Supporting memory efficient dropout in flash attention #23

tongxin · 2024-05-29T10:27:53Z

The main difference from Tri Dao's cuda implementation is the way we handle philox rng state. We cannot easily control the per thread philox offset increment since Triton makes thread abstraction opaque. That means we cannot reproduce the dropout masks exactly the same way as the cuda version.
The second difference is about the nuances of where to apply scaling, masking, casting and everything. It's an experience not guided by principles but rather trial and error.

…d to ensure execution on device other than cuda:0

tongxin

Looks good.

src/flag_attn/dropout.py

iclementine

LGTM

tongxin and others added 7 commits May 24, 2024 23:30

adding dropout to regular flash attention.

53747d4

fixed syntax errors.

c23bf88

working in progress, forward dropout passes tests.

3fa6183

Flash attn fwd with dropout done.

b60017d

[dropout] bug fixes, passing forward and backward tests.

f7798fa

fixed extra return errors in tests.

91a22f3

move call to philox_cuda_seed_offset & kernel launch into device guar…

cda4b19

…d to ensure execution on device other than cuda:0

tongxin commented Jun 4, 2024

View reviewed changes

iclementine reviewed Jun 4, 2024

View reviewed changes

src/flag_attn/dropout.py Outdated Show resolved Hide resolved

iclementine and others added 2 commits June 5, 2024 10:24

remove unused parameters

8b4fdbb

Minor refactors.

e49958e

iclementine approved these changes Jun 5, 2024

View reviewed changes

add dropout into feature list in README

e5c408a

iclementine approved these changes Jun 5, 2024

View reviewed changes

iclementine merged commit ee91638 into main Jun 5, 2024
1 check passed

iclementine mentioned this pull request Jun 5, 2024

Feature request: Add Dropout #7

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Supporting memory efficient dropout in flash attention #23

Supporting memory efficient dropout in flash attention #23

tongxin commented May 29, 2024

tongxin left a comment

iclementine left a comment

Supporting memory efficient dropout in flash attention #23

Supporting memory efficient dropout in flash attention #23

Conversation

tongxin commented May 29, 2024

tongxin left a comment

Choose a reason for hiding this comment

iclementine left a comment

Choose a reason for hiding this comment