Add flash decoding(flash attention with split_kv) #17

iclementine · 2024-02-01T13:30:39Z

Implement flash decoding(flash attention with split_kv) https://princeton-nlp.github.io/flash-decoding/. This algorithm is used when batch_size * num_heads * blocks_along_seqlen_q cannot saturate the gpu's SM's.

Benchmark results o RTX-3090.

batch_size=2, num_heads=32, seqlen_q =1, seqlen_k=N_CTX

attention_d-64_dtype-torch.float16 (ms):
       N_CTX  flag_attn      torch    flash-2
0      512.0   0.030585   0.037478   0.026489
1     1024.0   0.045577   0.047122   0.045771
2     2048.0   0.055814   0.068738   0.084344
3     4096.0   0.091942   0.109804   0.103477
4     8192.0   0.167185   0.192318   0.185000
5    16384.0   0.318114   0.358216   0.336406
6    32768.0   0.797097   0.725244   0.655855
7    65536.0   1.223194   1.454980   1.299972
8   131072.0   2.429410   2.982287   2.559437
9   262144.0   4.837376   6.334187   5.085689
10  524288.0   9.653147  13.354424  10.183790

batch_size=2, num_heads=16, seqlen_q =1, seqlen_k=N_CTX

attention_d-128_dtype-torch.float16 (ms):
       N_CTX  flag_attn      torch   flash-2
0      512.0   0.046705   0.033903  0.022163
1     1024.0   0.046658   0.046374  0.033584
2     2048.0   0.055688   0.067447  0.053302
3     4096.0   0.094723   0.107909  0.098266
4     8192.0   0.169925   0.189810  0.171835
5    16384.0   0.321480   0.348469  0.324517
6    32768.0   0.625961   0.676866  0.630376
7    65536.0   1.230001   1.345603  1.236284
8   131072.0   2.435531   2.695659  2.446050
9   262144.0   4.854816   5.393702  4.863354
10  524288.0   9.683682  10.757249  9.691238

StrongSpoon

👏

StrongSpoon · 2024-02-05T06:51:21Z

src/flag_attn/split_kv.py

+ return 1
+
+ num_n_blocks = triton.cdiv(N, BLOCK_N)
+ def num_split_avaiable(s):


When spliting num_n_blocks into s splits, there may be some splits(the last split, for example) without valid workload.

For example, when spliting 64 blocks into 12 splits, with each split processing cdiv(64, 12) = 6 blocks, the last split is empty. So spliting 64 blocks into 12 splits is not available in this sense.

iclementine added 2 commits February 1, 2024 20:57

Add flash decoding and integrate it into flash_attention

c0a8783

use online logsumexp, add doc & references

1623c58

iclementine force-pushed the flash_decoding branch from 00a366e to 3608a91 Compare February 4, 2024 07:23

iclementine requested review from tongxin and StrongSpoon February 4, 2024 07:36

rename arguments in _fwd_combine_kv_splits, clean unsed code

ff09aad

iclementine force-pushed the flash_decoding branch from 3608a91 to ff09aad Compare February 4, 2024 07:48

use more efficient algorithm to compute logsumexp

9564f23

StrongSpoon approved these changes Feb 5, 2024

View reviewed changes

iclementine merged commit 1641d0c into FlagOpen:main Feb 5, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add flash decoding(flash attention with split_kv) #17

Add flash decoding(flash attention with split_kv) #17

iclementine commented Feb 1, 2024 •

edited

Loading

StrongSpoon left a comment

StrongSpoon Feb 5, 2024

iclementine Feb 5, 2024

Add flash decoding(flash attention with split_kv) #17

Add flash decoding(flash attention with split_kv) #17

Conversation

iclementine commented Feb 1, 2024 • edited Loading

StrongSpoon left a comment

Choose a reason for hiding this comment

StrongSpoon Feb 5, 2024

Choose a reason for hiding this comment

iclementine Feb 5, 2024

Choose a reason for hiding this comment

iclementine commented Feb 1, 2024 •

edited

Loading