Supporting memory efficient dropout in flash attention (#23)

1. add dropout to regular flash attention. 2. add philox_cuda_seed_offset to increment offset of pytorch's philox random generator's state. --------- Co-authored-by: Clement Chan <[email protected]>
FlagOpen · Jun 5, 2024 · ee91638 · ee91638
1 parent 13664fc
commit ee91638
Show file tree

Hide file tree

Showing 9 changed files with 308 additions and 23 deletions.
diff --git a/README.md b/README.md
@@ -234,11 +234,11 @@ The performance of piecewise_attention has improved compared to that in v0.1. In
 - support computation of total attention of each `k` gets from all `q`'s;
 - supports returning accumulative attention of each keys.
 - supports [MQA](https://arxiv.org/abs/1911.02150) and [GQA](https://arxiv.org/pdf/2305.13245).
+- supports dropout of attention weights.
 
 #### Limitations
 
 - `headdim` should be in `[16, 32, 64, 128]`.
-- dropout of attention weights is not supported yet.
 
 ## TODOs
 

diff --git a/README_cn.md b/README_cn.md
@@ -224,11 +224,12 @@ print(gq)
 - 支持前向和反向计算；
 - K/V 的序列长度可以不等于 Q 的序列长度；
 - 支持计算每个 k 从所有 q 得到的 attention 总和。
+- 支持 [MQA](https://arxiv.org/abs/1911.02150) and [GQA](https://arxiv.org/pdf/2305.13245).
+- 支持对 attention weights 进行 dropout.
 
 #### 限制
 
 - `headdim` 必须为 `[16, 32, 64, 128]` 之一；
-- 尚未支持对 attention weight 使用 dropout。
 
 ## TODOs
 

diff --git a/src/flag_attn/dropout.py b/src/flag_attn/dropout.py
@@ -0,0 +1,15 @@
+import torch
+import triton
+import triton.language as tl
+
+def philox_cuda_seed_offset(increment, device=None):
+ device = device or torch.cuda.current_device()
+ gen = torch.cuda.default_generators[device]
+ state_copy = gen.get_state()
+ c0, c1 = state_copy.view(torch.int64)
+ seed, offset = int(c0), int(c1)
+ increment = (increment + 3) // 4 * 4
+ c1 += increment
+ # get_state returns a new tensor, so it needs set_state to update the actual generator state.
+ gen.set_state(state_copy)
+ return seed, offset