Why does vllm flash attention build only support CUDA 12.1 (not 11.8) ? #4801
Replies: 2 comments
-
I don't know the answer why but I was able to compile vllm-flash-attn for 11.8. This is the beginning of my Dockerfile:
|
Beta Was this translation helpful? Give feedback.
-
I'm also very curious about it. Upon exploring the forked repository of vllm-flash-attn (https://github.com/vllm-project/flash-attention), I couldn't discern any substantial differences from the original flash-attention, apart from changes in package names. In particular, I noticed that commit 7731823a43616ea4845cd02300b083889e840bca removed support for CUDA 11.8. Given this, it seems the removal of CUDA 11.8 was made for a specific reason. If not, why not simply use the original flash-attn instead which already has specific wheels for CUDA 11.8 builds and all sorts of other stuffs. EDIT) For anyone who gets stuck because of CUDA 11.8, edit the import statement in "vllm/attention/backends/flash_infer.py" as below. `from dataclasses import dataclass try: except ImportError: import torch from vllm import _custom_ops as ops I did this to use gemma-2-9b on my CUDA 11.8 environment, and works perferctly fine. You must have the right flash-attn version installed though. |
Beta Was this translation helpful? Give feedback.
-
Hi @youkaichao @WoosukKwon, I see that the original flash attention repo supports CUDA 11.6+. However, I am not sure why vllm fork only supports build for CUDA 12.1; and why vllm (master) does not use the original flash attention build.
How can I enable vllm to build flash attention with CUDA 11.8 ? (master version now only uses flash attention from vllm fork and that does not support CUDA 11.8)
EDIT: I cloned the master version, updated after the #4686
Beta Was this translation helpful? Give feedback.
All reactions