Why does vllm flash attention build only support CUDA 12.1 (not 11.8) ? #4801

thangld201 · 2024-05-14T03:27:59Z

thangld201
May 14, 2024

Hi @youkaichao @WoosukKwon, I see that the original flash attention repo supports CUDA 11.6+. However, I am not sure why vllm fork only supports build for CUDA 12.1; and why vllm (master) does not use the original flash attention build.

How can I enable vllm to build flash attention with CUDA 11.8 ? (master version now only uses flash attention from vllm fork and that does not support CUDA 11.8)

EDIT: I cloned the master version, updated after the #4686

g-eoj · 2024-06-27T17:23:44Z

g-eoj
Jun 27, 2024

I don't know the answer why but I was able to compile vllm-flash-attn for 11.8. This is the beginning of my Dockerfile:

FROM nvidia/cuda:11.8.0-base-ubuntu22.04 AS vllm-base

ENV CUDA_HOME=/usr/local/cuda-11.8
ENV PIP_EXTRA_INDEX_URL=https://download.pytorch.org/whl/cu118

RUN apt update -y \
    && apt upgrade -y \
    && apt install -y cuda-nvcc-11-8 python3-pip \
    && rm -rf /var/lib/apt/lists/*

RUN --mount=type=cache,target=/root/.cache/pip \
    pip install https://github.com/vllm-project/vllm/releases/download/v0.5.0.post1/vllm-0.5.0.post1+cu118-cp310-cp310-manylinux1_x86_64.whl


FROM vllm-base AS vllm-flash-attn-build

ARG max_jobs=2
ENV MAX_JOBS=${max_jobs}

RUN apt update -y \
    && apt install -y git cuda-libraries-dev-11-8 \
    && git clone --depth 1 --branch v2.5.9.post1 https://github.com/vllm-project/flash-attention.git \
    && pip install packaging ninja

WORKDIR flash-attention
RUN  python3 setup.py bdist_wheel --dist-dir=/vllm-flash-attn-dist


FROM vllm-base AS vllm-runtime

COPY --from=vllm-flash-attn-build /vllm-flash-attn-dist /vllm-flash-attn-dist
RUN pip install /vllm-flash-attn-dist/*.whl

...

0 replies

glistering96 · 2024-07-16T04:32:36Z

glistering96
Jul 16, 2024

I'm also very curious about it. Upon exploring the forked repository of vllm-flash-attn (https://github.com/vllm-project/flash-attention), I couldn't discern any substantial differences from the original flash-attention, apart from changes in package names. In particular, I noticed that commit 7731823a43616ea4845cd02300b083889e840bca removed support for CUDA 11.8.

Given this, it seems the removal of CUDA 11.8 was made for a specific reason. If not, why not simply use the original flash-attn instead which already has specific wheels for CUDA 11.8 builds and all sorts of other stuffs.

EDIT) For anyone who gets stuck because of CUDA 11.8, edit the import statement in "vllm/attention/backends/flash_infer.py" as below.

`from dataclasses import dataclass
from typing import Any, Dict, List, Optional, Set, Tuple, Type

try:
from flashinfer import BatchDecodeWithPagedKVCacheWrapper
from flashinfer.prefill import BatchPrefillWithPagedKVCacheWrapper
- from vllm_flash_attn import flash_attn_varlen_func
+ from flash_attn import flash_attn_varlen_func

except ImportError:
flash_attn_varlen_func = None
BatchDecodeWithPagedKVCacheWrapper = None
BatchPrefillWithPagedKVCacheWrapper = None

import torch

from vllm import _custom_ops as ops
from vllm.attention.backends.abstract import (AttentionBackend, AttentionImpl,
AttentionMetadata, AttentionType)
`

I did this to use gemma-2-9b on my CUDA 11.8 environment, and works perferctly fine. You must have the right flash-attn version installed though.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why does vllm flash attention build only support CUDA 12.1 (not 11.8) ? #4801

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Why does vllm flash attention build only support CUDA 12.1 (not 11.8) ? #4801

thangld201 May 14, 2024

Replies: 2 comments

g-eoj Jun 27, 2024

glistering96 Jul 16, 2024

thangld201
May 14, 2024

g-eoj
Jun 27, 2024

glistering96
Jul 16, 2024