Have any plans to optimize the prefill kernel for the Hopper architecture? #521

alexngng · 2024-10-10T03:37:35Z

I notice that the Flashinfer prefill kernel is much slower than FA3 and TRT-LLM FMHA on SM90.
Do you have any plans to use some SM90 features for optimization?

Here is some data I tested on an SM90. Single H20 GPU, Llama2 7B.

Tokens Number	TRT-LLM FMHA	FA3	Flashinfer
512 x 1	37638.6	39,334.6	74966.6
512 x 2	54729.9	61,680.4	114800.0
512 x 4	103388.8	113,056.2	190688.4

yzh119 · 2024-10-10T04:57:10Z

Hi @alexngng , yes for sure. I still have some slight bug to fix and it's coming soon :)

jason-huang03 · 2024-10-28T07:59:16Z

Really looking forward to it!

jeejeelee mentioned this issue Oct 18, 2024

[Performance]: FLASHINFER backend is slower than FLASH_ATTN on H100 vllm-project/vllm#9471

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Have any plans to optimize the prefill kernel for the Hopper architecture? #521

Have any plans to optimize the prefill kernel for the Hopper architecture? #521

alexngng commented Oct 10, 2024

yzh119 commented Oct 10, 2024

jason-huang03 commented Oct 28, 2024

Have any plans to optimize the prefill kernel for the Hopper architecture? #521

Have any plans to optimize the prefill kernel for the Hopper architecture? #521

Comments

alexngng commented Oct 10, 2024

yzh119 commented Oct 10, 2024

jason-huang03 commented Oct 28, 2024