Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Have any plans to optimize the prefill kernel for the Hopper architecture? #521

Open
alexngng opened this issue Oct 10, 2024 · 2 comments
Open

Comments

@alexngng
Copy link

I notice that the Flashinfer prefill kernel is much slower than FA3 and TRT-LLM FMHA on SM90.
Do you have any plans to use some SM90 features for optimization?

Here is some data I tested on an SM90. Single H20 GPU, Llama2 7B.

Tokens Number TRT-LLM FMHA FA3 Flashinfer
512 x 1 37638.6 39,334.6 74966.6
512 x 2 54729.9 61,680.4 114800.0
512 x 4 103388.8 113,056.2 190688.4
@yzh119
Copy link
Collaborator

yzh119 commented Oct 10, 2024

Hi @alexngng , yes for sure. I still have some slight bug to fix and it's coming soon :)

@jason-huang03
Copy link

Really looking forward to it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants