You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I notice that the Flashinfer prefill kernel is much slower than FA3 and TRT-LLM FMHA on SM90.
Do you have any plans to use some SM90 features for optimization?
Here is some data I tested on an SM90. Single H20 GPU, Llama2 7B.
Tokens Number
TRT-LLM FMHA
FA3
Flashinfer
512 x 1
37638.6
39,334.6
74966.6
512 x 2
54729.9
61,680.4
114800.0
512 x 4
103388.8
113,056.2
190688.4
The text was updated successfully, but these errors were encountered:
I notice that the Flashinfer prefill kernel is much slower than FA3 and TRT-LLM FMHA on SM90.
Do you have any plans to use some SM90 features for optimization?
Here is some data I tested on an SM90. Single H20 GPU, Llama2 7B.
The text was updated successfully, but these errors were encountered: