Question about `use_tensor_cores = True or False` #520

sleepwalker2017 · 2024-10-10T02:52:05Z

Hi, I'm benchmarking flashinfer on H100, and I'm running attention for the decoding stage.

I use q_head = kv_head = 40, which is the standard attention for llama 13B.

I tried use_tensor_cores = True and False, I get nearly the same performance.

 Time (%)  Total Time (ns)  Instances  Avg (ns)   Med (ns)   Min (ns)  Max (ns)  StdDev (ns)                                                  Name
 --------  ---------------  ---------  ---------  ---------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     99.2         49025722         40  1225643.1  1222881.5   1219938   1231234       4342.4  void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::nat…
      0.8           374625          1   374625.0   374625.0    374625    374625          0.0  void flashinfer::BatchDecodeWithPagedKVCacheKernel<(flashinfer::LogitsPostHook)0, (flashinfer::PosE…

 Time (%)  Total Time (ns)  Instances  Avg (ns)   Med (ns)   Min (ns)  Max (ns)  StdDev (ns)                                                  Name
 --------  ---------------  ---------  ---------  ---------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     99.2         49189726         40  1229743.2  1229681.5   1228833   1231042        495.0  void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::nat…
      0.8           379936          1   379936.0   379936.0    379936    379936          0.0  void flashinfer::BatchPrefillWithPagedKVCacheKernel<(flashinfer::LogitsPostHook)0, (flashinfer::Mas…
      0.0             1600          1     1600.0     1600.0      1600      1600          0.0  void <unnamed>::elementwise_kernel_with_index<int, at::native::arange_cuda_out(const c10::Scalar &,…

My question is:

Is this result reliable? If use_tensor_cores=True, it will invoke the prefill kernel?
I tested the tensor core usage for both kernels, but find they both uses tensor cores, why is that?

  void flashinfer::BatchDecodeWithPagedKVCacheKernel<(flashinfer::LogitsPostHook)0, (flashinfer::PosEncodingMode)0, (unsigned int)2, (unsigned int)4, (unsigned int)8, (unsigned int)16, (unsigned int)1, (unsigned int)8, (flashinfer::PageStorage)0, __half, __half, __half, int>(T10 *, T13 *, flashinfer::paged_kv_t<T9, T11, T13>, flashinfer::kv_partition_info_t<T13>, T12 *, float *, bool *, bool, int, float, float, float, float) (74, 40, 1)x(16, 1, 8), Context 1, Stream 7, Device 0, CC 9.0
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- ----------- ------------
    Metric Name                                                            Metric Unit Metric Value
    ---------------------------------------------------------------------- ----------- ------------
    sm__inst_executed_pipe_tensor_op.sum                                                    (!) n/a
    sm__inst_executed_pipe_tensor_op_dmma.sum                                     inst            0
    sm__inst_executed_pipe_tensor_op_hmma.avg.pct_of_peak_sustained_active           %         0.11
    sm__inst_executed_pipe_tensor_op_hmma.sum                                     inst       166688
    ---------------------------------------------------------------------- ----------- ------------

  void flashinfer::BatchPrefillWithPagedKVCacheKernel<(flashinfer::LogitsPostHook)0, (flashinfer::MaskMode)0, (flashinfer::PosEncodingMode)0, (unsigned int)1, (unsigned int)8, (unsigned int)2, (unsigned int)1, (unsigned int)4, (flashinfer::PageStorage)0, __half, __half, float, __half, int>(T14 *, T14 *, T14 *, T10 *, flashinfer::paged_kv_t<T9, T11, T14>, T14 *, unsigned char *, T14 *, T14 *, T14 *, T13 *, float *, bool *, T14 *, bool, flashinfer::uint_fastdiv, int, float, float, float, float) (74, 1, 40)x(32, 1, 4), Context 1, Stream 7, Device 0, CC 9.0
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- ----------- ------------
    Metric Name                                                            Metric Unit Metric Value
    ---------------------------------------------------------------------- ----------- ------------
    sm__inst_executed_pipe_tensor_op.sum                                                    (!) n/a
    sm__inst_executed_pipe_tensor_op_dmma.sum                                     inst            0
    sm__inst_executed_pipe_tensor_op_hmma.avg.pct_of_peak_sustained_active           %         3.60
    sm__inst_executed_pipe_tensor_op_hmma.sum                                     inst      4899840
    ---------------------------------------------------------------------- ----------- ------------

The text was updated successfully, but these errors were encountered:

yzh119 · 2024-10-10T06:58:18Z

Hi @sleepwalker2017 , thanks for doing the benchmark!

Yes, use_tensor_cores=True invokes prefill kernel, and it's reasonable that you get nearly the same performance because decode operations are IO bound.
That's interesting! I didn't manually call tensor cores in flashinfer decode kernels but I'm not sure if nvcc can do some clever optimizations that turns some reduction to tensor cores, so I checked the sass code of decode kernels (you can check sass by cuobjdump -sass *.o), and I can confirm there is NO HMMA instructions in decode kernels' generated sass (and if you check prefill kernels' sass there will be many of them). Perhaps ncu count some other operations in this pipe.

cyang49 · 2024-10-22T20:21:08Z

Hi @yzh119 I'm also looking into the decode kernel implementation. I tested it with llama3-8B size (num_qo_heads=32, num_kv_heads=8) and input batch size=64, seq_len=1024. q_data_type=torch.bfloat16, kv_data_type=torch.float8_e4m3.

I noticed that the BatchDecodePagedKV kernel runs slower than the BatchPrefill... (forcing it by setting use_tensor_cores=True) counterpart. I profiled the decode kernel with nsight compute, and also looked into the code. Here are some of my observations. Could you let me know if these are correct?

The compute type of the decode kernel is fixed at FP32. For my setting (Q:BF16, KV: FP8), the QKV are all casted to FP32 before the compute (in compute_qk function)
I observed a lot of overhead in the type cast instructions, which seems to be the main bottleneck of the decode kernel.
In contrast to the BatchPrefill... kernel with the same input types, there the compute is in BF16 MMA, and the QKV are casted to BF16 before MMA. However, the overhead seems to be smaller, as fast_dequant_f8f16x4 function seems to be a faster dequantization procedure.

My questions:

Why is fast_dequant_f8f16x4 not used in the decode kernel path?
Would it make sense to keep the compute type at lower precision (fp8) and avoid the dequantization overhead?

tdoublep mentioned this issue Oct 18, 2024

[Performance]: FLASHINFER backend is slower than FLASH_ATTN on H100 vllm-project/vllm#9471

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about `use_tensor_cores = True or False` #520

Question about `use_tensor_cores = True or False` #520

sleepwalker2017 commented Oct 10, 2024

yzh119 commented Oct 10, 2024 •

edited

Loading

cyang49 commented Oct 22, 2024 •

edited

Loading

Question about use_tensor_cores = True or False #520

Question about use_tensor_cores = True or False #520

Comments

sleepwalker2017 commented Oct 10, 2024

yzh119 commented Oct 10, 2024 • edited Loading

cyang49 commented Oct 22, 2024 • edited Loading

Question about `use_tensor_cores = True or False` #520

Question about `use_tensor_cores = True or False` #520

yzh119 commented Oct 10, 2024 •

edited

Loading

cyang49 commented Oct 22, 2024 •

edited

Loading