FlashInfer generating NANs on A100 GPU #574

dbarbuzzi · 2024-10-30T16:58:18Z

When running a particular vLLM test on an A100 GPU, flashinfer appears to be generating nans under a specific scenario. The test fails under a specific scenario on an A100 while passing all scenarios on both an H100 and an L4. We are using flashinfer-0.1.6+cu124torch2.4.

The test that fails is test_flashinfer_decode_with_paged_fp8_kv.

The failure scenario is when three of the parameters are three specific values at the same time:

block_size = 32
head_size = 256
num_heads = (32, 8)
- 32 gets assigned to num_query_heads
- 8 gets assigned to num_kv_heads

If any of these parameters is one of the other possible values, the test will pass on the A100.

The failure message seems to indicate that, under this scenario, nans are being generated:

AssertionError: Tensor-likes are not close!

Mismatched elements: 1024 / 24576 (4.2%)
Greatest absolute difference: nan at index (0, 0, 0) (up to 0.02 allowed)
Greatest relative difference: nan at index (0, 0, 0) (up to 0.01 allowed)

The general error message is the same between failures; the only variations are the total number of elements (either 24576 or 32768; the number of mismatched elements is always 1024) or the index (it is either (0, 0, 0) or (3, 0, 0)).

The text was updated successfully, but these errors were encountered:

yzh119 · 2024-10-30T19:54:32Z

Hi @dbarbuzzi , thanks for reporting this issue, this issue appears for fp8, is that correct?

dbarbuzzi · 2024-10-30T20:15:02Z

Hi @dbarbuzzi , thanks for reporting this issue, this issue appears for fp8, is that correct?

Yes, that is correct; I should have clarified that originally.

yzh119 self-assigned this Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FlashInfer generating NANs on A100 GPU #574

FlashInfer generating NANs on A100 GPU #574

dbarbuzzi commented Oct 30, 2024

yzh119 commented Oct 30, 2024

dbarbuzzi commented Oct 30, 2024

FlashInfer generating NANs on A100 GPU #574

FlashInfer generating NANs on A100 GPU #574

Comments

dbarbuzzi commented Oct 30, 2024

yzh119 commented Oct 30, 2024

dbarbuzzi commented Oct 30, 2024