[FEAT REQ][CUDA GRAPH] Allow explicit control flag to force enable/disable split KV #397

AgrawalAmey · 2024-07-26T00:16:18Z

Currently, we are using two independent API calls for prefill and decode in a mixed batch setting. This makes defining a cuda graph layout considerably harder. Ideally, if we could do both prefill and decode attention computation in prefill kernel it would considerably simplify the cuda graph layout. However, the main barrier for doing this right now is that we don't have an explicit control over when to use split-KV. In case of mixed batches, it appears that doing split-KV is beneficial in most cases. But it appears that split-KV gets disabled in certain batch composition, which significantly hurts latency. Would it be possible to add an optional override knob for this? Thanks!

yzh119 · 2024-07-26T00:34:38Z

Hi @AgrawalAmey , actually I found our scheduler could be further optimized so that there will be no wave quantization and I'm working on a refactor on that. After this change, I suppose the split-KV will always be enabled.

AgrawalAmey · 2024-07-26T07:30:53Z

Oh that is great! Looking forward to it, thank you @yzh119!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT REQ][CUDA GRAPH] Allow explicit control flag to force enable/disable split KV #397

[FEAT REQ][CUDA GRAPH] Allow explicit control flag to force enable/disable split KV #397

AgrawalAmey commented Jul 26, 2024

yzh119 commented Jul 26, 2024

AgrawalAmey commented Jul 26, 2024

[FEAT REQ][CUDA GRAPH] Allow explicit control flag to force enable/disable split KV #397

[FEAT REQ][CUDA GRAPH] Allow explicit control flag to force enable/disable split KV #397

Comments

AgrawalAmey commented Jul 26, 2024

yzh119 commented Jul 26, 2024

AgrawalAmey commented Jul 26, 2024