You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, we are using two independent API calls for prefill and decode in a mixed batch setting. This makes defining a cuda graph layout considerably harder. Ideally, if we could do both prefill and decode attention computation in prefill kernel it would considerably simplify the cuda graph layout. However, the main barrier for doing this right now is that we don't have an explicit control over when to use split-KV. In case of mixed batches, it appears that doing split-KV is beneficial in most cases. But it appears that split-KV gets disabled in certain batch composition, which significantly hurts latency. Would it be possible to add an optional override knob for this? Thanks!
The text was updated successfully, but these errors were encountered:
Hi @AgrawalAmey , actually I found our scheduler could be further optimized so that there will be no wave quantization and I'm working on a refactor on that. After this change, I suppose the split-KV will always be enabled.
Hello @yzh119,
Currently, we are using two independent API calls for prefill and decode in a mixed batch setting. This makes defining a cuda graph layout considerably harder. Ideally, if we could do both prefill and decode attention computation in prefill kernel it would considerably simplify the cuda graph layout. However, the main barrier for doing this right now is that we don't have an explicit control over when to use split-KV. In case of mixed batches, it appears that doing split-KV is beneficial in most cases. But it appears that split-KV gets disabled in certain batch composition, which significantly hurts latency. Would it be possible to add an optional override knob for this? Thanks!
The text was updated successfully, but these errors were encountered: