The Phenomenon of Latency Jumps in Inference #8202
daliwang777
started this conversation in
Ideas
Replies: 1 comment 1 reply
-
because default scheduler batch size is |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
When I used vLLM to accelerate inference for Llama3 70b, I observed an interesting phenomenon. When each input request consists of 1 token, there is a latency jump every 256 requests. Does anyone know why this happens?
Beta Was this translation helpful? Give feedback.
All reactions