You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Right now, it is not able to set these ENV for bs=1 and bs=128 in one warmup.
Proposed Change.
I am proposing to change VLLM_DECODE_BLOCK_BUCKET_* to VLLM_DECODE_SEQ_*. VLLM_DECODE_SEQ_MIN: min seq length VLLM_DECODE_SEQ_MAX: max seq length VLLM_DECODE_SEQ_STEP: seq length as a step
When warm up graph, vLLM compute graph as: (bs, total_block_number) = ( bs, bs x (VLLM_DECODE_SEQ_MIN + VLLM_DECODE_SEQ_STEP x N) / BLOCK_SIZE)
For bs=1 and bs=128, user can set the VLLM_DECODE_SEQ_* as:
@kzawora-intel Please kindly provide your feedback. Thanks.
Any Other Things.
No response
Before submitting a new issue...
Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
The text was updated successfully, but these errors were encountered:
Hi @michalkuligowski , I am afraid #345 is different from what I want to address here.
I understand #345 provides a way to help people set prompt/decode bucket setting. But the issue is here that padding is too much for small batch size when vLLM decode batch bucket (VLLM_DECODE_BS_BUCKET_*) ranges from small batch sizes to large batch sizes, e.g. [1, 128].
Motivation.
In the current design, user cannot set VLLM_DECODE_BLOCK_BUCKET_MIN/MAX/STEP properly for small batch size and large batch size at the same time.
For example, considering requests with input_len 512 and max_output_len 1024, and batch size from 1 to 128.
For bs=1, user needs to set VLLM_DECODE_BLOCK_BUCKET_* to the min,max of one sequence.
For bs=128, user needs to set
Right now, it is not able to set these ENV for bs=1 and bs=128 in one warmup.
Proposed Change.
I am proposing to change
VLLM_DECODE_BLOCK_BUCKET_*
toVLLM_DECODE_SEQ_*
.VLLM_DECODE_SEQ_MIN:
min seq lengthVLLM_DECODE_SEQ_MAX:
max seq lengthVLLM_DECODE_SEQ_STEP:
seq length as a stepWhen warm up graph, vLLM compute graph as:
(bs, total_block_number) = ( bs, bs x (VLLM_DECODE_SEQ_MIN + VLLM_DECODE_SEQ_STEP x N) / BLOCK_SIZE)
For bs=1 and bs=128, user can set the VLLM_DECODE_SEQ_* as:
Feedback Period.
No response
CC List.
@kzawora-intel Please kindly provide your feedback. Thanks.
Any Other Things.
No response
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: