[RFC]: change VLLM_DECODE_BLOCK_BUCKET_* design to fit small AND large batch size at one warmup #328

ccrhx4 · 2024-09-24T05:59:05Z

Motivation.

In the current design, user cannot set VLLM_DECODE_BLOCK_BUCKET_MIN/MAX/STEP properly for small batch size and large batch size at the same time.

For example, considering requests with input_len 512 and max_output_len 1024, and batch size from 1 to 128.

For bs=1, user needs to set VLLM_DECODE_BLOCK_BUCKET_* to the min,max of one sequence.

VLLM_DECODE_BLOCK_BUCKET_MIN=1x512/128=4
VLLM_DECODE_BLOCK_BUCKET_STEP=1x128/128=1
VLLM_DECODE_BLOCK_BUCKET_MAX=1x(512+1024)/128=8

For bs=128, user needs to set

VLLM_DECODE_BLOCK_BUCKET_MIN=128x512/128=512
VLLM_DECODE_BLOCK_BUCKET_STEP=128x128/128=128
VLLM_DECODE_BLOCK_BUCKET_MAX=128x(1024+512)/128=1536

Right now, it is not able to set these ENV for bs=1 and bs=128 in one warmup.

Proposed Change.

I am proposing to change VLLM_DECODE_BLOCK_BUCKET_* to VLLM_DECODE_SEQ_*.
VLLM_DECODE_SEQ_MIN: min seq length
VLLM_DECODE_SEQ_MAX: max seq length
VLLM_DECODE_SEQ_STEP: seq length as a step

When warm up graph, vLLM compute graph as:
(bs, total_block_number) = ( bs, bs x (VLLM_DECODE_SEQ_MIN + VLLM_DECODE_SEQ_STEP x N) / BLOCK_SIZE)

For bs=1 and bs=128, user can set the VLLM_DECODE_SEQ_* as:

VLLM_DECODE_SEQ_MIN=512
VLLM_DECODE_SEQ_MAX=1024+512=1536
VLLM_DECODE_SEQ_STEP=128

Feedback Period.

No response

CC List.

@kzawora-intel Please kindly provide your feedback. Thanks.

Any Other Things.

No response

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

michalkuligowski · 2024-09-30T11:53:54Z

@ccrhx4 Hi, please refer to #345 we were working on if that provides required configuration abilities

ccrhx4 · 2024-10-08T07:40:17Z

Hi @michalkuligowski , I am afraid #345 is different from what I want to address here.

I understand #345 provides a way to help people set prompt/decode bucket setting. But the issue is here that padding is too much for small batch size when vLLM decode batch bucket (VLLM_DECODE_BS_BUCKET_*) ranges from small batch sizes to large batch sizes, e.g. [1, 128].

michalkuligowski added the intel Issues or PRs submitted by Intel label Sep 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC]: change VLLM_DECODE_BLOCK_BUCKET_* design to fit small AND large batch size at one warmup #328

[RFC]: change VLLM_DECODE_BLOCK_BUCKET_* design to fit small AND large batch size at one warmup #328

ccrhx4 commented Sep 24, 2024

michalkuligowski commented Sep 30, 2024

ccrhx4 commented Oct 8, 2024

[RFC]: change VLLM_DECODE_BLOCK_BUCKET_* design to fit small AND large batch size at one warmup #328

[RFC]: change VLLM_DECODE_BLOCK_BUCKET_* design to fit small AND large batch size at one warmup #328

Comments

ccrhx4 commented Sep 24, 2024

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

michalkuligowski commented Sep 30, 2024

ccrhx4 commented Oct 8, 2024