Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC]: change VLLM_DECODE_BLOCK_BUCKET_* design to fit small AND large batch size at one warmup #328

Open
1 task done
ccrhx4 opened this issue Sep 24, 2024 · 2 comments
Labels
intel Issues or PRs submitted by Intel

Comments

@ccrhx4
Copy link

ccrhx4 commented Sep 24, 2024

Motivation.

In the current design, user cannot set VLLM_DECODE_BLOCK_BUCKET_MIN/MAX/STEP properly for small batch size and large batch size at the same time.

For example, considering requests with input_len 512 and max_output_len 1024, and batch size from 1 to 128.

For bs=1, user needs to set VLLM_DECODE_BLOCK_BUCKET_* to the min,max of one sequence.

VLLM_DECODE_BLOCK_BUCKET_MIN=1x512/128=4
VLLM_DECODE_BLOCK_BUCKET_STEP=1x128/128=1
VLLM_DECODE_BLOCK_BUCKET_MAX=1x(512+1024)/128=8

For bs=128, user needs to set

VLLM_DECODE_BLOCK_BUCKET_MIN=128x512/128=512
VLLM_DECODE_BLOCK_BUCKET_STEP=128x128/128=128
VLLM_DECODE_BLOCK_BUCKET_MAX=128x(1024+512)/128=1536

Right now, it is not able to set these ENV for bs=1 and bs=128 in one warmup.

Proposed Change.

I am proposing to change VLLM_DECODE_BLOCK_BUCKET_* to VLLM_DECODE_SEQ_*.
VLLM_DECODE_SEQ_MIN: min seq length
VLLM_DECODE_SEQ_MAX: max seq length
VLLM_DECODE_SEQ_STEP: seq length as a step

When warm up graph, vLLM compute graph as:
(bs, total_block_number) = ( bs, bs x (VLLM_DECODE_SEQ_MIN + VLLM_DECODE_SEQ_STEP x N) / BLOCK_SIZE)

For bs=1 and bs=128, user can set the VLLM_DECODE_SEQ_* as:

VLLM_DECODE_SEQ_MIN=512
VLLM_DECODE_SEQ_MAX=1024+512=1536
VLLM_DECODE_SEQ_STEP=128

Feedback Period.

No response

CC List.

@kzawora-intel Please kindly provide your feedback. Thanks.

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@michalkuligowski michalkuligowski added the intel Issues or PRs submitted by Intel label Sep 24, 2024
@michalkuligowski
Copy link

@ccrhx4 Hi, please refer to #345 we were working on if that provides required configuration abilities

@ccrhx4
Copy link
Author

ccrhx4 commented Oct 8, 2024

Hi @michalkuligowski , I am afraid #345 is different from what I want to address here.

I understand #345 provides a way to help people set prompt/decode bucket setting. But the issue is here that padding is too much for small batch size when vLLM decode batch bucket (VLLM_DECODE_BS_BUCKET_*) ranges from small batch sizes to large batch sizes, e.g. [1, 128].

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
intel Issues or PRs submitted by Intel
Projects
None yet
Development

No branches or pull requests

2 participants