to make repetition penalty faster #442

ccrhx4 · 2024-10-29T01:42:30Z

This PR is to fix very slow sampling process when repetition penalty is set.

The fix includes:

Enable pin_memory on HPU
Padding prompt tokens and output_tokens to avoid recompile
Replace slow ops

Before the fix:
SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.06, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=True, max_tokens=1024, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), guided_decoding=None
Warming up...
Profiling iterations: 100%|5/5 [03:24<00:00, 40.99s/it]
Avg latency: 40.98862759781768 seconds
10% percentile latency: 11.699748958216514 seconds
25% percentile latency: 11.73845003999304 seconds
50% percentile latency: 11.801458386995364 seconds
75% percentile latency: 11.861465670051984 seconds
90% percentile latency: 99.46527566103033 seconds
99% percentile latency: 152.02756165561732 seconds

After the fix:
SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.06, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=True, max_tokens=1024, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), guided_decoding=None
Warming up...
Profiling iterations: 100%| 5/5 [00:57<00:00, 11.59s/it]
Avg latency: 11.58703240059549 seconds
10% percentile latency: 11.444069900200702 seconds
25% percentile latency: 11.511425047006924 seconds
50% percentile latency: 11.525146245025098 seconds
75% percentile latency: 11.556680046953261 seconds
90% percentile latency: 11.788318535778672 seconds
99% percentile latency: 11.927301629073918 seconds

Testing code is by: https://github.com/ccrhx4/huanxing.vllm-fork/blob/slow_repetition_penalty/benchmarks/reproduce.sh

masked_fill instead of boolean index; third add paddings to prompt tokens and output tokens to reduce re-compile.

vllm/worker/cache_engine.py

to make repetition penalty faster: first, enable pin memory;second use

566f1c0

masked_fill instead of boolean index; third add paddings to prompt tokens and output tokens to reduce re-compile.

michalkuligowski requested changes Oct 29, 2024

View reviewed changes

vllm/worker/cache_engine.py Outdated Show resolved Hide resolved

michalkuligowski requested changes Oct 29, 2024

View reviewed changes

vllm/worker/cache_engine.py Outdated Show resolved Hide resolved

ccrhx4 added 2 commits October 30, 2024 01:19

revert change in cache_engine as it is not used in HPU

85cf06f

remove unnecessary import

655018e

ccrhx4 requested a review from michalkuligowski October 31, 2024 02:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

to make repetition penalty faster #442

to make repetition penalty faster #442

ccrhx4 commented Oct 29, 2024

to make repetition penalty faster #442

Are you sure you want to change the base?

to make repetition penalty faster #442

Conversation

ccrhx4 commented Oct 29, 2024