Skip to content

Commit

Permalink
Remove redundant set_active_loras call during warmup (#413)
Browse files Browse the repository at this point in the history
CUDA uses `capture` for warmup runs and `execute_model` for actual runs.
During each phase they call `set_active_loras` only once. HPU uses
`execute_model` for both warmup and actual runs. Since `execute_model`
already takes care of `set_active_loras` internally, the redundant call
can be removed.

This special handling is redundant and incorrect, as it causes
out-of-bound slicing in decode phase reported in
#405.

This PR removes special handling of `set_active_loras` function call
from warmup runs and resolves the issue in
#405.
  • Loading branch information
SanjuCSudhakaran authored Oct 22, 2024
1 parent 0cf5261 commit 3af4b6c
Showing 1 changed file with 0 additions and 6 deletions.
6 changes: 0 additions & 6 deletions vllm/worker/hpu_model_runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -1354,12 +1354,6 @@ def warmup_scenario(self,
]
self.profiler.start('internal', scenario_name)
times = 3 if use_graphs or is_pt_profiler_run else 1
if self.lora_config and not is_lora_profile_run:
lora_mapping = LoRAMapping(
**dict(index_mapping=[0] * batch_size * seq_len,
prompt_mapping=[0] * batch_size * seq_len,
is_prefill=is_prompt))
self.set_active_loras(set(), lora_mapping)
if is_prompt:
seqs = [
self.create_dummy_seq_group_metadata(
Expand Down

0 comments on commit 3af4b6c

Please sign in to comment.