Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable LoRA support for HPU #170

Merged
merged 10 commits into from
Aug 20, 2024
Merged

Enable LoRA support for HPU #170

merged 10 commits into from
Aug 20, 2024

Conversation

SanjuCSudhakaran
Copy link

@SanjuCSudhakaran SanjuCSudhakaran commented Aug 9, 2024

This PR enables LoRA support in HPU.

  • Implemented custom BGMV for LoRA modules using index-select operator.
  • Support for both single and multi card scenarios has been tested

@vivekgoe
Copy link

vivekgoe commented Aug 9, 2024

@kzawora-intel please review this PR.

Copy link

@vivekgoe vivekgoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SanjuCSudhakaran @hlahkar please check review comments. Thanks.

tests/lora/test_multilora_hpu.py Outdated Show resolved Hide resolved
tests/lora/test_multilora_hpu.py Outdated Show resolved Hide resolved
tests/lora/test_multilora_hpu.py Outdated Show resolved Hide resolved
vllm/lora/layers.py Outdated Show resolved Hide resolved
vllm/lora/models.py Show resolved Hide resolved
vllm/lora/models.py Show resolved Hide resolved
vllm/worker/habana_model_runner.py Outdated Show resolved Hide resolved
vllm/worker/habana_model_runner.py Show resolved Hide resolved
vllm/worker/habana_model_runner.py Show resolved Hide resolved
vllm/worker/habana_model_runner.py Show resolved Hide resolved
@JHLEE17
Copy link

JHLEE17 commented Aug 12, 2024

I ran some multi-LoRA tests on this branch and wanted to share a few additional issues that I encountered, hoping it might help.

[1] In tests/lora/test_llama_hpu.py, setting VLLM_SKIP_WARMUP=false seems to cause an error when running on multi-card setups. It would be appreciated if you could verify this. My SynapseAI SDK version is 1.16, so the issue might be related to running with multiple cards.

Error message w/ 'VLLM_SKIP_WARMUP=false pytest tests/lora/test_llama_hpu.py'

============================= HABANA PT BRIDGE CONFIGURATION ===========================
PT_HPU_LAZY_MODE = 1
PT_RECIPE_CACHE_PATH =
PT_CACHE_FOLDER_DELETE = 0
PT_HPU_RECIPE_CACHE_CONFIG =
PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
PT_HPU_LAZY_ACC_PAR_MODE = 1
PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
---------------------------: System Configuration :---------------------------
Num CPU Cores : 160
CPU RAM : 1056375036 KB

(RayWorkerWrapper pid=1491762) ============================= HABANA PT BRIDGE CONFIGURATION ===========================
(RayWorkerWrapper pid=1491762) PT_HPU_LAZY_MODE = 1
(RayWorkerWrapper pid=1491762) PT_RECIPE_CACHE_PATH =
(RayWorkerWrapper pid=1491762) PT_CACHE_FOLDER_DELETE = 0
(RayWorkerWrapper pid=1491762) PT_HPU_RECIPE_CACHE_CONFIG =
(RayWorkerWrapper pid=1491762) PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
(RayWorkerWrapper pid=1491762) PT_HPU_LAZY_ACC_PAR_MODE = 1
(RayWorkerWrapper pid=1491762) PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
(RayWorkerWrapper pid=1491762) ---------------------------: System Configuration :---------------------------
(RayWorkerWrapper pid=1491762) Num CPU Cores : 160
(RayWorkerWrapper pid=1491762) CPU RAM : 1056375036 KB
(RayWorkerWrapper pid=1491762) ------------------------------------------------------------------------------
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:00<00:00, 2.73it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00, 2.12it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00, 2.20it/s]

Process Process-3:
Traceback (most recent call last):
File "/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/tests/lora/test_llama_hpu.py", line 40, in _test_llama_lora
llm = vllm.LLM(MODEL_PATH,
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/entrypoints/llm.py", line 155, in init
self.llm_engine = LLMEngine.from_engine_args(
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/engine/llm_engine.py", line 455, in from_engine_args
engine = cls(
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/engine/llm_engine.py", line 265, in init
self._initialize_kv_caches()
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/engine/llm_engine.py", line 377, in _initialize_kv_caches
self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/executor/distributed_gpu_executor.py", line 62, in initialize_cache
self._run_workers("initialize_cache",
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/executor/ray_habana_executor.py", line 321, in _run_workers
self.driver_worker.execute_method(method, *driver_args,
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/worker/worker_base.py", line 383, in execute_method
raise e
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/worker/worker_base.py", line 374, in execute_method
return executor(*args, **kwargs)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/worker/habana_worker.py", line 190, in initialize_cache
self._warm_up_model()
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/worker/habana_worker.py", line 208, in _warm_up_model
self.model_runner.warmup_model(self.hpu_cache[0])
File "/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/worker/habana_model_runner.py", line 1268, in warmup_model
self.warmup_graphs(
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/worker/habana_model_runner.py", line 1213, in warmup_graphs
self.warmup_scenario(batch_size, seq_len, is_prompt, kv_caches)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/worker/habana_model_runner.py", line 1126, in warmup_scenario
self.execute_model(inputs, kv_caches)
File "/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/worker/habana_model_runner.py", line 1521, in execute_model
hidden_states = self.model.forward(
File "/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/site-packages/habana_frameworks/torch/hpu/graphs.py", line 716, in forward
return wrapped_hpugraph_forward(
File "/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/site-packages/habana_frameworks/torch/hpu/graphs.py", line 594, in wrapped_hpugraph_forward
outputs = orig_fwd(*args, **kwargs)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/worker/habana_model_runner.py", line 193, in forward
hidden_states = self.model(*args, **kwargs)
File "/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1514, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1523, in _call_impl
return forward_call(*args, **kwargs)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/model_executor/models/llama.py", line 423, in forward
model_output = self.model(input_ids, positions, kv_caches,
File "/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1514, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1564, in _call_impl
result = forward_call(*args, **kwargs)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/model_executor/models/llama.py", line 313, in forward
hidden_states = self.get_input_embeddings(input_ids)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/model_executor/models/llama.py", line 298, in get_input_embeddings
return self.embed_tokens(input_ids)
File "/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1514, in _wrapped_call_impl
return self.call_impl(*args, **kwargs)
File "/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1564, in call_impl
result = forward_call(*args, **kwargs)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/lora/layers.py", line 387, in forward
full_output = self.base_layer.forward(
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/model_executor/layers/vocab_parallel_embedding.py", line 365, in forward
output = tensor_model_parallel_all_reduce(output_parallel)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/distributed/communication_op.py", line 11, in tensor_model_parallel_all_reduce
return get_tp_group().all_reduce(input
)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/distributed/parallel_state.py", line 317, in all_reduce
return hpu_comm.all_reduce(input
)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/distributed/device_communicators/hpu_communicator.py", line 26, in all_reduce
dist.all_reduce(x, group=self.group)
File "/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper
return func(*args, **kwargs)
File "/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1993, in all_reduce
work = group.allreduce([tensor], opts)
RuntimeError: collective nonSFG is not supported during hpu graph capturing

2024-08-12 07:56:46,050 ERROR worker.py:409 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::RayWorkerWrapper.execute_method() (pid=1491762, ip=100.83.156.109, actor_id=71fbdd983c7fa33407b5196001000000, repr=<vllm.executor.ray_utils.RayWorkerWrapper object at 0x7f8bb5d21810>)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/worker/worker_base.py", line 383, in execute_method
raise e
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/worker/worker_base.py", line 374, in execute_method
return executor(*args, **kwargs)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/worker/habana_worker.py", line 190, in initialize_cache
self._warm_up_model()
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/worker/habana_worker.py", line 208, in _warm_up_model
self.model_runner.warmup_model(self.hpu_cache[0])
File "/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/worker/habana_model_runner.py", line 1268, in warmup_model
self.warmup_graphs(
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/worker/habana_model_runner.py", line 1213, in warmup_graphs
self.warmup_scenario(batch_size, seq_len, is_prompt, kv_caches)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/worker/habana_model_runner.py", line 1126, in warmup_scenario
self.execute_model(inputs, kv_caches)
File "/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/worker/habana_model_runner.py", line 1521, in execute_model
hidden_states = self.model.forward(
File "/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/site-packages/habana_frameworks/torch/hpu/graphs.py", line 716, in forward
return wrapped_hpugraph_forward(
File "/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/site-packages/habana_frameworks/torch/hpu/graphs.py", line 594, in wrapped_hpugraph_forward
outputs = orig_fwd(*args, **kwargs)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/worker/habana_model_runner.py", line 193, in forward
hidden_states = self.model(*args, **kwargs)
File "/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1514, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1523, in _call_impl
return forward_call(*args, **kwargs)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/model_executor/models/llama.py", line 423, in forward
model_output = self.model(input_ids, positions, kv_caches,
File "/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1514, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1564, in _call_impl
result = forward_call(*args, **kwargs)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/model_executor/models/llama.py", line 313, in forward
hidden_states = self.get_input_embeddings(input_ids)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/model_executor/models/llama.py", line 298, in get_input_embeddings
return self.embed_tokens(input_ids)
File "/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1514, in _wrapped_call_impl
return self.call_impl(*args, **kwargs)
File "/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1564, in call_impl
result = forward_call(*args, **kwargs)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/lora/layers.py", line 387, in forward
full_output = self.base_layer.forward(
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/model_executor/layers/vocab_parallel_embedding.py", line 365, in forward
output = tensor_model_parallel_all_reduce(output_parallel)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/distributed/communication_op.py", line 11, in tensor_model_parallel_all_reduce
return get_tp_group().all_reduce(input
)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/distributed/parallel_state.py", line 317, in all_reduce
return hpu_comm.all_reduce(input
)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/distributed/device_communicators/hpu_communicator.py", line 26, in all_reduce
dist.all_reduce(x, group=self.group)
File "/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper
return func(*args, **kwargs)
File "/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1993, in all_reduce
work = group.allreduce([tensor], opts)
RuntimeError: collective nonSFG is not supported during hpu graph capturing
/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '

[2] In tests/lora/test_multilora_hpu.py, it seems that not setting the prompt_logprobs=1 option in the sampling params avoids a dimension error. However, even with this workaround, I still encounter the same error as in [1] when using 2x or 4x multi-card setups. Additionally, in the 1x setup, 1 out of 6 prompts does not match between generated_texts and expected_output.

Error message w/ 'VLLM_SKIP_WARMUP=false pytest tests/lora/test_multilora_hpu.py'

-------------------------------------------------------- Captured stderr call --------------------------------------------------------
============================= HABANA PT BRIDGE CONFIGURATION ===========================
PT_HPU_LAZY_MODE = 1
PT_RECIPE_CACHE_PATH =
PT_CACHE_FOLDER_DELETE = 0
PT_HPU_RECIPE_CACHE_CONFIG =
PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
PT_HPU_LAZY_ACC_PAR_MODE = 1
PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
---------------------------: System Configuration :---------------------------
Num CPU Cores : 160
CPU RAM : 1056375036 KB

Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:00<00:00, 1.34it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00, 1.28s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00, 1.20s/it]

Process Process-1:
Traceback (most recent call last):
File "/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/tests/lora/test_multilora_hpu.py", line 111, in _test_llama_multilora
results = process_requests(engine, test_prompts)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/tests/lora/test_multilora_hpu.py", line 79, in process_requests
request_outputs: List[RequestOutput] = engine.step()
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/engine/llm_engine.py", line 925, in step
output = self.model_executor.execute_model(
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/executor/habana_executor.py", line 150, in execute_model
output = self.driver_worker.execute_model(execute_model_req)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/worker/worker_base.py", line 272, in execute_model
output = self.model_runner.execute_model(
File "/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/worker/habana_model_runner.py", line 1534, in execute_model
logits = self.model.compute_logits(hidden_states,
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/worker/habana_model_runner.py", line 205, in compute_logits
return self.model.compute_logits(*args, **kwargs)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/model_executor/models/llama.py", line 430, in compute_logits
logits = self.logits_processor(self.lm_head, hidden_states,
File "/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1514, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1564, in _call_impl
result = forward_call(*args, **kwargs)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/lora/layers.py", line 1311, in forward
return type(self.base_layer).forward(self, *args, **kwargs)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/model_executor/layers/logits_processor.py", line 61, in forward
logits = self._get_logits(hidden_states, lm_head, embedding_bias)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/lora/layers.py", line 1293, in _get_logits
logits[:,
RuntimeError: The expanded size of the tensor (59) must match the existing size (128) at non-singleton dimension 0. Target sizes: [59, 256]. Tensor sizes: [128, 256]

@SanjuCSudhakaran
Copy link
Author

I ran some multi-LoRA tests on this branch and wanted to share a few additional issues that I encountered, hoping it might help.

[1] In tests/lora/test_llama_hpu.py, setting VLLM_SKIP_WARMUP=false seems to cause an error when running on multi-card setups. It would be appreciated if you could verify this. My SynapseAI SDK version is 1.16, so the issue might be related to running with multiple cards.

Error message w/ 'VLLM_SKIP_WARMUP=false pytest tests/lora/test_llama_hpu.py'
[2] In tests/lora/test_multilora_hpu.py, it seems that not setting the prompt_logprobs=1 option in the sampling params avoids a dimension error. However, even with this workaround, I still encounter the same error as in [1] when using 2x or 4x multi-card setups. Additionally, in the 1x setup, 1 out of 6 prompts does not match between generated_texts and expected_output.

Error message w/ 'VLLM_SKIP_WARMUP=false pytest tests/lora/test_multilora_hpu.py'

Thank you @JHLEE17 for identifying the issues. We had also identified the same issues internally, and are currently working on a fix.

@vivekgoe
Copy link

@afierka-intel @madamczykhabana please help review this PR.

Copy link

@afierka-intel afierka-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have got a few clarification questions and suggestions to reorganize code a little bit.

vllm/worker/habana_model_runner.py Outdated Show resolved Hide resolved
vllm/lora/layers.py Outdated Show resolved Hide resolved
vllm/lora/layers.py Outdated Show resolved Hide resolved
Copy link

@afierka-intel afierka-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for applying all my suggestions. Code looks good for me now. I'll merge the PR once you resolve merging conflicts and pass all static checks.

Copy link

@afierka-intel afierka-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for resolving conflicts! All static checks and internal e2e tests passed. Approving the PR.

@afierka-intel afierka-intel merged commit 55ea658 into habana_main Aug 20, 2024
13 checks passed
@vivekgoe
Copy link

@JHLEE17 PR is merged to habana_main. 1x, 2x, 4x tests in tests/lora/test_llama_hpu.py and tests/lora/test_multilora_hpu.py are all passing with following caveats,

  • PT_HPU_ENABLE_LAZY_COLLECTIVES: required to be true for tensor parallel inference with HPU Graphs (as described in vllm-fork/README_GAUDI.md)
  • Tests are running with float32 precision where outputs from 1x, 2x, 4x can all use same reference output. For bfloat16 precision there are small differences between 1x and (2x, 4x) outputs, this happens without LoRA also, we are checking if this is something which can be fixed.
  • Current implementation of BGMV operator is not fully optimized and is impacting overall performance with LoRA, we are working on optimizing this operator further.

zhouyu5 pushed a commit to zhouyu5/vllm-fork that referenced this pull request Aug 22, 2024
This PR enables LoRA support in HPU.

* Implemented custom BGMV for LoRA modules using index-select operator.
* Support for both single and multi card scenarios has been tested

---------

Co-authored-by: Himangshu Lahkar <[email protected]>
Co-authored-by: Himangshu Lahkar <[email protected]>
@hlahkar hlahkar deleted the enable_lora_v1 branch October 1, 2024 10:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants