Enable LoRA support for HPU #170

SanjuCSudhakaran · 2024-08-09T09:55:09Z

This PR enables LoRA support in HPU.

Implemented custom BGMV for LoRA modules using index-select operator.
Support for both single and multi card scenarios has been tested

vivekgoe · 2024-08-09T10:02:36Z

@kzawora-intel please review this PR.

vivekgoe

@SanjuCSudhakaran @hlahkar please check review comments. Thanks.

tests/lora/test_multilora_hpu.py

vllm/lora/layers.py

vllm/lora/models.py

vllm/worker/habana_model_runner.py

JHLEE17 · 2024-08-12T08:58:38Z

I ran some multi-LoRA tests on this branch and wanted to share a few additional issues that I encountered, hoping it might help.

[1] In tests/lora/test_llama_hpu.py, setting VLLM_SKIP_WARMUP=false seems to cause an error when running on multi-card setups. It would be appreciated if you could verify this. My SynapseAI SDK version is 1.16, so the issue might be related to running with multiple cards.

Error message w/ 'VLLM_SKIP_WARMUP=false pytest tests/lora/test_llama_hpu.py'

============================= HABANA PT BRIDGE CONFIGURATION ===========================
PT_HPU_LAZY_MODE = 1
PT_RECIPE_CACHE_PATH =
PT_CACHE_FOLDER_DELETE = 0
PT_HPU_RECIPE_CACHE_CONFIG =
PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
PT_HPU_LAZY_ACC_PAR_MODE = 1
PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
---------------------------: System Configuration :---------------------------
Num CPU Cores : 160
CPU RAM : 1056375036 KB

(RayWorkerWrapper pid=1491762) ============================= HABANA PT BRIDGE CONFIGURATION ===========================
(RayWorkerWrapper pid=1491762) PT_HPU_LAZY_MODE = 1
(RayWorkerWrapper pid=1491762) PT_RECIPE_CACHE_PATH =
(RayWorkerWrapper pid=1491762) PT_CACHE_FOLDER_DELETE = 0
(RayWorkerWrapper pid=1491762) PT_HPU_RECIPE_CACHE_CONFIG =
(RayWorkerWrapper pid=1491762) PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
(RayWorkerWrapper pid=1491762) PT_HPU_LAZY_ACC_PAR_MODE = 1
(RayWorkerWrapper pid=1491762) PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
(RayWorkerWrapper pid=1491762) ---------------------------: System Configuration :---------------------------
(RayWorkerWrapper pid=1491762) Num CPU Cores : 160
(RayWorkerWrapper pid=1491762) CPU RAM : 1056375036 KB
(RayWorkerWrapper pid=1491762) ------------------------------------------------------------------------------
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:00<00:00, 2.73it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00, 2.12it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00, 2.20it/s]

Process Process-3:
Traceback (most recent call last):
File "/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/tests/lora/test_llama_hpu.py", line 40, in _test_llama_lora
llm = vllm.LLM(MODEL_PATH,
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/entrypoints/llm.py", line 155, in init
self.llm_engine = LLMEngine.from_engine_args(
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/engine/llm_engine.py", line 455, in from_engine_args
engine = cls(
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/engine/llm_engine.py", line 265, in init
self._initialize_kv_caches()
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/engine/llm_engine.py", line 377, in _initialize_kv_caches
self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/executor/distributed_gpu_executor.py", line 62, in initialize_cache
self._run_workers("initialize_cache",
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/executor/ray_habana_executor.py", line 321, in _run_workers
self.driver_worker.execute_method(method, *driver_args,
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/worker/worker_base.py", line 383, in execute_method
raise e
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/worker/worker_base.py", line 374, in execute_method
return executor(*args, **kwargs)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/worker/habana_worker.py", line 190, in initialize_cache
self._warm_up_model()
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/worker/habana_worker.py", line 208, in _warm_up_model
self.model_runner.warmup_model(self.hpu_cache[0])
File "/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/worker/habana_model_runner.py", line 1268, in warmup_model
self.warmup_graphs(
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/worker/habana_model_runner.py", line 1213, in warmup_graphs
self.warmup_scenario(batch_size, seq_len, is_prompt, kv_caches)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/worker/habana_model_runner.py", line 1126, in warmup_scenario
self.execute_model(inputs, kv_caches)
File "/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/worker/habana_model_runner.py", line 1521, in execute_model
hidden_states = self.model.forward(
File "/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/site-packages/habana_frameworks/torch/hpu/graphs.py", line 716, in forward
return wrapped_hpugraph_forward(
File "/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/site-packages/habana_frameworks/torch/hpu/graphs.py", line 594, in wrapped_hpugraph_forward
outputs = orig_fwd(*args, **kwargs)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/worker/habana_model_runner.py", line 193, in forward
hidden_states = self.model(*args, **kwargs)
File "/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1514, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1523, in _call_impl
return forward_call(*args, **kwargs)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/model_executor/models/llama.py", line 423, in forward
model_output = self.model(input_ids, positions, kv_caches,
File "/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1514, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1564, in _call_impl
result = forward_call(*args, **kwargs)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/model_executor/models/llama.py", line 313, in forward
hidden_states = self.get_input_embeddings(input_ids)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/model_executor/models/llama.py", line 298, in get_input_embeddings
return self.embed_tokens(input_ids)
File "/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1514, in _wrapped_call_impl
return self.call_impl(*args, **kwargs)
File "/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1564, in call_impl
result = forward_call(*args, **kwargs)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/lora/layers.py", line 387, in forward
full_output = self.base_layer.forward(
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/model_executor/layers/vocab_parallel_embedding.py", line 365, in forward
output = tensor_model_parallel_all_reduce(output_parallel)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/distributed/communication_op.py", line 11, in tensor_model_parallel_all_reduce
return get_tp_group().all_reduce(input)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/distributed/parallel_state.py", line 317, in all_reduce
return hpu_comm.all_reduce(input)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/distributed/device_communicators/hpu_communicator.py", line 26, in all_reduce
dist.all_reduce(x, group=self.group)
File "/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper
return func(*args, **kwargs)
File "/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1993, in all_reduce
work = group.allreduce([tensor], opts)
RuntimeError: collective nonSFG is not supported during hpu graph capturing

2024-08-12 07:56:46,050 ERROR worker.py:409 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::RayWorkerWrapper.execute_method() (pid=1491762, ip=100.83.156.109, actor_id=71fbdd983c7fa33407b5196001000000, repr=<vllm.executor.ray_utils.RayWorkerWrapper object at 0x7f8bb5d21810>)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/worker/worker_base.py", line 383, in execute_method
raise e
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/worker/worker_base.py", line 374, in execute_method
return executor(*args, **kwargs)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/worker/habana_worker.py", line 190, in initialize_cache
self._warm_up_model()
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/worker/habana_worker.py", line 208, in _warm_up_model
self.model_runner.warmup_model(self.hpu_cache[0])
File "/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/worker/habana_model_runner.py", line 1268, in warmup_model
self.warmup_graphs(
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/worker/habana_model_runner.py", line 1213, in warmup_graphs
self.warmup_scenario(batch_size, seq_len, is_prompt, kv_caches)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/worker/habana_model_runner.py", line 1126, in warmup_scenario
self.execute_model(inputs, kv_caches)
File "/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/worker/habana_model_runner.py", line 1521, in execute_model
hidden_states = self.model.forward(
File "/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/site-packages/habana_frameworks/torch/hpu/graphs.py", line 716, in forward
return wrapped_hpugraph_forward(
File "/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/site-packages/habana_frameworks/torch/hpu/graphs.py", line 594, in wrapped_hpugraph_forward
outputs = orig_fwd(*args, **kwargs)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/worker/habana_model_runner.py", line 193, in forward
hidden_states = self.model(*args, **kwargs)
File "/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1514, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1523, in _call_impl
return forward_call(*args, **kwargs)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/model_executor/models/llama.py", line 423, in forward
model_output = self.model(input_ids, positions, kv_caches,
File "/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1514, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1564, in _call_impl
result = forward_call(*args, **kwargs)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/model_executor/models/llama.py", line 313, in forward
hidden_states = self.get_input_embeddings(input_ids)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/model_executor/models/llama.py", line 298, in get_input_embeddings
return self.embed_tokens(input_ids)
File "/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1514, in _wrapped_call_impl
return self.call_impl(*args, **kwargs)
File "/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1564, in call_impl
result = forward_call(*args, **kwargs)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/lora/layers.py", line 387, in forward
full_output = self.base_layer.forward(
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/model_executor/layers/vocab_parallel_embedding.py", line 365, in forward
output = tensor_model_parallel_all_reduce(output_parallel)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/distributed/communication_op.py", line 11, in tensor_model_parallel_all_reduce
return get_tp_group().all_reduce(input)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/distributed/parallel_state.py", line 317, in all_reduce
return hpu_comm.all_reduce(input)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/distributed/device_communicators/hpu_communicator.py", line 26, in all_reduce
dist.all_reduce(x, group=self.group)
File "/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper
return func(*args, **kwargs)
File "/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1993, in all_reduce
work = group.allreduce([tensor], opts)
RuntimeError: collective nonSFG is not supported during hpu graph capturing
/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '

[2] In tests/lora/test_multilora_hpu.py, it seems that not setting the prompt_logprobs=1 option in the sampling params avoids a dimension error. However, even with this workaround, I still encounter the same error as in [1] when using 2x or 4x multi-card setups. Additionally, in the 1x setup, 1 out of 6 prompts does not match between generated_texts and expected_output.

Error message w/ 'VLLM_SKIP_WARMUP=false pytest tests/lora/test_multilora_hpu.py'

-------------------------------------------------------- Captured stderr call --------------------------------------------------------
============================= HABANA PT BRIDGE CONFIGURATION ===========================
PT_HPU_LAZY_MODE = 1
PT_RECIPE_CACHE_PATH =
PT_CACHE_FOLDER_DELETE = 0
PT_HPU_RECIPE_CACHE_CONFIG =
PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
PT_HPU_LAZY_ACC_PAR_MODE = 1
PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
---------------------------: System Configuration :---------------------------
Num CPU Cores : 160
CPU RAM : 1056375036 KB

Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:00<00:00, 1.34it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00, 1.28s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00, 1.20s/it]

Process Process-1:
Traceback (most recent call last):
File "/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/tests/lora/test_multilora_hpu.py", line 111, in _test_llama_multilora
results = process_requests(engine, test_prompts)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/tests/lora/test_multilora_hpu.py", line 79, in process_requests
request_outputs: List[RequestOutput] = engine.step()
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/engine/llm_engine.py", line 925, in step
output = self.model_executor.execute_model(
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/executor/habana_executor.py", line 150, in execute_model
output = self.driver_worker.execute_model(execute_model_req)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/worker/worker_base.py", line 272, in execute_model
output = self.model_runner.execute_model(
File "/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/worker/habana_model_runner.py", line 1534, in execute_model
logits = self.model.compute_logits(hidden_states,
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/worker/habana_model_runner.py", line 205, in compute_logits
return self.model.compute_logits(*args, **kwargs)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/model_executor/models/llama.py", line 430, in compute_logits
logits = self.logits_processor(self.lm_head, hidden_states,
File "/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1514, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/sdp/miniconda3/envs/vllm-fork/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1564, in _call_impl
result = forward_call(*args, **kwargs)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/lora/layers.py", line 1311, in forward
return type(self.base_layer).forward(self, *args, **kwargs)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/model_executor/layers/logits_processor.py", line 61, in forward
logits = self._get_logits(hidden_states, lm_head, embedding_bias)
File "/home/sdp/works/jongho/HabanaAI/vllm-fork/vllm/lora/layers.py", line 1293, in _get_logits
logits[:,
RuntimeError: The expanded size of the tensor (59) must match the existing size (128) at non-singleton dimension 0. Target sizes: [59, 256]. Tensor sizes: [128, 256]

SanjuCSudhakaran · 2024-08-12T17:15:20Z

I ran some multi-LoRA tests on this branch and wanted to share a few additional issues that I encountered, hoping it might help.

[1] In tests/lora/test_llama_hpu.py, setting VLLM_SKIP_WARMUP=false seems to cause an error when running on multi-card setups. It would be appreciated if you could verify this. My SynapseAI SDK version is 1.16, so the issue might be related to running with multiple cards.

Error message w/ 'VLLM_SKIP_WARMUP=false pytest tests/lora/test_llama_hpu.py'
[2] In tests/lora/test_multilora_hpu.py, it seems that not setting the prompt_logprobs=1 option in the sampling params avoids a dimension error. However, even with this workaround, I still encounter the same error as in [1] when using 2x or 4x multi-card setups. Additionally, in the 1x setup, 1 out of 6 prompts does not match between generated_texts and expected_output.

Error message w/ 'VLLM_SKIP_WARMUP=false pytest tests/lora/test_multilora_hpu.py'

Thank you @JHLEE17 for identifying the issues. We had also identified the same issues internally, and are currently working on a fix.

vivekgoe · 2024-08-13T12:54:15Z

@afierka-intel @madamczykhabana please help review this PR.

afierka-intel

I have got a few clarification questions and suggestions to reorganize code a little bit.

vllm/worker/habana_model_runner.py

vllm/lora/layers.py

afierka-intel

Thank you for applying all my suggestions. Code looks good for me now. I'll merge the PR once you resolve merging conflicts and pass all static checks.

Squashed commit of the following: commit 7beaeba commit 549bffb commit 2769fd8 commit 1911f44 commit e154e3c commit 220460d commit 1256be5 commit 03d6bc3 commit 4b7468c commit b7d2d86 commit 712a7ed commit 1ee15b4 commit 5c6a312 commit ccb0569 commit c10afb4 commit 6b3a039 commit 4ef5a6d commit 301579d commit ed98772 commit 55c82ba commit d7dddc9 commit 7cc2b99 commit e120246

...Also update test reference for test_multilora_hpu.py

...to fix accuracy mismatch between tp_size = 1 vs tp_size > 1

afierka-intel

Thank you for resolving conflicts! All static checks and internal e2e tests passed. Approving the PR.

vivekgoe · 2024-08-21T06:23:23Z

@JHLEE17 PR is merged to habana_main. 1x, 2x, 4x tests in tests/lora/test_llama_hpu.py and tests/lora/test_multilora_hpu.py are all passing with following caveats,

PT_HPU_ENABLE_LAZY_COLLECTIVES: required to be true for tensor parallel inference with HPU Graphs (as described in vllm-fork/README_GAUDI.md)
Tests are running with float32 precision where outputs from 1x, 2x, 4x can all use same reference output. For bfloat16 precision there are small differences between 1x and (2x, 4x) outputs, this happens without LoRA also, we are checking if this is something which can be fixed.
Current implementation of BGMV operator is not fully optimized and is impacting overall performance with LoRA, we are working on optimizing this operator further.

This PR enables LoRA support in HPU. * Implemented custom BGMV for LoRA modules using index-select operator. * Support for both single and multi card scenarios has been tested --------- Co-authored-by: Himangshu Lahkar <[email protected]> Co-authored-by: Himangshu Lahkar <[email protected]>

SanjuCSudhakaran requested review from vivekgoe and hlahkar August 9, 2024 09:56

vivekgoe requested a review from kzawora-intel August 9, 2024 10:01

vivekgoe requested changes Aug 12, 2024

View reviewed changes

vivekgoe requested review from afierka-intel and madamczykhabana August 13, 2024 11:21

SanjuCSudhakaran force-pushed the enable_lora_v1 branch from 6600367 to 08721a7 Compare August 13, 2024 13:25

vivekgoe approved these changes Aug 13, 2024

View reviewed changes

afierka-intel reviewed Aug 19, 2024

View reviewed changes

vllm/worker/habana_model_runner.py Outdated Show resolved Hide resolved

vllm/lora/layers.py Outdated Show resolved Hide resolved

vllm/lora/layers.py Outdated Show resolved Hide resolved

afierka-intel reviewed Aug 20, 2024

View reviewed changes

SanjuCSudhakaran and others added 10 commits August 20, 2024 10:40

Fix formatting

f039d7c

Update custom_bgmv docstring

319cfc7

...Also update test reference for test_multilora_hpu.py

Make log_prompt change model agnostic

9f72936

Make block size compliant to max_num_batched_tokens for LoRA

d6120c3

Move embedding index select to execute model

0e77af2

Update test dtype as bfloat16 and minor fixes

24c9ccb

Update test dtype to float32

dbd804f

...to fix accuracy mismatch between tp_size = 1 vs tp_size > 1

Make max seq number compliant with max_num_batched_token

557a23e

Move HPU specific LoRA ops to vllm.hpu.ops module

78436a6

SanjuCSudhakaran force-pushed the enable_lora_v1 branch from 4221f2e to 78436a6 Compare August 20, 2024 08:51

afierka-intel approved these changes Aug 20, 2024

View reviewed changes

afierka-intel merged commit 55ea658 into habana_main Aug 20, 2024
13 checks passed

hlahkar deleted the enable_lora_v1 branch October 1, 2024 10:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable LoRA support for HPU #170

Enable LoRA support for HPU #170

SanjuCSudhakaran commented Aug 9, 2024 •

edited

Loading

vivekgoe commented Aug 9, 2024

vivekgoe left a comment

JHLEE17 commented Aug 12, 2024

SanjuCSudhakaran commented Aug 12, 2024

vivekgoe commented Aug 13, 2024

afierka-intel left a comment

afierka-intel left a comment

afierka-intel left a comment

vivekgoe commented Aug 21, 2024

Enable LoRA support for HPU #170

Enable LoRA support for HPU #170

Conversation

SanjuCSudhakaran commented Aug 9, 2024 • edited Loading

vivekgoe commented Aug 9, 2024

vivekgoe left a comment

Choose a reason for hiding this comment

JHLEE17 commented Aug 12, 2024

SanjuCSudhakaran commented Aug 12, 2024

vivekgoe commented Aug 13, 2024

afierka-intel left a comment

Choose a reason for hiding this comment

afierka-intel left a comment

Choose a reason for hiding this comment

afierka-intel left a comment

Choose a reason for hiding this comment

vivekgoe commented Aug 21, 2024

SanjuCSudhakaran commented Aug 9, 2024 •

edited

Loading