Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LoRA support in model builder #955

Merged
merged 10 commits into from
Oct 21, 2024
Merged

LoRA support in model builder #955

merged 10 commits into from
Oct 21, 2024

Conversation

apsonawane
Copy link
Contributor

This PR adds LoRA MatMul changes in model builder. It includes changes made by Kunal and few changes done to make it work with Olive.
It covers the scenario where base_layer of float and adapters are float.

This PR will be followed with support for quant models scenario

src/python/py/models/builder.py Fixed Show fixed Hide fixed
src/python/py/models/builder.py Fixed Show fixed Hide fixed
@natke natke self-requested a review October 9, 2024 21:03
Copy link
Contributor

@natke natke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please update the README with any new options and usage of this feature? https://github.com/microsoft/onnxruntime-genai/blob/main/src/python/py/models/README.md

@apsonawane apsonawane force-pushed the asonawane/lora branch 3 times, most recently from f665ef4 to d43f4c5 Compare October 19, 2024 17:47
src/python/py/models/builder.py Fixed Show fixed Hide fixed
src/python/py/models/builder.py Fixed Show fixed Hide fixed
src/python/py/models/builder.py Fixed Show fixed Hide fixed
@kunal-vaishnavi
Copy link
Contributor

Can you add adapter_path and explain its usage after use_qdq?

parser.add_argument(
"--extra_options",
required=False,
metavar="KEY=VALUE",
nargs='+',
help=textwrap.dedent("""\
Key value pairs for various options. Currently supports:
int4_block_size = 16/32/64/128/256: Specify the block_size for int4 quantization.
int4_accuracy_level = 1/2/3/4: Specify the minimum accuracy level for activation of MatMul in int4 quantization.
4 is int8, which means input A of int4 quantized MatMul is quantized to int8 and input B is upcasted to int8 for computation.
3 is bf16.
2 is fp16.
1 is fp32.
num_hidden_layers = Manually specify the number of layers in your ONNX model (for unit testing purposes).
filename = Filename for ONNX model (default is 'model.onnx').
For models with multiple components, each component is exported to its own ONNX model.
The filename for each component will be '<filename>_<component-name>.onnx' (ex: '<filename>_encoder.onnx', '<filename>_decoder.onnx').
config_only = Generate config and pre/post processing files only.
Use this option when you already have your optimized and/or quantized ONNX model.
exclude_embeds = Remove embedding layer from your ONNX model.
Use this option when you want to remove the embedding layer from within your ONNX model.
Instead of `input_ids`, you will have `inputs_embeds` as the input to your ONNX model.
exclude_lm_head = Remove language modeling head from your ONNX model.
Use this option when you want to remove the language modeling head from within your ONNX model.
Instead of `logits`, you will have `hidden_states` as the output to your ONNX model.
enable_cuda_graph = 1 : The model can use CUDA graph capture for CUDA execution provider. If enabled, all nodes being placed on the CUDA EP
is the prerequisite for the CUDA graph to be used correctly. It is not guaranteed that cuda graph be enabled as it depends on the model
and the graph structure.
use_8bits_moe = 1 : Use 8-bit quantization for MoE layers. Default is using 4-bit quantization.
hf_token = false/token: Use this to disable authentication with Hugging Face or provide a custom authentication token that differs from the one stored in your environment. Default behavior is to use the authentication token stored by `huggingface-cli login`.
If you have already authenticated via `huggingface-cli login`, you do not need to use this flag because Hugging Face has already stored your authentication token for you.
use_qdq = 1 : Use the QDQ decomposition for quantized MatMul instead of the MatMulNBits operator.
"""),

@kunal-vaishnavi
Copy link
Contributor

This can be done in another PR but we should add a LoRA model such as this one in the CIs.

def get_model_paths():
hf_paths = {
"phi-2": "microsoft/phi-2",
# "phi-3-mini": "microsoft/Phi-3-mini-128k-instruct",
}
ci_data_path = os.path.join("/", "data", "ortgenai_pytorch_models")
if not os.path.exists(ci_data_path):
return {}, hf_paths
# Note: If a model has over 4B parameters, please add a quantized version
# to `ci_paths` instead of `hf_paths` to reduce file size and testing time.
ci_paths = {
"llama-2": os.path.join(ci_data_path, "Llama-2-7B-Chat-GPTQ"),
"llama-3": os.path.join(ci_data_path, "Meta-Llama-3-8B-AWQ"),
"mistral-v0.2": os.path.join(ci_data_path, "Mistral-7B-Instruct-v0.2-GPTQ"),
# "phi-2": os.path.join(ci_data_path, "phi2"),
# "gemma-2b": os.path.join(ci_data_path, "gemma-1.1-2b-it"),
"gemma-7b": os.path.join(ci_data_path, "gemma-7b-it-awq"),
# "phi-3-mini": os.path.join(ci_data_path, "phi3-mini-128k-instruct"),
}
return ci_paths, hf_paths

The models in hf_paths are downloaded from Hugging Face, and the models in ci_paths are currently uploaded to /data/ortgenai_pytorch_models in one of the CI VMs.

--volume /data/ortgenai_pytorch_models:/data/ortgenai_pytorch_models \

@apsonawane apsonawane merged commit 4253ecc into main Oct 21, 2024
12 of 13 checks passed
@apsonawane apsonawane deleted the asonawane/lora branch October 21, 2024 21:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants