LoRA support in model builder #955

apsonawane · 2024-10-05T00:06:51Z

This PR adds LoRA MatMul changes in model builder. It includes changes made by Kunal and few changes done to make it work with Olive.
It covers the scenario where base_layer of float and adapters are float.

This PR will be followed with support for quant models scenario

src/python/py/models/builder.py

natke

Can you please update the README with any new options and usage of this feature? https://github.com/microsoft/onnxruntime-genai/blob/main/src/python/py/models/README.md

src/python/py/models/builder.py

src/python/py/models/README.md

src/python/py/models/builder.py

kunal-vaishnavi · 2024-10-19T22:06:49Z

Can you add adapter_path and explain its usage after use_qdq?

onnxruntime-genai/src/python/py/models/builder.py

Lines 3181 to 3213 in 2687e60

    
               parser.add_argument( 
        
                   "--extra_options", 
        
                   required=False, 
        
                   metavar="KEY=VALUE", 
        
                   nargs='+', 
        
                   help=textwrap.dedent("""\ 
        
                       Key value pairs for various options. Currently supports: 
        
                           int4_block_size = 16/32/64/128/256: Specify the block_size for int4 quantization. 
        
                           int4_accuracy_level = 1/2/3/4: Specify the minimum accuracy level for activation of MatMul in int4 quantization. 
        
                               4 is int8, which means input A of int4 quantized MatMul is quantized to int8 and input B is upcasted to int8 for computation. 
        
                               3 is bf16. 
        
                               2 is fp16. 
        
                               1 is fp32. 
        
                           num_hidden_layers = Manually specify the number of layers in your ONNX model (for unit testing purposes). 
        
                           filename = Filename for ONNX model (default is 'model.onnx'). 
        
                               For models with multiple components, each component is exported to its own ONNX model. 
        
                               The filename for each component will be '<filename>_<component-name>.onnx' (ex: '<filename>_encoder.onnx', '<filename>_decoder.onnx'). 
        
                           config_only = Generate config and pre/post processing files only. 
        
                               Use this option when you already have your optimized and/or quantized ONNX model. 
        
                           exclude_embeds = Remove embedding layer from your ONNX model. 
        
                               Use this option when you want to remove the embedding layer from within your ONNX model. 
        
                               Instead of `input_ids`, you will have `inputs_embeds` as the input to your ONNX model. 
        
                           exclude_lm_head = Remove language modeling head from your ONNX model. 
        
                               Use this option when you want to remove the language modeling head from within your ONNX model. 
        
                               Instead of `logits`, you will have `hidden_states` as the output to your ONNX model. 
        
                           enable_cuda_graph = 1 : The model can use CUDA graph capture for CUDA execution provider. If enabled, all nodes being placed on the CUDA EP 
        
                               is the prerequisite for the CUDA graph to be used correctly. It is not guaranteed that cuda graph be enabled as it depends on the model 
        
                               and the graph structure. 
        
                           use_8bits_moe = 1 : Use 8-bit quantization for MoE layers. Default is using 4-bit quantization. 
        
                           hf_token = false/token: Use this to disable authentication with Hugging Face or provide a custom authentication token that differs from the one stored in your environment. Default behavior is to use the authentication token stored by `huggingface-cli login`. 
        
                               If you have already authenticated via `huggingface-cli login`, you do not need to use this flag because Hugging Face has already stored your authentication token for you. 
        
                           use_qdq = 1 : Use the QDQ decomposition for quantized MatMul instead of the MatMulNBits operator. 
        
                       """),

src/python/py/models/builder.py

src/python/py/models/README.md

kunal-vaishnavi · 2024-10-21T18:16:13Z

This can be done in another PR but we should add a LoRA model such as this one in the CIs.

onnxruntime-genai/test/python/_test_utils.py

Lines 55 to 77 in 0f59a90

    
           def get_model_paths(): 
        
               hf_paths = { 
        
                   "phi-2": "microsoft/phi-2", 
        
                   # "phi-3-mini": "microsoft/Phi-3-mini-128k-instruct", 
        
               } 
        
               ci_data_path = os.path.join("/", "data", "ortgenai_pytorch_models") 
        
               if not os.path.exists(ci_data_path): 
        
                   return {}, hf_paths 
        
               # Note: If a model has over 4B parameters, please add a quantized version 
        
               # to `ci_paths` instead of `hf_paths` to reduce file size and testing time. 
        
               ci_paths = { 
        
                   "llama-2": os.path.join(ci_data_path, "Llama-2-7B-Chat-GPTQ"), 
        
                   "llama-3": os.path.join(ci_data_path, "Meta-Llama-3-8B-AWQ"), 
        
                   "mistral-v0.2": os.path.join(ci_data_path, "Mistral-7B-Instruct-v0.2-GPTQ"), 
        
                   # "phi-2": os.path.join(ci_data_path, "phi2"), 
        
                   # "gemma-2b": os.path.join(ci_data_path, "gemma-1.1-2b-it"), 
        
                   "gemma-7b": os.path.join(ci_data_path, "gemma-7b-it-awq"), 
        
                   # "phi-3-mini": os.path.join(ci_data_path, "phi3-mini-128k-instruct"), 
        
               } 
        
               return ci_paths, hf_paths

The models in hf_paths are downloaded from Hugging Face, and the models in ci_paths are currently uploaded to /data/ortgenai_pytorch_models in one of the CI VMs.

onnxruntime-genai/.github/workflows/linux-gpu-x64-build.yml

Line 122 in 0f59a90

--volume /data/ortgenai_pytorch_models:/data/ortgenai_pytorch_models \

apsonawane requested review from kunal-vaishnavi and jambayk October 9, 2024 17:48

apsonawane force-pushed the asonawane/lora branch from 8f39891 to 1089d43 Compare October 9, 2024 19:11

github-advanced-security bot found potential problems Oct 9, 2024

View reviewed changes

src/python/py/models/builder.py Fixed Show fixed Hide fixed

src/python/py/models/builder.py Fixed Show fixed Hide fixed

jambayk reviewed Oct 9, 2024

View reviewed changes

src/python/py/models/builder.py Outdated Show resolved Hide resolved

jambayk reviewed Oct 9, 2024

View reviewed changes

src/python/py/models/builder.py Outdated Show resolved Hide resolved

natke self-requested a review October 9, 2024 21:03

natke reviewed Oct 9, 2024

View reviewed changes

kunal-vaishnavi reviewed Oct 10, 2024

View reviewed changes