-
Install Olive
pip install git+https://github.com/microsoft/olive
-
Build and install ONNX Runtime generate()
TODO: replace this with 1.20 when it is released
git clone https://github.com/microsoft/onnxruntime-genai.git cd onnxruntime-genai python build.py cd build\Windows\RelWithDebInfo\wheel pip install *.whl
-
Install ONNX Runtime nightly
TODO: remove this step when 1.20 is released
pip uninstall onnxruntime pip install --pre --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple/ ort-nightly
-
Install other dependencies
pip install optimum peft
-
Downgrade torch
TODO: There is an export bug with torch 2.5.0 and an incompatibility with transformers>=4.45.0
pip uninstall torch pip install torch==2.4 pip uninstall transformers pip install transformers==4.44
-
Choose a model
In this example we'll use Llama-3-8b
You need to register with Meta for a license to use this model. You can do this by accessing the above page, signing in, and registering for access. Access should be granted quickly. Esnure that the huggingface-cli is installed (
pip install huggingface-hub[cli]
) and you are logged in viahuggingface-cli login
. -
Locate datasets and/or existing adapters
In this example, we will two pre-tuned adapters
Note the output path cannot have any period (.
) characters.
Note also that this step requires 63GB of memory on the machine on which it is running.
-
Export the model to ONNX format
Note: add --use_model_builder when this is ready
olive capture-onnx-graph -m meta-llama/Llama-3.1-8B-Instruct --adapter_path Coldstart/Llama-3.1-8B-Instruct-Surfer-Dude-Personality -o models\Llama-3-1-8B-Instruct-LoRA --torch_dtype float32 --use_ort_genai
-
(Optional) Quantize the model
olive quantize -m models\Llama-3-1-8B-Instruct-LoRA\model --algorithm rtn --implementation matmul4 -o models\Llama-3-1-8B-Instruct-LoRA-int4
-
Adapt model
olive generate-adapter -m models\Llama-3-1-8B-Instruct-LoRA-int4\model -o models\Llama-3-1-8B-Instruct-LoRA-int4\adapted --log_level 1
-
Convert adapters to ONNX
olive convert-adapters --adapter_path Coldstart/Llama-3.1-8B-Instruct-Surfer-Dude-Personality --output_path adapters\Llama-1-8B-Instruct-Surfer-Dude-Personality --dtype float32 --quantize_int4
olive convert-adapters --adapter_path Coldstart/Llama-3.1-8B-Instruct-Hillbilly-Personality --output_path adapters\Llama-1-8B-Instruct-Hillbilly-Personality --dtype float32 --quantize_int4
See app.py
TODO: this requires CUDA
olive finetune --method qlora -m meta-llama/Meta-Llama-3-8B -d nampdn-ai/tiny-codes --train_split "train[:4096]" --eval_split "train[4096:4224]" --text_template "### Language: {programming_language} \n### Question: {prompt} \n### Answer: {response}" --per_device_train_batch_size 16 --per_device_eval_batch_size 16 --max_steps 150 --logging_steps 50 -o adapters\tiny-codes