falcontune allows finetuning FALCONs (e.g., falcon-40b-4bit) on as little as one consumer-grade A100 40GB.
Its features tiny and easy-to-use codebase.
One benefit of being able to finetune larger LLMs on one GPU is the ability to easily leverage data parallelism for large models.
Underneath the hood, falcontune implements the LoRA algorithm over an LLM compressed using the GPTQ algorithm, which requires implementing a backward pass for the quantized LLM.
falcontune can generate a 50-token recipe on A100 40GB for ~ 10 seconds using triton backend
$ falcontune generate --interactive --model falcon-40b-instruct-4bit --weights gptq_model-4bit--1g.safetensors --max_new_tokens=50 --use_cache --do_sample --prompt "How to prepare pasta?"
How to prepare pasta?
Here's a simple recipe to prepare pasta:
Ingredients:
- 1 pound of dry pasta
- 4-6 cups of water
- Salt (optional)
Instructions:
1. Boil the water
Took 10.042 s
This example is based on the model: TheBloke/falcon-40b-instruct-GPTQ.
Here is a Google Colab. You will need a A100 40GB to finetune the model.
pip install -r requirements.txt
python setup.py install
The default backend is triton which is the fastest. For cuda support install also the CUDA kernels:
python setup_cuda.py install
The above process installs a falcontune
command in your environment.
First, start by downloading the weights of a FALCON model:
$ wget https://huggingface.co/TheBloke/falcon-40b-instruct-GPTQ/resolve/main/gptq_model-4bit--1g.safetensors
You can generate text directly from the command line. This generates text from the base model:
$ falcontune generate \
--interactive \
--model falcon-40b-instruct-4bit \
--weights gptq_model-4bit--1g.safetensors \
--max_new_tokens=50 \
--use_cache \
--do_sample \
--instruction "Who was the first person on the moon?"
You may also finetune a base model yourself. First, you need to download a dataset:
$ wget https://github.com/gururise/AlpacaDataCleaned/raw/main/alpaca_data_cleaned.json
You can finetune any model of the FALCON family:
FALCON-7B
$ falcontune finetune \
--model=falcon-7b \
--weights=tiiuae/falcon-7b \
--dataset=./alpaca_data_cleaned.json \
--data_type=alpaca \
--lora_out_dir=./falcon-7b-alpaca/ \
--mbatch_size=1 \
--batch_size=2 \
--epochs=3 \
--lr=3e-4 \
--cutoff_len=256 \
--lora_r=8 \
--lora_alpha=16 \
--lora_dropout=0.05 \
--warmup_steps=5 \
--save_steps=50 \
--save_total_limit=3 \
--logging_steps=5 \
--target_modules='["query_key_value"]'
The above commands will download the model and use LoRA to finetune the quantized model. The final adapters and the checkpoints will be saved in `falcon-7b-alpaca` and available for generation as follows:
$ falcontune generate \
--interactive \
--model falcon-7b \
--weights tiiuae/falcon-7b \
--lora_apply_dir falcon-7b-alpaca \
--max_new_tokens 50 \
--use_cache \
--do_sample \
--instruction "How to prepare pasta?"
FALCON-7B-INSTRUCT
$ falcontune finetune \
--model=falcon-7b-instruct \
--weights=tiiuae/falcon-7b-instruct \
--dataset=./alpaca_data_cleaned.json \
--data_type=alpaca \
--lora_out_dir=./falcon-7b-instruct-alpaca/ \
--mbatch_size=1 \
--batch_size=2 \
--epochs=3 \
--lr=3e-4 \
--cutoff_len=256 \
--lora_r=8 \
--lora_alpha=16 \
--lora_dropout=0.05 \
--warmup_steps=5 \
--save_steps=50 \
--save_total_limit=3 \
--logging_steps=5 \
--target_modules='["query_key_value"]'
The above commands will download the model and use LoRA to finetune the quantized model. The final adapters and the checkpoints will be saved in `falcon-7b-instruct-alpaca` and available for generation as follows:
$ falcontune generate \
--interactive \
--model falcon-7b-instruct \
--weights mosaicml/falcon-7b-instruct \
--lora_apply_dir falcon-7b-instruct-alpaca \
--max_new_tokens 50 \
--use_cache \
--do_sample \
--instruction "How to prepare pasta?"
FALCON-40B
$ falcontune finetune \
--model=falcon-40b \
--weights=tiiuae/falcon-40b \
--dataset=./alpaca_data_cleaned.json \
--data_type=alpaca \
--lora_out_dir=./falcon-40b-alpaca/ \
--mbatch_size=1 \
--batch_size=2 \
--epochs=3 \
--lr=3e-4 \
--cutoff_len=256 \
--lora_r=8 \
--lora_alpha=16 \
--lora_dropout=0.05 \
--warmup_steps=5 \
--save_steps=50 \
--save_total_limit=3 \
--logging_steps=5 \
--target_modules='["query_key_value"]'
The above commands will download the model and use LoRA to finetune the quantized model. The final adapters and the checkpoints will be saved in `falcon-40b-alpaca` and available for generation as follows:
$ falcontune generate \
--interactive \
--model falcon-40b \
--weights tiiuae/falcon-40b\
--lora_apply_dir falcon-40b-alpaca \
--max_new_tokens 50 \
--use_cache \
--do_sample \
--instruction "How to prepare pasta?"
FALCON-40B-INSTRUCT
$ falcontune finetune \
--model=falcon-40b-instruct \
--weights=tiiuae/falcon-40b-instruct \
--dataset=./alpaca_data_cleaned.json \
--data_type=alpaca \
--lora_out_dir=./falcon-40b-instruct-alpaca/ \
--mbatch_size=1 \
--batch_size=2 \
--epochs=3 \
--lr=3e-4 \
--cutoff_len=256 \
--lora_r=8 \
--lora_alpha=16 \
--lora_dropout=0.05 \
--warmup_steps=5 \
--save_steps=50 \
--save_total_limit=3 \
--logging_steps=5 \
--target_modules='["query_key_value"]'
The above commands will download the model and use LoRA to finetune the quantized model. The final adapters and the checkpoints will be saved in `falcon-40b-instruct-alpaca` and available for generation as follows:
$ falcontune generate \
--interactive \
--model falcon-40b-instruct \
--weights tiiuae/falcon-40b-instruct\
--lora_apply_dir falcon-40b-alpaca \
--max_new_tokens 50 \
--use_cache \
--do_sample \
--instruction "How to prepare pasta?"
FALCON-7B-INSTRUCT-4BIT
$ wget https://huggingface.co/TheBloke/falcon-7b-instruct-GPTQ/resolve/main/gptq_model-4bit-64g.safetensors
$ falcontune finetune \
--model=falcon-7b-instruct-4bit \
--weights=gptq_model-4bit-64g.safetensors \
--dataset=./alpaca_data_cleaned.json \
--data_type=alpaca \
--lora_out_dir=./falcon-7b-instruct-4bit-alpaca/ \
--mbatch_size=1 \
--batch_size=2 \
--epochs=3 \
--lr=3e-4 \
--cutoff_len=256 \
--lora_r=8 \
--lora_alpha=16 \
--lora_dropout=0.05 \
--warmup_steps=5 \
--save_steps=50 \
--save_total_limit=3 \
--logging_steps=5 \
--target_modules='["query_key_value"]'
The above commands will download the model and use LoRA to finetune the quantized model. The final adapters and the checkpoints will be saved in `falcon-7b-instruct-4bit-alpaca` and available for generation as follows:
$ falcontune generate \
--interactive \
--model falcon-7b-instruct-4bit \
--weights gptq_model-4bit-64g.safetensors \
--lora_apply_dir falcon-7b-instruct-4bit-alpaca \
--max_new_tokens 50 \
--use_cache \
--do_sample \
--instruction "How to prepare pasta?"
FALCON-40B-INSTRUCT-4BIT
$ wget https://huggingface.co/TheBloke/falcon-40b-instruct-GPTQ/resolve/main/gptq_model-4bit--1g.safetensors
$ falcontune finetune \
--model=falcon-40b-instruct-4bit \
--weights=gptq_model-4bit--1g.safetensors \
--dataset=./alpaca_data_cleaned.json \
--data_type=alpaca \
--lora_out_dir=./falcon-40b-instruct-4bit-alpaca/ \
--mbatch_size=1 \
--batch_size=2 \
--epochs=3 \
--lr=3e-4 \
--cutoff_len=256 \
--lora_r=8 \
--lora_alpha=16 \
--lora_dropout=0.05 \
--warmup_steps=5 \
--save_steps=50 \
--save_total_limit=3 \
--logging_steps=5 \
--target_modules='["query_key_value"]'
The above commands will download the model and use LoRA to finetune the quantized model. The final adapters and the checkpoints will be saved in `falcon-40b-instruct-4bit-alpaca` and available for generation as follows:
$ falcontune generate \
--interactive \
--model falcon-40b-instruct-4bit \
--weights gptq_model-4bit--1g.safetensors \
--lora_apply_dir falcon-40b-instruct-4bit-alpaca \
--max_new_tokens 50 \
--use_cache \
--do_sample \
--instruction "How to prepare pasta?"
falcontune is based on the following projects:
- The GPTQ algorithm and codebase by the IST-DASLAB with modifications by @qwopqwop200
- The
alpaca_lora_4bit
repo by johnsmith0031 - The PEFT repo and its implementation of LoRA
- The LLAMA, OPT, and BLOOM models by META FAIR and the BigScience consortium
- The
llmtune
repo by kuleshov-group
Need a custom solution? Let me know: [email protected]