Add support for QAT + LoRA #1931

andrewor14 · 2024-10-31T00:10:46Z

TODO: write this

Helpful code review commands:

diff --color recipes/lora_finetune_distributed.py recipes/qat_lora_finetune_distributed.py
diff --color recipes/configs/llama2/7B_lora.yaml recipes/configs/llama2/7B_qat_lora.yaml
diff --color recipes/configs/llama3/8B_lora.yaml recipes/configs/llama3/8B_qat_lora.yaml

Test Plan

Unit tests:

pytest -m integration_test tests/recipes/test_qat_lora_finetune_distributed.py

Manual tests:

export CUDA_VISIBLE_DEVICES=4,5,6,7
export NCCL_SHM_DISABLE=0
LOG_DIR=/home/andrewor/local/logs/tune/qat_lora

tune run --nnodes 1 --nproc_per_node 4 qat_lora_finetune_distributed --config llama3/8B_qat_lora \
    batch_size=8 \
    quantizer.groupsize=32 \
    checkpointer.output_dir="$LOG_DIR" \
    metric_logger.output_dir="${LOG_DIR}/metrics"

tune run quantize --config quantization \
    model._component_=torchtune.models.llama3.llama3_8b \
    checkpointer._component_=torchtune.training.FullModelMetaCheckpointer \
    checkpointer.checkpoint_dir="$LOG_DIR" \
    checkpointer.output_dir="$LOG_DIR" \
    checkpointer.checkpoint_files=["meta_model_0.pt"] \
    checkpointer.model_type=LLAMA3 \
    quantizer._component_=torchtune.training.quantization.Int8DynActInt4WeightQuantizer \
    quantizer.groupsize=32

tune run eleuther_eval --config eleuther_evaluation \
    batch_size=1 \
    model._component_=torchtune.models.llama3.llama3_8b \
    checkpointer._component_=torchtune.training.FullModelTorchTuneCheckpointer \
    checkpointer.checkpoint_dir="$LOG_DIR" \
    checkpointer.output_dir="$LOG_DIR" \
    checkpointer.checkpoint_files=["meta_model_0.pt-8da4w"] \
    checkpointer.model_type=LLAMA3 \
    tokenizer._component_=torchtune.models.llama3.llama3_tokenizer \
    tokenizer.path=/tmp/Meta-Llama-3-8B-Instruct/original/tokenizer.model \
    tasks=[wikitext] \
    quantizer._component_=torchtune.training.quantization.Int8DynActInt4WeightQuantizer \
    quantizer.groupsize=32

Results:

# Baseline (LoRA only, no QAT)

| Tasks  |Version|Filter|n-shot|    Metric     |   | Value |   |Stderr|
|--------|------:|------|------|---------------|---|------:|---|------|
|wikitext|      2|none  |None  |bits_per_byte  |↓  | 0.6676|±  |   N/A|
|        |       |none  |None  |byte_perplexity|↓  | 1.5884|±  |   N/A|
|        |       |none  |None  |word_perplexity|↓  |11.8741|±  |   N/A|

# LoRA + QAT (new recipe)

| Tasks  |Version|Filter|n-shot|    Metric     |   | Value |   |Stderr|
|--------|------:|------|------|---------------|---|------:|---|------|
|wikitext|      2|none  |None  |bits_per_byte  |↓  | 0.6623|±  |   N/A|
|        |       |none  |None  |byte_perplexity|↓  | 1.5826|±  |   N/A|
|        |       |none  |None  |word_perplexity|↓  |11.6457|±  |   N/A|

pytorch-bot · 2024-10-31T00:10:49Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1931

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 5 New Failures, 5 Cancelled Jobs

As of commit 1a48a20 with merge base f560cbb ():

NEW FAILURES - The following jobs have failed:

Build Docs / build_docs (3.11) (gh)
sphinx.ext.autosummary.ImportExceptionGroup: no module named torchtune.models.clip
GPU tests / gpu_test (3.11, stable) (gh)
!!!!!!!!!!!!!!!!!!! Interrupted: 52 errors during collection !!!!!!!!!!!!!!!!!!!
Recipe Tests / recipe_test (3.11) (gh)
!!!!!!!!!!!!!!!!!!! Interrupted: 52 errors during collection !!!!!!!!!!!!!!!!!!!
Recipe Tests / recipe_test (3.9) (gh)
!!!!!!!!!!!!!!!!!!! Interrupted: 52 errors during collection !!!!!!!!!!!!!!!!!!!
Unit Test / unit_tests (3.9) (gh)
!!!!!!!!!!!!!!!!!!! Interrupted: 52 errors during collection !!!!!!!!!!!!!!!!!!!

CANCELLED JOBS - The following jobs were cancelled. Please retry:

GPU tests / gpu_test (3.10, stable) (gh)
##[error]The operation was canceled.
GPU tests / gpu_test (3.9, stable) (gh)
##[error]The operation was canceled.
Recipe Tests / recipe_test (3.10) (gh)
##[error]The operation was canceled.
Unit Test / unit_tests (3.10) (gh)
##[error]The operation was canceled.
Unit Test / unit_tests (3.11) (gh)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

andrewor14 · 2024-10-31T00:11:05Z

eval_it.sh

@@ -0,0 +1,24 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.


will delete these files before land

recipes/lora_finetune_distributed.py

TODO: write this

andrewor14 · 2024-11-01T22:54:07Z

recipes/qat_lora_finetune_distributed.py

+        # TODO: Expose fake quantize configs from torchao so we can get them
+        # directly from the quantizer. For now, we hardcode the configs for 8da4w.
+        # E.g. activation_config = quantizer.get_activation_fake_quantize_config()
+        # E.g. weight_config = quantizer.get_weight_fake_quantize_config()


Addressed in pytorch/ao#1214

andrewor14 marked this pull request as draft October 31, 2024 00:10

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 31, 2024

andrewor14 commented Oct 31, 2024

View reviewed changes

eval_it.sh

@@ -0,0 +1,24 @@

# Copyright (c) Meta Platforms, Inc. and affiliates.

Copy link

Contributor Author

andrewor14 Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will delete these files before land

andrewor14 force-pushed the try-qat-lora branch 2 times, most recently from e20e891 to d09c71f Compare November 1, 2024 19:38

andrewor14 commented Nov 1, 2024

View reviewed changes

recipes/lora_finetune_distributed.py Outdated Show resolved Hide resolved

Add support for QAT + LoRA

1a48a20

TODO: write this

andrewor14 force-pushed the try-qat-lora branch from d09c71f to 1a48a20 Compare November 1, 2024 22:52

andrewor14 commented Nov 1, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for QAT + LoRA #1931

Add support for QAT + LoRA #1931

andrewor14 commented Oct 31, 2024 •

edited

Loading

pytorch-bot bot commented Oct 31, 2024 •

edited

Loading

andrewor14 Oct 31, 2024

andrewor14 Nov 1, 2024

		@@ -0,0 +1,24 @@
		# Copyright (c) Meta Platforms, Inc. and affiliates.

Add support for QAT + LoRA #1931

Are you sure you want to change the base?

Add support for QAT + LoRA #1931

Conversation

andrewor14 commented Oct 31, 2024 • edited Loading

pytorch-bot bot commented Oct 31, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1931

❌ 5 New Failures, 5 Cancelled Jobs

andrewor14 Oct 31, 2024

Choose a reason for hiding this comment

andrewor14 Nov 1, 2024

Choose a reason for hiding this comment

andrewor14 commented Oct 31, 2024 •

edited

Loading

pytorch-bot bot commented Oct 31, 2024 •

edited

Loading