CUDA out of memory with llama3.2-3b on A100? #1830

troy256 · 2024-10-14T18:50:11Z

troy256
Oct 14, 2024

I've been fine tuning successfully with torchtune on llama 3.1-8b and am now trying to do the same with llama 3.2-3b. I upgraded to the latest nightly build of torchtune and am getting this out of memory error during fine tuning. System has an Nvidia A100 with 80 GB.

What I've tried: Smaller batch sizes, setting the env variable recommended in the error message, upgrading to latest torchtune nightly

Any suggestions would be appreciated.

tune run lora_finetune_single_device --config /data/torchtune/tune-recipes/llama3.2-3b-LoRA.yaml

Output:

INFO:torchtune.utils._logging:Running LoRAFinetuneRecipeSingleDevice with resolved config:

batch_size: 1
checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /data/HF-llama3.2-3b-instruct
  checkpoint_files:
  - model-00001-of-00002.safetensors
  - model-00002-of-00002.safetensors
  model_type: LLAMA3_2
  output_dir: /data/HF-llama3.2-3b-instruct
  recipe_checkpoint: null
compile: false
dataset:
- _component_: torchtune.datasets.text_completion_dataset
  column: content
  source: /data/torchtune/dataset/text_completion/parquet
device: cuda
dtype: bf16
enable_activation_checkpointing: true
enable_activation_offloading: false
epochs: 2
gradient_accumulation_steps: 4
log_every_n_steps: 1
log_peak_memory_stats: false
loss:
  _component_: torchtune.modules.loss.CEWithChunkedOutputLoss
lr_scheduler:
  _component_: torchtune.modules.get_cosine_schedule_with_warmup
  num_warmup_steps: 100
max_steps_per_epoch: null
metric_logger:
  _component_: torchtune.training.metric_logging.DiskLogger
  log_dir: /tmp/lora_finetune_output
model:
  _component_: torchtune.models.llama3_2.lora_llama3_2_3b
  apply_lora_to_mlp: true
  apply_lora_to_output: false
  lora_alpha: 128
  lora_attn_modules:
  - q_proj
  - v_proj
  - output_proj
  lora_dropout: 0.0
  lora_rank: 64
optimizer:
  _component_: torch.optim.AdamW
  fused: true
  lr: 0.0003
  weight_decay: 0.01
output_dir: /tmp/lora_finetune_output
profiler:
  _component_: torchtune.training.setup_torch_profiler
  active_steps: 2
  cpu: true
  cuda: true
  enabled: false
  num_cycles: 1
  output_dir: /tmp/lora_finetune_output/profiling_outputs
  profile_memory: false
  record_shapes: true
  wait_steps: 5
  warmup_steps: 5
  with_flops: false
  with_stack: false
resume_from_checkpoint: false
save_adapter_weights_only: false
seed: null
shuffle: true
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  max_seq_len: 131072
  path: /data/HF-llama3.2-3b-instruct/original/tokenizer.model

DEBUG:torchtune.utils._logging:Setting manual seed to local seed 3073645150. Local seed is seed + rank = 3073645150 + 0
Writing logs to /tmp/lora_finetune_output/log_1728931364.txt
INFO:torchtune.utils._logging:Model is initialized with precision torch.bfloat16.
INFO:torchtune.utils._logging:Memory stats after model init:
        GPU peak memory allocation: 6.21 GiB
        GPU peak memory reserved: 6.22 GiB
        GPU peak memory active: 6.21 GiB
INFO:torchtune.utils._logging:Tokenizer is initialized from file.
INFO:torchtune.utils._logging:Optimizer and loss are initialized.
INFO:torchtune.utils._logging:Loss is initialized.
INFO:torchtune.utils._logging:Dataset and Sampler are initialized.
INFO:torchtune.utils._logging:Learning rate scheduler is initialized.
WARNING:torchtune.utils._logging: Profiling disabled.
INFO:torchtune.utils._logging: Profiler config after instantiation: {'enabled': False}
  0%|                                                                                                                                                                                                              | 0/25552 [00:00<?, ?it/s]/data/pe/lib/python3.12/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
1|63|Loss: 1.5435899496078491:   0%|▍                                                                                                                                                                  | 63/25552 [02:50<35:45:32,  5.05s/it]Traceback (most recent call last):
  File "/data/pe/bin/tune", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/data/pe/lib/python3.12/site-packages/torchtune/_cli/tune.py", line 49, in main
    parser.run(args)
  File "/data/pe/lib/python3.12/site-packages/torchtune/_cli/tune.py", line 43, in run
    args.func(args)
  File "/data/pe/lib/python3.12/site-packages/torchtune/_cli/run.py", line 208, in _run_cmd
    self._run_single_device(args, is_builtin=is_builtin)
  File "/data/pe/lib/python3.12/site-packages/torchtune/_cli/run.py", line 102, in _run_single_device
    runpy.run_path(str(args.recipe), run_name="__main__")
  File "<frozen runpy>", line 286, in run_path
  File "<frozen runpy>", line 98, in _run_module_code
  File "<frozen runpy>", line 88, in _run_code
  File "/data/pe/lib/python3.12/site-packages/recipes/lora_finetune_single_device.py", line 782, in <module>
    sys.exit(recipe_main())
             ^^^^^^^^^^^^^
  File "/data/pe/lib/python3.12/site-packages/torchtune/config/_parse.py", line 99, in wrapper
    sys.exit(recipe_main(conf))
             ^^^^^^^^^^^^^^^^^
  File "/data/pe/lib/python3.12/site-packages/recipes/lora_finetune_single_device.py", line 777, in recipe_main
    recipe.train()
  File "/data/pe/lib/python3.12/site-packages/recipes/lora_finetune_single_device.py", line 684, in train
    loss = self._loss_step(batch)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/data/pe/lib/python3.12/site-packages/recipes/lora_finetune_single_device.py", line 635, in _loss_step
    loss = self._loss_fn(logits, labels)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/pe/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/pe/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/pe/lib/python3.12/site-packages/torchtune/modules/loss/ce_chunked_output_loss.py", line 81, in forward
    total_loss += self.compute_cross_entropy(logits_chunk, labels_chunk)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/pe/lib/python3.12/site-packages/torchtune/modules/loss/ce_chunked_output_loss.py", line 41, in compute_cross_entropy
    return F.cross_entropy(
           ^^^^^^^^^^^^^^^^
  File "/data/pe/lib/python3.12/site-packages/torch/nn/functional.py", line 3104, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 7.83 GiB. GPU 0 has a total capacity of 79.26 GiB of which 2.55 GiB is free. Including non-PyTorch memory, this process has 76.70 GiB memory in use. Of the allocated memory 76.04 GiB is allocated by PyTorch, and 159.17 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

troy256 · 2024-10-17T11:19:59Z

troy256
Oct 17, 2024
Author

I was able to work around this. In additional to making sure batch size and gradient accumulation were adjusted downward, I also needed to adjust max sequence length down, which I did increase for llama 3.2 because that's one of the new features we were interested in.

0 replies

SalmanMohammadi · 2024-10-17T11:24:42Z

SalmanMohammadi
Oct 17, 2024
Collaborator

Hey @troy256 - sorry for taking so long to get round to this. I think you're right that the main culprit here is maximum sequence length, as the llama3.2 model you're using still has a 130K sequence length - it was going to be my first suggestion. It seems like the samples in your dataset are pretty long, in this case, if they're using sufficient memory as to OOM on an A100? You might not need to reduce your batch size/use gradient accumulation so much if you've appropriately constrained your sequence length.

One thing I'd also suggest is to use sample packing if there's variance in sequence length between samples in your dataset (https://pytorch.org/torchtune/stable/tutorials/datasets.html#sample-packing). This could help speed things up for you.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA out of memory with llama3.2-3b on A100? #1830

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

Select a reply

CUDA out of memory with llama3.2-3b on A100? #1830

troy256 Oct 14, 2024

Replies: 2 comments

troy256 Oct 17, 2024 Author

SalmanMohammadi Oct 17, 2024 Collaborator

troy256
Oct 14, 2024

troy256
Oct 17, 2024
Author

SalmanMohammadi
Oct 17, 2024
Collaborator