OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 79.35 GiB total capacity; 77.18 GiB already allocated; 57.19 MiB free; 77.97 GiB reserved in total by PyTorch) #19

gpravi · 2023-06-09T05:14:05Z

Ran into CUDA OOM issue during fine tuning

File "/opt/conda/lib/python3.9/site-packages/bitsandbytes/nn/modules.py", line 336, in _save_to_state_dict
self.weight.data = undo_layout(self.state.CxB, self.state.tile_indices)
File "/opt/conda/lib/python3.9/site-packages/bitsandbytes/autograd/_functions.py", line 96, in undo_layout
outputs = torch.empty_like(tensor) # note: not using .index_copy because it was slower on cuda
torch.cuda.OutOfMemoryError: CUDA out of memory.

Any ideas to fix this?

SoumitriKolavennu · 2023-06-10T15:29:13Z

This happened to me as well with the 40B model. This error only occurs when trying to save a checkpoint. Tried to save the model after all steps instead of every 50 steps - still got the error.

gpravi · 2023-06-11T18:17:19Z

Any luck?

angelovAlex · 2023-06-13T17:24:13Z

I am not sure if it is the same issue I had before, but please check this rhulha/lora#1

gpravi · 2023-06-22T20:30:07Z

Thanks @angelovAlex . Applied the patch in that issue. Now running into a similar issue at a different line

File "/root/.conda/envs/falcontune/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1815, in state_dict
self._save_to_state_dict(destination, prefix, keep_vars)
File "/root/.conda/envs/falcontune/lib/python3.11/site-packages/bitsandbytes/nn/modules.py", line 330, in _save_to_state_dict
weight_clone = self.weight.data.clone()
^^^^^^^^^^^^^^^^^^^^^^^^
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 79.35 GiB total capacity; 75.73 GiB already allocated; 127.19 MiB free; 77.90 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

gpravi mentioned this issue Jun 11, 2023

Does fine-tuning support the multi-GPU training? #20

Open

gpravi mentioned this issue Jun 28, 2023

falcon-40b out of memory Lightning-AI/litgpt#165

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 79.35 GiB total capacity; 77.18 GiB already allocated; 57.19 MiB free; 77.97 GiB reserved in total by PyTorch) #19

OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 79.35 GiB total capacity; 77.18 GiB already allocated; 57.19 MiB free; 77.97 GiB reserved in total by PyTorch) #19

gpravi commented Jun 9, 2023 •

edited

Loading

SoumitriKolavennu commented Jun 10, 2023

gpravi commented Jun 11, 2023

angelovAlex commented Jun 13, 2023

gpravi commented Jun 22, 2023

OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 79.35 GiB total capacity; 77.18 GiB already allocated; 57.19 MiB free; 77.97 GiB reserved in total by PyTorch) #19

OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 79.35 GiB total capacity; 77.18 GiB already allocated; 57.19 MiB free; 77.97 GiB reserved in total by PyTorch) #19

Comments

gpravi commented Jun 9, 2023 • edited Loading

SoumitriKolavennu commented Jun 10, 2023

gpravi commented Jun 11, 2023

angelovAlex commented Jun 13, 2023

gpravi commented Jun 22, 2023

gpravi commented Jun 9, 2023 •

edited

Loading