Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 79.35 GiB total capacity; 77.18 GiB already allocated; 57.19 MiB free; 77.97 GiB reserved in total by PyTorch) #19

Open
gpravi opened this issue Jun 9, 2023 · 4 comments

Comments

@gpravi
Copy link

gpravi commented Jun 9, 2023

Ran into CUDA OOM issue during fine tuning

File "/opt/conda/lib/python3.9/site-packages/bitsandbytes/nn/modules.py", line 336, in _save_to_state_dict
self.weight.data = undo_layout(self.state.CxB, self.state.tile_indices)
File "/opt/conda/lib/python3.9/site-packages/bitsandbytes/autograd/_functions.py", line 96, in undo_layout
outputs = torch.empty_like(tensor) # note: not using .index_copy because it was slower on cuda
torch.cuda.OutOfMemoryError: CUDA out of memory.

Any ideas to fix this?

@SoumitriKolavennu
Copy link

This happened to me as well with the 40B model. This error only occurs when trying to save a checkpoint. Tried to save the model after all steps instead of every 50 steps - still got the error.

@gpravi
Copy link
Author

gpravi commented Jun 11, 2023

Any luck?

@angelovAlex
Copy link

I am not sure if it is the same issue I had before, but please check this rhulha/lora#1

@gpravi
Copy link
Author

gpravi commented Jun 22, 2023

Thanks @angelovAlex . Applied the patch in that issue. Now running into a similar issue at a different line

File "/root/.conda/envs/falcontune/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1815, in state_dict
self._save_to_state_dict(destination, prefix, keep_vars)
File "/root/.conda/envs/falcontune/lib/python3.11/site-packages/bitsandbytes/nn/modules.py", line 330, in _save_to_state_dict
weight_clone = self.weight.data.clone()
^^^^^^^^^^^^^^^^^^^^^^^^
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 79.35 GiB total capacity; 75.73 GiB already allocated; 127.19 MiB free; 77.90 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants