Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does fine-tuning support the multi-GPU training? #20

Open
cahuja1992 opened this issue Jun 9, 2023 · 5 comments
Open

Does fine-tuning support the multi-GPU training? #20

cahuja1992 opened this issue Jun 9, 2023 · 5 comments

Comments

@cahuja1992
Copy link

cahuja1992 commented Jun 9, 2023

Does fine-tuning support the multi-GPU training?

When trying to fine-tune with multiple GPUs, got the following error.

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!

File "/home/ubuntu/miniconda3/envs/mpt/lib/python3.10/site-packages/transformers/trainer.py", line 1664, in train
    return inner_training_loop(
  File "/home/ubuntu/miniconda3/envs/mpt/lib/python3.10/site-packages/transformers/trainer.py", line 1940, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/ubuntu/miniconda3/envs/mpt/lib/python3.10/site-packages/transformers/trainer.py", line 2735, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/ubuntu/miniconda3/envs/mpt/lib/python3.10/site-packages/transformers/trainer.py", line 2767, in compute_loss
    outputs = model(**inputs)
  File "/home/ubuntu/miniconda3/envs/mpt/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/mpt/lib/python3.10/site-packages/peft/peft_model.py", line 678, in forward
    return self.base_model(
  File "/home/ubuntu/miniconda3/envs/mpt/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/mpt/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/mpt/lib/python3.10/site-packages/falcontune-0.1.0-py3.10.egg/falcontune/model/falcon/model.py", line 1070, in forward
    transformer_outputs = self.transformer(
  File "/home/ubuntu/miniconda3/envs/mpt/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/mpt/lib/python3.10/site-packages/falcontune-0.1.0-py3.10.egg/falcontune/model/falcon/model.py", line 965, in forward
    outputs = block(
  File "/home/ubuntu/miniconda3/envs/mpt/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/mpt/lib/python3.10/site-packages/falcontune-0.1.0-py3.10.egg/falcontune/model/falcon/model.py", line 720, in forward
    mlp_output += attention_output
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!
@SoumitriKolavennu
Copy link

Same issue for me as well.

@gpravi
Copy link

gpravi commented Jun 11, 2023

Am just doing with single gpu for now but running into OOM issue - #19

@TeaCult
Copy link

TeaCult commented Jul 3, 2023

I have the same issue , device_map=auto does not work for training. I guess we should copy tokenizer to each and all devices ? Why does device_map=auto can not handle this ?

@zepmck
Copy link

zepmck commented Jul 11, 2023

Same here. While training on multi-GPU I get the following error:

ValueError: DistributedDataParallel device_ids and output_device arguments only work with single-device/multiple-device GPU modules or CPU modules, but got device_ids [0], output_device 0, and module parameters {device(type='cuda', index=0), device(type='cuda', index=1)}.

@baptistejamin
Copy link

It seems it doesn't rely on Accelerate framework.

An easy quick win would be to rely on the new HuggingFace SFT Trainer instead: https://huggingface.co/docs/trl/sft_trainer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants