Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: Socket Timeout #97

Open
angeliababy opened this issue Apr 21, 2023 · 8 comments
Open

RuntimeError: Socket Timeout #97

angeliababy opened this issue Apr 21, 2023 · 8 comments

Comments

@angeliababy
Copy link

sh training/finetune_Pythia-Chat-Base-7B.sh

Namespace(use_cuda=True, cuda_id=0, cuda_num=1, debug_mem=True, dist_backend='cupy_nccl', dp_backend='nccl', dist_url='tcp://127.0.0.1:7033', world_size= train_data=['./glue_dataset/data/QQP/train.tsv'], valid_data=['./glue_dataset/data/QQP/test.tsv'], tokenizer_type='BertWordPieceLowerCase', vocab_file='', train_log_backend='print', project_name='together', batch_size=32, micro_batch_size=1, lr=1e-05, num_iters=10, fp16=True, loss_scale=0, initial_loss_slreduce', gradient_accumulate_step=1, model_name='/data/app/OpenChatKit/training/../pretrained/Pythia-6.9B-deduped/EleutherAI_pythia-6.9b-deduped/', toketype='gptneox', checkpoint_path='/data/app/OpenChatKit/training/../model_ckpts/Pythia-Chat-Base-7B', task_name='/data/app/OpenChatKit/training/../data/OI_checkpoint=True, seed=42, profiling='no-profiling', trace_postfix='default', evaluation_steps=0, evaluation_data=None, evaluation_num_batch=None, checkp
Traceback (most recent call last):
File "/data/app/OpenChatKit/training/dist_clm_train.py", line 358, in
main()
File "/data/app/OpenChatKit/training/dist_clm_train.py", line 275, in main
init_communicators(args)
File "/data/app/OpenChatKit/training/comm/comm_utils.py", line 85, in init_communicators
default_init(args)
File "/data/app/OpenChatKit/training/comm/comm_utils.py", line 81, in default_init
dist.init_process_group(backend='gloo', timeout=datetime.timedelta(seconds=5*60), init_method=args.dist_url, world_size=args.world_size, rank=args.rank)
File "/data/anaconda3/envs/OpenChatKit/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 761, in init_process_group
default_pg = _new_process_group_helper(
File "/data/anaconda3/envs/OpenChatKit/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 862, in _new_process_group_helper
pg = ProcessGroupGloo(prefix_store, group_rank, group_size, timeout=timeout)
RuntimeError: Socket Timeout

Error reporting when running with a single gpu.

@darrinh
Copy link

darrinh commented May 19, 2023

Getting same error here.

@darrinh
Copy link

darrinh commented May 19, 2023

some of the other parameters need to be adjusted for single gpu:

< --num-layers 4 --embedding-dim 4096
< --world-size 1
Gets me:

Initialize NCCLCommunicator: < pipeline_group_0 >; rank: 0
comm init done!!

but i forgot to download the pretrained model (as per the training instructions), so it stopped there. Will post results once that step is complete.

cheers
Darrin

@yxy123
Copy link

yxy123 commented May 31, 2023

Hi Darrin,
I'm aslo getting same error here with two gpu.
I only modify finetuning script:
python ${DIR}/dist_clm_train.py $(echo ${ARGS}) --cuda-id 0 --rank 0
&
python ${DIR}/dist_clm_train.py $(echo ${ARGS}) --cuda-id 1 --rank 1
In finetuning aslo need to :
--num-layers 4 --embedding-dim 4096
--world-size 2 --pipeline-group-size 4 --data-group-size 2
right?
I have tried with single gpu and modified the related parameters, but met below issue:
File "/mnt/tet/OpenChatKit-main/training/comm/comm_utils.py", line 86, in init_communicators
assert args.world_size == args.data_group_size * args.pipeline_group_size
AssertionError

Thanks
Yuanyuan

@darrinh
Copy link

darrinh commented May 31, 2023

It won't train on my 12GB GPU, it runs out of memory. It requires more VRAM than I currently have.

@orangetin
Copy link
Member

@darrinh The fine tuning script will most likely not work on 12 GB VRAM. I'd recommend using LoRa for fine-tuning instead.

Here's some sample code to get you started: https://github.com/togethercomputer/OpenChatKit/blob/ecfe4d5d9b5f4b1a533c4468cc1b7e1107b9a819/training/lora/redpajama-incite-chat-3b.py

@darrinh
Copy link

darrinh commented May 31, 2023

Thanks @orangetin , it starts but quickly runs out of memory. Thanks for the link, will check it out.

thanks

@orangetin
Copy link
Member

Hi Darrin, I'm aslo getting same error here with two gpu. I only modify finetuning script: python DIR/distclmtrain.py(echo ${ARGS}) --cuda-id 0 --rank 0 & python DIR/distclmtrain.py(echo ${ARGS}) --cuda-id 1 --rank 1 In finetuning aslo need to : --num-layers 4 --embedding-dim 4096 --world-size 2 --pipeline-group-size 4 --data-group-size 2 right? I have tried with single gpu and modified the related parameters, but met below issue: File "/mnt/tet/OpenChatKit-main/training/comm/comm_utils.py", line 86, in init_communicators assert args.world_size == args.data_group_size * args.pipeline_group_size AssertionError

Thanks Yuanyuan

@yxy123 The arguments provided are invalid. args.world_size == args.data_group_size * args.pipeline_group_size must be true.

Change this line > --world-size 2 --pipeline-group-size 4 --data-group-size 2 so that world_size = pipline-group-size * data-group-size

@yxy123
Copy link

yxy123 commented May 31, 2023

@orangetin Got it, thanks very much, it worked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants