RuntimeError: Socket Timeout #97

angeliababy · 2023-04-21T10:17:45Z

sh training/finetune_Pythia-Chat-Base-7B.sh

Namespace(use_cuda=True, cuda_id=0, cuda_num=1, debug_mem=True, dist_backend='cupy_nccl', dp_backend='nccl', dist_url='tcp://127.0.0.1:7033', world_size= train_data=['./glue_dataset/data/QQP/train.tsv'], valid_data=['./glue_dataset/data/QQP/test.tsv'], tokenizer_type='BertWordPieceLowerCase', vocab_file='', train_log_backend='print', project_name='together', batch_size=32, micro_batch_size=1, lr=1e-05, num_iters=10, fp16=True, loss_scale=0, initial_loss_slreduce', gradient_accumulate_step=1, model_name='/data/app/OpenChatKit/training/../pretrained/Pythia-6.9B-deduped/EleutherAI_pythia-6.9b-deduped/', toketype='gptneox', checkpoint_path='/data/app/OpenChatKit/training/../model_ckpts/Pythia-Chat-Base-7B', task_name='/data/app/OpenChatKit/training/../data/OI_checkpoint=True, seed=42, profiling='no-profiling', trace_postfix='default', evaluation_steps=0, evaluation_data=None, evaluation_num_batch=None, checkp
Traceback (most recent call last):
File "/data/app/OpenChatKit/training/dist_clm_train.py", line 358, in
main()
File "/data/app/OpenChatKit/training/dist_clm_train.py", line 275, in main
init_communicators(args)
File "/data/app/OpenChatKit/training/comm/comm_utils.py", line 85, in init_communicators
default_init(args)
File "/data/app/OpenChatKit/training/comm/comm_utils.py", line 81, in default_init
dist.init_process_group(backend='gloo', timeout=datetime.timedelta(seconds=5*60), init_method=args.dist_url, world_size=args.world_size, rank=args.rank)
File "/data/anaconda3/envs/OpenChatKit/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 761, in init_process_group
default_pg = _new_process_group_helper(
File "/data/anaconda3/envs/OpenChatKit/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 862, in _new_process_group_helper
pg = ProcessGroupGloo(prefix_store, group_rank, group_size, timeout=timeout)
RuntimeError: Socket Timeout

Error reporting when running with a single gpu.

darrinh · 2023-05-19T01:53:09Z

Getting same error here.

darrinh · 2023-05-19T02:11:51Z

some of the other parameters need to be adjusted for single gpu:

< --num-layers 4 --embedding-dim 4096
< --world-size 1
Gets me:

Initialize NCCLCommunicator: < pipeline_group_0 >; rank: 0
comm init done!!

but i forgot to download the pretrained model (as per the training instructions), so it stopped there. Will post results once that step is complete.

cheers
Darrin

yxy123 · 2023-05-31T02:32:52Z

Hi Darrin,
I'm aslo getting same error here with two gpu.
I only modify finetuning script:
python ${DIR}/dist_clm_train.py $(echo ${ARGS}) --cuda-id 0 --rank 0
&
python ${DIR}/dist_clm_train.py $(echo ${ARGS}) --cuda-id 1 --rank 1
In finetuning aslo need to :
--num-layers 4 --embedding-dim 4096
--world-size 2 --pipeline-group-size 4 --data-group-size 2
right?
I have tried with single gpu and modified the related parameters, but met below issue：
File "/mnt/tet/OpenChatKit-main/training/comm/comm_utils.py", line 86, in init_communicators
assert args.world_size == args.data_group_size * args.pipeline_group_size
AssertionError

Thanks
Yuanyuan

darrinh · 2023-05-31T05:45:19Z

It won't train on my 12GB GPU, it runs out of memory. It requires more VRAM than I currently have.

orangetin · 2023-05-31T05:59:00Z

@darrinh The fine tuning script will most likely not work on 12 GB VRAM. I'd recommend using LoRa for fine-tuning instead.

Here's some sample code to get you started: https://github.com/togethercomputer/OpenChatKit/blob/ecfe4d5d9b5f4b1a533c4468cc1b7e1107b9a819/training/lora/redpajama-incite-chat-3b.py

darrinh · 2023-05-31T06:04:12Z

Thanks @orangetin , it starts but quickly runs out of memory. Thanks for the link, will check it out.

thanks

orangetin · 2023-05-31T06:10:57Z

Hi Darrin, I'm aslo getting same error here with two gpu. I only modify finetuning script: python DIR/distclmtrain.py(echo ${ARGS}) --cuda-id 0 --rank 0 & python DIR/distclmtrain.py(echo ${ARGS}) --cuda-id 1 --rank 1 In finetuning aslo need to : --num-layers 4 --embedding-dim 4096 --world-size 2 --pipeline-group-size 4 --data-group-size 2 right? I have tried with single gpu and modified the related parameters, but met below issue： File "/mnt/tet/OpenChatKit-main/training/comm/comm_utils.py", line 86, in init_communicators assert args.world_size == args.data_group_size * args.pipeline_group_size AssertionError

Thanks Yuanyuan

@yxy123 The arguments provided are invalid. args.world_size == args.data_group_size * args.pipeline_group_size must be true.

Change this line > --world-size 2 --pipeline-group-size 4 --data-group-size 2 so that world_size = pipline-group-size * data-group-size

yxy123 · 2023-05-31T07:45:10Z

@orangetin Got it, thanks very much, it worked.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: Socket Timeout #97

RuntimeError: Socket Timeout #97

angeliababy commented Apr 21, 2023

darrinh commented May 19, 2023

darrinh commented May 19, 2023

yxy123 commented May 31, 2023 •

edited

Loading

darrinh commented May 31, 2023

orangetin commented May 31, 2023

darrinh commented May 31, 2023

orangetin commented May 31, 2023

yxy123 commented May 31, 2023

RuntimeError: Socket Timeout #97

RuntimeError: Socket Timeout #97

Comments

angeliababy commented Apr 21, 2023

sh training/finetune_Pythia-Chat-Base-7B.sh

darrinh commented May 19, 2023

darrinh commented May 19, 2023

yxy123 commented May 31, 2023 • edited Loading

darrinh commented May 31, 2023

orangetin commented May 31, 2023

darrinh commented May 31, 2023

orangetin commented May 31, 2023

yxy123 commented May 31, 2023

yxy123 commented May 31, 2023 •

edited

Loading