Can’t Continue the training with the checkpoint in distributed manner ！！！ #146

666wodeyy · 2024-09-30T07:48:05Z

My dataset consists of 8 thousand grayscale images of 256 * 256 size，the follow is my train script:

MODEL_FLAGS="--image_size 256 --num_channels 128 --num_res_blocks 3"

DIFFUSION_FLAGS="--diffusion_steps 1000 \
                --noise_schedule cosine \
                --use_kl True"

TRAIN_FLAGS="--lr 1e-4 --batch_size 8"
export OPENAI_LOGDIR=XXXX

NCCL_DEBUG=INFO
export NCCL_SOCKET_NTHREADS=8

MASTER_PORT=$(python -c "import socket; s=socket.socket(socket.AF_INET, socket.SOCK_STREAM); s.bind(('',0)); print(s.getsockname()[1]); s.close()")
export MASTER_ADDR=localhost
export MASTER_PORT=$MASTER_PORT  

NUM_GPUS="2"
mpiexec -n $NUM_GPUS python image_train.py --data_dir ./data/XXX $MODEL_FLAGS $DIFFUSION_FLAGS $TRAIN_FLAGS --resume_checkpoint ./training_log/CREMI/model039000.pt

Strangely, when I do not specify checkpoint (i. e., without the resume_checkpoint command), the model can run normally on two V100s, but when I try to join checkpoint to continue training, the model makes an error

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can’t Continue the training with the checkpoint in distributed manner ！！！ #146

Can’t Continue the training with the checkpoint in distributed manner ！！！ #146

666wodeyy commented Sep 30, 2024

Can’t Continue the training with the checkpoint in distributed manner ！！！ #146

Can’t Continue the training with the checkpoint in distributed manner ！！！ #146

Comments

666wodeyy commented Sep 30, 2024