Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

4卡4090微调SVD 训练过程无报错提前保存模型退出 #78

Open
ImmNaruto opened this issue Sep 14, 2024 · 0 comments
Open

4卡4090微调SVD 训练过程无报错提前保存模型退出 #78

ImmNaruto opened this issue Sep 14, 2024 · 0 comments

Comments

@ImmNaruto
Copy link

按照多卡配置准备几十条数据集尝试微调SVD,模型正常训练但在固定轮次保存模型退出,无任何报错。请求大神帮助解惑,以下是环境列表和具体的训练过程:
环境列表:

Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main    defaults
_openmp_mutex             5.1                       1_gnu    defaults
accelerate                0.21.0                   pypi_0    pypi
aiofiles                  23.2.1                   pypi_0    pypi
annotated-types           0.7.0                    pypi_0    pypi
antlr4-python3-runtime    4.9.3                    pypi_0    pypi
anyio                     4.4.0                    pypi_0    pypi
bzip2                     1.0.8                h5eee18b_6    defaults
ca-certificates           2024.7.2             h06a4308_0    defaults
certifi                   2024.8.30                pypi_0    pypi
charset-normalizer        3.3.2                    pypi_0    pypi
click                     8.1.7                    pypi_0    pypi
cmake                     3.30.3                   pypi_0    pypi
compel                    2.0.3                    pypi_0    pypi
contourpy                 1.3.0                    pypi_0    pypi
cycler                    0.12.1                   pypi_0    pypi
decord                    0.6.0                    pypi_0    pypi
deepspeed                 0.13.5                   pypi_0    pypi
diffusers                 0.24.0                   pypi_0    pypi
easydict                  1.13                     pypi_0    pypi
einops                    0.8.0                    pypi_0    pypi
exceptiongroup            1.2.2                    pypi_0    pypi
fastapi                   0.112.2                  pypi_0    pypi
ffmpy                     0.4.0                    pypi_0    pypi
filelock                  3.15.4                   pypi_0    pypi
fonttools                 4.53.1                   pypi_0    pypi
fsspec                    2024.6.1                 pypi_0    pypi
gradio                    4.42.0                   pypi_0    pypi
gradio-client             1.3.0                    pypi_0    pypi
h11                       0.14.0                   pypi_0    pypi
hjson                     3.1.0                    pypi_0    pypi
httpcore                  1.0.5                    pypi_0    pypi
httpx                     0.27.2                   pypi_0    pypi
huggingface-hub           0.24.6                   pypi_0    pypi
idna                      3.8                      pypi_0    pypi
imageio                   2.35.1                   pypi_0    pypi
imageio-ffmpeg            0.5.1                    pypi_0    pypi
importlib-metadata        8.4.0                    pypi_0    pypi
importlib-resources       6.4.4                    pypi_0    pypi
jinja2                    3.1.4                    pypi_0    pypi
kiwisolver                1.4.5                    pypi_0    pypi
ld_impl_linux-64          2.38                 h1181459_1    defaults
libffi                    3.4.4                h6a678d5_1    defaults
libgcc-ng                 11.2.0               h1234567_1    defaults
libgomp                   11.2.0               h1234567_1    defaults
libstdcxx-ng              11.2.0               h1234567_1    defaults
libuuid                   1.41.5               h5eee18b_0    defaults
lit                       18.1.8                   pypi_0    pypi
markdown-it-py            3.0.0                    pypi_0    pypi
markupsafe                2.1.5                    pypi_0    pypi
matplotlib                3.9.2                    pypi_0    pypi
mdurl                     0.1.2                    pypi_0    pypi
mpmath                    1.3.0                    pypi_0    pypi
ncurses                   6.4                  h6a678d5_0    defaults
networkx                  3.3                      pypi_0    pypi
ninja                     1.11.1.1                 pypi_0    pypi
numpy                     1.26.4                   pypi_0    pypi
nvidia-cublas-cu12        12.1.3.1                 pypi_0    pypi
nvidia-cuda-cupti-cu12    12.1.105                 pypi_0    pypi
nvidia-cuda-nvrtc-cu12    12.1.105                 pypi_0    pypi
nvidia-cuda-runtime-cu12  12.1.105                 pypi_0    pypi
nvidia-cudnn-cu12         9.1.0.70                 pypi_0    pypi
nvidia-cufft-cu12         11.0.2.54                pypi_0    pypi
nvidia-curand-cu12        10.3.2.106               pypi_0    pypi
nvidia-cusolver-cu12      11.4.5.107               pypi_0    pypi
nvidia-cusparse-cu12      12.1.0.106               pypi_0    pypi
nvidia-ml-py              12.560.30                pypi_0    pypi
nvidia-nccl-cu12          2.20.5                   pypi_0    pypi
nvidia-nvjitlink-cu12     12.6.68                  pypi_0    pypi
nvidia-nvtx-cu12          12.1.105                 pypi_0    pypi
omegaconf                 2.3.0                    pypi_0    pypi
opencv-python             4.10.0.84                pypi_0    pypi
openssl                   3.0.14               h5eee18b_0    defaults
orjson                    3.10.7                   pypi_0    pypi
packaging                 24.1                     pypi_0    pypi
pandas                    2.2.2                    pypi_0    pypi
pillow                    10.4.0                   pypi_0    pypi
pip                       24.2            py310h06a4308_0    defaults
psutil                    6.0.0                    pypi_0    pypi
py-cpuinfo                9.0.0                    pypi_0    pypi
pydantic                  2.8.2                    pypi_0    pypi
pydantic-core             2.20.1                   pypi_0    pypi
pydub                     0.25.1                   pypi_0    pypi
pygments                  2.18.0                   pypi_0    pypi
pynvml                    11.5.3                   pypi_0    pypi
pyparsing                 3.1.4                    pypi_0    pypi
python                    3.10.14              h955ad1f_1    defaults
python-dateutil           2.9.0.post0              pypi_0    pypi
python-multipart          0.0.9                    pypi_0    pypi
pytz                      2024.1                   pypi_0    pypi
pyyaml                    6.0.2                    pypi_0    pypi
readline                  8.2                  h5eee18b_0    defaults
regex                     2024.7.24                pypi_0    pypi
requests                  2.32.3                   pypi_0    pypi
rich                      13.8.0                   pypi_0    pypi
rotary-embedding-torch    0.6.5                    pypi_0    pypi
ruff                      0.6.3                    pypi_0    pypi
safetensors               0.4.4                    pypi_0    pypi
semantic-version          2.10.0                   pypi_0    pypi
setuptools                72.1.0          py310h06a4308_0    defaults
shellingham               1.5.4                    pypi_0    pypi
six                       1.16.0                   pypi_0    pypi
sniffio                   1.3.1                    pypi_0    pypi
socksio                   1.0.0                    pypi_0    pypi
sqlite                    3.45.3               h5eee18b_0    defaults
starlette                 0.38.2                   pypi_0    pypi
sympy                     1.13.2                   pypi_0    pypi
tk                        8.6.14               h39e8969_0    defaults
tokenizers                0.15.2                   pypi_0    pypi
tomlkit                   0.12.0                   pypi_0    pypi
torch                     2.0.0+cu118              pypi_0    pypi
torchvision               0.15.0+cu118             pypi_0    pypi
tqdm                      4.66.5                   pypi_0    pypi
transformers              4.36.2                   pypi_0    pypi
triton                    2.0.0                    pypi_0    pypi
typer                     0.12.5                   pypi_0    pypi
typing-extensions         4.12.2                   pypi_0    pypi
tzdata                    2024.1                   pypi_0    pypi
urllib3                   2.2.2                    pypi_0    pypi
uvicorn                   0.30.6                   pypi_0    pypi
websockets                12.0                     pypi_0    pypi
wheel                     0.43.0          py310h06a4308_0    defaults
xz                        5.4.6                h5eee18b_1    defaults
zipp                      3.20.1                   pypi_0    pypi
zlib                      1.2.13               h5eee18b_1    defaults

deepspeed训练配置yaml:

compute_environment: LOCAL_MACHINE
# debug: true
deepspeed_config:
  gradient_accumulation_steps: 1
  offload_optimizer_device: cpu
  offload_param_device: none
  zero3_init_flag: false
  zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
# 设置gpu
gpu_ids: 0,1,6,7
machine_rank: 0
main_training_function: main
main_process_port: 39500
# 精度
mixed_precision: fp16
num_machines: 1
# 设置线程
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

训练过程开始段,存在overflow溢出INFO:

09/14/2024 09:27:10 - INFO - __main__ - ***** Running training *****
09/14/2024 09:27:10 - INFO - __main__ -   Num examples = 62
09/14/2024 09:27:10 - INFO - __main__ -   Num Epochs = 125
09/14/2024 09:27:10 - INFO - __main__ -   Instantaneous batch size per device = 1
09/14/2024 09:27:10 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 4
09/14/2024 09:27:10 - INFO - __main__ -   Gradient Accumulation steps = 1
09/14/2024 09:27:10 - INFO - __main__ -   Total optimization steps = 8000
Steps:   0%|                                                                                                                                            | 0/8000 [00:00<?, ?it/s]
[2024-09-14 09:27:14,509] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648
Steps:   0%|                                                                                                        | 1/8000 [00:03<7:50:21,  3.53s/it, lr=2e-5, step_loss=0.109][2024-09-14 09:27:15,833]
 [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2147483648, reducing to 1073741824
Steps:   0%|                                                                                                       | 2/8000 [00:04<4:57:23,  2.23s/it, lr=2e-5, step_loss=0.0687][2024-09-14 09:27:17,365] 
[INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1073741824, reducing to 536870912
Steps:   0%|                                                                                                        | 3/8000 [00:06<4:14:51,  1.91s/it, lr=2e-5, step_loss=0.123][2024-09-14 09:27:18,930] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 536870912, reducing to 268435456
Steps:   0%|                                                                                                        | 4/8000 [00:07<3:56:32,  1.77s/it, lr=2e-5, step_loss=0.201][2024-09-14 09:27:20,186] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 268435456, reducing to 134217728
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:03<00:00,  7.55it/s]
09/14/2024 09:27:24 - INFO - __main__ - Saved a new sample to ./output/svd/train_2024-09-14T09-26-49/samples/5_dataset-video_json.gif████████████| 25/25 [00:03<00:00,  8.11it/s]
Steps:   0%|                                                                                                        | 5/8000 [00:13<3:31:35,  1.59s/it, lr=2e-5, step_loss=0.102][2024-09-14 09:27:26,197] 
[INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 134217728, reducing to 67108864
Steps:   0%|                                                                                                         | 6/8000 [00:15<6:51:55,  3.09s/it, lr=2e-5, step_loss=0.49][2024-09-14 09:27:27,880] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 67108864, reducing to 33554432
Steps:   0%|                                                                                                        | 7/8000 [00:16<5:50:30,  2.63s/it, lr=2e-5, step_loss=0.244][2024-09-14 09:27:29,945] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 33554432, reducing to 16777216
Steps:   0%|                                                                                                        | 8/8000 [00:18<5:26:29,  2.45s/it, lr=2e-5, step_loss=0.293][2024-09-14 09:27:31,350] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 16777216, reducing to 8388608
Steps:   0%|                                                                                                        | 9/8000 [00:20<4:42:52,  2.12s/it, lr=2e-5, step_loss=0.133][2024-09-14 09:27:33,594] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 8388608, reducing to 4194304
Steps:   0%|▏                                                                                                      | 10/8000 [00:22<4:47:47,  2.16s/it, lr=2e-5, step_loss=0.326][2024-09-14 09:27:34,967] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4194304, reducing to 2097152
Steps:   0%|▏                                                                                                     | 11/8000 [00:23<4:15:37,  1.92s/it, lr=2e-5, step_loss=0.0787][2024-09-14 09:27:37,142] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2097152, reducing to 1048576
Steps:   0%|▏                                                                                                     | 13/8000 [00:28<4:38:11,  2.09s/it, lr=2e-5, step_loss=0.0888][2024-09-14 09:27:40,729] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1048576, reducing to 524288
Steps:   0%|▏                                                                                                      | 14/8000 [00:29<4:05:47,  1.85s/it, lr=2e-5, step_loss=0.116][2024-09-14 09:27:42,393] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 524288, reducing to 262144
Steps:   0%|▏                                                                                                      | 18/8000 [00:40<6:01:01,  2.71s/it, lr=2e-5, step_loss=0.274][2024-09-14 09:27:52,837] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072
Steps:   0%|▎                                                                                                      | 20/8000 [00:43<4:47:32,  2.16s/it, lr=2e-5, step_loss=0.427]

训练过程结尾段:

Steps:  25%|█████████████████████████████▋                                                                                         | 1995/8000 [1:10:37<4:06:40,  2.46s/it, lr=2e-5,Steps:  25%|█████████████████████████████▉                                                                                          | 1995/8000 [1:10:37<4:06:40,  2.46s/it, lr=2e-5Steps:  25%|█████████████████████████████▉                                                                                          | 1996/8000 [1:10:39<4:12:22,  2.52s/it, lr=2e-5Steps:  25%|█████████████████████████████▉                                                                                          | 1996/8000 [1:10:39<4:12:22,  2.52s/it, lr=2e-5Steps:  25%|█████████████████████████████▉                                                                                          | 1997/8000 [1:10:42<4:02:19,  2.42s/it, lr=2e-5Steps:  25%|█████████████████████████████▉                                                                                          | 1997/8000 [1:10:42<4:02:19,  2.42s/it, lr=2e-5Steps:  25%|█████████████████████████████▉                                                                                          | 1998/8000 [1:10:44<4:07:01,  2.47s/it, lr=2e-5Steps:  25%|█████████████████████████████▋                                                                                         | 1998/8000 [1:10:44<4:07:01,  2.47s/it, lr=2e-5,Steps:  25%|█████████████████████████████▋                                                                                         | 1999/8000 [1:10:47<4:04:27,  2.44s/it, lr=2e-5,Steps:  25%|█████████████████████████████▋                                                                                         | 1999/8000 [1:10:47<4:04:27,  2.44s/it, lr=2e-5,Steps:  25%|█████████████████████████████▊                                                                                         | 2000/8000 [1:10:49<4:18:17,  2.58s/it, lr=2e-5,Steps:  25%|█████████████████████████████▊                                                                                         | 2000/8000 [1:10:49<4:18:17,  2.58s/it, lr=2e-5,Steps:  25%|█████████████████████████████▊                                                                                         | 2001/8000 [1:10:53<4:35:51,  2.76s/it, lr=2e-5,Steps:  25%|█████████████████████████████▊                                                                                         | 2001/8000 [1:10:53<4:35:51,  2.76s/it, lr=2e-5,Steps:  25%|█████████████████████████████▊                                                                                         | 2002/8000 [1:10:54<4:06:22,  2.46s/it, lr=2e-5,Steps:  25%|█████████████████████████████▊                                                                                         | 2002/8000 [1:10:54<4:06:22,  2.46s/it, lr=2e-5, step_loss=0.0491]                                                                                                                                                                 Loaded scheduler as EulerDiscreteScheduler from `scheduler` subfolder of /opt/xj_ai_suanfa/animate-anything/models/animate_anything_svdv1.                     | 0/5 [00:00<?, ?it/s]
  Loaded vae as AutoencoderKLTemporalDecoder from `vae` subfolder of /opt/xj_ai_suanfa/animate-anything/models/animate_anything_svdv1.                   | 1/5 [00:00<00:01,  2.67it/s]
  Loaded feature_extractor as CLIPImageProcessor from `feature_extractor` subfolder of /opt/xj_ai_suanfa/animate-anything/models/animate_anything_svdv1. | 3/5 [00:00<00:00,  6.39it/s]
Loaded image_encoder as CLIPVisionModelWithProjection from `image_encoder` subfolder of /opt/xj_ai_suanfa/animate-anything/models/animate_anything_svdv1.
Loading pipeline components...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00,  5.30it/s]
/opt/anaconda3/envs/svd/lib/python3.10/site-packages/diffusers/pipelines/pipeline_utils.py:761: FutureWarning: `torch_dtype` is deprecated and will be removed in version 0.25.0. s]
  deprecate("torch_dtype", "0.25.0", "")
Configuration saved in ./output/svd/train_2024-09-13T14-55-06/vae/config.json
Model weights saved in ./output/svd/train_2024-09-13T14-55-06/vae/diffusion_pytorch_model.safetensors
Configuration saved in ./output/svd/train_2024-09-13T14-55-06/unet/config.json
Model weights saved in ./output/svd/train_2024-09-13T14-55-06/unet/diffusion_pytorch_model.safetensors
Configuration saved in ./output/svd/train_2024-09-13T14-55-06/scheduler/scheduler_config.json
Configuration saved in ./output/svd/train_2024-09-13T14-55-06/model_index.json
09/13/2024 16:06:40 - INFO - __main__ - Saved model at ./output/svd/train_2024-09-13T14-55-06 on step 2002
Steps:  25%|█████████████████████████████▊                                                                                         | 2002/8000 [1:11:15<3:33:27,  2.14s/it, lr=2e-5, step_loss=0.0491]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant