Should I be worried? My training counters are very high. #2962

AbstractEyes · 2024-11-12T22:43:34Z

I'm currently renting a couple A100s and the outcome is dismal. We're talking 350 hours for about 25000 images, which is only about 420k total 1024x1024 samples. I'm pretty sure the two are connected.
The math is about 1k samples per 4090 per hour, so this can't be anywhere near right.

22:57:10-863550 INFO     Gradient accumulation steps: 1
22:57:10-864904 INFO     Epoch: 20
22:57:10-866186 INFO     max_train_steps (21164 / 14 / 1 * 20 * 1) = 30235
22:57:10-867668 INFO     stop_text_encoder_training = 0
22:57:10-868913 INFO     lr_warmup_steps = 0.1
22:57:10-881778 INFO     Saving training config to ./output/simulacrum_v3/simulacrum_v3_20241112-225710.json...
22:57:10-888016 INFO     Executing command: /workspace/kohya_ss/venv/bin/accelerate launch --dynamo_backend tensorrt --dynamo_mode max-autotune --dynamo_use_fullgraph --dynamo_use_dynamic
                         --mixed_precision bf16 --multi_gpu --num_processes 2 --num_machines 1 --num_cpu_threads_per_process 8 /workspace/kohya_ss/sd-scripts/flux_train_network.py
                         --config_file ./output/simulacrum_v3/config_lora-20241112-225710.toml --console_log_file ./logs/loggylog.txt
/workspace/kohya_ss/venv/lib/python3.10/site-packages/diffusers/utils/outputs.py:63: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
  torch.utils._pytree._register_pytree_node(
/workspace/kohya_ss/venv/lib/python3.10/site-packages/diffusers/utils/outputs.py:63: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
  torch.utils._pytree._register_pytree_node(
2024-11-12 22:57:55.580643: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-11-12 22:57:55.617967: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-11-12 22:57:55.618054: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-11-12 22:57:55.618118: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-11-12 22:57:55.629785: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-11-12 22:57:55.754205: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-11-12 22:57:55.802245: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-11-12 22:57:55.802350: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-11-12 22:57:55.802420: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-11-12 22:57:55.814859: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-11-12 22:58:01.108559: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-11-12 22:58:01.264904: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
/workspace/kohya_ss/venv/lib/python3.10/site-packages/diffusers/utils/outputs.py:63: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
  torch.utils._pytree._register_pytree_node(
2024-11-12 22:58:18 INFO     Loading settings from ./output/simulacrum_v3/config_lora-20241112-225710.toml...                                                                      train_util.py:4451
                    INFO     ./output/simulacrum_v3/config_lora-20241112-225710                                                                                                    train_util.py:4470
/workspace/kohya_ss/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior willbe used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
/workspace/kohya_ss/venv/lib/python3.10/site-packages/diffusers/utils/outputs.py:63: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
  torch.utils._pytree._register_pytree_node(

The current version seems to have almost no checking for invalid bucketed latents as well. It just buckets everything without validation, then when come time for validation it crashes 50+ times as I filter out images that aren't deemed valid by the system. There's no automated removal or checking, so I have to do it after all those crashes.
Best I could do without hooking something was to just put a print check, because the system doesn't even print which latent is invalid half the time.
I have my own validation checks in-house and they meant nothing when it actually hit the system, because the system has high sensitivity to anything in terms of missing bits, headers, and so on. My tool used PIL so I don't see why it's even happening half the time.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should I be worried? My training counters are very high. #2962

Should I be worried? My training counters are very high. #2962

AbstractEyes commented Nov 12, 2024 •

edited

Loading

Should I be worried? My training counters are very high. #2962

Should I be worried? My training counters are very high. #2962

Comments

AbstractEyes commented Nov 12, 2024 • edited Loading

AbstractEyes commented Nov 12, 2024 •

edited

Loading