You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm currently renting a couple A100s and the outcome is dismal. We're talking 350 hours for about 25000 images, which is only about 420k total 1024x1024 samples. I'm pretty sure the two are connected.
The math is about 1k samples per 4090 per hour, so this can't be anywhere near right.
22:57:10-863550 INFO Gradient accumulation steps: 1
22:57:10-864904 INFO Epoch: 20
22:57:10-866186 INFO max_train_steps (21164 / 14 / 1 * 20 * 1) = 30235
22:57:10-867668 INFO stop_text_encoder_training = 0
22:57:10-868913 INFO lr_warmup_steps = 0.1
22:57:10-881778 INFO Saving training config to ./output/simulacrum_v3/simulacrum_v3_20241112-225710.json...
22:57:10-888016 INFO Executing command: /workspace/kohya_ss/venv/bin/accelerate launch --dynamo_backend tensorrt --dynamo_mode max-autotune --dynamo_use_fullgraph --dynamo_use_dynamic
--mixed_precision bf16 --multi_gpu --num_processes 2 --num_machines 1 --num_cpu_threads_per_process 8 /workspace/kohya_ss/sd-scripts/flux_train_network.py
--config_file ./output/simulacrum_v3/config_lora-20241112-225710.toml --console_log_file ./logs/loggylog.txt
/workspace/kohya_ss/venv/lib/python3.10/site-packages/diffusers/utils/outputs.py:63: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
torch.utils._pytree._register_pytree_node(
/workspace/kohya_ss/venv/lib/python3.10/site-packages/diffusers/utils/outputs.py:63: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
torch.utils._pytree._register_pytree_node(
2024-11-12 22:57:55.580643: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-11-12 22:57:55.617967: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-11-12 22:57:55.618054: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-11-12 22:57:55.618118: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-11-12 22:57:55.629785: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-11-12 22:57:55.754205: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-11-12 22:57:55.802245: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-11-12 22:57:55.802350: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-11-12 22:57:55.802420: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-11-12 22:57:55.814859: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-11-12 22:58:01.108559: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-11-12 22:58:01.264904: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
/workspace/kohya_ss/venv/lib/python3.10/site-packages/diffusers/utils/outputs.py:63: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
torch.utils._pytree._register_pytree_node(
2024-11-12 22:58:18 INFO Loading settings from ./output/simulacrum_v3/config_lora-20241112-225710.toml... train_util.py:4451
INFO ./output/simulacrum_v3/config_lora-20241112-225710 train_util.py:4470
/workspace/kohya_ss/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
warnings.warn(
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior willbe used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
/workspace/kohya_ss/venv/lib/python3.10/site-packages/diffusers/utils/outputs.py:63: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
torch.utils._pytree._register_pytree_node(
The current version seems to have almost no checking for invalid bucketed latents as well. It just buckets everything without validation, then when come time for validation it crashes 50+ times as I filter out images that aren't deemed valid by the system. There's no automated removal or checking, so I have to do it after all those crashes.
Best I could do without hooking something was to just put a print check, because the system doesn't even print which latent is invalid half the time.
I have my own validation checks in-house and they meant nothing when it actually hit the system, because the system has high sensitivity to anything in terms of missing bits, headers, and so on. My tool used PIL so I don't see why it's even happening half the time.
The text was updated successfully, but these errors were encountered:
I'm currently renting a couple A100s and the outcome is dismal. We're talking 350 hours for about 25000 images, which is only about 420k total 1024x1024 samples. I'm pretty sure the two are connected.
The math is about 1k samples per 4090 per hour, so this can't be anywhere near right.
The current version seems to have almost no checking for invalid bucketed latents as well. It just buckets everything without validation, then when come time for validation it crashes 50+ times as I filter out images that aren't deemed valid by the system. There's no automated removal or checking, so I have to do it after all those crashes.
Best I could do without hooking something was to just put a print check, because the system doesn't even print which latent is invalid half the time.
I have my own validation checks in-house and they meant nothing when it actually hit the system, because the system has high sensitivity to anything in terms of missing bits, headers, and so on. My tool used PIL so I don't see why it's even happening half the time.
The text was updated successfully, but these errors were encountered: