Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: latest torch/comfyui; perf improvments; fix: SSL cert issues #309

Open
wants to merge 63 commits into
base: main
Choose a base branch
from

Conversation

tazlin
Copy link
Member

@tazlin tazlin commented Oct 4, 2024

New Features/Updates

Fixes and Improvements

  • Improved the stability and performance of high_performance_mode.
    • Jobs which are expected to be brief now do not block job pops. Additionally, less time is spent in general waiting if this mode is on.
  • Improved the stability and performance of max_threads values greater than one.
    • xx90 series cards will likely see a large improvement with max_threads: 2 and a bit of tuning.
      • Important: You almost certainly will want high_performance_mode if you have a xx90 card.
    • Note that cascade and flux, as well as high_memory_mode can still lead to additional instability with threads at 2.
    • xx80 series cards may benefit from max_threads: 2 in SD1.5-only setups without controlnets/post-processing or in other conservative configurations.
  • Improved process management with enhanced deadlock detection and handling.
    • Particularly, hang ups where all of the process were available and waiting should be more readily detected and corrected.
  • Optimized image processing by using rawpng directly, reducing redundant operations.
    • The repeated call to PIL.Image.open(...) was highly inefficient, especially for very large images.
    • The already encoded png sent from ComfyUI is used instead
  • Added SSL context using certifi to resolve certificate resolution issues.
  • Updated documentation to reflect changes in CUDA version and new configuration options.
  • Fixed a bug where the download_models.py would not exit if the compvis models failed to download. This would cause the worker to crash unexpectedly as it expects the models to be available on worker start.
  • The docker image scheme has been substantially reworked. See the developer changes below for more information.
    • As a reminder, cloud systems such as runpod.io and vast.ai have good support for deploying docker images. See the new Dockerfiles/README.md for information on configuring these images.

Developer changes

feat: add ROCm and CUDA Dockerfiles with entrypoint and setup scripts


@tazlin
Copy link
Member Author

tazlin commented Oct 4, 2024

@CodiumAI-Agent /describe

@CodiumAI-Agent

This comment was marked as outdated.

If your system is set up properly (see [Prerequisites](#prerequisites))
you can just [setup](https://github.com/Haidra-Org/horde-worker-reGen?tab=readme-ov-file#configure) your bridgeData.yaml file and then run
```bash
docker compose -f Dockerfiles/compse.[cuda|rocm].yaml build --pull
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo here, it should be compose.[ instead of compse.[

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch

@CIB
Copy link
Contributor

CIB commented Oct 30, 2024

The docker instructions aren't working for me (Arch Linux / nvidia GPU)

git clone --sparse --branch raw-png https://github.com/Haidra-Org/horde-worker-reGen.git horde-worker-reGen-png
cd horde-worker-reGen-png/
git sparse-checkout set --no-cone Dockerfiles /bridgeData_template.yaml
docker compose -f Dockerfiles/compose.cuda.yaml build --pull
docker compose -f Dockerfiles/compose.cuda.yaml up -dV
reGen  | [notice] A new release of pip is available: 24.0 -> 24.3.1
reGen  | [notice] To update, run: pip install --upgrade pip
reGen  | 2024-10-30 18:40:57.711 | DEBUG    | horde_worker_regen.load_env_vars:load_env_vars_from_config:68 - Using default AI Horde URL.
reGen  | 2024-10-30 18:40:57.740 | DEBUG    | horde_sdk:_dev_env_var_warnings:42 - AIWORKER_CACHE_HOME is ./models/.
reGen  | 2024-10-30 18:40:59.707 | DEBUG    | horde_model_reference.legacy.classes.legacy_converters:write_out_records:554 - Converted database written to: /horde-worker-reGen/models/horde_model_reference/stable_diffusion.json
reGen  | 2024-10-30 18:41:00.050 | DEBUG    | horde_model_reference.legacy.classes.legacy_converters:write_out_records:554 - Converted database written to: /horde-worker-reGen/models/horde_model_reference/stable_diffusion.json
reGen  | 2024-10-30 18:41:00.061 | WARNING  | horde_worker_regen.bridge_data.data_model:validate_performance_modes:162 - High memory mode is enabled. You may experience performance issues with more than one thread.
reGen  | 2024-10-30 18:41:00.061 | WARNING  | horde_worker_regen.bridge_data.data_model:validate_performance_modes:167 - Please let us know if `unload_models_from_vram_often` improves or degrades performance with `high_memory_mode` enabled.
reGen  | 2024-10-30 18:41:01.056 | WARNING  | horde_model_reference.model_reference_records:validator_is_style_known:132 - Unknown style control_qr for model control_qr
reGen  | 2024-10-30 18:41:01.056 | WARNING  | horde_model_reference.model_reference_records:validator_is_style_known:132 - Unknown style control_qr_xl for model control_qr_xl
reGen  | 2024-10-30 18:41:01.061 | DEBUG    | horde_sdk.ai_horde_worker.model_meta:remove_large_models:155 - Removing cascade models: {'Stable Cascade 1.0'}
reGen  | 2024-10-30 18:41:01.061 | DEBUG    | horde_sdk.ai_horde_worker.model_meta:remove_large_models:156 - Removing flux models: {'Flux.1-Schnell fp16 (Compact)', 'Flux.1-Schnell fp8 (Compact)'}
reGen  | /horde-worker-reGen/venv/lib/python3.11/site-packages/transformers/utils/hub.py:128: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
reGen  |   warnings.warn(
reGen  | 2024-10-30 18:41:02.834 | INFO     | horde_safety.deep_danbooru_model:download_deep_danbooru_model:53 - Downloading DeepDanbooru model (~614 mb) to models/clip_blip/model-resnet_custom_v3.pt.
models/clip_blip/model-resnet_custom_v3.pt:   0% 0.00/644M [00:00<?, ?iB/s]2024-10-30 18:41:03.458 | INFO     | horde_safety.deep_danbooru_model:download_deep_danbooru_model:63 - Model already downloaded.
reGen  | 2024-10-30 18:41:03.458 | INFO     | horde_safety.deep_danbooru_model:verify_deep_danbooru_model_hash:30 - Verifying SHA256 hash of downloaded file.
models/clip_blip/model-resnet_custom_v3.pt:   0% 0.00/644M [00:00<?, ?iB/s]
reGen  | Loading CLIP model ViT-L-14/openai...
reGen  | /horde-worker-reGen/venv/lib/python3.11/site-packages/open_clip/factory.py:372: UserWarning: These pretrained weights were trained with QuickGELU activation but the model config does not have that enabled. Consider using a model config with a "-quickgelu" suffix or enable with a flag.
reGen  |   warnings.warn(
reGen  | Loaded CLIP model and data in 2.94 seconds.
reGen  | 2024-10-30 18:41:06.832 | INFO     | hordelib.comfy_horde:do_comfy_import:215 - Forcing normal vram mode
reGen  | Traceback (most recent call last):
reGen  |   File "/horde-worker-reGen/download_models.py", line 25, in <module>
reGen  |     download_all_models(
reGen  |   File "/horde-worker-reGen/horde_worker_regen/download_models.py", line 58, in download_all_models
reGen  |     hordelib.initialise()
reGen  |   File "/horde-worker-reGen/venv/lib/python3.11/site-packages/hordelib/initialisation.py", line 81, in initialise
reGen  |     hordelib.comfy_horde.do_comfy_import(
reGen  |   File "/horde-worker-reGen/venv/lib/python3.11/site-packages/hordelib/comfy_horde.py", line 229, in do_comfy_import
reGen  |     import execution
reGen  |   File "/horde-worker-reGen/venv/lib/python3.11/site-packages/hordelib/_comfyui/execution.py", line 13, in <module>
reGen  |     import nodes
reGen  |   File "/horde-worker-reGen/venv/lib/python3.11/site-packages/hordelib/_comfyui/nodes.py", line 21, in <module>
reGen  |     import comfy.diffusers_load
reGen  |   File "/horde-worker-reGen/venv/lib/python3.11/site-packages/hordelib/_comfyui/comfy/diffusers_load.py", line 3, in <module>
reGen  |     import comfy.sd
reGen  |   File "/horde-worker-reGen/venv/lib/python3.11/site-packages/hordelib/_comfyui/comfy/sd.py", line 5, in <module>
reGen  |     from comfy import model_management
reGen  |   File "/horde-worker-reGen/venv/lib/python3.11/site-packages/hordelib/_comfyui/comfy/model_management.py", line 143, in <module>
reGen  |     total_vram = get_total_memory(get_torch_device()) / (1024 * 1024)
reGen  |                                   ^^^^^^^^^^^^^^^^^^
reGen  |   File "/horde-worker-reGen/venv/lib/python3.11/site-packages/hordelib/_comfyui/comfy/model_management.py", line 112, in get_torch_device
reGen  |     return torch.device(torch.cuda.current_device())
reGen  |                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
reGen  |   File "/horde-worker-reGen/venv/lib/python3.11/site-packages/torch/cuda/__init__.py", line 778, in current_device
reGen  |     _lazy_init()
reGen  |   File "/horde-worker-reGen/venv/lib/python3.11/site-packages/torch/cuda/__init__.py", line 293, in _lazy_init
reGen  |     torch._C._cuda_init()
reGen  | RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

@HPPinata
Copy link

The docker instructions aren't working for me (Arch Linux / nvidia GPU)

git clone --sparse --branch raw-png https://github.com/Haidra-Org/horde-worker-reGen.git horde-worker-reGen-png
cd horde-worker-reGen-png/
git sparse-checkout set --no-cone Dockerfiles /bridgeData_template.yaml
docker compose -f Dockerfiles/compose.cuda.yaml build --pull
docker compose -f Dockerfiles/compose.cuda.yaml up -dV

Do you have your system set up to make cuda work at all?
sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

Ironically getting nvidia to work inside docker is not as painless as AMD, due to their custom Kernel stuff
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/sample-workload.html

@HPPinata
Copy link

I'm not sure what is and isn't required since I've not tested NVIDIA GPUs on Linux for a while, but you might need (some portion of) the cuda tooling installed locally.

@CIB
Copy link
Contributor

CIB commented Oct 30, 2024

The docker instructions aren't working for me (Arch Linux / nvidia GPU)

git clone --sparse --branch raw-png https://github.com/Haidra-Org/horde-worker-reGen.git horde-worker-reGen-png
cd horde-worker-reGen-png/
git sparse-checkout set --no-cone Dockerfiles /bridgeData_template.yaml
docker compose -f Dockerfiles/compose.cuda.yaml build --pull
docker compose -f Dockerfiles/compose.cuda.yaml up -dV

Do you have your system set up to make cuda work at all? sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

Ironically getting nvidia to work inside docker is not as painless as AMD, due to their custom Kernel stuff https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/sample-workload.html

Yes. In fact, I created my own Dockerfile before I knew this branch existed, and it's running fine on my system as we speak. So I'm also stumped. I can dive a bit more into comparing the two containers to figure out what's going on.

docker run --rm --gpus all ubuntu nvidia-smi --query-gpu=name --format=csv,noheader
NVIDIA GeForce RTX 4090

@HPPinata
Copy link

Yes. In fact, I created my own Dockerfile before I knew this branch existed, and it's running fine on my system as we speak. So I'm also stumped. I can dive a bit more into comparing the two containers to figure out what's going on.

Please do. I haven't had much to do with the creation of the Dockerfile.cuda and @tazlin found it to be working iirc. but the compose.cuda.yaml is a complete blindshot based on what worked for AMD and what I found online.
There might very well be a few issues with that, especially around exposing the GPU to the container.

@CIB
Copy link
Contributor

CIB commented Oct 30, 2024

There might very well be a few issues with that, especially around exposing the GPU to the container.

Good call. I compared the two docker-compose.yml files, and found that the gpu configurations were ever so slightly different. With count: all added here, now the error is gone.

    deploy:
      resources:
        reservations:
          devices:
          - driver: nvidia
            capabilities: [gpu]
            count: all

@HPPinata
Copy link

HPPinata commented Oct 30, 2024

I think you can just create a small separate PR to be merged into raw-png (not main).
This wouldn't fit anything I have open, shouldn't conflict with much else either and you should be the one credited for fixing what was broken.

@CIB
Copy link
Contributor

CIB commented Oct 30, 2024

I think you can just create a small separate PR to be merged into raw-png (not main). This wouldn't fit anything I have open, shouldn't conflict with much else either and you should be the one credited for fixing what was broken.

Done. #334

@tazlin

This comment was marked as resolved.

@CodiumAI-Agent

This comment was marked as resolved.

tazlin and others added 28 commits November 4, 2024 09:38
* minor fixes

USER root, venv creation after clone (otherwise git complains) and COPY the setup scripts

* ninja for faster flash_attn build

* mimimize diff
* rocm version via index url

--extra-index-url https://download.pytorch.org/whl/rocmX.Y is sufficient to install the right package.
This also makes A/B version testing easier.

* build/fix: bring rocm reqs.txt in line with main reqs.txt

* include wheel

If this is not included, the flash_attn build fails on ROCm cards

* Turn requirements.rocm.txt into a symlink

This is a temporary measure for backwards compatibility. Once all scripts and Dockerfiles are updated it can be removed.

* remove requirements.rocm.txt from scripts

---------

Co-authored-by: tazlin <[email protected]>
* Sparse checkout

* Create compose.rocm.yaml

* Create compose.cuda.yaml

* Docker Compose documentation

* Syntax, fixes and clarifications

* Update ROCm Version

* fix compose up

* build/fix: use env vars to control mounts w/ docker compose

* docs: warn about compose config/mount behavior

* docs: move old docker env config info to new location

---------

Co-authored-by: tazlin <[email protected]>
scaled_dot_product_attention had its function signature changed to have the arg `enable_gqa` added. The hijack was therefore incompatible as if an attempt was made to call the hijack with the new arg, it would not be found and would raise an exception.
These are already bad situations. I would like to give the worker a fighting chance at staying alive. I have seen exceptions being thrown here cause a potentially recoverable situation into an irrecoverable one because the exceptions bubble up to the main process with no catch.
* Fixed gpu configuration for compose.cuda.yaml

The `count: all` setting was missing, which was causing CUDA to be unavailable in some cases.

* Fix typos in README

---------

Co-authored-by: Christian Bielert <[email protected]>
* Use SIGINT to stop the docker container

This should allow the `docker stop` command to shut down the python process with SIGINT, allowing the process manager to stop the processes gracefully.

* update docker README

* Increase stop grace period

Give the running processes more time to finish when stopping the docker container.

* fix: give 2 minutes for worker shutdown w/ docker

* docs: explain docker stop timeout + maint reasoning

---------

Co-authored-by: Christian Bielert <[email protected]>
Co-authored-by: tazlin <[email protected]>
* triton branch

* install pytest

* even newer version

* set env variable container wide

This needs to be set during build and in the individual worker threads context as well, otherwise a cuda version is used

* check env variable

* PyTorch 2.5.0

* tuning options

* ROCm 6.2

* make sure the build works in conda env

* try 256 head dimensions

* enable flash_attn

* check whether we want to build flash_attn

* check is in install_amd_go_fast.sh

* ROCm 6.2

* update optimizations

--amd (basically setting --use-pytorch-cross-attention) degrades performance

MIOPEN_FIND_MODE="FAST" is required for ROCm 6.2 to work as expected
* fix/style: hadolint Dockerfile lint fixes/recomendations

* fix: re-add intended uninstall statement to ROCM image
There is some sort of incompatibility with the hadolint pre-commit hook in a github workflow. I am just going to stick to the github action for now.
docs: second revision of readme rewrite
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants