Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: latest torch/comfyui; perf improvments; fix: SSL cert issues #309

Open
wants to merge 63 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
63 commits
Select commit Hold shift + click to select a range
4d5d752
fix: don't run image.save() twice
tazlin Oct 3, 2024
2462737
feat: use torch 2.4.1 and cu124 by default
tazlin Oct 1, 2024
63baf2e
feat: use latest horde deps w/ latest comfyui+fixes
tazlin Oct 3, 2024
e44c98f
build/fix: condense and update dockerfiles
tazlin Oct 3, 2024
4f98ca9
chore: version bump
tazlin Oct 3, 2024
72b79ad
fix: pop more often with threads>1
tazlin Oct 4, 2024
af18d0a
fix: wait less time w/ high perf. mode
tazlin Oct 4, 2024
276aa38
fix: dont pause at all for short jobs on high perf mode
tazlin Oct 4, 2024
b277af1
fix: wait even less w/ high perf mode
tazlin Oct 4, 2024
94af426
docs/fix: clarify certain stats/config in logs and docstrings
tazlin Oct 4, 2024
f6b8fc6
fix: use sqrt as intended
tazlin Oct 4, 2024
25c4eac
fix: exit(1) on compvis model dl failure
tazlin Oct 2, 2024
c62cf09
fix: use a `certifi` ssl context for r2 uploads
tazlin Oct 2, 2024
d1f4900
fix: don't concurrently preload more than 1 model
tazlin Oct 4, 2024
46868f3
fix: don't spam preload delay messages
tazlin Oct 4, 2024
7211470
fix: include conditional to not spam delay messages
tazlin Oct 4, 2024
e92aac2
fix: give models a chance to load before failing
tazlin Oct 4, 2024
9f25e1d
fix: correct version pins across dep files
tazlin Oct 4, 2024
6aa128e
fix: use latest horde model reference
tazlin Oct 4, 2024
e55001d
style: fix
tazlin Oct 4, 2024
0a926bd
fix: better deadlock detection when all procs. aren't busy
tazlin Oct 4, 2024
737f9e9
fix: be slightly less aggressive w/ pops w/ high perf/threads
tazlin Oct 4, 2024
b15f772
fix: don't give conflicting advice about `high_memory_mode` and threads
tazlin Oct 4, 2024
66ebb72
chore: log a message to see if inf. proc. `preload_models` is called
tazlin Oct 4, 2024
cd8462f
fix: don't suggest `high_memory_mode` with <=32 sys ram
tazlin Oct 4, 2024
0badfbf
fix: avoid killing all processes before jobs are finished
tazlin Oct 4, 2024
1ea1cc3
chore: version bump
tazlin Oct 4, 2024
1758feb
fix: conflicting torchvision dep in update runtime
tazlin Oct 4, 2024
869ca62
fix: flag ending processes correctly
tazlin Oct 4, 2024
1a13664
fix: correctly download via `load_large_models`
tazlin Oct 5, 2024
8934f38
fix: docker installed python deps
HPPinata Oct 6, 2024
2c59e15
fix: "amdsmi not found" error with pytorch 2.4.1 (#324)
HPPinata Oct 17, 2024
f0dc905
feat: add ROCm and CUDA Dockerfiles with entrypoint and setup scripts…
tazlin Oct 17, 2024
f3ae91d
style: fix
tazlin Oct 17, 2024
a227235
build/fix: remove amd_go_fast from rocm dockerfile
tazlin Oct 17, 2024
bf3579d
fix: new dockerfile scheme fixes (#326)
HPPinata Oct 17, 2024
a20a90e
fix: rocm version via index url (#313)
HPPinata Oct 17, 2024
6350700
tests/fix: remove obsolete test
tazlin Oct 17, 2024
f7635fa
chore/dev: update developer dependencies
tazlin Oct 17, 2024
bec02fa
fix: remove obsolete numpy <2.0 pin
tazlin Oct 17, 2024
c9ce198
feat/fix: support for docker compose; docker tweaks/fixes (#328)
HPPinata Oct 20, 2024
154a4b3
style: fix
tazlin Oct 20, 2024
87e8124
feat: `torch==2.5.0`, latest comfyui via `horde_engine~=2.17.0`
tazlin Oct 20, 2024
2f0a688
chore: version bump
tazlin Oct 20, 2024
cba9a73
fix: match intended version pins
tazlin Oct 20, 2024
1968bf9
fix: amd go fast hijack func. signature change
tazlin Oct 20, 2024
8dbef89
fix: corrects missed amd hijack passthrough to func
tazlin Oct 20, 2024
b48afab
fix: avoid crashing on process kill/join
tazlin Oct 20, 2024
ae335d9
Revert "tests/fix: remove obsolete test"
tazlin Oct 21, 2024
9ea9bfe
tests: readd rocm file check
tazlin Oct 21, 2024
e670304
fix/tests: readd rocm reqs.txt, allow different torch versions
tazlin Oct 21, 2024
8410e67
style: fix
tazlin Oct 21, 2024
adeedd4
fix: accurate references to reqs.rocm.txt
tazlin Oct 21, 2024
0997620
Make req.rocm.txt not a symlink
HPPinata Oct 21, 2024
cd42907
fix: use `horde_engine==2.17.1`
tazlin Oct 21, 2024
9df214e
fix: gpu configuration for compose.cuda.yaml (#334)
CIB Nov 2, 2024
6627b7b
fix: use SIGINT to stop the docker container (#335)
CIB Nov 2, 2024
833a95e
style: fix
tazlin Nov 3, 2024
11ff693
chore/deps: update pre-commit+dev deps
tazlin Nov 3, 2024
6e4d31b
feat: flash triton for amd/rocm (#333)
HPPinata Nov 3, 2024
e7f77b3
fix/style: hadolint Dockerfile lint fixes/recomendations (#338)
tazlin Nov 4, 2024
fad9161
ci/fix: use github action for hadolint instead
tazlin Nov 4, 2024
c787cf5
docs: main readme rewrites
tazlin Nov 4, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions .github/workflows/maintests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,19 @@ jobs:
with:
extra_args: --all-files

dockerfile-lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Lint CUDA Dockerfile
uses: hadolint/hadolint-action@master
with:
dockerfile: "Dockerfiles/Dockerfile.cuda"
- name: Lint RoCM Dockerfile
uses: hadolint/hadolint-action@master
with:
dockerfile: "Dockerfiles/Dockerfile.rocm"

unit-tests:
runs-on: ubuntu-latest
strategy:
Expand Down
13 changes: 13 additions & 0 deletions .github/workflows/prtests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,19 @@ jobs:
with:
extra_args: --all-files

dockerfile-lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Lint CUDA Dockerfile
uses: hadolint/hadolint-action@master
with:
dockerfile: "Dockerfiles/Dockerfile.cuda"
- name: Lint RoCM Dockerfile
uses: hadolint/hadolint-action@master
with:
dockerfile: "Dockerfiles/Dockerfile.rocm"

unit-tests:
runs-on: ubuntu-latest
strategy:
Expand Down
5 changes: 5 additions & 0 deletions .hadolint.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
ignored:
- DL3008 # Pin versions in apt get install
- DL3042 # Avoid cache directory with `pip install`
- DL3002 # Last USER should not be root
failure-threshold: warning
16 changes: 8 additions & 8 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -1,20 +1,20 @@
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.6.0
rev: v5.0.0
hooks:
- id: check-yaml
- id: end-of-file-fixer
- id: trailing-whitespace
- repo: https://github.com/psf/black
rev: 24.4.2
rev: 24.10.0
hooks:
- id: black
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.5.4
rev: v0.7.2
hooks:
- id: ruff
- repo: https://github.com/pre-commit/mirrors-mypy
rev: 'v1.11.0'
rev: 'v1.13.0'
hooks:
- id: mypy
args: []
Expand All @@ -38,9 +38,9 @@ repos:
- python-dotenv
- aiohttp
- horde_safety==0.2.3
- torch==2.3.1
- torch==2.5.0
- ruamel.yaml
- horde_engine==2.15.3
- horde_sdk==0.14.11
- horde_model_reference==0.9.0
- horde_engine==2.17.1
- horde_sdk==0.15.1
- horde_model_reference==0.9.1
- semver
30 changes: 0 additions & 30 deletions Dockerfiles/Dockerfile.12.1.1-22.04

This file was deleted.

30 changes: 0 additions & 30 deletions Dockerfiles/Dockerfile.12.2.2-22.04

This file was deleted.

30 changes: 0 additions & 30 deletions Dockerfiles/Dockerfile.12.3.2-22.04

This file was deleted.

71 changes: 71 additions & 0 deletions Dockerfiles/Dockerfile.cuda
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# Stage 1: Base environment setup
ARG CUDA_VERSION=12.4.1
FROM nvidia/cuda:${CUDA_VERSION}-runtime-ubuntu22.04 AS base

SHELL ["/bin/bash", "-o", "pipefail", "-c"]

ARG DEBIAN_FRONTEND=noninteractive
ARG PYTHON_VERSION=3.11
ENV PYTHON_VERSION=${PYTHON_VERSION}
ENV APP_HOME=/horde-worker-reGen

RUN apt-get update && \
apt-get install -y --no-install-recommends software-properties-common && \
add-apt-repository ppa:deadsnakes/ppa && \
apt-get install -y --no-install-recommends \
python${PYTHON_VERSION} \
python3-pip \
python${PYTHON_VERSION}-venv \
libgl1 \
git && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*

# Extract CUDA version for PyTorch
ARG CUDA_VERSION
RUN CUDA_VERSION_SHORT=$(echo "${CUDA_VERSION}" | cut -d. -f1-2 | tr -d '.') && \
echo "${CUDA_VERSION_SHORT}" && \
echo "export CUDA_VERSION_SHORT=${CUDA_VERSION_SHORT}" >> /env_vars

# Stage 2: Clone repository and install dependencies
FROM base AS builder

ARG GIT_BRANCH=main
ARG GIT_OWNER=Haidra-Org

RUN echo "export GIT_BRANCH=${GIT_BRANCH}" >> /env_vars && \
echo "export GIT_OWNER=${GIT_OWNER}" >> /env_vars

WORKDIR "${APP_HOME}"

# Clone the repository
RUN git clone "https://github.com/${GIT_OWNER}/horde-worker-reGen.git" . && \
git switch "${GIT_BRANCH}"

# Create virtual environment
RUN python"${PYTHON_VERSION}" -m venv "${APP_HOME}/venv"
ENV PATH="${APP_HOME}/venv/bin:$PATH"

# Install dependencies
ARG PIP_CACHE_DIR=/pip-cache
ARG USE_PIP_CACHE=true

RUN --mount=type=cache,target="${PIP_CACHE_DIR}",sharing=locked,id=pip-cache \
. /env_vars && \
if [ "${USE_PIP_CACHE}" = "true" ]; then \
pip install --cache-dir="${PIP_CACHE_DIR}" opencv-python-headless -r requirements.txt -U --extra-index-url "https://download.pytorch.org/whl/cu${CUDA_VERSION_SHORT}"; \
else \
pip install opencv-python-headless -r requirements.txt -U --extra-index-url "https://download.pytorch.org/whl/cu${CUDA_VERSION_SHORT}"; \
fi

# Stage 3: Final stage
FROM builder AS final

WORKDIR "${APP_HOME}"
COPY entrypoint.sh /entrypoint.sh
COPY setup_*.sh "${APP_HOME}"
RUN chmod +x /entrypoint.sh

STOPSIGNAL SIGINT

ENTRYPOINT ["/entrypoint.sh"]
75 changes: 75 additions & 0 deletions Dockerfiles/Dockerfile.rocm
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# Stage 1: Base environment setup
ARG ROCM_VERSION=6.1.2
FROM rocm/rocm-terminal:${ROCM_VERSION} AS base

USER root
WORKDIR /
SHELL ["/bin/bash", "-o", "pipefail", "-c"]

ARG DEBIAN_FRONTEND=noninteractive
ARG PYTHON_VERSION=3.11
ENV PYTHON_VERSION=${PYTHON_VERSION}
ENV APP_HOME=/horde-worker-reGen

RUN apt-get update && \
apt-get install -y --no-install-recommends software-properties-common && \
add-apt-repository ppa:deadsnakes/ppa && \
apt-get install -y --no-install-recommends \
python${PYTHON_VERSION} \
python3-pip \
python${PYTHON_VERSION}-venv \
python${PYTHON_VERSION}-dev \
python${PYTHON_VERSION}-distutils \
ninja-build \
rocm \
git && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*

# Extract ROCm version for PyTorch
ARG ROCM_VERSION
RUN ROCM_VERSION_SHORT=$(echo "${ROCM_VERSION}" | cut -d. -f1-2) && \
echo "export ROCM_VERSION_SHORT=${ROCM_VERSION_SHORT}" >> /env_vars

# Stage 2: Clone repository and install dependencies
FROM base AS builder

ARG GIT_BRANCH=main
ARG GIT_OWNER=Haidra-Org

RUN echo "export GIT_BRANCH=${GIT_BRANCH}" >> /env_vars && \
echo "export GIT_OWNER=${GIT_OWNER}" >> /env_vars

WORKDIR "${APP_HOME}"

# Clone the repository
RUN git clone "https://github.com/${GIT_OWNER}/horde-worker-reGen.git" . && \
git switch "${GIT_BRANCH}"

# Create virtual environment
RUN python"${PYTHON_VERSION}" -m venv "${APP_HOME}/venv"
ENV PATH="${APP_HOME}/venv/bin:$PATH"

# Install dependencies
ARG PIP_CACHE_DIR=/pip-cache
ARG USE_PIP_CACHE=true

RUN --mount=type=cache,target="${PIP_CACHE_DIR}",sharing=locked,id=pip-cache \
. /env_vars && \
if [ "${USE_PIP_CACHE}" = "true" ]; then \
pip install --cache-dir="${PIP_CACHE_DIR}" opencv-python-headless -r requirements.rocm.txt -U --extra-index-url "https://download.pytorch.org/whl/rocm${ROCM_VERSION_SHORT}"; \
else \
pip install opencv-python-headless -r requirements.rocm.txt -U --extra-index-url "https://download.pytorch.org/whl/rocm${ROCM_VERSION_SHORT}"; \
fi && \
pip uninstall -y pynvml nvidia-ml-py
# Stage 3: Final stage
FROM builder AS final

WORKDIR "${APP_HOME}"
COPY entrypoint.sh /entrypoint.sh
COPY setup_*.sh "${APP_HOME}"
RUN chmod +x /entrypoint.sh

STOPSIGNAL SIGINT

ENTRYPOINT ["/entrypoint.sh"]
Loading
Loading