Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bump up Dockerfile and compose to newer syntax format, bump to Torch 2.4.0+CUDA 12.4, updated other deps and more #307

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

C0rn3j
Copy link

@C0rn3j C0rn3j commented Aug 13, 2024

Now on CUDA 12.4, Torch 2.4.0, newer Docker compose syntax, cleaner Dockerfile (including correctly creating layers and cleaning up garbage from apt), and heredoc for readability.

Also rename nvidia dockerfile to sort with other dockerfiles.

Built the Dockerfile and ran docker-compose.yaml swapping the image for my built version, seems to run fine, including Nvidia support on my 4000 series card.

I do not get the purpose of the Nvidia dockerfile+standalone reqs, and left those mostly alone, the main Dockerfile already supports Nvidia. Shouldn't these be just deleted?

Note that there are quite a few warnings present at the moment.
[AllTalk Startup] Model is available     : Checking
[AllTalk Startup] Model is available     : Checked
[AllTalk Startup] Current Python Version : 3.10.12
[AllTalk Startup] Current PyTorch Version: 2.4.0+cu124
[AllTalk Startup] Current CUDA Version   : 12.4
[AllTalk Startup] Current TTS Version    : 0.24.1
[AllTalk Startup] Current Coqui-TTS Version is : Up to date
[AllTalk Startup] AllTalk Github updated : 1st July 2024 at 08:57
df: /root/.triton/autotune: No such file or directory
/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  def forward(ctx, input, weight, bias=None):
/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  def backward(ctx, grad_output):
/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  def forward(ctx, input, weight, bias=None):
/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  def backward(ctx, grad_output):
[AllTalk Startup] Model is available     : Checking
[AllTalk Startup] Model is available     : Checked
[AllTalk Startup] Current Python Version : 3.10.12
[AllTalk Startup] Current PyTorch Version: 2.4.0+cu124
[AllTalk Startup] Current CUDA Version   : 12.4
[AllTalk Startup] Current TTS Version    : 0.24.1
[AllTalk Startup] Current Coqui-TTS Version is : Up to date
[AllTalk Startup] AllTalk Github updated : 1st July 2024 at 08:57
/usr/local/lib/python3.10/dist-packages/TTS/tts/layers/xtts/xtts_manager.py:6: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  self.speakers = torch.load(speaker_file_path)
/usr/local/lib/python3.10/dist-packages/TTS/utils/io.py:54: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  return torch.load(f, map_location=map_location, **kwargs)
Using /root/.cache/torch_extensions/py310_cu124 as PyTorch extensions root...
Creating extension directory /root/.cache/torch_extensions/py310_cu124/transformer_inference...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py310_cu124/transformer_inference/build.ninja...
/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py:1965: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
Building extension module transformer_inference...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4
 [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible
[AllTalk Model] XTTSv2 Local Loading xttsv2_2.0.2 into cuda 
[1/11] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output pointwise_ops.cuda.o.d -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/includes -I/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/includes -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.10/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_89,code=compute_89 -gencode=arch=compute_89,code=sm_89 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ --threads=8 -gencode=arch=compute_89,code=sm_89 -gencode=arch=compute_89,code=compute_89 -DBF16_AVAILABLE -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -c /usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/csrc/pointwise_ops.cu -o pointwise_ops.cuda.o 
[2/11] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output dequantize.cuda.o.d -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/includes -I/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/includes -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.10/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_89,code=compute_89 -gencode=arch=compute_89,code=sm_89 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ --threads=8 -gencode=arch=compute_89,code=sm_89 -gencode=arch=compute_89,code=compute_89 -DBF16_AVAILABLE -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -c /usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/csrc/dequantize.cu -o dequantize.cuda.o 
[3/11] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output transform.cuda.o.d -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/includes -I/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/includes -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.10/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_89,code=compute_89 -gencode=arch=compute_89,code=sm_89 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ --threads=8 -gencode=arch=compute_89,code=sm_89 -gencode=arch=compute_89,code=compute_89 -DBF16_AVAILABLE -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -c /usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/csrc/transform.cu -o transform.cuda.o 
/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/csrc/transform.cu(38): warning #177-D: variable "d0_stride" was declared but never referenced
      int d0_stride = hidden_dim * seq_length;
          ^

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/csrc/transform.cu(66): warning #177-D: variable "lane" was declared but never referenced
      int lane = d3 & 0x1f;
          ^

/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/csrc/transform.cu(109): warning #177-D: variable "half_dim" was declared but never referenced
      unsigned half_dim = (rotary_dim << 3) >> 1;
               ^
          detected during instantiation of "void launch_bias_add_transform_0213(T *, T *, T *, const T *, const T *, int, int, unsigned int, int, int, int, int, int, __nv_bool, __nv_bool, cudaStream_t, int, int, float) [with T=__nv_bfloat16]" at line 281

/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/csrc/transform.cu(110): warning #177-D: variable "d0_stride" was declared but never referenced
      int d0_stride = hidden_dim * seq_length;
          ^
          detected during instantiation of "void launch_bias_add_transform_0213(T *, T *, T *, const T *, const T *, int, int, unsigned int, int, int, int, int, int, __nv_bool, __nv_bool, cudaStream_t, int, int, float) [with T=__nv_bfloat16]" at line 281

/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/csrc/transform.cu(126): warning #177-D: variable "vals_half" was declared but never referenced
      T2* vals_half = reinterpret_cast<T2*>(&vals_arr);
          ^
          detected during instantiation of "void launch_bias_add_transform_0213(T *, T *, T *, const T *, const T *, int, int, unsigned int, int, int, int, int, int, __nv_bool, __nv_bool, cudaStream_t, int, int, float) [with T=__nv_bfloat16]" at line 281

/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/csrc/transform.cu(127): warning #177-D: variable "output_half" was declared but never referenced
      T2* output_half = reinterpret_cast<T2*>(&output_arr);
          ^
          detected during instantiation of "void launch_bias_add_transform_0213(T *, T *, T *, const T *, const T *, int, int, unsigned int, int, int, int, int, int, __nv_bool, __nv_bool, cudaStream_t, int, int, float) [with T=__nv_bfloat16]" at line 281

/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/csrc/transform.cu(144): warning #177-D: variable "lane" was declared but never referenced
      int lane = d3 & 0x1f;
          ^
          detected during instantiation of "void launch_bias_add_transform_0213(T *, T *, T *, const T *, const T *, int, int, unsigned int, int, int, int, int, int, __nv_bool, __nv_bool, cudaStream_t, int, int, float) [with T=__nv_bfloat16]" at line 281

[4/11] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output relu.cuda.o.d -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/includes -I/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/includes -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.10/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_89,code=compute_89 -gencode=arch=compute_89,code=sm_89 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ --threads=8 -gencode=arch=compute_89,code=sm_89 -gencode=arch=compute_89,code=compute_89 -DBF16_AVAILABLE -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -c /usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/csrc/relu.cu -o relu.cuda.o 
[5/11] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output gelu.cuda.o.d -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/includes -I/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/includes -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.10/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_89,code=compute_89 -gencode=arch=compute_89,code=sm_89 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ --threads=8 -gencode=arch=compute_89,code=sm_89 -gencode=arch=compute_89,code=compute_89 -DBF16_AVAILABLE -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -c /usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/csrc/gelu.cu -o gelu.cuda.o 
[6/11] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output apply_rotary_pos_emb.cuda.o.d -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/includes -I/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/includes -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.10/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_89,code=compute_89 -gencode=arch=compute_89,code=sm_89 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ --threads=8 -gencode=arch=compute_89,code=sm_89 -gencode=arch=compute_89,code=compute_89 -DBF16_AVAILABLE -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -c /usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/csrc/apply_rotary_pos_emb.cu -o apply_rotary_pos_emb.cuda.o 
[7/11] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output rms_norm.cuda.o.d -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/includes -I/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/includes -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.10/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_89,code=compute_89 -gencode=arch=compute_89,code=sm_89 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ --threads=8 -gencode=arch=compute_89,code=sm_89 -gencode=arch=compute_89,code=compute_89 -DBF16_AVAILABLE -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -c /usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/csrc/rms_norm.cu -o rms_norm.cuda.o 
[8/11] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output softmax.cuda.o.d -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/includes -I/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/includes -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.10/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_89,code=compute_89 -gencode=arch=compute_89,code=sm_89 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ --threads=8 -gencode=arch=compute_89,code=sm_89 -gencode=arch=compute_89,code=compute_89 -DBF16_AVAILABLE -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -c /usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/csrc/softmax.cu -o softmax.cuda.o 
[9/11] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output layer_norm.cuda.o.d -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/includes -I/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/includes -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.10/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_89,code=compute_89 -gencode=arch=compute_89,code=sm_89 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ --threads=8 -gencode=arch=compute_89,code=sm_89 -gencode=arch=compute_89,code=compute_89 -DBF16_AVAILABLE -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -c /usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu -o layer_norm.cuda.o 
[10/11] c++ -MMD -MF pt_binding.o.d -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/includes -I/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/includes -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.10/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -DBF16_AVAILABLE -c /usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/csrc/pt_binding.cpp -o pt_binding.o 
/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/csrc/pt_binding.cpp: In instantiation of ‘std::vector<at::Tensor> ds_softmax_context(at::Tensor&, at::Tensor&, int, bool, bool, int, int, float, bool, bool, int, bool, unsigned int, unsigned int, at::Tensor&, float) [with T = float]’:
/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/csrc/pt_binding.cpp:2033:5:   required from here
/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/csrc/pt_binding.cpp:547:50: warning: narrowing conversion of ‘(((size_t)hidden_dim) * (& InferenceContext::Instance())->InferenceContext::GetMaxTokenLength())’ from ‘size_t’ {aka ‘long unsigned int’} to ‘long int’ [-Wnarrowing]
  547 |                                      {hidden_dim * InferenceContext::Instance().GetMaxTokenLength(),
      |                                       ~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/csrc/pt_binding.cpp:547:50: warning: narrowing conversion of ‘(((size_t)hidden_dim) * (& InferenceContext::Instance())->InferenceContext::GetMaxTokenLength())’ from ‘size_t’ {aka ‘long unsigned int’} to ‘long int’ [-Wnarrowing]
/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/csrc/pt_binding.cpp:548:41: warning: narrowing conversion of ‘(((size_t)k) * (& InferenceContext::Instance())->InferenceContext::GetMaxTokenLength())’ from ‘size_t’ {aka ‘long unsigned int’} to ‘long int’ [-Wnarrowing]
  548 |                                       k * InferenceContext::Instance().GetMaxTokenLength(),
      |                                       ~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/csrc/pt_binding.cpp:548:41: warning: narrowing conversion of ‘(((size_t)k) * (& InferenceContext::Instance())->InferenceContext::GetMaxTokenLength())’ from ‘size_t’ {aka ‘long unsigned int’} to ‘long int’ [-Wnarrowing]
/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/csrc/pt_binding.cpp:556:38: warning: narrowing conversion of ‘(((size_t)hidden_dim) * (& InferenceContext::Instance())->InferenceContext::GetMaxTokenLength())’ from ‘size_t’ {aka ‘long unsigned int’} to ‘long int’ [-Wnarrowing]
  556 |                          {hidden_dim * InferenceContext::Instance().GetMaxTokenLength(),
      |                           ~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/csrc/pt_binding.cpp:556:38: warning: narrowing conversion of ‘(((size_t)hidden_dim) * (& InferenceContext::Instance())->InferenceContext::GetMaxTokenLength())’ from ‘size_t’ {aka ‘long unsigned int’} to ‘long int’ [-Wnarrowing]
/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/csrc/pt_binding.cpp:557:29: warning: narrowing conversion of ‘(((size_t)k) * (& InferenceContext::Instance())->InferenceContext::GetMaxTokenLength())’ from ‘size_t’ {aka ‘long unsigned int’} to ‘long int’ [-Wnarrowing]
  557 |                           k * InferenceContext::Instance().GetMaxTokenLength(),
      |                           ~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/csrc/pt_binding.cpp:557:29: warning: narrowing conversion of ‘(((size_t)k) * (& InferenceContext::Instance())->InferenceContext::GetMaxTokenLength())’ from ‘size_t’ {aka ‘long unsigned int’} to ‘long int’ [-Wnarrowing]
/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/csrc/pt_binding.cpp: In instantiation of ‘std::vector<at::Tensor> ds_rms_mlp_gemm(at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, float, at::Tensor&, at::Tensor&, bool, int, bool) [with T = float]’:
/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/csrc/pt_binding.cpp:2033:5:   required from here
/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/csrc/pt_binding.cpp:1595:72: warning: narrowing conversion of ‘(size_t)mlp_1_out_neurons’ from ‘size_t’ {aka ‘long unsigned int’} to ‘long int’ [-Wnarrowing]
 1595 |         at::from_blob(intermediate_ptr, {input.size(0), input.size(1), mlp_1_out_neurons}, options);
      |                                                                        ^~~~~~~~~~~~~~~~~
/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/csrc/pt_binding.cpp:1595:72: warning: narrowing conversion of ‘mlp_1_out_neurons’ from ‘const size_t’ {aka ‘const long unsigned int’} to ‘long int’ [-Wnarrowing]
/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/csrc/pt_binding.cpp: In instantiation of ‘std::vector<at::Tensor> ds_softmax_context(at::Tensor&, at::Tensor&, int, bool, bool, int, int, float, bool, bool, int, bool, unsigned int, unsigned int, at::Tensor&, float) [with T = __half]’:
/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/csrc/pt_binding.cpp:2034:5:   required from here
/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/csrc/pt_binding.cpp:547:50: warning: narrowing conversion of ‘(((size_t)hidden_dim) * (& InferenceContext::Instance())->InferenceContext::GetMaxTokenLength())’ from ‘size_t’ {aka ‘long unsigned int’} to ‘long int’ [-Wnarrowing]
  547 |                                      {hidden_dim * InferenceContext::Instance().GetMaxTokenLength(),
      |                                       ~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/csrc/pt_binding.cpp:547:50: warning: narrowing conversion of ‘(((size_t)hidden_dim) * (& InferenceContext::Instance())->InferenceContext::GetMaxTokenLength())’ from ‘size_t’ {aka ‘long unsigned int’} to ‘long int’ [-Wnarrowing]
/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/csrc/pt_binding.cpp:548:41: warning: narrowing conversion of ‘(((size_t)k) * (& InferenceContext::Instance())->InferenceContext::GetMaxTokenLength())’ from ‘size_t’ {aka ‘long unsigned int’} to ‘long int’ [-Wnarrowing]
  548 |                                       k * InferenceContext::Instance().GetMaxTokenLength(),
      |                                       ~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/csrc/pt_binding.cpp:548:41: warning: narrowing conversion of ‘(((size_t)k) * (& InferenceContext::Instance())->InferenceContext::GetMaxTokenLength())’ from ‘size_t’ {aka ‘long unsigned int’} to ‘long int’ [-Wnarrowing]
/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/csrc/pt_binding.cpp:556:38: warning: narrowing conversion of ‘(((size_t)hidden_dim) * (& InferenceContext::Instance())->InferenceContext::GetMaxTokenLength())’ from ‘size_t’ {aka ‘long unsigned int’} to ‘long int’ [-Wnarrowing]
  556 |                          {hidden_dim * InferenceContext::Instance().GetMaxTokenLength(),
      |                           ~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/csrc/pt_binding.cpp:556:38: warning: narrowing conversion of ‘(((size_t)hidden_dim) * (& InferenceContext::Instance())->InferenceContext::GetMaxTokenLength())’ from ‘size_t’ {aka ‘long unsigned int’} to ‘long int’ [-Wnarrowing]
/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/csrc/pt_binding.cpp:557:29: warning: narrowing conversion of ‘(((size_t)k) * (& InferenceContext::Instance())->InferenceContext::GetMaxTokenLength())’ from ‘size_t’ {aka ‘long unsigned int’} to ‘long int’ [-Wnarrowing]
  557 |                           k * InferenceContext::Instance().GetMaxTokenLength(),
      |                           ~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/csrc/pt_binding.cpp:557:29: warning: narrowing conversion of ‘(((size_t)k) * (& InferenceContext::Instance())->InferenceContext::GetMaxTokenLength())’ from ‘size_t’ {aka ‘long unsigned int’} to ‘long int’ [-Wnarrowing]
/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/csrc/pt_binding.cpp: In instantiation of ‘std::vector<at::Tensor> ds_rms_mlp_gemm(at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, float, at::Tensor&, at::Tensor&, bool, int, bool) [with T = __half]’:
/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/csrc/pt_binding.cpp:2034:5:   required from here
/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/csrc/pt_binding.cpp:1595:72: warning: narrowing conversion of ‘(size_t)mlp_1_out_neurons’ from ‘size_t’ {aka ‘long unsigned int’} to ‘long int’ [-Wnarrowing]
 1595 |         at::from_blob(intermediate_ptr, {input.size(0), input.size(1), mlp_1_out_neurons}, options);
      |                                                                        ^~~~~~~~~~~~~~~~~
/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/csrc/pt_binding.cpp:1595:72: warning: narrowing conversion of ‘mlp_1_out_neurons’ from ‘const size_t’ {aka ‘const long unsigned int’} to ‘long int’ [-Wnarrowing]
/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/csrc/pt_binding.cpp: In instantiation of ‘std::vector<at::Tensor> ds_softmax_context(at::Tensor&, at::Tensor&, int, bool, bool, int, int, float, bool, bool, int, bool, unsigned int, unsigned int, at::Tensor&, float) [with T = __nv_bfloat16]’:
/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/csrc/pt_binding.cpp:2036:5:   required from here
/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/csrc/pt_binding.cpp:547:50: warning: narrowing conversion of ‘(((size_t)hidden_dim) * (& InferenceContext::Instance())->InferenceContext::GetMaxTokenLength())’ from ‘size_t’ {aka ‘long unsigned int’} to ‘long int’ [-Wnarrowing]
  547 |                                      {hidden_dim * InferenceContext::Instance().GetMaxTokenLength(),
      |                                       ~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/csrc/pt_binding.cpp:547:50: warning: narrowing conversion of ‘(((size_t)hidden_dim) * (& InferenceContext::Instance())->InferenceContext::GetMaxTokenLength())’ from ‘size_t’ {aka ‘long unsigned int’} to ‘long int’ [-Wnarrowing]
/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/csrc/pt_binding.cpp:548:41: warning: narrowing conversion of ‘(((size_t)k) * (& InferenceContext::Instance())->InferenceContext::GetMaxTokenLength())’ from ‘size_t’ {aka ‘long unsigned int’} to ‘long int’ [-Wnarrowing]
  548 |                                       k * InferenceContext::Instance().GetMaxTokenLength(),
      |                                       ~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/csrc/pt_binding.cpp:548:41: warning: narrowing conversion of ‘(((size_t)k) * (& InferenceContext::Instance())->InferenceContext::GetMaxTokenLength())’ from ‘size_t’ {aka ‘long unsigned int’} to ‘long int’ [-Wnarrowing]
/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/csrc/pt_binding.cpp:556:38: warning: narrowing conversion of ‘(((size_t)hidden_dim) * (& InferenceContext::Instance())->InferenceContext::GetMaxTokenLength())’ from ‘size_t’ {aka ‘long unsigned int’} to ‘long int’ [-Wnarrowing]
  556 |                          {hidden_dim * InferenceContext::Instance().GetMaxTokenLength(),
      |                           ~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/csrc/pt_binding.cpp:556:38: warning: narrowing conversion of ‘(((size_t)hidden_dim) * (& InferenceContext::Instance())->InferenceContext::GetMaxTokenLength())’ from ‘size_t’ {aka ‘long unsigned int’} to ‘long int’ [-Wnarrowing]
/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/csrc/pt_binding.cpp:557:29: warning: narrowing conversion of ‘(((size_t)k) * (& InferenceContext::Instance())->InferenceContext::GetMaxTokenLength())’ from ‘size_t’ {aka ‘long unsigned int’} to ‘long int’ [-Wnarrowing]
  557 |                           k * InferenceContext::Instance().GetMaxTokenLength(),
      |                           ~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/csrc/pt_binding.cpp:557:29: warning: narrowing conversion of ‘(((size_t)k) * (& InferenceContext::Instance())->InferenceContext::GetMaxTokenLength())’ from ‘size_t’ {aka ‘long unsigned int’} to ‘long int’ [-Wnarrowing]
/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/csrc/pt_binding.cpp: In instantiation of ‘std::vector<at::Tensor> ds_rms_mlp_gemm(at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, float, at::Tensor&, at::Tensor&, bool, int, bool) [with T = __nv_bfloat16]’:
/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/csrc/pt_binding.cpp:2036:5:   required from here
/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/csrc/pt_binding.cpp:1595:72: warning: narrowing conversion of ‘(size_t)mlp_1_out_neurons’ from ‘size_t’ {aka ‘long unsigned int’} to ‘long int’ [-Wnarrowing]
 1595 |         at::from_blob(intermediate_ptr, {input.size(0), input.size(1), mlp_1_out_neurons}, options);
      |                                                                        ^~~~~~~~~~~~~~~~~
/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/transformer/inference/csrc/pt_binding.cpp:1595:72: warning: narrowing conversion of ‘mlp_1_out_neurons’ from ‘const size_t’ {aka ‘const long unsigned int’} to ‘long int’ [-Wnarrowing]
[11/11] c++ pt_binding.o gelu.cuda.o relu.cuda.o layer_norm.cuda.o rms_norm.cuda.o softmax.cuda.o dequantize.cuda.o apply_rotary_pos_emb.cuda.o transform.cuda.o pointwise_ops.cuda.o -shared -lcurand -L/usr/local/lib/python3.10/dist-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o transformer_inference.so
Loading extension module transformer_inference...

Some files may still contain references to 11.8/12.1 CUDA.

I am currently running CUDA 12.5 on host, and a 12.4 container with 12.4 libraries(+an extra apt dep for 11.8 compat) in this setup - it seems to generate and Analyze TTS just fine.

Also has an extra fix of sorting the WebUI voices.

EDIT: I've pushed c0rn3j/alltalk:1.9.c.1 on Docker Hub, if anyone is interested in the changeset.

…UDA 12.1

Also rename nvidia dockerfile to sort with other dockerfiles
@C0rn3j C0rn3j changed the title Bump up Dockerfile and compose to newer syntax format, Torch 2.4.0, update deps and more (WIP) Bump up Dockerfile and compose to newer syntax format, Torch 2.4.0, update deps and more Aug 13, 2024
…ch builds

Also tosses out a lot of extra whitespace, you can hide those in Github diff view if needed.
@C0rn3j C0rn3j changed the title (WIP) Bump up Dockerfile and compose to newer syntax format, Torch 2.4.0, update deps and more (WIP) Bump up Dockerfile and compose to newer syntax format, Torch 2.4.0+CUDA 12.4, updated other deps and more Aug 13, 2024
@C0rn3j C0rn3j changed the title (WIP) Bump up Dockerfile and compose to newer syntax format, Torch 2.4.0+CUDA 12.4, updated other deps and more Bump up Dockerfile and compose to newer syntax format, bump to Torch 2.4.0+CUDA 12.4, updated other deps and more Aug 13, 2024
@erew123
Copy link
Owner

erew123 commented Aug 14, 2024

Hi @C0rn3j Thanks for everything you have sent over! Being there are so many changes it will be a lot for me to validate.

A couple of things I have noted though are that some of the changes will break certain compatibilities:

  1. AllTalk is designed to work as a standalone and also integrated into Text-generation-webui. Because in this scenario it works inside of Text-generation-webui's (TGWUI) custom python environment and not AllTalks own environment, bumping versions of torch and CUDA will break that compatibility. TGWUI is working on Torch 2.2.2 https://github.com/oobabooga/text-generation-webui/blob/main/one_click.py#L18C1-L21C29 and Cuda 12.1 https://github.com/oobabooga/text-generation-webui/blob/main/one_click.py#L118. As such, I would validate that the base requirements work would on both TGWUI and Standalone, hence the 2x separate requirements files, as the standalone requirements files would force installation of certain things and the TGWUI requirements files would just say "I want a version equal to or later than this version" so that it didnt step all over the files that TGWUI installs as part of its installation routine. Obviously docker or other requirements were built off the back of that.

  2. I see you have bumped Cublas to v12. Do you know if eginhard has moved the training/finetuning code to support v12? As I hadnt seen this in the releases (maybe I missed it) https://github.com/idiap/coqui-ai-TTS/releases. The issue here was that Coqui's scripts had an issue with v12/refused to work correctly, though if v12 now works, great!

  3. DeepSpeed for Windows is complicated at best to compile and takes me 40-60 minutes per variation (Pytorch, CUDA and Python major versions in all variations). As I recall, DeepSpeed wouldn't compile up for Pytorch 2.3.x or later, so moving to 2.3.x broke DeepSpeed for all Windows users (not sure what its like on Linux as I never tested that far when I found that it broke the Windows installation). There is possible light at the end of the tunnel here, as I did a decent bit of work with MS getting DeepSpeed to compile up on Windows and they have recently confirmed to me that as of next release of DeepSpeed they will finally be building and compiling their own WHL files for DeepSpeed, though I just dont know what variations/version of Torch they will be compiling for Update README.md Windows Instructions microsoft/DeepSpeed#4748 (comment). I see you have made references to DeepSpeed in the readme etc. Did you manage to test DeepSpeed compilation on Windows at all? (I havnt tested it with 2.4.x of Pytorch). If so, then that eases one big issue off the list.

  4. Torch was only left in the requirements files as a "greater than this version" for Standalone users, as on Windows and Linux, the atsetup.bat handles the installation of torch to match the other versions of the requirements e.g. https://github.com/erew123/alltalk_tts/blob/main/atsetup.bat#L538 and https://github.com/erew123/alltalk_tts/blob/main/atsetup.sh#L328. So pushing a new torch version in the requirements files will impact the setup installer routines for Windows & Linux and impact into TGWUI (as mentioned above).

If you can let me know any thoughts/tests you did I can look a bit more in depth at what may work/break things elsewhere and then validate the setup across Windows & Linux, though this takes hours to do as there is the Standalone environments and then the Text-gen-webui environments to test.

@C0rn3j
Copy link
Author

C0rn3j commented Aug 14, 2024

  1. So far I've had good luck bumping 12.1 projects to 12.4 and Torch 2.4.0, usually projects just haven't had the chance to migrate as 12.4+2.4.0 only became possible 3 weeks ago. That said, I don't see what's the point of tying the two environments together? Using separate venvs for each projects seems the way to go to me, newer CUDA on host seems to work just fine for older CUDA on an app, so there would be no issue having a 12.4 and a 12.1 project on a 12.4 CUDA on host to my knowledge. Though I do understand wanting to keep the file sizes low if possible.

  2. I am not sure, but I had to migrate it for the Analyze TTS feature, and had to put old libs in so it would keep working, so that might have retained compatibility. This might be an issue for out-of-docker environments because the old libs won't be there, so I'd test the docker build first before testing native envs. I've published the Docker image as I mentioned in the edit, so that should be quick to test at least.

  3. I only tested the Docker build on Arch Linux. I have seen that deepspeed is enabled by default, and installed as a pip req, and since generation seems to work fine with no errors I presumed it works fine. It's a bit hard for me to notice any improvements on generation as I have a 4090 and the generation seems pretty much instant and beyond realtime.

  4. ad 1.

EDIT:

Actually, I may have inadvertently fixed a CUDA mishap with the bump, as the 12.4 container reports as 12.5 in nvidia-smi but 12.3.1 that whisper.cpp uses (but they end up using runtime and not devel though) reports as 12.3, which breaks on my system.

Building the current old Dockerfile and executing nvidia-smi in the container on a host with 12.5 or higher should show if the project actually suffers this issue (CUDA version row).

Going to try 12.6.0-[devel|runtime]-ubuntu24.04 and see what happens on whisper.cpp on my 12.5 host.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants