Skip to content
This repository has been archived by the owner on Jun 28, 2024. It is now read-only.

ROCM5.7 build pytorch failed #14

Closed
sdli1995 opened this issue Sep 8, 2023 · 36 comments
Closed

ROCM5.7 build pytorch failed #14

sdli1995 opened this issue Sep 8, 2023 · 36 comments

Comments

@sdli1995
Copy link

sdli1995 commented Sep 8, 2023

I use ROCmSoftwarePlatform pytorch Repositories to build the latest pytorch-rocm and it's fail .
use script command in rocm_lab
error log is

pytorch/torch/csrc/jit/ir/ir.cpp:1191:16: error: ‘set_stream’ is not a member of ‘torch::jit::cuda’; did you mean ‘c10::cuda::set_stream’?
 1191 |     case cuda::set_stream:
      |                ^~~~~~~~~~
@evshiron
Copy link
Owner

evshiron commented Sep 8, 2023

Is ROCm 5.7 released?

Btw, the source code of PyTorch used in this repo comes from:

I am not sure if the repo you use will work or not.

@sdli1995
Copy link
Author

sdli1995 commented Sep 8, 2023

Is ROCm 5.7 released?

Btw, the source code of PyTorch used in this repo comes from:

I am not sure if the repo you use will work or not.

yes it's released but not announce
here is latest amdgpu-install link https://repo.radeon.com/amdgpu-install/23.20/ubuntu/jammy/amdgpu-install_5.7.50700-1_all.deb

i use this repo https://github.com/ROCmSoftwarePlatform/pytorch/tree/rocm5.7_internal_testing because it's have many optimization for 5.7 but it always build failed

@evshiron
Copy link
Owner

evshiron commented Sep 8, 2023

Interesting! I'll give it a try too.

@evshiron
Copy link
Owner

evshiron commented Sep 8, 2023

This issue should be caused by the outdated hipify script, you can fix it by running:

git checkout torch/csrc/jit/ir/ir.h

after python3 tools/amd_build/build_amd.py is done.

UPDATE:

Additional export ROCM_PATH=/opt/rocm is needed to make /opt/rocm-5.7/lib/cmake/hip/FindHIP.cmake works.

While it compiles pretty well, linking is failed for some hipblas?ge*Batched functions.

@briansp2020
Copy link

I think 5.7 is still under testing. I had issues with the 5.7 version of the kernel when I tried it the other day. Hopefully, they will fix it before they officially release it.

@evshiron
Copy link
Owner

evshiron commented Sep 8, 2023

I built https://github.com/pytorch/pytorch against ROCm 5.7 just now and it succeeded.

I suspect that the rocm5.7_internal_testing branch is not complete and doesn't have the changes that fix the hipify script and link to hipblas correctly.

@evshiron
Copy link
Owner

evshiron commented Sep 8, 2023

https://github.com/AUTOMATIC1111/stable-diffusion-webui failed to work with the PyTorch built upon the main repo, but https://github.com/vladmandic/automatic worked fine, although I didn't see a performance difference.

@briansp2020
Copy link

Have you tried 5.7 amdgpu-dkms? It causes issues for me. But I'm not sure whether it's my setup or a problem with dkms modules since I seem to have a problematic setup. When I was running TF in a VM, AI benchmark would hang when running test 14. Using my current setup, it would crash X windows and log me out.

@evshiron
Copy link
Owner

evshiron commented Sep 9, 2023

I haven't tried running ROCm in a virtualized environment yet.

I did not call amdgpu-dkms manually either, but I think amdgpu-install did it for me as I have seen some logs about it.

I do think that ROCm 5.7 might be a bit strange because stable-diffusion-webui works fine on ROCm 5.6.

@sdli1995
Copy link
Author

sdli1995 commented Sep 9, 2023

I built https://github.com/pytorch/pytorch against ROCm 5.7 just now and it succeeded.

I suspect that the rocm5.7_internal_testing branch is not complete and doesn't have the changes that fix the hipify script and link to hipblas correctly.

i find this branch from pytorch pr, but now it's closed
pytorch/pytorch#108153
maybe it need more fixes.

@sdli1995
Copy link
Author

sdli1995 commented Sep 9, 2023

Have you tried 5.7 amdgpu-dkms? It causes issues for me. But I'm not sure whether it's my setup or a problem with dkms modules since I seem to have a problematic setup. When I was running TF in a VM, AI benchmark would hang when running test 14. Using my current setup, it would crash X windows and log me out.

i tried 5.7 amdgpu-dkms only ,it works fine

@briansp2020
Copy link

I'm trying out 5.7 and started getting the following messages in dmesg.

[70320.243903] amdgpu 0000:2d:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:157 vmid:0 pasid:0, for process pid 0 thread pid 0)
[70320.244931] amdgpu 0000:2d:00.0: amdgpu: in page starting at address 0x0000000000000000 from client 10
[70320.245910] amdgpu 0000:2d:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000B3A
[70320.246894] amdgpu 0000:2d:00.0: amdgpu: Faulty UTCL2 client ID: CPC (0x5)
[70320.247882] amdgpu 0000:2d:00.0: amdgpu: MORE_FAULTS: 0x0
[70320.248867] amdgpu 0000:2d:00.0: amdgpu: WALKER_ERROR: 0x5
[70320.249851] amdgpu 0000:2d:00.0: amdgpu: PERMISSION_FAULTS: 0x3
[70320.250816] amdgpu 0000:2d:00.0: amdgpu: MAPPING_ERROR: 0x1
[70320.251757] amdgpu 0000:2d:00.0: amdgpu: RW: 0x0

Does anyone know what this means? It's in red. So, it makes me feel nervous. :(

@PennyFranklin
Copy link

ROCm5.7 officially released,and I can't wait to see if there's any implement compared with ROCm5.6

@evshiron
Copy link
Owner

Unfortunately, I am on a business trip and don't have the chance to test them out.

With PyTorch built above, I didn't see a performance improvement compared to ROCm 5.6. We should re-evaluate it when the official builds come out.

I have looked through the changelog of ROCm 5.7 and it seems to mainly focus on new features, with less emphasis on performance optimizations related to AI.

The list I am interested in currently:

I haven't been testing Triton for ROCm for a long time, and I am curious about its performance and if the Fused Attention example works on Navi 3x or not.

@briansp2020
Copy link

I ran a TF benchmark (https://pypi.org/project/new-ai-benchmark/) again and got Device AI Score: 37824 which is about 10% better than ROCm 5.6. Looking at the details, inferencing seems better than my NVidia 3080ti but training seems really bad.
I updated the gist with output from ROCm 5.7 (https://gist.github.com/briansp2020/e885f0eb6cbec45fcaf0c2eac8c3ee11#file-7900xtx-rocm-5-7)

A simple benchmark from https://cprimozic.net/notes/posts/machine-learning-benchmarks-on-the-7900-xtx/ still is really bad compared to 3080ti

I have not been able to run any pytorch benchmark yet. Does anyone know a simple pytorch benchmark that runs on pytorch & ROCm? So far, the issue I noticed when trying out fastai (ROCm/pytorch#1276 (comment)) is still there.

@briansp2020
Copy link

I was able to run pytorch micro-benchmark from ROCm project. See ROCm/pytorch#1276

I built pytorch & torchvision on my machine before running the benchmark and do now know if they run with the ROCm 5.6 nightly build or not.

@sdli1995
Copy link
Author

In ROCM 5.6, I find best performance pytorch is in the rocm-pytorch docker . Waiting for this docker https://hub.docker.com/r/rocm/pytorch update.

@briansp2020
Copy link

Does anyone know how to build torch audio? I get errors when I try to build it. torch and torchvision build worked fine for me. It seems rocm version detection does not seem to work quite right... Full output at https://gist.github.com/briansp2020/484f639aa59ccb308f35cf9ad6542881

(pt) root@rocm:~/audio# python3 setup.py bdist_wheel
/root/audio/setup.py:2: DeprecationWarning: The distutils package is deprecated and slated for removal in Python 3.12. Use setuptools or check PEP 632 for potential alternatives
  import distutils.command.clean
-- Git branch: main
-- Git SHA: 4bbf65e44e5730d4b7deba745ba11375c1665db8
-- Git tag: None
-- PyTorch dependency: torch
-- Building version 2.2.0a0+4bbf65e
running bdist_wheel
running build
running build_py
copying torchaudio/version.py -> build/lib.linux-x86_64-3.10/torchaudio
running build_ext
HIP VERSION: 5.7.31921-d1770ee1b
CMake Error at cmake/LoadHIP.cmake:149 (file):
  file failed to open for reading (No such file or directory):

    /opt/rocm/.info/version-dev
Call Stack (most recent call first):
  CMakeLists.txt:75 (include)


CMake Error at cmake/LoadHIP.cmake:150 (string):
  string sub-command REGEX, mode MATCH needs at least 5 arguments total to
  command.
Call Stack (most recent call first):
  CMakeLists.txt:75 (include)



***** ROCm version from /opt/rocm/.info/version-dev ****

ROCM_VERSION_DEV:
ROCM_VERSION_DEV_MAJOR:
ROCM_VERSION_DEV_MINOR:
ROCM_VERSION_DEV_PATCH:

***** Library versions from dpkg *****

rocm-developer-tools VERSION: 5.7.0.50700-63~22.04
rocm-device-libs VERSION: 1.0.0.50700-63~22.04
hsakmt-roct-dev VERSION: 20230704.2.5268.50700-63~22.04
hsa-rocr-dev VERSION: 1.11.0.50700-63~22.04

~~~~~~~~ cut off since too long ~~~~~~~~ 

@evshiron
Copy link
Owner

@briansp2020

https://github.com/evshiron/rocm_lab/blob/master/scripts/build_torchaudio.sh#L25

The following code might work:

echo 5.7.0-63 > /opt/rocm/.info/version-dev

Otherwise, you might want to locate that file from some of the official Docker images (probably https://hub.docker.com/r/rocm/dev-ubuntu-20.04).

@briansp2020
Copy link

That helped with detecting ROCm version. But it still can't find rocrand.
Full output at
https://gist.github.com/briansp2020/0b16903b9eaddcedcfd003d61a3315a7

@sdli1995
Copy link
Author

I build the master branch pytorch got some perfmance improve
here is benchmark result

rocm5.6
running benchmark for frameworks ['pytorch']
cuda version= None
cudnn version= 2020000
pytorch's batchsize at 16 vgg16 eval at fp32: 15.1ms avg
pytorch's batchsize at 16 vgg16 train at fp32: 77.5ms avg
pytorch's batchsize at 16 resnet152 eval at fp32: 32.7ms avg
pytorch's batchsize at 16 resnet152 train at fp32: 120.9ms avg
pytorch's batchsize at 16 densenet161 eval at fp32: 29.0ms avg
pytorch's batchsize at 16 densenet161 train at fp32: 114.6ms avg
pytorch's batchsize at 16 convnext_large eval at fp32: 623.2ms avg
pytorch's batchsize at 16 convnext_large train at fp32: 1590.5ms avg
pytorch's batchsize at 16 swinbig eval at fp32: 121.0ms avg
pytorch's batchsize at 16 swinbig train at fp32: 346.8ms avg
pytorch's batchsize at 16 vgg16 eval at fp16: 8.5ms avg
pytorch's batchsize at 16 vgg16 train at fp16: 44.0ms avg
pytorch's batchsize at 16 resnet152 eval at fp16: 20.6ms avg
pytorch's batchsize at 16 resnet152 train at fp16: 72.7ms avg
pytorch's batchsize at 16 densenet161 eval at fp16: 23.1ms avg
pytorch's batchsize at 16 densenet161 train at fp16: 86.2ms avg
pytorch's batchsize at 16 convnext_large eval at fp16: 391.5ms avg
pytorch's batchsize at 16 convnext_large train at fp16: 1041.9ms avg
pytorch's batchsize at 16 swinbig eval at fp16: 28.7ms avg
pytorch's batchsize at 16 swinbig train at fp16: 87.8ms avg

rocm5.7
running benchmark for frameworks ['pytorch']
cuda version= None
cudnn version= 2020000
pytorch's batchsize at 16 vgg16 eval at fp32: 14.7ms avg
pytorch's batchsize at 16 vgg16 train at fp32: 71.2ms avg
pytorch's batchsize at 16 resnet152 eval at fp32: 23.2ms avg
pytorch's batchsize at 16 resnet152 train at fp32: 95.6ms avg
pytorch's batchsize at 16 densenet161 eval at fp32: 26.0ms avg
pytorch's batchsize at 16 densenet161 train at fp32: 101.1ms avg
pytorch's batchsize at 16 convnext_large eval at fp32: 393.0ms avg
pytorch's batchsize at 16 convnext_large train at fp32: 1030.8ms avg
pytorch's batchsize at 16 swinbig eval at fp32: 42.8ms avg
pytorch's batchsize at 16 swinbig train at fp32: 141.8ms avg
pytorch's batchsize at 16 vgg16 eval at fp16: 8.8ms avg
pytorch's batchsize at 16 vgg16 train at fp16: 43.2ms avg
pytorch's batchsize at 16 resnet152 eval at fp16: 19.7ms avg
pytorch's batchsize at 16 resnet152 train at fp16: 67.7ms avg
pytorch's batchsize at 16 densenet161 eval at fp16: 22.2ms avg
pytorch's batchsize at 16 densenet161 train at fp16: 80.0ms avg
pytorch's batchsize at 16 convnext_large eval at fp16: 322.2ms avg
pytorch's batchsize at 16 convnext_large train at fp16: 851.1ms avg
pytorch's batchsize at 16 swinbig eval at fp16: 25.5ms avg
pytorch's batchsize at 16 swinbig train at fp16: 85.0ms av

ref rtx3090

running benchmark for frameworks ['pytorch']
cuda version= 12.1
cudnn version= 8902
pytorch's batchsize at 16 vgg16 eval at fp32: 20.5ms avg
pytorch's batchsize at 16 vgg16 train at fp32: 58.7ms avg
pytorch's batchsize at 16 resnet152 eval at fp32: 27.9ms avg
pytorch's batchsize at 16 resnet152 train at fp32: 85.1ms avg
pytorch's batchsize at 16 densenet161 eval at fp32: 27.6ms avg
pytorch's batchsize at 16 densenet161 train at fp32: 87.4ms avg
pytorch's batchsize at 16 convnext_large eval at fp32: 57.3ms avg
pytorch's batchsize at 16 convnext_large train at fp32: 250.6ms avg
pytorch's batchsize at 16 swinbig eval at fp32: 39.9ms avg
pytorch's batchsize at 16 swinbig train at fp32: 124.2ms avg
pytorch's batchsize at 16 vgg16 eval at fp16: 12.4ms avg
pytorch's batchsize at 16 vgg16 train at fp16: 37.0ms avg
pytorch's batchsize at 16 resnet152 eval at fp16: 16.9ms avg
pytorch's batchsize at 16 resnet152 train at fp16: 63.1ms avg
pytorch's batchsize at 16 densenet161 eval at fp16: 20.4ms avg
pytorch's batchsize at 16 densenet161 train at fp16: 75.0ms avg
pytorch's batchsize at 16 convnext_large eval at fp16: 30.1ms avg
pytorch's batchsize at 16 convnext_large train at fp16: 87.6ms avg
pytorch's batchsize at 16 swinbig eval at fp16: 22.0ms avg
pytorch's batchsize at 16 swinbig train at fp16: 86.3ms avg

@sdli1995
Copy link
Author

Nano GPT benchmark
batch_size = 1
max_sequence_len = 512
num_heads = 80
embed_dimension = 64

3090 ngc 23.06
Infer type is torch.float16
5033574400 4.6878814697265625
real layer infer time is 37.69169330596924 ms
real layer train time is 120.95082998275757 ms
0.0048828125 0.1953125
12.03125 0.6493506493506493 0.3246753246753247
5033574400 4.6878814697265625

Infer type is torch.bfloat16
5033574400 4.6878814697265625
real layer infer time is 36.89380407333374 ms
real layer train time is 116.20726585388184 ms
0.0048828125 0.1953125
12.03125 0.6493506493506493 0.3246753246753247
5033574400 4.6878814697265625

Infer type is torch.float32
10067148800 9.375762939453125
real layer infer time is 77.41017580032349 ms
real layer train time is 248.90074253082275 ms
0.0048828125 0.1953125
12.03125 0.6493506493506493 0.3246753246753247
10067148800 9.375762939453125

7900xtx rocm-5.6.1
Infer type is torch.float16
5033574400 4.6878814697265625
real layer infer time is 43.32679271697998 ms
real layer train time is 111.61400318145752 ms
0.0048828125 0.1953125
12.03125 0.6493506493506493 0.3246753246753247
5033574400 4.6878814697265625

Infer type is torch.bfloat16
5033574400 4.6878814697265625
real layer infer time is 43.55735778808594 ms
real layer train time is 122.41567373275757 ms
0.0048828125 0.1953125
12.03125 0.6493506493506493 0.3246753246753247
5033574400 4.6878814697265625

Infer type is torch.float32
10067148800 9.375762939453125
real layer infer time is 457.5647735595703 ms
real layer train time is 1259.9600839614868 ms
0.0048828125 0.1953125
12.03125 0.6493506493506493 0.3246753246753247
10067148800 9.375762939453125

7900xtx rocm-5.7

Infer type is torch.float16
5033574400 4.6878814697265625
real layer infer time is 43.288211822509766 ms
real layer train time is 111.6347599029541 ms
0.0048828125 0.1953125
12.03125 0.6493506493506493 0.3246753246753247
5033574400 4.6878814697265625

Infer type is torch.bfloat16
5033574400 4.6878814697265625
real layer infer time is 43.8742470741272 ms
real layer train time is 122.43486881256104 ms
0.0048828125 0.1953125
12.03125 0.6493506493506493 0.3246753246753247
5033574400 4.6878814697265625

Infer type is torch.float32
10067148800 9.375762939453125
real layer infer time is 113.32763433456421 ms
real layer train time is 373.21545124053955 ms
0.0048828125 0.1953125
12.03125 0.6493506493506493 0.3246753246753247
10067148800 9.37576293945312

@briansp2020
Copy link

briansp2020 commented Sep 16, 2023

Stable Diffusion Control Net Pipeline worked! Compared to 3080ti I have, it's still about 20% slower. But the code works unmodified. I wonder how much more performance they can squeeze out of 7900.
https://gist.github.com/briansp2020/1830bbbf0c6400d620df0384bcd034a1

7900XTX does about 4.84it/s
3080ti does about 6.31it/s

Does anyone know of a way to measure the performance in absolute flops/bandwith/etc using pytorch and/or tensorflow?

Edited to add performance numbers.

@evshiron
Copy link
Owner

evshiron commented Sep 17, 2023

@briansp2020

That helped with detecting ROCm version. But it still can't find rocrand.

It should be caused by the changes to CMake files in ROCm 5.7.

As I don't have a chance to try it at the moment, my recommendation is to dig into /opt/rocm/hip/lib/cmake/hip/ and see how it prepares the CMake configurations.

@sdli1995
Copy link
Author

Unfortunately, I am on a business trip and don't have the chance to test them out.

With PyTorch built above, I didn't see a performance improvement compared to ROCm 5.6. We should re-evaluate it when the official builds come out.

I have looked through the changelog of ROCm 5.7 and it seems to mainly focus on new features, with less emphasis on performance optimizations related to AI.

The list I am interested in currently:

I haven't been testing Triton for ROCm for a long time, and I am curious about its performance and if the Fused Attention example works on Navi 3x or not.

I built the Rocm Software Platform pytorch 2.0.1 it's perfmance well at some net work like convnext_large
running benchmark for frameworks ['pytorch']
cuda version= None
cudnn version= 2020000
pytorch's batchsize at 16 vgg16 eval at fp32: 14.6ms avg
pytorch's batchsize at 16 vgg16 train at fp32: 71.2ms avg
pytorch's batchsize at 16 resnet152 eval at fp32: 23.2ms avg
pytorch's batchsize at 16 resnet152 train at fp32: 95.4ms avg
pytorch's batchsize at 16 densenet161 eval at fp32: 25.9ms avg
pytorch's batchsize at 16 densenet161 train at fp32: 100.3ms avg
pytorch's batchsize at 16 convnext_large eval at fp32: 108.3ms avg
pytorch's batchsize at 16 convnext_large train at fp32: 293.9ms avg
pytorch's batchsize at 16 swinbig eval at fp32: 43.3ms avg
pytorch's batchsize at 16 swinbig train at fp32: 142.4ms avg
pytorch's batchsize at 16 vgg16 eval at fp16: 8.8ms avg
pytorch's batchsize at 16 vgg16 train at fp16: 43.2ms avg
pytorch's batchsize at 16 resnet152 eval at fp16: 19.9ms avg
pytorch's batchsize at 16 resnet152 train at fp16: 72.5ms avg
pytorch's batchsize at 16 densenet161 eval at fp16: 22.2ms avg
pytorch's batchsize at 16 densenet161 train at fp16: 81.4ms avg
pytorch's batchsize at 16 convnext_large eval at fp16: 64.5ms avg
pytorch's batchsize at 16 convnext_large train at fp16: 168.2ms avg
pytorch's batchsize at 16 swinbig eval at fp16: 25.6ms avg
pytorch's batchsize at 16 swinbig train at fp16: 81.7ms avg

it's seems can update the rocm_lab repo and it built same as your repo
only 2 lines need change :

  1. https://github.com/ROCmSoftwarePlatform/pytorch/blob/96c66b7e0b97ebb53422d1e2b760b0bc6b9e74bd/cmake/public/LoadHIP.cmake#L144C1-L144C1 set(CMAKE_MODULE_PATH ${HIP_PATH}/cmake ${CMAKE_MODULE_PATH})
  2. USE_GLOO=0 python setup.py bdist_wheel

@PennyFranklin
Copy link

Official Pytorch with RC5.7 raleased,and i ran vlad systeminfo benchmark. Compared with RC5.6,no improvements RC5.7 show in sdwebui.Image_1696407357103.png

@sdli1995
Copy link
Author

sdli1995 commented Oct 4, 2023

Official Pytorch with RC5.7 raleased,and i ran vlad systeminfo benchmark. Compared with RC5.6,no improvements RC5.7 show in sdwebui.Image_1696407357103.png

the inference precision is half ,in my benchmark there is no improvements on this setting ,fp32 performance has more improvement. it can use it for some trainning job .but it's also have large gap betwen it's claimed performance 61t fp32 flops 122t fp16 flops and 900+ GB/s bandwidth ,rather rtx3090 is 39t flops 80t fp16 flops and 900+ gb bandwidth

@briansp2020
Copy link

I noticed a massive performance improvement using the latest pytorch nightly build. fastai/course22#96 (comment)

Is anyone else interested in trying out the latest pytorch build and report if they see any performance improvements? The improvements in convenext_small benchmark is massive and results in fastai example performance of 7900XTX catching up to 3080ti performance. Before, it was 4 to 5 times slower.

Sure 7900XTX should be even faster. But I'm so glad to see that AMD consumer graphics card support is coming along nicely!

@PennyFranklin
Copy link

Seems that FP32 performance improves a lot,hope that FP16 performance will be improved in the next ROCM6.0

@briansp2020
Copy link

Does anyone have an idea how fast 7900XTX should be when all optimization is in place? I'm wondering whether "python micro_benchmarking_pytorch.py --network convnext_small --fp16 1" performance is about as good as it will get. Looking at the raw fp16 numbers, 7900XTX is said to have 123 TFLOPs (here) vs 3080ti's 272 TFLOPs tensorcore performance (here). Ratio of measured performance seems about right....

Also, unlike Nvidia, AMD does not list different numbers for fp16 vector peak performance vs matrix engine peak performance. Does anyone know whether 123 TFLOPs is from the vector engine or the matrix engine?

@evshiron
Copy link
Owner

evshiron commented Oct 8, 2023

Based on my limited understanding, it is possible that RDNA 3 may not have processing units similar to Tensor Cores, and WMMA may just be an optimized instruction. In comparison, CDNA has its own Matrix Cores and XDL instructions.

Here is an article that may help in understanding the differences between GPUs from these vendors:

I am not a professional in this field, so as a consumer, I am satisfied as long as the RX 7900 XTX is comparable to the likes of RTX 4080 or RTX 3090 in terms of AI applications that consumers will use.

@sdli1995
Copy link
Author

sdli1995 commented Oct 9, 2023

Does anyone have an idea how fast 7900XTX should be when all optimization is in place? I'm wondering whether "python micro_benchmarking_pytorch.py --network convnext_small --fp16 1" performance is about as good as it will get. Looking at the raw fp16 numbers, 7900XTX is said to have 123 TFLOPs (here) vs 3080ti's 272 TFLOPs tensorcore performance (here). Ratio of measured performance seems about right....

Also, unlike Nvidia, AMD does not list different numbers for fp16 vector peak performance vs matrix engine peak performance. Does anyone know whether 123 TFLOPs is from the vector engine or the matrix engine?

3080ti only 75t fp16 performance, wmma is likely tensorcore but it's not useful like tensorcore because it’s memory bound

here is my test in 7900xtx fp16 gemm

work@astrali-SuperServer:~/composable_kernel/build/bin$ ./example_gemm_wmma_fp16 0 2 1 16384 16384 16384 0 0 0
a_m_k: dim 2, lengths {16384, 16384}, strides {0, 1}
b_k_n: dim 2, lengths {16384, 16384}, strides {1, 0}
c_m_n: dim 2, lengths {16384, 16384}, strides {0, 1}
Perf: 105.451 ms, 83.4141 TFlops, 15.2736 GB/s, DeviceGemmWmma_CShuffle<128, 64, 128, 64, 8, 16, 16, 2, 4> AEnableLds: 1, BEnableLds: 1, NumPrefetch: 1, LoopScheduler: Default, PipelineVersion: v1

if you aware of fp16 precsion performance may be need focus the wmma op develop

@briansp2020
Copy link

I re-ran a TF benchmark (https://pypi.org/project/new-ai-benchmark/) again and got Device AI Score: 40996 which is about 8% better than what I got before and is now better than 3080ti.
I created a new gist (tensorflow-upstream source as of 10/14/2023 + ROCm 5.7.1) (https://gist.github.com/briansp2020/3e176c7a933cf23531642e326a2f91c5)

@evshiron
Copy link
Owner

In my previous replies, I mentioned that I felt ROCm 5.7 was not working properly on my end, so I have been sticking with ROCm 5.6 for the time being.

Today, I tried updating to ROCm 5.7.1, but the situation did not improve: text and images disappear in Google Chrome, and running AI applications easily result in GPU resets. There are many reset logs in dmesg.

Therefore, I would like to ask if your ROCm 5.7 is working fine out of the box? Have you encountered the issues I mentioned above? And if so, how did you solve them?

@briansp2020
Copy link

@evshiron
What OS are you using? I noticed issues (ex new-ai-benchmark hangs) when using Ubuntu Desktop 22.04 which uses kernel 6.2. It seems to work much better under Ubuntu Server which still uses kernel 5.15.
I'm glad that they announced support for 7900 series card. But they still have some ways to go before ordinary users can use it to do ML work. Even when using 5 series kernel, I get page fault errors pretty often (ROCm/pytorch#1284). Then again, I had these issues when I was using ROCm 5.6 as well. So, the issues I saw may not be related to what you are experiencing.

@evshiron
Copy link
Owner

@briansp2020

I am currently using Ubuntu 22.04.1 with Linux kernel 6.2.0. As you mentioned, it is possible that the kernel version could be the reason.

Anyway, ROCm 5.6 is working fine on my end and PyTorch now distributes their stable version for ROCm 5.6, so I might stick with this version for a longer time.

I haven't used SHARK yet, but I think SHARK for AMD is a Vulkan thing.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants