Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MI100 performance #289

Closed
briansp2020 opened this issue Oct 25, 2023 · 10 comments
Closed

MI100 performance #289

briansp2020 opened this issue Oct 25, 2023 · 10 comments

Comments

@briansp2020
Copy link

What is the expected performance of MI100? I was expecting a much higher number since the theoretical performance is more than 180TF. I was getting higher numbers when I was testing 7900XTX even though it has a lower theoretical peak performance!

./perf_sgemm
Initializing host data...
Initializing device data...
Launching GEMM kernel...
gridDim (56 56) blockdim (128 2)
TBlockX, TBlockY, BlocksX, BlocksY, BlkM, BlkN, BlkK, MatM, MatN, MatK, alpha, lda, ldb, beta, ldc, ldd, elapsedMs, Problem Size(GFlops), TFlops/s
128, 2, 2, 2, 32, 32, 16, 7168, 7168, 7168, 2, 7168, 7168, 2, 7168, 7168, 165.95, 736.587, 22.193
Finished!
./perf_hgemm
Initializing host data...
Initializing device data...
Launching GEMM kernel...
gridDim (56 56) blockdim (128 2)
TBlockX, TBlockY, BlocksX, BlocksY, BlkM, BlkN, BlkK, MatM, MatN, MatK, alpha, lda, ldb, beta, ldc, ldd, elapsedMs, Problem Size(GFlops), TFlops/s
128, 2, 2, 2, 32, 32, 16, 7168, 7168, 7168, 2, 7168, 7168, 2, 7168, 7168, 78.1733, 736.587, 47.1124
Finished!

@cgmillette
Copy link
Contributor

Hi @briansp2020,
Thanks for reaching out!

Which release of ROCm are you using? Then I can see if I can reproduce the performance you are seeing on MI-100.

NB: For this particular sample, you have to be careful with supported block sizes on 7900XTX, as RDNA cards only support blockM/N of 16. The benchmark would run, but it won't validate successfully in debug mode. The challenge is that 'high performing' GEMMs may have different parameters on different architectures. This issue has been reported, and will be addressed in a future release.

@cgmillette
Copy link
Contributor

By example, I just ran the sample on MI-100 around ROCm 5.6 release, which achieved close to 90 TFlops - appears typical for this release.

Initializing host data...
Initializing device data...
Launching GEMM kernel...
gridDim (56 56) blockdim (128 2)
TBlockX, TBlockY, BlocksX, BlocksY, BlkM, BlkN, BlkK, MatM, MatN, MatK, alpha, lda, ldb, beta, ldc, ldd, elapsedMs, Problem Size(GFlops), TFlops/s
128, 2, 2, 2, 32, 32, 16, 7168, 7168, 7168, 2, 7168, 7168, 2, 7168, 7168, 41.825, 736.587, 88.0557
Finished!

@briansp2020
Copy link
Author

@cgmillette
Did you have to specify any parameters to get 90TF? I think I specified command line parameters when I ran the test on my 7900XTX before. Unfortunately, I forgot what the parameters were (I got them from an internet search and did not write it down. Doh!). Or are you supposed to get close to 90TF just by running perf_sgemm. I'm using a second hand MI100 I got off eBay in my 7900XT PC and built rocWMMA from git today.

Since I have your attention, I'd like to ask some questions. I'm trying to figure out how fast 7900XTX will eventually become when the software matures. I'm trying to compare it to MI100 since I'm assuming MI100 software support is mature and its theoretical fp16 peak number is similar. So far I have run some micro-benchmarks and am getting conflicting results.

Using TensorFlow, MI100 is much faster with CNNs

root@rocm:/root/benchmarks/scripts/tf_cnn_benchmarks# python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 --use_fp16

TensorFlow: 2.13
Model: resnet50
Dataset: imagenet (synthetic)
Mode: training
SingleSess: False
Batch size: 128 global
128 per device
Num batches: 100
Num epochs: 0.01
Devices: ['/gpu:0']
NUMA bind: False
Data format: NCHW
Optimizer: sgd
Variables: parameter_server
==========
Running warm up
Done warm up
Step Img/sec total_loss
1 images/sec: 1135.2 +/- 0.0 (jitter = 0.0) 7.788
10 images/sec: 1138.1 +/- 1.0 (jitter = 3.5) 7.743
20 images/sec: 1138.7 +/- 0.7 (jitter = 4.3) 7.823
30 images/sec: 1138.5 +/- 0.5 (jitter = 3.4) 7.963
40 images/sec: 1138.2 +/- 0.4 (jitter = 2.4) 7.889
50 images/sec: 1137.9 +/- 0.4 (jitter = 2.4) 7.787
60 images/sec: 1137.7 +/- 0.4 (jitter = 2.4) 8.015
70 images/sec: 1137.3 +/- 0.4 (jitter = 2.9) 7.876
80 images/sec: 1137.1 +/- 0.3 (jitter = 2.9) 7.931
90 images/sec: 1136.8 +/- 0.3 (jitter = 3.3) 7.734
100 images/sec: 1136.4 +/- 0.3 (jitter = 3.3) 7.987

total images/sec: 1136.18

compared to 7900XTX

python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --use_fp16=True --model=resnet50

Running warm up
Done warm up
Step Img/sec total_loss
1 images/sec: 706.8 +/- 0.0 (jitter = 0.0) 7.444
10 images/sec: 703.8 +/- 1.9 (jitter = 1.8) 7.422
20 images/sec: 703.2 +/- 1.0 (jitter = 2.4) 7.468
30 images/sec: 703.1 +/- 0.8 (jitter = 3.2) 7.564
40 images/sec: 702.9 +/- 0.8 (jitter = 3.1) 7.518
50 images/sec: 703.0 +/- 0.6 (jitter = 3.1) 7.447
60 images/sec: 703.4 +/- 0.6 (jitter = 3.1) 7.603
70 images/sec: 703.1 +/- 0.5 (jitter = 3.6) 7.516
80 images/sec: 703.0 +/- 0.5 (jitter = 3.7) 7.560
90 images/sec: 703.0 +/- 0.4 (jitter = 3.6) 7.433
100 images/sec: 702.9 +/- 0.4 (jitter = 3.7) 7.601

total images/sec: 702.71

But PyTorch micro bench gives the opposite result
MI100

python micro_benchmarking_pytorch.py --network convnext_small --fp16 1
INFO: running forward and backward for warmup.
INFO: running the benchmark..
OK: finished running benchmark..
--------------------SUMMARY--------------------------
Microbenchmark for network : convnext_small
Num devices: 1
Dtype: FP16
Mini batch size [img] : 64
Time per mini-batch : 0.33329823017120364
Throughput [img/sec] : 192.02022155090785

7900XTX

python micro_benchmarking_pytorch.py --network convnext_small --fp16 1
INFO: running forward and backward for warmup.
INFO: running the benchmark..
OK: finished running benchmark..
--------------------SUMMARY--------------------------
Microbenchmark for network : convnext_small
Num devices: 1
Dtype: FP16
Mini batch size [img] : 64
Time per mini-batch : 0.2943673968315125
Throughput [img/sec] : 217.41538189649373

When running more real-world tasks (ex. fastai/course22#96), MI100 and 7900XTX seem to perform very similarly. Do you expect that 7900XTX will eventually perform better than MI100 like it is in pytorch micro benchmark? or does MI100 still need more optimization? new-ai-benchmark shows 7900XTX to be faster than MI100 (see this and this even though MI100 has much higher theoretical fp16 performance.

Also, if you know of any document that shows a relative performance of different AMD hardware for ML tasks, I'd really like to see it.

Thank you!

@cgmillette
Copy link
Contributor

Hi @briansp2020,

No, I built from cmake with no special parameters just like in the README.md:

CC=hipcc CXX=hipcc cmake -B<build_dir> . -DAMDGPU_TARGETS=gfx908:xnack-
cd <build_dir>
make perf_hgemm

Just need to clarify:

The sample that I ran is perf_hgemm, which is the fp16 input datatype (hgemm = fp16). This is supported on both 7900XTX (blockM/N = 16), and MI-100 (BlockM/N = 16, 32).

I noticed that you previously ran perf_sgemm, which is fp32 input datatype (sgemm = fp32). This is not supported on 7900XTX, however is supported on MI-100 (BlockM/N = 16, 32)

Please note that the performances for these two datatypes are very different. Also note that the supported block sizes are different.

Comparing 7900XTX with MI-100 is not quite an "apples-to-apples" exercise.

The first major difference is the architectures - the former being RDNA, and the latter being CDNA.

They both have matrix-multiply functionalities, however RDNA cards are more of a consumer-grade gaming / graphics cards and CDNA cards are more data-center, HPC centric.

Each of the two cards have vastly different properties in terms of CU counts, clocks, memory type, capacity and bandwidth. This means that either of them might have an advantage depending on what kind of workload you are doing in AI. If your problem is compute-bound, you might see better performance with higher clocks and higher CU count. Alternatively if your problem is memory -bound, you may see better performance in more memory and higher-bandwidth.

Because of the vastness of different AI problems, there unfortunately is no blanket solution to winning all benchmarks. We can only try to pick the best tool (card) for the particular job.

rocWMMA's job in the meantime is to enable users to leverage matrix-multiply hardware and therefore our focus is more on MFMA / WMMA enablement and performance. For other tools such as pytorch or Tensorflow, they especially could give better answers to questions about their particular benchmarks.

Cheers!

@briansp2020
Copy link
Author

@cgmillette
Thank you for pointing out my mistake. I ran perf_hgemm on MI100 and the numbers are much more reasonable even though still much slower than it should be. Do you have any idea why I may be getting such a low score? I'm using Ryzen 7900X and am using a cooler from eBay. It does not get too hot so I don't think it's cooling issue...

./perf_hgemm
Initializing host data...
Initializing device data...
Launching GEMM kernel...
gridDim (56 56) blockdim (128 2)
TBlockX, TBlockY, BlocksX, BlocksY, BlkM, BlkN, BlkK, MatM, MatN, MatK, alpha, lda, ldb, beta, ldc, ldd, elapsedMs, Problem Size(GFlops), TFlops/s
128, 2, 2, 2, 32, 32, 16, 7168, 7168, 7168, 2, 7168, 7168, 2, 7168, 7168, 78.143, 736.587, 47.1307

@cgmillette
Copy link
Contributor

@briansp2020 Can you tell me which version of ROCm you are using?

@briansp2020
Copy link
Author

5.7.1 docker I built (based on this) running on Ubuntu 22.04.3 server, Kernel 5.15 + ROCm5.7.1 dkms.
I'm now building 5.6.1 docker to try it. If I still get a low performance with 5.6.1, I'll try downgrading kernel module to 5.6 and see if that helps. But it's easier to just try 5.6 user land stuff first.
Thank you!

@briansp2020
Copy link
Author

I just ran it using 5.6.1 docker I built and the result looks broken. Do I need to match the docker user land files with kernel module?

./perf_hgemm
Initializing host data...
Initializing device data...
Launching GEMM kernel...
gridDim (56 56) blockdim (128 2)
TBlockX, TBlockY, BlocksX, BlocksY, BlkM, BlkN, BlkK, MatM, MatN, MatK, alpha, lda, ldb, beta, ldc, ldd, elapsedMs, Problem Size(GFlops), TFlops/s
128, 2, 2, 2, 32, 32, 16, 7168, 7168, 7168, 2, 7168, 7168, 2, 7168, 7168, 0.0072, 736.587, 511519
Finished!

@briansp2020
Copy link
Author

@cgmillette
I tried rocm/pytorch:latest and am getting the following, which looks much better. I guess the issue is my docker container, even though I have no idea why the docker I build is having issues. I'll keep investing. Thank you for your help.

./perf_hgemm
Initializing host data...
Initializing device data...
Launching GEMM kernel...
gridDim (56 56) blockdim (128 2)
TBlockX, TBlockY, BlocksX, BlocksY, BlkM, BlkN, BlkK, MatM, MatN, MatK, alpha, lda, ldb, beta, ldc, ldd, elapsedMs, Problem Size(GFlops), TFlops/s
128, 2, 2, 2, 32, 32, 16, 7168, 7168, 7168, 2, 7168, 7168, 2, 7168, 7168, 39.0735, 736.587, 94.2566
Finished!

@cgmillette
Copy link
Contributor

Right on! Most welcome, and take care

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants