-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MI100 performance #289
Comments
Hi @briansp2020, Which release of ROCm are you using? Then I can see if I can reproduce the performance you are seeing on MI-100. NB: For this particular sample, you have to be careful with supported block sizes on 7900XTX, as RDNA cards only support blockM/N of 16. The benchmark would run, but it won't validate successfully in debug mode. The challenge is that 'high performing' GEMMs may have different parameters on different architectures. This issue has been reported, and will be addressed in a future release. |
By example, I just ran the sample on MI-100 around ROCm 5.6 release, which achieved close to 90 TFlops - appears typical for this release.
|
@cgmillette Since I have your attention, I'd like to ask some questions. I'm trying to figure out how fast 7900XTX will eventually become when the software matures. I'm trying to compare it to MI100 since I'm assuming MI100 software support is mature and its theoretical fp16 peak number is similar. So far I have run some micro-benchmarks and am getting conflicting results. Using TensorFlow, MI100 is much faster with CNNs
compared to 7900XTX
But PyTorch micro bench gives the opposite result
7900XTX
When running more real-world tasks (ex. fastai/course22#96), MI100 and 7900XTX seem to perform very similarly. Do you expect that 7900XTX will eventually perform better than MI100 like it is in pytorch micro benchmark? or does MI100 still need more optimization? new-ai-benchmark shows 7900XTX to be faster than MI100 (see this and this even though MI100 has much higher theoretical fp16 performance. Also, if you know of any document that shows a relative performance of different AMD hardware for ML tasks, I'd really like to see it. Thank you! |
Hi @briansp2020, No, I built from cmake with no special parameters just like in the README.md:
Just need to clarify: The sample that I ran is perf_hgemm, which is the fp16 input datatype (hgemm = fp16). This is supported on both 7900XTX (blockM/N = 16), and MI-100 (BlockM/N = 16, 32). I noticed that you previously ran perf_sgemm, which is fp32 input datatype (sgemm = fp32). This is not supported on 7900XTX, however is supported on MI-100 (BlockM/N = 16, 32) Please note that the performances for these two datatypes are very different. Also note that the supported block sizes are different. Comparing 7900XTX with MI-100 is not quite an "apples-to-apples" exercise. The first major difference is the architectures - the former being RDNA, and the latter being CDNA. They both have matrix-multiply functionalities, however RDNA cards are more of a consumer-grade gaming / graphics cards and CDNA cards are more data-center, HPC centric. Each of the two cards have vastly different properties in terms of CU counts, clocks, memory type, capacity and bandwidth. This means that either of them might have an advantage depending on what kind of workload you are doing in AI. If your problem is compute-bound, you might see better performance with higher clocks and higher CU count. Alternatively if your problem is memory -bound, you may see better performance in more memory and higher-bandwidth. Because of the vastness of different AI problems, there unfortunately is no blanket solution to winning all benchmarks. We can only try to pick the best tool (card) for the particular job. rocWMMA's job in the meantime is to enable users to leverage matrix-multiply hardware and therefore our focus is more on MFMA / WMMA enablement and performance. For other tools such as pytorch or Tensorflow, they especially could give better answers to questions about their particular benchmarks. Cheers! |
@cgmillette
|
@briansp2020 Can you tell me which version of ROCm you are using? |
5.7.1 docker I built (based on this) running on Ubuntu 22.04.3 server, Kernel 5.15 + ROCm5.7.1 dkms. |
I just ran it using 5.6.1 docker I built and the result looks broken. Do I need to match the docker user land files with kernel module?
|
@cgmillette
|
Right on! Most welcome, and take care |
What is the expected performance of MI100? I was expecting a much higher number since the theoretical performance is more than 180TF. I was getting higher numbers when I was testing 7900XTX even though it has a lower theoretical peak performance!
The text was updated successfully, but these errors were encountered: