Matrix operations intrisincs #437

andrii0lomakin · 2024-05-30T07:42:00Z

andrii0lomakin
May 30, 2024

Hi.
I am looking at matrix API and have the following question.

Are there operations implemented with simple stubs, or are they executed as is?
I am asking because if they are executed as is, then matrix multiplication, for example, does not consider memory hierarchy.

And follow-up questions. Is my understanding correct that 4by4 classes are made to leverage tensor operations speed up?
If so, do you consider supporting the "matrix operations" concept introduced by AMD as an alternative to tensor cores?

Answered by jjfumero

May 30, 2024

Hi @Laa , matrix types are mapped to 1D flat arrays in OpenCL/PTX/SPIR-V. When using matrix types that contains vector types e.g., Matrix2DFloat4, the TornadoVM JIT compiler generates vector float4 instructions for all matrix elements.

At the moment (May 2024), TornadoVM does not generate any tensor instructions. Here I am assuming you refer to Intel AMX, or NVIDIA Tensor types. It is in our plans to add this, but not immediate plans at the moment. If you would like to work on this, contributions are welcome.

Regarding memory hierarchy., I am assuming you refer to shared memory (or local memory in OpenCL). The Parallel Loop API (@parallel) does not optimize this. We had a PoC but we haven…

View full answer

jjfumero · 2024-05-30T08:06:35Z

jjfumero
May 30, 2024
Maintainer

Hi @Laa , matrix types are mapped to 1D flat arrays in OpenCL/PTX/SPIR-V. When using matrix types that contains vector types e.g., Matrix2DFloat4, the TornadoVM JIT compiler generates vector float4 instructions for all matrix elements.

At the moment (May 2024), TornadoVM does not generate any tensor instructions. Here I am assuming you refer to Intel AMX, or NVIDIA Tensor types. It is in our plans to add this, but not immediate plans at the moment. If you would like to work on this, contributions are welcome.

Regarding memory hierarchy., I am assuming you refer to shared memory (or local memory in OpenCL). The Parallel Loop API (@parallel) does not optimize this. We had a PoC but we haven't merged it into the master branch:

https://pure.manchester.ac.uk/ws/portalfiles/portal/190177400/MPAPADIMITRIOU_VEE2021_GPU_MEMORY_JIT_Preprint.pdf

However, you can still use local memory by using the Kernel API, in which programmer can "allocate" arrays, copy and process elements in this memory region. The kernel API in TornadoVM also exposes barriers to sync write accesses to the local memory.

5 replies

andrii0lomakin May 30, 2024
Author

Hi @jjfumero .
Thank you, as usual, for such a prompt reply. When I meant matrix operations, I meant those ones: https://gpuopen.com/learn/amd-lab-notes/amd-lab-notes-matrix-cores-readme/ . I am not sure, though, if it is possible to generate them using the OpenCL interface.

jjfumero May 30, 2024
Maintainer

That could be very interesting. This seems equivalent of Tensor operations for NVIDIA and Intel AMX instructions but for AMD architectures. We do not have any AMD GPU to develop/test this but it looks they expose a set of compiler intrinsics. I am not sure how this will work from the OpenCL side and unfortunately we do not have the resources at the moment. As I mention before, contributions on this matter are very welcome ;-)

andrii0lomakin May 30, 2024
Author

@jjfumero I also asked because using Radeon could be economically beneficial, considering the skyrocketing demand and orientation on nVidia cards. That can persuade people to use projects leveraging similar functionality on Radeon devices. Hopefully, we will see support for Radeon cards in TornadoVM.

jjfumero May 30, 2024
Maintainer

I do agree. Bear in mind that AMD provides an implementation for OpenCL via ROCm, https://github.com/ROCm/clr, so still you can use TornadoVM through the OpenCL backend. But I agree that, moving forward, the next step will be to provide more code specialisations for each GPU-vendor.

Just for completeness regarding support:

TornadoVM, as in version 1.0.5, can run on Apple M1/M2/M3, ARM + GraceHopper (NVIDIA), NViDIA Jetsons, NVIDIA GPUs, Intel HD Graphics, Intel ARC GPUs, Intel/Altera FPGAs, Xilinx FPGAs, and Intel CPUs. Furthermore, through an EU project I am currently working on, we are testing TornadoVM on RISC-V platforms with Intel/NVIDIA discrete GPUs with the end-goal of integrating this software stack with the European Processor: https://aero-project.eu/

andrii0lomakin Jun 13, 2024
Author

Hi, @jjfumero. A small addition to our conversation: Intel Arc GPUs also add XMX instructions analogous to the NVIDIA tensor core institutions. Just FYI.

andrii0lomakin · 2024-07-10T10:21:14Z

andrii0lomakin
Jul 10, 2024
Author

Hi, @jjfumero. May I ask you a question about what is bothering you all the time after our discussion?
If you work on the OpenCL/PTX level, how do you want to add support for Intel AMX (CPU instruction)? Is generating an inline assembly possible? Do you want to extend the CPU assembly generation workflow for GraalVM?

6 replies

andrii0lomakin Jul 10, 2024
Author

@stratika I am aware of that. My question is different. @jjfumero answered.

Here I am assuming you refer to Intel AMX, or NVIDIA Tensor types. It is in our plans to add this, but not immediate plans at the >
moment.

What "this" meant in this context ? Only NVIDIA Tensor instructions or support of AMX instructions too ?

stratika Jul 10, 2024
Maintainer

Here I am assuming you refer to Intel AMX, or NVIDIA Tensor types. It is in our plans to add this, but not immediate plans at the moment.

What "this" meant in this context ? Only NVIDIA Tensor instructions or support of AMX instructions too ?

In this context, both types of instructions.

andrii0lomakin Jul 10, 2024
Author

@stratika thank you.

andrii0lomakin Jul 10, 2024
Author

@stratika does it mean that Radeon so called "matrix operations" https://gpuopen.com/learn/amd-lab-notes/amd-lab-notes-matrix-cores-readme/ can be applied too?

stratika Jul 10, 2024
Maintainer

@stratika does it mean that Radeon so called "matrix operations" https://gpuopen.com/learn/amd-lab-notes/amd-lab-notes-matrix-cores-readme/ can be applied too?

As long as those operations are expressed in any of the TornadoVM backends (e.g. OpenCL), yes it can be applied. If not, maybe a new backend would be required which is a process that requires more engineering effort and poses additional maintenance cost.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Matrix operations intrisincs #437

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 11 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Matrix operations intrisincs #437

andrii0lomakin May 30, 2024

Replies: 2 comments · 11 replies

jjfumero May 30, 2024 Maintainer

andrii0lomakin May 30, 2024 Author

jjfumero May 30, 2024 Maintainer

andrii0lomakin May 30, 2024 Author

jjfumero May 30, 2024 Maintainer

andrii0lomakin Jun 13, 2024 Author

andrii0lomakin Jul 10, 2024 Author

andrii0lomakin Jul 10, 2024 Author

stratika Jul 10, 2024 Maintainer

andrii0lomakin Jul 10, 2024 Author

andrii0lomakin Jul 10, 2024 Author

stratika Jul 10, 2024 Maintainer

andrii0lomakin
May 30, 2024

Replies: 2 comments 11 replies

jjfumero
May 30, 2024
Maintainer

andrii0lomakin May 30, 2024
Author

jjfumero May 30, 2024
Maintainer

andrii0lomakin May 30, 2024
Author

jjfumero May 30, 2024
Maintainer

andrii0lomakin Jun 13, 2024
Author

andrii0lomakin
Jul 10, 2024
Author

andrii0lomakin Jul 10, 2024
Author

stratika Jul 10, 2024
Maintainer

andrii0lomakin Jul 10, 2024
Author

andrii0lomakin Jul 10, 2024
Author

stratika Jul 10, 2024
Maintainer