Matrix operations intrisincs #437
-
Hi. Are there operations implemented with simple stubs, or are they executed as is? And follow-up questions. Is my understanding correct that 4by4 classes are made to leverage tensor operations speed up? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 11 replies
-
Hi @Laa , matrix types are mapped to 1D flat arrays in OpenCL/PTX/SPIR-V. When using matrix types that contains vector types e.g., At the moment (May 2024), TornadoVM does not generate any tensor instructions. Here I am assuming you refer to Intel AMX, or NVIDIA Tensor types. It is in our plans to add this, but not immediate plans at the moment. If you would like to work on this, contributions are welcome. Regarding memory hierarchy., I am assuming you refer to shared memory (or local memory in OpenCL). The Parallel Loop API (@parallel) does not optimize this. We had a PoC but we haven't merged it into the master branch: However, you can still use local memory by using the Kernel API, in which programmer can "allocate" arrays, copy and process elements in this memory region. The kernel API in TornadoVM also exposes barriers to sync write accesses to the local memory. |
Beta Was this translation helpful? Give feedback.
-
Hi, @jjfumero. May I ask you a question about what is bothering you all the time after our discussion? |
Beta Was this translation helpful? Give feedback.
Hi @Laa , matrix types are mapped to 1D flat arrays in OpenCL/PTX/SPIR-V. When using matrix types that contains vector types e.g.,
Matrix2DFloat4
, the TornadoVM JIT compiler generates vectorfloat4
instructions for all matrix elements.At the moment (May 2024), TornadoVM does not generate any tensor instructions. Here I am assuming you refer to Intel AMX, or NVIDIA Tensor types. It is in our plans to add this, but not immediate plans at the moment. If you would like to work on this, contributions are welcome.
Regarding memory hierarchy., I am assuming you refer to shared memory (or local memory in OpenCL). The Parallel Loop API (@parallel) does not optimize this. We had a PoC but we haven…