-
Hi TornadoVM team. I'm trying to implement GEMM multiplication as described here: https://siboehm.com/articles/22/CUDA-MMM, but on TornadoVM. Currently stuck with the 2nd example (memory coalescing). The problem is that this approach requires setting blockDim and gridDim as following: // gridDim stays the same
dim3 gridDim(CEIL_DIV(M, 32), CEIL_DIV(N, 32));
// make blockDim 1-dimensional, but don't change number of threads
dim3 blockDim(32 * 32);
sgemm_coalescing<<<gridDim, blockDim>>>(M, N, K, alpha, A, B, beta, C); As you can see, Trying to do something like this in TornadoVM: int BLOCK_SIZE = 32;
WorkerGrid workerGrid = new WorkerGrid2D(m, n);
workerGrid.setLocalWork(BLOCK_SIZE * BLOCK_SIZE, 1, 1); Obviously, this won't work, because gridDim/number of workgroups is calculated automatically inside numOfWorkgroups = new long[globalWork.length];
for (int i = 0; i < globalWork.length; i++) {
numOfWorkgroups[i] = globalWork[i] / localWork[i];
} And instead of I tried to extend I was able to implement the example by recalculating indices inside the kernel: public static void coalescingKernel(
KernelContext context, FloatArray a, FloatArray b, FloatArray c,
int m, int k, int n,
int mTiles, int nTiles,
float alpha, float beta
) {
int ordinal = context.globalGroupSizeX * context.globalIdy + context.globalIdx;
int blockOrdinal = ordinal / (BLOCK_SIZE * BLOCK_SIZE);
int localOrdinal = ordinal % (BLOCK_SIZE * BLOCK_SIZE);
int blockX = blockOrdinal / nTiles;
int blockY = blockOrdinal % nTiles;
int x = blockX * BLOCK_SIZE + (localOrdinal / BLOCK_SIZE);
int y = blockY * BLOCK_SIZE + (localOrdinal % BLOCK_SIZE);
if (x < m && y < n) {
float sum = 0.0f;
for (int i = 0; i < k; i++) {
sum += a.get(x * k + i) * b.get(i * n + y);
}
c.set(x * n + y, alpha * sum + beta * c.get(x * n + y));
}
} But I'm still not sure it'll work in 100% cases, and besides, all these indices manipulations will probably affect the performance of the kernel. Can you point me to the right direction with this? Am I missing something? Thank you. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Hmm, looks like there's a simpler (and pretty obvious) solution to this: WorkerGrid2D workerGrid = new WorkerGrid2D(m * BLOCK_SIZE, n / BLOCK_SIZE);
workerGrid.setLocalWork(BLOCK_SIZE * BLOCK_SIZE, 1, 1); Interesting that it is slower than my first implementation with recalculated indices. But that's probably another question. |
Beta Was this translation helpful? Give feedback.
Hmm, looks like there's a simpler (and pretty obvious) solution to this:
Interesting that it is slower than my first implementation with recalculated indices. But that's probably another question.