Manually controlling number of workgroups in WorkerGrid? #498

lpld · 2024-07-13T17:00:59Z

lpld
Jul 13, 2024

Hi TornadoVM team.

I'm trying to implement GEMM multiplication as described here: https://siboehm.com/articles/22/CUDA-MMM, but on TornadoVM. Currently stuck with the 2nd example (memory coalescing). The problem is that this approach requires setting blockDim and gridDim as following:

// gridDim stays the same
dim3 gridDim(CEIL_DIV(M, 32), CEIL_DIV(N, 32));
// make blockDim 1-dimensional, but don't change number of threads
dim3 blockDim(32 * 32);
sgemm_coalescing<<<gridDim, blockDim>>>(M, N, K, alpha, A, B, beta, C);

As you can see, gridDim is (M/32, N/32, 1) and blockDim is (32 * 32, 1, 1).

Trying to do something like this in TornadoVM:

int BLOCK_SIZE = 32;

WorkerGrid workerGrid = new WorkerGrid2D(m, n);
workerGrid.setLocalWork(BLOCK_SIZE * BLOCK_SIZE, 1, 1);

Obviously, this won't work, because gridDim/number of workgroups is calculated automatically inside WorkerGrid:

numOfWorkgroups = new long[globalWork.length];
for (int i = 0; i < globalWork.length; i++) {
    numOfWorkgroups[i] = globalWork[i] / localWork[i];
}

And instead of M/32 x N/32 I will get M/(32^2) x N.

I tried to extend WorkerGrid2D and add a setter for numberOrWorkGroups, but the value I set seems to be ignored. As I understand, it gets recalculated in PTXDeviceContext.

I was able to implement the example by recalculating indices inside the kernel:

    public static void coalescingKernel(
            KernelContext context, FloatArray a, FloatArray b, FloatArray c,
            int m, int k, int n,
            int mTiles, int nTiles,
            float alpha, float beta
    ) {
        int ordinal = context.globalGroupSizeX * context.globalIdy + context.globalIdx;

        int blockOrdinal = ordinal / (BLOCK_SIZE * BLOCK_SIZE);
        int localOrdinal = ordinal % (BLOCK_SIZE * BLOCK_SIZE);

        int blockX = blockOrdinal / nTiles;
        int blockY = blockOrdinal % nTiles;

        int x = blockX * BLOCK_SIZE + (localOrdinal / BLOCK_SIZE);
        int y = blockY * BLOCK_SIZE + (localOrdinal % BLOCK_SIZE);

        if (x < m && y < n) {
            float sum = 0.0f;
            for (int i = 0; i < k; i++) {
                sum += a.get(x * k + i) * b.get(i * n + y);
            }
            c.set(x * n + y, alpha * sum + beta * c.get(x * n + y));
        }
    }

But I'm still not sure it'll work in 100% cases, and besides, all these indices manipulations will probably affect the performance of the kernel.

Can you point me to the right direction with this? Am I missing something? Thank you.

Answered by lpld

Jul 14, 2024

Hmm, looks like there's a simpler (and pretty obvious) solution to this:

WorkerGrid2D workerGrid = new WorkerGrid2D(m * BLOCK_SIZE, n / BLOCK_SIZE);
workerGrid.setLocalWork(BLOCK_SIZE * BLOCK_SIZE, 1, 1);

Interesting that it is slower than my first implementation with recalculated indices. But that's probably another question.

View full answer

lpld · 2024-07-14T11:18:12Z

lpld
Jul 14, 2024
Author

Hmm, looks like there's a simpler (and pretty obvious) solution to this:

WorkerGrid2D workerGrid = new WorkerGrid2D(m * BLOCK_SIZE, n / BLOCK_SIZE);
workerGrid.setLocalWork(BLOCK_SIZE * BLOCK_SIZE, 1, 1);

Interesting that it is slower than my first implementation with recalculated indices. But that's probably another question.

1 reply

stratika Jul 15, 2024
Maintainer

hi @lpld, the way that you use WorkerGrid2D seems to be correct. In general, for the configuration of a TornadoVM WorkerGrid, the globalWork refers to the CUDA gridDim, whereas the localWork corresponds to the blockDim.

Just one note regarding the first dimension, should it be m * BLOCK_SIZE or m / BLOCK_SIZE?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Manually controlling number of workgroups in WorkerGrid? #498

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Manually controlling number of workgroups in WorkerGrid? #498

lpld Jul 13, 2024

Replies: 1 comment · 1 reply

lpld Jul 14, 2024 Author

stratika Jul 15, 2024 Maintainer

lpld
Jul 13, 2024

Replies: 1 comment 1 reply

lpld
Jul 14, 2024
Author

stratika Jul 15, 2024
Maintainer