Cooperative groups #2307

MichaelVarvarin · 2024-07-01T09:54:12Z

Add support for cooperative groups and related functionality

include/alpaka/acc/AccGpuUniformCudaHipRt.hpp

include/alpaka/kernel/TaskKernelGpuUniformCudaHipRt.hpp

fwyzard · 2024-07-06T08:21:46Z

Looks good so far :-)

One functionality that we will need is the possibility of querying the maximum number of blocks that can be used with a given kernel, so the user can store it and use it for launching the kernel.

MichaelVarvarin · 2024-07-08T11:52:29Z

For some reason gridSync locks up, when compiled CUDA Clang, if numberOfBlocks > 2 * multiProcessorCount

mehmetyusufoglu · 2024-07-08T13:06:14Z

example/helloWorldGridSync/src/helloWorldGridSync.cpp

+
+//! Hello world kernel, utilizing grid synchronization.
+//! Prints hello world from a thread, performs grid sync.
+//! and prints the sum of indixes of this thread and the opposite thread (the sums have to be the same).


[nit] Could you explain what is the opposite thread here.

Thread, that has the same distance from the end of the grid dimension, as this from the start. So, if the IDs range from 0 to 9, these are 0 and 9, 1 and 8, 2 and 7 and so on. Their sum is constant, so we can check, if grid sync was performed successfully

mehmetyusufoglu · 2024-07-08T13:24:31Z

Add support for cooperative groups and related functionality

Could you add more details to the PR definition

MichaelVarvarin · 2024-07-10T07:21:32Z

Add support for cooperative groups and related functionality

Could you add more details to the PR definition

That depends on the desired scope of the PR, I've deliberately made it vague, so we can decide, when to merge it.

MichaelVarvarin · 2024-07-13T08:49:49Z

For some reason gridSync locks up, when compiled CUDA Clang, if numberOfBlocks > 2 * multiProcessorCount

This looks like an upstream issue, at least locally, when compiling with clang 17.0.6 and CUDA 12.1.1 and 12.5

fwyzard · 2024-07-13T10:18:35Z

For some reason gridSync locks up, when compiled CUDA Clang, if numberOfBlocks > 2 * multiProcessorCount

This looks like an upstream issue, at least locally, when compiling with clang 17.0.6 and CUDA 12.1.1 and 12.5

Is it supposed to work?
What is the maximum number of concurrent blocks that can be used cooperatively with this kernel?

MichaelVarvarin · 2024-07-13T18:15:33Z

For some reason gridSync locks up, when compiled CUDA Clang, if numberOfBlocks > 2 * multiProcessorCount

This looks like an upstream issue, at least locally, when compiling with clang 17.0.6 and CUDA 12.1.1 and 12.5

Is it supposed to work? What is the maximum number of concurrent blocks that can be used cooperatively with this kernel?

Yes, it is, maximum number reported is 16 * multiProcessorCount, and the kernel refuses to launch, on both nvcc and clang if this number is exceeded.

fwyzard · 2024-07-26T13:17:49Z

Add cooperative kernel launch and grid sync support for HIP

Nice 👍🏻

MichaelVarvarin · 2024-07-26T20:16:49Z

Add cooperative kernel launch and grid sync support for HIP

Nice 👍🏻

Doesn't actually work, unfortunately, I will investigate it. Would be funny, if I find a second compiler bug
Upd. It does, it was a hardware issue.

include/alpaka/acc/AccCpuOmp2Blocks.hpp

test/unit/acc/src/AccDevPropsTest.cpp

…ple for CUDA

…for CUDA/HIP

…nching the specified cooperative kernel

include/alpaka/alpaka.hpp

…ccCpuSerial

include/alpaka/kernel/TaskKernelGpuUniformCudaHipRt.hpp

fwyzard · 2024-08-13T09:36:30Z

include/alpaka/core/ApiCudaRt.hpp

+        static inline Error_t launchCooperativeKernel(
+            void const* func,
+            dim3 gridDim,
+            dim3 blockDim,
+            void** args,
+            size_t sharedMem,
+            Stream_t stream)
+        {
+            return ::cudaLaunchCooperativeKernel(func, gridDim, blockDim, args, sharedMem, stream);
+        }
+


Could you change this to be templated on the func argument ?

Suggested change

static inline Error_t launchCooperativeKernel(

void const* func,

dim3 gridDim,

dim3 blockDim,

void** args,

size_t sharedMem,

Stream_t stream)

{

return ::cudaLaunchCooperativeKernel(func, gridDim, blockDim, args, sharedMem, stream);

}

template <typename TFunc>

static inline Error_t launchCooperativeKernel(

TFunc func,

dim3 gridDim,

dim3 blockDim,

void** args,

size_t sharedMem,

Stream_t stream)

{

return ::cudaLaunchCooperativeKernel(func, gridDim, blockDim, args, sharedMem, stream);

}

Same for the HIP implementation.

fwyzard · 2024-08-13T09:41:15Z

include/alpaka/kernel/TaskKernelGpuUniformCudaHipRt.hpp

+                            void const* kernelArgs[] = {&threadElemExtent, &task.m_kernelFnObj, &args...};
+
+                            TApi::launchCooperativeKernel(
+                                reinterpret_cast<void*>(kernelName),


Can you check if the cast can be removed, after implementing https://github.com/alpaka-group/alpaka/pull/2307/files#r1714986927 ?

No, you now have to do the same cast inside of launchCooperativeKernel

…mp2 accelerators

…td::threads accelerator

SimeonEhrig · 2024-09-10T08:17:20Z

@MichaelVarvarin Sorry for removing the draft state. I thought I start the GitHub Action jobs. Not sure what was going wrong.