-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cooperative groups #2307
Draft
MichaelVarvarin
wants to merge
21
commits into
alpaka-group:develop
Choose a base branch
from
MichaelVarvarin:cooperative-groups
base: develop
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Cooperative groups #2307
Changes from 9 commits
Commits
Show all changes
21 commits
Select commit
Hold shift + click to select a range
c13f61c
Add CreateTaskCooperativeKernel, grid sync and HelloWorldGridSyncExam…
MichaelVarvarin ecbdcb1
Add comment about issue with grid sync on CUDA Clang
MichaelVarvarin dd0681f
Add cooperative kernel launch and grid sync support for HIP
MichaelVarvarin e92cee1
Add m_cooperativeLaunch device prop and runtime check for CG support …
MichaelVarvarin e423026
Clean errors in previous commit
MichaelVarvarin 8fcd8ac
Clean formatting
MichaelVarvarin 94f07a9
Add getMaxActiveBlocks to get the maximum allowed block count for lau…
MichaelVarvarin c6c12fd
Rename maxActiveBlocks trait
MichaelVarvarin 4ad8bae
Fix issues from bad rebase
MichaelVarvarin a892019
Add cooperative kernel launch, grid sync and getMaxActiveBlocks for A…
MichaelVarvarin f7efa76
Clean formatting
MichaelVarvarin d76e397
Correct the comment
MichaelVarvarin 93b704c
Add cooperative kernel launch, grid sync and getMaxActiveBlocks for O…
MichaelVarvarin d09ee84
Clean formatting
MichaelVarvarin 7fdbb60
Update comments
MichaelVarvarin 47d0a1c
Add include gridSync OMP to alpaka.hpp
MichaelVarvarin 0051222
Add cooperative kernel launch, grid sync and getMaxActiveBlocks for s…
MichaelVarvarin beee9db
Clean warnings for CPU accelerators
MichaelVarvarin 4db26da
Clean warnings for the HIP accelerator
MichaelVarvarin 7b3e194
Merge branch 'develop' into cooperative-groups
MichaelVarvarin 25b0e22
Merge branch 'alpaka-group:develop' into cooperative-groups
MichaelVarvarin File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,47 @@ | ||
# | ||
# Copyright 2024 Mykhailo Varvarin | ||
# SPDX-License-Identifier: ISC | ||
# | ||
|
||
################################################################################ | ||
# Required CMake version. | ||
|
||
cmake_minimum_required(VERSION 3.22) | ||
|
||
set_property(GLOBAL PROPERTY USE_FOLDERS ON) | ||
|
||
################################################################################ | ||
# Project. | ||
|
||
set(_TARGET_NAME helloWorldGridSync) | ||
|
||
project(${_TARGET_NAME} LANGUAGES CXX) | ||
|
||
#------------------------------------------------------------------------------- | ||
# Find alpaka. | ||
|
||
if(NOT TARGET alpaka::alpaka) | ||
option(alpaka_USE_SOURCE_TREE "Use alpaka's source tree instead of an alpaka installation" OFF) | ||
|
||
if(alpaka_USE_SOURCE_TREE) | ||
# Don't build the examples recursively | ||
set(alpaka_BUILD_EXAMPLES OFF) | ||
add_subdirectory("${CMAKE_CURRENT_LIST_DIR}/../.." "${CMAKE_BINARY_DIR}/alpaka") | ||
else() | ||
find_package(alpaka REQUIRED) | ||
endif() | ||
endif() | ||
|
||
#------------------------------------------------------------------------------- | ||
# Add executable. | ||
|
||
alpaka_add_executable( | ||
${_TARGET_NAME} | ||
src/helloWorldGridSync.cpp) | ||
target_link_libraries( | ||
${_TARGET_NAME} | ||
PUBLIC alpaka::alpaka) | ||
|
||
set_target_properties(${_TARGET_NAME} PROPERTIES FOLDER example) | ||
|
||
add_test(NAME ${_TARGET_NAME} COMMAND ${_TARGET_NAME}) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,102 @@ | ||
/* Copyright 2024 Mykhailo Varvarin | ||
* SPDX-License-Identifier: MPL-2.0 | ||
*/ | ||
|
||
#include <alpaka/alpaka.hpp> | ||
|
||
#include <cstdint> | ||
#include <iostream> | ||
|
||
//! Hello world kernel, utilizing grid synchronization. | ||
//! Prints hello world from a thread, performs grid sync. | ||
//! and prints the sum of indixes of this thread and the opposite thread (the sums have to be the same). | ||
//! Prints an error if sum is incorrect. | ||
struct HelloWorldKernel | ||
{ | ||
template<typename Acc> | ||
ALPAKA_FN_ACC void operator()(Acc const& acc, uint32_t* data) const | ||
{ | ||
// Get index of the current thread in the grid and the total number of threads. | ||
uint32_t gridThreadIdx = alpaka::getIdx<alpaka::Grid, alpaka::Threads>(acc)[0]; | ||
uint32_t gridThreadExtent = alpaka::getWorkDiv<alpaka::Grid, alpaka::Threads>(acc)[0]; | ||
|
||
printf("Hello, World from alpaka thread %u!\n", gridThreadIdx); | ||
|
||
// Write the index of the thread to array. | ||
data[gridThreadIdx] = gridThreadIdx; | ||
|
||
// Perform grid synchronization. | ||
alpaka::syncGridThreads(acc); | ||
|
||
// Get the index of the opposite thread. | ||
uint32_t gridThreadIdxOpposite = data[gridThreadExtent - gridThreadIdx - 1]; | ||
|
||
// Sum them. | ||
uint32_t sum = gridThreadIdx + gridThreadIdxOpposite; | ||
|
||
// Get the expected sum. | ||
uint32_t expectedSum = gridThreadExtent - 1; | ||
|
||
// Print the result and signify an error if the grid synchronization fails. | ||
printf( | ||
"After grid sync, this thread is %u, thread on the opposite side is %u. Their sum is %u, expected: %u.%s", | ||
gridThreadIdx, | ||
gridThreadIdxOpposite, | ||
sum, | ||
expectedSum, | ||
sum == expectedSum ? "\n" : " ERROR: the sum is incorrect.\n"); | ||
} | ||
}; | ||
|
||
auto main() -> int | ||
{ | ||
// Define dimensionality and type of indices to be used in kernels | ||
using Dim = alpaka::DimInt<1>; | ||
using Idx = uint32_t; | ||
|
||
// Define alpaka accelerator type, which corresponds to the underlying programming model | ||
using Acc = alpaka::AccGpuCudaRt<Dim, Idx>; | ||
|
||
// Select the first device available on a system, for the chosen accelerator | ||
auto const platformAcc = alpaka::Platform<Acc>{}; | ||
auto const devAcc = getDevByIdx(platformAcc, 0u); | ||
|
||
// Define type for a queue with requested properties: Blocking. | ||
using Queue = alpaka::Queue<Acc, alpaka::Blocking>; | ||
// Create a queue for the device. | ||
auto queue = Queue{devAcc}; | ||
|
||
// Define kernel execution configuration of blocks, | ||
// threads per block, and elements per thread. | ||
Idx blocksPerGrid = 10; | ||
Idx threadsPerBlock = 1; | ||
Idx elementsPerThread = 1; | ||
|
||
using WorkDiv = alpaka::WorkDivMembers<Dim, Idx>; | ||
auto workDiv = WorkDiv{blocksPerGrid, threadsPerBlock, elementsPerThread}; | ||
|
||
// Allocate memory on the device. | ||
alpaka::Vec<Dim, Idx> bufferExtent{blocksPerGrid * threadsPerBlock}; | ||
auto deviceMemory = alpaka::allocBuf<uint32_t, Idx>(devAcc, bufferExtent); | ||
|
||
// Instantiate the kernel object. | ||
HelloWorldKernel helloWorldKernel; | ||
|
||
int maxBlocks = alpaka::getMaxActiveBlocks<Acc>( | ||
devAcc, | ||
helloWorldKernel, | ||
threadsPerBlock, | ||
elementsPerThread, | ||
getPtrNative(deviceMemory)); | ||
std::cout << "Maximum blocks for the kernel: " << maxBlocks << std::endl; | ||
|
||
// Create a task to run the kernel. | ||
// Note the cooperative kernel specification. | ||
// Only cooperative kernels can perform grid synchronization. | ||
auto taskRunKernel | ||
= alpaka::createTaskCooperativeKernel<Acc>(workDiv, helloWorldKernel, getPtrNative(deviceMemory)); | ||
|
||
// Enqueue the kernel execution task.. | ||
alpaka::enqueue(queue, taskRunKernel); | ||
return 0; | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -76,6 +76,7 @@ namespace alpaka | |||||||||||||||||||||||||||||||||||||||||||||
static constexpr DeviceAttr_t deviceAttributeMaxThreadsPerBlock = ::cudaDevAttrMaxThreadsPerBlock; | ||||||||||||||||||||||||||||||||||||||||||||||
static constexpr DeviceAttr_t deviceAttributeMultiprocessorCount = ::cudaDevAttrMultiProcessorCount; | ||||||||||||||||||||||||||||||||||||||||||||||
static constexpr DeviceAttr_t deviceAttributeWarpSize = ::cudaDevAttrWarpSize; | ||||||||||||||||||||||||||||||||||||||||||||||
static constexpr DeviceAttr_t deviceAttributeCooperativeLaunch = ::cudaDevAttrCooperativeLaunch; | ||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||
static constexpr Limit_t limitPrintfFifoSize = ::cudaLimitPrintfFifoSize; | ||||||||||||||||||||||||||||||||||||||||||||||
static constexpr Limit_t limitMallocHeapSize = ::cudaLimitMallocHeapSize; | ||||||||||||||||||||||||||||||||||||||||||||||
|
@@ -253,6 +254,17 @@ namespace alpaka | |||||||||||||||||||||||||||||||||||||||||||||
return ::cudaHostUnregister(ptr); | ||||||||||||||||||||||||||||||||||||||||||||||
} | ||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||
static inline Error_t launchCooperativeKernel( | ||||||||||||||||||||||||||||||||||||||||||||||
void const* func, | ||||||||||||||||||||||||||||||||||||||||||||||
dim3 gridDim, | ||||||||||||||||||||||||||||||||||||||||||||||
dim3 blockDim, | ||||||||||||||||||||||||||||||||||||||||||||||
void** args, | ||||||||||||||||||||||||||||||||||||||||||||||
size_t sharedMem, | ||||||||||||||||||||||||||||||||||||||||||||||
Stream_t stream) | ||||||||||||||||||||||||||||||||||||||||||||||
{ | ||||||||||||||||||||||||||||||||||||||||||||||
return ::cudaLaunchCooperativeKernel(func, gridDim, blockDim, args, sharedMem, stream); | ||||||||||||||||||||||||||||||||||||||||||||||
} | ||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||
Comment on lines
+257
to
+267
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you change this to be templated on the
Suggested change
Same for the HIP implementation. |
||||||||||||||||||||||||||||||||||||||||||||||
static inline Error_t launchHostFunc(Stream_t stream, HostFn_t fn, void* userData) | ||||||||||||||||||||||||||||||||||||||||||||||
{ | ||||||||||||||||||||||||||||||||||||||||||||||
# if CUDART_VERSION >= 10000 | ||||||||||||||||||||||||||||||||||||||||||||||
|
@@ -395,6 +407,16 @@ namespace alpaka | |||||||||||||||||||||||||||||||||||||||||||||
{ | ||||||||||||||||||||||||||||||||||||||||||||||
return ::make_cudaExtent(w, h, d); | ||||||||||||||||||||||||||||||||||||||||||||||
} | ||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||
template<class T> | ||||||||||||||||||||||||||||||||||||||||||||||
static inline Error_t occupancyMaxActiveBlocksPerMultiprocessor( | ||||||||||||||||||||||||||||||||||||||||||||||
int* numBlocks, | ||||||||||||||||||||||||||||||||||||||||||||||
T func, | ||||||||||||||||||||||||||||||||||||||||||||||
int blockSize, | ||||||||||||||||||||||||||||||||||||||||||||||
size_t dynamicSMemSize) | ||||||||||||||||||||||||||||||||||||||||||||||
{ | ||||||||||||||||||||||||||||||||||||||||||||||
return ::cudaOccupancyMaxActiveBlocksPerMultiprocessor(numBlocks, func, blockSize, dynamicSMemSize); | ||||||||||||||||||||||||||||||||||||||||||||||
} | ||||||||||||||||||||||||||||||||||||||||||||||
}; | ||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||
} // namespace alpaka | ||||||||||||||||||||||||||||||||||||||||||||||
|
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] Could you explain what is the opposite thread here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thread, that has the same distance from the end of the grid dimension, as this from the start. So, if the IDs range from 0 to 9, these are 0 and 9, 1 and 8, 2 and 7 and so on. Their sum is constant, so we can check, if grid sync was performed successfully