You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
MappingUtils has been interagrated into in ROCM SDK 6.2, which defines coordinates <waveRows, waveCols> in the form of
blockDim = (waveRows * warpSize, waveCols) // warpSize is 64 in AMD GPU, and 32 in NVGPU
<waveRows, waveCols> warp coordinates in each threads block distributed to each SM(NV)/CUs(AMD).
This feature can eliminate hard coded warp size, and partition hirearchy transformation, which relies on HW memory hirearchy and make sure codes work correctly cross platform.
Note partition hirearchy transformation , HW memory hirearchy can changes with hardware. For example L2 cache may have different memory banks (4 banks) than LDS (64 banks), that means the best (if exist) swizzling technology super parameters for memory level_{i} is different from memroy level_{i+1}.
The codes of MappingUtils for a threads block looks like:
template <uint32_t BlockHeight, uint32_t BlockWidth, typename DataT, typename DataLayout>
struct MappingUtil {
static inline uint32_t laneId();
// Local wave coordinate relative to workgroup, above example <waveRows, waveCols> for warp level programming with warp sync API
static inline WaveCoordT WaveCoordT waveCoord();
// Global block (grid) coordinate of current wave
static inline BlockCoordT blockCoord();
// Matrix coordinate of current wave
static inline MatrixCoordT matrixCoord();
}
Morover, the warp size partition is dependent on the instruction used.
For example, the partition for instruct m8n8.x4 ( 8x8 matrix fragment x 4) instruction must be different from instruct m16n16.x4.
The text was updated successfully, but these errors were encountered:
MappingUtils has been interagrated into in ROCM SDK 6.2, which defines coordinates <waveRows, waveCols> in the form of
<waveRows, waveCols> warp coordinates in each threads block distributed to each SM(NV)/CUs(AMD).
This feature can eliminate hard coded warp size, and partition hirearchy transformation, which relies on HW memory hirearchy and make sure codes work correctly cross platform.
Note partition hirearchy transformation , HW memory hirearchy can changes with hardware. For example L2 cache may have different memory banks (4 banks) than LDS (64 banks), that means the best (if exist) swizzling technology super parameters for memory level_{i} is different from memroy level_{i+1}.
The codes of MappingUtils for a threads block looks like:
Morover, the warp size partition is dependent on the instruction used.
For example, the partition for instruct m8n8.x4 ( 8x8 matrix fragment x 4) instruction must be different from instruct m16n16.x4.
The text was updated successfully, but these errors were encountered: