-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Direct Metal] Insert scaler CB in reduce kernels #1153
base: main
Are you sure you want to change the base?
Conversation
When lowering reduce op, compiler will add additional CB that will represent scaler CB needed by LLK. ScalerCB is filled by constant according to type of reduce operation. An idea is to leave preallocated space at the top of L1 memory for misc usage in kernels. Scaler CB should be placed inside of this region. Also, TTIR ops should not have context of scaler CB. It should appear only in TTKernel dialect as additional arg when lowering reduce operations. ScalerCB is populated during kernel runtime, and that should be done in NOC thread since that thread can read zeros in a burst.
// using shadow buffer. This needs to be investigated further. | ||
return CircularBufferConfig(totalSize, | ||
{{cbRef->desc()->port(), dataFormat}}) | ||
// *buffers.at(CQExecutor::kReservedBufferId)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CB can be allocated in 2 ways --
- No buffer - metal will allocate it at arbitrary address.
- With (shadow) buffer - we can control address and lifetime of the buffer.
I observed incorrect results of math op when using the second option which is preferred to us because we can control allocations. I did not find the cause of the issue, maybe I programmed the shadow buffer incorrectly in func createReservedL1Buffer
.
By incorrect results I mean there are weird numbers and weird patterns, for example, doing reduce W only first 16 rows show non-zero result at the left most column, while rest of the rows are all zeros. It may have something to do with tilizing, as I've seen similar weird patterns before when I tested tilizing feature.
@nsmithtt If you have a hunch about this please share.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this sounds most likely to be a tiling issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be clear, I do not guarantee for numerical correctness as of now (even without this tiling issue), as I did not test vs golden.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a first pass, I will try to review the rest of TTIRToTTMetal.cpp tonight / tomorrow.
auto cbType = mlir::cast<ttkernel::CBType>(arg.getType()); | ||
auto cbDesc = cache.getOrCreate(cbType, cbTypeToFlatbuffer); | ||
auto tensorRef = operands[arg.getArgNumber()]; | ||
auto tensorRef = cbType.getIsInternal() ? 0 : operands[argNumber++]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we do this? Is this for the workaround you mentioned below where we were seeing weird data mismatch tiling issues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a workaround for the mismatch. ScalerCB does not have a tensor operand reference, it's kinda internal to the kernel itself. So we skip creating tensor ref in the FB here using this attribute.
createReservedL1Buffer(); | ||
} | ||
|
||
void CQExecutor::createReservedL1Buffer() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually not sure any runtime changes are needed. Shouldn't we get the address/cb info from the compiler just like all the others?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said about is_internal
attribute, there is no tensor ref coming from the compiler that represents this buffer.
I decided not to explicitly create tensor in the graph that represents scaler CB when lowering to a generic op. This is because it gets tricky creating layouts and empty ops before TTIRLayoutPass that is normally done after generic op lowering phase. Plus, logically it seems better not to pollute TTIR graph with this kind of kernel intrinsics.
Therefore, I put this code in the runtime to allocate small buffer at the top of L1.
Btw I noticed in metal, there is notion of BufferType::L1_SMALL
suitable for this kind of stuff, right? Is it only used for metal internal allocations?
// using shadow buffer. This needs to be investigated further. | ||
return CircularBufferConfig(totalSize, | ||
{{cbRef->desc()->port(), dataFormat}}) | ||
// *buffers.at(CQExecutor::kReservedBufferId)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this sounds most likely to be a tiling issue.
|
||
// Prepare zero region read. | ||
auto zerosBase = builder.create<ttkernel::MacroOp>( | ||
loc, builder.getI32Type(), "MEM_ZEROS_BASE"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be better to pretend this is a real function memZerosBase
and not have a concept of a MacroOp
at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what do you mean exactly?
Function emits call like memZerosBase()
while we need only macro name MEM_ZEROS_BASE
. Or do you suggest to wrap this in a func.func
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just mean from the dialect perspective, we should have MemZerosBaseOp
which happens to be emitted in emitc as all caps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can do that, but I am wondering what do we gain from that, isn't it better to have a single MacroOp that can be used for any C++ macro we have in kernels?
@@ -1344,6 +1370,268 @@ class TTIRToTTMetalDispatchRewriter : public OpRewritePattern<ttir::GenericOp> { | |||
return dm; | |||
} | |||
|
|||
static bool needScaler(ttir::GenericOp op) { | |||
Block &block = op.getRegion().front(); | |||
Operation &firstOp = block.getOperations().front(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically we probably need to walk the block to search for a reduce, but this is OK for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, this was a temporary solution anyway.
|
||
// TODO(radenko) move TTIRToTTMetalLayoutRewriter::createDataMovementThreads | ||
// (& other data movement logic) into common place | ||
int dmThreadIdx = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this came from old code, but why does this start at 1 and then increment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just a counter for creating new regions inside DispatchOp, we start from 1 because Tensix (compute) region is already created. Probably should just start from numRegions()
and increase.
fixes #781
When lowering reduce op, compiler will add additional CB that will represent scaler CB needed by LLK. ScalerCB is filled by constant according to type of reduce operation.
An idea is to leave preallocated space at the top of L1 memory for misc usage in kernels. Scaler CB should be placed inside of this region. Also, TTIR ops should not have context of scaler CB. It should appear only in TTKernel dialect as additional arg when lowering reduce operations.
ScalerCB is populated during kernel runtime, and that should be done in NOC thread since that thread can read zeros in a burst.