Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Direct Metal] Insert scaler CB in reduce kernels #1153

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

rpavlovicTT
Copy link
Contributor

fixes #781

When lowering reduce op, compiler will add additional CB that will represent scaler CB needed by LLK. ScalerCB is filled by constant according to type of reduce operation.

An idea is to leave preallocated space at the top of L1 memory for misc usage in kernels. Scaler CB should be placed inside of this region. Also, TTIR ops should not have context of scaler CB. It should appear only in TTKernel dialect as additional arg when lowering reduce operations.

ScalerCB is populated during kernel runtime, and that should be done in NOC thread since that thread can read zeros in a burst.

When lowering reduce op, compiler will add additional CB that
will represent scaler CB needed by LLK. ScalerCB is filled by
constant according to type of reduce operation.

An idea is to leave preallocated space at the top of L1 memory for
misc usage in kernels. Scaler CB should be placed inside of this region.
Also, TTIR ops should not have context of scaler CB. It should appear
only in TTKernel dialect as additional arg when lowering reduce
operations.

ScalerCB is populated during kernel runtime, and that should be done
in NOC thread since that thread can read zeros in a burst.
lib/Conversion/TTIRToTTMetal/TTIRToTTMetal.cpp Outdated Show resolved Hide resolved
lib/Conversion/TTIRToTTMetal/TTIRToTTMetal.cpp Outdated Show resolved Hide resolved
// using shadow buffer. This needs to be investigated further.
return CircularBufferConfig(totalSize,
{{cbRef->desc()->port(), dataFormat}})
// *buffers.at(CQExecutor::kReservedBufferId))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CB can be allocated in 2 ways --

  1. No buffer - metal will allocate it at arbitrary address.
  2. With (shadow) buffer - we can control address and lifetime of the buffer.

I observed incorrect results of math op when using the second option which is preferred to us because we can control allocations. I did not find the cause of the issue, maybe I programmed the shadow buffer incorrectly in func createReservedL1Buffer.

By incorrect results I mean there are weird numbers and weird patterns, for example, doing reduce W only first 16 rows show non-zero result at the left most column, while rest of the rows are all zeros. It may have something to do with tilizing, as I've seen similar weird patterns before when I tested tilizing feature.

@nsmithtt If you have a hunch about this please share.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this sounds most likely to be a tiling issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear, I do not guarantee for numerical correctness as of now (even without this tiling issue), as I did not test vs golden.

Copy link
Contributor

@nsmithtt nsmithtt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a first pass, I will try to review the rest of TTIRToTTMetal.cpp tonight / tomorrow.

auto cbType = mlir::cast<ttkernel::CBType>(arg.getType());
auto cbDesc = cache.getOrCreate(cbType, cbTypeToFlatbuffer);
auto tensorRef = operands[arg.getArgNumber()];
auto tensorRef = cbType.getIsInternal() ? 0 : operands[argNumber++];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we do this? Is this for the workaround you mentioned below where we were seeing weird data mismatch tiling issues?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a workaround for the mismatch. ScalerCB does not have a tensor operand reference, it's kinda internal to the kernel itself. So we skip creating tensor ref in the FB here using this attribute.

createReservedL1Buffer();
}

void CQExecutor::createReservedL1Buffer() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually not sure any runtime changes are needed. Shouldn't we get the address/cb info from the compiler just like all the others?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said about is_internal attribute, there is no tensor ref coming from the compiler that represents this buffer.

I decided not to explicitly create tensor in the graph that represents scaler CB when lowering to a generic op. This is because it gets tricky creating layouts and empty ops before TTIRLayoutPass that is normally done after generic op lowering phase. Plus, logically it seems better not to pollute TTIR graph with this kind of kernel intrinsics.

Therefore, I put this code in the runtime to allocate small buffer at the top of L1.

Btw I noticed in metal, there is notion of BufferType::L1_SMALL suitable for this kind of stuff, right? Is it only used for metal internal allocations?

// using shadow buffer. This needs to be investigated further.
return CircularBufferConfig(totalSize,
{{cbRef->desc()->port(), dataFormat}})
// *buffers.at(CQExecutor::kReservedBufferId))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this sounds most likely to be a tiling issue.


// Prepare zero region read.
auto zerosBase = builder.create<ttkernel::MacroOp>(
loc, builder.getI32Type(), "MEM_ZEROS_BASE");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be better to pretend this is a real function memZerosBase and not have a concept of a MacroOp at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what do you mean exactly?
Function emits call like memZerosBase() while we need only macro name MEM_ZEROS_BASE. Or do you suggest to wrap this in a func.func?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just mean from the dialect perspective, we should have MemZerosBaseOp which happens to be emitted in emitc as all caps.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do that, but I am wondering what do we gain from that, isn't it better to have a single MacroOp that can be used for any C++ macro we have in kernels?

@@ -1344,6 +1370,268 @@ class TTIRToTTMetalDispatchRewriter : public OpRewritePattern<ttir::GenericOp> {
return dm;
}

static bool needScaler(ttir::GenericOp op) {
Block &block = op.getRegion().front();
Operation &firstOp = block.getOperations().front();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically we probably need to walk the block to search for a reduce, but this is OK for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, this was a temporary solution anyway.

lib/Conversion/TTIRToTTMetal/TTIRToTTMetal.cpp Outdated Show resolved Hide resolved

// TODO(radenko) move TTIRToTTMetalLayoutRewriter::createDataMovementThreads
// (& other data movement logic) into common place
int dmThreadIdx = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this came from old code, but why does this start at 1 and then increment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just a counter for creating new regions inside DispatchOp, we start from 1 because Tensix (compute) region is already created. Probably should just start from numRegions() and increase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Load scaler CB to reduce op
2 participants