[Direct Metal] Insert scaler CB in reduce kernels #1153

rpavlovicTT · 2024-11-04T16:29:42Z

fixes #781

When lowering reduce op, compiler will add additional CB that will represent scaler CB needed by LLK. ScalerCB is filled by constant according to type of reduce operation.

An idea is to leave preallocated space at the top of L1 memory for misc usage in kernels. Scaler CB should be placed inside of this region. Also, TTIR ops should not have context of scaler CB. It should appear only in TTKernel dialect as additional arg when lowering reduce operations.

ScalerCB is populated during kernel runtime, and that should be done in NOC thread since that thread can read zeros in a burst.

When lowering reduce op, compiler will add additional CB that will represent scaler CB needed by LLK. ScalerCB is filled by constant according to type of reduce operation. An idea is to leave preallocated space at the top of L1 memory for misc usage in kernels. Scaler CB should be placed inside of this region. Also, TTIR ops should not have context of scaler CB. It should appear only in TTKernel dialect as additional arg when lowering reduce operations. ScalerCB is populated during kernel runtime, and that should be done in NOC thread since that thread can read zeros in a burst.

lib/Conversion/TTIRToTTMetal/TTIRToTTMetal.cpp

rpavlovicTT · 2024-11-04T16:45:35Z

runtime/lib/ttmetal/command_queue.cpp

+    // using shadow buffer. This needs to be investigated further.
+    return CircularBufferConfig(totalSize,
+                                {{cbRef->desc()->port(), dataFormat}})
+        // *buffers.at(CQExecutor::kReservedBufferId))


CB can be allocated in 2 ways --

No buffer - metal will allocate it at arbitrary address.

With (shadow) buffer - we can control address and lifetime of the buffer.

I observed incorrect results of math op when using the second option which is preferred to us because we can control allocations. I did not find the cause of the issue, maybe I programmed the shadow buffer incorrectly in func createReservedL1Buffer.

By incorrect results I mean there are weird numbers and weird patterns, for example, doing reduce W only first 16 rows show non-zero result at the left most column, while rest of the rows are all zeros. It may have something to do with tilizing, as I've seen similar weird patterns before when I tested tilizing feature.

@nsmithtt If you have a hunch about this please share.

Yeah this sounds most likely to be a tiling issue.

Just to be clear, I do not guarantee for numerical correctness as of now (even without this tiling issue), as I did not test vs golden.

nsmithtt

Did a first pass, I will try to review the rest of TTIRToTTMetal.cpp tonight / tomorrow.

nsmithtt · 2024-11-04T23:05:40Z

lib/Target/TTMetal/TTMetalToFlatbuffer.cpp

            auto cbType = mlir::cast<ttkernel::CBType>(arg.getType());
            auto cbDesc = cache.getOrCreate(cbType, cbTypeToFlatbuffer);
-            auto tensorRef = operands[arg.getArgNumber()];
+            auto tensorRef = cbType.getIsInternal() ? 0 : operands[argNumber++];


Why do we do this? Is this for the workaround you mentioned below where we were seeing weird data mismatch tiling issues?

It's not a workaround for the mismatch. ScalerCB does not have a tensor operand reference, it's kinda internal to the kernel itself. So we skip creating tensor ref in the FB here using this attribute.

nsmithtt · 2024-11-04T23:06:12Z

runtime/lib/ttmetal/command_queue.cpp

+  createReservedL1Buffer();
+}
+
+void CQExecutor::createReservedL1Buffer() {


I'm actually not sure any runtime changes are needed. Shouldn't we get the address/cb info from the compiler just like all the others?

As I said about is_internal attribute, there is no tensor ref coming from the compiler that represents this buffer.

I decided not to explicitly create tensor in the graph that represents scaler CB when lowering to a generic op. This is because it gets tricky creating layouts and empty ops before TTIRLayoutPass that is normally done after generic op lowering phase. Plus, logically it seems better not to pollute TTIR graph with this kind of kernel intrinsics.

Therefore, I put this code in the runtime to allocate small buffer at the top of L1.

Btw I noticed in metal, there is notion of BufferType::L1_SMALL suitable for this kind of stuff, right? Is it only used for metal internal allocations?

nsmithtt · 2024-11-04T23:06:29Z

runtime/lib/ttmetal/command_queue.cpp

+    // using shadow buffer. This needs to be investigated further.
+    return CircularBufferConfig(totalSize,
+                                {{cbRef->desc()->port(), dataFormat}})
+        // *buffers.at(CQExecutor::kReservedBufferId))


Yeah this sounds most likely to be a tiling issue.

include/ttmlir/Dialect/TTKernel/IR/TTKernelOpsTypes.td

nsmithtt · 2024-11-04T23:12:09Z

lib/Conversion/TTIRToTTMetal/TTIRToTTMetal.cpp

+
+    // Prepare zero region read.
+    auto zerosBase = builder.create<ttkernel::MacroOp>(
+        loc, builder.getI32Type(), "MEM_ZEROS_BASE");


I think it might be better to pretend this is a real function memZerosBase and not have a concept of a MacroOp at all.

Not sure what do you mean exactly?
Function emits call like memZerosBase() while we need only macro name MEM_ZEROS_BASE. Or do you suggest to wrap this in a func.func?

I just mean from the dialect perspective, we should have MemZerosBaseOp which happens to be emitted in emitc as all caps.

I can do that, but I am wondering what do we gain from that, isn't it better to have a single MacroOp that can be used for any C++ macro we have in kernels?

nsmithtt · 2024-11-05T02:04:14Z

lib/Conversion/TTIRToTTMetal/TTIRToTTMetal.cpp

@@ -1344,6 +1370,268 @@ class TTIRToTTMetalDispatchRewriter : public OpRewritePattern<ttir::GenericOp> {
    return dm;
  }

+  static bool needScaler(ttir::GenericOp op) {
+    Block &block = op.getRegion().front();
+    Operation &firstOp = block.getOperations().front();


Technically we probably need to walk the block to search for a reduce, but this is OK for now.

Agreed, this was a temporary solution anyway.

lib/Conversion/TTIRToTTMetal/TTIRToTTMetal.cpp

nsmithtt · 2024-11-05T02:08:36Z

lib/Conversion/TTIRToTTMetal/TTIRToTTMetal.cpp

+
+    // TODO(radenko) move TTIRToTTMetalLayoutRewriter::createDataMovementThreads
+    // (& other data movement logic) into common place
+    int dmThreadIdx = 1;


I know this came from old code, but why does this start at 1 and then increment?

It's just a counter for creating new regions inside DispatchOp, we start from 1 because Tensix (compute) region is already created. Probably should just start from numRegions() and increase.

rpavlovicTT requested review from jnie-TT, kmabeeTT, AleksKnezevic, pilkicTT, sdjordjevicTT, nsmithtt, svuckovicTT, mtopalovicTT, mrakitaTT, nobradovictt, jserbedzijaTT and rjakovljevicTT as code owners November 4, 2024 16:29

rpavlovicTT commented Nov 4, 2024

View reviewed changes

nsmithtt reviewed Nov 4, 2024

View reviewed changes

nsmithtt reviewed Nov 5, 2024

View reviewed changes

rpavlovicTT added 2 commits November 5, 2024 13:27

Small fix after LLVM uplift

80e80a4

Fix for empty regions of dispatch op

6970720

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Direct Metal] Insert scaler CB in reduce kernels #1153

[Direct Metal] Insert scaler CB in reduce kernels #1153

rpavlovicTT commented Nov 4, 2024

rpavlovicTT Nov 4, 2024

nsmithtt Nov 4, 2024

rpavlovicTT Nov 5, 2024

nsmithtt left a comment

nsmithtt Nov 4, 2024

rpavlovicTT Nov 5, 2024

nsmithtt Nov 4, 2024

rpavlovicTT Nov 5, 2024

nsmithtt Nov 4, 2024

nsmithtt Nov 4, 2024

rpavlovicTT Nov 5, 2024

nsmithtt Nov 5, 2024

rpavlovicTT Nov 6, 2024

nsmithtt Nov 5, 2024

rpavlovicTT Nov 5, 2024

nsmithtt Nov 5, 2024

rpavlovicTT Nov 5, 2024

[Direct Metal] Insert scaler CB in reduce kernels #1153

Are you sure you want to change the base?

[Direct Metal] Insert scaler CB in reduce kernels #1153

Conversation

rpavlovicTT commented Nov 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nsmithtt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment