Core coordinates prototype #266

pjanevskiTT · 2024-11-05T13:21:14Z

This PR is not going to be merged. It should be used just for disucssion. A lot of things are not building here since it would require a lot of effort to plugin everything within current UMD design. When we settle on particular design I will implement that version in a separate PR. I tried to make diff for this branch as clean as possible to understand two prototypes of core coordinates correctly. I tried to leave as much comments for better understanding

Implement two core coordinates prototypes. We expect to use one of these two prototypes (V1 or V2)

As discussed, we are going to have CoordSystem enum, which is going to be used to determine the appropriate coordiante system of the coordinates that are used in the API

enum class CoordSystem {
    LOGICAL,
    PHYSICAL,
    VIRTUAL,
    TRANSLATED,
};

For now, we support these 4 coordinate systems. For explanations on each coordinate system there is a pending PR to cleanup the docs properly, but understanding these coordinate systems should not be prerequisite to review this PR. It is just needed to understand that each core coordinate can be in multiple coordinate systems.

Note for reviews

Since we have CoordinateManager class here, in order to better understand the use of that class it might be useful to skim the PR #202. In general it is mostly used to support the translation between coordinate systems.

The most useful parts of the PR to review are (in my opinion)

For tt-metal people - test_core_coordinate.cpp file - these tests should represent how tt-metal (and other clients) should use coordinate types
For tt-umd people - everything else :)
Not in 1 or 2 - then probably same as tt-metal people, since this would be API change for all other clients

V1 coordinates

For V1 prototype, we have additional enum representing core type

enum class CoreType {
  ARC,
  DRAM,
  ETH,
  PCIE,
  TENSIX,
  ROUTER_ONLY,
};

which is used in the main struct

 struct CoreCoord_V1 : public tt_xy_pair {
    CoreType core_type;
    CoordSystem coord_system;
 };

When using this, it is needed to set both the core type and coordinate system, in order for API to work properly.

V2 coordinates

For V2 prototype, we would have one struct for each core type that is represented using enum in V1. In PR, structs are implemented for Tensix and DRAM cores, just to see the effects of having multiple core types

struct CoreCoord_V2 : public tt_xy_pair {
    CoordSystem coord_system;
};

struct TensixCoreCoord_V2 : public CoreCoord_V2 {};
 
struct DramCoreCoord_V2 : public CoreCoord_V2 {};

When using this it is needed to set just the coordinate system, core type comes from struct type.

API changes

API is provided inside tt_SiliconDevice class for both the V1 and V2 core coordinates. Focus was on providing API for reads/writes as the example. I think looking this from PR is the best way to explain it.

Main point to make here is that API is the same for all coordinate systems, which was one of the points that tt-metal folks made in the meeting. UMD can figure out how to properly program the TLBs and write to device for any coordinate system. In the PR we rely on translating everything to physical (NOC) coordinates

Pros/cons

All pros/cons are my view, feel free to comment on everything

V1

Pros

coordinate translation logic is one place
less API functions

Cons

One extra step for figuring out core type on each API call
readability
core type can be concluded at runtime

V2

Pros

readability
core type is clear from the type

Cons

API blows up, need to provide functions for each core type

Questions for reviewers

Do we want to be able to get worker/DRAM cores in all coordinate systems from Soc descriptor?

TODOs

Think about CoreCoordRange, is that representation simply a set of CoreCoord (either V1 or V2) ?

broskoTT · 2024-11-05T15:25:29Z

device/tt_silicon_driver.cpp

+// v2 functions
+
+// tensix core coord
+void tt_SiliconDevice::write_to_device(const void *mem_ptr, uint32_t size_in_bytes, chip_id_t chip, TensixCoreCoord_V2 core_coord, uint64_t addr) {


In any case we will probably end up with write_to_device functions per coreType.
I do like how V1 will hide this though, probably by having a switch case statement pointing to those functions.

Yes we can discuss about it in more detail later on, but one function can do the job as well. If we translate coordinates to physical, we can just program the TLB and do the write, it will work as well

broskoTT · 2024-11-05T15:26:29Z

device/coordinate_manager.h

@@ -45,6 +46,17 @@ class CoordinateManager {

    CoordinateManager(CoordinateManager& other) = default;


Since this is mostly translating CoreCoord_V1, and not providing additional functionality, maybe rename to CoreCoordTranslator, or something alike

broskoTT · 2024-11-05T15:27:29Z

device/tt_soc_descriptor.h

@@ -20,6 +20,7 @@
 #include "device/tt_arch_types.h"

 #include "device/coordinate_manager.h"
+#include "device/tt_core_coordinates.h"


We will probably have this discussion with metal folks, but my guess is they'd request from us to enrich SoCDescriptor class with getters which return vectors or maps of CoreCoords

Im also okay with breaking coordinate translation/management out of soc descriptor and soc descriptor being simple representation of the soc desc yaml. On the other hand, we would want to know what type of coordinates the soc desc is using.

Thoughts @tt-asaigal?

I think users should at least know what coordinate system a SOC descriptor uses, and what core type each endpoint corresponds to. Today in TT-Metal, its restricted to Physical, but in future, we may need to introduce translated coordinate based SOC descriptors. Tracking coordinate systems in different layers of the stack may lead to repeated code and confusion.
Having said that, the direct user of UMD APIs in Metal is tt_cluster, which knows the SOC descriptor variant that gets loaded. This PR already introduces APIs that allow switching between coordinate systems. I think just having APIs from the coordinate manager should be good enough. We likely won't need to heavily augment the tt_SocDescriptor class.

pjanevskiTT · 2024-11-05T17:32:54Z

@pgkeller @eyonland @TT-billteng - - couldn't tag you in reviewers for some reason

pgkeller · 2024-11-05T17:56:21Z

device/tt_core_coordinates.h

+ * CoreType is an enum class that represents all types of cores
+ * present on the Tenstorrent chip.
+ */
+enum class CoreType {


need to be sure this is uint8 (1 byte)

pgkeller · 2024-11-05T17:57:00Z

device/tt_core_coordinates.h

+enum class CoreType {
+  ARC,
+  DRAM,
+  ETH,


I think we want to expose the concept of both active and idle eth from the bottom of the stack

When you say active/idle, is that for example that on a basic N300 cards, that are not in cluster, we have 2 active cores (between 2 chips) and 14 idle cores whose link is not connected? If so, I agree, we should represent this from UMD

I'm also wondering whether Active/Idle explains 1. whether the ETH core is even connected to an ETH peripheral, or 2. whether the connected ETH core is used for workload or not

yes, active is "has links" and inactive is "no links". active is running the "base firmware" to handle host read/write tunneling, "inactive" does not. we run workloads on both types, but different types of workloads.

Cluster::get_active_ethernet_cores in Metal has logic for deducing which core is active/idle based on information parsed out of the cluster desc yaml.

For BH this is slightly different because the BH chips don't have coordinates, details on distinguishing the two are in branch abhullar/bh-active-eth

pgkeller · 2024-11-05T17:58:51Z

device/tt_core_coordinates.h

+  ROUTER_ONLY,
+};
+
+ struct CoreCoord_V1 : public tt_xy_pair {


I prefer V1. I was thinking we could use derived types to construct the base type: TensixCoreCoord() sets core_type to Tensix, then keep the CoreType internal only

If I understand, TensixCoreCoord would me tt-metal abstraction? There would still be just CoreCoord_V1 in umd?

yes. is an API consideration for metal

abhullar-tt · 2024-11-07T16:40:22Z

device/coordinate_manager.h

+    virtual CoreCoord_V1 to_physical(const CoreCoord_V1 core_coords);
+
+    // v2 functions
+    // We need as many functions as there are core types for each coordinate system.


these could be templated to avoid many functions

abhullar-tt · 2024-11-07T16:42:21Z

device/tt_core_coordinates.h

+enum class CoreType {
+  ARC,
+  DRAM,
+  ETH,


Cluster::get_active_ethernet_cores in Metal has logic for deducing which core is active/idle based on information parsed out of the cluster desc yaml.

For BH this is slightly different because the BH chips don't have coordinates, details on distinguishing the two are in branch abhullar/bh-active-eth

abhullar-tt · 2024-11-07T16:43:18Z

device/tt_soc_descriptor.cpp

@@ -126,7 +126,7 @@ void tt_SocDescriptor::load_core_descriptors_from_device_descriptor(YAML::Node &
    for (const auto &core_string : worker_cores) {
        CoreDescriptor core_descriptor;
        core_descriptor.coord = format_node(core_string);
-        core_descriptor.type = CoreType::WORKER;
+        core_descriptor.type = CoreType::TENSIX;


thanks for the rename :)

abhullar-tt · 2024-11-07T16:45:16Z

device/tt_soc_descriptor.h

@@ -20,6 +20,7 @@
 #include "device/tt_arch_types.h"

 #include "device/coordinate_manager.h"
+#include "device/tt_core_coordinates.h"


Im also okay with breaking coordinate translation/management out of soc descriptor and soc descriptor being simple representation of the soc desc yaml. On the other hand, we would want to know what type of coordinates the soc desc is using.

Thoughts @tt-asaigal?

broskoTT · 2024-11-07T17:47:26Z

@pjanevskiTT feel free to just delete the V2 coords from this PR and continue working on this one, rather than creating a new PR. This would nicely keep all the conversation threads at the same place...

pjanevskiTT self-assigned this Nov 5, 2024

Core coordinates prototype

b160428

pjanevskiTT force-pushed the pjanevski/core_coordinates branch from 9a910ec to b160428 Compare November 5, 2024 13:22

pjanevskiTT requested a review from broskoTT November 5, 2024 13:28

broskoTT reviewed Nov 5, 2024

View reviewed changes

pjanevskiTT requested review from patrickroberts, yan-zaretskiy, abhullar-tt, tt-asaigal, joelsmithTT and vtangTT November 5, 2024 17:32

pgkeller reviewed Nov 5, 2024

View reviewed changes

abhullar-tt reviewed Nov 7, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Core coordinates prototype #266

Core coordinates prototype #266

pjanevskiTT commented Nov 5, 2024 •

edited

Loading

broskoTT Nov 5, 2024

pjanevskiTT Nov 6, 2024

broskoTT Nov 5, 2024

broskoTT Nov 5, 2024

abhullar-tt Nov 7, 2024

tt-asaigal Nov 8, 2024

pjanevskiTT commented Nov 5, 2024 •

edited

Loading

pgkeller Nov 5, 2024

pgkeller Nov 5, 2024

pjanevskiTT Nov 6, 2024

broskoTT Nov 7, 2024

pgkeller Nov 7, 2024

abhullar-tt Nov 7, 2024

pgkeller Nov 5, 2024

pjanevskiTT Nov 6, 2024 •

edited

Loading

pgkeller Nov 7, 2024

abhullar-tt Nov 7, 2024

abhullar-tt Nov 7, 2024

abhullar-tt Nov 7, 2024

abhullar-tt Nov 7, 2024

broskoTT commented Nov 7, 2024

		@@ -45,6 +46,17 @@ class CoordinateManager {

		CoordinateManager(CoordinateManager& other) = default;

Core coordinates prototype #266

Are you sure you want to change the base?

Core coordinates prototype #266

Conversation

pjanevskiTT commented Nov 5, 2024 • edited Loading

Note for reviews

V1 coordinates

V2 coordinates

API changes

Pros/cons

V1

V2

Questions for reviewers

TODOs

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pjanevskiTT commented Nov 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pjanevskiTT Nov 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

broskoTT commented Nov 7, 2024

pjanevskiTT commented Nov 5, 2024 •

edited

Loading

pjanevskiTT commented Nov 5, 2024 •

edited

Loading

pjanevskiTT Nov 6, 2024 •

edited

Loading