-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core coordinates prototype #266
base: main
Are you sure you want to change the base?
Conversation
9a910ec
to
b160428
Compare
// v2 functions | ||
|
||
// tensix core coord | ||
void tt_SiliconDevice::write_to_device(const void *mem_ptr, uint32_t size_in_bytes, chip_id_t chip, TensixCoreCoord_V2 core_coord, uint64_t addr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In any case we will probably end up with write_to_device functions per coreType.
I do like how V1 will hide this though, probably by having a switch case statement pointing to those functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we can discuss about it in more detail later on, but one function can do the job as well. If we translate coordinates to physical, we can just program the TLB and do the write, it will work as well
@@ -45,6 +46,17 @@ class CoordinateManager { | |||
|
|||
CoordinateManager(CoordinateManager& other) = default; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is mostly translating CoreCoord_V1, and not providing additional functionality, maybe rename to CoreCoordTranslator, or something alike
@@ -20,6 +20,7 @@ | |||
#include "device/tt_arch_types.h" | |||
|
|||
#include "device/coordinate_manager.h" | |||
#include "device/tt_core_coordinates.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will probably have this discussion with metal folks, but my guess is they'd request from us to enrich SoCDescriptor class with getters which return vectors or maps of CoreCoords
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im also okay with breaking coordinate translation/management out of soc descriptor and soc descriptor being simple representation of the soc desc yaml. On the other hand, we would want to know what type of coordinates the soc desc is using.
Thoughts @tt-asaigal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think users should at least know what coordinate system a SOC descriptor uses, and what core type each endpoint corresponds to. Today in TT-Metal, its restricted to Physical, but in future, we may need to introduce translated coordinate based SOC descriptors. Tracking coordinate systems in different layers of the stack may lead to repeated code and confusion.
Having said that, the direct user of UMD APIs in Metal is tt_cluster
, which knows the SOC descriptor variant that gets loaded. This PR already introduces APIs that allow switching between coordinate systems. I think just having APIs from the coordinate manager should be good enough. We likely won't need to heavily augment the tt_SocDescriptor
class.
@pgkeller @eyonland @TT-billteng - - couldn't tag you in reviewers for some reason |
* CoreType is an enum class that represents all types of cores | ||
* present on the Tenstorrent chip. | ||
*/ | ||
enum class CoreType { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to be sure this is uint8 (1 byte)
enum class CoreType { | ||
ARC, | ||
DRAM, | ||
ETH, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want to expose the concept of both active and idle eth from the bottom of the stack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you say active/idle, is that for example that on a basic N300 cards, that are not in cluster, we have 2 active cores (between 2 chips) and 14 idle cores whose link is not connected? If so, I agree, we should represent this from UMD
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also wondering whether Active/Idle explains 1. whether the ETH core is even connected to an ETH peripheral, or 2. whether the connected ETH core is used for workload or not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, active is "has links" and inactive is "no links". active is running the "base firmware" to handle host read/write tunneling, "inactive" does not. we run workloads on both types, but different types of workloads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cluster::get_active_ethernet_cores
in Metal has logic for deducing which core is active/idle based on information parsed out of the cluster desc yaml.
For BH this is slightly different because the BH chips don't have coordinates, details on distinguishing the two are in branch abhullar/bh-active-eth
ROUTER_ONLY, | ||
}; | ||
|
||
struct CoreCoord_V1 : public tt_xy_pair { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer V1. I was thinking we could use derived types to construct the base type: TensixCoreCoord() sets core_type to Tensix, then keep the CoreType internal only
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand, TensixCoreCoord
would me tt-metal abstraction? There would still be just CoreCoord_V1
in umd?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. is an API consideration for metal
virtual CoreCoord_V1 to_physical(const CoreCoord_V1 core_coords); | ||
|
||
// v2 functions | ||
// We need as many functions as there are core types for each coordinate system. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these could be templated to avoid many functions
enum class CoreType { | ||
ARC, | ||
DRAM, | ||
ETH, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cluster::get_active_ethernet_cores
in Metal has logic for deducing which core is active/idle based on information parsed out of the cluster desc yaml.
For BH this is slightly different because the BH chips don't have coordinates, details on distinguishing the two are in branch abhullar/bh-active-eth
@@ -126,7 +126,7 @@ void tt_SocDescriptor::load_core_descriptors_from_device_descriptor(YAML::Node & | |||
for (const auto &core_string : worker_cores) { | |||
CoreDescriptor core_descriptor; | |||
core_descriptor.coord = format_node(core_string); | |||
core_descriptor.type = CoreType::WORKER; | |||
core_descriptor.type = CoreType::TENSIX; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the rename :)
@@ -20,6 +20,7 @@ | |||
#include "device/tt_arch_types.h" | |||
|
|||
#include "device/coordinate_manager.h" | |||
#include "device/tt_core_coordinates.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im also okay with breaking coordinate translation/management out of soc descriptor and soc descriptor being simple representation of the soc desc yaml. On the other hand, we would want to know what type of coordinates the soc desc is using.
Thoughts @tt-asaigal?
@pjanevskiTT feel free to just delete the V2 coords from this PR and continue working on this one, rather than creating a new PR. This would nicely keep all the conversation threads at the same place... |
This PR is not going to be merged. It should be used just for disucssion. A lot of things are not building here since it would require a lot of effort to plugin everything within current UMD design. When we settle on particular design I will implement that version in a separate PR. I tried to make diff for this branch as clean as possible to understand two prototypes of core coordinates correctly. I tried to leave as much comments for better understanding
Implement two core coordinates prototypes. We expect to use one of these two prototypes (V1 or V2)
As discussed, we are going to have
CoordSystem
enum, which is going to be used to determine the appropriate coordiante system of the coordinates that are used in the APIFor now, we support these 4 coordinate systems. For explanations on each coordinate system there is a pending PR to cleanup the docs properly, but understanding these coordinate systems should not be prerequisite to review this PR. It is just needed to understand that each core coordinate can be in multiple coordinate systems.
Note for reviews
Since we have
CoordinateManager
class here, in order to better understand the use of that class it might be useful to skim the PR #202. In general it is mostly used to support the translation between coordinate systems.The most useful parts of the PR to review are (in my opinion)
V1 coordinates
For V1 prototype, we have additional enum representing core type
which is used in the main struct
When using this, it is needed to set both the core type and coordinate system, in order for API to work properly.
V2 coordinates
For V2 prototype, we would have one struct for each core type that is represented using enum in V1. In PR, structs are implemented for Tensix and DRAM cores, just to see the effects of having multiple core types
When using this it is needed to set just the coordinate system, core type comes from struct type.
API changes
API is provided inside tt_SiliconDevice class for both the V1 and V2 core coordinates. Focus was on providing API for reads/writes as the example. I think looking this from PR is the best way to explain it.
Main point to make here is that API is the same for all coordinate systems, which was one of the points that tt-metal folks made in the meeting. UMD can figure out how to properly program the TLBs and write to device for any coordinate system. In the PR we rely on translating everything to physical (NOC) coordinates
Pros/cons
All pros/cons are my view, feel free to comment on everything
V1
Pros
Cons
V2
Pros
Cons
Questions for reviewers
TODOs