-
Notifications
You must be signed in to change notification settings - Fork 777
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Updating Training Operator documentation and fixing fragmented links (#…
…3834) * Adding architecture diagram for KFTO Signed-off-by: Francisco Javier Arceo <[email protected]> * merging changes Signed-off-by: Francisco Javier Arceo <[email protected]> * testing updating the fine tuning back to how things were Signed-off-by: Francisco Javier Arceo <[email protected]> * incorporating changes for getting started Signed-off-by: Francisco Javier Arceo <[email protected]> * Reverting back changes from commit aa65085 Signed-off-by: Francisco Javier Arceo <[email protected]> --------- Signed-off-by: Francisco Javier Arceo <[email protected]>
- Loading branch information
1 parent
d3ca1b1
commit 07e5d81
Showing
20 changed files
with
188 additions
and
405 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
2 changes: 1 addition & 1 deletion
2
...nt/en/docs/components/training/images/ml-lifecycle-training-operator.drawio.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified
BIN
-151 KB
(72%)
content/en/docs/components/training/images/training-operator-overview.drawio.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 4 additions & 0 deletions
4
...en/docs/components/training/images/training-operator-v1-architecture.drawio.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,5 @@ | ||
+++ | ||
title = "Reference" | ||
description = "Reference docs for Training Operator" | ||
description = "Reference docs for the Training Operator" | ||
weight = 50 | ||
+++ |
40 changes: 40 additions & 0 deletions
40
content/en/docs/components/training/reference/architecture.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
+++ | ||
title = "Architecture" | ||
description = "The Training Operator Architecture" | ||
weight = 10 | ||
+++ | ||
|
||
{{% stable-status %}} | ||
|
||
## What is the Training Operator Architecture? | ||
|
||
The original design was drafted in April 2021 and is [available here for reference](https://docs.google.com/document/d/1x1JPDQfDMIbnoQRftDH1IzGU0qvHGSU4W6Jl4rJLPhI/). | ||
The goal was to provide a unified Kubernetes operator that supports multiple | ||
machine learning/deep learning frameworks. This was done by having a "Frontend" | ||
operator that decomposes the job into different configurable Kubernetes | ||
components (e.g., Role, PodTemplate, Fault-Tolerance, etc.), | ||
watches all Role Customer Resources, and manages pod performance. | ||
The dedicated "Backend" operator was not implemented and instead | ||
consolidated to the "Frontend" operator. | ||
|
||
The benefits of this approach were: | ||
1. Shared testing and release infrastructure | ||
2. Unlocked production grade features like manifests and metadata support | ||
3. Simpler Kubeflow releases | ||
4. A Single Source of Truth (SSOT) for other Kubeflow components to interact with | ||
|
||
The V1 Training Operator architecture diagram can be seen in the diagram below: | ||
|
||
<img src="/docs/components/training/images/training-operator-v1-architecture.drawio.svg" | ||
alt="Training Operator V1 Architecture" | ||
class="mt-3 mb-3"> | ||
|
||
The diagram displays PyTorchJob and its configured communication methods but it | ||
is worth mentioning that each framework can have its own appraoch(es) to | ||
communicating across pods. Additionally, each framework can have its own set of | ||
configurable resources. | ||
|
||
As a concrete example, PyTorch has several | ||
[Communication Backends](https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group) | ||
available, see the [source code documentation for the full list](https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group). | ||
). |
Oops, something went wrong.