Updating Training Operator documentation and fixing fragmented links (#…

…3834) * Adding architecture diagram for KFTO Signed-off-by: Francisco Javier Arceo <[email protected]> * merging changes Signed-off-by: Francisco Javier Arceo <[email protected]> * testing updating the fine tuning back to how things were Signed-off-by: Francisco Javier Arceo <[email protected]> * incorporating changes for getting started Signed-off-by: Francisco Javier Arceo <[email protected]> * Reverting back changes from commit aa65085 Signed-off-by: Francisco Javier Arceo <[email protected]> --------- Signed-off-by: Francisco Javier Arceo <[email protected]>
kubeflow · Aug 27, 2024 · 07e5d81 · 07e5d81
1 parent d3ca1b1
commit 07e5d81
Show file tree

Hide file tree

Showing 20 changed files with 188 additions and 405 deletions.
diff --git a/content/en/docs/components/katib/user-guides/installation-options.md b/content/en/docs/components/katib/user-guides/installation-options.md
@@ -6,7 +6,7 @@ weight = 90
 
 Katib offers a few installation options to install control plane. This page describes the options
 and the features available with each option. Check
-[the installation guide](/docs/components/katib/installation/#installing-control-plane) to
+[the installation guide](/docs/components/katib/installation/#katib-control-plane-components) to
 understand the Katib control plane components.
 
 ## The Default Katib Standalone Installation

diff --git a/content/en/docs/components/training/explanation/fine-tuning.md b/content/en/docs/components/training/explanation/fine-tuning.md
@@ -1,17 +1,17 @@
 +++
-title = "LLM Fine-Tuning with Training Operator"
-description = "Why Training Operator needs fine-tuning API"
+title = "LLM Fine-Tuning with the Training Operator"
+description = "Why the Training Operator needs the fine-tuning API"
 weight = 10
 +++
 
 {{% alert title="Warning" color="warning" %}}
-This feature is in **alpha** stage and Kubeflow community is looking for your feedback. Please
-share your experience using [#kubeflow-training-operator Slack channel](https://kubeflow.slack.com/archives/C985VJN9F)
+This feature is in **alpha** stage and the Kubeflow community is looking for your feedback. Please
+share your experience using the [#kubeflow-training Slack channel](/docs/about/community/#kubeflow-slack-channels)
 or [Kubeflow Training Operator GitHib](https://github.com/kubeflow/training-operator/issues/new).
 {{% /alert %}}
 
-This page explains how [Training Operator fine-tuning API](/docs/components/training/user-guides/fine-tuning)
-fits into Kubeflow ecosystem.
+This page explains how the [Training Operator fine-tuning API](/docs/components/training/user-guides/fine-tuning)
+fits into the Kubeflow ecosystem.
 
 In the rapidly evolving landscape of machine learning (ML) and artificial intelligence (AI),
 the ability to fine-tune pre-trained models represents a significant leap towards achieving custom
@@ -22,23 +22,23 @@ to particular applications. Whether you're working in natural language processin
 image classification, or another ML domain, fine-tuning can drastically improve performance and
 applicability of pre-existing models to new datasets and problems.
 
-## Why Training Operator Fine-Tune API Matter ?
+## Why does the Training Operator's Fine-Tuning API Matter ?
 
-Training Operator Python SDK introduction of Fine-Tune API is a game-changer for ML practitioners
-operating within the Kubernetes ecosystem. Historically, Training Operator has streamlined the
+The introduction of the Fine-Tuning API in the Training Operator is a game-changer for ML practitioners
+operating within the Kubernetes ecosystem. Historically, the Training Operator has streamlined the
 orchestration of ML workloads on Kubernetes, making distributed training more accessible. However,
 fine-tuning tasks often require extensive manual intervention, including the configuration of
-training environments and the distribution of data across nodes. The Fine-Tune API aim to simplify
+training environments and the distribution of data across nodes. The Fine-Tuning API aims to simplify
 this process, offering an easy-to-use Python interface that abstracts away the complexity involved
 in setting up and executing fine-tuning tasks on distributed systems.
 
-## The Rationale Behind Kubeflow's Fine-Tune API
+## The Rationale Behind Kubeflow's Fine-Tuning API
 
-Implementing Fine-Tune API within Training Operator is a logical step in enhancing the platform's
+Implementing the Fine-Tuning API within the Training Operator is a logical step in enhancing the platform's
 capabilities. By providing this API, Training Operator not only simplifies the user experience for
 ML practitioners but also leverages its existing infrastructure for distributed training.
 This approach aligns with Kubeflow's mission to democratize distributed ML training, making it more
-accessible and less cumbersome for users. The API facilitate a seamless transition from model
+accessible and less cumbersome for users. The API facilitates a seamless transition from model
 development to deployment, supporting the fine-tuning of LLMs on custom datasets without the need
 for extensive manual setup or specialized knowledge of Kubernetes internals.
 

diff --git a/content/en/docs/components/training/getting-started.md b/content/en/docs/components/training/getting-started.md
@@ -1,33 +1,33 @@
 +++
 title = "Getting Started"
-description = "Get started with Training Operator"
+description = "Get started with the Training Operator"
 weight = 30
 +++
 
-This guide describes how to get started with Training Operator and run a few simple examples.
+This guide describes how to get started with the Training Operator and run a few simple examples.
 
 ## Prerequisites
 
 You need to install the following components to run examples:
 
-- Training Operator control plane [installed](/docs/components/training/installation/#installing-control-plane).
-- Training Python SDK [installed](/docs/components/training/installation/#installing-python-sdk).
+- The Training Operator control plane [installed](/docs/components/training/installation/#installing-the-control-plane).
+- The Training Python SDK [installed](/docs/components/training/installation/#installing-the-python-sdk).
 
 ## Getting Started with PyTorchJob
 
-You can create your first Training Operator distributed PyTorchJob using Python SDK. Define the
+You can create your first Training Operator distributed PyTorchJob using the Python SDK. Define the
 training function that implements end-to-end model training. Each Worker will execute this
 function on the appropriate Kubernetes Pod. Usually, this function contains logic to
 download dataset, create model, and train the model.
 
-Training Operator will automatically set `WORLD_SIZE` and `RANK` for the appropriate PyTorchJob
+The Training Operator will automatically set `WORLD_SIZE` and `RANK` for the appropriate PyTorchJob
 worker to perform [PyTorch Distributed Data Parallel (DDP)](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html).
 
-If you install Training Operator as part of Kubeflow Platform, you can open a new
+If you install the Training Operator as part of the Kubeflow Platform, you can open a new
 [Kubeflow Notebook](/docs/components/notebooks/quickstart-guide/) to run this script. If you
-install Training Operator standalone, make sure that you
+install the Training Operator standalone, make sure that you
 [configure local `kubeconfig`](https://kubernetes.io/docs/tasks/access-application-cluster/access-cluster/#programmatic-access-to-the-api)
-to access your Kubernetes cluster where you installed Training Operator.
+to access your Kubernetes cluster where you installed the Training Operator.
 
 ```python
 def train_func():
@@ -115,7 +115,7 @@ TrainingClient().create_job(
 
 ## Getting Started with TFJob
 
-Similar to PyTorchJob example, you can use the Python SDK to create your first distributed
+Similar to the PyTorchJob example, you can use the Python SDK to create your first distributed
 TensorFlow job. Run the following script to create TFJob with pre-created Docker image:
 `docker.io/kubeflow/tf-mnist-with-summaries:latest` that contains
 [distributed TensorFlow code](https://github.com/kubeflow/training-operator/tree/e6b4300f9dfebb5c2a3269641c828add367688ee/examples/tensorflow/mnist_with_summaries):
@@ -140,9 +140,8 @@ TrainingClient().get_job_logs(
     follow=True,
 )
 ```
-
 ## Next steps
 
-- Run [FashionMNIST example](https://github.com/kubeflow/training-operator/blob/7345e33b333ba5084127efe027774dd7bed8f6e6/examples/pytorch/image-classification/Train-CNN-with-FashionMNIST.ipynb) with using Training Operator Python SDK.
+- Run the [FashionMNIST example](https://github.com/kubeflow/training-operator/blob/7345e33b333ba5084127efe027774dd7bed8f6e6/examples/pytorch/image-classification/Train-CNN-with-FashionMNIST.ipynb) with using Training Operator Python SDK.
 
 - Learn more about [the PyTorchJob APIs](/docs/components/training/user-guides/pytorch/).
diff --git a/...nt/en/docs/components/training/images/ml-lifecycle-training-operator.drawio.svg b/...nt/en/docs/components/training/images/ml-lifecycle-training-operator.drawio.svg
diff --git a/content/en/docs/components/training/images/training-operator-overview.drawio.png b/content/en/docs/components/training/images/training-operator-overview.drawio.png
diff --git a/...en/docs/components/training/images/training-operator-v1-architecture.drawio.svg b/...en/docs/components/training/images/training-operator-v1-architecture.drawio.svg
diff --git a/content/en/docs/components/training/installation.md b/content/en/docs/components/training/installation.md
@@ -1,34 +1,34 @@
 +++
 title = "Installation"
-description = "How to install Training Operator"
+description = "How to install the Training Operator"
 weight = 20
 +++
 
-This guide describes how to install Training Operator on your Kubernetes cluster.
-Training Operator is a lightweight Kubernetes controller that orchestrates appropriate Kubernetes
-workloads to perform distributed ML training and fine-tuning.
+This guide describes how to install the Training Operator on your Kubernetes cluster.
+The Training Operator is a lightweight Kubernetes controller that orchestrates the
+appropriate Kubernetes workloads to perform distributed ML training and fine-tuning.
 
 ## Prerequisites
 
-These are minimal requirements to install Training Operator:
+These are the minimal requirements to install the Training Operator:
 
 - Kubernetes >= 1.27
 - `kubectl` >= 1.27
 - Python >= 3.7
 
-## Installing Training Operator
+## Installing the Training Operator
 
-You need to install Training Operator control plane and Python SDK to create training jobs.
+You need to install the Training Operator control plane and Python SDK to create training jobs.
 
-### Installing Control Plane
+### Installing the Control Plane
 
 You can skip these steps if you have already
 [installed Kubeflow platform](https://www.kubeflow.org/docs/started/installing-kubeflow/)
-using manifests or package distributions. Kubeflow platform includes Training Operator.
+using manifests or package distributions. The Kubeflow platform includes the Training Operator.
 
-You can install Training Operator as a standalone component.
+You can install the Training Operator as a standalone component.
 
-Run the following command to install the stable release of Training Operator control plane: `v1.7.0`
+Run the following command to install the stable release of the Training Operator control plane: `v1.7.0`
 
 ```shell
 kubectl apply -k "github.com/kubeflow/training-operator.git/manifests/overlays/standalone?ref=v1.7.0"
@@ -62,12 +62,12 @@ tfjobs.kubeflow.org                                      2023-06-09T00:31:04Z
 xgboostjobs.kubeflow.org                                 2023-06-09T00:31:04Z
 ```
 
-### Installing Python SDK
+### Installing the Python SDK
 
-Training Operator [implements Python SDK](https://pypi.org/project/kubeflow-training/)
+The Training Operator [implements a Python SDK](https://pypi.org/project/kubeflow-training/)
 to simplify creation of distributed training and fine-tuning jobs for Data Scientists.
 
-Run the following command to install the latest stable release of Training SDK:
+Run the following command to install the latest stable release of the Training SDK:
 
 ```shell
 pip install -U kubeflow-training
@@ -85,9 +85,9 @@ Otherwise, you can also install the Training SDK using the specific GitHub commi
 pip install git+https://github.com/kubeflow/training-operator.git@7345e33b333ba5084127efe027774dd7bed8f6e6#subdirectory=sdk/python
 ```
 
-#### Install Python SDK with Fine-Tuning Capabilities
+#### Install the Python SDK with Fine-Tuning Capabilities
 
-If you want to use `train` API for LLM fine-tuning with Training Operator, install the Python SDK
+If you want to use the `train` API for LLM fine-tuning with the Training Operator, install the Python SDK
 with the additional packages from HuggingFace:
 
 ```shell

diff --git a/content/en/docs/components/training/overview.md b/content/en/docs/components/training/overview.md
@@ -1,65 +1,65 @@
 +++
 title = "Overview"
-description = "An overview of Training Operator"
+description = "An overview of the Training Operator"
 weight = 10
 +++
 
 {{% stable-status %}}
 
-## What is Training Operator ?
+## What is the Training Operator
 
-Training Operator is a Kubernetes-native project for fine-tuning and scalable
-distributed training of machine learning (ML) models created with various ML frameworks such as
+The Training Operator is a Kubernetes-native project for fine-tuning and scalable
+distributed training of machine learning (ML) models created with different ML frameworks such as
 PyTorch, TensorFlow, XGBoost, and others.
 
-User can integrate other ML libraries such as [HuggingFace](https://huggingface.co),
+You can integrate other ML libraries such as [HuggingFace](https://huggingface.co),
 [DeepSpeed](https://github.com/microsoft/DeepSpeed), or [Megatron-LM](https://github.com/NVIDIA/Megatron-LM)
-with Training Operator to orchestrate their ML training on Kubernetes.
+with the Training Operator to orchestrate their ML training on Kubernetes.
 
-Training Operator allows you to use Kubernetes workloads to effectively train your large models
-via Kubernetes Custom Resources APIs or using Training Operator Python SDK.
+The Training Operator allows you to use Kubernetes workloads to effectively train your large models
+via Kubernetes Custom Resources APIs or using the Training Operator Python SDK.
 
-Training Operator implements centralized Kubernetes controller to orchestrate distributed training jobs.
+The Training Operator implements a centralized Kubernetes controller to orchestrate distributed training jobs.
 
-Users can run High-performance computing (HPC) tasks with Training Operator and MPIJob since it
+You can run high-performance computing (HPC) tasks with the Training Operator and MPIJob since it
 supports running Message Passing Interface (MPI) on Kubernetes which is heavily used for HPC.
-Training Operator implements V1 API version of MPI Operator. For MPI Operator V2 version,
+The Training Operator implements the V1 API version of MPI Operator. For the MPI Operator V2 version,
 please follow [this guide](/docs/components/training/user-guides/mpi/) to install MPI Operator V2.
 
 <img src="/docs/components/training/images/training-operator-overview.drawio.png"
   alt="Training Operator Overview"
   class="mt-3 mb-3">
 
-Training Operator is responsible for scheduling the appropriate Kubernetes workloads to implement
+The Training Operator is responsible for scheduling the appropriate Kubernetes workloads to implement
 various distributed training strategies for different ML frameworks.
 
-## Why Training Operator ?
+## Why use the Training Operator
 
-Training Operator addresses Model Training and Model Fine-Tuning step in AI/ML lifecycle as shown on
-that diagram:
+The Training Operator addresses the Model Training and Model Fine-Tuning steps in the AI/ML
+lifecycle as shown in diagram below:
 
 <img src="/docs/components/training/images/ml-lifecycle-training-operator.drawio.svg"
   alt="AI/ML Lifecycle Training Operator"
   class="mt-3 mb-3">
 
-- **Training Operator simplifies ability to run distributed training and fine-tuning.**
+- **The Training Operator simplifies the ability to run distributed training and fine-tuning.**
 
-Users can easily scale their model training from single machine to large-scale distributed
+You can easily scale their model training from single machine to large-scale distributed
 Kubernetes cluster using APIs and interfaces provided by Training Operator.
 
-- **Training Operator is extensible and portable.**
+- **The Training Operator is extensible and portable.**
 
-Users can deploy Training Operator on any cloud where you have Kubernetes cluster and users can
+You can deploy Training Operator on any cloud where you have Kubernetes cluster and you can
 integrate their own ML frameworks written in any programming languages with Training Operator.
 
-- **Training Operator is integrated with Kubernetes ecosystem.**
+- **The Training Operator is integrated with the Kubernetes ecosystem.**
 
-Users can leverage Kubernetes advanced scheduling techniques such as Kueue, Volcano, and YuniKorn
-with Training Operator to optimize cost savings for ML training resources.
+You can leverage Kubernetes advanced scheduling techniques such as Kueue, Volcano, and YuniKorn
+with the Training Operator to optimize cost savings for your ML training resources.
 
 ## Custom Resources for ML Frameworks
 
-To perform distributed training Training Operator implements the following
+To perform distributed training the Training Operator implements the following
 [Custom Resources](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/)
 for each ML framework:
 
@@ -73,6 +73,6 @@ for each ML framework:
 
 ## Next steps
 
-- Follow [the installation guide](/docs/components/training/installation/) to deploy Training Operator.
+- Follow [the installation guide](/docs/components/training/installation/) to deploy the Training Operator.
 
 - Run examples from [getting started guide](/docs/components/training/getting-started/).
diff --git a/content/en/docs/components/training/reference/_index.md b/content/en/docs/components/training/reference/_index.md
@@ -1,5 +1,5 @@
 +++
 title = "Reference"
-description = "Reference docs for Training Operator"
+description = "Reference docs for the Training Operator"
 weight = 50
 +++
diff --git a/content/en/docs/components/training/reference/architecture.md b/content/en/docs/components/training/reference/architecture.md
@@ -0,0 +1,40 @@
++++
+title = "Architecture"
+description = "The Training Operator Architecture"
+weight = 10
++++
+
+{{% stable-status %}}
+
+## What is the Training Operator Architecture?
+
+The original design was drafted in April 2021 and is [available here for reference](https://docs.google.com/document/d/1x1JPDQfDMIbnoQRftDH1IzGU0qvHGSU4W6Jl4rJLPhI/).
+The goal was to provide a unified Kubernetes operator that supports multiple
+machine learning/deep learning frameworks. This was done by having a "Frontend"
+operator that decomposes the job into different configurable Kubernetes
+components (e.g., Role, PodTemplate, Fault-Tolerance, etc.),
+watches all Role Customer Resources, and manages pod performance.
+The dedicated "Backend" operator was not implemented and instead
+consolidated to the "Frontend" operator.
+
+The benefits of this approach were:
+1. Shared testing and release infrastructure
+2. Unlocked production grade features like manifests and metadata support
+3. Simpler Kubeflow releases
+4. A Single Source of Truth (SSOT) for other Kubeflow components to interact with
+
+The V1 Training Operator architecture diagram can be seen in the diagram below:
+
+<img src="/docs/components/training/images/training-operator-v1-architecture.drawio.svg"
+  alt="Training Operator V1 Architecture"
+  class="mt-3 mb-3">
+
+The diagram displays PyTorchJob and its configured communication methods but it
+is worth mentioning that each framework can have its own appraoch(es) to
+communicating across pods. Additionally, each framework can have its own set of
+configurable resources.
+
+As a concrete example, PyTorch has several
+[Communication Backends](https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group)
+available, see the [source code documentation for the full list](https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group).
+).