Skip to content

Commit

Permalink
Updating Training Operator documentation and fixing fragmented links (#…
Browse files Browse the repository at this point in the history
…3834)

* Adding architecture diagram for KFTO

Signed-off-by: Francisco Javier Arceo <[email protected]>

* merging changes

Signed-off-by: Francisco Javier Arceo <[email protected]>

* testing updating the fine tuning back to how things were

Signed-off-by: Francisco Javier Arceo <[email protected]>

* incorporating changes for getting started

Signed-off-by: Francisco Javier Arceo <[email protected]>

* Reverting back changes from commit aa65085

Signed-off-by: Francisco Javier Arceo <[email protected]>

---------

Signed-off-by: Francisco Javier Arceo <[email protected]>
  • Loading branch information
franciscojavierarceo authored Aug 27, 2024
1 parent d3ca1b1 commit 07e5d81
Show file tree
Hide file tree
Showing 20 changed files with 188 additions and 405 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ weight = 90

Katib offers a few installation options to install control plane. This page describes the options
and the features available with each option. Check
[the installation guide](/docs/components/katib/installation/#installing-control-plane) to
[the installation guide](/docs/components/katib/installation/#katib-control-plane-components) to
understand the Katib control plane components.

## The Default Katib Standalone Installation
Expand Down
26 changes: 13 additions & 13 deletions content/en/docs/components/training/explanation/fine-tuning.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,17 @@
+++
title = "LLM Fine-Tuning with Training Operator"
description = "Why Training Operator needs fine-tuning API"
title = "LLM Fine-Tuning with the Training Operator"
description = "Why the Training Operator needs the fine-tuning API"
weight = 10
+++

{{% alert title="Warning" color="warning" %}}
This feature is in **alpha** stage and Kubeflow community is looking for your feedback. Please
share your experience using [#kubeflow-training-operator Slack channel](https://kubeflow.slack.com/archives/C985VJN9F)
This feature is in **alpha** stage and the Kubeflow community is looking for your feedback. Please
share your experience using the [#kubeflow-training Slack channel](/docs/about/community/#kubeflow-slack-channels)
or [Kubeflow Training Operator GitHib](https://github.com/kubeflow/training-operator/issues/new).
{{% /alert %}}

This page explains how [Training Operator fine-tuning API](/docs/components/training/user-guides/fine-tuning)
fits into Kubeflow ecosystem.
This page explains how the [Training Operator fine-tuning API](/docs/components/training/user-guides/fine-tuning)
fits into the Kubeflow ecosystem.

In the rapidly evolving landscape of machine learning (ML) and artificial intelligence (AI),
the ability to fine-tune pre-trained models represents a significant leap towards achieving custom
Expand All @@ -22,23 +22,23 @@ to particular applications. Whether you're working in natural language processin
image classification, or another ML domain, fine-tuning can drastically improve performance and
applicability of pre-existing models to new datasets and problems.

## Why Training Operator Fine-Tune API Matter ?
## Why does the Training Operator's Fine-Tuning API Matter ?

Training Operator Python SDK introduction of Fine-Tune API is a game-changer for ML practitioners
operating within the Kubernetes ecosystem. Historically, Training Operator has streamlined the
The introduction of the Fine-Tuning API in the Training Operator is a game-changer for ML practitioners
operating within the Kubernetes ecosystem. Historically, the Training Operator has streamlined the
orchestration of ML workloads on Kubernetes, making distributed training more accessible. However,
fine-tuning tasks often require extensive manual intervention, including the configuration of
training environments and the distribution of data across nodes. The Fine-Tune API aim to simplify
training environments and the distribution of data across nodes. The Fine-Tuning API aims to simplify
this process, offering an easy-to-use Python interface that abstracts away the complexity involved
in setting up and executing fine-tuning tasks on distributed systems.

## The Rationale Behind Kubeflow's Fine-Tune API
## The Rationale Behind Kubeflow's Fine-Tuning API

Implementing Fine-Tune API within Training Operator is a logical step in enhancing the platform's
Implementing the Fine-Tuning API within the Training Operator is a logical step in enhancing the platform's
capabilities. By providing this API, Training Operator not only simplifies the user experience for
ML practitioners but also leverages its existing infrastructure for distributed training.
This approach aligns with Kubeflow's mission to democratize distributed ML training, making it more
accessible and less cumbersome for users. The API facilitate a seamless transition from model
accessible and less cumbersome for users. The API facilitates a seamless transition from model
development to deployment, supporting the fine-tuning of LLMs on custom datasets without the need
for extensive manual setup or specialized knowledge of Kubernetes internals.

Expand Down
23 changes: 11 additions & 12 deletions content/en/docs/components/training/getting-started.md
Original file line number Diff line number Diff line change
@@ -1,33 +1,33 @@
+++
title = "Getting Started"
description = "Get started with Training Operator"
description = "Get started with the Training Operator"
weight = 30
+++

This guide describes how to get started with Training Operator and run a few simple examples.
This guide describes how to get started with the Training Operator and run a few simple examples.

## Prerequisites

You need to install the following components to run examples:

- Training Operator control plane [installed](/docs/components/training/installation/#installing-control-plane).
- Training Python SDK [installed](/docs/components/training/installation/#installing-python-sdk).
- The Training Operator control plane [installed](/docs/components/training/installation/#installing-the-control-plane).
- The Training Python SDK [installed](/docs/components/training/installation/#installing-the-python-sdk).

## Getting Started with PyTorchJob

You can create your first Training Operator distributed PyTorchJob using Python SDK. Define the
You can create your first Training Operator distributed PyTorchJob using the Python SDK. Define the
training function that implements end-to-end model training. Each Worker will execute this
function on the appropriate Kubernetes Pod. Usually, this function contains logic to
download dataset, create model, and train the model.

Training Operator will automatically set `WORLD_SIZE` and `RANK` for the appropriate PyTorchJob
The Training Operator will automatically set `WORLD_SIZE` and `RANK` for the appropriate PyTorchJob
worker to perform [PyTorch Distributed Data Parallel (DDP)](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html).

If you install Training Operator as part of Kubeflow Platform, you can open a new
If you install the Training Operator as part of the Kubeflow Platform, you can open a new
[Kubeflow Notebook](/docs/components/notebooks/quickstart-guide/) to run this script. If you
install Training Operator standalone, make sure that you
install the Training Operator standalone, make sure that you
[configure local `kubeconfig`](https://kubernetes.io/docs/tasks/access-application-cluster/access-cluster/#programmatic-access-to-the-api)
to access your Kubernetes cluster where you installed Training Operator.
to access your Kubernetes cluster where you installed the Training Operator.

```python
def train_func():
Expand Down Expand Up @@ -115,7 +115,7 @@ TrainingClient().create_job(

## Getting Started with TFJob

Similar to PyTorchJob example, you can use the Python SDK to create your first distributed
Similar to the PyTorchJob example, you can use the Python SDK to create your first distributed
TensorFlow job. Run the following script to create TFJob with pre-created Docker image:
`docker.io/kubeflow/tf-mnist-with-summaries:latest` that contains
[distributed TensorFlow code](https://github.com/kubeflow/training-operator/tree/e6b4300f9dfebb5c2a3269641c828add367688ee/examples/tensorflow/mnist_with_summaries):
Expand All @@ -140,9 +140,8 @@ TrainingClient().get_job_logs(
follow=True,
)
```

## Next steps

- Run [FashionMNIST example](https://github.com/kubeflow/training-operator/blob/7345e33b333ba5084127efe027774dd7bed8f6e6/examples/pytorch/image-classification/Train-CNN-with-FashionMNIST.ipynb) with using Training Operator Python SDK.
- Run the [FashionMNIST example](https://github.com/kubeflow/training-operator/blob/7345e33b333ba5084127efe027774dd7bed8f6e6/examples/pytorch/image-classification/Train-CNN-with-FashionMNIST.ipynb) with using Training Operator Python SDK.

- Learn more about [the PyTorchJob APIs](/docs/components/training/user-guides/pytorch/).
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
32 changes: 16 additions & 16 deletions content/en/docs/components/training/installation.md
Original file line number Diff line number Diff line change
@@ -1,34 +1,34 @@
+++
title = "Installation"
description = "How to install Training Operator"
description = "How to install the Training Operator"
weight = 20
+++

This guide describes how to install Training Operator on your Kubernetes cluster.
Training Operator is a lightweight Kubernetes controller that orchestrates appropriate Kubernetes
workloads to perform distributed ML training and fine-tuning.
This guide describes how to install the Training Operator on your Kubernetes cluster.
The Training Operator is a lightweight Kubernetes controller that orchestrates the
appropriate Kubernetes workloads to perform distributed ML training and fine-tuning.

## Prerequisites

These are minimal requirements to install Training Operator:
These are the minimal requirements to install the Training Operator:

- Kubernetes >= 1.27
- `kubectl` >= 1.27
- Python >= 3.7

## Installing Training Operator
## Installing the Training Operator

You need to install Training Operator control plane and Python SDK to create training jobs.
You need to install the Training Operator control plane and Python SDK to create training jobs.

### Installing Control Plane
### Installing the Control Plane

You can skip these steps if you have already
[installed Kubeflow platform](https://www.kubeflow.org/docs/started/installing-kubeflow/)
using manifests or package distributions. Kubeflow platform includes Training Operator.
using manifests or package distributions. The Kubeflow platform includes the Training Operator.

You can install Training Operator as a standalone component.
You can install the Training Operator as a standalone component.

Run the following command to install the stable release of Training Operator control plane: `v1.7.0`
Run the following command to install the stable release of the Training Operator control plane: `v1.7.0`

```shell
kubectl apply -k "github.com/kubeflow/training-operator.git/manifests/overlays/standalone?ref=v1.7.0"
Expand Down Expand Up @@ -62,12 +62,12 @@ tfjobs.kubeflow.org 2023-06-09T00:31:04Z
xgboostjobs.kubeflow.org 2023-06-09T00:31:04Z
```

### Installing Python SDK
### Installing the Python SDK

Training Operator [implements Python SDK](https://pypi.org/project/kubeflow-training/)
The Training Operator [implements a Python SDK](https://pypi.org/project/kubeflow-training/)
to simplify creation of distributed training and fine-tuning jobs for Data Scientists.

Run the following command to install the latest stable release of Training SDK:
Run the following command to install the latest stable release of the Training SDK:

```shell
pip install -U kubeflow-training
Expand All @@ -85,9 +85,9 @@ Otherwise, you can also install the Training SDK using the specific GitHub commi
pip install git+https://github.com/kubeflow/training-operator.git@7345e33b333ba5084127efe027774dd7bed8f6e6#subdirectory=sdk/python
```

#### Install Python SDK with Fine-Tuning Capabilities
#### Install the Python SDK with Fine-Tuning Capabilities

If you want to use `train` API for LLM fine-tuning with Training Operator, install the Python SDK
If you want to use the `train` API for LLM fine-tuning with the Training Operator, install the Python SDK
with the additional packages from HuggingFace:

```shell
Expand Down
48 changes: 24 additions & 24 deletions content/en/docs/components/training/overview.md
Original file line number Diff line number Diff line change
@@ -1,65 +1,65 @@
+++
title = "Overview"
description = "An overview of Training Operator"
description = "An overview of the Training Operator"
weight = 10
+++

{{% stable-status %}}

## What is Training Operator ?
## What is the Training Operator

Training Operator is a Kubernetes-native project for fine-tuning and scalable
distributed training of machine learning (ML) models created with various ML frameworks such as
The Training Operator is a Kubernetes-native project for fine-tuning and scalable
distributed training of machine learning (ML) models created with different ML frameworks such as
PyTorch, TensorFlow, XGBoost, and others.

User can integrate other ML libraries such as [HuggingFace](https://huggingface.co),
You can integrate other ML libraries such as [HuggingFace](https://huggingface.co),
[DeepSpeed](https://github.com/microsoft/DeepSpeed), or [Megatron-LM](https://github.com/NVIDIA/Megatron-LM)
with Training Operator to orchestrate their ML training on Kubernetes.
with the Training Operator to orchestrate their ML training on Kubernetes.

Training Operator allows you to use Kubernetes workloads to effectively train your large models
via Kubernetes Custom Resources APIs or using Training Operator Python SDK.
The Training Operator allows you to use Kubernetes workloads to effectively train your large models
via Kubernetes Custom Resources APIs or using the Training Operator Python SDK.

Training Operator implements centralized Kubernetes controller to orchestrate distributed training jobs.
The Training Operator implements a centralized Kubernetes controller to orchestrate distributed training jobs.

Users can run High-performance computing (HPC) tasks with Training Operator and MPIJob since it
You can run high-performance computing (HPC) tasks with the Training Operator and MPIJob since it
supports running Message Passing Interface (MPI) on Kubernetes which is heavily used for HPC.
Training Operator implements V1 API version of MPI Operator. For MPI Operator V2 version,
The Training Operator implements the V1 API version of MPI Operator. For the MPI Operator V2 version,
please follow [this guide](/docs/components/training/user-guides/mpi/) to install MPI Operator V2.

<img src="/docs/components/training/images/training-operator-overview.drawio.png"
alt="Training Operator Overview"
class="mt-3 mb-3">

Training Operator is responsible for scheduling the appropriate Kubernetes workloads to implement
The Training Operator is responsible for scheduling the appropriate Kubernetes workloads to implement
various distributed training strategies for different ML frameworks.

## Why Training Operator ?
## Why use the Training Operator

Training Operator addresses Model Training and Model Fine-Tuning step in AI/ML lifecycle as shown on
that diagram:
The Training Operator addresses the Model Training and Model Fine-Tuning steps in the AI/ML
lifecycle as shown in diagram below:

<img src="/docs/components/training/images/ml-lifecycle-training-operator.drawio.svg"
alt="AI/ML Lifecycle Training Operator"
class="mt-3 mb-3">

- **Training Operator simplifies ability to run distributed training and fine-tuning.**
- **The Training Operator simplifies the ability to run distributed training and fine-tuning.**

Users can easily scale their model training from single machine to large-scale distributed
You can easily scale their model training from single machine to large-scale distributed
Kubernetes cluster using APIs and interfaces provided by Training Operator.

- **Training Operator is extensible and portable.**
- **The Training Operator is extensible and portable.**

Users can deploy Training Operator on any cloud where you have Kubernetes cluster and users can
You can deploy Training Operator on any cloud where you have Kubernetes cluster and you can
integrate their own ML frameworks written in any programming languages with Training Operator.

- **Training Operator is integrated with Kubernetes ecosystem.**
- **The Training Operator is integrated with the Kubernetes ecosystem.**

Users can leverage Kubernetes advanced scheduling techniques such as Kueue, Volcano, and YuniKorn
with Training Operator to optimize cost savings for ML training resources.
You can leverage Kubernetes advanced scheduling techniques such as Kueue, Volcano, and YuniKorn
with the Training Operator to optimize cost savings for your ML training resources.

## Custom Resources for ML Frameworks

To perform distributed training Training Operator implements the following
To perform distributed training the Training Operator implements the following
[Custom Resources](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/)
for each ML framework:

Expand All @@ -73,6 +73,6 @@ for each ML framework:

## Next steps

- Follow [the installation guide](/docs/components/training/installation/) to deploy Training Operator.
- Follow [the installation guide](/docs/components/training/installation/) to deploy the Training Operator.

- Run examples from [getting started guide](/docs/components/training/getting-started/).
2 changes: 1 addition & 1 deletion content/en/docs/components/training/reference/_index.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
+++
title = "Reference"
description = "Reference docs for Training Operator"
description = "Reference docs for the Training Operator"
weight = 50
+++
40 changes: 40 additions & 0 deletions content/en/docs/components/training/reference/architecture.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
+++
title = "Architecture"
description = "The Training Operator Architecture"
weight = 10
+++

{{% stable-status %}}

## What is the Training Operator Architecture?

The original design was drafted in April 2021 and is [available here for reference](https://docs.google.com/document/d/1x1JPDQfDMIbnoQRftDH1IzGU0qvHGSU4W6Jl4rJLPhI/).
The goal was to provide a unified Kubernetes operator that supports multiple
machine learning/deep learning frameworks. This was done by having a "Frontend"
operator that decomposes the job into different configurable Kubernetes
components (e.g., Role, PodTemplate, Fault-Tolerance, etc.),
watches all Role Customer Resources, and manages pod performance.
The dedicated "Backend" operator was not implemented and instead
consolidated to the "Frontend" operator.

The benefits of this approach were:
1. Shared testing and release infrastructure
2. Unlocked production grade features like manifests and metadata support
3. Simpler Kubeflow releases
4. A Single Source of Truth (SSOT) for other Kubeflow components to interact with

The V1 Training Operator architecture diagram can be seen in the diagram below:

<img src="/docs/components/training/images/training-operator-v1-architecture.drawio.svg"
alt="Training Operator V1 Architecture"
class="mt-3 mb-3">

The diagram displays PyTorchJob and its configured communication methods but it
is worth mentioning that each framework can have its own appraoch(es) to
communicating across pods. Additionally, each framework can have its own set of
configurable resources.

As a concrete example, PyTorch has several
[Communication Backends](https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group)
available, see the [source code documentation for the full list](https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group).
).
Loading

0 comments on commit 07e5d81

Please sign in to comment.