Skip to content

Commit

Permalink
docs: add diagram to reference (#3906)
Browse files Browse the repository at this point in the history
* docs: Add Experiment CR diagram

Signed-off-by: Gonçalo Montalvão Marques <[email protected]>

* fix: fix diagram link

Signed-off-by: Gonçalo Montalvão Marques <[email protected]>

* Update content/en/docs/components/katib/reference/experiment-cr.md

Co-authored-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Gonçalo Montalvão Marques <[email protected]>

* Update content/en/docs/components/katib/reference/experiment-cr.md

Co-authored-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Gonçalo Montalvão Marques <[email protected]>

* docs: Update title

Signed-off-by: Gonçalo Montalvão Marques <[email protected]>

* docs: remove CR references

Signed-off-by: Gonçalo Montalvão Marques <[email protected]>

* docs: update diagram tags

Signed-off-by: Gonçalo Montalvão Marques <[email protected]>

* Update content/en/docs/components/katib/reference/experiment-cr.md

Co-authored-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Gonçalo Montalvão Marques <[email protected]>

* docs: rename entities

Signed-off-by: Gonçalo Montalvão Marques <[email protected]>

---------

Signed-off-by: Gonçalo Montalvão Marques <[email protected]>
Co-authored-by: Andrey Velichkevich <[email protected]>
  • Loading branch information
gonmmarques and andreyvelich authored Oct 15, 2024
1 parent 823249d commit 3c7d3de
Show file tree
Hide file tree
Showing 2 changed files with 55 additions and 0 deletions.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
55 changes: 55 additions & 0 deletions content/en/docs/components/katib/reference/experiment-cr.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
+++
title = "Katib Experiment Lifecycle"
description = "What happens after an Experiment is created"
weight = 10
+++

## Katib Experiment Lifecycle

When user creates an Experiment, Katib Experiment controller,
Suggestion controller and Trial controller is working together to achieve
hyperparameters tuning for user's Machine learning model. The Experiment
workflow looks as follows:

<img src="/docs/components/katib/images/katib-workflow.png" alt="Katib Workflow" class="mt-3 mb-3">

1. The Experiment is submitted to the Kubernetes API server. Katib
Experiment mutating and validating webhook is called to set the default
values for the Experiment and validate the CR separately.

1. The Experiment controller creates the Suggestion.

1. The Suggestion controller creates the algorithm deployment and service
based on the new Suggestion.

1. When the Suggestion controller verifies that the algorithm service is
ready, it calls the service to generate
`spec.request - len(status.suggestions)` sets of hyperparameters and append
them into `status.suggestions`.

1. The Experiment controller finds that Suggestion had been updated and
generates each Trial for the each new hyperparameters set.

1. The Trial controller generates `Worker Job` based on the `runSpec`
from the Trial with the new hyperparameters set.

1. The related job controller
(Kubernetes batch Job, Kubeflow TFJob, Tekton Pipeline, etc.) generates
Kubernetes Pods.

1. Katib Pod mutating webhook is called to inject the metrics collector sidecar
container to the candidate Pods.

1. During the ML model container runs, the metrics collector container
collects metrics from the injected pod and persists metrics to the Katib
DB backend.

1. When the ML model training ends, the Trial controller updates status
of the corresponding Trial.

1. When the Trial goes to end, the Experiment controller increases
`request` field of the corresponding Suggestion if it is needed,
then everything goes to `step 4` again.
Of course, if the Trial meet one of `end` condition
(exceeds `maxTrialCount`, `maxFailedTrialCount` or `goal`),
the Experiment controller takes everything done.

0 comments on commit 3c7d3de

Please sign in to comment.