Skip to content

Commit

Permalink
docs: add new documentation and fix existing docs
Browse files Browse the repository at this point in the history
  • Loading branch information
Bobbins228 committed Oct 15, 2024
1 parent be10123 commit f3ecf40
Show file tree
Hide file tree
Showing 10 changed files with 278 additions and 24 deletions.
5 changes: 2 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ For guided demos and basics walkthroughs, check out the following links:
- these demos can be copied into your current working directory when using the `codeflare-sdk` by using the `codeflare_sdk.copy_demo_nbs()` function
- Additionally, we have a [video walkthrough](https://www.youtube.com/watch?v=U76iIfd9EmE) of these basic demos from June, 2023

Full documentation can be found [here](https://project-codeflare.github.io/codeflare-sdk/detailed-documentation)
Full documentation can be found [here](https://project-codeflare.github.io/codeflare-sdk/index.html)

## Installation

Expand All @@ -32,11 +32,10 @@ It is possible to use the Release Github workflow to do the release. This is gen
The following instructions apply when doing release manually. This may be required in instances where the automation is failing.

- Check and update the version in "pyproject.toml" file.
- Generate new documentation.
`pdoc --html -o docs src/codeflare_sdk && pushd docs && rm -rf cluster job utils && mv codeflare_sdk/* . && rm -rf codeflare_sdk && popd && find docs -type f -name "*.html" -exec bash -c "echo '' >> {}" \;` (it is possible to install **pdoc** using the following command `poetry install --with docs`)
- Commit all the changes to the repository.
- Create Github release (<https://docs.github.com/en/repositories/releasing-projects-on-github/managing-releases-in-a-repository#creating-a-release>).
- Build the Python package. `poetry build`
- If not present already, add the API token to Poetry.
`poetry config pypi-token.pypi API_TOKEN`
- Publish the Python package. `poetry publish`
- Trigger the [Publish Documentation](https://github.com/project-codeflare/codeflare-sdk/actions/workflows/publish-documentation.yaml) workflow
4 changes: 3 additions & 1 deletion docs/sphinx/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,14 +16,16 @@ The CodeFlare SDK is an intuitive, easy-to-use python interface for batch resour
modules

.. toctree::
:maxdepth: 2
:maxdepth: 1
:caption: User Documentation:

user-docs/authentication
user-docs/cluster-configuration
user-docs/ray-cluster-interaction
user-docs/e2e
user-docs/s3-compatible-storage
user-docs/setup-kueue
user-docs/ui-widgets

Quick Links
===========
Expand Down
4 changes: 2 additions & 2 deletions docs/sphinx/user-docs/authentication.rst
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ a login command like ``oc login --token=<token> --server=<server>``
their kubernetes config file should have updated. If the user has not
specifically authenticated through the SDK by other means such as
``TokenAuthentication`` then the SDK will try to use their default
Kubernetes config file located at ``"/HOME/.kube/config"``.
Kubernetes config file located at ``"$HOME/.kube/config"``.

Method 3 Specifying a Kubernetes Config File
--------------------------------------------
Expand All @@ -62,5 +62,5 @@ Method 4 In-Cluster Authentication
----------------------------------

If a user does not authenticate by any of the means detailed above and
does not have a config file at ``"/HOME/.kube/config"`` the SDK will try
does not have a config file at ``"$HOME/.kube/config"`` the SDK will try
to authenticate with the in-cluster configuration file.
125 changes: 117 additions & 8 deletions docs/sphinx/user-docs/cluster-configuration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -29,24 +29,133 @@ requirements for creating the Ray Cluster.
labels={"exampleLabel": "example", "secondLabel": "example"},
))
Note: ‘quay.io/modh/ray:2.35.0-py39-cu121’ is the default image used by
the CodeFlare SDK for creating a RayCluster resource. If you have your
own Ray image which suits your purposes, specify it in image field to
override the default image. If you are using ROCm compatible GPUs you
can use ‘quay.io/modh/ray:2.35.0-py39-rocm61’. You can also find
documentation on building a custom image
`here <https://github.com/opendatahub-io/distributed-workloads/tree/main/images/runtime/examples>`__.
.. note::
`quay.io/modh/ray:2.35.0-py39-cu121` is the default image used by
the CodeFlare SDK for creating a RayCluster resource. If you have your
own Ray image which suits your purposes, specify it in image field to
override the default image. If you are using ROCm compatible GPUs you
can use `quay.io/modh/ray:2.35.0-py39-rocm61`. You can also find
documentation on building a custom image
`here <https://github.com/opendatahub-io/distributed-workloads/tree/main/images/runtime/examples>`__.

The ``labels={"exampleLabel": "example"}`` parameter can be used to
apply additional labels to the RayCluster resource.

After creating their ``cluster``, a user can call ``cluster.up()`` and
``cluster.down()`` to respectively create or remove the Ray Cluster.

Parameters of the ``ClusterConfiguration``
------------------------------------------

Below is a table explaining each of the ``ClusterConfiguration``
parameters and their default values.

.. list-table::
:header-rows: 1
:widths: auto

* - Name
- Type
- Description
- Default
* - ``name``
- ``str``
- The name of the Ray Cluster/AppWrapper
- Required - No default
* - ``namespace``
- ``Optional[str]``
- The namespace of the Ray Cluster/AppWrapper
- ``None``
* - ``head_cpu_requests``
- ``Union[int, str]``
- CPU resource requests for the Head Node
- ``2``
* - ``head_cpu_limits``
- ``Union[int, str]``
- CPU resource limits for the Head Node
- ``2``
* - ``head_memory_requests``
- ``Union[int, str]``
- Memory resource requests for the Head Node
- ``8``
* - ``head_memory_limits``
- ``Union[int, str]``
- Memory limits for the Head Node
- ``8``
* - ``head_extended_resource_requests``
- ``Dict[str, Union[str, int]]``
- Extended resource requests for the Head Node
- ``{}``
* - ``worker_cpu_requests``
- ``Union[int, str]``
- CPU resource requests for the Worker Node
- ``1``
* - ``worker_cpu_limits``
- ``Union[int, str]``
- CPU resource limits for the Worker Node
- ``1``
* - ``num_workers``
- ``int``
- Number of Worker Nodes for the Ray Cluster
- ``1``
* - ``worker_memory_requests``
- ``Union[int, str]``
- Memory resource requests for the Worker Node
- ``8``
* - ``worker_memory_limits``
- ``Union[int, str]``
- Memory resource limits for the Worker Node
- ``8``
* - ``appwrapper``
- ``bool``
- A boolean that wraps the Ray Cluster in an AppWrapper
- ``False``
* - ``envs``
- ``Dict[str, str]``
- A dictionary of environment variables to set for the Ray Cluster
- ``{}``
* - ``image``
- ``str``
- A parameter for specifying the Ray Image
- ``""``
* - ``image_pull_secrets``
- ``List[str]``
- A parameter for providing a list of Image Pull Secrets
- ``[]``
* - ``write_to_file``
- ``bool``
- A boolean for writing the Ray Cluster as a Yaml file if set to True
- ``False``
* - ``verify_tls``
- ``bool``
- A boolean indicating whether to verify TLS when connecting to the cluster
- ``True``
* - ``labels``
- ``Dict[str, str]``
- A dictionary of labels to apply to the cluster
- ``{}``
* - ``worker_extended_resource_requests``
- ``Dict[str, Union[str, int]]``
- Extended resource requests for the Worker Node
- ``{}``
* - ``extended_resource_mapping``
- ``Dict[str, str]``
- A dictionary of custom resource mappings to map extended resource requests to RayCluster resource names
- ``{}``
* - ``overwrite_default_resource_mapping``
- ``bool``
- A boolean indicating whether to overwrite the default resource mapping
- ``False``
* - ``local_queue``
- ``Optional[str]``
- A parameter for specifying the Local Queue label for the Ray Cluster
- ``None``

Deprecating Parameters
----------------------

The following parameters of the ``ClusterConfiguration`` are being deprecated.
The following parameters of the ``ClusterConfiguration`` are being
deprecated.

.. list-table::
:header-rows: 1
Expand Down
19 changes: 10 additions & 9 deletions docs/sphinx/user-docs/e2e.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ On KinD clusters

Pre-requisite for KinD clusters: please add in your local ``/etc/hosts``
file ``127.0.0.1 kind``. This will map your localhost IP address to the
KinD clusters hostname. This is already performed on `GitHub
KinD cluster's hostname. This is already performed on `GitHub
Actions <https://github.com/project-codeflare/codeflare-common/blob/1edd775e2d4088a5a0bfddafb06ff3a773231c08/github-actions/kind/action.yml#L70-L72>`__

If the system you run on contains NVidia GPU then you can enable the GPU
Expand Down Expand Up @@ -91,7 +91,7 @@ instructions <https://www.substratus.ai/blog/kind-with-gpus>`__.
poetry install --with test,docs
poetry run pytest -v -s ./tests/e2e/mnist_raycluster_sdk_kind_test.py

- If the cluster doesnt have NVidia GPU support then we need to
- If the cluster doesn't have NVidia GPU support then we need to
disable NVidia GPU tests by providing proper marker:

::
Expand Down Expand Up @@ -124,8 +124,8 @@ If the system you run on contains NVidia GPU then you can enable the GPU
support on OpenShift, this will allow you to run also GPU tests. To
enable GPU on OpenShift follow `these
instructions <https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/introduction.html>`__.
Currently the SDK doesnt support tolerations, so e2e tests cant be
executed on nodes with taint (i.e. GPU taint).
Currently the SDK doesn't support tolerations, so e2e tests can't be
executed on nodes with taint (i.e. GPU taint).

- Test Phase:

Expand Down Expand Up @@ -203,8 +203,9 @@ On OpenShift Disconnected clusters
AWS_STORAGE_BUCKET=<storage-bucket-name>
AWS_STORAGE_BUCKET_MNIST_DIR=<storage-bucket-MNIST-datasets-directory>

Note : When using the Python Minio client to connect to a minio
storage bucket, the ``AWS_DEFAULT_ENDPOINT`` environment
variable by default expects secure endpoint where user can use
endpoint url with https/http prefix for autodetection of
secure/insecure endpoint.
.. note::
When using the Python Minio client to connect to a minio
storage bucket, the ``AWS_DEFAULT_ENDPOINT`` environment
variable by default expects secure endpoint where user can use
endpoint url with https/http prefix for autodetection of
secure/insecure endpoint.
Binary file added docs/sphinx/user-docs/images/ui-buttons.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/sphinx/user-docs/images/ui-view-clusters.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
90 changes: 90 additions & 0 deletions docs/sphinx/user-docs/ray-cluster-interaction.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
Ray Cluster Interaction
=======================

The CodeFlare SDK offers multiple ways to interact with Ray Clusters
including the below methods.

get_cluster()
-------------

The ``get_cluster()`` function is used to initialise a ``Cluster``
object from a pre-existing Ray Cluster/AppWrapper. Below is an example
of it's usage:

::

from codeflare_sdk import get_cluster
cluster = get_cluster(cluster_name="raytest", namespace="example", is_appwrapper=False, write_to_file=False)
-> output: Yaml resources loaded for raytest
cluster.status()
-> output:
🚀 CodeFlare Cluster Status 🚀
╭─────────────────────────────────────────────────────────────────╮
│ Name │
│ raytest Active ✅ │
│ │
│ URI: ray://raytest-head-svc.example.svc:10001 │
│ │
│ Dashboard🔗 │
│ │
╰─────────────────────────────────────────────────────────────────╯
(<CodeFlareClusterStatus.READY: 1>, True)
cluster.down()
cluster.up() # This function will create an exact copy of the retrieved Ray Cluster only if the Ray Cluster has been previously deleted.

| These are the parameters the ``get_cluster()`` function accepts:
| ``cluster_name: str # Required`` -> The name of the Ray Cluster.
| ``namespace: str # Default: "default"`` -> The namespace of the Ray Cluster.
| ``is_appwrapper: bool # Default: False`` -> When set to
| ``True`` the function will attempt to retrieve an AppWrapper instead of a Ray Cluster.
| ``write_to_file: bool # Default: False`` -> When set to ``True`` the Ray Cluster/AppWrapper will be written to a file similar to how it is done in ``ClusterConfiguration``.
list_all_queued()
-----------------

| The ``list_all_queued()`` function returns (and prints by default) a list of all currently queued-up Ray Clusters in a given namespace.
| It accepts the following parameters:
| ``namespace: str # Required`` -> The namespace you want to retrieve the list from.
| ``print_to_console: bool # Default: True`` -> Allows the user to print the list to their console.
| ``appwrapper: bool # Default: False`` -> When set to ``True`` allows the user to list queued AppWrappers.
list_all_clusters()
-------------------

| The ``list_all_clusters()`` function will return a list of detailed descriptions of Ray Clusters to the console by default.
| It accepts the following parameters:
| ``namespace: str # Required`` -> The namespace you want to retrieve the list from.
| ``print_to_console: bool # Default: True`` -> A boolean that allows the user to print the list to their console.
.. note::

The following methods require a ``Cluster`` object to be
initialized. See :doc:`./cluster-configuration`

cluster.up()
------------

| The ``cluster.up()`` function creates a Ray Cluster in the given namespace.
cluster.down()
--------------

| The ``cluster.down()`` function deletes the Ray Cluster in the given namespace.
cluster.status()
----------------

| The ``cluster.status()`` function prints out the status of the Ray Cluster's state with a link to the Ray Dashboard.
cluster.details()
-----------------

| The ``cluster.details()`` function prints out a detailed description of the Ray Cluster's status, worker resources and a link to the Ray Dashboard.
cluster.wait_ready()
--------------------

| The ``cluster.wait_ready()`` function waits for the requested cluster to be ready, up to an optional timeout and checks every 5 seconds.
| It accepts the following parameters:
| ``timeout: Optional[int] # Default: None`` -> Allows the user to define a timeout for the ``wait_ready()`` function.
| ``dashboard_check: bool # Default: True`` -> If enabled the ``wait_ready()`` function will wait until the Ray Dashboard is ready too.
2 changes: 1 addition & 1 deletion docs/sphinx/user-docs/s3-compatible-storage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -82,5 +82,5 @@ Lastly the new ``run_config`` must be added to the Trainer:
To find more information on creating a Minio Bucket compatible with
RHOAI you can refer to this
`documentation <https://ai-on-openshift.io/tools-and-applications/minio/minio/>`__.
Note: You must have ``sf3s`` and ``pyarrow`` installed in your
Note: You must have ``s3fs`` and ``pyarrow`` installed in your
environment for this method.
53 changes: 53 additions & 0 deletions docs/sphinx/user-docs/ui-widgets.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
Jupyter UI Widgets
==================

Below are some examples of the Jupyter UI Widgets that are included in
the CodeFlare SDK. > [!NOTE] > To use the widgets functionality you must
be using the CodeFlare SDK in a Jupyter Notebook environment.

Cluster Up/Down Buttons
-----------------------

The Cluster Up/Down buttons appear after successfully initialising your
`ClusterConfiguration <cluster-configuration.md#ray-cluster-configuration>`__.
There are two buttons and a checkbox ``Cluster Up``, ``Cluster Down``
and ``Wait for Cluster?`` which mimic the
`cluster.up() <ray-cluster-interaction.md#clusterup>`__,
`cluster.down() <ray-cluster-interaction.md#clusterdown>`__ and
`cluster.wait_ready() <ray-cluster-interaction.md#clusterwait_ready>`__
functionality.

After initialising their ``ClusterConfiguration`` a user can select the
``Wait for Cluster?`` checkbox then click the ``Cluster Up`` button to
create their Ray Cluster and wait until it is ready. The cluster can be
deleted by clicking the ``Cluster Down`` button.

.. image:: images/ui-buttons.png
:alt: An image of the up/down ui buttons

View Clusters UI Table
----------------------

The View Clusters UI Table allows a user to see a list of Ray Clusters
with information on their configuration including number of workers, CPU
requests and limits along with the clusters status.

.. image:: images/ui-view-clusters.png
:alt: An image of the view clusters ui table

Above is a list of two Ray Clusters ``raytest`` and ``raytest2`` each of
those headings is clickable and will update the table to view the
selected Cluster's information. There are three buttons under the table
``Cluster Down``, ``View Jobs`` and ``Open Ray Dashboard``. \* The
``Cluster Down`` button will delete the selected Cluster. \* The
``View Jobs`` button will try to open the Ray Dashboard's Jobs view in a
Web Browser. The link will also be printed to the console. \* The
``Open Ray Dashboard`` button will try to open the Ray Dashboard view in
a Web Browser. The link will also be printed to the console.

The UI Table can be viewed by calling the following function.

.. code:: python
from codeflare_sdk import view_clusters
view_clusters() # Accepts namespace parameter but will try to gather the namespace from the current context

0 comments on commit f3ecf40

Please sign in to comment.