[Add] Design doc for CodeFlare SDK #703

varshaprasad96 · 2024-10-08T18:38:08Z

This PR adds design documentation for CodeFlare SDK to the repository for reference in future.

Issue link

What changes have been made

Verification steps

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- Testing is not required for this change

codecov · 2024-10-08T18:43:03Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.11%. Comparing base (2e28f8a) to head (f2602fd).

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #703   +/-   ##
=======================================
  Coverage   94.11%   94.11%           
=======================================
  Files          36       36           
  Lines        2412     2412           
=======================================
  Hits         2270     2270           
  Misses        142      142

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

KPostOffice · 2024-10-08T19:54:34Z

docs/designs/CodeFlare-SDK-design-doc.md

+1. Ease of use and integration: The SDK’s primary role is to abstract Kubernetes specifics. It should provide simple interfaces for interacting with any of the model training components on the server side.
+2. Lightweight: The SDK runs client-side and should minimize resource consumption. It must prioritize responsiveness and user experience. For example, using a polling mechanism to fetch status instead of a constant resource watch.
+3. Extensibility: The SDK currently integrates well with the CodeFlare stack, which uses Ray and TorchX (pytorch) distributed framework. In the future, components used for distributed training/tuning (as seen in fig [2]) should remain interchangeable.
+4. Security: The SDK must ensure users see only the information they are authorized to view. Authentication occurs on the client, while admission and validation are handled server-side. Kubernetes RBAC makes things easier.


This feels a bit off to me. Don't both authentication and authorization happen on the server with the SDK just providing a means for interacting and authenticating with server?

Good catch! My bad, I was meaning to say that it's the responsibility of the client side tool to send any request that contains all the required credentials (tokens, certificates etc) that are required for validation. Though actual authentication is not SDK's responsibility, but making sure that the request itself has all the credentials in its header is the client-side authentication which is intended to be referred here. Have reworded it, let me know if it reads out better.

dimakis

Excellent work Varsha. I've a couple of comments

dimakis · 2024-10-09T12:03:26Z

docs/designs/CodeFlare-SDK-design-doc.md

+## Release:
+A new version of CodeFlare SDK will be released once every three weeks.
+For details on the release support matrix with other CodeFlare components, refer [here][codeflare_compatibility_matrix].
+RHOAI support matrix: CodeFlare SDK APIs fall under [Tier 2][RH_customer_API_support] support on `RHOAI` platform. This implies than an API would be support in for `n-1` major versions for a minimum of 9 months.


Unless I'm missing something (which is quite possible) we don't currently have a branching strategy which ties CFSDK releases to RHOAI releases. We should confirm that CFSDK is continuously scanned for CVEs for all supported versions i.e. for 9 months after inclusion in a RHOAI release and 17 months in the case of a LTS EUS release.
Do we have a plan in place here to be able to easily identify the packages and upgrade them in case of CVEs etc.?

Good point, CVE fixes need to be addressed here. IIRC the downstream synk scanner and upstream Depedabots look for CVEs, and they are fixed in here. I'll check with @ChristianZaccaria on the process and come back to updating this.

Have added a CVE Management section that talks about Snyk downstream and Dependabot upstream with automation around making sure that version of SDK shipped in NB is scanned throughout its lifecycle.

dimakis · 2024-10-09T12:07:41Z

docs/designs/CodeFlare-SDK-design-doc.md

+2. Lightweight: The SDK runs client-side and should minimize resource consumption. It must prioritize responsiveness and user experience. For example, using a polling mechanism to fetch status instead of a constant resource watch.
+3. Extensibility: The SDK currently integrates well with the CodeFlare stack, which uses Ray and TorchX (pytorch) distributed framework. In the future, components used for distributed training/tuning (as seen in fig [2]) should remain interchangeable.
+4. Security: The SDK must ensure users see only the information they are authorized to view. Authentication occurs on the client, while admission and validation are handled server-side. Kubernetes RBAC makes things easier.
+5. Typed Object Creation: The client should only allow the creation of known, typed K8s resources. This prevents arbitrary payloads from reaching the server which could be a threat. (configuring AppWrapper resource template is a concern here).


I think this section needs to be talked about from an AW perspective. Are we proposing we take out AW support from the SDK?

Have reworded this to not explicitly mention that AppWrapper needs to be removed, as it's not a trivial process. It now rather states that it is a beast practice to allow only typed-object creation. The other approach would be to be able to refactor AppWrapper in such a way that there is validation on the client side to verify if the object being passed is a valid k8s yaml. Since we cannot completely get rid of it, it would be better to at least workaround and ensure we follow the best possible practice from our end.

docs/designs/CodeFlare-SDK-design-doc.md

dimakis · 2024-10-09T12:10:31Z

docs/designs/CodeFlare-SDK-design-doc.md

+6. Version Compatibility: The SDK must maintain compatibility between client and server versions. Backward compatibility should be ensured even if one side is upgraded.
+
+#### Codebase Modularization:
+The CodeFlare-SDK codebase requires refactoring and modularization to facilitate easier addition or modification of components. The key requirements for this refactor include:


I think this needs to be reworded. This is the design doc for future development, so detailing what needs to change, imo, shouldn't be carried out here.

Have reworded this section to instead say that this is the best modular practice we need to follow. Instead of explicitly mentioning that it needs to be refactored. This will make sure that in future when we add any component we still follow the guidelines mentioned here. Let me know if this sounds better.

docs/designs/CodeFlare-SDK-design-doc.md

franciscojavierarceo · 2024-10-15T15:08:08Z

docs/designs/CodeFlare-SDK-design-doc.md

+Team: Distributed Workloads - Orchestration
+
+## Introduction
+This document outlines the design of the Project CodeFlare SDK, a Python SDK that facilitates interactions between users and the distributed workloads component of Red Hat OpenShift AI(RHOAI)/ OpenDataHub(ODH). Users, in this instance, are both data scientists and MLOps Engineers. The SDK provides a high-level abstraction for managing machine learning(ML) workflows, jobs and distributed computing resources.


Suggested change

This document outlines the design of the Project CodeFlare SDK, a Python SDK that facilitates interactions between users and the distributed workloads component of Red Hat OpenShift AI(RHOAI)/ OpenDataHub(ODH). Users, in this instance, are both data scientists and MLOps Engineers. The SDK provides a high-level abstraction for managing machine learning(ML) workflows, jobs and distributed computing resources.

This document outlines the design of the Project CodeFlare SDK, a Python SDK that facilitates interactions between users and the distributed workloads component of Red Hat OpenShift AI(RHOAI)/ OpenDataHub(ODH). Users, in this instance, are both data scientists and MLOps Engineers. The SDK provides a high-level abstraction for managing machine learning(ML) workflows, jobs and distributed computing resources.

Should this be more generic? Are we limiting our focus to RHOAI and ODH specifically?

Are we limiting our focus to RHOAI and ODH specifically?

Ideally the CodeFlare stack can be used on any k8s clusters (through ODH or standalone). But since it is being shipped with RHOAI and ODH, the SDK design mentions that the intention is to make interaction with those components easier. Is supporting CodeFlare components standalone something we are planning in future (also what would be the use case)? I believe as a data scientist I would prefer to adopt a whole ecosystem of components on my cluster (including NBs, IDE etc - though ODH), rather than just a piece of training stack. Wdyt?

I believe as a data scientist I would prefer to adopt a whole ecosystem of components on my cluster (including NBs, IDE etc - though ODH), rather than just a piece of training stack. Wdyt?

Candidly, as a data scientist I think I would want to use an SDK by a trusted source. the CodeFlare SDK is not well known among the community and it is a product we are advocating for customers to use. That's absolutely fine but I do think DS customers will view it differently than if it was from Kubeflow or KubeRay (ignoring the fact that we are actually making their lives easier). And I think that perception could influence adoption.

I don't think that should change what we do in the short term but it is a realistic gap in our current strategy; i.e., CodeFlare has a very limited community and limited brand awareness so it may face adoption challenges. Of course, we can see how this plays out with customers but I want to call it out.

Completely fair, I feel we should track this somewhere. How about open an issue in here and also in JIRA - so that we can revisit when we get to having some of the codeFlare components upstream.

Having it upstream would help us to easily cross-reference wherever required.

docs/designs/CodeFlare-SDK-design-doc.md

franciscojavierarceo

some small nits but otherwise this lgtm!

Bobbins228 · 2024-10-17T14:25:10Z

Just need to resolve conflict and it lgtm

openshift-ci · 2024-10-17T20:18:30Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: franciscojavierarceo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [franciscojavierarceo]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

This PR adds design documentation for CodeFlare SDK to the repository for reference in future.

openshift-ci · 2024-10-18T21:16:33Z

New changes are detected. LGTM label has been removed.

varshaprasad96 · 2024-10-18T21:17:29Z

@Bobbins228 could you approve the PR if everything looks good? Thanks! :)

openshift-ci bot requested review from dimakis and Fiona-Waters October 8, 2024 18:38

KPostOffice reviewed Oct 8, 2024

View reviewed changes

dimakis reviewed Oct 9, 2024

View reviewed changes

Bobbins228 requested changes Oct 10, 2024

View reviewed changes

openshift-ci bot assigned Bobbins228 Oct 10, 2024

varshaprasad96 force-pushed the add/sdk-desing-doc branch from f2eb95b to 63cf20c Compare October 11, 2024 02:21

varshaprasad96 requested review from KPostOffice, Bobbins228 and dimakis October 11, 2024 02:43

Bobbins228 reviewed Oct 11, 2024

View reviewed changes

docs/designs/CodeFlare-SDK-design-doc.md Show resolved Hide resolved

docs/designs/CodeFlare-SDK-design-doc.md Show resolved Hide resolved

varshaprasad96 force-pushed the add/sdk-desing-doc branch 4 times, most recently from c72f652 to 648b374 Compare October 14, 2024 22:32

Bobbins228 requested changes Oct 15, 2024

View reviewed changes

docs/designs/CodeFlare-SDK-design-doc.md Outdated Show resolved Hide resolved