-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Add] Design doc for CodeFlare SDK #703
base: main
Are you sure you want to change the base?
[Add] Design doc for CodeFlare SDK #703
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #703 +/- ##
=======================================
Coverage 94.11% 94.11%
=======================================
Files 36 36
Lines 2412 2412
=======================================
Hits 2270 2270
Misses 142 142 ☔ View full report in Codecov by Sentry. |
1. Ease of use and integration: The SDK’s primary role is to abstract Kubernetes specifics. It should provide simple interfaces for interacting with any of the model training components on the server side. | ||
2. Lightweight: The SDK runs client-side and should minimize resource consumption. It must prioritize responsiveness and user experience. For example, using a polling mechanism to fetch status instead of a constant resource watch. | ||
3. Extensibility: The SDK currently integrates well with the CodeFlare stack, which uses Ray and TorchX (pytorch) distributed framework. In the future, components used for distributed training/tuning (as seen in fig [2]) should remain interchangeable. | ||
4. Security: The SDK must ensure users see only the information they are authorized to view. Authentication occurs on the client, while admission and validation are handled server-side. Kubernetes RBAC makes things easier. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels a bit off to me. Don't both authentication and authorization happen on the server with the SDK just providing a means for interacting and authenticating with server?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! My bad, I was meaning to say that it's the responsibility of the client side tool to send any request that contains all the required credentials (tokens, certificates etc) that are required for validation. Though actual authentication is not SDK's responsibility, but making sure that the request itself has all the credentials in its header is the client-side authentication which is intended to be referred here. Have reworded it, let me know if it reads out better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent work Varsha. I've a couple of comments
## Release: | ||
A new version of CodeFlare SDK will be released once every three weeks. | ||
For details on the release support matrix with other CodeFlare components, refer [here][codeflare_compatibility_matrix]. | ||
RHOAI support matrix: CodeFlare SDK APIs fall under [Tier 2][RH_customer_API_support] support on `RHOAI` platform. This implies than an API would be support in for `n-1` major versions for a minimum of 9 months. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless I'm missing something (which is quite possible) we don't currently have a branching strategy which ties CFSDK releases to RHOAI releases. We should confirm that CFSDK is continuously scanned for CVEs for all supported versions i.e. for 9 months after inclusion in a RHOAI release and 17 months in the case of a LTS EUS release.
Do we have a plan in place here to be able to easily identify the packages and upgrade them in case of CVEs etc.?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, CVE fixes need to be addressed here. IIRC the downstream synk scanner and upstream Depedabots look for CVEs, and they are fixed in here. I'll check with @ChristianZaccaria on the process and come back to updating this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have added a CVE Management
section that talks about Snyk downstream and Dependabot upstream with automation around making sure that version of SDK shipped in NB is scanned throughout its lifecycle.
2. Lightweight: The SDK runs client-side and should minimize resource consumption. It must prioritize responsiveness and user experience. For example, using a polling mechanism to fetch status instead of a constant resource watch. | ||
3. Extensibility: The SDK currently integrates well with the CodeFlare stack, which uses Ray and TorchX (pytorch) distributed framework. In the future, components used for distributed training/tuning (as seen in fig [2]) should remain interchangeable. | ||
4. Security: The SDK must ensure users see only the information they are authorized to view. Authentication occurs on the client, while admission and validation are handled server-side. Kubernetes RBAC makes things easier. | ||
5. Typed Object Creation: The client should only allow the creation of known, typed K8s resources. This prevents arbitrary payloads from reaching the server which could be a threat. (configuring AppWrapper resource template is a concern here). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this section needs to be talked about from an AW perspective. Are we proposing we take out AW support from the SDK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have reworded this to not explicitly mention that AppWrapper needs to be removed, as it's not a trivial process. It now rather states that it is a beast practice to allow only typed-object creation. The other approach would be to be able to refactor AppWrapper in such a way that there is validation on the client side to verify if the object being passed is a valid k8s yaml. Since we cannot completely get rid of it, it would be better to at least workaround and ensure we follow the best possible practice from our end.
6. Version Compatibility: The SDK must maintain compatibility between client and server versions. Backward compatibility should be ensured even if one side is upgraded. | ||
|
||
#### Codebase Modularization: | ||
The CodeFlare-SDK codebase requires refactoring and modularization to facilitate easier addition or modification of components. The key requirements for this refactor include: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs to be reworded. This is the design doc for future development, so detailing what needs to change, imo, shouldn't be carried out here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have reworded this section to instead say that this is the best modular practice we need to follow. Instead of explicitly mentioning that it needs to be refactored. This will make sure that in future when we add any component we still follow the guidelines mentioned here. Let me know if this sounds better.
f2eb95b
to
63cf20c
Compare
c72f652
to
648b374
Compare
Team: Distributed Workloads - Orchestration | ||
|
||
## Introduction | ||
This document outlines the design of the Project CodeFlare SDK, a Python SDK that facilitates interactions between users and the distributed workloads component of Red Hat OpenShift AI(RHOAI)/ OpenDataHub(ODH). Users, in this instance, are both data scientists and MLOps Engineers. The SDK provides a high-level abstraction for managing machine learning(ML) workflows, jobs and distributed computing resources. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This document outlines the design of the Project CodeFlare SDK, a Python SDK that facilitates interactions between users and the distributed workloads component of Red Hat OpenShift AI(RHOAI)/ OpenDataHub(ODH). Users, in this instance, are both data scientists and MLOps Engineers. The SDK provides a high-level abstraction for managing machine learning(ML) workflows, jobs and distributed computing resources. | |
This document outlines the design of the Project CodeFlare SDK, a Python SDK that facilitates interactions between users and the distributed workloads component of Red Hat OpenShift AI(RHOAI)/ OpenDataHub(ODH). Users, in this instance, are both data scientists and MLOps Engineers. The SDK provides a high-level abstraction for managing machine learning(ML) workflows, jobs and distributed computing resources. |
Should this be more generic? Are we limiting our focus to RHOAI and ODH specifically?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we limiting our focus to RHOAI and ODH specifically?
Ideally the CodeFlare stack can be used on any k8s clusters (through ODH or standalone). But since it is being shipped with RHOAI and ODH, the SDK design mentions that the intention is to make interaction with those components easier. Is supporting CodeFlare components standalone something we are planning in future (also what would be the use case)? I believe as a data scientist I would prefer to adopt a whole ecosystem of components on my cluster (including NBs, IDE etc - though ODH), rather than just a piece of training stack. Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe as a data scientist I would prefer to adopt a whole ecosystem of components on my cluster (including NBs, IDE etc - though ODH), rather than just a piece of training stack. Wdyt?
Candidly, as a data scientist I think I would want to use an SDK by a trusted source. the CodeFlare SDK is not well known among the community and it is a product we are advocating for customers to use. That's absolutely fine but I do think DS customers will view it differently than if it was from Kubeflow or KubeRay (ignoring the fact that we are actually making their lives easier). And I think that perception could influence adoption.
I don't think that should change what we do in the short term but it is a realistic gap in our current strategy; i.e., CodeFlare has a very limited community and limited brand awareness so it may face adoption challenges. Of course, we can see how this plays out with customers but I want to call it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Completely fair, I feel we should track this somewhere. How about open an issue in here and also in JIRA - so that we can revisit when we get to having some of the codeFlare components upstream.
Having it upstream would help us to easily cross-reference wherever required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some small nits but otherwise this lgtm!
648b374
to
41df024
Compare
Just need to resolve conflict and it lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: franciscojavierarceo The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This PR adds design documentation for CodeFlare SDK to the repository for reference in future.
41df024
to
f2602fd
Compare
New changes are detected. LGTM label has been removed. |
@Bobbins228 could you approve the PR if everything looks good? Thanks! :) |
This PR adds design documentation for CodeFlare SDK to the repository for reference in future.
Issue link
What changes have been made
Verification steps
Checks