Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Add] Design doc for CodeFlare SDK #703

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

varshaprasad96
Copy link
Contributor

This PR adds design documentation for CodeFlare SDK to the repository for reference in future.

Issue link

What changes have been made

Verification steps

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • Testing is not required for this change

Copy link

codecov bot commented Oct 8, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.11%. Comparing base (2e28f8a) to head (f2602fd).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #703   +/-   ##
=======================================
  Coverage   94.11%   94.11%           
=======================================
  Files          36       36           
  Lines        2412     2412           
=======================================
  Hits         2270     2270           
  Misses        142      142           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

1. Ease of use and integration: The SDK’s primary role is to abstract Kubernetes specifics. It should provide simple interfaces for interacting with any of the model training components on the server side.
2. Lightweight: The SDK runs client-side and should minimize resource consumption. It must prioritize responsiveness and user experience. For example, using a polling mechanism to fetch status instead of a constant resource watch.
3. Extensibility: The SDK currently integrates well with the CodeFlare stack, which uses Ray and TorchX (pytorch) distributed framework. In the future, components used for distributed training/tuning (as seen in fig [2]) should remain interchangeable.
4. Security: The SDK must ensure users see only the information they are authorized to view. Authentication occurs on the client, while admission and validation are handled server-side. Kubernetes RBAC makes things easier.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a bit off to me. Don't both authentication and authorization happen on the server with the SDK just providing a means for interacting and authenticating with server?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! My bad, I was meaning to say that it's the responsibility of the client side tool to send any request that contains all the required credentials (tokens, certificates etc) that are required for validation. Though actual authentication is not SDK's responsibility, but making sure that the request itself has all the credentials in its header is the client-side authentication which is intended to be referred here. Have reworded it, let me know if it reads out better.

Copy link
Contributor

@dimakis dimakis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work Varsha. I've a couple of comments

## Release:
A new version of CodeFlare SDK will be released once every three weeks.
For details on the release support matrix with other CodeFlare components, refer [here][codeflare_compatibility_matrix].
RHOAI support matrix: CodeFlare SDK APIs fall under [Tier 2][RH_customer_API_support] support on `RHOAI` platform. This implies than an API would be support in for `n-1` major versions for a minimum of 9 months.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I'm missing something (which is quite possible) we don't currently have a branching strategy which ties CFSDK releases to RHOAI releases. We should confirm that CFSDK is continuously scanned for CVEs for all supported versions i.e. for 9 months after inclusion in a RHOAI release and 17 months in the case of a LTS EUS release.
Do we have a plan in place here to be able to easily identify the packages and upgrade them in case of CVEs etc.?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, CVE fixes need to be addressed here. IIRC the downstream synk scanner and upstream Depedabots look for CVEs, and they are fixed in here. I'll check with @ChristianZaccaria on the process and come back to updating this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have added a CVE Management section that talks about Snyk downstream and Dependabot upstream with automation around making sure that version of SDK shipped in NB is scanned throughout its lifecycle.

2. Lightweight: The SDK runs client-side and should minimize resource consumption. It must prioritize responsiveness and user experience. For example, using a polling mechanism to fetch status instead of a constant resource watch.
3. Extensibility: The SDK currently integrates well with the CodeFlare stack, which uses Ray and TorchX (pytorch) distributed framework. In the future, components used for distributed training/tuning (as seen in fig [2]) should remain interchangeable.
4. Security: The SDK must ensure users see only the information they are authorized to view. Authentication occurs on the client, while admission and validation are handled server-side. Kubernetes RBAC makes things easier.
5. Typed Object Creation: The client should only allow the creation of known, typed K8s resources. This prevents arbitrary payloads from reaching the server which could be a threat. (configuring AppWrapper resource template is a concern here).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this section needs to be talked about from an AW perspective. Are we proposing we take out AW support from the SDK?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have reworded this to not explicitly mention that AppWrapper needs to be removed, as it's not a trivial process. It now rather states that it is a beast practice to allow only typed-object creation. The other approach would be to be able to refactor AppWrapper in such a way that there is validation on the client side to verify if the object being passed is a valid k8s yaml. Since we cannot completely get rid of it, it would be better to at least workaround and ensure we follow the best possible practice from our end.

docs/designs/CodeFlare-SDK-design-doc.md Outdated Show resolved Hide resolved
6. Version Compatibility: The SDK must maintain compatibility between client and server versions. Backward compatibility should be ensured even if one side is upgraded.

#### Codebase Modularization:
The CodeFlare-SDK codebase requires refactoring and modularization to facilitate easier addition or modification of components. The key requirements for this refactor include:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to be reworded. This is the design doc for future development, so detailing what needs to change, imo, shouldn't be carried out here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have reworded this section to instead say that this is the best modular practice we need to follow. Instead of explicitly mentioning that it needs to be refactored. This will make sure that in future when we add any component we still follow the guidelines mentioned here. Let me know if this sounds better.

docs/designs/CodeFlare-SDK-design-doc.md Outdated Show resolved Hide resolved
docs/designs/CodeFlare-SDK-design-doc.md Outdated Show resolved Hide resolved
docs/designs/CodeFlare-SDK-design-doc.md Outdated Show resolved Hide resolved
docs/designs/CodeFlare-SDK-design-doc.md Outdated Show resolved Hide resolved
docs/designs/CodeFlare-SDK-design-doc.md Show resolved Hide resolved
docs/designs/CodeFlare-SDK-design-doc.md Outdated Show resolved Hide resolved
docs/designs/CodeFlare-SDK-design-doc.md Outdated Show resolved Hide resolved
@varshaprasad96 varshaprasad96 force-pushed the add/sdk-desing-doc branch 4 times, most recently from c72f652 to 648b374 Compare October 14, 2024 22:32
Team: Distributed Workloads - Orchestration

## Introduction
This document outlines the design of the Project CodeFlare SDK, a Python SDK that facilitates interactions between users and the distributed workloads component of Red Hat OpenShift AI(RHOAI)/ OpenDataHub(ODH). Users, in this instance, are both data scientists and MLOps Engineers. The SDK provides a high-level abstraction for managing machine learning(ML) workflows, jobs and distributed computing resources.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This document outlines the design of the Project CodeFlare SDK, a Python SDK that facilitates interactions between users and the distributed workloads component of Red Hat OpenShift AI(RHOAI)/ OpenDataHub(ODH). Users, in this instance, are both data scientists and MLOps Engineers. The SDK provides a high-level abstraction for managing machine learning(ML) workflows, jobs and distributed computing resources.
This document outlines the design of the Project CodeFlare SDK, a Python SDK that facilitates interactions between users and the distributed workloads component of Red Hat OpenShift AI(RHOAI)/ OpenDataHub(ODH). Users, in this instance, are both data scientists and MLOps Engineers. The SDK provides a high-level abstraction for managing machine learning(ML) workflows, jobs and distributed computing resources.

Should this be more generic? Are we limiting our focus to RHOAI and ODH specifically?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we limiting our focus to RHOAI and ODH specifically?

Ideally the CodeFlare stack can be used on any k8s clusters (through ODH or standalone). But since it is being shipped with RHOAI and ODH, the SDK design mentions that the intention is to make interaction with those components easier. Is supporting CodeFlare components standalone something we are planning in future (also what would be the use case)? I believe as a data scientist I would prefer to adopt a whole ecosystem of components on my cluster (including NBs, IDE etc - though ODH), rather than just a piece of training stack. Wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe as a data scientist I would prefer to adopt a whole ecosystem of components on my cluster (including NBs, IDE etc - though ODH), rather than just a piece of training stack. Wdyt?

Candidly, as a data scientist I think I would want to use an SDK by a trusted source. the CodeFlare SDK is not well known among the community and it is a product we are advocating for customers to use. That's absolutely fine but I do think DS customers will view it differently than if it was from Kubeflow or KubeRay (ignoring the fact that we are actually making their lives easier). And I think that perception could influence adoption.

I don't think that should change what we do in the short term but it is a realistic gap in our current strategy; i.e., CodeFlare has a very limited community and limited brand awareness so it may face adoption challenges. Of course, we can see how this plays out with customers but I want to call it out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completely fair, I feel we should track this somewhere. How about open an issue in here and also in JIRA - so that we can revisit when we get to having some of the codeFlare components upstream.

Having it upstream would help us to easily cross-reference wherever required.

Copy link
Contributor

@franciscojavierarceo franciscojavierarceo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some small nits but otherwise this lgtm!

@Bobbins228
Copy link
Contributor

Just need to resolve conflict and it lgtm

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 17, 2024
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 17, 2024
Copy link
Contributor

openshift-ci bot commented Oct 17, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: franciscojavierarceo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [franciscojavierarceo]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 17, 2024
This PR adds design documentation for CodeFlare SDK to the repository for
reference in future.
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Oct 18, 2024
Copy link
Contributor

openshift-ci bot commented Oct 18, 2024

New changes are detected. LGTM label has been removed.

@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 18, 2024
@varshaprasad96
Copy link
Contributor Author

@Bobbins228 could you approve the PR if everything looks good? Thanks! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants