Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure AI4DD workloads land on A100 GPU nodes(s) in nerc-ocp-prod cluster #762

Open
1 task
dystewart opened this issue Oct 7, 2024 · 4 comments
Open
1 task
Assignees
Labels

Comments

@dystewart
Copy link

Motivation

The AI4DD team is interested in using only A100 GPU nodes for their research. With V100s also in the cluster, simply requesting a gpu cannot guarantee it lands on an A100, without some manual intervention. There are 2 ways we can attack this dilemma:

  1. Taint a node(s) and leverage tolerations to land workloads on A100 nodes. If these are long running or constantly running workloads or if they need to be run on a single GPU host, I think this would make sense, and it's very simple to enable and disable this behavior. This also gurantees that the tainted A100 GPU node will be available when needed.
  2. Utilize nodeSelector in AI4DD workloads to land on A100 nodes. This is the simpler option, but there is no guarantee that the A100 resources will necessarily be available.

Completion Criteria

Assist the AI4DD team in implementing the desired fix.

Description

  • Determine which solution from above will work best for the AI4DD folks

Completion dates

Desired - ASAP

@dystewart dystewart added the gpu label Oct 7, 2024
@dystewart dystewart self-assigned this Oct 7, 2024
@EldritchJS
Copy link

Thanks @dystewart for creating this. One note of clarification: there will be three services/pods that need to be on A100 nodes, and 40-50 RHOAI workbenches that won't need GPU nodes. Ideally a means for wrangling these disparate needs accordingly would be ideal. I assume for the workbenches we just specify in the .yaml that no accelerator is needed.

Your initial option above seems to make sense to me but I'll clarify the three services and how they're expected to operate:

  1. Fine tuning application for the workshop presenter only
  2. Redundant instance of Database connection timeouts/performance issues #1
  3. Inference service that is expected to take requests from the RHOAI workbenches

I expect the inference service will need autoscale enabled as the inference is reliant on GPU nodes.

Any thoughts on this?

@joachimweyl
Copy link
Contributor

here is documentation on how to select the specific GPU.

@naved001
Copy link

naved001 commented Oct 9, 2024

Utilize nodeSelector in AI4DD workloads to land on A100 nodes. This is the simpler option, but there is no guarantee that the A100 resources will necessarily be available.

I think that's the solution that we have in the NERC documentation. It is true that if there are no A100s then the pod will not be scheduled and will stay in a pending state; in that case it would be nice to get an estimate of how many GPUs should be available for this project.

@EldritchJS
Copy link

Thanks for this. I am currently working with the project dev folks to get those estimates and will post them here as soon as I have them!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants