Ensure AI4DD workloads land on A100 GPU nodes(s) in nerc-ocp-prod cluster #762

dystewart · 2024-10-07T21:17:49Z

Motivation

The AI4DD team is interested in using only A100 GPU nodes for their research. With V100s also in the cluster, simply requesting a gpu cannot guarantee it lands on an A100, without some manual intervention. There are 2 ways we can attack this dilemma:

Taint a node(s) and leverage tolerations to land workloads on A100 nodes. If these are long running or constantly running workloads or if they need to be run on a single GPU host, I think this would make sense, and it's very simple to enable and disable this behavior. This also gurantees that the tainted A100 GPU node will be available when needed.
Utilize nodeSelector in AI4DD workloads to land on A100 nodes. This is the simpler option, but there is no guarantee that the A100 resources will necessarily be available.

Completion Criteria

Assist the AI4DD team in implementing the desired fix.

Description

Determine which solution from above will work best for the AI4DD folks

Completion dates

Desired - ASAP

EldritchJS · 2024-10-08T15:52:07Z

Thanks @dystewart for creating this. One note of clarification: there will be three services/pods that need to be on A100 nodes, and 40-50 RHOAI workbenches that won't need GPU nodes. Ideally a means for wrangling these disparate needs accordingly would be ideal. I assume for the workbenches we just specify in the .yaml that no accelerator is needed.

Your initial option above seems to make sense to me but I'll clarify the three services and how they're expected to operate:

Fine tuning application for the workshop presenter only
Redundant instance of Database connection timeouts/performance issues #1
Inference service that is expected to take requests from the RHOAI workbenches

I expect the inference service will need autoscale enabled as the inference is reliant on GPU nodes.

Any thoughts on this?

joachimweyl · 2024-10-09T14:47:52Z

here is documentation on how to select the specific GPU.

naved001 · 2024-10-09T14:52:47Z

Utilize nodeSelector in AI4DD workloads to land on A100 nodes. This is the simpler option, but there is no guarantee that the A100 resources will necessarily be available.

I think that's the solution that we have in the NERC documentation. It is true that if there are no A100s then the pod will not be scheduled and will stay in a pending state; in that case it would be nice to get an estimate of how many GPUs should be available for this project.

EldritchJS · 2024-10-09T15:02:41Z

Thanks for this. I am currently working with the project dev folks to get those estimates and will post them here as soon as I have them!

dystewart added the gpu label Oct 7, 2024

dystewart self-assigned this Oct 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure AI4DD workloads land on A100 GPU nodes(s) in nerc-ocp-prod cluster #762

Ensure AI4DD workloads land on A100 GPU nodes(s) in nerc-ocp-prod cluster #762

dystewart commented Oct 7, 2024

EldritchJS commented Oct 8, 2024

joachimweyl commented Oct 9, 2024

naved001 commented Oct 9, 2024

EldritchJS commented Oct 9, 2024

Ensure AI4DD workloads land on A100 GPU nodes(s) in nerc-ocp-prod cluster #762

Ensure AI4DD workloads land on A100 GPU nodes(s) in nerc-ocp-prod cluster #762

Comments

dystewart commented Oct 7, 2024

Motivation

Completion Criteria

Description

Completion dates

EldritchJS commented Oct 8, 2024

joachimweyl commented Oct 9, 2024

naved001 commented Oct 9, 2024

EldritchJS commented Oct 9, 2024