bug: kruize: gpus not allocatable #782

schwesig · 2024-10-22T13:10:24Z

Motivation

message from the kruize team (https://redhat-internal.slack.com/archives/C07NJB3URP1/p1729578977380869)
about
- https://console-openshift-console.apps.nerc-ocp-test-2.nerc.mghpcc.org/
- GPU node: wrk-5 (https://console-openshift-console.apps.nerc-ocp-test-2.nerc.mghpcc.org/k8s/cluster/nodes/wrk-5)

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                Requests     Limits
  --------                --------     ------
  cpu                     804m (0%)    10m (0%)
  memory                  3213Mi (0%)  0 (0%)
  ephemeral-storage       0 (0%)       0 (0%)
  hugepages-1Gi           0 (0%)       0 (0%)
  hugepages-2Mi           0 (0%)       0 (0%)
  nvidia.com/gpu          0            0
  nvidia.com/mig-3g.20gb  0            0
  nvidia.com/mig-4g.20gb  0            0

Capacity:
  cpu:                     128
  ephemeral-storage:       468097540Ki
  hugepages-1Gi:           0
  hugepages-2Mi:           0
  memory:                  1056462096Ki
  nvidia.com/gpu:          0
  nvidia.com/mig-3g.20gb:  0
  nvidia.com/mig-4g.20gb:  0
  pods:                    250

Allocatable:
  cpu:                     127500m
  ephemeral-storage:       430324950326
  hugepages-1Gi:           0
  hugepages-2Mi:           0
  memory:                  1055311120Ki
  nvidia.com/gpu:          0
  nvidia.com/mig-3g.20gb:  0
  nvidia.com/mig-4g.20gb:  0

I see that the pods are not getting launched due to insufficient GPU resources

on (https://console-openshift-console.apps.nerc-ocp-test-2.nerc.mghpcc.org/k8s/ns/nvidia-gpu-operator/pods/nvidia-device-plugin-daemonset-lk48j/logs)
- nvidia-device-plugin-daemonset-lk48j: CrashLoopBackOff
- E1022 12:46:50.325831 1 main.go:132] error starting plugins: error getting plugins: failed to construct NVML resource managers: error building device map: error building device map from config.resources: invalid MIG configuration: at least one device with migEnabled=true was not configured correctly: error visiting device: device 0 has an invalid MIG configuration

Completion Criteria

pods can launch
GPUs can be allocated

Description

Finding error
maybe something to do with
- fix: KRUIZE - apply a machine config for a security patch #773 (machine config applied on Sunday) --> NERC team
- gap between ACM initial created basic cluster and manually added configs --> kruize team
Fix configs or settings (tbd)

Completion dates

Desired - 2024-10-23
Required - 2024-10-25

Involved

@schwesig
@shekhar316
@bharathappali
@tssala23
@dystewart

maybe/FYI

The text was updated successfully, but these errors were encountered:

schwesig · 2024-10-22T13:11:33Z

asked kruize team for checking MIG configuration

dystewart · 2024-10-22T13:53:50Z

@schwesig we should also have kruize check their clusterPolicy for errors

schwesig · 2024-10-22T13:59:31Z

connected to this

NVIDIA GPU Operator gpu-cluster-policy in OperandNotReady state in multiple clusters #768

computate · 2024-10-22T17:48:19Z

@schwesig you might check if the nvidia-operator-validator pods in the nvidia-gpu-operator namespace are failing to start with errors in the plugin-validation container to confirm it's the same problem as above.

bharathappali · 2024-10-25T01:34:48Z

@schwesig If my understanding is correct, I feel the node got restarted and I see the config we added to the default mig config map got deleted (expecting that nvidia gpu operator has rewritten the config map with default).

Node wrk-5 has the label mig.config set to the custom kruize config (which is not present due to rewrite) so Ideally the mig config manager should choose the default setting (all-disabled) in case of missing the desired config in the config map. But with some reason it haven't happened.

schwesig · 2024-10-25T08:23:37Z

@schwesig you might check if the nvidia-operator-validator pods in the nvidia-gpu-operator namespace are failing to start with errors in the plugin-validation container to confirm it's the same problem as above.

@computate

time="2024-10-25T08:27:35Z" level=info msg="version: 0fe1e8db, commit: 0fe1e8d"
time="2024-10-25T08:27:36Z" level=info msg="pod nvidia-cuda-validator-46k29 is curently in Pending phase"
time="2024-10-25T08:27:41Z" level=info msg="pod nvidia-cuda-validator-46k29 is curently in Pending phase"
time="2024-10-25T08:27:46Z" level=info msg="pod nvidia-cuda-validator-46k29 is curently in Pending phase"
time="2024-10-25T08:27:51Z" level=info msg="pod nvidia-cuda-validator-46k29 is curently in Pending phase"
time="2024-10-25T08:27:56Z" level=info msg="pod nvidia-cuda-validator-46k29 is curently in Pending phase"

schwesig · 2024-10-25T12:00:46Z

fyi:

tried
https://docs.openshift.com/container-platform/4.15/nodes/nodes/nodes-nodes-rebooting.html
not successful, didn't solve the problem yet

- oc adm cordon wrk-5
- oc adm drain wrk-5 --ignore-daemonsets --delete-emptydir-data --force
- oc adm drain wrk-5 --ignore-daemonsets --delete-emptydir-data --force --disable-eviction
- oc debug node/wrk-5
- chroot /host
- systemctl reboot

schwesig added ai-telemetry bug Something isn't working observability openshift This issue pertains to NERC OpenShift labels Oct 22, 2024

schwesig self-assigned this Oct 22, 2024

dystewart mentioned this issue Oct 22, 2024

NVIDIA GPU Operator gpu-cluster-policy in OperandNotReady state in multiple clusters #768

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: kruize: gpus not allocatable #782

bug: kruize: gpus not allocatable #782

schwesig commented Oct 22, 2024

schwesig commented Oct 22, 2024

dystewart commented Oct 22, 2024

schwesig commented Oct 22, 2024

computate commented Oct 22, 2024

bharathappali commented Oct 25, 2024

schwesig commented Oct 25, 2024 •

edited

Loading

schwesig commented Oct 25, 2024 •

edited

Loading

bug: kruize: gpus not allocatable #782

bug: kruize: gpus not allocatable #782

Comments

schwesig commented Oct 22, 2024

Motivation

Completion Criteria

Description

Completion dates

Involved

maybe/FYI

schwesig commented Oct 22, 2024

dystewart commented Oct 22, 2024

schwesig commented Oct 22, 2024

computate commented Oct 22, 2024

bharathappali commented Oct 25, 2024

schwesig commented Oct 25, 2024 • edited Loading

schwesig commented Oct 25, 2024 • edited Loading

schwesig commented Oct 25, 2024 •

edited

Loading

schwesig commented Oct 25, 2024 •

edited

Loading