Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: kruize: gpus not allocatable #782

Open
5 tasks
schwesig opened this issue Oct 22, 2024 · 7 comments
Open
5 tasks

bug: kruize: gpus not allocatable #782

schwesig opened this issue Oct 22, 2024 · 7 comments
Assignees
Labels
ai-telemetry bug Something isn't working observability openshift This issue pertains to NERC OpenShift

Comments

@schwesig
Copy link
Member

Motivation

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                Requests     Limits
  --------                --------     ------
  cpu                     804m (0%)    10m (0%)
  memory                  3213Mi (0%)  0 (0%)
  ephemeral-storage       0 (0%)       0 (0%)
  hugepages-1Gi           0 (0%)       0 (0%)
  hugepages-2Mi           0 (0%)       0 (0%)
  nvidia.com/gpu          0            0
  nvidia.com/mig-3g.20gb  0            0
  nvidia.com/mig-4g.20gb  0            0 
Capacity:
  cpu:                     128
  ephemeral-storage:       468097540Ki
  hugepages-1Gi:           0
  hugepages-2Mi:           0
  memory:                  1056462096Ki
  nvidia.com/gpu:          0
  nvidia.com/mig-3g.20gb:  0
  nvidia.com/mig-4g.20gb:  0
  pods:                    250
Allocatable:
  cpu:                     127500m
  ephemeral-storage:       430324950326
  hugepages-1Gi:           0
  hugepages-2Mi:           0
  memory:                  1055311120Ki
  nvidia.com/gpu:          0
  nvidia.com/mig-3g.20gb:  0
  nvidia.com/mig-4g.20gb:  0

I see that the pods are not getting launched due to insufficient GPU resources

Completion Criteria

  • pods can launch
  • GPUs can be allocated

Description

Completion dates

Desired - 2024-10-23
Required - 2024-10-25

Involved

@schwesig
@shekhar316
@bharathappali
@tssala23
@dystewart

maybe/FYI

@schwesig schwesig added ai-telemetry bug Something isn't working observability openshift This issue pertains to NERC OpenShift labels Oct 22, 2024
@schwesig schwesig self-assigned this Oct 22, 2024
@schwesig
Copy link
Member Author

  • asked kruize team for checking MIG configuration

@dystewart
Copy link

@schwesig we should also have kruize check their clusterPolicy for errors

@schwesig
Copy link
Member Author

@computate
Copy link
Member

@schwesig you might check if the nvidia-operator-validator pods in the nvidia-gpu-operator namespace are failing to start with errors in the plugin-validation container to confirm it's the same problem as above.

@bharathappali
Copy link

@schwesig If my understanding is correct, I feel the node got restarted and I see the config we added to the default mig config map got deleted (expecting that nvidia gpu operator has rewritten the config map with default).

Node wrk-5 has the label mig.config set to the custom kruize config (which is not present due to rewrite) so Ideally the mig config manager should choose the default setting (all-disabled) in case of missing the desired config in the config map. But with some reason it haven't happened.

@schwesig
Copy link
Member Author

schwesig commented Oct 25, 2024

@schwesig you might check if the nvidia-operator-validator pods in the nvidia-gpu-operator namespace are failing to start with errors in the plugin-validation container to confirm it's the same problem as above.

@computate
Image
Image
Image

time="2024-10-25T08:27:35Z" level=info msg="version: 0fe1e8db, commit: 0fe1e8d"
time="2024-10-25T08:27:36Z" level=info msg="pod nvidia-cuda-validator-46k29 is curently in Pending phase"
time="2024-10-25T08:27:41Z" level=info msg="pod nvidia-cuda-validator-46k29 is curently in Pending phase"
time="2024-10-25T08:27:46Z" level=info msg="pod nvidia-cuda-validator-46k29 is curently in Pending phase"
time="2024-10-25T08:27:51Z" level=info msg="pod nvidia-cuda-validator-46k29 is curently in Pending phase"
time="2024-10-25T08:27:56Z" level=info msg="pod nvidia-cuda-validator-46k29 is curently in Pending phase"

@schwesig
Copy link
Member Author

schwesig commented Oct 25, 2024

fyi:

tried
https://docs.openshift.com/container-platform/4.15/nodes/nodes/nodes-nodes-rebooting.html
not successful, didn't solve the problem yet

- oc adm cordon wrk-5
- oc adm drain wrk-5 --ignore-daemonsets --delete-emptydir-data --force
- oc adm drain wrk-5 --ignore-daemonsets --delete-emptydir-data --force --disable-eviction
- oc debug node/wrk-5
- chroot /host
- systemctl reboot

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ai-telemetry bug Something isn't working observability openshift This issue pertains to NERC OpenShift
Projects
None yet
Development

No branches or pull requests

4 participants