Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nebuly-nvidia-device plugin crash on new partitioning / config change #57

Open
hasenbam opened this issue Aug 1, 2024 · 0 comments
Open

Comments

@hasenbam
Copy link

hasenbam commented Aug 1, 2024

I deployed nos with nebuly-nvidia device plugin in MPS partitioning mode.
Whenever I deploy a deployment/pods that require a change of GPU partitioning by the GPU partitioner, the nebuly-nvidia-device plugin crashes.

I tried to follow whats happening and this is what I guess:

  1. new deployment gets applied, GPU partitionier checks pending pods if partitioning needs change
  2. GPU partitioner performs new partitioning and writes to config and references the new config in the node-label nvidia.com/device-plugin.config.
  3. At the same time nebuly-device plugin is triggerd by label change and tries to read the new config referenced by the label.
  4. The config referenced does not exist (yet?) maybe - is this a timing issue, that for instance the config takes a second to become active?
  5. The non-existing config causes the nebuly-device-plugin to crash. Because this happens every time a new partitioning is necessary, after some time we run into the k8s CrashLoopBackoff, meaning that the restart of the nebuly-device-plugin takes 5 minutes. After 5 minutes and the restart the new partitioning becomes active and the pending pods start quickly with access to their configured MPS GPU fractions.

Here is the output of the logs of the nebuly-nvidia-device plugin. You can see at 13:05 I deployed a Deployment with a pod requesting a nvidia.com/gpu-2gb which triggered a new partitioning and caused the crash:

kubectl logs pod/nvidia-device-plugin-1722514861-rrdhz -n nebuly-nvidia --follow
Defaulted container "nvidia-device-plugin-sidecar" out of: nvidia-device-plugin-sidecar, nvidia-mps-server, nvidia-device-plugin-ctr, set-compute-mode (init), set-nvidia-mps-volume-permissions (init), nvidia-device-plugin-init (init)
W0801 13:02:37.159120     270 client_config.go:608] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
time="2024-08-01T13:02:37Z" level=info msg="Waiting for change to 'nvidia.com/device-plugin.config' label"
time="2024-08-01T13:02:37Z" level=info msg="Label change detected: nvidia.com/device-plugin.config=vm125-1722517133"
time="2024-08-01T13:02:37Z" level=info msg="Updating to config: vm125-1722517133"
time="2024-08-01T13:02:37Z" level=info msg="Successfully updated to config: vm125-1722517133"
time="2024-08-01T13:02:37Z" level=info msg="Sending signal 'hangup' to 'nvidia-device-plugin'"
time="2024-08-01T13:02:37Z" level=info msg="Successfully sent signal"
time="2024-08-01T13:02:37Z" level=info msg="Waiting for change to 'nvidia.com/device-plugin.config' label"
time="2024-08-01T13:05:02Z" level=info msg="Label change detected: nvidia.com/device-plugin.config=vm125-1722517497"
time="2024-08-01T13:05:02Z" level=info msg="Error: specified config vm125-1722517497 does not exist"

I mean it is still working but with this it takes always 5 minutes for my pods to start when partitioning changes :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant