Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NOS MPS leaves GPUs on node in exclusive mode #27

Open
Damowerko opened this issue Mar 28, 2023 · 4 comments
Open

NOS MPS leaves GPUs on node in exclusive mode #27

Damowerko opened this issue Mar 28, 2023 · 4 comments
Labels
bug Something isn't working

Comments

@Damowerko
Copy link

In my use-case I am often enabling and disabling NOS on individual nodes by adding/removing the label nos.nebuly.com/gpu-partitioning=mps. After labeling the node, NOS will change the GPU mode to exclusive. However, after removing the label, the GPU remains in exclusive mode.

Expected behavior: NOS should revert the GPU mode to whatever it was when it started or to default.

Workaround: Change back to default mode (or whatever mode you want) after removing the label. Do this for all GPUs. For example, to change the mode on GPU 0 back to default use the following.

nvidia-smi -i 0 -c 0
@Damowerko Damowerko changed the title MPS server leaves GPUs on node in exclusive mode NOS MPS leaves GPUs on node in exclusive mode Mar 28, 2023
@Telemaco019 Telemaco019 added the bug Something isn't working label Mar 28, 2023
@Baenimyr
Copy link

Baenimyr commented Feb 5, 2024

You can try to add a shutdown command to the set-compute-mode container.
This container must wait and run nvidia-smi -c 0 when it receives a SIGINT.

@Damowerko
Copy link
Author

When able I will add a preStop hook to the container and test if this resolves the issue.

@Baenimyr
Copy link

Have you seen this MR ? NVIDIA/k8s-device-plugin#490
Maybe you can use the mps daemon from nvidia.

@Damowerko
Copy link
Author

@Baenimyr Good that the device plugin supports MPS now. The problem is that it does not scale dynamically. Of course, NOS could use the NVIDIA plugin now. However, with the NVIDIA DRA driver on the horizon, it does not make sense for me personally to use NOS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants