Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it compatible with different driver versions and cuda versions #482

Open
15929482853 opened this issue Sep 9, 2024 · 1 comment
Open
Labels
kind/bug Something isn't working

Comments

@15929482853
Copy link

What happened:All previous Gpus of the cluster were 515 version of the driver and cuda11.7.Rencently I add a machine with L20(only support driver 535 at least and cuda12, then I ran into a problem that the gpus were not recognized correctly:
企业微信截图_17258626616133
企业微信截图_17258627199966

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

  • The output of nvidia-smi -a on your host
  • Your docker or containerd configuration file (e.g: /etc/docker/daemon.json)
  • The hami-device-plugin container logs
  • The hami-scheduler container logs
  • The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)
  • Any relevant kernel output lines from dmesg

Environment:

  • HAMi version:
  • nvidia driver or other AI device driver version:
  • Docker version from docker version
  • Docker command, image and tag used
  • Kernel version from uname -a
  • Others:
@15929482853 15929482853 added the kind/bug Something isn't working label Sep 9, 2024
@archlitchi
Copy link
Collaborator

archlitchi commented Sep 10, 2024

can you re-submit the task with env 'CUDA_DISABLE_CONTROL'=true , and see if it reproduces this error?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants