Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

虚拟机HAMI无法调用GPU资源Virtual machine HAMI cannot call GPU resources #536

Open
1219354801 opened this issue Oct 8, 2024 · 4 comments

Comments

@1219354801
Copy link

我在wsl2虚拟机上部署了HAMI,当提交任务的时候,HAMI总是无法调用我的GPU资源。这应该怎么办?
I deployed HAMI on a wsl2 virtual machine. When submitting tasks, HAMI always fails to call my GPU resources. What should I do?

Please provide an in-depth description of the question you have:

What do you think about this question?:

Environment:

  • HAMi version:
  • Kubernetes version:
  • Others:
@1219354801 1219354801 changed the title 虚拟机HAMI无法调用GPU资源 虚拟机HAMI无法调用GPU资源Virtual machine HAMI cannot call GPU resources Oct 8, 2024
@1219354801
Copy link
Author

user@sfyfb:$ kubectl get pods
NAME READY STATUS RESTARTS AGE
gpu-pod 0/1 Pending 0 82s
user@sfyfb:
$ kubectl describe pod gpu-pod
Name: gpu-pod
Namespace: default
Priority: 0
Service Account: default
Node:
Labels:
Annotations:
Status: Pending
IP:
IPs:
Containers:
ubuntu-container:
Image: ubuntu:18.04
Port:
Host Port:
Command:
bash
-c
sleep 86400
Limits:
nvidia.com/gpu: 1
nvidia.com/gpumem: 3k
Requests:
nvidia.com/gpu: 1
nvidia.com/gpumem: 3k
Environment:
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hndm4 (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
kube-api-access-hndm4:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional:
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors:
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message


Warning FailedScheduling 103s default-scheduler 0/1 nodes are available: 1 Insufficient nvidia.com/gpu, 1 Insufficient nvidia.com/gpumem. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.
user@sfyfb:$ kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
docker-desktop Ready control-plane 3d17h v1.30.2 192.168.65.3 Docker Desktop 5.15.153.1-microsoft-standard-WSL2 docker://27.1.1
user@sfyfb:
$ kubectl get nodes --show-labels
NAME STATUS ROLES AGE VERSION LABELS
docker-desktop Ready control-plane 3d21h v1.30.2 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,gpu=on,kubernetes.io/arch=amd64,kubernetes.io/hostname=docker-desktop,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=,node.kubernetes.io/exclude-from-external-load-balancers=
user@sfyfb:~$ kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
coredns-7db6d8ff4d-544gc 1/1 Running 0 3d21h
coredns-7db6d8ff4d-slq8z 1/1 Running 0 3d21h
etcd-docker-desktop 1/1 Running 6 3d21h
kube-apiserver-docker-desktop 1/1 Running 6 3d21h
kube-controller-manager-docker-desktop 1/1 Running 6 3d21h
kube-proxy-fz4kk 1/1 Running 0 3d21h
kube-scheduler-docker-desktop 1/1 Running 6 3d21h
storage-provisioner 1/1 Running 0 3d21h
vpnkit-controller 1/1 Running 0 3d21h

@1219354801
Copy link
Author

image
虚拟机运行nvidia-smi是没有问题的

@1219354801
Copy link
Author

kubectl get pods,节点是pending。Pod处于Pending状态的原因是没有可用的GPU资源,调度器报告了Insufficient nvidia.com/gpu和Insufficient nvidia.com/gpumem。
🙏这应该怎么办

@archlitchi
Copy link
Collaborator

user@sfyfb:$ kubectl get pods NAME READY STATUS RESTARTS AGE gpu-pod 0/1 Pending 0 82s user@sfyfb:$ kubectl describe pod gpu-pod Name: gpu-pod Namespace: default Priority: 0 Service Account: default Node: Labels: Annotations: Status: Pending IP: IPs: Containers: ubuntu-container: Image: ubuntu:18.04 Port: Host Port: Command: bash -c sleep 86400 Limits: nvidia.com/gpu: 1 nvidia.com/gpumem: 3k Requests: nvidia.com/gpu: 1 nvidia.com/gpumem: 3k Environment: Mounts: /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hndm4 (ro) Conditions: Type Status PodScheduled False Volumes: kube-api-access-hndm4: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: DownwardAPI: true QoS Class: BestEffort Node-Selectors: Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message

Warning FailedScheduling 103s default-scheduler 0/1 nodes are available: 1 Insufficient nvidia.com/gpu, 1 Insufficient nvidia.com/gpumem. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod. user@sfyfb:$ kubectl get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME docker-desktop Ready control-plane 3d17h v1.30.2 192.168.65.3 Docker Desktop 5.15.153.1-microsoft-standard-WSL2 docker://27.1.1 user@sfyfb:$ kubectl get nodes --show-labels NAME STATUS ROLES AGE VERSION LABELS docker-desktop Ready control-plane 3d21h v1.30.2 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,gpu=on,kubernetes.io/arch=amd64,kubernetes.io/hostname=docker-desktop,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=,node.kubernetes.io/exclude-from-external-load-balancers= user@sfyfb:~$ kubectl get pods -n kube-system NAME READY STATUS RESTARTS AGE coredns-7db6d8ff4d-544gc 1/1 Running 0 3d21h coredns-7db6d8ff4d-slq8z 1/1 Running 0 3d21h etcd-docker-desktop 1/1 Running 6 3d21h kube-apiserver-docker-desktop 1/1 Running 6 3d21h kube-controller-manager-docker-desktop 1/1 Running 6 3d21h kube-proxy-fz4kk 1/1 Running 0 3d21h kube-scheduler-docker-desktop 1/1 Running 6 3d21h storage-provisioner 1/1 Running 0 3d21h vpnkit-controller 1/1 Running 0 3d21h

based on the pods in kube-system, it seems you haven't installed hami correctly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants