Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRA does not support Tesla P4 model GPUs because it does not support setting time slices by nvidia-smi #41

Open
wawa0210 opened this issue Dec 21, 2023 · 5 comments · May be fixed by #58
Assignees
Labels
bug Issue/PR to expose/discuss/fix a bug

Comments

@wawa0210
Copy link

wawa0210 commented Dec 21, 2023

When I ran DRA on the Tesla P4 node, I found that the pod failed to start.

environment

K8s version: v1.27.5
k8s-dra-driver: latest branch main

what happened

Deployment pod in Tesla P4 environment occupies one card and reports an error

E1221 04:43:21.238356       1 nvlib.go:489]
Failed to set timeslice policy with value Default for GPU 0 : Not Supported
Failed to set timeslice for requested devices : Not Supported
E1221 04:43:21.238522       1 nonblockinggrpcserver.go:127] "dra: handling request failed" err="error preparing devices for claim 5dc94ce6-1e6e-4359-bc73-3d2039797ff0: error setting up sharing: error setting timeslice for 5dc94ce6-1e6e-4359-bc73-3d2039797ff0: error setting time slice: error running nvidia-smi: exit status 3" requestID=2 request="&NodePrepareResourceRequest{Namespace:gpu-test1,ClaimUid:5dc94ce6-1e6e-4359-bc73-3d2039797ff0,ClaimName:pod1-gpu,ResourceHandle:,}"

dig found that when the GPU is set to not share, nvidia-smi compute-policy -i uuid --set-timeslice 0 will still be set, but Tesla P4 does not support this command, so an error is reported

root@nvidia-dcgm-exporter-pvgr8:/# nvidia-smi
Thu Dec 21 04:58:31 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla P4                       On  | 00000000:13:00.0 Off |                    0 |
| N/A   28C    P8               6W /  75W |      0MiB /  7680MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
root@nvidia-dcgm-exporter-pvgr8:/# nvidia-smi -L
GPU 0: Tesla P4 (UUID: GPU-e290caca-2f0c-9582-acab-67a142b61ffa)
root@nvidia-dcgm-exporter-pvgr8:/# nvidia-smi compute-policy -i GPU-e290caca-2f0c-9582-acab-67a142b61ffa --set-timeslice 0
Failed to set timeslice policy with value Default for GPU 0 : Not Supported
Failed to set timeslice for requested devices : Not Supported

code ref

func (t *TimeSlicingManager) SetTimeSlice(devices *PreparedDevices, config *nascrd.TimeSlicingConfig) error {
if devices.Mig != nil {
return fmt.Errorf("setting a TimeSlice duration on MIG devices is unsupported")
}
timeSlice := nascrd.DefaultTimeSlice
if config != nil && config.TimeSlice != nil {
timeSlice = *config.TimeSlice
}
err := t.nvdevlib.setComputeMode(devices.UUIDs(), "DEFAULT")
if err != nil {
return fmt.Errorf("error setting compute mode: %w", err)
}
err = t.nvdevlib.setTimeSlice(devices.UUIDs(), timeSlice.Int())
if err != nil {
return fmt.Errorf("error setting time slice: %w", err)
}
return nil
}

Steps to reproduce

Test yaml information

cat <<EOF | kubectl apply -f -

apiVersion: v1
kind: Namespace
metadata:
  name: gpu-test1
 
---
apiVersion: resource.k8s.io/v1alpha2
kind: ResourceClaimTemplate
metadata:
  namespace: gpu-test1
  name: gpu.nvidia.com
spec:
  spec:
    resourceClassName: gpu.nvidia.com
 
---
apiVersion: v1
kind: Pod
metadata:
  namespace: gpu-test1
  name: pod1
  labels:
    app: pod
spec:
  containers:
  - name: ctr
    image: chrstnhntschl/gpu_burn
    args: ["3600"]
    resources:
      claims:
      - name: gpu
  resourceClaims:
  - name: gpu
    source:
      resourceClaimTemplateName: gpu.nvidia.com
EOF

other information

NAS info

apiVersion: nas.gpu.resource.nvidia.com/v1alpha1
kind: NodeAllocationState
metadata:
  creationTimestamp: "2023-12-20T11:43:30Z"
  generation: 41
  name: 172-30-43-122
  namespace: nvidia-dra-driver
  ownerReferences:
  - apiVersion: v1
    kind: Node
    name: 172-30-43-122
    uid: 0e49e1e1-e0b5-4bfc-a89c-286262f6265d
  resourceVersion: "121350"
  uid: 5b600060-08d4-4525-95d5-02ec746e7c3c
spec:
  allocatableDevices:
  - gpu:
      architecture: Pascal
      brand: Tesla
      cudaComputeCapability: "6.1"
      index: 0
      memoryBytes: 8053063680
      migEnabled: false
      productName: Tesla P4
      uuid: GPU-e290caca-2f0c-9582-acab-67a142b61ffa
  allocatedClaims:
    5dc94ce6-1e6e-4359-bc73-3d2039797ff0:
      claimInfo:
        name: pod1-gpu
        namespace: gpu-test1
        uid: 5dc94ce6-1e6e-4359-bc73-3d2039797ff0
      gpu:
        devices:
        - uuid: GPU-e290caca-2f0c-9582-acab-67a142b61ffa
status: Ready

In this case, if sharing is not set, is it possible not to call the setTimeSlice method?

Looking forward to hearing from the community and then I can try to fix it

@wawa0210 wawa0210 changed the title DRA does not support Tesla P4 model GPUs because it does not support setting time slices DRA does not support Tesla P4 model GPUs because it does not support setting time slices by nvidia-smi Dec 21, 2023
@wawa0210
Copy link
Author

wawa0210 commented Jan 8, 2024

@klueska @elezar friendly ping

@elezar
Copy link
Member

elezar commented Jan 23, 2024

Hi. Sorry @wawa0210.

We have been focussed on other development for the past couple of weeks.

It may make sense to not trigger the nvidia-smi call if sharing is not set. Would you be willing to create a PR with a proposal for us to review?

@klueska
Copy link
Collaborator

klueska commented Jan 23, 2024

It's called everytime at the moment to ensure that when sharing is not set, that it gets set to the default time slice (in case it had been set to something else previously).

A better check might be to ensure that the architecture is Kepler+ before attempting to make the call.

@wawa0210
Copy link
Author

A better check might be to ensure that the architecture is Kepler+ before attempting to make the call.

It seems that no accurate documentation has been found describing which architectures support time slice settings,Is there accurate information available for reference?

@wawa0210
Copy link
Author

Hi. Sorry @wawa0210.

We have been focussed on other development for the past couple of weeks.

It may make sense to not trigger the nvidia-smi call if sharing is not set. Would you be willing to create a PR with a proposal for us to review?

okk

@klueska klueska mentioned this issue Jan 24, 2024
@klueska klueska added the bug Issue/PR to expose/discuss/fix a bug label Jan 25, 2024
@klueska klueska modified the milestone: v0.1.0 Jan 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue/PR to expose/discuss/fix a bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants