DRA does not support Tesla P4 model GPUs because it does not support setting time slices by nvidia-smi #41

wawa0210 · 2023-12-21T04:51:37Z

When I ran DRA on the Tesla P4 node, I found that the pod failed to start.

environment

K8s version: v1.27.5
k8s-dra-driver: latest branch main

what happened

Deployment pod in Tesla P4 environment occupies one card and reports an error

E1221 04:43:21.238356       1 nvlib.go:489]
Failed to set timeslice policy with value Default for GPU 0 : Not Supported
Failed to set timeslice for requested devices : Not Supported
E1221 04:43:21.238522       1 nonblockinggrpcserver.go:127] "dra: handling request failed" err="error preparing devices for claim 5dc94ce6-1e6e-4359-bc73-3d2039797ff0: error setting up sharing: error setting timeslice for 5dc94ce6-1e6e-4359-bc73-3d2039797ff0: error setting time slice: error running nvidia-smi: exit status 3" requestID=2 request="&NodePrepareResourceRequest{Namespace:gpu-test1,ClaimUid:5dc94ce6-1e6e-4359-bc73-3d2039797ff0,ClaimName:pod1-gpu,ResourceHandle:,}"

dig found that when the GPU is set to not share, nvidia-smi compute-policy -i uuid --set-timeslice 0 will still be set, but Tesla P4 does not support this command, so an error is reported

root@nvidia-dcgm-exporter-pvgr8:/# nvidia-smi
Thu Dec 21 04:58:31 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla P4                       On  | 00000000:13:00.0 Off |                    0 |
| N/A   28C    P8               6W /  75W |      0MiB /  7680MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
root@nvidia-dcgm-exporter-pvgr8:/# nvidia-smi -L
GPU 0: Tesla P4 (UUID: GPU-e290caca-2f0c-9582-acab-67a142b61ffa)
root@nvidia-dcgm-exporter-pvgr8:/# nvidia-smi compute-policy -i GPU-e290caca-2f0c-9582-acab-67a142b61ffa --set-timeslice 0
Failed to set timeslice policy with value Default for GPU 0 : Not Supported
Failed to set timeslice for requested devices : Not Supported

code ref

k8s-dra-driver/cmd/nvidia-dra-plugin/sharing.go

Lines 99 to 120 in 702a05b

    
           func (t *TimeSlicingManager) SetTimeSlice(devices *PreparedDevices, config *nascrd.TimeSlicingConfig) error { 
        
           	if devices.Mig != nil { 
        
           		return fmt.Errorf("setting a TimeSlice duration on MIG devices is unsupported") 
        
           	} 
        
           	timeSlice := nascrd.DefaultTimeSlice 
        
           	if config != nil && config.TimeSlice != nil { 
        
           		timeSlice = *config.TimeSlice 
        
           	} 
        
           	err := t.nvdevlib.setComputeMode(devices.UUIDs(), "DEFAULT") 
        
           	if err != nil { 
        
           		return fmt.Errorf("error setting compute mode: %w", err) 
        
           	} 
        
           	err = t.nvdevlib.setTimeSlice(devices.UUIDs(), timeSlice.Int()) 
        
           	if err != nil { 
        
           		return fmt.Errorf("error setting time slice: %w", err) 
        
           	} 
        
           	return nil 
        
           }

Steps to reproduce

Test yaml information

cat <<EOF | kubectl apply -f -

apiVersion: v1
kind: Namespace
metadata:
  name: gpu-test1
 
---
apiVersion: resource.k8s.io/v1alpha2
kind: ResourceClaimTemplate
metadata:
  namespace: gpu-test1
  name: gpu.nvidia.com
spec:
  spec:
    resourceClassName: gpu.nvidia.com
 
---
apiVersion: v1
kind: Pod
metadata:
  namespace: gpu-test1
  name: pod1
  labels:
    app: pod
spec:
  containers:
  - name: ctr
    image: chrstnhntschl/gpu_burn
    args: ["3600"]
    resources:
      claims:
      - name: gpu
  resourceClaims:
  - name: gpu
    source:
      resourceClaimTemplateName: gpu.nvidia.com
EOF

other information

NAS info

apiVersion: nas.gpu.resource.nvidia.com/v1alpha1
kind: NodeAllocationState
metadata:
  creationTimestamp: "2023-12-20T11:43:30Z"
  generation: 41
  name: 172-30-43-122
  namespace: nvidia-dra-driver
  ownerReferences:
  - apiVersion: v1
    kind: Node
    name: 172-30-43-122
    uid: 0e49e1e1-e0b5-4bfc-a89c-286262f6265d
  resourceVersion: "121350"
  uid: 5b600060-08d4-4525-95d5-02ec746e7c3c
spec:
  allocatableDevices:
  - gpu:
      architecture: Pascal
      brand: Tesla
      cudaComputeCapability: "6.1"
      index: 0
      memoryBytes: 8053063680
      migEnabled: false
      productName: Tesla P4
      uuid: GPU-e290caca-2f0c-9582-acab-67a142b61ffa
  allocatedClaims:
    5dc94ce6-1e6e-4359-bc73-3d2039797ff0:
      claimInfo:
        name: pod1-gpu
        namespace: gpu-test1
        uid: 5dc94ce6-1e6e-4359-bc73-3d2039797ff0
      gpu:
        devices:
        - uuid: GPU-e290caca-2f0c-9582-acab-67a142b61ffa
status: Ready

In this case, if sharing is not set, is it possible not to call the setTimeSlice method?

Looking forward to hearing from the community and then I can try to fix it

The text was updated successfully, but these errors were encountered:

wawa0210 · 2024-01-08T11:26:06Z

@klueska @elezar friendly ping

elezar · 2024-01-23T09:55:07Z

Hi. Sorry @wawa0210.

We have been focussed on other development for the past couple of weeks.

It may make sense to not trigger the nvidia-smi call if sharing is not set. Would you be willing to create a PR with a proposal for us to review?

klueska · 2024-01-23T13:15:21Z

It's called everytime at the moment to ensure that when sharing is not set, that it gets set to the default time slice (in case it had been set to something else previously).

A better check might be to ensure that the architecture is Kepler+ before attempting to make the call.

wawa0210 · 2024-01-23T13:37:38Z

A better check might be to ensure that the architecture is Kepler+ before attempting to make the call.

It seems that no accurate documentation has been found describing which architectures support time slice settings,Is there accurate information available for reference?

wawa0210 · 2024-01-23T13:37:56Z

Hi. Sorry @wawa0210.

We have been focussed on other development for the past couple of weeks.

It may make sense to not trigger the nvidia-smi call if sharing is not set. Would you be willing to create a PR with a proposal for us to review?

okk

wawa0210 changed the title ~~DRA does not support Tesla P4 model GPUs because it does not support setting time slices~~ DRA does not support Tesla P4 model GPUs because it does not support setting time slices by nvidia-smi Dec 21, 2023

klueska mentioned this issue Jan 24, 2024

Add Dependabot #57

Closed

wawa0210 linked a pull request Jan 24, 2024 that will close this issue

support skip old architecture version GPU settings time slice #58

Open

klueska added the bug Issue/PR to expose/discuss/fix a bug label Jan 25, 2024

klueska assigned elezar and klueska Jan 25, 2024

klueska modified the milestone: v0.1.0 Jan 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRA does not support Tesla P4 model GPUs because it does not support setting time slices by nvidia-smi #41

DRA does not support Tesla P4 model GPUs because it does not support setting time slices by nvidia-smi #41

wawa0210 commented Dec 21, 2023 •

edited

Loading

wawa0210 commented Jan 8, 2024

elezar commented Jan 23, 2024

klueska commented Jan 23, 2024

wawa0210 commented Jan 23, 2024

wawa0210 commented Jan 23, 2024

DRA does not support Tesla P4 model GPUs because it does not support setting time slices by nvidia-smi #41

DRA does not support Tesla P4 model GPUs because it does not support setting time slices by nvidia-smi #41

Comments

wawa0210 commented Dec 21, 2023 • edited Loading

environment

what happened

Steps to reproduce

other information

wawa0210 commented Jan 8, 2024

elezar commented Jan 23, 2024

klueska commented Jan 23, 2024

wawa0210 commented Jan 23, 2024

wawa0210 commented Jan 23, 2024

wawa0210 commented Dec 21, 2023 •

edited

Loading