support skip old architecture version GPU settings time slice #58

wawa0210 · 2024-01-24T15:52:54Z

/kind feature

Flexible judgment based on the GPU architecture series. If the model does not support time slices, ignore the settings to ensure normal operation of the function.

Test Results

Tesla P4 can run demo1 normally
https://github.com/NVIDIA/k8s-dra-driver/blob/main/demo/specs/quickstart/gpu-test1.yaml

➜  k8s-dra-driver git:(main) ✗ kubectl get all -n kubectl -n gpu-test1
NAME       READY   STATUS    RESTARTS   AGE
pod/pod2   1/1     Running   0          14m
➜  k8s-dra-driver git:(main) ✗ kubectl -n gpu-test1 exec -it pod2 nvidia-smi
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
Wed Jan 24 15:58:21 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12             Driver Version: 535.104.12   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla P4                       On  | 00000000:13:00.0 Off |                    0 |
| N/A   27C    P8               7W /  75W |      0MiB /  7680MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+


➜  k8s-dra-driver git:(main) ✗ kubectl get ResourceClaim -n gpu-test1 pod2-gpu -o yaml
apiVersion: resource.k8s.io/v1alpha2
kind: ResourceClaim
metadata:
  creationTimestamp: "2024-01-24T15:43:31Z"
  finalizers:
  - gpu.resource.nvidia.com/deletion-protection
  name: pod2-gpu
  namespace: gpu-test1
  ownerReferences:
  - apiVersion: v1
    blockOwnerDeletion: true
    controller: true
    kind: Pod
    name: pod2
    uid: 09c9eae5-44b2-44ed-ba92-a4212c8edfff
  resourceVersion: "53101"
  uid: 0e2c7b7c-7357-4742-8eb1-101b0c673fd1
spec:
  allocationMode: WaitForFirstConsumer
  resourceClassName: gpu.nvidia.com
status:
  allocation:
    availableOnNodes:
      nodeSelectorTerms:
      - matchFields:
        - key: metadata.name
          operator: In
          values:
          - 172-30-40-100
    shareable: true
  driverName: gpu.resource.nvidia.com
  reservedFor:
  - name: pod2
    resource: pods
    uid: 09c9eae5-44b2-44ed-ba92-a4212c8edfff

wawa0210 · 2024-01-24T15:55:18Z

@klueska @elezar

friendly ping
Looking forward to review, especially detactSupportTimeSliceByArch needs to add more architecture and needs a complete support matrix

Can you help me update this form?

arch	support timeslice
Fermi	unknow
Kepler	unknow
Maxwell	unknow
Pascal	false
Volta	unknow
Turing	true
Ampere	true
Hopper	true
Ada	true

elezar

Thanks for the PR.

Instead of making this change in:

func (l deviceLib) setTimeSlice(uuids []string, timeSlice int) error {

We could perform the checks in:

func (t *TimeSlicingManager) SetTimeSlice(devices *PreparedDevices, config *nascrd.TimeSlicingConfig) error {

This already checks for MIG devices and adding a loop over the devices to filter out those that do not support timeslicing would be simpler here instead of rediscovering the available devices.

Furthermore, PreparedDevices already stores the brand and other relevant information.

Another quesiton that I have is what we expect the behavior to be when timeslicing is configured on a GPU that doesn't support it? Do we expect an error to be raised, or do we want to continue assuming blocking operation if the GPU is shared?

cmd/nvidia-dra-plugin/nvlib.go

klueska · 2024-01-29T09:51:15Z

Another quesiton that I have is what we expect the behavior to be when timeslicing is configured on a GPU that doesn't support it? Do we expect an error to be raised, or do we want to continue assuming blocking operation if the GPU is shared?

I think we should error out and not start the plugin.

wawa0210 · 2024-01-29T15:40:12Z

Another quesiton that I have is what we expect the behavior to be when timeslicing is configured on a GPU that doesn't support it? Do we expect an error to be raised, or do we want to continue assuming blocking operation if the GPU is shared?

I think we should error out and not start the plugin.

Do you mean that a gpu card that does not support settings time slice will not be able to use this function and will error out ?

elezar

Thanks for the changes.

It's definitely cleaner to do this at the higher level.

Also note, that we could consider just existing early if timeslicing is not supported by at least one device.

cmd/nvidia-dra-plugin/sharing.go

elezar

Thanks for the modifications.

It would be good to get @klueska's thoughts on my suggestions too.

cmd/nvidia-dra-plugin/sharing.go

wawa0210 · 2024-03-01T03:14:08Z

@elezar @klueska friendly ping

elezar · 2024-03-01T16:21:24Z

@elezar @klueska friendly ping

@wawa0210 as already requested, it would be better if we return an error if timeslicing is configured on a GPU that doesn't support it. This will align with what is done for MIG.

I understand that crashing the plugin at this stage may not be desireable, but that is a separate issue that we would need to address.

wawa0210 · 2024-03-05T06:22:02Z

@elezar @klueska friendly ping

@wawa0210 as already requested, it would be better if we return an error if timeslicing is configured on a GPU that doesn't support it. This will align with what is done for MIG.

I understand that crashing the plugin at this stage may not be desireable, but that is a separate issue that we would need to address.

Agree with you, updated,maybe we should discuss this scenario in a separate issue --> #81

Signed-off-by: wawa0210 <[email protected]>

wawa0210 · 2024-04-03T03:26:41Z

@elezar PTAL

klueska · 2024-09-13T12:17:43Z

The code base has undergone a major overhaul to accommodate the API changes in Kubernetes v1.31. I imagine this code needs a rebase.

wawa0210 force-pushed the main branch from c13215a to 35f009f Compare January 24, 2024 15:53

wawa0210 changed the title ~~skip old architecture version GPU settings time slice~~ support skip old architecture version GPU settings time slice Jan 24, 2024

wawa0210 force-pushed the main branch from 35f009f to b23c1b1 Compare January 25, 2024 05:55

elezar requested review from elezar and klueska January 25, 2024 09:43

elezar requested changes Jan 29, 2024

View reviewed changes

cmd/nvidia-dra-plugin/nvlib.go Outdated Show resolved Hide resolved

cmd/nvidia-dra-plugin/nvlib.go Outdated Show resolved Hide resolved

elezar self-assigned this Jan 29, 2024

wawa0210 force-pushed the main branch 2 times, most recently from f08aeb2 to 0dc4244 Compare January 29, 2024 15:17

elezar requested changes Jan 29, 2024

View reviewed changes

cmd/nvidia-dra-plugin/sharing.go Outdated Show resolved Hide resolved

cmd/nvidia-dra-plugin/sharing.go Outdated Show resolved Hide resolved

cmd/nvidia-dra-plugin/sharing.go Outdated Show resolved Hide resolved

cmd/nvidia-dra-plugin/sharing.go Outdated Show resolved Hide resolved

wawa0210 force-pushed the main branch 2 times, most recently from b46450f to a67b164 Compare January 29, 2024 15:59

wawa0210 requested a review from elezar January 29, 2024 16:01

wawa0210 force-pushed the main branch from 2615b47 to 9de24f5 Compare January 30, 2024 01:57

elezar requested changes Jan 31, 2024

View reviewed changes

cmd/nvidia-dra-plugin/sharing.go Outdated Show resolved Hide resolved

wawa0210 force-pushed the main branch from 9de24f5 to a672dd7 Compare February 1, 2024 10:37

elezar mentioned this pull request Feb 19, 2024

Add resource.sharing-strategy labels NVIDIA/k8s-device-plugin#503

Merged

wawa0210 requested a review from elezar March 1, 2024 03:13

wawa0210 force-pushed the main branch from a672dd7 to 6722a16 Compare March 5, 2024 06:20

wawa0210 mentioned this pull request Mar 5, 2024

When the node GPU does not support setting timeslice, the plugin will crash directly. #81

Open

wawa0210 closed this Apr 3, 2024

wawa0210 force-pushed the main branch from 6722a16 to 0e01612 Compare April 3, 2024 02:49

wawa0210 reopened this Apr 3, 2024

wawa0210 force-pushed the main branch from 395dabc to 945f629 Compare April 3, 2024 03:18

wawa0210 force-pushed the main branch from 945f629 to 99a16f1 Compare April 3, 2024 03:20

skip old architecture version GPU settings time slice

40dae3c

Signed-off-by: wawa0210 <[email protected]>

wawa0210 force-pushed the main branch from 99a16f1 to 40dae3c Compare April 3, 2024 03:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support skip old architecture version GPU settings time slice #58

support skip old architecture version GPU settings time slice #58

wawa0210 commented Jan 24, 2024 •

edited

Loading

wawa0210 commented Jan 24, 2024 •

edited

Loading

elezar left a comment

klueska commented Jan 29, 2024

wawa0210 commented Jan 29, 2024

elezar left a comment

elezar left a comment

wawa0210 commented Mar 1, 2024

elezar commented Mar 1, 2024

wawa0210 commented Mar 5, 2024 •

edited

Loading

wawa0210 commented Apr 3, 2024

klueska commented Sep 13, 2024

support skip old architecture version GPU settings time slice #58

Are you sure you want to change the base?

support skip old architecture version GPU settings time slice #58

Conversation

wawa0210 commented Jan 24, 2024 • edited Loading

wawa0210 commented Jan 24, 2024 • edited Loading

elezar left a comment

Choose a reason for hiding this comment

klueska commented Jan 29, 2024

wawa0210 commented Jan 29, 2024

elezar left a comment

Choose a reason for hiding this comment

elezar left a comment

Choose a reason for hiding this comment

wawa0210 commented Mar 1, 2024

elezar commented Mar 1, 2024

wawa0210 commented Mar 5, 2024 • edited Loading

wawa0210 commented Apr 3, 2024

klueska commented Sep 13, 2024

wawa0210 commented Jan 24, 2024 •

edited

Loading

wawa0210 commented Jan 24, 2024 •

edited

Loading

wawa0210 commented Mar 5, 2024 •

edited

Loading