Metax GPU topo-awareness support (#574)

* support metax topology-aware scheduling Signed-off-by: root <[email protected]> * fix ut Signed-off-by: root <[email protected]> --------- Signed-off-by: root <[email protected]> Co-authored-by: root <[email protected]>
Project-HAMi · Oct 28, 2024 · b030525 · b030525
1 parent 3f24a36
commit b030525
Show file tree

Hide file tree

Showing 29 changed files with 629 additions and 61 deletions.
diff --git a/README.md b/README.md
@@ -67,6 +67,7 @@ will see 3G device memory inside container
 [![iluvatar GPU](https://img.shields.io/badge/Iluvatar-GPU-blue)](docs/iluvatar-gpu-support.md)
 [![mthreads GPU](https://img.shields.io/badge/Mthreads-GPU-blue)](docs/mthreads-support.md)
 [![ascend NPU](https://img.shields.io/badge/Ascend-NPU-blue)](https://github.com/Project-HAMi/ascend-device-plugin/blob/main/README.md)
+[![metax GPU](https://img.shields.io/badge/metax-GPU-blue)](docs/metax-support.md)
 
 ## Architect
 

diff --git a/README_cn.md b/README_cn.md
@@ -24,6 +24,7 @@
 [![天数智芯 GPU](https://img.shields.io/badge/天数智芯-GPU-blue)](docs/iluvatar-gpu-support_cn.md)
 [![摩尔线程 GPU](https://img.shields.io/badge/摩尔线程-GPU-blue)](docs/mthreads-support_cn.md)
 [![华为昇腾 NPU](https://img.shields.io/badge/华为昇腾-NPU-blue)](https://github.com/Project-HAMi/ascend-device-plugin/blob/main/README_cn.md)
+[![沐曦 GPU](https://img.shields.io/badge/metax-GPU-blue)](docs/metax-support_cn.md)
 
 
 ## 简介

diff --git a/cmd/scheduler/main.go b/cmd/scheduler/main.go
@@ -22,7 +22,6 @@ import (
 	"github.com/Project-HAMi/HAMi/pkg/device"
 	"github.com/Project-HAMi/HAMi/pkg/scheduler"
 	"github.com/Project-HAMi/HAMi/pkg/scheduler/config"
-	"github.com/Project-HAMi/HAMi/pkg/scheduler/policy"
 	"github.com/Project-HAMi/HAMi/pkg/scheduler/routes"
 	"github.com/Project-HAMi/HAMi/pkg/util"
 	"github.com/Project-HAMi/HAMi/pkg/version"
@@ -58,8 +57,8 @@ func init() {
 	rootCmd.Flags().Int32Var(&config.DefaultMem, "default-mem", 0, "default gpu device memory to allocate")
 	rootCmd.Flags().Int32Var(&config.DefaultCores, "default-cores", 0, "default gpu core percentage to allocate")
 	rootCmd.Flags().Int32Var(&config.DefaultResourceNum, "default-gpu", 1, "default gpu to allocate")
-	rootCmd.Flags().StringVar(&config.NodeSchedulerPolicy, "node-scheduler-policy", policy.NodeSchedulerPolicyBinpack.String(), "node scheduler policy")
-	rootCmd.Flags().StringVar(&config.GPUSchedulerPolicy, "gpu-scheduler-policy", policy.GPUSchedulerPolicySpread.String(), "GPU scheduler policy")
+	rootCmd.Flags().StringVar(&config.NodeSchedulerPolicy, "node-scheduler-policy", util.NodeSchedulerPolicyBinpack.String(), "node scheduler policy")
+	rootCmd.Flags().StringVar(&config.GPUSchedulerPolicy, "gpu-scheduler-policy", util.GPUSchedulerPolicySpread.String(), "GPU scheduler policy")
 	rootCmd.Flags().StringVar(&config.MetricsBindAddress, "metrics-bind-address", ":9395", "The TCP address that the scheduler should bind to for serving prometheus metrics(e.g. 127.0.0.1:9395, :9395)")
 	rootCmd.Flags().StringToStringVar(&config.NodeLabelSelector, "node-label-selector", nil, "key=value pairs separated by commas")
 	rootCmd.PersistentFlags().AddGoFlagSet(device.GlobalFlagSet())

diff --git a/docs/metax-support.md b/docs/metax-support.md
@@ -0,0 +1,65 @@
+## Introduction
+
+**We now support metax.com/gpu by implementing topo-awareness among metax GPUs**:
+
+When multiple GPUs are configured on a single server, the GPU cards are connected to the same PCIe Switch or MetaXLink depending on whether they are connected
+, there is a near-far relationship. This forms a topology among all the cards on the server, as shown in the following figure:
+
+![img](../imgs/metax_topo.jpg)
+
+A user job requests a certain number of metax-tech.com/gpu resources, Kubernetes schedule pods to the appropriate node. gpu-device further processes the logic of allocating the remaining resources on the resource node following criterias below:
+1. MetaXLink takes precedence over PCIe Switch in two way:
+– A connection is considered a MetaXLink connection when there is a MetaXLink connection and a PCIe Switch connection between the two cards.
+– When both the MetaXLink and the PCIe Switch can meet the job request
+Equipped with MetaXLink interconnected resources.
+
+2. When using `node-scheduler-policy=spread` , Allocate Metax resources to be under the same Metaxlink or Paiswich as much as possible, as the following figure shows:
+
+![img](../imgs/metax_spread.jpg)
+
+3. When using `node-scheduler-policy=binpack`, Assign GPU resources, so minimize the damage to MetaxXLink topology, as the following figure shows:
+
+![img](../imgs/metax_binpack.jpg)
+
+## Important Notes
+
+1. Device sharing is not supported yet.
+
+2. These features are tested on MXC500
+
+## Prerequisites
+
+* Metax GPU extensions >= 0.8.0
+* Kubernetes >= 1.23
+
+## Enabling topo-awareness scheduling
+
+* Deploy Metax GPU Extensions on metax nodes (Please consult your device provider to aquire its package and document)
+
+* Deploy HAMi according to README.md
+
+## Running Metax jobs
+
+Mthreads GPUs can now be requested by a container
+using the `metax-tech.com/gpu`  resource type:
+
+```
+apiVersion: v1
+kind: Pod
+metadata:
+  name: gpu-pod1
+  annotations: hami.io/node-scheduler-policy: "spread" # when this parameter is set to spread, the scheduler will try to find the best topology for this task.
+spec:
+  containers:
+    - name: ubuntu-container
+      image: cr.metax-tech.com/public-ai-release/c500/colossalai:2.24.0.5-py38-ubuntu20.04-amd64 
+      imagePullPolicy: IfNotPresent
+      command: ["sleep","infinity"]
+      resources:
+        limits:
+          metax-tech.com/gpu: 1 # requesting 1 vGPUs
+```
+
+> **NOTICE2:** *You can find more examples in [examples/metax folder](../examples/metax/)*
+
+
diff --git a/docs/metax-support_cn.md b/docs/metax-support_cn.md
@@ -0,0 +1,66 @@
+## 简介
+
+**我们支持基于拓扑结构，对沐曦设备进行优化调度**:
+
+在单台服务器上配置多张 GPU 时，GPU 卡间根据双方是否连接在相同的 PCIe Switch 或 MetaXLink
+下，存在近远（带宽高低）关系。服务器上所有卡间据此形成一张拓扑，如下图所示。
+
+![img](../imgs/metax_topo.jpg)
+
+用户作业请求一定数量的 metax-tech.com/gpu 资源，Kubernetes 选择剩余资源数量满足要求的
+节点，并将 Pod 调度到相应节点。gpu‑device 进一步处理资源节点上剩余资源的分配逻辑，并按照以
+下优先级逻辑为作业容器分配 GPU 设备：
+1. MetaXLink 优先级高于 PCIe Switch，包含两层含义：
+– 两卡之间同时存在 MetaXLink 连接以及 PCIe Switch 连接时，认定为 MetaXLink 连接。
+– 服务器剩余 GPU 资源中 MetaXLink 互联资源与 PCIe Switch 互联资源均能满足作业请求时，分
+配 MetaXLink 互联资源。
+
+2. 当任务使用 `node-scheduler-policy=spread` ,分配GPU资源尽可能位于相同 MetaXLink或PCIe Switch下，如下图所示:
+
+![img](../imgs/metax_spread.jpg)
+
+3. 当使用 `node-scheduler-policy=binpack`,分配GPU资源后，剩余资源尽可能完整，如下图所示：
+
+![img](../imgs/metax_binpack.jpg)
+
+## 注意：
+
+1. 暂时不支持沐曦设备的切片，只能申请整卡
+
+2. 本功能基于MXC500进行测试
+
+## 需求
+
+* Metax GPU extensions >= 0.8.0
+* Kubernetes >= 1.23
+
+## 开启针对沐曦设备的拓扑调度优化
+
+* 部署Metax GPU extensions (请联系您的设备提供方获取)
+
+* 根据readme.md部署HAMi
+
+## 运行沐曦任务
+
+一个典型的沐曦任务如下所示：
+
+```
+apiVersion: v1
+kind: Pod
+metadata:
+  name: gpu-pod1
+  annotations: hami.io/node-scheduler-policy: "spread" # when this parameter is set to spread, the scheduler will try to find the best topology for this task.
+spec:
+  containers:
+    - name: ubuntu-container
+      image: cr.metax-tech.com/public-ai-release/c500/colossalai:2.24.0.5-py38-ubuntu20.04-amd64 
+      imagePullPolicy: IfNotPresent
+      command: ["sleep","infinity"]
+      resources:
+        limits:
+          metax-tech.com/gpu: 1 # requesting 1 vGPUs
+```
+
+> **NOTICE2:** *你可以在这里找到更多样例 [examples/metax folder](../examples/metax/)*
+
+
diff --git a/examples/metax/binpack.yaml b/examples/metax/binpack.yaml
@@ -0,0 +1,15 @@
+apiVersion: v1
+kind: Pod
+metadata:
+  name: gpu-pod1
+  annotations: 
+    hami.io/node-scheduler-policy: "binpack" # when this parameter is set to binpack, the scheduler will try to minimize the topology loss.
+spec:
+  containers:
+    - name: ubuntu-container
+      image: cr.metax-tech.com/public-ai-release/c500/colossalai:2.24.0.5-py38-ubuntu20.04-amd64 
+      imagePullPolicy: IfNotPresent
+      command: ["sleep","infinity"]
+      resources:
+        limits:
+          metax-tech.com/gpu: 1 # requesting 1 vGPUs
diff --git a/examples/metax/default_use.yaml b/examples/metax/default_use.yaml
@@ -0,0 +1,13 @@
+apiVersion: v1
+kind: Pod
+metadata:
+  name: gpu-pod1
+spec:
+  containers:
+    - name: ubuntu-container
+      image: cr.metax-tech.com/public-ai-release/c500/colossalai:2.24.0.5-py38-ubuntu20.04-amd64 
+      imagePullPolicy: IfNotPresent
+      command: ["sleep","infinity"]
+      resources:
+        limits:
+          metax-tech.com/gpu: 1 # requesting 1 vGPUs
diff --git a/examples/metax/spread.yaml b/examples/metax/spread.yaml
@@ -0,0 +1,15 @@
+apiVersion: v1
+kind: Pod
+metadata:
+  name: gpu-pod1
+  annotations: 
+    hami.io/node-scheduler-policy: "spread" # when this parameter is set to spread, the scheduler will try to find the best topology for this task.
+spec:
+  containers:
+    - name: ubuntu-container
+      image: cr.metax-tech.com/public-ai-release/c500/colossalai:2.24.0.5-py38-ubuntu20.04-amd64 
+      imagePullPolicy: IfNotPresent
+      command: ["sleep","infinity"]
+      resources:
+        limits:
+          metax-tech.com/gpu: 1 # requesting 1 vGPUs
diff --git a/imgs/metax_binpack.jpg b/imgs/metax_binpack.jpg
diff --git a/imgs/metax_spread.jpg b/imgs/metax_spread.jpg
diff --git a/imgs/metax_topo.jpg b/imgs/metax_topo.jpg
diff --git a/pkg/device/ascend/device.go b/pkg/device/ascend/device.go
@@ -272,3 +272,7 @@ func (dev *Devices) GenerateResourceRequests(ctr *corev1.Container) util.Contain
 func (dev *Devices) CustomFilterRule(allocated *util.PodDevices, toAllocate util.ContainerDevices, device *util.DeviceUsage) bool {
 	return true
 }
+
+func (dev *Devices) ScoreNode(node *corev1.Node, podDevices util.PodSingleDevice, policy string) float32 {
+	return 0
+}
diff --git a/pkg/device/cambricon/device.go b/pkg/device/cambricon/device.go
@@ -312,3 +312,7 @@ func (dev *CambriconDevices) PatchAnnotations(annoinput *map[string]string, pd u
 func (dev *CambriconDevices) CustomFilterRule(allocated *util.PodDevices, toAllocate util.ContainerDevices, device *util.DeviceUsage) bool {
 	return true
 }
+
+func (dev *CambriconDevices) ScoreNode(node *corev1.Node, podDevices util.PodSingleDevice, policy string) float32 {
+	return 0
+}
diff --git a/pkg/device/devices.go b/pkg/device/devices.go
@@ -27,6 +27,7 @@ import (
 	"github.com/Project-HAMi/HAMi/pkg/device/cambricon"
 	"github.com/Project-HAMi/HAMi/pkg/device/hygon"
 	"github.com/Project-HAMi/HAMi/pkg/device/iluvatar"
+	"github.com/Project-HAMi/HAMi/pkg/device/metax"
 	"github.com/Project-HAMi/HAMi/pkg/device/mthreads"
 	"github.com/Project-HAMi/HAMi/pkg/device/nvidia"
 	"github.com/Project-HAMi/HAMi/pkg/util"
@@ -52,6 +53,7 @@ type Devices interface {
 	GenerateResourceRequests(ctr *corev1.Container) util.ContainerDeviceRequest
 	PatchAnnotations(annoinput *map[string]string, pd util.PodDevices) map[string]string
 	CustomFilterRule(allocated *util.PodDevices, toAllicate util.ContainerDevices, device *util.DeviceUsage) bool
+	ScoreNode(node *corev1.Node, podDevices util.PodSingleDevice, policy string) float32
 	// This should not be associated with a specific device object
 	//ParseConfig(fs *flag.FlagSet)
 }
@@ -77,15 +79,14 @@ func InitDevices() {
 	devices[hygon.HygonDCUDevice] = hygon.InitDCUDevice()
 	devices[iluvatar.IluvatarGPUDevice] = iluvatar.InitIluvatarDevice()
 	devices[mthreads.MthreadsGPUDevice] = mthreads.InitMthreadsDevice()
-	//devices[d.AscendDevice] = d.InitDevice()
-	//devices[ascend.Ascend310PName] = ascend.InitAscend310P()
+	devices[metax.MetaxGPUDevice] = metax.InitMetaxDevice()
+
 	DevicesToHandle = append(DevicesToHandle, nvidia.NvidiaGPUCommonWord)
 	DevicesToHandle = append(DevicesToHandle, cambricon.CambriconMLUCommonWord)
 	DevicesToHandle = append(DevicesToHandle, hygon.HygonDCUCommonWord)
 	DevicesToHandle = append(DevicesToHandle, iluvatar.IluvatarGPUCommonWord)
 	DevicesToHandle = append(DevicesToHandle, mthreads.MthreadsGPUCommonWord)
-	//DevicesToHandle = append(DevicesToHandle, d.AscendDevice)
-	//DevicesToHandle = append(DevicesToHandle, ascend.Ascend310PName)
+	DevicesToHandle = append(DevicesToHandle, metax.MetaxGPUCommonWord)
 	for _, dev := range ascend.InitDevices() {
 		devices[dev.CommonWord()] = dev
 		DevicesToHandle = append(DevicesToHandle, dev.CommonWord())
@@ -143,6 +144,7 @@ func GlobalFlagSet() *flag.FlagSet {
 	iluvatar.ParseConfig(fs)
 	nvidia.ParseConfig(fs)
 	mthreads.ParseConfig(fs)
+	metax.ParseConfig(fs)
 	fs.BoolVar(&DebugMode, "debug", false, "debug mode")
 	klog.InitFlags(fs)
 	return fs

diff --git a/pkg/device/hygon/device.go b/pkg/device/hygon/device.go
@@ -241,3 +241,7 @@ func (dev *DCUDevices) PatchAnnotations(annoinput *map[string]string, pd util.Po
 func (dev *DCUDevices) CustomFilterRule(allocated *util.PodDevices, toAllocate util.ContainerDevices, device *util.DeviceUsage) bool {
 	return true
 }
+
+func (dev *DCUDevices) ScoreNode(node *corev1.Node, podDevices util.PodSingleDevice, policy string) float32 {
+	return 0
+}
diff --git a/pkg/device/iluvatar/device.go b/pkg/device/iluvatar/device.go
@@ -225,3 +225,7 @@ func (dev *IluvatarDevices) GenerateResourceRequests(ctr *corev1.Container) util
 func (dev *IluvatarDevices) CustomFilterRule(allocated *util.PodDevices, toAllocate util.ContainerDevices, device *util.DeviceUsage) bool {
 	return true
 }
+
+func (dev *IluvatarDevices) ScoreNode(node *corev1.Node, podDevices util.PodSingleDevice, policy string) float32 {
+	return 0
+}