Skip to content

Commit

Permalink
Metax GPU topo-awareness support (#574)
Browse files Browse the repository at this point in the history
* support metax topology-aware scheduling

Signed-off-by: root <[email protected]>

* fix ut

Signed-off-by: root <[email protected]>

---------

Signed-off-by: root <[email protected]>
Co-authored-by: root <[email protected]>
  • Loading branch information
archlitchi and root authored Oct 28, 2024
1 parent 3f24a36 commit b030525
Show file tree
Hide file tree
Showing 29 changed files with 629 additions and 61 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,7 @@ will see 3G device memory inside container
[![iluvatar GPU](https://img.shields.io/badge/Iluvatar-GPU-blue)](docs/iluvatar-gpu-support.md)
[![mthreads GPU](https://img.shields.io/badge/Mthreads-GPU-blue)](docs/mthreads-support.md)
[![ascend NPU](https://img.shields.io/badge/Ascend-NPU-blue)](https://github.com/Project-HAMi/ascend-device-plugin/blob/main/README.md)
[![metax GPU](https://img.shields.io/badge/metax-GPU-blue)](docs/metax-support.md)

## Architect

Expand Down
1 change: 1 addition & 0 deletions README_cn.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
[![天数智芯 GPU](https://img.shields.io/badge/天数智芯-GPU-blue)](docs/iluvatar-gpu-support_cn.md)
[![摩尔线程 GPU](https://img.shields.io/badge/摩尔线程-GPU-blue)](docs/mthreads-support_cn.md)
[![华为昇腾 NPU](https://img.shields.io/badge/华为昇腾-NPU-blue)](https://github.com/Project-HAMi/ascend-device-plugin/blob/main/README_cn.md)
[![沐曦 GPU](https://img.shields.io/badge/metax-GPU-blue)](docs/metax-support_cn.md)


## 简介
Expand Down
5 changes: 2 additions & 3 deletions cmd/scheduler/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,6 @@ import (
"github.com/Project-HAMi/HAMi/pkg/device"
"github.com/Project-HAMi/HAMi/pkg/scheduler"
"github.com/Project-HAMi/HAMi/pkg/scheduler/config"
"github.com/Project-HAMi/HAMi/pkg/scheduler/policy"
"github.com/Project-HAMi/HAMi/pkg/scheduler/routes"
"github.com/Project-HAMi/HAMi/pkg/util"
"github.com/Project-HAMi/HAMi/pkg/version"
Expand Down Expand Up @@ -58,8 +57,8 @@ func init() {
rootCmd.Flags().Int32Var(&config.DefaultMem, "default-mem", 0, "default gpu device memory to allocate")
rootCmd.Flags().Int32Var(&config.DefaultCores, "default-cores", 0, "default gpu core percentage to allocate")
rootCmd.Flags().Int32Var(&config.DefaultResourceNum, "default-gpu", 1, "default gpu to allocate")
rootCmd.Flags().StringVar(&config.NodeSchedulerPolicy, "node-scheduler-policy", policy.NodeSchedulerPolicyBinpack.String(), "node scheduler policy")
rootCmd.Flags().StringVar(&config.GPUSchedulerPolicy, "gpu-scheduler-policy", policy.GPUSchedulerPolicySpread.String(), "GPU scheduler policy")
rootCmd.Flags().StringVar(&config.NodeSchedulerPolicy, "node-scheduler-policy", util.NodeSchedulerPolicyBinpack.String(), "node scheduler policy")
rootCmd.Flags().StringVar(&config.GPUSchedulerPolicy, "gpu-scheduler-policy", util.GPUSchedulerPolicySpread.String(), "GPU scheduler policy")
rootCmd.Flags().StringVar(&config.MetricsBindAddress, "metrics-bind-address", ":9395", "The TCP address that the scheduler should bind to for serving prometheus metrics(e.g. 127.0.0.1:9395, :9395)")
rootCmd.Flags().StringToStringVar(&config.NodeLabelSelector, "node-label-selector", nil, "key=value pairs separated by commas")
rootCmd.PersistentFlags().AddGoFlagSet(device.GlobalFlagSet())
Expand Down
65 changes: 65 additions & 0 deletions docs/metax-support.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
## Introduction

**We now support metax.com/gpu by implementing topo-awareness among metax GPUs**:

When multiple GPUs are configured on a single server, the GPU cards are connected to the same PCIe Switch or MetaXLink depending on whether they are connected
, there is a near-far relationship. This forms a topology among all the cards on the server, as shown in the following figure:

![img](../imgs/metax_topo.jpg)

A user job requests a certain number of metax-tech.com/gpu resources, Kubernetes schedule pods to the appropriate node. gpu-device further processes the logic of allocating the remaining resources on the resource node following criterias below:
1. MetaXLink takes precedence over PCIe Switch in two way:
– A connection is considered a MetaXLink connection when there is a MetaXLink connection and a PCIe Switch connection between the two cards.
– When both the MetaXLink and the PCIe Switch can meet the job request
Equipped with MetaXLink interconnected resources.

2. When using `node-scheduler-policy=spread` , Allocate Metax resources to be under the same Metaxlink or Paiswich as much as possible, as the following figure shows:

![img](../imgs/metax_spread.jpg)

3. When using `node-scheduler-policy=binpack`, Assign GPU resources, so minimize the damage to MetaxXLink topology, as the following figure shows:

![img](../imgs/metax_binpack.jpg)

## Important Notes

1. Device sharing is not supported yet.

2. These features are tested on MXC500

## Prerequisites

* Metax GPU extensions >= 0.8.0
* Kubernetes >= 1.23

## Enabling topo-awareness scheduling

* Deploy Metax GPU Extensions on metax nodes (Please consult your device provider to aquire its package and document)

* Deploy HAMi according to README.md

## Running Metax jobs

Mthreads GPUs can now be requested by a container
using the `metax-tech.com/gpu` resource type:

```
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod1
annotations: hami.io/node-scheduler-policy: "spread" # when this parameter is set to spread, the scheduler will try to find the best topology for this task.
spec:
containers:
- name: ubuntu-container
image: cr.metax-tech.com/public-ai-release/c500/colossalai:2.24.0.5-py38-ubuntu20.04-amd64
imagePullPolicy: IfNotPresent
command: ["sleep","infinity"]
resources:
limits:
metax-tech.com/gpu: 1 # requesting 1 vGPUs
```

> **NOTICE2:** *You can find more examples in [examples/metax folder](../examples/metax/)*

66 changes: 66 additions & 0 deletions docs/metax-support_cn.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
## 简介

**我们支持基于拓扑结构,对沐曦设备进行优化调度**:

在单台服务器上配置多张 GPU 时,GPU 卡间根据双方是否连接在相同的 PCIe Switch 或 MetaXLink
下,存在近远(带宽高低)关系。服务器上所有卡间据此形成一张拓扑,如下图所示。

![img](../imgs/metax_topo.jpg)

用户作业请求一定数量的 metax-tech.com/gpu 资源,Kubernetes 选择剩余资源数量满足要求的
节点,并将 Pod 调度到相应节点。gpu‑device 进一步处理资源节点上剩余资源的分配逻辑,并按照以
下优先级逻辑为作业容器分配 GPU 设备:
1. MetaXLink 优先级高于 PCIe Switch,包含两层含义:
– 两卡之间同时存在 MetaXLink 连接以及 PCIe Switch 连接时,认定为 MetaXLink 连接。
– 服务器剩余 GPU 资源中 MetaXLink 互联资源与 PCIe Switch 互联资源均能满足作业请求时,分
配 MetaXLink 互联资源。

2. 当任务使用 `node-scheduler-policy=spread` ,分配GPU资源尽可能位于相同 MetaXLink或PCIe Switch下,如下图所示:

![img](../imgs/metax_spread.jpg)

3. 当使用 `node-scheduler-policy=binpack`,分配GPU资源后,剩余资源尽可能完整,如下图所示:

![img](../imgs/metax_binpack.jpg)

## 注意:

1. 暂时不支持沐曦设备的切片,只能申请整卡

2. 本功能基于MXC500进行测试

## 需求

* Metax GPU extensions >= 0.8.0
* Kubernetes >= 1.23

## 开启针对沐曦设备的拓扑调度优化

* 部署Metax GPU extensions (请联系您的设备提供方获取)

* 根据readme.md部署HAMi

## 运行沐曦任务

一个典型的沐曦任务如下所示:

```
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod1
annotations: hami.io/node-scheduler-policy: "spread" # when this parameter is set to spread, the scheduler will try to find the best topology for this task.
spec:
containers:
- name: ubuntu-container
image: cr.metax-tech.com/public-ai-release/c500/colossalai:2.24.0.5-py38-ubuntu20.04-amd64
imagePullPolicy: IfNotPresent
command: ["sleep","infinity"]
resources:
limits:
metax-tech.com/gpu: 1 # requesting 1 vGPUs
```

> **NOTICE2:** *你可以在这里找到更多样例 [examples/metax folder](../examples/metax/)*

15 changes: 15 additions & 0 deletions examples/metax/binpack.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod1
annotations:
hami.io/node-scheduler-policy: "binpack" # when this parameter is set to binpack, the scheduler will try to minimize the topology loss.
spec:
containers:
- name: ubuntu-container
image: cr.metax-tech.com/public-ai-release/c500/colossalai:2.24.0.5-py38-ubuntu20.04-amd64
imagePullPolicy: IfNotPresent
command: ["sleep","infinity"]
resources:
limits:
metax-tech.com/gpu: 1 # requesting 1 vGPUs
13 changes: 13 additions & 0 deletions examples/metax/default_use.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod1
spec:
containers:
- name: ubuntu-container
image: cr.metax-tech.com/public-ai-release/c500/colossalai:2.24.0.5-py38-ubuntu20.04-amd64
imagePullPolicy: IfNotPresent
command: ["sleep","infinity"]
resources:
limits:
metax-tech.com/gpu: 1 # requesting 1 vGPUs
15 changes: 15 additions & 0 deletions examples/metax/spread.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod1
annotations:
hami.io/node-scheduler-policy: "spread" # when this parameter is set to spread, the scheduler will try to find the best topology for this task.
spec:
containers:
- name: ubuntu-container
image: cr.metax-tech.com/public-ai-release/c500/colossalai:2.24.0.5-py38-ubuntu20.04-amd64
imagePullPolicy: IfNotPresent
command: ["sleep","infinity"]
resources:
limits:
metax-tech.com/gpu: 1 # requesting 1 vGPUs
Binary file added imgs/metax_binpack.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added imgs/metax_spread.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added imgs/metax_topo.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 4 additions & 0 deletions pkg/device/ascend/device.go
Original file line number Diff line number Diff line change
Expand Up @@ -272,3 +272,7 @@ func (dev *Devices) GenerateResourceRequests(ctr *corev1.Container) util.Contain
func (dev *Devices) CustomFilterRule(allocated *util.PodDevices, toAllocate util.ContainerDevices, device *util.DeviceUsage) bool {
return true
}

func (dev *Devices) ScoreNode(node *corev1.Node, podDevices util.PodSingleDevice, policy string) float32 {
return 0
}
4 changes: 4 additions & 0 deletions pkg/device/cambricon/device.go
Original file line number Diff line number Diff line change
Expand Up @@ -312,3 +312,7 @@ func (dev *CambriconDevices) PatchAnnotations(annoinput *map[string]string, pd u
func (dev *CambriconDevices) CustomFilterRule(allocated *util.PodDevices, toAllocate util.ContainerDevices, device *util.DeviceUsage) bool {
return true
}

func (dev *CambriconDevices) ScoreNode(node *corev1.Node, podDevices util.PodSingleDevice, policy string) float32 {
return 0
}
10 changes: 6 additions & 4 deletions pkg/device/devices.go
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ import (
"github.com/Project-HAMi/HAMi/pkg/device/cambricon"
"github.com/Project-HAMi/HAMi/pkg/device/hygon"
"github.com/Project-HAMi/HAMi/pkg/device/iluvatar"
"github.com/Project-HAMi/HAMi/pkg/device/metax"
"github.com/Project-HAMi/HAMi/pkg/device/mthreads"
"github.com/Project-HAMi/HAMi/pkg/device/nvidia"
"github.com/Project-HAMi/HAMi/pkg/util"
Expand All @@ -52,6 +53,7 @@ type Devices interface {
GenerateResourceRequests(ctr *corev1.Container) util.ContainerDeviceRequest
PatchAnnotations(annoinput *map[string]string, pd util.PodDevices) map[string]string
CustomFilterRule(allocated *util.PodDevices, toAllicate util.ContainerDevices, device *util.DeviceUsage) bool
ScoreNode(node *corev1.Node, podDevices util.PodSingleDevice, policy string) float32
// This should not be associated with a specific device object
//ParseConfig(fs *flag.FlagSet)
}
Expand All @@ -77,15 +79,14 @@ func InitDevices() {
devices[hygon.HygonDCUDevice] = hygon.InitDCUDevice()
devices[iluvatar.IluvatarGPUDevice] = iluvatar.InitIluvatarDevice()
devices[mthreads.MthreadsGPUDevice] = mthreads.InitMthreadsDevice()
//devices[d.AscendDevice] = d.InitDevice()
//devices[ascend.Ascend310PName] = ascend.InitAscend310P()
devices[metax.MetaxGPUDevice] = metax.InitMetaxDevice()

DevicesToHandle = append(DevicesToHandle, nvidia.NvidiaGPUCommonWord)
DevicesToHandle = append(DevicesToHandle, cambricon.CambriconMLUCommonWord)
DevicesToHandle = append(DevicesToHandle, hygon.HygonDCUCommonWord)
DevicesToHandle = append(DevicesToHandle, iluvatar.IluvatarGPUCommonWord)
DevicesToHandle = append(DevicesToHandle, mthreads.MthreadsGPUCommonWord)
//DevicesToHandle = append(DevicesToHandle, d.AscendDevice)
//DevicesToHandle = append(DevicesToHandle, ascend.Ascend310PName)
DevicesToHandle = append(DevicesToHandle, metax.MetaxGPUCommonWord)
for _, dev := range ascend.InitDevices() {
devices[dev.CommonWord()] = dev
DevicesToHandle = append(DevicesToHandle, dev.CommonWord())
Expand Down Expand Up @@ -143,6 +144,7 @@ func GlobalFlagSet() *flag.FlagSet {
iluvatar.ParseConfig(fs)
nvidia.ParseConfig(fs)
mthreads.ParseConfig(fs)
metax.ParseConfig(fs)
fs.BoolVar(&DebugMode, "debug", false, "debug mode")
klog.InitFlags(fs)
return fs
Expand Down
4 changes: 4 additions & 0 deletions pkg/device/hygon/device.go
Original file line number Diff line number Diff line change
Expand Up @@ -241,3 +241,7 @@ func (dev *DCUDevices) PatchAnnotations(annoinput *map[string]string, pd util.Po
func (dev *DCUDevices) CustomFilterRule(allocated *util.PodDevices, toAllocate util.ContainerDevices, device *util.DeviceUsage) bool {
return true
}

func (dev *DCUDevices) ScoreNode(node *corev1.Node, podDevices util.PodSingleDevice, policy string) float32 {
return 0
}
4 changes: 4 additions & 0 deletions pkg/device/iluvatar/device.go
Original file line number Diff line number Diff line change
Expand Up @@ -225,3 +225,7 @@ func (dev *IluvatarDevices) GenerateResourceRequests(ctr *corev1.Container) util
func (dev *IluvatarDevices) CustomFilterRule(allocated *util.PodDevices, toAllocate util.ContainerDevices, device *util.DeviceUsage) bool {
return true
}

func (dev *IluvatarDevices) ScoreNode(node *corev1.Node, podDevices util.PodSingleDevice, policy string) float32 {
return 0
}
Loading

0 comments on commit b030525

Please sign in to comment.