一、核心使命与设计理念
1.1 DeviceShare 插件的使命
DeviceShare 是 Koordinator Scheduler 中的核心插件,负责 GPU、FPGA、RDMA 等异构设备的调度决策。它解决了 Kubernetes 原生调度器的三大痛点:
arduino
┌────────────────────────────────────────────────────────┐
│ Kubernetes 原生 Device Plugin 的局限 │
├────────────────────────────────────────────────────────┤
│ 1. 调度时机问题 │
│ └─ Device Plugin 在 kubelet 层分配设备 │
│ └─ Scheduler 无法感知设备拓扑和使用情况 │
│ └─ 可能导致调度后无法实际分配设备 │
│ │
│ 2. 共享和隔离问题 │
│ └─ 只能整卡分配,无法细粒度共享 │
│ └─ 缺乏设备间的QoS隔离机制 │
│ │
│ 3. 跨设备管理问题 │
│ └─ GPU/FPGA/RDMA需要独立的Device Plugin实现 │
│ └─ 无法统一管理和调度 │
└────────────────────────────────────────────────────────┘
DeviceShare 的三大核心能力:
- 提前感知: 在调度阶段就确定具体的设备分配方案
- 细粒度共享: 支持按比例分配 GPU 算力和显存
- 统一抽象: GPU/FPGA/RDMA 使用相同的调度逻辑
1.2 插件架构设计(How)
DeviceShare 实现了 Kubernetes Scheduler Framework 的 5 个扩展点:
scss
┌──────────────────────────────────────────────────────────┐
│ DeviceShare 插件扩展点 │
├──────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────┐ │
│ │ 1. PreFilter (预处理) │ │
│ │ - 验证 GPU 请求合法性 │ │
│ │ - 转换为统一资源格式 │ │
│ │ - 处理 Reservation 预留资源 │ │
│ └─────────────────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────┐ │
│ │ 2. Filter (过滤) │ │
│ │ - 遍历所有节点 │ │
│ │ - 检查设备资源是否充足 │ │
│ │ - 尝试分配(不实际执行) │ │
│ └─────────────────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────┐ │
│ │ 3. Reserve (预留) │ │
│ │ - 在选中的节点上执行真正分配 │ │
│ │ - 更新缓存中的设备使用情况 │ │
│ │ - 支持 Reservation 资源预留 │ │
│ └─────────────────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────┐ │
│ │ 4. PreBind (绑定前) │ │
│ │ - 将分配结果注入 Pod Annotation │ │
│ │ - 更新 Device CRD 的 Status │ │
│ │ - Patch Pod 到 APIServer │ │
│ └─────────────────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────┐ │
│ │ 5. Unreserve (回滚) │ │
│ │ - 调度失败时释放已预留的资源 │ │
│ │ - 恢复缓存状态 │ │
│ └─────────────────────────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────┘
二、核心数据结构
2.1 PreFilterState - 调度上下文
PreFilterState 保存了 Pod 调度过程中的所有上下文信息:
go
// pkg/scheduler/plugins/deviceshare/plugin.go
type preFilterState struct {
// 是否跳过设备调度
skip bool
// Pod 请求的设备资源(已转换为统一格式)
podRequests corev1.ResourceList
// 最终的分配结果
allocationResult apiext.DeviceAllocations
// 可抢占的设备资源(用于抢占调度)
// 键: 节点名 → 设备类型 → 设备 Minor → 资源列表
preemptibleDevices map[string]map[schedulingv1alpha1.DeviceType]deviceResources
// Reservation 预留的设备资源
// 键: 节点名 → Reservation UID → 设备类型 → 设备 Minor → 资源列表
reservedDevices map[string]map[types.UID]map[schedulingv1alpha1.DeviceType]deviceResources
}
// 设备资源表示: Minor 编号 → 资源列表
// 例如: {0: {gpu-core: 100, gpu-memory: 16Gi}}
type deviceResources map[int]corev1.ResourceList
关键字段说明:
| 字段 | 类型 | 说明 | 示例 |
|---|---|---|---|
skip |
bool | 是否跳过设备调度 | Pod没有请求GPU时为true |
podRequests |
ResourceList | 统一格式的资源请求 | {gpu-core:50, gpu-memory-ratio:50} |
allocationResult |
DeviceAllocations | 最终分配到的设备 | GPU 0 的 50% |
preemptibleDevices |
三层Map | 可被抢占的低优先级Pod占用的资源 | node-1 → gpu → 0 → {core:25} |
reservedDevices |
四层Map | Reservation预留的资源 | node-1 → uid-123 → gpu → 0 → {core:50} |
State 的生命周期:
css
┌──────────────────────────────────────────────────────┐
│ preFilterState 生命周期 │
├──────────────────────────────────────────────────────┤
│ │
│ 1. PreFilter 阶段创建 │
│ state := &preFilterState{ │
│ skip: false, │
│ podRequests: {gpu-core: 50}, │
│ preemptibleDevices: {}, │
│ reservedDevices: {}, │
│ } │
│ cycleState.Write("DeviceShare", state) │
│ │
│ 2. Filter 阶段读取 │
│ state := cycleState.Read("DeviceShare") │
│ 使用 state.podRequests 进行资源匹配 │
│ │
│ 3. Reserve 阶段更新 │
│ state.allocationResult = allocatedDevices │
│ │
│ 4. PreBind 阶段使用 │
│ SetDeviceAllocations(pod, state.allocationResult)│
│ │
│ 5. Unreserve 阶段清理(失败时) │
│ state.allocationResult = nil │
│ │
└──────────────────────────────────────────────────────┘
2.2 NodeDeviceCache - 节点设备缓存
NodeDeviceCache 是 DeviceShare 的核心缓存,存储了所有节点的设备信息:
go
// pkg/scheduler/plugins/deviceshare/device_cache.go
type nodeDeviceCache struct {
lock sync.RWMutex
// 每个节点的设备信息
nodeDeviceInfos map[string]*nodeDevice
}
type nodeDevice struct {
lock sync.RWMutex
// 设备总容量: 设备类型 → Minor → 资源容量
deviceTotal map[schedulingv1alpha1.DeviceType]deviceResources
// 设备可用容量: 设备类型 → Minor → 剩余资源
deviceFree map[schedulingv1alpha1.DeviceType]deviceResources
// 设备已用容量: 设备类型 → Minor → 已用资源
deviceUsed map[schedulingv1alpha1.DeviceType]deviceResources
// 分配集合: 设备类型 → Pod名称 → 设备分配情况
// 用于快速查询某个Pod占用了哪些设备
allocateSet map[schedulingv1alpha1.DeviceType]map[types.NamespacedName]deviceResources
}
缓存更新流程:
ini
┌────────────────────────────────────────────────────────┐
│ NodeDeviceCache 更新流程 │
├────────────────────────────────────────────────────────┤
│ │
│ 1. Device CRD 创建/更新事件 │
│ Informer Watch → onDeviceAdd/onDeviceUpdate │
│ │
│ 2. 解析 Device CRD │
│ for device in Device.Spec.Devices: │
│ deviceTotal[device.Type][device.Minor] = │
│ device.Resources │
│ │
│ 3. 更新 deviceFree 和 deviceUsed │
│ deviceFree = deviceTotal - deviceUsed │
│ │
│ 4. Pod 调度完成后更新 │
│ Reserve() → 更新 deviceUsed │
│ deviceUsed[0] += {gpu-core: 50} │
│ deviceFree[0] -= {gpu-core: 50} │
│ │
│ 5. Pod 删除后更新 │
│ onPodDelete() → 释放 deviceUsed │
│ deviceUsed[0] -= {gpu-core: 50} │
│ deviceFree[0] += {gpu-core: 50} │
│ │
└────────────────────────────────────────────────────────┘
缓存数据示例:
假设节点有 2 张 GPU,已有 2 个 Pod 在运行:
go
nodeDevice := &nodeDevice{
deviceTotal: {
schedulingv1alpha1.GPU: {
0: { // GPU 0
"koordinator.sh/gpu-core": resource.MustParse("100"),
"koordinator.sh/gpu-memory": resource.MustParse("16Gi"),
"koordinator.sh/gpu-memory-ratio": resource.MustParse("100"),
},
1: { // GPU 1
"koordinator.sh/gpu-core": resource.MustParse("100"),
"koordinator.sh/gpu-memory": resource.MustParse("16Gi"),
"koordinator.sh/gpu-memory-ratio": resource.MustParse("100"),
},
},
},
deviceUsed: {
schedulingv1alpha1.GPU: {
0: { // GPU 0 已用 50%
"koordinator.sh/gpu-core": resource.MustParse("50"),
"koordinator.sh/gpu-memory": resource.MustParse("8Gi"),
"koordinator.sh/gpu-memory-ratio": resource.MustParse("50"),
},
1: { // GPU 1 已用 25%
"koordinator.sh/gpu-core": resource.MustParse("25"),
"koordinator.sh/gpu-memory": resource.MustParse("4Gi"),
"koordinator.sh/gpu-memory-ratio": resource.MustParse("25"),
},
},
},
deviceFree: {
schedulingv1alpha1.GPU: {
0: { // GPU 0 剩余 50%
"koordinator.sh/gpu-core": resource.MustParse("50"),
"koordinator.sh/gpu-memory": resource.MustParse("8Gi"),
"koordinator.sh/gpu-memory-ratio": resource.MustParse("50"),
},
1: { // GPU 1 剩余 75%
"koordinator.sh/gpu-core": resource.MustParse("75"),
"koordinator.sh/gpu-memory": resource.MustParse("12Gi"),
"koordinator.sh/gpu-memory-ratio": resource.MustParse("75"),
},
},
},
allocateSet: {
schedulingv1alpha1.GPU: {
types.NamespacedName{Namespace: "default", Name: "pod-1"}: {
0: {"koordinator.sh/gpu-core": resource.MustParse("50")},
},
types.NamespacedName{Namespace: "default", Name: "pod-2"}: {
1: {"koordinator.sh/gpu-core": resource.MustParse("25")},
},
},
},
}
三、PreFilter 阶段详解
3.1 GPU 请求验证
PreFilter 阶段的核心任务是验证 GPU 请求的合法性:
go
// pkg/scheduler/plugins/deviceshare/plugin.go
func (p *Plugin) PreFilter(ctx context.Context, cycleState *framework.CycleState, pod *corev1.Pod) *framework.Status {
state := &preFilterState{
skip: true,
podRequests: make(corev1.ResourceList),
preemptibleDevices: map[string]map[schedulingv1alpha1.DeviceType]deviceResources{},
reservedDevices: map[string]map[types.UID]map[schedulingv1alpha1.DeviceType]deviceResources{},
}
// 获取 Pod 的资源请求
podRequests, _ := resource.PodRequestsAndLimits(pod)
podRequests = apiext.TransformDeprecatedDeviceResources(podRequests)
// 处理 GPU 设备
for deviceType := range DeviceResourceNames {
switch deviceType {
case schedulingv1alpha1.GPU:
if !hasDeviceResource(podRequests, deviceType) {
break
}
// 验证 GPU 请求
combination, err := ValidateGPURequest(podRequests)
if err != nil {
return framework.NewStatus(framework.Error, err.Error())
}
// 转换为统一格式
state.podRequests = quotav1.Add(state.podRequests, ConvertGPUResource(podRequests, combination))
state.skip = false
case schedulingv1alpha1.RDMA, schedulingv1alpha1.FPGA:
// 类似的验证逻辑
...
}
}
cycleState.Write(stateKey, state)
return nil
}
3.2 GPU 请求合法性验证算法
验证规则矩阵:
| 资源组合 | 二进制表示 | 是否合法 | 说明 |
|---|---|---|---|
nvidia.com/gpu |
0b00001 | ✅ | 原生整卡分配 |
koordinator.sh/gpu |
0b00010 | ✅ | Koord整卡分配 |
gpu-core + gpu-memory |
0b01100 | ✅ | 指定算力和显存 |
gpu-core + gpu-memory-ratio |
0b10100 | ✅ | 指定算力和显存比例 |
gpu-core 单独使用 |
0b00100 | ❌ | 必须配合显存 |
gpu-memory 单独使用 |
0b01000 | ❌ | 必须配合算力 |
nvidia.com/gpu + gpu-core |
0b00101 | ❌ | 不能混用 |
验证代码实现:
go
// pkg/scheduler/plugins/deviceshare/utils.go
const (
NvidiaGPUExist = 1 << 0 // 0b00001
KoordGPUExist = 1 << 1 // 0b00010
GPUCoreExist = 1 << 2 // 0b00100
GPUMemoryExist = 1 << 3 // 0b01000
GPUMemoryRatioExist = 1 << 4 // 0b10000
)
func ValidateGPURequest(podRequest corev1.ResourceList) (uint, error) {
var gpuCombination uint
// 检测各类资源是否存在
if _, exist := podRequest[apiext.ResourceNvidiaGPU]; exist {
gpuCombination |= NvidiaGPUExist
}
if koordGPU, exist := podRequest[apiext.ResourceGPU]; exist {
// koordinator.sh/gpu 的值必须是 100 或 100 的倍数
if koordGPU.Value() > 100 && koordGPU.Value()%100 != 0 {
return gpuCombination, fmt.Errorf("failed to validate %v: %v", apiext.ResourceGPU, koordGPU.Value())
}
gpuCombination |= KoordGPUExist
}
if gpuCore, exist := podRequest[apiext.ResourceGPUCore]; exist {
// gpu-core 的值必须 <= 100 或 是 100 的倍数
if gpuCore.Value() > 100 && gpuCore.Value()%100 != 0 {
return gpuCombination, fmt.Errorf("failed to validate %v: %v", apiext.ResourceGPUCore, gpuCore.Value())
}
gpuCombination |= GPUCoreExist
}
if _, exist := podRequest[apiext.ResourceGPUMemory]; exist {
gpuCombination |= GPUMemoryExist
}
if gpuMemRatio, exist := podRequest[apiext.ResourceGPUMemoryRatio]; exist {
if gpuMemRatio.Value() > 100 && gpuMemRatio.Value()%100 != 0 {
return gpuCombination, fmt.Errorf("failed to validate %v: %v", apiext.ResourceGPUMemoryRatio, gpuMemRatio.Value())
}
gpuCombination |= GPUMemoryRatioExist
}
// 验证组合是否合法
if gpuCombination == NvidiaGPUExist ||
gpuCombination == KoordGPUExist ||
gpuCombination == (GPUCoreExist | GPUMemoryExist) ||
gpuCombination == (GPUCoreExist | GPUMemoryRatioExist) {
return gpuCombination, nil
}
return gpuCombination, fmt.Errorf("request is not valid, current combination: %v", quotav1.ResourceNames(quotav1.Mask(podRequest, DeviceResourceNames[schedulingv1alpha1.GPU])))
}
测试用例:
go
// pkg/scheduler/plugins/deviceshare/plugin_test.go
func Test_ValidateGPURequest(t *testing.T) {
tests := []struct {
name string
podRequest corev1.ResourceList
wantValid bool
wantErr bool
}{
{
name: "valid nvidia.com/gpu",
podRequest: corev1.ResourceList{
"nvidia.com/gpu": resource.MustParse("1"),
},
wantValid: true,
},
{
name: "valid gpu-core + gpu-memory",
podRequest: corev1.ResourceList{
"koordinator.sh/gpu-core": resource.MustParse("50"),
"koordinator.sh/gpu-memory": resource.MustParse("8Gi"),
},
wantValid: true,
},
{
name: "invalid gpu-core only",
podRequest: corev1.ResourceList{
"koordinator.sh/gpu-core": resource.MustParse("50"),
},
wantValid: false,
wantErr: true,
},
{
name: "invalid gpu value (101)",
podRequest: corev1.ResourceList{
"koordinator.sh/gpu": resource.MustParse("101"),
},
wantValid: false,
wantErr: true,
},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
combination, err := ValidateGPURequest(tt.podRequest)
if tt.wantErr {
assert.Error(t, err)
} else {
assert.NoError(t, err)
assert.Greater(t, combination, uint(0))
}
})
}
}
3.3 GPU 请求转换
验证通过后,将不同格式的 GPU 请求转换为统一格式:
go
// pkg/scheduler/plugins/deviceshare/utils.go
func ConvertGPUResource(podRequest corev1.ResourceList, combination uint) corev1.ResourceList {
switch combination {
case GPUCoreExist | GPUMemoryExist:
// 已经是标准格式,不需要转换
return corev1.ResourceList{
apiext.ResourceGPUCore: podRequest[apiext.ResourceGPUCore],
apiext.ResourceGPUMemory: podRequest[apiext.ResourceGPUMemory],
}
case GPUCoreExist | GPUMemoryRatioExist:
// 已经是标准格式,不需要转换
return corev1.ResourceList{
apiext.ResourceGPUCore: podRequest[apiext.ResourceGPUCore],
apiext.ResourceGPUMemoryRatio: podRequest[apiext.ResourceGPUMemoryRatio],
}
case KoordGPUExist:
// koordinator.sh/gpu: 100 → gpu-core: 100, gpu-memory-ratio: 100
return corev1.ResourceList{
apiext.ResourceGPUCore: podRequest[apiext.ResourceGPU],
apiext.ResourceGPUMemoryRatio: podRequest[apiext.ResourceGPU],
}
case NvidiaGPUExist:
// nvidia.com/gpu: 1 → gpu-core: 100, gpu-memory-ratio: 100
// nvidia.com/gpu: 2 → gpu-core: 200, gpu-memory-ratio: 200
nvidiaGpu := podRequest[apiext.ResourceNvidiaGPU]
return corev1.ResourceList{
apiext.ResourceGPUCore: *resource.NewQuantity(nvidiaGpu.Value()*100, resource.DecimalSI),
apiext.ResourceGPUMemoryRatio: *resource.NewQuantity(nvidiaGpu.Value()*100, resource.DecimalSI),
}
}
return nil
}
转换示例:
| 原始请求 | 转换后 |
|---|---|
nvidia.com/gpu: 1 |
gpu-core: 100, gpu-memory-ratio: 100 |
nvidia.com/gpu: 2 |
gpu-core: 200, gpu-memory-ratio: 200 |
koordinator.sh/gpu: 100 |
gpu-core: 100, gpu-memory-ratio: 100 |
koordinator.sh/gpu: 200 |
gpu-core: 200, gpu-memory-ratio: 200 |
gpu-core: 50, gpu-memory: 8Gi |
gpu-core: 50, gpu-memory: 8Gi (不变) |
gpu-core: 50, gpu-memory-ratio: 50 |
gpu-core: 50, gpu-memory-ratio: 50 (不变) |
四、Filter 阶段详解
4.1 节点过滤流程
Filter 阶段遍历所有候选节点,检查是否满足 GPU 资源要求:
go
// pkg/scheduler/plugins/deviceshare/plugin.go
func (p *Plugin) Filter(ctx context.Context, cycleState *framework.CycleState, pod *corev1.Pod, nodeInfo *framework.NodeInfo) *framework.Status {
// 获取 PreFilter 状态
state, status := getPreFilterState(cycleState)
if !status.IsSuccess() {
return status
}
if state.skip {
return nil
}
// 获取节点名称
node := nodeInfo.Node()
if node == nil {
return framework.NewStatus(framework.Error, "node not found")
}
// 获取节点的设备缓存
nodeDeviceInfo := p.nodeDeviceCache.getNodeDevice(node.Name, false)
if nodeDeviceInfo == nil {
return framework.NewStatus(framework.UnschedulableAndUnresolvable, ErrMissingDevice)
}
nodeDeviceInfo.lock.RLock()
defer nodeDeviceInfo.lock.RUnlock()
// 优先尝试从 Reservation 预留资源中分配
if reservedDevices := state.reservedDevices[node.Name]; len(reservedDevices) > 0 {
for _, reserved := range reservedDevices {
devices := nodeDeviceInfo.replaceWith(reserved)
allocateResult, err := p.allocator.Allocate(nodeInfo.Node().Name, pod, state.podRequests, devices, state.preemptibleDevices[node.Name])
if len(allocateResult) > 0 && err == nil {
return nil
}
}
}
// 尝试从节点的空闲资源中分配
allocateResult, err := p.allocator.Allocate(nodeInfo.Node().Name, pod, state.podRequests, nodeDeviceInfo, state.preemptibleDevices[node.Name])
if len(allocateResult) != 0 && err == nil {
return nil
}
return framework.NewStatus(framework.Unschedulable, ErrInsufficientDevices)
}
4.2 设备分配算法
Allocate() 方法是设备分配的核心逻辑:
go
// pkg/scheduler/plugins/deviceshare/device_cache.go
func (n *nodeDevice) tryAllocateDevice(podRequest corev1.ResourceList, preemptibleDevices map[schedulingv1alpha1.DeviceType]deviceResources) (apiext.DeviceAllocations, error) {
allocateResult := make(apiext.DeviceAllocations)
// 遍历所有设备类型 (GPU, FPGA, RDMA)
for deviceType := range DeviceResourceNames {
if !hasDeviceResource(podRequest, deviceType) {
continue
}
err := n.tryAllocateDeviceByType(podRequest, deviceType, allocateResult, preemptibleDevices)
if err != nil {
return nil, err
}
}
return allocateResult, nil
}
func (n *nodeDevice) tryAllocateDeviceByType(
podRequest corev1.ResourceList,
deviceType schedulingv1alpha1.DeviceType,
allocateResult apiext.DeviceAllocations,
preemptibleDevices map[schedulingv1alpha1.DeviceType]deviceResources,
) error {
// 提取该设备类型的资源请求
podRequest = quotav1.Mask(podRequest, DeviceResourceNames[deviceType])
nodeDeviceTotal := n.deviceTotal[deviceType]
if len(nodeDeviceTotal) == 0 {
return fmt.Errorf("node does not have enough %v", deviceType)
}
// 获取空闲设备资源
freeDevices := n.deviceFree[deviceType]
deviceUsed := n.deviceUsed[deviceType]
// 合并可抢占的设备资源
preemptible := preemptibleDevices[deviceType]
var mergedFreeDevices deviceResources
if len(preemptible) > 0 {
mergedFreeDevices = make(deviceResources)
for minor, v := range preemptible {
used := quotav1.SubtractWithNonNegativeResult(deviceUsed[minor], v)
remaining := quotav1.SubtractWithNonNegativeResult(nodeDeviceTotal[minor], used)
if !quotav1.IsZero(remaining) {
mergedFreeDevices[minor] = remaining
}
}
// 合并普通空闲资源
for minor, v := range freeDevices {
res := mergedFreeDevices[minor]
if res == nil {
mergedFreeDevices[minor] = v.DeepCopy()
} else {
util.AddResourceList(res, v)
}
}
freeDevices = mergedFreeDevices
}
// GPU 特殊处理: 填充总显存
if deviceType == schedulingv1alpha1.GPU {
if err := fillGPUTotalMem(nodeDeviceTotal, podRequest); err != nil {
return err
}
}
// 计算需要的设备数量
var deviceAllocations []*apiext.DeviceAllocation
deviceWanted := int64(1)
podRequestPerCard := podRequest
// 检查是否是多卡请求 (例如 gpu-core: 200 表示需要 2 张卡)
if isPodRequestsMultipleDevice(podRequest, deviceType) {
switch deviceType {
case schedulingv1alpha1.GPU:
gpuCore := podRequest[apiext.ResourceGPUCore]
deviceWanted = gpuCore.Value() / 100
// 计算单张卡的资源需求
gpuMem := podRequest[apiext.ResourceGPUMemory]
gpuMemRatio := podRequest[apiext.ResourceGPUMemoryRatio]
podRequestPerCard = corev1.ResourceList{
apiext.ResourceGPUCore: *resource.NewQuantity(gpuCore.Value()/deviceWanted, resource.DecimalSI),
apiext.ResourceGPUMemory: *resource.NewQuantity(gpuMem.Value()/deviceWanted, resource.BinarySI),
apiext.ResourceGPUMemoryRatio: *resource.NewQuantity(gpuMemRatio.Value()/deviceWanted, resource.DecimalSI),
}
}
}
// 遍历空闲设备,尝试分配
satisfiedDeviceCount := 0
orderedDeviceResources := sortDeviceResourcesByMinor(freeDevices)
for _, deviceResource := range orderedDeviceResources {
// 跳过不健康的设备
if quotav1.IsZero(deviceResource.resources) {
continue
}
// 检查设备资源是否满足需求
if satisfied, _ := quotav1.LessThanOrEqual(podRequestPerCard, deviceResource.resources); satisfied {
satisfiedDeviceCount++
deviceAllocations = append(deviceAllocations, &apiext.DeviceAllocation{
Minor: int32(deviceResource.minor),
Resources: podRequestPerCard,
})
}
// 检查是否已满足所有需求
if satisfiedDeviceCount == int(deviceWanted) {
allocateResult[deviceType] = deviceAllocations
return nil
}
}
return fmt.Errorf("node does not have enough %v", deviceType)
}
分配算法流程图:
yaml
┌────────────────────────────────────────────────────────┐
│ GPU 设备分配算法 │
├────────────────────────────────────────────────────────┤
│ │
│ 输入: │
│ - podRequest: {gpu-core: 50, gpu-memory-ratio: 50} │
│ - nodeDeviceFree: {0: 75%, 1: 25%, 2: 100%} │
│ │
│ 步骤 1: 按 Minor 排序 │
│ orderedDevices = [0, 1, 2] │
│ │
│ 步骤 2: 遍历设备 │
│ for minor in orderedDevices: │
│ if deviceFree[minor] >= podRequest: │
│ 分配到该设备 │
│ break │
│ │
│ 步骤 3: 返回结果 │
│ allocationResult = { │
│ gpu: [{minor: 0, resources: {gpu-core: 50}}] │
│ } │
│ │
│ 注意事项: │
│ 1. 优先分配 Minor 小的设备 (简单策略) │
│ 2. 如果启用 binpack,优先填满已使用的设备 │
│ 3. 支持跨多卡分配 (gpu-core: 200 → 2 张卡) │
│ │
└────────────────────────────────────────────────────────┘
分配示例:
yaml
场景 1: 单卡分配
Pod 请求: gpu-core: 50, gpu-memory-ratio: 50
节点状态:
GPU 0: 已用 25%, 剩余 75%
GPU 1: 已用 50%, 剩余 50%
GPU 2: 已用 0%, 剩余 100%
分配结果: GPU 0 (剩余资源最多)
分配后状态:
GPU 0: 已用 75%, 剩余 25%
场景 2: 多卡分配
Pod 请求: gpu-core: 200, gpu-memory-ratio: 200
节点状态:
GPU 0: 已用 0%, 剩余 100%
GPU 1: 已用 0%, 剩余 100%
GPU 2: 已用 0%, 剩余 100%
分配结果: GPU 0 和 GPU 1 (按顺序选择 2 张完整卡)
分配后状态:
GPU 0: 已用 100%, 剩余 0%
GPU 1: 已用 100%, 剩余 0%
场景 3: 资源不足
Pod 请求: gpu-core: 100, gpu-memory-ratio: 100
节点状态:
GPU 0: 已用 75%, 剩余 25%
GPU 1: 已用 50%, 剩余 50%
GPU 2: 已用 80%, 剩余 20%
分配结果: 失败 (所有 GPU 剩余资源都不足 100%)
Filter 返回: Unschedulable
五、Reserve 和 PreBind 阶段
5.1 Reserve - 预留资源
Reserve 阶段执行真正的资源分配:
go
func (p *Plugin) Reserve(ctx context.Context, cycleState *framework.CycleState, pod *corev1.Pod, nodeName string) *framework.Status {
state, status := getPreFilterState(cycleState)
if !status.IsSuccess() {
return status
}
if state.skip {
return nil
}
nodeDeviceInfo := p.nodeDeviceCache.getNodeDevice(nodeName, false)
if nodeDeviceInfo == nil {
return framework.NewStatus(framework.UnschedulableAndUnresolvable, ErrMissingDevice)
}
nodeDeviceInfo.lock.Lock()
defer nodeDeviceInfo.lock.Unlock()
// 获取 Reservation 预留的资源
reservedDevices := p.getReservationReservedDevices(cycleState, state, pod, nodeName)
var nodeDeviceInfoToAllocate *nodeDevice
if len(reservedDevices) > 0 {
nodeDeviceInfoToAllocate = nodeDeviceInfo.replaceWith(reservedDevices)
} else {
nodeDeviceInfoToAllocate = nodeDeviceInfo
}
// 执行分配
allocateResult, err := p.allocator.Allocate(nodeName, pod, state.podRequests, nodeDeviceInfoToAllocate, state.preemptibleDevices[nodeName])
if err != nil || len(allocateResult) == 0 {
return framework.NewStatus(framework.Unschedulable, ErrInsufficientDevices)
}
// 更新缓存
p.allocator.Reserve(pod, nodeDeviceInfo, allocateResult)
state.allocationResult = allocateResult
return nil
}
Reserve 方法的实现:
go
// pkg/scheduler/plugins/deviceshare/allocator.go
func (a *defaultAllocator) Reserve(pod *corev1.Pod, nodeDevice *nodeDevice, allocations apiext.DeviceAllocations) {
for deviceType, allocs := range allocations {
// 更新 deviceUsed
nodeDevice.updateDeviceUsed(deviceType, allocs, true)
// 更新 allocateSet
nodeDevice.updateAllocateSet(deviceType, allocs, pod, true)
}
// 重新计算 deviceFree
for deviceType := range allocations {
nodeDevice.resetDeviceFree(deviceType)
}
}
状态更新示例:
css
Reserve 前:
deviceUsed[GPU][0] = {gpu-core: 25}
deviceFree[GPU][0] = {gpu-core: 75}
Reserve 后 (分配 50% GPU):
deviceUsed[GPU][0] = {gpu-core: 75} // 25 + 50
deviceFree[GPU][0] = {gpu-core: 25} // 100 - 75
allocateSet[GPU][default/pod-new] = {0: {gpu-core: 50}}
5.2 PreBind - 注入分配结果
PreBind 阶段将分配结果注入到 Pod 的 Annotation:
go
func (p *Plugin) PreBind(ctx context.Context, cycleState *framework.CycleState, pod *corev1.Pod, nodeName string) *framework.Status {
return p.preBindObject(ctx, cycleState, pod, nodeName)
}
func (p *Plugin) preBindObject(ctx context.Context, cycleState *framework.CycleState, object runtime.Object, nodeName string) *framework.Status {
state, status := getPreFilterState(cycleState)
if !status.IsSuccess() {
return status
}
if state.skip {
return nil
}
originalObj := object.DeepCopyObject()
metaObject := object.(metav1.Object)
// 设置设备分配信息到 Annotation
if err := apiext.SetDeviceAllocations(metaObject, state.allocationResult); err != nil {
return framework.AsStatus(err)
}
// Patch Pod
err := util.RetryOnConflictOrTooManyRequests(func() error {
_, err1 := util.NewPatch().
WithHandle(p.handle).
AddAnnotations(metaObject.GetAnnotations()).
Patch(ctx, originalObj.(metav1.Object))
return err1
})
if err != nil {
klog.V(3).ErrorS(err, "Failed to preBind", "object", klog.KObj(metaObject), "Devices", state.allocationResult, "node", nodeName)
return framework.NewStatus(framework.Error, err.Error())
}
klog.V(4).Infof("Successfully preBind %T %v", object, klog.KObj(metaObject))
return nil
}
Annotation 格式:
yaml
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
annotations:
scheduling.koordinator.sh/device-allocated: |
{
"gpu": [
{
"minor": 0,
"resources": {
"koordinator.sh/gpu-core": "50",
"koordinator.sh/gpu-memory-ratio": "50",
"koordinator.sh/gpu-memory": "8Gi"
}
}
]
}
六、生产环境调优
6.1 性能优化
缓存锁优化:
go
// 读多写少场景,使用 RWMutex
type nodeDeviceCache struct {
lock sync.RWMutex // 读写锁
nodeDeviceInfos map[string]*nodeDevice
}
// Filter 阶段: 读锁
nodeDeviceInfo.lock.RLock()
defer nodeDeviceInfo.lock.RUnlock()
// Reserve 阶段: 写锁
nodeDeviceInfo.lock.Lock()
defer nodeDeviceInfo.lock.Unlock()
批量处理优化:
makefile
场景: 100 个 Pod 同时调度到同一节点
优化前: 每个 Pod 独立获取锁 → 100 次锁竞争
优化后: 批量获取锁 → 1 次锁竞争 + 100 次内存操作
性能提升: 从 100ms 降低到 10ms (提升 10倍)
6.2 监控指标
yaml
# Prometheus 指标
- name: deviceshare_allocation_duration_seconds
type: histogram
help: "设备分配耗时"
- name: deviceshare_allocation_errors_total
type: counter
help: "设备分配失败次数"
labels: ["error_type", "device_type"]
- name: deviceshare_cache_size
type: gauge
help: "缓存的节点数量"
- name: deviceshare_device_utilization
type: gauge
help: "设备利用率"
labels: ["node", "device_type", "minor"]
6.3 故障排查
常见问题 1: Filter 阶段全部失败
排查步骤:
bash
# 1. 检查 Device CRD 是否存在
kubectl get device
# 2. 检查设备信息是否正确
kubectl get device <node-name> -o yaml
# 3. 检查 Pod 的 GPU 请求
kubectl get pod <pod-name> -o yaml | grep koordinator.sh/gpu
# 4. 查看 DeviceShare 日志
kubectl logs -n kube-system koord-scheduler-xxx | grep "DeviceShare\|Filter"
常见问题 2: PreBind 阶段失败
排查步骤:
bash
# 1. 检查 Pod 是否已有 Annotation
kubectl get pod <pod-name> -o jsonpath='{.metadata.annotations}'
# 2. 检查 APIServer 是否可达
kubectl get --raw /api/v1/namespaces/default/pods/<pod-name>
# 3. 查看详细错误信息
kubectl logs -n kube-system koord-scheduler-xxx | grep "PreBind.*Failed"