揭秘云原生混布资源调度器Koordinator （十四）DeviceShare 调度插件详解

一、核心使命与设计理念

1.1 DeviceShare 插件的使命

DeviceShare 是 Koordinator Scheduler 中的核心插件,负责 GPU、FPGA、RDMA 等异构设备的调度决策。它解决了 Kubernetes 原生调度器的三大痛点:

arduino 复制代码

┌────────────────────────────────────────────────────────┐
│         Kubernetes 原生 Device Plugin 的局限            │
├────────────────────────────────────────────────────────┤
│  1. 调度时机问题                                        │
│     └─ Device Plugin 在 kubelet 层分配设备              │
│     └─ Scheduler 无法感知设备拓扑和使用情况              │
│     └─ 可能导致调度后无法实际分配设备                    │
│                                                         │
│  2. 共享和隔离问题                                      │
│     └─ 只能整卡分配,无法细粒度共享                       │
│     └─ 缺乏设备间的QoS隔离机制                          │
│                                                         │
│  3. 跨设备管理问题                                      │
│     └─ GPU/FPGA/RDMA需要独立的Device Plugin实现         │
│     └─ 无法统一管理和调度                               │
└────────────────────────────────────────────────────────┘

DeviceShare 的三大核心能力:

提前感知: 在调度阶段就确定具体的设备分配方案
细粒度共享: 支持按比例分配 GPU 算力和显存
统一抽象: GPU/FPGA/RDMA 使用相同的调度逻辑

1.2 插件架构设计（How）

DeviceShare 实现了 Kubernetes Scheduler Framework 的 5 个扩展点:

scss 复制代码

┌──────────────────────────────────────────────────────────┐
│           DeviceShare 插件扩展点                          │
├──────────────────────────────────────────────────────────┤
│                                                           │
│  ┌─────────────────────────────────────────────┐        │
│  │  1. PreFilter (预处理)                       │        │
│  │     - 验证 GPU 请求合法性                     │        │
│  │     - 转换为统一资源格式                      │        │
│  │     - 处理 Reservation 预留资源               │        │
│  └─────────────────────────────────────────────┘        │
│                      ↓                                    │
│  ┌─────────────────────────────────────────────┐        │
│  │  2. Filter (过滤)                            │        │
│  │     - 遍历所有节点                           │        │
│  │     - 检查设备资源是否充足                    │        │
│  │     - 尝试分配(不实际执行)                    │        │
│  └─────────────────────────────────────────────┘        │
│                      ↓                                    │
│  ┌─────────────────────────────────────────────┐        │
│  │  3. Reserve (预留)                           │        │
│  │     - 在选中的节点上执行真正分配              │        │
│  │     - 更新缓存中的设备使用情况                │        │
│  │     - 支持 Reservation 资源预留               │        │
│  └─────────────────────────────────────────────┘        │
│                      ↓                                    │
│  ┌─────────────────────────────────────────────┐        │
│  │  4. PreBind (绑定前)                         │        │
│  │     - 将分配结果注入 Pod Annotation           │        │
│  │     - 更新 Device CRD 的 Status               │        │
│  │     - Patch Pod 到 APIServer                 │        │
│  └─────────────────────────────────────────────┘        │
│                      ↓                                    │
│  ┌─────────────────────────────────────────────┐        │
│  │  5. Unreserve (回滚)                         │        │
│  │     - 调度失败时释放已预留的资源              │        │
│  │     - 恢复缓存状态                           │        │
│  └─────────────────────────────────────────────┘        │
│                                                           │
└──────────────────────────────────────────────────────────┘

二、核心数据结构

2.1 PreFilterState - 调度上下文

PreFilterState 保存了 Pod 调度过程中的所有上下文信息:

go 复制代码

// pkg/scheduler/plugins/deviceshare/plugin.go

type preFilterState struct {
    // 是否跳过设备调度
    skip bool
    
    // Pod 请求的设备资源(已转换为统一格式)
    podRequests corev1.ResourceList
    
    // 最终的分配结果
    allocationResult apiext.DeviceAllocations
    
    // 可抢占的设备资源(用于抢占调度)
    // 键: 节点名 → 设备类型 → 设备 Minor → 资源列表
    preemptibleDevices map[string]map[schedulingv1alpha1.DeviceType]deviceResources
    
    // Reservation 预留的设备资源
    // 键: 节点名 → Reservation UID → 设备类型 → 设备 Minor → 资源列表
    reservedDevices map[string]map[types.UID]map[schedulingv1alpha1.DeviceType]deviceResources
}

// 设备资源表示: Minor 编号 → 资源列表
// 例如: {0: {gpu-core: 100, gpu-memory: 16Gi}}
type deviceResources map[int]corev1.ResourceList

关键字段说明:

字段	类型	说明	示例
`skip`	bool	是否跳过设备调度	Pod没有请求GPU时为true
`podRequests`	ResourceList	统一格式的资源请求	{gpu-core:50, gpu-memory-ratio:50}
`allocationResult`	DeviceAllocations	最终分配到的设备	GPU 0 的 50%
`preemptibleDevices`	三层Map	可被抢占的低优先级Pod占用的资源	node-1 → gpu → 0 → {core:25}
`reservedDevices`	四层Map	Reservation预留的资源	node-1 → uid-123 → gpu → 0 → {core:50}

State 的生命周期:

css 复制代码

┌──────────────────────────────────────────────────────┐
│          preFilterState 生命周期                      │
├──────────────────────────────────────────────────────┤
│                                                       │
│  1. PreFilter 阶段创建                                │
│     state := &preFilterState{                        │
│         skip: false,                                 │
│         podRequests: {gpu-core: 50},                 │
│         preemptibleDevices: {},                      │
│         reservedDevices: {},                         │
│     }                                                 │
│     cycleState.Write("DeviceShare", state)           │
│                                                       │
│  2. Filter 阶段读取                                   │
│     state := cycleState.Read("DeviceShare")          │
│     使用 state.podRequests 进行资源匹配               │
│                                                       │
│  3. Reserve 阶段更新                                  │
│     state.allocationResult = allocatedDevices        │
│                                                       │
│  4. PreBind 阶段使用                                  │
│     SetDeviceAllocations(pod, state.allocationResult)│
│                                                       │
│  5. Unreserve 阶段清理(失败时)                        │
│     state.allocationResult = nil                     │
│                                                       │
└──────────────────────────────────────────────────────┘

2.2 NodeDeviceCache - 节点设备缓存

NodeDeviceCache 是 DeviceShare 的核心缓存,存储了所有节点的设备信息:

go 复制代码

// pkg/scheduler/plugins/deviceshare/device_cache.go

type nodeDeviceCache struct {
    lock            sync.RWMutex
    // 每个节点的设备信息
    nodeDeviceInfos map[string]*nodeDevice
}

type nodeDevice struct {
    lock sync.RWMutex
    
    // 设备总容量: 设备类型 → Minor → 资源容量
    deviceTotal map[schedulingv1alpha1.DeviceType]deviceResources
    
    // 设备可用容量: 设备类型 → Minor → 剩余资源
    deviceFree map[schedulingv1alpha1.DeviceType]deviceResources
    
    // 设备已用容量: 设备类型 → Minor → 已用资源
    deviceUsed map[schedulingv1alpha1.DeviceType]deviceResources
    
    // 分配集合: 设备类型 → Pod名称 → 设备分配情况
    // 用于快速查询某个Pod占用了哪些设备
    allocateSet map[schedulingv1alpha1.DeviceType]map[types.NamespacedName]deviceResources
}

缓存更新流程:

ini 复制代码

┌────────────────────────────────────────────────────────┐
│           NodeDeviceCache 更新流程                      │
├────────────────────────────────────────────────────────┤
│                                                         │
│  1. Device CRD 创建/更新事件                            │
│     Informer Watch → onDeviceAdd/onDeviceUpdate        │
│                                                         │
│  2. 解析 Device CRD                                     │
│     for device in Device.Spec.Devices:                 │
│         deviceTotal[device.Type][device.Minor] =       │
│             device.Resources                           │
│                                                         │
│  3. 更新 deviceFree 和 deviceUsed                       │
│     deviceFree = deviceTotal - deviceUsed              │
│                                                         │
│  4. Pod 调度完成后更新                                  │
│     Reserve() → 更新 deviceUsed                         │
│     deviceUsed[0] += {gpu-core: 50}                    │
│     deviceFree[0] -= {gpu-core: 50}                    │
│                                                         │
│  5. Pod 删除后更新                                      │
│     onPodDelete() → 释放 deviceUsed                     │
│     deviceUsed[0] -= {gpu-core: 50}                    │
│     deviceFree[0] += {gpu-core: 50}                    │
│                                                         │
└────────────────────────────────────────────────────────┘

缓存数据示例:

假设节点有 2 张 GPU,已有 2 个 Pod 在运行:

go 复制代码

nodeDevice := &nodeDevice{
    deviceTotal: {
        schedulingv1alpha1.GPU: {
            0: {  // GPU 0
                "koordinator.sh/gpu-core":        resource.MustParse("100"),
                "koordinator.sh/gpu-memory":      resource.MustParse("16Gi"),
                "koordinator.sh/gpu-memory-ratio": resource.MustParse("100"),
            },
            1: {  // GPU 1
                "koordinator.sh/gpu-core":        resource.MustParse("100"),
                "koordinator.sh/gpu-memory":      resource.MustParse("16Gi"),
                "koordinator.sh/gpu-memory-ratio": resource.MustParse("100"),
            },
        },
    },
    
    deviceUsed: {
        schedulingv1alpha1.GPU: {
            0: {  // GPU 0 已用 50%
                "koordinator.sh/gpu-core":        resource.MustParse("50"),
                "koordinator.sh/gpu-memory":      resource.MustParse("8Gi"),
                "koordinator.sh/gpu-memory-ratio": resource.MustParse("50"),
            },
            1: {  // GPU 1 已用 25%
                "koordinator.sh/gpu-core":        resource.MustParse("25"),
                "koordinator.sh/gpu-memory":      resource.MustParse("4Gi"),
                "koordinator.sh/gpu-memory-ratio": resource.MustParse("25"),
            },
        },
    },
    
    deviceFree: {
        schedulingv1alpha1.GPU: {
            0: {  // GPU 0 剩余 50%
                "koordinator.sh/gpu-core":        resource.MustParse("50"),
                "koordinator.sh/gpu-memory":      resource.MustParse("8Gi"),
                "koordinator.sh/gpu-memory-ratio": resource.MustParse("50"),
            },
            1: {  // GPU 1 剩余 75%
                "koordinator.sh/gpu-core":        resource.MustParse("75"),
                "koordinator.sh/gpu-memory":      resource.MustParse("12Gi"),
                "koordinator.sh/gpu-memory-ratio": resource.MustParse("75"),
            },
        },
    },
    
    allocateSet: {
        schedulingv1alpha1.GPU: {
            types.NamespacedName{Namespace: "default", Name: "pod-1"}: {
                0: {"koordinator.sh/gpu-core": resource.MustParse("50")},
            },
            types.NamespacedName{Namespace: "default", Name: "pod-2"}: {
                1: {"koordinator.sh/gpu-core": resource.MustParse("25")},
            },
        },
    },
}

三、PreFilter 阶段详解

3.1 GPU 请求验证

PreFilter 阶段的核心任务是验证 GPU 请求的合法性:

go 复制代码

// pkg/scheduler/plugins/deviceshare/plugin.go

func (p *Plugin) PreFilter(ctx context.Context, cycleState *framework.CycleState, pod *corev1.Pod) *framework.Status {
    state := &preFilterState{
        skip:               true,
        podRequests:        make(corev1.ResourceList),
        preemptibleDevices: map[string]map[schedulingv1alpha1.DeviceType]deviceResources{},
        reservedDevices:    map[string]map[types.UID]map[schedulingv1alpha1.DeviceType]deviceResources{},
    }

    // 获取 Pod 的资源请求
    podRequests, _ := resource.PodRequestsAndLimits(pod)
    podRequests = apiext.TransformDeprecatedDeviceResources(podRequests)

    // 处理 GPU 设备
    for deviceType := range DeviceResourceNames {
        switch deviceType {
        case schedulingv1alpha1.GPU:
            if !hasDeviceResource(podRequests, deviceType) {
                break
            }
            // 验证 GPU 请求
            combination, err := ValidateGPURequest(podRequests)
            if err != nil {
                return framework.NewStatus(framework.Error, err.Error())
            }
            // 转换为统一格式
            state.podRequests = quotav1.Add(state.podRequests, ConvertGPUResource(podRequests, combination))
            state.skip = false
            
        case schedulingv1alpha1.RDMA, schedulingv1alpha1.FPGA:
            // 类似的验证逻辑
            ...
        }
    }

    cycleState.Write(stateKey, state)
    return nil
}

3.2 GPU 请求合法性验证算法

验证规则矩阵:

资源组合	二进制表示	是否合法	说明
`nvidia.com/gpu`	0b00001	✅	原生整卡分配
`koordinator.sh/gpu`	0b00010	✅	Koord整卡分配
`gpu-core + gpu-memory`	0b01100	✅	指定算力和显存
`gpu-core + gpu-memory-ratio`	0b10100	✅	指定算力和显存比例
`gpu-core` 单独使用	0b00100	❌	必须配合显存
`gpu-memory` 单独使用	0b01000	❌	必须配合算力
`nvidia.com/gpu + gpu-core`	0b00101	❌	不能混用

验证代码实现:

go 复制代码

// pkg/scheduler/plugins/deviceshare/utils.go

const (
    NvidiaGPUExist      = 1 << 0  // 0b00001
    KoordGPUExist       = 1 << 1  // 0b00010
    GPUCoreExist        = 1 << 2  // 0b00100
    GPUMemoryExist      = 1 << 3  // 0b01000
    GPUMemoryRatioExist = 1 << 4  // 0b10000
)

func ValidateGPURequest(podRequest corev1.ResourceList) (uint, error) {
    var gpuCombination uint

    // 检测各类资源是否存在
    if _, exist := podRequest[apiext.ResourceNvidiaGPU]; exist {
        gpuCombination |= NvidiaGPUExist
    }
    if koordGPU, exist := podRequest[apiext.ResourceGPU]; exist {
        // koordinator.sh/gpu 的值必须是 100 或 100 的倍数
        if koordGPU.Value() > 100 && koordGPU.Value()%100 != 0 {
            return gpuCombination, fmt.Errorf("failed to validate %v: %v", apiext.ResourceGPU, koordGPU.Value())
        }
        gpuCombination |= KoordGPUExist
    }
    if gpuCore, exist := podRequest[apiext.ResourceGPUCore]; exist {
        // gpu-core 的值必须 <= 100 或 是 100 的倍数
        if gpuCore.Value() > 100 && gpuCore.Value()%100 != 0 {
            return gpuCombination, fmt.Errorf("failed to validate %v: %v", apiext.ResourceGPUCore, gpuCore.Value())
        }
        gpuCombination |= GPUCoreExist
    }
    if _, exist := podRequest[apiext.ResourceGPUMemory]; exist {
        gpuCombination |= GPUMemoryExist
    }
    if gpuMemRatio, exist := podRequest[apiext.ResourceGPUMemoryRatio]; exist {
        if gpuMemRatio.Value() > 100 && gpuMemRatio.Value()%100 != 0 {
            return gpuCombination, fmt.Errorf("failed to validate %v: %v", apiext.ResourceGPUMemoryRatio, gpuMemRatio.Value())
        }
        gpuCombination |= GPUMemoryRatioExist
    }

    // 验证组合是否合法
    if gpuCombination == NvidiaGPUExist ||
       gpuCombination == KoordGPUExist ||
       gpuCombination == (GPUCoreExist | GPUMemoryExist) ||
       gpuCombination == (GPUCoreExist | GPUMemoryRatioExist) {
        return gpuCombination, nil
    }

    return gpuCombination, fmt.Errorf("request is not valid, current combination: %v", quotav1.ResourceNames(quotav1.Mask(podRequest, DeviceResourceNames[schedulingv1alpha1.GPU])))
}

测试用例:

go 复制代码

// pkg/scheduler/plugins/deviceshare/plugin_test.go

func Test_ValidateGPURequest(t *testing.T) {
    tests := []struct {
        name        string
        podRequest  corev1.ResourceList
        wantValid   bool
        wantErr     bool
    }{
        {
            name: "valid nvidia.com/gpu",
            podRequest: corev1.ResourceList{
                "nvidia.com/gpu": resource.MustParse("1"),
            },
            wantValid: true,
        },
        {
            name: "valid gpu-core + gpu-memory",
            podRequest: corev1.ResourceList{
                "koordinator.sh/gpu-core":   resource.MustParse("50"),
                "koordinator.sh/gpu-memory": resource.MustParse("8Gi"),
            },
            wantValid: true,
        },
        {
            name: "invalid gpu-core only",
            podRequest: corev1.ResourceList{
                "koordinator.sh/gpu-core": resource.MustParse("50"),
            },
            wantValid: false,
            wantErr:   true,
        },
        {
            name: "invalid gpu value (101)",
            podRequest: corev1.ResourceList{
                "koordinator.sh/gpu": resource.MustParse("101"),
            },
            wantValid: false,
            wantErr:   true,
        },
    }
    
    for _, tt := range tests {
        t.Run(tt.name, func(t *testing.T) {
            combination, err := ValidateGPURequest(tt.podRequest)
            if tt.wantErr {
                assert.Error(t, err)
            } else {
                assert.NoError(t, err)
                assert.Greater(t, combination, uint(0))
            }
        })
    }
}

3.3 GPU 请求转换

验证通过后,将不同格式的 GPU 请求转换为统一格式:

go 复制代码

// pkg/scheduler/plugins/deviceshare/utils.go

func ConvertGPUResource(podRequest corev1.ResourceList, combination uint) corev1.ResourceList {
    switch combination {
    case GPUCoreExist | GPUMemoryExist:
        // 已经是标准格式,不需要转换
        return corev1.ResourceList{
            apiext.ResourceGPUCore:   podRequest[apiext.ResourceGPUCore],
            apiext.ResourceGPUMemory: podRequest[apiext.ResourceGPUMemory],
        }
        
    case GPUCoreExist | GPUMemoryRatioExist:
        // 已经是标准格式,不需要转换
        return corev1.ResourceList{
            apiext.ResourceGPUCore:        podRequest[apiext.ResourceGPUCore],
            apiext.ResourceGPUMemoryRatio: podRequest[apiext.ResourceGPUMemoryRatio],
        }
        
    case KoordGPUExist:
        // koordinator.sh/gpu: 100 → gpu-core: 100, gpu-memory-ratio: 100
        return corev1.ResourceList{
            apiext.ResourceGPUCore:        podRequest[apiext.ResourceGPU],
            apiext.ResourceGPUMemoryRatio: podRequest[apiext.ResourceGPU],
        }
        
    case NvidiaGPUExist:
        // nvidia.com/gpu: 1 → gpu-core: 100, gpu-memory-ratio: 100
        // nvidia.com/gpu: 2 → gpu-core: 200, gpu-memory-ratio: 200
        nvidiaGpu := podRequest[apiext.ResourceNvidiaGPU]
        return corev1.ResourceList{
            apiext.ResourceGPUCore:        *resource.NewQuantity(nvidiaGpu.Value()*100, resource.DecimalSI),
            apiext.ResourceGPUMemoryRatio: *resource.NewQuantity(nvidiaGpu.Value()*100, resource.DecimalSI),
        }
    }
    return nil
}

转换示例:

原始请求	转换后
`nvidia.com/gpu: 1`	`gpu-core: 100, gpu-memory-ratio: 100`
`nvidia.com/gpu: 2`	`gpu-core: 200, gpu-memory-ratio: 200`
`koordinator.sh/gpu: 100`	`gpu-core: 100, gpu-memory-ratio: 100`
`koordinator.sh/gpu: 200`	`gpu-core: 200, gpu-memory-ratio: 200`
`gpu-core: 50, gpu-memory: 8Gi`	`gpu-core: 50, gpu-memory: 8Gi` (不变)
`gpu-core: 50, gpu-memory-ratio: 50`	`gpu-core: 50, gpu-memory-ratio: 50` (不变)

四、Filter 阶段详解

4.1 节点过滤流程

Filter 阶段遍历所有候选节点,检查是否满足 GPU 资源要求:

go 复制代码

// pkg/scheduler/plugins/deviceshare/plugin.go

func (p *Plugin) Filter(ctx context.Context, cycleState *framework.CycleState, pod *corev1.Pod, nodeInfo *framework.NodeInfo) *framework.Status {
    // 获取 PreFilter 状态
    state, status := getPreFilterState(cycleState)
    if !status.IsSuccess() {
        return status
    }
    if state.skip {
        return nil
    }

    // 获取节点名称
    node := nodeInfo.Node()
    if node == nil {
        return framework.NewStatus(framework.Error, "node not found")
    }

    // 获取节点的设备缓存
    nodeDeviceInfo := p.nodeDeviceCache.getNodeDevice(node.Name, false)
    if nodeDeviceInfo == nil {
        return framework.NewStatus(framework.UnschedulableAndUnresolvable, ErrMissingDevice)
    }

    nodeDeviceInfo.lock.RLock()
    defer nodeDeviceInfo.lock.RUnlock()

    // 优先尝试从 Reservation 预留资源中分配
    if reservedDevices := state.reservedDevices[node.Name]; len(reservedDevices) > 0 {
        for _, reserved := range reservedDevices {
            devices := nodeDeviceInfo.replaceWith(reserved)
            allocateResult, err := p.allocator.Allocate(nodeInfo.Node().Name, pod, state.podRequests, devices, state.preemptibleDevices[node.Name])
            if len(allocateResult) > 0 && err == nil {
                return nil
            }
        }
    }
    
    // 尝试从节点的空闲资源中分配
    allocateResult, err := p.allocator.Allocate(nodeInfo.Node().Name, pod, state.podRequests, nodeDeviceInfo, state.preemptibleDevices[node.Name])
    if len(allocateResult) != 0 && err == nil {
        return nil
    }
    
    return framework.NewStatus(framework.Unschedulable, ErrInsufficientDevices)
}

4.2 设备分配算法

Allocate() 方法是设备分配的核心逻辑:

go 复制代码

// pkg/scheduler/plugins/deviceshare/device_cache.go

func (n *nodeDevice) tryAllocateDevice(podRequest corev1.ResourceList, preemptibleDevices map[schedulingv1alpha1.DeviceType]deviceResources) (apiext.DeviceAllocations, error) {
    allocateResult := make(apiext.DeviceAllocations)
    
    // 遍历所有设备类型 (GPU, FPGA, RDMA)
    for deviceType := range DeviceResourceNames {
        if !hasDeviceResource(podRequest, deviceType) {
            continue
        }
        
        err := n.tryAllocateDeviceByType(podRequest, deviceType, allocateResult, preemptibleDevices)
        if err != nil {
            return nil, err
        }
    }
    
    return allocateResult, nil
}

func (n *nodeDevice) tryAllocateDeviceByType(
    podRequest corev1.ResourceList,
    deviceType schedulingv1alpha1.DeviceType,
    allocateResult apiext.DeviceAllocations,
    preemptibleDevices map[schedulingv1alpha1.DeviceType]deviceResources,
) error {
    // 提取该设备类型的资源请求
    podRequest = quotav1.Mask(podRequest, DeviceResourceNames[deviceType])
    
    nodeDeviceTotal := n.deviceTotal[deviceType]
    if len(nodeDeviceTotal) == 0 {
        return fmt.Errorf("node does not have enough %v", deviceType)
    }

    // 获取空闲设备资源
    freeDevices := n.deviceFree[deviceType]
    deviceUsed := n.deviceUsed[deviceType]
    
    // 合并可抢占的设备资源
    preemptible := preemptibleDevices[deviceType]
    var mergedFreeDevices deviceResources
    if len(preemptible) > 0 {
        mergedFreeDevices = make(deviceResources)
        for minor, v := range preemptible {
            used := quotav1.SubtractWithNonNegativeResult(deviceUsed[minor], v)
            remaining := quotav1.SubtractWithNonNegativeResult(nodeDeviceTotal[minor], used)
            if !quotav1.IsZero(remaining) {
                mergedFreeDevices[minor] = remaining
            }
        }
        
        // 合并普通空闲资源
        for minor, v := range freeDevices {
            res := mergedFreeDevices[minor]
            if res == nil {
                mergedFreeDevices[minor] = v.DeepCopy()
            } else {
                util.AddResourceList(res, v)
            }
        }
        freeDevices = mergedFreeDevices
    }

    // GPU 特殊处理: 填充总显存
    if deviceType == schedulingv1alpha1.GPU {
        if err := fillGPUTotalMem(nodeDeviceTotal, podRequest); err != nil {
            return err
        }
    }

    // 计算需要的设备数量
    var deviceAllocations []*apiext.DeviceAllocation
    deviceWanted := int64(1)
    podRequestPerCard := podRequest
    
    // 检查是否是多卡请求 (例如 gpu-core: 200 表示需要 2 张卡)
    if isPodRequestsMultipleDevice(podRequest, deviceType) {
        switch deviceType {
        case schedulingv1alpha1.GPU:
            gpuCore := podRequest[apiext.ResourceGPUCore]
            deviceWanted = gpuCore.Value() / 100
            
            // 计算单张卡的资源需求
            gpuMem := podRequest[apiext.ResourceGPUMemory]
            gpuMemRatio := podRequest[apiext.ResourceGPUMemoryRatio]
            podRequestPerCard = corev1.ResourceList{
                apiext.ResourceGPUCore:        *resource.NewQuantity(gpuCore.Value()/deviceWanted, resource.DecimalSI),
                apiext.ResourceGPUMemory:      *resource.NewQuantity(gpuMem.Value()/deviceWanted, resource.BinarySI),
                apiext.ResourceGPUMemoryRatio: *resource.NewQuantity(gpuMemRatio.Value()/deviceWanted, resource.DecimalSI),
            }
        }
    }

    // 遍历空闲设备,尝试分配
    satisfiedDeviceCount := 0
    orderedDeviceResources := sortDeviceResourcesByMinor(freeDevices)
    
    for _, deviceResource := range orderedDeviceResources {
        // 跳过不健康的设备
        if quotav1.IsZero(deviceResource.resources) {
            continue
        }
        
        // 检查设备资源是否满足需求
        if satisfied, _ := quotav1.LessThanOrEqual(podRequestPerCard, deviceResource.resources); satisfied {
            satisfiedDeviceCount++
            deviceAllocations = append(deviceAllocations, &apiext.DeviceAllocation{
                Minor:     int32(deviceResource.minor),
                Resources: podRequestPerCard,
            })
        }
        
        // 检查是否已满足所有需求
        if satisfiedDeviceCount == int(deviceWanted) {
            allocateResult[deviceType] = deviceAllocations
            return nil
        }
    }
    
    return fmt.Errorf("node does not have enough %v", deviceType)
}

分配算法流程图:

yaml 复制代码

┌────────────────────────────────────────────────────────┐
│            GPU 设备分配算法                             │
├────────────────────────────────────────────────────────┤
│                                                         │
│  输入:                                                  │
│    - podRequest: {gpu-core: 50, gpu-memory-ratio: 50}  │
│    - nodeDeviceFree: {0: 75%, 1: 25%, 2: 100%}         │
│                                                         │
│  步骤 1: 按 Minor 排序                                  │
│    orderedDevices = [0, 1, 2]                          │
│                                                         │
│  步骤 2: 遍历设备                                       │
│    for minor in orderedDevices:                        │
│        if deviceFree[minor] >= podRequest:             │
│            分配到该设备                                 │
│            break                                        │
│                                                         │
│  步骤 3: 返回结果                                       │
│    allocationResult = {                                │
│        gpu: [{minor: 0, resources: {gpu-core: 50}}]    │
│    }                                                    │
│                                                         │
│  注意事项:                                              │
│    1. 优先分配 Minor 小的设备 (简单策略)                │
│    2. 如果启用 binpack,优先填满已使用的设备             │
│    3. 支持跨多卡分配 (gpu-core: 200 → 2 张卡)          │
│                                                         │
└────────────────────────────────────────────────────────┘

分配示例:

yaml 复制代码

场景 1: 单卡分配
  Pod 请求: gpu-core: 50, gpu-memory-ratio: 50
  节点状态:
    GPU 0: 已用 25%, 剩余 75%
    GPU 1: 已用 50%, 剩余 50%
    GPU 2: 已用 0%,  剩余 100%
  
  分配结果: GPU 0 (剩余资源最多)
  分配后状态:
    GPU 0: 已用 75%, 剩余 25%

场景 2: 多卡分配
  Pod 请求: gpu-core: 200, gpu-memory-ratio: 200
  节点状态:
    GPU 0: 已用 0%, 剩余 100%
    GPU 1: 已用 0%, 剩余 100%
    GPU 2: 已用 0%, 剩余 100%
  
  分配结果: GPU 0 和 GPU 1 (按顺序选择 2 张完整卡)
  分配后状态:
    GPU 0: 已用 100%, 剩余 0%
    GPU 1: 已用 100%, 剩余 0%

场景 3: 资源不足
  Pod 请求: gpu-core: 100, gpu-memory-ratio: 100
  节点状态:
    GPU 0: 已用 75%, 剩余 25%
    GPU 1: 已用 50%, 剩余 50%
    GPU 2: 已用 80%, 剩余 20%
  
  分配结果: 失败 (所有 GPU 剩余资源都不足 100%)
  Filter 返回: Unschedulable

五、Reserve 和 PreBind 阶段

5.1 Reserve - 预留资源

Reserve 阶段执行真正的资源分配:

go 复制代码

func (p *Plugin) Reserve(ctx context.Context, cycleState *framework.CycleState, pod *corev1.Pod, nodeName string) *framework.Status {
    state, status := getPreFilterState(cycleState)
    if !status.IsSuccess() {
        return status
    }
    if state.skip {
        return nil
    }

    nodeDeviceInfo := p.nodeDeviceCache.getNodeDevice(nodeName, false)
    if nodeDeviceInfo == nil {
        return framework.NewStatus(framework.UnschedulableAndUnresolvable, ErrMissingDevice)
    }

    nodeDeviceInfo.lock.Lock()
    defer nodeDeviceInfo.lock.Unlock()

    // 获取 Reservation 预留的资源
    reservedDevices := p.getReservationReservedDevices(cycleState, state, pod, nodeName)
    var nodeDeviceInfoToAllocate *nodeDevice
    if len(reservedDevices) > 0 {
        nodeDeviceInfoToAllocate = nodeDeviceInfo.replaceWith(reservedDevices)
    } else {
        nodeDeviceInfoToAllocate = nodeDeviceInfo
    }

    // 执行分配
    allocateResult, err := p.allocator.Allocate(nodeName, pod, state.podRequests, nodeDeviceInfoToAllocate, state.preemptibleDevices[nodeName])
    if err != nil || len(allocateResult) == 0 {
        return framework.NewStatus(framework.Unschedulable, ErrInsufficientDevices)
    }
    
    // 更新缓存
    p.allocator.Reserve(pod, nodeDeviceInfo, allocateResult)
    state.allocationResult = allocateResult
    return nil
}

Reserve 方法的实现:

go 复制代码

// pkg/scheduler/plugins/deviceshare/allocator.go

func (a *defaultAllocator) Reserve(pod *corev1.Pod, nodeDevice *nodeDevice, allocations apiext.DeviceAllocations) {
    for deviceType, allocs := range allocations {
        // 更新 deviceUsed
        nodeDevice.updateDeviceUsed(deviceType, allocs, true)
        
        // 更新 allocateSet
        nodeDevice.updateAllocateSet(deviceType, allocs, pod, true)
    }
    
    // 重新计算 deviceFree
    for deviceType := range allocations {
        nodeDevice.resetDeviceFree(deviceType)
    }
}

状态更新示例:

css 复制代码

Reserve 前:
  deviceUsed[GPU][0] = {gpu-core: 25}
  deviceFree[GPU][0] = {gpu-core: 75}

Reserve 后 (分配 50% GPU):
  deviceUsed[GPU][0] = {gpu-core: 75}  // 25 + 50
  deviceFree[GPU][0] = {gpu-core: 25}  // 100 - 75
  
  allocateSet[GPU][default/pod-new] = {0: {gpu-core: 50}}

5.2 PreBind - 注入分配结果

PreBind 阶段将分配结果注入到 Pod 的 Annotation:

go 复制代码

func (p *Plugin) PreBind(ctx context.Context, cycleState *framework.CycleState, pod *corev1.Pod, nodeName string) *framework.Status {
    return p.preBindObject(ctx, cycleState, pod, nodeName)
}

func (p *Plugin) preBindObject(ctx context.Context, cycleState *framework.CycleState, object runtime.Object, nodeName string) *framework.Status {
    state, status := getPreFilterState(cycleState)
    if !status.IsSuccess() {
        return status
    }
    if state.skip {
        return nil
    }

    originalObj := object.DeepCopyObject()
    metaObject := object.(metav1.Object)
    
    // 设置设备分配信息到 Annotation
    if err := apiext.SetDeviceAllocations(metaObject, state.allocationResult); err != nil {
        return framework.AsStatus(err)
    }

    // Patch Pod
    err := util.RetryOnConflictOrTooManyRequests(func() error {
        _, err1 := util.NewPatch().
            WithHandle(p.handle).
            AddAnnotations(metaObject.GetAnnotations()).
            Patch(ctx, originalObj.(metav1.Object))
        return err1
    })
    
    if err != nil {
        klog.V(3).ErrorS(err, "Failed to preBind", "object", klog.KObj(metaObject), "Devices", state.allocationResult, "node", nodeName)
        return framework.NewStatus(framework.Error, err.Error())
    }
    
    klog.V(4).Infof("Successfully preBind %T %v", object, klog.KObj(metaObject))
    return nil
}

Annotation 格式:

yaml 复制代码

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
  annotations:
    scheduling.koordinator.sh/device-allocated: |
      {
        "gpu": [
          {
            "minor": 0,
            "resources": {
              "koordinator.sh/gpu-core": "50",
              "koordinator.sh/gpu-memory-ratio": "50",
              "koordinator.sh/gpu-memory": "8Gi"
            }
          }
        ]
      }

六、生产环境调优

6.1 性能优化

缓存锁优化:

go 复制代码

// 读多写少场景,使用 RWMutex
type nodeDeviceCache struct {
    lock            sync.RWMutex  // 读写锁
    nodeDeviceInfos map[string]*nodeDevice
}

// Filter 阶段: 读锁
nodeDeviceInfo.lock.RLock()
defer nodeDeviceInfo.lock.RUnlock()

// Reserve 阶段: 写锁
nodeDeviceInfo.lock.Lock()
defer nodeDeviceInfo.lock.Unlock()

批量处理优化:

makefile 复制代码

场景: 100 个 Pod 同时调度到同一节点
优化前: 每个 Pod 独立获取锁 → 100 次锁竞争
优化后: 批量获取锁 → 1 次锁竞争 + 100 次内存操作

性能提升: 从 100ms 降低到 10ms (提升 10倍)

6.2 监控指标

yaml 复制代码

# Prometheus 指标
- name: deviceshare_allocation_duration_seconds
  type: histogram
  help: "设备分配耗时"
  
- name: deviceshare_allocation_errors_total
  type: counter
  help: "设备分配失败次数"
  labels: ["error_type", "device_type"]

- name: deviceshare_cache_size
  type: gauge
  help: "缓存的节点数量"

- name: deviceshare_device_utilization
  type: gauge
  help: "设备利用率"
  labels: ["node", "device_type", "minor"]

6.3 故障排查

常见问题 1: Filter 阶段全部失败

排查步骤:

bash 复制代码

# 1. 检查 Device CRD 是否存在
kubectl get device

# 2. 检查设备信息是否正确
kubectl get device <node-name> -o yaml

# 3. 检查 Pod 的 GPU 请求
kubectl get pod <pod-name> -o yaml | grep koordinator.sh/gpu

# 4. 查看 DeviceShare 日志
kubectl logs -n kube-system koord-scheduler-xxx | grep "DeviceShare\|Filter"

常见问题 2: PreBind 阶段失败

排查步骤:

bash 复制代码

# 1. 检查 Pod 是否已有 Annotation
kubectl get pod <pod-name> -o jsonpath='{.metadata.annotations}'

# 2. 检查 APIServer 是否可达
kubectl get --raw /api/v1/namespaces/default/pods/<pod-name>

# 3. 查看详细错误信息
kubectl logs -n kube-system koord-scheduler-xxx | grep "PreBind.*Failed"