【kubernetes v1.21】(kube-scheduler 4)kube-scheduler 内部缓存、队列与抢占机制

Part 4: kube-scheduler 内部缓存、队列与抢占机制------超深度逐行专业级分析


一、模块定位

1.1 Cache(缓存层)

业务职责: Cache 是 scheduler 的"集群状态视图",负责维护集群中所有 Node 和 Pod 的实时聚合信息。它是一个增量更新系统------不主动轮询,而是通过 Informer 的 Reflector 机制接收事件驱动更新。Cache 的核心价值在于:

  • 消除 API Server 查询:调度周期内所有 Node 信息查询都走 Cache,避免对 etcd 的直接压力
  • Assumed Pod 机制:调度器在 Binding 阶段将 Pod 标记为"假定已调度"(Assumed),而非等到 API Server 确认,从而实现乐观并发
  • 快照隔离 :每个调度周期通过 UpdateSnapshot 生成只读快照,多个调度周期互不干扰
  • 增量镜像状态:追踪每个镜像在哪些节点存在,支持 ImageLocality 优先级策略

1.2 SchedulingQueue(调度队列层)

业务职责: SchedulingQueue 是待调度 Pod 的优先级队列,是调度器的"入口"。它解决的核心问题:

  • 优先级调度:高优先级 Pod 优先出队
  • 退避(Backoff):调度失败的 Pod 不应立即重试,而应逐步退避
  • 不可调度Pod管理:已尝试但失败的 Pod 暂存于 UnschedulablePodsMap,等待集群状态变化后重新入队
  • 事件驱动重入队:Node/PVC/PV 等集群事件触发特定 Pod 重新调度

1.3 DefaultPreemption(默认抢占机制)

业务职责: 当 Pod 经过 Filter 阶段后仍无法找到合适节点时,DefaultPreemption 作为 PostFilter 插件介入,通过驱逐低优先级 Pod 为高优先级 Pod 腾出资源。核心逻辑:

  • 候选节点发现:并行地在多个节点上模拟驱逐(DryRun)
  • PDB 保护:优先选择不违反 PodDisruptionBudget 的驱逐方案
  • 最优节点选择:在多个候选节点中按 PDB 违反数、最高优先级牺牲品、优先级总和、牺牲品数量等维度层层筛选
  • Extender 集成:将候选结果交由外部扩展器进一步过滤

二、模块整体结构

2.1 Cache 接口与 schedulerCache 类结构

Cache 接口定义(interface.go)
go 复制代码
type Cache interface {
    NodeCount() int                                          // 节点数量(仅测试用)
    PodCount() (int, error)                                  // Pod 数量(含已删节点的Pod)
    AssumePod(pod *v1.Pod) error                             // 假定Pod已调度
    FinishBinding(pod *v1.Pod) error                         // 通知绑定完成,可设过期时间
    ForgetPod(pod *v1.Pod) error                             // 移除假定Pod
    AddPod(pod *v1.Pod) error                                // 确认/恢复Pod
    UpdatePod(oldPod, newPod *v1.Pod) error                  // 更新Pod信息
    RemovePod(pod *v1.Pod) error                             // 删除Pod
    GetPod(pod *v1.Pod) (*v1.Pod, error)                     // 查询Pod
    IsAssumedPod(pod *v1.Pod) (bool, error)                  // 判断是否为假定Pod
    AddNode(node *v1.Node) error                             // 添加节点
    UpdateNode(oldNode, newNode *v1.Node) error              // 更新节点
    RemoveNode(node *v1.Node) error                          // 删除节点
    UpdateSnapshot(nodeSnapshot *Snapshot) error             // 更新快照
    Dump() *Dump                                             // 导出缓存状态(调试用)
}
schedulerCache 完整字段(cache.go)
go 复制代码
type schedulerCache struct {
    stop   <-chan struct{}        // 停止信号通道
    ttl    time.Duration          // 假定Pod过期TTL
    period time.Duration          // 清理协程运行周期

    mu sync.RWMutex               // 全局读写互斥锁,保护所有字段
    assumedPods sets.String       // 假定Pod的key集合(用于快速判断)
    podStates  map[string]*podState  // Pod key → podState 映射
    nodes      map[string]*nodeInfoListItem  // 节点名 → NodeInfo链表项
    headNode   *nodeInfoListItem  // 双向链表头(最近更新的节点)
    nodeTree   *nodeTree          // 按zone分组的节点树(用于快照排序)
    imageStates map[string]*imageState  // 镜像名 → 镜像状态
}

辅助数据结构:

go 复制代码
type podState struct {
    pod             *v1.Pod       // Pod对象引用
    deadline        *time.Time    // 过期截止时间(Assume后由FinishBinding设置)
    bindingFinished bool          // 绑定是否完成
}

type imageState struct {
    size  int64          // 镜像大小(字节)
    nodes sets.String    // 拥有此镜像的节点名集合
}

type nodeInfoListItem struct {
    info *framework.NodeInfo    // 节点聚合信息
    next *nodeInfoListItem      // 双向链表后继
    prev *nodeInfoListItem      // 双向链表前驱
}
Pod 状态机(来自 interface.go 注释)
复制代码
  +-------------------------------------------+  +----+
  |                            Add            |  |    |
  |                                           |  |    | Update
  +      Assume                Add            v  v    |
Initial +--------> Assumed +------------+---> Added <--+
  ^                +   +               |       +
  |                |   |               |       |
  |                |   |           Add |       | Remove
  |                |   |               |       |
  |                |   |               +       |
  +----------------+   +-----------> Expired   +----> Deleted
        Forget             Expire

关键规则:

  • Pod 不会被 Assume 两次
  • Pod 可以不经过 Assume 直接 Add(非调度器创建的Pod)
  • 未 Add 的 Pod 不会被 Update 或 Remove
  • Expired 和 Deleted 都是终态

2.2 NodeInfo 完整数据结构(framework/types.go)

go 复制代码
type NodeInfo struct {
    node *v1.Node                           // 节点对象

    Pods []*PodInfo                          // 节点上所有Pod
    PodsWithAffinity []*PodInfo              // 声明了亲和性的Pod子集
    PodsWithRequiredAntiAffinity []*PodInfo  // 声明了强制反亲和性的Pod子集

    UsedPorts HostPortInfo                   // 已占用端口映射

    Requested *Resource                      // 所有Pod的总请求资源(含Assumed Pod)
    NonZeroRequested *Resource               // 非零请求资源(防止大量零请求Pod堆叠)
    Allocatable *Resource                    // 节点可分配资源

    ImageStates map[string]*ImageStateSummary // 本节点镜像状态
    TransientInfo *TransientSchedulerInfo     // 调度周期内临时信息
    Generation int64                          // 世代号(用于增量快照)
}

type Resource struct {
    MilliCPU         int64                    // CPU(毫核)
    Memory           int64                    // 内存(字节)
    EphemeralStorage int64                    // 临时存储
    AllowedPodNumber int                      // 允许的Pod数量
    ScalarResources  map[v1.ResourceName]int64 // 标量资源(GPU等)
}

世代号机制(Generation) :NodeInfo 每次变更都会通过 nextGeneration() 递增全局原子计数器。UpdateSnapshot 利用 Generation 做增量更新------只克隆自上次快照后变更的节点,大幅减少快照开销。

2.3 SchedulingQueue 接口与 PriorityQueue 实现

SchedulingQueue 接口
go 复制代码
type SchedulingQueue interface {
    framework.PodNominator                              // 嵌入提名器接口
    Add(pod *v1.Pod) error                              // 添加Pod到ActiveQ
    AddUnschedulableIfNotPresent(pInfo, cycle) error    // 添加不可调度Pod
    SchedulingCycle() int64                             // 当前调度周期号
    Pop() (*framework.QueuedPodInfo, error)             // 弹出最高优先级Pod
    Update(oldPod, newPod *v1.Pod) error                // 更新Pod
    Delete(pod *v1.Pod) error                           // 删除Pod
    MoveAllToActiveOrBackoffQueue(event string)         // 全量迁移到Active/Backoff
    AssignedPodAdded(pod *v1.Pod)                       // 已绑定Pod添加事件
    AssignedPodUpdated(pod *v1.Pod)                     // 已绑定Pod更新事件
    PendingPods() []*v1.Pod                             // 返回所有待调度Pod
    Close()                                             // 关闭队列
    NumUnschedulablePods() int                          // 不可调度Pod数量
    Run()                                               // 启动后台协程
}
PriorityQueue 完整字段
go 复制代码
type PriorityQueue struct {
    framework.PodNominator                             // 提名器(组合复用)

    stop  chan struct{}                                // 停止信号
    clock util.Clock                                   // 可注入时钟

    podInitialBackoffDuration time.Duration            // 初始退避时长(默认1秒)
    podMaxBackoffDuration     time.Duration            // 最大退避时长(默认10秒)

    lock sync.RWMutex                                  // 互斥锁
    cond sync.Cond                                     // 条件变量(Pop阻塞等待)

    activeQ       *heap.Heap                          // 活跃队列(按优先级排序的堆)
    podBackoffQ   *heap.Heap                          // 退避队列(按退避完成时间排序的堆)
    unschedulableQ *UnschedulablePodsMap              // 不可调度Pod映射
    schedulingCycle int64                             // 调度周期号
    moveRequestCycle int64                            // 上次MoveAll请求的周期号

    clusterEventMap map[framework.ClusterEvent]sets.String  // 事件→插件映射

    closed bool                                        // 队列是否关闭
    nsLister listersv1.NamespaceLister                 // Namespace列表器
}

2.4 ActiveQ / BackoffQ / UnschedulablePodsMap 结构

ActiveQ :基于 heap.Heap 实现,排序函数为外部传入的 lessFn(通常基于优先级和创建时间)。堆顶为最高优先级Pod。

podBackoffQ :同样基于 heap.Heap,但排序函数为 podsCompareBackoffCompleted,即按退避完成时间排序。退避先完成的Pod排在堆顶。

UnschedulablePodsMap

go 复制代码
type UnschedulablePodsMap struct {
    podInfoMap     map[string]*framework.QueuedPodInfo  // Pod全名 → QueuedPodInfo
    keyFunc        func(*v1.Pod) string                  // 键函数(GetPodFullName)
    metricRecorder metrics.MetricRecorder                // 指标记录器
}

2.5 DefaultPreemption 类结构

go 复制代码
type DefaultPreemption struct {
    fh        framework.Handle                           // 框架句柄
    args      config.DefaultPreemptionArgs               // 插件参数
    podLister corelisters.PodLister                      // Pod列表器
    pdbLister policylisters.PodDisruptionBudgetLister    // PDB列表器
}

// Candidate 接口
type Candidate interface {
    Victims() *extenderv1.Victims   // 被驱逐Pod列表 + PDB违反数
    Name() string                    // 候选节点名
}

// candidate 实现
type candidate struct {
    victims *extenderv1.Victims
    name    string
}

2.6 核心方法清单与作用

模块 方法 作用
Cache AssumePod 将Pod标记为假定已调度,加入NodeInfo
Cache FinishBinding 设置假定Pod的过期deadline
Cache ForgetPod 移除假定Pod(调度回滚)
Cache AddPod 确认假定Pod或恢复过期Pod
Cache UpdatePod 更新已确认Pod信息
Cache RemovePod 从NodeInfo移除Pod
Cache UpdateSnapshot 增量更新快照(核心性能优化)
Queue Add 新Pod加入ActiveQ
Queue Pop 阻塞弹出ActiveQ堆顶
Queue AddUnschedulableIfNotPresent 调度失败Pod入队
Queue flushBackoffQCompleted 退避完成Pod迁移到ActiveQ
Queue flushUnschedulableQLeftover 超时Pod从Unschedulable迁移
Queue MoveAllToActiveOrBackoffQueue 全量重调度触发
Preemption PostFilter 抢占入口
Preemption FindCandidates 发现候选节点
Preemption dryRunPreemption 并行模拟驱逐
Preemption selectVictimsOnNode 单节点最小驱逐集
Preemption SelectCandidate 选择最优候选
Preemption PrepareCandidate 执行驱逐

三、核心业务逻辑深度解析

3.1 Cache AddPod / RemovePod / UpdatePod 完整逻辑

3.1.1 AddPod 逐行解析
go 复制代码
func (cache *schedulerCache) AddPod(pod *v1.Pod) error {
    key, err := framework.GetPodKey(pod)   // 基于UID生成key
    // ...

    cache.mu.Lock()
    defer cache.mu.Unlock()

    currState, ok := cache.podStates[key]
    switch {
    case ok && cache.assumedPods.Has(key):
        // 场景1: Pod已被Assume,现在收到Add事件确认
        if currState.pod.Spec.NodeName != pod.Spec.NodeName {
            // 警告: Assume的节点与实际不同(理论不应发生)
            // 清除旧的假定信息,重新添加
            cache.removePod(currState.pod)
            cache.addPod(pod)
        }
        // 从assumedPods集合中删除(不再是假定状态)
        delete(cache.assumedPods, key)
        cache.podStates[key].deadline = nil       // 清除过期时间
        cache.podStates[key].pod = pod             // 更新Pod引用

    case !ok:
        // 场景2: Pod不在缓存中(可能是过期后被重新添加,或非调度器创建的Pod)
        cache.addPod(pod)
        ps := &podState{pod: pod}
        cache.podStates[key] = ps

    default:
        // 场景3: Pod已是Added状态,重复Add,返回错误
        return fmt.Errorf("pod %v was already in added state", key)
    }
    return nil
}

关键洞察 :AddPod 有两个截然不同的语义路径------确认假定Pod恢复过期/新增Pod。前者是最常见的正常流程,后者处理边界条件。

3.1.2 addPod 内部方法(假定已持锁)
go 复制代码
func (cache *schedulerCache) addPod(pod *v1.Pod) {
    n, ok := cache.nodes[pod.Spec.NodeName]
    if !ok {
        // 节点尚未在缓存中,创建新的NodeInfo
        n = newNodeInfoListItem(framework.NewNodeInfo())
        cache.nodes[pod.Spec.NodeName] = n
    }
    n.info.AddPod(pod)                        // 将Pod加入NodeInfo,更新资源计数
    cache.moveNodeInfoToHead(pod.Spec.NodeName) // 移到链表头部(标记为最近更新)
}

注意 :即使节点尚未通过 AddNode 正式添加,addPod 也会创建 NodeInfo。这种设计允许 Pod 事件先于 Node 事件到达(事件顺序不保证)。

3.1.3 RemovePod 逐行解析
go 复制代码
func (cache *schedulerCache) RemovePod(pod *v1.Pod) error {
    key, err := framework.GetPodKey(pod)

    cache.mu.Lock()
    defer cache.mu.Unlock()

    currState, ok := cache.podStates[key]
    switch {
    case ok && !cache.assumedPods.Has(key):
        // 仅允许从Added状态Remove
        // 假定状态的Pod不会收到Delete事件
        if currState.pod.Spec.NodeName != pod.Spec.NodeName {
            // 节点名不匹配 → 缓存损坏,Fatal退出
            klog.Fatalf("Schedulercache is corrupted...")
        }
        err := cache.removePod(currState.pod)  // 从NodeInfo移除
        delete(cache.podStates, key)            // 清除podStates记录

    default:
        return fmt.Errorf("pod %v is not found in scheduler cache", key)
    }
    return nil
}
3.1.4 removePod 内部方法(假定已持锁)
go 复制代码
func (cache *schedulerCache) removePod(pod *v1.Pod) error {
    n, ok := cache.nodes[pod.Spec.NodeName]
    if !ok {
        klog.Errorf("node %v not found when trying to remove pod %v", ...)
        return nil   // 节点不存在,静默返回
    }
    if err := n.info.RemovePod(pod); err != nil {
        return err    // NodeInfo中找不到Pod
    }
    if len(n.info.Pods) == 0 && n.info.Node() == nil {
        // Pod已清空且节点对象为nil → 从链表中删除NodeInfo
        cache.removeNodeInfoFromList(pod.Spec.NodeName)
    } else {
        // 仍有Pod或Node存在 → 移到链表头部
        cache.moveNodeInfoToHead(pod.Spec.NodeName)
    }
    return nil
}

幽灵节点机制 :当节点被删除但仍有Pod未删除时(Pod删除事件尚未到达),NodeInfo 保留为"幽灵节点"------node 字段为 nil,但 Pods 列表非空。这种设计是必要的,因为 Pod 和 Node 事件通过不同的 Watch 推送,时序无法保证。

3.1.5 UpdatePod 逐行解析
go 复制代码
func (cache *schedulerCache) UpdatePod(oldPod, newPod *v1.Pod) error {
    key, err := framework.GetPodKey(oldPod)

    cache.mu.Lock()
    defer cache.mu.Unlock()

    currState, ok := cache.podStates[key]
    switch {
    case ok && !cache.assumedPods.Has(key):
        // 仅允许从Added状态Update
        // 假定状态的Pod不会收到Update事件
        if currState.pod.Spec.NodeName != newPod.Spec.NodeName {
            klog.Fatalf("Schedulercache is corrupted...")  // 节点不一致→Fatal
        }
        cache.updatePod(oldPod, newPod)   // 先删后加
        currState.pod = newPod             // 更新引用

    default:
        return fmt.Errorf("pod %v is not added to scheduler cache", key)
    }
    return nil
}

updatePod 内部实现为 removePod(oldPod) + addPod(newPod),原子性地完成替换。

3.2 NodeInfo 资源计算算法

3.2.1 calculateResource------核心资源计算函数
go 复制代码
func calculateResource(pod *v1.Pod) (res Resource, non0CPU int64, non0Mem int64) {
    resPtr := &res

    // 第一轮:累加所有容器的请求
    for _, c := range pod.Spec.Containers {
        resPtr.Add(c.Resources.Requests)
        non0CPUReq, non0MemReq := schedutil.GetNonzeroRequests(&c.Resources.Requests)
        non0CPU += non0CPUReq
        non0Mem += non0MemReq
    }

    // 第二轮:InitContainer取最大值(非累加)
    for _, ic := range pod.Spec.InitContainers {
        resPtr.SetMaxResource(ic.Resources.Requests)
        non0CPUReq, non0MemReq := schedutil.GetNonzeroRequests(&ic.Resources.Requests)
        if non0CPU < non0CPUReq { non0CPU = non0CPUReq }
        if non0Mem < non0MemReq { non0Mem = non0MemReq }
    }

    // 第三轮:PodOverhead(如VirtualMachine的额外开销)
    if pod.Spec.Overhead != nil && featureGateEnabled(PodOverhead) {
        resPtr.Add(pod.Spec.Overhead)
        // ... 累加非零值
    }

    return
}

关键公式

复制代码
Pod请求 = max(sum(Containers.Requests), max(InitContainers.Requests)) + Overhead
NonZero请求 = max(sum(Containers.NonZeroRequests), max(InitContainers.NonZeroRequests)) + Overhead

NonZeroRequested 的意义:防止大量零请求 Pod 堆叠到同一节点。即使 Pod 的 request 为 0,NonZeroRequested 也会使用一个最小值(默认 100m CPU / 200Mi Memory),确保调度器对资源分配有基本保护。

3.2.2 AddPodInfo------向 NodeInfo 添加 Pod
go 复制代码
func (n *NodeInfo) AddPodInfo(podInfo *PodInfo) {
    res, non0CPU, non0Mem := calculateResource(podInfo.Pod)

    n.Requested.MilliCPU += res.MilliCPU
    // 特殊处理: isolated CPU Pod不占用普通CPU请求
    if podInfo.isIsolatedCpuPod() {
        n.Requested.MilliCPU -= res.MilliCPU
    }
    n.Requested.Memory += res.Memory
    n.Requested.EphemeralStorage += res.EphemeralStorage
    // 累加标量资源
    for rName, rQuant := range res.ScalarResources {
        n.Requested.ScalarResources[rName] += rQuant
    }

    n.NonZeroRequested.MilliCPU += non0CPU
    n.NonZeroRequested.Memory += non0Mem

    n.Pods = append(n.Pods, podInfo)
    // 维护亲和性/反亲和性子集
    if podWithAffinity(podInfo.Pod) {
        n.PodsWithAffinity = append(n.PodsWithAffinity, podInfo)
    }
    if podWithRequiredAntiAffinity(podInfo.Pod) {
        n.PodsWithRequiredAntiAffinity = append(n.PodsWithRequiredAntiAffinity, podInfo)
    }

    n.updateUsedPorts(podInfo.Pod, true)   // 消费端口
    n.Generation = nextGeneration()         // 递增世代号
}
3.2.3 RemovePod------从 NodeInfo 移除 Pod
go 复制代码
func (n *NodeInfo) RemovePod(pod *v1.Pod) error {
    k, err := GetPodKey(pod)

    // 从亲和性/反亲和性子集中移除
    if podWithAffinity(pod) {
        n.PodsWithAffinity = removeFromSlice(n.PodsWithAffinity, k)
    }
    if podWithRequiredAntiAffinity(pod) {
        n.PodsWithRequiredAntiAffinity = removeFromSlice(n.PodsWithRequiredAntiAffinity, k)
    }

    // 在Pods列表中查找并移除
    for i := range n.Pods {
        k2, _ := GetPodKey(n.Pods[i].Pod)
        if k == k2 {
            n.Pods[i] = n.Pods[len(n.Pods)-1]
            n.Pods = n.Pods[:len(n.Pods)-1]

            // 反向计算资源
            res, non0CPU, non0Mem := calculateResource(pod)
            n.Requested.MilliCPU -= res.MilliCPU
            if pi.isIsolatedCpuPod() {
                n.Requested.MilliCPU += res.MilliCPU   // 恢复
            }
            n.Requested.Memory -= res.Memory
            // ... 减去标量资源
            n.NonZeroRequested.MilliCPU -= non0CPU
            n.NonZeroRequested.Memory -= non0Mem

            n.updateUsedPorts(pod, false)  // 释放端口
            n.Generation = nextGeneration()
            n.resetSlicesIfEmpty()         // 空切片置nil
            return nil
        }
    }
    return fmt.Errorf("no corresponding pod %s in pods of node %s", ...)
}

重要细节removeFromSlice 使用交换删除法(swap-delete),将目标元素与最后一个元素交换后截断,O(1) 但不保序。这在调度场景下可以接受,因为列表遍历时顺序无关紧要。

3.3 PriorityQueue Pop / Push / Add 操作流程

3.3.1 Add------Pod 入队
go 复制代码
func (p *PriorityQueue) Add(pod *v1.Pod) error {
    p.lock.Lock()
    defer p.lock.Unlock()

    pInfo := p.newQueuedPodInfo(pod)          // 创建QueuedPodInfo,设Timestamp=now
    p.activeQ.Add(pInfo)                       // 加入ActiveQ堆

    // 如果Pod同时在unschedulableQ中,删除(不应发生)
    if p.unschedulableQ.get(pod) != nil {
        p.unschedulableQ.delete(pod)
    }
    // 如果Pod在backoffQ中,删除
    p.podBackoffQ.Delete(pInfo)

    // 记录指标
    metrics.SchedulerQueueIncomingPods.WithLabelValues("active", PodAdd).Inc()
    p.PodNominator.AddNominatedPod(pInfo.PodInfo, "")
    p.cond.Broadcast()                         // 唤醒可能阻塞的Pop
    return nil
}

并发安全p.lock 保护所有队列操作,p.cond.Broadcast() 确保 Pop 协程被唤醒。

3.3.2 Pop------Pod 出队
go 复制代码
func (p *PriorityQueue) Pop() (*framework.QueuedPodInfo, error) {
    p.lock.Lock()
    defer p.lock.Unlock()

    for p.activeQ.Len() == 0 {
        // ActiveQ为空 → 阻塞等待
        if p.closed {
            return nil, fmt.Errorf(queueClosed)
        }
        p.cond.Wait()   // 释放锁,等待Broadcast
    }

    obj, err := p.activeQ.Pop()               // 弹出堆顶(最高优先级Pod)
    pInfo := obj.(*framework.QueuedPodInfo)
    pInfo.Attempts++                           // 递增调度尝试次数
    p.schedulingCycle++                        // 递增调度周期号
    return pInfo, err
}

关键设计 :Pop 使用条件变量(sync.Cond)实现阻塞等待。当 ActiveQ 为空时,Pop 协程释放锁进入等待状态,直到 Add/movePodsToActiveOrBackoffQueue 调用 Broadcast 唤醒。

3.3.3 AddUnschedulableIfNotPresent------调度失败Pod入队
go 复制代码
func (p *PriorityQueue) AddUnschedulableIfNotPresent(
    pInfo *framework.QueuedPodInfo, podSchedulingCycle int64) error {

    p.lock.Lock()
    defer p.lock.Unlock()

    // 三重去重检查
    if p.unschedulableQ.get(pod) != nil { return fmt.Errorf("already in unschedulable") }
    if _, exists, _ := p.activeQ.Get(pInfo); exists { return fmt.Errorf("already in active") }
    if _, exists, _ := p.podBackoffQ.Get(pInfo); exists { return fmt.Errorf("already in backoff") }

    pInfo.Timestamp = p.clock.Now()  // 刷新时间戳

    // 关键判断:是否有MoveAll请求尚未处理
    if p.moveRequestCycle >= podSchedulingCycle {
        // 近期有MoveAll请求 → 直接入BackoffQ而非UnschedulableQ
        p.podBackoffQ.Add(pInfo)
        metrics.SchedulerQueueIncomingPods.WithLabelValues("backoff", ScheduleAttemptFailure).Inc()
    } else {
        // 正常入UnschedulableQ
        p.unschedulableQ.addOrUpdate(pInfo)
        metrics.SchedulerQueueIncomingPods.WithLabelValues("unschedulable", ScheduleAttemptFailure).Inc()
    }

    p.PodNominator.AddNominatedPod(pInfo.PodInfo, "")
    return nil
}

**moveRequestCycle 的精妙之处**:当集群状态发生重大变化(如新节点加入)时,`MoveAllToActiveOrBackoffQueue` 会设置 `moveRequestCycle = schedulingCycle`。此时如果有 Pod 刚刚在当前周期调度失败,它的 `podSchedulingCycle` 等于当前周期号,条件 `moveRequestCycle >= podSchedulingCycle` 成立,Pod 直接进入 BackoffQ 而非 UnschedulableQ,避免错过重调度窗口。

### 3.4 Pod 退避(Backoff)算法详解

#### 3.4.1 退避时长计算

```go
func (p *PriorityQueue) calculateBackoffDuration(podInfo *framework.QueuedPodInfo) time.Duration {
    duration := p.podInitialBackoffDuration   // 初始1秒
    for i := 1; i < podInfo.Attempts; i++ {   // 指数退避
        duration = duration * 2
        if duration > p.podMaxBackoffDuration { // 上限10秒
            return p.podMaxBackoffDuration
        }
    }
    return duration
}

退避序列:Attempts=1→1s, Attempts=2→2s, Attempts=3→4s, Attempts=4→8s, Attempts≥5→10s(上限)

go 复制代码
func (p *PriorityQueue) getBackoffTime(podInfo *framework.QueuedPodInfo) time.Time {
    duration := p.calculateBackoffDuration(podInfo)
    return podInfo.Timestamp.Add(duration)  // 入队时间 + 退避时长
}

func (p *PriorityQueue) isPodBackingoff(podInfo *framework.QueuedPodInfo) bool {
    boTime := p.getBackoffTime(podInfo)
    return boTime.After(p.clock.Now())      // 当前时间 < 退避完成时间
}

关键 :退避时间的基准是 Timestamp(Pod入队时间),而非首次调度时间。每次Pod从BackoffQ移到ActiveQ再失败回来时,Timestamp 会被更新,退避时间重新计算。

3.4.2 BackoffQ 到 ActiveQ 的迁移
go 复制代码
func (p *PriorityQueue) flushBackoffQCompleted() {
    p.lock.Lock()
    defer p.lock.Unlock()

    for {
        rawPodInfo := p.podBackoffQ.Peek()   // 窥视堆顶
        if rawPodInfo == nil { return }        // 空队列

        pod := rawPodInfo.(*framework.QueuedPodInfo).Pod
        boTime := p.getBackoffTime(rawPodInfo.(*framework.QueuedPodInfo))
        if boTime.After(p.clock.Now()) {
            return   // 堆顶尚未到期,后续更不可能到期(堆有序)
        }

        // 退避到期,Pop并加入ActiveQ
        _, err := p.podBackoffQ.Pop()
        p.activeQ.Add(rawPodInfo)
        metrics.SchedulerQueueIncomingPods.WithLabelValues("active", BackoffComplete).Inc()
        defer p.cond.Broadcast()  // 唤醒Pop
    }
}

高效性 :由于 BackoffQ 按退避完成时间排序(podsCompareBackoffCompleted),只需检查堆顶。如果堆顶未到期,直接返回------O(1)判断。

此方法由 Run() 启动的后台协程每秒调用一次:

go 复制代码
go wait.Until(p.flushBackoffQCompleted, 1.0*time.Second, p.stop)

3.5 UnschedulablePods 到 ActiveQueue 迁移逻辑

3.5.1 定时超时迁移
go 复制代码
func (p *PriorityQueue) flushUnschedulableQLeftover() {
    p.lock.Lock()
    defer p.lock.Unlock()

    var podsToMove []*framework.QueuedPodInfo
    currentTime := p.clock.Now()
    for _, pInfo := range p.unschedulableQ.podInfoMap {
        if currentTime.Sub(pInfo.Timestamp) > unschedulableQTimeInterval {
            // 在UnschedulableQ中超过60秒
            podsToMove = append(podsToMove, pInfo)
        }
    }
    if len(podsToMove) > 0 {
        p.movePodsToActiveOrBackoffQueue(podsToMove, UnschedulableTimeout)
    }
}

此方法由 Run() 启动的后台协程每30秒调用一次:

go 复制代码
go wait.Until(p.flushUnschedulableQLeftover, 30*time.Second, p.stop)
3.5.2 movePodsToActiveOrBackoffQueue------核心迁移方法
go 复制代码
func (p *PriorityQueue) movePodsToActiveOrBackoffQueue(
    podInfoList []*framework.QueuedPodInfo, event string) {

    moved := false
    for _, pInfo := range podInfoList {
        // 事件匹配检查:Pod的UnschedulablePlugins是否与事件相关
        if len(pInfo.UnschedulablePlugins) != 0 && !p.podMatchesEvent(pInfo, event) {
            continue   // 事件不能帮助此Pod,跳过
        }
        moved = true

        if p.isPodBackingoff(pInfo) {
            // Pod仍在退避期 → 入BackoffQ
            p.podBackoffQ.Add(pInfo)
            metrics.SchedulerQueueIncomingPods.WithLabelValues("backoff", event).Inc()
            p.unschedulableQ.delete(pInfo.Pod)
        } else {
            // 退避已结束 → 入ActiveQ
            p.activeQ.Add(pInfo)
            metrics.SchedulerQueueIncomingPods.WithLabelValues("active", event).Inc()
            p.unschedulableQ.delete(pInfo.Pod)
        }
    }

    p.moveRequestCycle = p.schedulingCycle
    if moved {
        p.cond.Broadcast()   // 唤醒Pop
    }
}

podMatchesEvent------智能事件路由

go 复制代码
func (p *PriorityQueue) podMatchesEvent(podInfo *framework.QueuedPodInfo, event string) bool {
    clusterEvent, ok := clusterEventReg[event]
    if !ok { return false }
    if clusterEvent == framework.WildCardEvent { return true }  // 通配符匹配所有

    for evt, nameSet := range p.clusterEventMap {
        // 检查: 1) 事件类型匹配 2) 插件名交集非空
        evtMatch := evt == framework.WildCardEvent ||
            (evt.Resource == clusterEvent.Resource &&
             evt.ActionType&clusterEvent.ActionType != 0)
        if evtMatch && intersect(nameSet, podInfo.UnschedulablePlugins) {
            return true
        }
    }
    return false
}

设计意图 :不是所有集群事件对所有 Pod 都有意义。例如 NodeTaintChange 只影响被 TaintToleration 插件拒绝的 Pod,对被 NodeAffinity 拒绝的 Pod 无意义。podMatchesEvent 通过 clusterEventMap(注册时由各插件声明)和 UnschedulablePlugins(记录了 Pod 被哪些插件拒绝)做精确匹配,避免无效重调度。

3.5.3 亲和性驱动的重入队
go 复制代码
func (p *PriorityQueue) AssignedPodAdded(pod *v1.Pod) {
    p.lock.Lock()
    p.movePodsToActiveOrBackoffQueue(
        p.getUnschedulablePodsWithMatchingAffinityTerm(pod), AssignedPodAdd)
    p.lock.Unlock()
}

getUnschedulablePodsWithMatchingAffinityTerm 遍历 UnschedulablePods,找出有亲和性条款与新绑定 Pod 匹配的待调度 Pod。当一个新的 Pod 绑定到节点后,之前因亲和性无法调度的 Pod 可能变得可调度了。

3.6 DefaultPreemption 抢占完整流程

3.6.1 PostFilter 入口
go 复制代码
func (pl *DefaultPreemption) PostFilter(
    ctx context.Context, state *framework.CycleState,
    pod *v1.Pod, m framework.NodeToStatusMap) (*framework.PostFilterResult, *framework.Status) {

    defer func() { metrics.PreemptionAttempts.Inc() }()

    nnn, status := pl.preempt(ctx, state, pod, m)
    if !status.IsSuccess() { return nil, status }
    if nnn == "" {
        return nil, framework.NewStatus(framework.Unschedulable)
    }
    return &framework.PostFilterResult{NominatedNodeName: nnn},
           framework.NewStatus(framework.Success)
}
3.6.2 preempt------五步抢占流程
go 复制代码
func (pl *DefaultPreemption) preempt(ctx context.Context, state *framework.CycleState,
    pod *v1.Pod, m framework.NodeToStatusMap) (string, *framework.Status) {

    // 步骤0: 获取最新的Pod对象
    pod, err := pl.podLister.Pods(pod.Namespace).Get(pod.Name)

    // 步骤1: 检查Pod是否有资格抢占
    if !PodEligibleToPreemptOthers(pod, nodeLister, m[pod.Status.NominatedNodeName]) {
        return "", nil   // 不具备抢占资格
    }

    // 步骤2: 发现所有候选节点
    candidates, nodeToStatusMap, status := pl.FindCandidates(ctx, state, pod, m)
    if len(candidates) == 0 {
        // 无候选节点 → 返回FitError
        return "", framework.NewStatus(framework.Unschedulable, fitError.Error())
    }

    // 步骤3: Extender过滤候选节点
    candidates, status = CallExtenders(pl.fh.Extenders(), pod, nodeLister, candidates)

    // 步骤4: 选择最优候选
    bestCandidate := SelectCandidate(candidates)
    if bestCandidate == nil || len(bestCandidate.Name()) == 0 {
        return "", nil
    }

    // 步骤5: 执行驱逐,提名节点
    if status := PrepareCandidate(bestCandidate, pl.fh, cs, pod, pl.Name()); !status.IsSuccess() {
        return "", status
    }

    return bestCandidate.Name(), nil
}
3.6.3 PodEligibleToPreemptOthers------抢占资格检查
go 复制代码
func PodEligibleToPreemptOthers(pod *v1.Pod, nodeInfos framework.NodeInfoLister,
    nominatedNodeStatus *framework.Status) bool {

    // PreemptNever策略 → 不允许抢占
    if pod.Spec.PreemptionPolicy != nil && *pod.Spec.PreemptionPolicy == v1.PreemptNever {
        return false
    }

    nomNodeName := pod.Status.NominatedNodeName
    if len(nomNodeName) > 0 {
        // 已有提名节点
        // 如果提名节点被标记为UnschedulableAndUnresolvable → 允许重新抢占
        if nominatedNodeStatus.Code() == framework.UnschedulableAndUnresolvable {
            return true
        }

        // 检查提名节点上是否还有低优先级Pod在终止中
        if nodeInfo, _ := nodeInfos.Get(nomNodeName); nodeInfo != nil {
            podPriority := corev1helpers.PodPriority(pod)
            for _, p := range nodeInfo.Pods {
                if p.Pod.DeletionTimestamp != nil && corev1helpers.PodPriority(p.Pod) < podPriority {
                    // 还有Pod在优雅终止 → 不再抢占更多
                    return false
                }
            }
        }
    }
    return true
}

设计理念:如果 Pod 已经提名了一个节点,且该节点上还有被抢占的 Pod 正在终止中,就不应该再发起更多抢占。这避免了"抢占雪崩"------一个 Pod 反复抢占不同节点上的 Pod。

3.6.4 FindCandidates------候选节点发现
go 复制代码
func (pl *DefaultPreemption) FindCandidates(
    ctx context.Context, state *framework.CycleState,
    pod *v1.Pod, m framework.NodeToStatusMap) ([]Candidate, framework.NodeToStatusMap, *framework.Status) {

    allNodes, _ := pl.fh.SnapshotSharedLister().NodeInfos().List()

    // 过滤出抢占可能有帮助的节点
    potentialNodes, unschedulableNodeStatus := nodesWherePreemptionMightHelp(allNodes, m)
    if len(potentialNodes) == 0 {
        // 清除提名节点名
        util.ClearNominatedNodeName(pl.fh.ClientSet(), pod)
        return nil, unschedulableNodeStatus, nil
    }

    // 获取PDB列表
    pdbs, _ := getPodDisruptionBudgets(pl.pdbLister)

    // 计算随机偏移和候选数量(采样优化)
    offset, numCandidates := pl.getOffsetAndNumCandidates(int32(len(potentialNodes)))

    // 并行模拟驱逐
    candidates, nodeStatuses := dryRunPreemption(
        ctx, pl.fh, state, pod, potentialNodes, pdbs, offset, numCandidates)

    return candidates, nodeStatuses, nil
}

nodesWherePreemptionMightHelp------排除不可能的节点

go 复制代码
func nodesWherePreemptionMightHelp(nodes []*framework.NodeInfo, m framework.NodeToStatusMap) (
    []*framework.NodeInfo, framework.NodeToStatusMap) {

    for _, node := range nodes {
        name := node.Node().Name
        // UnschedulableAndUnresolvable: 抢占也无法解决的问题(如节点亲和性不匹配)
        if m[name].Code() == framework.UnschedulableAndUnresolvable {
            nodeStatuses[name] = framework.NewStatus(
                framework.UnschedulableAndUnresolvable, "Preemption is not helpful")
            continue
        }
        potentialNodes = append(potentialNodes, node)
    }
    return potentialNodes, nodeStatuses
}

关键区分

  • Unschedulable:抢占可能解决的问题(资源不足、taint不匹配等)
  • UnschedulableAndUnresolvable:抢占无法解决的问题(节点选择器不匹配、Pod亲和性不满足等)
3.6.5 calculateNumCandidates------采样优化
go 复制代码
func (pl *DefaultPreemption) calculateNumCandidates(numNodes int32) int32 {
    n := (numNodes * pl.args.MinCandidateNodesPercentage) / 100
    if n < pl.args.MinCandidateNodesAbsolute {
        n = pl.args.MinCandidateNodesAbsolute
    }
    if n > numNodes {
        n = numNodes
    }
    return n
}

默认参数MinCandidateNodesPercentage=10, MinCandidateNodesAbsolute=100

含义:在1000个节点中,只对约100个节点做 DryRun,从随机偏移位置开始采样。这大大减少了大规模集群的抢占计算量,代价是可能错过最优解------但抢占本身就对精确度要求不高。

3.6.6 dryRunPreemption------并行模拟驱逐
go 复制代码
func dryRunPreemption(ctx context.Context, fh framework.Handle,
    state *framework.CycleState, pod *v1.Pod, potentialNodes []*framework.NodeInfo,
    pdbs []*policy.PodDisruptionBudget, offset int32, numCandidates int32) (
    []Candidate, framework.NodeToStatusMap) {

    nonViolatingCandidates := newCandidateList(numCandidates)   // 不违反PDB的候选
    violatingCandidates := newCandidateList(numCandidates)      // 违反PDB的候选

    parallelCtx, cancel := context.WithCancel(ctx)
    nodeStatuses := make(framework.NodeToStatusMap)
    var statusesLock sync.Mutex

    checkNode := func(i int) {
        // 从offset开始环形索引
        nodeInfoCopy := potentialNodes[(int(offset)+i)%len(potentialNodes)].Clone()
        stateCopy := state.Clone()

        pods, numPDBViolations, status := selectVictimsOnNode(
            ctx, fh, stateCopy, pod, nodeInfoCopy, pdbs)

        if status.IsSuccess() {
            victims := extenderv1.Victims{
                Pods:             pods,
                NumPDBViolations: int64(numPDBViolations),
            }
            c := &candidate{victims: &victims, name: nodeInfoCopy.Node().Name}

            if numPDBViolations == 0 {
                nonViolatingCandidates.add(c)
            } else {
                violatingCandidates.add(c)
            }

            // 提前终止:不违反PDB的候选 + 违反PDB的候选 ≥ numCandidates
            nvcSize, vcSize := nonViolatingCandidates.size(), violatingCandidates.size()
            if nvcSize > 0 && nvcSize+vcSize >= numCandidates {
                cancel()   // 取消并行上下文,其他协程退出
            }
        } else {
            statusesLock.Lock()
            nodeStatuses[nodeInfoCopy.Node().Name] = status
            statusesLock.Unlock()
        }
    }

    fh.Parallelizer().Until(parallelCtx, len(potentialNodes), checkNode)

    // 不违反PDB的优先排在前面
    return append(nonViolatingCandidates.get(), violatingCandidates.get()...), nodeStatuses
}

三大性能优化

  1. 并行化 :使用 Parallelizer.Until 在所有节点上并行模拟
  2. 采样 :从随机偏移位置开始,只检查 numCandidates 个节点
  3. 提前终止 :找到足够候选后通过 cancel() 提前退出

candidateList 的线程安全

go 复制代码
type candidateList struct {
    idx   int32           // 原子递增的索引
    items []Candidate     // 预分配数组
}

func (cl *candidateList) add(c *candidate) {
    if idx := atomic.AddInt32(&cl.idx, 1); idx < int32(len(cl.items)) {
        cl.items[idx] = c   // 无锁写入预分配位置
    }
}

使用预分配数组 + 原子索引实现无锁并发写入,避免了锁竞争。

3.7 selectVictimsOnNode------单节点最小驱逐集算法

这是抢占机制中最复杂的函数,逐行深度解析:

go 复制代码
func selectVictimsOnNode(
    ctx context.Context, fh framework.Handle, state *framework.CycleState,
    pod *v1.Pod, nodeInfo *framework.NodeInfo,
    pdbs []*policy.PodDisruptionBudget) ([]*v1.Pod, int, *framework.Status) {

    var potentialVictims []*framework.PodInfo

    // 定义两个闭包操作
    removePod := func(rpi *framework.PodInfo) error {
        if err := nodeInfo.RemovePod(rpi.Pod); err != nil { return err }
        // 通知PreFilter扩展:Pod被移除
        status := fh.RunPreFilterExtensionRemovePod(ctx, state, pod, rpi, nodeInfo)
        if !status.IsSuccess() { return status.AsError() }
        return nil
    }

    addPod := func(api *framework.PodInfo) error {
        nodeInfo.AddPodInfo(api)
        // 通知PreFilter扩展:Pod被添加回来
        status := fh.RunPreFilterExtensionAddPod(ctx, state, pod, api, nodeInfo)
        if !status.IsSuccess() { return status.AsError() }
        return nil
    }

    // ====== 第一阶段: 移除所有低优先级Pod,检查是否可调度 ======
    podPriority := corev1helpers.PodPriority(pod)
    for _, pi := range nodeInfo.Pods {
        if corev1helpers.PodPriority(pi.Pod) < podPriority {
            potentialVictims = append(potentialVictims, pi)
            if err := removePod(pi); err != nil {
                return nil, 0, framework.AsStatus(err)
            }
        }
    }

    // 没有潜在牺牲品 → 节点不可用于抢占
    if len(potentialVictims) == 0 {
        return nil, 0, framework.NewStatus(framework.UnschedulableAndUnresolvable, ...)
    }

    // 所有低优先级Pod移除后,Pod仍无法调度 → 节点不可用
    if status := fh.RunFilterPluginsWithNominatedPods(ctx, state, pod, nodeInfo);
       !status.IsSuccess() {
        return nil, 0, status
    }

    // ====== 第二阶段: 尽量恢复Pod(贪心法) ======
    var victims []*v1.Pod
    numViolatingVictim := 0

    // 按优先级降序排列潜在牺牲品
    sort.Slice(potentialVictims, func(i, j int) bool {
        return util.MoreImportantPod(potentialVictims[i].Pod, potentialVictims[j].Pod)
    })

    // 分为PDB违反组和非PDB违反组
    violatingVictims, nonViolatingVictims := filterPodsWithPDBViolation(
        potentialVictims, pdbs)

    // reprievePod: 尝试"赦免"一个Pod(恢复它到节点上)
    reprievePod := func(pi *framework.PodInfo) (bool, error) {
        if err := addPod(pi); err != nil { return false, err }

        // 恢复后Pod是否仍可调度?
        status := fh.RunFilterPluginsWithNominatedPods(ctx, state, pod, nodeInfo)
        fits := status.IsSuccess()

        if !fits {
            // 恢复后无法调度 → 必须驱逐此Pod
            if err := removePod(pi); err != nil { return false, err }
            victims = append(victims, pi.Pod)
        }
        return fits, nil
    }

    // 先尝试赦免PDB违反的牺牲品(从高优先级开始)
    for _, p := range violatingVictims {
        if fits, err := reprievePod(p); err != nil {
            return nil, 0, framework.AsStatus(err)
        } else if !fits {
            numViolatingVictim++
        }
    }

    // 再尝试赦免非PDB违反的牺牲品
    for _, p := range nonViolatingVictims {
        if _, err := reprievePod(p); err != nil {
            return nil, 0, framework.AsStatus(err)
        }
    }

    return victims, numViolatingVictim, framework.NewStatus(framework.Success)
}

算法核心思想

  1. 全局可行性检查:先移除所有低优先级 Pod,确认"如果全部驱逐,Pod是否可调度"。如果全驱逐都不行,那部分驱逐更不可能
  2. 贪心恢复:按优先级从高到低尝试恢复(赦免)Pod。高优先级 Pod 优先保留,因为它们对业务影响更大
  3. PDB 保护优先:先处理 PDB 违反组,尽量少违反 PDB
  4. 最小驱逐集:最终 victims 列表是使 Pod 可调度的最小 Pod 集合

filterPodsWithPDBViolation 详解

go 复制代码
func filterPodsWithPDBViolation(podInfos []*framework.PodInfo, pdbs []*policy.PodDisruptionBudget) (
    violatingPodInfos, nonViolatingPodInfos []*framework.PodInfo) {

    pdbsAllowed := make([]int32, len(pdbs))
    for i, pdb := range pdbs {
        pdbsAllowed[i] = pdb.Status.DisruptionsAllowed  // PDB允许的中断数
    }

    for _, podInfo := range podInfos {
        pod := podInfo.Pod
        pdbForPodIsViolated := false

        if len(pod.Labels) != 0 {
            for i, pdb := range pdbs {
                // 命名空间不匹配 → 跳过
                if pdb.Namespace != pod.Namespace { continue }
                // 标签不匹配 → 跳过
                selector, _ := metav1.LabelSelectorAsSelector(pdb.Spec.Selector)
                if selector.Empty() || !selector.Matches(labels.Set(pod.Labels)) { continue }
                // 已在DisruptedPods中 → 已处理过,不计入
                if _, exist := pdb.Status.DisruptedPods[pod.Name]; exist { continue }

                pdbsAllowed[i]--                  // 消耗一个PDB预算
                if pdbsAllowed[i] < 0 {
                    pdbForPodIsViolated = true     // 超出预算 → 违反PDB
                }
            }
        }

        if pdbForPodIsViolated {
            violatingPodInfos = append(violatingPodInfos, podInfo)
        } else {
            nonViolatingPodInfos = append(nonViolatingPodInfos, podInfo)
        }
    }
    return
}

稳定性保证:此函数保持输入排序不变(stable),因为后续的贪心恢复依赖优先级排序。

3.8 pickOneNodeForPreemption------最优节点选择算法

go 复制代码
func pickOneNodeForPreemption(nodesToVictims map[string]*extenderv1.Victims) string {
    // ====== 第1轮: 最少PDB违反数 ======
    minNumPDBViolatingPods := int64(math.MaxInt32)
    var minNodes1 []string
    for node, victims := range nodesToVictims {
        numPDBViolatingPods := victims.NumPDBViolations
        if numPDBViolatingPods < minNumPDBViolatingPods {
            minNumPDBViolatingPods = numPDBViolatingPods
            minNodes1 = nil
        }
        if numPDBViolatingPods == minNumPDBViolatingPods {
            minNodes1 = append(minNodes1, node)
        }
    }
    if lenNodes1 == 1 { return minNodes1[0] }

    // ====== 第2轮: 最低最高优先级牺牲品 ======
    minHighestPriority := int32(math.MaxInt32)
    var minNodes2 []string
    for _, node := range minNodes1 {
        highestPodPriority := corev1helpers.PodPriority(victims.Pods[0])  // Pods已按优先级降序
        if highestPodPriority < minHighestPriority {
            minHighestPriority = highestPodPriority
            lenNodes2 = 0
        }
        if highestPodPriority == minHighestPriority {
            minNodes2[lenNodes2] = node; lenNodes2++
        }
    }
    if lenNodes2 == 1 { return minNodes2[0] }

    // ====== 第3轮: 最低优先级总和 ======
    minSumPriorities := int64(math.MaxInt64)
    for _, node := range minNodes2 {
        var sumPriorities int64
        for _, pod := range victims.Pods {
            // 偏移为MaxInt32+1,使负优先级Pod的权重合理
            sumPriorities += int64(corev1helpers.PodPriority(pod)) + int64(math.MaxInt32+1)
        }
        if sumPriorities < minSumPriorities { ... }
    }

    // ====== 第4轮: 最少牺牲品数量 ======
    minNumPods := math.MaxInt32
    for _, node := range minNodes1 { ... }

    // ====== 第5轮: 最晚最早启动时间 ======
    latestStartTime := util.GetEarliestPodStartTime(nodesToVictims[minNodes2[0]])
    nodeToReturn := minNodes2[0]
    for i := 1; i < lenNodes2; i++ {
        earliestStartTimeOnNode := util.GetEarliestPodStartTime(nodesToVictims[node])
        if earliestStartTimeOnNode.After(latestStartTime.Time) {
            latestStartTime = earliestStartTimeOnNode
            nodeToReturn = node
        }
    }

    return nodeToReturn   // 第6轮: 如果仍平局,返回第一个
}

六级筛选策略

级别 比较维度 目的
1 PDB违反数最少 保护服务可用性
2 最高优先级牺牲品最低 减少对高优先级工作负载的影响
3 优先级总和最低 整体影响最小
4 牺牲品数量最少 驱逐操作最少
5 最晚最早启动时间 选择运行时间最短的Pod(更"年轻"的Pod优先被牺牲)
6 取第一个 兜底随机选择

优先级偏移(MaxInt32+1)的设计意图:当存在负优先级 Pod(如 BestEffort)时,直接求和会导致负值抵消正值,可能出现"一个高优先级Pod + 多个负优先级Pod"的总和小于"少量中优先级Pod"的情况,误导选择。偏移后所有值变为非负,避免此问题。

3.9 PrepareCandidate------执行驱逐

go 复制代码
func PrepareCandidate(c Candidate, fh framework.Handle, cs kubernetes.Interface,
    pod *v1.Pod, pluginName string) *framework.Status {

    // 1. 驱逐所有牺牲品Pod
    for _, victim := range c.Victims().Pods {
        if err := util.DeletePod(cs, victim); err != nil {
            return framework.AsStatus(err)
        }
        // 2. 如果牺牲品是WaitingPod(在Permit阶段等待),发送Reject
        if waitingPod := fh.GetWaitingPod(victim.UID); waitingPod != nil {
            waitingPod.Reject(pluginName, "preempted")
        }
        // 3. 记录事件
        fh.EventRecorder().Eventf(victim, pod, v1.EventTypeNormal,
            "Preempted", "Preempting", "Preempted by %v/%v on node %v",
            pod.Namespace, pod.Name, c.Name())
    }

    metrics.PreemptionVictims.Observe(float64(len(c.Victims().Pods)))

    // 4. 清除提名到此节点的低优先级Pod的NominatedNodeName
    nominatedPods := getLowerPriorityNominatedPods(fh, pod, c.Name())
    if err := util.ClearNominatedNodeName(cs, nominatedPods...); err != nil {
        klog.ErrorS(err, "cannot clear 'NominatedNodeName' field")
        // 非关键错误,不返回
    }

    return nil
}

清除低优先级Pod提名的原因 :当高优先级 Pod 抢占了节点上的 Pod 后,之前被提名到该节点的低优先级 Pod 可能不再适合(因为高优先级 Pod 将占用资源)。清除它们的 NominatedNodeName 会触发 Update 事件,使它们重新进入 ActiveQ 被重新调度。


四、辅助模块深度解析

4.1 Heap 数据结构(internal/heap/heap.go)

go 复制代码
type Heap struct {
    data *data
    metricRecorder metrics.MetricRecorder
}

type data struct {
    items    map[string]*heapItem   // key → {obj, index}
    queue    []string               // 堆序的key数组
    keyFunc  KeyFunc                // 对象→key
    lessFunc lessFunc               // 比较函数
}

type heapItem struct {
    obj   interface{}  // 存储的对象
    index int          // 在queue中的索引
}

双索引设计

  • items map:O(1) 按 key 查找对象
  • queue 切片:维护堆序,支持 O(log n) 的 Push/Pop

关键方法

方法 复杂度 说明
Add O(log n) 插入或更新,已有则 heap.Fix
Pop O(log n) 弹出堆顶
Peek O(1) 窥视堆顶
Get/GetByKey O(1) 按key查找
Delete O(log n) heap.Remove
Update O(log n) 等同Add

Add 的幂等性 :如果 key 已存在,会更新对象并调用 heap.Fix 调整堆位置,而不是报错。

4.2 nodeTree------zone感知的节点树

go 复制代码
type nodeTree struct {
    tree     map[string][]string  // zone → []nodeName
    zones    []string             // 所有zone列表
    numNodes int                  // 总节点数
}

list() 方法------跨zone轮询遍历:

go 复制代码
func (nt *nodeTree) list() ([]string, error) {
    nodesList := make([]string, 0, nt.numNodes)
    numExhaustedZones := 0
    nodeIndex := 0

    for len(nodesList) < nt.numNodes {
        if numExhaustedZones >= len(nt.zones) {
            return nodesList, errors.New("all zones exhausted")
        }
        for zoneIndex := 0; zoneIndex < len(nt.zones); zoneIndex++ {
            na := nt.tree[nt.zones[zoneIndex]]
            if nodeIndex >= len(na) {
                if nodeIndex == len(na) { numExhaustedZones++ }
                continue
            }
            nodesList = append(nodesList, na[nodeIndex])
        }
        nodeIndex++
    }
    return nodesList, nil
}

设计目的 :实现跨可用区的均匀分布。输出序列形如:[zone1-node1, zone2-node1, zone3-node1, zone1-node2, zone2-node2, ...],确保调度器优先考虑不同的zone。

4.3 并行化框架(internal/parallelize/)

go 复制代码
type Parallelizer struct {
    parallelism int   // 默认16
}

func chunkSizeFor(n, parallelism int) int {
    s := int(math.Sqrt(float64(n)))        // sqrt(n)
    if r := n/parallelism + 1; s > r {
        s = r                               // min(sqrt(n), n/parallelism+1)
    } else if s < 1 {
        s = 1
    }
    return s
}

func (p Parallelizer) Until(ctx context.Context, pieces int,
    doWorkPiece workqueue.DoWorkPieceFunc) {
    workqueue.ParallelizeUntil(ctx, p.parallelism, pieces, doWorkPiece,
        workqueue.WithChunkSize(chunkSizeFor(pieces, p.parallelism)))
}

chunkSizeFor 公式max(1, min(√n, n/parallelism+1))

  • n=1000, parallelism=16 → chunkSize = min(31, 63) = 31
  • n=100, parallelism=16 → chunkSize = min(10, 7) = 7
  • n=10, parallelism=16 → chunkSize = max(1, min(3, 1)) = 1

目标是在并行度和任务粒度之间取得平衡:chunk太大会导致负载不均,太小会增加调度开销。

4.4 ErrorChannel------非阻塞错误传播

go 复制代码
type ErrorChannel struct {
    errCh chan error  // 容量为1的缓冲通道
}

func (e *ErrorChannel) SendError(err error) {
    select {
    case e.errCh <- err:   // 非阻塞发送
    default:               // 通道已有错误,丢弃
    }
}

用于并行化场景中收集第一个错误,后续错误被忽略。配合 context.WithCancel 实现快速失败。

4.5 CacheDebugger------调试工具

go 复制代码
type CacheDebugger struct {
    Comparer CacheComparer   // 比较缓存与Informer的状态一致性
    Dumper   CacheDumper     // 导出缓存状态到日志
}

监听 SIGUSR2 信号(Linux)或 SIGINT 信号(Windows),触发时:

  1. Comparer.Compare():比较 Cache 中的 Pod/Node 与 Informer 缓存是否一致
  2. Dumper.DumpAll():输出 Cache 和 Queue 的完整状态到日志

五、Mermaid 架构图(12张)

图1:SchedulerCache 整体架构图

#mermaid-svg-AZRa4EaqPz6c9hCS{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-AZRa4EaqPz6c9hCS .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-AZRa4EaqPz6c9hCS .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-AZRa4EaqPz6c9hCS .error-icon{fill:#552222;}#mermaid-svg-AZRa4EaqPz6c9hCS .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-AZRa4EaqPz6c9hCS .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-AZRa4EaqPz6c9hCS .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-AZRa4EaqPz6c9hCS .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-AZRa4EaqPz6c9hCS .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-AZRa4EaqPz6c9hCS .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-AZRa4EaqPz6c9hCS .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-AZRa4EaqPz6c9hCS .marker{fill:#333333;stroke:#333333;}#mermaid-svg-AZRa4EaqPz6c9hCS .marker.cross{stroke:#333333;}#mermaid-svg-AZRa4EaqPz6c9hCS svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-AZRa4EaqPz6c9hCS p{margin:0;}#mermaid-svg-AZRa4EaqPz6c9hCS .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-AZRa4EaqPz6c9hCS .cluster-label text{fill:#333;}#mermaid-svg-AZRa4EaqPz6c9hCS .cluster-label span{color:#333;}#mermaid-svg-AZRa4EaqPz6c9hCS .cluster-label span p{background-color:transparent;}#mermaid-svg-AZRa4EaqPz6c9hCS .label text,#mermaid-svg-AZRa4EaqPz6c9hCS span{fill:#333;color:#333;}#mermaid-svg-AZRa4EaqPz6c9hCS .node rect,#mermaid-svg-AZRa4EaqPz6c9hCS .node circle,#mermaid-svg-AZRa4EaqPz6c9hCS .node ellipse,#mermaid-svg-AZRa4EaqPz6c9hCS .node polygon,#mermaid-svg-AZRa4EaqPz6c9hCS .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-AZRa4EaqPz6c9hCS .rough-node .label text,#mermaid-svg-AZRa4EaqPz6c9hCS .node .label text,#mermaid-svg-AZRa4EaqPz6c9hCS .image-shape .label,#mermaid-svg-AZRa4EaqPz6c9hCS .icon-shape .label{text-anchor:middle;}#mermaid-svg-AZRa4EaqPz6c9hCS .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-AZRa4EaqPz6c9hCS .rough-node .label,#mermaid-svg-AZRa4EaqPz6c9hCS .node .label,#mermaid-svg-AZRa4EaqPz6c9hCS .image-shape .label,#mermaid-svg-AZRa4EaqPz6c9hCS .icon-shape .label{text-align:center;}#mermaid-svg-AZRa4EaqPz6c9hCS .node.clickable{cursor:pointer;}#mermaid-svg-AZRa4EaqPz6c9hCS .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-AZRa4EaqPz6c9hCS .arrowheadPath{fill:#333333;}#mermaid-svg-AZRa4EaqPz6c9hCS .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-AZRa4EaqPz6c9hCS .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-AZRa4EaqPz6c9hCS .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-AZRa4EaqPz6c9hCS .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-AZRa4EaqPz6c9hCS .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-AZRa4EaqPz6c9hCS .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-AZRa4EaqPz6c9hCS .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-AZRa4EaqPz6c9hCS .cluster text{fill:#333;}#mermaid-svg-AZRa4EaqPz6c9hCS .cluster span{color:#333;}#mermaid-svg-AZRa4EaqPz6c9hCS div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-AZRa4EaqPz6c9hCS .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-AZRa4EaqPz6c9hCS rect.text{fill:none;stroke-width:0;}#mermaid-svg-AZRa4EaqPz6c9hCS .icon-shape,#mermaid-svg-AZRa4EaqPz6c9hCS .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-AZRa4EaqPz6c9hCS .icon-shape p,#mermaid-svg-AZRa4EaqPz6c9hCS .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-AZRa4EaqPz6c9hCS .icon-shape .label rect,#mermaid-svg-AZRa4EaqPz6c9hCS .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-AZRa4EaqPz6c9hCS .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-AZRa4EaqPz6c9hCS .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-AZRa4EaqPz6c9hCS :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 快照系统
后台协程
schedulerCache
事件源
AddPod
RemovePod
UpdatePod
AddNode
RemoveNode
AssumePod
ForgetPod
FinishBinding
清理过期AssumePod
读取
调度周期使用
Informer Reflector
mu sync.RWMutex
assumedPods sets.String
podStates map
nodes map + 双向链表
headNode
nodeTree
imageStates map
cleanupExpiredAssumedPods

每1秒
Snapshot
UpdateSnapshot

增量更新
调度器

图2:Cache 接口类图

#mermaid-svg-jTAa2DngB9754bTz{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-jTAa2DngB9754bTz .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-jTAa2DngB9754bTz .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-jTAa2DngB9754bTz .error-icon{fill:#552222;}#mermaid-svg-jTAa2DngB9754bTz .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-jTAa2DngB9754bTz .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-jTAa2DngB9754bTz .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-jTAa2DngB9754bTz .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-jTAa2DngB9754bTz .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-jTAa2DngB9754bTz .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-jTAa2DngB9754bTz .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-jTAa2DngB9754bTz .marker{fill:#333333;stroke:#333333;}#mermaid-svg-jTAa2DngB9754bTz .marker.cross{stroke:#333333;}#mermaid-svg-jTAa2DngB9754bTz svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-jTAa2DngB9754bTz p{margin:0;}#mermaid-svg-jTAa2DngB9754bTz g.classGroup text{fill:#9370DB;stroke:none;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:10px;}#mermaid-svg-jTAa2DngB9754bTz g.classGroup text .title{font-weight:bolder;}#mermaid-svg-jTAa2DngB9754bTz .cluster-label text{fill:#333;}#mermaid-svg-jTAa2DngB9754bTz .cluster-label span{color:#333;}#mermaid-svg-jTAa2DngB9754bTz .cluster-label span p{background-color:transparent;}#mermaid-svg-jTAa2DngB9754bTz .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-jTAa2DngB9754bTz .cluster text{fill:#333;}#mermaid-svg-jTAa2DngB9754bTz .cluster span{color:#333;}#mermaid-svg-jTAa2DngB9754bTz .nodeLabel,#mermaid-svg-jTAa2DngB9754bTz .edgeLabel{color:#131300;}#mermaid-svg-jTAa2DngB9754bTz .edgeLabel .label rect{fill:#ECECFF;}#mermaid-svg-jTAa2DngB9754bTz .label text{fill:#131300;}#mermaid-svg-jTAa2DngB9754bTz .labelBkg{background:#ECECFF;}#mermaid-svg-jTAa2DngB9754bTz .edgeLabel .label span{background:#ECECFF;}#mermaid-svg-jTAa2DngB9754bTz .classTitle{font-weight:bolder;}#mermaid-svg-jTAa2DngB9754bTz .node rect,#mermaid-svg-jTAa2DngB9754bTz .node circle,#mermaid-svg-jTAa2DngB9754bTz .node ellipse,#mermaid-svg-jTAa2DngB9754bTz .node polygon,#mermaid-svg-jTAa2DngB9754bTz .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-jTAa2DngB9754bTz .divider{stroke:#9370DB;stroke-width:1;}#mermaid-svg-jTAa2DngB9754bTz g.clickable{cursor:pointer;}#mermaid-svg-jTAa2DngB9754bTz g.classGroup rect{fill:#ECECFF;stroke:#9370DB;}#mermaid-svg-jTAa2DngB9754bTz g.classGroup line{stroke:#9370DB;stroke-width:1;}#mermaid-svg-jTAa2DngB9754bTz .classLabel .box{stroke:none;stroke-width:0;fill:#ECECFF;opacity:0.5;}#mermaid-svg-jTAa2DngB9754bTz .classLabel .label{fill:#9370DB;font-size:10px;}#mermaid-svg-jTAa2DngB9754bTz .relation{stroke:#333333;stroke-width:1;fill:none;}#mermaid-svg-jTAa2DngB9754bTz .dashed-line{stroke-dasharray:3;}#mermaid-svg-jTAa2DngB9754bTz .dotted-line{stroke-dasharray:1 2;}#mermaid-svg-jTAa2DngB9754bTz #compositionStart,#mermaid-svg-jTAa2DngB9754bTz .composition{fill:#333333!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-jTAa2DngB9754bTz #compositionEnd,#mermaid-svg-jTAa2DngB9754bTz .composition{fill:#333333!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-jTAa2DngB9754bTz #dependencyStart,#mermaid-svg-jTAa2DngB9754bTz .dependency{fill:#333333!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-jTAa2DngB9754bTz #dependencyStart,#mermaid-svg-jTAa2DngB9754bTz .dependency{fill:#333333!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-jTAa2DngB9754bTz #extensionStart,#mermaid-svg-jTAa2DngB9754bTz .extension{fill:transparent!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-jTAa2DngB9754bTz #extensionEnd,#mermaid-svg-jTAa2DngB9754bTz .extension{fill:transparent!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-jTAa2DngB9754bTz #aggregationStart,#mermaid-svg-jTAa2DngB9754bTz .aggregation{fill:transparent!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-jTAa2DngB9754bTz #aggregationEnd,#mermaid-svg-jTAa2DngB9754bTz .aggregation{fill:transparent!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-jTAa2DngB9754bTz #lollipopStart,#mermaid-svg-jTAa2DngB9754bTz .lollipop{fill:#ECECFF!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-jTAa2DngB9754bTz #lollipopEnd,#mermaid-svg-jTAa2DngB9754bTz .lollipop{fill:#ECECFF!important;stroke:#333333!important;stroke-width:1;}#mermaid-svg-jTAa2DngB9754bTz .edgeTerminals{font-size:11px;line-height:initial;}#mermaid-svg-jTAa2DngB9754bTz .classTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-jTAa2DngB9754bTz .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-jTAa2DngB9754bTz .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-jTAa2DngB9754bTz :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} implements
creates
<<interface>>
Cache
+NodeCount() : int
+PodCount()(int, error)
+AssumePod(pod) : error
+FinishBinding(pod) : error
+ForgetPod(pod) : error
+AddPod(pod) : error
+UpdatePod(oldPod, newPod) : error
+RemovePod(pod) : error
+GetPod(pod)(*Pod, error)
+IsAssumedPod(pod)(bool, error)
+AddNode(node) : error
+UpdateNode(oldNode, newNode) : error
+RemoveNode(node) : error
+UpdateSnapshot(snapshot) : error
+Dump() : *Dump
schedulerCache
-stop chan
-ttl Duration
-period Duration
-mu sync.RWMutex
-assumedPods sets.String
-podStates mapstring*podState
-nodes mapstring*nodeInfoListItem
-headNode *nodeInfoListItem
-nodeTree *nodeTree
-imageStates mapstring*imageState
+AssumePod(pod) : error
+AddPod(pod) : error
+RemovePod(pod) : error
-addPod(pod)
-removePod(pod) : error
-updatePod(old, new) : error
-moveNodeInfoToHead(name)
-cleanupAssumedPods(now)
-expirePod(key, ps) : error
Snapshot
-nodeInfoMap mapstring*NodeInfo
-nodeInfoList \[\]*NodeInfo
-havePodsWithAffinityNodeInfoList \[\]*NodeInfo
-havePodsWithRequiredAntiAffinityNodeInfoList \[\]*NodeInfo
-generation int64
+NodeInfos() : NodeInfoLister
+List()(\[\]*NodeInfo, error)
+Get(nodeName)(*NodeInfo, error)

图3:NodeInfo 数据结构图

渲染错误: Mermaid 渲染失败: Parse error on line 43: ...ProtocolPortstruct{}>> +Add(ip -----------------------^ Expecting 'STRUCT_STOP', 'MEMBER', got 'OPEN_IN_STRUCT'

图4:AddPod → NodeInfo 更新流程图

#mermaid-svg-lkvFrNCdRs7gAYNg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-lkvFrNCdRs7gAYNg .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-lkvFrNCdRs7gAYNg .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-lkvFrNCdRs7gAYNg .error-icon{fill:#552222;}#mermaid-svg-lkvFrNCdRs7gAYNg .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-lkvFrNCdRs7gAYNg .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-lkvFrNCdRs7gAYNg .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-lkvFrNCdRs7gAYNg .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-lkvFrNCdRs7gAYNg .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-lkvFrNCdRs7gAYNg .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-lkvFrNCdRs7gAYNg .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-lkvFrNCdRs7gAYNg .marker{fill:#333333;stroke:#333333;}#mermaid-svg-lkvFrNCdRs7gAYNg .marker.cross{stroke:#333333;}#mermaid-svg-lkvFrNCdRs7gAYNg svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-lkvFrNCdRs7gAYNg p{margin:0;}#mermaid-svg-lkvFrNCdRs7gAYNg .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-lkvFrNCdRs7gAYNg .cluster-label text{fill:#333;}#mermaid-svg-lkvFrNCdRs7gAYNg .cluster-label span{color:#333;}#mermaid-svg-lkvFrNCdRs7gAYNg .cluster-label span p{background-color:transparent;}#mermaid-svg-lkvFrNCdRs7gAYNg .label text,#mermaid-svg-lkvFrNCdRs7gAYNg span{fill:#333;color:#333;}#mermaid-svg-lkvFrNCdRs7gAYNg .node rect,#mermaid-svg-lkvFrNCdRs7gAYNg .node circle,#mermaid-svg-lkvFrNCdRs7gAYNg .node ellipse,#mermaid-svg-lkvFrNCdRs7gAYNg .node polygon,#mermaid-svg-lkvFrNCdRs7gAYNg .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-lkvFrNCdRs7gAYNg .rough-node .label text,#mermaid-svg-lkvFrNCdRs7gAYNg .node .label text,#mermaid-svg-lkvFrNCdRs7gAYNg .image-shape .label,#mermaid-svg-lkvFrNCdRs7gAYNg .icon-shape .label{text-anchor:middle;}#mermaid-svg-lkvFrNCdRs7gAYNg .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-lkvFrNCdRs7gAYNg .rough-node .label,#mermaid-svg-lkvFrNCdRs7gAYNg .node .label,#mermaid-svg-lkvFrNCdRs7gAYNg .image-shape .label,#mermaid-svg-lkvFrNCdRs7gAYNg .icon-shape .label{text-align:center;}#mermaid-svg-lkvFrNCdRs7gAYNg .node.clickable{cursor:pointer;}#mermaid-svg-lkvFrNCdRs7gAYNg .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-lkvFrNCdRs7gAYNg .arrowheadPath{fill:#333333;}#mermaid-svg-lkvFrNCdRs7gAYNg .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-lkvFrNCdRs7gAYNg .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-lkvFrNCdRs7gAYNg .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-lkvFrNCdRs7gAYNg .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-lkvFrNCdRs7gAYNg .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-lkvFrNCdRs7gAYNg .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-lkvFrNCdRs7gAYNg .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-lkvFrNCdRs7gAYNg .cluster text{fill:#333;}#mermaid-svg-lkvFrNCdRs7gAYNg .cluster span{color:#333;}#mermaid-svg-lkvFrNCdRs7gAYNg div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-lkvFrNCdRs7gAYNg .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-lkvFrNCdRs7gAYNg rect.text{fill:none;stroke-width:0;}#mermaid-svg-lkvFrNCdRs7gAYNg .icon-shape,#mermaid-svg-lkvFrNCdRs7gAYNg .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-lkvFrNCdRs7gAYNg .icon-shape p,#mermaid-svg-lkvFrNCdRs7gAYNg .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-lkvFrNCdRs7gAYNg .icon-shape .label rect,#mermaid-svg-lkvFrNCdRs7gAYNg .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-lkvFrNCdRs7gAYNg .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-lkvFrNCdRs7gAYNg .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-lkvFrNCdRs7gAYNg :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 存在 + 在assumedPods中
一致
不一致
不存在
存在 + 不在assumedPods中
没有

Informer推送AddPod事件
podStates中是否存在?
节点名是否一致?
从assumedPods删除

清除deadline

更新pod引用
removePod旧Pod

addPod新Pod

从assumedPods删除
addPod

创建新podState
返回错误: already in added state
nodes中是否有此节点?
创建NodeInfoListItem

加入nodes map
获取现有NodeInfoListItem
NodeInfo.AddPod
calculateResource计算Pod资源请求
累加Requested/NonZeroRequested
更新PodsWithAffinity/PodsWithRequiredAntiAffinity
updateUsedPorts消费端口
moveNodeInfoToHead移到链表头部
nextGeneration递增世代号

图5:PriorityQueue 架构图

#mermaid-svg-Y1NJ5zgJYI3DKWrU{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-Y1NJ5zgJYI3DKWrU .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-Y1NJ5zgJYI3DKWrU .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-Y1NJ5zgJYI3DKWrU .error-icon{fill:#552222;}#mermaid-svg-Y1NJ5zgJYI3DKWrU .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-Y1NJ5zgJYI3DKWrU .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-Y1NJ5zgJYI3DKWrU .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-Y1NJ5zgJYI3DKWrU .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-Y1NJ5zgJYI3DKWrU .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-Y1NJ5zgJYI3DKWrU .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-Y1NJ5zgJYI3DKWrU .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-Y1NJ5zgJYI3DKWrU .marker{fill:#333333;stroke:#333333;}#mermaid-svg-Y1NJ5zgJYI3DKWrU .marker.cross{stroke:#333333;}#mermaid-svg-Y1NJ5zgJYI3DKWrU svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-Y1NJ5zgJYI3DKWrU p{margin:0;}#mermaid-svg-Y1NJ5zgJYI3DKWrU .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-Y1NJ5zgJYI3DKWrU .cluster-label text{fill:#333;}#mermaid-svg-Y1NJ5zgJYI3DKWrU .cluster-label span{color:#333;}#mermaid-svg-Y1NJ5zgJYI3DKWrU .cluster-label span p{background-color:transparent;}#mermaid-svg-Y1NJ5zgJYI3DKWrU .label text,#mermaid-svg-Y1NJ5zgJYI3DKWrU span{fill:#333;color:#333;}#mermaid-svg-Y1NJ5zgJYI3DKWrU .node rect,#mermaid-svg-Y1NJ5zgJYI3DKWrU .node circle,#mermaid-svg-Y1NJ5zgJYI3DKWrU .node ellipse,#mermaid-svg-Y1NJ5zgJYI3DKWrU .node polygon,#mermaid-svg-Y1NJ5zgJYI3DKWrU .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-Y1NJ5zgJYI3DKWrU .rough-node .label text,#mermaid-svg-Y1NJ5zgJYI3DKWrU .node .label text,#mermaid-svg-Y1NJ5zgJYI3DKWrU .image-shape .label,#mermaid-svg-Y1NJ5zgJYI3DKWrU .icon-shape .label{text-anchor:middle;}#mermaid-svg-Y1NJ5zgJYI3DKWrU .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-Y1NJ5zgJYI3DKWrU .rough-node .label,#mermaid-svg-Y1NJ5zgJYI3DKWrU .node .label,#mermaid-svg-Y1NJ5zgJYI3DKWrU .image-shape .label,#mermaid-svg-Y1NJ5zgJYI3DKWrU .icon-shape .label{text-align:center;}#mermaid-svg-Y1NJ5zgJYI3DKWrU .node.clickable{cursor:pointer;}#mermaid-svg-Y1NJ5zgJYI3DKWrU .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-Y1NJ5zgJYI3DKWrU .arrowheadPath{fill:#333333;}#mermaid-svg-Y1NJ5zgJYI3DKWrU .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-Y1NJ5zgJYI3DKWrU .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-Y1NJ5zgJYI3DKWrU .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-Y1NJ5zgJYI3DKWrU .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-Y1NJ5zgJYI3DKWrU .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-Y1NJ5zgJYI3DKWrU .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-Y1NJ5zgJYI3DKWrU .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-Y1NJ5zgJYI3DKWrU .cluster text{fill:#333;}#mermaid-svg-Y1NJ5zgJYI3DKWrU .cluster span{color:#333;}#mermaid-svg-Y1NJ5zgJYI3DKWrU div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-Y1NJ5zgJYI3DKWrU .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-Y1NJ5zgJYI3DKWrU rect.text{fill:none;stroke-width:0;}#mermaid-svg-Y1NJ5zgJYI3DKWrU .icon-shape,#mermaid-svg-Y1NJ5zgJYI3DKWrU .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-Y1NJ5zgJYI3DKWrU .icon-shape p,#mermaid-svg-Y1NJ5zgJYI3DKWrU .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-Y1NJ5zgJYI3DKWrU .icon-shape .label rect,#mermaid-svg-Y1NJ5zgJYI3DKWrU .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-Y1NJ5zgJYI3DKWrU .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-Y1NJ5zgJYI3DKWrU .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-Y1NJ5zgJYI3DKWrU :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 事件触发
入口
后台协程
PriorityQueue
直接入队
moveRequestCycle < podSchedulingCycle
moveRequestCycle >= podSchedulingCycle
阻塞等待
弹出后
退避完成
超过60秒
超过60秒
调度失败
退避完成
Broadcast唤醒Pop
activeQ

Heap: 按优先级排序

堆顶=最高优先级
podBackoffQ

Heap: 按退避完成时间排序

堆顶=最先完成退避
unschedulableQ

UnschedulablePodsMap

Map: podKey → QueuedPodInfo
lock sync.RWMutex
cond sync.Cond
schedulingCycle int64
moveRequestCycle int64
flushBackoffQCompleted

每1秒
flushUnschedulableQLeftover

每30秒
Add: 新Pod
AddUnschedulableIfNotPresent: 调度失败Pod
Update: Pod更新
Pop: 取出Pod调度
MoveAllToActiveOrBackoffQueue
AssignedPodAdded
AssignedPodUpdated

图6:SchedulingQueue 状态转换图

#mermaid-svg-ZWGLowOHXc6O4iYz{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-ZWGLowOHXc6O4iYz .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-ZWGLowOHXc6O4iYz .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-ZWGLowOHXc6O4iYz .error-icon{fill:#552222;}#mermaid-svg-ZWGLowOHXc6O4iYz .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-ZWGLowOHXc6O4iYz .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-ZWGLowOHXc6O4iYz .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-ZWGLowOHXc6O4iYz .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-ZWGLowOHXc6O4iYz .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-ZWGLowOHXc6O4iYz .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-ZWGLowOHXc6O4iYz .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-ZWGLowOHXc6O4iYz .marker{fill:#333333;stroke:#333333;}#mermaid-svg-ZWGLowOHXc6O4iYz .marker.cross{stroke:#333333;}#mermaid-svg-ZWGLowOHXc6O4iYz svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-ZWGLowOHXc6O4iYz p{margin:0;}#mermaid-svg-ZWGLowOHXc6O4iYz defs #statediagram-barbEnd{fill:#333333;stroke:#333333;}#mermaid-svg-ZWGLowOHXc6O4iYz g.stateGroup text{fill:#9370DB;stroke:none;font-size:10px;}#mermaid-svg-ZWGLowOHXc6O4iYz g.stateGroup text{fill:#333;stroke:none;font-size:10px;}#mermaid-svg-ZWGLowOHXc6O4iYz g.stateGroup .state-title{font-weight:bolder;fill:#131300;}#mermaid-svg-ZWGLowOHXc6O4iYz g.stateGroup rect{fill:#ECECFF;stroke:#9370DB;}#mermaid-svg-ZWGLowOHXc6O4iYz g.stateGroup line{stroke:#333333;stroke-width:1;}#mermaid-svg-ZWGLowOHXc6O4iYz .transition{stroke:#333333;stroke-width:1;fill:none;}#mermaid-svg-ZWGLowOHXc6O4iYz .stateGroup .composit{fill:white;border-bottom:1px;}#mermaid-svg-ZWGLowOHXc6O4iYz .stateGroup .alt-composit{fill:#e0e0e0;border-bottom:1px;}#mermaid-svg-ZWGLowOHXc6O4iYz .state-note{stroke:#aaaa33;fill:#fff5ad;}#mermaid-svg-ZWGLowOHXc6O4iYz .state-note text{fill:black;stroke:none;font-size:10px;}#mermaid-svg-ZWGLowOHXc6O4iYz .stateLabel .box{stroke:none;stroke-width:0;fill:#ECECFF;opacity:0.5;}#mermaid-svg-ZWGLowOHXc6O4iYz .edgeLabel .label rect{fill:#ECECFF;opacity:0.5;}#mermaid-svg-ZWGLowOHXc6O4iYz .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-ZWGLowOHXc6O4iYz .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-ZWGLowOHXc6O4iYz .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-ZWGLowOHXc6O4iYz .edgeLabel .label text{fill:#333;}#mermaid-svg-ZWGLowOHXc6O4iYz .label div .edgeLabel{color:#333;}#mermaid-svg-ZWGLowOHXc6O4iYz .stateLabel text{fill:#131300;font-size:10px;font-weight:bold;}#mermaid-svg-ZWGLowOHXc6O4iYz .node circle.state-start{fill:#333333;stroke:#333333;}#mermaid-svg-ZWGLowOHXc6O4iYz .node .fork-join{fill:#333333;stroke:#333333;}#mermaid-svg-ZWGLowOHXc6O4iYz .node circle.state-end{fill:#9370DB;stroke:white;stroke-width:1.5;}#mermaid-svg-ZWGLowOHXc6O4iYz .end-state-inner{fill:white;stroke-width:1.5;}#mermaid-svg-ZWGLowOHXc6O4iYz .node rect{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-ZWGLowOHXc6O4iYz .node polygon{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-ZWGLowOHXc6O4iYz #statediagram-barbEnd{fill:#333333;}#mermaid-svg-ZWGLowOHXc6O4iYz .statediagram-cluster rect{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-ZWGLowOHXc6O4iYz .cluster-label,#mermaid-svg-ZWGLowOHXc6O4iYz .nodeLabel{color:#131300;}#mermaid-svg-ZWGLowOHXc6O4iYz .statediagram-cluster rect.outer{rx:5px;ry:5px;}#mermaid-svg-ZWGLowOHXc6O4iYz .statediagram-state .divider{stroke:#9370DB;}#mermaid-svg-ZWGLowOHXc6O4iYz .statediagram-state .title-state{rx:5px;ry:5px;}#mermaid-svg-ZWGLowOHXc6O4iYz .statediagram-cluster.statediagram-cluster .inner{fill:white;}#mermaid-svg-ZWGLowOHXc6O4iYz .statediagram-cluster.statediagram-cluster-alt .inner{fill:#f0f0f0;}#mermaid-svg-ZWGLowOHXc6O4iYz .statediagram-cluster .inner{rx:0;ry:0;}#mermaid-svg-ZWGLowOHXc6O4iYz .statediagram-state rect.basic{rx:5px;ry:5px;}#mermaid-svg-ZWGLowOHXc6O4iYz .statediagram-state rect.divider{stroke-dasharray:10,10;fill:#f0f0f0;}#mermaid-svg-ZWGLowOHXc6O4iYz .note-edge{stroke-dasharray:5;}#mermaid-svg-ZWGLowOHXc6O4iYz .statediagram-note rect{fill:#fff5ad;stroke:#aaaa33;stroke-width:1px;rx:0;ry:0;}#mermaid-svg-ZWGLowOHXc6O4iYz .statediagram-note rect{fill:#fff5ad;stroke:#aaaa33;stroke-width:1px;rx:0;ry:0;}#mermaid-svg-ZWGLowOHXc6O4iYz .statediagram-note text{fill:black;}#mermaid-svg-ZWGLowOHXc6O4iYz .statediagram-note .nodeLabel{color:black;}#mermaid-svg-ZWGLowOHXc6O4iYz .statediagram .edgeLabel{color:red;}#mermaid-svg-ZWGLowOHXc6O4iYz #dependencyStart,#mermaid-svg-ZWGLowOHXc6O4iYz #dependencyEnd{fill:#333333;stroke:#333333;stroke-width:1;}#mermaid-svg-ZWGLowOHXc6O4iYz .statediagramTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-ZWGLowOHXc6O4iYz :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} Add (新Pod)
Pop (调度开始)
调度成功 (binding)
AddUnschedulableIfNotPresent

(调度失败, 无MoveAll)
AddUnschedulableIfNotPresent

(调度失败, 有MoveAll)
Update (Pod更新, 仍在退避)
Update (Pod更新, 退避结束)
movePodsToActiveOrBackoffQueue

(退避中)
movePodsToActiveOrBackoffQueue

(退避结束)
flushUnschedulableQLeftover

(超过60秒, 退避中)
flushUnschedulableQLeftover

(超过60秒, 退避结束)
flushBackoffQCompleted

(退避到期)
Delete
Delete
Delete
ActiveQ
Popped
UnschedulableQ
BackoffQ

图7:Pod 退避 Backoff 流程图

#mermaid-svg-LXcYkBgGlaYxfKnG{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-LXcYkBgGlaYxfKnG .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-LXcYkBgGlaYxfKnG .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-LXcYkBgGlaYxfKnG .error-icon{fill:#552222;}#mermaid-svg-LXcYkBgGlaYxfKnG .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-LXcYkBgGlaYxfKnG .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-LXcYkBgGlaYxfKnG .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-LXcYkBgGlaYxfKnG .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-LXcYkBgGlaYxfKnG .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-LXcYkBgGlaYxfKnG .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-LXcYkBgGlaYxfKnG .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-LXcYkBgGlaYxfKnG .marker{fill:#333333;stroke:#333333;}#mermaid-svg-LXcYkBgGlaYxfKnG .marker.cross{stroke:#333333;}#mermaid-svg-LXcYkBgGlaYxfKnG svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-LXcYkBgGlaYxfKnG p{margin:0;}#mermaid-svg-LXcYkBgGlaYxfKnG .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-LXcYkBgGlaYxfKnG .cluster-label text{fill:#333;}#mermaid-svg-LXcYkBgGlaYxfKnG .cluster-label span{color:#333;}#mermaid-svg-LXcYkBgGlaYxfKnG .cluster-label span p{background-color:transparent;}#mermaid-svg-LXcYkBgGlaYxfKnG .label text,#mermaid-svg-LXcYkBgGlaYxfKnG span{fill:#333;color:#333;}#mermaid-svg-LXcYkBgGlaYxfKnG .node rect,#mermaid-svg-LXcYkBgGlaYxfKnG .node circle,#mermaid-svg-LXcYkBgGlaYxfKnG .node ellipse,#mermaid-svg-LXcYkBgGlaYxfKnG .node polygon,#mermaid-svg-LXcYkBgGlaYxfKnG .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-LXcYkBgGlaYxfKnG .rough-node .label text,#mermaid-svg-LXcYkBgGlaYxfKnG .node .label text,#mermaid-svg-LXcYkBgGlaYxfKnG .image-shape .label,#mermaid-svg-LXcYkBgGlaYxfKnG .icon-shape .label{text-anchor:middle;}#mermaid-svg-LXcYkBgGlaYxfKnG .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-LXcYkBgGlaYxfKnG .rough-node .label,#mermaid-svg-LXcYkBgGlaYxfKnG .node .label,#mermaid-svg-LXcYkBgGlaYxfKnG .image-shape .label,#mermaid-svg-LXcYkBgGlaYxfKnG .icon-shape .label{text-align:center;}#mermaid-svg-LXcYkBgGlaYxfKnG .node.clickable{cursor:pointer;}#mermaid-svg-LXcYkBgGlaYxfKnG .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-LXcYkBgGlaYxfKnG .arrowheadPath{fill:#333333;}#mermaid-svg-LXcYkBgGlaYxfKnG .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-LXcYkBgGlaYxfKnG .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-LXcYkBgGlaYxfKnG .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-LXcYkBgGlaYxfKnG .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-LXcYkBgGlaYxfKnG .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-LXcYkBgGlaYxfKnG .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-LXcYkBgGlaYxfKnG .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-LXcYkBgGlaYxfKnG .cluster text{fill:#333;}#mermaid-svg-LXcYkBgGlaYxfKnG .cluster span{color:#333;}#mermaid-svg-LXcYkBgGlaYxfKnG div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-LXcYkBgGlaYxfKnG .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-LXcYkBgGlaYxfKnG rect.text{fill:none;stroke-width:0;}#mermaid-svg-LXcYkBgGlaYxfKnG .icon-shape,#mermaid-svg-LXcYkBgGlaYxfKnG .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-LXcYkBgGlaYxfKnG .icon-shape p,#mermaid-svg-LXcYkBgGlaYxfKnG .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-LXcYkBgGlaYxfKnG .icon-shape .label rect,#mermaid-svg-LXcYkBgGlaYxfKnG .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-LXcYkBgGlaYxfKnG .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-LXcYkBgGlaYxfKnG .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-LXcYkBgGlaYxfKnG :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 是
否 + 有MoveAll
否 + 无MoveAll




默认参数
initialBackoff = 1s
maxBackoff = 10s
1s → 2s → 4s → 8s → 10s → 10s...
退避时长计算
calculateBackoffDuration
duration = initialBackoff × 2^(Attempts-1)
duration = min(duration, maxBackoff)
Pod调度失败
AddUnschedulableIfNotPresent
isPodBackingoff?
入BackoffQ
入UnschedulableQ
flushUnschedulableQLeftover

超过60秒
isPodBackingoff?
入ActiveQ
flushBackoffQCompleted

每1秒检查堆顶
退避时间已到?
返回,等待下次检查
Pop from BackoffQ

Add to ActiveQ

Broadcast唤醒Pop

图8:Unschedulable → Active 迁移时序图

Pop协程 activeQ podBackoffQ PriorityQueue UnschedulablePodsMap Pop协程 activeQ podBackoffQ PriorityQueue UnschedulablePodsMap #mermaid-svg-n4nQ5jmvcTrwhvGG{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-n4nQ5jmvcTrwhvGG .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-n4nQ5jmvcTrwhvGG .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-n4nQ5jmvcTrwhvGG .error-icon{fill:#552222;}#mermaid-svg-n4nQ5jmvcTrwhvGG .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-n4nQ5jmvcTrwhvGG .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-n4nQ5jmvcTrwhvGG .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-n4nQ5jmvcTrwhvGG .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-n4nQ5jmvcTrwhvGG .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-n4nQ5jmvcTrwhvGG .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-n4nQ5jmvcTrwhvGG .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-n4nQ5jmvcTrwhvGG .marker{fill:#333333;stroke:#333333;}#mermaid-svg-n4nQ5jmvcTrwhvGG .marker.cross{stroke:#333333;}#mermaid-svg-n4nQ5jmvcTrwhvGG svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-n4nQ5jmvcTrwhvGG p{margin:0;}#mermaid-svg-n4nQ5jmvcTrwhvGG .actor{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-n4nQ5jmvcTrwhvGG text.actor>tspan{fill:black;stroke:none;}#mermaid-svg-n4nQ5jmvcTrwhvGG .actor-line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-n4nQ5jmvcTrwhvGG .innerArc{stroke-width:1.5;stroke-dasharray:none;}#mermaid-svg-n4nQ5jmvcTrwhvGG .messageLine0{stroke-width:1.5;stroke-dasharray:none;stroke:#333;}#mermaid-svg-n4nQ5jmvcTrwhvGG .messageLine1{stroke-width:1.5;stroke-dasharray:2,2;stroke:#333;}#mermaid-svg-n4nQ5jmvcTrwhvGG #arrowhead path{fill:#333;stroke:#333;}#mermaid-svg-n4nQ5jmvcTrwhvGG .sequenceNumber{fill:white;}#mermaid-svg-n4nQ5jmvcTrwhvGG #sequencenumber{fill:#333;}#mermaid-svg-n4nQ5jmvcTrwhvGG #crosshead path{fill:#333;stroke:#333;}#mermaid-svg-n4nQ5jmvcTrwhvGG .messageText{fill:#333;stroke:none;}#mermaid-svg-n4nQ5jmvcTrwhvGG .labelBox{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-n4nQ5jmvcTrwhvGG .labelText,#mermaid-svg-n4nQ5jmvcTrwhvGG .labelText>tspan{fill:black;stroke:none;}#mermaid-svg-n4nQ5jmvcTrwhvGG .loopText,#mermaid-svg-n4nQ5jmvcTrwhvGG .loopText>tspan{fill:black;stroke:none;}#mermaid-svg-n4nQ5jmvcTrwhvGG .loopLine{stroke-width:2px;stroke-dasharray:2,2;stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-n4nQ5jmvcTrwhvGG .note{stroke:#aaaa33;fill:#fff5ad;}#mermaid-svg-n4nQ5jmvcTrwhvGG .noteText,#mermaid-svg-n4nQ5jmvcTrwhvGG .noteText>tspan{fill:black;stroke:none;}#mermaid-svg-n4nQ5jmvcTrwhvGG .activation0{fill:#f4f4f4;stroke:#666;}#mermaid-svg-n4nQ5jmvcTrwhvGG .activation1{fill:#f4f4f4;stroke:#666;}#mermaid-svg-n4nQ5jmvcTrwhvGG .activation2{fill:#f4f4f4;stroke:#666;}#mermaid-svg-n4nQ5jmvcTrwhvGG .actorPopupMenu{position:absolute;}#mermaid-svg-n4nQ5jmvcTrwhvGG .actorPopupMenuPanel{position:absolute;fill:#ECECFF;box-shadow:0px 8px 16px 0px rgba(0,0,0,0.2);filter:drop-shadow(3px 5px 2px rgb(0 0 0 / 0.4));}#mermaid-svg-n4nQ5jmvcTrwhvGG .actor-man line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-n4nQ5jmvcTrwhvGG .actor-man circle,#mermaid-svg-n4nQ5jmvcTrwhvGG line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;stroke-width:2px;}#mermaid-svg-n4nQ5jmvcTrwhvGG :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 事件触发: NodeAdd/PvAdd/... alt仍在退避退避结束 跳过此Pod alt事件匹配事件不匹配 loop对每个Pod 被唤醒,从AQ取Pod MoveAllToActiveOrBackoffQueue(event)遍历USQ中所有PodpodMatchesEvent(podInfo, event)isPodBackingoff(podInfo)Add(pInfo)delete(pod)Add(pInfo)delete(pod)moveRequestCycle = schedulingCyclecond.Broadcast()Pop()最高优先级Pod

图9:DefaultPreemption 完整抢占流程图

#mermaid-svg-IWKA3prd1iz3pAXm{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-IWKA3prd1iz3pAXm .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-IWKA3prd1iz3pAXm .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-IWKA3prd1iz3pAXm .error-icon{fill:#552222;}#mermaid-svg-IWKA3prd1iz3pAXm .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-IWKA3prd1iz3pAXm .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-IWKA3prd1iz3pAXm .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-IWKA3prd1iz3pAXm .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-IWKA3prd1iz3pAXm .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-IWKA3prd1iz3pAXm .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-IWKA3prd1iz3pAXm .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-IWKA3prd1iz3pAXm .marker{fill:#333333;stroke:#333333;}#mermaid-svg-IWKA3prd1iz3pAXm .marker.cross{stroke:#333333;}#mermaid-svg-IWKA3prd1iz3pAXm svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-IWKA3prd1iz3pAXm p{margin:0;}#mermaid-svg-IWKA3prd1iz3pAXm .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-IWKA3prd1iz3pAXm .cluster-label text{fill:#333;}#mermaid-svg-IWKA3prd1iz3pAXm .cluster-label span{color:#333;}#mermaid-svg-IWKA3prd1iz3pAXm .cluster-label span p{background-color:transparent;}#mermaid-svg-IWKA3prd1iz3pAXm .label text,#mermaid-svg-IWKA3prd1iz3pAXm span{fill:#333;color:#333;}#mermaid-svg-IWKA3prd1iz3pAXm .node rect,#mermaid-svg-IWKA3prd1iz3pAXm .node circle,#mermaid-svg-IWKA3prd1iz3pAXm .node ellipse,#mermaid-svg-IWKA3prd1iz3pAXm .node polygon,#mermaid-svg-IWKA3prd1iz3pAXm .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-IWKA3prd1iz3pAXm .rough-node .label text,#mermaid-svg-IWKA3prd1iz3pAXm .node .label text,#mermaid-svg-IWKA3prd1iz3pAXm .image-shape .label,#mermaid-svg-IWKA3prd1iz3pAXm .icon-shape .label{text-anchor:middle;}#mermaid-svg-IWKA3prd1iz3pAXm .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-IWKA3prd1iz3pAXm .rough-node .label,#mermaid-svg-IWKA3prd1iz3pAXm .node .label,#mermaid-svg-IWKA3prd1iz3pAXm .image-shape .label,#mermaid-svg-IWKA3prd1iz3pAXm .icon-shape .label{text-align:center;}#mermaid-svg-IWKA3prd1iz3pAXm .node.clickable{cursor:pointer;}#mermaid-svg-IWKA3prd1iz3pAXm .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-IWKA3prd1iz3pAXm .arrowheadPath{fill:#333333;}#mermaid-svg-IWKA3prd1iz3pAXm .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-IWKA3prd1iz3pAXm .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-IWKA3prd1iz3pAXm .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-IWKA3prd1iz3pAXm .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-IWKA3prd1iz3pAXm .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-IWKA3prd1iz3pAXm .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-IWKA3prd1iz3pAXm .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-IWKA3prd1iz3pAXm .cluster text{fill:#333;}#mermaid-svg-IWKA3prd1iz3pAXm .cluster span{color:#333;}#mermaid-svg-IWKA3prd1iz3pAXm div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-IWKA3prd1iz3pAXm .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-IWKA3prd1iz3pAXm rect.text{fill:none;stroke-width:0;}#mermaid-svg-IWKA3prd1iz3pAXm .icon-shape,#mermaid-svg-IWKA3prd1iz3pAXm .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-IWKA3prd1iz3pAXm .icon-shape p,#mermaid-svg-IWKA3prd1iz3pAXm .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-IWKA3prd1iz3pAXm .icon-shape .label rect,#mermaid-svg-IWKA3prd1iz3pAXm .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-IWKA3prd1iz3pAXm .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-IWKA3prd1iz3pAXm .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-IWKA3prd1iz3pAXm :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 不具备资格
具备资格
无候选
有候选


是 + 可忽略
是 + 不可忽略

遍历完成
无候选
选出最优
PostFilter被调用

Pod无法调度
preempt
步骤0: 获取最新Pod对象
步骤1: PodEligibleToPreemptOthers
返回空, 不抢占
步骤2: FindCandidates
nodesWherePreemptionMightHelp

排除UnschedulableAndUnresolvable节点
getOffsetAndNumCandidates

计算采样偏移和数量
dryRunPreemption

并行模拟驱逐
返回candidates列表
返回FitError
步骤3: CallExtenders
遍历Extender
Extender支持抢占?
extender.ProcessPreemption
返回错误?
返回错误
步骤4: SelectCandidate
pickOneNodeForPreemption

6级筛选
步骤5: PrepareCandidate
DeletePod驱逐牺牲品
Reject WaitingPod
记录Event
ClearNominatedNodeName

低优先级提名Pod
返回NominatedNodeName

图10:selectVictimsOnNode 算法流程图

#mermaid-svg-fM0m1Kuru9S6zoK3{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-fM0m1Kuru9S6zoK3 .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-fM0m1Kuru9S6zoK3 .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-fM0m1Kuru9S6zoK3 .error-icon{fill:#552222;}#mermaid-svg-fM0m1Kuru9S6zoK3 .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-fM0m1Kuru9S6zoK3 .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-fM0m1Kuru9S6zoK3 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-fM0m1Kuru9S6zoK3 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-fM0m1Kuru9S6zoK3 .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-fM0m1Kuru9S6zoK3 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-fM0m1Kuru9S6zoK3 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-fM0m1Kuru9S6zoK3 .marker{fill:#333333;stroke:#333333;}#mermaid-svg-fM0m1Kuru9S6zoK3 .marker.cross{stroke:#333333;}#mermaid-svg-fM0m1Kuru9S6zoK3 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-fM0m1Kuru9S6zoK3 p{margin:0;}#mermaid-svg-fM0m1Kuru9S6zoK3 .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-fM0m1Kuru9S6zoK3 .cluster-label text{fill:#333;}#mermaid-svg-fM0m1Kuru9S6zoK3 .cluster-label span{color:#333;}#mermaid-svg-fM0m1Kuru9S6zoK3 .cluster-label span p{background-color:transparent;}#mermaid-svg-fM0m1Kuru9S6zoK3 .label text,#mermaid-svg-fM0m1Kuru9S6zoK3 span{fill:#333;color:#333;}#mermaid-svg-fM0m1Kuru9S6zoK3 .node rect,#mermaid-svg-fM0m1Kuru9S6zoK3 .node circle,#mermaid-svg-fM0m1Kuru9S6zoK3 .node ellipse,#mermaid-svg-fM0m1Kuru9S6zoK3 .node polygon,#mermaid-svg-fM0m1Kuru9S6zoK3 .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-fM0m1Kuru9S6zoK3 .rough-node .label text,#mermaid-svg-fM0m1Kuru9S6zoK3 .node .label text,#mermaid-svg-fM0m1Kuru9S6zoK3 .image-shape .label,#mermaid-svg-fM0m1Kuru9S6zoK3 .icon-shape .label{text-anchor:middle;}#mermaid-svg-fM0m1Kuru9S6zoK3 .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-fM0m1Kuru9S6zoK3 .rough-node .label,#mermaid-svg-fM0m1Kuru9S6zoK3 .node .label,#mermaid-svg-fM0m1Kuru9S6zoK3 .image-shape .label,#mermaid-svg-fM0m1Kuru9S6zoK3 .icon-shape .label{text-align:center;}#mermaid-svg-fM0m1Kuru9S6zoK3 .node.clickable{cursor:pointer;}#mermaid-svg-fM0m1Kuru9S6zoK3 .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-fM0m1Kuru9S6zoK3 .arrowheadPath{fill:#333333;}#mermaid-svg-fM0m1Kuru9S6zoK3 .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-fM0m1Kuru9S6zoK3 .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-fM0m1Kuru9S6zoK3 .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-fM0m1Kuru9S6zoK3 .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-fM0m1Kuru9S6zoK3 .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-fM0m1Kuru9S6zoK3 .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-fM0m1Kuru9S6zoK3 .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-fM0m1Kuru9S6zoK3 .cluster text{fill:#333;}#mermaid-svg-fM0m1Kuru9S6zoK3 .cluster span{color:#333;}#mermaid-svg-fM0m1Kuru9S6zoK3 div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-fM0m1Kuru9S6zoK3 .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-fM0m1Kuru9S6zoK3 rect.text{fill:none;stroke-width:0;}#mermaid-svg-fM0m1Kuru9S6zoK3 .icon-shape,#mermaid-svg-fM0m1Kuru9S6zoK3 .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-fM0m1Kuru9S6zoK3 .icon-shape p,#mermaid-svg-fM0m1Kuru9S6zoK3 .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-fM0m1Kuru9S6zoK3 .icon-shape .label rect,#mermaid-svg-fM0m1Kuru9S6zoK3 .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-fM0m1Kuru9S6zoK3 .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-fM0m1Kuru9S6zoK3 .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-fM0m1Kuru9S6zoK3 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 贪心恢复循环2
贪心恢复循环




不可调度
可调度




selectVictimsOnNode入口
遍历节点所有Pod
Pod优先级 < 抢占者?
加入potentialVictims

removePod
跳过
potentialVictims为空?
返回UnschedulableAndUnresolvable
RunFilterPluginsWithNominatedPods

检查抢占者是否可调度
返回: 节点不可用
按优先级降序排序potentialVictims
filterPodsWithPDBViolation

分为violating/nonViolating两组
尝试赦免violatingVictims
reprievePod: addPod恢复
Pod可调度?
保留此Pod
removePod, 加入victims

numViolatingVictim++
尝试赦免nonViolatingVictims
reprievePod: addPod恢复
Pod可调度?
保留此Pod
removePod, 加入victims
返回victims + numViolatingVictim

图11:DryRunPreemption 执行流程图

渲染错误: Mermaid 渲染失败: Parse error on line 13: ... J -->|是| Kcancel() 提前终止 J - -----------------------^ Expecting 'SQE', 'DOUBLECIRCLEEND', 'PE', '-)', 'STADIUMEND', 'SUBROUTINEEND', 'PIPE', 'CYLINDEREND', 'DIAMOND_STOP', 'TAGEND', 'TRAPEND', 'INVTRAPEND', 'UNICODE_TEXT', 'TEXT', 'TAGSTART', got 'PS'

图12:并行化调度架构图

渲染错误: Mermaid 渲染失败: Parse error on line 14: ... Until[Until(ctx, pieces, doWork ----------------------^ Expecting 'SQE', 'DOUBLECIRCLEEND', 'PE', '-)', 'STADIUMEND', 'SUBROUTINEEND', 'PIPE', 'CYLINDEREND', 'DIAMOND_STOP', 'TAGEND', 'TRAPEND', 'INVTRAPEND', 'UNICODE_TEXT', 'TEXT', 'TAGSTART', got 'PS'


六、核心设计模式与工程洞察

6.1 乐观并发控制(AssumePod 机制)

Cache 的 AssumePod 是 Kubernetes 调度器最重要的设计之一。调度器在选出节点后,先将 Pod 标记为 "Assumed"(假定已调度),然后才发起 Binding 请求。这意味着:

  • 无需等待 API Server 确认即可开始下一个调度周期
  • 假定Pod的信息已聚合到NodeInfo,下一个调度周期看到的是"乐观"的资源视图
  • 如果 Binding 失败,通过 ForgetPod 回滚
  • 如果长时间未收到 Add 事件(网络问题等),通过过期机制自动清理

过期时间设置FinishBinding 在绑定完成后设置 deadline = now + ttl。默认 TTL 为 30 秒。后台协程每秒检查一次,超过 deadline 的假定 Pod 被自动移除。

6.2 增量快照(UpdateSnapshot)

UpdateSnapshot 是性能关键路径。每个调度周期开始时调用一次,设计目标是最小化克隆开销:

  1. Generation 过滤 :只克隆 Generation > snapshotGeneration 的节点
  2. 双向链表 :最近更新的节点在链表头部,遍历到 Generation <= snapshotGeneration 即可停止
  3. 原位更新 :使用 *existing = *clone 保留原指针,避免更新 nodeInfoList 等列表
  4. 惰性列表重建:只在节点增删或亲和性状态变化时重建 nodeInfoList

6.3 三级队列分层

PriorityQueue 的三层队列设计是精心安排的:

  • ActiveQ:立即可调度的Pod,按优先级排序
  • podBackoffQ:退避中的Pod,按退避完成时间排序。1秒轮询
  • UnschedulablePodsMap:等待集群状态变化的Pod。30秒轮询超时,事件驱动入队

这种分层避免了"调度风暴"------大量失败Pod反复冲击调度器。退避机制确保Pod的重试频率随失败次数递减,而事件驱动确保Pod在集群状态变化时及时重调度。

6.4 抢占的采样与并行

大规模集群中抢占计算量巨大:每个节点都需要模拟驱逐 + 运行Filter。DefaultPreemption 通过三重优化控制计算量:

  1. 节点采样 :只检查 min(10%, 100) 个节点
  2. 并行化:使用 Parallelizer 在16个协程中并发检查
  3. 提前终止:找到足够候选后 cancel 所有剩余协程

这种"近似最优"策略在精度和性能之间取得了良好平衡。在实际场景中,抢占的精确性要求低于正常调度------因为抢占本身就是一种"紧急措施"。

6.5 NominatedNode 机制

NominatedNodeName 是抢占流程的关键产物。当 Pod 通过抢占获得节点提名后:

  • Pod 的 status.nominatedNodeName 被设置
  • 下一个调度周期,Pod 会作为"提名Pod"参与Filter计算
  • 其他待调度Pod看到的NodeInfo已包含提名Pod的资源请求
  • 如果提名Pod最终无法调度(如Binding失败),NominatedNodeName 会被清除

这避免了"双重抢占"问题------多个高优先级Pod同时抢占同一节点的不同低优先级Pod。

6.6 UnschedulablePlugins 精确事件路由

QueuedPodInfo 中的 UnschedulablePlugins 字段记录了 Pod 被哪些插件拒绝。当集群事件发生时,podMatchesEvent 通过此字段精确判断事件是否可能帮助 Pod 变得可调度:

  • NodeTaintChange 事件 → 只影响被 TaintToleration 拒绝的 Pod
  • PvAdd 事件 → 只影响被 VolumeBinding 拒绝的 Pod
  • UnschedulableTimeout → 通配符,影响所有 Pod

这种精确路由避免了无效重调度,减少了调度器的CPU消耗。


七、关键常量与配置参数

常量/参数 默认值 作用
cleanAssumedPeriod 1秒 清理过期假定Pod的检查周期
DefaultPodInitialBackoffDuration 1秒 Pod初始退避时长
DefaultPodMaxBackoffDuration 10秒 Pod最大退避时长
unschedulableQTimeInterval 60秒 UnschedulableQ超时时间
DefaultParallelism 16 并行化默认并发数
MinCandidateNodesPercentage 10% 抢占候选节点最小百分比
MinCandidateNodesAbsolute 100 抢占候选节点最小绝对数
TTL(假定Pod) 30秒 假定Pod过期时间

八、总结

kube-scheduler 的 Cache/Queue/Preemption 三大模块构成了一个精巧的调度基础设施:

  1. Cache 通过乐观并发和增量快照实现了高效的集群状态管理,AssumePod 机制使调度器无需等待 API Server 确认即可推进工作
  2. PriorityQueue 的三级队列 + 指数退避 + 精确事件路由,在吞吐量、公平性和效率之间取得了精妙平衡
  3. DefaultPreemption 的采样 + 并行 + 提前终止策略,使抢占在大规模集群中仍然可行

三者协同工作:Cache 提供状态视图 → Queue 提供Pod来源 → Preemption 在常规调度失败时提供兜底方案,共同支撑了 Kubernetes 声明式调度的完整闭环。

相关推荐
数字时代全景窗1 小时前
商业航天不是航天的分支,而是产业革命本身
架构·软件工程
苏渡苇1 小时前
Seata 番外篇:使用 docker-compose 部署 Seata Server(TC)及 K8S 部署 Seata 高可用
spring boot·docker·微服务·容器·kubernetes·seata·springcloud
轻刀快马1 小时前
从繁琐到极简,从幻象到本质:Spring AOP 架构演进与实战避坑指南
java·spring·架构
vortex52 小时前
Polkit 架构原理深度解析
架构
heimeiyingwang2 小时前
【架构实战】Canal数据同步:MySQL数据变更实时捕获
数据库·mysql·架构
IT策士2 小时前
第29篇 k8s之Service 与 Endpoints 深入:服务发现原理
容器·kubernetes·服务发现
Benszen2 小时前
Kubernetes容器编排解决方案【基础篇】
云原生·容器·kubernetes
张忠琳2 小时前
【kubernetes v1.21】(kube-apiserver 4)kube-apiserver Storage/ETCD 与 Watch 机制
云原生·架构·kubernetes
国科安芯2 小时前
ASM232S电气特性与TIA/EIA-232-F及ITU V.28标准符合性深度分析
单片机·嵌入式硬件·算法·安全·架构