【Kubernetes】 Scheduler 的逻辑：从 Predicates/Priorities 到 Filter/Score

Kubernetes 调度框架的演进：从 Predicates/Priorities 到 Filter/Score

Kubernetes 调度框架从传统的 Predicates（预选）和 Priorities（优选）转变为现代的 Filter 和 Score 扩展点是在 Kubernetes 1.15 到 1.18 这个时期逐步完成的。

演进时间线

Kubernetes 1.15 (2019年6月发布)
- 首次引入调度框架（Scheduling Framework）的 Alpha 版本
- 开始在设计文档中提出将 Predicates/Priorities 重构为更灵活的扩展点架构
Kubernetes 1.16 (2019年9月发布)
- 调度框架进入 Beta 阶段
- 开始将内部 Predicates 逻辑逐步迁移到 Filter 插件
Kubernetes 1.17 (2019年12月发布)
- 继续扩展调度框架功能
- 更多内置调度器功能迁移到插件体系
Kubernetes 1.18 (2020年3月发布)
- 调度框架达到相对完善状态
- 大多数 Predicates 和 Priorities 已迁移到对应的 Filter 和 Score 插件
Kubernetes 1.19 (2020年8月发布)
- 调度框架功能稳定（GA）
- 完成从旧架构到新架构的迁移

主要变化

架构变化：
- 从两阶段（预选/优选）模型转变为多阶段插件架构
- 引入了更多精细的扩展点：PreFilter、Filter、PostFilter、PreScore、Score、Reserve等
代码组织：
- 旧版代码位于：pkg/scheduler/algorithm/predicates 和 pkg/scheduler/algorithm/priorities
- 新版代码位于：pkg/scheduler/framework/plugins，按照插件功能分类
配置方式：
- 旧版通过 --policy-config-file 配置预选和优选函数
- 新版通过 ComponentConfig 配置插件及其顺序

转变的原因

扩展性：允许通过插件方式扩展调度器，而无需修改核心代码
精细化控制：在调度过程的各个阶段提供干预点
代码维护：更清晰的模块化结构
性能优化：通过分阶段处理提高调度效率

这次架构重构使得 Kubernetes 调度器更加灵活、可扩展，同时保持了向后兼容性，让旧版本的调度策略能够平滑迁移到新架构。

Kubernetes Scheduler 的 predictor 和 priority 组织方式

在 Kubernetes 调度器中，predictor（预选）和 priority（优选）功能在最新版本中已经被整合到统一的调度框架（Scheduling Framework）中。在这个架构下，它们主要通过以下文件组织：

核心文件

1. 调度框架定义

主要文件位于 pkg/scheduler/framework 目录下：

framework.go: 定义了调度框架的核心接口和结构
interface.go: 定义了插件接口，包括 Filter（前身是 predictor）和 Score（前身是 priority）接口
runtime.go: 实现了调度框架的运行时

2. 具体插件实现

调度插件位于 pkg/scheduler/framework/plugins 目录下，每个插件负责特定的预选或优选功能：

nodeaffinity: 节点亲和性
tainttoleration: 污点容忍
noderesources: 节点资源匹配
nodeports: 端口资源
volumebinding: 卷绑定
...等

3. 配置和注册

pkg/scheduler/scheduler.go: 创建调度器的主入口
pkg/scheduler/profile/profile.go: 管理调度配置文件
pkg/scheduler/framework/plugins/registry.go: 注册所有可用的插件

调度过程

在当前版本中：

Predictor（预选） 已替换为 Filter 插件，负责过滤不满足Pod基本要求的节点
Priority（优选） 已替换为 Score 插件，负责为通过预选的节点评分

调度器调用这些插件的顺序在 pkg/scheduler/framework/cycle.go 文件中定义，遵循以下流程：

PreFilter → 2. Filter → 3. PostFilter → 4. PreScore → 5. Score → 6. NormalizeScore → 7. Reserve

插件扩展点

现代 Kubernetes 调度框架将调度周期划分为多个扩展点，其中：

Filter 扩展点：对应原来的 Predictor 功能
Score 扩展点：对应原来的 Priority 功能

这种基于插件的架构使调度器更加灵活和可扩展，允许用户通过组合不同的插件来自定义调度行为。

如果您需要查看特定的预选或优选逻辑，应该查看 pkg/scheduler/framework/plugins 目录下相应的插件实现。

Kubernetes 调度框架的扩展点：Filter 和 Score

一、Filter 扩展点

1. 概念与作用

Filter 扩展点（前身是 Predicates）是调度过程中的"预选"阶段，负责筛选出满足 Pod 基本要求的节点。

执行时机：调度周期的早期阶段
主要功能：过滤掉不适合运行 Pod 的节点
返回值：通过/不通过的二元结果
特点：如有一个 Filter 插件返回失败，则节点被立即排除

2. 核心内置 Filter 插件

go 复制代码

// 一些重要的内置 Filter 插件
- NodeName: 检查 Pod 指定的节点名称
- NodeUnschedulable: 检查节点是否可调度
- NodeResourcesFit: 检查节点资源是否足够
- NodeAffinity: 实现节点亲和性规则
- TaintToleration: 检查 Pod 是否容忍节点污点
- PodTopologySpread: 实现 Pod 拓扑分布约束
- VolumeRestrictions: 检查卷限制
- NodeVolumeLimits: 检查节点卷数量限制

二、Score 扩展点

1. 概念与作用

Score 扩展点（前身是 Priorities）是调度过程中的"优选"阶段，为通过预选的节点评分。

执行时机：在 Filter 之后执行
主要功能：为节点分配 0-100 的分数
返回值：数值评分
特点：所有 Score 插件的分数会被加权合并，选择得分最高的节点

2. 核心内置 Score 插件

go 复制代码

// 一些重要的内置 Score 插件
- NodeResourcesBalancedAllocation: 平衡节点资源使用
- NodeResourcesFit: 根据请求资源量评分
- NodeAffinity: 节点亲和性优先级评分
- ImageLocality: 基于节点上已有镜像评分
- InterPodAffinity: Pod 间亲和性评分
- TaintToleration: 基于污点容忍评分
- PodTopologySpread: 拓扑分布评分

三、如何扩展这些扩展点

1. 实现插件接口

要创建自定义调度插件，需实现相关接口：

go 复制代码

// Filter 插件接口
type FilterPlugin interface {
    Plugin
    Filter(ctx context.Context, state *CycleState, pod *v1.Pod, nodeInfo *framework.NodeInfo) *Status
}

// Score 插件接口
type ScorePlugin interface {
    Plugin
    Score(ctx context.Context, state *CycleState, pod *v1.Pod, nodeName string) (int64, *Status)
    // 可选实现归一化方法
    ScoreExtensions() ScoreExtensions
}

// 通用 Plugin 接口
type Plugin interface {
    Name() string
}

2. 注册插件

创建一个插件注册函数：

go 复制代码

func New(args runtime.Object, h framework.Handle) (framework.Plugin, error) {
    // 解析配置参数
    // 初始化插件
    return &YourPlugin{
        handle: h,
        // 其他字段
    }, nil
}

3. 编译并部署自定义调度器

有两种主要方式部署自定义调度插件：

方式一：编译到自定义调度器中

创建自定义调度器，导入你的插件
在调度器配置中启用插件
部署为独立的调度器Pod

go 复制代码

import (
    "k8s.io/kubernetes/cmd/kube-scheduler/app"
    "yourpkg/yourplugin" // 你的插件包
)

func main() {
    command := app.NewSchedulerCommand(
        app.WithPlugin("YourPlugin", yourplugin.New),
    )
    // 启动调度器
}

方式二：使用 ConfigMap 配置现有调度器

创建包含调度配置的 ConfigMap：

yaml 复制代码

apiVersion: v1
kind: ConfigMap
metadata:
  name: scheduler-config
  namespace: kube-system
data:
  scheduler-config.yaml: |
    apiVersion: kubescheduler.config.k8s.io/v1
    kind: KubeSchedulerConfiguration
    profiles:
      - schedulerName: default-scheduler
        plugins:
          filter:
            enabled:
            - name: "YourFilterPlugin"
          score:
            enabled:
            - name: "YourScorePlugin"
              weight: 5

修改调度器部署，使用该配置

yaml 复制代码

spec:
  containers:
  - command:
    - kube-scheduler
    - --config=/etc/kubernetes/scheduler-config.yaml
    volumeMounts:
      - name: scheduler-config
        mountPath: /etc/kubernetes/
  volumes:
    - name: scheduler-config
      configMap:
        name: scheduler-config

4. 使用自定义调度器

可以在 Pod 定义中指定使用自定义调度器：

yaml 复制代码

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  schedulerName: your-custom-scheduler  # 指定使用的调度器
  containers:
  - name: container
    image: nginx

示例：实现简单的 Filter 和 Score 插件

简单的 Filter 插件

go 复制代码

// NodeTypeFilter 基于节点标签筛选特定类型节点
type NodeTypeFilter struct{}

func (pl *NodeTypeFilter) Name() string {
    return "NodeTypeFilter"
}

func (pl *NodeTypeFilter) Filter(ctx context.Context, state *framework.CycleState, pod *v1.Pod, nodeInfo *framework.NodeInfo) *framework.Status {
    // 获取节点
    node := nodeInfo.Node()
    if node == nil {
        return framework.NewStatus(framework.Error, "node not found")
    }
    
    // 检查节点是否有特定标签
    if value, ok := node.Labels["node-type"]; ok && value == "special" {
        return framework.NewStatus(framework.Success, "")
    }
    
    return framework.NewStatus(framework.Unschedulable, "node does not have the required type")
}

简单的 Score 插件

go 复制代码

// PodDensityScore 基于节点已有Pod数量评分
type PodDensityScore struct {
    handle framework.Handle
}

func (pl *PodDensityScore) Name() string {
    return "PodDensityScore"
}

func (pl *PodDensityScore) Score(ctx context.Context, state *framework.CycleState, pod *v1.Pod, nodeName string) (int64, *framework.Status) {
    nodeInfo, err := pl.handle.SnapshotSharedLister().NodeInfos().Get(nodeName)
    if err != nil {
        return 0, framework.NewStatus(framework.Error, fmt.Sprintf("getting node %q from Snapshot: %v", nodeName, err))
    }

    // 计算节点上运行的Pod数量
    podCount := len(nodeInfo.Pods)
    
    // Pod数量越少得分越高
    score := 100 - int64(podCount)
    if score < 0 {
        score = 0
    }
    
    return score, framework.NewStatus(framework.Success, "")
}

// 归一化分数
func (pl *PodDensityScore) ScoreExtensions() framework.ScoreExtensions {
    return pl
}

func (pl *PodDensityScore) NormalizeScore(ctx context.Context, state *framework.CycleState, pod *v1.Pod, scores framework.NodeScoreList) *framework.Status {
    // 找出最高和最低分
    var highest int64 = 0
    var lowest int64 = math.MaxInt64
    
    for _, nodeScore := range scores {
        if nodeScore.Score > highest {
            highest = nodeScore.Score
        }
        if nodeScore.Score < lowest {
            lowest = nodeScore.Score
        }
    }
    
    // 避免除零错误
    if highest == lowest {
        return framework.NewStatus(framework.Success, "")
    }
    
    // 归一化到0-100范围
    for i, nodeScore := range scores {
        scores[i].Score = (nodeScore.Score - lowest) * 100 / (highest - lowest)
    }
    
    return framework.NewStatus(framework.Success, "")
}

通过以上方式，可以根据业务需求实现自定义的调度逻辑，使Kubernetes调度更好地适应特定场景。

关于init 函数的逻辑：之前只调度init container 中 request.cpu 最大的，但是现在会比较init container 中最大的request.cpu 和正常运行的container 的综合request.cpu ，取两者中最大的进行pod 调度，这样更有效率。

Kubernetes Scheduler 资源计算机制

在 Kubernetes 中，scheduler 对 Pod 所需资源的计算确实有特殊的规则，尤其是对于包含 init 容器的 Pod。根据您查看的 Kubernetes 代码，以下是现行的资源计算机制：

资源计算规则

常规容器资源计算：
- 对于所有常规容器，scheduler 会累加它们的 CPU 和内存 request 来计算 Pod 的总资源请求
Init 容器资源计算：
- 对于 init 容器，scheduler 不再只计算 CPU request 最多的一个
- 当前的机制是：
  - 对每种资源分别计算：找出所有 init 容器中对该资源请求最大的值
  - 比较 init 容器的最大值与常规容器的总和：取两者中的较大值作为 Pod 的资源请求

举例说明

假设一个 Pod 有两个 init 容器和两个常规容器：

Init 容器 1:

CPU: 2 cores
Memory: 1GB

Init 容器 2:

CPU: 1 core
Memory: 2GB

常规容器 1 & 2 总和:

CPU: 1 core
Memory: 3GB

Pod 的有效资源请求将是:

CPU: max(2, 1) = 2 cores (取 init 容器中最大的 CPU 请求)
Memory: max(2, 3) = 3GB (取常规容器总和，因为它大于 init 容器中最大的内存请求)

源代码确认

这种计算方式在 Kubernetes 的调度器代码中实现，具体在 pkg/scheduler/framework/plugins/noderesources 包中。计算 Pod 请求资源的 GetPodResourceRequest 函数会分别计算 init 容器的资源最大值和常规容器的资源总和，然后取两者中的较大值。

结论

总结来说，当前的 Kubernetes scheduler 不只是计算 CPU request 最多的 init 容器，而是：

对每种资源类型（CPU、内存等）分别找出 init 容器中请求最多的那个
将这些最大值与常规容器的对应资源总和比较
取两者中的较大值作为 Pod 对该资源的总请求

这种机制确保了 scheduler 在调度时既考虑了 init 容器的最大资源需求，也考虑了常规容器的总体资源需求。