Linux 2.6.10 调度器负载均衡机制深度解析：从理论到实现

前言

负载均衡不仅仅是简单地将任务从繁忙CPU迁移到空闲CPU，而是一个涉及CPU拓扑感知、缓存友好性、实时性响应和能效管理的复杂系统工程。Linux 2.6.10 通过精巧的调度域（Sched Domain）层次结构、智能的任务选择算法和高效的迁移机制，构建了一个既保证性能又兼顾稳定性的负载均衡体系。

本文将从 sched_exec 函数调用入口开始，沿着任务迁移的完整路径，逐层分析每个关键函数的实现细节、设计哲学和性能考量。通过对 find_idlest_cpu 的负载评估算法、move_tasks 的任务选择策略、can_migrate_task 的迁移条件判断以及 pull_task 的迁移操作等核心组件的深入解读，揭示Linux调度器在负载均衡方面的精妙设计。

特别值得关注的是，本文还将探迁移线程在异步任务迁移中的关键角色。这些跨子系统的协作机制体现了Linux内核作为一个整体系统的复杂性和协调性。

通过本文的详细分析，读者不仅能够深入理解Linux 2.6.10调度器负载均衡的技术细节，更能领略到操作系统内核设计中平衡性能、功耗、实时性和稳定性的艺术。这些设计思想对于现代多核处理器环境下的操作系统优化仍然具有重要的参考价值。

将任务迁移到更合适的CPU上执行`sched_exec`

c 复制代码

void sched_exec(void)
{
	struct sched_domain *tmp, *sd = NULL;
	int new_cpu, this_cpu = get_cpu();

	schedstat_inc(this_rq(), sbe_cnt);
	/* Prefer the current CPU if there's only this task running */
	if (this_rq()->nr_running <= 1)
		goto out;

	for_each_domain(this_cpu, tmp)
		if (tmp->flags & SD_BALANCE_EXEC)
			sd = tmp;

	if (sd) {
		schedstat_inc(sd, sbe_attempts);
		new_cpu = find_idlest_cpu(current, this_cpu, sd);
		if (new_cpu != this_cpu) {
			schedstat_inc(sd, sbe_pushed);
			put_cpu();
			sched_migrate_task(current, new_cpu);
			return;
		}
	}
out:
	put_cpu();
}

函数功能概述

sched_exec 函数在进程执行 execve 系统调用时被调用，用于在需要时将任务迁移到更合适的CPU上执行，以优化系统性能和负载均衡

代码逐段解析

变量声明和初始化

c 复制代码

struct sched_domain *tmp, *sd = NULL;
int new_cpu, this_cpu = get_cpu();

struct sched_domain *tmp, *sd = NULL：
- tmp：临时变量，用于遍历调度域
- sd：找到的目标调度域，初始为NULL
int new_cpu, this_cpu = get_cpu()：
- new_cpu：要迁移到的目标CPU
- this_cpu = get_cpu()：获取当前CPU编号并禁用内核抢占
- get_cpu() 宏会返回当前CPU ID并增加preempt_count

统计计数和简单情况检查

c 复制代码

schedstat_inc(this_rq(), sbe_cnt);
/* Prefer the current CPU if there's only this task running */
if (this_rq()->nr_running <= 1)
    goto out;

schedstat_inc(this_rq(), sbe_cnt)：增加调度统计计数器sbe_cnt（schedule balance exec count）
this_rq()->nr_running <= 1：检查当前运行队列中是否只有当前任务在运行
如果只有当前任务，直接跳转到out标签（保持当前CPU）
这是优化：如果当前CPU很空闲，就不需要迁移

查找支持执行负载均衡的调度域

c 复制代码

for_each_domain(this_cpu, tmp)
    if (tmp->flags & SD_BALANCE_EXEC)
        sd = tmp;

for_each_domain(this_cpu, tmp)：遍历当前CPU的所有调度域
tmp->flags & SD_BALANCE_EXEC：检查调度域是否支持执行时的负载均衡
sd = tmp：记录最后一个找到的支持SD_BALANCE_EXEC的调度域
调度域是Linux调度器中用于描述CPU拓扑和负载均衡范围的层次结构

找到合适调度域后的处理

c 复制代码

if (sd) {
    schedstat_inc(sd, sbe_attempts);
    new_cpu = find_idlest_cpu(current, this_cpu, sd);

if (sd)：如果找到了合适的调度域
schedstat_inc(sd, sbe_attempts)：增加尝试迁移的统计计数
new_cpu = find_idlest_cpu(current, this_cpu, sd)：在调度域内查找最空闲的CPU
- current：当前任务
- this_cpu：当前CPU
- sd：目标调度域

执行任务迁移

c 复制代码

    if (new_cpu != this_cpu) {
        schedstat_inc(sd, sbe_pushed);
        put_cpu();
        sched_migrate_task(current, new_cpu);
        return;
    }
}

if (new_cpu != this_cpu)：如果找到的最空闲CPU不是当前CPU
schedstat_inc(sd, sbe_pushed)：增加成功迁移的统计计数
put_cpu()：启用内核抢占（与前面的get_cpu()配对）
sched_migrate_task(current, new_cpu)：将当前任务迁移到新的CPU
- 为什么是迁移当前进程呢
- 因为execve是替换当前进程的上下文，并不是创建新进程，所以迁移当前进程相当于迁移新进程
- 之所以要进行负载均衡，是因为新的执行程序可能变成重量级程序
return：直接返回，不再执行后面的代码

退出处理

c 复制代码

out:
put_cpu();

out标签：统一的退出点
put_cpu()：启用内核抢占
这个路径表示：要么不需要迁移，要么没有找到更合适的CPU

函数功能总结

主要功能 ：在进程执行execve时进行智能的CPU迁移决策，优化系统负载均衡

时机选择：在程序执行时重新评估CPU分配
负载均衡：将任务迁移到更合适的CPU
- 查找当前调度域内最空闲的CPU
- 避免CPU热点，提高整体系统性能
条件优化：只在必要时才迁移
- 当前CPU空闲时不迁移
- 没有合适目标CPU时不迁移

统计信息：

sbe_cnt：执行调度平衡的总次数
sbe_attempts：尝试迁移的次数
sbe_pushed：实际迁移的次数

在指定的调度域内查找最空闲的CPU`find_idlest_cpu`

c 复制代码

static int find_idlest_cpu(struct task_struct *p, int this_cpu,
			   struct sched_domain *sd)
{
	unsigned long load, min_load, this_load;
	int i, min_cpu;
	cpumask_t mask;

	min_cpu = UINT_MAX;
	min_load = ULONG_MAX;

	cpus_and(mask, sd->span, p->cpus_allowed);

	for_each_cpu_mask(i, mask) {
		load = target_load(i);

		if (load < min_load) {
			min_cpu = i;
			min_load = load;

			/* break out early on an idle CPU: */
			if (!min_load)
				break;
		}
	}

	/* add +1 to account for the new task */
	this_load = source_load(this_cpu) + SCHED_LOAD_SCALE;

	/*
	 * Would with the addition of the new task to the
	 * current CPU there be an imbalance between this
	 * CPU and the idlest CPU?
	 *
	 * Use half of the balancing threshold - new-context is
	 * a good opportunity to balance.
	 */
	if (min_load*(100 + (sd->imbalance_pct-100)/2) < this_load*100)
		return min_cpu;

	return this_cpu;
}

函数功能概述

find_idlest_cpu 函数在指定的调度域内查找最空闲的CPU，并决定是否将任务迁移到该CPU上

代码逐段解析

变量声明和初始化

c 复制代码

unsigned long load, min_load, this_load;
int i, min_cpu;
cpumask_t mask;

min_cpu = UINT_MAX;
min_load = ULONG_MAX;

load：临时存储每个CPU的负载值
min_load：记录找到的最小负载值，初始设为最大无符号长整数值
this_load：当前CPU的负载值
i：循环计数器
min_cpu：记录找到的最空闲CPU编号，初始设为最大无符号整数值
mask：CPU掩码，用于存储允许搜索的CPU集合
初始化 min_cpu 和 min_load 为最大值，确保第一次比较能正确更新

计算允许搜索的CPU集合

c 复制代码

cpus_and(mask, sd->span, p->cpus_allowed);

cpus_and(mask, sd->span, p->cpus_allowed)：计算三个集合的交集
- sd->span：调度域包含的所有CPU
- p->cpus_allowed：任务允许运行的CPU（受cpuset、affinity等限制）
- mask：结果掩码，只包含既在调度域内又允许任务运行的CPU

遍历所有候选CPU查找最空闲的

c 复制代码

for_each_cpu_mask(i, mask) {
    load = target_load(i);

    if (load < min_load) {
        min_cpu = i;
        min_load = load;

        /* break out early on an idle CPU: */
        if (!min_load)
            break;
    }
}

for_each_cpu_mask(i, mask)：遍历掩码中的每个CPU
load = target_load(i)：获取CPU i的负载，取平均负载和当前瞬时负载的较大值
if (load < min_load)：如果找到更空闲的CPU
- 更新 min_cpu 和 min_load
if (!min_load) break：优化策略，如果找到完全空闲的CPU（负载为0），立即终止搜索

计算当前CPU的预估负载

c 复制代码

/* add +1 to account for the new task */
this_load = source_load(this_cpu) + SCHED_LOAD_SCALE;

source_load(this_cpu)：获取当前CPU的负载，取平均负载和当前瞬时负载的较小值
+ SCHED_LOAD_SCALE：加上新任务的负载贡献
- SCHED_LOAD_SCALE 通常是128，表示一个标准任务的负载单位
这计算的是如果新任务留在当前CPU的预估总负载

负载均衡决策

c 复制代码

/*
 * Would with the addition of the new task to the
 * current CPU there be an imbalance between this
 * CPU and the idlest CPU?
 *
 * Use half of the balancing threshold - new-context is
 * a good opportunity to balance.
 */
if (min_load*(100 + (sd->imbalance_pct-100)/2) < this_load*100)
    return min_cpu;

使用一半的平衡阈值，因为新上下文是平衡的好机会
min_load*(100 + (sd->imbalance_pct-100)/2)：计算调整后的最小负载
- sd->imbalance_pct：调度域的不平衡百分比（通常125，表示25%的不平衡容忍度）
- (sd->imbalance_pct-100)/2：计算一半的不平衡容忍度
- 例如：如果 imbalance_pct = 125，则 (125-100)/2 = 12
- 所以是 min_load * 112
this_load*100：当前CPU负载的100倍
条件判断：如果 调整后最闲CPU负载 < 当前CPU负载，则迁移

保持当前CPU

c 复制代码

return this_cpu;

如果负载差异不够大，不值得迁移，返回当前CPU

关键函数和概念详解

target_load(i) 和 source_load(i)

这两个函数都返回CPU的负载，但使用不同的计算策略
target_load 用于评估目标CPU的负载，取平均负载和当前瞬时负载的较大值
source_load 用于评估源CPU的负载，取平均负载和当前瞬时负载的较小值
任务迁移一般是偏向于不迁移，所以会尽量让目标CPU的负载往大了算，源CPU往小了算，这样可以避免因为负载抖动频繁迁移

负载均衡算法详解

c 复制代码

min_load * (100 + (imbalance_pct - 100) / 2) < this_load * 100

一般进行任务迁移时，不平衡容忍度是25%
当前因为是新的上下文，不用考虑CPU缓存，可以采取激进一点的负载均衡，取不平衡容忍度为12%

函数功能总结

主要功能：在调度域内智能选择最合适的CPU来运行新任务

搜索范围限制：
- 只在调度域内搜索
- 遵守任务的CPU亲和性限制
最闲CPU查找：
- 遍历所有候选CPU
- 找到负载最小的CPU
- 优化：发现完全空闲CPU时提前终止
迁移决策：
- 计算当前CPU接纳新任务后的预估负载
- 使用调整阈值避免不必要的迁移
- 只在负载不平衡明显时才迁移

将任务迁移到指定的目标CPU`sched_migrate_task`

c 复制代码

static void sched_migrate_task(task_t *p, int dest_cpu)
{
	migration_req_t req;
	runqueue_t *rq;
	unsigned long flags;

	rq = task_rq_lock(p, &flags);
	if (!cpu_isset(dest_cpu, p->cpus_allowed)
	    || unlikely(cpu_is_offline(dest_cpu)))
		goto out;

	schedstat_inc(rq, smt_cnt);
	/* force the process onto the specified CPU */
	if (migrate_task(p, dest_cpu, &req)) {
		/* Need to wait for migration thread (might exit: take ref). */
		struct task_struct *mt = rq->migration_thread;
		get_task_struct(mt);
		task_rq_unlock(rq, &flags);
		wake_up_process(mt);
		put_task_struct(mt);
		wait_for_completion(&req.done);
		return;
	}
out:
	task_rq_unlock(rq, &flags);
}

函数功能概述

sched_migrate_task 函数用于将任务迁移到指定的目标CPU，处理迁移过程中的各种边界条件和同步机制

代码逐段解析

变量声明和锁获取

c 复制代码

migration_req_t req;
runqueue_t *rq;
unsigned long flags;

rq = task_rq_lock(p, &flags);

migration_req_t req：迁移请求结构，用于迁移线程和调用者之间的同步
runqueue_t *rq：当前任务所在运行队列的指针
unsigned long flags：用于保存中断状态的标志
rq = task_rq_lock(p, &flags)：锁定任务当前所在的运行队列，并禁用中断
- 这是关键的同步操作，防止在迁移过程中其他CPU修改任务状态

目标CPU有效性检查

c 复制代码

if (!cpu_isset(dest_cpu, p->cpus_allowed)
    || unlikely(cpu_is_offline(dest_cpu)))
    goto out;

!cpu_isset(dest_cpu, p->cpus_allowed)：检查目标CPU是否在任务的允许CPU集合中
unlikely(cpu_is_offline(dest_cpu))：检查目标CPU是否离线
- unlikely 提示编译器这个条件很少成立，优化分支预测
如果任一条件满足，跳转到 out 标签直接解锁返回

迁移统计计数

c 复制代码

schedstat_inc(rq, smt_cnt);

schedstat_inc(rq, smt_cnt)：增加调度统计计数器
smt_cnt 表示任务迁移的次数统计
用于监控系统负载均衡活动的频率

尝试迁移任务

c 复制代码

/* force the process onto the specified CPU */
if (migrate_task(p, dest_cpu, &req)) {

强制将进程放到指定的CPU上
migrate_task(p, dest_cpu, &req)：尝试执行迁移操作
- 返回值为真表示需要异步迁移（任务当前正在运行）
- 返回值为假表示迁移立即完成或失败

异步迁移处理

c 复制代码

    /* Need to wait for migration thread (might exit: take ref). */
    struct task_struct *mt = rq->migration_thread;
    get_task_struct(mt);
    task_rq_unlock(rq, &flags);
    wake_up_process(mt);
    put_task_struct(mt);
    wait_for_completion(&req.done);
    return;

struct task_struct *mt = rq->migration_thread：获取当前运行队列的迁移线程
get_task_struct(mt)：增加迁移线程的引用计数，防止在操作过程中线程退出
task_rq_unlock(rq, &flags)：释放运行队列锁，允许其他操作继续
wake_up_process(mt)：唤醒迁移线程来处理实际的迁移工作
put_task_struct(mt)：减少迁移线程的引用计数
wait_for_completion(&req.done)：等待迁移完成
- 迁移线程完成后会调用 complete(&req.done)
return：直接返回，不执行后面的解锁代码

快速退出路径

c 复制代码

out:
task_rq_unlock(rq, &flags);

out 标签：用于无效目标CPU或同步迁移完成的情况
task_rq_unlock(rq, &flags)：释放运行队列锁并恢复中断状态

关键机制详解

为什么需要异步迁移？

场景1：任务正在运行

复制代码

CPU A: 任务P正在执行
CPU B: 调用 sched_migrate_task(P, CPU_B)
      → 不能立即停止P在CPU A上的执行
      → 需要异步迁移

场景2：任务可立即迁移

复制代码

CPU A: 任务P处于睡眠状态
CPU B: 调用 sched_migrate_task(P, CPU_B)  
      → 可以立即迁移
      → 同步完成

函数功能总结

主要功能：安全地将任务迁移到指定的目标CPU，处理同步和异步迁移场景

安全性检查：
- 验证目标CPU的有效性
- 检查CPU亲和性约束
- 确保目标CPU在线
迁移策略：
- 立即迁移（任务未运行）
- 异步迁移（任务正在运行）

判断任务是否可以立即迁移`migrate_task`

c 复制代码

static int migrate_task(task_t *p, int dest_cpu, migration_req_t *req)
{
	runqueue_t *rq = task_rq(p);

	/*
	 * If the task is not on a runqueue (and not running), then
	 * it is sufficient to simply update the task's cpu field.
	 */
	if (!p->array && !task_running(rq, p)) {
		set_task_cpu(p, dest_cpu);
		return 0;
	}

	init_completion(&req->done);
	req->type = REQ_MOVE_TASK;
	req->task = p;
	req->dest_cpu = dest_cpu;
	list_add(&req->list, &rq->migration_queue);
	return 1;
}

函数功能概述

migrate_task 函数是任务迁移的核心函数，决定任务是否可以立即迁移，或者需要排队进行异步迁移

代码逐段解析

获取运行队列

c 复制代码

runqueue_t *rq = task_rq(p);

runqueue_t *rq = task_rq(p)：获取任务当前所在的运行队列
task_rq(p) 是一个宏，返回指向任务p所在CPU运行队列的指针
这个运行队列包含了该CPU上所有可运行任务的信息

立即迁移条件检查

c 复制代码

/*
 * If the task is not on a runqueue (and not running), then
 * it is sufficient to simply update the task's cpu field.
 */
if (!p->array && !task_running(rq, p)) {
    set_task_cpu(p, dest_cpu);
    return 0;
}

如果任务不在运行队列中（且不在运行状态），那么只需简单更新任务的cpu字段即可

!p->array：检查任务是否不在任何运行队列的活动数组中
- p->array 指向任务所属的运行队列优先级数组
- 如果为NULL，表示任务不在可运行状态（可能是睡眠、停止等）
!task_running(rq, p)：检查任务是否当前没有在执行
- task_running(rq, p) 检查任务是否正在CPU上运行

立即迁移执行

set_task_cpu(p, dest_cpu)：直接设置任务的CPU字段为目标CPU
return 0：返回0表示迁移立即完成，不需要异步处理

初始化迁移请求

c 复制代码

init_completion(&req->done);

init_completion(&req->done)：初始化完成量机制
req->done 是一个 struct completion，用于同步迁移操作的完成
迁移线程完成后会调用 complete(&req->done) 来通知等待者

设置迁移请求参数

c 复制代码

req->type = REQ_MOVE_TASK;
req->task = p;
req->dest_cpu = dest_cpu;

req->type = REQ_MOVE_TASK：设置请求类型为"移动任务"
req->task = p：设置要迁移的任务指针
req->dest_cpu = dest_cpu：设置目标CPU编号

将请求加入迁移队列

c 复制代码

list_add(&req->list, &rq->migration_queue);

list_add(&req->list, &rq->migration_queue)：将迁移请求添加到运行队列的迁移队列中
&req->list：迁移请求的链表节点
&rq->migration_queue：运行队列的迁移请求链表头
迁移线程会从这个队列中取出请求并处理

返回异步迁移标志

c 复制代码

return 1;

return 1：返回1表示需要异步迁移
调用者需要等待迁移线程完成实际操作

函数功能总结

主要功能：根据任务当前状态决定迁移策略，支持立即迁移和异步迁移两种模式

立即迁移条件：
- 任务不在运行队列中（!p->array）
- 任务当前没有在执行（!task_running(rq, p)）
- 直接更新CPU字段即可
异步迁移条件：
- 任务正在运行或准备运行
- 需要迁移线程介入处理
- 通过完成量机制同步

返回值语义：

0：迁移立即完成，调用者无需等待
1：需要异步迁移，调用者必须等待迁移线程完成

迁移线程`migration_thread`

c 复制代码

static int migration_thread(void * data)
{
	runqueue_t *rq;
	int cpu = (long)data;

	rq = cpu_rq(cpu);
	BUG_ON(rq->migration_thread != current);

	set_current_state(TASK_INTERRUPTIBLE);
	while (!kthread_should_stop()) {
		struct list_head *head;
		migration_req_t *req;

		if (current->flags & PF_FREEZE)
			refrigerator(PF_FREEZE);

		spin_lock_irq(&rq->lock);

		if (cpu_is_offline(cpu)) {
			spin_unlock_irq(&rq->lock);
			goto wait_to_die;
		}

		if (rq->active_balance) {
			active_load_balance(rq, cpu);
			rq->active_balance = 0;
		}

		head = &rq->migration_queue;

		if (list_empty(head)) {
			spin_unlock_irq(&rq->lock);
			schedule();
			set_current_state(TASK_INTERRUPTIBLE);
			continue;
		}
		req = list_entry(head->next, migration_req_t, list);
		list_del_init(head->next);

		if (req->type == REQ_MOVE_TASK) {
			spin_unlock(&rq->lock);
			__migrate_task(req->task, smp_processor_id(),
					req->dest_cpu);
			local_irq_enable();
		} else if (req->type == REQ_SET_DOMAIN) {
			rq->sd = req->sd;
			spin_unlock_irq(&rq->lock);
		} else {
			spin_unlock_irq(&rq->lock);
			WARN_ON(1);
		}

		complete(&req->done);
	}
	__set_current_state(TASK_RUNNING);
	return 0;

wait_to_die:
	/* Wait for kthread_stop */
	set_current_state(TASK_INTERRUPTIBLE);
	while (!kthread_should_stop()) {
		schedule();
		set_current_state(TASK_INTERRUPTIBLE);
	}
	__set_current_state(TASK_RUNNING);
	return 0;
}

函数功能概述

migration_thread 是每个CPU的迁移线程主函数，负责处理任务迁移请求和主动负载均衡

代码逐段解析

线程初始化和设置

c 复制代码

runqueue_t *rq;
int cpu = (long)data;

rq = cpu_rq(cpu);
BUG_ON(rq->migration_thread != current);

set_current_state(TASK_INTERRUPTIBLE);

runqueue_t *rq：运行队列指针
int cpu = (long)data：从参数获取CPU编号（线程绑定的CPU）
rq = cpu_rq(cpu)：获取该CPU的运行队列
BUG_ON(rq->migration_thread != current)：验证当前线程确实是该CPU的迁移线程
set_current_state(TASK_INTERRUPTIBLE)：设置线程为可中断睡眠状态

主循环开始

c 复制代码

while (!kthread_should_stop()) {
    struct list_head *head;
    migration_req_t *req;

while (!kthread_should_stop())：主循环，直到收到停止信号
head：迁移队列链表头指针
req：迁移请求指针

冻结检查

c 复制代码

if (current->flags & PF_FREEZE)
    refrigerator(PF_FREEZE);

检查当前线程是否有冻结标志
refrigerator(PF_FREEZE)：如果系统正在挂起，进入冻结状态

加锁和CPU离线检查

c 复制代码

spin_lock_irq(&rq->lock);

if (cpu_is_offline(cpu)) {
    spin_unlock_irq(&rq->lock);
    goto wait_to_die;
}

spin_lock_irq(&rq->lock)：获取运行队列锁并禁用中断
cpu_is_offline(cpu)：检查绑定的CPU是否已离线
如果CPU离线，释放锁并跳转到等待退出路径

主动负载均衡处理

c 复制代码

if (rq->active_balance) {
    active_load_balance(rq, cpu);
    rq->active_balance = 0;
}

rq->active_balance：检查是否需要主动负载均衡
active_load_balance(rq, cpu)：执行主动负载均衡操作
rq->active_balance = 0：清除主动平衡标志

迁移队列检查

c 复制代码

head = &rq->migration_queue;

if (list_empty(head)) {
    spin_unlock_irq(&rq->lock);
    schedule();
    set_current_state(TASK_INTERRUPTIBLE);
    continue;
}

head = &rq->migration_queue：获取迁移队列链表头
list_empty(head)：检查迁移队列是否为空
如果队列为空：
- 释放锁
- schedule()：让出CPU，等待唤醒
- 重新设置为可中断睡眠状态
- continue：继续循环

获取迁移请求

c 复制代码

req = list_entry(head->next, migration_req_t, list);
list_del_init(head->next);

req = list_entry(head->next, migration_req_t, list)：从链表获取第一个迁移请求
list_del_init(head->next)：从队列中删除该请求

处理不同类型的迁移请求

c 复制代码

if (req->type == REQ_MOVE_TASK) {
    spin_unlock(&rq->lock);
    __migrate_task(req->task, smp_processor_id(), req->dest_cpu);
    local_irq_enable();
} else if (req->type == REQ_SET_DOMAIN) {
    rq->sd = req->sd;
    spin_unlock_irq(&rq->lock);
} else {
    spin_unlock_irq(&rq->lock);
    WARN_ON(1);
}

任务迁移请求（REQ_MOVE_TASK）：

spin_unlock(&rq->lock)：释放运行队列锁（但保持中断禁用）
__migrate_task(req->task, smp_processor_id(), req->dest_cpu)：执行实际的任务迁移
- smp_processor_id()：获取当前CPU（迁移线程运行的CPU）
local_irq_enable()：启用本地中断

设置调度域请求（REQ_SET_DOMAIN）：

rq->sd = req->sd：更新运行队列的调度域指针
spin_unlock_irq(&rq->lock)：释放锁并启用中断

未知请求类型：

WARN_ON(1)：输出警告信息

完成请求通知

c 复制代码

complete(&req->done);

complete(&req->done)：通知等待者迁移操作已完成

正常退出路径

c 复制代码

__set_current_state(TASK_RUNNING);
return 0;

循环结束后设置线程为运行状态并返回

CPU离线等待退出路径

c 复制代码

wait_to_die:
/* Wait for kthread_stop */
set_current_state(TASK_INTERRUPTIBLE);
while (!kthread_should_stop()) {
    schedule();
    set_current_state(TASK_INTERRUPTIBLE);
}
__set_current_state(TASK_RUNNING);
return 0;

wait_to_die 标签：CPU离线时的专用退出路径
在可中断睡眠状态中等待停止信号
收到停止信号后设置运行状态并返回

关键机制详解

请求类型处理

REQ_MOVE_TASK（任务迁移）：

需要执行实际的迁移操作
在迁移过程中需要仔细处理锁和中断状态

REQ_SET_DOMAIN（设置调度域）：

更新调度拓扑信息
相对简单的操作

函数功能总结

主要功能：每个CPU的专用迁移线程，处理异步任务迁移和负载均衡操作

任务迁移处理：
- 从迁移队列中取出请求
- 执行实际的任务迁移操作
- 通知请求完成
主动负载均衡：
- 检测并执行主动负载均衡
- 优化系统整体性能
调度域管理：
- 响应调度域更新请求
- 维护CPU调度拓扑信息

将任务从源CPU迁移到目标CPU`__migrate_task`

c 复制代码

static void __migrate_task(struct task_struct *p, int src_cpu, int dest_cpu)
{
	runqueue_t *rq_dest, *rq_src;

	if (unlikely(cpu_is_offline(dest_cpu)))
		return;

	rq_src = cpu_rq(src_cpu);
	rq_dest = cpu_rq(dest_cpu);

	double_rq_lock(rq_src, rq_dest);
	/* Already moved. */
	if (task_cpu(p) != src_cpu)
		goto out;
	/* Affinity changed (again). */
	if (!cpu_isset(dest_cpu, p->cpus_allowed))
		goto out;

	set_task_cpu(p, dest_cpu);
	if (p->array) {
		/*
		 * Sync timestamp with rq_dest's before activating.
		 * The same thing could be achieved by doing this step
		 * afterwards, and pretending it was a local activate.
		 * This way is cleaner and logically correct.
		 */
		p->timestamp = p->timestamp - rq_src->timestamp_last_tick
				+ rq_dest->timestamp_last_tick;
		deactivate_task(p, rq_src);
		activate_task(p, rq_dest, 0);
		if (TASK_PREEMPTS_CURR(p, rq_dest))
			resched_task(rq_dest->curr);
	}

out:
	double_rq_unlock(rq_src, rq_dest);
}

函数功能概述

__migrate_task 函数是实际执行任务迁移的核心函数，负责将任务从源CPU迁移到目标CPU，处理所有必要的状态更新和调度器数据结构调整

代码逐段解析

目标CPU离线检查

c 复制代码

if (unlikely(cpu_is_offline(dest_cpu)))
    return;

unlikely(cpu_is_offline(dest_cpu))：检查目标CPU是否离线
unlikely 宏提示编译器这个条件很少成立，优化分支预测
如果目标CPU离线，直接返回，不执行迁移

获取运行队列

c 复制代码

rq_src = cpu_rq(src_cpu);
rq_dest = cpu_rq(dest_cpu);

rq_src = cpu_rq(src_cpu)：获取源CPU的运行队列
rq_dest = cpu_rq(dest_cpu)：获取目标CPU的运行队列
运行队列包含每个CPU的调度状态和任务列表

双重锁获取

c 复制代码

double_rq_lock(rq_src, rq_dest);

double_rq_lock(rq_src, rq_dest)：同时锁定两个运行队列
这是关键操作，防止在迁移过程中其他CPU修改任务状态
锁获取顺序遵循固定的CPU编号顺序，避免死锁

任务位置验证

c 复制代码

/* Already moved. */
if (task_cpu(p) != src_cpu)
    goto out;

任务可能已经被移动了
task_cpu(p) != src_cpu：检查任务的当前CPU是否还是源CPU
如果不是，说明任务已经被其他迁移操作移动，跳转到out标签

CPU亲和性验证

c 复制代码

/* Affinity changed (again). */
if (!cpu_isset(dest_cpu, p->cpus_allowed))
    goto out;

亲和性可能改变了
!cpu_isset(dest_cpu, p->cpus_allowed)：检查目标CPU是否在任务的允许CPU集合中
如果不在，说明任务的CPU亲和性在迁移过程中被修改，跳转到out标签

更新任务CPU字段

c 复制代码

set_task_cpu(p, dest_cpu);

set_task_cpu(p, dest_cpu)：更新任务的CPU字段为目标CPU
这是迁移操作的第一步，更新任务的基本属性

运行中任务处理

c 复制代码

if (p->array) {
    p->timestamp = p->timestamp - rq_src->timestamp_last_tick
            + rq_dest->timestamp_last_tick;

if (p->array)：检查任务是否在运行队列中（可运行状态）
p->timestamp = p->timestamp - rq_src->timestamp_last_tick + rq_dest->timestamp_last_tick：
- 调整任务的时间戳，使其相对于目标运行队列的时钟
- 确保任务的调度统计在目标CPU上正确

停用和激活任务

c 复制代码

    deactivate_task(p, rq_src);
    activate_task(p, rq_dest, 0);

deactivate_task(p, rq_src)：从源运行队列中移除任务
- 将任务从运行队列的优先级数组中删除
- 更新运行队列的任务计数
activate_task(p, rq_dest, 0)：将任务添加到目标运行队列
- 0 参数表示跨CPU添加任务
- 将任务插入到目标运行队列的适当优先级位置

抢占检查

c 复制代码

    if (TASK_PREEMPTS_CURR(p, rq_dest))
        resched_task(rq_dest->curr);

TASK_PREEMPTS_CURR(p, rq_dest)：检查迁移过来的任务是否应该抢占目标CPU上当前运行的任务
判断标准：基于任务优先级
resched_task(rq_dest->curr)：如果需要抢占，设置目标CPU当前任务的重调度标志

解锁和退出

c 复制代码

out:
double_rq_unlock(rq_src, rq_dest);

out 标签：统一的退出点
double_rq_unlock(rq_src, rq_dest)：释放两个运行队列的锁

关键机制详解

双重锁机制

死锁避免：通过固定的锁获取顺序（按CPU编号大小排序）防止死锁
性能优化：如果是同一个运行队列，只获取一次锁

时间戳同步的重要性

c 复制代码

p->timestamp = p->timestamp - rq_src->timestamp_last_tick + rq_dest->timestamp_last_tick;

为什么需要同步？

每个运行队列有自己的时间戳基准
调度决策依赖于相对时间
不同步会导致任务在目标CPU上得到不公平的调度待遇

函数功能总结

主要功能：安全地将任务从一个CPU迁移到另一个CPU，维护调度器数据结构的完整性

安全性验证：
- 目标CPU在线检查
- 任务还未迁移确认
- CPU亲和性验证
状态转移：
- 更新任务CPU字段
- 时间戳同步
- 运行队列操作（停用+激活）
调度决策：
- 抢占性检查
- 重调度触发

将当前任务置于冷冻状态`refrigerator`

c 复制代码

void refrigerator(unsigned long flag)
{
	/* Hmm, should we be allowed to suspend when there are realtime
	   processes around? */
	long save;
	save = current->state;
	current->state = TASK_UNINTERRUPTIBLE;
	pr_debug("%s entered refrigerator\n", current->comm);
	printk("=");
	current->flags &= ~PF_FREEZE;

	spin_lock_irq(&current->sighand->siglock);
	recalc_sigpending(); /* We sent fake signal, clean it up */
	spin_unlock_irq(&current->sighand->siglock);

	current->flags |= PF_FROZEN;
	while (current->flags & PF_FROZEN)
		schedule();
	pr_debug("%s left refrigerator\n", current->comm);
	current->state = save;
}

函数功能概述

refrigerator 函数用于将当前任务"冷冻"，进入一种特殊的睡眠状态，主要用于系统挂起（suspend）和休眠（hibernate）操作

代码逐段解析

变量声明

c 复制代码

long save;

long save：用于保存当前任务状态的变量

保存当前状态并设置为不可中断状态

c 复制代码

save = current->state;
current->state = TASK_UNINTERRUPTIBLE;

save = current->state：保存任务的当前状态（运行、睡眠等）
current->state = TASK_UNINTERRUPTIBLE：将任务状态设置为不可中断睡眠
- 确保任务不会被信号唤醒
- 这是进入冷冻状态的前提

调试信息输出

c 复制代码

pr_debug("%s entered refrigerator\n", current->comm);
printk("=");

pr_debug("%s entered refrigerator\n", current->comm)：输出调试信息，显示哪个任务进入了冷冻状态
printk("=")：在系统日志中输出等号字符，作为可视化的进度指示

清除冻结标志

c 复制代码

current->flags &= ~PF_FREEZE;

current->flags &= ~PF_FREEZE：清除任务的PF_FREEZE标志
PF_FREEZE标志表示任务需要进入冷冻状态，现在已经开始处理，所以清除

信号处理清理

c 复制代码

spin_lock_irq(&current->sighand->siglock);
recalc_sigpending(); /* We sent fake signal, clean it up */
spin_unlock_irq(&current->sighand->siglock);

spin_lock_irq(&current->sighand->siglock)：获取信号处理锁并禁用中断
recalc_sigpending()：重新计算待处理信号
- 在进入冷冻前可能发送了唤醒信号，需要清理信号状态
spin_unlock_irq(&current->sighand->siglock)：释放信号锁并启用中断

设置冷冻标志并进入等待循环

c 复制代码

current->flags |= PF_FROZEN;
while (current->flags & PF_FROZEN)
    schedule();

current->flags |= PF_FROZEN：设置任务的PF_FROZEN标志，表示任务已被冷冻
while (current->flags & PF_FROZEN) schedule()：循环调用schedule()让出CPU，直到PF_FROZEN标志被清除
- 这是核心的等待机制，任务在这里"休眠"
- 只有当系统恢复时，其他代码才会清除这个标志

退出冷冻状态

c 复制代码

pr_debug("%s left refrigerator\n", current->comm);
current->state = save;

pr_debug("%s left refrigerator\n", current->comm)：输出调试信息，显示任务离开冷冻状态
current->state = save：恢复任务原来的状态
- 如果原来是运行状态，现在恢复为运行状态
- 如果原来是睡眠状态，保持睡眠状态

关键机制详解

任务状态标志

c 复制代码

// PF_FREEZE: 需要进入冷冻状态（输入标志）
// PF_FROZEN: 已经进入冷冻状态（状态标志）

// 状态流转:
// 进入前: PF_FREEZE = 1, PF_FROZEN = 0
// 进入后: PF_FREEZE = 0, PF_FROZEN = 1  
// 退出后: PF_FREEZE = 0, PF_FROZEN = 0

为什么使用 TASK_UNINTERRUPTIBLE？

c 复制代码

current->state = TASK_UNINTERRUPTIBLE;

// 对比:
// TASK_INTERRUPTIBLE: 可被信号唤醒 - 不适合挂起操作
// TASK_UNINTERRUPTIBLE: 不可被信号唤醒 - 确保任务停留在冷冻状态

信号清理的必要性

在进入冷冻前，系统可能向任务发送了伪信号来唤醒它们进行冷冻操作。现在需要清理这些信号状态，避免任务被错误的信号唤醒

函数功能总结

主要功能：将当前任务置于冷冻状态，用于系统挂起和休眠操作。

状态管理：
- 保存和恢复任务状态
- 设置适当的任务标志
安全进入：
- 确保任务不会被意外唤醒
- 清理可能干扰的信号状态
等待机制：
- 在冷冻状态中循环让出CPU
- 等待系统恢复指令
调试支持：
- 提供进入和离开的调试信息
- 可视化进度指示

主动从繁忙CPU迁移任务到空闲CPU`active_load_balance`

c 复制代码

static void active_load_balance(runqueue_t *busiest_rq, int busiest_cpu)
{
	struct sched_domain *sd;
	struct sched_group *cpu_group;
	cpumask_t visited_cpus;

	schedstat_inc(busiest_rq, alb_cnt);
	/*
	 * Search for suitable CPUs to push tasks to in successively higher
	 * domains with SD_LOAD_BALANCE set.
	 */
	visited_cpus = CPU_MASK_NONE;
	for_each_domain(busiest_cpu, sd) {
		if (!(sd->flags & SD_LOAD_BALANCE) || busiest_rq->nr_running <= 1)
			break; /* no more domains to search or no more tasks to move */

		cpu_group = sd->groups;
		do { /* sched_groups should either use list_heads or be merged into the domains structure */
			int cpu, target_cpu = -1;
			runqueue_t *target_rq;

			for_each_cpu_mask(cpu, cpu_group->cpumask) {
				if (cpu_isset(cpu, visited_cpus) || cpu == busiest_cpu ||
				    !cpu_and_siblings_are_idle(cpu)) {
					cpu_set(cpu, visited_cpus);
					continue;
				}
				target_cpu = cpu;
				break;
			}
			if (target_cpu == -1)
				goto next_group; /* failed to find a suitable target cpu in this domain */

			target_rq = cpu_rq(target_cpu);

			/*
			 * This condition is "impossible", if it occurs we need to fix it
			 * Reported by Bjorn Helgaas on a 128-cpu setup.
			 */
			BUG_ON(busiest_rq == target_rq);

			/* move a task from busiest_rq to target_rq */
			double_lock_balance(busiest_rq, target_rq);
			if (move_tasks(target_rq, target_cpu, busiest_rq, 1, sd, SCHED_IDLE)) {
				schedstat_inc(busiest_rq, alb_lost);
				schedstat_inc(target_rq, alb_gained);
			} else {
				schedstat_inc(busiest_rq, alb_failed);
			}
			spin_unlock(&target_rq->lock);
next_group:
			cpu_group = cpu_group->next;
		} while (cpu_group != sd->groups && busiest_rq->nr_running > 1);
	}
}

函数功能概述

active_load_balance 函数是主动负载均衡的核心实现，用于在检测到CPU负载不平衡时，主动从繁忙CPU迁移任务到空闲CPU

代码逐段解析

变量声明和初始化

c 复制代码

struct sched_domain *sd;
struct sched_group *cpu_group;
cpumask_t visited_cpus;

schedstat_inc(busiest_rq, alb_cnt);

struct sched_domain *sd：调度域指针，用于遍历CPU拓扑层次
struct sched_group *cpu_group：CPU组指针，表示调度域内的CPU分组
cpumask_t visited_cpus：CPU掩码，记录已经访问过的CPU，避免重复处理
schedstat_inc(busiest_rq, alb_cnt)：增加主动负载均衡的统计计数器

循环初始化

c 复制代码

visited_cpus = CPU_MASK_NONE;
for_each_domain(busiest_cpu, sd) {

在设置了SD_LOAD_BALANCE标志的逐级更高的域中搜索合适的CPU来推送任务
visited_cpus = CPU_MASK_NONE：初始化已访问CPU掩码为空
for_each_domain(busiest_cpu, sd)：从繁忙CPU开始，遍历所有调度域层次

调度域检查

c 复制代码

    if (!(sd->flags & SD_LOAD_BALANCE) || busiest_rq->nr_running <= 1)
        break; /* no more domains to search or no more tasks to move */

!(sd->flags & SD_LOAD_BALANCE)：检查调度域是否支持负载均衡
busiest_rq->nr_running <= 1：检查繁忙运行队列是否只有1个或更少任务在运行
如果任一条件满足，跳出循环（没有更多域可搜索或没有更多任务可移动）

CPU组循环开始

c 复制代码

    cpu_group = sd->groups;
    do {
        int cpu, target_cpu = -1;
        runqueue_t *target_rq;

cpu_group = sd->groups：获取调度域的第一个CPU组
target_cpu = -1：初始化目标CPU为无效值
target_rq：目标运行队列指针

在CPU组中查找合适的目标CPU

c 复制代码

        for_each_cpu_mask(cpu, cpu_group->cpumask) {
            if (cpu_isset(cpu, visited_cpus) || cpu == busiest_cpu ||
                !cpu_and_siblings_are_idle(cpu)) {
                cpu_set(cpu, visited_cpus);
                continue;
            }
            target_cpu = cpu;
            break;
        }

for_each_cpu_mask(cpu, cpu_group->cpumask)：遍历CPU组中的所有CPU
排除条件检查：
- cpu_isset(cpu, visited_cpus)：CPU已经被访问过
- cpu == busiest_cpu：CPU是繁忙CPU本身
- !cpu_and_siblings_are_idle(cpu)：CPU及其附近CPU不空闲
如果满足排除条件，标记为已访问并继续查找
找到合适的CPU后，设置target_cpu并跳出循环

目标CPU检查

c 复制代码

        if (target_cpu == -1)
            goto next_group; /* failed to find a suitable target cpu in this domain */

如果没有找到合适的目标CPU，跳转到下一个CPU组

目标运行队列获取和验证

c 复制代码

        target_rq = cpu_rq(target_cpu);
        BUG_ON(busiest_rq == target_rq);

target_rq = cpu_rq(target_cpu)：获取目标CPU的运行队列
BUG_ON(busiest_rq == target_rq)：如果繁忙队列和目标队列相同，触发内核BUG

任务迁移执行

c 复制代码

        /* move a task from busiest_rq to target_rq */
        double_lock_balance(busiest_rq, target_rq);
        if (move_tasks(target_rq, target_cpu, busiest_rq, 1, sd, SCHED_IDLE)) {
            schedstat_inc(busiest_rq, alb_lost);
            schedstat_inc(target_rq, alb_gained);
        } else {
            schedstat_inc(busiest_rq, alb_failed);
        }
        spin_unlock(&target_rq->lock);

double_lock_balance(busiest_rq, target_rq)：同时锁定两个运行队列
move_tasks(target_rq, target_cpu, busiest_rq, 1, sd, SCHED_IDLE)：尝试移动任务
- 参数说明：目标队列、目标CPU、源队列、移动数量、调度域、优先级
统计更新：
- 成功：增加alb_lost（源队列失去任务）和alb_gained（目标队列获得任务）
- 失败：增加alb_failed（迁移失败）
spin_unlock(&target_rq->lock)：释放目标运行队列锁

继续处理下一个CPU组

c 复制代码

next_group:
        cpu_group = cpu_group->next;
    } while (cpu_group != sd->groups && busiest_rq->nr_running > 1);
}

next_group标签：下一个CPU组的入口点
cpu_group = cpu_group->next：移动到下一个CPU组
循环条件：不是初始组且繁忙队列仍有多个任务

关键机制详解

CPU选择条件

理想目标CPU的特征：

还未被遍历访问过
不是繁忙CPU自身
CPU及其兄弟核心都空闲
在当前的调度组内

统计信息说明

alb_cnt：主动负载均衡触发次数
alb_lost/alb_gained：成功迁移的任务数
alb_failed：迁移失败次数

函数功能总结

主要功能：在多核系统中执行主动负载均衡，将任务从繁忙CPU迁移到空闲CPU

层次化搜索：从近到远在调度域层次中寻找目标CPU
智能目标选择：优先选择完全空闲的CPU核心
条件迁移：只在明显不平衡时才执行迁移
统计监控：跟踪负载均衡的效果和频率

从繁忙运行队列中迁移任务到当前运行队列`move_tasks`

c 复制代码

static int move_tasks(runqueue_t *this_rq, int this_cpu, runqueue_t *busiest,
		      unsigned long max_nr_move, struct sched_domain *sd,
		      enum idle_type idle)
{
	prio_array_t *array, *dst_array;
	struct list_head *head, *curr;
	int idx, pulled = 0;
	task_t *tmp;

	if (max_nr_move <= 0 || busiest->nr_running <= 1)
		goto out;

	/*
	 * We first consider expired tasks. Those will likely not be
	 * executed in the near future, and they are most likely to
	 * be cache-cold, thus switching CPUs has the least effect
	 * on them.
	 */
	if (busiest->expired->nr_active) {
		array = busiest->expired;
		dst_array = this_rq->expired;
	} else {
		array = busiest->active;
		dst_array = this_rq->active;
	}

new_array:
	/* Start searching at priority 0: */
	idx = 0;
skip_bitmap:
	if (!idx)
		idx = sched_find_first_bit(array->bitmap);
	else
		idx = find_next_bit(array->bitmap, MAX_PRIO, idx);
	if (idx >= MAX_PRIO) {
		if (array == busiest->expired && busiest->active->nr_active) {
			array = busiest->active;
			dst_array = this_rq->active;
			goto new_array;
		}
		goto out;
	}

	head = array->queue + idx;
	curr = head->prev;
skip_queue:
	tmp = list_entry(curr, task_t, run_list);

	curr = curr->prev;

	if (!can_migrate_task(tmp, busiest, this_cpu, sd, idle)) {
		if (curr != head)
			goto skip_queue;
		idx++;
		goto skip_bitmap;
	}

	/*
	 * Right now, this is the only place pull_task() is called,
	 * so we can safely collect pull_task() stats here rather than
	 * inside pull_task().
	 */
	schedstat_inc(this_rq, pt_gained[idle]);
	schedstat_inc(busiest, pt_lost[idle]);

	pull_task(busiest, array, tmp, this_rq, dst_array, this_cpu);
	pulled++;

	/* We only want to steal up to the prescribed number of tasks. */
	if (pulled < max_nr_move) {
		if (curr != head)
			goto skip_queue;
		idx++;
		goto skip_bitmap;
	}
out:
	return pulled;
}

函数功能概述

move_tasks 函数是负载均衡的核心实现，负责从繁忙运行队列中迁移任务到当前运行队列，采用智能的任务选择策略

代码逐段解析

变量声明和基础检查

c 复制代码

prio_array_t *array, *dst_array;
struct list_head *head, *curr;
int idx, pulled = 0;
task_t *tmp;

if (max_nr_move <= 0 || busiest->nr_running <= 1)
    goto out;

prio_array_t *array, *dst_array：源和目标优先级数组指针
struct list_head *head, *curr：链表头和当前节点指针
int idx, pulled = 0：优先级索引和已迁移任务计数
task_t *tmp：临时任务指针
基础检查：如果要迁移数量≤0或繁忙队列任务数≤1，直接跳转到结束

选择源任务数组

c 复制代码

if (busiest->expired->nr_active) {
    array = busiest->expired;
    dst_array = this_rq->expired;
} else {
    array = busiest->active;
    dst_array = this_rq->active;
}

首先考虑过期任务。这些任务在近期不太可能被执行，而且很可能是缓存冷的，因此切换CPU对它们的影响最小
优先选择过期任务数组（如果其中有活动任务）
否则选择活动任务数组

新数组处理标签

c 复制代码

new_array:
/* Start searching at priority 0: */
idx = 0;

new_array标签：开始处理新数组的入口点
idx = 0：从优先级0开始搜索

位图搜索循环

c 复制代码

skip_bitmap:
if (!idx)
    idx = sched_find_first_bit(array->bitmap);
else
    idx = find_next_bit(array->bitmap, MAX_PRIO, idx);
if (idx >= MAX_PRIO) {
    if (array == busiest->expired && busiest->active->nr_active) {
        array = busiest->active;
        dst_array = this_rq->active;
        goto new_array;
    }
    goto out;
}

skip_bitmap标签：为后续切换活动数组后或者跳过当前优先级后继续搜索
如果是第一次搜索（idx=0），找到第一个设置位，即有任务的优先级数组
否则找到下一个设置位
如果搜索完所有优先级（idx ≥ MAX_PRIO）：
- 如果当前是过期数组且活动数组有任务，切换到活动数组重新开始
- 否则跳转到结束

获取任务链表

c 复制代码

head = array->queue + idx;
curr = head->prev;

head = array->queue + idx：获取当前优先级对应的任务链表头
curr = head->prev：从链表尾部开始，先处理最近加入的任务

任务迁移检查

c 复制代码

skip_queue:
tmp = list_entry(curr, task_t, run_list);

curr = curr->prev;

if (!can_migrate_task(tmp, busiest, this_cpu, sd, idle)) {
    if (curr != head)
        goto skip_queue;
    idx++;
    goto skip_bitmap;
}

skip_queue标签：跳过当前任务继续搜索
tmp = list_entry(curr, task_t, run_list)：从链表节点获取任务结构
curr = curr->prev：移动到前一个节点
can_migrate_task检查任务是否可以迁移：
- 如果不可迁移，且还有任务，继续当前链表
- 如果链表遍历完，移动到下一个优先级

迁移任务和统计更新

c 复制代码

schedstat_inc(this_rq, pt_gained[idle]);
schedstat_inc(busiest, pt_lost[idle]);

pull_task(busiest, array, tmp, this_rq, dst_array, this_cpu);
pulled++;

更新统计：目标队列获得任务，源队列失去任务
pull_task执行实际的任务迁移
pulled++增加已迁移任务计数

继续迁移检查

c 复制代码

/* We only want to steal up to the prescribed number of tasks. */
if (pulled < max_nr_move) {
    if (curr != head)
        goto skip_queue;
    idx++;
    goto skip_bitmap;
}

如果还未达到最大迁移数量：
- 当前链表还有任务，继续处理
- 否则移动到下一个优先级

返回结果

c 复制代码

out:
return pulled;

out标签：函数结束点
返回实际迁移的任务数量

关键机制详解

优先级数组选择策略

c 复制代码

// 优先迁移过期任务的原因:
// 1. 近期不会执行 - 迁移对性能影响小
// 2. 缓存冷的可能性大 - 缓存失效代价低

任务搜索算法

按优先级从高到低（位图搜索）
每个优先级内从新到旧（链表尾部到头部）
先过期数组后活动数组

统计信息说明

pt_gained[idle]：根据空闲类型统计获得的任务数
pt_lost[idle]：根据空闲类型统计失去的任务数
idle参数影响迁移策略的激进程度

函数功能总结

主要功能：从繁忙运行队列中智能选择并迁移任务到当前运行队列

迁移优先级顺序：

复制代码

高优先级过期任务 → 低优先级过期任务 → 高优先级活动任务 → 低优先级活动任务

判断一个任务是否可以从当前CPU迁移到目标CPU`can_migrate_task`

c 复制代码

static inline
int can_migrate_task(task_t *p, runqueue_t *rq, int this_cpu,
		     struct sched_domain *sd, enum idle_type idle)
{
	/*
	 * We do not migrate tasks that are:
	 * 1) running (obviously), or
	 * 2) cannot be migrated to this CPU due to cpus_allowed, or
	 * 3) are cache-hot on their current CPU.
	 */
	if (task_running(rq, p))
		return 0;
	if (!cpu_isset(this_cpu, p->cpus_allowed))
		return 0;

	/* Aggressive migration if we've failed balancing */
	if (idle == NEWLY_IDLE ||
			sd->nr_balance_failed < sd->cache_nice_tries) {
		if (task_hot(p, rq->timestamp_last_tick, sd))
			return 0;
	}

	return 1;
}

函数功能概述

can_migrate_task 函数用于判断一个任务是否可以从当前CPU迁移到目标CPU，考虑运行状态、CPU亲和性、缓存热度等多种因素

代码逐段解析

函数声明和注释说明

c 复制代码

static inline
int can_migrate_task(task_t *p, runqueue_t *rq, int this_cpu,
		     struct sched_domain *sd, enum idle_type idle)
{
	/*
	 * We do not migrate tasks that are:
	 * 1) running (obviously), or
	 * 2) cannot be migrated to this CPU due to cpus_allowed, or
	 * 3) are cache-hot on their current CPU.
	 */

static inline：内联函数，减少函数调用开销
参数说明 ：
- p：要检查的任务
- rq：任务当前所在的运行队列
- this_cpu：目标CPU编号
- sd：调度域
- idle：目标CPU的空闲类型
注释说明 ：我们不迁移以下任务：
1. 正在运行的（显然）
2. 由于cpus_allowed限制不能迁移到该CPU的
3. 在当前CPU上缓存热的

运行状态检查

c 复制代码

	if (task_running(rq, p))
		return 0;

task_running(rq, p)：检查任务是否正在CPU上运行
如果任务正在运行，返回0（不能迁移）
原因：不能迁移正在执行的任务，需要等待其被调度出去

CPU亲和性检查

c 复制代码

	if (!cpu_isset(this_cpu, p->cpus_allowed))
		return 0;

!cpu_isset(this_cpu, p->cpus_allowed)：检查目标CPU是否在任务的允许CPU集合中
如果不在允许集合中，返回0（不能迁移）
原因：遵守任务的CPU亲和性设置

缓存热度检查条件

c 复制代码

	/* Aggressive migration if we've failed balancing */
	if (idle == NEWLY_IDLE ||
			sd->nr_balance_failed < sd->cache_nice_tries) {

当发起负载均衡但未成功时nr_balance_failed计数会增长，当进行缓存查询成功时cache_nice_tries计数会增长，如果负载均衡失败次数大于缓存命中次数，说明保证缓存的亲和的代价太高，可以跳过缓存热度检查

idle == NEWLY_IDLE：目标CPU是新近空闲的
sd->nr_balance_failed < sd->cache_nice_tries：调度域负载均衡失败次数小于缓存友好尝试次数
这个条件决定是否进行缓存热度检查

缓存热度检查

c 复制代码

		if (task_hot(p, rq->timestamp_last_tick, sd))
			return 0;
	}

task_hot(p, rq->timestamp_last_tick, sd)：检查任务是否是缓存热的
c 复制代码
```
#define task_hot(p, now, sd) ((long long) ((now) - (p)->last_ran)	\
				< (long long) (sd)->cache_hot_time)
```
- 检查进程最后在CPU运行的时间到现在过去的时间是否超过该调度域的缓存热度时间
如果任务缓存热，返回0（不能迁移）
原因：避免迁移缓存热任务导致的性能损失

允许迁移

c 复制代码

	return 1;
}

如果通过所有检查，返回1（可以迁移）

函数功能总结

主要功能：智能判断任务是否适合迁移，平衡负载均衡和性能保护

基础安全性：
- 运行状态：不迁移正在执行的任务
- CPU亲和性：遵守任务的位置限制
性能优化：
- 缓存热度：避免迁移热任务导致性能损失
- 条件检查：根据系统状态动态调整策略

将任务从源运行队列移动到目标运行队列`pull_task`

c 复制代码

static inline
void pull_task(runqueue_t *src_rq, prio_array_t *src_array, task_t *p,
	       runqueue_t *this_rq, prio_array_t *this_array, int this_cpu)
{
	dequeue_task(p, src_array);
	src_rq->nr_running--;
	set_task_cpu(p, this_cpu);
	this_rq->nr_running++;
	enqueue_task(p, this_array);
	p->timestamp = (p->timestamp - src_rq->timestamp_last_tick)
				+ this_rq->timestamp_last_tick;
	/*
	 * Note that idle threads have a prio of MAX_PRIO, for this test
	 * to be always true for them.
	 */
	if (TASK_PREEMPTS_CURR(p, this_rq))
		resched_task(this_rq->curr);
}

函数功能概述

pull_task 函数是实际执行任务迁移的核心函数，负责将任务从源运行队列移动到目标运行队列，并更新所有相关的调度器数据结构

代码逐段解析

函数声明

c 复制代码

static inline
void pull_task(runqueue_t *src_rq, prio_array_t *src_array, task_t *p,
	       runqueue_t *this_rq, prio_array_t *this_array, int this_cpu)

static inline：内联函数，减少函数调用开销
参数说明 ：
- src_rq：源运行队列（任务当前所在的队列）
- src_array：源优先级数组（任务当前所在的数组）
- p：要迁移的任务
- this_rq：目标运行队列（任务要迁移到的队列）
- this_array：目标优先级数组
- this_cpu：目标CPU编号

从源队列移除任务

c 复制代码

	dequeue_task(p, src_array);
	src_rq->nr_running--;

dequeue_task(p, src_array)：将任务从源优先级数组中移除
- 这包括从对应的优先级链表中删除任务
- 如果该优先级链表没有任务，清除优先级位图中的相应位
src_rq->nr_running--：减少源运行队列的任务计数

更新任务CPU和增加目标队列计数

c 复制代码

	set_task_cpu(p, this_cpu);
	this_rq->nr_running++;

set_task_cpu(p, this_cpu)：设置任务的CPU字段为目标CPU
- 更新 p->cpu = this_cpu
this_rq->nr_running++：增加目标运行队列的任务计数

将任务加入目标队列

c 复制代码

	enqueue_task(p, this_array);

enqueue_task(p, this_array)：将任务加入到目标优先级数组中
- 根据任务插入到对应的优先级链表
- 设置优先级位图中的相应位

时间戳同步

c 复制代码

	p->timestamp = (p->timestamp - src_rq->timestamp_last_tick)
				+ this_rq->timestamp_last_tick;

同步任务的时间戳到目标运行队列的时间基准
计算原理 ：
- p->timestamp - src_rq->timestamp_last_tick：计算相对于源队列基准的相对时间
- + this_rq->timestamp_last_tick：转换到目标队列的时间基准

抢占检查和重调度

c 复制代码

	if (TASK_PREEMPTS_CURR(p, this_rq))
		resched_task(this_rq->curr);

TASK_PREEMPTS_CURR(p, this_rq)：检查迁移过来的任务是否应该抢占目标CPU上当前运行的任务
resched_task(this_rq->curr)：如果需要抢占，设置目标CPU当前任务的重调度标志
- 这会在下次调度时机触发任务切换

函数功能总结

主要功能：将任务从一个运行队列迁移到另一个运行队列，维护所有调度器数据结构的完整性

队列管理：
- 从源队列正确移除任务
- 向目标队列正确添加任务
- 更新运行任务计数器
状态更新：
- 更新任务的CPU归属
- 同步时间戳基准
- 维护优先级位图
调度决策：
- 检查是否需要立即抢占
- 触发重调度以响应优先级变化

Linux 2.6.10 调度器负载均衡机制深度解析：从理论到实现

前言

将任务迁移到更合适的CPU上执行sched_exec

函数功能概述

代码逐段解析

函数功能总结

在指定的调度域内查找最空闲的CPUfind_idlest_cpu

函数功能概述

代码逐段解析

关键函数和概念详解

函数功能总结

将任务迁移到指定的目标CPUsched_migrate_task

函数功能概述

代码逐段解析

关键机制详解

为什么需要异步迁移？

函数功能总结

判断任务是否可以立即迁移migrate_task

函数功能概述

代码逐段解析

函数功能总结

迁移线程migration_thread

函数功能概述

代码逐段解析

关键机制详解

请求类型处理

函数功能总结

将任务从源CPU迁移到目标CPU__migrate_task

函数功能概述

代码逐段解析

关键机制详解

双重锁机制

时间戳同步的重要性

函数功能总结

将当前任务置于冷冻状态refrigerator

函数功能概述

代码逐段解析

关键机制详解

函数功能总结

主动从繁忙CPU迁移任务到空闲CPUactive_load_balance

函数功能概述

代码逐段解析

关键机制详解

CPU选择条件

统计信息说明

函数功能总结

从繁忙运行队列中迁移任务到当前运行队列move_tasks

函数功能概述

代码逐段解析

关键机制详解

函数功能总结

判断一个任务是否可以从当前CPU迁移到目标CPUcan_migrate_task

函数功能概述

代码逐段解析

函数功能总结

将任务从源运行队列移动到目标运行队列pull_task

函数功能概述

代码逐段解析

函数功能总结

将任务迁移到更合适的CPU上执行`sched_exec`

在指定的调度域内查找最空闲的CPU`find_idlest_cpu`

将任务迁移到指定的目标CPU`sched_migrate_task`

判断任务是否可以立即迁移`migrate_task`

迁移线程`migration_thread`

将任务从源CPU迁移到目标CPU`__migrate_task`

将当前任务置于冷冻状态`refrigerator`

主动从繁忙CPU迁移任务到空闲CPU`active_load_balance`

从繁忙运行队列中迁移任务到当前运行队列`move_tasks`

判断一个任务是否可以从当前CPU迁移到目标CPU`can_migrate_task`

将任务从源运行队列移动到目标运行队列`pull_task`