【Linux内核二十九】进程管理模块：CFS调度器check_preempt_wakeup

接上篇：【Linux内核二十八】进程管理模块：CFS调度器yield_task_fair & yield_to_task_fair

这一篇继续。

check_preempt_wakeup（完整版）

c 复制代码

static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
{
	struct task_struct *curr = rq->curr;
	struct sched_entity *se = &curr->se, *pse = &p->se;
	struct cfs_rq *cfs_rq = task_cfs_rq(curr);
	int scale = cfs_rq->nr_running >= sched_nr_latency;
	int next_buddy_marked = 0;
	int cse_is_idle, pse_is_idle;

	if (unlikely(se == pse))
		return;

	/*
	 * This is possible from callers such as attach_tasks(), in which we
	 * unconditionally check_preempt_curr() after an enqueue (which may have
	 * lead to a throttle).  This both saves work and prevents false
	 * next-buddy nomination below.
	 */
	if (unlikely(throttled_hierarchy(cfs_rq_of(pse))))
		return;

	if (sched_feat(NEXT_BUDDY) && scale && !(wake_flags & WF_FORK)) {
		set_next_buddy(pse);
		next_buddy_marked = 1;
	}

	/*
	 * We can come here with TIF_NEED_RESCHED already set from new task
	 * wake up path.
	 *
	 * Note: this also catches the edge-case of curr being in a throttled
	 * group (e.g. via set_curr_task), since update_curr() (in the
	 * enqueue of curr) will have resulted in resched being set.  This
	 * prevents us from potentially nominating it as a false LAST_BUDDY
	 * below.
	 */
	if (test_tsk_need_resched(curr))
		return;

	/* Idle tasks are by definition preempted by non-idle tasks. */
	if (unlikely(task_has_idle_policy(curr)) &&
	    likely(!task_has_idle_policy(p)))
		goto preempt;

	/*
	 * Batch and idle tasks do not preempt non-idle tasks (their preemption
	 * is driven by the tick):
	 */
	if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))
		return;

	find_matching_se(&se, &pse);
	BUG_ON(!pse);

	cse_is_idle = se_is_idle(se);
	pse_is_idle = se_is_idle(pse);

	/*
	 * Preempt an idle group in favor of a non-idle group (and don't preempt
	 * in the inverse case).
	 */
	if (cse_is_idle && !pse_is_idle)
		goto preempt;
	if (cse_is_idle != pse_is_idle)
		return;

	update_curr(cfs_rq_of(se));
	if (wakeup_preempt_entity(se, pse) == 1) {
		/*
		 * Bias pick_next to pick the sched entity that is
		 * triggering this preemption.
		 */
		if (!next_buddy_marked)
			set_next_buddy(pse);
		goto preempt;
	}

	return;

preempt:
	resched_curr(rq);
	/*
	 * Only set the backward buddy when the current task is still
	 * on the rq. This can happen when a wakeup gets interleaved
	 * with schedule on the ->pre_schedule() or idle_balance()
	 * point, either of which can * drop the rq lock.
	 *
	 * Also, during early boot the idle thread is in the fair class,
	 * for obvious reasons its a bad idea to schedule back to it.
	 */
	if (unlikely(!se->on_rq || curr == rq->idle))
		return;

	if (sched_feat(LAST_BUDDY) && scale && entity_is_task(se))
		set_last_buddy(se);
}

为什么需要check_preempt_wakeup？

在多任务操作系统中，当一个新任务被唤醒时，它可能比当前正在运行的任务更适合运行。例如：

高优先级任务：新唤醒的任务优先级更高，应该立即抢占当前任务。
更短的虚拟运行时间：在CFS调度器中，虚拟运行时间（vruntime）是决定任务调度顺序的核心指标。如果新唤醒的任务的vruntime比当前任务更小，则说明它"更饿"，应该优先运行。
减少延迟：快速响应新唤醒的任务可以降低系统的响应时间，提升用户体验。

然而，频繁的抢占会增加上下文切换开销，因此check_preempt_wakeup通过一系列条件判断，确保只有在必要时才触发抢占。

是什么？

check_preempt_wakeup的核心作用是检查新唤醒的任务是否应该抢占当前运行的任务。具体实现如下：

函数逻辑解析

基本检查：
c 复制代码
```
if (unlikely(se == pse))
    return;
```
- 如果当前任务和新唤醒的任务是同一个任务（即se == pse），直接返回，无需抢占。
特性开关：
c 复制代码
```
if (sched_feat(WAKEUP_PREEMPTION)) {
    ...
}
```
- sched_feat(WAKEUP_PREEMPTION)：检查调度器是否启用了"唤醒抢占"特性。如果未启用，则直接返回。
抢占条件判断：
c 复制代码
```
if (wakeup_preempt_entity(se, pse) == 1) {
    goto preempt;
}
```
- 调用wakeup_preempt_entity函数，比较当前任务和新唤醒任务的虚拟运行时间。
- 如果返回值为1，表示新唤醒的任务应该抢占当前任务，跳转到preempt标签。
触发抢占：
c 复制代码
```
preempt:
    resched_curr(rq);
```
- 调用resched_curr函数，设置当前任务的重新调度标志，触发抢占。

关键函数：wakeup_preempt_entity

wakeup_preempt_entity的定义如下：

c 复制代码

static int wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
{
	s64 gran, vdiff = curr->vruntime - se->vruntime;

	if (vdiff <= 0)
		return -1;

	gran = wakeup_gran(curr, se);
	if (vdiff > gran)
		return 1;

	return 0;
}

vdiff：计算当前任务和新唤醒任务的虚拟运行时间差。
gran ：调用wakeup_gran函数，计算抢占粒度（granularity）。抢占粒度是为了避免过于频繁的抢占而引入的一个阈值。
返回值 ：
- -1：当前任务的vruntime小于等于新唤醒任务的vruntime，无需抢占。
- 1：当前任务的vruntime大于新唤醒任务的vruntime，并且差值超过抢占粒度，需要抢占。
- 0：差值未超过抢占粒度，暂时不抢占。

数据结构关联

struct sched_entity：
c 复制代码
```
struct sched_entity {
    u64 vruntime; // 虚拟运行时间
    ...
};
```
- vruntime：任务的虚拟运行时间，用于衡量任务的"饥饿程度"。

抢占粒度：

c 复制代码

static unsigned long wakeup_gran(struct sched_entity *curr, struct sched_entity *se)
{
    unsigned long gran = sysctl_sched_wakeup_granularity;
    ...
    return gran;
}

sysctl_sched_wakeup_granularity：系统配置的抢占粒度，默认值通常为1ms。

怎么用？

check_preempt_wakeup通常在以下场景中调用：

任务唤醒 ：当一个任务从睡眠状态被唤醒时，调用check_preempt_wakeup检查是否需要抢占当前任务。
负载均衡：在跨CPU迁移任务时，可能会调用该函数以确保新迁入的任务能够及时运行。

使用示例

假设我们有两个任务taskA和taskB，其中taskA当前正在运行，taskB刚刚被唤醒。可以通过以下步骤检查是否需要抢占：

调用check_preempt_wakeup：
c 复制代码
```
struct rq *rq = cpu_rq(cpu_of(taskA));
check_preempt_wakeup(rq, taskB, 0);
```
- 检查taskB是否应该抢占taskA。
如果需要抢占，resched_curr会设置重新调度标志，下一次调度时taskB将优先运行。

注意事项

抢占粒度：抢占粒度的存在是为了避免过于频繁的抢占。如果新唤醒任务的vruntime仅略小于当前任务的vruntime，可能不会触发抢占。
特性开关 ：WAKEUP_PREEMPTION特性开关允许用户动态启用或禁用唤醒抢占功能。

find_matching_se

c 复制代码

static void
find_matching_se(struct sched_entity **se, struct sched_entity **pse)
{
	int se_depth, pse_depth;

	/*
	 * Preemption test can be made between sibling entities who are in the
	 * same cfs_rq i.e who have a common parent. Walk up the hierarchy of
	 * both tasks until we find their ancestors who are siblings of common
	 * parent.
	 */

	/* First walk up until both entities are at same depth */
	se_depth = (*se)->depth;
	pse_depth = (*pse)->depth;

	while (se_depth > pse_depth) {
		se_depth--;
		*se = parent_entity(*se);
	}

	while (pse_depth > se_depth) {
		pse_depth--;
		*pse = parent_entity(*pse);
	}

	while (!is_same_group(*se, *pse)) {
		*se = parent_entity(*se);
		*pse = parent_entity(*pse);
	}
}

为什么需要find_matching_se？

在CFS调度器中，任务的调度实体（struct sched_entity）可能属于不同的调度组（struct cfs_rq），尤其是在启用了组调度（CONFIG_FAIR_GROUP_SCHED）的情况下。为了比较两个任务是否可以抢占彼此，必须确保它们处于同一个调度组中，即它们是"兄弟节点"（siblings）。否则，直接比较它们的虚拟运行时间（vruntime）可能会导致不公平的调度决策。

因此，find_matching_se的核心作用是找到两个调度实体的共同祖先，确保它们在同一层级进行比较。

find_matching_se通过遍历调度实体的层级结构，将两个调度实体逐步向上移动，直到它们位于同一深度，并且属于同一个调度组。具体实现如下：

函数逻辑解析

获取初始深度：
c 复制代码
```
se_depth = (*se)->depth;
pse_depth = (*pse)->depth;
```
- (*se)->depth 和 (*pse)->depth 分别表示当前任务和新唤醒任务的调度实体在调度树中的深度。
调整深度一致：
c 复制代码
```
while (se_depth > pse_depth) {
    se_depth--;
    *se = parent_entity(*se);
}

while (pse_depth > se_depth) {
    pse_depth--;
    *pse = parent_entity(*pse);
}
```
- 如果两个调度实体的深度不一致，先将较深的调度实体逐步向上移动，直到两者的深度相同。
- parent_entity(*se)：返回当前调度实体的父节点。
寻找共同祖先：
c 复制代码
```
while (!is_same_group(*se, *pse)) {
    *se = parent_entity(*se);
    *pse = parent_entity(*pse);
}
```
- 当两个调度实体不在同一个调度组时，继续向上移动，直到找到它们的共同祖先。
- is_same_group(*se, *pse)：判断两个调度实体是否属于同一个调度组。

总结

check_preempt_wakeup是CFS调度器中实现唤醒抢占的关键机制。它通过比较任务的虚拟运行时间和抢占粒度，决定是否触发抢占，从而确保高优先级或更"饥饿"的任务能够及时运行。这种设计在保证公平性的同时，也兼顾了系统的性能和响应速度。