深入剖析 Linux 内核 TCP Poll 机制：等待、唤醒与同步

Linux 内核中的 poll/select/epoll 机制是 I/O 多路复用的基石，它们允许用户态进程同时监视多个文件描述符，等待其中任何一个变为就绪。对于 TCP socket 而言，tcp_poll 函数是实现这一机制的核心。本文将以 Linux 6.18 内核源码为基础，深入分析 tcp_poll 的实现，并探讨其背后的等待队列、唤醒过程以及同步原语，揭示内核如何在不加锁的情况下安全高效地完成状态检测。

1. TCP Poll 的入口：tcp_poll

tcp_poll 函数定义在 net/ipv4/tcp.c 中，它负责检查 TCP socket 的状态并返回一个掩码，表示当前可执行的操作（可读、可写、异常等）。函数的原型如下：

arduino 复制代码

__poll_t tcp_poll(struct file *file, struct socket *sock, poll_table *wait)

file：关联的文件结构。
sock：socket 结构。
wait：poll_table 结构，用于向等待队列添加回调。

函数开始处调用 sock_poll_wait(file, sock, wait)，这是将当前进程添加到 socket 等待队列的关键步骤，稍后会详细分析。随后获取 socket 的状态：

ini 复制代码

state = inet_sk_state_load(sk);
if (state == TCP_LISTEN)
    return inet_csk_listen_poll(sk);

如果 socket 处于监听状态，直接调用 inet_csk_listen_poll 返回监听 socket 的 poll 掩码（通常表示可接受新连接）。对于非监听 socket，则进入主逻辑。

构建 mask

接下来，代码根据 socket 的关闭状态、连接状态、可读可写条件等逐步构建 mask。这里有几个关键点：

EPOLLHUP 的处理 ：注释中详细讨论了 EPOLLHUP 的复杂性。最终决定：当 socket 双向关闭（shutdown == SHUTDOWN_MASK）或处于 TCP_CLOSE 状态时，设置 EPOLLHUP。
可读性检查 ：调用 tcp_stream_is_readable(sk, target) 判断是否有数据可读，其中 target 考虑了低水位标记和紧急数据。
可写性检查 ：通过 __sk_stream_is_writeable(sk, 1) 检查发送缓冲区是否有空间。若不可写，则设置 SOCK_NOSPACE 标志，并利用内存屏障确保后续写空间释放时能够正确触发信号。
紧急数据 ：若 urg_data 有效，则添加 EPOLLPRI。
TCP Fast Open ：对于处于 TCP_SYN_SENT 且设置了 DEFER_CONNECT 的 fast open socket，直接返回 EPOLLOUT，允许用户态调用 write() 触发 SYN 发送。
错误处理 ：最后通过 smp_rmb() 与 tcp_reset 中的 smp_wmb() 配对，确保读取到最新的错误状态，若 sk_err 或错误队列非空，则添加 EPOLLERR。

值得注意的是，函数开头明确注释："Socket is not locked. We are protected from async events by poll logic and correct handling of state changes"。也就是说，tcp_poll 并不需要持有 socket 锁，而是依赖 poll 框架的并发控制以及精心放置的内存屏障来保证正确性。

2. 等待队列与 poll_table：sock_poll_wait 的妙用

sock_poll_wait 是一个内联函数，定义在 include/net/sock.h 中：

scss 复制代码

static inline void sock_poll_wait(struct file *filp, struct socket *sock,
				  poll_table *p)
{
	if (!poll_does_not_wait(p)) {
		poll_wait(filp, &sock->wq.wait, p);
		smp_mb();
	}
}

它首先判断 p 是否为非空（即用户确实想等待）。然后调用通用的 poll_wait，将当前进程关联到 socket 的等待队列头 sock->wq.wait 上。关键之处在于随后插入了一个 smp_mb() 内存屏障。

为什么需要这个屏障？注释中提到："We need to be sure we are in sync with the socket flags modification. This memory barrier is paired in the wq_has_sleeper." 这意味着，在添加等待项之后，必须确保之前对 socket 标志（如可读状态）的读取不会被重排到添加操作之前，同时也要确保后续唤醒者能看到等待项已被添加。这里的屏障与唤醒路径中的 wq_has_sleeper（通常包含一个 smp_mb()）配对，形成一种"发布-订阅"的同步模式，保证等待者不会错过唤醒事件。

3. 唤醒路径：从数据到达到进程就绪

当 TCP 数据到达或发送缓冲区释放时，内核需要唤醒等待在 socket 上的进程。以数据到达为例，tcp_data_queue 等函数最终会调用 sk->sk_data_ready 回调，该回调通常被设置为 sock_def_readable：

scss 复制代码

void sock_def_readable(struct sock *sk)
{
	struct socket_wq *wq;

	rcu_read_lock();
	wq = rcu_dereference(sk->sk_wq);
	if (skwq_has_sleeper(wq))
		wake_up_interruptible_sync_poll(&wq->wait, EPOLLIN | EPOLLPRI |
						EPOLLRDNORM | EPOLLRDBAND);
	sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_IN);
	rcu_read_unlock();
}

通过 RCU 安全地获取等待队列头。
如果有 sleeper，调用 wake_up_interruptible_sync_poll 唤醒。
同时可能触发异步通知（如 SIGIO）。

wake_up_interruptible_sync_poll 宏展开为 __wake_up_sync_key，它最终调用 __wake_up_common_lock 获得锁后遍历等待队列，对每个等待项执行回调函数（通常是 default_wake_function）。

__wake_up_sync_key 传入的 WF_SYNC 标志表明这是一个同步唤醒，即唤醒者即将睡眠，希望被唤醒的进程不要迁移到其他 CPU，以减少 bouncing。这在 socket 唤醒场景中很有意义。

等待项的回调：default_wake_function

default_wake_function 简单地调用 try_to_wake_up，将等待项中保存的进程结构 curr->private 唤醒。try_to_wake_up 是唤醒操作的核心，其复杂性在于正确处理各种并发情况。

4. try_to_wake_up 的深度解析

try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) 负责将进程 p 的状态设置为 TASK_RUNNING 并加入运行队列。它的实现充满了内存屏障和细粒度的锁，以确保正确性。

函数首先处理唤醒当前进程的特殊情况（如 p == current），利用程序顺序保证和关中断来避免死锁。对于唤醒其他进程，它遵循以下步骤：

获取 p->pi_lock ：使用 raw_spinlock_irqsave 保护进程的调度相关字段。
状态匹配 ：调用 ttwu_state_match 检查进程当前状态是否在允许的 state 掩码中。这里还涉及 saved_state 的处理，用于 RT 互斥锁和 freezer 场景。
内存屏障 ：在加载 p->on_rq 之前插入 smp_rmb()，与调度器中的屏障配对，确保看到最新的 on_rq 值。
检查是否已在运行队列 ：如果 p->on_rq 为真，且 ttwu_runnable 成功（即进程已经在其他 CPU 上运行或可运行），则直接返回。
处理迁移 ：对于 SMP 系统，需要处理进程可能正在其他 CPU 上调度的情况。通过 smp_cond_load_acquire(&p->on_cpu, !VAL) 等待进程完成调度（on_cpu 变为 0），然后选择合适的 CPU 进行迁移。
加入运行队列 ：调用 ttwu_queue 将进程放入目标 CPU 的运行队列，并可能触发 IPI 唤醒该 CPU。

整个过程使用了多个内存屏障，如 smp_mb__after_spinlock()、smp_rmb()、smp_acquire__after_ctrl_dep() 等，它们与调度器中的屏障配合，保证了唤醒者与调度者之间的顺序一致性。

5. 调度点：schedule

当进程无事可做时，它会调用 schedule() 主动让出 CPU。schedule 函数简单封装了 __schedule_loop，最终完成上下文切换。值得注意的是，schedule 通常在等待事件时被调用（例如在 poll 系统调用中，当没有事件时进程会睡眠在等待队列上，直到被唤醒）。

6. 同步机制总结：从 poll 到唤醒的完整链条

现在我们可以将整个流程串起来：

poll 进入 ：用户态调用 poll/select/epoll，内核最终调用 tcp_poll。sock_poll_wait 将当前进程添加到 socket 的等待队列，并插入内存屏障，确保后续状态检查的正确性。
状态检查 ：tcp_poll 在不加锁的情况下读取 socket 的各种状态（如 sk_shutdown, tp->urg_data, sk_err 等），这些读取可能被编译器或 CPU 重排，但由于之前有屏障，且与唤醒路径中的屏障配对，可以保证读到一致的值。
数据到达 ：网络中断或软中断处理数据，最终调用 sock_def_readable。它检查等待队列，如果有 sleeper，则调用 __wake_up_sync_key 触发唤醒。
唤醒过程 ：__wake_up_common 遍历等待队列，对每个等待项调用 default_wake_function，后者通过 try_to_wake_up 将进程设置为可运行。try_to_wake_up 中的屏障确保了进程状态的正确可见性。
进程调度：被唤醒的进程会在适当的时机被调度器选中，从 poll 系统调用返回，用户态得到事件掩码。

在整个过程中，内存屏障起到了关键作用：它们保证了跨 CPU 的数据一致性，防止了编译器和处理器优化导致的乱序问题，使得无锁的 tcp_poll 能够安全运行。例如：

sock_poll_wait 中的 smp_mb() 与 wq_has_sleeper 中的隐含屏障配对，确保等待项的添加和检查不会丢失唤醒。
tcp_poll 末尾的 smp_rmb() 与 tcp_reset 中的 smp_wmb() 配对，确保错误状态的读取是最新的。
try_to_wake_up 中大量的屏障保证了进程状态和运行队列操作的正确顺序。

7. 结语

Linux 内核的 TCP poll 机制展示了在高度并发的环境中，如何通过精巧的等待队列、内存屏障和无锁设计，实现高效且正确的 I/O 事件通知。从 tcp_poll 的简洁实现，到 try_to_wake_up 的复杂逻辑，每一处都体现了内核开发者对性能和正确性的极致追求。理解这些代码不仅能加深我们对网络编程的认识，也能帮助我们更好地掌握并发编程的精髓。

源码

scss 复制代码

/*
 *	Wait for a TCP event.
 *
 *	Note that we don't need to lock the socket, as the upper poll layers
 *	take care of normal races (between the test and the event) and we don't
 *	go look at any of the socket buffers directly.
 */
__poll_t tcp_poll(struct file *file, struct socket *sock, poll_table *wait)
{
	__poll_t mask;
	struct sock *sk = sock->sk;
	const struct tcp_sock *tp = tcp_sk(sk);
	u8 shutdown;
	int state;

	sock_poll_wait(file, sock, wait);

	state = inet_sk_state_load(sk);
	if (state == TCP_LISTEN)
		return inet_csk_listen_poll(sk);

	/* Socket is not locked. We are protected from async events
	 * by poll logic and correct handling of state changes
	 * made by other threads is impossible in any case.
	 */

	mask = 0;

	/*
	 * EPOLLHUP is certainly not done right. But poll() doesn't
	 * have a notion of HUP in just one direction, and for a
	 * socket the read side is more interesting.
	 *
	 * Some poll() documentation says that EPOLLHUP is incompatible
	 * with the EPOLLOUT/POLLWR flags, so somebody should check this
	 * all. But careful, it tends to be safer to return too many
	 * bits than too few, and you can easily break real applications
	 * if you don't tell them that something has hung up!
	 *
	 * Check-me.
	 *
	 * Check number 1. EPOLLHUP is _UNMASKABLE_ event (see UNIX98 and
	 * our fs/select.c). It means that after we received EOF,
	 * poll always returns immediately, making impossible poll() on write()
	 * in state CLOSE_WAIT. One solution is evident --- to set EPOLLHUP
	 * if and only if shutdown has been made in both directions.
	 * Actually, it is interesting to look how Solaris and DUX
	 * solve this dilemma. I would prefer, if EPOLLHUP were maskable,
	 * then we could set it on SND_SHUTDOWN. BTW examples given
	 * in Stevens' books assume exactly this behaviour, it explains
	 * why EPOLLHUP is incompatible with EPOLLOUT.	--ANK
	 *
	 * NOTE. Check for TCP_CLOSE is added. The goal is to prevent
	 * blocking on fresh not-connected or disconnected socket. --ANK
	 */
	shutdown = READ_ONCE(sk->sk_shutdown);
	if (shutdown == SHUTDOWN_MASK || state == TCP_CLOSE)
		mask |= EPOLLHUP;
	if (shutdown & RCV_SHUTDOWN)
		mask |= EPOLLIN | EPOLLRDNORM | EPOLLRDHUP;

	/* Connected or passive Fast Open socket? */
	if (state != TCP_SYN_SENT &&
	    (state != TCP_SYN_RECV || rcu_access_pointer(tp->fastopen_rsk))) {
		int target = sock_rcvlowat(sk, 0, INT_MAX);
		u16 urg_data = READ_ONCE(tp->urg_data);

		if (unlikely(urg_data) &&
		    READ_ONCE(tp->urg_seq) == READ_ONCE(tp->copied_seq) &&
		    !sock_flag(sk, SOCK_URGINLINE))
			target++;

		if (tcp_stream_is_readable(sk, target))
			mask |= EPOLLIN | EPOLLRDNORM;

		if (!(shutdown & SEND_SHUTDOWN)) {
			if (__sk_stream_is_writeable(sk, 1)) {
				mask |= EPOLLOUT | EPOLLWRNORM;
			} else {  /* send SIGIO later */
				sk_set_bit(SOCKWQ_ASYNC_NOSPACE, sk);
				set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);

				/* Race breaker. If space is freed after
				 * wspace test but before the flags are set,
				 * IO signal will be lost. Memory barrier
				 * pairs with the input side.
				 */
				smp_mb__after_atomic();
				if (__sk_stream_is_writeable(sk, 1))
					mask |= EPOLLOUT | EPOLLWRNORM;
			}
		} else
			mask |= EPOLLOUT | EPOLLWRNORM;

		if (urg_data & TCP_URG_VALID)
			mask |= EPOLLPRI;
	} else if (state == TCP_SYN_SENT &&
		   inet_test_bit(DEFER_CONNECT, sk)) {
		/* Active TCP fastopen socket with defer_connect
		 * Return EPOLLOUT so application can call write()
		 * in order for kernel to generate SYN+data
		 */
		mask |= EPOLLOUT | EPOLLWRNORM;
	}
	/* This barrier is coupled with smp_wmb() in tcp_reset() */
	smp_rmb();
	if (READ_ONCE(sk->sk_err) ||
	    !skb_queue_empty_lockless(&sk->sk_error_queue))
		mask |= EPOLLERR;

	return mask;
}
EXPORT_SYMBOL(tcp_poll);


/**
 * sock_poll_wait - place memory barrier behind the poll_wait call.
 * @filp:           file
 * @sock:           socket to wait on
 * @p:              poll_table
 *
 * See the comments in the wq_has_sleeper function.
 */
static inline void sock_poll_wait(struct file *filp, struct socket *sock,
				  poll_table *p)
{
	if (!poll_does_not_wait(p)) {
		poll_wait(filp, &sock->wq.wait, p);
		/* We need to be sure we are in sync with the
		 * socket flags modification.
		 *
		 * This memory barrier is paired in the wq_has_sleeper.
		 */
		smp_mb();
	}
}

void sock_def_readable(struct sock *sk)
{
	struct socket_wq *wq;

	trace_sk_data_ready(sk);

	rcu_read_lock();
	wq = rcu_dereference(sk->sk_wq);
	if (skwq_has_sleeper(wq))
		wake_up_interruptible_sync_poll(&wq->wait, EPOLLIN | EPOLLPRI |
						EPOLLRDNORM | EPOLLRDBAND);
	sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_IN);
	rcu_read_unlock();
}

#define wake_up_interruptible_sync_poll(x, m)					\
	__wake_up_sync_key((x), TASK_INTERRUPTIBLE, poll_to_key(m))

/**
 * __wake_up_sync_key - wake up threads blocked on a waitqueue.
 * @wq_head: the waitqueue
 * @mode: which threads
 * @key: opaque value to be passed to wakeup targets
 *
 * The sync wakeup differs that the waker knows that it will schedule
 * away soon, so while the target thread will be woken up, it will not
 * be migrated to another CPU - ie. the two threads are 'synchronized'
 * with each other. This can prevent needless bouncing between CPUs.
 *
 * On UP it can prevent extra preemption.
 *
 * If this function wakes up a task, it executes a full memory barrier before
 * accessing the task state.
 */
void __wake_up_sync_key(struct wait_queue_head *wq_head, unsigned int mode,
			void *key)
{
	if (unlikely(!wq_head))
		return;

	__wake_up_common_lock(wq_head, mode, 1, WF_SYNC, key);
}
EXPORT_SYMBOL_GPL(__wake_up_sync_key);

static int __wake_up_common_lock(struct wait_queue_head *wq_head, unsigned int mode,
			int nr_exclusive, int wake_flags, void *key)
{
	unsigned long flags;
	int remaining;

	spin_lock_irqsave(&wq_head->lock, flags);
	remaining = __wake_up_common(wq_head, mode, nr_exclusive, wake_flags,
			key);
	spin_unlock_irqrestore(&wq_head->lock, flags);

	return nr_exclusive - remaining;
}


/*
 * The core wakeup function. Non-exclusive wakeups (nr_exclusive == 0) just
 * wake everything up. If it's an exclusive wakeup (nr_exclusive == small +ve
 * number) then we wake that number of exclusive tasks, and potentially all
 * the non-exclusive tasks. Normally, exclusive tasks will be at the end of
 * the list and any non-exclusive tasks will be woken first. A priority task
 * may be at the head of the list, and can consume the event without any other
 * tasks being woken.
 *
 * There are circumstances in which we can try to wake a task which has already
 * started to run but is not in state TASK_RUNNING. try_to_wake_up() returns
 * zero in this (rare) case, and we handle it by continuing to scan the queue.
 */
static int __wake_up_common(struct wait_queue_head *wq_head, unsigned int mode,
			int nr_exclusive, int wake_flags, void *key)
{
	wait_queue_entry_t *curr, *next;

	lockdep_assert_held(&wq_head->lock);

	curr = list_first_entry(&wq_head->head, wait_queue_entry_t, entry);

	if (&curr->entry == &wq_head->head)
		return nr_exclusive;

	list_for_each_entry_safe_from(curr, next, &wq_head->head, entry) {
		unsigned flags = curr->flags;
		int ret;

		ret = curr->func(curr, mode, wake_flags, key);
		if (ret < 0)
			break;
		if (ret && (flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
			break;
	}

	return nr_exclusive;
}

int default_wake_function(wait_queue_entry_t *curr, unsigned mode, int wake_flags,
			  void *key)
{
	WARN_ON_ONCE(IS_ENABLED(CONFIG_SCHED_DEBUG) && wake_flags & ~(WF_SYNC|WF_CURRENT_CPU));
	return try_to_wake_up(curr->private, mode, wake_flags);
}
EXPORT_SYMBOL(default_wake_function);


/**
 * try_to_wake_up - wake up a thread
 * @p: the thread to be awakened
 * @state: the mask of task states that can be woken
 * @wake_flags: wake modifier flags (WF_*)
 *
 * Conceptually does:
 *
 *   If (@state & @p->state) @p->state = TASK_RUNNING.
 *
 * If the task was not queued/runnable, also place it back on a runqueue.
 *
 * This function is atomic against schedule() which would dequeue the task.
 *
 * It issues a full memory barrier before accessing @p->state, see the comment
 * with set_current_state().
 *
 * Uses p->pi_lock to serialize against concurrent wake-ups.
 *
 * Relies on p->pi_lock stabilizing:
 *  - p->sched_class
 *  - p->cpus_ptr
 *  - p->sched_task_group
 * in order to do migration, see its use of select_task_rq()/set_task_cpu().
 *
 * Tries really hard to only take one task_rq(p)->lock for performance.
 * Takes rq->lock in:
 *  - ttwu_runnable()    -- old rq, unavoidable, see comment there;
 *  - ttwu_queue()       -- new rq, for enqueue of the task;
 *  - psi_ttwu_dequeue() -- much sadness :-( accounting will kill us.
 *
 * As a consequence we race really badly with just about everything. See the
 * many memory barriers and their comments for details.
 *
 * Return: %true if @p->state changes (an actual wakeup was done),
 *	   %false otherwise.
 */
int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
{
	guard(preempt)();
	int cpu, success = 0;

	if (p == current) {
		/*
		 * We're waking current, this means 'p->on_rq' and 'task_cpu(p)
		 * == smp_processor_id()'. Together this means we can special
		 * case the whole 'p->on_rq && ttwu_runnable()' case below
		 * without taking any locks.
		 *
		 * In particular:
		 *  - we rely on Program-Order guarantees for all the ordering,
		 *  - we're serialized against set_special_state() by virtue of
		 *    it disabling IRQs (this allows not taking ->pi_lock).
		 */
		if (!ttwu_state_match(p, state, &success))
			goto out;

		trace_sched_waking(p);
		ttwu_do_wakeup(p);
		goto out;
	}

	/*
	 * If we are going to wake up a thread waiting for CONDITION we
	 * need to ensure that CONDITION=1 done by the caller can not be
	 * reordered with p->state check below. This pairs with smp_store_mb()
	 * in set_current_state() that the waiting thread does.
	 */
	scoped_guard (raw_spinlock_irqsave, &p->pi_lock) {
		smp_mb__after_spinlock();
		if (!ttwu_state_match(p, state, &success))
			break;

		trace_sched_waking(p);

		/*
		 * Ensure we load p->on_rq _after_ p->state, otherwise it would
		 * be possible to, falsely, observe p->on_rq == 0 and get stuck
		 * in smp_cond_load_acquire() below.
		 *
		 * sched_ttwu_pending()			try_to_wake_up()
		 *   STORE p->on_rq = 1			  LOAD p->state
		 *   UNLOCK rq->lock
		 *
		 * __schedule() (switch to task 'p')
		 *   LOCK rq->lock			  smp_rmb();
		 *   smp_mb__after_spinlock();
		 *   UNLOCK rq->lock
		 *
		 * [task p]
		 *   STORE p->state = UNINTERRUPTIBLE	  LOAD p->on_rq
		 *
		 * Pairs with the LOCK+smp_mb__after_spinlock() on rq->lock in
		 * __schedule().  See the comment for smp_mb__after_spinlock().
		 *
		 * A similar smp_rmb() lives in __task_needs_rq_lock().
		 */
		smp_rmb();
		if (READ_ONCE(p->on_rq) && ttwu_runnable(p, wake_flags))
			break;

#ifdef CONFIG_SMP
		/*
		 * Ensure we load p->on_cpu _after_ p->on_rq, otherwise it would be
		 * possible to, falsely, observe p->on_cpu == 0.
		 *
		 * One must be running (->on_cpu == 1) in order to remove oneself
		 * from the runqueue.
		 *
		 * __schedule() (switch to task 'p')	try_to_wake_up()
		 *   STORE p->on_cpu = 1		  LOAD p->on_rq
		 *   UNLOCK rq->lock
		 *
		 * __schedule() (put 'p' to sleep)
		 *   LOCK rq->lock			  smp_rmb();
		 *   smp_mb__after_spinlock();
		 *   STORE p->on_rq = 0			  LOAD p->on_cpu
		 *
		 * Pairs with the LOCK+smp_mb__after_spinlock() on rq->lock in
		 * __schedule().  See the comment for smp_mb__after_spinlock().
		 *
		 * Form a control-dep-acquire with p->on_rq == 0 above, to ensure
		 * schedule()'s deactivate_task() has 'happened' and p will no longer
		 * care about it's own p->state. See the comment in __schedule().
		 */
		smp_acquire__after_ctrl_dep();

		/*
		 * We're doing the wakeup (@success == 1), they did a dequeue (p->on_rq
		 * == 0), which means we need to do an enqueue, change p->state to
		 * TASK_WAKING such that we can unlock p->pi_lock before doing the
		 * enqueue, such as ttwu_queue_wakelist().
		 */
		WRITE_ONCE(p->__state, TASK_WAKING);

		/*
		 * If the owning (remote) CPU is still in the middle of schedule() with
		 * this task as prev, considering queueing p on the remote CPUs wake_list
		 * which potentially sends an IPI instead of spinning on p->on_cpu to
		 * let the waker make forward progress. This is safe because IRQs are
		 * disabled and the IPI will deliver after on_cpu is cleared.
		 *
		 * Ensure we load task_cpu(p) after p->on_cpu:
		 *
		 * set_task_cpu(p, cpu);
		 *   STORE p->cpu = @cpu
		 * __schedule() (switch to task 'p')
		 *   LOCK rq->lock
		 *   smp_mb__after_spin_lock()		smp_cond_load_acquire(&p->on_cpu)
		 *   STORE p->on_cpu = 1		LOAD p->cpu
		 *
		 * to ensure we observe the correct CPU on which the task is currently
		 * scheduling.
		 */
		if (smp_load_acquire(&p->on_cpu) &&
		    ttwu_queue_wakelist(p, task_cpu(p), wake_flags))
			break;

		/*
		 * If the owning (remote) CPU is still in the middle of schedule() with
		 * this task as prev, wait until it's done referencing the task.
		 *
		 * Pairs with the smp_store_release() in finish_task().
		 *
		 * This ensures that tasks getting woken will be fully ordered against
		 * their previous state and preserve Program Order.
		 */
		smp_cond_load_acquire(&p->on_cpu, !VAL);

		cpu = select_task_rq(p, p->wake_cpu, wake_flags | WF_TTWU);
		if (task_cpu(p) != cpu) {
			if (p->in_iowait) {
				delayacct_blkio_end(p);
				atomic_dec(&task_rq(p)->nr_iowait);
			}

			wake_flags |= WF_MIGRATED;
			psi_ttwu_dequeue(p);
			set_task_cpu(p, cpu);
		}
#else
		cpu = task_cpu(p);
#endif /* CONFIG_SMP */

		ttwu_queue(p, cpu, wake_flags);
	}
out:
	if (success)
		ttwu_stat(p, task_cpu(p), wake_flags);

	return success;
}


/*
 * Invoked from try_to_wake_up() to check whether the task can be woken up.
 *
 * The caller holds p::pi_lock if p != current or has preemption
 * disabled when p == current.
 *
 * The rules of saved_state:
 *
 *   The related locking code always holds p::pi_lock when updating
 *   p::saved_state, which means the code is fully serialized in both cases.
 *
 *   For PREEMPT_RT, the lock wait and lock wakeups happen via TASK_RTLOCK_WAIT.
 *   No other bits set. This allows to distinguish all wakeup scenarios.
 *
 *   For FREEZER, the wakeup happens via TASK_FROZEN. No other bits set. This
 *   allows us to prevent early wakeup of tasks before they can be run on
 *   asymmetric ISA architectures (eg ARMv9).
 */
static __always_inline
bool ttwu_state_match(struct task_struct *p, unsigned int state, int *success)
{
	int match;

	if (IS_ENABLED(CONFIG_DEBUG_PREEMPT)) {
		WARN_ON_ONCE((state & TASK_RTLOCK_WAIT) &&
			     state != TASK_RTLOCK_WAIT);
	}

	*success = !!(match = __task_state_match(p, state));

	/*
	 * Saved state preserves the task state across blocking on
	 * an RT lock or TASK_FREEZABLE tasks.  If the state matches,
	 * set p::saved_state to TASK_RUNNING, but do not wake the task
	 * because it waits for a lock wakeup or __thaw_task(). Also
	 * indicate success because from the regular waker's point of
	 * view this has succeeded.
	 *
	 * After acquiring the lock the task will restore p::__state
	 * from p::saved_state which ensures that the regular
	 * wakeup is not lost. The restore will also set
	 * p::saved_state to TASK_RUNNING so any further tests will
	 * not result in false positives vs. @success
	 */
	if (match < 0)
		p->saved_state = TASK_RUNNING;

	return match > 0;
}

asmlinkage __visible void __sched schedule(void)
{
	struct task_struct *tsk = current;

#ifdef CONFIG_RT_MUTEXES
	lockdep_assert(!tsk->sched_rt_mutex);
#endif

	if (!task_is_running(tsk))
		sched_submit_work(tsk);
	__schedule_loop(SM_NONE);
	sched_update_worker(tsk);
}
EXPORT_SYMBOL(schedule);