Linux 内核中的 poll/select/epoll 机制是 I/O 多路复用的基石,它们允许用户态进程同时监视多个文件描述符,等待其中任何一个变为就绪。对于 TCP socket 而言,tcp_poll 函数是实现这一机制的核心。本文将以 Linux 6.18 内核源码为基础,深入分析 tcp_poll 的实现,并探讨其背后的等待队列、唤醒过程以及同步原语,揭示内核如何在不加锁的情况下安全高效地完成状态检测。
1. TCP Poll 的入口:tcp_poll
tcp_poll 函数定义在 net/ipv4/tcp.c 中,它负责检查 TCP socket 的状态并返回一个掩码,表示当前可执行的操作(可读、可写、异常等)。函数的原型如下:
c
arduino
__poll_t tcp_poll(struct file *file, struct socket *sock, poll_table *wait)
file:关联的文件结构。sock:socket 结构。wait:poll_table 结构,用于向等待队列添加回调。
函数开始处调用 sock_poll_wait(file, sock, wait),这是将当前进程添加到 socket 等待队列的关键步骤,稍后会详细分析。随后获取 socket 的状态:
c
ini
state = inet_sk_state_load(sk);
if (state == TCP_LISTEN)
return inet_csk_listen_poll(sk);
如果 socket 处于监听状态,直接调用 inet_csk_listen_poll 返回监听 socket 的 poll 掩码(通常表示可接受新连接)。对于非监听 socket,则进入主逻辑。
构建 mask
接下来,代码根据 socket 的关闭状态、连接状态、可读可写条件等逐步构建 mask。这里有几个关键点:
- EPOLLHUP 的处理 :注释中详细讨论了
EPOLLHUP的复杂性。最终决定:当 socket 双向关闭(shutdown == SHUTDOWN_MASK)或处于TCP_CLOSE状态时,设置EPOLLHUP。 - 可读性检查 :调用
tcp_stream_is_readable(sk, target)判断是否有数据可读,其中target考虑了低水位标记和紧急数据。 - 可写性检查 :通过
__sk_stream_is_writeable(sk, 1)检查发送缓冲区是否有空间。若不可写,则设置SOCK_NOSPACE标志,并利用内存屏障确保后续写空间释放时能够正确触发信号。 - 紧急数据 :若
urg_data有效,则添加EPOLLPRI。 - TCP Fast Open :对于处于
TCP_SYN_SENT且设置了DEFER_CONNECT的 fast open socket,直接返回EPOLLOUT,允许用户态调用write()触发 SYN 发送。 - 错误处理 :最后通过
smp_rmb()与tcp_reset中的smp_wmb()配对,确保读取到最新的错误状态,若sk_err或错误队列非空,则添加EPOLLERR。
值得注意的是,函数开头明确注释:"Socket is not locked. We are protected from async events by poll logic and correct handling of state changes"。也就是说,tcp_poll 并不需要持有 socket 锁,而是依赖 poll 框架的并发控制以及精心放置的内存屏障来保证正确性。
2. 等待队列与 poll_table:sock_poll_wait 的妙用
sock_poll_wait 是一个内联函数,定义在 include/net/sock.h 中:
c
scss
static inline void sock_poll_wait(struct file *filp, struct socket *sock,
poll_table *p)
{
if (!poll_does_not_wait(p)) {
poll_wait(filp, &sock->wq.wait, p);
smp_mb();
}
}
它首先判断 p 是否为非空(即用户确实想等待)。然后调用通用的 poll_wait,将当前进程关联到 socket 的等待队列头 sock->wq.wait 上。关键之处在于随后插入了一个 smp_mb() 内存屏障。
为什么需要这个屏障?注释中提到:"We need to be sure we are in sync with the socket flags modification. This memory barrier is paired in the wq_has_sleeper." 这意味着,在添加等待项之后,必须确保之前对 socket 标志(如可读状态)的读取不会被重排到添加操作之前,同时也要确保后续唤醒者能看到等待项已被添加。这里的屏障与唤醒路径中的 wq_has_sleeper(通常包含一个 smp_mb())配对,形成一种"发布-订阅"的同步模式,保证等待者不会错过唤醒事件。
3. 唤醒路径:从数据到达到进程就绪
当 TCP 数据到达或发送缓冲区释放时,内核需要唤醒等待在 socket 上的进程。以数据到达为例,tcp_data_queue 等函数最终会调用 sk->sk_data_ready 回调,该回调通常被设置为 sock_def_readable:
c
scss
void sock_def_readable(struct sock *sk)
{
struct socket_wq *wq;
rcu_read_lock();
wq = rcu_dereference(sk->sk_wq);
if (skwq_has_sleeper(wq))
wake_up_interruptible_sync_poll(&wq->wait, EPOLLIN | EPOLLPRI |
EPOLLRDNORM | EPOLLRDBAND);
sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_IN);
rcu_read_unlock();
}
- 通过 RCU 安全地获取等待队列头。
- 如果有 sleeper,调用
wake_up_interruptible_sync_poll唤醒。 - 同时可能触发异步通知(如 SIGIO)。
wake_up_interruptible_sync_poll 宏展开为 __wake_up_sync_key,它最终调用 __wake_up_common_lock 获得锁后遍历等待队列,对每个等待项执行回调函数(通常是 default_wake_function)。
__wake_up_sync_key 传入的 WF_SYNC 标志表明这是一个同步唤醒,即唤醒者即将睡眠,希望被唤醒的进程不要迁移到其他 CPU,以减少 bouncing。这在 socket 唤醒场景中很有意义。
等待项的回调:default_wake_function
default_wake_function 简单地调用 try_to_wake_up,将等待项中保存的进程结构 curr->private 唤醒。try_to_wake_up 是唤醒操作的核心,其复杂性在于正确处理各种并发情况。
4. try_to_wake_up 的深度解析
try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) 负责将进程 p 的状态设置为 TASK_RUNNING 并加入运行队列。它的实现充满了内存屏障和细粒度的锁,以确保正确性。
函数首先处理唤醒当前进程的特殊情况(如 p == current),利用程序顺序保证和关中断来避免死锁。对于唤醒其他进程,它遵循以下步骤:
- 获取 p->pi_lock :使用
raw_spinlock_irqsave保护进程的调度相关字段。 - 状态匹配 :调用
ttwu_state_match检查进程当前状态是否在允许的state掩码中。这里还涉及saved_state的处理,用于 RT 互斥锁和 freezer 场景。 - 内存屏障 :在加载
p->on_rq之前插入smp_rmb(),与调度器中的屏障配对,确保看到最新的 on_rq 值。 - 检查是否已在运行队列 :如果
p->on_rq为真,且ttwu_runnable成功(即进程已经在其他 CPU 上运行或可运行),则直接返回。 - 处理迁移 :对于 SMP 系统,需要处理进程可能正在其他 CPU 上调度的情况。通过
smp_cond_load_acquire(&p->on_cpu, !VAL)等待进程完成调度(on_cpu 变为 0),然后选择合适的 CPU 进行迁移。 - 加入运行队列 :调用
ttwu_queue将进程放入目标 CPU 的运行队列,并可能触发 IPI 唤醒该 CPU。
整个过程使用了多个内存屏障,如 smp_mb__after_spinlock()、smp_rmb()、smp_acquire__after_ctrl_dep() 等,它们与调度器中的屏障配合,保证了唤醒者与调度者之间的顺序一致性。
5. 调度点:schedule
当进程无事可做时,它会调用 schedule() 主动让出 CPU。schedule 函数简单封装了 __schedule_loop,最终完成上下文切换。值得注意的是,schedule 通常在等待事件时被调用(例如在 poll 系统调用中,当没有事件时进程会睡眠在等待队列上,直到被唤醒)。
6. 同步机制总结:从 poll 到唤醒的完整链条
现在我们可以将整个流程串起来:
- poll 进入 :用户态调用 poll/select/epoll,内核最终调用
tcp_poll。sock_poll_wait将当前进程添加到 socket 的等待队列,并插入内存屏障,确保后续状态检查的正确性。 - 状态检查 :
tcp_poll在不加锁的情况下读取 socket 的各种状态(如sk_shutdown,tp->urg_data,sk_err等),这些读取可能被编译器或 CPU 重排,但由于之前有屏障,且与唤醒路径中的屏障配对,可以保证读到一致的值。 - 数据到达 :网络中断或软中断处理数据,最终调用
sock_def_readable。它检查等待队列,如果有 sleeper,则调用__wake_up_sync_key触发唤醒。 - 唤醒过程 :
__wake_up_common遍历等待队列,对每个等待项调用default_wake_function,后者通过try_to_wake_up将进程设置为可运行。try_to_wake_up中的屏障确保了进程状态的正确可见性。 - 进程调度:被唤醒的进程会在适当的时机被调度器选中,从 poll 系统调用返回,用户态得到事件掩码。
在整个过程中,内存屏障起到了关键作用:它们保证了跨 CPU 的数据一致性,防止了编译器和处理器优化导致的乱序问题,使得无锁的 tcp_poll 能够安全运行。例如:
sock_poll_wait中的smp_mb()与wq_has_sleeper中的隐含屏障配对,确保等待项的添加和检查不会丢失唤醒。tcp_poll末尾的smp_rmb()与tcp_reset中的smp_wmb()配对,确保错误状态的读取是最新的。try_to_wake_up中大量的屏障保证了进程状态和运行队列操作的正确顺序。
7. 结语
Linux 内核的 TCP poll 机制展示了在高度并发的环境中,如何通过精巧的等待队列、内存屏障和无锁设计,实现高效且正确的 I/O 事件通知。从 tcp_poll 的简洁实现,到 try_to_wake_up 的复杂逻辑,每一处都体现了内核开发者对性能和正确性的极致追求。理解这些代码不仅能加深我们对网络编程的认识,也能帮助我们更好地掌握并发编程的精髓。
源码
scss
/*
* Wait for a TCP event.
*
* Note that we don't need to lock the socket, as the upper poll layers
* take care of normal races (between the test and the event) and we don't
* go look at any of the socket buffers directly.
*/
__poll_t tcp_poll(struct file *file, struct socket *sock, poll_table *wait)
{
__poll_t mask;
struct sock *sk = sock->sk;
const struct tcp_sock *tp = tcp_sk(sk);
u8 shutdown;
int state;
sock_poll_wait(file, sock, wait);
state = inet_sk_state_load(sk);
if (state == TCP_LISTEN)
return inet_csk_listen_poll(sk);
/* Socket is not locked. We are protected from async events
* by poll logic and correct handling of state changes
* made by other threads is impossible in any case.
*/
mask = 0;
/*
* EPOLLHUP is certainly not done right. But poll() doesn't
* have a notion of HUP in just one direction, and for a
* socket the read side is more interesting.
*
* Some poll() documentation says that EPOLLHUP is incompatible
* with the EPOLLOUT/POLLWR flags, so somebody should check this
* all. But careful, it tends to be safer to return too many
* bits than too few, and you can easily break real applications
* if you don't tell them that something has hung up!
*
* Check-me.
*
* Check number 1. EPOLLHUP is _UNMASKABLE_ event (see UNIX98 and
* our fs/select.c). It means that after we received EOF,
* poll always returns immediately, making impossible poll() on write()
* in state CLOSE_WAIT. One solution is evident --- to set EPOLLHUP
* if and only if shutdown has been made in both directions.
* Actually, it is interesting to look how Solaris and DUX
* solve this dilemma. I would prefer, if EPOLLHUP were maskable,
* then we could set it on SND_SHUTDOWN. BTW examples given
* in Stevens' books assume exactly this behaviour, it explains
* why EPOLLHUP is incompatible with EPOLLOUT. --ANK
*
* NOTE. Check for TCP_CLOSE is added. The goal is to prevent
* blocking on fresh not-connected or disconnected socket. --ANK
*/
shutdown = READ_ONCE(sk->sk_shutdown);
if (shutdown == SHUTDOWN_MASK || state == TCP_CLOSE)
mask |= EPOLLHUP;
if (shutdown & RCV_SHUTDOWN)
mask |= EPOLLIN | EPOLLRDNORM | EPOLLRDHUP;
/* Connected or passive Fast Open socket? */
if (state != TCP_SYN_SENT &&
(state != TCP_SYN_RECV || rcu_access_pointer(tp->fastopen_rsk))) {
int target = sock_rcvlowat(sk, 0, INT_MAX);
u16 urg_data = READ_ONCE(tp->urg_data);
if (unlikely(urg_data) &&
READ_ONCE(tp->urg_seq) == READ_ONCE(tp->copied_seq) &&
!sock_flag(sk, SOCK_URGINLINE))
target++;
if (tcp_stream_is_readable(sk, target))
mask |= EPOLLIN | EPOLLRDNORM;
if (!(shutdown & SEND_SHUTDOWN)) {
if (__sk_stream_is_writeable(sk, 1)) {
mask |= EPOLLOUT | EPOLLWRNORM;
} else { /* send SIGIO later */
sk_set_bit(SOCKWQ_ASYNC_NOSPACE, sk);
set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
/* Race breaker. If space is freed after
* wspace test but before the flags are set,
* IO signal will be lost. Memory barrier
* pairs with the input side.
*/
smp_mb__after_atomic();
if (__sk_stream_is_writeable(sk, 1))
mask |= EPOLLOUT | EPOLLWRNORM;
}
} else
mask |= EPOLLOUT | EPOLLWRNORM;
if (urg_data & TCP_URG_VALID)
mask |= EPOLLPRI;
} else if (state == TCP_SYN_SENT &&
inet_test_bit(DEFER_CONNECT, sk)) {
/* Active TCP fastopen socket with defer_connect
* Return EPOLLOUT so application can call write()
* in order for kernel to generate SYN+data
*/
mask |= EPOLLOUT | EPOLLWRNORM;
}
/* This barrier is coupled with smp_wmb() in tcp_reset() */
smp_rmb();
if (READ_ONCE(sk->sk_err) ||
!skb_queue_empty_lockless(&sk->sk_error_queue))
mask |= EPOLLERR;
return mask;
}
EXPORT_SYMBOL(tcp_poll);
/**
* sock_poll_wait - place memory barrier behind the poll_wait call.
* @filp: file
* @sock: socket to wait on
* @p: poll_table
*
* See the comments in the wq_has_sleeper function.
*/
static inline void sock_poll_wait(struct file *filp, struct socket *sock,
poll_table *p)
{
if (!poll_does_not_wait(p)) {
poll_wait(filp, &sock->wq.wait, p);
/* We need to be sure we are in sync with the
* socket flags modification.
*
* This memory barrier is paired in the wq_has_sleeper.
*/
smp_mb();
}
}
void sock_def_readable(struct sock *sk)
{
struct socket_wq *wq;
trace_sk_data_ready(sk);
rcu_read_lock();
wq = rcu_dereference(sk->sk_wq);
if (skwq_has_sleeper(wq))
wake_up_interruptible_sync_poll(&wq->wait, EPOLLIN | EPOLLPRI |
EPOLLRDNORM | EPOLLRDBAND);
sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_IN);
rcu_read_unlock();
}
#define wake_up_interruptible_sync_poll(x, m) \
__wake_up_sync_key((x), TASK_INTERRUPTIBLE, poll_to_key(m))
/**
* __wake_up_sync_key - wake up threads blocked on a waitqueue.
* @wq_head: the waitqueue
* @mode: which threads
* @key: opaque value to be passed to wakeup targets
*
* The sync wakeup differs that the waker knows that it will schedule
* away soon, so while the target thread will be woken up, it will not
* be migrated to another CPU - ie. the two threads are 'synchronized'
* with each other. This can prevent needless bouncing between CPUs.
*
* On UP it can prevent extra preemption.
*
* If this function wakes up a task, it executes a full memory barrier before
* accessing the task state.
*/
void __wake_up_sync_key(struct wait_queue_head *wq_head, unsigned int mode,
void *key)
{
if (unlikely(!wq_head))
return;
__wake_up_common_lock(wq_head, mode, 1, WF_SYNC, key);
}
EXPORT_SYMBOL_GPL(__wake_up_sync_key);
static int __wake_up_common_lock(struct wait_queue_head *wq_head, unsigned int mode,
int nr_exclusive, int wake_flags, void *key)
{
unsigned long flags;
int remaining;
spin_lock_irqsave(&wq_head->lock, flags);
remaining = __wake_up_common(wq_head, mode, nr_exclusive, wake_flags,
key);
spin_unlock_irqrestore(&wq_head->lock, flags);
return nr_exclusive - remaining;
}
/*
* The core wakeup function. Non-exclusive wakeups (nr_exclusive == 0) just
* wake everything up. If it's an exclusive wakeup (nr_exclusive == small +ve
* number) then we wake that number of exclusive tasks, and potentially all
* the non-exclusive tasks. Normally, exclusive tasks will be at the end of
* the list and any non-exclusive tasks will be woken first. A priority task
* may be at the head of the list, and can consume the event without any other
* tasks being woken.
*
* There are circumstances in which we can try to wake a task which has already
* started to run but is not in state TASK_RUNNING. try_to_wake_up() returns
* zero in this (rare) case, and we handle it by continuing to scan the queue.
*/
static int __wake_up_common(struct wait_queue_head *wq_head, unsigned int mode,
int nr_exclusive, int wake_flags, void *key)
{
wait_queue_entry_t *curr, *next;
lockdep_assert_held(&wq_head->lock);
curr = list_first_entry(&wq_head->head, wait_queue_entry_t, entry);
if (&curr->entry == &wq_head->head)
return nr_exclusive;
list_for_each_entry_safe_from(curr, next, &wq_head->head, entry) {
unsigned flags = curr->flags;
int ret;
ret = curr->func(curr, mode, wake_flags, key);
if (ret < 0)
break;
if (ret && (flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
break;
}
return nr_exclusive;
}
int default_wake_function(wait_queue_entry_t *curr, unsigned mode, int wake_flags,
void *key)
{
WARN_ON_ONCE(IS_ENABLED(CONFIG_SCHED_DEBUG) && wake_flags & ~(WF_SYNC|WF_CURRENT_CPU));
return try_to_wake_up(curr->private, mode, wake_flags);
}
EXPORT_SYMBOL(default_wake_function);
/**
* try_to_wake_up - wake up a thread
* @p: the thread to be awakened
* @state: the mask of task states that can be woken
* @wake_flags: wake modifier flags (WF_*)
*
* Conceptually does:
*
* If (@state & @p->state) @p->state = TASK_RUNNING.
*
* If the task was not queued/runnable, also place it back on a runqueue.
*
* This function is atomic against schedule() which would dequeue the task.
*
* It issues a full memory barrier before accessing @p->state, see the comment
* with set_current_state().
*
* Uses p->pi_lock to serialize against concurrent wake-ups.
*
* Relies on p->pi_lock stabilizing:
* - p->sched_class
* - p->cpus_ptr
* - p->sched_task_group
* in order to do migration, see its use of select_task_rq()/set_task_cpu().
*
* Tries really hard to only take one task_rq(p)->lock for performance.
* Takes rq->lock in:
* - ttwu_runnable() -- old rq, unavoidable, see comment there;
* - ttwu_queue() -- new rq, for enqueue of the task;
* - psi_ttwu_dequeue() -- much sadness :-( accounting will kill us.
*
* As a consequence we race really badly with just about everything. See the
* many memory barriers and their comments for details.
*
* Return: %true if @p->state changes (an actual wakeup was done),
* %false otherwise.
*/
int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
{
guard(preempt)();
int cpu, success = 0;
if (p == current) {
/*
* We're waking current, this means 'p->on_rq' and 'task_cpu(p)
* == smp_processor_id()'. Together this means we can special
* case the whole 'p->on_rq && ttwu_runnable()' case below
* without taking any locks.
*
* In particular:
* - we rely on Program-Order guarantees for all the ordering,
* - we're serialized against set_special_state() by virtue of
* it disabling IRQs (this allows not taking ->pi_lock).
*/
if (!ttwu_state_match(p, state, &success))
goto out;
trace_sched_waking(p);
ttwu_do_wakeup(p);
goto out;
}
/*
* If we are going to wake up a thread waiting for CONDITION we
* need to ensure that CONDITION=1 done by the caller can not be
* reordered with p->state check below. This pairs with smp_store_mb()
* in set_current_state() that the waiting thread does.
*/
scoped_guard (raw_spinlock_irqsave, &p->pi_lock) {
smp_mb__after_spinlock();
if (!ttwu_state_match(p, state, &success))
break;
trace_sched_waking(p);
/*
* Ensure we load p->on_rq _after_ p->state, otherwise it would
* be possible to, falsely, observe p->on_rq == 0 and get stuck
* in smp_cond_load_acquire() below.
*
* sched_ttwu_pending() try_to_wake_up()
* STORE p->on_rq = 1 LOAD p->state
* UNLOCK rq->lock
*
* __schedule() (switch to task 'p')
* LOCK rq->lock smp_rmb();
* smp_mb__after_spinlock();
* UNLOCK rq->lock
*
* [task p]
* STORE p->state = UNINTERRUPTIBLE LOAD p->on_rq
*
* Pairs with the LOCK+smp_mb__after_spinlock() on rq->lock in
* __schedule(). See the comment for smp_mb__after_spinlock().
*
* A similar smp_rmb() lives in __task_needs_rq_lock().
*/
smp_rmb();
if (READ_ONCE(p->on_rq) && ttwu_runnable(p, wake_flags))
break;
#ifdef CONFIG_SMP
/*
* Ensure we load p->on_cpu _after_ p->on_rq, otherwise it would be
* possible to, falsely, observe p->on_cpu == 0.
*
* One must be running (->on_cpu == 1) in order to remove oneself
* from the runqueue.
*
* __schedule() (switch to task 'p') try_to_wake_up()
* STORE p->on_cpu = 1 LOAD p->on_rq
* UNLOCK rq->lock
*
* __schedule() (put 'p' to sleep)
* LOCK rq->lock smp_rmb();
* smp_mb__after_spinlock();
* STORE p->on_rq = 0 LOAD p->on_cpu
*
* Pairs with the LOCK+smp_mb__after_spinlock() on rq->lock in
* __schedule(). See the comment for smp_mb__after_spinlock().
*
* Form a control-dep-acquire with p->on_rq == 0 above, to ensure
* schedule()'s deactivate_task() has 'happened' and p will no longer
* care about it's own p->state. See the comment in __schedule().
*/
smp_acquire__after_ctrl_dep();
/*
* We're doing the wakeup (@success == 1), they did a dequeue (p->on_rq
* == 0), which means we need to do an enqueue, change p->state to
* TASK_WAKING such that we can unlock p->pi_lock before doing the
* enqueue, such as ttwu_queue_wakelist().
*/
WRITE_ONCE(p->__state, TASK_WAKING);
/*
* If the owning (remote) CPU is still in the middle of schedule() with
* this task as prev, considering queueing p on the remote CPUs wake_list
* which potentially sends an IPI instead of spinning on p->on_cpu to
* let the waker make forward progress. This is safe because IRQs are
* disabled and the IPI will deliver after on_cpu is cleared.
*
* Ensure we load task_cpu(p) after p->on_cpu:
*
* set_task_cpu(p, cpu);
* STORE p->cpu = @cpu
* __schedule() (switch to task 'p')
* LOCK rq->lock
* smp_mb__after_spin_lock() smp_cond_load_acquire(&p->on_cpu)
* STORE p->on_cpu = 1 LOAD p->cpu
*
* to ensure we observe the correct CPU on which the task is currently
* scheduling.
*/
if (smp_load_acquire(&p->on_cpu) &&
ttwu_queue_wakelist(p, task_cpu(p), wake_flags))
break;
/*
* If the owning (remote) CPU is still in the middle of schedule() with
* this task as prev, wait until it's done referencing the task.
*
* Pairs with the smp_store_release() in finish_task().
*
* This ensures that tasks getting woken will be fully ordered against
* their previous state and preserve Program Order.
*/
smp_cond_load_acquire(&p->on_cpu, !VAL);
cpu = select_task_rq(p, p->wake_cpu, wake_flags | WF_TTWU);
if (task_cpu(p) != cpu) {
if (p->in_iowait) {
delayacct_blkio_end(p);
atomic_dec(&task_rq(p)->nr_iowait);
}
wake_flags |= WF_MIGRATED;
psi_ttwu_dequeue(p);
set_task_cpu(p, cpu);
}
#else
cpu = task_cpu(p);
#endif /* CONFIG_SMP */
ttwu_queue(p, cpu, wake_flags);
}
out:
if (success)
ttwu_stat(p, task_cpu(p), wake_flags);
return success;
}
/*
* Invoked from try_to_wake_up() to check whether the task can be woken up.
*
* The caller holds p::pi_lock if p != current or has preemption
* disabled when p == current.
*
* The rules of saved_state:
*
* The related locking code always holds p::pi_lock when updating
* p::saved_state, which means the code is fully serialized in both cases.
*
* For PREEMPT_RT, the lock wait and lock wakeups happen via TASK_RTLOCK_WAIT.
* No other bits set. This allows to distinguish all wakeup scenarios.
*
* For FREEZER, the wakeup happens via TASK_FROZEN. No other bits set. This
* allows us to prevent early wakeup of tasks before they can be run on
* asymmetric ISA architectures (eg ARMv9).
*/
static __always_inline
bool ttwu_state_match(struct task_struct *p, unsigned int state, int *success)
{
int match;
if (IS_ENABLED(CONFIG_DEBUG_PREEMPT)) {
WARN_ON_ONCE((state & TASK_RTLOCK_WAIT) &&
state != TASK_RTLOCK_WAIT);
}
*success = !!(match = __task_state_match(p, state));
/*
* Saved state preserves the task state across blocking on
* an RT lock or TASK_FREEZABLE tasks. If the state matches,
* set p::saved_state to TASK_RUNNING, but do not wake the task
* because it waits for a lock wakeup or __thaw_task(). Also
* indicate success because from the regular waker's point of
* view this has succeeded.
*
* After acquiring the lock the task will restore p::__state
* from p::saved_state which ensures that the regular
* wakeup is not lost. The restore will also set
* p::saved_state to TASK_RUNNING so any further tests will
* not result in false positives vs. @success
*/
if (match < 0)
p->saved_state = TASK_RUNNING;
return match > 0;
}
asmlinkage __visible void __sched schedule(void)
{
struct task_struct *tsk = current;
#ifdef CONFIG_RT_MUTEXES
lockdep_assert(!tsk->sched_rt_mutex);
#endif
if (!task_is_running(tsk))
sched_submit_work(tsk);
__schedule_loop(SM_NONE);
sched_update_worker(tsk);
}
EXPORT_SYMBOL(schedule);