引言:用户态与内核态的同步桥梁
在多线程编程中,锁机制是保证数据一致性的基石。当我们在用户态调用 pthread_mutex_lock 时,背后隐藏着一套复杂而精巧的内核实现。这套实现涉及两个核心组件:futex (Fast Userspace Mutex)和 spinlock。futex 负责用户态与内核态的同步交互,而 spinlock 则 protects 内核内部的关键数据结构。
本文将基于 Linux 内核 6.8.12 的源码,深入剖析从用户态锁竞争到内核态等待队列的完整路径。我们会沿着代码执行流程,逐行分析关键函数,揭示内核如何高效地处理锁竞争、如何组织等待队列、以及如何使用底层原子操作保证并发安全。
第一部分:futex 的核心概念与数据结构
1.1 什么是 futex?
futex 是 Linux 提供的一种快速用户态互斥体机制。它的核心思想是:在无竞争的情况下,所有操作都在用户态完成,不需要陷入内核;只有在发生竞争时,才通过系统调用进入内核进行排队和唤醒。
这种设计极大地提升了性能,因为绝大多数锁操作都是无竞争的。futex_wait 和 futex_wake 是两个最核心的系统调用。
1.2 关键数据结构:futex_q 和 futex_hash_bucket
源码中定义了两个至关重要的数据结构:
c
struct futex_q {
struct plist_node list; // 优先级链表节点
struct task_struct *task; // 等待任务
spinlock_t *lock_ptr; // 指向 hash bucket 的锁
u32 bitset; // 位掩码,用于条件唤醒
futex_key_t key; // 唯一标识一个 futex 对象
struct futex_pi_state *pi_state;
struct rt_mutex_waiter *rt_waiter;
union futex_key *requeue_pi_key;
};
futex_q 代表一个正在等待 futex 的线程。每个等待的线程都会有一个对应的 futex_q 结构,它被挂载到某个 hash bucket 的链表中。
c
static struct {
struct futex_hash_bucket *queues;
unsigned long hashsize;
} __futex_data __read_mostly __aligned(2*sizeof(long));
futex_queues 是一个全局的 hash 桶数组,每个桶包含一个自旋锁和一个优先级链表。通过 hash 函数将 futex 地址映射到具体的桶,可以有效地分散锁竞争。
1.3 futex_hash:快速定位等待队列
c
struct futex_hash_bucket *futex_hash(union futex_key *key)
{
u32 hash = jhash2((u32 *)key, offsetof(typeof(*key), both.offset) / 4,
key->both.offset);
return &futex_queues[hash & (futex_hashsize - 1)];
}
这里使用了 jhash2 (Jenkins hash)算法,将 futex_key 的内容计算为一个 32 位哈希值。offsetof(typeof(*key), both.offset) / 4 计算出 hash 的单词数(因为 jhash2 以 32 位为单位)。最终的桶索引通过 hash & (futex_hashsize - 1) 获得,这要求 futex_hashsize 是 2 的幂次,使得按位与操作可以代替取模,提升性能。
第二部分:从用户态争用到内核等待 - futex_wait 完整流程
2.1 系统调用入口:futex_wait
c
int futex_wait(u32 __user *uaddr, unsigned int flags, u32 val, ktime_t *abs_time, u32 bitset)
{
struct hrtimer_sleeper timeout, *to;
struct restart_block *restart;
int ret;
pr_debug("futex_wait: pid=%d, tid=%d, uaddr=%p, val=%u, bitset=0x%x, flags=0x%x\n",
current->tgid, current->pid, uaddr, val, bitset, flags);
to = futex_setup_timer(abs_time, &timeout, flags, current->timer_slack_ns);
ret = __futex_wait(uaddr, flags, val, to, bitset);
// ... 处理超时和重启
}
函数开始处有一个调试输出,记录了调用进程的 PID、TGID、等待的地址、期望值等信息。然后通过 futex_setup_timer 设置超时定时器(如果用户指定了绝对时间)。最后调用核心函数 __futex_wait。
2.2 核心等待逻辑:__futex_wait
c
int __futex_wait(u32 __user *uaddr, unsigned int flags, u32 val,
struct hrtimer_sleeper *to, u32 bitset)
{
struct futex_q q = futex_q_init;
struct futex_hash_bucket *hb;
int ret;
if (!bitset)
return -EINVAL;
q.bitset = bitset;
retry:
ret = futex_wait_setup(uaddr, val, flags, &q, &hb);
if (ret)
return ret;
futex_wait_queue(hb, &q, to);
if (!futex_unqueue(&q))
return 0;
// ... 处理信号和重试
}
这段代码清晰地展示了等待的三步曲:
-
futex_wait_setup:准备工作,获取 hash bucket 锁,验证用户态的值
-
futex_wait_queue:将当前线程加入等待队列并调度
-
futex_unqueue:从队列中移除(如果被唤醒则返回 false)
2.3 关键的设置阶段:futex_wait_setup
c
int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
struct futex_q *q, struct futex_hash_bucket **hb)
{
u32 uval;
int ret;
retry:
ret = get_futex_key(uaddr, flags, &q->key, FUTEX_READ);
if (unlikely(ret != 0))
return ret;
retry_private:
*hb = futex_q_lock(q);
ret = futex_get_value_locked(&uval, uaddr);
if (ret) {
futex_q_unlock(*hb);
ret = get_user(uval, uaddr);
if (ret)
return ret;
if (!(flags & FLAGS_SHARED))
goto retry_private;
goto retry;
}
if (uval != val) {
futex_q_unlock(*hb);
ret = -EWOULDBLOCK;
}
return ret;
}
这个函数有一个非常重要的注释,解释了为什么必须先锁再读:
The basic logical guarantee of a futex is that it blocks ONLY if cond(var) is known to be true at the time of blocking, for any cond. If we locked the hash-bucket after testing *uaddr, that would open a race condition where we could block indefinitely with cond(var) false.
考虑如下竞争场景:
-
线程 A 检查
*uaddr == val,条件满足,准备进入内核等待 -
线程 B 在同一时刻修改
*uaddr为新值,并调用futex_wake -
如果 A 先检查条件,然后才获取锁,那么 B 的唤醒可能发生在 A 进入等待队列之前
-
结果:A 将永远等待下去,因为值已经改变,不会再有新的唤醒
通过先获取锁、再读取用户态值,可以保证:如果 A 最终决定等待,那么 B 的任何唤醒操作都会看到队列中的 A。
2.4 获取 hash bucket 锁:futex_q_lock
c
struct futex_hash_bucket *futex_q_lock(struct futex_q *q)
__acquires(&hb->lock)
{
struct futex_hash_bucket *hb;
hb = futex_hash(&q->key);
futex_hb_waiters_inc(hb); // 增加等待者计数,包含 smp_mb()
q->lock_ptr = &hb->lock;
spin_lock(&hb->lock);
return hb;
}
这里有一个精妙的设计:在获取锁之前先增加等待者计数。注释解释了原因:
Increment the counter before taking the lock so that a potential waker won't miss a to-be-slept task that is waiting for the spinlock.
如果没有这个递增操作,可能出现:
-
线程 A 计算出 hash bucket,但还未获取锁
-
线程 B 调用
futex_wake,发现等待者计数为 0,直接返回 -
线程 A 获取锁并加入队列,但永远不会被唤醒
futex_hb_waiters_inc 内部包含一个内存屏障(smp_mb()),确保递增操作对其他 CPU 可见。
2.5 进入等待队列:futex_wait_queue
c
void futex_wait_queue(struct futex_hash_bucket *hb, struct futex_q *q,
struct hrtimer_sleeper *timeout)
{
set_current_state(TASK_INTERRUPTIBLE | TASK_FREEZABLE);
futex_queue(q, hb);
if (timeout)
hrtimer_sleeper_start_expires(timeout, HRTIMER_MODE_ABS);
if (likely(!plist_node_empty(&q->list))) {
if (!timeout || timeout->task)
schedule();
}
__set_current_state(TASK_RUNNING);
}
关键点:
-
set_current_state :将任务状态设置为可中断的睡眠状态,这使用了
smp_store_mb()保证内存屏障 -
futex_queue:将 q 加入到 hash bucket 的链表中,然后释放 spinlock
-
schedule:只有当 q 仍在链表中时才调度(可能被虚假唤醒或超时提前唤醒)
2.6 加入优先级队列:futex_queue 和 __futex_queue
c
static inline void futex_queue(struct futex_q *q, struct futex_hash_bucket *hb)
__releases(&hb->lock)
{
__futex_queue(q, hb);
// 调试输出:打印所有等待者
struct plist_head *head = &hb->chain;
struct plist_node *node;
struct futex_q *q_temp;
int count = 0;
if (!plist_head_empty(head)) {
plist_for_each(node, head) {
q_temp = container_of(node, struct futex_q, list);
pr_debug("futex:queue:PID %d (%s)\n", q_temp->task->pid, q_temp->task->comm);
count++;
}
pr_debug("futex:queue:total %d waiters\n", count);
}
spin_unlock(&hb->lock);
}
void __futex_queue(struct futex_q *q, struct futex_hash_bucket *hb)
{
int prio = min(current->normal_prio, MAX_RT_PRIO);
plist_node_init(&q->list, prio);
plist_add(&q->list, &hb->chain);
q->task = current;
}
这里使用了优先级链表(plist),而非简单的 FIFO 队列。优先级计算:
-
实时线程(优先级 < MAX_RT_PRIO):使用其实际优先级
-
普通线程:使用
MAX_RT_PRIO
这意味着当多个线程等待同一个 futex 时,实时线程会优先被唤醒,且实时线程之间按照优先级高低排序。这对于实时系统至关重要。
2.7 plist_add:优先级链表的插入算法
c
void plist_add(struct plist_node *node, struct plist_head *head)
{
struct plist_node *first, *iter, *prev = NULL;
struct list_head *node_next = &head->node_list;
if (plist_head_empty(head))
goto ins_node;
first = iter = plist_first(head);
do {
if (node->prio < iter->prio) {
node_next = &iter->node_list;
break;
}
prev = iter;
iter = list_entry(iter->prio_list.next, struct plist_node, prio_list);
} while (iter != first);
if (!prev || prev->prio != node->prio)
list_add_tail(&node->prio_list, &iter->prio_list);
ins_node:
list_add_tail(&node->node_list, node_next);
}
plist 的实现有两个链表:
-
prio_list:按优先级排序的链表,每个优先级只出现一次
-
node_list:所有节点的链表,相同优先级的节点聚集在一起
插入时:
-
遍历 prio_list,找到第一个优先级小于等于 node 的位置
-
如果优先级与 prev 不同,则将 node 加入 prio_list
-
将 node 插入 node_list 的适当位置
这种结构保证了按优先级排序的遍历效率,同时避免了重复存储优先级相同的节点。
第三部分:自旋锁 - 内核并发的基石
在 futex 的实现中,spin_lock(&hb->lock) 被频繁使用。自旋锁是内核中最基础的同步原语之一,它 protects hash bucket 的链表操作。让我们深入研究自旋锁的实现。
3.1 从 spin_lock 到 raw_spin_lock
c
static __always_inline void spin_lock(spinlock_t *lock)
{
raw_spin_lock(&lock->rlock);
}
#define raw_spin_lock(lock) _raw_spin_lock(lock)
#ifndef CONFIG_INLINE_SPIN_LOCK
noinline void __lockfunc _raw_spin_lock(raw_spinlock_t *lock)
{
__raw_spin_lock(lock);
}
EXPORT_SYMBOL(_raw_spin_lock);
#endif
这里有一个重要的配置选项 CONFIG_INLINE_SPIN_LOCK。如果启用了内联,_raw_spin_lock 会被内联展开,减少函数调用开销;否则作为独立函数导出,供模块使用。
3.2 核心实现:__raw_spin_lock
c
static inline void __raw_spin_lock(raw_spinlock_t *lock)
{
preempt_disable();
spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
}
这里做了三件事:
-
preempt_disable:禁用内核抢占。这是自旋锁的关键特性 - 持有自旋锁时不能被抢占,否则可能导致死锁(其他 CPU 可能等待这个 CPU 释放锁)
-
spin_acquire :锁依赖检查,仅在配置了
CONFIG_LOCKDEP时有效,用于检测潜在的死锁 -
LOCK_CONTENDED:宏,尝试获取锁,如果失败则进入慢速路径
3.3 底层架构相关:arch_spin_lock
c
#define arch_spin_lock(l) queued_spin_lock(l)
static __always_inline void queued_spin_lock(struct qspinlock *lock)
{
int val = 0;
if (likely(atomic_try_cmpxchg_acquire(&lock->val, &val, _Q_LOCKED_VAL)))
return;
queued_spin_lock_slowpath(lock, val);
}
这是 x86 架构上实现的排队自旋锁(queued spinlock)。关键设计:
-
快速路径:使用原子 CAS 操作尝试获取锁,如果锁当前是 0(未锁定),则将其设置为 1(锁定)
-
慢速路径 :如果 CAS 失败(锁已被占用),则进入
queued_spin_lock_slowpath
使用 atomic_try_cmpxchg_acquire 而不是 atomic_cmpxchg 的原因是 try_cmpxchg 在失败时会自动更新 old 值为当前值,避免再次读取。
3.4 原子操作的实现:从 try_cmpxchg 到 LOCK 前缀
c
static __always_inline bool arch_atomic_try_cmpxchg(atomic_t *v, int *old, int new)
{
return arch_try_cmpxchg(&v->counter, old, new);
}
#define arch_try_cmpxchg(ptr, pold, new) __try_cmpxchg((ptr), (pold), (new), sizeof(*(ptr)))
最终展开到内联汇编:
c
case __X86_CASE_L: // 32 位
{
volatile u32 *__ptr = (volatile u32 *)(_ptr);
asm volatile(lock "cmpxchgl %[new], %[ptr]"
CC_SET(z)
: CC_OUT(z) (success),
[ptr] "+m" (*__ptr),
[old] "+a" (__old)
: [new] "r" (__new)
: "memory");
break;
}
这里使用了 LOCK 前缀和 cmpxchg 指令:
-
LOCK 前缀:使指令在总线上锁定,保证原子性
-
cmpxchg :比较
eax(__old)与目标内存,相等则交换,不等则加载到eax -
CC_SET(z) :根据结果设置零标志位,
success变量获得结果
这是一个典型的 RMW(Read-Modify-Write)操作,在现代多核系统中通过缓存一致性协议(MESI)实现。
3.5 慢速路径:queued_spin_lock_slowpath
当快速路径失败时,即锁已被占用,内核进入慢速路径。由于这个函数非常长(超过 200 行),我们分析其核心逻辑:
c
void __lockfunc queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
{
struct mcs_spinlock *prev, *next, *node;
u32 old, tail;
int idx;
// 检查是否可以使用 pending bit 优化
if (val == _Q_PENDING_VAL) {
// 等待 pending 位被清除
val = atomic_cond_read_relaxed(&lock->val,
(VAL != _Q_PENDING_VAL) || !cnt--);
}
if (val & ~_Q_LOCKED_MASK)
goto queue;
// 尝试设置 pending bit
val = queued_fetch_set_pending_acquire(lock);
if (unlikely(val & ~_Q_LOCKED_MASK)) {
if (!(val & _Q_PENDING_MASK))
clear_pending(lock);
goto queue;
}
// 等待锁释放
if (val & _Q_LOCKED_MASK)
smp_cond_load_acquire(&lock->locked, !VAL);
clear_pending_set_locked(lock);
return;
queue:
// 排队逻辑
node = this_cpu_ptr(&qnodes[0].mcs);
idx = node->count++;
tail = encode_tail(smp_processor_id(), idx);
// 将当前节点加入队列尾部
old = xchg_tail(lock, tail);
if (old & _Q_TAIL_MASK) {
prev = decode_tail(old);
WRITE_ONCE(prev->next, node);
arch_mcs_spin_lock_contended(&node->locked);
}
// 等待成为队首
if ((val = pv_wait_head_or_lock(lock, node)))
goto locked;
val = atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK));
locked:
// 获取锁
set_locked(lock);
arch_mcs_spin_unlock_contended(&next->locked);
}
排队自旋锁的核心设计思想:
-
pending bit 优化:当只有一个等待者时,使用 pending bit 避免进入队列
-
MCS 锁:使用 MCS(Mellor-Crummey and Scott)锁算法,每个 CPU 在自己的本地变量上旋转,避免在同一个变量上竞争导致 cache line bouncing
-
FIFO 顺序:保证公平性,避免饥饿
第四部分:MCS 锁 - 可扩展的自旋锁
4.1 传统自旋锁的问题
传统的自旋锁实现中,所有等待者都在同一个锁变量上自旋。当锁释放时,所有等待者会同时尝试获取锁,导致:
-
Cache line bouncing:锁变量在多个 CPU 的 cache 间来回传输
-
竞争加剧:多个 CPU 同时执行 CAS 操作,大部分失败
这被称为 thundering herd 问题。
4.2 MCS 锁的核心思想
MCS 锁为每个等待者分配一个本地节点,每个 CPU 只在自己的节点上自旋。节点形成链表,锁持有者只通知下一个节点。
struct mcs_spinlock 包含两个字段:
-
locked:当值为 1 时,表示该节点已被授予锁 -
next:指向下一个节点
入队过程:
-
创建自己的节点 node
-
原子地将队尾指针指向 node,并获取旧的队尾 prev
-
如果 prev 存在,将 prev->next 设为 node
-
在自己的 node->locked 上自旋,直到变为 1
解锁过程:
-
如果自己的 next 不为空,将 next->locked 设置为 1
-
如果没有 next,尝试原子地将队尾指针改回 NULL
4.3 Linux 中的 MCS 实现
Linux 的排队自旋锁结合了 MCS 和传统的自旋锁。每个 CPU 有 4 个 MCS 节点(在 qspinlock.h 中定义):
c
struct qnodes {
struct mcs_spinlock mcs[MAX_NODES];
};
static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
为什么要 4 个节点?因为在嵌套中断(NMI)中可能再次获取自旋锁,最多支持 4 层嵌套。如果超出限制(极少发生),会降级为传统自旋锁:
c
if (unlikely(idx >= MAX_NODES)) {
lockevent_inc(lock_no_node);
while (!queued_spin_trylock(lock))
cpu_relax();
goto release;
}
第五部分:内存屏障与顺序一致性
5.1 为什么需要内存屏障?
在多核系统中,编译器和 CPU 可能会重排指令以提高性能。这在单线程中无影响,但在多线程同步中可能导致严重问题。
以 futex 为例:
c
set_current_state(TASK_INTERRUPTIBLE);
futex_queue(q, hb);
如果编译器或 CPU 重排这两条指令,可能出现:
-
先执行
futex_queue(释放锁,将 q 加入链表) -
再执行
set_current_state设置状态 -
在状态设置完成前,另一个 CPU 可能看到 q 已在链表中,尝试唤醒它
-
但当前线程状态仍为
TASK_RUNNING,唤醒无效,导致永久睡眠
5.2 RAW 使用场景
源码中多处使用了不同的内存屏障:
- smp_mb():完整的内存屏障
c
futex_hb_waiters_inc(hb); /* implies smp_mb(); (A) */
- smp_store_mb():存储后跟完整屏障
c
set_current_state(TASK_INTERRUPTIBLE);
// set_current_state 内部使用 smp_store_mb()
- acquire/release 语义
c
atomic_try_cmpxchg_acquire(&lock->val, &val, _Q_LOCKED_VAL)
-
acquire:后续读写不能提前到 acquire 之前
-
release:之前的读写不能推迟到 release 之后
- smp_cond_load_acquire():条件加载加 acquire
c
smp_cond_load_acquire(&lock->locked, !VAL);
5.3 x86 平台的特殊性
x86 架构具有较强的一致性模型(TSO - Total Store Order),许多情况下不需要显式的内存屏障:
-
普通 store 有 release 语义
-
普通 load 有 acquire 语义
-
LOCK前缀的指令有完整屏障效果
但 Linux 内核为了跨平台,仍然需要使用通用的屏障宏,在不同架构上展开为适当的指令。
第六部分:完整的竞争场景分析
6.1 场景一:无竞争情况
用户程序:
c
// 期望值等于当前值
if (*futex == 1) {
futex_wait(&futex, 1); // 不会真正进入内核
}
由于值匹配,线程进入内核:
-
futex_wait_setup获取 hb->lock -
再次读取 uaddr 确认值未变
-
futex_queue将线程加入队列并释放锁 -
schedule进入睡眠
6.2 场景二:唤醒
另一个线程执行:
c
*futex = 2;
futex_wake(&futex, 1);
唤醒流程:
-
通过相同的 hash 计算找到 hb
-
获取 hb->lock
-
遍历链表,找到优先级最高的等待者
-
将其从链表中移除,标记任务为可运行
-
释放锁
-
在调度时机,等待线程被唤醒
6.3 场景三:超时处理
超时定时器通过 hrtimer_sleeper 实现:
c
if (timeout)
hrtimer_sleeper_start_expires(timeout, HRTIMER_MODE_ABS);
定时器到期后,会调用回调函数唤醒任务。如果任务仍在链表中,将被移除并设置 -ETIMEDOUT 返回值。
6.4 场景四:虚假唤醒
由于信号或其他原因,schedule() 可能在没有被显式唤醒的情况下返回。代码通过检查 plist_node_empty(&q->list) 来判断是否是真正的唤醒:
-
如果 node 不在链表中,说明已被某线程唤醒,直接返回
-
如果 node 仍在链表中,可能是虚假唤醒,重新调用
schedule()或处理超时
第七部分:性能优化技巧
7.1 快速路径与慢速路径分离
自旋锁的快速路径只有一条 cmpxchg 指令,开销极小。大部分无竞争情况直接成功返回。只有竞争时才进入慢速路径。
7.2 Cache line 对齐
c
} __futex_data __read_mostly __aligned(2*sizeof(long));
futex_queues 和 futex_hashsize 被强制对齐到 2 个 long 的大小,确保它们不在同一个 cache line。因为它们经常一起使用,但在不同 CPU 上可能同时访问,分离可以避免 false sharing。
7.3 per-CPU 变量
MCS 节点使用 DEFINE_PER_CPU 分配,每个 CPU 有自己的节点数组,避免了跨 CPU 的 cache line 竞争。
7.4 条件预取
c
if (next)
prefetchw(next);
预取下一个节点的 cache line,减少后续解锁操作时的 cache miss。
第八部分:调试与追踪
8.1 lockdep 锁依赖检查
spin_acquire(&lock->dep_map, 0, 0, _RET_IP_) 用于 lockdep 系统。它会记录锁的获取顺序,并在运行时检测可能的死锁。如果检测到循环依赖,会输出警告并提供调用栈。
8.2 tracepoint
c
trace_contention_begin(lock, LCB_F_SPIN);
trace_contention_end(lock, 0);
这些 tracepoint 允许用户通过 perf 或 ftrace 监控锁竞争情况,分析系统瓶颈。
8.3 lockevent 统计
c
lockevent_inc(lock_slowpath);
lockevent_cond_inc(lock_use_node2 + idx - 1, idx);
编译时配置 CONFIG_LOCK_EVENT_COUNTS 后,可以在 /proc/lock_stat 中看到各种锁事件的统计信息。
8.4 代码中的调试输出
在 futex_queue 中:
c
pr_debug("futex:queue:PID %d (%s)\n", q_temp->task->pid, q_temp->task->comm);
pr_debug("futex:queue:total %d waiters\n", count);
这些 pr_debug 只在 DEBUG 定义或 dynamic_debug 启用时输出,可以动态控制日志级别。
总结:从用户态到硬件的完整链条
本文从用户态的一个 futex_wait 调用出发,沿着代码执行路径深入内核,最终到达 CPU 的原子指令。整个链条涉及:
-
futex 系统调用:用户态与内核态的边界
-
hash bucket 机制:将 futex 地址映射到可控大小的等待队列
-
优先级链表:保证实时线程的调度优先
-
自旋锁:保护内核关键数据结构的并发访问
-
排队自旋锁:可扩展的多核锁算法,结合 MCS 锁思想
-
原子操作:LOCK 前缀和 cmpxchg 指令,保证多核安全
-
内存屏障:保证内存访问的顺序性
这个链条展示了 Linux 内核如何平衡性能 和正确性:
-
快速路径尽可能在用户态或简单指令完成
-
慢速路径使用复杂的队列算法减少 cache 竞争
-
精巧的内存屏障确保并发正确
-
调试和追踪工具帮助开发者分析和优化
理解这些机制,不仅能帮助我们写出更好的并发程序,也能深入理解操作系统设计的核心思想。Linux 内核的同步机制经过数十年的演进,已经成为并发编程领域的典范实现。
##源码
cpp
static inline void do_raw_spin_lock(raw_spinlock_t *lock) __acquires(lock)
{
__acquire(lock);
arch_spin_lock(&lock->raw_lock);
mmiowb_spin_lock();
}
# define lock_acquire(l, s, t, r, c, n, i) do { } while (0)
#define lock_acquire_exclusive(l, s, t, n, i) lock_acquire(l, s, t, 0, 1, n, i)
#define spin_acquire(l, s, t, i) lock_acquire_exclusive(l, s, t, NULL, i)
static inline void __raw_spin_lock(raw_spinlock_t *lock)
{
preempt_disable();
spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
}
#ifndef CONFIG_INLINE_SPIN_LOCK
noinline void __lockfunc _raw_spin_lock(raw_spinlock_t *lock)
{
__raw_spin_lock(lock);
}
EXPORT_SYMBOL(_raw_spin_lock);
#endif
#define raw_spin_lock(lock) _raw_spin_lock(lock)
static __always_inline void spin_lock(spinlock_t *lock)
{
raw_spin_lock(&lock->rlock);
}
/**
* futex_hash - Return the hash bucket in the global hash
* @key: Pointer to the futex key for which the hash is calculated
*
* We hash on the keys returned from get_futex_key (see below) and return the
* corresponding hash bucket in the global hash.
*/
struct futex_hash_bucket *futex_hash(union futex_key *key)
{
u32 hash = jhash2((u32 *)key, offsetof(typeof(*key), both.offset) / 4,
key->both.offset);
return &futex_queues[hash & (futex_hashsize - 1)];
}
/* The key must be already stored in q->key. */
struct futex_hash_bucket *futex_q_lock(struct futex_q *q)
__acquires(&hb->lock)
{
struct futex_hash_bucket *hb;
hb = futex_hash(&q->key);
/*
* Increment the counter before taking the lock so that
* a potential waker won't miss a to-be-slept task that is
* waiting for the spinlock. This is safe as all futex_q_lock()
* users end up calling futex_queue(). Similarly, for housekeeping,
* decrement the counter at futex_q_unlock() when some error has
* occurred and we don't end up adding the task to the list.
*/
futex_hb_waiters_inc(hb); /* implies smp_mb(); (A) */
q->lock_ptr = &hb->lock;
spin_lock(&hb->lock);
return hb;
}
/*
* The base of the bucket array and its size are always used together
* (after initialization only in futex_hash()), so ensure that they
* reside in the same cacheline.
*/
static struct {
struct futex_hash_bucket *queues;
unsigned long hashsize;
} __futex_data __read_mostly __aligned(2*sizeof(long));
#define futex_queues (__futex_data.queues)
#define futex_hashsize (__futex_data.hashsize)
/**
* futex_wait_setup() - Prepare to wait on a futex
* @uaddr: the futex userspace address
* @val: the expected value
* @flags: futex flags (FLAGS_SHARED, etc.)
* @q: the associated futex_q
* @hb: storage for hash_bucket pointer to be returned to caller
*
* Setup the futex_q and locate the hash_bucket. Get the futex value and
* compare it with the expected value. Handle atomic faults internally.
* Return with the hb lock held on success, and unlocked on failure.
*
* Return:
* - 0 - uaddr contains val and hb has been locked;
* - <1 - -EFAULT or -EWOULDBLOCK (uaddr does not contain val) and hb is unlocked
*/
int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
struct futex_q *q, struct futex_hash_bucket **hb)
{
u32 uval;
int ret;
/*
* Access the page AFTER the hash-bucket is locked.
* Order is important:
*
* Userspace waiter: val = var; if (cond(val)) futex_wait(&var, val);
* Userspace waker: if (cond(var)) { var = new; futex_wake(&var); }
*
* The basic logical guarantee of a futex is that it blocks ONLY
* if cond(var) is known to be true at the time of blocking, for
* any cond. If we locked the hash-bucket after testing *uaddr, that
* would open a race condition where we could block indefinitely with
* cond(var) false, which would violate the guarantee.
*
* On the other hand, we insert q and release the hash-bucket only
* after testing *uaddr. This guarantees that futex_wait() will NOT
* absorb a wakeup if *uaddr does not match the desired values
* while the syscall executes.
*/
retry:
ret = get_futex_key(uaddr, flags, &q->key, FUTEX_READ);
if (unlikely(ret != 0))
return ret;
retry_private:
*hb = futex_q_lock(q);
ret = futex_get_value_locked(&uval, uaddr);
if (ret) {
futex_q_unlock(*hb);
ret = get_user(uval, uaddr);
if (ret)
return ret;
if (!(flags & FLAGS_SHARED))
goto retry_private;
goto retry;
}
if (uval != val) {
futex_q_unlock(*hb);
ret = -EWOULDBLOCK;
}
return ret;
}
int __futex_wait(u32 __user *uaddr, unsigned int flags, u32 val,
struct hrtimer_sleeper *to, u32 bitset)
{
struct futex_q q = futex_q_init;
struct futex_hash_bucket *hb;
int ret;
if (!bitset)
return -EINVAL;
q.bitset = bitset;
retry:
/*
* Prepare to wait on uaddr. On success, it holds hb->lock and q
* is initialized.
*/
ret = futex_wait_setup(uaddr, val, flags, &q, &hb);
if (ret)
return ret;
/* futex_queue and wait for wakeup, timeout, or a signal. */
futex_wait_queue(hb, &q, to);
/* If we were woken (and unqueued), we succeeded, whatever. */
if (!futex_unqueue(&q))
return 0;
if (to && !to->task)
return -ETIMEDOUT;
/*
* We expect signal_pending(current), but we might be the
* victim of a spurious wakeup as well.
*/
if (!signal_pending(current))
goto retry;
return -ERESTARTSYS;
}
int futex_wait(u32 __user *uaddr, unsigned int flags, u32 val, ktime_t *abs_time, u32 bitset)
{
struct hrtimer_sleeper timeout, *to;
struct restart_block *restart;
int ret;
// yym-gaizao
pr_debug("futex_wait: pid=%d, tid=%d, uaddr=%p, val=%u, bitset=0x%x, flags=0x%x\n",
current->tgid, current->pid, uaddr, val, bitset, flags);
to = futex_setup_timer(abs_time, &timeout, flags,
current->timer_slack_ns);
ret = __futex_wait(uaddr, flags, val, to, bitset);
/* No timeout, nothing to clean up. */
if (!to)
return ret;
hrtimer_cancel(&to->timer);
destroy_hrtimer_on_stack(&to->timer);
if (ret == -ERESTARTSYS) {
restart = ¤t->restart_block;
restart->futex.uaddr = uaddr;
restart->futex.val = val;
restart->futex.time = *abs_time;
restart->futex.bitset = bitset;
restart->futex.flags = flags | FLAGS_HAS_TIMEOUT;
return set_restart_fn(restart, futex_wait_restart);
}
return ret;
}
/**
* futex_wait_queue() - futex_queue() and wait for wakeup, timeout, or signal
* @hb: the futex hash bucket, must be locked by the caller
* @q: the futex_q to queue up on
* @timeout: the prepared hrtimer_sleeper, or null for no timeout
*/
void futex_wait_queue(struct futex_hash_bucket *hb, struct futex_q *q,
struct hrtimer_sleeper *timeout)
{
/*
* The task state is guaranteed to be set before another task can
* wake it. set_current_state() is implemented using smp_store_mb() and
* futex_queue() calls spin_unlock() upon completion, both serializing
* access to the hash list and forcing another memory barrier.
*/
set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
futex_queue(q, hb);
/* Arm the timer */
if (timeout)
hrtimer_sleeper_start_expires(timeout, HRTIMER_MODE_ABS);
/*
* If we have been removed from the hash list, then another task
* has tried to wake us, and we can skip the call to schedule().
*/
if (likely(!plist_node_empty(&q->list))) {
/*
* If the timer has already expired, current will already be
* flagged for rescheduling. Only call schedule if there
* is no timeout, or if it has yet to expire.
*/
if (!timeout || timeout->task)
schedule();
}
__set_current_state(TASK_RUNNING);
}
/**
* futex_queue() - Enqueue the futex_q on the futex_hash_bucket
* @q: The futex_q to enqueue
* @hb: The destination hash bucket
*
* The hb->lock must be held by the caller, and is released here. A call to
* futex_queue() is typically paired with exactly one call to futex_unqueue(). The
* exceptions involve the PI related operations, which may use futex_unqueue_pi()
* or nothing if the unqueue is done as part of the wake process and the unqueue
* state is implicit in the state of woken task (see futex_wait_requeue_pi() for
* an example).
*/
static inline void futex_queue(struct futex_q *q, struct futex_hash_bucket *hb)
__releases(&hb->lock)
{
__futex_queue(q, hb);
// yym-gaizao
struct plist_head *head = &hb->chain;
struct plist_node *node;
struct futex_q *q_temp;
int count = 0;
// yym-gaizao
if (!plist_head_empty(head)) {
plist_for_each(node, head) {
q_temp = container_of(node, struct futex_q, list);
pr_debug("futex:queue:PID %d (%s)\n", q_temp->task->pid, q_temp->task->comm);
count++;
}
pr_debug("futex:queue:total %d waiters\n", count);
}
spin_unlock(&hb->lock);
}
void __futex_queue(struct futex_q *q, struct futex_hash_bucket *hb)
{
int prio;
/*
* The priority used to register this element is
* - either the real thread-priority for the real-time threads
* (i.e. threads with a priority lower than MAX_RT_PRIO)
* - or MAX_RT_PRIO for non-RT threads.
* Thus, all RT-threads are woken first in priority order, and
* the others are woken last, in FIFO order.
*/
prio = min(current->normal_prio, MAX_RT_PRIO);
plist_node_init(&q->list, prio);
plist_add(&q->list, &hb->chain);
q->task = current;
}
/**
* plist_add - add @node to @head
*
* @node: &struct plist_node pointer
* @head: &struct plist_head pointer
*/
void plist_add(struct plist_node *node, struct plist_head *head)
{
struct plist_node *first, *iter, *prev = NULL;
struct list_head *node_next = &head->node_list;
plist_check_head(head);
WARN_ON(!plist_node_empty(node));
WARN_ON(!list_empty(&node->prio_list));
if (plist_head_empty(head))
goto ins_node;
first = iter = plist_first(head);
do {
if (node->prio < iter->prio) {
node_next = &iter->node_list;
break;
}
prev = iter;
iter = list_entry(iter->prio_list.next,
struct plist_node, prio_list);
} while (iter != first);
if (!prev || prev->prio != node->prio)
list_add_tail(&node->prio_list, &iter->prio_list);
ins_node:
list_add_tail(&node->node_list, node_next);
plist_check_head(head);
}
static __always_inline void spin_unlock(spinlock_t *lock)
{
raw_spin_unlock(&lock->rlock);
}
#define raw_spin_unlock(lock) _raw_spin_unlock(lock)
static inline void do_raw_spin_unlock(raw_spinlock_t *lock) __releases(lock)
{
mmiowb_spin_unlock();
arch_spin_unlock(&lock->raw_lock);
__release(lock);
}
# define __release(x) (void)0
#define arch_spin_lock(l) queued_spin_lock(l)
#define arch_spin_unlock(l) queued_spin_unlock(l)
#ifndef queued_spin_unlock
/**
* queued_spin_unlock - release a queued spinlock
* @lock : Pointer to queued spinlock structure
*/
static __always_inline void queued_spin_unlock(struct qspinlock *lock)
{
/*
* unlock() needs release semantics:
*/
smp_store_release(&lock->locked, 0);
}
#endif
/**
* queued_spin_lock - acquire a queued spinlock
* @lock: Pointer to queued spinlock structure
*/
static __always_inline void queued_spin_lock(struct qspinlock *lock)
{
int val = 0;
if (likely(atomic_try_cmpxchg_acquire(&lock->val, &val, _Q_LOCKED_VAL)))
return;
queued_spin_lock_slowpath(lock, val);
}
/**
* atomic_try_cmpxchg_acquire() - atomic compare and exchange with acquire ordering
* @v: pointer to atomic_t
* @old: pointer to int value to compare with
* @new: int value to assign
*
* If (@v == @old), atomically updates @v to @new with acquire ordering.
* Otherwise, updates @old to the current value of @v.
*
* Unsafe to use in noinstr code; use raw_atomic_try_cmpxchg_acquire() there.
*
* Return: @true if the exchange occured, @false otherwise.
*/
static __always_inline bool
atomic_try_cmpxchg_acquire(atomic_t *v, int *old, int new)
{
instrument_atomic_read_write(v, sizeof(*v));
instrument_atomic_read_write(old, sizeof(*old));
return raw_atomic_try_cmpxchg_acquire(v, old, new);
}
/**
* raw_atomic_try_cmpxchg_acquire() - atomic compare and exchange with acquire ordering
* @v: pointer to atomic_t
* @old: pointer to int value to compare with
* @new: int value to assign
*
* If (@v == @old), atomically updates @v to @new with acquire ordering.
* Otherwise, updates @old to the current value of @v.
*
* Safe to use in noinstr code; prefer atomic_try_cmpxchg_acquire() elsewhere.
*
* Return: @true if the exchange occured, @false otherwise.
*/
static __always_inline bool
raw_atomic_try_cmpxchg_acquire(atomic_t *v, int *old, int new)
{
#if defined(arch_atomic_try_cmpxchg_acquire)
return arch_atomic_try_cmpxchg_acquire(v, old, new);
#elif defined(arch_atomic_try_cmpxchg_relaxed)
bool ret = arch_atomic_try_cmpxchg_relaxed(v, old, new);
__atomic_acquire_fence();
return ret;
#elif defined(arch_atomic_try_cmpxchg)
return arch_atomic_try_cmpxchg(v, old, new);
#else
int r, o = *old;
r = raw_atomic_cmpxchg_acquire(v, o, new);
if (unlikely(r != o))
*old = r;
return likely(r == o);
#endif
}
static __always_inline bool arch_atomic_try_cmpxchg(atomic_t *v, int *old, int new)
{
return arch_try_cmpxchg(&v->counter, old, new);
}
#define arch_atomic_try_cmpxchg arch_atomic_try_cmpxchg
#define arch_try_cmpxchg(ptr, pold, new) \
__try_cmpxchg((ptr), (pold), (new), sizeof(*(ptr)))
#define __try_cmpxchg(ptr, pold, new, size) \
__raw_try_cmpxchg((ptr), (pold), (new), (size), LOCK_PREFIX)
#define __raw_try_cmpxchg(_ptr, _pold, _new, size, lock) \
({ \
bool success; \
__typeof__(_ptr) _old = (__typeof__(_ptr))(_pold); \
__typeof__(*(_ptr)) __old = *_old; \
__typeof__(*(_ptr)) __new = (_new); \
switch (size) { \
case __X86_CASE_B: \
{ \
volatile u8 *__ptr = (volatile u8 *)(_ptr); \
asm volatile(lock "cmpxchgb %[new], %[ptr]" \
CC_SET(z) \
: CC_OUT(z) (success), \
[ptr] "+m" (*__ptr), \
[old] "+a" (__old) \
: [new] "q" (__new) \
: "memory"); \
break; \
} \
case __X86_CASE_W: \
{ \
volatile u16 *__ptr = (volatile u16 *)(_ptr); \
asm volatile(lock "cmpxchgw %[new], %[ptr]" \
CC_SET(z) \
: CC_OUT(z) (success), \
[ptr] "+m" (*__ptr), \
[old] "+a" (__old) \
: [new] "r" (__new) \
: "memory"); \
break; \
} \
case __X86_CASE_L: \
{ \
volatile u32 *__ptr = (volatile u32 *)(_ptr); \
asm volatile(lock "cmpxchgl %[new], %[ptr]" \
CC_SET(z) \
: CC_OUT(z) (success), \
[ptr] "+m" (*__ptr), \
[old] "+a" (__old) \
: [new] "r" (__new) \
: "memory"); \
break; \
} \
case __X86_CASE_Q: \
{ \
volatile u64 *__ptr = (volatile u64 *)(_ptr); \
asm volatile(lock "cmpxchgq %[new], %[ptr]" \
CC_SET(z) \
: CC_OUT(z) (success), \
[ptr] "+m" (*__ptr), \
[old] "+a" (__old) \
: [new] "r" (__new) \
: "memory"); \
break; \
} \
default: \
__cmpxchg_wrong_size(); \
} \
if (unlikely(!success)) \
*_old = __old; \
likely(success); \
})
/**
* queued_spin_lock_slowpath - acquire the queued spinlock
* @lock: Pointer to queued spinlock structure
* @val: Current value of the queued spinlock 32-bit word
*
* (queue tail, pending bit, lock value)
*
* fast : slow : unlock
* : :
* uncontended (0,0,0) -:--> (0,0,1) ------------------------------:--> (*,*,0)
* : | ^--------.------. / :
* : v \ \ | :
* pending : (0,1,1) +--> (0,1,0) \ | :
* : | ^--' | | :
* : v | | :
* uncontended : (n,x,y) +--> (n,0,0) --' | :
* queue : | ^--' | :
* : v | :
* contended : (*,x,y) +--> (*,0,0) ---> (*,0,1) -' :
* queue : ^--' :
*/
void __lockfunc queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
{
struct mcs_spinlock *prev, *next, *node;
u32 old, tail;
int idx;
BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
if (pv_enabled())
goto pv_queue;
if (virt_spin_lock(lock))
return;
/*
* Wait for in-progress pending->locked hand-overs with a bounded
* number of spins so that we guarantee forward progress.
*
* 0,1,0 -> 0,0,1
*/
if (val == _Q_PENDING_VAL) {
int cnt = _Q_PENDING_LOOPS;
val = atomic_cond_read_relaxed(&lock->val,
(VAL != _Q_PENDING_VAL) || !cnt--);
}
/*
* If we observe any contention; queue.
*/
if (val & ~_Q_LOCKED_MASK)
goto queue;
/*
* trylock || pending
*
* 0,0,* -> 0,1,* -> 0,0,1 pending, trylock
*/
val = queued_fetch_set_pending_acquire(lock);
/*
* If we observe contention, there is a concurrent locker.
*
* Undo and queue; our setting of PENDING might have made the
* n,0,0 -> 0,0,0 transition fail and it will now be waiting
* on @next to become !NULL.
*/
if (unlikely(val & ~_Q_LOCKED_MASK)) {
/* Undo PENDING if we set it. */
if (!(val & _Q_PENDING_MASK))
clear_pending(lock);
goto queue;
}
/*
* We're pending, wait for the owner to go away.
*
* 0,1,1 -> *,1,0
*
* this wait loop must be a load-acquire such that we match the
* store-release that clears the locked bit and create lock
* sequentiality; this is because not all
* clear_pending_set_locked() implementations imply full
* barriers.
*/
if (val & _Q_LOCKED_MASK)
smp_cond_load_acquire(&lock->locked, !VAL);
/*
* take ownership and clear the pending bit.
*
* 0,1,0 -> 0,0,1
*/
clear_pending_set_locked(lock);
lockevent_inc(lock_pending);
return;
/*
* End of pending bit optimistic spinning and beginning of MCS
* queuing.
*/
queue:
lockevent_inc(lock_slowpath);
pv_queue:
node = this_cpu_ptr(&qnodes[0].mcs);
idx = node->count++;
tail = encode_tail(smp_processor_id(), idx);
trace_contention_begin(lock, LCB_F_SPIN);
/*
* 4 nodes are allocated based on the assumption that there will
* not be nested NMIs taking spinlocks. That may not be true in
* some architectures even though the chance of needing more than
* 4 nodes will still be extremely unlikely. When that happens,
* we fall back to spinning on the lock directly without using
* any MCS node. This is not the most elegant solution, but is
* simple enough.
*/
if (unlikely(idx >= MAX_NODES)) {
lockevent_inc(lock_no_node);
while (!queued_spin_trylock(lock))
cpu_relax();
goto release;
}
node = grab_mcs_node(node, idx);
/*
* Keep counts of non-zero index values:
*/
lockevent_cond_inc(lock_use_node2 + idx - 1, idx);
/*
* Ensure that we increment the head node->count before initialising
* the actual node. If the compiler is kind enough to reorder these
* stores, then an IRQ could overwrite our assignments.
*/
barrier();
node->locked = 0;
node->next = NULL;
pv_init_node(node);
/*
* We touched a (possibly) cold cacheline in the per-cpu queue node;
* attempt the trylock once more in the hope someone let go while we
* weren't watching.
*/
if (queued_spin_trylock(lock))
goto release;
/*
* Ensure that the initialisation of @node is complete before we
* publish the updated tail via xchg_tail() and potentially link
* @node into the waitqueue via WRITE_ONCE(prev->next, node) below.
*/
smp_wmb();
/*
* Publish the updated tail.
* We have already touched the queueing cacheline; don't bother with
* pending stuff.
*
* p,*,* -> n,*,*
*/
old = xchg_tail(lock, tail);
next = NULL;
/*
* if there was a previous node; link it and wait until reaching the
* head of the waitqueue.
*/
if (old & _Q_TAIL_MASK) {
prev = decode_tail(old);
/* Link @node into the waitqueue. */
WRITE_ONCE(prev->next, node);
pv_wait_node(node, prev);
arch_mcs_spin_lock_contended(&node->locked);
/*
* While waiting for the MCS lock, the next pointer may have
* been set by another lock waiter. We optimistically load
* the next pointer & prefetch the cacheline for writing
* to reduce latency in the upcoming MCS unlock operation.
*/
next = READ_ONCE(node->next);
if (next)
prefetchw(next);
}
/*
* we're at the head of the waitqueue, wait for the owner & pending to
* go away.
*
* *,x,y -> *,0,0
*
* this wait loop must use a load-acquire such that we match the
* store-release that clears the locked bit and create lock
* sequentiality; this is because the set_locked() function below
* does not imply a full barrier.
*
* The PV pv_wait_head_or_lock function, if active, will acquire
* the lock and return a non-zero value. So we have to skip the
* atomic_cond_read_acquire() call. As the next PV queue head hasn't
* been designated yet, there is no way for the locked value to become
* _Q_SLOW_VAL. So both the set_locked() and the
* atomic_cmpxchg_relaxed() calls will be safe.
*
* If PV isn't active, 0 will be returned instead.
*
*/
if ((val = pv_wait_head_or_lock(lock, node)))
goto locked;
val = atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK));
locked:
/*
* claim the lock:
*
* n,0,0 -> 0,0,1 : lock, uncontended
* *,*,0 -> *,*,1 : lock, contended
*
* If the queue head is the only one in the queue (lock value == tail)
* and nobody is pending, clear the tail code and grab the lock.
* Otherwise, we only need to grab the lock.
*/
/*
* In the PV case we might already have _Q_LOCKED_VAL set, because
* of lock stealing; therefore we must also allow:
*
* n,0,1 -> 0,0,1
*
* Note: at this point: (val & _Q_PENDING_MASK) == 0, because of the
* above wait condition, therefore any concurrent setting of
* PENDING will make the uncontended transition fail.
*/
if ((val & _Q_TAIL_MASK) == tail) {
if (atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL))
goto release; /* No contention */
}
/*
* Either somebody is queued behind us or _Q_PENDING_VAL got set
* which will then detect the remaining tail and queue behind us
* ensuring we'll see a @next.
*/
set_locked(lock);
/*
* contended path; wait for next if not observed yet, release.
*/
if (!next)
next = smp_cond_load_relaxed(&node->next, (VAL));
arch_mcs_spin_unlock_contended(&next->locked);
pv_kick_node(lock, next);
release:
trace_contention_end(lock, 0);
/*
* release the node
*/
__this_cpu_dec(qnodes[0].mcs.count);
}
EXPORT_SYMBOL(queued_spin_lock_slowpath);