目录
- 引言:用户态同步与内核的交汇点
- Futex 机制概览:快速用户空间互斥锁的设计哲学
- Futex 核心数据结构解析
- Futex 哈希与等待队列机制
- 自旋锁基础:从
- 排队自旋锁(Queued Spinlock)与 MCS 锁
- 原子操作与
- Futex 等待路径的完整流程分析
- Futex 唤醒机制与竞争条件处理
- 性能考量与优化策略
- 总结与展望
引言:用户态同步与内核的交汇点
在现代操作系统中,进程/线程同步是一个核心且复杂的话题。从用户态的 pthread_mutex_lock 到内核态的底层锁实现,整个同步链条涉及多个层次的设计权衡。Linux 内核的 Futex(Fast Userspace muTEX)机制正是这种权衡的典范------它允许在无竞争的情况下完全在用户态完成锁操作,仅在出现竞争时才陷入内核进行线程的睡眠与唤醒。
本文基于 Linux 内核 6.8.12 的源码片段,深入剖析从 Futex 系统调用到最底层 MCS 队列自旋锁的完整实现链条。我们将看到,一个看似简单的 futex_wait 调用,背后涉及哈希表定位、自旋锁获取、优先级排序队列、原子操作、内存屏障、以及复杂的竞争条件处理。理解这些机制,对于编写高性能并发程序、调试死锁问题、以及优化系统性能都具有重要意义。
Futex 机制概览:快速用户空间互斥锁的设计哲学
设计背景与核心思想
Futex 机制诞生于 2002 年,由 Hubertus Franke、Rusty Russell 和 Matthew Kirkwood 设计,并在 Ottawa Linux Symposium 上首次提出。其核心洞察非常简洁:锁操作的开销主要来自竞争,而非锁本身。在大多数应用场景中,锁的持有时间很短,竞争并不频繁。如果每次加锁/解锁都要进行系统调用,那么系统调用的开销(约 100-300 纳秒)将成为性能瓶颈。
Futex 的设计哲学是:
- 无竞争路径完全在用户态 :通过原子指令(如
cmpxchg)尝试获取锁,成功则直接返回,无需内核介入。 - 有竞争路径才陷入内核 :当用户态发现锁已被占用时,调用
futex()系统调用,让内核将当前线程放入等待队列并使其睡眠。 - 解锁时唤醒等待者 :锁持有者释放锁后,通过
futex()系统调用唤醒等待队列中的一个或多个线程。
这种设计使得 glibc 的 pthread_mutex_t、pthread_cond_t、pthread_rwlock_t、sem_t 以及 C++ 的 std::mutex 都能构建在 Futex 之上,获得优异的性能。
Futex 系统调用的核心操作
futex(2) 系统调用提供两个核心操作:
FUTEX_WAIT:检查 futex 变量的值,如果等于期望值,则将当前线程加入等待队列并睡眠;否则立即返回。FUTEX_WAKE:唤醒等待队列中指定数量的线程。
这两个操作构成了用户态锁机制与内核调度器之间的桥梁。
Futex 核心数据结构解析
union futex_key:标识唯一的 Futex
c
复制
arduino
union futex_key {
struct {
u64 ptr; /* for private futexes: mm pointer */
unsigned long word; /* page offset + word offset */
} private;
struct {
u64 ptr; /* for shared futexes: mapping pointer */
unsigned long pgoff; /* page offset in file */
unsigned int offset;
} shared;
struct {
u64 ptr;
unsigned long word;
} both;
};
futex_key 是内核标识一个 futex 的唯一方式。根据 futex 是进程私有(private)还是共享(shared),使用不同的字段组合:
- Private Futex :使用
(current->mm, address, 0)作为 key。同一进程内的不同线程通过虚拟地址即可唯一标识 futex,无需页表遍历,性能更优。 - Shared Futex :使用
(inode->i_sequence, page->index, offset_within_page)作为 key。这允许多个进程映射同一物理页面时,通过相同的 key 找到同一个等待队列。
struct futex_q:每个等待线程的队列节点
c
复制
arduino
struct futex_q {
struct plist_node list; /* sorted by priority in hash bucket */
struct task_struct *task; /* the sleeping task */
spinlock_t *lock_ptr; /* hash bucket lock */
union futex_key key; /* what futex we're waiting on */
struct futex_pi_state *pi_state; /* priority inheritance state */
u32 bitset; /* for FUTEX_WAIT_BITSET */
};
每个调用 FUTEX_WAIT 的线程,内核都会为其创建一个 futex_q 结构,并将其挂入对应的哈希桶等待链表。plist_node 表示这是一个按优先级排序的链表节点,确保实时线程优先被唤醒。
struct futex_hash_bucket:哈希桶
c
复制
arduino
struct futex_hash_bucket {
atomic_t waiters;
spinlock_t lock;
struct plist_head chain; /* list of futex_q entries */
} ____cacheline_aligned_in_smp;
所有等待同一 futex(或哈希冲突的不同 futex)的线程,其 futex_q 节点都会被挂入同一个哈希桶的 chain 链表中。哈希桶本身包含一个自旋锁 lock,用于保护链表操作,以及一个原子计数器 waiters,用于快速判断是否有等待者。
Futex 哈希与等待队列机制
哈希函数 futex_hash
c
复制
scss
struct futex_hash_bucket *futex_hash(union futex_key *key)
{
u32 hash = jhash2((u32 *)key, offsetof(typeof(*key), both.offset) / 4,
key->both.offset);
return &futex_queues[hash & (futex_hashsize - 1)];
}
内核使用 jhash2(Jenkins hash)对 futex_key 进行哈希计算,然后映射到全局哈希表 futex_queues 的某个桶中。哈希表大小 futex_hashsize 通常为 2 的幂次(如 256),这样可以通过位与操作快速取模。
全局哈希表定义
c
复制
c
static struct {
struct futex_hash_bucket *queues;
unsigned long hashsize;
} __futex_data __read_mostly __aligned(2*sizeof(long));
#define futex_queues (__futex_data.queues)
#define futex_hashsize (__futex_data.hashsize)
哈希表在系统初始化时分配,并按缓存行对齐,以减少多 CPU 访问时的伪共享(false sharing)。
自旋锁基础:从 raw_spin_lock 到 spin_lock
在深入 Futex 的等待与唤醒逻辑之前,我们必须先理解 Linux 内核的自旋锁实现,因为哈希桶的保护、MCS 队列的维护都依赖于自旋锁。
自旋锁的层次结构
Linux 内核的自旋锁实现分为多个层次,这种分层设计既保证了通用性,又允许架构特定的优化:
c
复制
scss
/* 最底层:架构相关的自旋锁操作 */
#define arch_spin_lock(l) queued_spin_lock(l)
#define arch_spin_unlock(l) queued_spin_unlock(l)
/* 原始自旋锁:关闭抢占,获取锁 */
static inline void do_raw_spin_lock(raw_spinlock_t *lock) __acquires(lock)
{
__acquire(lock);
arch_spin_lock(&lock->raw_lock);
mmiowb_spin_lock();
}
/* 中间层:包含锁依赖映射(用于死锁检测) */
static inline void __raw_spin_lock(raw_spinlock_t *lock)
{
preempt_disable();
spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
}
/* 导出符号(非内联版本) */
#ifndef CONFIG_INLINE_SPIN_LOCK
noinline void __lockfunc _raw_spin_lock(raw_spinlock_t *lock)
{
__raw_spin_lock(lock);
}
EXPORT_SYMBOL(_raw_spin_lock);
#endif
/* 最上层:通用自旋锁接口 */
#define raw_spin_lock(lock) _raw_spin_lock(lock)
static __always_inline void spin_lock(spinlock_t *lock)
{
raw_spin_lock(&lock->rlock);
}
关键步骤解析
preempt_disable():关闭内核抢占。自旋锁持有期间不允许被其他任务抢占,否则可能导致死锁或长时间自旋。spin_acquire():记录锁的获取,用于内核的锁依赖检测(lockdep)功能,帮助调试死锁。LOCK_CONTENDED():在锁竞争激烈时,可能调用do_raw_spin_trylock进行乐观自旋尝试,避免直接睡眠带来的上下文切换开销。arch_spin_lock():架构相关的实际加锁操作,在 x86 上映射为queued_spin_lock()。
自旋锁释放
c
复制
scss
static __always_inline void spin_unlock(spinlock_t *lock)
{
raw_spin_unlock(&lock->rlock);
}
#define raw_spin_unlock(lock) _raw_spin_unlock(lock)
static inline void do_raw_spin_unlock(raw_spinlock_t *lock) __releases(lock)
{
mmiowb_spin_unlock();
arch_spin_unlock(&lock->raw_lock);
__release(lock);
}
释放锁时,首先执行内存屏障操作(mmiowb_spin_unlock),然后调用架构相关的解锁操作,最后标记锁已释放。__release(lock) 在编译时用于静态分析,实际为空操作。
排队自旋锁(Queued Spinlock)与 MCS 锁
MCS 锁的基本原理
传统的测试-测试-设置(TTS)自旋锁存在严重的缓存行抖动(cache line bouncing)问题:所有等待者都在同一个内存位置(锁变量)上自旋,每次锁状态变化都会触发所有 CPU 缓存失效,导致大量跨 CPU 缓存流量。
MCS 锁(由 Mellor-Crummey 和 Scott 提出)解决了这个问题。其核心思想是:
- 每个等待者在自己的本地变量上自旋,而非在全局锁变量上自旋。
- 等待者通过链表排队,锁释放时只通知下一个等待者。
- 这消除了缓存行抖动,实现了公平性(FIFO),并显著提升了大规模系统的性能。
Linux 内核的排队自旋锁实现
Linux 内核对传统 MCS 锁进行了精巧的压缩,使其适配 32 位字长:
c
复制
csharp
/**
* queued_spin_lock - acquire a queued spinlock
* @lock: Pointer to queued spinlock structure
*/
static __always_inline void queued_spin_lock(struct qspinlock *lock)
{
int val = 0;
if (likely(atomic_try_cmpxchg_acquire(&lock->val, &val, _Q_LOCKED_VAL)))
return;
queued_spin_lock_slowpath(lock, val);
}
快速路径(无竞争)
atomic_try_cmpxchg_acquire 尝试将锁从 0(未锁定)原子地设置为 _Q_LOCKED_VAL(已锁定)。这是最常见的无竞争情况,只需一条原子指令即可完成加锁。
慢速路径(有竞争)
当快速路径失败时,进入 queued_spin_lock_slowpath,这是整个排队自旋锁的核心:
c
复制
scss
void __lockfunc queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
{
struct mcs_spinlock *prev, *next, *node;
u32 old, tail;
int idx;
BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
if (pv_enabled())
goto pv_queue;
if (virt_spin_lock(lock))
return;
/* 等待正在进行的 pending->locked 交接 */
if (val == _Q_PENDING_VAL) {
int cnt = _Q_PENDING_LOOPS;
val = atomic_cond_read_relaxed(&lock->val,
(VAL != _Q_PENDING_VAL) || !cnt--);
}
/* 观察到竞争,进入队列 */
if (val & ~_Q_LOCKED_MASK)
goto queue;
/* 尝试设置 pending 位 */
val = queued_fetch_set_pending_acquire(lock);
/* 如果仍有竞争,撤销 pending 并排队 */
if (unlikely(val & ~_Q_LOCKED_MASK)) {
if (!(val & _Q_PENDING_MASK))
clear_pending(lock);
goto queue;
}
/* 等待锁持有者释放 */
if (val & _Q_LOCKED_MASK)
smp_cond_load_acquire(&lock->locked, !VAL);
/* 获取锁并清除 pending 位 */
clear_pending_set_locked(lock);
lockevent_inc(lock_pending);
return;
queue:
/* MCS 队列逻辑 */
lockevent_inc(lock_slowpath);
pv_queue:
node = this_cpu_ptr(&qnodes[0].mcs);
idx = node->count++;
tail = encode_tail(smp_processor_id(), idx);
/* 防止嵌套 NMI 超过节点限制 */
if (unlikely(idx >= MAX_NODES)) {
lockevent_inc(lock_no_node);
while (!queued_spin_trylock(lock))
cpu_relax();
goto release;
}
node = grab_mcs_node(node, idx);
barrier();
node->locked = 0;
node->next = NULL;
pv_init_node(node);
/* 尝试最后一次快速获取 */
if (queued_spin_trylock(lock))
goto release;
smp_wmb();
/* 将当前节点加入队列尾部 */
old = xchg_tail(lock, tail);
next = NULL;
/* 如果有前驱节点,链接并等待 */
if (old & _Q_TAIL_MASK) {
prev = decode_tail(old);
WRITE_ONCE(prev->next, node);
pv_wait_node(node, prev);
arch_mcs_spin_lock_contended(&node->locked);
/* 预取下一个节点的缓存行 */
next = READ_ONCE(node->next);
if (next)
prefetchw(next);
}
/* 等待锁持有者离开 */
if ((val = pv_wait_head_or_lock(lock, node)))
goto locked;
val = atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK));
locked:
/* 获取锁 */
if ((val & _Q_TAIL_MASK) == tail) {
if (atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL))
goto release;
}
set_locked(lock);
/* 通知下一个等待者 */
if (!next)
next = smp_cond_load_relaxed(&node->next, (VAL));
arch_mcs_spin_unlock_contended(&next->locked);
pv_kick_node(lock, next);
release:
trace_contention_end(lock, 0);
__this_cpu_dec(qnodes[0].mcs.count);
}
状态机解析
源码注释中给出了清晰的状态转换图:
plain
复制
lua
fast : slow : unlock
: :
uncontended (0,0,0) -:--> (0,0,1) ------------------------------:--> (*,*,0)
: | ^--------.------. / :
: v \ \ | :
pending : (0,1,1) +--> (0,1,0) \ | :
: | ^--' | | :
: v | | :
uncontended : (n,x,y) +--> (n,0,0) --' | :
queue : | ^--' | :
: v | :
contended : (*,x,y) +--> (*,0,0) ---> (*,0,1) -' :
queue : ^--' :
其中三个位域分别表示:
- tail:队列尾部指针(编码了 CPU ID 和节点索引)
- pending:是否有线程正在乐观自旋等待
- locked:锁是否被持有
关键优化点
- 乐观自旋(Optimistic Spinning) :第一个等待者不是立即进入 MCS 队列,而是先设置
pending位并在锁变量上自旋。如果锁很快释放,可以避免队列操作的开销。 - 本地自旋 :进入 MCS 队列后,等待者在
node->locked上自旋,这是本地变量,不会引起缓存行抖动。 - 锁窃取(Lock Stealing) :在某些情况下,新到达的线程可以直接获取锁,而不必等待队列中的线程。这提高了吞吐量,但需要配合防饥饿机制。
- 虚拟化支持(PV) :在虚拟化环境中,与 hypervisor 协作,避免在锁持有者被抢占时无效自旋。
原子操作与 cmpxchg 指令
排队自旋锁的实现依赖于底层的原子操作,其中最关键的是比较并交换(Compare-and-Swap, CAS)。
atomic_try_cmpxchg_acquire 实现
c
复制
csharp
static __always_inline bool
atomic_try_cmpxchg_acquire(atomic_t *v, int *old, int new)
{
instrument_atomic_read_write(v, sizeof(*v));
instrument_atomic_read_write(old, sizeof(*old));
return raw_atomic_try_cmpxchg_acquire(v, old, new);
}
x86 架构的 cmpxchg 内联汇编
c
复制
scss
#define arch_try_cmpxchg(ptr, pold, new) \
__try_cmpxchg((ptr), (pold), (new), sizeof(*(ptr)))
#define __try_cmpxchg(ptr, pold, new, size) \
__raw_try_cmpxchg((ptr), (pold), (new), (size), LOCK_PREFIX)
#define __raw_try_cmpxchg(_ptr, _pold, _new, size, lock) \
({ \
bool success; \
__typeof__(_ptr) _old = (__typeof__(_ptr))(_pold); \
__typeof__(*(_ptr)) __old = *_old; \
__typeof__(*(_ptr)) __new = (_new); \
switch (size) { \
case __X86_CASE_B: \
{ \
volatile u8 *__ptr = (volatile u8 *)(_ptr); \
asm volatile(lock "cmpxchgb %[new], %[ptr]" \
CC_SET(z) \
: CC_OUT(z) (success), \
[ptr] "+m" (*__ptr), \
[old] "+a" (__old) \
: [new] "q" (__new) \
: "memory"); \
break; \
} \
case __X86_CASE_W: \
{ \
volatile u16 *__ptr = (volatile u16 *)(_ptr); \
asm volatile(lock "cmpxchgw %[new], %[ptr]" \
CC_SET(z) \
: CC_OUT(z) (success), \
[ptr] "+m" (*__ptr), \
[old] "+a" (__old) \
: [new] "r" (__new) \
: "memory"); \
break; \
} \
case __X86_CASE_L: \
{ \
volatile u32 *__ptr = (volatile u32 *)(_ptr); \
asm volatile(lock "cmpxchgl %[new], %[ptr]" \
CC_SET(z) \
: CC_OUT(z) (success), \
[ptr] "+m" (*__ptr), \
[old] "+a" (__old) \
: [new] "r" (__new) \
: "memory"); \
break; \
} \
case __X86_CASE_Q: \
{ \
volatile u64 *__ptr = (volatile u64 *)(_ptr); \
asm volatile(lock "cmpxchgq %[new], %[ptr]" \
CC_SET(z) \
: CC_OUT(z) (success), \
[ptr] "+m" (*__ptr), \
[old] "+a" (__old) \
: [new] "r" (__new) \
: "memory"); \
break; \
} \
default: \
__cmpxchg_wrong_size(); \
} \
if (unlikely(!success)) \
*_old = __old; \
likely(success); \
})
关键点
lock前缀:确保指令在多处理器环境下的原子性。+m约束:表示内存操作数会被读写。+a约束 :__old放在eax/rax寄存器中,cmpxchg会自动比较该寄存器与内存值。CC_SET(z)/CC_OUT(z):利用 x86 的零标志位(ZF)判断比较是否成功。"memory"破坏描述符:防止编译器重排内存操作,确保内存屏障语义。
Futex 等待路径的完整流程分析
futex_wait 入口
c
复制
rust
int futex_wait(u32 __user *uaddr, unsigned int flags, u32 val, ktime_t *abs_time, u32 bitset)
{
struct hrtimer_sleeper timeout, *to;
struct restart_block *restart;
int ret;
// yym-gaizao
pr_debug("futex_wait: pid=%d, tid=%d, uaddr=%p, val=%u, bitset=0x%x, flags=0x%x\n",
current->tgid, current->pid, uaddr, val, bitset, flags);
to = futex_setup_timer(abs_time, &timeout, flags,
current->timer_slack_ns);
ret = __futex_wait(uaddr, flags, val, to, bitset);
/* 清理超时定时器 */
if (!to)
return ret;
hrtimer_cancel(&to->timer);
destroy_hrtimer_on_stack(&to->timer);
/* 处理可重启系统调用 */
if (ret == -ERESTARTSYS) {
restart = ¤t->restart_block;
restart->futex.uaddr = uaddr;
restart->futex.val = val;
restart->futex.time = *abs_time;
restart->futex.bitset = bitset;
restart->futex.flags = flags | FLAGS_HAS_TIMEOUT;
return set_restart_fn(restart, futex_wait_restart);
}
return ret;
}
futex_wait 首先设置超时定时器(如果有),然后调用 __futex_wait 执行核心逻辑。如果因信号中断(-ERESTARTSYS),则设置重启块,以便信号处理完成后重新执行系统调用。
__futex_wait 核心逻辑
c
复制
arduino
int __futex_wait(u32 __user *uaddr, unsigned int flags, u32 val,
struct hrtimer_sleeper *to, u32 bitset)
{
struct futex_q q = futex_q_init;
struct futex_hash_bucket *hb;
int ret;
if (!bitset)
return -EINVAL;
q.bitset = bitset;
retry:
/*
* Prepare to wait on uaddr. On success, it holds hb->lock and q
* is initialized.
*/
ret = futex_wait_setup(uaddr, val, flags, &q, &hb);
if (ret)
return ret;
/* futex_queue and wait for wakeup, timeout, or a signal. */
futex_wait_queue(hb, &q, to);
/* If we were woken (and unqueued), we succeeded, whatever. */
if (!futex_unqueue(&q))
return 0;
if (to && !to->task)
return -ETIMEDOUT;
/*
* We expect signal_pending(current), but we might be the
* victim of a spurious wakeup as well.
*/
if (!signal_pending(current))
goto retry;
return -ERESTARTSYS;
}
关键流程
futex_wait_setup:准备等待,获取哈希桶锁,检查 futex 值。futex_wait_queue:将当前线程加入等待队列并睡眠。- 唤醒后检查 :如果被正常唤醒(从队列移除),返回成功;如果超时,返回
-ETIMEDOUT;如果被信号中断,返回-ERESTARTSYS;如果是虚假唤醒,重试。
futex_wait_setup:竞争条件的核心处理
c
复制
scss
int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
struct futex_q *q, struct futex_hash_bucket **hb)
{
u32 uval;
int ret;
/*
* Access the page AFTER the hash-bucket is locked.
* Order is important:
*
* Userspace waiter: val = var; if (cond(val)) futex_wait(&var, val);
* Userspace waker: if (cond(var)) { var = new; futex_wake(&var); }
*
* The basic logical guarantee of a futex is that it blocks ONLY
* if cond(var) is known to be true at the time of blocking, for
* any cond. If we locked the hash-bucket after testing *uaddr, that
* would open a race condition where we could block indefinitely with
* cond(var) false, which would violate the guarantee.
*
* On the other hand, we insert q and release the hash-bucket only
* after testing *uaddr. This guarantees that futex_wait() will NOT
* absorb a wakeup if *uaddr does not match the desired values
* while the syscall executes.
*/
retry:
ret = get_futex_key(uaddr, flags, &q->key, FUTEX_READ);
if (unlikely(ret != 0))
return ret;
retry_private:
*hb = futex_q_lock(q);
ret = futex_get_value_locked(&uval, uaddr);
if (ret) {
futex_q_unlock(*hb);
ret = get_user(uval, uaddr);
if (ret)
return ret;
if (!(flags & FLAGS_SHARED))
goto retry_private;
goto retry;
}
if (uval != val) {
futex_q_unlock(*hb);
ret = -EWOULDBLOCK;
}
return ret;
}
关键设计:锁顺序与值检查
注释中详细解释了为何必须先锁定哈希桶,再检查 futex 值:
- 防止丢失唤醒:如果先检查值再锁桶,可能在检查值和锁桶之间发生唤醒,导致线程永远睡眠。
- 防止吸收错误唤醒:如果在值不匹配时已经入队,可能错误地消耗一个本不属于它的唤醒。
futex_q_lock:获取哈希桶锁
c
复制
scss
struct futex_hash_bucket *futex_q_lock(struct futex_q *q)
__acquires(&hb->lock)
{
struct futex_hash_bucket *hb;
hb = futex_hash(&q->key);
/*
* Increment the counter before taking the lock so that
* a potential waker won't miss a to-be-slept task that is
* waiting for the spinlock. This is safe as all futex_q_lock()
* users end up calling futex_queue(). Similarly, for housekeeping,
* decrement the counter at futex_q_unlock() when some error has
* occurred and we don't end up adding the task to the list.
*/
futex_hb_waiters_inc(hb); /* implies smp_mb(); (A) */
q->lock_ptr = &hb->lock;
spin_lock(&hb->lock);
return hb;
}
关键细节:waiters 计数器
在获取自旋锁之前先增加 waiters 计数器,这是为了防止以下竞争:
- 等待者 A 计算哈希桶,但尚未获取锁。
- 唤醒者 B 获取锁,发现
waiters为 0,直接返回(认为没有等待者)。 - A 获取锁并入队,但已错过唤醒,永远睡眠。
通过先增加计数器(包含内存屏障),确保唤醒者能看到即将入队的等待者。
futex_wait_queue:入队并睡眠
c
复制
scss
void futex_wait_queue(struct futex_hash_bucket *hb, struct futex_q *q,
struct hrtimer_sleeper *timeout)
{
/*
* The task state is guaranteed to be set before another task can
* wake it. set_current_state() is implemented using smp_store_mb() and
* futex_queue() calls spin_unlock() upon completion, both serializing
* access to the hash list and forcing another memory barrier.
*/
set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
futex_queue(q, hb);
/* Arm the timer */
if (timeout)
hrtimer_sleeper_start_expires(timeout, HRTIMER_MODE_ABS);
/*
* If we have been removed from the hash list, then another task
* has tried to wake us, and we can skip the call to schedule().
*/
if (likely(!plist_node_empty(&q->list))) {
/*
* If the timer has already expired, current will already be
* flagged for rescheduling. Only call schedule if there
* is no timeout, or if it has yet to expire.
*/
if (!timeout || timeout->task)
schedule();
}
__set_current_state(TASK_RUNNING);
}
关键设计:状态设置与锁释放的顺序
set_current_state(TASK_INTERRUPTIBLE):将当前线程标记为可中断睡眠状态。使用smp_store_mb()实现,包含内存屏障。futex_queue(q, hb):在持有哈希桶锁的情况下,将当前线程加入队列。futex_queue最后会释放锁。- 检查是否已被唤醒 :由于
futex_queue释放锁时包含内存屏障,如果唤醒发生在schedule()之前,队列节点会被移除,此时可以跳过schedule()。 schedule():主动放弃 CPU,进入睡眠。__set_current_state(TASK_RUNNING):被唤醒后恢复运行状态。
futex_queue:将线程加入优先级队列
c
复制
scss
static inline void futex_queue(struct futex_q *q, struct futex_hash_bucket *hb)
__releases(&hb->lock)
{
__futex_queue(q, hb);
// yym-gaizao
struct plist_head *head = &hb->chain;
struct plist_node *node;
struct futex_q *q_temp;
int count = 0;
if (!plist_head_empty(head)) {
plist_for_each(node, head) {
q_temp = container_of(node, struct futex_q, list);
pr_debug("futex:queue:PID %d (%s)\n", q_temp->task->pid, q_temp->task->comm);
count++;
}
pr_debug("futex:queue:total %d waiters\n", count);
}
spin_unlock(&hb->lock);
}
__futex_queue:核心入队逻辑
c
复制
scss
void __futex_queue(struct futex_q *q, struct futex_hash_bucket *hb)
{
int prio;
/*
* The priority used to register this element is
* - either the real thread-priority for the real-time threads
* (i.e. threads with a priority lower than MAX_RT_PRIO)
* - or MAX_RT_PRIO for non-RT threads.
* Thus, all RT-threads are woken first in priority order, and
* the others are woken last, in FIFO order.
*/
prio = min(current->normal_prio, MAX_RT_PRIO);
plist_node_init(&q->list, prio);
plist_add(&q->list, &hb->chain);
q->task = current;
}
优先级队列的设计
- 实时线程:使用实际优先级(< MAX_RT_PRIO),按优先级顺序唤醒。
- 非实时线程:统一使用 MAX_RT_PRIO,按 FIFO 顺序唤醒。
这确保了实时线程的响应性,同时为非实时线程提供公平性。
plist_add:优先级链表操作
c
复制
ini
void plist_add(struct plist_node *node, struct plist_head *head)
{
struct plist_node *first, *iter, *prev = NULL;
struct list_head *node_next = &head->node_list;
plist_check_head(head);
WARN_ON(!plist_node_empty(node));
WARN_ON(!list_empty(&node->prio_list));
if (plist_head_empty(head))
goto ins_node;
first = iter = plist_first(head);
do {
if (node->prio < iter->prio) {
node_next = &iter->node_list;
break;
}
prev = iter;
iter = list_entry(iter->prio_list.next,
struct plist_node, prio_list);
} while (iter != first);
if (!prev || prev->prio != node->prio)
list_add_tail(&node->prio_list, &iter->prio_list);
ins_node:
list_add_tail(&node->node_list, node_next);
plist_check_head(head);
}
plist(priority list)是 Linux 内核的一种特殊链表,同时维护两个链表:
node_list:按插入顺序排列的节点链表。prio_list:按优先级排列的链表,相同优先级的节点链接在一起。
这使得按优先级遍历和按插入顺序遍历都能高效进行。
Futex 唤醒机制与竞争条件处理
虽然提供的源码片段中没有包含 futex_wake 的完整实现,但我们可以从等待路径的设计中推断唤醒机制的关键点。
唤醒路径的伪代码
基于内核文档和相关源码,唤醒路径大致如下:
c
复制
kotlin
int futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
{
struct futex_hash_bucket *hb;
struct futex_q *this, *next;
union futex_key key = FUTEX_KEY_INIT;
DEFINE_WAKE_Q(wake_q);
int ret;
if (!bitset)
return -EINVAL;
ret = get_futex_key(uaddr, flags, &key, FUTEX_READ);
if (unlikely(ret != 0))
return ret;
if ((flags & FLAGS_STRICT) && !nr_wake)
return 0;
hb = futex_hash(&key);
/* 快速路径:如果没有等待者,直接返回 */
if (!futex_hb_waiters_pending(hb))
return ret;
spin_lock(&hb->lock);
plist_for_each_entry_safe(this, next, &hb->chain, list) {
if (futex_match(&this->key, &key)) {
if (this->pi_state || this->rt_waiter) {
ret = -EINVAL;
break;
}
/* 检查 bitset 是否匹配 */
if (!(this->bitset & bitset))
continue;
this->wake(&wake_q, this);
if (++ret >= nr_wake)
break;
}
}
spin_unlock(&hb->lock);
wake_up_q(&wake_q); /* 实际唤醒操作在释放锁后进行 */
return ret;
}
关键设计点
- 锁内标记,锁外唤醒 :唤醒操作在持有哈希桶锁时进行标记(将任务加入唤醒队列),但实际唤醒(
wake_up_process)在释放锁后进行。这减少了锁持有时间,避免在锁保护下执行复杂的调度操作。 - Bitset 匹配 :支持
FUTEX_WAIT_BITSET和FUTEX_WAKE_BITSET,允许线程等待/唤醒特定的位模式,实现更灵活的同步语义。 - 优先级继承(PI)支持 :对于实时线程,futex 支持优先级继承协议,防止优先级反转。这通过
futex_pi_state和rt_waiter字段实现。
性能考量与优化策略
哈希桶大小与扩展性
哈希桶的数量(futex_hashsize)直接影响并发性能:
- 桶太少:大量不同 futex 映射到同一桶,增加锁竞争。
- 桶太多:内存占用增加,缓存效率降低。
默认配置通常为 256 个桶,但在大型系统上可能需要调整。
乐观自旋与 MCS 锁的结合
在 Futex 的哈希桶锁(spinlock_t)层面,Linux 内核使用了排队自旋锁。这意味着:
- 多个 CPU 同时访问同一哈希桶时,会在 MCS 队列中排队。
- 每个等待者在本地变量上自旋,避免了缓存行抖动。
- 对于 240 核的系统,这种优化可以将系统时间从 54% 降低到 2%。
用户态乐观自旋
现代 glibc 在用户态也实现了乐观自旋:在调用 futex_wait 之前,先自旋一段时间等待锁释放。如果锁很快释放,可以完全避免系统调用开销。
锁竞争诊断
对于生产环境中的锁竞争问题,可以使用 perf lock 工具进行诊断:
表格
| perf lock 表现 | 根因 | 优化方案 |
|---|---|---|
| contended 高 + avg wait 低(< 1us) | 锁粒度太粗,频繁短竞争 | 拆锁(per-bucket / per-CPU)或无锁 |
| contended 低 + avg wait 高(> 10us) | 临界区内有 I/O / malloc / 日志 | 移出锁外 + Collect-Release-Execute |
| contended 高 + avg wait 高 | 严重设计问题 | 无锁队列(MPSC/SPSC)彻底替代 |
| spinlock type + 高 contended | 临界区持有时间超过自旋收益 | 改用 mutex(允许休眠) |
| futex_wake 路径占比 > 20% | hash bucket 竞争(二阶效应) | 竞争已极严重,必须无锁化 |
总结与展望
通过本文对 Linux 内核 6.8.12 源码的深入分析,我们完整地梳理了从 Futex 系统调用到 MCS 队列自旋锁的实现链条:
- 用户态锁 (如
pthread_mutex_t)在无竞争时完全在用户态通过原子操作完成。 - 出现竞争时 ,调用
futex_wait/futex_wake陷入内核。 - 内核通过
futex_key定位哈希桶,使用自旋锁保护等待队列。 - 自旋锁底层使用 MCS 排队机制,确保公平性和缓存效率。
- 原子操作(
cmpxchg) 是整个同步机制的基石。
这种分层设计体现了 Linux 内核在性能、公平性和可扩展性之间的精妙权衡。随着硬件的发展(更多核心、NUMA 架构、新原子指令),内核的同步机制也在不断演进。未来的发展方向可能包括:
- 更细粒度的哈希桶:减少桶级竞争。
- 自适应自旋:根据锁持有历史动态调整自旋时间。
- 硬件事务内存(HTM) :利用 Intel TSX 等硬件特性实现无锁同步。
- 更高效的虚拟化支持:减少 hypervisor 层的锁开销。
理解这些底层机制,不仅能帮助我们编写更高效的并发程序,也为诊断和解决复杂的性能问题提供了坚实的基础。
源码
scss
static inline void do_raw_spin_lock(raw_spinlock_t *lock) __acquires(lock)
{
__acquire(lock);
arch_spin_lock(&lock->raw_lock);
mmiowb_spin_lock();
}
# define lock_acquire(l, s, t, r, c, n, i) do { } while (0)
#define lock_acquire_exclusive(l, s, t, n, i) lock_acquire(l, s, t, 0, 1, n, i)
#define spin_acquire(l, s, t, i) lock_acquire_exclusive(l, s, t, NULL, i)
static inline void __raw_spin_lock(raw_spinlock_t *lock)
{
preempt_disable();
spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
}
#ifndef CONFIG_INLINE_SPIN_LOCK
noinline void __lockfunc _raw_spin_lock(raw_spinlock_t *lock)
{
__raw_spin_lock(lock);
}
EXPORT_SYMBOL(_raw_spin_lock);
#endif
#define raw_spin_lock(lock) _raw_spin_lock(lock)
static __always_inline void spin_lock(spinlock_t *lock)
{
raw_spin_lock(&lock->rlock);
}
/**
* futex_hash - Return the hash bucket in the global hash
* @key: Pointer to the futex key for which the hash is calculated
*
* We hash on the keys returned from get_futex_key (see below) and return the
* corresponding hash bucket in the global hash.
*/
struct futex_hash_bucket *futex_hash(union futex_key *key)
{
u32 hash = jhash2((u32 *)key, offsetof(typeof(*key), both.offset) / 4,
key->both.offset);
return &futex_queues[hash & (futex_hashsize - 1)];
}
/* The key must be already stored in q->key. */
struct futex_hash_bucket *futex_q_lock(struct futex_q *q)
__acquires(&hb->lock)
{
struct futex_hash_bucket *hb;
hb = futex_hash(&q->key);
/*
* Increment the counter before taking the lock so that
* a potential waker won't miss a to-be-slept task that is
* waiting for the spinlock. This is safe as all futex_q_lock()
* users end up calling futex_queue(). Similarly, for housekeeping,
* decrement the counter at futex_q_unlock() when some error has
* occurred and we don't end up adding the task to the list.
*/
futex_hb_waiters_inc(hb); /* implies smp_mb(); (A) */
q->lock_ptr = &hb->lock;
spin_lock(&hb->lock);
return hb;
}
/*
* The base of the bucket array and its size are always used together
* (after initialization only in futex_hash()), so ensure that they
* reside in the same cacheline.
*/
static struct {
struct futex_hash_bucket *queues;
unsigned long hashsize;
} __futex_data __read_mostly __aligned(2*sizeof(long));
#define futex_queues (__futex_data.queues)
#define futex_hashsize (__futex_data.hashsize)
/**
* futex_wait_setup() - Prepare to wait on a futex
* @uaddr: the futex userspace address
* @val: the expected value
* @flags: futex flags (FLAGS_SHARED, etc.)
* @q: the associated futex_q
* @hb: storage for hash_bucket pointer to be returned to caller
*
* Setup the futex_q and locate the hash_bucket. Get the futex value and
* compare it with the expected value. Handle atomic faults internally.
* Return with the hb lock held on success, and unlocked on failure.
*
* Return:
* - 0 - uaddr contains val and hb has been locked;
* - <1 - -EFAULT or -EWOULDBLOCK (uaddr does not contain val) and hb is unlocked
*/
int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
struct futex_q *q, struct futex_hash_bucket **hb)
{
u32 uval;
int ret;
/*
* Access the page AFTER the hash-bucket is locked.
* Order is important:
*
* Userspace waiter: val = var; if (cond(val)) futex_wait(&var, val);
* Userspace waker: if (cond(var)) { var = new; futex_wake(&var); }
*
* The basic logical guarantee of a futex is that it blocks ONLY
* if cond(var) is known to be true at the time of blocking, for
* any cond. If we locked the hash-bucket after testing *uaddr, that
* would open a race condition where we could block indefinitely with
* cond(var) false, which would violate the guarantee.
*
* On the other hand, we insert q and release the hash-bucket only
* after testing *uaddr. This guarantees that futex_wait() will NOT
* absorb a wakeup if *uaddr does not match the desired values
* while the syscall executes.
*/
retry:
ret = get_futex_key(uaddr, flags, &q->key, FUTEX_READ);
if (unlikely(ret != 0))
return ret;
retry_private:
*hb = futex_q_lock(q);
ret = futex_get_value_locked(&uval, uaddr);
if (ret) {
futex_q_unlock(*hb);
ret = get_user(uval, uaddr);
if (ret)
return ret;
if (!(flags & FLAGS_SHARED))
goto retry_private;
goto retry;
}
if (uval != val) {
futex_q_unlock(*hb);
ret = -EWOULDBLOCK;
}
return ret;
}
int __futex_wait(u32 __user *uaddr, unsigned int flags, u32 val,
struct hrtimer_sleeper *to, u32 bitset)
{
struct futex_q q = futex_q_init;
struct futex_hash_bucket *hb;
int ret;
if (!bitset)
return -EINVAL;
q.bitset = bitset;
retry:
/*
* Prepare to wait on uaddr. On success, it holds hb->lock and q
* is initialized.
*/
ret = futex_wait_setup(uaddr, val, flags, &q, &hb);
if (ret)
return ret;
/* futex_queue and wait for wakeup, timeout, or a signal. */
futex_wait_queue(hb, &q, to);
/* If we were woken (and unqueued), we succeeded, whatever. */
if (!futex_unqueue(&q))
return 0;
if (to && !to->task)
return -ETIMEDOUT;
/*
* We expect signal_pending(current), but we might be the
* victim of a spurious wakeup as well.
*/
if (!signal_pending(current))
goto retry;
return -ERESTARTSYS;
}
int futex_wait(u32 __user *uaddr, unsigned int flags, u32 val, ktime_t *abs_time, u32 bitset)
{
struct hrtimer_sleeper timeout, *to;
struct restart_block *restart;
int ret;
// yym-gaizao
pr_debug("futex_wait: pid=%d, tid=%d, uaddr=%p, val=%u, bitset=0x%x, flags=0x%x\n",
current->tgid, current->pid, uaddr, val, bitset, flags);
to = futex_setup_timer(abs_time, &timeout, flags,
current->timer_slack_ns);
ret = __futex_wait(uaddr, flags, val, to, bitset);
/* No timeout, nothing to clean up. */
if (!to)
return ret;
hrtimer_cancel(&to->timer);
destroy_hrtimer_on_stack(&to->timer);
if (ret == -ERESTARTSYS) {
restart = ¤t->restart_block;
restart->futex.uaddr = uaddr;
restart->futex.val = val;
restart->futex.time = *abs_time;
restart->futex.bitset = bitset;
restart->futex.flags = flags | FLAGS_HAS_TIMEOUT;
return set_restart_fn(restart, futex_wait_restart);
}
return ret;
}
/**
* futex_wait_queue() - futex_queue() and wait for wakeup, timeout, or signal
* @hb: the futex hash bucket, must be locked by the caller
* @q: the futex_q to queue up on
* @timeout: the prepared hrtimer_sleeper, or null for no timeout
*/
void futex_wait_queue(struct futex_hash_bucket *hb, struct futex_q *q,
struct hrtimer_sleeper *timeout)
{
/*
* The task state is guaranteed to be set before another task can
* wake it. set_current_state() is implemented using smp_store_mb() and
* futex_queue() calls spin_unlock() upon completion, both serializing
* access to the hash list and forcing another memory barrier.
*/
set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
futex_queue(q, hb);
/* Arm the timer */
if (timeout)
hrtimer_sleeper_start_expires(timeout, HRTIMER_MODE_ABS);
/*
* If we have been removed from the hash list, then another task
* has tried to wake us, and we can skip the call to schedule().
*/
if (likely(!plist_node_empty(&q->list))) {
/*
* If the timer has already expired, current will already be
* flagged for rescheduling. Only call schedule if there
* is no timeout, or if it has yet to expire.
*/
if (!timeout || timeout->task)
schedule();
}
__set_current_state(TASK_RUNNING);
}
/**
* futex_queue() - Enqueue the futex_q on the futex_hash_bucket
* @q: The futex_q to enqueue
* @hb: The destination hash bucket
*
* The hb->lock must be held by the caller, and is released here. A call to
* futex_queue() is typically paired with exactly one call to futex_unqueue(). The
* exceptions involve the PI related operations, which may use futex_unqueue_pi()
* or nothing if the unqueue is done as part of the wake process and the unqueue
* state is implicit in the state of woken task (see futex_wait_requeue_pi() for
* an example).
*/
static inline void futex_queue(struct futex_q *q, struct futex_hash_bucket *hb)
__releases(&hb->lock)
{
__futex_queue(q, hb);
// yym-gaizao
struct plist_head *head = &hb->chain;
struct plist_node *node;
struct futex_q *q_temp;
int count = 0;
// yym-gaizao
if (!plist_head_empty(head)) {
plist_for_each(node, head) {
q_temp = container_of(node, struct futex_q, list);
pr_debug("futex:queue:PID %d (%s)\n", q_temp->task->pid, q_temp->task->comm);
count++;
}
pr_debug("futex:queue:total %d waiters\n", count);
}
spin_unlock(&hb->lock);
}
void __futex_queue(struct futex_q *q, struct futex_hash_bucket *hb)
{
int prio;
/*
* The priority used to register this element is
* - either the real thread-priority for the real-time threads
* (i.e. threads with a priority lower than MAX_RT_PRIO)
* - or MAX_RT_PRIO for non-RT threads.
* Thus, all RT-threads are woken first in priority order, and
* the others are woken last, in FIFO order.
*/
prio = min(current->normal_prio, MAX_RT_PRIO);
plist_node_init(&q->list, prio);
plist_add(&q->list, &hb->chain);
q->task = current;
}
/**
* plist_add - add @node to @head
*
* @node: &struct plist_node pointer
* @head: &struct plist_head pointer
*/
void plist_add(struct plist_node *node, struct plist_head *head)
{
struct plist_node *first, *iter, *prev = NULL;
struct list_head *node_next = &head->node_list;
plist_check_head(head);
WARN_ON(!plist_node_empty(node));
WARN_ON(!list_empty(&node->prio_list));
if (plist_head_empty(head))
goto ins_node;
first = iter = plist_first(head);
do {
if (node->prio < iter->prio) {
node_next = &iter->node_list;
break;
}
prev = iter;
iter = list_entry(iter->prio_list.next,
struct plist_node, prio_list);
} while (iter != first);
if (!prev || prev->prio != node->prio)
list_add_tail(&node->prio_list, &iter->prio_list);
ins_node:
list_add_tail(&node->node_list, node_next);
plist_check_head(head);
}
static __always_inline void spin_unlock(spinlock_t *lock)
{
raw_spin_unlock(&lock->rlock);
}
#define raw_spin_unlock(lock) _raw_spin_unlock(lock)
static inline void do_raw_spin_unlock(raw_spinlock_t *lock) __releases(lock)
{
mmiowb_spin_unlock();
arch_spin_unlock(&lock->raw_lock);
__release(lock);
}
# define __release(x) (void)0
#define arch_spin_lock(l) queued_spin_lock(l)
#define arch_spin_unlock(l) queued_spin_unlock(l)
#ifndef queued_spin_unlock
/**
* queued_spin_unlock - release a queued spinlock
* @lock : Pointer to queued spinlock structure
*/
static __always_inline void queued_spin_unlock(struct qspinlock *lock)
{
/*
* unlock() needs release semantics:
*/
smp_store_release(&lock->locked, 0);
}
#endif
/**
* queued_spin_lock - acquire a queued spinlock
* @lock: Pointer to queued spinlock structure
*/
static __always_inline void queued_spin_lock(struct qspinlock *lock)
{
int val = 0;
if (likely(atomic_try_cmpxchg_acquire(&lock->val, &val, _Q_LOCKED_VAL)))
return;
queued_spin_lock_slowpath(lock, val);
}
/**
* atomic_try_cmpxchg_acquire() - atomic compare and exchange with acquire ordering
* @v: pointer to atomic_t
* @old: pointer to int value to compare with
* @new: int value to assign
*
* If (@v == @old), atomically updates @v to @new with acquire ordering.
* Otherwise, updates @old to the current value of @v.
*
* Unsafe to use in noinstr code; use raw_atomic_try_cmpxchg_acquire() there.
*
* Return: @true if the exchange occured, @false otherwise.
*/
static __always_inline bool
atomic_try_cmpxchg_acquire(atomic_t *v, int *old, int new)
{
instrument_atomic_read_write(v, sizeof(*v));
instrument_atomic_read_write(old, sizeof(*old));
return raw_atomic_try_cmpxchg_acquire(v, old, new);
}
/**
* raw_atomic_try_cmpxchg_acquire() - atomic compare and exchange with acquire ordering
* @v: pointer to atomic_t
* @old: pointer to int value to compare with
* @new: int value to assign
*
* If (@v == @old), atomically updates @v to @new with acquire ordering.
* Otherwise, updates @old to the current value of @v.
*
* Safe to use in noinstr code; prefer atomic_try_cmpxchg_acquire() elsewhere.
*
* Return: @true if the exchange occured, @false otherwise.
*/
static __always_inline bool
raw_atomic_try_cmpxchg_acquire(atomic_t *v, int *old, int new)
{
#if defined(arch_atomic_try_cmpxchg_acquire)
return arch_atomic_try_cmpxchg_acquire(v, old, new);
#elif defined(arch_atomic_try_cmpxchg_relaxed)
bool ret = arch_atomic_try_cmpxchg_relaxed(v, old, new);
__atomic_acquire_fence();
return ret;
#elif defined(arch_atomic_try_cmpxchg)
return arch_atomic_try_cmpxchg(v, old, new);
#else
int r, o = *old;
r = raw_atomic_cmpxchg_acquire(v, o, new);
if (unlikely(r != o))
*old = r;
return likely(r == o);
#endif
}
static __always_inline bool arch_atomic_try_cmpxchg(atomic_t *v, int *old, int new)
{
return arch_try_cmpxchg(&v->counter, old, new);
}
#define arch_atomic_try_cmpxchg arch_atomic_try_cmpxchg
#define arch_try_cmpxchg(ptr, pold, new) \
__try_cmpxchg((ptr), (pold), (new), sizeof(*(ptr)))
#define __try_cmpxchg(ptr, pold, new, size) \
__raw_try_cmpxchg((ptr), (pold), (new), (size), LOCK_PREFIX)
#define __raw_try_cmpxchg(_ptr, _pold, _new, size, lock) \
({ \
bool success; \
__typeof__(_ptr) _old = (__typeof__(_ptr))(_pold); \
__typeof__(*(_ptr)) __old = *_old; \
__typeof__(*(_ptr)) __new = (_new); \
switch (size) { \
case __X86_CASE_B: \
{ \
volatile u8 *__ptr = (volatile u8 *)(_ptr); \
asm volatile(lock "cmpxchgb %[new], %[ptr]" \
CC_SET(z) \
: CC_OUT(z) (success), \
[ptr] "+m" (*__ptr), \
[old] "+a" (__old) \
: [new] "q" (__new) \
: "memory"); \
break; \
} \
case __X86_CASE_W: \
{ \
volatile u16 *__ptr = (volatile u16 *)(_ptr); \
asm volatile(lock "cmpxchgw %[new], %[ptr]" \
CC_SET(z) \
: CC_OUT(z) (success), \
[ptr] "+m" (*__ptr), \
[old] "+a" (__old) \
: [new] "r" (__new) \
: "memory"); \
break; \
} \
case __X86_CASE_L: \
{ \
volatile u32 *__ptr = (volatile u32 *)(_ptr); \
asm volatile(lock "cmpxchgl %[new], %[ptr]" \
CC_SET(z) \
: CC_OUT(z) (success), \
[ptr] "+m" (*__ptr), \
[old] "+a" (__old) \
: [new] "r" (__new) \
: "memory"); \
break; \
} \
case __X86_CASE_Q: \
{ \
volatile u64 *__ptr = (volatile u64 *)(_ptr); \
asm volatile(lock "cmpxchgq %[new], %[ptr]" \
CC_SET(z) \
: CC_OUT(z) (success), \
[ptr] "+m" (*__ptr), \
[old] "+a" (__old) \
: [new] "r" (__new) \
: "memory"); \
break; \
} \
default: \
__cmpxchg_wrong_size(); \
} \
if (unlikely(!success)) \
*_old = __old; \
likely(success); \
})
/**
* queued_spin_lock_slowpath - acquire the queued spinlock
* @lock: Pointer to queued spinlock structure
* @val: Current value of the queued spinlock 32-bit word
*
* (queue tail, pending bit, lock value)
*
* fast : slow : unlock
* : :
* uncontended (0,0,0) -:--> (0,0,1) ------------------------------:--> (*,*,0)
* : | ^--------.------. / :
* : v \ \ | :
* pending : (0,1,1) +--> (0,1,0) \ | :
* : | ^--' | | :
* : v | | :
* uncontended : (n,x,y) +--> (n,0,0) --' | :
* queue : | ^--' | :
* : v | :
* contended : (*,x,y) +--> (*,0,0) ---> (*,0,1) -' :
* queue : ^--' :
*/
void __lockfunc queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
{
struct mcs_spinlock *prev, *next, *node;
u32 old, tail;
int idx;
BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
if (pv_enabled())
goto pv_queue;
if (virt_spin_lock(lock))
return;
/*
* Wait for in-progress pending->locked hand-overs with a bounded
* number of spins so that we guarantee forward progress.
*
* 0,1,0 -> 0,0,1
*/
if (val == _Q_PENDING_VAL) {
int cnt = _Q_PENDING_LOOPS;
val = atomic_cond_read_relaxed(&lock->val,
(VAL != _Q_PENDING_VAL) || !cnt--);
}
/*
* If we observe any contention; queue.
*/
if (val & ~_Q_LOCKED_MASK)
goto queue;
/*
* trylock || pending
*
* 0,0,* -> 0,1,* -> 0,0,1 pending, trylock
*/
val = queued_fetch_set_pending_acquire(lock);
/*
* If we observe contention, there is a concurrent locker.
*
* Undo and queue; our setting of PENDING might have made the
* n,0,0 -> 0,0,0 transition fail and it will now be waiting
* on @next to become !NULL.
*/
if (unlikely(val & ~_Q_LOCKED_MASK)) {
/* Undo PENDING if we set it. */
if (!(val & _Q_PENDING_MASK))
clear_pending(lock);
goto queue;
}
/*
* We're pending, wait for the owner to go away.
*
* 0,1,1 -> *,1,0
*
* this wait loop must be a load-acquire such that we match the
* store-release that clears the locked bit and create lock
* sequentiality; this is because not all
* clear_pending_set_locked() implementations imply full
* barriers.
*/
if (val & _Q_LOCKED_MASK)
smp_cond_load_acquire(&lock->locked, !VAL);
/*
* take ownership and clear the pending bit.
*
* 0,1,0 -> 0,0,1
*/
clear_pending_set_locked(lock);
lockevent_inc(lock_pending);
return;
/*
* End of pending bit optimistic spinning and beginning of MCS
* queuing.
*/
queue:
lockevent_inc(lock_slowpath);
pv_queue:
node = this_cpu_ptr(&qnodes[0].mcs);
idx = node->count++;
tail = encode_tail(smp_processor_id(), idx);
trace_contention_begin(lock, LCB_F_SPIN);
/*
* 4 nodes are allocated based on the assumption that there will
* not be nested NMIs taking spinlocks. That may not be true in
* some architectures even though the chance of needing more than
* 4 nodes will still be extremely unlikely. When that happens,
* we fall back to spinning on the lock directly without using
* any MCS node. This is not the most elegant solution, but is
* simple enough.
*/
if (unlikely(idx >= MAX_NODES)) {
lockevent_inc(lock_no_node);
while (!queued_spin_trylock(lock))
cpu_relax();
goto release;
}
node = grab_mcs_node(node, idx);
/*
* Keep counts of non-zero index values:
*/
lockevent_cond_inc(lock_use_node2 + idx - 1, idx);
/*
* Ensure that we increment the head node->count before initialising
* the actual node. If the compiler is kind enough to reorder these
* stores, then an IRQ could overwrite our assignments.
*/
barrier();
node->locked = 0;
node->next = NULL;
pv_init_node(node);
/*
* We touched a (possibly) cold cacheline in the per-cpu queue node;
* attempt the trylock once more in the hope someone let go while we
* weren't watching.
*/
if (queued_spin_trylock(lock))
goto release;
/*
* Ensure that the initialisation of @node is complete before we
* publish the updated tail via xchg_tail() and potentially link
* @node into the waitqueue via WRITE_ONCE(prev->next, node) below.
*/
smp_wmb();
/*
* Publish the updated tail.
* We have already touched the queueing cacheline; don't bother with
* pending stuff.
*
* p,*,* -> n,*,*
*/
old = xchg_tail(lock, tail);
next = NULL;
/*
* if there was a previous node; link it and wait until reaching the
* head of the waitqueue.
*/
if (old & _Q_TAIL_MASK) {
prev = decode_tail(old);
/* Link @node into the waitqueue. */
WRITE_ONCE(prev->next, node);
pv_wait_node(node, prev);
arch_mcs_spin_lock_contended(&node->locked);
/*
* While waiting for the MCS lock, the next pointer may have
* been set by another lock waiter. We optimistically load
* the next pointer & prefetch the cacheline for writing
* to reduce latency in the upcoming MCS unlock operation.
*/
next = READ_ONCE(node->next);
if (next)
prefetchw(next);
}
/*
* we're at the head of the waitqueue, wait for the owner & pending to
* go away.
*
* *,x,y -> *,0,0
*
* this wait loop must use a load-acquire such that we match the
* store-release that clears the locked bit and create lock
* sequentiality; this is because the set_locked() function below
* does not imply a full barrier.
*
* The PV pv_wait_head_or_lock function, if active, will acquire
* the lock and return a non-zero value. So we have to skip the
* atomic_cond_read_acquire() call. As the next PV queue head hasn't
* been designated yet, there is no way for the locked value to become
* _Q_SLOW_VAL. So both the set_locked() and the
* atomic_cmpxchg_relaxed() calls will be safe.
*
* If PV isn't active, 0 will be returned instead.
*
*/
if ((val = pv_wait_head_or_lock(lock, node)))
goto locked;
val = atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK));
locked:
/*
* claim the lock:
*
* n,0,0 -> 0,0,1 : lock, uncontended
* *,*,0 -> *,*,1 : lock, contended
*
* If the queue head is the only one in the queue (lock value == tail)
* and nobody is pending, clear the tail code and grab the lock.
* Otherwise, we only need to grab the lock.
*/
/*
* In the PV case we might already have _Q_LOCKED_VAL set, because
* of lock stealing; therefore we must also allow:
*
* n,0,1 -> 0,0,1
*
* Note: at this point: (val & _Q_PENDING_MASK) == 0, because of the
* above wait condition, therefore any concurrent setting of
* PENDING will make the uncontended transition fail.
*/
if ((val & _Q_TAIL_MASK) == tail) {
if (atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL))
goto release; /* No contention */
}
/*
* Either somebody is queued behind us or _Q_PENDING_VAL got set
* which will then detect the remaining tail and queue behind us
* ensuring we'll see a @next.
*/
set_locked(lock);
/*
* contended path; wait for next if not observed yet, release.
*/
if (!next)
next = smp_cond_load_relaxed(&node->next, (VAL));
arch_mcs_spin_unlock_contended(&next->locked);
pv_kick_node(lock, next);
release:
trace_contention_end(lock, 0);
/*
* release the node
*/
__this_cpu_dec(qnodes[0].mcs.count);
}
EXPORT_SYMBOL(queued_spin_lock_slowpath);