深入 Linux 内核 6.8.12：从 Futex 到 MCS 队列自旋锁的完整同步机制剖析

引言：用户态同步与内核的交汇点

在现代操作系统中，进程/线程同步是一个核心且复杂的话题。从用户态的 pthread_mutex_lock 到内核态的底层锁实现，整个同步链条涉及多个层次的设计权衡。Linux 内核的 Futex（Fast Userspace muTEX）机制正是这种权衡的典范------它允许在无竞争的情况下完全在用户态完成锁操作，仅在出现竞争时才陷入内核进行线程的睡眠与唤醒。

本文基于 Linux 内核 6.8.12 的源码片段，深入剖析从 Futex 系统调用到最底层 MCS 队列自旋锁的完整实现链条。我们将看到，一个看似简单的 futex_wait 调用，背后涉及哈希表定位、自旋锁获取、优先级排序队列、原子操作、内存屏障、以及复杂的竞争条件处理。理解这些机制，对于编写高性能并发程序、调试死锁问题、以及优化系统性能都具有重要意义。

Futex 机制概览：快速用户空间互斥锁的设计哲学

设计背景与核心思想

Futex 机制诞生于 2002 年，由 Hubertus Franke、Rusty Russell 和 Matthew Kirkwood 设计，并在 Ottawa Linux Symposium 上首次提出。其核心洞察非常简洁：锁操作的开销主要来自竞争，而非锁本身。在大多数应用场景中，锁的持有时间很短，竞争并不频繁。如果每次加锁/解锁都要进行系统调用，那么系统调用的开销（约 100-300 纳秒）将成为性能瓶颈。

Futex 的设计哲学是：

无竞争路径完全在用户态 ：通过原子指令（如 cmpxchg）尝试获取锁，成功则直接返回，无需内核介入。
有竞争路径才陷入内核 ：当用户态发现锁已被占用时，调用 futex() 系统调用，让内核将当前线程放入等待队列并使其睡眠。
解锁时唤醒等待者 ：锁持有者释放锁后，通过 futex() 系统调用唤醒等待队列中的一个或多个线程。

这种设计使得 glibc 的 pthread_mutex_t、pthread_cond_t、pthread_rwlock_t、sem_t 以及 C++ 的 std::mutex 都能构建在 Futex 之上，获得优异的性能。

Futex 系统调用的核心操作

futex(2) 系统调用提供两个核心操作：

FUTEX_WAIT：检查 futex 变量的值，如果等于期望值，则将当前线程加入等待队列并睡眠；否则立即返回。
FUTEX_WAKE：唤醒等待队列中指定数量的线程。

这两个操作构成了用户态锁机制与内核调度器之间的桥梁。

Futex 核心数据结构解析

`union futex_key`：标识唯一的 Futex

复制

arduino 复制代码

union futex_key {
    struct {
        u64 ptr;          /* for private futexes: mm pointer */
        unsigned long word;  /* page offset + word offset */
    } private;
    struct {
        u64 ptr;          /* for shared futexes: mapping pointer */
        unsigned long pgoff; /* page offset in file */
        unsigned int offset;
    } shared;
    struct {
        u64 ptr;
        unsigned long word;
    } both;
};

futex_key 是内核标识一个 futex 的唯一方式。根据 futex 是进程私有（private）还是共享（shared），使用不同的字段组合：

Private Futex ：使用 (current->mm, address, 0) 作为 key。同一进程内的不同线程通过虚拟地址即可唯一标识 futex，无需页表遍历，性能更优。
Shared Futex ：使用 (inode->i_sequence, page->index, offset_within_page) 作为 key。这允许多个进程映射同一物理页面时，通过相同的 key 找到同一个等待队列。

`struct futex_q`：每个等待线程的队列节点

复制

arduino 复制代码

struct futex_q {
    struct plist_node list;       /* sorted by priority in hash bucket */
    struct task_struct *task;     /* the sleeping task */
    spinlock_t *lock_ptr;         /* hash bucket lock */
    union futex_key key;          /* what futex we're waiting on */
    struct futex_pi_state *pi_state; /* priority inheritance state */
    u32 bitset;                   /* for FUTEX_WAIT_BITSET */
};

每个调用 FUTEX_WAIT 的线程，内核都会为其创建一个 futex_q 结构，并将其挂入对应的哈希桶等待链表。plist_node 表示这是一个按优先级排序的链表节点，确保实时线程优先被唤醒。

`struct futex_hash_bucket`：哈希桶

复制

arduino 复制代码

struct futex_hash_bucket {
    atomic_t waiters;
    spinlock_t lock;
    struct plist_head chain;      /* list of futex_q entries */
} ____cacheline_aligned_in_smp;

所有等待同一 futex（或哈希冲突的不同 futex）的线程，其 futex_q 节点都会被挂入同一个哈希桶的 chain 链表中。哈希桶本身包含一个自旋锁 lock，用于保护链表操作，以及一个原子计数器 waiters，用于快速判断是否有等待者。

Futex 哈希与等待队列机制

哈希函数 `futex_hash`

复制

scss 复制代码

struct futex_hash_bucket *futex_hash(union futex_key *key)
{
    u32 hash = jhash2((u32 *)key, offsetof(typeof(*key), both.offset) / 4,
              key->both.offset);

    return &futex_queues[hash & (futex_hashsize - 1)];
}

内核使用 jhash2（Jenkins hash）对 futex_key 进行哈希计算，然后映射到全局哈希表 futex_queues 的某个桶中。哈希表大小 futex_hashsize 通常为 2 的幂次（如 256），这样可以通过位与操作快速取模。

全局哈希表定义

复制

c 复制代码

static struct {
    struct futex_hash_bucket *queues;
    unsigned long            hashsize;
} __futex_data __read_mostly __aligned(2*sizeof(long));
#define futex_queues   (__futex_data.queues)
#define futex_hashsize (__futex_data.hashsize)

哈希表在系统初始化时分配，并按缓存行对齐，以减少多 CPU 访问时的伪共享（false sharing）。

自旋锁基础：从 `raw_spin_lock` 到 `spin_lock`

在深入 Futex 的等待与唤醒逻辑之前，我们必须先理解 Linux 内核的自旋锁实现，因为哈希桶的保护、MCS 队列的维护都依赖于自旋锁。

自旋锁的层次结构

Linux 内核的自旋锁实现分为多个层次，这种分层设计既保证了通用性，又允许架构特定的优化：

复制

scss 复制代码

/* 最底层：架构相关的自旋锁操作 */
#define arch_spin_lock(l)    queued_spin_lock(l)
#define arch_spin_unlock(l)  queued_spin_unlock(l)

/* 原始自旋锁：关闭抢占，获取锁 */
static inline void do_raw_spin_lock(raw_spinlock_t *lock) __acquires(lock)
{
    __acquire(lock);
    arch_spin_lock(&lock->raw_lock);
    mmiowb_spin_lock();
}

/* 中间层：包含锁依赖映射（用于死锁检测） */
static inline void __raw_spin_lock(raw_spinlock_t *lock)
{
    preempt_disable();
    spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
    LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
}

/* 导出符号（非内联版本） */
#ifndef CONFIG_INLINE_SPIN_LOCK
noinline void __lockfunc _raw_spin_lock(raw_spinlock_t *lock)
{
    __raw_spin_lock(lock);
}
EXPORT_SYMBOL(_raw_spin_lock);
#endif

/* 最上层：通用自旋锁接口 */
#define raw_spin_lock(lock)  _raw_spin_lock(lock)

static __always_inline void spin_lock(spinlock_t *lock)
{
    raw_spin_lock(&lock->rlock);
}

关键步骤解析

preempt_disable() ：关闭内核抢占。自旋锁持有期间不允许被其他任务抢占，否则可能导致死锁或长时间自旋。
spin_acquire() ：记录锁的获取，用于内核的锁依赖检测（lockdep）功能，帮助调试死锁。
LOCK_CONTENDED() ：在锁竞争激烈时，可能调用 do_raw_spin_trylock 进行乐观自旋尝试，避免直接睡眠带来的上下文切换开销。
arch_spin_lock() ：架构相关的实际加锁操作，在 x86 上映射为 queued_spin_lock()。

自旋锁释放

复制

scss 复制代码

static __always_inline void spin_unlock(spinlock_t *lock)
{
    raw_spin_unlock(&lock->rlock);
}

#define raw_spin_unlock(lock)    _raw_spin_unlock(lock)

static inline void do_raw_spin_unlock(raw_spinlock_t *lock) __releases(lock)
{
    mmiowb_spin_unlock();
    arch_spin_unlock(&lock->raw_lock);
    __release(lock);
}

释放锁时，首先执行内存屏障操作（mmiowb_spin_unlock），然后调用架构相关的解锁操作，最后标记锁已释放。__release(lock) 在编译时用于静态分析，实际为空操作。

排队自旋锁（Queued Spinlock）与 MCS 锁

MCS 锁的基本原理

传统的测试-测试-设置（TTS）自旋锁存在严重的缓存行抖动（cache line bouncing）问题：所有等待者都在同一个内存位置（锁变量）上自旋，每次锁状态变化都会触发所有 CPU 缓存失效，导致大量跨 CPU 缓存流量。

MCS 锁（由 Mellor-Crummey 和 Scott 提出）解决了这个问题。其核心思想是：

每个等待者在自己的本地变量上自旋，而非在全局锁变量上自旋。
等待者通过链表排队，锁释放时只通知下一个等待者。
这消除了缓存行抖动，实现了公平性（FIFO），并显著提升了大规模系统的性能。

Linux 内核的排队自旋锁实现

Linux 内核对传统 MCS 锁进行了精巧的压缩，使其适配 32 位字长：

复制

csharp 复制代码

/**
 * queued_spin_lock - acquire a queued spinlock
 * @lock: Pointer to queued spinlock structure
 */
static __always_inline void queued_spin_lock(struct qspinlock *lock)
{
    int val = 0;

    if (likely(atomic_try_cmpxchg_acquire(&lock->val, &val, _Q_LOCKED_VAL)))
        return;

    queued_spin_lock_slowpath(lock, val);
}

快速路径（无竞争）

atomic_try_cmpxchg_acquire 尝试将锁从 0（未锁定）原子地设置为 _Q_LOCKED_VAL（已锁定）。这是最常见的无竞争情况，只需一条原子指令即可完成加锁。

慢速路径（有竞争）

当快速路径失败时，进入 queued_spin_lock_slowpath，这是整个排队自旋锁的核心：

复制

scss 复制代码

void __lockfunc queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
{
    struct mcs_spinlock *prev, *next, *node;
    u32 old, tail;
    int idx;

    BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));

    if (pv_enabled())
        goto pv_queue;

    if (virt_spin_lock(lock))
        return;

    /* 等待正在进行的 pending->locked 交接 */
    if (val == _Q_PENDING_VAL) {
        int cnt = _Q_PENDING_LOOPS;
        val = atomic_cond_read_relaxed(&lock->val,
                           (VAL != _Q_PENDING_VAL) || !cnt--);
    }

    /* 观察到竞争，进入队列 */
    if (val & ~_Q_LOCKED_MASK)
        goto queue;

    /* 尝试设置 pending 位 */
    val = queued_fetch_set_pending_acquire(lock);

    /* 如果仍有竞争，撤销 pending 并排队 */
    if (unlikely(val & ~_Q_LOCKED_MASK)) {
        if (!(val & _Q_PENDING_MASK))
            clear_pending(lock);
        goto queue;
    }

    /* 等待锁持有者释放 */
    if (val & _Q_LOCKED_MASK)
        smp_cond_load_acquire(&lock->locked, !VAL);

    /* 获取锁并清除 pending 位 */
    clear_pending_set_locked(lock);
    lockevent_inc(lock_pending);
    return;

queue:
    /* MCS 队列逻辑 */
    lockevent_inc(lock_slowpath);
pv_queue:
    node = this_cpu_ptr(&qnodes[0].mcs);
    idx = node->count++;
    tail = encode_tail(smp_processor_id(), idx);

    /* 防止嵌套 NMI 超过节点限制 */
    if (unlikely(idx >= MAX_NODES)) {
        lockevent_inc(lock_no_node);
        while (!queued_spin_trylock(lock))
            cpu_relax();
        goto release;
    }

    node = grab_mcs_node(node, idx);
    barrier();

    node->locked = 0;
    node->next = NULL;
    pv_init_node(node);

    /* 尝试最后一次快速获取 */
    if (queued_spin_trylock(lock))
        goto release;

    smp_wmb();

    /* 将当前节点加入队列尾部 */
    old = xchg_tail(lock, tail);
    next = NULL;

    /* 如果有前驱节点，链接并等待 */
    if (old & _Q_TAIL_MASK) {
        prev = decode_tail(old);
        WRITE_ONCE(prev->next, node);

        pv_wait_node(node, prev);
        arch_mcs_spin_lock_contended(&node->locked);

        /* 预取下一个节点的缓存行 */
        next = READ_ONCE(node->next);
        if (next)
            prefetchw(next);
    }

    /* 等待锁持有者离开 */
    if ((val = pv_wait_head_or_lock(lock, node)))
        goto locked;

    val = atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK));

locked:
    /* 获取锁 */
    if ((val & _Q_TAIL_MASK) == tail) {
        if (atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL))
            goto release;
    }

    set_locked(lock);

    /* 通知下一个等待者 */
    if (!next)
        next = smp_cond_load_relaxed(&node->next, (VAL));

    arch_mcs_spin_unlock_contended(&next->locked);
    pv_kick_node(lock, next);

release:
    trace_contention_end(lock, 0);
    __this_cpu_dec(qnodes[0].mcs.count);
}

状态机解析

源码注释中给出了清晰的状态转换图：

plain

复制

lua 复制代码

              fast     :    slow                                  :    unlock
                       :                                          :
uncontended  (0,0,0) -:--> (0,0,1) ------------------------------:--> (*,*,0)
                       :       | ^--------.------.             /  :
                       :       v           \      \            |  :
pending               :    (0,1,1) +--> (0,1,0)   \           |  :
                       :       | ^--'              |           |  :
                       :       v                   |           |  :
uncontended           :    (n,x,y) +--> (n,0,0) --'           |  :
  queue               :       | ^--'                          |  :
                       :       v                               |  :
contended             :    (*,x,y) +--> (*,0,0) ---> (*,0,1) -'  :
  queue               :         ^--'                             :

其中三个位域分别表示：

tail：队列尾部指针（编码了 CPU ID 和节点索引）
pending：是否有线程正在乐观自旋等待
locked：锁是否被持有

关键优化点

乐观自旋（Optimistic Spinning） ：第一个等待者不是立即进入 MCS 队列，而是先设置 pending 位并在锁变量上自旋。如果锁很快释放，可以避免队列操作的开销。
本地自旋 ：进入 MCS 队列后，等待者在 node->locked 上自旋，这是本地变量，不会引起缓存行抖动。
锁窃取（Lock Stealing） ：在某些情况下，新到达的线程可以直接获取锁，而不必等待队列中的线程。这提高了吞吐量，但需要配合防饥饿机制。
虚拟化支持（PV） ：在虚拟化环境中，与 hypervisor 协作，避免在锁持有者被抢占时无效自旋。

原子操作与 `cmpxchg` 指令

排队自旋锁的实现依赖于底层的原子操作，其中最关键的是比较并交换（Compare-and-Swap, CAS）。

`atomic_try_cmpxchg_acquire` 实现

复制

csharp 复制代码

static __always_inline bool
atomic_try_cmpxchg_acquire(atomic_t *v, int *old, int new)
{
    instrument_atomic_read_write(v, sizeof(*v));
    instrument_atomic_read_write(old, sizeof(*old));
    return raw_atomic_try_cmpxchg_acquire(v, old, new);
}

x86 架构的 `cmpxchg` 内联汇编

复制

scss 复制代码

#define arch_try_cmpxchg(ptr, pold, new)                \
    __try_cmpxchg((ptr), (pold), (new), sizeof(*(ptr)))

#define __try_cmpxchg(ptr, pold, new, size)             \
    __raw_try_cmpxchg((ptr), (pold), (new), (size), LOCK_PREFIX)

#define __raw_try_cmpxchg(_ptr, _pold, _new, size, lock)    \
({                                      \
    bool success;                       \
    __typeof__(_ptr) _old = (__typeof__(_ptr))(_pold);  \
    __typeof__(*(_ptr)) __old = *_old;          \
    __typeof__(*(_ptr)) __new = (_new);          \
    switch (size) {                         \
    case __X86_CASE_B:                      \
    {                               \
        volatile u8 *__ptr = (volatile u8 *)(_ptr); \
        asm volatile(lock "cmpxchgb %[new], %[ptr]"   \
                 CC_SET(z)                  \
                 : CC_OUT(z) (success),         \
                   [ptr] "+m" (*__ptr),         \
                   [old] "+a" (__old)           \
                 : [new] "q" (__new)            \
                 : "memory");               \
        break;                          \
    }                               \
    case __X86_CASE_W:                      \
    {                               \
        volatile u16 *__ptr = (volatile u16 *)(_ptr);   \
        asm volatile(lock "cmpxchgw %[new], %[ptr]"   \
                 CC_SET(z)                  \
                 : CC_OUT(z) (success),         \
                   [ptr] "+m" (*__ptr),         \
                   [old] "+a" (__old)           \
                 : [new] "r" (__new)            \
                 : "memory");               \
        break;                          \
    }                               \
    case __X86_CASE_L:                      \
    {                               \
        volatile u32 *__ptr = (volatile u32 *)(_ptr);   \
        asm volatile(lock "cmpxchgl %[new], %[ptr]"   \
                 CC_SET(z)                  \
                 : CC_OUT(z) (success),         \
                   [ptr] "+m" (*__ptr),         \
                   [old] "+a" (__old)           \
                 : [new] "r" (__new)            \
                 : "memory");               \
        break;                          \
    }                               \
    case __X86_CASE_Q:                      \
    {                               \
        volatile u64 *__ptr = (volatile u64 *)(_ptr);   \
        asm volatile(lock "cmpxchgq %[new], %[ptr]"   \
                 CC_SET(z)                  \
                 : CC_OUT(z) (success),         \
                   [ptr] "+m" (*__ptr),         \
                   [old] "+a" (__old)           \
                 : [new] "r" (__new)            \
                 : "memory");               \
        break;                          \
    }                               \
    default:                            \
        __cmpxchg_wrong_size();                 \
    }                               \
    if (unlikely(!success))                     \
        *_old = __old;                      \
    likely(success);                        \
})

关键点

lock 前缀：确保指令在多处理器环境下的原子性。
+m 约束：表示内存操作数会被读写。
+a 约束 ：__old 放在 eax/rax 寄存器中，cmpxchg 会自动比较该寄存器与内存值。
CC_SET(z)/CC_OUT(z) ：利用 x86 的零标志位（ZF）判断比较是否成功。
"memory" 破坏描述符：防止编译器重排内存操作，确保内存屏障语义。

Futex 等待路径的完整流程分析

`futex_wait` 入口

复制

rust 复制代码

int futex_wait(u32 __user *uaddr, unsigned int flags, u32 val, ktime_t *abs_time, u32 bitset)
{
    struct hrtimer_sleeper timeout, *to;
    struct restart_block *restart;
    int ret;

    // yym-gaizao
    pr_debug("futex_wait: pid=%d, tid=%d, uaddr=%p, val=%u, bitset=0x%x, flags=0x%x\n",
         current->tgid, current->pid, uaddr, val, bitset, flags);

    to = futex_setup_timer(abs_time, &timeout, flags,
                   current->timer_slack_ns);

    ret = __futex_wait(uaddr, flags, val, to, bitset);

    /* 清理超时定时器 */
    if (!to)
        return ret;

    hrtimer_cancel(&to->timer);
    destroy_hrtimer_on_stack(&to->timer);

    /* 处理可重启系统调用 */
    if (ret == -ERESTARTSYS) {
        restart = &current->restart_block;
        restart->futex.uaddr = uaddr;
        restart->futex.val = val;
        restart->futex.time = *abs_time;
        restart->futex.bitset = bitset;
        restart->futex.flags = flags | FLAGS_HAS_TIMEOUT;

        return set_restart_fn(restart, futex_wait_restart);
    }

    return ret;
}

futex_wait 首先设置超时定时器（如果有），然后调用 __futex_wait 执行核心逻辑。如果因信号中断（-ERESTARTSYS），则设置重启块，以便信号处理完成后重新执行系统调用。

`__futex_wait` 核心逻辑

复制

arduino 复制代码

int __futex_wait(u32 __user *uaddr, unsigned int flags, u32 val,
         struct hrtimer_sleeper *to, u32 bitset)
{
    struct futex_q q = futex_q_init;
    struct futex_hash_bucket *hb;
    int ret;

    if (!bitset)
        return -EINVAL;

    q.bitset = bitset;

retry:
    /*
     * Prepare to wait on uaddr. On success, it holds hb->lock and q
     * is initialized.
     */
    ret = futex_wait_setup(uaddr, val, flags, &q, &hb);
    if (ret)
        return ret;

    /* futex_queue and wait for wakeup, timeout, or a signal. */
    futex_wait_queue(hb, &q, to);

    /* If we were woken (and unqueued), we succeeded, whatever. */
    if (!futex_unqueue(&q))
        return 0;

    if (to && !to->task)
        return -ETIMEDOUT;

    /*
     * We expect signal_pending(current), but we might be the
     * victim of a spurious wakeup as well.
     */
    if (!signal_pending(current))
        goto retry;

    return -ERESTARTSYS;
}

关键流程

futex_wait_setup：准备等待，获取哈希桶锁，检查 futex 值。
futex_wait_queue：将当前线程加入等待队列并睡眠。
唤醒后检查 ：如果被正常唤醒（从队列移除），返回成功；如果超时，返回 -ETIMEDOUT；如果被信号中断，返回 -ERESTARTSYS；如果是虚假唤醒，重试。

`futex_wait_setup`：竞争条件的核心处理

复制

scss 复制代码

int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
             struct futex_q *q, struct futex_hash_bucket **hb)
{
    u32 uval;
    int ret;

    /*
     * Access the page AFTER the hash-bucket is locked.
     * Order is important:
     *
     *   Userspace waiter: val = var; if (cond(val)) futex_wait(&var, val);
     *   Userspace waker:  if (cond(var)) { var = new; futex_wake(&var); }
     *
     * The basic logical guarantee of a futex is that it blocks ONLY
     * if cond(var) is known to be true at the time of blocking, for
     * any cond.  If we locked the hash-bucket after testing *uaddr, that
     * would open a race condition where we could block indefinitely with
     * cond(var) false, which would violate the guarantee.
     *
     * On the other hand, we insert q and release the hash-bucket only
     * after testing *uaddr.  This guarantees that futex_wait() will NOT
     * absorb a wakeup if *uaddr does not match the desired values
     * while the syscall executes.
     */
retry:
    ret = get_futex_key(uaddr, flags, &q->key, FUTEX_READ);
    if (unlikely(ret != 0))
        return ret;

retry_private:
    *hb = futex_q_lock(q);

    ret = futex_get_value_locked(&uval, uaddr);

    if (ret) {
        futex_q_unlock(*hb);

        ret = get_user(uval, uaddr);
        if (ret)
            return ret;

        if (!(flags & FLAGS_SHARED))
            goto retry_private;

        goto retry;
    }

    if (uval != val) {
        futex_q_unlock(*hb);
        ret = -EWOULDBLOCK;
    }

    return ret;
}

关键设计：锁顺序与值检查

注释中详细解释了为何必须先锁定哈希桶，再检查 futex 值：

防止丢失唤醒：如果先检查值再锁桶，可能在检查值和锁桶之间发生唤醒，导致线程永远睡眠。
防止吸收错误唤醒：如果在值不匹配时已经入队，可能错误地消耗一个本不属于它的唤醒。

`futex_q_lock`：获取哈希桶锁

复制

scss 复制代码

struct futex_hash_bucket *futex_q_lock(struct futex_q *q)
    __acquires(&hb->lock)
{
    struct futex_hash_bucket *hb;

    hb = futex_hash(&q->key);

    /*
     * Increment the counter before taking the lock so that
     * a potential waker won't miss a to-be-slept task that is
     * waiting for the spinlock. This is safe as all futex_q_lock()
     * users end up calling futex_queue(). Similarly, for housekeeping,
     * decrement the counter at futex_q_unlock() when some error has
     * occurred and we don't end up adding the task to the list.
     */
    futex_hb_waiters_inc(hb); /* implies smp_mb(); (A) */

    q->lock_ptr = &hb->lock;

    spin_lock(&hb->lock);
    return hb;
}

关键细节：`waiters` 计数器

在获取自旋锁之前先增加 waiters 计数器，这是为了防止以下竞争：

等待者 A 计算哈希桶，但尚未获取锁。
唤醒者 B 获取锁，发现 waiters 为 0，直接返回（认为没有等待者）。
A 获取锁并入队，但已错过唤醒，永远睡眠。

通过先增加计数器（包含内存屏障），确保唤醒者能看到即将入队的等待者。

`futex_wait_queue`：入队并睡眠

复制

scss 复制代码

void futex_wait_queue(struct futex_hash_bucket *hb, struct futex_q *q,
                struct hrtimer_sleeper *timeout)
{
    /*
     * The task state is guaranteed to be set before another task can
     * wake it. set_current_state() is implemented using smp_store_mb() and
     * futex_queue() calls spin_unlock() upon completion, both serializing
     * access to the hash list and forcing another memory barrier.
     */
    set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
    futex_queue(q, hb);

    /* Arm the timer */
    if (timeout)
        hrtimer_sleeper_start_expires(timeout, HRTIMER_MODE_ABS);

    /*
     * If we have been removed from the hash list, then another task
     * has tried to wake us, and we can skip the call to schedule().
     */
    if (likely(!plist_node_empty(&q->list))) {
        /*
         * If the timer has already expired, current will already be
         * flagged for rescheduling. Only call schedule if there
         * is no timeout, or if it has yet to expire.
         */
        if (!timeout || timeout->task)
            schedule();
    }
    __set_current_state(TASK_RUNNING);
}

关键设计：状态设置与锁释放的顺序

set_current_state(TASK_INTERRUPTIBLE) ：将当前线程标记为可中断睡眠状态。使用 smp_store_mb() 实现，包含内存屏障。
futex_queue(q, hb) ：在持有哈希桶锁的情况下，将当前线程加入队列。futex_queue 最后会释放锁。
检查是否已被唤醒 ：由于 futex_queue 释放锁时包含内存屏障，如果唤醒发生在 schedule() 之前，队列节点会被移除，此时可以跳过 schedule()。
schedule() ：主动放弃 CPU，进入睡眠。
__set_current_state(TASK_RUNNING) ：被唤醒后恢复运行状态。

`futex_queue`：将线程加入优先级队列

复制

scss 复制代码

static inline void futex_queue(struct futex_q *q, struct futex_hash_bucket *hb)
    __releases(&hb->lock)
{
    __futex_queue(q, hb);

    // yym-gaizao
    struct plist_head *head = &hb->chain;
    struct plist_node *node;
    struct futex_q *q_temp;
    int count = 0;
    
    if (!plist_head_empty(head)) {
        plist_for_each(node, head) {
            q_temp = container_of(node, struct futex_q, list);
            pr_debug("futex:queue:PID %d (%s)\n", q_temp->task->pid, q_temp->task->comm);
            count++;
        }
        pr_debug("futex:queue:total %d waiters\n", count);
    }

    spin_unlock(&hb->lock);
}

`__futex_queue`：核心入队逻辑

复制

scss 复制代码

void __futex_queue(struct futex_q *q, struct futex_hash_bucket *hb)
{
    int prio;

    /*
     * The priority used to register this element is
     * - either the real thread-priority for the real-time threads
     * (i.e. threads with a priority lower than MAX_RT_PRIO)
     * - or MAX_RT_PRIO for non-RT threads.
     * Thus, all RT-threads are woken first in priority order, and
     * the others are woken last, in FIFO order.
     */
    prio = min(current->normal_prio, MAX_RT_PRIO);

    plist_node_init(&q->list, prio);
    plist_add(&q->list, &hb->chain);
    q->task = current;
}

优先级队列的设计

实时线程：使用实际优先级（< MAX_RT_PRIO），按优先级顺序唤醒。
非实时线程：统一使用 MAX_RT_PRIO，按 FIFO 顺序唤醒。

这确保了实时线程的响应性，同时为非实时线程提供公平性。

`plist_add`：优先级链表操作

复制

ini 复制代码

void plist_add(struct plist_node *node, struct plist_head *head)
{
    struct plist_node *first, *iter, *prev = NULL;
    struct list_head *node_next = &head->node_list;

    plist_check_head(head);
    WARN_ON(!plist_node_empty(node));
    WARN_ON(!list_empty(&node->prio_list));

    if (plist_head_empty(head))
        goto ins_node;

    first = iter = plist_first(head);

    do {
        if (node->prio < iter->prio) {
            node_next = &iter->node_list;
            break;
        }

        prev = iter;
        iter = list_entry(iter->prio_list.next,
                struct plist_node, prio_list);
    } while (iter != first);

    if (!prev || prev->prio != node->prio)
        list_add_tail(&node->prio_list, &iter->prio_list);
ins_node:
    list_add_tail(&node->node_list, node_next);

    plist_check_head(head);
}

plist（priority list）是 Linux 内核的一种特殊链表，同时维护两个链表：

node_list：按插入顺序排列的节点链表。
prio_list：按优先级排列的链表，相同优先级的节点链接在一起。

这使得按优先级遍历和按插入顺序遍历都能高效进行。

Futex 唤醒机制与竞争条件处理

虽然提供的源码片段中没有包含 futex_wake 的完整实现，但我们可以从等待路径的设计中推断唤醒机制的关键点。

唤醒路径的伪代码

基于内核文档和相关源码，唤醒路径大致如下：

复制

kotlin 复制代码

int futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
{
    struct futex_hash_bucket *hb;
    struct futex_q *this, *next;
    union futex_key key = FUTEX_KEY_INIT;
    DEFINE_WAKE_Q(wake_q);
    int ret;

    if (!bitset)
        return -EINVAL;

    ret = get_futex_key(uaddr, flags, &key, FUTEX_READ);
    if (unlikely(ret != 0))
        return ret;

    if ((flags & FLAGS_STRICT) && !nr_wake)
        return 0;

    hb = futex_hash(&key);

    /* 快速路径：如果没有等待者，直接返回 */
    if (!futex_hb_waiters_pending(hb))
        return ret;

    spin_lock(&hb->lock);

    plist_for_each_entry_safe(this, next, &hb->chain, list) {
        if (futex_match(&this->key, &key)) {
            if (this->pi_state || this->rt_waiter) {
                ret = -EINVAL;
                break;
            }

            /* 检查 bitset 是否匹配 */
            if (!(this->bitset & bitset))
                continue;

            this->wake(&wake_q, this);
            if (++ret >= nr_wake)
                break;
        }
    }

    spin_unlock(&hb->lock);
    wake_up_q(&wake_q);  /* 实际唤醒操作在释放锁后进行 */
    return ret;
}

关键设计点

锁内标记，锁外唤醒 ：唤醒操作在持有哈希桶锁时进行标记（将任务加入唤醒队列），但实际唤醒（wake_up_process）在释放锁后进行。这减少了锁持有时间，避免在锁保护下执行复杂的调度操作。
Bitset 匹配 ：支持 FUTEX_WAIT_BITSET 和 FUTEX_WAKE_BITSET，允许线程等待/唤醒特定的位模式，实现更灵活的同步语义。
优先级继承（PI）支持 ：对于实时线程，futex 支持优先级继承协议，防止优先级反转。这通过 futex_pi_state 和 rt_waiter 字段实现。

性能考量与优化策略

哈希桶大小与扩展性

哈希桶的数量（futex_hashsize）直接影响并发性能：

桶太少：大量不同 futex 映射到同一桶，增加锁竞争。
桶太多：内存占用增加，缓存效率降低。

默认配置通常为 256 个桶，但在大型系统上可能需要调整。

乐观自旋与 MCS 锁的结合

在 Futex 的哈希桶锁（spinlock_t）层面，Linux 内核使用了排队自旋锁。这意味着：

多个 CPU 同时访问同一哈希桶时，会在 MCS 队列中排队。
每个等待者在本地变量上自旋，避免了缓存行抖动。
对于 240 核的系统，这种优化可以将系统时间从 54% 降低到 2%。

用户态乐观自旋

现代 glibc 在用户态也实现了乐观自旋：在调用 futex_wait 之前，先自旋一段时间等待锁释放。如果锁很快释放，可以完全避免系统调用开销。

锁竞争诊断

对于生产环境中的锁竞争问题，可以使用 perf lock 工具进行诊断：

表格

perf lock 表现	根因	优化方案
contended 高 + avg wait 低（< 1us）	锁粒度太粗，频繁短竞争	拆锁（per-bucket / per-CPU）或无锁
contended 低 + avg wait 高（> 10us）	临界区内有 I/O / malloc / 日志	移出锁外 + Collect-Release-Execute
contended 高 + avg wait 高	严重设计问题	无锁队列（MPSC/SPSC）彻底替代
spinlock type + 高 contended	临界区持有时间超过自旋收益	改用 mutex（允许休眠）
futex_wake 路径占比 > 20%	hash bucket 竞争（二阶效应）	竞争已极严重，必须无锁化

总结与展望

通过本文对 Linux 内核 6.8.12 源码的深入分析，我们完整地梳理了从 Futex 系统调用到 MCS 队列自旋锁的实现链条：

用户态锁 （如 pthread_mutex_t）在无竞争时完全在用户态通过原子操作完成。
出现竞争时 ，调用 futex_wait/futex_wake 陷入内核。
内核通过 futex_key 定位哈希桶，使用自旋锁保护等待队列。
自旋锁底层使用 MCS 排队机制，确保公平性和缓存效率。
原子操作（cmpxchg） 是整个同步机制的基石。

这种分层设计体现了 Linux 内核在性能、公平性和可扩展性之间的精妙权衡。随着硬件的发展（更多核心、NUMA 架构、新原子指令），内核的同步机制也在不断演进。未来的发展方向可能包括：

更细粒度的哈希桶：减少桶级竞争。
自适应自旋：根据锁持有历史动态调整自旋时间。
硬件事务内存（HTM） ：利用 Intel TSX 等硬件特性实现无锁同步。
更高效的虚拟化支持：减少 hypervisor 层的锁开销。

理解这些底层机制，不仅能帮助我们编写更高效的并发程序，也为诊断和解决复杂的性能问题提供了坚实的基础。

源码

scss 复制代码

static inline void do_raw_spin_lock(raw_spinlock_t *lock) __acquires(lock)
{
	__acquire(lock);
	arch_spin_lock(&lock->raw_lock);
	mmiowb_spin_lock();
}

# define lock_acquire(l, s, t, r, c, n, i)	do { } while (0)
#define lock_acquire_exclusive(l, s, t, n, i)		lock_acquire(l, s, t, 0, 1, n, i)
#define spin_acquire(l, s, t, i)		lock_acquire_exclusive(l, s, t, NULL, i)

static inline void __raw_spin_lock(raw_spinlock_t *lock)
{
	preempt_disable();
	spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
	LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
}

#ifndef CONFIG_INLINE_SPIN_LOCK
noinline void __lockfunc _raw_spin_lock(raw_spinlock_t *lock)
{
	__raw_spin_lock(lock);
}
EXPORT_SYMBOL(_raw_spin_lock);
#endif

#define raw_spin_lock(lock)	_raw_spin_lock(lock)

static __always_inline void spin_lock(spinlock_t *lock)
{
	raw_spin_lock(&lock->rlock);
}

/**
 * futex_hash - Return the hash bucket in the global hash
 * @key:	Pointer to the futex key for which the hash is calculated
 *
 * We hash on the keys returned from get_futex_key (see below) and return the
 * corresponding hash bucket in the global hash.
 */
struct futex_hash_bucket *futex_hash(union futex_key *key)
{
	u32 hash = jhash2((u32 *)key, offsetof(typeof(*key), both.offset) / 4,
			  key->both.offset);

	return &futex_queues[hash & (futex_hashsize - 1)];
}

/* The key must be already stored in q->key. */
struct futex_hash_bucket *futex_q_lock(struct futex_q *q)
	__acquires(&hb->lock)
{
	struct futex_hash_bucket *hb;

	hb = futex_hash(&q->key);

	/*
	 * Increment the counter before taking the lock so that
	 * a potential waker won't miss a to-be-slept task that is
	 * waiting for the spinlock. This is safe as all futex_q_lock()
	 * users end up calling futex_queue(). Similarly, for housekeeping,
	 * decrement the counter at futex_q_unlock() when some error has
	 * occurred and we don't end up adding the task to the list.
	 */
	futex_hb_waiters_inc(hb); /* implies smp_mb(); (A) */

	q->lock_ptr = &hb->lock;

	spin_lock(&hb->lock);
	return hb;
}

/*
 * The base of the bucket array and its size are always used together
 * (after initialization only in futex_hash()), so ensure that they
 * reside in the same cacheline.
 */
static struct {
	struct futex_hash_bucket *queues;
	unsigned long            hashsize;
} __futex_data __read_mostly __aligned(2*sizeof(long));
#define futex_queues   (__futex_data.queues)
#define futex_hashsize (__futex_data.hashsize)


/**
 * futex_wait_setup() - Prepare to wait on a futex
 * @uaddr:	the futex userspace address
 * @val:	the expected value
 * @flags:	futex flags (FLAGS_SHARED, etc.)
 * @q:		the associated futex_q
 * @hb:		storage for hash_bucket pointer to be returned to caller
 *
 * Setup the futex_q and locate the hash_bucket.  Get the futex value and
 * compare it with the expected value.  Handle atomic faults internally.
 * Return with the hb lock held on success, and unlocked on failure.
 *
 * Return:
 *  -  0 - uaddr contains val and hb has been locked;
 *  - <1 - -EFAULT or -EWOULDBLOCK (uaddr does not contain val) and hb is unlocked
 */
int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
		     struct futex_q *q, struct futex_hash_bucket **hb)
{
	u32 uval;
	int ret;

	/*
	 * Access the page AFTER the hash-bucket is locked.
	 * Order is important:
	 *
	 *   Userspace waiter: val = var; if (cond(val)) futex_wait(&var, val);
	 *   Userspace waker:  if (cond(var)) { var = new; futex_wake(&var); }
	 *
	 * The basic logical guarantee of a futex is that it blocks ONLY
	 * if cond(var) is known to be true at the time of blocking, for
	 * any cond.  If we locked the hash-bucket after testing *uaddr, that
	 * would open a race condition where we could block indefinitely with
	 * cond(var) false, which would violate the guarantee.
	 *
	 * On the other hand, we insert q and release the hash-bucket only
	 * after testing *uaddr.  This guarantees that futex_wait() will NOT
	 * absorb a wakeup if *uaddr does not match the desired values
	 * while the syscall executes.
	 */
retry:
	ret = get_futex_key(uaddr, flags, &q->key, FUTEX_READ);
	if (unlikely(ret != 0))
		return ret;

retry_private:
	*hb = futex_q_lock(q);

	ret = futex_get_value_locked(&uval, uaddr);

	if (ret) {
		futex_q_unlock(*hb);

		ret = get_user(uval, uaddr);
		if (ret)
			return ret;

		if (!(flags & FLAGS_SHARED))
			goto retry_private;

		goto retry;
	}

	if (uval != val) {
		futex_q_unlock(*hb);
		ret = -EWOULDBLOCK;
	}

	return ret;
}

int __futex_wait(u32 __user *uaddr, unsigned int flags, u32 val,
		 struct hrtimer_sleeper *to, u32 bitset)
{
	struct futex_q q = futex_q_init;
	struct futex_hash_bucket *hb;
	int ret;

	if (!bitset)
		return -EINVAL;

	q.bitset = bitset;

retry:
	/*
	 * Prepare to wait on uaddr. On success, it holds hb->lock and q
	 * is initialized.
	 */
	ret = futex_wait_setup(uaddr, val, flags, &q, &hb);
	if (ret)
		return ret;

	/* futex_queue and wait for wakeup, timeout, or a signal. */
	futex_wait_queue(hb, &q, to);

	/* If we were woken (and unqueued), we succeeded, whatever. */
	if (!futex_unqueue(&q))
		return 0;

	if (to && !to->task)
		return -ETIMEDOUT;

	/*
	 * We expect signal_pending(current), but we might be the
	 * victim of a spurious wakeup as well.
	 */
	if (!signal_pending(current))
		goto retry;

	return -ERESTARTSYS;
}

int futex_wait(u32 __user *uaddr, unsigned int flags, u32 val, ktime_t *abs_time, u32 bitset)
{
	struct hrtimer_sleeper timeout, *to;
	struct restart_block *restart;
	int ret;

	// yym-gaizao
	pr_debug("futex_wait: pid=%d, tid=%d, uaddr=%p, val=%u, bitset=0x%x, flags=0x%x\n",
		 current->tgid, current->pid, uaddr, val, bitset, flags);

	to = futex_setup_timer(abs_time, &timeout, flags,
			       current->timer_slack_ns);

	ret = __futex_wait(uaddr, flags, val, to, bitset);

	/* No timeout, nothing to clean up. */
	if (!to)
		return ret;

	hrtimer_cancel(&to->timer);
	destroy_hrtimer_on_stack(&to->timer);

	if (ret == -ERESTARTSYS) {
		restart = &current->restart_block;
		restart->futex.uaddr = uaddr;
		restart->futex.val = val;
		restart->futex.time = *abs_time;
		restart->futex.bitset = bitset;
		restart->futex.flags = flags | FLAGS_HAS_TIMEOUT;

		return set_restart_fn(restart, futex_wait_restart);
	}

	return ret;
}

/**
 * futex_wait_queue() - futex_queue() and wait for wakeup, timeout, or signal
 * @hb:		the futex hash bucket, must be locked by the caller
 * @q:		the futex_q to queue up on
 * @timeout:	the prepared hrtimer_sleeper, or null for no timeout
 */
void futex_wait_queue(struct futex_hash_bucket *hb, struct futex_q *q,
			    struct hrtimer_sleeper *timeout)
{
	/*
	 * The task state is guaranteed to be set before another task can
	 * wake it. set_current_state() is implemented using smp_store_mb() and
	 * futex_queue() calls spin_unlock() upon completion, both serializing
	 * access to the hash list and forcing another memory barrier.
	 */
	set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
	futex_queue(q, hb);

	/* Arm the timer */
	if (timeout)
		hrtimer_sleeper_start_expires(timeout, HRTIMER_MODE_ABS);

	/*
	 * If we have been removed from the hash list, then another task
	 * has tried to wake us, and we can skip the call to schedule().
	 */
	if (likely(!plist_node_empty(&q->list))) {
		/*
		 * If the timer has already expired, current will already be
		 * flagged for rescheduling. Only call schedule if there
		 * is no timeout, or if it has yet to expire.
		 */
		if (!timeout || timeout->task)
			schedule();
	}
	__set_current_state(TASK_RUNNING);
}


/**
 * futex_queue() - Enqueue the futex_q on the futex_hash_bucket
 * @q:	The futex_q to enqueue
 * @hb:	The destination hash bucket
 *
 * The hb->lock must be held by the caller, and is released here. A call to
 * futex_queue() is typically paired with exactly one call to futex_unqueue().  The
 * exceptions involve the PI related operations, which may use futex_unqueue_pi()
 * or nothing if the unqueue is done as part of the wake process and the unqueue
 * state is implicit in the state of woken task (see futex_wait_requeue_pi() for
 * an example).
 */
static inline void futex_queue(struct futex_q *q, struct futex_hash_bucket *hb)
	__releases(&hb->lock)
{
	__futex_queue(q, hb);

	// yym-gaizao
	struct plist_head *head = &hb->chain;
    struct plist_node *node;
	struct futex_q *q_temp;
    int count = 0;
	// yym-gaizao
	if (!plist_head_empty(head)) {
		plist_for_each(node, head) {
        	q_temp = container_of(node, struct futex_q, list);
        	pr_debug("futex:queue:PID %d (%s)\n", q_temp->task->pid, q_temp->task->comm);
        	count++;
    	}
		pr_debug("futex:queue:total %d waiters\n", count);
	}

	spin_unlock(&hb->lock);
}

void __futex_queue(struct futex_q *q, struct futex_hash_bucket *hb)
{
	int prio;

	/*
	 * The priority used to register this element is
	 * - either the real thread-priority for the real-time threads
	 * (i.e. threads with a priority lower than MAX_RT_PRIO)
	 * - or MAX_RT_PRIO for non-RT threads.
	 * Thus, all RT-threads are woken first in priority order, and
	 * the others are woken last, in FIFO order.
	 */
	prio = min(current->normal_prio, MAX_RT_PRIO);

	plist_node_init(&q->list, prio);
	plist_add(&q->list, &hb->chain);
	q->task = current;
}

/**
 * plist_add - add @node to @head
 *
 * @node:	&struct plist_node pointer
 * @head:	&struct plist_head pointer
 */
void plist_add(struct plist_node *node, struct plist_head *head)
{
	struct plist_node *first, *iter, *prev = NULL;
	struct list_head *node_next = &head->node_list;

	plist_check_head(head);
	WARN_ON(!plist_node_empty(node));
	WARN_ON(!list_empty(&node->prio_list));

	if (plist_head_empty(head))
		goto ins_node;

	first = iter = plist_first(head);

	do {
		if (node->prio < iter->prio) {
			node_next = &iter->node_list;
			break;
		}

		prev = iter;
		iter = list_entry(iter->prio_list.next,
				struct plist_node, prio_list);
	} while (iter != first);

	if (!prev || prev->prio != node->prio)
		list_add_tail(&node->prio_list, &iter->prio_list);
ins_node:
	list_add_tail(&node->node_list, node_next);

	plist_check_head(head);
}

static __always_inline void spin_unlock(spinlock_t *lock)
{
	raw_spin_unlock(&lock->rlock);
}

#define raw_spin_unlock(lock)		_raw_spin_unlock(lock)

static inline void do_raw_spin_unlock(raw_spinlock_t *lock) __releases(lock)
{
	mmiowb_spin_unlock();
	arch_spin_unlock(&lock->raw_lock);
	__release(lock);
}
# define __release(x)	(void)0

#define arch_spin_lock(l)		queued_spin_lock(l)
#define arch_spin_unlock(l)		queued_spin_unlock(l)

#ifndef queued_spin_unlock
/**
 * queued_spin_unlock - release a queued spinlock
 * @lock : Pointer to queued spinlock structure
 */
static __always_inline void queued_spin_unlock(struct qspinlock *lock)
{
	/*
	 * unlock() needs release semantics:
	 */
	smp_store_release(&lock->locked, 0);
}
#endif

/**
 * queued_spin_lock - acquire a queued spinlock
 * @lock: Pointer to queued spinlock structure
 */
static __always_inline void queued_spin_lock(struct qspinlock *lock)
{
	int val = 0;

	if (likely(atomic_try_cmpxchg_acquire(&lock->val, &val, _Q_LOCKED_VAL)))
		return;

	queued_spin_lock_slowpath(lock, val);
}

/**
 * atomic_try_cmpxchg_acquire() - atomic compare and exchange with acquire ordering
 * @v: pointer to atomic_t
 * @old: pointer to int value to compare with
 * @new: int value to assign
 *
 * If (@v == @old), atomically updates @v to @new with acquire ordering.
 * Otherwise, updates @old to the current value of @v.
 *
 * Unsafe to use in noinstr code; use raw_atomic_try_cmpxchg_acquire() there.
 *
 * Return: @true if the exchange occured, @false otherwise.
 */
static __always_inline bool
atomic_try_cmpxchg_acquire(atomic_t *v, int *old, int new)
{
	instrument_atomic_read_write(v, sizeof(*v));
	instrument_atomic_read_write(old, sizeof(*old));
	return raw_atomic_try_cmpxchg_acquire(v, old, new);
}

/**
 * raw_atomic_try_cmpxchg_acquire() - atomic compare and exchange with acquire ordering
 * @v: pointer to atomic_t
 * @old: pointer to int value to compare with
 * @new: int value to assign
 *
 * If (@v == @old), atomically updates @v to @new with acquire ordering.
 * Otherwise, updates @old to the current value of @v.
 *
 * Safe to use in noinstr code; prefer atomic_try_cmpxchg_acquire() elsewhere.
 *
 * Return: @true if the exchange occured, @false otherwise.
 */
static __always_inline bool
raw_atomic_try_cmpxchg_acquire(atomic_t *v, int *old, int new)
{
#if defined(arch_atomic_try_cmpxchg_acquire)
	return arch_atomic_try_cmpxchg_acquire(v, old, new);
#elif defined(arch_atomic_try_cmpxchg_relaxed)
	bool ret = arch_atomic_try_cmpxchg_relaxed(v, old, new);
	__atomic_acquire_fence();
	return ret;
#elif defined(arch_atomic_try_cmpxchg)
	return arch_atomic_try_cmpxchg(v, old, new);
#else
	int r, o = *old;
	r = raw_atomic_cmpxchg_acquire(v, o, new);
	if (unlikely(r != o))
		*old = r;
	return likely(r == o);
#endif
}

static __always_inline bool arch_atomic_try_cmpxchg(atomic_t *v, int *old, int new)
{
	return arch_try_cmpxchg(&v->counter, old, new);
}
#define arch_atomic_try_cmpxchg arch_atomic_try_cmpxchg

#define arch_try_cmpxchg(ptr, pold, new) 				\
	__try_cmpxchg((ptr), (pold), (new), sizeof(*(ptr)))
	
#define __try_cmpxchg(ptr, pold, new, size)				\
	__raw_try_cmpxchg((ptr), (pold), (new), (size), LOCK_PREFIX)
	

#define __raw_try_cmpxchg(_ptr, _pold, _new, size, lock)		\
({									\
	bool success;							\
	__typeof__(_ptr) _old = (__typeof__(_ptr))(_pold);		\
	__typeof__(*(_ptr)) __old = *_old;				\
	__typeof__(*(_ptr)) __new = (_new);				\
	switch (size) {							\
	case __X86_CASE_B:						\
	{								\
		volatile u8 *__ptr = (volatile u8 *)(_ptr);		\
		asm volatile(lock "cmpxchgb %[new], %[ptr]"		\
			     CC_SET(z)					\
			     : CC_OUT(z) (success),			\
			       [ptr] "+m" (*__ptr),			\
			       [old] "+a" (__old)			\
			     : [new] "q" (__new)			\
			     : "memory");				\
		break;							\
	}								\
	case __X86_CASE_W:						\
	{								\
		volatile u16 *__ptr = (volatile u16 *)(_ptr);		\
		asm volatile(lock "cmpxchgw %[new], %[ptr]"		\
			     CC_SET(z)					\
			     : CC_OUT(z) (success),			\
			       [ptr] "+m" (*__ptr),			\
			       [old] "+a" (__old)			\
			     : [new] "r" (__new)			\
			     : "memory");				\
		break;							\
	}								\
	case __X86_CASE_L:						\
	{								\
		volatile u32 *__ptr = (volatile u32 *)(_ptr);		\
		asm volatile(lock "cmpxchgl %[new], %[ptr]"		\
			     CC_SET(z)					\
			     : CC_OUT(z) (success),			\
			       [ptr] "+m" (*__ptr),			\
			       [old] "+a" (__old)			\
			     : [new] "r" (__new)			\
			     : "memory");				\
		break;							\
	}								\
	case __X86_CASE_Q:						\
	{								\
		volatile u64 *__ptr = (volatile u64 *)(_ptr);		\
		asm volatile(lock "cmpxchgq %[new], %[ptr]"		\
			     CC_SET(z)					\
			     : CC_OUT(z) (success),			\
			       [ptr] "+m" (*__ptr),			\
			       [old] "+a" (__old)			\
			     : [new] "r" (__new)			\
			     : "memory");				\
		break;							\
	}								\
	default:							\
		__cmpxchg_wrong_size();					\
	}								\
	if (unlikely(!success))						\
		*_old = __old;						\
	likely(success);						\
})


/**
 * queued_spin_lock_slowpath - acquire the queued spinlock
 * @lock: Pointer to queued spinlock structure
 * @val: Current value of the queued spinlock 32-bit word
 *
 * (queue tail, pending bit, lock value)
 *
 *              fast     :    slow                                  :    unlock
 *                       :                                          :
 * uncontended  (0,0,0) -:--> (0,0,1) ------------------------------:--> (*,*,0)
 *                       :       | ^--------.------.             /  :
 *                       :       v           \      \            |  :
 * pending               :    (0,1,1) +--> (0,1,0)   \           |  :
 *                       :       | ^--'              |           |  :
 *                       :       v                   |           |  :
 * uncontended           :    (n,x,y) +--> (n,0,0) --'           |  :
 *   queue               :       | ^--'                          |  :
 *                       :       v                               |  :
 * contended             :    (*,x,y) +--> (*,0,0) ---> (*,0,1) -'  :
 *   queue               :         ^--'                             :
 */
void __lockfunc queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
{
	struct mcs_spinlock *prev, *next, *node;
	u32 old, tail;
	int idx;

	BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));

	if (pv_enabled())
		goto pv_queue;

	if (virt_spin_lock(lock))
		return;

	/*
	 * Wait for in-progress pending->locked hand-overs with a bounded
	 * number of spins so that we guarantee forward progress.
	 *
	 * 0,1,0 -> 0,0,1
	 */
	if (val == _Q_PENDING_VAL) {
		int cnt = _Q_PENDING_LOOPS;
		val = atomic_cond_read_relaxed(&lock->val,
					       (VAL != _Q_PENDING_VAL) || !cnt--);
	}

	/*
	 * If we observe any contention; queue.
	 */
	if (val & ~_Q_LOCKED_MASK)
		goto queue;

	/*
	 * trylock || pending
	 *
	 * 0,0,* -> 0,1,* -> 0,0,1 pending, trylock
	 */
	val = queued_fetch_set_pending_acquire(lock);

	/*
	 * If we observe contention, there is a concurrent locker.
	 *
	 * Undo and queue; our setting of PENDING might have made the
	 * n,0,0 -> 0,0,0 transition fail and it will now be waiting
	 * on @next to become !NULL.
	 */
	if (unlikely(val & ~_Q_LOCKED_MASK)) {

		/* Undo PENDING if we set it. */
		if (!(val & _Q_PENDING_MASK))
			clear_pending(lock);

		goto queue;
	}

	/*
	 * We're pending, wait for the owner to go away.
	 *
	 * 0,1,1 -> *,1,0
	 *
	 * this wait loop must be a load-acquire such that we match the
	 * store-release that clears the locked bit and create lock
	 * sequentiality; this is because not all
	 * clear_pending_set_locked() implementations imply full
	 * barriers.
	 */
	if (val & _Q_LOCKED_MASK)
		smp_cond_load_acquire(&lock->locked, !VAL);

	/*
	 * take ownership and clear the pending bit.
	 *
	 * 0,1,0 -> 0,0,1
	 */
	clear_pending_set_locked(lock);
	lockevent_inc(lock_pending);
	return;

	/*
	 * End of pending bit optimistic spinning and beginning of MCS
	 * queuing.
	 */
queue:
	lockevent_inc(lock_slowpath);
pv_queue:
	node = this_cpu_ptr(&qnodes[0].mcs);
	idx = node->count++;
	tail = encode_tail(smp_processor_id(), idx);

	trace_contention_begin(lock, LCB_F_SPIN);

	/*
	 * 4 nodes are allocated based on the assumption that there will
	 * not be nested NMIs taking spinlocks. That may not be true in
	 * some architectures even though the chance of needing more than
	 * 4 nodes will still be extremely unlikely. When that happens,
	 * we fall back to spinning on the lock directly without using
	 * any MCS node. This is not the most elegant solution, but is
	 * simple enough.
	 */
	if (unlikely(idx >= MAX_NODES)) {
		lockevent_inc(lock_no_node);
		while (!queued_spin_trylock(lock))
			cpu_relax();
		goto release;
	}

	node = grab_mcs_node(node, idx);

	/*
	 * Keep counts of non-zero index values:
	 */
	lockevent_cond_inc(lock_use_node2 + idx - 1, idx);

	/*
	 * Ensure that we increment the head node->count before initialising
	 * the actual node. If the compiler is kind enough to reorder these
	 * stores, then an IRQ could overwrite our assignments.
	 */
	barrier();

	node->locked = 0;
	node->next = NULL;
	pv_init_node(node);

	/*
	 * We touched a (possibly) cold cacheline in the per-cpu queue node;
	 * attempt the trylock once more in the hope someone let go while we
	 * weren't watching.
	 */
	if (queued_spin_trylock(lock))
		goto release;

	/*
	 * Ensure that the initialisation of @node is complete before we
	 * publish the updated tail via xchg_tail() and potentially link
	 * @node into the waitqueue via WRITE_ONCE(prev->next, node) below.
	 */
	smp_wmb();

	/*
	 * Publish the updated tail.
	 * We have already touched the queueing cacheline; don't bother with
	 * pending stuff.
	 *
	 * p,*,* -> n,*,*
	 */
	old = xchg_tail(lock, tail);
	next = NULL;

	/*
	 * if there was a previous node; link it and wait until reaching the
	 * head of the waitqueue.
	 */
	if (old & _Q_TAIL_MASK) {
		prev = decode_tail(old);

		/* Link @node into the waitqueue. */
		WRITE_ONCE(prev->next, node);

		pv_wait_node(node, prev);
		arch_mcs_spin_lock_contended(&node->locked);

		/*
		 * While waiting for the MCS lock, the next pointer may have
		 * been set by another lock waiter. We optimistically load
		 * the next pointer & prefetch the cacheline for writing
		 * to reduce latency in the upcoming MCS unlock operation.
		 */
		next = READ_ONCE(node->next);
		if (next)
			prefetchw(next);
	}

	/*
	 * we're at the head of the waitqueue, wait for the owner & pending to
	 * go away.
	 *
	 * *,x,y -> *,0,0
	 *
	 * this wait loop must use a load-acquire such that we match the
	 * store-release that clears the locked bit and create lock
	 * sequentiality; this is because the set_locked() function below
	 * does not imply a full barrier.
	 *
	 * The PV pv_wait_head_or_lock function, if active, will acquire
	 * the lock and return a non-zero value. So we have to skip the
	 * atomic_cond_read_acquire() call. As the next PV queue head hasn't
	 * been designated yet, there is no way for the locked value to become
	 * _Q_SLOW_VAL. So both the set_locked() and the
	 * atomic_cmpxchg_relaxed() calls will be safe.
	 *
	 * If PV isn't active, 0 will be returned instead.
	 *
	 */
	if ((val = pv_wait_head_or_lock(lock, node)))
		goto locked;

	val = atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK));

locked:
	/*
	 * claim the lock:
	 *
	 * n,0,0 -> 0,0,1 : lock, uncontended
	 * *,*,0 -> *,*,1 : lock, contended
	 *
	 * If the queue head is the only one in the queue (lock value == tail)
	 * and nobody is pending, clear the tail code and grab the lock.
	 * Otherwise, we only need to grab the lock.
	 */

	/*
	 * In the PV case we might already have _Q_LOCKED_VAL set, because
	 * of lock stealing; therefore we must also allow:
	 *
	 * n,0,1 -> 0,0,1
	 *
	 * Note: at this point: (val & _Q_PENDING_MASK) == 0, because of the
	 *       above wait condition, therefore any concurrent setting of
	 *       PENDING will make the uncontended transition fail.
	 */
	if ((val & _Q_TAIL_MASK) == tail) {
		if (atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL))
			goto release; /* No contention */
	}

	/*
	 * Either somebody is queued behind us or _Q_PENDING_VAL got set
	 * which will then detect the remaining tail and queue behind us
	 * ensuring we'll see a @next.
	 */
	set_locked(lock);

	/*
	 * contended path; wait for next if not observed yet, release.
	 */
	if (!next)
		next = smp_cond_load_relaxed(&node->next, (VAL));

	arch_mcs_spin_unlock_contended(&next->locked);
	pv_kick_node(lock, next);

release:
	trace_contention_end(lock, 0);

	/*
	 * release the node
	 */
	__this_cpu_dec(qnodes[0].mcs.count);
}
EXPORT_SYMBOL(queued_spin_lock_slowpath);

深入 Linux 内核 6.8.12：从 Futex 到 MCS 队列自旋锁的完整同步机制剖析

目录

引言：用户态同步与内核的交汇点

Futex 机制概览：快速用户空间互斥锁的设计哲学

设计背景与核心思想

Futex 系统调用的核心操作

Futex 核心数据结构解析

union futex_key：标识唯一的 Futex

struct futex_q：每个等待线程的队列节点

struct futex_hash_bucket：哈希桶

Futex 哈希与等待队列机制

哈希函数 futex_hash

全局哈希表定义

自旋锁基础：从 raw_spin_lock 到 spin_lock

自旋锁的层次结构

关键步骤解析

自旋锁释放

排队自旋锁（Queued Spinlock）与 MCS 锁

MCS 锁的基本原理

Linux 内核的排队自旋锁实现

快速路径（无竞争）

慢速路径（有竞争）

状态机解析

关键优化点

原子操作与 cmpxchg 指令

atomic_try_cmpxchg_acquire 实现

x86 架构的 cmpxchg 内联汇编

关键点

Futex 等待路径的完整流程分析

futex_wait 入口

__futex_wait 核心逻辑

关键流程

futex_wait_setup：竞争条件的核心处理

关键设计：锁顺序与值检查

futex_q_lock：获取哈希桶锁

关键细节：waiters 计数器

futex_wait_queue：入队并睡眠

关键设计：状态设置与锁释放的顺序

futex_queue：将线程加入优先级队列

__futex_queue：核心入队逻辑

优先级队列的设计

plist_add：优先级链表操作

Futex 唤醒机制与竞争条件处理

唤醒路径的伪代码

关键设计点

性能考量与优化策略

哈希桶大小与扩展性

乐观自旋与 MCS 锁的结合

用户态乐观自旋

锁竞争诊断

总结与展望

`union futex_key`：标识唯一的 Futex

`struct futex_q`：每个等待线程的队列节点

`struct futex_hash_bucket`：哈希桶

哈希函数 `futex_hash`

自旋锁基础：从 `raw_spin_lock` 到 `spin_lock`

原子操作与 `cmpxchg` 指令

`atomic_try_cmpxchg_acquire` 实现

x86 架构的 `cmpxchg` 内联汇编

`futex_wait` 入口

`__futex_wait` 核心逻辑

`futex_wait_setup`：竞争条件的核心处理

`futex_q_lock`：获取哈希桶锁

关键细节：`waiters` 计数器

`futex_wait_queue`：入队并睡眠

`futex_queue`：将线程加入优先级队列

`__futex_queue`：核心入队逻辑

`plist_add`：优先级链表操作