深入 Linux 内核 6.8.12:从 Futex 到 MCS 队列自旋锁的完整同步机制剖析

目录

  1. 引言:用户态同步与内核的交汇点
  2. Futex 机制概览:快速用户空间互斥锁的设计哲学
  3. Futex 核心数据结构解析
  4. Futex 哈希与等待队列机制
  5. 自旋锁基础:从
  6. 排队自旋锁(Queued Spinlock)与 MCS 锁
  7. 原子操作与
  8. Futex 等待路径的完整流程分析
  9. Futex 唤醒机制与竞争条件处理
  10. 性能考量与优化策略
  11. 总结与展望

引言:用户态同步与内核的交汇点

在现代操作系统中,进程/线程同步是一个核心且复杂的话题。从用户态的 pthread_mutex_lock 到内核态的底层锁实现,整个同步链条涉及多个层次的设计权衡。Linux 内核的 Futex(Fast Userspace muTEX)机制正是这种权衡的典范------它允许在无竞争的情况下完全在用户态完成锁操作,仅在出现竞争时才陷入内核进行线程的睡眠与唤醒。

本文基于 Linux 内核 6.8.12 的源码片段,深入剖析从 Futex 系统调用到最底层 MCS 队列自旋锁的完整实现链条。我们将看到,一个看似简单的 futex_wait 调用,背后涉及哈希表定位、自旋锁获取、优先级排序队列、原子操作、内存屏障、以及复杂的竞争条件处理。理解这些机制,对于编写高性能并发程序、调试死锁问题、以及优化系统性能都具有重要意义。


Futex 机制概览:快速用户空间互斥锁的设计哲学

设计背景与核心思想

Futex 机制诞生于 2002 年,由 Hubertus Franke、Rusty Russell 和 Matthew Kirkwood 设计,并在 Ottawa Linux Symposium 上首次提出。其核心洞察非常简洁:锁操作的开销主要来自竞争,而非锁本身。在大多数应用场景中,锁的持有时间很短,竞争并不频繁。如果每次加锁/解锁都要进行系统调用,那么系统调用的开销(约 100-300 纳秒)将成为性能瓶颈。

Futex 的设计哲学是:

  1. 无竞争路径完全在用户态 :通过原子指令(如 cmpxchg)尝试获取锁,成功则直接返回,无需内核介入。
  2. 有竞争路径才陷入内核 :当用户态发现锁已被占用时,调用 futex() 系统调用,让内核将当前线程放入等待队列并使其睡眠。
  3. 解锁时唤醒等待者 :锁持有者释放锁后,通过 futex() 系统调用唤醒等待队列中的一个或多个线程。

这种设计使得 glibc 的 pthread_mutex_tpthread_cond_tpthread_rwlock_tsem_t 以及 C++ 的 std::mutex 都能构建在 Futex 之上,获得优异的性能。

Futex 系统调用的核心操作

futex(2) 系统调用提供两个核心操作:

  • FUTEX_WAIT:检查 futex 变量的值,如果等于期望值,则将当前线程加入等待队列并睡眠;否则立即返回。
  • FUTEX_WAKE:唤醒等待队列中指定数量的线程。

这两个操作构成了用户态锁机制与内核调度器之间的桥梁。


Futex 核心数据结构解析

union futex_key:标识唯一的 Futex

c

复制

arduino 复制代码
union futex_key {
    struct {
        u64 ptr;          /* for private futexes: mm pointer */
        unsigned long word;  /* page offset + word offset */
    } private;
    struct {
        u64 ptr;          /* for shared futexes: mapping pointer */
        unsigned long pgoff; /* page offset in file */
        unsigned int offset;
    } shared;
    struct {
        u64 ptr;
        unsigned long word;
    } both;
};

futex_key 是内核标识一个 futex 的唯一方式。根据 futex 是进程私有(private)还是共享(shared),使用不同的字段组合:

  • Private Futex :使用 (current->mm, address, 0) 作为 key。同一进程内的不同线程通过虚拟地址即可唯一标识 futex,无需页表遍历,性能更优。
  • Shared Futex :使用 (inode->i_sequence, page->index, offset_within_page) 作为 key。这允许多个进程映射同一物理页面时,通过相同的 key 找到同一个等待队列。

struct futex_q:每个等待线程的队列节点

c

复制

arduino 复制代码
struct futex_q {
    struct plist_node list;       /* sorted by priority in hash bucket */
    struct task_struct *task;     /* the sleeping task */
    spinlock_t *lock_ptr;         /* hash bucket lock */
    union futex_key key;          /* what futex we're waiting on */
    struct futex_pi_state *pi_state; /* priority inheritance state */
    u32 bitset;                   /* for FUTEX_WAIT_BITSET */
};

每个调用 FUTEX_WAIT 的线程,内核都会为其创建一个 futex_q 结构,并将其挂入对应的哈希桶等待链表。plist_node 表示这是一个按优先级排序的链表节点,确保实时线程优先被唤醒。

struct futex_hash_bucket:哈希桶

c

复制

arduino 复制代码
struct futex_hash_bucket {
    atomic_t waiters;
    spinlock_t lock;
    struct plist_head chain;      /* list of futex_q entries */
} ____cacheline_aligned_in_smp;

所有等待同一 futex(或哈希冲突的不同 futex)的线程,其 futex_q 节点都会被挂入同一个哈希桶的 chain 链表中。哈希桶本身包含一个自旋锁 lock,用于保护链表操作,以及一个原子计数器 waiters,用于快速判断是否有等待者。


Futex 哈希与等待队列机制

哈希函数 futex_hash

c

复制

scss 复制代码
struct futex_hash_bucket *futex_hash(union futex_key *key)
{
    u32 hash = jhash2((u32 *)key, offsetof(typeof(*key), both.offset) / 4,
              key->both.offset);

    return &futex_queues[hash & (futex_hashsize - 1)];
}

内核使用 jhash2(Jenkins hash)对 futex_key 进行哈希计算,然后映射到全局哈希表 futex_queues 的某个桶中。哈希表大小 futex_hashsize 通常为 2 的幂次(如 256),这样可以通过位与操作快速取模。

全局哈希表定义

c

复制

c 复制代码
static struct {
    struct futex_hash_bucket *queues;
    unsigned long            hashsize;
} __futex_data __read_mostly __aligned(2*sizeof(long));
#define futex_queues   (__futex_data.queues)
#define futex_hashsize (__futex_data.hashsize)

哈希表在系统初始化时分配,并按缓存行对齐,以减少多 CPU 访问时的伪共享(false sharing)。


自旋锁基础:从 raw_spin_lockspin_lock

在深入 Futex 的等待与唤醒逻辑之前,我们必须先理解 Linux 内核的自旋锁实现,因为哈希桶的保护、MCS 队列的维护都依赖于自旋锁。

自旋锁的层次结构

Linux 内核的自旋锁实现分为多个层次,这种分层设计既保证了通用性,又允许架构特定的优化:

c

复制

scss 复制代码
/* 最底层:架构相关的自旋锁操作 */
#define arch_spin_lock(l)    queued_spin_lock(l)
#define arch_spin_unlock(l)  queued_spin_unlock(l)

/* 原始自旋锁:关闭抢占,获取锁 */
static inline void do_raw_spin_lock(raw_spinlock_t *lock) __acquires(lock)
{
    __acquire(lock);
    arch_spin_lock(&lock->raw_lock);
    mmiowb_spin_lock();
}

/* 中间层:包含锁依赖映射(用于死锁检测) */
static inline void __raw_spin_lock(raw_spinlock_t *lock)
{
    preempt_disable();
    spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
    LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
}

/* 导出符号(非内联版本) */
#ifndef CONFIG_INLINE_SPIN_LOCK
noinline void __lockfunc _raw_spin_lock(raw_spinlock_t *lock)
{
    __raw_spin_lock(lock);
}
EXPORT_SYMBOL(_raw_spin_lock);
#endif

/* 最上层:通用自旋锁接口 */
#define raw_spin_lock(lock)  _raw_spin_lock(lock)

static __always_inline void spin_lock(spinlock_t *lock)
{
    raw_spin_lock(&lock->rlock);
}

关键步骤解析

  1. preempt_disable() :关闭内核抢占。自旋锁持有期间不允许被其他任务抢占,否则可能导致死锁或长时间自旋。
  2. spin_acquire() :记录锁的获取,用于内核的锁依赖检测(lockdep)功能,帮助调试死锁。
  3. LOCK_CONTENDED() :在锁竞争激烈时,可能调用 do_raw_spin_trylock 进行乐观自旋尝试,避免直接睡眠带来的上下文切换开销。
  4. arch_spin_lock() :架构相关的实际加锁操作,在 x86 上映射为 queued_spin_lock()

自旋锁释放

c

复制

scss 复制代码
static __always_inline void spin_unlock(spinlock_t *lock)
{
    raw_spin_unlock(&lock->rlock);
}

#define raw_spin_unlock(lock)    _raw_spin_unlock(lock)

static inline void do_raw_spin_unlock(raw_spinlock_t *lock) __releases(lock)
{
    mmiowb_spin_unlock();
    arch_spin_unlock(&lock->raw_lock);
    __release(lock);
}

释放锁时,首先执行内存屏障操作(mmiowb_spin_unlock),然后调用架构相关的解锁操作,最后标记锁已释放。__release(lock) 在编译时用于静态分析,实际为空操作。


排队自旋锁(Queued Spinlock)与 MCS 锁

MCS 锁的基本原理

传统的测试-测试-设置(TTS)自旋锁存在严重的缓存行抖动(cache line bouncing)问题:所有等待者都在同一个内存位置(锁变量)上自旋,每次锁状态变化都会触发所有 CPU 缓存失效,导致大量跨 CPU 缓存流量。

MCS 锁(由 Mellor-Crummey 和 Scott 提出)解决了这个问题。其核心思想是:

  • 每个等待者在自己的本地变量上自旋,而非在全局锁变量上自旋。
  • 等待者通过链表排队,锁释放时只通知下一个等待者。
  • 这消除了缓存行抖动,实现了公平性(FIFO),并显著提升了大规模系统的性能。

Linux 内核的排队自旋锁实现

Linux 内核对传统 MCS 锁进行了精巧的压缩,使其适配 32 位字长:

c

复制

csharp 复制代码
/**
 * queued_spin_lock - acquire a queued spinlock
 * @lock: Pointer to queued spinlock structure
 */
static __always_inline void queued_spin_lock(struct qspinlock *lock)
{
    int val = 0;

    if (likely(atomic_try_cmpxchg_acquire(&lock->val, &val, _Q_LOCKED_VAL)))
        return;

    queued_spin_lock_slowpath(lock, val);
}

快速路径(无竞争)

atomic_try_cmpxchg_acquire 尝试将锁从 0(未锁定)原子地设置为 _Q_LOCKED_VAL(已锁定)。这是最常见的无竞争情况,只需一条原子指令即可完成加锁。

慢速路径(有竞争)

当快速路径失败时,进入 queued_spin_lock_slowpath,这是整个排队自旋锁的核心:

c

复制

scss 复制代码
void __lockfunc queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
{
    struct mcs_spinlock *prev, *next, *node;
    u32 old, tail;
    int idx;

    BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));

    if (pv_enabled())
        goto pv_queue;

    if (virt_spin_lock(lock))
        return;

    /* 等待正在进行的 pending->locked 交接 */
    if (val == _Q_PENDING_VAL) {
        int cnt = _Q_PENDING_LOOPS;
        val = atomic_cond_read_relaxed(&lock->val,
                           (VAL != _Q_PENDING_VAL) || !cnt--);
    }

    /* 观察到竞争,进入队列 */
    if (val & ~_Q_LOCKED_MASK)
        goto queue;

    /* 尝试设置 pending 位 */
    val = queued_fetch_set_pending_acquire(lock);

    /* 如果仍有竞争,撤销 pending 并排队 */
    if (unlikely(val & ~_Q_LOCKED_MASK)) {
        if (!(val & _Q_PENDING_MASK))
            clear_pending(lock);
        goto queue;
    }

    /* 等待锁持有者释放 */
    if (val & _Q_LOCKED_MASK)
        smp_cond_load_acquire(&lock->locked, !VAL);

    /* 获取锁并清除 pending 位 */
    clear_pending_set_locked(lock);
    lockevent_inc(lock_pending);
    return;

queue:
    /* MCS 队列逻辑 */
    lockevent_inc(lock_slowpath);
pv_queue:
    node = this_cpu_ptr(&qnodes[0].mcs);
    idx = node->count++;
    tail = encode_tail(smp_processor_id(), idx);

    /* 防止嵌套 NMI 超过节点限制 */
    if (unlikely(idx >= MAX_NODES)) {
        lockevent_inc(lock_no_node);
        while (!queued_spin_trylock(lock))
            cpu_relax();
        goto release;
    }

    node = grab_mcs_node(node, idx);
    barrier();

    node->locked = 0;
    node->next = NULL;
    pv_init_node(node);

    /* 尝试最后一次快速获取 */
    if (queued_spin_trylock(lock))
        goto release;

    smp_wmb();

    /* 将当前节点加入队列尾部 */
    old = xchg_tail(lock, tail);
    next = NULL;

    /* 如果有前驱节点,链接并等待 */
    if (old & _Q_TAIL_MASK) {
        prev = decode_tail(old);
        WRITE_ONCE(prev->next, node);

        pv_wait_node(node, prev);
        arch_mcs_spin_lock_contended(&node->locked);

        /* 预取下一个节点的缓存行 */
        next = READ_ONCE(node->next);
        if (next)
            prefetchw(next);
    }

    /* 等待锁持有者离开 */
    if ((val = pv_wait_head_or_lock(lock, node)))
        goto locked;

    val = atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK));

locked:
    /* 获取锁 */
    if ((val & _Q_TAIL_MASK) == tail) {
        if (atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL))
            goto release;
    }

    set_locked(lock);

    /* 通知下一个等待者 */
    if (!next)
        next = smp_cond_load_relaxed(&node->next, (VAL));

    arch_mcs_spin_unlock_contended(&next->locked);
    pv_kick_node(lock, next);

release:
    trace_contention_end(lock, 0);
    __this_cpu_dec(qnodes[0].mcs.count);
}

状态机解析

源码注释中给出了清晰的状态转换图:

plain

复制

lua 复制代码
              fast     :    slow                                  :    unlock
                       :                                          :
uncontended  (0,0,0) -:--> (0,0,1) ------------------------------:--> (*,*,0)
                       :       | ^--------.------.             /  :
                       :       v           \      \            |  :
pending               :    (0,1,1) +--> (0,1,0)   \           |  :
                       :       | ^--'              |           |  :
                       :       v                   |           |  :
uncontended           :    (n,x,y) +--> (n,0,0) --'           |  :
  queue               :       | ^--'                          |  :
                       :       v                               |  :
contended             :    (*,x,y) +--> (*,0,0) ---> (*,0,1) -'  :
  queue               :         ^--'                             :

其中三个位域分别表示:

  • tail:队列尾部指针(编码了 CPU ID 和节点索引)
  • pending:是否有线程正在乐观自旋等待
  • locked:锁是否被持有

关键优化点

  1. 乐观自旋(Optimistic Spinning) :第一个等待者不是立即进入 MCS 队列,而是先设置 pending 位并在锁变量上自旋。如果锁很快释放,可以避免队列操作的开销。
  2. 本地自旋 :进入 MCS 队列后,等待者在 node->locked 上自旋,这是本地变量,不会引起缓存行抖动。
  3. 锁窃取(Lock Stealing) :在某些情况下,新到达的线程可以直接获取锁,而不必等待队列中的线程。这提高了吞吐量,但需要配合防饥饿机制。
  4. 虚拟化支持(PV) :在虚拟化环境中,与 hypervisor 协作,避免在锁持有者被抢占时无效自旋。

原子操作与 cmpxchg 指令

排队自旋锁的实现依赖于底层的原子操作,其中最关键的是比较并交换(Compare-and-Swap, CAS)。

atomic_try_cmpxchg_acquire 实现

c

复制

csharp 复制代码
static __always_inline bool
atomic_try_cmpxchg_acquire(atomic_t *v, int *old, int new)
{
    instrument_atomic_read_write(v, sizeof(*v));
    instrument_atomic_read_write(old, sizeof(*old));
    return raw_atomic_try_cmpxchg_acquire(v, old, new);
}

x86 架构的 cmpxchg 内联汇编

c

复制

scss 复制代码
#define arch_try_cmpxchg(ptr, pold, new)                \
    __try_cmpxchg((ptr), (pold), (new), sizeof(*(ptr)))

#define __try_cmpxchg(ptr, pold, new, size)             \
    __raw_try_cmpxchg((ptr), (pold), (new), (size), LOCK_PREFIX)

#define __raw_try_cmpxchg(_ptr, _pold, _new, size, lock)    \
({                                      \
    bool success;                       \
    __typeof__(_ptr) _old = (__typeof__(_ptr))(_pold);  \
    __typeof__(*(_ptr)) __old = *_old;          \
    __typeof__(*(_ptr)) __new = (_new);          \
    switch (size) {                         \
    case __X86_CASE_B:                      \
    {                               \
        volatile u8 *__ptr = (volatile u8 *)(_ptr); \
        asm volatile(lock "cmpxchgb %[new], %[ptr]"   \
                 CC_SET(z)                  \
                 : CC_OUT(z) (success),         \
                   [ptr] "+m" (*__ptr),         \
                   [old] "+a" (__old)           \
                 : [new] "q" (__new)            \
                 : "memory");               \
        break;                          \
    }                               \
    case __X86_CASE_W:                      \
    {                               \
        volatile u16 *__ptr = (volatile u16 *)(_ptr);   \
        asm volatile(lock "cmpxchgw %[new], %[ptr]"   \
                 CC_SET(z)                  \
                 : CC_OUT(z) (success),         \
                   [ptr] "+m" (*__ptr),         \
                   [old] "+a" (__old)           \
                 : [new] "r" (__new)            \
                 : "memory");               \
        break;                          \
    }                               \
    case __X86_CASE_L:                      \
    {                               \
        volatile u32 *__ptr = (volatile u32 *)(_ptr);   \
        asm volatile(lock "cmpxchgl %[new], %[ptr]"   \
                 CC_SET(z)                  \
                 : CC_OUT(z) (success),         \
                   [ptr] "+m" (*__ptr),         \
                   [old] "+a" (__old)           \
                 : [new] "r" (__new)            \
                 : "memory");               \
        break;                          \
    }                               \
    case __X86_CASE_Q:                      \
    {                               \
        volatile u64 *__ptr = (volatile u64 *)(_ptr);   \
        asm volatile(lock "cmpxchgq %[new], %[ptr]"   \
                 CC_SET(z)                  \
                 : CC_OUT(z) (success),         \
                   [ptr] "+m" (*__ptr),         \
                   [old] "+a" (__old)           \
                 : [new] "r" (__new)            \
                 : "memory");               \
        break;                          \
    }                               \
    default:                            \
        __cmpxchg_wrong_size();                 \
    }                               \
    if (unlikely(!success))                     \
        *_old = __old;                      \
    likely(success);                        \
})

关键点

  1. lock 前缀:确保指令在多处理器环境下的原子性。
  2. +m 约束:表示内存操作数会被读写。
  3. +a 约束__old 放在 eax/rax 寄存器中,cmpxchg 会自动比较该寄存器与内存值。
  4. CC_SET(z)/CC_OUT(z) :利用 x86 的零标志位(ZF)判断比较是否成功。
  5. "memory" 破坏描述符:防止编译器重排内存操作,确保内存屏障语义。

Futex 等待路径的完整流程分析

futex_wait 入口

c

复制

rust 复制代码
int futex_wait(u32 __user *uaddr, unsigned int flags, u32 val, ktime_t *abs_time, u32 bitset)
{
    struct hrtimer_sleeper timeout, *to;
    struct restart_block *restart;
    int ret;

    // yym-gaizao
    pr_debug("futex_wait: pid=%d, tid=%d, uaddr=%p, val=%u, bitset=0x%x, flags=0x%x\n",
         current->tgid, current->pid, uaddr, val, bitset, flags);

    to = futex_setup_timer(abs_time, &timeout, flags,
                   current->timer_slack_ns);

    ret = __futex_wait(uaddr, flags, val, to, bitset);

    /* 清理超时定时器 */
    if (!to)
        return ret;

    hrtimer_cancel(&to->timer);
    destroy_hrtimer_on_stack(&to->timer);

    /* 处理可重启系统调用 */
    if (ret == -ERESTARTSYS) {
        restart = &current->restart_block;
        restart->futex.uaddr = uaddr;
        restart->futex.val = val;
        restart->futex.time = *abs_time;
        restart->futex.bitset = bitset;
        restart->futex.flags = flags | FLAGS_HAS_TIMEOUT;

        return set_restart_fn(restart, futex_wait_restart);
    }

    return ret;
}

futex_wait 首先设置超时定时器(如果有),然后调用 __futex_wait 执行核心逻辑。如果因信号中断(-ERESTARTSYS),则设置重启块,以便信号处理完成后重新执行系统调用。

__futex_wait 核心逻辑

c

复制

arduino 复制代码
int __futex_wait(u32 __user *uaddr, unsigned int flags, u32 val,
         struct hrtimer_sleeper *to, u32 bitset)
{
    struct futex_q q = futex_q_init;
    struct futex_hash_bucket *hb;
    int ret;

    if (!bitset)
        return -EINVAL;

    q.bitset = bitset;

retry:
    /*
     * Prepare to wait on uaddr. On success, it holds hb->lock and q
     * is initialized.
     */
    ret = futex_wait_setup(uaddr, val, flags, &q, &hb);
    if (ret)
        return ret;

    /* futex_queue and wait for wakeup, timeout, or a signal. */
    futex_wait_queue(hb, &q, to);

    /* If we were woken (and unqueued), we succeeded, whatever. */
    if (!futex_unqueue(&q))
        return 0;

    if (to && !to->task)
        return -ETIMEDOUT;

    /*
     * We expect signal_pending(current), but we might be the
     * victim of a spurious wakeup as well.
     */
    if (!signal_pending(current))
        goto retry;

    return -ERESTARTSYS;
}

关键流程

  1. futex_wait_setup:准备等待,获取哈希桶锁,检查 futex 值。
  2. futex_wait_queue:将当前线程加入等待队列并睡眠。
  3. 唤醒后检查 :如果被正常唤醒(从队列移除),返回成功;如果超时,返回 -ETIMEDOUT;如果被信号中断,返回 -ERESTARTSYS;如果是虚假唤醒,重试。

futex_wait_setup:竞争条件的核心处理

c

复制

scss 复制代码
int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
             struct futex_q *q, struct futex_hash_bucket **hb)
{
    u32 uval;
    int ret;

    /*
     * Access the page AFTER the hash-bucket is locked.
     * Order is important:
     *
     *   Userspace waiter: val = var; if (cond(val)) futex_wait(&var, val);
     *   Userspace waker:  if (cond(var)) { var = new; futex_wake(&var); }
     *
     * The basic logical guarantee of a futex is that it blocks ONLY
     * if cond(var) is known to be true at the time of blocking, for
     * any cond.  If we locked the hash-bucket after testing *uaddr, that
     * would open a race condition where we could block indefinitely with
     * cond(var) false, which would violate the guarantee.
     *
     * On the other hand, we insert q and release the hash-bucket only
     * after testing *uaddr.  This guarantees that futex_wait() will NOT
     * absorb a wakeup if *uaddr does not match the desired values
     * while the syscall executes.
     */
retry:
    ret = get_futex_key(uaddr, flags, &q->key, FUTEX_READ);
    if (unlikely(ret != 0))
        return ret;

retry_private:
    *hb = futex_q_lock(q);

    ret = futex_get_value_locked(&uval, uaddr);

    if (ret) {
        futex_q_unlock(*hb);

        ret = get_user(uval, uaddr);
        if (ret)
            return ret;

        if (!(flags & FLAGS_SHARED))
            goto retry_private;

        goto retry;
    }

    if (uval != val) {
        futex_q_unlock(*hb);
        ret = -EWOULDBLOCK;
    }

    return ret;
}

关键设计:锁顺序与值检查

注释中详细解释了为何必须先锁定哈希桶,再检查 futex 值:

  1. 防止丢失唤醒:如果先检查值再锁桶,可能在检查值和锁桶之间发生唤醒,导致线程永远睡眠。
  2. 防止吸收错误唤醒:如果在值不匹配时已经入队,可能错误地消耗一个本不属于它的唤醒。

futex_q_lock:获取哈希桶锁

c

复制

scss 复制代码
struct futex_hash_bucket *futex_q_lock(struct futex_q *q)
    __acquires(&hb->lock)
{
    struct futex_hash_bucket *hb;

    hb = futex_hash(&q->key);

    /*
     * Increment the counter before taking the lock so that
     * a potential waker won't miss a to-be-slept task that is
     * waiting for the spinlock. This is safe as all futex_q_lock()
     * users end up calling futex_queue(). Similarly, for housekeeping,
     * decrement the counter at futex_q_unlock() when some error has
     * occurred and we don't end up adding the task to the list.
     */
    futex_hb_waiters_inc(hb); /* implies smp_mb(); (A) */

    q->lock_ptr = &hb->lock;

    spin_lock(&hb->lock);
    return hb;
}

关键细节:waiters 计数器

在获取自旋锁之前先增加 waiters 计数器,这是为了防止以下竞争:

  1. 等待者 A 计算哈希桶,但尚未获取锁。
  2. 唤醒者 B 获取锁,发现 waiters 为 0,直接返回(认为没有等待者)。
  3. A 获取锁并入队,但已错过唤醒,永远睡眠。

通过先增加计数器(包含内存屏障),确保唤醒者能看到即将入队的等待者。

futex_wait_queue:入队并睡眠

c

复制

scss 复制代码
void futex_wait_queue(struct futex_hash_bucket *hb, struct futex_q *q,
                struct hrtimer_sleeper *timeout)
{
    /*
     * The task state is guaranteed to be set before another task can
     * wake it. set_current_state() is implemented using smp_store_mb() and
     * futex_queue() calls spin_unlock() upon completion, both serializing
     * access to the hash list and forcing another memory barrier.
     */
    set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
    futex_queue(q, hb);

    /* Arm the timer */
    if (timeout)
        hrtimer_sleeper_start_expires(timeout, HRTIMER_MODE_ABS);

    /*
     * If we have been removed from the hash list, then another task
     * has tried to wake us, and we can skip the call to schedule().
     */
    if (likely(!plist_node_empty(&q->list))) {
        /*
         * If the timer has already expired, current will already be
         * flagged for rescheduling. Only call schedule if there
         * is no timeout, or if it has yet to expire.
         */
        if (!timeout || timeout->task)
            schedule();
    }
    __set_current_state(TASK_RUNNING);
}

关键设计:状态设置与锁释放的顺序

  1. set_current_state(TASK_INTERRUPTIBLE) :将当前线程标记为可中断睡眠状态。使用 smp_store_mb() 实现,包含内存屏障。
  2. futex_queue(q, hb) :在持有哈希桶锁的情况下,将当前线程加入队列。futex_queue 最后会释放锁。
  3. 检查是否已被唤醒 :由于 futex_queue 释放锁时包含内存屏障,如果唤醒发生在 schedule() 之前,队列节点会被移除,此时可以跳过 schedule()
  4. schedule() :主动放弃 CPU,进入睡眠。
  5. __set_current_state(TASK_RUNNING) :被唤醒后恢复运行状态。

futex_queue:将线程加入优先级队列

c

复制

scss 复制代码
static inline void futex_queue(struct futex_q *q, struct futex_hash_bucket *hb)
    __releases(&hb->lock)
{
    __futex_queue(q, hb);

    // yym-gaizao
    struct plist_head *head = &hb->chain;
    struct plist_node *node;
    struct futex_q *q_temp;
    int count = 0;
    
    if (!plist_head_empty(head)) {
        plist_for_each(node, head) {
            q_temp = container_of(node, struct futex_q, list);
            pr_debug("futex:queue:PID %d (%s)\n", q_temp->task->pid, q_temp->task->comm);
            count++;
        }
        pr_debug("futex:queue:total %d waiters\n", count);
    }

    spin_unlock(&hb->lock);
}

__futex_queue:核心入队逻辑

c

复制

scss 复制代码
void __futex_queue(struct futex_q *q, struct futex_hash_bucket *hb)
{
    int prio;

    /*
     * The priority used to register this element is
     * - either the real thread-priority for the real-time threads
     * (i.e. threads with a priority lower than MAX_RT_PRIO)
     * - or MAX_RT_PRIO for non-RT threads.
     * Thus, all RT-threads are woken first in priority order, and
     * the others are woken last, in FIFO order.
     */
    prio = min(current->normal_prio, MAX_RT_PRIO);

    plist_node_init(&q->list, prio);
    plist_add(&q->list, &hb->chain);
    q->task = current;
}

优先级队列的设计

  • 实时线程:使用实际优先级(< MAX_RT_PRIO),按优先级顺序唤醒。
  • 非实时线程:统一使用 MAX_RT_PRIO,按 FIFO 顺序唤醒。

这确保了实时线程的响应性,同时为非实时线程提供公平性。

plist_add:优先级链表操作

c

复制

ini 复制代码
void plist_add(struct plist_node *node, struct plist_head *head)
{
    struct plist_node *first, *iter, *prev = NULL;
    struct list_head *node_next = &head->node_list;

    plist_check_head(head);
    WARN_ON(!plist_node_empty(node));
    WARN_ON(!list_empty(&node->prio_list));

    if (plist_head_empty(head))
        goto ins_node;

    first = iter = plist_first(head);

    do {
        if (node->prio < iter->prio) {
            node_next = &iter->node_list;
            break;
        }

        prev = iter;
        iter = list_entry(iter->prio_list.next,
                struct plist_node, prio_list);
    } while (iter != first);

    if (!prev || prev->prio != node->prio)
        list_add_tail(&node->prio_list, &iter->prio_list);
ins_node:
    list_add_tail(&node->node_list, node_next);

    plist_check_head(head);
}

plist(priority list)是 Linux 内核的一种特殊链表,同时维护两个链表:

  • node_list:按插入顺序排列的节点链表。
  • prio_list:按优先级排列的链表,相同优先级的节点链接在一起。

这使得按优先级遍历和按插入顺序遍历都能高效进行。


Futex 唤醒机制与竞争条件处理

虽然提供的源码片段中没有包含 futex_wake 的完整实现,但我们可以从等待路径的设计中推断唤醒机制的关键点。

唤醒路径的伪代码

基于内核文档和相关源码,唤醒路径大致如下:

c

复制

kotlin 复制代码
int futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
{
    struct futex_hash_bucket *hb;
    struct futex_q *this, *next;
    union futex_key key = FUTEX_KEY_INIT;
    DEFINE_WAKE_Q(wake_q);
    int ret;

    if (!bitset)
        return -EINVAL;

    ret = get_futex_key(uaddr, flags, &key, FUTEX_READ);
    if (unlikely(ret != 0))
        return ret;

    if ((flags & FLAGS_STRICT) && !nr_wake)
        return 0;

    hb = futex_hash(&key);

    /* 快速路径:如果没有等待者,直接返回 */
    if (!futex_hb_waiters_pending(hb))
        return ret;

    spin_lock(&hb->lock);

    plist_for_each_entry_safe(this, next, &hb->chain, list) {
        if (futex_match(&this->key, &key)) {
            if (this->pi_state || this->rt_waiter) {
                ret = -EINVAL;
                break;
            }

            /* 检查 bitset 是否匹配 */
            if (!(this->bitset & bitset))
                continue;

            this->wake(&wake_q, this);
            if (++ret >= nr_wake)
                break;
        }
    }

    spin_unlock(&hb->lock);
    wake_up_q(&wake_q);  /* 实际唤醒操作在释放锁后进行 */
    return ret;
}

关键设计点

  1. 锁内标记,锁外唤醒 :唤醒操作在持有哈希桶锁时进行标记(将任务加入唤醒队列),但实际唤醒(wake_up_process)在释放锁后进行。这减少了锁持有时间,避免在锁保护下执行复杂的调度操作。
  2. Bitset 匹配 :支持 FUTEX_WAIT_BITSETFUTEX_WAKE_BITSET,允许线程等待/唤醒特定的位模式,实现更灵活的同步语义。
  3. 优先级继承(PI)支持 :对于实时线程,futex 支持优先级继承协议,防止优先级反转。这通过 futex_pi_statert_waiter 字段实现。

性能考量与优化策略

哈希桶大小与扩展性

哈希桶的数量(futex_hashsize)直接影响并发性能:

  • 桶太少:大量不同 futex 映射到同一桶,增加锁竞争。
  • 桶太多:内存占用增加,缓存效率降低。

默认配置通常为 256 个桶,但在大型系统上可能需要调整。

乐观自旋与 MCS 锁的结合

在 Futex 的哈希桶锁(spinlock_t)层面,Linux 内核使用了排队自旋锁。这意味着:

  • 多个 CPU 同时访问同一哈希桶时,会在 MCS 队列中排队。
  • 每个等待者在本地变量上自旋,避免了缓存行抖动。
  • 对于 240 核的系统,这种优化可以将系统时间从 54% 降低到 2%。

用户态乐观自旋

现代 glibc 在用户态也实现了乐观自旋:在调用 futex_wait 之前,先自旋一段时间等待锁释放。如果锁很快释放,可以完全避免系统调用开销。

锁竞争诊断

对于生产环境中的锁竞争问题,可以使用 perf lock 工具进行诊断:

表格

perf lock 表现 根因 优化方案
contended 高 + avg wait 低(< 1us) 锁粒度太粗,频繁短竞争 拆锁(per-bucket / per-CPU)或无锁
contended 低 + avg wait 高(> 10us) 临界区内有 I/O / malloc / 日志 移出锁外 + Collect-Release-Execute
contended 高 + avg wait 高 严重设计问题 无锁队列(MPSC/SPSC)彻底替代
spinlock type + 高 contended 临界区持有时间超过自旋收益 改用 mutex(允许休眠)
futex_wake 路径占比 > 20% hash bucket 竞争(二阶效应) 竞争已极严重,必须无锁化

总结与展望

通过本文对 Linux 内核 6.8.12 源码的深入分析,我们完整地梳理了从 Futex 系统调用到 MCS 队列自旋锁的实现链条:

  1. 用户态锁 (如 pthread_mutex_t)在无竞争时完全在用户态通过原子操作完成。
  2. 出现竞争时 ,调用 futex_wait/futex_wake 陷入内核。
  3. 内核通过 futex_key 定位哈希桶,使用自旋锁保护等待队列。
  4. 自旋锁底层使用 MCS 排队机制,确保公平性和缓存效率。
  5. 原子操作(cmpxchg 是整个同步机制的基石。

这种分层设计体现了 Linux 内核在性能、公平性和可扩展性之间的精妙权衡。随着硬件的发展(更多核心、NUMA 架构、新原子指令),内核的同步机制也在不断演进。未来的发展方向可能包括:

  • 更细粒度的哈希桶:减少桶级竞争。
  • 自适应自旋:根据锁持有历史动态调整自旋时间。
  • 硬件事务内存(HTM) :利用 Intel TSX 等硬件特性实现无锁同步。
  • 更高效的虚拟化支持:减少 hypervisor 层的锁开销。

理解这些底层机制,不仅能帮助我们编写更高效的并发程序,也为诊断和解决复杂的性能问题提供了坚实的基础。

源码

scss 复制代码
static inline void do_raw_spin_lock(raw_spinlock_t *lock) __acquires(lock)
{
	__acquire(lock);
	arch_spin_lock(&lock->raw_lock);
	mmiowb_spin_lock();
}

# define lock_acquire(l, s, t, r, c, n, i)	do { } while (0)
#define lock_acquire_exclusive(l, s, t, n, i)		lock_acquire(l, s, t, 0, 1, n, i)
#define spin_acquire(l, s, t, i)		lock_acquire_exclusive(l, s, t, NULL, i)

static inline void __raw_spin_lock(raw_spinlock_t *lock)
{
	preempt_disable();
	spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
	LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
}

#ifndef CONFIG_INLINE_SPIN_LOCK
noinline void __lockfunc _raw_spin_lock(raw_spinlock_t *lock)
{
	__raw_spin_lock(lock);
}
EXPORT_SYMBOL(_raw_spin_lock);
#endif

#define raw_spin_lock(lock)	_raw_spin_lock(lock)

static __always_inline void spin_lock(spinlock_t *lock)
{
	raw_spin_lock(&lock->rlock);
}

/**
 * futex_hash - Return the hash bucket in the global hash
 * @key:	Pointer to the futex key for which the hash is calculated
 *
 * We hash on the keys returned from get_futex_key (see below) and return the
 * corresponding hash bucket in the global hash.
 */
struct futex_hash_bucket *futex_hash(union futex_key *key)
{
	u32 hash = jhash2((u32 *)key, offsetof(typeof(*key), both.offset) / 4,
			  key->both.offset);

	return &futex_queues[hash & (futex_hashsize - 1)];
}

/* The key must be already stored in q->key. */
struct futex_hash_bucket *futex_q_lock(struct futex_q *q)
	__acquires(&hb->lock)
{
	struct futex_hash_bucket *hb;

	hb = futex_hash(&q->key);

	/*
	 * Increment the counter before taking the lock so that
	 * a potential waker won't miss a to-be-slept task that is
	 * waiting for the spinlock. This is safe as all futex_q_lock()
	 * users end up calling futex_queue(). Similarly, for housekeeping,
	 * decrement the counter at futex_q_unlock() when some error has
	 * occurred and we don't end up adding the task to the list.
	 */
	futex_hb_waiters_inc(hb); /* implies smp_mb(); (A) */

	q->lock_ptr = &hb->lock;

	spin_lock(&hb->lock);
	return hb;
}

/*
 * The base of the bucket array and its size are always used together
 * (after initialization only in futex_hash()), so ensure that they
 * reside in the same cacheline.
 */
static struct {
	struct futex_hash_bucket *queues;
	unsigned long            hashsize;
} __futex_data __read_mostly __aligned(2*sizeof(long));
#define futex_queues   (__futex_data.queues)
#define futex_hashsize (__futex_data.hashsize)


/**
 * futex_wait_setup() - Prepare to wait on a futex
 * @uaddr:	the futex userspace address
 * @val:	the expected value
 * @flags:	futex flags (FLAGS_SHARED, etc.)
 * @q:		the associated futex_q
 * @hb:		storage for hash_bucket pointer to be returned to caller
 *
 * Setup the futex_q and locate the hash_bucket.  Get the futex value and
 * compare it with the expected value.  Handle atomic faults internally.
 * Return with the hb lock held on success, and unlocked on failure.
 *
 * Return:
 *  -  0 - uaddr contains val and hb has been locked;
 *  - <1 - -EFAULT or -EWOULDBLOCK (uaddr does not contain val) and hb is unlocked
 */
int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
		     struct futex_q *q, struct futex_hash_bucket **hb)
{
	u32 uval;
	int ret;

	/*
	 * Access the page AFTER the hash-bucket is locked.
	 * Order is important:
	 *
	 *   Userspace waiter: val = var; if (cond(val)) futex_wait(&var, val);
	 *   Userspace waker:  if (cond(var)) { var = new; futex_wake(&var); }
	 *
	 * The basic logical guarantee of a futex is that it blocks ONLY
	 * if cond(var) is known to be true at the time of blocking, for
	 * any cond.  If we locked the hash-bucket after testing *uaddr, that
	 * would open a race condition where we could block indefinitely with
	 * cond(var) false, which would violate the guarantee.
	 *
	 * On the other hand, we insert q and release the hash-bucket only
	 * after testing *uaddr.  This guarantees that futex_wait() will NOT
	 * absorb a wakeup if *uaddr does not match the desired values
	 * while the syscall executes.
	 */
retry:
	ret = get_futex_key(uaddr, flags, &q->key, FUTEX_READ);
	if (unlikely(ret != 0))
		return ret;

retry_private:
	*hb = futex_q_lock(q);

	ret = futex_get_value_locked(&uval, uaddr);

	if (ret) {
		futex_q_unlock(*hb);

		ret = get_user(uval, uaddr);
		if (ret)
			return ret;

		if (!(flags & FLAGS_SHARED))
			goto retry_private;

		goto retry;
	}

	if (uval != val) {
		futex_q_unlock(*hb);
		ret = -EWOULDBLOCK;
	}

	return ret;
}

int __futex_wait(u32 __user *uaddr, unsigned int flags, u32 val,
		 struct hrtimer_sleeper *to, u32 bitset)
{
	struct futex_q q = futex_q_init;
	struct futex_hash_bucket *hb;
	int ret;

	if (!bitset)
		return -EINVAL;

	q.bitset = bitset;

retry:
	/*
	 * Prepare to wait on uaddr. On success, it holds hb->lock and q
	 * is initialized.
	 */
	ret = futex_wait_setup(uaddr, val, flags, &q, &hb);
	if (ret)
		return ret;

	/* futex_queue and wait for wakeup, timeout, or a signal. */
	futex_wait_queue(hb, &q, to);

	/* If we were woken (and unqueued), we succeeded, whatever. */
	if (!futex_unqueue(&q))
		return 0;

	if (to && !to->task)
		return -ETIMEDOUT;

	/*
	 * We expect signal_pending(current), but we might be the
	 * victim of a spurious wakeup as well.
	 */
	if (!signal_pending(current))
		goto retry;

	return -ERESTARTSYS;
}

int futex_wait(u32 __user *uaddr, unsigned int flags, u32 val, ktime_t *abs_time, u32 bitset)
{
	struct hrtimer_sleeper timeout, *to;
	struct restart_block *restart;
	int ret;

	// yym-gaizao
	pr_debug("futex_wait: pid=%d, tid=%d, uaddr=%p, val=%u, bitset=0x%x, flags=0x%x\n",
		 current->tgid, current->pid, uaddr, val, bitset, flags);

	to = futex_setup_timer(abs_time, &timeout, flags,
			       current->timer_slack_ns);

	ret = __futex_wait(uaddr, flags, val, to, bitset);

	/* No timeout, nothing to clean up. */
	if (!to)
		return ret;

	hrtimer_cancel(&to->timer);
	destroy_hrtimer_on_stack(&to->timer);

	if (ret == -ERESTARTSYS) {
		restart = &current->restart_block;
		restart->futex.uaddr = uaddr;
		restart->futex.val = val;
		restart->futex.time = *abs_time;
		restart->futex.bitset = bitset;
		restart->futex.flags = flags | FLAGS_HAS_TIMEOUT;

		return set_restart_fn(restart, futex_wait_restart);
	}

	return ret;
}

/**
 * futex_wait_queue() - futex_queue() and wait for wakeup, timeout, or signal
 * @hb:		the futex hash bucket, must be locked by the caller
 * @q:		the futex_q to queue up on
 * @timeout:	the prepared hrtimer_sleeper, or null for no timeout
 */
void futex_wait_queue(struct futex_hash_bucket *hb, struct futex_q *q,
			    struct hrtimer_sleeper *timeout)
{
	/*
	 * The task state is guaranteed to be set before another task can
	 * wake it. set_current_state() is implemented using smp_store_mb() and
	 * futex_queue() calls spin_unlock() upon completion, both serializing
	 * access to the hash list and forcing another memory barrier.
	 */
	set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
	futex_queue(q, hb);

	/* Arm the timer */
	if (timeout)
		hrtimer_sleeper_start_expires(timeout, HRTIMER_MODE_ABS);

	/*
	 * If we have been removed from the hash list, then another task
	 * has tried to wake us, and we can skip the call to schedule().
	 */
	if (likely(!plist_node_empty(&q->list))) {
		/*
		 * If the timer has already expired, current will already be
		 * flagged for rescheduling. Only call schedule if there
		 * is no timeout, or if it has yet to expire.
		 */
		if (!timeout || timeout->task)
			schedule();
	}
	__set_current_state(TASK_RUNNING);
}


/**
 * futex_queue() - Enqueue the futex_q on the futex_hash_bucket
 * @q:	The futex_q to enqueue
 * @hb:	The destination hash bucket
 *
 * The hb->lock must be held by the caller, and is released here. A call to
 * futex_queue() is typically paired with exactly one call to futex_unqueue().  The
 * exceptions involve the PI related operations, which may use futex_unqueue_pi()
 * or nothing if the unqueue is done as part of the wake process and the unqueue
 * state is implicit in the state of woken task (see futex_wait_requeue_pi() for
 * an example).
 */
static inline void futex_queue(struct futex_q *q, struct futex_hash_bucket *hb)
	__releases(&hb->lock)
{
	__futex_queue(q, hb);

	// yym-gaizao
	struct plist_head *head = &hb->chain;
    struct plist_node *node;
	struct futex_q *q_temp;
    int count = 0;
	// yym-gaizao
	if (!plist_head_empty(head)) {
		plist_for_each(node, head) {
        	q_temp = container_of(node, struct futex_q, list);
        	pr_debug("futex:queue:PID %d (%s)\n", q_temp->task->pid, q_temp->task->comm);
        	count++;
    	}
		pr_debug("futex:queue:total %d waiters\n", count);
	}

	spin_unlock(&hb->lock);
}

void __futex_queue(struct futex_q *q, struct futex_hash_bucket *hb)
{
	int prio;

	/*
	 * The priority used to register this element is
	 * - either the real thread-priority for the real-time threads
	 * (i.e. threads with a priority lower than MAX_RT_PRIO)
	 * - or MAX_RT_PRIO for non-RT threads.
	 * Thus, all RT-threads are woken first in priority order, and
	 * the others are woken last, in FIFO order.
	 */
	prio = min(current->normal_prio, MAX_RT_PRIO);

	plist_node_init(&q->list, prio);
	plist_add(&q->list, &hb->chain);
	q->task = current;
}

/**
 * plist_add - add @node to @head
 *
 * @node:	&struct plist_node pointer
 * @head:	&struct plist_head pointer
 */
void plist_add(struct plist_node *node, struct plist_head *head)
{
	struct plist_node *first, *iter, *prev = NULL;
	struct list_head *node_next = &head->node_list;

	plist_check_head(head);
	WARN_ON(!plist_node_empty(node));
	WARN_ON(!list_empty(&node->prio_list));

	if (plist_head_empty(head))
		goto ins_node;

	first = iter = plist_first(head);

	do {
		if (node->prio < iter->prio) {
			node_next = &iter->node_list;
			break;
		}

		prev = iter;
		iter = list_entry(iter->prio_list.next,
				struct plist_node, prio_list);
	} while (iter != first);

	if (!prev || prev->prio != node->prio)
		list_add_tail(&node->prio_list, &iter->prio_list);
ins_node:
	list_add_tail(&node->node_list, node_next);

	plist_check_head(head);
}

static __always_inline void spin_unlock(spinlock_t *lock)
{
	raw_spin_unlock(&lock->rlock);
}

#define raw_spin_unlock(lock)		_raw_spin_unlock(lock)

static inline void do_raw_spin_unlock(raw_spinlock_t *lock) __releases(lock)
{
	mmiowb_spin_unlock();
	arch_spin_unlock(&lock->raw_lock);
	__release(lock);
}
# define __release(x)	(void)0

#define arch_spin_lock(l)		queued_spin_lock(l)
#define arch_spin_unlock(l)		queued_spin_unlock(l)

#ifndef queued_spin_unlock
/**
 * queued_spin_unlock - release a queued spinlock
 * @lock : Pointer to queued spinlock structure
 */
static __always_inline void queued_spin_unlock(struct qspinlock *lock)
{
	/*
	 * unlock() needs release semantics:
	 */
	smp_store_release(&lock->locked, 0);
}
#endif

/**
 * queued_spin_lock - acquire a queued spinlock
 * @lock: Pointer to queued spinlock structure
 */
static __always_inline void queued_spin_lock(struct qspinlock *lock)
{
	int val = 0;

	if (likely(atomic_try_cmpxchg_acquire(&lock->val, &val, _Q_LOCKED_VAL)))
		return;

	queued_spin_lock_slowpath(lock, val);
}

/**
 * atomic_try_cmpxchg_acquire() - atomic compare and exchange with acquire ordering
 * @v: pointer to atomic_t
 * @old: pointer to int value to compare with
 * @new: int value to assign
 *
 * If (@v == @old), atomically updates @v to @new with acquire ordering.
 * Otherwise, updates @old to the current value of @v.
 *
 * Unsafe to use in noinstr code; use raw_atomic_try_cmpxchg_acquire() there.
 *
 * Return: @true if the exchange occured, @false otherwise.
 */
static __always_inline bool
atomic_try_cmpxchg_acquire(atomic_t *v, int *old, int new)
{
	instrument_atomic_read_write(v, sizeof(*v));
	instrument_atomic_read_write(old, sizeof(*old));
	return raw_atomic_try_cmpxchg_acquire(v, old, new);
}

/**
 * raw_atomic_try_cmpxchg_acquire() - atomic compare and exchange with acquire ordering
 * @v: pointer to atomic_t
 * @old: pointer to int value to compare with
 * @new: int value to assign
 *
 * If (@v == @old), atomically updates @v to @new with acquire ordering.
 * Otherwise, updates @old to the current value of @v.
 *
 * Safe to use in noinstr code; prefer atomic_try_cmpxchg_acquire() elsewhere.
 *
 * Return: @true if the exchange occured, @false otherwise.
 */
static __always_inline bool
raw_atomic_try_cmpxchg_acquire(atomic_t *v, int *old, int new)
{
#if defined(arch_atomic_try_cmpxchg_acquire)
	return arch_atomic_try_cmpxchg_acquire(v, old, new);
#elif defined(arch_atomic_try_cmpxchg_relaxed)
	bool ret = arch_atomic_try_cmpxchg_relaxed(v, old, new);
	__atomic_acquire_fence();
	return ret;
#elif defined(arch_atomic_try_cmpxchg)
	return arch_atomic_try_cmpxchg(v, old, new);
#else
	int r, o = *old;
	r = raw_atomic_cmpxchg_acquire(v, o, new);
	if (unlikely(r != o))
		*old = r;
	return likely(r == o);
#endif
}

static __always_inline bool arch_atomic_try_cmpxchg(atomic_t *v, int *old, int new)
{
	return arch_try_cmpxchg(&v->counter, old, new);
}
#define arch_atomic_try_cmpxchg arch_atomic_try_cmpxchg

#define arch_try_cmpxchg(ptr, pold, new) 				\
	__try_cmpxchg((ptr), (pold), (new), sizeof(*(ptr)))
	
#define __try_cmpxchg(ptr, pold, new, size)				\
	__raw_try_cmpxchg((ptr), (pold), (new), (size), LOCK_PREFIX)
	

#define __raw_try_cmpxchg(_ptr, _pold, _new, size, lock)		\
({									\
	bool success;							\
	__typeof__(_ptr) _old = (__typeof__(_ptr))(_pold);		\
	__typeof__(*(_ptr)) __old = *_old;				\
	__typeof__(*(_ptr)) __new = (_new);				\
	switch (size) {							\
	case __X86_CASE_B:						\
	{								\
		volatile u8 *__ptr = (volatile u8 *)(_ptr);		\
		asm volatile(lock "cmpxchgb %[new], %[ptr]"		\
			     CC_SET(z)					\
			     : CC_OUT(z) (success),			\
			       [ptr] "+m" (*__ptr),			\
			       [old] "+a" (__old)			\
			     : [new] "q" (__new)			\
			     : "memory");				\
		break;							\
	}								\
	case __X86_CASE_W:						\
	{								\
		volatile u16 *__ptr = (volatile u16 *)(_ptr);		\
		asm volatile(lock "cmpxchgw %[new], %[ptr]"		\
			     CC_SET(z)					\
			     : CC_OUT(z) (success),			\
			       [ptr] "+m" (*__ptr),			\
			       [old] "+a" (__old)			\
			     : [new] "r" (__new)			\
			     : "memory");				\
		break;							\
	}								\
	case __X86_CASE_L:						\
	{								\
		volatile u32 *__ptr = (volatile u32 *)(_ptr);		\
		asm volatile(lock "cmpxchgl %[new], %[ptr]"		\
			     CC_SET(z)					\
			     : CC_OUT(z) (success),			\
			       [ptr] "+m" (*__ptr),			\
			       [old] "+a" (__old)			\
			     : [new] "r" (__new)			\
			     : "memory");				\
		break;							\
	}								\
	case __X86_CASE_Q:						\
	{								\
		volatile u64 *__ptr = (volatile u64 *)(_ptr);		\
		asm volatile(lock "cmpxchgq %[new], %[ptr]"		\
			     CC_SET(z)					\
			     : CC_OUT(z) (success),			\
			       [ptr] "+m" (*__ptr),			\
			       [old] "+a" (__old)			\
			     : [new] "r" (__new)			\
			     : "memory");				\
		break;							\
	}								\
	default:							\
		__cmpxchg_wrong_size();					\
	}								\
	if (unlikely(!success))						\
		*_old = __old;						\
	likely(success);						\
})


/**
 * queued_spin_lock_slowpath - acquire the queued spinlock
 * @lock: Pointer to queued spinlock structure
 * @val: Current value of the queued spinlock 32-bit word
 *
 * (queue tail, pending bit, lock value)
 *
 *              fast     :    slow                                  :    unlock
 *                       :                                          :
 * uncontended  (0,0,0) -:--> (0,0,1) ------------------------------:--> (*,*,0)
 *                       :       | ^--------.------.             /  :
 *                       :       v           \      \            |  :
 * pending               :    (0,1,1) +--> (0,1,0)   \           |  :
 *                       :       | ^--'              |           |  :
 *                       :       v                   |           |  :
 * uncontended           :    (n,x,y) +--> (n,0,0) --'           |  :
 *   queue               :       | ^--'                          |  :
 *                       :       v                               |  :
 * contended             :    (*,x,y) +--> (*,0,0) ---> (*,0,1) -'  :
 *   queue               :         ^--'                             :
 */
void __lockfunc queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
{
	struct mcs_spinlock *prev, *next, *node;
	u32 old, tail;
	int idx;

	BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));

	if (pv_enabled())
		goto pv_queue;

	if (virt_spin_lock(lock))
		return;

	/*
	 * Wait for in-progress pending->locked hand-overs with a bounded
	 * number of spins so that we guarantee forward progress.
	 *
	 * 0,1,0 -> 0,0,1
	 */
	if (val == _Q_PENDING_VAL) {
		int cnt = _Q_PENDING_LOOPS;
		val = atomic_cond_read_relaxed(&lock->val,
					       (VAL != _Q_PENDING_VAL) || !cnt--);
	}

	/*
	 * If we observe any contention; queue.
	 */
	if (val & ~_Q_LOCKED_MASK)
		goto queue;

	/*
	 * trylock || pending
	 *
	 * 0,0,* -> 0,1,* -> 0,0,1 pending, trylock
	 */
	val = queued_fetch_set_pending_acquire(lock);

	/*
	 * If we observe contention, there is a concurrent locker.
	 *
	 * Undo and queue; our setting of PENDING might have made the
	 * n,0,0 -> 0,0,0 transition fail and it will now be waiting
	 * on @next to become !NULL.
	 */
	if (unlikely(val & ~_Q_LOCKED_MASK)) {

		/* Undo PENDING if we set it. */
		if (!(val & _Q_PENDING_MASK))
			clear_pending(lock);

		goto queue;
	}

	/*
	 * We're pending, wait for the owner to go away.
	 *
	 * 0,1,1 -> *,1,0
	 *
	 * this wait loop must be a load-acquire such that we match the
	 * store-release that clears the locked bit and create lock
	 * sequentiality; this is because not all
	 * clear_pending_set_locked() implementations imply full
	 * barriers.
	 */
	if (val & _Q_LOCKED_MASK)
		smp_cond_load_acquire(&lock->locked, !VAL);

	/*
	 * take ownership and clear the pending bit.
	 *
	 * 0,1,0 -> 0,0,1
	 */
	clear_pending_set_locked(lock);
	lockevent_inc(lock_pending);
	return;

	/*
	 * End of pending bit optimistic spinning and beginning of MCS
	 * queuing.
	 */
queue:
	lockevent_inc(lock_slowpath);
pv_queue:
	node = this_cpu_ptr(&qnodes[0].mcs);
	idx = node->count++;
	tail = encode_tail(smp_processor_id(), idx);

	trace_contention_begin(lock, LCB_F_SPIN);

	/*
	 * 4 nodes are allocated based on the assumption that there will
	 * not be nested NMIs taking spinlocks. That may not be true in
	 * some architectures even though the chance of needing more than
	 * 4 nodes will still be extremely unlikely. When that happens,
	 * we fall back to spinning on the lock directly without using
	 * any MCS node. This is not the most elegant solution, but is
	 * simple enough.
	 */
	if (unlikely(idx >= MAX_NODES)) {
		lockevent_inc(lock_no_node);
		while (!queued_spin_trylock(lock))
			cpu_relax();
		goto release;
	}

	node = grab_mcs_node(node, idx);

	/*
	 * Keep counts of non-zero index values:
	 */
	lockevent_cond_inc(lock_use_node2 + idx - 1, idx);

	/*
	 * Ensure that we increment the head node->count before initialising
	 * the actual node. If the compiler is kind enough to reorder these
	 * stores, then an IRQ could overwrite our assignments.
	 */
	barrier();

	node->locked = 0;
	node->next = NULL;
	pv_init_node(node);

	/*
	 * We touched a (possibly) cold cacheline in the per-cpu queue node;
	 * attempt the trylock once more in the hope someone let go while we
	 * weren't watching.
	 */
	if (queued_spin_trylock(lock))
		goto release;

	/*
	 * Ensure that the initialisation of @node is complete before we
	 * publish the updated tail via xchg_tail() and potentially link
	 * @node into the waitqueue via WRITE_ONCE(prev->next, node) below.
	 */
	smp_wmb();

	/*
	 * Publish the updated tail.
	 * We have already touched the queueing cacheline; don't bother with
	 * pending stuff.
	 *
	 * p,*,* -> n,*,*
	 */
	old = xchg_tail(lock, tail);
	next = NULL;

	/*
	 * if there was a previous node; link it and wait until reaching the
	 * head of the waitqueue.
	 */
	if (old & _Q_TAIL_MASK) {
		prev = decode_tail(old);

		/* Link @node into the waitqueue. */
		WRITE_ONCE(prev->next, node);

		pv_wait_node(node, prev);
		arch_mcs_spin_lock_contended(&node->locked);

		/*
		 * While waiting for the MCS lock, the next pointer may have
		 * been set by another lock waiter. We optimistically load
		 * the next pointer & prefetch the cacheline for writing
		 * to reduce latency in the upcoming MCS unlock operation.
		 */
		next = READ_ONCE(node->next);
		if (next)
			prefetchw(next);
	}

	/*
	 * we're at the head of the waitqueue, wait for the owner & pending to
	 * go away.
	 *
	 * *,x,y -> *,0,0
	 *
	 * this wait loop must use a load-acquire such that we match the
	 * store-release that clears the locked bit and create lock
	 * sequentiality; this is because the set_locked() function below
	 * does not imply a full barrier.
	 *
	 * The PV pv_wait_head_or_lock function, if active, will acquire
	 * the lock and return a non-zero value. So we have to skip the
	 * atomic_cond_read_acquire() call. As the next PV queue head hasn't
	 * been designated yet, there is no way for the locked value to become
	 * _Q_SLOW_VAL. So both the set_locked() and the
	 * atomic_cmpxchg_relaxed() calls will be safe.
	 *
	 * If PV isn't active, 0 will be returned instead.
	 *
	 */
	if ((val = pv_wait_head_or_lock(lock, node)))
		goto locked;

	val = atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK));

locked:
	/*
	 * claim the lock:
	 *
	 * n,0,0 -> 0,0,1 : lock, uncontended
	 * *,*,0 -> *,*,1 : lock, contended
	 *
	 * If the queue head is the only one in the queue (lock value == tail)
	 * and nobody is pending, clear the tail code and grab the lock.
	 * Otherwise, we only need to grab the lock.
	 */

	/*
	 * In the PV case we might already have _Q_LOCKED_VAL set, because
	 * of lock stealing; therefore we must also allow:
	 *
	 * n,0,1 -> 0,0,1
	 *
	 * Note: at this point: (val & _Q_PENDING_MASK) == 0, because of the
	 *       above wait condition, therefore any concurrent setting of
	 *       PENDING will make the uncontended transition fail.
	 */
	if ((val & _Q_TAIL_MASK) == tail) {
		if (atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL))
			goto release; /* No contention */
	}

	/*
	 * Either somebody is queued behind us or _Q_PENDING_VAL got set
	 * which will then detect the remaining tail and queue behind us
	 * ensuring we'll see a @next.
	 */
	set_locked(lock);

	/*
	 * contended path; wait for next if not observed yet, release.
	 */
	if (!next)
		next = smp_cond_load_relaxed(&node->next, (VAL));

	arch_mcs_spin_unlock_contended(&next->locked);
	pv_kick_node(lock, next);

release:
	trace_contention_end(lock, 0);

	/*
	 * release the node
	 */
	__this_cpu_dec(qnodes[0].mcs.count);
}
EXPORT_SYMBOL(queued_spin_lock_slowpath);	
相关推荐
A小辣椒21 小时前
TShark:Wireshark CLI 功能
linux
A小辣椒1 天前
TShark:基础知识
linux
AlfredZhao1 天前
OCI 明明分配了 200G 系统盘,为什么 df 只看到 30G?
linux·oci
AlfredZhao2 天前
vi 删除指定范围的行,不用再反复按 dd
linux·vi
用户9718356334662 天前
银河麒麟 KY10 申威(SW64) 安装 nginx-1.16.1-2.p01.ky10.sw_64.rpm 详细步骤
linux
猪脚踏浪2 天前
linux 拷贝文件或目录到指定的位置
linux
摇滚侠3 天前
Linux CentOS7 rpm 安装 MySQL 5.7
linux·运维·mysql
bush43 天前
嵌入式linux学习记录十四、术语
linux·嵌入式
载数而行5203 天前
Linux 11 动态监控指令top
linux
不会C语言的男孩3 天前
Linux 系统编程 · 第 8 章:进程基础
linux·c语言