深入 Linux 内核同步机制：从 futex 到 spinlock 的完整旅程

引言：用户态与内核态的同步桥梁

在多线程编程中，锁机制是保证数据一致性的基石。当我们在用户态调用 pthread_mutex_lock 时，背后隐藏着一套复杂而精巧的内核实现。这套实现涉及两个核心组件：futex （Fast Userspace Mutex）和 spinlock。futex 负责用户态与内核态的同步交互，而 spinlock 则 protects 内核内部的关键数据结构。

本文将基于 Linux 内核 6.8.12 的源码，深入剖析从用户态锁竞争到内核态等待队列的完整路径。我们会沿着代码执行流程，逐行分析关键函数，揭示内核如何高效地处理锁竞争、如何组织等待队列、以及如何使用底层原子操作保证并发安全。

第一部分：futex 的核心概念与数据结构

1.1 什么是 futex？

futex 是 Linux 提供的一种快速用户态互斥体机制。它的核心思想是：在无竞争的情况下，所有操作都在用户态完成，不需要陷入内核；只有在发生竞争时，才通过系统调用进入内核进行排队和唤醒。

这种设计极大地提升了性能，因为绝大多数锁操作都是无竞争的。futex_wait 和 futex_wake 是两个最核心的系统调用。

1.2 关键数据结构：futex_q 和 futex_hash_bucket

源码中定义了两个至关重要的数据结构：

复制代码

struct futex_q {
    struct plist_node list;      // 优先级链表节点
    struct task_struct *task;    // 等待任务
    spinlock_t *lock_ptr;        // 指向 hash bucket 的锁
    u32 bitset;                  // 位掩码，用于条件唤醒
    futex_key_t key;             // 唯一标识一个 futex 对象
    struct futex_pi_state *pi_state;
    struct rt_mutex_waiter *rt_waiter;
    union futex_key *requeue_pi_key;
};

futex_q 代表一个正在等待 futex 的线程。每个等待的线程都会有一个对应的 futex_q 结构，它被挂载到某个 hash bucket 的链表中。

复制代码

static struct {
    struct futex_hash_bucket *queues;
    unsigned long            hashsize;
} __futex_data __read_mostly __aligned(2*sizeof(long));

futex_queues 是一个全局的 hash 桶数组，每个桶包含一个自旋锁和一个优先级链表。通过 hash 函数将 futex 地址映射到具体的桶，可以有效地分散锁竞争。

1.3 futex_hash：快速定位等待队列

复制代码

struct futex_hash_bucket *futex_hash(union futex_key *key)
{
    u32 hash = jhash2((u32 *)key, offsetof(typeof(*key), both.offset) / 4,
                      key->both.offset);
    return &futex_queues[hash & (futex_hashsize - 1)];
}

这里使用了 jhash2 （Jenkins hash）算法，将 futex_key 的内容计算为一个 32 位哈希值。offsetof(typeof(*key), both.offset) / 4 计算出 hash 的单词数（因为 jhash2 以 32 位为单位）。最终的桶索引通过 hash & (futex_hashsize - 1) 获得，这要求 futex_hashsize 是 2 的幂次，使得按位与操作可以代替取模，提升性能。

第二部分：从用户态争用到内核等待 - futex_wait 完整流程

2.1 系统调用入口：futex_wait

复制代码

int futex_wait(u32 __user *uaddr, unsigned int flags, u32 val, ktime_t *abs_time, u32 bitset)
{
    struct hrtimer_sleeper timeout, *to;
    struct restart_block *restart;
    int ret;

    pr_debug("futex_wait: pid=%d, tid=%d, uaddr=%p, val=%u, bitset=0x%x, flags=0x%x\n",
             current->tgid, current->pid, uaddr, val, bitset, flags);
    
    to = futex_setup_timer(abs_time, &timeout, flags, current->timer_slack_ns);
    ret = __futex_wait(uaddr, flags, val, to, bitset);
    // ... 处理超时和重启
}

函数开始处有一个调试输出，记录了调用进程的 PID、TGID、等待的地址、期望值等信息。然后通过 futex_setup_timer 设置超时定时器（如果用户指定了绝对时间）。最后调用核心函数 __futex_wait。

2.2 核心等待逻辑：__futex_wait

复制代码

int __futex_wait(u32 __user *uaddr, unsigned int flags, u32 val,
                 struct hrtimer_sleeper *to, u32 bitset)
{
    struct futex_q q = futex_q_init;
    struct futex_hash_bucket *hb;
    int ret;

    if (!bitset)
        return -EINVAL;
    q.bitset = bitset;

retry:
    ret = futex_wait_setup(uaddr, val, flags, &q, &hb);
    if (ret)
        return ret;

    futex_wait_queue(hb, &q, to);

    if (!futex_unqueue(&q))
        return 0;
    // ... 处理信号和重试
}

这段代码清晰地展示了等待的三步曲：

futex_wait_setup：准备工作，获取 hash bucket 锁，验证用户态的值
futex_wait_queue：将当前线程加入等待队列并调度
futex_unqueue：从队列中移除（如果被唤醒则返回 false）

2.3 关键的设置阶段：futex_wait_setup

复制代码

int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
                     struct futex_q *q, struct futex_hash_bucket **hb)
{
    u32 uval;
    int ret;

retry:
    ret = get_futex_key(uaddr, flags, &q->key, FUTEX_READ);
    if (unlikely(ret != 0))
        return ret;

retry_private:
    *hb = futex_q_lock(q);

    ret = futex_get_value_locked(&uval, uaddr);
    if (ret) {
        futex_q_unlock(*hb);
        ret = get_user(uval, uaddr);
        if (ret)
            return ret;
        if (!(flags & FLAGS_SHARED))
            goto retry_private;
        goto retry;
    }

    if (uval != val) {
        futex_q_unlock(*hb);
        ret = -EWOULDBLOCK;
    }
    return ret;
}

这个函数有一个非常重要的注释，解释了为什么必须先锁再读：

The basic logical guarantee of a futex is that it blocks ONLY if cond(var) is known to be true at the time of blocking, for any cond. If we locked the hash-bucket after testing *uaddr, that would open a race condition where we could block indefinitely with cond(var) false.

考虑如下竞争场景：

线程 A 检查 *uaddr == val，条件满足，准备进入内核等待
线程 B 在同一时刻修改 *uaddr 为新值，并调用 futex_wake
如果 A 先检查条件，然后才获取锁，那么 B 的唤醒可能发生在 A 进入等待队列之前
结果：A 将永远等待下去，因为值已经改变，不会再有新的唤醒

通过先获取锁、再读取用户态值，可以保证：如果 A 最终决定等待，那么 B 的任何唤醒操作都会看到队列中的 A。

2.4 获取 hash bucket 锁：futex_q_lock

复制代码

struct futex_hash_bucket *futex_q_lock(struct futex_q *q)
    __acquires(&hb->lock)
{
    struct futex_hash_bucket *hb;
    hb = futex_hash(&q->key);
    futex_hb_waiters_inc(hb);  // 增加等待者计数，包含 smp_mb()
    q->lock_ptr = &hb->lock;
    spin_lock(&hb->lock);
    return hb;
}

这里有一个精妙的设计：在获取锁之前先增加等待者计数。注释解释了原因：

Increment the counter before taking the lock so that a potential waker won't miss a to-be-slept task that is waiting for the spinlock.

如果没有这个递增操作，可能出现：

线程 A 计算出 hash bucket，但还未获取锁
线程 B 调用 futex_wake，发现等待者计数为 0，直接返回
线程 A 获取锁并加入队列，但永远不会被唤醒

futex_hb_waiters_inc 内部包含一个内存屏障（smp_mb()），确保递增操作对其他 CPU 可见。

2.5 进入等待队列：futex_wait_queue

复制代码

void futex_wait_queue(struct futex_hash_bucket *hb, struct futex_q *q,
                      struct hrtimer_sleeper *timeout)
{
    set_current_state(TASK_INTERRUPTIBLE | TASK_FREEZABLE);
    futex_queue(q, hb);

    if (timeout)
        hrtimer_sleeper_start_expires(timeout, HRTIMER_MODE_ABS);

    if (likely(!plist_node_empty(&q->list))) {
        if (!timeout || timeout->task)
            schedule();
    }
    __set_current_state(TASK_RUNNING);
}

关键点：

set_current_state ：将任务状态设置为可中断的睡眠状态，这使用了 smp_store_mb() 保证内存屏障
futex_queue：将 q 加入到 hash bucket 的链表中，然后释放 spinlock
schedule：只有当 q 仍在链表中时才调度（可能被虚假唤醒或超时提前唤醒）

2.6 加入优先级队列：futex_queue 和 __futex_queue

复制代码

static inline void futex_queue(struct futex_q *q, struct futex_hash_bucket *hb)
    __releases(&hb->lock)
{
    __futex_queue(q, hb);
    
    // 调试输出：打印所有等待者
    struct plist_head *head = &hb->chain;
    struct plist_node *node;
    struct futex_q *q_temp;
    int count = 0;
    if (!plist_head_empty(head)) {
        plist_for_each(node, head) {
            q_temp = container_of(node, struct futex_q, list);
            pr_debug("futex:queue:PID %d (%s)\n", q_temp->task->pid, q_temp->task->comm);
            count++;
        }
        pr_debug("futex:queue:total %d waiters\n", count);
    }
    spin_unlock(&hb->lock);
}

void __futex_queue(struct futex_q *q, struct futex_hash_bucket *hb)
{
    int prio = min(current->normal_prio, MAX_RT_PRIO);
    plist_node_init(&q->list, prio);
    plist_add(&q->list, &hb->chain);
    q->task = current;
}

这里使用了优先级链表（plist），而非简单的 FIFO 队列。优先级计算：

实时线程（优先级 < MAX_RT_PRIO）：使用其实际优先级
普通线程：使用 MAX_RT_PRIO

这意味着当多个线程等待同一个 futex 时，实时线程会优先被唤醒，且实时线程之间按照优先级高低排序。这对于实时系统至关重要。

2.7 plist_add：优先级链表的插入算法

复制代码

void plist_add(struct plist_node *node, struct plist_head *head)
{
    struct plist_node *first, *iter, *prev = NULL;
    struct list_head *node_next = &head->node_list;

    if (plist_head_empty(head))
        goto ins_node;

    first = iter = plist_first(head);

    do {
        if (node->prio < iter->prio) {
            node_next = &iter->node_list;
            break;
        }
        prev = iter;
        iter = list_entry(iter->prio_list.next, struct plist_node, prio_list);
    } while (iter != first);

    if (!prev || prev->prio != node->prio)
        list_add_tail(&node->prio_list, &iter->prio_list);
ins_node:
    list_add_tail(&node->node_list, node_next);
}

plist 的实现有两个链表：

prio_list：按优先级排序的链表，每个优先级只出现一次
node_list：所有节点的链表，相同优先级的节点聚集在一起

插入时：

遍历 prio_list，找到第一个优先级小于等于 node 的位置
如果优先级与 prev 不同，则将 node 加入 prio_list
将 node 插入 node_list 的适当位置

这种结构保证了按优先级排序的遍历效率，同时避免了重复存储优先级相同的节点。

第三部分：自旋锁 - 内核并发的基石

在 futex 的实现中，spin_lock(&hb->lock) 被频繁使用。自旋锁是内核中最基础的同步原语之一，它 protects hash bucket 的链表操作。让我们深入研究自旋锁的实现。

3.1 从 spin_lock 到 raw_spin_lock

复制代码

static __always_inline void spin_lock(spinlock_t *lock)
{
    raw_spin_lock(&lock->rlock);
}

#define raw_spin_lock(lock)	_raw_spin_lock(lock)

#ifndef CONFIG_INLINE_SPIN_LOCK
noinline void __lockfunc _raw_spin_lock(raw_spinlock_t *lock)
{
    __raw_spin_lock(lock);
}
EXPORT_SYMBOL(_raw_spin_lock);
#endif

这里有一个重要的配置选项 CONFIG_INLINE_SPIN_LOCK。如果启用了内联，_raw_spin_lock 会被内联展开，减少函数调用开销；否则作为独立函数导出，供模块使用。

3.2 核心实现：__raw_spin_lock

复制代码

static inline void __raw_spin_lock(raw_spinlock_t *lock)
{
    preempt_disable();
    spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
    LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
}

这里做了三件事：

preempt_disable：禁用内核抢占。这是自旋锁的关键特性 - 持有自旋锁时不能被抢占，否则可能导致死锁（其他 CPU 可能等待这个 CPU 释放锁）
spin_acquire ：锁依赖检查，仅在配置了 CONFIG_LOCKDEP 时有效，用于检测潜在的死锁
LOCK_CONTENDED：宏，尝试获取锁，如果失败则进入慢速路径

3.3 底层架构相关：arch_spin_lock

复制代码

#define arch_spin_lock(l)		queued_spin_lock(l)

static __always_inline void queued_spin_lock(struct qspinlock *lock)
{
    int val = 0;
    if (likely(atomic_try_cmpxchg_acquire(&lock->val, &val, _Q_LOCKED_VAL)))
        return;
    queued_spin_lock_slowpath(lock, val);
}

这是 x86 架构上实现的排队自旋锁（queued spinlock）。关键设计：

快速路径：使用原子 CAS 操作尝试获取锁，如果锁当前是 0（未锁定），则将其设置为 1（锁定）
慢速路径 ：如果 CAS 失败（锁已被占用），则进入 queued_spin_lock_slowpath

使用 atomic_try_cmpxchg_acquire 而不是 atomic_cmpxchg 的原因是 try_cmpxchg 在失败时会自动更新 old 值为当前值，避免再次读取。

3.4 原子操作的实现：从 try_cmpxchg 到 LOCK 前缀

复制代码

static __always_inline bool arch_atomic_try_cmpxchg(atomic_t *v, int *old, int new)
{
    return arch_try_cmpxchg(&v->counter, old, new);
}

#define arch_try_cmpxchg(ptr, pold, new) __try_cmpxchg((ptr), (pold), (new), sizeof(*(ptr)))

最终展开到内联汇编：

复制代码

case __X86_CASE_L:  // 32 位
{
    volatile u32 *__ptr = (volatile u32 *)(_ptr);
    asm volatile(lock "cmpxchgl %[new], %[ptr]"
                 CC_SET(z)
                 : CC_OUT(z) (success),
                   [ptr] "+m" (*__ptr),
                   [old] "+a" (__old)
                 : [new] "r" (__new)
                 : "memory");
    break;
}

这里使用了 LOCK 前缀和 cmpxchg 指令：

LOCK 前缀：使指令在总线上锁定，保证原子性
cmpxchg ：比较 eax（__old）与目标内存，相等则交换，不等则加载到 eax
CC_SET(z) ：根据结果设置零标志位，success 变量获得结果

这是一个典型的 RMW（Read-Modify-Write）操作，在现代多核系统中通过缓存一致性协议（MESI）实现。

3.5 慢速路径：queued_spin_lock_slowpath

当快速路径失败时，即锁已被占用，内核进入慢速路径。由于这个函数非常长（超过 200 行），我们分析其核心逻辑：

复制代码

void __lockfunc queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
{
    struct mcs_spinlock *prev, *next, *node;
    u32 old, tail;
    int idx;

    // 检查是否可以使用 pending bit 优化
    if (val == _Q_PENDING_VAL) {
        // 等待 pending 位被清除
        val = atomic_cond_read_relaxed(&lock->val,
                                       (VAL != _Q_PENDING_VAL) || !cnt--);
    }

    if (val & ~_Q_LOCKED_MASK)
        goto queue;

    // 尝试设置 pending bit
    val = queued_fetch_set_pending_acquire(lock);
    if (unlikely(val & ~_Q_LOCKED_MASK)) {
        if (!(val & _Q_PENDING_MASK))
            clear_pending(lock);
        goto queue;
    }

    // 等待锁释放
    if (val & _Q_LOCKED_MASK)
        smp_cond_load_acquire(&lock->locked, !VAL);

    clear_pending_set_locked(lock);
    return;

queue:
    // 排队逻辑
    node = this_cpu_ptr(&qnodes[0].mcs);
    idx = node->count++;
    tail = encode_tail(smp_processor_id(), idx);
    
    // 将当前节点加入队列尾部
    old = xchg_tail(lock, tail);
    if (old & _Q_TAIL_MASK) {
        prev = decode_tail(old);
        WRITE_ONCE(prev->next, node);
        arch_mcs_spin_lock_contended(&node->locked);
    }

    // 等待成为队首
    if ((val = pv_wait_head_or_lock(lock, node)))
        goto locked;
    
    val = atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK));

locked:
    // 获取锁
    set_locked(lock);
    arch_mcs_spin_unlock_contended(&next->locked);
}

排队自旋锁的核心设计思想：

pending bit 优化：当只有一个等待者时，使用 pending bit 避免进入队列
MCS 锁：使用 MCS（Mellor-Crummey and Scott）锁算法，每个 CPU 在自己的本地变量上旋转，避免在同一个变量上竞争导致 cache line bouncing
FIFO 顺序：保证公平性，避免饥饿

第四部分：MCS 锁 - 可扩展的自旋锁

4.1 传统自旋锁的问题

传统的自旋锁实现中，所有等待者都在同一个锁变量上自旋。当锁释放时，所有等待者会同时尝试获取锁，导致：

Cache line bouncing：锁变量在多个 CPU 的 cache 间来回传输
竞争加剧：多个 CPU 同时执行 CAS 操作，大部分失败

这被称为 thundering herd 问题。

4.2 MCS 锁的核心思想

MCS 锁为每个等待者分配一个本地节点，每个 CPU 只在自己的节点上自旋。节点形成链表，锁持有者只通知下一个节点。

struct mcs_spinlock 包含两个字段：

locked：当值为 1 时，表示该节点已被授予锁
next：指向下一个节点

入队过程：

创建自己的节点 node
原子地将队尾指针指向 node，并获取旧的队尾 prev
如果 prev 存在，将 prev->next 设为 node
在自己的 node->locked 上自旋，直到变为 1

解锁过程：

如果自己的 next 不为空，将 next->locked 设置为 1
如果没有 next，尝试原子地将队尾指针改回 NULL

4.3 Linux 中的 MCS 实现

Linux 的排队自旋锁结合了 MCS 和传统的自旋锁。每个 CPU 有 4 个 MCS 节点（在 qspinlock.h 中定义）：

复制代码

struct qnodes {
    struct mcs_spinlock mcs[MAX_NODES];
};
static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);

为什么要 4 个节点？因为在嵌套中断（NMI）中可能再次获取自旋锁，最多支持 4 层嵌套。如果超出限制（极少发生），会降级为传统自旋锁：

复制代码

if (unlikely(idx >= MAX_NODES)) {
    lockevent_inc(lock_no_node);
    while (!queued_spin_trylock(lock))
        cpu_relax();
    goto release;
}

第五部分：内存屏障与顺序一致性

5.1 为什么需要内存屏障？

在多核系统中，编译器和 CPU 可能会重排指令以提高性能。这在单线程中无影响，但在多线程同步中可能导致严重问题。

以 futex 为例：

复制代码

set_current_state(TASK_INTERRUPTIBLE);
futex_queue(q, hb);

如果编译器或 CPU 重排这两条指令，可能出现：

先执行 futex_queue（释放锁，将 q 加入链表）
再执行 set_current_state 设置状态
在状态设置完成前，另一个 CPU 可能看到 q 已在链表中，尝试唤醒它
但当前线程状态仍为 TASK_RUNNING，唤醒无效，导致永久睡眠

5.2 RAW 使用场景

源码中多处使用了不同的内存屏障：

smp_mb()：完整的内存屏障

复制代码

futex_hb_waiters_inc(hb); /* implies smp_mb(); (A) */

smp_store_mb()：存储后跟完整屏障

复制代码

set_current_state(TASK_INTERRUPTIBLE);
// set_current_state 内部使用 smp_store_mb()

acquire/release 语义

复制代码

atomic_try_cmpxchg_acquire(&lock->val, &val, _Q_LOCKED_VAL)

acquire：后续读写不能提前到 acquire 之前
release：之前的读写不能推迟到 release 之后

smp_cond_load_acquire()：条件加载加 acquire

复制代码

smp_cond_load_acquire(&lock->locked, !VAL);

5.3 x86 平台的特殊性

x86 架构具有较强的一致性模型（TSO - Total Store Order），许多情况下不需要显式的内存屏障：

普通 store 有 release 语义
普通 load 有 acquire 语义
LOCK 前缀的指令有完整屏障效果

但 Linux 内核为了跨平台，仍然需要使用通用的屏障宏，在不同架构上展开为适当的指令。

第六部分：完整的竞争场景分析

6.1 场景一：无竞争情况

用户程序：

复制代码

// 期望值等于当前值
if (*futex == 1) {
    futex_wait(&futex, 1);  // 不会真正进入内核
}

由于值匹配，线程进入内核：

futex_wait_setup 获取 hb->lock
再次读取 uaddr 确认值未变
futex_queue 将线程加入队列并释放锁
schedule 进入睡眠

6.2 场景二：唤醒

另一个线程执行：

复制代码

*futex = 2;
futex_wake(&futex, 1);

唤醒流程：

通过相同的 hash 计算找到 hb
获取 hb->lock
遍历链表，找到优先级最高的等待者
将其从链表中移除，标记任务为可运行
释放锁
在调度时机，等待线程被唤醒

6.3 场景三：超时处理

超时定时器通过 hrtimer_sleeper 实现：

复制代码

if (timeout)
    hrtimer_sleeper_start_expires(timeout, HRTIMER_MODE_ABS);

定时器到期后，会调用回调函数唤醒任务。如果任务仍在链表中，将被移除并设置 -ETIMEDOUT 返回值。

6.4 场景四：虚假唤醒

由于信号或其他原因，schedule() 可能在没有被显式唤醒的情况下返回。代码通过检查 plist_node_empty(&q->list) 来判断是否是真正的唤醒：

如果 node 不在链表中，说明已被某线程唤醒，直接返回
如果 node 仍在链表中，可能是虚假唤醒，重新调用 schedule() 或处理超时

第七部分：性能优化技巧

7.1 快速路径与慢速路径分离

自旋锁的快速路径只有一条 cmpxchg 指令，开销极小。大部分无竞争情况直接成功返回。只有竞争时才进入慢速路径。

7.2 Cache line 对齐

复制代码

} __futex_data __read_mostly __aligned(2*sizeof(long));

futex_queues 和 futex_hashsize 被强制对齐到 2 个 long 的大小，确保它们不在同一个 cache line。因为它们经常一起使用，但在不同 CPU 上可能同时访问，分离可以避免 false sharing。

7.3 per-CPU 变量

MCS 节点使用 DEFINE_PER_CPU 分配，每个 CPU 有自己的节点数组，避免了跨 CPU 的 cache line 竞争。

7.4 条件预取

复制代码

if (next)
    prefetchw(next);

预取下一个节点的 cache line，减少后续解锁操作时的 cache miss。

第八部分：调试与追踪

8.1 lockdep 锁依赖检查

spin_acquire(&lock->dep_map, 0, 0, _RET_IP_) 用于 lockdep 系统。它会记录锁的获取顺序，并在运行时检测可能的死锁。如果检测到循环依赖，会输出警告并提供调用栈。

8.2 tracepoint

复制代码

trace_contention_begin(lock, LCB_F_SPIN);
trace_contention_end(lock, 0);

这些 tracepoint 允许用户通过 perf 或 ftrace 监控锁竞争情况，分析系统瓶颈。

8.3 lockevent 统计

复制代码

lockevent_inc(lock_slowpath);
lockevent_cond_inc(lock_use_node2 + idx - 1, idx);

编译时配置 CONFIG_LOCK_EVENT_COUNTS 后，可以在 /proc/lock_stat 中看到各种锁事件的统计信息。

8.4 代码中的调试输出

在 futex_queue 中：

复制代码

pr_debug("futex:queue:PID %d (%s)\n", q_temp->task->pid, q_temp->task->comm);
pr_debug("futex:queue:total %d waiters\n", count);

这些 pr_debug 只在 DEBUG 定义或 dynamic_debug 启用时输出，可以动态控制日志级别。

总结：从用户态到硬件的完整链条

本文从用户态的一个 futex_wait 调用出发，沿着代码执行路径深入内核，最终到达 CPU 的原子指令。整个链条涉及：

futex 系统调用：用户态与内核态的边界
hash bucket 机制：将 futex 地址映射到可控大小的等待队列
优先级链表：保证实时线程的调度优先
自旋锁：保护内核关键数据结构的并发访问
排队自旋锁：可扩展的多核锁算法，结合 MCS 锁思想
原子操作：LOCK 前缀和 cmpxchg 指令，保证多核安全
内存屏障：保证内存访问的顺序性

这个链条展示了 Linux 内核如何平衡性能和正确性:

快速路径尽可能在用户态或简单指令完成
慢速路径使用复杂的队列算法减少 cache 竞争
精巧的内存屏障确保并发正确
调试和追踪工具帮助开发者分析和优化

理解这些机制，不仅能帮助我们写出更好的并发程序，也能深入理解操作系统设计的核心思想。Linux 内核的同步机制经过数十年的演进，已经成为并发编程领域的典范实现。

##源码

cpp 复制代码

static inline void do_raw_spin_lock(raw_spinlock_t *lock) __acquires(lock)
{
	__acquire(lock);
	arch_spin_lock(&lock->raw_lock);
	mmiowb_spin_lock();
}

# define lock_acquire(l, s, t, r, c, n, i)	do { } while (0)
#define lock_acquire_exclusive(l, s, t, n, i)		lock_acquire(l, s, t, 0, 1, n, i)
#define spin_acquire(l, s, t, i)		lock_acquire_exclusive(l, s, t, NULL, i)

static inline void __raw_spin_lock(raw_spinlock_t *lock)
{
	preempt_disable();
	spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
	LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
}

#ifndef CONFIG_INLINE_SPIN_LOCK
noinline void __lockfunc _raw_spin_lock(raw_spinlock_t *lock)
{
	__raw_spin_lock(lock);
}
EXPORT_SYMBOL(_raw_spin_lock);
#endif

#define raw_spin_lock(lock)	_raw_spin_lock(lock)

static __always_inline void spin_lock(spinlock_t *lock)
{
	raw_spin_lock(&lock->rlock);
}

/**
 * futex_hash - Return the hash bucket in the global hash
 * @key:	Pointer to the futex key for which the hash is calculated
 *
 * We hash on the keys returned from get_futex_key (see below) and return the
 * corresponding hash bucket in the global hash.
 */
struct futex_hash_bucket *futex_hash(union futex_key *key)
{
	u32 hash = jhash2((u32 *)key, offsetof(typeof(*key), both.offset) / 4,
			  key->both.offset);

	return &futex_queues[hash & (futex_hashsize - 1)];
}

/* The key must be already stored in q->key. */
struct futex_hash_bucket *futex_q_lock(struct futex_q *q)
	__acquires(&hb->lock)
{
	struct futex_hash_bucket *hb;

	hb = futex_hash(&q->key);

	/*
	 * Increment the counter before taking the lock so that
	 * a potential waker won't miss a to-be-slept task that is
	 * waiting for the spinlock. This is safe as all futex_q_lock()
	 * users end up calling futex_queue(). Similarly, for housekeeping,
	 * decrement the counter at futex_q_unlock() when some error has
	 * occurred and we don't end up adding the task to the list.
	 */
	futex_hb_waiters_inc(hb); /* implies smp_mb(); (A) */

	q->lock_ptr = &hb->lock;

	spin_lock(&hb->lock);
	return hb;
}

/*
 * The base of the bucket array and its size are always used together
 * (after initialization only in futex_hash()), so ensure that they
 * reside in the same cacheline.
 */
static struct {
	struct futex_hash_bucket *queues;
	unsigned long            hashsize;
} __futex_data __read_mostly __aligned(2*sizeof(long));
#define futex_queues   (__futex_data.queues)
#define futex_hashsize (__futex_data.hashsize)


/**
 * futex_wait_setup() - Prepare to wait on a futex
 * @uaddr:	the futex userspace address
 * @val:	the expected value
 * @flags:	futex flags (FLAGS_SHARED, etc.)
 * @q:		the associated futex_q
 * @hb:		storage for hash_bucket pointer to be returned to caller
 *
 * Setup the futex_q and locate the hash_bucket.  Get the futex value and
 * compare it with the expected value.  Handle atomic faults internally.
 * Return with the hb lock held on success, and unlocked on failure.
 *
 * Return:
 *  -  0 - uaddr contains val and hb has been locked;
 *  - <1 - -EFAULT or -EWOULDBLOCK (uaddr does not contain val) and hb is unlocked
 */
int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
		     struct futex_q *q, struct futex_hash_bucket **hb)
{
	u32 uval;
	int ret;

	/*
	 * Access the page AFTER the hash-bucket is locked.
	 * Order is important:
	 *
	 *   Userspace waiter: val = var; if (cond(val)) futex_wait(&var, val);
	 *   Userspace waker:  if (cond(var)) { var = new; futex_wake(&var); }
	 *
	 * The basic logical guarantee of a futex is that it blocks ONLY
	 * if cond(var) is known to be true at the time of blocking, for
	 * any cond.  If we locked the hash-bucket after testing *uaddr, that
	 * would open a race condition where we could block indefinitely with
	 * cond(var) false, which would violate the guarantee.
	 *
	 * On the other hand, we insert q and release the hash-bucket only
	 * after testing *uaddr.  This guarantees that futex_wait() will NOT
	 * absorb a wakeup if *uaddr does not match the desired values
	 * while the syscall executes.
	 */
retry:
	ret = get_futex_key(uaddr, flags, &q->key, FUTEX_READ);
	if (unlikely(ret != 0))
		return ret;

retry_private:
	*hb = futex_q_lock(q);

	ret = futex_get_value_locked(&uval, uaddr);

	if (ret) {
		futex_q_unlock(*hb);

		ret = get_user(uval, uaddr);
		if (ret)
			return ret;

		if (!(flags & FLAGS_SHARED))
			goto retry_private;

		goto retry;
	}

	if (uval != val) {
		futex_q_unlock(*hb);
		ret = -EWOULDBLOCK;
	}

	return ret;
}

int __futex_wait(u32 __user *uaddr, unsigned int flags, u32 val,
		 struct hrtimer_sleeper *to, u32 bitset)
{
	struct futex_q q = futex_q_init;
	struct futex_hash_bucket *hb;
	int ret;

	if (!bitset)
		return -EINVAL;

	q.bitset = bitset;

retry:
	/*
	 * Prepare to wait on uaddr. On success, it holds hb->lock and q
	 * is initialized.
	 */
	ret = futex_wait_setup(uaddr, val, flags, &q, &hb);
	if (ret)
		return ret;

	/* futex_queue and wait for wakeup, timeout, or a signal. */
	futex_wait_queue(hb, &q, to);

	/* If we were woken (and unqueued), we succeeded, whatever. */
	if (!futex_unqueue(&q))
		return 0;

	if (to && !to->task)
		return -ETIMEDOUT;

	/*
	 * We expect signal_pending(current), but we might be the
	 * victim of a spurious wakeup as well.
	 */
	if (!signal_pending(current))
		goto retry;

	return -ERESTARTSYS;
}

int futex_wait(u32 __user *uaddr, unsigned int flags, u32 val, ktime_t *abs_time, u32 bitset)
{
	struct hrtimer_sleeper timeout, *to;
	struct restart_block *restart;
	int ret;

	// yym-gaizao
	pr_debug("futex_wait: pid=%d, tid=%d, uaddr=%p, val=%u, bitset=0x%x, flags=0x%x\n",
		 current->tgid, current->pid, uaddr, val, bitset, flags);

	to = futex_setup_timer(abs_time, &timeout, flags,
			       current->timer_slack_ns);

	ret = __futex_wait(uaddr, flags, val, to, bitset);

	/* No timeout, nothing to clean up. */
	if (!to)
		return ret;

	hrtimer_cancel(&to->timer);
	destroy_hrtimer_on_stack(&to->timer);

	if (ret == -ERESTARTSYS) {
		restart = &current->restart_block;
		restart->futex.uaddr = uaddr;
		restart->futex.val = val;
		restart->futex.time = *abs_time;
		restart->futex.bitset = bitset;
		restart->futex.flags = flags | FLAGS_HAS_TIMEOUT;

		return set_restart_fn(restart, futex_wait_restart);
	}

	return ret;
}

/**
 * futex_wait_queue() - futex_queue() and wait for wakeup, timeout, or signal
 * @hb:		the futex hash bucket, must be locked by the caller
 * @q:		the futex_q to queue up on
 * @timeout:	the prepared hrtimer_sleeper, or null for no timeout
 */
void futex_wait_queue(struct futex_hash_bucket *hb, struct futex_q *q,
			    struct hrtimer_sleeper *timeout)
{
	/*
	 * The task state is guaranteed to be set before another task can
	 * wake it. set_current_state() is implemented using smp_store_mb() and
	 * futex_queue() calls spin_unlock() upon completion, both serializing
	 * access to the hash list and forcing another memory barrier.
	 */
	set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
	futex_queue(q, hb);

	/* Arm the timer */
	if (timeout)
		hrtimer_sleeper_start_expires(timeout, HRTIMER_MODE_ABS);

	/*
	 * If we have been removed from the hash list, then another task
	 * has tried to wake us, and we can skip the call to schedule().
	 */
	if (likely(!plist_node_empty(&q->list))) {
		/*
		 * If the timer has already expired, current will already be
		 * flagged for rescheduling. Only call schedule if there
		 * is no timeout, or if it has yet to expire.
		 */
		if (!timeout || timeout->task)
			schedule();
	}
	__set_current_state(TASK_RUNNING);
}


/**
 * futex_queue() - Enqueue the futex_q on the futex_hash_bucket
 * @q:	The futex_q to enqueue
 * @hb:	The destination hash bucket
 *
 * The hb->lock must be held by the caller, and is released here. A call to
 * futex_queue() is typically paired with exactly one call to futex_unqueue().  The
 * exceptions involve the PI related operations, which may use futex_unqueue_pi()
 * or nothing if the unqueue is done as part of the wake process and the unqueue
 * state is implicit in the state of woken task (see futex_wait_requeue_pi() for
 * an example).
 */
static inline void futex_queue(struct futex_q *q, struct futex_hash_bucket *hb)
	__releases(&hb->lock)
{
	__futex_queue(q, hb);

	// yym-gaizao
	struct plist_head *head = &hb->chain;
    struct plist_node *node;
	struct futex_q *q_temp;
    int count = 0;
	// yym-gaizao
	if (!plist_head_empty(head)) {
		plist_for_each(node, head) {
        	q_temp = container_of(node, struct futex_q, list);
        	pr_debug("futex:queue:PID %d (%s)\n", q_temp->task->pid, q_temp->task->comm);
        	count++;
    	}
		pr_debug("futex:queue:total %d waiters\n", count);
	}

	spin_unlock(&hb->lock);
}

void __futex_queue(struct futex_q *q, struct futex_hash_bucket *hb)
{
	int prio;

	/*
	 * The priority used to register this element is
	 * - either the real thread-priority for the real-time threads
	 * (i.e. threads with a priority lower than MAX_RT_PRIO)
	 * - or MAX_RT_PRIO for non-RT threads.
	 * Thus, all RT-threads are woken first in priority order, and
	 * the others are woken last, in FIFO order.
	 */
	prio = min(current->normal_prio, MAX_RT_PRIO);

	plist_node_init(&q->list, prio);
	plist_add(&q->list, &hb->chain);
	q->task = current;
}

/**
 * plist_add - add @node to @head
 *
 * @node:	&struct plist_node pointer
 * @head:	&struct plist_head pointer
 */
void plist_add(struct plist_node *node, struct plist_head *head)
{
	struct plist_node *first, *iter, *prev = NULL;
	struct list_head *node_next = &head->node_list;

	plist_check_head(head);
	WARN_ON(!plist_node_empty(node));
	WARN_ON(!list_empty(&node->prio_list));

	if (plist_head_empty(head))
		goto ins_node;

	first = iter = plist_first(head);

	do {
		if (node->prio < iter->prio) {
			node_next = &iter->node_list;
			break;
		}

		prev = iter;
		iter = list_entry(iter->prio_list.next,
				struct plist_node, prio_list);
	} while (iter != first);

	if (!prev || prev->prio != node->prio)
		list_add_tail(&node->prio_list, &iter->prio_list);
ins_node:
	list_add_tail(&node->node_list, node_next);

	plist_check_head(head);
}

static __always_inline void spin_unlock(spinlock_t *lock)
{
	raw_spin_unlock(&lock->rlock);
}

#define raw_spin_unlock(lock)		_raw_spin_unlock(lock)

static inline void do_raw_spin_unlock(raw_spinlock_t *lock) __releases(lock)
{
	mmiowb_spin_unlock();
	arch_spin_unlock(&lock->raw_lock);
	__release(lock);
}
# define __release(x)	(void)0

#define arch_spin_lock(l)		queued_spin_lock(l)
#define arch_spin_unlock(l)		queued_spin_unlock(l)

#ifndef queued_spin_unlock
/**
 * queued_spin_unlock - release a queued spinlock
 * @lock : Pointer to queued spinlock structure
 */
static __always_inline void queued_spin_unlock(struct qspinlock *lock)
{
	/*
	 * unlock() needs release semantics:
	 */
	smp_store_release(&lock->locked, 0);
}
#endif

/**
 * queued_spin_lock - acquire a queued spinlock
 * @lock: Pointer to queued spinlock structure
 */
static __always_inline void queued_spin_lock(struct qspinlock *lock)
{
	int val = 0;

	if (likely(atomic_try_cmpxchg_acquire(&lock->val, &val, _Q_LOCKED_VAL)))
		return;

	queued_spin_lock_slowpath(lock, val);
}

/**
 * atomic_try_cmpxchg_acquire() - atomic compare and exchange with acquire ordering
 * @v: pointer to atomic_t
 * @old: pointer to int value to compare with
 * @new: int value to assign
 *
 * If (@v == @old), atomically updates @v to @new with acquire ordering.
 * Otherwise, updates @old to the current value of @v.
 *
 * Unsafe to use in noinstr code; use raw_atomic_try_cmpxchg_acquire() there.
 *
 * Return: @true if the exchange occured, @false otherwise.
 */
static __always_inline bool
atomic_try_cmpxchg_acquire(atomic_t *v, int *old, int new)
{
	instrument_atomic_read_write(v, sizeof(*v));
	instrument_atomic_read_write(old, sizeof(*old));
	return raw_atomic_try_cmpxchg_acquire(v, old, new);
}

/**
 * raw_atomic_try_cmpxchg_acquire() - atomic compare and exchange with acquire ordering
 * @v: pointer to atomic_t
 * @old: pointer to int value to compare with
 * @new: int value to assign
 *
 * If (@v == @old), atomically updates @v to @new with acquire ordering.
 * Otherwise, updates @old to the current value of @v.
 *
 * Safe to use in noinstr code; prefer atomic_try_cmpxchg_acquire() elsewhere.
 *
 * Return: @true if the exchange occured, @false otherwise.
 */
static __always_inline bool
raw_atomic_try_cmpxchg_acquire(atomic_t *v, int *old, int new)
{
#if defined(arch_atomic_try_cmpxchg_acquire)
	return arch_atomic_try_cmpxchg_acquire(v, old, new);
#elif defined(arch_atomic_try_cmpxchg_relaxed)
	bool ret = arch_atomic_try_cmpxchg_relaxed(v, old, new);
	__atomic_acquire_fence();
	return ret;
#elif defined(arch_atomic_try_cmpxchg)
	return arch_atomic_try_cmpxchg(v, old, new);
#else
	int r, o = *old;
	r = raw_atomic_cmpxchg_acquire(v, o, new);
	if (unlikely(r != o))
		*old = r;
	return likely(r == o);
#endif
}

static __always_inline bool arch_atomic_try_cmpxchg(atomic_t *v, int *old, int new)
{
	return arch_try_cmpxchg(&v->counter, old, new);
}
#define arch_atomic_try_cmpxchg arch_atomic_try_cmpxchg

#define arch_try_cmpxchg(ptr, pold, new) 				\
	__try_cmpxchg((ptr), (pold), (new), sizeof(*(ptr)))
	
#define __try_cmpxchg(ptr, pold, new, size)				\
	__raw_try_cmpxchg((ptr), (pold), (new), (size), LOCK_PREFIX)
	

#define __raw_try_cmpxchg(_ptr, _pold, _new, size, lock)		\
({									\
	bool success;							\
	__typeof__(_ptr) _old = (__typeof__(_ptr))(_pold);		\
	__typeof__(*(_ptr)) __old = *_old;				\
	__typeof__(*(_ptr)) __new = (_new);				\
	switch (size) {							\
	case __X86_CASE_B:						\
	{								\
		volatile u8 *__ptr = (volatile u8 *)(_ptr);		\
		asm volatile(lock "cmpxchgb %[new], %[ptr]"		\
			     CC_SET(z)					\
			     : CC_OUT(z) (success),			\
			       [ptr] "+m" (*__ptr),			\
			       [old] "+a" (__old)			\
			     : [new] "q" (__new)			\
			     : "memory");				\
		break;							\
	}								\
	case __X86_CASE_W:						\
	{								\
		volatile u16 *__ptr = (volatile u16 *)(_ptr);		\
		asm volatile(lock "cmpxchgw %[new], %[ptr]"		\
			     CC_SET(z)					\
			     : CC_OUT(z) (success),			\
			       [ptr] "+m" (*__ptr),			\
			       [old] "+a" (__old)			\
			     : [new] "r" (__new)			\
			     : "memory");				\
		break;							\
	}								\
	case __X86_CASE_L:						\
	{								\
		volatile u32 *__ptr = (volatile u32 *)(_ptr);		\
		asm volatile(lock "cmpxchgl %[new], %[ptr]"		\
			     CC_SET(z)					\
			     : CC_OUT(z) (success),			\
			       [ptr] "+m" (*__ptr),			\
			       [old] "+a" (__old)			\
			     : [new] "r" (__new)			\
			     : "memory");				\
		break;							\
	}								\
	case __X86_CASE_Q:						\
	{								\
		volatile u64 *__ptr = (volatile u64 *)(_ptr);		\
		asm volatile(lock "cmpxchgq %[new], %[ptr]"		\
			     CC_SET(z)					\
			     : CC_OUT(z) (success),			\
			       [ptr] "+m" (*__ptr),			\
			       [old] "+a" (__old)			\
			     : [new] "r" (__new)			\
			     : "memory");				\
		break;							\
	}								\
	default:							\
		__cmpxchg_wrong_size();					\
	}								\
	if (unlikely(!success))						\
		*_old = __old;						\
	likely(success);						\
})


/**
 * queued_spin_lock_slowpath - acquire the queued spinlock
 * @lock: Pointer to queued spinlock structure
 * @val: Current value of the queued spinlock 32-bit word
 *
 * (queue tail, pending bit, lock value)
 *
 *              fast     :    slow                                  :    unlock
 *                       :                                          :
 * uncontended  (0,0,0) -:--> (0,0,1) ------------------------------:--> (*,*,0)
 *                       :       | ^--------.------.             /  :
 *                       :       v           \      \            |  :
 * pending               :    (0,1,1) +--> (0,1,0)   \           |  :
 *                       :       | ^--'              |           |  :
 *                       :       v                   |           |  :
 * uncontended           :    (n,x,y) +--> (n,0,0) --'           |  :
 *   queue               :       | ^--'                          |  :
 *                       :       v                               |  :
 * contended             :    (*,x,y) +--> (*,0,0) ---> (*,0,1) -'  :
 *   queue               :         ^--'                             :
 */
void __lockfunc queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
{
	struct mcs_spinlock *prev, *next, *node;
	u32 old, tail;
	int idx;

	BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));

	if (pv_enabled())
		goto pv_queue;

	if (virt_spin_lock(lock))
		return;

	/*
	 * Wait for in-progress pending->locked hand-overs with a bounded
	 * number of spins so that we guarantee forward progress.
	 *
	 * 0,1,0 -> 0,0,1
	 */
	if (val == _Q_PENDING_VAL) {
		int cnt = _Q_PENDING_LOOPS;
		val = atomic_cond_read_relaxed(&lock->val,
					       (VAL != _Q_PENDING_VAL) || !cnt--);
	}

	/*
	 * If we observe any contention; queue.
	 */
	if (val & ~_Q_LOCKED_MASK)
		goto queue;

	/*
	 * trylock || pending
	 *
	 * 0,0,* -> 0,1,* -> 0,0,1 pending, trylock
	 */
	val = queued_fetch_set_pending_acquire(lock);

	/*
	 * If we observe contention, there is a concurrent locker.
	 *
	 * Undo and queue; our setting of PENDING might have made the
	 * n,0,0 -> 0,0,0 transition fail and it will now be waiting
	 * on @next to become !NULL.
	 */
	if (unlikely(val & ~_Q_LOCKED_MASK)) {

		/* Undo PENDING if we set it. */
		if (!(val & _Q_PENDING_MASK))
			clear_pending(lock);

		goto queue;
	}

	/*
	 * We're pending, wait for the owner to go away.
	 *
	 * 0,1,1 -> *,1,0
	 *
	 * this wait loop must be a load-acquire such that we match the
	 * store-release that clears the locked bit and create lock
	 * sequentiality; this is because not all
	 * clear_pending_set_locked() implementations imply full
	 * barriers.
	 */
	if (val & _Q_LOCKED_MASK)
		smp_cond_load_acquire(&lock->locked, !VAL);

	/*
	 * take ownership and clear the pending bit.
	 *
	 * 0,1,0 -> 0,0,1
	 */
	clear_pending_set_locked(lock);
	lockevent_inc(lock_pending);
	return;

	/*
	 * End of pending bit optimistic spinning and beginning of MCS
	 * queuing.
	 */
queue:
	lockevent_inc(lock_slowpath);
pv_queue:
	node = this_cpu_ptr(&qnodes[0].mcs);
	idx = node->count++;
	tail = encode_tail(smp_processor_id(), idx);

	trace_contention_begin(lock, LCB_F_SPIN);

	/*
	 * 4 nodes are allocated based on the assumption that there will
	 * not be nested NMIs taking spinlocks. That may not be true in
	 * some architectures even though the chance of needing more than
	 * 4 nodes will still be extremely unlikely. When that happens,
	 * we fall back to spinning on the lock directly without using
	 * any MCS node. This is not the most elegant solution, but is
	 * simple enough.
	 */
	if (unlikely(idx >= MAX_NODES)) {
		lockevent_inc(lock_no_node);
		while (!queued_spin_trylock(lock))
			cpu_relax();
		goto release;
	}

	node = grab_mcs_node(node, idx);

	/*
	 * Keep counts of non-zero index values:
	 */
	lockevent_cond_inc(lock_use_node2 + idx - 1, idx);

	/*
	 * Ensure that we increment the head node->count before initialising
	 * the actual node. If the compiler is kind enough to reorder these
	 * stores, then an IRQ could overwrite our assignments.
	 */
	barrier();

	node->locked = 0;
	node->next = NULL;
	pv_init_node(node);

	/*
	 * We touched a (possibly) cold cacheline in the per-cpu queue node;
	 * attempt the trylock once more in the hope someone let go while we
	 * weren't watching.
	 */
	if (queued_spin_trylock(lock))
		goto release;

	/*
	 * Ensure that the initialisation of @node is complete before we
	 * publish the updated tail via xchg_tail() and potentially link
	 * @node into the waitqueue via WRITE_ONCE(prev->next, node) below.
	 */
	smp_wmb();

	/*
	 * Publish the updated tail.
	 * We have already touched the queueing cacheline; don't bother with
	 * pending stuff.
	 *
	 * p,*,* -> n,*,*
	 */
	old = xchg_tail(lock, tail);
	next = NULL;

	/*
	 * if there was a previous node; link it and wait until reaching the
	 * head of the waitqueue.
	 */
	if (old & _Q_TAIL_MASK) {
		prev = decode_tail(old);

		/* Link @node into the waitqueue. */
		WRITE_ONCE(prev->next, node);

		pv_wait_node(node, prev);
		arch_mcs_spin_lock_contended(&node->locked);

		/*
		 * While waiting for the MCS lock, the next pointer may have
		 * been set by another lock waiter. We optimistically load
		 * the next pointer & prefetch the cacheline for writing
		 * to reduce latency in the upcoming MCS unlock operation.
		 */
		next = READ_ONCE(node->next);
		if (next)
			prefetchw(next);
	}

	/*
	 * we're at the head of the waitqueue, wait for the owner & pending to
	 * go away.
	 *
	 * *,x,y -> *,0,0
	 *
	 * this wait loop must use a load-acquire such that we match the
	 * store-release that clears the locked bit and create lock
	 * sequentiality; this is because the set_locked() function below
	 * does not imply a full barrier.
	 *
	 * The PV pv_wait_head_or_lock function, if active, will acquire
	 * the lock and return a non-zero value. So we have to skip the
	 * atomic_cond_read_acquire() call. As the next PV queue head hasn't
	 * been designated yet, there is no way for the locked value to become
	 * _Q_SLOW_VAL. So both the set_locked() and the
	 * atomic_cmpxchg_relaxed() calls will be safe.
	 *
	 * If PV isn't active, 0 will be returned instead.
	 *
	 */
	if ((val = pv_wait_head_or_lock(lock, node)))
		goto locked;

	val = atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK));

locked:
	/*
	 * claim the lock:
	 *
	 * n,0,0 -> 0,0,1 : lock, uncontended
	 * *,*,0 -> *,*,1 : lock, contended
	 *
	 * If the queue head is the only one in the queue (lock value == tail)
	 * and nobody is pending, clear the tail code and grab the lock.
	 * Otherwise, we only need to grab the lock.
	 */

	/*
	 * In the PV case we might already have _Q_LOCKED_VAL set, because
	 * of lock stealing; therefore we must also allow:
	 *
	 * n,0,1 -> 0,0,1
	 *
	 * Note: at this point: (val & _Q_PENDING_MASK) == 0, because of the
	 *       above wait condition, therefore any concurrent setting of
	 *       PENDING will make the uncontended transition fail.
	 */
	if ((val & _Q_TAIL_MASK) == tail) {
		if (atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL))
			goto release; /* No contention */
	}

	/*
	 * Either somebody is queued behind us or _Q_PENDING_VAL got set
	 * which will then detect the remaining tail and queue behind us
	 * ensuring we'll see a @next.
	 */
	set_locked(lock);

	/*
	 * contended path; wait for next if not observed yet, release.
	 */
	if (!next)
		next = smp_cond_load_relaxed(&node->next, (VAL));

	arch_mcs_spin_unlock_contended(&next->locked);
	pv_kick_node(lock, next);

release:
	trace_contention_end(lock, 0);

	/*
	 * release the node
	 */
	__this_cpu_dec(qnodes[0].mcs.count);
}
EXPORT_SYMBOL(queued_spin_lock_slowpath);