Linux 内核自旋锁spinlock（四）--- queued spinlock

文章目录

前言
[一、queued spinlock](#一、queued spinlock)
- [1.1 简介](#1.1 简介)
- [1.2. spin_lock/spin_unlock](#1.2. spin_lock/spin_unlock)
二、源码解析
- [2.1 struct qspinlock](#2.1 struct qspinlock)
- [2.2 struct qnode](#2.2 struct qnode)
- [2.3 queued_spin_lock](#2.3 queued_spin_lock)
- - [2.3.1 快速申请通道](#2.3.1 快速申请通道)
  - - CPU0申请锁
  - [2.3.2 慢速申请通道](#2.3.2 慢速申请通道)
  - queued_spin_lock_slowpath
  - - 总结
- [2.4 queued_spin_unlock](#2.4 queued_spin_unlock)
三、总结
参考资料

前言

（1）老版本（2.6.25之前）的Linux内核的自旋锁请参考：Linux 内核自旋锁spinlock（一）

其缺点：该自旋锁是无序竞争的，不保证先申请的进程先获得锁，不公平。也就是所有的thread都是在无序的争抢spin lock，谁先抢到谁先得，不管thread等了很久还是刚刚开始spin。在冲突比较少的情况下，不公平不会体现的特别明显，然而，随着硬件的发展，多核处理器的数目越来越多，多核之间的冲突越来越剧烈，无序竞争的spinlock带来性能问题。

（2）2.6.25以后的Linux内核的自旋锁称为 ticket spinlock，基于 FIFO 算法的排队自选锁。请参考：Linux 内核自旋锁spinlock（二）--- ticket spinlock

ticket spinlock可以让CPU按照到达的先后顺序，去获取spinlock的所有权，形成了有序竞争。

其缺点：所有申请锁的处理器在同一个变量上自旋等待，缓存同步的开销大，不适合处理器很多的系统。所有等待同一个自旋锁的处理器在同一个变量上自旋等待，申请或者释放锁的时候会修改锁，导致其他处理器存放自旋锁的缓存行失效，在拥有几百甚至几千个处理器的大型系统中，处理器申请自旋锁时竞争可能很激烈，缓存同步的开销很大，导致系统性能大幅度下降。

（3）基于ticket spinlock存在cache-line bouncing的问题，内核开发者提出MCS锁机制，请参考：Linux 内核自旋锁spinlock（三）--- MCS locks

MCS锁机制让每个CPU不再是等待同一个spinlock变量，而是基于各自不同的per-CPU的变量进行等待，那么每个CPU平时只需要查询自己对应的这个变量所在的本地cache line，仅在这个变量发生变化的时候，才需要读取内存和刷新这条cache line。

MCS锁机制会导致spinlock结构体变大（相比于ticket spinlock多了一个指针，64位平台指针大小8个字节），而内核很大数据结构都内嵌了spinlock结构体，比如struct page，因此MCS锁机制没有合入内核主线。

（4）4.2.0以后的Linux内核的自旋锁称为 queued spinlock，基于 MCS 算法的排队自选锁：

queued spinlock也属于排队自选锁，但是没有增加spinlock结构体的大小，进程按照申请锁的顺序排队，先申请的进程先获得锁。MCS自旋锁的策略是为每个处理器创建一个变量副本，每个处理器在自己的本地变量上自旋等待，解决了性能问题。

本文主要介绍queued spinlock。

传统的自旋锁在高竞争场景下会导致严重的缓存行争用，因为所有等待锁的 CPU 都在不停地读取和修改同一个锁变量。排队自旋锁通过引入MCS锁（Mellor-Crummey and Scott Lock）来解决这个问题：

每个等待锁的 CPU 都有一个本地节点（MCS Node），避免多个 CPU 争用同一个缓存行。

锁的等待者按顺序排队，确保公平性。

一、queued spinlock

1.1 简介

qspinlock是建立在MCS自旋锁之上的。其主要思想是让每个自旋者在自己的per-cpu变量上自旋，从而避免不同CPU之间的常量cacheline bouncing。当所有CPU在同一个锁变量上自旋等待时，会发生cacheline bouncing，导致它们反复读取这个变量。当一个CPU解锁时，这个变量被修改，使所有其他CPU的缓存行无效，它们必须重新读取这个变量。这会导致性能开销。MCS锁通过让每个CPU在自己的专用变量上自旋来缓解这个问题，从而避免对单个锁变量的争用。

（1）qspinlock 的设计背景

MCS 锁的局限性：

MCS 锁通过队列化等待和本地自旋机制减少缓存行争用，但其额外的 struct mcs_spinlock 结构增加了内存开销。

许多内核数据结构（如 struct page）无法容忍锁大小的增加。

（2）qspinlock 的目标：

在保留 MCS 锁性能优势的同时，将其压缩到 32 位（4 字节），使其能够嵌入到现有数据结构中。

1.2. spin_lock/spin_unlock

其使用，动态定义：

c 复制代码

spinlock_t lock;  //定义一个自旋锁变量
spin_lock_init(&lock)  //初始化自旋锁

静态定义：

c 复制代码

DEFINE_SPINLOCK(lock)

使用：

c 复制代码

spin_lock(&lock);  //加锁
//临界区
spin_unlock(&lock);  //解锁

c 复制代码

// include/linux/spinlock_types.h

#define DEFINE_SPINLOCK(x)	spinlock_t x = __SPIN_LOCK_UNLOCKED(x)

c 复制代码

// include/linux/spinlock.h

# define spin_lock_init(_lock)			\
do {						\
	spinlock_check(_lock);			\
	*(_lock) = __SPIN_LOCK_UNLOCKED(_lock);	\
} while (0)

c 复制代码

// include/linux/spinlock.h

#define raw_spin_lock(lock)	_raw_spin_lock(lock)

static __always_inline void spin_lock(spinlock_t *lock)
{
	raw_spin_lock(&lock->rlock);
}

c 复制代码

// kernel/locking/spinlock.c

void __lockfunc _raw_spin_lock(raw_spinlock_t *lock)
{
	__raw_spin_lock(lock);
}
EXPORT_SYMBOL(_raw_spin_lock);

c 复制代码

//  /include/linux/spinlock_api_smp.h

static inline void __raw_spin_lock(raw_spinlock_t *lock)
{
	preempt_disable();
	spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
	LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
}

reempt_disable(): 禁用抢占，这将增加抢占计数并阻止当前 CPU 上的任务被抢占。

spin_acquire(): 这个函数调用用于表示获取自旋锁。

LOCK_CONTENDED(): 这个宏用于处理自旋锁在被占用时的情况。根据情况，可能会调用 do_raw_spin_trylock 或者 do_raw_spin_lock 函数来尝试获取自旋锁或者直接获取自旋锁。

c 复制代码

static inline void do_raw_spin_lock(raw_spinlock_t *lock) __acquires(lock)
{
	__acquire(lock);
	arch_spin_lock(&lock->raw_lock);
	mmiowb_spin_lock();
}

c 复制代码

// /include/asm-generic/qspinlock.h

#define arch_spin_lock(l)		queued_spin_lock(l)

流程图如下：

二、源码解析

2.1 struct qspinlock

c 复制代码

typedef struct qspinlock {
	union {
		atomic_t val;

		/*
		 * By using the whole 2nd least significant byte for the
		 * pending bit, we can allow better optimization of the lock
		 * acquisition for the pending bit holder.
		 */
#ifdef __LITTLE_ENDIAN
		struct {
			u8	locked;
			u8	pending;
		};
		struct {
			u16	locked_pending;
			u16	tail;
		};
		......
	};
} arch_spinlock_t;

Qspinlock的数据结构一共4个byte，如下图所示：

自旋锁的32个二进制位被划分成4个字段：

（1）locked字段，锁是否被持有（0 未持有，1 已持有），长度是一个字节，占用第0~7位。

（2）一个pending位，占用第8~15位，第1个等待自旋锁的处理器设置pending位。是否有 CPU 正在等待锁（没有在mcs队列中，等待在locked字段上）。

（3） index字段，是数组索引，占用第16~17位，指示队列的尾部节点使用数组mcs_nodes的哪一项。编码四种上下文（task=0, softirq=1, hardirq=2, nmi=3）。

（4）cpu字段，存放队列的尾部节点的处理器编号，实际存储的值是处理器编号加上1，cpu字段减去1才是真实的处理器编号。

index字段和cpu字段合起来称为tail字段，Tail成员占2byte，包括tail index（16~17）和tail cpu（18~31）两个域。存放队列的尾部节点的信息，布局分两种情况：

（1）如果处理器的数量小于2的14次方，那么第9_{15位没有使用，第16}17位是index字段，第18~31位是cpu字段。

（2）如果处理器的数量大于或等于2的14次方，那么第9_{10位是index字段，第11}31位是cpu字段。

上图是系统CPU的个数小于16k的时候的布局，如果CPU数据太大，tail需要扩展，压缩pending域的空间。这时候pending域占一个bit，其他的7个bit用于tail。

把MCS自旋锁放进4个字节的关键是：存储处理器编号和数组索引，而不是存储尾部节点的地址。

内核对MCS自旋锁做了优化：第1个等待自旋锁的处理器直接在锁自身上面自旋等待，不是在自己的mcs_spinlock结构体上自旋等待。这个优化带来的好处是：当锁被释放的时候，不需要访问mcs_spinlock结构体的缓存行，相当于减少了一次缓存没命中。后续的处理器在自己的mcs_spinlock结构体上面自旋等待，直到它们移动到队列的首部为止。

自旋锁的pending位进一步扩展这个优化策略。第1个等待自旋锁的处理器简单地设置pending位，不需要使用自己的mcs_spinlock结构体。第2个处理器看到pending被设置，开始创建等待队列，在自己的mcs_spinlock结构体的locked字段上自旋等待。这种做法消除了两个等待者之间的缓存同步，而且第1个等待者没使用自己的mcs_spinlock结构体，减少了一次缓存行没命中。

pending 位优化：减少缓存行访问

第一个等待者仅设置 pending 位，无需操作 qnode，减少缓存行访问。

第二个等待者开始构建队列，自旋在本地 qnode 上。

2.2 struct qnode

c 复制代码

// v5.15/source/kernel/locking/qspinlock.c
/*
 * The basic principle of a queue-based spinlock can best be understood
 * by studying a classic queue-based spinlock implementation called the
 * MCS lock. A copy of the original MCS lock paper ("Algorithms for Scalable
 * Synchronization on Shared-Memory Multiprocessors by Mellor-Crummey and
 * Scott") is available at
 *
 * https://bugzilla.kernel.org/show_bug.cgi?id=206115
 *
 * This queued spinlock implementation is based on the MCS lock, however to
 * make it fit the 4 bytes we assume spinlock_t to be, and preserve its
 * existing API, we must modify it somehow.
 *
 * In particular; where the traditional MCS lock consists of a tail pointer
 * (8 bytes) and needs the next pointer (another 8 bytes) of its own node to
 * unlock the next pending (next->locked), we compress both these: {tail,
 * next->locked} into a single u32 value.
 *
 * Since a spinlock disables recursion of its own context and there is a limit
 * to the contexts that can nest; namely: task, softirq, hardirq, nmi. As there
 * are at most 4 nesting levels, it can be encoded by a 2-bit number. Now
 * we can encode the tail by combining the 2-bit nesting level with the cpu
 * number. With one byte for the lock value and 3 bytes for the tail, only a
 * 32-bit word is now needed. Even though we only need 1 bit for the lock,
 * we extend it to a full byte to achieve better performance for architectures
 * that support atomic byte write.
 *
 * We also change the first spinner to spin on the lock bit instead of its
 * node; whereby avoiding the need to carry a node from lock to unlock, and
 * preserving existing lock API. This also makes the unlock code simpler and
 * faster.
 *
 * N.B. The current implementation only supports architectures that allow
 *      atomic operations on smaller 8-bit and 16-bit data types.
 *
 */

#define MAX_NODES	4

/*
 * On 64-bit architectures, the mcs_spinlock structure will be 16 bytes in
 * size and four of them will fit nicely in one 64-byte cacheline. For
 * pvqspinlock, however, we need more space for extra data. To accommodate
 * that, we insert two more long words to pad it up to 32 bytes. IOW, only
 * two of them can fit in a cacheline in this case. That is OK as it is rare
 * to have more than 2 levels of slowpath nesting in actual use. We don't
 * want to penalize pvqspinlocks to optimize for a rare case in native
 * qspinlocks.
 */
struct qnode {
	struct mcs_spinlock mcs;
	......
#endif
};

/*
 * Per-CPU queue node structures; we can never have more than 4 nested
 * contexts: task, softirq, hardirq, nmi.
 *
 * Exactly fits one 64-byte cacheline on a 64-bit architecture.
 *
 * PV doubles the storage and uses the second cacheline for PV state.
 */
static DEFINE_PER_CPU_ALIGNED(struct qnode, qnodes[MAX_NODES]);

MCS锁的基本原理以及为什么它不能直接替换现有的票号自旋锁。MCS锁通过队列化等待减少缓存行争用，但每个锁需要额外的结构体，导致内存开销增加。这对于像struct page这样的大小敏感的结构体来说不可行。

（1）qspinlock通过将MCS锁的尾指针和状态信息压缩到32位来解决这个问题。将嵌套级别和CPU编号组合成tail字段，这需要详细解析如何将这些信息编码到有限的位数中。同时，pending位的优化策略也很关键，因为它允许第一个等待者避免使用额外的结构体，从而减少缓存访问。

MCS 锁的压缩：

将 MCS 锁的尾指针（8 字节）和 next->locked（8 字节）压缩到 32 位。

通过 tail 字段编码队列尾部 CPU 编号和上下文嵌套级别。

（2）qspinlock在不同上下文（任务、软中断、硬中断、NMI）中的处理方式，以及每个CPU的qnodes数组如何分配和管理。

上下文隔离：

每个 CPU 预分配一个 qnode 数组（qnodes[MAX_NODES]），每个元素对应一种上下文。

不同上下文的锁请求使用独立的 qnode，避免嵌套冲突。

c 复制代码

// v5.15/source/kernel/locking/mcs_spinlock.h

struct mcs_spinlock {
	struct mcs_spinlock *next;
	int locked; /* 1 if lock acquired */
	int count;  /* nesting count, see qspinlock.c */
}

每个处理器需要4个队列节点，原因如下：

c 复制代码

contexts: task, softirq, hardirq, nmi.

(1) task：申请自旋锁的函数禁止内核抢占，所以进程在等待自旋锁的过程中不会被其他进程抢占。

(2) softirq：进程在等待自旋锁的过程中可能被软中断抢占，然后软中断等待另一个自旋锁。

(3) hardirq：软中断在等待自旋锁的过程中可能被硬中断抢占，然后硬中断等待另一个自旋锁。

(4) nmi：硬中断在等待自旋锁的过程中可能被不可屏蔽中断抢占，然后不可屏蔽中断等待另一个自旋锁。

综上所述，一个处理器最多同时等待4个自旋锁。

和入场券自旋锁相比，MCS自旋锁增加的内存开销是数组mcs_nodes。

如下图所示：

该图片来自于：http://www.wowotech.net/kernel_synchronization/queued_spinlock.html

在某个线程上下文，由于持A锁失败而进入自旋，我们需要把该CPU上的mcs锁节点挂入A spinlock的队列。在这个自旋过程中，该CPU有可能发生软中断，在软中断处理函数中，我们试图持B锁，如果失败，那么该cpu上的mcs锁节点需要挂入B spinlock的队列。在这样的场景中，我们必须区分线程上下文和软中断上下文的mcs node。这样复杂的嵌套最多有四层：线程上下文、软中断上下文、硬中断上下文和NMI上下文。因此我们每个CPU实际上定义了多个mcs node节点（目前是四个），用来解决自旋锁的嵌套问题。

2.3 queued_spin_lock

c 复制代码

/**
 * queued_spin_lock - acquire a queued spinlock
 * @lock: Pointer to queued spinlock structure
 */
static __always_inline void queued_spin_lock(struct qspinlock *lock)
{
	int val = 0;

	//快速申请通道
	if (likely(atomic_try_cmpxchg_acquire(&lock->val, &val, _Q_LOCKED_VAL)))
		return;

	//慢速申请通道
	queued_spin_lock_slowpath(lock, val);
}

快速申请通道：

（1）CPU0持有锁

初始状态，没有CPU获取锁，CPU0获取锁，因为没有其它CPU竞争，所有CPU0很快获取到锁。

可以看到queued_spin_lock函数调用atomic_try_cmpxchg_acquire时用 likely 修饰，认为大部分spinlock都只有1个CPU试图获得锁。

因为我们设计spinlock时，在大多数情况下，通常都只有1个CPU试图获得锁。

慢速申请通道：

（2）CPU1设置了pending字段，自旋等待CPU0释放锁

CPU0获取锁，CPU1尝试获取锁，调用queued_spin_lock_slowpath，进入慢速申请通道，发现没有设置pending字段，CPU1设置pending字段为1，自选等待，称为锁的第一顺位继承者。（如果CPU0释放锁，那么CPU1便获取到锁）

（3）CPU2加入了MCS等待队列，自旋等待所得locked字段和pending字段被清0。

CPU0获取锁，CPU1尝试获取锁，CPU2也加入尝试获取锁，调用queued_spin_lock_slowpath，进入慢速申请通道，发现pending字段为1，这时queued spinlock利用 MSC 锁机制来进行排队。首先获取当前CPU对应的 mcs_spinlock，通常会使用mcs_spinlock[0]节点。CPU2加入 MCS 等待队列。

（4）CPU3加入了MCS等待队列，在自己的MCS节点中自旋等待 locked域被置为1。

CPU0获取锁，CPU1尝试获取锁，CPU2也加入尝试获取锁，CPU3也加入尝试获取锁，调用queued_spin_lock_slowpath，进入慢速申请通道，CPU3发现已经有别的CPU（即CPU2）在MCS等待队列，CPU3把自己的节点加到MCS等待队列末尾，等待前继节点释放锁，通过设置设置 prev->next域中指针的指向来把当前节点加入到MCS等待队列，CPU3在自己的MCS节点自旋并等待 node->locked被设置为1。

CPU0持有锁，CPU1是锁的第一顺位继承者，CPU2和CPU3在MCS等待队列中等待。

对于CPU2和CPU3，以及如果有CPU4、5、6、7、8等待，请参考：Linux 内核自旋锁spinlock（三）--- MCS locks

2.3.1 快速申请通道

CPU0申请锁

CPU0调用queued_spin_lock获取锁，因为没有其它CPU竞争，这时锁的旧值是0，说明申请锁的时候锁处于空闲状态，那么CPU0成功地获得锁，spinlock的值被设置为_Q_LOCKED_VAL（值为1），即标记锁处于locked状态。此时，只有一个所有者，没有自旋者。如下图所示：

locked_pending.locked = 1

2.3.2 慢速申请通道

CPU0/1申请锁

假设在持有锁的所有者期间，CPU1试图获取锁。CPU1将不得不等待，我们称这个等待的CPU1为"pender"。pender将pending变量设置为1以指示其存在，然后自旋等待spinlock->locked变量变为0。图中的虚线箭头表示等待关系以及相应的代码。

值得注意的是，pending变量是专门针对pender的。如果pender消失，该变量将恢复为0。如下图所示：

locked_pending.locked = 1，locked_pending.pending = 1。

CPU0/1/2申请锁

在这种情况下，除了pender之外，另一个CPU（CPU2）尝试获取锁。这个新的CPU将自旋等待spinlock->{locked, pending}变量。我们称这个CPU为successor。

在这种情况下，spinlock->tail变量存储了successor的CPU ID。idx字段表示这个successor的上下文（即进程上下文、软中断、硬中断、NMI），tail变量"指向"successor CPU就足够了。从tail指向successor CPU的每个CPU的MCS节点结构。

在这种情况下，有两个自旋者：pender（CPU1）等待spinlock->locked变为0，而successor（CPU2）等待spinlock->{locked, pending}同时变为0。

CPU2发现了locked_pending.locked ，locked_pending.pending 都为1，那么就要使用额外的MCS节点。

CPU0/1/2/3申请锁

在这种更复杂的情况下，另一个CPU（CPU3）尝试获取锁，而successor（CPU2）正在等待。这个CPU将等待自己的专用MCS节点结构，并被称为queuer（CPU3）。

具体而言，每个CPU都有自己专用的MCS节点结构，其中包括一个名为locked的成员，表示为mcs->locked。该变量最初的值为0，queuer会自旋等待，直到它的mcs->locked变量变为1。这样，它就不会重复读取已经被pender（CPU1）和successor（CPU2）读取的自旋锁变量。这降低了缓存争用和跳动。

这种情况的状态转换如下所示。pender和successor分别等待它们各自的变量，而queuer则等待其专用的MCS节点。请注意，现在spinlock->tail变量指向queuer，与术语"tail"对齐，该术语表示等待队列的末尾。

在这个阶段，基于它们等待的特定变量，这三个自旋线程被命名为pender（CPU1）、successor（CPU2）和queuer（CPU3）。这种命名约定用于区分它们在同步过程中的不同角色和行为。如下图所示：

queued_spin_lock_slowpath

c 复制代码

/**
 * queued_spin_lock_slowpath - acquire the queued spinlock
 * @lock: Pointer to queued spinlock structure
 * @val: Current value of the queued spinlock 32-bit word
 *
 * (queue tail, pending bit, lock value)
 *
 *              fast     :    slow                                  :    unlock
 *                       :                                          :
 * uncontended  (0,0,0) -:--> (0,0,1) ------------------------------:--> (*,*,0)
 *                       :       | ^--------.------.             /  :
 *                       :       v           \      \            |  :
 * pending               :    (0,1,1) +--> (0,1,0)   \           |  :
 *                       :       | ^--'              |           |  :
 *                       :       v                   |           |  :
 * uncontended           :    (n,x,y) +--> (n,0,0) --'           |  :
 *   queue               :       | ^--'                          |  :
 *                       :       v                               |  :
 * contended             :    (*,x,y) +--> (*,0,0) ---> (*,0,1) -'  :
 *   queue               :         ^--'                             :
 */
void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
{
	struct mcs_spinlock *prev, *next, *node;
	u32 old, tail;
	int idx;

	BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));

	if (pv_enabled())
		goto pv_queue;

	if (virt_spin_lock(lock))
		return;

	/*
	 * Wait for in-progress pending->locked hand-overs with a bounded
	 * number of spins so that we guarantee forward progress.
	 *
	 * 0,1,0 -> 0,0,1
	 */
	if (val == _Q_PENDING_VAL) {
		int cnt = _Q_PENDING_LOOPS;
		val = atomic_cond_read_relaxed(&lock->val,
					       (VAL != _Q_PENDING_VAL) || !cnt--);
	}

	/*
	 * If we observe any contention; queue.
	 */
	if (val & ~_Q_LOCKED_MASK)
		goto queue;

	/*
	 * trylock || pending
	 *
	 * 0,0,* -> 0,1,* -> 0,0,1 pending, trylock
	 */
	val = queued_fetch_set_pending_acquire(lock);

	/*
	 * If we observe contention, there is a concurrent locker.
	 *
	 * Undo and queue; our setting of PENDING might have made the
	 * n,0,0 -> 0,0,0 transition fail and it will now be waiting
	 * on @next to become !NULL.
	 */
	if (unlikely(val & ~_Q_LOCKED_MASK)) {

		/* Undo PENDING if we set it. */
		if (!(val & _Q_PENDING_MASK))
			clear_pending(lock);

		goto queue;
	}

	/*
	 * We're pending, wait for the owner to go away.
	 *
	 * 0,1,1 -> 0,1,0
	 *
	 * this wait loop must be a load-acquire such that we match the
	 * store-release that clears the locked bit and create lock
	 * sequentiality; this is because not all
	 * clear_pending_set_locked() implementations imply full
	 * barriers.
	 */
	if (val & _Q_LOCKED_MASK)
		atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_MASK));

	/*
	 * take ownership and clear the pending bit.
	 *
	 * 0,1,0 -> 0,0,1
	 */
	clear_pending_set_locked(lock);
	lockevent_inc(lock_pending);
	return;

	/*
	 * End of pending bit optimistic spinning and beginning of MCS
	 * queuing.
	 */
queue:
	lockevent_inc(lock_slowpath);
pv_queue:
	node = this_cpu_ptr(&qnodes[0].mcs);
	idx = node->count++;
	tail = encode_tail(smp_processor_id(), idx);

	/*
	 * 4 nodes are allocated based on the assumption that there will
	 * not be nested NMIs taking spinlocks. That may not be true in
	 * some architectures even though the chance of needing more than
	 * 4 nodes will still be extremely unlikely. When that happens,
	 * we fall back to spinning on the lock directly without using
	 * any MCS node. This is not the most elegant solution, but is
	 * simple enough.
	 */
	if (unlikely(idx >= MAX_NODES)) {
		lockevent_inc(lock_no_node);
		while (!queued_spin_trylock(lock))
			cpu_relax();
		goto release;
	}

	node = grab_mcs_node(node, idx);

	/*
	 * Keep counts of non-zero index values:
	 */
	lockevent_cond_inc(lock_use_node2 + idx - 1, idx);

	/*
	 * Ensure that we increment the head node->count before initialising
	 * the actual node. If the compiler is kind enough to reorder these
	 * stores, then an IRQ could overwrite our assignments.
	 */
	barrier();

	node->locked = 0;
	node->next = NULL;
	pv_init_node(node);

	/*
	 * We touched a (possibly) cold cacheline in the per-cpu queue node;
	 * attempt the trylock once more in the hope someone let go while we
	 * weren't watching.
	 */
	if (queued_spin_trylock(lock))
		goto release;

	/*
	 * Ensure that the initialisation of @node is complete before we
	 * publish the updated tail via xchg_tail() and potentially link
	 * @node into the waitqueue via WRITE_ONCE(prev->next, node) below.
	 */
	smp_wmb();

	/*
	 * Publish the updated tail.
	 * We have already touched the queueing cacheline; don't bother with
	 * pending stuff.
	 *
	 * p,*,* -> n,*,*
	 */
	old = xchg_tail(lock, tail);
	next = NULL;

	/*
	 * if there was a previous node; link it and wait until reaching the
	 * head of the waitqueue.
	 */
	if (old & _Q_TAIL_MASK) {
		prev = decode_tail(old);

		/* Link @node into the waitqueue. */
		WRITE_ONCE(prev->next, node);

		pv_wait_node(node, prev);
		arch_mcs_spin_lock_contended(&node->locked);

		/*
		 * While waiting for the MCS lock, the next pointer may have
		 * been set by another lock waiter. We optimistically load
		 * the next pointer & prefetch the cacheline for writing
		 * to reduce latency in the upcoming MCS unlock operation.
		 */
		next = READ_ONCE(node->next);
		if (next)
			prefetchw(next);
	}

	/*
	 * we're at the head of the waitqueue, wait for the owner & pending to
	 * go away.
	 *
	 * *,x,y -> *,0,0
	 *
	 * this wait loop must use a load-acquire such that we match the
	 * store-release that clears the locked bit and create lock
	 * sequentiality; this is because the set_locked() function below
	 * does not imply a full barrier.
	 *
	 * The PV pv_wait_head_or_lock function, if active, will acquire
	 * the lock and return a non-zero value. So we have to skip the
	 * atomic_cond_read_acquire() call. As the next PV queue head hasn't
	 * been designated yet, there is no way for the locked value to become
	 * _Q_SLOW_VAL. So both the set_locked() and the
	 * atomic_cmpxchg_relaxed() calls will be safe.
	 *
	 * If PV isn't active, 0 will be returned instead.
	 *
	 */
	if ((val = pv_wait_head_or_lock(lock, node)))
		goto locked;

	val = atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK));

locked:
	/*
	 * claim the lock:
	 *
	 * n,0,0 -> 0,0,1 : lock, uncontended
	 * *,*,0 -> *,*,1 : lock, contended
	 *
	 * If the queue head is the only one in the queue (lock value == tail)
	 * and nobody is pending, clear the tail code and grab the lock.
	 * Otherwise, we only need to grab the lock.
	 */

	/*
	 * In the PV case we might already have _Q_LOCKED_VAL set, because
	 * of lock stealing; therefore we must also allow:
	 *
	 * n,0,1 -> 0,0,1
	 *
	 * Note: at this point: (val & _Q_PENDING_MASK) == 0, because of the
	 *       above wait condition, therefore any concurrent setting of
	 *       PENDING will make the uncontended transition fail.
	 */
	if ((val & _Q_TAIL_MASK) == tail) {
		if (atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL))
			goto release; /* No contention */
	}

	/*
	 * Either somebody is queued behind us or _Q_PENDING_VAL got set
	 * which will then detect the remaining tail and queue behind us
	 * ensuring we'll see a @next.
	 */
	set_locked(lock);

	/*
	 * contended path; wait for next if not observed yet, release.
	 */
	if (!next)
		next = smp_cond_load_relaxed(&node->next, (VAL));

	arch_mcs_spin_unlock_contended(&next->locked);
	pv_kick_node(lock, next);

release:
	/*
	 * release the node
	 */
	__this_cpu_dec(qnodes[0].mcs.count);
}
EXPORT_SYMBOL(queued_spin_lock_slowpath);

（1）处理 pending 位

c 复制代码

if (val == _Q_PENDING_VAL) {
    int cnt = _Q_PENDING_LOOPS;
    val = atomic_cond_read_relaxed(&lock->val,
                     (VAL != _Q_PENDING_VAL) || !cnt--);
}

pending 位优化：

如果锁处于 pending 状态（val == _Q_PENDING_VAL），等待 pending 位被清除。

通过 atomic_cond_read_relaxed 循环检查锁状态，直到 pending 位被清除或超时。

（2）检查锁争用

c 复制代码

if (val & ~_Q_LOCKED_MASK)
    goto queue;

锁争用：如果锁已被占用或有其他等待者（val & ~_Q_LOCKED_MASK），跳转到队列化处理逻辑。

（3）设置 pending 位

c 复制代码

val = queued_fetch_set_pending_acquire(lock);

设置 pending 位：通过原子操作设置 pending 位，表示当前 CPU 正在等待锁。

（4）撤销 pending 位并加入队列

c 复制代码

if (unlikely(val & ~_Q_LOCKED_MASK)) {
    if (!(val & _Q_PENDING_MASK))
        clear_pending(lock);
    goto queue;
}

撤销 pending 位：如果发现锁已被占用或有其他等待者，撤销 pending 位并跳转到队列化处理逻辑。

（5）等待锁持有者释放锁

c 复制代码

if (val & _Q_LOCKED_MASK)
    atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_MASK));

自旋等待：如果锁已被持有，等待锁持有者释放锁（_Q_LOCKED_MASK 被清除）。

（6）获取锁并清除 pending 位

c 复制代码

clear_pending_set_locked(lock);

获取锁：清除 pending 位并设置 locked 位，表示当前 CPU 已获得锁。

（7）初始化本地节点

c 复制代码

node = this_cpu_ptr(&qnodes[0].mcs);
idx = node->count++;
tail = encode_tail(smp_processor_id(), idx);

获取本地节点：从预分配的 qnodes 数组中获取当前 CPU 的 mcs_spinlock 节点。

编码 tail 字段：将 CPU 编号和节点索引编码到 tail 字段中。

（8）检查节点索引

c 复制代码

if (unlikely(idx >= MAX_NODES)) {
    lockevent_inc(lock_no_node);
    while (!queued_spin_trylock(lock))
        cpu_relax();
    goto release;
}

节点索引超限：如果节点索引超过 MAX_NODES（通常为 4），直接自旋尝试获取锁。

（9）初始化节点

c 复制代码

node->locked = 0;
node->next = NULL;
pv_init_node(node);

初始化节点：设置 locked 为 0，next 为 NULL，并初始化 PV 相关字段。

（10）尝试获取锁

c 复制代码

if (queued_spin_trylock(lock))
    goto release;

快速尝试：在加入队列前再次尝试获取锁，避免不必要的队列操作。

（11）更新 tail 字段

c 复制代码

old = xchg_tail(lock, tail);

更新 tail 字段：通过原子操作将 tail 字段更新为当前 CPU 的编码值。

（12）链接到前驱节点

c 复制代码

if (old & _Q_TAIL_MASK) {
    prev = decode_tail(old);
    WRITE_ONCE(prev->next, node);
    pv_wait_node(node, prev);
    arch_mcs_spin_lock_contended(&node->locked);
}

链接到队列：如果存在前驱节点，将当前节点链接到前驱节点的 next 字段。

自旋等待：在本地节点的 locked 字段上自旋，等待前驱节点释放锁。

（13）等待锁持有者释放锁

c 复制代码

val = atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK));

自旋等待：等待锁持有者释放锁（_Q_LOCKED_PENDING_MASK 被清除）。

（14）获取锁

c 复制代码

set_locked(lock);

设置 locked 位：通过原子操作设置 locked 位，表示当前 CPU 已获得锁。

（15）唤醒后继节点

c 复制代码

arch_mcs_spin_unlock_contended(&next->locked);
pv_kick_node(lock, next);

唤醒后继节点：设置后继节点的 locked 字段为 1，唤醒等待的 CPU。

（16）释放node

c 复制代码

	/*
	 * release the node
	 */
	__this_cpu_dec(qnodes[0].mcs.count);

总结

（1）Pending Bit（挂起位）：在锁竞争不激烈时，避免立即进入复杂的排队逻辑。

设置锁的 pending 位，表示当前 CPU 正在尝试获取锁。

如果锁很快被释放，当前 CPU 可以直接获取锁，而无需进入排队。

如果检测到其他竞争者（如 pending 位已被设置），则撤销 pending 位并进入排队逻辑。

（2）排队逻辑（MCS 队列）：

MCS 节点：

每个 CPU 有一个本地 MCS 节点，用于在队列中等待锁。

节点包含一个 locked 字段（用于本地自旋）和一个 next 指针（用于链接下一个节点）。

队操作：

将当前 CPU 的 MCS 节点加入队列尾部，通过 xchg_tail 更新锁的 tail 字段。

如果队列非空，将当前节点链接到前一个节点的 next 指针。

等待锁：

当前 CPU 在自己的 MCS 节点上自旋，等待 locked 字段被前一个节点释放。

通过 arch_mcs_spin_lock_contended 实现高效的自旋等待。

（3）获取锁：

当当前 CPU 到达队列头部时，尝试获取锁：

等待锁的 locked 和 pending 位被清除。

使用原子操作（如 atomic_try_cmpxchg_relaxed）将锁状态更新为已获取。

如果队列中还有其他等待者，当前 CPU 需要唤醒下一个等待者（通过设置下一个节点的 locked 字段）。

2.4 queued_spin_unlock

c 复制代码

/**
 * queued_spin_unlock - release a queued spinlock
 * @lock : Pointer to queued spinlock structure
 */
static __always_inline void queued_spin_unlock(struct qspinlock *lock)
{
	/*
	 * unlock() needs release semantics:
	 */
	smp_store_release(&lock->locked, 0);
}
#endif

它将 lock->locked 的值设置为 0（表示锁已释放），并确保在此操作之前的所有内存操作对其他 CPU 可见。释放锁时，将 locked 设置为 0，表示锁已释放，其他 CPU 可以尝试获取锁。

内存顺序：

在释放锁之前的所有写操作对其他 CPU 可见。

其他 CPU 在获取锁之后，能够看到释放锁之前的所有修改。

如下图所示：

CPU0释放锁将 locked_pending.locked 设置为0，CPU1获取锁。

三、总结

（1）

如果只有一个CPU获取queued spinlock，那么快速获取到锁。

如果只有2个CPU获取queued spinlock，使用locked_pending的pending字段即可。

如果只有1个或2个CPU试图获取锁，那么只需要一个4字节的qspinlock就可以了，其所占内存的大小和ticket spinlock一样。

（2）

如果有三个或三个以上的CPU获取queued spinlock，那么就要额外的 MCS节点了。第三个CPU会自旋等待被释放，即locked_pending的pending字段和locked 字段被清零，而第四个CPU和后面的CPU只能在MCS节点中自旋等待mcs_spinlock节点的locked字段被前继节点设置为1，得等到前继节点把locked控制器过继给自己才能有机会自旋等待自选锁的释放。

有3个以上的CPU试图获取锁，需要一个qspinlock加上(N-2)个MCS node。

对于设计合理的spinlock，在大多数情况下，锁的争抢都不应该太激烈，最大概率是只有1个CPU试图获得锁，其次是2个，并依次递减。

queued_spin_lock函数：

c 复制代码

/**
 * queued_spin_lock - acquire a queued spinlock
 * @lock: Pointer to queued spinlock structure
 */
static __always_inline void queued_spin_lock(struct qspinlock *lock)
{
	int val = 0;

	if (likely(atomic_try_cmpxchg_acquire(&lock->val, &val, _Q_LOCKED_VAL)))
		return;

	queued_spin_lock_slowpath(lock, val);
}
#endif

可以看到queued_spin_lock函数调用atomic_try_cmpxchg_acquire时用 likely 修饰，认为大部分spinlock都只有1个CPU试图获得锁。

参考资料

https://lwn.net/Articles/590243/

http://www.wowotech.net/kernel_synchronization/queued_spinlock.html
http://www.wowotech.net/kernel_synchronization/460.html
https://zhuanlan.zhihu.com/p/100546935
https://systemsresearch.io/posts/f22352cfc/
https://www.slideshare.net/slideshow/spinlockpdf/254977958