前言
本文基于 Linux 6.8.12 内核源码与 glibc-2.39 用户态库源码,对 pthread_mutex_lock 的完整实现进行逐层剖析。我们将沿着一条清晰的逻辑链条,从 glibc 用户态入口出发,穿越 futex 系统调用边界,深入内核态的线程阻塞与唤醒机制,覆盖普通互斥锁、递归锁、自适应锁、错误检测锁、健壮锁(Robust Mutex)、优先级继承锁(PI Mutex)以及优先级天花板锁(PP Mutex)等全部类型。整条链路涵盖用户态快速路径、内核态慢速路径、哈希队列管理、定时器处理、信号中断恢复等关键环节,力求内容贴合源码、上下文连贯、逻辑链条清晰。
第一章:glibc 用户态入口 --- ___pthread_mutex_lock
1.1 函数入口与类型分发
pthread_mutex_lock 的 glibc 实现入口位于 nptl/pthread_mutex_lock.c 中,实际符号名为 ___pthread_mutex_lock(通过版本控制宏暴露为 pthread_mutex_lock)。该函数的首要任务是根据互斥锁的类型进行快速分发。
c
复制
arduino
int PTHREAD_MUTEX_LOCK (pthread_mutex_t *mutex)
{
unsigned int type = PTHREAD_MUTEX_TYPE_ELISION (mutex);
LIBC_PROBE (mutex_entry, 1, mutex);
if (__builtin_expect (type & ~(PTHREAD_MUTEX_KIND_MASK_NP
| PTHREAD_MUTEX_ELISION_FLAGS_NP), 0))
return __pthread_mutex_lock_full (mutex);
// ...
}
这里 PTHREAD_MUTEX_TYPE_ELISION 宏从 mutex->__data.__kind 中提取锁的类型信息,同时包含 elision(事务内存优化)标志位。PTHREAD_MUTEX_KIND_MASK_NP 定义了基本类型掩码,PTHREAD_MUTEX_ELISION_FLAGS_NP 定义了 elision 相关的标志位。
第一行检查非常关键:如果 type 中出现了不在上述两个掩码覆盖范围内的位,说明这是一个非标准类型 (如健壮锁、PI 锁、PP 锁等),需要走完整路径 __pthread_mutex_lock_full。这个分支预测标记为 __builtin_expect(..., 0),暗示这些类型在常规使用中较少见,编译器会优化主路径的指令布局。
1.2 普通互斥锁(PTHREAD_MUTEX_TIMED_NP)快速路径
对于最常见的 PTHREAD_MUTEX_TIMED_NP 类型(即默认的 normal mutex),代码进入最优化路径:
c
复制
ini
if (__glibc_likely (type == PTHREAD_MUTEX_TIMED_NP))
{
FORCE_ELISION (mutex, goto elision);
simple:
/* Normal mutex. */
LLL_MUTEX_LOCK_OPTIMIZED (mutex);
assert (mutex->__data.__owner == 0);
}
FORCE_ELISION 宏用于支持 Intel TSX(事务同步扩展)硬件事务内存。如果系统支持 elision 且编译时启用了 ENABLE_ELISION_SUPPORT,会尝试用事务内存指令(如 XBEGIN/XEND)来优化锁获取。如果 elision 失败或不可用,则跳转到 simple 标签,执行常规的 LLL_MUTEX_LOCK_OPTIMIZED。
LLL_MUTEX_LOCK_OPTIMIZED 是一个关键的优化宏,其完整定义如下:
c
复制
arduino
static inline void lll_mutex_lock_optimized (pthread_mutex_t *mutex)
{
int private = PTHREAD_MUTEX_PSHARED (mutex);
if (private == LLL_PRIVATE && SINGLE_THREAD_P && mutex->__data.__lock == 0)
mutex->__data.__lock = 1;
else
lll_lock (mutex->__data.__lock, private);
}
这个优化体现了 futex 设计的核心哲学:在无竞争场景下完全避免系统调用 。SINGLE_THREAD_P 是一个运行时检查,当程序只有一个线程时(如通过 pthread_atfork 等机制跟踪),可以直接将 __lock 从 0 改为 1,无需任何原子操作或系统调用。这个优化仅对 LLL_PRIVATE(进程私有)互斥锁有效,因为进程共享的互斥锁可能位于共享内存映射中,即使单进程也需要考虑与其他进程的同步。
如果上述优化条件不满足,则调用 lll_lock,进入标准 futex 路径。
1.3 lll_lock 与底层 futex 交互
lll_lock 定义在 sysdeps/nptl/lowlevellock.h 中,最终会调用到架构相关的 __lll_lock_wait 或 __lll_lock_wait_private 函数。以 x86_64 为例,其核心逻辑是:
- 第一次 CAS 尝试 :使用
atomic_compare_and_exchange_val_acq将__lock从 0 原子地改为 1。如果成功,直接返回。 - 竞争检测 :如果 CAS 失败,说明锁已被占用。此时将
__lock设为 2(表示"已锁定且有等待者"),然后调用futex_wait系统调用进入内核等待。 - 循环重试:被唤醒后,再次尝试 CAS,直到成功。
这里的状态机设计非常精妙:
0:未锁定1:已锁定,无等待者2:已锁定,有等待者(或正在等待)
这种三态设计使得解锁时能够判断是否需要执行 futex_wake 系统调用:如果 __lock 原来是 1,说明没有等待者,直接改为 0 即可;如果是 2,则需要唤醒等待队列中的线程。
1.4 递归互斥锁(PTHREAD_MUTEX_RECURSIVE_NP)
c
复制
ini
else if (__builtin_expect (PTHREAD_MUTEX_TYPE (mutex)
== PTHREAD_MUTEX_RECURSIVE_NP, 1))
{
pid_t id = THREAD_GETMEM (THREAD_SELF, tid);
/* Check whether we already hold the mutex. */
if (mutex->__data.__owner == id)
{
if (__glibc_unlikely (mutex->__data.__count + 1 == 0))
return EAGAIN; /* Overflow of the counter. */
++mutex->__data.__count;
return 0;
}
/* We have to get the mutex. */
LLL_MUTEX_LOCK_OPTIMIZED (mutex);
assert (mutex->__data.__owner == 0);
mutex->__data.__count = 1;
}
递归锁允许同一线程多次加锁。实现上通过 __owner 字段记录持有者 TID,__count 字段记录递归深度。当线程再次加锁时,先检查 __owner == id,如果匹配则增加计数器。这里有一个溢出检查:__count 是 unsigned int 类型,当值为 0xFFFFFFFF 时再加 1 会溢出为 0,此时返回 EAGAIN。
如果当前线程不是持有者,则走正常的 LLL_MUTEX_LOCK_OPTIMIZED 路径获取锁,成功后设置 __owner = id 和 __count = 1。
1.5 自适应互斥锁(PTHREAD_MUTEX_ADAPTIVE_NP)
c
复制
ini
else if (__builtin_expect (PTHREAD_MUTEX_TYPE (mutex)
== PTHREAD_MUTEX_ADAPTIVE_NP, 1))
{
if (LLL_MUTEX_TRYLOCK (mutex) != 0)
{
int cnt = 0;
int max_cnt = MIN (max_adaptive_count (),
mutex->__data.__spins * 2 + 10);
int spin_count, exp_backoff = 1;
unsigned int jitter = get_jitter ();
do
{
spin_count = exp_backoff + (jitter & (exp_backoff - 1));
cnt += spin_count;
if (cnt >= max_cnt)
{
LLL_MUTEX_LOCK (mutex);
break;
}
do
atomic_spin_nop ();
while (--spin_count > 0);
exp_backoff = get_next_backoff (exp_backoff);
}
while (LLL_MUTEX_READ_LOCK (mutex) != 0
|| LLL_MUTEX_TRYLOCK (mutex) != 0);
mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8;
}
assert (mutex->__data.__owner == 0);
}
自适应锁是 glibc 的一种性能优化策略,它结合了自旋锁 和互斥锁的优点。其核心思想是:如果锁的持有时间很短,自旋等待可能比进入内核睡眠更高效;如果持有时间较长,则应该尽快进入内核等待以避免 CPU 浪费。
实现细节包括:
- 首次尝试 :先调用
LLL_MUTEX_TRYLOCK(即lll_trylock,使用 CAS 尝试获取锁),如果成功直接返回。 - 自适应自旋 :如果首次尝试失败,进入自旋循环。
max_adaptive_count()返回系统允许的最大自旋次数(通常与 CPU 核心数相关)。mutex->__data.__spins记录了历史自旋统计,用于动态调整。 - 指数退避 + 随机抖动 :
exp_backoff从 1 开始,每次循环翻倍(通过get_next_backoff),但上限受max_cnt限制。jitter引入随机性,避免多个线程同步自旋导致的"雷群效应"(thundering herd)。 - 自旋 NOP :
atomic_spin_nop()在 x86_64 上通常是PAUSE指令,它告诉 CPU 这是一个自旋等待循环,可以优化流水线并降低功耗。 - 动态学习 :每次自旋结束后,更新
__spins字段:mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8。这是一个指数移动平均,使得自旋策略能够根据历史行为自适应调整。 - 最终回退 :如果自旋次数达到上限仍未获取锁,则调用
LLL_MUTEX_LOCK进入内核等待。
LLL_MUTEX_READ_LOCK 是一个只读检查,用于在自旋期间快速检测锁是否已释放,避免不必要的 CAS 操作。
1.6 错误检测互斥锁(PTHREAD_MUTEX_ERRORCHECK_NP)
c
复制
arduino
else
{
pid_t id = THREAD_GETMEM (THREAD_SELF, tid);
assert (PTHREAD_MUTEX_TYPE (mutex) == PTHREAD_MUTEX_ERRORCHECK_NP);
/* Check whether we already hold the mutex. */
if (__glibc_unlikely (mutex->__data.__owner == id))
return EDEADLK;
goto simple;
}
错误检测锁用于调试场景,能够检测死锁。如果当前线程已经持有该锁,再次加锁会返回 EDEADLK 错误,而不是死锁。如果未持有,则跳转到 simple 标签,走普通锁的获取路径。
1.7 所有权记录与探针
所有成功获取锁的路径(除了 elision 路径)最终都会执行:
c
复制
arduino
pid_t id = THREAD_GETMEM (THREAD_SELF, tid);
/* Record the ownership. */
mutex->__data.__owner = id;
#ifndef NO_INCR
++mutex->__data.__nusers;
#endif
LIBC_PROBE (mutex_acquired, 1, mutex);
return 0;
__owner 字段记录当前持有者的 TID,用于递归锁检测、错误检测锁的死锁检测、以及健壮锁的恢复。__nusers 是一个使用计数器(如果未定义 NO_INCR 宏)。LIBC_PROBE 是 SystemTap/DTrace 探针,用于动态追踪,不影响正常执行路径。
第二章:完整路径 --- __pthread_mutex_lock_full
当互斥锁类型为健壮锁、PI 锁、PP 锁或不可识别类型时,进入 __pthread_mutex_lock_full 函数。这是整个实现中最复杂的部分,涵盖了多种高级互斥锁语义。
2.1 健壮互斥锁(Robust Mutex)
健壮互斥锁解决了"线程持有锁时异常终止"的问题。普通互斥锁在这种情况下会导致死锁------锁永远被持有但持有者已不存在。健壮锁通过在线程退出时自动释放其持有的锁,并通知等待者 EOWNERDEAD 来解决这个问题。
2.1.1 健壮锁状态与链表管理
c
复制
arduino
case PTHREAD_MUTEX_ROBUST_RECURSIVE_NP:
case PTHREAD_MUTEX_ROBUST_ERRORCHECK_NP:
case PTHREAD_MUTEX_ROBUST_NORMAL_NP:
case PTHREAD_MUTEX_ROBUST_ADAPTIVE_NP:
THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending,
&mutex->__data.__list.__next);
__asm ("" ::: "memory");
每个线程有一个 robust_head 结构,维护一个由该线程持有的健壮锁组成的链表。list_op_pending 指向当前正在操作的锁节点,这是一个关键的原子性标记:它必须在实际修改锁状态之前设置,确保如果在操作过程中线程崩溃,内核清理代码能够识别出这个未完成的操作。
__asm ("" ::: "memory") 是一个编译器屏障(compiler barrier),防止编译器重排内存操作,确保 list_op_pending 的写入在后续锁操作之前完成。
2.1.2 锁获取循环与 CAS 操作
c
复制
ini
oldval = mutex->__data.__lock;
unsigned int assume_other_futex_waiters = LLL_ROBUST_MUTEX_LOCK_MODIFIER;
while (1)
{
/* Try to acquire the lock through a CAS from 0 to our TID | waiters. */
if (__glibc_likely (oldval == 0))
{
oldval = atomic_compare_and_exchange_val_acq (&mutex->__data.__lock,
id | assume_other_futex_waiters, 0);
if (__glibc_likely (oldval == 0))
break;
}
健壮锁的 __lock 字段编码了更多信息:
- 低位(
FUTEX_TID_MASK):持有者 TID FUTEX_WAITERS位:是否有等待者FUTEX_OWNER_DIED位:原持有者是否已死亡
CAS 操作尝试将 __lock 从 0 改为 id | assume_other_futex_waiters。assume_other_futex_waiters 初始值为 LLL_ROBUST_MUTEX_LOCK_MODIFIER(即 0),但在后续循环中可能被设置为 FUTEX_WAITERS。
2.1.3 处理原持有者死亡(FUTEX_OWNER_DIED)
c
复制
ini
if ((oldval & FUTEX_OWNER_DIED) != 0)
{
int newval = id;
#ifdef NO_INCR
newval |= FUTEX_WAITERS;
#else
newval |= (oldval & FUTEX_WAITERS) | assume_other_futex_waiters;
#endif
newval = atomic_compare_and_exchange_val_acq (&mutex->__data.__lock,
newval, oldval);
if (newval != oldval)
{
oldval = newval;
continue;
}
/* We got the mutex. */
mutex->__data.__count = 1;
mutex->__data.__owner = PTHREAD_MUTEX_INCONSISTENT;
__asm ("" ::: "memory");
ENQUEUE_MUTEX (mutex);
__asm ("" ::: "memory");
THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
#ifdef NO_INCR
--mutex->__data.__nusers;
#endif
return EOWNERDEAD;
}
当检测到 FUTEX_OWNER_DIED 位时,说明原持有该锁的线程已经异常终止。此时:
- 尝试 CAS 获取锁,同时保留
FUTEX_WAITERS位(如果有等待者的话)。 - 如果 CAS 成功,设置
__owner = PTHREAD_MUTEX_INCONSISTENT,表示锁处于不一致状态(因为原持有者可能在临界区内死亡,数据结构可能已损坏)。 - 将锁加入当前线程的健壮锁链表(
ENQUEUE_MUTEX)。 - 清除
list_op_pending,标记操作完成。 - 返回
EOWNERDEAD给调用者,调用者需要决定如何恢复(通常需要调用pthread_mutex_consistent来标记锁状态为一致)。
注意这里的 NO_INCR 处理:在特殊场景下(如条件变量内部使用的互斥锁),不需要增加 __nusers,反而需要递减,因为原持有者不应被计数。
2.1.4 死锁检测与递归处理
c
复制
ini
if (__glibc_unlikely ((oldval & FUTEX_TID_MASK) == id))
{
int kind = PTHREAD_MUTEX_TYPE (mutex);
if (kind == PTHREAD_MUTEX_ROBUST_ERRORCHECK_NP)
{
THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
return EDEADLK;
}
if (kind == PTHREAD_MUTEX_ROBUST_RECURSIVE_NP)
{
THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
if (__glibc_unlikely (mutex->__data.__count + 1 == 0))
return EAGAIN;
++mutex->__data.__count;
return 0;
}
}
健壮锁也支持错误检测和递归语义。如果当前线程已经持有该锁:
- 错误检测类型:返回
EDEADLK - 递归类型:增加计数器
2.1.5 Futex 等待与唤醒
c
复制
ini
if ((oldval & FUTEX_WAITERS) == 0)
{
int val = atomic_compare_and_exchange_val_acq
(&mutex->__data.__lock, oldval | FUTEX_WAITERS, oldval);
if (val != oldval)
{
oldval = val;
continue;
}
oldval |= FUTEX_WAITERS;
}
assume_other_futex_waiters |= FUTEX_WAITERS;
futex_wait ((unsigned int *) &mutex->__data.__lock, oldval,
PTHREAD_ROBUST_MUTEX_PSHARED (mutex));
oldval = mutex->__data.__lock;
如果无法获取锁,需要进入内核等待。关键步骤:
- 设置 FUTEX_WAITERS:如果当前没有等待者标志,先通过 CAS 设置它。这样解锁线程就知道需要唤醒等待者。
- 更新 assume_other_futex_waiters :一旦设置了
FUTEX_WAITERS,后续循环中必须保留它,避免"丢失唤醒"(lost wake-up)问题。 - futex_wait :调用 futex 系统调用,传入当前
__lock值作为期望値。内核会比较实际值与期望值,如果不同则立即返回(EAGAIN),避免无效等待。
2.1.6 不可恢复锁检测
c
复制
ini
if (__builtin_expect (mutex->__data.__owner
== PTHREAD_MUTEX_NOTRECOVERABLE, 0))
{
mutex->__data.__count = 0;
int private = PTHREAD_ROBUST_MUTEX_PSHARED (mutex);
lll_unlock (mutex->__data.__lock, private);
THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
return ENOTRECOVERABLE;
}
如果锁被标记为 NOTRECOVERABLE(不可恢复),说明之前已经有线程收到 EOWNERDEAD 但没有调用 pthread_mutex_consistent 来恢复。此时释放锁并返回 ENOTRECOVERABLE。
2.2 优先级继承互斥锁(PI Mutex)
优先级继承(Priority Inheritance)是解决优先级反转问题的经典方案。当高优先级线程等待低优先级线程持有的锁时,低优先级线程的优先级被临时提升到高优先级线程的级别,从而避免中等优先级线程抢占 CPU 导致高优先级线程无限期等待。
c
复制
arduino
#ifdef __NR_futex
case PTHREAD_MUTEX_PI_RECURSIVE_NP:
case PTHREAD_MUTEX_PI_ERRORCHECK_NP:
case PTHREAD_MUTEX_PI_NORMAL_NP:
case PTHREAD_MUTEX_PI_ADAPTIVE_NP:
case PTHREAD_MUTEX_PI_ROBUST_RECURSIVE_NP:
case PTHREAD_MUTEX_PI_ROBUST_ERRORCHECK_NP:
case PTHREAD_MUTEX_PI_ROBUST_NORMAL_NP:
case PTHREAD_MUTEX_PI_ROBUST_ADAPTIVE_NP:
PI 锁的实现完全依赖内核的 futex_lock_pi 和 futex_unlock_pi 系统调用,用户态只做少量封装。
2.2.1 PI 锁的 kind 与 robust 分离
c
复制
ini
{
int mutex_kind = atomic_load_relaxed (&(mutex->__data.__kind));
kind = mutex_kind & PTHREAD_MUTEX_KIND_MASK_NP;
robust = mutex_kind & PTHREAD_MUTEX_ROBUST_NORMAL_NP;
}
PI 锁的类型编码在 __kind 中,需要分离出基本类型(kind)和健壮标志(robust)。这里使用 atomic_load_relaxed 是因为 __kind 在初始化后通常不变,但在某些并发场景下需要保证读取的原子性。
2.2.2 健壮 PI 锁的 op_pending 设置
c
复制
erlang
if (robust)
{
THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending,
(void *) (((uintptr_t) &mutex->__data.__list.__next) | 1));
__asm ("" ::: "memory");
}
健壮 PI 锁的 list_op_pending 设置与普通健壮锁不同:地址的最低位被置 1,这是内核区分 PI 锁和非 PI 锁健壮操作的标记。
2.2.3 首次 CAS 与内核接管
c
复制
ini
oldval = mutex->__data.__lock;
if (__glibc_unlikely ((oldval & FUTEX_TID_MASK) == id))
{
// 死锁检测和递归处理(同普通健壮锁)
}
int newval = id;
#ifdef NO_INCR
newval |= FUTEX_WAITERS;
#endif
oldval = atomic_compare_and_exchange_val_acq (&mutex->__data.__lock,
newval, 0);
if (oldval != 0)
{
int private = (robust
? PTHREAD_ROBUST_MUTEX_PSHARED (mutex)
: PTHREAD_MUTEX_PSHARED (mutex));
int e = __futex_lock_pi64 (&mutex->__data.__lock, 0 /* unused */,
NULL, private);
// ...
oldval = mutex->__data.__lock;
}
PI 锁的首次 CAS 尝试将 __lock 从 0 改为 id(或 id | FUTEX_WAITERS)。如果失败(oldval != 0),则调用 __futex_lock_pi64,将控制权完全交给内核。
__futex_lock_pi64 是 glibc 对内核 futex_lock_pi 系统调用的封装。内核会:
- 检查锁的当前状态
- 如果已被持有,执行优先级继承逻辑
- 将当前线程加入等待队列
- 必要时提升持有者的优先级
- 当锁释放时,按照优先级顺序唤醒等待者
2.2.4 ESRCH 与 EDEADLK 处理
c
复制
arduino
if (e == ESRCH || e == EDEADLK)
{
assert (e != EDEADLK
|| (kind != PTHREAD_MUTEX_ERRORCHECK_NP
&& kind != PTHREAD_MUTEX_RECURSIVE_NP));
assert (e != ESRCH || !robust);
/* Delay the thread indefinitely. */
while (1)
__futex_abstimed_wait64 (&(unsigned int){0}, 0,
0 /* ignored */, NULL, private);
}
ESRCH 在非健壮 PI 锁中表示锁的持有者已死亡。EDEADLK 表示检测到死锁。对于这些错误,代码进入一个无限等待循环,实际上是将线程永久阻塞。这是一种保守的处理策略,避免在无法安全恢复的情况下继续执行。
2.2.5 持有者死亡与不可恢复处理
c
复制
ini
if (__glibc_unlikely (oldval & FUTEX_OWNER_DIED))
{
atomic_fetch_and_acquire (&mutex->__data.__lock, ~FUTEX_OWNER_DIED);
mutex->__data.__count = 1;
mutex->__data.__owner = PTHREAD_MUTEX_INCONSISTENT;
__asm ("" ::: "memory");
ENQUEUE_MUTEX_PI (mutex);
__asm ("" ::: "memory");
THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
#ifdef NO_INCR
--mutex->__data.__nusers;
#endif
return EOWNERDEAD;
}
if (robust
&& __builtin_expect (mutex->__data.__owner
== PTHREAD_MUTEX_NOTRECOVERABLE, 0))
{
mutex->__data.__count = 0;
futex_unlock_pi ((unsigned int *) &mutex->__data.__lock,
PTHREAD_ROBUST_MUTEX_PSHARED (mutex));
THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
return ENOTRECOVERABLE;
}
PI 锁的持有者死亡和不可恢复处理逻辑与普通健壮锁类似,但使用 ENQUEUE_MUTEX_PI 将锁加入 PI 特定的健壮链表,并使用 futex_unlock_pi 释放锁。
2.3 优先级天花板互斥锁(PP Mutex)
优先级天花板(Priority Protect/Protection)是另一种解决优先级反转的方案。它为每个锁分配一个优先级天花板值,任何获取该锁的线程的优先级都会被提升到天花板级别。
c
复制
arduino
case PTHREAD_MUTEX_PP_RECURSIVE_NP:
case PTHREAD_MUTEX_PP_ERRORCHECK_NP:
case PTHREAD_MUTEX_PP_NORMAL_NP:
case PTHREAD_MUTEX_PP_ADAPTIVE_NP:
2.3.1 优先级检查与提升
c
复制
ini
int kind = atomic_load_relaxed (&(mutex->__data.__kind))
& PTHREAD_MUTEX_KIND_MASK_NP;
oldval = mutex->__data.__lock;
if (mutex->__data.__owner == id)
{
if (kind == PTHREAD_MUTEX_ERRORCHECK_NP)
return EDEADLK;
if (kind == PTHREAD_MUTEX_RECURSIVE_NP)
{
if (__glibc_unlikely (mutex->__data.__count + 1 == 0))
return EAGAIN;
++mutex->__data.__count;
return 0;
}
}
PP 锁的死锁检测和递归处理与前面类似。
2.3.2 优先级天花板协议实现
c
复制
ini
int oldprio = -1, ceilval;
do
{
int ceiling = (oldval & PTHREAD_MUTEX_PRIO_CEILING_MASK)
>> PTHREAD_MUTEX_PRIO_CEILING_SHIFT;
if (__pthread_current_priority () > ceiling)
{
if (oldprio != -1)
__pthread_tpp_change_priority (oldprio, -1);
return EINVAL;
}
int retval = __pthread_tpp_change_priority (oldprio, ceiling);
if (retval)
return retval;
ceilval = ceiling << PTHREAD_MUTEX_PRIO_CEILING_SHIFT;
oldprio = ceiling;
oldval = atomic_compare_and_exchange_val_acq (&mutex->__data.__lock,
#ifdef NO_INCR
ceilval | 2,
#else
ceilval | 1,
#endif
ceilval);
// ...
} while ((oldval & PTHREAD_MUTEX_PRIO_CEILING_MASK) != ceilval);
PP 锁的实现关键点:
- 提取天花板优先级 :从
__lock的高位提取ceiling值。 - 当前优先级检查 :如果当前线程的优先级高于天花板,返回
EINVAL(POSIX 要求)。 - 优先级提升 :调用
__pthread_tpp_change_priority将线程优先级提升到天花板级别。如果之前已经提升过(oldprio != -1),先恢复旧优先级。 - CAS 获取锁 :尝试将
__lock从ceilval改为ceilval | 1(或ceilval | 2)。ceilval是天花板优先级左移后的值。 - 等待循环:如果 CAS 失败,说明有其他线程正在竞争。此时进入 futex 等待循环,等待锁释放。
2.3.3 Futex 等待
c
复制
perl
do
{
oldval = atomic_compare_and_exchange_val_acq (&mutex->__data.__lock,
ceilval | 2,
ceilval | 1);
if ((oldval & PTHREAD_MUTEX_PRIO_CEILING_MASK) != ceilval)
break;
if (oldval != ceilval)
futex_wait ((unsigned int *) &mutex->__data.__lock,
ceilval | 2,
PTHREAD_MUTEX_PSHARED (mutex));
}
while (atomic_compare_and_exchange_val_acq (&mutex->__data.__lock,
ceilval | 2, ceilval)
!= ceilval);
PP 锁的等待逻辑:
- 尝试将
__lock从ceilval | 1(已锁定无等待者)改为ceilval | 2(已锁定有等待者)。 - 如果天花板值发生变化(说明锁被重新初始化),跳出循环重试。
- 如果锁仍被持有,调用
futex_wait等待。 - 被唤醒后,再次尝试 CAS 获取锁。
第三章:从 glibc 到内核 --- futex 系统调用
3.1 futex_wait 的内核入口
当 glibc 的 lll_futex_wait 宏展开后,最终通过 syscall 指令进入内核。在 x86_64 上,这对应 SYS_futex 系统调用。
c
复制
ini
SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val,
const struct __kernel_timespec __user *, utime,
u32 __user *, uaddr2, u32, val3)
{
int ret, cmd = op & FUTEX_CMD_MASK;
ktime_t t, *tp = NULL;
struct timespec64 ts;
if (utime && futex_cmd_has_timeout(cmd)) {
if (unlikely(should_fail_futex(!(op & FUTEX_PRIVATE_FLAG))))
return -EFAULT;
if (get_timespec64(&ts, utime))
return -EFAULT;
ret = futex_init_timeout(cmd, op, &ts, &t);
if (ret)
return ret;
tp = &t;
}
return do_futex(uaddr, op, val, tp, uaddr2, (unsigned long)utime, val3);
}
SYSCALL_DEFINE6 是 Linux 内核定义 6 参数系统调用的宏。futex 系统调用的参数包括:
uaddr:用户空间地址(futex 字)op:操作码 + 标志位val:期望值(对于 WAIT)或唤醒数量(对于 WAKE)utime:超时时间(可选)uaddr2:第二个地址(用于 REQUEUE 等操作)val3:第三个值
op 参数的低 8 位是命令(FUTEX_CMD_MASK),高位是标志位(如 FUTEX_PRIVATE_FLAG 表示进程私有)。
3.2 do_futex --- 命令分发中心
c
复制
kotlin
long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
u32 __user *uaddr2, u32 val2, u32 val3)
{
unsigned int flags = futex_to_flags(op);
int cmd = op & FUTEX_CMD_MASK;
pr_debug("do_futex: pid=%d, tid=%d, cmd=%d, uaddr=%p, val=%u, flags=0x%x\n",
current->tgid, current->pid, cmd, uaddr, val, flags);
if (flags & FLAGS_CLOCKRT) {
if (cmd != FUTEX_WAIT_BITSET &&
cmd != FUTEX_WAIT_REQUEUE_PI &&
cmd != FUTEX_LOCK_PI2)
return -ENOSYS;
}
switch (cmd) {
case FUTEX_WAIT:
val3 = FUTEX_BITSET_MATCH_ANY;
fallthrough;
case FUTEX_WAIT_BITSET:
return futex_wait(uaddr, flags, val, timeout, val3);
case FUTEX_WAKE:
val3 = FUTEX_BITSET_MATCH_ANY;
fallthrough;
case FUTEX_WAKE_BITSET:
return futex_wake(uaddr, flags, val, val3);
case FUTEX_REQUEUE:
return futex_requeue(uaddr, flags, uaddr2, flags, val, val2, NULL, 0);
case FUTEX_CMP_REQUEUE:
return futex_requeue(uaddr, flags, uaddr2, flags, val, val2, &val3, 0);
case FUTEX_WAKE_OP:
return futex_wake_op(uaddr, flags, uaddr2, val, val2, val3);
case FUTEX_LOCK_PI:
flags |= FLAGS_CLOCKRT;
fallthrough;
case FUTEX_LOCK_PI2:
return futex_lock_pi(uaddr, flags, timeout, 0);
case FUTEX_UNLOCK_PI:
return futex_unlock_pi(uaddr, flags);
case FUTEX_TRYLOCK_PI:
return futex_lock_pi(uaddr, flags, NULL, 1);
case FUTEX_WAIT_REQUEUE_PI:
val3 = FUTEX_BITSET_MATCH_ANY;
return futex_wait_requeue_pi(uaddr, flags, val, timeout, val3, uaddr2);
case FUTEX_CMP_REQUEUE_PI:
return futex_requeue(uaddr, flags, uaddr2, flags, val, val2, &val3, 1);
}
return -ENOSYS;
}
do_futex 是 futex 子系统的核心分发函数。它支持的操作包括:
表格
| 命令 | 功能 |
|---|---|
FUTEX_WAIT / FUTEX_WAIT_BITSET |
等待 futex 字等于期望值 |
FUTEX_WAKE / FUTEX_WAKE_BITSET |
唤醒等待者 |
FUTEX_REQUEUE / FUTEX_CMP_REQUEUE |
将等待者从一个 futex 迁移到另一个 |
FUTEX_WAKE_OP |
条件唤醒(用于条件变量优化) |
FUTEX_LOCK_PI / FUTEX_LOCK_PI2 |
优先级继承锁获取 |
FUTEX_UNLOCK_PI |
优先级继承锁释放 |
FUTEX_TRYLOCK_PI |
非阻塞 PI 锁获取 |
FUTEX_WAIT_REQUEUE_PI |
PI 条件的等待与重排队 |
FUTEX_CMP_REQUEUE_PI |
带比较的 PI 重排队 |
3.3 futex_wait --- 内核等待入口
c
复制
rust
int futex_wait(u32 __user *uaddr, unsigned int flags, u32 val, ktime_t *abs_time, u32 bitset)
{
struct hrtimer_sleeper timeout, *to;
struct restart_block *restart;
int ret;
pr_debug("futex_wait: pid=%d, tid=%d, uaddr=%p, val=%u, bitset=0x%x, flags=0x%x\n",
current->tgid, current->pid, uaddr, val, bitset, flags);
to = futex_setup_timer(abs_time, &timeout, flags,
current->timer_slack_ns);
ret = __futex_wait(uaddr, flags, val, to, bitset);
if (!to)
return ret;
hrtimer_cancel(&to->timer);
destroy_hrtimer_on_stack(&to->timer);
if (ret == -ERESTARTSYS) {
restart = ¤t->restart_block;
restart->futex.uaddr = uaddr;
restart->futex.val = val;
restart->futex.time = *abs_time;
restart->futex.bitset = bitset;
restart->futex.flags = flags | FLAGS_HAS_TIMEOUT;
return set_restart_fn(restart, futex_wait_restart);
}
return ret;
}
futex_wait 处理带超时的等待场景:
- 定时器设置 :
futex_setup_timer根据abs_time和flags设置高精度定时器(hrtimer)。current->timer_slack_ns是内核的定时器松弛值,用于电源管理优化。 - 核心等待 :调用
__futex_wait执行实际的等待逻辑。 - 定时器清理:如果设置了超时,取消并销毁定时器。
- 系统调用重启 :如果返回
-ERESTARTSYS(被信号中断),设置重启块(restart block),使得信号处理完成后可以自动重启系统调用。这是 Linux 内核的系统调用重启机制的一部分。
3.4 __futex_wait --- 核心等待逻辑
c
复制
ini
int __futex_wait(u32 __user *uaddr, unsigned int flags, u32 val,
struct hrtimer_sleeper *to, u32 bitset)
{
struct futex_q q = futex_q_init;
struct futex_hash_bucket *hb;
int ret;
if (!bitset)
return -EINVAL;
q.bitset = bitset;
retry:
ret = futex_wait_setup(uaddr, val, flags, &q, &hb);
if (ret)
return ret;
futex_wait_queue(hb, &q, to);
if (!futex_unqueue(&q))
return 0;
if (to && !to->task)
return -ETIMEDOUT;
if (!signal_pending(current))
goto retry;
return -ERESTARTSYS;
}
__futex_wait 是 futex 等待的核心,其逻辑流程如下:
-
初始化 futex_q :
futex_q是表示一个 futex 等待请求的结构体,包含等待者的任务结构、key、plist 节点等。 -
bitset 检查 :
bitset用于支持条件唤醒(FUTEX_WAKE_BITSET),0是非法值。 -
等待设置 (
futex_wait_setup):这是最关键的一步,它确保在等待之前 futex 字的值仍然是期望值,防止"丢失唤醒"。 -
入队等待 (
futex_wait_queue):将当前任务加入哈希桶的等待队列,并进入睡眠。 -
检查唤醒原因:
- 如果
futex_unqueue返回 false,说明被正常唤醒(从队列中移除),返回 0。 - 如果超时且
to->task为 NULL,说明定时器已到期,返回-ETIMEDOUT。 - 如果没有信号 pending,说明是虚假唤醒(spurious wakeup),跳转到
retry重试。 - 如果有信号 pending,返回
-ERESTARTSYS。
- 如果
3.5 futex_wait_setup --- 防止丢失唤醒的关键
c
复制
ini
static int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
struct futex_q *q, struct futex_hash_bucket **hb)
{
u32 uval;
int ret;
retry:
ret = get_futex_value_locked(&uval, uaddr);
if (ret)
return ret;
if (uval != val)
return -EWOULDBLOCK;
q->key = FUTEX_KEY_INIT;
ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &q->key, FUTEX_READ);
if (unlikely(ret != 0))
return ret;
*hb = hash_futex(&q->key);
spin_lock(&(*hb)->lock);
ret = get_futex_value_locked(&uval, uaddr);
if (ret) {
spin_unlock(&(*hb)->lock);
return ret;
}
if (uval != val) {
spin_unlock(&(*hb)->lock);
return -EWOULDBLOCK;
}
__futex_queue(q, *hb);
return 0;
}
futex_wait_setup 是 futex 机制正确性的核心保障,它解决了经典的"检查-等待"竞态条件:
- 首次读取 futex 值 :
get_futex_value_locked安全地从用户空间读取uaddr处的值。 - 值检查 :如果值已不等于
val,说明在 glibc 设置值和进入内核之间,锁已被释放并重新获取,立即返回-EWOULDBLOCK(glibc 会重试)。 - 计算 key :
get_futex_key根据用户地址计算一个全局唯一的futex_key。对于私有 futex,key 包含mm_struct指针和页内偏移;对于共享 futex,key 包含 inode 序列号和页偏移。 - 哈希定位 :
hash_futex根据 key 计算哈希值,定位到全局futex_queues数组中的某个桶。 - 加锁并二次检查 :获取哈希桶的 spinlock 后,再次读取 futex 值。这是关键:在获取 spinlock 之前,其他 CPU 可能已修改了 futex 值。
- 二次值检查 :如果值再次变化,释放 spinlock 并返回
-EWOULDBLOCK。 - 入队 :通过
__futex_queue将futex_q加入哈希桶的 plist(优先级链表)。
这种"双重检查"模式确保了:如果 futex 值在任何时候发生了变化,等待者不会错误地进入睡眠,从而避免丢失唤醒。
3.6 futex_wait_queue --- 进入睡眠
c
复制
scss
void futex_wait_queue(struct futex_hash_bucket *hb, struct futex_q *q,
struct hrtimer_sleeper *timeout)
{
set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
futex_queue(q, hb);
if (timeout)
hrtimer_sleeper_start_expires(timeout, HRTIMER_MODE_ABS);
if (likely(!plist_node_empty(&q->list))) {
if (!timeout || timeout->task)
schedule();
}
__set_current_state(TASK_RUNNING);
}
futex_wait_queue 执行实际的睡眠操作:
- 设置任务状态 :
set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE)将当前任务设置为可中断睡眠状态,同时允许在系统冻结(suspend/hibernate)时被暂停。这里使用set_current_state而不是直接赋值,因为它包含内存屏障,确保状态设置在进入队列之前完成。 - 入队 :
futex_queue将q加入hb的 plist。plist 是优先级排序的链表,支持优先级继承。 - 启动定时器:如果设置了超时,启动 hrtimer。
- 检查并调度 :如果
q->list不为空(说明仍在队列中),且没有超时或定时器未到期,调用schedule()主动放弃 CPU。 - 恢复运行状态 :被唤醒后,设置任务状态为
TASK_RUNNING。
这里有一个重要的竞态处理:在 set_current_state 和 schedule() 之间,如果另一个 CPU 尝试唤醒当前任务,唤醒操作会设置任务状态并尝试将任务从队列中移除。由于 set_current_state 的内存屏障和 futex_queue 中 spin_unlock 的内存屏障,这种竞态是安全的。
3.7 Futex 哈希表与 Key 机制
futex 的核心数据结构是全局哈希表 futex_queues,其大小在启动时根据内存大小动态计算。每个哈希桶包含一个 spinlock 和一个 plist(优先级排序链表)。
futex_key 的设计非常精巧:
c
复制
arduino
union futex_key {
struct {
u64 i_seq;
unsigned long pgoff;
unsigned int offset;
} shared; /* 用于进程共享 futex */
struct {
union {
struct mm_struct *mm;
u64 __tmp;
};
unsigned long address;
unsigned int offset;
} private; /* 用于进程私有 futex */
struct {
u64 ptr;
unsigned long word;
unsigned int offset;
} both; /* 通用访问 */
};
对于进程私有 futex (FUTEX_PRIVATE_FLAG),key 由 mm_struct 指针和虚拟地址组成。由于每个进程有独立的地址空间,不同进程的相同虚拟地址不会产生冲突。
对于进程共享 futex,key 由文件的 inode 序列号和页内偏移组成。这要求共享 futex 必须位于文件映射(file-backed mapping)或共享匿名映射(shared anonymous mapping)中。
get_futex_key 的实现涉及页表遍历、VMA(虚拟内存区域)查找等复杂逻辑,需要处理各种边界情况(如页未映射、文件映射、特殊映射等)。
第四章:系统调用层 --- 从 glibc 到内核的桥梁
4.1 lll_futex_syscall 宏展开
在 glibc 中,lll_futex_wait 和 lll_futex_wake 等宏最终展开为 lll_futex_syscall,后者展开为内联汇编系统调用:
c
复制
ini
#define lll_futex_syscall(nargs, futexp, op, ...) \
({ \
long int __ret = INTERNAL_SYSCALL (futex, nargs, futexp, op, \
__VA_ARGS__); \
(__glibc_unlikely (INTERNAL_SYSCALL_ERROR_P (__ret)) \
? -INTERNAL_SYSCALL_ERRNO (__ret) : 0); \
})
INTERNAL_SYSCALL 进一步展开为架构特定的系统调用指令。在 x86_64 上:
c
复制
scss
#define internal_syscall4(number, arg1, arg2, arg3, arg4) \
({ \
unsigned long int resultvar; \
TYPEFY (arg4, __arg4) = ARGIFY (arg4); \
TYPEFY (arg3, __arg3) = ARGIFY (arg3); \
TYPEFY (arg2, __arg2) = ARGIFY (arg2); \
TYPEFY (arg1, __arg1) = ARGIFY (arg1); \
register TYPEFY (arg4, _a4) asm ("r10") = __arg4; \
register TYPEFY (arg3, _a3) asm ("rdx") = __arg3; \
register TYPEFY (arg2, _a2) asm ("rsi") = __arg2; \
register TYPEFY (arg1, _a1) asm ("rdi") = __arg1; \
asm volatile ( \
"syscall\n\t" \
: "=a" (resultvar) \
: "0" (number), "r" (_a1), "r" (_a2), "r" (_a3), "r" (_a4) \
: "memory", REGISTERS_CLOBBERED_BY_SYSCALL); \
(long int) resultvar; \
})
这里展示了 x86_64 Linux 系统调用的标准 ABI:
- 系统调用号放在
rax寄存器(通过"0" (number)约束,复用输出操作数) - 参数 1-6 分别放在
rdi,rsi,rdx,r10,r8,r9 syscall指令触发系统调用memory和REGISTERS_CLOBBERED_BY_SYSCALL作为 clobber 列表,告诉编译器这些资源被修改
4.2 futex_wait 的返回值处理
在 glibc 的 futex_wait 包装函数中:
c
复制
arduino
static __always_inline int
futex_wait (unsigned int *futex_word, unsigned int expected, int private)
{
int err = lll_futex_timed_wait (futex_word, expected, NULL, private);
switch (err)
{
case 0:
case -EAGAIN:
case -EINTR:
return -err;
case -ETIMEDOUT:
case -EFAULT:
case -EINVAL:
case -ENOSYS:
default:
futex_fatal_error ();
}
}
返回值处理:
0:正常唤醒-EAGAIN:futex 值不等于期望值(在futex_wait_setup的二次检查中发现)-EINTR:被信号中断-ETIMEDOUT:超时(但这里传入的timeout为 NULL,不会发生)-EFAULT/-EINVAL/-ENOSYS:严重错误,调用futex_fatal_error()终止程序
第五章:完整逻辑链条总结
5.1 无竞争快速路径
plain
复制
scss
pthread_mutex_lock()
└── type == PTHREAD_MUTEX_TIMED_NP
└── LLL_MUTEX_LOCK_OPTIMIZED()
├── SINGLE_THREAD_P && private == LLL_PRIVATE && __lock == 0
│ └── __lock = 1 [无原子操作,无系统调用]
└── else
└── lll_lock()
└── atomic CAS: 0 -> 1
└── 成功,返回
这条路径在无竞争、单线程或首次尝试时,完全不涉及系统调用,仅需要一次原子 CAS 操作,性能极高。
5.2 有竞争慢速路径
plain
复制
scss
pthread_mutex_lock()
└── type == PTHREAD_MUTEX_TIMED_NP
└── LLL_MUTEX_LOCK_OPTIMIZED()
└── lll_lock()
└── CAS 失败
└── __lll_lock_wait()
├── atomic_exchange: __lock -> 2
│ └── 如果原值 == 0,成功获取,返回
└── 否则
└── futex_wait(__lock, 2, private)
└── syscall(SYS_futex, FUTEX_WAIT, ...)
└── 内核: do_futex()
└── futex_wait()
├── futex_wait_setup()
│ ├── 读取 uaddr
│ ├── 值检查
│ ├── get_futex_key()
│ ├── hash_futex()
│ ├── spin_lock(hb)
│ ├── 二次读取 + 检查
│ └── __futex_queue()
├── futex_wait_queue()
│ ├── set_current_state(INTERRUPTIBLE)
│ ├── futex_queue()
│ ├── hrtimer_start() [如果有超时]
│ └── schedule()
└── 被唤醒后
├── futex_unqueue()
├── 检查超时
├── 检查信号
└── 返回用户态
这条路径涉及完整的系统调用、内核哈希表操作、任务状态切换和调度,开销较大,但只在真正有竞争时才会触发。
5.3 健壮锁路径
plain
复制
scss
pthread_mutex_lock()
└── type == ROBUST_*
└── __pthread_mutex_lock_full()
├── 设置 list_op_pending
├── CAS 循环
│ ├── oldval == 0: 获取成功
│ ├── FUTEX_OWNER_DIED: 处理死亡,返回 EOWNERDEAD
│ ├── 已持有: 死锁检测 / 递归处理
│ └── 否则: 设置 FUTEX_WAITERS, futex_wait()
├── 检查 NOTRECOVERABLE
├── ENQUEUE_MUTEX()
└── 清除 list_op_pending
5.4 PI 锁路径
plain
复制
scss
pthread_mutex_lock()
└── type == PI_*
└── __pthread_mutex_lock_full()
├── 分离 kind 和 robust
├── 设置 list_op_pending [如果 robust]
├── CAS: 0 -> id
│ └── 失败: __futex_lock_pi64()
│ └── syscall(SYS_futex, FUTEX_LOCK_PI, ...)
│ └── 内核: futex_lock_pi()
│ ├── 优先级继承链处理
│ ├── 入 PI 等待队列
│ └── 调度等待
├── 检查 FUTEX_OWNER_DIED
├── 检查 NOTRECOVERABLE
└── ENQUEUE_MUTEX_PI() [如果 robust]
第六章:关键设计思想与优化策略
6.1 Futex 的核心哲学
Futex(Fast Userspace Mutex)的设计核心在于 "快速路径在用户态,慢速路径在内核态" 的分层策略:
- 无竞争时:完全在用户态通过原子操作完成,无需陷入内核,开销与自旋锁相当。
- 有竞争时:通过系统调用进入内核,由内核管理等待队列和调度,避免 CPU 忙等浪费。
这种分层使得 pthread_mutex_lock 在大多数实际场景下(锁竞争不激烈)具有极高的性能,同时在竞争激烈时仍能正确、高效地工作。
6.2 三态锁状态机
glibc 的 futex 实现使用三态状态机:
表格
| 状态 | 值 | 含义 |
|---|---|---|
| 未锁定 | 0 | 锁可用 |
| 已锁定,无等待者 | 1 | 锁被持有,无需唤醒 |
| 已锁定,有等待者 | 2 | 锁被持有,释放时需要唤醒 |
这种设计使得解锁操作能够判断是否需要系统调用:只有当状态为 2 时才需要 futex_wake。
6.3 自适应自旋策略
自适应锁通过历史统计动态调整自旋策略:
- 短临界区:自旋等待通常比进入内核更高效
- 长临界区:应尽快进入内核睡眠,避免 CPU 浪费
mutex->__data.__spins 的指数移动平均更新使得锁能够"学习"临界区的典型长度,逐步优化自旋策略。
6.4 防止丢失唤醒
futex_wait_setup 的双重检查机制是防止丢失唤醒的关键:
- 用户态在调用
futex_wait前设置 futex 值(如改为 2) - 内核态在
futex_wait_setup中再次检查值是否仍为预期值 - 如果在之间值已变化(如锁被释放),立即返回
EAGAIN,用户态重试
这种设计确保了即使在高并发下,唤醒操作也不会被遗漏。
6.5 系统调用重启
当 futex 等待被信号中断时,内核返回 -ERESTARTSYS。glibc 的 futex_wait 函数设置 restart block,使得信号处理完成后,系统调用可以自动重启。这对应用透明,确保了 pthread_mutex_lock 的语义正确性(除非被信号明确中断)。
第七章:源码中的调试与追踪
7.1 pr_debug 调试输出
在 Linux 6.8.12 内核的 futex 实现中,可以看到多处 pr_debug 调用:
c
复制
perl
pr_debug("do_futex: pid=%d, tid=%d, cmd=%d, uaddr=%p, val=%u, flags=0x%x\n",
current->tgid, current->pid, cmd, uaddr, val, flags);
pr_debug("futex_wait: pid=%d, tid=%d, uaddr=%p, val=%u, bitset=0x%x, flags=0x%x\n",
current->tgid, current->pid, uaddr, val, bitset, flags);
这些调试输出需要在编译内核时启用 CONFIG_DYNAMIC_DEBUG 或设置 dynamic_debug.verbose 才能看到。它们对于理解 futex 的调用流程和排查问题非常有价值。
7.2 SystemTap/DTrace 探针
glibc 中使用了 LIBC_PROBE 宏:
c
复制
arduino
LIBC_PROBE (mutex_entry, 1, mutex);
LIBC_PROBE (mutex_acquired, 1, mutex);
这些探针可以在不重新编译的情况下,通过 SystemTap 或 DTrace 动态插入探针,监控互斥锁的获取和释放行为,对于性能分析和故障排查非常有用。
第八章:版本演进与兼容性
8.1 glibc 版本控制
c
复制
arduino
#if PTHREAD_MUTEX_VERSIONS
libc_hidden_ver (___pthread_mutex_lock, __pthread_mutex_lock)
# ifndef SHARED
strong_alias (___pthread_mutex_lock, __pthread_mutex_lock)
# endif
versioned_symbol (libpthread, ___pthread_mutex_lock, pthread_mutex_lock,
GLIBC_2_0);
glibc 使用符号版本控制(symbol versioning)来保持 ABI 兼容性。___pthread_mutex_lock 是内部实现,__pthread_mutex_lock 是兼容符号,pthread_mutex_lock 是对外暴露的符号,版本为 GLIBC_2_0。这允许 glibc 在更新实现的同时,保持与旧二进制文件的兼容性。
8.2 与历史版本的差异
相比早期版本(如 glibc 2.34 之前),glibc-2.39 的主要改进包括:
- 更完善的 elision 支持
- 改进的自适应锁算法
- 更好的健壮锁和 PI 锁集成
- 对新型 CPU 指令集(如 ARM LSE、x86 TSX)的优化
结语
从 glibc-2.39 的 ___pthread_mutex_lock 到 Linux 6.8.12 内核的 futex_wait,pthread_mutex_lock 的实现展现了一个高度优化、层次分明的同步原语设计。整个链路从用户态的原子操作快速路径,到内核态的哈希队列和调度管理,涵盖了无竞争优化、自适应自旋、优先级继承、健壮性恢复、超时处理、信号中断恢复等多种复杂场景。
理解这条完整的逻辑链条,不仅有助于编写高效、正确的多线程程序,也为深入理解操作系统内核的同步机制、内存模型和调度原理提供了绝佳的切入点。futex 机制作为 Linux 特有的创新,其"用户态快速路径 + 内核态慢速路径"的分层思想,至今仍是操作系统同步原语设计的典范。 源码
scss
# define LLL_MUTEX_LOCK(mutex) \
lll_lock ((mutex)->__data.__lock, PTHREAD_MUTEX_PSHARED (mutex))
# define LLL_MUTEX_LOCK_OPTIMIZED(mutex) lll_mutex_lock_optimized (mutex)
# define LLL_MUTEX_TRYLOCK(mutex) \
lll_trylock ((mutex)->__data.__lock)
# define LLL_ROBUST_MUTEX_LOCK_MODIFIER 0
# define LLL_MUTEX_LOCK_ELISION(mutex) \
lll_lock_elision ((mutex)->__data.__lock, (mutex)->__data.__elision, \
PTHREAD_MUTEX_PSHARED (mutex))
# define LLL_MUTEX_TRYLOCK_ELISION(mutex) \
lll_trylock_elision((mutex)->__data.__lock, (mutex)->__data.__elision, \
PTHREAD_MUTEX_PSHARED (mutex))
# define PTHREAD_MUTEX_LOCK ___pthread_mutex_lock
# define PTHREAD_MUTEX_VERSIONS 1
#endif
#ifndef LLL_MUTEX_READ_LOCK
# define LLL_MUTEX_READ_LOCK(mutex) \
atomic_load_relaxed (&(mutex)->__data.__lock)
#endif
static int __pthread_mutex_lock_full (pthread_mutex_t *mutex)
__attribute_noinline__;
int
PTHREAD_MUTEX_LOCK (pthread_mutex_t *mutex)
{
/* See concurrency notes regarding mutex type which is loaded from __kind
in struct __pthread_mutex_s in sysdeps/nptl/bits/thread-shared-types.h. */
unsigned int type = PTHREAD_MUTEX_TYPE_ELISION (mutex);
LIBC_PROBE (mutex_entry, 1, mutex);
if (__builtin_expect (type & ~(PTHREAD_MUTEX_KIND_MASK_NP
| PTHREAD_MUTEX_ELISION_FLAGS_NP), 0))
return __pthread_mutex_lock_full (mutex);
if (__glibc_likely (type == PTHREAD_MUTEX_TIMED_NP))
{
FORCE_ELISION (mutex, goto elision);
simple:
/* Normal mutex. */
LLL_MUTEX_LOCK_OPTIMIZED (mutex);
assert (mutex->__data.__owner == 0);
}
#if ENABLE_ELISION_SUPPORT
else if (__glibc_likely (type == PTHREAD_MUTEX_TIMED_ELISION_NP))
{
elision: __attribute__((unused))
/* This case can never happen on a system without elision,
as the mutex type initialization functions will not
allow to set the elision flags. */
/* Don't record owner or users for elision case. This is a
tail call. */
return LLL_MUTEX_LOCK_ELISION (mutex);
}
#endif
else if (__builtin_expect (PTHREAD_MUTEX_TYPE (mutex)
== PTHREAD_MUTEX_RECURSIVE_NP, 1))
{
/* Recursive mutex. */
pid_t id = THREAD_GETMEM (THREAD_SELF, tid);
/* Check whether we already hold the mutex. */
if (mutex->__data.__owner == id)
{
/* Just bump the counter. */
if (__glibc_unlikely (mutex->__data.__count + 1 == 0))
/* Overflow of the counter. */
return EAGAIN;
++mutex->__data.__count;
return 0;
}
/* We have to get the mutex. */
LLL_MUTEX_LOCK_OPTIMIZED (mutex);
assert (mutex->__data.__owner == 0);
mutex->__data.__count = 1;
}
else if (__builtin_expect (PTHREAD_MUTEX_TYPE (mutex)
== PTHREAD_MUTEX_ADAPTIVE_NP, 1))
{
if (LLL_MUTEX_TRYLOCK (mutex) != 0)
{
int cnt = 0;
int max_cnt = MIN (max_adaptive_count (),
mutex->__data.__spins * 2 + 10);
int spin_count, exp_backoff = 1;
unsigned int jitter = get_jitter ();
do
{
/* In each loop, spin count is exponential backoff plus
random jitter, random range is [0, exp_backoff-1]. */
spin_count = exp_backoff + (jitter & (exp_backoff - 1));
cnt += spin_count;
if (cnt >= max_cnt)
{
/* If cnt exceeds max spin count, just go to wait
queue. */
LLL_MUTEX_LOCK (mutex);
break;
}
do
atomic_spin_nop ();
while (--spin_count > 0);
/* Prepare for next loop. */
exp_backoff = get_next_backoff (exp_backoff);
}
while (LLL_MUTEX_READ_LOCK (mutex) != 0
|| LLL_MUTEX_TRYLOCK (mutex) != 0);
mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8;
}
assert (mutex->__data.__owner == 0);
}
else
{
pid_t id = THREAD_GETMEM (THREAD_SELF, tid);
assert (PTHREAD_MUTEX_TYPE (mutex) == PTHREAD_MUTEX_ERRORCHECK_NP);
/* Check whether we already hold the mutex. */
if (__glibc_unlikely (mutex->__data.__owner == id))
return EDEADLK;
goto simple;
}
pid_t id = THREAD_GETMEM (THREAD_SELF, tid);
/* Record the ownership. */
mutex->__data.__owner = id;
#ifndef NO_INCR
++mutex->__data.__nusers;
#endif
LIBC_PROBE (mutex_acquired, 1, mutex);
return 0;
}
static int
__pthread_mutex_lock_full (pthread_mutex_t *mutex)
{
int oldval;
pid_t id = THREAD_GETMEM (THREAD_SELF, tid);
switch (PTHREAD_MUTEX_TYPE (mutex))
{
case PTHREAD_MUTEX_ROBUST_RECURSIVE_NP:
case PTHREAD_MUTEX_ROBUST_ERRORCHECK_NP:
case PTHREAD_MUTEX_ROBUST_NORMAL_NP:
case PTHREAD_MUTEX_ROBUST_ADAPTIVE_NP:
THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending,
&mutex->__data.__list.__next);
/* We need to set op_pending before starting the operation. Also
see comments at ENQUEUE_MUTEX. */
__asm ("" ::: "memory");
oldval = mutex->__data.__lock;
/* This is set to FUTEX_WAITERS iff we might have shared the
FUTEX_WAITERS flag with other threads, and therefore need to keep it
set to avoid lost wake-ups. We have the same requirement in the
simple mutex algorithm.
We start with value zero for a normal mutex, and FUTEX_WAITERS if we
are building the special case mutexes for use from within condition
variables. */
unsigned int assume_other_futex_waiters = LLL_ROBUST_MUTEX_LOCK_MODIFIER;
while (1)
{
/* Try to acquire the lock through a CAS from 0 (not acquired) to
our TID | assume_other_futex_waiters. */
if (__glibc_likely (oldval == 0))
{
oldval
= atomic_compare_and_exchange_val_acq (&mutex->__data.__lock,
id | assume_other_futex_waiters, 0);
if (__glibc_likely (oldval == 0))
break;
}
if ((oldval & FUTEX_OWNER_DIED) != 0)
{
/* The previous owner died. Try locking the mutex. */
int newval = id;
#ifdef NO_INCR
/* We are not taking assume_other_futex_waiters into account
here simply because we'll set FUTEX_WAITERS anyway. */
newval |= FUTEX_WAITERS;
#else
newval |= (oldval & FUTEX_WAITERS) | assume_other_futex_waiters;
#endif
newval
= atomic_compare_and_exchange_val_acq (&mutex->__data.__lock,
newval, oldval);
if (newval != oldval)
{
oldval = newval;
continue;
}
/* We got the mutex. */
mutex->__data.__count = 1;
/* But it is inconsistent unless marked otherwise. */
mutex->__data.__owner = PTHREAD_MUTEX_INCONSISTENT;
/* We must not enqueue the mutex before we have acquired it.
Also see comments at ENQUEUE_MUTEX. */
__asm ("" ::: "memory");
ENQUEUE_MUTEX (mutex);
/* We need to clear op_pending after we enqueue the mutex. */
__asm ("" ::: "memory");
THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
/* Note that we deliberately exit here. If we fall
through to the end of the function __nusers would be
incremented which is not correct because the old
owner has to be discounted. If we are not supposed
to increment __nusers we actually have to decrement
it here. */
#ifdef NO_INCR
--mutex->__data.__nusers;
#endif
return EOWNERDEAD;
}
/* Check whether we already hold the mutex. */
if (__glibc_unlikely ((oldval & FUTEX_TID_MASK) == id))
{
int kind = PTHREAD_MUTEX_TYPE (mutex);
if (kind == PTHREAD_MUTEX_ROBUST_ERRORCHECK_NP)
{
/* We do not need to ensure ordering wrt another memory
access. Also see comments at ENQUEUE_MUTEX. */
THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending,
NULL);
return EDEADLK;
}
if (kind == PTHREAD_MUTEX_ROBUST_RECURSIVE_NP)
{
/* We do not need to ensure ordering wrt another memory
access. */
THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending,
NULL);
/* Just bump the counter. */
if (__glibc_unlikely (mutex->__data.__count + 1 == 0))
/* Overflow of the counter. */
return EAGAIN;
++mutex->__data.__count;
return 0;
}
}
/* We cannot acquire the mutex nor has its owner died. Thus, try
to block using futexes. Set FUTEX_WAITERS if necessary so that
other threads are aware that there are potentially threads
blocked on the futex. Restart if oldval changed in the
meantime. */
if ((oldval & FUTEX_WAITERS) == 0)
{
int val = atomic_compare_and_exchange_val_acq
(&mutex->__data.__lock, oldval | FUTEX_WAITERS, oldval);
if (val != oldval)
{
oldval = val;
continue;
}
oldval |= FUTEX_WAITERS;
}
/* It is now possible that we share the FUTEX_WAITERS flag with
another thread; therefore, update assume_other_futex_waiters so
that we do not forget about this when handling other cases
above and thus do not cause lost wake-ups. */
assume_other_futex_waiters |= FUTEX_WAITERS;
/* Block using the futex and reload current lock value. */
futex_wait ((unsigned int *) &mutex->__data.__lock, oldval,
PTHREAD_ROBUST_MUTEX_PSHARED (mutex));
oldval = mutex->__data.__lock;
}
/* We have acquired the mutex; check if it is still consistent. */
if (__builtin_expect (mutex->__data.__owner
== PTHREAD_MUTEX_NOTRECOVERABLE, 0))
{
/* This mutex is now not recoverable. */
mutex->__data.__count = 0;
int private = PTHREAD_ROBUST_MUTEX_PSHARED (mutex);
lll_unlock (mutex->__data.__lock, private);
/* FIXME This violates the mutex destruction requirements. See
__pthread_mutex_unlock_full. */
THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
return ENOTRECOVERABLE;
}
mutex->__data.__count = 1;
/* We must not enqueue the mutex before we have acquired it.
Also see comments at ENQUEUE_MUTEX. */
__asm ("" ::: "memory");
ENQUEUE_MUTEX (mutex);
/* We need to clear op_pending after we enqueue the mutex. */
__asm ("" ::: "memory");
THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
break;
/* The PI support requires the Linux futex system call. If that's not
available, pthread_mutex_init should never have allowed the type to
be set. So it will get the default case for an invalid type. */
#ifdef __NR_futex
case PTHREAD_MUTEX_PI_RECURSIVE_NP:
case PTHREAD_MUTEX_PI_ERRORCHECK_NP:
case PTHREAD_MUTEX_PI_NORMAL_NP:
case PTHREAD_MUTEX_PI_ADAPTIVE_NP:
case PTHREAD_MUTEX_PI_ROBUST_RECURSIVE_NP:
case PTHREAD_MUTEX_PI_ROBUST_ERRORCHECK_NP:
case PTHREAD_MUTEX_PI_ROBUST_NORMAL_NP:
case PTHREAD_MUTEX_PI_ROBUST_ADAPTIVE_NP:
{
int kind, robust;
{
/* See concurrency notes regarding __kind in struct __pthread_mutex_s
in sysdeps/nptl/bits/thread-shared-types.h. */
int mutex_kind = atomic_load_relaxed (&(mutex->__data.__kind));
kind = mutex_kind & PTHREAD_MUTEX_KIND_MASK_NP;
robust = mutex_kind & PTHREAD_MUTEX_ROBUST_NORMAL_NP;
}
if (robust)
{
/* Note: robust PI futexes are signaled by setting bit 0. */
THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending,
(void *) (((uintptr_t) &mutex->__data.__list.__next)
| 1));
/* We need to set op_pending before starting the operation. Also
see comments at ENQUEUE_MUTEX. */
__asm ("" ::: "memory");
}
oldval = mutex->__data.__lock;
/* Check whether we already hold the mutex. */
if (__glibc_unlikely ((oldval & FUTEX_TID_MASK) == id))
{
if (kind == PTHREAD_MUTEX_ERRORCHECK_NP)
{
/* We do not need to ensure ordering wrt another memory
access. */
THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
return EDEADLK;
}
if (kind == PTHREAD_MUTEX_RECURSIVE_NP)
{
/* We do not need to ensure ordering wrt another memory
access. */
THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
/* Just bump the counter. */
if (__glibc_unlikely (mutex->__data.__count + 1 == 0))
/* Overflow of the counter. */
return EAGAIN;
++mutex->__data.__count;
return 0;
}
}
int newval = id;
# ifdef NO_INCR
newval |= FUTEX_WAITERS;
# endif
oldval = atomic_compare_and_exchange_val_acq (&mutex->__data.__lock,
newval, 0);
if (oldval != 0)
{
/* The mutex is locked. The kernel will now take care of
everything. */
int private = (robust
? PTHREAD_ROBUST_MUTEX_PSHARED (mutex)
: PTHREAD_MUTEX_PSHARED (mutex));
int e = __futex_lock_pi64 (&mutex->__data.__lock, 0 /* unused */,
NULL, private);
if (e == ESRCH || e == EDEADLK)
{
assert (e != EDEADLK
|| (kind != PTHREAD_MUTEX_ERRORCHECK_NP
&& kind != PTHREAD_MUTEX_RECURSIVE_NP));
/* ESRCH can happen only for non-robust PI mutexes where
the owner of the lock died. */
assert (e != ESRCH || !robust);
/* Delay the thread indefinitely. */
while (1)
__futex_abstimed_wait64 (&(unsigned int){0}, 0,
0 /* ignored */, NULL, private);
}
oldval = mutex->__data.__lock;
assert (robust || (oldval & FUTEX_OWNER_DIED) == 0);
}
if (__glibc_unlikely (oldval & FUTEX_OWNER_DIED))
{
atomic_fetch_and_acquire (&mutex->__data.__lock, ~FUTEX_OWNER_DIED);
/* We got the mutex. */
mutex->__data.__count = 1;
/* But it is inconsistent unless marked otherwise. */
mutex->__data.__owner = PTHREAD_MUTEX_INCONSISTENT;
/* We must not enqueue the mutex before we have acquired it.
Also see comments at ENQUEUE_MUTEX. */
__asm ("" ::: "memory");
ENQUEUE_MUTEX_PI (mutex);
/* We need to clear op_pending after we enqueue the mutex. */
__asm ("" ::: "memory");
THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
/* Note that we deliberately exit here. If we fall
through to the end of the function __nusers would be
incremented which is not correct because the old owner
has to be discounted. If we are not supposed to
increment __nusers we actually have to decrement it here. */
# ifdef NO_INCR
--mutex->__data.__nusers;
# endif
return EOWNERDEAD;
}
if (robust
&& __builtin_expect (mutex->__data.__owner
== PTHREAD_MUTEX_NOTRECOVERABLE, 0))
{
/* This mutex is now not recoverable. */
mutex->__data.__count = 0;
futex_unlock_pi ((unsigned int *) &mutex->__data.__lock,
PTHREAD_ROBUST_MUTEX_PSHARED (mutex));
/* To the kernel, this will be visible after the kernel has
acquired the mutex in the syscall. */
THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
return ENOTRECOVERABLE;
}
mutex->__data.__count = 1;
if (robust)
{
/* We must not enqueue the mutex before we have acquired it.
Also see comments at ENQUEUE_MUTEX. */
__asm ("" ::: "memory");
ENQUEUE_MUTEX_PI (mutex);
/* We need to clear op_pending after we enqueue the mutex. */
__asm ("" ::: "memory");
THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
}
}
break;
#endif /* __NR_futex. */
case PTHREAD_MUTEX_PP_RECURSIVE_NP:
case PTHREAD_MUTEX_PP_ERRORCHECK_NP:
case PTHREAD_MUTEX_PP_NORMAL_NP:
case PTHREAD_MUTEX_PP_ADAPTIVE_NP:
{
/* See concurrency notes regarding __kind in struct __pthread_mutex_s
in sysdeps/nptl/bits/thread-shared-types.h. */
int kind = atomic_load_relaxed (&(mutex->__data.__kind))
& PTHREAD_MUTEX_KIND_MASK_NP;
oldval = mutex->__data.__lock;
/* Check whether we already hold the mutex. */
if (mutex->__data.__owner == id)
{
if (kind == PTHREAD_MUTEX_ERRORCHECK_NP)
return EDEADLK;
if (kind == PTHREAD_MUTEX_RECURSIVE_NP)
{
/* Just bump the counter. */
if (__glibc_unlikely (mutex->__data.__count + 1 == 0))
/* Overflow of the counter. */
return EAGAIN;
++mutex->__data.__count;
return 0;
}
}
int oldprio = -1, ceilval;
do
{
int ceiling = (oldval & PTHREAD_MUTEX_PRIO_CEILING_MASK)
>> PTHREAD_MUTEX_PRIO_CEILING_SHIFT;
if (__pthread_current_priority () > ceiling)
{
if (oldprio != -1)
__pthread_tpp_change_priority (oldprio, -1);
return EINVAL;
}
int retval = __pthread_tpp_change_priority (oldprio, ceiling);
if (retval)
return retval;
ceilval = ceiling << PTHREAD_MUTEX_PRIO_CEILING_SHIFT;
oldprio = ceiling;
oldval
= atomic_compare_and_exchange_val_acq (&mutex->__data.__lock,
#ifdef NO_INCR
ceilval | 2,
#else
ceilval | 1,
#endif
ceilval);
if (oldval == ceilval)
break;
do
{
oldval
= atomic_compare_and_exchange_val_acq (&mutex->__data.__lock,
ceilval | 2,
ceilval | 1);
if ((oldval & PTHREAD_MUTEX_PRIO_CEILING_MASK) != ceilval)
break;
if (oldval != ceilval)
futex_wait ((unsigned int * ) &mutex->__data.__lock,
ceilval | 2,
PTHREAD_MUTEX_PSHARED (mutex));
}
while (atomic_compare_and_exchange_val_acq (&mutex->__data.__lock,
ceilval | 2, ceilval)
!= ceilval);
}
while ((oldval & PTHREAD_MUTEX_PRIO_CEILING_MASK) != ceilval);
assert (mutex->__data.__owner == 0);
mutex->__data.__count = 1;
}
break;
default:
/* Correct code cannot set any other type. */
return EINVAL;
}
/* Record the ownership. */
mutex->__data.__owner = id;
#ifndef NO_INCR
++mutex->__data.__nusers;
#endif
LIBC_PROBE (mutex_acquired, 1, mutex);
return 0;
}
#if PTHREAD_MUTEX_VERSIONS
libc_hidden_ver (___pthread_mutex_lock, __pthread_mutex_lock)
# ifndef SHARED
strong_alias (___pthread_mutex_lock, __pthread_mutex_lock)
# endif
versioned_symbol (libpthread, ___pthread_mutex_lock, pthread_mutex_lock,
GLIBC_2_0);
static __always_inline int
futex_wait (unsigned int *futex_word, unsigned int expected, int private)
{
int err = lll_futex_timed_wait (futex_word, expected, NULL, private);
switch (err)
{
case 0:
case -EAGAIN:
case -EINTR:
return -err;
case -ETIMEDOUT: /* Cannot have happened as we provided no timeout. */
case -EFAULT: /* Must have been caused by a glibc or application bug. */
case -EINVAL: /* Either due to wrong alignment or due to the timeout not
being normalized. Must have been caused by a glibc or
application bug. */
case -ENOSYS: /* Must have been caused by a glibc bug. */
/* No other errors are documented at this time. */
default:
futex_fatal_error ();
}
}
# define lll_futex_timed_wait(futexp, val, timeout, private) \
lll_futex_syscall (4, futexp, \
__lll_private_flag (FUTEX_WAIT, private), \
val, timeout)
# define lll_futex_syscall(nargs, futexp, op, ...) \
({ \
long int __ret = INTERNAL_SYSCALL (futex, nargs, futexp, op, \
__VA_ARGS__); \
(__glibc_unlikely (INTERNAL_SYSCALL_ERROR_P (__ret)) \
? -INTERNAL_SYSCALL_ERRNO (__ret) : 0); \
})
#define INTERNAL_SYSCALL(name, nr, args...) \
internal_syscall##nr (SYS_ify (name), args)
#undef internal_syscall4
#define internal_syscall4(number, arg1, arg2, arg3, arg4) \
({ \
unsigned long int resultvar; \
TYPEFY (arg4, __arg4) = ARGIFY (arg4); \
TYPEFY (arg3, __arg3) = ARGIFY (arg3); \
TYPEFY (arg2, __arg2) = ARGIFY (arg2); \
TYPEFY (arg1, __arg1) = ARGIFY (arg1); \
register TYPEFY (arg4, _a4) asm ("r10") = __arg4; \
register TYPEFY (arg3, _a3) asm ("rdx") = __arg3; \
register TYPEFY (arg2, _a2) asm ("rsi") = __arg2; \
register TYPEFY (arg1, _a1) asm ("rdi") = __arg1; \
asm volatile ( \
"syscall\n\t" \
: "=a" (resultvar) \
: "0" (number), "r" (_a1), "r" (_a2), "r" (_a3), "r" (_a4) \
: "memory", REGISTERS_CLOBBERED_BY_SYSCALL); \
(long int) resultvar; \
})
SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val,
const struct __kernel_timespec __user *, utime,
u32 __user *, uaddr2, u32, val3)
{
int ret, cmd = op & FUTEX_CMD_MASK;
ktime_t t, *tp = NULL;
struct timespec64 ts;
if (utime && futex_cmd_has_timeout(cmd)) {
if (unlikely(should_fail_futex(!(op & FUTEX_PRIVATE_FLAG))))
return -EFAULT;
if (get_timespec64(&ts, utime))
return -EFAULT;
ret = futex_init_timeout(cmd, op, &ts, &t);
if (ret)
return ret;
tp = &t;
}
return do_futex(uaddr, op, val, tp, uaddr2, (unsigned long)utime, val3);
}
long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
u32 __user *uaddr2, u32 val2, u32 val3)
{
unsigned int flags = futex_to_flags(op);
int cmd = op & FUTEX_CMD_MASK;
// yym-gaizao
pr_debug("do_futex: pid=%d, tid=%d, cmd=%d, uaddr=%p, val=%u, flags=0x%x\n",
current->tgid, current->pid, cmd, uaddr, val, flags);
if (flags & FLAGS_CLOCKRT) {
if (cmd != FUTEX_WAIT_BITSET &&
cmd != FUTEX_WAIT_REQUEUE_PI &&
cmd != FUTEX_LOCK_PI2)
return -ENOSYS;
}
switch (cmd) {
case FUTEX_WAIT:
val3 = FUTEX_BITSET_MATCH_ANY;
fallthrough;
case FUTEX_WAIT_BITSET:
return futex_wait(uaddr, flags, val, timeout, val3);
case FUTEX_WAKE:
val3 = FUTEX_BITSET_MATCH_ANY;
fallthrough;
case FUTEX_WAKE_BITSET:
return futex_wake(uaddr, flags, val, val3);
case FUTEX_REQUEUE:
return futex_requeue(uaddr, flags, uaddr2, flags, val, val2, NULL, 0);
case FUTEX_CMP_REQUEUE:
return futex_requeue(uaddr, flags, uaddr2, flags, val, val2, &val3, 0);
case FUTEX_WAKE_OP:
return futex_wake_op(uaddr, flags, uaddr2, val, val2, val3);
case FUTEX_LOCK_PI:
flags |= FLAGS_CLOCKRT;
fallthrough;
case FUTEX_LOCK_PI2:
return futex_lock_pi(uaddr, flags, timeout, 0);
case FUTEX_UNLOCK_PI:
return futex_unlock_pi(uaddr, flags);
case FUTEX_TRYLOCK_PI:
return futex_lock_pi(uaddr, flags, NULL, 1);
case FUTEX_WAIT_REQUEUE_PI:
val3 = FUTEX_BITSET_MATCH_ANY;
return futex_wait_requeue_pi(uaddr, flags, val, timeout, val3,
uaddr2);
case FUTEX_CMP_REQUEUE_PI:
return futex_requeue(uaddr, flags, uaddr2, flags, val, val2, &val3, 1);
}
return -ENOSYS;
}
int futex_wait(u32 __user *uaddr, unsigned int flags, u32 val, ktime_t *abs_time, u32 bitset)
{
struct hrtimer_sleeper timeout, *to;
struct restart_block *restart;
int ret;
// yym-gaizao
pr_debug("futex_wait: pid=%d, tid=%d, uaddr=%p, val=%u, bitset=0x%x, flags=0x%x\n",
current->tgid, current->pid, uaddr, val, bitset, flags);
to = futex_setup_timer(abs_time, &timeout, flags,
current->timer_slack_ns);
ret = __futex_wait(uaddr, flags, val, to, bitset);
/* No timeout, nothing to clean up. */
if (!to)
return ret;
hrtimer_cancel(&to->timer);
destroy_hrtimer_on_stack(&to->timer);
if (ret == -ERESTARTSYS) {
restart = ¤t->restart_block;
restart->futex.uaddr = uaddr;
restart->futex.val = val;
restart->futex.time = *abs_time;
restart->futex.bitset = bitset;
restart->futex.flags = flags | FLAGS_HAS_TIMEOUT;
return set_restart_fn(restart, futex_wait_restart);
}
return ret;
}
int __futex_wait(u32 __user *uaddr, unsigned int flags, u32 val,
struct hrtimer_sleeper *to, u32 bitset)
{
struct futex_q q = futex_q_init;
struct futex_hash_bucket *hb;
int ret;
if (!bitset)
return -EINVAL;
q.bitset = bitset;
retry:
/*
* Prepare to wait on uaddr. On success, it holds hb->lock and q
* is initialized.
*/
ret = futex_wait_setup(uaddr, val, flags, &q, &hb);
if (ret)
return ret;
/* futex_queue and wait for wakeup, timeout, or a signal. */
futex_wait_queue(hb, &q, to);
/* If we were woken (and unqueued), we succeeded, whatever. */
if (!futex_unqueue(&q))
return 0;
if (to && !to->task)
return -ETIMEDOUT;
/*
* We expect signal_pending(current), but we might be the
* victim of a spurious wakeup as well.
*/
if (!signal_pending(current))
goto retry;
return -ERESTARTSYS;
}
/**
* futex_wait_queue() - futex_queue() and wait for wakeup, timeout, or signal
* @hb: the futex hash bucket, must be locked by the caller
* @q: the futex_q to queue up on
* @timeout: the prepared hrtimer_sleeper, or null for no timeout
*/
void futex_wait_queue(struct futex_hash_bucket *hb, struct futex_q *q,
struct hrtimer_sleeper *timeout)
{
/*
* The task state is guaranteed to be set before another task can
* wake it. set_current_state() is implemented using smp_store_mb() and
* futex_queue() calls spin_unlock() upon completion, both serializing
* access to the hash list and forcing another memory barrier.
*/
set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
futex_queue(q, hb);
/* Arm the timer */
if (timeout)
hrtimer_sleeper_start_expires(timeout, HRTIMER_MODE_ABS);
/*
* If we have been removed from the hash list, then another task
* has tried to wake us, and we can skip the call to schedule().
*/
if (likely(!plist_node_empty(&q->list))) {
/*
* If the timer has already expired, current will already be
* flagged for rescheduling. Only call schedule if there
* is no timeout, or if it has yet to expire.
*/
if (!timeout || timeout->task)
schedule();
}
__set_current_state(TASK_RUNNING);
}