Linux 6.8.12 内核与 glibc-2.39 中 pthread_mutex_lock 完整实现深度解析

前言

本文基于 Linux 6.8.12 内核源码与 glibc-2.39 用户态库源码,对 pthread_mutex_lock 的完整实现进行逐层剖析。我们将沿着一条清晰的逻辑链条,从 glibc 用户态入口出发,穿越 futex 系统调用边界,深入内核态的线程阻塞与唤醒机制,覆盖普通互斥锁、递归锁、自适应锁、错误检测锁、健壮锁(Robust Mutex)、优先级继承锁(PI Mutex)以及优先级天花板锁(PP Mutex)等全部类型。整条链路涵盖用户态快速路径、内核态慢速路径、哈希队列管理、定时器处理、信号中断恢复等关键环节,力求内容贴合源码、上下文连贯、逻辑链条清晰。


第一章:glibc 用户态入口 --- ___pthread_mutex_lock

1.1 函数入口与类型分发

pthread_mutex_lock 的 glibc 实现入口位于 nptl/pthread_mutex_lock.c 中,实际符号名为 ___pthread_mutex_lock(通过版本控制宏暴露为 pthread_mutex_lock)。该函数的首要任务是根据互斥锁的类型进行快速分发。

c

复制

arduino 复制代码
int PTHREAD_MUTEX_LOCK (pthread_mutex_t *mutex)
{
    unsigned int type = PTHREAD_MUTEX_TYPE_ELISION (mutex);
    
    LIBC_PROBE (mutex_entry, 1, mutex);
    
    if (__builtin_expect (type & ~(PTHREAD_MUTEX_KIND_MASK_NP
                                   | PTHREAD_MUTEX_ELISION_FLAGS_NP), 0))
        return __pthread_mutex_lock_full (mutex);
    // ...
}

这里 PTHREAD_MUTEX_TYPE_ELISION 宏从 mutex->__data.__kind 中提取锁的类型信息,同时包含 elision(事务内存优化)标志位。PTHREAD_MUTEX_KIND_MASK_NP 定义了基本类型掩码,PTHREAD_MUTEX_ELISION_FLAGS_NP 定义了 elision 相关的标志位。

第一行检查非常关键:如果 type 中出现了不在上述两个掩码覆盖范围内的位,说明这是一个非标准类型 (如健壮锁、PI 锁、PP 锁等),需要走完整路径 __pthread_mutex_lock_full。这个分支预测标记为 __builtin_expect(..., 0),暗示这些类型在常规使用中较少见,编译器会优化主路径的指令布局。

1.2 普通互斥锁(PTHREAD_MUTEX_TIMED_NP)快速路径

对于最常见的 PTHREAD_MUTEX_TIMED_NP 类型(即默认的 normal mutex),代码进入最优化路径:

c

复制

ini 复制代码
if (__glibc_likely (type == PTHREAD_MUTEX_TIMED_NP))
{
    FORCE_ELISION (mutex, goto elision);
simple:
    /* Normal mutex.  */
    LLL_MUTEX_LOCK_OPTIMIZED (mutex);
    assert (mutex->__data.__owner == 0);
}

FORCE_ELISION 宏用于支持 Intel TSX(事务同步扩展)硬件事务内存。如果系统支持 elision 且编译时启用了 ENABLE_ELISION_SUPPORT,会尝试用事务内存指令(如 XBEGIN/XEND)来优化锁获取。如果 elision 失败或不可用,则跳转到 simple 标签,执行常规的 LLL_MUTEX_LOCK_OPTIMIZED

LLL_MUTEX_LOCK_OPTIMIZED 是一个关键的优化宏,其完整定义如下:

c

复制

arduino 复制代码
static inline void lll_mutex_lock_optimized (pthread_mutex_t *mutex)
{
    int private = PTHREAD_MUTEX_PSHARED (mutex);
    if (private == LLL_PRIVATE && SINGLE_THREAD_P && mutex->__data.__lock == 0)
        mutex->__data.__lock = 1;
    else
        lll_lock (mutex->__data.__lock, private);
}

这个优化体现了 futex 设计的核心哲学:在无竞争场景下完全避免系统调用SINGLE_THREAD_P 是一个运行时检查,当程序只有一个线程时(如通过 pthread_atfork 等机制跟踪),可以直接将 __lock 从 0 改为 1,无需任何原子操作或系统调用。这个优化仅对 LLL_PRIVATE(进程私有)互斥锁有效,因为进程共享的互斥锁可能位于共享内存映射中,即使单进程也需要考虑与其他进程的同步。

如果上述优化条件不满足,则调用 lll_lock,进入标准 futex 路径。

1.3 lll_lock 与底层 futex 交互

lll_lock 定义在 sysdeps/nptl/lowlevellock.h 中,最终会调用到架构相关的 __lll_lock_wait__lll_lock_wait_private 函数。以 x86_64 为例,其核心逻辑是:

  1. 第一次 CAS 尝试 :使用 atomic_compare_and_exchange_val_acq__lock 从 0 原子地改为 1。如果成功,直接返回。
  2. 竞争检测 :如果 CAS 失败,说明锁已被占用。此时将 __lock 设为 2(表示"已锁定且有等待者"),然后调用 futex_wait 系统调用进入内核等待。
  3. 循环重试:被唤醒后,再次尝试 CAS,直到成功。

这里的状态机设计非常精妙:

  • 0:未锁定
  • 1:已锁定,无等待者
  • 2:已锁定,有等待者(或正在等待)

这种三态设计使得解锁时能够判断是否需要执行 futex_wake 系统调用:如果 __lock 原来是 1,说明没有等待者,直接改为 0 即可;如果是 2,则需要唤醒等待队列中的线程。

1.4 递归互斥锁(PTHREAD_MUTEX_RECURSIVE_NP)

c

复制

ini 复制代码
else if (__builtin_expect (PTHREAD_MUTEX_TYPE (mutex)
                           == PTHREAD_MUTEX_RECURSIVE_NP, 1))
{
    pid_t id = THREAD_GETMEM (THREAD_SELF, tid);
    
    /* Check whether we already hold the mutex.  */
    if (mutex->__data.__owner == id)
    {
        if (__glibc_unlikely (mutex->__data.__count + 1 == 0))
            return EAGAIN;  /* Overflow of the counter.  */
        ++mutex->__data.__count;
        return 0;
    }
    
    /* We have to get the mutex.  */
    LLL_MUTEX_LOCK_OPTIMIZED (mutex);
    
    assert (mutex->__data.__owner == 0);
    mutex->__data.__count = 1;
}

递归锁允许同一线程多次加锁。实现上通过 __owner 字段记录持有者 TID,__count 字段记录递归深度。当线程再次加锁时,先检查 __owner == id,如果匹配则增加计数器。这里有一个溢出检查:__countunsigned int 类型,当值为 0xFFFFFFFF 时再加 1 会溢出为 0,此时返回 EAGAIN

如果当前线程不是持有者,则走正常的 LLL_MUTEX_LOCK_OPTIMIZED 路径获取锁,成功后设置 __owner = id__count = 1

1.5 自适应互斥锁(PTHREAD_MUTEX_ADAPTIVE_NP)

c

复制

ini 复制代码
else if (__builtin_expect (PTHREAD_MUTEX_TYPE (mutex)
                           == PTHREAD_MUTEX_ADAPTIVE_NP, 1))
{
    if (LLL_MUTEX_TRYLOCK (mutex) != 0)
    {
        int cnt = 0;
        int max_cnt = MIN (max_adaptive_count (),
                           mutex->__data.__spins * 2 + 10);
        int spin_count, exp_backoff = 1;
        unsigned int jitter = get_jitter ();
        do
        {
            spin_count = exp_backoff + (jitter & (exp_backoff - 1));
            cnt += spin_count;
            if (cnt >= max_cnt)
            {
                LLL_MUTEX_LOCK (mutex);
                break;
            }
            do
                atomic_spin_nop ();
            while (--spin_count > 0);
            
            exp_backoff = get_next_backoff (exp_backoff);
        }
        while (LLL_MUTEX_READ_LOCK (mutex) != 0
               || LLL_MUTEX_TRYLOCK (mutex) != 0);
        
        mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8;
    }
    assert (mutex->__data.__owner == 0);
}

自适应锁是 glibc 的一种性能优化策略,它结合了自旋锁互斥锁的优点。其核心思想是:如果锁的持有时间很短,自旋等待可能比进入内核睡眠更高效;如果持有时间较长,则应该尽快进入内核等待以避免 CPU 浪费。

实现细节包括:

  1. 首次尝试 :先调用 LLL_MUTEX_TRYLOCK(即 lll_trylock,使用 CAS 尝试获取锁),如果成功直接返回。
  2. 自适应自旋 :如果首次尝试失败,进入自旋循环。max_adaptive_count() 返回系统允许的最大自旋次数(通常与 CPU 核心数相关)。mutex->__data.__spins 记录了历史自旋统计,用于动态调整。
  3. 指数退避 + 随机抖动exp_backoff 从 1 开始,每次循环翻倍(通过 get_next_backoff),但上限受 max_cnt 限制。jitter 引入随机性,避免多个线程同步自旋导致的"雷群效应"(thundering herd)。
  4. 自旋 NOPatomic_spin_nop() 在 x86_64 上通常是 PAUSE 指令,它告诉 CPU 这是一个自旋等待循环,可以优化流水线并降低功耗。
  5. 动态学习 :每次自旋结束后,更新 __spins 字段:mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8。这是一个指数移动平均,使得自旋策略能够根据历史行为自适应调整。
  6. 最终回退 :如果自旋次数达到上限仍未获取锁,则调用 LLL_MUTEX_LOCK 进入内核等待。

LLL_MUTEX_READ_LOCK 是一个只读检查,用于在自旋期间快速检测锁是否已释放,避免不必要的 CAS 操作。

1.6 错误检测互斥锁(PTHREAD_MUTEX_ERRORCHECK_NP)

c

复制

arduino 复制代码
else
{
    pid_t id = THREAD_GETMEM (THREAD_SELF, tid);
    assert (PTHREAD_MUTEX_TYPE (mutex) == PTHREAD_MUTEX_ERRORCHECK_NP);
    /* Check whether we already hold the mutex.  */
    if (__glibc_unlikely (mutex->__data.__owner == id))
        return EDEADLK;
    goto simple;
}

错误检测锁用于调试场景,能够检测死锁。如果当前线程已经持有该锁,再次加锁会返回 EDEADLK 错误,而不是死锁。如果未持有,则跳转到 simple 标签,走普通锁的获取路径。

1.7 所有权记录与探针

所有成功获取锁的路径(除了 elision 路径)最终都会执行:

c

复制

arduino 复制代码
pid_t id = THREAD_GETMEM (THREAD_SELF, tid);

/* Record the ownership.  */
mutex->__data.__owner = id;
#ifndef NO_INCR
++mutex->__data.__nusers;
#endif

LIBC_PROBE (mutex_acquired, 1, mutex);

return 0;

__owner 字段记录当前持有者的 TID,用于递归锁检测、错误检测锁的死锁检测、以及健壮锁的恢复。__nusers 是一个使用计数器(如果未定义 NO_INCR 宏)。LIBC_PROBE 是 SystemTap/DTrace 探针,用于动态追踪,不影响正常执行路径。


第二章:完整路径 --- __pthread_mutex_lock_full

当互斥锁类型为健壮锁、PI 锁、PP 锁或不可识别类型时,进入 __pthread_mutex_lock_full 函数。这是整个实现中最复杂的部分,涵盖了多种高级互斥锁语义。

2.1 健壮互斥锁(Robust Mutex)

健壮互斥锁解决了"线程持有锁时异常终止"的问题。普通互斥锁在这种情况下会导致死锁------锁永远被持有但持有者已不存在。健壮锁通过在线程退出时自动释放其持有的锁,并通知等待者 EOWNERDEAD 来解决这个问题。

2.1.1 健壮锁状态与链表管理

c

复制

arduino 复制代码
case PTHREAD_MUTEX_ROBUST_RECURSIVE_NP:
case PTHREAD_MUTEX_ROBUST_ERRORCHECK_NP:
case PTHREAD_MUTEX_ROBUST_NORMAL_NP:
case PTHREAD_MUTEX_ROBUST_ADAPTIVE_NP:
    THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending,
                   &mutex->__data.__list.__next);
    __asm ("" ::: "memory");

每个线程有一个 robust_head 结构,维护一个由该线程持有的健壮锁组成的链表。list_op_pending 指向当前正在操作的锁节点,这是一个关键的原子性标记:它必须在实际修改锁状态之前设置,确保如果在操作过程中线程崩溃,内核清理代码能够识别出这个未完成的操作。

__asm ("" ::: "memory") 是一个编译器屏障(compiler barrier),防止编译器重排内存操作,确保 list_op_pending 的写入在后续锁操作之前完成。

2.1.2 锁获取循环与 CAS 操作

c

复制

ini 复制代码
oldval = mutex->__data.__lock;
unsigned int assume_other_futex_waiters = LLL_ROBUST_MUTEX_LOCK_MODIFIER;

while (1)
{
    /* Try to acquire the lock through a CAS from 0 to our TID | waiters.  */
    if (__glibc_likely (oldval == 0))
    {
        oldval = atomic_compare_and_exchange_val_acq (&mutex->__data.__lock,
                    id | assume_other_futex_waiters, 0);
        if (__glibc_likely (oldval == 0))
            break;
    }

健壮锁的 __lock 字段编码了更多信息:

  • 低位(FUTEX_TID_MASK):持有者 TID
  • FUTEX_WAITERS 位:是否有等待者
  • FUTEX_OWNER_DIED 位:原持有者是否已死亡

CAS 操作尝试将 __lock 从 0 改为 id | assume_other_futex_waitersassume_other_futex_waiters 初始值为 LLL_ROBUST_MUTEX_LOCK_MODIFIER(即 0),但在后续循环中可能被设置为 FUTEX_WAITERS

2.1.3 处理原持有者死亡(FUTEX_OWNER_DIED)

c

复制

ini 复制代码
if ((oldval & FUTEX_OWNER_DIED) != 0)
{
    int newval = id;
#ifdef NO_INCR
    newval |= FUTEX_WAITERS;
#else
    newval |= (oldval & FUTEX_WAITERS) | assume_other_futex_waiters;
#endif

    newval = atomic_compare_and_exchange_val_acq (&mutex->__data.__lock,
                                                   newval, oldval);

    if (newval != oldval)
    {
        oldval = newval;
        continue;
    }

    /* We got the mutex.  */
    mutex->__data.__count = 1;
    mutex->__data.__owner = PTHREAD_MUTEX_INCONSISTENT;

    __asm ("" ::: "memory");
    ENQUEUE_MUTEX (mutex);
    __asm ("" ::: "memory");
    THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);

#ifdef NO_INCR
    --mutex->__data.__nusers;
#endif

    return EOWNERDEAD;
}

当检测到 FUTEX_OWNER_DIED 位时,说明原持有该锁的线程已经异常终止。此时:

  1. 尝试 CAS 获取锁,同时保留 FUTEX_WAITERS 位(如果有等待者的话)。
  2. 如果 CAS 成功,设置 __owner = PTHREAD_MUTEX_INCONSISTENT,表示锁处于不一致状态(因为原持有者可能在临界区内死亡,数据结构可能已损坏)。
  3. 将锁加入当前线程的健壮锁链表(ENQUEUE_MUTEX)。
  4. 清除 list_op_pending,标记操作完成。
  5. 返回 EOWNERDEAD 给调用者,调用者需要决定如何恢复(通常需要调用 pthread_mutex_consistent 来标记锁状态为一致)。

注意这里的 NO_INCR 处理:在特殊场景下(如条件变量内部使用的互斥锁),不需要增加 __nusers,反而需要递减,因为原持有者不应被计数。

2.1.4 死锁检测与递归处理

c

复制

ini 复制代码
if (__glibc_unlikely ((oldval & FUTEX_TID_MASK) == id))
{
    int kind = PTHREAD_MUTEX_TYPE (mutex);
    if (kind == PTHREAD_MUTEX_ROBUST_ERRORCHECK_NP)
    {
        THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
        return EDEADLK;
    }

    if (kind == PTHREAD_MUTEX_ROBUST_RECURSIVE_NP)
    {
        THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
        
        if (__glibc_unlikely (mutex->__data.__count + 1 == 0))
            return EAGAIN;
        ++mutex->__data.__count;
        return 0;
    }
}

健壮锁也支持错误检测和递归语义。如果当前线程已经持有该锁:

  • 错误检测类型:返回 EDEADLK
  • 递归类型:增加计数器

2.1.5 Futex 等待与唤醒

c

复制

ini 复制代码
if ((oldval & FUTEX_WAITERS) == 0)
{
    int val = atomic_compare_and_exchange_val_acq
        (&mutex->__data.__lock, oldval | FUTEX_WAITERS, oldval);
    if (val != oldval)
    {
        oldval = val;
        continue;
    }
    oldval |= FUTEX_WAITERS;
}

assume_other_futex_waiters |= FUTEX_WAITERS;

futex_wait ((unsigned int *) &mutex->__data.__lock, oldval,
            PTHREAD_ROBUST_MUTEX_PSHARED (mutex));
oldval = mutex->__data.__lock;

如果无法获取锁,需要进入内核等待。关键步骤:

  1. 设置 FUTEX_WAITERS:如果当前没有等待者标志,先通过 CAS 设置它。这样解锁线程就知道需要唤醒等待者。
  2. 更新 assume_other_futex_waiters :一旦设置了 FUTEX_WAITERS,后续循环中必须保留它,避免"丢失唤醒"(lost wake-up)问题。
  3. futex_wait :调用 futex 系统调用,传入当前 __lock 值作为期望値。内核会比较实际值与期望值,如果不同则立即返回(EAGAIN),避免无效等待。

2.1.6 不可恢复锁检测

c

复制

ini 复制代码
if (__builtin_expect (mutex->__data.__owner
                      == PTHREAD_MUTEX_NOTRECOVERABLE, 0))
{
    mutex->__data.__count = 0;
    int private = PTHREAD_ROBUST_MUTEX_PSHARED (mutex);
    lll_unlock (mutex->__data.__lock, private);
    THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
    return ENOTRECOVERABLE;
}

如果锁被标记为 NOTRECOVERABLE(不可恢复),说明之前已经有线程收到 EOWNERDEAD 但没有调用 pthread_mutex_consistent 来恢复。此时释放锁并返回 ENOTRECOVERABLE

2.2 优先级继承互斥锁(PI Mutex)

优先级继承(Priority Inheritance)是解决优先级反转问题的经典方案。当高优先级线程等待低优先级线程持有的锁时,低优先级线程的优先级被临时提升到高优先级线程的级别,从而避免中等优先级线程抢占 CPU 导致高优先级线程无限期等待。

c

复制

arduino 复制代码
#ifdef __NR_futex
case PTHREAD_MUTEX_PI_RECURSIVE_NP:
case PTHREAD_MUTEX_PI_ERRORCHECK_NP:
case PTHREAD_MUTEX_PI_NORMAL_NP:
case PTHREAD_MUTEX_PI_ADAPTIVE_NP:
case PTHREAD_MUTEX_PI_ROBUST_RECURSIVE_NP:
case PTHREAD_MUTEX_PI_ROBUST_ERRORCHECK_NP:
case PTHREAD_MUTEX_PI_ROBUST_NORMAL_NP:
case PTHREAD_MUTEX_PI_ROBUST_ADAPTIVE_NP:

PI 锁的实现完全依赖内核的 futex_lock_pifutex_unlock_pi 系统调用,用户态只做少量封装。

2.2.1 PI 锁的 kind 与 robust 分离

c

复制

ini 复制代码
{
    int mutex_kind = atomic_load_relaxed (&(mutex->__data.__kind));
    kind = mutex_kind & PTHREAD_MUTEX_KIND_MASK_NP;
    robust = mutex_kind & PTHREAD_MUTEX_ROBUST_NORMAL_NP;
}

PI 锁的类型编码在 __kind 中,需要分离出基本类型(kind)和健壮标志(robust)。这里使用 atomic_load_relaxed 是因为 __kind 在初始化后通常不变,但在某些并发场景下需要保证读取的原子性。

2.2.2 健壮 PI 锁的 op_pending 设置

c

复制

erlang 复制代码
if (robust)
{
    THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending,
                   (void *) (((uintptr_t) &mutex->__data.__list.__next) | 1));
    __asm ("" ::: "memory");
}

健壮 PI 锁的 list_op_pending 设置与普通健壮锁不同:地址的最低位被置 1,这是内核区分 PI 锁和非 PI 锁健壮操作的标记。

2.2.3 首次 CAS 与内核接管

c

复制

ini 复制代码
oldval = mutex->__data.__lock;

if (__glibc_unlikely ((oldval & FUTEX_TID_MASK) == id))
{
    // 死锁检测和递归处理(同普通健壮锁)
}

int newval = id;
#ifdef NO_INCR
newval |= FUTEX_WAITERS;
#endif
oldval = atomic_compare_and_exchange_val_acq (&mutex->__data.__lock,
                                               newval, 0);

if (oldval != 0)
{
    int private = (robust
                   ? PTHREAD_ROBUST_MUTEX_PSHARED (mutex)
                   : PTHREAD_MUTEX_PSHARED (mutex));
    int e = __futex_lock_pi64 (&mutex->__data.__lock, 0 /* unused */,
                                NULL, private);
    // ...
    oldval = mutex->__data.__lock;
}

PI 锁的首次 CAS 尝试将 __lock 从 0 改为 id(或 id | FUTEX_WAITERS)。如果失败(oldval != 0),则调用 __futex_lock_pi64,将控制权完全交给内核。

__futex_lock_pi64 是 glibc 对内核 futex_lock_pi 系统调用的封装。内核会:

  1. 检查锁的当前状态
  2. 如果已被持有,执行优先级继承逻辑
  3. 将当前线程加入等待队列
  4. 必要时提升持有者的优先级
  5. 当锁释放时,按照优先级顺序唤醒等待者

2.2.4 ESRCH 与 EDEADLK 处理

c

复制

arduino 复制代码
if (e == ESRCH || e == EDEADLK)
{
    assert (e != EDEADLK
            || (kind != PTHREAD_MUTEX_ERRORCHECK_NP
                && kind != PTHREAD_MUTEX_RECURSIVE_NP));
    assert (e != ESRCH || !robust);

    /* Delay the thread indefinitely.  */
    while (1)
        __futex_abstimed_wait64 (&(unsigned int){0}, 0,
                                  0 /* ignored */, NULL, private);
}

ESRCH 在非健壮 PI 锁中表示锁的持有者已死亡。EDEADLK 表示检测到死锁。对于这些错误,代码进入一个无限等待循环,实际上是将线程永久阻塞。这是一种保守的处理策略,避免在无法安全恢复的情况下继续执行。

2.2.5 持有者死亡与不可恢复处理

c

复制

ini 复制代码
if (__glibc_unlikely (oldval & FUTEX_OWNER_DIED))
{
    atomic_fetch_and_acquire (&mutex->__data.__lock, ~FUTEX_OWNER_DIED);
    mutex->__data.__count = 1;
    mutex->__data.__owner = PTHREAD_MUTEX_INCONSISTENT;
    
    __asm ("" ::: "memory");
    ENQUEUE_MUTEX_PI (mutex);
    __asm ("" ::: "memory");
    THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
    
#ifdef NO_INCR
    --mutex->__data.__nusers;
#endif
    return EOWNERDEAD;
}

if (robust
    && __builtin_expect (mutex->__data.__owner
                         == PTHREAD_MUTEX_NOTRECOVERABLE, 0))
{
    mutex->__data.__count = 0;
    futex_unlock_pi ((unsigned int *) &mutex->__data.__lock,
                     PTHREAD_ROBUST_MUTEX_PSHARED (mutex));
    THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
    return ENOTRECOVERABLE;
}

PI 锁的持有者死亡和不可恢复处理逻辑与普通健壮锁类似,但使用 ENQUEUE_MUTEX_PI 将锁加入 PI 特定的健壮链表,并使用 futex_unlock_pi 释放锁。

2.3 优先级天花板互斥锁(PP Mutex)

优先级天花板(Priority Protect/Protection)是另一种解决优先级反转的方案。它为每个锁分配一个优先级天花板值,任何获取该锁的线程的优先级都会被提升到天花板级别。

c

复制

arduino 复制代码
case PTHREAD_MUTEX_PP_RECURSIVE_NP:
case PTHREAD_MUTEX_PP_ERRORCHECK_NP:
case PTHREAD_MUTEX_PP_NORMAL_NP:
case PTHREAD_MUTEX_PP_ADAPTIVE_NP:

2.3.1 优先级检查与提升

c

复制

ini 复制代码
int kind = atomic_load_relaxed (&(mutex->__data.__kind))
    & PTHREAD_MUTEX_KIND_MASK_NP;

oldval = mutex->__data.__lock;

if (mutex->__data.__owner == id)
{
    if (kind == PTHREAD_MUTEX_ERRORCHECK_NP)
        return EDEADLK;
    if (kind == PTHREAD_MUTEX_RECURSIVE_NP)
    {
        if (__glibc_unlikely (mutex->__data.__count + 1 == 0))
            return EAGAIN;
        ++mutex->__data.__count;
        return 0;
    }
}

PP 锁的死锁检测和递归处理与前面类似。

2.3.2 优先级天花板协议实现

c

复制

ini 复制代码
int oldprio = -1, ceilval;
do
{
    int ceiling = (oldval & PTHREAD_MUTEX_PRIO_CEILING_MASK)
                  >> PTHREAD_MUTEX_PRIO_CEILING_SHIFT;

    if (__pthread_current_priority () > ceiling)
    {
        if (oldprio != -1)
            __pthread_tpp_change_priority (oldprio, -1);
        return EINVAL;
    }

    int retval = __pthread_tpp_change_priority (oldprio, ceiling);
    if (retval)
        return retval;

    ceilval = ceiling << PTHREAD_MUTEX_PRIO_CEILING_SHIFT;
    oldprio = ceiling;

    oldval = atomic_compare_and_exchange_val_acq (&mutex->__data.__lock,
#ifdef NO_INCR
                                                   ceilval | 2,
#else
                                                   ceilval | 1,
#endif
                                                   ceilval);
    // ...
} while ((oldval & PTHREAD_MUTEX_PRIO_CEILING_MASK) != ceilval);

PP 锁的实现关键点:

  1. 提取天花板优先级 :从 __lock 的高位提取 ceiling 值。
  2. 当前优先级检查 :如果当前线程的优先级高于天花板,返回 EINVAL(POSIX 要求)。
  3. 优先级提升 :调用 __pthread_tpp_change_priority 将线程优先级提升到天花板级别。如果之前已经提升过(oldprio != -1),先恢复旧优先级。
  4. CAS 获取锁 :尝试将 __lockceilval 改为 ceilval | 1(或 ceilval | 2)。ceilval 是天花板优先级左移后的值。
  5. 等待循环:如果 CAS 失败,说明有其他线程正在竞争。此时进入 futex 等待循环,等待锁释放。

2.3.3 Futex 等待

c

复制

perl 复制代码
do
{
    oldval = atomic_compare_and_exchange_val_acq (&mutex->__data.__lock,
                                                     ceilval | 2,
                                                     ceilval | 1);

    if ((oldval & PTHREAD_MUTEX_PRIO_CEILING_MASK) != ceilval)
        break;

    if (oldval != ceilval)
        futex_wait ((unsigned int *) &mutex->__data.__lock,
                    ceilval | 2,
                    PTHREAD_MUTEX_PSHARED (mutex));
}
while (atomic_compare_and_exchange_val_acq (&mutex->__data.__lock,
                                               ceilval | 2, ceilval)
       != ceilval);

PP 锁的等待逻辑:

  1. 尝试将 __lockceilval | 1(已锁定无等待者)改为 ceilval | 2(已锁定有等待者)。
  2. 如果天花板值发生变化(说明锁被重新初始化),跳出循环重试。
  3. 如果锁仍被持有,调用 futex_wait 等待。
  4. 被唤醒后,再次尝试 CAS 获取锁。

第三章:从 glibc 到内核 --- futex 系统调用

3.1 futex_wait 的内核入口

当 glibc 的 lll_futex_wait 宏展开后,最终通过 syscall 指令进入内核。在 x86_64 上,这对应 SYS_futex 系统调用。

c

复制

ini 复制代码
SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val,
                const struct __kernel_timespec __user *, utime,
                u32 __user *, uaddr2, u32, val3)
{
    int ret, cmd = op & FUTEX_CMD_MASK;
    ktime_t t, *tp = NULL;
    struct timespec64 ts;

    if (utime && futex_cmd_has_timeout(cmd)) {
        if (unlikely(should_fail_futex(!(op & FUTEX_PRIVATE_FLAG))))
            return -EFAULT;
        if (get_timespec64(&ts, utime))
            return -EFAULT;
        ret = futex_init_timeout(cmd, op, &ts, &t);
        if (ret)
            return ret;
        tp = &t;
    }

    return do_futex(uaddr, op, val, tp, uaddr2, (unsigned long)utime, val3);
}

SYSCALL_DEFINE6 是 Linux 内核定义 6 参数系统调用的宏。futex 系统调用的参数包括:

  • uaddr:用户空间地址(futex 字)
  • op:操作码 + 标志位
  • val:期望值(对于 WAIT)或唤醒数量(对于 WAKE)
  • utime:超时时间(可选)
  • uaddr2:第二个地址(用于 REQUEUE 等操作)
  • val3:第三个值

op 参数的低 8 位是命令(FUTEX_CMD_MASK),高位是标志位(如 FUTEX_PRIVATE_FLAG 表示进程私有)。

3.2 do_futex --- 命令分发中心

c

复制

kotlin 复制代码
long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
              u32 __user *uaddr2, u32 val2, u32 val3)
{
    unsigned int flags = futex_to_flags(op);
    int cmd = op & FUTEX_CMD_MASK;

    pr_debug("do_futex: pid=%d, tid=%d, cmd=%d, uaddr=%p, val=%u, flags=0x%x\n",
             current->tgid, current->pid, cmd, uaddr, val, flags);

    if (flags & FLAGS_CLOCKRT) {
        if (cmd != FUTEX_WAIT_BITSET &&
            cmd != FUTEX_WAIT_REQUEUE_PI &&
            cmd != FUTEX_LOCK_PI2)
            return -ENOSYS;
    }

    switch (cmd) {
    case FUTEX_WAIT:
        val3 = FUTEX_BITSET_MATCH_ANY;
        fallthrough;
    case FUTEX_WAIT_BITSET:
        return futex_wait(uaddr, flags, val, timeout, val3);
    case FUTEX_WAKE:
        val3 = FUTEX_BITSET_MATCH_ANY;
        fallthrough;
    case FUTEX_WAKE_BITSET:
        return futex_wake(uaddr, flags, val, val3);
    case FUTEX_REQUEUE:
        return futex_requeue(uaddr, flags, uaddr2, flags, val, val2, NULL, 0);
    case FUTEX_CMP_REQUEUE:
        return futex_requeue(uaddr, flags, uaddr2, flags, val, val2, &val3, 0);
    case FUTEX_WAKE_OP:
        return futex_wake_op(uaddr, flags, uaddr2, val, val2, val3);
    case FUTEX_LOCK_PI:
        flags |= FLAGS_CLOCKRT;
        fallthrough;
    case FUTEX_LOCK_PI2:
        return futex_lock_pi(uaddr, flags, timeout, 0);
    case FUTEX_UNLOCK_PI:
        return futex_unlock_pi(uaddr, flags);
    case FUTEX_TRYLOCK_PI:
        return futex_lock_pi(uaddr, flags, NULL, 1);
    case FUTEX_WAIT_REQUEUE_PI:
        val3 = FUTEX_BITSET_MATCH_ANY;
        return futex_wait_requeue_pi(uaddr, flags, val, timeout, val3, uaddr2);
    case FUTEX_CMP_REQUEUE_PI:
        return futex_requeue(uaddr, flags, uaddr2, flags, val, val2, &val3, 1);
    }
    return -ENOSYS;
}

do_futex 是 futex 子系统的核心分发函数。它支持的操作包括:

表格

命令 功能
FUTEX_WAIT / FUTEX_WAIT_BITSET 等待 futex 字等于期望值
FUTEX_WAKE / FUTEX_WAKE_BITSET 唤醒等待者
FUTEX_REQUEUE / FUTEX_CMP_REQUEUE 将等待者从一个 futex 迁移到另一个
FUTEX_WAKE_OP 条件唤醒(用于条件变量优化)
FUTEX_LOCK_PI / FUTEX_LOCK_PI2 优先级继承锁获取
FUTEX_UNLOCK_PI 优先级继承锁释放
FUTEX_TRYLOCK_PI 非阻塞 PI 锁获取
FUTEX_WAIT_REQUEUE_PI PI 条件的等待与重排队
FUTEX_CMP_REQUEUE_PI 带比较的 PI 重排队

3.3 futex_wait --- 内核等待入口

c

复制

rust 复制代码
int futex_wait(u32 __user *uaddr, unsigned int flags, u32 val, ktime_t *abs_time, u32 bitset)
{
    struct hrtimer_sleeper timeout, *to;
    struct restart_block *restart;
    int ret;

    pr_debug("futex_wait: pid=%d, tid=%d, uaddr=%p, val=%u, bitset=0x%x, flags=0x%x\n",
             current->tgid, current->pid, uaddr, val, bitset, flags);

    to = futex_setup_timer(abs_time, &timeout, flags,
                           current->timer_slack_ns);

    ret = __futex_wait(uaddr, flags, val, to, bitset);

    if (!to)
        return ret;

    hrtimer_cancel(&to->timer);
    destroy_hrtimer_on_stack(&to->timer);

    if (ret == -ERESTARTSYS) {
        restart = &current->restart_block;
        restart->futex.uaddr = uaddr;
        restart->futex.val = val;
        restart->futex.time = *abs_time;
        restart->futex.bitset = bitset;
        restart->futex.flags = flags | FLAGS_HAS_TIMEOUT;

        return set_restart_fn(restart, futex_wait_restart);
    }

    return ret;
}

futex_wait 处理带超时的等待场景:

  1. 定时器设置futex_setup_timer 根据 abs_timeflags 设置高精度定时器(hrtimer)。current->timer_slack_ns 是内核的定时器松弛值,用于电源管理优化。
  2. 核心等待 :调用 __futex_wait 执行实际的等待逻辑。
  3. 定时器清理:如果设置了超时,取消并销毁定时器。
  4. 系统调用重启 :如果返回 -ERESTARTSYS(被信号中断),设置重启块(restart block),使得信号处理完成后可以自动重启系统调用。这是 Linux 内核的系统调用重启机制的一部分。

3.4 __futex_wait --- 核心等待逻辑

c

复制

ini 复制代码
int __futex_wait(u32 __user *uaddr, unsigned int flags, u32 val,
                 struct hrtimer_sleeper *to, u32 bitset)
{
    struct futex_q q = futex_q_init;
    struct futex_hash_bucket *hb;
    int ret;

    if (!bitset)
        return -EINVAL;

    q.bitset = bitset;

retry:
    ret = futex_wait_setup(uaddr, val, flags, &q, &hb);
    if (ret)
        return ret;

    futex_wait_queue(hb, &q, to);

    if (!futex_unqueue(&q))
        return 0;

    if (to && !to->task)
        return -ETIMEDOUT;

    if (!signal_pending(current))
        goto retry;

    return -ERESTARTSYS;
}

__futex_wait 是 futex 等待的核心,其逻辑流程如下:

  1. 初始化 futex_qfutex_q 是表示一个 futex 等待请求的结构体,包含等待者的任务结构、key、plist 节点等。

  2. bitset 检查bitset 用于支持条件唤醒(FUTEX_WAKE_BITSET),0 是非法值。

  3. 等待设置futex_wait_setup):这是最关键的一步,它确保在等待之前 futex 字的值仍然是期望值,防止"丢失唤醒"。

  4. 入队等待futex_wait_queue):将当前任务加入哈希桶的等待队列,并进入睡眠。

  5. 检查唤醒原因

    • 如果 futex_unqueue 返回 false,说明被正常唤醒(从队列中移除),返回 0。
    • 如果超时且 to->task 为 NULL,说明定时器已到期,返回 -ETIMEDOUT
    • 如果没有信号 pending,说明是虚假唤醒(spurious wakeup),跳转到 retry 重试。
    • 如果有信号 pending,返回 -ERESTARTSYS

3.5 futex_wait_setup --- 防止丢失唤醒的关键

c

复制

ini 复制代码
static int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
                            struct futex_q *q, struct futex_hash_bucket **hb)
{
    u32 uval;
    int ret;
    
retry:
    ret = get_futex_value_locked(&uval, uaddr);
    if (ret)
        return ret;

    if (uval != val)
        return -EWOULDBLOCK;

    q->key = FUTEX_KEY_INIT;
    ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &q->key, FUTEX_READ);
    if (unlikely(ret != 0))
        return ret;

    *hb = hash_futex(&q->key);

    spin_lock(&(*hb)->lock);

    ret = get_futex_value_locked(&uval, uaddr);
    if (ret) {
        spin_unlock(&(*hb)->lock);
        return ret;
    }

    if (uval != val) {
        spin_unlock(&(*hb)->lock);
        return -EWOULDBLOCK;
    }

    __futex_queue(q, *hb);
    return 0;
}

futex_wait_setup 是 futex 机制正确性的核心保障,它解决了经典的"检查-等待"竞态条件:

  1. 首次读取 futex 值get_futex_value_locked 安全地从用户空间读取 uaddr 处的值。
  2. 值检查 :如果值已不等于 val,说明在 glibc 设置值和进入内核之间,锁已被释放并重新获取,立即返回 -EWOULDBLOCK(glibc 会重试)。
  3. 计算 keyget_futex_key 根据用户地址计算一个全局唯一的 futex_key。对于私有 futex,key 包含 mm_struct 指针和页内偏移;对于共享 futex,key 包含 inode 序列号和页偏移。
  4. 哈希定位hash_futex 根据 key 计算哈希值,定位到全局 futex_queues 数组中的某个桶。
  5. 加锁并二次检查 :获取哈希桶的 spinlock 后,再次读取 futex 值。这是关键:在获取 spinlock 之前,其他 CPU 可能已修改了 futex 值。
  6. 二次值检查 :如果值再次变化,释放 spinlock 并返回 -EWOULDBLOCK
  7. 入队 :通过 __futex_queuefutex_q 加入哈希桶的 plist(优先级链表)。

这种"双重检查"模式确保了:如果 futex 值在任何时候发生了变化,等待者不会错误地进入睡眠,从而避免丢失唤醒。

3.6 futex_wait_queue --- 进入睡眠

c

复制

scss 复制代码
void futex_wait_queue(struct futex_hash_bucket *hb, struct futex_q *q,
                      struct hrtimer_sleeper *timeout)
{
    set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
    futex_queue(q, hb);

    if (timeout)
        hrtimer_sleeper_start_expires(timeout, HRTIMER_MODE_ABS);

    if (likely(!plist_node_empty(&q->list))) {
        if (!timeout || timeout->task)
            schedule();
    }
    __set_current_state(TASK_RUNNING);
}

futex_wait_queue 执行实际的睡眠操作:

  1. 设置任务状态set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE) 将当前任务设置为可中断睡眠状态,同时允许在系统冻结(suspend/hibernate)时被暂停。这里使用 set_current_state 而不是直接赋值,因为它包含内存屏障,确保状态设置在进入队列之前完成。
  2. 入队futex_queueq 加入 hb 的 plist。plist 是优先级排序的链表,支持优先级继承。
  3. 启动定时器:如果设置了超时,启动 hrtimer。
  4. 检查并调度 :如果 q->list 不为空(说明仍在队列中),且没有超时或定时器未到期,调用 schedule() 主动放弃 CPU。
  5. 恢复运行状态 :被唤醒后,设置任务状态为 TASK_RUNNING

这里有一个重要的竞态处理:在 set_current_stateschedule() 之间,如果另一个 CPU 尝试唤醒当前任务,唤醒操作会设置任务状态并尝试将任务从队列中移除。由于 set_current_state 的内存屏障和 futex_queuespin_unlock 的内存屏障,这种竞态是安全的。

3.7 Futex 哈希表与 Key 机制

futex 的核心数据结构是全局哈希表 futex_queues,其大小在启动时根据内存大小动态计算。每个哈希桶包含一个 spinlock 和一个 plist(优先级排序链表)。

futex_key 的设计非常精巧:

c

复制

arduino 复制代码
union futex_key {
    struct {
        u64 i_seq;
        unsigned long pgoff;
        unsigned int offset;
    } shared;      /* 用于进程共享 futex */
    struct {
        union {
            struct mm_struct *mm;
            u64 __tmp;
        };
        unsigned long address;
        unsigned int offset;
    } private;     /* 用于进程私有 futex */
    struct {
        u64 ptr;
        unsigned long word;
        unsigned int offset;
    } both;        /* 通用访问 */
};

对于进程私有 futexFUTEX_PRIVATE_FLAG),key 由 mm_struct 指针和虚拟地址组成。由于每个进程有独立的地址空间,不同进程的相同虚拟地址不会产生冲突。

对于进程共享 futex,key 由文件的 inode 序列号和页内偏移组成。这要求共享 futex 必须位于文件映射(file-backed mapping)或共享匿名映射(shared anonymous mapping)中。

get_futex_key 的实现涉及页表遍历、VMA(虚拟内存区域)查找等复杂逻辑,需要处理各种边界情况(如页未映射、文件映射、特殊映射等)。


第四章:系统调用层 --- 从 glibc 到内核的桥梁

4.1 lll_futex_syscall 宏展开

在 glibc 中,lll_futex_waitlll_futex_wake 等宏最终展开为 lll_futex_syscall,后者展开为内联汇编系统调用:

c

复制

ini 复制代码
#define lll_futex_syscall(nargs, futexp, op, ...)                      \
  ({                                                                    \
    long int __ret = INTERNAL_SYSCALL (futex, nargs, futexp, op,      \
                                       __VA_ARGS__);                    \
    (__glibc_unlikely (INTERNAL_SYSCALL_ERROR_P (__ret))               \
     ? -INTERNAL_SYSCALL_ERRNO (__ret) : 0);                            \
  })

INTERNAL_SYSCALL 进一步展开为架构特定的系统调用指令。在 x86_64 上:

c

复制

scss 复制代码
#define internal_syscall4(number, arg1, arg2, arg3, arg4)              \
({                                                                      \
    unsigned long int resultvar;                                        \
    TYPEFY (arg4, __arg4) = ARGIFY (arg4);                              \
    TYPEFY (arg3, __arg3) = ARGIFY (arg3);                              \
    TYPEFY (arg2, __arg2) = ARGIFY (arg2);                              \
    TYPEFY (arg1, __arg1) = ARGIFY (arg1);                              \
    register TYPEFY (arg4, _a4) asm ("r10") = __arg4;                   \
    register TYPEFY (arg3, _a3) asm ("rdx") = __arg3;                   \
    register TYPEFY (arg2, _a2) asm ("rsi") = __arg2;                   \
    register TYPEFY (arg1, _a1) asm ("rdi") = __arg1;                   \
    asm volatile (                                                      \
    "syscall\n\t"                                                       \
    : "=a" (resultvar)                                                  \
    : "0" (number), "r" (_a1), "r" (_a2), "r" (_a3), "r" (_a4)         \
    : "memory", REGISTERS_CLOBBERED_BY_SYSCALL);                        \
    (long int) resultvar;                                               \
})

这里展示了 x86_64 Linux 系统调用的标准 ABI:

  • 系统调用号放在 rax 寄存器(通过 "0" (number) 约束,复用输出操作数)
  • 参数 1-6 分别放在 rdi, rsi, rdx, r10, r8, r9
  • syscall 指令触发系统调用
  • memoryREGISTERS_CLOBBERED_BY_SYSCALL 作为 clobber 列表,告诉编译器这些资源被修改

4.2 futex_wait 的返回值处理

在 glibc 的 futex_wait 包装函数中:

c

复制

arduino 复制代码
static __always_inline int
futex_wait (unsigned int *futex_word, unsigned int expected, int private)
{
    int err = lll_futex_timed_wait (futex_word, expected, NULL, private);
    switch (err)
    {
    case 0:
    case -EAGAIN:
    case -EINTR:
        return -err;
    case -ETIMEDOUT:
    case -EFAULT:
    case -EINVAL:
    case -ENOSYS:
    default:
        futex_fatal_error ();
    }
}

返回值处理:

  • 0:正常唤醒
  • -EAGAIN:futex 值不等于期望值(在 futex_wait_setup 的二次检查中发现)
  • -EINTR:被信号中断
  • -ETIMEDOUT:超时(但这里传入的 timeout 为 NULL,不会发生)
  • -EFAULT / -EINVAL / -ENOSYS:严重错误,调用 futex_fatal_error() 终止程序

第五章:完整逻辑链条总结

5.1 无竞争快速路径

plain

复制

scss 复制代码
pthread_mutex_lock()
  └── type == PTHREAD_MUTEX_TIMED_NP
        └── LLL_MUTEX_LOCK_OPTIMIZED()
              ├── SINGLE_THREAD_P && private == LLL_PRIVATE && __lock == 0
              │     └── __lock = 1  [无原子操作,无系统调用]
              └── else
                    └── lll_lock()
                          └── atomic CAS: 0 -> 1
                                └── 成功,返回

这条路径在无竞争、单线程或首次尝试时,完全不涉及系统调用,仅需要一次原子 CAS 操作,性能极高。

5.2 有竞争慢速路径

plain

复制

scss 复制代码
pthread_mutex_lock()
  └── type == PTHREAD_MUTEX_TIMED_NP
        └── LLL_MUTEX_LOCK_OPTIMIZED()
              └── lll_lock()
                    └── CAS 失败
                          └── __lll_lock_wait()
                                ├── atomic_exchange: __lock -> 2
                                │     └── 如果原值 == 0,成功获取,返回
                                └── 否则
                                      └── futex_wait(__lock, 2, private)
                                            └── syscall(SYS_futex, FUTEX_WAIT, ...)
                                                  └── 内核: do_futex()
                                                        └── futex_wait()
                                                              ├── futex_wait_setup()
                                                              │     ├── 读取 uaddr
                                                              │     ├── 值检查
                                                              │     ├── get_futex_key()
                                                              │     ├── hash_futex()
                                                              │     ├── spin_lock(hb)
                                                              │     ├── 二次读取 + 检查
                                                              │     └── __futex_queue()
                                                              ├── futex_wait_queue()
                                                              │     ├── set_current_state(INTERRUPTIBLE)
                                                              │     ├── futex_queue()
                                                              │     ├── hrtimer_start() [如果有超时]
                                                              │     └── schedule()
                                                              └── 被唤醒后
                                                                    ├── futex_unqueue()
                                                                    ├── 检查超时
                                                                    ├── 检查信号
                                                                    └── 返回用户态

这条路径涉及完整的系统调用、内核哈希表操作、任务状态切换和调度,开销较大,但只在真正有竞争时才会触发。

5.3 健壮锁路径

plain

复制

scss 复制代码
pthread_mutex_lock()
  └── type == ROBUST_*
        └── __pthread_mutex_lock_full()
              ├── 设置 list_op_pending
              ├── CAS 循环
              │     ├── oldval == 0: 获取成功
              │     ├── FUTEX_OWNER_DIED: 处理死亡,返回 EOWNERDEAD
              │     ├── 已持有: 死锁检测 / 递归处理
              │     └── 否则: 设置 FUTEX_WAITERS, futex_wait()
              ├── 检查 NOTRECOVERABLE
              ├── ENQUEUE_MUTEX()
              └── 清除 list_op_pending

5.4 PI 锁路径

plain

复制

scss 复制代码
pthread_mutex_lock()
  └── type == PI_*
        └── __pthread_mutex_lock_full()
              ├── 分离 kind 和 robust
              ├── 设置 list_op_pending [如果 robust]
              ├── CAS: 0 -> id
              │     └── 失败: __futex_lock_pi64()
              │           └── syscall(SYS_futex, FUTEX_LOCK_PI, ...)
              │                 └── 内核: futex_lock_pi()
              │                       ├── 优先级继承链处理
              │                       ├── 入 PI 等待队列
              │                       └── 调度等待
              ├── 检查 FUTEX_OWNER_DIED
              ├── 检查 NOTRECOVERABLE
              └── ENQUEUE_MUTEX_PI() [如果 robust]

第六章:关键设计思想与优化策略

6.1 Futex 的核心哲学

Futex(Fast Userspace Mutex)的设计核心在于 "快速路径在用户态,慢速路径在内核态" 的分层策略:

  1. 无竞争时:完全在用户态通过原子操作完成,无需陷入内核,开销与自旋锁相当。
  2. 有竞争时:通过系统调用进入内核,由内核管理等待队列和调度,避免 CPU 忙等浪费。

这种分层使得 pthread_mutex_lock 在大多数实际场景下(锁竞争不激烈)具有极高的性能,同时在竞争激烈时仍能正确、高效地工作。

6.2 三态锁状态机

glibc 的 futex 实现使用三态状态机:

表格

状态 含义
未锁定 0 锁可用
已锁定,无等待者 1 锁被持有,无需唤醒
已锁定,有等待者 2 锁被持有,释放时需要唤醒

这种设计使得解锁操作能够判断是否需要系统调用:只有当状态为 2 时才需要 futex_wake

6.3 自适应自旋策略

自适应锁通过历史统计动态调整自旋策略:

  • 短临界区:自旋等待通常比进入内核更高效
  • 长临界区:应尽快进入内核睡眠,避免 CPU 浪费

mutex->__data.__spins 的指数移动平均更新使得锁能够"学习"临界区的典型长度,逐步优化自旋策略。

6.4 防止丢失唤醒

futex_wait_setup 的双重检查机制是防止丢失唤醒的关键:

  1. 用户态在调用 futex_wait 前设置 futex 值(如改为 2)
  2. 内核态在 futex_wait_setup 中再次检查值是否仍为预期值
  3. 如果在之间值已变化(如锁被释放),立即返回 EAGAIN,用户态重试

这种设计确保了即使在高并发下,唤醒操作也不会被遗漏。

6.5 系统调用重启

当 futex 等待被信号中断时,内核返回 -ERESTARTSYS。glibc 的 futex_wait 函数设置 restart block,使得信号处理完成后,系统调用可以自动重启。这对应用透明,确保了 pthread_mutex_lock 的语义正确性(除非被信号明确中断)。


第七章:源码中的调试与追踪

7.1 pr_debug 调试输出

在 Linux 6.8.12 内核的 futex 实现中,可以看到多处 pr_debug 调用:

c

复制

perl 复制代码
pr_debug("do_futex: pid=%d, tid=%d, cmd=%d, uaddr=%p, val=%u, flags=0x%x\n",
         current->tgid, current->pid, cmd, uaddr, val, flags);

pr_debug("futex_wait: pid=%d, tid=%d, uaddr=%p, val=%u, bitset=0x%x, flags=0x%x\n",
         current->tgid, current->pid, uaddr, val, bitset, flags);

这些调试输出需要在编译内核时启用 CONFIG_DYNAMIC_DEBUG 或设置 dynamic_debug.verbose 才能看到。它们对于理解 futex 的调用流程和排查问题非常有价值。

7.2 SystemTap/DTrace 探针

glibc 中使用了 LIBC_PROBE 宏:

c

复制

arduino 复制代码
LIBC_PROBE (mutex_entry, 1, mutex);
LIBC_PROBE (mutex_acquired, 1, mutex);

这些探针可以在不重新编译的情况下,通过 SystemTap 或 DTrace 动态插入探针,监控互斥锁的获取和释放行为,对于性能分析和故障排查非常有用。


第八章:版本演进与兼容性

8.1 glibc 版本控制

c

复制

arduino 复制代码
#if PTHREAD_MUTEX_VERSIONS
libc_hidden_ver (___pthread_mutex_lock, __pthread_mutex_lock)
# ifndef SHARED
strong_alias (___pthread_mutex_lock, __pthread_mutex_lock)
# endif
versioned_symbol (libpthread, ___pthread_mutex_lock, pthread_mutex_lock,
                  GLIBC_2_0);

glibc 使用符号版本控制(symbol versioning)来保持 ABI 兼容性。___pthread_mutex_lock 是内部实现,__pthread_mutex_lock 是兼容符号,pthread_mutex_lock 是对外暴露的符号,版本为 GLIBC_2_0。这允许 glibc 在更新实现的同时,保持与旧二进制文件的兼容性。

8.2 与历史版本的差异

相比早期版本(如 glibc 2.34 之前),glibc-2.39 的主要改进包括:

  • 更完善的 elision 支持
  • 改进的自适应锁算法
  • 更好的健壮锁和 PI 锁集成
  • 对新型 CPU 指令集(如 ARM LSE、x86 TSX)的优化

结语

从 glibc-2.39 的 ___pthread_mutex_lock 到 Linux 6.8.12 内核的 futex_waitpthread_mutex_lock 的实现展现了一个高度优化、层次分明的同步原语设计。整个链路从用户态的原子操作快速路径,到内核态的哈希队列和调度管理,涵盖了无竞争优化、自适应自旋、优先级继承、健壮性恢复、超时处理、信号中断恢复等多种复杂场景。

理解这条完整的逻辑链条,不仅有助于编写高效、正确的多线程程序,也为深入理解操作系统内核的同步机制、内存模型和调度原理提供了绝佳的切入点。futex 机制作为 Linux 特有的创新,其"用户态快速路径 + 内核态慢速路径"的分层思想,至今仍是操作系统同步原语设计的典范。 源码

scss 复制代码
# define LLL_MUTEX_LOCK(mutex)						\
  lll_lock ((mutex)->__data.__lock, PTHREAD_MUTEX_PSHARED (mutex))
# define LLL_MUTEX_LOCK_OPTIMIZED(mutex) lll_mutex_lock_optimized (mutex)
# define LLL_MUTEX_TRYLOCK(mutex) \
  lll_trylock ((mutex)->__data.__lock)
# define LLL_ROBUST_MUTEX_LOCK_MODIFIER 0
# define LLL_MUTEX_LOCK_ELISION(mutex) \
  lll_lock_elision ((mutex)->__data.__lock, (mutex)->__data.__elision, \
		   PTHREAD_MUTEX_PSHARED (mutex))
# define LLL_MUTEX_TRYLOCK_ELISION(mutex) \
  lll_trylock_elision((mutex)->__data.__lock, (mutex)->__data.__elision, \
		   PTHREAD_MUTEX_PSHARED (mutex))
# define PTHREAD_MUTEX_LOCK ___pthread_mutex_lock
# define PTHREAD_MUTEX_VERSIONS 1
#endif

#ifndef LLL_MUTEX_READ_LOCK
# define LLL_MUTEX_READ_LOCK(mutex) \
  atomic_load_relaxed (&(mutex)->__data.__lock)
#endif

static int __pthread_mutex_lock_full (pthread_mutex_t *mutex)
     __attribute_noinline__;

int
PTHREAD_MUTEX_LOCK (pthread_mutex_t *mutex)
{
  /* See concurrency notes regarding mutex type which is loaded from __kind
     in struct __pthread_mutex_s in sysdeps/nptl/bits/thread-shared-types.h.  */
  unsigned int type = PTHREAD_MUTEX_TYPE_ELISION (mutex);

  LIBC_PROBE (mutex_entry, 1, mutex);

  if (__builtin_expect (type & ~(PTHREAD_MUTEX_KIND_MASK_NP
				 | PTHREAD_MUTEX_ELISION_FLAGS_NP), 0))
    return __pthread_mutex_lock_full (mutex);

  if (__glibc_likely (type == PTHREAD_MUTEX_TIMED_NP))
    {
      FORCE_ELISION (mutex, goto elision);
    simple:
      /* Normal mutex.  */
      LLL_MUTEX_LOCK_OPTIMIZED (mutex);
      assert (mutex->__data.__owner == 0);
    }
#if ENABLE_ELISION_SUPPORT
  else if (__glibc_likely (type == PTHREAD_MUTEX_TIMED_ELISION_NP))
    {
  elision: __attribute__((unused))
      /* This case can never happen on a system without elision,
         as the mutex type initialization functions will not
	 allow to set the elision flags.  */
      /* Don't record owner or users for elision case.  This is a
         tail call.  */
      return LLL_MUTEX_LOCK_ELISION (mutex);
    }
#endif
  else if (__builtin_expect (PTHREAD_MUTEX_TYPE (mutex)
			     == PTHREAD_MUTEX_RECURSIVE_NP, 1))
    {
      /* Recursive mutex.  */
      pid_t id = THREAD_GETMEM (THREAD_SELF, tid);

      /* Check whether we already hold the mutex.  */
      if (mutex->__data.__owner == id)
	{
	  /* Just bump the counter.  */
	  if (__glibc_unlikely (mutex->__data.__count + 1 == 0))
	    /* Overflow of the counter.  */
	    return EAGAIN;

	  ++mutex->__data.__count;

	  return 0;
	}

      /* We have to get the mutex.  */
      LLL_MUTEX_LOCK_OPTIMIZED (mutex);

      assert (mutex->__data.__owner == 0);
      mutex->__data.__count = 1;
    }
  else if (__builtin_expect (PTHREAD_MUTEX_TYPE (mutex)
			  == PTHREAD_MUTEX_ADAPTIVE_NP, 1))
    {
      if (LLL_MUTEX_TRYLOCK (mutex) != 0)
	{
	  int cnt = 0;
	  int max_cnt = MIN (max_adaptive_count (),
			     mutex->__data.__spins * 2 + 10);
	  int spin_count, exp_backoff = 1;
	  unsigned int jitter = get_jitter ();
	  do
	    {
	      /* In each loop, spin count is exponential backoff plus
		 random jitter, random range is [0, exp_backoff-1].  */
	      spin_count = exp_backoff + (jitter & (exp_backoff - 1));
	      cnt += spin_count;
	      if (cnt >= max_cnt)
		{
		  /* If cnt exceeds max spin count, just go to wait
		     queue.  */
		  LLL_MUTEX_LOCK (mutex);
		  break;
		}
	      do
		atomic_spin_nop ();
	      while (--spin_count > 0);
	      /* Prepare for next loop.  */
	      exp_backoff = get_next_backoff (exp_backoff);
	    }
	  while (LLL_MUTEX_READ_LOCK (mutex) != 0
		 || LLL_MUTEX_TRYLOCK (mutex) != 0);

	  mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8;
	}
      assert (mutex->__data.__owner == 0);
    }
  else
    {
      pid_t id = THREAD_GETMEM (THREAD_SELF, tid);
      assert (PTHREAD_MUTEX_TYPE (mutex) == PTHREAD_MUTEX_ERRORCHECK_NP);
      /* Check whether we already hold the mutex.  */
      if (__glibc_unlikely (mutex->__data.__owner == id))
	return EDEADLK;
      goto simple;
    }

  pid_t id = THREAD_GETMEM (THREAD_SELF, tid);

  /* Record the ownership.  */
  mutex->__data.__owner = id;
#ifndef NO_INCR
  ++mutex->__data.__nusers;
#endif

  LIBC_PROBE (mutex_acquired, 1, mutex);

  return 0;
}

static int
__pthread_mutex_lock_full (pthread_mutex_t *mutex)
{
  int oldval;
  pid_t id = THREAD_GETMEM (THREAD_SELF, tid);

  switch (PTHREAD_MUTEX_TYPE (mutex))
    {
    case PTHREAD_MUTEX_ROBUST_RECURSIVE_NP:
    case PTHREAD_MUTEX_ROBUST_ERRORCHECK_NP:
    case PTHREAD_MUTEX_ROBUST_NORMAL_NP:
    case PTHREAD_MUTEX_ROBUST_ADAPTIVE_NP:
      THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending,
		     &mutex->__data.__list.__next);
      /* We need to set op_pending before starting the operation.  Also
	 see comments at ENQUEUE_MUTEX.  */
      __asm ("" ::: "memory");

      oldval = mutex->__data.__lock;
      /* This is set to FUTEX_WAITERS iff we might have shared the
	 FUTEX_WAITERS flag with other threads, and therefore need to keep it
	 set to avoid lost wake-ups.  We have the same requirement in the
	 simple mutex algorithm.
	 We start with value zero for a normal mutex, and FUTEX_WAITERS if we
	 are building the special case mutexes for use from within condition
	 variables.  */
      unsigned int assume_other_futex_waiters = LLL_ROBUST_MUTEX_LOCK_MODIFIER;
      while (1)
	{
	  /* Try to acquire the lock through a CAS from 0 (not acquired) to
	     our TID | assume_other_futex_waiters.  */
	  if (__glibc_likely (oldval == 0))
	    {
	      oldval
	        = atomic_compare_and_exchange_val_acq (&mutex->__data.__lock,
	            id | assume_other_futex_waiters, 0);
	      if (__glibc_likely (oldval == 0))
		break;
	    }

	  if ((oldval & FUTEX_OWNER_DIED) != 0)
	    {
	      /* The previous owner died.  Try locking the mutex.  */
	      int newval = id;
#ifdef NO_INCR
	      /* We are not taking assume_other_futex_waiters into account
		 here simply because we'll set FUTEX_WAITERS anyway.  */
	      newval |= FUTEX_WAITERS;
#else
	      newval |= (oldval & FUTEX_WAITERS) | assume_other_futex_waiters;
#endif

	      newval
		= atomic_compare_and_exchange_val_acq (&mutex->__data.__lock,
						       newval, oldval);

	      if (newval != oldval)
		{
		  oldval = newval;
		  continue;
		}

	      /* We got the mutex.  */
	      mutex->__data.__count = 1;
	      /* But it is inconsistent unless marked otherwise.  */
	      mutex->__data.__owner = PTHREAD_MUTEX_INCONSISTENT;

	      /* We must not enqueue the mutex before we have acquired it.
		 Also see comments at ENQUEUE_MUTEX.  */
	      __asm ("" ::: "memory");
	      ENQUEUE_MUTEX (mutex);
	      /* We need to clear op_pending after we enqueue the mutex.  */
	      __asm ("" ::: "memory");
	      THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);

	      /* Note that we deliberately exit here.  If we fall
		 through to the end of the function __nusers would be
		 incremented which is not correct because the old
		 owner has to be discounted.  If we are not supposed
		 to increment __nusers we actually have to decrement
		 it here.  */
#ifdef NO_INCR
	      --mutex->__data.__nusers;
#endif

	      return EOWNERDEAD;
	    }

	  /* Check whether we already hold the mutex.  */
	  if (__glibc_unlikely ((oldval & FUTEX_TID_MASK) == id))
	    {
	      int kind = PTHREAD_MUTEX_TYPE (mutex);
	      if (kind == PTHREAD_MUTEX_ROBUST_ERRORCHECK_NP)
		{
		  /* We do not need to ensure ordering wrt another memory
		     access.  Also see comments at ENQUEUE_MUTEX. */
		  THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending,
				 NULL);
		  return EDEADLK;
		}

	      if (kind == PTHREAD_MUTEX_ROBUST_RECURSIVE_NP)
		{
		  /* We do not need to ensure ordering wrt another memory
		     access.  */
		  THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending,
				 NULL);

		  /* Just bump the counter.  */
		  if (__glibc_unlikely (mutex->__data.__count + 1 == 0))
		    /* Overflow of the counter.  */
		    return EAGAIN;

		  ++mutex->__data.__count;

		  return 0;
		}
	    }

	  /* We cannot acquire the mutex nor has its owner died.  Thus, try
	     to block using futexes.  Set FUTEX_WAITERS if necessary so that
	     other threads are aware that there are potentially threads
	     blocked on the futex.  Restart if oldval changed in the
	     meantime.  */
	  if ((oldval & FUTEX_WAITERS) == 0)
	    {
	      int val = atomic_compare_and_exchange_val_acq
		(&mutex->__data.__lock, oldval | FUTEX_WAITERS, oldval);
	      if (val != oldval)
		{
		  oldval = val;
		  continue;
		}
	      oldval |= FUTEX_WAITERS;
	    }

	  /* It is now possible that we share the FUTEX_WAITERS flag with
	     another thread; therefore, update assume_other_futex_waiters so
	     that we do not forget about this when handling other cases
	     above and thus do not cause lost wake-ups.  */
	  assume_other_futex_waiters |= FUTEX_WAITERS;

	  /* Block using the futex and reload current lock value.  */
	  futex_wait ((unsigned int *) &mutex->__data.__lock, oldval,
		      PTHREAD_ROBUST_MUTEX_PSHARED (mutex));
	  oldval = mutex->__data.__lock;
	}

      /* We have acquired the mutex; check if it is still consistent.  */
      if (__builtin_expect (mutex->__data.__owner
			    == PTHREAD_MUTEX_NOTRECOVERABLE, 0))
	{
	  /* This mutex is now not recoverable.  */
	  mutex->__data.__count = 0;
	  int private = PTHREAD_ROBUST_MUTEX_PSHARED (mutex);
	  lll_unlock (mutex->__data.__lock, private);
	  /* FIXME This violates the mutex destruction requirements.  See
	     __pthread_mutex_unlock_full.  */
	  THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
	  return ENOTRECOVERABLE;
	}

      mutex->__data.__count = 1;
      /* We must not enqueue the mutex before we have acquired it.
	 Also see comments at ENQUEUE_MUTEX.  */
      __asm ("" ::: "memory");
      ENQUEUE_MUTEX (mutex);
      /* We need to clear op_pending after we enqueue the mutex.  */
      __asm ("" ::: "memory");
      THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
      break;

    /* The PI support requires the Linux futex system call.  If that's not
       available, pthread_mutex_init should never have allowed the type to
       be set.  So it will get the default case for an invalid type.  */
#ifdef __NR_futex
    case PTHREAD_MUTEX_PI_RECURSIVE_NP:
    case PTHREAD_MUTEX_PI_ERRORCHECK_NP:
    case PTHREAD_MUTEX_PI_NORMAL_NP:
    case PTHREAD_MUTEX_PI_ADAPTIVE_NP:
    case PTHREAD_MUTEX_PI_ROBUST_RECURSIVE_NP:
    case PTHREAD_MUTEX_PI_ROBUST_ERRORCHECK_NP:
    case PTHREAD_MUTEX_PI_ROBUST_NORMAL_NP:
    case PTHREAD_MUTEX_PI_ROBUST_ADAPTIVE_NP:
      {
	int kind, robust;
	{
	  /* See concurrency notes regarding __kind in struct __pthread_mutex_s
	     in sysdeps/nptl/bits/thread-shared-types.h.  */
	  int mutex_kind = atomic_load_relaxed (&(mutex->__data.__kind));
	  kind = mutex_kind & PTHREAD_MUTEX_KIND_MASK_NP;
	  robust = mutex_kind & PTHREAD_MUTEX_ROBUST_NORMAL_NP;
	}

	if (robust)
	  {
	    /* Note: robust PI futexes are signaled by setting bit 0.  */
	    THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending,
			   (void *) (((uintptr_t) &mutex->__data.__list.__next)
				     | 1));
	    /* We need to set op_pending before starting the operation.  Also
	       see comments at ENQUEUE_MUTEX.  */
	    __asm ("" ::: "memory");
	  }

	oldval = mutex->__data.__lock;

	/* Check whether we already hold the mutex.  */
	if (__glibc_unlikely ((oldval & FUTEX_TID_MASK) == id))
	  {
	    if (kind == PTHREAD_MUTEX_ERRORCHECK_NP)
	      {
		/* We do not need to ensure ordering wrt another memory
		   access.  */
		THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
		return EDEADLK;
	      }

	    if (kind == PTHREAD_MUTEX_RECURSIVE_NP)
	      {
		/* We do not need to ensure ordering wrt another memory
		   access.  */
		THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);

		/* Just bump the counter.  */
		if (__glibc_unlikely (mutex->__data.__count + 1 == 0))
		  /* Overflow of the counter.  */
		  return EAGAIN;

		++mutex->__data.__count;

		return 0;
	      }
	  }

	int newval = id;
# ifdef NO_INCR
	newval |= FUTEX_WAITERS;
# endif
	oldval = atomic_compare_and_exchange_val_acq (&mutex->__data.__lock,
						      newval, 0);

	if (oldval != 0)
	  {
	    /* The mutex is locked.  The kernel will now take care of
	       everything.  */
	    int private = (robust
			   ? PTHREAD_ROBUST_MUTEX_PSHARED (mutex)
			   : PTHREAD_MUTEX_PSHARED (mutex));
	    int e = __futex_lock_pi64 (&mutex->__data.__lock, 0 /* unused  */,
				       NULL, private);
	    if (e == ESRCH || e == EDEADLK)
	      {
		assert (e != EDEADLK
			|| (kind != PTHREAD_MUTEX_ERRORCHECK_NP
			    && kind != PTHREAD_MUTEX_RECURSIVE_NP));
		/* ESRCH can happen only for non-robust PI mutexes where
		   the owner of the lock died.  */
		assert (e != ESRCH || !robust);

		/* Delay the thread indefinitely.  */
		while (1)
		  __futex_abstimed_wait64 (&(unsigned int){0}, 0,
					   0 /* ignored */, NULL, private);
	      }

	    oldval = mutex->__data.__lock;

	    assert (robust || (oldval & FUTEX_OWNER_DIED) == 0);
	  }

	if (__glibc_unlikely (oldval & FUTEX_OWNER_DIED))
	  {
	    atomic_fetch_and_acquire (&mutex->__data.__lock, ~FUTEX_OWNER_DIED);

	    /* We got the mutex.  */
	    mutex->__data.__count = 1;
	    /* But it is inconsistent unless marked otherwise.  */
	    mutex->__data.__owner = PTHREAD_MUTEX_INCONSISTENT;

	    /* We must not enqueue the mutex before we have acquired it.
	       Also see comments at ENQUEUE_MUTEX.  */
	    __asm ("" ::: "memory");
	    ENQUEUE_MUTEX_PI (mutex);
	    /* We need to clear op_pending after we enqueue the mutex.  */
	    __asm ("" ::: "memory");
	    THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);

	    /* Note that we deliberately exit here.  If we fall
	       through to the end of the function __nusers would be
	       incremented which is not correct because the old owner
	       has to be discounted.  If we are not supposed to
	       increment __nusers we actually have to decrement it here.  */
# ifdef NO_INCR
	    --mutex->__data.__nusers;
# endif

	    return EOWNERDEAD;
	  }

	if (robust
	    && __builtin_expect (mutex->__data.__owner
				 == PTHREAD_MUTEX_NOTRECOVERABLE, 0))
	  {
	    /* This mutex is now not recoverable.  */
	    mutex->__data.__count = 0;

	    futex_unlock_pi ((unsigned int *) &mutex->__data.__lock,
			     PTHREAD_ROBUST_MUTEX_PSHARED (mutex));

	    /* To the kernel, this will be visible after the kernel has
	       acquired the mutex in the syscall.  */
	    THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
	    return ENOTRECOVERABLE;
	  }

	mutex->__data.__count = 1;
	if (robust)
	  {
	    /* We must not enqueue the mutex before we have acquired it.
	       Also see comments at ENQUEUE_MUTEX.  */
	    __asm ("" ::: "memory");
	    ENQUEUE_MUTEX_PI (mutex);
	    /* We need to clear op_pending after we enqueue the mutex.  */
	    __asm ("" ::: "memory");
	    THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);
	  }
      }
      break;
#endif  /* __NR_futex.  */

    case PTHREAD_MUTEX_PP_RECURSIVE_NP:
    case PTHREAD_MUTEX_PP_ERRORCHECK_NP:
    case PTHREAD_MUTEX_PP_NORMAL_NP:
    case PTHREAD_MUTEX_PP_ADAPTIVE_NP:
      {
	/* See concurrency notes regarding __kind in struct __pthread_mutex_s
	   in sysdeps/nptl/bits/thread-shared-types.h.  */
	int kind = atomic_load_relaxed (&(mutex->__data.__kind))
	  & PTHREAD_MUTEX_KIND_MASK_NP;

	oldval = mutex->__data.__lock;

	/* Check whether we already hold the mutex.  */
	if (mutex->__data.__owner == id)
	  {
	    if (kind == PTHREAD_MUTEX_ERRORCHECK_NP)
	      return EDEADLK;

	    if (kind == PTHREAD_MUTEX_RECURSIVE_NP)
	      {
		/* Just bump the counter.  */
		if (__glibc_unlikely (mutex->__data.__count + 1 == 0))
		  /* Overflow of the counter.  */
		  return EAGAIN;

		++mutex->__data.__count;

		return 0;
	      }
	  }

	int oldprio = -1, ceilval;
	do
	  {
	    int ceiling = (oldval & PTHREAD_MUTEX_PRIO_CEILING_MASK)
			  >> PTHREAD_MUTEX_PRIO_CEILING_SHIFT;

	    if (__pthread_current_priority () > ceiling)
	      {
		if (oldprio != -1)
		  __pthread_tpp_change_priority (oldprio, -1);
		return EINVAL;
	      }

	    int retval = __pthread_tpp_change_priority (oldprio, ceiling);
	    if (retval)
	      return retval;

	    ceilval = ceiling << PTHREAD_MUTEX_PRIO_CEILING_SHIFT;
	    oldprio = ceiling;

	    oldval
	      = atomic_compare_and_exchange_val_acq (&mutex->__data.__lock,
#ifdef NO_INCR
						     ceilval | 2,
#else
						     ceilval | 1,
#endif
						     ceilval);

	    if (oldval == ceilval)
	      break;

	    do
	      {
		oldval
		  = atomic_compare_and_exchange_val_acq (&mutex->__data.__lock,
							 ceilval | 2,
							 ceilval | 1);

		if ((oldval & PTHREAD_MUTEX_PRIO_CEILING_MASK) != ceilval)
		  break;

		if (oldval != ceilval)
		  futex_wait ((unsigned int * ) &mutex->__data.__lock,
			      ceilval | 2,
			      PTHREAD_MUTEX_PSHARED (mutex));
	      }
	    while (atomic_compare_and_exchange_val_acq (&mutex->__data.__lock,
							ceilval | 2, ceilval)
		   != ceilval);
	  }
	while ((oldval & PTHREAD_MUTEX_PRIO_CEILING_MASK) != ceilval);

	assert (mutex->__data.__owner == 0);
	mutex->__data.__count = 1;
      }
      break;

    default:
      /* Correct code cannot set any other type.  */
      return EINVAL;
    }

  /* Record the ownership.  */
  mutex->__data.__owner = id;
#ifndef NO_INCR
  ++mutex->__data.__nusers;
#endif

  LIBC_PROBE (mutex_acquired, 1, mutex);

  return 0;
}

#if PTHREAD_MUTEX_VERSIONS
libc_hidden_ver (___pthread_mutex_lock, __pthread_mutex_lock)
# ifndef SHARED
strong_alias (___pthread_mutex_lock, __pthread_mutex_lock)
# endif
versioned_symbol (libpthread, ___pthread_mutex_lock, pthread_mutex_lock,
		  GLIBC_2_0);

static __always_inline int
futex_wait (unsigned int *futex_word, unsigned int expected, int private)
{
  int err = lll_futex_timed_wait (futex_word, expected, NULL, private);
  switch (err)
    {
    case 0:
    case -EAGAIN:
    case -EINTR:
      return -err;

    case -ETIMEDOUT: /* Cannot have happened as we provided no timeout.  */
    case -EFAULT: /* Must have been caused by a glibc or application bug.  */
    case -EINVAL: /* Either due to wrong alignment or due to the timeout not
		     being normalized.  Must have been caused by a glibc or
		     application bug.  */
    case -ENOSYS: /* Must have been caused by a glibc bug.  */
    /* No other errors are documented at this time.  */
    default:
      futex_fatal_error ();
    }
}

# define lll_futex_timed_wait(futexp, val, timeout, private)     \
  lll_futex_syscall (4, futexp,                                 \
		     __lll_private_flag (FUTEX_WAIT, private),  \
		     val, timeout)

# define lll_futex_syscall(nargs, futexp, op, ...)                      \
  ({                                                                    \
    long int __ret = INTERNAL_SYSCALL (futex, nargs, futexp, op, 	\
				       __VA_ARGS__);                    \
    (__glibc_unlikely (INTERNAL_SYSCALL_ERROR_P (__ret))         	\
     ? -INTERNAL_SYSCALL_ERRNO (__ret) : 0);                     	\
  })

#define INTERNAL_SYSCALL(name, nr, args...)				\
	internal_syscall##nr (SYS_ify (name), args)
    
#undef internal_syscall4
#define internal_syscall4(number, arg1, arg2, arg3, arg4)		\
({									\
    unsigned long int resultvar;					\
    TYPEFY (arg4, __arg4) = ARGIFY (arg4);			 	\
    TYPEFY (arg3, __arg3) = ARGIFY (arg3);			 	\
    TYPEFY (arg2, __arg2) = ARGIFY (arg2);			 	\
    TYPEFY (arg1, __arg1) = ARGIFY (arg1);			 	\
    register TYPEFY (arg4, _a4) asm ("r10") = __arg4;			\
    register TYPEFY (arg3, _a3) asm ("rdx") = __arg3;			\
    register TYPEFY (arg2, _a2) asm ("rsi") = __arg2;			\
    register TYPEFY (arg1, _a1) asm ("rdi") = __arg1;			\
    asm volatile (							\
    "syscall\n\t"							\
    : "=a" (resultvar)							\
    : "0" (number), "r" (_a1), "r" (_a2), "r" (_a3), "r" (_a4)		\
    : "memory", REGISTERS_CLOBBERED_BY_SYSCALL);			\
    (long int) resultvar;						\
})

SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val,
		const struct __kernel_timespec __user *, utime,
		u32 __user *, uaddr2, u32, val3)
{
	int ret, cmd = op & FUTEX_CMD_MASK;
	ktime_t t, *tp = NULL;
	struct timespec64 ts;

	if (utime && futex_cmd_has_timeout(cmd)) {
		if (unlikely(should_fail_futex(!(op & FUTEX_PRIVATE_FLAG))))
			return -EFAULT;
		if (get_timespec64(&ts, utime))
			return -EFAULT;
		ret = futex_init_timeout(cmd, op, &ts, &t);
		if (ret)
			return ret;
		tp = &t;
	}

	return do_futex(uaddr, op, val, tp, uaddr2, (unsigned long)utime, val3);
}

long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
		u32 __user *uaddr2, u32 val2, u32 val3)
{
	unsigned int flags = futex_to_flags(op);
	int cmd = op & FUTEX_CMD_MASK;

	// yym-gaizao
	pr_debug("do_futex: pid=%d, tid=%d, cmd=%d, uaddr=%p, val=%u, flags=0x%x\n",
		 current->tgid, current->pid, cmd, uaddr, val, flags);

	if (flags & FLAGS_CLOCKRT) {
		if (cmd != FUTEX_WAIT_BITSET &&
		    cmd != FUTEX_WAIT_REQUEUE_PI &&
		    cmd != FUTEX_LOCK_PI2)
			return -ENOSYS;
	}

	switch (cmd) {
	case FUTEX_WAIT:
		val3 = FUTEX_BITSET_MATCH_ANY;
		fallthrough;
	case FUTEX_WAIT_BITSET:
		return futex_wait(uaddr, flags, val, timeout, val3);
	case FUTEX_WAKE:
		val3 = FUTEX_BITSET_MATCH_ANY;
		fallthrough;
	case FUTEX_WAKE_BITSET:
		return futex_wake(uaddr, flags, val, val3);
	case FUTEX_REQUEUE:
		return futex_requeue(uaddr, flags, uaddr2, flags, val, val2, NULL, 0);
	case FUTEX_CMP_REQUEUE:
		return futex_requeue(uaddr, flags, uaddr2, flags, val, val2, &val3, 0);
	case FUTEX_WAKE_OP:
		return futex_wake_op(uaddr, flags, uaddr2, val, val2, val3);
	case FUTEX_LOCK_PI:
		flags |= FLAGS_CLOCKRT;
		fallthrough;
	case FUTEX_LOCK_PI2:
		return futex_lock_pi(uaddr, flags, timeout, 0);
	case FUTEX_UNLOCK_PI:
		return futex_unlock_pi(uaddr, flags);
	case FUTEX_TRYLOCK_PI:
		return futex_lock_pi(uaddr, flags, NULL, 1);
	case FUTEX_WAIT_REQUEUE_PI:
		val3 = FUTEX_BITSET_MATCH_ANY;
		return futex_wait_requeue_pi(uaddr, flags, val, timeout, val3,
					     uaddr2);
	case FUTEX_CMP_REQUEUE_PI:
		return futex_requeue(uaddr, flags, uaddr2, flags, val, val2, &val3, 1);
	}
	return -ENOSYS;
}

int futex_wait(u32 __user *uaddr, unsigned int flags, u32 val, ktime_t *abs_time, u32 bitset)
{
	struct hrtimer_sleeper timeout, *to;
	struct restart_block *restart;
	int ret;

	// yym-gaizao
	pr_debug("futex_wait: pid=%d, tid=%d, uaddr=%p, val=%u, bitset=0x%x, flags=0x%x\n",
		 current->tgid, current->pid, uaddr, val, bitset, flags);

	to = futex_setup_timer(abs_time, &timeout, flags,
			       current->timer_slack_ns);

	ret = __futex_wait(uaddr, flags, val, to, bitset);

	/* No timeout, nothing to clean up. */
	if (!to)
		return ret;

	hrtimer_cancel(&to->timer);
	destroy_hrtimer_on_stack(&to->timer);

	if (ret == -ERESTARTSYS) {
		restart = &current->restart_block;
		restart->futex.uaddr = uaddr;
		restart->futex.val = val;
		restart->futex.time = *abs_time;
		restart->futex.bitset = bitset;
		restart->futex.flags = flags | FLAGS_HAS_TIMEOUT;

		return set_restart_fn(restart, futex_wait_restart);
	}

	return ret;
}

int __futex_wait(u32 __user *uaddr, unsigned int flags, u32 val,
		 struct hrtimer_sleeper *to, u32 bitset)
{
	struct futex_q q = futex_q_init;
	struct futex_hash_bucket *hb;
	int ret;

	if (!bitset)
		return -EINVAL;

	q.bitset = bitset;

retry:
	/*
	 * Prepare to wait on uaddr. On success, it holds hb->lock and q
	 * is initialized.
	 */
	ret = futex_wait_setup(uaddr, val, flags, &q, &hb);
	if (ret)
		return ret;

	/* futex_queue and wait for wakeup, timeout, or a signal. */
	futex_wait_queue(hb, &q, to);

	/* If we were woken (and unqueued), we succeeded, whatever. */
	if (!futex_unqueue(&q))
		return 0;

	if (to && !to->task)
		return -ETIMEDOUT;

	/*
	 * We expect signal_pending(current), but we might be the
	 * victim of a spurious wakeup as well.
	 */
	if (!signal_pending(current))
		goto retry;

	return -ERESTARTSYS;
}

/**
 * futex_wait_queue() - futex_queue() and wait for wakeup, timeout, or signal
 * @hb:		the futex hash bucket, must be locked by the caller
 * @q:		the futex_q to queue up on
 * @timeout:	the prepared hrtimer_sleeper, or null for no timeout
 */
void futex_wait_queue(struct futex_hash_bucket *hb, struct futex_q *q,
			    struct hrtimer_sleeper *timeout)
{
	/*
	 * The task state is guaranteed to be set before another task can
	 * wake it. set_current_state() is implemented using smp_store_mb() and
	 * futex_queue() calls spin_unlock() upon completion, both serializing
	 * access to the hash list and forcing another memory barrier.
	 */
	set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
	futex_queue(q, hb);

	/* Arm the timer */
	if (timeout)
		hrtimer_sleeper_start_expires(timeout, HRTIMER_MODE_ABS);

	/*
	 * If we have been removed from the hash list, then another task
	 * has tried to wake us, and we can skip the call to schedule().
	 */
	if (likely(!plist_node_empty(&q->list))) {
		/*
		 * If the timer has already expired, current will already be
		 * flagged for rescheduling. Only call schedule if there
		 * is no timeout, or if it has yet to expire.
		 */
		if (!timeout || timeout->task)
			schedule();
	}
	__set_current_state(TASK_RUNNING);
}
相关推荐
网络安全许木1 小时前
自学渗透测试第29天(Linux SUID/SGID基础实验)
linux·运维·服务器·web安全·渗透测试
JiaWen技术圈1 小时前
conntrack-tools 用法
linux·运维·服务器·安全·运维开发
ZenosDoron1 小时前
Linux/Unix 系统中用于创建链接的命令ln
linux·运维·unix
IDO读书1 小时前
CentOS 7 安装 xampp-linux-1.8.1.tar.gz 详细步骤(解压、启动、验证)
linux
wuminyu1 小时前
专家视角看Lambda表达式的原理解析
java·linux·c语言·jvm·c++
tingting01191 小时前
dns域名信息收集
linux·服务器·前端
JiaWen技术圈2 小时前
nf_tables 架构深度详解(内核级完整架构)
linux·服务器·安全·运维开发
XX風2 小时前
三维点云处理环境相关-ubuntu安装numpy、open3d
linux·ubuntu·numpy
zzipeng2 小时前
IMX6ULL CAN通讯应用学习
linux·运维·网络