第5章：并发与竞态条件-12：Locking Traps

In continuation of the previous text第5章：并发与竞态条件-11：Reader/Writer Spinlocks , let's GO ahead.

Many years of experience with locks---experience that predates Linux---have shown that locking can be very hard to get right. Managing concurrency is an inherently tricky undertaking, and there are many ways of making mistakes. In this section, we take a quick look at things that can go wrong.

多年来使用锁的经验（早于 Linux 诞生的经验）表明，正确实现锁机制难度极高。并发管理本身就是一项极具挑战性的工作，且出错的方式多种多样。本节我们简要梳理可能出现的问题。

Ambiguous Rules

As has already been said above, a proper locking scheme requires clear and explicit rules. When you create a resource that can be accessed concurrently, you should define which lockwill control that access. Locking should really be laid out at the beginning; it can be a hard thing to retrofit in afterward. Time taken at the outset usually is paid back generously at debugging time.

正如前文所述，一套完善的锁机制需要清晰且明确的规则。当你创建一个可能被并发访问的资源时，应明确规定由哪把锁控制该资源的访问。锁机制的设计必须在开发初期就确定；后期再补加锁逻辑往往难度极大。前期在锁设计上投入的时间，通常能在调试阶段得到丰厚的回报。

As you write your code, you will doubtless encounter several functions that all require access to structures protected by a specific lock. At this point, you must be careful: if one function acquires a lockand then calls another function that also attempts to acquire the lock, your code deadlocks. Neither semaphores nor spin locks allow a lockholder to acquire the locka second time; should you attempt to do so, things simply hang.

编写代码时，你必然会遇到多个函数都需要访问某把锁保护的结构体的情况。此时必须格外小心：如果一个函数获取了某把锁后，又调用了另一个同样尝试获取该锁的函数，代码会陷入死锁。无论是信号量还是自旋锁，都不允许持有者重复获取同一把锁；一旦尝试这么做，程序会直接挂起。

To make your locking work properly, you have to write some functions with the assumption that their caller has already acquired the relevant lock(s). Usually, only your internal, static functions can be written in this way; functions called from out side must handle locking explicitly. When you write internal functions that make assumptions about locking, do yourself (and anybody else who works with your code) a favor and document those assumptions explicitly. It can be very hard to come backmonths later and figure out whether you need to hold a lockto call a par ticular function or not.

要让锁机制正常工作，你必须编写一些 "假定调用者已获取相关锁" 的函数。通常只有内部的静态函数可以按这种方式编写；外部调用的函数则必须显式处理锁的获取与释放。当你编写这类依赖锁状态的内部函数时，请务必为自己（以及所有维护你代码的人）提供明确的文档说明这些假定 ------ 数月后再回头梳理 "调用某个函数是否需要持有锁"，会变得极其困难。

In the case of scull, the design decision taken was to require all functions invoked directly from system calls to acquire the semaphore applying to the device structure that is accessed. All internal functions, which are only called from other scull func tions, can then assume that the semaphore has been properly acquired.

以 scull 驱动为例，其设计原则是：所有直接从系统调用触发的函数，都必须获取对应设备结构体的信号量；所有仅被其他 scull 函数调用的内部函数，则可假定信号量已被正确获取。

Lock Ordering Rules

In systems with a large number of locks (and the kernel is becoming such a system), it is not unusual for code to need to hold more than one lockat once. If some sort of computation must be performed using two different resources, each of which has its own lock, there is often no alternative to acquiring both locks.

在包含大量锁的系统中（内核正逐渐成为这样的系统），代码需要同时持有多把锁的情况并不罕见。如果某段计算逻辑需要用到两个不同的资源（且每个资源都有专属锁），往往不得不获取这两把锁。

Taking multiple locks can be dangerous, however. If you have two locks, called Lock1 and Lock2, and code needs to acquire both at the same time, you have a potential deadlock. Just imagine one thread locking Lock1 while another simulta neously takes Lock2. Then each thread tries to get the one it doesn't have. Both threads will deadlock.

但获取多把锁存在极大风险。假设有两把锁 Lock1 和 Lock2，代码需要同时获取它们，就可能引发死锁：试想一个线程获取了 Lock1，而另一个线程同时获取了 Lock2；随后两个线程都尝试获取自己未持有的那把锁，最终双双陷入死锁。

The solution to this problem is usually simple: when multiple locks must be acquired, they should always be acquired in the same order. As long as this conven tion is followed, simple deadlocks like the one described above can be avoided. However, following lockordering rules can be easier said than done. It is very rare that such rules are actually written down anywhere. Often the best you can do is to see what other code does.

这个问题的解决方案通常很简单：当需要获取多把锁时，必须始终按照同一顺序获取。只要遵循这一约定，上述简单死锁就能避免。但 "遵循锁排序规则" 说起来容易做起来难 ------ 这类规则极少被明文记录，通常只能通过参考其他代码来确定顺序。

A couple of rules of thumb can help. If you must obtain a lockthat is local to your code (a device lock, say) along with a lock belonging to a more central part of the kernel, take your lock first. If you have a combination of semaphores and spinlocks, you must, of course, obtain the semaphore(s) first; calling down (which can sleep) while holding a spinlockis a serious error. But most of all, try to avoid situations where you need more than one lock.

以下几条经验法则可提供帮助：

若必须同时获取 "代码本地锁（如设备专属锁）" 和 "内核核心层锁"，应先获取本地锁；
若同时涉及信号量和自旋锁，必须先获取信号量 ------ 持有自旋锁时调用 down（可能导致睡眠）是严重错误；
最重要的是：尽量避免需要同时持有多把锁的场景。

补充说明：

重复加锁死锁的本质

信号量和自旋锁均为 "不可重入锁"，即同一线程第二次获取已持有的锁时，会触发阻塞（信号量）或自旋死锁（自旋锁）。若需重入特性，需使用 mutex_trylock（互斥锁）或自定义重入逻辑，但内核中应尽量避免重入锁（增加复杂度）。
锁排序的实战原则

内核社区默认的锁排序潜规则：
1. 按 "自旋锁 < 信号量" 的顺序（自旋锁不可睡眠，必须后获取）。
2. 按锁的 "地址顺序" 获取（对动态分配的锁，比较内存地址，先小后大）；
3. 按锁的 "作用域从小到大" 获取（如先设备锁、再子系统锁、最后全局锁）；

文档化锁规则的最佳实践为内部函数添加明确的注释，示例：

cpp 复制代码

/*
 * scull_internal_func: 处理设备数据
 * @dev: 目标设备结构体
 * 注意：调用前必须持有 dev->sem 信号量！
 */
static void scull_internal_func(struct scull_dev *dev) { ... }

多锁场景的替代方案

若必须持有多把锁，可通过以下方式降低风险：
- 重构代码，减少跨资源的原子操作。
- 使用 trylock 非阻塞获取（失败时释放已持有的锁，重试）；
- 缩短多锁持有时间（仅在必要时同时持有，尽快释放）。