计算机科学家应了解的浮点运算知识（5）

注：本文为 "浮点运算" 相关译文，机翻未校。

略作重排，如有内容异常，请看原文。

csdn 篇幅所限，多篇连载。

Exception Handling / 异常处理

The topics discussed up to now have primarily concerned systems implications of accuracy and precision. Trap handlers also raise some interesting systems issues. The IEEE standard strongly recommends that users be able to specify a trap handler for each of the five classes of exceptions, and "Trap Handlers" on page 207 gave some applications of user defined trap handlers. In the case of invalid operation and division by zero exceptions, the handler should be provided with the operands, otherwise, with the exactly rounded result. Depending on the programming language being used, the trap handler might be able to access other variables in the program as well. For all exceptions, the trap handler must be able to identify what operation was being performed and the precision of its destination.

前文讨论的主题主要涉及精度和准确性对系统的影响，而陷阱处理器也带来了一些有趣的系统问题。IEEE 标准强烈建议允许用户为五类异常分别指定陷阱处理器，"陷阱处理器（Trap Handlers）"已给出用户自定义陷阱处理器的部分应用场景：

无效操作和除零异常：陷阱处理器应获取操作数；
其他异常：陷阱处理器应获取精确舍入后的结果；
额外要求：根据编程语言的不同，陷阱处理器可能还需访问程序中的其他变量，且必须能识别当前执行的操作及目标精度。

The IEEE standard assumes that operations are conceptually serial and that when an interrupt occurs, it is possible to identify the operation and its operands. On machines which have pipelining or multiple arithmetic units, when an exception occurs, it may not be enough to simply have the trap handler examine the program counter. Hardware support for identifying exactly which operation trapped may be necessary.

IEEE 标准假设操作在概念上是串行的，因此中断发生时可识别当前操作及其操作数。但在具有流水线或多个算术单元的机器上，异常发生时仅通过程序计数器（program counter）可能无法确定触发异常的操作，需硬件支持精确识别触发异常的操作。

Another problem is illustrated by the following program fragment.

以下程序片段展示了另一个问题：

c 复制代码

x = y * z;  // 第一个乘法
z = a + b;  // 加法
d = a / x;  // 第二个乘法（可能触发异常）

Suppose the second multiply raises an exception, and the trap handler wants to use the value of a a a. On hardware that can do an add and multiply in parallel, an optimizer would probably move the addition operation ahead of the second multiply, so that the add can proceed in parallel with the first multiply. Thus when the second multiply traps, a = b + c a = b + c a=b+c has already been executed, potentially changing the result of a a a. It would not be reasonable for a compiler to avoid this kind of optimization, because every floating-point operation can potentially trap, and thus virtually all instruction scheduling optimizations would be eliminated. This problem can be avoided by prohibiting trap handlers from accessing any variables of the program directly. Instead, the handler can be given the operands or result as an argument.

假设第二个乘法触发异常，陷阱处理器需要使用变量 a a a 的值：

支持加法和乘法并行执行的硬件上，优化器可能将加法指令提前到第二个乘法之前，与第一个乘法并行执行；
因此，当第二个乘法触发异常时， z = a + b z = a + b z=a+b 可能已执行，导致 z z z 的值改变（注：原文此处疑似笔误，应为 z z z 而非 a a a）。

要求编译器避免此类优化并不合理------因为每个浮点操作都可能触发异常，这将导致几乎所有指令调度优化失效。解决方案是禁止陷阱处理器直接访问程序变量，而是将操作数或结果作为参数传递给处理器。

But there are still problems. In the fragment x = y ∗ z ; z = a + b ; x = y * z; z = a + b; x=y∗z;z=a+b;, the two instructions might well be executed in parallel. If the multiply traps, its argument z z z could already have been overwritten by the addition, especially since addition is usually faster than multiply. Computer systems that support the IEEE standard must provide some way to save the value of z z z, either in hardware or by having the compiler avoid such a situation in the first place.

但仍存在问题：上述片段中，两个指令可能并行执行。若乘法触发异常，其操作数 z z z 可能已被加法覆盖（尤其加法通常比乘法快）。支持 IEEE 标准的计算机系统必须提供 z z z 值的保存机制------要么通过硬件实现，要么编译器从一开始就避免此类情况。

W. Kahan has proposed using presubstitution instead of trap handlers to avoid these problems. In this method, the user specifies an exception and the value he wants to be used as the result when the exception occurs. As an example, suppose that in code for computing ( sin ⁡ x ) / x (\sin x)/x (sinx)/x, the user decides that x = 0 x=0 x=0 is so rare that it would improve performance to avoid a test for x = 0 x=0 x=0, and instead handle this case when a 0 / 0 0/0 0/0 trap occurs. Using IEEE trap handlers, the user would write a handler that returns a value of 1 and install it before computing sin ⁡ x / x \sin x / x sinx/x. Using presubstitution, the user would specify that when an invalid operation occurs, the value 1 should be used. Kahan calls this presubstitution, because the value to be used must be specified before the exception occurs. When using trap handlers, the value to be returned can be computed when the trap occurs.

W. Kahan 提出用"预替换（presubstitution）"替代陷阱处理器以避免这些问题：用户提前指定异常类型及异常发生时使用的结果值。例如，计算 ( sin ⁡ x ) / x (\sin x)/x (sinx)/x 时，若用户认为 x = 0 x=0 x=0 极少发生，为提升性能可省略 x = 0 x=0 x=0 的判断，转而在 0 / 0 0/0 0/0 异常发生时处理：

用 IEEE 陷阱处理器：编写返回 1 的处理器，在计算前安装；
用预替换：指定"无效操作异常发生时，返回 1"。

Kahan 将其称为"预替换"，因为需在异常发生前指定结果值；而陷阱处理器可在异常发生时动态计算返回值。

The advantage of presubstitution is that it has a straightforward hardware implementation.¹ As soon as the type of exception has been determined, it can be used to index a table which contains the desired result of the operation. Although presubstitution has some attractive attributes, the widespread acceptance of the IEEE standard makes it unlikely to be widely implemented by hardware manufacturers.

预替换的优势是硬件实现简单¹：一旦确定异常类型，即可通过索引查找存储目标结果的表格。尽管预替换具有吸引力，但 IEEE 标准已被广泛接受，硬件制造商不太可能大规模实现预替换。

¹ The difficulty with presubstitution is that it requires either direct hardware implementation, or continuable floating-point traps if implemented in software. -- Ed.

¹ 预替换的难点在于：要么需硬件直接实现，要么软件实现时需支持可继续的浮点陷阱。------编者注

The Details / 细节

A number of claims have been made in this paper concerning properties of floating-point arithmetic. We now proceed to show that floating-point is not black magic, but rather is a straightforward subject whose claims can be verified mathematically. This section is divided into three parts. The first part presents an introduction to error analysis, and provides the details for Section "Rounding Error" on page 173. The second part explores binary to decimal conversion, filling in some gaps from Section "The IEEE Standard" on page 189. The third part discusses the Kahan summation formula, which was used as an example in Section "Systems Aspects" on page 211.

本文提出了许多关于浮点运算特性的主张，本节将证明浮点运算并非"黑魔法"，而是可通过数学验证的 straightforward 主题。本节分为三部分：

误差分析入门：补充"舍入误差（Rounding Error）"的细节；
二进制到十进制转换：填补"IEEE 标准"的部分空白；
卡汉求和公式：详细讨论"系统方面（Systems Aspects）"中用作示例的该公式。

Rounding Error / 舍入误差

In the discussion of rounding error, it was stated that a single guard digit is enough to guarantee that addition and subtraction will always be accurate (Theorem 2). We now proceed to verify this fact. Theorem 2 has two parts, one for subtraction and one for addition. The part for subtraction is

"舍入误差"部分曾指出，单个保护位足以保证加法和减法的准确性（定理 2）。本节将验证该事实------定理 2 分为减法和加法两部分，以下先证明减法部分。

Theorem 9

If x x x and y y y are positive floating-point numbers in a format with parameters β \beta β and p p p, and if subtraction is done with p + 1 p+1 p+1 digits (i.e. one guard digit), then the relative rounding error in the result is less than ( β 2 + 1 ) β − p = ( 1 + 2 β ) ε ≤ 2 ε \left(\frac{\beta}{2} + 1\right)\beta^{-p} = \left(1 + \frac{2}{\beta}\right)\varepsilon ≤ 2\varepsilon (2β+1)β−p=(1+β2)ε≤2ε.

若 x x x 和 y y y 是参数为 β \beta β（基数）和 p p p（精度）的格式中的正浮点数，且减法使用 p + 1 p+1 p+1 位数字（即 1 个保护位）执行，则结果的相对舍入误差小于 ( β 2 + 1 ) β − p = ( 1 + 2 β ) ε ≤ 2 ε \left(\frac{\beta}{2} + 1\right)\beta^{-p} = \left(1 + \frac{2}{\beta}\right)\varepsilon ≤ 2\varepsilon (2β+1)β−p=(1+β2)ε≤2ε（ ε \varepsilon ε 为机器精度）。

Proof

Interchange x x x and y y y if necessary so that x > y x > y x>y. It is also harmless to scale x x x and y y y so that x x x is represented by x 0 . x 1 . . . x p − 1 × β 0 x_0.x_1...x_{p-1} × \beta^0 x0.x1...xp−1×β0. If y y y is represented as y 0 . y 1 . . . y p − 1 y_0.y_1...y_{p-1} y0.y1...yp−1, then the difference is exact. If y y y is represented as 0. y 1 . . . y p 0.y_1...y_p 0.y1...yp, then the guard digit ensures that the computed difference will be the exact difference rounded to a floating-point number, so the rounding error is at most ε \varepsilon ε. In general, let y = 0.0...0 y k + 1 . . . y k + p y = 0.0...0y_{k+1}...y_{k+p} y=0.0...0yk+1...yk+p, and y ˉ \bar{y} yˉ be y y y truncated to p + 1 p+1 p+1 digits. Then:

不失一般性，假设 x > y x > y x>y（否则交换 x x x 和 y y y），且将 x x x 和 y y y 缩放为 x = x 0 . x 1 . . . x p − 1 × β 0 x = x_0.x_1...x_{p-1} × \beta^0 x=x0.x1...xp−1×β0（缩放不影响相对误差）：

若 y = y 0 . y 1 . . . y p − 1 y = y_0.y_1...y_{p-1} y=y0.y1...yp−1（与 x x x 指数相同），则差值精确；
若 y = 0. y 1 . . . y p y = 0.y_1...y_p y=0.y1...yp（指数小于 x x x），保护位确保计算差值为"精确差值舍入后的浮点数"，舍入误差至多为 ε \varepsilon ε。

一般情况下，设 y = 0.0...0 y k + 1 . . . y k + p y = 0.0...0y_{k+1}...y_{k+p} y=0.0...0yk+1...yk+p（前 k k k 位为 0）， y ˉ \bar{y} yˉ 为 y y y 截断到 p + 1 p+1 p+1 位的结果，则 y y y 与 y ˉ \bar{y} yˉ 的差值满足：
y − y ˉ < ( β − 1 ) ( β − p − 1 + β − p − 2 + . . . + β − p − k ) y - \bar{y} < (\beta - 1)\left(\beta^{-p-1} + \beta^{-p-2} + ... + \beta^{-p-k}\right) y−yˉ<(β−1)(β−p−1+β−p−2+...+β−p−k)

From the definition of guard digit, the computed value of x − y x - y x−y is x − y ˉ x - \bar{y} x−yˉ rounded to be a floating-point number, that is, ( x − y ˉ ) + δ (x - \bar{y}) + \delta (x−yˉ)+δ, where the rounding error δ \delta δ satisfies:

根据保护位的定义， x − y x - y x−y 的计算值为 x − y ˉ x - \bar{y} x−yˉ 舍入后的浮点数，即 ( x − y ˉ ) + δ (x - \bar{y}) + \delta (x−yˉ)+δ，其中舍入误差 δ \delta δ 满足：
∣ δ ∣ ≤ β 2 β − p (16) |\delta| ≤ \frac{\beta}{2}\beta^{-p} \tag{16} ∣δ∣≤2ββ−p(16)

There are three cases.

分三种情况讨论：
Case 1 : If x − y ≥ 1 x - y ≥ 1 x−y≥1, then the relative error is bounded by:

情况 1 ：若 x − y ≥ 1 x - y ≥ 1 x−y≥1，相对误差上限为：
y − y ˉ + ∣ δ ∣ x − y ≤ ( β − 1 ) β − p ( β − 1 + . . . + β − k ) + β 2 β − p 1 < β − p ( 1 + β 2 ) (17) \frac{y - \bar{y} + |\delta|}{x - y} ≤ \frac{(\beta - 1)\beta^{-p}\left(\beta^{-1} + ... + \beta^{-k}\right) + \frac{\beta}{2}\beta^{-p}}{1} < \beta^{-p}\left(1 + \frac{\beta}{2}\right) \tag{17} x−yy−yˉ+∣δ∣≤1(β−1)β−p(β−1+...+β−k)+2ββ−p<β−p(1+2β)(17)

The exact difference is x − y x - y x−y, so the error is

其中，精确差值为 x − y x - y x−y，误差为

( x − y ) − ( x − y ˉ + δ ) = y ˉ − y + δ (x - y) - (x - \bar{y} + \delta) = \bar{y} - y + \delta (x−y)−(x−yˉ+δ)=yˉ−y+δ.

其中，精确差值为 x − y x - y x−y，误差为 ( x − y ) − ( x − y ˉ + δ ) = y ˉ − y + δ (x - y) - (x - \bar{y} + \delta) = \bar{y} - y + \delta (x−y)−(x−yˉ+δ)=yˉ−y+δ。

Case 2 : If x − y ˉ < 1 x - \bar{y} < 1 x−yˉ<1, then δ = 0 \delta = 0 δ=0 (no rounding needed). Since the smallest that x − y x - y x−y can be is:
情况 2 ：若 x − y ˉ < 1 x - \bar{y} < 1 x−yˉ<1，则无需舍入（ δ = 0 \delta = 0 δ=0）。由于 x − y x - y x−y 的最小值满足：

1.0 − 0. 0...0 ⏟ k ρ . . . ρ ⏟ p > ( β − 1 ) ( β − 1 + . . . + β − k ) , ρ = β − 1 1.0 - 0.\underbrace{0...0}{k}\underbrace{\rho...\rho}{p} > (\beta - 1)\left(\beta^{-1} + ... + \beta^{-k}\right),\quad \rho = \beta - 1 1.0−0.k 0...0p ρ...ρ>(β−1)(β−1+...+β−k),ρ=β−1

the relative error is bounded by:

因此相对误差上限为：
y − y ˉ x − y < ( β − 1 ) β − p ( β − 1 + . . . + β − k ) ( β − 1 ) ( β − 1 + . . . + β − k ) = β − p (18) \frac{y - \bar{y}}{x - y} < \frac{(\beta - 1)\beta^{-p}\left(\beta^{-1} + ... + \beta^{-k}\right)}{(\beta - 1)\left(\beta^{-1} + ... + \beta^{-k}\right)} = \beta^{-p} \tag{18} x−yy−yˉ<(β−1)(β−1+...+β−k)(β−1)β−p(β−1+...+β−k)=β−p(18)

Case 3 : If x − y < 1 x - y < 1 x−y<1 but x − y ˉ ≥ 1 x - \bar{y} ≥ 1 x−yˉ≥1, then x − y ˉ = 1 x - \bar{y} = 1 x−yˉ=1 (the only possibility), so δ = 0 \delta = 0 δ=0. Substituting into (18) gives the relative error bound β − p < ( 1 + β 2 ) β − p \beta^{-p} < \left(1 + \frac{\beta}{2}\right)\beta^{-p} β−p<(1+2β)β−p.
情况 3 ：若 x − y < 1 x - y < 1 x−y<1 但 x − y ˉ ≥ 1 x - \bar{y} ≥ 1 x−yˉ≥1，则仅可能 x − y ˉ = 1 x - \bar{y} = 1 x−yˉ=1（无舍入， δ = 0 \delta = 0 δ=0）。代入式 (18) 得相对误差上限 β − p < ( 1 + β 2 ) β − p \beta^{-p} < \left(1 + \frac{\beta}{2}\right)\beta^{-p} β−p<(1+2β)β−p。

Combining the three cases, the relative error is less than ( 1 + β 2 ) β − p = ( 1 + 2 β ) ε ≤ 2 ε \left(1 + \frac{\beta}{2}\right)\beta^{-p} = \left(1 + \frac{2}{\beta}\right)\varepsilon ≤ 2\varepsilon (1+2β)β−p=(1+β2)ε≤2ε (since β ≥ 2 \beta ≥ 2 β≥2).

综合三种情况，相对误差小于 ( 1 + β 2 ) β − p = ( 1 + 2 β ) ε ≤ 2 ε \left(1 + \frac{\beta}{2}\right)\beta^{-p} = \left(1 + \frac{2}{\beta}\right)\varepsilon ≤ 2\varepsilon (1+2β)β−p=(1+β2)ε≤2ε（因 β ≥ 2 \beta ≥ 2 β≥2）。

When β = 2 \beta = 2 β=2, the bound is exactly 2 ε 2\varepsilon 2ε, and this bound is achieved for x = 1 + 2 − 2 p x = 1 + 2^{-2p} x=1+2−2p and y = 2 1 − p − 2 1 − 2 p y = 2^{1-p} - 2^{1-2p} y=21−p−21−2p in the limit as p → ∞ p \to \infty p→∞. When adding numbers of the same sign, a guard digit is not necessary to achieve good accuracy, as the following result shows.

当 β = 2 \beta = 2 β=2 时，误差上限恰好为 2 ε 2\varepsilon 2ε，且当 p → ∞ p \to \infty p→∞ 时， x = 1 + 2 − 2 p x = 1 + 2^{-2p} x=1+2−2p、 y = 2 1 − p − 2 1 − 2 p y = 2^{1-p} - 2^{1-2p} y=21−p−21−2p 可达到该上限。对于同号数相加，无需保护位即可获得良好精度，如下定理所示。

Theorem 10

If x ≥ 0 x ≥ 0 x≥0 and y ≥ 0 y ≥ 0 y≥0, then the relative error in computing x + y x + y x+y is at most 2 ε 2\varepsilon 2ε, even if no guard digits are used.

若 x ≥ 0 x ≥ 0 x≥0 且 y ≥ 0 y ≥ 0 y≥0，则即使不使用保护位，计算 x + y x + y x+y 的相对误差也至多为 2 ε 2\varepsilon 2ε。

Proof

The algorithm for addition with k k k guard digits is similar to that for subtraction. If x ≥ y x ≥ y x≥y, shift y y y right until the radix points of x x x and y y y are aligned. Discard any digits shifted past the p + k p+k p+k position. Compute the sum of these two p + k p+k p+k digit numbers exactly. Then round to p p p digits.

带 k k k 个保护位的加法算法与减法类似：若 x ≥ y x ≥ y x≥y，将 y y y 右移使小数点对齐，丢弃超过 p + k p+k p+k 位的数字，精确计算两个 p + k p+k p+k 位数字的和，最后舍入到 p p p 位。以下验证无保护位（ k = 0 k=0 k=0）的情况（一般情况类似）：

We will verify the theorem when no guard digits are used ( k = 0 k=0 k=0); the general case is similar. There is no loss of generality in assuming that x ≥ y ≥ 0 x ≥ y ≥ 0 x≥y≥0 and that x x x is scaled to be of the form d . d d . . . d × β 0 d.dd...d × \beta^0 d.dd...d×β0.

不失一般性，设 x ≥ y ≥ 0 x ≥ y ≥ 0 x≥y≥0 且 x = d . d d . . . d × β 0 x = d.dd...d × \beta^0 x=d.dd...d×β0（缩放后）：

Case 1 : No carry-out. The digits shifted off the end of y y y have a value less than β − p + 1 \beta^{-p+1} β−p+1, and the sum x + y ≥ x ≥ 1 x + y ≥ x ≥ 1 x+y≥x≥1, so the relative error is less than β − p + 1 / 1 = 2 ε \beta^{-p+1} / 1 = 2\varepsilon β−p+1/1=2ε (since ε = β 2 β − p \varepsilon = \frac{\beta}{2}\beta^{-p} ε=2ββ−p).
情况 1 ：无进位。 y y y 右移时丢弃的数字值小于 β − p + 1 \beta^{-p+1} β−p+1，且和 x + y ≥ x ≥ 1 x + y ≥ x ≥ 1 x+y≥x≥1，因此相对误差小于 β − p + 1 / 1 = 2 ε \beta^{-p+1} / 1 = 2\varepsilon β−p+1/1=2ε（因 ε = β 2 β − p \varepsilon = \frac{\beta}{2}\beta^{-p} ε=2ββ−p）。

Case 2 : Carry-out. The error from shifting is less than β − p + 1 \beta^{-p+1} β−p+1, and the rounding error is at most 1 2 β − p + 2 \frac{1}{2}\beta^{-p+2} 21β−p+2. The sum x + y ≥ β x + y ≥ \beta x+y≥β (due to carry-out), so the relative error is less than:
情况 2 ：有进位。移位误差小于 β − p + 1 \beta^{-p+1} β−p+1，舍入误差至多为 1 2 β − p + 2 \frac{1}{2}\beta^{-p+2} 21β−p+2，且和 x + y ≥ β x + y ≥ \beta x+y≥β（进位导致），因此相对误差小于：
β − p + 1 + 1 2 β − p + 2 β = ( 1 + β 2 ) β − p ≤ 2 ε \frac{\beta^{-p+1} + \frac{1}{2}\beta^{-p+2}}{\beta} = (1 + \frac{\beta}{2})\beta^{-p} ≤ 2\varepsilon ββ−p+1+21β−p+2=(1+2β)β−p≤2ε

Combining the two cases proves the theorem.

综合两种情况，定理得证。

It is obvious that combining these two theorems gives Theorem 2. Theorem 2 gives the relative error for performing one operation. Comparing the rounding error of x 2 − y 2 x^2 - y^2 x2−y2 and ( x + y ) ( x − y ) (x + y)(x - y) (x+y)(x−y) requires knowing the relative error of multiple operations. The relative error of x ⊖ y x \ominus y x⊖y is δ 1 = ( x ⊖ y ) − ( x − y ) x − y \delta_1 = \frac{(x \ominus y) - (x - y)}{x - y} δ1=x−y(x⊖y)−(x−y), which satisfies ∣ δ 1 ∣ ≤ 2 ε |\delta_1| ≤ 2\varepsilon ∣δ1∣≤2ε. Or to write it another way:

显然，结合定理 9 和定理 10 可得定理 2。定理 2 给出了单次操作的相对误差，而比较 x 2 − y 2 x^2 - y^2 x2−y2 与 ( x + y ) ( x − y ) (x + y)(x - y) (x+y)(x−y) 的舍入误差需考虑多次操作的累积误差：

减法 x ⊖ y x \ominus y x⊖y 的相对误差 δ 1 = ( x ⊖ y ) − ( x − y ) x − y \delta_1 = \frac{(x \ominus y) - (x - y)}{x - y} δ1=x−y(x⊖y)−(x−y)，满足 ∣ δ 1 ∣ ≤ 2 ε |\delta_1| ≤ 2\varepsilon ∣δ1∣≤2ε，即：
x ⊖ y = ( x − y ) ( 1 + δ 1 ) , ∣ δ 1 ∣ ≤ 2 ε (19) x \ominus y = (x - y)(1 + \delta_1),\quad |\delta_1| ≤ 2\varepsilon \tag{19} x⊖y=(x−y)(1+δ1),∣δ1∣≤2ε(19)

Similarly:

加法 x ⊕ y x \oplus y x⊕y 同理：
x ⊕ y = ( x + y ) ( 1 + δ 2 ) , ∣ δ 2 ∣ ≤ 2 ε (20) x \oplus y = (x + y)(1 + \delta_2),\quad |\delta_2| ≤ 2\varepsilon \tag{20} x⊕y=(x+y)(1+δ2),∣δ2∣≤2ε(20)

Assuming that multiplication is performed by computing the exact product and then rounding, the relative error is at most 0.5 0.5 0.5 ulp, so:

乘法假设为"精确计算后舍入"，相对误差至多 0.5 ulp，即：
u ⊗ v = u v ( 1 + δ 3 ) , ∣ δ 3 ∣ ≤ ε (21) u \otimes v = uv(1 + \delta_3),\quad |\delta_3| ≤ \varepsilon \tag{21} u⊗v=uv(1+δ3),∣δ3∣≤ε(21)

for any floating-point numbers u u u and v v v. Putting these three equations together (letting u = x ⊖ y u = x \ominus y u=x⊖y and v = x ⊕ y v = x \oplus y v=x⊕y) gives:

对于任意的浮点数 u u u 和 v v v。令 u = x ⊖ y u = x \ominus y u=x⊖y、 v = x ⊕ y v = x \oplus y v=x⊕y，联立三式得：
( x ⊖ y ) ⊗ ( x ⊕ y ) = ( x − y ) ( 1 + δ 1 ) ( x + y ) ( 1 + δ 2 ) ( 1 + δ 3 ) (22) (x \ominus y) \otimes (x \oplus y) = (x - y)(1 + \delta_1)(x + y)(1 + \delta_2)(1 + \delta_3) \tag{22} (x⊖y)⊗(x⊕y)=(x−y)(1+δ1)(x+y)(1+δ2)(1+δ3)(22)

So the relative error incurred when computing ( x − y ) ( x + y ) (x - y)(x + y) (x−y)(x+y) is:

因此，计算 ( x − y ) ( x + y ) (x - y)(x + y) (x−y)(x+y) 的相对误差为：
( x ⊖ y ) ⊗ ( x ⊕ y ) − ( x 2 − y 2 ) x 2 − y 2 = ( 1 + δ 1 ) ( 1 + δ 2 ) ( 1 + δ 3 ) − 1 \frac{(x \ominus y) \otimes (x \oplus y) - (x^2 - y^2)}{x^2 - y^2} = (1 + \delta_1)(1 + \delta_2)(1 + \delta_3) - 1 x2−y2(x⊖y)⊗(x⊕y)−(x2−y2)=(1+δ1)(1+δ2)(1+δ3)−1

This relative error is equal to δ 1 + δ 2 + δ 3 + δ 1 δ 2 + δ 1 δ 3 + δ 2 δ 3 + δ 1 δ 2 δ 3 \delta_1 + \delta_2 + \delta_3 + \delta_1\delta_2 + \delta_1\delta_3 + \delta_2\delta_3 + \delta_1\delta_2\delta_3 δ1+δ2+δ3+δ1δ2+δ1δ3+δ2δ3+δ1δ2δ3, which is bounded by 5 ε + 8 ε 2 5\varepsilon + 8\varepsilon^2 5ε+8ε2. In other words, the maximum relative error is about 5 5 5 rounding errors (since ε 2 \varepsilon^2 ε2 is almost negligible).

该误差等于 δ 1 + δ 2 + δ 3 + δ 1 δ 2 + δ 1 δ 3 + δ 2 δ 3 + δ 1 δ 2 δ 3 \delta_1 + \delta_2 + \delta_3 + \delta_1\delta_2 + \delta_1\delta_3 + \delta_2\delta_3 + \delta_1\delta_2\delta_3 δ1+δ2+δ3+δ1δ2+δ1δ3+δ2δ3+δ1δ2δ3，上限为 5 ε + 8 ε 2 5\varepsilon + 8\varepsilon^2 5ε+8ε2（ ε 2 \varepsilon^2 ε2 可忽略，因此最大相对误差约为 5 个舍入误差）。

A similar analysis of ( x ⊗ x ) ⊖ ( y ⊗ y ) (x \otimes x) \ominus (y \otimes y) (x⊗x)⊖(y⊗y) cannot result in a small value for the relative error, because when two nearby values of x x x and y y y are plugged into x 2 − y 2 x^2 - y^2 x2−y2, the relative error will usually be quite large. Another way to see this is to try and duplicate the analysis that worked on ( x ⊖ y ) ⊗ ( x ⊕ y ) (x \ominus y) \otimes (x \oplus y) (x⊖y)⊗(x⊕y), yielding:

对 ( x ⊗ x ) ⊖ ( y ⊗ y ) (x \otimes x) \ominus (y \otimes y) (x⊗x)⊖(y⊗y) 的类似分析无法得到小的相对误差------当 x x x 和 y y y 相近时， x 2 − y 2 x^2 - y^2 x2−y2 的相对误差通常很大。将上述分析方法应用于该表达式：
( x ⊗ x ) ⊖ ( y ⊗ y ) = $x 2 ( 1 + δ 1 ) - y 2 ( 1 + δ 2 )$ ( 1 + δ 3 ) = ( ( x 2 − y 2 ) ( 1 + δ 1 ) + ( δ 1 − δ 2 ) y 2 ) ( 1 + δ 3 ) \begin{aligned} (x \otimes x) \ominus (y \otimes y) & = \left $x\^2(1 + \\delta_1) - y\^2(1 + \\delta_2) \\right$ (1 + \delta_3) \\ & = \left( (x^2 - y^2)(1 + \delta_1) + (\delta_1 - \delta_2)y^2 \right) (1 + \delta_3) \end{aligned} (x⊗x)⊖(y⊗y)= $x2(1+δ1)-y2(1+δ2)$ (1+δ3)=((x2−y2)(1+δ1)+(δ1−δ2)y2)(1+δ3)

When x x x and y y y are nearby, the error term ( δ 1 − δ 2 ) y 2 (\delta_1 - \delta_2)y^2 (δ1−δ2)y2 can be as large as the result x 2 − y 2 x^2 - y^2 x2−y2. These computations formally justify our claim that ( x − y ) ( x + y ) (x - y)(x + y) (x−y)(x+y) is more accurate than x 2 − y 2 x^2 - y^2 x2−y2.

当 x ≈ y x \approx y x≈y 时，误差项 ( δ 1 − δ 2 ) y 2 (\delta_1 - \delta_2)y^2 (δ1−δ2)y2 的量级可与结果 x 2 − y 2 x^2 - y^2 x2−y2 相当，因此 ( x − y ) ( x + y ) (x - y)(x + y) (x−y)(x+y) 比 x 2 − y 2 x^2 - y^2 x2−y2 更精确的主张得到形式化证明。

We next turn to an analysis of the formula for the area of a triangle. In order to estimate the maximum error that can occur when computing with (7), the following fact will be needed.

接下来分析三角形面积公式。为估计公式 (7) 的最大计算误差，需用到以下定理。

Theorem 11

If subtraction is performed with a guard digit, and y / 2 ≤ x ≤ 2 y y/2 ≤ x ≤ 2y y/2≤x≤2y, then x ⊖ y x \ominus y x⊖y is computed exactly.

若减法使用保护位，且 y / 2 ≤ x ≤ 2 y y/2 ≤ x ≤ 2y y/2≤x≤2y，则 x ⊖ y x \ominus y x⊖y 的计算结果精确（无舍入误差）。

Proof

Note that if x x x and y y y have the same exponent, then certainly x ⊖ y x \ominus y x⊖y is exact. Otherwise, from the condition y / 2 ≤ x ≤ 2 y y/2 ≤ x ≤ 2y y/2≤x≤2y, the exponents can differ by at most 1. Scale and interchange x x x and y y y if necessary so that 0 ≤ y ≤ x 0 ≤ y ≤ x 0≤y≤x, and x x x is represented as x 0 . x 1 . . . x p − 1 x_0.x_1...x_{p-1} x0.x1...xp−1 and y y y as 0. y 1 . . . y p 0.y_1...y_p 0.y1...yp (one guard digit). Then the algorithm for computing x ⊖ y x \ominus y x⊖y will compute x − y x - y x−y exactly and round to a floating-point number. If the difference is of the form 0. d 1 . . . d p 0.d_1...d_p 0.d1...dp (p digits), no rounding is necessary. Since x ≤ 2 y x ≤ 2y x≤2y, x − y ≤ y x - y ≤ y x−y≤y, and since y y y is of the form 0. d 1 . . . d p 0.d_1...d_p 0.d1...dp, so is x − y x - y x−y. Thus x ⊖ y = x − y x \ominus y = x - y x⊖y=x−y (exact).

若 x x x 和 y y y 指数相同，则 x ⊖ y x \ominus y x⊖y 必然精确；
若指数不同，由 y / 2 ≤ x ≤ 2 y y/2 ≤ x ≤ 2y y/2≤x≤2y 可知指数差至多为 1。不失一般性，设 0 ≤ y ≤ x 0 ≤ y ≤ x 0≤y≤x， x = x 0 . x 1 . . . x p − 1 x = x_0.x_1...x_{p-1} x=x0.x1...xp−1（p 位）， y = 0. y 1 . . . y p y = 0.y_1...y_p y=0.y1...yp（含 1 个保护位）。此时，计算 x ⊖ y x \ominus y x⊖y 的算法会精确计算 x − y x - y x−y，再舍入为浮点数；
由于 x ≤ 2 y x ≤ 2y x≤2y，则 x − y ≤ y x - y ≤ y x−y≤y，且 y = 0. y 1 . . . y p y = 0.y_1...y_p y=0.y1...yp（p+1 位），因此 x − y = 0. d 1 . . . d p x - y = 0.d_1...d_p x−y=0.d1...dp（p 位），无需舍入，故 x ⊖ y = x − y x \ominus y = x - y x⊖y=x−y（精确）。

When β > 2 \beta > 2 β>2, the hypothesis of Theorem 11 cannot be replaced by y / β ≤ x ≤ β y y/\beta ≤ x ≤ \beta y y/β≤x≤βy; the stronger condition y / 2 ≤ x ≤ 2 y y/2 ≤ x ≤ 2y y/2≤x≤2y is still necessary. The analysis of the error in ( x − y ) ( x + y ) (x - y)(x + y) (x−y)(x+y), immediately following the proof of Theorem 10, used the fact that the relative error in the basic operations of addition and subtraction is small (namely equations (19) and (20)). This is the most common kind of error analysis. However, analyzing formula (7) requires something more, namely Theorem 11, as the following proof will show.

当 β > 2 \beta > 2 β>2 时，定理 11 的条件 y / 2 ≤ x ≤ 2 y y/2 ≤ x ≤ 2y y/2≤x≤2y 不能弱化为 y / β ≤ x ≤ β y y/\beta ≤ x ≤ \beta y y/β≤x≤βy，仍需保持该强条件。定理 10 证明后对 ( x − y ) ( x + y ) (x - y)(x + y) (x−y)(x+y) 的误差分析，利用了加减运算相对误差小的特性（式 (19) 和 (20)），这是最常见的误差分析方式。但公式 (7) 的分析需要更强的定理 11，如下证明所示。

Theorem 12

If subtraction uses a guard digit, and if a , b a, b a,b and c c c are the sides of a triangle ( a ≥ b ≥ c a ≥ b ≥ c a≥b≥c), then the relative error in computing ( a + ( b + c ) ) ( c − ( a − b ) ) ( c + ( a − b ) ) ( a + ( b − c ) ) (a + (b + c))(c - (a - b))(c + (a - b))(a + (b - c)) (a+(b+c))(c−(a−b))(c+(a−b))(a+(b−c)) is at most 16 ε 16\varepsilon 16ε, provided ε < .005 \varepsilon < .005 ε<.005.

若减法使用保护位，且 a , b , c a, b, c a,b,c 为三角形的三边（ a ≥ b ≥ c a ≥ b ≥ c a≥b≥c），则当 ε < 0.005 \varepsilon < 0.005 ε<0.005 时，计算 ( a + ( b + c ) ) ( c − ( a − b ) ) ( c + ( a − b ) ) ( a + ( b − c ) ) (a + (b + c))(c - (a - b))(c + (a - b))(a + (b - c)) (a+(b+c))(c−(a−b))(c+(a−b))(a+(b−c)) 的相对误差至多为 16 ε 16\varepsilon 16ε。

Proof

Let's examine the factors one by one.

逐一分析每个因子：

First factor a + ( b + c ) a + (b + c) a+(b+c) : By Theorem 10, b ⊕ c = ( b + c ) ( 1 + δ 1 ) b \oplus c = (b + c)(1 + \delta_1) b⊕c=(b+c)(1+δ1), ∣ δ 1 ∣ ≤ 2 ε |\delta_1| ≤ 2\varepsilon ∣δ1∣≤2ε. Then a ⊕ ( b ⊕ c ) = ( a + ( b ⊕ c ) ) ( 1 + δ 2 ) = ( a + ( b + c ) ( 1 + δ 1 ) ) ( 1 + δ 2 ) a \oplus (b \oplus c) = (a + (b \oplus c))(1 + \delta_2) = (a + (b + c)(1 + \delta_1))(1 + \delta_2) a⊕(b⊕c)=(a+(b⊕c))(1+δ2)=(a+(b+c)(1+δ1))(1+δ2). Thus:
第一个因子 a + ( b + c ) a + (b + c) a+(b+c) : 由定理 10， b ⊕ c = ( b + c ) ( 1 + δ 1 ) b \oplus c = (b + c)(1 + \delta_1) b⊕c=(b+c)(1+δ1)（ ∣ δ 1 ∣ ≤ 2 ε |\delta_1| ≤ 2\varepsilon ∣δ1∣≤2ε），进而 a ⊕ ( b ⊕ c ) = ( a + ( b ⊕ c ) ) ( 1 + δ 2 ) = ( a + ( b + c ) ( 1 + δ 1 ) ) ( 1 + δ 2 ) a \oplus (b \oplus c) = (a + (b \oplus c))(1 + \delta_2) = (a + (b + c)(1 + \delta_1))(1 + \delta_2) a⊕(b⊕c)=(a+(b⊕c))(1+δ2)=(a+(b+c)(1+δ1))(1+δ2)，因此：
( a + b + c ) ( 1 − 2 ε ) 2 ≤ a ⊕ ( b ⊕ c ) ≤ ( a + b + c ) ( 1 + 2 ε ) 2 (a + b + c)(1 - 2\varepsilon)^2 ≤ a \oplus (b \oplus c) ≤ (a + b + c)(1 + 2\varepsilon)^2 (a+b+c)(1−2ε)2≤a⊕(b⊕c)≤(a+b+c)(1+2ε)2

Let η 1 \eta_1 η1 satisfy a ⊕ ( b ⊕ c ) = ( a + b + c ) ( 1 + η 1 ) 2 a \oplus (b \oplus c) = (a + b + c)(1 + \eta_1)^2 a⊕(b⊕c)=(a+b+c)(1+η1)2, then ∣ η 1 ∣ ≤ 2 ε |\eta_1| ≤ 2\varepsilon ∣η1∣≤2ε.

令 η 1 \eta_1 η1 满足 a ⊕ ( b ⊕ c ) = ( a + b + c ) ( 1 + η 1 ) 2 a \oplus (b \oplus c) = (a + b + c)(1 + \eta_1)^2 a⊕(b⊕c)=(a+b+c)(1+η1)2，则 ∣ η 1 ∣ ≤ 2 ε |\eta_1| ≤ 2\varepsilon ∣η1∣≤2ε。
Second factor c − ( a − b ) c - (a - b) c−(a−b) : Since a , b , c a, b, c a,b,c are triangle sides, a ≤ b + c a ≤ b + c a≤b+c. Combined with a ≥ b ≥ c a ≥ b ≥ c a≥b≥c, we get a ≤ b + c ≤ 2 b ≤ 2 a a ≤ b + c ≤ 2b ≤ 2a a≤b+c≤2b≤2a, so a − b ≤ c ≤ b ≤ a a - b ≤ c ≤ b ≤ a a−b≤c≤b≤a, hence b / 2 ≤ a − b ≤ 2 b b/2 ≤ a - b ≤ 2b b/2≤a−b≤2b (satisfies Theorem 11). Thus a ⊖ b = a − b a \ominus b = a - b a⊖b=a−b (exact). By Theorem 9, c ⊖ ( a ⊖ b ) = ( c − ( a − b ) ) ( 1 + η 2 ) c \ominus (a \ominus b) = (c - (a - b))(1 + \eta_2) c⊖(a⊖b)=(c−(a−b))(1+η2), ∣ η 2 ∣ ≤ 2 ε |\eta_2| ≤ 2\varepsilon ∣η2∣≤2ε.
第二个因子 c − ( a − b ) c - (a - b) c−(a−b) : 因 a , b , c a, b, c a,b,c 为三角形三边， a ≤ b + c a ≤ b + c a≤b+c；结合 a ≥ b ≥ c a ≥ b ≥ c a≥b≥c，得 a ≤ b + c ≤ 2 b ≤ 2 a a ≤ b + c ≤ 2b ≤ 2a a≤b+c≤2b≤2a，故 a − b ≤ c ≤ b ≤ a a - b ≤ c ≤ b ≤ a a−b≤c≤b≤a，即 b / 2 ≤ a − b ≤ 2 b b/2 ≤ a - b ≤ 2b b/2≤a−b≤2b（满足定理 11），因此 a ⊖ b = a − b a \ominus b = a - b a⊖b=a−b（精确）。由定理 9， c ⊖ ( a ⊖ b ) = ( c − ( a − b ) ) ( 1 + η 2 ) c \ominus (a \ominus b) = (c - (a - b))(1 + \eta_2) c⊖(a⊖b)=(c−(a−b))(1+η2)（ ∣ η 2 ∣ ≤ 2 ε |\eta_2| ≤ 2\varepsilon ∣η2∣≤2ε）。
Third factor c + ( a − b ) c + (a - b) c+(a−b) : Sum of two exact positive quantities. By Theorem 10, c ⊕ ( a ⊖ b ) = ( c + ( a − b ) ) ( 1 + η 3 ) c \oplus (a \ominus b) = (c + (a - b))(1 + \eta_3) c⊕(a⊖b)=(c+(a−b))(1+η3), ∣ η 3 ∣ ≤ 2 ε |\eta_3| ≤ 2\varepsilon ∣η3∣≤2ε.
第三个因子 c + ( a − b ) c + (a - b) c+(a−b) : 两个精确正数之和。由定理 10， c ⊕ ( a ⊖ b ) = ( c + ( a − b ) ) ( 1 + η 3 ) c \oplus (a \ominus b) = (c + (a - b))(1 + \eta_3) c⊕(a⊖b)=(c+(a−b))(1+η3)（ ∣ η 3 ∣ ≤ 2 ε |\eta_3| ≤ 2\varepsilon ∣η3∣≤2ε）。
Fourth factor a + ( b − c ) a + (b - c) a+(b−c) : Similar to the first factor. By Theorem 11, b ⊖ c = b − c b \ominus c = b - c b⊖c=b−c (since c ≤ b ≤ 2 c c ≤ b ≤ 2c c≤b≤2c for triangle sides). By Theorem 10, a ⊕ ( b ⊖ c ) = ( a + ( b − c ) ) ( 1 + η 4 ) 2 a \oplus (b \ominus c) = (a + (b - c))(1 + \eta_4)^2 a⊕(b⊖c)=(a+(b−c))(1+η4)2, ∣ η 4 ∣ ≤ 2 ε |\eta_4| ≤ 2\varepsilon ∣η4∣≤2ε.
第四个因子 a + ( b − c ) a + (b - c) a+(b−c) : 类似第一个因子。由定理 11， b ⊖ c = b − c b \ominus c = b - c b⊖c=b−c（三角形三边满足 c ≤ b ≤ 2 c c ≤ b ≤ 2c c≤b≤2c）；由定理 10， a ⊕ ( b ⊖ c ) = ( a + ( b − c ) ) ( 1 + η 4 ) 2 a \oplus (b \ominus c) = (a + (b - c))(1 + \eta_4)^2 a⊕(b⊖c)=(a+(b−c))(1+η4)2（ ∣ η 4 ∣ ≤ 2 ε |\eta_4| ≤ 2\varepsilon ∣η4∣≤2ε）。

Combining the four factors (assuming multiplication is exactly rounded, x ⊗ y = x y ( 1 + ζ ) x \otimes y = xy(1 + \zeta) x⊗y=xy(1+ζ), ∣ ζ ∣ ≤ ε |\zeta| ≤ \varepsilon ∣ζ∣≤ε):

联立四个因子（假设乘法精确舍入， x ⊗ y = x y ( 1 + ζ ) x \otimes y = xy(1 + \zeta) x⊗y=xy(1+ζ)， ∣ ζ ∣ ≤ ε |\zeta| ≤ \varepsilon ∣ζ∣≤ε）：
Computed product = Exact product ⋅ E 计算值 = 精确值 ⋅ E \text{Computed product} = \text{Exact product} \cdot E \\ \text{计算值} = \text{精确值} \cdot E Computed product=Exact product⋅E计算值=精确值⋅E

where:

其中
E = ( 1 + η 1 ) 2 ( 1 + η 2 ) ( 1 + η 3 ) ( 1 + η 4 ) 2 ( 1 + ζ 1 ) ( 1 + ζ 2 ) ( 1 + ζ 3 ) E = (1 + \eta_1)^2(1 + \eta_2)(1 + \eta_3)(1 + \eta_4)^2(1 + \zeta_1)(1 + \zeta_2)(1 + \zeta_3) E=(1+η1)2(1+η2)(1+η3)(1+η4)2(1+ζ1)(1+ζ2)(1+ζ3)

The upper bound of E E E is ( 1 + 2 ε ) 6 ( 1 + ε ) 3 (1 + 2\varepsilon)^6(1 + \varepsilon)^3 (1+2ε)6(1+ε)3. Expanding and ignoring ε 2 \varepsilon^2 ε2 (since ε < 0.005 \varepsilon < 0.005 ε<0.005):
E E E 的上限为 ( 1 + 2 ε ) 6 ( 1 + ε ) 3 (1 + 2\varepsilon)^6(1 + \varepsilon)^3 (1+2ε)6(1+ε)3，展开后忽略 ε 2 \varepsilon^2 ε2（因 ε < 0.005 \varepsilon < 0.005 ε<0.005）：
( 1 + 2 ε ) 6 ( 1 + ε ) 3 ≈ 1 + 15 ε < 1 + 16 ε (1 + 2\varepsilon)^6(1 + \varepsilon)^3 ≈ 1 + 15\varepsilon < 1 + 16\varepsilon (1+2ε)6(1+ε)3≈1+15ε<1+16ε

Similarly, the lower bound is 1 − 16 ε 1 - 16\varepsilon 1−16ε. Thus the relative error is at most 16 ε 16\varepsilon 16ε.

同理，下限为 1 − 16 ε 1 - 16\varepsilon 1−16ε，因此相对误差至多为 16 ε 16\varepsilon 16ε。

Theorem 12 certainly shows that there is no catastrophic cancellation in formula (7). So although it is not necessary to show formula (7) is numerically stable, it is satisfying to have a bound for the entire formula, which is what Theorem 3 of "Cancellation" on page 179 gives.

定理 12 表明公式 (7) 中不存在灾难性抵消。尽管无需额外证明公式 (7) 的数值稳定性，但该定理为整个公式提供了误差上限，这正是"抵消（Cancellation）"中定理 3 的内容。

Proof of Theorem 3

Let 设
q = ( a + ( b + c ) ) ( c − ( a − b ) ) ( c + ( a − b ) ) ( a + ( b − c ) ) q=(a+(b+c))(c-(a-b))(c+(a-b))(a+(b-c)) q=(a+(b+c))(c−(a−b))(c+(a−b))(a+(b−c))

and 且
Q = ( a ⊕ ( b ⊕ c ) ) ⊗ ( c ⊖ ( a ⊖ b ) ) ⊗ ( c ⊕ ( a ⊖ b ) ) ⊗ ( a ⊕ ( b ⊖ c ) ) . Q=(a\oplus (b\oplus c))\otimes (c\ominus (a\ominus b))\otimes (c\oplus (a\ominus b))\otimes (a\oplus (b\ominus c)). Q=(a⊕(b⊕c))⊗(c⊖(a⊖b))⊗(c⊕(a⊖b))⊗(a⊕(b⊖c)).

Then, Theorem 12 shows that Q = q ( 1 + δ ) Q=q(1+\delta) Q=q(1+δ), with ∣ δ ∣ ≤ 16 ε |\delta| ≤16 \varepsilon ∣δ∣≤16ε. It is easy to check that

由定理 12 可知， Q = q ( 1 + δ ) Q=q(1+\delta) Q=q(1+δ)，其中 ∣ δ ∣ ≤ 16 ε |\delta| ≤16 \varepsilon ∣δ∣≤16ε。容易验证，当 ∣ δ ∣ ≤ .04 / ( .52 ) 2 ≈ .15 |\delta| ≤.04 /(.52)^{2} ≈.15 ∣δ∣≤.04/(.52)2≈.15 时，有：
1 − 0.52 ∣ δ ∣ ≤ 1 − ∣ δ ∣ ≤ 1 + ∣ δ ∣ ≤ 1 + 0.52 ∣ δ ∣ 1-0.52|\delta| \leq \sqrt{1-|\delta|} \leq \sqrt{1+|\delta|} \leq 1+0.52|\delta| 1−0.52∣δ∣≤1−∣δ∣ ≤1+∣δ∣ ≤1+0.52∣δ∣

provided ∣ δ ∣ ≤ .04 / ( .52 ) 2 ≈ .15 |\delta| ≤.04 /(.52)^{2} ≈.15 ∣δ∣≤.04/(.52)2≈.15, and since ∣ δ ∣ ≤ 16 ε ≤ 16 ( .005 ) = .08 |\delta| ≤16 \varepsilon ≤16(.005)=.08 ∣δ∣≤16ε≤16(.005)=.08, δ \delta δ does satisfy the condition. Thus Q = q ( 1 + δ ) = q ( 1 + δ 1 ) \sqrt{Q}=\sqrt{q(1+\delta)}=\sqrt{q}(1+\delta_{1}) Q =q(1+δ) =q (1+δ1), with ∣ δ 1 ∣ ≤ .52 ∣ δ ∣ ≤ 8.5 ε |\delta_{1}| ≤.52|\delta| ≤8.5 \varepsilon ∣δ1∣≤.52∣δ∣≤8.5ε. If square roots are computed to within .5 ulp, then the error when computing Q \sqrt{Q} Q is ( 1 + δ 1 ) ( 1 + δ 2 ) (1+\delta_{1})(1+\delta_{2}) (1+δ1)(1+δ2) with ∣ δ 2 ∣ ≤ ε |\delta_{2}| ≤\varepsilon ∣δ2∣≤ε. If β = 2 \beta=2 β=2, then there is no further error committed when dividing by 4. Otherwise, one more factor 1 + δ 3 1+\delta_{3} 1+δ3 with ∣ δ 3 ∣ ≤ ε |\delta_{3}| ≤\varepsilon ∣δ3∣≤ε is necessary for the division, and using the method in the proof of Theorem 12, the final error bound of ( 1 + δ 1 ) ( 1 + δ 2 ) ( 1 + δ 3 ) (1+\delta_{1})(1+\delta_{2})(1+\delta_{3}) (1+δ1)(1+δ2)(1+δ3) is dominated by 1 + δ 4 1+\delta_{4} 1+δ4, with ∣ δ 4 ∣ ≤ 11 ε |\delta_{4}| ≤11 \varepsilon ∣δ4∣≤11ε.¹

由于 ∣ δ ∣ ≤ 16 ε ≤ 16 × .005 = 0.08 |\delta| ≤16 \varepsilon ≤16×.005=0.08 ∣δ∣≤16ε≤16×.005=0.08，满足上述条件。因此 Q = q ( 1 + δ ) = q ( 1 + δ 1 ) \sqrt{Q}=\sqrt{q(1+\delta)}=\sqrt{q}(1+\delta_{1}) Q =q(1+δ) =q (1+δ1)，且 ∣ δ 1 ∣ ≤ .52 ∣ δ ∣ ≤ 8.5 ε |\delta_{1}| ≤.52|\delta| ≤8.5 \varepsilon ∣δ1∣≤.52∣δ∣≤8.5ε。若平方根计算精度在 0.5 ulp 以内，则计算 Q \sqrt{Q} Q 的误差为 ( 1 + δ 1 ) ( 1 + δ 2 ) (1+\delta_{1})(1+\delta_{2}) (1+δ1)(1+δ2)，其中 ∣ δ 2 ∣ ≤ ε |\delta_{2}| ≤\varepsilon ∣δ2∣≤ε。当 β = 2 \beta=2 β=2 时，除以 4 的操作不会引入额外误差；否则，除法操作需再引入一个满足 ∣ δ 3 ∣ ≤ ε |\delta_{3}| ≤\varepsilon ∣δ3∣≤ε 的因子 1 + δ 3 1+\delta_{3} 1+δ3。采用定理 12 的证明方法可得， ( 1 + δ 1 ) ( 1 + δ 2 ) ( 1 + δ 3 ) (1+\delta_{1})(1+\delta_{2})(1+\delta_{3}) (1+δ1)(1+δ2)(1+δ3) 的最终误差上限由 1 + δ 4 1+\delta_{4} 1+δ4 主导，且 ∣ δ 4 ∣ ≤ 11 ε |\delta_{4}| ≤11 \varepsilon ∣δ4∣≤11ε。¹

¹ Left as an exercise to the reader: extend the proof to bases other than 2. -- Ed.

¹ 留给读者的练习：将该证明扩展到非 2 进制的基数。------编者注

To make the heuristic explanation immediately following the statement of Theorem 4 precise, the next theorem describes just how closely μ ( x ) \mu(x) μ(x) approximates a constant.

为使定理 4 陈述后直观解释更严谨，下一定理将详细说明 μ ( x ) \mu(x) μ(x) 逼近常数的程度。

Theorem 13

If μ ( x ) = ln ⁡ ( 1 + x ) / x \mu(x)=\ln (1+x) / x μ(x)=ln(1+x)/x, then for 0 ≤ x ≤ 3 4 0 ≤x ≤\frac{3}{4} 0≤x≤43, 1 2 ≤ μ ( x ) ≤ 1 \frac{1}{2} ≤\mu(x) ≤1 21≤μ(x)≤1 and the derivative satisfies

若 μ ( x ) = ln ⁡ ( 1 + x ) / x \mu(x)=\ln (1+x) / x μ(x)=ln(1+x)/x，则当 0 ≤ x ≤ 3 4 0 ≤x ≤\frac{3}{4} 0≤x≤43 时， 1 2 ≤ μ ( x ) ≤ 1 \frac{1}{2} ≤\mu(x) ≤1 21≤μ(x)≤1，且其导数满足：
∣ μ ′ ( x ) ∣ ≤ 1 2 \left|\mu'(x)\right| \leq \frac{1}{2} ∣μ′(x)∣≤21

Proof / 证明

Note that μ ( x ) = 1 − x / 2 + x 2 / 3 − x 3 / 4 + . . . \mu(x)=1-x / 2+x^{2} / 3-x^{3}/4+... μ(x)=1−x/2+x2/3−x3/4+... is an alternating series with decreasing terms, so for x ≤ 1 x ≤1 x≤1, μ ( x ) ≥ 1 − x / 2 ≥ 1 / 2 \mu(x) ≥1-x / 2 ≥1 / 2 μ(x)≥1−x/2≥1/2 (since x ≤ 3 / 4 x ≤3/4 x≤3/4). It is even easier to see that because the series for μ ( x ) \mu(x) μ(x) is alternating, μ ( x ) ≤ 1 \mu(x) ≤1 μ(x)≤1. The Taylor series of μ ′ ( x ) \mu'(x) μ′(x) is also alternating and has decreasing terms for x ≤ 3 4 x ≤\frac{3}{4} x≤43: μ ′ ( x ) = − 1 / 2 + 2 x / 3 − 3 x 2 / 4 + . . . \mu'(x)=-1/2 + 2x/3 - 3x^2/4 + ... μ′(x)=−1/2+2x/3−3x2/4+.... Thus − 1 2 ≤ μ ′ ( x ) ≤ − 1 2 + 2 x / 3 -\frac{1}{2} ≤\mu'(x) ≤-\frac{1}{2}+2x/3 −21≤μ′(x)≤−21+2x/3. For x ≤ 3 / 4 x ≤3/4 x≤3/4, 2 x / 3 ≤ 1 / 2 2x/3 ≤1/2 2x/3≤1/2, so − 1 2 ≤ μ ′ ( x ) ≤ 0 -\frac{1}{2} ≤\mu'(x) ≤0 −21≤μ′(x)≤0, hence ∣ μ ′ ( x ) ∣ ≤ 1 2 |\mu'(x)| ≤\frac{1}{2} ∣μ′(x)∣≤21.

注意到 μ ( x ) = 1 − x / 2 + x 2 / 3 − x 3 / 4 + . . . \mu(x)=1-x / 2+x^{2} / 3-x^{3}/4+... μ(x)=1−x/2+x2/3−x3/4+... 是一个各项递减的交错级数。当 x ≤ 1 x ≤1 x≤1 时（此处 x ≤ 3 / 4 x ≤3/4 x≤3/4），有 μ ( x ) ≥ 1 − x / 2 ≥ 1 / 2 \mu(x) ≥1-x / 2 ≥1 / 2 μ(x)≥1−x/2≥1/2。同样，由于该级数为交错级数，易知 μ ( x ) ≤ 1 \mu(x) ≤1 μ(x)≤1。 μ ′ ( x ) \mu'(x) μ′(x) 的泰勒级数同样是交错级数，且当 x ≤ 3 4 x ≤\frac{3}{4} x≤43 时各项递减： μ ′ ( x ) = − 1 / 2 + 2 x / 3 − 3 x 2 / 4 + . . . \mu'(x)=-1/2 + 2x/3 - 3x^2/4 + ... μ′(x)=−1/2+2x/3−3x2/4+...。因此 − 1 2 ≤ μ ′ ( x ) ≤ − 1 2 + 2 x / 3 -\frac{1}{2} ≤\mu'(x) ≤-\frac{1}{2}+2x/3 −21≤μ′(x)≤−21+2x/3。当 x ≤ 3 / 4 x ≤3/4 x≤3/4 时， 2 x / 3 ≤ 1 / 2 2x/3 ≤1/2 2x/3≤1/2，故 − 1 2 ≤ μ ′ ( x ) ≤ 0 -\frac{1}{2} ≤\mu'(x) ≤0 −21≤μ′(x)≤0，即 ∣ μ ′ ( x ) ∣ ≤ 1 2 |\mu'(x)| ≤\frac{1}{2} ∣μ′(x)∣≤21。

Proof of Theorem 4

Since the Taylor series for ln ⁡ ( 1 + x ) \ln(1+x) ln(1+x) is x − x 2 2 + x 3 3 − . . . x - \frac{x^2}{2} + \frac{x^3}{3} - ... x−2x2+3x3−... (an alternating series), 0 < x − ln ⁡ ( 1 + x ) < x 2 / 2 0<x-\ln (1+x)<x^{2} / 2 0<x−ln(1+x)<x2/2 for x > 0 x>0 x>0, so the relative error incurred when approximating ln ⁡ ( 1 + x ) \ln (1+x) ln(1+x) by x x x is bounded by x / 2 x/2 x/2. If 1 ⊕ x = 1 1 \oplus x=1 1⊕x=1, then ∣ x ∣ < ε |x|<\varepsilon ∣x∣<ε，so the relative error is bounded by ε / 2 \varepsilon / 2 ε/2.

由于 ln ⁡ ( 1 + x ) \ln(1+x) ln(1+x) 的泰勒级数为 x − x 2 2 + x 3 3 − . . . x - \frac{x^2}{2} + \frac{x^3}{3} - ... x−2x2+3x3−...（交错级数），对于 x > 0 x>0 x>0，有 0 < x − ln ⁡ ( 1 + x ) < x 2 / 2 0<x-\ln (1+x)<x^{2} / 2 0<x−ln(1+x)<x2/2。因此，用 x x x 逼近 ln ⁡ ( 1 + x ) \ln (1+x) ln(1+x) 的相对误差上限为 x / 2 x/2 x/2。若 1 ⊕ x = 1 1 \oplus x=1 1⊕x=1，则 ∣ x ∣ < ε |x|<\varepsilon ∣x∣<ε，此时相对误差上限为 ε / 2 \varepsilon / 2 ε/2。

When 1 ⊕ x ≠ 1 1 \oplus x ≠1 1⊕x=1, define x ^ \hat{x} x^ via 1 ⊕ x = 1 + x ^ 1 \oplus x=1+\hat{x} 1⊕x=1+x^. Then since 0 ≤ x < 3 / 4 0 ≤x<3/4 0≤x<3/4, ( 1 ⊕ x ) ⊖ 1 = x ^ (1 \oplus x) \ominus 1=\hat{x} (1⊕x)⊖1=x^ (exact, by Theorem 11, since 1 ≤ 1 + x ^ ≤ 1 + 3 / 4 = 7 / 4 ≤ 2 × 1 1 ≤1+\hat{x} ≤1+3/4=7/4 ≤2×1 1≤1+x^≤1+3/4=7/4≤2×1). The computed value of the expression ln ⁡ ( 1 + x ) / ( ( 1 + x ) − 1 ) \ln (1+x) /((1+x)-1) ln(1+x)/((1+x)−1) is

当 1 ⊕ x ≠ 1 1 \oplus x ≠1 1⊕x=1 时，令 x ^ \hat{x} x^ 满足 1 ⊕ x = 1 + x ^ 1 \oplus x=1+\hat{x} 1⊕x=1+x^。由于 0 ≤ x < 3 / 4 0 ≤x<3/4 0≤x<3/4，根据定理 11（ 1 ≤ 1 + x ^ ≤ 1 + 3 / 4 = 7 / 4 ≤ 2 × 1 1 ≤1+\hat{x} ≤1+3/4=7/4 ≤2×1 1≤1+x^≤1+3/4=7/4≤2×1），有 ( 1 ⊕ x ) ⊖ 1 = x ^ (1 \oplus x) \ominus 1=\hat{x} (1⊕x)⊖1=x^（精确值）。表达式 ln ⁡ ( 1 + x ) / ( ( 1 + x ) − 1 ) \ln (1+x) /((1+x)-1) ln(1+x)/((1+x)−1) 的计算值为：
LN ( 1 ⊕ x ) ( 1 ⊕ x ) ⊖ 1 ⊗ ( 1 + δ 1 ) ⊗ ( 1 + δ 2 ) = ln ⁡ ( 1 + x ^ ) x ^ ⊗ ( 1 + δ 1 ) ⊗ ( 1 + δ 2 ) = μ ( x ^ ) ⊗ ( 1 + δ 1 ) ⊗ ( 1 + δ 2 ) \begin{aligned}\frac{\text{LN}(1 \oplus x)}{(1 \oplus x) \ominus 1} \otimes (1+\delta_{1}) \otimes (1+\delta_{2}) &= \frac{\ln (1+\hat{x})}{\hat{x}} \otimes (1+\delta_{1}) \otimes (1+\delta_{2}) \\ &= \mu(\hat{x}) \otimes (1+\delta_{1}) \otimes (1+\delta_{2})\end{aligned} (1⊕x)⊖1LN(1⊕x)⊗(1+δ1)⊗(1+δ2)=x^ln(1+x^)⊗(1+δ1)⊗(1+δ2)=μ(x^)⊗(1+δ1)⊗(1+δ2)

where ∣ δ 1 ∣ ≤ ε |\delta_{1}| ≤\varepsilon ∣δ1∣≤ε and ∣ δ 2 ∣ ≤ ε |\delta_{2}| ≤\varepsilon ∣δ2∣≤ε (since division and logarithm are computed to within 1/2 ulp).

其中 ∣ δ 1 ∣ ≤ ε |\delta_{1}| ≤\varepsilon ∣δ1∣≤ε 且 ∣ δ 2 ∣ ≤ ε |\delta_{2}| ≤\varepsilon ∣δ2∣≤ε（因除法和对数运算精度在 1/2 ulp 以内）。

To estimate μ ( x ^ ) \mu(\hat{x}) μ(x^), use the mean value theorem: μ ( x ^ ) − μ ( x ) = ( x ^ − x ) μ ′ ( ξ ) \mu(\hat{x})-\mu(x)=(\hat{x}-x) \mu'(\xi) μ(x^)−μ(x)=(x^−x)μ′(ξ) for some ξ \xi ξ between x x x and x ^ \hat{x} x^. From the definition of x ^ \hat{x} x^, ∣ x ^ − x ∣ ≤ ε |\hat{x}-x| ≤\varepsilon ∣x^−x∣≤ε (since 1 ⊕ x 1 \oplus x 1⊕x is the rounded value of 1 + x 1+x 1+x). Combining this with Theorem 13 gives ∣ μ ( x ^ ) − μ ( x ) ∣ ≤ ε / 2 |\mu(\hat{x})-\mu(x)| ≤\varepsilon / 2 ∣μ(x^)−μ(x)∣≤ε/2, or ∣ μ ( x ^ ) / μ ( x ) − 1 ∣ ≤ ε / ( 2 ∣ μ ( x ) ∣ ) ≤ ε |\mu(\hat{x}) / \mu(x)-1| ≤\varepsilon /(2|\mu(x)|) ≤\varepsilon ∣μ(x^)/μ(x)−1∣≤ε/(2∣μ(x)∣)≤ε (since μ ( x ) ≥ 1 / 2 \mu(x) ≥1/2 μ(x)≥1/2). Thus μ ( x ^ ) = μ ( x ) ( 1 + δ 3 ) \mu(\hat{x})=\mu(x)(1+\delta_{3}) μ(x^)=μ(x)(1+δ3), with ∣ δ 3 ∣ ≤ ε |\delta_{3}| ≤\varepsilon ∣δ3∣≤ε.

利用中值定理估计 μ ( x ^ ) \mu(\hat{x}) μ(x^)：存在 x x x 与 x ^ \hat{x} x^ 之间的 ξ \xi ξ，使得 μ ( x ^ ) − μ ( x ) = ( x ^ − x ) μ ′ ( ξ ) \mu(\hat{x})-\mu(x)=(\hat{x}-x) \mu'(\xi) μ(x^)−μ(x)=(x^−x)μ′(ξ)。由 x ^ \hat{x} x^ 的定义可知， ∣ x ^ − x ∣ ≤ ε |\hat{x}-x| ≤\varepsilon ∣x^−x∣≤ε（因 1 ⊕ x 1 \oplus x 1⊕x 是 1 + x 1+x 1+x 的舍入值）。结合定理 13 可得 ∣ μ ( x ^ ) − μ ( x ) ∣ ≤ ε / 2 |\mu(\hat{x})-\mu(x)| ≤\varepsilon / 2 ∣μ(x^)−μ(x)∣≤ε/2，即 ∣ μ ( x ^ ) / μ ( x ) − 1 ∣ ≤ ε / ( 2 ∣ μ ( x ) ∣ ) ≤ ε |\mu(\hat{x}) / \mu(x)-1| ≤\varepsilon /(2|\mu(x)|) ≤\varepsilon ∣μ(x^)/μ(x)−1∣≤ε/(2∣μ(x)∣)≤ε（因 μ ( x ) ≥ 1 / 2 \mu(x) ≥1/2 μ(x)≥1/2）。因此 μ ( x ^ ) = μ ( x ) ( 1 + δ 3 ) \mu(\hat{x})=\mu(x)(1+\delta_{3}) μ(x^)=μ(x)(1+δ3)，其中 ∣ δ 3 ∣ ≤ ε |\delta_{3}| ≤\varepsilon ∣δ3∣≤ε。

Finally, multiplying by x x x introduces a final error δ 4 \delta_{4} δ4 ( ∣ δ 4 ∣ ≤ ε |\delta_{4}| ≤\varepsilon ∣δ4∣≤ε ), so the computed value of x ⋅ LN ( 1 ⊕ x ) ( 1 ⊕ x ) ⊖ 1 \frac{x \cdot \text{LN}(1 \oplus x)}{(1 \oplus x) \ominus 1} (1⊕x)⊖1x⋅LN(1⊕x) is

最后，乘以 x x x 会引入最终误差 δ 4 \delta_{4} δ4（ ∣ δ 4 ∣ ≤ ε |\delta_{4}| ≤\varepsilon ∣δ4∣≤ε），因此 x ⋅ LN ( 1 ⊕ x ) ( 1 ⊕ x ) ⊖ 1 \frac{x \cdot \text{LN}(1 \oplus x)}{(1 \oplus x) \ominus 1} (1⊕x)⊖1x⋅LN(1⊕x) 的计算值为：
x ln ⁡ ( 1 + x ) ( 1 + x ) − 1 ⊗ ( 1 + δ 1 ) ⊗ ( 1 + δ 2 ) ⊗ ( 1 + δ 3 ) ⊗ ( 1 + δ 4 ) \frac{x \ln (1+x)}{(1+x)-1} \otimes (1+\delta_{1}) \otimes (1+\delta_{2}) \otimes (1+\delta_{3}) \otimes (1+\delta_{4}) (1+x)−1xln(1+x)⊗(1+δ1)⊗(1+δ2)⊗(1+δ3)⊗(1+δ4)

It is easy to check that if ε < 0.1 \varepsilon<0.1 ε<0.1, then ( 1 + δ 1 ) ( 1 + δ 2 ) ( 1 + δ 3 ) ( 1 + δ 4 ) = 1 + δ (1+\delta_{1})(1+\delta_{2})(1+\delta_{3})(1+\delta_{4})=1+\delta (1+δ1)(1+δ2)(1+δ3)(1+δ4)=1+δ, with ∣ δ ∣ ≤ 5 ε |\delta| ≤5 \varepsilon ∣δ∣≤5ε.

易验证，当 ε < 0.1 \varepsilon<0.1 ε<0.1 时， ( 1 + δ 1 ) ( 1 + δ 2 ) ( 1 + δ 3 ) ( 1 + δ 4 ) = 1 + δ (1+\delta_{1})(1+\delta_{2})(1+\delta_{3})(1+\delta_{4})=1+\delta (1+δ1)(1+δ2)(1+δ3)(1+δ4)=1+δ，且 ∣ δ ∣ ≤ 5 ε |\delta| ≤5 \varepsilon ∣δ∣≤5ε。

计算机科学家应了解的浮点运算知识（6）-CSDN博客
https://blog.csdn.net/u013669912/article/details/158927889