线性代数 · SVD | 奇异值分解的早期历史（一）

注：本文为 "线性代数 · SVD" 相关英文引文，机翻未校。

如有内容异常，请看原文。

csdn 篇幅字数限制，分为两篇，此为第一篇。

ON THE EARLY HISTORY OF THE SINGULAR VALUE DECOMPOSITION*

奇异值分解的早期历史*

G. W. STEWARTt

For Gene Golub on his 15th birthday.

SIAM REVIEW Vol. 35, No. 4, pp. 551-566, December 1993

1993 Society for Industrial and Applied Mathematics 001

Abstract

摘要

This paper surveys the contributions of five mathematicians---Eugenio Beltrami (1835-1899), Camille Jordan (1838-1921), James Joseph Sylvester (1814-1897), Erhard Schmidt (1876-1959), and Hermann Weyl (1885-1955)---who were responsible for establishing the existence of the singular value decomposition and developing its theory.

本文综述了五位数学家的贡献，他们分别是欧金尼奥·贝尔特拉米（Eugenio Beltrami，1835-1899）、卡米耶·若尔当（Camille Jordan，1838-1921）、詹姆斯·约瑟夫·西尔维斯特（James Joseph Sylvester，1814-1897）、埃哈德·施密特（Erhard Schmidt，1876-1959）以及赫尔曼·外尔（Hermann Weyl，1885-1955）。正是这五位数学家证实了奇异值分解的存在性，并推动了其理论的发展。

Key words, singular value decomposition, history

AMS subject classifications. 01A, 15-03, 15A18

1. Introduction

1. 引言

One of the most fruitful ideas in the theory of matrices is that of a matrix decomposition or canonical form. The theoretical utility of matrix decompositions has long been appreciated. More recently, they have become the mainstay of numerical linear algebra, where they serve as computational platforms from which a variety of problems can be solved.

矩阵分解或标准型的概念，是矩阵理论中最具成效的思想之一。长期以来，矩阵分解的理论效用已得到广泛认可。近年来，矩阵分解更是成为数值线性代数的核心基础，它可作为求解各类问题的计算平台。

Of the many useful decompositions, the singular value decomposition---that is, the factorization of a matrix A A A into the product U ∑ V H U \sum V^{H} U∑VH of a unitary matrix U U U, a diagonal matrix ∑ \sum ∑ and another unitary matrix V V V---has assumed a special role. There are several reasons. In the first place, the fact that the decomposition is achieved by unitary matrices makes it an ideal vehicle for discussing the geometry of n n n-space. Second, it is stable; small perturbations in A A A correspond to small perturbations in ∑ \sum ∑, and conversely. Third, the diagonality of ∑ \sum ∑ makes it easy to determine when A A A is near to a rank-degenerate matrix; and when it is, the decomposition provides optimal low rank approximations to A A A. Finally, thanks to the pioneering efforts of Gene Golub, there exist efficient, stable algorithms to compute the singular value decomposition.

在众多实用的矩阵分解方法中，奇异值分解（即把矩阵 A A A 分解为酉矩阵 U U U、对角矩阵 ∑ \sum ∑ 与另一酉矩阵 V V V 的乘积 U ∑ V H U \sum V^{H} U∑VH）具有特殊地位，原因如下：首先，由于该分解通过酉矩阵实现，使其成为研究 n n n 维空间几何性质的理想工具；其次，它具有稳定性，矩阵 A A A 的微小扰动会对应对角矩阵 ∑ \sum ∑ 的微小扰动，反之亦然；第三，对角矩阵 ∑ \sum ∑ 的结构便于判断矩阵 A A A 是否接近降秩矩阵，且当 A A A 接近降秩矩阵时，奇异值分解能为 A A A 提供最优的低秩近似；最后，得益于吉恩·戈卢布的开创性工作，目前已存在高效、稳定的奇异值分解计算算法。

The purpose of this paper is to survey the contributions of five mathematicians---Eugenio Beltrami (1835-1899), Camille Jordan (1838-1921), James Joseph Sylvester (1814-1897), Erhard Schmidt (1876-1959), and Hermann Weyl (1885-1955)---who were responsible for establishing the existence of the singular value decomposition and developing its theory. Beltrami, Jordan, and Sylvester came to the decomposition through what we should now call linear algebra; Schmidt and Weyl approached it from integral equations. To give this survey context, we will begin with a brief description of the historical background.

本文旨在综述五位数学家的贡献，他们分别是欧金尼奥·贝尔特拉米（1835-1899）、卡米耶·若尔当（1838-1921）、詹姆斯·约瑟夫·西尔维斯特（1814-1897）、埃哈德·施密特（1876-1959）以及赫尔曼·外尔（1885-1955）。这五位数学家证实了奇异值分解的存在性，并推动了其理论发展。其中，贝尔特拉米、若尔当和西尔维斯特是从如今被称为线性代数的领域入手研究奇异值分解的，而施密特和外尔则是从积分方程的角度展开研究。为了让本综述更具背景支撑，下文将首先简要介绍相关历史背景。

It is an intriguing observation that most of the classical matrix decompositions predated the widespread use of matrices: they were cast in terms of determinants, linear systems of equations, and especially bilinear and quadratic forms. Gauss is the father of this development. Writing in 1823 [20, 31], he describes his famous elimination algorithm (first sketched in [19, 1809]) as follows:

一个有趣的现象是，大多数经典的矩阵分解方法在矩阵被广泛使用之前就已出现：这些分解方法最初是借助行列式、线性方程组，尤其是双线性型和二次型来表述的。高斯（Gauss）是这一发展领域的奠基人。他在 1823 年的文献 [20, 31] 中，对其著名的消元算法（该算法首次概述于 1809 年的文献 [19]）进行了如下描述：

Specifically, the function f 2 f_2 f2 [a quadratic function of x , y , z x, y, z x,y,z, etc.] can be reduced to the form

具体而言，函数 f 2 f_2 f2（[关于 x , y , z x, y, z x,y,z 等变量的二次函数]）可化简为如下形式：
u 0 u 0 A 0 + u ′ u ′ B ′ + u ′ ′ u ′ ′ C ′ ′ + u ′ ′ ′ u ′ ′ ′ D ′ ′ ′ + etc. + M , \frac{u^{0} u^{0}}{A^{0}}+\frac{u' u'}{B'}+\frac{u'' u''}{C''}+\frac{u''' u'''}{D'''} + \text{etc.} + M, A0u0u0+B′u′u′+C′′u′′u′′+D′′′u′′′u′′′+etc.+M,

in which the divisors A 0 , B ′ , C ′ ′ , D ′ ′ ′ A^{0}, B', C'', D''' A0,B′,C′′,D′′′, etc. are constants and u 0 , u ′ , u ′ ′ u^{0}, u', u'' u0,u′,u′′, etc. are linear functions of x , y , z x, y, z x,y,z, etc. The second function, u ′ u' u′, is independent of x x x; the third, u ′ ′ u'' u′′, is independent of x x x and y y y; the fourth, u ′ ′ ′ u''' u′′′, is independent of x , y x, y x,y, and z z z, and so on. The last function depends only on the last of the unknowns x , y , z x, y, z x,y,z, etc. Moreover, the coefficients A 0 , B ′ , C ′ ′ A^{0}, B', C'' A0,B′,C′′, etc. multiply x , y , z x, y, z x,y,z, etc. in u 0 , u ′ , u ′ ′ u^{0}, u', u'' u0,u′,u′′, etc., respectively.

式中，除数 A 0 , B ′ , C ′ ′ , D ′ ′ ′ A^{0}, B', C'', D''' A0,B′,C′′,D′′′ 等均为常数，而 u 0 , u ′ , u ′ ′ u^{0}, u', u'' u0,u′,u′′ 等则是关于 x , y , z x, y, z x,y,z 等变量的线性函数。其中，第二个函数 u ′ u' u′ 与 x x x 无关；第三个函数 u ′ ′ u'' u′′ 与 x x x 和 y y y 均无关；第四个函数 u ′ ′ ′ u''' u′′′ 与 x , y , z x, y, z x,y,z 均无关，以此类推。最后一个函数则仅依赖于未知量 x , y , z x, y, z x,y,z 等中的最后一个未知量。此外，系数 A 0 , B ′ , C ′ ′ A^{0}, B', C'' A0,B′,C′′ 等分别在 u 0 , u ′ , u ′ ′ u^{0}, u', u'' u0,u′,u′′ 等函数中与 x , y , z x, y, z x,y,z 等变量相乘。

From this we easily see that Gauss's algorithm factors the matrix of the quadratic form x T A x x^{T} A x xTAx into the product R T D R R^{T} D R RTDR, where D D D is diagonal and R R R is upper triangular with the diagonals of D D D on its diagonal. Gauss's functions u 0 , u ′ , u ′ ′ u^{0}, u', u'' u0,u′,u′′, etc. are the components of the vector u = R x u = R x u=Rx.

由此我们不难看出，高斯算法将二次型 x T A x x^{T} A x xTAx 所对应的矩阵分解为 R T D R R^{T} D R RTDR 的形式，其中 D D D 是对角矩阵， R R R 是上三角矩阵，且 R R R 的对角线上元素与 D D D 的对角线上元素相同。高斯所定义的函数 u 0 , u ′ , u ′ ′ u^{0}, u', u'' u0,u′,u′′ 等，正是向量 u = R x u = R x u=Rx 的各个分量。

Gauss was also able to effectively obtain the inverse of a matrix by a process of eliminatio indefinita, in which the system of equations is transformed into the inverse system. Gauss's skill in manipulating quadratic forms and systems of equations made possible his very general treatment of the theory and practice of least squares.

此外，高斯还通过"不定消元法"（eliminatio indefinita）有效地求出了矩阵的逆，该方法可将原方程组转化为其逆方程组。高斯在处理二次型和方程组方面的精湛技巧，使其能够对最小二乘法的理论与实践进行全面且深入的研究。

Other developments followed. Cauchy [7, 1829] established the properties of the eigenvalues and eigenvectors of a symmetric system (including the interlacing property) by considering the corresponding homogeneous system of equations. In 1846, Jacobi [30] gave his famous algorithm for diagonalizing a symmetric matrix, and in a posthumous paper [31, 1857] he obtained the LU decomposition by decomposing a bilinear form in the style of Gauss. Weierstrass [63, 1868] established canonical forms for pairs of bilinear functions---what we should today call the generalized eigenvalue problem. Thus the advent of the singular value decomposition in 1873 is seen as one of a long line of results on canonical forms.

此后，相关领域又有了诸多新进展。柯西（Cauchy）在 1829 年的文献 [7] 中，通过研究对称系统对应的齐次方程组，确立了对称系统特征值与特征向量的性质（包括交错性质）。1846 年，雅可比（Jacobi）在文献 [30] 中提出了著名的对称矩阵对角化算法；在其 1857 年的遗作 [31] 中，他效仿高斯分解双线性型的方法，得到了矩阵的 LU 分解。外尔斯特拉斯（Weierstrass）在 1868 年的文献 [63] 中，建立了双线性函数对的标准型，这便是如今我们所说的广义特征值问题。由此可见，1873 年奇异值分解的出现，是一系列关于标准型研究成果中的重要一环。

We will use modern matrix notation to describe the early work on the singular value decomposition. Most of it slips as easily into matrix terminology as Gauss's description of his decomposition; and we shall be in no danger of anachronism, provided we take care to use matrix notation only as an expository device, and otherwise stick close to the writer's argument. The greatest danger is that the use of modern notation will trivialize the writer's accomplishments by making them obvious to our eyes. On the other hand, presenting derivations in the original scalar form would probably exaggerate the obstacles these people had to overcome, since they were accustomed, as we are not, to grasping sets of equations as a whole.

本文将采用现代矩阵符号来描述早期关于奇异值分解的研究工作。与高斯对其分解方法的描述类似，早期大多数奇异值分解的研究内容都能很自然地用矩阵术语来表述。只要我们仅将矩阵符号作为一种阐述工具，且在其他方面严格遵循原作者的论证逻辑，就不会出现时代错位的问题。不过，使用现代符号存在一个最大隐患：它可能会让原作者的成果在我们看来变得显而易见，从而淡化这些成果的价值。另一方面，若完全按照原始的标量形式来呈现推导过程，又可能会夸大原作者当时所克服的困难------因为他们习惯于从整体上把握方程组，而这种能力是我们现在所不具备的。

With a single author, it is usually possible to modernize notation in such a way that it corresponds naturally to what he actually wrote. Here we are dealing with several authors, and uniformity is more important than correspondence with the original. Consequently, throughout this paper we will be concerned with the singular value decomposition

对于单一作者的研究，通常可以对其符号进行现代化处理，使其与作者的原文表述自然对应。但本文涉及多位作者的研究，此时符号的统一性比与原文符号的一致性更为重要。因此，在整篇论文中，我们所讨论的奇异值分解均表示为如下形式：
A = U ∑ V T A = U \sum V^{T} A=U∑VT

where A A A is a real matrix of order n n n,

其中， A A A 是 n n n 阶实矩阵，
∑ = diag ( σ 1 , σ 2 , ... , σ n ) \sum = \text{diag}(\sigma_{1}, \sigma_{2}, \dots, \sigma_{n}) ∑=diag(σ1,σ2,...,σn)

has nonnegative diagonal elements arranged in descending order of magnitude, and
∑ \sum ∑ 的对角元素均为非负值，且按从大到小的顺序排列；
U = ( u 1 u 2 ⋯ u n ) and V = ( v 1 v 2 ⋯ v n ) U = \begin{pmatrix} u_{1} & u_{2} & \cdots & u_{n} \end{pmatrix} \text{ and } V = \begin{pmatrix} v_{1} & v_{2} & \cdots & v_{n} \end{pmatrix} U=(u1u2⋯un) and V=(v1v2⋯vn)

are orthogonal. The symbol ∥ ⋅ ∥ \| \cdot \| ∥⋅∥ will denote the Frobenius norm defined by
U U U 和 V V V 均为正交矩阵。符号 ∥ ⋅ ∥ \| \cdot \| ∥⋅∥ 表示弗罗贝尼乌斯（Frobenius）范数，其定义为
∥ A ∥ 2 = ∑ i , j a i , j 2 = ∑ i σ i 2 . \| A \|^{2} = \sum_{i,j} a_{i,j}^{2} = \sum_{i} \sigma_{i}^{2}. ∥A∥2=i,j∑ai,j2=i∑σi2.

In summarizing the contributions I have followed the principle that if you try to say everything you end up saying nothing. Most of the works treated here are richer than the following sketches would indicate, and the reader is advised to go to the sources for the full story.

在综述各位学者的贡献时，我遵循了"若试图面面俱到，反而会一无所获"的原则。本文所提及的大多数研究成果，其内涵远比下文的简述更为丰富，因此建议读者查阅原文以全面了解相关内容。

2. Beltrami [5, 1873]

2. 贝尔特拉米的研究 [5, 1873]

Together, Beltrami and Jordan are the progenitors of the singular value decomposition, Beltrami by virtue of first publication and Jordan by the completeness and elegance of his treatment. Beltrami's contribution appeared in the Journal of Mathematics for the Use of the Students of the Italian Universities, and its purpose was to encourage students to become familiar with bilinear forms.

贝尔特拉米和若尔当共同被誉为奇异值分解的开创者。其中，贝尔特拉米因首次发表相关研究成果而获此称谓，若尔当则因其研究的完整性与严谨性而获此荣誉。贝尔特拉米的研究成果发表在《意大利大学学生用数学期刊》（Journal of Mathematics for the Use of the Students of the Italian Universities）上，其目的是帮助学生熟悉双线性型。

The derivation

推导过程

Beltrami begins with a bilinear form

贝尔特拉米从双线性型入手：
f ( x , y ) = x T A y , f(x, y) = x^{T} A y, f(x,y)=xTAy,

where A A A is real and of order n n n. If one makes the substitutions

其中， A A A 是 n n n 阶实矩阵。若进行如下变量替换：
x = U ξ and y = V η , x = U \xi \text{ and } y = V \eta, x=Uξ and y=Vη,

then

则（双线性型可转化为）：
f ( x , y ) = ξ T S η , f(x, y) = \xi^{T} S \eta, f(x,y)=ξTSη,

where

其中，
( 2.1 ) S = U T A V . (2.1) \quad S = U^{T} A V. (2.1)S=UTAV.

Beltrami now observes that if U U U and V V V are required to be orthogonal, then there are n 2 − n n^2 - n n2−n degrees of freedom in their choice, and he proposes to use these degrees of freedom to annihilate the off-diagonal elements of S S S.

贝尔特拉米指出，若要求 U U U 和 V V V 为正交矩阵，则在选择这两个矩阵时存在 n 2 − n n^2 - n n2−n 个自由度，他提议利用这些自由度来消去矩阵 S S S 的非对角元素。

Assume that S S S is diagonal, i.e., S = ∑ S = \sum S=∑. Then it follows from (2.1) and the orthogonality of V V V that

假设 S S S 为对角矩阵，即 S = ∑ S = \sum S=∑。由式 (2.1) 以及 V V V 的正交性可推出：
( 2.2 ) U T A = ∑ V T . (2.2) \quad U^{T} A = \sum V^{T}. (2.2)UTA=∑VT.

Similarly,

同理可得：
( 2.3 ) A V = U ∑ . (2.3) \quad A V = U \sum. (2.3)AV=U∑.

Substituting the value of U T U^{T} UT obtained from (2.3) into (2.2), Beltrami obtains the equation

将由式 (2.3) 得到的 U T U^{T} UT 代入式 (2.2)，贝尔特拉米得到如下方程：
( 2.4 ) U T ( A A T ) = ∑ 2 U T , (2.4) \quad U^{T} (A A^{T}) = \sum^{2} U^{T}, (2.4)UT(AAT)=∑2UT,

and similarly he obtains

同理，还可得到：
( A T A ) V = V ∑ 2 . (A^{T} A) V = V \sum^{2}. (ATA)V=V∑2.

Thus the σ i \sigma_{i} σi are the roots of the equations

因此， σ i \sigma_{i} σi 是下列方程的根：
( 2.5 ) det ⁡ ( A A T − σ 2 I ) = 0 (2.5) \quad \det(A A^{T} - \sigma^{2} I) = 0 (2.5)det(AAT−σ2I)=0

and

以及
( 2.6 ) det ⁡ ( A T A − σ 2 I ) = 0. (2.6) \quad \det(A^{T} A - \sigma^{2} I) = 0. (2.6)det(ATA−σ2I)=0.

Note that the derivation, as presented by Beltrami, assumes that ∑ \sum ∑ (and hence A A A) is nonsingular 1 ^1 1.

需要注意的是，贝尔特拉米的上述推导过程假设 ∑ \sum ∑（进而 A A A）为非奇异矩阵。

1 ^1 1However, it is possible to derive the equations without assuming that $\\mathbf {A}$ is nonsingular, e.g., $\\mathbf {U}\^T\\mathbf {A}\\mathbf {A}\^T = \\boldsymbol {\\Sigma}\\mathbf {V}\^T\\mathbf {A}\^T = \\boldsymbol {\\Sigma}\^2\\mathbf {U}\^T$ , the first equality following on multiplying (2.2) by $\\mathbf {A}\^T$ , and the second on substituting the transpose of (2.2). Thanks to Anne Greenbaum for pointing this fact out.

¹ 不过，不假设 A \mathbf {A} A 是非奇异的，也有可能推导出这些方程，例如， U T A A T = Σ V T A T = Σ 2 U T \mathbf {U}^T\mathbf {A}\mathbf {A}^T = \boldsymbol {\Sigma}\mathbf {V}^T\mathbf {A}^T = \boldsymbol {\Sigma}^2\mathbf {U}^T UTAAT=ΣVTAT=Σ2UT，第一个等式是将（2.2）乘以 A T \mathbf {A}^T AT 得到的，第二个等式是代入（2.2）的转置得到的。感谢安妮・格林鲍姆指出这一事实。

Beltrami now argues that the two functions (2.5) and (2.6) are identical because they are polynomials of degree n n n that assume the same values at n n n points and the common value det ⁡ 2 ( A ) \det^{2}(A) det2(A) at σ = 0 \sigma = 0 σ=0---an argument that presupposes that the singular values are distinct and nonzero.

贝尔特拉米进一步指出，函数 (2.5) 和 (2.6) 是完全相同的。因为这两个函数均为 n n n 次多项式，且在 n n n 个点上取值相同，同时当 σ = 0 \sigma = 0 σ=0 时，两者都等于 det ⁡ 2 ( A ) \det^{2}(A) det2(A)。不过，该论证的前提是奇异值互不相等且均不为零。

Beltrami next states that by a well-known theorem, the roots of (2.5) are real. Moreover, they are positive. To show this he notes that

接着，贝尔特拉米依据一个著名定理指出，方程 (2.5) 的根均为实数，且均为正值。为证明这一点，他给出如下推导：
( 2.7 ) 0 < ∥ x T A ∥ 2 = x T ( A A T ) x = ξ T ∑ 2 ξ , (2.7) \quad 0 < \| x^{T} A \|^{2} = x^{T} (A A^{T}) x = \xi^{T} \sum^{2} \xi, (2.7)0<∥xTA∥2=xT(AAT)x=ξT∑2ξ,

the last equation following from the theory of quadratic forms. This inequality immediately implies that the σ i 2 \sigma_{i}^{2} σi2 are positive.

上式最后一个等式可由二次型理论推出。该不等式直接表明 σ i 2 \sigma_{i}^{2} σi2 均为正值。

There is some confusion here. Beltrami appears to be assuming the existence of the vector ξ \xi ξ whose very existence he is trying to establish. The vector required by his argument is an eigenvector of A A T A A^{T} AAT corresponding to σ i 2 \sigma_{i}^{2} σi2. The fact that the two vectors turn out to be the same apparently caused Beltrami to leap ahead of himself and use ξ \xi ξ in (2.7).

此处存在一处逻辑模糊。贝尔特拉米似乎假设了向量 ξ \xi ξ 的存在性，但实际上他原本需要证明的正是 ξ \xi ξ 的存在性。他的论证过程中所需要的向量，是 A A T A A^{T} AAT 对应于特征值 σ i 2 \sigma_{i}^{2} σi2 的特征向量。而由于最终证明的向量与假设的向量是同一个，这显然使得贝尔特拉米在论证中提前使用了 (2.7) 中的 ξ \xi ξ。

Beltrami is now ready to give an algorithm to determine the diagonalizing transformation.

随后，贝尔特拉米提出了确定对角化变换的算法，步骤如下：

Find the roots of (2.5).
求解方程 (2.5) 的根。
Determine U U U from (2.4). Here Beltrami notes that the columns of U U U are determined up to factors of ± 1 \pm 1 ±1, which is true only if the σ i \sigma_{i} σi are distinct. He also tacitly assumes that the resulting U U U will be orthogonal, which also requires that the σ i \sigma_{i} σi be distinct.
根据式 (2.4) 确定矩阵 U U U。贝尔特拉米指出，矩阵 U U U 的列向量在相差一个 ± 1 \pm 1 ±1 的因子范围内是确定的，但这一结论仅在 σ i \sigma_{i} σi 互不相等时成立。同时，他还默认由此得到的 U U U 是正交矩阵，而这一前提同样要求 σ i \sigma_{i} σi 互不相等。
Determine V V V from (2.3). This step requires that ∑ \sum ∑ be nonsingular.
根据式 (2.3) 确定矩阵 V V V。该步骤要求 ∑ \sum ∑ 为非奇异矩阵。

Discussion

讨论

From the foregoing it is clear that Beltrami derived the singular value decomposition for a real, square, nonsingular matrix having distinct singular values. His derivation is the one given in most textbooks, but it lacks the extras needed to handle degeneracies. It may be that in omitting these extras Beltrami was simplifying things for his student audience, but a certain slackness in the exposition suggests that he had not thought the problem through.

由上述内容可知，贝尔特拉米推导的是具有互不相等奇异值的实方阵、非奇异矩阵的奇异值分解。他的推导方法与如今大多数教材中所采用的方法一致，但该方法无法处理奇异值退化（如奇异值相等或存在零奇异值）的情况。或许贝尔特拉米是为了让学生更容易理解，才省略了处理退化情况的内容；但从其阐述中存在的逻辑疏漏可以看出，他可能并未完全透彻地研究这一问题。

3. Jordan [32], [33]

3. 若尔当的研究 [32], [33]

Camille Jordan can rightly be called the codiscoverer of the singular value decomposition. Although he published his derivation a year after Beltrami, it is clear that the work is independent. In fact, his "Mémoire sur les formes bilinéaires" treats three problems, of which the reduction of a bilinear form to a diagonal form by orthogonal substitutions is the simplest 2 ^2 2.

卡米耶·若尔当完全有资格被称为奇异值分解的共同发现者。尽管他的推导成果比贝尔特拉米晚发表一年，但显然其研究是独立完成的。事实上，他的论文《关于双线性型的研究报告》（Mémoire sur les formes bilinéaires）探讨了三个问题，其中通过正交替换将双线性型化为对角型是最简单的一个问题。

2 ^2 2 The other two are to reduce a bilinear form by the same substitution of both sets of variables and to reduce a pair of bilinear forms by two substitutions, one for each set of variables. Jordan notes that the former problem had been considered by Kronecker [37, 1866] in a different form, and the latter by Weierstrass [63, 1868].
2 ^2 2 另外两个问题分别是：通过对两组变量进行相同的替换化简双线性型，以及通过两组不同的替换（每组变量对应一组替换）化简一对双线性型。若尔当指出，前者曾由克罗内克（Kronecker）在 1866 年的文献 [37] 中以另一种形式研究过，而后者则由外尔斯特拉斯在 1868 年的文献 [63] 中探讨过。

The derivation

推导过程

Jordan starts with the form

若尔当从如下双线性型入手：
P = x T A y P = x^{T} A y P=xTAy

and seeks the maximum and minimum of P P P subject to

并在如下约束条件下寻求 P P P 的最大值与最小值：
( 3.1 ) ∥ x ∥ 2 = ∥ y ∥ 2 = 1. (3.1) \quad \| x \|^{2} = \| y \|^{2} = 1. (3.1)∥x∥2=∥y∥2=1.

The maximum is determined by the equation

最大值可通过如下方程确定：
( 3.2 ) 0 = d P = d x T A y + x T A d y , (3.2) \quad 0 = dP = dx^{T} A y + x^{T} A dy, (3.2)0=dP=dxTAy+xTAdy,

which must be satisfied for all d x dx dx and d y dy dy that satisfy

该方程需对所有满足下列条件的 d x dx dx 和 d y dy dy 成立：
( 3.3 ) d x T x = 0 and d y T y = 0. (3.3) \quad dx^{T} x = 0 \text{ and } dy^{T} y = 0. (3.3)dxTx=0 and dyTy=0.

Jordan then asserts that "equation (3.2) will therefore be a combination of the equations (3.3)," from which one obtains 3 ^3 3

随后，若尔当提出"方程 (3.2) 可表示为方程 (3.3) 的线性组合"，由此可推出：

3 ^3 3 Jordan's argument is not very clear. Possibly he means to say that for some constants σ \sigma σ and τ \tau τ we must have d x T A y + x T A d y = σ d x T x + τ d y T y d\mathbf {x}^T \mathbf {A} \mathbf {y} + \mathbf {x}^T \mathbf {A} d\mathbf {y} = \sigma d\mathbf {x}^T \mathbf {x} + \tau d\mathbf {y}^T \mathbf {y} dxTAy+xTAdy=σdxTx+τdyTy, from which the subsequent equations follow from the independence of d x d\mathbf {x} dx and d y d\mathbf {y} dy.

³ 若尔当的论证不是很清晰。他可能想说，对于某些常数 σ \sigma σ 和 τ \tau τ，我们必定有 d x T A y + x T A d y = σ d x T x + τ d y T y d\mathbf {x}^T \mathbf {A} \mathbf {y} + \mathbf {x}^T \mathbf {A} d\mathbf {y} = \sigma d\mathbf {x}^T \mathbf {x} + \tau d\mathbf {y}^T \mathbf {y} dxTAy+xTAdy=σdxTx+τdyTy，后续的方程可由 d x d\mathbf {x} dx 和 d y d\mathbf {y} dy 的独立性推导得出。

( 3.4 ) A y = σ x (3.4) \quad A y = \sigma x (3.4)Ay=σx

and

以及
( 3.5 ) x T A = τ y T . (3.5) \quad x^{T} A = \tau y^{T}. (3.5)xTA=τyT.

From (3.4) it follows that the maximum is

由式 (3.4) 可知，最大值满足：
x T ( A y ) = σ x T x = σ . x^{T} (A y) = \sigma x^{T} x = \sigma. xT(Ay)=σxTx=σ.

Similarly the maximum is also τ \tau τ, so that σ = τ \sigma = \tau σ=τ.

同理，最大值也等于 τ \tau τ，因此 σ = τ \sigma = \tau σ=τ。

Jordan now observes that σ \sigma σ is determined by the vanishing of the determinant

若尔当进一步指出， σ \sigma σ 可由下列行列式等于零来确定：
D = ∣ − σ I A A T − σ I ∣ D = \begin{vmatrix} -\sigma I & A \\ A^{T} & -\sigma I \end{vmatrix} D= −σIATA−σI

of the system (3.4)-(3.5). He shows that this determinant contains only even powers of σ \sigma σ.

该行列式对应方程组 (3.4)-(3.5) 的系数矩阵。若尔当证明了该行列式中仅包含 σ \sigma σ 的偶次幂。

Now let σ 1 \sigma_{1} σ1 be a root of the equation D = 0 D = 0 D=0 and let (3.4) and (3.5) be satisfied by x = u x = u x=u and y = v y = v y=v, where ∥ u ∥ 2 = ∥ v ∥ 2 = 1 \| u \|^{2} = \| v \|^{2} = 1 ∥u∥2=∥v∥2=1 (Jordan notes that one can find such a solution, even when it is not unique). Let

令 σ 1 \sigma_{1} σ1 为方程 D = 0 D = 0 D=0 的一个根，且 x = u x = u x=u、 y = v y = v y=v（其中 ∥ u ∥ 2 = ∥ v ∥ 2 = 1 \| u \|^{2} = \| v \|^{2} = 1 ∥u∥2=∥v∥2=1）满足方程 (3.4) 和 (3.5)（若尔当指出，即便解不唯一，也能找到这样的解）。令
U ^ = ( u U ∗ ) and V ^ = ( v V ∗ ) \hat{U} = (u \quad U_{*}) \text{ and } \hat{V} = (v \quad V_{*}) U^=(uU∗) and V^=(vV∗)

be orthogonal, and let

为正交矩阵，并令
x = U ^ x ^ and y = V ^ y ^ . x = \hat{U} \hat{x} \text{ and } y = \hat{V} \hat{y}. x=U^x^ and y=V^y^.

With these substitutions, let

通过上述替换，令
P = x ^ T A ^ y ^ P = \hat{x}^{T} \hat{A} \hat{y} P=x^TA^y^

In this system, P P P attains its maximum 4 ^4 4 for x ^ = e 1 \hat{x} = e_{1} x^=e1, y ^ = e 1 \hat{y} = e_{1} y^=e1, where e 1 = ( 1 , 0 , ... , 0 ) T e_{1} = (1, 0, \dots, 0)^{T} e1=(1,0,...,0)T. Moreover, at the maximum we have

在该系统中，当 x ^ = e 1 \hat{x} = e_{1} x^=e1、 y ^ = e 1 \hat{y} = e_{1} y^=e1（其中 e 1 = ( 1 , 0 , ... , 0 ) T e_{1} = (1, 0, \dots, 0)^{T} e1=(1,0,...,0)T）时， P P P 取得最大值 4。此外，在最大值点处，有
A ^ y ^ = σ 1 x ^ and x ^ T A ^ = σ 1 y ^ T , \hat{A} \hat{y} = \sigma_{1} \hat{x} \text{ and } \hat{x}^{T} \hat{A} = \sigma_{1} \hat{y}^{T}, A^y^=σ1x^ and x^TA^=σ1y^T,

which implies that

这表明
A ^ = ( σ 1 0 0 A 1 ) \hat{A} = \begin{pmatrix} \sigma_{1} & 0 \\ 0 & A_{1} \end{pmatrix} A^=(σ100A1)

Thus with ξ 1 = x ^ 1 \xi_{1} = \hat{x}{1} ξ1=x^1 and η 1 = y ^ 1 \eta{1} = \hat{y}_{1} η1=y^1, P P P assumes the form

因此，令 ξ 1 = x ^ 1 \xi_{1} = \hat{x}{1} ξ1=x^1、 η 1 = y ^ 1 \eta{1} = \hat{y}{1} η1=y^1，则 P P P 可表示为
σ 1 ξ 1 η 1 + P 1 \sigma{1} \xi_{1} \eta_{1} + P_{1} σ1ξ1η1+P1

where P 1 P_{1} P1 is independent of ξ 1 \xi_{1} ξ1 and η 1 \eta_{1} η1. Jordan now applies the reduction inductively to P 1 P_{1} P1 to arrive at the canonical form

其中， P 1 P_{1} P1 与 ξ 1 \xi_{1} ξ1 和 η 1 \eta_{1} η1 均无关。若尔当通过对 P 1 P_{1} P1 进行归纳化简，最终得到双线性型的标准型：
P = ξ T ∑ η P = \xi^{T} \sum \eta P=ξT∑η

Finally, Jordan notes that when the roots of the characteristic equation D = 0 D = 0 D=0 are simple, the columns of U U U and V V V can be calculated directly from (3.1), (3.4), and (3.5).

最后，若尔当指出，当特征方程 D = 0 D = 0 D=0 的根均为单根时，矩阵 U U U 和 V V V 的列向量可直接由式 (3.1)、(3.4) 和 (3.5) 计算得到。

4 ^4 4 Jordan nods here, since he has not explicitly selected the largest root σ 1 \sigma_{1} σ1.
4 ^4 4 此处若尔当的表述不够严谨，因为他并未明确指出 σ 1 \sigma_{1} σ1 是最大的根。

Discussion

讨论

In this paper we see the sure hand of a skilled professional. Jordan proceeds from problem to solution with economy and elegance. His approach of using a partial solution of the problem to reduce it to one of smaller size---deflation is the modern term---avoids the degeneracies that complicate Beltrami's approach. Incidentally, the technique of deflation apparently lay fallow until Schur [52, 1917] used it to establish his triangular form of a general matrix. It is now a widely used theoretical and algorithmic tool.

从这篇论文中，我们能感受到一位资深学者的严谨与精湛。若尔当以简洁且严谨的方式从问题出发，逐步推导出解决方案。他采用"通过部分解将原问题转化为低维问题"的方法（现代称之为"收缩法"，deflation），成功避开了贝尔特拉米方法中因奇异值退化而产生的复杂问题。顺便提一句，收缩法在当时并未得到广泛应用，直到 1917 年，舒尔（Schur）在文献 [52] 中利用该方法建立了一般矩阵的三角分解，此后收缩法才成为一种被广泛使用的理论与算法工具。

The matrix

矩阵
( 0 A A T 0 ) \begin{pmatrix} 0 & A \\ A^{T} & 0 \end{pmatrix} (0ATA0)

from which the determinant D D D was formed, is also widely used. Its present-day popularity is due to Wielandt (see [18, p.113]) and Lanczos [38, 1958]. The latter apparently rediscovered the singular value decomposition independently.

上述用于构造行列式 D D D 的矩阵如今也被广泛应用。该矩阵之所以在现代受到重视，得益于维兰特（Wielandt，参见文献 [18, 第 113 页]）和兰索斯（Lanczos，1958 年文献 [38]）的研究。其中，兰索斯似乎独立重新发现了奇异值分解。

Yet another consequence of Jordan's approach is the variational characterization of the largest singular value as the maximum of a function. This and related characterizations have played an important role in perturbation and localization theorems for singular values (for more, see [55, 4.4]).

若尔当的方法还带来了另一个重要成果：将最大奇异值表示为某个函数的最大值，即最大奇异值的变分特征。这一特征及其相关表述，在奇异值的扰动定理和局部化定理中发挥了重要作用（更多细节可参见文献 [55, 4.4 节]）。

4. Sylvester [57, 1889], [59, 1889], [58, 1889]

4. 西尔维斯特的研究 [57, 1889], [59, 1889], [58, 1889]

Sylvester wrote a footnote and two papers on the subject of the singular value decomposition. The footnote appears at the end of a paper in The Messenger of Mathematics [57] entitled "A new proof that a general quadric may be reduced to its canonical form (that is, a linear function of squares) by means of a real orthogonal substitution." In the paper Sylvester describes an iterative algorithm for reducing a quadratic form to diagonal form. In the footnote he points out that an analogous iteration can be used to diagonalize a bilinear form and says that he has "sent for insertion in the C. R. of the Institute, a Note in which I give the rule for effecting this reduction." The rule turns out to be Beltrami's algorithm. In a final paper [58, 1889], Sylvester presents both the iterative algorithm and the rule.

西尔维斯特撰写了一篇关于奇异值分解的注释以及两篇相关论文。其中，注释附在发表于《数学通讯》（The Messenger of Mathematics）的一篇论文 [57] 末尾，该论文标题为《关于一般二次型可通过实正交替换化为标准型（即平方项的线性组合）的新证明》。在这篇论文中，西尔维斯特提出了一种将二次型化为对角型的迭代算法；而在注释中，他指出类似的迭代方法也可用于双线性型的对角化，并提到"已向《法国科学院院报》（C. R. of the Institute）投稿一篇短文，文中给出了实现该对角化的规则"。后来证实，他所说的规则正是贝尔特拉米提出的算法。在最后一篇论文 [58, 1889] 中，西尔维斯特同时阐述了该迭代算法与上述规则。

The rule

规则

Here we follow [59, 1899]. Sylvester begins with the bilinear form

下文将依据文献 [59, 1899] 展开介绍。西尔维斯特从双线性型入手：
B = x T A y B = x^{T} A y B=xTAy

and considers the quadratic form

并研究如下二次型：
M = ∑ i ( ∂ B ∂ y i ) 2 M = \sum_{i} \left( \frac{\partial B}{\partial y_{i}} \right)^{2} M=i∑(∂yi∂B)2

(which is x T A A T x x^{T} A A^{T} x xTAATx, a fact tacitly assumed by Sylvester). Let ∑ λ i ξ i 2 \sum \lambda_{i} \xi_{i}^{2} ∑λiξi2 be the canonical form of M M M. If B B B has the canonical form B = ∑ σ i ξ i η i B = \sum \sigma_{i} \xi_{i} \eta_{i} B=∑σiξiηi, then M M M is orthogonally equivalent to M = ∑ λ i ξ i 2 M = \sum \lambda_{i} \xi_{i}^{2} M=∑λiξi2, which implies that λ i = σ i 2 \lambda_{i} = \sigma_{i}^{2} λi=σi2 in some order.

（该二次型即 x T A A T x x^{T} A A^{T} x xTAATx，西尔维斯特默认了这一事实）。设 ∑ λ i ξ i 2 \sum \lambda_{i} \xi_{i}^{2} ∑λiξi2 为 M M M 的标准型。若 B B B 的标准型为 B = ∑ σ i ξ i η i B = \sum \sigma_{i} \xi_{i} \eta_{i} B=∑σiξiηi，则 M M M 与 M = ∑ λ i ξ i 2 M = \sum \lambda_{i} \xi_{i}^{2} M=∑λiξi2 正交等价，这意味着在某种顺序下有 λ i = σ i 2 \lambda_{i} = \sigma_{i}^{2} λi=σi2。

To find the substitutions, Sylvester introduces the matrices M = A A T M = A A^{T} M=AAT and N = A T A N = A^{T} A N=ATA and asserts that the substitution for x x x is the substitution that diagonalizes M M M and the substitution for y y y is the one that diagonalizes N N N. In general, this is true only if the singular values of A A A are distinct.

为了确定变量替换，西尔维斯特引入矩阵 M = A A T M = A A^{T} M=AAT 和 N = A T A N = A^{T} A N=ATA，并指出：用于 x x x 的替换是使 M M M 对角化的替换，用于 y y y 的替换是使 N N N 对角化的替换。但通常情况下，该结论仅在 A A A 的奇异值互不相等时成立。

In his Comptes Rendus note, Sylvester gives the following rule for finding the coefficients of the x x x-substitution corresponding to a singular value σ \sigma σ: Strike a row of the matrix M − σ 2 I M - \sigma^{2} I M−σ2I. Then the vector of coefficients is the vector of minors of order n − 1 n - 1 n−1 of the reduced matrix, normalized so that their sum of squares is one. Coefficients of the y y y-substitution may be obtained analogously from N − σ 2 I N - \sigma^{2} I N−σ2I. This only works if the singular value σ \sigma σ is simple.

在发表于《法国科学院院报》的短文中，西尔维斯特给出了求解对应于奇异值 σ \sigma σ 的 x x x 变量替换系数的规则：删除矩阵 M − σ 2 I M - \sigma^{2} I M−σ2I 的某一行，得到一个降阶矩阵，该降阶矩阵的 n − 1 n - 1 n−1 阶子式构成一个向量，将该向量标准化（使其各元素平方和为 1），即为所求的系数向量。同理，通过矩阵 N − σ 2 I N - \sigma^{2} I N−σ2I 可求得 y y y 变量替换的系数。但该规则仅在奇异值 σ \sigma σ 为单根时有效。

Infinitesimal iteration

无穷小迭代法

Sylvester first proposed this method as a technique for showing that a quadratic form could be diagonalized, and he later extended it to bilinear forms. It is already intricate enough for quadratic forms, and we will confine ourselves to a sketch of that case.

西尔维斯特最初提出该方法是为了证明二次型可对角化，后来他将其推广到双线性型的情形。仅针对二次型，该方法已较为复杂，因此下文仅简要介绍二次型情形下的无穷小迭代法。

Sylvester proceeds inductively, assuming that he can solve a problem of order n − 1 n - 1 n−1. Thus for n = 3 n = 3 n=3 he can assume the matrix is of the form

西尔维斯特采用归纳法进行推导，假设已能解决 n − 1 n - 1 n−1 阶问题。因此，对于 n = 3 n = 3 n=3 的情形，可假设矩阵具有如下形式：
A = ( a 0 f 0 b g f g c ) , A = \begin{pmatrix} a & 0 & f \\ 0 & b & g \\ f & g & c \end{pmatrix}, A= a0f0bgfgc ,

the zeros being introduced by the induction step. His problem is then to get rid of f f f and g g g without destroying the zeros previously introduced.

其中，零元素是通过归纳步骤引入的。此时，西尔维斯特需要解决的问题是：在不破坏已有的零元素的前提下，消去元素 f f f 和 g g g。

Sylvester proposes to make an "infinitesimal orthogonal substitution" of the form

西尔维斯特提出采用如下形式的"无穷小正交替换"：
( x 1 x 2 x 3 ) = ( 1 ϵ η − ϵ 1 θ − η − θ 1 ) ( ξ 1 ξ 2 ξ 3 ) , \begin{pmatrix} x_{1} \\ x_{2} \\ x_{3} \end{pmatrix} = \begin{pmatrix} 1 & \epsilon & \eta \\ -\epsilon & 1 & \theta \\ -\eta & -\theta & 1 \end{pmatrix} \begin{pmatrix} \xi_{1} \\ \xi_{2} \\ \xi_{3} \end{pmatrix}, x1x2x3 = 1−ϵ−ηϵ1−θηθ1 ξ1ξ2ξ3 ,

where the off-diagonal quantities are so small that powers higher than the first can be neglected. Then the (2, 1)- and (1, 2)-elements of the transformed matrix are

其中，非对角元素的取值非常小，以至于其高于一次的幂次均可忽略不计。经过该替换后，变换后矩阵的 (2, 1) 元和 (1, 2) 元为：
( 4.1 ) ( a − b ) ϵ − f θ − g η , (4.1) \quad (a - b) \epsilon - f \theta - g \eta, (4.1)(a−b)ϵ−fθ−gη,

while the change in f 2 + g 2 f^{2} + g^{2} f2+g2 is given by

而 f 2 + g 2 f^{2} + g^{2} f2+g2 的变化量满足：
1 2 δ ( f 2 + g 2 ) = ( a − c ) f η + ( b − c ) g θ . \frac{1}{2} \delta (f^{2} + g^{2}) = (a - c) f \eta + (b - c) g \theta. 21δ(f2+g2)=(a−c)fη+(b−c)gθ.

If either f f f or g g g is nonzero, η \eta η and θ \theta θ can be chosen to decrease f 2 + g 2 f^{2} + g^{2} f2+g2. If ( a − b ) (a - b) (a−b) is nonzero, ϵ \epsilon ϵ may then be chosen so that (4.1) is zero, i.e., so that the zero previously introduced is preserved. Sylvester shows how special cases like a = b a = b a=b can be handled by explicitly deflating the problem.

若 f f f 或 g g g 中至少有一个非零，则可选择 η \eta η 和 θ \theta θ 的值来减小 f 2 + g 2 f^{2} + g^{2} f2+g2。若 ( a − b ) ≠ 0 (a - b) \neq 0 (a−b)=0，则可选择 ϵ \epsilon ϵ 的值使式 (4.1) 等于零，即保持之前引入的零元素不变。西尔维斯特还通过明确的收缩法，说明了如何处理 a = b a = b a=b 等特殊情况。

Sylvester now claims that an infinite sequence of these infinitesimal transformations will reduce one of f f f or g g g to zero, or will reduce the problem to one of the special cases.

西尔维斯特认为，通过无限次重复上述无穷小变换，要么可将 f f f 或 g g g 中的一个化为零，要么可将问题转化为上述特殊情况之一。

Discussion

讨论

These are not easy papers to read. The style is opaque, and Sylvester pontificates without proving, leaving too many details to the reader. The mathematical reasoning harks back to an earlier, less rigorous era.

西尔维斯特的这些论文阅读难度较大。其表述晦涩难懂，且他常常只给出结论而不加以证明，将大量细节留给读者自行推导。从数学推理方式来看，其风格更偏向早期不够严谨的数学研究范式。

The fact that Sylvester sent a note to Comptes Rendus, the very organ where Jordan announced his results a decade and a half earlier, makes it clear that he was working in ignorance of his predecessors. It also suggests the importance he attached to his discovery, since a note in Comptes Rendus was tantamount to laying claim to a new result.

西尔维斯特选择向《法国科学院院报》投稿，而该期刊正是 15 年前若尔当发表其研究成果的刊物。这一事实表明，西尔维斯特在开展研究时并未了解到前人（若尔当等）的相关成果。同时，这也体现出他对自己发现的重视------因为在《法国科学院院报》上发表短文，相当于公开宣告自己取得了新的研究成果。

Sylvester was also working in ignorance of the iterative algorithm of Jacobi [30, 1846] for diagonalizing a quadratic form. The generalization of this algorithm to the singular value decomposition is due to Kogbetliantz [36].

此外，西尔维斯特在研究时也未了解到雅可比在 1846 年文献 [30] 中提出的二次型对角化迭代算法。后来，科格别利扬茨（Kogbetliantz）在文献 [36] 中将雅可比算法推广到了奇异值分解的情形。

It is not clear whether Sylvester intended to ignore second-order terms in his iteration or whether he regards the diagonalization as being composed of an (uncountably) infinite number of infinitesimal transformations. Though the preponderance of his statements favor the latter, neither interpretation truly squares with everything he writes. In the first interpretation, small, but finite, terms replace the zeros previously introduced, so that a true diagonalization is not achieved. The second has the flavor of some recent algorithms in which discrete transformations are replaced by continuous transformations defined by differential equations (for applications of this approach to the singular value decomposition see [8] and [11]). But Sylvester does not give enough detail to write down such equations.

目前尚不清楚西尔维斯特是有意忽略迭代过程中的二阶项，还是将对角化视为由（不可数）无穷多次无穷小变换构成的过程。尽管他的大部分表述更支持后一种解释，但这两种解释都无法完全与他的所有论述一致。若按第一种解释，之前引入的零元素会被微小但有限的项取代，从而无法实现真正的对角化；第二种解释则与近年来的一些算法思想相似------这些算法用微分方程定义的连续变换替代离散变换（该方法在奇异值分解中的应用可参见文献 [8] 和 [11]）。然而，西尔维斯特并未提供足够的细节来推导这类微分方程。

5. Schmidt [50, 1907]

5. 施密特的研究 [50, 1907]

Our story now moves from the domain of linear algebra to integral equations, one of the hot topics of the first decades of our century. In his treatment of integral equations with unsymmetric kernels, Erhard Schmidt (of Gram-Schmidt fame and a student of Hilbert) introduced the infinite-dimensional analogue of the singular value decomposition. But he went beyond the mere existence of the decomposition by showing how it can be used to obtain optimal, low-rank approximations to an operator. In doing so he transformed the singular value decomposition from a mathematical curiosity to an important theoretical and computational tool.

接下来，我们的叙述将从线性代数领域转向积分方程领域------积分方程是 20 世纪最初几十年的热门研究课题之一。埃哈德·施密特（因格拉姆-施密特正交化方法闻名，同时也是希尔伯特（Hilbert）的学生）在研究具有非对称核的积分方程时，提出了奇异值分解在无穷维空间中的类似形式（即无穷维奇异值分解）。不仅如此，他还进一步研究了该分解的应用：证明了如何利用无穷维奇异值分解获得算子的最优低秩近似。通过这一工作，施密特使奇异值分解从一个纯粹的数学理论问题，转变为兼具重要理论意义与实用价值的计算工具。

Symmetric kernels

对称核

Schmidt's approach is essentially the same as Beltrami's; however, because he worked in infinite-dimensional spaces of functions he could not appeal to previous results on quadratic forms. Consequently, the first part of his paper is devoted to symmetric kernels.

施密特的研究思路与贝尔特拉米基本一致，但由于他研究的是无穷维函数空间，无法直接沿用之前关于二次型的研究成果。因此，他的论文第一部分专门探讨了对称核的相关问题。

Schmidt begins with a kernel A ( s , t ) A(s, t) A(s,t) that is continuous and symmetric on [ a , b ] × [ a , b ] [a, b] \times [a, b] [a,b]×[a,b]. A continuous, nonvanishing function φ ( s ) \varphi(s) φ(s) satisfying

施密特从定义在 [ a , b ] × [ a , b ] [a, b] \times [a, b] [a,b]×[a,b] 上的连续对称核 A ( s , t ) A(s, t) A(s,t) 入手。若连续非零函数 φ ( s ) \varphi(s) φ(s) 满足
φ ( s ) = λ ∫ a b A ( s , t ) φ ( t ) d t \varphi(s) = \lambda \int_{a}^{b} A(s, t) \varphi(t) dt φ(s)=λ∫abA(s,t)φ(t)dt

is said to be an eigenfunction of A A A corresponding to the eigenvalue λ \lambda λ. Note that Schmidt's eigenvalues are the reciprocals of ours.

则称 φ ( s ) \varphi(s) φ(s) 为核 A A A 对应于特征值 λ \lambda λ 的特征函数。需要注意的是，施密特所定义的特征值与我们现在所用的特征值互为倒数。

Schmidt then establishes the following facts.

随后，施密特证明了如下结论：

The kernel A A A has at least one eigenfunction.

核 A A A 至少存在一个特征函数。
The eigenvalues and their eigenfunctions are real.

特征值及其对应的特征函数均为实值。
Each eigenvalue of A A A has at most a finite number of linearly independent eigenfunctions.

核 A A A 的每个特征值最多对应有限个线性无关的特征函数。
The kernel A A A has a complete, orthonormal system of eigenfunctions; that is, a sequence of orthonormal eigenfunctions φ 1 ( s ) , φ 2 ( s ) , ... \varphi_{1}(s), \varphi_{2}(s), \dots φ1(s),φ2(s),... such that every eigenfunction can be expressed as a linear combination of a finite number of the φ j ( s ) \varphi_{j}(s) φj(s). $⁵

核 A A A 存在完备的标准正交特征函数系，即存在一列标准正交的特征函数 φ 1 ( s ) , φ 2 ( s ) , ... \varphi_{1}(s), \varphi_{2}(s), \dots φ1(s),φ2(s),...，使得每个特征函数都能表示为有限个 φ j ( s ) \varphi_{j}(s) φj(s) 的线性组合。5
The eigenvalues satisfy

特征值满足
∫ a b ∫ a b ( A ( s , t ) ) 2 d s d t ≥ ∑ i 1 λ i 2 , \int_{a}^{b} \int_{a}^{b} (A(s, t))^{2} ds dt \geq \sum_{i} \frac{1}{\lambda_{i}^{2}}, ∫ab∫ab(A(s,t))2dsdt≥i∑λi21,

which implies that the sequence of eigenvalues is unbounded.

这表明特征值序列是无界的。

5 ^5 5 This usage of the word "complete" is at variance with today's usage, in which a sequence is complete if its finite linear combinations are dense.
5 ^5 5 此处"完备"（complete）一词的含义与现代定义不同。在现代数学中，若一列函数的有限线性组合在某个函数空间中稠密，则称该函数列是完备的。

Unsymmetric kernels

非对称核

Schmidt now allows A ( s , t ) A(s, t) A(s,t) to be unsymmetric and calls any nonzero pair u ( s ) u(s) u(s) and v ( s ) v(s) v(s) satisfying

接下来，施密特考虑核 A ( s , t ) A(s, t) A(s,t) 为非对称的情形，并将满足下列条件的非零函数对 u ( s ) u(s) u(s) 和 v ( s ) v(s) v(s) 称为
u ( s ) = λ ∫ a b A ( s , t ) v ( t ) d t u(s) = \lambda \int_{a}^{b} A(s, t) v(t) dt u(s)=λ∫abA(s,t)v(t)dt

and

以及
v ( t ) = λ ∫ a b A ( t , s ) u ( s ) d s , v(t) = \lambda \int_{a}^{b} A(t, s) u(s) ds, v(t)=λ∫abA(t,s)u(s)ds,

a pair of adjoint eigenfunctions corresponding to the eigenvalue λ \lambda λ. He then introduces the symmetric kernels 6 ^6 6

对应于特征值 λ \lambda λ 的伴随特征函数对。随后，他引入如下对称核：6
A ‾ ( s , t ) = ∫ a b A ( r , s ) A ( r , t ) d r \underline{A}(s, t) = \int_{a}^{b} A(r, s) A(r, t) dr A(s,t)=∫abA(r,s)A(r,t)dr

and

以及
A ‾ ( s , t ) = ∫ a b A ( s , r ) A ( t , r ) d r \overline{A}(s, t) = \int_{a}^{b} A(s, r) A(t, r) dr A(s,t)=∫abA(s,r)A(t,r)dr

and shows that if u 1 ( s ) , u 2 ( s ) , ... u_{1}(s), u_{2}(s), \dots u1(s),u2(s),... is a complete orthonormal system for A ‾ ( s , t ) \underline{A}(s, t) A(s,t) corresponding to the eigenvalues μ 1 , μ 2 , ... \mu_{1}, \mu_{2}, \dots μ1,μ2,..., then the functions v i ( t ) v_{i}(t) vi(t) defined by

施密特证明：若 u 1 ( s ) , u 2 ( s ) , ... u_{1}(s), u_{2}(s), \dots u1(s),u2(s),... 是对称核 A ‾ ( s , t ) \underline{A}(s, t) A(s,t) 对应于特征值 μ 1 , μ 2 , ... \mu_{1}, \mu_{2}, \dots μ1,μ2,... 的完备标准正交特征函数系，则由下式定义的函数 v i ( t ) v_{i}(t) vi(t)
v i ( t ) = λ i ∫ a b A ( s , t ) u i ( s ) d s , i = 1 , 2 , ... v_{i}(t) = \lambda_{i} \int_{a}^{b} A(s, t) u_{i}(s) ds, \quad i = 1, 2, \dots vi(t)=λi∫abA(s,t)ui(s)ds,i=1,2,...

is a complete orthonormal system for A ‾ ( s , t ) \overline{A}(s, t) A(s,t). Moreover, for i = 1 , 2 , ... i = 1, 2, \dots i=1,2,... the functions u i ( s ) u_{i}(s) ui(s) and v i ( s ) v_{i}(s) vi(s) form an adjoint pair for A ( s , t ) A(s, t) A(s,t) with λ i = 1 / μ i \lambda_{i} = 1/\sqrt{\mu_{i}} λi=1/μi .

是对称核 A ‾ ( s , t ) \overline{A}(s, t) A(s,t) 的完备标准正交特征函数系。此外，对于 i = 1 , 2 , ... i = 1, 2, \dots i=1,2,...，函数对 u i ( s ) u_{i}(s) ui(s) 和 v i ( s ) v_{i}(s) vi(s) 是核 A ( s , t ) A(s, t) A(s,t) 对应于特征值 λ i = 1 / μ i \lambda_{i} = 1/\sqrt{\mu_{i}} λi=1/μi 的伴随特征函数对。

6 ^6 6 Again the usage differs from ours, but now in two ways. We work with the reciprocal of λ \lambda λ, calling it a singular value, and we distinguish between the singular values of a matrix and its eigenvalues.
6 ^6 6 此处术语的用法与现代也存在差异，且差异体现在两个方面：一是我们现在使用 λ \lambda λ 的倒数，并称之为奇异值；二是我们会明确区分矩阵的奇异值与特征值。

Schmidt then goes on to consider the expansion of functions in series of eigenfunctions. Specifically, if

随后，施密特研究了函数按特征函数系的展开问题。具体而言，若
g ( s ) = ∫ a b A ( s , t ) h ( t ) d t , g(s) = \int_{a}^{b} A(s, t) h(t) dt, g(s)=∫abA(s,t)h(t)dt,

then

则有
g ( s ) = ∑ i u i ( s ) λ i ∫ a b h ( t ) v i ( t ) d t , g(s) = \sum_{i} \frac{u_{i}(s)}{\lambda_{i}} \int_{a}^{b} h(t) v_{i}(t) dt, g(s)=i∑λiui(s)∫abh(t)vi(t)dt,

and the convergence is absolute and uniform. Finally, he shows that if g g g and h h h are continuous, then

且该级数的收敛是绝对收敛且一致收敛的。最后，施密特证明：若 g g g 和 h h h 均为连续函数，则
( 5.1 ) ∫ a b ∫ a b A ( s , t ) g ( s ) h ( t ) d s d t = ∑ i 1 λ i ∫ a b g ( s ) u i ( s ) d s ∫ a b h ( t ) v i ( t ) d t , (5.1) \quad \int_{a}^{b} \int_{a}^{b} A(s, t) g(s) h(t) ds dt = \sum_{i} \frac{1}{\lambda_{i}} \int_{a}^{b} g(s) u_{i}(s) ds \int_{a}^{b} h(t) v_{i}(t) dt, (5.1)∫ab∫abA(s,t)g(s)h(t)dsdt=i∑λi1∫abg(s)ui(s)ds∫abh(t)vi(t)dt,

an expression which Schmidt says "corresponds to the canonical decomposition of a bilinear form."

施密特指出，上式"对应于双线性型的标准分解"。

The approximation theorem

近似定理

Up to now, our exposition has been cast in the language of integral equations, principally to keep issues of analysis in the foreground. These issues are not as important in what follows, and we will therefore return to matrix notation, taking care, as always, to follow Schmidt's development closely.

到目前为止，我们一直使用积分方程的语言进行阐述，主要是为了突出分析学相关问题的重要性。但在下文的讨论中，这些问题的重要性会有所降低，因此我们将回归矩阵符号表述，同时一如既往地严格遵循施密特的研究思路。

The problem Schmidt sets out to solve is that of finding the best approximation to A A A of the form

施密特旨在解决的问题是：寻找如下形式的矩阵，使其成为矩阵 A A A 的最优近似：
A ≅ ∑ i = 1 k x i y i T A \cong \sum_{i=1}^{k} x_{i} y_{i}^{T} A≅i=1∑kxiyiT

in the sense that

这里的"最优"定义为
∥ A − ∑ i = 1 k x i y i T ∥ = min ⁡ . \left\| A - \sum_{i=1}^{k} x_{i} y_{i}^{T} \right\| = \min. A−i=1∑kxiyiT =min.

In other words, he is looking for the best approximation of rank not greater than k k k.

换句话说，他要寻找的是秩不超过 k k k 的矩阵中对 A A A 的最优近似。

Schmidt begins by noting that if

施密特首先指出，若
( 5.2 ) A k = ∑ i = 1 k σ i u i v i T , (5.2) \quad A_{k} = \sum_{i=1}^{k} \sigma_{i} u_{i} v_{i}^{T}, (5.2)Ak=i=1∑kσiuiviT,

then

则有
∥ A − A k ∥ 2 = ∥ A ∥ 2 − ∑ i = 1 k σ i 2 . \left\| A - A_{k} \right\|^{2} = \| A \|^{2} - \sum_{i=1}^{k} \sigma_{i}^{2}. ∥A−Ak∥2=∥A∥2−i=1∑kσi2.

Consequently, if it can be shown that for arbitrary x i x_{i} xi and y i y_{i} yi

因此，若能证明对任意的 x i x_{i} xi 和 y i y_{i} yi，都有
( 5.3 ) ∥ A − ∑ i = 1 k x i y i T ∥ 2 ≥ ∥ A ∥ 2 − ∑ i = 1 k σ i 2 , (5.3) \quad \left\| A - \sum_{i=1}^{k} x_{i} y_{i}^{T} \right\|^{2} \geq \| A \|^{2} - \sum_{i=1}^{k} \sigma_{i}^{2}, (5.3) A−i=1∑kxiyiT 2≥∥A∥2−i=1∑kσi2,

then A k A_{k} Ak will be the desired approximation.

则 A k A_{k} Ak 即为所求的最优近似矩阵。

Without loss of generality we may assume that the vectors x 1 , ... , x k x_{1}, \dots, x_{k} x1,...,xk are orthonormal. For if they are not, we can use Gram-Schmidt orthogonalization to express them as linear combinations of orthonormal vectors, substitute these expressions in ∑ i = 1 k x i y i T \sum_{i=1}^{k} x_{i} y_{i}^{T} ∑i=1kxiyiT, and collect terms in the new vectors.

不失一般性，可假设向量 x 1 , ... , x k x_{1}, \dots, x_{k} x1,...,xk 是标准正交的。若这些向量不是标准正交的，可通过格拉姆-施密特正交化方法将其表示为标准正交向量的线性组合，将该表达式代入 ∑ i = 1 k x i y i T \sum_{i=1}^{k} x_{i} y_{i}^{T} ∑i=1kxiyiT 中，并按新的标准正交向量整理项即可。

Now

此时有
∥ A − ∑ i = 1 k x i y i T ∥ 2 = trace ( ( A − ∑ i = 1 k x i y i T ) T ( A − ∑ i = 1 k x i y i T ) ) = trace ( A T A + ∑ i = 1 k ( y i − A T x i ) ( y i − A T x i ) T − ∑ i = 1 k A T x i x i T A ) . \begin{aligned} \left\| A - \sum_{i=1}^{k} x_{i} y_{i}^{T} \right\|^{2} &= \text{trace}\left( \left( A - \sum_{i=1}^{k} x_{i} y_{i}^{T} \right)^{T} \left( A - \sum_{i=1}^{k} x_{i} y_{i}^{T} \right) \right) \\ &= \text{trace}\left( A^{T} A + \sum_{i=1}^{k} (y_{i} - A^{T} x_{i}) (y_{i} - A^{T} x_{i})^{T} - \sum_{i=1}^{k} A^{T} x_{i} x_{i}^{T} A \right). \end{aligned} A−i=1∑kxiyiT 2=trace (A−i=1∑kxiyiT)T(A−i=1∑kxiyiT) =trace(ATA+i=1∑k(yi−ATxi)(yi−ATxi)T−i=1∑kATxixiTA).

Since trace ( ( y i − A T x i ) ( y i − A T x i ) T ) ≥ 0 \text{trace}((y_i - A^T x_i)(y_i - A^T x_i)^T) \geq 0 trace((yi−ATxi)(yi−ATxi)T)≥0 and trace ( A x i x i T A T ) = ∥ A x i ∥ 2 \text{trace}(Ax_i x_i^T A^T) = \|Ax_i\|^2 trace(AxixiTAT)=∥Axi∥2, the result will be established if it can be shown that

由于 trace ( ( y i − A T x i ) ( y i − A T x i ) T ) ≥ 0 \text{trace}((y_i - A^T x_i)(y_i - A^T x_i)^T) \geq 0 trace((yi−ATxi)(yi−ATxi)T)≥0 且 trace ( A x i x i T A T ) = ∥ A x i ∥ 2 \text{trace}(Ax_i x_i^T A^T) = \|Ax_i\|^2 trace(AxixiTAT)=∥Axi∥2，如果能够证明

∑ i = 1 k ∥ A x i ∥ 2 ≤ ∑ i = 1 k σ i 2 . \sum_{i=1}^{k} \|Ax_i\|^2 \leq \sum_{i=1}^{k} \sigma_i^2. i=1∑k∥Axi∥2≤i=1∑kσi2.

则结果将被证明。

Let V = ( V 1 , V 2 ) V = (V_1, V_2) V=(V1,V2), where V 1 V_1 V1 has k k k columns, and let Σ = diag ( Σ 1 , Σ 2 ) \Sigma = \text{diag}(\Sigma_1, \Sigma_2) Σ=diag(Σ1,Σ2) be a conformal partition of Σ \Sigma Σ. Then

设 V = ( V 1 , V 2 ) V = (V_1, V_2) V=(V1,V2)，其中 V 1 V_1 V1 有 k k k 列，设 Σ = diag ( Σ 1 , Σ 2 ) \Sigma = \text{diag}(\Sigma_1, \Sigma_2) Σ=diag(Σ1,Σ2) 是 Σ \Sigma Σ 的一个共形划分。那么

( 5.4 ) ∥ A x i ∥ 2 = σ k 2 + ( ∥ ∑ 1 V 1 T x i ∥ 2 − σ k 2 ∥ V 1 T x i ∥ 2 ) − ( σ k 2 ∥ V 2 T x i ∥ 2 − ∥ ∑ 2 V 2 T x i ∥ 2 ) − σ k 2 ( 1 − ∥ V T x i ∥ 2 ) . (5.4) \quad \begin{aligned} \| A x_{i} \|^{2} &= \sigma_{k}^{2} + \left( \| \sum_{1} V_{1}^{T} x_{i} \|^{2} - \sigma_{k}^{2} \| V_{1}^{T} x_{i} \|^{2} \right) \\ &\quad - \left( \sigma_{k}^{2} \| V_{2}^{T} x_{i} \|^{2} - \| \sum_{2} V_{2}^{T} x_{i} \|^{2} \right) \\ &\quad - \sigma_{k}^{2} \left( 1 - \| V^{T} x_{i} \|^{2} \right). \end{aligned} (5.4)∥Axi∥2=σk2+(∥1∑V1Txi∥2−σk2∥V1Txi∥2)−(σk2∥V2Txi∥2−∥2∑V2Txi∥2)−σk2(1−∥VTxi∥2).

Now the last two terms in (5.4) are clearly nonnegative. Hence

显然，式 (5.4) 中的最后两项均非负。因此
∑ i = 1 k ∥ A x i ∥ 2 ≤ k σ k 2 + ∑ i = 1 k ( ∥ ∑ 1 V 1 T x i ∥ 2 − σ k 2 ∥ V 1 T x i ∥ 2 ) = k σ k 2 + ∑ i = 1 k ∑ j = 1 k ( σ j 2 − σ k 2 ) ∣ v j T x i ∣ 2 = ∑ j = 1 k ( σ k 2 + ( σ j 2 − σ k 2 ) ∑ i = 1 k ∣ v j T x i ∣ 2 ) ≤ ∑ j = 1 k ( σ k 2 + ( σ j 2 − σ k 2 ) ) = ∑ j = 1 k σ j 2 , \begin{aligned} \sum_{i=1}^{k} \| A x_{i} \|^{2} &\leq k \sigma_{k}^{2} + \sum_{i=1}^{k} \left( \| \sum_{1} V_{1}^{T} x_{i} \|^{2} - \sigma_{k}^{2} \| V_{1}^{T} x_{i} \|^{2} \right) \\ &= k \sigma_{k}^{2} + \sum_{i=1}^{k} \sum_{j=1}^{k} (\sigma_{j}^{2} - \sigma_{k}^{2}) | v_{j}^{T} x_{i} |^{2} \\ &= \sum_{j=1}^{k} \left( \sigma_{k}^{2} + (\sigma_{j}^{2} - \sigma_{k}^{2}) \sum_{i=1}^{k} | v_{j}^{T} x_{i} |^{2} \right) \\ &\leq \sum_{j=1}^{k} \left( \sigma_{k}^{2} + (\sigma_{j}^{2} - \sigma_{k}^{2}) \right) \\ &= \sum_{j=1}^{k} \sigma_{j}^{2}, \end{aligned} i=1∑k∥Axi∥2≤kσk2+i=1∑k(∥1∑V1Txi∥2−σk2∥V1Txi∥2)=kσk2+i=1∑kj=1∑k(σj2−σk2)∣vjTxi∣2=j=1∑k(σk2+(σj2−σk2)i=1∑k∣vjTxi∣2)≤j=1∑k(σk2+(σj2−σk2))=j=1∑kσj2,

which establishes the result.

由此可证得所需结论。

Discussion

讨论

Schmidt's two contributions to the singular value decomposition are its generalization to function spaces and his approximation theorem. Although Schmidt did not refer to earlier work on the decomposition in finite-dimensional spaces, the quote following (5.1) suggests that he knew of its existence. Nonetheless, his contribution here is substantial, especially since he had to deal with many of the problems of functional analysis without modern tools.

施密特对奇异值分解的贡献主要体现在两个方面：一是将奇异值分解推广到函数空间（即无穷维空间），二是提出了上述近似定理。尽管施密特在论文中并未提及早期关于有限维空间奇异值分解的研究，但从式 (5.1) 后的表述可以推测，他知晓有限维奇异值分解的存在。即便如此，他的研究仍具有重大意义------尤其是考虑到他在没有现代泛函分析工具的情况下，解决了泛函分析中的诸多难题。

An important difference in Schmidt's version of the decomposition is the treatment of null vectors of A A A. In his predecessors' treatments they are part of the substitution that reduces the bilinear form x T A y x^{T} A y xTAy to its canonical form. For Schmidt they are not part of the decomposition. The effect of this can be seen in the third term of (5.4), which in the usual approach is zero but in Schmidt's approach can be nonzero.

施密特提出的奇异值分解与前人成果的一个重要区别，在于对矩阵 A A A 零空间向量的处理方式。在其前人的研究中，零空间向量是将双线性型 x T A y x^{T} A y xTAy 化为标准型的变量替换的一部分；而在施密特的分解中，零空间向量并未包含在分解之内。这一差异的影响可在式 (5.4) 的第三项中体现：在常规的奇异值分解方法中，该项为零，但在施密特的方法中，该项可能非零。

The crowning glory of Schmidt's work is his approximation theorem, which is nontrivial to conjecture and hard to prove from scratch. Schmidt's proof is certainly not pretty---we will examine the more elegant approach of Weyl in the next section---but it does establish what can properly be termed the fundamental theorem of the singular value decomposition.

施密特研究成果的最大亮点便是上述近似定理。该定理的猜想并非显而易见，且从零开始证明难度极大。尽管施密特的证明过程并不简洁（下一节将介绍外尔提出的更简洁的证明方法），但他确实证明了这一堪称奇异值分解基本定理的重要结论。

csdn 篇幅字数限制，未完......

线性代数 · SVD | 奇异值分解的早期历史（二）-CSDN博客
https://blog.csdn.net/u013669912/article/details/151970576

via:

On the Early History of the Singular Value Decomposition | SIAM Review
https://epubs.siam.org/doi/10.1137/1035134