线性代数 · 矩阵 | SVD 与 PCA 应用区别

注：本文为 "线性代数 · 矩阵 | SVD 与 PCA" 相关合辑。

英文引文，机翻未校。

中文引文，略作重排。

图片清晰度受引文原图所限。

如有内容异常，请看原文。

1 Singular Value Decomposition and Principal Component Analysis

1 奇异值分解与主成分分析

In these lectures we discuss the SVD and the PCA, two of the most widely used tools in machine learning. Principal Component Analysis (PCA) is a linear dimensionality reduction method dating back to Pearson (1901) and it is one of the most useful techniques in exploratory data analysis. It is also known under different names such as the Karhunen-Love Transform, the Hotelling transform, and Proper Orthogonal Decomposition (POD). PCA can be applied to a data set comprising of n n n vectors x 1 , . . . , x n ∈ R d x_{1}, ..., x_{n} \in \mathbb{R}^{d} x1,...,xn∈Rd and in turn returns a new basis for R d \mathbb{R}^{d} Rd whose elements are terms the principal components. It is important that the method is completely data-dependent, that is, the new basis is only a function of the data. The PCA builds on the SVD (or the spectral theorem), we therefore start with the SVD.

在本系列讲座中，我们将探讨机器学习领域中两种应用最广泛的工具------奇异值分解（SVD）和主成分分析（PCA）。主成分分析（PCA）是一种线性降维方法，其起源可追溯至 1901 年（由 Pearson 提出），同时也是探索性数据分析中最实用的技术之一。它还有多个别称，如卡尔曼 - 洛维变换、霍特林变换以及本征正交分解（POD）。对于由 n n n 个向量 x 1 , . . . , x n ∈ R d x_{1}, ..., x_{n} \in \mathbb{R}^{d} x1,...,xn∈Rd 组成的数据集，可应用 PCA 方法，该方法会返回 R d \mathbb{R}^{d} Rd 空间的一个新基，这个新基的元素就是主成分。需要重点说明的是，PCA 方法完全依赖数据，也就是说，新基仅仅是数据的函数。由于 PCA 以 SVD（或谱定理）为基础，因此我们首先从 SVD 开始讲解。

1.1 Singular Value Decomposition (SVD)

1.1 奇异值分解（SVD）

Consider a matrix A ∈ R m × n A \in \mathbb{R}^{m \times n} A∈Rm×n or C m × n \mathbb{C}^{m \times n} Cm×n and let us assume that m ≥ n m \geq n m≥n. Then the singular value decomposition (SVD) of A A A is given by [1] A = U D W A=U D W A=UDW where U U U is m × m m \times m m×m, D D D is m × n m \times n m×n, W W W is n × n n \times n n×n, U U U and W W W are unitary (i.e., U ∗ U = U U ∗ = I m U^{*} U=U U^{*}=I_{m} U∗U=UU∗=Im, W W ∗ = W ∗ W = I n W W^{*}=W^{*} W=I_{n} WW∗=W∗W=In), and D D D is a diagonal (rectangular) matrix

考虑矩阵 A ∈ R m × n A \in \mathbb{R}^{m \times n} A∈Rm×n 或 A ∈ C m × n A \in \mathbb{C}^{m \times n} A∈Cm×n，且假设 m ≥ n m \geq n m≥n。那么矩阵 A A A 的奇异值分解（SVD）可表示为[1] A = U D W A=U D W A=UDW，其中 U U U 是 m × m m \times m m×m 矩阵， D D D 是 m × n m \times n m×n 矩阵， W W W 是 n × n n \times n n×n 矩阵； U U U 和 W W W 均为酉矩阵（即满足 U ∗ U = U U ∗ = I m U^{*} U=U U^{*}=I_{m} U∗U=UU∗=Im、 W W ∗ = W ∗ W = I n W W^{*}=W^{*} W=I_{n} WW∗=W∗W=In）； D D D 是对角（长方）矩阵，其形式为

D = [ σ 1 0 ... 0 0 σ 2 ... 0 ⋮ ⋮ ⋱ ⋮ 0 0 0 σ n 0 0 0 0 ⋮ ⋮ ⋮ ⋮ 0 0 0 0 ] D=\begin{bmatrix} \sigma_{1} & 0 & \dots & 0 \\ 0 & \sigma_{2} & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & \sigma_{n} \\ 0 & 0 & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots \\ 0 & 0 & 0 & 0 \end{bmatrix} D= σ10⋮00⋮00σ2⋮00⋮0......⋱00⋮000⋮σn0⋮0

with D i i = σ i > 0 D_{ii}=\sigma_{i}>0 Dii=σi>0. Here, σ i \sigma_{i} σi are called the singular values of A A A, the columns of U U U are the corresponding left singular vectors, and the columns of W W W are the corresponding right singular vectors.

且满足 D i i = σ i > 0 D_{ii}=\sigma_{i}>0 Dii=σi>0。在此， σ i \sigma_{i} σi 被称为矩阵 A A A 的奇异值， U U U 的列向量是对应的左奇异向量， W W W 的列向量是对应的右奇异向量。

Let U = [ u 1 , . . . , u m ] U=[u_{1}, ..., u_{m}] U=[u1,...,um], W = [ w 1 , . . . , w n ] W=[w_{1}, ..., w_{n}] W=[w1,...,wn] and let r r r be the rank of A A A. Then we can write A = ∑ i = 1 r σ i u i w i ∗ A=\sum_{i=1}^{r} \sigma_{i} u_{i} w_{i}^{*} A=∑i=1rσiuiwi∗ with r ≤ n r \leq n r≤n (and σ 1 ≥ σ 2 ≥ ⋯ ≥ σ r \sigma_{1} \geq \sigma_{2} \geq \cdots \geq \sigma_{r} σ1≥σ2≥⋯≥σr). (So A A A is a sum of weighted rank-one matrices.) The SVD exists for any finite-dimensional matrix.

设 U = [ u 1 , . . . , u m ] U=[u_{1}, ..., u_{m}] U=[u1,...,um]、 W = [ w 1 , . . . , w n ] W=[w_{1}, ..., w_{n}] W=[w1,...,wn]，且 r r r 为矩阵 A A A 的秩，则可将 A A A 表示为 A = ∑ i = 1 r σ i u i w i ∗ A=\sum_{i=1}^{r} \sigma_{i} u_{i} w_{i}^{*} A=∑i=1rσiuiwi∗，其中 r ≤ n r \leq n r≤n（且满足 σ 1 ≥ σ 2 ≥ ⋯ ≥ σ r \sigma_{1} \geq \sigma_{2} \geq \cdots \geq \sigma_{r} σ1≥σ2≥⋯≥σr）。（也就是说， A A A 是加权一阶矩阵的和。）任意有限维矩阵都存在奇异值分解。

Remarks

注

The u i u_{i} ui are eigenvectors of A A ∗ A A^{*} AA∗ and the w i w_{i} wi are eigenvectors of A ∗ A A^{*} A A∗A.
u i u_{i} ui 是 A A ∗ A A^{*} AA∗ 的特征向量， w i w_{i} wi 是 A ∗ A A^{*} A A∗A 的特征向量。
A A ∗ A A^{*} AA∗ and A ∗ A A^{*} A A∗A are positive semidefinite so their eigenvalues are nonnegative.
A A ∗ A A^{*} AA∗ 和 A ∗ A A^{*} A A∗A 均为半正定矩阵，因此它们的特征值均非负。
If λ i \lambda_{i} λi are the eigenvalues of A ∗ A A^{*} A A∗A, then σ i 2 = λ i \sigma_{i}^{2}=\lambda_{i} σi2=λi if λ i > 0 \lambda_{i}>0 λi>0. (Here we're saying that singular values must be positive, but this is more of a matter of taste.)

若 λ i \lambda_{i} λi 是 A ∗ A A^{*} A A∗A 的特征值，且 λ i > 0 \lambda_{i}>0 λi>0，则 σ i 2 = λ i \sigma_{i}^{2}=\lambda_{i} σi2=λi。（此处我们认为奇异值必须为正，但这更多是一种习惯约定。）
If A A A is square and Hermitian, then the SVD and the eigenvalue decomposition are the same.

若 A A A 是方阵且为埃尔米特矩阵，则其奇异值分解与特征值分解相同。
We could alternatively define the SVD with U U U as an m × n m \times n m×n matrix, D D D as an n × n n \times n n×n matrix, and W W W as an n × n n \times n n×n matrix. In this case, U ∗ U = I n U^{*} U=I_{n} U∗U=In, and W ∗ W = W W ∗ = I n W^{*} W=W W^{*}=I_{n} W∗W=WW∗=In.

我们也可采用另一种方式定义奇异值分解：令 U U U 为 m × n m \times n m×n 矩阵、 D D D 为 n × n n \times n n×n 矩阵、 W W W 为 n × n n \times n n×n 矩阵，此时满足 U ∗ U = I n U^{*} U=I_{n} U∗U=In 以及 W ∗ W = W W ∗ = I n W^{*} W=W W^{*}=I_{n} W∗W=WW∗=In。

Some intuition for SVD: SVD rotates the matrix A A A by U U U and W ∗ W^{*} W∗ so that A A A becomes a diagonal matrix.

奇异值分解的直观理解：通过 U U U 和 W ∗ W^{*} W∗ 对矩阵 A A A 进行旋转操作，可使 A A A 转化为对角矩阵。

2 Principal Component Analysis (PCA)

2 主成分分析（PCA）

2.1 Motivation

2.1 研究背景

Given x 1 , . . . , x n ∈ R d x_{1}, ..., x_{n} \in \mathbb{R}^{d} x1,...,xn∈Rd, we want to project the x i x_{i} xi onto R k \mathbb{R}^{k} Rk, k < d k<d k<d. So, how do we choose k k k and the orientation of the subspace? We consider two ideas:

已知 x 1 , . . . , x n ∈ R d x_{1}, ..., x_{n} \in \mathbb{R}^{d} x1,...,xn∈Rd，我们希望将 x i x_{i} xi 投影到 R k \mathbb{R}^{k} Rk 空间（其中 k < d k<d k<d）。那么，如何选择 k k k 的值以及子空间的方向呢？我们考虑以下两种思路：

Find the k k k-dimensional subspace for which the projections of x 1 , . . . , x n x_{1}, ..., x_{n} x1,...,xn best approximate the original points x 1 , . . . , x n x_{1}, ..., x_{n} x1,...,xn. (We define "best approximation" in the sense of the 2-norm.)
找到一个 k k k 维子空间，使得 x 1 , . . . , x n x_{1}, ..., x_{n} x1,...,xn 在该子空间上的投影能最佳逼近原始点 x 1 , . . . , x n x_{1}, ..., x_{n} x1,...,xn。（此处"最佳逼近"按 2 - 范数的意义定义。）
We also want to conserve what makes the data points different from each other. Hence, find the k k k-dimensional projection of x 1 , . . . , x n x_{1}, ..., x_{n} x1,...,xn that preserves most of the variance of the x i x_{i} xi.
我们还希望保留数据点之间的差异特征，因此需要找到 x 1 , . . . , x n x_{1}, ..., x_{n} x1,...,xn 的 k k k 维投影，该投影能保留 x i x_{i} xi 的大部分方差。

Both of the two ideas above are solved by principal component analysis (PCA).

上述两种思路均可通过主成分分析（PCA）来实现。

2.2 Optimization Problem Formulation [following lecture notes of Singer and Bandeira]

2.2 优化问题构建[参考 Singer 和 Bandeira 的讲义]

We denote the sample mean by

我们将样本均值记为

μ n : = 1 n ∑ i = 1 n x i \mu_{n}:=\frac{1}{n} \sum_{i=1}^{n} x_{i} μn:=n1∑i=1nxi

and sample covariance matrix by

样本协方差矩阵记为

∑ n : = 1 n ∑ i = 1 n ( x i − μ n ) ( x i − μ n ) ∗ \sum_{n}:=\frac{1}{n} \sum_{i=1}^{n}\left(x_{i}-\mu_{n}\right)\left(x_{i}-\mu_{n}\right)^{*} ∑n:=n1∑i=1n(xi−μn)(xi−μn)∗.

Let us focus on the first idea in Section 2.1. We want to approximate each x i x_{i} xi by an affine low-dimensional subspace such that for each x i x_{i} xi we have x i ≈ μ + ∑ j = 1 k ( α i ) j v j x_{i} \approx \mu+\sum_{j=1}^{k}\left(\alpha_{i}\right){j} v{j} xi≈μ+∑j=1k(αi)jvj, where V : = [ v 1 , . . . , v k ] V:=[v_{1}, ..., v_{k}] V:=[v1,...,vk] is an orthonormal basis to be determined. We can rewrite the above as x i ≈ μ + V α i x_{i} \approx \mu+V \alpha_{i} xi≈μ+Vαi, where

我们重点关注 2.1 节中的第一种思路。我们希望用一个仿射低维子空间逼近每个 x i x_{i} xi，使得对于每个 x i x_{i} xi，都有 x i ≈ μ + ∑ j = 1 k ( α i ) j v j x_{i} \approx \mu+\sum_{j=1}^{k}\left(\alpha_{i}\right){j} v{j} xi≈μ+∑j=1k(αi)jvj，其中 V : = [ v 1 , . . . , v k ] V:=[v_{1}, ..., v_{k}] V:=[v1,...,vk] 是待确定的标准正交基。我们可将上式改写为 x i ≈ μ + V α i x_{i} \approx \mu+V \alpha_{i} xi≈μ+Vαi，其中

α i = [ α i 1 α i 2 ⋮ α i k ] \alpha_{i}=\begin{bmatrix} \alpha_{i1} \\ \alpha_{i2} \\ \vdots \\ \alpha_{ik} \end{bmatrix} αi= αi1αi2⋮αik

with V V V as a n × k n \times k n×k matrix satisfying V ∗ V = I k V^{*} V=I_{k} V∗V=Ik. Now, we try to solve the optimization problem

且 V V V 是满足 V ∗ V = I k V^{*} V=I_{k} V∗V=Ik 的 n × k n \times k n×k 矩阵。目前，我们尝试求解如下优化问题：

min ⁡ V , α 1 , . . . , α n V ∗ V = I k I : = ∑ i = 1 n ∥ x i − ( μ + V α i ) ∥ 2 2 \min {\substack{V, \alpha{1}, ..., \alpha_{n} \\ V^{*} V=I_{k}}} I:=\sum_{i=1}^{n}\left\| x_{i}-\left(\mu+V \alpha_{i}\right)\right\| _{2}^{2} V,α1,...,αnV∗V=IkminI:=i=1∑n∥xi−(μ+Vαi)∥22

Thus, we try to minimize the ℓ 2 \ell_{2} ℓ2-error across all vectors x i x_{i} xi. (Unlike in the JL approach we do not strive for minimizing the error uniformly (within an ε \varepsilon ε-range) across all x i x_{i} xi, but rather the average error.)

也就是说，我们试图最小化所有向量 x i x_{i} xi 的 ℓ 2 \ell_{2} ℓ2 误差。（与约翰逊 - 林登施特劳斯（JL）方法不同，我们并非力求使所有 x i x_{i} xi 的误差在 ε \varepsilon ε 范围内均匀最小化，而是追求平均误差最小化。）

2.3 Solving the Optimization Problem

2.3 优化问题求解

Fortunately we can separate this problem and first optimize over μ \mu μ, then α \alpha α, then over V V V (There are optimization problems which look similar but where you can't do this strategy of separation of variables.)

幸运的是，我们可以将该问题分解，先对 μ \mu μ 进行优化，再对 α \alpha α 进行优化，最后对 V V V 进行优化（有些优化问题看似与此相似，但无法采用这种变量分离策略）。

Let us first optimize with respect to μ \mu μ. Without loss of generality, we can assume that ∑ i = 1 n α i = 0 \sum_{i=1}^{n} \alpha_{i}=0 ∑i=1nαi=0, because otherwise we could absorb the nonzero ∑ i α i \sum_{i} \alpha_{i} ∑iαi into μ \mu μ. Then, ∂ I ∂ μ = − 2 ∑ i = 1 n ( x i − μ − V α i ) \frac{\partial I}{\partial \mu}=-2 \sum_{i=1}^{n}\left(x_{i}-\mu-V \alpha_{i}\right) ∂μ∂I=−2∑i=1n(xi−μ−Vαi). Setting the right-hand side equal to zero, we get μ = 1 n ∑ i = 1 n x i = μ n \mu = \frac{1}{n} \sum_{i=1}^{n} x_{i} = \mu_{n} μ=n1∑i=1nxi=μn.

首先，我们对 μ \mu μ 进行优化。不失一般性，我们可假设 ∑ i = 1 n α i = 0 \sum_{i=1}^{n} \alpha_{i}=0 ∑i=1nαi=0，因为若不满足该条件，我们可将非零的 ∑ i α i \sum_{i} \alpha_{i} ∑iαi 归入 μ \mu μ 中。此时， ∂ I ∂ μ = − 2 ∑ i = 1 n ( x i − μ − V α i ) \frac{\partial I}{\partial \mu}=-2 \sum_{i=1}^{n}\left(x_{i}-\mu-V \alpha_{i}\right) ∂μ∂I=−2∑i=1n(xi−μ−Vαi)。令等式右边等于零，可得 μ = 1 n ∑ i = 1 n x i = μ n \mu = \frac{1}{n} \sum_{i=1}^{n} x_{i} = \mu_{n} μ=n1∑i=1nxi=μn。

Now let's optimize in α \alpha α. We calculate: ∂ I ∂ α i = ( x i − μ − V α i ) ∗ V \frac{\partial I}{\partial \alpha_{i}}=\left(x_{i}-\mu-V \alpha_{i}\right)^{*} V ∂αi∂I=(xi−μ−Vαi)∗V. Setting the right-hand side equal to zero, we get α i = V ∗ ( x i − μ ) \alpha_{i}=V^{*}\left(x_{i}-\mu\right) αi=V∗(xi−μ).

接下来，我们对 α \alpha α 进行优化。经计算可得： ∂ I ∂ α i = ( x i − μ − V α i ) ∗ V \frac{\partial I}{\partial \alpha_{i}}=\left(x_{i}-\mu-V \alpha_{i}\right)^{*} V ∂αi∂I=(xi−μ−Vαi)∗V。令等式右边等于零，可得 α i = V ∗ ( x i − μ ) \alpha_{i}=V^{*}\left(x_{i}-\mu\right) αi=V∗(xi−μ)。

Plugging in the expressions for μ \mu μ and α i \alpha_{i} αi into I I I, we get I = ∑ i = 1 n ∥ x i − μ n − V V ∗ ( x i − μ n ) ∥ 2 2 I=\sum_{i=1}^{n}\left\| x_{i}-\mu_{n}-V V^{*}\left(x_{i}-\mu_{n}\right)\right\| {2}^{2} I=∑i=1n∥xi−μn−VV∗(xi−μn)∥22 where V V ∗ V V^{*} VV∗ is an orthogonal projection matrix. Thus, letting y i : = x i − μ n y{i}:=x_{i}-\mu_{n} yi:=xi−μn, I = ∑ i = 1 n ∥ y i − V V ∗ y i ∥ 2 2 I=\sum_{i=1}^{n}\left\| y_{i}-V V^{*} y_{i}\right\| _{2}^{2} I=∑i=1n∥yi−VV∗yi∥22.

将 μ \mu μ 和 α i \alpha_{i} αi 的表达式代入 I I I 中，可得 I = ∑ i = 1 n ∥ x i − μ n − V V ∗ ( x i − μ n ) ∥ 2 2 I=\sum_{i=1}^{n}\left\| x_{i}-\mu_{n}-V V^{*}\left(x_{i}-\mu_{n}\right)\right\| {2}^{2} I=∑i=1n∥xi−μn−VV∗(xi−μn)∥22，其中 V V ∗ V V^{*} VV∗ 是正交投影矩阵。因此，令 y i : = x i − μ n y{i}:=x_{i}-\mu_{n} yi:=xi−μn，则 I = ∑ i = 1 n ∥ y i − V V ∗ y i ∥ 2 2 I=\sum_{i=1}^{n}\left\| y_{i}-V V^{*} y_{i}\right\| _{2}^{2} I=∑i=1n∥yi−VV∗yi∥22。

Denote Y = [ y 1 , . . . , y n ] Y=[y_{1}, ..., y_{n}] Y=[y1,...,yn]. Then

令 Y = [ y 1 , . . . , y n ] Y=[y_{1}, ..., y_{n}] Y=[y1,...,yn]，则有

min ⁡ V : V ∗ V = I k ∑ i = 1 n ∥ y i − V V ∗ y i ∥ 2 2 = min ⁡ V : V ∗ V = I k t r a c e [ ( Y − V V ∗ Y ) ∗ ( Y − V V ∗ Y ) ] = min ⁡ V : V ∗ V = I k t r a c e [ Y ∗ ( I − V V ∗ ) ( I − V V ∗ ) Y ] \begin{aligned} \min {V: V^{*} V=I{k}} \sum_{i=1}^{n}\left\| y_{i}-V V^{*} y_{i}\right\| _{2}^{2} &= \min {V: V^{*} V=I{k}} trace\left[\left(Y-V V^{*} Y\right)^{*}\left(Y-V V^{*} Y\right)\right] \\ &= \min {V: V^{*} V=I{k}} trace\left[Y^{*}\left(I-V V^{*}\right)\left(I-V V^{*}\right) Y\right] \end{aligned} V:V∗V=Ikmini=1∑n∥yi−VV∗yi∥22=V:V∗V=Ikmintrace[(Y−VV∗Y)∗(Y−VV∗Y)]=V:V∗V=Ikmintrace[Y∗(I−VV∗)(I−VV∗)Y]

Using properties of the trace (i.e., the circular shift property and linearity), and the fact that ( I − V V ∗ ) ( I − V V ∗ ) = I − V V ∗ (I - V V^{*})(I - V V^{*}) = I - V V^{*} (I−VV∗)(I−VV∗)=I−VV∗, we have:

利用迹的性质（即循环移位性质与线性性质），以及 ( I − V V ∗ ) ( I − V V ∗ ) = I − V V ∗ (I - V V^{*})(I - V V^{*}) = I - V V^{*} (I−VV∗)(I−VV∗)=I−VV∗ 这一事实，可得：

min ⁡ V : V ∗ V = I k ∑ i = 1 n ∥ y i − V V ∗ y i ∥ 2 2 = min ⁡ V : V ∗ V = I k trace ⁡ [ Y Y ∗ ( I − V V ∗ ) ] = min ⁡ V : V ∗ V = I k [ trace ⁡ ( Y Y ∗ ) − trace ⁡ ( Y Y ∗ V V ∗ ) ] = min ⁡ V : V ∗ V = I k [ trace ⁡ ( Y Y ∗ ) − trace ⁡ ( V ∗ Y Y ∗ V ) ] \begin{aligned} \min {V:{V^*}V = {I_k}}\sum\limits{i = 1}^n {\left\| {{y_i} - V{V^*}{y_i}} \right\|_2^2} & = \min _{V:{V^*}V = {I_k}}\operatorname{trace}\left[ {Y{Y^*}\left( {I - V{V^*}} \right)} \right] \\ & = \min _{V:{V^*}V = {I_k}}\left[ \operatorname{trace}\left( {Y{Y^*}} \right) - \operatorname{trace}\left( {Y{Y^*}V{V^*}} \right) \right] \\ & = \min _{V:{V^*}V = {I_k}}\left[ \operatorname{trace}\left( {Y{Y^*}} \right) - \operatorname{trace}\left( {{V^*}Y{Y^*}V} \right) \right] \end{aligned} V:V∗V=Ikmini=1∑n∥yi−VV∗yi∥22=V:V∗V=Ikmintrace[YY∗(I−VV∗)]=V:V∗V=Ikmin[trace(YY∗)−trace(YY∗VV∗)]=V:V∗V=Ikmin[trace(YY∗)−trace(V∗YY∗V)]

But Y Y Y does not depend on V V V! Hence, the minimum in the above formula is independent of the term t r a c e ( Y Y ∗ ) trace(Y Y^{*}) trace(YY∗), and thus is equivalent to the solution of the following optimization problem:

但 Y Y Y 与 V V V 无关！因此，上式中的最小值与 t r a c e ( Y Y ∗ ) trace(Y Y^{*}) trace(YY∗) 无关，从而等同于如下优化问题的解：

max ⁡ V : V ∗ V = I k 1 n t r a c e ( V ∗ Y Y ∗ V ) = max ⁡ V : V ∗ V = I k t r a c e ( V ∗ ∑ n V ) \max {V: V^{*} V=I{k}} \frac{1}{n} trace\left(V^{*} Y Y^{*} V\right)=\max {V: V^{*} V=I{k}} trace\left(V^{*} \sum_{n} V\right) V:V∗V=Ikmaxn1trace(V∗YY∗V)=V:V∗V=Ikmaxtrace(V∗n∑V)

Let ∑ n \sum {n} ∑n have the eigenvalue decomposition ∑ n = ∑ i = 1 d λ i v i v i ∗ \sum{n}=\sum_{i=1}^{d} \lambda_{i} v_{i} v_{i}^{*} ∑n=∑i=1dλivivi∗ (note: the original text uses n n n as the upper limit, which is corrected here to d d d to match the dimension of the data space R d \mathbb{R}^d Rd) where λ i ≥ 0 \lambda_{i} \geq0 λi≥0. ( λ i \lambda_{i} λi cannot be negative because ∑ n \sum {n} ∑n is positive semidefinite.) Here, λ i \lambda{i} λi are the eigenvalues of ∑ n \sum {n} ∑n, and v i v{i} vi are the corresponding eigenvectors. Since ∑ n \sum {n} ∑n is a symmetric matrix, its eigenvectors v i v{i} vi are mutually orthogonal.

设 ∑ n \sum {n} ∑n 的特征值分解为 ∑ n = ∑ i = 1 d λ i v i v i ∗ \sum{n}=\sum_{i=1}^{d} \lambda_{i} v_{i} v_{i}^{*} ∑n=∑i=1dλivivi∗（注：原文以上限 n n n 表述，此处修正为数据空间 R d \mathbb{R}^d Rd 的维度 d d d，以保证逻辑一致性），其中 λ i ≥ 0 \lambda_{i} \geq0 λi≥0（由于 ∑ n \sum {n} ∑n 是半正定矩阵，因此 λ i \lambda{i} λi 不可能为负）。此处， λ i \lambda_{i} λi 是 ∑ n \sum {n} ∑n 的特征值， v i v{i} vi 是对应的特征向量。由于 ∑ n \sum {n} ∑n 是对称矩阵，其特征向量 v i v{i} vi 相互正交。

From linear algebra, we know that:

由线性代数知识可知：

max ⁡ V : V ∗ V = I k t r a c e ( V ∗ ∑ n V ) = ∑ i = 1 k λ i \max {V: V^{*} V=I{k}} trace\left(V^{*} \sum_{n} V\right)=\sum_{i=1}^{k} \lambda_{i} V:V∗V=Ikmaxtrace(V∗n∑V)=i=1∑kλi

Moreover, the V V V that achieves this maximum is given by V = [ v 1 , . . . , v k ] V=[v_{1}, ..., v_{k}] V=[v1,...,vk], where v 1 , . . . , v k v_{1}, ..., v_{k} v1,...,vk are the eigenvectors corresponding to the k k k largest eigenvalues of ∑ n \sum {n} ∑n. Hence, these specific v j v{j} vj provide the desired optimal orthonormal basis for our data x i x_{i} xi.

且实现该最大值的 V V V 为 V = [ v 1 , . . . , v k ] V=[v_{1}, ..., v_{k}] V=[v1,...,vk]，其中 v 1 , . . . , v k v_{1}, ..., v_{k} v1,...,vk 是 ∑ n \sum {n} ∑n 的 k k k 个最大特征值对应的特征向量。因此，这些特定的 v j v{j} vj 为数据 x i x_{i} xi 提供了所需的最优标准正交基。

2.4 Intuition for PCA

2.4 主成分分析的直观理解

PCA first performs the eigenvalue decomposition of ∑ n \sum {n} ∑n, then treats the projections of centered data points (where "centered" means subtracting the sample mean μ n \mu{n} μn) onto the k k k top eigenvectors of the sample covariance matrix ∑ n \sum _{n} ∑n as the principal components. (The "k top eigenvectors" refer to the eigenvectors associated with the k k k largest eigenvalues.)

主成分分析首先对 ∑ n \sum {n} ∑n 进行特征值分解，然后将中心化数据点 （"中心化"即减去样本均值 μ n \mu{n} μn）在样本协方差矩阵 ∑ n \sum _{n} ∑n 的 k k k 个"顶级特征向量"上的投影定义为主成分（" k k k 个顶级特征向量"指与 k k k 个最大特征值相关联的特征向量）。

2.5 Cost for PCA

2.5 主成分分析的计算成本

The computational cost of the PCA procedure without using SVD is as follows:

不使用 SVD 的 PCA 流程，其计算成本如下：

Constructing the sample covariance matrix ∑ n \sum _{n} ∑n requires O ( n d 2 ) O(n d^{2}) O(nd2) operations (where n n n is the number of data points and d d d is the dimension of each data point).

构建样本协方差矩阵 ∑ n \sum _{n} ∑n 需要 O ( n d 2 ) O(n d^{2}) O(nd2) 次运算（其中 n n n 为数据点数量， d d d 为单个数据点的维度）。
If we use a traditional, naive method to perform eigenvalue decomposition for solving V V V, it requires O ( d 3 ) O(d^{3}) O(d3) operations.

若采用传统、朴素的方法进行特征值分解以求解 V V V，则需要 O ( d 3 ) O(d^{3}) O(d3) 次运算。

However, the cost can be slightly reduced by using SVD, as explained below.

不过，通过 SVD 方法可略微降低计算成本，具体说明如下。

Let X = [ x 1 , . . . , x n ] X=[x_{1}, ..., x_{n}] X=[x1,...,xn] (an d × n d \times n d×n matrix, where each column is a data point) and

令 X = [ x 1 , . . . , x n ] X=[x_{1}, ..., x_{n}] X=[x1,...,xn]（ d × n d \times n d×n 矩阵，每一列代表一个数据点），且

KaTeX parse error: Unknown column alignment: * at position 40: ...{\begin{array}{*̲{20}{c}} 1\\1\\... (an n × 1 n \times 1 n×1 vector with n n n ones). （ n × 1 n \times 1 n×1 向量，包含 n n n 个 1）

Then the sample covariance matrix can be rewritten as:

则样本协方差矩阵可改写为：

∑ n = 1 n ( X − μ n 1 n ∗ ) ( X − μ n 1 n ∗ ) ∗ \sum {n}=\frac{1}{n}(X-\mu{n} 1_{n}^{*})(X-\mu_{n} 1_{n}^{*})^{*} n∑=n1(X−μn1n∗)(X−μn1n∗)∗

The key idea to reduce computational cost is to directly compute the SVD of the centered data matrix A : = X − μ n 1 n ∗ A:=X-\mu_{n} 1_{n}^{*} A:=X−μn1n∗, instead of first constructing ∑ n \sum _{n} ∑n.

降低计算成本的核心思路是：直接计算中心化数据矩阵 A : = X − μ n 1 n ∗ A:=X-\mu_{n} 1_{n}^{*} A:=X−μn1n∗ 的 SVD，而非先构建 ∑ n \sum _{n} ∑n。

The left singular vectors of A A A are exactly the eigenvectors of A A ∗ = n ∑ n A A^{*}=n\sum {n} AA∗=n∑n, i.e., they are the same as the eigenvectors v 1 , . . . , v d v{1}, ..., v_{d} v1,...,vd of ∑ n \sum _{n} ∑n. Therefore, the computational cost of PCA via SVD is O ( min ⁡ { n 2 d , n d 2 } ) O(\min \{n^{2} d, n d^{2}\}) O(min{n2d,nd2}), which is more efficient than the traditional method when n n n and d d d are large.

矩阵 A A A 的左奇异向量恰好是 A A ∗ = n ∑ n A A^{*}=n\sum {n} AA∗=n∑n 的特征向量，即与 ∑ n \sum {n} ∑n 的特征向量 v 1 , . . . , v d v{1}, ..., v{d} v1,...,vd 完全一致。因此，通过 SVD 实现 PCA 的计算成本为 O ( min ⁡ { n 2 d , n d 2 } ) O(\min \{n^{2} d, n d^{2}\}) O(min{n2d,nd2})，在 n n n 和 d d d 较大时，该成本低于传统方法。

Furthermore, from the full SVD of A A A, we can obtain all left singular vectors v 1 , . . . , v d v_{1}, ..., v_{d} v1,...,vd. However, in practice, we only need the first k k k left singular vectors v 1 , . . . , v k v_{1}, ..., v_{k} v1,...,vk (where k < d k<d k<d, and often k ≪ d k \ll d k≪d for dimensionality reduction). Computing only the top k k k singular vectors can be done in O ( d n k ) O(d n k) O(dnk) operations---this is much faster than computing the full SVD. In MATLAB, this can be implemented with the svds command, which internally uses Lanczos-type iterative methods to efficiently find the top k k k singular vectors.

此外，通过 A A A 的完整 SVD，我们可得到全部左奇异向量 v 1 , . . . , v d v_{1}, ..., v_{d} v1,...,vd。但在实际降维场景中，我们通常仅需前 k k k 个左奇异向量 v 1 , . . . , v k v_{1}, ..., v_{k} v1,...,vk（其中 k < d k<d k<d，且常满足 k ≪ d k \ll d k≪d）。仅计算前 k k k 个奇异向量仅需 O ( d n k ) O(d n k) O(dnk) 次运算，远快于完整 SVD 的计算。在 MATLAB 中，可通过 svds 命令实现该操作，该命令内部采用兰索斯（Lanczos）型迭代方法，能高效求解前 k k k 个奇异向量。

We also note that randomized SVD algorithms can further reduce this cost to O ( n d log ⁡ ( k ) + ( n + d ) k 2 ) O(n d \log (k)+(n+d) k^{2}) O(ndlog(k)+(n+d)k2). This type of algorithm uses random sampling to reduce the dimension of the original matrix first, then performs SVD on the low-dimensional matrix---this is particularly effective for large-scale data sets, and we will discuss it in more detail later.

我们还需注意，随机化 SVD 算法可将计算成本进一步降低至 O ( n d log ⁡ ( k ) + ( n + d ) k 2 ) O(n d \log (k)+(n+d) k^{2}) O(ndlog(k)+(n+d)k2)。这类算法通过随机采样先对原始矩阵进行降维，再对低维矩阵进行 SVD，在大规模数据集上效果尤为显著，后续我们将对其展开更详细的讨论。

2.6 Another Optimality Property of the SVD

2.6 奇异值分解的另一最优性性质

Let A ∈ R m × n A \in \mathbb{R}^{m \times n} A∈Rm×n with m ≥ n m \geq n m≥n, and let its full SVD be A = ∑ i = 1 n σ i u i w i ∗ A=\sum_{i=1}^{n} \sigma_{i} u_{i} w_{i}^{*} A=∑i=1nσiuiwi∗ (where σ 1 ≥ σ 2 ≥ ⋯ ≥ σ n > 0 \sigma_{1} \geq \sigma_{2} \geq \dots \geq \sigma_{n}>0 σ1≥σ2≥⋯≥σn>0 are the singular values, u i ∈ R m u_{i} \in \mathbb{R}^m ui∈Rm are the left singular vectors, and w i ∈ R n w_{i} \in \mathbb{R}^n wi∈Rn are the right singular vectors). For k < n k<n k<n, define the rank- k k k truncated SVD of A A A as:

设 A ∈ R m × n A \in \mathbb{R}^{m \times n} A∈Rm×n 且 m ≥ n m \geq n m≥n，其完整 SVD 为 A = ∑ i = 1 n σ i u i w i ∗ A=\sum_{i=1}^{n} \sigma_{i} u_{i} w_{i}^{*} A=∑i=1nσiuiwi∗（其中 σ 1 ≥ σ 2 ≥ ⋯ ≥ σ n > 0 \sigma_{1} \geq \sigma_{2} \geq \dots \geq \sigma_{n}>0 σ1≥σ2≥⋯≥σn>0 为奇异值， u i ∈ R m u_{i} \in \mathbb{R}^m ui∈Rm 为左奇异向量， w i ∈ R n w_{i} \in \mathbb{R}^n wi∈Rn 为右奇异向量）。对于 k < n k<n k<n，定义 A A A 的秩- k k k 截断 SVD为：

A k = ∑ i = 1 k σ i u i w i ∗ A_{k}=\sum_{i=1}^{k} \sigma_{i} u_{i} w_{i}^{*} Ak=i=1∑kσiuiwi∗

Given matrix A A A, for any matrix B B B with rank at most k k k, the following best approximation result holds:

已知矩阵 A A A，对于任意秩不超过 k k k 的矩阵 B B B，有如下最佳逼近结果：

∥ A − A k ∥ o p ≤ ∥ A − B ∥ o p \left\| A-A_{k}\right\| _{op } \leq\| A-B\| _{op } ∥A−Ak∥op≤∥A−B∥op

and 且
∥ A − A k ∥ o p = σ k + 1 \left\|A-A_{k}\right\|{op }=\sigma{k+1} ∥A−Ak∥op=σk+1

In other words, the rank- k k k truncated SVD A k A_{k} Ak is the best rank- k k k approximation to A A A under the operator norm (also known as the spectral norm). This property is crucial for linking SVD to PCA: the PCA projection of data is essentially equivalent to using the truncated SVD of the centered data matrix to approximate the original data.

也就是说，在算子范数（又称谱范数）下，秩- k k k 截断 SVD A k A_{k} Ak 是 A A A 的最佳秩- k k k 逼近矩阵。该性质是 SVD 与 PCA 关联的核心：数据的 PCA 投影本质上等价于用中心化数据矩阵的截断 SVD 逼近原始数据。

References

参考文献

结果即 x \boldsymbol{x} x 在新基下的坐标，与内积计算结果完全一致（ x ⋅ b 1 = 5 2 \boldsymbol{x} \cdot \boldsymbol{b}_1 = \frac{5}{\sqrt{2}} x⋅b1=2 5， x ⋅ b 2 = − 1 2 \boldsymbol{x} \cdot \boldsymbol{b}_2 = -\frac{1}{\sqrt{2}} x⋅b2=−2 1）。

六、总结：知识体系与逻辑闭环

概念层面：特征值是方阵的"专属特征"，描述线性变换的稳定方向与伸缩比例；奇异值是特征值的"广义推广"，突破方阵限制，适用于任意矩阵，通过"矩阵-转置乘积的特征值平方根"定义。
分解层面 ：实对称方阵的特征值分解（ A = Q Λ Q T A=Q\Lambda Q^T A=QΛQT）是 PCA 的核心；任意矩阵的 SVD（ A = U Σ V T A=U\Sigma V^T A=UΣVT）是矩阵近似、压缩的基础，且可间接实现 PCA，效率更高。
应用层面：PCA 适用于小规模稠密数据降维；SVD 适用于大规模稀疏数据处理、图像压缩、推荐系统等场景，应用范围更广。
数学基础：基变换是降维的本质，单位正交基通过"内积=投影"简化计算，是 PCA 选择主成分、SVD 选择奇异向量的核心依据。

via：

1 Singular Value Decomposition and Principal Component Analysis - 180lecture_svd_pca.pdf
https://math.ucdavis.edu/~strohmer/courses/180BigData/180lecture_svd_pca.pdf
特征值和奇异值的关系_奇异值与特征值的关系-CSDN 博客 Never-Giveup 于 2018-08-25 16:40:40 发布
https://blog.csdn.net/qq_36653505/article/details/82052593
特征值与奇异值的基础知识 - litaotao_doctor - 博客园 posted on 2016-03-26 17:08 litaotao_doctor
https://www.cnblogs.com/litaotao-doctor/p/5320521.html
降维方法 PCA 与 SVD 的联系与区别 - Byron_NG - 博客园
https://www.cnblogs.com/bjwu/p/9280492.html
- Reference
  1. 《线性代数及其应用》. David C Lay
  2. 奇异值分解 (SVD) 原理详解及推导_svd 分解-CSDN 博客
    https://blog.csdn.net/zhongkejingwang/article/details/43053513
  3. 数据分析中的降维方法初探 - 郑瀚 - 博客园
    http://www.cnblogs.com/LittleHann/p/6558575.html
  4. 奇异值分解 (SVD) 原理与在降维中的应用 - 刘建平 Pinard - 博客园
    https://www.cnblogs.com/pinard/p/6251584.html
  5. PCA 与 SVD 解析-CSDN 博客
    https://blog.csdn.net/wangjian1204/article/details/50642732
  6. 《机器学习》. 周志华

... ...