【机器学习-03】矩阵方程与向量求导方法

在铺垫了基础矩阵和线性代数的相关知识后，我们现在尝试将【机器学习-01】中提到的方程组表示形式转化为矩阵形式，并利用矩阵方法来求解相关方程。同时，在【机器学习-01】中，我们已经初步探讨了最小二乘法这一优化算法的基本思想。最小二乘法是一个基础而重要的优化算法，其背后的数学推导和实际应用都值得我们深入研究。因此，从本节开始，我们将首先从矩阵方程出发，回顾矩阵运算的相关方法，并讲解矩阵求导的技巧。之后，我们将从更严谨的数学角度出发，深入讨论最小二乘法的基本原理，以深化对其的理解和应用。

1.方程组求解与矩阵方程求解

在【机器学习-01】机器学习基本概念与建模流程中，我们曾经利用损失函数的偏导函数方程组进行简单线性回归模型参数的求解：

尽管求解方程组有多种方法，例如【机器学习-01】机器学习基本概念与建模流程一文中所描述的，可以先通过方程变量相消法反解出一个变量（例如 w=1），然后再将这个解代入到其他方程中求解出另一个变量（例如 b=1）。这种方法确实能够手动求出方程组的解。然而，当想要借助编程工具来求解方程组时，就需要将原始的方程组求解问题转化为矩阵方程的求解问题。通过这种方法，我们可以利用计算机编程的便利性和高效性来自动求解复杂的方程组。因此，了解并掌握矩阵方程的求解方法对于利用编程工具进行机器学习建模是至关重要的。

<math xmlns="http://www.w3.org/1998/Math/MathML"> 20 w + 8 b − 28 = 0 20w+8b-28=0 </math>20w+8b−28=0 <math xmlns="http://www.w3.org/1998/Math/MathML"> 8 w + 4 b − 12 = 0 8w+4b-12=0 </math>8w+4b−12=0 我们令：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> A = [ 20 8 8 4 ] A = \left [\begin{array}{cccc} 20 &8 \\ 8 &4 \\ \end{array}\right] </math>A=[20884]
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> B = [ 28 12 ] B = \left [\begin{array}{cccc} 28 \\ 12 \\ \end{array}\right] </math>B=[2812]
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> X = [ w b ] X = \left [\begin{array}{cccc} w \\ b \\ \end{array}\right] </math>X=[wb]

其中 <math xmlns="http://www.w3.org/1998/Math/MathML"> X X </math>X为参数向量。借助矩阵运算相关知识，上述方程组可等价表示为： <math xmlns="http://www.w3.org/1998/Math/MathML"> A ⋅ X − B = 0 A \cdot X - B = 0 </math>A⋅X−B=0 即 <math xmlns="http://www.w3.org/1998/Math/MathML"> A ⋅ X = B A \cdot X = B </math>A⋅X=B

我们已经成功地将方程组转化为了矩阵方程。利用矩阵运算，我们可以直接在矩阵方程中求解参数向量X。为了进行这一计算，我们借助NumPy的基础知识，通过创建二维张量来表示上述矩阵方程中的矩阵A和向量B。这样，我们就可以利用NumPy提供的矩阵运算功能来求解这个矩阵方程，从而得到参数向量X的解。

python 复制代码

A = np.array([[20, 8], [8, 4]])
A
array([[20,  8],
       [ 8,  4]])

B = np.array([[28, 12]]).T
B
array([[28],
       [12]])

注，此时B也是二维张量，可以使用矩阵乘法。

python 复制代码

B.ndim
2

然后通过行列式计算结果，简单验证A是否满秩：

python 复制代码

np.linalg.matrix_rank(A)
2

当然，也可以通过观察A的行列式计算结果是否为0，来判断A是否满秩

python 复制代码

np.linalg.det(A)
15.999999999999991

对于满秩矩阵，我们可以求其逆矩阵

python 复制代码

np.linalg.inv(A)
array([[ 0.25, -0.5 ],
       [-0.5 ,  1.25]])

然后在矩阵方程左右两端同时左乘其逆矩阵，即可解出X的取值 <math xmlns="http://www.w3.org/1998/Math/MathML"> A − 1 A X = A − 1 B A^{-1}AX=A^{-1}B </math>A−1AX=A−1B <math xmlns="http://www.w3.org/1998/Math/MathML"> X = A − 1 B X=A^{-1}B </math>X=A−1B

python 复制代码

np.matmul(np.linalg.inv(A), B)
array([[1.],
       [1.]])


# 也可以使用dot方法，对于二维数组，dot就是执行矩阵乘法
np.linalg.inv(A).dot(B)
array([[1.],
       [1.]])

即 <math xmlns="http://www.w3.org/1998/Math/MathML"> X = [ w b ] = [ 1 1 ] X = \left [\begin{array}{cccc} w \\ b \\ \end{array}\right] =\left [\begin{array}{cccc} 1 \\ 1 \\ \end{array}\right] </math>X=[wb]=[11]

除了手动创建矩阵并进行运算，NumPy库还为我们提供了一种便捷的函数来求解类似于 <math xmlns="http://www.w3.org/1998/Math/MathML"> A ∗ X T = B A*X^T=B </math>A∗XT=B这样的矩阵方程。通过使用这个函数，我们可以直接求解出参数向量X，从而避免了繁琐的手动计算过程。这种方法既简单又高效，极大地简化了矩阵方程的求解过程。

python 复制代码

np.linalg.solve(A, B)
array([[1.],
       [1.]])

2.向量求导运算

鉴于在编程实践中，矩阵和向量的使用相较于方程组形式更为普遍和高效，因此，包括最小二乘法在内的多种优化方法和算法的理论阐述，我们都将采用矩阵和向量作为基本的数据结构进行概念说明和数学公式的推导。在深入探讨最小二乘法的数学原理之前，我们有必要先补充一些关于向量求导的基础知识，以便为后续的分析和计算打下坚实的理论基础。

2.1 向量求导基本方法

首先，我们先来探讨相对简单的向量求导方法。通过这一过程，我们可以深入理解对结构化变量进行求导运算的本质。这不仅是数学上的重要技巧，也是后续机器学习算法推导的基础。假设我们有一个二元函数，具体形式如下：

<math xmlns="http://www.w3.org/1998/Math/MathML"> f ( x 1 , x 2 ) = 2 x 1 + x 2 f(x_1,x_2) = 2x_1+x_2 </math>f(x1,x2)=2x1+x2

为了研究这个函数随着 <math xmlns="http://www.w3.org/1998/Math/MathML"> x 1 、 x 2 x_1、x_2 </math>x1、x2 的变化情况，我们可以分别对这两个变量求偏导数。通过求偏导，我们可以得到函数在每个变量上的局部变化率。

假设现有一个二元函数如下： <math xmlns="http://www.w3.org/1998/Math/MathML"> f ( x 1 , x 2 ) = 2 x 1 + x 2 f(x_1,x_2) = 2x_1+x_2 </math>f(x1,x2)=2x1+x2 并且，我们可以分别对该函数中的两个变量 <math xmlns="http://www.w3.org/1998/Math/MathML"> x 1 、 x 2 x_1、x_2 </math>x1、x2依次求偏导，可得： <math xmlns="http://www.w3.org/1998/Math/MathML"> ∂ f ∂ x 1 = 2 \frac{\partial f}{\partial x_1} = 2 </math>∂x1∂f=2 <math xmlns="http://www.w3.org/1998/Math/MathML"> ∂ f ∂ x 2 = 1 \frac{\partial f}{\partial x_2} = 1 </math>∂x2∂f=1

现在我们考虑将上述求偏导的函数组改写为矩阵形式。则根据前述内容介绍，我们可以将函数中的两个变量依次排列，组成一个向量变元，即一个由多个变量所组成的向量，即 <math xmlns="http://www.w3.org/1998/Math/MathML"> x = [ x 1 , x 2 ] T x = [x_1, x_2]^T </math>x=[x1,x2]T

此时，如果我们按照向量变元内部的变量排列顺序，依次在每个变量位置填上该变量对应的偏导函数，则就构成了对于函数 <math xmlns="http://www.w3.org/1998/Math/MathML"> f f </math>f进行向量变元 <math xmlns="http://www.w3.org/1998/Math/MathML"> x x </math>x的向量求导的结果，即：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> ∂ f ( x ) ∂ x = [ 2 1 ] \frac{\partial f(x)}{\partial x} = \left [\begin{array}{cccc} 2 \\ 1 \\ \end{array}\right] </math>∂x∂f(x)=[21]

其中， <math xmlns="http://www.w3.org/1998/Math/MathML"> x x </math>x为向量变元。

至此，我们已经完成了向量求导的基本步骤。关键在于，我们按照向量变元中变量的排列顺序，逐一计算并填写了对应变量的偏导函数结果。然而，为了与方程组的矩阵/向量形式保持一致，原始的函数方程同样需要进行相应的改写。因此，原方程可以转化为向量/矩阵形式，以便进行后续的矩阵运算和向量求导。 <math xmlns="http://www.w3.org/1998/Math/MathML"> f ( x ) = A T ⋅ x f(x) = A^T \cdot x </math>f(x)=AT⋅x

其中， <math xmlns="http://www.w3.org/1998/Math/MathML"> A = [ 2 , 1 ] T A = [2, 1]^T </math>A=[2,1]T <math xmlns="http://www.w3.org/1998/Math/MathML"> x = [ x 1 , x 2 ] T x = [x_1, x_2]^T </math>x=[x1,x2]T 原方程为 <math xmlns="http://www.w3.org/1998/Math/MathML"> y = 2 x 1 + x 2 y = 2x_1+x_2 </math>y=2x1+x2

结合函数求导结果，我们不难发现， <math xmlns="http://www.w3.org/1998/Math/MathML"> ∂ f ( x ) ∂ x \frac{\partial f(x)}{\partial x} </math>∂x∂f(x)最终计算结果就是 <math xmlns="http://www.w3.org/1998/Math/MathML"> A A </math>A，即 <math xmlns="http://www.w3.org/1998/Math/MathML"> ∂ f ( x ) ∂ x = ∂ ( A T ⋅ x ) ∂ x = A \frac{\partial f(x)}{\partial x} = \frac{\partial(A^T \cdot x)}{\partial x} = A </math>∂x∂f(x)=∂x∂(AT⋅x)=A

在这里， <math xmlns="http://www.w3.org/1998/Math/MathML"> x x </math>x代表向量变元，而 <math xmlns="http://www.w3.org/1998/Math/MathML"> A A </math>A是一个列向量。值得注意的是，这个结论可以推广到更一般的情况，我们将在下一小节给出相关的证明。为了便于理解和应用，此处我们直接给出向量变元的函数求导计算公式。这个公式将帮助我们更高效地处理涉及向量变元的函数求导问题。

很多时候，我们并不严格区分向量方程和矩阵方程，而是将自变量为向量或矩阵的方程统称为矩阵方程。同样地，包含向量或矩阵的表达式也被我们统一称为矩阵表达式。这样的处理方式有助于我们更灵活地运用矩阵和向量的运算规则，从而简化问题求解过程。

向量求导的定义法设 <math xmlns="http://www.w3.org/1998/Math/MathML"> f ( x ) f(x) </math>f(x)是一个关于 <math xmlns="http://www.w3.org/1998/Math/MathML"> x x </math>x的函数，其中 <math xmlns="http://www.w3.org/1998/Math/MathML"> x x </math>x是向量变元，并且 <math xmlns="http://www.w3.org/1998/Math/MathML"> x = [ x 1 , x 2 , . . . , x n ] T x = [x_1, x_2,...,x_n]^T </math>x=[x1,x2,...,xn]T

则 <math xmlns="http://www.w3.org/1998/Math/MathML"> ∂ f ∂ x = [ ∂ f ∂ x 1 , ∂ f ∂ x 2 , . . . , ∂ f ∂ x n ] T \frac{\partial f}{\partial x} = [\frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, ..., \frac{\partial f}{\partial x_n}]^T </math>∂x∂f=[∂x1∂f,∂x2∂f,...,∂xn∂f]T

而该表达式也被称为向量求导的梯度向量形式。
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> ∇ x f ( x ) = ∂ f ∂ x = [ ∂ f ∂ x 1 , ∂ f ∂ x 2 , . . . , ∂ f ∂ x n ] T \nabla _xf(x) = \frac{\partial f}{\partial x} = [\frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, ..., \frac{\partial f}{\partial x_n}]^T </math>∇xf(x)=∂x∂f=[∂x1∂f,∂x2∂f,...,∂xn∂f]T

通过求得函数的梯度向量求解向量导数的方法，也被称为定义法求解。

值得注意的是，对于多元函数，我们总是可以计算出其梯度向量。然而，这个梯度向量或者说向量求导的结果，并不总是可以由一些已经定义的向量直接表示出来。以 <math xmlns="http://www.w3.org/1998/Math/MathML"> A A </math>A为例，虽然它表示了 <math xmlns="http://www.w3.org/1998/Math/MathML"> f ( x ) f(x) </math>f(x)的向量求导结果，但并非所有情况下都能如此直接地找到这样的向量表示。

2.2 常见向量求导公式

在前期的学习中，数学理论推导经常涉及到向量变元的求导。因此，除了掌握基本的向量求导方法，我们还需要推导几个常用的向量求导公式。这些公式的特点是，向量求导的结果能够通过一些已经定义的向量进行简洁的表示。在这里，我们假设x是一个包含n个变量的列向量，即 <math xmlns="http://www.w3.org/1998/Math/MathML"> x = [ x 1 , x 2 , . . . , x n ] T x = [x_1, x_2,...,x_n]^T </math>x=[x1,x2,...,xn]T。通过掌握这些公式，我们可以更高效地处理涉及向量变元的求导问题。

（1） <math xmlns="http://www.w3.org/1998/Math/MathML"> ∂ a ∂ x = 0 \frac{\partial a}{\partial x} = 0 </math>∂x∂a=0 证明：

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> ∂ a ∂ x = [ ∂ a ∂ x 1 , ∂ a ∂ x 2 , . . . , ∂ a ∂ x n ] T = [ 0 , 0 , . . . , 0 ] T \frac{\partial a}{\partial x} = [\frac{\partial a}{\partial x_1}, \frac{\partial a}{\partial x_2}, ..., \frac{\partial a}{\partial x_n}]^T = [0,0,...,0]^T </math>∂x∂a=[∂x1∂a,∂x2∂a,...,∂xn∂a]T=[0,0,...,0]T

（2）

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> ∂ ( x T ⋅ A ) ∂ x = ∂ ( A T ⋅ x ) ∂ x = A \frac{\partial(x^T \cdot A)}{\partial x} = \frac{\partial(A^T \cdot x)}{\partial x} = A </math>∂x∂(xT⋅A)=∂x∂(AT⋅x)=A

证明：

此时A为拥有n个分量的常数向量，设 <math xmlns="http://www.w3.org/1998/Math/MathML"> A = [ a 1 , a 2 , . . . , a n ] T A = [a_1, a_2,...,a_n]^T </math>A=[a1,a2,...,an]T，则有
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> ∂ ( x T ⋅ A ) ∂ x = ∂ ( A T ⋅ x ) ∂ x = ∂ ( a 1 ⋅ x 1 + a 2 ⋅ x 2 + . . . + a n ⋅ x n ) ∂ x = [ ∂ ( a 1 ⋅ x 1 + a 2 ⋅ x 2 + . . . + a n ⋅ x n ) ∂ x 1 ∂ ( a 1 ⋅ x 1 + a 2 ⋅ x 2 + . . . + a n ⋅ x n ) ∂ x 2 . . . ∂ ( a 1 ⋅ x 1 + a 2 ⋅ x 2 + . . . + a n ⋅ x n ) ∂ x n ] = [ a 1 a 2 . . . a n ] = A \begin{aligned} \frac{\partial(x^T \cdot A)}{\partial x} & = \frac{\partial(A^T \cdot x)}{\partial x}\\ & = \frac{\partial(a_1 \cdot x_1 + a_2 \cdot x_2 +...+ a_n \cdot x_n)}{\partial x}\\ & = \left [\begin{array}{cccc} \frac{\partial(a_1 \cdot x_1 + a_2 \cdot x_2 +...+ a_n \cdot x_n)}{\partial x_1} \\ \frac{\partial(a_1 \cdot x_1 + a_2 \cdot x_2 +...+ a_n \cdot x_n)}{\partial x_2} \\ . \\ . \\ . \\ \frac{\partial(a_1 \cdot x_1 + a_2 \cdot x_2 +...+ a_n \cdot x_n)}{\partial x_n} \\ \end{array}\right] \\ & =\left [\begin{array}{cccc} a_1 \\ a_2 \\ . \\ . \\ . \\ a_n \\ \end{array}\right] = A \end{aligned} </math>∂x∂(xT⋅A)=∂x∂(AT⋅x)=∂x∂(a1⋅x1+a2⋅x2+...+an⋅xn)= ∂x1∂(a1⋅x1+a2⋅x2+...+an⋅xn)∂x2∂(a1⋅x1+a2⋅x2+...+an⋅xn)...∂xn∂(a1⋅x1+a2⋅x2+...+an⋅xn) = a1a2...an =A

（3） <math xmlns="http://www.w3.org/1998/Math/MathML"> ∂ ( x T ⋅ x ) ∂ x = 2 x \frac{\partial (x^T \cdot x)}{\partial x} = 2x </math>∂x∂(xT⋅x)=2x 证明：

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> ∂ ( x T ⋅ x ) ∂ x = ∂ ( x 1 2 + x 2 2 + . . . + x n 2 ) ∂ x = [ ∂ ( x 1 2 + x 2 2 + . . . + x n 2 ) ∂ x 1 ∂ ( x 1 2 + x 2 2 + . . . + x n 2 ) ∂ x 2 . . . ∂ ( x 1 2 + x 2 2 + . . . + x n 2 ) ∂ x n ] = [ 2 x 1 2 x 2 . . . 2 x n ] = 2 x \begin{aligned} \frac{\partial(x^T \cdot x)}{\partial x} & = \frac{\partial(x_1^2+x_2^2+...+x_n^2)}{\partial x}\\ & = \left [\begin{array}{cccc} \frac{\partial(x_1^2+x_2^2+...+x_n^2)}{\partial x_1} \\ \frac{\partial(x_1^2+x_2^2+...+x_n^2)}{\partial x_2} \\ . \\ . \\ . \\ \frac{\partial(x_1^2+x_2^2+...+x_n^2)}{\partial x_n} \\ \end{array}\right] \\ & =\left [\begin{array}{cccc} 2x_1 \\ 2x_2 \\ . \\ . \\ . \\ 2x_n \\ \end{array}\right] = 2x \end{aligned} </math>∂x∂(xT⋅x)=∂x∂(x12+x22+...+xn2)= ∂x1∂(x12+x22+...+xn2)∂x2∂(x12+x22+...+xn2)...∂xn∂(x12+x22+...+xn2) = 2x12x2...2xn =2x

此处 <math xmlns="http://www.w3.org/1998/Math/MathML"> x T x x^Tx </math>xTx也被称为向量的交叉乘积(crossprod)。

至此，我们已经完成了相关向量求导常用公式的证明。然而，从上述证明过程可以看出，使用定义法进行公式证明往往相当繁琐（尽管整个流程相对清晰）。因此，我们会在后续补充除了定义法之外的向量乘法常用公式的证明方法。

此外，矩阵的求导方法与向量类似。当变量以矩阵形式出现时，我们实际上是在按照矩阵的基本结构，在每个位置上对相应的变量分量求偏导函数。但由于矩阵比向量多了一个维度，结构更为复杂，因此求解过程也更为繁琐。由于我们初期接触的大多数是向量变元的方程，关于矩阵求导的常用公式推导，我们将在后续逐步展开讨论。

最后，我们还需要简要辨析一下矩阵函数和矩阵方程这两个概念的区别：

矩阵方程：它指的是变量为矩阵的方程。
矩阵函数：它类似于函数矩阵，指的是自变量和因变量都是n阶矩阵的函数。也可以简单理解为由函数构成的矩阵，其中每个函数的变量都是矩阵。通过这些辨析，我们可以更清楚地理解这两个概念在数学和机器学习中的应用