方向导数和梯度

[1 导数的回忆](#1 导数的回忆)
[2 偏导数及其向量形式](#2 偏导数及其向量形式)
- 偏导数的几何意义
- 偏导数的向量形式
[3 方向导数](#3 方向导数)
[4 梯度](#4 梯度)
[5 梯度下降算法](#5 梯度下降算法)

1 导数的回忆

导数的几何意义如图所示：

当 P 0 P_{0} P0点不断接近 P P P时，导数如下定义：
f ′ ( x 0 ) = lim ⁡ △ x → 0 △ y △ x = lim ⁡ △ x → 0 f ( x 0 + △ x ) − f ( x 0 ) △ x {f}'(x_{0})=\lim\limits_{△x→0}\frac{△y}{△x} =\lim\limits_{△x→0}\frac{f(x_{0}+△x)-f(x_{0})}{△x} f′(x0)=△x→0lim△x△y=△x→0lim△xf(x0+△x)−f(x0)

2 偏导数及其向量形式

偏导数的几何意义

对 x x x的偏导数定义如下：
∂ f ∂ x ( x 0 , y 0 ) = lim ⁡ △ x → 0 f ( x 0 + △ x , y 0 ) − f ( x 0 , y 0 ) △ x \frac{\partial f}{\partial x}(x_{0},y_{0})=\lim\limits_{△x→0} \frac{f(x_{0}+△x,y_{0})-f(x_{0},y_{0})}{△x} ∂x∂f(x0,y0)=△x→0lim△xf(x0+△x,y0)−f(x0,y0)

如图所示， M 0 ( x 0 , y 0 , f ( x 0 , y 0 ) ) M_{0}(x_{0},y{0},f(x_{0},y_{0})) M0(x0,y0,f(x0,y0))是曲线 z = f ( x , y ) z=f(x,y) z=f(x,y)的一点，过 M 0 M_{0} M0作平面 y = y 0 y=y_{0} y=y0,截此曲面得一曲线，此曲线的方程为： z = f ( x , y 0 ) z=f(x,y_{0}) z=f(x,y0)，则上述对 x x x的偏导数就是在点 M 0 M_{0} M0处的切线 M 0 T x M_{0}T_{x} M0Tx对x轴的斜率。对 y y y的偏导数的几何意义同理。

偏导数的向量形式

为了等一下方便理解方向导数，将上述的偏导数表示成向量形式。

设 a ⃗ = [ x 0 y 0 ] \vec{a}=\begin{bmatrix}x_{0}\\ y_{0}\end{bmatrix} a =[x0y0]，则对 x x x的偏导数为：
∂ f ∂ x ( a ⃗ ) = lim ⁡ h → 0 f ( a ⃗ + h i ^ ) − f ( a ⃗ ) h \frac{\partial f}{\partial x}(\vec{a})=\lim\limits_{h→0} \frac{f(\vec{a}+h\hat{i})-f(\vec{a})}{h} ∂x∂f(a )=h→0limhf(a +hi^)−f(a )

其中 i ^ = [ 1 0 ] \hat{i}=\begin{bmatrix}1\\ 0\end{bmatrix} i^=[10]

3 方向导数

向量形式

从偏导数的向量形式可知：当 i ^ \hat{i} i^方向改变时，就产生了方向导数。

方向导数的定义如下：
∇ v ⃗ f ( a ⃗ ) = ∂ f ∂ v ⃗ ( a ⃗ ) = lim ⁡ h → 0 f ( a ⃗ + h v ⃗ ) − f ( a ⃗ ) h \nabla_{\vec{v}}f(\vec{a})=\frac{\partial f}{\partial \vec{v}}(\vec{a})=\lim\limits_{h→0} \frac{f(\vec{a}+h\vec{v})-f(\vec{a})}{h} ∇v f(a )=∂v ∂f(a )=h→0limhf(a +hv )−f(a )

这种定义有一种问题，假设 v ⃗ \vec{v} v 变为原来的两倍，但分母不变，则实际上方向导数也会变成2倍。

如果要把方向导数表示成该方向的斜率，则应该用如下定义：
∇ v ⃗ f ( a ⃗ ) = ∂ f ∂ v ⃗ ( a ⃗ ) = lim ⁡ h → 0 f ( a ⃗ + h v ⃗ ) − f ( a ⃗ ) h ∣ v ⃗ ∣ \nabla_{\vec{v}}f(\vec{a})=\frac{\partial f}{\partial \vec{v}}(\vec{a})=\lim\limits_{h→0} \frac{f(\vec{a}+h\vec{v})-f(\vec{a})}{h|\vec{v}|} ∇v f(a )=∂v ∂f(a )=h→0limh∣v ∣f(a +hv )−f(a )

在该定义下，表示的是函数在某点沿着 a ⃗ \vec{a} a 方向上的导数，即斜率

几何意义

方向导数的非向量形式如下：

设 e l = ( c o s α , c o s β ) e_{l}=(cos\alpha,cosβ) el=(cosα,cosβ)是与 l l l同方向的单位向量，设射线 l l l上 P P P的坐标为 P = ( x 0 + t c o s α , y 0 + t c o s β ) P=(x_{0}+tcos\alpha,y_{0}+tcosβ) P=(x0+tcosα,y0+tcosβ)，此时如下图所示：

此时函数增量与距离 ∣ P P 0 ∣ = t |PP_{0}|=t ∣PP0∣=t的比值为：
f ( x 0 + t c o s α , y 0 + t c o s β ) − f ( x 0 , y 0 ) t \frac{f(x_{0}+tcos\alpha,y_{0}+tcosβ)-f(x_{0},y_{0})}{t} tf(x0+tcosα,y0+tcosβ)−f(x0,y0)

当 P P P沿着 l l l趋于 P 0 P_{0} P0时，若极限存在，则称为函数 f ( x , y ) f(x,y) f(x,y)在点 P 0 P_{0} P0沿方向 l l l的的方向导数，即上面向量形式的第二种定义（只不过这里用了非向量形式），如下：
∂ f ∂ l ∣ ( x 0 , y 0 ) = lim ⁡ t → 0 ＋ f ( x 0 + t c o s α , y 0 + t c o s β ) − f ( x 0 , y 0 ) t \frac{\partial f}{\partial l}\vert_{(x_{0},y_{0})}=\lim\limits_{t→0^{＋}}\frac{f(x_{0}+tcos\alpha,y_{0}+tcosβ)-f(x_{0},y_{0})}{t} ∂l∂f∣(x0,y0)=t→0＋limtf(x0+tcosα,y0+tcosβ)−f(x0,y0)

其几何意义如下图所示：如果要求A点在紫色向量方向上的斜率(红色线圈出来的)，则可用方向导数

方向导数和偏导的关系

定理：如果函数 f ( x , y ) f(x,y) f(x,y)在点 P 0 ( x 0 , y 0 ) P_{0}(x_{0},y_{0}) P0(x0,y0)可微分，那么函数在该点沿任一方向 l l l的方向导数存在，且有
∂ f ∂ l ∣ ( x 0 , y 0 ) = f x ( x 0 , y 0 ) c o s α + f y c o s β \frac{\partial f}{\partial l}\vert_{(x_{0},y_{0})}=f_{x}(x_{0},y_{0})cos\alpha+f_{y}cosβ ∂l∂f∣(x0,y0)=fx(x0,y0)cosα+fycosβ

其中 c o s α cos\alpha cosα和 c o s β cosβ cosβ是方向 l l l的方向余弦

证明：由函数 f ( x , y ) f(x,y) f(x,y)在点 P 0 ( x 0 , y 0 ) P_{0}(x_{0},y_{0}) P0(x0,y0)可微分可得：
f ( x 0 + △ x , y 0 + △ y ) − f ( x 0 , y 0 ) = f x ( x 0 , y 0 ) △ x + f y △ y + o ( △ x 2 + △ y 2 ) f(x_{0}+△x,y_{0}+△y)-f(x_{0},y_{0})=f_{x}(x_{0},y_{0})△x+f_{y}△y+o(\sqrt{△x^{2}+△y^{2} }) f(x0+△x,y0+△y)−f(x0,y0)=fx(x0,y0)△x+fy△y+o(△x2+△y2 )

由上述可知： △ x = t c o s α , △ y = t c o s β △x=tcos\alpha,△y=tcosβ △x=tcosα,△y=tcosβ,则有： △ x 2 + △ y 2 = t \sqrt{△x^{2}+△y^{2} }=t △x2+△y2 =t

从而有：
lim ⁡ t → 0 ＋ f ( x 0 + t c o s α , y 0 + t c o s β ) − f ( x 0 , y 0 ) t = lim ⁡ t → 0 ＋ f x ( x 0 , y 0 ) t c o s α + f y t c o s β + o ( t ) t = f x ( x 0 , y 0 ) c o s α + f y c o s β \begin {aligned} {} \lim\limits_{t→0^{＋}}\frac{f(x_{0}+tcos\alpha,y_{0}+tcosβ)-f(x_{0},y_{0})}{t}& =\lim\limits_{t→0^{＋}}\frac{f_{x}(x_{0},y_{0})tcos\alpha+f_{y}tcosβ+o(t)}{t} \\&=f_{x}(x_{0},y_{0})cos\alpha+f_{y}cosβ \end {aligned} t→0＋limtf(x0+tcosα,y0+tcosβ)−f(x0,y0)=t→0＋limtfx(x0,y0)tcosα+fytcosβ+o(t)=fx(x0,y0)cosα+fycosβ

相当于方向导数是偏导数的线性组合

4 梯度

在二元函数情况下，设函数 f ( x , y ) f(x,y) f(x,y)在平面区域 D D D内具有一阶连续偏导数，则对于每一点 P 0 ( x 0 , y 0 ) ∈ D P_{0}(x_{0},y_{0})∈D P0(x0,y0)∈D，都存在一个梯度，记作 grad f ( x 0 , y 0 ) \textbf{grad}f(x_{0},y_{0}) gradf(x0,y0)或 ▽ f ( x 0 , y 0 ) ▽f(x_{0},y_{0}) ▽f(x0,y0)：
grad f ( x 0 , y 0 ) = ▽ f ( x 0 , y 0 ) = [ ∂ f ∂ x ∂ f ∂ y ] \textbf{grad}f(x_{0},y_{0})=▽f(x_{0},y_{0})= \begin{bmatrix}\frac{\partial f}{\partial x} \\ \\ \frac{\partial f}{\partial y}\end{bmatrix} gradf(x0,y0)=▽f(x0,y0)= ∂x∂f∂y∂f

因此，假设函数 f ( x , y ) f(x,y) f(x,y)在点 P 0 ( x 0 , y 0 ) P_{0}(x_{0},y_{0}) P0(x0,y0)可微分， e l = ( c o s α , c o s β ) e_{l}=(cos\alpha,cosβ) el=(cosα,cosβ)是与方向 l l l同向的方向向量，则方向导数和梯度的关系是：
∂ f ∂ l ∣ ( x 0 , y 0 ) = f x ( x 0 , y 0 ) c o s α + f y c o s β = ▽ f ( x 0 , y 0 ) ⋅ e l = ∣ ▽ f ( x 0 , y 0 ) ∣ ∣ e l ∣ c o s θ = ∣ ▽ f ( x 0 , y 0 ) ∣ c o s θ \begin {aligned} {} \frac{\partial f}{\partial l}\vert_{(x_{0},y_{0})}& = f_{x}(x_{0},y_{0})cos\alpha+f_{y}cosβ \\&=▽f(x_{0},y_{0}) \ \bm{\cdot} \ e_{l} \\&=|▽f(x_{0},y_{0})| \ |e_{l}| \ cos\theta \\&=|▽f(x_{0},y_{0})|\ cos\theta \end {aligned} ∂l∂f∣(x0,y0)=fx(x0,y0)cosα+fycosβ=▽f(x0,y0) ⋅ el=∣▽f(x0,y0)∣ ∣el∣ cosθ=∣▽f(x0,y0)∣ cosθ

其中， θ \theta θ是向量 ▽ f ( x 0 , y 0 ) ▽f(x_{0},y_{0}) ▽f(x0,y0)与向量 e l e_{l} el所成夹角，因此可以得出下述结论：

θ = 0 \theta=0 θ=0，向量 ▽ f ( x 0 , y 0 ) ▽f(x_{0},y_{0}) ▽f(x0,y0)与向量 e l e_{l} el方向相同，此时方向导数最大，函数 f ( x , y ) f(x,y) f(x,y)增长最快
θ = π \theta=π θ=π，向量 ▽ f ( x 0 , y 0 ) ▽f(x_{0},y_{0}) ▽f(x0,y0)与向量 e l e_{l} el方向相反，此时方向导数最小，函数 f ( x , y ) f(x,y) f(x,y)减少最快
θ = π 2 \theta=\frac{π}{2} θ=2π，向量 ▽ f ( x 0 , y 0 ) ▽f(x_{0},y_{0}) ▽f(x0,y0)与向量 e l e_{l} el方向正交，函数 f ( x , y ) f(x,y) f(x,y)变化率为0
综上，沿着梯度方向函数增长最快

5 梯度下降算法

梯度下降法(Gradient Descent)是一种用于寻找函数极小值的一阶迭代优化算法，又称为最速下降(Steepest Descent)。以下是梯度下降的基本公式：
θ ← θ − η ∂ L ( θ ) ∂ θ \theta \leftarrow \theta -η\frac{\partial L(\theta)}{\partial \theta } θ←θ−η∂θ∂L(θ)

可以看出：

L ( θ ) L(\theta) L(θ)是关于 θ \theta θ的损失函数
η η η是学习率，称为梯度下降的步长，
梯度下降法是让方向导数最小时做迭代

举一个例子：设 L ( θ ) = θ 2 L(\theta)=\theta^{2} L(θ)=θ2，则梯度 ▽ L ( θ ) = ∂ L ( θ ) ∂ θ = 2 θ ▽L(\theta)=\frac{\partial L(\theta)}{\partial \theta }=2\theta ▽L(θ)=∂θ∂L(θ)=2θ；设学习率为 η = 0.2 η=0.2 η=0.2；设初始值为 ( θ 0 , L ( θ 0 ) ) = ( 10 , 100 ) (\theta_{0},L(\theta_{0}))=(10,100) (θ0,L(θ0))=(10,100),此时梯度为： ▽ L ( θ 0 ) = 2 θ 0 = 20 ▽L(\theta_{0})=2\theta_{0}=20 ▽L(θ0)=2θ0=20。

更新 θ \theta θ： θ 1 ← θ 0 − η ∂ L ( θ 0 ) ∂ θ 0 = 10 − 0.2 × 20 = 6 \theta_{1} \leftarrow \theta_{0} -η\frac{\partial L(\theta_{0})}{\partial \theta_{0} } = 10-0.2×20=6 θ1←θ0−η∂θ0∂L(θ0)=10−0.2×20=6
重复上述步骤，直至 θ \theta θ收敛

代码如下：

python 复制代码

import numpy as np
import matplotlib.pyplot as plt

# 定义损失函数 y = x^2
def f(x):
    return x**2

# 梯度下降函数
def gradient_descent(x, eta):
    # 计算斜率
    slope = 2 * x
    # 更新 x 的值
    x_out = x - eta * slope
    return x_out, slope

# 主程序
def main():
    # 初始化参数
    x_data = np.linspace(-10, 10, 1000)  # x 范围
    LRate = 0.2  # 学习率
    slope_thresh = 0.0001  # 斜率阈值

    # 绘制损失函数图像
    plt.plot(x_data, f(x_data), 'c', linewidth=2)
    plt.title('y = x^2 (learning rate = 0.2)')
    plt.xlabel('x')
    plt.ylabel('y = x^2')
    plt.grid(True)

    # 初始点设置为 (10, f(10))
    x = 10
    y = f(10)
    plt.plot(x, y, 'r*')

    # 开始梯度下降迭代
    slope = float('inf')  # 初始斜率设置为无穷大
    while abs(slope) > slope_thresh:
        x_new, slope = gradient_descent(x, LRate)
        y_new = f(x_new)
        # 绘制当前点到更新后点的连线
        plt.plot([x, x_new], [y, y_new], 'k--', linewidth=1)
        # 绘制点
        plt.plot(x_new, y_new, 'r*')
        plt.legend(['y = x^2', 'Gradient descent path'])
        plt.draw()
        plt.pause(0.2)  # 暂停一小段时间，使得动态图像可见
        x = x_new
        y = y_new

    plt.show()


# 当这个 .py 文件被直接运行时（作为主程序），执行 main() 函数；
# 当这个 .py 文件被导入到其他模块中时，不执行 main() 函数，因为 __name__ 的值不是 '__main__'。
if __name__ == '__main__':
    main()

应尽可能选择适中的学习率，过大会震荡，过小迭代次数会过多，如下所示，学习率为0.2更好。

当学习率 η = 0.2 η=0.2 η=0.2时，图像如下：

当学习率 η = 0.9 η=0.9 η=0.9时，图像如下：

当学习率 η = 0.01 η=0.01 η=0.01时，图像如下：