机器学习算法——梯度下降

梯度下降是一种迭代的一阶优化算法，用于寻找一个给定函数的局部最小值/最大值。该方法常用于在机器学习或深度学习中最小化代价函数/损失函数。

⒈ 梯度

梯度是衡量所有权重相对于误差变化的变化。可以将梯度理解为函数的斜率，梯度越大，函数的斜率就越大，模型学习的速度也就越快。如果梯度为 0，则模型停止学习。

在数学术语中，梯度是函数相对于其输入的偏导数。对于一元函数，一个指定点的梯度是函数在该点的一阶导数；对于多元函数，梯度则是函数沿各个变量轴的导数向量。在实际应用中，我们只关注沿单个变量轴的梯度，即函数相对于该变量的偏导数。

n 元函数 <math xmlns="http://www.w3.org/1998/Math/MathML"> f f </math>f 相对于给定点 <math xmlns="http://www.w3.org/1998/Math/MathML"> P P </math>P 的偏导数表示如下：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> ∇ f ( P ) = [ ∂ f ∂ x 1 ( P ) ⋮ ∂ f ∂ x n ( P ) ] \nabla f(P) = \begin{bmatrix} \frac{\partial f}{\partial x_1}(P)\\ \vdots\\ \frac{\partial f}{\partial x_n}(P) \end{bmatrix} </math>∇f(P)= ∂x1∂f(P)⋮∂xn∂f(P)

以二元函数 <math xmlns="http://www.w3.org/1998/Math/MathML"> f ( x , y ) = 0.5 x 2 + y 2 f(x, y) = 0.5x^2 + y^2 </math>f(x,y)=0.5x2+y2 为例，计算当前函数在给定点 <math xmlns="http://www.w3.org/1998/Math/MathML"> P ( 5 , 5 ) P(5, 5) </math>P(5,5) 的梯度。

函数相对于给定点 <math xmlns="http://www.w3.org/1998/Math/MathML"> P ( 5 , 5 ) P(5, 5) </math>P(5,5) 的偏导数为
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> ∇ f ( P ) = [ x 2 y ] \nabla f(P) = \begin{bmatrix} x\\ 2y \end{bmatrix} </math>∇f(P)=[x2y]
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> ∇ f ( 5 , 5 ) = [ 510 ] \nabla f(5, 5) = \begin{bmatrix} 5 10 \end{bmatrix} </math>∇f(5,5)=[510]

可以看出，函数在 <math xmlns="http://www.w3.org/1998/Math/MathML"> y y </math>y 轴的梯度是函数在 <math xmlns="http://www.w3.org/1998/Math/MathML"> x x </math>x 轴的梯度的两倍。

⒉ 梯度下降

梯度下降要起作用，则函数必须可微并且同时是一个凸函数。所谓函数可微指的是在函数的定义域内的任何一点都存在导数。

要判断一个一元函数是否为凸函数，连接函数图像上的任意两点，如果线段与函数图像重合或位于函数图像上方，则该函数为凸函数。

另一种判断一个一元函数是否为凸函数的方法是判断其二阶导数的值，如果其二阶导数的值永远大于 0，那么该函数一定为凸函数。以一元函数 <math xmlns="http://www.w3.org/1998/Math/MathML"> f ( x ) = x 2 − x + 3 f(x) = x^2 - x + 3 </math>f(x)=x2−x+3 为例，其二阶导数为 2，永远大于 0，所以该函数是一个凸函数。

⑴ 梯度下降算法的基本步骤

梯度下降的主要目标是通过迭代的调整模型参数，使损失函数（也称代价函数/目标函数/错误函数）的值降到最低，从而提高模型的预测性能。梯度下降的具体运行主要包括以下步骤：

初始化模型参数

以简单线性回归为例，在开始梯度下降之前首先需要初始化线性函数的截距（ <math xmlns="http://www.w3.org/1998/Math/MathML"> θ 0 \theta_0 </math>θ0）和斜率（ <math xmlns="http://www.w3.org/1998/Math/MathML"> θ 1 \theta_1 </math>θ1），以便确定最初的拟合线。模型参数通常会被随机初始化为很小的值或 0。

计算预测值

根据当前的模型参数和输入值，计算预测值 <math xmlns="http://www.w3.org/1998/Math/MathML"> y i ^ = θ 0 + θ 1 × x i \hat{y_i} = \theta_0 + \theta_1 \times x_i </math>yi^=θ0+θ1×xi

计算损失

使用损失函数计算预测值与真实值之间的误差。仍然以简单线性回归为例，常用的损失函数为 MSE（均方误差）：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> J ( θ 0 , θ 1 ) = 1 n ∑ i = 1 n ( y i ^ − y i ) 2 J(\theta_0, \theta_1) = \frac{1}{n} \sum_{i=1}^{n}(\hat{y_i} - y_i)^2 </math>J(θ0,θ1)=n1i=1∑n(yi^−yi)2

计算梯度

计算损失函数关于各个参数的偏导数（梯度）：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> ∂ J ( θ 0 , θ 1 ) ∂ θ 0 = 2 n ∑ i = 1 n ( y i ^ − y i ) ∂ J ( θ 0 , θ 1 ) ∂ θ 1 = 2 n ∑ i = 1 n ( y i ^ − y i ) × x i \frac{\partial J(\theta_0, \theta_1)}{\partial \theta_0} = \frac{2}{n}\sum_{i=1}^{n}(\hat{y_i} - y_i) \\ \frac{\partial J(\theta_0, \theta_1)}{\partial \theta_1} = \frac{2}{n}\sum_{i=1}^{n}(\hat{y_i} - y_i) \times x_i </math>∂θ0∂J(θ0,θ1)=n2i=1∑n(yi^−yi)∂θ1∂J(θ0,θ1)=n2i=1∑n(yi^−yi)×xi

更新参数

计算完梯度之后，需要根据梯度更新参数，此时需要用到学习率，学习率用来控制参数更新的幅度（步长）：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> θ 0 = θ 0 − α × ∂ J ( θ 0 , θ 1 ) ∂ θ 0 θ 1 = θ 1 − α × ∂ J ( θ 0 , θ 1 ) ∂ θ 1 \theta_0 = \theta_0 - \alpha \times \frac{\partial J(\theta_0, \theta_1)}{\partial \theta_0} \\ \theta_1 = \theta_1 - \alpha \times \frac{\partial J(\theta_0, \theta_1)}{\partial \theta_1} </math>θ0=θ0−α×∂θ0∂J(θ0,θ1)θ1=θ1−α×∂θ1∂J(θ0,θ1)

重复上述 2 ~ 5 步的操作

重复上述操作，直到损失函数的值收敛到一个足够小的范围或迭代次数达到上限。

Python 复制代码

import numpy as np
import matplotlib.pyplot as plt

# 生成示例数据 100 x 1 的矩阵
np.random.seed(0)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)

# 初始化参数
theta = np.random.randn(2, 1)  # 2 x 1 矩阵，包括截距项 θ₀ 和斜率项 θ₁
learning_rate = 0.1  # 学习率
iterations = 1000  # 迭代次数
tolerance = 1e-6  # 损失函数的阈值
m = len(X)

# 添加一列全1特征，以考虑截距项
X_b = np.c_[np.ones((m, 1)), X]  # 生成一个 100 x 2 的矩阵

# 损失函数
def compute_loss(X_b, y, theta):
    m = len(y)
    predictions = X_b.dot(theta)  # 矩阵点乘
    loss = (1/m) * np.sum((predictions - y) ** 2)
    return loss

# 梯度下降算法
loss_history = []
previous_loss = float('inf')

for iteration in range(iterations):
    # 计算预测值
    y_pred = X_b.dot(theta)
    
    # 计算梯度
    gradients = (2/m) * X_b.T.dot(y_pred - y)
    
    # 更新参数
    theta -= learning_rate * gradients
    
    # 计算损失
    loss = compute_loss(X_b, y, theta)
    loss_history.append(loss)
    
    # 判断损失函数是否收敛
    if abs(previous_loss - loss) < tolerance:
        print(f"Converged after {iteration} iterations")
        break
    
    previous_loss = loss
    
    if iteration % 100 == 0:
        print(f"Iteration {iteration}: Loss = {loss:.4f}")

print(f"Optimized theta: {theta.ravel()}")

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 8))

# 绘制损失函数的变化
ax1.plot(range(len(loss_history)), loss_history)
ax1.set_xlabel("迭代次数")
ax1.set_ylabel("损失值")
ax1.set_title("损失值随迭代次数的变化")

# 绘制拟合直线
ax2.scatter(X, y)
ax2.plot(X, X_b.dot(theta), color='red')
ax2.set_xlabel("X")
ax2.set_ylabel("y")
ax2.set_title("数据点以及拟合线")

plt.show()

代码解释

生成示例数据

变量 <math xmlns="http://www.w3.org/1998/Math/MathML"> X X </math>X 和 <math xmlns="http://www.w3.org/1998/Math/MathML"> y y </math>y 均为 <math xmlns="http://www.w3.org/1998/Math/MathML"> 100 × 1 100 \times 1 </math>100×1 的矩阵。考虑到简单线性回归中的截距项（代码中为 theta[0]），故又向矩阵 <math xmlns="http://www.w3.org/1998/Math/MathML"> X X </math>X 中增加了一列值全为 1 的项，之后 <math xmlns="http://www.w3.org/1998/Math/MathML"> X _ b X\_b </math>X_b 变为 <math xmlns="http://www.w3.org/1998/Math/MathML"> 100 × 2 100 \times 2 </math>100×2 的矩阵。
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> X _ b = [ 1 x 1 1 x 2 ⋮ ⋮ 1 x 100 ] X\b = \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x{100} \end{bmatrix} </math>X_b= 11⋮1x1x2⋮x100

变量 <math xmlns="http://www.w3.org/1998/Math/MathML"> t h e t a theta </math>theta 为线性函数系数的初始值，为 <math xmlns="http://www.w3.org/1998/Math/MathML"> 2 × 1 2 \times 1 </math>2×1 的矩阵。
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> t h e t a = [ θ 0 θ 1 ] theta = \begin{bmatrix} \theta_0 \\ \theta_1 \end{bmatrix} </math>theta=[θ0θ1]

预测值计算

简单线性回归的线性函数为 <math xmlns="http://www.w3.org/1998/Math/MathML"> y = θ 0 + θ 1 × x y = \theta_0 + \theta_1 \times x </math>y=θ0+θ1×x，这里直接通过矩阵相乘得到预测值的矩阵 <math xmlns="http://www.w3.org/1998/Math/MathML"> y _ p r e d y\_pred </math>y_pred。
形如 <math xmlns="http://www.w3.org/1998/Math/MathML"> m × n m \times n </math>m×n 的矩阵与形如 <math xmlns="http://www.w3.org/1998/Math/MathML"> n × p n \times p </math>n×p 的矩阵相乘，得到的结果为形如 <math xmlns="http://www.w3.org/1998/Math/MathML"> m × p m \times p </math>m×p 的矩阵。
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> y _ p r e d = X _ b × t h e t a = [ 1 × θ 0 + x 1 × θ 1 1 × θ 0 + x 2 × θ 1 ⋮ 1 × θ 0 + x 100 × θ 1 ] = [ y 1 _ p r e d y 2 _ p r e d ⋮ y 100 _ p r e d ] y\_pred = X\b \times theta \\ = \begin{bmatrix} 1 \times \theta_0 + x_1 \times \theta_1 \\ 1 \times \theta_0 + x_2 \times \theta_1 \\ \vdots \\ 1 \times \theta_0 + x{100} \times \theta_1 \end{bmatrix} \\ =\begin{bmatrix} y_1\_pred \\ y_2\pred \\ \vdots \\ y{100}\_pred \end{bmatrix} </math>y_pred=X_b×theta= 1×θ0+x1×θ11×θ0+x2×θ1⋮1×θ0+x100×θ1 = y1_predy2_pred⋮y100_pred

梯度计算

梯度为损失函数关于系数的偏导数，这里的损失函数采用的是 MSE。 <math xmlns="http://www.w3.org/1998/Math/MathML"> X _ b X\_b </math>X_b 为 <math xmlns="http://www.w3.org/1998/Math/MathML"> 100 × 2 100 \times 2 </math>100×2 的矩阵，转置之后变成 <math xmlns="http://www.w3.org/1998/Math/MathML"> 2 × 100 2 \times 100 </math>2×100 的矩阵。所以
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> X _ b . T = [ 1 1 ⋯ 1 x 1 x 2 ⋯ x 100 ] X\b.T = \begin{bmatrix} 1 & 1 & \cdots & 1 \\ x_1 & x_2 & \cdots & x{100} \end{bmatrix} </math>X_b.T=[1x11x2⋯⋯1x100]

通过矩阵的形式计算梯度：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> g r a d i e n t s = 2 m × X _ b . T × ( y _ p r e d − y ) = 2 m × [ 1 × ( y 1 _ p r e d − y 1 ) + 1 × ( y 2 _ p r e d − y 2 ) + ⋯ + 1 × ( y 100 _ p r e d − y 2 ) x 1 × ( y 1 _ p r e d − y 1 ) + x 2 × ( y 2 _ p r e d − y 2 ) + ⋯ + x 100 × ( y 100 _ p r e d − y 100 ) ] = [ 2 m × ∑ i = 1 n ( y i _ p r e d − y i ) 2 m × ∑ i = 1 n ( y i _ p r e d − y i ) × x i ] = [ ∂ J ( θ 0 , θ 1 ) ∂ θ 0 ∂ J ( θ 0 , θ 1 ) ∂ θ 1 ] gradients = \frac{2}{m} \times X\_b.T \times (y\_pred - y) \\ =\frac{2}{m} \times \begin{bmatrix} 1 \times (y_1\_pred - y_1) + 1 \times (y_2\pred - y_2) + \cdots + 1 \times (y{100}\pred - y_2) \\ x_1 \times (y_1\pred - y_1) + x_2 \times (y_2\pred - y_2) + \cdots + x{100} \times (y{100}\pred - y{100}) \end{bmatrix} \\ = \begin{bmatrix} \frac{2}{m} \times \sum{i = 1}^{n}(y_i\pred - y_i) \\ \frac{2}{m} \times \sum{i = 1}^{n}(y_i\_pred - y_i) \times x_i \end{bmatrix} \\ = \begin{bmatrix} \frac{\partial J(\theta_0, \theta_1)}{\partial \theta_0} \\ \frac{\partial J(\theta_0, \theta_1)}{\partial \theta_1} \end{bmatrix} </math>gradients=m2×X_b.T×(y_pred−y)=m2×[1×(y1_pred−y1)+1×(y2_pred−y2)+⋯+1×(y100_pred−y2)x1×(y1_pred−y1)+x2×(y2_pred−y2)+⋯+x100×(y100_pred−y100)]=[m2×∑i=1n(yi_pred−yi)m2×∑i=1n(yi_pred−yi)×xi]=[∂θ0∂J(θ0,θ1)∂θ1∂J(θ0,θ1)]

⑵ 学习率

学习率是个超参数，需要在训练开始之前手动设置，并且无法从样本数据中学习得到。学习率决定了梯度下降过程中模型参数更新的步长，同时学习率还会影响损失函数和模型的收敛速度以及整个模型参数优化过程的稳定性。

较大的学习率可以使得损失函数朝着模型参数最优解的方向更快速的收敛，但过大的学习率也可能会导致在收敛过程中错过损失函数的最小值并最终在最小值附近震荡或发散。较小的学习率会导致损失函数的收敛速度变慢，虽然这样可以避免错过损失函数的最小值，但却需要更多的迭代次数来得到模型参数的最优解。所以，选择适当的学习率至关重要。在实际操作中刚开始通常会选择一个适中的学习率，然后在训练过程中根据表现动态调整。

python 复制代码

import numpy as np
import matplotlib.pyplot as plt


# 损失函数计算
def compute_loss(x: float):
    return x**2 - 4*x + 1


# 计算梯度
def compute_gradient(x: float):
    return 2*x - 4


# 梯度下降
def gradient_descent(start: float, learning_rate: float, max_iterations: int, threshold: float=0.01):
    x = start
    steps = [start]

    for _ in range(max_iterations):
        gradients = compute_gradient(x)
        difference = learning_rate * gradients

        if np.abs(difference) < threshold:
            break

        x -= difference
        steps.append(x)

    return steps


# 不同学习率
start = 10
learning_rates = [0.1, 0.5, 0.8]
iterations = 100

# 准备绘制函数图像的数据
x = np.linspace(-8, 12, 10000)
y = [compute_loss(param) for param in x]

fig, axes = plt.subplots(1, 3, figsize=(12, 8))

for i, learning_rate in enumerate(learning_rates):
    points_x = gradient_descent(start, learning_rate, iterations)
    points_y = [compute_loss(x) for x in points_x]
    axes[i].plot(x, y)
    axes[i].scatter(points_x, points_y, color="red")
    axes[i].plot(points_x, points_y, "o--r", label=f"学习率：{learning_rate}")
    axes[i].set_xlabel("X")
    axes[i].set_ylabel("Y")
    axes[i].legend(loc="upper right")

fig.suptitle("不同学习率对收敛速度及稳定性的影响")
plt.show()

上述代码通过一个假定的损失函数 <math xmlns="http://www.w3.org/1998/Math/MathML"> y = x 2 − 4 x + 1 y = x^2 - 4x + 1 </math>y=x2−4x+1 来演示不同的学习率对收敛速度以及稳定性的影响。

⒊ 梯度下降算法类型

被广泛使用的梯度下降算法主要有三种类型，其区别主要在于每次计算使用的样本数据量。

⑴ 批量梯度下降

批量梯度下降在每次迭代过程中都是用所有样本数据来计算梯度，这样可以提高结果的准确性。但由于计算量很大，所以训练过程会比较长而且也非常消耗资源。另外，如果损失函数为凸函数，批量梯度下降能够保证训练结果最终收敛到全局最优解；但如果损失函数为非凸函数，那么结果可能导致损失函数局限于局部最小值而使得模型无法收敛到全局最优解。

批量梯度下降比较适用于数据量较小且损失函数为凸函数，同时模型本身也比较简单的情况。

python 复制代码

def batch_gradient_descent(features: np.ndarray, targets: np.ndarray, weights: np.ndarray, iterations: int,
                           learning_rate: float, loss_threshold: float):
    losses = []
    milliseconds = []
    epochs = []

    m = len(targets)
    loss_current = np.inf
    start_time = int(time_ns() // 1e6)  # 时间以毫秒为单位

    for i in range(iterations):
        epochs.append(i)
        loss_prev = loss_current
        shuffled_indexes = np.random.permutation(m)
        # 每次迭代之前将数据乱序
        features, targets = features[shuffled_indexes], targets[shuffled_indexes]
        # 计算预测值
        predictions = features.dot(weights)
        # 计算梯度
        gradients = (2 / m) * features.T.dot(predictions - targets)
        # 更新参数
        weights -= learning_rate * gradients
        # 计算损失
        loss_current = compute_loss(features, targets, weights)
        losses.append(loss_current)
        milliseconds.append(int(time_ns() // 1e6) - start_time)

        # 根据两次 loss 的差值是否达到阈值，增加 early stop 逻辑
        if abs(loss_current - loss_prev) < loss_threshold:
            break
        loss_prev = loss_current

    return losses, milliseconds, epochs

⑵ 随机梯度下降

随机梯度下降在每次迭代的过程中随机的选择一个样本数据来计算梯度。这样虽然可以减少训练所需的时间，有时候也可以规避损失函数的局部最小值，但由于样本选择的随机性，会导致结果的不稳定，从而需要更多的迭代次数来使结果收敛到全局最优解。

python 复制代码

def stochastic_gradient_descent(features: np.ndarray, targets: np.ndarray, weights: np.ndarray, iterations: int,
                                learning_rate: float, loss_threshold: float):
    losses = []
    milliseconds = []
    epochs = []

    m = len(targets)
    loss_current = np.inf
    start_time = int(time_ns() // 1e6)

    for i in range(iterations):
        epochs.append(i)
        loss_prev = loss_current
        shuffled_indexes = np.random.permutation(m)
        # 每次迭代之前将数据乱序
        features, targets = features[shuffled_indexes], targets[shuffled_indexes]

        for j in range(m):
            # 从样本中取训练数据
            features_train, targets_train = features[j:j+1], targets[j:j+1]
            # 计算预测值
            predictions = features_train.dot(weights)
            # 计算梯度
            gradients = 2 * features_train.T.dot(predictions - targets_train)
            # 更新参数
            weights -= learning_rate * gradients
        
        # 计算损失
        loss_current = compute_loss(features, targets, weights)
        losses.append(loss_current)
        milliseconds.append(int(time_ns() // 1e6) - start_time)

        # 根据两次 loss 的差值是否达到阈值，增加 early stop 逻辑
        if abs(loss_current - loss_prev) < loss_threshold:
            break
        loss_prev = loss_current

    return losses, milliseconds, epochs

⑶ 小批量梯度下降

小批量梯度下降是前两种梯度下降的折中版本，在每次迭代中随机选取一小批样本数据来计算梯度。相较于批量梯度下降，小批量梯度下降具有运行速度快的优点；而相较于随机梯度下降，小批量梯度下降的结果更加稳定。但小批量梯度下降中批量的选择至关重要，批量过小则近似于随机梯度下降，而批量过大又会近似于批量梯度下降，通常批量值取 2 的 n 次幂。

python 复制代码

def mini_batch_gradient_descent(features: np.ndarray, targets: np.ndarray, weights: np.ndarray, iterations: int, 
                                learning_rate: float, loss_threshold: float, batch_size: int):
    losses = []
    milliseconds = []
    epochs = []

    m = len(targets)
    loss_current = np.inf
    start_time = int(time_ns() // 1e6)
    batch_num = ceil(m / batch_size)

    for i in range(iterations):
        epochs.append(i)
        loss_prev = loss_current
        shuffled_indexes = np.random.permutation(m)
        # 每次迭代之前将数据乱序
        features, targets = features[shuffled_indexes], targets[shuffled_indexes]

        for j in range(batch_num):
            # 取一批样本数据
            features_train, targets_train = features[j * batch_size: min(m, (j + 1) * batch_size)], targets[j
            * batch_size: min(m, (j + 1) * batch_size)]
            # 计算预测值
            predictions = features_train.dot(weights)
            # 计算梯度
            gradients = (2 / batch_size) * features_train.T.dot(predictions - targets_train)
            # 更新参数
            weights -= learning_rate * gradients

        # 计算损失
        loss_current = compute_loss(features, targets, weights)
        losses.append(loss_current)
        milliseconds.append(int(time_ns() // 1e6) - start_time)

        # 根据两次 loss 的差值是否达到阈值，增加 early stop 逻辑
        if abs(loss_current - loss_prev) < loss_threshold:
            break
        loss_prev = loss_current
                                                                                                                        
    return losses, milliseconds, epochs

对三种梯度下降进行比较

为了兼顾三种梯度下降算法的运行效果，学习率、损失阈值设置的偏小导致批量梯度下降一直到 10000 次迭代完成也没有达到设置的损失阈值

python 复制代码

import numpy as np
from math import ceil
import matplotlib.pyplot as plt
import matplotlib.font_manager as fm
from time import time_ns


# 生成样本数据
np.random.seed(0)
X = 2 * np.random.rand(1000, 1)
y = 4 + 3 * X + np.random.randn(1000, 1)

# 给 X 添加一列全 1 项，以考虑截距
X_b = np.c_[np.ones((len(X), 1)), X]

# 初始化截距和斜率参数
theta = np.random.randn(2, 1)

learning_rate = 0.0003
batch_size = 32
epochs = 10000
loss_threshold = 1e-7


# 损失函数
def compute_loss(features: np.ndarray, targets: np.ndarray, weights: np.ndarray):
    m = len(targets)
    # 计算预测值
    predictions = features.dot(weights)
    # 计算损失
    loss = np.sum((predictions - targets) ** 2) / m
    return loss
    

fig,axes = plt.subplots(3, 2, figsize=(12, 8))

losses, milliseconds, iterations = batch_gradient_descent(X_b, y, theta, epochs, learning_rate, loss_threshold)

axes[0][0].plot(iterations, losses)
axes[0][0].set_xlabel("迭代次数")
axes[0][0].set_ylabel("损失")
axes[0][0].set_title("批量梯度下降损失随迭代次数的变化")

axes[0][1].plot(milliseconds, losses)
axes[0][1].set_xlabel("时间（ms）")
axes[0][1].set_ylabel("损失")
axes[0][1].set_title("批量梯度下降损失随时间的变化")

losses, milliseconds, iterations = stochastic_gradient_descent(X_b, y, theta, epochs, learning_rate, loss_threshold)

axes[1][0].plot(iterations, losses)
axes[1][0].set_xlabel("迭代次数")
axes[1][0].set_ylabel("损失")
axes[1][0].set_title("随机梯度下降损失随迭代次数的变化")

axes[1][1].plot(milliseconds, losses)
axes[1][1].set_xlabel("时间（ms）")
axes[1][1].set_ylabel("损失")
axes[1][1].set_title("随机梯度下降损失随时间的变化")

losses, milliseconds, iterations = mini_batch_gradient_descent(X_b, y, theta, epochs, learning_rate, loss_threshold, batch_size)

axes[2][0].plot(iterations, losses)
axes[2][0].set_xlabel("迭代次数")
axes[2][0].set_ylabel("损失")
axes[2][0].set_title("小批量梯度下降损失随迭代次数的变化")

axes[2][1].plot(milliseconds, losses)
axes[2][1].set_xlabel("时间（ms）")
axes[2][1].set_ylabel("损失")
axes[2][1].set_title("小批量梯度下降损失随时间的变化")

plt.tight_layout()
plt.show()

运行结果：