
目录
- [C++ 梯度下降法(Gradient Descent):数值优化的核心迭代算法](#C++ 梯度下降法(Gradient Descent):数值优化的核心迭代算法)
-
- 引言
- 一、梯度下降核心原理
-
- [1. 数学基础:梯度与最优化](#1. 数学基础:梯度与最优化)
- [2. 梯度下降的核心变体](#2. 梯度下降的核心变体)
- [3. 梯度下降的关键参数](#3. 梯度下降的关键参数)
- 二、梯度下降的C++基础实现框架
-
- [1. 通用数据结构与工具函数](#1. 通用数据结构与工具函数)
- [2. 梯度下降核心算法实现](#2. 梯度下降核心算法实现)
- 三、实战案例1:求解函数极小值(无约束优化)
-
- [1. 问题描述](#1. 问题描述)
- [2. 梯度推导](#2. 梯度推导)
- [3. 完整实现代码](#3. 完整实现代码)
- 四、实战案例2:线性回归(有监督学习)
-
- [1. 问题描述](#1. 问题描述)
- [2. 梯度推导](#2. 梯度推导)
- [3. 完整实现代码](#3. 完整实现代码)
- 五、梯度下降的优化技巧
-
- [1. 学习率调整策略](#1. 学习率调整策略)
- [2. 动量(Momentum)优化](#2. 动量(Momentum)优化)
- [3. 自适应学习率优化(Adam)](#3. 自适应学习率优化(Adam))
- 六、完整可运行代码
- 七、常见坑点与避坑指南
-
- [1. 学习率设置不当](#1. 学习率设置不当)
- [2. 梯度计算错误](#2. 梯度计算错误)
- [3. 数据未归一化](#3. 数据未归一化)
- [4. 梯度消失/爆炸](#4. 梯度消失/爆炸)
- [5. 局部最优陷阱](#5. 局部最优陷阱)
- 八、总结
C++ 梯度下降法(Gradient Descent):数值优化的核心迭代算法
引言
梯度下降法(Gradient Descent,GD)是最经典的一阶优化算法,核心思想是"沿着损失函数梯度的反方向迭代更新参数",逐步逼近函数的极小值点。它是机器学习(如线性回归、神经网络训练)、数值优化、非线性方程求解等领域的基础算法,具有实现简单、计算高效、适用性广的特点。
本文将从梯度下降的数学原理、核心变体、C++ 实现框架到实战案例(线性回归、函数极值求解),全面讲解这一算法,并通过完整的代码实现帮你掌握其核心思想与工程落地能力。
一、梯度下降核心原理
1. 数学基础:梯度与最优化
(1)梯度的定义
对于多元函数 f ( x ) f(\boldsymbol{x}) f(x)( x = [ x 1 , x 2 , . . . , x n ] T \boldsymbol{x} = [x_1, x_2, ..., x_n]^T x=[x1,x2,...,xn]T),其梯度 是一个向量,定义为:
∇ f ( x ) = ( ∂ f ∂ x 1 , ∂ f ∂ x 2 , . . . , ∂ f ∂ x n ) T \nabla f(\boldsymbol{x}) = \left( \frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, ..., \frac{\partial f}{\partial x_n} \right)^T ∇f(x)=(∂x1∂f,∂x2∂f,...,∂xn∂f)T
梯度的几何意义:函数在该点上升最快的方向,其反方向是函数下降最快的方向。
(2)梯度下降的核心公式
梯度下降的迭代更新公式为:
x t + 1 = x t − η ⋅ ∇ f ( x t ) \boldsymbol{x}_{t+1} = \boldsymbol{x}_t - \eta \cdot \nabla f(\boldsymbol{x}_t) xt+1=xt−η⋅∇f(xt)
其中:
- x t \boldsymbol{x}_t xt:第 t t t 次迭代的参数值;
- η \eta η(学习率,Learning Rate):步长,控制每次迭代的更新幅度;
- ∇ f ( x t ) \nabla f(\boldsymbol{x}_t) ∇f(xt):函数在 x t \boldsymbol{x}_t xt 处的梯度;
- 负号:表示沿梯度反方向更新(向极小值点靠近)。
(3)收敛条件
迭代停止的常见条件(满足其一即可):
- 迭代次数达到预设最大值;
- 梯度的模长小于阈值( ∥ ∇ f ( x t ) ∥ < ϵ \|\nabla f(\boldsymbol{x}_t)\| < \epsilon ∥∇f(xt)∥<ϵ,如 ϵ = 1 e − 6 \epsilon=1e-6 ϵ=1e−6);
- 参数更新的幅度小于阈值( ∥ x t + 1 − x t ∥ < ϵ \|\boldsymbol{x}_{t+1} - \boldsymbol{x}_t\| < \epsilon ∥xt+1−xt∥<ϵ);
- 损失函数值的变化小于阈值( ∣ f ( x t + 1 ) − f ( x t ) ∣ < ϵ |f(\boldsymbol{x}_{t+1}) - f(\boldsymbol{x}_t)| < \epsilon ∣f(xt+1)−f(xt)∣<ϵ)。
2. 梯度下降的核心变体
根据数据使用方式和更新策略,梯度下降分为三类核心变体:
| 变体类型 | 核心思想 | 优点 | 缺点 | 适用场景 |
|---|---|---|---|---|
| 批量梯度下降(BGD) | 使用全部训练数据计算梯度 | 收敛稳定,能找到全局最优(凸函数) | 计算成本高,数据量大时效率低 | 小数据集、凸优化问题 |
| 随机梯度下降(SGD) | 使用单个样本随机计算梯度 | 计算速度快,易跳出局部最优 | 收敛震荡,稳定性差 | 大数据集、非凸优化问题 |
| 小批量梯度下降(MBGD) | 使用小批量样本计算梯度(如batch_size=32) | 平衡BGD和SGD的优缺点,收敛稳定且效率高 | 需要调整batch_size参数 | 绝大多数机器学习场景 |
3. 梯度下降的关键参数
| 参数 | 作用 | 调优建议 |
|---|---|---|
| 学习率 η \eta η | 控制参数更新幅度 | 过小:收敛慢;过大:震荡不收敛;建议从0.01/0.001开始尝试 |
| 迭代次数 | 控制算法终止条件 | 结合收敛阈值使用,避免无效迭代 |
| batch_size | 小批量梯度下降的样本数 | 常用32/64/128,平衡计算效率和稳定性 |
| 收敛阈值 ϵ \epsilon ϵ | 判断是否收敛 | 常用1e-6~1e-8,根据问题精度要求调整 |
二、梯度下降的C++基础实现框架
1. 通用数据结构与工具函数
cpp
#include <iostream>
#include <vector>
#include <cmath>
#include <random>
#include <numeric>
#include <algorithm>
#include <iomanip>
#include <chrono>
using namespace std;
// 定义向量类型(适配多元函数优化)
using Vector = vector<double>;
// 定义矩阵类型(适配批量数据处理)
using Matrix = vector<Vector>;
// 随机数生成器(单例模式,保证高质量随机性)
class RandomGenerator {
public:
static RandomGenerator& get_instance() {
static RandomGenerator instance;
return instance;
}
// 生成[min, max]范围内的随机浮点数
double rand_double(double min = 0.0, double max = 1.0) {
uniform_real_distribution<double> dist(min, max);
return dist(rng);
}
// 生成服从正态分布的随机数(用于参数初始化)
double rand_normal(double mean = 0.0, double stddev = 1.0) {
normal_distribution<double> dist(mean, stddev);
return dist(rng);
}
private:
RandomGenerator() {
random_device rd;
rng = mt19937(rd()); // 梅森旋转算法,高质量随机数
}
// 禁止拷贝
RandomGenerator(const RandomGenerator&) = delete;
RandomGenerator& operator=(const RandomGenerator&) = delete;
mt19937 rng;
};
// 向量点积
double dot_product(const Vector& a, const Vector& b) {
if (a.size() != b.size()) {
throw invalid_argument("向量维度不匹配");
}
double res = 0.0;
for (size_t i = 0; i < a.size(); ++i) {
res += a[i] * b[i];
}
return res;
}
// 向量加法
Vector vector_add(const Vector& a, const Vector& b) {
if (a.size() != b.size()) {
throw invalid_argument("向量维度不匹配");
}
Vector res(a.size());
for (size_t i = 0; i < a.size(); ++i) {
res[i] = a[i] + b[i];
}
return res;
}
// 向量数乘
Vector scalar_multiply(double scalar, const Vector& vec) {
Vector res(vec.size());
for (size_t i = 0; i < vec.size(); ++i) {
res[i] = scalar * vec[i];
}
return res;
}
// 向量的L2范数(模长)
double vector_norm(const Vector& vec) {
double sum = 0.0;
for (double v : vec) {
sum += v * v;
}
return sqrt(sum);
}
// 打印向量
void print_vector(const Vector& vec, const string& name = "向量") {
cout << name << ":[";
for (size_t i = 0; i < vec.size(); ++i) {
cout << fixed << setprecision(6) << vec[i];
if (i != vec.size() - 1) {
cout << ", ";
}
}
cout << "]" << endl;
}
// 打印矩阵
void print_matrix(const Matrix& mat, const string& name = "矩阵") {
cout << name << ":" << endl;
for (const auto& row : mat) {
cout << "[";
for (size_t i = 0; i < row.size(); ++i) {
cout << fixed << setprecision(4) << row[i];
if (i != row.size() - 1) {
cout << ", ";
}
}
cout << "]" << endl;
}
}
2. 梯度下降核心算法实现
cpp
// 梯度下降参数配置
struct GDParams {
double learning_rate = 0.01; // 学习率
int max_iter = 10000; // 最大迭代次数
double tol = 1e-6; // 收敛阈值(梯度模长/参数变化/损失变化)
int batch_size = 32; // 小批量梯度下降的批次大小(BGD设为数据集大小,SGD设为1)
bool verbose = true; // 是否打印迭代信息
int print_interval = 1000; // 打印间隔(每N次迭代打印一次)
};
// 梯度下降结果
struct GDResult {
Vector params; // 优化后的参数
double final_loss; // 最终损失值
int iter_num; // 实际迭代次数
bool converged; // 是否收敛
};
// 梯度下降核心函数(通用模板,适配任意可计算梯度的损失函数)
// 参数说明:
// - init_params: 参数初始值
// - loss_func: 损失函数(输入参数,返回损失值)
// - grad_func: 梯度函数(输入参数,返回梯度向量)
// - data: 训练数据(可选,用于批量/小批量梯度计算)
// - params: 梯度下降配置
GDResult gradient_descent(
Vector init_params,
function<double(const Vector&, const Matrix&)> loss_func,
function<Vector(const Vector&, const Matrix&)> grad_func,
const Matrix& data,
const GDParams& gd_params
) {
GDResult result;
result.params = init_params;
result.final_loss = 0.0;
result.iter_num = 0;
result.converged = false;
int n_samples = data.size();
if (n_samples == 0) {
cerr << "错误:训练数据为空!" << endl;
return result;
}
// 迭代优化
for (int iter = 0; iter < gd_params.max_iter; ++iter) {
// 1. 选择批次数据(BGD/SGD/MBGD)
Matrix batch_data;
if (gd_params.batch_size == n_samples) {
// 批量梯度下降(BGD):使用全部数据
batch_data = data;
} else if (gd_params.batch_size == 1) {
// 随机梯度下降(SGD):随机选择一个样本
int rand_idx = RandomGenerator::get_instance().rand_double(0, n_samples - 1);
batch_data.push_back(data[rand_idx]);
} else {
// 小批量梯度下降(MBGD):随机选择batch_size个样本
vector<int> indices(n_samples);
iota(indices.begin(), indices.end(), 0);
shuffle(indices.begin(), indices.end(), RandomGenerator::get_instance().get_rng());
for (int i = 0; i < min(gd_params.batch_size, n_samples); ++i) {
batch_data.push_back(data[indices[i]]);
}
}
// 2. 计算当前损失和梯度
double current_loss = loss_func(result.params, batch_data);
Vector grad = grad_func(result.params, batch_data);
double grad_norm = vector_norm(grad);
// 3. 打印迭代信息
if (gd_params.verbose && iter % gd_params.print_interval == 0) {
cout << "迭代次数:" << iter
<< " | 损失值:" << fixed << setprecision(6) << current_loss
<< " | 梯度模长:" << fixed << setprecision(8) << grad_norm << endl;
}
// 4. 检查收敛条件(梯度模长小于阈值)
if (grad_norm < gd_params.tol) {
result.converged = true;
result.iter_num = iter;
result.final_loss = current_loss;
break;
}
// 5. 参数更新:x = x - η * ∇f(x)
Vector update = scalar_multiply(gd_params.learning_rate, grad);
for (size_t i = 0; i < result.params.size(); ++i) {
result.params[i] -= update[i];
}
// 6. 更新迭代次数
result.iter_num = iter + 1;
result.final_loss = current_loss;
}
// 最终检查是否收敛
if (!result.converged && gd_params.verbose) {
cout << "警告:达到最大迭代次数,未收敛!" << endl;
}
return result;
}
三、实战案例1:求解函数极小值(无约束优化)
1. 问题描述
求解二元函数 f ( x , y ) = x 2 + 2 y 2 + 2 sin ( x ) cos ( y ) f(x, y) = x^2 + 2y^2 + 2\sin(x)\cos(y) f(x,y)=x2+2y2+2sin(x)cos(y) 的极小值点。
2. 梯度推导
函数的梯度为:
∇ f ( x , y ) = ( 2 x + 2 cos ( x ) cos ( y ) , 4 y − 2 sin ( x ) sin ( y ) ) T \nabla f(x, y) = \left( 2x + 2\cos(x)\cos(y), 4y - 2\sin(x)\sin(y) \right)^T ∇f(x,y)=(2x+2cos(x)cos(y),4y−2sin(x)sin(y))T
3. 完整实现代码
cpp
// 目标函数:f(x, y) = x² + 2y² + 2sin(x)cos(y)
double target_function(const Vector& params) {
if (params.size() != 2) {
throw invalid_argument("参数必须为二维向量(x, y)");
}
double x = params[0];
double y = params[1];
return x*x + 2*y*y + 2*sin(x)*cos(y);
}
// 目标函数的损失函数(无数据,直接返回函数值)
double loss_func_unconstrained(const Vector& params, const Matrix& /*data*/) {
return target_function(params);
}
// 目标函数的梯度函数
Vector grad_func_unconstrained(const Vector& params, const Matrix& /*data*/) {
if (params.size() != 2) {
throw invalid_argument("参数必须为二维向量(x, y)");
}
double x = params[0];
double y = params[1];
Vector grad(2);
grad[0] = 2*x + 2*cos(x)*cos(y); // ∂f/∂x
grad[1] = 4*y - 2*sin(x)*sin(y); // ∂f/∂y
return grad;
}
// 测试无约束优化(函数极小值求解)
void test_unconstrained_optimization() {
cout << "===== 无约束优化:求解函数极小值 =====" << endl;
// 1. 初始化参数(随机初始值)
Vector init_params = {
RandomGenerator::get_instance().rand_double(-3.0, 3.0),
RandomGenerator::get_instance().rand_double(-3.0, 3.0)
};
cout << "初始参数:";
print_vector(init_params);
cout << "初始函数值:" << fixed << setprecision(6) << target_function(init_params) << endl;
// 2. 配置梯度下降参数
GDParams gd_params;
gd_params.learning_rate = 0.01;
gd_params.max_iter = 20000;
gd_params.tol = 1e-8;
gd_params.verbose = true;
gd_params.print_interval = 2000;
gd_params.batch_size = 1; // 无数据,SGD/BGD/MBGD效果一致
// 3. 空数据(无约束优化无需训练数据)
Matrix empty_data;
// 4. 执行梯度下降
GDResult result = gradient_descent(
init_params,
loss_func_unconstrained,
grad_func_unconstrained,
empty_data,
gd_params
);
// 5. 输出结果
cout << "\n===== 优化结果 =====" << endl;
cout << "最终参数:";
print_vector(result.params, "极小值点(x, y)");
cout << "最终函数值(极小值):" << fixed << setprecision(6) << result.final_loss << endl;
cout << "迭代次数:" << result.iter_num << endl;
cout << "是否收敛:" << (result.converged ? "是" : "否") << endl;
}
四、实战案例2:线性回归(有监督学习)
1. 问题描述
给定数据集 { ( x i , y i ) } \{(x_i, y_i)\} {(xi,yi)},拟合线性模型 y = w T x + b y = \boldsymbol{w}^T \boldsymbol{x} + b y=wTx+b(简化为 y = w 1 x 1 + w 2 x 2 + . . . + w n x n + b y = w_1x_1 + w_2x_2 + ... + w_nx_n + b y=w1x1+w2x2+...+wnxn+b),最小化均方误差(MSE)损失函数:
L ( w , b ) = 1 m ∑ i = 1 m ( y i − ( w T x i + b ) ) 2 L(\boldsymbol{w}, b) = \frac{1}{m} \sum_{i=1}^m (y_i - (\boldsymbol{w}^T \boldsymbol{x}_i + b))^2 L(w,b)=m1i=1∑m(yi−(wTxi+b))2
2. 梯度推导
将偏置 b b b 融入参数(令 w 0 = b w_0 = b w0=b, x i 0 = 1 x_{i0} = 1 xi0=1),模型简化为 y = w T x y = \boldsymbol{w}^T \boldsymbol{x} y=wTx,损失函数梯度为:
∂ L ∂ w j = − 2 m ∑ i = 1 m ( y i − w T x i ) x i j \frac{\partial L}{\partial w_j} = -\frac{2}{m} \sum_{i=1}^m (y_i - \boldsymbol{w}^T \boldsymbol{x}i) x{ij} ∂wj∂L=−m2i=1∑m(yi−wTxi)xij
3. 完整实现代码
cpp
// 生成线性回归测试数据(y = 2x1 + 3x2 + 4 + 噪声)
Matrix generate_linear_regression_data(int n_samples, int n_features, double noise_std = 0.1) {
Matrix data(n_samples, Vector(n_features + 1)); // 最后一列是y
// 真实参数:w=[2, 3], b=4
Vector true_weights = {2.0, 3.0};
double true_bias = 4.0;
for (int i = 0; i < n_samples; ++i) {
// 生成特征x1, x2
for (int j = 0; j < n_features; ++j) {
data[i][j] = RandomGenerator::get_instance().rand_double(0.0, 10.0);
}
// 生成标签y = 2x1 + 3x2 + 4 + 噪声
double y = true_bias;
for (int j = 0; j < n_features; ++j) {
y += true_weights[j] * data[i][j];
}
// 添加高斯噪声
y += RandomGenerator::get_instance().rand_normal(0.0, noise_std);
data[i][n_features] = y;
}
return data;
}
// 线性回归损失函数(MSE)
double loss_func_linear_regression(const Vector& params, const Matrix& batch_data) {
int n_samples = batch_data.size();
int n_features = batch_data[0].size() - 1; // 最后一列是y
if (params.size() != n_features + 1) { // params = [b, w1, w2](b是偏置,w1/w2是权重)
throw invalid_argument("参数维度错误:应为" + to_string(n_features + 1) + "维");
}
double loss = 0.0;
for (const auto& sample : batch_data) {
// 预测值:y_pred = b + w1*x1 + w2*x2
double y_pred = params[0]; // 偏置b
for (int j = 0; j < n_features; ++j) {
y_pred += params[j+1] * sample[j];
}
// 真实值
double y_true = sample[n_features];
// 均方误差
loss += (y_true - y_pred) * (y_true - y_pred);
}
return loss / n_samples; // 平均损失
}
// 线性回归梯度函数
Vector grad_func_linear_regression(const Vector& params, const Matrix& batch_data) {
int n_samples = batch_data.size();
int n_features = batch_data[0].size() - 1;
if (params.size() != n_features + 1) {
throw invalid_argument("参数维度错误:应为" + to_string(n_features + 1) + "维");
}
Vector grad(params.size(), 0.0);
for (const auto& sample : batch_data) {
// 预测值
double y_pred = params[0];
for (int j = 0; j < n_features; ++j) {
y_pred += params[j+1] * sample[j];
}
// 真实值
double y_true = sample[n_features];
double error = y_true - y_pred;
// 计算梯度
grad[0] -= 2 * error / n_samples; // 偏置b的梯度
for (int j = 0; j < n_features; ++j) {
grad[j+1] -= 2 * error * sample[j] / n_samples; // 权重wj的梯度
}
}
return grad;
}
// 测试线性回归
void test_linear_regression() {
cout << "\n===== 线性回归:梯度下降拟合 =====" << endl;
// 1. 生成测试数据
int n_samples = 1000; // 样本数
int n_features = 2; // 特征数(x1, x2)
Matrix data = generate_linear_regression_data(n_samples, n_features);
cout << "生成数据:" << n_samples << "个样本," << n_features << "个特征" << endl;
// print_matrix(data, "前5个样本"); // 可选:打印前5个样本
// 2. 初始化参数(b, w1, w2)
Vector init_params(n_features + 1);
for (size_t i = 0; i < init_params.size(); ++i) {
init_params[i] = RandomGenerator::get_instance().rand_normal(0.0, 0.1); // 小随机数初始化
}
cout << "初始参数(b, w1, w2):";
print_vector(init_params);
// 3. 配置梯度下降参数(小批量梯度下降)
GDParams gd_params;
gd_params.learning_rate = 0.001;
gd_params.max_iter = 10000;
gd_params.tol = 1e-7;
gd_params.verbose = true;
gd_params.print_interval = 1000;
gd_params.batch_size = 64; // 小批量大小
// 4. 执行梯度下降
GDResult result = gradient_descent(
init_params,
loss_func_linear_regression,
grad_func_linear_regression,
data,
gd_params
);
// 5. 输出结果
cout << "\n===== 线性回归拟合结果 =====" << endl;
cout << "拟合参数:";
print_vector(result.params, "[b, w1, w2]");
cout << "真实参数:[4.0, 2.0, 3.0]" << endl;
cout << "最终MSE损失:" << fixed << setprecision(6) << result.final_loss << endl;
cout << "迭代次数:" << result.iter_num << endl;
cout << "是否收敛:" << (result.converged ? "是" : "否") << endl;
// 6. 测试预测
Vector test_sample = {5.0, 6.0}; // x1=5, x2=6
double y_pred = result.params[0] + result.params[1]*test_sample[0] + result.params[2]*test_sample[1];
double y_true = 4.0 + 2.0*5.0 + 3.0*6.0;
cout << "\n测试预测:" << endl;
cout << "测试样本(x1, x2):[" << test_sample[0] << ", " << test_sample[1] << "]" << endl;
cout << "预测值:" << fixed << setprecision(4) << y_pred << endl;
cout << "真实值:" << fixed << setprecision(4) << y_true << endl;
cout << "预测误差:" << fixed << setprecision(4) << abs(y_pred - y_true) << endl;
}
五、梯度下降的优化技巧
1. 学习率调整策略
固定学习率易导致收敛慢或震荡,常用动态调整策略:
cpp
// 学习率衰减(指数衰减)
double exponential_decay_lr(double initial_lr, int iter, double decay_rate = 0.99, int decay_step = 100) {
return initial_lr * pow(decay_rate, iter / decay_step);
}
// 学习率衰减(线性衰减)
double linear_decay_lr(double initial_lr, int iter, int max_iter, double end_lr = 0.0001) {
return initial_lr - (initial_lr - end_lr) * (double)iter / max_iter;
}
2. 动量(Momentum)优化
引入动量模拟物理惯性,加速收敛并减少震荡:
cpp
// 带动量的梯度下降
GDResult gradient_descent_with_momentum(
Vector init_params,
function<double(const Vector&, const Matrix&)> loss_func,
function<Vector(const Vector&, const Matrix&)> grad_func,
const Matrix& data,
const GDParams& gd_params,
double momentum = 0.9 // 动量系数(常用0.9)
) {
GDResult result;
result.params = init_params;
result.final_loss = 0.0;
result.iter_num = 0;
result.converged = false;
Vector velocity(init_params.size(), 0.0); // 速度项(动量)
int n_samples = data.size();
for (int iter = 0; iter < gd_params.max_iter; ++iter) {
// 1. 选择批次数据(同基础版)
Matrix batch_data;
if (gd_params.batch_size == n_samples) {
batch_data = data;
} else if (gd_params.batch_size == 1) {
int rand_idx = RandomGenerator::get_instance().rand_double(0, n_samples - 1);
batch_data.push_back(data[rand_idx]);
} else {
vector<int> indices(n_samples);
iota(indices.begin(), indices.end(), 0);
shuffle(indices.begin(), indices.end(), RandomGenerator::get_instance().get_rng());
for (int i = 0; i < min(gd_params.batch_size, n_samples); ++i) {
batch_data.push_back(data[indices[i]]);
}
}
// 2. 计算损失和梯度
double current_loss = loss_func(result.params, batch_data);
Vector grad = grad_func(result.params, batch_data);
double grad_norm = vector_norm(grad);
// 3. 打印信息
if (gd_params.verbose && iter % gd_params.print_interval == 0) {
cout << "迭代次数:" << iter
<< " | 损失值:" << fixed << setprecision(6) << current_loss
<< " | 梯度模长:" << fixed << setprecision(8) << grad_norm << endl;
}
// 4. 收敛检查
if (grad_norm < gd_params.tol) {
result.converged = true;
result.iter_num = iter;
result.final_loss = current_loss;
break;
}
// 5. 带动量的参数更新
double lr = exponential_decay_lr(gd_params.learning_rate, iter); // 动态学习率
for (size_t i = 0; i < result.params.size(); ++i) {
velocity[i] = momentum * velocity[i] + lr * grad[i]; // 速度更新
result.params[i] -= velocity[i]; // 参数更新
}
result.iter_num = iter + 1;
result.final_loss = current_loss;
}
return result;
}
3. 自适应学习率优化(Adam)
Adam 结合动量和自适应学习率,是目前最常用的梯度下降变体:
cpp
// Adam优化器(简化版)
GDResult gradient_descent_adam(
Vector init_params,
function<double(const Vector&, const Matrix&)> loss_func,
function<Vector(const Vector&, const Matrix&)> grad_func,
const Matrix& data,
const GDParams& gd_params,
double beta1 = 0.9, // 一阶矩衰减系数
double beta2 = 0.999, // 二阶矩衰减系数
double eps = 1e-8 // 防止除零
) {
GDResult result;
result.params = init_params;
result.final_loss = 0.0;
result.iter_num = 0;
result.converged = false;
Vector m(init_params.size(), 0.0); // 一阶矩(动量)
Vector v(init_params.size(), 0.0); // 二阶矩(自适应学习率)
int n_samples = data.size();
for (int iter = 0; iter < gd_params.max_iter; ++iter) {
// 1. 选择批次数据
Matrix batch_data;
if (gd_params.batch_size == n_samples) {
batch_data = data;
} else if (gd_params.batch_size == 1) {
int rand_idx = RandomGenerator::get_instance().rand_double(0, n_samples - 1);
batch_data.push_back(data[rand_idx]);
} else {
vector<int> indices(n_samples);
iota(indices.begin(), indices.end(), 0);
shuffle(indices.begin(), indices.end(), RandomGenerator::get_instance().get_rng());
for (int i = 0; i < min(gd_params.batch_size, n_samples); ++i) {
batch_data.push_back(data[indices[i]]);
}
}
// 2. 计算损失和梯度
double current_loss = loss_func(result.params, batch_data);
Vector grad = grad_func(result.params, batch_data);
double grad_norm = vector_norm(grad);
// 3. 打印信息
if (gd_params.verbose && iter % gd_params.print_interval == 0) {
cout << "迭代次数:" << iter
<< " | 损失值:" << fixed << setprecision(6) << current_loss
<< " | 梯度模长:" << fixed << setprecision(8) << grad_norm << endl;
}
// 4. 收敛检查
if (grad_norm < gd_params.tol) {
result.converged = true;
result.iter_num = iter;
result.final_loss = current_loss;
break;
}
// 5. Adam参数更新
double lr = gd_params.learning_rate;
double t = iter + 1;
for (size_t i = 0; i < result.params.size(); ++i) {
// 一阶矩和二阶矩更新
m[i] = beta1 * m[i] + (1 - beta1) * grad[i];
v[i] = beta2 * v[i] + (1 - beta2) * grad[i] * grad[i];
// 偏差修正
double m_hat = m[i] / (1 - pow(beta1, t));
double v_hat = v[i] / (1 - pow(beta2, t));
// 参数更新
result.params[i] -= lr * m_hat / (sqrt(v_hat) + eps);
}
result.iter_num = iter + 1;
result.final_loss = current_loss;
}
return result;
}
六、完整可运行代码
cpp
#include <iostream>
#include <vector>
#include <cmath>
#include <random>
#include <numeric>
#include <algorithm>
#include <iomanip>
#include <chrono>
#include <functional>
#include <stdexcept>
#include <string>
using namespace std;
// 向量和矩阵类型定义
using Vector = vector<double>;
using Matrix = vector<Vector>;
// 随机数生成器单例
class RandomGenerator {
public:
static RandomGenerator& get_instance() {
static RandomGenerator instance;
return instance;
}
double rand_double(double min = 0.0, double max = 1.0) {
uniform_real_distribution<double> dist(min, max);
return dist(rng);
}
double rand_normal(double mean = 0.0, double stddev = 1.0) {
normal_distribution<double> dist(mean, stddev);
return dist(rng);
}
mt19937& get_rng() { return rng; }
private:
RandomGenerator() {
random_device rd;
rng = mt19937(rd());
}
RandomGenerator(const RandomGenerator&) = delete;
RandomGenerator& operator=(const RandomGenerator&) = delete;
mt19937 rng;
};
// 向量工具函数
double dot_product(const Vector& a, const Vector& b) {
if (a.size() != b.size()) {
throw invalid_argument("向量维度不匹配");
}
double res = 0.0;
for (size_t i = 0; i < a.size(); ++i) {
res += a[i] * b[i];
}
return res;
}
Vector vector_add(const Vector& a, const Vector& b) {
if (a.size() != b.size()) {
throw invalid_argument("向量维度不匹配");
}
Vector res(a.size());
for (size_t i = 0; i < a.size(); ++i) {
res[i] = a[i] + b[i];
}
return res;
}
Vector scalar_multiply(double scalar, const Vector& vec) {
Vector res(vec.size());
for (size_t i = 0; i < vec.size(); ++i) {
res[i] = scalar * vec[i];
}
return res;
}
double vector_norm(const Vector& vec) {
double sum = 0.0;
for (double v : vec) {
sum += v * v;
}
return sqrt(sum);
}
void print_vector(const Vector& vec, const string& name) {
cout << name << ":[";
for (size_t i = 0; i < vec.size(); ++i) {
cout << fixed << setprecision(6) << vec[i];
if (i != vec.size() - 1) {
cout << ", ";
}
}
cout << "]" << endl;
}
void print_matrix(const Matrix& mat, const string& name) {
cout << name << ":" << endl;
for (const auto& row : mat) {
cout << "[";
for (size_t i = 0; i < row.size(); ++i) {
cout << fixed << setprecision(4) << row[i];
if (i != row.size() - 1) {
cout << ", ";
}
}
cout << "]" << endl;
}
}
// 梯度下降参数配置
struct GDParams {
double learning_rate = 0.01;
int max_iter = 10000;
double tol = 1e-6;
int batch_size = 32;
bool verbose = true;
int print_interval = 1000;
};
// 梯度下降结果
struct GDResult {
Vector params;
double final_loss;
int iter_num;
bool converged;
};
// 基础梯度下降核心函数
GDResult gradient_descent(
Vector init_params,
function<double(const Vector&, const Matrix&)> loss_func,
function<Vector(const Vector&, const Matrix&)> grad_func,
const Matrix& data,
const GDParams& gd_params
) {
GDResult result;
result.params = init_params;
result.final_loss = 0.0;
result.iter_num = 0;
result.converged = false;
int n_samples = data.size();
if (n_samples == 0) {
cerr << "错误:训练数据为空!" << endl;
return result;
}
for (int iter = 0; iter < gd_params.max_iter; ++iter) {
// 选择批次数据
Matrix batch_data;
if (gd_params.batch_size == n_samples) {
batch_data = data;
} else if (gd_params.batch_size == 1) {
int rand_idx = RandomGenerator::get_instance().rand_double(0, n_samples - 1);
batch_data.push_back(data[rand_idx]);
} else {
vector<int> indices(n_samples);
iota(indices.begin(), indices.end(), 0);
shuffle(indices.begin(), indices.end(), RandomGenerator::get_instance().get_rng());
for (int i = 0; i < min(gd_params.batch_size, n_samples); ++i) {
batch_data.push_back(data[indices[i]]);
}
}
// 计算损失和梯度
double current_loss = loss_func(result.params, batch_data);
Vector grad = grad_func(result.params, batch_data);
double grad_norm = vector_norm(grad);
// 打印迭代信息
if (gd_params.verbose && iter % gd_params.print_interval == 0) {
cout << "迭代次数:" << iter
<< " | 损失值:" << fixed << setprecision(6) << current_loss
<< " | 梯度模长:" << fixed << setprecision(8) << grad_norm << endl;
}
// 收敛检查
if (grad_norm < gd_params.tol) {
result.converged = true;
result.iter_num = iter;
result.final_loss = current_loss;
break;
}
// 参数更新
Vector update = scalar_multiply(gd_params.learning_rate, grad);
for (size_t i = 0; i < result.params.size(); ++i) {
result.params[i] -= update[i];
}
result.iter_num = iter + 1;
result.final_loss = current_loss;
}
if (!result.converged && gd_params.verbose) {
cout << "警告:达到最大迭代次数,未收敛!" << endl;
}
return result;
}
// ===================== 实战1:无约束优化(函数极小值) =====================
double target_function(const Vector& params) {
if (params.size() != 2) {
throw invalid_argument("参数必须为二维向量(x, y)");
}
double x = params[0];
double y = params[1];
return x*x + 2*y*y + 2*sin(x)*cos(y);
}
double loss_func_unconstrained(const Vector& params, const Matrix& /*data*/) {
return target_function(params);
}
Vector grad_func_unconstrained(const Vector& params, const Matrix& /*data*/) {
if (params.size() != 2) {
throw invalid_argument("参数必须为二维向量(x, y)");
}
double x = params[0];
double y = params[1];
Vector grad(2);
grad[0] = 2*x + 2*cos(x)*cos(y);
grad[1] = 4*y - 2*sin(x)*sin(y);
return grad;
}
void test_unconstrained_optimization() {
cout << "===== 无约束优化:求解函数极小值 =====" << endl;
Vector init_params = {
RandomGenerator::get_instance().rand_double(-3.0, 3.0),
RandomGenerator::get_instance().rand_double(-3.0, 3.0)
};
cout << "初始参数:";
print_vector(init_params, "初始点(x, y)");
cout << "初始函数值:" << fixed << setprecision(6) << target_function(init_params) << endl;
GDParams gd_params;
gd_params.learning_rate = 0.01;
gd_params.max_iter = 20000;
gd_params.tol = 1e-8;
gd_params.verbose = true;
gd_params.print_interval = 2000;
gd_params.batch_size = 1;
Matrix empty_data;
GDResult result = gradient_descent(
init_params,
loss_func_unconstrained,
grad_func_unconstrained,
empty_data,
gd_params
);
cout << "\n===== 优化结果 =====" << endl;
print_vector(result.params, "极小值点(x, y)");
cout << "最终函数值(极小值):" << fixed << setprecision(6) << result.final_loss << endl;
cout << "迭代次数:" << result.iter_num << endl;
cout << "是否收敛:" << (result.converged ? "是" : "否") << endl;
}
// ===================== 实战2:线性回归 =====================
Matrix generate_linear_regression_data(int n_samples, int n_features, double noise_std = 0.1) {
Matrix data(n_samples, Vector(n_features + 1));
Vector true_weights = {2.0, 3.0};
double true_bias = 4.0;
for (int i = 0; i < n_samples; ++i) {
for (int j = 0; j < n_features; ++j) {
data[i][j] = RandomGenerator::get_instance().rand_double(0.0, 10.0);
}
double y = true_bias;
for (int j = 0; j < n_features; ++j) {
y += true_weights[j] * data[i][j];
}
y += RandomGenerator::get_instance().rand_normal(0.0, noise_std);
data[i][n_features] = y;
}
return data;
}
double loss_func_linear_regression(const Vector& params, const Matrix& batch_data) {
int n_samples = batch_data.size();
int n_features = batch_data[0].size() - 1;
if (params.size() != n_features + 1) {
throw invalid_argument("参数维度错误:应为" + to_string(n_features + 1) + "维");
}
double loss = 0.0;
for (const auto& sample : batch_data) {
double y_pred = params[0];
for (int j = 0; j < n_features; ++j) {
y_pred += params[j+1] * sample[j];
}
double y_true = sample[n_features];
loss += (y_true - y_pred) * (y_true - y_pred);
}
return loss / n_samples;
}
Vector grad_func_linear_regression(const Vector& params, const Matrix& batch_data) {
int n_samples = batch_data.size();
int n_features = batch_data[0].size() - 1;
if (params.size() != n_features + 1) {
throw invalid_argument("参数维度错误:应为" + to_string(n_features + 1) + "维");
}
Vector grad(params.size(), 0.0);
for (const auto& sample : batch_data) {
double y_pred = params[0];
for (int j = 0; j < n_features; ++j) {
y_pred += params[j+1] * sample[j];
}
double y_true = sample[n_features];
double error = y_true - y_pred;
grad[0] -= 2 * error / n_samples;
for (int j = 0; j < n_features; ++j) {
grad[j+1] -= 2 * error * sample[j] / n_samples;
}
}
return grad;
}
void test_linear_regression() {
cout << "\n===== 线性回归:梯度下降拟合 =====" << endl;
int n_samples = 1000;
int n_features = 2;
Matrix data = generate_linear_regression_data(n_samples, n_features);
Vector init_params(n_features + 1);
for (size_t i = 0; i < init_params.size(); ++i) {
init_params[i] = RandomGenerator::get_instance().rand_normal(0.0, 0.1);
}
cout << "初始参数(b, w1, w2):";
print_vector(init_params);
GDParams gd_params;
gd_params.learning_rate = 0.001;
gd_params.max_iter = 10000;
gd_params.tol = 1e-7;
gd_params.verbose = true;
gd_params.print_interval = 1000;
gd_params.batch_size = 64;
GDResult result = gradient_descent(
init_params,
loss_func_linear_regression,
grad_func_linear_regression,
data,
gd_params
);
cout << "\n===== 线性回归拟合结果 =====" << endl;
print_vector(result.params, "拟合参数[b, w1, w2]");
cout << "真实参数:[4.0, 2.0, 3.0]" << endl;
cout << "最终MSE损失:" << fixed << setprecision(6) << result.final_loss << endl;
cout << "迭代次数:" << result.iter_num << endl;
cout << "是否收敛:" << (result.converged ? "是" : "否") << endl;
Vector test_sample = {5.0, 6.0};
double y_pred = result.params[0] + result.params[1]*test_sample[0] + result.params[2]*test_sample[1];
double y_true = 4.0 + 2.0*5.0 + 3.0*6.0;
cout << "\n测试预测:" << endl;
cout << "测试样本(x1, x2):[" << test_sample[0] << ", " << test_sample[1] << "]" << endl;
cout << "预测值:" << fixed << setprecision(4) << y_pred << endl;
cout << "真实值:" << fixed << setprecision(4) << y_true << endl;
cout << "预测误差:" << fixed << setprecision(4) << abs(y_pred - y_true) << endl;
}
// 主函数
int main() {
// 测试无约束优化
test_unconstrained_optimization();
// 测试线性回归
test_linear_regression();
return 0;
}
七、常见坑点与避坑指南
1. 学习率设置不当
- 坑:学习率过大导致震荡不收敛,过小导致收敛极慢;
- 避坑:从0.01/0.001开始尝试,结合学习率衰减(指数/线性),或使用自适应学习率(Adam)。
2. 梯度计算错误
- 坑:手动推导梯度时符号错误、偏导数计算错误;
- 避坑 :
- 对简单函数,用数值梯度验证( ∂ f ∂ x ≈ f ( x + ϵ ) − f ( x − ϵ ) 2 ϵ \frac{\partial f}{\partial x} \approx \frac{f(x+\epsilon) - f(x-\epsilon)}{2\epsilon} ∂x∂f≈2ϵf(x+ϵ)−f(x−ϵ));
- 复杂函数使用自动微分工具(如PyTorch)辅助验证。
3. 数据未归一化
- 坑:特征值范围差异大(如x1∈[0,1], x2∈[0,1000]),导致梯度更新不平衡;
- 避坑 :对数据做归一化(Z-Score: x ′ = ( x − μ ) / σ x' = (x - \mu)/\sigma x′=(x−μ)/σ)或标准化(Min-Max: x ′ = ( x − m i n ) / ( m a x − m i n ) x' = (x - min)/(max - min) x′=(x−min)/(max−min))。
4. 梯度消失/爆炸
- 坑:深层神经网络中,梯度值趋近于0或无穷大;
- 避坑:使用ReLU激活函数、权重初始化(Xavier/He)、梯度裁剪(Clip Gradient)。
5. 局部最优陷阱
- 坑:非凸函数中,梯度下降收敛到局部最优而非全局最优;
- 避坑:多次随机初始化、使用带动量的梯度下降、加入噪声扰动。
八、总结
核心要点回顾
- 梯度下降核心 :沿损失函数梯度反方向更新参数,逐步逼近极小值点,核心公式 x t + 1 = x t − η ⋅ ∇ f ( x t ) \boldsymbol{x}_{t+1} = \boldsymbol{x}_t - \eta \cdot \nabla f(\boldsymbol{x}_t) xt+1=xt−η⋅∇f(xt);
- 核心变体 :
- BGD:用全部数据计算梯度,收敛稳定但效率低;
- SGD:用单个样本计算梯度,效率高但震荡;
- MBGD:平衡BGD和SGD,工业界主流;
- 关键优化 :
- 学习率衰减:动态降低学习率,加速收敛;
- 动量:引入惯性,减少震荡;
- Adam:结合动量和自适应学习率,适用性最广;
- 适用场景:函数极值求解、线性回归、神经网络训练等数值优化问题。
学习建议
- 先掌握基础梯度下降实现,理解迭代更新的核心逻辑;
- 用简单函数(如二次函数)验证梯度计算的正确性;
- 学习动量、Adam等优化变体,对比不同策略的收敛效果;
- 将梯度下降应用到实际问题(如线性回归、逻辑回归),掌握工程落地技巧。
梯度下降是数值优化的"入门