[深度学习网络从入门到入土] 反向传播backprop
个人导航
知乎:https://www.zhihu.com/people/byzh_rc
CSDN:https://blog.csdn.net/qq_54636039
注:本文仅对所述内容做了框架性引导,具体细节可查询其余相关资料or源码
参考文章:各方资料
文章目录
- [[深度学习网络从入门到入土] 反向传播backprop](#[深度学习网络从入门到入土] 反向传播backprop)
- 个人导航
- 参考资料
- 背景
- 架构(公式)
-
-
-
- [1. 先看一层:线性层 + 激活](#1. 先看一层:线性层 + 激活)
- [2. L 层 MLP 的反传(delta 记号)](#2. L 层 MLP 的反传(delta 记号))
- [3. 批量形式](#3. 批量形式)
- [4. 两个"必背"的简化结论(非常常用)](#4. 两个“必背”的简化结论(非常常用))
-
-
- 创新点
- [拓展 - 残差结构的梯度推导](#拓展 - 残差结构的梯度推导)
- [代码实现 - numpy](#代码实现 - numpy)
- [代码实现 - pytorch](#代码实现 - pytorch)
参考资料
Learning representations by back-propagating errors.
Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences.
背景
训练一个神经网络,本质是最小化损失函数:
- 给定输入 x x x,网络输出 y ^ = f ( x ; θ ) \hat{y}=f(x;\theta) y^=f(x;θ)
- 定义损失 L ( y ^ , y ) \mathcal{L}(\hat{y}, y) L(y^,y)
- 希望更新参数 θ \theta θ 使 L \mathcal{L} L 变小(梯度下降 / SGD / Adam)
问题来了:深度网络层数很多,参数很多,怎么高效算梯度?
- 直接数值差分(finite difference)太慢:每个参数都要扰动一次
- 反向传播(Backprop)利用链式法则 ,一次反向就能得到所有参数的梯度
- 计算复杂度大致是:一次前向 + 一次反向 ≈ 2 倍前向(非常划算)
直观理解:
前向传播把 "输入 → 输出 → 损失" 算出来
反向传播把 "损失对每层的影响" 一层层传回去
架构(公式)
反向传播是一套对计算图求导的通用算法
text
(h_0 = x)
z_1 = w*h_0 + b <-> 误差项δ
h_1 = σ(z_1)
...
logits = z_L
x ∈ R d × 1 x \in \mathbb{R}^{d \times 1} x∈Rd×1
W ∈ R m × d W \in \mathbb{R}^{m \times d} W∈Rm×d
b ∈ R m × 1 b \in \mathbb{R}^{m \times 1} b∈Rm×1
-> z , h ∈ R m × 1 z,h \in \mathbb{R}^{m \times 1} z,h∈Rm×1
1. 先看一层:线性层 + 激活
设一层是:
z = W x + b h = ϕ ( z ) z = Wx + b \\ h=\phi(z) z=Wx+bh=ϕ(z)
如果我们已经拿到了上游梯度 ∂ L ∂ h \frac{\partial \mathcal{L}}{\partial h} ∂h∂L,那么:
1)激活反传:
∂ L ∂ z = ∂ L ∂ h ⊙ ∂ h ∂ z = ∂ L ∂ h ⊙ ϕ ′ ( z ) \begin{align} \color{purple}{\frac{\partial \mathcal{L}}{\partial z}} &=\frac{\partial \mathcal{L}}{\partial h} \odot \frac{\partial h}{\partial z} \\ &\color{purple}{=\frac{\partial \mathcal{L}}{\partial h} \odot \phi'(z)} \end{align} ∂z∂L=∂h∂L⊙∂z∂h=∂h∂L⊙ϕ′(z)
⊙ \odot ⊙: 逐元素相乘
激活函数是逐元素做的, 所以是用 ⊙ \odot ⊙
2)线性层反传:
∂ L ∂ W = ( ∂ L ∂ z ) x ⊤ , ∂ L ∂ b = ∂ L ∂ z ∂ z ∂ b = ∂ L ∂ z ∂ L ∂ x = W ⊤ ( ∂ L ∂ z ) \frac{\partial \mathcal{L}}{\partial W} = \left(\frac{\partial \mathcal{L}}{\partial z}\right)x^\top ,\quad \frac{\partial \mathcal{L}}{\partial b} = \frac{\partial \mathcal{L}}{\partial z} \frac{\partial z}{\partial b} = \frac{\partial \mathcal{L}}{\partial z} \\ \color{purple}{\frac{\partial \mathcal{L}}{\partial x} = W^\top\left(\frac{\partial \mathcal{L}}{\partial z}\right)} ∂W∂L=(∂z∂L)x⊤,∂b∂L=∂z∂L∂b∂z=∂z∂L∂x∂L=W⊤(∂z∂L)
这里是矩阵乘法, 不用 ⊙ \odot ⊙, 反而求导需要用上转置
这里最重要的思想: 链式法则 -> 局部梯度 × 上游梯度
h的梯度传给了z, z的梯度传给了x
2. L 层 MLP 的反传(delta 记号)
前向:
z l = W l h l − 1 + b l , h l = ϕ l ( z l ) z_l = W_l h_{l-1} + b_l,\quad h_l=\phi_l(z_l) zl=Wlhl−1+bl,hl=ϕl(zl)
定义误差项(delta):
δ l ≜ ∂ L ∂ z l \delta_l \triangleq \frac{\partial \mathcal{L}}{\partial z_l} δl≜∂zl∂L
那么反传的标准形式是:
- 参数梯度:
∂ L ∂ W l = h l − 1 δ l ⊤ , ∂ L ∂ b l = δ l \frac{\partial \mathcal{L}}{\partial W_l} = h_{l-1}\delta_l^\top,\quad \frac{\partial \mathcal{L}}{\partial b_l} = \delta_l ∂Wl∂L=hl−1δl⊤,∂bl∂L=δl
- 误差递推(从后往前):
δ l − 1 = ∂ L ∂ z l − 1 = ∂ L ∂ h l − 1 ⊙ ϕ l − 1 ′ ( z l − 1 ) = ( W l ⊤ δ l ) ⊙ ϕ l − 1 ′ ( z l − 1 ) \begin{align} \color{purple}{\delta_{l-1}} &= \frac{\partial \mathcal{L}}{\partial z_{l-1}} \\ &= \frac{\partial \mathcal{L}}{\partial h_{l-1}} \odot \phi'{l-1}(z{l-1}) \\ &\color{purple}{= (W_l^\top \delta_l) \odot \phi'{l-1}(z{l-1})} \end{align} δl−1=∂zl−1∂L=∂hl−1∂L⊙ϕl−1′(zl−1)=(Wl⊤δl)⊙ϕl−1′(zl−1)
"反向传播"在 MLP 上的最经典写法
3. 批量形式
一个 batch: X ∈ R B × d X\in\mathbb{R}^{B\times d} X∈RB×d
Z l = H l − 1 W l + 1 b l , H l = ϕ ( Z l ) Z_l = H_{l-1}W_l + \mathbf{1}b_l,\quad H_l=\phi(Z_l) Zl=Hl−1Wl+1bl,Hl=ϕ(Zl)
反传:
∇ W l = H l − 1 ⊤ Δ l ∇ b l = ∑ i = 1 B Δ l , i Δ l − 1 = ( Δ l W l ⊤ ) ⊙ ϕ ′ ( Z l − 1 ) \nabla W_l = H_{l-1}^\top \Delta_l \\ \nabla b_l = \sum_{i=1}^{B}\Delta_{l,i} \Delta_{l-1} = (\Delta_l W_l^\top)\odot \phi'(Z_{l-1}) ∇Wl=Hl−1⊤Δl∇bl=i=1∑BΔl,iΔl−1=(ΔlWl⊤)⊙ϕ′(Zl−1)
4. 两个"必背"的简化结论(非常常用)
(1) Softmax + CrossEntropy 的简化
如果最后一层 logits 是 Z Z Z,概率 P = softmax ( Z ) P=\text{softmax}(Z) P=softmax(Z),标签 one-hot 为 Y Y Y,交叉熵:
L = − 1 B ∑ i , k Y i , k log P i , k \mathcal{L} = -\frac{1}{B}\sum_{i,k}Y_{i,k}\log P_{i,k} L=−B1i,k∑Yi,klogPi,k
那么:
∂ L ∂ Z = 1 B ( P − Y ) \color{purple}{\frac{\partial \mathcal{L}}{\partial Z} = \frac{1}{B}(P - Y)} ∂Z∂L=B1(P−Y)
所以工程里常用 "CE 内部自带 softmax " 的稳定实现(比如 PyTorch
CrossEntropyLoss)
(2) Sigmoid + BCE 的简化
二分类 logits 为 z z z,概率 p = σ ( z ) p=\sigma(z) p=σ(z),BCE:
L = − 1 B ∑ i [ y i log p i + ( 1 − y i ) log ( 1 − p i ) ] \mathcal{L}=-\frac{1}{B}\sum_i\big[y_i\log p_i+(1-y_i)\log(1-p_i)\big] L=−B1i∑[yilogpi+(1−yi)log(1−pi)]
那么:
∂ L ∂ z = 1 B ( p − y ) \color{purple}{\frac{\partial \mathcal{L}}{\partial z}=\frac{1}{B}(p-y)} ∂z∂L=B1(p−y)
因此工程里
BCEWithLogitsLoss(logits 直接进损失)更稳定
创新点
- 把"训练深层网络"变成可规模化 算法
一次反向即可得到所有参数梯度,复杂度≈2倍前向,非常高效 - 模块化:每层只需实现局部 backward
线性层、激活层、卷积层、归一化层都能用同一套链式法则拼起来 - 奠定现代深度学习范式
Backprop +(SGD/Adam)+(mini-batch)成为深度学习训练的标准模板
自动微分(Autograd)本质就是把 backprop 自动化
拓展 - 残差结构的梯度推导
残差块最常见形式:
y = x + F ( x ; θ ) y = x + F(x;\theta) y=x+F(x;θ)
设上游梯度:
g ≜ ∂ L ∂ y g \triangleq \frac{\partial \mathcal{L}}{\partial y} g≜∂y∂L
1) 对输入 x x x 的梯度
加法节点: y = x + F y=x+F y=x+F,对两支路偏导都是 1,因此两条路的梯度会相加:
∂ L ∂ x = ∂ L ∂ y ∂ y ∂ x = g ( 1 + ∂ F ∂ x ) = g + g ∂ F ∂ x \begin{align} \color{purple}{\frac{\partial \mathcal{L}}{\partial x}} &= \frac{\partial \mathcal{L}}{\partial y}\frac{\partial y}{\partial x} \\ &= g\left(1 + \frac{\partial F}{\partial x}\right) \\ &\color{purple}{= g + g\frac{\partial F}{\partial x}} \end{align} ∂x∂L=∂y∂L∂x∂y=g(1+∂x∂F)=g+g∂x∂F
关键:至少有一条直通梯度 g g g(不依赖 F F F),这也是残差能缓解梯度消失的直观原因
2) 对残差分支参数 θ \theta θ 的梯度
∂ L ∂ θ = ∂ L ∂ y ∂ y ∂ θ = g ⋅ ∂ F ( x ; θ ) ∂ θ \frac{\partial \mathcal{L}}{\partial \theta}= \frac{\partial \mathcal{L}}{\partial y}\frac{\partial y}{\partial \theta}= g\cdot \frac{\partial F(x;\theta)}{\partial \theta} ∂θ∂L=∂y∂L∂θ∂y=g⋅∂θ∂F(x;θ)
代码实现 - numpy
py
import numpy as np
# =========================
# 1) 工具函数
# =========================
def relu(z):
"""ReLU 激活:max(0,z)"""
return np.maximum(0.0, z)
def relu_grad(z):
"""ReLU 导数:z>0 为1,否则0"""
return (z > 0.0).astype(float)
def softmax(logits):
"""稳定 softmax:先减去 max 避免 exp 溢出"""
logits = logits - logits.max(axis=1, keepdims=True)
exp = np.exp(logits)
return exp / exp.sum(axis=1, keepdims=True)
def one_hot(y, K):
"""把整数标签 y(形状[B])转为 one-hot(形状[B,K])"""
y = np.asarray(y, dtype=int)
Y = np.zeros((y.shape[0], K), dtype=float)
Y[np.arange(y.shape[0]), y] = 1.0
return Y
def cross_entropy(P, Y, eps=1e-12):
"""
交叉熵:
L = -mean(sum_k Y_k log P_k)
"""
P = np.clip(P, eps, 1.0)
return float(-np.mean(np.sum(Y * np.log(P), axis=1)))
# =========================
# 2) 两层 MLP:D -> H -> K
# =========================
class MLP2:
"""
两层 MLP(用于演示手写 backprop):
X -> Linear(W1,b1) -> ReLU -> Linear(W2,b2) -> Softmax -> CE
重点:backward() 里手写每个梯度
"""
def __init__(self, in_dim, hidden_dim, out_dim, seed=0, weight_scale=0.01):
rng = np.random.default_rng(seed)
# 参数初始化:小随机数,避免一开始 logits 太大
self.W1 = rng.normal(0.0, weight_scale, size=(in_dim, hidden_dim))
self.b1 = np.zeros((1, hidden_dim), dtype=float)
self.W2 = rng.normal(0.0, weight_scale, size=(hidden_dim, out_dim))
self.b2 = np.zeros((1, out_dim), dtype=float)
# cache:存前向中间量,供反传使用
self.cache = {}
def forward(self, X):
"""
前向传播:
Z1 = XW1 + b1
A1 = ReLU(Z1)
Z2 = A1W2 + b2
P = softmax(Z2)
"""
Z1 = X @ self.W1 + self.b1 # [B,H]
A1 = relu(Z1) # [B,H]
Z2 = A1 @ self.W2 + self.b2 # [B,K]
P = softmax(Z2) # [B,K]
# 保存中间量
self.cache = {"X": X, "Z1": Z1, "A1": A1, "Z2": Z2, "P": P}
return P
def loss(self, X, y):
"""计算 loss(仅用于打印)"""
P = self.forward(X)
Y = one_hot(y, P.shape[1])
return cross_entropy(P, Y)
def backward(self, y):
"""
反向传播:手写梯度
关键简化(softmax + CE):
dZ2 = (P - Y) / B
"""
X = self.cache["X"] # [B,D]
Z1 = self.cache["Z1"] # [B,H]
A1 = self.cache["A1"] # [B,H]
P = self.cache["P"] # [B,K]
B = X.shape[0]
K = P.shape[1]
Y = one_hot(y, K) # [B,K]
# ===== 输出层梯度(softmax+CE 简化)=====
dZ2 = (P - Y) / B # [B,K]
# W2, b2 梯度
dW2 = A1.T @ dZ2 # [H,K]
db2 = np.sum(dZ2, axis=0, keepdims=True) # [1,K]
# ===== 传播到隐藏层 =====
dA1 = dZ2 @ self.W2.T # [B,H]
dZ1 = dA1 * relu_grad(Z1) # [B,H]
# W1, b1 梯度
dW1 = X.T @ dZ1 # [D,H]
db1 = np.sum(dZ1, axis=0, keepdims=True) # [1,H]
return {"dW1": dW1, "db1": db1, "dW2": dW2, "db2": db2}
def step(self, grads, lr=0.1):
"""梯度下降更新参数"""
self.W1 -= lr * grads["dW1"]
self.b1 -= lr * grads["db1"]
self.W2 -= lr * grads["dW2"]
self.b2 -= lr * grads["db2"]
def predict(self, X):
"""输出预测类别"""
P = self.forward(X)
return np.argmax(P, axis=1)
# =========================
# 3) 造一个简单三分类 toy 数据并训练
# =========================
def make_toy_3class(n=300, seed=0):
"""
2D 三分类高斯簇:
类0 around (0,0)
类1 around (2.5,0)
类2 around (1.2,2.2)
"""
rng = np.random.default_rng(seed)
n_per = n // 3
centers = np.array([[0, 0], [2.5, 0], [1.2, 2.2]], dtype=float)
X_list, y_list = [], []
for k in range(3):
Xk = rng.normal(loc=centers[k], scale=0.5, size=(n_per, 2))
yk = np.full(n_per, k, dtype=int)
X_list.append(Xk)
y_list.append(yk)
X = np.concatenate(X_list, axis=0)
y = np.concatenate(y_list, axis=0)
# shuffle
idx = rng.permutation(X.shape[0])
return X[idx], y[idx]
def demo_backprop():
X, y = make_toy_3class(n=300, seed=42)
# train/test split
n_train = 240
X_train, y_train = X[:n_train], y[:n_train]
X_test, y_test = X[n_train:], y[n_train:]
model = MLP2(in_dim=2, hidden_dim=16, out_dim=3, seed=1)
lr = 0.2
epochs = 200
batch_size = 32
rng = np.random.default_rng(0)
for epoch in range(1, epochs + 1):
# mini-batch SGD
idx = rng.permutation(n_train)
for s in range(0, n_train, batch_size):
bidx = idx[s:s + batch_size]
xb, yb = X_train[bidx], y_train[bidx]
# 1) forward
_ = model.forward(xb)
# 2) backward
grads = model.backward(yb)
# 3) step
model.step(grads, lr=lr)
if epoch % 20 == 0 or epoch == 1:
train_loss = model.loss(X_train, y_train)
pred = model.predict(X_test)
acc = np.mean(pred == y_test)
print(f"epoch={epoch:3d} train_loss={train_loss:.4f} test_acc={acc:.3f}")
print("Done.")
if __name__ == "__main__":
demo_backprop()
代码实现 - pytorch
py
import torch
import torch.nn as nn
import torch.optim as optim
class TinyMLP(nn.Module):
def __init__(self, in_dim=2, hidden=16, out_dim=3):
super().__init__()
self.net = nn.Sequential(
nn.Linear(in_dim, hidden),
nn.ReLU(),
nn.Linear(hidden, out_dim) # 输出 logits
)
def forward(self, x):
return self.net(x)
def demo_torch_autograd():
# 造一点 toy 数据
X = torch.randn(64, 2)
y = torch.randint(0, 3, (64,))
model = TinyMLP()
criterion = nn.CrossEntropyLoss() # 内部做 softmax + CE
optimizer = optim.SGD(model.parameters(), lr=0.1)
for epoch in range(1, 101):
logits = model(X)
loss = criterion(logits, y)
optimizer.zero_grad()
loss.backward() # 反向传播:autograd 计算所有梯度
optimizer.step()
if epoch % 20 == 0 or epoch == 1:
print(f"epoch={epoch:3d} loss={loss.item():.4f}")
print("Done.")
if __name__ == "__main__":
demo_torch_autograd()