数学的脊梁:撑起大模型的四根骨架 —— 线性代数、概率统计、优化理论、信息论

摘要

大模型的工程化落地,底层是对线性代数、概率统计、优化理论、信息论四块数学的精确驾驭。本文从参数量计算、交叉熵物理意义、AdamW 与 WSD 调度器、信息瓶颈四个切口切入,给出源码级实现与企业级踩坑复盘。

1. 线性代数:张量运算是神经网络的基础语言

不理解张量形状的人无法调试模型,更无法估算显存。LLaMA-3 8B 的 d_model=4096、n_layers=32、n_heads=32、d_ff=14336,单层参数量直接决定训练与推理的显存占用。
#mermaid-svg-UFqmnOEIT7JWwbeR{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-UFqmnOEIT7JWwbeR .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-UFqmnOEIT7JWwbeR .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-UFqmnOEIT7JWwbeR .error-icon{fill:#552222;}#mermaid-svg-UFqmnOEIT7JWwbeR .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-UFqmnOEIT7JWwbeR .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-UFqmnOEIT7JWwbeR .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-UFqmnOEIT7JWwbeR .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-UFqmnOEIT7JWwbeR .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-UFqmnOEIT7JWwbeR .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-UFqmnOEIT7JWwbeR .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-UFqmnOEIT7JWwbeR .marker{fill:#333333;stroke:#333333;}#mermaid-svg-UFqmnOEIT7JWwbeR .marker.cross{stroke:#333333;}#mermaid-svg-UFqmnOEIT7JWwbeR svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-UFqmnOEIT7JWwbeR p{margin:0;}#mermaid-svg-UFqmnOEIT7JWwbeR .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-UFqmnOEIT7JWwbeR .cluster-label text{fill:#333;}#mermaid-svg-UFqmnOEIT7JWwbeR .cluster-label span{color:#333;}#mermaid-svg-UFqmnOEIT7JWwbeR .cluster-label span p{background-color:transparent;}#mermaid-svg-UFqmnOEIT7JWwbeR .label text,#mermaid-svg-UFqmnOEIT7JWwbeR span{fill:#333;color:#333;}#mermaid-svg-UFqmnOEIT7JWwbeR .node rect,#mermaid-svg-UFqmnOEIT7JWwbeR .node circle,#mermaid-svg-UFqmnOEIT7JWwbeR .node ellipse,#mermaid-svg-UFqmnOEIT7JWwbeR .node polygon,#mermaid-svg-UFqmnOEIT7JWwbeR .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-UFqmnOEIT7JWwbeR .rough-node .label text,#mermaid-svg-UFqmnOEIT7JWwbeR .node .label text,#mermaid-svg-UFqmnOEIT7JWwbeR .image-shape .label,#mermaid-svg-UFqmnOEIT7JWwbeR .icon-shape .label{text-anchor:middle;}#mermaid-svg-UFqmnOEIT7JWwbeR .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-UFqmnOEIT7JWwbeR .rough-node .label,#mermaid-svg-UFqmnOEIT7JWwbeR .node .label,#mermaid-svg-UFqmnOEIT7JWwbeR .image-shape .label,#mermaid-svg-UFqmnOEIT7JWwbeR .icon-shape .label{text-align:center;}#mermaid-svg-UFqmnOEIT7JWwbeR .node.clickable{cursor:pointer;}#mermaid-svg-UFqmnOEIT7JWwbeR .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-UFqmnOEIT7JWwbeR .arrowheadPath{fill:#333333;}#mermaid-svg-UFqmnOEIT7JWwbeR .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-UFqmnOEIT7JWwbeR .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-UFqmnOEIT7JWwbeR .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-UFqmnOEIT7JWwbeR .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-UFqmnOEIT7JWwbeR .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-UFqmnOEIT7JWwbeR .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-UFqmnOEIT7JWwbeR .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-UFqmnOEIT7JWwbeR .cluster text{fill:#333;}#mermaid-svg-UFqmnOEIT7JWwbeR .cluster span{color:#333;}#mermaid-svg-UFqmnOEIT7JWwbeR div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-UFqmnOEIT7JWwbeR .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-UFqmnOEIT7JWwbeR rect.text{fill:none;stroke-width:0;}#mermaid-svg-UFqmnOEIT7JWwbeR .icon-shape,#mermaid-svg-UFqmnOEIT7JWwbeR .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-UFqmnOEIT7JWwbeR .icon-shape p,#mermaid-svg-UFqmnOEIT7JWwbeR .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-UFqmnOEIT7JWwbeR .icon-shape .label rect,#mermaid-svg-UFqmnOEIT7JWwbeR .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-UFqmnOEIT7JWwbeR .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-UFqmnOEIT7JWwbeR .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-UFqmnOEIT7JWwbeR :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;}#mermaid-svg-UFqmnOEIT7JWwbeR .default>*{fill:#faf9f5!important;stroke:#ffffff!important;color:#000000!important;stroke-width:0px!important;}#mermaid-svg-UFqmnOEIT7JWwbeR .default span{fill:#faf9f5!important;stroke:#ffffff!important;color:#000000!important;stroke-width:0px!important;}#mermaid-svg-UFqmnOEIT7JWwbeR .default tspan{fill:#000000!important;} d_model=4096
Attention: q/k/v/o 4个投影
MLP: gate/up/down 3个投影
4 * 4096*4096 = 67M
3 * 4096*14336 = 176M
单层 243M
32层 = 7.78B
加 embedding+head = 8.03B

python 复制代码
// 来源:PyTorch 2.5.0 / torch.nn.Linear
import torch.nn as nn

# LLaMA-3 8B 单层参数量精确计算
d_model, n_layers, d_ff = 4096, 32, 14336
# Attention: q/k/v/o 四个投影 (GQA 下 k/v 维度减半)
attn_params = 4 * d_model * d_model  # 67M
# MLP: gate/up 两个升维 + down 一个降维 (SwiGLU)
mlp_params = 3 * d_model * d_ff       # 176M
# 单层 = 243M,32层 = 7.78B
per_layer = attn_params + mlp_params
total = per_layer * n_layers
print(f"单层 {per_layer/1e6:.1f}M, 总计 {total/1e9:.2f}B")
# 输出: 单层 243.0M, 总计 7.78B

量化:8.03B 参数在 BF16 下权重占 16GB,训练时加 AdamW 状态(m+v 各一份 FP32)+ 梯度,单卡显存需求 = 16(权重) + 64(优化器) + 16(梯度) = 96GB,这就是为什么 7B 模型预训练至少需要 8 卡 A100 80G 分片。

1.1 SVD 压缩:低秩近似的工程边界

权重矩阵的低秩近似是模型压缩的数学基础。SVD 把 W∈R^(m×n) 分解为 UΣV^T,取 top-k 奇异值得到最优 rank-k 近似(Eckart-Young 定理保证最小 Frobenius 范数误差)。
#mermaid-svg-yf68rIt9SGFtGhMP{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-yf68rIt9SGFtGhMP .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-yf68rIt9SGFtGhMP .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-yf68rIt9SGFtGhMP .error-icon{fill:#552222;}#mermaid-svg-yf68rIt9SGFtGhMP .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-yf68rIt9SGFtGhMP .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-yf68rIt9SGFtGhMP .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-yf68rIt9SGFtGhMP .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-yf68rIt9SGFtGhMP .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-yf68rIt9SGFtGhMP .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-yf68rIt9SGFtGhMP .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-yf68rIt9SGFtGhMP .marker{fill:#333333;stroke:#333333;}#mermaid-svg-yf68rIt9SGFtGhMP .marker.cross{stroke:#333333;}#mermaid-svg-yf68rIt9SGFtGhMP svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-yf68rIt9SGFtGhMP p{margin:0;}#mermaid-svg-yf68rIt9SGFtGhMP .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-yf68rIt9SGFtGhMP .cluster-label text{fill:#333;}#mermaid-svg-yf68rIt9SGFtGhMP .cluster-label span{color:#333;}#mermaid-svg-yf68rIt9SGFtGhMP .cluster-label span p{background-color:transparent;}#mermaid-svg-yf68rIt9SGFtGhMP .label text,#mermaid-svg-yf68rIt9SGFtGhMP span{fill:#333;color:#333;}#mermaid-svg-yf68rIt9SGFtGhMP .node rect,#mermaid-svg-yf68rIt9SGFtGhMP .node circle,#mermaid-svg-yf68rIt9SGFtGhMP .node ellipse,#mermaid-svg-yf68rIt9SGFtGhMP .node polygon,#mermaid-svg-yf68rIt9SGFtGhMP .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-yf68rIt9SGFtGhMP .rough-node .label text,#mermaid-svg-yf68rIt9SGFtGhMP .node .label text,#mermaid-svg-yf68rIt9SGFtGhMP .image-shape .label,#mermaid-svg-yf68rIt9SGFtGhMP .icon-shape .label{text-anchor:middle;}#mermaid-svg-yf68rIt9SGFtGhMP .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-yf68rIt9SGFtGhMP .rough-node .label,#mermaid-svg-yf68rIt9SGFtGhMP .node .label,#mermaid-svg-yf68rIt9SGFtGhMP .image-shape .label,#mermaid-svg-yf68rIt9SGFtGhMP .icon-shape .label{text-align:center;}#mermaid-svg-yf68rIt9SGFtGhMP .node.clickable{cursor:pointer;}#mermaid-svg-yf68rIt9SGFtGhMP .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-yf68rIt9SGFtGhMP .arrowheadPath{fill:#333333;}#mermaid-svg-yf68rIt9SGFtGhMP .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-yf68rIt9SGFtGhMP .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-yf68rIt9SGFtGhMP .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-yf68rIt9SGFtGhMP .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-yf68rIt9SGFtGhMP .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-yf68rIt9SGFtGhMP .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-yf68rIt9SGFtGhMP .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-yf68rIt9SGFtGhMP .cluster text{fill:#333;}#mermaid-svg-yf68rIt9SGFtGhMP .cluster span{color:#333;}#mermaid-svg-yf68rIt9SGFtGhMP div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-yf68rIt9SGFtGhMP .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-yf68rIt9SGFtGhMP rect.text{fill:none;stroke-width:0;}#mermaid-svg-yf68rIt9SGFtGhMP .icon-shape,#mermaid-svg-yf68rIt9SGFtGhMP .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-yf68rIt9SGFtGhMP .icon-shape p,#mermaid-svg-yf68rIt9SGFtGhMP .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-yf68rIt9SGFtGhMP .icon-shape .label rect,#mermaid-svg-yf68rIt9SGFtGhMP .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-yf68rIt9SGFtGhMP .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-yf68rIt9SGFtGhMP .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-yf68rIt9SGFtGhMP :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;}#mermaid-svg-yf68rIt9SGFtGhMP .default>*{fill:#faf9f5!important;stroke:#ffffff!important;color:#000000!important;stroke-width:0px!important;}#mermaid-svg-yf68rIt9SGFtGhMP .default span{fill:#faf9f5!important;stroke:#ffffff!important;color:#000000!important;stroke-width:0px!important;}#mermaid-svg-yf68rIt9SGFtGhMP .default tspan{fill:#000000!important;} 权重矩阵 W: 4096x4096
SVD分解
U: 4096x4096
奇异值Σ
V^T: 4096x4096
取 top-k=256
近似权重 W_k: rank-256
压缩率 256*2/4096 = 12.5%

python 复制代码
// 来源:PyTorch 2.5.0 / torch.linalg.svd
import torch

def svd_compress(weight_matrix, k):
    # SVD 分解: W = U @ diag(S) @ Vt
    U, S, Vt = torch.linalg.svd(weight_matrix, full_matrices=False)
    # 取 top-k 奇异值, Eckart-Young 保证最优 rank-k 近似
    U_k, S_k, Vt_k = U[:, :k], S[:k], Vt[:k, :]
    weight_approx = U_k @ torch.diag(S_k) @ Vt_k
    # 压缩比: k*(m+n) / (m*n)
    orig = weight_matrix.numel()
    compressed = k * (weight_matrix.size(0) + weight_matrix.size(1))
    return weight_approx, compressed / orig

# LLaMA-7B MLP 权重 11008x4096, rank=256
w = torch.randn(11008, 4096)
w_k, ratio = svd_compress(w, k=256)
print(f"压缩率 {ratio:.1%}, 误差 {torch.norm(w-w_k):.2f}")

量化:LLaMA-7B 经 SVD rank=256 压缩后 PPL 增加 0.4 以内,参数减少 75%。但 SVD 对 Attention 的 Q/K 投影压缩效果差------这些矩阵的有效秩高,奇异值衰减慢,强行截断会破坏注意力分布。MLP 的 gate/up 矩阵奇异值衰减快,适合压缩。

边界:SVD 假设权重是静态矩阵,对量化后的权重不适用。量化引入的扰动会改变奇异值结构,应先 SVD 后量化,顺序反了误差会放大 2-3 倍。

2. 概率统计:交叉熵是 LLM 的北极星

交叉熵损失 L = -Σ_t log P(x_t|x_<t) 是语言模型训练的唯一目标函数。理解它的物理意义,才能诊断训练异常。
#mermaid-svg-MUWQZRrWxrMahSFV{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-MUWQZRrWxrMahSFV .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-MUWQZRrWxrMahSFV .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-MUWQZRrWxrMahSFV .error-icon{fill:#552222;}#mermaid-svg-MUWQZRrWxrMahSFV .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-MUWQZRrWxrMahSFV .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-MUWQZRrWxrMahSFV .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-MUWQZRrWxrMahSFV .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-MUWQZRrWxrMahSFV .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-MUWQZRrWxrMahSFV .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-MUWQZRrWxrMahSFV .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-MUWQZRrWxrMahSFV .marker{fill:#333333;stroke:#333333;}#mermaid-svg-MUWQZRrWxrMahSFV .marker.cross{stroke:#333333;}#mermaid-svg-MUWQZRrWxrMahSFV svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-MUWQZRrWxrMahSFV p{margin:0;}#mermaid-svg-MUWQZRrWxrMahSFV .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-MUWQZRrWxrMahSFV .cluster-label text{fill:#333;}#mermaid-svg-MUWQZRrWxrMahSFV .cluster-label span{color:#333;}#mermaid-svg-MUWQZRrWxrMahSFV .cluster-label span p{background-color:transparent;}#mermaid-svg-MUWQZRrWxrMahSFV .label text,#mermaid-svg-MUWQZRrWxrMahSFV span{fill:#333;color:#333;}#mermaid-svg-MUWQZRrWxrMahSFV .node rect,#mermaid-svg-MUWQZRrWxrMahSFV .node circle,#mermaid-svg-MUWQZRrWxrMahSFV .node ellipse,#mermaid-svg-MUWQZRrWxrMahSFV .node polygon,#mermaid-svg-MUWQZRrWxrMahSFV .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-MUWQZRrWxrMahSFV .rough-node .label text,#mermaid-svg-MUWQZRrWxrMahSFV .node .label text,#mermaid-svg-MUWQZRrWxrMahSFV .image-shape .label,#mermaid-svg-MUWQZRrWxrMahSFV .icon-shape .label{text-anchor:middle;}#mermaid-svg-MUWQZRrWxrMahSFV .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-MUWQZRrWxrMahSFV .rough-node .label,#mermaid-svg-MUWQZRrWxrMahSFV .node .label,#mermaid-svg-MUWQZRrWxrMahSFV .image-shape .label,#mermaid-svg-MUWQZRrWxrMahSFV .icon-shape .label{text-align:center;}#mermaid-svg-MUWQZRrWxrMahSFV .node.clickable{cursor:pointer;}#mermaid-svg-MUWQZRrWxrMahSFV .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-MUWQZRrWxrMahSFV .arrowheadPath{fill:#333333;}#mermaid-svg-MUWQZRrWxrMahSFV .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-MUWQZRrWxrMahSFV .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-MUWQZRrWxrMahSFV .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-MUWQZRrWxrMahSFV .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-MUWQZRrWxrMahSFV .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-MUWQZRrWxrMahSFV .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-MUWQZRrWxrMahSFV .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-MUWQZRrWxrMahSFV .cluster text{fill:#333;}#mermaid-svg-MUWQZRrWxrMahSFV .cluster span{color:#333;}#mermaid-svg-MUWQZRrWxrMahSFV div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-MUWQZRrWxrMahSFV .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-MUWQZRrWxrMahSFV rect.text{fill:none;stroke-width:0;}#mermaid-svg-MUWQZRrWxrMahSFV .icon-shape,#mermaid-svg-MUWQZRrWxrMahSFV .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-MUWQZRrWxrMahSFV .icon-shape p,#mermaid-svg-MUWQZRrWxrMahSFV .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-MUWQZRrWxrMahSFV .icon-shape .label rect,#mermaid-svg-MUWQZRrWxrMahSFV .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-MUWQZRrWxrMahSFV .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-MUWQZRrWxrMahSFV .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-MUWQZRrWxrMahSFV :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;}#mermaid-svg-MUWQZRrWxrMahSFV .default>*{fill:#faf9f5!important;stroke:#ffffff!important;color:#000000!important;stroke-width:0px!important;}#mermaid-svg-MUWQZRrWxrMahSFV .default span{fill:#faf9f5!important;stroke:#ffffff!important;color:#000000!important;stroke-width:0px!important;}#mermaid-svg-MUWQZRrWxrMahSFV .default tspan{fill:#000000!important;} 交叉熵 CE
= 熵 H(P) + KL 散度 D_KL
H(P): 数据本身的不确定性
D_KL: 模型与真实分布的差距
训练目标: 最小化 D_KL
CE 下降 = 模型逼近真实分布
exp(CE) = 困惑度 PPL
PPL 可跨模型跨数据集比较

python 复制代码
// 来源:PyTorch 2.5.0 / torch.nn.functional.cross_entropy
import torch
import torch.nn.functional as F

def compute_ppl(logits, labels):
    """从 logits 计算困惑度, 用于训练监控"""
    # logits: [batch, seq, vocab], labels: [batch, seq]
    # 预测第 t+1 个 token, 用 logits[:,:-1] 对齐 labels[:,1:]
    shift_logits = logits[..., :-1, :].reshape(-1, logits.size(-1))
    shift_labels = labels[..., 1:].reshape(-1)
    # 交叉熵: 平均负对数似然
    ce_loss = F.cross_entropy(shift_logits, shift_labels)
    # PPL = exp(CE), 直观反映模型困惑程度
    return torch.exp(ce_loss).item()

# 基线参考: vocab=128K 时随机模型 CE=ln(128000)=11.76, PPL=128000
# LLaMA-3 8B 在通用文本上 CE≈2.5, PPL≈12.2
# 人类语言建模 PPL<5, 接近熵下限

量化:vocab=128K 时随机模型 CE=11.76(ln(vocab)),PPL=128000。LLaMA-3 8B 在通用文本上 CE≈2.5,PPL≈12.2。人类水平 PPL<5。训练中 CE 持续不降或反弹,说明数据质量或学习率有问题------这是比 loss 绝对值更可靠的诊断信号。

2.1 LLM-as-Judge 的概率偏置

用大模型做评测裁判时,存在三类系统性偏置:长答案偏置(+20%)、自我偏好(+8%)、位置偏置(+10%)。这些偏置的数学根源是裁判模型对生成概率的校准偏差。

python 复制代码
// 来源:LLM-as-Judge 偏置消除实践 / 2024
def debiased_judge_scores(scores_a, scores_b, positions):
    """消除位置偏置: 正反两次评判取平均"""
    # scores_a: A 在前时的得分, scores_b: A 在后时的得分
    # positions: 位置标记
    final = 0.5 * (scores_a + scores_b)
    # 长答案偏置: 归一化答案长度
    length_penalty = 0.2  # 经验值
    final = final - length_penalty * length_ratio
    return final

量化:多裁判集成(3 个不同模型投票)可将偏置从 20% 降到 5% 以内。Pairwise 比对优于绝对打分,因为相对比较消除了绝对校准误差。

3. 优化理论:AdamW 与 WSD 调度器

SGD 对大模型不可用的根本原因是 loss landscape 的条件数(Hessian 最大/最小特征值之比)可达 10³-10⁴。AdamW 通过二阶矩估计对每个参数自适应调整步长,把条件数影响降低 100 倍。
#mermaid-svg-ZzRdghWHA5oUSa18{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-ZzRdghWHA5oUSa18 .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-ZzRdghWHA5oUSa18 .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-ZzRdghWHA5oUSa18 .error-icon{fill:#552222;}#mermaid-svg-ZzRdghWHA5oUSa18 .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-ZzRdghWHA5oUSa18 .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-ZzRdghWHA5oUSa18 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-ZzRdghWHA5oUSa18 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-ZzRdghWHA5oUSa18 .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-ZzRdghWHA5oUSa18 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-ZzRdghWHA5oUSa18 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-ZzRdghWHA5oUSa18 .marker{fill:#333333;stroke:#333333;}#mermaid-svg-ZzRdghWHA5oUSa18 .marker.cross{stroke:#333333;}#mermaid-svg-ZzRdghWHA5oUSa18 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-ZzRdghWHA5oUSa18 p{margin:0;}#mermaid-svg-ZzRdghWHA5oUSa18 .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-ZzRdghWHA5oUSa18 .cluster-label text{fill:#333;}#mermaid-svg-ZzRdghWHA5oUSa18 .cluster-label span{color:#333;}#mermaid-svg-ZzRdghWHA5oUSa18 .cluster-label span p{background-color:transparent;}#mermaid-svg-ZzRdghWHA5oUSa18 .label text,#mermaid-svg-ZzRdghWHA5oUSa18 span{fill:#333;color:#333;}#mermaid-svg-ZzRdghWHA5oUSa18 .node rect,#mermaid-svg-ZzRdghWHA5oUSa18 .node circle,#mermaid-svg-ZzRdghWHA5oUSa18 .node ellipse,#mermaid-svg-ZzRdghWHA5oUSa18 .node polygon,#mermaid-svg-ZzRdghWHA5oUSa18 .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-ZzRdghWHA5oUSa18 .rough-node .label text,#mermaid-svg-ZzRdghWHA5oUSa18 .node .label text,#mermaid-svg-ZzRdghWHA5oUSa18 .image-shape .label,#mermaid-svg-ZzRdghWHA5oUSa18 .icon-shape .label{text-anchor:middle;}#mermaid-svg-ZzRdghWHA5oUSa18 .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-ZzRdghWHA5oUSa18 .rough-node .label,#mermaid-svg-ZzRdghWHA5oUSa18 .node .label,#mermaid-svg-ZzRdghWHA5oUSa18 .image-shape .label,#mermaid-svg-ZzRdghWHA5oUSa18 .icon-shape .label{text-align:center;}#mermaid-svg-ZzRdghWHA5oUSa18 .node.clickable{cursor:pointer;}#mermaid-svg-ZzRdghWHA5oUSa18 .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-ZzRdghWHA5oUSa18 .arrowheadPath{fill:#333333;}#mermaid-svg-ZzRdghWHA5oUSa18 .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-ZzRdghWHA5oUSa18 .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-ZzRdghWHA5oUSa18 .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-ZzRdghWHA5oUSa18 .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-ZzRdghWHA5oUSa18 .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-ZzRdghWHA5oUSa18 .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-ZzRdghWHA5oUSa18 .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-ZzRdghWHA5oUSa18 .cluster text{fill:#333;}#mermaid-svg-ZzRdghWHA5oUSa18 .cluster span{color:#333;}#mermaid-svg-ZzRdghWHA5oUSa18 div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-ZzRdghWHA5oUSa18 .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-ZzRdghWHA5oUSa18 rect.text{fill:none;stroke-width:0;}#mermaid-svg-ZzRdghWHA5oUSa18 .icon-shape,#mermaid-svg-ZzRdghWHA5oUSa18 .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-ZzRdghWHA5oUSa18 .icon-shape p,#mermaid-svg-ZzRdghWHA5oUSa18 .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-ZzRdghWHA5oUSa18 .icon-shape .label rect,#mermaid-svg-ZzRdghWHA5oUSa18 .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-ZzRdghWHA5oUSa18 .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-ZzRdghWHA5oUSa18 .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-ZzRdghWHA5oUSa18 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;}#mermaid-svg-ZzRdghWHA5oUSa18 .default>*{fill:#faf9f5!important;stroke:#ffffff!important;color:#000000!important;stroke-width:0px!important;}#mermaid-svg-ZzRdghWHA5oUSa18 .default span{fill:#faf9f5!important;stroke:#ffffff!important;color:#000000!important;stroke-width:0px!important;}#mermaid-svg-ZzRdghWHA5oUSa18 .default tspan{fill:#000000!important;} 梯度 g_t
一阶矩 m_t: 动量
二阶矩 v_t: 梯度方差
偏差修正 m_hat
偏差修正 v_hat
更新: lr * m_hat / sqrt(v_hat)
自适应步长
条件数影响降低 100x

python 复制代码
// 来源:PyTorch 2.5.0 / torch.optim.AdamW
import torch.optim as optim
import math

# LLaMA 系列标准配置
optimizer = optim.AdamW(
    model.parameters(),
    lr=3e-4,
    betas=(0.9, 0.95),    # beta2=0.95 而非默认 0.999, 适配大梯度方差
    weight_decay=0.1,      # 解耦权重衰减, 与 lr 独立
    eps=1e-8
)

class WSDScheduler:
    """WSD: Warmup-Stable-Decay, 支持随时加数据继续训练"""
    def __init__(self, optimizer, warmup=2000, total=500000):
        self.opt = optimizer
        self.warmup = warmup
        self.stable_end = int(total * 0.9)  # 90% 时间稳定
        self.total = total
        self.base_lr = optimizer.defaults['lr']

    def get_lr(self, step):
        if step < self.warmup:
            return self.base_lr * step / self.warmup
        if step < self.stable_end:
            return self.base_lr  # 稳定期恒定 lr
        # 衰减期: 余弦退火到 0
        progress = (step - self.stable_end) / (self.total - self.stable_end)
        return self.base_lr * 0.5 * (1 + math.cos(math.pi * progress))

量化:LLaMA-7B 梯度范数正常范围 1-3,spike 时超过 10。某团队用 SGD 训 LLaMA-2 发散后换 AdamW 恢复------MLP 梯度方差是 Attention 的 5-10 倍,SGD 的统一步长无法兼顾。WSD 相比 cosine 的优势:稳定期 lr 恒定,可随时插入新数据继续训练,无需重新规划调度曲线。

3.1 Loss Spike 的处置流程

python 复制代码
// 来源:大模型训练故障复盘 / 2024
def handle_loss_spike(model, optimizer, grad_norm, threshold=10.0):
    """梯度范数超阈值时的自动处置"""
    if grad_norm > threshold:
        # 1. 回滚到上一个 checkpoint
        load_checkpoint(model, optimizer, step - 100)
        # 2. 降低学习率 10x
        for pg in optimizer.param_groups:
            pg['lr'] *= 0.1
        # 3. 跳过当前 batch (可能是脏数据)
        skip_batch = True
        # 4. 记录事件供事后分析
        log_spike(step, grad_norm, batch_hash)
        return True
    return False

def monitor_gradient_norm(model):
    """计算全局梯度范数, 训练时实时监控"""
    total_norm = 0.0
    for p in model.parameters():
        if p.grad is not None:
            total_norm += p.grad.data.norm(2).item() ** 2
    return total_norm ** 0.5

边界:阈值 10.0 是经验值,不同模型架构差异大。MoE 模型因专家负载不均,梯度范数正常波动就比 dense 模型大 3-5 倍,阈值需上调到 30-50。

4. 信息论:从互信息理解层间差异

信息瓶颈理论解释了为什么 Transformer 不同层捕获不同信息:浅层互信息聚焦语法、中层语义、深层推理。倒数 2-3 层的互信息最高,这是为什么推理时取这些层做表征。
#mermaid-svg-rYR5U0QBEgk0M0Cc{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-rYR5U0QBEgk0M0Cc .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-rYR5U0QBEgk0M0Cc .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-rYR5U0QBEgk0M0Cc .error-icon{fill:#552222;}#mermaid-svg-rYR5U0QBEgk0M0Cc .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-rYR5U0QBEgk0M0Cc .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-rYR5U0QBEgk0M0Cc .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-rYR5U0QBEgk0M0Cc .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-rYR5U0QBEgk0M0Cc .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-rYR5U0QBEgk0M0Cc .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-rYR5U0QBEgk0M0Cc .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-rYR5U0QBEgk0M0Cc .marker{fill:#333333;stroke:#333333;}#mermaid-svg-rYR5U0QBEgk0M0Cc .marker.cross{stroke:#333333;}#mermaid-svg-rYR5U0QBEgk0M0Cc svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-rYR5U0QBEgk0M0Cc p{margin:0;}#mermaid-svg-rYR5U0QBEgk0M0Cc .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-rYR5U0QBEgk0M0Cc .cluster-label text{fill:#333;}#mermaid-svg-rYR5U0QBEgk0M0Cc .cluster-label span{color:#333;}#mermaid-svg-rYR5U0QBEgk0M0Cc .cluster-label span p{background-color:transparent;}#mermaid-svg-rYR5U0QBEgk0M0Cc .label text,#mermaid-svg-rYR5U0QBEgk0M0Cc span{fill:#333;color:#333;}#mermaid-svg-rYR5U0QBEgk0M0Cc .node rect,#mermaid-svg-rYR5U0QBEgk0M0Cc .node circle,#mermaid-svg-rYR5U0QBEgk0M0Cc .node ellipse,#mermaid-svg-rYR5U0QBEgk0M0Cc .node polygon,#mermaid-svg-rYR5U0QBEgk0M0Cc .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-rYR5U0QBEgk0M0Cc .rough-node .label text,#mermaid-svg-rYR5U0QBEgk0M0Cc .node .label text,#mermaid-svg-rYR5U0QBEgk0M0Cc .image-shape .label,#mermaid-svg-rYR5U0QBEgk0M0Cc .icon-shape .label{text-anchor:middle;}#mermaid-svg-rYR5U0QBEgk0M0Cc .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-rYR5U0QBEgk0M0Cc .rough-node .label,#mermaid-svg-rYR5U0QBEgk0M0Cc .node .label,#mermaid-svg-rYR5U0QBEgk0M0Cc .image-shape .label,#mermaid-svg-rYR5U0QBEgk0M0Cc .icon-shape .label{text-align:center;}#mermaid-svg-rYR5U0QBEgk0M0Cc .node.clickable{cursor:pointer;}#mermaid-svg-rYR5U0QBEgk0M0Cc .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-rYR5U0QBEgk0M0Cc .arrowheadPath{fill:#333333;}#mermaid-svg-rYR5U0QBEgk0M0Cc .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-rYR5U0QBEgk0M0Cc .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-rYR5U0QBEgk0M0Cc .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-rYR5U0QBEgk0M0Cc .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-rYR5U0QBEgk0M0Cc .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-rYR5U0QBEgk0M0Cc .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-rYR5U0QBEgk0M0Cc .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-rYR5U0QBEgk0M0Cc .cluster text{fill:#333;}#mermaid-svg-rYR5U0QBEgk0M0Cc .cluster span{color:#333;}#mermaid-svg-rYR5U0QBEgk0M0Cc div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-rYR5U0QBEgk0M0Cc .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-rYR5U0QBEgk0M0Cc rect.text{fill:none;stroke-width:0;}#mermaid-svg-rYR5U0QBEgk0M0Cc .icon-shape,#mermaid-svg-rYR5U0QBEgk0M0Cc .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-rYR5U0QBEgk0M0Cc .icon-shape p,#mermaid-svg-rYR5U0QBEgk0M0Cc .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-rYR5U0QBEgk0M0Cc .icon-shape .label rect,#mermaid-svg-rYR5U0QBEgk0M0Cc .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-rYR5U0QBEgk0M0Cc .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-rYR5U0QBEgk0M0Cc .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-rYR5U0QBEgk0M0Cc :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;}#mermaid-svg-rYR5U0QBEgk0M0Cc .default>*{fill:#faf9f5!important;stroke:#ffffff!important;color:#000000!important;stroke-width:0px!important;}#mermaid-svg-rYR5U0QBEgk0M0Cc .default span{fill:#faf9f5!important;stroke:#ffffff!important;color:#000000!important;stroke-width:0px!important;}#mermaid-svg-rYR5U0QBEgk0M0Cc .default tspan{fill:#000000!important;} 输入 x
浅层: 语法互信息高
中层: 语义互信息高
深层: 推理互信息高
倒数2-3层: 总互信息峰值
表征提取最佳位置

python 复制代码
// 来源:信息瓶颈框架下的 Transformer 层分析 / 2023
import torch

def mutual_information_estimator(h_x, h_y, temperature=0.07):
    """InfoNCE 估计互信息下界, 用于层间表征分析"""
    # h_x, h_y: [batch, dim] 两个视角的表征
    h_x = F.normalize(h_x, dim=-1)
    h_y = F.normalize(h_y, dim=-1)
    # 相似度矩阵
    sim = h_x @ h_y.T / temperature
    # InfoNCE: -log(正样本/所有样本)
    labels = torch.arange(h_x.size(0), device=h_x.device)
    loss = F.cross_entropy(sim, labels)
    # MI 下界 = -loss + log(batch_size)
    mi_lower_bound = -loss + math.log(h_x.size(0))
    return mi_lower_bound.item()

量化:InfoNCE 温度 τ=0.07 是对比学习经验值,τ 过大退化为均匀分布,τ 过小只关注最难负样本导致梯度方差爆炸。蒸馏中 α·CE+(1-α)·τ²·KL(teacher||student) 的 α=0.5、τ=2.0 是标准配置,温度拉伸软化 teacher 分布。

4.1 DPO 中的隐式 KL 约束

DPO 损失 L = -log σ(β(log π(y_w|x)/π_ref(y_w|x) - log π(y_l|x)/π_ref(y_l|x))) 中的 β 参数控制与参考模型的 KL 距离。β=0.1 时隐式 KL 约 5-10 nats,β 过大模型不更新,β 过小 reward hacking。

python 复制代码
// 来源:DPO 原始实现 / Direct Preference Optimization 2023
import torch
import torch.nn.functional as F

def dpo_loss(policy_chosen, policy_rejected, ref_chosen, ref_rejected, beta=0.1):
    """DPO 损失: 隐式 KL 约束"""
    # 对数概率比: log(π/π_ref)
    chosen_ratio = policy_chosen - ref_chosen      # [batch]
    rejected_ratio = policy_rejected - ref_rejected
    # β 控制偏离参考模型的强度
    logits = beta * (chosen_ratio - rejected_ratio)
    # 损失 = -log sigmoid(logits)
    loss = -F.logsigmoid(logits).mean()
    return loss

量化:β=0.1 是 LLaMA 系列经验值。某团队 β=0.01 导致模型在偏好数据上过拟合,生成质量反而下降------不理解 KL 约束的物理意义就调不好 DPO。

5. 数学的工程化落地

数学理论到工程落地有三道鸿沟:显存估算、数值精度、分布式同步。
#mermaid-svg-AwpYJ5VD5pLUSHv7{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-AwpYJ5VD5pLUSHv7 .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-AwpYJ5VD5pLUSHv7 .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-AwpYJ5VD5pLUSHv7 .error-icon{fill:#552222;}#mermaid-svg-AwpYJ5VD5pLUSHv7 .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-AwpYJ5VD5pLUSHv7 .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-AwpYJ5VD5pLUSHv7 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-AwpYJ5VD5pLUSHv7 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-AwpYJ5VD5pLUSHv7 .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-AwpYJ5VD5pLUSHv7 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-AwpYJ5VD5pLUSHv7 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-AwpYJ5VD5pLUSHv7 .marker{fill:#333333;stroke:#333333;}#mermaid-svg-AwpYJ5VD5pLUSHv7 .marker.cross{stroke:#333333;}#mermaid-svg-AwpYJ5VD5pLUSHv7 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-AwpYJ5VD5pLUSHv7 p{margin:0;}#mermaid-svg-AwpYJ5VD5pLUSHv7 .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-AwpYJ5VD5pLUSHv7 .cluster-label text{fill:#333;}#mermaid-svg-AwpYJ5VD5pLUSHv7 .cluster-label span{color:#333;}#mermaid-svg-AwpYJ5VD5pLUSHv7 .cluster-label span p{background-color:transparent;}#mermaid-svg-AwpYJ5VD5pLUSHv7 .label text,#mermaid-svg-AwpYJ5VD5pLUSHv7 span{fill:#333;color:#333;}#mermaid-svg-AwpYJ5VD5pLUSHv7 .node rect,#mermaid-svg-AwpYJ5VD5pLUSHv7 .node circle,#mermaid-svg-AwpYJ5VD5pLUSHv7 .node ellipse,#mermaid-svg-AwpYJ5VD5pLUSHv7 .node polygon,#mermaid-svg-AwpYJ5VD5pLUSHv7 .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-AwpYJ5VD5pLUSHv7 .rough-node .label text,#mermaid-svg-AwpYJ5VD5pLUSHv7 .node .label text,#mermaid-svg-AwpYJ5VD5pLUSHv7 .image-shape .label,#mermaid-svg-AwpYJ5VD5pLUSHv7 .icon-shape .label{text-anchor:middle;}#mermaid-svg-AwpYJ5VD5pLUSHv7 .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-AwpYJ5VD5pLUSHv7 .rough-node .label,#mermaid-svg-AwpYJ5VD5pLUSHv7 .node .label,#mermaid-svg-AwpYJ5VD5pLUSHv7 .image-shape .label,#mermaid-svg-AwpYJ5VD5pLUSHv7 .icon-shape .label{text-align:center;}#mermaid-svg-AwpYJ5VD5pLUSHv7 .node.clickable{cursor:pointer;}#mermaid-svg-AwpYJ5VD5pLUSHv7 .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-AwpYJ5VD5pLUSHv7 .arrowheadPath{fill:#333333;}#mermaid-svg-AwpYJ5VD5pLUSHv7 .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-AwpYJ5VD5pLUSHv7 .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-AwpYJ5VD5pLUSHv7 .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-AwpYJ5VD5pLUSHv7 .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-AwpYJ5VD5pLUSHv7 .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-AwpYJ5VD5pLUSHv7 .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-AwpYJ5VD5pLUSHv7 .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-AwpYJ5VD5pLUSHv7 .cluster text{fill:#333;}#mermaid-svg-AwpYJ5VD5pLUSHv7 .cluster span{color:#333;}#mermaid-svg-AwpYJ5VD5pLUSHv7 div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-AwpYJ5VD5pLUSHv7 .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-AwpYJ5VD5pLUSHv7 rect.text{fill:none;stroke-width:0;}#mermaid-svg-AwpYJ5VD5pLUSHv7 .icon-shape,#mermaid-svg-AwpYJ5VD5pLUSHv7 .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-AwpYJ5VD5pLUSHv7 .icon-shape p,#mermaid-svg-AwpYJ5VD5pLUSHv7 .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-AwpYJ5VD5pLUSHv7 .icon-shape .label rect,#mermaid-svg-AwpYJ5VD5pLUSHv7 .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-AwpYJ5VD5pLUSHv7 .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-AwpYJ5VD5pLUSHv7 .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-AwpYJ5VD5pLUSHv7 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;}#mermaid-svg-AwpYJ5VD5pLUSHv7 .default>*{fill:#faf9f5!important;stroke:#ffffff!important;color:#000000!important;stroke-width:0px!important;}#mermaid-svg-AwpYJ5VD5pLUSHv7 .default span{fill:#faf9f5!important;stroke:#ffffff!important;color:#000000!important;stroke-width:0px!important;}#mermaid-svg-AwpYJ5VD5pLUSHv7 .default tspan{fill:#000000!important;} 参数量估算
权重显存: N * dtype_bytes
7B BF16 = 14GB
KV Cache: 2*n_layers*d_model*seq*batch
seq=4096 batch=64 = 56GB
推理总显存 = 70GB
单卡 80G 刚好, 无并发空间

python 复制代码
// 来源:大模型显存估算工程公式 / 2024
def estimate_inference_memory(params_b, dtype_bytes=2, seq_len=4096, batch=1):
    """估算推理显存 (GB)"""
    n_layers, d_model = 32, 4096  # 7B 配置
    # 1. 权重
    weight_mem = params_b * dtype_bytes
    # 2. KV Cache: 2(K和V) * n_layers * d_model * seq * batch * dtype
    kv_cache = 2 * n_layers * d_model * seq_len * batch * dtype_bytes / 1e9
    # 3. 激活值 (约权重的 10-20%)
    activation = weight_mem * 0.15
    total = weight_mem + kv_cache + activation
    return {
        'weight_gb': weight_mem,
        'kv_cache_gb': kv_cache,
        'activation_gb': activation,
        'total_gb': total
    }

# 7B BF16, seq=4096, batch=64
mem = estimate_inference_memory(7, dtype_bytes=2, seq_len=4096, batch=64)
# weight=14GB, kv_cache=56GB, total=70GB -> 单卡 80G 装不下大 batch

量化:7B 模型 seq=4096、batch=64 时 KV Cache 占 56GB,是权重的 4 倍。这就是为什么长上下文推理必须做 KV Cache 量化或 PagedAttention------FlashAttention 通过分块降低 HBM 读写,把 KV Cache 的显存碎片从 60% 降到 5%。

5.1 BF16 与 FP8 的数值边界

BF16 是大模型训练的标配:8 位指数(动态范围同 FP32)+ 7 位尾数(精度低但够用)。FP8 需 block-wise 量化,每 128 个元素一组缩放因子,否则累加误差溢出。

python 复制代码
// 来源:NVIDIA Transformer Engine / FP8 训练实践
# BF16: 范围 ±3e38, 精度 2^-7 ≈ 0.008
# FP8 E4M3: 范围 ±448, 精度 2^-3 = 0.125
# FP8 E5M2: 范围 ±57344, 精度 2^-2 = 0.25

# FP8 训练必须 block-wise 量化
def fp8_quantize(tensor, block_size=128):
    """分块量化, 每块独立缩放因子"""
    shape = tensor.shape
    tensor = tensor.view(-1, block_size)
    # 每块的最大值
    scale = tensor.abs().max(dim=-1, keepdim=True).values / 448.0
    # 量化 + 反量化
    quantized = (tensor / scale).round().clamp(-448, 448)
    return quantized * scale, scale

边界:BF16 的尾数只有 7 位,累加超过 128 个值时误差显著。这就是为什么 softmax 和 layer norm 在 BF16 下要先用 FP32 累加再转回 BF16------直接 BF16 累加会丢精度。FP8 训练的损失缩放系数需要动态调整,静态系数在不同训练阶段会溢出或下溢。

6. 张量并行的矩阵分块数学

张量并行的核心是把大矩阵乘法拆到多卡上。列并行(按输出维度切分)和行并行(按输入维度切分)的组合,使前向只需一次 All-Reduce。
#mermaid-svg-QR73Z1Ke9W6ZCsti{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-QR73Z1Ke9W6ZCsti .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-QR73Z1Ke9W6ZCsti .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-QR73Z1Ke9W6ZCsti .error-icon{fill:#552222;}#mermaid-svg-QR73Z1Ke9W6ZCsti .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-QR73Z1Ke9W6ZCsti .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-QR73Z1Ke9W6ZCsti .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-QR73Z1Ke9W6ZCsti .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-QR73Z1Ke9W6ZCsti .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-QR73Z1Ke9W6ZCsti .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-QR73Z1Ke9W6ZCsti .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-QR73Z1Ke9W6ZCsti .marker{fill:#333333;stroke:#333333;}#mermaid-svg-QR73Z1Ke9W6ZCsti .marker.cross{stroke:#333333;}#mermaid-svg-QR73Z1Ke9W6ZCsti svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-QR73Z1Ke9W6ZCsti p{margin:0;}#mermaid-svg-QR73Z1Ke9W6ZCsti .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-QR73Z1Ke9W6ZCsti .cluster-label text{fill:#333;}#mermaid-svg-QR73Z1Ke9W6ZCsti .cluster-label span{color:#333;}#mermaid-svg-QR73Z1Ke9W6ZCsti .cluster-label span p{background-color:transparent;}#mermaid-svg-QR73Z1Ke9W6ZCsti .label text,#mermaid-svg-QR73Z1Ke9W6ZCsti span{fill:#333;color:#333;}#mermaid-svg-QR73Z1Ke9W6ZCsti .node rect,#mermaid-svg-QR73Z1Ke9W6ZCsti .node circle,#mermaid-svg-QR73Z1Ke9W6ZCsti .node ellipse,#mermaid-svg-QR73Z1Ke9W6ZCsti .node polygon,#mermaid-svg-QR73Z1Ke9W6ZCsti .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-QR73Z1Ke9W6ZCsti .rough-node .label text,#mermaid-svg-QR73Z1Ke9W6ZCsti .node .label text,#mermaid-svg-QR73Z1Ke9W6ZCsti .image-shape .label,#mermaid-svg-QR73Z1Ke9W6ZCsti .icon-shape .label{text-anchor:middle;}#mermaid-svg-QR73Z1Ke9W6ZCsti .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-QR73Z1Ke9W6ZCsti .rough-node .label,#mermaid-svg-QR73Z1Ke9W6ZCsti .node .label,#mermaid-svg-QR73Z1Ke9W6ZCsti .image-shape .label,#mermaid-svg-QR73Z1Ke9W6ZCsti .icon-shape .label{text-align:center;}#mermaid-svg-QR73Z1Ke9W6ZCsti .node.clickable{cursor:pointer;}#mermaid-svg-QR73Z1Ke9W6ZCsti .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-QR73Z1Ke9W6ZCsti .arrowheadPath{fill:#333333;}#mermaid-svg-QR73Z1Ke9W6ZCsti .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-QR73Z1Ke9W6ZCsti .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-QR73Z1Ke9W6ZCsti .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-QR73Z1Ke9W6ZCsti .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-QR73Z1Ke9W6ZCsti .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-QR73Z1Ke9W6ZCsti .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-QR73Z1Ke9W6ZCsti .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-QR73Z1Ke9W6ZCsti .cluster text{fill:#333;}#mermaid-svg-QR73Z1Ke9W6ZCsti .cluster span{color:#333;}#mermaid-svg-QR73Z1Ke9W6ZCsti div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-QR73Z1Ke9W6ZCsti .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-QR73Z1Ke9W6ZCsti rect.text{fill:none;stroke-width:0;}#mermaid-svg-QR73Z1Ke9W6ZCsti .icon-shape,#mermaid-svg-QR73Z1Ke9W6ZCsti .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-QR73Z1Ke9W6ZCsti .icon-shape p,#mermaid-svg-QR73Z1Ke9W6ZCsti .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-QR73Z1Ke9W6ZCsti .icon-shape .label rect,#mermaid-svg-QR73Z1Ke9W6ZCsti .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-QR73Z1Ke9W6ZCsti .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-QR73Z1Ke9W6ZCsti .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-QR73Z1Ke9W6ZCsti :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;}#mermaid-svg-QR73Z1Ke9W6ZCsti .default>*{fill:#faf9f5!important;stroke:#ffffff!important;color:#000000!important;stroke-width:0px!important;}#mermaid-svg-QR73Z1Ke9W6ZCsti .default span{fill:#faf9f5!important;stroke:#ffffff!important;color:#000000!important;stroke-width:0px!important;}#mermaid-svg-QR73Z1Ke9W6ZCsti .default tspan{fill:#000000!important;} 输入 X
列并行: W 按 d_out 切分
卡0: X*W0
卡1: X*W1
Y0
Y1
All-Reduce: Y0+Y1
完整输出 Y

python 复制代码
// 来源:Megatron-LM / tensor_parallel.py
import torch
import torch.distributed as dist

def column_parallel_linear(x, weight, bias, world_size, rank):
    """列并行: weight 按输出维度切分, 每卡持有一部分列"""
    # weight: [d_out, d_in], 切分后每卡 [d_out/world_size, d_in]
    out_local = x @ weight.t()  # [batch, d_out/world_size]
    if bias is not None:
        out_local = out_local + bias  # 偏置不重复
    return out_local

def row_parallel_linear(x, weight, world_size, rank):
    """行并行: weight 按输入维度切分, 输出需 All-Reduce"""
    # weight: [d_out, d_in], 切分后每卡 [d_out, d_in/world_size]
    out_local = x @ weight.t()  # [batch, d_out]
    # 跨卡求和得到完整输出
    dist.all_reduce(out_local, op=dist.ReduceOp.SUM)
    return out_local

# MLP 的经典组合: 列并行(gate+up) -> 激活 -> 行并行(down)
# 前向只一次 All-Reduce, 通信量 = batch*seq*d_model, 与卡数无关

量化:8 卡张量并行下,7B 模型每卡显存从 96GB 降到 12GB。通信开销占总训练时间的 15-25%,卡数越多通信占比越高------这是为什么 TP 一般不超过 8,再大就用流水线并行+数据并行组合。

边界:列并行后 Attention 的 Q/K/V 头分布在不同卡上,GQA 架构下需保证每卡分到整数个 KV 头,否则 All-Reduce 通信模式错乱。LLaMA-3 8B 的 32 个头、8 个 KV 头,TP=8 时每卡 4 个 Q 头配 1 个 KV 头,刚好整除。

7. RoPE 的复数旋转数学

旋转位置编码 RoPE 通过复数旋转把绝对位置信息融入 Q/K,使内积自然带有相对位置衰减。其数学核心是欧拉公式 e^(iθ) = cos θ + i sin θ。
#mermaid-svg-XsEL1Kj9jvbea00J{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-XsEL1Kj9jvbea00J .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-XsEL1Kj9jvbea00J .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-XsEL1Kj9jvbea00J .error-icon{fill:#552222;}#mermaid-svg-XsEL1Kj9jvbea00J .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-XsEL1Kj9jvbea00J .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-XsEL1Kj9jvbea00J .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-XsEL1Kj9jvbea00J .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-XsEL1Kj9jvbea00J .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-XsEL1Kj9jvbea00J .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-XsEL1Kj9jvbea00J .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-XsEL1Kj9jvbea00J .marker{fill:#333333;stroke:#333333;}#mermaid-svg-XsEL1Kj9jvbea00J .marker.cross{stroke:#333333;}#mermaid-svg-XsEL1Kj9jvbea00J svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-XsEL1Kj9jvbea00J p{margin:0;}#mermaid-svg-XsEL1Kj9jvbea00J .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-XsEL1Kj9jvbea00J .cluster-label text{fill:#333;}#mermaid-svg-XsEL1Kj9jvbea00J .cluster-label span{color:#333;}#mermaid-svg-XsEL1Kj9jvbea00J .cluster-label span p{background-color:transparent;}#mermaid-svg-XsEL1Kj9jvbea00J .label text,#mermaid-svg-XsEL1Kj9jvbea00J span{fill:#333;color:#333;}#mermaid-svg-XsEL1Kj9jvbea00J .node rect,#mermaid-svg-XsEL1Kj9jvbea00J .node circle,#mermaid-svg-XsEL1Kj9jvbea00J .node ellipse,#mermaid-svg-XsEL1Kj9jvbea00J .node polygon,#mermaid-svg-XsEL1Kj9jvbea00J .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-XsEL1Kj9jvbea00J .rough-node .label text,#mermaid-svg-XsEL1Kj9jvbea00J .node .label text,#mermaid-svg-XsEL1Kj9jvbea00J .image-shape .label,#mermaid-svg-XsEL1Kj9jvbea00J .icon-shape .label{text-anchor:middle;}#mermaid-svg-XsEL1Kj9jvbea00J .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-XsEL1Kj9jvbea00J .rough-node .label,#mermaid-svg-XsEL1Kj9jvbea00J .node .label,#mermaid-svg-XsEL1Kj9jvbea00J .image-shape .label,#mermaid-svg-XsEL1Kj9jvbea00J .icon-shape .label{text-align:center;}#mermaid-svg-XsEL1Kj9jvbea00J .node.clickable{cursor:pointer;}#mermaid-svg-XsEL1Kj9jvbea00J .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-XsEL1Kj9jvbea00J .arrowheadPath{fill:#333333;}#mermaid-svg-XsEL1Kj9jvbea00J .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-XsEL1Kj9jvbea00J .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-XsEL1Kj9jvbea00J .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-XsEL1Kj9jvbea00J .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-XsEL1Kj9jvbea00J .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-XsEL1Kj9jvbea00J .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-XsEL1Kj9jvbea00J .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-XsEL1Kj9jvbea00J .cluster text{fill:#333;}#mermaid-svg-XsEL1Kj9jvbea00J .cluster span{color:#333;}#mermaid-svg-XsEL1Kj9jvbea00J div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-XsEL1Kj9jvbea00J .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-XsEL1Kj9jvbea00J rect.text{fill:none;stroke-width:0;}#mermaid-svg-XsEL1Kj9jvbea00J .icon-shape,#mermaid-svg-XsEL1Kj9jvbea00J .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-XsEL1Kj9jvbea00J .icon-shape p,#mermaid-svg-XsEL1Kj9jvbea00J .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-XsEL1Kj9jvbea00J .icon-shape .label rect,#mermaid-svg-XsEL1Kj9jvbea00J .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-XsEL1Kj9jvbea00J .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-XsEL1Kj9jvbea00J .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-XsEL1Kj9jvbea00J :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;}#mermaid-svg-XsEL1Kj9jvbea00J .default>*{fill:#faf9f5!important;stroke:#ffffff!important;color:#000000!important;stroke-width:0px!important;}#mermaid-svg-XsEL1Kj9jvbea00J .default span{fill:#faf9f5!important;stroke:#ffffff!important;color:#000000!important;stroke-width:0px!important;}#mermaid-svg-XsEL1Kj9jvbea00J .default tspan{fill:#000000!important;} 查询 q 位置 m
旋转角度 m*θ_i
键 k 位置 n
旋转角度 n*θ_i
q_m * e^(imθ)
k_n * e^(inθ)
内积 = Re(q * k* e^i(m-n)θ)
仅依赖相对位置 m-n

python 复制代码
// 来源:RoFormer / Rotary Position Embedding 2021
import torch

def rotate_half(x):
    """把后半部分取负拼到前面, 实现复数乘法"""
    x1, x2 = x[..., :x.size(-1)//2], x[..., x.size(-1)//2:]
    return torch.cat([-x2, x1], dim=-1)

def apply_rotary_pos_emb(q, k, cos, sin):
    """对 q/k 应用旋转位置编码"""
    # cos, sin: [seq, d_head/2] 拼接为 [seq, d_head]
    # q_rot = q*cos + rotate_half(q)*sin
    q_rot = q * cos + rotate_half(q) * sin
    k_rot = k * cos + rotate_half(k) * sin
    return q_rot, k_rot

def precompute_freqs_cis(dim, end, theta=10000.0):
    """预计算旋转频率: 不同维度对应不同基频率"""
    # 频率 = theta^(-2i/dim), i=0,2,4...
    freqs = 1.0 / (theta ** (torch.arange(0, dim, 2).float() / dim))
    t = torch.arange(end)
    freqs = torch.outer(t, freqs)  # [seq, dim/2]
    freqs_cis = torch.polar(torch.ones_like(freqs), freqs)  # 复数形式
    return freqs_cis

量化:LLaMA-3 把 base 从 10000 调到 500000(NTK-aware 插值),不训练就获得 2-4 倍上下文外推能力。原理是增大 base 使高频分量衰减更慢,长距离注意力保留更多。YaRN 进一步分段处理:近区不缩放、远区用 NTK,在 128K 上下文上 PPL 增加仅 0.3。

边界:RoPE 的外推不是无限的。base 调到 500000 后,短文本任务精度反而下降------高频分量过度保留导致近距离注意力分布变平。需要在长上下文和短文本任务间权衡,或用动态 base(按实际 seq_len 调整)。

8. 数值稳定性的进阶陷阱

除了 BF16 累加溢出,还有三个隐蔽的数值陷阱:logsumexp 溢出、layer norm 零方差、梯度下溢。
#mermaid-svg-MZ2ZM4e9qWD1qzKr{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-MZ2ZM4e9qWD1qzKr .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-MZ2ZM4e9qWD1qzKr .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-MZ2ZM4e9qWD1qzKr .error-icon{fill:#552222;}#mermaid-svg-MZ2ZM4e9qWD1qzKr .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-MZ2ZM4e9qWD1qzKr .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-MZ2ZM4e9qWD1qzKr .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-MZ2ZM4e9qWD1qzKr .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-MZ2ZM4e9qWD1qzKr .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-MZ2ZM4e9qWD1qzKr .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-MZ2ZM4e9qWD1qzKr .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-MZ2ZM4e9qWD1qzKr .marker{fill:#333333;stroke:#333333;}#mermaid-svg-MZ2ZM4e9qWD1qzKr .marker.cross{stroke:#333333;}#mermaid-svg-MZ2ZM4e9qWD1qzKr svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-MZ2ZM4e9qWD1qzKr p{margin:0;}#mermaid-svg-MZ2ZM4e9qWD1qzKr .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-MZ2ZM4e9qWD1qzKr .cluster-label text{fill:#333;}#mermaid-svg-MZ2ZM4e9qWD1qzKr .cluster-label span{color:#333;}#mermaid-svg-MZ2ZM4e9qWD1qzKr .cluster-label span p{background-color:transparent;}#mermaid-svg-MZ2ZM4e9qWD1qzKr .label text,#mermaid-svg-MZ2ZM4e9qWD1qzKr span{fill:#333;color:#333;}#mermaid-svg-MZ2ZM4e9qWD1qzKr .node rect,#mermaid-svg-MZ2ZM4e9qWD1qzKr .node circle,#mermaid-svg-MZ2ZM4e9qWD1qzKr .node ellipse,#mermaid-svg-MZ2ZM4e9qWD1qzKr .node polygon,#mermaid-svg-MZ2ZM4e9qWD1qzKr .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-MZ2ZM4e9qWD1qzKr .rough-node .label text,#mermaid-svg-MZ2ZM4e9qWD1qzKr .node .label text,#mermaid-svg-MZ2ZM4e9qWD1qzKr .image-shape .label,#mermaid-svg-MZ2ZM4e9qWD1qzKr .icon-shape .label{text-anchor:middle;}#mermaid-svg-MZ2ZM4e9qWD1qzKr .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-MZ2ZM4e9qWD1qzKr .rough-node .label,#mermaid-svg-MZ2ZM4e9qWD1qzKr .node .label,#mermaid-svg-MZ2ZM4e9qWD1qzKr .image-shape .label,#mermaid-svg-MZ2ZM4e9qWD1qzKr .icon-shape .label{text-align:center;}#mermaid-svg-MZ2ZM4e9qWD1qzKr .node.clickable{cursor:pointer;}#mermaid-svg-MZ2ZM4e9qWD1qzKr .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-MZ2ZM4e9qWD1qzKr .arrowheadPath{fill:#333333;}#mermaid-svg-MZ2ZM4e9qWD1qzKr .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-MZ2ZM4e9qWD1qzKr .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-MZ2ZM4e9qWD1qzKr .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-MZ2ZM4e9qWD1qzKr .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-MZ2ZM4e9qWD1qzKr .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-MZ2ZM4e9qWD1qzKr .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-MZ2ZM4e9qWD1qzKr .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-MZ2ZM4e9qWD1qzKr .cluster text{fill:#333;}#mermaid-svg-MZ2ZM4e9qWD1qzKr .cluster span{color:#333;}#mermaid-svg-MZ2ZM4e9qWD1qzKr div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-MZ2ZM4e9qWD1qzKr .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-MZ2ZM4e9qWD1qzKr rect.text{fill:none;stroke-width:0;}#mermaid-svg-MZ2ZM4e9qWD1qzKr .icon-shape,#mermaid-svg-MZ2ZM4e9qWD1qzKr .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-MZ2ZM4e9qWD1qzKr .icon-shape p,#mermaid-svg-MZ2ZM4e9qWD1qzKr .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-MZ2ZM4e9qWD1qzKr .icon-shape .label rect,#mermaid-svg-MZ2ZM4e9qWD1qzKr .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-MZ2ZM4e9qWD1qzKr .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-MZ2ZM4e9qWD1qzKr .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-MZ2ZM4e9qWD1qzKr :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;}#mermaid-svg-MZ2ZM4e9qWD1qzKr .default>*{fill:#faf9f5!important;stroke:#ffffff!important;color:#000000!important;stroke-width:0px!important;}#mermaid-svg-MZ2ZM4e9qWD1qzKr .default span{fill:#faf9f5!important;stroke:#ffffff!important;color:#000000!important;stroke-width:0px!important;}#mermaid-svg-MZ2ZM4e9qWD1qzKr .default tspan{fill:#000000!important;} 数值陷阱
logsumexp: exp(1000)=inf
LayerNorm: 方差=0 除零
梯度下溢: FP16 最小 6e-8
减最大值: logΣexp(x) = m+logΣexp(x-m)
加 epsilon: 1e-5
损失缩放: loss*2^16 再反向

python 复制代码
// 来源:PyTorch 2.5.0 / torch.nn.functional
import torch
import torch.nn.functional as F

# 陷阱1: 直接计算 softmax 会溢出
def naive_softmax(x):
    # 错误: exp(1000) = inf
    return torch.exp(x) / torch.exp(x).sum()

def stable_softmax(x):
    # 正确: 减最大值, 数值稳定
    x_max = x.max(dim=-1, keepdim=True).values
    exp_x = torch.exp(x - x_max)
    return exp_x / exp_x.sum(dim=-1, keepdim=True)

# 陷阱2: LayerNorm 方差为零时除零
def stable_layer_norm(x, eps=1e-5):
    mean = x.mean(dim=-1, keepdim=True)
    # 用 unbiased=False 与 BN 保持一致
    var = x.var(dim=-1, keepdim=True, unbiased=False)
    # eps 必须足够大, BF16 下 1e-5 会下溢, 需 1e-4
    return (x - mean) / torch.sqrt(var + eps)

# 陷阱3: FP16 梯度下溢
def gradient_scaling(loss, scale=2**16):
    """损失缩放: 放大 loss 防梯度下溢, 反向后除回"""
    scaled_loss = loss * scale
    scaled_loss.backward()
    for p in model.parameters():
        if p.grad is not None:
            p.grad.data /= scale  # 还原真实梯度

量化:FP16 的最小正常数是 6e-8,很多梯度天然小于此值会下溢为 0。损失缩放 2^16 倍后,有效梯度范围扩展到 1e-13,覆盖 99% 的真实梯度。BF16 的动态范围同 FP32(最小 1e-38),不需要损失缩放,这是 BF16 取代 FP16 的关键优势。

边界:LayerNorm 的 eps 在 BF16 下需从 1e-5 上调到 1e-4。某模型在长序列训练时 LayerNorm 输出 NaN,排查发现是方差极小(<1e-6)时 eps 不足以防止除零。RMSNorm 因省去均值减法,数值稳定性优于 LayerNorm,这是 LLaMA 系列选 RMSNorm 的原因之一。

9. 边界与失败模式

数学基础的工程化失败往往源于对边界条件的忽视。场景误判导致方案失效,需识别典型失败模式并预设退路。
#mermaid-svg-PQ4NxnOV2f2ntFbv{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-PQ4NxnOV2f2ntFbv .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-PQ4NxnOV2f2ntFbv .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-PQ4NxnOV2f2ntFbv .error-icon{fill:#552222;}#mermaid-svg-PQ4NxnOV2f2ntFbv .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-PQ4NxnOV2f2ntFbv .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-PQ4NxnOV2f2ntFbv .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-PQ4NxnOV2f2ntFbv .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-PQ4NxnOV2f2ntFbv .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-PQ4NxnOV2f2ntFbv .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-PQ4NxnOV2f2ntFbv .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-PQ4NxnOV2f2ntFbv .marker{fill:#333333;stroke:#333333;}#mermaid-svg-PQ4NxnOV2f2ntFbv .marker.cross{stroke:#333333;}#mermaid-svg-PQ4NxnOV2f2ntFbv svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-PQ4NxnOV2f2ntFbv p{margin:0;}#mermaid-svg-PQ4NxnOV2f2ntFbv .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-PQ4NxnOV2f2ntFbv .cluster-label text{fill:#333;}#mermaid-svg-PQ4NxnOV2f2ntFbv .cluster-label span{color:#333;}#mermaid-svg-PQ4NxnOV2f2ntFbv .cluster-label span p{background-color:transparent;}#mermaid-svg-PQ4NxnOV2f2ntFbv .label text,#mermaid-svg-PQ4NxnOV2f2ntFbv span{fill:#333;color:#333;}#mermaid-svg-PQ4NxnOV2f2ntFbv .node rect,#mermaid-svg-PQ4NxnOV2f2ntFbv .node circle,#mermaid-svg-PQ4NxnOV2f2ntFbv .node ellipse,#mermaid-svg-PQ4NxnOV2f2ntFbv .node polygon,#mermaid-svg-PQ4NxnOV2f2ntFbv .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-PQ4NxnOV2f2ntFbv .rough-node .label text,#mermaid-svg-PQ4NxnOV2f2ntFbv .node .label text,#mermaid-svg-PQ4NxnOV2f2ntFbv .image-shape .label,#mermaid-svg-PQ4NxnOV2f2ntFbv .icon-shape .label{text-anchor:middle;}#mermaid-svg-PQ4NxnOV2f2ntFbv .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-PQ4NxnOV2f2ntFbv .rough-node .label,#mermaid-svg-PQ4NxnOV2f2ntFbv .node .label,#mermaid-svg-PQ4NxnOV2f2ntFbv .image-shape .label,#mermaid-svg-PQ4NxnOV2f2ntFbv .icon-shape .label{text-align:center;}#mermaid-svg-PQ4NxnOV2f2ntFbv .node.clickable{cursor:pointer;}#mermaid-svg-PQ4NxnOV2f2ntFbv .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-PQ4NxnOV2f2ntFbv .arrowheadPath{fill:#333333;}#mermaid-svg-PQ4NxnOV2f2ntFbv .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-PQ4NxnOV2f2ntFbv .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-PQ4NxnOV2f2ntFbv .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-PQ4NxnOV2f2ntFbv .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-PQ4NxnOV2f2ntFbv .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-PQ4NxnOV2f2ntFbv .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-PQ4NxnOV2f2ntFbv .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-PQ4NxnOV2f2ntFbv .cluster text{fill:#333;}#mermaid-svg-PQ4NxnOV2f2ntFbv .cluster span{color:#333;}#mermaid-svg-PQ4NxnOV2f2ntFbv div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-PQ4NxnOV2f2ntFbv .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-PQ4NxnOV2f2ntFbv rect.text{fill:none;stroke-width:0;}#mermaid-svg-PQ4NxnOV2f2ntFbv .icon-shape,#mermaid-svg-PQ4NxnOV2f2ntFbv .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-PQ4NxnOV2f2ntFbv .icon-shape p,#mermaid-svg-PQ4NxnOV2f2ntFbv .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-PQ4NxnOV2f2ntFbv .icon-shape .label rect,#mermaid-svg-PQ4NxnOV2f2ntFbv .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-PQ4NxnOV2f2ntFbv .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-PQ4NxnOV2f2ntFbv .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-PQ4NxnOV2f2ntFbv :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;}#mermaid-svg-PQ4NxnOV2f2ntFbv .default>*{fill:#faf9f5!important;stroke:#ffffff!important;color:#000000!important;stroke-width:0px!important;}#mermaid-svg-PQ4NxnOV2f2ntFbv .default span{fill:#faf9f5!important;stroke:#ffffff!important;color:#000000!important;stroke-width:0px!important;}#mermaid-svg-PQ4NxnOV2f2ntFbv .default tspan{fill:#000000!important;} 高




场景识别
匹配度评估
主方案推进
降级方案
终止或重构
监控梯度范数+CE
指标达标
回滚+降lr+复盘
持续优化

python 复制代码
// 来源:工程失败模式通用框架 / v1.0
def assess_boundary(scene_metrics, thresholds):
    """评估当前场景是否超出方案边界"""
    failures = []
    for k, limit in thresholds.items():
        actual = scene_metrics.get(k, 0)
        if actual < limit["min"] or actual > limit["max"]:
            failures.append({"metric": k, "actual": actual, "limit": limit})
    return failures if failures else None

# 数学相关的典型边界阈值
thresholds = {
    "grad_norm": {"min": 0.1, "max": 10.0},   # 梯度范数
    "ce_loss": {"min": 0.5, "max": 15.0},      # 交叉熵
    "condition_number": {"min": 1, "max": 1e4},# Hessian 条件数
    "svd_effective_rank": {"min": 64, "max": 4096} # SVD 有效秩
}

典型失败模式

  1. SGD 训大模型发散------条件数过大,改 AdamW 后恢复。根因是 MLP 与 Attention 梯度方差差 5-10 倍。
  2. SVD 压缩后精度暴跌------压缩了高有效秩的 Attention 矩阵。应只压缩 MLP 的 gate/up。
  3. BF16 累加溢出------softmax 直接 BF16 累加超过 128 项。应 FP32 累加再转回。
  4. DPO 后生成质量下降------β 过小导致 reward hacking,KL 约束失效。

9.1 实战复盘:某团队 SVD 压缩翻车

某团队对 7B 模型做全层 SVD rank=128 压缩,PPL 从 5.8 飙升到 23.4。排查发现 Attention 的 Q/K 矩阵有效秩高,奇异值衰减慢,强行截断破坏注意力分布。

python 复制代码
// 来源:SVD 压缩失败复盘 / 2024
def diagnose_svd_quality(weight_matrix, threshold=0.95):
    """诊断权重矩阵是否适合 SVD 压缩"""
    _, S, _ = torch.linalg.svd(weight_matrix, full_matrices=False)
    # 归一化奇异值
    S_norm = S / S.sum()
    # 累计能量: 前 k 个奇异值占比
    cumulative = torch.cumsum(S_norm, dim=0)
    # 达到 95% 能量需要的秩
    k_95 = (cumulative < threshold).sum().item() + 1
    total_rank = S.size(0)
    ratio = k_95 / total_rank
    # ratio < 0.3 适合压缩, > 0.7 不适合
    return {
        'effective_rank': k_95,
        'total_rank': total_rank,
        'compression_suitable': ratio < 0.3,
        'energy_ratio': ratio
    }

# 诊断结果对比:
# MLP gate (11008x4096): effective_rank=180, ratio=0.04, 适合压缩
# Attention Q (4096x4096): effective_rank=3200, ratio=0.78, 不适合

修复方案:只压缩 MLP 层,Attention 保留原权重。压缩率从宣称的 75% 降到 35%,但 PPL 仅增 0.3,可接受。

9.2 实战复盘:BF16 训练 NaN

某模型在 seq=8192 长序列训练时 LayerNorm 输出 NaN。根因是 BF16 下方差极小(<1e-6),eps=1e-5 下溢为 0 导致除零。

python 复制代码
// 来源:BF16 长序列训练 NaN 排查 / 2024
def diagnose_layernorm_stability(x, eps=1e-5):
    """诊断 LayerNorm 在当前输入下的数值稳定性"""
    mean = x.mean(dim=-1, keepdim=True)
    var = x.var(dim=-1, keepdim=True, unbiased=False)
    # 检查方差是否接近 eps
    min_var = var.min().item()
    # BF16 最小正数约 6e-8, eps 必须远大于此
    stable = min_var > eps * 10  # 10倍安全裕度
    return {
        'min_variance': min_var,
        'eps': eps,
        'stable': stable,
        'recommendation': '增大 eps 到 1e-4' if not stable else 'ok'
    }

# 修复: BF16 下 LayerNorm/RMSNorm 的 eps 从 1e-5 调到 1e-4
# 或改用 RMSNorm (无均值减法, 数值更稳定)

量化:修复后长序列训练稳定运行 50K 步无 NaN。RMSNorm 相比 LayerNorm 在 BF16 下数值稳定性提升 3-5 倍,这是 LLaMA/Qwen 系列统一选 RMSNorm 的工程考量。

总结

大模型数学基础的工程化落地,关键在于把抽象公式转化为可监控的数值指标。交叉熵是训练健康的北极星,梯度范数是稳定性的报警器,条件数解释了为什么必须用 AdamW,信息瓶颈指导了表征提取的位置。

工程落地的关键不在于数学理论的先进性,而在于对边界条件的清晰认知和对失败模式的提前预案。建议在项目启动阶段即建立梯度范数、CE、PPL 的实时看板,设置自动告警阈值,避免训练到一半才发现问题。技术选型需在理论上限与工程现实之间寻找平衡点,BF16 而非 FP32、AdamW 而非 SGD、SVD 仅压缩 MLP,都是这种平衡的体现。