Fine-tuning
├── Full Fine-tuning(全参数更新)
└── Parameter-Efficient Fine-tuning (PEFT)
├── Additive methods(添加参数)
│ ├── Prompt Tuning(输入层加 soft prompt)
│ └── Prefix Tuning(每层加可学习 KV 前缀)
├── Selective methods(选择部分参数训练)
└── Reparameterization methods(低秩近似)
├── LoRA
└── Adapter
Lora 原理
预训练模型的权重W0∈Rd×kW_0∈R^{d×k}W0∈Rd×k, 传统 fine-tuning 直接学习 W=W0+ΔW 。
LoRA 将 ΔW 分解为两个低秩矩阵的乘积:ΔW=BA
其中:
B∈Rd×rB∈R^{d×r}B∈Rd×r
A∈Rr×kA∈R^{r×k}A∈Rr×k
秩 r≪min(d,k) (通常 r=4,8,16,64 )
最终前向传播:
h=W0x+ΔWx=W0x+BAxh=W_0x+ΔWx=W_0x+BAxh=W0x+ΔWx=W0x+BAx
| 优势 | 说明 |
|---|---|
| 参数高效 | 从 d×kd \times kd×k 降至 r×(d+k)r \times (d + k)r×(d+k),通常减少 99.9%+ 参数 |
| 不修改原权重 | W0W_0W0 冻结,推理时可合并或分离 |
| 模块化 | 不同任务训练不同的 (B,A)(B,A)(B,A) 对,快速切换 |
| 无推理延迟 | 部署时合并 W=W0+BAW = W_0 + BAW=W0+BA,计算量与原始模型相同 |
初始化策略
A 用随机高斯初始化(打破对称性)
B 初始化为零(确保训练开始时 ΔW=0,即h=W0xΔW=0 ,即 h=W_0xΔW=0,即h=W0x )
这样模型从预训练状态平滑启动,避免初期震荡。
应用位置
通常应用于 Transformer 的:
Query/Value 投影矩阵(Wq,Wv )--- 效果最佳
也可扩展到 Wk,Wo 、FFN 层等
Transformer Block
├── Multi-Head Attention
│ ├── Q Projection: x @ W_q^T (d_model → d_model)
│ ├── K Projection: x @ W_k^T (d_model → d_model)
│ ├── V Projection: x @ W_v^T (d_model → d_model)
│ └── O Projection: attn_out @ W_o^T (d_model → d_model)
│
└── Feed-Forward Network (FFN)
├── Gate/Up: x @ W_gate^T (d_model → 4d_model)
└── Down: x @ W_down^T (4d_model → d_model)
| 目标模块 | 参数量 | 效果 | 推荐度 |
|---|---|---|---|
| WqW_qWq, WvW_vWv | 2×(d×d) | ⭐⭐⭐ 最佳 | 必加 |
| WkW_kWk | d×d | ⭐⭐ 中等增益 | 可选 |
| WoW_oWo | d×d | ⭐⭐ 中等增益 | 可选 |
| FFN (Gate/Up/Down) | d×4d + 4d×d = 5d² | ⭐⭐⭐ |
python
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
from typing import Optional, Tuple
class LoRALayer(nn.Module):
"""
LoRA 核心层: 实现低秩适配 W = W0 + BA * scaling
"""
def __init__(
self,
in_features: int, # 输入特征维度 (如 768)
out_features: int, # 输出特征维度 (如 768)
rank: int = 4, # 低秩维度 r,通常 4-64
lora_alpha: float = 1.0, # 缩放系数 α
lora_dropout: float = 0.0, # Dropout 概率
):
super().__init__()
self.rank = rank # 保存秩参数
self.lora_alpha = lora_alpha # 保存 alpha
self.scaling = lora_alpha / rank # 计算缩放因子: α/r
# 保存维度信息(用于形状检查)
self.in_features = in_features
self.out_features = out_features
# ============================================
# 可训练参数: 低秩矩阵 A 和 B
# ============================================
# A: [rank, in_features] = [r, d]
# 将输入从 d 维压缩到 r 维
self.lora_A = nn.Parameter(torch.zeros(rank, in_features))
# B: [out_features, rank] = [d, r]
# 将 r 维扩展回 d 维
self.lora_B = nn.Parameter(torch.zeros(out_features, rank))
# Dropout 层: 防止过拟合,训练时随机丢弃部分路径
self.lora_dropout = nn.Dropout(p=lora_dropout) if lora_dropout > 0 else nn.Identity()
# ============================================
# 关键初始化策略(决定训练稳定性)
# ============================================
# A 用 Kaiming/He 初始化: 随机高斯分布,保持方差稳定
# a=sqrt(5) 对应 LeakyReLU 参数,bound = sqrt(6/((1+5)*fan_in))
nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
# B 初始化为零: 确保训练开始时 ΔW = B@A = 0
# 这样模型从预训练状态平滑启动,无初始扰动
nn.init.zeros_(self.lora_B)
def forward(self, x: torch.Tensor, original_output: torch.Tensor) -> torch.Tensor:
"""
前向传播: 计算 LoRA 分支并加到原始输出
Args:
x: 输入张量 [batch_size, seq_len, in_features]
original_output: 原始线性层输出 [batch_size, seq_len, out_features]
Returns:
增强后的输出 [batch_size, seq_len, out_features]
"""
batch_size, seq_len, _ = x.shape
# ============================================
# LoRA 分支计算: h = x @ A^T @ B^T * scaling
# ============================================
# Step 1: Dropout (正则化)
x_dropped = self.lora_dropout(x) # [batch, seq, in_features]
# Step 2: 矩阵乘法路径 (x @ A^T) @ B^T
# x @ A^T: [batch, seq, in] @ [in, rank] -> [batch, seq, rank]
# 即将输入压缩到低秩空间
x_compressed = x_dropped @ self.lora_A.T # [batch, seq, rank]
# Step 3: @ B^T: [batch, seq, rank] @ [rank, out] -> [batch, seq, out]
# 从低秩空间扩展回输出空间
lora_output = x_compressed @ self.lora_B.T # [batch, seq, out_features]
# Step 4: 应用缩放因子 α/r
lora_output = lora_output * self.scaling
# ============================================
# 残差连接: 原始输出 + LoRA 适配
# ============================================
# 维度检查: original_output 和 lora_output 都是 [batch, seq, out]
return original_output + lora_output
def merge_weights(self) -> torch.Tensor:
"""
计算等效的合并权重: W_merged = B @ A * scaling
用于推理时消除额外计算开销
Returns:
合并后的低秩更新矩阵 [out_features, in_features]
"""
# B @ A: [out, rank] @ [rank, in] -> [out, in]
return self.lora_B @ self.lora_A * self.scaling
def get_trainable_params(self) -> int:
"""返回可训练参数数量(用于统计)"""
return self.lora_A.numel() + self.lora_B.numel()
class LinearWithLoRA(nn.Module):
"""
包装器: 将普通 nn.Linear 增强为带 LoRA 的版本
"""
def __init__(
self,
original_linear: nn.Linear, # 预训练的原始线性层
rank: int = 4,
lora_alpha: float = 1.0,
lora_dropout: float = 0.0,
):
super().__init__()
# ============================================
# 冻结原始权重(关键:不计算梯度,节省显存)
# ============================================
self.original_linear = original_linear
for param in self.original_linear.parameters():
param.requires_grad = False # 冻结标记
# 保存原始层的配置
self.in_features = original_linear.in_features
self.out_features = original_linear.out_features
# ============================================
# 创建并附加 LoRA 层
# ============================================
self.lora = LoRALayer(
in_features=original_linear.in_features,
out_features=original_linear.out_features,
rank=rank,
lora_alpha=lora_alpha,
lora_dropout=lora_dropout,
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""
前向传播流程
Args:
x: 输入 [batch, seq, in_features]
Returns:
输出 [batch, seq, out_features]
"""
# ============================================
# 原始分支(无梯度,节省计算)
# ============================================
with torch.no_grad(): # 禁用梯度计算上下文
original_out = self.original_linear(x)
# 输出形状: [batch, seq, out_features]
# ============================================
# LoRA 分支(有梯度,可训练)
# ============================================
# lora 返回的是 original_out + lora_delta,形状不变
return self.lora(x, original_out) # [batch, seq, out_features]
def merge_and_unload(self) -> nn.Linear:
"""
将 LoRA 合并到原始层,返回普通 Linear(推理优化)
Returns:
合并后的标准 nn.Linear 层
"""
# 计算合并后的权重: W_original + ΔW
# original_linear.weight: [out, in]
# merge_weights(): [out, in]
merged_weight = self.original_linear.weight.data + self.lora.merge_weights()
# 创建新的标准 Linear 层
new_linear = nn.Linear(
self.in_features,
self.out_features,
bias=self.original_linear.bias is not None,
device=merged_weight.device,
dtype=merged_weight.dtype,
)
# 复制合并后的权重
new_linear.weight.data = merged_weight
# 复制偏置(如果存在)
if self.original_linear.bias is not None:
new_linear.bias.data = self.original_linear.bias.data
return new_linear
class LoRATransformerBlock(nn.Module):
"""
完整的 LoRA Transformer Block
只对 Attention 的 Q, V 应用 LoRA(最佳实践)
"""
def __init__(
self,
d_model: int, # 模型维度(如 768, 1024, 2048)
n_heads: int, # 注意力头数(如 12, 16)
rank: int = 4, # LoRA 秩
lora_alpha: float = 1.0,
dropout: float = 0.1,
):
super().__init__()
# 保存配置
self.d_model = d_model
self.n_heads = n_heads
self.head_dim = d_model // n_heads # 每个头的维度(如 768/12=64)
# ============================================
# 标准 Attention 投影层(模拟预训练权重)
# ============================================
# Q 投影: 输入 -> 查询空间 [d_model, d_model]
self.q_proj = nn.Linear(d_model, d_model)
# K 投影: 输入 -> 键空间 [d_model, d_model]
self.k_proj = nn.Linear(d_model, d_model)
# V 投影: 输入 -> 值空间 [d_model, d_model]
self.v_proj = nn.Linear(d_model, d_model)
# O 投影: 注意力输出 -> 模型空间 [d_model, d_model]
self.o_proj = nn.Linear(d_model, d_model)
# ============================================
# 为 Q 和 V 添加 LoRA(经验证效果最佳)
# ============================================
# Q 的 LoRA: 控制"关注什么"
self.q_proj_lora = LoRALayer(d_model, d_model, rank, lora_alpha, dropout)
# V 的 LoRA: 控制"提取什么信息"
self.v_proj_lora = LoRALayer(d_model, d_model, rank, lora_alpha, dropout)
# 冻结原始 Q, V 的权重(只训练 LoRA 部分)
for param in self.q_proj.parameters():
param.requires_grad = False
for param in self.v_proj.parameters():
param.requires_grad = False
# K 和 O 保持可训练或冻结(这里保持默认可训练,也可冻结)
# ============================================
# 前馈网络 FFN(通常不加 LoRA,保持原样)
# ============================================
self.ffn = nn.Sequential(
nn.Linear(d_model, 4 * d_model), # 扩展 4 倍
nn.GELU(), # 激活函数
nn.Linear(4 * d_model, d_model), # 投影回原始维度
)
# LayerNorm(通常微调时不加 LoRA,直接训练或冻结)
self.ln1 = nn.LayerNorm(d_model) # Attention 前的 Norm
self.ln2 = nn.LayerNorm(d_model) # FFN 前的 Norm
# Dropout
self.dropout = nn.Dropout(dropout)
def attention(
self,
x: torch.Tensor,
mask: Optional[torch.Tensor] = None
) -> torch.Tensor:
"""
多头自注意力机制(带 LoRA 的 Q, V)
Args:
x: 输入 [batch_size, seq_len, d_model]
mask: 注意力掩码 [batch_size, 1, seq_len, seq_len] 或 None
Returns:
注意力输出 [batch_size, seq_len, d_model]
"""
batch_size, seq_len, _ = x.shape
# ============================================
# Step 1: 线性投影得到 Q, K, V
# ============================================
# Q = X @ Wq^T + LoRA_Q(X),形状: [batch, seq, d_model]
q = self.q_proj(x) + self.q_proj_lora(x, torch.zeros_like(x))
# K = X @ Wk^T(无 LoRA),形状: [batch, seq, d_model]
k = self.k_proj(x)
# V = X @ Wv^T + LoRA_V(X),形状: [batch, seq, d_model]
v = self.v_proj(x) + self.v_proj_lora(x, torch.zeros_like(x))
# ============================================
# Step 2: 多头拆分 (Multi-Head Split)
# ============================================
# view: 将最后一个维度 d_model 拆分为 (n_heads, head_dim)
# 例如: [2, 512, 768] -> [2, 512, 12, 64]
# Q: [batch, seq, d_model] -> [batch, seq, n_heads, head_dim]
q = q.view(batch_size, seq_len, self.n_heads, self.head_dim)
# transpose(1, 2): 交换 seq 和 n_heads 维度
# -> [batch, n_heads, seq, head_dim] = [2, 12, 512, 64]
# 目的: 让每个头在最后一个维度独立计算
q = q.transpose(1, 2)
# K 同理: [batch, n_heads, seq, head_dim]
k = k.view(batch_size, seq_len, self.n_heads, self.head_dim).transpose(1, 2)
# V 同理: [batch, n_heads, seq, head_dim]
v = v.view(batch_size, seq_len, self.n_heads, self.head_dim).transpose(1, 2)
# ============================================
# Step 3: 计算注意力分数 Attention(Q, K, V)
# ============================================
# Q @ K^T: [batch, heads, seq, head_dim] @ [batch, heads, head_dim, seq]
# -> [batch, heads, seq, seq] = 每个位置对其他位置的注意力分数
scores = torch.matmul(q, k.transpose(-2, -1))
# 缩放: 防止 softmax 梯度消失,除以 sqrt(head_dim)
scores = scores / math.sqrt(self.head_dim)
# 应用掩码(如因果掩码、填充掩码)
if mask is not None:
# mask == 0 的位置填充 -inf,softmax 后变为 0
scores = scores.masked_fill(mask == 0, float('-inf'))
# Softmax 归一化: 每行之和为 1
attn_weights = F.softmax(scores, dim=-1) # [batch, heads, seq, seq]
# Dropout(训练时随机丢弃注意力连接)
attn_weights = self.dropout(attn_weights)
# 加权求和: [batch, heads, seq, seq] @ [batch, heads, seq, head_dim]
# -> [batch, heads, seq, head_dim]
attn_output = torch.matmul(attn_weights, v)
# ============================================
# Step 4: 合并多头并输出投影
# ============================================
# transpose(1, 2): [batch, heads, seq, dim] -> [batch, seq, heads, dim]
attn_output = attn_output.transpose(1, 2)
# contiguous(): 确保内存连续,view 需要连续内存
# view: [batch, seq, heads, head_dim] -> [batch, seq, d_model]
attn_output = attn_output.contiguous().view(batch_size, seq_len, self.d_model)
# 最终线性投影: [batch, seq, d_model] -> [batch, seq, d_model]
output = self.o_proj(attn_output)
return output
def forward(
self,
x: torch.Tensor,
mask: Optional[torch.Tensor] = None
) -> torch.Tensor:
"""
Transformer Block 前向传播(预归一化结构)
Args:
x: 输入 [batch, seq, d_model]
mask: 注意力掩码
Returns:
输出 [batch, seq, d_model]
"""
# ============================================
# 子层 1: 多头自注意力 + 残差连接
# ============================================
# 预归一化: LayerNorm 在 Attention 之前(现代架构常用)
# 比后归一化更稳定,梯度流动更顺畅
ln1_out = self.ln1(x)
# 注意力计算
attn_out = self.attention(ln1_out, mask)
# 残差连接: 输入 + 子层输出(梯度高速公路)
x = x + attn_out
# ============================================
# 子层 2: FFN + 残差连接
# ============================================
# 预归一化
ln2_out = self.ln2(x)
# FFN: 两个线性层夹激活函数
ffn_out = self.ffn(ln2_out)
# 残差连接
x = x + ffn_out
return x
# ============================================
# 使用示例和测试
# ============================================
def test_lora_implementation():
"""测试 LoRA 实现"""
# 配置
batch_size = 2
seq_len = 128
d_model = 768
n_heads = 12
rank = 16
# 创建模型
block = LoRATransformerBlock(
d_model=d_model,
n_heads=n_heads,
rank=rank,
lora_alpha=32, # 通常设为 2*rank
dropout=0.1,
)
# 随机输入
x = torch.randn(batch_size, seq_len, d_model)
# 因果掩码(防止看到未来信息)
# [1, 1, seq, seq] 的下三角矩阵
causal_mask = torch.tril(torch.ones(seq_len, seq_len))
causal_mask = causal_mask.unsqueeze(0).unsqueeze(0) # [1, 1, seq, seq]
# 前向传播
output = block(x, mask=causal_mask)
print(f"输入形状: {x.shape}")
print(f"输出形状: {output.shape}")
print(f"参数量统计:")
# 统计可训练参数
total_params = sum(p.numel() for p in block.parameters())
trainable_params = sum(p.numel() for p in block.parameters() if p.requires_grad)
print(f"总参数: {total_params:,}")
print(f"可训练参数: {trainable_params:,}")
print(f"训练比例: {trainable_params/total_params*100:.4f}%")
# 验证输出形状
assert output.shape == x.shape, "输出形状应与输入相同"
print("\n测试通过!")
if __name__ == "__main__":
test_lora_implementation()
可视化流程
python
# 输入 X: [batch, seq, in_features]
# W: [out_features, in_features] = [768, 768]
# A: [rank, in_features] = [4, 768]
# B: [out_features, rank] = [768, 4]
# 标准线性层
output = X @ W.T # [batch,seq,768] @ [768,768] = [batch,seq,768]
# LoRA 分支
lora_out = X @ A.T @ B.T
# Step 1: [batch,seq,768] @ [768,4] = [batch,seq,4]
# Step 2: [batch,seq,4] @ [4,768] = [batch,seq,768]
# 形状相同!都是 [batch, seq, 768]
final_out = output + lora_out ✖ scaling
输入 X: [batch, seq, d_model=768]
│
├──► Q_proj ──► [batch, seq, 768] ──┐
│ (+ LoRA) │
├──► K_proj ──► [batch, seq, 768] ──┼──► Multi-Head Split
│ (frozen) │ view + transpose(1,2)
└──► V_proj ──► [batch, seq, 768] ──┘ │
(+ LoRA) ▼
Q: [2, 12, 512, 64]
K: [2, 12, 512, 64]
V: [2, 12, 512, 64]
│
▼
Q @ K^T ──► Scale ──► Softmax ──► @ V
2,12,512,512\] \[2,12,512,64
│
▼
Merge Heads ──► O_proj ──► Output
2,512,768\] \[2,512,768
| 维度 | 含义 | 数值来源 |
|---|---|---|
| 2 | batch size(批次大小) | 输入的 batch |
| 12 | n_heads(注意力头数) | 模型配置(如 12 个头) |
| 512 | seq_len(序列长度) | 输入 token 数 |
| 64 | head_dim(每个头的维度) | d_model / n_heads = 768/12 |