LLM学习

Fine-tuning

├── Full Fine-tuning(全参数更新)

└── Parameter-Efficient Fine-tuning (PEFT)

├── Additive methods(添加参数)

│ ├── Prompt Tuning(输入层加 soft prompt)

│ └── Prefix Tuning(每层加可学习 KV 前缀)

├── Selective methods(选择部分参数训练)

└── Reparameterization methods(低秩近似)

├── LoRA

└── Adapter

Lora 原理

预训练模型的权重W0∈Rd×kW_0∈R^{d×k}W0∈Rd×k, 传统 fine-tuning 直接学习 W=W0​+ΔW 。

LoRA 将 ΔW 分解为两个低秩矩阵的乘积:ΔW=BA

其中:

B∈Rd×rB∈R^{d×r}B∈Rd×r
A∈Rr×kA∈R^{r×k}A∈Rr×k

秩 r≪min(d,k) (通常 r=4,8,16,64 )

最终前向传播:
h=W0x+ΔWx=W0x+BAxh=W_0x+ΔWx=W_0x+BAxh=W0x+ΔWx=W0x+BAx

优势 说明
参数高效 从 d×kd \times kd×k 降至 r×(d+k)r \times (d + k)r×(d+k),通常减少 99.9%+ 参数
不修改原权重 W0W_0W0 冻结,推理时可合并或分离
模块化 不同任务训练不同的 (B,A)(B,A)(B,A) 对,快速切换
无推理延迟 部署时合并 W=W0+BAW = W_0 + BAW=W0+BA,计算量与原始模型相同

初始化策略

A 用随机高斯初始化(打破对称性)

B 初始化为零(确保训练开始时 ΔW=0,即h=W0xΔW=0 ,即 h=W_0xΔW=0,即h=W0x )

这样模型从预训练状态平滑启动,避免初期震荡。

应用位置

通常应用于 Transformer 的:

Query/Value 投影矩阵(Wq​,Wv​ )--- 效果最佳

也可扩展到 Wk​,Wo​ 、FFN 层等

Transformer Block

├── Multi-Head Attention

│ ├── Q Projection: x @ W_q^T (d_model → d_model)

│ ├── K Projection: x @ W_k^T (d_model → d_model)

│ ├── V Projection: x @ W_v^T (d_model → d_model)

│ └── O Projection: attn_out @ W_o^T (d_model → d_model)

└── Feed-Forward Network (FFN)

├── Gate/Up: x @ W_gate^T (d_model → 4d_model)
└── Down: x @ W_down^T (4
d_model → d_model)

目标模块 参数量 效果 推荐度
WqW_qWq, WvW_vWv 2×(d×d) ⭐⭐⭐ 最佳 必加
WkW_kWk d×d ⭐⭐ 中等增益 可选
WoW_oWo d×d ⭐⭐ 中等增益 可选
FFN (Gate/Up/Down) d×4d + 4d×d = 5d² ⭐⭐⭐
python 复制代码
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
from typing import Optional, Tuple


class LoRALayer(nn.Module):
    """
    LoRA 核心层: 实现低秩适配 W = W0 + BA * scaling
    """
    def __init__(
        self,
        in_features: int,      # 输入特征维度 (如 768)
        out_features: int,     # 输出特征维度 (如 768)
        rank: int = 4,         # 低秩维度 r,通常 4-64
        lora_alpha: float = 1.0,  # 缩放系数 α
        lora_dropout: float = 0.0,  # Dropout 概率
    ):
        super().__init__()
        
        self.rank = rank                    # 保存秩参数
        self.lora_alpha = lora_alpha        # 保存 alpha
        self.scaling = lora_alpha / rank    # 计算缩放因子: α/r
        
        # 保存维度信息(用于形状检查)
        self.in_features = in_features
        self.out_features = out_features
        
        # ============================================
        # 可训练参数: 低秩矩阵 A 和 B
        # ============================================
        
        # A: [rank, in_features] = [r, d] 
        # 将输入从 d 维压缩到 r 维
        self.lora_A = nn.Parameter(torch.zeros(rank, in_features))
        
        # B: [out_features, rank] = [d, r]
        # 将 r 维扩展回 d 维
        self.lora_B = nn.Parameter(torch.zeros(out_features, rank))
        
        # Dropout 层: 防止过拟合,训练时随机丢弃部分路径
        self.lora_dropout = nn.Dropout(p=lora_dropout) if lora_dropout > 0 else nn.Identity()
        
        # ============================================
        # 关键初始化策略(决定训练稳定性)
        # ============================================
        
        # A 用 Kaiming/He 初始化: 随机高斯分布,保持方差稳定
        # a=sqrt(5) 对应 LeakyReLU 参数,bound = sqrt(6/((1+5)*fan_in))
        nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
        
        # B 初始化为零: 确保训练开始时 ΔW = B@A = 0
        # 这样模型从预训练状态平滑启动,无初始扰动
        nn.init.zeros_(self.lora_B)
    
    def forward(self, x: torch.Tensor, original_output: torch.Tensor) -> torch.Tensor:
        """
        前向传播: 计算 LoRA 分支并加到原始输出
        
        Args:
            x: 输入张量 [batch_size, seq_len, in_features]
            original_output: 原始线性层输出 [batch_size, seq_len, out_features]
        
        Returns:
            增强后的输出 [batch_size, seq_len, out_features]
        """
        batch_size, seq_len, _ = x.shape
        
        # ============================================
        # LoRA 分支计算: h = x @ A^T @ B^T * scaling
        # ============================================
        
        # Step 1: Dropout (正则化)
        x_dropped = self.lora_dropout(x)  # [batch, seq, in_features]
        
        # Step 2: 矩阵乘法路径 (x @ A^T) @ B^T
        # x @ A^T: [batch, seq, in] @ [in, rank] -> [batch, seq, rank]
        # 即将输入压缩到低秩空间
        x_compressed = x_dropped @ self.lora_A.T  # [batch, seq, rank]
        
        # Step 3: @ B^T: [batch, seq, rank] @ [rank, out] -> [batch, seq, out]
        # 从低秩空间扩展回输出空间
        lora_output = x_compressed @ self.lora_B.T  # [batch, seq, out_features]
        
        # Step 4: 应用缩放因子 α/r
        lora_output = lora_output * self.scaling
        
        # ============================================
        # 残差连接: 原始输出 + LoRA 适配
        # ============================================
        # 维度检查: original_output 和 lora_output 都是 [batch, seq, out]
        return original_output + lora_output
    
    def merge_weights(self) -> torch.Tensor:
        """
        计算等效的合并权重: W_merged = B @ A * scaling
        用于推理时消除额外计算开销
        
        Returns:
            合并后的低秩更新矩阵 [out_features, in_features]
        """
        # B @ A: [out, rank] @ [rank, in] -> [out, in]
        return self.lora_B @ self.lora_A * self.scaling
    
    def get_trainable_params(self) -> int:
        """返回可训练参数数量(用于统计)"""
        return self.lora_A.numel() + self.lora_B.numel()


class LinearWithLoRA(nn.Module):
    """
    包装器: 将普通 nn.Linear 增强为带 LoRA 的版本
    """
    def __init__(
        self,
        original_linear: nn.Linear,  # 预训练的原始线性层
        rank: int = 4,
        lora_alpha: float = 1.0,
        lora_dropout: float = 0.0,
    ):
        super().__init__()
        
        # ============================================
        # 冻结原始权重(关键:不计算梯度,节省显存)
        # ============================================
        self.original_linear = original_linear
        for param in self.original_linear.parameters():
            param.requires_grad = False  # 冻结标记
        
        # 保存原始层的配置
        self.in_features = original_linear.in_features
        self.out_features = original_linear.out_features
        
        # ============================================
        # 创建并附加 LoRA 层
        # ============================================
        self.lora = LoRALayer(
            in_features=original_linear.in_features,
            out_features=original_linear.out_features,
            rank=rank,
            lora_alpha=lora_alpha,
            lora_dropout=lora_dropout,
        )
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        前向传播流程
        
        Args:
            x: 输入 [batch, seq, in_features]
        
        Returns:
            输出 [batch, seq, out_features]
        """
        # ============================================
        # 原始分支(无梯度,节省计算)
        # ============================================
        with torch.no_grad():  # 禁用梯度计算上下文
            original_out = self.original_linear(x)
            # 输出形状: [batch, seq, out_features]
        
        # ============================================
        # LoRA 分支(有梯度,可训练)
        # ============================================
        # lora 返回的是 original_out + lora_delta,形状不变
        return self.lora(x, original_out)  # [batch, seq, out_features]
    
    def merge_and_unload(self) -> nn.Linear:
        """
        将 LoRA 合并到原始层,返回普通 Linear(推理优化)
        
        Returns:
            合并后的标准 nn.Linear 层
        """
        # 计算合并后的权重: W_original + ΔW
        # original_linear.weight: [out, in]
        # merge_weights(): [out, in]
        merged_weight = self.original_linear.weight.data + self.lora.merge_weights()
        
        # 创建新的标准 Linear 层
        new_linear = nn.Linear(
            self.in_features,
            self.out_features,
            bias=self.original_linear.bias is not None,
            device=merged_weight.device,
            dtype=merged_weight.dtype,
        )
        
        # 复制合并后的权重
        new_linear.weight.data = merged_weight
        
        # 复制偏置(如果存在)
        if self.original_linear.bias is not None:
            new_linear.bias.data = self.original_linear.bias.data
            
        return new_linear


class LoRATransformerBlock(nn.Module):
    """
    完整的 LoRA Transformer Block
    只对 Attention 的 Q, V 应用 LoRA(最佳实践)
    """
    def __init__(
        self,
        d_model: int,      # 模型维度(如 768, 1024, 2048)
        n_heads: int,      # 注意力头数(如 12, 16)
        rank: int = 4,     # LoRA 秩
        lora_alpha: float = 1.0,
        dropout: float = 0.1,
    ):
        super().__init__()
        
        # 保存配置
        self.d_model = d_model
        self.n_heads = n_heads
        self.head_dim = d_model // n_heads  # 每个头的维度(如 768/12=64)
        
        # ============================================
        # 标准 Attention 投影层(模拟预训练权重)
        # ============================================
        
        # Q 投影: 输入 -> 查询空间 [d_model, d_model]
        self.q_proj = nn.Linear(d_model, d_model)
        # K 投影: 输入 -> 键空间 [d_model, d_model]
        self.k_proj = nn.Linear(d_model, d_model)
        # V 投影: 输入 -> 值空间 [d_model, d_model]
        self.v_proj = nn.Linear(d_model, d_model)
        # O 投影: 注意力输出 -> 模型空间 [d_model, d_model]
        self.o_proj = nn.Linear(d_model, d_model)
        
        # ============================================
        # 为 Q 和 V 添加 LoRA(经验证效果最佳)
        # ============================================
        
        # Q 的 LoRA: 控制"关注什么"
        self.q_proj_lora = LoRALayer(d_model, d_model, rank, lora_alpha, dropout)
        # V 的 LoRA: 控制"提取什么信息"
        self.v_proj_lora = LoRALayer(d_model, d_model, rank, lora_alpha, dropout)
        
        # 冻结原始 Q, V 的权重(只训练 LoRA 部分)
        for param in self.q_proj.parameters():
            param.requires_grad = False
        for param in self.v_proj.parameters():
            param.requires_grad = False
        
        # K 和 O 保持可训练或冻结(这里保持默认可训练,也可冻结)
        
        # ============================================
        # 前馈网络 FFN(通常不加 LoRA,保持原样)
        # ============================================
        self.ffn = nn.Sequential(
            nn.Linear(d_model, 4 * d_model),  # 扩展 4 倍
            nn.GELU(),                         # 激活函数
            nn.Linear(4 * d_model, d_model), # 投影回原始维度
        )
        
        # LayerNorm(通常微调时不加 LoRA,直接训练或冻结)
        self.ln1 = nn.LayerNorm(d_model)  # Attention 前的 Norm
        self.ln2 = nn.LayerNorm(d_model)  # FFN 前的 Norm
        
        # Dropout
        self.dropout = nn.Dropout(dropout)
    
    def attention(
        self, 
        x: torch.Tensor, 
        mask: Optional[torch.Tensor] = None
    ) -> torch.Tensor:
        """
        多头自注意力机制(带 LoRA 的 Q, V)
        
        Args:
            x: 输入 [batch_size, seq_len, d_model]
            mask: 注意力掩码 [batch_size, 1, seq_len, seq_len] 或 None
        
        Returns:
            注意力输出 [batch_size, seq_len, d_model]
        """
        batch_size, seq_len, _ = x.shape
        
        # ============================================
        # Step 1: 线性投影得到 Q, K, V
        # ============================================
        
        # Q = X @ Wq^T + LoRA_Q(X),形状: [batch, seq, d_model]
        q = self.q_proj(x) + self.q_proj_lora(x, torch.zeros_like(x))
        
        # K = X @ Wk^T(无 LoRA),形状: [batch, seq, d_model]
        k = self.k_proj(x)
        
        # V = X @ Wv^T + LoRA_V(X),形状: [batch, seq, d_model]
        v = self.v_proj(x) + self.v_proj_lora(x, torch.zeros_like(x))
        
        # ============================================
        # Step 2: 多头拆分 (Multi-Head Split)
        # ============================================
        
        # view: 将最后一个维度 d_model 拆分为 (n_heads, head_dim)
        # 例如: [2, 512, 768] -> [2, 512, 12, 64]
        
        # Q: [batch, seq, d_model] -> [batch, seq, n_heads, head_dim]
        q = q.view(batch_size, seq_len, self.n_heads, self.head_dim)
        # transpose(1, 2): 交换 seq 和 n_heads 维度
        # -> [batch, n_heads, seq, head_dim] = [2, 12, 512, 64]
        # 目的: 让每个头在最后一个维度独立计算
        q = q.transpose(1, 2)
        
        # K 同理: [batch, n_heads, seq, head_dim]
        k = k.view(batch_size, seq_len, self.n_heads, self.head_dim).transpose(1, 2)
        
        # V 同理: [batch, n_heads, seq, head_dim]
        v = v.view(batch_size, seq_len, self.n_heads, self.head_dim).transpose(1, 2)
        
        # ============================================
        # Step 3: 计算注意力分数 Attention(Q, K, V)
        # ============================================
        
        # Q @ K^T: [batch, heads, seq, head_dim] @ [batch, heads, head_dim, seq]
        # -> [batch, heads, seq, seq] = 每个位置对其他位置的注意力分数
        scores = torch.matmul(q, k.transpose(-2, -1))
        
        # 缩放: 防止 softmax 梯度消失,除以 sqrt(head_dim)
        scores = scores / math.sqrt(self.head_dim)
        
        # 应用掩码(如因果掩码、填充掩码)
        if mask is not None:
            # mask == 0 的位置填充 -inf,softmax 后变为 0
            scores = scores.masked_fill(mask == 0, float('-inf'))
        
        # Softmax 归一化: 每行之和为 1
        attn_weights = F.softmax(scores, dim=-1)  # [batch, heads, seq, seq]
        
        # Dropout(训练时随机丢弃注意力连接)
        attn_weights = self.dropout(attn_weights)
        
        # 加权求和: [batch, heads, seq, seq] @ [batch, heads, seq, head_dim]
        # -> [batch, heads, seq, head_dim]
        attn_output = torch.matmul(attn_weights, v)
        
        # ============================================
        # Step 4: 合并多头并输出投影
        # ============================================
        
        # transpose(1, 2): [batch, heads, seq, dim] -> [batch, seq, heads, dim]
        attn_output = attn_output.transpose(1, 2)
        
        # contiguous(): 确保内存连续,view 需要连续内存
        # view: [batch, seq, heads, head_dim] -> [batch, seq, d_model]
        attn_output = attn_output.contiguous().view(batch_size, seq_len, self.d_model)
        
        # 最终线性投影: [batch, seq, d_model] -> [batch, seq, d_model]
        output = self.o_proj(attn_output)
        
        return output
    
    def forward(
        self, 
        x: torch.Tensor, 
        mask: Optional[torch.Tensor] = None
    ) -> torch.Tensor:
        """
        Transformer Block 前向传播(预归一化结构)
        
        Args:
            x: 输入 [batch, seq, d_model]
            mask: 注意力掩码
        
        Returns:
            输出 [batch, seq, d_model]
        """
        # ============================================
        # 子层 1: 多头自注意力 + 残差连接
        # ============================================
        
        # 预归一化: LayerNorm 在 Attention 之前(现代架构常用)
        # 比后归一化更稳定,梯度流动更顺畅
        ln1_out = self.ln1(x)
        
        # 注意力计算
        attn_out = self.attention(ln1_out, mask)
        
        # 残差连接: 输入 + 子层输出(梯度高速公路)
        x = x + attn_out
        
        # ============================================
        # 子层 2: FFN + 残差连接
        # ============================================
        
        # 预归一化
        ln2_out = self.ln2(x)
        
        # FFN: 两个线性层夹激活函数
        ffn_out = self.ffn(ln2_out)
        
        # 残差连接
        x = x + ffn_out
        
        return x


# ============================================
# 使用示例和测试
# ============================================

def test_lora_implementation():
    """测试 LoRA 实现"""
    
    # 配置
    batch_size = 2
    seq_len = 128
    d_model = 768
    n_heads = 12
    rank = 16
    
    # 创建模型
    block = LoRATransformerBlock(
        d_model=d_model,
        n_heads=n_heads,
        rank=rank,
        lora_alpha=32,  # 通常设为 2*rank
        dropout=0.1,
    )
    
    # 随机输入
    x = torch.randn(batch_size, seq_len, d_model)
    
    # 因果掩码(防止看到未来信息)
    # [1, 1, seq, seq] 的下三角矩阵
    causal_mask = torch.tril(torch.ones(seq_len, seq_len))
    causal_mask = causal_mask.unsqueeze(0).unsqueeze(0)  # [1, 1, seq, seq]
    
    # 前向传播
    output = block(x, mask=causal_mask)
    
    print(f"输入形状:  {x.shape}")
    print(f"输出形状:  {output.shape}")
    print(f"参数量统计:")
    
    # 统计可训练参数
    total_params = sum(p.numel() for p in block.parameters())
    trainable_params = sum(p.numel() for p in block.parameters() if p.requires_grad)
    
    print(f"总参数:     {total_params:,}")
    print(f"可训练参数: {trainable_params:,}")
    print(f"训练比例:   {trainable_params/total_params*100:.4f}%")
    
    # 验证输出形状
    assert output.shape == x.shape, "输出形状应与输入相同"
    print("\n测试通过!")


if __name__ == "__main__":
    test_lora_implementation()

可视化流程

python 复制代码
# 输入 X: [batch, seq, in_features]
# W: [out_features, in_features] = [768, 768]
# A: [rank, in_features] = [4, 768]
# B: [out_features, rank] = [768, 4]

# 标准线性层
output = X @ W.T  # [batch,seq,768] @ [768,768] = [batch,seq,768]

# LoRA 分支
lora_out = X @ A.T @ B.T  
# Step 1: [batch,seq,768] @ [768,4] = [batch,seq,4]
# Step 2: [batch,seq,4] @ [4,768] = [batch,seq,768]

# 形状相同!都是 [batch, seq, 768]
final_out = output + lora_out ✖ scaling

输入 X: [batch, seq, d_model=768]

├──► Q_proj ──► [batch, seq, 768] ──┐

│ (+ LoRA) │

├──► K_proj ──► [batch, seq, 768] ──┼──► Multi-Head Split

│ (frozen) │ view + transpose(1,2)

└──► V_proj ──► [batch, seq, 768] ──┘ │

(+ LoRA) ▼

Q: [2, 12, 512, 64]

K: [2, 12, 512, 64]

V: [2, 12, 512, 64]

Q @ K^T ──► Scale ──► Softmax ──► @ V

2,12,512,512\] \[2,12,512,64

Merge Heads ──► O_proj ──► Output

2,512,768\] \[2,512,768

维度 含义 数值来源
2 batch size(批次大小) 输入的 batch
12 n_heads(注意力头数) 模型配置(如 12 个头)
512 seq_len(序列长度) 输入 token 数
64 head_dim(每个头的维度) d_model / n_heads = 768/12
相关推荐
啥咕啦呛2 小时前
java打卡学习6:集合框架 Collection
java·windows·学习
Miki Makimura2 小时前
Redis基础指令学习
数据库·redis·学习
忙什么果2 小时前
transformer学习笔记1
笔记·学习·transformer
风舞雪凌月2 小时前
【趣谈】移动系统和桌面系统编程语言思考
java·c语言·c++·python·学习·objective-c·swift
88号技师2 小时前
2026年3月新锐一区SCI-随机社会学习优化算法Stochastic social learning-附Matlab免费代码
学习·算法·数学建模·matlab·优化算法
今儿敲了吗3 小时前
Linux学习笔记第三章——基础命令(一)
linux·笔记·学习
冰语竹3 小时前
Android学习之Activity生命周期
android·学习
Engineer邓祥浩3 小时前
JVM学习问题记录(2) jps命令无法识别
jvm·学习
lizhenjun1143 小时前
Aosp14及后续版本默认不可用profiler调试问题分析
android·学习