目标的抉择:CLM 称王、MLM 退场、FIM 补刀、多 Token 与多语 —— 预训练目标五辩

摘要

预训练目标决定模型学到什么。本文从 CLM 的工程必然性、MLM 的适用边界、FIM 的代码补全价值、多 Token 预测的训练增强、多语言目标的平衡策略五个切口,给出源码级实现与企业级目标选择决策框架。

1. CLM:因果语言建模的工程必然性

CLM(Causal Language Modeling)预测下一 token,是大模型预训练的唯一主流目标。它的胜出不是偶然,而是训练推理一致性、并行效率、Scaling 稳定性三重工程红利的叠加。
#mermaid-svg-Fe1XvmcH6mleTf0E{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-Fe1XvmcH6mleTf0E .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-Fe1XvmcH6mleTf0E .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-Fe1XvmcH6mleTf0E .error-icon{fill:#552222;}#mermaid-svg-Fe1XvmcH6mleTf0E .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-Fe1XvmcH6mleTf0E .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-Fe1XvmcH6mleTf0E .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-Fe1XvmcH6mleTf0E .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-Fe1XvmcH6mleTf0E .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-Fe1XvmcH6mleTf0E .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-Fe1XvmcH6mleTf0E .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-Fe1XvmcH6mleTf0E .marker{fill:#333333;stroke:#333333;}#mermaid-svg-Fe1XvmcH6mleTf0E .marker.cross{stroke:#333333;}#mermaid-svg-Fe1XvmcH6mleTf0E svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-Fe1XvmcH6mleTf0E p{margin:0;}#mermaid-svg-Fe1XvmcH6mleTf0E .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-Fe1XvmcH6mleTf0E .cluster-label text{fill:#333;}#mermaid-svg-Fe1XvmcH6mleTf0E .cluster-label span{color:#333;}#mermaid-svg-Fe1XvmcH6mleTf0E .cluster-label span p{background-color:transparent;}#mermaid-svg-Fe1XvmcH6mleTf0E .label text,#mermaid-svg-Fe1XvmcH6mleTf0E span{fill:#333;color:#333;}#mermaid-svg-Fe1XvmcH6mleTf0E .node rect,#mermaid-svg-Fe1XvmcH6mleTf0E .node circle,#mermaid-svg-Fe1XvmcH6mleTf0E .node ellipse,#mermaid-svg-Fe1XvmcH6mleTf0E .node polygon,#mermaid-svg-Fe1XvmcH6mleTf0E .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-Fe1XvmcH6mleTf0E .rough-node .label text,#mermaid-svg-Fe1XvmcH6mleTf0E .node .label text,#mermaid-svg-Fe1XvmcH6mleTf0E .image-shape .label,#mermaid-svg-Fe1XvmcH6mleTf0E .icon-shape .label{text-anchor:middle;}#mermaid-svg-Fe1XvmcH6mleTf0E .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-Fe1XvmcH6mleTf0E .rough-node .label,#mermaid-svg-Fe1XvmcH6mleTf0E .node .label,#mermaid-svg-Fe1XvmcH6mleTf0E .image-shape .label,#mermaid-svg-Fe1XvmcH6mleTf0E .icon-shape .label{text-align:center;}#mermaid-svg-Fe1XvmcH6mleTf0E .node.clickable{cursor:pointer;}#mermaid-svg-Fe1XvmcH6mleTf0E .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-Fe1XvmcH6mleTf0E .arrowheadPath{fill:#333333;}#mermaid-svg-Fe1XvmcH6mleTf0E .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-Fe1XvmcH6mleTf0E .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-Fe1XvmcH6mleTf0E .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-Fe1XvmcH6mleTf0E .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-Fe1XvmcH6mleTf0E .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-Fe1XvmcH6mleTf0E .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-Fe1XvmcH6mleTf0E .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-Fe1XvmcH6mleTf0E .cluster text{fill:#333;}#mermaid-svg-Fe1XvmcH6mleTf0E .cluster span{color:#333;}#mermaid-svg-Fe1XvmcH6mleTf0E div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-Fe1XvmcH6mleTf0E .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-Fe1XvmcH6mleTf0E rect.text{fill:none;stroke-width:0;}#mermaid-svg-Fe1XvmcH6mleTf0E .icon-shape,#mermaid-svg-Fe1XvmcH6mleTf0E .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-Fe1XvmcH6mleTf0E .icon-shape p,#mermaid-svg-Fe1XvmcH6mleTf0E .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-Fe1XvmcH6mleTf0E .icon-shape .label rect,#mermaid-svg-Fe1XvmcH6mleTf0E .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-Fe1XvmcH6mleTf0E .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-Fe1XvmcH6mleTf0E .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-Fe1XvmcH6mleTf0E :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;}#mermaid-svg-Fe1XvmcH6mleTf0E .default>*{fill:#faf9f5!important;stroke:#ffffff!important;color:#000000!important;stroke-width:0px!important;}#mermaid-svg-Fe1XvmcH6mleTf0E .default span{fill:#faf9f5!important;stroke:#ffffff!important;color:#000000!important;stroke-width:0px!important;}#mermaid-svg-Fe1XvmcH6mleTf0E .default tspan{fill:#000000!important;} CLM 优势
训练推理一致: 自回归
全位置并行: teacher forcing
Scaling 稳定: loss 单调降
无训练推理差异
训练效率高: 一次前向算所有位置 loss
千亿级不发散
多任务统一: 生成即理解

python 复制代码
// 来源:PyTorch 2.5.0 / CLM 损失实现
import torch
import torch.nn.functional as F

def clm_loss(logits, labels):
    """因果语言建模损失: 预测第 t+1 个 token"""
    # logits: [batch, seq, vocab], labels: [batch, seq]
    # 关键: 用 logits[t] 预测 labels[t+1], 因果掩码保证不见未来
    shift_logits = logits[..., :-1, :].contiguous()
    shift_labels = labels[..., 1:].contiguous()
    # 交叉熵: 平均负对数似然
    loss = F.cross_entropy(
        shift_logits.view(-1, shift_logits.size(-1)),
        shift_labels.view(-1)
    )
    return loss

# CLM 的并行性: 一次前向计算所有位置的 loss
# logits = model(input_ids)  # 一次前向, 所有位置
# loss = clm_loss(logits, input_ids)  # 所有位置并行算 loss
# 对比 MLM: 仅 15% 掩码位置算 loss, 有效梯度信号密度低 6-7x

量化:CLM 的 teacher forcing 使所有位置可并行计算 loss,训练效率比 MLM(仅 15% 掩码位置)高 6-7 倍。LLaMA-3 8B 在 15T token 上 CLM 训练 loss 单调下降,千亿级 Scaling 稳定。CLM 的训练推理一致性使其在生成任务上无退化------这是 Decoder-only 战胜 Encoder-decoder 的核心原因。

边界:CLM 的因果掩码使前 token 看不到后 token,理解任务上早期弱于双向注意力(BERT)。但规模足够大后 Decoder-only 在理解任务上追平甚至超越------这是为什么 BERT 系模型在向量检索场景仍存活,但通用大模型统一选 CLM。CLM 对长程依赖建模弱,需靠规模与上下文长度补偿。

2. MLM:掩码语言建模的残存价值

MLM(Masked Language Modeling)随机掩码 15% token 让模型预测,是 BERT 的训练目标。虽在生成模型中被 CLM 取代,但在理解任务与向量表征上仍有价值。
#mermaid-svg-1SJsPzp5SQcZXY1Y{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-1SJsPzp5SQcZXY1Y .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-1SJsPzp5SQcZXY1Y .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-1SJsPzp5SQcZXY1Y .error-icon{fill:#552222;}#mermaid-svg-1SJsPzp5SQcZXY1Y .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-1SJsPzp5SQcZXY1Y .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-1SJsPzp5SQcZXY1Y .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-1SJsPzp5SQcZXY1Y .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-1SJsPzp5SQcZXY1Y .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-1SJsPzp5SQcZXY1Y .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-1SJsPzp5SQcZXY1Y .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-1SJsPzp5SQcZXY1Y .marker{fill:#333333;stroke:#333333;}#mermaid-svg-1SJsPzp5SQcZXY1Y .marker.cross{stroke:#333333;}#mermaid-svg-1SJsPzp5SQcZXY1Y svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-1SJsPzp5SQcZXY1Y p{margin:0;}#mermaid-svg-1SJsPzp5SQcZXY1Y .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-1SJsPzp5SQcZXY1Y .cluster-label text{fill:#333;}#mermaid-svg-1SJsPzp5SQcZXY1Y .cluster-label span{color:#333;}#mermaid-svg-1SJsPzp5SQcZXY1Y .cluster-label span p{background-color:transparent;}#mermaid-svg-1SJsPzp5SQcZXY1Y .label text,#mermaid-svg-1SJsPzp5SQcZXY1Y span{fill:#333;color:#333;}#mermaid-svg-1SJsPzp5SQcZXY1Y .node rect,#mermaid-svg-1SJsPzp5SQcZXY1Y .node circle,#mermaid-svg-1SJsPzp5SQcZXY1Y .node ellipse,#mermaid-svg-1SJsPzp5SQcZXY1Y .node polygon,#mermaid-svg-1SJsPzp5SQcZXY1Y .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-1SJsPzp5SQcZXY1Y .rough-node .label text,#mermaid-svg-1SJsPzp5SQcZXY1Y .node .label text,#mermaid-svg-1SJsPzp5SQcZXY1Y .image-shape .label,#mermaid-svg-1SJsPzp5SQcZXY1Y .icon-shape .label{text-anchor:middle;}#mermaid-svg-1SJsPzp5SQcZXY1Y .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-1SJsPzp5SQcZXY1Y .rough-node .label,#mermaid-svg-1SJsPzp5SQcZXY1Y .node .label,#mermaid-svg-1SJsPzp5SQcZXY1Y .image-shape .label,#mermaid-svg-1SJsPzp5SQcZXY1Y .icon-shape .label{text-align:center;}#mermaid-svg-1SJsPzp5SQcZXY1Y .node.clickable{cursor:pointer;}#mermaid-svg-1SJsPzp5SQcZXY1Y .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-1SJsPzp5SQcZXY1Y .arrowheadPath{fill:#333333;}#mermaid-svg-1SJsPzp5SQcZXY1Y .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-1SJsPzp5SQcZXY1Y .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-1SJsPzp5SQcZXY1Y .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-1SJsPzp5SQcZXY1Y .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-1SJsPzp5SQcZXY1Y .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-1SJsPzp5SQcZXY1Y .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-1SJsPzp5SQcZXY1Y .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-1SJsPzp5SQcZXY1Y .cluster text{fill:#333;}#mermaid-svg-1SJsPzp5SQcZXY1Y .cluster span{color:#333;}#mermaid-svg-1SJsPzp5SQcZXY1Y div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-1SJsPzp5SQcZXY1Y .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-1SJsPzp5SQcZXY1Y rect.text{fill:none;stroke-width:0;}#mermaid-svg-1SJsPzp5SQcZXY1Y .icon-shape,#mermaid-svg-1SJsPzp5SQcZXY1Y .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-1SJsPzp5SQcZXY1Y .icon-shape p,#mermaid-svg-1SJsPzp5SQcZXY1Y .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-1SJsPzp5SQcZXY1Y .icon-shape .label rect,#mermaid-svg-1SJsPzp5SQcZXY1Y .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-1SJsPzp5SQcZXY1Y .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-1SJsPzp5SQcZXY1Y .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-1SJsPzp5SQcZXY1Y :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;}#mermaid-svg-1SJsPzp5SQcZXY1Y .default>*{fill:#faf9f5!important;stroke:#000000!important;color:#000000!important;stroke-width:0px!important;}#mermaid-svg-1SJsPzp5SQcZXY1Y .default span{fill:#faf9f5!important;stroke:#000000!important;color:#000000!important;stroke-width:0px!important;}#mermaid-svg-1SJsPzp5SQcZXY1Y .default tspan{fill:#000000!important;} MLM 特性
双向注意力: 看上下文
15% 掩码: 稀疏 loss
训练推理不一致
理解任务强: 全局信息
训练效率低: 信号密度 1/7
生成任务退化: 无自回归
残存场景: 向量检索/分类

python 复制代码
// 来源:HuggingFace Transformers / BERT MLM 实现
import torch
import torch.nn.functional as F

def mlm_loss(logits, labels, mask_token_id):
    """掩码语言建模损失: 仅掩码位置计 loss"""
    # labels 中非掩码位置为 -100 (自动忽略)
    # 仅 15% 掩码位置有真实 label
    loss = F.cross_entropy(
        logits.view(-1, logits.size(-1)),
        labels.view(-1),
        ignore_index=-100  # 非掩码位置忽略
    )
    return loss

def apply_mlm_mask(input_ids, vocab_size, mask_prob=0.15, mask_token_id=128000):
    """应用 MLM 掩码"""
    labels = input_ids.clone()
    # 创建掩码: 15% 位置被掩
    mask = torch.rand(input_ids.shape) < mask_prob
    # 80% 替换为 [MASK]
    mask_mask = mask & (torch.rand(input_ids.shape) < 0.8)
    input_ids[mask_mask] = mask_token_id
    # 10% 替换为随机 token
    mask_random = mask & ~mask_mask & (torch.rand(input_ids.shape) < 0.5)
    random_tokens = torch.randint(0, vocab_size, input_ids.shape, device=input_ids.device)
    input_ids[mask_random] = random_tokens[mask_random]
    # 10% 保持原 token
    # 非掩码位置 label 设为 -100
    labels[~mask] = -100
    return input_ids, labels

# MLM vs CLM 训练效率:
# CLM: 100% 位置算 loss, 有效信号密度 1.0
# MLM: 15% 位置算 loss, 有效信号密度 0.15
# CLM 训练效率高 6-7x (1.0 / 0.15)

量化:MLM 仅 15% 位置计 loss,训练信号密度是 CLM 的 1/7,训练效率低 6-7 倍。但 MLM 的双向注意力在理解任务(分类、抽取)上比 CLM 高 3-5 分------这是 BERT 在向量检索场景仍存活的原因。MLM 训练推理不一致,生成任务严重退化,不适合生成模型。

边界:MLM 的掩码比例 15% 是经验妥协------过高破坏语义连续性,过低训练信号不足。80/10/10 的掩码策略(80% MASK、10% 随机、10% 原始)是 BERT 的最优配置。MLM 适合 Encoder-only 模型做理解任务,不应用于生成模型。向量检索场景 BERT 系仍是首选,因双向注意力产生更全局的表征。

3. FIM:填空模式与代码补全

FIM(Fill-In-the-Middle)把 prefix|middle|suffix 重组为 prefix< MID>middle< SUF>suffix< EOT>,使模型学会基于上下文补全中间内容。这是代码补全的关键目标。

复制代码
```python
// 来源:BigCode / FIM 实现
import random

class FIMTransformer:
    """FIM: 填空模式数据重组"""
    def __init__(self, fim_rate=0.5, prefix_token='<fim_prefix>',
                 middle_token='<fim_middle>', suffix_token='<fim_suffix>'):
        self.fim_rate = fim_rate  # 50% 数据用 FIM
        self.prefix_tok = prefix_token
        self.middle_tok = middle_token
        self.suffix_tok = suffix_token

    def transform(self, document):
        """把文档转为 FIM 格式"""
        if random.random() > self.fim_rate:
            return document  # 非 FIM 数据保持原样
        # 随机拆分点
        tokens = document.split()
        n = len(tokens)
        # 拆为 prefix|middle|suffix
        split1 = random.randint(0, n // 2)
        split2 = random.randint(n // 2, n)
        prefix = ' '.join(tokens[:split1])
        middle = ' '.join(tokens[split1:split2])
        suffix = ' '.join(tokens[split2:])
        # 重组为 FIM 格式: prefix <MID> middle <SUF> suffix <EOT>
        fim_doc = f"{self.prefix_tok} {prefix} {self.suffix_tok} {suffix} {self.middle_tok} {middle}"
        return fim_doc

    def batch_transform(self, documents):
        """批量转换: 50% FIM + 50% 原始"""
        return [self.transform(doc) for doc in documents]

# FIM 对代码补全的效果:
# - 代码补全准确率 +5%
# - 通用任务 -2% (数据重组引入噪声)
# - 净收益: 仅代码模型值得, 通用模型放弃
# LLaMA 试验 FIM 后放弃: 代码+5% 不足以补偿通用-2%
# DeepSeek-Coder 采用 FIM: 代码模型聚焦代码能力

量化:FIM 使代码补全准确率 +5%,但通用任务 -2%。LLaMA 试验后放弃 FIM(净收益为负),DeepSeek-Coder 保留(代码模型聚焦代码)。FIM 的 50% 比例是平衡------过高通用能力降太多,过低代码补全效果不显著。

边界:FIM 仅在代码场景有价值,通用模型不应使用。FIM 数据重组引入噪声,需配合 CLM 原始数据混合训练(50% FIM + 50% 原始)。FIM 的拆分点选择影响效果------代码按 AST 节点拆分比随机拆分好,但实现复杂。代码补全的实测效果需在真实 IDE 环境验证,不能仅看合成基准。

4. 多 Token 预测:训练时的规划能力

MTP(Multi-Token Prediction)同时预测未来 k 个 token,强迫模型在隐藏状态中编码更长远的规划信息。这是 DeepSeek-V3 的训练增强创新。
#mermaid-svg-CQit8pMIIj9JzrUz{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-CQit8pMIIj9JzrUz .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-CQit8pMIIj9JzrUz .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-CQit8pMIIj9JzrUz .error-icon{fill:#552222;}#mermaid-svg-CQit8pMIIj9JzrUz .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-CQit8pMIIj9JzrUz .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-CQit8pMIIj9JzrUz .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-CQit8pMIIj9JzrUz .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-CQit8pMIIj9JzrUz .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-CQit8pMIIj9JzrUz .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-CQit8pMIIj9JzrUz .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-CQit8pMIIj9JzrUz .marker{fill:#333333;stroke:#333333;}#mermaid-svg-CQit8pMIIj9JzrUz .marker.cross{stroke:#333333;}#mermaid-svg-CQit8pMIIj9JzrUz svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-CQit8pMIIj9JzrUz p{margin:0;}#mermaid-svg-CQit8pMIIj9JzrUz .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-CQit8pMIIj9JzrUz .cluster-label text{fill:#333;}#mermaid-svg-CQit8pMIIj9JzrUz .cluster-label span{color:#333;}#mermaid-svg-CQit8pMIIj9JzrUz .cluster-label span p{background-color:transparent;}#mermaid-svg-CQit8pMIIj9JzrUz .label text,#mermaid-svg-CQit8pMIIj9JzrUz span{fill:#333;color:#333;}#mermaid-svg-CQit8pMIIj9JzrUz .node rect,#mermaid-svg-CQit8pMIIj9JzrUz .node circle,#mermaid-svg-CQit8pMIIj9JzrUz .node ellipse,#mermaid-svg-CQit8pMIIj9JzrUz .node polygon,#mermaid-svg-CQit8pMIIj9JzrUz .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-CQit8pMIIj9JzrUz .rough-node .label text,#mermaid-svg-CQit8pMIIj9JzrUz .node .label text,#mermaid-svg-CQit8pMIIj9JzrUz .image-shape .label,#mermaid-svg-CQit8pMIIj9JzrUz .icon-shape .label{text-anchor:middle;}#mermaid-svg-CQit8pMIIj9JzrUz .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-CQit8pMIIj9JzrUz .rough-node .label,#mermaid-svg-CQit8pMIIj9JzrUz .node .label,#mermaid-svg-CQit8pMIIj9JzrUz .image-shape .label,#mermaid-svg-CQit8pMIIj9JzrUz .icon-shape .label{text-align:center;}#mermaid-svg-CQit8pMIIj9JzrUz .node.clickable{cursor:pointer;}#mermaid-svg-CQit8pMIIj9JzrUz .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-CQit8pMIIj9JzrUz .arrowheadPath{fill:#333333;}#mermaid-svg-CQit8pMIIj9JzrUz .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-CQit8pMIIj9JzrUz .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-CQit8pMIIj9JzrUz .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-CQit8pMIIj9JzrUz .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-CQit8pMIIj9JzrUz .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-CQit8pMIIj9JzrUz .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-CQit8pMIIj9JzrUz .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-CQit8pMIIj9JzrUz .cluster text{fill:#333;}#mermaid-svg-CQit8pMIIj9JzrUz .cluster span{color:#333;}#mermaid-svg-CQit8pMIIj9JzrUz div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-CQit8pMIIj9JzrUz .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-CQit8pMIIj9JzrUz rect.text{fill:none;stroke-width:0;}#mermaid-svg-CQit8pMIIj9JzrUz .icon-shape,#mermaid-svg-CQit8pMIIj9JzrUz .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-CQit8pMIIj9JzrUz .icon-shape p,#mermaid-svg-CQit8pMIIj9JzrUz .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-CQit8pMIIj9JzrUz .icon-shape .label rect,#mermaid-svg-CQit8pMIIj9JzrUz .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-CQit8pMIIj9JzrUz .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-CQit8pMIIj9JzrUz .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-CQit8pMIIj9JzrUz :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;}#mermaid-svg-CQit8pMIIj9JzrUz .default>*{fill:#faf9f5!important;stroke:#ffffff!important;color:#000000!important;stroke-width:0px!important;}#mermaid-svg-CQit8pMIIj9JzrUz .default span{fill:#faf9f5!important;stroke:#ffffff!important;color:#000000!important;stroke-width:0px!important;}#mermaid-svg-CQit8pMIIj9JzrUz .default tspan{fill:#000000!important;} 标准 CLM
预测 t+1
仅学局部连贯
MTP 训练
同时预测 t+1, t+2, ..., t+k
隐藏状态编码规划
推理能力 +5-10%
训练成本 +15%
副产物: 投机解码

python 复制代码
// 来源:DeepSeek-V3 / MTP 实现
import torch
import torch.nn as nn
import torch.nn.functional as F

class MultiTokenPrediction(nn.Module):
    """MTP: 同时预测未来 k+1 个 token"""
    def __init__(self, d_model, vocab_size, k=1):
        super().__init__()
        self.k = k
        # k+1 个输出头, 每个预测不同偏移
        self.heads = nn.ModuleList([
            nn.Linear(d_model, vocab_size, bias=False) for _ in range(k + 1)
        ])
        # 损失权重: 近期权重高
        self.loss_weights = [1.0, 0.5]  # k=1 时

    def forward(self, hidden, labels):
        """hidden: [B, S, d], labels: [B, S]"""
        losses = []
        for i, head in enumerate(self.heads):
            if i == 0:
                # 标准下一 token 预测
                logits = head(hidden[:, :-1, :])
                target = labels[:, 1:]
            else:
                # 预测 t+i+1 (更远)
                logits = head(hidden[:, :-(i+1), :])
                target = labels[:, i+1:]
            loss = F.cross_entropy(
                logits.reshape(-1, logits.size(-1)),
                target.reshape(-1)
            )
            losses.append(self.loss_weights[i] * loss)
        total_loss = sum(losses) / sum(self.loss_weights)
        return total_loss

# MTP 的双重价值:
# 1. 训练增强: 强迫隐藏状态编码规划, 推理 +5-10%
# 2. 投机解码: MTP 头作 draft model, 推理并行预测多 token
# DeepSeek-V3: k=1, 训练成本 +15%, 推理 +5-10%
# 投机解码: 命中率 60-70%, 推理吞吐 +1.5-2x

量化:DeepSeek-V3 用 MTP k=1,训练成本增 15%,推理任务提升 5-10%。MTP 头做投机解码时命中率 60-70%,推理吞吐提升 1.5-2 倍。k=2 时训练成本增 30%,收益递减------k=1 是性价比最优。

边界:MTP 的损失权重需调参,近期权重过高退化为标准 CLM,远期过高训练不稳。MTP 头需与主模型共享隐藏状态,不能独立训练。投机解码的收益依赖 MTP 头预测准确率,若低于 50% 验证开销抵消并行收益。MTP 适合推理任务重的模型,纯对话模型收益小。

5. 多语言目标的平衡策略

多语言训练目标不是简单混合,需平衡语料配比防主导语言淹没弱势语言。
#mermaid-svg-fhKNnT6RXFDyIbmH{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-fhKNnT6RXFDyIbmH .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-fhKNnT6RXFDyIbmH .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-fhKNnT6RXFDyIbmH .error-icon{fill:#552222;}#mermaid-svg-fhKNnT6RXFDyIbmH .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-fhKNnT6RXFDyIbmH .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-fhKNnT6RXFDyIbmH .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-fhKNnT6RXFDyIbmH .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-fhKNnT6RXFDyIbmH .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-fhKNnT6RXFDyIbmH .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-fhKNnT6RXFDyIbmH .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-fhKNnT6RXFDyIbmH .marker{fill:#333333;stroke:#333333;}#mermaid-svg-fhKNnT6RXFDyIbmH .marker.cross{stroke:#333333;}#mermaid-svg-fhKNnT6RXFDyIbmH svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-fhKNnT6RXFDyIbmH p{margin:0;}#mermaid-svg-fhKNnT6RXFDyIbmH .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-fhKNnT6RXFDyIbmH .cluster-label text{fill:#333;}#mermaid-svg-fhKNnT6RXFDyIbmH .cluster-label span{color:#333;}#mermaid-svg-fhKNnT6RXFDyIbmH .cluster-label span p{background-color:transparent;}#mermaid-svg-fhKNnT6RXFDyIbmH .label text,#mermaid-svg-fhKNnT6RXFDyIbmH span{fill:#333;color:#333;}#mermaid-svg-fhKNnT6RXFDyIbmH .node rect,#mermaid-svg-fhKNnT6RXFDyIbmH .node circle,#mermaid-svg-fhKNnT6RXFDyIbmH .node ellipse,#mermaid-svg-fhKNnT6RXFDyIbmH .node polygon,#mermaid-svg-fhKNnT6RXFDyIbmH .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-fhKNnT6RXFDyIbmH .rough-node .label text,#mermaid-svg-fhKNnT6RXFDyIbmH .node .label text,#mermaid-svg-fhKNnT6RXFDyIbmH .image-shape .label,#mermaid-svg-fhKNnT6RXFDyIbmH .icon-shape .label{text-anchor:middle;}#mermaid-svg-fhKNnT6RXFDyIbmH .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-fhKNnT6RXFDyIbmH .rough-node .label,#mermaid-svg-fhKNnT6RXFDyIbmH .node .label,#mermaid-svg-fhKNnT6RXFDyIbmH .image-shape .label,#mermaid-svg-fhKNnT6RXFDyIbmH .icon-shape .label{text-align:center;}#mermaid-svg-fhKNnT6RXFDyIbmH .node.clickable{cursor:pointer;}#mermaid-svg-fhKNnT6RXFDyIbmH .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-fhKNnT6RXFDyIbmH .arrowheadPath{fill:#333333;}#mermaid-svg-fhKNnT6RXFDyIbmH .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-fhKNnT6RXFDyIbmH .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-fhKNnT6RXFDyIbmH .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-fhKNnT6RXFDyIbmH .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-fhKNnT6RXFDyIbmH .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-fhKNnT6RXFDyIbmH .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-fhKNnT6RXFDyIbmH .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-fhKNnT6RXFDyIbmH .cluster text{fill:#333;}#mermaid-svg-fhKNnT6RXFDyIbmH .cluster span{color:#333;}#mermaid-svg-fhKNnT6RXFDyIbmH div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-fhKNnT6RXFDyIbmH .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-fhKNnT6RXFDyIbmH rect.text{fill:none;stroke-width:0;}#mermaid-svg-fhKNnT6RXFDyIbmH .icon-shape,#mermaid-svg-fhKNnT6RXFDyIbmH .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-fhKNnT6RXFDyIbmH .icon-shape p,#mermaid-svg-fhKNnT6RXFDyIbmH .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-fhKNnT6RXFDyIbmH .icon-shape .label rect,#mermaid-svg-fhKNnT6RXFDyIbmH .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-fhKNnT6RXFDyIbmH .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-fhKNnT6RXFDyIbmH .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-fhKNnT6RXFDyIbmH :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;}#mermaid-svg-fhKNnT6RXFDyIbmH .default>*{fill:#faf9f5!important;stroke:#ffffff!important;color:#000000!important;stroke-width:0px!important;}#mermaid-svg-fhKNnT6RXFDyIbmH .default span{fill:#faf9f5!important;stroke:#ffffff!important;color:#000000!important;stroke-width:0px!important;}#mermaid-svg-fhKNnT6RXFDyIbmH .default tspan{fill:#000000!important;} 多语言目标
语料配比
词表平衡
去重策略
英文 40-50% 主导
中文 25-35%
其他 10-20%
多语词表: 各语覆盖
跨语去重: 翻译重复
防主导语言淹没
弱势语言上采样

python 复制代码
// 来源:多语言训练平衡 / 2024
import collections

class MultilingualBalancer:
    """多语言训练数据平衡"""
    def __init__(self, target_ratios):
        # 目标配比: 防主导语言淹没弱势语言
        self.target_ratios = target_ratios  # {'en': 0.45, 'zh': 0.30, 'ja': 0.05, ...}

    def balance(self, documents):
        """按目标配比平衡数据"""
        # 1. 语言识别分类
        lang_docs = collections.defaultdict(list)
        for doc in documents:
            lang = self.detect_language(doc)
            lang_docs[lang].append(doc)
        # 2. 按配比采样
        balanced = []
        total_target = sum(self.target_ratios.values())
        for lang, ratio in self.target_ratios.items():
            docs = lang_docs.get(lang, [])
            n_select = int(len(documents) * ratio / total_target)
            if len(docs) < n_select:
                # 弱势语言上采样 (重复采样)
                balanced.extend(self.upsample(docs, n_select))
            else:
                balanced.extend(docs[:n_select])
        return balanced

    def upsample(self, docs, target_count):
        """弱势语言上采样"""
        upsampled = []
        while len(upsampled) < target_count:
            upsampled.extend(docs)
        return upsampled[:target_count]

    def cross_lingual_dedup(self, documents):
        """跨语言去重: 移除翻译重复"""
        # 用多语嵌入检测翻译对
        embeddings = self.encode_multilingual(documents)
        unique = []
        seen = []
        for doc, emb in zip(documents, embeddings):
            # 检查是否有高相似度的已见文档
            is_duplicate = any(
                self.cosine_sim(emb, seen_emb) > 0.9
                for seen_emb in seen
            )
            if not is_duplicate:
                unique.append(doc)
                seen.append(emb)
        return unique

# 配比策略:
# 英文 45%, 中文 30%, 代码 15%, 其他 10%
# 弱势语言 (日/韩/法) 上采样防淹没
# 跨语去重移除翻译重复 (省 5-10% 数据)

量化:多语言模型若英文占比超 60%,弱势语言(日/韩)能力降 15-20 分。上采样弱势语言可补偿,但过度上采样导致重复记忆。跨语去重移除 5-10% 翻译重复数据。Qwen2.5 的多语平衡使其在日/韩任务上比纯英文模型高 10-15 分。

边界:上采样弱势语言是双刃剑------补偿能力但增加重复风险,需配合去重。跨语去重的相似度阈值需调------0.9 适合翻译对,0.7 会误删同类内容。多语词表需保证各语言覆盖,否则弱势语言 token 效率低。代码数据作为"语言"处理,配比独立。

6. 去噪目标:Prefix-LM 与 UL2 的探索

纯 CLM 的因果掩码限制理解能力,纯 MLM 无法生成。Prefix-LM 与 UL2 是两种折中探索,试图兼顾理解与生成。
#mermaid-svg-jf4nDQpuKI8V4Igq{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-jf4nDQpuKI8V4Igq .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-jf4nDQpuKI8V4Igq .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-jf4nDQpuKI8V4Igq .error-icon{fill:#552222;}#mermaid-svg-jf4nDQpuKI8V4Igq .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-jf4nDQpuKI8V4Igq .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-jf4nDQpuKI8V4Igq .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-jf4nDQpuKI8V4Igq .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-jf4nDQpuKI8V4Igq .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-jf4nDQpuKI8V4Igq .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-jf4nDQpuKI8V4Igq .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-jf4nDQpuKI8V4Igq .marker{fill:#333333;stroke:#333333;}#mermaid-svg-jf4nDQpuKI8V4Igq .marker.cross{stroke:#333333;}#mermaid-svg-jf4nDQpuKI8V4Igq svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-jf4nDQpuKI8V4Igq p{margin:0;}#mermaid-svg-jf4nDQpuKI8V4Igq .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-jf4nDQpuKI8V4Igq .cluster-label text{fill:#333;}#mermaid-svg-jf4nDQpuKI8V4Igq .cluster-label span{color:#333;}#mermaid-svg-jf4nDQpuKI8V4Igq .cluster-label span p{background-color:transparent;}#mermaid-svg-jf4nDQpuKI8V4Igq .label text,#mermaid-svg-jf4nDQpuKI8V4Igq span{fill:#333;color:#333;}#mermaid-svg-jf4nDQpuKI8V4Igq .node rect,#mermaid-svg-jf4nDQpuKI8V4Igq .node circle,#mermaid-svg-jf4nDQpuKI8V4Igq .node ellipse,#mermaid-svg-jf4nDQpuKI8V4Igq .node polygon,#mermaid-svg-jf4nDQpuKI8V4Igq .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-jf4nDQpuKI8V4Igq .rough-node .label text,#mermaid-svg-jf4nDQpuKI8V4Igq .node .label text,#mermaid-svg-jf4nDQpuKI8V4Igq .image-shape .label,#mermaid-svg-jf4nDQpuKI8V4Igq .icon-shape .label{text-anchor:middle;}#mermaid-svg-jf4nDQpuKI8V4Igq .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-jf4nDQpuKI8V4Igq .rough-node .label,#mermaid-svg-jf4nDQpuKI8V4Igq .node .label,#mermaid-svg-jf4nDQpuKI8V4Igq .image-shape .label,#mermaid-svg-jf4nDQpuKI8V4Igq .icon-shape .label{text-align:center;}#mermaid-svg-jf4nDQpuKI8V4Igq .node.clickable{cursor:pointer;}#mermaid-svg-jf4nDQpuKI8V4Igq .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-jf4nDQpuKI8V4Igq .arrowheadPath{fill:#333333;}#mermaid-svg-jf4nDQpuKI8V4Igq .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-jf4nDQpuKI8V4Igq .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-jf4nDQpuKI8V4Igq .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-jf4nDQpuKI8V4Igq .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-jf4nDQpuKI8V4Igq .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-jf4nDQpuKI8V4Igq .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-jf4nDQpuKI8V4Igq .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-jf4nDQpuKI8V4Igq .cluster text{fill:#333;}#mermaid-svg-jf4nDQpuKI8V4Igq .cluster span{color:#333;}#mermaid-svg-jf4nDQpuKI8V4Igq div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-jf4nDQpuKI8V4Igq .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-jf4nDQpuKI8V4Igq rect.text{fill:none;stroke-width:0;}#mermaid-svg-jf4nDQpuKI8V4Igq .icon-shape,#mermaid-svg-jf4nDQpuKI8V4Igq .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-jf4nDQpuKI8V4Igq .icon-shape p,#mermaid-svg-jf4nDQpuKI8V4Igq .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-jf4nDQpuKI8V4Igq .icon-shape .label rect,#mermaid-svg-jf4nDQpuKI8V4Igq .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-jf4nDQpuKI8V4Igq .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-jf4nDQpuKI8V4Igq .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-jf4nDQpuKI8V4Igq :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;}#mermaid-svg-jf4nDQpuKI8V4Igq .default>*{fill:#faf9f5!important;stroke:#ffffff!important;color:#000000!important;stroke-width:0px!important;}#mermaid-svg-jf4nDQpuKI8V4Igq .default span{fill:#faf9f5!important;stroke:#ffffff!important;color:#000000!important;stroke-width:0px!important;}#mermaid-svg-jf4nDQpuKI8V4Igq .default tspan{fill:#000000!important;} 去噪目标探索
Prefix-LM: 前缀双向+生成因果
UL2: 统一去噪框架
前缀部分双向注意力
生成部分因果
理解+生成兼顾
多种去噪模式混合
统一不同目标
现状: 均被纯 CLM 取代
复杂度不值收益

python 复制代码
// 来源:Prefix-LM / GLM 实现
import torch

class PrefixLMMask:
    """Prefix-LM: 前缀双向 + 生成因果"""
    def create_mask(self, seq_len, prefix_len):
        """构造混合掩码"""
        mask = torch.full((seq_len, seq_len), float('-inf'))
        # 1. 前缀部分 (行<prefix_len): 双向, 可看所有前缀列
        mask[:prefix_len, :prefix_len] = 0
        # 2. 生成部分 (行>=prefix_len): 因果
        for i in range(prefix_len, seq_len):
            mask[i, :i+1] = 0  # 可看前缀 + 自身及之前
        return mask

# UL2: 统一去噪
class UL2Objective:
    """UL2: 混合多种去噪模式"""
    def __init__(self):
        self.modes = [
            {'mask_ratio': 0.15, 'span_len': 2, 'name': 'R-Denoiser'},   # 短跨度
            {'mask_ratio': 0.5, 'span_len': 8, 'name': 'S-Denoiser'},    # 中跨度
            {'mask_ratio': 0.5, 'span_len': 32, 'name': 'X-Denoiser'},   # 长跨度
        ]

    def sample_mode(self):
        """随机选择去噪模式"""
        return random.choice(self.modes)

    def apply_denoising(self, text, mode):
        """应用去噪掩码"""
        tokens = text.split()
        n_mask = int(len(tokens) * mode['mask_ratio'])
        # 按 span_len 切分掩码区域
        spans = []
        while len(spans) * mode['span_len'] < n_mask:
            start = random.randint(0, len(tokens) - mode['span_len'])
            spans.append((start, mode['span_len']))
        # 掩码并记录
        masked = list(tokens)
        targets = []
        for start, length in spans:
            targets.append(' '.join(masked[start:start+length]))
            for i in range(start, start+length):
                masked[i] = '<mask>'
        return ' '.join(masked), targets

# Prefix-LM 被 GLM-4 放弃, 回归纯 CLM
# UL2 被 T5 系采用但未成主流
# 共同结论: 复杂度不值收益, 纯 CLM + 规模足够

量化:Prefix-LM 在理解任务上比纯 CLM 高 3-5 分,但生成任务略低,且推理引擎适配复杂------GLM-4 最终回归纯 CLM。UL2 的多模式去噪在 T5 上比单模式 MLM 高 2-3 分,但训练复杂度高。两者均被纯 CLM+大规模取代。

边界:Prefix-LM 的前缀长度需预设,不同任务最优前缀不同,灵活性差。UL2 的多模式混合增加训练调参复杂度。两者的共同问题是生态兼容性差------主流推理引擎优化针对纯 CLM,混合掩码需单独适配。规模足够大后纯 CLM 在理解任务上追平,复杂目标的收益消失。

7. 对比学习目标:SimCSE 与向量表征

对比学习是训练向量表征的有效目标,与 CLM 互补。SimCSE 等方法用对比损失训练高质量句向量。
#mermaid-svg-Y06kmqQerwNKv12D{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-Y06kmqQerwNKv12D .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-Y06kmqQerwNKv12D .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-Y06kmqQerwNKv12D .error-icon{fill:#552222;}#mermaid-svg-Y06kmqQerwNKv12D .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-Y06kmqQerwNKv12D .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-Y06kmqQerwNKv12D .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-Y06kmqQerwNKv12D .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-Y06kmqQerwNKv12D .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-Y06kmqQerwNKv12D .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-Y06kmqQerwNKv12D .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-Y06kmqQerwNKv12D .marker{fill:#333333;stroke:#333333;}#mermaid-svg-Y06kmqQerwNKv12D .marker.cross{stroke:#333333;}#mermaid-svg-Y06kmqQerwNKv12D svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-Y06kmqQerwNKv12D p{margin:0;}#mermaid-svg-Y06kmqQerwNKv12D .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-Y06kmqQerwNKv12D .cluster-label text{fill:#333;}#mermaid-svg-Y06kmqQerwNKv12D .cluster-label span{color:#333;}#mermaid-svg-Y06kmqQerwNKv12D .cluster-label span p{background-color:transparent;}#mermaid-svg-Y06kmqQerwNKv12D .label text,#mermaid-svg-Y06kmqQerwNKv12D span{fill:#333;color:#333;}#mermaid-svg-Y06kmqQerwNKv12D .node rect,#mermaid-svg-Y06kmqQerwNKv12D .node circle,#mermaid-svg-Y06kmqQerwNKv12D .node ellipse,#mermaid-svg-Y06kmqQerwNKv12D .node polygon,#mermaid-svg-Y06kmqQerwNKv12D .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-Y06kmqQerwNKv12D .rough-node .label text,#mermaid-svg-Y06kmqQerwNKv12D .node .label text,#mermaid-svg-Y06kmqQerwNKv12D .image-shape .label,#mermaid-svg-Y06kmqQerwNKv12D .icon-shape .label{text-anchor:middle;}#mermaid-svg-Y06kmqQerwNKv12D .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-Y06kmqQerwNKv12D .rough-node .label,#mermaid-svg-Y06kmqQerwNKv12D .node .label,#mermaid-svg-Y06kmqQerwNKv12D .image-shape .label,#mermaid-svg-Y06kmqQerwNKv12D .icon-shape .label{text-align:center;}#mermaid-svg-Y06kmqQerwNKv12D .node.clickable{cursor:pointer;}#mermaid-svg-Y06kmqQerwNKv12D .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-Y06kmqQerwNKv12D .arrowheadPath{fill:#333333;}#mermaid-svg-Y06kmqQerwNKv12D .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-Y06kmqQerwNKv12D .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-Y06kmqQerwNKv12D .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-Y06kmqQerwNKv12D .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-Y06kmqQerwNKv12D .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-Y06kmqQerwNKv12D .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-Y06kmqQerwNKv12D .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-Y06kmqQerwNKv12D .cluster text{fill:#333;}#mermaid-svg-Y06kmqQerwNKv12D .cluster span{color:#333;}#mermaid-svg-Y06kmqQerwNKv12D div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-Y06kmqQerwNKv12D .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-Y06kmqQerwNKv12D rect.text{fill:none;stroke-width:0;}#mermaid-svg-Y06kmqQerwNKv12D .icon-shape,#mermaid-svg-Y06kmqQerwNKv12D .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-Y06kmqQerwNKv12D .icon-shape p,#mermaid-svg-Y06kmqQerwNKv12D .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-Y06kmqQerwNKv12D .icon-shape .label rect,#mermaid-svg-Y06kmqQerwNKv12D .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-Y06kmqQerwNKv12D .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-Y06kmqQerwNKv12D .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-Y06kmqQerwNKv12D :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;}#mermaid-svg-Y06kmqQerwNKv12D .default>*{fill:#faf9f5!important;stroke:#ffffff!important;color:#000000!important;stroke-width:0px!important;}#mermaid-svg-Y06kmqQerwNKv12D .default span{fill:#faf9f5!important;stroke:#ffffff!important;color:#000000!important;stroke-width:0px!important;}#mermaid-svg-Y06kmqQerwNKv12D .default tspan{fill:#000000!important;} 对比学习目标
SimCSE: 句向量对比
InfoNCE 损失
正样本: 同句不同 dropout
负样本: batch 内其他句
句向量质量高
检索/分类任务强
与 CLM 互补
CLM 学生成, 对比学表征

python 复制代码
// 来源:SimCSE / 对比学习句向量
import torch
import torch.nn.functional as F

class SimCSELoss:
    """SimCSE: 句向量对比学习"""
    def __init__(self, temperature=0.05):
        self.temperature = temperature

    def forward(self, sentence_embeddings):
        """计算对比损失"""
        # sentence_embeddings: [batch, dim] (两次前向的同句)
        # 正样本: 同句两次前向 (dropout 不同)
        # 负样本: batch 内其他句
        # 归一化
        embeddings = F.normalize(sentence_embeddings, dim=-1)
        # 相似度矩阵
        sim = embeddings @ embeddings.T / self.temperature
        # 标签: 对角线为正样本
        labels = torch.arange(sim.size(0), device=sim.device)
        # InfoNCE 损失
        loss = F.cross_entropy(sim, labels)
        return loss

class ContrastivePreTrainer:
    """对比学习预训练: 与 CLM 互补"""
    def __init__(self, model, clm_weight=0.7, contrast_weight=0.3):
        self.model = model
        self.clm_weight = clm_weight
        self.contrast_weight = contrast_weight
        self.contrast_loss = SimCSELoss()

    def forward(self, input_ids):
        # 1. CLM 损失 (生成能力)
        logits = self.model(input_ids)
        clm_loss = self.clm_loss(logits, input_ids)
        # 2. 对比损失 (表征能力)
        # 取句子的 mean pooling 向量
        emb1 = self.mean_pool(self.model.get_hidden(input_ids))
        emb2 = self.mean_pool(self.model.get_hidden(input_ids))  # 不同 dropout
        embeddings = torch.cat([emb1, emb2], dim=0)  # [2*batch, dim]
        contrast_loss = self.contrast_loss(embeddings)
        # 加权组合
        total = self.clm_weight * clm_loss + self.contrast_weight * contrast_loss
        return total

# SimCSE 效果:
# 句向量检索准确率 +15-20% (vs 纯 CLM mean pooling)
# 温度 τ=0.05 是最优, 过大退化为均匀, 过小梯度方差大

量化:SimCSE 训练的句向量在检索任务上比纯 CLM 的 mean pooling 高 15-20 分。温度 τ=0.05 是最优------过大退化为均匀分布,过小只关注最难负样本导致梯度爆炸。对比学习与 CLM 互补,CLM 学生成能力,对比学表征能力。

边界:对比学习的负样本质量关键------batch 内负样本可能包含语义相似的句子(假负样本),需难负样本挖掘。对比损失与 CLM 损失的权重需平衡,对比权重过高损害生成能力。对比学习适合向量检索场景,纯生成模型不一定需要。

8. 边界与失败模式

预训练目标选择的失败,往往源于对目标-任务匹配度的误判,或对 FIM/MTP 副作用的低估。
#mermaid-svg-FCDOPx0CxvoKVMNm{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-FCDOPx0CxvoKVMNm .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-FCDOPx0CxvoKVMNm .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-FCDOPx0CxvoKVMNm .error-icon{fill:#552222;}#mermaid-svg-FCDOPx0CxvoKVMNm .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-FCDOPx0CxvoKVMNm .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-FCDOPx0CxvoKVMNm .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-FCDOPx0CxvoKVMNm .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-FCDOPx0CxvoKVMNm .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-FCDOPx0CxvoKVMNm .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-FCDOPx0CxvoKVMNm .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-FCDOPx0CxvoKVMNm .marker{fill:#333333;stroke:#333333;}#mermaid-svg-FCDOPx0CxvoKVMNm .marker.cross{stroke:#333333;}#mermaid-svg-FCDOPx0CxvoKVMNm svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-FCDOPx0CxvoKVMNm p{margin:0;}#mermaid-svg-FCDOPx0CxvoKVMNm .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-FCDOPx0CxvoKVMNm .cluster-label text{fill:#333;}#mermaid-svg-FCDOPx0CxvoKVMNm .cluster-label span{color:#333;}#mermaid-svg-FCDOPx0CxvoKVMNm .cluster-label span p{background-color:transparent;}#mermaid-svg-FCDOPx0CxvoKVMNm .label text,#mermaid-svg-FCDOPx0CxvoKVMNm span{fill:#333;color:#333;}#mermaid-svg-FCDOPx0CxvoKVMNm .node rect,#mermaid-svg-FCDOPx0CxvoKVMNm .node circle,#mermaid-svg-FCDOPx0CxvoKVMNm .node ellipse,#mermaid-svg-FCDOPx0CxvoKVMNm .node polygon,#mermaid-svg-FCDOPx0CxvoKVMNm .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-FCDOPx0CxvoKVMNm .rough-node .label text,#mermaid-svg-FCDOPx0CxvoKVMNm .node .label text,#mermaid-svg-FCDOPx0CxvoKVMNm .image-shape .label,#mermaid-svg-FCDOPx0CxvoKVMNm .icon-shape .label{text-anchor:middle;}#mermaid-svg-FCDOPx0CxvoKVMNm .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-FCDOPx0CxvoKVMNm .rough-node .label,#mermaid-svg-FCDOPx0CxvoKVMNm .node .label,#mermaid-svg-FCDOPx0CxvoKVMNm .image-shape .label,#mermaid-svg-FCDOPx0CxvoKVMNm .icon-shape .label{text-align:center;}#mermaid-svg-FCDOPx0CxvoKVMNm .node.clickable{cursor:pointer;}#mermaid-svg-FCDOPx0CxvoKVMNm .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-FCDOPx0CxvoKVMNm .arrowheadPath{fill:#333333;}#mermaid-svg-FCDOPx0CxvoKVMNm .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-FCDOPx0CxvoKVMNm .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-FCDOPx0CxvoKVMNm .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-FCDOPx0CxvoKVMNm .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-FCDOPx0CxvoKVMNm .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-FCDOPx0CxvoKVMNm .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-FCDOPx0CxvoKVMNm .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-FCDOPx0CxvoKVMNm .cluster text{fill:#333;}#mermaid-svg-FCDOPx0CxvoKVMNm .cluster span{color:#333;}#mermaid-svg-FCDOPx0CxvoKVMNm div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-FCDOPx0CxvoKVMNm .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-FCDOPx0CxvoKVMNm rect.text{fill:none;stroke-width:0;}#mermaid-svg-FCDOPx0CxvoKVMNm .icon-shape,#mermaid-svg-FCDOPx0CxvoKVMNm .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-FCDOPx0CxvoKVMNm .icon-shape p,#mermaid-svg-FCDOPx0CxvoKVMNm .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-FCDOPx0CxvoKVMNm .icon-shape .label rect,#mermaid-svg-FCDOPx0CxvoKVMNm .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-FCDOPx0CxvoKVMNm .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-FCDOPx0CxvoKVMNm .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-FCDOPx0CxvoKVMNm :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;}#mermaid-svg-FCDOPx0CxvoKVMNm .default>*{fill:#faf9f5!important;stroke:#ffffff!important;color:#000000!important;stroke-width:0px!important;}#mermaid-svg-FCDOPx0CxvoKVMNm .default span{fill:#faf9f5!important;stroke:#ffffff!important;color:#000000!important;stroke-width:0px!important;}#mermaid-svg-FCDOPx0CxvoKVMNm .default tspan{fill:#000000!important;} 生成+对话
代码补全
向量检索
推理增强


目标选择
任务类型
CLM 唯一选择
CLM + FIM
MLM (BERT 系)
CLM + MTP
监控 loss + 下游任务
指标达标
目标复盘
持续优化

python 复制代码
// 来源:预训练目标失败诊断 / 2024
def diagnose_objective_failure(task_performance, objective, task_type):
    """诊断目标选择问题"""
    if objective == 'mlm' and task_type == 'generation':
        return {'issue': 'MLM 用于生成任务', 'action': '换 CLM'}
    if objective == 'clm' and task_type == 'retrieval':
        if task_performance < baseline * 0.85:
            return {'issue': 'CLM 检索能力弱', 'action': '加 MLM 预训练 或 用 BERT'}
    if 'fim' in objective and task_type != 'code':
        return {'issue': 'FIM 用于非代码任务', 'action': '移除 FIM'}
    if task_performance['reasoning'] < 60 and 'mtp' not in objective:
        return {'issue': '推理能力不足', 'action': '加 MTP 训练目标'}
    return {'issue': 'healthy'}

典型失败模式

  1. MLM 用于生成任务严重退化------某团队用 BERT 架构做对话,生成质量极差。换 CLM 后正常。
  2. FIM 用于通用模型降能力------非代码模型加 FIM,通用任务 -2 分。仅代码模型用 FIM。
  3. 多语言配比失衡弱势语言淹没------英文占 70%,日文任务降 20 分。上采样弱势语言。
  4. MTP 损失权重失调训练不稳------远期权重过高,loss 震荡。调权重近期为主。

8.1 实战复盘:FIM 用于通用模型翻车

某团队在通用 7B 模型加 FIM 训练目标,预期提升代码能力。结果代码任务仅 +3 分,通用任务 -2 分,净收益为负。

python 复制代码
// 来源:FIM 通用模型复盘 / 2024
def evaluate_fim_tradeoff(code_perf, general_perf, baseline):
    """评估 FIM 的净收益"""
    code_gain = code_perf - baseline['code']
    general_loss = baseline['general'] - general_perf
    net = code_gain - general_loss
    return {
        'code_gain': code_gain,
        'general_loss': general_loss,
        'net_benefit': net,
        'recommendation': '保留 FIM' if net > 3 else '移除 FIM'
    }

# 实测: code_gain=3, general_loss=2, net=1 (不值得)
# 结论: 通用模型移除 FIM, 代码模型才用 FIM
# LLaMA 团队同样的试验得出同样结论, 放弃 FIM

量化:FIM 在通用模型上代码 +3 分、通用 -2 分,净收益仅 1 分,不值得。LLaMA 团队同样试验后放弃 FIM。代码专用模型(DeepSeek-Coder)才用 FIM,因代码能力权重大可接受通用损失。

8.2 实战复盘:多语言配比失衡

某团队训练多语模型,英文数据占 70%,日文仅 5%。模型在日文任务上准确率仅 45%,远低于英文的 85%。

python 复制代码
// 来源:多语言配比复盘 / 2024
def diagnose_multilingual_imbalance(perf_by_lang, ratio_by_lang):
    """诊断多语言配比失衡"""
    issues = []
    for lang, perf in perf_by_lang.items():
        ratio = ratio_by_lang.get(lang, 0)
        if ratio < 0.1 and perf < 60:
            issues.append({
                'lang': lang,
                'issue': f'{lang} 配比 {ratio:.0%} 过低, 准确率 {perf}%',
                'action': f'上采样 {lang} 到 15-20%'
            })
    return issues

# 修复: 日文配比 5% -> 15%, 上采样 3 倍
# 修复后: 日文准确率 45% -> 72% (+27 分)
# 代价: 英文配比 70% -> 55%, 英文准确率 85% -> 83% (-2 分)
# 净收益: 多语言平衡值得

量化:日文配比从 5% 提到 15% 后,日文准确率从 45% 提升到 72%(+27 分),英文仅降 2 分。多语言平衡的净收益显著。弱势语言需至少 10-15% 配比才能保证基本能力。

8.3 实战复盘:MTP 权重失调训练震荡

某团队用 MTP k=2 训练,远期预测权重设为 0.8(高于近期的 1.0),训练 loss 在 2.5-3.5 间剧烈震荡无法收敛。

python 复制代码
// 来源:MTP 权重调参复盘 / 2024
def diagnose_mtp_instability(loss_history, weights):
    """诊断 MTP 训练不稳"""
    loss_std = np.std(loss_history[-100:])
    if loss_std > 0.3:
        # 检查权重配置
        far_weight = weights[-1]  # 最远预测权重
        near_weight = weights[0]  # 最近预测权重
        if far_weight > near_weight * 0.5:
            return {
                'issue': '远期权重过高, 训练震荡',
                'action': f'远期权重 {far_weight} -> {near_weight * 0.25}',
                'reason': '远期预测难度大, 权重高使梯度方差大'
            }
    return {'issue': 'healthy'}

# 修复: 权重 [1.0, 0.8] -> [1.0, 0.25]
# 修复后: loss 稳定在 2.4-2.6, 收敛正常
# 经验: 远期权重应 <= 近期的 1/4

量化:远期权重从 0.8 降到 0.25 后,loss 震荡从 ±0.5 降到 ±0.1,训练正常收敛。经验法则:远期预测权重应不超过近期的 1/4,因远期预测难度指数级增大,高权重使梯度方差爆炸。

总结

预训练目标的工程化落地,核心在于 CLM 的工程必然性、MLM 的理解任务残存、FIM 的代码补全价值、MTP 的推理增强、多语言平衡五点。CLM 是生成模型唯一选择(训练推理一致+并行+Scaling 稳),MLM 仅在向量检索场景存活,FIM 仅代码模型值得(通用模型净收益为负),MTP 增强推理且兼做投机解码,多语言需平衡配比防弱势语言淹没。

工程落地的关键在于目标-任务匹配度的清晰认知。生成任务必选 CLM,代码补全加 FIM(仅代码模型),推理增强加 MTP(k=1 性价比最优),向量检索用 MLM(BERT 系)。建议在训练前建立目标-任务匹配矩阵,FIM/MTP 需消融验证净收益,多语言配比需平衡弱势语言(至少 10-15%),MTP 远期权重不超过近期的 1/4。预训练目标决定模型学到什么,选错目标再多的数据与算力也无法弥补。

相关推荐
星马梦缘1 小时前
机器学习与模式识别 第十三章 从线性模型到神经网络 考点压缩
人工智能·pytorch·神经网络·机器学习·激活函数·relu
大鱼>1 小时前
深度学习入门:神经网络原理与 PyTorch 实战
pytorch·深度学习·神经网络
one_love_zfl1 小时前
Claude Code 隐私检测事件情况说明及升级指南
人工智能
格子软件1 小时前
2026年分布式GEO代理流量调度:源码级状态机防重挂实战
java·vue.js·人工智能·spring boot·分布式·vue
小柒儿3361 小时前
量子通信产业化:从保密通信到全域应用,重构信息安全底层体系
人工智能·重构
手写码匠2 小时前
手写 LLM 安全护栏:从内容审核到越狱防御的完整实现
人工智能·深度学习·算法·aigc
AI科技星2 小时前
乖乖数学全域数学加速正电荷会产生反向引力
人工智能·机器学习·概率论·量子计算·乖乖数学·全域数学·引力
大囚长2 小时前
信息约简对智能系统预测的重要性
人工智能·深度学习·机器学习
A.说学逗唱的Coke2 小时前
【大模型专题】Qoder 实战指南:从安装到 Agents 自主开发全流程
人工智能·语言模型