Decoder-only 架构深度解析：为什么 GPT 选择这条技术路线？

一、前言

当前主流的大语言模型（GPT-4、Claude、Llama、通义千问等）几乎都采用 Decoder-only 架构。这一架构选择并非偶然，而是经过大量实践验证的技术路线。本文将深入解析 Decoder-only 架构的设计原理、核心机制、优势劣势，以及与 Encoder-only、Encoder-Decoder 架构的对比，帮助您理解为什么 GPT 系列选择这条道路。

二、三种主流架构概览

在深入 Decoder-only 之前，我们先回顾 Transformer 的三种变体：

架构类型	代表模型	核心特点	典型应用
Encoder-only	BERT、RoBERTa	双向注意力，理解为主	文本分类、NER、语义相似度
Decoder-only	GPT、Llama、Claude	因果注意力，自回归生成	文本生成、对话、代码生成
Encoder-Decoder	T5、BART、T5	编码器理解+解码器生成	机器翻译、摘要、文本改写

复制代码

┌─────────────────────────────────────────────────────────────┐
│                    Transformer 架构家族                      │
├─────────────────┬─────────────────┬─────────────────────────┤
│   Encoder-only  │   Decoder-only  │    Encoder-Decoder      │
│                 │                 │                         │
│  [Input]        │  [Input]        │  [Input]    [Output]    │
│    ↓            │    ↓            │    ↓          ↑         │
│  ┌─────┐        │  ┌─────┐        │  ┌─────┐   ┌─────┐     │
│  │Enc  │        │  │Dec  │        │  │Enc  │ → │Dec  │     │
│  │×N   │        │  │×N   │        │  │×N   │   │×N   │     │
│  └─────┘        │  └─────┘        │  └─────┘   └─────┘     │
│    ↓            │    ↓            │                         │
│  [Output]       │  [Output]       │  分离的编码器和解码器    │
│                 │                 │                         │
│  双向注意力      │  因果注意力      │  编码器双向+解码器因果   │
│  适合理解任务    │  适合生成任务    │  适合序列转换任务        │
└─────────────────┴─────────────────┴─────────────────────────┘

三、Decoder-only 架构核心设计

3.1 整体结构

Decoder-only 模型由 N 个相同的 Decoder 层 堆叠而成，每层包含三个核心子层：

复制代码

输入嵌入 (Input Embedding) + 位置编码 (Positional Encoding)
                    ↓
        ┌───────────────────────┐
        │    Decoder Layer 1    │
        │  ┌─────────────────┐  │
        │  │  Masked Multi-  │  │
        │  │  Head Self-Attn │  │
        │  └────────┬────────┘  │
        │           ↓ Add&Norm  │
        │  ┌─────────────────┐  │
        │  │    Cross-Attn   │  │  ← Decoder-only 中通常省略
        │  │   (Encoder-     │  │
        │  │    Decoder)     │  │
        │  └─────────────────┘  │
        │           ↓ Add&Norm  │
        │  ┌─────────────────┐  │
        │  │    Feed Forward │  │
        │  │      Network    │  │
        │  └────────┬────────┘  │
        │           ↓ Add&Norm  │
        └───────────────────────┘
                    ↓
              [重复 N 层]
                    ↓
        输出线性层 + Softmax
                    ↓
              预测下一个 Token

关键特征：

无独立的 Encoder 模块
使用 因果（Causal）自注意力
采用 自回归（Autoregressive） 方式生成

3.2 核心机制一：因果自注意力（Causal Self-Attention）

这是 Decoder-only 与 Encoder 的最大区别。

3.2.1 注意力掩码可视化

Encoder 的双向注意力（可看全部上下文）：

复制代码

位置:    1    2    3    4    5
        ───────────────────────
1  │     ✓    ✓    ✓    ✓    ✓
2  │     ✓    ✓    ✓    ✓    ✓
3  │     ✓    ✓    ✓    ✓    ✓
4  │     ✓    ✓    ✓    ✓    ✓
5  │     ✓    ✓    ✓    ✓    ✓
        ───────────────────────
每个位置都能看到所有位置（包括未来）

Decoder 的因果注意力（只能看过去）：

复制代码

位置:    1    2    3    4    5
        ───────────────────────
1  │     ✓    ✗    ✗    ✗    ✗
2  │     ✓    ✓    ✗    ✗    ✗
3  │     ✓    ✓    ✓    ✗    ✗
4  │     ✓    ✓    ✓    ✓    ✗
5  │     ✓    ✓    ✓    ✓    ✓
        ───────────────────────
每个位置只能看到当前及之前的位置（三角掩码）

3.2.2 数学实现

python 复制代码

import torch
import torch.nn as nn
import math

class CausalSelfAttention(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        assert d_model % n_heads == 0
        
        self.d_model = d_model
        self.n_heads = n_heads
        self.head_dim = d_model // n_heads
        
        # Q, K, V 投影
        self.q_proj = nn.Linear(d_model, d_model)
        self.k_proj = nn.Linear(d_model, d_model)
        self.v_proj = nn.Linear(d_model, d_model)
        self.o_proj = nn.Linear(d_model, d_model)
        
    def forward(self, x, mask=None):
        B, T, C = x.size()  # Batch, Sequence length, Channels
        
        # 计算 Q, K, V
        Q = self.q_proj(x).view(B, T, self.n_heads, self.head_dim).transpose(1, 2)
        K = self.k_proj(x).view(B, T, self.n_heads, self.head_dim).transpose(1, 2)
        V = self.v_proj(x).view(B, T, self.n_heads, self.head_dim).transpose(1, 2)
        
        # 注意力计算: (B, n_heads, T, head_dim) @ (B, n_heads, head_dim, T) -> (B, n_heads, T, T)
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.head_dim)
        
        # 关键：应用因果掩码（上三角矩阵设为 -inf）
        causal_mask = torch.triu(torch.ones(T, T), diagonal=1).bool().to(x.device)
        scores = scores.masked_fill(causal_mask, float('-inf'))
        
        # Softmax 得到注意力权重
        attn_weights = torch.softmax(scores, dim=-1)
        
        # 加权求和
        out = torch.matmul(attn_weights, V)  # (B, n_heads, T, head_dim)
        out = out.transpose(1, 2).contiguous().view(B, T, C)
        
        return self.o_proj(out)

关键代码：

python 复制代码

# 生成上三角掩码（对角线以上为 True）
causal_mask = torch.triu(torch.ones(T, T), diagonal=1).bool()
# 将未来位置的概率设为负无穷（softmax 后为 0）
scores = scores.masked_fill(causal_mask, float('-inf'))

3.3 核心机制二：自回归生成（Autoregressive Generation）

Decoder-only 模型通过 逐 token 预测 的方式生成文本。

3.3.1 生成过程演示

python 复制代码

# 输入提示
prompt = "人工智能的发展"

# 分词
tokens = ["人工", "智能", "的", "发展"]

# 自回归生成过程
for i in range(max_new_tokens):
    # 当前输入序列
    input_ids = tokens
    
    # 模型前向传播
    logits = model(input_ids)  # 输出每个位置的预测分布
    
    # 取最后一个位置的预测（下一个 token）
    next_token_logits = logits[-1]
    
    # 采样策略（Greedy / Top-k / Top-p）
    next_token = sample(next_token_logits)
    
    # 添加到序列
    tokens.append(next_token)
    
    # 检查是否生成结束符
    if next_token == "<|endoftext|>":
        break

# 最终输出
"人工智能的发展已经深刻地改变了我们的生活方式..."

3.3.2 KV Cache 优化

由于自回归特性，每次生成新 token 时，前面的计算会重复。KV Cache 是解决这一问题的关键技术。

python 复制代码

class DecoderWithKVCache(nn.Module):
    def __init__(self):
        super().__init__()
        self.k_cache = None  # 缓存 Key
        self.v_cache = None  # 缓存 Value
        
    def forward(self, x, use_cache=True):
        # 计算新的 Q, K, V
        Q = self.q_proj(x)
        K_new = self.k_proj(x)
        V_new = self.v_proj(x)
        
        if use_cache and self.k_cache is not None:
            # 拼接缓存的历史 K, V
            K = torch.cat([self.k_cache, K_new], dim=1)
            V = torch.cat([self.v_cache, V_new], dim=1)
        else:
            K, V = K_new, V_new
            
        # 更新缓存
        if use_cache:
            self.k_cache = K
            self.v_cache = V
            
        # 计算注意力（此时 Q 只有新位置，K/V 包含历史）
        scores = torch.matmul(Q, K.transpose(-2, -1))
        # ... 后续计算
        
        return output

优化效果：

无 KV Cache：生成 N 个 token 的时间复杂度为 O(N³)
有 KV Cache：时间复杂度降为 O(N)，显存占用增加 O(N)

四、Decoder-only 的优势分析

4.1 为什么 GPT 选择 Decoder-only？

优势	详细说明
架构简洁	无需设计复杂的 Encoder-Decoder 交互，实现更简单
训练高效	统一使用下一个 token 预测目标，无需设计多种预训练任务
生成自然	自回归方式与人类语言生成习惯一致（从左到右）
扩展性好	堆叠层数即可提升能力，Scaling Law 在此架构上表现最明显
上下文学习	In-context Learning 能力在 Decoder-only 上表现突出
工程友好	统一的生成接口，易于部署和优化（如 KV Cache）

4.2 与 Encoder-Decoder 的对比

维度	Decoder-only (GPT)	Encoder-Decoder (T5)
预训练任务	单一：下一个 token 预测	复杂：Span Corruption 等
生成方式	自回归，逐 token 生成	Encoder 编码后 Decoder 生成
双向上下文	生成时只能看左侧	Encoder 可双向看输入
理解能力	需通过提示工程实现	Encoder 天然适合理解
训练效率	高（统一目标）	较低（多任务）
长文本生成	原生支持	需特殊设计
典型应用	对话、创作、代码	翻译、摘要、改写

4.3 涌现能力（Emergent Abilities）

Decoder-only 架构在规模扩大后展现出惊人的涌现能力：

In-context Learning：通过提示中的示例学习新任务
Chain-of-Thought：展示推理步骤解决复杂问题
Instruction Following：遵循人类指令完成各种任务
Zero-shot/Few-shot：无需微调即可适应新任务

这些能力在 Encoder-only 或 Encoder-Decoder 架构中表现较弱。

五、Decoder-only 的局限与改进

5.1 主要局限

局限	说明	改进方向
单向注意力	生成时无法利用未来信息	引入双向编码器（如 UniLM）
长度限制	上下文窗口有限	长上下文技术（RoPE、ALiBi）
推理效率	自回归生成无法并行	投机采样、并行解码
理解任务	不如 Encoder 架构直接	提示工程、指令微调

5.2 架构改进方向

5.2.1 长上下文扩展

python 复制代码

# RoPE (Rotary Position Embedding) 支持长度外推
class RoPE(nn.Module):
    def __init__(self, dim, max_seq_len=2048):
        super().__init__()
        inv_freq = 1.0 / (10000 ** (torch.arange(0, dim, 2).float() / dim))
        self.register_buffer('inv_freq', inv_freq)
        
    def forward(self, seq_len):
        t = torch.arange(seq_len, device=self.inv_freq.device)
        freqs = torch.einsum('i,j->ij', t, self.inv_freq)
        emb = torch.cat((freqs, freqs), dim=-1)
        return emb.cos(), emb.sin()

技术进展：

位置编码改进：RoPE、ALiBi 支持更长上下文
稀疏注意力：Longformer、BigBird 降低长序列计算复杂度
上下文压缩：通过摘要或检索压缩历史信息

5.2.2 推理加速

技术	原理	效果
KV Cache	缓存历史 Key/Value	生成复杂度从 O(N³) 降到 O(N)
投机采样	小模型草稿+大模型验证	2-3 倍加速
量化推理	INT8/INT4 权重	显存减半，速度提升
Continuous Batching	动态批处理	GPU 利用率提升
PagedAttention	非连续内存管理	支持更高并发

六、主流 Decoder-only 模型对比

模型	发布时间	参数量	核心创新
GPT-3	2020	175B	证明大规模 Decoder-only 的涌现能力
GPT-4	2023	未公开	多模态、更强推理、安全性
LLaMA	2023	7B-65B	开源可商用，推动社区发展
LLaMA 2	2023	7B-70B	更长上下文，更强性能
LLaMA 3	2024	8B-70B	更大词表，更高效训练
Claude	2023-2024	未公开	长上下文（200K），安全性
通义千问	2023-2024	0.5B-110B	中英双语，开源生态
DeepSeek	2024	16B-236B	MoE 架构，高效推理
GPT-4o	2024	未公开	原生多模态，端到端
o1/o3	2024-2025	未公开	推理时扩展，慢思考

七、实战：搭建简易 Decoder-only 模型

python 复制代码

import torch
import torch.nn as nn
import math

class DecoderOnlyTransformer(nn.Module):
    def __init__(self, vocab_size, d_model=512, n_heads=8, n_layers=6, max_seq_len=512):
        super().__init__()
        self.d_model = d_model
        self.token_emb = nn.Embedding(vocab_size, d_model)
        self.pos_emb = nn.Embedding(max_seq_len, d_model)
        
        self.layers = nn.ModuleList([
            DecoderLayer(d_model, n_heads) for _ in range(n_layers)
        ])
        
        self.norm = nn.LayerNorm(d_model)
        self.lm_head = nn.Linear(d_model, vocab_size, bias=False)
        
    def forward(self, x, mask=None):
        B, T = x.shape
        
        # 嵌入 + 位置编码
        tok_emb = self.token_emb(x)
        pos_emb = self.pos_emb(torch.arange(T, device=x.device))
        x = tok_emb + pos_emb
        
        # 通过 Decoder 层
        for layer in self.layers:
            x = layer(x, mask)
            
        x = self.norm(x)
        logits = self.lm_head(x)  # (B, T, vocab_size)
        
        return logits

class DecoderLayer(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        self.self_attn = CausalSelfAttention(d_model, n_heads)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, 4 * d_model),
            nn.GELU(),
            nn.Linear(4 * d_model, d_model)
        )
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        
    def forward(self, x, mask=None):
        # 预归一化（Pre-norm）结构
        x = x + self.self_attn(self.norm1(x), mask)
        x = x + self.ffn(self.norm2(x))
        return x

# 使用示例
model = DecoderOnlyTransformer(vocab_size=50000)
input_ids = torch.randint(0, 50000, (2, 100))  # (batch_size, seq_len)
logits = model(input_ids)  # (2, 100, 50000)

# 计算下一个 token 预测损失
target = torch.randint(0, 50000, (2, 100))
loss = nn.CrossEntropyLoss()(logits.view(-1, 50000), target.view(-1))

八、总结

Decoder-only 架构凭借其简洁的设计、高效的训练、强大的生成能力和优异的扩展性 ，成为当前大语言模型的主流选择。理解其核心机制------因果自注意力 和自回归生成，是掌握现代 LLM 技术的基础。

随着长上下文、多模态、推理优化等技术的发展，Decoder-only 架构仍在持续演进。对于开发者而言，深入理解这一架构有助于更好地使用、微调和部署大模型。

九、参考资源

Attention Is All You Need - Transformer 原始论文
GPT-3: Language Models are Few-Shot Learners
LLaMA: Open and Efficient Foundation Language Models
The Illustrated Transformer - 可视化讲解
Hugging Face Transformers 文档