大语言模型 (LLM) 详解

1. 概述

1.1 什么是大语言模型

大语言模型 (Large Language Model, LLM) 是基于Transformer架构、在大规模文本数据上预训练的神经语言模型，参数量通常从数十亿到数万亿不等。

1.2 核心特征

特征	说明
规模	参数量巨大 (7B-175B+)
涌现能力	规模带来质变：推理、编程、翻译等
上下文学习	无需微调，通过示例学习任务
通用性	单一模型处理多种任务

1.3 发展历程

复制代码

2018: GPT (117M), BERT (340M)
2019: GPT-2 (1.5B), T5 (11B)
2020: GPT-3 (175B) - 开启大模型时代
2022: ChatGPT - RLHF对齐，引爆AI热潮
2023: GPT-4, LLaMA, Claude, Qwen
2024: LLaMA 3, Mixtral, GPT-4o

2. 语言模型发展史

2.1 统计语言模型

N-gram模型 ：
P ( w t ∣ w 1 , . . . , w t − 1 ) ≈ P ( w t ∣ w t − n + 1 , . . . , w t − 1 ) P(w_t | w_1, ..., w_{t-1}) \approx P(w_t | w_{t-n+1}, ..., w_{t-1}) P(wt∣w1,...,wt−1)≈P(wt∣wt−n+1,...,wt−1)

局限：数据稀疏、长距离依赖困难

2.2 神经语言模型

Bengio (2003) - 前馈神经网络 ：
P ( w t ∣ w t − n + 1 , . . . , w t − 1 ) = softmax ( W h + b ) P(w_t | w_{t-n+1}, ..., w_{t-1}) = \text{softmax}(Wh + b) P(wt∣wt−n+1,...,wt−1)=softmax(Wh+b)

2.3 RNN时代

LSTM/GRU：解决长距离依赖，但仍无法并行

2.4 Transformer时代

自注意力：完全并行、任意距离依赖

预训练范式：

GPT: 自回归生成
BERT: 双向理解
T5: 统一框架

2.5 大模型时代

规模定律 (Scaling Laws) ：
L ( N ) ∝ N − α L(N) \propto N^{-\alpha} L(N)∝N−α

损失随参数量、数据量、计算量的幂律下降

3. 预训练技术

3.1 因果语言模型 (CLM)

目标：预测下一个token
L C L M = − ∑ t = 1 T log ⁡ P ( x t ∣ x 1 , . . . , x t − 1 ) \mathcal{L}{CLM} = -\sum{t=1}^{T} \log P(x_t | x_1, ..., x_{t-1}) LCLM=−t=1∑TlogP(xt∣x1,...,xt−1)

用于：GPT系列、LLaMA、Mistral

3.2 掩码语言模型 (MLM)

目标：预测被mask的token
L M L M = − ∑ t ∈ mask log ⁡ P ( x t ∣ x \ t ) \mathcal{L}{MLM} = -\sum{t \in \text{mask}} \log P(x_t | x_{\backslash t}) LMLM=−t∈mask∑logP(xt∣x\t)

用于：BERT、RoBERTa

3.3 去噪目标

T5的Span Corruption：

复制代码

输入: "The <X> sat on <Y> mat"
目标: "<X> cat <Y> the"

3.4 混合目标

部分模型结合多种预训练目标以获得更好的表示。

4. 模型架构

4.1 主流架构

Decoder-Only (GPT风格)：

复制代码

当前主流选择
- GPT系列
- LLaMA系列
- Mistral/Mixtral
- Qwen

Encoder-Only (BERT风格)：

复制代码

主要用于理解任务
- BERT, RoBERTa
- DeBERTa

Encoder-Decoder (T5风格)：

复制代码

用于序列到序列任务
- T5, BART
- Flan-T5

4.2 现代LLM架构组件

复制代码

输入 → Token Embedding + RoPE
    ↓
┌─────────────────────────┐
│    Transformer Block × N │
│  ┌───────────────────┐  │
│  │  RMSNorm          │  │
│  │  GQA Attention    │  │
│  │  + RoPE           │  │
│  │  + Residual       │  │
│  ├───────────────────┤  │
│  │  RMSNorm          │  │
│  │  SwiGLU FFN       │  │
│  │  + Residual       │  │
│  └───────────────────┘  │
└─────────────────────────┘
    ↓
  RMSNorm → LM Head → 输出概率

4.3 关键组件

RMSNorm

RMSNorm ( x ) = x 1 d ∑ i = 1 d x i 2 + ϵ ⋅ γ \text{RMSNorm}(x) = \frac{x}{\sqrt{\frac{1}{d}\sum_{i=1}^d x_i^2 + \epsilon}} \cdot \gamma RMSNorm(x)=d1∑i=1dxi2+ϵ x⋅γ

比LayerNorm更快，效果相当

SwiGLU

SwiGLU ( x ) = Swish ( x W 1 ) ⊗ ( x W 3 ) ⋅ W 2 \text{SwiGLU}(x) = \text{Swish}(xW_1) \otimes (xW_3) \cdot W_2 SwiGLU(x)=Swish(xW1)⊗(xW3)⋅W2

比ReLU效果更好

RoPE (旋转位置编码)

将位置信息编码为旋转，自然融入相对位置

GQA (分组查询注意力)

多查询头共享键值头，减少KV缓存

4.4 模型规模

模型	参数量	层数	维度	头数
LLaMA-7B	7B	32	4096	32
LLaMA-13B	13B	40	5120	40
LLaMA-70B	70B	80	8192	64
GPT-3	175B	96	12288	96
Mixtral-8x7B	47B	32	4096	32

5. 训练数据

5.1 数据来源

数据集	规模	特点
Common Crawl	PB级	互联网爬取，需要过滤
Wikipedia	数十GB	高质量知识
Books	数百GB	长文本、深度知识
Code	数百GB	GitHub代码
学术论文	数百GB	科学知识

5.2 数据处理流程

复制代码

原始数据 → 去重 → 质量过滤 → 毒性过滤 → PII移除 → 分词 → 训练数据

去重

python 复制代码

# MinHash去重
from datasketch import MinHash, MinHashLSH

def compute_minhash(text, num_perm=128):
    m = MinHash(num_perm=num_perm)
    for word in text.split():
        m.update(word.encode('utf8'))
    return m

质量过滤

python 复制代码

def quality_filter(text):
    # 长度过滤
    if len(text) < 100 or len(text) > 100000:
        return False

    # 语言检测
    if detect(text) != 'en':
        return False

    # 特殊字符比例
    special_ratio = sum(1 for c in text if not c.isalnum() and c != ' ') / len(text)
    if special_ratio > 0.1:
        return False

    return True

5.3 数据配比

复制代码

典型配比：
- Web数据: 60-70%
- Books: 10-15%
- Code: 10-15%
- Wikipedia: 5-10%
- 学术: 5-10%

5.4 Tokenization

BPE (Byte Pair Encoding)：

复制代码

训练过程：
1. 初始化：所有字节作为基本token
2. 统计相邻token对频率
3. 合并最频繁的对
4. 重复直到达到目标词汇表大小

词汇表大小选择：

32K: LLaMA
50K: GPT-2
100K: GPT-4
150K: Qwen

6. 分布式训练

6.1 数据并行 (DP)

每个GPU有完整模型副本，处理不同数据：

python 复制代码

model = nn.parallel.DataParallel(model)

6.2 分布式数据并行 (DDP)

python 复制代码

import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

dist.init_process_group("nccl")
model = DDP(model.to(local_rank), device_ids=[local_rank])

6.3 模型并行

张量并行

将矩阵计算分布到多个GPU：

python 复制代码

# 列并行
Y = XW = X[W1, W2] = [XW1, XW2]

# 行并行
Y = XW = [X1, X2][W1; W2] = X1W1 + X2W2

流水线并行

将不同层放在不同GPU：

复制代码

GPU 0: Layer 0-15
GPU 1: Layer 16-31
GPU 2: Layer 32-47
GPU 3: Layer 48-63

6.4 ZeRO优化

python 复制代码

# DeepSpeed ZeRO
ds_config = {
    "zero_optimization": {
        "stage": 2,  # 分片优化器状态和梯度
        "allgather_partitions": True,
        "allgather_bucket_size": 5e8,
        "reduce_scatter": True,
        "reduce_bucket_size": 5e8,
    }
}

阶段	分片内容	内存节省
Stage 1	优化器状态	~4x
Stage 2	+ 梯度	~8x
Stage 3	+ 参数	~N倍

6.5 3D并行

复制代码

结合三种并行：
- 数据并行：处理大批次
- 张量并行：处理大层
- 流水线并行：处理深网络

示例：Megatron-LM配置
TP=8 (节点内), PP=16 (节点间), DP=64

6.6 混合精度训练

python 复制代码

from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

for batch in dataloader:
    with autocast():
        outputs = model(batch)
        loss = criterion(outputs)

    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

BF16 vs FP16：

BF16: 范围更大，不容易溢出
FP16: 需要loss scaling

7. 微调技术

7.1 全量微调

更新所有参数：

python 复制代码

model = AutoModelForCausalLM.from_pretrained("llama-7b")
optimizer = AdamW(model.parameters(), lr=2e-5)

for batch in dataloader:
    outputs = model(**batch)
    loss = outputs.loss
    loss.backward()
    optimizer.step()

缺点：需要大量GPU内存

7.2 LoRA (低秩适应)

核心思想

冻结原始权重，只训练低秩增量：
W ′ = W + B A W' = W + BA W′=W+BA

其中 B ∈ R d × r B \in \mathbb{R}^{d \times r} B∈Rd×r, A ∈ R r × d A \in \mathbb{R}^{r \times d} A∈Rr×d, r ≪ d r \ll d r≪d

实现

python 复制代码

from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=16,                    # 秩
    lora_alpha=32,           # 缩放因子
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(base_model, config)
model.print_trainable_parameters()
# 输出: trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.0622

优势

内存节省：只需训练0.1%参数
速度快：训练和推理都更快
无推理延迟：可以合并权重

7.3 QLoRA

在4-bit量化基础上应用LoRA：

python 复制代码

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    "llama-7b",
    quantization_config=bnb_config,
    device_map="auto"
)

# 然后应用LoRA
model = get_peft_model(model, lora_config)

7.4 指令微调 (SFT)

数据格式

json 复制代码

{
    "instruction": "将以下句子翻译成英文",
    "input": "今天天气很好",
    "output": "The weather is nice today."
}

Prompt模板

复制代码

### System: 你是一个有用的助手

### User: {instruction}

### Assistant: {output}

训练代码

python 复制代码

def formatting_func(examples):
    instructions = examples["instruction"]
    inputs = examples["input"]
    outputs = examples["output"]

    texts = []
    for instruction, input_text, output in zip(instructions, inputs, outputs):
        text = f"### Human: {instruction}\n{input_text}\n\n### Assistant: {output}"
        texts.append(text)

    return tokenizer(texts, truncation=True, max_length=2048)

8. 推理优化

8.1 KV缓存

python 复制代码

class KVCache:
    def __init__(self):
        self.key_cache = []
        self.value_cache = []

    def update(self, key_states, value_states):
        if len(self.key_cache) == 0:
            self.key_cache = key_states
            self.value_cache = value_states
        else:
            self.key_cache = torch.cat([self.key_cache, key_states], dim=2)
            self.value_cache = torch.cat([self.value_cache, value_states], dim=2)

        return self.key_cache, self.value_cache

8.2 量化

INT8量化

python 复制代码

model_int8 = AutoModelForCausalLM.from_pretrained(
    "llama-7b",
    load_in_8bit=True,
    device_map="auto"
)

INT4量化 (GPTQ)

python 复制代码

from auto_gptq import AutoGPTQForCausalLM

model = AutoGPTQForCausalLM.from_quantized(
    "llama-7b-gptq-4bit",
    device="cuda:0"
)

AWQ量化

python 复制代码

from awq import AutoAWQForCausalLM

model = AutoAWQForCausalLM.from_quantized(
    "llama-7b-awq-4bit",
    fuse_layers=True
)

8.3 推测解码

使用小模型预测，大模型验证：

python 复制代码

def speculative_decode(draft_model, target_model, prompt, num_speculative=5):
    # 小模型生成候选
    draft_tokens = draft_model.generate(prompt, max_new_tokens=num_speculative)

    # 大模型并行验证
    target_probs = target_model.get_probs(prompt, draft_tokens)

    # 接受/拒绝
    accepted = 0
    for i in range(num_speculative):
        if random.random() < target_probs[i] / draft_probs[i]:
            accepted += 1
        else:
            break

    return draft_tokens[:accepted]

8.4 连续批处理

python 复制代码

class ContinuousBatcher:
    def __init__(self, model, max_batch_size=32):
        self.model = model
        self.max_batch_size = max_batch_size
        self.active_requests = []

    def add_request(self, request):
        self.active_requests.append(request)
        if len(self.active_requests) >= self.max_batch_size:
            self.process_batch()

    def process_batch(self):
        # 将请求组成batch
        batch = self.prepare_batch(self.active_requests)

        # 推理
        outputs = self.model.generate(**batch)

        # 返回完成的请求
        self.handle_completions(outputs)

8.5 Flash Attention

python 复制代码

# PyTorch 2.0+
output = F.scaled_dot_product_attention(
    query, key, value,
    attn_mask=None,
    dropout_p=0.0,
    is_causal=True
)

优势：

内存O(n)而非O(n²)
速度提升2-4倍

8.6 推理框架

框架	特点
vLLM	PagedAttention, 连续批处理
TensorRT-LLM	NVIDIA优化, 最快推理
llama.cpp	CPU推理, 量化支持
MLC-LLM	多平台部署

9. 评估方法

9.1 基准测试

基准	评估内容
MMLU	多领域知识
GSM8K	数学推理
HumanEval	代码生成
TruthfulQA	真实性
HellaSwag	常识推理
ARC	科学推理

9.2 评估代码

python 复制代码

import lm_eval

model = lm_eval.models.HuggingFace(pretrained="llama-7b")

results = lm_eval.simple_evaluate(
    model=model,
    tasks=["mmlu", "hellaswag", "truthfulqa"],
    num_fewshot=0,
    batch_size=32
)

print(results['results'])

9.3 人工评估

复制代码

评估维度：
- 有用性 (Helpfulness)
- 诚实性 (Honesty)
- 无害性 (Harmless)
- 格式正确性
- 事实准确性

10. 代表模型详解

10.1 GPT系列

GPT-3 (175B)：

96层，12288维，96头
300B tokens训练
Few-shot学习能力

GPT-4：

多模态（文本+图像）
更强推理能力
更好安全性

10.2 LLaMA系列

LLaMA 2：

python 复制代码

# 架构特点
- RMSNorm
- SwiGLU激活
- RoPE位置编码
- GQA (70B版本)

LLaMA 3：

128K词汇表
8K上下文（可扩展到128K）
更多训练数据 (15T tokens)

10.3 Mistral/Mixtral

Mistral 7B：

滑动窗口注意力
GQA
性能超越LLaMA 2 13B

Mixtral 8x7B：

MoE架构
8个专家，每次激活2个
总参数47B，活跃参数13B

10.4 中文模型

Qwen (通义千问)：

支持中英双语
150K词汇表
长上下文支持

ChatGLM：

GLM架构
中英双语
对话优化

DeepSeek：

强大的代码能力
MoE架构 (V2)
性价比高

11. 完整代码实现

11.1 简化LLM实现

python 复制代码

import torch
import torch.nn as nn
import torch.nn.functional as F
import math
from dataclasses import dataclass

@dataclass
class LLMConfig:
    vocab_size: int = 32000
    d_model: int = 4096
    n_layers: int = 32
    n_heads: int = 32
    n_kv_heads: int = 8  # GQA
    d_ff: int = 11008
    max_seq_len: int = 2048
    dropout: float = 0.0

class RMSNorm(nn.Module):
    def __init__(self, d_model, eps=1e-6):
        super().__init__()
        self.weight = nn.Parameter(torch.ones(d_model))
        self.eps = eps

    def forward(self, x):
        rms = torch.sqrt(torch.mean(x ** 2, dim=-1, keepdim=True) + self.eps)
        return x / rms * self.weight

class RoPE(nn.Module):
    def __init__(self, dim, max_seq_len=2048):
        super().__init__()
        inv_freq = 1.0 / (10000 ** (torch.arange(0, dim, 2).float() / dim))
        self.register_buffer('inv_freq', inv_freq)
        t = torch.arange(max_seq_len).float()
        freqs = torch.outer(t, inv_freq)
        self.register_buffer('cos_cached', freqs.cos())
        self.register_buffer('sin_cached', freqs.sin())

    def forward(self, x, seq_len):
        return (
            self.cos_cached[:seq_len].to(x.device),
            self.sin_cached[:seq_len].to(x.device)
        )

def rotate_half(x):
    x1, x2 = x.chunk(2, dim=-1)
    return torch.cat([-x2, x1], dim=-1)

def apply_rotary_pos_emb(q, k, cos, sin):
    q_embed = q * cos + rotate_half(q) * sin
    k_embed = k * cos + rotate_half(k) * sin
    return q_embed, k_embed

class GroupedQueryAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.n_heads = config.n_heads
        self.n_kv_heads = config.n_kv_heads
        self.d_k = config.d_model // config.n_heads
        self.n_rep = config.n_heads // config.n_kv_heads

        self.wq = nn.Linear(config.d_model, config.n_heads * self.d_k, bias=False)
        self.wk = nn.Linear(config.d_model, config.n_kv_heads * self.d_k, bias=False)
        self.wv = nn.Linear(config.d_model, config.n_kv_heads * self.d_k, bias=False)
        self.wo = nn.Linear(config.n_heads * self.d_k, config.d_model, bias=False)

    def forward(self, x, mask=None, freqs_cis=None):
        B, L, D = x.shape

        q = self.wq(x).view(B, L, self.n_heads, self.d_k).transpose(1, 2)
        k = self.wk(x).view(B, L, self.n_kv_heads, self.d_k).transpose(1, 2)
        v = self.wv(x).view(B, L, self.n_kv_heads, self.d_k).transpose(1, 2)

        # RoPE
        if freqs_cis is not None:
            cos, sin = freqs_cis
            q, k = apply_rotary_pos_emb(q, k, cos, sin)

        # 扩展KV头
        k = k.repeat_interleave(self.n_rep, dim=1)
        v = v.repeat_interleave(self.n_rep, dim=1)

        # 注意力
        scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.d_k)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        attn = F.softmax(scores, dim=-1)

        out = torch.matmul(attn, v)
        out = out.transpose(1, 2).contiguous().view(B, L, -1)
        return self.wo(out)

class SwiGLUFFN(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.w1 = nn.Linear(config.d_model, config.d_ff, bias=False)
        self.w3 = nn.Linear(config.d_model, config.d_ff, bias=False)
        self.w2 = nn.Linear(config.d_ff, config.d_model, bias=False)

    def forward(self, x):
        return self.w2(F.silu(self.w1(x)) * self.w3(x))

class TransformerBlock(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.attention_norm = RMSNorm(config.d_model)
        self.attention = GroupedQueryAttention(config)
        self.ffn_norm = RMSNorm(config.d_model)
        self.ffn = SwiGLUFFN(config)

    def forward(self, x, mask=None, freqs_cis=None):
        h = x + self.attention(self.attention_norm(x), mask, freqs_cis)
        out = h + self.ffn(self.ffn_norm(h))
        return out

class LLM(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.config = config

        self.tok_embeddings = nn.Embedding(config.vocab_size, config.d_model)
        self.layers = nn.ModuleList([TransformerBlock(config) for _ in range(config.n_layers)])
        self.norm = RMSNorm(config.d_model)
        self.output = nn.Linear(config.d_model, config.vocab_size, bias=False)

        # RoPE
        self.rope = RoPE(config.d_model // config.n_heads, config.max_seq_len)

        # 因果掩码
        mask = torch.tril(torch.ones(config.max_seq_len, config.max_seq_len))
        self.register_buffer('mask', mask)

    def forward(self, tokens, targets=None):
        B, L = tokens.shape

        # 嵌入
        h = self.tok_embeddings(tokens)

        # RoPE
        freqs_cis = self.rope(h, L)

        # 因果掩码
        mask = self.mask[:L, :L].unsqueeze(0).unsqueeze(0)

        # Transformer层
        for layer in self.layers:
            h = layer(h, mask, freqs_cis)

        # 输出
        h = self.norm(h)
        logits = self.output(h)

        # 计算损失
        loss = None
        if targets is not None:
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))

        return logits, loss

    @torch.no_grad()
    def generate(self, prompt, max_new_tokens=100, temperature=1.0, top_k=50):
        for _ in range(max_new_tokens):
            # 截断
            idx_cond = prompt[:, -self.config.max_seq_len:]

            # 前向传播
            logits, _ = self(idx_cond)
            logits = logits[:, -1, :] / temperature

            # Top-k
            if top_k > 0:
                v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
                logits[logits < v[:, [-1]]] = float('-inf')

            # 采样
            probs = F.softmax(logits, dim=-1)
            next_token = torch.multinomial(probs, num_samples=1)
            prompt = torch.cat([prompt, next_token], dim=1)

        return prompt

11.2 训练循环

python 复制代码

def train_llm():
    # 配置
    config = LLMConfig(
        vocab_size=32000,
        d_model=4096,
        n_layers=32,
        n_heads=32,
        n_kv_heads=8,
        d_ff=11008,
        max_seq_len=2048
    )

    # 模型
    model = LLM(config).cuda()

    # 优化器
    optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.1)

    # 学习率调度
    def lr_lambda(step):
        if step < 1000:
            return step / 1000
        return 0.5 * (1 + math.cos(math.pi * (step - 1000) / (max_steps - 1000)))

    scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda)

    # 训练
    for step, batch in enumerate(dataloader):
        tokens = batch['input_ids'].cuda()
        targets = batch['labels'].cuda()

        logits, loss = model(tokens, targets)

        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()

        if step % 100 == 0:
            print(f"Step {step}, Loss: {loss.item():.4f}, LR: {scheduler.get_last_lr()[0]:.6f}")

12. 部署与应用

12.1 使用vLLM部署

python 复制代码

from vllm import LLM, SamplingParams

# 加载模型
llm = LLM(model="meta-llama/Llama-2-7b-chat-hf")

# 采样参数
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=512
)

# 生成
prompts = ["Hello, my name is", "The capital of France is"]
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output.outputs[0].text)

12.2 使用Transformers推理

python 复制代码

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-chat-hf",
    torch_dtype=torch.float16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")

# 生成
inputs = tokenizer("Hello, my name is", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

12.3 API服务

python 复制代码

from fastapi import FastAPI
from pydantic import BaseModel
import uvicorn

app = FastAPI()

class GenerateRequest(BaseModel):
    prompt: str
    max_tokens: int = 100
    temperature: float = 0.7

@app.post("/generate")
async def generate(request: GenerateRequest):
    inputs = tokenizer(request.prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(
        **inputs,
        max_new_tokens=request.max_tokens,
        temperature=request.temperature
    )
    result = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return {"text": result}

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

12.4 Gradio界面

python 复制代码

import gradio as gr

def predict(message, history):
    # 构建对话
    messages = []
    for user_msg, assistant_msg in history:
        messages.append({"role": "user", "content": user_msg})
        messages.append({"role": "assistant", "content": assistant_msg})
    messages.append({"role": "user", "content": message})

    # 生成
    inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
    outputs = model.generate(inputs, max_new_tokens=512, temperature=0.7)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return response

demo = gr.ChatInterface(predict, title="LLM Chat")
demo.launch()

13. 前沿方向

13.1 长上下文

RoPE外推：NTK-aware, YaRN
注意力优化：Ring Attention, Striped Attention
压缩：Landmark Attention

13.2 多模态

视觉语言：GPT-4V, LLaVA, Qwen-VL
音频：Whisper, Qwen-Audio
视频：Video-LLaVA

13.3 Agent

工具使用：函数调用、代码执行
规划：任务分解、推理链
记忆：长期记忆、检索增强

13.4 效率提升

MoE：稀疏激活，大参数低计算
量化：4-bit, 2-bit
蒸馏：小模型学习大模型

13.5 安全对齐

RLHF/DPO：人类偏好对齐
Constitutional AI：原则约束
红队测试：对抗性测试

14. 参考资料

核心论文

GPT-3: "Language Models are Few-Shot Learners" (2020)
LLaMA: "LLaMA: Open and Efficient Foundation Language Models" (2023)
LLaMA 2: "Llama 2: Open Foundation and Fine-Tuned Chat Models" (2023)
Mistral: "Mistral 7B" (2023)
Mixtral: "Mixtral of Experts" (2024)
Scaling Laws: "Scaling Laws for Neural Language Models" (2020)
Chinchilla: "Training Compute-Optimal Large Language Models" (2022)

开源项目

Transformers: https://github.com/huggingface/transformers
vLLM: https://github.com/vllm-project/vllm
llama.cpp: https://github.com/ggerganov/llama.cpp
Megatron-LM: https://github.com/NVIDIA/Megatron-LM
DeepSpeed: https://github.com/microsoft/DeepSpeed

模型资源

HuggingFace Hub: https://huggingface.co/models
LLaMA: https://github.com/facebookresearch/llama
Mistral: https://mistral.ai

大语言模型 (LLM) 详解

目录

1. 概述

1.1 什么是大语言模型

1.2 核心特征

1.3 发展历程

2. 语言模型发展史

2.1 统计语言模型

2.2 神经语言模型

2.3 RNN时代

2.4 Transformer时代

2.5 大模型时代

3. 预训练技术

3.1 因果语言模型 (CLM)

3.2 掩码语言模型 (MLM)

3.3 去噪目标

3.4 混合目标

4. 模型架构

4.1 主流架构

4.2 现代LLM架构组件

4.3 关键组件

RMSNorm

SwiGLU

RoPE (旋转位置编码)

GQA (分组查询注意力)

4.4 模型规模

5. 训练数据

5.1 数据来源

5.2 数据处理流程

去重

质量过滤

5.3 数据配比

5.4 Tokenization

6. 分布式训练

6.1 数据并行 (DP)

6.2 分布式数据并行 (DDP)

6.3 模型并行

张量并行

流水线并行

6.4 ZeRO优化

6.5 3D并行

6.6 混合精度训练

7. 微调技术

7.1 全量微调

7.2 LoRA (低秩适应)

核心思想

实现

优势

7.3 QLoRA

7.4 指令微调 (SFT)

数据格式

Prompt模板

训练代码

8. 推理优化

8.1 KV缓存

8.2 量化

INT8量化

INT4量化 (GPTQ)

AWQ量化

8.3 推测解码

8.4 连续批处理

8.5 Flash Attention

8.6 推理框架

9. 评估方法

9.1 基准测试

9.2 评估代码

9.3 人工评估

10. 代表模型详解

10.1 GPT系列

10.2 LLaMA系列

10.3 Mistral/Mixtral

10.4 中文模型

11. 完整代码实现

11.1 简化LLM实现

11.2 训练循环

12. 部署与应用

12.1 使用vLLM部署

12.2 使用Transformers推理

12.3 API服务

12.4 Gradio界面