目录
1. 概述
1.1 什么是大语言模型
大语言模型 (Large Language Model, LLM) 是基于Transformer架构、在大规模文本数据上预训练的神经语言模型,参数量通常从数十亿到数万亿不等。
1.2 核心特征
| 特征 | 说明 |
|---|---|
| 规模 | 参数量巨大 (7B-175B+) |
| 涌现能力 | 规模带来质变:推理、编程、翻译等 |
| 上下文学习 | 无需微调,通过示例学习任务 |
| 通用性 | 单一模型处理多种任务 |
1.3 发展历程
2018: GPT (117M), BERT (340M)
2019: GPT-2 (1.5B), T5 (11B)
2020: GPT-3 (175B) - 开启大模型时代
2022: ChatGPT - RLHF对齐,引爆AI热潮
2023: GPT-4, LLaMA, Claude, Qwen
2024: LLaMA 3, Mixtral, GPT-4o
2. 语言模型发展史
2.1 统计语言模型
N-gram模型 :
P ( w t ∣ w 1 , . . . , w t − 1 ) ≈ P ( w t ∣ w t − n + 1 , . . . , w t − 1 ) P(w_t | w_1, ..., w_{t-1}) \approx P(w_t | w_{t-n+1}, ..., w_{t-1}) P(wt∣w1,...,wt−1)≈P(wt∣wt−n+1,...,wt−1)
局限:数据稀疏、长距离依赖困难
2.2 神经语言模型
Bengio (2003) - 前馈神经网络 :
P ( w t ∣ w t − n + 1 , . . . , w t − 1 ) = softmax ( W h + b ) P(w_t | w_{t-n+1}, ..., w_{t-1}) = \text{softmax}(Wh + b) P(wt∣wt−n+1,...,wt−1)=softmax(Wh+b)
2.3 RNN时代
LSTM/GRU:解决长距离依赖,但仍无法并行
2.4 Transformer时代
自注意力:完全并行、任意距离依赖
预训练范式:
- GPT: 自回归生成
- BERT: 双向理解
- T5: 统一框架
2.5 大模型时代
规模定律 (Scaling Laws) :
L ( N ) ∝ N − α L(N) \propto N^{-\alpha} L(N)∝N−α
损失随参数量、数据量、计算量的幂律下降
3. 预训练技术
3.1 因果语言模型 (CLM)
目标 :预测下一个token
L C L M = − ∑ t = 1 T log P ( x t ∣ x 1 , . . . , x t − 1 ) \mathcal{L}{CLM} = -\sum{t=1}^{T} \log P(x_t | x_1, ..., x_{t-1}) LCLM=−t=1∑TlogP(xt∣x1,...,xt−1)
用于:GPT系列、LLaMA、Mistral
3.2 掩码语言模型 (MLM)
目标 :预测被mask的token
L M L M = − ∑ t ∈ mask log P ( x t ∣ x \ t ) \mathcal{L}{MLM} = -\sum{t \in \text{mask}} \log P(x_t | x_{\backslash t}) LMLM=−t∈mask∑logP(xt∣x\t)
用于:BERT、RoBERTa
3.3 去噪目标
T5的Span Corruption:
输入: "The <X> sat on <Y> mat"
目标: "<X> cat <Y> the"
3.4 混合目标
部分模型结合多种预训练目标以获得更好的表示。
4. 模型架构
4.1 主流架构
Decoder-Only (GPT风格):
当前主流选择
- GPT系列
- LLaMA系列
- Mistral/Mixtral
- Qwen
Encoder-Only (BERT风格):
主要用于理解任务
- BERT, RoBERTa
- DeBERTa
Encoder-Decoder (T5风格):
用于序列到序列任务
- T5, BART
- Flan-T5
4.2 现代LLM架构组件
输入 → Token Embedding + RoPE
↓
┌─────────────────────────┐
│ Transformer Block × N │
│ ┌───────────────────┐ │
│ │ RMSNorm │ │
│ │ GQA Attention │ │
│ │ + RoPE │ │
│ │ + Residual │ │
│ ├───────────────────┤ │
│ │ RMSNorm │ │
│ │ SwiGLU FFN │ │
│ │ + Residual │ │
│ └───────────────────┘ │
└─────────────────────────┘
↓
RMSNorm → LM Head → 输出概率
4.3 关键组件
RMSNorm
RMSNorm ( x ) = x 1 d ∑ i = 1 d x i 2 + ϵ ⋅ γ \text{RMSNorm}(x) = \frac{x}{\sqrt{\frac{1}{d}\sum_{i=1}^d x_i^2 + \epsilon}} \cdot \gamma RMSNorm(x)=d1∑i=1dxi2+ϵ x⋅γ
比LayerNorm更快,效果相当
SwiGLU
SwiGLU ( x ) = Swish ( x W 1 ) ⊗ ( x W 3 ) ⋅ W 2 \text{SwiGLU}(x) = \text{Swish}(xW_1) \otimes (xW_3) \cdot W_2 SwiGLU(x)=Swish(xW1)⊗(xW3)⋅W2
比ReLU效果更好
RoPE (旋转位置编码)
将位置信息编码为旋转,自然融入相对位置
GQA (分组查询注意力)
多查询头共享键值头,减少KV缓存
4.4 模型规模
| 模型 | 参数量 | 层数 | 维度 | 头数 |
|---|---|---|---|---|
| LLaMA-7B | 7B | 32 | 4096 | 32 |
| LLaMA-13B | 13B | 40 | 5120 | 40 |
| LLaMA-70B | 70B | 80 | 8192 | 64 |
| GPT-3 | 175B | 96 | 12288 | 96 |
| Mixtral-8x7B | 47B | 32 | 4096 | 32 |
5. 训练数据
5.1 数据来源
| 数据集 | 规模 | 特点 |
|---|---|---|
| Common Crawl | PB级 | 互联网爬取,需要过滤 |
| Wikipedia | 数十GB | 高质量知识 |
| Books | 数百GB | 长文本、深度知识 |
| Code | 数百GB | GitHub代码 |
| 学术论文 | 数百GB | 科学知识 |
5.2 数据处理流程
原始数据 → 去重 → 质量过滤 → 毒性过滤 → PII移除 → 分词 → 训练数据
去重
python
# MinHash去重
from datasketch import MinHash, MinHashLSH
def compute_minhash(text, num_perm=128):
m = MinHash(num_perm=num_perm)
for word in text.split():
m.update(word.encode('utf8'))
return m
质量过滤
python
def quality_filter(text):
# 长度过滤
if len(text) < 100 or len(text) > 100000:
return False
# 语言检测
if detect(text) != 'en':
return False
# 特殊字符比例
special_ratio = sum(1 for c in text if not c.isalnum() and c != ' ') / len(text)
if special_ratio > 0.1:
return False
return True
5.3 数据配比
典型配比:
- Web数据: 60-70%
- Books: 10-15%
- Code: 10-15%
- Wikipedia: 5-10%
- 学术: 5-10%
5.4 Tokenization
BPE (Byte Pair Encoding):
训练过程:
1. 初始化:所有字节作为基本token
2. 统计相邻token对频率
3. 合并最频繁的对
4. 重复直到达到目标词汇表大小
词汇表大小选择:
- 32K: LLaMA
- 50K: GPT-2
- 100K: GPT-4
- 150K: Qwen
6. 分布式训练
6.1 数据并行 (DP)
每个GPU有完整模型副本,处理不同数据:
python
model = nn.parallel.DataParallel(model)
6.2 分布式数据并行 (DDP)
python
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
dist.init_process_group("nccl")
model = DDP(model.to(local_rank), device_ids=[local_rank])
6.3 模型并行
张量并行
将矩阵计算分布到多个GPU:
python
# 列并行
Y = XW = X[W1, W2] = [XW1, XW2]
# 行并行
Y = XW = [X1, X2][W1; W2] = X1W1 + X2W2
流水线并行
将不同层放在不同GPU:
GPU 0: Layer 0-15
GPU 1: Layer 16-31
GPU 2: Layer 32-47
GPU 3: Layer 48-63
6.4 ZeRO优化
python
# DeepSpeed ZeRO
ds_config = {
"zero_optimization": {
"stage": 2, # 分片优化器状态和梯度
"allgather_partitions": True,
"allgather_bucket_size": 5e8,
"reduce_scatter": True,
"reduce_bucket_size": 5e8,
}
}
| 阶段 | 分片内容 | 内存节省 |
|---|---|---|
| Stage 1 | 优化器状态 | ~4x |
| Stage 2 | + 梯度 | ~8x |
| Stage 3 | + 参数 | ~N倍 |
6.5 3D并行
结合三种并行:
- 数据并行:处理大批次
- 张量并行:处理大层
- 流水线并行:处理深网络
示例:Megatron-LM配置
TP=8 (节点内), PP=16 (节点间), DP=64
6.6 混合精度训练
python
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
for batch in dataloader:
with autocast():
outputs = model(batch)
loss = criterion(outputs)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
BF16 vs FP16:
- BF16: 范围更大,不容易溢出
- FP16: 需要loss scaling
7. 微调技术
7.1 全量微调
更新所有参数:
python
model = AutoModelForCausalLM.from_pretrained("llama-7b")
optimizer = AdamW(model.parameters(), lr=2e-5)
for batch in dataloader:
outputs = model(**batch)
loss = outputs.loss
loss.backward()
optimizer.step()
缺点:需要大量GPU内存
7.2 LoRA (低秩适应)
核心思想
冻结原始权重,只训练低秩增量:
W ′ = W + B A W' = W + BA W′=W+BA
其中 B ∈ R d × r B \in \mathbb{R}^{d \times r} B∈Rd×r, A ∈ R r × d A \in \mathbb{R}^{r \times d} A∈Rr×d, r ≪ d r \ll d r≪d
实现
python
from peft import LoraConfig, get_peft_model
config = LoraConfig(
r=16, # 秩
lora_alpha=32, # 缩放因子
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(base_model, config)
model.print_trainable_parameters()
# 输出: trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.0622
优势
- 内存节省:只需训练0.1%参数
- 速度快:训练和推理都更快
- 无推理延迟:可以合并权重
7.3 QLoRA
在4-bit量化基础上应用LoRA:
python
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
"llama-7b",
quantization_config=bnb_config,
device_map="auto"
)
# 然后应用LoRA
model = get_peft_model(model, lora_config)
7.4 指令微调 (SFT)
数据格式
json
{
"instruction": "将以下句子翻译成英文",
"input": "今天天气很好",
"output": "The weather is nice today."
}
Prompt模板
### System: 你是一个有用的助手
### User: {instruction}
### Assistant: {output}
训练代码
python
def formatting_func(examples):
instructions = examples["instruction"]
inputs = examples["input"]
outputs = examples["output"]
texts = []
for instruction, input_text, output in zip(instructions, inputs, outputs):
text = f"### Human: {instruction}\n{input_text}\n\n### Assistant: {output}"
texts.append(text)
return tokenizer(texts, truncation=True, max_length=2048)
8. 推理优化
8.1 KV缓存
python
class KVCache:
def __init__(self):
self.key_cache = []
self.value_cache = []
def update(self, key_states, value_states):
if len(self.key_cache) == 0:
self.key_cache = key_states
self.value_cache = value_states
else:
self.key_cache = torch.cat([self.key_cache, key_states], dim=2)
self.value_cache = torch.cat([self.value_cache, value_states], dim=2)
return self.key_cache, self.value_cache
8.2 量化
INT8量化
python
model_int8 = AutoModelForCausalLM.from_pretrained(
"llama-7b",
load_in_8bit=True,
device_map="auto"
)
INT4量化 (GPTQ)
python
from auto_gptq import AutoGPTQForCausalLM
model = AutoGPTQForCausalLM.from_quantized(
"llama-7b-gptq-4bit",
device="cuda:0"
)
AWQ量化
python
from awq import AutoAWQForCausalLM
model = AutoAWQForCausalLM.from_quantized(
"llama-7b-awq-4bit",
fuse_layers=True
)
8.3 推测解码
使用小模型预测,大模型验证:
python
def speculative_decode(draft_model, target_model, prompt, num_speculative=5):
# 小模型生成候选
draft_tokens = draft_model.generate(prompt, max_new_tokens=num_speculative)
# 大模型并行验证
target_probs = target_model.get_probs(prompt, draft_tokens)
# 接受/拒绝
accepted = 0
for i in range(num_speculative):
if random.random() < target_probs[i] / draft_probs[i]:
accepted += 1
else:
break
return draft_tokens[:accepted]
8.4 连续批处理
python
class ContinuousBatcher:
def __init__(self, model, max_batch_size=32):
self.model = model
self.max_batch_size = max_batch_size
self.active_requests = []
def add_request(self, request):
self.active_requests.append(request)
if len(self.active_requests) >= self.max_batch_size:
self.process_batch()
def process_batch(self):
# 将请求组成batch
batch = self.prepare_batch(self.active_requests)
# 推理
outputs = self.model.generate(**batch)
# 返回完成的请求
self.handle_completions(outputs)
8.5 Flash Attention
python
# PyTorch 2.0+
output = F.scaled_dot_product_attention(
query, key, value,
attn_mask=None,
dropout_p=0.0,
is_causal=True
)
优势:
- 内存O(n)而非O(n²)
- 速度提升2-4倍
8.6 推理框架
| 框架 | 特点 |
|---|---|
| vLLM | PagedAttention, 连续批处理 |
| TensorRT-LLM | NVIDIA优化, 最快推理 |
| llama.cpp | CPU推理, 量化支持 |
| MLC-LLM | 多平台部署 |
9. 评估方法
9.1 基准测试
| 基准 | 评估内容 |
|---|---|
| MMLU | 多领域知识 |
| GSM8K | 数学推理 |
| HumanEval | 代码生成 |
| TruthfulQA | 真实性 |
| HellaSwag | 常识推理 |
| ARC | 科学推理 |
9.2 评估代码
python
import lm_eval
model = lm_eval.models.HuggingFace(pretrained="llama-7b")
results = lm_eval.simple_evaluate(
model=model,
tasks=["mmlu", "hellaswag", "truthfulqa"],
num_fewshot=0,
batch_size=32
)
print(results['results'])
9.3 人工评估
评估维度:
- 有用性 (Helpfulness)
- 诚实性 (Honesty)
- 无害性 (Harmless)
- 格式正确性
- 事实准确性
10. 代表模型详解
10.1 GPT系列
GPT-3 (175B):
- 96层,12288维,96头
- 300B tokens训练
- Few-shot学习能力
GPT-4:
- 多模态(文本+图像)
- 更强推理能力
- 更好安全性
10.2 LLaMA系列
LLaMA 2:
python
# 架构特点
- RMSNorm
- SwiGLU激活
- RoPE位置编码
- GQA (70B版本)
LLaMA 3:
- 128K词汇表
- 8K上下文(可扩展到128K)
- 更多训练数据 (15T tokens)
10.3 Mistral/Mixtral
Mistral 7B:
- 滑动窗口注意力
- GQA
- 性能超越LLaMA 2 13B
Mixtral 8x7B:
- MoE架构
- 8个专家,每次激活2个
- 总参数47B,活跃参数13B
10.4 中文模型
Qwen (通义千问):
- 支持中英双语
- 150K词汇表
- 长上下文支持
ChatGLM:
- GLM架构
- 中英双语
- 对话优化
DeepSeek:
- 强大的代码能力
- MoE架构 (V2)
- 性价比高
11. 完整代码实现
11.1 简化LLM实现
python
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
from dataclasses import dataclass
@dataclass
class LLMConfig:
vocab_size: int = 32000
d_model: int = 4096
n_layers: int = 32
n_heads: int = 32
n_kv_heads: int = 8 # GQA
d_ff: int = 11008
max_seq_len: int = 2048
dropout: float = 0.0
class RMSNorm(nn.Module):
def __init__(self, d_model, eps=1e-6):
super().__init__()
self.weight = nn.Parameter(torch.ones(d_model))
self.eps = eps
def forward(self, x):
rms = torch.sqrt(torch.mean(x ** 2, dim=-1, keepdim=True) + self.eps)
return x / rms * self.weight
class RoPE(nn.Module):
def __init__(self, dim, max_seq_len=2048):
super().__init__()
inv_freq = 1.0 / (10000 ** (torch.arange(0, dim, 2).float() / dim))
self.register_buffer('inv_freq', inv_freq)
t = torch.arange(max_seq_len).float()
freqs = torch.outer(t, inv_freq)
self.register_buffer('cos_cached', freqs.cos())
self.register_buffer('sin_cached', freqs.sin())
def forward(self, x, seq_len):
return (
self.cos_cached[:seq_len].to(x.device),
self.sin_cached[:seq_len].to(x.device)
)
def rotate_half(x):
x1, x2 = x.chunk(2, dim=-1)
return torch.cat([-x2, x1], dim=-1)
def apply_rotary_pos_emb(q, k, cos, sin):
q_embed = q * cos + rotate_half(q) * sin
k_embed = k * cos + rotate_half(k) * sin
return q_embed, k_embed
class GroupedQueryAttention(nn.Module):
def __init__(self, config):
super().__init__()
self.n_heads = config.n_heads
self.n_kv_heads = config.n_kv_heads
self.d_k = config.d_model // config.n_heads
self.n_rep = config.n_heads // config.n_kv_heads
self.wq = nn.Linear(config.d_model, config.n_heads * self.d_k, bias=False)
self.wk = nn.Linear(config.d_model, config.n_kv_heads * self.d_k, bias=False)
self.wv = nn.Linear(config.d_model, config.n_kv_heads * self.d_k, bias=False)
self.wo = nn.Linear(config.n_heads * self.d_k, config.d_model, bias=False)
def forward(self, x, mask=None, freqs_cis=None):
B, L, D = x.shape
q = self.wq(x).view(B, L, self.n_heads, self.d_k).transpose(1, 2)
k = self.wk(x).view(B, L, self.n_kv_heads, self.d_k).transpose(1, 2)
v = self.wv(x).view(B, L, self.n_kv_heads, self.d_k).transpose(1, 2)
# RoPE
if freqs_cis is not None:
cos, sin = freqs_cis
q, k = apply_rotary_pos_emb(q, k, cos, sin)
# 扩展KV头
k = k.repeat_interleave(self.n_rep, dim=1)
v = v.repeat_interleave(self.n_rep, dim=1)
# 注意力
scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
attn = F.softmax(scores, dim=-1)
out = torch.matmul(attn, v)
out = out.transpose(1, 2).contiguous().view(B, L, -1)
return self.wo(out)
class SwiGLUFFN(nn.Module):
def __init__(self, config):
super().__init__()
self.w1 = nn.Linear(config.d_model, config.d_ff, bias=False)
self.w3 = nn.Linear(config.d_model, config.d_ff, bias=False)
self.w2 = nn.Linear(config.d_ff, config.d_model, bias=False)
def forward(self, x):
return self.w2(F.silu(self.w1(x)) * self.w3(x))
class TransformerBlock(nn.Module):
def __init__(self, config):
super().__init__()
self.attention_norm = RMSNorm(config.d_model)
self.attention = GroupedQueryAttention(config)
self.ffn_norm = RMSNorm(config.d_model)
self.ffn = SwiGLUFFN(config)
def forward(self, x, mask=None, freqs_cis=None):
h = x + self.attention(self.attention_norm(x), mask, freqs_cis)
out = h + self.ffn(self.ffn_norm(h))
return out
class LLM(nn.Module):
def __init__(self, config):
super().__init__()
self.config = config
self.tok_embeddings = nn.Embedding(config.vocab_size, config.d_model)
self.layers = nn.ModuleList([TransformerBlock(config) for _ in range(config.n_layers)])
self.norm = RMSNorm(config.d_model)
self.output = nn.Linear(config.d_model, config.vocab_size, bias=False)
# RoPE
self.rope = RoPE(config.d_model // config.n_heads, config.max_seq_len)
# 因果掩码
mask = torch.tril(torch.ones(config.max_seq_len, config.max_seq_len))
self.register_buffer('mask', mask)
def forward(self, tokens, targets=None):
B, L = tokens.shape
# 嵌入
h = self.tok_embeddings(tokens)
# RoPE
freqs_cis = self.rope(h, L)
# 因果掩码
mask = self.mask[:L, :L].unsqueeze(0).unsqueeze(0)
# Transformer层
for layer in self.layers:
h = layer(h, mask, freqs_cis)
# 输出
h = self.norm(h)
logits = self.output(h)
# 计算损失
loss = None
if targets is not None:
loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))
return logits, loss
@torch.no_grad()
def generate(self, prompt, max_new_tokens=100, temperature=1.0, top_k=50):
for _ in range(max_new_tokens):
# 截断
idx_cond = prompt[:, -self.config.max_seq_len:]
# 前向传播
logits, _ = self(idx_cond)
logits = logits[:, -1, :] / temperature
# Top-k
if top_k > 0:
v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
logits[logits < v[:, [-1]]] = float('-inf')
# 采样
probs = F.softmax(logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
prompt = torch.cat([prompt, next_token], dim=1)
return prompt
11.2 训练循环
python
def train_llm():
# 配置
config = LLMConfig(
vocab_size=32000,
d_model=4096,
n_layers=32,
n_heads=32,
n_kv_heads=8,
d_ff=11008,
max_seq_len=2048
)
# 模型
model = LLM(config).cuda()
# 优化器
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.1)
# 学习率调度
def lr_lambda(step):
if step < 1000:
return step / 1000
return 0.5 * (1 + math.cos(math.pi * (step - 1000) / (max_steps - 1000)))
scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda)
# 训练
for step, batch in enumerate(dataloader):
tokens = batch['input_ids'].cuda()
targets = batch['labels'].cuda()
logits, loss = model(tokens, targets)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
scheduler.step()
optimizer.zero_grad()
if step % 100 == 0:
print(f"Step {step}, Loss: {loss.item():.4f}, LR: {scheduler.get_last_lr()[0]:.6f}")
12. 部署与应用
12.1 使用vLLM部署
python
from vllm import LLM, SamplingParams
# 加载模型
llm = LLM(model="meta-llama/Llama-2-7b-chat-hf")
# 采样参数
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.9,
max_tokens=512
)
# 生成
prompts = ["Hello, my name is", "The capital of France is"]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(output.outputs[0].text)
12.2 使用Transformers推理
python
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-chat-hf",
torch_dtype=torch.float16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
# 生成
inputs = tokenizer("Hello, my name is", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
12.3 API服务
python
from fastapi import FastAPI
from pydantic import BaseModel
import uvicorn
app = FastAPI()
class GenerateRequest(BaseModel):
prompt: str
max_tokens: int = 100
temperature: float = 0.7
@app.post("/generate")
async def generate(request: GenerateRequest):
inputs = tokenizer(request.prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=request.max_tokens,
temperature=request.temperature
)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
return {"text": result}
if __name__ == "__main__":
uvicorn.run(app, host="0.0.0.0", port=8000)
12.4 Gradio界面
python
import gradio as gr
def predict(message, history):
# 构建对话
messages = []
for user_msg, assistant_msg in history:
messages.append({"role": "user", "content": user_msg})
messages.append({"role": "assistant", "content": assistant_msg})
messages.append({"role": "user", "content": message})
# 生成
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=512, temperature=0.7)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
return response
demo = gr.ChatInterface(predict, title="LLM Chat")
demo.launch()
13. 前沿方向
13.1 长上下文
- RoPE外推:NTK-aware, YaRN
- 注意力优化:Ring Attention, Striped Attention
- 压缩:Landmark Attention
13.2 多模态
- 视觉语言:GPT-4V, LLaVA, Qwen-VL
- 音频:Whisper, Qwen-Audio
- 视频:Video-LLaVA
13.3 Agent
- 工具使用:函数调用、代码执行
- 规划:任务分解、推理链
- 记忆:长期记忆、检索增强
13.4 效率提升
- MoE:稀疏激活,大参数低计算
- 量化:4-bit, 2-bit
- 蒸馏:小模型学习大模型
13.5 安全对齐
- RLHF/DPO:人类偏好对齐
- Constitutional AI:原则约束
- 红队测试:对抗性测试
14. 参考资料
核心论文
- GPT-3: "Language Models are Few-Shot Learners" (2020)
- LLaMA: "LLaMA: Open and Efficient Foundation Language Models" (2023)
- LLaMA 2: "Llama 2: Open Foundation and Fine-Tuned Chat Models" (2023)
- Mistral: "Mistral 7B" (2023)
- Mixtral: "Mixtral of Experts" (2024)
- Scaling Laws: "Scaling Laws for Neural Language Models" (2020)
- Chinchilla: "Training Compute-Optimal Large Language Models" (2022)
开源项目
- Transformers: https://github.com/huggingface/transformers
- vLLM: https://github.com/vllm-project/vllm
- llama.cpp: https://github.com/ggerganov/llama.cpp
- Megatron-LM: https://github.com/NVIDIA/Megatron-LM
- DeepSpeed: https://github.com/microsoft/DeepSpeed
模型资源
- HuggingFace Hub: https://huggingface.co/models
- LLaMA: https://github.com/facebookresearch/llama
- Mistral: https://mistral.ai