文章目录
- [1 DLLM 扩散语言模型](#1 DLLM 扩散语言模型)
-
- [1.1 DLLM 训练过程](#1.1 DLLM 训练过程)
- [1.2 DLLM 推理过程](#1.2 DLLM 推理过程)
- [2 Block Diffusion 分块扩散模型](#2 Block Diffusion 分块扩散模型)
-
- [2.1 Block Diffusion 与 DLLM 区别](#2.1 Block Diffusion 与 DLLM 区别)
- [2.2 训练 & 推理思路](#2.2 训练 & 推理思路)
-
- [2.2.1 训练流程](#2.2.1 训练流程)
- [2.2.2 推理过程](#2.2.2 推理过程)
- [2.3 Block Diffusion 完整示例代码](#2.3 Block Diffusion 完整示例代码)
- [3 参考资料](#3 参考资料)
1 DLLM 扩散语言模型
在传统自回归语言模型(如 GPT 系列)中,文本生成采用自回归 AR 模式:模型逐 token 预测下一个词元,将预测结果拼接入上下文后再继续预测,逐词生成完整句子。这种方式生成精度高,但无法并行生成多token,推理生成速度存在天然瓶颈。
扩散语言模型(DLLMs, Diffusion Language Models)借鉴图像扩散模型(Diffusion Models) 迭代去噪的核心思想,重构文本生成范式:
- 从全序列高噪声状态起步(整序列全部置为
[MASK]掩码或随机噪声 token); - 模型执行逐步迭代去噪,从无序噪声序列逐步还原为通顺真实文本;
- 每一步可全局并行处理整个序列所有位置;
- 更适配批量并行预测、文本纠错、改写场景,打破自回归模型严格左到右逐 token 生成的限制。
训练逻辑沿用扩散模型经典范式:前向加噪(Forward Diffusion) + 反向去噪学习(Reverse Denoising),让模型学习从任意噪声水平还原原始离散文本序列。
1.1 DLLM 训练过程
图像扩散模型通过添加连续高斯噪声实现加噪,但文本是离散 Token 空间,无法直接叠加高斯噪声,因此离散扩散(Discrete Diffusion)主流采用两种加噪策略:
- 主流方案:随机将序列Token替换为
[MASK]掩码标签 - 备选方案:随机替换为词表中其他无关 Token
下面给出极简可运行的 DLLM QA 任务训练示例代码:
python
import torch
import torch.nn as nn
import torch.nn.functional as F
import random
from transformers import AutoTokenizer, AutoModel
# =====================
# 超参数
# =====================
T = 20
MAX_LEN = 64
LR = 1e-4
EPOCHS = 10
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
# =====================
# QA 数据(示例)
# =====================
qa_data = [
("What is the capital of France?", "Paris"),
("Who wrote Hamlet?", "Shakespeare"),
("What is the largest planet?", "Jupiter"),
("What language is spoken in China?", "Chinese"),
]
# =====================
# tokenizer
# =====================
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
MASK_ID = tokenizer.mask_token_id
PAD_ID = tokenizer.pad_token_id
def encode_qa(q, a):
"""
构造:Question + Answer
"""
text = f"Question: {q} Answer: {a}"
encoding = tokenizer(
text,
padding="max_length",
truncation=True,
max_length=MAX_LEN,
return_tensors="pt"
)
input_ids = encoding["input_ids"][0]
# 找到 Answer 起点
answer_text = f"Answer: {a}"
answer_ids = tokenizer(answer_text, add_special_tokens=False)["input_ids"]
# 找子序列位置
def find_subsequence(seq, sub):
for i in range(len(seq) - len(sub)):
if seq[i:i+len(sub)].tolist() == sub:
return i
return -1
start = find_subsequence(input_ids, answer_ids)
answer_mask = torch.zeros_like(input_ids).bool()
if start != -1:
answer_mask[start:start+len(answer_ids)] = True
return input_ids, answer_mask
dataset = [encode_qa(q, a) for q, a in qa_data]
# =====================
# Noise Schedule
# =====================
def get_mask_prob(t, T):
return t / T
# =====================
# Forward Diffusion(QA版本)
# =====================
def forward_diffusion(x0, t, answer_mask):
"""
更偏向 mask Answer
"""
prob = get_mask_prob(t, T)
xt = x0.clone()
mask = torch.zeros_like(x0).bool()
for i in range(len(x0)):
if x0[i] == PAD_ID:
continue
# ⭐ 核心:Answer 部分更容易被 mask
if answer_mask[i]:
if random.random() < prob:
xt[i] = MASK_ID
mask[i] = True
else:
# Question 部分少量 mask
if random.random() < prob * 0.3:
xt[i] = MASK_ID
mask[i] = True
return xt, mask
# =====================
# 模型
# =====================
class DLLM(nn.Module):
def __init__(self, vocab_size, hidden_size=768):
super().__init__()
self.encoder = AutoModel.from_pretrained("bert-base-uncased")
self.t_embed = nn.Embedding(T + 1, hidden_size)
self.lm_head = nn.Linear(hidden_size, vocab_size)
def forward(self, input_ids, t, labels=None, loss_mask=None):
outputs = self.encoder(input_ids=input_ids)
hidden = outputs.last_hidden_state
t_emb = self.t_embed(t).unsqueeze(1)
hidden = hidden + t_emb
logits = self.lm_head(hidden)
loss = None
if labels is not None:
loss_fct = nn.CrossEntropyLoss(reduction="none")
loss_all = loss_fct(
logits.view(-1, logits.size(-1)),
labels.view(-1)
).view(labels.size())
loss = (loss_all * loss_mask).sum() / (loss_mask.sum() + 1e-6)
return logits, loss
model = DLLM(tokenizer.vocab_size).to(DEVICE)
optimizer = torch.optim.Adam(model.parameters(), lr=LR)
# =====================
# 训练
# =====================
print("==== Training QA DLLM ====")
for epoch in range(EPOCHS):
total_loss = 0
for x0, answer_mask in dataset:
x0 = x0.to(DEVICE)
# 1️⃣ sample timestep
t = random.randint(1, T)
t_tensor = torch.tensor([t], device=DEVICE)
# 2️⃣ diffusion
xt, mask = forward_diffusion(x0, t, answer_mask)
xt = xt.unsqueeze(0).to(DEVICE)
x0_batch = x0.unsqueeze(0).to(DEVICE)
# ⭐ 只在 mask 位置训练
loss_mask = mask.unsqueeze(0).float().to(DEVICE)
# 3️⃣ forward
logits, loss = model(
xt,
t_tensor,
labels=x0_batch,
loss_mask=loss_mask
)
# 4️⃣ backward
optimizer.zero_grad()
loss.backward()
optimizer.step()
total_loss += loss.item()
print(f"epoch {epoch} loss: {total_loss:.4f}")
流程核心拆解
Step1: 先进行时间步骤采样,假设最大时间步骤T=10,这里的采用就是随机从1~10中采样一个整数出来。
Step2: 加噪过程forward_diffusion
(1)首先将quesiton和answer部分拼接在一起,构成一个完整的序列,然后进行tokenization;
(2)再用采样的时间步骤t / 最大时间步 T,得到加噪的概率。
(3)遍历数据序列,根据加噪的概率,随机将序列中的每个token修改为[MASK],一般只mask answer的部分
Step3: 模型前向推理,然后只计算被mask的token的loss(一般采用交叉熵);推理过程中一般先将处理好的input_ids输入到encoder中获取hidden embed,然后hidden_embed再加上时间步的编码,最后输入到一个全连接层中输出。这样在前向过程中就能拿到时间步骤的信息。
Step4: 反向传播。
【思考】为什么在训练过程中不遍历所有的时间步,而是只采样一个时间步?
训练时随机采样一个时间步,是为了用一次计算近似"所有时间步的平均损失"(Monte Carlo估计,随机采样近似期望),从而把 O(T) 的成本降到 O(1),同时保持训练正确性。
1.2 DLLM 推理过程
推理完整示例代码:
python
# =====================
# 推理(QA)
# =====================
def generate_answer(model, question):
model.eval()
prompt = f"Question: {question} Answer:"
tokens = tokenizer(
prompt,
return_tensors="pt",
padding="max_length",
truncation=True,
max_length=MAX_LEN
)["input_ids"].to(DEVICE)
# 初始化:答案部分全 mask
xt = tokens.clone()
for i in range(xt.shape[1]):
if xt[0, i] != PAD_ID:
xt[0, i] = MASK_ID
# 保留 question
for i in range(tokens.shape[1]):
if tokens[0, i] != MASK_ID:
xt[0, i] = tokens[0, i]
# diffusion 反向过程
for t in reversed(range(1, T + 1)):
t_tensor = torch.tensor([t], device=DEVICE)
with torch.no_grad():
logits, _ = model(xt, t_tensor)
probs = F.softmax(logits, dim=-1)
xt = torch.multinomial(
probs.view(-1, probs.size(-1)),
1
).view(1, -1)
return tokenizer.decode(xt[0].cpu().tolist())
print("\n==== QA Inference ====")
print(generate_answer(model, "What is the capital of France?"))
流程核心拆解:
Step1 :除了问题部分,剩余的MAX_LEN - question长度的部分全部设置为[MASK]。
Step2:遍历时间步,逐渐恢复答案。时间步为倒序,即为[T, T-1, T-2, ..., 1]
【思考】为什么推理过程要倒序遍历所有时间步骤?
推理阶段必须遍历时间步骤,是因为 DLLM 的输出依赖上一时间步的状态,去噪是一个迭代过程,每一步都是在修正上一步的预测。
训练:模型学习任意噪声水平下如何恢复原始 x 0 x_0 x0 → 可以随机 t
推理:从最强噪声 x T x_T xT 开始 → 逐步还原 → 最终得到合理答案 → 必须顺序 t=T→t=0
2 Block Diffusion 分块扩散模型
DLLM 虽实现全序列并行生成,但存在明显缺陷:
- 长序列无局部依赖约束,生成长度难以控制;
- 全局并行忽略文本上下文语序依赖,长文本连贯性差。
Block Diffusion分块扩散针对性优化:将完整序列切分为多个独立 Block,实现「块内并行去噪、块间顺序依赖生成」。核心原理如下:
- 将序列分成K个block: x = [ x 1 , x 2 , . . . , x K ] x = [x^1, x^2, ..., x^K] x=[x1,x2,...,xK];
- 每个Block做DLLM风格的迭代: x t k ∼ p θ ( x 0 k ∣ x t k , x 0 < k ) x_t^k \sim p_{\theta}(x_0^k | x_t^k, x_0^{< k}) xtk∼pθ(x0k∣xtk,x0<k)
其中:
- x 0 < k x_0^{< k} x0<k是前面已经生成好了的块
- 块内部可以并行
- 块之间顺序生成:保证长度和上下文一致性
2.1 Block Diffusion 与 DLLM 区别
| 特性 | DLLM | Block Diffusion |
|---|---|---|
| 并行生成 | 全序列全局并行 | 块内并行,块间串行顺序生成 |
| 长度控制 | 难以约束生成长度 | 单块长度固定,全局易控制 |
| 依赖关系 | 全局弱依赖,无局部语序约束 | 块间强条件依赖,上下文更连贯 |
| 推理复杂度 | T 次全序列迭代 | T 次单块迭代,整体可控 |
2.2 训练 & 推理思路
2.2.1 训练流程
- 将序列划分成 K 个 block
- 对每个 block 内部随机采样 t
- Mask 当前 block 内的 token(答案或内容)
- 计算 loss → 只在 mask block 上
- 迭代更新
注意:块之间条件依赖 不需要显式训练,因为前面块已知 → 条件自然而然存在
2.2.2 推理过程
- 初始化每个 block 的答案部分为
[MASK] - 从第 1 个 block 开始迭代去噪 → 得到 x 0 ( 1 ) x^{(1)}_0 x0(1)
- 将 x 0 ( 1 ) x^{(1)}_0 x0(1)作为条件 → 生成 x 0 ( 2 ) x^{(2)}_0 x0(2)
- 依次生成 K 个 block → 拼接得到最终序列
2.3 Block Diffusion 完整示例代码
python
import torch
import torch.nn as nn
import torch.nn.functional as F
import random
from transformers import AutoTokenizer, AutoModel
# =====================
# 基本配置
# =====================
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
T = 10 # diffusion steps
MAX_LEN = 64 # 最大序列长度
BLOCK_SIZE = 8 # 每块长度,可调
HIDDEN_SIZE = 128
LR = 1e-4
EPOCHS = 2
# =====================
# QA 数据
# =====================
qa_data = [
("What is the capital of France?", "Paris"),
("Who wrote Hamlet?", "Shakespeare"),
]
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
MASK_ID = tokenizer.mask_token_id
PAD_ID = tokenizer.pad_token_id
# =====================
# 数据编码
# =====================
def encode_qa(q, a):
text = f"Question: {q} Answer: {a}"
encoding = tokenizer(
text,
padding="max_length",
truncation=True,
max_length=MAX_LEN,
return_tensors="pt"
)
input_ids = encoding["input_ids"][0]
# 找答案位置
answer_text = f"Answer: {a}"
answer_ids = tokenizer(answer_text, add_special_tokens=False)["input_ids"]
start = -1
for i in range(len(input_ids) - len(answer_ids) + 1):
if (input_ids[i:i+len(answer_ids)] == torch.tensor(answer_ids)).all():
start = i
break
answer_mask = torch.zeros_like(input_ids).bool()
if start != -1:
answer_mask[start:start+len(answer_ids)] = True
return input_ids, answer_mask
dataset = [encode_qa(q, a) for q, a in qa_data]
# =====================
# Noise schedule & forward diffusion
# =====================
def get_mask_prob(t):
return t / T
def forward_diffusion_block(x0, t, block_mask):
prob = get_mask_prob(t)
xt = x0.clone()
mask = torch.zeros_like(x0).bool()
for i in range(len(x0)):
if x0[i] == PAD_ID:
continue
if block_mask[i] and random.random() < prob:
xt[i] = MASK_ID
mask[i] = True
return xt, mask
# =====================
# 模型
# =====================
class BlockDLLM(nn.Module):
def __init__(self, vocab_size, hidden_size=HIDDEN_SIZE):
super().__init__()
self.encoder = AutoModel.from_pretrained("bert-base-uncased")
self.t_embed = nn.Embedding(T+1, hidden_size)
self.lm_head = nn.Linear(hidden_size, vocab_size)
def forward(self, input_ids, t, labels=None, loss_mask=None):
hidden = self.encoder(input_ids=input_ids).last_hidden_state
t_emb = self.t_embed(t).unsqueeze(1)
hidden = hidden + t_emb
logits = self.lm_head(hidden)
loss = None
if labels is not None:
loss_fct = nn.CrossEntropyLoss(reduction="none")
loss_all = loss_fct(
logits.view(-1, logits.size(-1)),
labels.view(-1)
).view(labels.size())
loss = (loss_all * loss_mask).sum() / (loss_mask.sum() + 1e-6)
return logits, loss
model = BlockDLLM(tokenizer.vocab_size).to(DEVICE)
optimizer = torch.optim.Adam(model.parameters(), lr=LR)
# =====================
# 训练
# =====================
print("==== Training Block Diffusion QA ====")
for epoch in range(EPOCHS):
total_loss = 0
for x0, answer_mask in dataset:
x0 = x0.to(DEVICE)
# 自动划分块
num_blocks = (MAX_LEN + BLOCK_SIZE - 1) // BLOCK_SIZE
blocks = [(i*BLOCK_SIZE, min((i+1)*BLOCK_SIZE, MAX_LEN)) for i in range(num_blocks)]
for start, end in blocks:
block_mask = torch.zeros_like(x0).bool()
block_mask[start:end] = answer_mask[start:end]
t = random.randint(1, T)
xt, mask = forward_diffusion_block(x0, t, block_mask)
xt = xt.unsqueeze(0)
x0_batch = x0.unsqueeze(0)
loss_mask = mask.unsqueeze(0).float()
logits, loss = model(
xt,
torch.tensor([t], device=DEVICE),
labels=x0_batch,
loss_mask=loss_mask
)
optimizer.zero_grad()
loss.backward()
optimizer.step()
total_loss += loss.item()
print(f"Epoch {epoch}, loss: {total_loss:.4f}")
# =====================
# 推理
# =====================
def generate_answer_block(model, question):
model.eval()
prompt = f"Question: {question} Answer:"
tokens = tokenizer(prompt, padding="max_length", truncation=True, max_length=MAX_LEN, return_tensors="pt")["input_ids"].to(DEVICE)
xt = tokens.clone()
num_blocks = (MAX_LEN + BLOCK_SIZE - 1) // BLOCK_SIZE
blocks = [(i*BLOCK_SIZE, min((i+1)*BLOCK_SIZE, MAX_LEN)) for i in range(num_blocks)]
# 块顺序生成
for start, end in blocks:
# mask 当前块
xt[0, start:end] = MASK_ID
for t in reversed(range(1, T+1)):
t_tensor = torch.tensor([t], device=DEVICE)
with torch.no_grad():
logits, _ = model(xt, t_tensor)
probs = F.softmax(logits, dim=-1)
xt[0, start:end] = torch.multinomial(probs[0, start:end], 1).squeeze(-1)
return tokenizer.decode(xt[0].cpu().tolist())
print("\n==== QA Inference ====")
print(generate_answer_block(model, "What is the capital of France?"))
3 参考资料
1\] [Block Diffusion 论文](https://arxiv.org/abs/2305.16238) \[2\] [分块扩散长文本生成开源实现](https://github.com/lmsys/block-diffusion) \[3\] [离散扩散语言模型经典论文](https://arxiv.org/abs/2205.14217) \[4\] [扩散模型基础理论综述](https://arxiv.org/abs/2006.11239) \[5\] [HuggingFace 扩散语言模型开源实现](https://github.com/huggingface/diffusers/tree/main/examples/diffusion_language_model)