通过训练代码来理解DLLM扩散语言模型

1. 引入

Diffusion-LLM（DLLM）是扩散语言模型（Diffusion Large Language Model），它与LLM有什么区别呢？

（1）共同点：都是以Transformer结构为主的模型

（2）差异点：

LLM是 "从左到右的自回归（Autoregressive）生成" 逻辑，就是基于已生成的左侧上下文，逐一生成下一个令牌，直到触发终止符；
DLLM不是预测下一个token。而是采用迭代去噪的扩散生成：生成过程分为前向腐蚀和反向去噪两个阶段，核心是 "从全噪声 / 掩码序列中，通过多步迭代逐步恢复出有效序列"，生成过程是双向、并行、可迭代修正的。

听上去有些抽象，我们下面从代码层面上来理解DLLM。

2. DLLM最简化代码

2.1. DLLM简化代码

下面是一个DLLM最简单的训练代码，包括了简单的数据集、处理、训练、生成过程：

python 复制代码

import torch
import torch.nn as nn
import torch.optim as optim

# ===================== 超参数 =====================
vocab_size = 1000
embed_dim = 64
seq_len = 16
hidden_dim = 128
timesteps = 20
batch_size = 4
lr = 1e-3

# ===================== 简易词汇表与英文句子 =====================
# 真实英文句子（已截断/补齐到 seq_len）
sentences = [
    "i love deep learning and diffusion models",
    "language models can generate coherent text",
    "diffusion models work by denoising gradually",
    "transformer is the backbone of modern llm",
    "we train a diffusion language model today"
]

# 简易分词（按空格）+ 构建词表
words = list({w for s in sentences for w in s.split()})
word2idx = {w: i+1 for i, w in enumerate(words)}  # 0 = pad
idx2word = {i: w for w, i in word2idx.items()}
vocab_size = len(word2idx) + 1

# 句子转 token，并补齐到 seq_len
def tokenize(s):
    tokens = [word2idx[w] for w in s.split() if w in word2idx]
    tokens = tokens[:seq_len]
    tokens += [0] * (seq_len - len(tokens))
    return torch.tensor(tokens)

dataset = torch.stack([tokenize(s) for s in sentences])
print('dataset.shape={0}'.format(dataset.shape))  # torch.Size([5, 16])

# ===================== 嵌入层 =====================
embedding = nn.Embedding(vocab_size, embed_dim)

# ===================== Diffusion 去噪模型 =====================
class DenoiseTransformer(nn.Module):
    def __init__(self):
        super().__init__()
        self.time_emb = nn.Embedding(timesteps, embed_dim)
        self.layer = nn.TransformerEncoderLayer(
            d_model=embed_dim, nhead=2, dim_feedforward=hidden_dim,
            batch_first=True, activation='gelu'
        )
        self.out = nn.Linear(embed_dim, embed_dim)

    def forward(self, x, t):
        # x: [B, L, D]
        # t: [B]
        t_emb = self.time_emb(t).unsqueeze(1)  # [B,1,D]
        x = x + t_emb
        x = self.layer(x)
        return self.out(x)

model = DenoiseTransformer()
opt = optim.Adam(model.parameters(), lr=lr)

# ===================== 扩散前向过程 =====================
def forward_process(x0, t):
    noise = torch.randn_like(x0)
    # 简单线性调度
    alpha = torch.linspace(0.05, 0.99, timesteps)[t]
    alpha = alpha.view(-1,1,1)
    xt = torch.sqrt(alpha) * x0 + torch.sqrt(1 - alpha) * noise
    return xt, noise

# ===================== 训练 =====================
print("开始训练 DLLM...\n")
for step in range(2000):
    idx = torch.randint(0, len(dataset), (batch_size,))
    x_ids = dataset[idx]
    x0 = embedding(x_ids)  # 干净的词嵌入 [B,L,D]

    t = torch.randint(0, timesteps, (batch_size,))
    xt, noise = forward_process(x0, t)

    pred_noise = model(xt, t)
    loss = (pred_noise - noise).pow(2).mean()

    opt.zero_grad()
    loss.backward()
    opt.step()

    if step % 200 == 0:
        print(f"step {step:04d} | loss {loss.item():.4f}")

# ===================== DLLM 反向生成（采样）=====================
print("\n=== DLLM 生成句子 ===")
x = torch.randn(1, seq_len, embed_dim)  # 从纯噪声开始
print('x.shape={0}'.format(x.shape)) # x.shape=torch.Size([1, 16, 64])

# 刚开始全是噪声
logits = x @ embedding.weight.T
pred_ids = logits.argmax(-1).squeeze().tolist()
pred_words = [idx2word.get(i, "<pad>") for i in pred_ids if i != 0]
print("初始噪声：", " ".join(pred_words))


for t_step in reversed(range(timesteps)):
    with torch.no_grad():
        pred_n = model(x, torch.tensor([t_step]))
    alpha = torch.linspace(0.05, 0.99, timesteps)[t_step]
    # 去噪一步
    x = (x - torch.sqrt(1 - alpha) * pred_n) / torch.sqrt(alpha)

# 映射回单词
logits = x @ embedding.weight.T
pred_ids = logits.argmax(-1).squeeze().tolist()
pred_words = [idx2word.get(i, "<pad>") for i in pred_ids if i != 0]

print("生成句子：", " ".join(pred_words))

运行后，程序输出：

bash 复制代码

dataset.shape=torch.Size([5, 16])
开始训练 DLLM...

step 0000 | loss 1.3353
step 0200 | loss 0.7625
step 0400 | loss 0.6536
step 0600 | loss 0.5462
step 0800 | loss 0.5811
step 1000 | loss 0.4207
step 1200 | loss 0.3732
step 1400 | loss 0.5663
step 1600 | loss 0.3291
step 1800 | loss 0.3746

=== DLLM 生成句子 ===
x.shape=torch.Size([1, 16, 64])
初始噪声： of is by models we train of love text deep is learning the the today work
生成句子： by language language language language models by by by by

可以从中看到，DLLM最有特色的地方，是：针对输入的噪声x（一个句子），进行去噪后，直接生成一个最终的句子。这个过程，一开始全是随机噪声x，模型一步步去掉噪声，最后直接输出一整句话。整个过程是并行生成，一步一步去噪，不是预测下一个词。

而普通LLM（GPT、LLaMA），输入比如是"I love"，输出则是预测下一个词 deep；再输入：I love deep，再输出：learning。LLM是逐词生成，串行，自回归模型。

2.2 DLLM特点

（1）前向扩散（加噪）

python 复制代码

def forward_process(x0, t):
    noise = torch.randn_like(x0)
    xt = sqrt(alpha)*x0 + sqrt(1-alpha)*noise
    return xt, noise

（2）预测噪声

python 复制代码

pred_noise = model(xt, t)

（3）噪声损失 MSE

python 复制代码

loss = (pred_noise - noise).pow(2).mean()

（4）反向采样去噪生成

python 复制代码

x = torch.randn(...)  # 从纯噪声开始
for t in reversed(...):
    x = (x - sqrt(1-alpha)*pred_n) / sqrt(alpha)

3. 区别LLM与DLLM

根据上面的解释，我们再来理解这几个差异点，就容易一些：

生成内容的方式不同
- LLM是 "从左到右的自回归（Autoregressive）生成" 逻辑，就是基于已生成的左侧上下文，逐一生成下一个令牌，直到触发终止符；
- DLLM不是预测下一个token。而是采用迭代去噪的扩散生成：生成过程分为前向腐蚀和反向去噪两个阶段，核心是 "从全噪声 / 掩码序列中，通过多步迭代逐步恢复出有效序列"，生成过程是双向、并行、可迭代修正的。
训练的目标不同
- LLM训练目标主要是预测下一个单词，损失函数一般为CrossEntropyLoss
- DLLM训练目标是预测加进去的高斯噪声，损失函数一般为 MSE（预测噪声 - 真实噪声），比如上面代码中loss = (pred_noise - noise).pow(2).mean()
数据处理方式不同
- LLM直接对 token ID 建模，输出单词概率。LLM 不会对连续向量做扩散去噪。
- DLLM对词嵌入向量（continuous embedding）加噪、去噪。最后才映射回单词。

4. 总结

DLLM：

✅ 使用扩散
✅ 对句子向量加噪
✅ 训练模型预测噪声
✅ 从噪声生成整句
✅ 不是自回归
✅ 不是预测下一个词

普通 LLM

❌ 无扩散
❌ 无加噪
❌ 无去噪
✅ 逐词生成

Diffusion = 加噪声 + 去噪声的整套流程

5. 参考

Large Language Diffusion Models. https://arxiv.org/abs/2502.09992
https://github.com/Diffusion-LLM/Awesome-DiffusionLLM
https://zhuanlan.zhihu.com/p/1913691243197752405