【大模型系列】DLLM与Block Diffusion的区别与联系

文章目录

[1 DLLM 扩散语言模型](#1 DLLM 扩散语言模型)
- [1.1 DLLM 训练过程](#1.1 DLLM 训练过程)
- [1.2 DLLM 推理过程](#1.2 DLLM 推理过程)
[2 Block Diffusion 分块扩散模型](#2 Block Diffusion 分块扩散模型)
- [2.1 Block Diffusion 与 DLLM 区别](#2.1 Block Diffusion 与 DLLM 区别)
- [2.2 训练 & 推理思路](#2.2 训练 & 推理思路)
- - [2.2.1 训练流程](#2.2.1 训练流程)
  - [2.2.2 推理过程](#2.2.2 推理过程)
- [2.3 Block Diffusion 完整示例代码](#2.3 Block Diffusion 完整示例代码)
[3 参考资料](#3 参考资料)

1 DLLM 扩散语言模型

在传统自回归语言模型（如 GPT 系列）中，文本生成采用自回归 AR 模式：模型逐 token 预测下一个词元，将预测结果拼接入上下文后再继续预测，逐词生成完整句子。这种方式生成精度高，但无法并行生成多token，推理生成速度存在天然瓶颈。

扩散语言模型（DLLMs, Diffusion Language Models）借鉴图像扩散模型（Diffusion Models）迭代去噪的核心思想，重构文本生成范式：

从全序列高噪声状态起步（整序列全部置为[MASK]掩码或随机噪声 token）；
模型执行逐步迭代去噪，从无序噪声序列逐步还原为通顺真实文本；
每一步可全局并行处理整个序列所有位置；
更适配批量并行预测、文本纠错、改写场景，打破自回归模型严格左到右逐 token 生成的限制。

训练逻辑沿用扩散模型经典范式：前向加噪（Forward Diffusion） + 反向去噪学习（Reverse Denoising），让模型学习从任意噪声水平还原原始离散文本序列。

1.1 DLLM 训练过程

图像扩散模型通过添加连续高斯噪声实现加噪，但文本是离散 Token 空间，无法直接叠加高斯噪声，因此离散扩散（Discrete Diffusion）主流采用两种加噪策略：

主流方案：随机将序列Token替换为[MASK]掩码标签
备选方案：随机替换为词表中其他无关 Token

下面给出极简可运行的 DLLM QA 任务训练示例代码：

python 复制代码

import torch
import torch.nn as nn
import torch.nn.functional as F
import random
from transformers import AutoTokenizer, AutoModel

# =====================
# 超参数
# =====================
T = 20
MAX_LEN = 64
LR = 1e-4
EPOCHS = 10
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# =====================
# QA 数据（示例）
# =====================
qa_data = [
    ("What is the capital of France?", "Paris"),
    ("Who wrote Hamlet?", "Shakespeare"),
    ("What is the largest planet?", "Jupiter"),
    ("What language is spoken in China?", "Chinese"),
]

# =====================
# tokenizer
# =====================
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
MASK_ID = tokenizer.mask_token_id
PAD_ID = tokenizer.pad_token_id

def encode_qa(q, a):
    """
    构造：Question + Answer
    """
    text = f"Question: {q} Answer: {a}"
    encoding = tokenizer(
        text,
        padding="max_length",
        truncation=True,
        max_length=MAX_LEN,
        return_tensors="pt"
    )

    input_ids = encoding["input_ids"][0]

    # 找到 Answer 起点
    answer_text = f"Answer: {a}"
    answer_ids = tokenizer(answer_text, add_special_tokens=False)["input_ids"]

    # 找子序列位置
    def find_subsequence(seq, sub):
        for i in range(len(seq) - len(sub)):
            if seq[i:i+len(sub)].tolist() == sub:
                return i
        return -1

    start = find_subsequence(input_ids, answer_ids)
    answer_mask = torch.zeros_like(input_ids).bool()

    if start != -1:
        answer_mask[start:start+len(answer_ids)] = True

    return input_ids, answer_mask

dataset = [encode_qa(q, a) for q, a in qa_data]

# =====================
# Noise Schedule
# =====================
def get_mask_prob(t, T):
    return t / T

# =====================
# Forward Diffusion（QA版本）
# =====================
def forward_diffusion(x0, t, answer_mask):
    """
    更偏向 mask Answer
    """
    prob = get_mask_prob(t, T)
    xt = x0.clone()
    mask = torch.zeros_like(x0).bool()

    for i in range(len(x0)):
        if x0[i] == PAD_ID:
            continue

        # ⭐ 核心：Answer 部分更容易被 mask
        if answer_mask[i]:
            if random.random() < prob:
                xt[i] = MASK_ID
                mask[i] = True
        else:
            # Question 部分少量 mask
            if random.random() < prob * 0.3:
                xt[i] = MASK_ID
                mask[i] = True

    return xt, mask

# =====================
# 模型
# =====================
class DLLM(nn.Module):
    def __init__(self, vocab_size, hidden_size=768):
        super().__init__()
        self.encoder = AutoModel.from_pretrained("bert-base-uncased")
        self.t_embed = nn.Embedding(T + 1, hidden_size)
        self.lm_head = nn.Linear(hidden_size, vocab_size)

    def forward(self, input_ids, t, labels=None, loss_mask=None):
        outputs = self.encoder(input_ids=input_ids)
        hidden = outputs.last_hidden_state

        t_emb = self.t_embed(t).unsqueeze(1)
        hidden = hidden + t_emb

        logits = self.lm_head(hidden)

        loss = None
        if labels is not None:
            loss_fct = nn.CrossEntropyLoss(reduction="none")
            loss_all = loss_fct(
                logits.view(-1, logits.size(-1)),
                labels.view(-1)
            ).view(labels.size())

            loss = (loss_all * loss_mask).sum() / (loss_mask.sum() + 1e-6)

        return logits, loss

model = DLLM(tokenizer.vocab_size).to(DEVICE)
optimizer = torch.optim.Adam(model.parameters(), lr=LR)

# =====================
# 训练
# =====================
print("==== Training QA DLLM ====")

for epoch in range(EPOCHS):
    total_loss = 0

    for x0, answer_mask in dataset:
        x0 = x0.to(DEVICE)

        # 1️⃣ sample timestep
        t = random.randint(1, T)
        t_tensor = torch.tensor([t], device=DEVICE)

        # 2️⃣ diffusion
        xt, mask = forward_diffusion(x0, t, answer_mask)

        xt = xt.unsqueeze(0).to(DEVICE)
        x0_batch = x0.unsqueeze(0).to(DEVICE)

        # ⭐ 只在 mask 位置训练
        loss_mask = mask.unsqueeze(0).float().to(DEVICE)

        # 3️⃣ forward
        logits, loss = model(
            xt,
            t_tensor,
            labels=x0_batch,
            loss_mask=loss_mask
        )

        # 4️⃣ backward
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    print(f"epoch {epoch} loss: {total_loss:.4f}")

流程核心拆解
Step1: 先进行时间步骤采样，假设最大时间步骤T=10，这里的采用就是随机从1~10中采样一个整数出来。
Step2： 加噪过程forward_diffusion

（1）首先将quesiton和answer部分拼接在一起，构成一个完整的序列，然后进行tokenization；

（2）再用采样的时间步骤t / 最大时间步 T，得到加噪的概率。

（3）遍历数据序列，根据加噪的概率，随机将序列中的每个token修改为[MASK]，一般只mask answer的部分
Step3： 模型前向推理，然后只计算被mask的token的loss（一般采用交叉熵）；推理过程中一般先将处理好的input_ids输入到encoder中获取hidden embed，然后hidden_embed再加上时间步的编码，最后输入到一个全连接层中输出。这样在前向过程中就能拿到时间步骤的信息。
Step4： 反向传播。

【思考】为什么在训练过程中不遍历所有的时间步，而是只采样一个时间步？
训练时随机采样一个时间步，是为了用一次计算近似"所有时间步的平均损失"（Monte Carlo估计，随机采样近似期望），从而把 O(T) 的成本降到 O(1)，同时保持训练正确性。

1.2 DLLM 推理过程

推理完整示例代码：

python 复制代码

# =====================
# 推理（QA）
# =====================
def generate_answer(model, question):
    model.eval()

    prompt = f"Question: {question} Answer:"
    tokens = tokenizer(
        prompt,
        return_tensors="pt",
        padding="max_length",
        truncation=True,
        max_length=MAX_LEN
    )["input_ids"].to(DEVICE)

    # 初始化：答案部分全 mask
    xt = tokens.clone()
    for i in range(xt.shape[1]):
        if xt[0, i] != PAD_ID:
            xt[0, i] = MASK_ID

    # 保留 question
    for i in range(tokens.shape[1]):
        if tokens[0, i] != MASK_ID:
            xt[0, i] = tokens[0, i]

    # diffusion 反向过程
    for t in reversed(range(1, T + 1)):
        t_tensor = torch.tensor([t], device=DEVICE)

        with torch.no_grad():
            logits, _ = model(xt, t_tensor)

        probs = F.softmax(logits, dim=-1)
        xt = torch.multinomial(
            probs.view(-1, probs.size(-1)),
            1
        ).view(1, -1)

    return tokenizer.decode(xt[0].cpu().tolist())

print("\n==== QA Inference ====")
print(generate_answer(model, "What is the capital of France?"))

流程核心拆解：
Step1 ：除了问题部分，剩余的MAX_LEN - question长度的部分全部设置为 $MASK$ 。
Step2：遍历时间步，逐渐恢复答案。时间步为倒序，即为 $T, T-1, T-2, ..., 1$

【思考】为什么推理过程要倒序遍历所有时间步骤？

推理阶段必须遍历时间步骤，是因为 DLLM 的输出依赖上一时间步的状态，去噪是一个迭代过程，每一步都是在修正上一步的预测。

训练：模型学习任意噪声水平下如何恢复原始 x 0 x_0 x0 → 可以随机 t

推理：从最强噪声 x T x_T xT 开始 → 逐步还原 → 最终得到合理答案 → 必须顺序 t=T→t=0

2 Block Diffusion 分块扩散模型

DLLM 虽实现全序列并行生成，但存在明显缺陷：

长序列无局部依赖约束，生成长度难以控制；
全局并行忽略文本上下文语序依赖，长文本连贯性差。

Block Diffusion分块扩散针对性优化：将完整序列切分为多个独立 Block，实现「块内并行去噪、块间顺序依赖生成」。核心原理如下：

将序列分成K个block： x = $x 1 , x 2 , . . . , x K$ x = $x\^1, x\^2, ..., x\^K$ x= $x1,x2,...,xK$ ;
每个Block做DLLM风格的迭代: x t k ∼ p θ ( x 0 k ∣ x t k , x 0 < k ) x_t^k \sim p_{\theta}(x_0^k | x_t^k, x_0^{< k}) xtk∼pθ(x0k∣xtk,x0<k)

其中：

x 0 < k x_0^{< k} x0<k是前面已经生成好了的块
块内部可以并行
块之间顺序生成：保证长度和上下文一致性

2.1 Block Diffusion 与 DLLM 区别

特性	DLLM	Block Diffusion
并行生成	全序列全局并行	块内并行，块间串行顺序生成
长度控制	难以约束生成长度	单块长度固定，全局易控制
依赖关系	全局弱依赖，无局部语序约束	块间强条件依赖，上下文更连贯
推理复杂度	T 次全序列迭代	T 次单块迭代，整体可控

2.2 训练 & 推理思路

2.2.1 训练流程

将序列划分成 K 个 block
对每个 block 内部随机采样 t
Mask 当前 block 内的 token（答案或内容）
计算 loss → 只在 mask block 上
迭代更新

注意：块之间条件依赖不需要显式训练，因为前面块已知 → 条件自然而然存在

2.2.2 推理过程

初始化每个 block 的答案部分为[MASK]
从第 1 个 block 开始迭代去噪 → 得到 x 0 ( 1 ) x^{(1)}_0 x0(1)
将 x 0 ( 1 ) x^{(1)}_0 x0(1)作为条件 → 生成 x 0 ( 2 ) x^{(2)}_0 x0(2)
依次生成 K 个 block → 拼接得到最终序列

2.3 Block Diffusion 完整示例代码

python 复制代码

import torch
import torch.nn as nn
import torch.nn.functional as F
import random
from transformers import AutoTokenizer, AutoModel

# =====================
# 基本配置
# =====================
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
T = 10                    # diffusion steps
MAX_LEN = 64              # 最大序列长度
BLOCK_SIZE = 8            # 每块长度，可调
HIDDEN_SIZE = 128
LR = 1e-4
EPOCHS = 2

# =====================
# QA 数据
# =====================
qa_data = [
    ("What is the capital of France?", "Paris"),
    ("Who wrote Hamlet?", "Shakespeare"),
]

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
MASK_ID = tokenizer.mask_token_id
PAD_ID = tokenizer.pad_token_id

# =====================
# 数据编码
# =====================
def encode_qa(q, a):
    text = f"Question: {q} Answer: {a}"
    encoding = tokenizer(
        text,
        padding="max_length",
        truncation=True,
        max_length=MAX_LEN,
        return_tensors="pt"
    )
    input_ids = encoding["input_ids"][0]

    # 找答案位置
    answer_text = f"Answer: {a}"
    answer_ids = tokenizer(answer_text, add_special_tokens=False)["input_ids"]
    start = -1
    for i in range(len(input_ids) - len(answer_ids) + 1):
        if (input_ids[i:i+len(answer_ids)] == torch.tensor(answer_ids)).all():
            start = i
            break
    answer_mask = torch.zeros_like(input_ids).bool()
    if start != -1:
        answer_mask[start:start+len(answer_ids)] = True
    return input_ids, answer_mask

dataset = [encode_qa(q, a) for q, a in qa_data]

# =====================
# Noise schedule & forward diffusion
# =====================
def get_mask_prob(t):
    return t / T

def forward_diffusion_block(x0, t, block_mask):
    prob = get_mask_prob(t)
    xt = x0.clone()
    mask = torch.zeros_like(x0).bool()
    for i in range(len(x0)):
        if x0[i] == PAD_ID:
            continue
        if block_mask[i] and random.random() < prob:
            xt[i] = MASK_ID
            mask[i] = True
    return xt, mask

# =====================
# 模型
# =====================
class BlockDLLM(nn.Module):
    def __init__(self, vocab_size, hidden_size=HIDDEN_SIZE):
        super().__init__()
        self.encoder = AutoModel.from_pretrained("bert-base-uncased")
        self.t_embed = nn.Embedding(T+1, hidden_size)
        self.lm_head = nn.Linear(hidden_size, vocab_size)

    def forward(self, input_ids, t, labels=None, loss_mask=None):
        hidden = self.encoder(input_ids=input_ids).last_hidden_state
        t_emb = self.t_embed(t).unsqueeze(1)
        hidden = hidden + t_emb
        logits = self.lm_head(hidden)
        loss = None
        if labels is not None:
            loss_fct = nn.CrossEntropyLoss(reduction="none")
            loss_all = loss_fct(
                logits.view(-1, logits.size(-1)),
                labels.view(-1)
            ).view(labels.size())
            loss = (loss_all * loss_mask).sum() / (loss_mask.sum() + 1e-6)
            return logits, loss

model = BlockDLLM(tokenizer.vocab_size).to(DEVICE)
optimizer = torch.optim.Adam(model.parameters(), lr=LR)

# =====================
# 训练
# =====================
print("==== Training Block Diffusion QA ====")
for epoch in range(EPOCHS):
    total_loss = 0
    for x0, answer_mask in dataset:
        x0 = x0.to(DEVICE)

        # 自动划分块
        num_blocks = (MAX_LEN + BLOCK_SIZE - 1) // BLOCK_SIZE
        blocks = [(i*BLOCK_SIZE, min((i+1)*BLOCK_SIZE, MAX_LEN)) for i in range(num_blocks)]

        for start, end in blocks:
            block_mask = torch.zeros_like(x0).bool()
            block_mask[start:end] = answer_mask[start:end]

            t = random.randint(1, T)
            xt, mask = forward_diffusion_block(x0, t, block_mask)

            xt = xt.unsqueeze(0)
            x0_batch = x0.unsqueeze(0)
            loss_mask = mask.unsqueeze(0).float()

            logits, loss = model(
                xt,
                torch.tensor([t], device=DEVICE),
                labels=x0_batch,
                loss_mask=loss_mask
            )
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            total_loss += loss.item()
    print(f"Epoch {epoch}, loss: {total_loss:.4f}")

# =====================
# 推理
# =====================
def generate_answer_block(model, question):
    model.eval()
    prompt = f"Question: {question} Answer:"
    tokens = tokenizer(prompt, padding="max_length", truncation=True, max_length=MAX_LEN, return_tensors="pt")["input_ids"].to(DEVICE)
    xt = tokens.clone()

    num_blocks = (MAX_LEN + BLOCK_SIZE - 1) // BLOCK_SIZE
    blocks = [(i*BLOCK_SIZE, min((i+1)*BLOCK_SIZE, MAX_LEN)) for i in range(num_blocks)]

    # 块顺序生成
    for start, end in blocks:
        # mask 当前块
        xt[0, start:end] = MASK_ID
        for t in reversed(range(1, T+1)):
            t_tensor = torch.tensor([t], device=DEVICE)
            with torch.no_grad():
                logits, _ = model(xt, t_tensor)
            probs = F.softmax(logits, dim=-1)
            xt[0, start:end] = torch.multinomial(probs[0, start:end], 1).squeeze(-1)

    return tokenizer.decode(xt[0].cpu().tolist())

print("\n==== QA Inference ====")
print(generate_answer_block(model, "What is the capital of France?"))

3 参考资料

$1$ Block Diffusion 论文

$2$ 分块扩散长文本生成开源实现

$3$ 离散扩散语言模型经典论文

$4$ 扩散模型基础理论综述

$5$ HuggingFace 扩散语言模型开源实现