Diffusion 与 Flow Matching 数学原理及其在 VLA Action 生成中的应用

作者：Asimov+Codex | 日期：2026-06-11

适合：Diffusion / Flow Matching 初学者，关注 VLA（Vision-Language-Action）方向

一、Diffusion Models（扩散模型）数学原理

1.1 核心直觉

扩散模型受热力学中的扩散过程启发：

前向过程（Forward Process） ：逐步向数据 x0x_0x0 添加高斯噪声，直到变成纯噪声 xT∼N(0,I)x_T \sim \mathcal{N}(0, I)xT∼N(0,I)。
反向过程（Reverse Process） ：学习一个神经网络，从纯噪声逐步去噪还原出原始数据。

1.2 前向扩散过程（加噪）

给定真实数据 x0∼q(x0)x_0 \sim q(x_0)x0∼q(x0)，定义一个固定的马尔可夫链，每一步添加噪声：

q(xt∣xt−1)=N(xt;1−βtxt−1,βtI)q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t I)q(xt∣xt−1)=N(xt;1−βt xt−1,βtI)

其中 βt\beta_tβt 是一个预先定义的噪声调度表（variance schedule） ，ttt 从 111 到 TTT。

关键性质 ：由于高斯分布的叠加性，我们可以直接从 x0x_0x0 一步计算出任意 ttt 步的结果：

q(xt∣x0)=N(xt;αˉtx0,(1−αˉt)I)q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} x_0, (1 - \bar{\alpha}_t)I)q(xt∣x0)=N(xt;αˉt x0,(1−αˉt)I)

其中 αt=1−βt\alpha_t = 1 - \beta_tαt=1−βt，αˉt=∏s=1tαs\bar{\alpha}t = \prod{s=1}^t \alpha_sαˉt=∏s=1tαs。

等价地（重参数化技巧）：

xt=αˉtx0+1−αˉtϵ,ϵ∼N(0,I)x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)xt=αˉt x0+1−αˉt ϵ,ϵ∼N(0,I)

1.3 反向去噪过程（生成）

反向过程也是一个马尔可夫链，但转移核未知，需要用一个神经网络 θ\thetaθ 来近似：

pθ(xt−1∣xt)=N(xt−1;μθ(xt,t),Σθ(xt,t))p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))pθ(xt−1∣xt)=N(xt−1;μθ(xt,t),Σθ(xt,t))

通常 Σθ\Sigma_\thetaΣθ 被固定为常数（如 βt\beta_tβt），只用神经网络预测均值 μθ\mu_\thetaμθ。

1.4 训练目标：ELBO 与简化损失

通过变分下界（ELBO）推导，最终简化为一个极其简洁的损失函数------去噪评分匹配（Denoising Score Matching）：

Lsimple=Et,x0,ϵ $∥ϵ-ϵθ(xt,t)∥2$ \mathcal{L}{\text{simple}} = \mathbb{E}{t, x_0, \epsilon} \left $\\\|\\epsilon - \\epsilon_\\theta(x_t, t)\\\|\^2 \\right$ Lsimple=Et,x0,ϵ $∥ϵ-ϵθ(xt,t)∥2$

其中：

ϵ∼N(0,I)\epsilon \sim \mathcal{N}(0, I)ϵ∼N(0,I) 是加的噪声
ϵθ\epsilon_\thetaϵθ 是神经网络（通常是 U-Net 或 DiT）
xt=αˉtx0+1−αˉtϵx_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilonxt=αˉt x0+1−αˉt ϵ

直觉：网络学会了"猜"出数据中被加的是什么噪声，减去它就能还原数据。

1.5 采样：DDPM 与 DDIM

DDPM 采样（逐步去噪）：

xt−1=1αt(xt−βt1−αˉtϵθ(xt,t))+σtz,z∼N(0,I)x_{t-1} = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}t}} \epsilon\theta(x_t, t) \right) + \sigma_t z, \quad z \sim \mathcal{N}(0, I)xt−1=αt 1(xt−1−αˉt βtϵθ(xt,t))+σtz,z∼N(0,I)

DDIM 采样 （加速，非马尔可夫）：可以跳步，把 T=1000T=1000T=1000 步压缩到 50∼10050\sim 10050∼100 步。

二、Flow Matching（流匹配）数学原理

2.1 动机：扩散模型的痛点

扩散模型存在一些固有缺陷：

采样步数多：DDPM 需要 1000 步迭代
似然计算困难
连续时间视角非最优

Flow Matching 从一个不同的视角出发：直接学习从噪声到数据的概率路径（Probability Path）。

2.2 从连续归一化流（CNF）说起

定义时间依赖的向量场（time-dependent vector field） vt(x): $0,1$ ×Rd→Rdv_t(x): $0,1$ \times \mathbb{R}^d \to \mathbb{R}^dvt(x): $0,1$ ×Rd→Rd，通过常微分方程（ODE）生成概率路径：

ddtϕt(x)=vt(ϕt(x))\frac{d}{dt} \phi_t(x) = v_t(\phi_t(x))dtdϕt(x)=vt(ϕt(x))

其中 ϕt\phi_tϕt 是流（flow） ，ϕ0(x)=x\phi_0(x) = xϕ0(x)=x，ϕ1(x)\phi_1(x)ϕ1(x) 将噪声映射到数据。

初始分布 p0p_0p0（标准高斯）经过这个流，在 t=1t=1t=1 时得到数据分布 p1≈q(x)p_1 \approx q(x)p1≈q(x)。

2.3 Flow Matching 目标

理想情况，我们想最小化：

LFM=Et∼ $0,1$ ,x∼pt(x) $∥vt(x)-ut(x)∥2$ \mathcal{L}{\text{FM}} = \mathbb{E}{t \sim $0,1$ , x \sim p_t(x)} \left $\\\|v_t(x) - u_t(x)\\\|\^2 \\right$ LFM=Et∼ $0,1$ ,x∼pt(x) $∥vt(x)-ut(x)∥2$

其中 ut(x)u_t(x)ut(x) 是真实的向量场------但我们不知道它！

2.4 条件 Flow Matching（核心突破）

核心思想：对每个数据点 x1x_1x1 定义条件概率路径 pt(x∣x1)p_t(x|x_1)pt(x∣x1)，然后让总和匹配真实路径。

选择最简单的高斯插值路径 （也被称为 Rectified Flow）：

pt(x∣x1)=N(x;tx1,(1−t)2I)p_t(x|x_1) = \mathcal{N}(x; t x_1, (1 - t)^2 I)pt(x∣x1)=N(x;tx1,(1−t)2I)

对应的条件向量场为：

ut(x∣x1)=x1−x1−tu_t(x|x_1) = \frac{x_1 - x}{1 - t}ut(x∣x1)=1−tx1−x

更稳定常用的形式（线性插值路径）：

xt=(1−t)x0+tx1,x0∼N(0,I)x_t = (1 - t) x_0 + t x_1, \quad x_0 \sim \mathcal{N}(0, I)xt=(1−t)x0+tx1,x0∼N(0,I)

向量场简化为：

ut(x∣x0,x1)=x1−x0u_t(x|x_0, x_1) = x_1 - x_0ut(x∣x0,x1)=x1−x0

最终损失函数：

LCFM=Et∼ $0,1$ ,x0∼N(0,I),x1∼q(x1) $∥vθ(xt,t)-(x1-x0)∥2$ \mathcal{L}{\text{CFM}} = \mathbb{E}{t \sim $0,1$ , x_0 \sim \mathcal{N}(0,I), x_1 \sim q(x_1)} \left $\\\|v_\\theta(x_t, t) - (x_1 - x_0)\\\|\^2 \\right$ LCFM=Et∼ $0,1$ ,x0∼N(0,I),x1∼q(x1) $∥vθ(xt,t)-(x1-x0)∥2$

2.5 核心优势

对比项	Diffusion Models	Flow Matching
路径	随机（SDE）	确定性（ODE）
采样步数	数十~上千步	十几~几十步
训练目标	预测噪声 ϵ\epsilonϵ	预测速度场 vθv_\thetavθ
理论简洁性	需要变分推导	直接回归向量场

2.6 从扩散到 Flow Matching 的统一视角

实际上，扩散模型也可以看作一种特例的 Flow Matching：

DDPM 的加噪路径：xt=αˉtx0+1−αˉtϵx_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilonxt=αˉt x0+1−αˉt ϵ
Flow Matching 的线性路径：xt=(1−t)x0+tx1x_t = (1 - t)x_0 + t x_1xt=(1−t)x0+tx1

两者本质都是从源分布到目标分布的插值，只是插值方式和噪声调度不同。

三、在 VLA Action 生成中的应用

3.1 什么是 VLA（Vision-Language-Action）

VLA 是一种多模态大模型架构 ，输入视觉（Vision） 和语言指令（Language） ，输出动作（Action）。

典型代表：

RT-2 (Google DeepMind)：将机器人动作 token 化，让 LLM 直接输出动作
\u03c00 (pi-zero) (Physical Intelligence)：Flow Matching 生成动作
Octo / OpenVLA：基于 Diffusion 的 VLA 模型

3.2 为什么 VLA 需要 Diffusion / Flow Matching

关键矛盾 ：机器人动作空间是连续且多模态的。

举个例子------给定"把杯子放到托盘上"：

可以直着抓过去，也可以绕一下
可以在高点放下，也可以在低点放下
同一条指令有无限多种合理的动作轨迹

自回归（AR）建模的问题：

LLM 预测的是 token 离散分布，量化连续动作会丢失精度
没有多模态不确定性建模能力（只能输出最可能的一个动作）

Diffusion / Flow Matching 正好擅长：

直接建模连续分布
天然支持多模态输出（一个 prompt 对应多种合理动作）
可以通过采样生成多样化的动作轨迹

3.3 具体应用架构

3.3.1 Diffusion-based VLA（如 OpenVLA / Diffusion Policy）

复制代码

输入：图像 I + 语言指令 L
           |
视觉编码器（SigLIP / DINOv2） + 文本编码器（LLM）
           |
    多模态特征融合（Cross-Attention）
           |
    噪声动作 x_T -> 去噪 U-Net / DiT -> 预测动作 x_0
    条件：视觉 + 语言特征通过 Cross-Attention 注入
           |
    输出：机器人动作序列（关节角度 / 末端执行器位姿）

数学上：

训练：将真实动作 a0a_0a0 加噪为 ata_tat，网络预测噪声 ϵθ(at,t,I,L)\epsilon_\theta(a_t, t, I, L)ϵθ(at,t,I,L)
推理：从纯噪声 aTa_TaT 开始，逐步去噪得到 a0a_0a0，条件为观测 (I,L)(I, L)(I,L)
损失：L=Et,a0,ϵ $∥ϵ-ϵθ(at,t,I,L)∥2$ \mathcal{L} = \mathbb{E}_{t, a_0, \epsilon} \left $\\\|\\epsilon - \\epsilon_\\theta(a_t, t, I, L)\\\|\^2 \\right$ L=Et,a0,ϵ $∥ϵ-ϵθ(at,t,I,L)∥2$

示例（OpenVLA 7B 工作流程）：

复制代码

用户指令："Pick up the red cup"
图像：摄像头拍摄的桌面场景
动作空间：7-DOF 末端执行器位姿 (x,y,z,roll,pitch,yaw,gripper)
输出：下一步的机器人动作
      扩散模型采样 16 步生成一个平滑动作

3.3.2 Flow Matching VLA（如 \u03c0 模型）

\u03c00（Physical Intelligence 2024）使用 Flow Matching 生成动作：

复制代码

输入：图像 I + 语言指令 L + 当前机器人状态 s
           |
    预训练 VLM backbone（基于 PaLM-E 风格）
           |
    Flow Matching Head（动作解码器）
      - 用 ODE 求解器从噪声生成动作轨迹
      - 条件：VLM 输出的多模态 token
           |
    输出：全局动作轨迹（未来 N 步的关节位置）

数学上（条件 Flow Matching for VLA）：

定义条件动作分布：

噪声动作：a0∼N(0,I)a_0 \sim \mathcal{N}(0, I)a0∼N(0,I)
真实动作：a1a_1a1（来自人类遥操作数据）
插值路径：at=(1−t)a0+ta1a_t = (1 - t) a_0 + t a_1at=(1−t)a0+ta1
向量场：vθ(at,t,I,L,s)v_\theta(a_t, t, I, L, s)vθ(at,t,I,L,s) 预测 a1−a0a_1 - a_0a1−a0
损失：L=Et,a0,a1 $∥vθ(at,t,I,L,s)-(a1-a0)∥2$ \mathcal{L} = \mathbb{E}_{t, a_0, a_1} \left $\\\|v_\\theta(a_t, t, I, L, s) - (a_1 - a_0)\\\|\^2 \\right$ L=Et,a0,a1 $∥vθ(at,t,I,L,s)-(a1-a0)∥2$

推理时：

复制代码

1. 采样 a_0 ~ N(0, I)
2. 用 Euler ODE 求解器迭代：
   a_{t+dt} = a_t + v_theta(a_t, t, I, L, s) * dt
3. 从 t=0 到 t=1 积分，得到 a_1 作为最终动作

3.4 对比：Diffusion vs Flow Matching 用于 Action

对比维度	Diffusion Policy	Flow Matching (\u03c00)
路径类型	SDE（随机）	ODE（确定性）
采样步数	10~100 步	2~10 步
动作多样性	✅ 高（随机项）	可调节（初始噪声）
实时推理	❌ 较慢	✅ 更快
平滑性	需额外处理	✅ 天然平滑
代表工作	Diffusion Policy, OpenVLA, Octo	\u03c00, Rectified Flow

3.5 实际例子：抓取物体的完整流程

假设机器人看到桌面上有一个杯子，指令是 "Pick up the cup"。

Step 1: 视觉 + 语言编码

复制代码

图像 -> SigLIP 视觉特征 (224x224)
文本 -> LLM 文本嵌入
     |
  多模态融合 -> 条件特征 c

Step 2: 动作生成（以 Flow Matching 为例）

复制代码

初始化：a_0 ~ N(0, I)    # 7维动作噪声
t = 0.0
while t < 1.0:
    # 计算条件向量场
    v = v_theta(a_t, t, c)
    # Euler 积分一步
    a_{t+dt} = a_t + v * dt
    t += dt

输出：a_1 = [0.32, -0.15, 0.08, 0.0, 0.0, 0.5, 0.02]  # 7-DOF 动作

Step 3: 执行动作

复制代码

机器人移动到抓取位置，闭合夹爪
然后进入下一轮观测-动作循环

3.6 为什么 Flow Matching 更适合 VLA？

速度：VLA 需要实时控制（10-50Hz），Flow Matching 2-5 步即够
连续平滑：ODE 路径天然平滑，减少机器人抖动
确定性：给定相同的观测条件，生成稳定的动作（可复现）
可扩展性：可以很容易地融入大模型的 latent space

四、总结

概念	一句话总结
Diffusion	学一个去噪网络，从纯噪声一步步还原数据（预测 ϵ\epsilonϵ）
Flow Matching	学一个向量场，在 ODE 路径上直接把噪声"流"到数据（预测 vvv）
VLA Action 生成	以视觉+语言为条件，用 Diffusion / Flow Matching 生成连续、多模态的机器人动作
Diffusion Policy	将动作建模为条件扩散过程，采样得到动作
\u03c00 / Flow Matching VLA	用 ODE 将噪声流到动作，更快更平滑

参考文献

DDPM: Denoising Diffusion Probabilistic Models (Ho et al., 2020)
Flow Matching for Generative Modeling (Lipman et al., 2023)
Diffusion Policy: Visuomotor Policy Learning via Action Diffusion (Chi et al., 2023)
OpenVLA: An Open-Source Vision-Language-Action Model (Kim et al., 2024)
\u03c00: A Vision-Language-Action Flow Model for General Robot Control (Physical Intelligence, 2024)

五、进阶：Anchor-based Diffusion Policy（锚点扩散策略）

5.1 核心动机：纯噪声出发的痛点

前面讨论的标准 Diffusion Policy 有一个固有局限：每个去噪过程都从纯高斯噪声 aT∼N(0,I)a_T \sim \mathcal{N}(0, I)aT∼N(0,I) 出发。

这意味着：

即使观测条件 OOO 已经暗示了明确的行为模式（如"向右抓"或"向左抓"），模型仍需要从整个动作空间搜索
对于多模态行为（同一指令多种合理动作），采样效率低，需要很多 denoising step
推理时生成的动作可能不连贯，不同时间步的采样模式可能跳跃

Anchor-based 的核心思路 ：先对训练轨迹做聚类提取 锚点（Anchor） 作为典型行为原型，推理时从锚点附近的噪声出发，而不是从标准高斯出发。

5.2 整体架构

复制代码

┌─────────────────────────────────────────────────────┐
│                   Offline Phase                      │
│                                                      │
│  训练轨迹集 {τ₁, τ₂, ..., τ_N}                       │
│            │                                         │
│      K-Means / 行为聚类                               │
│            │                                         │
│   K 个锚点轨迹 {c₁, c₂, ..., c_K}                    │
│   (cluster centers = 典型行为模式)                     │
│                                                      │
├─────────────────────────────────────────────────────┤
│                   Training Phase                     │
│                                                      │
│  for each trajectory τ:                              │
│      找到所属锚点 c_k                                 │
│      train: ε_θ(a_t, t, c_k, O) → ε                 │
│      loss: ∥ε - ε_θ(a_t, t, c_k, O)∥²              │
│                                                      │
├─────────────────────────────────────────────────────┤
│                   Inference Phase                    │
│                                                      │
│  给定观测 O:                                         │
│      ① 预测锚点相关性 → 选出 top-K 个锚点             │
│      ② 从每个锚点附近初始化噪声 → 并行去噪             │
│      ③ 聚合 / 选取最佳动作                             │
└─────────────────────────────────────────────────────┘

5.3 数学公式：离线阶段（轨迹聚类）

假设我们有 NNN 条训练轨迹，每条是长度为 HHH 的动作序列：

τi= $ai(1),ai(2),...,ai(H)$ ∈RH×d\tau_i = $a_i\^{(1)}, a_i\^{(2)}, ..., a_i\^{(H)}$ \in \mathbb{R}^{H \times d}τi= $ai(1),ai(2),...,ai(H)$ ∈RH×d

进行 KKK-means 聚类：

min⁡{ck}∑i=1Nmin⁡k∥τi−ck∥2\min_{\{c_k\}} \sum_{i=1}^N \min_k \|\tau_i - c_k\|^2{ck}mini=1∑Nkmin∥τi−ck∥2

得到 KKK 个锚点（聚类中心）c1,c2,...,cK∈RH×dc_1, c_2, ..., c_K \in \mathbb{R}^{H \times d}c1,c2,...,cK∈RH×d。

每个锚点代表一种典型行为模式。例如在"抓杯子"任务中：

c1c_1c1：上抓式（从上方接近物体）
c2c_2c2：侧抓式（从侧面接近物体）
c3c_3c3：平移式（先平移再抓取）

5.4 训练：锚点条件扩散

给定观测 OOO 和对应的锚点 ckc_kck，训练一个锚点条件去噪网络 ϵθ(at,t,ck,O)\epsilon_\theta(a_t, t, c_k, O)ϵθ(at,t,ck,O)。

关键变体 1：硬分配式

每个轨迹只属于一个锚点（距离最近的那个）：

Lhard=Et,τ,ϵ,k∗∼NN(τ) $∥ϵ-ϵθ(τt,t,ck*,O)∥2$ \mathcal{L}{\text{hard}} = \mathbb{E}{t, \tau, \epsilon, k^* \sim \text{NN}(\tau)} \left $\\\|\\epsilon - \\epsilon_\\theta(\\tau_t, t, c_{k\^\*}, O)\\\|\^2 \\right$ Lhard=Et,τ,ϵ,k∗∼NN(τ) $∥ϵ-ϵθ(τt,t,ck*,O)∥2$

关键变体 2：软加权式

每个轨迹软加权到多个锚点（更适合连续行为）：

Lsoft=Et,τ,ϵ $\sumk=1Kwk(τ)\cdot∥ϵ-ϵθ(τt,t,ck,O)∥2$ \mathcal{L}{\text{soft}} = \mathbb{E}{t, \tau, \epsilon} \left $\\sum_{k=1}\^K w_k(\\tau) \\cdot \\\|\\epsilon - \\epsilon_\\theta(\\tau_t, t, c_k, O)\\\|\^2 \\right$ Lsoft=Et,τ,ϵ $k=1\sumKwk(τ)\cdot∥ϵ-ϵθ(τt,t,ck,O)∥2$

其中权重为：

wk(τ)=exp⁡(−∥τ−ck∥2/σ)∑j=1Kexp⁡(−∥τ−cj∥2/σ)w_k(\tau) = \frac{\exp(-\|\tau - c_k\|^2 / \sigma)}{\sum_{j=1}^K \exp(-\|\tau - c_j\|^2 / \sigma)}wk(τ)=∑j=1Kexp(−∥τ−cj∥2/σ)exp(−∥τ−ck∥2/σ)

5.5 推理：Top-K 锚点选取 + 并行去噪

Step 1：预测锚点相关性

给定观测 OOO，用一个锚点预测器（可以是轻量 MLP 或注意力层）计算每个锚点的相关分数：

sk=fϕ(O)k,k=1,...,Ks_k = f_\phi(O)_k, \quad k = 1, ..., Ksk=fϕ(O)k,k=1,...,K

选出 top-K 锚点：

Ktop={k:sk 排前 K}\mathcal{K}_{\text{top}} = \{k : s_k \text{ 排前 K}\}Ktop={k:sk 排前 K}

也可以直接用 CLIP-style 的跨模态匹配，或者用 VLM 来推理最可能的行为模式。

Step 2：从锚点附近初始化

对于每个选中锚点 ckc_kck，初始化噪声：

aT(k)∼N(ck,σk2I)a_T^{(k)} \sim \mathcal{N}(c_k, \sigma_k^2 I)aT(k)∼N(ck,σk2I)

其中 σk\sigma_kσk 是第 kkk 个簇内的经验标准差（反应了该类行为的变异程度）。

Step 3：并行去噪

对每个 k∈Ktopk \in \mathcal{K}_{\text{top}}k∈Ktop，执行去噪过程：

at−1(k)=denoise(at(k),t,ck,O),t=T,T−1,...,1a_{t-1}^{(k)} = \text{denoise}\left(a_t^{(k)}, t, c_k, O\right), \quad t = T, T-1, ..., 1at−1(k)=denoise(at(k),t,ck,O),t=T,T−1,...,1

每个锚点生成一个候选动作轨迹 a0(k)a_0^{(k)}a0(k)。

Step 4：聚合 / 选取

三种聚合策略：

策略	做法	特点
Best-of-K	执行置信度最高的那个	简洁，但可能不连贯
加权平均	按锚点分数加权融合	平滑，但可能模式平均
时序 Aggregation	每步动作由多个锚点分支加权平均	最平滑，计算量大

5.6 与标准 Diffusion Policy 的对比

对比维度	标准 Diffusion Policy	Anchor-based Diffusion Policy
初始化噪声	aT∼N(0,I)a_T \sim \mathcal{N}(0, I)aT∼N(0,I)	aT∼N(ck,σk2I)a_T \sim \mathcal{N}(c_k, \sigma_k^2 I)aT∼N(ck,σk2I)
条件信息	仅观测 OOO	观测 OOO + 锚点 ckc_kck
采样步数	10~100 步	5~20 步（起点更近）
多模态处理	靠随机噪声探索	显式建模多种模式
动作一致性	需大量步数保证	天然一致（锚点约束）
推理速度	单次去噪	可并行 Top-K 去噪
额外训练	无	锚点聚类 + 锚点预测器

5.7 一个具体的数据流实例

场景：桌面抓取任务，指令 "Pick up the cup on the left"。

离线聚类结果 （K=4K=4K=4）：

复制代码

锚点 1: 右侧绕过障碍 → 下降到杯把 → 闭合夹爪
锚点 2: 从正上方直下 → 闭合夹爪
锚点 3: 左侧绕行 → 前伸 → 闭合夹爪
锚点 4: 平移至杯正上方 → 下降 → 闭合夹爪

在线推理：

python 复制代码

# Step 1: 观测图像 I, 指令 L → 锚点预测
O = encode(I, L)
scores = anchor_predictor(O)       # 4 维分数
# 输出: [0.05, 0.80, 0.05, 0.10]
# → 锚点 2 (上方直抓) 最匹配当前场景

# Step 2: 取 top-2 锚点做并行去噪
K_selected = [2, 4]
candidates = []
for k in K_selected:
    a_T = sample_from_anchor(c_k, sigma_k)   # 初始化
    for t in reversed(range(T)):
        eps_pred = eps_theta(a_t, t, c_k, O)
        a_{t-1} = ddpm_step(a_t, eps_pred)   # 去噪
    # 得到候选动作 a_0_k
    candidates.append(a_0_k)

# Step 3: 选中置信度最高的
final_action = candidates[0]                 # 锚点 2 的候选

5.8 与相关工作的关系

现有文献中的类似思想：

工作	核心方法	与 Anchor-based 的异同
Skill-based RL	用 VAE / 聚类离散化行为为 skill	类似聚类思想，但用 VAE 隐变量而非显式锚点
SPARTN (Skill-based Planning)	先选 skill 再规划轨迹	层级式，skill 类似于锚点的概念
MoE-Diffusion	多个专家头各自处理一种行为模式	类似 K 个 anchors 对应 K 个 expert head
Conditional Flow Matching	用文本/类别作为条件	锚点可以看作一种离散的隐条件
Consistency Models	单步生成	anchor 约束 + 一致性模型可组合成极速采样

5.9 优缺点分析

优点：

✅ 模式覆盖更可控 ：显式指定 KKK 种行为原型
✅ 推理效率提升：从更少的步数收敛
✅ 可解释性：能知道当前属于哪种行为模式
✅ 方便人机交互：用户可以手动选择/调节锚点

挑战：

⚠️ KKK 的选取：太少丢模式，太多过细
⚠️ 聚类质量：K-Means 假设球形簇，不一定适配高维轨迹
⚠️ 锚点预测器精度：错误预测锚点会导致生成失败
⚠️ 组合爆炸 ：Top-K 并行去噪，KKK 增大时计算成本线性增长

5.10 更高阶的扩展方向

动态锚点（Dynamic Anchors） ：锚点随时间步调整，而非固定

ck(t)=ck+Δt(ck,O)c_k^{(t)} = c_k + \Delta_t(c_k, O)ck(t)=ck+Δt(ck,O)
层级锚点（Hierarchical Anchors） ：粗粒度 → 细粒度锚点

clevel1→clevel2→...c_{\text{level1}} \rightarrow c_{\text{level2}} \rightarrow ...clevel1→clevel2→...
连续锚点（Continuous Anchors）：不再离散聚类，而是用 VAE / Normalizing Flow 将锚点嵌入连续空间
Anchor + Flow Matching 组合 ：

at=(1−t)a0+tck,a0∼N(ck,σk2I)a_t = (1 - t)a_0 + tc_k, \quad a_0 \sim \mathcal{N}(c_k, \sigma_k^2 I)at=(1−t)a0+tck,a0∼N(ck,σk2I)

用 flow matching 的条件向量场沿着锚点路径积分，可能实现 2-3 步 的极速生成。

六、总结：从标准 Diffusion 到 Anchor-based 的演进

#mermaid-svg-1cag98B9l2hEKXEs{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-1cag98B9l2hEKXEs .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-1cag98B9l2hEKXEs .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-1cag98B9l2hEKXEs .error-icon{fill:#552222;}#mermaid-svg-1cag98B9l2hEKXEs .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-1cag98B9l2hEKXEs .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-1cag98B9l2hEKXEs .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-1cag98B9l2hEKXEs .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-1cag98B9l2hEKXEs .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-1cag98B9l2hEKXEs .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-1cag98B9l2hEKXEs .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-1cag98B9l2hEKXEs .marker{fill:#333333;stroke:#333333;}#mermaid-svg-1cag98B9l2hEKXEs .marker.cross{stroke:#333333;}#mermaid-svg-1cag98B9l2hEKXEs svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-1cag98B9l2hEKXEs p{margin:0;}#mermaid-svg-1cag98B9l2hEKXEs .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-1cag98B9l2hEKXEs .cluster-label text{fill:#333;}#mermaid-svg-1cag98B9l2hEKXEs .cluster-label span{color:#333;}#mermaid-svg-1cag98B9l2hEKXEs .cluster-label span p{background-color:transparent;}#mermaid-svg-1cag98B9l2hEKXEs .label text,#mermaid-svg-1cag98B9l2hEKXEs span{fill:#333;color:#333;}#mermaid-svg-1cag98B9l2hEKXEs .node rect,#mermaid-svg-1cag98B9l2hEKXEs .node circle,#mermaid-svg-1cag98B9l2hEKXEs .node ellipse,#mermaid-svg-1cag98B9l2hEKXEs .node polygon,#mermaid-svg-1cag98B9l2hEKXEs .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-1cag98B9l2hEKXEs .rough-node .label text,#mermaid-svg-1cag98B9l2hEKXEs .node .label text,#mermaid-svg-1cag98B9l2hEKXEs .image-shape .label,#mermaid-svg-1cag98B9l2hEKXEs .icon-shape .label{text-anchor:middle;}#mermaid-svg-1cag98B9l2hEKXEs .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-1cag98B9l2hEKXEs .rough-node .label,#mermaid-svg-1cag98B9l2hEKXEs .node .label,#mermaid-svg-1cag98B9l2hEKXEs .image-shape .label,#mermaid-svg-1cag98B9l2hEKXEs .icon-shape .label{text-align:center;}#mermaid-svg-1cag98B9l2hEKXEs .node.clickable{cursor:pointer;}#mermaid-svg-1cag98B9l2hEKXEs .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-1cag98B9l2hEKXEs .arrowheadPath{fill:#333333;}#mermaid-svg-1cag98B9l2hEKXEs .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-1cag98B9l2hEKXEs .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-1cag98B9l2hEKXEs .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-1cag98B9l2hEKXEs .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-1cag98B9l2hEKXEs .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-1cag98B9l2hEKXEs .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-1cag98B9l2hEKXEs .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-1cag98B9l2hEKXEs .cluster text{fill:#333;}#mermaid-svg-1cag98B9l2hEKXEs .cluster span{color:#333;}#mermaid-svg-1cag98B9l2hEKXEs div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-1cag98B9l2hEKXEs .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-1cag98B9l2hEKXEs rect.text{fill:none;stroke-width:0;}#mermaid-svg-1cag98B9l2hEKXEs .icon-shape,#mermaid-svg-1cag98B9l2hEKXEs .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-1cag98B9l2hEKXEs .icon-shape p,#mermaid-svg-1cag98B9l2hEKXEs .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-1cag98B9l2hEKXEs .icon-shape .label rect,#mermaid-svg-1cag98B9l2hEKXEs .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-1cag98B9l2hEKXEs .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-1cag98B9l2hEKXEs .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-1cag98B9l2hEKXEs :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} DDPM

(1000步, 纯噪声)
DDIM

(50步, 跳步加速)
Diffusion Policy

(条件去噪生成动作)
Anchor-based DP

(聚类锚点 + 约束初始化)
+Flow Matching

(ODE路径, 更少步数)
Anchor+FM

(锚点约束 + 少数步积分)

七、完整伪代码示例

7.1 标准 Diffusion Policy 训练 + 推理

训练

python 复制代码

# ============================================================
# 标准 Diffusion Policy -- 训练循环
# ============================================================
import torch
import torch.nn as nn

class DiffusionPolicy(nn.Module):
    def __init__(self, obs_dim, action_dim, hidden_dim=256):
        super().__init__()
        self.time_mlp = nn.Sequential(
            nn.Linear(1, hidden_dim), nn.SiLU(),
            nn.Linear(hidden_dim, hidden_dim),
        )
        self.obs_encoder = nn.Linear(obs_dim, hidden_dim)
        self.net = nn.Sequential(
            nn.Linear(action_dim + hidden_dim * 2, hidden_dim),
            nn.SiLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.SiLU(),
            nn.Linear(hidden_dim, action_dim),
        )

    def forward(self, a_t, t, obs):
        t_feat = self.time_mlp(t)
        o_feat = self.obs_encoder(obs)
        x = torch.cat([a_t, t_feat.expand_as(a_t),
                       o_feat.unsqueeze(1).expand(-1, a_t.size(1), -1)], dim=-1)
        return self.net(x)


def train_diffusion_policy(dataloader, model, T=100, lr=1e-4, epochs=100):
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    mse = nn.MSELoss()

    for epoch in range(epochs):
        for batch_actions, batch_obs in dataloader:
            B = batch_actions.shape[0]

            # 随机采样时间步
            t = torch.randint(0, T, (B, 1)).float() / T

            # 采样真实噪声
            eps = torch.randn_like(batch_actions)

            # 加噪: a_t = sqrt(alpha_bar) * a_0 + sqrt(1 - alpha_bar) * eps
            alpha_bar_t = (1 - t / T).unsqueeze(-1)
            a_t = alpha_bar_t.sqrt() * batch_actions + \
                  (1 - alpha_bar_t).sqrt() * eps

            # 预测噪声
            eps_pred = model(a_t, t, batch_obs)
            loss = mse(eps_pred, eps)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

        if epoch % 20 == 0:
            print(f"Epoch {epoch}, Loss: {loss.item():.6f}")


def ddpm_step(model, a_t, t, obs, alpha_bar_t, beta_t):
    """单步 DDPM 去噪"""
    eps_pred = model(a_t, t, obs)
    alpha_t = 1 - beta_t
    coef1 = 1.0 / alpha_t.sqrt()
    coef2 = beta_t / (1 - alpha_bar_t).sqrt()
    z = torch.randn_like(a_t) if t.item() > 0 else 0.0
    return coef1 * (a_t - coef2 * eps_pred) + beta_t.sqrt() * z

推理

python 复制代码

# ============================================================
# 标准 Diffusion Policy -- 推理
# ============================================================
@torch.no_grad()
def infer_diffusion_policy(model, obs, T=100, action_dim=7):
    # 1. 从纯噪声开始
    a_t = torch.randn(1, 100, action_dim)

    # 2. 预计算噪声调度
    betas = torch.linspace(1e-4, 0.02, T)
    alpha_bars = torch.cumprod(1 - betas, dim=0)

    # 3. 逐步去噪
    for t_idx in reversed(range(T)):
        t = torch.tensor([[t_idx / T]]).float()
        a_t = ddpm_step(model, a_t, t, obs,
                        alpha_bars[t_idx], betas[t_idx])

    return a_t


# ===== 实际使用 =====
# model = DiffusionPolicy(obs_dim=64, action_dim=7)
# model.load_state_dict(torch.load("model.pth"))
# while True:
#     obs = get_observation()
#     action_seq = infer_diffusion_policy(model, obs)
#     execute_action(action_seq[:, 0, :])

7.2 Flow Matching for VLA 训练 + 推理

训练

python 复制代码

# ============================================================
# Flow Matching -- 训练循环
# ============================================================
class FlowMatchingVLA(nn.Module):
    """条件向量场网络: v_theta(a_t, t, cond)"""
    def __init__(self, action_dim, cond_dim, hidden_dim=512):
        super().__init__()
        self.time_mlp = nn.Sequential(
            nn.Linear(1, hidden_dim), nn.SiLU(),
            nn.Linear(hidden_dim, hidden_dim),
        )
        self.cond_encoder = nn.Linear(cond_dim, hidden_dim)
        self.net = nn.Sequential(
            nn.Linear(action_dim + hidden_dim * 2, hidden_dim),
            nn.SiLU(), nn.Linear(hidden_dim, hidden_dim),
            nn.SiLU(), nn.Linear(hidden_dim, action_dim),
        )

    def forward(self, a_t, t, cond):
        t_feat = self.time_mlp(t)                             # [B, hidden]
        c_feat = self.cond_encoder(cond)                      # [B, hidden]
        t_feat = t_feat.unsqueeze(1).expand(-1, a_t.size(1), -1)
        c_feat = c_feat.unsqueeze(1).expand(-1, a_t.size(1), -1)
        x = torch.cat([a_t, t_feat, c_feat], dim=-1)
        return self.net(x)


def train_flow_matching(dataloader, model, lr=1e-4, epochs=100):
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    mse = nn.MSELoss()

    for epoch in range(epochs):
        for actions, cond in dataloader:
            B = actions.shape[0]

            # 采样时间 t ~ Uniform[0, 1]
            t = torch.rand(B, 1)

            # 初始噪声
            a_0 = torch.randn_like(actions)

            # 线性插值路径: a_t = (1-t) * a_0 + t * a_1
            a_t = (1 - t.unsqueeze(-1)) * a_0 + t.unsqueeze(-1) * actions

            # 真实向量场: u_t = a_1 - a_0
            u_t = actions - a_0

            # 预测向量场
            v_pred = model(a_t, t, cond)
            loss = mse(v_pred, u_t)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

        if epoch % 20 == 0:
            print(f"Epoch {epoch}, Loss: {loss.item():.6f}")

推理 (ODE 求解)

python 复制代码

# ============================================================
# Flow Matching -- 推理 (Euler ODE 求解)
# ============================================================
@torch.no_grad()
def infer_flow_matching(model, cond, n_steps=10):
    """
    cond:     [1, cond_dim]  观测+语言条件
    n_steps:  ODE 求解步数 (2~20)
    """
    # 从标准高斯采样初始噪声
    a_t = torch.randn(1, 100, 7)

    # Euler ODE: t: 0 -> 1
    dt = 1.0 / n_steps
    for step in range(n_steps):
        t = torch.tensor([[step * dt]])
        v = model(a_t, t, cond)
        a_t = a_t + v * dt       # Euler 积分

    return a_t                   # a_1 近似为真实动作


# ===== 实际使用 =====
# img_feat = encode_image(camera_frame)
# txt_feat = encode_text("pick up the cup")
# cond = torch.cat([img_feat, txt_feat], dim=-1)
# action_seq = infer_flow_matching(model, cond, n_steps=5)
# execute_action(action_seq[0, :10])
# # 仅需 5 步 ODE, 比 Diffusion(100步) 快 20 倍

阶段一：轨迹聚类 (离线)

python 复制代码

# ============================================================
# Anchor-based DP -- 阶段 1: 轨迹聚类
# ============================================================
from sklearn.cluster import KMeans
import numpy as np

def extract_anchors(trajectory_dataset, K=8):
    H, d = trajectory_dataset[0].shape
    X = np.array([traj.reshape(-1) for traj in trajectory_dataset])
    kmeans = KMeans(n_clusters=K, random_state=42)
    labels = kmeans.fit_predict(X)
    anchors = kmeans.cluster_centers_.reshape(K, H, d)

    sigmas = []
    for k in range(K):
        cluster_trajs = X[labels == k]
        if len(cluster_trajs) > 0:
            dists = np.linalg.norm(cluster_trajs - kmeans.cluster_centers_[k], axis=1)
            sigmas.append(dists.std())
        else:
            sigmas.append(1.0)
    return anchors, labels, np.array(sigmas)

# 使用示例
# anchors, labels, sigmas = extract_anchors(all_trajectories, K=8)
# print(f"聚类完成: {len(anchors)} 个锚点")

阶段二：锚点条件扩散训练

python 复制代码

# ============================================================
# Anchor-based DP -- 阶段 2: 锚点条件训练
# ============================================================
class AnchorDiffusionPolicy(nn.Module):
    """锚点条件去噪网络"""
    def __init__(self, obs_dim, action_dim, hidden_dim=256):
        super().__init__()
        self.time_mlp = nn.Sequential(
            nn.Linear(1, hidden_dim), nn.SiLU(),
            nn.Linear(hidden_dim, hidden_dim),
        )
        self.obs_encoder = nn.Linear(obs_dim, hidden_dim)
        self.anchor_encoder = nn.Linear(action_dim, hidden_dim)  # 新增锚点编码
        self.fusion = nn.Linear(hidden_dim * 3, hidden_dim)

        self.net = nn.Sequential(
            nn.Linear(action_dim + hidden_dim, hidden_dim),
            nn.SiLU(), nn.Linear(hidden_dim, hidden_dim),
            nn.SiLU(), nn.Linear(hidden_dim, action_dim),
        )

    def forward(self, a_t, t, obs, anchor):
        t_feat = self.time_mlp(t)
        o_feat = self.obs_encoder(obs)
        a_feat = self.anchor_encoder(anchor.mean(dim=1))   # [B, hidden], 锚点全局特征

        cond = self.fusion(torch.cat([t_feat, o_feat, a_feat], dim=-1))
        cond = cond.unsqueeze(1).expand(-1, a_t.size(1), -1)

        x = torch.cat([a_t, cond], dim=-1)
        return self.net(x)


def train_anchor_diffusion(dataloader, model, anchors, sigmas, T=50, lr=1e-4):
    """
    dataloader: 每次返回 (action_seq, obs)
    anchors: [K, H, d]
    sigmas:  [K]
    """
    anchors_t = torch.tensor(anchors, dtype=torch.float32)
    sigmas_t  = torch.tensor(sigmas, dtype=torch.float32)
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    mse = nn.MSELoss()

    for epoch in range(100):
        for actions, obs in dataloader:
            B = actions.shape[0]

            # 硬分配: 每条轨迹 -> 最近锚点
            act_flat = actions.view(B, -1).unsqueeze(1)     # [B, 1, H*d]
            anc_flat = anchors_t.view(-1).unsqueeze(0).unsqueeze(0)  # [1, 1, H*d]
            # 实际应写: anc_flat.view(K, -1).unsqueeze(0) 然后 expand
            
            K = anchors_t.shape[0]
            anc_flat = anchors_t.view(K, -1).unsqueeze(0)   # [1, K, H*d]
            act_flat = actions.view(B, -1).unsqueeze(1)     # [B, 1, H*d]
            dists = ((act_flat - anc_flat) ** 2).sum(dim=-1)  # [B, K]
            anchor_idx = dists.argmin(dim=1)                 # [B]

            selected_anchors = anchors_t[anchor_idx]         # [B, H, d]
            selected_sigmas  = sigmas_t[anchor_idx]          # [B]

            # 加噪 (标准差由簇内方差决定)
            t = torch.randint(0, T, (B, 1)).float() / T
            eps_scale = selected_sigmas.view(B, 1, 1)
            eps = torch.randn_like(actions) * eps_scale

            alpha_bar_t = (1 - t / T).unsqueeze(-1)
            a_t = alpha_bar_t.sqrt() * actions + \
                  (1 - alpha_bar_t).sqrt() * eps

            eps_pred = model(a_t, t, obs, selected_anchors)
            loss = mse(eps_pred, eps)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

        if epoch % 20 == 0:
            print(f"Epoch {epoch}, Loss: {loss.item():.6f}")

阶段三：Top-K 锚点推理 + Anchor Predictor

python 复制代码

# ============================================================
# Anchor-based DP -- 阶段 3a: 锚点预测器
# ============================================================
class AnchorPredictor(nn.Module):
    """从观测预测各锚点的相关分数"""
    def __init__(self, obs_dim, K=8, hidden_dim=128):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_dim, hidden_dim), nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim), nn.ReLU(),
            nn.Linear(hidden_dim, K),
        )

    def forward(self, obs):
        return self.net(obs)  # [B, K] logits


@torch.no_grad()
def infer_anchor_diffusion(model, anchor_predictor,
                           anchors, sigmas, obs, top_k=3, T=20):
    """
    model:             AnchorDiffusionPolicy
    anchor_predictor:  AnchorPredictor
    anchors:           [K, H, d]
    sigmas:            [K]
    obs:               [1, obs_dim]
    top_k:             并行去噪的锚点数
    T:                 每分支去噪步数 (较小, ~20)

    返回: best_action [1, H, d]
    """
    K, H, d = anchors.shape
    betas = torch.linspace(1e-4, 0.02, T)
    alpha_bars = torch.cumprod(1 - betas, dim=0)

    # ===== Step 1: 预测锚点分数 =====
    scores = anchor_predictor(obs)                 # [1, K]
    probs = torch.softmax(scores, dim=-1)
    topk_scores, topk_indices = torch.topk(probs, top_k, dim=-1)
    topk_indices = topk_indices.squeeze(0).tolist()

    # ===== Step 2: 每个锚点分支并行去噪 =====
    candidates = []
    for anchor_id in topk_indices:
        anchor = anchors[anchor_id:anchor_id+1]    # [1, H, d]
        sigma  = sigmas[anchor_id]

        # 关键: 从锚点附近初始化, 而非纯噪声
        a_t = anchor + torch.randn_like(anchor) * sigma

        for t_idx in reversed(range(T)):
            t = torch.tensor([[t_idx / T]]).float()
            eps_pred = model(a_t, t, obs, anchor)

            # DDPM 一步
            alpha_t = 1 - betas[t_idx]
            coef1 = 1.0 / alpha_t.sqrt()
            coef2 = betas[t_idx] / (1 - alpha_bars[t_idx]).sqrt()
            z = torch.randn_like(a_t) if t_idx > 0 else 0.0
            a_t = coef1 * (a_t - coef2 * eps_pred) \
                  + betas[t_idx].sqrt() * z

        candidates.append(a_t)

    # ===== Step 3: 聚合 (Best-of-K) =====
    # 选分数最高的锚点的生成结果
    best_idx = 0
    best_action = candidates[best_idx]
    return best_action


# ===== 实战走通 =====
def demo_anchor_pipeline():
    # 1. 离线聚类
    dummy = [np.random.randn(50, 7) for _ in range(2000)]
    anchors, labels, sigmas = extract_anchors(dummy, K=8)

    # 2. 训练锚点条件扩散
    model = AnchorDiffusionPolicy(obs_dim=64, action_dim=7)
    # train_anchor_diffusion(dataloader, model, anchors, sigmas)

    # 3. 训练锚点预测器 (训练方式类似分类器)
    anchor_pred = AnchorPredictor(obs_dim=64, K=8)
    # 对每个 obs, 标签 = 该 obs 下真实动作所属的锚点 ID
    # 用 CrossEntropyLoss 训练

    # 4. Top-K 推理
    obs = torch.randn(1, 64)
    action = infer_anchor_diffusion(
        model, anchor_pred,
        torch.tensor(anchors, dtype=torch.float32),
        torch.tensor(sigmas, dtype=torch.float32),
        obs, top_k=3, T=20,  # 仅 20 步去噪
    )
    print(f"生成动作: {action.shape}")
    return action

7.4 三者在伪代码层面的核心区别对比

环节	标准 Diffusion	Flow Matching	Anchor-based Diffusion
初始化	a_T ~ N(0, I)	a_0 ~ N(0, I)	a_T ~ N(c_k, sigma_k^2 I)
路径	sqrt(alpha)a_0 + sqrt(1-alpha)eps	(1-t)a_0 + ta_1	同 Diffusion, 但锚点引导
预测目标	eps_theta 预测噪声	v_theta 预测速度场	eps_theta 预测噪声 + 锚点条件
步数	50~100	2~20	10~30 (因更近)
多模态来源	随机噪声	随机噪声	显式锚点选择

python 复制代码

# ========== 三行代码说明本质差异 ==========

# 标准 Diffusion: 从纯噪声随机游走到数据
a_T = torch.randn([1, H, d])
for t in reversed(range(T)):
    a_t = ddpm_step(a_t, t, obs)   # 靠噪声探索整个空间

# Flow Matching: 用 ODE 从噪声直接流到数据
a_0 = torch.randn([1, H, d])
for step in range(N):
    a_t = a_t + v(a_t, t, obs) * dt   # Euler 积分, 5步就够

# Anchor-based: 从锚点附近出发, 受锚点约束
c_k = anchors[select_anchor(obs)]      # 先选锚点
a_T = c_k + noise * sigma_k            # 锚点附近初始化
for t in reversed(range(T)):
    a_t = ddpm_step(a_t, t, obs, c_k)  # 锚点约束去噪

八、如何选择：实际工程决策指南

你的场景	推荐方案	原因
动作空间简单 (2~3 动作)	标准 Diffusion Policy	足够, 无需额外复杂度
需要高频控制 (50+ Hz)	Flow Matching	2~5 步 ODE 足够快
动作模式多样 (多峰明显)	Anchor-based DP	显式建模各模式
人机交互 (需要解释行为)	Anchor-based DP	展示选了哪个锚点
资源受限 (边缘部署)	Flow Matching	步数少, 模型轻
有大量离线数据	Anchor-based DP	聚类信息充分利用

附录 A：前置数学概念详解（面向初学者）

本章补充文档中出现的但未深入解释的核心数学概念，按在正文中出现的顺序排列。

A.1 重参数化技巧（Reparameterization Trick）

出现在 ：Diffusion 前向过程 xt=αˉtx0+1−αˉtϵx_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilonxt=αˉt x0+1−αˉt ϵ

直觉：假如你想从 N(μ,σ2)\mathcal{N}(\mu, \sigma^2)N(μ,σ2) 采样一个 zzz，同时还想让梯度能流过 μ\muμ 和 σ\sigmaσ。

直接采样 z∼N(μ,σ2)z \sim \mathcal{N}(\mu, \sigma^2)z∼N(μ,σ2) 是不可微的（采样操作没有梯度）。

技巧：先把采样和参数分离：

z=μ+σ⋅ϵ,ϵ∼N(0,1)z = \mu + \sigma \cdot \epsilon, \quad \epsilon \sim \mathcal{N}(0, 1)z=μ+σ⋅ϵ,ϵ∼N(0,1)

这样 μ\muμ 和 σ\sigmaσ 的梯度就能通过确定性的加减乘传递过去。

python 复制代码

# 不可微版本 (no gradient)
z = torch.normal(mean, std)    # 采样操作本身不可导

# 重参数化版本 (differentiable!)
eps = torch.randn_like(mean)   # 从标准高斯采样
z = mean + std * eps           # 确定性的, 梯度可传

意义：没有这个技巧，扩散模型的 loss 就无法反向传播到 ϵθ\epsilon_\thetaϵθ。

A.2 马尔可夫链（Markov Chain）

出现在：Diffusion 前向和反向过程被定义为马尔可夫链

直觉："马尔可夫性" = 未来只取决于现在，与过去无关。

q(xt∣xt−1,xt−2,...,x0)=q(xt∣xt−1)q(x_t | x_{t-1}, x_{t-2}, ..., x_0) = q(x_t | x_{t-1})q(xt∣xt−1,xt−2,...,x0)=q(xt∣xt−1)

类比：像一个没有记忆的人------只记得自己上一秒在哪，完全不管更早之前的事。

在扩散模型中：

前向链 ：xtx_txt 只由 xt−1x_{t-1}xt−1 加噪得到，xt−2x_{t-2}xt−2 不影响 xtx_txt
反向链 ：xt−1x_{t-1}xt−1 只由 xtx_txt 去噪得到

python 复制代码

# 非马尔可夫 (需要全部历史)
def denoise(x_t, x_{t-1}, x_{t-2}, ...): ...

# 马尔可夫 (只需要上一步)
def denoise(x_t): ...  # x_t 已包含全部必要信息

意义：简化模型设计------去噪网络只需输入当前步 xtx_txt 和时间 ttt，不需要维护完整历史。

A.3 变分下界（ELBO: Evidence Lower Bound）

出现在：Diffusion 模型训练目标的完整推导

为什么需要它 ：我们想最大化真实数据分布 q(x0)q(x_0)q(x0) 的似然 log⁡pθ(x0)\log p_\theta(x_0)logpθ(x0)，但 pθ(x0)p_\theta(x_0)pθ(x0) 的计算需要积分掉所有中间变量 x1,...,xTx_1, ..., x_Tx1,...,xT，这是不可行的。

ELBO 的核心思想 ：不直接最大化 log⁡pθ(x0)\log p_\theta(x_0)logpθ(x0)（太难），而是最大化一个它的下界：

log⁡pθ(x0)≥Eq(x1:T∣x0) $logpθ(x0:T)q(x1:T∣x0)$ ⏟ELBO\log p_\theta(x_0) \geq \underbrace{\mathbb{E}{q(x{1:T}|x_0)} \left $\\log \\frac{p_\\theta(x_{0:T})}{q(x_{1:T}\|x_0)} \\right$ }_{\text{ELBO}}logpθ(x0)≥ELBO Eq(x1:T∣x0) $logq(x1:T∣x0)pθ(x0:T)$

直观理解：

复制代码

log p(x) = 一座山的真实高度 (不可直接测量)
ELBO    = 一个已知比山矮的热气球的高度 (我们可以计算)
  如果我们把热气球升到最高, 它就能无限接近山顶

所以在扩散模型中，最大化 ELBO = 最大化真实似然。

化简过程（简化版）：

ELBO=Eq(x1∣x0) $logpθ(x0∣x1)$ ⏟重构项−∑t=2TDKL(q(xt−1∣xt,x0)∥pθ(xt−1∣xt))⏟去噪匹配项−DKL(q(xT∣x0)∥p(xT))⏟先验匹配项\text{ELBO} = \underbrace{\mathbb{E}{q(x_1|x_0)} $\\log p_\\theta(x_0\|x_1)$ }{\text{重构项}} - \sum_{t=2}^T \underbrace{D_{KL}(q(x_{t-1}|x_t, x_0) \| p_\theta(x_{t-1}|x_t))}{\text{去噪匹配项}} - \underbrace{D{KL}(q(x_T|x_0) \| p(x_T))}_{\text{先验匹配项}}ELBO=重构项 Eq(x1∣x0) $logpθ(x0∣x1)$ −t=2∑T去噪匹配项 DKL(q(xt−1∣xt,x0)∥pθ(xt−1∣xt))−先验匹配项 DKL(q(xT∣x0)∥p(xT))

最终，去噪匹配项 化简为 ∥ϵ−ϵθ(xt,t)∥2\|\epsilon - \epsilon_\theta(x_t, t)\|^2∥ϵ−ϵθ(xt,t)∥2。

A.4 评分匹配（Score Matching）

出现在：扩散模型的训练损失被称为"去噪评分匹配"

什么是 Score（评分函数）：

Score(x)=∇xlog⁡p(x)\text{Score}(x) = \nabla_x \log p(x)Score(x)=∇xlogp(x)

它不是概率本身，而是概率密度朝哪个方向增加最快的方向。

直觉：

复制代码

p(x)  = 数据分布 (我们想要但不知道)
s(x)  = ∇_x log p(x) (指向高概率区域的"箭头"方向)

类比: 
  p(x) = 地形海拔          (绝对高度, 难测量)
  s(x) = 坡度/等高线梯度   (相对方向, 可学习)

Denoising Score Matching ：与其直接估计 ∇xlog⁡p(x)\nabla_x \log p(x)∇xlogp(x)，扩散模型估计的是加噪后数据的评分：

∇xtlog⁡q(xt)≈−xt−αˉtx01−αˉt∝−ϵ\nabla_{x_t} \log q(x_t) \approx -\frac{x_t - \sqrt{\bar{\alpha}_t} x_0}{1 - \bar{\alpha}_t} \propto -\epsilon∇xtlogq(xt)≈−1−αˉtxt−αˉt x0∝−ϵ

即预测噪声 ϵθ\epsilon_\thetaϵθ 等价于估计评分函数。去噪 = 沿着评分方向往高概率区域走一步。

A.5 KL 散度（Kullback-Leibler Divergence）

出现在：ELBO 推导中衡量两个分布的差异

定义：

DKL(q∥p)=Ex∼q $logq(x)p(x)$ D_{KL}(q \| p) = \mathbb{E}_{x \sim q} \left $\\log \\frac{q(x)}{p(x)} \\right$ DKL(q∥p)=Ex∼q $logp(x)q(x)$

直觉：用分布 ppp 来近似分布 qqq 时，平均损失了多少信息。

复制代码

q = 真实分布 (目标)
p = 模型分布 (我们的近似)
D_KL(q||p) = 用 p 替代 q 付出的"代价"

特性:
- D_KL >= 0, 等号当且仅当 q = p
- 不对称: D_KL(q||p) != D_KL(p||q)

在扩散模型中 ：ELBO 中的 DKL(q(xt−1∣xt,x0)∥pθ(xt−1∣xt))D_{KL}(q(x_{t-1}|x_t,x_0) \| p_\theta(x_{t-1}|x_t))DKL(q(xt−1∣xt,x0)∥pθ(xt−1∣xt)) 衡量"真实去噪分布"和"网络学到的去噪分布"的差距。最小化它 = 让网络生成分布逼近真实分布。

A.6 向量场（Vector Field）

出现在 ：Flow Matching 的核心概念 vt(x)v_t(x)vt(x)

直观理解：

复制代码

想象一个流体的速度场:
  - 空间中的每一个点 (x) 都有一个箭头 (v)
  - 箭头指示: 如果你在这个点, 你会朝哪个方向/多快速度移动
  - 这个箭头可以随时间变化 (t)

数学定义：

vt(x): $0,1$ ×Rd→Rdv_t(x): $0,1$ \times \mathbb{R}^d \to \mathbb{R}^dvt(x): $0,1$ ×Rd→Rd

输入：时间 ttt + 位置 xxx
输出：该点处的速度向量

类比：

概念	现实类比
向量场 vt(x)v_t(x)vt(x)	河流中每个点的流速和方向
粒子路径 ϕt(x)\phi_t(x)ϕt(x)	一片树叶在水面上漂过的轨迹
初始位置 x0x_0x0	树叶投放的位置
t=1t=1t=1 时的位置 x1=ϕ1(x0)x_1 = \phi_1(x_0)x1=ϕ1(x0)	树叶最终到达的位置

python 复制代码

# 向量场就是: 给定位置和时间, 告诉我下一步往哪走
def vector_field(x, t):
    # x = 当前位置 (如 3D 坐标)
    # t = 当前时间
    # 返回: 速度向量 (往哪个方向移动多快)
    return v  # 例如 [-0.5, 1.2, 0.3]

# Flow Matching 训练: 让神经网络学会这个函数
v_theta = NeuralNetwork(x, t)  # 预测向量场

A.7 流（Flow）与连续归一化流（CNF）

出现在 ：Flow Matching 的 ODE 公式 ddtϕt(x)=vt(ϕt(x))\frac{d}{dt} \phi_t(x) = v_t(\phi_t(x))dtdϕt(x)=vt(ϕt(x))

什么是流（Flow）：

流 = 一个映射，把初始位置 xxx 映射到时间 ttt 时的位置 ϕt(x)\phi_t(x)ϕt(x)

xt=ϕt(x0)x_t = \phi_t(x_0)xt=ϕt(x0)

流的两个视角：

复制代码

视角 1: 拉格朗日视角 (追着单个粒子看)
  - φ_t(x_0) = 从 x_0 出发的粒子在 t 时刻在哪

视角 2: 欧拉视角 (固定位置看流过的粒子)
  - v_t(x) = 在位置 x、时间 t 时的流速

两者关系：

ϕt(x0)=x0+∫0tvs(ϕs(x0)) ds\phi_t(x_0) = x_0 + \int_0^t v_s(\phi_s(x_0)) \, dsϕt(x0)=x0+∫0tvs(ϕs(x0))ds

即：流 = 对向量场的积分。

连续归一化流（CNF）：

归一化 ：ϕt\phi_tϕt 是一个可逆映射（双射）
连续：随时间 ttt 平滑变化
流：将简单分布（高斯）变换为复杂分布（数据）

简单分布 p_0 (高斯噪声)
|
| φ_t (连续变换)
|
复杂分布 p_1 (真实数据)

类比：

复制代码

把 CNF 想象成"压面团":
  p_0 = 一个完美的球形面团 (噪声)
  φ_t = 连续擀面的过程     (流)
  v_t = 擀面杖每次移动的方向和力度 (向量场)
  p_1 = 最终擀成的薄面饼    (数据分布)

A.8 SDE vs ODE（随机 vs 确定性过程）

出现在：Diffusion（SDE）和 Flow Matching（ODE）的路径类型差异

SDE（随机微分方程）：

dxt=μ(xt,t)dt+σ(xt,t)dWtdx_t = \mu(x_t, t) dt + \sigma(x_t, t) dW_tdxt=μ(xt,t)dt+σ(xt,t)dWt

确定项 dtdtdt：趋势方向
随机项 dWtdW_tdWt：布朗运动（随机扰动）

类比：醉汉走路------大体往前，但每一步都带随机晃动。

ODE（常微分方程）：

dxtdt=vt(xt)\frac{dx_t}{dt} = v_t(x_t)dtdxt=vt(xt)

只有确定项
给定起点，路径完全确定

类比：列车轨道------轨迹固定，毫无随机性。

python 复制代码

# SDE: 每一步都有随机性
for step in range(100):
    noise = torch.randn_like(x)
    x = x + drift(x, t) * dt + diffusion(x, t) * noise * sqrt_dt

# ODE: 每一步确定
for step in range(10):
    v = vector_field(x, t)
    x = x + v * dt           # 没有随机项!

在 VLA 中的影响：

特性	SDE (Diffusion)	ODE (Flow Matching)
动作平滑性	有随机抖动	天然平滑
单次生成稳定性	低 (每次不同)	高 (相同输入→相同输出)
多样性来源	每一步的噪声	仅初始噪声
采样步数	需要多步平摊噪声	可以很少步

A.9 K-Means 聚类

出现在：Anchor-based DP 的离线轨迹聚类

目标：把 NNN 条轨迹分成 KKK 组，每组内的轨迹尽可能相似。

算法流程：

python 复制代码

def k_means(trajectories, K=8, max_iter=100):
    # 1. 随机初始化 K 个中心点
    centers = random_choose(trajectories, K)

    for _ in range(max_iter):
        # 2. 分配: 每条轨迹 -> 最近的中心
        labels = [argmin(dist(traj, c) for c in centers)
                  for traj in trajectories]

        # 3. 更新: 每个中心 -> 该组所有轨迹的均值
        new_centers = [mean(trajs[labels == k]) for k in range(K)]

        # 4. 如果中心不再变化, 收敛
        if new_centers == centers: break
        centers = new_centers

    return centers, labels

为什么用于 Anchor-based DP：

min⁡∑i∥τi−cki∥2\min \sum_i \|\tau_i - c_{k_i}\|^2min∑i∥τi−cki∥2 保证每个锚点代表一类典型行为
簇标准差 σk\sigma_kσk 反映该类行为的变异程度------用于控制噪声初始化范围

python 复制代码

# 举例: 8 个锚点对应 8 种抓取模式
anchor_0: 从正上方抓取          (sigma=0.02, 动作很集中)
anchor_1: 从右侧绕过障碍抓取    (sigma=0.15, 动作较分散)
anchor_2: 两指捏取              (sigma=0.03, 精确控制)
...

A.10 ODE 数值求解器（Euler / Runge-Kutta）

出现在：Flow Matching 推理时的 ODE 积分

欧拉法（Euler Method） ------ 最简单的数值积分：

ϕt+Δt(x)=ϕt(x)+vt(ϕt(x))⋅Δt\phi_{t+\Delta t}(x) = \phi_t(x) + v_t(\phi_t(x)) \cdot \Delta tϕt+Δt(x)=ϕt(x)+vt(ϕt(x))⋅Δt

python 复制代码

# Euler 法: 走一步, 看看当前速度, 沿着速度方向走 dt
def euler_ode(v_func, x_0, t_start=0, t_end=1, n_steps=10):
    x = x_0
    dt = (t_end - t_start) / n_steps
    for step in range(n_steps):
        t = t_start + step * dt
        v = v_func(x, t)       # 当前点的速度
        x = x + v * dt         # 沿着速度走一小步
    return x

RK4（四阶龙格-库塔法） ------ 更精确：

python 复制代码

def rk4_step(v_func, x, t, dt):
    k1 = v_func(x, t)
    k2 = v_func(x + dt*k1/2, t + dt/2)
    k3 = v_func(x + dt*k2/2, t + dt/2)
    k4 = v_func(x + dt*k3, t + dt)
    return x + dt/6 * (k1 + 2*k2 + 2*k3 + k4)

Flow Matching 中两者的差异：

求解器	每步计算量	精度	推荐步数
Euler	1 次网络推理	低	10~20 步
RK4	4 次网络推理	高	2~5 步

工程上常使用 Euler + 更多步数 （因为网络推理是瓶颈, Euler 虽精度低但单步快）或 RK4 + 少数步数。

A.11 高斯分布的基本性质

出现在 ：全文各处的 N(μ,σ2I)\mathcal{N}(\mu, \sigma^2 I)N(μ,σ2I)

PDF（概率密度函数）：

p(x)=1(2π)d/2∣Σ∣1/2exp⁡(−12(x−μ)TΣ−1(x−μ))p(x) = \frac{1}{(2\pi)^{d/2} |\Sigma|^{1/2}} \exp\left(-\frac12 (x-\mu)^T \Sigma^{-1} (x-\mu)\right)p(x)=(2π)d/2∣Σ∣1/21exp(−21(x−μ)TΣ−1(x−μ))

关键性质：

1. 高斯分布的叠加性 （扩散模型的核心）：

x0∼N(0,I), ϵ∼N(0,I)x_0 \sim \mathcal{N}(0, I), \; \epsilon \sim \mathcal{N}(0, I)x0∼N(0,I),ϵ∼N(0,I)

⇒αˉx0+1−αˉϵ∼N(0,I)\Rightarrow \sqrt{\bar{\alpha}} x_0 + \sqrt{1 - \bar{\alpha}} \epsilon \sim \mathcal{N}(0, I)⇒αˉ x0+1−αˉ ϵ∼N(0,I)

即两个独立高斯之和仍是高斯。

2. 条件高斯 （DDPM 反向过程）：

给定 xtx_txt 和 x0x_0x0，可从联合高斯推出 q(xt−1∣xt,x0)q(x_{t-1}|x_t, x_0)q(xt−1∣xt,x0) 是高斯分布。

3. 高斯与 KL 散度 ：

两个高斯间的 KL 散度有闭式解，这是扩散模型 loss 可化简的关键。

A.12 常见符号速查表

符号	含义	首次出现位置
N(μ,Σ)\mathcal{N}(\mu, \Sigma)N(μ,Σ)	高斯分布 (均值 μ\muμ, 协方差 Σ\SigmaΣ)	Diffusion 前向过程
βt\beta_tβt	噪声调度表中的方差参数	q(xt∣xt−1)q(x_t\|x_{t-1})q(xt∣xt−1)
αt=1−βt\alpha_t = 1 - \beta_tαt=1−βt	信号保留比例	αˉt\bar{\alpha}_tαˉt 定义
αˉt=∏s=1tαs\bar{\alpha}t = \prod{s=1}^t \alpha_sαˉt=∏s=1tαs	累积信号保留比例	q(xt∣x0)q(x_t\|x_0)q(xt∣x0)
ϵ\epsilonϵ	标准高斯噪声 ∼N(0,I)\sim \mathcal{N}(0, I)∼N(0,I)	重参数化
TTT	总扩散步数 (通常 1000)	DDPM
ϕt(x)\phi_t(x)ϕt(x)	流 (从初始位置 xxx 到 ttt 时刻的映射)	CNF
vt(x)v_t(x)vt(x)	在 (t,x)(t, x)(t,x) 处的向量场	Flow Matching
pt(x)p_t(x)pt(x)	时间 ttt 时的概率分布	Flow Matching
Lsimple\mathcal{L}_{\text{simple}}Lsimple	简化的扩散损失	训练目标
LCFM\mathcal{L}_{\text{CFM}}LCFM	条件 Flow Matching 损失	训练目标
DKLD_{KL}DKL	KL 散度	ELBO
∇xlog⁡p(x)\nabla_x \log p(x)∇xlogp(x)	评分函数 (Score)	评分匹配
KKK	锚点数量	Anchor-based DP
ckc_kck	第 kkk 个锚点 (聚类中心)	Anchor-based DP
σk\sigma_kσk	第 kkk 个簇的标准差	Anchor-based DP
HHH	动作序列长度 (horizon)	VLA Action
ddd	动作维度 (如 7-DOF)	VLA Action

A.13 学习路线图（前置知识依赖）

复制代码

┌────────────────────────────────────────────┐
│ 学这个笔记需要的前置知识 (按需补)            │
├────────────────────────────────────────────┤
│                                            │
│  1. 基础概率论                              │
│     ├─ 随机变量, 概率密度函数               │
│     ├─ 高斯分布 (A.11)                     │
│     └─ 条件概率, 边缘概率                   │
│                                            │
│  2. 基础线性代数                            │
│     ├─ 向量, 矩阵乘法                       │
│     ├─ 范数 (∥·∥²)                        │
│     └─ 协方差矩阵, 单位矩阵 I              │
│                                            │
│  3. 基础信息论 (可选)                       │
│     └─ KL 散度 (A.5)                      │
│                                            │
│  4. 机器学习基础                            │
│     ├─ 损失函数, 梯度下降                   │
│     ├─ 重参数化技巧 (A.1)                  │
│     └─ 变分推断 → ELBO (A.3)              │
│                                            │
│  5. 本笔记的核心概念                         │
│     ├─ Diffusion (第一章)                   │
│     ├─ Flow Matching (第二章)              │
│     └─ VLA 应用 (第三章)                   │
│         └─ Anchor-based (第五章)           │
│                                            │
└────────────────────────────────────────────┘

提示：以上这些前置概念不需要一次性学完。建议路线：

先理解重参数化技巧 (A.1) 和马尔可夫链 (A.2)
理解 KL 散度 (A.5) 后即可看懂 Diffusion 的 ELBO 推导
向量场 (A.6) 和流 (A.7) 是 Flow Matching 的基石
最后补 SDE vs ODE (A.8) 理解两种范式的本质差异

附录 B：DiT（Diffusion Transformer）详解

B.1 为什么需要 DiT？

背景：早期的 Diffusion 模型（DDPM, Stable Diffusion 1.x）都使用 U-Net 作为去噪骨干网络。

U-Net 的局限：

局部归纳偏置太重：卷积天然关注局部，难以建模动作序列的长程依赖
多模态融合不自然：文本/视觉条件需要用 Cross-Attention 额外插入
扩展性差：增大 U-Net 的参数量收益递减

DiT 的解决思路 ：用 Transformer 替代 U-Net。

Transformer 的 Self-Attention 天然建模全局依赖
多种模态都可以作为 token 序列 自然融合
参数规模可像 LLM 一样平滑扩展

U-Net 时代: DiT 时代:
┌──────┐ ┌──────────┐
│图像 │ │图像 → patch │
│ ↓ │ │ ↓ │
│Conv │ │Embed │
│ ↓ │ │ ↓ │
│UNet │ │Transformer│
│ ↓ │ │ ↓ │
│噪声ε │ │噪声ε / 速度v│
└──────┘ └──────────┘

B.2 DiT 架构总览

复制代码

                        输入: 带噪图像 z_t (或带噪动作 a_t)
                                   |
                              Patchify
                          (将空间/时序分块)
                                   |
                     ┌───────────────┐
                     │  Token 序列    │
                     │  [CLS, t1, t2, ...] │
                     └───────────────┘
                                   |
                        位置编码 (Pos Embed)
                                   |
                     ┌───────────────┐
                     │  DiT Block × N │
                     │               │
                     │  ① adaLN  → 时间+条件  │
                     │  ② Multi-Head Self-Attn │
                     │  ③ MLP (FFN)          │
                     └───────────────┘
                                   |
                       Un-patchify (重组)
                                   |
                        输出: 预测噪声 / 速度

B.3 核心组件详解

B.3.1 Patchify（分块嵌入）

将输入 x∈RL×dx \in \mathbb{R}^{L \times d}x∈RL×d（例如 L=100L=100L=100 步动作, d=7d=7d=7 维）切分成 patches：

python 复制代码

def patchify(x, patch_size=4):
    """
    x: [B, L, d]    (L=100 步动作, d=7 维)
    patch_size=4: 每 4 步拼成一个 patch

    输出: [B, L/patch_size, patch_dim]
    """
    B, L, d = x.shape
    num_patches = L // patch_size                    # 25 patches
    x = x.view(B, num_patches, patch_size * d)       # [B, 25, 28]
    return x

意义：

减少 token 数量（100 → 25），降低计算量
每个 patch 包含局部时序信息

B.3.2 adaLN（Adaptive Layer Normalization）

这是 DiT 最关键的创新------将条件信息注入 Transformer 的方式。

传统的 LayerNorm：

y=γ⊙x−μσ+βy = \gamma \odot \frac{x - \mu}{\sigma} + \betay=γ⊙σx−μ+β

其中 γ,β\gamma, \betaγ,β 是可学习参数，所有时间步共享。

adaLN 让 γ,β\gamma, \betaγ,β 由条件 ccc（时间步 ttt、类标签、文本等）动态生成：

y=γθ(c)⊙x−μσ+βθ(c)y = \gamma_\theta(c) \odot \frac{x - \mu}{\sigma} + \beta_\theta(c)y=γθ(c)⊙σx−μ+βθ(c)

python 复制代码

class AdaLN(nn.Module):
    """自适应 LayerNorm: γ, β 由条件动态生成"""
    def __init__(self, hidden_dim, cond_dim):
        super().__init__()
        # 条件编码器: 从条件预测 γ 和 β
        self.condition_proj = nn.Sequential(
            nn.SiLU(),
            nn.Linear(cond_dim, hidden_dim * 2),   # 输出 [γ, β]
        )

    def forward(self, x, cond):
        """
        x:    [B, N, hidden]   token 序列
        cond: [B, cond_dim]    条件 (时间+文本+图像)
        """
        gamma, beta = self.condition_proj(cond).chunk(2, dim=-1)
        gamma = gamma.unsqueeze(1)  # [B, 1, hidden]
        beta  = beta.unsqueeze(1)   # [B, 1, hidden]

        # 标准 LayerNorm 后, 再缩放+平移
        mean = x.mean(dim=-1, keepdim=True)
        var  = x.var(dim=-1, keepdim=True, unbiased=False)
        x_norm = (x - mean) / (var + 1e-6).sqrt()
        return x_norm * (1 + gamma) + beta

为什么用 adaLN 而不是 Cross-Attention？：

Cross-Attention 计算量大（QK^T 矩阵乘法）
adaLN 只是简单的缩放+平移，计算几乎免费
实验证明 adaLN 在条件注入上比 Cross-Attention 更有效

B.3.3 DiT Block 完整实现

python 复制代码

class DiTBlock(nn.Module):
    """单个 DiT Transformer Block"""
    def __init__(self, hidden_dim, num_heads, cond_dim):
        super().__init__()
        self.adaLN1 = AdaLN(hidden_dim, cond_dim)        # 条件 LN
        self.attn   = nn.MultiheadAttention(hidden_dim, num_heads, batch_first=True)
        self.adaLN2 = AdaLN(hidden_dim, cond_dim)        # 条件 LN
        self.mlp    = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim * 4),
            nn.GELU(),
            nn.Linear(hidden_dim * 4, hidden_dim),
        )

    def forward(self, x, cond):
        """
        x:    [B, N, hidden]
        cond: [B, cond_dim]
        """
        # adaLN → Self-Attn → 残差
        x = x + self.attn(*self.adaLN1(x, cond).chunk(3, dim=-1))[0]
        # adaLN → MLP → 残差
        x = x + self.mlp(self.adaLN2(x, cond))
        return x

B.3.4 完整 DiT 实现（用于动作生成）

python 复制代码

class DiT(nn.Module):
    """Diffusion Transformer --- 用于 VLA 动作生成"""
    def __init__(self, action_dim=7, horizon=100, patch_size=5,
                 hidden_dim=512, num_heads=8, num_blocks=12, cond_dim=768):
        super().__init__()
        self.patch_size = patch_size
        self.num_patches = horizon // patch_size

        # 1. 分块嵌入
        self.patch_embed = nn.Linear(patch_size * action_dim, hidden_dim)

        # 2. 可学习位置编码
        self.pos_embed = nn.Parameter(
            torch.randn(1, self.num_patches + 1, hidden_dim) * 0.02
        )

        # 3. [CLS] token (用于聚合全局信息)
        self.cls_token = nn.Parameter(torch.randn(1, 1, hidden_dim) * 0.02)

        # 4. 条件编码器 (将时间 + 观测 + 语言 融合为条件向量)
        self.cond_encoder = nn.Sequential(
            nn.Linear(cond_dim + 1, hidden_dim * 4),  # +1 是时间 t
            nn.SiLU(),
            nn.Linear(hidden_dim * 4, hidden_dim),
        )

        # 5. N 个 DiT Block
        self.blocks = nn.ModuleList([
            DiTBlock(hidden_dim, num_heads, hidden_dim)
            for _ in range(num_blocks)
        ])

        # 6. 输出层: 预测噪声/速度
        self.output = nn.Sequential(
            AdaLN(hidden_dim, hidden_dim),
            nn.Linear(hidden_dim, patch_size * action_dim),
        )

    def forward(self, a_t, t, cond):
        """
        a_t:  [B, H, action_dim]  带噪动作
        t:    [B, 1]              时间步
        cond: [B, cond_dim]       观测+语言融合条件
        """
        B = a_t.shape[0]

        # --- Patchify ---
        x = a_t.view(B, self.num_patches, self.patch_size * a_t.shape[-1])
        x = self.patch_embed(x)                         # [B, N, hidden]

        # --- 拼接 [CLS] ---
        cls_tokens = self.cls_token.expand(B, -1, -1)   # [B, 1, hidden]
        x = torch.cat([cls_tokens, x], dim=1)           # [B, N+1, hidden]
        x = x + self.pos_embed

        # --- 编码条件 ---
        t_flat = t.unsqueeze(-1) if t.dim() == 1 else t
        c = self.cond_encoder(torch.cat([cond, t], dim=-1))  # [B, hidden]

        # --- Transformer Blocks ---
        for block in self.blocks:
            x = block(x, c)

        # --- 输出 (去除 [CLS], 重组) ---
        x = x[:, 1:, :]                                 # [B, N, hidden]
        x = self.output(x, c.unsqueeze(1).expand(-1, self.num_patches, -1))
        x = x.view(B, -1, a_t.shape[-1])                # [B, H, action_dim]

        return x

B.4 DiT 用于 Diffusion vs Flow Matching

python 复制代码

# ===== Diffusion 模式: 预测噪声 =====
def dit_diffusion_forward(dit, a_t, t, cond):
    eps_pred = dit(a_t, t, cond)        # DiT 输出直接是噪声
    return eps_pred

# ===== Flow Matching 模式: 预测向量场 =====
def dit_flowmatching_forward(dit, a_t, t, cond):
    v_pred = dit(a_t, t, cond)          # 同样架构, 只是语义不同
    return v_pred

同一套 DiT 架构可以用于 Diffusion 或 Flow Matching，只改变损失函数。

B.5 DiT 在 VLA 中的应用现状

B.5.1 π0 中的 DiT 风格架构

Physical Intelligence 的 π0 模型使用类似 DiT 的架构：

复制代码

输入: 7 张历史图像 + 当前图像 + 语言指令
  |
Vision Encoder (SigLIP)           Text Encoder (T5)
  |                                    |
  └─────────── 融合 ──────────────────┘
                    |
               DiT Backbone (多个 Transformer Block)
                    |
         Flow Matching Head (预测向量场)
                    |
               动作序列 (未来 N 步关节位置)

π0 中 DiT 的特点：

动作和视觉 token 在 DiT 的 Transformer 中 统一处理
Action token 和 Image token 在一个序列中做 Self-Attention
使用 Flow Matching 而非 Diffusion

B.5.2 为什么 DiT 适合 VLA？

对比维度	U-Net 方案	DiT 方案
动作时序建模	CNN 感受野有限	Self-Attention 全局可见
多模态融合	需额外 Cross-Attention	所有 token 一起做 Self-Attention
条件注入	层层拼接/AdaGN	adaLN 高效条件调制
参数量扩展	收益递减	像 LLM 一样平滑扩展
推理速度	卷积快	需优化 (Flash Attention, KV cache)

核心优势 ：DiT 可以把视觉 token、语言 token、动作 token 拼成一个长序列，用统一的 Transformer 处理------这非常适合 VLA 的多模态性质。

python 复制代码

# U-Net 风格:
#    visual_feat → Conv  (各自处理)
#    text_feat   → CrossAttn (额外融合)
#    action      → Conv
#    三者在不同空间, 互不共享

# DiT 风格:
#    [visual_token_1, ..., visual_token_N,
#     text_token_1, ..., text_token_M,
#     action_token_1, ..., action_token_K]
#    → 一个 Transformer 同时建模所有模态的交互

B.6 Scalable（可扩展性）：DiT 的核心卖点

DiT 论文（Peebles & Xie, 2023）最重要的发现：

复制代码

模型大小          FID (生成质量)      训练成本
DiT-S (33M)  →   7.64              基准
DiT-B (130M) →   4.00              3.9×  ↑
DiT-XL (675M)→   2.27              20.5× ↑
              (越大越好!)

相比之下，增大 U-Net 到同样参数量，质量提升远不如 DiT。

向 LLM 看齐的 Scaling Law：

复制代码

DiT 参数量 = f(blocks, hidden_dim, heads)

小型 DiT:   num_blocks=12, hidden_dim=384, heads=6    (~100M)
中型 DiT:   num_blocks=24, hidden_dim=768, heads=12   (~500M)
大型 DiT:   num_blocks=36, hidden_dim=1152, heads=16  (~1.5B)
超大规模:   同 LLM 一样, 还在继续增长

B.7 DiT 与其他架构的比较

架构	条件注入	全局建模	多模态	扩展性	典型工作
U-Net	AdaGN / CrossAttn	弱 (CNN 局部)	需额外融合	差	DDPM, Stable Diffusion 1.x
DiT	adaLN	强 (Self-Attn)	天然支持	好	DiT, SD3, π0, Sora
Mamba	Cross-Scan	中等	中等	好	新兴方向
MoE-DiT	adaLN + Router	强	天然支持	极好	SD3-MoE

注：Sora（OpenAI 视频生成）、Stable Diffusion 3、π0 都使用了 DiT 风格的架构。DiT 已成为扩散/流匹配模型的新一代标准骨干网络。

B.8 在 VLA 中 DiT 的简化伪代码

python 复制代码

# ============================================================
# 完整的 DiT-based Flow Matching VLA 推理
# ============================================================
class DiTVLA(nn.Module):
    """基于 DiT 的 VLA 策略 (Flow Matching 版本)"""
    def __init__(self, action_dim=7, horizon=50, cond_dim=768):
        super().__init__()
        self.dit = DiT(
            action_dim=action_dim,
            horizon=horizon,
            patch_size=5,          # 每 5 步一个 patch
            hidden_dim=768,
            num_heads=12,
            num_blocks=12,
            cond_dim=cond_dim,
        )

    def forward(self, a_t, t, image_feat, text_feat):
        # 融合视觉和语言条件
        cond = torch.cat([image_feat, text_feat], dim=-1)
        # DiT 预测向量场
        return self.dit(a_t, t, cond)


@torch.no_grad()
def infer_dit_vla(model, image, text, n_ode_steps=5, horizon=50):
    """
    使用 DiT + Flow Matching 生成动作
    """
    # 1. 编码多模态输入
    img_feat = model.image_encoder(image)    # [1, 512]
    txt_feat = model.text_encoder(text)      # [1, 256]
    cond = torch.cat([img_feat, txt_feat], dim=-1)  # [1, 768]

    # 2. 从噪声出发
    a_t = torch.randn(1, horizon, 7)

    # 3. ODE 求解 (仅需少数几步)
    dt = 1.0 / n_ode_steps
    for step in range(n_ode_steps):
        t = torch.tensor([[step * dt]]).float()
        v = model.dit(a_t, t, cond)          # DiT 预测向量场
        a_t = a_t + v * dt

    return a_t  # [1, 50, 7] 未来 50 步动作

B.9 DiT 关键论文速览

论文	年份	核心贡献
DiT (Peebles & Xie)	2023	首次用 Transformer 替代 U-Net 做扩散, 提出 adaLN
SD3 (Stability AI)	2024	在 DiT 基础上用 MM-DiT 做文本到图像, 改用 Flow Matching
Sora (OpenAI)	2024	视频生成的 DiT 架构, 时空联合注意力
π0 (Physical Intelligence)	2024	DiT 风格架构用于机器人 VLA, Flow Matching
Flux (Black Forest Labs)	2024	DiT + Flow Matching, SOTA 图像生成

小结：DiT = 扩散 + Transformer = 更好的可扩展性 + 天然多模态支持。在 VLA 中，DiT 让视觉、语言、动作可以在同一个 Transformer 序列中交互，这是 U-Net 做不到的。π0、SD3、Sora 都证明了 DiT 是当前最佳实践。

附录 C：DiT / Diffusion 如何生成多模态轨迹？

C.1 先看清问题：什么是"多模态轨迹"？

在 VLA 场景中，"多模态"不是指多种传感器模态，而是指------给定完全相同的观测和指令，模型可以输出多种不同的合理动作。

举例：桌面抓取任务，"Pick up the cup"。

复制代码

观测 O (不变的图像 + 指令)
        |
        ├── 轨迹 A: 从上方直抓 → 速度平滑, 耗时 0.5s
        ├── 轨迹 B: 从左侧绕过障碍 → 旋转手腕, 耗时 1.2s
        ├── 轨迹 C: 从右侧抓把手 → 调整手爪方向, 耗时 0.8s
        └── 轨迹 D: 先平移再下抓 → 大范围移动, 耗时 1.5s

这四种轨迹都是合理的 ，只是抓取策略不同。传统回归方法（如 Behavior Cloning + MSE loss）会取所有合理轨迹的平均值 ，结果往往是一条不合理的、模糊的中间轨迹（比如抓到一个不存在的平均位置）。

C.2 扩散/流匹配生成多模态轨迹的底层机制

扩散/流匹配能生成多模态轨迹，本质原因只有一个：

扩散模型学习的是整个分布的概率结构，而不是一个输入到输出的单值映射。

复制代码

回归 (MSE):      O → 一个平均动作 (mode collapse)
扩散 (Diffusion): O → 整个动作分布, 从中采样
                        ↙  ↓  ↘
                    轨迹A  轨迹B  轨迹C   (都是合理动作)

C.2.1 从噪声空间到动作空间的映射

这是最本质的机制。扩散模型定义了一个从噪声到动作的双射（在 ODE 视角下）：

a1=fθ(a0,O),a0∼N(0,I)a_1 = f_\theta(a_0, O), \quad a_0 \sim \mathcal{N}(0, I)a1=fθ(a0,O),a0∼N(0,I)

不同的 a0a_0a0 映射到不同的 a1a_1a1
噪声空间中的不同区域 对应动作空间中的不同模式

python 复制代码

# 噪声空间的分区结构 (示意图)
#
#          噪声空间 (a_0)                   动作空间 (a_1)
#
#     ┌─────────────────┐              ┌─────────────────┐
#     │  区域 A          │              │                  │
#     │  (右上高斯块)    │  ──→         │  轨迹A (上抓式)  │
#     │                  │              │                  │
#     ├─────────────────┤              ├─────────────────┤
#     │  区域 B          │  ──→         │  轨迹B (左侧绕)  │
#     │  (左下高斯块)    │              │                  │
#     │                  │              │                  │
#     ├─────────────────┤              ├─────────────────┤
#     │  区域 C          │  ──→         │  轨迹C (右侧抓)  │
#     │  (中间高斯块)    │              │                  │
#     └─────────────────┘              └─────────────────┘
#
# 训练中: 同一条指令的多种轨迹, 对应噪声空间中不同区域
# 推理中: 随机采 a_0, 自然落到某个区域的生成结果

C.2.2 逐步去噪过程中的"模式选择"

扩散过程是多步的，模式选择不是一次性决定的，而是逐步明确：

python 复制代码

# 去噪过程中的"模式细化" (t=T → t=0)
#
# t=100:  纯噪声 a_T              (完全模糊, 不知道什么动作)
#   ↓
# t=80:   出现了大致轮廓           (好像在靠近一个物体)
#   ↓
# t=60:   轨迹形态初现             (看起来是"从上往下"的形状)
#   ↓
# t=40:   动作细节更明确           (手爪朝向朝下)
#   ↓
# t=20:   几乎确定是"上方抓取"     (模式已选好)
#   ↓
# t=0:    最终动作 a_0             (精确到毫米级的完整轨迹)
#
# 核心: 模式选择是 "渐进式" 的, 而非一步到位

为什么这能防止模式崩塌？

因为在前期的噪声级别（ttt 接近 TTT），模型对动作的约束是松散的------任何合理的轨迹在"模糊"的语义上都是相似的。随着去噪深入，不同初始噪声自然会收敛到训练数据中不同模式的吸引域（basin of attraction）。

复制代码

    训练数据中不同模式的吸引域:
    
              轨迹A的吸引域
              ████████
         ██████████████████
    ██████████████████████████
         ██████████████████      ← 轨迹B的吸引域
              ████████
                   ████████████
                ████████████████  ← 轨迹C的吸引域
                     ████████

    每个随机噪声 a_0 落在不同的吸引域 → 生成不同的模式

C.2.3 Flow Matching 的 ODE 轨迹分支

Flow Matching 用的是 ODE（确定性），但初始噪声是随机的，所以多条 ODE 轨迹构成多模态：

python 复制代码

# Flow Matching: 不同初始点的 ODE 轨迹自然分叉
#
#    a_0^{(1)} ──→  ODE  ──→  轨迹A (上方抓取)
#    a_0^{(2)} ──→  ODE  ──→  轨迹B (左侧绕行)
#    a_0^{(3)} ──→  ODE  ──→  轨迹C (右侧抓把)
#
# 每个初始点都走一条确定的路径, 但起点不同路径就不同
#
# 可视化:
# 噪声空间:          ODE路径:          动作空间:
#   ● a_0^{(1)}  ──→ ╱╲             ┌─ 轨迹A
#                     ╱  ╲           │
#   ● a_0^{(2)}  ──→ ╱    ╲         ├─ 轨迹B
#                   ╱      ╲        │
#   ● a_0^{(3)}  ──→         ╲      └─ 轨迹C

C.3 DiT 的 Self-Attention 如何帮助多模态

这是 DiT 相比 U-Net 的关键优势。

C.3.1 全局感受野 → 区分整体模式

DiT 的 Self-Attention 让每个动作 patch 都能看到整条轨迹的所有其他 patch。这意味着：

python 复制代码

# U-Net: 每个时间步的动作只看到邻域
# 问题: 只能看到局部, 难以区分 "全局上抓" 和 "全局侧抓"
#   U-Net感受野: [a_{t-2}, a_{t-1}, a_t, a_{t+1}, a_{t+2}]
#   → "看起来都是 '手在下降', 无法判断是上抓还是侧抓"

# DiT: 每个 patch 看到整个轨迹
#   DiT注意力: [a_1, a_2, a_3, ..., a_H] (全可见)
#   → "看到整条轨迹是上抓形状 → 朝上抓模式细化"
#   → "看到整条轨迹是侧抓形状 → 朝侧抓模式细化"

直觉：就像看图猜动作------只看 1 帧分不清是上抓还是侧抓，看完整段视频才能确定。DiT 的 Self-Attention = 能看完整段视频 ，U-Net = 只看局部几帧。

C.3.2 交叉注意力（Cross-Attention）的可选性

DiT 可以用 adaLN 注入条件，不需要 Cross-Attention。但如果你想显式控制模式，可以加一个 Cross-Attention 到不同的模式嵌入上：

python 复制代码

# 额外的"模式选择" Cross-Attention (可选架构)
class ModeAwareDiTBlock(nn.Module):
    def __init__(self, ...):
        super().__init__()
        self.mode_cross_attn = nn.MultiheadAttention(
            hidden_dim, num_heads, batch_first=True
        )
        # mode_embeddings: [num_modes, hidden] 可学习
        self.mode_embeddings = nn.Parameter(torch.randn(K, hidden_dim))

    def forward(self, x, cond, mode_idx=None):
        if mode_idx is not None:
            # 从可学习模式嵌入中取出目标模式
            mode_tokens = self.mode_embeddings[mode_idx]  # [B, 1, hidden]
            x = x + self.mode_cross_attn(x, mode_tokens, mode_tokens)[0]
        # 继续标准 DiT block
        ...

这相当于是 Anchor-based 思想在 DiT 内部的实现------用可学习嵌入向量代表不同的行为模式，通过 Cross-Attention 注入到 DiT 的每一层。

C.4 控制多模态的三种手段

C.4.1 方法 1：随机采样（默认方式）

python 复制代码

# 最简单, 零额外成本
a_0 = torch.randn(1, H, d)      # 换一个种子, 换一种模式
action = sample(model, a_0, O)  # 可能是轨迹A/B/C中的一种

特点：

完全靠随机种子决定模式
无法控制选哪种模式
适合探索阶段

C.4.2 方法 2：Classifier-Free Guidance (CFG)

CFG 可以拉大不同模式之间的差距，让采样结果更"极端"、更清晰：

python 复制代码

@torch.no_grad()
def cfg_sample(model, a_t, t, O, w=1.0):
    """
    w: guidance scale
    w=0.0: 无条件 (完全自由, 多模态最强但可能不合理)
    w=1.0: 标准条件生成 (默认)
    w>1.0: 强条件 (模式更清晰, 多样性降低)
    """
    eps_uncond = model(a_t, t, null_cond)      # 无条件预测
    eps_cond   = model(a_t, t, O)              # 有条件预测
    eps_cfg    = eps_uncond + w * (eps_cond - eps_uncond)
    return eps_cfg


# 不同 w 对多模态的影响:
# w=0.5:  动作多样性高, 但可能不合理
# w=1.0:  平衡状态
# w=3.0:  动作很确定, 但几乎只有一种模式

CFG 与多模态的关系：

复制代码

CFG w 小 (← 更多多样性)
  很多种轨迹, 但有些可能不自然
          ↓
CFG w = 1.0 (标准)
  自然的多模态------每种轨迹都合理
          ↓
CFG w 大 (→ 更确定, 多样性 ↓)
  只产出最高概率的那种轨迹

C.4.3 方法 3：显式锚点/模式选择（Anchor-based）

python 复制代码

def anchor_mode_sample(model, anchor_predictor, anchors, O, top_k=3):
    # 1. 预测最可能的几种模式
    scores = anchor_predictor(O)           # [1, K]
    top_modes = scores.topk(top_k).indices # 选 top-3

    candidates = []
    for mode_id in top_modes:
        a_T = anchors[mode_id] + noise     # 从该模式初始化
        action = denoise(model, a_T, O)    # 去噪
        candidates.append(action)

    return candidates  # 同时返回多种模式的轨迹

C.4.4 三种方法对比

方法	控制粒度	多样性	额外成本	适用场景
随机采样	无（全靠运气）	最高	无	探索、数据收集
CFG	连续调节（w 参数）	可调节	需无条件模型（额外一次推理）	大多数场景
Anchor-based	显式选模式	可控（选几个锚点）	需离线聚类 + 锚点预测器	需要可解释性、工程部署

C.5 多模态与轨迹质量的关系

python 复制代码

# ============================================================
# 评估多模态的量化指标
# ============================================================

# 1. Mode Coverage (模式覆盖率)
#    生成 N 条轨迹, 看它们覆盖了多少训练集中的模式
#    聚类后: 每条轨迹属于哪个锚点
#    好的覆盖: N 条轨迹均匀分布在 K 个锚点中

# 2. Intra-mode Variance (模内方差)
#    同一个模式内的轨迹有多分散
#    太小 → 模式内缺乏灵活性
#    太大 → 模式定义不清晰

# 3. Mode Distance (模式间距)
#    不同模式的轨迹中心之间的距离
#    太小 → 模式混在一起 (mode collapse)
#    太大 → 模式间不连续

# ============================================================
# 可视化: 多模态轨迹分布
# ============================================================

# 用 t-SNE 或 PCA 将轨迹投影到 2D:
#
#      ↑ 轨迹A组 (上抓式)
#      |   ● ● ● ● ●
#      |  ● ● ● ● ●    ● 轨迹B中的离群点
#      |
#      |     ● ● ● ● ● ● ●    ← 轨迹B组 (侧抓式)
#      |    ● ● ● ● ●
#      |
#      |  ● ● ● ●    轨迹C组 (右侧抓)
#      | ● ● ●
#      └────────────────────────→
#
# 好的多模态: K 个清晰的聚类 (如上图)
# 模式崩塌:  所有点挤在一起
# 模式过多:  分不清聚类边界

C.6 一个直观的对比实验

假设同一个场景，用三种方法生成 20 条轨迹：

python 复制代码

# ============================================================
# 三种方法生成 20 条轨迹的可视化对比
# ============================================================
import matplotlib.pyplot as plt

def compare_multimodality(model, obs, n_samples=20):
    """对比不同方法的多模态生成能力"""

    # --- 方法 1: 标准 Flow Matching (随机采样) ---
    trajectories_1 = []
    for _ in range(n_samples):
        a_0 = torch.randn(1, 50, 7)
        traj = flow_matching_infer(model, a_0, obs)
        trajectories_1.append(traj)

    # --- 方法 2: CFG 强引导 ---
    trajectories_2 = []
    for _ in range(n_samples):
        a_0 = torch.randn(1, 50, 7)
        traj = flow_matching_cfg(model, a_0, obs, w=3.0)
        trajectories_2.append(traj)

    # --- 方法 3: Anchor-based ---
    trajectories_3 = []
    for k in range(K):  # K 个锚点, 各采样几份
        for _ in range(n_samples // K):
            a_T = anchors[k] + torch.randn_like(anchors[k]) * sigmas[k]
            traj = anchor_denoise(model, a_T, obs, anchors[k])
            trajectories_3.append(traj)

    # --- 可视化对比 ---
    # 标准随机采样:  3~5 种模式, 分布均匀  ✅
    # CFG w=3.0:    1~2 种模式, 高度集中  ⚠️ 多样性降低
    # Anchor-based: 正好 K 种模式, 可解释  ✅ 最受控

方法	20 条轨迹的模式数	每种模式的轨迹数	是否有离群轨迹
随机采样	4~6	3~7	可能有 1~2 条
CFG (w=3.0)	1~2	10~19	几乎没有
Anchor-based (K=8)	8	2~3	极少

C.7 总结：DiT 多模态的完整链路

复制代码

训练数据中的多模态轨迹
  (K 种不同的抓取方式)
        |
        ↓
训练时, 扩散/流匹配模型学习"整个动作分布"
  (不是平均, 而是把每种模式的概率结构都记住)
        |
        ↓
DiT 的 Self-Attention:
  - 每个 patch 看到完整轨迹 → 区分整体模式
  - adaLN 高效注入条件 → 不影响多样性
        |
        ↓
推理时:
  1. 随机噪声 a_0 落在不同区域
  2. 去噪/ODE过程, 噪声逐步细化为特定模式
  3. 模式在去噪过程中"渐进式确定"
  4. 最终 → 每种初始噪声对应一种合理动作
        |
        ↓
三种控制手段:
  - 随机采样: 交给随机性 (高多样性)
  - CFG:     调节 w 控制多样性强度
  - Anchors: 显式选择模式 (完全可控)

一句话总结 ：DiT 不直接"选择"模式------它通过 Self-Attention 理解全局轨迹结构 ，再依靠扩散/流匹配过程从随机噪声中自然采样出不同模式 。这不是分类问题，而是分布采样问题。

Diffusion 与 Flow Matching 数学原理及其在 VLA Action 生成中的应用