a)具有不同类型动作表示的显式策略(Explicit policy with different types of action representations)
b)隐式策略学习以动作和观察为条件的能量函数,并优化能够最小化能量景观的动作 Implicit policy learns an energy function conditioned on both action and observation and optimizes for actions that minimize the energy landscape
c)通过"条件去噪扩散过程在机器人行动空间上生成行为",即该扩散策略策略不直接输出一个动作,而是推断出「基于视觉观察的动作-评分梯度 」,进行K次去噪迭代 instead of directly outputting an action, the policy infers the action-score gradient, conditioned on visual observations, for K denoising iterations
1.1.2 扩散策略的优势:表达多模态动作分布、高维输出空间、稳定训练
扩散策略有以下几个关键特性
可以表达多模态动作分布
通过学习动作评分函数的梯度(Song和Ermon,2019),并在此梯度场上执行随机朗之万动力学采样,++扩散策略可以表达任意可归一化的分布++ (Neal等,2011),其中包括多模态动作分布 By learning the gradient of the action score function Song and Ermon (2019) and performing Stochastic Langevin Dynamics sampling on this gradient field,++Diffusion policy can express arbitrary normalizabledistributions++ Neal et al. (2011), which includes mul-timodal action distribution
闭环动作序列
结合策略预测高维动作序列的能力与递推视界控制来实现稳健执行(We combine the policy's capability to predict high-dimensional action sequences with++receding-horizon control++ to achieve robust execution)
这种设计允许策略以闭环方式持续重新规划其动作,同时保持时间上的动作一致性,从而在长视距规划和响应性之间实现平衡
视觉条件化
引入了一种视觉条件化扩散策略,其中视觉观测被视为条件而不是联合数据分布的一部分。在这种公式化中,策略只需提取一次视觉表示,而无需考虑去噪迭代,从而大大减少了计算量,并实现了实时动作推理 where the visual observations are treated as conditioning instead of apart of the joint data distribution. In this formulation,the policy extracts the visual representation onceregardless of the denoising iterations,which drastically reduces the computation and enable sreal-time action inference
最小化公式3所示的损失函数,也同时最小化了数据分布和从DDPM 中提取的样本分布之间KL-散度的变分下界「minimizing the loss functionin Eq 3 also minimizes the variational lower bound of the KL-divergence between the data distribution p(x0) and the distribution of samples drawn from the DDPM q(x0) usingEq 1」
之前输出的是图像,现在需要输出:为机器人的动作(changing the output xto represent robot actions)
去噪时所依据的去噪条件为输入观测** (making the denoising processes conditioned on input observation Ot)
为了达到以上这两个修改的目的,需要做以下措施
++一、闭环动作序列预测++
一个有效的动作制定应该鼓励在长期规划中的时间一致性和平滑性,同时允许对意外观察做出迅速反应
为了实现这一目标,应该在重新规划之前,采用扩散模型生成的固定持续时间的++行动序列预测++ 即如论文原文所说:"An effective action formulation should encourage temporal consistencyand smoothness in long-horizon planning while allowingprompt reactions to unexpected observations. To accomplishthis goal, we commit to the ++action-sequence prediction++produced by a diffusion model for**a fixed durationbeforere planning"
a) 一般情况下,该策略在时间步长时,将最新的步观测数据作为输入,并输出步动作 General formulation. At time step t, the policy takes the latest To steps of observation data Ot as input and outputs Ta steps of actions At
b) 在基于CNN的扩散策略中,对观测特征应用FiLM(Feature-wise Linear Modulation )来调节每个卷积层通道 In the CNN-based Diffusion Policy, FiLM (Feature-wise Linear Modulation) [ Film: Visual reasoning with a general conditioning layer*] conditioning of the observation feature Ot is applied to every convolution layer, channel-wise.*
然后从高斯噪声中提取的减去噪声估计网络的输出,并重复次,得到去噪动作序列 「这个过程是扩散模型去噪的本质,如不了解DDPM,请详看此文:《图像生成发展起源:从VAE、扩散模型DDPM、DETR到ViT、Swin transformer* 》* 」 Starting from AtK drawn from Gaussian noise, the outputof noise-prediction network εθ is subtracted, repeating K times to get At0, the denoised action sequence.
c) 在基于Transformer的扩散策略中,观测的嵌入被传递到每个Transformer解码器块的多头交叉注意力层
每个动作嵌入使用所示注意力掩码进行约束,使其仅关注自身和之前的动作嵌入(因果注意力,类似GPT的解码策略 ) In the Transformer-based [52]Diffusion Policy, the embedding of observation Ot is passed into a multi-head cross-attention layer of each transformer decoder block.
Each action embedding is constrained to only attend to itself and previous action embeddings (causal attention) using the attention mask illustrated.
++二、视觉观察条件化++
总之,他们使用DDPM来近似条件分布,而不是Janner等人[20]用于规划的联合分布「We use a DDPM toapproximate the conditional distribution p(At|Ot) instead of the joint distribution p(At,Ot) used in Janner et al. (2022a)for planning.」
这种形式使得模型能够在不推断未来状态的情况下预测基于观察的动作(This formulation allows the model to predict actionsconditioned on observations without the cost of inferringfuture states),加快了扩散过程并提高生成动作的准确性
首先,仅通过特征线性调制(FiLM)将动作生成过程条件化在观察特征上, 并进行去噪迭代,来建模条件分布
*we only **model the conditional distribution p(At|Ot)*by conditioning the action generation processon observation features Ot with Feature-wise Linear Modulation (FiLM)
其次,仅预测动作轨迹,而非连接的观察动作轨迹 we only predict the action trajectory instead of the concatenated observation action trajectory
第三,利用receding prediction horizon,删除了基于修复的目标状态条件反射 。然而,目标条件反射仍然是可能的,与观测所用的FiLM条件反射方法相同 we removed inpainting-based goal state conditioning due to incompatibility with our framework utilizing a receding prediction horizon.
However, goal conditioning is still possible with the same FiLM conditioning method used for observations
在实践中发现,基于CNN的骨干网络在大多数任务上表现良好且无需过多超参数调优。然而,当期望的动作序列随着时间快速而急剧变化时(如velocity命令动作空间),它的表现很差,可能是由于时间卷积的归纳偏差[temporal convolutions to prefer lowfrequency signals],以更倾向低频信号所致
行动和噪声作为transformer解码器块的输入tokens传入,扩散迭代的正弦嵌入作为第一个token(Actions with noise At k are passed in as input tokens for the transformer decoder blocks, with the sinusoidal embedding for diffusion iteration k prepended as the first token)
观测通过共享的MLP转换为观测嵌入序列,然后作为输入特征传递到transformer解码器堆栈中(The observation Ot is transformed into observation embedding sequence by a shared MLP, which is then passed into the transformer decoder stack as input features)
"梯度"由解码器堆栈的每个对应输出token进行预测(The "gradient" εθ (Ot ,Atk , k) is predicted by each corresponding output token of the decoder stack)
当然了,Transformer训练的困难[25]并不是Diffusion Policy所独有的,未来可以通过改进Transformer训练技术或增加数据规模来解决 However, we found the transformer to be more sensitive to hyperparameters. The difficulty of transformer training [25] is not unique to Diffusion Policy and could potentially be resolved in the future with improved transformer training techniques or increased data scale
故,一般来说
建议从基于CNN的扩散策略实施开始,作为新任务的第一次尝试 In general, we recommend starting with the CNN-based diffusion policy implementation as the first attempt at a new task
但如果由于任务复杂性或高速率动作变化导致性能低下,那么可以使用时间序列扩散Transformer来潜在地提高性能,但代价是额外的调优 If performance is low due to task complexity or high-rate action changes, then the Time-series Diffusion Transformer formulation can be used to potentially improve performance at the cost of additional tuning
1.3.2 视觉编码器:把图像潜在嵌入化并通过扩散策略做端到端的训练
视觉编码器将原始图像序列映射为潜在嵌入**** ,并使用扩散策略进行端到端的训练(The visual encoder maps the raw image sequenceinto a latent embedding Otand is trained end-to-end with the diffusion policy)
使用空间softmax池化替代掉全局平均池化,以维护空间信息 1) Replace the global average pooling with a spatial softmax pooling to maintain spatial information[29]
采用GroupNorm代替BatchNorm来实现稳定训练,这个修改对于把归一化层与指数移动平均一起使用时(通常应用于DDPM)很有必要 2) Replace BatchNorm with GroupNorm [57] for stabletraining. This is important when the normalization layer isused in conjunction with Exponential Moving Average [17](commonly used in DDPMs)
然而,如果序列中的每个动作被预测为独立的多模态分布(如在BC-RNN和BET中所做的那样)。连续动作可能会从不同模式中提取出来,并导致两个有效轨迹之间交替出现抖动动作 However, suppose each action in the sequence is predicted as independent multimodal distributions (as done in BCRNN and BET). In that case, consecutive actions could be drawn from different modes, resulting in jittery actions that alternate between the two valid trajectories.
对于空闲动作的鲁棒性:当演示暂停并导致相同位置或接近零速度的连续动作序列时,则会发生空闲行为。这在远程操作等任务中很常见
然而,单步策略容易过度适应这种暂停行为。例如,在实际世界实验中使用BC-RNN和IBC时经常会卡住,未删除训练数据集中包含的空闲行为(BC-RNN andIBC often get stuck in real-world experiments when the idleactions are not explicitly removed from training)
// 待更
1.4.2 扩散模型在训练中的稳定
隐式策略使用基于能量的模型(EBM)表示动作分布(An implicit policy represents the action distribution using an Energy-Based Model (EBM) ),如下公式6所示:
def replace_submodules(
root_module: nn.Module,
predicate: Callable[[nn.Module], bool],
func: Callable[[nn.Module], nn.Module]) -> nn.Module:
"""
Replace all submodules selected by the predicate with
the output of func.
predicate: Return true if the module is to be replaced.
func: Return new module to use.
"""
它接受三个参数:
。root_module:根模块,类型为 nn. Module
。predicate :谓词函数,接受一个模块作为输入,返回一个布尔值,表示该模块是
否需要被替换
。func:函数,接受一个模块作为输入,返回一个新的模块,用于替换原模块
检查根模块是否符合条件,如果符合条件,则直接返回替换后的模块
复制代码
if predicate(root_module):
return func(root_module)
检查pytorch版本是否大于1.9,如果版本过低,则抛出导入错误
复制代码
if parse_version(torch.__version__) < parse_version('1.9.0'):
raise ImportError('This function requires pytorch >= 1.9.0')
查找符合条件的子模块
复制代码
bn_list = [k.split('.') for k, m
in root_module.named_modules(remove_duplicate=True)
if predicate(m)]
替换符合条件的子模块,具体而言
首先获取父模块
复制代码
for *parent, k in bn_list:
parent_module = root_module
if len(parent) > 0:
parent_module = root_module.get_submodule('.'.join(parent))
if isinstance(parent_module, nn.Sequential):
parent_module[int(k)] = tgt_module
else:
setattr(parent_module, k, tgt_module)
验证所有模块已被替换
复制代码
# verify that all modules are replaced
bn_list = [k.split('.') for k, m
in root_module.named_modules(remove_duplicate=True)
if predicate(m)]
assert len(bn_list) == 0
def forward(self, x, cond): # 定义前向传播方法
'''
x : [ batch_size x in_channels x horizon ]
cond : [ batch_size x cond_dim]
returns:
out : [ batch_size x out_channels x horizon ]
'''
global_feature = self.diffusion_step_encoder(timesteps) # 计算全局特征
if global_cond is not None: # 如果全局条件不为空
global_feature = torch.cat([ # 将全局特征和全局条件拼接在一起
global_feature, global_cond
], axis=-1)
下采样过程
复制代码
x = sample # 获取输入张量
h = [] # 定义中间结果列表
for idx, (resnet, resnet2, downsample) in enumerate(self.down_modules): # 遍历下采样模块
x = resnet(x, global_feature) # 通过第一个条件残差块
x = resnet2(x, global_feature) # 通过第二个条件残差块
h.append(x) # 将结果添加到中间结果列表
x = downsample(x) # 通过下采样层
中间过程
复制代码
for mid_module in self.mid_modules: # 遍历中间模块
x = mid_module(x, global_feature) # 通过中间模块
上采样过程
复制代码
for idx, (resnet, resnet2, upsample) in enumerate(self.up_modules): # 遍历上采样模块
x = torch.cat((x, h.pop()), dim=1) # 将当前结果和中间结果拼接在一起
x = resnet(x, global_feature) # 通过第一个条件残差块
x = resnet2(x, global_feature) # 通过第二个条件残差块
x = upsample(x) # 通过上采样层
最终卷积层
复制代码
x = self.final_conv(x) # 通过最终卷积层
最终,返回输出张量
复制代码
x = x.moveaxis(-1,-2) # 将输出张量的第二个维度移动到最后一个维度
# (B,T,C)
return x # 返回输出
第三部分(选读) Diff-Control:改进UMI所用的扩散策略(含ControlNet简介)
3.1 Diff-Control是什么及其提出的背景
3.1.1 背景
自从24年年初斯坦福等一系列机器人横空出世以来,模仿学习已经成为训练机器人的重要方法,其中,基于扩散的策略「4-Diffusion policy: Visuomotor policy learning via action diffusion,至于扩散策略的详解请见此文《UMI------斯坦福刷盘机器人:从手持夹持器到动作预测Diffusion Policy(含代码解读)* 》的第三部分*」------因其有效建模多模态动作分布的能力而脱颖而出,从而提升了性能
然而,在实践中,动作表示的不一致性问题仍然是一个持续的挑战,这种不一致性可能导致机器人轨迹分布与底层环境之间的明显差异,从而限制控制策略的有效性[5-Robot learning from human demonstrations with inconsistent contexts]
这种不一致性的主要原因通常源于
人类演示的丰富上下文性质[6-What matters in learning from offline human demonstrations for robot manipulation]
分布转移问题[7- A reductionof imitation learning and structured prediction to no-regret online learning]
以及高动态环境的波动性 其实本质上是无状态的,缺乏将记忆和先验知识纳入控制器的机制,从而导致动作生成的不一致性,即they are fundamentally stateless, lackingprovisions for incorporating memory and prior knowledgeinto the controller, potentially leading to inconsistent actiongeneration
先前的方法,如
动作分块「8-Learning fine-grained bimanual manipulation with low-cost hardware,即ACT,其原理详见此文《ACT的原理解析:斯坦福炒虾机器人Moblie Aloha的动作分块算法ACT* 》」*和预测闭环动作序列[4-扩散策略],已被提出以解决这一问题
此外,Hydra[9]和基于航点的操作[10]修改动作表示以确保一致性
然而,这些方法通过改变动作表示来解决问题,而不是直接使用原始动作
相反,能否通过在扩散策略中加入时间转换来明确地施加时间一致性?在深度状态空间模型领域[11- Deep state space models for time series forecasting]--[13-How to train your differentiable filter],有效学习状态转换模型能够识别潜在的动态模式
Diff-Control团队将ControlNet的基本原理从图像生成扩展到动作生成,并将其用作状态空间模型,在该模型中,系统的内部状态、观测(摄像头输入)、和人类语言指令共同影响策略的输出 「Our method extends the basic principle of ControlNet ++from image generation to action generation++ , and use it as a state-space model in which theinternal state of the system affects the output of the policy in conjunction with observations (camerainput) and human language instructions」
如下图所示,是Diff-Control在"打开盖子"任务中的实际应用
每个时间窗口内(如红色所示),Diff-Control 生成动作序列
Within each time window (depicted in red), Diff-Control generates action sequences
在生成后续动作序列时,它利用先前的动作作为额外的控制输入,如蓝色所示
When generat-ing subsequent action sequences, it utilizes previous actionsas an additional control input, shown in blue.
这种时间过渡是通过贝叶斯公式实现的,有效地弥合了独立策略与状态空间建模之间的差距
This temporaltransition is achieved through Bayesian formulation, effec-tively bridging the gap between standalone policies and statespace modeling
例如,在上图左侧a图 中,"SD 编码器块 A" 包含 4 个 resnet 层和 2 个 ViTs,而"×3"表示该块重复三次 文本提示被编码为CLIP文本编码器*[66-Learning transferable visual models from natural language supervision* ] 而扩散时间步通过使用位置编码的时间编码器Time Encoder进行编码 Text prompts are encoded using the CLIP text encoder [66], and diffusion timesteps are encodedwith a time encoder using positional encoding
最终,ControlNet结构应用于U-net的每个编码器层级(相当于ControlNet作用在了U-net的每个编码器的副本上,即The ControlNet structure is applied to each encoder levelof the U-net ),如上图右侧b图的右上角所示
对于这点,从ControlNet的GitHub主页也可以证实到,即 Q: If the weight of a conv layer is zero, the gradient will also be zero, and the network will not learn anything. Why "zero convolution" works?
A: This is wrong. Let us consider a very simple:ControlNet/blob/main/docs/faq.md**
然后采用[15-Planning with diffusion for flexible behavior synthesis]中的一维时间卷积网络,并构建U-net骨干网络
策略可以自主执行并生成动作,而无需依赖任何时间信息
++对于转换模型,如下图右侧所示++ ,作者团队将ControlNet纳入为转换模块「Diff-Control Policy通过使用锁定的 U-net 扩散策略架构来实现。该策略复制了编码器和中间模块,并引入了零卷积层,即The Diff-Control Policy is implemented through the utilization of a locked U-net diffusion policy architecture. It replicates the encoder and middle blocks and incorporates zero convolution layers」
这种利用有效地扩展了策略网络的能力,使其包含时间条件(This utilization extends the capability of the policy networkto include temporal conditioning effectively)
为实现这一目标,利用先前生成的动作序列作为ControlNet的提示输入 To achieve this,we utilize the previously generated action sequences as the prompt input to ControlNet.
通过这样做,基础策略可以了解先前的动作,且通过创建一个可训练的编码器副本来实现ControlNet,然后冻结基础策略 By doing so, the base policy¯πψψψ becomes informed about the previous actions a[Wt−h].We implement ControlNet by creating a trainable replica of the ¯πψψψ encoders and then freeze the base policy ¯πψψψ.
且可训练的副本通过零卷积层[33]连接到固定模型 The trainable replica is connected to the fixed model with zero convolutional layers [33]
最终,ControlNet可以将作为条件向量,并重用训练好的基础策略来构建下一个动作序列 ControlNet can then take a[Wt−h] as the conditioning vector and reuses the trained base policy¯πψψψ to construct the next action sequence a[Wt]
为了解决这个问题,更实用的方法是引入融合层并增加视觉和语言表示的嵌入大小 ,而不是直接将它们拼接在一起「To address this,a more practical approach is to incorporate a fuse layerand increasing embedding size for visual and language representations, instead of concatenating them directly」,这一修改可以提升策略在语言条件任务中的整体表现