offline meta-RL | 近期工作速读记录


目录

  • [📌 近期工作 1](#📌 近期工作 1)
    • [(UBER) Unsupervised Behavior Extraction via Random Intent Priors [NeurIPS 2023]](#(UBER) Unsupervised Behavior Extraction via Random Intent Priors [NeurIPS 2023])
    • [Entropy Regularized Task Representation Learning for Offline Meta-Reinforcement Learning [AAAI 2025]](#Entropy Regularized Task Representation Learning for Offline Meta-Reinforcement Learning [AAAI 2025])
    • [Robust Task Representations for Offline Meta-Reinforcement Learning via Contrastive Learning [ICML 2022]](#Robust Task Representations for Offline Meta-Reinforcement Learning via Contrastive Learning [ICML 2022])
    • [Meta-DT: Offline Meta-RL as Conditional Sequence Modeling with World Model Disentanglement [NeurIPS 2024]](#Meta-DT: Offline Meta-RL as Conditional Sequence Modeling with World Model Disentanglement [NeurIPS 2024])
    • [Towards an Information Theoretic Framework of Context-Based Offline Meta-Reinforcement Learning [NeurIPS 2024]](#Towards an Information Theoretic Framework of Context-Based Offline Meta-Reinforcement Learning [NeurIPS 2024])
    • [Skill-based Meta-Reinforcement Learning [ICLR 2022]](#Skill-based Meta-Reinforcement Learning [ICLR 2022])
    • [Scrutinize What We Ignore: Reining In Task Representation Shift Of Context-Based Offline Meta Reinforcement Learning [ICLR 2025]](#Scrutinize What We Ignore: Reining In Task Representation Shift Of Context-Based Offline Meta Reinforcement Learning [ICLR 2025])
    • [Generalizable Task Representation Learning for Offline Meta-Reinforcement Learning with Data Limitations [AAAI 2024]](#Generalizable Task Representation Learning for Offline Meta-Reinforcement Learning with Data Limitations [AAAI 2024])
    • [Provably Improved Context-Based Offline Meta-RL with Attention and Contrastive Learning [ICLR 2022]](#Provably Improved Context-Based Offline Meta-RL with Attention and Contrastive Learning [ICLR 2022])
    • [(UDS) How to Leverage Unlabeled Data in Offline Reinforcement Learning [ICML 2022]](#(UDS) How to Leverage Unlabeled Data in Offline Reinforcement Learning [ICML 2022])
  • [📌 近期工作 2](#📌 近期工作 2)
    • [(IDAQ) Offline Meta Reinforcement Learning with In-Distribution Online Adaptation [ICML 2023]](#(IDAQ) Offline Meta Reinforcement Learning with In-Distribution Online Adaptation [ICML 2023])
    • [Context Shift Reduction for Offline Meta-Reinforcement Learning [NeurIPS 2023]](#Context Shift Reduction for Offline Meta-Reinforcement Learning [NeurIPS 2023])
    • [Text-to-Decision Agent: Offline Meta-Reinforcement Learning from Natural Language Supervision [NeurIPS 2025]](#Text-to-Decision Agent: Offline Meta-Reinforcement Learning from Natural Language Supervision [NeurIPS 2025])
    • [Efficient Offline Meta-Reinforcement Learning via Robust Task Representations and Adaptive Policy Generation [IJCAI 2024]](#Efficient Offline Meta-Reinforcement Learning via Robust Task Representations and Adaptive Policy Generation [IJCAI 2024])
    • [Meta-Reinforcement Learning via Exploratory Task Clustering [AAAI 2024]](#Meta-Reinforcement Learning via Exploratory Task Clustering [AAAI 2024])
    • [Contextual Transformer for Offline Meta Reinforcement Learning [NeurIPS 2022 workshop]](#Contextual Transformer for Offline Meta Reinforcement Learning [NeurIPS 2022 workshop])
    • [Model-Based Offline Meta-Reinforcement Learning with Regularization](#Model-Based Offline Meta-Reinforcement Learning with Regularization)
    • [Enhancing Online Reinforcement Learning with Meta-Learned Objective from Offline Data [AAAI 2025]](#Enhancing Online Reinforcement Learning with Meta-Learned Objective from Offline Data [AAAI 2025])
    • [Offline Meta-Reinforcement Learning with Online Self-Supervision [ICML 2022]](#Offline Meta-Reinforcement Learning with Online Self-Supervision [ICML 2022])

也请参见:offline meta-RL | 经典论文速读记录


📌 近期工作 1

(UBER) Unsupervised Behavior Extraction via Random Intent Priors [NeurIPS 2023]

主要内容:

  • setting:我们拿到了 single-task 的没有 reward 的 offline 数据集,现在想基于这个数据集,学出来可以做相关 task 的策略。
  • method:直接给这个数据集标注 N 个随机 reward,然后训出来 N 个策略,最后使用 PEX 方法进行 offline-to-online。
  • 理论(根据印象 可能有幻觉):
    • Proposition 4.1 指的是,给定一个 policy,总能构造出来一个 reward,使得这个 policy 是这个 reward 下的最优 policy 之一。
    • Theorem 4.2 指的是,只要目标行为在数据集中有较好的覆盖,我们就能有效地学习它。使用大小为 N 的 offline dataset,这样学出来的最好性能与 optimal policy 的差距,可以被 N bound 住。使用了 linear MDP 和 PEVI 那一套,我不懂这些理论。
    • Theorem 4.3 好像指的是,UBER 使用的构造 random reward 的方法可以离 true reward 足够近,是使用岭回归(ridge regression)来证明的,岭回归 我也不懂。
  • 实验:做了 d4rl 和 metaworld。还没仔细看。搬运参考博客的内容:

结果 1:随机意图确实产生多样且高质量行为。实验显示,UBER提取的行为策略:

  • 性能超越原始数据:特别是在原始数据质量不高时
  • 分布更加多样:回报分布的熵值显著高于原始数据集和行为克隆方法

结果 2:在线学习加速显著。在Mujoco运动任务中,UBER相比基线方法:

  • 学习速度更快:在相同环境步数下获得更高回报
  • 最终性能更好:在多数任务中达到或接近专家水平

结果3:跨任务迁移能力。在 Meta-World 的多任务实验中,UBER 学到的行为策略能够成功迁移到不同的下游任务,证明了其跨任务泛化能力。可能的原因是,随机奖励产生了通用运动原语(如"接近物体"、"精确控制末端执行器"),这些原语在不同任务间可迁移。

Entropy Regularized Task Representation Learning for Offline Meta-Reinforcement Learning [AAAI 2025]

主要内容:

  • task encoder \(e(z|c)\) 可能会耦合 behavior policy \(\pi_\beta\)(即生成 offline dataset 的那些 policy)的信息,导致 inference 时,当 agent 遇到 OOD 的 transition 时,encoder 无法推断出正确的 task。
  • 为此,我们希望去最小化 task encoder \(e(z|c)\) 和 behavior policy \(\pi_\beta\) 之间的互信息;通过一个 GAN 来模拟 behavior policy \(\pi_\beta\),其中 generator 用来生成以假乱真的 action, discriminator 用来区分真假 action。
  • 最小化这个互信息,好像等于最大化 \(H(\pi_\beta | p(z_i))\) 的熵;具体细节还没看。

Robust Task Representations for Offline Meta-Reinforcement Learning via Contrastive Learning [ICML 2022]

Meta-DT: Offline Meta-RL as Conditional Sequence Modeling with World Model Disentanglement [NeurIPS 2024]

Towards an Information Theoretic Framework of Context-Based Offline Meta-Reinforcement Learning [NeurIPS 2024]

  • arxiv:
  • 好像是提出了一个统一的框架,来总结现有的 offline meta-RL 方法。

Skill-based Meta-Reinforcement Learning [ICLR 2022]

Scrutinize What We Ignore: Reining In Task Representation Shift Of Context-Based Offline Meta Reinforcement Learning [ICLR 2025]

Generalizable Task Representation Learning for Offline Meta-Reinforcement Learning with Data Limitations [AAAI 2024]

  • arxiv:
  • 有可能有点相关,是 OMRL 的最新工作。

Provably Improved Context-Based Offline Meta-RL with Attention and Contrastive Learning [ICLR 2022]

(UDS) How to Leverage Unlabeled Data in Offline Reinforcement Learning [ICML 2022]

好像基于 CDS 和 UDS,但听说这两个方法不太可复现。

📌 近期工作 2

(IDAQ) Offline Meta Reinforcement Learning with In-Distribution Online Adaptation [ICML 2023]

Context Shift Reduction for Offline Meta-Reinforcement Learning [NeurIPS 2023]

  • arxiv:https://arxiv.org/abs/2311.03695
  • 感觉想解决的问题,好像跟 IDAQ 是类似的,都是去 address offline dataset 和我们真正 rollout 出来的数据的分布不一致。

Text-to-Decision Agent: Offline Meta-Reinforcement Learning from Natural Language Supervision [NeurIPS 2025]

Efficient Offline Meta-Reinforcement Learning via Robust Task Representations and Adaptive Policy Generation [IJCAI 2024]

Meta-Reinforcement Learning via Exploratory Task Clustering [AAAI 2024]

Contextual Transformer for Offline Meta Reinforcement Learning [NeurIPS 2022 workshop]

Model-Based Offline Meta-Reinforcement Learning with Regularization

Enhancing Online Reinforcement Learning with Meta-Learned Objective from Offline Data [AAAI 2025]

  • arxiv:
  • 可能有一点相关。

Offline Meta-Reinforcement Learning with Online Self-Supervision [ICML 2022]

感谢师弟和参考博客的讲解🍵

相关推荐
MoonOut9 天前
offline meta RL | 论文速读记录
offline rl·meta rl
MoonOut1 年前
offline RL · PbRL | LiRE:构造 A>B>C 的 RLT 列表,得到更多 preference 数据
offline rl·pbrl
MoonOut2 年前
offline RL | D4RL:最常用的 offline 数据集之一
offline rl
MoonOut2 年前
offline RL | 读读 Decision Transformer
offline rl
MoonOut2 年前
offline 2 online | Cal-QL:校准保守 offline 训出的 Q value,让它与真实 reward 尺度相当
offline rl
MoonOut2 年前
offline 2 online | 重要性采样,把 offline + online 数据化为 on-policy samples
offline rl
MoonOut2 年前
offline 2 online | AWAC:基于 AWR 的 policy update + online 补充数据集
offline rl
MoonOut2 年前
offline RL | ABM:从 offline dataset 的好 transition 提取 prior policy
offline rl