论文速读记录 | 2025.12（2）

[Exploiting Unlabeled Data for Feedback Efficient Human Preference based Reinforcement Learning](#Exploiting Unlabeled Data for Feedback Efficient Human Preference based Reinforcement Learning)
[一些 labeled data / expert demo + unlabeled data 的 offline RL 工作](#一些 labeled data / expert demo + unlabeled data 的 offline RL 工作)
[(HILP) Foundation policies with hilbert representations](#(HILP) Foundation policies with hilbert representations)
[Multi-Task Learning as Multi-Objective Optimization](#Multi-Task Learning as Multi-Objective Optimization)
[Multi-Type Preference Learning: Empowering Preference-Based Reinforcement Learning with Equal Preferences](#Multi-Type Preference Learning: Empowering Preference-Based Reinforcement Learning with Equal Preferences)
[MetaCURE: Meta Reinforcement Learning with Empowerment-Driven Exploration](#MetaCURE: Meta Reinforcement Learning with Empowerment-Driven Exploration)
[Absolute Zero: Reinforced Self-play Reasoning with Zero Data](#Absolute Zero: Reinforced Self-play Reasoning with Zero Data)
[CIC: Contrastive Intrinsic Control for Unsupervised Skill Discovery](#CIC: Contrastive Intrinsic Control for Unsupervised Skill Discovery)
[auto-curriculum learning (Jiang et al., 2021b)](#auto-curriculum learning (Jiang et al., 2021b))
[Meta-Motivo（Tirinzoni 等人，2025），zero-shot goal-conditioned RL](#Meta-Motivo（Tirinzoni 等人，2025），zero-shot goal-conditioned RL)
[Unsupervised Skill Discovery via Recurrent Skill Training](#Unsupervised Skill Discovery via Recurrent Skill Training)
[Learning to Discover Skills through Guidance](#Learning to Discover Skills through Guidance)
[One After Another: Learning Incremental Skills for a Changing World](#One After Another: Learning Incremental Skills for a Changing World)
[Direct then Diffuse: Incremental Unsupervised Skill Discovery for State Covering and Goal Reaching](#Direct then Diffuse: Incremental Unsupervised Skill Discovery for State Covering and Goal Reaching)
[Horizon Generalization in Reinforcement Learning](#Horizon Generalization in Reinforcement Learning)
[HIQL: Offline Goal-Conditioned RL with Latent States as Actions](#HIQL: Offline Goal-Conditioned RL with Latent States as Actions)
[Contrastive Preference Learning: Learning from Human Feedback without RL](#Contrastive Preference Learning: Learning from Human Feedback without RL)
[Few is More: Task-Efficient Skill-Discovery for Multi-Task Offline Multi-Agent Reinforcement Learning](#Few is More: Task-Efficient Skill-Discovery for Multi-Task Offline Multi-Agent Reinforcement Learning)
[Rethinking Reward Modeling in Preference-based Large Language Model Alignment](#Rethinking Reward Modeling in Preference-based Large Language Model Alignment)
[DOPL: Direct Online Preference Learning for Restless Bandits with Preference Feedback](#DOPL: Direct Online Preference Learning for Restless Bandits with Preference Feedback)
[Fewer May Be Better: Enhancing Offline Reinforcement Learning with Reduced Dataset](#Fewer May Be Better: Enhancing Offline Reinforcement Learning with Reduced Dataset)
[Data Center Cooling System Optimization Using Offline Reinforcement Learning](#Data Center Cooling System Optimization Using Offline Reinforcement Learning)
[SpikeLLM: Scaling up Spiking Neural Network to Large Language Models via Saliency-based Spiking](#SpikeLLM: Scaling up Spiking Neural Network to Large Language Models via Saliency-based Spiking)
[Rethinking Inverse Reinforcement Learning: from Data Alignment to Task Alignment](#Rethinking Inverse Reinforcement Learning: from Data Alignment to Task Alignment)
[Why Distillation can Outperform Zero-RL: The Role of Flexible Reasoning](#Why Distillation can Outperform Zero-RL: The Role of Flexible Reasoning)
[Thinkless: LLM Learns When to Think](#Thinkless: LLM Learns When to Think)
[Learning to Reason without External Rewards](#Learning to Reason without External Rewards)

Exploiting Unlabeled Data for Feedback Efficient Human Preference based Reinforcement Learning

arxiv：https://arxiv.org/abs/2302.08738
来源：无意中看到的，AAAI 2023。
主要内容：为 PbRL 提出两种无监督 / 自监督技术，来 online 地利用 unlabelled data。1. 认为所有 unlabelled segment 都是人类喜欢的，并将 [R1 R2 ... RH] 作为奖励向量，通过神秘的 triplet loss 进行对比学习；2. 鼓励 reward model 中 state 的 embedding（没有细看这是什么）之间的距离满足 temporal distance，使用 MSE loss 来做。
没有细读。

一些 labeled data / expert demo + unlabeled data 的 offline RL 工作

除了 CDS UDS 之外，还有：
The Provable Benefits of Unsupervised Data Sharing for Offline Reinforcement Learning，https://arxiv.org/abs/2302.13493 ，ICLR 2023，师兄的工作。好像很理论，没有看。
CLUE: Calibrated Latent Guidance for Offline Reinforcement Learning，https://arxiv.org/abs/2104.07749 ，CoRL 2023：
- 校准潜在空间（Calibrated Latent Guidance）：用 CVAE 学习 state-action 的潜在表示，但通过关键正则化强制所有专家数据嵌入坍缩到原点（均值 / 方差 ≈ 0）。这样，专家行为在潜在空间被"绑"成单点，任意样本与它的距离天然构成任务导向的内在奖励 ------ 越像专家，奖励越高。无需对抗、无需时序建模，距离即奖励。
- 🥑 这篇文章也希望在 latent space 里面，用 latent space 里的距离来标 reward。
- 看起来没有理论，纯启发式的。
Semi-Supervised Offline Reinforcement Learning with Action-Free Trajectories，https://arxiv.org/abs/2210.06518 ，ICML 2023：
- 123

(HILP) Foundation policies with hilbert representations

arxiv：https://arxiv.org/abs/2402.15567
website：https://seohong.me/projects/hilp/
来源：offline metra（？）

Multi-Task Learning as Multi-Objective Optimization

arxiv：https://arxiv.org/abs/1810.04650
来源：合作者提到的论文，用 multi-objective 的方式来解决 multi-task 问题。NeurIPS 2018。
（感觉对 RL 来说，如果 multi-task 的 task 之间 transition 相同，只有 reward 不同，那么问题 setting 好像跟 multi-objective 挺像的（）

Multi-Type Preference Learning: Empowering Preference-Based Reinforcement Learning with Equal Preferences

来源：无意中搜到的。ICRA 2025。
arxiv：https://arxiv.org/abs/2409.07268
GitHub：https://github.com/FeiCuiLengMMbb/paper_MTPL
好奇是不是 multi-type + PbRL。

MetaCURE: Meta Reinforcement Learning with Empowerment-Driven Exploration

arxiv：https://arxiv.org/abs/2006.08170
来源：合作者说有趣的 skill + meta-RL 论文，ICML 2021。

Absolute Zero: Reinforced Self-play Reasoning with Zero Data

arxiv：https://arxiv.org/abs/2505.03335
来源：neurips 2025 best paper 的一作 yue yang 的 NeurIPS 2025 spotlight 工作。被题目吸引住了，单纯好奇，想读一读。

CIC: Contrastive Intrinsic Control for Unsupervised Skill Discovery

arxiv：https://arxiv.org/abs/2202.00161
来源：想起来，想看一下。

auto-curriculum learning (Jiang et al., 2021b)

来源：RSD。似乎可以做自动 curriculum learning，或许是有启发性的。

Meta-Motivo（Tirinzoni 等人，2025），zero-shot goal-conditioned RL

来源：RGSD。可能包含一个技能库，也想看。速读一下就行。

Unsupervised Skill Discovery via Recurrent Skill Training

来源：合作者推荐的 skill discovery 先前工作。

Learning to Discover Skills through Guidance

来源：同上。

One After Another: Learning Incremental Skills for a Changing World

来源：同上。

Direct then Diffuse: Incremental Unsupervised Skill Discovery for State Covering and Goal Reaching

来源：同上。

Horizon Generalization in Reinforcement Learning

arxiv：https://arxiv.org/abs/2501.02709
website：https://horizon-generalization.github.io/
来源：Benjamin Eysenbach 的新作，是一篇 arxiv paper，同学说有趣。

HIQL: Offline Goal-Conditioned RL with Latent States as Actions

arxiv：https://arxiv.org/abs/2307.11949
website：https://seohong.me/projects/hiql/
来源：合作者推荐的文章，好像也是 Benjamin Eysenbach 发表的。

Contrastive Preference Learning: Learning from Human Feedback without RL

arxiv：https://arxiv.org/abs/2310.13639
GitHub：https://github.com/jhejna/cpl
来源：无意中搜到的文章，ICLR 2024，好像之前读过。
主要内容：

Few is More: Task-Efficient Skill-Discovery for Multi-Task Offline Multi-Agent Reinforcement Learning

arxiv：https://arxiv.org/abs/2502.08985
来源：同学的最新工作。
主要内容：
- 这篇文章关注的 setting 是 offline multi-task MARL；特别的，agent 只在（比如说）三个人合作的场景上训练，然后就可以泛化到任意多个人合作的场景。同学讲的故事是，用 transformer 作为一个翻译器，把三个人的合作动作翻译为多个人的，感觉这个故事听起来非常好。

Rethinking Reward Modeling in Preference-based Large Language Model Alignment

arxiv：https://arxiv.org/abs/2411.04991
OpenReview：https://openreview.net/forum?id=rfdblE10qm
来源：ICLR 2025 oral。
主要内容：
- 这篇文章关注 LLM 的 RLHF。据说不采用 bradley-terry model 来建模 reward model，而是直接训一个分类器，学习一个 (x,y) 是好的还剩坏的，然后使用分类器的概率 logit 作为 RLHF 的 reward。
- 是否使用了非成对的比较 \((x_1, y_1^+, x_2, y_2^-)\)，而非把成对比较 \((x, y^+, y^-)\) 打乱（？）
- 实验是否过于 toy（？）理论大概说了什么（？）

DOPL: Direct Online Preference Learning for Restless Bandits with Preference Feedback

arxiv：https://arxiv.org/abs/2410.05527
open review：https://openreview.net/forum?id=2iYVBqRHK4
来源：合作者推荐的文章。
主要内容：
- preference-based index policy（？）
whittle index，一个结论，两个等价条件，经典问题的证明方式。

Fewer May Be Better: Enhancing Offline Reinforcement Learning with Reduced Dataset

来源：师兄的文章。

Data Center Cooling System Optimization Using Offline Reinforcement Learning

arxiv：https://arxiv.org/pdf/2501.15085
来源：xianyuan zhan 组的新文章。
主要内容：
- T-symmetry。

SpikeLLM: Scaling up Spiking Neural Network to Large Language Models via Saliency-based Spiking

arxiv：https://arxiv.org/abs/2407.04752
来源：师兄推荐的神秘文章，ICLR 2025 poster。

Rethinking Inverse Reinforcement Learning: from Data Alignment to Task Alignment

arxiv：https://arxiv.org/abs/2410.23680
来源：偶然看到的文章。

Why Distillation can Outperform Zero-RL: The Role of Flexible Reasoning

arxiv：https://arxiv.org/abs/2505.21067
来源：偶然看到的文章。

Thinkless: LLM Learns When to Think

arxiv：https://arxiv.org/abs/2505.13379
来源：偶然看到的文章。

Learning to Reason without External Rewards

arxiv：https://arxiv.org/abs/2505.19590
来源：偶然看到的文章。