offline 2 online | Cal-QL：校准保守 offline 训出的 Q value，让它与真实 reward 尺度相当

论文标题：Cal-QL: Calibrated Offline RL Pre-Training for Efficient Online Fine-Tuning.
NeurIPS 2023，5 5 6 6 poster；ICLR RRL workshop 2023 spotlight（神秘），两个 4: Good paper, strong accept。应该是先投的 ICLR workshop 再投的 NeurIPS 2023 吧...
pdf：https://arxiv.org/pdf/2303.05479.pdf
html：https://ar5iv.labs.arxiv.org/html/2303.05479
open review：https://openreview.net/forum?id=GcEIvidYSw ， https://openreview.net/forum?id=PhCWNmatOX
项目网站：https://nakamotoo.github.io/Cal-QL/ （介绍的很清楚）
video：https://youtu.be/r9CCdLeMJTg
GitHub：https://github.com/nakamotoo/Cal-QL

0 abstract

A compelling use case of offline reinforcement learning (RL) is to obtain an effective policy initialization from existing datasets, which allows efficient fine-tuning with limited amounts of active online interaction in the environment. Many existing offline RL methods tend to exhibit poor fine-tuning performance. On the contrary, while naive online RL methods achieve compelling empirical performance, online methods suffer from a large sample complexity without a good policy initialization from the offline data. Our goal in this paper is to devise an approach for learning an effective offline initialization that also unlocks fast online fine-tuning capabilities. Our approach, calibrated Q-learning (Cal-QL) accomplishes this by learning a conservative value function initialization that underestimates the value of the learned policy from offline data, while also being calibrated, meaning that the learned value estimation still upper-bounds the ground-truth value of some other reference policy (e.g., the behavior policy). Both theoretically and empirically, we show that imposing these conditions speeds up online fine-tuning, and brings in benefits of the offline data. In practice, Cal-QL can be implemented on top of existing offline RL methods without any extra hyperparameter tuning. Empirically, Cal-QL outperforms state-of-the-art methods on a wide range of fine-tuning tasks from both state and visual observations, across several benchmarks.

感觉在 story 上，不如看项目网站的 motivation。

（突然想到，RLHF 是否也是一种 offline 2 online（无端联想）

story

观察：
- IQL 是隐式策略约束的，CQL 是基于 conservative（保守）的。发现 IQL 学习缓慢，CQL 会 unlearn。
- 基于 policy constraint 的 IQL 等方法，会导致渐近（asymptotic）性能变慢（fine-tune 学习缓慢）。保守方法可以获得良好 fine-tune 性能，但"浪费"一部分样本去做 unlearn + relearn 过程，校正 offline Q 函数。
- 因此，我们试图开发一种良好的微调方法，该方法建立在现有的 conservative offline RL 上（以获得良好的渐近性能），但旨在"校准" Q function，以避免性能的初始下降。
为什么会 unlearn：
- 先前工作声称，基于 conservative 的 offline RL 能为 offline 2 online 提供好的起点。然而，会出现 unlearn 现象（return 的 curve 先急剧下降再反升然后正常学习）。
- 这是因为 conservative 导致 Q value 的尺度太小，明显小于 ground-truth return。这会导致，如果在 online 时选到了其实更糟糕的 action，（因为 offline Q value 实在是太小了），我们也会将其错认为更好的 action，最终导致 policy 初始化被毁掉。
校准（calibration）：
- 我们要防止保守主义学习过小的 Q 值。定义 $\pi$ 相对于 reference policy μ 被校准，如果 $Q^{\pi}_{\theta}(s,a)\ge Q^{\mu}(s,a), ~~ \forall(s,a)$ 。如果 Q function 可以大于等于一个 base policy 的 Q function。
- 在 Cal-QL 中，μ 是 behavior policy。
一个 metric：
- 定义了一个 cumulative regret： $Reg(K):=E_{s_0\sim\rho}\sum_{t=1}^K $V\^\*(s_0)-V\^{\\pi\^k}(s_0)$ $ ，其中 ρ 为 env 的 initial state distribution。大概就是最优 value function 与我们学的 value function 的差值。regret 越小越好。
- 在后续的理论中，他们把 regret 拆成了两项， $Reg(K):=E_{s_0\sim\rho}\sum_{t=1}^K $V\^\*(s_0)-\\max_a Q\^k_\\theta(s_0,a)$ + E_{s_0\sim\rho}\sum_{t=1}^K $\\max_a Q\^k_\\theta(s_0,a)-V\^{\\pi\^k}(s_0)$ $ ，其中第一项是错误校准的程度（miscalibration），第二项是最优性（over-estimation）。
- （理论把我看蒙了）

method

基于 CQL。魔改 CQL 的 Q network update。

希望最小化的 objective function：

\ $J_Q(θ) := α (E_{s\\sim D,a\\sim π} \[\\max (Q_θ(s, a), Q_\\mu(s, a))$ − E_{s,a\sim D} $Q_θ(s, a)$ ) \\ + \frac12 E_{s,a,s'\sim D} $(Q_θ(s, a) − B\^π\\bar Q(s, a))\^2$ . \]

第一行，被 α 乘的 EQ - EQ 的项，被称为 calibrated conservative regularizer R(θ)。

原来的 CQL 是这样的：

\ $J_Q(θ) := α (E_{s\\sim D,a\\sim π} \[Q_θ(s, a)$ − E_{s,a\sim D} $Q_θ(s, a)$ ) \\ + \frac12 E_{s,a,s′\sim D} $(Q_θ(s, a) − B\^π\\bar Q(s, a))\^2$ . \]

其中 D 为 offline dataset。解释 CQL：

若 (s,a) 在 $\pi_\theta$ 中的出现比 D 中要多，则 J(θ) 里的 Q(s,a) 具有正系数，我们会拉低 (s,a) 的 Q function。
反之，若 (s,a) 在 $\pi_\theta$ 中的出现比 D 中要少，则 J(θ) 里的 Q(s,a) 具有负系数，我们会拉高 (s,a) 的 Q function。

再看 Cal-QL 的形式。

若 (s,a) 在 $\pi_\theta$ 中的出现比 D 中要多，则 J(θ) 里的 Q(s,a) 具有正系数，我们会拉低 (s,a) 的 Q function。但如果 $\pi_\theta$ 的 Q function 还不如 behavior policy 的 Q function 大，就只剩了 $-E_{s,a\sim D} $Q_\\theta(s,a)$ $ 这一项，变成了拉高 (s,a) 的 Q function。
反之，若 (s,a) 在 $\pi$ 中的出现比 D 中要少，则 J(θ) 里的 Q(s,a) 具有负系数，我们会拉高 (s,a) 的 Q function。同样，如果 $\pi_\theta$ 的 Q function 还不如 behavior policy 的 Q function 大，仍然变成拉高 Q function。

experiment

实验貌似做了蛮多的，不过没有细看；如果以后需要仔细看，再看吧。

据说，在分布越窄的数据上（这种数据需要在online fine-tune 开始时修正Q值尺度），相比于 CQL 的效果提升越明显。

https://zhuanlan.zhihu.com/p/614001660 （感觉这位大佬很厉害）