offline RL | IQL：通过 sarsa 式 Q 更新避免 unseen actions

题目：Offline Reinforcement Learning with Implicit Q-Learning，Sergey Levine 组，2022 ICLR，5 6 8。
pdf 版本：https://arxiv.org/pdf/2110.06169.pdf
html 版本：https://ar5iv.labs.arxiv.org/html/2110.06169
open review：https://openreview.net/forum?id=68n2s9ZJWF8
github：
- https://github.com/ikostrikov/implicit_q_learning
- https://github.com/rail-berkeley/rlkit/tree/master/examples/iql
两篇相关博客：
- https://zhuanlan.zhihu.com/p/497358947
- https://blog.csdn.net/wxc971231/article/details/128803648

0 abstract

Offline reinforcement learning requires reconciling two conflicting aims: learning a policy that improves over the behavior policy that collected the dataset, while at the same time minimizing the deviation from the behavior policy so as to avoid errors due to distributional shift. This tradeoff is critical, because most current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy, and therefore need to either constrain these actions to be in-distribution, or else regularize their values.

We propose a new offline RL method that never needs to evaluate actions outside of the dataset, but still enables the learned policy to improve substantially over the best behavior in the data through generalization. The main insight in our work is that, instead of evaluating unseen actions from the latest policy, we can approximate the policy improvement step implicitly by treating the state value function as a random variable, with randomness determined by the action (while still integrating over the dynamics to avoid excessive optimism), and then taking a state conditional upper expectile of this random variable to estimate the value of the best actions in that state. This leverages the generalization capacity of the function approximator to estimate the value of the best available action at a given state without ever directly querying a Q-function with this unseen action. Our algorithm alternates between fitting this upper expectile value function and backing it up into a Q-function, without any explicit policy. Then, we extract the policy via advantage-weighted behavioral cloning, which also avoids querying out-of-sample actions.

We dub our method Implicit Q-learning (IQL). IQL is easy to implement, computationally efficient, and only requires fitting an additional critic with an asymmetric L2 loss.

IQL demonstrates the state-of-the-art performance on D4RL, a standard benchmark for offline reinforcement learning. We also demonstrate that IQL achieves strong performance fine-tuning using online interaction after offline initialization.

background：
- offline RL 需要调和两个相互冲突的目标：① 学习一种比 behavior policy 改进的策略，② 尽量减少与 behavior policy 的偏差，以避免由于 distribution shift 而导致的错误。
- 这种 trade-off 至关重要，因为当前的大多数 offline RL 方法，都需要在训练期间查询 unseen actions 的 value 来改进策略，因此需要将这些 action 限制为 in distribution，或者将 value 正则化。
method：
- 我们提出了一种新的 offline RL 方法 Implicit Q-learning（IQL），完全不需要评估数据集外的 action，但仍然使学习到的策略，能够通过泛化（generalization），大大改善数据中的最佳行为。
- main insight：我们不去评估 latest policy 中的 unseen actions，而是去 implicitly 近似 policy improvement step，通过将 state value function（V function？）视为随机变量，其中 randomness 由 action 决定（同时仍对动态（dynamics）进行积分（integrating）以避免过度乐观），然后取 state value function 的 conditioned on state 的上限期望，来估计该状态下最佳行动的值。
- IQL 利用了 function approximator 的泛化能力，来估计给定状态下最佳 available action 的 value，而非直接使用 unseen action 来 query Q function。
- IQL 交替进行 ① 拟合这个期望上限（upper expectile）的 value function，② 将其备份到 Q function。IQL 没有任何显式的 policy，通过 advantage-weighted behavioral cloning 来提取策略，这也避免了查询 out-of-sample actions。
- IQL 易于实现，计算效率高，并且只需要拟合一个具有非对称 L2 loss 的额外的 critic。
results：
- IQL 在 D4RL 上取得了最先进的性能，D4RL 是 offline RL 的 standard benchmark。
- offline 2 online：我们还（通过实验）证明，IQL 在离线初始化后使用在线交互，实现了强大的性能 fine-tuning。

open review

contribution：
- IQL 基于期望回归（expectile regression）的 novel idea，通过专注于 in-sample actions，避免查询 unseen actions 的 values，具有稳定的性能。The paper is well written。
- 在 policy improvement 阶段，in-sample policy evaluation + advantage-weighted regression。利用类似 sarsa 的 TD 来更新 Q-function。（sarsa：收集 transition (s, a, r, s', a')，更新 Q(s, a) = r(s, a) + γQ(s', a') 。）
- 研究了如何在 Q 更新期间，避免使用 out-of-sample 或 unseen 的 actions 进行更新。
- 正如 BCQ（好像是一篇 offline 工作）所说，无约束的策略提取方案，在 offline RL 中失败；因此，我们选择受约束的策略提取方案，advantage-weighted regression（AWR）。
- expectile regression + training Q-function + 使用 advantage-weighted behavioral cloning 来提取 policy 。
实验：
- 我们的方法比最近的方法（例如 TD3+BC，是 NeurIPS Spotlight，以及 Onestep RL，也来自 NeurIPS）有很大的改进：在 locomotion 上有改进，在 AntMaze 上获得了 3 倍的改进。在先前方法中，性能上与我们最接近的是 CQL。然而，在最困难的任务 AntMaze 中，我们比 CQL 提高了 25%。
- 请注意，TD3+BC 在性能方面并没有改进（如表 1 所示），它只是简单。我们的工作既引入了一种更简单的方法，并且取得了更好的性能。此外，我们的方法在运行时（快 4 倍）和 fine-tune 实验（改进 2 倍，表 2）方面，比 CQL 有非常大的改进。
优点：
- novel idea：对 ID actions 使用 expectile regression，来学习基于 ID 的 high-performance actions 的 value function。
  - 之前有很多关于在 RL 上使用分位数回归（quantile regression）的研究，但大多数研究都学习一个价值函数分布，其中随机性来自环境，而分位数回归（quantile regression）通常为了提高最坏情况的鲁棒性。
  - 然而，本文对值函数使用期望回归（expectile regression），其中随机性来自动作，并表明它推广了贝尔曼期望方程（Bellman expectation equation）和贝尔曼最优方程（Bellman optimality equation）。这是第一个提出的工作。
  - 同时使用 V 和 Q 函数来实现这种学习的技巧，似乎也非常聪明。
- 理论也好，实验结果也好。类似的魔改，或许可以应用在 online RL 上。
疑惑：
- 为什么选择 expectile regression 进行 value function 更新，而非均值回归（mean regression）？（然后发现这个实验其实做过了，reviewer 看漏了）
- IQL 在 gym locomotion 任务的结果，与 CQL 的结果相当或更差，这有点违反直觉，因为 IQL 比 CQL 更类似于 behavior cloning，但在数据质量更好的任务上表现不佳。
- 既然学到的最优 Q 函数非常好且准确，为什么不直接根据学到的 Q 函数，去优化一个参数化策略（确定性或高斯策略），例如最大化 E_(s,a) [Q(s,a)] ？为什么必须使用行为克隆？事实上，BC 很难超过数据集中的最佳策略。回答：IQL 学习的 Q(s,a) 未针对 OOD actions 进行定义。因此，Q(s,a) 的无约束最大化，可能会导致选择 value 被错误高估的 action。
- 从证明来看，只有当期望值 tau 达到极限 1 时，学习到的 value function 才能在数据下是最优的。从代码中，我观察到 MuJoCo tau=0.7、Adroit tau=0.7、Ant-maze tau=0.9，都没有达到 limit 1，因此认为理论与实现之间存在差距。

建议直接看这些博客...

https://zhuanlan.zhihu.com/p/497358947 ，感觉写的已经很好了。

https://blog.csdn.net/wxc971231/article/details/128803648 ，可以同时参考这一篇。

博客 1.1 1.2 对应 section 2 的 related work。
1.3 对应 section 3 preliminaries + section 4 的介绍（4.1 前）。
1.4 讲解了 expectile regression（期望回归），这是一个非对称的 L2 损失，用来 minimize \(L(θ)=\mathbb E_{(s,a,r,s',a')\sim D}\big[L_2^\tau\big(r+γQ_{\hat θ}(s',a')-Q_θ(s,a)\big)\big]\) 。
- expectile regression ： loss = \(\sum\max[\tau(y_i-\hat y_i),(\tau-1)(y_i-\hat y_i)]\) 。
- 所以，这个 loss 其实也可以用 MSE 或 quantile regression（也是一种 loss function 嘛？），但作者声称 expectile regression 最好用。
- 貌似后面有基于 expectile regression 的证明（？）
- 当 τ = 0.5 时 loss 退化为 MSE；τ 越接近 1，模型就越倾向拟合那些 TD error 更大的 transition，从而使 Q 估计靠近数据集上的上界；当 τ → 1 时，可认为得到了 Q* 。
2.1 对应原文 section 4.2。
- 直接使用上面说的 expectile loss，问题在于引入了环境随机性 \(s'\sim p(·|s,a)\) ：一个大的 TD target 可能只是来自碰巧采样到的 "好状态"，即使这个概率很小，也会被 expectile regression 找出来，导致 Q value 高估。
- 为此，IQL 又学习了一个独立的 state value function，它的 loss function 是 \(L_V(\psi)=E_{(s,a)\sim D}[L_2^\tau(Q_{\hat \theta}(s,a) -V_\psi(s))]\) 。
  - θ hat 在前面 θ 的 loss function 也出现过，大概是为了防止 Q 偏移而设置的 Q target。
  - 总之就是学习一个 value function，其中 action 对于特定 state 的分布是数据集 D 中给定的。
- 然后，使用这个 state value function 来更新 Q function，来避免因为随机"好状态"而错判一个 action 为好 action。
  - \(L_Q(\theta)=E_{(s,a,s')\sim D}\big[r(s,a)+γV_\psi(s')-Q_θ(s,a)\big]^2\) 。
  - 这个 loss function 是 MSE，好像不是 expectile loss 的形式。
  - （虽然不太明白为什么能规避随机"好状态" 的问题）
- 使用 Clipped Double Q-Learning 来缓解 Q value 的高估。
2.2 2.3 对应原文 section 4.3。
- Advantage-Weighted Regression，通过 dataset + value function 得到 policy。
- AWR 的 loss： \(L_\pi(\phi)=E_{(s,a)\sim D}[\exp[β(Q_{\hat θ}(s,a)-V_\psi(s))]\log \pi_\phi(a|s)]\) 。
- 若 β = 0，完全变成 behavior cloning；若 β 变大，则变成加权 behavior cloning，权重是 exp advantage。
- IQL 算法流程：先更新 \(L_V(\psi)\) （使用了 expectile regression 的 loss），再更新 \(L_Q(θ)\) （MSE loss 的形式），然后 \(\hat θ\leftarrow(1-α)\hat θ+αθ\) ，这样迭代得到 value function Q 和 V，最后使用 AWR 提取 policy。
- 搬运：
- \(L_Q(\theta)=E_{(s,a,s')\sim D}\big[r(s,a)+γV_\psi(s')-Q_θ(s,a)\big]^2\) 。
- \(L_V(\psi)=E_{(s,a)\sim D}[L_2^\tau(Q_{\hat \theta}(s,a) -V_\psi(s))]\) 。
- 规避随机"好状态" ：大概因为，若 s' 特别好，则 Q 和 V 都会很好，policy 的加权是 advantage = Q - V，不会受影响。
section 4.4 貌似是数学证明。
- 我们证明了在某些假设下，我们的方法确实逼近了最优的 state-action value Q* 。
- 可以证明，lim τ→1 时（τ 大概是 expectile regression 的参数）， \(V_τ(s) → \max_{\pi_\beta(a|s)>0}Q^*(s,a)\) ，能达到 dataset 里面出现的最好 action 的 Q value。
section 5 experiment：
- 次优轨迹拼接能力更强：超过 DT、one-step 方法。
- 不使用 unseen action，缓解 distribution shift：超过 TD3+BC 和 CQL 等约束 policy 的方法。
- IQL 在训练时间上也有优势。
- online fine-tune：听说 AWAC 是专门为 online fine-tune 而提出的。先 offline，再 online 10w 步，IQL 性能最好，CQL 次之，AWAC 因为 offline 性能初始化不好而最差。