0 论文信息

1 对于 SAC 的逐步解析

传统的强化学习可以简单认为其是最大化奖励的预期总和 : <math xmlns="http://www.w3.org/1998/Math/MathML"> ∑ t E ( s t , a t ) ∼ ρ π [ r ( s t , a t ) ] \sum_t \mathbb{E}_{\left(\mathbf{s}_t, \mathbf{a}t\right) \sim \rho\pi}\left[r\left(\mathbf{s}t, \mathbf{a}t\right)\right] </math>∑tE(st,at)∼ρπ[r(st,at)]，如同之前在《DBC 论文阅读补充》中所写，作者补充了熵来让让强化学习取得更好的效果 (采用拉格朗日乘子法) :
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> J ( π ) = ∑ t = 0 T E ( s t , a t ) ∼ ρ π [ r ( s t , a t ) + α H ( π ( ⋅ ∣ s t ) ) ] \begin{align} J(\pi)=\sum{t=0}^T \mathbb{E}{\left(\mathbf{s}_t, \mathbf{a}t\right) \sim \rho\pi}\left[r\left(\mathbf{s}_t, \mathbf{a}_t\right)+\alpha \mathcal{H}\left(\pi\left(\cdot \mid \mathbf{s}_t\right)\right)\right] \end{align} </math>J(π)=t=0∑TE(st,at)∼ρπ[r(st,at)+αH(π(⋅∣st))]

<math xmlns="http://www.w3.org/1998/Math/MathML"> α \alpha </math>α 即拉格朗日参数也被称为温度参数，一般会选一个较为合适的后续进行整体优化。接下来就是看如何在此基础上推出 Soft Actor-Critic (SAC) 的算法。

Bellman 方程回顾 在之前的《从马尔可夫决策到 DQN 算法族(上)》中，我们提到过 Bellman 方程，其表述如下 :

在马尔可夫奖励过程中，一个状态的期望回报 (即从这个状态出发的未来累积奖励的期望) 被称为这个状态的价值 (value)。所有状态的价值就组成了价值函数 (value function)，价值函数的输入为某个状态，输出为这个状态的价值。我们将价值函数写成 <math xmlns="http://www.w3.org/1998/Math/MathML"> V ( s ) = E [ G t ∣ S t = s ] V(s)=\mathbb{E}\left[G_t \mid S_t=s\right] </math>V(s)=E[Gt∣St=s]，展开为 :
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> V ( s ) = E [ G t ∣ S t = s ] = E [ R t + γ R t + 1 + γ 2 R t + 2 + ... ∣ S t = s ] = E [ R t + γ ( R t + 1 + γ R t + 2 + ... ) ∣ S t = s ] = E [ R t + γ G t + 1 ∣ S t = s ] = E [ R t + γ V ( S t + 1 ) ∣ S t = s ] \begin{aligned} V(s) & =\mathbb{E}\left[G_t \mid S_t=s\right] \\ & =\mathbb{E}\left[R_t+\gamma R_{t+1}+\gamma^2 R_{t+2}+\ldots \mid S_t=s\right] \\ & =\mathbb{E}\left[R_t+\gamma\left(R_{t+1}+\gamma R_{t+2}+\ldots\right) \mid S_t=s\right] \\ & =\mathbb{E}\left[R_t+\gamma G_{t+1} \mid S_t=s\right] \\ & =\mathbb{E}\left[R_t+\gamma V\left(S_{t+1}\right) \mid S_t=s\right] \end{aligned} </math>V(s)=E[Gt∣St=s]=E[Rt+γRt+1+γ2Rt+2+...∣St=s]=E[Rt+γ(Rt+1+γRt+2+...)∣St=s]=E[Rt+γGt+1∣St=s]=E[Rt+γV(St+1)∣St=s]

在上式的最后一个等号中，一方面，即时奖励的期望正是奖励函数的输出，即 <math xmlns="http://www.w3.org/1998/Math/MathML"> E [ R t ∣ S t = s ] = r ( s ) \mathbb{E}\left[R_t \mid S_t=s\right]=r(s) </math>E[Rt∣St=s]=r(s)；另一方面，等式中剩余部分 <math xmlns="http://www.w3.org/1998/Math/MathML"> E [ γ V ( S t + 1 ) ∣ S t = s ] \mathbb{E}\left[\gamma V\left(S_{t+1}\right) \mid S_t=s\right] </math>E[γV(St+1)∣St=s] 可以根据从状态 <math xmlns="http://www.w3.org/1998/Math/MathML"> s s </math>s 出发的转移概率得到，即可以得到
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> V ( s ) = r ( s ) + γ ∑ s ′ ∈ S p ( s ′ ∣ s ) V ( s ′ ) \begin{align} V(s)=r(s)+\gamma \sum_{s^{\prime} \in S} p\left(s^{\prime} \mid s\right) V\left(s^{\prime}\right) \end{align} </math>V(s)=r(s)+γs′∈S∑p(s′∣s)V(s′)

式 (2) 即被称为 Bellman 方程。

在软策略迭代的策略评估步骤中，作者希望根据公式 (1) 中的最大熵目标计算策略 <math xmlns="http://www.w3.org/1998/Math/MathML"> π \pi </math>π 的值。对于固定策略， <math xmlns="http://www.w3.org/1998/Math/MathML"> soft- Q value \text{soft-}Q \text{ value} </math>soft-Q value 可以使用迭代计算，从任意函数 <math xmlns="http://www.w3.org/1998/Math/MathML"> Q : S × A → R Q: \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R} </math>Q:S×A→R 开始，反复应用修正的 Bellman 备份算子 <math xmlns="http://www.w3.org/1998/Math/MathML"> T π \mathcal{T}^{\pi} </math>Tπ :
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> T π Q ( s t , a t ) ≜ r ( s t , a t ) + γ E s t + 1 ∼ p [ V ( s t + 1 ) ] \begin{align} \mathcal{T}^\pi Q\left(\mathbf{s}_t, \mathbf{a}t\right) \triangleq r\left(\mathbf{s}t, \mathbf{a}t\right)+\gamma \mathbb{E}{\mathbf{s}{t+1} \sim p}\left[V\left(\mathbf{s}{t+1}\right)\right] \end{align} </math>TπQ(st,at)≜r(st,at)+γEst+1∼p[V(st+1)]

其中 :
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> V ( s t ) = E a t ∼ π [ Q ( s t , a t ) − log ⁡ π ( a t ∣ s t ) ] \begin{align} V\left(\mathbf{s}t\right)=\mathbb{E}{\mathbf{a}_t \sim \pi}\left[Q\left(\mathbf{s}_t, \mathbf{a}_t\right)-\log \pi\left(\mathbf{a}_t \mid \mathbf{s}_t\right)\right] \end{align} </math>V(st)=Eat∼π[Q(st,at)−logπ(at∣st)]

其中 <math xmlns="http://www.w3.org/1998/Math/MathML"> V ( s t ) V\left(\mathbf{s}_t\right) </math>V(st) 是软状态值函数。通过重复应用 <math xmlns="http://www.w3.org/1998/Math/MathML"> T π \mathcal{T}^{\pi} </math>Tπ，可以得到任意策略 <math xmlns="http://www.w3.org/1998/Math/MathML"> π \pi </math>π 的软值函数，有如下引力 :

引理 1 (软政策评估) 考虑方程 (3) 中的软 Bellman 备份算子 <math xmlns="http://www.w3.org/1998/Math/MathML"> T π \mathcal{T}^{\pi} </math>Tπ 和 <math xmlns="http://www.w3.org/1998/Math/MathML"> ∣ A ∣ < ∞ |\mathcal{A}|<\infty </math>∣A∣<∞ 的映射 <math xmlns="http://www.w3.org/1998/Math/MathML"> Q 0 : S × A → R Q^0: \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R} </math>Q0:S×A→R，定义 <math xmlns="http://www.w3.org/1998/Math/MathML"> Q k + 1 = T π Q k Q^{k+1}=\mathcal{T}^\pi Q^k </math>Qk+1=TπQk。当 <math xmlns="http://www.w3.org/1998/Math/MathML"> k → ∞ k \rightarrow \infty </math>k→∞ 时，序列 <math xmlns="http://www.w3.org/1998/Math/MathML"> Q k Q^k </math>Qk 收敛于 <math xmlns="http://www.w3.org/1998/Math/MathML"> π \pi </math>π 的 <math xmlns="http://www.w3.org/1998/Math/MathML"> soft- Q value \text{soft-}Q \text{ value} </math>soft-Q value。

在策略改进步骤中，向新 <math xmlns="http://www.w3.org/1998/Math/MathML"> Q Q </math>Q 函数的指数方向更新策略。这种特定的更新选择可以保证在其软价值方面产生改进的策略。由于在实践中更喜欢易于处理的策略，因此将额外地将策略限制为一些策略集 <math xmlns="http://www.w3.org/1998/Math/MathML"> Π \Pi </math>Π，这些策略可以对应于一些实例，参数化的分布族，高斯分布。为了考虑 <math xmlns="http://www.w3.org/1998/Math/MathML"> π ∈ Π \pi\in\Pi </math>π∈Π 的约束，将改进的策略投影到期望的策略集合中。虽然原则上可以选择任何投影，在此直接使用 KL 散度定义的信息投影。换句话说，在策略改进步骤中，对于每个状态，根据更新策略 :
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> π new = arg ⁡ min ⁡ π ′ ∈ Π D K L ( π ′ ( ⋅ ∣ s t ) ∥ exp ⁡ ( Q π old ( s t , ⋅ ) ) Z π old ( s t ) ) \begin{align} \pi_{\text {new }}=\arg \min {\pi^{\prime} \in \Pi} \mathrm{D}{\mathrm{KL}}\left(\pi^{\prime}\left(\cdot \mid \mathbf{s}t\right) \| \frac{\exp \left(Q^{\pi{\text {old }}}\left(\mathbf{s}t, \cdot\right)\right)}{Z^{\pi{\text {old }}}\left(\mathbf{s}_t\right)}\right) \end{align} </math>πnew =argπ′∈ΠminDKL(π′(⋅∣st)∥Zπold (st)exp(Qπold (st,⋅)))

根据 KL 散度的思想，即希望策略 <math xmlns="http://www.w3.org/1998/Math/MathML"> π ′ ( ⋅ ∣ s t ) \pi^{\prime}\left(\cdot \mid \mathbf{s}t\right) </math>π′(⋅∣st) 的分布更接近 <math xmlns="http://www.w3.org/1998/Math/MathML"> exp ⁡ ( Q π old ( s t , ⋅ ) ) Z π old ( s t ) \frac{\exp \left(Q^{\pi{\text {old }}}\left(\mathbf{s}t, \cdot\right)\right)}{Z^{\pi{\text {old }}}\left(\mathbf{s}t\right)} </math>Zπold (st)exp(Qπold (st,⋅)) 的分布。其中配分函数 <math xmlns="http://www.w3.org/1998/Math/MathML"> Z π o l d ( s t ) Z^{\pi{\mathrm{old}}}\left(\mathrm{s}_t\right) </math>Zπold(st) 将分布归一化，虽然它通常是难以处理的，但它对新策略的梯度没有贡献，因此可以忽略。对于这个投影，可以证明，相对于方程 (1) 中的目标，新的投影策略比旧策略具有更高的值。在引理 2 中将这个结果形式化 :

引理 2 (软政策的改善) 设 <math xmlns="http://www.w3.org/1998/Math/MathML"> π o l d ∈ Π \pi_{\mathrm{old}} \in \Pi </math>πold∈Π，设 <math xmlns="http://www.w3.org/1998/Math/MathML"> π o l d \pi_{\mathrm{old}} </math>πold 为式 (5) 定义的最小化问题的优化器。则对于所有 <math xmlns="http://www.w3.org/1998/Math/MathML"> ( s t , a t ) ∈ S × A \left(\mathbf{s}_t, \mathbf{a}t\right) \in \mathcal{S} \times \mathcal{A} </math>(st,at)∈S×A 且 <math xmlns="http://www.w3.org/1998/Math/MathML"> ∣ A ∣ < ∞ |\mathcal{A}|<\infty </math>∣A∣<∞，都有 <math xmlns="http://www.w3.org/1998/Math/MathML"> Q π new ( s t , a t ) ≥ Q π old ( s t , a t ) Q^{\pi{\text {new }}}\left(\mathbf{s}_t, \mathbf{a}t\right) \geq Q^{\pi{\text {old }}}\left(\mathbf{s}_t, \mathbf{a}_t\right) </math>Qπnew (st,at)≥Qπold (st,at)。

完整的软策略迭代算法在软策略评估和软策略改进步骤之间交替，可证明在 <math xmlns="http://www.w3.org/1998/Math/MathML"> Π \Pi </math>Π 的策略中收敛到最优最大熵策略 (定理 1)。尽管该算法可以证明找到了最优解，但目前只能在表格情况下以其精确形式执行它。因此，作者接下来将近似连续域的算法，需要依靠函数逼近器来表示 <math xmlns="http://www.w3.org/1998/Math/MathML"> Q Q </math>Q 值，并运行这两个步骤，直到收敛至边际效应。该近似产生了一种新的实用算法，称为 soft Actor-Critic (SAC)。

定理 1 (软策略迭代) 从任意 <math xmlns="http://www.w3.org/1998/Math/MathML"> π ∈ Π \pi\in\Pi </math>π∈Π 重复应用软策略评估和软策略改进收敛于策略 <math xmlns="http://www.w3.org/1998/Math/MathML"> π ∗ \pi^* </math>π∗，使得对于所有 <math xmlns="http://www.w3.org/1998/Math/MathML"> π ∈ Π \pi\in\Pi </math>π∈Π 和 <math xmlns="http://www.w3.org/1998/Math/MathML"> ( s t , a t ) ∈ S × A \left(\mathbf{s}_t, \mathbf{a}_t\right) \in \mathcal{S} \times \mathcal{A} </math>(st,at)∈S×A，假设 <math xmlns="http://www.w3.org/1998/Math/MathML"> ∣ A ∣ < ∞ |\mathcal{A}|<\infty </math>∣A∣<∞， <math xmlns="http://www.w3.org/1998/Math/MathML"> Q π ∗ ( s t , a t ) ≥ Q π ( s t , a t ) Q^{\pi^*}\left(\mathbf{s}_t, \mathbf{a}_t\right) \geq Q^{\pi}\left(\mathbf{s}_t, \mathbf{a}_t\right) </math>Qπ∗(st,at)≥Qπ(st,at)。

如上所述，大型连续域要求推导出软策略迭代的实用近似 (因为实际情况多是连续的)。为此，作者将对 <math xmlns="http://www.w3.org/1998/Math/MathML"> Q Q </math>Q 函数和策略使用函数逼近器，而不是运行评估和改进收敛，而是在使用随机梯度下降优化两个网络之间交替。作者考虑一个参数化状态值函数 <math xmlns="http://www.w3.org/1998/Math/MathML"> V ψ ( s t ) V_\psi\left(\mathbf{s}t\right) </math>Vψ(st)、 <math xmlns="http://www.w3.org/1998/Math/MathML"> soft − Q \text{soft}-Q </math>soft−Q 函数 <math xmlns="http://www.w3.org/1998/Math/MathML"> Q θ ( s t , a t ) Q\theta\left(\mathbf{s}_t, \mathbf{a}t\right) </math>Qθ(st,at) 和一个易于处理的策略 <math xmlns="http://www.w3.org/1998/Math/MathML"> π ϕ ( a t ∣ s t ) \pi\phi\left(\mathbf{a}_t \mid \mathbf{s}_t\right) </math>πϕ(at∣st)。这些网络的参数是 <math xmlns="http://www.w3.org/1998/Math/MathML"> ψ , θ , ϕ \psi,\theta,\phi </math>ψ,θ,ϕ。例如，价值函数可以建模为表征神经网络，策略被建模为具有神经网络给出的均值和协方差的高斯分布。接下来，对于这些参数向量的更新规则进行推导。

状态值函数近似于软值。原则上不需要为状态值构建一个单独的函数逼近器，因为它与根据公式 (4) 的 <math xmlns="http://www.w3.org/1998/Math/MathML"> Q Q </math>Q 函数和策略有关。这个量可以从当前策略的单个动作样本中估计，而不引入偏差，但在实践中，包括软值的单独函数逼近器可以稳定训练，便于与其他网络同时训练。训练软值函数以最小化平方残差
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> J V ( ψ ) = E s t ∼ D [ 1 2 ( V ψ ( s t ) − E a t ∼ π ϕ [ Q θ ( s t , a t ) − log ⁡ π ϕ ( a t ∣ s t ) ] ) 2 ] \begin{align} J_V(\psi)=\mathbb{E}{\mathbf{s}t \sim \mathcal{D}}\left[\frac{1}{2}\left(V\psi\left(\mathbf{s}t\right)-\mathbb{E}{\mathbf{a}t \sim \pi\phi}\left[Q\theta\left(\mathbf{s}_t, \mathbf{a}t\right)-\log \pi\phi\left(\mathbf{a}_t \mid \mathbf{s}_t\right)\right]\right)^2\right] \end{align} </math>JV(ψ)=Est∼D[21(Vψ(st)−Eat∼πϕ[Qθ(st,at)−logπϕ(at∣st)])2]

其中 <math xmlns="http://www.w3.org/1998/Math/MathML"> D \mathcal{D} </math>D 是先前采样状态和动作的分布，或重放缓冲区。公式 (6) 的梯度可以用无偏估计量估计
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> ∇ ^ ψ J V ( ψ ) = ∇ ψ V ψ ( s t ) ( V ψ ( s t ) − Q θ ( s t , a t ) + log ⁡ π ϕ ( a t ∣ s t ) ) \begin{align} \hat{\nabla}\psi J_V(\psi)=\nabla\psi V_\psi\left(\mathbf{s}t\right)\left(V\psi\left(\mathbf{s}t\right)-Q\theta\left(\mathbf{s}_t, \mathbf{a}t\right)+\log \pi\phi\left(\mathbf{a}_t \mid \mathbf{s}_t\right)\right) \end{align} </math>∇^ψJV(ψ)=∇ψVψ(st)(Vψ(st)−Qθ(st,at)+logπϕ(at∣st))

其中动作根据当前策略进行采样，而不是重放缓冲区。可以训练软 <math xmlns="http://www.w3.org/1998/Math/MathML"> Q Q </math>Q 函数参数以最小化软 Bellman 残差
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> J Q ( θ ) = E ( s t , a t ) ∼ D [ 1 2 ( Q θ ( s t , a t ) − Q ^ ( s t , a t ) ) 2 ] \begin{align} J_Q(\theta)=\mathbb{E}_{\left(\mathbf{s}_t, \mathbf{a}t\right) \sim \mathcal{D}}\left[\frac{1}{2}\left(Q\theta\left(\mathbf{s}_t, \mathbf{a}_t\right)-\hat{Q}\left(\mathbf{s}_t, \mathbf{a}_t\right)\right)^2\right] \end{align} </math>JQ(θ)=E(st,at)∼D[21(Qθ(st,at)−Q^(st,at))2]

其中
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> Q ^ ( s t , a t ) = r ( s t , a t ) + γ E s t + 1 ∼ p [ V ψ ˉ ( s t + 1 ) ] \begin{align} \hat{Q}\left(\mathbf{s}t, \mathbf{a}t\right)=r\left(\mathbf{s}t, \mathbf{a}t\right)+\gamma \mathbb{E}{\mathbf{s}{t+1} \sim p}\left[V{\bar{\psi}}\left(\mathbf{s}{t+1}\right)\right] \end{align} </math>Q^(st,at)=r(st,at)+γEst+1∼p[Vψˉ(st+1)]

可以再次用随机梯度进行优化
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> ∇ ^ θ J Q ( θ ) = ∇ θ Q θ ( a t , s t ) ( Q θ ( s t , a t ) − r ( s t , a t ) − γ V ψ ˉ ( s t + 1 ) ) \begin{align} \hat{\nabla}\theta J_Q(\theta)=\nabla\theta Q_\theta\left(\mathbf{a}_t, \mathbf{s}t\right)\left(Q\theta\left(\mathbf{s}_t, \mathbf{a}_t\right)-r\left(\mathbf{s}t, \mathbf{a}t\right)-\gamma V{\bar{\psi}}\left(\mathbf{s}{t+1}\right)\right) \end{align} </math>∇^θJQ(θ)=∇θQθ(at,st)(Qθ(st,at)−r(st,at)−γVψˉ(st+1))

该更新利用了目标值网络 <math xmlns="http://www.w3.org/1998/Math/MathML"> V ψ ˉ V_{\bar{\psi}} </math>Vψˉ，其 <math xmlns="http://www.w3.org/1998/Math/MathML"> ψ ˉ \bar{\psi} </math>ψˉ 可以是值网络权重的指数移动平均值。或者可以更新目标权重以定期匹配当前值函数权重。最后，可以通过直接最小化公式 (5) 中预期的 KL 散度来学习策略参数
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> J π ( ϕ ) = E s t ∼ D [ D K L ( π ϕ ( ⋅ ∣ s t ) ∥ exp ⁡ ( Q θ ( s t , ⋅ ) ) Z θ ( s t ) ) ] \begin{align} J_\pi(\phi)=\mathbb{E}{\mathbf{s}t \sim \mathcal{D}}\left[\mathrm{D}{\mathrm{KL}}\left(\pi\phi\left(\cdot \mid \mathbf{s}t\right) \| \frac{\exp \left(Q\theta\left(\mathbf{s}t, \cdot\right)\right)}{Z\theta\left(\mathbf{s}_t\right)}\right)\right] \end{align} </math>Jπ(ϕ)=Est∼D[DKL(πϕ(⋅∣st)∥Zθ(st)exp(Qθ(st,⋅)))]

有几个选项可以最小化 <math xmlns="http://www.w3.org/1998/Math/MathML"> J π J_{\pi} </math>Jπ。策略梯度方法的典型解决方案是使用似然比梯度估计器，它不需要通过策略和目标密度网络反向传播梯度。然而在例子中，目标密度是 <math xmlns="http://www.w3.org/1998/Math/MathML"> Q Q </math>Q 函数，它由神经网络表示，可以区分，因此应用重新参数化技巧很方便，从而导致方差估计器较低。为此，使用神经网络变换重新参数化策略
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> a t = f ϕ ( ϵ t ; s t ) \begin{align} \mathbf{a}t=f\phi\left(\epsilon_t ; \mathbf{s}_t\right) \end{align} </math>at=fϕ(ϵt;st)

其中 <math xmlns="http://www.w3.org/1998/Math/MathML"> ϵ t \epsilon_t </math>ϵt 是一个输入噪声向量，从一些固定分布中采样，例如球形高斯分布。我们现在可以将公式 (11) 中的目标重写为
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> J π ( ϕ ) = E s t ∼ D , ϵ t ∼ N [ log ⁡ π ϕ ( f ϕ ( ϵ t ; s t ) ∣ s t ) − Q θ ( s t , f ϕ ( ϵ t ; s t ) ) ] \begin{align} J_\pi(\phi)=\mathbb{E}{\mathbf{s}t \sim \mathcal{D}, \epsilon_t \sim \mathcal{N}}\left[\log \pi\phi\left(f\phi\left(\epsilon_t ; \mathbf{s}_t\right) \mid \mathbf{s}t\right)-Q\theta\left(\mathbf{s}t, f\phi\left(\epsilon_t ; \mathbf{s}_t\right)\right)\right] \end{align} </math>Jπ(ϕ)=Est∼D,ϵt∼N[logπϕ(fϕ(ϵt;st)∣st)−Qθ(st,fϕ(ϵt;st))]

其中 <math xmlns="http://www.w3.org/1998/Math/MathML"> π ϕ \pi_{\phi} </math>πϕ 是根据 <math xmlns="http://www.w3.org/1998/Math/MathML"> f ϕ f_{\phi} </math>fϕ 隐式定义的，注意到分区函数独立于 <math xmlns="http://www.w3.org/1998/Math/MathML"> ϕ \phi </math>ϕ，因此可以省略。同时对于公式 (13) 借助以下估计
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> ∇ ^ ϕ J π ( ϕ ) = ∇ ϕ log ⁡ π ϕ ( a t ∣ s t ) + ( ∇ a t log ⁡ π ϕ ( a t ∣ s t ) − ∇ a t Q ( s t , a t ) ) ∇ ϕ f ϕ ( ϵ t ; s t ) \begin{align} \hat{\nabla}\phi J\pi(\phi)=\nabla_\phi \log \pi_\phi\left(\mathbf{a}_t \mid \mathbf{s}t\right)+\left(\nabla{\mathbf{a}t} \log \pi\phi\left(\mathbf{a}_t \mid \mathbf{s}t\right)-\nabla{\mathbf{a}_t} Q\left(\mathbf{s}t, \mathbf{a}t\right)\right) \nabla\phi f\phi\left(\epsilon_t ; \mathbf{s}_t\right) \end{align} </math>∇^ϕJπ(ϕ)=∇ϕlogπϕ(at∣st)+(∇atlogπϕ(at∣st)−∇atQ(st,at))∇ϕfϕ(ϵt;st)

其中 <math xmlns="http://www.w3.org/1998/Math/MathML"> a t \mathbf{a}t </math>at 在 <math xmlns="http://www.w3.org/1998/Math/MathML"> f ϕ ( ϵ t ; s t ) f\phi\left(\epsilon_t ; \mathbf{s}_t\right) </math>fϕ(ϵt;st) 上进行评估。这种无偏梯度估计器将 DDPG 风格策略梯度扩展到任何易于处理的随机策略。

Algorithm 1 Soft Actor-Critic

Initialize parameter vectors <math xmlns="http://www.w3.org/1998/Math/MathML"> ψ , ψ ˉ , θ , ϕ \psi, \bar{\psi}, \theta, \phi </math>ψ,ψˉ,θ,ϕ.
for each iteration do
- for each environment step do
  - <math xmlns="http://www.w3.org/1998/Math/MathML"> a t ∼ π ϕ ( a t ∣ s t ) \mathbf{a}t \sim \pi\phi\left(\mathbf{a}_t \mid \mathbf{s}_t\right) </math>at∼πϕ(at∣st)
  - <math xmlns="http://www.w3.org/1998/Math/MathML"> s t + 1 ∼ p ( s t + 1 ∣ s t , a t ) \mathbf{s}{t+1} \sim p\left(\mathbf{s}{t+1} \mid \mathbf{s}_t, \mathbf{a}_t\right) </math>st+1∼p(st+1∣st,at)
  - <math xmlns="http://www.w3.org/1998/Math/MathML"> D ← D ∪ { ( s t , a t , r ( s t , a t ) , s t + 1 ) } \mathcal{D} \leftarrow \mathcal{D} \cup\left\{\left(\mathbf{s}_t, \mathbf{a}_t, r\left(\mathbf{s}_t, \mathbf{a}t\right), \mathbf{s}{t+1}\right)\right\} </math>D←D∪{(st,at,r(st,at),st+1)}
- end for
- for each gradient step do
  - <math xmlns="http://www.w3.org/1998/Math/MathML"> ψ ← ψ − λ V ∇ ^ ψ J V ( ψ ) \psi \leftarrow \psi-\lambda_V \hat{\nabla}_\psi J_V(\psi) </math>ψ←ψ−λV∇^ψJV(ψ)
  - <math xmlns="http://www.w3.org/1998/Math/MathML"> θ i ← θ i − λ Q ∇ ^ θ i J Q ( θ i ) \theta_i \leftarrow \theta_i-\lambda_Q \hat{\nabla}_{\theta_i} J_Q\left(\theta_i\right) </math>θi←θi−λQ∇^θiJQ(θi) for <math xmlns="http://www.w3.org/1998/Math/MathML"> i ∈ { 1 , 2 } i \in\{1,2\} </math>i∈{1,2}
  - <math xmlns="http://www.w3.org/1998/Math/MathML"> ϕ ← ϕ − λ π ∇ ^ ϕ J π ( ϕ ) \phi \leftarrow \phi-\lambda_\pi \hat{\nabla}\phi J\pi(\phi) </math>ϕ←ϕ−λπ∇^ϕJπ(ϕ)
  - <math xmlns="http://www.w3.org/1998/Math/MathML"> ψ ˉ ← τ ψ + ( 1 − τ ) ψ ˉ \bar{\psi} \leftarrow \tau \psi+(1-\tau) \bar{\psi} </math>ψˉ←τψ+(1−τ)ψˉ
- end for
end for

2 感想

感觉即使想法并不难理解，还是使用了很多的小技巧，整体的推导看下来并没有一种直观的理解，后续我会继续努力理解的。之后会把后续的理解进行更新。

SAC 论文阅读

0 论文信息

1 对于 SAC 的逐步解析

Algorithm 1 Soft Actor-Critic

2 感想

参考资料 (References)