0 引言

一开始读 DBC 的论文感觉很多都不懂，于是就想着先从强化学习对应部分 (TD 算法部分) 进行学习，然后再回过头来看论文。但是读完论文后仍发现有很多的基础概念不懂，一开始只是去搜集自己不懂的地方，找了好几篇文章读了半天发现还是晕头转向，回过头来发现自己对于强化学习的基础概念还是欠缺较多 (感觉强化学习中的概念都很绕，尤其像 Actor-Critic 这种集大成者)。然后发现其实之前自己按照动手学习强化学习教程里写的也比较片面，所以又看了很多课程 (特别感谢王树森老师的课程，链接就不给了，在油管上，b 站上面也有但不是2k的)。前前后后这篇文章也是一边看一边想然后又有点推倒重来那个感觉 ~~，像极了在智慧的水潭中搞不明白撒泼打滚的自己~~。

1 Siamese Network

Siamese Network 直观翻译是暹罗网络，其实可以理解为 "孪生神经网络" (Siamese 在英语中是 "孪生" 的意思)。其示意图如下所示。而其 "孪生" 则在于两个子网络的结构完全一样，且共享权值 <math xmlns="http://www.w3.org/1998/Math/MathML"> W W </math>W，这些子网络被称为 "分支" (Branch)。分支的网络结构可以是神经网络 (CNN)、循环神经网络 (RNN) 或全连接神经网络等，具体取决于任务的特点。

而对于暹罗网络的机制与原理，则有以下解释 (感觉这个结构很有对比学习那味) :

共享参数的结构 : 暹罗神经网络通过共享参数的子网络，使得两个或多个输入样本可以共享相同的特征提取和编码过程。这种共享参数的结构可以减少网络的参数量，提高模型的模型效率和泛化能力。同时，它还可以使网络更容易训练，减少过拟合的风险。
学习表征的能力 : 暹罗神经网络通过将输入样本映射到低维特征空间，学习到了具有良好区分性的特征表示。这些特征表示能够捕捉到输入样本的关键信息，使得不同类别的样本在特征空间中学习到的表征能力使得暹罗神经网络在相似性比较和学习任务中表现出色。
对比损失函数的优化 : 暹罗神经网络通常使用对比损失函数进行训练，该损失函数能够有效地鼓励相同类别的样本在特征空间中更加接近，不同类别的样本则更加分散。这种方式有助由于增强模型的区分性，使得模型能够更好地区分不同类别的样本，并具有更好的泛化能力。

除了纯粹的暹罗网络，还衍生出了伪暹罗网络 (Pseudo-Siamese Network)，不同于暹罗网络的是，其两边的分支网络不共享参数。这种网络结构的好处是可以学习到两个样本的特征表示，但是其缺点是参数量会比较大，同时也会增加网络的训练难度。

以及除了两个分支网络之外，还更有多个分支网络的暹罗网络。这种网络结构的好处是可以学习到多个样本的特征表示，但是其缺点是参数量会比较大，同时也会增加网络的训练难度。

2 value-based 和 policy-based 强化学习方法

2.0 回顾强化学习中的基本概念

主体 (Agent) : 在嵌入环境中采取行动来改变环境状态的系统。比如室内机器人、超级马里奥中的马里奥。
状态 (State <math xmlns="http://www.w3.org/1998/Math/MathML"> S S </math>S) : 状态可以被看作是决定其未来演变的制度历史的总结。状态空间 <math xmlns="http://www.w3.org/1998/Math/MathML"> S \mathcal{S} </math>S 是包含所有可能状态的集合。在第 <math xmlns="http://www.w3.org/1998/Math/MathML"> t t </math>t 步中，我们可以观察到过去的状态的值 : <math xmlns="http://www.w3.org/1998/Math/MathML"> s 1 , ... , s t s_1,\ldots,s_t </math>s1,...,st，而未来状态 <math xmlns="http://www.w3.org/1998/Math/MathML"> S t + 1 , S t + 2 , ... \mathcal{S}{t+1},\mathcal{S}{t+2},\ldots </math>St+1,St+2,... 是不可观测的随机变量。
动作 (Action <math xmlns="http://www.w3.org/1998/Math/MathML"> A A </math>A) : 代理的决策基于状态和其他考虑。操作空间 <math xmlns="http://www.w3.org/1998/Math/MathML"> A \mathcal{A} </math>A 是包含所有动作的集合，其可以是离散集，如 {"向左"， "向右"， "向上"}，也可以是连续集，如 <math xmlns="http://www.w3.org/1998/Math/MathML"> [ 0 , 1 ] × [ − 90 , 90 ] [0,1]\times[−90,90] </math>[0,1]×[−90,90]。在第 <math xmlns="http://www.w3.org/1998/Math/MathML"> t t </math>t 步中，可以观察到过去的状态的值 : <math xmlns="http://www.w3.org/1998/Math/MathML"> a 1 , ... , a t a_1,\ldots,a_t </math>a1,...,at，而未来状态 <math xmlns="http://www.w3.org/1998/Math/MathML"> A t + 1 , A t + 2 , ... \mathcal{A}{t+1},\mathcal{A}{t+2},\ldots </math>At+1,At+2,... 是不可观测的随机变量。
奖励 (Reward <math xmlns="http://www.w3.org/1998/Math/MathML"> R R </math>R) : 奖励是主体从环境中获得的价值，是对主体行为的直接回应。在第 <math xmlns="http://www.w3.org/1998/Math/MathML"> t t </math>t 步中，可以观察到过去的状态的值 : <math xmlns="http://www.w3.org/1998/Math/MathML"> r 1 , ... , r t r_1,\ldots,r_t </math>r1,...,rt。然而，未来的奖励 <math xmlns="http://www.w3.org/1998/Math/MathML"> R i R_i </math>Ri (对于 <math xmlns="http://www.w3.org/1998/Math/MathML"> i > t i>t </math>i>t) 是不可观察的，它取决于随机变量 <math xmlns="http://www.w3.org/1998/Math/MathML"> S t + 1 S_{t+1} </math>St+1 和 <math xmlns="http://www.w3.org/1998/Math/MathML"> A t + 1 A_{t+1} </math>At+1。因此在第 <math xmlns="http://www.w3.org/1998/Math/MathML"> t t </math>t 步中，未来奖励 <math xmlns="http://www.w3.org/1998/Math/MathML"> R t + 1 , R t + 2 , ... R_{t+1},R_{t+2},\ldots </math>Rt+1,Rt+2,... 是随机变量。
策略函数 ( <math xmlns="http://www.w3.org/1998/Math/MathML"> π \pi </math>π) : 主体的决策功能。策略是概率密度函数 (PDF) : <math xmlns="http://www.w3.org/1998/Math/MathML"> π ( a ∣ s ) = P ( A = a ∣ S = s ) \pi(a \mid s)=\mathbb{P}(A=a \mid S=s) </math>π(a∣s)=P(A=a∣S=s)，策略函数将观测到的状态 <math xmlns="http://www.w3.org/1998/Math/MathML"> S = s S=s </math>S=s 映射为集合 <math xmlns="http://www.w3.org/1998/Math/MathML"> A \mathcal{A} </math>A 中所有动作的概率分布。因为 <math xmlns="http://www.w3.org/1998/Math/MathML"> π \pi </math>π 是一个概率密度函数，所以 <math xmlns="http://www.w3.org/1998/Math/MathML"> ∑ a ∈ A π ( a ∣ s ) = 1 \sum_{a \in \mathcal{A}} \pi(a \mid s)=1 </math>∑a∈Aπ(a∣s)=1。对于所有 <math xmlns="http://www.w3.org/1998/Math/MathML"> a ∈ A a\in\mathcal{A} </math>a∈A，主体将以概率 <math xmlns="http://www.w3.org/1998/Math/MathML"> π ( a ∣ s ) \pi(a|s) </math>π(a∣s) 执行动作 <math xmlns="http://www.w3.org/1998/Math/MathML"> a a </math>a。
状态转移 ( <math xmlns="http://www.w3.org/1998/Math/MathML"> p p </math>p) : 给定当前状态 <math xmlns="http://www.w3.org/1998/Math/MathML"> S = s S=s </math>S=s，主体的动作 <math xmlns="http://www.w3.org/1998/Math/MathML"> A = a A=a </math>A=a 将导致环境到达的新状态 <math xmlns="http://www.w3.org/1998/Math/MathML"> S ′ S' </math>S′。状态转移函数是概率密度函数(PDF) <math xmlns="http://www.w3.org/1998/Math/MathML"> p ( s ′ ∣ s , a ) = P ( S ′ = s ′ ∣ S = s , A = a ) p\left(s^{\prime} \mid s, a\right)=\mathbb{P}\left(S^{\prime}=s^{\prime} \mid S=s, A=a\right) </math>p(s′∣s,a)=P(S′=s′∣S=s,A=a)。环境决定了 <math xmlns="http://www.w3.org/1998/Math/MathML"> s t s^t </math>st 这个新状态下的状态转移概率 <math xmlns="http://www.w3.org/1998/Math/MathML"> p ( s ′ ∣ s , a ) p\left(s^{\prime} \mid s, a\right) </math>p(s′∣s,a)，对于所有 <math xmlns="http://www.w3.org/1998/Math/MathML"> s ∈ S s\in\mathcal{S} </math>s∈S。
轨迹 (Trajectory) : 主体与环境的交互产生一系列 (状态、行动、奖励) 三元组
回报 (Return <math xmlns="http://www.w3.org/1998/Math/MathML"> U U </math>U) : 回报综合了现在能立即获得的奖励和未来能获得的奖励。很明显，未来获得的奖励是不如立即可以获得的奖励更有价值的 (可以通俗理解为其具有更多的风险)，因此在计算未来的奖励时我们引入折扣因子 <math xmlns="http://www.w3.org/1998/Math/MathML"> γ ∈ ( 0 , 1 ) \gamma\in(0,1) </math>γ∈(0,1) 来进行平衡 :

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> U t = R t + γ ⋅ R t + 1 + γ 2 ⋅ R t + 2 + γ 3 ⋅ R t + 3 + ⋯ U_t=R_t+\gamma \cdot R_{t+1}+\gamma^2 \cdot R_{t+2}+\gamma^3 \cdot R_{t+3}+\cdots </math>Ut=Rt+γ⋅Rt+1+γ2⋅Rt+2+γ3⋅Rt+3+⋯

以上先介绍下强化学习的基础知识，之后不同两种形式的方法会引入不同的想法与知识。

2.1 value-based 强化学习

value-based (价值引导的) 强化学习希望对于特定的状态与特定的动作，得到一个最大的返回值 (即最优)，因此引入以下两个概念 :

动作价值函数 ( <math xmlns="http://www.w3.org/1998/Math/MathML"> Q π Q_{\pi} </math>Qπ) : 动作值函数 <math xmlns="http://www.w3.org/1998/Math/MathML"> Q π ( s t , a t ) Q_\pi\left(s_t, a_t\right) </math>Qπ(st,at) 衡量给定状态 <math xmlns="http://www.w3.org/1998/Math/MathML"> s t s_t </math>st 和策略 <math xmlns="http://www.w3.org/1998/Math/MathML"> π \pi </math>π 下动作 <math xmlns="http://www.w3.org/1998/Math/MathML"> a t a_t </math>at 的好坏。形式化表示如下 :

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> Q π ( s t , a t ) = E [ U t ∣ S t = s t , A t = a t ] Q_\pi\left(s_t, a_t\right)=\mathbb{E}\left[U_t \mid S_t=s_t, A_t=a_t\right] </math>Qπ(st,at)=E[Ut∣St=st,At=at]

即对于固定状态 <math xmlns="http://www.w3.org/1998/Math/MathML"> s t s_t </math>st 与动作 <math xmlns="http://www.w3.org/1998/Math/MathML"> a t a_t </math>at，计算在策略 <math xmlns="http://www.w3.org/1998/Math/MathML"> π \pi </math>π 下 <math xmlns="http://www.w3.org/1998/Math/MathML"> U t U_t </math>Ut 的期望值。

最优动作值函数 ( <math xmlns="http://www.w3.org/1998/Math/MathML"> Q ∗ Q^* </math>Q∗) : 最优动作值函数 <math xmlns="http://www.w3.org/1998/Math/MathML"> Q ⋆ ( s t , a t ) Q^{\star}\left(s_t, a_t\right) </math>Q⋆(st,at) 衡量状态 <math xmlns="http://www.w3.org/1998/Math/MathML"> s t s_t </math>st 下动作 <math xmlns="http://www.w3.org/1998/Math/MathML"> a t a_t </math>at 的好坏。形式化表示如下 :

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> Q ⋆ ( s , a ) = max ⁡ π Q π ( s , a ) . Q^{\star}(s, a)=\max \pi Q\pi(s, a) . </math>Q⋆(s,a)=πmaxQπ(s,a).

其中 <math xmlns="http://www.w3.org/1998/Math/MathML"> Q ⋆ ( s , a ) Q^{\star}(s, a) </math>Q⋆(s,a) 与策略函数 <math xmlns="http://www.w3.org/1998/Math/MathML"> π \pi </math>π 无关。

通俗而言，我们可以理解为 value-based 强化学习希望学习到一个稳定且强大的 <math xmlns="http://www.w3.org/1998/Math/MathML"> Q Q </math>Q 函数 : 像是找到一个先知，然后在当前状态下进行特定动作时先知会对我们进行一定的指引，比如先知告诉我们在马里奥游戏中现在 "往上跳" 比起 "往右跑" 会获得更好的收益。所以关键就在于如何学习这样的一个先知 (可以看出来 value-based 强化学习对于策略函数 <math xmlns="http://www.w3.org/1998/Math/MathML"> π \pi </math>π 的依存是不强的，虽然之前学习的方法中会防止搜索空间有限引入一定的随机参数，但是这个打分明显是不需要学习 <math xmlns="http://www.w3.org/1998/Math/MathML"> π \pi </math>π 的)。

我们引入神经网络来对 <math xmlns="http://www.w3.org/1998/Math/MathML"> Q Q </math>Q 函数进行学习，其对应参数为 <math xmlns="http://www.w3.org/1998/Math/MathML"> w \mathbf{w} </math>w。这样的情况下，在第 <math xmlns="http://www.w3.org/1998/Math/MathML"> t t </math>t 步我们采取的动作为 :
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> a t = argmax ⁡ a ∈ A Q ⋆ ( s t , a ) = argmax ⁡ a Q ( s t , a ; w ) a_t=\underset{a \in \mathcal{A}}{\operatorname{argmax}} Q^{\star}\left(s_t, a\right)=\underset{a}{\operatorname{argmax}} Q\left(s_t, a ; \mathbf{w}\right) </math>at=a∈AargmaxQ⋆(st,a)=aargmaxQ(st,a;w)

这就和我们之前时序差分 (temporal different, TD) 的那一部分对上了，结合回报 <math xmlns="http://www.w3.org/1998/Math/MathML"> U t U_t </math>Ut 的计算性质 <math xmlns="http://www.w3.org/1998/Math/MathML"> U t = R t + γ ⋅ U t + 1 U_t=R_t+\gamma \cdot U_{t+1} </math>Ut=Rt+γ⋅Ut+1 (不难推导在此不多赘述)，可以得到 :
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> Q π ( s t , a t ) = E [ U t ∣ s t , a t ] = E [ R t + γ ⋅ U t + 1 ∣ s t , a t ] = E [ R t + γ ⋅ Q π ( S t + 1 , A t + 1 ) ∣ s t , a t ] Q_\pi\left(s_t, a_t\right)=\mathbb{E}\left[U_t \mid s_t, a_t\right]=\mathbb{E}\left[R_t+\gamma \cdot U_{t+1} \mid s_t, a_t\right]=\mathbb{E}\left[R_t+\gamma \cdot Q_\pi\left(S_{t+1}, A_{t+1}\right) \mid s_t, a_t\right] </math>Qπ(st,at)=E[Ut∣st,at]=E[Rt+γ⋅Ut+1∣st,at]=E[Rt+γ⋅Qπ(St+1,At+1)∣st,at]

结合 <math xmlns="http://www.w3.org/1998/Math/MathML"> Q ( s t , a t ; w ) ≈ max ⁡ π E [ U t ∣ s t , a t ] Q\left(s_t, a_t ; \mathbf{w}\right) \approx \max \pi \mathbb{E}\left[U_t \mid s_t, a_t\right] </math>Q(st,at;w)≈maxπE[Ut∣st,at]，可以得到 :
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> Q ( s t , a t ; w ) ≈ r t + γ ⋅ Q ( s t + 1 , a t + 1 ; w ) Q\left(s_t, a_t ; \mathbf{w}\right) \approx r_t+\gamma \cdot Q\left(s{t+1}, a_{t+1} ; \mathbf{w}\right) </math>Q(st,at;w)≈rt+γ⋅Q(st+1,at+1;w)

结合时序差分的优化目标
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> L = 1 N ∑ i ( y i − Q ( s t , a t ; w ) ) 2 L=\frac{1}{N} \sum_i\left(y_i-Q\left(s_t, a_t ; \mathbf{w}\right)\right)^2 </math>L=N1i∑(yi−Q(st,at;w))2

即可完成对于神经网络参数 <math xmlns="http://www.w3.org/1998/Math/MathML"> w \mathbf{w} </math>w 的优化，逐步训练就能更好地刻画这么一个先知 <math xmlns="http://www.w3.org/1998/Math/MathML"> Q Q </math>Q。

2.2 policy-based 强化学习

policy-based (策略引导的) 强化学习希望对于特定的状态，策略函数 <math xmlns="http://www.w3.org/1998/Math/MathML"> π \pi </math>π 能够对于不同的操作给出合理的概率 (与 value-based 强化学习的区别是其操作从根本上就是不确定的)，补充以下两个概念 :

状态价值函数 ( <math xmlns="http://www.w3.org/1998/Math/MathML"> V π V_{\pi} </math>Vπ) : 状态值函数 <math xmlns="http://www.w3.org/1998/Math/MathML"> V π ( s t ) V_\pi\left(s_t\right) </math>Vπ(st) 在表示给定策略 <math xmlns="http://www.w3.org/1998/Math/MathML"> π \pi </math>π 下，当衡量前状态 <math xmlns="http://www.w3.org/1998/Math/MathML"> s t s_t </math>st 的好坏。具体表述如下 :

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> V π ( s t ) = E A ∼ π ( ⋅ ∣ s t ) [ Q π ( s t , A ) ] = ∫ A π ( a ∣ s t ) ⋅ Q π ( s t , a ) d a V_\pi\left(s_t\right)=\mathbb{E}{A \sim \pi\left(\cdot \mid s_t\right)}\left[Q\pi\left(s_t, A\right)\right]=\int_{\mathcal{A}} \pi\left(a \mid s_t\right) \cdot Q_\pi\left(s_t, a\right) d a </math>Vπ(st)=EA∼π(⋅∣st)[Qπ(st,A)]=∫Aπ(a∣st)⋅Qπ(st,a)da

其中动作 <math xmlns="http://www.w3.org/1998/Math/MathML"> A A </math>A 被视为随机变量并参与积分。

最优状态价值函数 ( <math xmlns="http://www.w3.org/1998/Math/MathML"> V ∗ V^* </math>V∗) : 最优状态值函数 <math xmlns="http://www.w3.org/1998/Math/MathML"> V ⋆ ( s t ) V^{\star}\left(s_t\right) </math>V⋆(st) 衡量当前状态 <math xmlns="http://www.w3.org/1998/Math/MathML"> s t s_t </math>st 下的好坏。正式表述为 :

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> V ⋆ ( s ) = max ⁡ π V π ( s ) V^{\star}(s)=\max \pi V\pi(s) </math>V⋆(s)=πmaxVπ(s)

注意 <math xmlns="http://www.w3.org/1998/Math/MathML"> V ∗ V^* </math>V∗ 与策略函数 <math xmlns="http://www.w3.org/1998/Math/MathML"> π \pi </math>π 无关。

可以看出来，policy-based 强化学习和 value-based 强化学习训练的目标不同，policy-based 虽然也是训练出一个先知，但是这个先知和奇异博士有点相似，给的指示并不明确，比如先知告诉我们在马里奥游戏中现在建议采取 <math xmlns="http://www.w3.org/1998/Math/MathML"> 0.7 0.7 </math>0.7 的概率选取 "向上跳" 这一动作、采取 <math xmlns="http://www.w3.org/1998/Math/MathML"> 0.2 0.2 </math>0.2 的概率选取 "往右跑" 这一动作、采取 <math xmlns="http://www.w3.org/1998/Math/MathML"> 0.1 0.1 </math>0.1 的概率选取 "往左跑" 这一动作。你只能说它更建议你 "往上跳"，但是不能说它要求你不要 "往左跑" 或者 "往左跑" 一定是不好的 (感觉某种程度上是用概率来对于确定估值进行一定化简，有更大的搜索空间与合理性)。我们训练的目标等于是让先知给的策略函数即概率采用更为合理。回顾 <math xmlns="http://www.w3.org/1998/Math/MathML"> V π ( s ) V_\pi(s) </math>Vπ(s) 的定义 :
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> V π ( s ) = ∑ a ∈ A π ( a ∣ s ) ⋅ Q π ( s , a ) V_\pi(s)=\sum_{a \in \mathcal{A}} \pi(a \mid s) \cdot Q_\pi(s, a) </math>Vπ(s)=a∈A∑π(a∣s)⋅Qπ(s,a)

一样的，遇事不决我们便使用神经网络对其进行拟合，拟合策略函数 <math xmlns="http://www.w3.org/1998/Math/MathML"> π ( a ∣ s ) \pi(a\mid s) </math>π(a∣s) 网络的参数为 <math xmlns="http://www.w3.org/1998/Math/MathML"> θ \theta </math>θ。基于对策略函数的估计， <math xmlns="http://www.w3.org/1998/Math/MathML"> V π ( s ) V_\pi(s) </math>Vπ(s) 可以表示为 :
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> V ( s ; θ ) = ∑ a ∈ A π ( a ∣ s ; θ ) ⋅ Q π ( s , a ) V(s ; \boldsymbol{\theta})=\sum_{a \in \mathcal{A}} \pi(a \mid s ; \boldsymbol{\theta}) \cdot Q_\pi(s, a) </math>V(s;θ)=a∈A∑π(a∣s;θ)⋅Qπ(s,a)

接着我们对于参数 <math xmlns="http://www.w3.org/1998/Math/MathML"> θ \theta </math>θ 求导得到优化方向 :
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> ∂ V ( s ; θ ) ∂ θ = ∂ ∑ a ∈ A π ( a ∣ s ; θ ) ⋅ Q π ( s , a ) ∂ θ = ∑ a ∈ A ∂ π ( a ∣ s ; θ ) ⋅ Q π ( s , a ) ∂ θ \begin{aligned} \frac{\partial V(s ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}} & =\frac{\partial \sum_{a \in \mathcal{A}} \pi(a \mid s ; \boldsymbol{\theta}) \cdot Q_\pi(s, a)}{\partial \boldsymbol{\theta}} \\ & =\sum_{a \in \mathcal{A}} \frac{\partial \pi(a \mid s ; \boldsymbol{\theta}) \cdot Q_\pi(s, a)}{\partial \boldsymbol{\theta}} \end{aligned} </math>∂θ∂V(s;θ)=∂θ∂∑a∈Aπ(a∣s;θ)⋅Qπ(s,a)=a∈A∑∂θ∂π(a∣s;θ)⋅Qπ(s,a)

到此为止其实 <math xmlns="http://www.w3.org/1998/Math/MathML"> Q π ( s , a ) Q_\pi(s, a) </math>Qπ(s,a) 其实是和 <math xmlns="http://www.w3.org/1998/Math/MathML"> θ \theta </math>θ 有关系的，毕竟 <math xmlns="http://www.w3.org/1998/Math/MathML"> θ \theta </math>θ 是 <math xmlns="http://www.w3.org/1998/Math/MathML"> π ( a ∣ s ) \pi(a\mid s) </math>π(a∣s) 的参数，但是按照王老师给的 "无伤大雅" 的处理 (确实最后不影响结果)，我们可以认为其与 <math xmlns="http://www.w3.org/1998/Math/MathML"> θ \theta </math>θ 没有关系，因此可以得到 :
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> ∑ a ∈ A ∂ π ( a ∣ s ; θ ) ⋅ Q π ( s , a ) ∂ θ ≈ ∑ a ∈ A Q π ( s , a ) ⋅ ∂ π ( a ∣ s ; θ ) ∂ θ = ∑ a ∈ A Q π ( s , a ) ⋅ π ( a ∣ s ; θ ) ⋅ ∂ log ⁡ π ( a ∣ s ; θ ) ∂ θ . \begin{aligned} &\quad \sum_{a \in \mathcal{A}} \frac{\partial \pi(a \mid s ; \boldsymbol{\theta}) \cdot Q_\pi(s, a)}{\partial \boldsymbol{\theta}}\\ & \approx\sum_{a \in \mathcal{A}} Q_\pi(s, a) \cdot \frac{\partial \pi(a \mid s ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}} \\ & =\sum_{a \in \mathcal{A}} Q_\pi(s, a) \cdot \pi(a \mid s ; \boldsymbol{\theta}) \cdot \frac{\partial \log \pi(a \mid s ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}} . \end{aligned} </math>a∈A∑∂θ∂π(a∣s;θ)⋅Qπ(s,a)≈a∈A∑Qπ(s,a)⋅∂θ∂π(a∣s;θ)=a∈A∑Qπ(s,a)⋅π(a∣s;θ)⋅∂θ∂logπ(a∣s;θ).

第三行的形式其实并不好直接从第二行得到，但是可以轻易验算出其和第二行的等价性。因此上面的方程可以等价地写成 :
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> ∂ V ( s ; θ ) ∂ θ = E A ∼ π ( ⋅ ∣ s , θ ) [ Q π ( s , a ) ⋅ ∂ log ⁡ π ( A ∣ s ; θ ) ∂ θ ] \frac{\partial V(s ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}}=\mathbb{E}{A \sim \pi(\cdot \mid s, \theta)}\left[Q\pi(s, a) \cdot \frac{\partial \log \pi(A \mid s ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}}\right] </math>∂θ∂V(s;θ)=EA∼π(⋅∣s,θ)[Qπ(s,a)⋅∂θ∂logπ(A∣s;θ)]

对于这一计算结果，不同的实现有不同的处理。

3 Actor-critic

演员-评论家算法 (Actor-Critic) 可以理解为将 value-based 和 policy-based 两种方法进行了结合，其有两个神经网络 :

一个是策略网络 <math xmlns="http://www.w3.org/1998/Math/MathML"> π ( a ∣ s ; θ ) \pi(a \mid s ; \boldsymbol{\theta}) </math>π(a∣s;θ)，称为演员 (actor)，类似于策略函数 <math xmlns="http://www.w3.org/1998/Math/MathML"> π ( a ∣ s ) \pi(a \mid s) </math>π(a∣s)。
价值网络 <math xmlns="http://www.w3.org/1998/Math/MathML"> q ( s , a ; w ) q(s, a ; \mathbf{w}) </math>q(s,a;w) 近似动作值函数 <math xmlns="http://www.w3.org/1998/Math/MathML"> Q π ( a , s ) Q_\pi(a, s) </math>Qπ(a,s)。

这样，状态价值函数 <math xmlns="http://www.w3.org/1998/Math/MathML"> V π ( s ) V_\pi(s) </math>Vπ(s) 近似为 :
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> V ( s ; w , θ ) = E A ∼ π ( ⋅ ∣ s ; θ ) [ q ( s , A ; w ) ] = ∑ a ∈ A π ( a ∣ s ; θ ) ⋅ q ( s , a ; w ) V(s ; \mathbf{w}, \boldsymbol{\theta})=\mathbb{E}{A \sim \pi(\cdot \mid s ; \theta)}[q(s, A ; \mathbf{w})]=\sum{a \in \mathcal{A}} \pi(a \mid s ; \boldsymbol{\theta}) \cdot q(s, a ; \mathbf{w}) </math>V(s;w,θ)=EA∼π(⋅∣s;θ)[q(s,A;w)]=a∈A∑π(a∣s;θ)⋅q(s,a;w)

其对应的策略梯度为
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> ∂ V ( s ; w , θ ) ∂ θ = E A ∼ π ( ⋅ ∣ s , θ ) [ q ( s , A ; w ) ⋅ ∂ log ⁡ π ( A ∣ s ; θ ) ∂ θ ] \frac{\partial V(s ; \mathbf{w}, \boldsymbol{\theta})}{\partial \boldsymbol{\theta}}=\mathbb{E}_{A \sim \pi(\cdot \mid s, \theta)}\left[q(s, A ; \mathbf{w}) \cdot \frac{\partial \log \pi(A \mid s ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}}\right] </math>∂θ∂V(s;w,θ)=EA∼π(⋅∣s,θ)[q(s,A;w)⋅∂θ∂logπ(A∣s;θ)]

策略网络将使用 (随机) 策略梯度上升来更新。价值网络可以通过时序差分方法 (temporal different, TD) 学习进行更新。下面总结了该算法的一次迭代 :

观察状态 <math xmlns="http://www.w3.org/1998/Math/MathML"> s t s_t </math>st，然后随机抽样动作 <math xmlns="http://www.w3.org/1998/Math/MathML"> a t ∼ π ( ⋅ ∣ s t ; θ t ) a_t \sim \pi\left(\cdot \mid s_t ; \boldsymbol{\theta}_t\right) </math>at∼π(⋅∣st;θt)。
主体执行动作 <math xmlns="http://www.w3.org/1998/Math/MathML"> a t a_t </math>at，并观察奖励 <math xmlns="http://www.w3.org/1998/Math/MathML"> r t r_t </math>rt 和新状态 <math xmlns="http://www.w3.org/1998/Math/MathML"> s t + 1 s_{t+1} </math>st+1。
随机抽样动作 <math xmlns="http://www.w3.org/1998/Math/MathML"> a t + 1 ∼ π ( ⋅ ∣ s t + 1 ; θ t ) a_{t+1} \sim \pi\left(\cdot \mid s_{t+1} ; \boldsymbol{\theta}t\right) </math>at+1∼π(⋅∣st+1;θt)。(主体并不执行动作 <math xmlns="http://www.w3.org/1998/Math/MathML"> a t + 1 a{t+1} </math>at+1)
使用价值网络进行评估，得到 <math xmlns="http://www.w3.org/1998/Math/MathML"> q t = q ( s t , a t ; w t ) q_t=q\left(s_t, a_t ; \mathbf{w}t\right) </math>qt=q(st,at;wt) 和 <math xmlns="http://www.w3.org/1998/Math/MathML"> q t + 1 = q ( s t + 1 , a t + 1 ; w t ) q{t+1}=q\left(s_{t+1}, a_{t+1} ; \mathbf{w}_t\right) </math>qt+1=q(st+1,at+1;wt)。
计算时序差分损失值 : <math xmlns="http://www.w3.org/1998/Math/MathML"> δ t = q t − ( r t + γ ⋅ q t + 1 ) \delta_t=q_t-\left(r_t+\gamma \cdot q_{t+1}\right) </math>δt=qt−(rt+γ⋅qt+1)。
更新价值网络 : <math xmlns="http://www.w3.org/1998/Math/MathML"> w t + 1 ⟵ w t − α ⋅ δ t ⋅ ∂ Q ( s t , a t ; w ) ∂ w ∣ w = w t \mathbf{w}_{t+1} \longleftarrow \mathbf{w}t-\left.\alpha \cdot \delta_t \cdot \frac{\partial Q\left(s_t, a_t ; \mathbf{w}\right)}{\partial \mathbf{w}}\right|{\mathbf{w}=\mathbf{w}_t} </math>wt+1⟵wt−α⋅δt⋅∂w∂Q(st,at;w)∣ ∣w=wt。
更新策略网络 : <math xmlns="http://www.w3.org/1998/Math/MathML"> θ t + 1 ⟵ θ t + β ⋅ q t ⋅ ∂ log ⁡ π ( a t ∣ s t ; θ ) ∂ θ ∣ θ = θ t \boldsymbol{\theta}_{t+1} \longleftarrow \boldsymbol{\theta}t+\left.\beta \cdot q_t \cdot \frac{\partial \log \pi\left(a_t \mid s_t ; \theta\right)}{\partial \theta}\right|{\theta=\theta_t} </math>θt+1⟵θt+β⋅qt⋅∂θ∂logπ(at∣st;θ)∣ ∣θ=θt

3.1 Deterministic Policy Gradient

DPG 是一种确定性的策略梯度算法 (policy gradient)。策略梯度算法的基本思想是 : 用一个参数化的概率分布 <math xmlns="http://www.w3.org/1998/Math/MathML"> π θ ( a ∣ s ) = P [ a ∣ s ; θ ] \pi_{\theta}(a|s) = P[a|s;\theta] </math>πθ(a∣s)=P[a∣s;θ] 来表示策略，并且由于策略是一个概率分布，则其所采取的行动 <math xmlns="http://www.w3.org/1998/Math/MathML"> a a </math>a 就是随机选取的，也就是所谓的随机策略梯度算法 (Stochastic Policy Gradient)。

而 DPG 则摒弃了用概率分布表示策略的方法，转而用一个确定性的函数 <math xmlns="http://www.w3.org/1998/Math/MathML"> a = μ θ ( s ) a=\mu_{\theta}(s) </math>a=μθ(s) 表示策略。也就是说，给定当前的状态 <math xmlns="http://www.w3.org/1998/Math/MathML"> s s </math>s，选取的行动 <math xmlns="http://www.w3.org/1998/Math/MathML"> a a </math>a 就是确定的。而这样将随机转化为固定的方式也有着其对应的优缺点 :

优点 : 从理论上可以证明，确定性策略的梯度就是 Q 函数梯度的期望，这使得确定性方法在计算上比随机性方法更高效。
缺点 : 对于固定的状态只有固定的行动，和之前博客中所提到的类似于 <math xmlns="http://www.w3.org/1998/Math/MathML"> ϵ − \epsilon- </math>ϵ− 贪婪策略的原因 : 如果一直使用确定性策略，可能会导致某些状态动作对 <math xmlns="http://www.w3.org/1998/Math/MathML"> ( s , a ) (s, a) </math>(s,a) 在序列中永远不会出现，从而无法估计其动作价值，也无法保证提升后的策略比原策略更好。

为了一定程度解决对应的缺点，DPG 采用了离线策略的方法。即采样的策略和待优化的策略是不同的 : 其中采样的策略是随机的，而待优化的策略是确定的。采样策略的随机性保证了状态对 <math xmlns="http://www.w3.org/1998/Math/MathML"> ( s , a ) (s,a) </math>(s,a) 的遍历性。

而关键结论则沿用参考资料中的总结 :
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> ∇ θ μ J ≈ E s t ∼ ρ β [ ∇ θ μ Q ( s , a ∣ θ Q ) ∣ s = s t , a = μ ( s t ∣ θ μ ) ] = E s t ∼ ρ β [ ∇ a Q ( s , a ∣ θ Q ) ∣ s = s t , a = μ ( s t ) ∇ θ μ μ ( s ∣ θ μ ) ∣ s = s t ] \begin{aligned} \nabla_{\theta^\mu} J & \approx \mathbb{E}{s_t \sim \rho^\beta}[\nabla{\theta^\mu}Q(s, a|\theta^Q)|{s=s_t, a=\mu(s_t|\theta^\mu)}] \\ &=\mathbb{E}{s_t \sim \rho^\beta}[\nabla_{a}Q(s, a|\theta^Q)|{s=s_t, a=\mu(s_t)} \nabla{\theta^\mu}\mu(s|\theta^\mu)|_{s=s_t}] \end{aligned} </math>∇θμJ≈Est∼ρβ[∇θμQ(s,a∣θQ)∣s=st,a=μ(st∣θμ)]=Est∼ρβ[∇aQ(s,a∣θQ)∣s=st,a=μ(st)∇θμμ(s∣θμ)∣s=st]

其中各个符号的含义为 :

<math xmlns="http://www.w3.org/1998/Math/MathML"> s s </math>s 和 <math xmlns="http://www.w3.org/1998/Math/MathML"> a a </math>a 分别是状态和动作。
<math xmlns="http://www.w3.org/1998/Math/MathML"> Q Q </math>Q 为 <math xmlns="http://www.w3.org/1998/Math/MathML"> Q \mathrm{Q} </math>Q 函数，即动作价值 (action value) 函数， <math xmlns="http://www.w3.org/1998/Math/MathML"> θ Q \theta^Q </math>θQ 为其参数。
<math xmlns="http://www.w3.org/1998/Math/MathML"> J J </math>J 表示的是初始状态分布下的期望回报，也就是给定初始状态的概率分布，期望能够获得的总回报 (可能要考虑折扣因子 <math xmlns="http://www.w3.org/1998/Math/MathML"> γ \gamma </math>γ )，我们的目标就是使得 <math xmlns="http://www.w3.org/1998/Math/MathML"> J J </math>J 越大越好。
<math xmlns="http://www.w3.org/1998/Math/MathML"> s t s_t </math>st 表示在 <math xmlns="http://www.w3.org/1998/Math/MathML"> t t </math>t 时刻的状态，而 <math xmlns="http://www.w3.org/1998/Math/MathML"> ρ β \rho^\beta </math>ρβ 则表示在随机采样策略 <math xmlns="http://www.w3.org/1998/Math/MathML"> β \beta </math>β 之下，每个状态被访问的概率分布；
<math xmlns="http://www.w3.org/1998/Math/MathML"> μ \mu </math>μ 表示待优化的确定性算法， <math xmlns="http://www.w3.org/1998/Math/MathML"> θ μ \theta^\mu </math>θμ 是它的参数。

下面给出公式的更直观的认知 :

公式中的第一行 可以理解为 : 期望回报对待优化的策略 <math xmlns="http://www.w3.org/1998/Math/MathML"> μ \mu </math>μ 的梯度 ( <math xmlns="http://www.w3.org/1998/Math/MathML"> ∇ θ μ J \nabla_{\theta^\mu} J </math>∇θμJ)，可以近似为在随机采样策略 <math xmlns="http://www.w3.org/1998/Math/MathML"> β \beta </math>β 的状态访问的概率分布下 <math xmlns="http://www.w3.org/1998/Math/MathML"> ( s t ∼ ρ β ) \left(s_t \sim \rho^\beta\right) </math>(st∼ρβ) ， Q 函数对 <math xmlns="http://www.w3.org/1998/Math/MathML"> μ \mu </math>μ 的梯度 <math xmlns="http://www.w3.org/1998/Math/MathML"> ( ∇ θ μ Q ( s , a ∣ θ Q ) ∣ s = s t , a = μ ( s t ∣ θ μ ) ) \left(\left.\nabla_{\theta^\mu} Q\left(s, a \mid \theta^Q\right)\right|_{s=s_t, a=\mu\left(s_t \mid \theta^\mu\right)}\right) </math>(∇θμQ(s,a∣θQ)∣ ∣s=st,a=μ(st∣θμ)) 的期望 ( <math xmlns="http://www.w3.org/1998/Math/MathML"> E \mathbb{E} </math>E)。
然后对 <math xmlns="http://www.w3.org/1998/Math/MathML"> ∇ θ μ Q ( s , a ∣ θ Q ) ∣ s = s t , a = μ ( s t ∣ θ μ ) \left.\nabla_{\theta^\mu} Q\left(s, a \mid \theta^Q\right)\right|_{s=s_t, a=\mu\left(s_t \mid \theta^\mu\right)} </math>∇θμQ(s,a∣θQ)∣ ∣s=st,a=μ(st∣θμ) 使用链式法则，即 Q 函数对 <math xmlns="http://www.w3.org/1998/Math/MathML"> μ \mu </math>μ 的梯度，等于 Q 函数对行动 <math xmlns="http://www.w3.org/1998/Math/MathML"> a a </math>a 的梯度乘以行动 <math xmlns="http://www.w3.org/1998/Math/MathML"> a a </math>a 对 <math xmlns="http://www.w3.org/1998/Math/MathML"> μ \mu </math>μ 的梯度，就得到了上述公式的第二行。
<math xmlns="http://www.w3.org/1998/Math/MathML"> ∇ θ μ J \nabla_{\theta^\mu} J </math>∇θμJ 就是DPG的策略梯度，用这个策略梯度做梯度上升算法，即可优化策略 <math xmlns="http://www.w3.org/1998/Math/MathML"> μ \mu </math>μ ，使得期望回报最大化。

3.2 Deep Deterministic Policy Gradient

DDPG 是基于 DPG (Deterministic Policy Gradient) 实现的深度强化学习算法。

3.2.1 from DPG to DDPG

回顾 DQN 的损失函数 :
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> ℓ = 1 2 N ∑ i = 1 N [ Q ω ( s i , a i ) − ( r i + γ max ⁡ a ′ Q ω ( s i ′ , a ′ ) ) ] 2 \ell=\frac{1}{2 N} \sum_{i=1}^N\left[Q_\omega\left(s_i, a_i\right)-\left(r_i+\gamma \max {a^{\prime}} Q\omega\left(s_i^{\prime}, a^{\prime}\right)\right)\right]^2 </math>ℓ=2N1i=1∑N[Qω(si,ai)−(ri+γa′maxQω(si′,a′))]2

有一个求最大值的行动的操作。如果行动域非常大，甚至是连续的 (之前举例的非离散情况)，那么这个求最大的操作是不可能完成的。即使是将连续的空间离散化，也会导致非常低的算法效率。而 DPG 的策略梯度则没有上述的寻找最大化的操作，回顾如下 :
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> ∇ θ μ J ≈ E s t ∼ ρ β [ ∇ θ μ Q ( s , a ∣ θ Q ) ∣ s = s t , a = μ ( s t ∣ θ μ ) ] = E s t ∼ ρ β [ ∇ a Q ( s , a ∣ θ Q ) ∣ s = s t , a = μ ( s t ) ∇ θ μ μ ( s ∣ θ μ ) ∣ s = s t ] \begin{aligned} \nabla_{\theta^\mu} J & \approx \mathbb{E}{s_t \sim \rho^\beta}[\nabla{\theta^\mu}Q(s, a|\theta^Q)|{s=s_t, a=\mu(s_t|\theta^\mu)}] \\ &=\mathbb{E}{s_t \sim \rho^\beta}[\nabla_{a}Q(s, a|\theta^Q)|{s=s_t, a=\mu(s_t)} \nabla{\theta^\mu}\mu(s|\theta^\mu)|_{s=s_t}] \end{aligned} </math>∇θμJ≈Est∼ρβ[∇θμQ(s,a∣θQ)∣s=st,a=μ(st∣θμ)]=Est∼ρβ[∇aQ(s,a∣θQ)∣s=st,a=μ(st)∇θμμ(s∣θμ)∣s=st]

而 DDPG 所做的事情，就是将 DQN 中的神经网络用于拟合 Q 函数的两个子方法用到了 DPG 中，即将 DPG 中的 Q 函数也变成了一个神经网络。下面给出 DDPG 的完整算法 :

Algorithm : DDPG algorithm

Initialize critic network <math xmlns="http://www.w3.org/1998/Math/MathML"> Q ( s , a ∣ θ Q ) Q(s, a | \theta^Q) </math>Q(s,a∣θQ) and actor <math xmlns="http://www.w3.org/1998/Math/MathML"> μ ( s ∣ θ μ ) \mu(s | \theta^\mu) </math>μ(s∣θμ) with weights <math xmlns="http://www.w3.org/1998/Math/MathML"> θ Q \theta^Q </math>θQ and <math xmlns="http://www.w3.org/1998/Math/MathML"> θ μ \theta^\mu </math>θμ.
Initialize target network <math xmlns="http://www.w3.org/1998/Math/MathML"> Q ′ Q^{\prime} </math>Q′ and <math xmlns="http://www.w3.org/1998/Math/MathML"> μ ′ \mu^{\prime} </math>μ′ with weights <math xmlns="http://www.w3.org/1998/Math/MathML"> θ Q ′ ← θ Q , θ μ ′ ← θ μ \theta^{Q^{\prime}} \leftarrow \theta^Q, \theta^{\mu^{\prime}} \leftarrow \theta^\mu </math>θQ′←θQ,θμ′←θμ.
Initialize replay buffer <math xmlns="http://www.w3.org/1998/Math/MathML"> R R </math>R
for episode <math xmlns="http://www.w3.org/1998/Math/MathML"> = 1 , M =1, \mathbf{M} </math>=1,M do
Initialize a random process <math xmlns="http://www.w3.org/1998/Math/MathML"> N \mathcal{N} </math>N for action exploration.
Receive initial observation state <math xmlns="http://www.w3.org/1998/Math/MathML"> s 1 s_1 </math>s1.
for <math xmlns="http://www.w3.org/1998/Math/MathML"> t = 1 , T t=1, \mathbf{T} </math>t=1,T do
Select action <math xmlns="http://www.w3.org/1998/Math/MathML"> a t = μ ( s t ∣ θ μ ) + N t a_t=\mu\left(s_t \mid \theta^\mu\right)+\mathcal{N}_t </math>at=μ(st∣θμ)+Nt according to the current policy and exploration noise
Execute action <math xmlns="http://www.w3.org/1998/Math/MathML"> a t a_t </math>at and observe reward <math xmlns="http://www.w3.org/1998/Math/MathML"> r t r_t </math>rt and observe new state <math xmlns="http://www.w3.org/1998/Math/MathML"> s t + 1 s_{t+1} </math>st+1
Store transition <math xmlns="http://www.w3.org/1998/Math/MathML"> ( s t , a t , r t , s t + 1 ) \left(s_t, a_t, r_t, s_{t+1}\right) </math>(st,at,rt,st+1) in <math xmlns="http://www.w3.org/1998/Math/MathML"> R R </math>R
Sample a random minibatch of <math xmlns="http://www.w3.org/1998/Math/MathML"> N N </math>N transitions <math xmlns="http://www.w3.org/1998/Math/MathML"> ( s i , a i , r i , s i + 1 ) \left(s_i, a_i, r_i, s_{i+1}\right) </math>(si,ai,ri,si+1) from <math xmlns="http://www.w3.org/1998/Math/MathML"> R R </math>R
Set <math xmlns="http://www.w3.org/1998/Math/MathML"> y i = r i + γ Q ′ ( s i + 1 , μ ′ ( s i + 1 ∣ θ μ ′ ) ∣ θ Q ′ ) y_i=r_i+\gamma Q^{\prime}\left(s_{i+1}, \mu^{\prime}\left(s_{i+1} \mid \theta^{\mu^{\prime}}\right) \mid \theta^{Q^{\prime}}\right) </math>yi=ri+γQ′(si+1,μ′(si+1∣θμ′)∣θQ′)
Update critic by minimizing the loss: <math xmlns="http://www.w3.org/1998/Math/MathML"> L = 1 N ∑ i ( y i − Q ( s i , a i ∣ θ Q ) ) 2 L=\frac{1}{N} \sum_i\left(y_i-Q\left(s_i, a_i \mid \theta^Q\right)\right)^2 </math>L=N1∑i(yi−Q(si,ai∣θQ))2
Update the actor policy using the sampled policy gradient :

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> ∇ θ μ J ≈ 1 N ∑ i ∇ a Q ( s , a ∣ θ Q ) ∣ s = s i , a = μ ( s i ) ∇ θ μ μ ( s ∣ θ μ ) ∣ s i \nabla_{\theta^\mu} J \approx \frac{1}{N} \sum_i \nabla_a Q\left(s, a \mid \theta^Q\right)\left|{s=s_i, a=\mu\left(s_i\right)} \nabla{\theta^\mu} \mu\left(s \mid \theta^\mu\right)\right|_{s_i} </math>∇θμJ≈N1i∑∇aQ(s,a∣θQ)∣ ∣s=si,a=μ(si)∇θμμ(s∣θμ)∣ ∣si

Update the target networks :

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> θ Q ′ ← τ θ Q + ( 1 − τ ) θ Q ′ θ μ ′ ← τ θ μ + ( 1 − τ ) θ μ ′ \begin{aligned} \theta^{Q^{\prime}} & \leftarrow \tau \theta^Q+(1-\tau) \theta^{Q^{\prime}} \\ \theta^{\mu^{\prime}} & \leftarrow \tau \theta^\mu+(1-\tau) \theta^{\mu^{\prime}} \end{aligned} </math>θQ′θμ′←τθQ+(1−τ)θQ′←τθμ+(1−τ)θμ′

end for
end for

此外，DDPG 还有一个另外的优点，就是可以直接从原生数据 (例如 Atari 游戏的图片中) 学习，也就是所谓的端到端。

3.3 Soft Actor-Critic

SAC (Soft Actor-Critic) 是基于最大熵 (maximum entropy) 这一思想发展的强化学习算法，其采用与 PPO 类似的随机分布式策略函数 (Stochastic Policy)，并且是一个采用离线策略，actor-critic 架构的算法，与其他强化学习算法最为不同的地方在于，SAC在优化策略以获取更高累计收益的同时，也会最大化策略的熵。

将熵引入强化学习算法的好处为，可以让策略尽可能随机，主体可以更充分地探索状态空间，避免策略早早地落入局部最优点，并且可以探索到多个可行方案来完成指定任务，提高抗干扰能力。(怎么看着这么像神经网络里面的熵正则项呢?)

熵的计算 : 熵是用于衡量随机变量的随机性，实际计算时直接考虑其服从的随机分布。现在要计算变量 <math xmlns="http://www.w3.org/1998/Math/MathML"> x x </math>x 的熵值，而 <math xmlns="http://www.w3.org/1998/Math/MathML"> x x </math>x 服从分布 <math xmlns="http://www.w3.org/1998/Math/MathML"> P P </math>P，则 <math xmlns="http://www.w3.org/1998/Math/MathML"> x x </math>x 的熵 <math xmlns="http://www.w3.org/1998/Math/MathML"> H ( P ) H(P) </math>H(P) 为 :

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> H ( P ) = E x ∼ P [ − log ⁡ P ( x ) ] H(P)=\underset{x \sim P}{\mathrm{E}}[-\log P(x)] </math>H(P)=x∼PE[−logP(x)]

标准强化学习算法的目标，是找到能收集最多累计收益的策略，表达式为 :

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> π s t d ∗ = arg ⁡ max ⁡ π ∑ t E ( s t , a t ) ∼ ρ π [ r ( s t , a t ) ] \pi_{s t d}^*=\arg \max \pi \sum_t \mathbb{E}{\left(s_t, a_t\right) \sim \rho_\pi}\left[r\left(s_t, a_t\right)\right] </math>πstd∗=argπmaxt∑E(st,at)∼ρπ[r(st,at)]

引入了熵最大化的强化学习算法的目标策略 (其实就是用了下拉格朗日乘子法，信息熵越大越均匀，因此是加而不是乘负数再加):

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> π MaxEnt ∗ = arg ⁡ max ⁡ π ∑ t E ( s t , a t ) ∼ ρ π [ r ( s t , a t ) + α H ( π ( ⋅ ∣ s t ) ) ] \pi_{\text {MaxEnt }}^*=\arg \max \pi \sum_t \mathbb{E}{\left(s_t, a_t\right) \sim \rho_\pi}\left[r\left(s_t, a_t\right)+\alpha H\left(\pi\left(\cdot \mid s_t\right)\right)\right] </math>πMaxEnt ∗=argπmaxt∑E(st,at)∼ρπ[r(st,at)+αH(π(⋅∣st))]

因此，其与正常的 Actor-Critic 算法对应的 Q 和 V 函数也有所区别，是 soft Q 函数和 soft V 函数 :
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> Q s o f t π ( s , a ) = E s ′ ∼ p ( s ′ ∣ s , a ) a ′ ∼ π [ r ( s , a ) + γ ( Q s o f t π ( s ′ , a ′ ) + α H ( π ( ⋅ ∣ s ′ ) ) ) ] = E s ′ ∼ p ( s ′ ∣ s , a ) [ r ( s , a ) + γ V s o f t π ( s ′ ) ] V s o f t π ( s ) = E a ∼ π [ Q s o f t π ( s , a ) − α l o g π ( a ∣ s ) ] \begin{aligned} Q_{soft}^{\pi}(s, a) &=\underset{s^{\prime} \sim p(s^{\prime}|s,a) \atop a^{\prime} \sim \pi}{\mathbb{E}}\left[r\left(s, a \right) + \gamma\left(Q_{soft}^{\pi}\left(s^{\prime}, a^{\prime}\right)+\alpha H\left(\pi\left(\cdot | s^{\prime}\right)\right)\right)\right]\\ &=\underset{s^{\prime} \sim p(s^{\prime}|s,a)}{\mathbb{E}}\left[r\left(s, a \right)+\gamma V_{soft}^{\pi}\left(s^{\prime}\right)\right]\\ V_{soft}^{\pi}(s) &=\underset{a \sim \pi}{\mathbb{E}}\left[Q_{soft}^{\pi}(s, a) - \alpha\mathrm{log}~\pi(a| s) \right] \end{aligned} </math>Qsoftπ(s,a)Vsoftπ(s)=a′∼πs′∼p(s′∣s,a)E[r(s,a)+γ(Qsoftπ(s′,a′)+αH(π(⋅∣s′)))]=s′∼p(s′∣s,a)E[r(s,a)+γVsoftπ(s′)]=a∼πE[Qsoftπ(s,a)−αlog π(a∣s)]

基于能量的策略模型 (Energy Based Policy, EBP) : 回顾熵最大化的强化学习算法的目标策略 (Max Entropy Reinforcement Learning, MERL) :

为了适应更复杂的任务，MERL 中的策略不再是以往的高斯分布形式，而是用基于能量的模型 (energy-based model) 来表示策略 :
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> π ( a t ∣ s t ) ∝ exp ⁡ ( − E ( s t , a t ) ) \pi\left({a}{t} | {s}{t}\right) \propto \exp \left(-\mathcal{E}\left({s}{t}, {a}{t}\right)\right) </math>π(at∣st)∝exp(−E(st,at))

其中 <math xmlns="http://www.w3.org/1998/Math/MathML"> E \mathcal{E} </math>E 为能量函数，可以用神经网络进行拟合。MERL中，为了让基于能量的策略 (energy-based policy, EBP) 与值函数联系起来，设定 :
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> E ( s t , a t ) = − 1 α Q s o f t ( s t , a t ) \mathcal{E}\left(s_{t}, a_{t}\right)=-\frac{1}{\alpha} Q_{s o f t}\left(s_{t}, a_{t}\right) </math>E(st,at)=−α1Qsoft(st,at)

于是可以得到 <math xmlns="http://www.w3.org/1998/Math/MathML"> π ( a t ∣ s t ) ∝ exp ⁡ ( 1 α Q s o f t ( s t , a t ) ) \pi\left(a_{t} | s_{t}\right) \propto \exp \left(\frac1 \alpha Q_{s o f t}\left(s_{t}, a_{t}\right)\right) </math>π(at∣st)∝exp(α1Qsoft(st,at))

SAC (Soft Actor-Critic) 中的理想策略依然是上面的 EBP 形式，不过由于 EBP 无法采样的问题依然存在，所以只能用一个高斯分布 <math xmlns="http://www.w3.org/1998/Math/MathML"> π \pi </math>π 来代替 EBP 与环境交互，随后在策略优化时，让这个高斯分布 <math xmlns="http://www.w3.org/1998/Math/MathML"> π \pi </math>π 尽可能向 EBP 靠近。 <math xmlns="http://www.w3.org/1998/Math/MathML"> π \pi </math>π 与 EBP 的距离用 KL-三度来衡量。策略优化公式为 (即目标为让 <math xmlns="http://www.w3.org/1998/Math/MathML"> π \pi </math>π 更接近 EBP) :

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> π n e w = arg ⁡ min ⁡ π ∈ Π D K L ( π ( ⋅ ∣ s t ) ∥ exp ⁡ ( 1 α Q s o f t π o l d ( s t , ⋅ ) ) Z soft ⁡ π o l d ( s t ) ) \pi_{new}=\arg \min {\pi{} \in \Pi} D_{K L}\left(\pi_{}\left(\cdot | s_{t}\right) \| \frac{\exp \left(\frac{1}{\alpha} Q_{s o f t}^{\pi_{old}}\left(s_{t}, \cdot\right)\right)}{Z_{\operatorname{soft}}^{\pi_{old}}\left(s_{t}\right)}\right) </math>πnew=argπ∈ΠminDKL⎝ ⎛π(⋅∣st)∥Zsoftπold(st)exp(α1Qsoftπold(st,⋅))⎠ ⎞

根据 <math xmlns="http://www.w3.org/1998/Math/MathML"> Q s o f t π ( s , a ) Q_{soft}^{\pi}(s, a) </math>Qsoftπ(s,a) 可以得到 <math xmlns="http://www.w3.org/1998/Math/MathML"> Q Q </math>Q 函数训练时的损失函数为 :

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> J Q ( θ ) = E ( s t , a t , s t + 1 ) ∼ D a t + 1 ∼ π ϕ [ 1 2 ( Q θ ( s t , a t ) − ( r ( s t , a t ) + γ ( Q θ ( s t + 1 , a t + 1 ) − α log ⁡ ( π ϕ ( a t + 1 ∣ s t + 1 ) ) ) ) ) 2 ] J_{Q}(\theta) =\underset{\left(s_{t}, a_{t}, s_{t+1}\right) \sim \mathcal{D} \atop a_{t+1} \sim \pi_{\phi}} {\mathbb{E}} \left[\frac{1}{2}\left(Q_{\theta}\left(s_{t}, a_{t}\right)-\left(r\left(s_{t}, a_{t}\right)+\gamma\left(Q_{{\theta}}\left(s_{t+1}, a_{t+1}\right)-\alpha \log \left(\pi_{\phi}\left(a_{t+1} | s_{t+1}\right)\right)\right)\right)\right)^{2}\right] </math>JQ(θ)=at+1∼πϕ(st,at,st+1)∼DE[21(Qθ(st,at)−(r(st,at)+γ(Qθ(st+1,at+1)−αlog(πϕ(at+1∣st+1)))))2]

与之前一致的是， <math xmlns="http://www.w3.org/1998/Math/MathML"> ( s t , a t ) (s_t,a_t) </math>(st,at) 是主体从环境中交互产生的数据， <math xmlns="http://www.w3.org/1998/Math/MathML"> a t + 1 a_{t+1} </math>at+1 则是根据策略 <math xmlns="http://www.w3.org/1998/Math/MathML"> π ϕ \pi_{\phi} </math>πϕ 选出来的，可以得到训练 <math xmlns="http://www.w3.org/1998/Math/MathML"> π ϕ \pi_{\phi} </math>πϕ 时的损失函数为 :
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> J π ( ϕ ) = D K L ( π ϕ ( ⋅ ∣ s t ) ∥ exp ⁡ ( 1 α Q θ ( s t , ⋅ ) − log ⁡ Z ( s t ) ) ) = E s t ∼ D , a t ∼ π ϕ [ log ⁡ ( π ϕ ( a t ∣ s t ) exp ⁡ ( 1 α Q θ ( s t , a t ) − log ⁡ Z ( s t ) ) ) ] = E s t ∼ D , a t ∼ π ϕ [ log ⁡ π ϕ ( a t ∣ s t ) − 1 α Q θ ( s t , a t ) + log ⁡ Z ( s t ) ] \begin{aligned} J_{\pi}(\phi) &=D_{\mathrm{KL}}\left(\pi_{\phi}\left(\cdot | s_{t}\right) \| \exp \left(\frac{1}{\alpha} Q_{\theta}\left(s_{t}, \cdot\right)-\log Z\left(s_{t}\right)\right)\right) \\ &=\mathbb{E}{s{t} \sim \mathcal{D}, a_{t} \sim \pi_{\phi}}\left[\log \left(\frac{\pi_{\phi}\left(a_{t} | s_{t}\right)}{\exp \left(\frac{1}{\alpha} Q_{\theta}\left(s_{t}, a_{t}\right)-\log Z\left(s_{t}\right)\right)}\right)\right] \\ &=\mathbb{E}{s{t} \sim \mathcal{D}, a_{t} \sim \pi_{\phi}}\left[\log \pi_{\phi}\left(a_{t} | s_{t}\right)-\frac{1}{\alpha} Q_{\theta}\left(s_{t}, a_{t}\right)+\log Z\left(s_{t}\right)\right] \end{aligned} </math>Jπ(ϕ)=DKL(πϕ(⋅∣st)∥exp(α1Qθ(st,⋅)−logZ(st)))=Est∼D,at∼πϕ[log(exp(α1Qθ(st,at)−logZ(st))πϕ(at∣st))]=Est∼D,at∼πϕ[logπϕ(at∣st)−α1Qθ(st,at)+logZ(st)]

这里不一样的时 <math xmlns="http://www.w3.org/1998/Math/MathML"> s t s_t </math>st 仍是主体从环境中交互产生的数据， <math xmlns="http://www.w3.org/1998/Math/MathML"> a t a_t </math>at 则是从当前策略 <math xmlns="http://www.w3.org/1998/Math/MathML"> π ϕ \pi_\phi </math>πϕ 中采样得到的。所以有 :
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> a t = f ϕ ( ε t ; s t ) = f ϕ μ ( s t ) + ε t ⊙ f ϕ σ ( s t ) a_{t}=f_{\phi}\left(\varepsilon_{t} ; s_{t}\right)=f_{\phi}^{\mu}\left(s_{t}\right)+\varepsilon_{t} \odot f_{\phi}^{\sigma}\left(s_{t}\right) </math>at=fϕ(εt;st)=fϕμ(st)+εt⊙fϕσ(st)

\同时，由于在对策略的参数求导时， <math xmlns="http://www.w3.org/1998/Math/MathML"> Z Z </math>Z 不受策略参数 <math xmlns="http://www.w3.org/1998/Math/MathML"> ϕ \phi </math>ϕ 影响，所以直接将其忽略，即可得到 :
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> J π ( ϕ ) = E s t ∼ D , ε ∼ N [ α log ⁡ π ϕ ( f ϕ ( ε t ; s t ) ∣ s t ) − Q θ ( s t , f ϕ ( ε t ; s t ) ) ] J_{\pi}(\phi)=\mathbb{E}{s{t} \sim \mathcal{D}, \varepsilon \sim \mathcal{N}}\left[\alpha \log \pi_{\phi}\left(f_{\phi}\left(\varepsilon_{t} ; s_{t}\right) | s_{t}\right)-Q_{\theta}\left(s_{t}, f_{\phi}\left(\varepsilon_{t} ; s_{t}\right)\right)\right] </math>Jπ(ϕ)=Est∼D,ε∼N[αlogπϕ(fϕ(εt;st)∣st)−Qθ(st,fϕ(εt;st))]

为什么没有了 <math xmlns="http://www.w3.org/1998/Math/MathML"> V V </math>V 函数 ? 在初版的 SAC 中，作者表示同时维持两个值函数可以使训练更加稳定，不过在第二版中，作者引入了自动调整温度系数 <math xmlns="http://www.w3.org/1998/Math/MathML"> α \alpha </math>α 的方法，使得 SAC 更加稳定，于是就只保留了 <math xmlns="http://www.w3.org/1998/Math/MathML"> Q Q </math>Q 函数。

DBC 论文阅读补充