强化学习学习（一）从MDP到Actor-critic演员-评论家算法

文章目录

- - - [From Markov chains to Markov decision process (MDP):](#From Markov chains to Markov decision process (MDP):)
    - [partially observed Markov decision peocess](#partially observed Markov decision peocess)
  - [Value Functions](#Value Functions)
  - - Q-function
    - [value function](#value function)
    - [Using Q π Q^\pi Qπ and V π V^\pi Vπ](#Using Q π Q^\pi Qπ and V π V^\pi Vπ)
  - [Types of RL algorithms](#Types of RL algorithms)
  - - [Examples of algorithms](#Examples of algorithms)
  - [Policy gradient](#Policy gradient)
  - - 大量有趣的数学技巧，包括为梯度和=增加约束
  - Actor-critic
  - - [Estimate V](#Estimate V)
    - [algorithms-with discount](#algorithms-with discount)

From Markov chains to Markov decision process (MDP):

M = S , A , T , r \mathcal{M}=\mathcal{S},\mathcal{A},\mathcal{T},r M=S,A,T,r
T \mathcal{T} T now is a tensor of 3 division: T i , j , k = p ( s t + 1 = i ∣ s t = j , a t = k ) \mathcal{T}{i,j,k}=p(s{t+1}=i|s_t=j,a_t=k) Ti,j,k=p(st+1=i∣st=j,at=k)
r r r-reward funtion r : S × A → R r:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R} r:S×A→R,

partially observed Markov decision peocess

M = S , A , O , T , E , r \mathcal{M}=\mathcal{S},\mathcal{A},\mathcal{O},\mathcal{T},\mathcal{E},r M=S,A,O,T,E,r
E \mathcal{E} E-emmision probobality p ( o t ∣ s t ) p(o_t|s_t) p(ot∣st)

由于 a t a_t at是根据 s t s_t st和 π θ \pi_{\theta} πθ共同推出的： p ( ( s t + 1 , a t + 1 ) ∣ s t , a t ) = p ( s t + 1 ∣ s t , a t ) π θ ( a t + 1 ∣ s t + 1 ) p((s_{t+1},a_{t+1})|s_{t},a_{t})=p(s_{t+1}|s_t,a_t)\pi_{\theta}(a_{t+1}|s_{t+1}) p((st+1,at+1)∣st,at)=p(st+1∣st,at)πθ(at+1∣st+1)

statinary: the same before and after transition: μ = T μ \mu=\mathcal{T}\mu μ=Tμ，因此 μ \mu μ是 T \mathcal{T} T的特征值为1的特征向量！

for statinary distribution, μ = p θ ( s , a ) \mu=p_{\theta}(s,a) μ=pθ(s,a)

利用期望，是为了将不连续的奖励变成连续的变量，从而适用于梯度下降等算法

Value Functions

Q-function

Q π ( s t , a t ) = ∑ t ′ = t T E π θ [ r ( s t ′ , a t ′ ) ∣ s t , a t ] Q^{\pi}(s_t,a_t)=\sum^T_{t'=t}E_{\pi_{\theta}}[r(s_{t'},a_{t'})|s_t,a_t] Qπ(st,at)=∑t′=tTEπθ[r(st′,at′)∣st,at]: 在 s t s_t st状态下执行动作 a t a_t at的所有奖励

value function

V π ( s t ) = ∑ t ′ = t T E π θ [ r ( s t ′ , a t ′ ) ∣ s t ] V^{\pi}(s_t)=\sum^T_{t'=t}E_{\pi_{\theta}}[r(s_{t'},a_{t'})|s_t] Vπ(st)=∑t′=tTEπθ[r(st′,at′)∣st]： s t s_t st状态下的所有奖励（注意看自变量和它对什么去了取了期望）

So, V π ( s t ) = E a t ∼ π ( a t ∣ s t ) [ Q π ( s t , a t ) ] V^{\pi}(s_t)=E_{a_t\sim\pi(a_t|s_t)}[Q^{\pi}(s_{t},a_{t})] Vπ(st)=Eat∼π(at∣st)[Qπ(st,at)]，如果我们对状态再求一次期望呢？：
E s 1 ∼ p ( s 1 ) [ V π ( s 1 ) ] E_{s_1\sim p(s_1)}[V^{\pi}(s_1)] Es1∼p(s1)[Vπ(s1)] is the RL objective (对所有可能的出发的状态进行求期望，就是我们上面说的goal)

Using Q π Q^\pi Qπ and V π V^\pi Vπ

if we have policy π \pi π, and we know Q π ( s , a ) Q^\pi(s,a) Qπ(s,a), then we can improve π \pi π :
- set π ′ ( a ∣ s ) = 1 \pi'(a|s)=1 π′(a∣s)=1 if a = arg ⁡ max ⁡ a Q π ( s , a ) a=\arg\max_aQ^\pi(s,a) a=argmaxaQπ(s,a)
- this policy is better than π \pi π
compute gradient to increase probobality of good actions a:
- if Q π ( s , a ) > V π ( s ) Q^\pi(s,a)>V^\pi(s) Qπ(s,a)>Vπ(s), then a is better than average 因为 V π V^\pi Vπ是平均的各种动作奖励值
- modify π ( a ∣ s ) \pi(a|s) π(a∣s) to increase probobality of a if Q π ( s , a ) > V π ( s ) Q^\pi(s,a)>V^\pi(s) Qπ(s,a)>Vπ(s)
  
  Q-value和V-value都是用来评价策略好坏的

Types of RL algorithms

Our goal:
θ ∗ = arg ⁡ max ⁡ θ E τ ∼ p θ ( τ ) [ ∑ t r ( s t , a t ) ] \theta^*=\arg\max_\theta E_{\tau\sim p_\theta(\tau)}\left[\sum_t r(s_t,a_t)\right] θ∗=argθmaxEτ∼pθ(τ)[t∑r(st,at)]

$!NOTE\] Types of RL algorithms * Policy gradients: 直接计算目标函数对 θ \\theta θ的导数，然后执行梯度下降 * Value-based: 为最优策略去模拟（神经网络去模拟）值函数或者Q函数。（没有明确的策略） * Actor-critic: 上面两种的结合，模拟值/Q函数，然后去更新策略 * Model-based: estimate the transition model T \\mathcal{T} T * 用它来planning（没有明确的策略） * 用来更新策略$

Examples of algorithms

Value function fitting
- Q-learning, DQN
- Temporal difference learning
- Fitted value iteration
Policy gradient methods
- Reinforce
- Natual policy gradient
- Trust region policy optimization
Actor-critic
- A3C
- SAC
- DDPG
Model-based RL
- Dyna
- Guided policy search

Policy gradient

我们对着这个幻灯片来讨论讨论：

首先这个 J ( θ ) J(\theta) J(θ)表示某个特定的策略的轨迹奖励，也就按照这个策略，跑完整个的所有reward

因为最终的目标就是找到最佳策略使得 J ( θ ) J(\theta) J(θ)的期望最大，也就是说它就是优化函数，因此我们要对 J ( θ ) J(\theta) J(θ)求导。

积分和求导符号可以换位置。要求期望就得知道概率分布 p θ ( τ ) p_\theta(\tau) pθ(τ)，而这个是不知道的。因此利用了蓝框里的，对 p θ ( τ ) p_\theta(\tau) pθ(τ)求导变成对 log ⁡ p θ ( τ ) \log p_\theta(\tau) logpθ(τ)求导，有了下面的：

几个和策略 θ \theta θ无关项求导之后都等于零， p θ ( τ ) p_\theta(\tau) pθ(τ)的展开又可以利用log转换为求和

因此得到最后一行，也就是剩下的参数我们都是知道的（ π \pi π、 r r r）

由于 J ( θ ) J(\theta) J(θ)是个期望，那么就可以通过多次实验采样然后平均得到：
∇ θ J ( θ ) ≈ 1 N ∑ i = 1 N ( ∑ t = 1 T ∇ θ log ⁡ π θ ( a i , t ∣ s i , t ) ) ( ∑ t = 1 T r ( s i , t , a i , t ) ) \nabla_\theta J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\left(\sum_{t=1}^T\nabla_\theta \log\pi_\theta(a_{i,t}|s_{i,t})\right)\left(\sum_{t=1}^Tr(s_{i,t},a_{i,t})\right) ∇θJ(θ)≈N1i=1∑N(t=1∑T∇θlogπθ(ai,t∣si,t))(t=1∑Tr(si,t,ai,t))

我们再对应着那三个box的颜色看看：

这里的 π θ ( a t ∣ s t ) \pi_\theta(a_t|s_t) πθ(at∣st)是什么？

可以是神经网络输出的概率（离散），

也可以是动作的概率分布（连续）：
π θ ( a t ∣ s t ) = N ( f neural network ( s t ) ; Σ ) \pi_\theta(a_t|s_t)=\mathcal{N}(f_{\text{neural network}(s_t)};\Sigma) πθ(at∣st)=N(fneural network(st);Σ) （高斯分布例子）

把上面的式子稍微简化一点：
∇ θ J ( θ ) ≈ 1 N ∑ i = 1 N ( ∑ t = 1 T ∇ θ log ⁡ π θ ( a i , t ∣ s i , t ) ) ( ∑ t = 1 T r ( s i , t , a i , t ) ) = 1 N ∑ i = 1 N ∇ θ log ⁡ π θ ( τ i ) r ( τ i ) \begin{align}\nabla_\theta J(\theta)&\approx\frac{1}{N}\sum_{i=1}^N\left(\sum_{t=1}^T\nabla_\theta \log\pi_\theta(a_{i,t}|s_{i,t})\right)\left(\sum_{t=1}^Tr(s_{i,t},a_{i,t})\right)\\&=\frac{1}{N}\sum_{i=1}^N\nabla_\theta \log\pi_\theta(\tau_i)r(\tau_i)\end{align} ∇θJ(θ)≈N1i=1∑N(t=1∑T∇θlogπθ(ai,t∣si,t))(t=1∑Tr(si,t,ai,t))=N1i=1∑N∇θlogπθ(τi)r(τi)

考虑到因果性：现在的决策不会影响过去的reward，绿色框中可以改写一下：
Q ^ π ( x t , u t ) = ∑ t ′ = t T r ( x t ′ , u t ′ ) \hat{Q}^\pi(x_t,u_t)=\sum_{t'=t}^Tr(x_{t'},u_{t'}) Q^π(xt,ut)=∑t′=tTr(xt′,ut′)

我们把轨迹 τ \tau τ写开，再利用因果性的 Q ^ \hat{Q} Q^：
∇ θ J ( θ ) ≈ 1 N ∑ i = 1 N ∑ t = 1 T ∇ θ log ⁡ π θ ( a i , t ∣ s i , t ) Q ^ i , t π \nabla_\theta J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\nabla_\theta \log\pi_\theta(a_{i,t}|s_{i,t})\hat{Q}^\pi_{i,t} ∇θJ(θ)≈N1i=1∑Nt=1∑T∇θlogπθ(ai,t∣si,t)Q^i,tπ
Q ^ i , t π \hat{Q}^\pi_{i,t} Q^i,tπ表示如果你在 s i , t s_{i,t} si,t状态采取 a i , t a_{i,t} ai,t动作，并按照 π \pi π策略继续下去跑完整个轨迹的奖励的估计。

重点就在于求和的下标，这样做的最大好处就是能够减少有限样本带来的方差

其实policy gradient的原理就是在最大似然估计的基础上，按照 Q ^ π ( x t , u t ) \hat{Q}^\pi(x_t,u_t) Q^π(xt,ut)进行加权：

J ~ ( θ ) ≈ 1 N ∑ i = 1 N ∑ i = 1 N log ⁡ π θ ( a i , t ∣ s i , t ) Q ^ i , t \tilde{J} (\theta)\approx\frac{1}{N}\sum_{i=1}^N\sum_{i=1}^N \log\pi_\theta(a_{i,t}|s_{i,t})\hat{Q}_{i,t} J~(θ)≈N1i=1∑Ni=1∑Nlogπθ(ai,t∣si,t)Q^i,t

大量有趣的数学技巧，包括为梯度和=增加约束

在L5，gradient policy的最后一节，Trust region policy optimization

Actor-critic

来个baseline，接着我们上面的说，正因为 Q ^ \hat{Q} Q^只是多次采样的对期望的模拟，因此假设我们有一个理想的reward的期望:
Q ( s t , a t ) = ∑ t ′ = t T E π θ [ r ( s t ′ , a t ′ ) ∣ s t , a t ] Q(s_t,a_t)=\sum_{t'=t}^TE_{\pi_\theta}[r(s_{t'},a_{t'})|s_t,a_t] Q(st,at)=t′=t∑TEπθ[r(st′,at′)∣st,at]

然后考虑在一个特定的状态 s t s_t st下，预期的所有动作的平均reward，就是值函数的定义（注意自变量和它是对谁取了期望）
V ( s t ) = E a t ∼ π θ ( a t ∣ s t ) [ Q ( s t , a t ) ] V(s_t)=E_{a_t\sim\pi_\theta}(a_t|s_t)[Q(s_t,a_t)] V(st)=Eat∼πθ(at∣st)[Q(st,at)]

这个就可以作为我们的baseline，代表就是平均的行动回报：（代替原本的恒定的 b b b）

$!NOTE\] AC和policy gradient的区别所以以前的 ∑ t = 1 T r ( s i , t , a i , t ) \\sum_{t=1}\^Tr(s_{i,t},a_{i,t}) ∑t=1Tr(si,t,ai,t)其实是一种蒙特卡洛估计，虽然无偏但是方差很大，我们用些许的偏差换来方差的巨大减小，也就是要直接去fit Q π , V π , o r A π Q\^\\pi,V\^\\pi,or A\^\\pi Qπ,Vπ,orAπ. 因为后者本质是期望，哪怕是我们拟合的期望函数，也比采样得到的 r ( ) r() r()更好$

第一项实际上是确定的不是随机变量，因为是当下的状态和动作。 Q π ( s t , a t ) ≈ r ( s t , a t ) + V π ( s t + 1 ) Q^\pi(s_t,a_t)\approx r(s_t,a_t)+V^\pi(s_{t+1}) Qπ(st,at)≈r(st,at)+Vπ(st+1) 这里进行了小的近似：从t到t+1的过程相当于用单样本进行了近似，因为理论上来说这里的状态转移也是要取期望的，随后就可以得到：
A π ( s t , a t ) ≈ r ( s t , a t ) + V π ( s t + 1 ) − V π ( s t ) A^\pi(s_t,a_t)\approx r(s_t,a_t)+V^\pi(s_{t+1})-V^\pi(s_{t}) Aπ(st,at)≈r(st,at)+Vπ(st+1)−Vπ(st)
A π ( s t , a t ) A^\pi(s_t,a_t) Aπ(st,at)意义是动作 a t a_t at比根据策略 π \pi π产生的平均动作reward好多少：Advantage

A和Q都取决于两个变量：状态和动作，而V只取决于状态，因此更好去模拟fit------接下来就用基于V函数的critic

Estimate V

V π ( s t ) ≈ ∑ t ′ = t T r ( s t ′ , a t ′ ) V^\pi(s_t)\approx\sum_{t'=t}^T r(s_{t'},a_{t'}) Vπ(st)≈∑t′=tTr(st′,at′)

not as good as: V π ( s t ) ≈ 1 N ∑ n = 1 N ∑ t ′ = t T r ( s t ′ , a t ′ ) V^\pi(s_t)\approx\frac{1}{N}\sum_{n=1}^N\sum_{t'=t}^T r(s_{t'},a_{t'}) Vπ(st)≈N1∑n=1N∑t′=tTr(st′,at′), but still pretty good

So, our training data: { ( s i , t , ∑ t ′ = t T r ( s i , t ′ , a i , t ′ ) ) } \left\{(s_{i,t},\sum_{t'=t}^T r(s_{i,t'},a_{i,t'}))\right\} {(si,t,∑t′=tTr(si,t′,ai,t′))} ，右边的就是 y i , t y_{i,t} yi,t

supervised learning: L ( ϕ ) = 1 2 ∑ i ∣ ∣ V ^ ϕ π ( s i ) − y i ∣ ∣ 2 \mathcal{L}(\phi)=\frac{1}{2}\sum_i||\hat{V}^\pi_\phi(s_i)-y_i||^2 L(ϕ)=21∑i∣∣V^ϕπ(si)−yi∣∣2

algorithms-with discount

上面的方法需要我们去使用两个神经网络去拟合函数：

s-> V ^ ϕ π \hat{V}^\pi_\phi V^ϕπ和s-> π θ ( a ∣ s ) \pi_\theta(a|s) πθ(a∣s)