lec 6, 2026/3/17-19, Actor-Critic Algorithms
- State & state-action value functions
Qπ(st,at)=∑t′=tTEπθ[r(st′,at′)∣st,at]Q^\pi(s_t,a_t)=\sum^T_{t'=t}\mathbb{E}{\pi\theta}[r(s_{t'},a_{t'})|s_t,a_t]Qπ(st,at)=∑t′=tTEπθ[r(st′,at′)∣st,at]: total reward from taking ata_tat in sts_tst
Vπ(st)=Eat∼πθ(at∣st)[Qπ(st,at)]V^\pi(s_t)=\mathbb{E}{a_t\sim\pi\theta(a_t|s_t)}[Q^\pi(s_t,a_t)]Vπ(st)=Eat∼πθ(at∣st)[Qπ(st,at)]: total reward from sts_tst
Aπ(st,at)=Qπ(st,at)−Vπ(st)A^\pi(s_t,a_t)=Q^\pi(s_t,a_t)-V^\pi(s_t)Aπ(st,at)=Qπ(st,at)−Vπ(st): how much better ata_tat is
∇θJ(θ)≈1N∑i=1N∑t=1T∇θlogπθ(ai,t∣si,t)Aπ(si,t,ai,t)\nabla_\theta J(\theta)\approx\frac{1}{N}\sum^N_{i=1}\sum^T_{t=1}\nabla_\theta\log{\pi_\theta(a_{i,t}|s_{i,t})}A^\pi(s_{i,t},a_{i,t})∇θJ(θ)≈N1∑i=1N∑t=1T∇θlogπθ(ai,t∣si,t)Aπ(si,t,ai,t), the better this estimate with lower variance
- Value function fitting
∑t′=tTr(st′,at′)=r(st,at)+∑t′=t+1Tr(st′,at′)\sum_{t'=t}^T r(s_{t'}, a_{t'}) = r(s_t, a_t) + \sum_{t'=t+1}^T r(s_{t'}, a_{t'})∑t′=tTr(st′,at′)=r(st,at)+∑t′=t+1Tr(st′,at′)
Qπ(st,at)=∑t′=tTEπθ[r(st′,at′)∣st,at]=r(st,at)+∑t′=t+1TEπθ[r(st′,at′)∣st,at]=r(st,at)+Est+1∼p(st+1∣st,at)[Vπ(st+1)]≈r(st,at)+Vπ(st+1)Q^\pi(s_t,a_t)=\sum^T_{t'=t}\mathbb{E}{\pi\theta}[r(s_{t'},a_{t'})|s_t,a_t]=r(s_t,a_t)+\sum^T_{t'=t+1}\mathbb{E}{\pi\theta}[r(s_{t'},a_{t'})|s_t,a_t]=r(s_t,a_t)+\mathbb{E}{s{t+1}\sim p(s_{t+1}|s_t,a_t)}[V^\pi(s_{t+1})]\approx r(s_t,a_t)+V^\pi(s_{t+1})Qπ(st,at)=∑t′=tTEπθ[r(st′,at′)∣st,at]=r(st,at)+∑t′=t+1TEπθ[r(st′,at′)∣st,at]=r(st,at)+Est+1∼p(st+1∣st,at)[Vπ(st+1)]≈r(st,at)+Vπ(st+1)
∑t′=t+1TEπθ[r(st′,at′)∣st,at]\sum^T_{t'=t+1}\mathbb{E}{\pi\theta}[r(s_{t'},a_{t'})|s_t,a_t]∑t′=t+1TEπθ[r(st′,at′)∣st,at]可以看作两层期望, 第一层期望是动作ata_tat导致环境转移到哪个st+1s_{t+1}st+1, 第二层期望是进入st+1s_{t+1}st+1后遵循策略π\piπ走完剩下路径的期望回报, 而后者正是Vπ(st+1)V^\pi(s_{t+1})Vπ(st+1)的定义。
Qπ(st,at)≈r(st,at)+Vπ(st+1)Q^\pi(s_t,a_t)\approx r(s_t,a_t)+V^\pi(s_{t+1})Qπ(st,at)≈r(st,at)+Vπ(st+1), Aπ(st,at)≈r(st,at)+Vπ(st+1)−Vπ(st)A^\pi(s_t,a_t)\approx r(s_t,a_t)+V^\pi(s_{t+1})-V^\pi(s_{t})Aπ(st,at)≈r(st,at)+Vπ(st+1)−Vπ(st)
- Policy evaluation
Vπ(st)=∑t′=tTEπθ[r(st′,at′)∣st]V^\pi(s_t)=\sum^T_{t'=t}\mathbb{E}{\pi\theta}[r(s_{t'},a_{t'})|s_t]Vπ(st)=∑t′=tTEπθ[r(st′,at′)∣st]
J(θ)=E[∑t=1Tr(st,at)]=Es1∼p(s1)[Vπ(s1)]J(\theta)=\mathbb{E}[\sum^T_{t=1}r(s_t,a_t)]=\mathbb{E}_{s_1\sim p(s_1)}[V^\pi(s_1)]J(θ)=E[∑t=1Tr(st,at)]=Es1∼p(s1)[Vπ(s1)]
perform Monte Carlo policy evaluation (just like policy gradient does): Vπ(st)≈1N∑i=1N∑t′=tTr(st′,at′)V^\pi(s_t)\approx\frac{1}{N}\sum^N_{i=1}\sum^T_{t'=t}r(s_{t'},a_{t'})Vπ(st)≈N1∑i=1N∑t′=tTr(st′,at′)
3.1 Monte Carlo evaluation with function approximation
Vπ(st)≈r(st′,at′)V^\pi(s_t)\approx r(s_{t'},a_{t'})Vπ(st)≈r(st′,at′) is not as good as Vπ(st)≈1N∑i=1N∑t′=tTr(st′,at′)V^\pi(s_t)\approx\frac{1}{N}\sum^N_{i=1}\sum^T_{t'=t}r(s_{t'},a_{t'})Vπ(st)≈N1∑i=1N∑t′=tTr(st′,at′) but still pretty good.
training data {(si,t,∑t′=tTr(si,t′,ai,t′)⏟yi,t)}\{(s_{i,t},\underbrace{\sum^T_{t'=t}r(s_{i,t'},a_{i,t'})}{y{i,t}})\}{(si,t,yi,t t′=t∑Tr(si,t′,ai,t′))}, make supervised regression: L(ϕ)=12∑i∥V^ϕπ(si)−yi∥2\mathcal{L}(\phi)=\frac{1}{2}\sum_i\lVert\hat{V}_\phi^\pi(s_i)-y_i\rVert^2L(ϕ)=21∑i∥V^ϕπ(si)−yi∥2
3.2 do better?
ideal target: yi,t=∑t′=tTEπθ[r(st′,at′)∣si,t]≈r(si,t,ai,t)+∑t′=t+1TEπθ[r(st′,at′)∣si,t+1]≈r(si,t,ai,t)+Vπ(si,t+1)≈r(si,t,ai,t)+V^ϕπ(si,t+1)y_{i,t}=\sum^T_{t'=t}\mathbb{E}{\pi\theta}[r(s_{t'},a_{t'})|s_{i,t}]\approx r(s_{i,t},a_{i,t})+\sum^T_{t'=t+1}\mathbb{E}{\pi\theta}[r(s_{t'},a_{t'})|s_{i,t+1}]\approx r(s_{i,t},a_{i,t})+V^\pi(s_{i,t+1})\approx r(s_{i,t},a_{i,t})+\hat{V}\phi^\pi(s{i,t+1})yi,t=∑t′=tTEπθ[r(st′,at′)∣si,t]≈r(si,t,ai,t)+∑t′=t+1TEπθ[r(st′,at′)∣si,t+1]≈r(si,t,ai,t)+Vπ(si,t+1)≈r(si,t,ai,t)+V^ϕπ(si,t+1), 故技重施!
training data: {(si,t,r(si,t,ai,t)+V^ϕπ(si,t+1)⏟yi,t)}\{(s_{i,t},\underbrace{r(s_{i,t},a_{i,t})+\hat{V}^\pi_\phi(s_{i,t+1})}{y{i,t}})\}{(si,t,yi,t r(si,t,ai,t)+V^ϕπ(si,t+1))}
一样的supervised regression: L(ϕ)=12∑i∥V^ϕπ(si)−yi∥2\mathcal{L}(\phi)=\frac{1}{2}\sum_i\lVert\hat{V}_\phi^\pi(s_i)-y_i\rVert^2L(ϕ)=21∑i∥V^ϕπ(si)−yi∥2
- from evaluation to Actor Critic
discount factors: if T (episode length)→∞T\text{ (episode length)}\to\infinT (episode length)→∞, V^ϕπ\hat{V}^\pi_\phiV^ϕπ can get infinitely large in many cases
4.1 simple trick: better to get rewards sooner than later, yi,t≈r(si,t,ai,t)+γV^ϕπ(si,t+1)y_{i,t}\approx r(s_{i,t},a_{i,t})+\gamma\hat{V}^\pi_\phi(s_{i,t+1})yi,t≈r(si,t,ai,t)+γV^ϕπ(si,t+1), γ∈[0,1]\gamma\in[0,1]γ∈[0,1] is discount factor (0.99 works well)
with critic (V^ϕπ\hat{V}^\pi_\phiV^ϕπ): ∇θJ(θ)≈1N∑i=1N∑t=1T∇θlogπθ(ai,t∣si,t)(r(si,t,ai,t)+γV^ϕπ(si,t+1)−V^ϕπ(si,t))⏞A^π(si,t,ai,t)\nabla_\theta J(\theta)\approx\frac{1}{N}\sum^N_{i=1}\sum^T_{t=1}\nabla_\theta\log{\pi_\theta(a_{i,t}|s_{i,t})}\overbrace{(r(\mathbf{s}{i,t}, \mathbf{a}{i,t}) + \gamma \hat{V}^\pi_\phi(\mathbf{s}{i,t+1}) - \hat{V}^\pi\phi(\mathbf{s}{i,t}))}^{\hat{A}^\pi(\mathbf{s}{i,t}, \mathbf{a}_{i,t})}∇θJ(θ)≈N1∑i=1N∑t=1T∇θlogπθ(ai,t∣si,t)(r(si,t,ai,t)+γV^ϕπ(si,t+1)−V^ϕπ(si,t)) A^π(si,t,ai,t)
4.2 what about policy gradients?
(1) option 1: 做causal trick后再乘折扣系数 (今后的策略对过去的奖励无影响)
∇θJ(θ)≈1N∑i=1N∑t=1T∇θlogπθ(ai,t∣si,t)(∑t′=tTγt′−tr(si,t′,ai,t′))\nabla_\theta J(\theta) \approx\frac{1}{N}\sum^N_{i=1}\sum^T_{t=1}\nabla_\theta\log{\pi_\theta(a_{i,t}|s_{i,t})}\left(\sum^T_{t'=t}\gamma^{t'-t}r(s_{i,t'},a_{i,t'})\right)∇θJ(θ)≈N1∑i=1N∑t=1T∇θlogπθ(ai,t∣si,t)(∑t′=tTγt′−tr(si,t′,ai,t′))
(2) option 2: 乘折扣系数后再做causal trick
∇θJ(θ)≈1N∑i=1N(∑t=1T∇θlogπθ(ai,t∣si,t))(∑t′=1Tγt′−1r(si,t′,ai,t′))\nabla_\theta J(\theta)\approx\frac{1}{N}\sum^N_{i=1}\left(\sum^T_{t=1}\nabla_\theta\log{\pi_\theta(a_{i,t}|s_{i,t})}\right)\left(\sum^T_{t'=1}\gamma^{t'-1}r(s_{i,t'},a_{i,t'})\right)∇θJ(θ)≈N1∑i=1N(∑t=1T∇θlogπθ(ai,t∣si,t))(∑t′=1Tγt′−1r(si,t′,ai,t′))
≈1N∑i=1N∑t=1T∇θlogπθ(ai,t∣si,t)(∑t′=tTγt′−1r(si,t′,ai,t′))\approx\frac{1}{N}\sum^N_{i=1}\sum^T_{t=1}\nabla_\theta\log{\pi_\theta(a_{i,t}|s_{i,t})}\left(\sum^T_{t'=t}\gamma^{t'-1}r(s_{i,t'},a_{i,t'})\right)≈N1∑i=1N∑t=1T∇θlogπθ(ai,t∣si,t)(∑t′=tTγt′−1r(si,t′,ai,t′))
≈1N∑i=1N∑t=1Tγt−1∇θlogπθ(ai,t∣si,t)(∑t′=tTγt′−tr(si,t′,ai,t′))\approx\frac{1}{N}\sum^N_{i=1}\sum^T_{t=1}\boldsymbol{\gamma^{t-1}}\nabla_\theta\log{\pi_\theta(a_{i,t}|s_{i,t})}\left(\sum^T_{t'=t}\gamma^{t'-t}r(s_{i,t'},a_{i,t'})\right)≈N1∑i=1N∑t=1Tγt−1∇θlogπθ(ai,t∣si,t)(∑t′=tTγt′−tr(si,t′,ai,t′))
γt−1\gamma^{t-1}γt−1可以理解成存活到TTT step的概率, later steps don't matter if you're dead, 现实中一般选择option 1
4.3 Actor-critic algorithms (with discount)
4.3.1 batch actor-critic algorithm
(1) sample {si,ai}\{s_i,a_i\}{si,ai} from πθ(a∣s)\pi_\theta(a|s)πθ(a∣s) (run it on the robot)
(2) fit V^ϕπ(s)\hat{V}^\pi_\phi(s)V^ϕπ(s) to sampled reward sums
(3) evaluate A^π(si,ai)=r(si,ai)+γV^ϕπ(si′)−V^ϕπ(si)\hat{A}^\pi(s_i,a_i)=r(s_i,a_i)+\gamma\hat{V}^\pi_\phi(s'i)-\hat{V}^\pi\phi(s_i)A^π(si,ai)=r(si,ai)+γV^ϕπ(si′)−V^ϕπ(si)
(4) ∇θJ(θ)≈1N∑i∇θlogπθ(ai∣si)A^π(si,ai)\nabla_\theta J(\theta)\approx\frac{1}{N}\sum_i\nabla_\theta\log{\pi_\theta(a_i|s_i)}\hat{A}^\pi(s_i,a_i)∇θJ(θ)≈N1∑i∇θlogπθ(ai∣si)A^π(si,ai)
(5) θ←θ+α∇θJ(θ)\theta\leftarrow\theta+\alpha\nabla_\theta J(\theta)θ←θ+α∇θJ(θ)
4.3.2 online actor-critic algorithm (update per step)
(1) take action a∼πθ(a∣s)a\sim\pi_\theta(a|s)a∼πθ(a∣s), get (s,a,s′,r)(s,a,s',r)(s,a,s′,r)
(2) update V^ϕπ\hat{V}^\pi_\phiV^ϕπ using target r+γV^ϕπ(s′)r+\gamma\hat{V}^\pi_\phi(s')r+γV^ϕπ(s′)
(3) evaluate A^π(s,a)=r(s,a)+γV^ϕπ(s′)−V^ϕπ(s)\hat{A}^\pi(s,a)=r(s,a)+\gamma\hat{V}^\pi_\phi(s')-\hat{V}^\pi_\phi(s)A^π(s,a)=r(s,a)+γV^ϕπ(s′)−V^ϕπ(s)
(4) ∇θJ(θ)≈∇θlogπθ(a∣s)A^π(s,a)\nabla_\theta J(\theta)\approx\nabla_\theta\log{\pi_\theta(a|s)}\hat{A}^\pi(s,a)∇θJ(θ)≈∇θlogπθ(a∣s)A^π(s,a)
(5) θ←θ+α∇θJ(θ)\theta\leftarrow\theta+\alpha\nabla_\theta J(\theta)θ←θ+α∇θJ(θ)
V^ϕπ\hat{V}^\pi_\phiV^ϕπ (critic) 和πθ\pi_\thetaπθ (actor) 网络的训练simple & stable, but no shared features between actor & critic
且根据SGD的经验, 单次采样的方差极大, step (2)&(4) works best with a batch (e.g., parallel workers)
synchronized parallel actor-critic: 同步并行跑多个轨迹, update per step
asynchronous parallel actor-critic: 异步并行几个轨迹 (每个轨迹的进程可能不同), 相比同步并行不需要跑那么多个轨迹, 收集一定数量的data才更新, 缺点是收集到的data可能来自不同的参数
都异步了也可以干脆不并行了吧, off-policy actor-critic: replay buffer:
(1) take action a∼πθ(a∣s)a\sim\pi_\theta(a|s)a∼πθ(a∣s), get (s,a,s′,r)(s,a,s',r)(s,a,s′,r) and store in R\mathcal{R}R
(2) sample a batch {si,ai,ri,si′}\{s_i,a_i,r_i,s_i'\}{si,ai,ri,si′} from buffer R\mathcal{R}R
(3) update V^ϕπ\hat{V}^\pi_\phiV^ϕπ using targets yi=ri+γV^ϕπ(si′)y_i=r_i+\gamma\hat{V}^\pi_\phi(s_i')yi=ri+γV^ϕπ(si′) for each sis_isi
L(ϕ)=1N∑i∥V^ϕπ(si)−yi∥2\mathcal{L}(\phi)=\frac{1}{N}\sum_i\lVert\hat{V}^\pi_\phi(s_i)-y_i\rVert^2L(ϕ)=N1∑i∥V^ϕπ(si)−yi∥2, 1N\frac{1}{N}N1 is batch size
(4) evaluate A^π(si,ai)=r(si,ai)+γV^ϕπ(si′)−V^ϕπ(si)\hat{A}^\pi(s_i,a_i)=r(s_i,a_i)+\gamma\hat{V}^\pi_\phi(s_i')-\hat{V}^\pi_\phi(s_i)A^π(si,ai)=r(si,ai)+γV^ϕπ(si′)−V^ϕπ(si)
(5) ∇θJ(θ)≈1N∑i∇θlogπθ(ai∣si)A^π(si,ai)\nabla_\theta J(\theta)\approx\frac{1}{N}\sum_i\nabla_\theta\log{\pi_\theta(a_i|s_i)}\hat{A}^\pi(s_i,a_i)∇θJ(θ)≈N1∑i∇θlogπθ(ai∣si)A^π(si,ai)
(6) θ←θ+α∇θJ(θ)\theta\leftarrow\theta+\alpha\nabla_\theta J(\theta)θ←θ+α∇θJ(θ)
problem: si′s_i'si′ is not the result of taking action aia_iai with the latest actor; 因为策略更新过所以V^ϕπ(si′)\hat{V}^\pi_\phi(s_i')V^ϕπ(si′)算出来的不是当时的value; 同样因为策略更新过, 所以aia_iai和当前策略根据sis_isi采样出的动作不同, aia_iai根本不是从当前策略采样出的动作, 那么也无法算当前策略的policy gradient
4.3.3 fixing the value function
update Q^ϕπ\hat{Q}^\pi_\phiQ^ϕπ using targets yi=ri+γV^ϕπ(si′)y_i=r_i+\gamma\hat{V}^\pi_\phi(s_i')yi=ri+γV^ϕπ(si′) for each sis_isi, aia_iai
L(ϕ)=1N∑i∥Q^ϕπ(si,ai)−yi∥2\mathcal{L}(\phi)=\frac{1}{N}\sum_i\lVert\hat{Q}^\pi_\phi(s_i,a_i)-y_i\rVert^2L(ϕ)=N1∑i∥Q^ϕπ(si,ai)−yi∥2
recall: Vπ(st)=∑t′=tTEπθ[r(st′,at′)∣st]=Ea∼π(at∣st)[Q(st,at)]V^\pi(s_t)=\sum^T_{t'=t}\mathbb{E}{\pi\theta}[r(s_{t'},a_{t'})|s_t]=\mathbb{E}{a\sim\pi(a_t|s_t)}[Q(s_t,a_t)]Vπ(st)=∑t′=tTEπθ[r(st′,at′)∣st]=Ea∼π(at∣st)[Q(st,at)], 因此可以取ai′∼πθ(ai′∣si′)a_i'\sim\pi\theta(a_i'|s_i')ai′∼πθ(ai′∣si′), V^ϕπ(si′)=Q^ϕπ(si′,ai′)\hat{V}^\pi_\phi(s_i')=\hat{Q}^\pi_\phi(s_i',a_i')V^ϕπ(si′)=Q^ϕπ(si′,ai′)
4.3.4 fixing the policy update
use the same trick, sample aiπ∼πθ(a∣si)a_i^\pi\sim\pi_\theta(a|s_i)aiπ∼πθ(a∣si), ∇θJ(θ)≈1N∑i∇θlogπθ(aiπ∣si)A^π(si,aiπ)\nabla_\theta J(\theta)\approx\frac{1}{N}\sum_i\nabla_\theta\log{\pi_\theta(a_i^\pi|s_i)}\hat{A}^\pi(s_i,a_i^\pi)∇θJ(θ)≈N1∑i∇θlogπθ(aiπ∣si)A^π(si,aiπ)
in practice: ∇θJ(θ)≈1N∑i∇θlogπθ(aiπ∣si)Q^π(si,aiπ)\nabla_\theta J(\theta)\approx\frac{1}{N}\sum_i\nabla_\theta\log{\pi_\theta(a_i^\pi|s_i)}\hat{Q}^\pi(s_i,a_i^\pi)∇θJ(θ)≈N1∑i∇θlogπθ(aiπ∣si)Q^π(si,aiπ), higher variance but convenient
why is higher variance OK here? 降方差依旧需要多次采样, 但这里采样只用采动作而不是做simulation
4.3.5 control variates: action-dependent baselines
Q^\hat{Q}Q^和Qϕπ(st,at)Q^\pi_\phi(s_t,a_t)Qϕπ(st,at)高度相关, 如果后者能作为baseline减掉的话, 方差就会大幅降低了。但后者的期望不为0, 减去的话会引入偏差, 所以必须在后面补上∇θE[Qϕπ]\nabla_\theta \mathbb{E}[Q_\phi^\pi]∇θE[Qϕπ]
即∇θJ(θ)≈1N∑i=1N∑t=1T∇θlogπθ(ai,t∣si,t)(Q^i,t−Qϕπ(si,t,ai,t))+1N∑i=1N∑t=1T∇θEa∼πθ(at∣si,t)[Qϕπ(si,t,at)]\nabla_\theta J(\theta)\approx\frac{1}{N}\sum^N_{i=1}\sum^T_{t=1}\nabla_\theta\log{\pi_\theta(a_{i,t}|s_{i,t})}\left(\hat{Q}{i,t}-Q^\pi\phi(s_{i,t},a_{i,t})\right)+\frac{1}{N}\sum^N_{i=1}\sum^T_{t=1}\nabla_\theta\mathbb{E}{a\sim\pi\theta(a_t|s_{i,t})}[Q^\pi_\phi(s_{i,t},a_t)]∇θJ(θ)≈N1∑i=1N∑t=1T∇θlogπθ(ai,t∣si,t)(Q^i,t−Qϕπ(si,t,ai,t))+N1∑i=1N∑t=1T∇θEa∼πθ(at∣si,t)[Qϕπ(si,t,at)]
4.4 n-step returns
A^Cπ(st,at)=r(st,at)+γV^ϕπ(st+1)−V^ϕπ(st)\hat{A}^\pi_{\text{C}}(s_t,a_t)=r(s_t,a_t)+\gamma\hat{V}^\pi_\phi(s_{t+1})-\hat{V}^\pi_\phi(s_t)A^Cπ(st,at)=r(st,at)+γV^ϕπ(st+1)−V^ϕπ(st): lower variance (用一个估计的v取代∑r\sum r∑r), higher bias if value estimation is wrong
(Monte Carlo) A^MCπ(st,at)=∑t′=t∞γt′−tr(st′,at′)−V^ϕπ(st)\hat{A}^\pi_{\text{MC}}(s_t,a_t)=\sum^\infin_{t'=t}\gamma^{t'-t}r(s_{t'},a_{t'})-\hat{V}^\pi_\phi(s_t)A^MCπ(st,at)=∑t′=t∞γt′−tr(st′,at′)−V^ϕπ(st): no bias, higher variance (because single-sample estimate, 对长序列做单样本估计, 随机性累积)
Can we combine those two, to control bias/variance tradeoff? - n-step returns, cut down estimate before variance gets too big (choosing n>1n>1n>1 often works better)
A^nπ(st,at)=∑t′=tt+nγt′−tr(st′,at′)−V^ϕπ(st)+γnV^ϕπ(st+n)\hat{A}^\pi_n(s_t,a_t)=\sum^{t+n}{t'=t}\gamma^{t'-t}r(s{t'},a_{t'})-\hat{V}^\pi_\phi(s_t)+\gamma^n\hat{V}^\pi_\phi(s_{t+n})A^nπ(st,at)=∑t′=tt+nγt′−tr(st′,at′)−V^ϕπ(st)+γnV^ϕπ(st+n)
4.5 Generalized advantage estimation
A^GAEπ(st,at)=∑n=1∞wnA^nπ(st,at)=∑t′=t∞(γλ)t′−tδt′\hat{A}^\pi_{\text{GAE}}(s_t,a_t)=\sum^\infin_{n=1}w_n\hat{A}^\pi_n(s_t,a_t)=\sum^\infin_{t'=t}(\gamma\lambda)^{t'-t}\delta_{t'}A^GAEπ(st,at)=∑n=1∞wnA^nπ(st,at)=∑t′=t∞(γλ)t′−tδt′, δt′=r(st′,at′)+γV^ϕπ(st′+1)−V^ϕπ(st′)\delta_{t'}=r(s_{t'},a_{t'})+\gamma\hat{V}^\pi_\phi(s_{t'+1})-\hat{V}^\pi_\phi(s_{t'})δt′=r(st′,at′)+γV^ϕπ(st′+1)−V^ϕπ(st′)
weighted combination of n-step returns, wn∝λn−1w_n\propto\lambda^{n-1}wn∝λn−1 exponential falloff, (γλ)t′−t(\gamma\lambda)^{t'-t}(γλ)t′−t similar as discount
lec 7, 2026/3/21, value function
1.1 value-based method, policy iteration for discrete actions
step 1: evaluate Aπ(s,a)A^\pi(s,a)Aπ(s,a)
step 2: set π←π′\pi\leftarrow\pi'π←π′
π′(at∣st)={1 if at=argmaxatAπ(st,at)0 otherwise\pi'(a_t|s_t)=\begin{cases}1\text{ if }a_t=\text{argmax}_{a_t}A^\pi(s_t,a_t)\\0\text{ otherwise}\end{cases}π′(at∣st)={1 if at=argmaxatAπ(st,at)0 otherwise, at least as good as π\piπ
as before: Aπ(s,a)=r(s,a)+γE[Vπ(s′)]−Vπ(s)A^\pi(s,a)=r(s,a)+\gamma\mathbb{E}[V^\pi(s')]-V^\pi(s)Aπ(s,a)=r(s,a)+γE[Vπ(s′)]−Vπ(s), let's evaluate Vπ(s)V^\pi(s)Vπ(s)
1.2 Dynamic programming
let's assume we know p(s′∣s,a)p(s'|s,a)p(s′∣s,a), and sss and aaa are both discrete (and small), using tabular mdp: 16 states, 4 actions per state, can store full Vπ(s)V^\pi(s)Vπ(s) in a table, transition T\mathcal{T}T is 16×16×416\times16\times416×16×4 tensor
bootstrapped update: Vπ(s)←Ea∼π(a∣s)[r(s,a)+γEs′∼p(s′∣s,a)[Vπ(s′)]]V^\pi(s)\leftarrow\mathbb{E}{a\sim\pi(a|s)}[r(s,a)+\gamma\mathbb{E}{s'\sim p(s'|s,a)}[V^\pi(s')]]Vπ(s)←Ea∼π(a∣s)[r(s,a)+γEs′∼p(s′∣s,a)[Vπ(s′)]], Es′∼p(s′∣s,a)[Vπ(s′)]\mathbb{E}_{s'\sim p(s'|s,a)}[V^\pi(s')]Es′∼p(s′∣s,a)[Vπ(s′)] just use the current estimate here
π′(at∣st)={1 if at=argmaxatAπ(st,at)0 otherwise\pi'(a_t|s_t)=\begin{cases}1\text{ if }a_t=\text{argmax}_{a_t}A^\pi(s_t,a_t)\\0\text{ otherwise}\end{cases}π′(at∣st)={1 if at=argmaxatAπ(st,at)0 otherwise, deterministic policy π(s)=a\pi(s)=aπ(s)=a
simplified: Vπ(s)←r(s,π(s))+γEs′∼p(s′∣s,π(s))[Vπ(s′)]V^\pi(s)\leftarrow r(s,\pi(s))+\gamma\mathbb{E}_{s'\sim p(s'|s,\pi(s))}[V^\pi(s')]Vπ(s)←r(s,π(s))+γEs′∼p(s′∣s,π(s))[Vπ(s′)]
上式可以写作v=rπ+γPπvv=r^\pi+\gamma P^\pi vv=rπ+γPπv, 其中PPP表示状态转移矩阵, 整理得(I−γPπ)v=rπ(I-\gamma P^\pi)v=r^\pi(I−γPπ)v=rπ, v=(I−γPπ)−1rπv=(I-\gamma P^\pi)^{-1}r^\piv=(I−γPπ)−1rπ, can be solved with any linear equation solver
1.3 value iteration algorithm (even simpler dynamic programming)
argmaxatAπ(st,at)=argmaxatQπ(st,at)\text{argmax}{a_t}A^\pi(s_t,a_t)=\text{argmax}{a_t}Q^\pi(s_t,a_t)argmaxatAπ(st,at)=argmaxatQπ(st,at)
Qπ(s,a)=r(s,a)+γE[Vπ(s′)]Q^\pi(s,a)=r(s,a)+\gamma\mathbb{E}[V^\pi(s')]Qπ(s,a)=r(s,a)+γE[Vπ(s′)] (a bit simple than AπA^\piAπ)
argmaxaQ(s,a)→policy\text{argmax}_aQ(s,a)\to\text{policy}argmaxaQ(s,a)→policy, approximates the new value, and skip the policy and compute values directly
step 1: set Q(s,a)←r(s,a)+γE[V(s′)]Q(s,a)\leftarrow r(s,a)+\gamma\mathbb{E}[V(s')]Q(s,a)←r(s,a)+γE[V(s′)]
step 2: set V(s)←maxaQ(s,a)V(s)\leftarrow\max_a{Q(s,a)}V(s)←maxaQ(s,a) (因为策略100%选择maxQ\max{Q}maxQ的动作, 所以期望退化成max算子)
2.1 Fitted value iteration
big table curse of dimensionality, using net function V:S→RV:\mathcal{S}\to\mathbb{R}V:S→R
L(ϕ)=12∥Vϕ(s)−maxaQπ(s,a)∥2\mathcal{L}(\phi)=\frac{1}{2}\lVert V_\phi(s)-\max_a{Q^\pi(s,a)}\rVert^2L(ϕ)=21∥Vϕ(s)−maxaQπ(s,a)∥2
step 1: set yi←maxai(r(si,ai)+γE[Vϕ(si′)])y_i\leftarrow\max_{a_i}{(r(s_i,a_i)+\gamma\mathbb{E}[V_\phi(s_i')])}yi←maxai(r(si,ai)+γE[Vϕ(si′)])
step 2: set ϕ←argminϕ12∑i∥Vϕ(si)−yi∥2\phi\leftarrow\arg\min_\phi{\frac{1}{2}\sum_i\lVert V_\phi(s_i)-y_i\rVert^2}ϕ←argminϕ21∑i∥Vϕ(si)−yi∥2
2.1.1 without knowing the transitions
然E[Vϕ(si′)]\mathbb{E}[V_\phi(s_i')]E[Vϕ(si′)] need to know outcomes for different actions
Qπ(s,a)←r(s,a)+γEs′∼p(s′∣s,a)[Qπ(s′,π(s′))]Q^\pi(s,a)\leftarrow r(s,a)+\gamma\mathbb{E}_{s'\sim p(s'|s,a)}[Q^\pi(s',\pi(s'))]Qπ(s,a)←r(s,a)+γEs′∼p(s′∣s,a)[Qπ(s′,π(s′))]只需要知道一对s和a
因为π′=1 if a=argmaxQ\pi'=1\text{ if }a=\arg\max{Q}π′=1 if a=argmaxQ导致VVV的期望退化成max, 故技重施
step 1: set yi←maxai(r(si,ai)+γE[Vϕ(si′)])y_i\leftarrow\max_{a_i}{(r(s_i,a_i)+\gamma\mathbb{E}[V_\phi(s_i')])}yi←maxai(r(si,ai)+γE[Vϕ(si′)]), approximate E[V(si′)]≈maxa′Qϕ(si′,ai′)\mathbb{E}[V(s_i')]\approx\max_{a'}{Q_\phi(s_i',a_i')}E[V(si′)]≈maxa′Qϕ(si′,ai′)
step 2: set ϕ←argminϕ12∑i∥Qϕ(si,ai)−yi∥2\phi\leftarrow\arg\min_\phi{\frac{1}{2}\sum_i\lVert Q_\phi(s_i,a_i)-y_i\rVert^2}ϕ←argminϕ21∑i∥Qϕ(si,ai)−yi∥2
2.2 full fitted Q-iteration algorithm
step 1: collect dataset {(si,ai,si′,ri)}\{(s_i,a_i,s_i',r_i)\}{(si,ai,si′,ri)} using some policy, parameters: dataset size NNN, collection policy
repeat ×K{step 2: set yi←r(si,ai)+γmaxai′Qϕ(si′,ai′)step 3: setϕ←argminϕ12∑i∥Qϕ(si,ai)−yi∥2, parameters: gradient steps S\times K\begin{cases}\text{step 2: set }y_i\leftarrow r(s_i,a_i)+\gamma\max_{a_i'}Q_\phi(s_i',a_i')\\\text{step 3: set}\phi\leftarrow\arg\min_\phi{\frac{1}{2}\sum_i\lVert Q_\phi(s_i,a_i)-y_i\rVert^2}\text{, parameters: gradient steps }S\end{cases}×K{step 2: set yi←r(si,ai)+γmaxai′Qϕ(si′,ai′)step 3: setϕ←argminϕ21∑i∥Qϕ(si,ai)−yi∥2, parameters: gradient steps S, parameters: iterations KKK
error ε=12E(s,a)∼β[(Qϕ(s,a)−[r(s,a)+γmaxa′Qϕ(s′,a′)])2]\varepsilon=\frac{1}{2}\mathbb{E}{(s,a)\sim\beta}[\left(Q\phi(s,a)-[r(s,a)+\gamma\max_{a'}Q_\phi(s',a')]\right)^2]ε=21E(s,a)∼β[(Qϕ(s,a)−[r(s,a)+γmaxa′Qϕ(s′,a′)])2]
if ε=0\varepsilon=0ε=0, then Qϕ(s,a)=r(s,a)+γmaxa′Qϕ(s′,a′)Q_\phi(s,a)=r(s,a)+\gamma\max_{a'}Q_\phi(s',a')Qϕ(s,a)=r(s,a)+γmaxa′Qϕ(s′,a′), this is an optimal Q-function corresponding to optimal π′(at∣st)={1 if at=argmaxatAπ(st,at)0 otherwise\pi'(a_t|s_t)=\begin{cases}1\text{ if }a_t=\text{argmax}_{a_t}A^\pi(s_t,a_t)\\0\text{ otherwise}\end{cases}π′(at∣st)={1 if at=argmaxatAπ(st,at)0 otherwise, maximizes reward, sometimes written Q∗Q^*Q∗ and π∗\pi^*π∗
3.1 Online Q-learning algorithm
step 1: take some action aia_iai and observe (si,ai,si′,ri)(s_i,a_i,s_i',r_i)(si,ai,si′,ri)
step 2: yi=r(si,ai)+γmaxa′Qϕ(si′,ai′)y_i=r(s_i,a_i)+\gamma\max_{a'}Q_\phi(s_i',a_i')yi=r(si,ai)+γmaxa′Qϕ(si′,ai′)
step 3: ϕ←ϕ−αdQϕdϕ(Qϕ(si,ai)−yi)\phi\leftarrow\phi-\alpha\frac{dQ_\phi}{d\phi}(Q_\phi(s_i,a_i)-y_i)ϕ←ϕ−αdϕdQϕ(Qϕ(si,ai)−yi)
inject some additional randomness to produce better exploration, ε\varepsilonε-greedy π(at∣st)={1−ε if at=argmaxatQϕ(st,at)ε/(∣A∣−1) otherwise\pi(a_t|s_t)=\begin{cases}1-\varepsilon\text{ if }a_t=\arg\max_{a_t}{Q_\phi(s_t,a_t)}\\\varepsilon/(|\mathcal{A}|-1)\text{ otherwise}\end{cases}π(at∣st)={1−ε if at=argmaxatQϕ(st,at)ε/(∣A∣−1) otherwise or Boltzmann exploration π(at∣st)∝exp(Qϕ(st,at))\pi(a_t|s_t)\varpropto\exp{(Q_\phi(s_t,a_t))}π(at∣st)∝exp(Qϕ(st,at))
lec 8, 2026/3/23, Deep RL with Q-Functions
1.1 Correlated samples in online Q-learning
in online Q iteration algorithm, sequential states are strongly correlated and its target value is always changing
solution: synchronized parallel Q-learning / asynchronous parallel Q-learning / replay buffers
1.2 full Q-learning with replay buffer
samples are no longer correlated, multiple samples in the batch (low-variance gradient: σ2N\frac{\sigma^2}{N}Nσ2):
step 1: collect dataset {(si,ai,si′,ri)}\{(s_i,a_i,s_i',r_i)\}{(si,ai,si′,ri)} using some policy, add it to B\mathcal{B}B
repeat ×K{step 2: sample a batch (si,ai,si′,ri) from Bstep 3: ϕ←ϕ−α∑idQϕdϕ(si,ai)(Qϕ(si,ai)−[r(si,ai)+γmaxa′Qϕ(si′,ai′)])\times K\begin{cases}\text{step 2: sample a batch }(s_i,a_i,s_i',r_i)\text{ from }\mathcal{B}\\\text{step 3: }\phi\leftarrow\phi-\alpha\sum_i\frac{dQ_\phi}{d\phi}(s_i,a_i)(Q_\phi(s_i,a_i)-[r(s_i,a_i)+\gamma\max_{a'}Q_\phi(s_i',a_i')])\end{cases}×K{step 2: sample a batch (si,ai,si′,ri) from Bstep 3: ϕ←ϕ−α∑idϕdQϕ(si,ai)(Qϕ(si,ai)−[r(si,ai)+γmaxa′Qϕ(si′,ai′)])
K=1K=1K=1 is common, though larger KKK more efficient
但更新ϕ\phiϕ的时候, 目标值r(si,ai)+γmaxa′Qϕ(si′,ai′)r(s_i,a_i)+\gamma\max_{a'}Q_\phi(s_i',a_i')r(si,ai)+γmaxa′Qϕ(si′,ai′)也在变, 我们希望采样的时候target value不变, 对此引入target network ϕ′\phi'ϕ′
2.1 Q-Learning with replay buffer and target networks
step 1: save target network parameters: ϕ′←ϕ\phi'\leftarrow\phiϕ′←ϕ
repeat ×N{step 2: collect dataset{(si,ai,si′,ri)} using some policy, add it to Brepeat ×K{step 3: sample a batch (si,ai,si′,ri) from Bstep 4: ϕ←ϕ−α∑idQϕdϕ(si,ai)(Qϕ(si,ai)−[r(si,ai)+γmaxa′Qϕ′(si′,ai′)])\times N\begin{cases}\text{step 2: collect dataset} \{(s_i,a_i,s_i',r_i)\}\text{ using some policy, add it to }\mathcal{B}\\\text{repeat }\times K\begin{cases}\text{step 3: sample a batch }(s_i,a_i,s_i',r_i)\text{ from }\mathcal{B}\\\text{step 4: }\phi\leftarrow\phi-\alpha\sum_i\frac{dQ_\phi}{d\phi}(s_i,a_i)(Q_\phi(s_i,a_i)-[r(s_i,a_i)+\gamma\max_{a'}Q_{\phi'}(s_i',a_i')])\end{cases}\end{cases}×N⎩ ⎨ ⎧step 2: collect dataset{(si,ai,si′,ri)} using some policy, add it to Brepeat ×K{step 3: sample a batch (si,ai,si′,ri) from Bstep 4: ϕ←ϕ−α∑idϕdQϕ(si,ai)(Qϕ(si,ai)−[r(si,ai)+γmaxa′Qϕ′(si′,ai′)])
so targets don't change in inner loop
"classic" deep Q-learning algorithm: K=1K=1K=1
2.2 Alternative target network
每NNN步进行一次ϕ′←ϕ\phi'\leftarrow\phiϕ′←ϕ, 刚更新完那一刻ϕ′\phi'ϕ′完全等于ϕ\phiϕ; 快到下一次更新前, ϕ′\phi'ϕ′已经是老古董了, weirdly uneven
popular alternative (similar to Polyak averaging), 每步微调: ϕ′←τϕ′+(1−τ)ϕ\phi'\leftarrow\tau\phi'+(1-\tau)\phiϕ′←τϕ′+(1−τ)ϕ, τ=0.999\tau=0.999τ=0.999 works well
3.1 Overestimation in Q-learning
maxaj′Qϕ′(sj′,aj′)\max_{a_j'}Q_{\phi'}(s_j',a_j')maxaj′Qϕ′(sj′,aj′) overestimates the next value: E[max(X1,X2)]≥max(E[X1],E[X2])\mathbb{E}[\max(X_1,X_2)]\ge\max(\mathbb{E}[X_1],\mathbb{E}[X_2])E[max(X1,X2)]≥max(E[X1],E[X2])
note that maxa′Qϕ′(s′,a′)=Qϕ′(s′,argmaxa′Qϕ′(s′,a′))\max_{a'}Q_{\phi'}(s',a')=Q_{\phi'}(s',\arg\max_{a'}Q_{\phi'}(s',a'))maxa′Qϕ′(s′,a′)=Qϕ′(s′,argmaxa′Qϕ′(s′,a′)), both action selected and value are came from Qϕ′Q_{\phi'}Qϕ′
目标值被高估, 权重更新后Q变大; Q变大, 由于max\max{}max更高估, 左脚踩右脚
if the noise in these is decorrelated, the problem goes away
idea: don't use the same network to choose the action and evaluate value
3.2 Double Q-learning
use two networks: QϕA(s,a)←r+γQϕB(s′,argmaxa′QϕA(s′,a′))Q_{\phi_A}(s,a)\leftarrow r+\gamma Q_{\phi_B}(s',\arg\max_{a'}Q_{\phi_A}(s',a'))QϕA(s,a)←r+γQϕB(s′,argmaxa′QϕA(s′,a′)), QϕB(s,a)←r+γQϕA(s′,argmaxa′QϕB(s′,a′))Q_{\phi_B}(s,a)\leftarrow r+\gamma Q_{\phi_A}(s',\arg\max_{a'}Q_{\phi_B}(s',a'))QϕB(s,a)←r+γQϕA(s′,argmaxa′QϕB(s′,a′))
3.3 Q-learning with N-step return
yj,t=∑t′=tt+N−1γt′−trj,t′+γNmaxaj,t+NQϕ′(sj,t+N,aj,t+N)y_{j,t}=\sum^{t+N-1}{t'=t}\gamma^{t'-t}r{j,t'}+\gamma^N\max_{a_{j,t+N}}Q_{\phi'}(s_{j,t+N},a_{j,t+N})yj,t=∑t′=tt+N−1γt′−trj,t′+γNmaxaj,t+NQϕ′(sj,t+N,aj,t+N)
less biased target values when Q-values are inaccurate (真实奖励r乘的折扣少), typically faster learning especially early on, only actually correct when learning on-policy
问题在于N-step中间几步的奖励是根据旧策略选的动作得到的, which is not the action that your new policy would have taken (not an issue when N=1N=1N=1)
solution: ignore; cut the trace - dynamic choose N to get only on-policy data (works well when data mostly on-policy, and action space is small); importance sampling (w=πnew(a∣s)πold(a∣s)w=\frac{\pi_{\text{new}}(a|s)}{\pi_{\text{old}}(a|s)}w=πold(a∣s)πnew(a∣s))
4.1 Q-learning with continuous actions
how do we perform at=argmaxatQϕ(st,at)a_t=\arg\max_{a_t}Q_\phi(s_t,a_t)at=argmaxatQϕ(st,at) and γmaxaj′Qϕ′(sj′,aj′)\gamma\max_{a_j'}Q_{\phi'}(s_j',a_j')γmaxaj′Qϕ′(sj′,aj′)
option 1: optimization
gradient based optimization (e.g., SGD) a bit slow in the inner loop
action space typically low-dimensional, stochastic optimization also works well
option 2: use function class that is easy to optimize
option 3: learn an approximate maximizer, train another neural network μθ(s)\mu_\theta(s)μθ(s) such that μθ(s)≈argmaxaQϕ(s,a)\mu_\theta(s)\approx\arg\max_aQ_\phi(s,a)μθ(s)≈argmaxaQϕ(s,a)
DDPG:
step 1: take some action aia_iai and observe (si,ai,si′,ri)(s_i,a_i,s_i',r_i)(si,ai,si′,ri), add it to B\mathcal{B}B
step 2: sample mini-batch {rj,aj,sj′,rj}\{r_j,a_j,s_j',r_j\}{rj,aj,sj′,rj} from B\mathcal{B}B uniformly
step 3: compute yj=rj+γQϕ′(sj′,μθ′(sj′))y_j=r_j+\gamma Q_{\phi'}(s_j',\mu_{\theta'}(s_j'))yj=rj+γQϕ′(sj′,μθ′(sj′)) using target net Qϕ′Q_{\phi'}Qϕ′ and μθ′\mu_{\theta'}μθ′
step 4: ϕ←ϕ−α∑jdQϕdϕ(sj,aj)(Qϕ(sj,aj)−yj)\phi\leftarrow\phi-\alpha\sum_j\frac{dQ_\phi}{d\phi}(s_j,a_j)(Q_\phi(s_j,a_j)-y_j)ϕ←ϕ−α∑jdϕdQϕ(sj,aj)(Qϕ(sj,aj)−yj)
step 5: θ←θ+β∑jdμdθ(sj)dQϕda(sj,μ(sj))\theta\leftarrow\theta+\beta\sum_j\frac{d\mu}{d\theta}(s_j)\frac{dQ_\phi}{da}(s_j,\mu(s_j))θ←θ+β∑jdθdμ(sj)dadQϕ(sj,μ(sj))
step 6: update ϕ′\phi'ϕ′ and θ′\theta'θ′ (e.g., Polyak averaging)
5.1 simple practical tips for Q-learning
large replay buffers help improve stability
start with high exploration (epsilon) and gradually reduce
Bellman error gradients can be big; clip gradients or use Huber loss
Double Q-learning and N-step returns helps a lot
schedule exploration (high to low) and learning rates (high to low), Adam optimizer can help too
run multiple random seeds