cs2385_note1 (lec6-lec8)

lec 6, 2026/3/17-19, Actor-Critic Algorithms
  1. State & state-action value functions

Qπ(st,at)=∑t′=tTEπθ[r(st′,at′)∣st,at]Q^\pi(s_t,a_t)=\sum^T_{t'=t}\mathbb{E}{\pi\theta}[r(s_{t'},a_{t'})|s_t,a_t]Qπ(st,at)=∑t′=tTEπθ[r(st′,at′)∣st,at]: total reward from taking ata_tat in sts_tst

Vπ(st)=Eat∼πθ(at∣st)[Qπ(st,at)]V^\pi(s_t)=\mathbb{E}{a_t\sim\pi\theta(a_t|s_t)}[Q^\pi(s_t,a_t)]Vπ(st)=Eat∼πθ(at∣st)[Qπ(st,at)]: total reward from sts_tst

Aπ(st,at)=Qπ(st,at)−Vπ(st)A^\pi(s_t,a_t)=Q^\pi(s_t,a_t)-V^\pi(s_t)Aπ(st,at)=Qπ(st,at)−Vπ(st): how much better ata_tat is

∇θJ(θ)≈1N∑i=1N∑t=1T∇θlog⁡πθ(ai,t∣si,t)Aπ(si,t,ai,t)\nabla_\theta J(\theta)\approx\frac{1}{N}\sum^N_{i=1}\sum^T_{t=1}\nabla_\theta\log{\pi_\theta(a_{i,t}|s_{i,t})}A^\pi(s_{i,t},a_{i,t})∇θJ(θ)≈N1∑i=1N∑t=1T∇θlogπθ(ai,t∣si,t)Aπ(si,t,ai,t), the better this estimate with lower variance

  1. Value function fitting

∑t′=tTr(st′,at′)=r(st,at)+∑t′=t+1Tr(st′,at′)\sum_{t'=t}^T r(s_{t'}, a_{t'}) = r(s_t, a_t) + \sum_{t'=t+1}^T r(s_{t'}, a_{t'})∑t′=tTr(st′,at′)=r(st,at)+∑t′=t+1Tr(st′,at′)

Qπ(st,at)=∑t′=tTEπθ[r(st′,at′)∣st,at]=r(st,at)+∑t′=t+1TEπθ[r(st′,at′)∣st,at]=r(st,at)+Est+1∼p(st+1∣st,at)[Vπ(st+1)]≈r(st,at)+Vπ(st+1)Q^\pi(s_t,a_t)=\sum^T_{t'=t}\mathbb{E}{\pi\theta}[r(s_{t'},a_{t'})|s_t,a_t]=r(s_t,a_t)+\sum^T_{t'=t+1}\mathbb{E}{\pi\theta}[r(s_{t'},a_{t'})|s_t,a_t]=r(s_t,a_t)+\mathbb{E}{s{t+1}\sim p(s_{t+1}|s_t,a_t)}[V^\pi(s_{t+1})]\approx r(s_t,a_t)+V^\pi(s_{t+1})Qπ(st,at)=∑t′=tTEπθ[r(st′,at′)∣st,at]=r(st,at)+∑t′=t+1TEπθ[r(st′,at′)∣st,at]=r(st,at)+Est+1∼p(st+1∣st,at)[Vπ(st+1)]≈r(st,at)+Vπ(st+1)

∑t′=t+1TEπθ[r(st′,at′)∣st,at]\sum^T_{t'=t+1}\mathbb{E}{\pi\theta}[r(s_{t'},a_{t'})|s_t,a_t]∑t′=t+1TEπθ[r(st′,at′)∣st,at]可以看作两层期望, 第一层期望是动作ata_tat导致环境转移到哪个st+1s_{t+1}st+1, 第二层期望是进入st+1s_{t+1}st+1后遵循策略π\piπ走完剩下路径的期望回报, 而后者正是Vπ(st+1)V^\pi(s_{t+1})Vπ(st+1)的定义。

Qπ(st,at)≈r(st,at)+Vπ(st+1)Q^\pi(s_t,a_t)\approx r(s_t,a_t)+V^\pi(s_{t+1})Qπ(st,at)≈r(st,at)+Vπ(st+1), Aπ(st,at)≈r(st,at)+Vπ(st+1)−Vπ(st)A^\pi(s_t,a_t)\approx r(s_t,a_t)+V^\pi(s_{t+1})-V^\pi(s_{t})Aπ(st,at)≈r(st,at)+Vπ(st+1)−Vπ(st)

  1. Policy evaluation

Vπ(st)=∑t′=tTEπθ[r(st′,at′)∣st]V^\pi(s_t)=\sum^T_{t'=t}\mathbb{E}{\pi\theta}[r(s_{t'},a_{t'})|s_t]Vπ(st)=∑t′=tTEπθ[r(st′,at′)∣st]

J(θ)=E[∑t=1Tr(st,at)]=Es1∼p(s1)[Vπ(s1)]J(\theta)=\mathbb{E}[\sum^T_{t=1}r(s_t,a_t)]=\mathbb{E}_{s_1\sim p(s_1)}[V^\pi(s_1)]J(θ)=E[∑t=1Tr(st,at)]=Es1∼p(s1)[Vπ(s1)]

perform Monte Carlo policy evaluation (just like policy gradient does): Vπ(st)≈1N∑i=1N∑t′=tTr(st′,at′)V^\pi(s_t)\approx\frac{1}{N}\sum^N_{i=1}\sum^T_{t'=t}r(s_{t'},a_{t'})Vπ(st)≈N1∑i=1N∑t′=tTr(st′,at′)

3.1 Monte Carlo evaluation with function approximation

Vπ(st)≈r(st′,at′)V^\pi(s_t)\approx r(s_{t'},a_{t'})Vπ(st)≈r(st′,at′) is not as good as Vπ(st)≈1N∑i=1N∑t′=tTr(st′,at′)V^\pi(s_t)\approx\frac{1}{N}\sum^N_{i=1}\sum^T_{t'=t}r(s_{t'},a_{t'})Vπ(st)≈N1∑i=1N∑t′=tTr(st′,at′) but still pretty good.

training data {(si,t,∑t′=tTr(si,t′,ai,t′)⏟yi,t)}\{(s_{i,t},\underbrace{\sum^T_{t'=t}r(s_{i,t'},a_{i,t'})}{y{i,t}})\}{(si,t,yi,t t′=t∑Tr(si,t′,ai,t′))}, make supervised regression: L(ϕ)=12∑i∥V^ϕπ(si)−yi∥2\mathcal{L}(\phi)=\frac{1}{2}\sum_i\lVert\hat{V}_\phi^\pi(s_i)-y_i\rVert^2L(ϕ)=21∑i∥V^ϕπ(si)−yi∥2

3.2 do better?

ideal target: yi,t=∑t′=tTEπθ[r(st′,at′)∣si,t]≈r(si,t,ai,t)+∑t′=t+1TEπθ[r(st′,at′)∣si,t+1]≈r(si,t,ai,t)+Vπ(si,t+1)≈r(si,t,ai,t)+V^ϕπ(si,t+1)y_{i,t}=\sum^T_{t'=t}\mathbb{E}{\pi\theta}[r(s_{t'},a_{t'})|s_{i,t}]\approx r(s_{i,t},a_{i,t})+\sum^T_{t'=t+1}\mathbb{E}{\pi\theta}[r(s_{t'},a_{t'})|s_{i,t+1}]\approx r(s_{i,t},a_{i,t})+V^\pi(s_{i,t+1})\approx r(s_{i,t},a_{i,t})+\hat{V}\phi^\pi(s{i,t+1})yi,t=∑t′=tTEπθ[r(st′,at′)∣si,t]≈r(si,t,ai,t)+∑t′=t+1TEπθ[r(st′,at′)∣si,t+1]≈r(si,t,ai,t)+Vπ(si,t+1)≈r(si,t,ai,t)+V^ϕπ(si,t+1), 故技重施!

training data: {(si,t,r(si,t,ai,t)+V^ϕπ(si,t+1)⏟yi,t)}\{(s_{i,t},\underbrace{r(s_{i,t},a_{i,t})+\hat{V}^\pi_\phi(s_{i,t+1})}{y{i,t}})\}{(si,t,yi,t r(si,t,ai,t)+V^ϕπ(si,t+1))}

一样的supervised regression: L(ϕ)=12∑i∥V^ϕπ(si)−yi∥2\mathcal{L}(\phi)=\frac{1}{2}\sum_i\lVert\hat{V}_\phi^\pi(s_i)-y_i\rVert^2L(ϕ)=21∑i∥V^ϕπ(si)−yi∥2

  1. from evaluation to Actor Critic

discount factors: if T (episode length)→∞T\text{ (episode length)}\to\infinT (episode length)→∞, V^ϕπ\hat{V}^\pi_\phiV^ϕπ can get infinitely large in many cases

4.1 simple trick: better to get rewards sooner than later, yi,t≈r(si,t,ai,t)+γV^ϕπ(si,t+1)y_{i,t}\approx r(s_{i,t},a_{i,t})+\gamma\hat{V}^\pi_\phi(s_{i,t+1})yi,t≈r(si,t,ai,t)+γV^ϕπ(si,t+1), γ∈[0,1]\gamma\in[0,1]γ∈[0,1] is discount factor (0.99 works well)

with critic (V^ϕπ\hat{V}^\pi_\phiV^ϕπ): ∇θJ(θ)≈1N∑i=1N∑t=1T∇θlog⁡πθ(ai,t∣si,t)(r(si,t,ai,t)+γV^ϕπ(si,t+1)−V^ϕπ(si,t))⏞A^π(si,t,ai,t)\nabla_\theta J(\theta)\approx\frac{1}{N}\sum^N_{i=1}\sum^T_{t=1}\nabla_\theta\log{\pi_\theta(a_{i,t}|s_{i,t})}\overbrace{(r(\mathbf{s}{i,t}, \mathbf{a}{i,t}) + \gamma \hat{V}^\pi_\phi(\mathbf{s}{i,t+1}) - \hat{V}^\pi\phi(\mathbf{s}{i,t}))}^{\hat{A}^\pi(\mathbf{s}{i,t}, \mathbf{a}_{i,t})}∇θJ(θ)≈N1∑i=1N∑t=1T∇θlogπθ(ai,t∣si,t)(r(si,t,ai,t)+γV^ϕπ(si,t+1)−V^ϕπ(si,t)) A^π(si,t,ai,t)

4.2 what about policy gradients?

(1) option 1: 做causal trick后再乘折扣系数 (今后的策略对过去的奖励无影响)

∇θJ(θ)≈1N∑i=1N∑t=1T∇θlog⁡πθ(ai,t∣si,t)(∑t′=tTγt′−tr(si,t′,ai,t′))\nabla_\theta J(\theta) \approx\frac{1}{N}\sum^N_{i=1}\sum^T_{t=1}\nabla_\theta\log{\pi_\theta(a_{i,t}|s_{i,t})}\left(\sum^T_{t'=t}\gamma^{t'-t}r(s_{i,t'},a_{i,t'})\right)∇θJ(θ)≈N1∑i=1N∑t=1T∇θlogπθ(ai,t∣si,t)(∑t′=tTγt′−tr(si,t′,ai,t′))

(2) option 2: 乘折扣系数后再做causal trick

∇θJ(θ)≈1N∑i=1N(∑t=1T∇θlog⁡πθ(ai,t∣si,t))(∑t′=1Tγt′−1r(si,t′,ai,t′))\nabla_\theta J(\theta)\approx\frac{1}{N}\sum^N_{i=1}\left(\sum^T_{t=1}\nabla_\theta\log{\pi_\theta(a_{i,t}|s_{i,t})}\right)\left(\sum^T_{t'=1}\gamma^{t'-1}r(s_{i,t'},a_{i,t'})\right)∇θJ(θ)≈N1∑i=1N(∑t=1T∇θlogπθ(ai,t∣si,t))(∑t′=1Tγt′−1r(si,t′,ai,t′))

≈1N∑i=1N∑t=1T∇θlog⁡πθ(ai,t∣si,t)(∑t′=tTγt′−1r(si,t′,ai,t′))\approx\frac{1}{N}\sum^N_{i=1}\sum^T_{t=1}\nabla_\theta\log{\pi_\theta(a_{i,t}|s_{i,t})}\left(\sum^T_{t'=t}\gamma^{t'-1}r(s_{i,t'},a_{i,t'})\right)≈N1∑i=1N∑t=1T∇θlogπθ(ai,t∣si,t)(∑t′=tTγt′−1r(si,t′,ai,t′))

≈1N∑i=1N∑t=1Tγt−1∇θlog⁡πθ(ai,t∣si,t)(∑t′=tTγt′−tr(si,t′,ai,t′))\approx\frac{1}{N}\sum^N_{i=1}\sum^T_{t=1}\boldsymbol{\gamma^{t-1}}\nabla_\theta\log{\pi_\theta(a_{i,t}|s_{i,t})}\left(\sum^T_{t'=t}\gamma^{t'-t}r(s_{i,t'},a_{i,t'})\right)≈N1∑i=1N∑t=1Tγt−1∇θlogπθ(ai,t∣si,t)(∑t′=tTγt′−tr(si,t′,ai,t′))

γt−1\gamma^{t-1}γt−1可以理解成存活到TTT step的概率, later steps don't matter if you're dead, 现实中一般选择option 1

4.3 Actor-critic algorithms (with discount)

4.3.1 batch actor-critic algorithm

(1) sample {si,ai}\{s_i,a_i\}{si,ai} from πθ(a∣s)\pi_\theta(a|s)πθ(a∣s) (run it on the robot)

(2) fit V^ϕπ(s)\hat{V}^\pi_\phi(s)V^ϕπ(s) to sampled reward sums

(3) evaluate A^π(si,ai)=r(si,ai)+γV^ϕπ(si′)−V^ϕπ(si)\hat{A}^\pi(s_i,a_i)=r(s_i,a_i)+\gamma\hat{V}^\pi_\phi(s'i)-\hat{V}^\pi\phi(s_i)A^π(si,ai)=r(si,ai)+γV^ϕπ(si′)−V^ϕπ(si)

(4) ∇θJ(θ)≈1N∑i∇θlog⁡πθ(ai∣si)A^π(si,ai)\nabla_\theta J(\theta)\approx\frac{1}{N}\sum_i\nabla_\theta\log{\pi_\theta(a_i|s_i)}\hat{A}^\pi(s_i,a_i)∇θJ(θ)≈N1∑i∇θlogπθ(ai∣si)A^π(si,ai)

(5) θ←θ+α∇θJ(θ)\theta\leftarrow\theta+\alpha\nabla_\theta J(\theta)θ←θ+α∇θJ(θ)

4.3.2 online actor-critic algorithm (update per step)

(1) take action a∼πθ(a∣s)a\sim\pi_\theta(a|s)a∼πθ(a∣s), get (s,a,s′,r)(s,a,s',r)(s,a,s′,r)

(2) update V^ϕπ\hat{V}^\pi_\phiV^ϕπ using target r+γV^ϕπ(s′)r+\gamma\hat{V}^\pi_\phi(s')r+γV^ϕπ(s′)

(3) evaluate A^π(s,a)=r(s,a)+γV^ϕπ(s′)−V^ϕπ(s)\hat{A}^\pi(s,a)=r(s,a)+\gamma\hat{V}^\pi_\phi(s')-\hat{V}^\pi_\phi(s)A^π(s,a)=r(s,a)+γV^ϕπ(s′)−V^ϕπ(s)

(4) ∇θJ(θ)≈∇θlog⁡πθ(a∣s)A^π(s,a)\nabla_\theta J(\theta)\approx\nabla_\theta\log{\pi_\theta(a|s)}\hat{A}^\pi(s,a)∇θJ(θ)≈∇θlogπθ(a∣s)A^π(s,a)

(5) θ←θ+α∇θJ(θ)\theta\leftarrow\theta+\alpha\nabla_\theta J(\theta)θ←θ+α∇θJ(θ)

V^ϕπ\hat{V}^\pi_\phiV^ϕπ (critic) 和πθ\pi_\thetaπθ (actor) 网络的训练simple & stable, but no shared features between actor & critic

且根据SGD的经验, 单次采样的方差极大, step (2)&(4) works best with a batch (e.g., parallel workers)

synchronized parallel actor-critic: 同步并行跑多个轨迹, update per step

asynchronous parallel actor-critic: 异步并行几个轨迹 (每个轨迹的进程可能不同), 相比同步并行不需要跑那么多个轨迹, 收集一定数量的data才更新, 缺点是收集到的data可能来自不同的参数

都异步了也可以干脆不并行了吧, off-policy actor-critic: replay buffer:

(1) take action a∼πθ(a∣s)a\sim\pi_\theta(a|s)a∼πθ(a∣s), get (s,a,s′,r)(s,a,s',r)(s,a,s′,r) and store in R\mathcal{R}R

(2) sample a batch {si,ai,ri,si′}\{s_i,a_i,r_i,s_i'\}{si,ai,ri,si′} from buffer R\mathcal{R}R

(3) update V^ϕπ\hat{V}^\pi_\phiV^ϕπ using targets yi=ri+γV^ϕπ(si′)y_i=r_i+\gamma\hat{V}^\pi_\phi(s_i')yi=ri+γV^ϕπ(si′) for each sis_isi

L(ϕ)=1N∑i∥V^ϕπ(si)−yi∥2\mathcal{L}(\phi)=\frac{1}{N}\sum_i\lVert\hat{V}^\pi_\phi(s_i)-y_i\rVert^2L(ϕ)=N1∑i∥V^ϕπ(si)−yi∥2, 1N\frac{1}{N}N1 is batch size

(4) evaluate A^π(si,ai)=r(si,ai)+γV^ϕπ(si′)−V^ϕπ(si)\hat{A}^\pi(s_i,a_i)=r(s_i,a_i)+\gamma\hat{V}^\pi_\phi(s_i')-\hat{V}^\pi_\phi(s_i)A^π(si,ai)=r(si,ai)+γV^ϕπ(si′)−V^ϕπ(si)

(5) ∇θJ(θ)≈1N∑i∇θlog⁡πθ(ai∣si)A^π(si,ai)\nabla_\theta J(\theta)\approx\frac{1}{N}\sum_i\nabla_\theta\log{\pi_\theta(a_i|s_i)}\hat{A}^\pi(s_i,a_i)∇θJ(θ)≈N1∑i∇θlogπθ(ai∣si)A^π(si,ai)

(6) θ←θ+α∇θJ(θ)\theta\leftarrow\theta+\alpha\nabla_\theta J(\theta)θ←θ+α∇θJ(θ)

problem: si′s_i'si′ is not the result of taking action aia_iai with the latest actor; 因为策略更新过所以V^ϕπ(si′)\hat{V}^\pi_\phi(s_i')V^ϕπ(si′)算出来的不是当时的value; 同样因为策略更新过, 所以aia_iai和当前策略根据sis_isi采样出的动作不同, aia_iai根本不是从当前策略采样出的动作, 那么也无法算当前策略的policy gradient

4.3.3 fixing the value function

update Q^ϕπ\hat{Q}^\pi_\phiQ^ϕπ using targets yi=ri+γV^ϕπ(si′)y_i=r_i+\gamma\hat{V}^\pi_\phi(s_i')yi=ri+γV^ϕπ(si′) for each sis_isi, aia_iai

L(ϕ)=1N∑i∥Q^ϕπ(si,ai)−yi∥2\mathcal{L}(\phi)=\frac{1}{N}\sum_i\lVert\hat{Q}^\pi_\phi(s_i,a_i)-y_i\rVert^2L(ϕ)=N1∑i∥Q^ϕπ(si,ai)−yi∥2

recall: Vπ(st)=∑t′=tTEπθ[r(st′,at′)∣st]=Ea∼π(at∣st)[Q(st,at)]V^\pi(s_t)=\sum^T_{t'=t}\mathbb{E}{\pi\theta}[r(s_{t'},a_{t'})|s_t]=\mathbb{E}{a\sim\pi(a_t|s_t)}[Q(s_t,a_t)]Vπ(st)=∑t′=tTEπθ[r(st′,at′)∣st]=Ea∼π(at∣st)[Q(st,at)], 因此可以取ai′∼πθ(ai′∣si′)a_i'\sim\pi\theta(a_i'|s_i')ai′∼πθ(ai′∣si′), V^ϕπ(si′)=Q^ϕπ(si′,ai′)\hat{V}^\pi_\phi(s_i')=\hat{Q}^\pi_\phi(s_i',a_i')V^ϕπ(si′)=Q^ϕπ(si′,ai′)

4.3.4 fixing the policy update

use the same trick, sample aiπ∼πθ(a∣si)a_i^\pi\sim\pi_\theta(a|s_i)aiπ∼πθ(a∣si), ∇θJ(θ)≈1N∑i∇θlog⁡πθ(aiπ∣si)A^π(si,aiπ)\nabla_\theta J(\theta)\approx\frac{1}{N}\sum_i\nabla_\theta\log{\pi_\theta(a_i^\pi|s_i)}\hat{A}^\pi(s_i,a_i^\pi)∇θJ(θ)≈N1∑i∇θlogπθ(aiπ∣si)A^π(si,aiπ)

in practice: ∇θJ(θ)≈1N∑i∇θlog⁡πθ(aiπ∣si)Q^π(si,aiπ)\nabla_\theta J(\theta)\approx\frac{1}{N}\sum_i\nabla_\theta\log{\pi_\theta(a_i^\pi|s_i)}\hat{Q}^\pi(s_i,a_i^\pi)∇θJ(θ)≈N1∑i∇θlogπθ(aiπ∣si)Q^π(si,aiπ), higher variance but convenient

why is higher variance OK here? 降方差依旧需要多次采样, 但这里采样只用采动作而不是做simulation

4.3.5 control variates: action-dependent baselines

Q^\hat{Q}Q^和Qϕπ(st,at)Q^\pi_\phi(s_t,a_t)Qϕπ(st,at)高度相关, 如果后者能作为baseline减掉的话, 方差就会大幅降低了。但后者的期望不为0, 减去的话会引入偏差, 所以必须在后面补上∇θE[Qϕπ]\nabla_\theta \mathbb{E}[Q_\phi^\pi]∇θE[Qϕπ]

即∇θJ(θ)≈1N∑i=1N∑t=1T∇θlog⁡πθ(ai,t∣si,t)(Q^i,t−Qϕπ(si,t,ai,t))+1N∑i=1N∑t=1T∇θEa∼πθ(at∣si,t)[Qϕπ(si,t,at)]\nabla_\theta J(\theta)\approx\frac{1}{N}\sum^N_{i=1}\sum^T_{t=1}\nabla_\theta\log{\pi_\theta(a_{i,t}|s_{i,t})}\left(\hat{Q}{i,t}-Q^\pi\phi(s_{i,t},a_{i,t})\right)+\frac{1}{N}\sum^N_{i=1}\sum^T_{t=1}\nabla_\theta\mathbb{E}{a\sim\pi\theta(a_t|s_{i,t})}[Q^\pi_\phi(s_{i,t},a_t)]∇θJ(θ)≈N1∑i=1N∑t=1T∇θlogπθ(ai,t∣si,t)(Q^i,t−Qϕπ(si,t,ai,t))+N1∑i=1N∑t=1T∇θEa∼πθ(at∣si,t)[Qϕπ(si,t,at)]

4.4 n-step returns

A^Cπ(st,at)=r(st,at)+γV^ϕπ(st+1)−V^ϕπ(st)\hat{A}^\pi_{\text{C}}(s_t,a_t)=r(s_t,a_t)+\gamma\hat{V}^\pi_\phi(s_{t+1})-\hat{V}^\pi_\phi(s_t)A^Cπ(st,at)=r(st,at)+γV^ϕπ(st+1)−V^ϕπ(st): lower variance (用一个估计的v取代∑r\sum r∑r), higher bias if value estimation is wrong

(Monte Carlo) A^MCπ(st,at)=∑t′=t∞γt′−tr(st′,at′)−V^ϕπ(st)\hat{A}^\pi_{\text{MC}}(s_t,a_t)=\sum^\infin_{t'=t}\gamma^{t'-t}r(s_{t'},a_{t'})-\hat{V}^\pi_\phi(s_t)A^MCπ(st,at)=∑t′=t∞γt′−tr(st′,at′)−V^ϕπ(st): no bias, higher variance (because single-sample estimate, 对长序列做单样本估计, 随机性累积)

Can we combine those two, to control bias/variance tradeoff? - n-step returns, cut down estimate before variance gets too big (choosing n>1n>1n>1 often works better)

A^nπ(st,at)=∑t′=tt+nγt′−tr(st′,at′)−V^ϕπ(st)+γnV^ϕπ(st+n)\hat{A}^\pi_n(s_t,a_t)=\sum^{t+n}{t'=t}\gamma^{t'-t}r(s{t'},a_{t'})-\hat{V}^\pi_\phi(s_t)+\gamma^n\hat{V}^\pi_\phi(s_{t+n})A^nπ(st,at)=∑t′=tt+nγt′−tr(st′,at′)−V^ϕπ(st)+γnV^ϕπ(st+n)

4.5 Generalized advantage estimation

A^GAEπ(st,at)=∑n=1∞wnA^nπ(st,at)=∑t′=t∞(γλ)t′−tδt′\hat{A}^\pi_{\text{GAE}}(s_t,a_t)=\sum^\infin_{n=1}w_n\hat{A}^\pi_n(s_t,a_t)=\sum^\infin_{t'=t}(\gamma\lambda)^{t'-t}\delta_{t'}A^GAEπ(st,at)=∑n=1∞wnA^nπ(st,at)=∑t′=t∞(γλ)t′−tδt′, δt′=r(st′,at′)+γV^ϕπ(st′+1)−V^ϕπ(st′)\delta_{t'}=r(s_{t'},a_{t'})+\gamma\hat{V}^\pi_\phi(s_{t'+1})-\hat{V}^\pi_\phi(s_{t'})δt′=r(st′,at′)+γV^ϕπ(st′+1)−V^ϕπ(st′)

weighted combination of n-step returns, wn∝λn−1w_n\propto\lambda^{n-1}wn∝λn−1 exponential falloff, (γλ)t′−t(\gamma\lambda)^{t'-t}(γλ)t′−t similar as discount

lec 7, 2026/3/21, value function

1.1 value-based method, policy iteration for discrete actions

step 1: evaluate Aπ(s,a)A^\pi(s,a)Aπ(s,a)

step 2: set π←π′\pi\leftarrow\pi'π←π′

π′(at∣st)={1 if at=argmaxatAπ(st,at)0 otherwise\pi'(a_t|s_t)=\begin{cases}1\text{ if }a_t=\text{argmax}_{a_t}A^\pi(s_t,a_t)\\0\text{ otherwise}\end{cases}π′(at∣st)={1 if at=argmaxatAπ(st,at)0 otherwise, at least as good as π\piπ

as before: Aπ(s,a)=r(s,a)+γE[Vπ(s′)]−Vπ(s)A^\pi(s,a)=r(s,a)+\gamma\mathbb{E}[V^\pi(s')]-V^\pi(s)Aπ(s,a)=r(s,a)+γE[Vπ(s′)]−Vπ(s), let's evaluate Vπ(s)V^\pi(s)Vπ(s)

1.2 Dynamic programming

let's assume we know p(s′∣s,a)p(s'|s,a)p(s′∣s,a), and sss and aaa are both discrete (and small), using tabular mdp: 16 states, 4 actions per state, can store full Vπ(s)V^\pi(s)Vπ(s) in a table, transition T\mathcal{T}T is 16×16×416\times16\times416×16×4 tensor

bootstrapped update: Vπ(s)←Ea∼π(a∣s)[r(s,a)+γEs′∼p(s′∣s,a)[Vπ(s′)]]V^\pi(s)\leftarrow\mathbb{E}{a\sim\pi(a|s)}[r(s,a)+\gamma\mathbb{E}{s'\sim p(s'|s,a)}[V^\pi(s')]]Vπ(s)←Ea∼π(a∣s)[r(s,a)+γEs′∼p(s′∣s,a)[Vπ(s′)]], Es′∼p(s′∣s,a)[Vπ(s′)]\mathbb{E}_{s'\sim p(s'|s,a)}[V^\pi(s')]Es′∼p(s′∣s,a)[Vπ(s′)] just use the current estimate here

π′(at∣st)={1 if at=argmaxatAπ(st,at)0 otherwise\pi'(a_t|s_t)=\begin{cases}1\text{ if }a_t=\text{argmax}_{a_t}A^\pi(s_t,a_t)\\0\text{ otherwise}\end{cases}π′(at∣st)={1 if at=argmaxatAπ(st,at)0 otherwise, deterministic policy π(s)=a\pi(s)=aπ(s)=a

simplified: Vπ(s)←r(s,π(s))+γEs′∼p(s′∣s,π(s))[Vπ(s′)]V^\pi(s)\leftarrow r(s,\pi(s))+\gamma\mathbb{E}_{s'\sim p(s'|s,\pi(s))}[V^\pi(s')]Vπ(s)←r(s,π(s))+γEs′∼p(s′∣s,π(s))[Vπ(s′)]

上式可以写作v=rπ+γPπvv=r^\pi+\gamma P^\pi vv=rπ+γPπv, 其中PPP表示状态转移矩阵, 整理得(I−γPπ)v=rπ(I-\gamma P^\pi)v=r^\pi(I−γPπ)v=rπ, v=(I−γPπ)−1rπv=(I-\gamma P^\pi)^{-1}r^\piv=(I−γPπ)−1rπ, can be solved with any linear equation solver

1.3 value iteration algorithm (even simpler dynamic programming)

argmaxatAπ(st,at)=argmaxatQπ(st,at)\text{argmax}{a_t}A^\pi(s_t,a_t)=\text{argmax}{a_t}Q^\pi(s_t,a_t)argmaxatAπ(st,at)=argmaxatQπ(st,at)

Qπ(s,a)=r(s,a)+γE[Vπ(s′)]Q^\pi(s,a)=r(s,a)+\gamma\mathbb{E}[V^\pi(s')]Qπ(s,a)=r(s,a)+γE[Vπ(s′)] (a bit simple than AπA^\piAπ)

argmaxaQ(s,a)→policy\text{argmax}_aQ(s,a)\to\text{policy}argmaxaQ(s,a)→policy, approximates the new value, and skip the policy and compute values directly

step 1: set Q(s,a)←r(s,a)+γE[V(s′)]Q(s,a)\leftarrow r(s,a)+\gamma\mathbb{E}[V(s')]Q(s,a)←r(s,a)+γE[V(s′)]

step 2: set V(s)←max⁡aQ(s,a)V(s)\leftarrow\max_a{Q(s,a)}V(s)←maxaQ(s,a) (因为策略100%选择max⁡Q\max{Q}maxQ的动作, 所以期望退化成max算子)

2.1 Fitted value iteration

big table curse of dimensionality, using net function V:S→RV:\mathcal{S}\to\mathbb{R}V:S→R

L(ϕ)=12∥Vϕ(s)−max⁡aQπ(s,a)∥2\mathcal{L}(\phi)=\frac{1}{2}\lVert V_\phi(s)-\max_a{Q^\pi(s,a)}\rVert^2L(ϕ)=21∥Vϕ(s)−maxaQπ(s,a)∥2

step 1: set yi←max⁡ai(r(si,ai)+γE[Vϕ(si′)])y_i\leftarrow\max_{a_i}{(r(s_i,a_i)+\gamma\mathbb{E}[V_\phi(s_i')])}yi←maxai(r(si,ai)+γE[Vϕ(si′)])

step 2: set ϕ←arg⁡min⁡ϕ12∑i∥Vϕ(si)−yi∥2\phi\leftarrow\arg\min_\phi{\frac{1}{2}\sum_i\lVert V_\phi(s_i)-y_i\rVert^2}ϕ←argminϕ21∑i∥Vϕ(si)−yi∥2

2.1.1 without knowing the transitions

然E[Vϕ(si′)]\mathbb{E}[V_\phi(s_i')]E[Vϕ(si′)] need to know outcomes for different actions

Qπ(s,a)←r(s,a)+γEs′∼p(s′∣s,a)[Qπ(s′,π(s′))]Q^\pi(s,a)\leftarrow r(s,a)+\gamma\mathbb{E}_{s'\sim p(s'|s,a)}[Q^\pi(s',\pi(s'))]Qπ(s,a)←r(s,a)+γEs′∼p(s′∣s,a)[Qπ(s′,π(s′))]只需要知道一对s和a

因为π′=1 if a=arg⁡max⁡Q\pi'=1\text{ if }a=\arg\max{Q}π′=1 if a=argmaxQ导致VVV的期望退化成max, 故技重施

step 1: set yi←max⁡ai(r(si,ai)+γE[Vϕ(si′)])y_i\leftarrow\max_{a_i}{(r(s_i,a_i)+\gamma\mathbb{E}[V_\phi(s_i')])}yi←maxai(r(si,ai)+γE[Vϕ(si′)]), approximate E[V(si′)]≈max⁡a′Qϕ(si′,ai′)\mathbb{E}[V(s_i')]\approx\max_{a'}{Q_\phi(s_i',a_i')}E[V(si′)]≈maxa′Qϕ(si′,ai′)

step 2: set ϕ←arg⁡min⁡ϕ12∑i∥Qϕ(si,ai)−yi∥2\phi\leftarrow\arg\min_\phi{\frac{1}{2}\sum_i\lVert Q_\phi(s_i,a_i)-y_i\rVert^2}ϕ←argminϕ21∑i∥Qϕ(si,ai)−yi∥2

2.2 full fitted Q-iteration algorithm

step 1: collect dataset {(si,ai,si′,ri)}\{(s_i,a_i,s_i',r_i)\}{(si,ai,si′,ri)} using some policy, parameters: dataset size NNN, collection policy

repeat ×K{step 2: set yi←r(si,ai)+γmax⁡ai′Qϕ(si′,ai′)step 3: setϕ←arg⁡min⁡ϕ12∑i∥Qϕ(si,ai)−yi∥2, parameters: gradient steps S\times K\begin{cases}\text{step 2: set }y_i\leftarrow r(s_i,a_i)+\gamma\max_{a_i'}Q_\phi(s_i',a_i')\\\text{step 3: set}\phi\leftarrow\arg\min_\phi{\frac{1}{2}\sum_i\lVert Q_\phi(s_i,a_i)-y_i\rVert^2}\text{, parameters: gradient steps }S\end{cases}×K{step 2: set yi←r(si,ai)+γmaxai′Qϕ(si′,ai′)step 3: setϕ←argminϕ21∑i∥Qϕ(si,ai)−yi∥2, parameters: gradient steps S, parameters: iterations KKK

error ε=12E(s,a)∼β[(Qϕ(s,a)−[r(s,a)+γmax⁡a′Qϕ(s′,a′)])2]\varepsilon=\frac{1}{2}\mathbb{E}{(s,a)\sim\beta}[\left(Q\phi(s,a)-[r(s,a)+\gamma\max_{a'}Q_\phi(s',a')]\right)^2]ε=21E(s,a)∼β[(Qϕ(s,a)−[r(s,a)+γmaxa′Qϕ(s′,a′)])2]

if ε=0\varepsilon=0ε=0, then Qϕ(s,a)=r(s,a)+γmax⁡a′Qϕ(s′,a′)Q_\phi(s,a)=r(s,a)+\gamma\max_{a'}Q_\phi(s',a')Qϕ(s,a)=r(s,a)+γmaxa′Qϕ(s′,a′), this is an optimal Q-function corresponding to optimal π′(at∣st)={1 if at=argmaxatAπ(st,at)0 otherwise\pi'(a_t|s_t)=\begin{cases}1\text{ if }a_t=\text{argmax}_{a_t}A^\pi(s_t,a_t)\\0\text{ otherwise}\end{cases}π′(at∣st)={1 if at=argmaxatAπ(st,at)0 otherwise, maximizes reward, sometimes written Q∗Q^*Q∗ and π∗\pi^*π∗

3.1 Online Q-learning algorithm

step 1: take some action aia_iai and observe (si,ai,si′,ri)(s_i,a_i,s_i',r_i)(si,ai,si′,ri)

step 2: yi=r(si,ai)+γmax⁡a′Qϕ(si′,ai′)y_i=r(s_i,a_i)+\gamma\max_{a'}Q_\phi(s_i',a_i')yi=r(si,ai)+γmaxa′Qϕ(si′,ai′)

step 3: ϕ←ϕ−αdQϕdϕ(Qϕ(si,ai)−yi)\phi\leftarrow\phi-\alpha\frac{dQ_\phi}{d\phi}(Q_\phi(s_i,a_i)-y_i)ϕ←ϕ−αdϕdQϕ(Qϕ(si,ai)−yi)

inject some additional randomness to produce better exploration, ε\varepsilonε-greedy π(at∣st)={1−ε if at=arg⁡max⁡atQϕ(st,at)ε/(∣A∣−1) otherwise\pi(a_t|s_t)=\begin{cases}1-\varepsilon\text{ if }a_t=\arg\max_{a_t}{Q_\phi(s_t,a_t)}\\\varepsilon/(|\mathcal{A}|-1)\text{ otherwise}\end{cases}π(at∣st)={1−ε if at=argmaxatQϕ(st,at)ε/(∣A∣−1) otherwise or Boltzmann exploration π(at∣st)∝exp⁡(Qϕ(st,at))\pi(a_t|s_t)\varpropto\exp{(Q_\phi(s_t,a_t))}π(at∣st)∝exp(Qϕ(st,at))

lec 8, 2026/3/23, Deep RL with Q-Functions

1.1 Correlated samples in online Q-learning

in online Q iteration algorithm, sequential states are strongly correlated and its target value is always changing

solution: synchronized parallel Q-learning / asynchronous parallel Q-learning / replay buffers

1.2 full Q-learning with replay buffer

samples are no longer correlated, multiple samples in the batch (low-variance gradient: σ2N\frac{\sigma^2}{N}Nσ2):

step 1: collect dataset {(si,ai,si′,ri)}\{(s_i,a_i,s_i',r_i)\}{(si,ai,si′,ri)} using some policy, add it to B\mathcal{B}B

repeat ×K{step 2: sample a batch (si,ai,si′,ri) from Bstep 3: ϕ←ϕ−α∑idQϕdϕ(si,ai)(Qϕ(si,ai)−[r(si,ai)+γmax⁡a′Qϕ(si′,ai′)])\times K\begin{cases}\text{step 2: sample a batch }(s_i,a_i,s_i',r_i)\text{ from }\mathcal{B}\\\text{step 3: }\phi\leftarrow\phi-\alpha\sum_i\frac{dQ_\phi}{d\phi}(s_i,a_i)(Q_\phi(s_i,a_i)-[r(s_i,a_i)+\gamma\max_{a'}Q_\phi(s_i',a_i')])\end{cases}×K{step 2: sample a batch (si,ai,si′,ri) from Bstep 3: ϕ←ϕ−α∑idϕdQϕ(si,ai)(Qϕ(si,ai)−[r(si,ai)+γmaxa′Qϕ(si′,ai′)])

K=1K=1K=1 is common, though larger KKK more efficient

但更新ϕ\phiϕ的时候, 目标值r(si,ai)+γmax⁡a′Qϕ(si′,ai′)r(s_i,a_i)+\gamma\max_{a'}Q_\phi(s_i',a_i')r(si,ai)+γmaxa′Qϕ(si′,ai′)也在变, 我们希望采样的时候target value不变, 对此引入target network ϕ′\phi'ϕ′

2.1 Q-Learning with replay buffer and target networks

step 1: save target network parameters: ϕ′←ϕ\phi'\leftarrow\phiϕ′←ϕ

repeat ×N{step 2: collect dataset{(si,ai,si′,ri)} using some policy, add it to Brepeat ×K{step 3: sample a batch (si,ai,si′,ri) from Bstep 4: ϕ←ϕ−α∑idQϕdϕ(si,ai)(Qϕ(si,ai)−[r(si,ai)+γmax⁡a′Qϕ′(si′,ai′)])\times N\begin{cases}\text{step 2: collect dataset} \{(s_i,a_i,s_i',r_i)\}\text{ using some policy, add it to }\mathcal{B}\\\text{repeat }\times K\begin{cases}\text{step 3: sample a batch }(s_i,a_i,s_i',r_i)\text{ from }\mathcal{B}\\\text{step 4: }\phi\leftarrow\phi-\alpha\sum_i\frac{dQ_\phi}{d\phi}(s_i,a_i)(Q_\phi(s_i,a_i)-[r(s_i,a_i)+\gamma\max_{a'}Q_{\phi'}(s_i',a_i')])\end{cases}\end{cases}×N⎩ ⎨ ⎧step 2: collect dataset{(si,ai,si′,ri)} using some policy, add it to Brepeat ×K{step 3: sample a batch (si,ai,si′,ri) from Bstep 4: ϕ←ϕ−α∑idϕdQϕ(si,ai)(Qϕ(si,ai)−[r(si,ai)+γmaxa′Qϕ′(si′,ai′)])

so targets don't change in inner loop

"classic" deep Q-learning algorithm: K=1K=1K=1

2.2 Alternative target network

每NNN步进行一次ϕ′←ϕ\phi'\leftarrow\phiϕ′←ϕ, 刚更新完那一刻ϕ′\phi'ϕ′完全等于ϕ\phiϕ; 快到下一次更新前, ϕ′\phi'ϕ′已经是老古董了, weirdly uneven

popular alternative (similar to Polyak averaging), 每步微调: ϕ′←τϕ′+(1−τ)ϕ\phi'\leftarrow\tau\phi'+(1-\tau)\phiϕ′←τϕ′+(1−τ)ϕ, τ=0.999\tau=0.999τ=0.999 works well

3.1 Overestimation in Q-learning

max⁡aj′Qϕ′(sj′,aj′)\max_{a_j'}Q_{\phi'}(s_j',a_j')maxaj′Qϕ′(sj′,aj′) overestimates the next value: E[max⁡(X1,X2)]≥max⁡(E[X1],E[X2])\mathbb{E}[\max(X_1,X_2)]\ge\max(\mathbb{E}[X_1],\mathbb{E}[X_2])E[max(X1,X2)]≥max(E[X1],E[X2])

note that max⁡a′Qϕ′(s′,a′)=Qϕ′(s′,arg⁡max⁡a′Qϕ′(s′,a′))\max_{a'}Q_{\phi'}(s',a')=Q_{\phi'}(s',\arg\max_{a'}Q_{\phi'}(s',a'))maxa′Qϕ′(s′,a′)=Qϕ′(s′,argmaxa′Qϕ′(s′,a′)), both action selected and value are came from Qϕ′Q_{\phi'}Qϕ′

目标值被高估, 权重更新后Q变大; Q变大, 由于max⁡\max{}max更高估, 左脚踩右脚

if the noise in these is decorrelated, the problem goes away

idea: don't use the same network to choose the action and evaluate value

3.2 Double Q-learning

use two networks: QϕA(s,a)←r+γQϕB(s′,arg⁡max⁡a′QϕA(s′,a′))Q_{\phi_A}(s,a)\leftarrow r+\gamma Q_{\phi_B}(s',\arg\max_{a'}Q_{\phi_A}(s',a'))QϕA(s,a)←r+γQϕB(s′,argmaxa′QϕA(s′,a′)), QϕB(s,a)←r+γQϕA(s′,arg⁡max⁡a′QϕB(s′,a′))Q_{\phi_B}(s,a)\leftarrow r+\gamma Q_{\phi_A}(s',\arg\max_{a'}Q_{\phi_B}(s',a'))QϕB(s,a)←r+γQϕA(s′,argmaxa′QϕB(s′,a′))

3.3 Q-learning with N-step return

yj,t=∑t′=tt+N−1γt′−trj,t′+γNmax⁡aj,t+NQϕ′(sj,t+N,aj,t+N)y_{j,t}=\sum^{t+N-1}{t'=t}\gamma^{t'-t}r{j,t'}+\gamma^N\max_{a_{j,t+N}}Q_{\phi'}(s_{j,t+N},a_{j,t+N})yj,t=∑t′=tt+N−1γt′−trj,t′+γNmaxaj,t+NQϕ′(sj,t+N,aj,t+N)

less biased target values when Q-values are inaccurate (真实奖励r乘的折扣少), typically faster learning especially early on, only actually correct when learning on-policy

问题在于N-step中间几步的奖励是根据旧策略选的动作得到的, which is not the action that your new policy would have taken (not an issue when N=1N=1N=1)

solution: ignore; cut the trace - dynamic choose N to get only on-policy data (works well when data mostly on-policy, and action space is small); importance sampling (w=πnew(a∣s)πold(a∣s)w=\frac{\pi_{\text{new}}(a|s)}{\pi_{\text{old}}(a|s)}w=πold(a∣s)πnew(a∣s))

4.1 Q-learning with continuous actions

how do we perform at=arg⁡max⁡atQϕ(st,at)a_t=\arg\max_{a_t}Q_\phi(s_t,a_t)at=argmaxatQϕ(st,at) and γmax⁡aj′Qϕ′(sj′,aj′)\gamma\max_{a_j'}Q_{\phi'}(s_j',a_j')γmaxaj′Qϕ′(sj′,aj′)

option 1: optimization

gradient based optimization (e.g., SGD) a bit slow in the inner loop

action space typically low-dimensional, stochastic optimization also works well

option 2: use function class that is easy to optimize

option 3: learn an approximate maximizer, train another neural network μθ(s)\mu_\theta(s)μθ(s) such that μθ(s)≈arg⁡max⁡aQϕ(s,a)\mu_\theta(s)\approx\arg\max_aQ_\phi(s,a)μθ(s)≈argmaxaQϕ(s,a)

DDPG:

step 1: take some action aia_iai and observe (si,ai,si′,ri)(s_i,a_i,s_i',r_i)(si,ai,si′,ri), add it to B\mathcal{B}B

step 2: sample mini-batch {rj,aj,sj′,rj}\{r_j,a_j,s_j',r_j\}{rj,aj,sj′,rj} from B\mathcal{B}B uniformly

step 3: compute yj=rj+γQϕ′(sj′,μθ′(sj′))y_j=r_j+\gamma Q_{\phi'}(s_j',\mu_{\theta'}(s_j'))yj=rj+γQϕ′(sj′,μθ′(sj′)) using target net Qϕ′Q_{\phi'}Qϕ′ and μθ′\mu_{\theta'}μθ′

step 4: ϕ←ϕ−α∑jdQϕdϕ(sj,aj)(Qϕ(sj,aj)−yj)\phi\leftarrow\phi-\alpha\sum_j\frac{dQ_\phi}{d\phi}(s_j,a_j)(Q_\phi(s_j,a_j)-y_j)ϕ←ϕ−α∑jdϕdQϕ(sj,aj)(Qϕ(sj,aj)−yj)

step 5: θ←θ+β∑jdμdθ(sj)dQϕda(sj,μ(sj))\theta\leftarrow\theta+\beta\sum_j\frac{d\mu}{d\theta}(s_j)\frac{dQ_\phi}{da}(s_j,\mu(s_j))θ←θ+β∑jdθdμ(sj)dadQϕ(sj,μ(sj))

step 6: update ϕ′\phi'ϕ′ and θ′\theta'θ′ (e.g., Polyak averaging)

5.1 simple practical tips for Q-learning

large replay buffers help improve stability

start with high exploration (epsilon) and gradually reduce

Bellman error gradients can be big; clip gradients or use Huber loss

Double Q-learning and N-step returns helps a lot

schedule exploration (high to low) and learning rates (high to low), Adam optimizer can help too

run multiple random seeds

相关推荐
中屹指纹浏览器3 小时前
2026住宅IP网络环境下指纹浏览器稳定性优化与工程实践
经验分享·笔记
云边散步3 小时前
godot2D游戏教程系列二(20)
笔记·学习·音视频
年纪青青3 小时前
NanoPi Neo移植笔记(U-Boot v2025.10 + Linux Kernel 6.18 + Ubuntu 24.04 根文件系统)
linux·笔记·ubuntu·nanopi neo·linux镜像
pq113_65 小时前
开源软件学习笔记 - nanoModbus
笔记·学习·nanomodbus
2301_781143565 小时前
C语言笔记(四)
c语言·笔记·算法
似水明俊德5 小时前
12-C#.Net-加密解密-学习笔记
笔记·学习·oracle·c#·.net
chinalihuanyu5 小时前
Linux-应用编程学习笔记(五、系统信息和系统资源)
笔记·学习
C羊驼5 小时前
C语言学习笔记(十四):编译与链接
c语言·开发语言·经验分享·笔记·学习
_李小白5 小时前
【OSG学习笔记】Day 7: AutoTransform 类
笔记·学习