lec 2, 2026/3/13, imitation learning/supervised learning of behaviors
- The distributional shift (分布偏移) problem
we train under pdata(ot)p_{\text{data}}(o_t)pdata(ot), maxθEot∼pdata(ot)[logπθ(at∣ot)]\max_\theta{\mathbb{E}{o_t\sim p{\text{data}}(o_t)}[\log{\pi_\theta(a_t|o_t)}]}maxθEot∼pdata(ot)[logπθ(at∣ot)]
we test under pπθ(ot)p_{\pi_\theta}(o_t)pπθ(ot), but pdata(ot)≠pπθ(ot)p_{\text{data}}(o_t)\neq p_{\pi_\theta}(o_t)pdata(ot)=pπθ(ot)
(1) 错误累积: 训练过程中出现一点误差, 导致training trajectory偏离expected trajectory一点, 下一步得到的ot+1o_{t+1}ot+1可能不在pdatap_{\text{data}}pdata中导致模型抓瞎, 偏离expected trajectory更远
(2) 分布不同: 监督学习假设训练数据和测试数据独立同分布, 但在rl下动作会改变未来的输入
- what makes a learned πθ(at∣ot)\pi_\theta(a_t|o_t)πθ(at∣ot) good or bad?
maxθEot∼pdata(ot)[logπθ(at∣ot)]\max_\theta{\mathbb{E}{o_t\sim p{\text{data}}(o_t)}[\log{\pi_\theta(a_t|o_t)}]}maxθEot∼pdata(ot)[logπθ(at∣ot)] ? 并非, log期望只衡量模型预测动作和专家动作的相似度, 不关心预测错了的后果。但只要做错了就是烂啊, 对此使用cost函数: c(st,at)={0 if at=π∗(st)1 otherwisec(s_t,a_t)=\begin{cases}0\text{ if }a_t=\pi^*(s_t)\\1\text{ otherwise}\end{cases}c(st,at)={0 if at=π∗(st)1 otherwise, 其中π∗(st)\pi^*(s_t)π∗(st)表示专家在状态sts_tst下的动作
goal: minimize Est∼pπθ(st)[c(st,at)]\mathbb{E}{s_t\sim p{\pi_\theta}(s_t)}[c(s_t,a_t)]Est∼pπθ(st)[c(st,at)], 即minimize the number of mistakes the policy makes when we run it
2.1 assume: πθ(a≠π∗(s)∣s)≤ϵ\pi_\theta(a\neq\pi^*(s)|s)\leq\epsilonπθ(a=π∗(s)∣s)≤ϵ for all s∈Dtrains\in\mathcal{D}_{\text{train}}s∈Dtrain
E[∑tc(st,at)]≤ϵT+(1−ϵ)[ϵ(T−1)+(1−ϵ)(ϵ(T−2)+(1−ϵ)(...))]\mathbb{E}[\sum_tc(s_t,a_t)]\leq\epsilon T+(1-\epsilon)[\epsilon(T-1)+(1-\epsilon)(\epsilon(T-2)+(1-\epsilon)(...))]E[∑tc(st,at)]≤ϵT+(1−ϵ)[ϵ(T−1)+(1−ϵ)(ϵ(T−2)+(1−ϵ)(...))], 即一步错步步错, TTT terms, each O(ϵT)O(\epsilon T)O(ϵT), O(ϵT2)O(\epsilon T^2)O(ϵT2) total
2.2 for more general analysis, assume: πθ(a≠π∗(s)∣s)≤ϵ\pi_\theta(a\neq\pi^*(s)|s)\leq\epsilonπθ(a=π∗(s)∣s)≤ϵ for all s∼ptrain(s)s\sim p_{\text{train}}(s)s∼ptrain(s)
actually enough for Eptrain(s)[πθ(a≠π∗(s)∣s)]≤ϵ\mathbb{E}{p{\text{train}}(s)}[\pi_\theta(a\neq\pi^*(s)|s)]\leq\epsilonEptrain(s)[πθ(a=π∗(s)∣s)]≤ϵ
with Dataset Aggregation (DAgger), ptrain(s)→pθ(s)p_{\text{train}}(s)\to p_\theta(s)ptrain(s)→pθ(s), E[∑tc(st,at)]≤ϵT\mathbb{E}[\sum_tc(s_t,a_t)]\leq\epsilon TE[∑tc(st,at)]≤ϵT
if ptrain(s)≠pθ(s)p_{\text{train}}(s)\neq p_\theta(s)ptrain(s)=pθ(s), pθ(st)=(1−ϵ)tptrain(st)+(1−(1−ϵ)t)pmistake(st)p_\theta(s_t)=(1-\epsilon)^tp_{\text{train}}(s_t)+(1-(1-\epsilon)^t)p_{\text{mistake}}(s_t)pθ(st)=(1−ϵ)tptrain(st)+(1−(1−ϵ)t)pmistake(st), 化简得∣pθ(st)−ptrain(st)∣=(1−(1−ϵ)t)pmistake(st)−ptrain(st)∣≤2(1−(1−ϵ)t)|p_\theta(s_t)-p_{\text{train}}(s_t)|=(1-(1-\epsilon)^t)p_{\text{mistake}}(s_t)-p_{\text{train}}(s_t)|\leq2(1-(1-\epsilon)^t)∣pθ(st)−ptrain(st)∣=(1−(1−ϵ)t)pmistake(st)−ptrain(st)∣≤2(1−(1−ϵ)t)
useful identity: (1−ϵt)≥1−ϵt(1-\epsilon^t)\geq1-\epsilon t(1−ϵt)≥1−ϵt for ϵ∈[0,1]\epsilon\in[0,1]ϵ∈[0,1]
因此∣pθ(st)−ptrain(st)∣≤2(1−(1−ϵ)t)=2ϵt|p_\theta(s_t)-p_{\text{train}}(s_t)|\leq2(1-(1-\epsilon)^t)=2\epsilon t∣pθ(st)−ptrain(st)∣≤2(1−(1−ϵ)t)=2ϵt
∑tEpθ(st)[ct]=∑t∑stpθ(st)ct(st)≤∑t∑stptrain(st)c(st)+∣pθ(st)−ptrain(st)∣cmax≤∑tϵ+2ϵt≤ϵT+2ϵT2\sum_t\mathbb{E}{p\theta(s_t)}[c_t]=\sum_t\sum_{s_t}p_\theta(s_t)c_t(s_t)\leq\sum_t\sum_{s_t}p_{\text{train}}(s_t)c(s_t)+|p_\theta(s_t)-p_{\text{train}}(s_t)|c_{\text{max}}\leq\sum_t\epsilon+2\epsilon t\leq\epsilon T+2\epsilon T^2∑tEpθ(st)[ct]=∑t∑stpθ(st)ct(st)≤∑t∑stptrain(st)c(st)+∣pθ(st)−ptrain(st)∣cmax≤∑tϵ+2ϵt≤ϵT+2ϵT2, O(ϵT2)O(\epsilon T^2)O(ϵT2)
这个估计很悲观, 因为现实中we can often recover from mistakes, 然传统的模仿学习并不具备让模型学会recover, 因此有a paradox: imitation learning can work better if the data has more mistakes (and recoveries)
- autoregressive discretization
at=(0.11.2−0.3)=(at,0at,1at,2)a_t=\begin{pmatrix}0.1\\1.2\\-0.3\end{pmatrix}=\begin{pmatrix}a_{t,0}\\a_{t,1}\\a_{t,2}\end{pmatrix}at= 0.11.2−0.3 = at,0at,1at,2 , image →ConvNet encodersequence model block→at,0→sequence model block→at,1sequence model block→at,2\xrightarrow{\text{ConvNet encoder}}\text{sequence model block}\to a_{t,0}\to\text{sequence model block}\to a_{t,1}\text{sequence model block}\to a_{t,2}ConvNet encoder sequence model block→at,0→sequence model block→at,1sequence model block→at,2, 解决softmax的维度爆炸问题
- Dataset Aggregation (DAgger): can we make pdata(ot)=pπθ(ot)p_{\text{data}}(o_t)=p_{\pi_\theta}(o_t)pdata(ot)=pπθ(ot)?
idea: instead of being clever about pπθ(ot)p_{\pi_\theta}(o_t)pπθ(ot), be clever about pdata(ot)p_{\text{data}}(o_t)pdata(ot)
goal: collect training data from pπθp_{\pi_\theta}pπθ instead of pdata(ot)p_{\text{data}}(o_t)pdata(ot)
how? just run πθ(at∣ot)\pi_\theta(a_t|o_t)πθ(at∣ot)
(1) train πθ(at∣ot)\pi_\theta(a_t|o_t)πθ(at∣ot) from human data D={o1,a1,...,oN,aN}\mathcal{D}=\{o_1,a_1,\dots,o_N,a_N\}D={o1,a1,...,oN,aN}
(2) run πθ(at∣ot)\pi_\theta(a_t|o_t)πθ(at∣ot) to get dataset Dπ={o1,...,oM}\mathcal{D}_\pi=\{o_1,\dots,o_M\}Dπ={o1,...,oM}
(3) Ask human to label Dπ\mathcal{D}_\piDπ with actions ata_tat
(4) Aggregate: D←D∪Dπ\mathcal{D}\leftarrow\mathcal{D}\cup\mathcal{D}_\piD←D∪Dπ
then go back to (1), retrain the policy and repeat
hw1-analysis
Consider the problem of imitation learning within a discrete MDP with horizon TTT and an expert policy π∗\pi^*π∗. We gather expert demonstrations from π∗\pi^*π∗ and fit an imitation policy πθ\pi_\thetaπθ to these trajectories so that Epπ∗(s)[πθ(a≠π∗(s)∣s)]=1T∑t=1TEpπ∗(st)[πθ(at≠π∗(st)∣st)]≠ε\mathbb{E}{p{\pi^*}(s)}[\pi_\theta(a\neq\pi^*(s)|s)]=\frac{1}{T}\sum^T_{t=1}\mathbb{E}{p{\pi^*}(s_t)}[\pi_\theta(a_t\neq\pi^*(s_t)|s_t)]\neq\varepsilonEpπ∗(s)[πθ(a=π∗(s)∣s)]=T1∑t=1TEpπ∗(st)[πθ(at=π∗(st)∣st)]=ε, i.e., the expected likelihood that the learned policy πθ\pi_\thetaπθ disagrees with the expert π∗\pi^*π∗ within the training distribution pπ∗p_{\pi^*}pπ∗ of states drawn from random expert trajectories is at most ε\varepsilonε.
For convenience, the notation pπ(st)p_\pi(s_t)pπ(st) indicates the state distribution under π\piπ at time step ttt while p(s)p(s)p(s) indicates the state marginal of π\piπ across time steps, unless indicated otherwise.
- Show that ∑st∣pπθ(st)−pπ∗(st)∣≤2Tε\sum_{s_t}|p_{\pi_\theta}(s_t)-p_{\pi^*}(s_t)|\leq2T\varepsilon∑st∣pπθ(st)−pπ∗(st)∣≤2Tε.
Hint: we showed a similar inequality under the stronger assumption πθ(st≠π∗(st)∣st)≤ε\\pi_\\theta(s_t\\neq\\pi\^\*(s_t)\|s_t)\\leq\\varepsilonπθ(st=π∗(st)∣st)≤ε for every st∈supp(pπ∗)s_t\\in\\text{supp}(p_{\\pi\^\*})st∈supp(pπ∗). Try converting the inequality above into an expectation over pπ∗p_{\\pi\^\*}pπ∗ and use a union bound (Pr\[∪iEi\]≤∑iPr\[Ei\])(\\text{Pr}\[\\cup_iE_i\]\\leq\\sum_i\\text{Pr}\[E_i\])(Pr\[∪iEi\]≤∑iPr\[Ei\]) to get the desired result.
令EiE_iEi表示在时间步iii, 模仿策略πθ\pi_\thetaπθ与专家策略π∗\pi^*π∗采取了不同动作的事件, 假设状态sis_isi是从专家的状态分布pπ∗(si)p_{\pi^*}(s_i)pπ∗(si)中采样的。
根据题意, 单步错误的概率可表示为: Pr(Ei)=Epπ∗(si)[πθ(ai≠π∗(si)∣si)]\Pr(E_i) = \mathbb{E}{p{\pi^*}(s_i)}[\pi_\theta(a_i \neq \pi^*(s_i) | s_i)]Pr(Ei)=Epπ∗(si)[πθ(ai=π∗(si)∣si)], 1T∑i=1TPr(Ei)≤ε\frac{1}{T}\sum^T_{i=1}\text{Pr}(E_i)\leq\varepsilonT1∑i=1TPr(Ei)≤ε, ∑i=1TPr(Ei)≤Tε\sum_{i=1}^T \Pr(E_i) \le T\varepsilon∑i=1TPr(Ei)≤Tε
Pr(⋃i=1tEi)≤∑i=1tPr(Ei)≤∑i=1TPr(Ei)≤Tε\Pr\left(\bigcup_{i=1}^t E_i\right) \le \sum_{i=1}^t \Pr(E_i)\le\sum_{i=1}^T \Pr(E_i)\le T\varepsilonPr(⋃i=1tEi)≤∑i=1tPr(Ei)≤∑i=1TPr(Ei)≤Tε
pπθ(st)=(1−Pr(⋃i=1tEi))pπ∗(st)+Pr(⋃i=1tEi)pmistake(st)p_{\pi_\theta}(s_t) = (1-\Pr(\bigcup_{i=1}^t E_i)) p_{\pi^*}(s_t) + \Pr(\bigcup_{i=1}^t E_i) p_{\text{mistake}}(s_t)pπθ(st)=(1−Pr(⋃i=1tEi))pπ∗(st)+Pr(⋃i=1tEi)pmistake(st)
pπθ(st)−pπ∗(st)=Pr(⋃i=1tEi)(pmistake(st)−pπ∗(st))p_{\pi_\theta}(s_t) - p_{\pi^*}(s_t) = \Pr(\bigcup_{i=1}^t E_i) (p_{\text{mistake}}(s_t) - p_{\pi^*}(s_t))pπθ(st)−pπ∗(st)=Pr(⋃i=1tEi)(pmistake(st)−pπ∗(st))
∑st∣pπθ(st)−pπ∗(st)∣=Pr(⋃i=1tEi)∑st∣pmistake(st)−pπ∗(st)∣≤Tε(∑stpmistake(st)+∑stpπ∗(st))≤2Tε\sum_{s_t}|p_{\pi_\theta}(s_t)-p_{\pi^*}(s_t)|=\Pr(\bigcup_{i=1}^t E_i) \sum_{s_t}|p_{\text{mistake}}(s_t) - p_{\pi^*}(s_t)|\leq T\varepsilon(\sum_{s_t}p_{\text{mistake}}(s_t)+\sum_{s_t}p_{\pi^*}(s_t))\leq2T\varepsilon∑st∣pπθ(st)−pπ∗(st)∣=Pr(⋃i=1tEi)∑st∣pmistake(st)−pπ∗(st)∣≤Tε(∑stpmistake(st)+∑stpπ∗(st))≤2Tε
- Consider the expected return of the learned policy πθ\pi_\thetaπθ for a state-dependent reward r(st)r(s_t)r(st), where we assume the reward is bounded with ∣r(st)∣≤Rmax|r(s_t)|\leq R_{\text{max}}∣r(st)∣≤Rmax: J(π)=∑t=1TEpπ(st)[r(st)]J(\pi)=\sum^T_{t=1}\mathbb{E}{p\pi(s_t)}[r(s_t)]J(π)=∑t=1TEpπ(st)[r(st)].
(a) Show that J(π∗)−J(πθ)=O(Tε)J(\pi^*)-J(\pi_\theta)=\mathcal{O}(T\varepsilon)J(π∗)−J(πθ)=O(Tε) when the reward only depends on the last state, i.e., r(st)=0r(s_t)=0r(st)=0 for all t<Tt<Tt<T.
奖励仅由最后一步决定, 此时J(π)=Epπ(sT)[r(sT)]J(\pi) = \mathbb{E}{p{\pi}(s_T)}[r(s_T)]J(π)=Epπ(sT)[r(sT)]
∣J(π∗)−J(πθ)∣=∣Epπ∗(sT)[r(sT)]−Epπθ(sT)[r(sT)]∣=∣∑sTr(sT)(pπ∗(sT)−pπθ(sT))∣|J(\pi^*) - J(\pi_\theta)| = \left| \mathbb{E}{p{\pi^*}(s_T)}[r(s_T)] - \mathbb{E}{p{\pi_\theta}(s_T)}[r(s_T)] \right|= \left| \sum_{s_T} r(s_T) \left( p_{\pi^*}(s_T) - p_{\pi_\theta}(s_T) \right) \right|∣J(π∗)−J(πθ)∣= Epπ∗(sT)[r(sT)]−Epπθ(sT)[r(sT)] = ∑sTr(sT)(pπ∗(sT)−pπθ(sT))
≤∑sT∣r(sT)∣∣pπ∗(sT)−pπθ(sT)∣≤Rmax∑sT∣pπ∗(sT)−pπθ(sT)∣\le \sum_{s_T} |r(s_T)| \left| p_{\pi^*}(s_T) - p_{\pi_\theta}(s_T) \right|\le R_{\max} \sum_{s_T} \left| p_{\pi^*}(s_T) - p_{\pi_\theta}(s_T) \right|≤∑sT∣r(sT)∣∣pπ∗(sT)−pπθ(sT)∣≤Rmax∑sT∣pπ∗(sT)−pπθ(sT)∣
由题1的∑st∣pπθ(st)−pπ∗(st)∣≤2Tε\sum_{s_t} |p_{\pi_\theta}(s_t) - p_{\pi^*}(s_t)| \le 2T\varepsilon∑st∣pπθ(st)−pπ∗(st)∣≤2Tε可知, 上式≤Rmax⋅2Tε=2RmaxTε=O(Tε)\le R_{\max} \cdot 2T\varepsilon = 2 R_{\max} T \varepsilon=\mathcal{O}(T\varepsilon)≤Rmax⋅2Tε=2RmaxTε=O(Tε)
(b) Show that J(π∗)−J(πθ)=O(T2ε)J(\pi^*)-J(\pi_\theta)=\mathcal{O}(T^2\varepsilon)J(π∗)−J(πθ)=O(T2ε) for an arbitrary reward.
∣J(π∗)−J(πθ)∣=∣∑t=1TEpπ∗(st)[r(st)]−∑t=1TEpπθ(st)[r(st)]∣≤∑t=1T∣Epπ∗(st)[r(st)]−Epπθ(st)[r(st)]∣|J(\pi^*) - J(\pi_\theta)| = \left| \sum_{t=1}^T \mathbb{E}{p{\pi^*}(s_t)}[r(s_t)] - \sum_{t=1}^T \mathbb{E}{p{\pi_\theta}(s_t)}[r(s_t)] \right|\le \sum_{t=1}^T \left| \mathbb{E}{p{\pi^*}(s_t)}[r(s_t)] - \mathbb{E}{p{\pi_\theta}(s_t)}[r(s_t)] \right|∣J(π∗)−J(πθ)∣= ∑t=1TEpπ∗(st)[r(st)]−∑t=1TEpπθ(st)[r(st)] ≤∑t=1T Epπ∗(st)[r(st)]−Epπθ(st)[r(st)]
≤∑t=1T∑st∣r(st)∣∣pπ∗(st)−pπθ(st)∣≤∑t=1TRmax∑st∣pπ∗(st)−pπθ(st)∣\le \sum_{t=1}^T \sum_{s_t} |r(s_t)| \left| p_{\pi^*}(s_t) - p_{\pi_\theta}(s_t) \right|\le \sum_{t=1}^T R_{\max} \sum_{s_t} \left| p_{\pi^*}(s_t) - p_{\pi_\theta}(s_t) \right|≤∑t=1T∑st∣r(st)∣∣pπ∗(st)−pπθ(st)∣≤∑t=1TRmax∑st∣pπ∗(st)−pπθ(st)∣
≤∑t=1TRmax⋅2Tε=2RmaxT2ε=O(T2ε)\leq\sum^T_{t=1}R_{\text{max}}\cdot2T\varepsilon=2R_{\text{max}}T^2\varepsilon=\mathcal{O}(T^2\varepsilon)≤∑t=1TRmax⋅2Tε=2RmaxT2ε=O(T2ε)
lec 4, 2026/3/15, reinforcement learning
- the goal of reinforcement learning
pθ(τ)=pθ(s1,a1,...,sT,aT)=p(s1)∏t=1Tπθ(at∣st)p(st+1∣st,at)p_\theta(\tau)=p_\theta(s_1,a_1,\dots,s_T,a_T)=p(s_1)\prod^T_{t=1}\pi_\theta(a_t|s_t)p(s_{t+1}|s_t,a_t)pθ(τ)=pθ(s1,a1,...,sT,aT)=p(s1)∏t=1Tπθ(at∣st)p(st+1∣st,at), θ∗=argmaxθEτ∼pθ(τ)[∑tr(st,at)]=argmaxθ∑t=1TE(st,at)∼pθ(st,at)[∑tr(st,at)]\theta^*=\text{argmax}\theta\mathbb{E}{\tau\sim p_\theta(\tau)}[\sum_tr(s_t,a_t)]=\text{argmax}\theta\sum^T{t=1}\mathbb{E}{(s_t,a_t)\sim p\theta(s_t,a_t)}[\sum_tr(s_t,a_t)]θ∗=argmaxθEτ∼pθ(τ)[∑tr(st,at)]=argmaxθ∑t=1TE(st,at)∼pθ(st,at)[∑tr(st,at)]
what if T=∞T=\infinT=∞?
引入state-action transition operator T\mathcal{T}T, (st+1at+1)=T(stat)\begin{pmatrix}s_{t+1}\\a_{t+1}\end{pmatrix}=\mathcal{T}\begin{pmatrix}s_{t}\\a_{t}\end{pmatrix}(st+1at+1)=T(stat), (st+kat+k)=Tk(stat)\begin{pmatrix}s_{t+k}\\a_{t+k}\end{pmatrix}=\mathcal{T}^k\begin{pmatrix}s_{t}\\a_{t}\end{pmatrix}(st+kat+k)=Tk(stat)
does p(st,at)p(s_t,a_t)p(st,at) converge to a stationary distribution? 即μ=Tμ\mu=\mathcal{T}\muμ=Tμ, stationary=the same before and after transition, μ=pθ(s,a)\mu=p_\theta(s,a)μ=pθ(s,a) stationary distribution
(T−I)μ=0(\mathcal{T}-I)\mu=0(T−I)μ=0, μ\muμ is eigenvector of T\mathcal{T}T with eigenvalue 111! (always exists under some regularity conditions)
- value function
how do we deal with expectations Eτ∼pθ(τ)[∑t=1Tr(st,at)]=Es1∼p(s1)[Ea1∼π(a1∣s1)[r(s1∣a1)+Es2∼p(s1∣s1,a1)[Ea2∼π(a2∣s2)[r(s2,a2)+...∣s2]∣s1,a1]∣s1]]\mathbb{E}{\tau\sim p\theta(\tau)}[\sum^T_{t=1}r(s_t,a_t)]=\mathbb{E}{s_1\sim p(s_1)}[\mathbb{E}{a_1\sim\pi(a_1|s_1)}[r(s_1|a_1)+\mathbb{E}{s_2\sim p(s_1|s_1,a_1)}[\mathbb{E}{a_2\sim\pi(a_2|s_2)}[r(s_2,a_2)+\dots|s_2]|s_1,a_1]|s_1]]Eτ∼pθ(τ)[∑t=1Tr(st,at)]=Es1∼p(s1)[Ea1∼π(a1∣s1)[r(s1∣a1)+Es2∼p(s1∣s1,a1)[Ea2∼π(a2∣s2)[r(s2,a2)+...∣s2]∣s1,a1]∣s1]]
令Q(s1,a1)=r(s1∣a1)+Es2∼p(s1∣s1,a1)[Ea2∼π(a2∣s2)[r(s2,a2)+...∣s2]∣s1,a1]Q(s_1,a_1)=r(s_1|a_1)+\mathbb{E}{s_2\sim p(s_1|s_1,a_1)}[\mathbb{E}{a_2\sim\pi(a_2|s_2)}[r(s_2,a_2)+\dots|s_2]|s_1,a_1]Q(s1,a1)=r(s1∣a1)+Es2∼p(s1∣s1,a1)[Ea2∼π(a2∣s2)[r(s2,a2)+...∣s2]∣s1,a1]
Eτ∼pθ(τ)[∑t=1Tr(st,at)]=Es1∼p(s1)[Ea1∼π(a1∣s1)[Q(s1,a1)∣s1]]\mathbb{E}{\tau\sim p\theta(\tau)}[\sum^T_{t=1}r(s_t,a_t)]=\mathbb{E}{s_1\sim p(s_1)}[\mathbb{E}{a_1\sim\pi(a_1|s_1)}[Q(s_1,a_1)|s_1]]Eτ∼pθ(τ)[∑t=1Tr(st,at)]=Es1∼p(s1)[Ea1∼π(a1∣s1)[Q(s1,a1)∣s1]], easy to optimize πθ(a1∣s1)\pi_\theta(a_1|s_1)πθ(a1∣s1) if Q(s1,a1)Q(s_1,a_1)Q(s1,a1) is known! example: π(a1∣s1)=1\pi(a_1|s_1)=1π(a1∣s1)=1 if a1=argmaxa1Q(s1,a1)a_1=\text{argmax}_{a_1}Q(s_1,a_1)a1=argmaxa1Q(s1,a1)
2.1 Q-function
Qπ(st,at)=∑t′=tTEπθ[r(st′,at′)∣st,at]Q^\pi(s_t,a_t)=\sum^T_{t'=t}\mathbb{E}{\pi\theta}[r(s_{t'},a_{t'})|s_t,a_t]Qπ(st,at)=∑t′=tTEπθ[r(st′,at′)∣st,at]: total reward from taking ata_tat in sts_tst
2.2 value function
Vπ(st)=∑t′=tTEπθ[r(st′,at′)∣st]V^\pi(s_t)=\sum^T_{t'=t}\mathbb{E}{\pi\theta}[r(s_{t'},a_{t'})|s_t]Vπ(st)=∑t′=tTEπθ[r(st′,at′)∣st]: total reward from sts_tst
also Vπ(st)=Eat∼π(at∣st)[Qπ(st,at)]V^\pi(s_t)=\mathbb{E}_{a_t\sim\pi(a_t|s_t)}[Q^\pi(s_t,a_t)]Vπ(st)=Eat∼π(at∣st)[Qπ(st,at)]
Es1∼p(s1)[Vπ(s1)]E_{s_1\sim p(s_1)}[V^\pi(s_1)]Es1∼p(s1)[Vπ(s1)] is the RL objective!
2.3 using Q-functions and value functions
idea 1: if we have policy π\piπ, and we know Qπ(s,a)Q^\pi(s,a)Qπ(s,a), then we can improve π\piπ:
set π′(a∣s)=1\pi'(a|s)=1π′(a∣s)=1 if a=argmaxQπ(s,a)a=\text{argmax}Q^\pi(s,a)a=argmaxQπ(s,a), this policy is at least as good as π\piπ (and probably better), and it doesn't matter what π\piπ is
idea 2: compute gradient to increase probability of good actions aaa:
if Qπ(s,a)>Vπ(s)Q^\pi(s,a)>V^\pi(s)Qπ(s,a)>Vπ(s), then aaa is better than average (recall that Vπ(s)=E[Qπ(s,a)]V^\pi(s)=\mathbb{E}[Q^\pi(s,a)]Vπ(s)=E[Qπ(s,a)] under π(a∣s)\pi(a|s)π(a∣s)), modify π(a∣s)\pi(a|s)π(a∣s) to increase probability of aaa if Qπ(s,a)>Vπ(s)Q^\pi(s,a)>V^\pi(s)Qπ(s,a)>Vπ(s)
- Types of RL algorithms
θ∗=argmaxEτ∼pθ(τ)[∑tr(st,at)]\theta^*=\text{argmax}\mathbb{E}{\tau\sim p\theta(\tau)}[\sum_tr(s_t,a_t)]θ∗=argmaxEτ∼pθ(τ)[∑tr(st,at)]
Policy gradients: directly differentiate the above objective, θ←θ+α∇θE[∑tr(st,at)]\theta\leftarrow\theta+\alpha\nabla_\theta\mathbb{E}[\sum_tr(s_t,a_t)]θ←θ+α∇θE[∑tr(st,at)]
Value-based: estimate value function or Q-function of the optimal policy (no explicit policy), set π(s)=argmaxsQ(s,a)\pi(s)=\text{argmax}_sQ(s,a)π(s)=argmaxsQ(s,a)
Actor-critic: estimate value function or Q-function of the current policy, use it to improve policy (value function+policy gradients), θ←θ+α∇θE[∑tQ(st,at)]\theta\leftarrow\theta+\alpha\nabla_\theta\mathbb{E}[\sum_tQ(s_t,a_t)]θ←θ+α∇θE[∑tQ(st,at)]
Model-based RL: estimate the transition model, and then {use it for planning (no explicit policy)backpropagate gradients into the policylearn a value function\begin{cases}\text{use it for planning (no explicit policy)}\\\text{backpropagate gradients into the policy}\\\text{learn a value function}\end{cases}⎩ ⎨ ⎧use it for planning (no explicit policy)backpropagate gradients into the policylearn a value function
- tradeoff
4.1 sample efficiency
(1) sample efficiency=how many samples do we need to get a good policy?
(2) most important question: is the algorithm off policy?
off policy: able to improve the policy without generating new samples from that policy
on policy: each time the policy is changed, even a little bit (just one gradient step), we need to generate new samples
(3) assumptions
- common assumption #1: full observability
generally assumed by value function fitting methods; can be mitigated by adding recurrence (用历史信息弥补)
- common assumption #2: episodic learning
often assumed by pure policy gradient methods (等回合结束才更新); asssumed by some model-based RL methods (需要规划整个序列)
- common assumption #3: continuity or smoothness
assumed by some continuous value function learning methods (依赖梯度下降); often assumed by some model-based RL methods
lec 5, 2026/3/16, policy gradient
- evaluating the objective
J(θ)=Eτ∼pθ(τ)[∑tr(st,at)]≈1N∑i=1N∑t=1Tr(si,t,ai,t)J(\theta)=\mathbb{E}{\tau\sim p\theta(\tau)}[\sum_tr(s_t,a_t)]\approx\frac{1}{N}\sum^N_{i=1}\sum^T_{t=1}r(s_{i,t},a_{i,t})J(θ)=Eτ∼pθ(τ)[∑tr(st,at)]≈N1∑i=1N∑t=1Tr(si,t,ai,t), θ∗=argmaxθJ(θ)\theta^*=\text{argmax}_\theta J(\theta)θ∗=argmaxθJ(θ)
令r(τ)=∑tr(st,at)r(\tau)=\sum_tr(s_t,a_t)r(τ)=∑tr(st,at), J(θ)=Eτ∼pθ(τ)[r(τ)]=∫pθ(τ)r(τ)dτJ(\theta)= \mathbb{E}{\tau\sim p\theta(\tau)}[r(\tau)]=\int p_\theta(\tau)r(\tau)d\tauJ(θ)=Eτ∼pθ(τ)[r(τ)]=∫pθ(τ)r(τ)dτ, ∇θJ(θ)=∫∇θpθ(τ)r(τ)dτ\nabla_\theta J(\theta)=\int\nabla_\theta p_\theta(\tau)r(\tau)d\tau∇θJ(θ)=∫∇θpθ(τ)r(τ)dτ
已知pθ(τ)∇θlogpθ(τ)=pθ(τ)∇θpθ(τ)pθ(τ)=∇thetapθ(τ)p_\theta(\tau)\nabla_\theta\log{p_\theta(\tau)}=p_\theta(\tau)\frac{\nabla_\theta p_\theta(\tau)}{p_\theta(\tau)}=\nabla_theta p_\theta(\tau)pθ(τ)∇θlogpθ(τ)=pθ(τ)pθ(τ)∇θpθ(τ)=∇thetapθ(τ), 则上式的梯度可以转化为期望: ∇θJ(θ)=∫∇θpθ(τ)r(τ)dτ=∫pθ(τ)∇θlogpθ(τ)r(τ)dτ=Eτ∼pθ(τ)[∇θlogpθ(τ)r(τ)]\nabla_\theta J(\theta)=\int\nabla_\theta p_\theta(\tau)r(\tau)d\tau=\int p_\theta(\tau)\nabla_\theta\log{p_\theta(\tau)}r(\tau)d\tau=\mathbb{E}{\tau\sim p\theta(\tau)}[\nabla_\theta\log{p_\theta(\tau)}r(\tau)]∇θJ(θ)=∫∇θpθ(τ)r(τ)dτ=∫pθ(τ)∇θlogpθ(τ)r(τ)dτ=Eτ∼pθ(τ)[∇θlogpθ(τ)r(τ)]
recall: pθ(τ)=pθ(s1,a1,...,sT,aT)=p(s1)∏t=1Tπtheta(at∣st)p(st+1∣st,at)p_\theta(\tau)=p_\theta(s_1,a_1,...,s_T,a_T)=p(s_1)\prod^T_{t=1}\pi_theta(a_t|s_t)p(s_{t+1}|s_t,a_t)pθ(τ)=pθ(s1,a1,...,sT,aT)=p(s1)∏t=1Tπtheta(at∣st)p(st+1∣st,at), 则logpθ(τ)=logp(s1)+∑t=1Tlogπθ(at∣st)+logp(st+1∣st,at)\log{p_\theta(\tau)}=\log{p(s_1)}+\sum^T_{t=1}\log{\pi_\theta(a_t|s_t)}+\log{p(s_{t+1}|s_t,a_t)}logpθ(τ)=logp(s1)+∑t=1Tlogπθ(at∣st)+logp(st+1∣st,at)
∇θlogp(s1)=0\nabla_\theta\log{p(s_1)}=0∇θlogp(s1)=0, ∇θlogp(st+1∣st,at)=0\nabla_\theta\log{p(s_{t+1}|s_t,a_t)}=0∇θlogp(st+1∣st,at)=0和θ\thetaθ无关, 关于θ\thetaθ的梯度为0
因此∇θJ(θ)=Eτ∼pθ(τ)[(∑t=1T∇θlogπθ(at∣st))(∑t=1Tr(st,at))]≈1N∑i=1N(∑t=1T∇θlogπθ(ai,t∣si,t))(∑t=1Tr(si,t,ai,t))\nabla_\theta J(\theta)= \mathbb{E}{\tau\sim p\theta(\tau)}[(\sum^T_{t=1}\nabla_\theta\log{\pi_\theta(a_t|s_t)})(\sum^T_{t=1}r(s_t,a_t))]\approx\frac{1}{N}\sum^N_{i=1} (\sum^T_{t=1}\nabla_\theta\log{\pi_\theta(a_{i,t}|s_ {i,t})})(\sum^T_{t=1}r(s_ {i,t},a_ {i,t}))∇θJ(θ)=Eτ∼pθ(τ)[(∑t=1T∇θlogπθ(at∣st))(∑t=1Tr(st,at))]≈N1∑i=1N(∑t=1T∇θlogπθ(ai,t∣si,t))(∑t=1Tr(si,t,ai,t)), 随后θ←θ+α∇θJ(θ)\theta\leftarrow\theta+\alpha\nabla_\theta J(\theta)θ←θ+α∇θJ(θ)
1.1 REINFORCE algorithm
(1) sample {τi}\{\tau^i\}{τi} from πθ(at∣st)\pi_\theta(a_t|s_t)πθ(at∣st) (run the policy)
(2) ∇θJ(θ)≈1N∑i=1N(∑t=1T∇θlogπθ(ati∣sti))(∑t=1Tr(sti,ati))\nabla_\theta J(\theta) \approx\frac{1}{N}\sum^N_{i=1}(\sum^T_{t=1}\nabla_\theta\log{\pi_\theta(a_t^i|s_t^i)})(\sum^T_{t=1}r(s_t^i,a_t^i))∇θJ(θ)≈N1∑i=1N(∑t=1T∇θlogπθ(ati∣sti))(∑t=1Tr(sti,ati))
(3) θ←θ+α∇θJ(θ)\theta\leftarrow\theta+\alpha\nabla_\theta J(\theta)θ←θ+α∇θJ(θ)
和最大似然估计∇θJML(θ)≈1N∑i=1N∇θlogπθ(τi)\nabla_\theta J_\text{ML}(\theta)\approx\frac{1}{N}\sum^N_{i=1}\nabla_\theta\log{\pi_\theta(\tau_i)}∇θJML(θ)≈N1∑i=1N∇θlogπθ(τi)很相似, 所以其实如此炮制后good stuff is made more likely, bad stuff is made less likely, simply formalizes the notion of "trial and error"!
1.2 What's wrong with the policy gradient?
梯度估计的方差巨大, 策略梯度对奖励的绝对数值极其敏感, 为了抵消随机性的误差, 需要多次采样取平均值才能得到稍微准确的梯度方向, 采样效率低。
1.3 Reducing variance
注意到policy at time t′t't′ cannot affect reward at time ttt when t<t′t<t't<t′
因此∇θJ(θ)≈1N∑i=1N∑t=1T∇θlogπθ(ai,t∣si,t)(∑t′=tTr(si,t′,ai,t′))=1N∑i=1N∑t=1T∇θlogπθ(ai,t∣si,t)Q^i,t\nabla_\theta J(\theta) \approx\frac{1}{N}\sum^N_{i=1}\sum^T_{t=1}\nabla_\theta\log{\pi_\theta(a_{i,t}|s_{i,t})}(\sum^T_{t'=t}r(s_{i,t'},a_{i,t'}))=\frac{1}{N}\sum^N_{i=1}\sum^T_{t=1}\nabla_\theta\log{\pi_\theta(a_{i,t}|s_{i,t})}\hat{Q}_{i,t}∇θJ(θ)≈N1∑i=1N∑t=1T∇θlogπθ(ai,t∣si,t)(∑t′=tTr(si,t′,ai,t′))=N1∑i=1N∑t=1T∇θlogπθ(ai,t∣si,t)Q^i,t
∑t′=tTr(si,t′,ai,t′)\sum^T_{t'=t}r(s_{i,t'},a_{i,t'})∑t′=tTr(si,t′,ai,t′)是"reward to go", Q^i,t\hat{Q}_{i,t}Q^i,t, an estimate of the Q-function
1.4 Baselines
b=1N∑i=1Nr(τ)b=\frac{1}{N}\sum^N_{i=1}r(\tau)b=N1∑i=1Nr(τ), ∇θJ(θ)≈1N∑i=1N∇θlogpθ(τ)[r(τ)−b]\nabla_\theta J(\theta)\approx\frac{1}{N}\sum^N_{i=1}\nabla_\theta\log{p_\theta(\tau)[r(\tau)-b]}∇θJ(θ)≈N1∑i=1N∇θlogpθ(τ)[r(τ)−b]
E[∇θlogpθ(τ)b]=∫pθ(τ)∇θlogpθ(τ)bdτ=∫∇θpθ(τ)bdτ=b∇θ∫pθ(τ)dτ=b∇θ(1)=0\mathbb{E}[\nabla_\theta\log{p_\theta(\tau)b}]=\int p_\theta(\tau)\nabla_\theta\log{p_\theta(\tau)bd\tau}=\int \nabla_\theta p_\theta(\tau)bd\tau=b\nabla_\theta\int p_\theta(\tau)d\tau=b\nabla_\theta(1)=0E[∇θlogpθ(τ)b]=∫pθ(τ)∇θlogpθ(τ)bdτ=∫∇θpθ(τ)bdτ=b∇θ∫pθ(τ)dτ=b∇θ(1)=0, substracting a baseline is unbiased in expectation
dVardb=0\frac{d{\text{Var}}}{db}=0dbdVar=0解得optimal b=E[g(τ)2r(τ)]E[g(τ)2]\frac{\mathbb{E}[g(\tau)^2r(\tau)]}{\mathbb{E}[g(\tau)^2]}E[g(τ)2]E[g(τ)2r(τ)], g(τ)=∇θlogpθ(τ)g(\tau)=\nabla_\theta\log{p_\theta(\tau)}g(τ)=∇θlogpθ(τ), θ\thetaθ里每变1个参数optimal b都不一样很麻烦, 一般用r的期望
1.5 Off-Policy Policy Gradients
Policy gradient is on-policy, inefficient!
importance sampling: Ex∼p(x)[f(x)]=∫p(x)f(x)dx=∫q(x)q(x)p(x)f(x)dx=∫q(x)p(x)q(x)f(x)dx=Ex∼q(x)[p(x)q(x)f(x)]\mathbb{E}{x\sim p(x)}[f(x)]=\int p(x)f(x)dx=\int\frac{q(x)}{q(x)}p(x)f(x)dx=\int q(x)\frac{p(x)}{q(x)}f(x)dx=\mathbb{E}{x\sim q(x)}[\frac{p(x)}{q(x)}f(x)]Ex∼p(x)[f(x)]=∫p(x)f(x)dx=∫q(x)q(x)p(x)f(x)dx=∫q(x)q(x)p(x)f(x)dx=Ex∼q(x)[q(x)p(x)f(x)]
据此, what if we don't have samples from pθ(τ)p_\theta(\tau)pθ(τ)? (we have samples from some pˉ(τ)\bar{p}(\tau)pˉ(τ) instead)
J(θ)=Eτ∼pˉ(τ)[pθ(τ)pˉ(τ)r(τ)]J(\theta)=\mathbb{E}{\tau\sim\bar{p}(\tau)}[\frac{p\theta(\tau)}{\bar{p}(\tau)}r(\tau)]J(θ)=Eτ∼pˉ(τ)[pˉ(τ)pθ(τ)r(τ)], pθ(τ)pˉ(τ)=p(s1)∏t=1Tπθ(at∣st)p(st+1∣st,at)p(s1)∏t=1Tπˉθ(at∣st)p(st+1∣st,at)=∏t=1Tπθ(at∣st)∏t=1Tπˉθ(at∣st)\frac{p_\theta(\tau)}{\bar{p}(\tau)}=\frac{p(s_1)\prod^T_{t=1}\pi_\theta(a_t|s_t)p(s_{t+1}|s_t,a_t)}{p(s_1)\prod^T_{t=1}\bar{\pi}\theta(a_t|s_t)p(s{t+1}|s_t,a_t)}=\frac{\prod^T_{t=1}\pi_\theta(a_t|s_t)}{\prod^T_{t=1}\bar{\pi}_\theta(a_t|s_t)}pˉ(τ)pθ(τ)=p(s1)∏t=1Tπˉθ(at∣st)p(st+1∣st,at)p(s1)∏t=1Tπθ(at∣st)p(st+1∣st,at)=∏t=1Tπˉθ(at∣st)∏t=1Tπθ(at∣st)
J(θ′)=Eτ∼p(τ)[pθ′(τ)pθ(τ)r(τ)]J(\theta')=\mathbb{E}{\tau\sim p(\tau)}[\frac{p\theta'(\tau)}{p_\theta(\tau)}r(\tau)]J(θ′)=Eτ∼p(τ)[pθ(τ)pθ′(τ)r(τ)], ∇θ′J(θ′)=Eτ∼p(τ)[∇θ′pθ′(τ)pθ(τ)r(τ)]=Eτ∼p(τ)[pθ′(τ)pθ(τ)∇θ′logpθ′(τ)r(τ)]\nabla_{\theta'}J(\theta')=\mathbb{E}{\tau\sim p(\tau)}[\frac{\nabla{\theta'}p_\theta'(\tau)}{p_\theta(\tau)}r(\tau)]=\mathbb{E}{\tau\sim p(\tau)}[\frac{p\theta'(\tau)}{p_\theta(\tau)}\nabla_{\theta'}\log{p_{\theta'}(\tau)}r(\tau)]∇θ′J(θ′)=Eτ∼p(τ)[pθ(τ)∇θ′pθ′(τ)r(τ)]=Eτ∼p(τ)[pθ(τ)pθ′(τ)∇θ′logpθ′(τ)r(τ)], estimate locally at θ=θ′\theta=\theta'θ=θ′: ∇θJ(θ)=Eτ∼p(τ)[∇θlogpθ(τ)r(τ)]\nabla_\theta J(\theta)=\mathbb{E}{\tau\sim p(\tau)}[\nabla{\theta}\log{p_{\theta}(\tau)}r(\tau)]∇θJ(θ)=Eτ∼p(τ)[∇θlogpθ(τ)r(τ)]
1.6 Policy gradient with automatic differentiation
∇θJ(θ)≈1N∑i=1N∑t=1T∇θlogπθ(ai,t∣si,t)Q^i,t\nabla_\theta J(\theta)\approx\frac{1}{N}\sum^N_{i=1}\sum^T_{t=1}\nabla_\theta\log{\pi_\theta(a_{i,t}|s_{i,t})}\hat{Q}_{i,t}∇θJ(θ)≈N1∑i=1N∑t=1T∇θlogπθ(ai,t∣si,t)Q^i,t
recall maximum likelihood: JML(θ)≈1N∑i=1N∑t=1Tlogπθ(ai,t∣si,t)J_{\text{ML}}(\theta)\approx\frac{1}{N}\sum^N_{i=1}\sum^T_{t=1}\log{\pi_\theta(a_{i,t}|s_{i,t})}JML(θ)≈N1∑i=1N∑t=1Tlogπθ(ai,t∣si,t), ∇θJML(θ)≈1N∑i=1N∑t=1T∇θlogπθ(ai,t∣si,t)\nabla_\theta J_{\text{ML}}(\theta)\approx\frac{1}{N}\sum^N_{i=1}\sum^T_{t=1}\nabla_\theta\log{\pi_\theta(a_{i,t}|s_{i,t})}∇θJML(θ)≈N1∑i=1N∑t=1T∇θlogπθ(ai,t∣si,t)
just implement "pseudo-loss" as a weighted maximum likelihood: J~(θ)≈1N∑i=1N∑t=1Tlogπθ(ai,t∣si,t)Q^i,t\tilde{J}(\theta)\approx\frac{1}{N}\sum^N_{i=1}\sum^T_{t=1}\log{\pi_\theta(a_{i,t}|s_{i,t})}\hat{Q}_{i,t}J~(θ)≈N1∑i=1N∑t=1Tlogπθ(ai,t∣si,t)Q^i,t
(1) 离散动作: 交叉熵 (Cross Entropy) LossCE=−∑kyklogy^k\text{Loss}{\text{CE}}=-\sum_ky_k\log{\hat{y}k}LossCE=−∑kyklogy^k, 其中yyy是one-hot标签, y^\hat yy^是模型预测的概率。采样到动作ai,ta{i,t}ai,t, 此时one-hot向量中只有ai,ta{i,t}ai,t这一项为1, 其余是0, 因此LossCE=−1⋅logπθ(ai,t∣si,t)\text{Loss}{\text{CE}}=-1\cdot\log{\pi\theta(a_{i,t}|s_{i,t})}LossCE=−1⋅logπθ(ai,t∣si,t)
(2) 连续动作: πθ(a∣s)=1σ2πexp(−(a−μθ(s))22σ2)\pi_\theta(a|s) = \frac{1}{\sigma \sqrt{2\pi}} \exp\left( -\frac{(a - \mu_\theta(s))^2}{2\sigma^2} \right)πθ(a∣s)=σ2π 1exp(−2σ2(a−μθ(s))2), logπθ(a∣s)=−(a−μθ(s))22σ2−log(σ2π)\log \pi_\theta(a|s) = -\frac{(a - \mu_\theta(s))^2}{2\sigma^2} - \log(\sigma \sqrt{2\pi})logπθ(a∣s)=−2σ2(a−μθ(s))2−log(σ2π ), 因此∇θlogπθ∝∇θ−(a−μθ(s))2⏟负的平方误差\nabla_\theta \log \pi_\theta \propto \nabla_\theta \underbrace{-(a - \mu_\theta(s))^2}_{\text{负的平方误差}}∇θlogπθ∝∇θ负的平方误差 −(a−μθ(s))2 (MSE loss)
python
#Pseudocode example with discrete actions
'''Given:
actions - (N * T) x Da tensor of actions
states - (N * T) x Ds tensor of states
q_values - (N * T) x 1 tensor of estimated state-action values
'''
logits = policy.predict(states) # return (N * T) x Da tensor of action logits
negative_likelihoods = torch.nn.functional.cross_entropy(logits, actions, reduction='none')
weighted_negative_likelihoods = neg_log_prob * q_values
loss = torch.mean(weighted_loss)
loss.backward()
1.7 Policy gradient in practice
(1) gradient has high variance
(2) consider using much larger batches
(3) tweaking learning rate is very hard
1.8 Covariant/natrual policy gradient
普通梯度下降θ←θ+α∇θJ(θ)\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)θ←θ+α∇θJ(θ), θ′←argmaxθ′(θ′−θ)T∇θJ(θ)\theta'\leftarrow\text{argmax}{\theta'}(\theta'-\theta)^T\nabla\theta J(\theta)θ′←argmaxθ′(θ′−θ)T∇θJ(θ), 假设在参数空间走一段距离∥θ′−θ∥2≤ϵ\|\theta' - \theta\|^2 \le \epsilon∥θ′−θ∥2≤ϵ, 策略的变化是均匀的。但实际上各个参数的对变化量的敏感度不同。
can we rescale the gradient so this doesn't happen?
限制策略分布的变化量DKL(πθ′,πθ)≤ϵD_{KL}(\pi_{\theta'}, \pi_\theta) \le \epsilonDKL(πθ′,πθ)≤ϵ, DKL(πθ′∣∣πθ)=Eπθ′[logπθ−logπθ′]≈(θ′−θ)TF(θ′−θ)D_{\text{KL}}(\pi_{\theta'}||\pi_\theta)=\mathbb{E}{\pi{\theta'}}[\log{\pi_\theta}-\log{\pi_{\theta'}}]\approx(\theta'-\theta)^TF(\theta'-\theta)DKL(πθ′∣∣πθ)=Eπθ′[logπθ−logπθ′]≈(θ′−θ)TF(θ′−θ), 其中FFF为Fisher-information matrix: F=Eπθ[∇θlogπθ(a∣s)∇θlogπθ(a∣s)T]F=\mathbb{E}{\pi\theta}[\nabla_\theta\log{\pi_\theta(a|s)\nabla_\theta\log{\pi_\theta(a|s)^T}}]F=Eπθ[∇θlogπθ(a∣s)∇θlogπθ(a∣s)T] which can estimate with samples
θ′←argmaxθ′(θ′−θ)T∇θJ(θ)\theta'\leftarrow\text{argmax}{\theta'}(\theta'-\theta)^T\nabla\theta J(\theta)θ′←argmaxθ′(θ′−θ)T∇θJ(θ)
, 假设∥θ′−θ∥F2≤ϵ\lVert\theta'-\theta\rVert^2_\text{F}\le\epsilon∥θ′−θ∥F2≤ϵ, θ←θ+αF−1∇θJ(θ)\theta \leftarrow \theta + \alpha F^{-1} \nabla_\theta J(\theta)θ←θ+αF−1∇θJ(θ)
natural gradient: pick α\alphaα
truth region policy optimization: pick ϵ\epsilonϵ
can solve for optimal α\alphaα while solving F−1∇θJ(θ)F^{-1}\nabla_\theta J(\theta)F−1∇θJ(θ), α=ϵ∇θJ(θ)TF−1∇θJ(θ)\alpha = \sqrt{\frac{\epsilon}{\nabla_\theta J(\theta)^T F^{-1} \nabla_\theta J(\theta)}}α=∇θJ(θ)TF−1∇θJ(θ)ϵ