cs2385_note5 (lec18-lec19) Variational Inference & Control as Inference

lec 18, 2026/4/7-4/9, Variational Inference

1.1 Latent variable models

p(x)p(x)p(x)是一个复杂的分布。p(z)p(z)p(z)是一个简单的分布 (e.g., Gaussian), 且p(x∣z)p(x|z)p(x∣z)也是一个简单的分布 (e.g., p(x∣z)=N(μnn(z),σnn(z))p(x|z)=\mathcal{N}(\mu_{\text{nn}}(z),\sigma_{\text{nn}}(z))p(x∣z)=N(μnn(z),σnn(z)), 即均值和标准差由神经网络给出), 那么根据p(x)=∫p(x∣z)p(z)dzp(x)=\int p(x|z)p(z)dzp(x)=∫p(x∣z)p(z)dz, 复杂的分布可以被简单的分布表示出来

1.2 How do we train latent variable models

the model: pθ(x)p_\theta(x)pθ(x); the data: D={x1,x2,x3,...,xN}\mathcal{D}=\{x_1,x_2,x_3,\dots,x_N\}D={x1,x2,x3,...,xN}

maximum likelihood fit: θ←arg⁡max⁡θ1N∑ilog⁡pθ(xi)\theta\leftarrow\arg\max_\theta\frac{1}{N}\sum_i\log{p_\theta(x_i)}θ←argmaxθN1∑ilogpθ(xi), 最大化真实样本的概率

根据p(x)=∫p(x∣z)p(z)dzp(x)=\int p(x|z)p(z)dzp(x)=∫p(x∣z)p(z)dz, 上式也可以写作θ←arg⁡max⁡θ1N∑ilog⁡∫pθ(xi∣z)p(z)dz\theta\leftarrow\arg\max_\theta\frac{1}{N}\sum_i\log{\int p_\theta(x_i|z)p(z)dz}θ←argmaxθN1∑ilog∫pθ(xi∣z)p(z)dz, 但算微分太难了

1.3 Estimating the log-likelihood

alternative: θ←arg⁡max⁡θ1N∑iEz∼p(z∣xi)log⁡pθ(xi,z)\theta\leftarrow\arg\max_\theta\frac{1}{N}\sum_i\mathbb{E}_{z\sim p(z|x_i)}\\log{p_\\theta(x_i,z)}θ←argmaxθN1∑iEz∼p(z∣xi)logpθ(xi,z)

intuition: "guess" most likely zzz given xix_ixi, and pretend it's the right one. there are many possible values of zzz so use the distribution p(z∣xi)p(z|x_i)p(z∣xi), but how do we calculate p(z∣xi)p(z|x_i)p(z∣xi)?

2.1 The variational approximation

approximate p(z∣xi)p(z|x_i)p(z∣xi) with qi(z)=N(μi,σi)q_i(z)=\mathcal{N}(\mu_i,\sigma_i)qi(z)=N(μi,σi)

log⁡p(xi)=log⁡∫zp(xi∣z)p(z)=log⁡∫zp(xi∣z)p(z)qi(z)qi(z)=log⁡Ez∼qi(z)p(xi∣z)p(z)qi(z)\log{p(x_i)}=\log{\int_z p(x_i|z)p(z)}=\log{\int_z p(x_i|z)p(z)\frac{q_i(z)}{q_i(z)}}=\log{\mathbb{E}_{z\sim q_i(z)}\\frac{p(x_i\|z)p(z)}{q_i(z)}}logp(xi)=log∫zp(xi∣z)p(z)=log∫zp(xi∣z)p(z)qi(z)qi(z)=logEz∼qi(z)qi(z)p(xi∣z)p(z)

根据Jensen's inequality: log⁡Ey≥Elog⁡y\log{\mathbb{E}y}\ge\mathbb{E}\\log{y}logEy≥Elogy, 上式≥Ez∼qi(z)log⁡p(xi∣z)p(z)qi(z)=Ez∼qi(z)log⁡p(xi∣z)+log⁡p(z)−Ez∼qi(z)log⁡qi(z)\ge\mathbb{E}{z\sim q_i(z)}\\log{\\frac{p(x_i\|z)p(z)}{q_i(z)}}=\mathbb{E}{z\sim q_i(z)}\\log{p(x_i\|z)}+\\log{p(z)}-\mathbb{E}{z\sim q_i(z)}\\log{q_i(z)}≥Ez∼qi(z)logqi(z)p(xi∣z)p(z)=Ez∼qi(z)logp(xi∣z)+logp(z)−Ez∼qi(z)logqi(z), −Ez∼qi(z)log⁡qi(z)=H(qi)-\mathbb{E}{z\sim q_i(z)}\\log{q_i(z)}=\mathcal{H}(q_i)−Ez∼qi(z)logqi(z)=H(qi)

maximizing log⁡p(xi∣z)\log{p(x_i|z)}logp(xi∣z) could maximize log⁡p(xi)\log{p(x_i)}logp(xi), 令Li(p,qi)=Ez∼qi(z)log⁡p(xi∣z)+log⁡p(z)+H(qi)\mathcal{L}i(p,q_i)=\mathbb{E}{z\sim q_i(z)}\\log{p(x_i\|z)}+\\log{p(z)}+\mathcal{H}(q_i)Li(p,qi)=Ez∼qi(z)logp(xi∣z)+logp(z)+H(qi)

what make a good qi(z)q_i(z)qi(z)? intuition: qi(z)q_i(z)qi(z) should approximate p(z∣xi)p(z|x_i)p(z∣xi), 因而想到KL-divergence: DKL(qi(z)∣∣p(z∣x))D_{\text{KL}}(q_i(z)||p(z|x))DKL(qi(z)∣∣p(z∣x))

DKL(qi(z)∣∣p(z∣x))=Ez∼qi(z)log⁡qi(z)p(z∣xi)=Ez∼qi(z)log⁡qi(z)p(xi)p(xi,z)D_{\text{KL}}(q_i(z)||p(z|x))=\mathbb{E}{z\sim q_i(z)}\\log{\\frac{q_i(z)}{p(z\|x_i)}}=\mathbb{E}{z\sim q_i(z)}\\log{\\frac{q_i(z)p(x_i)}{p(x_i,z)}}DKL(qi(z)∣∣p(z∣x))=Ez∼qi(z)logp(z∣xi)qi(z)=Ez∼qi(z)logp(xi,z)qi(z)p(xi)

=−Ez∼qi(z)log⁡p(xi∣z)+log⁡p(z)+Ez∼qi(z)log⁡qi(z)+Ez∼qi(z)log⁡p(xi)=-\mathbb{E}{z\sim q_i(z)}\\log{p(x_i\|z)}+\\log{p(z)}+\mathbb{E}{z\sim q_i(z)}\\log{q_i(z)}+\mathbb{E}_{z\sim q_i(z)}\\log{p(x_i)}=−Ez∼qi(z)logp(xi∣z)+logp(z)+Ez∼qi(z)logqi(z)+Ez∼qi(z)logp(xi)

=−Ez∼qi(z)log⁡p(xi∣z)+log⁡p(z)−H(qi)+log⁡p(xi)=−Li(p,qi)+log⁡p(xi)=-\mathbb{E}_{z\sim q_i(z)}\\log{p(x_i\|z)}+\\log{p(z)}-\mathcal{H}(q_i)+\log{p(x_i)}=-\mathcal{L}_i(p,q_i)+\log{p(x_i)}=−Ez∼qi(z)logp(xi∣z)+logp(z)−H(qi)+logp(xi)=−Li(p,qi)+logp(xi)

log⁡p(xi)=DKL(qi(z)∣∣p(z∣x))+Li(p,qi)≥Li(p,qi)\log{p(x_i)}=D_{\text{KL}}(q_i(z)||p(z|x))+\mathcal{L}_i(p,q_i)\ge\mathcal{L}_i(p,q_i)logp(xi)=DKL(qi(z)∣∣p(z∣x))+Li(p,qi)≥Li(p,qi)

转而θ←arg⁡max⁡θ1N∑iLi(p,qi)\theta\leftarrow\arg\max_\theta\frac{1}{N}\sum_i\mathcal{L}_i(p,q_i)θ←argmaxθN1∑iLi(p,qi)

2.2 How do we use this

for each xix_ixi (or mini-batch):

step 1: calculate ∇θLi(p,qi)\nabla_\theta\mathcal{L}i(p,q_i)∇θLi(p,qi): sample z∼qi(xi)z\sim q_i(x_i)z∼qi(xi), ∇θLi(p,qi)≈∇θlog⁡pθ(xi∣z)\nabla\theta\mathcal{L}i(p,q_i)\approx\nabla\theta\log{p_\theta(x_i|z)}∇θLi(p,qi)≈∇θlogpθ(xi∣z)

step 2: θ←θ+α∇θLi(p,qi)\theta\leftarrow\theta+\alpha\nabla_\theta\mathcal{L}_i(p,q_i)θ←θ+α∇θLi(p,qi)

step 3: update qiq_iqi to maximize Li(p,qi)\mathcal{L}_i(p,q_i)Li(p,qi)

2.3 What's the problem?

let's say qi(z)=N(μi,σi)q_i(z)=\mathcal{N}(\mu_i,\sigma_i)qi(z)=N(μi,σi), use gradient ∇μiLi(p,qi)\nabla_{\mu_i}\mathcal{L}i(p,q_i)∇μiLi(p,qi) and ∇σiLi(p,qi)\nabla{\sigma_i}\mathcal{L}_i(p,q_i)∇σiLi(p,qi), gradient ascent on μi\mu_iμi, σi\sigma_iσi

q(z)q(z)q(z)的参数总量是∣θ∣+(∣μi∣+∣σi∣)×N|\theta|+(|\mu_i|+|\sigma_i|)\times N∣θ∣+(∣μi∣+∣σi∣)×N, 如果样本量NNN很大, 这个网络很爆炸

intuition: qi(z)q_i(z)qi(z) should approximate p(z∣xi)p(z|x_i)p(z∣xi), what if we learn a network qi(z)=q(z∣xi)≈p(z∣xi)q_i(z)=q(z|x_i)\approx p(z|x_i)qi(z)=q(z∣xi)≈p(z∣xi)? (既然训练qi(z)q_i(z)qi(z)的目的是去逼近p(z∣xi)p(z|x_i)p(z∣xi), 而p(z∣xi)p(z|x_i)p(z∣xi)本质上是从xxx到zzz的映射, 为何不直接训练一个神经网络来完成此映射呢?)

3.1 Amortized variational inference

训练网络qϕ(z∣x)=N(μϕ(x),σϕ(x))q_\phi(z|x)=\mathcal{N}(\mu_\phi(x),\sigma_\phi(x))qϕ(z∣x)=N(μϕ(x),σϕ(x)), 根据样本xxx生成对应的分布; 训练网络pθ(x∣z)p_\theta(x|z)pθ(x∣z), 和预先定好的p(z)p(z)p(z)相乘得到p(x,z)p(x,z)p(x,z)对zzz求marginal期望得到想要的p(x)p(x)p(x)

log⁡p(xi)≥Li(p,qi)=Ez∼qϕ(z∣xi)log⁡pθ(xi∣z)+log⁡p(z)+H(qϕ(z∣xi))=L(pθ(xi∣z),qϕ(z∣xi))\log{p(x_i)}\ge\mathcal{L}i(p,q_i)=\mathbb{E}{z\sim q_\phi(z|x_i)}\\log{p_\\theta(x_i\|z)}+\\log{p(z)}+\mathcal{H}(q_\phi(z|x_i))=\mathcal{L}(p_\theta(x_i|z),q_\phi(z|x_i))logp(xi)≥Li(p,qi)=Ez∼qϕ(z∣xi)logpθ(xi∣z)+logp(z)+H(qϕ(z∣xi))=L(pθ(xi∣z),qϕ(z∣xi))

for each xix_ixi (or mini-batch):

step 1: calculate ∇θL(pθ(xi∣z),qϕ(z∣xi))\nabla_\theta\mathcal{L}(p_\theta(x_i|z),q_\phi(z|x_i))∇θL(pθ(xi∣z),qϕ(z∣xi)): sample z∼qϕ(z∣xi)z\sim q_\phi(z|x_i)z∼qϕ(z∣xi), ∇θL≈∇θlog⁡pθ(xi∣z)\nabla_\theta\mathcal{L}\approx\nabla_\theta\log{p_\theta(x_i|z)}∇θL≈∇θlogpθ(xi∣z)

step 2: θ←θ+α∇θL\theta\leftarrow\theta+\alpha\nabla_\theta\mathcal{L}θ←θ+α∇θL

step 3: ϕ←ϕ+α∇ϕL\phi\leftarrow\phi+\alpha\nabla_\phi\mathcal{L}ϕ←ϕ+α∇ϕL

3.1.1 How to calculate ∇ϕL\nabla_\phi\mathcal{L}∇ϕL

J(ϕ)=Ez∼qϕ(z∣xi)log⁡pθ(xi∣z)+log⁡p(z)=Ez∼qϕ(z∣xi)r(xi,z)J(\phi)=\mathbb{E}{z\sim q\phi(z|x_i)}\\log{p_\\theta(x_i\|z)}+\\log{p(z)}=\mathbb{E}{z\sim q\phi(z|x_i)}r(x_i,z)J(ϕ)=Ez∼qϕ(z∣xi)logpθ(xi∣z)+logp(z)=Ez∼qϕ(z∣xi)r(xi,z), can just use policy gradient: ∇J(ϕ)≈1M∑j∇ϕlog⁡qϕ(zj∣xi)r(xi,zj)\nabla J(\phi)\approx\frac{1}{M}\sum_j\nabla_\phi\log{q_\phi(z_j|x_i)}r(x_i,z_j)∇J(ϕ)≈M1∑j∇ϕlogqϕ(zj∣xi)r(xi,zj)

重参数J(ϕ)=Ez∼qϕ(z∣xi)r(xi,z)=Ez∼qϕ(z∣xi)r(xi,μϕ(xi)+ϵσϕ(xi))J(\phi)=\mathbb{E}{z\sim q\phi(z|x_i)}r(x_i,z)=\mathbb{E}{z\sim q\phi(z|x_i)}r(x_i,\\mu_\\phi(x_i)+\\epsilon\\sigma_\\phi(x_i))J(ϕ)=Ez∼qϕ(z∣xi)r(xi,z)=Ez∼qϕ(z∣xi)r(xi,μϕ(xi)+ϵσϕ(xi))之后, estimating ∇ϕJ(ϕ)\nabla_\phi J(\phi)∇ϕJ(ϕ): sample ϵ1,...,ϵM\epsilon_1,\dots,\epsilon_Mϵ1,...,ϵM from N(0,1)\mathcal{N}(0,1)N(0,1), ∇J(ϕ)≈1M∑j∇ϕr(xi,μϕ(xi)+ϵσϕ(xi))\nabla J(\phi)\approx\frac{1}{M}\sum_j\nabla_\phi r(x_i,\mu_\phi(x_i)+\epsilon\sigma_\phi(x_i))∇J(ϕ)≈M1∑j∇ϕr(xi,μϕ(xi)+ϵσϕ(xi))

3.1.2 another way to look at it

Li=Ez∼qϕ(z∣xi)log⁡pθ(xi∣z)+Ez∼qϕ(z∣xi)log⁡p(z)+H(qϕ(z∣xi))\mathcal{L}i=\mathbb{E}{z\sim q_\phi(z|x_i)}\\log{p_\\theta(x_i\|z)}+\mathbb{E}{z\sim q\phi(z|x_i)}\\log{p(z)}+\mathcal{H}(q_\phi(z|x_i))Li=Ez∼qϕ(z∣xi)logpθ(xi∣z)+Ez∼qϕ(z∣xi)logp(z)+H(qϕ(z∣xi))

=Ez∼qϕ(z∣xi)log⁡pθ(xi∣z)−DKL(qϕ(z∣xi)∣∣p(z))=\mathbb{E}{z\sim q\phi(z|x_i)}\\log{p_\\theta(x_i\|z)}-D_{\text{KL}}(q_\phi(z|x_i)||p(z))=Ez∼qϕ(z∣xi)logpθ(xi∣z)−DKL(qϕ(z∣xi)∣∣p(z))

=Eϵ∼N(0,1)log⁡pθ(xi∣μϕ(xi)+ϵσϕ(xi))−DKL(qϕ(z∣xi)∣∣p(z))≈log⁡pθ(xi∣μϕ(xi)+ϵσϕ(xi))−DKL(qϕ(z∣xi)∣∣p(z))=\mathbb{E}{\epsilon\sim\mathcal{N}(0,1)}\\log{p_\\theta(x_i\|\\mu_\\phi(x_i)+\\epsilon\\sigma_\\phi(x_i))}-D{\text{KL}}(q_\phi(z|x_i)||p(z))\approx\log{p_\theta(x_i|\mu_\phi(x_i)+\epsilon\sigma_\phi(x_i))}-D_{\text{KL}}(q_\phi(z|x_i)||p(z))=Eϵ∼N(0,1)logpθ(xi∣μϕ(xi)+ϵσϕ(xi))−DKL(qϕ(z∣xi)∣∣p(z))≈logpθ(xi∣μϕ(xi)+ϵσϕ(xi))−DKL(qϕ(z∣xi)∣∣p(z))

4.1 Example applications

4.1.1 Representation learning - zzz is a representation of sss

step 1: train VAE on states in replay buffer R\mathcal{R}R

step 2: run RL, using zzz as the state instead of sss

4.1.2 Conditional models - Multimodal imitation learning

Li=Ez∼qϕ(z∣xi,yi)log⁡pθ(yi∣xi,z)+log⁡p(z∣xi)+H(qϕ(z∣xi,yi))\mathcal{L}i=\mathbb{E}{z\sim q_\phi(z|x_i,y_i)}\\log{p_\\theta(y_i\|x_i,z)}+\\log{p(z\|x_i)}+\mathcal{H}(q_\phi(z|x_i,y_i))Li=Ez∼qϕ(z∣xi,yi)logpθ(yi∣xi,z)+logp(z∣xi)+H(qϕ(z∣xi,yi)), generating yiy_iyi and everything is conditioned on xix_ixi

p(z)p(z)p(z) can optionally depend on xxx

at test time: z∼p(z∣xi)z\sim p(z|x_i)z∼p(z∣xi), y∼p(y∣xi,z)y\sim p(y|x_i,z)y∼p(y∣xi,z)

4.1.3 State space models

xxx: (o1,...,oT)(o_1,\dots,o_T)(o1,...,oT), zzz: (z1,...,zT)(z_1,\dots,z_T)(z1,...,zT)

prior: p(z)=p(zi)∏tp(zt+1∣zt,at)p(z)=p(z_i)\prod_tp(z_{t+1}|z_t,a_t)p(z)=p(zi)∏tp(zt+1∣zt,at), 其中p(z1)∼N(0,I)p(z_1)\sim\mathcal{N}(0,I)p(z1)∼N(0,I)

decoder: pθ(o∣z)=∏tp(ot∣zt)p_\theta(o|z)=\prod_tp(o_t|z_t)pθ(o∣z)=∏tp(ot∣zt)

encoder: qϕ(z∣o)=∏tqϕ(zt∣o1:t)q_\phi(z|o)=\prod_tq_\phi(z_t|o_{1:t})qϕ(z∣o)=∏tqϕ(zt∣o1:t)

4.1.4 Representation learning and model-based RL

lec 19, 2026/4/10-4/13, Control as Inference

1.1 Inference=planning

引入随机变量Ot\mathcal{O}tOt表示在时间步ttt是否是最优的。已知我们在所有时间步都是最优的 (即O1:T=1\mathcal{O}{1:T}=1O1:T=1), 推断最可能的动作序列ata_tat和状态序列sts_tst。最优概率p(Ot∣st,at)∝exp⁡(r(st,at))p(\mathcal{O}_t|s_t,a_t)\propto\exp{(r(s_t,a_t))}p(Ot∣st,at)∝exp(r(st,at))

1.2 how to do inference?

step 1: compute backward messages βt(st,at)=p(Ot:T∣st,at)\beta_t(s_t,a_t)=p(\mathcal{O}_{t:T}|s_t,a_t)βt(st,at)=p(Ot:T∣st,at) (probability that we can be optimal at steps ttt through TTT given that we take action ata_tat in state sts_tst)

step 2: compute policy p(at∣st,O1:T)p(a_t|s_t,\mathcal{O}_{1:T})p(at∣st,O1:T)

step 3: compute forward messages αt(st)=p(st∣O1:t−1)\alpha_t(s_t)=p(s_t|\mathcal{O}_{1:t-1})αt(st)=p(st∣O1:t−1) (在过去一直保持最优的情况下, 当前处于状态sts_tst的概率)

2.1 Backward messages

βt(st,at)=p(Ot:T∣st,at)=∫p(Ot:T,st+1∣st,at)dst+1=∫p(Ot+1:T∣st+1)p(st+1∣st,at)p(Ot∣st,at)dst+1\beta_t(s_t,a_t)=p(\mathcal{O}{t:T}|s_t,a_t)=\int p(\mathcal{O}{t:T},s_{t+1}|s_t,a_t)ds_{t+1}=\int p(\mathcal{O}{t+1:T}|s{t+1})p(s_{t+1}|s_t,a_t)p(\mathcal{O}t|s_t,a_t)ds{t+1}βt(st,at)=p(Ot:T∣st,at)=∫p(Ot:T,st+1∣st,at)dst+1=∫p(Ot+1:T∣st+1)p(st+1∣st,at)p(Ot∣st,at)dst+1

p(Ot+1:T∣st+1)=∫p(Ot+1:T∣st+1,at+1)p(at+1∣st+1)dat+1p(\mathcal{O}{t+1:T}|s{t+1})=\int p(\mathcal{O}{t+1:T}|s{t+1},a_{t+1})p(a_{t+1}|s_{t+1})da_{t+1}p(Ot+1:T∣st+1)=∫p(Ot+1:T∣st+1,at+1)p(at+1∣st+1)dat+1, p(Ot+1:T∣st+1,at+1)=βt(st+1∣at+1)p(\mathcal{O}{t+1:T}|s{t+1},a_{t+1})=\beta_t(s_{t+1}|a_{t+1})p(Ot+1:T∣st+1,at+1)=βt(st+1∣at+1), p(at+1∣st+1)p(a_{t+1}|s_{t+1})p(at+1∣st+1)假设是先验概率, which actions are likely a priori, assume uniform for now

for t=T−1t=T-1t=T−1 to 111:

​ βt(st,at)=p(Ot∣st,at)Est+1∼p(st+1∣st,at)βt+1(st+1)\beta_t(s_t,a_t)=p(\mathcal{O}t|s_t,a_t)\mathbb{E}{s_{t+1}\sim p(s_{t+1}|s_t,a_t)}\\beta_{t+1}(s_{t+1})βt(st,at)=p(Ot∣st,at)Est+1∼p(st+1∣st,at)βt+1(st+1)

​ βt(st)=Eat∼p(at∣st)βt(st,at)\beta_t(s_t)=\mathbb{E}_{a_t\sim p(a_t|s_t)}\\beta_t(s_t,a_t)βt(st)=Eat∼p(at∣st)βt(st,at)

2.1.1 A closer look at the backward pass

βt(st,at)=p(Ot:T∣st,at)\beta_t(s_t,a_t)=p(\mathcal{O}_{t:T}|s_t,a_t)βt(st,at)=p(Ot:T∣st,at)是从当前时间步ttt为最优的概率一直连乘到最终步TTT, 因此log of βt\beta_tβt is "Q-function-like"

let Vt(st)=log⁡βt(st)V_t(s_t)=\log{\beta_t(s_t)}Vt(st)=logβt(st), Qt(st,at)=log⁡βt(st,at)Q_t(s_t,a_t)=\log{\beta_t(s_t,a_t)}Qt(st,at)=logβt(st,at)

Vt(st)=log⁡∫exp⁡(Qt(st,at))datV_t(s_t)=\log\int\exp(Q_t(s_t,a_t))da_tVt(st)=log∫exp(Qt(st,at))dat, 有点像softmax

Vt(st)→max⁡atQt(st,at)V_t(s_t)\to\max_{a_t}Q_t(s_t,a_t)Vt(st)→maxatQt(st,at) as Qt(st,at)Q_t(s_t,a_t)Qt(st,at) gets bigger

根据 p(Ot∣st,at)=exp⁡(r(st,at))p(\mathcal{O}t|s_t,a_t)=\exp{(r(s_t,a_t))}p(Ot∣st,at)=exp(r(st,at)), βt(st,at)=p(Ot∣st,at)Est+1∼p(st+1∣st,at)βt+1(st+1)\beta_t(s_t,a_t)=p(\mathcal{O}t|s_t,a_t)\mathbb{E}{s{t+1}\sim p(s_{t+1}|s_t,a_t)}\\beta_{t+1}(s_{t+1})βt(st,at)=p(Ot∣st,at)Est+1∼p(st+1∣st,at)βt+1(st+1)可以化成exp⁡(Qt(st,at))=exp⁡(r(st,at))Est+1exp⁡(Vt+1(st+1))\exp(Q_t(s_t,a_t))=\exp(r(s_t,a_t))\mathbb{E}{s{t+1}}\\exp(V_{t+1}(s_{t+1}))exp(Qt(st,at))=exp(r(st,at))Est+1exp(Vt+1(st+1)), 等号两边同时取对数得到Qt(st,at)=r(st,at)+log⁡Est+1exp⁡(Vt+1(st+1))Q_t(s_t,a_t)=r(s_t,a_t)+\log\mathbb{E}{s{t+1}}\\exp(V_{t+1}(s_{t+1}))Qt(st,at)=r(st,at)+logEst+1exp(Vt+1(st+1)), 有点像Bellman

2.2 The action prior

what if the action prior p(at+1∣st+1)p(a_{t+1}|s_{t+1})p(at+1∣st+1) is not uniform?

V(st)=log⁡∫exp⁡(Q(st,at)+log⁡p(at∣st))datV(s_t)=\log\int\exp(Q(s_t,a_t)+\log p(a_t|s_t))da_tV(st)=log∫exp(Q(st,at)+logp(at∣st))dat

Q(st,at)=r(st,at)+log⁡Eexp⁡(V(st+1))Q(s_t,a_t)=r(s_t,a_t)+\log\mathbb{E}\\exp(V(s_{t+1}))Q(st,at)=r(st,at)+logEexp(V(st+1))

构造不同的reward function accounts for action prior: let Q~(st,at)=r(st,at)+log⁡p(at∣st)+log⁡Eexp⁡(V(st+1))\tilde{Q}(s_t,a_t)=r(s_t,a_t)+\log p(a_t|s_t)+\log\mathbb{E}\\exp(V(s_{t+1}))Q~(st,at)=r(st,at)+logp(at∣st)+logEexp(V(st+1)), 则V(st)=log⁡∫exp⁡(Q~(st,at))datV(s_t)=\log\int\exp(\tilde{Q}(s_t,a_t))da_tV(st)=log∫exp(Q~(st,at))dat和原先的形式一样

can always fold the action prior into the reward; action prior can be assumed without loss of generality

2.3 Policy computation p(at∣st,O1:T)p(a_t|s_t,\mathcal{O}_{1:T})p(at∣st,O1:T)

p(at∣st,O1:T)=π(at∣st)=p(at∣st,Ot:T)=p(at,st∣Ot:T)p(st∣Ot:T)p(a_t|s_t,\mathcal{O}{1:T})=\pi(a_t|s_t)=p(a_t|s_t,\mathcal{O}{t:T})=\frac{p(a_t,s_t|\mathcal{O}{t:T})}{p(s_t|\mathcal{O}{t:T})}p(at∣st,O1:T)=π(at∣st)=p(at∣st,Ot:T)=p(st∣Ot:T)p(at,st∣Ot:T)

=p(Ot:T∣at,st)p(at,st)/p(Ot:T)p(Ot:T∣st)p(st)/p(Ot:T)=p(Ot:T∣at,st)p(Ot:T∣st)p(at,st)p(st)=βt(st,at)βt(st)p(at∣st)=\frac{p(\mathcal{O}{t:T}|a_t,s_t)p(a_t,s_t)/p(\mathcal{O}{t:T})}{p(\mathcal{O}{t:T}|s_t)p(s_t)/p(\mathcal{O}{t:T})}=\frac{p(\mathcal{O}{t:T}|a_t,s_t)}{p(\mathcal{O}{t:T}|s_t)}\frac{p(a_t,s_t)}{p(s_t)}=\frac{\beta_t(s_t,a_t)}{\beta_t(s_t)}p(a_t|s_t)=p(Ot:T∣st)p(st)/p(Ot:T)p(Ot:T∣at,st)p(at,st)/p(Ot:T)=p(Ot:T∣st)p(Ot:T∣at,st)p(st)p(at,st)=βt(st)βt(st,at)p(at∣st)

p(at∣st)p(a_t|s_t)p(at∣st)是action prior, assume uniform是常数

π(at∣st)=βt(st,at)βt(st)=exp⁡(Qt(st,at)−Vt(st))=exp⁡(At(st,at))\pi(a_t|s_t)=\frac{\beta_t(s_t,a_t)}{\beta_t(s_t)}=\exp(Q_t(s_t,a_t)-V_t(s_t))=\exp(A_t(s_t,a_t))π(at∣st)=βt(st)βt(st,at)=exp(Qt(st,at)−Vt(st))=exp(At(st,at))

with temperature: π(at∣st)=exp⁡(1αAt(st,at))\pi(a_t|s_t)=\exp(\frac{1}{\alpha}A_t(s_t,a_t))π(at∣st)=exp(α1At(st,at)), approaches greedy policy as temperature decreases

2.4 Forward messages

αt(st)=p(st∣O1:t−1)=∫p(st,st−1,at−1∣O1:t−1)dst−1dat−1\alpha_t(s_t)=p(s_t|\mathcal{O}{1:t-1})=\int p(s_t,s{t-1},a_{t-1}|\mathcal{O}{1:t-1})ds{t-1}da_{t-1}αt(st)=p(st∣O1:t−1)=∫p(st,st−1,at−1∣O1:t−1)dst−1dat−1

=∫p(st∣st−1,at−1,O1:t−1)p(at−1∣st−1,O1:t−1)p(st−1∣O1:t−1)dst−1dat−1=\int p(s_t|s_{t-1},a_{t-1},\mathcal{O}{1:t-1})p(a{t-1}|s_{t-1},\mathcal{O}{1:t-1})p(s{t-1}|\mathcal{O}{1:t-1})ds{t-1}da_{t-1}=∫p(st∣st−1,at−1,O1:t−1)p(at−1∣st−1,O1:t−1)p(st−1∣O1:t−1)dst−1dat−1

状态转移与是否最优O\mathcal{O}O无关, 因此p(st∣st−1,at−1,O1:t−1)=p(st∣st−1,at−1)p(s_t|s_{t-1},a_{t-1},\mathcal{O}{1:t-1})=p(s_t|s{t-1},a_{t-1})p(st∣st−1,at−1,O1:t−1)=p(st∣st−1,at−1)

根据贝叶斯公式P(A∣B,C)=P(A,B,C)P(B,C)=P(C∣A,B)P(A,B)P(C∣B)P(B)=P(C∣A,B)P(A∣B)P(B)P(C∣B)P(B)=P(C∣A,B)p(A∣B)P(C∣B)P(A|B,C)=\frac{P(A,B,C)}{P(B,C)}=\frac{P(C|A,B)P(A,B)}{P(C|B)P(B)}=\frac{P(C|A,B)P(A|B)P(B)}{P(C|B)P(B)}=\frac{P(C|A,B)p(A|B)}{P(C|B)}P(A∣B,C)=P(B,C)P(A,B,C)=P(C∣B)P(B)P(C∣A,B)P(A,B)=P(C∣B)P(B)P(C∣A,B)P(A∣B)P(B)=P(C∣B)P(C∣A,B)p(A∣B),
p(at−1∣st−1,O1:t−1)p(st−1∣O1:t−1)=p(Ot−1∣st−1,at−1)p(at−1∣st−1)p(Ot−1∣st−1)p(Ot−1∣st−1)p(st−1∣O1:t−2)p(Ot−1∣O1:t−2)p(a_{t-1}|s_{t-1},\mathcal{O}{1:t-1})p(s{t-1}|\mathcal{O}{1:t-1})=\frac{p(\mathcal{O}{t-1}|s_{t-1},a_{t-1})p(a_{t-1}|s_{t-1})}{p(\mathcal{O}{t-1}|s{t-1})}\frac{p(\mathcal{O}{t-1}|s{t-1})p(s_{t-1}|\mathcal{O}{1:t-2})}{p(\mathcal{O}{t-1}|\mathcal{O}{1:t-2})}p(at−1∣st−1,O1:t−1)p(st−1∣O1:t−1)=p(Ot−1∣st−1)p(Ot−1∣st−1,at−1)p(at−1∣st−1)p(Ot−1∣O1:t−2)p(Ot−1∣st−1)p(st−1∣O1:t−2) (后一项的O1:t−1\mathcal{O}{1:t-1}O1:t−1可以拆成O1:t−2,Ot−1\mathcal{O}{1:t-2},\mathcal{O}{t-1}O1:t−2,Ot−1)
=p(Ot−1∣st−1,at−1)p(at−1∣st−1)p(st−1∣O1:t−2)p(O1:t−2)=\frac{p(\mathcal{O}{t-1}|s{t-1},a_{t-1})p(a_{t-1}|s_{t-1})p(s_{t-1}|\mathcal{O}{1:t-2})}{p(\mathcal{O}{1:t-2})}=p(O1:t−2)p(Ot−1∣st−1,at−1)p(at−1∣st−1)p(st−1∣O1:t−2), 其中p(st−1∣O1:t−2)=αt−1(st−1)p(s_{t-1}|\mathcal{O}{1:t-2})=\alpha{t-1}(s_{t-1})p(st−1∣O1:t−2)=αt−1(st−1)

what if we want p(st∣O1:T)p(s_t|\mathcal{O}_{1:T})p(st∣O1:T)? 状态估计的时候, 未来的表现可以反过来修正对当前状态的认知

p(st∣O1:T)=p(st,O1:T)p(O1:T)=p(Ot:T∣st,O1:t−1)p(st,O1:t−1)p(O1:T)=p(Ot:T∣st)p(st,O1:t−1)p(O1:T)p(s_t|\mathcal{O}{1:T})=\frac{p(s_t,\mathcal{O}{1:T})}{p(\mathcal{O}{1:T})}=\frac{p(\mathcal{O}{t:T}|s_t,\mathcal{O}{1:t-1})p(s_t,\mathcal{O}{1:t-1})}{p(\mathcal{O}{1:T})}=\frac{p(\mathcal{O}{t:T}|s_t)p(s_t,\mathcal{O}{1:t-1})}{p(\mathcal{O}{1:T})}p(st∣O1:T)=p(O1:T)p(st,O1:T)=p(O1:T)p(Ot:T∣st,O1:t−1)p(st,O1:t−1)=p(O1:T)p(Ot:T∣st)p(st,O1:t−1) (p(Ot:T∣st,O1:t−1)=p(Ot:T∣st)p(\mathcal{O}{t:T}|s_t,\mathcal{O}{1:t-1})=p(\mathcal{O}_{t:T}|s_t)p(Ot:T∣st,O1:t−1)=p(Ot:T∣st), 未来表现好不好取决于当下的状态sts_tst, 与过去表现好不好无关)

p(Ot:T∣st)=βt(st)p(\mathcal{O}{t:T}|s_t)=\beta_t(s_t)p(Ot:T∣st)=βt(st), p(st∣O1:T)∝βt(st)p(st∣O1:t−1)p(O1:t−1)p(s_t|\mathcal{O}{1:T})\propto\beta_t(s_t)p(s_t|\mathcal{O}{1:t-1})p(\mathcal{O}{1:t-1})p(st∣O1:T)∝βt(st)p(st∣O1:t−1)p(O1:t−1), 其中p(st∣O1:t−1)=αt(st)p(s_t|\mathcal{O}{1:t-1})=\alpha_t(s_t)p(st∣O1:t−1)=αt(st), p(O1:t−1)p(\mathcal{O}{1:t-1})p(O1:t−1)和sts_tst无关是神秘unknown constant, 因此p(st∣O1:T)∝βt(st)αt(st)p(s_t|\mathcal{O}_{1:T})\propto\beta_t(s_t)\alpha_t(s_t)p(st∣O1:T)∝βt(st)αt(st)

3.1 The optimism problem

Q(st,at)=r(st,at)+log⁡Eexp⁡(V(st+1))Q(s_t,a_t)=r(s_t,a_t)+\log\mathbb{E}\\exp(V(s_{t+1}))Q(st,at)=r(st,at)+logEexp(V(st+1)), log⁡Eexp⁡(V(st+1))\log\mathbb{E}\\exp(V(s_{t+1}))logEexp(V(st+1))会使agent倾向于发生概率低但一旦发生回报极高的状态转移 ("optimistic" transition)。

3.1.1 the inference problem: p(s1:T,a1:T∣O1:T)p(s_{1:T},a_{1:T}|\mathcal{O}_{1:T})p(s1:T,a1:T∣O1:T)

"given that you obtained high reward, what was your action probability?" marginalizing and conditioning, we get: p(at∣st,O1:T)p(a_t|s_t,\mathcal{O}_{1:T})p(at∣st,O1:T) (the policy)

"given that you obtained high reward, what was your transition probability?" marginalizing and conditioning, we get: p(st+1∣st,at,O1:T)p(s_{t+1}|s_t,a_t,\mathcal{O}{1:T})p(st+1∣st,at,O1:T), 但p(st+1∣st,at,O1:T)≠p(st+1∣st,at)p(s{t+1}|s_t,a_t,\mathcal{O}{1:T})\ne p(s{t+1}|s_t,a_t)p(st+1∣st,at,O1:T)=p(st+1∣st,at) (从一定成功的前提去回推状态转移时, 推断出环境的随机抖动也会配合达成目标, 即直接计算后验概率会导致dynamics被扭曲)

3.2 Addressing the optimism problem

"given that you obtained high reward, what was your action probability, given that your transition probability did not change?"

can we find another distribution q(s1:T,a1:T)q(s_{1:T},a_{1:T})q(s1:T,a1:T) that is close to p(s1:T,a1:T∣O1:T)p(s_{1:T},a_{1:T}|\mathcal{O}{1:T})p(s1:T,a1:T∣O1:T) but has dynamics p(st+1∣st,at)p(s{t+1}|s_t,a_t)p(st+1∣st,at)?

let x=O1:Tx=\mathcal{O}{1:T}x=O1:T and z=(s1:T,a1:T)z=(s{1:T},a_{1:T})z=(s1:T,a1:T), find q(z)q(z)q(z) to approximate p(z∣x)p(z|x)p(z∣x)

let q(s1:T,a1:T)=p(s1)∏tp(st+1∣st,at)q(at∣st)q(s_{1:T},a_{1:T})=p(s_1)\prod_tp(s_{t+1}|s_t,a_t)q(a_t|s_t)q(s1:T,a1:T)=p(s1)∏tp(st+1∣st,at)q(at∣st), 其中p(s1)p(s_1)p(s1)和p(st+1∣st,at)p(s_{t+1}|s_t,a_t)p(st+1∣st,at)是same dynamics and initial state as ppp, 唯一能控制的只有动作分布q(at∣st)q(a_t|s_t)q(at∣st)

根据概率论的链式法则, p(O1:T,s1:T,a1:T)=p(s1)∏tp(at∣st,at−1)p(st+1∣st,at)p(Ot∣st,at)=p(s1)∏tp(at∣st)p(st+1∣st,at)p(Ot∣st,at)p(\mathcal{O}{1:T},s{1:T},a_{1:T})=p(s_1)\prod_tp(a_t|s_t,a_{t-1})p(s_{t+1}|s_{t},a_{t})p(\mathcal{O}t|s{t},a_{t})=p(s_1)\prod_tp(a_t|s_t)p(s_{t+1}|s_{t},a_{t})p(\mathcal{O}t|s{t},a_{t})p(O1:T,s1:T,a1:T)=p(s1)∏tp(at∣st,at−1)p(st+1∣st,at)p(Ot∣st,at)=p(s1)∏tp(at∣st)p(st+1∣st,at)p(Ot∣st,at), 其中先验p(at∣st)p(a_t|s_t)p(at∣st)是常数项可以略去

代入ELBO公式log⁡p(x)≥Ez∼p(z)log⁡p(x,z)−log⁡q(z)\log p(x)\ge\mathbb{E}{z\sim p(z)}\\log p(x,z)-\\log q(z)logp(x)≥Ez∼p(z)logp(x,z)−logq(z),
log⁡p(O1:T)≥E(s1:T,a1:T)∼qlog⁡p(s1)+∑t=1Tlog⁡p(st+1∣st,at)+∑t=1Tlog⁡p(Ot∣st,at)−log⁡p(s1)−∑t=1Tlog⁡p(st+1∣st,at)−∑t=1Tlog⁡q(at∣st)\log p(\mathcal{O}
{1:T})\ge\mathbb{E}{(s{1:T},a_{1:T})\sim q}\\log p(s_1)+\\sum\^T_{t=1}\\log p(s_{t+1}\|s_t,a_t)+\\sum\^T_{t=1}\\log p(\\mathcal{O}_t\|s_t,a_t)-\\log p(s_1)-\\sum\^T_{t=1}\\log p(s_{t+1}\|s_t,a_t)-\\sum\^T_{t=1}\\log q(a_t\|s_t)logp(O1:T)≥E(s1:T,a1:T)∼qlogp(s1)+∑t=1Tlogp(st+1∣st,at)+∑t=1Tlogp(Ot∣st,at)−logp(s1)−∑t=1Tlogp(st+1∣st,at)−∑t=1Tlogq(at∣st)

log⁡p(O1:T)≥E(s1:T,a1:T)∼q∑t=1Tlog⁡p(Ot∣st,at)−∑t=1Tlog⁡q(at∣st)=E(s1:T,a1:T)∼q∑t=1T(r(st,at)−log⁡q(at∣st))=∑tE(st,at)∼qr(st,at)+H(q(at∣st))\log p(\mathcal{O}{1:T})\ge\mathbb{E}{(s_{1:T},a_{1:T})\sim q}\\sum\^T_{t=1}\\log p(\\mathcal{O}_t\|s_t,a_t)-\\sum\^T_{t=1}\\log q(a_t\|s_t)=\mathbb{E}{(s{1:T},a_{1:T})\sim q}\\sum\^T_{t=1}\\left(r(s_t,a_t)-\\log q(a_t\|s_t)\\right)=\sum_t\mathbb{E}_{(s_t,a_t)\sim q}r(s_t,a_t)+\\mathcal{H}(q(a_t\|s_t))logp(O1:T)≥E(s1:T,a1:T)∼q∑t=1Tlogp(Ot∣st,at)−∑t=1Tlogq(at∣st)=E(s1:T,a1:T)∼q∑t=1T(r(st,at)−logq(at∣st))=∑tE(st,at)∼qr(st,at)+H(q(at∣st))

maximize reward and action entropy

3.3 Optimizing the variational lower bound

base case: solve for q(aT∣sT)q(a_T|s_T)q(aT∣sT), 最大化p(O1:T)p(\mathcal{O}_{1:T})p(O1:T)的下界

q(aT∣sT)=arg⁡max⁡EsT∼q(sT)EaT∼q(aT∣sT)\[r(sT,aT)+H(q(aT∣sT))]q(a_T|s_T)=\arg\max\mathbb{E}_{s_T\sim q(s_T)}\\mathbb{E}_{a_T\\sim q(a_T\|s_T)}\[r(s_T,a_T)+\mathcal{H}(q(a_T|s_T))]q(aT∣sT)=argmaxEsT∼q(sT)EaT∼q(aT∣sT)\[r(sT,aT)+H(q(aT∣sT))]

=arg⁡max⁡EsT∼q(sT)EaT∼q(aT∣sT)\[r(sT,aT)−log⁡p(aT∣sT)]=\arg\max\mathbb{E}_{s_T\sim q(s_T)}\\mathbb{E}_{a_T\\sim q(a_T\|s_T)}\[r(s_T,a_T)-\\log p(a_T\|s_T)]=argmaxEsT∼q(sT)EaT∼q(aT∣sT)\[r(sT,aT)−logp(aT∣sT)]

optimized when q(aT∣sT)∝exp⁡(r(sT,aT))q(a_T|s_T)\propto\exp(r(s_T,a_T))q(aT∣sT)∝exp(r(sT,aT)), 最后一刻动作被选中的概率与该动作带来的奖励成正比: q(aT∣sT)=exp⁡(r(sT,aT))∫exp⁡(r(sT,aT))daT=exp⁡(Q(sT,aT)−V(sT))q(a_T|s_T)=\frac{\exp(r(s_T,a_T))}{\int\exp(r(s_T,a_T))da_T}=\exp(Q(s_T,a_T)-V(s_T))q(aT∣sT)=∫exp(r(sT,aT))daTexp(r(sT,aT))=exp(Q(sT,aT)−V(sT))

EsT∼q(sT)EaT∼q(aT∣sT)\[r(sT,aT)−log⁡q(aT∣sT)]=EsT∼q(sT)EaT∼q(aT∣sT)\[V(sT)]\mathbb{E}{s_T\sim q(s_T)}\left\\mathbb{E}_{a_T\\sim q(a_T\|s_T)}\[r(s_T,a_T)-\\log q(a_T\|s_T)\right]=\mathbb{E}{s_T\sim q(s_T)}\\mathbb{E}_{a_T\\sim q(a_T\|s_T)}\[V(s_T)]EsT∼q(sT)EaT∼q(aT∣sT)\[r(sT,aT)−logq(aT∣sT)]=EsT∼q(sT)EaT∼q(aT∣sT)\[V(sT)]

来到非base case, q(at∣st)=arg⁡max⁡Est∼q(st)Eat∼q(at∣st)\[r(st,at)+Est+1∼p(st+1∣st,at)\[V(st+1)]+H(q(at∣st))]q(a_t|s_t)=\arg\max\mathbb{E}_{s_t\sim q(s_t)}\\mathbb{E}_{a_t\\sim q(a_t\|s_t)}\[r(s_t,a_t)+\\mathbb{E}_{s_{t+1}\\sim p(s_{t+1}\|s_t,a_t)}\[V(s_{t+1})]+\mathcal{H}(q(a_t|s_t))]q(at∣st)=argmaxEst∼q(st)Eat∼q(at∣st)\[r(st,at)+Est+1∼p(st+1∣st,at)\[V(st+1)]+H(q(at∣st))]

recall regular Bellman backup: Qt(st,at)=r(st,at)+EVt+1(st+1)Q_t(s_t,a_t)=r(s_t,a_t)+\mathbb{E}V_{t+1}(s_{t+1})Qt(st,at)=r(st,at)+EVt+1(st+1)

q(at∣st)=arg⁡max⁡Est∼q(st)Eat∼q(at∣st)\[Q(st,at)+H(q(at∣st))]=arg⁡max⁡Est∼q(st)Eat∼q(at∣st)\[Q(st,at)−logq(at∣st)]q(a_t|s_t)=\arg\max\mathbb{E}{s_t\sim q(s_t)}\\mathbb{E}_{a_t\\sim q(a_t\|s_t)}\[Q(s_t,a_t)+\mathcal{H}(q(a_t|s_t))]=\arg\max\mathbb{E}{s_t\sim q(s_t)}\\mathbb{E}_{a_t\\sim q(a_t\|s_t)}\[Q(s_t,a_t)-log q(a_t|s_t)]q(at∣st)=argmaxEst∼q(st)Eat∼q(at∣st)\[Q(st,at)+H(q(at∣st))]=argmaxEst∼q(st)Eat∼q(at∣st)\[Q(st,at)−logq(at∣st)]

optimized when q(at∣st)∝exp⁡(Q(st,at))q(a_t|s_t)\propto\exp(Q(s_t,a_t))q(at∣st)∝exp(Q(st,at)): q(at∣st)=exp⁡(Q(st,at)−V(st))q(a_t|s_t)=\exp(Q(s_t,a_t)-V(s_t))q(at∣st)=exp(Q(st,at)−V(st))

soft value iteration algorithm:

step 1: set Q(s,a)←r(s,a)+γEV(s′)Q(s,a)\leftarrow r(s,a)+\gamma\mathbb{E}V(s')Q(s,a)←r(s,a)+γEV(s′)

step 2: set V(s)←softmaxaQ(s,a)V(s)\leftarrow\text{softmax}_aQ(s,a)V(s)←softmaxaQ(s,a)

4.1 Q-learning with soft optimality

ϕ←ϕ+α∇ϕQϕ(s,a)(r(s,a)+γV(s′)−Qϕ(s,a))\phi\leftarrow\phi+\alpha\nabla_\phi Q_\phi(s,a)(r(s,a)+\gamma V(s')-Q_\phi(s,a))ϕ←ϕ+α∇ϕQϕ(s,a)(r(s,a)+γV(s′)−Qϕ(s,a))

target value: V(s′)=softmaxa′Qϕ(s′,a′)=log⁡∫exp⁡(Qϕ(s′,a′))da′V(s')=\text{softmax}{a'}Q\phi(s',a')=\log\int\exp(Q_\phi(s',a'))da'V(s′)=softmaxa′Qϕ(s′,a′)=log∫exp(Qϕ(s′,a′))da′

π(a∣s)=exp⁡(Qϕ(s,a)−V(s))=exp⁡(A(s,a))\pi(a|s)=\exp(Q_\phi(s,a)-V(s))=\exp(A(s,a))π(a∣s)=exp(Qϕ(s,a)−V(s))=exp(A(s,a))

4.2 policy gradient and Q-learning

只有通过πθ\pi_\thetaπθ采样产生的动作才会对θ\thetaθ产生梯度, V(st)V(s_t)V(st)和动作无关可以消去

5.1 Soft actor-critic

step 1: Q-function update

update Q-function to evaluate current policy: Q(s,a)←r(s,a)+Es′∼ps,a′∼πQ(s′,a′)−log⁡π(a′∣s′)Q(s,a)\leftarrow r(s,a)+\mathbb{E}_{s'\sim p_s,a'\sim\pi}Q(s',a')-\\log{\\pi(a'\|s')}Q(s,a)←r(s,a)+Es′∼ps,a′∼πQ(s′,a′)−logπ(a′∣s′), this convergences to QπQ^\piQπ

step 2: Update policy

update the policy with gradient of information projection: πnew=arg⁡min⁡π′DKL(π′(⋅∣s)∣∣1Zexp⁡Qπold(s,⋅))\pi_{\text{new}}=\arg\min_{\pi'}D_{\text{KL}}\left(\pi'(\cdot|s)||\frac{1}{Z}\exp Q^{\pi_{\text{old}}}(s,\cdot)\right)πnew=argminπ′DKL(π′(⋅∣s)∣∣Z1expQπold(s,⋅))

in practice, only take one gradient step on this objective

step 3: interact with the world, collect more data, goto step 1

jection: πnew=arg⁡min⁡π′DKL(π′(⋅∣s)∣∣1Zexp⁡Qπold(s,⋅))\pi_{\text{new}}=\arg\min_{\pi'}D_{\text{KL}}\left(\pi'(\cdot|s)||\frac{1}{Z}\exp Q^{\pi_{\text{old}}}(s,\cdot)\right)πnew=argminπ′DKL(π′(⋅∣s)∣∣Z1expQπold(s,⋅))

in practice, only take one gradient step on this objective

step 3: interact with the world, collect more data, goto step 1

相关推荐
啄缘之间28 分钟前
8.【学习】工业级详细接口约束&覆盖率
开发语言·笔记·学习·uvm·sv
Quz2 小时前
将Markdown文件推送到浮墨笔记
人工智能·笔记
Brilliantwxx2 小时前
【C++】 深入理解红黑树:实现与原理全解
数据结构·c++·笔记·算法·青少年编程·红黑树
U盘失踪了2 小时前
claude code /skill-creator 创建skill
笔记
jscxy52062 小时前
ospf笔记
笔记
MAXrxc2 小时前
ospf笔记
网络·笔记
想不明白的过度思考者3 小时前
Unity学习笔记——虚拟摇杆实现笔记(事件触发器的使用、UGUI 坐标转换)
笔记·学习·unity
疯狂打码的少年3 小时前
流水线冒险(结构冒险/数据冒险/控制冒险)
笔记
问心无愧05134 小时前
ctf show web入门261
android·前端·笔记
智者知已应修善业4 小时前
【分立元件OCL电路】2024-5-17
驱动开发·经验分享·笔记·硬件架构·硬件工程