lec 5, 2026/5/5-5/10, VAEs
1.1 Deep Latent Variable Models
Use neural networks to model the conditionals (deep latent variable models): z∼N(0,I)z\sim\mathcal{N}(0,I)z∼N(0,I), p(x∣z)=N(μθ(z),Σθ(z))p(x|z)=\mathcal{N}(\mu_\theta(z),\Sigma_\theta(z))p(x∣z)=N(μθ(z),Σθ(z)) where μθ,Σθ\mu_\theta,\Sigma_\thetaμθ,Σθ are neural networks
Hope that after training, zzz will correspond to meaningful latent factors of variation (features). Unsupervised representation learning
1.2 Mixture of Gaussians: a Shallow Latent Variable Model
z∼Categorical(1,...,K)z\sim\text{Categorical}(1,\dots,K)z∼Categorical(1,...,K), p(x∣z=k)=N(μk,Σk)p(x|z=k)=\mathcal{N}(\mu_k,\Sigma_k)p(x∣z=k)=N(μk,Σk), for clustering
p(x)=∑zp(x,z)=∑zp(z)p(x∣z)=∑k=1kp(z=k)N(x;μk,Σk)p(x)=\sum_zp(x,z)=\sum_zp(z)p(x|z)=\sum^k_{k=1}p(z=k)\mathcal{N}(x;\mu_k,\Sigma_k)p(x)=∑zp(x,z)=∑zp(z)p(x∣z)=∑k=1kp(z=k)N(x;μk,Σk)
1.3 Marginal Likelihood
假定训练的时候只能看到部分像素, let XXX denote observed random variables, and ZZZ the unobserved ones (also called hidden or latent), suppose we have a model for the joint distribution p(X,Z;θ)p(X,Z;\theta)p(X,Z;θ)
the probability of observing a training data point xˉ\bar{x}xˉ is p(X=xˉ;θ)=∑zp(X=xˉ,Z=z;θ)=∑zp(xˉ,z;θ)p(X=\bar{x};\theta)=\sum_zp(X=\bar{x},Z=z;\theta)=\sum_zp(\bar{x},z;\theta)p(X=xˉ;θ)=∑zp(X=xˉ,Z=z;θ)=∑zp(xˉ,z;θ)
for Variational Autoencoder, p(X=xˉ;θ)=∫zp(X=xˉ,Z=z;θ)dz=∫zp(xˉ,z;θ)dzp(X=\bar{x};\theta)=\int_zp(X=\bar{x},Z=z;\theta)dz=\int_zp(\bar{x},z;\theta)dzp(X=xˉ;θ)=∫zp(X=xˉ,Z=z;θ)dz=∫zp(xˉ,z;θ)dz
We have a dataset D\mathcal{D}D, where for each datapoint the XXX variables are observed and the variables ZZZ are never observed. D={x(1),...,x(M)}\mathcal{D}=\{x^{(1)},\dots,x^{(M)}\}D={x(1),...,x(M)}.
Do Maximum likelihood learning: log∏x∈Dp(x;θ)=∑x∈Dlogp(x;θ)=∑x∈Dlog∑zp(x,z;θ)\log\prod_{x\in\mathcal{D}}p(x;\theta)=\sum_{x\in\mathcal{D}}\log p(x;\theta)=\sum_{x\in\mathcal{D}}\log\sum_zp(x,z;\theta)log∏x∈Dp(x;θ)=∑x∈Dlogp(x;θ)=∑x∈Dlog∑zp(x,z;θ), computing log∑zp(x,z;θ)\log\sum_zp(x,z;\theta)log∑zp(x,z;θ) and ∇θ\nabla_\theta∇θ is expensive, need cheap approximations!
(1) first attempt: Naive Monte Carlo
pθ(x)=∑z∈Zpθ(x,z)=∣Z∣∑z∈Z1∣Z∣pθ(x,z)=∣Z∣Ez∼Uniform(Z)pθ(x,z)p_\theta(x)=\sum_{z\in\mathcal{Z}}p_\theta(x,z)=|\mathcal{Z}|\sum_{z\in\mathcal{Z}}\frac{1}{|\mathcal{Z}|}p_\theta(x,z)=|\mathcal{Z}|\mathbb{E}_{z\sim \text{Uniform}(\mathcal{Z})}p_\\theta(x,z)pθ(x)=∑z∈Zpθ(x,z)=∣Z∣∑z∈Z∣Z∣1pθ(x,z)=∣Z∣Ez∼Uniform(Z)pθ(x,z)
sample z(1),...,z(k)z^{(1)},\dots,z^{(k)}z(1),...,z(k) uniformly at random, approximate expected with sample average: ∑zpθ(x,z)≈∣Z∣1k∑j=1kpθ(x,z(j))\sum_zp_\theta(x,z)\approx|\mathcal{Z}|\frac{1}{k}\sum^k_{j=1}p_\theta(x,z^{(j)})∑zpθ(x,z)≈∣Z∣k1∑j=1kpθ(x,z(j))
Need a clever way to select z(j)z^{(j)}z(j) to reduce variance of the estimator
(2) second attempt: Importance Sampling
pθ(x)=∑z∈Zpθ(x,z)=∑z∈Zq(z)q(z)pθ(x,z)=Ez∼q(z)pθ(x,z)q(z)p_\theta(x)=\sum_{z\in\mathcal{Z}}p_\theta(x,z)=\sum_{z\in\mathcal{Z}}\frac{q(z)}{q(z)}p_\theta(x,z)=\mathbb{E}_{z\sim q(z)}\\frac{p_\\theta(x,z)}{q(z)}pθ(x)=∑z∈Zpθ(x,z)=∑z∈Zq(z)q(z)pθ(x,z)=Ez∼q(z)q(z)pθ(x,z)
sample z(1),...,z(k)z^{(1)},\dots,z^{(k)}z(1),...,z(k) from q(z)q(z)q(z), approximate expectation with sample average (still Monte Carlo, unbiased estimator): pθ(x)≈1k∑j=1kpθ(x,z(j))q(z(j))p_\theta(x)\approx\frac{1}{k}\sum^k_{j=1}\frac{p_\theta(x,z^{(j)})}{q(z^{(j)})}pθ(x)≈k1∑j=1kq(z(j))pθ(x,z(j))
What is a good choice for q(z)q(z)q(z)? 应该频繁采样给定xxx在pθ(x,z)p_\theta(x,z)pθ(x,z)下更有可能出现的zzz。
require log-likelihood, estimate it as logpθ(x)≈log(1k∑j=1kpθ(x,z(j))q(z(j)))\log p_\theta(x)\approx\log\left(\frac{1}{k}\sum^k_{j=1}\frac{p_\theta(x,z^{(j)})}{q(z^{(j)})}\right)logpθ(x)≈log(k1∑j=1kq(z(j))pθ(x,z(j)))
然log函数是concave function (凹函数), 根据Jensen's Inequality: 对任何非线性凹函数fff, 有Ef(Y)≤f(EY)\mathbb{E}f(Y)\le f(\mathbb{E}Y)Ef(Y)≤f(EY), 因此Ez∼q(z)log(pθ(x,z)q(z))≤log(Ez∼q(z)pθ(x,z)q(z))=logpθ(x)\mathbb{E}{z\sim q(z)}\left\\log\\left(\\frac{p_\\theta(x,z)}{q(z)}\\right)\\right\le\log\left(\mathbb{E}{z\sim q(z)}\left\\frac{p_\\theta(x,z)}{q(z)}\\right\right)=\log p_\theta(x)Ez∼q(z)log(q(z)pθ(x,z))≤log(Ez∼q(z)q(z)pθ(x,z))=logpθ(x), the estimator is biased! (Evidence Lower Bound, ELBO)
1.4 Variational inference
Suppose q(z)q(z)q(z) is any probability distribution over the hidden variables, Evidence lower bound (ELBO) holds for any qqq: logp(x;θ)≥∑zq(z)log(pθ(x,z)q(z))=∑zq(z)logpθ(x,z)−∑zq(z)logq(z)⏟Entropy H(q) of q=∑zq(z)logpθ(x,z)+H(q)\log p(x;\theta)\ge\sum_zq(z)\log\left(\frac{p_\theta(x,z)}{q(z)}\right)=\sum_zq(z)\log p_\theta(x,z)\underbrace{-\sum_zq(z)\log q(z)}{\text{Entropy }H(q)\text{ of }q}=\sum_zq(z)\log p\theta(x,z)+H(q)logp(x;θ)≥∑zq(z)log(q(z)pθ(x,z))=∑zq(z)logpθ(x,z)Entropy H(q) of q −z∑q(z)logq(z)=∑zq(z)logpθ(x,z)+H(q), equality holds if q(z)=p(z∣x;θ)q(z)=p(z|x;\theta)q(z)=p(z∣x;θ), 此后验分布不好求
lec 6, 2026/5/11-5/28, VAEs
1.1 Variational inference
DKL(q(z)∣∣p(z∣x;θ))=Ez∼q(z)logq(z)−logp(z∣x;θ)=∑zq(z)logq(z)−∑zq(z)logp(z∣x;θ)D_{\text{KL}}(q(z)||p(z|x;\theta))=\mathbb{E}_{z\sim q(z)}\\log q(z)-\\log p(z\|x;\\theta)=\sum_zq(z)\log q(z)-\sum_zq(z)\log p(z|x;\theta)DKL(q(z)∣∣p(z∣x;θ))=Ez∼q(z)logq(z)−logp(z∣x;θ)=∑zq(z)logq(z)−∑zq(z)logp(z∣x;θ)
=−H(q)−∑zq(z)logp(x,z;θ)p(x;θ)=−H(q)−∑zq(z)logp(x,z;θ)+∑zq(z)logp(x;θ)=−H(q)−∑zq(z)logp(x,z;θ)+logp(x;θ)=-H(q)-\sum_zq(z)\log\frac{p(x,z;\theta)}{p(x;\theta)}=-H(q)-\sum_zq(z)\log p(x,z;\theta)+\sum_zq(z)\log p(x;\theta)=-H(q)-\sum_zq(z)\log p(x,z;\theta)+\log p(x;\theta)=−H(q)−∑zq(z)logp(x;θ)p(x,z;θ)=−H(q)−∑zq(z)logp(x,z;θ)+∑zq(z)logp(x;θ)=−H(q)−∑zq(z)logp(x,z;θ)+logp(x;θ)
因为KL散度恒≥0\ge0≥0, 因此有logp(x;θ)≥∑zq(z)logp(x,z;θ)+H(q)\log p(x;\theta)\ge \sum_zq(z)\log p(x,z;\theta)+H(q)logp(x;θ)≥∑zq(z)logp(x,z;θ)+H(q), 从KL散度视角又推导出了ELBO
Equality holds if q(z)=p(z∣x;θ)q(z)=p(z|x;\theta)q(z)=p(z∣x;θ) because DKL(q(z)∣∣p(z∣x;θ))=0D_{\text{KL}}(q(z)||p(z|x;\theta))=0DKL(q(z)∣∣p(z∣x;θ))=0
1.2 Intractable Posteriors
如果后验概率p(z∣x;θ)p(z|x;\theta)p(z∣x;θ)不好算怎么办? In a VAE this corresponds to "inverting" the neural networks μθ,Σθ\mu_\theta,\Sigma_\thetaμθ,Σθ defining p(x∣z)=N(μθ(z),Σθ(z))p(x|z)=\mathcal{N}(\mu_\theta(z),\Sigma_\theta(z))p(x∣z)=N(μθ(z),Σθ(z))
Suppose q(z;ϕ)q(z;\phi)q(z;ϕ) is a tractable probability distribution over the hidden variables parameterized by ϕ\phiϕ (variational parameters), e.g., q(z;ϕ)=N(ϕ1,ϕ2)q(z;\phi)=\mathcal{N}(\phi_1,\phi_2)q(z;ϕ)=N(ϕ1,ϕ2)
pick ϕ\phiϕ so that q(z;ϕ)q(z;\phi)q(z;ϕ) is as close as possible to p(z∣x;θ)p(z|x;\theta)p(z∣x;θ), 真是原汤化原食!
logp(x;θ)≥∑zq(z;ϕ)logp(z,x;θ)+H(q(z;ϕ))=L(x;θ,ϕ)⏟ELBO\log p(x;\theta)\ge\sum_zq(z;\phi)\log p(z,x;\theta)+H(q(z;\phi))=\underbrace{\mathcal{L}(x;\theta,\phi)}_{\text{ELBO}}logp(x;θ)≥∑zq(z;ϕ)logp(z,x;θ)+H(q(z;ϕ))=ELBO L(x;θ,ϕ)
logp(x;θ)=L(x;θ,ϕ)+DKL(q(z;ϕ)∣∣p(z∣x;θ))\log p(x;\theta)=\mathcal{L}(x;\theta,\phi)+D_{\text{KL}}(q(z;\phi)||p(z|x;\theta))logp(x;θ)=L(x;θ,ϕ)+DKL(q(z;ϕ)∣∣p(z∣x;θ))
The better q(z;ϕ)q(z;\phi)q(z;ϕ) can approximate the posterior p(z∣x;θ)p(z|x;\theta)p(z∣x;θ), the smaller DKL(q(z;ϕ)∣∣p(z∣x;θ))D_{\text{KL}}(q(z;\phi)||p(z|x;\theta))DKL(q(z;ϕ)∣∣p(z∣x;θ)) we can achieve, the closer ELBO will be to logp(x;θ)\log p(x;\theta)logp(x;θ)
1.3 apply ELBO to the entire dataset
minimize DKL(Pdata∣∣Pθ)=maximize ∑xi∈Dlogp(xi;θ)=l(θ;D)≥∑xi∈DL(xi;θ,ϕi)\text{minimize }D_{\text{KL}}(P_{\text{data}}||P_\theta)=\text{maximize }\sum_{x^i\in\mathcal{D}}\log p(x^i;\theta)=l(\theta;\mathcal{D})\ge\sum_{x^i\in\mathcal{D}}\mathcal{L}(x^i;\theta,\phi^i)minimize DKL(Pdata∣∣Pθ)=maximize ∑xi∈Dlogp(xi;θ)=l(θ;D)≥∑xi∈DL(xi;θ,ϕi)
therefore, maxθl(θ;D)≥maxθ,ϕ1,...,ϕM∑xi∈DL(xi;θ,ϕi)\max_\theta l(\theta;\mathcal{D})\ge\max_{\theta,\phi_1,\dots,\phi^M}\sum_{x^i\in\mathcal{D}}\mathcal{L}(x^i;\theta,\phi^i)maxθl(θ;D)≥maxθ,ϕ1,...,ϕM∑xi∈DL(xi;θ,ϕi)
Note that we use different variational parameters ϕi\phi^iϕi for every data point xix^ixi, because the true posterior p(z∣xi;θ)p(z|x^i;\theta)p(z∣xi;θ) is different across datapoints xix^ixi
1.4 Learning via stochastic variational inference (SVI)
optimize ∑xi∈DL(xi;θ,ϕi)\sum_{x^i\in\mathcal{D}}\mathcal{L}(x^i;\theta,\phi^i)∑xi∈DL(xi;θ,ϕi) as a function of θ,ϕ1,...,ϕM\theta,\phi^1,\dots,\phi^Mθ,ϕ1,...,ϕM using stochastic gradient descent: L(xi;θ,ϕi)=∑zq(z;ϕi)logp(z,xi;θ)+H(q(z;ϕi))=Eq(z;ϕi)logp(z,xi;θ)−logq(z;ϕi)\mathcal{L}(x^i;\theta,\phi^i)=\sum_zq(z;\phi^i)\log p(z,x^i;\theta)+H(q(z;\phi^i))=\mathbb{E}_{q(z;\phi^i)}\\log p(z,x\^i;\\theta)-\\log q(z;\\phi\^i)L(xi;θ,ϕi)=∑zq(z;ϕi)logp(z,xi;θ)+H(q(z;ϕi))=Eq(z;ϕi)logp(z,xi;θ)−logq(z;ϕi)
step 1: Initialize θ,ϕ1,...,ϕM\theta,\phi^1,\dots,\phi^Mθ,ϕ1,...,ϕM
step 2: Randomly sample a data point xix^ixi from D\mathcal{D}D
step 3: repeat ϕi=ϕi+η∇ϕL(xi;θ,ϕi)\phi^i=\phi^i+\eta\nabla_\phi\mathcal{L}(x^i;\theta,\phi^i)ϕi=ϕi+η∇ϕL(xi;θ,ϕi) until convergence to ϕi,∗≈argmaxϕL(xi;θ,ϕ)\phi^{i,*}\approx\arg\max_\phi\mathcal{L}(x^i;\theta,\phi)ϕi,∗≈argmaxϕL(xi;θ,ϕ)
step 4: Compute ∇θL(xi;θ,ϕi,∗)\nabla_\theta\mathcal{L}(x^i;\theta,\phi^{i,*})∇θL(xi;θ,ϕi,∗)
step 5: Update θ\thetaθ in the gradient direction, goto step 2
经典Monte Carlo, sample z1,...,zkz^1,\dots,z^kz1,...,zk from q(z;ϕ)q(z;\phi)q(z;ϕ) and estimate Eq(z;ϕi)logp(z,xi;θ)−logq(z;ϕi)≈1K∑klogp(zk,x;θ)−logq(zk;ϕ)\mathbb{E}_{q(z;\phi^i)}\\log p(z,x\^i;\\theta)-\\log q(z;\\phi\^i)\approx\frac{1}{K}\sum_k\log p(z^k,x;\theta)-\log q(z^k;\phi)Eq(z;ϕi)logp(z,xi;θ)−logq(z;ϕi)≈K1∑klogp(zk,x;θ)−logq(zk;ϕ)
key assumption: q(z;ϕ)q(z;\phi)q(z;ϕ) is tractable, i.e., easy to sample from and evaluate (e.g., flow model, 反例: generative model)
how to compute ∇θL(x;θ,ϕ)\nabla_\theta\mathcal{L}(x;\theta,\phi)∇θL(x;θ,ϕ) and ∇ϕL(x;θ,ϕ)\nabla_\phi\mathcal{L}(x;\theta,\phi)∇ϕL(x;θ,ϕ)?
the gradient with respect to θ\thetaθ is easy, estimate with a Monte Carlo average: ∇θEq(z;ϕ)logp(z,x;θ)−logq(z;ϕ)=Eq(z;ϕ)∇θlogp(z,x;θ)≈1K∑k∇θlogp(zk,x;θ)\nabla_\theta\mathbb{E}{q(z;\phi)}\\log p(z,x;\\theta)-\\log q(z;\\phi)=\mathbb{E}{q(z;\phi)}\\nabla_\\theta\\log p(z,x;\\theta)\approx\frac{1}{K}\sum_k\nabla_\theta\log p(z^k,x;\theta)∇θEq(z;ϕ)logp(z,x;θ)−logq(z;ϕ)=Eq(z;ϕ)∇θlogp(z,x;θ)≈K1∑k∇θlogp(zk,x;θ)
the gradient with respect to ϕ\phiϕ is more complicated because the expectation depends on ϕ\phiϕ, 用reparameterization的trick: sample z∼q(z;ϕ)z\sim q(z;\phi)z∼q(z;ϕ), suppose q(z;ϕ)=N(μ,σ2I)q(z;\phi)=\mathcal{N}(\mu,\sigma^2I)q(z;ϕ)=N(μ,σ2I); sample ϵ∼N(0,I)\epsilon\sim\mathcal{N}(0,I)ϵ∼N(0,I), z=μ+σϵ=g(ϵ;ϕ)z=\mu+\sigma\epsilon=g(\epsilon;\phi)z=μ+σϵ=g(ϵ;ϕ), ggg is deterministic
如此以后, Ez∼q(z;ϕ)r(z)=∫q(z;ϕ)r(z)dz=Eϵ∼N(0,I)r(g(ϵ;ϕ))=∫N(ϵ)r(μ+σϵ)dϵ\mathbb{E}{z\sim q(z;\phi)}r(z)=\int q(z;\phi)r(z)dz=\mathbb{E}{\epsilon\sim\mathcal{N}(0,I)}r(g(\\epsilon;\\phi))=\int\mathcal{N}(\epsilon)r(\mu+\sigma\epsilon)d\epsilonEz∼q(z;ϕ)r(z)=∫q(z;ϕ)r(z)dz=Eϵ∼N(0,I)r(g(ϵ;ϕ))=∫N(ϵ)r(μ+σϵ)dϵ
∇ϕEq(z;ϕ)r(z)=∇ϕEϵr(g(ϵ;ϕ))=Eϵ∇ϕr(g(ϵ;ϕ))\nabla_\phi\mathbb{E}{q(z;\phi)}r(z)=\nabla\phi\mathbb{E}\epsilonr(g(\\epsilon;\\phi))=\mathbb{E}\epsilon\\nabla_\\phi r(g(\\epsilon;\\phi))∇ϕEq(z;ϕ)r(z)=∇ϕEϵr(g(ϵ;ϕ))=Eϵ∇ϕr(g(ϵ;ϕ)), easy to estimate via Monte Carlo if rrr and ggg are differentiable: Eϵ∇ϕr(g(ϵ;ϕ))≈1K∑k∇ϕr(g(ϵk;ϕ))\mathbb{E}\epsilon\\nabla_\\phi r(g(\\epsilon;\\phi))\approx\frac{1}{K}\sum_k\nabla\phi r(g(\epsilon^k;\phi))Eϵ∇ϕr(g(ϵ;ϕ))≈K1∑k∇ϕr(g(ϵk;ϕ)) where ϵ1,...,ϵK∼N(0,I)\epsilon^1,\dots,\epsilon^K\sim\mathcal{N}(0,I)ϵ1,...,ϵK∼N(0,I)
1.5 Amortized Inference
maxθl(θ;D)≥maxθ,ϕ1,...,ϕM∑xi∈DL(xi;θ,ϕi)\max_\theta l(\theta;\mathcal{D})\ge\max_{\theta,\phi^1,\dots,\phi^M}\sum_{x^i\in\mathcal{D}}\mathcal{L}(x^i;\theta,\phi^i)maxθl(θ;D)≥maxθ,ϕ1,...,ϕM∑xi∈DL(xi;θ,ϕi), 先前每个训练样本xix^ixi维护/保存一组variational parameters ϕi\phi^iϕi, 这不scale。
Amortization: now we learn a single parametric function fλf_\lambdafλ that maps each xxx to a set of (good) variational parameters, like doing regression on xi→ϕi,∗x^i\to\phi^{i,*}xi→ϕi,∗
例: 真实posterior pθ(z∣x)p_\theta(z|x)pθ(z∣x), 用q(z∣xi)q(z|x^i)q(z∣xi)近似, 如果q(z∣xi)q(z|x^i)q(z∣xi)是Gaussians with different means μ1,...,μm\mu_1,\dots,\mu^mμ1,...,μm, learn a single neural network fλf_\lambdafλ mapping xix^ixi to μi\mu^iμi, then we approximate the posteriors q(z∣xi)q(z|x^i)q(z∣xi) using this distribution qλ(z∣x)q_\lambda(z|x)qλ(z∣x), 可以理解成qλ(z∣xi)≈qϕi,∗(z∣xi)≈pθ(z∣xi)q_\lambda(z|x^i)\approx q_{\phi^{i,*}}(z|x^i)\approx p_\theta(z|x^i)qλ(z∣xi)≈qϕi,∗(z∣xi)≈pθ(z∣xi)
1.6 A variational approximation to the posterior
(1) Assume p(z,xi;θ)p(z,x^i;\theta)p(z,xi;θ) is close to pdata(z,xi)p_{\text{data}}(z,x^i)pdata(z,xi), suppose zzz captures information such as the digit identity (label), style, etc.
(2) Suppose q(z;ϕi)q(z;\phi^i)q(z;ϕi) is a tractable probability distribution over the hidden variables zzz parameterized by ϕi\phi^iϕi. For each xix^ixi, need to find a good ϕi,∗\phi^{i,*}ϕi,∗ (via optimization, expensive).
(3) (Amortized inference) Learn how to map xix^ixi to a good set of parameters ϕi\phi^iϕi via q(z;fλ(xi))q(z;f_\lambda(x^i))q(z;fλ(xi)). fλf_\lambdafλ learns how to solve the optimization problem.
In the literature, q(z;fλ(xi))q(z;f_\lambda(x^i))q(z;fλ(xi)) often denoted qϕ(z∣x)q_\phi(z|x)qϕ(z∣x).
1.7 Autoencoder perspective
logp(x;θ)≥L(x;θ,ϕ)=Eqϕ(z∣x)logp(z,x;θ)−logqϕ(z∣x)=Eqϕ(z∣x)logp(z,x;θ)−logp(z)+logp(z)−logqϕ(z∣x)\log p(x;\theta)\ge\mathcal{L}(x;\theta,\phi)=\mathbb{E}{q\phi(z|x)}\\log p(z,x;\\theta)-\\log q_\\phi(z\|x)=\mathbb{E}{q\phi(z|x)}\\log p(z,x;\\theta)-\\log p(z)+\\log p(z)-\\log q_\\phi(z\|x)logp(x;θ)≥L(x;θ,ϕ)=Eqϕ(z∣x)logp(z,x;θ)−logqϕ(z∣x)=Eqϕ(z∣x)logp(z,x;θ)−logp(z)+logp(z)−logqϕ(z∣x)
=Eqϕ(z∣x)logp(x∣z;θ)−DKL(qϕ(z∣x)∣∣p(z))=\mathbb{E}{q\phi(z|x)}\\log p(x\|z;\\theta)-D_{\text{KL}}(q_\phi(z|x)||p(z))=Eqϕ(z∣x)logp(x∣z;θ)−DKL(qϕ(z∣x)∣∣p(z))
step 1: Take a data point xix^ixi, map it to z^\hat{z}z^ by sampling from qϕ(z∣xi)q_\phi(z|x^i)qϕ(z∣xi) (encoder). Sample from a Gaussian with parameters (μ,σ)=encoderϕ(xi)(\mu,\sigma)=\text{encoder}_\phi(x^i)(μ,σ)=encoderϕ(xi)
step 2: Reconstruct x^\hat{x}x^ by sampling from p(x∣z^;θ)p(x|\hat{z};\theta)p(x∣z^;θ) (decoder). Sample from a Gaussian with parameters decoderθ(z^)\text{decoder}_\theta(\hat{z})decoderθ(z^)
What does the training objective L(x;θ,ϕ)\mathcal{L}(x;\theta,\phi)L(x;θ,ϕ) do?
first term encourages x^≈xi\hat{x}\approx x^ix^≈xi, 因为logpθ(xi∣z)=−12σ2∥xi−μθ(z)∥2+C\log p_\theta(x^i|z)= -\frac{1}{2\sigma^2}\|x^i-\mu_\theta(z)\|^2 + Clogpθ(xi∣z)=−2σ21∥xi−μθ(z)∥2+C, 最大化logpθ(xi∣z)\log p_\theta(x^i|z)logpθ(xi∣z)等价于最小化∥xi−μθ(z)∥2\|x^i-\mu_\theta(z)\|^2∥xi−μθ(z)∥2
second term encourages z^\hat{z}z^ to have a distribution similar to the prior p(z)p(z)p(z), kl散度越像越小
最大化logp(x;θ)\log p(x;\theta)logp(x;θ)等价于最大化似然估计maxθEx∼pdatalogpθ(x)\max_\theta \mathbb E_{x\sim p_{\text{data}}}\\log p_\\theta(x)maxθEx∼pdatalogpθ(x), 等价于最小化Ex∼pdata−logpθ(x)=H(pdata,pθ)=H(pdata)+DKL(pdata∥pθ)\mathbb E_{x\sim p_{\text{data}}}-\\log p_\\theta(x)=H(p_{\text{data}},p_\theta)= H(p_{\text{data}}) + D_{KL}(p_{\text{data}}\|p_\theta)Ex∼pdata−logpθ(x)=H(pdata,pθ)=H(pdata)+DKL(pdata∥pθ), 等价于最小化DKL(pdata∥pθ)D_{KL}(p_{\text{data}}\|p_\theta)DKL(pdata∥pθ), 这样可以让模型分布pθ(x)p_\theta(x)pθ(x)尽量接近真实数据分布pdata(x)p_{\text{data}}(x)pdata(x)。
lec 6居然拖了这么久才学完,虽可以立即编出八百个理由为自己开脱, 惭愧惭愧。2026/5/28