cs236_note1 (lec5-lec6) VAEs

lec 5, 2026/5/5-5/10, VAEs

1.1 Deep Latent Variable Models

Use neural networks to model the conditionals (deep latent variable models): z∼N(0,I)z\sim\mathcal{N}(0,I)z∼N(0,I), p(x∣z)=N(μθ(z),Σθ(z))p(x|z)=\mathcal{N}(\mu_\theta(z),\Sigma_\theta(z))p(x∣z)=N(μθ(z),Σθ(z)) where μθ,Σθ\mu_\theta,\Sigma_\thetaμθ,Σθ are neural networks

Hope that after training, zzz will correspond to meaningful latent factors of variation (features). Unsupervised representation learning

1.2 Mixture of Gaussians: a Shallow Latent Variable Model

z∼Categorical(1,...,K)z\sim\text{Categorical}(1,\dots,K)z∼Categorical(1,...,K), p(x∣z=k)=N(μk,Σk)p(x|z=k)=\mathcal{N}(\mu_k,\Sigma_k)p(x∣z=k)=N(μk,Σk), for clustering

p(x)=∑zp(x,z)=∑zp(z)p(x∣z)=∑k=1kp(z=k)N(x;μk,Σk)p(x)=\sum_zp(x,z)=\sum_zp(z)p(x|z)=\sum^k_{k=1}p(z=k)\mathcal{N}(x;\mu_k,\Sigma_k)p(x)=∑zp(x,z)=∑zp(z)p(x∣z)=∑k=1kp(z=k)N(x;μk,Σk)

1.3 Marginal Likelihood

假定训练的时候只能看到部分像素, let XXX denote observed random variables, and ZZZ the unobserved ones (also called hidden or latent), suppose we have a model for the joint distribution p(X,Z;θ)p(X,Z;\theta)p(X,Z;θ)

the probability of observing a training data point xˉ\bar{x}xˉ is p(X=xˉ;θ)=∑zp(X=xˉ,Z=z;θ)=∑zp(xˉ,z;θ)p(X=\bar{x};\theta)=\sum_zp(X=\bar{x},Z=z;\theta)=\sum_zp(\bar{x},z;\theta)p(X=xˉ;θ)=∑zp(X=xˉ,Z=z;θ)=∑zp(xˉ,z;θ)

for Variational Autoencoder, p(X=xˉ;θ)=∫zp(X=xˉ,Z=z;θ)dz=∫zp(xˉ,z;θ)dzp(X=\bar{x};\theta)=\int_zp(X=\bar{x},Z=z;\theta)dz=\int_zp(\bar{x},z;\theta)dzp(X=xˉ;θ)=∫zp(X=xˉ,Z=z;θ)dz=∫zp(xˉ,z;θ)dz

We have a dataset D\mathcal{D}D, where for each datapoint the XXX variables are observed and the variables ZZZ are never observed. D={x(1),...,x(M)}\mathcal{D}=\{x^{(1)},\dots,x^{(M)}\}D={x(1),...,x(M)}.

Do Maximum likelihood learning: log⁡∏x∈Dp(x;θ)=∑x∈Dlog⁡p(x;θ)=∑x∈Dlog⁡∑zp(x,z;θ)\log\prod_{x\in\mathcal{D}}p(x;\theta)=\sum_{x\in\mathcal{D}}\log p(x;\theta)=\sum_{x\in\mathcal{D}}\log\sum_zp(x,z;\theta)log∏x∈Dp(x;θ)=∑x∈Dlogp(x;θ)=∑x∈Dlog∑zp(x,z;θ), computing log⁡∑zp(x,z;θ)\log\sum_zp(x,z;\theta)log∑zp(x,z;θ) and ∇θ\nabla_\theta∇θ is expensive, need cheap approximations!

(1) first attempt: Naive Monte Carlo

pθ(x)=∑z∈Zpθ(x,z)=∣Z∣∑z∈Z1∣Z∣pθ(x,z)=∣Z∣Ez∼Uniform(Z) $pθ(x,z)$ p_\theta(x)=\sum_{z\in\mathcal{Z}}p_\theta(x,z)=|\mathcal{Z}|\sum_{z\in\mathcal{Z}}\frac{1}{|\mathcal{Z}|}p_\theta(x,z)=|\mathcal{Z}|\mathbb{E}_{z\sim \text{Uniform}(\mathcal{Z})} $p_\\theta(x,z)$ pθ(x)=∑z∈Zpθ(x,z)=∣Z∣∑z∈Z∣Z∣1pθ(x,z)=∣Z∣Ez∼Uniform(Z) $pθ(x,z)$

sample z(1),...,z(k)z^{(1)},\dots,z^{(k)}z(1),...,z(k) uniformly at random, approximate expected with sample average: ∑zpθ(x,z)≈∣Z∣1k∑j=1kpθ(x,z(j))\sum_zp_\theta(x,z)\approx|\mathcal{Z}|\frac{1}{k}\sum^k_{j=1}p_\theta(x,z^{(j)})∑zpθ(x,z)≈∣Z∣k1∑j=1kpθ(x,z(j))

Need a clever way to select z(j)z^{(j)}z(j) to reduce variance of the estimator

(2) second attempt: Importance Sampling

pθ(x)=∑z∈Zpθ(x,z)=∑z∈Zq(z)q(z)pθ(x,z)=Ez∼q(z) $pθ(x,z)q(z)$ p_\theta(x)=\sum_{z\in\mathcal{Z}}p_\theta(x,z)=\sum_{z\in\mathcal{Z}}\frac{q(z)}{q(z)}p_\theta(x,z)=\mathbb{E}_{z\sim q(z)} $\\frac{p_\\theta(x,z)}{q(z)}$ pθ(x)=∑z∈Zpθ(x,z)=∑z∈Zq(z)q(z)pθ(x,z)=Ez∼q(z) $q(z)pθ(x,z)$

sample z(1),...,z(k)z^{(1)},\dots,z^{(k)}z(1),...,z(k) from q(z)q(z)q(z), approximate expectation with sample average (still Monte Carlo, unbiased estimator): pθ(x)≈1k∑j=1kpθ(x,z(j))q(z(j))p_\theta(x)\approx\frac{1}{k}\sum^k_{j=1}\frac{p_\theta(x,z^{(j)})}{q(z^{(j)})}pθ(x)≈k1∑j=1kq(z(j))pθ(x,z(j))

What is a good choice for q(z)q(z)q(z)? 应该频繁采样给定xxx在pθ(x,z)p_\theta(x,z)pθ(x,z)下更有可能出现的zzz。

require log-likelihood, estimate it as log⁡pθ(x)≈log⁡(1k∑j=1kpθ(x,z(j))q(z(j)))\log p_\theta(x)\approx\log\left(\frac{1}{k}\sum^k_{j=1}\frac{p_\theta(x,z^{(j)})}{q(z^{(j)})}\right)logpθ(x)≈log(k1∑j=1kq(z(j))pθ(x,z(j)))

然log函数是concave function (凹函数), 根据Jensen's Inequality: 对任何非线性凹函数fff, 有E $f(Y)$ ≤f(E $Y$ )\mathbb{E} $f(Y)$ \le f(\mathbb{E} $Y$ )E $f(Y)$ ≤f(E $Y$ ), 因此Ez∼q(z) $log(pθ(x,z)q(z))$ ≤log⁡(Ez∼q(z) $pθ(x,z)q(z)$ )=log⁡pθ(x)\mathbb{E}{z\sim q(z)}\left $\\log\\left(\\frac{p_\\theta(x,z)}{q(z)}\\right)\\right$ \le\log\left(\mathbb{E}{z\sim q(z)}\left $\\frac{p_\\theta(x,z)}{q(z)}\\right$ \right)=\log p_\theta(x)Ez∼q(z) $log(q(z)pθ(x,z))$ ≤log(Ez∼q(z) $q(z)pθ(x,z)$ )=logpθ(x), the estimator is biased! (Evidence Lower Bound, ELBO)

1.4 Variational inference

Suppose q(z)q(z)q(z) is any probability distribution over the hidden variables, Evidence lower bound (ELBO) holds for any qqq: log⁡p(x;θ)≥∑zq(z)log⁡(pθ(x,z)q(z))=∑zq(z)log⁡pθ(x,z)−∑zq(z)log⁡q(z)⏟Entropy H(q) of q=∑zq(z)log⁡pθ(x,z)+H(q)\log p(x;\theta)\ge\sum_zq(z)\log\left(\frac{p_\theta(x,z)}{q(z)}\right)=\sum_zq(z)\log p_\theta(x,z)\underbrace{-\sum_zq(z)\log q(z)}{\text{Entropy }H(q)\text{ of }q}=\sum_zq(z)\log p\theta(x,z)+H(q)logp(x;θ)≥∑zq(z)log(q(z)pθ(x,z))=∑zq(z)logpθ(x,z)Entropy H(q) of q −z∑q(z)logq(z)=∑zq(z)logpθ(x,z)+H(q), equality holds if q(z)=p(z∣x;θ)q(z)=p(z|x;\theta)q(z)=p(z∣x;θ), 此后验分布不好求

lec 6, 2026/5/11-5/28, VAEs

1.1 Variational inference

DKL(q(z)∣∣p(z∣x;θ))=Ez∼q(z) $logq(z)-logp(z∣x;θ)$ =∑zq(z)log⁡q(z)−∑zq(z)log⁡p(z∣x;θ)D_{\text{KL}}(q(z)||p(z|x;\theta))=\mathbb{E}_{z\sim q(z)} $\\log q(z)-\\log p(z\|x;\\theta)$ =\sum_zq(z)\log q(z)-\sum_zq(z)\log p(z|x;\theta)DKL(q(z)∣∣p(z∣x;θ))=Ez∼q(z) $logq(z)-logp(z∣x;θ)$ =∑zq(z)logq(z)−∑zq(z)logp(z∣x;θ)

=−H(q)−∑zq(z)log⁡p(x,z;θ)p(x;θ)=−H(q)−∑zq(z)log⁡p(x,z;θ)+∑zq(z)log⁡p(x;θ)=−H(q)−∑zq(z)log⁡p(x,z;θ)+log⁡p(x;θ)=-H(q)-\sum_zq(z)\log\frac{p(x,z;\theta)}{p(x;\theta)}=-H(q)-\sum_zq(z)\log p(x,z;\theta)+\sum_zq(z)\log p(x;\theta)=-H(q)-\sum_zq(z)\log p(x,z;\theta)+\log p(x;\theta)=−H(q)−∑zq(z)logp(x;θ)p(x,z;θ)=−H(q)−∑zq(z)logp(x,z;θ)+∑zq(z)logp(x;θ)=−H(q)−∑zq(z)logp(x,z;θ)+logp(x;θ)

因为KL散度恒≥0\ge0≥0, 因此有log⁡p(x;θ)≥∑zq(z)log⁡p(x,z;θ)+H(q)\log p(x;\theta)\ge \sum_zq(z)\log p(x,z;\theta)+H(q)logp(x;θ)≥∑zq(z)logp(x,z;θ)+H(q), 从KL散度视角又推导出了ELBO

Equality holds if q(z)=p(z∣x;θ)q(z)=p(z|x;\theta)q(z)=p(z∣x;θ) because DKL(q(z)∣∣p(z∣x;θ))=0D_{\text{KL}}(q(z)||p(z|x;\theta))=0DKL(q(z)∣∣p(z∣x;θ))=0

1.2 Intractable Posteriors

如果后验概率p(z∣x;θ)p(z|x;\theta)p(z∣x;θ)不好算怎么办? In a VAE this corresponds to "inverting" the neural networks μθ,Σθ\mu_\theta,\Sigma_\thetaμθ,Σθ defining p(x∣z)=N(μθ(z),Σθ(z))p(x|z)=\mathcal{N}(\mu_\theta(z),\Sigma_\theta(z))p(x∣z)=N(μθ(z),Σθ(z))

Suppose q(z;ϕ)q(z;\phi)q(z;ϕ) is a tractable probability distribution over the hidden variables parameterized by ϕ\phiϕ (variational parameters), e.g., q(z;ϕ)=N(ϕ1,ϕ2)q(z;\phi)=\mathcal{N}(\phi_1,\phi_2)q(z;ϕ)=N(ϕ1,ϕ2)

pick ϕ\phiϕ so that q(z;ϕ)q(z;\phi)q(z;ϕ) is as close as possible to p(z∣x;θ)p(z|x;\theta)p(z∣x;θ), 真是原汤化原食!

log⁡p(x;θ)≥∑zq(z;ϕ)log⁡p(z,x;θ)+H(q(z;ϕ))=L(x;θ,ϕ)⏟ELBO\log p(x;\theta)\ge\sum_zq(z;\phi)\log p(z,x;\theta)+H(q(z;\phi))=\underbrace{\mathcal{L}(x;\theta,\phi)}_{\text{ELBO}}logp(x;θ)≥∑zq(z;ϕ)logp(z,x;θ)+H(q(z;ϕ))=ELBO L(x;θ,ϕ)

log⁡p(x;θ)=L(x;θ,ϕ)+DKL(q(z;ϕ)∣∣p(z∣x;θ))\log p(x;\theta)=\mathcal{L}(x;\theta,\phi)+D_{\text{KL}}(q(z;\phi)||p(z|x;\theta))logp(x;θ)=L(x;θ,ϕ)+DKL(q(z;ϕ)∣∣p(z∣x;θ))

The better q(z;ϕ)q(z;\phi)q(z;ϕ) can approximate the posterior p(z∣x;θ)p(z|x;\theta)p(z∣x;θ), the smaller DKL(q(z;ϕ)∣∣p(z∣x;θ))D_{\text{KL}}(q(z;\phi)||p(z|x;\theta))DKL(q(z;ϕ)∣∣p(z∣x;θ)) we can achieve, the closer ELBO will be to log⁡p(x;θ)\log p(x;\theta)logp(x;θ)

1.3 apply ELBO to the entire dataset

minimize DKL(Pdata∣∣Pθ)=maximize ∑xi∈Dlog⁡p(xi;θ)=l(θ;D)≥∑xi∈DL(xi;θ,ϕi)\text{minimize }D_{\text{KL}}(P_{\text{data}}||P_\theta)=\text{maximize }\sum_{x^i\in\mathcal{D}}\log p(x^i;\theta)=l(\theta;\mathcal{D})\ge\sum_{x^i\in\mathcal{D}}\mathcal{L}(x^i;\theta,\phi^i)minimize DKL(Pdata∣∣Pθ)=maximize ∑xi∈Dlogp(xi;θ)=l(θ;D)≥∑xi∈DL(xi;θ,ϕi)

therefore, max⁡θl(θ;D)≥max⁡θ,ϕ1,...,ϕM∑xi∈DL(xi;θ,ϕi)\max_\theta l(\theta;\mathcal{D})\ge\max_{\theta,\phi_1,\dots,\phi^M}\sum_{x^i\in\mathcal{D}}\mathcal{L}(x^i;\theta,\phi^i)maxθl(θ;D)≥maxθ,ϕ1,...,ϕM∑xi∈DL(xi;θ,ϕi)

Note that we use different variational parameters ϕi\phi^iϕi for every data point xix^ixi, because the true posterior p(z∣xi;θ)p(z|x^i;\theta)p(z∣xi;θ) is different across datapoints xix^ixi

1.4 Learning via stochastic variational inference (SVI)

optimize ∑xi∈DL(xi;θ,ϕi)\sum_{x^i\in\mathcal{D}}\mathcal{L}(x^i;\theta,\phi^i)∑xi∈DL(xi;θ,ϕi) as a function of θ,ϕ1,...,ϕM\theta,\phi^1,\dots,\phi^Mθ,ϕ1,...,ϕM using stochastic gradient descent: L(xi;θ,ϕi)=∑zq(z;ϕi)log⁡p(z,xi;θ)+H(q(z;ϕi))=Eq(z;ϕi) $logp(z,xi;θ)-logq(z;ϕi)$ \mathcal{L}(x^i;\theta,\phi^i)=\sum_zq(z;\phi^i)\log p(z,x^i;\theta)+H(q(z;\phi^i))=\mathbb{E}_{q(z;\phi^i)} $\\log p(z,x\^i;\\theta)-\\log q(z;\\phi\^i)$ L(xi;θ,ϕi)=∑zq(z;ϕi)logp(z,xi;θ)+H(q(z;ϕi))=Eq(z;ϕi) $logp(z,xi;θ)-logq(z;ϕi)$

step 1: Initialize θ,ϕ1,...,ϕM\theta,\phi^1,\dots,\phi^Mθ,ϕ1,...,ϕM

step 2: Randomly sample a data point xix^ixi from D\mathcal{D}D

step 3: repeat ϕi=ϕi+η∇ϕL(xi;θ,ϕi)\phi^i=\phi^i+\eta\nabla_\phi\mathcal{L}(x^i;\theta,\phi^i)ϕi=ϕi+η∇ϕL(xi;θ,ϕi) until convergence to ϕi,∗≈arg⁡max⁡ϕL(xi;θ,ϕ)\phi^{i,*}\approx\arg\max_\phi\mathcal{L}(x^i;\theta,\phi)ϕi,∗≈argmaxϕL(xi;θ,ϕ)

step 4: Compute ∇θL(xi;θ,ϕi,∗)\nabla_\theta\mathcal{L}(x^i;\theta,\phi^{i,*})∇θL(xi;θ,ϕi,∗)

step 5: Update θ\thetaθ in the gradient direction, goto step 2

经典Monte Carlo, sample z1,...,zkz^1,\dots,z^kz1,...,zk from q(z;ϕ)q(z;\phi)q(z;ϕ) and estimate Eq(z;ϕi) $logp(z,xi;θ)-logq(z;ϕi)$ ≈1K∑klog⁡p(zk,x;θ)−log⁡q(zk;ϕ)\mathbb{E}_{q(z;\phi^i)} $\\log p(z,x\^i;\\theta)-\\log q(z;\\phi\^i)$ \approx\frac{1}{K}\sum_k\log p(z^k,x;\theta)-\log q(z^k;\phi)Eq(z;ϕi) $logp(z,xi;θ)-logq(z;ϕi)$ ≈K1∑klogp(zk,x;θ)−logq(zk;ϕ)

key assumption: q(z;ϕ)q(z;\phi)q(z;ϕ) is tractable, i.e., easy to sample from and evaluate (e.g., flow model, 反例: generative model)

how to compute ∇θL(x;θ,ϕ)\nabla_\theta\mathcal{L}(x;\theta,\phi)∇θL(x;θ,ϕ) and ∇ϕL(x;θ,ϕ)\nabla_\phi\mathcal{L}(x;\theta,\phi)∇ϕL(x;θ,ϕ)?

the gradient with respect to θ\thetaθ is easy, estimate with a Monte Carlo average: ∇θEq(z;ϕ) $logp(z,x;θ)-logq(z;ϕ)$ =Eq(z;ϕ) $\nablaθlogp(z,x;θ)$ ≈1K∑k∇θlog⁡p(zk,x;θ)\nabla_\theta\mathbb{E}{q(z;\phi)} $\\log p(z,x;\\theta)-\\log q(z;\\phi)$ =\mathbb{E}{q(z;\phi)} $\\nabla_\\theta\\log p(z,x;\\theta)$ \approx\frac{1}{K}\sum_k\nabla_\theta\log p(z^k,x;\theta)∇θEq(z;ϕ) $logp(z,x;θ)-logq(z;ϕ)$ =Eq(z;ϕ) $\nablaθlogp(z,x;θ)$ ≈K1∑k∇θlogp(zk,x;θ)

the gradient with respect to ϕ\phiϕ is more complicated because the expectation depends on ϕ\phiϕ, 用reparameterization的trick: sample z∼q(z;ϕ)z\sim q(z;\phi)z∼q(z;ϕ), suppose q(z;ϕ)=N(μ,σ2I)q(z;\phi)=\mathcal{N}(\mu,\sigma^2I)q(z;ϕ)=N(μ,σ2I); sample ϵ∼N(0,I)\epsilon\sim\mathcal{N}(0,I)ϵ∼N(0,I), z=μ+σϵ=g(ϵ;ϕ)z=\mu+\sigma\epsilon=g(\epsilon;\phi)z=μ+σϵ=g(ϵ;ϕ), ggg is deterministic

如此以后, Ez∼q(z;ϕ) $r(z)$ =∫q(z;ϕ)r(z)dz=Eϵ∼N(0,I) $r(g(ϵ;ϕ))$ =∫N(ϵ)r(μ+σϵ)dϵ\mathbb{E}{z\sim q(z;\phi)} $r(z)$ =\int q(z;\phi)r(z)dz=\mathbb{E}{\epsilon\sim\mathcal{N}(0,I)} $r(g(\\epsilon;\\phi))$ =\int\mathcal{N}(\epsilon)r(\mu+\sigma\epsilon)d\epsilonEz∼q(z;ϕ) $r(z)$ =∫q(z;ϕ)r(z)dz=Eϵ∼N(0,I) $r(g(ϵ;ϕ))$ =∫N(ϵ)r(μ+σϵ)dϵ

∇ϕEq(z;ϕ) $r(z)$ =∇ϕEϵ $r(g(ϵ;ϕ))$ =Eϵ $\nablaϕr(g(ϵ;ϕ))$ \nabla_\phi\mathbb{E}{q(z;\phi)} $r(z)$ =\nabla\phi\mathbb{E}\epsilon $r(g(\\epsilon;\\phi))$ =\mathbb{E}\epsilon $\\nabla_\\phi r(g(\\epsilon;\\phi))$ ∇ϕEq(z;ϕ) $r(z)$ =∇ϕEϵ $r(g(ϵ;ϕ))$ =Eϵ $\nablaϕr(g(ϵ;ϕ))$ , easy to estimate via Monte Carlo if rrr and ggg are differentiable: Eϵ $\nablaϕr(g(ϵ;ϕ))$ ≈1K∑k∇ϕr(g(ϵk;ϕ))\mathbb{E}\epsilon $\\nabla_\\phi r(g(\\epsilon;\\phi))$ \approx\frac{1}{K}\sum_k\nabla\phi r(g(\epsilon^k;\phi))Eϵ $\nablaϕr(g(ϵ;ϕ))$ ≈K1∑k∇ϕr(g(ϵk;ϕ)) where ϵ1,...,ϵK∼N(0,I)\epsilon^1,\dots,\epsilon^K\sim\mathcal{N}(0,I)ϵ1,...,ϵK∼N(0,I)

1.5 Amortized Inference

max⁡θl(θ;D)≥max⁡θ,ϕ1,...,ϕM∑xi∈DL(xi;θ,ϕi)\max_\theta l(\theta;\mathcal{D})\ge\max_{\theta,\phi^1,\dots,\phi^M}\sum_{x^i\in\mathcal{D}}\mathcal{L}(x^i;\theta,\phi^i)maxθl(θ;D)≥maxθ,ϕ1,...,ϕM∑xi∈DL(xi;θ,ϕi), 先前每个训练样本xix^ixi维护/保存一组variational parameters ϕi\phi^iϕi, 这不scale。

Amortization: now we learn a single parametric function fλf_\lambdafλ that maps each xxx to a set of (good) variational parameters, like doing regression on xi→ϕi,∗x^i\to\phi^{i,*}xi→ϕi,∗

例: 真实posterior pθ(z∣x)p_\theta(z|x)pθ(z∣x), 用q(z∣xi)q(z|x^i)q(z∣xi)近似, 如果q(z∣xi)q(z|x^i)q(z∣xi)是Gaussians with different means μ1,...,μm\mu_1,\dots,\mu^mμ1,...,μm, learn a single neural network fλf_\lambdafλ mapping xix^ixi to μi\mu^iμi, then we approximate the posteriors q(z∣xi)q(z|x^i)q(z∣xi) using this distribution qλ(z∣x)q_\lambda(z|x)qλ(z∣x), 可以理解成qλ(z∣xi)≈qϕi,∗(z∣xi)≈pθ(z∣xi)q_\lambda(z|x^i)\approx q_{\phi^{i,*}}(z|x^i)\approx p_\theta(z|x^i)qλ(z∣xi)≈qϕi,∗(z∣xi)≈pθ(z∣xi)

1.6 A variational approximation to the posterior

(1) Assume p(z,xi;θ)p(z,x^i;\theta)p(z,xi;θ) is close to pdata(z,xi)p_{\text{data}}(z,x^i)pdata(z,xi), suppose zzz captures information such as the digit identity (label), style, etc.

(2) Suppose q(z;ϕi)q(z;\phi^i)q(z;ϕi) is a tractable probability distribution over the hidden variables zzz parameterized by ϕi\phi^iϕi. For each xix^ixi, need to find a good ϕi,∗\phi^{i,*}ϕi,∗ (via optimization, expensive).

(3) (Amortized inference) Learn how to map xix^ixi to a good set of parameters ϕi\phi^iϕi via q(z;fλ(xi))q(z;f_\lambda(x^i))q(z;fλ(xi)). fλf_\lambdafλ learns how to solve the optimization problem.

In the literature, q(z;fλ(xi))q(z;f_\lambda(x^i))q(z;fλ(xi)) often denoted qϕ(z∣x)q_\phi(z|x)qϕ(z∣x).

1.7 Autoencoder perspective

log⁡p(x;θ)≥L(x;θ,ϕ)=Eqϕ(z∣x) $logp(z,x;θ)-logqϕ(z∣x)$ =Eqϕ(z∣x) $logp(z,x;θ)-logp(z)+logp(z)-logqϕ(z∣x)$ \log p(x;\theta)\ge\mathcal{L}(x;\theta,\phi)=\mathbb{E}{q\phi(z|x)} $\\log p(z,x;\\theta)-\\log q_\\phi(z\|x)$ =\mathbb{E}{q\phi(z|x)} $\\log p(z,x;\\theta)-\\log p(z)+\\log p(z)-\\log q_\\phi(z\|x)$ logp(x;θ)≥L(x;θ,ϕ)=Eqϕ(z∣x) $logp(z,x;θ)-logqϕ(z∣x)$ =Eqϕ(z∣x) $logp(z,x;θ)-logp(z)+logp(z)-logqϕ(z∣x)$

=Eqϕ(z∣x) $logp(x∣z;θ)$ −DKL(qϕ(z∣x)∣∣p(z))=\mathbb{E}{q\phi(z|x)} $\\log p(x\|z;\\theta)$ -D_{\text{KL}}(q_\phi(z|x)||p(z))=Eqϕ(z∣x) $logp(x∣z;θ)$ −DKL(qϕ(z∣x)∣∣p(z))

step 1: Take a data point xix^ixi, map it to z^\hat{z}z^ by sampling from qϕ(z∣xi)q_\phi(z|x^i)qϕ(z∣xi) (encoder). Sample from a Gaussian with parameters (μ,σ)=encoderϕ(xi)(\mu,\sigma)=\text{encoder}_\phi(x^i)(μ,σ)=encoderϕ(xi)

step 2: Reconstruct x^\hat{x}x^ by sampling from p(x∣z^;θ)p(x|\hat{z};\theta)p(x∣z^;θ) (decoder). Sample from a Gaussian with parameters decoderθ(z^)\text{decoder}_\theta(\hat{z})decoderθ(z^)

What does the training objective L(x;θ,ϕ)\mathcal{L}(x;\theta,\phi)L(x;θ,ϕ) do?

first term encourages x^≈xi\hat{x}\approx x^ix^≈xi, 因为log⁡pθ(xi∣z)=−12σ2∥xi−μθ(z)∥2+C\log p_\theta(x^i|z)= -\frac{1}{2\sigma^2}\|x^i-\mu_\theta(z)\|^2 + Clogpθ(xi∣z)=−2σ21∥xi−μθ(z)∥2+C, 最大化log⁡pθ(xi∣z)\log p_\theta(x^i|z)logpθ(xi∣z)等价于最小化∥xi−μθ(z)∥2\|x^i-\mu_\theta(z)\|^2∥xi−μθ(z)∥2

second term encourages z^\hat{z}z^ to have a distribution similar to the prior p(z)p(z)p(z), kl散度越像越小

最大化log⁡p(x;θ)\log p(x;\theta)logp(x;θ)等价于最大化似然估计max⁡θEx∼pdata $logpθ(x)$ \max_\theta \mathbb E_{x\sim p_{\text{data}}} $\\log p_\\theta(x)$ maxθEx∼pdata $logpθ(x)$ , 等价于最小化Ex∼pdata $-logpθ(x)$ =H(pdata,pθ)=H(pdata)+DKL(pdata∥pθ)\mathbb E_{x\sim p_{\text{data}}} $-\\log p_\\theta(x)$ =H(p_{\text{data}},p_\theta)= H(p_{\text{data}}) + D_{KL}(p_{\text{data}}\|p_\theta)Ex∼pdata $-logpθ(x)$ =H(pdata,pθ)=H(pdata)+DKL(pdata∥pθ), 等价于最小化DKL(pdata∥pθ)D_{KL}(p_{\text{data}}\|p_\theta)DKL(pdata∥pθ), 这样可以让模型分布pθ(x)p_\theta(x)pθ(x)尽量接近真实数据分布pdata(x)p_{\text{data}}(x)pdata(x)。

lec 6居然拖了这么久才学完，虽可以立即编出八百个理由为自己开脱，惭愧惭愧。2026/5/28