Python 全栈体系【四阶】(五十)

第五章 深度学习

十一、扩散模型

4. 附录:Diffusion的数学推导过程

4.2 Diffusion正向扩散过程推导

设初始数据 x 0 x_0 x0符合分布 q ( x 0 ) q(x_0) q(x0),即训练集分布,然后不断向其中添加高斯噪声,高斯噪声本身是不可训练参数,或者说均值和方差是固定的,通过方差系数 β 1 , ⋯   , β n \beta_1, \cdots, \beta_n β1,⋯,βn来控制添加噪声的强度,它们是0~1之间的小数,一般会越来越大. 另外,这个过程被固定为马尔科夫链,每步的条件转移分布为 q ( x t ∣ x t − 1 ) q(x_t|x_{t-1}) q(xt∣xt−1),整体后验分布表示为 q ( x 1 : T ∣ x 0 ) q(x_{1:T} | x_0) q(x1:T∣x0),也就是连乘的形式.

q ( x t ∣ x t − 1 ) = N ( x t ; 1 − β t x t − 1 , β t I ) q ( x 1 : T ∣ x 0 ) = ∏ t = 1 T q ( x t ∣ x t − 1 ) q(x_t|x_{t-1}) = N(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I) \\ q(x_{1:T}|x_0) = \prod_{t=1} ^ T q(x_t|x_{t-1}) q(xt∣xt−1)=N(xt;1−βt xt−1,βtI)q(x1:T∣x0)=t=1∏Tq(xt∣xt−1)

正向过程的特点在于,可以根据系数 β \beta β和 x 0 x_0 x0直接求出任意时刻的转移分布 q ( x t ∣ x 0 ) q(x_t|x_0) q(xt∣x0),如下所示:

q ( x t ∣ x 0 ) = N ( x t ; α ˉ t x 0 , ( 1 − α ˉ t ) I ) q(x_t|x_0) = N(x_t; \sqrt{\bar \alpha_t} x_0, (1-\bar \alpha_t)I) q(xt∣x0)=N(xt;αˉt x0,(1−αˉt)I)

其中, α t = 1 − β t \alpha_t=1-\beta_t αt=1−βt, α ˉ t = ∏ i = 1 t α i \bar \alpha_t = \prod_{i=1} ^t \alpha _i αˉt=∏i=1tαi.

4.2.1 正向扩散过程的推导

利用重参数化技巧,可将 x t x_t xt表示为 x t − 1 x_{t-1} xt−1加上一个噪声值

x t = α t x t − 1 + 1 − α t ϵ t − 1 ; w h e r e ϵ t − 1 , ϵ t − 2 , ⋯ ∼ N ( 0 , 1 ) x_t = \sqrt{\alpha_t} x_{t-1} + \sqrt{1-\alpha_t} \epsilon_{t-1}; \ \ \ where\ \ \ \epsilon_{t-1}, \epsilon_{t-2},\cdots \sim N(0,1) \\ xt=αt xt−1+1−αt ϵt−1; where ϵt−1,ϵt−2,⋯∼N(0,1)

t x − 1 t_{x-1} tx−1又可以表示为 x t − 1 x_{t-1} xt−1加上一个噪声值:

x t = α t ( t − 1 x t − 2 + 1 − α t − 1 ϵ t − 2 ) + 1 − α t ϵ t − 1 = α t α t − 1 x t − 2 + α t ( 1 − α t − 1 ) ϵ t − 2 + 1 − α t ϵ t − 1 \begin{align} x_t &= \sqrt{\alpha_t} (\sqrt{t-1} x_{t-2} + \sqrt{1 - \alpha_{t-1}} \epsilon_{t-2}) + \sqrt{1-\alpha_t} \epsilon_{t-1} \\ &= \sqrt{\alpha_t \alpha_{t-1}} x_{t-2} +\sqrt{\alpha_t (1-\alpha_{t-1})} \epsilon_{t-2} + \sqrt{1-\alpha_t} \epsilon_{t-1} \end{align} xt=αt (t−1 xt−2+1−αt−1 ϵt−2)+1−αt ϵt−1=αtαt−1 xt−2+αt(1−αt−1) ϵt−2+1−αt ϵt−1

上式第二项、第三项为两个高斯噪声相加. 两个均值为-的高斯值相加,均值仍为0;方程相加 σ 1 2 + σ 2 2 = α t ( 1 − α t − 1 ) + 1 − α t = 1 − α t α t − 1 \sigma_1 ^2 + \sigma_2 ^2 = \alpha_t(1-\alpha_{t-1}) + 1 - \alpha_t = 1 - \alpha_t \alpha_{t-1} σ12+σ22=αt(1−αt−1)+1−αt=1−αtαt−1

所以,第二项、第三项合并后可表示为:

x t = α t α t − 1 x t − 2 + 1 − α t α t − 1 ϵ ˉ t − 2 ; w h e r e ϵ ˉ t − 2 merges two Gaussians(*) = . . . = α ˉ t x 0 + 1 − α ˉ ϵ \begin{align} x_t &= \sqrt{\alpha_t \alpha_{t-1}} x_{t-2} + \sqrt{1- \alpha_t \alpha_{t-1}} \bar \epsilon_{t-2}; \ where \ \bar \epsilon_{t-2} \ \text{merges two Gaussians(*)} \\ &= ... \\ &= \sqrt{\bar \alpha_t} x_0 + \sqrt{1-\bar \alpha} \epsilon \end{align} xt=αtαt−1 xt−2+1−αtαt−1 ϵˉt−2; where ϵˉt−2 merges two Gaussians(*)=...=αˉt x0+1−αˉ ϵ

这样,就能根据初始分布和时间t直接求出t时刻的分布.

4.2.2 逆向去噪过程推导

如果把正向扩散过程比作墨水在水中扩散的过程,逆向过程就相当于从水中提取出墨水的过程. 为了简化分析,也把它假定为马尔科夫链,转移分布也是高斯的,这样就变成了一个参数估计问题,用神经网络来学习转移分布.

p θ ( x 0 : T ) = p ( x T ) ∏ t = 1 T p θ ( x t − 1 ∣ x t ) = N ( x t − 1 ; μ θ ( x t , t ) , Σ θ ( x t , t ) ) p_\theta(x_{0:T}) = p(x_T) \prod_{t=1} ^T p_\theta(x_{t-1}|x_t) = N(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t)) pθ(x0:T)=p(xT)t=1∏Tpθ(xt−1∣xt)=N(xt−1;μθ(xt,t),Σθ(xt,t))

其中,网络的输入是 x t x_t xt和 t t t,转移分布为 μ \mu μ,协方差为 Σ \Sigma Σ, θ \theta θ为模型参数(要求的目标),转移概率 p θ p_\theta pθ为未知的. 逆向过程比正向过程要难(这就好比把墨水融到水中容易,把墨水从水中提取出来更困难),Diffusion模型的做法是,通过公式推导,把逆向过程的转移分布 p θ ( x t − 1 ∣ x t ) p_\theta(x_{t-1}|x_t) pθ(xt−1∣xt)用正向扩散过程的后验分布 q ( x t − 1 ∣ x t , x 0 ) q(x_{t-1}|x_t, x_0) q(xt−1∣xt,x0)来逼近:

q ( x t − 1 ∣ x t , x 0 ) = N ( x t − 1 ; μ ~ ( x t , x 0 ) , β t ~ I ) q(x_{t-1}|x_t, x_0) = N(x_{t-1}; \tilde{\mu}(x_t, x_0), \tilde{\beta_t}I) q(xt−1∣xt,x0)=N(xt−1;μ~(xt,x0),βt~I)

根据贝叶斯定理:

q ( x t − 1 ∣ x t , x 0 ) = q ( x t − 1 ) q ( x t , x 0 ∣ x t − 1 ) q ( x t , x 0 ) = q ( x t − 1 ) q ( x t ∣ x t − 1 ) q ( x 0 ∣ x t − 1 ) q ( x 0 ) q ( x t ∣ x 0 ) = q ( x t − 1 ) q ( x t ∣ x t − 1 ) q ( x 0 ) q ( x t ∣ x 0 ) × q ( x 0 ) q ( x t − 1 ∣ x 0 ) q ( x t − 1 ) = q ( x t − 1 ) q ( x t ∣ x t − 1 ) q ( x t − 1 ∣ x 0 ) q ( x t ∣ x 0 ) = q ( x t ∣ x t − 1 , x 0 ) q ( x t − 1 ∣ x 0 ) q ( x t ∣ x 0 ) \begin{align} q(x_{t-1}|x_t, x_0) &= \frac{q(x_{t-1})q(x_t, x_0|x_{t-1})}{q(x_t, x_0)} \\ &= \frac{q(x_{t-1})q(x_t|x_{t-1})q(x_0|x_{t-1})}{q(x_0)q(x_t|x_0)} \\ &=\frac{q(x_{t-1})q(x_t|x_{t-1})}{q(x_0)q(x_t|x_0)} \times \frac{q(x_0)q(x_{t-1}|x_0)}{q(x_{t-1})}\\ &=\frac{q(x_{t-1})q(x_t|x_{t-1})q(x_{t-1}|x_0)}{q(x_t|x_0)} \\ &=q(x_t|x_{t-1}, x_0) \frac{q(x_{t-1} | x_0)}{q(x_t|x_0)} \end{align} q(xt−1∣xt,x0)=q(xt,x0)q(xt−1)q(xt,x0∣xt−1)=q(x0)q(xt∣x0)q(xt−1)q(xt∣xt−1)q(x0∣xt−1)=q(x0)q(xt∣x0)q(xt−1)q(xt∣xt−1)×q(xt−1)q(x0)q(xt−1∣x0)=q(xt∣x0)q(xt−1)q(xt∣xt−1)q(xt−1∣x0)=q(xt∣xt−1,x0)q(xt∣x0)q(xt−1∣x0)

将上面的公式写成指数形式,并省略掉前面的系数,展开,凑成一元二次方程标准形式:

q ( x t ∣ x t − 1 , x 0 ) q ( x t − 1 ∣ x 0 ) q ( x t ∣ x 0 ) ∝ e x p ( − 1 2 ( ( x t − α x t − 1 ) 2 β t + ( x t − 1 − α ˉ t − 1 x 0 ) 2 1 − α ˉ t − 1 − ( x t − α ˉ t x 0 ) 2 1 − α ˉ t ) ) = e x p ( − 1 2 ( ( x t 2 − 2 α t x t x t − 1 + α t x t − 1 2 β t + ( x t − 1 2 − 2 α ˉ t − 1 x 0 x t − 1 + α ˉ t − 1 x 0 2 ) 1 − α ˉ t − 1 − ( x t − α ˉ t x 0 ) 2 1 − α ˉ t ) ) = e x p ( − 1 2 ( ( α t β t + 1 1 − α ˉ t − 1 ) x t − 1 2 − ( 2 α t β t x t + 2 α ˉ t − 1 1 − α ˉ t − 1 x 0 ) x t − 1 + C ( x t , x 0 ) ) ) \begin{align} & q(x_t|x_{t-1}, x_0) \frac{q(x_{t-1} | x_0)}{q(x_t|x_0)} \\ &\propto exp \bigg( - \frac{1}{2} \big( \frac{(x_t - \sqrt \alpha x_{t-1})^2}{\beta_t} + \frac{(x_{t-1} - \sqrt{\bar \alpha_{t-1}} x_0)^2}{1-\bar \alpha_{t-1}} - \frac{(x_t - \sqrt{\bar \alpha_t} x_0)^2}{1-\bar \alpha_t} \big)\bigg) \\ &= exp \bigg( - \frac{1}{2} \big( \frac{(x_t ^2 - 2\sqrt \alpha_t x_t x_{t-1} + \alpha_t x_{t-1} ^2}{\beta_t} + \frac{(x_{t-1} ^2 - 2 \sqrt{\bar \alpha_{t-1}} x_0 x_{t-1} + \bar \alpha_{t-1} x_0 ^2)}{1-\bar \alpha_{t-1}} - \frac{(x_t - \sqrt{\bar \alpha_t} x_0)^2}{1-\bar \alpha_t} \big)\bigg) \\ &= exp\bigg( - \frac{1}{2} \big((\frac{\alpha_t}{\beta_t} + \frac{1}{1-\bar \alpha_{t-1}})x_{t-1} ^2 - (\frac{2 \sqrt \alpha_t}{\beta_t} x_t + \frac{2 \sqrt {\bar \alpha_{t-1}}}{1- \bar \alpha_{t-1}} x_0) x_{t-1} + C(x_t, x_0) \big)\bigg) \end{align} q(xt∣xt−1,x0)q(xt∣x0)q(xt−1∣x0)∝exp(−21(βt(xt−α xt−1)2+1−αˉt−1(xt−1−αˉt−1 x0)2−1−αˉt(xt−αˉt x0)2))=exp(−21(βt(xt2−2α txtxt−1+αtxt−12+1−αˉt−1(xt−12−2αˉt−1 x0xt−1+αˉt−1x02)−1−αˉt(xt−αˉt x0)2))=exp(−21((βtαt+1−αˉt−11)xt−12−(βt2α txt+1−αˉt−12αˉt−1 x0)xt−1+C(xt,x0)))

根据一元二次方程方程公式 a ( x − b 2 a ) 2 + ( 4 a b − b 2 4 a ) = 0 a(x - \frac{b}{2a})^2 + (\frac{4ab - b^2}{4a}) = 0 a(x−2ab)2+(4a4ab−b2)=0和高斯概率密度函数 f ( x ) = 1 2 π σ e x p ( − ( x − μ ) 2 2 σ 2 ) f(x)=\frac{1}{\sqrt{2 \pi \sigma} } exp(- \frac{(x-\mu)^2}{2 \sigma ^2}) f(x)=2πσ 1exp(−2σ2(x−μ)2)可知,均值为 − b 2 a - \frac{b}{2a} −2ab,方差为 1 a \frac{1}{a} a1,带入公式计算出均值和方差(以下公式省略掉了常数):

μ ~ ( x t , x 0 ) = ( α t β t x t + α ˉ t − 1 1 − α ˉ t − 1 x 0 ) / ( α t β t + 1 1 − α ˉ t − 1 ) = ( α t β t x t + α ˉ t − 1 1 − α ˉ t − 1 x 0 ) 1 − α ˉ t − 1 1 − α ˉ t . β t = α t ( 1 − α ˉ t − 1 ) 1 − α ˉ t x t + α ˉ t − 1 β t 1 − α ˉ t x 0 β t ~ = 1 / ( α t β t + 1 1 − α ˉ t − 1 ) = 1 / ( α t − α t ˉ + β t β t ( 1 − α ˉ t − 1 ) ) = 1 − α ˉ t − 1 1 − α ˉ t . β t \begin{align}\tilde \mu(x_t, x_0) &= (\frac{\sqrt{\alpha_t}}{\beta_t} x_t + \frac{\sqrt{\bar \alpha_{t-1}}}{1-\bar \alpha_{t-1}}x_0) / (\frac{\alpha_t}{\beta_t} + \frac{1}{1 - \bar \alpha_{t-1}}) \\ &= (\frac{\sqrt{\alpha_t}}{\beta_t} x_t + \frac{\sqrt{\bar \alpha_{t-1}}}{1-\bar \alpha_{t-1}}x_0) \frac{1 - \bar \alpha_{t-1}}{1 - \bar \alpha_t} . \beta_t \\&= \frac{\sqrt {\alpha_t}(1- \bar \alpha_{t-1})}{1- \bar \alpha_t} x_t + \frac{\sqrt {\bar \alpha_{t-1}} \beta_t}{1 - \bar \alpha_t} x_0 \\ \tilde{\beta_t} &= 1 / (\frac{\alpha_t}{\beta_t} + \frac{1}{1 - \bar \alpha_{t-1}}) = 1 / (\frac{\alpha_t - \bar{\alpha_t} + \beta_t}{\beta_t(1 - \bar{\alpha}{t-1})}) = \frac{1 - \bar \alpha{t-1} }{1- \bar{\alpha}_t} .\beta_t \\ \end{align} μ~(xt,x0)βt~=(βtαt xt+1−αˉt−1αˉt−1 x0)/(βtαt+1−αˉt−11)=(βtαt xt+1−αˉt−1αˉt−1 x0)1−αˉt1−αˉt−1.βt=1−αˉtαt (1−αˉt−1)xt+1−αˉtαˉt−1 βtx0=1/(βtαt+1−αˉt−11)=1/(βt(1−αˉt−1)αt−αtˉ+βt)=1−αˉt1−αˉt−1.βt

这里使用了之前的定义:

α t = 1 − β t α ˉ t = ∏ i = 1 T α t \alpha_t = 1 - \beta_t \\ \bar \alpha_t = \prod_{i=1} ^ T \alpha_t αt=1−βtαˉt=i=1∏Tαt

这样,就得到扩散过程后验分布 q ( x t − 1 ∣ x t , x 0 ) q(x_{t-1}|x_t, x_0) q(xt−1∣xt,x0)的解析式,它是一个高斯分布. 其中,均值是关于 α , α ˉ , β t \alpha, \bar \alpha, \beta_t α,αˉ,βt以及 x 0 , x t x_0,x_t x0,xt的表达式, 而方差完全是个常数,跟x没有关系. 进一步根据前面正向扩散过程中,应用重参数技巧推导的 x 0 x_0 x0与 x t x_t xt之间的关系,可以得到:

x 0 = 1 α ˉ t ( x t − 1 − α t ˉ ϵ t ) x_0 = \frac{1}{\sqrt {\bar \alpha_t}} (x_t - \sqrt {1 - \bar {\alpha_t}} \epsilon_t) x0=αˉt 1(xt−1−αtˉ ϵt)

带入上面均值表达式,替换掉 x 0 x_0 x0,于是 μ ~ t \tilde \mu_t μ~t就等于:

μ ~ t = α t ( 1 − α ˉ t − 1 ) 1 − α ˉ t x t + α ˉ t − 1 β t 1 − α ˉ t 1 α ˉ t ( x t − 1 − α ˉ t ϵ t ) = 1 α t ( x t − 1 − α t 1 − α ˉ t ϵ t ) \begin{align} \tilde \mu_t &= \frac{\sqrt{\alpha_t}(1 - \bar \alpha_{t-1})}{1 - \bar \alpha_t} x_t + \frac{\sqrt{\bar \alpha_{t-1}} \beta_t}{1 - \bar {\alpha}_t} \frac{1}{\sqrt{\bar \alpha_t}} (x_t - \sqrt{1 - \bar \alpha_t} \epsilon_t) \\ &=\frac{1}{\sqrt \alpha_t}(x_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar \alpha_t}} \epsilon_t) \end{align} μ~t=1−αˉtαt (1−αˉt−1)xt+1−αˉtαˉt−1 βtαˉt 1(xt−1−αˉt ϵt)=α t1(xt−1−αˉt 1−αtϵt)

其中, ϵ \epsilon ϵ是t时刻从标准正态分布中采样得到的随机值. 到此为止,就完成扩散过程的后验概率分布 q ( x t − 1 ∣ x t , x 0 ) q(x_{t-1} | x_t, x_0) q(xt−1∣xt,x0), 它依旧是一个高斯分布,均值只和 x t x_t xt及标准正态分布的噪声相关,方差只跟常数 α \alpha α和 β \beta β相关. 扩散模型的重要意义在于提供了一种全新的生成模型范式,可以更好地描述数据的演化过程. 总体上来说,正向扩散和逆向扩散都是马尔科夫链,其中正向过程是确定性的、可控的,通过不断调整系数 β t \beta_t βt逐步添加噪声;转移分布是高斯的;逆向过程虽然复杂,但是转移分布 p θ ( x t − 1 ∣ x t ) p_\theta(x_{t-1} | x_t) pθ(xt−1∣xt)也可以假设为高斯,用神经网络来逼近求解. 由于直接求解缺少有效数据,因此先推导了更容易求得,有解析式的正向过程后验条件分布 q ( x t − 1 ∣ x t , x 0 ) q(x_{t-1} | x_t, x_0) q(xt−1∣xt,x0),用来逼近 p θ ( x t − 1 ∣ x t ) p_\theta(x_{t-1} | x_t) pθ(xt−1∣xt). 这三个分布某种意义上刻画了Diffusion模型的全部演化过程,在后面求损失函数过程中用到.

4.2.3 损失函数变分推导

先求数据的负对数似然函数,直接不好求解,从而变通一下,寻找它的上界,其方法就是加上一个KL散度,因为KL散度是非负数.

− l o g p θ ( x 0 ) ≤ − l o g p θ ( x 0 ) + D K L ( q ( x 1 : T ∣ x 0 ) ∣ ∣ p θ ( x 1 : T ∣ x 0 ) ) -logp_\theta(x_0) \leq -logp_\theta(x_0) + D_{KL}(q(x_{1:T} | x_0) || p_\theta(x_{1:T}|x_0)) −logpθ(x0)≤−logpθ(x0)+DKL(q(x1:T∣x0)∣∣pθ(x1:T∣x0))

最小化化负对数似然,等价于最小化他的上界(上式中右边的部分). 将右边部分进行变形,先利用贝叶斯定理展开,然后消掉最后一个无关项:

− l o g p θ ( x 0 ) ≤ − l o g p θ ( x 0 ) + D K L ( q ( x 1 : T ∣ x 0 ) ∣ ∣ p θ ( x 1 : T ∣ x 0 ) ) = − l o g p θ ( x 0 ) + E x 1 : T ∼ q ( x 1 : T ∣ x 0 ) [ l o g q ( x 1 : T ∣ x 0 ) p θ ( x 0 : T ) / p θ ( x 0 ) ] = − l o g p θ ( x 0 ) + E q [ l o g q ( x 1 : T ∣ x 0 ) p θ ( x 0 : T ) + l o g p θ ( x 0 ) ] = E q [ l o g q ( x 1 : T ∣ x 0 ) p θ ( x 0 : T ) ] \begin{align} -logp_\theta(x_0) & \leq -logp_\theta(x_0) + D_{KL}(q(x_{1:T} | x_0) || p_\theta(x_{1:T}|x_0)) \\ &= -log p_\theta(x_0) + \mathbb E_{x_{1:T} \sim q(x_{1:T}|x_0)} \bigg[ log \frac{q(x_{1:T} | x_0)}{p_\theta(x_{0:T}) / p_\theta(x_0)} \bigg] \\ &= -log p_\theta(x_0) + \mathbb E_q \bigg[ log \frac{q(x_{1:T}|x_0)}{p_\theta(x_{0:T})} + log p_\theta(x_0) \bigg] \\ &= \mathbb E_q \bigg[ log \frac{q(x_{1:T} | x_0)}{p_\theta(x_{0:T})} \bigg] \end{align} −logpθ(x0)≤−logpθ(x0)+DKL(q(x1:T∣x0)∣∣pθ(x1:T∣x0))=−logpθ(x0)+Ex1:T∼q(x1:T∣x0)[logpθ(x0:T)/pθ(x0)q(x1:T∣x0)]=−logpθ(x0)+Eq[logpθ(x0:T)q(x1:T∣x0)+logpθ(x0)]=Eq[logpθ(x0:T)q(x1:T∣x0)]

不等式两边都加上一个对 q ( x 0 ) q(x_0) q(x0)的期望,左侧就是交叉熵,右边就变成对 q ( x 0 : T ) q(x_{0:T}) q(x0:T)求期望:

E q ( x 0 ) l o g p θ ( x 0 ) ≤ E q ( x 0 : T ) [ l o g q ( x 1 : T ∣ x 0 ) p θ ( x 0 : T ) ] \mathbb E_q(x_0) log p_\theta(x_0) \leq \mathbb E_{q(x_{0:T})} \bigg[ log \frac{q(x_{1:T} | x_0)}{p_\theta(x_{0:T})} \bigg] Eq(x0)logpθ(x0)≤Eq(x0:T)[logpθ(x0:T)q(x1:T∣x0)]

最小化交叉熵就等价于最小化它的上界,右侧部分称为证据下界(Evidence Lower Bound),也就是变分推断中的ELBO,只不过前面加了符号,因此最大化对数似然,变成了最小化负对数似然,右侧的下界也变成了上界. Diffusion模型选取的损失函数就是目标数据的交叉熵,然后通过变分先找到上界,然后持续化简上界表达式,因为分子部分就是正向过程的条件概率分布,分布部分是逆向过程的联合分布. 接下来,就是对右侧一顿暴推,推导出迭代形式:

L V L B = E q ( x 0 : T ) [ l o g q ( x 1 : T ∣ x 0 ) p θ ( x 0 : T ) ] = E q [ l o g ∏ t = 1 T q ( x t ∣ x t − 1 ) p θ ( x T ) ∏ t = 1 T p θ ( x t − 1 ∣ x t ) ] = E q [ − l o g p θ ( x T ) + ∑ t = 1 T l o g q ( x t ∣ x t − 1 ) p θ ( x t − 1 ∣ x t ) ] = E q [ − l o g p θ ( x T ) + ∑ t = 2 T l o g q ( x t ∣ x t − 1 ) p θ ( x t − 1 ∣ x t ) + l o g q ( x 1 ∣ x 0 ) p θ ( x 0 ∣ x 1 ) ] = E q [ − l o g p θ ( x T ) + ∑ t = 2 T l o g ( q ( x t − 1 ∣ x t , x 0 ) p θ ( x t − 1 ∣ x t ) . q ( x t ∣ x 0 ) q ( x t − 1 ∣ x 0 ) ) + l o g q ( x 1 ∣ x 0 ) p θ ( x 0 ∣ x 1 ) ] = E q [ − l o g p θ ( x T ) + ∑ t = 2 T l o g q ( x t − 1 ∣ x t , x 0 ) p θ ( x t − 1 ∣ x t ) + ∑ t = 2 T q ( x t ∣ x 0 ) q ( x t − 1 ∣ x 0 ) + l o g q ( x 1 ∣ x 0 ) p θ ( x 0 ∣ x 1 ) ] = E q [ − l o g p θ ( x T ) + ∑ t = 2 T l o g q ( x t − 1 ∣ x t , x 0 ) p θ ( x t − 1 ∣ x t ) + l o g q ( x T ∣ x 0 ) q ( x 1 ∣ x 0 ) + l o g q ( x 1 ∣ x 0 ) p θ ( x 0 ∣ x 1 ) ] = E q [ l o g q ( x T ∣ x 0 ) p θ ( x T ) + ∑ t = 2 T l o g q ( x t − 1 ∣ x t , x 0 ) p θ ( x t − 1 ∣ x t ) − l o g p θ ( x 0 ∣ x 1 ) ] = E q [ D K L ( q ( x T ∣ x 0 ) ∣ ∣ p θ ( x T ) ) ⏟ L T + ∑ t = 2 T D K L ( q ( x t − 1 ∣ x t , x 0 ) ∣ ∣ p θ ( x t − 1 ∣ x t ) ) ⏟ L T − 1 − l o g p θ ( x 0 ∣ x 1 ) ⏟ L 0 ] \begin{align} L_{VLB} &= \mathbb E_{q(x_{0:T})} \bigg[ log \frac{q(x_{1:T} | x_0)}{p_\theta(x_{0:T})} \bigg] \\ &= \mathbb E_q \bigg[ log \frac{\prod_{t=1} ^ T q(x_t | x_{t-1})}{p_\theta(x_T) \prod_{t=1} ^ T p_\theta(x_{t-1} | x_t)} \bigg] \\ &= \mathbb E_q \bigg[ -log p_\theta (x_T) + \sum_{t=1} ^ T log \frac{q(x_t|x_{t-1})}{p_\theta(x_{t-1}|x_t)} \bigg] \\ &= \mathbb E_q \bigg[ -log p_\theta (x_T) + \sum_{t=2} ^ T log \frac{q(x_t|x_{t-1})}{p_\theta(x_{t-1}|x_t)} + log \frac{q(x_1|x_0)}{p_\theta(x_0|x_1)} \bigg] \\ &= \mathbb E_q \bigg[ -log p_\theta (x_T) + \sum_{t=2} ^ T log \big(\frac{q(x_{t-1} | x_t, x_0)}{p_\theta(x_{t-1}|x_t)} . \frac{q(x_t|x_0)}{q(x_{t-1}|x_0)} \big) + log \frac{q(x_1|x_0)}{p_\theta(x_0|x_1)} \bigg] \\ &= \mathbb E_q \bigg[ -log p_\theta (x_T) + \sum_{t=2} ^ T log \frac{q(x_{t-1}|x_t,x_0)}{p_\theta(x_{t-1}|x_t)} + \sum_{t=2} ^ T \frac{q(x_t|x_0)}{q(x_{t-1}|x_0)} + log \frac{q(x_1|x_0)}{p_\theta(x_0|x_1)} \bigg] \\ &= \mathbb E_q \bigg[ -log p_\theta (x_T) + \sum_{t=2} ^ T log \frac{q(x_{t-1}|x_t,x_0)}{p_\theta(x_{t-1}|x_t)} + log \frac{q(x_T|x_0)}{q(x_1|x_0)} + log \frac{q(x_1|x_0)}{p_\theta(x_0|x_1)} \bigg] \\ &= \mathbb E_q \bigg[ log \frac{q(x_T|x_0)}{p_\theta(x_T)} + \sum_{t=2} ^ T log \frac{q(x_{t-1}|x_t, x_0)}{p_\theta(x_{t-1}|x_t)} - log p_\theta(x_0|x_1) \bigg] \\ &= \mathbb E_q \big[ \underbrace {D_{KL}(q(x_T|x_0)||p_\theta(x_T))}{L_T} + \sum{t=2} ^T \underbrace {D_{KL} (q(x_{t-1}|x_t,x_0)||p_\theta(x_{t-1}|x_t))}{L{T-1}} - \underbrace {log p_\theta(x_0|x_1) }_{L_0} \big] \end{align} LVLB=Eq(x0:T)[logpθ(x0:T)q(x1:T∣x0)]=Eq[logpθ(xT)∏t=1Tpθ(xt−1∣xt)∏t=1Tq(xt∣xt−1)]=Eq[−logpθ(xT)+t=1∑Tlogpθ(xt−1∣xt)q(xt∣xt−1)]=Eq[−logpθ(xT)+t=2∑Tlogpθ(xt−1∣xt)q(xt∣xt−1)+logpθ(x0∣x1)q(x1∣x0)]=Eq[−logpθ(xT)+t=2∑Tlog(pθ(xt−1∣xt)q(xt−1∣xt,x0).q(xt−1∣x0)q(xt∣x0))+logpθ(x0∣x1)q(x1∣x0)]=Eq[−logpθ(xT)+t=2∑Tlogpθ(xt−1∣xt)q(xt−1∣xt,x0)+t=2∑Tq(xt−1∣x0)q(xt∣x0)+logpθ(x0∣x1)q(x1∣x0)]=Eq[−logpθ(xT)+t=2∑Tlogpθ(xt−1∣xt)q(xt−1∣xt,x0)+logq(x1∣x0)q(xT∣x0)+logpθ(x0∣x1)q(x1∣x0)]=Eq[logpθ(xT)q(xT∣x0)+t=2∑Tlogpθ(xt−1∣xt)q(xt−1∣xt,x0)−logpθ(x0∣x1)]=Eq[LT DKL(q(xT∣x0)∣∣pθ(xT))+t=2∑TLT−1 DKL(q(xt−1∣xt,x0)∣∣pθ(xt−1∣xt))−L0 logpθ(x0∣x1)]

4.2.4 损失函数的参数化

两个高斯分布p和q的KL散度其实是可以直接求解的,只和它们的均值和方差有关:

K L ( p , q ) = l o g σ 2 σ 1 + σ 2 + ( μ 1 − μ 2 ) 2 2 σ 2 2 − 1 2 KL(p, q) = log \frac{\sigma_2}{\sigma_1} + \frac{\sigma^2 + (\mu_1 - \mu_2)^2}{2 \sigma_2 ^2} - \frac{1}{2} KL(p,q)=logσ1σ2+2σ22σ2+(μ1−μ2)2−21

在推导得到的KL散度损失函数中,两个高斯分布的方差都是常数,因此对最优化没有贡献,可以忽略掉,只剩下含有两个均值的部分,得到:

L t = E x 0 , ϵ [ 1 2 ∣ ∣ Σ θ ( x t , t ) ∣ ∣ 2 2 ∣ ∣ μ ~ t ( x t , x 0 ) − μ θ ( x t , t ) ∣ ∣ 2 ] L_t = \mathbb E_{x_0, \epsilon} \bigg[ \frac{1}{2||\Sigma_\theta(x_t, t)||2 ^2} ||\tilde \mu_t(x_t, x_0) - \mu\theta(x_t, t)|| ^2 \bigg] Lt=Ex0,ϵ[2∣∣Σθ(xt,t)∣∣221∣∣μ~t(xt,x0)−μθ(xt,t)∣∣2]

其中 μ ~ ( x t , x 0 ) \tilde \mu(x_t, x_0) μ~(xt,x0)是 q ( x t − 1 ∣ x t , x 0 ) q(x_{t-1} | x_t, x_0) q(xt−1∣xt,x0)的均值, μ θ ( x t , t ) \mu_\theta(x_t, t) μθ(xt,t)是 p θ ( x t − 1 ∣ x t ) p_\theta(x_{t-1}|x_t) pθ(xt−1∣xt)的均值,优化的目标就是后面逆向过程的均值 μ θ \mu_\theta μθ要尽量毕竟前面正向过程后验分布的均值 μ ~ \tilde \mu μ~,或者说训练的目标就是让 μ θ \mu_\theta μθ来预测 μ ~ \tilde \mu μ~. 前面这个均值我们刚才已经求出来了,有具体解析式.

μ ~ t = α t ( 1 − α ˉ t − 1 ) 1 − α ˉ t x t + α ˉ t − 1 β t 1 − α ˉ t 1 α ˉ t ( x t − 1 − α ˉ t ϵ t ) = 1 α t ( x t − 1 − α t 1 − α ˉ t ϵ t ) \begin{align} \tilde \mu_t &= \frac{\sqrt{\alpha_t}(1 - \bar \alpha_{t-1})}{1 - \bar \alpha_t} x_t + \frac{\sqrt{\bar \alpha_{t-1}} \beta_t}{1 - \bar {\alpha}_t} \frac{1}{\sqrt{\bar \alpha_t}} (x_t - \sqrt{1 - \bar \alpha_t} \epsilon_t) \\ &=\frac{1}{\sqrt \alpha_t}(x_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar \alpha_t}} \epsilon_t) \end{align} μ~t=1−αˉtαt (1−αˉt−1)xt+1−αˉtαˉt−1 βtαˉt 1(xt−1−αˉt ϵt)=α t1(xt−1−αˉt 1−αtϵt)

因为 x t x_t xt在训练过程中是已知的,因此后面的均值 μ θ \mu_\theta μθ也可以通过重参数化技巧写成 x t x_t xt和一个含参的高斯噪声 ϵ θ \epsilon_\theta ϵθ的形式(下式中第二步),整理合并同类项,就消掉了 x t x_t xt,只剩下两个 ϵ \epsilon ϵ之间的差值. 再根据正向过程的重参数化推导,把 x t x_t xt替换成 x 0 x_0 x0和 ϵ t \epsilon_t ϵt的形式.

L t = E x 0 , ϵ [ 1 2 ∣ ∣ Σ θ ( x t , t ) ∣ ∣ 2 2 ∣ ∣ μ ~ t ( x t , x 0 ) − μ θ ( x t , t ) ∣ ∣ 2 ] = E x 0 , ϵ [ 1 2 ∣ ∣ Σ θ ∣ ∣ 2 2 ∣ ∣ 1 α t ( x t − 1 − α t 1 − α ˉ t ϵ t ) − 1 α t ( x t − 1 − α t 1 − α ˉ t ϵ θ ( x t , t ) ∣ ∣ 2 ] = E x 0 , ϵ [ ( 1 − α t ) 2 2 α t ( 1 − α ˉ t ) ∣ ∣ Σ θ ∣ ∣ 2 2 ∣ ∣ ϵ t − ϵ θ ( x t , t ) ∣ ∣ 2 ] = E x 0 , ϵ [ ( 1 − α t ) 2 2 α t ( 1 − α ˉ t ) ∣ ∣ Σ θ ∣ ∣ 2 2 ∣ ∣ ϵ t − ϵ θ ( α ˉ t x 0 + 1 − α ˉ t ϵ t , t ) ∣ ∣ 2 ] \begin{align} L_t &= \mathbb E_{x_0, \epsilon} \bigg[ \frac{1}{2||\Sigma_\theta(x_t, t)||2 ^2} ||\tilde \mu_t(x_t, x_0) - \mu\theta(x_t, t)|| ^2 \bigg] \\ &= \mathbb E_{x_0, \epsilon} \bigg[ \frac{1}{2||\Sigma_\theta||2 ^2} || \frac{1}{\sqrt{\alpha_t}} (x_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar \alpha_t} } \epsilon_t) - \frac{1}{\sqrt{\alpha_t}}(x_t - \frac{1 - \alpha_t}{\sqrt{1-\bar \alpha_t}} \epsilon\theta(x_t, t)|| ^2 \bigg] \\ &= \mathbb E_{x_0, \epsilon} \bigg[ \frac{(1-\alpha_t) ^2}{2 \alpha_t(1 - \bar \alpha_t) ||\Sigma_\theta||2 ^2} ||\epsilon_t - \epsilon\theta(x_t, t)|| ^2 \bigg] \\ &= \mathbb E_{x_0, \epsilon} \bigg[ \frac{(1-\alpha_t) ^2}{2 \alpha_t(1 - \bar \alpha_t) ||\Sigma_\theta||2 ^2} ||\epsilon_t - \epsilon\theta(\sqrt{\bar \alpha_t} x_0 + \sqrt{1 - \bar \alpha_t} \epsilon_t, t) || ^2 \bigg] \\ \end{align} Lt=Ex0,ϵ[2∣∣Σθ(xt,t)∣∣221∣∣μ~t(xt,x0)−μθ(xt,t)∣∣2]=Ex0,ϵ[2∣∣Σθ∣∣221∣∣αt 1(xt−1−αˉt 1−αtϵt)−αt 1(xt−1−αˉt 1−αtϵθ(xt,t)∣∣2]=Ex0,ϵ[2αt(1−αˉt)∣∣Σθ∣∣22(1−αt)2∣∣ϵt−ϵθ(xt,t)∣∣2]=Ex0,ϵ[2αt(1−αˉt)∣∣Σθ∣∣22(1−αt)2∣∣ϵt−ϵθ(αˉt x0+1−αˉt ϵt,t)∣∣2]

上面的式子,表达的含义是:有一个神经网络,输入 x 0 , ϵ t x_0, \epsilon_t x0,ϵt和时间戳 t t t,输出是预测的 ϵ θ \epsilon_\theta ϵθ,用来逼近扩散过程噪声 ϵ t \epsilon_t ϵt. 这样就实现了对负对数似然的优化. 论文作者进一步发现前面一项系数可以丢掉,并不影响结果,而且还更稳定,所以可以进一步简化为:

L t s i m p l e = E t ∼ [ 1 , T ] , x 0 , ϵ t [ ∣ ∣ ϵ t − ϵ θ ( x t , t ) ∣ ∣ 2 ] = E t ∼ [ 1 , T ] , x 0 , ϵ t ∣ ∣ ϵ t − ϵ θ ( α ˉ t x 0 + 1 − α ˉ t ϵ t , t ) ∣ ∣ 2 ] \begin{align} L_t ^{simple} &= \mathbb E_{t \sim [1, T], x_0, \epsilon_t} \bigg[ ||\epsilon_t - \epsilon_\theta(x_t, t)|| ^2 \bigg] \\ &= \mathbb E_{t \sim [1, T], x_0, \epsilon_t} ||\epsilon_t - \epsilon_\theta(\sqrt{\bar \alpha_t} x_0 + \sqrt{1 - \bar \alpha_t} \epsilon_t, t) || ^2 \bigg] \\ \end{align} Ltsimple=Et∼[1,T],x0,ϵt[∣∣ϵt−ϵθ(xt,t)∣∣2]=Et∼[1,T],x0,ϵt∣∣ϵt−ϵθ(αˉt x0+1−αˉt ϵt,t)∣∣2]

其中, ϵ θ \epsilon_\theta ϵθ是由一个含有参数的神经网络预测得到的值, ϵ t \epsilon_t ϵt就是当前时刻的噪声随机量. 到此,就完成了损失函数部分的推导,过程较为复杂,但结果却十分简洁.

相关推荐
湫ccc15 分钟前
《Python基础》之字符串格式化输出
开发语言·python
mqiqe1 小时前
Python MySQL通过Binlog 获取变更记录 恢复数据
开发语言·python·mysql
AttackingLin1 小时前
2024强网杯--babyheap house of apple2解法
linux·开发语言·python
哭泣的眼泪4081 小时前
解析粗糙度仪在工业制造及材料科学和建筑工程领域的重要性
python·算法·django·virtualenv·pygame
Ysjt | 深2 小时前
C++多线程编程入门教程(优质版)
java·开发语言·jvm·c++
ephemerals__2 小时前
【c++丨STL】list模拟实现(附源码)
开发语言·c++·list
码农飞飞2 小时前
深入理解Rust的模式匹配
开发语言·后端·rust·模式匹配·解构·结构体和枚举
一个小坑货2 小时前
Rust 的简介
开发语言·后端·rust