李宏毅机器学习笔记.Flow-based Generative Model(补)

文章目录

引子
生成问题回顾：Generator
[Math Background](#Math Background)
- [Jacobian Matrix](#Jacobian Matrix)
- [Determinant 行列式](#Determinant 行列式)
- [Change of Variable Theorem](#Change of Variable Theorem)
网络G的限制
基于Flow的网络构架
- G的训练
- [Coupling Layer](#Coupling Layer)
- - [Coupling Layer反函数计算](#Coupling Layer反函数计算)
  - [Coupling Layer Jacobian矩阵计算](#Coupling Layer Jacobian矩阵计算)
  - [Coupling Layer Stacking](#Coupling Layer Stacking)
- [1×1 Convolution](#1×1 Convolution)
GLOW效果
其他工作

原视频见油管https://www.youtube.com/watch?v=uXY18nzdSsM
Latex编辑器

引子

之前有讲过三种生成模型：

1.Component-by-component (也叫：Auto-regressive Model)：按component进行生成，如何确定最佳的生成顺序？而且一个个的生成会使得速度比较慢。特别是语音生成，一秒钟需要生成的采样点个数约为20万个，有人声称：生成一秒钟，合成90分。

2.Autoencoder（VAE）：这个模型证明了是在优化似然的Lower bound，而非去maximize似然，这样的效果有多好还不好说。

3.Generative Adversarial Network(GAN)：虽然很强，但是很难训练。

生成问题回顾：Generator

A generator G G G is a network. The network defines a probability distribution p G p_G pG

为什么说生成器网络定义了一个概率分布？看下面的流程：

图中 G G G吃一个向量 z z z得到一个表示 x = G ( z ) x=G(z) x=G(z)，这个 x x x是一个高维向量，是一张图像， x x x里面每一个维度就是这个图像的每一个像素。

输入向量 z z z是用一个Normal Distribution中采样得来的：

因此经过多次采样经过 G G G后会得到一个比较复杂的分布 p G p_G pG：

我们希望找到一个 G G G，使得其生成的分布 p G p_G pG与实际图像分布 p d a t a ( x ) p_{data}(x) pdata(x)越接近越好。

越接近越好就是要求最大似然，也就是要使得 p G ( x ) p_G(x) pG(x)的似然与 p d a t a ( x ) p_{data}(x) pdata(x)采样得到的样本越接近越好，用数学表示为：
G ∗ = a r g max ⁡ G ∑ i = 1 m log ⁡ p G ( x i ) , { x 1 , x 2 , ⋯ , x m } f r o m p d a t a ( x ) ≈ a r g min ⁡ G K L ( p d a t a ∣ ∣ p G ) \begin{aligned} G^*&=arg\max_G\sum_{i=1}^{m}\log p_G(x^i),\{x^1,x^2,\cdots,x^m\}\text{ } from\text{ } p_{data}(x)\\ &\approx arg\min_G KL(p_{data}||p_G)\end{aligned} G∗=argGmaxi=1∑mlogpG(xi),{x1,x2,⋯,xm} from pdata(x)≈argGminKL(pdata∣∣pG)

上式中的求两个概率越接近越好也相当于求他们的KL散度越小越好。

由于 G G G是一个网络，因此其生成概率的最大似然非常难求，Flow-based Generative Model提出了一种可以直接求最大似然的方法，接下来进入难点，补充部分数学推导。

Math Background

三个东西：Jacobian, Determinant, Change of Variable Theorem

Jacobian Matrix

假如有一个函数 x = f ( z ) x=f(z) x=f(z)，吃一个二维向量 z = [ z 1 z 2 ] z=\begin{bmatrix} z_1 \\ z_2 \end{bmatrix} z=[z1z2]，得到输出： x = [ x 1 x 2 ] x=\begin{bmatrix} x_1 \\ x_2 \end{bmatrix} x=[x1x2]。（Jacobian Matrix的输入和输出维度不一定一样，这里先简化来举例）

这里的函数可以看做上面提到的生成器 G G G。

函数 x = f ( z ) x=f(z) x=f(z)的Jacobian Matrix J f J_f Jf可以写为输入和输出两两组合做偏导后形成的矩阵：
J f = [ ∂ x 1 ∂ z 1 ∂ x 1 ∂ z 2 ∂ x 2 ∂ z 1 ∂ x 2 ∂ z 2 ] (1) J_f=\begin{bmatrix} \cfrac{\partial x_1}{\partial z_1} & \cfrac{\partial x_1}{\partial z_2}\\ \cfrac{\partial x_2}{\partial z_1} &\cfrac{\partial x_2}{\partial z_2} \end{bmatrix}\tag1 Jf= ∂z1∂x1∂z1∂x2∂z2∂x1∂z2∂x2 (1)

Jacobian Matrix小例子，假如有这样的函数：

z 1 + z 2 2 z 2 \] = f ( \[ z 1 z 2 \] ) \\begin{bmatrix} z_1+z_2 \\\\ 2z_2 \\end{bmatrix}=f\\left(\\begin{bmatrix} z_1 \\\\ z_2 \\end{bmatrix}\\right) \[z1+z22z2\]=f(\[z1z2\]) 则根据上面的公式1可以求得： J f = \[ ∂ ( z 1 + z 2 ) ∂ z 1 ∂ ( z 1 + z 2 ) ∂ z 2 ∂ 2 z 2 ∂ z 1 ∂ 2 z 2 ∂ z 2 \] = \[ 1 1 2 0 \] J_f=\\begin{bmatrix} \\cfrac{\\partial (z_1+z_2)}{\\partial z_1} \& \\cfrac{\\partial (z_1+z_2)}{\\partial z_2}\\\\ \\cfrac{\\partial 2z_2}{\\partial z_1} \&\\cfrac{\\partial 2z_2}{\\partial z_2} \\end{bmatrix}=\\begin{bmatrix} 1 \& 1\\\\ 2 \&0 \\end{bmatrix} Jf= ∂z1∂(z1+z2)∂z1∂2z2∂z2∂(z1+z2)∂z2∂2z2 =\[1210

同理，若有 z = f − 1 ( x ) z=f^{-1}(x) z=f−1(x)，则有函数 f f finverse 的Jacobian Matrix：
J f − 1 = [ ∂ z 1 ∂ x 1 ∂ z 1 ∂ x 2 ∂ z 2 ∂ x 1 ∂ z 2 ∂ x 2 ] (2) J_{f^{-1}}=\begin{bmatrix} \cfrac{\partial z_1}{\partial x_1} & \cfrac{\partial z_1}{\partial x_2}\\ \cfrac{\partial z_2}{\partial x_1} &\cfrac{\partial z_2}{\partial x_2} \end{bmatrix}\tag2 Jf−1= ∂x1∂z1∂x1∂z2∂x2∂z1∂x2∂z2 (2)

公式1和2的两个矩阵互逆，二者的乘积结果是Identity矩阵（对角线是1，其他都是0）。

反函数的Jacobian Matrix小例子，假如有这样的函数：

x 2 / 2 x 1 − x 2 / 2 \] = f − 1 ( \[ x 1 x 2 \] ) \\begin{bmatrix} x_2/2 \\\\ x_1-x_2/2 \\end{bmatrix}=f\^{-1}\\left(\\begin{bmatrix} x_1 \\\\ x_2 \\end{bmatrix}\\right) \[x2/2x1−x2/2\]=f−1(\[x1x2\]) 则根据上面的公式2可以求得： J f − 1 = \[ ∂ ( x 2 / 2 ) ∂ x 1 ∂ ( x 2 / 2 ) ∂ x 2 ∂ ( x 1 − x 2 / 2 ) ∂ x 1 ∂ ( x 1 − x 2 / 2 ) ∂ x 2 \] = \[ 0 1 / 2 1 − 1 / 2 \] J_{f\^{-1}}=\\begin{bmatrix} \\cfrac{\\partial (x_2/2)}{\\partial x_1} \& \\cfrac{\\partial (x_2/2)}{\\partial x_2}\\\\ \\cfrac{\\partial (x_1-x_2/2)}{\\partial x_1} \&\\cfrac{\\partial (x_1-x_2/2)}{\\partial x_2} \\end{bmatrix}=\\begin{bmatrix} 0 \& 1/2\\\\ 1 \&-1/2 \\end{bmatrix} Jf−1= ∂x1∂(x2/2)∂x1∂(x1−x2/2)∂x2∂(x2/2)∂x2∂(x1−x2/2) =\[011/2−1/2

两个小例子的结果相乘：
J f J f − 1 = [ 1 1 2 0 ] [ 0 1 / 2 1 − 1 / 2 ] = I J_fJ_{f^{-1}}=\begin{bmatrix} 1 & 1\\ 2 &0 \end{bmatrix}\begin{bmatrix} 0 & 1/2\\ 1 &-1/2 \end{bmatrix}=I JfJf−1=[1210][011/2−1/2]=I

Determinant 行列式

The determinant of a square matrix is a scalar that provides information about the matrix.

对于2×2的矩阵：
A = [ a b c d ] A=\begin{bmatrix} a&b \\ c &d \end{bmatrix} A=[acbd]

有：
d e t ( A ) = a d − b c det(A)=ad-bc det(A)=ad−bc

对于3×3的矩阵：