VideoGPT:使用VQ-VAE和Transformers的视频生成

1 Title

VideoGPT: Video Generation using VQ-VAE and Transformers(Wilson Yan,Yunzhi Zhang ,Pieter Abbeel,Aravind Srinivas)

2 Conlusion

This paper present VideoGPT: a conceptually simple architecture for scaling likelihood based generative modeling to natural videos. VideoGPT uses VQ-VAE that learns downsampled discrete latent representations of a raw video by employing 3D convolutions and axial self-attention. A simple GPT-like architecture is then used to autoregressively model the discrete latents using spatio-temporal position encodings.

3 Good Sentences

1、High-fidelity natural videos is one notable modality that has not seen the same level of progress in generative modeling as compared to images, audio, and text. This is reasonable since the complexity of natural videos requires modeling correlations across both space and time with much higher input dimensions. Video modeling is therefore a natural next challenge for current deep generative models. (The significance of this work)

2、The above line of reasoning leads us to our proposed model:VideoGPT, a simple video generation architecture that is a minimal adaptation of VQ-VAE and GPT architectures for videos.(The reason for choosing VideoGPT)

3、Although the VQ-VAE is trained unconditionally, we can generate conditional samples by training a conditional prior. We use two types of conditioning:Cross Attention and Conditional Norms.(How to transform unconditional to conditional learning)


背景知识

VQ-VAE

VQ-VAE能利用codebook机制把图像编码成离散向量

Method

整个训练过程如图所示,分为两个部分,训练VQ-VAE(左)和训练隐空间中的自回归Transformer(右)

第一阶段与原始VQ-VAE训练过程类似。

第二阶段,VQ-VAE将视频数据编码为隐序列作为先验模型的训练数据。首先从先验中采样隐序列,然后使用VQ-VAE将隐序列解码为视频样本。(Transformer的作用是引入条件,这里可以使用交叉注意力或者Conditional Norms:)

相关推荐
浊酒南街8 分钟前
XGBClassifiler函数介绍
算法·机器学习·xgb
mlxg9999914 分钟前
hom_mat2d_to_affine_par 的c#实现
算法·计算机视觉·c#
真就死难4 小时前
完全日期(日期枚举问题)--- 数学性质题型
算法·日期枚举
不知道取啥耶4 小时前
C++ 滑动窗口
数据结构·c++·算法·leetcode
花间流风5 小时前
晏殊几何学讲义
算法·矩阵·几何学·情感分析
@心都5 小时前
机器学习数学基础:42.AMOS 结构方程模型(SEM)分析的系统流程
人工智能·算法·机器学习
紫雾凌寒6 小时前
深度学习|MAE技术全景图:自监督学习的“掩码魔法“如何重塑AI基础
人工智能·深度学习·计算机视觉·自监督学习·vit·视频理解·mae
北顾南栀倾寒8 小时前
[算法笔记]cin和getline的并用、如何区分两个数据对、C++中std::tuple类
笔记·算法
jndingxin9 小时前
OpenCV计算摄影学(15)无缝克隆(Seamless Cloning)调整图像颜色的函数colorChange()
人工智能·opencv·计算机视觉
kimi-2229 小时前
plt和cv2有不同的图像表示方式和颜色通道顺序
人工智能·opencv·计算机视觉