VideoGPT:使用VQ-VAE和Transformers的视频生成

1 Title

VideoGPT: Video Generation using VQ-VAE and Transformers(Wilson Yan,Yunzhi Zhang ,Pieter Abbeel,Aravind Srinivas)

2 Conlusion

This paper present VideoGPT: a conceptually simple architecture for scaling likelihood based generative modeling to natural videos. VideoGPT uses VQ-VAE that learns downsampled discrete latent representations of a raw video by employing 3D convolutions and axial self-attention. A simple GPT-like architecture is then used to autoregressively model the discrete latents using spatio-temporal position encodings.

3 Good Sentences

1、High-fidelity natural videos is one notable modality that has not seen the same level of progress in generative modeling as compared to images, audio, and text. This is reasonable since the complexity of natural videos requires modeling correlations across both space and time with much higher input dimensions. Video modeling is therefore a natural next challenge for current deep generative models. (The significance of this work)

2、The above line of reasoning leads us to our proposed model:VideoGPT, a simple video generation architecture that is a minimal adaptation of VQ-VAE and GPT architectures for videos.(The reason for choosing VideoGPT)

3、Although the VQ-VAE is trained unconditionally, we can generate conditional samples by training a conditional prior. We use two types of conditioning:Cross Attention and Conditional Norms.(How to transform unconditional to conditional learning)


背景知识

VQ-VAE

VQ-VAE能利用codebook机制把图像编码成离散向量

Method

整个训练过程如图所示,分为两个部分,训练VQ-VAE(左)和训练隐空间中的自回归Transformer(右)

第一阶段与原始VQ-VAE训练过程类似。

第二阶段,VQ-VAE将视频数据编码为隐序列作为先验模型的训练数据。首先从先验中采样隐序列,然后使用VQ-VAE将隐序列解码为视频样本。(Transformer的作用是引入条件,这里可以使用交叉注意力或者Conditional Norms:)

相关推荐
金枪不摆鳍几秒前
算法-字典树
开发语言·算法
diediedei6 分钟前
C++类型推导(auto/decltype)
开发语言·c++·算法
独断万古他化26 分钟前
【算法通关】前缀和:从一维到二维、从和到积,核心思路与解题模板
算法·前缀和
loui robot29 分钟前
规划与控制之局部路径规划算法local_planner
人工智能·算法·自动驾驶
格林威35 分钟前
Baumer相机金属焊缝缺陷识别:提升焊接质量检测可靠性的 7 个关键技术,附 OpenCV+Halcon 实战代码!
人工智能·数码相机·opencv·算法·计算机视觉·视觉检测·堡盟相机
Ryan老房1 小时前
无人机航拍图像标注-从采集到训练全流程
yolo·目标检测·机器学习·计算机视觉·目标跟踪·无人机
你撅嘴真丑1 小时前
第八章 - 贪心法
开发语言·c++·算法
VT.馒头1 小时前
【力扣】2625. 扁平化嵌套数组
前端·javascript·算法·leetcode·职场和发展·typescript
wanghu20241 小时前
AT_abc443_C~E题题解
c语言·算法
Ryan老房1 小时前
开源vs商业-数据标注工具的选择困境
人工智能·yolo·目标检测·计算机视觉·ai