Scalable Diffusion Models with Transformers (DiT)

Official PyTorch Implementation

Paper | Project Page | Run DiT-XL/2

This repo contains PyTorch model definitions, pre-trained weights and training/sampling code for our paper exploring diffusion models with transformers (DiTs). You can find more visualizations on our project page.

Scalable Diffusion Models with Transformers
William Peebles, Saining Xie

UC Berkeley, New York University

We train latent diffusion models, replacing the commonly-used U-Net backbone with a transformer that operates on latent patches. We analyze the scalability of our Diffusion Transformers (DiTs) through the lens of forward pass complexity as measured by Gflops. We find that DiTs with higher Gflops---through increased transformer depth/width or increased number of input tokens---consistently have lower FID. In addition to good scalability properties, our DiT-XL/2 models outperform all prior diffusion models on the class-conditional ImageNet 512×512 and 256×256 benchmarks, achieving a state-of-the-art FID of 2.27 on the latter.

This repository contains:

An implementation of DiT directly in Hugging Face diffusers can also be found here.

Setup

First, download and set up the repo:

复制代码
git clone https://github.com/facebookresearch/DiT.git
cd DiT

We provide an environment.yml file that can be used to create a Conda environment. If you only want to run pre-trained models locally on CPU, you can remove the cudatoolkit and pytorch-cuda requirements from the file.

复制代码
conda env create -f environment.yml
conda activate DiT

Sampling

Pre-trained DiT checkpoints. You can sample from our pre-trained DiT models with sample.py. Weights for our pre-trained DiT model will be automatically downloaded depending on the model you use. The script has various arguments to switch between the 256x256 and 512x512 models, adjust sampling steps, change the classifier-free guidance scale, etc. For example, to sample from our 512x512 DiT-XL/2 model, you can use:

复制代码
python sample.py --image-size 512 --seed 1

For convenience, our pre-trained DiT models can be downloaded directly here as well:

DiT Model Image Resolution FID-50K Inception Score Gflops
XL/2 256x256 2.27 278.24 119
XL/2 512x512 3.04 240.82 525

Custom DiT checkpoints. If you've trained a new DiT model with train.py (see below), you can add the --ckpt argument to use your own checkpoint instead. For example, to sample from the EMA weights of a custom 256x256 DiT-L/4 model, run:

复制代码
python sample.py --model DiT-L/4 --image-size 256 --ckpt /path/to/model.pt

Training DiT

We provide a training script for DiT in train.py. This script can be used to train class-conditional DiT models, but it can be easily modified to support other types of conditioning. To launch DiT-XL/2 (256x256) training with N GPUs on one node:

复制代码
torchrun --nnodes=1 --nproc_per_node=N train.py --model DiT-XL/2 --data-path /path/to/imagenet/train

PyTorch Training Results

We've trained DiT-XL/2 and DiT-B/4 models from scratch with the PyTorch training script to verify that it reproduces the original JAX results up to several hundred thousand training iterations. Across our experiments, the PyTorch-trained models give similar (and sometimes slightly better) results compared to the JAX-trained models up to reasonable random variation. Some data points:

DiT Model Train Steps FID-50K (JAX Training) FID-50K (PyTorch Training) PyTorch Global Training Seed
XL/2 400K 19.5 18.1 42
B/4 400K 68.4 68.9 42
B/4 400K 68.4 68.3 100

These models were trained at 256x256 resolution; we used 8x A100s to train XL/2 and 4x A100s to train B/4. Note that FID here is computed with 250 DDPM sampling steps, with the mse VAE decoder and without guidance (cfg-scale=1).

TF32 Note (important for A100 users). When we ran the above tests, TF32 matmuls were disabled per PyTorch's defaults. We've enabled them at the top of train.py and sample.py because it makes training and sampling way way way faster on A100s (and should for other Ampere GPUs too), but note that the use of TF32 may lead to some differences compared to the above results.

Enhancements

Training (and sampling) could likely be sped-up significantly by:

  • using Flash Attention in the DiT model
  • using torch.compile in PyTorch 2.0

Basic features that would be nice to add:

  • Monitor FID and other metrics
  • Generate and save samples from the EMA model periodically
  • Resume training from a checkpoint
  • AMP/bfloat16 support

🔥 Feature Update Check out this repository at GitHub - chuanyangjin/fast-DiT: Fast Diffusion Models with Transformers to preview a selection of training speed acceleration and memory saving features including gradient checkpointing, mixed precision training and pre-extrated VAE features. With these advancements, we have achieved a training speed of 0.84 steps/sec for DiT-XL/2 using just a single A100 GPU.

相关推荐
后端小肥肠28 分钟前
从图文到视频,如何用Coze跑通“小红书儿童绘本”的商业闭环?
人工智能·aigc·coze
reddingtons40 分钟前
PS 参考图像:线稿上色太慢?AI 3秒“喂”出精细厚涂
前端·人工智能·游戏·ui·aigc·游戏策划·游戏美术
Java后端的Ai之路3 小时前
【分析式AI】-分类与回归的区别以及内联
人工智能·分类·数据挖掘·回归·aigc
量子位4 小时前
PPIO姚欣:AI正在进入自主行动与创造时代,智能体需要全新的操作系统|MEET2026
aigc·ai编程
量子位4 小时前
小米语音首席科学家:AI发展的本质就像生物进化,不开源要慢1000倍 | MEET2026
aigc·ai编程
小程故事多_805 小时前
深度解析WeKnora,腾讯开源RAG框架如何重塑复杂文档的智能处理生态
人工智能·开源·aigc
Mintopia5 小时前
🌐 技术平权视角:WebAIGC如何让小众创作者获得技术赋能?
人工智能·aigc·ai编程
资料加载中6 小时前
【AIGC】SCAIL:通过对 3D 一致姿态表示进行上下文学习,实现工作室级角色动画
学习·aigc
AI生成未来6 小时前
NeurIPS 2025 | 硬刚可灵1.5!阿里通义&清华等开源Wan-Move:指哪动哪的“神笔马良”
aigc·视频编辑·视频生成
undsky_6 小时前
【n8n教程】:n8n扩展和性能优化指南
人工智能·ai·aigc·ai编程