微软重磅开源 Lens: 重新思考基础文本到图像模型的训练效率

贡献者（按字母顺序排列）:

Baining Guo , Chong Luo , Dong Chen †, Dongdong Chen , Fangyun Wei †, Ji Li , Jianmin Bao , Jiawei Zhang *, Jinjing Zhao *, Lei Shi , Qinhong Yang , Sirui Zhang *, Xiuyu Wu , Xuelu Feng , Yan Lu , Yanchen Dong , Yang Yue *, Yitong Wang , Yunuo Chen , Zhiyang Liang *, Ziyu Wan †

微软 | *核心贡献者 | †项目负责人

Lens 是一个拥有 38亿参数 的基础文本到图像生成模型，专为 高效训练 和 快速高分辨率生成 而设计。它结合了密集标注预训练、混合分辨率学习、GPT-OSS多层文本特征以及FLUX.2语义变分自编码器，在训练计算量远低于更大规模文生图模型的情况下，仍能达到具有竞争力的生成质量。

本代码库提供从Lens DiT检查点生成图像的最小推理代码。

亮点

高效基础 --- 在包含8亿图文对（附带GPT-4.1长标注）的Lens-800M数据集上训练，最大化每批训练数据的信息密度。
紧凑而强大 --- 48模块的MMDiT去噪器结合FLUX.2潜变量与串联的多层GPT-OSS特征，实现更强的提示跟随能力和多语言泛化性。
灵活分辨率 --- 混合分辨率训练支持从1:2到2:1的宽高比及最高1440×1440分辨率的生成。
微调变体 --- 强化学习调优提升视觉质量并抑制伪影；蒸馏版Lens-Turbo 支持极速4步生成。

效果展示

Sample 000 · 1440x1440

A generous portion of classic British fish and chips served on a sheet of white paper, golden crispy beer-battered cod fillet alongside thick-cut chips, a wedge of lemon, mushy peas in a small dish, malt vinegar bottle nearby, wooden pub table, overhead shot

Sample 001 · 1440x1440

The iconic Big Ben clock tower and the Houses of Parliament in London at golden hour, the River Thames reflecting warm amber light, Westminster Bridge in the foreground, a classic red double-decker bus crossing, dramatic clouds lit by sunset

Sample 002 · 1440x1440

La Tour Eiffel au crépuscule vue depuis le Trocadéro, la structure en fer illuminée de milliers de lumières dorées scintillantes, le ciel passant du bleu profond au violet, les fontaines du Trocadéro au premier plan avec des reflets dorés, silhouettes de promeneurs

Sample 003 · 1248x1664

A crystal dragon soaring through an aurora borealis sky, its entire body made of transparent faceted crystal refracting the green and purple aurora light into rainbow spectra, ice particles trailing from its wings, high fantasy digital art

Sample 004 · 1664x1248

Aerial view of Yuanyang rice terraces in Yunnan province at sunrise, thousands of cascading water-filled paddies reflecting golden and pink sky colors, morning mist weaving between terrace layers, lush green hillside with scattered palm trees, drone photography

Sample 005 · 1664x1248

A green iguana basking on a moss-covered fallen log in a tropical rainforest, every scale and spine rendered in sharp detail, dewdrops clinging to its skin, a blurred waterfall and lush tropical foliage in the background, National Geographic wildlife photography style

安装

测试环境： Python 3.12 · CUDA 12.6 · PyTorch 2.11.0+cu126 · TorchVision 0.26.0+cu126

bash 复制代码

conda create -n lens python=3.12 -y
conda activate lens

uv pip install torch==2.11.0+cu126 torchvision==0.26.0+cu126 \
    --index-url https://download.pytorch.org/whl/cu126
uv pip install -r requirements.txt

The default GPT-OSS encoder and FLUX.2 VAE are loaded from Hugging Face. Make sure your environment has access to any gated model repositories you use.

检查点

仓库	描述	步数	CFG
`microsoft/Lens`	默认版本。经过强化学习优化视觉质量	20	5.0
`microsoft/Lens-Turbo`	从强化学习模型蒸馏而来，支持快速4步采样	4	1.0
`microsoft/Lens-Base`	监督式基础模型（无强化学习/无蒸馏）	50	5.0

通过将仓库ID传递给--repo_id（命令行）或LensPipeline.from_pretrained(...)（Python）来选择变体。

推理

重要提示： 从克隆的仓库根目录运行，确保from lens import LensPipeline能正确解析到本包------导入lens模块会将LensGptOssEncoder/LensTransformer2DModel注册到model_index.json引用的transformers和diffusers命名空间。

Python接口：

python 复制代码

import torch
from lens import LensPipeline

pipe = LensPipeline.from_pretrained(
    "microsoft/Lens", torch_dtype=torch.bfloat16
).to("cuda")

image = pipe(
    prompt="A cat holding a sign that says \"hello world\"",
    base_resolution=1440, aspect_ratio="1:1",
    num_inference_steps=20, guidance_scale=5.0,
    generator=torch.Generator("cuda").manual_seed(0),
).images[0]
image.save("lens.png")

为了以速度换取显存，请将 .to("cuda") 替换为 pipe.enable_model_cpu_offload()。

命令行界面 --- 基本用法：

bash 复制代码

python inference.py \
    --repo_id "microsoft/Lens" \
    --prompt "A cinematic mountain lake at sunrise, soft mist, detailed reflections" \
    --base_resolution 1440 --aspect_ratio 1:1 \
    --steps 20 --cfg 5.0 --n 1 --seed 42 \
    --out ./outputs

批量生成 --- 用 | 分隔多个提示词

bash 复制代码

python inference.py \
    --repo_id "microsoft/Lens" \
    --steps 20 --cfg 5.0 \
    --prompt "a red fox in snow|a glass greenhouse at night"

A100 / V100（不支持MXFP4内核） --- 将GPT编码器反量化为bf16格式：

bash 复制代码

python inference.py \
    --repo_id "microsoft/Lens" \
    --steps 20 --cfg 5.0 \
    --prompt "a cat" \
    --disable_mxfp4 --offload

选项

参数	说明	默认值
`--repo_id`	组装Lens流程的HF仓库ID（或本地路径）	`microsoft/Lens`
`--base_resolution`	`1024` 或 `1440`	`1440`
`--aspect_ratio`	`1:2`, `9:16`, `2:3`, `3:4`, `1:1`, `4:3`, `3:2`, `16:9`, `2:1`	`1:1`
`--steps`	去噪步数	`20`
`--cfg`	分类器自由引导尺度	`5.0`
`--n`	每个提示词生成的图像数量	`1`
`--seed`	随机种子（留空则为非确定性）	---
`--out`	输出目录	`./outputs`
`--dtype`	计算数据类型：`bfloat16`, `float16`, `float32`	`bfloat16`
`--disable_mxfp4`	将GPT-OSS文本编码器反量化为`--dtype`（A100/V100需启用；Hopper+架构默认保留MXFP4以节省显存）	---
`--offload`	启用diffusers CPU卸载（`text_encoder->transformer->vae`）以降低峰值显存	---
`--reasoner`	生成前使用加载的GPT-OSS编码器优化提示词	---
`--api_url` / `--api_key` / `--api_model`	使用OpenAI兼容API优化提示词（优先级高于`--reasoner`）	---

引用

bibtex 复制代码

@article{zhao2026lens,
  title   = {Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models},
  author  = {Guo, Baining and Luo, Chong and Chen, Dong and Chen, Dongdong and Wei, Fangyun and Li, Ji and Bao, Jianmin and Zhang, Jiawei and Zhao, Jinjing and Shi, Lei and Yang, Qinhong and Zhang, Sirui and Wu, Xiuyu and Feng, Xuelu and Lu, Yan and Dong, Yanchen and Yue, Yang and Wang, Yitong and Chen, Yunuo and Liang, Zhiyang and Wan, Ziyu},
  journal = {arXiv preprint arXiv:2605.21573},
  year    = {2026}
}

许可

本项目采用MIT许可证。