文章目录
一、前言
参考资料:Controlnet原论文
Adding Conditional Control to Text-to-Image Diffusion Models
论文地址:https://arxiv.org/pdf/2302.05543
ControlNet的作者团队由斯坦福大学的博士生张吕敏(Lvmin Zhang)领衔,另两位合作者分别是现香港科技大学的助理教授饶安逸(Anyi Rao)和斯坦福大学的Maneesh Agrawala教授。
👤 核心作者:张吕敏(Lvmin Zhang)
个人代号:昵称是 "lllyasviel" ,因其卓越贡献被网友尊称为AI界的 "赛博佛祖" 。
教育背景:出生于中国,2021年在苏州大学获得工学学士学位,2022年起进入斯坦福大学攻读计算机科学博士学位,师从Maneesh Agrawala教授。
技术起点:大一就发表了AI绘画相关论文,本科期间已在ICCV/CVPR/ECCV等顶级会议发表了10篇论文。
代表作:除了ControlNet,还开发了Style2Paints、Fooocus、IC-Light、LayerDiffusion等知名项目。
🎓 其他作者介绍
饶安逸 (Anyi Rao):团队的重要成员,同时也是ControlNet论文的合著、者。他现任香港科技大学 (HKUST) 助理教授,领导"多媒体创意实验室"。
Maneesh Agrawala:斯坦福大学Forest Baskett讲席教授,也是张吕敏在斯坦福的博士生导师。他是麦克阿瑟天才奖和ACM Fellow得主,在人机交互与计算机图形学领域享有极高声誉。
二、摘要
We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained textto-image diffusion models. ControlNet locks the productionready large diffusion models , and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls .
我们提出了ControlNet,一种将空间条件控制添加到大型预训练文本到图像扩散模型中的神经网络架构。ControlNet锁定了生产就绪的大型扩散模型,并重用了它们通过数十亿张图像预训练的深度而强大的编码层,作为学习多样化条件控制集的强大骨干。
The neural architecture is connected with "zero convolutions" (zero-initialized convolution layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect the finetuning. We test various conditioning controls, e.g., edges, depth, segmentation, human pose, etc., with Stable Diffusion, using single or multiple conditions, with or without prompts.
该神经网络架构通过"零卷积"(零初始化的卷积层)进行连接,这些卷积层可以逐步增加参数量,并确保没有有害噪声会影响微调。我们使用 Stable Diffusion 测试了各种条件控制,例如边缘、深度、分割、人体姿态等,支持单个或多个条件,有或无提示词。
We show that the training of ControlNets is robust with small (<50k) and large (>1m) datasets. Extensive results show that ControlNet may facilitate wider applications to control image diffusion models.
我们证明了ControlNets的训练对于小型(<50k)和大型(>1m)数据集都是鲁棒的。广泛的结果表明,ControlNet可能有助于将图像扩散模型推广到更广泛的应用中。
1. 引言
Many of us have experienced flashes of visual inspiration that we wish to capture in a unique image . With the advent of text-to-image diffusion models [54, 62, 72], we can now create visually stunning images by typing in a text prompt.
我们中的许多人都曾经历过瞬间的视觉灵感,希望能将其捕捉成一幅独特的图像。随着文本到图像扩散模型[54, 62, 72]的出现,现在我们只需输入文本提示,就能创造出视觉上令人惊叹的图像。
Yet, text-to-image models are limited in the control they provide
over the spatial composition of the image ; precisely expressing complex layouts, poses, shapes and forms can be difficult via text prompts alone. Generating an image that accurately matches our mental imagery often requires numerous trial-and-error cycles of editing a prompt, inspecting the resulting images and then re-editing the prompt.
然而,文本到图像模型在对图像空间构图的控制方面存在局限性;仅通过文本提示很难精确表达复杂的布局、姿势、形状和形态。要生成与我们心中所想图像精确匹配的图像,通常需要经过多次反复试验和错误的过程:编辑提示词、检查生成的图像,然后再次编辑提示词。
Can we enable finer grained spatial control by letting users provide additional images that directly specify their desired image composition? In computer vision and machine learning, these additional images (e.g., edge maps, human pose skeletons, segmentation maps , depth, normals , etc.) are often treated as conditioning on the image generation process.
我们能否通过允许用户提供直接指定其所需图像构成的额外图像,从而实现更精细的空间控制?在计算机视觉和机器学习中,这些额外的图像(例如,边缘图、人体姿态骨骼、分割图、深度图、法线图等)通常被视为图像生成过程的条件。
Image-to-image translation models [34, 98] learn the mapping from conditioning images to target images. The research community has also taken steps to control text-to-image models with spatial masks [6, 20], image editing instructions [10], personalization via finetuning [21, 75], etc. While a few problems (e.g., generating image variations , inpainting ) can be resolved with training-free techniques like constraining the denoising diffusion process or editing attention layer activations, a wider variety of problems like depth-to-image, pose-to-image, etc., require end-to-end learning and data-driven solutions.
图像到图像的翻译模型 [34, 98] 学习从条件图像到目标图像的映射。研究界也已采取措施通过空间掩码 [6, 20]、图像编辑指令 [10]、通过微调进行个性化 [21, 75] 等来控制文本到图像模型。虽然少数问题(例如,生成图像变体、图像修复)可以通过约束去噪扩散过程或编辑注意力层激活等无训练技术来解决,但更广泛的问题(例如,深度到图像、姿态到图像等)需要端到端学习和数据驱动的解决方案。
Learning conditional controls for large text-to-image diffusion models in an end-to-end way is challenging. The amount of training data for a specific condition may be significantly smaller than the data available for general text-to-image training. For instance, the largest datasets for various specific problems (e.g., object shape/normal, human pose extraction, etc.) are usually about 100K in size, which is 50,000 times smaller than the LAION-5B [79] dataset that was used to train Stable Diffusion [82].
以端到端的方式为大型文本到图像扩散模型学习条件控制是具有挑战性的。特定条件的训练数据量可能远小于可用于通用文本到图像训练的数据量。例如,针对各种特定问题(例如,物体形状/法线、人体姿态提取等)的最大数据集通常约为 10 万个样本,这比用于训练 Stable Diffusion [82] 的 LAION-5B [79] 数据集小 50,000 倍。
The direct finetuning or continued training of a large pretrained model with limited data may cause overfitting and catastrophic forgetting [31, 75]. Researchers have shown that such forgetting can be alleviated by restricting the number or rank of trainable parameters [14, 25, 31, 92].
直接使用有限数据对大型预训练模型进行微调或持续训练可能会导致过拟合和灾难性遗忘 [31, 75]。研究人员表明,通过限制可训练参数的数量或秩可以缓解这种遗忘 [14, 25, 31, 92]。
For our problem , designing deeper or more customized neural architectures might be necessary for handling in-the-wild conditioning images with complex shapes and diverse high-level semantics .
对于我们所面临的问题,设计更深层或更定制化的神经网络架构,可能是处理具有复杂形状和多样化高级语义的野外条件图像所必需的。
This paper presents ControlNet, an end-to-end neural network architecture that learns conditional controls for large pretrained text-to-image diffusion models (Stable Diffusion in our implementation). ControlNet preserves the quality and capabilities of the large model by locking its parameters, and also making a trainable copy of its encoding layers .
本文提出了ControlNet,一种端到端的神经网络架构,它为大型预训练文本到图像扩散模型(在我们的实现中为Stable Diffusion)学习条件控制。ControlNet通过锁定大型模型的参数,并制作其编码层的可训练副本,来保留大型模型的质量和能力。
This architecture treats the large pretrained model as a strong backbone for learning diverse conditional controls. The trainable copy and the original, locked model are connected with zero convolution layers, with weights initialized to zeros so that they progressively grow during the training.
该架构将大型预训练模型视为学习各种条件控制的强大骨干。可训练副本和原始的、锁定的模型通过零卷积层连接,其权重初始化为零,以便在训练过程中逐步增长。
This architecture ensures that harmful noise is not added to the deep features of the large diffusion model at the beginning of training, and protects the large-scale pretrained backbone in the trainable copy from being damaged by such noise.
注:protect ... from ... 这个结构里,from 表示 "让坏的东西远离被保护对象",因此翻译成 "免于 / 免受"。
该架构确保在大规模扩散模型训练初期不会向其深层特征添加有害噪声,并保护可训练副本中的大规模预训练骨干网络免受此类噪声的损害。
Our experiments show that ControlNet can control Stable Diffusion with various conditioning inputs, including Canny edges, Hough lines , user scribbles , human key points, segmentation maps, shape normals, depths, etc. (Figure 1). We test our approach using a single conditioning image, with or without text prompts, and we demonstrate how our approach supports the composition of multiple conditions.
注:"Hough lines" 指的是一种计算机视觉中的经典算法------霍夫变换(Hough Transform),主要用于检测图像中的直线。
我们的实验表明,ControlNet 可以通过各种条件输入来控制 Stable Diffusion,包括 Canny 边缘、Hough 线、用户涂鸦、人体关键点、分割图、形状法线、深度图等(图 1)。我们使用单个条件图像(带或不带文本提示)来测试我们的方法,并展示了我们的方法如何支持多条件组合。
Additionally, we report that the training of ControlNet is robust and scalable on datasets of different sizes , and that for some tasks like depth-to-image conditioning , training ControlNets on a single NVIDIA RTX 3090Ti GPU can achieve results competitive with industrial models trained on large computation clusters. Finally, we conduct ablative studies to investigate the contribution of each component of our model, and compare our models to several strong conditional image generation baselines with user studies.
此外,我们报告称,ControlNet 的训练在不同规模的数据集上是稳健且可扩展的,并且对于某些任务,如深度到图像的条件生成,在单个 NVIDIA RTX 3090Ti GPU 上训练 ControlNets 即可获得与在大型计算集群上训练的工业级模型相媲美的结果。最后,我们进行了消融研究,以探究我们模型各组件的贡献,并通过用户研究将我们的模型与几个强大的条件图像生成基线进行了比较。
In summary, (1) we propose ControlNet, a neural network architecture that can add spatially localized input conditions to a pretrained text-to-image diffusion model via efficient finetuning, (2) we present pretrained ControlNets to control Stable Diffusion, conditioned on Canny edges, Hough lines, user scribbles, human key points, segmentation maps, shape normals, depths, and cartoon line drawings, and (3) we validate the method with ablative experiments comparing to several alternative architectures, and conduct user studies focused on several previous baselines across different tasks.
总而言之,(1)我们提出了 ControlNet,这是一种神经网络架构,可以通过高效的微调将空间局部化的输入条件添加到预训练的文本到图像扩散模型中;(2)我们展示了预训练的 ControlNet,用于控制 Stable Diffusion,并以 Canny 边缘、Hough 线、用户涂鸦、人体关键点、分割图、形状法线、深度和卡通线条画为条件;(3)我们通过消融实验验证了该方法,将其与几种替代架构进行了比较,并进行了用户研究,重点关注不同任务中的几个先前基线。
注:
注:spatially localized:空间上局部化的 → 意思是指这种输入条件只影响图像的特定区域(比如只对图像左边或某个物体位置施加控制),而不是全局
conduct:进行
2. 相关工作
2.1. 微调神经网络
One way to finetune a neural network is to directly continue training it with the additional training data. But this approach can lead to overfitting, mode collapse , and catastrophic forgetting. Extensive research has focused on developing finetuning strategies that avoid such issues.
微调神经网络的一种方法是直接使用额外的训练数据继续训练它。但是,这种方法可能导致过拟合、模式崩溃和灾难性遗忘。大量的研究集中在开发避免这些问题的微调策略上。
HyperNetwork is an approach that originated in the Natural Language Processing (NLP) community [25], with the aim of training a small recurrent neural network to influence the weights of a larger one. It has been applied to image generation with generative adversarial networks (GANs) [4, 18].
超网络(HyperNetwork)是一种起源于自然语言处理(NLP)社区的方法 [25],旨在训练一个小型循环神经网络来影响一个大型神经网络的权重。它已被应用于生成对抗网络(GANs)的图像生成中 [4, 18]。
25\] David Ha, Andrew M. Dai, and Quoc V. Le. Hypernetworks. In International Conference on Learning Representations, 2017. 2 Heathen et al. \[26\] and Kurumuz \[43\] implement HyperNetworks for Stable Diffusion \[72\] to change the artistic style of its output images. Heathen 等人 \[26\] 和 Kurumuz \[43\] 为 Stable Diffusion \[72\] 实现超网络(HyperNetworks),以改变其输出图像的艺术风格。 Adapter methods are widely used in NLP for customizing a pretrained transformer model to other tasks by embedding new module layers into it \[30, 84\]. In computer vision, adapters are used for incremental learning \[74\] and domain adaptation \[70\]. This technique is often used with CLIP \[66\] for transferring pretrained backbone models to different tasks \[23, 66, 85, 94\]. 在自然语言处理中,适配器方法通过将新的模块层嵌入预训练的 Transformer 模型中,广泛用于将该模型定制到其他任务 \[30, 84\]。在计算机视觉领域,适配器被用于增量学习 \[74\] 和领域自适应 \[70\]。该技术常与 CLIP \[66\] 结合使用,将预训练的骨干模型迁移到不同任务 \[23, 66, 85, 94\]。 More recently, adapters have yielded successful results in vision transformers \[49, 50\] and ViT-Adapter \[14\]. In **concurrent work** with ours, T2IAdapter \[56\] **adapts** Stable Diffusion **to** external conditions. 近期,适配器在视觉 Transformer \[49, 50\] 和 ViT-Adapter \[14\] 中取得了成功。在我们同期进行的工作中,T2IAdapter \[56\] 将 Stable Diffusion 适配到外部条件。 \[14\] Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, and Yu Qiao. Vision transformer adapter for dense predictions. International Conference on Learning Representations, 2023. 2 **Additive Learning** **circumvents forgetting** by freezing the original model weights and adding a small number of new parameters using **learned weight masks** \[51, 74\], **pruning** \[52\], or hard attention \[80\]. **Side-Tuning** \[92\] uses a side branch model to **learn extra functionality** by **linearly blending** the outputs of a frozen model and an added network, with a predefined blending weight schedule. 增量学习通过冻结原始模型权重并使用学习到的权重掩码\[51, 74\]、剪枝\[52\]或硬注意力\[80\]来添加少量新参数,从而避免遗忘。Side-Tuning\[92\] 用一个侧分支模型来学习额外功能。它把冻结的原模型输出和新增的侧分支输出按一定权重线性混合,其中混合权重按照预设的计划变化(比如训练初期原模型占比大,后期侧分支占比大)。 **注:** Additive Learning:一种方法的名字("加法式学习") circumvents forgetting:绕过/避免遗忘 → 也就是解决"灾难性遗忘"问题(学新任务时把旧任务忘了) learned weight masks:学出来的权重掩码 → 决定新参数的哪些位置被激活,像一个可学习的开关 pruning:剪枝 → 通常指删掉不重要的参数;但在这里的语境下,可能是先剪掉一些原模型的参数,然后在空出来的位置上加新参数(或者理解为只训练一部分稀疏的新连接) hard attention:硬注意力 → 一种二进制的注意力(要么完全保留某个神经元,要么完全忽略),用来决定新参数加在哪儿 \[51\] Arun Mallya, Dillon Davis, and Svetlana Lazebnik. Piggyback: Adapting a single network to multiple tasks by learning to mask weights. In European Conference on Computer Vision (ECCV), pages 67--82, 2018. 2 \[74\] Amir Rosenfeld and John K Tsotsos. Incremental learning through deep adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(3):651--663, 2018. 2 \[52\] Arun Mallya and Svetlana Lazebnik. Packnet: Adding multiple tasks to a single network by iterative pruning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7765--7773, 2018. 2 \[80\] Joan Serra, Didac Suris, Marius Miron, and Alexandros Karatzoglou. Overcoming catastrophic forgetting with hard attention to the task. In International Conference on Machine Learning, pages 4548--4557. PMLR, 2018. 2 \[92\] Jeffrey O. Zhang, Alexander Sax, Amir Zamir, Leonidas J. Guibas, and Jitendra Malik. Side-tuning: Network adaptation via additive side networks. In European Conference on Computer Vision (ECCV), pages 698--714. Springer, 2020. 2 Low-Rank Adaptation (LoRA) prevents catastrophic forgetting \[31\] by learning **the offset of parameters** with **lowrank matrices** , based on the observation that many overparameterized models **reside** in a **low intrinsic dimension subspace** \[2, 47\]. 低秩适应 (LoRA) 通过学习低秩矩阵的参数偏移来防止灾难性遗忘 \[31\],这是基于许多过度参数化的模型存在于低内在维度子空间中的观察结果。 \[2, 47\]。 **注:** offset of parameters:参数的偏移量。即"原参数应该调整多少",而不是直接替换原参数。 low-rank matrices:低秩矩阵。一种数学上紧凑的表示方式,参数量很少。 → LoRA 通过使用低秩矩阵来学习参数的调整量(而不是直接改原参数),从而防止遗忘。 **为什么这样做能防止遗忘?** 因为 LoRA 不修改原模型。原模型的参数被冻结(frozen),一直保留着旧知识。 新知识被存在那些"低秩矩阵"里,学完之后再加到原参数上。 原参数始终不变 → 旧知识永远在 → 不会遗忘。 **为什么可以用低秩矩阵?** overparameterized models:过参数化的模型(参数量远多于必要量的模型,比如大语言模型、扩散模型)。 low intrinsic dimension subspace:低内在维度的子空间。 通俗解释:虽然模型有几十亿参数,但真正有效的"自由度"其实很少。 好比一个高维数据(比如人脸照片),实际上可以用很少的几个特征(眼睛间距、鼻子形状等)来近似表示。 observation:研究者观察到的现象。 → 既然模型真正有效的变化空间很小,那么"参数的调整量"也可以在这个小空间里完成,用低秩矩阵就足够表达想要的更新。 reside: "存在于"、"位于",更通俗的说法是 "处于......空间之中"。它表示 "这些模型的参数虽然处在高维空间,但它们的有效性实际上局限在一个低维子空间里"。 **完整通俗翻译** "LoRA 通过低秩矩阵来学习原模型参数的调整值,以此防止灾难性遗忘。这样做的基础是:很多参数量巨大的模型,实际上内在的有效维度很低------也就是说,你只需要在很小的子空间里调整参数,就能达到不错的微调效果。 **更口语化:** "LoRA 不直接改原模型的参数,而是另外加两个小矩阵(低秩矩阵)来记录'原参数要改多少'。因为很多大模型其实'真实需要的自由度'很少,所以用这些小矩阵就能学得很好,这样既不会忘记旧知识,又省显存。" **一个直观类比(帮助你真正"理解")** 想象你有一本写满知识的大百科全书(原模型): 全参数微调:直接在原书上修改字句。可能改着改着,旧的内容就被覆盖或删掉了(遗忘)。 LoRA:你在书旁边贴便利贴,便利贴上只写"改动差异",比如"第3页第五行改为......"。 便利贴很小(低秩),而且原书内容完好无损。 阅读时,你同时看原书 + 便利贴上的改动就可以了。 为什么便利贴够用?因为原书已经很好了,你只需要在一个很小的改动范围内修正它对特定任务的表现,不需要重写整本书。 \[31\] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021. 2 \[2\] Armen Aghajanyan, Sonal Gupta, and Luke Zettlemoyer. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pages 7319--7328, Online, Aug. 2021. Association for Computational Linguistics. 3 \[47\] Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. Measuring the intrinsic dimension of objective landscapes. International Conference on Learning Representations, 2018. 3 Zero-Initialized Layers are used by ControlNet for connecting network blocks. **Research on neural networks** has **extensively discussed** the **initialization and manipulation** of network weights \[36, 37, 44, 45, 46, 76, 83, 95\]. For example, Gaussian initialization of weights can be less risky than initializing with zeros \[1\]. ControlNet 使用零初始化层来连接网络块。关于神经网络的研究已广泛讨论了网络权重的初始化和操作 \[36, 37, 44, 45, 46, 76, 83, 95\]。例如,与使用零初始化相比,高斯初始化风险较低 \[1\]。 **注:** Zero-Initialized Layers:权重全部初始化为0的网络层。 ControlNet用它来连接不同的网络块。 **→ ControlNet 在连接主模型和可训练副本时,用了把权重设为0的层。 → 一般情况下,用高斯初始化比用零初始化更安全。** **为什么零初始化有风险?** 因为如果一层所有权重都是0,那么它的输出也是0,反向传播时所有神经元的梯度相同,导致它们永远无法学到不同的特征(对称性问题)。所以通常不建议全零初始化。 **这两者之间的"矛盾"怎么理解?** 一般人会觉得:零初始化有风险,那ControlNet为什么还要用? 这正是ControlNet的创意所在: ControlNet的零初始化层并不是独立训练的层,而是作为连接桥梁。 训练开始时,零初始化的层输出为0 → 相当于"桥是断开的",可训练副本对原模型完全没有影响。 随着训练进行,权重从0逐渐被更新,控制信号从无到有、平滑引入。 这样就不会在训练初期用随机的、未学习的控制信号破坏预训练模型已经学好的特征。 所以: 一般的风险是"所有神经元对称,学不动" → 但ControlNet恰恰希望一开始什么都不学(输出为0),然后逐步介入。这是有意设计,不是缺陷。 **一句话总结这段话的核心意思** 虽然传统研究认为零初始化有风险(不如高斯初始化安全),但ControlNet却专门用零初始化来连接网络块------这样训练开始时新增部分为零,不会干扰预训练模型,之后才逐渐学会施加控制。 **如果还不清楚,看这个类比** 想象你在给一辆已经调校好的赛车加装一个辅助动力装置: 常见的零初始化问题:如果直接把辅助动力装置接上,但所有参数都是0,那它一开始不工作,而且学起来很慢。 ControlNet的做法:故意让连接点一开始是"断开"的(输出0),等装好了再慢慢接通动力。这样不会在安装过程中搞坏原来的引擎。 More recently, Nichol et al. \[59\] discussed how to scale the initial weight of convolution layers in a diffusion model to improve the training, and their implementation of "zero module" is an extreme case to scale weights to zero. Stability's model cards \[83\] also mention the use of zero weights in neural layers. 近期,Nichol 等人 \[59\] 讨论了如何缩放扩散模型中卷积层的初始权重以改进训练,他们实现的"零模块"是将权重缩放到零的极端情况。Stability 的模型卡 \[83\] 也提到了在神经网络层中使用零权重。 **注:** 这段话的目的是:为 ControlNet 使用"零初始化"提供背景支持和合理性论证,告诉读者"这种做法不是我们瞎搞的,别人也用过"。 scale the initial weight:缩放初始权重。即把原本随机初始化的权重乘以一个系数(比如 0.5, 0.1, 0.0 等)。 "zero module":他们实现的一种模块,名字就叫"零模块"。 extreme case to scale weights to zero:缩放权重的极端情况------直接缩到 0。 → Nichol 等人的工作里,为了改善训练,会缩放卷积层的初始权重。其中一种极端做法就是把权重设为零,他们称之为"zero module"。 隐含逻辑:把权重设为零并不是一个疯狂的、不可行的操作,反而是"缩放权重"这个连续操作中的一种自然边界情况。ControlNet 就是用了这个"边界情况"。 → Stability AI 的公开技术文档中,也承认他们用过零权重。 Manipulating the initial convolution weights is also discussed in ProGAN \[36\], StyleGAN \[37\], and Noise2Noise \[46\]. 在ProGAN \[36\]、StyleGAN \[37\]和Noise2Noise \[46\]中也讨论了对初始卷积权重的操作。