Controlnet的理解1——引言和相关工作

文章目录

一、前言
二、摘要
- [1. 引言](#1. 引言)
- [2. 相关工作](#2. 相关工作)
- - [2.1. 微调神经网络](#2.1. 微调神经网络)

一、前言

参考资料：Controlnet原论文

Adding Conditional Control to Text-to-Image Diffusion Models

论文地址：https://arxiv.org/pdf/2302.05543

ControlNet的作者团队由斯坦福大学的博士生张吕敏（Lvmin Zhang）领衔，另两位合作者分别是现香港科技大学的助理教授饶安逸（Anyi Rao）和斯坦福大学的Maneesh Agrawala教授。

👤 核心作者：张吕敏（Lvmin Zhang）

个人代号：昵称是 "lllyasviel" ，因其卓越贡献被网友尊称为AI界的 "赛博佛祖" 。

教育背景：出生于中国，2021年在苏州大学获得工学学士学位，2022年起进入斯坦福大学攻读计算机科学博士学位，师从Maneesh Agrawala教授。

技术起点：大一就发表了AI绘画相关论文，本科期间已在ICCV/CVPR/ECCV等顶级会议发表了10篇论文。

代表作：除了ControlNet，还开发了Style2Paints、Fooocus、IC-Light、LayerDiffusion等知名项目。

🎓 其他作者介绍

饶安逸 (Anyi Rao)：团队的重要成员，同时也是ControlNet论文的合著、者。他现任香港科技大学 (HKUST) 助理教授，领导"多媒体创意实验室"。

Maneesh Agrawala：斯坦福大学Forest Baskett讲席教授，也是张吕敏在斯坦福的博士生导师。他是麦克阿瑟天才奖和ACM Fellow得主，在人机交互与计算机图形学领域享有极高声誉。

二、摘要

We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained textto-image diffusion models. ControlNet locks the productionready large diffusion models , and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls .

我们提出了ControlNet，一种将空间条件控制添加到大型预训练文本到图像扩散模型中的神经网络架构。ControlNet锁定了生产就绪的大型扩散模型，并重用了它们通过数十亿张图像预训练的深度而强大的编码层，作为学习多样化条件控制集的强大骨干。

The neural architecture is connected with "zero convolutions" (zero-initialized convolution layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect the ﬁnetuning. We test various conditioning controls, e.g., edges, depth, segmentation, human pose, etc., with Stable Diffusion, using single or multiple conditions, with or without prompts.

该神经网络架构通过"零卷积"（零初始化的卷积层）进行连接，这些卷积层可以逐步增加参数量，并确保没有有害噪声会影响微调。我们使用 Stable Diffusion 测试了各种条件控制，例如边缘、深度、分割、人体姿态等，支持单个或多个条件，有或无提示词。

We show that the training of ControlNets is robust with small (<50k) and large (>1m) datasets. Extensive results show that ControlNet may facilitate wider applications to control image diffusion models.

我们证明了ControlNets的训练对于小型（<50k）和大型（>1m）数据集都是鲁棒的。广泛的结果表明，ControlNet可能有助于将图像扩散模型推广到更广泛的应用中。

1. 引言

Many of us have experienced ﬂashes of visual inspiration that we wish to capture in a unique image . With the advent of text-to-image diffusion models [54, 62, 72], we can now create visually stunning images by typing in a text prompt.

我们中的许多人都曾经历过瞬间的视觉灵感，希望能将其捕捉成一幅独特的图像。随着文本到图像扩散模型[54, 62, 72]的出现，现在我们只需输入文本提示，就能创造出视觉上令人惊叹的图像。

Yet, text-to-image models are limited in the control they provide
over the spatial composition of the image ; precisely expressing complex layouts, poses, shapes and forms can be difﬁcult via text prompts alone. Generating an image that accurately matches our mental imagery often requires numerous trial-and-error cycles of editing a prompt, inspecting the resulting images and then re-editing the prompt.

然而，文本到图像模型在对图像空间构图的控制方面存在局限性；仅通过文本提示很难精确表达复杂的布局、姿势、形状和形态。要生成与我们心中所想图像精确匹配的图像，通常需要经过多次反复试验和错误的过程：编辑提示词、检查生成的图像，然后再次编辑提示词。

Can we enable ﬁner grained spatial control by letting users provide additional images that directly specify their desired image composition? In computer vision and machine learning, these additional images (e.g., edge maps, human pose skeletons, segmentation maps , depth, normals , etc.) are often treated as conditioning on the image generation process.

我们能否通过允许用户提供直接指定其所需图像构成的额外图像，从而实现更精细的空间控制？在计算机视觉和机器学习中，这些额外的图像（例如，边缘图、人体姿态骨骼、分割图、深度图、法线图等）通常被视为图像生成过程的条件。

Image-to-image translation models [34, 98] learn the mapping from conditioning images to target images. The research community has also taken steps to control text-to-image models with spatial masks [6, 20], image editing instructions [10], personalization via ﬁnetuning [21, 75], etc. While a few problems (e.g., generating image variations , inpainting ) can be resolved with training-free techniques like constraining the denoising diffusion process or editing attention layer activations, a wider variety of problems like depth-to-image, pose-to-image, etc., require end-to-end learning and data-driven solutions.

图像到图像的翻译模型 [34, 98] 学习从条件图像到目标图像的映射。研究界也已采取措施通过空间掩码 [6, 20]、图像编辑指令 [10]、通过微调进行个性化 [21, 75] 等来控制文本到图像模型。虽然少数问题（例如，生成图像变体、图像修复）可以通过约束去噪扩散过程或编辑注意力层激活等无训练技术来解决，但更广泛的问题（例如，深度到图像、姿态到图像等）需要端到端学习和数据驱动的解决方案。

Learning conditional controls for large text-to-image diffusion models in an end-to-end way is challenging. The amount of training data for a speciﬁc condition may be signiﬁcantly smaller than the data available for general text-to-image training. For instance, the largest datasets for various speciﬁc problems (e.g., object shape/normal, human pose extraction, etc.) are usually about 100K in size, which is 50,000 times smaller than the LAION-5B [79] dataset that was used to train Stable Diffusion [82].

以端到端的方式为大型文本到图像扩散模型学习条件控制是具有挑战性的。特定条件的训练数据量可能远小于可用于通用文本到图像训练的数据量。例如，针对各种特定问题（例如，物体形状/法线、人体姿态提取等）的最大数据集通常约为 10 万个样本，这比用于训练 Stable Diffusion [82] 的 LAION-5B [79] 数据集小 50,000 倍。

The direct ﬁnetuning or continued training of a large pretrained model with limited data may cause overﬁtting and catastrophic forgetting [31, 75]. Researchers have shown that such forgetting can be alleviated by restricting the number or rank of trainable parameters [14, 25, 31, 92].

直接使用有限数据对大型预训练模型进行微调或持续训练可能会导致过拟合和灾难性遗忘 [31, 75]。研究人员表明，通过限制可训练参数的数量或秩可以缓解这种遗忘 [14, 25, 31, 92]。

For our problem , designing deeper or more customized neural architectures might be necessary for handling in-the-wild conditioning images with complex shapes and diverse high-level semantics .

对于我们所面临的问题，设计更深层或更定制化的神经网络架构，可能是处理具有复杂形状和多样化高级语义的野外条件图像所必需的。

This paper presents ControlNet, an end-to-end neural network architecture that learns conditional controls for large pretrained text-to-image diffusion models (Stable Diffusion in our implementation). ControlNet preserves the quality and capabilities of the large model by locking its parameters, and also making a trainable copy of its encoding layers .

本文提出了ControlNet，一种端到端的神经网络架构，它为大型预训练文本到图像扩散模型（在我们的实现中为Stable Diffusion）学习条件控制。ControlNet通过锁定大型模型的参数，并制作其编码层的可训练副本，来保留大型模型的质量和能力。

This architecture treats the large pretrained model as a strong backbone for learning diverse conditional controls. The trainable copy and the original, locked model are connected with zero convolution layers, with weights initialized to zeros so that they progressively grow during the training.

该架构将大型预训练模型视为学习各种条件控制的强大骨干。可训练副本和原始的、锁定的模型通过零卷积层连接，其权重初始化为零，以便在训练过程中逐步增长。

This architecture ensures that harmful noise is not added to the deep features of the large diffusion model at the beginning of training, and protects the large-scale pretrained backbone in the trainable copy from being damaged by such noise.

注：protect ... from ... 这个结构里，from 表示 "让坏的东西远离被保护对象"，因此翻译成 "免于 / 免受"。

该架构确保在大规模扩散模型训练初期不会向其深层特征添加有害噪声，并保护可训练副本中的大规模预训练骨干网络免受此类噪声的损害。

Our experiments show that ControlNet can control Stable Diffusion with various conditioning inputs, including Canny edges, Hough lines , user scribbles , human key points, segmentation maps, shape normals, depths, etc. (Figure 1). We test our approach using a single conditioning image, with or without text prompts, and we demonstrate how our approach supports the composition of multiple conditions.

注："Hough lines" 指的是一种计算机视觉中的经典算法------霍夫变换（Hough Transform），主要用于检测图像中的直线。

我们的实验表明，ControlNet 可以通过各种条件输入来控制 Stable Diffusion，包括 Canny 边缘、Hough 线、用户涂鸦、人体关键点、分割图、形状法线、深度图等（图 1）。我们使用单个条件图像（带或不带文本提示）来测试我们的方法，并展示了我们的方法如何支持多条件组合。

Additionally, we report that the training of ControlNet is robust and scalable on datasets of different sizes , and that for some tasks like depth-to-image conditioning , training ControlNets on a single NVIDIA RTX 3090Ti GPU can achieve results competitive with industrial models trained on large computation clusters. Finally, we conduct ablative studies to investigate the contribution of each component of our model, and compare our models to several strong conditional image generation baselines with user studies.

此外，我们报告称，ControlNet 的训练在不同规模的数据集上是稳健且可扩展的，并且对于某些任务，如深度到图像的条件生成，在单个 NVIDIA RTX 3090Ti GPU 上训练 ControlNets 即可获得与在大型计算集群上训练的工业级模型相媲美的结果。最后，我们进行了消融研究，以探究我们模型各组件的贡献，并通过用户研究将我们的模型与几个强大的条件图像生成基线进行了比较。

In summary, (1) we propose ControlNet, a neural network architecture that can add spatially localized input conditions to a pretrained text-to-image diffusion model via efﬁcient ﬁnetuning, (2) we present pretrained ControlNets to control Stable Diffusion, conditioned on Canny edges, Hough lines, user scribbles, human key points, segmentation maps, shape normals, depths, and cartoon line drawings, and (3) we validate the method with ablative experiments comparing to several alternative architectures, and conduct user studies focused on several previous baselines across different tasks.

总而言之，（1）我们提出了 ControlNet，这是一种神经网络架构，可以通过高效的微调将空间局部化的输入条件添加到预训练的文本到图像扩散模型中；（2）我们展示了预训练的 ControlNet，用于控制 Stable Diffusion，并以 Canny 边缘、Hough 线、用户涂鸦、人体关键点、分割图、形状法线、深度和卡通线条画为条件；（3）我们通过消融实验验证了该方法，将其与几种替代架构进行了比较，并进行了用户研究，重点关注不同任务中的几个先前基线。
注：
注：spatially localized：空间上局部化的 → 意思是指这种输入条件只影响图像的特定区域（比如只对图像左边或某个物体位置施加控制），而不是全局
conduct：进行

2. 相关工作

2.1. 微调神经网络

One way to ﬁnetune a neural network is to directly continue training it with the additional training data. But this approach can lead to overﬁtting, mode collapse , and catastrophic forgetting. Extensive research has focused on developing ﬁnetuning strategies that avoid such issues.

微调神经网络的一种方法是直接使用额外的训练数据继续训练它。但是，这种方法可能导致过拟合、模式崩溃和灾难性遗忘。大量的研究集中在开发避免这些问题的微调策略上。

HyperNetwork is an approach that originated in the Natural Language Processing (NLP) community [25], with the aim of training a small recurrent neural network to inﬂuence the weights of a larger one. It has been applied to image generation with generative adversarial networks (GANs) [4, 18].

超网络（HyperNetwork）是一种起源于自然语言处理（NLP）社区的方法 [25]，旨在训练一个小型循环神经网络来影响一个大型神经网络的权重。它已被应用于生成对抗网络（GANs）的图像生成中 [4, 18]。