CV之UIGM之OmniGen：《OmniGen: Unified Image Generation》翻译与解读

导读：这篇论文介绍了OmniGen，一个用于统一图像生成的扩散模型。

>> 背景痛点 ：目前的图像生成模型大多专注于特定任务，例如文本到图像生成。每个新任务都需要设计特定的模块和微调模型，这阻碍了图像生成领域的发展。更糟糕的是，这些特定任务的模型导致了一个重要问题：用户无法简单地通过输入指令来完成任务，而是需要一个复杂且繁琐的工作流程，涉及多个模型和许多步骤。例如，ControlNet 需要使用检测器提取条件信息，然后加载相应的条件模块；InstantID 只能处理单人图像，需要先进行人脸检测，然后使用人脸编码器进行编码。如果用户想编辑生成的图像，还需要加载额外的模型，例如InstructPix2Pix 。总而言之，缺乏一个能够像ChatGPT处理语言任务一样，通过用户指令端到端地完成各种任务的单一模型。

>> 具体的解决方案 ：论文提出了OmniGen ，一个统一的图像生成框架，旨在解决上述痛点。OmniGen是一个基于扩散模型的框架，其核心设计理念是通用性和简洁性。它仅包含两个主要组件：一个VAE （变分自动编码器）和一个预训练的大型Transformer模型。

VAE 用于提取图像的连续视觉特征，T ransformer模型 则根据指令（作为条件）生成图像。OmniGen支持任意交错的文本和图像输入作为条件来指导图像生成，而不是只接受纯文本或图像。为了训练这个统一模型，论文构建了一个大型多任务图像生成数据集X2I（Anything to Image）。X2I数据集涵盖了各种图像生成任务，所有任务都标准化为统一的格式。

>> 核心思路步骤：OmniGen的核心思路步骤如下：

● 数据集构建：构建X2I数据集，包含文本到图像、多模态到图像（包括混合模态提示、主题驱动图像生成、计算机视觉任务）、少样本到图像等多种任务的数据，并将其标准化为统一的输入输出格式。

● 模型设计：采用简洁的架构，仅包含VAE和Transformer模型，避免了额外编码器的使用，简化了流程。修改了Transformer中的注意力机制，结合因果注意力和双向注意力，使模型能够更好地处理图像信息。

● 训练策略：使用修正流（Rectified Flow）进行训练，并针对图像编辑任务设计了加权损失函数，以避免模型学习简单的复制输入图像的捷径。分阶段训练，逐步提高图像分辨率。

● 推理过程：利用流匹配方法迭代生成图像，并利用kv-cache加速推理。

>> 优势：

● 统一性：OmniGen能够处理多种图像生成任务，包括文本到图像生成、图像编辑、主题驱动生成和视觉条件生成等，无需额外插件。

● 简洁性：架构简洁，仅包含VAE和Transformer两个主要组件，易于使用。

● 知识迁移：通过统一训练，OmniGen能够有效地跨不同任务进行知识迁移，处理未见过的任务和领域，并展现出新的能力。

● 端到端流程：能够通过用户指令端到端地完成复杂任务，无需繁琐的中间步骤。

>> 结论和观点：

● OmniGen是首个尝试构建通用图像生成模型的工作。

● OmniGen在多个基准测试中取得了与现有最先进模型相当甚至更好的结果，同时参数规模更小，效率更高。

● OmniGen展现了上下文学习、推理能力和思维链机制等能力。

● 尽管OmniGen取得了显著成果，但仍存在一些局限性，例如文本渲染能力有限、生成的图像可能包含不需要的细节等。这些局限性可以通过进一步扩大数据集和模型规模来解决。

● 论文认为，未来的图像生成范式应该是简单和灵活的，允许用户通过任何多模态指令直接生成各种图像，无需复杂的工作流程。OmniGen朝着构建通用图像生成基础模型迈出了重要一步。

总而言之，OmniGen 通过构建统一的数据集和简洁的模型架构，实现了图像生成任务的统一，显著简化了工作流程，并展现了强大的多模态理解和生成能力，为通用图像生成基础模型的构建提供了新的方向。但论文也承认模型仍有改进空间，需要进一步提升图像质量和处理更复杂任务的能力。

[《OmniGen: Unified Image Generation》翻译与解读](#《OmniGen: Unified Image Generation》翻译与解读)

Abstract

[Figure 1:OmniGen can flexibly follow instructions to complete various tasks, without the need for any additional plugins. For complex tasks (e.g., examples in the second line), it can also be completed end-to-end without cumbersome intermediate steps.图1：OmniGen可以灵活地执行指令以完成各种任务，无需任何额外的插件。对于复杂的任务（例如第二行中的示例），也可以通过无需繁琐中间步骤的端到端流程完成。](#Figure 1:OmniGen can flexibly follow instructions to complete various tasks, without the need for any additional plugins. For complex tasks (e.g., examples in the second line), it can also be completed end-to-end without cumbersome intermediate steps.图1：OmniGen可以灵活地执行指令以完成各种任务，无需任何额外的插件。对于复杂的任务（例如第二行中的示例），也可以通过无需繁琐中间步骤的端到端流程完成。)

[Figure 2:The framework of OmniGen. Texts are tokenized into tokens, while input images are transformed into embedding via VAE. OmniGen can accept free-form multi-modal prompts and generate images through the rectified flow approach.图2：OmniGen框架。文本被分割为单词，而输入图像则通过VAE转换为嵌入式表示。OmniGen可以接受自由形式的多模态提示，并通过修正流方法生成图像。](#Figure 2:The framework of OmniGen. Texts are tokenized into tokens, while input images are transformed into embedding via VAE. OmniGen can accept free-form multi-modal prompts and generate images through the rectified flow approach.图2：OmniGen框架。文本被分割为单词，而输入图像则通过VAE转换为嵌入式表示。OmniGen可以接受自由形式的多模态提示，并通过修正流方法生成图像。)

1、Introduction

[8 Limitations and Conclusion局限性和结论](#8 Limitations and Conclusion局限性和结论)

《OmniGen: Unified Image Generation》翻译与解读

|------------|--------------------------------------------------------------------------------------------------------------|
| 地址 | 论文地址：https://arxiv.org/abs/2409.11340 |
| 时间 | 2024年9月17日，最新日期为2024年11月21日 |
| 作者 | BAAI团队 |

Abstract

Figure 1:OmniGen can flexibly follow instructions to complete various tasks, without the need for any additional plugins. For complex tasks (e.g., examples in the second line), it can also be completed end-to-end without cumbersome intermediate steps.图1：OmniGen可以灵活地执行指令以完成各种任务，无需任何额外的插件。对于复杂的任务（例如第二行中的示例），也可以通过无需繁琐中间步骤的端到端流程完成。

1、Introduction

|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| The pursuit of Artificial General Intelligence (AGI) has intensified the demand for generative foundation models capable of handling various tasks within a single framework. In the field of Natural Language Processing (NLP), Large Language Models (LLMs) have become exemplary in achieving this goal, demonstrating remarkable versatility across numerous language tasks. However, the realm of visual generation has yet to reveal a counterpart that matches the universality of LLMs. Current image generation models have demonstrated proficiency in specialized tasks. For instance, in the text-to-image generation filed, models such as the SD series [55, 51, 12], DALL-E [54], and Imagen [25] have made significant strides. Meanwhile, lots of efforts have been made to extend the capabilities of diffusion models for specific tasks, such as ControlNet [75], InstandID [64], and InstructPix2Pix [3]. Currently, for each new task, designing a specific module and fine-tuning it is necessary, which hinders the development of image generation. What's worse, these task-specific models lead to a significant issue: we cannot accomplish tasks simply through input instructions but require a complex and cumbersome workflow involving multiple models and numerous steps. For example, ControlNet needs to use a detector to extract conditions (such as pose estimation maps) and then loads the corresponding condition module. InstantID can only process single-person images and requires a face detection model to detect faces beforehand, followed by using a face encoder to encode the face. If users want to edit the generated images, additional models like InstructPix2Pix need to be loaded. | 追求通用人工智能（AGI）加剧了对能够在单一框架内处理各种任务的生成基础模型的需求。在自然语言处理（NLP）领域，大型语言模型（LLMs）已成为实现这一目标的典范，在众多语言任务中展现出卓越的多功能性。然而，视觉生成领域尚未出现与LLMs的通用性相匹配的对应物。当前的图像生成模型在特定任务中表现出色。例如，在文本到图像生成领域，SD系列（[55, 51, 12]）、DALL-E（[54]）和Imagen（[25]）等模型取得了显著进展。同时，人们也为扩展扩散模型的特定任务能力做出了大量努力，例如ControlNet[75]、InstandID[64]和InstructPix2Pix[3]。目前，为了完成每个新任务，都需要设计特定模块并进行微调，这阻碍了图像生成的发展。更糟糕的是，这些任务特定模型导致了一个重大问题：我们无法仅通过输入指令来完成任务，而是需要一个复杂且繁琐的工作流程，涉及多个模型和多个步骤。例如，ControlNet需要使用检测器提取条件（如姿态估计图），然后加载相应的条件模块。InstantID只能处理单人图像，需要先使用人脸检测模型检测人脸，然后使用人脸编码器对人脸进行编码。如果用户想要编辑生成的图像，还需要加载额外的模型，如InstructPix2Pix。 |
| Can a single model complete various tasks end-to-end through user instructions, similar to how ChatGPT handles language tasks? We envision a future where image generation is made simple and flexible: that any tasks can be accomplished directly through user instructions. Motivated by this goal, we propose a unified framework: OmniGen. As shown in Figure 1, this framework allows for convenient and flexible image generation for any purposes, where no additional plugins or operations are needed. Given the flexibility in following arbitrary instructions, the new framework also help to inspire more interesting image generation tasks. Unlike popular diffusion models, OmniGen features a very concise structure, comprising only two main components: a VAE and a transformer model. OmniGen supports arbitrarily interleaved text and image inputs as conditions to guide image generation, instead of only accepting pure text or image. To train this architecture as a unified model, we construct a large-scale multi-task image generation dataset X2I, which unifies different tasks with one uniform format. We evaluate the well-trained model based on multiple benchmarks, demonstrating its superior generation capability compared to existing models. Remarkably, OmniGen enables effective knowledge transfer across different scenarios, allowing it to handle unseen tasks and domains while also fostering the emergence of new abilities. | 能否通过用户指令实现端到端完成各种任务的单一模型，就像ChatGPT处理语言任务一样？我们设想未来图像生成将变得简单灵活：任何任务都可以直接通过用户指令完成。基于这一目标，我们提出了一个统一的框架：OmniGen。如图1所示，该框架允许方便灵活地生成任何用途的图像，无需额外的插件或操作。考虑到它能够执行任意指令的灵活性，新框架还帮助激发了更多有趣的图像生成任务。与流行的扩散模型不同，OmniGen仅由两个主要组件组成：变分自动编码器（VAE）和变换器模型。OmniGen支持以任意方式交织的文本和图像输入作为指导图像生成的条件，而不是仅接受纯文本或图像。为了将该架构训练为统一模型，我们构建了一个大规模的多任务图像生成数据集X2I，将不同的任务统一到一个统一的格式中。我们基于多个基准对已训练好的模型进行评估，证明其生成能力优于现有模型。值得注意的是，OmniGen能够在不同场景之间有效传递知识，使其能够处理未见过的任务和领域，同时促进新能力的涌现。 |
| Our contributions are summarized as follows: • We introduce OmniGen, a unified image generation model that excels in multiple domains. The model demonstrates competitive text-to-image generation capabilities and inherently supports a variety of downstream tasks, such as controllable image generation and classic computer vision tasks. Furthermore, it can handle complex tasks end-to-end without any lengthy intermediate steps. To our knowledge, OmniGen is the first image generation model to achieve such comprehensive functionality. • We construct the first comprehensive image generation dataset named X2I, which stands for "anything to image". This dataset covers a wide range of image generation tasks, all of which are standardized into a unified format. • By unified training on the multi-task dataset, OmniGen can apply learned knowledge to tackle unseen tasks and domains, as well as exhibit new capabilities. Additionally, we explore the model's reasoning capabilities and chain-of-thought mechanism. • OmniGen takes an initial step toward a foundational model for general image generation. We will open-source the relevant resources (model, code, and data) and hope this work can provide some insights for future image generation models. | 我们的贡献如下： • 我们提出了OmniGen，一个在多个领域表现卓越的统一图像生成模型。该模型在文本到图像生成方面表现出卓越的能力，并内在地支持各种下游任务，如可控图像生成和经典的计算机视觉任务。此外，它可以无须冗长的中间步骤，端到端地处理复杂任务。据我们所知，OmniGen是第一个具有如此全面功能的图像生成模型。 • 我们构建了第一个全面的图像生成数据集X2I，意为"任何东西到图像"。该数据集涵盖了广泛的图像生成任务，并将所有任务标准化为统一格式。 • 通过在多任务数据集上统一训练，OmniGen可以将学到的知识应用于未见过的任务和领域，并展现出新的能力。此外，我们探索了模型的推理能力和思维链机制。 • OmniGen朝着通用图像生成的基础模型迈出了第一步。我们将开源相关的资源（模型、代码和数据），希望这项工作能为未来的图像生成模型提供一些启示。 |