[读论文]深度孵化 Deep Incubation: Training Large Models by Divide-and-Conquering

Recent years have witnessed a remarkable success of large deep learning models.

However, training these models is challenging due to high computational costs, painfully slow convergence, and overfitting issues.

In this paper, we present Deep Incubation, a novel approach that enables the efficient and effective training of large models by dividing them into smaller sub-modules which can be trained sepa- rately and assembled seamlessly.

A key challenge for im- plementing this idea is to ensure the compatibility of the independently trained sub-modules.

To address this issue, we first introduce a global, shared meta model, which is leveraged to implicitly link all the modules together, and can be designed as an extremely small network with neg- ligible computational overhead.

Then we propose a mod- ule incubation algorithm, which trains each sub-module to replace the corresponding component of the meta model and accomplish a given learning task. Despite the simplicity, our approach effectively encourages each sub-module to be aware of its role in the target large model, such that the finally-learned sub-modules can collaborate with each other smoothly after being assembled.

Empirically, our method outperforms end-to-end (E2E) training in terms of both final accuracy and training efficiency.

For example, on top of ViT-Huge, it improves the accuracy by 2.7% on ImageNet or achieves similar performance with 4× less training time. Notably, the gains are significant for downstream tasks as well (e.g., object detection and image segmentation on COCO and ADE20K).

近年来，大型深度学习模型取得了显著的成功。

然而，由于计算成本高、收敛速度慢和过拟合问题，训练这些模型是具有挑战性的。

在本文中，我们提出了深度孵化，这是一种新颖的方法，可以通过将大型模型划分为更小的子模块来实现高效和有效的训练，这些子模块可以单独训练并无缝组装。

实现这一思想的一个关键挑战是确保独立训练的子模块的兼容性。

为了解决这个问题，我们首先引入了一个全局共享元模型，它被用来隐式地将所有模块链接在一起，并且可以被设计成一个极小的网络，计算开销可以忽略不计。

然后提出了一种模块孵化算法，该算法训练每个子模块替换元模型的相应组件，完成给定的学习任务。

尽管我们的方法简单，但我们的方法有效地鼓励每个子模块意识到自己在目标大模型中的作用，这样最终学习的子模块在组装后可以顺利地相互协作。

从经验上看，我们的方法在最终准确率和训练效率方面都优于端到端(E2E)训练。

例如，在viti - huge的基础上，它在ImageNet上的准确率提高了2.7%，或者在训练时间减少4倍的情况下达到类似的性能。值得注意的是，对于下游任务(例如，COCO和ADE20K上的对象检测和图像分割)，收益也很显著。

An illustration of our idea. We first train the sub-modules of a large model fully independently, and then assemble the trained modules to obtain the target model.
我们首先对一个大模型的子模块进行完全独立的训练，然后将训练好的模块组装起来得到目标模型。

Figure 2: Comparison of 3 implementations of modular trainingwhen training Module II in the target model (K= 3).
In each implementation, the model above is the meta model ˆM**∗, and the model below is the target model M. Lis any measure of distance in feature space, i.e., L1 distance.
LE2E is the original E2E training loss. Modules not involved in the training pipeline are greyed out.

在目标模型中训练模块II (K = 3)时，模块化训练的3种实现方式的比较。

在每个实现中，上面的模型是元模型M *，下面的模型是目标模型M.

L是特征空间中距离的任意度量，即L1距离。

L_E2E为原E2E训练损失。

未涉及培训管道的模块显示为灰色。

孵化->组合

Deep Incubation的整体流水线(K = 3)。

这里，我们以一个有12层的目标模型为例，

并设计一个每个模块只有一层的元模型。

元模型在训练数据集上进行端到端预训练。

当训练第i个目标模块(记为Mi)时，我们只需将元模型中的第i个元层替换为Mi, 在所有元层固定的情况下，以端到端方式训练生成的混合网络。

然后，我们组装训练过的模块共同得到目标模型。

Introduction

In this paper, we propose a divide-and-conquer strategy to improve the effectiveness (better generalization perfor- mance) and the efficiency (lower training cost) for training large models.

In specific, we divide a large model into smaller sub-modules, train these modules separately, and then assemble them to obtain the final model.

Compared with directly training the whole large network from scratch, starting the learning on top of smaller modules yields a faster and more stable converge process and higher robustness against overfitting.

The independent nature also allows the training of each module to be performed on different machines with no communication needed.

We refer to this paradigm as "modular training", and illustrate it in Fig. 1.

在本文中，我们提出了一种分而治之的策略来提高训练大型模型的有效性(更好的泛化性能)和效率(更低的训练成本)。

具体来说，我们将一个大的模型分成更小的子模块，分别对这些模块进行训练，然后将它们组合起来，得到最终的模型。

与直接从零开始训练整个大网络相比，在较小的模块上开始学习具有更快更稳定的收敛过程和更高的抗过拟合鲁棒性。

独立的特性也使得每个模块的训练可以在不同的机器上进行，而不需要通信。

我们将这种范式称为"模块化训练"，如图1所示。

Importantly, designing an effective modular training mechanism is non-trivial, as there exists a dilemma between independency and compatibility:

although training sub-modules independently enjoys advantages in terms of optimization efficiency and generalization performance, it is challenging to make these modules compatible with each other when assembling them together.

Some preliminary works alleviate this problem by leveraging approximated gradients [19, 11, 18] or local objectives [3, 4, 40], at the price of only achieving partial independency.

However, the modules are still highly entangled during forward propagation, and generally have not exhibited the ability to effectively address the optimization issues faced by training the recently proposed large models (e.g., ViTs, see Tab. 2).

重要的是，设计一个有效的模块化训练机制并非易事，因为存在独立性和兼容性之间的两难境地:

虽然独立的训练子模块在优化效率和泛化性能方面具有优势，但在将这些模块组装在一起时，如何使这些模块相互兼容是一个挑战。

一些初步工作通过利用近似梯度[19,11,18]或局部目标[3,4,40]来缓解这一问题，但代价是只能实现部分独立性。

然而，模块在前向传播过程中仍然高度纠缠，并且通常没有表现出有效解决训练最近提出的大型模型(例如vit，见表2)所面临的优化问题的能力。

Empirically, extensive experiments of image recognition, object detection and semantic/instance segmentation on competitive benchmarks (e.g., ImageNet-1K [34], ADE20K [47] and COCO [26]) demonstrate the effectiveness of Deep Incubation.

For example, with ViT-H, in terms of the generalization performance, Deep Incubation improves the accuracy by 2.7% on ImageNet and the mIoU by 3.4 on ADE20K compared to E2E baseline.

From the lens of training efficiency, Deep Incubation can achieve performance similar to E2E training with 4× less training cost.

从经验上看，在竞争性基准(例如ImageNet-1K[34]、ADE20K[47]和COCO[26])上进行的图像识别、对象检测和语义/实例分割的大量实验证明了深度孵化的有效性。

例如，对于vit，在泛化性能方面，与E2E基线相比，Deep Incubation在ImageNet上的精度提高了2.7%，在ADE20K上的mIoU提高了3.4。

从训练效率的角度来看，Deep Incubation可以达到与E2E训练相似的训练效果，而训练成本只有E2E训练的4倍。

. Related Work

Decoupled learning of neural networks is receiving more and more attention due to its biological plausibility and its potential in accelerating the model training process.

Auxiliary variable methods [37, 46, 1, 24] achieve a certain level of decoupling with strong convergence guarantees.

Another line of research [5, 25, 22, 29] uses biologically motivated methods to achieve decoupled learning. Using auxiliary networks [3, 4, 40] to achieve local supervision is also a way to achieve decoupling.

However, most above methods focus on decoupling modules during back-propagation, while the modules are still highly entangled during forward propagation.

In contrast, our modular training process com- pletely decouples the modules and optimizes each of them independently.

神经网络的解耦学习因其生物学上的合理性和加速模型训练过程的潜力而受到越来越多的关注。

辅助变量方法[37,46,1,24]实现了一定程度的解耦，具有很强的收敛性保证。

另一项研究[5,25,22,29]使用生物动机方法来实现解耦学习。利用辅助网络[3,4,40]实现局部监督也是实现解耦的一种方式。

然而，上述方法大多侧重于反向传播过程中模块的解耦，而模块在正向传播过程中仍然高度纠缠。

相比之下，我们的模块化训练过程完全解耦模块，并独立优化每个模块。

Model stitching [23, 2, 10] aims to build hybrid models by "stitching" model parts from different pre-trained model with stitch layers.

The aim is usually to investigate the internal representation similarity of different neural networks.

A recent work [42] also applies model stitching to transfer the knowledge of pre-trained models for downstream tasks.

However, the models obtained by stitching are limited by the architecture and training dataset of the pre-trained models, while our method is a general training paradigm that can be applied to any novel architectures and new datasets.

模型拼接[23,2,10]的目的是通过缝线层将不同预训练模型的模型部件"拼接"，从而构建混合模型。

其目的通常是研究不同神经网络的内部表示相似性。

最近的一项研究[42]也应用模型拼接将预训练模型的知识转移到下游任务。

然而，拼接得到的模型受到预训练模型的体系结构和训练数据集的限制，而我们的方法是一种通用的训练范式，可以应用于任何新的体系结构和新的数据集。

Knowledge distillation [17, 33, 35] trains a small student model to mimic the behavior of a larger model, thus transferring knowledge from the teacher model to the student model and achieves model compression.

This imitative fea- ture has some resemblance to a naïve variant of our method, which is called Module Imitation (see Fig. 2 (b)).

However, they are essentially different. Specifically, the meta models in our work are much smaller than the target models, while in knowledge distillation the teacher networks are typically larger and ore powerful than the student networks.

Moreover, our goal is not to compress a large model into a smaller one, but to effectively train a large model with the help of a small meta model.

知识蒸馏[17,33,35]训练一个小的学生模型来模仿一个大模型的行为，从而将知识从教师模型转移到学生模型，实现模型压缩。

这种模仿功能与我们方法的naïve变体有一些相似之处，称为模块模仿(见图2 (b))。

然而，它们本质上是不同的。具体来说，我们工作中的元模型比目标模型小得多，而在知识提炼中，教师网络通常比学生网络更大、更强大。

此外，我们的目标不是将一个大模型压缩成一个小模型，而是在一个小元模型的帮助下有效地训练一个大模型。

3. Deep Incubation
As aforementioned, training large models is typically challenging, e.g ., the learning process tends to be unstable, resource/data-hungry, and vulnerable to overfifitting.
To tackle these challenges, we propose Deep Incubation, a divide-and-conquer strategy that improves the effectivenessand effificiency of large model training.
In this section, we introduce the concept of modular training.
By discussing the diffificulties it faces, we present our Deep Incubation approach and summarize it in Alg. 1 and Fig. 3 .
Modular training fifirst divides a large model into smaller modules, and then optimizes each module independently.
As modern neural networks are generally constituted by a stack of layers, it is natural to divide the model along the depth dimension.

如前所述，训练大型模型通常具有挑战性，例如，学习过程往往不稳定，资源/数据匮乏，并且容易过度拟合。

为了应对这些挑战，我们提出了深度孵化，这是一种分而治之的策略，可提高大型模型训练的有效性和效率。

在本节中，我们将介绍模块化训练的概念。

通过讨论它所面临的困难，我们提出了我们的深度孵化ap方法，并在图1和图3中进行了总结。

模块化训练首先将一个大的模型分成更小的模块，然后对每个模块进行独立的优化。

由于现代神经网络通常是由一层一层的堆叠构成的，所以沿着深度维度划分模型是很自然的。

形式上，给定一个n层的大型目标模型M，我们可以将M划分为K个(K≤n)模块:

Formally, given a large target model M with nlayers, we can divide M into K ( K ≤ n ) modules:

where ◦ represents function composition.
Then, each module M i is trained independently in modular training.
In this way, the cumbersome task of directly training a large model is decomposed into easier sub-tasks of training small modules.
Moreover, these sub-tasks can be distributed to different machines and executed in full parallel, with no communication needed.
After this process, we can simply assemble the trained modules, thus avoiding training the large model directly from scratch.
Therefore, if implemented properly, modular training can be a highly effective and effificient way for large model training.
However, designing a proper modular training mechanism is a non-trivial task. In the following, we discuss in detail the challenges and present our solutions.
◦表示功能组合。

然后，在模块化训练中对每个mod规则Mi进行独立训练。

通过这种方式，将直接训练大型模型的繁琐任务分解为训练小模块的更简单的子任务。

此外，这些子任务可以分配到不同的机器上，完全并行地执行，不需要通信。

在这个过程之后，我们可以简单地组装训练好的模块，从而避免直接从头开始训练大型模型。

因此，如果实施得当，模块化训练可以成为大型模型训练的一种非常有效和高效的方式。

然而，设计一个合适的模块化训练机制是一项艰巨的任务。下面，我们将详细讨论所面临的挑战并提出我们的解决方案。
Dilemma I: independency vs . compatibility. At the core of modular training is the requirement of independency .
However, if the modules are trained completely unaware of other modules, they may have low compatibility between each other, hence negatively affecting the performance of the assembled model.
Solution: meta model. We argue the root of the above dilemma is that, the requirement of independency prevents the explicit information exchange between modules.
Consequently, the modules cannot adapt to each other during training, causing the incompatible issue. Driven by this analysis, we propose to address the dilemma by introducing
a global, shared meta-model ˆM∗ to enable implicit information exchange between the modules. Notably, the meta model ˆ M∗ is designed to have the same number of modules
as the target model M :
困境1:独立性vs.兼容性。模块化培训的核心是独立性要求。

但是，如果训练的模块完全不知道其他模块，则它们之间的兼容性可能很低，从而对组装模型的性能产生负面影响。

解决方案:元模型。我们认为，上述困境的根源在于，独立性的要求阻碍了模块之间显式的信息交换。

Con因此，模块在训练过程中无法相互适应，导致不兼容问题。在此分析的推动下，我们建议通过引入

一个全局的、共享的元模型{M *}使模块之间的隐式信息交换成为可能。值得注意的是，元模型M *被设计为具有相同数量的模块

为目标模型M: