[读论文]深度孵化 Deep Incubation: Training Large Models by Divide-and-Conquering
计算机视觉-Archer2023-09-23 11:50
Recent years have witnessed a remarkable success of large deep learning models.
However, training these models is challenging due to high computational costs, painfully slow convergence, and overfitting issues.
In this paper, we present Deep Incubation, a novel approach that enables the efficient and effective training of large models by dividing them into smaller sub-modules which can be trained sepa- rately and assembled seamlessly.
A key challenge for im- plementing this idea is to ensure the compatibility of the independently trained sub-modules.
To address this issue, we first introduce a global, shared meta model, which is leveraged to implicitly link all the modules together, and can be designed as an extremely small network with neg- ligible computational overhead.
Then we propose a mod- ule incubation algorithm, which trains each sub-module to replace the corresponding component of the meta model and accomplish a given learning task. Despite the simplicity, our approach effectively encourages each sub-module to be aware of its role in the target large model, such that the finally-learned sub-modules can collaborate with each other smoothly after being assembled.
Empirically, our method outperforms end-to-end (E2E) training in terms of both final accuracy and training efficiency.
For example, on top of ViT-Huge, it improves the accuracy by 2.7% on ImageNet or achieves similar performance with 4× less training time. Notably, the gains are significant for downstream tasks as well (e.g., object detection and image segmentation on COCO and ADE20K).
An illustration of our idea. We first train the sub-modules of a large model fully independently, and then assemble the trained modules to obtain the target model.
我们首先对一个大模型的子模块进行完全独立的训练,然后将训练好的模块组装起来得到目标模型。
Figure 2: Comparison of 3 implementations of modular trainingwhen training Module II in the target model (K= 3).
In each implementation, the model above is the meta model ˆM**∗, and the model below is the target model M. Lis any measure of distance in feature space, i.e., L1 distance. LE2E is the original E2E training loss. Modules not involved in the training pipeline are greyed out.
In this paper, we propose a divide-and-conquer strategy to improve the effectiveness (better generalization perfor- mance) and the efficiency (lower training cost) for training large models.
In specific, we divide a large model into smaller sub-modules, train these modules separately, and then assemble them to obtain the final model.
Compared with directly training the whole large network from scratch, starting the learning on top of smaller modules yields a faster and more stable converge process and higher robustness against overfitting.
The independent nature also allows the training of each module to be performed on different machines with no communication needed.
We refer to this paradigm as "modular training", and illustrate it in Fig. 1.
Importantly, designing an effective modular training mechanism is non-trivial, as there exists a dilemma between independency and compatibility:
although training sub-modules independently enjoys advantages in terms of optimization efficiency and generalization performance, it is challenging to make these modules compatible with each other when assembling them together.
Some preliminary works alleviate this problem by leveraging approximated gradients [19, 11, 18] or local objectives [3, 4, 40], at the price of only achieving partial independency.
However, the modules are still highly entangled during forward propagation, and generally have not exhibited the ability to effectively address the optimization issues faced by training the recently proposed large models (e.g., ViTs, see Tab. 2).
Empirically, extensive experiments of image recognition, object detection and semantic/instance segmentation on competitive benchmarks (e.g., ImageNet-1K [34], ADE20K [47] and COCO [26]) demonstrate the effectiveness of Deep Incubation.
For example, with ViT-H, in terms of the generalization performance, Deep Incubation improves the accuracy by 2.7% on ImageNet and the mIoU by 3.4 on ADE20K compared to E2E baseline.
From the lens of training efficiency, Deep Incubation can achieve performance similar to E2E training with 4× less training cost.
Decoupled learning of neural networks is receiving more and more attention due to its biological plausibility and its potential in accelerating the model training process.
Auxiliary variable methods [37, 46, 1, 24] achieve a certain level of decoupling with strong convergence guarantees.
Another line of research [5, 25, 22, 29] uses biologically motivated methods to achieve decoupled learning. Using auxiliary networks [3, 4, 40] to achieve local supervision is also a way to achieve decoupling.
However, most above methods focus on decoupling modules during back-propagation, while the modules are still highly entangled during forward propagation.
In contrast, our modular training process com- pletely decouples the modules and optimizes each of them independently.
Model stitching [23, 2, 10] aims to build hybrid models by "stitching" model parts from different pre-trained model with stitch layers.
The aim is usually to investigate the internal representation similarity of different neural networks.
A recent work [42] also applies model stitching to transfer the knowledge of pre-trained models for downstream tasks.
However, the models obtained by stitching are limited by the architecture and training dataset of the pre-trained models, while our method is a general training paradigm that can be applied to any novel architectures and new datasets.
Knowledge distillation [17, 33, 35] trains a small student model to mimic the behavior of a larger model, thus transferring knowledge from the teacher model to the student model and achieves model compression.
This imitative fea- ture has some resemblance to a naïve variant of our method, which is called Module Imitation (see Fig. 2 (b)).
However, they are essentially different. Specifically, the meta models in our work are much smaller than the target models, while in knowledge distillation the teacher networks are typically larger and ore powerful than the student networks.
Moreover, our goal is not to compress a large model into a smaller one, but to effectively train a large model with the help of a small meta model.
3. Deep Incubation
As aforementioned, training large models is typically challenging, e.g ., the learning process tends to be unstable, resource/data-hungry, and vulnerable to overfifitting.
To tackle these challenges, we propose Deep Incubation, a divide-and-conquer strategy that improves the effectivenessand effificiency of large model training.
In this section, we introduce the concept of modular training.
By discussing the diffificulties it faces, we present our Deep Incubation approach and summarize it in Alg. 1 and Fig. 3 . Modular training fifirst divides a large model into smaller modules, and then optimizes each module independently.
As modern neural networks are generally constituted by a stack of layers, it is natural to divide the model along the depth dimension.
如前所述,训练大型模型通常具有挑战性,例如,学习过程往往不稳定,资源/数据匮乏,并且容易过度拟合。
为了应对这些挑战,我们提出了深度孵化,这是一种分而治之的策略,可提高大型模型训练的有效性和效率。
在本节中,我们将介绍模块化训练的概念。
通过讨论它所面临的困难,我们提出了我们的深度孵化ap方法,并在图1和图3中进行了总结。
模块化训练首先将一个大的模型分成更小的模块,然后对每个模块进行独立的优化。
由于现代神经网络通常是由一层一层的堆叠构成的,所以沿着深度维度划分模型是很自然的。
形式上,给定一个n层的大型目标模型M,我们可以将M划分为K个(K≤n)模块:
Formally, given a large target model M with nlayers, we can divide M into K ( K≤n ) modules:
where ◦ represents function composition.
Then, each module Mi is trained independently in modular training.
In this way, the cumbersome task of directly training a large model is decomposed into easier sub-tasks of training small modules.
Moreover, these sub-tasks can be distributed to different machines and executed in full parallel, with no communication needed.
After this process, we can simply assemble the trained modules, thus avoiding training the large model directly from scratch.
Therefore, if implemented properly, modular training can be a highly effective and effificient way for large model training.
However, designing a proper modular training mechanism is a non-trivial task. In the following, we discuss in detail the challenges and present our solutions.
◦表示功能组合。
然后,在模块化训练中对每个mod规则Mi进行独立训练。
通过这种方式,将直接训练大型模型的繁琐任务分解为训练小模块的更简单的子任务。
此外,这些子任务可以分配到不同的机器上,完全并行地执行,不需要通信。
在这个过程之后,我们可以简单地组装训练好的模块,从而避免直接从头开始训练大型模型。
因此,如果实施得当,模块化训练可以成为大型模型训练的一种非常有效和高效的方式。
然而,设计一个合适的模块化训练机制是一项艰巨的任务。下面,我们将详细讨论所面临的挑战并提出我们的解决方案。 Dilemma I: independencyvs. compatibility. At the core of modular training is the requirement of independency .
However, if the modules are trained completely unaware of other modules, they may have low compatibility between each other, hence negatively affecting the performance of the assembled model. Solution: meta model. We argue the root of the above dilemma is that, the requirement of independency prevents the explicit information exchange between modules.
Consequently, the modules cannot adapt to each other during training, causing the incompatible issue. Driven by this analysis, we propose to address the dilemma by introducing
a global, shared meta-model ˆM∗ to enable implicit information exchange between the modules. Notably, the meta model ˆ M∗ is designed to have the same number of modules
as the target model M :
困境1:独立性vs.兼容性。模块化培训的核心是独立性要求。