Abstract ---Neural architecture search (NAS) has emerged as a promising avenue for automatically designing task-specific neural networks. Existing NAS approaches require one complete search for each deployment specification of hardware or objective. This is a computationally impractical endeavor given the potentially large number of application scenarios. In this paper, we propose Neural Architecture Transfer (NAT) to overcome this limitation. NAT is designed to efficiently generate task-specific custom models that are competitive under multiple conflicting objectives. To realize this goal we learn task-specific supernets from which specialized subnets can be sampled without any additional training. The key to our approach is an integrated online transfer learning and many-objective evolutionary search procedure. A pre-trained supernet is iteratively adapted while simultaneously searching for task-specific subnets.
We demonstrate the efficacy of NAT on 11 benchmark image classification tasks ranging from large-scale multi-class to small-scale fine-grained datasets. In all cases, including ImageNet, NATNets improve upon the state-of-the-art under mobile settings (≤ 600M Multiply-Adds). Surprisingly, small-scale fine-grained datasets benefit the most from NAT. At the same time, the architecture search and transfer is orders of magnitude more efficient than existing NAS methods. Overall, experimental evaluation indicates that, across diverse image classification tasks and computational objectives, NAT is an appreciably more effective alternative to conventional transfer learning of fine-tuning weights of an existing network architecture learned on standard datasets. Code is available at
Index Terms ---Convolutional Neural Networks, Neural Architecture Search, AutoML, Transfer Learning, Evolutionary Algorithms.
1 I NTRODUCTION
I
MAGE classification is a fundamental task in computer vision,
where given a dataset and, possibly, multiple objectives to
optimize, one seeks to learn a model to classify images. Solutions to
address this problem fall into two categories: (a) Sufficient Data: A
custom convolutional neural network architecture is designed and
its parameters are trained from scratch using variants of stochastic
gradient descent, and (b) Insufficient Data: An existing architec
ture designed on a large scale dataset, such as ImageNet [1], along
with its pre-trained weights (e.g., VGG [2], ResNet [3]), is fine
tuned for the task at hand. These two approaches have emerged as
the mainstays of present day computer vision.
Success of the aforementioned approaches is primarily at
tributed to architectural advances in convolutional neural net
works. Initial efforts at designing neural architectures relied on
human ingenuity. Steady advances by skilled practitioners has
resulted in designs, such as AlexNet [4], VGG [2], GoogLeNet
[5], ResNet [3], DenseNet [6] and many more, which have led to
performance gains on the ImageNet Large Scale Visual Recogni
tion Challenge [1]. In most other cases, a recent large scale study
[7] has shown that, across many tasks, transfer learning by fine
tuning ImageNet pre-trained networks outperforms networks that
are trained from scratch on the same data.
Moving beyond manually designed network architectures,
Neural Architecture Search (NAS) [8] seeks to automate this
process and find not only good architectures, but also their
associated weights for a given image classification task. This goal
has led to notable improvements in convolutional neural network
architectures on standard image classification benchmarks, such
as ImageNet, CIFAR-10 [9], CIFAR-100 [9] etc., in terms of
predictive performance, computational complexity and mod
el size.
However, apart from transfer learning by fine-tuning the weights
,
current NAS approaches have failed to deliver new models for
both weights and topology on custom non-standard datasets. The
key barrier to realizing the full potential of NAS is the large
data and computational requirements for employing existing NAS
algorithms on new tasks.
In this paper, we introduce Neural Architecture Transfer (NAT)
to breach this barrier. Given an image classification task, NAT
obtains custom neural networks (both topology and weights ),
optimized for possibly many conflicting objectives, and does so
without the steep computational burden of running NAS for each
new task from scratch. A single run of NAT efficiently obtains
multiple custom neural networks spanning the entire trade-off
front of objectives.
Our solution builds upon the concept of a supernet [10] which
comprises of many subnets. All subnets are trained simultaneously
through weight sharing, and can be sampled very efficiently. This
procedure decouples the network training and the search phases of
NAS. A many-objective 1 search can then be employed on top of
the supernet to find all network architectures that provide the best
trade-off among the objectives. However, training such supernets
for each task from scratch is very computationally and data
intensive. The key idea of NAT is to leverage an existing supernet
and efficiently transfer it into a task-specific supernet, whilst
simultaneously searching for architectures that offer the best trade
off between the objectives of interest. Therefore, unlike standard
supernet-based NAS, we combine supernet transfer learning with
the search process. At the conclusion of this process, NAT returns
Fig. 1: Overview: Given a dataset and objectives to optimize, NAT designs custom architectures spanning the objective trade-off front. NAT comprises of two main components, supernet adaptation and evolutionary search, that are iteratively executed. NAT also uses an online accuracy predictor model to improve its computational efficiency
(i) subnets that span the entire objective trade-off front, and (ii)
a task-specific supernet. The latter can now be utilized for all
future deployment-specific NAS, i.e., new and different hardware
or objectives, without any additional training.
The core of NAT's efficiency lies in only adapting the subnets
of the supernet that will lie on the efficient trade-off front of the
new dataset, instead of all possible subnets. But, the structure
of the corresponding subnets is unknown before adaptation. We
resolve this "chicken-and-egg problem" by adopting an online
procedure that alternates between the two primary stages of NAT:
(a) supernet adaptation of subnets that are at the current trade-off
front, and (b) evolutionary search for subnets that span the many
objective trade-off front. A pictorial overview of the entire NAT
method is shown in Fig.1. In the adaptation stage, we first construct a layer-wise em
pirical distribution from the promising subnets returned by evo
lutionary search. Then, subnets sampled from this distribution
are fine-tuned. In the search stage, to improve the efficiency of
the search, we adopt a surrogate model to quickly predict the
objectives of any sampled subnet without a full-blown and costly
evaluation. Furthermore, the predictor model itself is also learned
online from previously evaluated subnets. We alternate between
these two stages until our computational budget 2 is exhausted.
The key contributions of this paper are:
-- We introduce Neural Architecture Transfer as a NAS-powered
alternative to fine-tuning based transfer learning. NAT is powered by a simple, yet highly effective online supernet fine-tuning and online accuracy predicting surrogate model. -- We demonstrate the scalability and practicality of NAT on multiple datasets corresponding to different scenarios; large-scale multi-class (ImageNet [1], CINIC-10 [12]), medium-scale multiclass (CIFAR-10, CIFAR-100 [9]), small-scale multi-class (STL- 10 [13]), large-scale fine-grained (Food-101 [14]), medium-scale fine-grained (Stanford Cars [15], FGVC Aircraft [16]) and smallscale fine-grained (DTD [17], Oxford-IIIT Pets [18], Oxford Flowers102 [19]) datasets. -- Under mobile settings (≤ 600M MAdds), NATNets lead to state-of-the-art performance across all these tasks. For instance,
on ImageNet, NATNet achieves a Top-1 accuracy of 80.5% at 600M MAdds.
2 R ELATED W ORK
Recent years have witnessed growing interests in neural architec
ture search. The promise of being able to automatically search for
task-dependent network architectures is particularly appealing as
deep neural networks are widely deployed in diverse applications
and computational environments. Early methods [33], [34] made
efforts to simultaneously evolve the topology of neural networks
along with weights and hyperparameters. These methods per
form competitively with hand-crafted networks on simple control
tasks with shallow fully connected networks. Recent efforts [35]
primarily focus on designing deep convolutional neural network
architectures.
The development of NAS largely happened in two phases.
Starting from NASNet [8], the focus of the first wave of methods
was primarily on improving the predictive accuracy of CNNs in
cluding Block-QNN [36], Hierarchical NAS [37], and AmoebaNet
[38], etc. These methods relied on Reinforcement Learning (RL)
or Evolutionary Algorithm (EA) to search for an optimal modular
structure that is repeatedly stacked together to form a network
architecture. The search was typically carried out on relatively
small-scale datasets (e.g. CIFAR-10/100 [9]), following which the
best architectures were transferred to ImageNet for validation. A
steady stream of improvements over state-of-the-art on numerous
datasets were reported. The focus of the second wave of NAS
methods was on improving the search efficiency.
A few methods have also been proposed to adapt NAS to other
scenarios. These include meta-learning based approaches [39],
[40] with application to few-shot learning tasks. XferNAS [41]
and EAT-NAS [42] illustrate how architectures can be transferred
between similar datasets or from smaller to larger datasets. Some
approaches [43], [44] proposed RL-based NAS methods that
TABLE 1: Comparison of NAT and existing NAS methods. † indicates methods that scalarize multiple objectives into one composite objective or as an additional constraint, see text for details.
search on multiple tasks during training and transfer the learned
search strategy, as opposed to searched networks, to new tasks at
inference. Next, we provide short overviews on methods that are
closely related to the technical approach in this paper. Table 1 pro
vides a comparative overview of NAT to existing NAS approaches.
Performance Prediction: Evaluating the performance of an archi
tecture requires a computationally intensive process of iteratively
optimizing model weights. To alleviate this computational burden,
regression models have been learned to predict an architecture's
performance without actually training it. Baker et al. [45] use a
radial basis function to estimate the final accuracy of architectures
from its accuracy in the first 25% of training iterations. PNAS
[23] uses a multilayer perceptron (MLP) and a recurrent neural
network to estimate the expected improvement in accuracy if the
current modular structure (which is later stacked together to form
a network) is expanded with a new branch. Conceptually, both
of these methods seek to learn a prediction model that extrapolate
(rather than interpolate), resulting in poor correlation in prediction.
OnceForAll [31] also uses a MLP to predict accuracy from
architecture encoding. However, the model is trained offline for the
entire search space, thereby requiring a large number of samples
for learning (16K samples - > 2 GPU-days for just constructing
the surrogate model). Instead of using uniformly sampled archi
tectures to train the prediction model to approximate the entire
landscape, ChamNet [29] trains many architectures through full
SGD and selects only 300 samples of high accuracy with diverse
efficiency (Multiply-adds, Latency, Energy) to train a prediction
model offline. In contrast, NAT learns a prediction model in an
online fashion only on the samples at the current trade-off front
as we explore the search space. Such an approach only needs to
interpolate over a much smaller space of architectures constituting
the current trade-off front. Consequently, this procedure signifi-
cantly improves both the accuracy and the sample complexity of
constructing the prediction model.
Weight Sharing: Approaches in this category involve training a
supernet that contains all searchable architectures as its subnets.
They can be broadly classified into two categories depending on
whether the supernet training is coupled with architecture search
or decoupled into a two-stage process. Approaches of the former
kind [24], [26], [46] are computationally efficient but return sub
optimal models. Numerous studies [47], [48], [49] allude to weak
correlation between performance at the search and final evaluation
stages. Methods of the latter kind [10], [31], [50] use performance
of subnets (obtained by sampling the trained supernet) as a metric
to select architectures during search. However, training a supernet
beforehand for each new task is computationally prohibitive. In
this work, we take an integrated approach where we train a
supernet on large-scale datasets (e.g. ImageNet) once and couple
it with our architecture search to quickly adapt it to a new
task. An elaborated discussion connecting our method to existing
approaches is provided in Section A.
Multi-Objective NAS: Methods that consider multiple objectives
for designing hardware specific models have also been developed.
The objectives are optimized either through (i) scalarization, or (ii)
Pareto-based solutions. The former include, ProxylessNAS [26],
MnasNet [27], ChamNet [29], MobileNetV3 [22], and FBNetV2
[32] which use a scalarized objective or an additional constraint
to encourage high accuracy and penalize compute inefficiency at
the same time, e.g., maximize Acc ∗ ( Latency/T arget ) − 0 . 07 .
Conceptually, the search of architectures is still guided by a single
objective and only one architecture is obtained per search. Em
pirically, multiple runs with different weighting of the objectives
are needed to find an architecture with the desired trade-off, or
multiple architectures with different complexities. Methods in the
latter category include [25], [51], [52], [53], [54] and aim to
approximate the entire Pareto-efficient frontier simultaneously---
i.e. multiple architectures with different complexities are obtained
in a single run. These approaches rely on heuristics (e.g., EA)
to efficiently navigate the search space allowing practitioners to
visualize the trade-off between the objectives and to choose a
suitable network a posteriori to the search. NAT falls into the
latter category and uses an accuracy prediction model and weight
sharing for efficient architecture transfer to new tasks.
3 P ROPOSED A PPROACH
Neural Architecture Transfer consists of three main components:
an accuracy predictor, an evolutionary search routine, and a
supernet. NAT starts with an archive A of architectures (subnets)
created by uniform sampling from our search space. We evaluate
the performance f i of each subnet ( a i ) using weights inherited
from the supernet. The accuracy predictor is then constructed
from ( a i , f i ) pairs which (jointly with any additional objectives
provided by the user) drives the subsequent many-objective evolu
tionary search towards optimal architectures. Promising architec- 4
tures at the conclusion of the evolutionary process are added to the
archive A . The (partial) weights of the supernet corresponding to
the top-ranked subnets in the archive are fine-tuned. NAT repeats
this process for a pre-specified number of iterations. At the con
clusion, we output both the archive and the task-specific supernet.
Networks that offer the best trade-off among the objectives can
be post-selected from the archive. Detailed descriptions of each
component of NAT are provided in the following subsections.
Figure 1 and Algorithm 1 provide an overview of our entire approach.
Fig. 2: The architectures in our search space are variants of MobileNetV2
family of models [22], [27], [28], [56]. (a) Each networks consists of five
stages. Each stage has two to four layers. Each layer is an inverted residual
bottleneck block. The search space includes, input image resolution (R), width
multiplier (W), the number of layers in each stage, the # of output channels
(expansion ratio E) of the first 1 × 1 convolution and the kernel size (K) of the
depth-wise separable convolution in each layer. (b) Networks are represented
as 22-integer strings, where the first two correspond to resolution and width
multiplier, and the rest correspond to the layers. Each value indicates a choice,
e.g. the third integer ( L 1 ) takes a value of "1" corresponds to using expansion
ratio of 3 and kernel size of 3 in layer 1 of stage 1.
3.2 Search Space and Encoding
The search for optimal network architectures can be performed
over many different search spaces. The generality of the chosen
search space has a major influence on the quality of results that
are feasible. We adopt a modular design for overall structure of
the network , consisting of a stem, multiple stages and a tail (see
Fig. 2a). The stem and tail are common to all networks and not
searched. Each stage in turn comprises of multiple layers, and
each layer itself is an inverted residual bottleneck structure [56].
-Network: We search for the input image resolution and the width
multiplier (a factor that scales the # of output channels of each
layer uniformly [57]). Following previous work [27], [28], [31],
we segment the CNN architecture into five sequentially connected
stages. The stages gradually reduce the feature map size and increase the number of channels (Fig. 2a Left ). -Stage: We search over the number of layers, where only the first layer uses stride 2 if the feature map size decreases, and we allow
each block to have minimum of two and maximum of four layers
(Fig. 2a Middle ). -Layer: We search over the expansion ratio (between the # of
output and input channels) of the first 1 × 1 convolution and the
kernel size of the depth-wise separable convolution (Fig. 2a Right ).
Fig. 3: Top Path: A typical process of evaluating an architecture in NAS
algorithms. Bottom Path: Accuracy predictor aims to bypass the time
consuming components for evaluating a network's performance by directly
regressing its accuracy f from a (architecture in the encoded space).
Overall, we search over four primary hyperparameters of CNNs i.e., the depth (# of layers), the width (# of channels), the kernel size, and the input resolution. The resulting volume of our
search space is approximately 3 . 5 × 10 19 for each combination of image resolution and width multiplier. To encode these architectural choices, we use an integer string of length 22, as shown in Fig. 2b. The first two values represent the input image resolution and width multiplier, respectively. The remaining 20 values denote the expansion ratio and kernel size settings for each of the 20 layers. The available options for expansion ratio and kernel size are [3, 4, 6] and [3, 5, 7], respectively. It is worth noting that we sort the layer settings in ascending #MAdds order, which is beneficial to the mutation operator used in our evolutionary search algorithm.
3.3 Accuracy Predictor
The main computational bottleneck of NAS arises from the nested
nature of the bi-level optimization problem. The inner optimiza
tion requires the weights of the subnets to be thoroughly learned
prior to evaluating its performance. Methods like weight-sharing
[31], [46], [50] allow sampled subnets to inherit weights among
themselves or from a supernet, avoiding the time-consuming
process (typically requiring hours) of learning weights through
SGD. However, standalone weight-sharing still requires inference
on validation data (typically requiring minutes) to assess per
formance. Therefore, simply having to evaluate the subnets can
still render the overall process computationally prohibitive for
methods [8], [27], [38] that sample thousands of architectures
during search. To mitigate the computational burden of fully evaluating the
subnets, we adopt a surrogate accuracy predictor that regresses the
performance of a sampled subnet without performing training or
inference. By learning a functional relation between the integer
strings (subnets in the encoded space) and the corresponding
performance, this approach decouples the evaluation of an archi
tecture from data-processing (including both SGD and inference).
Consequently, the evaluation time reduces from hours/minutes to
seconds. We illustrate this concept in Fig. 3. The effectiveness of
this idea, however, is critically dependent on the quality of the
surrogate model. Below we identify three desired properties of
such a model: 1) Reliable prediction: high rank-order correlation3 between
predicted and true performance.
Fig. 4: Accuracy predictor performance as a function of training samples. For
each model, we show the mean and standard deviation of the Spearman rank
correlation on 11 datasets (Table 3). The size of RBF ensemble is 500.
2) Consistent prediction: the quality of the prediction should
be consistent across different datasets.
3) Sample efficiency: minimizing the number of training
examples necessary to construct an accurate predictor
model, since each training sample requires costly training
and evaluation of a subnet.
Current approaches [23], [29], [31] that use surrogate based
accuracy predictors, however, do not satisfy property (1) and (3)
simultaneously. For instance, PNAS [23] uses 1,160 subnets to
build the surrogate but only achieves a rank-order correlation of
0.476. Similarly, OnceForAll [31] uses 16,000 subnets to build the
surrogate. The poor sample complexity and rank-order correlation
of these approaches, is due to the offline learning of the surrogate
model. Instead of focusing on models that are at the trade-off front
of the objectives, these surrogate models are built for the entire
search space. Consequently, these methods require a significantly
larger and more complex surrogate model.
We overcome the aforementioned limitation by restricting the
surrogate model to the search space that constitutes the current
objective trade-off. Such a solution significantly reduces the sam
ple complexity of the surrogate and increases the reliability of
its predictions. We adopt four low-complexity predictors, namely,
Gaussian Process (GP) [29], Radial Basis Function (RBF) [45],
Multilayer Perceptron (MLP) [23], and Decision Tree (DT) [58].
Empirically, we observe that RBFs are consistently better than the
other three models if the # of training samples is more than 100. To
further improve RBF's performance, especially under a high sam
ple efficiency regime, we construct an ensemble of RBF models.
As outlined in Algorithm 2, each RBF model is constructed with a
subset of samples and features randomly selected from the training
instances. The correlation between predicted accuracy and true
accuracy from an ensemble of 500 RBF models outperforms all
Fig. 5: (a) Crossover Operator : new offspring architectures are created
by recombining integers from two parent architectures. The probability of
choosing from either one of the parents is equal. (b) Mutation Operator :
histograms showing the probabilities of mutated values with current value at 5
under different hyperparameter η m settings.
integer. The PM operator inherits the parent-centric convention, in
which the offspring are intentionally created around the parents.
The centricity is controlled via an index hyperparameter η m . In
particular, high-values of η m tend to create mutated offspring
around the parent, and low-values encourage mutated offspring
to be further away from the parent architecture. See Fig. 5b for
a visualization of the effect of η m . It is the worth noting that the
PM operator was originally proposed for continuous optimization
where distances between variable values are naturally defined. In
contrast, in context of our encoding, our variables are categorical
in nature, indicating a particular layer hyperparameter. So we sort
the searched subnets in ascending order of #MAdds, such that η m
now controls the difference in #MAdds between the parent and
the mutated offspring.
We apply PM to every member in the offspring population
(created from crossover). We then merge the mutated offspring
population with the parent population and select the top half using
many-objective selection operator described in Algorithm 4. This
procedure creates the parent population for the next generation.
We repeat this overall process for a pre-specified number of
generations and output the parent population at the conclusion
of the evolution.
3.5 Many-Objective Selection
In addition to high predictive accuracy, real-world applications
demand NAS algorithms to simultaneously balance a few other
conflicting objectives that are specific to the deployment scenarios.
For instance, mobile or embedded devices often have restrictions
in terms of model size, multiply-adds, latency, power consump
tion, and memory footprint. With no prior assumption on the
correlation among these objectives, a scalable (to the number
of objectives) selection is required to drive the search towards
the high dimensional Pareto front. In this work, we adopt the
reference point guided selection originally proposed in NSGA-III
[11], which has been shown to be effective in handling problems
A solution a i is said to be non-dominated if these conditions hold
against all the other solutions a j (with j = i ) in the entire search
space of a . With the above definition, we can sort solutions to different
ranks of domination, where solutions in the same rank are non
dominated to each other, and there exists at least one solution in
lower rank that dominates any solution in the higher rank. Thus, a
lower non-dominated ranked set is lexicographically better than a
higher ranked set. This process is referred as non dominated sort ,
and it is the first step in the selection process. During the many
objective selection process, the lower ranked sets are chosen one
at a time until no more sets can be included to maintain the popu
lation size. The final accepted set may have to be split to choose
only a part. For this purpose, we choose the most diverse subset
based on a diversity-maintaining mechanism. We first create a
set of reference directions from a set of uniformly distributed (in
( m − 1 )-dimensional space) reference points in the unit simplex
by using Das-and-Dennis method [61]. Then we associate each
solution to a reference direction based on orthogonal distance of
the solution from the direction. Then, for every reference direction,
we choose the closest associated solution in a systematic manner
by adaptively computing a niche count ρ so that every reference
direction gets an equal opportunity to choose a representative
closest solution in the selected population. The domination and
diversity-preserving procedures are easily scalable to any number
of objectives and importantly are free from any user-defined
hyperparameter. See Algorithm 4 for the pseudocode and Fig. 6
for a graphical illustration. A more elaborated discussion on the
necessity of the reference point based selection is provided in
Section B.
3.6 Supernet Adaptation
Instead of training every architectures sampled during search from
scratch, NAS with weight sharing [24], [46] inherits weights from
previously-trained networks or from a supernet. Directly inheriting
the weights obviates the need to optimize the weights from scratch
and speeds up the search from thousands of GPU days to only a
few. In this work, we focus on the supernet approach [10], [31].
It involves first training a large network model (in which search
able architectures become subnets) prior to the search. Then the
performance of the subnets, evaluated with the inherited weights,
is used to guide the selection of architectures during search. The
key to the success of this approach is that the performance of the
subnets with the inherited weights be highly correlated with the
performance of the same subnet when thoroughly trained from
scratch. Satisfying this desideratum necessitates that the supernet
weights be learned in such a way that all subnets are optimized
simultaneously .
Existing methods [30], [53] attempt to achieve the above goal
by imposing fairness in training the supernet, where the proba
bilities of training any particular subnet for each batch of data is
uniform in expectation. However, we argue that simultaneously
training all the subnets in the search space is practically not
feasible and, more importantly, not necessary. Firstly, it is evident
from existing NAS approaches [26], [62] that different objectives
(#Params, #MAdds, latency on different hardware, etc.) require
1.联系看这篇文章的末尾:
基于深度学习的人脸情绪识别检测系统(VGG、CNN、ResNet)-CSDN博客
Neural Architecture Transfer
扫地僧9852025-01-01 17:32
相关推荐
听吉米讲故事6 分钟前
AI新闻自动化:使用Tavily Search API构建AI新闻总结助手程序员陆通16 分钟前
2024年大型语言模型(LLMs)的发展回顾Shiyuan71 小时前
【IEEE冠名会议】2025年IEEE第二届深度学习与计算机视觉国际会议(DLCV 2025)是十一月末1 小时前
Opencv实现Sobel算子、Scharr算子、Laplacian算子、Canny检测图像边缘MYT_flyflyfly1 小时前
计算机视觉之三维重建-摄像机标定云卓SKYDROID1 小时前
无人机信道分配与频谱效率定义!paixiaoxin1 小时前
CV-MLLM经典论文解读|OneLLM: One Framework to Align All Modalities with LanguageAI34561 小时前
壁纸样机神器,这个工具适合专业设计师用吗?love530love1 小时前
利用 AI 高效生成思维导图的简单实用方法背太阳的牧羊人1 小时前
df.groupby()方法使用在查询中用到的筛选条件函数对 数据进行分组