Neural Architecture Transfer

Abstract ---Neural architecture search (NAS) has emerged as a promising avenue for automatically designing task-specific neural networks. Existing NAS approaches require one complete search for each deployment specification of hardware or objective. This is a computationally impractical endeavor given the potentially large number of application scenarios. In this paper, we propose Neural Architecture Transfer (NAT) to overcome this limitation. NAT is designed to efficiently generate task-specific custom models that are competitive under multiple conflicting objectives. To realize this goal we learn task-specific supernets from which specialized subnets can be sampled without any additional training. The key to our approach is an integrated online transfer learning and many-objective evolutionary search procedure. A pre-trained supernet is iteratively adapted while simultaneously searching for task-specific subnets.
We demonstrate the efficacy of NAT on 11 benchmark image classification tasks ranging from large-scale multi-class to small-scale fine-grained datasets. In all cases, including ImageNet, NATNets improve upon the state-of-the-art under mobile settings (≤ 600M Multiply-Adds). Surprisingly, small-scale fine-grained datasets benefit the most from NAT. At the same time, the architecture search and transfer is orders of magnitude more efficient than existing NAS methods. Overall, experimental evaluation indicates that, across diverse image classification tasks and computational objectives, NAT is an appreciably more effective alternative to conventional transfer learning of fine-tuning weights of an existing network architecture learned on standard datasets. Code is available at
Index Terms ---Convolutional Neural Networks, Neural Architecture Search, AutoML, Transfer Learning, Evolutionary Algorithms.
1 I NTRODUCTION
I
MAGE classification is a fundamental task in computer vision,
where given a dataset and, possibly, multiple objectives to
optimize, one seeks to learn a model to classify images. Solutions to
address this problem fall into two categories: (a) Sufficient Data: A
custom convolutional neural network architecture is designed and
its parameters are trained from scratch using variants of stochastic
gradient descent, and (b) Insufficient Data: An existing architec
ture designed on a large scale dataset, such as ImageNet [1], along
with its pre-trained weights (e.g., VGG [2], ResNet [3]), is fine
tuned for the task at hand. These two approaches have emerged as
the mainstays of present day computer vision.
Success of the aforementioned approaches is primarily at
tributed to architectural advances in convolutional neural net
works. Initial efforts at designing neural architectures relied on
human ingenuity. Steady advances by skilled practitioners has
resulted in designs, such as AlexNet [4], VGG [2], GoogLeNet

5\], ResNet \[3\], DenseNet \[6\] and many more, which have led to performance gains on the ImageNet Large Scale Visual Recogni tion Challenge \[1\]. In most other cases, a recent large scale study \[7\] has shown that, across many tasks, transfer learning by fine tuning ImageNet pre-trained networks outperforms networks that are trained from scratch on the same data. Moving beyond manually designed network architectures, Neural Architecture Search (NAS) \[8\] seeks to automate this process and find not only good architectures, but also their associated weights for a given image classification task. This goal has led to notable improvements in convolutional neural network architectures on standard image classification benchmarks, such as ImageNet, CIFAR-10 \[9\], CIFAR-100 \[9\] etc., in terms of predictive performance, computational complexity and mod el size. However, apart from transfer learning by fine-tuning the *weights* , current NAS approaches have failed to deliver new models for both *weights* and *topology* on custom non-standard datasets. The key barrier to realizing the full potential of NAS is the large data and computational requirements for employing existing NAS algorithms on new tasks. In this paper, we introduce *Neural Architecture Transfer* (NAT) to breach this barrier. Given an image classification task, NAT obtains custom neural networks (both *topology* and *weights* ), optimized for possibly many conflicting objectives, and does so without the steep computational burden of running NAS for each new task from scratch. A single run of NAT efficiently obtains multiple custom neural networks spanning the entire trade-off front of objectives. Our solution builds upon the concept of a supernet \[10\] which comprises of many subnets. All subnets are trained simultaneously through weight sharing, and can be sampled very efficiently. This procedure decouples the network training and the search phases of NAS. A many-objective 1 search can then be employed on top of the supernet to find all network architectures that provide the best trade-off among the objectives. However, training such supernets for each task from scratch is very computationally and data intensive. The key idea of NAT is to leverage an existing supernet and efficiently transfer it into a task-specific supernet, whilst simultaneously searching for architectures that offer the best trade off between the objectives of interest. Therefore, unlike standard supernet-based NAS, we combine supernet transfer learning with the search process. At the conclusion of this process, NAT returns ![](https://i-blog.csdnimg.cn/direct/2d5c6f75709e4ff29fcc1b81ec827fdf.png) Fig. 1: Overview: Given a dataset and objectives to optimize, NAT designs custom architectures spanning the objective trade-off front. NAT comprises of two main components, supernet adaptation and evolutionary search, that are iteratively executed. NAT also uses an online accuracy predictor model to improve its computational efficiency (i) subnets that span the entire objective trade-off front, and (ii) a task-specific supernet. The latter can now be utilized for all future deployment-specific NAS, i.e., new and different hardware or objectives, without any additional training. The core of NAT's efficiency lies in only adapting the subnets of the supernet that will lie on the efficient trade-off front of the new dataset, instead of all possible subnets. But, the structure of the corresponding subnets is unknown before adaptation. We resolve this "chicken-and-egg problem" by adopting an online procedure that alternates between the two primary stages of NAT: (a) *supernet adaptation* of subnets that are at the current trade-off front, and (b) *evolutionary search* for subnets that span the many objective trade-off front. A pictorial overview of the entire NAT method is shown in Fig.1. In the *adaptation* stage, we first construct a layer-wise em pirical distribution from the promising subnets returned by evo lutionary search. Then, subnets sampled from this distribution are fine-tuned. In the *search* stage, to improve the efficiency of the search, we adopt a surrogate model to quickly predict the objectives of any sampled subnet without a full-blown and costly evaluation. Furthermore, the predictor model itself is also learned online from previously evaluated subnets. We alternate between these two stages until our computational budget 2 is exhausted. The key contributions of this paper are: -- We introduce *Neural Architecture Transfer* as a NAS-powered alternative to fine-tuning based transfer learning. NAT is powered by a simple, yet highly effective online supernet fine-tuning and online accuracy predicting surrogate model. -- We demonstrate the scalability and practicality of NAT on multiple datasets corresponding to different scenarios; large-scale multi-class (ImageNet \[1\], CINIC-10 \[12\]), medium-scale multiclass (CIFAR-10, CIFAR-100 \[9\]), small-scale multi-class (STL- 10 \[13\]), large-scale fine-grained (Food-101 \[14\]), medium-scale fine-grained (Stanford Cars \[15\], FGVC Aircraft \[16\]) and smallscale fine-grained (DTD \[17\], Oxford-IIIT Pets \[18\], Oxford Flowers102 \[19\]) datasets. -- Under mobile settings (≤ 600M MAdds), NATNets lead to state-of-the-art performance across all these tasks. For instance, on ImageNet, NATNet achieves a Top-1 accuracy of 80.5% at 600M MAdds. **2 R** **ELATED** **W** **ORK** Recent years have witnessed growing interests in neural architec ture search. The promise of being able to automatically search for task-dependent network architectures is particularly appealing as deep neural networks are widely deployed in diverse applications and computational environments. Early methods \[33\], \[34\] made efforts to simultaneously evolve the topology of neural networks along with weights and hyperparameters. These methods per form competitively with hand-crafted networks on simple control tasks with shallow fully connected networks. Recent efforts \[35

primarily focus on designing deep convolutional neural network
architectures.
The development of NAS largely happened in two phases.
Starting from NASNet [8], the focus of the first wave of methods
was primarily on improving the predictive accuracy of CNNs in
cluding Block-QNN [36], Hierarchical NAS [37], and AmoebaNet

38\], etc. These methods relied on Reinforcement Learning (RL) or Evolutionary Algorithm (EA) to search for an optimal modular structure that is repeatedly stacked together to form a network architecture. The search was typically carried out on relatively small-scale datasets (e.g. CIFAR-10/100 \[9\]), following which the best architectures were transferred to ImageNet for validation. A steady stream of improvements over state-of-the-art on numerous datasets were reported. The focus of the second wave of NAS methods was on improving the search efficiency. A few methods have also been proposed to adapt NAS to other scenarios. These include meta-learning based approaches \[39\], \[40\] with application to few-shot learning tasks. XferNAS \[41

and EAT-NAS [42] illustrate how architectures can be transferred
between similar datasets or from smaller to larger datasets. Some
approaches [43], [44] proposed RL-based NAS methods that
TABLE 1: Comparison of NAT and existing NAS methods. † indicates methods that scalarize multiple objectives into one composite objective or as an additional constraint, see text for details.

search on multiple tasks during training and transfer the learned
search strategy, as opposed to searched networks, to new tasks at
inference. Next, we provide short overviews on methods that are
closely related to the technical approach in this paper. Table 1 pro
vides a comparative overview of NAT to existing NAS approaches.
Performance Prediction: Evaluating the performance of an archi
tecture requires a computationally intensive process of iteratively
optimizing model weights. To alleviate this computational burden,
regression models have been learned to predict an architecture's
performance without actually training it. Baker et al. [45] use a
radial basis function to estimate the final accuracy of architectures
from its accuracy in the first 25% of training iterations. PNAS