Attention plays a critical role in human visual experience. Furthermore, it has recently been demonstrated that attention can also play an important role in the context of applying artificial neural networks to a variety of tasks from fields such as computer vision and NLP. In this work we show that, by properly defining attention for convolutional neural networks, we can actually use this type of information in order to significantly improve the performance of a student CNN network by forcing it to mimic the attention maps of a powerful teacher network.To that end, we propose several novel methods of transferring attention, showing consistent improvement across a variety of datasets and convolutional neural network architectures. Code and models for our experiments are available at https://github.com/szagoruyko/attention-transfer.
This brings us to the main topic of this paper: how attention differs within artificial vision systems, and can we use attention information in order to improve the performance of convolutional neural networks ? More specifically, can a teacher network improve the performance of another student network by providing to it information about where it looks, i.e., about where it concentrates its attention into ?
To study these questions, one first needs to properly specify how attention is defined w.r.t. a given convolutional neural network. To that end, here we consider attention as a set of spatial maps that essentially try to encode on which spatial areas of the input the network focuses most for taking its output decision (e.g., for classifying an image), where, furthermore, these maps can be defined w.r.t. various layers of the network so that they are able to capture both low-, mid-, and high-level representation information. More specifically, in this work we define two types of spatial attention maps: activation-based and gradient-based. We explore how both of these attention maps change over various datasets and architectures, and show that these actually contain valuable information that can be used for significantly improving the performance of convolutional neural network architectures (of various types and trained for various different tasks). To that end, we propose several novel ways of transferring attention from a powerful teacher network to a smaller student network with the goal of improving the performance of the latter (Fig. 1).
To summarize, the contributions of this work are as follows:
• We propose attention as a mechanism of transferring knowledge from one network to another
• We propose the use of both activation-based and gradient-based spatial attention maps
• We show experimentally that our approach provides significant improvements across a variety of datasets and deep network architectures, including both residual and non-residual networks
• We show that activation-based attention transfer gives better improvements than fullactivation transfer, and can be combined with knowledge distillation
Due to the above fact and due to that thin deep networks are less parallelizable than wider ones, we think that knowledge transfer needs to be revisited, and take an opposite to FitNets approach we try to learn less deep student networks. Our attention maps used for transfer are similar to both gradient-based and activation-based maps mentioned above, which play a role similar to "hints" in FitNets, although we don't introduce new weights.
We also examined networks of the same architecture, width and depth, but trained with different frameworks with significant difference in performance. We found that the above statistics of hidden activations not only have spatial correlation with predicted objects on image level, but these correlations also tend to be higher in networks with higher accuracy, and stronger networks have peaks in attention where weak networks don't (e.g., see Fig. 4). Furthermore, attention maps focus on different parts for different layers in the network. In the first layers neurons activation level is high for low-level gradient points, in the middle it is higher for the most discriminative regions such as eyes or wheels, and in the top layers it reflects full objects. For example, mid-level attention maps of a network trained for face recognition Parkhi et al (2015) will have higher activations around eyes, nose and lips, and top level activation will correspond to full face (Fig. 2).
To further illustrate the differences of these functions we visualized attention maps of 3 networks with sufficient difference in classification performance: Network-In-Network (62% top-1 val accuracy), ResNet34 (73% top-1 val accuracy) and ResNet-101 (77.3% top-1 val accuracy).In each network we took last pre-downsampling activation maps, on the left for mid-level and on the right for top pre-average pooling activations in fig. 4. Top-level maps are blurry because their original spatial resolution is 7 × 7. It is clear that most discriminative regions have higher activation levels, e.g. face of the wolf, and that shape details disappear as the parameter p (used as exponent) increases.
In attention transfer, given the spatial attention maps of a teacher network (computed using any of the above attention mapping functions), the goal is to train a student network that will not only make correct predictions but will also have attentions maps that are similar to those of the teacher.
In general, one can place transfer losses w.r.t. attention maps computed across several layers. For instance, in the case of ResNet architectures, one can consider the following two cases, depending on the depth of teacher and student:
• Same depth: possible to have attention transfer layer after every residual block
• Different depth: have attention transfer on output activations of each group of residual blocks
Attention transfer can also be combined with knowledge distillation Hinton et al (2015), in which case an additional term (corresponding to the cross entropy between softened distributions over labels of teacher and student) simply needs to be included to the above loss. When combined, attention transfer adds very little computational cost, as attention maps for teacher can be easily computed during forward propagation, needed for distillation.
We also propose to enforce horizontal flip invariance on gradient attention maps. To do that we propagate horizontally flipped images as well as originals, backpropagate and flip gradient attention maps back. We then add l2 losses on the obtained attentions and outputs, and do second backpropagation:
where flip(x) denotes the flip operator. This is similar to Group Equivariant CNN approach by Cohen & Welling (2016), however it is not a hard constraint. We experimentally find that this has a regularization effect on training.
We presented several ways of transferring attention from one network to another, with experimental results over several image recognition datasets. It would be interesting to see how attention transfer works in cases where spatial information is more important, e.g. object detection or weaklysupervised localization, which is something that we plan to explore in the future.
Overall, we think that our interesting findings will help further advance knowledge distillation, and understanding convolutional neural networks in general.