论文精读--Pay More Attention To Attention

ABSTRACT

Attention plays a critical role in human visual experience. Furthermore, it has recently been demonstrated that attention can also play an important role in the context of applying artificial neural networks to a variety of tasks from fields such as computer vision and NLP. In this work we show that, by properly defining attention for convolutional neural networks, we can actually use this type of information in order to significantly improve the performance of a student CNN network by forcing it to mimic the attention maps of a powerful teacher network.To that end, we propose several novel methods of transferring attention, showing consistent improvement across a variety of datasets and convolutional neural network architectures. Code and models for our experiments are available at https://github.com/szagoruyko/attention-transfer.

翻译：

注意力在人类视觉体验中发挥着关键作用。此外，最近已经证明，注意力在将人工神经网络应用于计算机视觉和自然语言处理等领域的任务中也起着重要作用。在这项工作中，我们展示了通过为卷积神经网络正确定义注意力，我们实际上可以利用这种信息来显着提高学生CNN网络的性能，使其模仿强大的教师网络的注意力地图。为此，我们提出了几种新颖的注意力转移方法，并展示了在各种数据集和卷积神经网络架构上的一致性改进。我们的实验代码和模型可在https://github.com/szagoruyko/attention-transfer 上获得。

总结：

这里的attention不是transformer那个，作者的attention是指attention map展示了哪些输入对输出的影响更大，理所当然feature map里值大的对输出影响大，所以这个就是模型对这块输入位置的attention

This brings us to the main topic of this paper: how attention differs within artificial vision systems, and can we use attention information in order to improve the performance of convolutional neural networks ? More specifically, can a teacher network improve the performance of another student network by providing to it information about where it looks, i.e., about where it concentrates its attention into ?

To study these questions, one first needs to properly specify how attention is defined w.r.t. a given convolutional neural network. To that end, here we consider attention as a set of spatial maps that essentially try to encode on which spatial areas of the input the network focuses most for taking its output decision (e.g., for classifying an image), where, furthermore, these maps can be defined w.r.t. various layers of the network so that they are able to capture both low-, mid-, and high-level representation information. More specifically, in this work we define two types of spatial attention maps: activation-based and gradient-based. We explore how both of these attention maps change over various datasets and architectures, and show that these actually contain valuable information that can be used for significantly improving the performance of convolutional neural network architectures (of various types and trained for various different tasks). To that end, we propose several novel ways of transferring attention from a powerful teacher network to a smaller student network with the goal of improving the performance of the latter (Fig. 1).

翻译：

这将我们带到本文的主题：注意力在人工视觉系统中的差异，以及我们是否可以利用注意力信息来提高卷积神经网络的性能？更具体地说，一个教师网络是否可以通过向另一个学生网络提供其注意力集中的位置信息来提高后者的性能？

要研究这些问题，首先需要适当地指定注意力如何与给定的卷积神经网络相关联。为此，在这里我们将注意力视为一组空间映射，它们基本上试图编码网络在输入的哪些空间区域中最关注，以便做出其输出决策（例如，对图像进行分类），此外，这些映射可以针对网络的各个层定义，以便捕捉低层、中层和高层表示信息。更具体地说，在这项工作中，我们定义了两种类型的空间注意力图：基于激活和基于梯度。我们探讨了这两种注意力图在各种数据集和架构上的变化，并且展示了这些图实际上包含了有价值的信息，可以用来显着提高各种类型的卷积神经网络架构（以及训练用于各种不同任务的网络）的性能。为此，我们提出了几种从强大的教师网络向较小的学生网络转移注意力的新颖方法，目的是提高后者的性能（见图1）。

To summarize, the contributions of this work are as follows:

• We propose attention as a mechanism of transferring knowledge from one network to another

• We propose the use of both activation-based and gradient-based spatial attention maps

• We show experimentally that our approach provides significant improvements across a variety of datasets and deep network architectures, including both residual and non-residual networks

• We show that activation-based attention transfer gives better improvements than fullactivation transfer, and can be combined with knowledge distillation

翻译：

总结一下，本文的贡献如下：

• 我们将注意力提出作为一种从一个网络向另一个网络转移知识的机制

• 我们提出了使用基于激活和基于梯度的空间注意力图

• 我们通过实验证明，我们的方法在各种数据集和深度网络架构上都提供了显著的改进，包括残差网络和非残差网络

• 我们展示了基于激活的注意力转移比完全激活转移提供了更好的改进，并且可以与知识蒸馏相结合。

Due to the above fact and due to that thin deep networks are less parallelizable than wider ones, we think that knowledge transfer needs to be revisited, and take an opposite to FitNets approach we try to learn less deep student networks. Our attention maps used for transfer are similar to both gradient-based and activation-based maps mentioned above, which play a role similar to "hints" in FitNets, although we don't introduce new weights.

翻译：

由于上述事实以及较窄的深度网络比较宽的网络更难以并行化，我们认为需要重新审视知识传递，并采取与FitNets相反的方法，尝试学习较浅的学生网络。我们用于转移的注意力图类似于上述提到的基于梯度和基于激活的地图，它们在转移中起到类似于FitNets中的"提示"的作用，尽管我们不引入新的权重。

ATTENTION TRANSFER

ACTIVATION-BASED ATTENTION TRANSFER

注意力图是对通道上的绝对值给作

普通求和或指数求和或最大指数

We also examined networks of the same architecture, width and depth, but trained with different frameworks with significant difference in performance. We found that the above statistics of hidden activations not only have spatial correlation with predicted objects on image level, but these correlations also tend to be higher in networks with higher accuracy, and stronger networks have peaks in attention where weak networks don't (e.g., see Fig. 4). Furthermore, attention maps focus on different parts for different layers in the network. In the first layers neurons activation level is high for low-level gradient points, in the middle it is higher for the most discriminative regions such as eyes or wheels, and in the top layers it reflects full objects. For example, mid-level attention maps of a network trained for face recognition Parkhi et al (2015) will have higher activations around eyes, nose and lips, and top level activation will correspond to full face (Fig. 2).

翻译：

我们还研究了具有相同架构、宽度和深度的网络，但在不同的框架下训练，性能差异显著。我们发现，隐藏激活的上述统计数据不仅与图像级别上预测的对象具有空间相关性，而且这些相关性在准确性更高的网络中也更高，而且更强大的网络在注意力的峰值处具有强度，而较弱的网络则没有（例如，请参见图4）。此外，注意力图在网络的不同层次上集中在不同的部分。在第一层，神经元的激活水平较高，适用于低级梯度点，在中间层，对于最具有区分性的区域，如眼睛或车轮，激活水平更高，在顶层则反映出完整的对象。例如，用于人脸识别的网络的中层注意力图（Parkhi等人，2015年）将在眼睛、鼻子和嘴周围具有较高的激活，并且顶层激活将对应于完整的人脸（图2）。

总结：

精度越高的网络对应的注意力图往往可以抓住图片分类的特征

更强大的网络在注意力的峰值处具有强度，而较弱的网络则没有

To further illustrate the differences of these functions we visualized attention maps of 3 networks with sufficient difference in classification performance: Network-In-Network (62% top-1 val accuracy), ResNet34 (73% top-1 val accuracy) and ResNet-101 (77.3% top-1 val accuracy).In each network we took last pre-downsampling activation maps, on the left for mid-level and on the right for top pre-average pooling activations in fig. 4. Top-level maps are blurry because their original spatial resolution is 7 × 7. It is clear that most discriminative regions have higher activation levels, e.g. face of the wolf, and that shape details disappear as the parameter p (used as exponent) increases.

翻译：

为了进一步说明这些函数的差异，我们可视化了三个网络的注意力图，这些网络在分类性能上有足够大的差异：Network-In-Network（62%的top-1验证准确率）、ResNet34（73%的top-1验证准确率）和ResNet-101（77.3%的top-1验证准确率）。在每个网络中，我们提取了最后一个预下采样激活图，在图4中左侧是中层激活图，右侧是顶层激活图。顶层的图像模糊，因为它们的原始空间分辨率为7×7。很明显，大多数具有区分性的区域具有更高的激活水平，例如狼的脸，随着参数p（用作指数）的增加，形状细节消失。

In attention transfer, given the spatial attention maps of a teacher network (computed using any of the above attention mapping functions), the goal is to train a student network that will not only make correct predictions but will also have attentions maps that are similar to those of the teacher.

In general, one can place transfer losses w.r.t. attention maps computed across several layers. For instance, in the case of ResNet architectures, one can consider the following two cases, depending on the depth of teacher and student:

• Same depth: possible to have attention transfer layer after every residual block

• Different depth: have attention transfer on output activations of each group of residual blocks

翻译：

在注意力传递中，考虑到老师网络的空间注意力图（使用上述任何一种注意力映射函数计算），目标是训练一个学生网络，不仅能够做出正确的预测，而且还具有与老师相似的注意力图。

一般来说，可以在多个层面上放置相对于注意力图的传递损失。例如，在ResNet架构中，可以考虑以下两种情况，取决于老师和学生的深度：

• 相同深度：可以在每个残差块后面放置注意力传递层

• 不同深度：在每组残差块的输出激活上进行注意力传递

loss分为两部分，第一部分是分类loss就是简单的交叉熵损失函数来实现分类，后一部分是衡量复杂模型于简单模型注意力图差异的距离损失函数。作者着重强调对注意力图进行归一化的重要性，在学生网络的训练中起很大作用

Attention transfer can also be combined with knowledge distillation Hinton et al (2015), in which case an additional term (corresponding to the cross entropy between softened distributions over labels of teacher and student) simply needs to be included to the above loss. When combined, attention transfer adds very little computational cost, as attention maps for teacher can be easily computed during forward propagation, needed for distillation.

翻译：

注意力转移还可以与知识蒸馏（Hinton等人，2015）相结合，此时只需要将一个额外项（对应于教师和学生标签之间的软化分布的交叉熵）添加到上述损失中即可。当结合使用时，注意力转移几乎不会增加计算成本，因为可以在前向传播期间轻松地计算教师的注意力图，这对蒸馏是必需的。

GRADIENT-BASED ATTENTION TRANSFER

让学生的gradient attention与教师的更接近，我们最小化二者的距离：

对学生参数求导

We also propose to enforce horizontal flip invariance on gradient attention maps. To do that we propagate horizontally flipped images as well as originals, backpropagate and flip gradient attention maps back. We then add l2 losses on the obtained attentions and outputs, and do second backpropagation:

where flip(x) denotes the flip operator. This is similar to Group Equivariant CNN approach by Cohen & Welling (2016), however it is not a hard constraint. We experimentally find that this has a regularization effect on training.

翻译：

我们还提出在梯度注意力图上强制执行水平翻转不变性。为了做到这一点，我们传播水平翻转的图像以及原始图像，进行反向传播，然后将梯度注意力图反转回来。然后，我们对获得的注意力和输出添加l2损失，并进行第二次反向传播。

其中flip(x)表示翻转运算符。这类似于Cohen＆Welling（2016）的群等变CNN方法，但它不是一个硬性约束。我们在实验中发现，这对训练有一种正则化效果。

CONCLUSIONS

We presented several ways of transferring attention from one network to another, with experimental results over several image recognition datasets. It would be interesting to see how attention transfer works in cases where spatial information is more important, e.g. object detection or weaklysupervised localization, which is something that we plan to explore in the future.

Overall, we think that our interesting findings will help further advance knowledge distillation, and understanding convolutional neural networks in general.

翻译：

我们提出了几种将注意力从一个网络转移到另一个网络的方法，并在多个图像识别数据集上进行了实验结果。看到注意力转移在空间信息更重要的情况下的工作方式，比如目标检测或弱监督定位，将是很有趣的，这是我们未来计划要探索的内容。

总的来说，我们认为我们的有趣发现将有助于进一步推动知识蒸馏，并且有助于理解卷积神经网络的工作原理。

论文精读--Pay More Attention To Attention

ABSTRACT

RELATED WORK

ATTENTION TRANSFER

ACTIVATION-BASED ATTENTION TRANSFER

GRADIENT-BASED ATTENTION TRANSFER

CONCLUSIONS