Self-Attention Mechanism & Multi-Head Attention Mechanism详解

研究背景以及研究意义

在 Transformer 模型发表之前，序列转录模型（Sequence Transduction Models） 大多都是基于复杂的 循环神经网络（Recurrent Neural Networks） 或者 卷积神经网络（Convolutional Neural Network） ，这些网络其中都包含了一个 编码器（Encoder） 和一个 解码器（Decoder） 。序列转录模型指输入序列并输出转换后的序列的模型，例如中译英就是一个典型的序列转录模型，中文句子输入，对应的英文句子输出。最佳的序列转录模型也包含了一个编码器、一个解码器以及一个注意力机制。Transformer 的贡献在于完全没有依靠 循环（Recurrence） 或者 卷积（Convolutions） ，利用注意力机制构建了一个简单的网络架构。它的优势在于提高了序列转录模型的并行度，大大降低了训练时间，同时训练质量还要更好。作者在当时就预估了 Transformer 模型可以在图片、语音和视频以及那些不那么符合时序化的数据的相关任务上表现良好，现在看来也是多多少少预言未来了。

在 Transformer 之前的最常用的序列转录模型有两种，一种叫循环语言模型（Recurrent Language Models） ，另外一种叫编码器-译码器（Encoder-Decoder）架构 。常用的模型是 RNN，循环神经网络。RNN 的网络架构导致它有并行度低和内存开销大两个缺点。RNN 的网络架构如下图所示，它的计算是将序列从左往右一步一步处理。假设一个序列是一个句子，它将一个词一个词的进行处理。对于第 <math xmlns="http://www.w3.org/1998/Math/MathML"> t t </math>t 个词生成一个隐藏状态 <math xmlns="http://www.w3.org/1998/Math/MathML"> h t h_t </math>ht。 <math xmlns="http://www.w3.org/1998/Math/MathML"> h t h_t </math>ht 是由前一个隐藏状态 <math xmlns="http://www.w3.org/1998/Math/MathML"> h t − 1 h_{t-1} </math>ht−1 和当前第 <math xmlns="http://www.w3.org/1998/Math/MathML"> t t </math>t 个词本身决定的，并且需要计算 <math xmlns="http://www.w3.org/1998/Math/MathML"> h t h_t </math>ht 的时候必须保证前面 <math xmlns="http://www.w3.org/1998/Math/MathML"> t − 1 t-1 </math>t−1 个词已经处理完成。这就导致了 RNN 的网络架构成为了一个经典的串行结构，在面对 GPU 或者 NPU 这类硬件平台的时候完全无法高效发挥其并行性。其次，由于 RNN 的历史信息是逐步向后传递的，所以如果序列特别长的话，可能会丢弃早期的信息，如果不想丢弃信息的话，就必须把 <math xmlns="http://www.w3.org/1998/Math/MathML"> h t h_t </math>ht 做的特别大，导致内存开销特别大。

在这篇文章之前，注意力机制（Attention Mechanism） 已经成功地用在编码器-译码器的架构当中，但一般还是与 RNN 一同使用。Transformer 在使用 Attention 的基础上摒弃了 RNN 结构，提出了 自注意力机制（Self-Attention Mechanism） 和 多头注意力机制（Multi-Head Attention Mechanism）。它对标的就是时序化模型，解决了时序化模型并行度低的问题，跳出了思维误区，给后人指引了一条新的方向。

传统注意力机制

传统的注意力机制主要应用于循环神经网络（RNN）中，特别是在序列到序列（Seq2Seq）模型中。以下是对传统注意力机制的详细步骤描述：

1. 输入序列和初始化

输入序列 ：假设我们有一个输入序列 <math xmlns="http://www.w3.org/1998/Math/MathML"> X = [ x 1 , x 2 , ... , x n ] X = [x_1, x_2, \ldots, x_n] </math>X=[x1,x2,...,xn]，其中每个 <math xmlns="http://www.w3.org/1998/Math/MathML"> x i x_i </math>xi 是一个向量，表示序列中的一个元素（例如一个词的词向量）.
编码器 ：通常使用RNN或LSTM对输入序列进行编码，生成隐藏状态序列 <math xmlns="http://www.w3.org/1998/Math/MathML"> H = [ h 1 , h 2 , ... , h n ] H = [h_1, h_2, \ldots, h_n] </math>H=[h1,h2,...,hn].

2. 计算注意力分数

目标状态 ：在解码器的每个时间步 <math xmlns="http://www.w3.org/1998/Math/MathML"> t t </math>t，我们有一个目标状态 <math xmlns="http://www.w3.org/1998/Math/MathML"> s t s_t </math>st，通常是解码器的隐藏状态.
注意力分数计算 ：使用目标状态 <math xmlns="http://www.w3.org/1998/Math/MathML"> s t s_t </math>st 和编码器的隐藏状态 <math xmlns="http://www.w3.org/1998/Math/MathML"> h i h_i </math>hi 计算注意力分数.常见的计算方式是： <math xmlns="http://www.w3.org/1998/Math/MathML"> A t t e n t i o n S c o r e ( s t , h i ) = v T tanh ⁡ ( W s s t + W h h i ) Attention Score(s_t, h_i) = v^T \tanh(W_s s_t + W_h h_i) </math>AttentionScore(st,hi)=vTtanh(Wsst+Whhi) 其中 <math xmlns="http://www.w3.org/1998/Math/MathML"> W s W_s </math>Ws 和 <math xmlns="http://www.w3.org/1998/Math/MathML"> W h W_h </math>Wh 是权重矩阵， <math xmlns="http://www.w3.org/1998/Math/MathML"> v v </math>v 是一个可学习的向量.

3. 计算注意力权重

归一化 ：将注意力分数通过softmax函数进行归一化，得到最终的注意力权重： <math xmlns="http://www.w3.org/1998/Math/MathML"> A t t e n t i o n W e i g h t ( s t , h i ) = exp ⁡ ( A t t e n t i o n S c o r e ( s t , h i ) ) ∑ j = 1 n exp ⁡ ( A t t e n t i o n S c o r e ( s t , h j ) ) Attention Weight(s_t, h_i) = \frac{\exp(Attention Score(s_t, h_i))}{\sum_{j=1}^{n} \exp(Attention Score(s_t, h_j))} </math>AttentionWeight(st,hi)=∑j=1nexp(AttentionScore(st,hj))exp(AttentionScore(st,hi)) 这些权重表示每个输入隐藏状态在当前解码器状态下的重要性.

4. 计算上下文向量

加权求和 ：使用注意力权重对编码器的隐藏状态进行加权求和，得到上下文向量： <math xmlns="http://www.w3.org/1998/Math/MathML"> C o n t e x t V e c t o r = ∑ i = 1 n Attention Weight ( s t , h i ) ⋅ h i Context Vector = \sum_{i=1}^{n} \text{Attention Weight}(s_t, h_i) \cdot h_i </math>ContextVector=∑i=1nAttention Weight(st,hi)⋅hi 这个上下文向量代表了当前解码器状态与输入序列的相关信息.

5. 输出和迭代

输出生成：将上下文向量与解码器的当前状态结合，用于生成当前时间步的输出.
迭代过程：整个过程会重复进行，直到生成完整的输出序列.

多头注意力机制

作者在 "Attention is all you need" 这篇文章中提到自注意力机制（Self Attention）和提出多头注意力机制（Multi-Head Attention）的目的就是为了能够不拘泥于时间和空间距离约束的一次性计算大批量序列的注意力分数和生成注意力权重。在文章中，多头注意力机制可以分为两个部分，可缩放点积注意力（Scaled Dot-Product Attention） 和多头机制（Multi-Head Mechanism）。

注意力机制的实现可分为 加性注意力（Additive Attention） 和 点积注意力（Dot-Product Attention） 。其实最常用的是加性注意力，这里使用点积的方法只是因为它最方便实现。这里的可缩放点积注意力（Scaled Dot-Product Attention） 的公式如下所示

<math xmlns="http://www.w3.org/1998/Math/MathML"> Attention ( Q , K , V ) = softmax ( Q K T d k ) V \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V </math>Attention(Q,K,V)=softmax(dk QKT)V

其中：

<math xmlns="http://www.w3.org/1998/Math/MathML"> Q Q </math>Q 是查询矩阵（query matrix）.
<math xmlns="http://www.w3.org/1998/Math/MathML"> K K </math>K 是键矩阵（key matrix）.
<math xmlns="http://www.w3.org/1998/Math/MathML"> V V </math>V 是值矩阵（value matrix）.
<math xmlns="http://www.w3.org/1998/Math/MathML"> d k d_k </math>dk 是键向量的维度，用于缩放点积以防止梯度消失问题.
<math xmlns="http://www.w3.org/1998/Math/MathML"> softmax \text{softmax} </math>softmax 函数用于对每个查询的键向量的点积结果进行归一化，使得输出的权重和为 1.

这个公式的核心思想是通过查询和键的点积来计算注意力权重，然后用这些权重对值进行加权求和，得到最终的输出。

这里要重点讲一下缩放因子 <math xmlns="http://www.w3.org/1998/Math/MathML"> d k d_k </math>dk，它也是 <math xmlns="http://www.w3.org/1998/Math/MathML"> Q Q </math>Q、 <math xmlns="http://www.w3.org/1998/Math/MathML"> K K </math>K、 <math xmlns="http://www.w3.org/1998/Math/MathML"> V V </math>V的维度，即 <math xmlns="http://www.w3.org/1998/Math/MathML"> Q Q </math>Q 是一个 <math xmlns="http://www.w3.org/1998/Math/MathML"> n × d k n \times d_k </math>n×dk的矩阵， <math xmlns="http://www.w3.org/1998/Math/MathML"> K K </math>K是一个 <math xmlns="http://www.w3.org/1998/Math/MathML"> m × d k m \times d_k </math>m×dk 的矩阵， <math xmlns="http://www.w3.org/1998/Math/MathML"> V V </math>V 也是一个 <math xmlns="http://www.w3.org/1998/Math/MathML"> m × d k m \times d_k </math>m×dk 的矩阵，因为 <math xmlns="http://www.w3.org/1998/Math/MathML"> K e y − V a l u e Key-Value </math>Key−Value 是一对值。在 <math xmlns="http://www.w3.org/1998/Math/MathML"> d k d_k </math>dk 数值小的时候其实用不用缩放因子影响不大。但当 <math xmlns="http://www.w3.org/1998/Math/MathML"> d k d_k </math>dk 数值大了以后，如果不加缩放部分， <math xmlns="http://www.w3.org/1998/Math/MathML"> softmax \text{softmax} </math>softmax 函数计算的结果会向 0 和 1 两端靠拢，而我们训练的目的也是将不重要即不置信的部分归拢接近 0，将重要的置信的部分归拢接近 1。这就导致了模型梯度下降过快，出现了跑不动的问题，为了避免这种情况，所以需要设置一个缩放因子。Transformer 里 <math xmlns="http://www.w3.org/1998/Math/MathML"> d k d_k </math>dk 的设置一般设置为 512。具体的可缩放点积注意力的流程图如下所示。

这里看到了一个名为 Mask (opt.) 的层级，设置它的目的就是当我们的序列长度为 <math xmlns="http://www.w3.org/1998/Math/MathML"> X = [ x 1 , x 2 , ... , x m , ... , x n ] X = [ x_1, x_2, \ldots, x_m, \ldots, x_n ] </math>X=[x1,x2,...,xm,...,xn] 的时我们只希望训练 <math xmlns="http://www.w3.org/1998/Math/MathML"> X p = [ x 1 , x 2 , ... , x m ] X_p = [ x_1, x_2, \ldots, x_m ] </math>Xp=[x1,x2,...,xm]。这个时候设置一个 Mask，也就是一个非常大的负数。之所以不设置为 0 的原因是 <math xmlns="http://www.w3.org/1998/Math/MathML"> Softmax \text{Softmax} </math>Softmax 函数中参数为 0 时，计算结果将会得 1，一个非常大得负数才能将结果趋近 0。

关于多头机制（Multi-Head Mechanism），作者认为与其做单次的注意力机制，不如将 Quary, Key, Value 投影到低维的矩阵，投影 h 次，然后进行 h 次点积注意力操作，最后合并回来投影升维，如下图所示。

这么做的原因是单个点积注意力操作里面并没有什么可以学习的参数，但我们有时候又希望注意力机制能有一些不同的计算相似度的办法。作者采取的办法就是将 <math xmlns="http://www.w3.org/1998/Math/MathML"> V V </math>V、 <math xmlns="http://www.w3.org/1998/Math/MathML"> K K </math>K、 <math xmlns="http://www.w3.org/1998/Math/MathML"> Q Q </math>Q 分为 <math xmlns="http://www.w3.org/1998/Math/MathML"> h h </math>h 组，利用 线性层（Linear） 进行学习， <math xmlns="http://www.w3.org/1998/Math/MathML"> Linear \text{Linear} </math>Linear 中的权重 <math xmlns="http://www.w3.org/1998/Math/MathML"> W W </math>W 是可以学习的，希望能学到不同的投影办法，适配不同的模式需要的相似函数，这有点像卷积中多个输出通道的想法。

自注意力机制

关于自注意力机制，我们可以在上图中看到，左下角的多头注意力层的三个输入的来源是同一组数据复制成三份，即 Quary 是我，Key 是我，Value 也是我。当我们有 n 个 Quary 的话，我们就有 n 个输出，意味着 Quary 的长度和 Value 的长度是一样的。不考虑多头机制的话，输出其实就是输入的一个加权和，权重全部来自于输入本身与各个向量之间的一个相似度。考虑多头机制和投影的话，就会学习到 <math xmlns="http://www.w3.org/1998/Math/MathML"> h h </math>h 组不同的距离空间，输出就会有一点点不一样。

神经网络接收的输入是很多大小不一的向量，并且不同向量向量之间有一定的关系，但是实际训练的时候无法充分发挥这些输入之间的关系而导致模型训练结果效果极差。比如机器翻译问题（序列到序列的问题，机器自己决定多少个标签），词性标注问题（一个向量对应一个标签），语义分析问题（多个向量对应一个标签）等文字处理问题。针对全连接神经网络对于多个相关的输入无法建立起相关性的这个问题，通过自注意力机制来解决，自注意力机制实际上是想让机器注意到整个输入中不同部分之间的相关性。

论述完毕！！