研读论文《Attention Is All You Need》（17）

原文 48

7 Conclusion

In this work, we presented the Transformer, the first sequence transduction model based entirely on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention.

翻译

7 结论

在这篇论文中，我们提出了转换器(Transformer) ------ 首个完全基于注意力机制的序列转换模型。它采用多头自注意力机制取代了循环层（或：循环神经网络层），这种循环层在编码器-解码器架构中最为常用。

重点句子解析

【解析】

这是一个自成段落的长难句。其结构是：介词短语+主句+插入语+现在分词短语。

句首的介词短语做状语，交代语境。

主句是"we presented the Transformer."其语法结构为"主谓宾"。

接下来的"the first sequence transduction model based entirely on attention"是插入语，对前边的宾语名词"the Transformer"进行补充说明。插入语的中心词是model，前边的the first 和sequence transduction都是做定语，对model起修饰作用；based entirely on attention是过去分词短语做后置定语，也是修饰model。实际上，这个过去分词短语相当于一个被动语态的定语从句，即：which/that was based entirely on attention.

现在分词短语replacing...with...用于补充说明"the Transformer"的特性（即它如何取代了循环层）。其中，replacing的逻辑主语是the Transformer。我们可以用A代表the recurrent layers，用B代表multi-headed self-attention，从而把这个分词短语的主干简化为：replacing A with B (用B来取代A)。此外，most commonly used in encoder-decoder architectures是过去分词短语做后置定语，修饰前边的名词短语the recurrent layers。这个短语同样相当于一个被动语态的定语从句，即：which/that is most commonly used in encoder-decoder architectures. 其中，most commonly used表示"最常用的"；介词短语"in encoder-decoder architectures"说明"the recurrent layers"的使用范围。

【参考翻译】

在这篇论文中，我们提出了转换器(Transformer) ------ 首个完全基于注意力机制的序列转换模型。它采用多头自注意力机制取代了循环层，这种循环层在编码器-解码器架构中最为常用。

原文 49

For translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers. On both WMT 2014 English-to-German and WMT 2014 English-to-French translation tasks, we achieve a new state of the art. In the former task our best model outperforms even all previously reported ensembles.

翻译

在翻译任务中，Transformer的训练速度显著快于基于循环层或卷积层的架构。在WMT 2014英德和WMT 2014英法翻译任务中，我们均实现了新的最优性能。其中在英德任务上，我们的最佳模型甚至超越了以往所有已发布的集成模型。

原文 50

We are excited about the future of attention-based models and plan to apply them to other tasks. We plan to extend the Transformer to problems involving input and output modalities other than text and to investigate local, restricted attention mechanisms to efficiently handle large inputs and outputs such as images, audio and video. Making generation less sequential is another research goal of ours. The code we used to train and evaluate our models is available at https://github.com/tensorflow/tensor2tensor.

翻译

我们对注意力基模型的未来充满期待，并计划将其应用于更多任务。我们拟将转化器扩展至文本以外的输入输出模态（如图像、音频和视频），并研究局部受限注意力机制，以便高效处理大规模输入输出。另一个研究目标是降低生成过程的序列依赖性。我们用于训练和评估的模型的代码可在 https://github.com/tensorflow/tensor2tensor 上获取。

重点句子解析

We plan to extend the Transformer to problems involving input and output modalities other than text and to investigate local, restricted attention mechanisms to efficiently handle large inputs and outputs such as images, audio and video.

【解析】

句子的主干是：We plan to extend the Transformer (to problems) and to investigate (local, restricted attention) mechanisms. 其中and连接的两个并列不定式结构"to extend..."和"to investigate..."都是做谓语动词plan的宾语。extend... to...表示"把...扩展到..."；"local, restricted 和attention"共同做定语，修饰mechanisms。

原句中的现在分词短语involving input and output modalities other than text 做后置定语，修饰problems。这一部分可以简化为：involving A other than B(涉及到A 而不是B的)。其中A 代表input and output modalities，B代表text。

另一个不定式短语to efficiently handle large inputs and outputs...做目的状语，结尾的such as images, audio and video用于对前边的名词短语"large inputs and outputs"进行举例说明。

【参考翻译】

我们拟将转化器扩展至文本以外的输入输出模态（如图像、音频和视频），并研究局部受限注意力机制，以便高效处理大规模输入输出。