机器学习周报（transformer学习1）

文章目录

- 摘要
- Abstract
[1 Transformer](#1 Transformer)
- [1.1 数据部分，输入处理和Embedding](#1.1 数据部分，输入处理和Embedding)
- [1.2 位置编码](#1.2 位置编码)
- 总结

摘要

本文介绍了Transformer模型的输入结构，包括Embedding层和位置编码的作用与实现。深入了解如何将原始数据转换为模型能够处理的嵌入向量，并探讨了位置编码在捕获序列信息中的重要性。通过一个具体的实例，学习如何对数据进行处理以适应Transformer的输入要求，并利用编造的数据帮助理解各部分的计算过程和工作原理。

Abstract

This article introduces the input structure of the Transformer model, including the roles and implementation of the Embedding layer and positional encoding. It delves into how raw data is transformed into embedding vectors that the model can process and explores the importance of positional encoding in capturing sequence information. Through a specific example, it demonstrates how to preprocess data to meet the input requirements of the Transformer and uses fabricated data to clarify the calculation processes and operational principles of each component.

1 Transformer

Transformer整体结构图

Transformer 是 seq2seq 模型，分为Encoder和Decoder两大部分，如上图，Encoder部分是由6个相同的encoder组成，Decoder部分也是由6个相同的decoder组成，与encoder不同的是，每一个decoder都会接受最后一个encoder的输出。

其中，编码组件由多层编码器（Encoder）组成（在论文中作者使用了 6 层编码器，在实际使用过程中你可以尝试其他层数）。解码组件也是由相同层数的解码器（Decoder）组成（在论文也使用了 6 层）

编码器的输入会先流入 Self-Attention 层。它可以让编码器在对特定词进行编码时使用输入句子中的其他词的信息（可以理解为：当我们翻译一个词时，不仅只关注当前的词，而且还会关注其他词的信息）。然后，Self-Attention 层的输出会流入前馈网络。每个编码器的结构都是相同的，但是它们使用不同的权重参数。

解码器也有编码器中这两层，但是它们之间还有一个注意力层（即 Encoder-Decoder Attention），其用来帮忙解码器关注输入句子的相关部分（类似于 seq2seq 模型中的注意力）。

1.1 数据部分，输入处理和Embedding

我们以一个中英翻译为例：

将数据进行预处理：

将字符串转化为数字编码
按句子长度进行过滤
添加起始符和结束符
mini_batch，padding填充

选其中一个batch研究

进行Embedding，将每一个字符都用一个4维的向量表示

1.2 位置编码

到目前为止，我们所描述的模型中缺少一个东西：表示序列中词顺序的方法。为了解决这个问题，Transformer 模型为每个输入的词嵌入向量添加一个向量。这些向量遵循模型学习的特定模式，有助于模型确定每个词的位置，或序列中不同词之间的距离。

位置编码公式

P E ( p o s , 2 i ) = s i n ( p o s / 1000 0 2 i / d m o d e l ) PE_{(pos,2i)}=sin(pos/10000^{2i/d_{model}}) PE(pos,2i)=sin(pos/100002i/dmodel)
P E ( p o s , 2 i + 1 ) = c o s ( p o s / 1000 0 2 i / d m o d e l ) PE_{(pos,2i+1)}=cos(pos/10000^{2i/d_{model}}) PE(pos,2i+1)=cos(pos/100002i/dmodel)

及

c o s ( α + β ) = c o s ( α ) c o s ( β ) − s i n ( α ) s i n ( β ) cos(\alpha+\beta)=cos(\alpha)cos(\beta)-sin(\alpha)sin(\beta) cos(α+β)=cos(α)cos(β)−sin(α)sin(β)

s i n ( α + β ) = s i n ( α ) c o s ( β ) + c o s ( α ) s i n ( β ) sin(\alpha+\beta)=sin(\alpha)cos(\beta)+cos(\alpha)sin(\beta) sin(α+β)=sin(α)cos(β)+cos(α)sin(β)

其中，pos表示位置，i表示维度。上面的函数使得模型可以学习到 token 之间的相对位置关系：任意位置的 P E ( p o s + k ) PE_{(pos+k)} PE(pos+k)都可以被 P E ( p o s ) PE_{(pos)} PE(pos)和 P E ( k ) PE_{(k)} PE(k)的线性函数表示：

P E ( p o s + k , 2 i ) = P E ( p o s , 2 i ) × P E ( k , 2 i + 1 ) + P E ( p o s , 2 i + 1 ) × P E ( k , 2 i ) PE_{(pos+k,2i)}=PE_{(pos,2i)}×PE_{(k,2i+1)}+PE_{(pos,2i+1)}×PE_{(k,2i)} PE(pos+k,2i)=PE(pos,2i)×PE(k,2i+1)+PE(pos,2i+1)×PE(k,2i)
P E ( p o s + k , 2 i + 1 ) = P E ( p o s , 2 i + 1 ) × P E ( k , 2 i + 1 ) − P E ( p o s , 2 i ) × P E ( k , 2 i ) PE_{(pos+k,2i+1)}=PE_{(pos,2i+1)}×PE_{(k,2i+1)}-PE_{(pos,2i)}×PE_{(k,2i)} PE(pos+k,2i+1)=PE(pos,2i+1)×PE(k,2i+1)−PE(pos,2i)×PE(k,2i)

该图是一个512维的Embedding向量，i=dim_index // 2得到 0 0 1 1 ... ,255

可以清晰地看出pos=4 i=0所在位置的数值的位置编码跟那四个数值有关系

总结

本文通过分析Transformer的输入结构及其核心组件Embedding层和位置编码，理解Transformer如何处理序列数据。示例数据的引入，使得各个模块的作用及其计算细节更加清晰。掌握这些基础知识，不仅有助于理解Transformer的内部机制，也为后续的模型优化和应用打下基础。总体而言，Embedding和位置编码是理解和运用Transformer的关键。