Jay Alammar | 图解 Transformer

注：本文为 "图解 Transformer" 相关译文。

英文引文，机翻未校。

如有内容异常，请看原文。

The Illustrated Transformer

图解 Transformer

Jay Alammar

Written on June 27, 2018

Visualizing machine learning one concept at a time.

一次拆解一个机器学习概念，让其可视化。

Read our book, Hands-On Large Language Models and follow me on LinkedIn, Bluesky, Substack, X,YouTube
Blog About

Watch: MIT's Deep Learning State of the Art lecture referencing this post

相关视频：麻省理工学院《深度学习前沿》课程中对本文的引用讲解

Featured in courses at Stanford, Harvard, MIT, Princeton, CMU and others

本文被斯坦福大学、哈佛大学、麻省理工学院、普林斯顿大学、卡内基梅隆大学等院校的课程列为参考资料

Update: This post has now become a book! Check out LLM-book.com which contains (Chapter 3) an updated and expanded version of this post speaking about the latest Transformer models and how they've evolved in the seven years since the original Transformer (like Multi-Query Attention and RoPE Positional embeddings). 更新说明：本文现已整理成书！移步相关网站查看，书中第 3 章在本文基础上进行了更新和拓展，介绍了最新的 Transformer 模型，以及自初代 Transformer 问世后的七年间模型的演变历程，包括多查询注意力、旋转位置编码等技术。

In the previous post, we looked at Attention -- a ubiquitous method in modern deep learning models. Attention is a concept that helped improve the performance of neural machine translation applications. In this post, we will look at The Transformer -- a model that uses attention to boost the speed with which these models can be trained. The Transformer outperforms the Google Neural Machine Translation model in specific tasks. The biggest benefit, however, comes from how The Transformer lends itself to parallelization. It is in fact Google Cloud's recommendation to use The Transformer as a reference model to use their Cloud TPU offering. So let's try to break the model apart and look at how it functions.

在之前的文章中，我们介绍了注意力机制------这一现代深度学习模型中应用广泛的方法。注意力机制有效提升了神经机器翻译模型的性能。本文将讲解 Transformer 模型，这一借助注意力机制提升训练速度的模型。在特定任务中，Transformer 模型的表现优于谷歌神经机器翻译模型，而其最大的优势在于具备天然的并行训练特性。谷歌云也推荐将 Transformer 作为参考模型，搭配其云张量处理单元使用。接下来，我们将拆解该模型，解析其工作原理。

The Transformer was proposed in the paper Attention is All You Need. A TensorFlow implementation of it is available as a part of the Tensor2Tensor package. Harvard's NLP group created a guide annotating the paper with PyTorch implementation. In this post, we will attempt to oversimplify things a bit and introduce the concepts one by one to hopefully make it easier to understand to people without in-depth knowledge of the subject matter.

Transformer 模型首次提出于论文《Attention is All You Need》，基于 TensorFlow 的实现版本被收录在 Tensor2Tensor 工具包中。哈佛大学自然语言处理研究组还发布了该论文的解读指南，并附上了基于 PyTorch 的实现代码。本文将对相关概念做简化处理，逐一讲解，力求让非专业读者也能理解。

2025 Update : We've built a free short course that brings the contents of this post up-to-date with animations:
2025 年更新：我们制作了一门免费的短期课程，通过动画形式对本文内容进行了更新和呈现：

Learn how ChatGPT and DeepSeek models work: How Transformer LLMs Work [Free Course] - YouTube
https://www.youtube.com/watch?v=k1ILy23t89E

A High-Level Look

整体概览

Let's begin by looking at the model as a single black box. In a machine translation application, it would take a sentence in one language, and output its translation in another.

我们先将 Transformer 视作一个整体的黑箱。在机器翻译任务中，该模型接收一种语言的句子，输出其另一种语言的译文。

Popping open that Optimus Prime goodness, we see an encoding component, a decoding component, and connections between them.

打开这个如同擎天柱般的模型框架，我们能看到其包含编码模块、解码模块，以及连接两个模块的交互层。

The encoding component is a stack of encoders (the paper stacks six of them on top of each other -- there's nothing magical about the number six, one can definitely experiment with other arrangements). The decoding component is a stack of decoders of the same number.

编码模块由若干个编码器堆叠而成（论文中使用了 6 个编码器堆叠的结构，数字 6 并无特殊含义，研究者可根据需求尝试不同的堆叠数量），解码模块则由数量相同的解码器堆叠而成。

The encoders are all identical in structure (yet they do not share weights). Each one is broken down into two sub-layers:

所有编码器的结构完全相同（但彼此间不共享权重），每个编码器都包含两个子层：

The encoder's inputs first flow through a self-attention layer -- a layer that helps the encoder look at other words in the input sentence as it encodes a specific word. We'll look closer at self-attention later in the post.

编码器的输入首先经过自注意力层，该层能让编码器在对某个词进行编码时，同时关注输入句子中的其他词汇。本文后续将详细解析自注意力层的工作原理。

The outputs of the self-attention layer are fed to a feed-forward neural network. The exact same feed-forward network is independently applied to each position.

自注意力层的输出会传入前馈神经网络，且同一个前馈神经网络会独立作用于每个位置的特征向量。

The decoder has both those layers, but between them is an attention layer that helps the decoder focus on relevant parts of the input sentence (similar what attention does in seq2seq models).

解码器同样包含自注意力层和前馈神经网络层，且在两个层之间增设了一个编码器-解码器注意力层，该层能让解码器关注输入句子的相关部分，其作用与序列到序列模型中的注意力机制类似。

Bringing The Tensors Into The Picture

张量的流转过程

Now that we've seen the major components of the model, let's start to look at the various vectors/tensors and how they flow between these components to turn the input of a trained model into an output.

了解了模型的主要组成后，我们接下来分析各类向量/张量在模块间的流转过程，以及训练完成的模型如何将输入转化为输出。

As is the case in NLP applications in general, we begin by turning each input word into a vector using an embedding algorithm.

与大多数自然语言处理任务的处理流程一致，我们首先通过词嵌入算法将每个输入词转化为向量。

Each word is embedded into a vector of size 512. We'll represent those vectors with these simple boxes.

每个词被嵌入为 512 维的向量，本文将用简易的方框来表示这些向量。

The embedding only happens in the bottom-most encoder. The abstraction that is common to all the encoders is that they receive a list of vectors each of the size 512 -- In the bottom encoder that would be the word embeddings, but in other encoders, it would be the output of the encoder that's directly below. The size of this list is hyperparameter we can set -- basically it would be the length of the longest sentence in our training dataset.

词嵌入操作仅在最底层的编码器中进行。所有编码器的共性在于，都会接收一个由 512 维向量组成的序列：最底层编码器的输入为词嵌入向量，其余编码器的输入则为其直接下层编码器的输出。该向量序列的长度是一个可设置的超参数，通常设为训练数据集中最长句子的长度。

After embedding the words in our input sequence, each of them flows through each of the two layers of the encoder.

输入序列中的词汇完成词嵌入后，每个词的向量都会依次经过编码器的两个子层。

Here we begin to see one key property of the Transformer, which is that the word in each position flows through its own path in the encoder. There are dependencies between these paths in the self-attention layer. The feed-forward layer does not have those dependencies, however, and thus the various paths can be executed in parallel while flowing through the feed-forward layer.

由此我们能看到 Transformer 的一个关键特性：输入序列中每个位置的词，在编码器中都有独立的处理路径。这些路径在自注意力层中存在依赖关系，但在前馈神经网络层中无任何依赖，因此各路径在前馈神经网络层中可并行计算。

Next, we'll switch up the example to a shorter sentence and we'll look at what happens in each sub-layer of the encoder.

接下来，我们将用更简短的句子作为示例，解析编码器各子层的具体处理过程。

Now We're Encoding!

编码过程解析

As we've mentioned already, an encoder receives a list of vectors as input. It processes this list by passing these vectors into a 'self-attention' layer, then into a feed-forward neural network, then sends out the output upwards to the next encoder.

如前文所述，编码器接收一个向量序列作为输入，先将其传入自注意力层处理，再传入前馈神经网络层，最后将输出结果向上传递至下一个编码器。

The word at each position passes through a self-attention process. Then, they each pass through a feed-forward neural network -- the exact same network with each vector flowing through it separately.

每个位置的词向量先经过自注意力层的处理，再各自传入前馈神经网络层------所有词向量使用同一个前馈神经网络，且彼此独立计算。

Self-Attention at a High Level

自注意力机制概览

Don't be fooled by me throwing around the word "self-attention" like it's a concept everyone should be familiar with. I had personally never came across the concept until reading the Attention is All You Need paper. Let us distill how it works.

不要因为我频繁提及"自注意力"就误以为这是一个众所周知的概念，事实上我也是在阅读《Attention is All You Need》这篇论文时才首次接触到该概念。接下来，我们将提炼其主要工作原理。

Say the following sentence is an input sentence we want to translate:

假设我们要翻译以下这个输入句子：

复制代码

The animal didn't cross the street because it was too tired

What does "it" in this sentence refer to? Is it referring to the street or to the animal? It's a simple question to a human, but not as simple to an algorithm.

这句话中的代词 it 指代什么？是街道还是动物？人类能轻易回答这个问题，但算法却难以判断。

When the model is processing the word "it", self-attention allows it to associate "it" with "animal".

当模型处理 it 这个词时，自注意力机制能让模型将其与 animal 关联起来。

As the model processes each word (each position in the input sequence), self attention allows it to look at other positions in the input sequence for clues that can help lead to a better encoding for this word.

模型在处理输入序列中每个位置的词时，自注意力机制能让模型参考序列中其他位置的词，为当前词的编码提供线索，从而得到更优质的编码结果。

If you're familiar with RNNs, think of how maintaining a hidden state allows an RNN to incorporate its representation of previous words/vectors it has processed with the current one it's processing. Self-attention is the method the Transformer uses to bake the "understanding" of other relevant words into the one we're currently processing.

如果你熟悉循环神经网络，就会知道循环神经网络通过维护隐藏状态，将已处理词汇/向量的特征融入当前处理的词汇/向量中。而自注意力机制，就是 Transformer 实现这一功能的方式------将对其他相关词汇的理解融入当前词汇的编码中。

As we are encoding the word "it" in encoder #5 (the top encoder in the stack), part of the attention mechanism was focusing on "The Animal", and baked a part of its representation into the encoding of "it".

当我们在第 5 个编码器（堆叠结构中最顶层的编码器）中对 it 这个词进行编码时，注意力机制的一部分会关注 The Animal 这个短语，并将其部分特征融入 it 的编码结果中。

Be sure to check out the Tensor2Tensor notebook where you can load a Transformer model, and examine it using this interactive visualization.

你可以查看 Tensor2Tensor 的交互式笔记本，在其中加载 Transformer 模型，并通过交互式可视化工具解析模型的工作过程。

Self-Attention in Detail

自注意力机制详解

Let's first look at how to calculate self-attention using vectors, then proceed to look at how it's actually implemented -- using matrices.

我们先讲解基于向量的自注意力计算方式，再介绍实际工程中采用的基于矩阵的实现方法。

The first step in calculating self-attention is to create three vectors from each of the encoder's input vectors (in this case, the embedding of each word). So for each word, we create a Query vector, a Key vector, and a Value vector. These vectors are created by multiplying the embedding by three matrices that we trained during the training process.

计算自注意力的第一步，是从编码器的每个输入向量（此处为词嵌入向量）生成三个新的向量，即查询向量（Query）、键向量（Key）和值向量（Value）。这三个向量由词嵌入向量分别与训练过程中得到的三个权重矩阵相乘得到。

Notice that these new vectors are smaller in dimension than the embedding vector. Their dimensionality is 64, while the embedding and encoder input/output vectors have dimensionality of 512. They don't HAVE to be smaller, this is an architecture choice to make the computation of multiheaded attention (mostly) constant.

需要注意的是，这三个新向量的维度小于词嵌入向量的维度：查询、键、值向量的维度为 64，而词嵌入向量及编码器的输入、输出向量维度为 512。并非必须将其维度设置更小，这一架构设计是为了让多头注意力的计算量保持基本恒定。

Multiplying x1 by the WQ weight matrix produces q1, the "query" vector associated with that word. We end up creating a "query", a "key", and a "value" projection of each word in the input sentence.

将向量 x 1 x_1 x1 与权重矩阵 W Q W_Q WQ 相乘，得到该词对应的查询向量 q 1 q_1 q1。最终，输入句子中的每个词都会被映射为对应的查询向量、键向量和值向量。

What are the "query", "key", and "value" vectors?

什么是查询向量、键向量和值向量？

They're abstractions that are useful for calculating and thinking about attention. Once you proceed with reading how attention is calculated below, you'll know pretty much all you need to know about the role each of these vectors plays.

它们是为了方便注意力机制的计算和理解而提出的抽象概念。当你阅读完下文的注意力计算过程后，就能理解这三个向量各自的作用。

The second step in calculating self-attention is to calculate a score. Say we're calculating the self-attention for the first word in this example, "Thinking". We need to score each word of the input sentence against this word. The score determines how much focus to place on other parts of the input sentence as we encode a word at a certain position.

计算自注意力的第二步是计算注意力得分。假设我们要计算示例中第一个词 Thinking 的自注意力得分，需要将该词与输入序列中的所有词进行相似度打分。该得分决定了模型在对某个位置的词进行编码时，对输入序列其他部分的关注程度。

The score is calculated by taking the dot product of the query vector with the key vector of the respective word we're scoring. So if we're processing the self-attention for the word in position #1, the first score would be the dot product of q1 and k1. The second score would be the dot product of q1 and k2.

注意力得分由当前词的查询向量与待打分词的键向量进行点积运算得到。例如，在计算第 1 个位置词的自注意力时，第一个得分是 q 1 q_1 q1 与 k 1 k_1 k1 的点积，第二个得分是 q 1 q_1 q1 与 k 2 k_2 k2 的点积。

The third and fourth steps are to divide the scores by 8 (the square root of the dimension of the key vectors used in the paper -- 64. This leads to having more stable gradients. There could be other possible values here, but this is the default), then pass the result through a softmax operation. Softmax normalizes the scores so they're all positive and add up to 1.

计算自注意力的第三步和第四步，是将注意力得分除以 8（论文中使用的键向量维度为 64，8 是 64 的平方根，这一操作能让模型的梯度更稳定，也可选用其他数值，8 为默认值），再将结果传入 Softmax 函数。Softmax 函数会对得分做归一化处理，使所有得分均为正数且求和为 1。

This softmax score determines how much each word will be expressed at this position. Clearly the word at this position will have the highest softmax score, but sometimes it's useful to attend to another word that is relevant to the current word.

归一化后的 Softmax 得分，代表了输入序列中各词在当前位置的表达权重。显然，当前位置的词会拥有最高的权重，但有时关注与当前词相关的其他词，能让编码结果更优。

The fifth step is to multiply each value vector by the softmax score (in preparation to sum them up). The intuition here is to keep intact the values of the word(s) we want to focus on, and drown-out irrelevant words (by multiplying them by tiny numbers like 0.001, for example).

计算自注意力的第五步，是将每个值向量与对应的 Softmax 得分相乘（为后续的求和操作做准备）。这样做的目的是保留模型重点关注词汇的特征，同时弱化无关词汇的特征------例如将无关词汇的特征向量乘以 0.001 这样的极小值。

The sixth step is to sum up the weighted value vectors. This produces the output of the self-attention layer at this position (for the first word).

计算自注意力的第六步，是将所有加权后的值向量求和，得到自注意力层在该位置（此处为第一个词的位置）的输出向量。

That concludes the self-attention calculation. The resulting vector is one we can send along to the feed-forward neural network. In the actual implementation, however, this calculation is done in matrix form for faster processing. So let's look at that now that we've seen the intuition of the calculation on the word level.

以上就是基于向量的自注意力计算全过程，得到的输出向量将被传入前馈神经网络层。在实际工程实现中，为了提升计算效率，会采用矩阵形式完成上述计算。在理解了词级别的自注意力计算逻辑后，我们接下来讲解基于矩阵的计算方式。

Matrix Calculation of Self-Attention

自注意力的矩阵计算方式

The first step is to calculate the Query, Key, and Value matrices. We do that by packing our embeddings into a matrix X, and multiplying it by the weight matrices we've trained (WQ, WK, WV).
第一步 是计算查询矩阵、键矩阵和值矩阵。将所有词嵌入向量组合为嵌入矩阵 X X X，再将其分别与训练得到的权重矩阵 W Q W_Q WQ、 W K W_K WK、 W V W_V WV 相乘，即可得到对应的三个矩阵。

Every row in the X matrix corresponds to a word in the input sentence. We again see the difference in size of the embedding vector (512, or 4 boxes in the figure), and the q/k/v vectors (64, or 3 boxes in the figure)

嵌入矩阵 X X X 中的每一行对应输入句子中的一个词的嵌入向量。我们能再次看到嵌入向量（维度 512，图中用 4 个方框表示）与查询/键/值向量（维度 64，图中用 3 个方框表示）的维度差异。

Finally , since we're dealing with matrices, we can condense steps two through six in one formula to calculate the outputs of the self-attention layer.
最后，基于矩阵的运算特性，我们可以将前文的第二步到第六步浓缩为一个公式，直接计算自注意力层的输出矩阵。

The self-attention calculation in matrix form

自注意力计算的矩阵形式表达式

The Beast With Many Heads

多头注意力机制

The paper further refined the self-attention layer by adding a mechanism called "multi-headed" attention. This improves the performance of the attention layer in two ways:

论文中对自注意力层做了进一步优化，提出了多头注意力机制。该机制从两个方面提升了注意力层的性能：

It expands the model's ability to focus on different positions. Yes, in the example above, z1 contains a little bit of every other encoding, but it could be dominated by the actual word itself. If we're translating a sentence like "The animal didn't cross the street because it was too tired", it would be useful to know which word "it" refers to.

提升模型对输入序列不同位置的关注能力。在上述的单头注意力示例中，输出向量 z 1 z_1 z1 虽融合了其他所有词的编码特征，但主要特征仍来自当前词本身。在翻译"The animal didn't cross the street because it was too tired"这类句子时，模型需要明确 it 的指代对象，多头注意力就能更好地实现这一需求。
It gives the attention layer multiple "representation subspaces". As we'll see next, with multi-headed attention we have not only one, but multiple sets of Query/Key/Value weight matrices (the Transformer uses eight attention heads, so we end up with eight sets for each encoder/decoder). Each of these sets is randomly initialized. Then, after training, each set is used to project the input embeddings (or vectors from lower encoders/decoders) into a different representation subspace.

为注意力层提供多个特征表示子空间。多头注意力机制会设置多组独立的查询/键/值权重矩阵（Transformer 中使用了 8 个注意力头，因此每个编码器和解码器都对应 8 组权重矩阵），每组矩阵均随机初始化。经过训练后，每组矩阵会将输入的嵌入向量（或下层编码器/解码器的输出向量）映射到不同的特征表示子空间中。

With multi-headed attention, we maintain separate Q/K/V weight matrices for each head resulting in different Q/K/V matrices. As we did before, we multiply X by the WQ/WK/WV matrices to produce Q/K/V matrices.

多头注意力机制中，每个注意力头都有独立的查询/键/值权重矩阵，因此会得到多组不同的查询/键/值矩阵。与单头注意力的计算方式一致，将嵌入矩阵 X X X 与各组权重矩阵分别相乘，即可得到对应注意力头的查询/键/值矩阵。

If we do the same self-attention calculation we outlined above, just eight different times with different weight matrices, we end up with eight different Z matrices

使用 8 组不同的权重矩阵，分别执行上述的自注意力计算，最终会得到 8 个不同的输出矩阵 Z Z Z。

This leaves us with a bit of a challenge. The feed-forward layer is not expecting eight matrices -- it's expecting a single matrix (a vector for each word). So we need a way to condense these eight down into a single matrix.

这就带来了一个问题：前馈神经网络层的输入并非 8 个矩阵，而是一个统一的矩阵（每个词对应一个向量）。因此，我们需要将这 8 个输出矩阵融合为一个矩阵。

How do we do that? We concat the matrices then multiply them by an additional weights matrix WO.

具体的融合方式为：先将 8 个矩阵按列拼接，再将拼接后的矩阵与一个额外的权重矩阵 W O W_O WO 相乘。

That's pretty much all there is to multi-headed self-attention. It's quite a handful of matrices, I realize. Let me try to put them all in one visual so we can look at them in one place

以上就是多头自注意力机制的主要内容。能看出整个过程涉及大量矩阵运算，接下来我们用一张图整合所有矩阵的运算关系，方便整体理解。

Now that we have touched upon attention heads, let's revisit our example from before to see where the different attention heads are focusing as we encode the word "it" in our example sentence:

了解了注意力头的概念后，我们回到之前的示例，看看模型在对 it 这个词进行编码时，不同的注意力头分别关注输入序列的哪些位置：

As we encode the word "it", one attention head is focusing most on "the animal", while another is focusing on "tired" -- in a sense, the model's representation of the word "it" bakes in some of the representation of both "animal" and "tired".

模型在对 it 进行编码时，一个注意力头主要关注 the animal，另一个注意力头则关注 tired。从某种意义上来说，模型对 it 这个词的特征表示，融合了 animal 和 tired 两个词的部分特征。

If we add all the attention heads to the picture, however, things can be harder to interpret:

但如果将所有注意力头的关注位置都展示出来，整体的注意力分布会变得难以解读：

Representing The Order of The Sequence Using Positional Encoding

基于位置编码的序列顺序表示

One thing that's missing from the model as we have described it so far is a way to account for the order of the words in the input sequence.

截至目前，我们讲解的模型架构中，缺少一个关键的模块：用于表示输入序列中词汇顺序的模块。

To address this, the transformer adds a vector to each input embedding. These vectors follow a specific pattern that the model learns, which helps it determine the position of each word, or the distance between different words in the sequence. The intuition here is that adding these values to the embeddings provides meaningful distances between the embedding vectors once they're projected into Q/K/V vectors and during dot-product attention.

为了解决这一问题，Transformer 为每个词嵌入向量添加了一个位置编码向量。这些向量遵循特定的规律，模型能通过学习该规律判断每个词在序列中的位置，以及不同词之间的位置距离。这样做的主要逻辑是：将位置编码向量与词嵌入向量相加后，得到的融合向量在映射为查询/键/值向量并进行点积注意力计算时，能体现出词汇间的位置距离特征。

To give the model a sense of the order of the words, we add positional encoding vectors -- the values of which follow a specific pattern.

为了让模型感知词汇的序列顺序，我们为词嵌入向量添加位置编码向量，这些向量的数值遵循特定的规律。

If we assumed the embedding has a dimensionality of 4, the actual positional encodings would look like this:

假设词嵌入向量的维度为 4，对应的位置编码向量的数值分布如下：

A real example of positional encoding with a toy embedding size of 4

词嵌入维度为 4 时的位置编码示例。

What might this pattern look like?

位置编码的数值规律具体是怎样的？

In the following figure, each row corresponds to a positional encoding of a vector. So the first row would be the vector we'd add to the embedding of the first word in an input sequence. Each row contains 512 values -- each with a value between 1 and -1. We've color-coded them so the pattern is visible.

在下图中，每一行对应一个位置的编码向量，第一行的向量将与输入序列中第一个词的嵌入向量相加。每个位置编码向量包含 512 个数值，数值范围在 -1 到 1 之间。我们对数值进行了颜色编码，以便更清晰地看到其分布规律。

A real example of positional encoding for 20 words (rows) with an embedding size of 512 (columns). You can see that it appears split in half down the center. That's because the values of the left half are generated by one function (which uses sine), and the right half is generated by another function (which uses cosine). They're then concatenated to form each of the positional encoding vectors.

输入序列长度为 20、词嵌入维度为 512 时的位置编码示例（行代表词汇位置，列代表向量维度）。能看到向量在中间位置被分为左右两部分，左半部分的数值由正弦函数生成，右半部分的数值由余弦函数生成，将两部分拼接后，即可得到完整的位置编码向量。

The formula for positional encoding is described in the paper (section 3.5). You can see the code for generating positional encodings in get_timing_signal_1d(). This is not the only possible method for positional encoding. It, however, gives the advantage of being able to scale to unseen lengths of sequences (e.g. if our trained model is asked to translate a sentence longer than any of those in our training set).

论文的 3.5 节详细给出了位置编码的计算公式，生成位置编码的代码可参考函数 get_timing_signal_1d()。这并非位置编码的唯一实现方式，但该方式的优势在于能适配训练集中未出现过的长序列------例如，让训练完成的模型翻译比训练集中所有句子都长的文本。

July 2020 Update: The positional encoding shown above is from the Tensor2Tensor implementation of the Transformer. The method shown in the paper is slightly different in that it doesn't directly concatenate, but interweaves the two signals. The following figure shows what that looks like. Here's the code to generate it:
2020 年 7 月更新：上文展示的位置编码是 Tensor2Tensor 框架中 Transformer 的实现方式，而论文中提出的实现方式略有不同------并非将正弦和余弦生成的向量直接拼接，而是将两个向量的数值交错融合。下图为论文中位置编码的数值分布，生成该图的代码可参考相关开源仓库：

The Residuals

残差连接与层归一化

One detail in the architecture of the encoder that we need to mention before moving on, is that each sub-layer (self-attention, ffnn) in each encoder has a residual connection around it, and is followed by a layer-normalization step.

在继续讲解前，我们需要补充编码器架构的一个细节：编码器中的每个子层（自注意力层、前馈神经网络层）都配备了残差连接，且子层的输出会经过层归一化处理。

If we're to visualize the vectors and the layer-norm operation associated with self attention, it would look like this:

我们用向量的流转过程展示自注意力层的残差连接和层归一化操作，具体如下：

This goes for the sub-layers of the decoder as well. If we're to think of a Transformer of 2 stacked encoders and decoders, it would look something like this:

解码器的每个子层也采用了相同的设计。若一个 Transformer 模型由 2 个编码器和 2 个解码器堆叠而成，其残差连接和层归一化的整体分布如下：

The Decoder Side

解码器的工作过程

Now that we've covered most of the concepts on the encoder side, we basically know how the components of decoders work as well. But let's take a look at how they work together.

理解了编码器的主要概念后，解码器的各组件工作原理就不难掌握了。接下来，我们讲解解码器各组件的协同工作过程。

The encoder start by processing the input sequence. The output of the top encoder is then transformed into a set of attention vectors K and V. These are to be used by each decoder in its "encoder-decoder attention" layer which helps the decoder focus on appropriate places in the input sequence:

编码器首先对输入序列进行处理，最顶层编码器的输出会被转化为一组键向量 K K K 和值向量 V V V，这些向量会被解码器的编码器-解码器注意力层调用，帮助解码器关注输入序列的相关位置：

After finishing the encoding phase, we begin the decoding phase. Each step in the decoding phase outputs an element from the output sequence (the English translation sentence in this case).

编码阶段完成后，模型进入解码阶段。解码阶段的每一步都会输出目标序列中的一个元素（本示例中为英语译文的一个词）。

The following steps repeat the process until a special symbol is reached indicating the transformer decoder has completed its output. The output of each step is fed to the bottom decoder in the next time step, and the decoders bubble up their decoding results just like the encoders did. And just like we did with the encoder inputs, we embed and add positional encoding to those decoder inputs to indicate the position of each word.

模型会重复上述解码步骤，直到输出代表解码结束的特殊符号。每一步的解码输出会作为下一时间步的输入，传入最底层的解码器，解码器的输出也会像编码器一样逐层向上传递。与编码器的输入处理方式一致，解码器的输入也会经过词嵌入和位置编码处理，以体现词汇的位置信息。

The self attention layers in the decoder operate in a slightly different way than the one in the encoder:

解码器中的自注意力层与编码器中的自注意力层工作方式略有不同：

In the decoder, the self-attention layer is only allowed to attend to earlier positions in the output sequence. This is done by masking future positions (setting them to -inf) before the softmax step in the self-attention calculation.

解码器的自注意力层仅能关注目标序列中当前位置之前的词汇，实现方式为：在自注意力计算的 Softmax 步骤前，对未来位置的注意力得分进行掩码处理------将其设置为负无穷。

The "Encoder-Decoder Attention" layer works just like multiheaded self-attention, except it creates its Queries matrix from the layer below it, and takes the Keys and Values matrix from the output of the encoder stack.

编码器-解码器注意力层的工作方式与多头自注意力层基本一致，唯一的区别在于：该层的查询矩阵由解码器的下层输出生成，而键矩阵和值矩阵则来自编码器堆叠结构的最终输出。

The Final Linear and Softmax Layer

最终的线性层与Softmax层

The decoder stack outputs a vector of floats. How do we turn that into a word? That's the job of the final Linear layer which is followed by a Softmax Layer.

解码器堆叠结构的输出是一个浮点型向量，如何将其转化为具体的词汇？这一工作由模型最后的线性层和 Softmax 层完成。

The Linear layer is a simple fully connected neural network that projects the vector produced by the stack of decoders, into a much, much larger vector called a logits vector.

线性层是一个简单的全连接神经网络，其作用是将解码器的输出向量映射为维度更高的对数几率向量（logits 向量）。

Let's assume that our model knows 10,000 unique English words (our model's "output vocabulary") that it's learned from its training dataset. This would make the logits vector 10,000 cells wide -- each cell corresponding to the score of a unique word. That is how we interpret the output of the model followed by the Linear layer.

假设模型从训练数据集中学习到了 10000 个不同的英语词汇（即模型的目标词汇表），那么对数几率向量的维度就为 10000，向量中的每个元素对应词汇表中一个词汇的得分。这就是线性层输出结果的含义。

The softmax layer then turns those scores into probabilities (all positive, all add up to 1.0). The cell with the highest probability is chosen, and the word associated with it is produced as the output for this time step.

Softmax 层会将这些得分转化为概率值（所有概率值均为正数且求和为 1.0），模型会选择概率值最高的元素，其对应的词汇即为当前时间步的输出。

This figure starts from the bottom with the vector produced as the output of the decoder stack. It is then turned into an output word.

该图从解码器堆叠结构的输出向量开始，展示了向量如何逐步转化为最终的输出词汇。

Recap Of Training

模型训练过程回顾

Now that we've covered the entire forward-pass process through a trained Transformer, it would be useful to glance at the intuition of training the model.

讲解完训练完成的 Transformer 的前向传播过程后，我们接下来梳理模型的主要训练逻辑。

During training, an untrained model would go through the exact same forward pass. But since we are training it on a labeled training dataset, we can compare its output with the actual correct output.

模型训练阶段，未训练的模型会执行与前向传播完全相同的计算过程。由于训练使用的是带标签的数据集，我们可以将模型的输出与真实的标签结果进行对比。

To visualize this, let's assume our output vocabulary only contains six words("a", "am", "i", "thanks", "student", and "" (short for 'end of sentence')).

为了更直观地展示，我们假设模型的目标词汇表仅包含 6 个词汇：a、am、i、thanks、student 以及代表句子结束的特殊符号。

The output vocabulary of our model is created in the preprocessing phase before we even begin training.

模型的目标词汇表在训练开始前的预处理阶段就已确定。

Once we define our output vocabulary, we can use a vector of the same width to indicate each word in our vocabulary. This also known as one-hot encoding. So for example, we can indicate the word "am" using the following vector:

确定目标词汇表后，我们可以用与词汇表维度相同的向量表示每个词汇，这种表示方式被称为独热编码。例如，词汇 am 可以用如下的独热向量表示：

Example: one-hot encoding of our output vocabulary

目标词汇表的独热编码示例。

Following this recap, let's discuss the model's loss function -- the metric we are optimizing during the training phase to lead up to a trained and hopefully amazingly accurate model.

梳理完上述基础概念后，我们讲解模型的损失函数------训练阶段需要优化的评价指标，通过优化该指标，让模型的预测结果逐渐接近真实值。

The Loss Function

损失函数

Say we are training our model. Say it's our first step in the training phase, and we're training it on a simple example -- translating "merci" into "thanks".

假设我们正在训练模型，训练的第一个示例是将法语词汇 merci 翻译为英语词汇 thanks。

What this means, is that we want the output to be a probability distribution indicating the word "thanks". But since this model is not yet trained, that's unlikely to happen just yet.

这意味着，我们期望模型的输出是一个以 thanks 为最高概率的概率分布，但由于模型尚未经过训练，其输出很难达到这一效果。

Since the model's parameters (weights) are all initialized randomly, the (untrained) model produces a probability distribution with arbitrary values for each cell/word. We can compare it with the actual output, then tweak all the model's weights using backpropagation to make the output closer to the desired output.

模型的参数（权重）均为随机初始化，因此未训练的模型输出的概率分布中，每个词汇的概率值都是随机的。我们将该概率分布与真实的概率分布对比，再通过反向传播算法调整模型的所有权重，让模型的输出逐渐接近期望结果。

How do you compare two probability distributions? We simply subtract one from the other. For more details, look at cross-entropy and Kullback--Leibler divergence.

如何量化两个概率分布的差异？主要方法是计算二者的差值，具体可参考交叉熵和KL散度的相关理论。

But note that this is an oversimplified example. More realistically, we'll use a sentence longer than one word. For example -- input: "je suis étudiant" and expected output: "i am a student". What this really means, is that we want our model to successively output probability distributions where:

需要注意的是，这是一个高度简化的示例。实际的训练任务中，输入和输出通常为长句。例如，输入为法语句子 je suis étudiant，期望输出为英语句子 i am a student。这意味着我们期望模型能依次输出一系列概率分布，且满足以下要求：

Each probability distribution is represented by a vector of width vocab_size (6 in our toy example, but more realistically a number like 30,000 or 50,000)

每个概率分布由一个与词汇表维度相同的向量表示（上述简易示例中维度为 6，实际任务中通常为 30000 或 50000 左右）
The first probability distribution has the highest probability at the cell associated with the word "i"

第一个概率分布中，词汇 i 对应的位置概率值最高
The second probability distribution has the highest probability at the cell associated with the word "am"

第二个概率分布中，词汇 am 对应的位置概率值最高
And so on, until the fifth output distribution indicates '<end of sentence>' symbol, which also has a cell associated with it from the 10,000 element vocabulary.

以此类推，直到第五个概率分布中，代表句子结束的 <eos> 符号对应的位置概率值最高（该符号在包含 10000 个词汇的真实词汇表中也有对应的位置）。

The targeted probability distributions we'll train our model against in the training example for one sample sentence.

单个训练样本对应的目标概率分布，模型的训练过程就是向该目标分布不断逼近的过程。

After training the model for enough time on a large enough dataset, we would hope the produced probability distributions would look like this:

当模型在足够大的数据集上训练足够长的时间后，我们期望模型输出的概率分布如下：

Hopefully upon training, the model would output the right translation we expect. Of course it's no real indication if this phrase was part of the training dataset (see: cross validation). Notice that every position gets a little bit of probability even if it's unlikely to be the output of that time step -- that's a very useful property of softmax which helps the training process.

理想情况下，训练完成的模型能输出我们期望的正确译文。当然，如果该例句出现在训练集中，其翻译结果并不能真实反映模型的泛化能力，这一点需要通过交叉验证来检验。需要注意的是，即使某个词汇并非当前时间步的理想输出，其对应的概率值也不为 0，这是 Softmax 函数的重要特性，能有效辅助模型的训练过程。

Now, because the model produces the outputs one at a time, we can assume that the model is selecting the word with the highest probability from that probability distribution and throwing away the rest. That's one way to do it (called greedy decoding). Another way to do it would be to hold on to, say, the top two words (say, 'I' and 'a' for example), then in the next step, run the model twice: once assuming the first output position was the word 'I', and another time assuming the first output position was the word 'a', and whichever version produced less error considering both positions #1 and #2 is kept. We repeat this for positions #2 and #3...etc. This method is called "beam search", where in our example, beam_size was two (meaning that at all times, two partial hypotheses (unfinished translations) are kept in memory), and top_beams is also two (meaning we'll return two translations). These are both hyperparameters that you can experiment with.

模型的输出为逐词生成，最直接的解码方式是在每个时间步选择概率值最高的词汇作为输出，其余词汇则被舍弃，这种方式被称为贪心解码。另一种更优的解码方式是束搜索：例如，在第一个时间步保留概率值最高的 2 个词汇（如 I 和 a），然后基于这两个词汇分别执行下一个时间步的解码，计算两个候选结果在前两个位置的整体误差，保留误差更小的结果；再对第二个时间步的结果重复上述操作，以此类推。在该示例中，束宽（beam_size）设为 2，代表模型在每个时间步都会保留 2 个候选的部分翻译结果，输出的最优译文数量（top_beams）也设为 2，代表模型最终会返回 2 个翻译结果。束宽和最优译文数量均为可调试的超参数。

Go Forth And Transform

深入学习建议

I hope you've found this a useful place to start to break the ice with the major concepts of the Transformer. If you want to go deeper, I'd suggest these next steps:

希望本文能帮助你入门 Transformer 的主要概念。如果你想深入学习，可参考以下学习资源：

Read the Attention Is All You Need paper, the Transformer blog post (Transformer: A Novel Neural Network Architecture for Language Understanding), and the Tensor2Tensor announcement.

阅读《Attention is All You Need》原论文、谷歌官方发布的 Transformer 解读博客，以及 Tensor2Tensor 工具包的发布说明。
Watch Łukasz Kaiser's talk walking through the model and its details

观看 Łukasz Kaiser 讲解 Transformer 模型细节的演讲视频。
Play with the Jupyter Notebook provided as part of the Tensor2Tensor repo

实操 Tensor2Tensor 仓库中提供的 Jupyter 交互式笔记本。
Explore the Tensor2Tensor repo.

研究 Tensor2Tensor 开源仓库的源码。

Follow-up works:

Acknowledgements

致谢

Thanks to Illia Polosukhin, Jakob Uszkoreit, Llion Jones, Lukasz Kaiser, Niki Parmar, and Noam Shazeer for providing feedback on earlier versions of this post.

感谢 Illia Polosukhin、Jakob Uszkoreit、Llion Jones、Lukasz Kaiser、Niki Parmar、Noam Shazeer 对本文初稿提出的修改意见。

Please hit me up on Twitter for any corrections or feedback.

若你发现本文的错误或有相关建议，可在 Twitter 上与我交流。

Written on June 27, 2018

via:

The Illustrated Transformer -- Jay Alammar -- Visualizing machine learning one concept at a time.
https://jalammar.github.io/illustrated-transformer/
- The Illustrated Transformer | Hacker News
  https://news.ycombinator.com/item?id=18351674
  - The Annotated Transformer
    https://nlp.seas.harvard.edu//2018/04/03/attention.html
- Translations: Arabic, Chinese (Simplified) 1, Chinese (Simplified) 2
  - The Illustrated Transformer【译】-CSDN博客 2018
    https://blog.csdn.net/yujianmin1990/article/details/85221271
  - 图解 Transformer | The Illustrated Transformer-CSDN博客 2023
    https://blog.csdn.net/qq_36667170/article/details/124359818