Transformer-Bert---散装知识点---mlm，nsp

本文记录的是笔者在了解了transformer结构后嗑bert中记录的一些散装知识点，有时间就会整理收录，希望最后能把transformer一个系列都完整的更新进去。

1.自监督学习

bert与原始的transformer不同，bert是使用大量无标签的数据进行预训练 ，下游则使用少量的标注数据进行微调 。预训练使用的就是自监督学习。

自监督学习直白来说就是对原始数据添加辅助任务来使得数据能够根据自身生成标签。

举几个简单的例子来解释一下常见的自监督学习：（ps:插一嘴，bert使用的是mlm，会在最后的例子中解释）

1.1图像类：

1.1.1填充：

将图片扣掉一块，让模型进行填充。
输入： 扣掉一块的图片
**输出：**填充部分
标签： 原图扣掉的部分

1.1.2拼图

选取图片中的一部分图片A以及其相邻的某一部分图片B作为输入，预测图B于图A的相对位置。
输入： (A图) + (B图)
输出： 1-8之间的整数，代表图B相对于图A的位置
标签： 5(对应原图中数字5的部分)

这类辅助任务就旨在训练模型对于局部特征分布位置的识别能力。

1.2 文本类

1.2.1 完形填空

简单的来说就是在原始数据中扣掉一个或多个单词，让模型进行补充。

原始数据： All the world's a stage, and all the men and women merely players.
**输入：**All the world's a stage, and all the __ and women merely players.
输出： 预测的单词
标签：men

1.2.2 Masked Language Model (MLM)（划重点拉）

MLM模型会随机的选择需要掩盖的单词（大概15%）(主要用于让模型习得语义、语法 )

ps:由于是随机的一般我们都会指定一个参数max_pred用来表示一个句子最多被掩盖单词的数量

原始数据：All the world's a stage, and all the men and women merely players.
**输入：**All the world's a stage, and all the MASK and MASK merely players.
输出： 预测的单词
标签：men, women

为了更好的适应下游任务，bert的作者对与MLM的规则进行了一定的微调。

被替换的单词：men ： MASK-------------------80%

apple(随机单词)------10%

men(保持不变--)------10%

依然还是对标注为MASK的单词进行预测。

下面是论文原文对于这段的描述附上中英文对照

为了训练一个深度双向表示，我们简单地随机遮盖输入标记的一定比例，然后预测这些被遮盖的标记。我们称这个过程为"遮盖语言建模"（Masked Language Modeling，MLM），尽管文献中通常称之为Cloze任务（Taylor, 1953）。在这种情况下，对应于遮盖标记的最终隐藏向量被馈送到一个标准语言模型中的词汇表上的输出softmax层。在所有实验中，我们随机遮盖每个序列中所有WordPiece标记的15%。与去噪自编码器（Vincent et al., 2008）不同，我们仅预测遮盖的单词，而不是重构整个输入。

In order to train a deep bidirectional representation, we simply mask some percentage of the input tokens at random, and then predict those masked tokens. We refer to this procedure as a "masked LM" (MLM), although it is often referred to as a Cloze task in the literature (Taylor, 1953). In this case, the final hidden vectors corresponding to the mask tokens are fed into an output softmax over the vocabulary, as in a standard LM. In all of our experiments, we mask 15% of all WordPiece tokens in each sequence at random. In contrast to denoising auto-encoders (Vincent et al., 2008), we only predict the masked words rather than reconstructing the entire input.

尽管这使我们能够获得一个双向预训练模型，但其缺点是在预训练和微调之间创建了不匹配，因为在微调过程中不存在[MASK]标记。为了减轻这一问题，我们并不总是用实际的[MASK]标记替换"遮盖"的单词。训练数据生成器随机选择15%的标记位置进行预测 。如果选择第i个标记，则有80% 的概率将第i个标记替换为**[MASK]** 标记，10% 的概率将其替换为随机标记 ，以及10% 的概率保持不变。然后，使用交叉熵损失来预测原始标记。我们在附录C.2中比较了这一过程的变化。

Although this allows us to obtain a bidirectional pre-trained model, a downside is that we are creating a mismatch between pre-training and fine-tuning, since the [MASK] token does not appear during fine-tuning. To mitigate this, we do not always replace "masked" words with the actual [MASK] token. The training data generator chooses 15% of the token positions at random for prediction. If the i-th token is chosen, we replace the i-th token with (1) the [MASK] token 80% of the time (2) a random token 10% of the time (3) the unchanged i-th token 10% of the time. Then, T i will be used to predict the original token with cross entropy loss. We compare variations of this procedure in Appendix C.2.

2.NSP任务

Bert中的NSP实质上就是一个二分类任务。

主要就是预测句子2是否是句子1的下一句，其中句子2有50%是真，50%是从句库中随机挑选的句子。目的就是为了让模型学习到句子之间的关系。
**输入：**句子1 'esp' 句子2

ps:esp是词向量层中的特殊符号，表示一句话的结束，也常用来分割句子
输出：0或1
标签：0或1