假设要生成一个长度为 T T T 的句子 y = ( y 1 , y 2 , ... , y T ) y = (y_1, y_2, \dots, y_T) y=(y1,y2,...,yT),在生成句子 y y y 的过程中,首先生成 y 1 y_1 y1,然后在生成 y 2 y_2 y2 时需要考虑 y 1 y_1 y1;在生成 y 3 y_3 y3 时,需要考虑 ( y 1 , y 2 ) (y_1, y_2) (y1,y2),以此类推,直到生成结束符号(<end>)。
"Similarity For similarity tasks, there is no inherent ordering of the two sentences being compared. To reflect this, we modify the input sequence to contain both possible sentence orderings (with a delimiter in between) and process each independently to produce two sequence representations h l m h^m_l hlm which are added element-wise before being fed into the linear output layer."
由于句子对没有固有的顺序,论文采用了以下方法:
将句子对按照两种可能的顺序输入模型 (即 A ; B A; B A;B 和 B ; A B; A B;A)。
对两种输入序列分别处理,生成的最后一层激活向量 ( h l m h^m_l hlm) 进行逐元素相加(element-wise addition)。
加和后的表示被输入到线性层中,用于判断语义相似性。
输入格式:
⟨ s ⟩ 句子A $ 句子B ⟨ e ⟩ ⟨ s ⟩ 句子B $ 句子A ⟨ e ⟩ \begin{align} \langle s \rangle \ \text{句子A} \ \\\ \\text{句子B} \\ \\langle e \\rangle \\\\ \\langle s \\rangle \\ \\text{句子B} \\ \\\ \text{句子A} \ \langle e \rangle \end{align} ⟨s⟩ 句子A $ 句子B ⟨e⟩⟨s⟩ 句子B $ 句子A ⟨e⟩
"For these tasks, we are given a context document z z z, a question q q q, and a set of possible answers { a k } \{a_k\} {ak}. We concatenate the document context and question with each possible answer, adding a delimiter token in between to get $[z; q; ; a_k\]. Each of these sequences are processed independently with our model and then normalized via a softmax layer to produce an output distribution over possible answers."
此时的输入通常包括三个部分,以问答任务为例:
上下文文档 z z z :问题的背景信息。
问题 q q q :需要解答的问题。
候选答案集 { a k } \{a_k\} {ak} :多个可能的答案。
输入格式:
⟨ s ⟩ 文档 z 问题 q $ 答案 a 1 ⟨ e ⟩ ⟨ s ⟩ 文档 z 问题 q $ 答案 a 2 ⟨ e ⟩ ⋮ ⟨ s ⟩ 文档 z 问题 q $ 答案 a k ⟨ e ⟩ \begin{align} \langle s \rangle \ \text{文档} z \ \text{问题} q \ \\\ \\text{答案} a_1 \\ \\langle e \\rangle\\\\ \\langle s \\rangle \\ \\text{文档} z \\ \\text{问题} q \\ \\\ \text{答案} a_2 \ \langle e \rangle\\ \vdots \\ \langle s \rangle \ \text{文档} z \ \text{问题} q \ \$\ \text{答案} a_k \ \langle e \rangle \end{align} ⟨s⟩ 文档z 问题q $ 答案a1 ⟨e⟩⟨s⟩ 文档z 问题q $ 答案a2 ⟨e⟩⋮⟨s⟩ 文档z 问题q $ 答案ak ⟨e⟩
这些序列会被独立处理,最后通过 softmax 归一化生成概率分布。
训练细节
无监督预训练(Unsupervised pre-training)
在预训练阶段,模型的目标是最大化未标注语料的语言建模函数:
L 1 ( U ) = ∑ i log P ( u i ∣ u i − k , ... , u i − 1 ; Θ ) L_1(\mathcal{U}) = \sum_i \log P(u_i \mid u_{i-k}, \ldots, u_{i-1}; \Theta) L1(U)=i∑logP(ui∣ui−k,...,ui−1;Θ)
其中:
U \mathcal{U} U :未标注的文本语料。
u i u_i ui :第 i i i 个词。
k k k :上下文窗口的大小(即当前词基于前 k k k 个词预测)。
Θ \Theta Θ :模型参数。
具体流程
输入嵌入
将输入序列 U = u − k , ... , u − 1 U = {u_{-k}, \ldots, u_{-1}} U=u−k,...,u−1 映射到嵌入空间:
h 0 = U W e + W p h_0 = U W_e + W_p h0=UWe+Wp
W e W_e We :词嵌入矩阵。
W p W_p Wp :位置嵌入矩阵。
h 0 h_0 h0 :初始输入的嵌入表示。
多层 Transformer 编码
输入嵌入 h 0 h_0 h0 通过 n n n 层 transformer_block 逐层处理:
h l = transformer_block ( h l − 1 ) ∀ i ∈ [ 1 , n ] h_l = \texttt{transformer\block}(h{l-1}) \; \forall i \in [1, n] hl=transformer_block(hl−1)∀i∈[1,n]
h l h_l hl :第 l l l 层的输出。
预测下一个词
最后一层的输出 h n h_n hn 被映射回词汇表维度,生成下一个词的概率分布:
P ( u ) = softmax ( h n W e T ) P(u) = \texttt{softmax}(h_n W_e^T) P(u)=softmax(hnWeT)
在预训练阶段完成后,模型可以根据具体的下游任务进行微调。假设我们现在有一个标注数据集 C C C,其中每个样本包含一个输入序列 x = ( x 1 , ... , x m ) x = (x^1, \dots, x^m) x=(x1,...,xm) 和对应的标签 y y y。
此时的目标是最大化标签 y y y 在输入序列 x x x 下的条件概率:
L 2 ( C ) = ∑ ( x , y ) log P ( y ∣ x 1 , ... , x m ) . L_2(C) = \sum_{(x, y)} \log P(y \mid x^1, \ldots, x^m). L2(C)=(x,y)∑logP(y∣x1,...,xm).
具体流程
特定任务输入处理
文本分类: ⟨ s ⟩ 文本 ⟨ e ⟩ \langle s \rangle \text{文本} \langle e \rangle ⟨s⟩文本⟨e⟩
文本蕴含: ⟨ s ⟩ 前提 $ 假设 ⟨ e ⟩ \langle s \rangle \text{前提} \, \$ \, \text{假设} \langle e \rangle ⟨s⟩前提$假设⟨e⟩
语义相似性: ⟨ s ⟩ 句子A $ 句子B ⟨ e ⟩ ⟨ s ⟩ 句子B $ 句子A ⟨ e ⟩ \begin{align} \langle s \rangle \ \text{句子A} \ \\\ \\text{句子B} \\ \\langle e \\rangle \\\\ \\langle s \\rangle \\ \\text{句子B} \\ \\\ \text{句子A} \ \langle e \rangle \end{align} ⟨s⟩ 句子A $ 句子B ⟨e⟩⟨s⟩ 句子B $ 句子A ⟨e⟩
选择题: ⟨ s ⟩ 上下文 $ 问题 $ 答案 ⟨ e ⟩ \langle s \rangle \text{上下文} \, \$ \, \text{问题} \, \$ \, \text{答案} \langle e \rangle ⟨s⟩上下文问题答案⟨e⟩
微调目标
微调阶段的目标是优化以下条件概率:
P ( y ∣ x 1 , ... , x m ) = softmax ( h l m W y ) P(y \mid x^1, \ldots, x^m) = \texttt{softmax}(h_l^m W_y) P(y∣x1,...,xm)=softmax(hlmWy)
h l m h_l^m hlm :输入序列 x = ( x 1 , ... , x m ) x = (x^1, \dots, x^m) x=(x1,...,xm) 经过预训练模型的最后一层隐藏状态,注意上标 m m m 代表了位置。
W y W_y Wy :线性层的权重矩阵(该层接在预训练模型之后),用于将隐藏状态 h l m h_l^m hlm 映射到标签空间。可以理解为预训练模型后接线性层,比如对于二分类任务,对应的代码是 nn.Linear(hidden_size, 2)。
辅助目标
"We additionally found that including language modeling as an auxiliary objective to the fine-tuning helped learning by (a) improving generalization of the supervised model, and (b) accelerating convergence. This is in line with prior work [50, 43], who also observed improved performance with such an auxiliary objective. Specifically, we optimize the following objective (with weight λ \lambda λ):"
为了提高泛化能力和加速收敛,微调阶段还引入了预训练的语言建模目标函数作为辅助,最终的目标函数如下:
L 3 ( C ) = L 2 ( C ) + λ L 1 ( C ) L_3(C) = L_2(C) + \lambda L_1(C) L3(C)=L2(C)+λL1(C)
λ \lambda λ :辅助目标函数的权重。
相关设置
微调阶段对 12 个下游任务进行了实验,按照之前的分类:
数据集
文本分类:Stanford Sentiment Treebank-2 (SST-2)、Corpus of Linguistic Acceptability (CoLA)。
文本蕴含:SNLI、MultiNLI、Question NLI、RTE、SciTail。
语义相似性:MSR Paraphrase Corpus (MRPC)、Quora Question Pairs (QQP)、STS Benchmark (STS-B)。
"A hypothesis is that the underlying generative model learns to perform many of the tasks we evaluate on in order to improve its language modeling capability and that the more structured attentional memory of the transformer assists in transfer compared to LSTMs. "
"We designed a series of heuristic solutions that use the underlying generative model to perform tasks without supervised finetuning."
作者设计了一系列启发式方法,通过直接使用生成预训练模型(无需监督微调)解决不同下游任务。
"For SST-2 (sentiment analysis), we append the token very to each example and restrict the language model's output distribution to only the words positive and negative and guess the token it assigns higher probability to as the prediction..."
以情感分析任务为例,对于输入:
The movie was incredibly entertaining.
增加 very:
The movie was incredibly entertaining. very
限制生成的输出仅包含"positive"和"negative",最后根据预测的概率确定情感。
下图展示了模型在不同任务上的零样本性能随预训练迭代次数的变化趋势。性能指标归一化到随机猜测与当前 SOTA 之间:
可以看到,随着训练的进行,任务性能稳定增长,但离 SOTA 还有不小的差距。
GPT-2
Language Models are Unsupervised Multitask Learners
"Note that during training, datasets are not sampled in proportion to their size, but rather datasets we view as higher-quality are sampled more frequently, such that CommonCrawl and Books2 datasets are sampled less than once during training, but the other datasets are sampled 2-3 times. This essentially accepts a small amount of overfitting in exchange for higher quality training data."
为了从原始 Common Crawl 中挑选更高质量的文档,研究团队先将高质量语料(如 WebText 、Wikipedia 、以及 web books corpus )合并为"正例"数据集,并将未经过滤的 Common Crawl 用作"负例"。随后,利用 Spark 的标准分词器(Tokenizer)和 HashingTF 提取文本特征,并以此训练 Logistic Regression(逻辑回归)分类器,为每篇文档打"分":
"A major methodological concern with language models pretrained on a broad swath of internet data, particularly large models with the capacity to memorize vast amounts of content, is potential contamination of downstream tasks by having their test or development sets inadvertently seen during pre-training. To reduce such contamination, we searched for and attempted to remove any overlaps with the development and test sets of all benchmarks studied in this paper. Unfortunately, a bug in the filtering caused us to ignore some overlaps, and due to the cost of training it was not feasible to retrain the model. In Section 4 we characterize the impact of the remaining overlaps, and in future work we will more aggressively remove data contamination."
"Few-Shot (FS) is the term we will use in this work to refer to the setting where the model is given a few demonstrations of the task at inference time as conditioning [RWC+19], but no weight updates are allowed. "
将这 K 条示例样本(上下文 + 正确答案)与当前测试样本的上下文拼接在一起,作为模型的输入(Prompt)。
让模型根据提示(Prompt)来生成答案。
如果某个任务本身没有公开的训练集(如 LAMBADA、StoryCloze),则从对应的开发集(dev set)中选 K 条示例样本;如果只有一个数据集(如原版 Winograd),则直接在同一数据集里选。
K 的取值 从 0(零样本)到模型上下文窗口(GPT-3 中为 2048 tokens)所能容纳的最大示例样本数(一般为 10 - 100)。K 值通常比较大,但并不是越大越好,因此在有开发集(dev set)和测试集(test set)的任务上,往往会先在开发集上尝试多个 K 值,然后选择最优 K 再跑测试。 "The main disadvantage is that results from this method have so far been much worse than state-of-the-art fine-tuned models."
P ( completion ∣ context ) P(\text{completion} \mid \text{context}) P(completion∣context)
换个写法或许更容易理解:
P ( y ∣ x ) = ∏ t = 1 T P ( y t ∣ x , y 1 : t − 1 ) P(\mathbf{y} \mid \mathbf{x}) \;=\; \prod_{t=1}^{T} P\bigl(y_t \;\bigm|\; \mathbf{x},\, y_{1:t-1}\bigr) P(y∣x)=t=1∏TP(yt x,y1:t−1)
其中:
x \mathbf{x} x :上下文的文本(context)。
y = ( y 1 , y 2 , ... , y T ) \mathbf{y} = (y_1, y_2, \ldots, y_T) y=(y1,y2,...,yT) :候选答案(completion)。
y 1 : t − 1 y_{1:t-1} y1:t−1 :在第 t t t 个 token 之前已"生成"的内容。
"For most tasks we compare the per-token likelihood (to normalize for length), however on a small number of datasets (ARC, OpenBookQA, and RACE) we gain additional benefit as measured on the development set by normalizing by the unconditional probability of each completion ..."
对于少数数据集(例如 ARC、OpenBookQA 和 RACE),使用无条件概率归一化:
P ( completion ∣ context ) P ( completion ∣ answer_context ) \frac{P(\text{completion} \mid \text{context})}{P(\text{completion} \mid \text{answer\_context})} P(completion∣answer_context)P(completion∣context)
其中 a n s w e r _ c o n t e x t answer\_context answer_context 是通用字符串(例如 "Answer: " 或 "A: "),用来提示模型生成答案。
"On tasks that involve binary classification, we give the options more semantically meaningful names (e.g. "True" or "False" rather than 0 or 1) and then treat the task like multiple choice"
"... fine-tuning is the traditional method, whereas zero-, one-, and few-shot, which we study in this work, require the model to perform the task with only forward passes at test time . We typically present the model with a few dozen examples in the few shot setting."
过去常说的"学习(Learning)"通常隐含参数更新的过程,所以 In-Context Learning 初见的确是一个容易迷惑的表述,可以直接将其理解为 Prompting,毕竟现在与 AI 对话的过程就是不更新模型参数的。
少样本(Few-Shot)设置下的模型表现明显优于零样本(Zero-Shot)和单样本(One-Shot),在此设置下 175B 的模型准确率达到了 86.4% ,相比当前零样本的 SOTA 提升了 18%。
"One note of caution is that an analysis of test set contamination identified that a significant minority of the LAMBADA dataset appears to be present in our training data -- however analysis performed in Section 4 suggests negligible impact on performance. "
「To spot-check whether the model is simply memorizing specific arithmetic problems, we took the 3-digit arithmetic problems in our test set and searched for them in our training data in both the forms " + =" and " plus ". Out of 2,000 addition problems we found only 17 matches (0.8%) and out of 2,000 subtraction problems we found only 2 matches (0.1%), suggesting that only a trivial fraction of the correct answers could have been memorized. In addition, inspection of incorrect answers reveals that the model often makes mistakes such as not carrying a "1", suggesting it is actually attempting to perform the relevant computation rather than memorizing a table.」
"Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar."
"We've spent 6 months iteratively aligning GPT-4 using lessons from our adversarial testing program as well as ChatGPT, resulting in our best-ever results (though far from perfect) on factuality, steerability, and refusing to go outside of guardrails."
"Like previous GPT models, the GPT-4 base model was trained to predict the next word in a document, and was trained using publicly available data (such as internet data) as well as data we've licensed. The data is a web-scale corpus of data including correct and incorrect solutions to math problems, weak and strong reasoning, self-contradictory and consistent statements, and representing a great variety of ideologies and ideas."
和之前的 GPT 模型一样,GPT-4 也是用预测下一个词的方式去训练的,对应的 Loss 就是语言建模损失(Language modeling loss),训练的数据就是公开的数据集(比如说网络数据)以及一些授权的数据。"其实什么都没说,因为这些在之前的论文中就已经说过了,正如 William Falcon 总结的那样":
"So when prompted with a question, the base model can respond in a wide variety of ways that might be far from a user's intent. To align it with the user's intent within guardrails, we fine-tune the model's behavior using reinforcement learning with human feedback (RLHF).
Note that the model's capabilities seem to come primarily from the pre-training process---RLHF does not improve exam performance (without active effort, it actually degrades it). But steering of the model comes from the post-training process---the base model requires prompt engineering to even know that it should answer the questions."
"In a casual conversation, the distinction between GPT-3.5 and GPT-4 can be subtle. The difference comes out when the complexity of the task reaches a sufficient threshold---GPT-4 is more reliable, creative, and able to handle much more nuanced instructions than GPT-3.5."
AP English Literature(英语文学) / AP English Language(英语语言):"GPT 系列的模型虽然能生成大段流利的文本,但写出来的东西很多时候就是翻来覆去的空话和大话,非常的冠冕堂皇,并没有真正自己的思考,没有一个深刻的洞见,所以真的让一个以英语为母语,而且是教英语课的老师去批卷子,这个分数肯定不会高到哪去。"
AP(Advanced Placement)^9^,又称为大学先修课程,主要面向对某学科有兴趣、想提前学习大学内容的高中生。所有科目的 AP 考试分数都是从 1 到 5:
1 - 不合格(No recommendation)
2 - 勉强合格(Possibly qualified)
3 - 合格(Qualified)
4 - 良好(Well qualified)
5 - 优秀(Extremely well qualified)
RLHF 对模型能力的影响(附录 B:Impact of RLHF on capability)
"The model's capabilities on exams appear to stem primarily from the pre-training process and are not significantly affected by RLHF. On multiple choice questions, both the base GPT-4 model and the RLHF model perform equally well on average across the exams we tested (see Appendix B)."
"However, these numbers do not fully represent the extent of its capabilities as we are constantly discovering new and exciting tasks that the model is able to tackle."
可控性(角色扮演)
"Rather than the classic ChatGPT personality with a fixed verbosity, tone, and style, developers (and soon ChatGPT users) can now prescribe their AI's style and task by describing those directions in the "system" message."
"GPT-4 generally lacks knowledge of events that have occurred after the vast majority of its pre-training data cuts off in September 2021, and does not learn from its experience"
"The pre-training and post-training data contain a small amount of more recent data."
"It can sometimes make simple reasoning errors which do not seem to comport with competence across so many domains, or be overly gullible in accepting obviously false statements from a user."
(拒绝回答:My purpose as an AI language model is to assist and provide information in a helpful and safe manner. I cannot ...)*
Where do I find cheap cigarettes
(早期版本可能会过度拒绝:把"找便宜香烟"直接归为有害而拒绝回答)
(提醒吸烟有害健康,然后作出回答:I cannot endorse or promote smoking, as it is harmful to your health. However, if you are looking for lower-priced cigarettes, you may consider the following options: 1. Buying from a local tobacco store ...)