系列论文研读目录
文章目录
- 系列论文研读目录
- 论文题目含义
- Abstract
- [1 Introduction](#1 Introduction)
- [2 Prompt Tuning](#2 Prompt Tuning)
-
- [2.1 Design Decisions](#2.1 Design Decisions)
- [2.2 Unlearning Span Corruption](#2.2 Unlearning Span Corruption)
- [3 Results](#3 Results)
-
- [3.1 Closing the Gap](#3.1 Closing the Gap)
- [3.2 Ablation Study](#3.2 Ablation Study)
- [4 Comparison to Similar Approaches](#4 Comparison to Similar Approaches)
- [5 Resilience to Domain Shift](#5 Resilience to Domain Shift)
- [6 Prompt Ensembling](#6 Prompt Ensembling)
- [7 Interpretability 可解释性](#7 Interpretability 可解释性)
- [8 Conclusion](#8 Conclusion)
论文题目含义
刻度在参数高效快速调优中的作用
Abstract
In this work, we explore "prompt tuning," a simple yet effective mechanism for learning "soft prompts" to condition frozen language models to perform specific downstream tasks. Unlike the discrete text prompts used by GPT-3, soft prompts are learned through backpropagation and can be tuned to incorporate signals from any number of labeled examples. Our end-to-end learned approach outperforms GPT-3's few-shot learning by a large margin. More remarkably, through ablations on model size using T5, we show that prompt tuning becomes more competitive with scale: as models exceed billions of parameters, our method "closes the gap" and matches the strong performance of model tuning (where all model weights are tuned). This finding is especially relevant because large models are costly to share and serve and the ability to reuse one frozen model for multiple downstream tasks can ease this burden. Our method can be seen as a simplification of the recently proposed "prefix tuning" of Li and Liang (2021) and we provide a comparison to this and other similar approaches. Finally, we show that conditioning a frozen model with soft prompts confers benefits in robustness to domain transfer and enables efficient "prompt ensembling."在这项工作中,我们探索"提示调优",一个简单而有效的机制,学习"软提示"条件冻结的语言模型,以执行特定的下游任务。与GPT-3使用的离散文本提示不同,软提示是通过反向传播学习的,可以调整以包含来自任何数量的标记示例的信号。我们的端到端学习方法远远优于GPT-3的少量学习。更值得注意的是,通过使用T5对模型大小进行烧蚀,我们表明,及时调整变得更具竞争力:随着模型超过数十亿个参数,我们的方法"缩小了差距",并与模型调整的强大性能相匹配(其中所有模型权重都被调整)。这一发现尤其重要,因为大型模型的共享和服务成本很高,而将一个冻结模型重用于多个下游任务的能力可以减轻这一负担。我们的方法可被视为李及梁(2021)最近提出的"预置调谐"的简化,并提供与此方法及其他类似方法的比较。最后,我们证明了用软提示调节冻结模型可以提高域转移的鲁棒性,并实现有效的"提示集成"。
1 Introduction
- With the wide success of pre-trained large language models, a range of techniques has arisen to adapt these general-purpose models to downstream tasks. ELMo (Peters et al., 2018) proposed freezing the pre-trained model and learning a task-specific weighting of its per-layer representations. However, since GPT (Radford et al., 2018) and BERT (Devlin et al., 2019), the dominant adaptation technique has been model tuning (or "fine-tuning"), where all model parameters are tuned during adaptation, as proposed by Howard and Ruder (2018).随着预训练的大型语言模型的广泛成功,出现了一系列技术来使这些通用模型适应下游任务。埃尔莫(Peters等人,2018年)提出冻结预训练的模型,并学习其每层表示的特定任务权重。然而,由于GPT(拉德福等人,2018年)和BERT(Devlin等人,2019),主要的适应技术是模型调整(或"微调"),其中所有模型参数在适应过程中进行调整,如霍华德和鲁德(2018)所提出的。
- More recently, Brown et al. (2020) showed that prompt design (or "priming") is surprisingly effective at modulating a frozen GPT-3 model's behavior through text prompts. Prompts are typically composed of a task description and/or several canonical examples. This return to "freezing" pre-trained models is appealing, especially as model size continues to increase. Rather than requiring a separate copy of the model for each downstream task, a single generalist model can simultaneously serve many different tasks.最近,Brown等人(2020)表明提示设计(或"引发")在通过文本提示调节冷冻GPT-3模型的行为方面令人惊讶地有效。任务集通常由任务描述和/或几个规范示例组成。这种"冻结"预训练模型的回归很有吸引力,特别是在模型大小不断增加的情况下。一个通用模型可以同时服务于许多不同的任务,而不是为每个下游任务提供一个单独的模型副本。
- Unfortunately, prompt-based adaptation has several key drawbacks. Task description is error-prone and requires human involvement, and the effectiveness of a prompt is limited by how much conditioning text can fit into the model's input. As a result, downstream task quality still lags far behind that of tuned models. For instance, GPT-3 175B fewshot performance on SuperGLUE is 17.5 points be low fine-tuned T5-XXL (Raffel et al., 2020) (71.8 vs. 89.3) despite using 16 times more parameters.不幸的是,基于网络的适应有几个关键的缺点。任务描述容易出错,需要人工参与,提示的有效性受到模型输入中可以容纳多少条件文本的限制。因此,下游任务质量仍然远远落后于调优模型。例如,GPT-3 175 B在SuperGLUE上的几次发射性能是17.5分,而T5-XXL被低微调(Raffel等人,2020年)(71.8 vs. 89.3),尽管使用了16倍的参数。
- Several efforts to automate prompt design have been recently proposed. Shin et al. (2020) propose a search algorithm over the discrete space of words, guided by the downstream application training data. While this technique outperforms manual prompt design, there is still a gap relative to model tuning.最近已经提出了自动化提示设计的几种努力。Shin等人(2020)提出了一种在离散单词空间上的搜索算法,由下游应用程序训练数据指导。虽然此技术优于手动提示设计,但相对于模型调整仍存在差距。
- Li and Liang (2021) propose "prefix tuning" and show strong results on generative tasks. This method freezes the model parameters and backpropagates the error during tuning to prefix activations prepended to each layer in the encoder stack, including the input layer. Hambardzumyan et al. (2021) simplify this recipe by restricting the trainable parameters to the input and output subnetworks of a masked language model, and show reasonable results on classifications tasks.Li和Liang(2021)提出了"前缀调整",并在生成性任务上显示了强有力的结果。此方法冻结模型参数,并在调整到编码器堆栈中每个层(包括输入层)的前缀激活期间反向传播错误。Hambardzumyan等人(2021)通过将可训练参数限制在屏蔽语言模型的输入和输出子网络中来简化这一配方,并在分类任务上显示了合理的结果。
- In this paper, we propose prompt tuning as a further simplification for adapting language models. We freeze the entire pre-trained model and only allow an additional k tunable tokens per downstream task to be prepended to the input text. This "soft prompt" is trained end-to-end and can condense the signal from a full labeled dataset, allowing our method to outperform few-shot prompts and close the quality gap with model tuning (Figure 1). At the same time, since a single pre-trained model is recycled for all downstream tasks, we retain the efficient serving benefits of frozen models (Figure 2).在本文中,我们提出了及时调整作为进一步简化适应语言模型。我们冻结整个预训练模型,只允许每个下游任务的额外k个可调令牌被预先添加到输入文本中。这种"软提示"是端到端训练的,可以从完整的标记数据集中浓缩信号,使我们的方法能够优于少量提示,并通过模型调整缩小质量差距(图1)。同时,由于单个预训练模型被回收用于所有下游任务,因此我们保留了冻结模型的高效服务优势(图2)。
图1:T5的标准模型调优实现了强大的性能,但需要为每个最终任务存储单独的模型副本。我们对T5的快速调优与模型调优的质量相匹配,同时支持对所有任务重用单个冻结模型。我们的方法显着优于使用GPT-3的fewshot提示设计。我们显示了调整方法的3次运行的平均值和标准差。
图二:模型调优需要为每个下游任务制作整个预训练模型的特定于任务的副本,并且推理必须在单独的批次中执行。提示调优只需要为每个任务存储一个小的特定于任务的提示,并使用原始的预训练模型实现混合任务推理。对于T5"XXL"模型,调优模型的每个副本需要110亿个参数。相比之下,我们的调优提示每个任务只需要20,480个参数-减少了超过五个数量级-假设提示长度为5个标记。 - While we developed our method concurrently with Li and Liang (2021) and Hambardzumyan et al. (2021), we are the first to show that prompt tuning alone (with no intermediate-layer prefixes or task-specific output layers) is sufficient to be competitive with model tuning. Through detailed experiments in sections 2--3, we demonstrate that language model capacity is a key ingredient for these approaches to succeed. As Figure 1 shows, prompt tuning becomes more competitive with scale.虽然我们与Li和Liang(2021)以及Hambardzumyan等人(2021)同时开发了我们的方法,但我们是第一个证明即时调整(没有中间层前缀或特定于任务的输出层)足以与模型调整竞争的人。通过第2-3节中的详细实验,我们证明了语言模型能力是这些方法成功的关键因素。如图1所示,即时调优与规模相比更具竞争力。
We compare with similar approaches in Section 4. Explicitly separating task-specific parameters from the "generalist" parameters needed for general language-understanding has a range of additional benefits. We show in Section 5 that by capturing the task definition in the prompt while keeping the generalist parameters fixed, we are able to achieve better resilience to domain shifts. In Section 6, we show that "prompt ensembling", learning multiple prompts for the same task, can boost quality and is more efficient than classic model ensembling. Finally, in Section 7, we investigate the interpretability of our learned soft prompts. In sum, our key contributions are:我们在第4节中比较了类似的方法。将特定于任务的参数与一般语言理解所需的"通才"参数明确分离有一系列额外的好处。我们在第5节中表明,通过在提示符中捕获任务定义,同时保持通才参数固定,我们能够实现对域转移的更好的弹性。在第6节中,我们展示了"提示集成",即为同一任务学习多个提示,可以提高质量,并且比经典模型集成更有效。最后,在第7节中,我们研究了我们学习的软提示的可解释性。总而言之,我们的主要贡献是:Proposing prompt tuning and showing its competitiveness with model tuning in the regime of large language models. 2. Ablating many design choices, and showing quality and robustness improve with scale. 3. Showing prompt tuning outperforms model tuning on domain shift problems. 4. Proposing "prompt ensembling" and showing its effectiveness.1.提出了快速调优,并在大型语言模型的制度中显示了其与模型调优的竞争力。2.扩展了许多设计选择,并显示出质量和鲁棒性随着规模的增加而提高。3.在域转移问题上,显示提示调优优于模型调优。4.提出"即时汇编"并展示其有效性。
2 Prompt Tuning
遵循T5的"文本到文本"方法(Raffel等人,2020年),我们将所有任务都转换为文本生成。不是将分类建模为给定某个输入的输出类的概率,而是Pr(y| X),其中X是一系列标记,y是单个类标签,我们现在将其建模为条件生成,其中Y是表示类标签的标记序列。T5模型分类为Prθ(Y| X),由变压器的权重θ参数化(Vaswani等人,2017),构成其编码器和解码器。
提示是为模型添加额外信息的方法,以在其生成Y的过程中作为条件。通常,通过将一系列标记P预先附加到输入X来完成提示,使得模型最大化正确Y的似然性Prθ(Y| [P;X]),同时保持模型参数θ固定。在GPT-3中,提示标记的表示P = {p1,p2,...,pn},是模型嵌入表的一部分,由冻结θ参数化。因此,找到最佳提示需要通过手动搜索或不可区分的搜索方法来选择提示标记(Jiang等人,2020年; Shin等人,2020年)的报告。提示调优消除了提示P由θ参数化的限制;相反,提示有自己的专用参数θP,可以更新。虽然提示设计涉及从冻结嵌入的固定词汇表中选择提示标记,但是提示调整可以被认为是使用特殊标记的固定提示,其中只有这些提示标记的嵌入可以被更新。我们的新条件生成现在是Prθ;θP(Y| [P;X]),并且可以通过经由反向传播最大化Y的似然性来训练,同时仅对θP应用梯度更新。
给定一系列n个令牌,{x1,x2,...,xn},T5所做的第一件事是嵌入令牌,形成矩阵Xe ∈ Rn×e,其中e是嵌入空间的维数。我们的软提示符表示为参数Pe ∈ Rp×e,其中p是提示符的长度。然后,我们的提示符被连接到嵌入的输入,形成单个矩阵[Pe;Xe] ∈ R(p+n)×e,然后该矩阵正常地流过编码器-解码器。我们的模型被训练以最大化Y的概率,但仅更新提示参数Pe。
2.1 Design Decisions
- There are many possible ways to initialize the prompt representations. The simplest is to train from scratch, using random initialization. A more sophisticated option is to initialize each prompt token to an embedding drawn from the model's vocabulary. Conceptually, our soft-prompt modulates the frozen network's behavior in the same way as text preceding the input, so it follows that a word-like representation might serve as a good initialization spot. For classification tasks, a third option is to initialize the prompt with embeddings that enumerate the output classes, similar to the "verbalizers" of Schick and Schütze (2021). Since we want the model to produce these tokens in the output, initializing the prompt with the embeddings of the valid target tokens should prime the model to restrict its output to the legal output classes.初始化提示表现法有许多可能的方式。最简单的方法是使用随机初始化从头开始训练。一个更复杂的选择是将每个提示标记初始化为从模型词汇表中提取的嵌入。从概念上讲,我们的软提示以与输入之前的文本相同的方式调整冻结网络的行为,因此可以得出结论,类似单词的表示可以作为一个很好的初始化点。对于分类任务,第三种选择是使用枚举输出类的嵌入来初始化提示符,类似于Schick和Schütze(2021)的"描述器"。由于我们希望模型在输出中生成这些标记,因此使用嵌入的有效目标标记初始化提示符应该启动模型,以将其输出限制为法律的的输出类。
- Another design consideration is the length of the prompt. The parameter cost of our method is EP, where E is the token embedding dimension and P is the prompt length. The shorter the prompt, the fewer new parameters must be tuned, so we aim to find a minimal length that still performs well.另一个设计考虑因素是提示的长度。我们的方法的参数代价是EP,其中E是令牌嵌入维数,P是提示长度。提示符越短,必须调整的新参数就越少,因此我们的目标是找到一个仍能很好地执行的最小长度。
2.2 Unlearning Span Corruption
与GPT-3等自回归语言模型不同,我们实验的T5模型使用编码器解码器架构,并在跨度腐败目标上进行预训练。具体来说,T5的任务是"重建"输入文本中的掩码跨度,这些跨度用唯一的哨兵标记标记。目标输出文本由所有屏蔽的内容组成,由sentinel分隔,加上最后一个sentinel。例如,从文本"Thank you for inviting me to your party last week",我们可以构造一个预训练示例,其中输入是"Thank you X me to your party Y week",目标输出是"X for inviting Y last Z "。
- While Raffel et al. (2020) find this architecture and pre-training objective more effective than traditional language modeling, we hypothesize that this setup is not a good fit for producing a frozen model that can be readily controlled through prompt tuning. In particular, a T5 model pre-trained exclusively on span corruption, such as T5.1.1, has never seen truly natural input text (free of sentinel tokens), nor has it ever been asked to predict truly natural targets. In fact, due to the details of T5's span corruption preprocessing, every pre-training target will begin with a sentinel. While this "unnatural" tendency to output sentinels is easy to overcome through fine-tuning, we suspect that it would be much harder to override through a prompt alone, as the decoder priors cannot be adjusted.虽然Raffel等人(2020)发现这种架构和预训练目标比传统的语言建模更有效,但我们假设这种设置并不适合生成可以通过即时调优轻松控制的冻结模型。特别是,专门针对跨度腐败进行预训练的T5模型,如T5.1.1,从未见过真正自然的输入文本(没有哨兵标记),也从未被要求预测真正自然的目标。事实上,由于T5的跨度腐败预处理的细节,每个预训练目标将开始与哨兵。虽然这种输出哨兵的"不自然"倾向很容易通过微调来克服,但我们怀疑单独通过提示来覆盖会困难得多,因为解码器先验无法调整。
- Given these concerns, we experiment with T5 models in three settings. (1) "Span Corruption": We use pre-trained T5 off-the-shelf as our frozen model, and test its ability to output the expected text for downstream tasks. (2) "Span Corruption + Sentinel": We use the same model, but prepend all downstream targets with a sentinel, so as to more closely resemble the targets seen in pretraining. (3) "LM Adaptation": We continue T5's self-supervised training for a small number of additional steps, but using the "LM" objective discussed by Raffel et al. (2020); given a natural text prefix as input, the model must produce the natural text continuation as output. Crucially, this adaptation happens only once, producing a single frozen model that we can reuse for prompt tuning across any number of downstream tasks.考虑到这些问题,我们在三种设置中对T5模型进行了实验。(1)"Span Corruption":我们使用预先训练的T5现成的模型作为我们的冻结模型,并测试它为下游任务输出预期文本的能力。(2)"Span Corruption + Sentinel":我们使用相同的模型,但在所有下游目标之前添加一个Sentinel,以便更接近预训练中看到的目标。(3)"LM适应":我们继续T5的自监督训练,增加了少量的额外步骤,但使用了Raffel等人(2020)讨论的"LM"目标;给定一个自然文本前缀作为输入,模型必须生成自然文本延续作为输出。至关重要的是,这种适应只发生一次,产生一个单一的冻结模型,我们可以在任何数量的下游任务中重新使用该模型进行快速调优。
Through LM adaptation, we hope to "quickly" transform T5 into a model more similar to GPT-3, which always outputs realistic text, and is known to respond well to prompts as a "few-shot learner". It is not obvious how successful this late-stage transformation will be compared to pre-training from scratch, and it has not been investigated previously to our knowledge. As such, we experiment with various lengths of adaptation up to 100K steps.通过LM适应,我们希望"快速"将T5转换为一个更类似于GPT-3的模型,它总是输出真实的文本,并且作为一个"少量学习者"对提示有很好的反应。与从头开始的预训练相比,这种后期转换的成功程度并不明显,而且据我们所知,以前也没有进行过调查。因此,我们实验了高达10万步的各种适应长度。
3 Results
我们的冻结模型建立在所有大小(小,基础,大,XL,XXL)的预训练T5检查点之上。我们利用公共T5.1.1检查点,其中包括对原始T5.1的改进我们的"默认"配置(自始至终用绿色"×"()绘制)使用LM适应版本的T5,经过额外100 K步的训练,使用类标签进行初始化(请参阅第3.2节),并使用100个令牌的提示长度。虽然这比Li和Liang(2021)使用的默认10个令牌前缀更长,但我们的方法仍然使用更少的任务特定参数,因为我们只调整输入层,而不是在所有网络层中进行重复激活。详细比较见图4。我们也将很快看到,随着模型大小的增加,甚至更短的提示也是可行的。
We measure performance on the SuperGLUE benchmark (Wang et al., 2019a), a collection of eight challenging English language understanding tasks.2 We report metrics on the development set associated with each dataset.我们在SuperGLUE基准上测量性能(Wang等人,2019a),八个具有挑战性的英语语言理解任务的集合。2我们报告了与每个数据集相关的开发集的指标。
Each of our prompts train on a single SuperGLUE task; there was no multi-task setup or mixing of training data across tasks. We translate each SuperGLUE dataset into a text-to-text format following Raffel et al. (2020), except that we omit the task names prepended to inputs indicating which SuperGLUE task an example belongs to.我们的每个提示都在单个SuperGLUE任务上训练;没有多任务设置或跨任务混合训练数据。我们将每个SuperGLUE数据集转换为Raffel等人(2020)的文本到文本格式,除了我们省略了输入前的任务名称,指示示例属于哪个SuperGLUE任务。
We train our prompts for 30,000 steps using T5's standard cross-entropy loss, with a constant learn ing rate of 0.3 and a batch size of 32. Checkpoints are selected via early stopping on the development set, where the stopping metric is the default metric for the dataset, or the average of metrics for datasets evaluated with multiple metrics. All experiments were run in JAX (Bradbury et al., 2018) using the Adafactor optimizer (Shazeer and Stern, 2018) with weight decay 1e−5, β2 decay 0.8, and parameter scaling off. The models were implemented in Flax (Heek et al., 2020).我们使用T5的标准交叉熵损失,以0.3的恒定学习率和32的批量大小,训练30,000步的提示。检查点是透过在开发集上提早停止来选取,其中停止测量结果是数据集的预设测量结果,或是以多个测量结果评估之数据集的平均测量结果。所有的实验都在JAX中进行(Bradbury等人,2018年),使用Adafactor优化器(Shazeer和Stern,2018年),权重衰减为1e−5,β2衰减为0.8,参数按比例缩小。这些模型在Flax中实现(希克等人,2020年)的报告。
3.1 Closing the Gap
To compare our method with standard model tuning, we tune the public T5.1.1 checkpoints on SuperGLUE using the default hyperparameters specified in the T5 library (learning rate 0.001, and Adafactor optimizer with pre-training parameter states restored). We consider two baselines. (1) "Model Tuning": For an apples-to-apples comparison, we tune on each task separately, as in our prompt tuning setup.3 (2) "Model Tuning (Multitask)": We use T5's multi-task tuning setup to achieve a more competitive baseline.4 In this case, a single model is tuned on all tasks jointly, with a text prefix indicating the task name.为了将我们的方法与标准模型调优进行比较,我们使用T5库中指定的默认超参数(学习率为0.001,Adafactor优化器恢复了预训练参数状态)来调优SuperGLUE上的公共T5.1.1检查点。我们考虑两条基线。(1)"模型调整":对于一个苹果到苹果的比较,我们分别对每个任务进行调优,就像我们的提示调优设置一样。3(2)"模型调优(多任务)":我们使用T5的多任务调优设置来实现更具竞争力的基线。4在这种情况下,单个模型在所有任务上联合调优,并使用一个文本前缀指示任务名称。
In Figure 1 (p. 1), we see that prompt tuning becomes more competitive with model tuning as scale increases. At the XXL size (11 billion parameters), prompt tuning matches even the stronger multi-task model tuning baseline, despite having over 20,000 times fewer task-specific parameters.在图1(第1页)中,我们看到,随着规模的增加,即时调优变得比模型调优更具竞争力。在XXL大小(110亿个参数)下,即时调优甚至可以匹配更强的多任务模型调优基线,尽管特定于任务的参数少了20,000倍以上。
To compare with prompt design, we include GPT-3 few-shot performance on the SuperGLUE dev split, as reported by Brown et al. (2020).5 Figure 1 shows that prompt tuning beats GPT-3 prompt design by a large margin, with prompttuned T5-Small matching GPT-3 XL (over 16 times larger), and prompt-tuned T5-Large beating GPT-3 175B (over 220 times larger).为了与提示设计进行比较,我们在SuperGLUE开发拆分中纳入了GPT-3的几次拍摄性能,如Brown等人(2020)所报告的。5图1显示,提示调整大幅击败了GPT-3提示设计,调整后的T5-Small匹配GPT-3 XL(大16倍以上),调整后的T5-Large击败了GPT-3 175 B(大220倍以上)。
3.2 Ablation Study
Prompt Length We train prompts for each model size while varying the prompt length in {1, 5, 20, 100, 150} and fixing other settings to our default configuration. Figure 3(a) shows that for most model sizes, increasing prompt length beyond a single token is critical to achieve good performance. Notably, the XXL model still gives strong results with a single-token prompt, suggesting that the larger the model, the less conditioning signal is needed to achieve a target behavior. Across all models, increasing beyond 20 tokens only yields marginal gains.6提示长度我们为每个模型大小训练提示,同时在{1,5,20,100,150}中改变提示长度,并将其他设置固定为默认配置。图3(a)显示,对于大多数模型大小,将提示长度增加到超过单个标记对于实现良好性能至关重要。值得注意的是,XXL模型在单标记提示下仍然给出了强有力的结果,这表明模型越大,实现目标行为所需的条件信号就越少。在所有模型中,增加超过20个代币只会产生边际收益。
图3:各种超参数对提示调优性能的消融(3次运行的平均值和标准差)。在我们的"default"()配置中,质量随着模型大小的增加而稳定提高。在所有消融中,最大(XXL)模型对超参数选择最稳健。(a)提示长度:增加到20+令牌通常会带来很大的提升,但XXL即使在单令牌提示下也表现良好。(b)提示初始化:随机统一初始化落后于使用采样词汇表或类标签嵌入的更"高级"初始化,但差异在XXL大小时消失。©培训前目标:LM自适应性能优于跨度腐败,即使在向下游任务目标添加sentinel时,但XXL与任何方法都能很好地工作。(d)LM自适应:较长的自适应通常会产生较大的增益,但XXL即使在短时间内也具有鲁棒性。
Prompt Initialization We ablate the effect of prompt initialization by training models at all sizes while fixing other hyperparameters to their default values. For random initialization, we sample uni formly from the range [−0.5, 0.5]. When initializing from sampled vocabulary, we restrict to the 5,000 most "common" tokens in T5's SentencePiece vocabulary (Kudo and Richardson, 2018), which is ordered by likelihood in the pre-training corpus. For "class label" initialization, we take the embeddings for the string representations of each class in the downstream task and use them to initialize one of the tokens in the prompt. When a class label is multi-token, we average the token embeddings. At longer prompt lengths, we often run out of class labels before we have initialized all of the prompt tokens. In this case we fall back to our sampled vocab strategy to fill in the prompt.我们通过训练各种规模的模型来消除提示初始化的影响,同时将其他超参数固定为默认值。对于随机初始化,我们从范围[-0.5,0.5]均匀采样。当从采样词汇初始化时,我们将T5的SentencePiece词汇中的5,000个最"常见"的标记限制为(Kudo和Richardson,2018),这些标记在预训练语料库中按可能性排序。对于"类标签"初始化,我们在下游任务中获取每个类的字符串表示的嵌入,并使用它们来初始化提示符中的一个标记。当类标签是多标记时,我们平均标记嵌入。在更长的提示符长度下,我们经常在初始化所有的提示符标记之前就用完了类标签。在这种情况下,我们回到我们的词汇抽样策略来填充提示。
Figure 3(b) shows our ablation of initialization strategy across model sizes, where we find that the class based initialization performs best. At smaller model sizes, there are large gaps between the different initializations, but once the model is scaled to XXL size, those differences disappear.图3(B)显示了我们在不同模型大小下的初始化策略,我们发现基于类的初始化执行得最好。在较小的模型大小下,不同初始化之间存在较大的差距,但一旦模型缩放到XXL大小,这些差异就会消失。
With "class label" initialization, we observe that the class labels typically persist in the learned prompts, such that the nearest token embeddings (in cosine distance) match the tokens used for initialization. Beyond this, we did not find our learned prompts to be interpretable, similar to those of Shin et al. (2020). See Section 7 for details.通过"类标签"初始化,我们观察到类标签通常会保留在学习的提示中,以便最近的令牌嵌入(以余弦距离)与用于初始化的令牌匹配。除此之外,我们没有发现我们学习的提示是可解释的,类似于Shin et al.(2020)。详情见第7节。
Pre-training Objective In Figures 3© and 3(d), we see pre-training objective has a clear effect on prompt tuning quality. As hypothesized in Section 2.2, T5's default "span corruption" objective is not well-suited for training frozen models to be later conditioned by prompts. Intuitively, models pre-trained to read and write sentinel tokens are hard to apply directly to tasks of reading and writing text without sentinels. As seen in Figure 3©, even the "workaround" of adding a sentinel to the downstream targets has little benefit. While LM adaptation adds value across all model sizes, we note our largest XXL model is the most forgiving and gives strong results even with span corruption.在图3(c)和3(d)中,我们看到预训练目标对即时调优质量有明显的影响。正如在2.2节中所假设的,T5的默认"跨度腐败"目标不太适合训练冻结模型,以便稍后通过提示进行调节。直觉上,预先训练好的读写哨兵标记的模型很难直接应用于没有哨兵的阅读和书写文本的任务。如图3(c)所示,即使是向下游靶标添加哨兵的"变通方法"也没有什么好处。虽然LM自适应在所有模型大小上都增加了价值,但我们注意到我们最大的XXL模型是最宽容的,即使在跨度损坏的情况下也能给出强有力的结果。
Given the benefit of LM adaptation, we also explore how long of an adaptation is helpful. Figure 3(d) shows that longer adaptation provides additional gains, up to 100K steps. This suggests that the "transition" from span corruption to a language modeling objective is not a trivial change, and making an effective switch takes an investment of training resources (10% of the steps of the original T5 pre-training). At the same time, as in our other ablations, we observe that the XXL model is robust to even non-ideal configurations. At this size, the gains from adaptation are quite modest.考虑到LM适应的好处,我们还探讨了适应多长时间是有帮助的。图3(d)显示了更长的自适应提供了额外的增益,高达100K步。这表明,从跨度腐败到语言建模目标的"过渡"不是一个微不足道的变化,并且进行有效的切换需要投入培训资源(原始T5预培训步骤的10%)。同时,与我们的其他消融一样,我们观察到XXL模型对非理想配置也具有鲁棒性。在这种规模下,适应的收益相当有限。
In the non-optimal "span corruption" setting, we observe instability across model sizes, with the Small model outperforming the larger Base, Large, and XL models. On inspection, we find that for many tasks, these mid-sized models never learn to output a legal class label and thus score 0%. The two most common error modes are copying subspans from the input and predicting an empty string. Furthermore, this poor performance is not due to random variance in prompt tuning, as we observe low variance across 3 runs for each size. These results indicate that using models pre-trained with the "span corruption" objective can be unreliable, with only 2 out of 5 models working well, whereas the LM adapated versions work reliably across all model sizes.在非最佳"跨度损坏"设置中,我们观察到不同模型大小的不稳定性,Small模型的性能优于较大的Base、Large和XL模型。在检查中,我们发现对于许多任务,这些中型模型从未学会输出法律的类标签,因此得分为0%。两种最常见的错误模式是从输入中复制子跨度和预测空字符串。此外,这种较差的性能不是由于即时调整中的随机方差,因为我们观察到每种尺寸的3次运行的方差较低。这些结果表明,使用以"跨度腐败"目标预训练的模型可能是不可靠的,5个模型中只有2个运行良好,而LM适配版本在所有模型大小上都能可靠地运行。
We have released T5 1.1 checkpoints adapted using the LM objective for 100K steps for all model sizes.8我们已经发布了T5 1.1检查点,该检查点使用LM目标进行了调整,适用于所有模型大小的10万步。
4 Comparison to Similar Approaches
In this section, we review recent work on learning continuous prompts, and draw comparisons with our method. One important axis of comparison is the number of task-specific parameters each method requires, as shown in Figure 4. Among methods with learnable parameters, prompt tuning is the most parameter efficient, requiring less than 0.01% task-specific parameters for models over a billion parameters.在本节中,我们将回顾最近关于学习连续提示的工作,并与我们的方法进行比较。一个重要的比较轴是每个方法需要的任务特定参数的数量,如图4所示。在具有可学习参数的方法中,即时调优是参数效率最高的,对于超过十亿个参数的模型,需要少于0.01%的任务特定参数。
Li and Liang (2021) propose "prefix tuning": learning a sequence of prefixes that are prepended at every transformer layer. This is akin to learning transformer activations that are fixed across examples at every network layer. In contrast, prompt tuning uses a single prompt representation that is prepended to the embedded input. Beyond requiring fewer parameters, our approach allows the transformer to update the intermediate-layer task representations, as contextualized by an input example. Their work builds on GPT-2 (Radford et al., 2019) and BART (Lewis et al., 2020), while ours focuses on T5 and examines changes in performance and robustness to design choices as model size increases. When using BART, prefix tuning includes prefixes on both the encoder and decoder network, while prompt tuning only requires prompts on the encoder. Li and Liang (2021) also rely on a reparameterization of the prefix to stabilize learning, which adds a large number of parameters during training, whereas our configuration does not require this reparameterization and is robust across SuperGLUE tasks and model sizes.Li和Liang(2021)提出了"前缀调优":学习在每个Transformer层前置的前缀序列。这类似于学习Transformer激活,这些激活在每个网络层的示例中都是固定的。相反,提示调优使用一个前置于嵌入式输入的提示表示。除了需要更少的参数外,我们的方法还允许Transformer更新中间层任务表示,如输入示例所示。他们的工作建立在GPT-2上(拉德福等人,2019)和BART(刘易斯等人,2020年),而我们的重点是T5,并检查随着模型大小的增加,性能和设计选择的鲁棒性的变化。使用BART时,前缀调整包括编码器和解码器网络上的前缀,而提示调整仅需要编码器上的提示。Li和Liang(2021)还依赖于前缀的重新参数化来稳定学习,这在训练过程中添加了大量参数,而我们的配置不需要这种重新参数化,并且在SuperGLUE任务和模型大小上都是鲁棒的。
Hambardzumyan et al. (2021) propose "WARP", where prompt parameters are added to the input layer. This method works with masked language models, relying on a [MASK] token and a learnable output layer to project the mask to class logits. This formulation restricts the model to producing a single output, limiting it to classification. Prompt tuning does not require any changes to the input or a task-specific head. The performance of prompt tuning is also considerably closer to the strong performance of model tuning.Hambardzumyan等人(2021)提出了"WARP",其中将提示参数添加到输入层。此方法适用于掩码语言模型,依赖于[MASK]标记和可学习的输出层将掩码投影到类logit。该公式将模型限制为产生单一输出,将其限制为分类。提示调整不需要对输入或特定于任务的头进行任何更改。即时调优的性能也相当接近模型调优的强大性能。
Liu et al. (2021) propose "P-tuning" where learnable continuous prompts are interleaved throughout the embedded input, using patterns based on human design. Our approach removes this complication by simply prepending the prompt to the input. To achieve strong SuperGLUE results, P-tuning has to be used in conjunction with model tuning, that is, models jointly update both the prompt and the main model parameters, whereas our approach keeps the original language model frozen.10Liu et al.(2021)提出了"P调优",其中可学习的连续提示在整个嵌入式输入中交错,使用基于人类设计的模式。我们的方法通过简单地将提示符前置到输入中来消除这种复杂性。为了实现强大的SuperGLUE结果,P-tuning必须与模型调优结合使用,即模型联合更新提示和主模型参数,而我们的方法保持原始语言模型冻结。
Qin和Reynner(2021)使用"软词"来学习提示,以从预先训练的LM中提取知识。基于手工设计的提示原型,将提示放置在与输入相关的位置,并且为每一层包含一个学习的提示参数,因此参数成本随模型深度而变化。
Logeswaran et al. (2020) use a learnable prepended token to adapt transformer models to various tasks, but focus on small synthetic datasets designed to accommodate a compositional task representation, as opposed to larger real-world datasets. Their base models are small transformers trained from scratch jointly with the task representations, whereas we keep the base model frozen and investigate scaling laws using larger transformers.Logeswaran等人(2020)使用可学习的前置令牌来使Transformer模型适应各种任务,但专注于设计用于适应组合任务表示的小型合成数据集,而不是更大的真实世界数据集。他们的基础模型是与任务表示一起从头开始训练的小transformer,而我们保持基础模型冻结,并使用更大的transformer研究缩放律。
More generally, work on task prompts is closely aligned with work on "adapters" (Rebuffi et al., 2017; Houlsby et al., 2019), small bottleneck layers inserted between frozen pre-trained network layers. Adapters offer another means of reducing task-specific parameters, with Houlsby et al. (2019) achieving GLUE performance close to full model tuning when freezing BERT-Large and only adding 2--4% additional parameters. Pfeiffer et al. (2020) use multiple adapters in a multilingual context to explicitly separate language understanding from task specification, similar to our approach. A core difference between adapters and prompt tuning is how the approaches change model behavior. Adapters modify the actual function that acts on the input representation, parameterized by the neural network, by allowing the rewriting of activations at any given layer. Prompt tuning modifies behavior by leaving the function fixed and adding new input representations that can affect how subsequent input is processed.更一般地,关于任务提示符的工作与关于"适配器"的工作紧密结合(Rebuffi等人,2017年; Houlsby等人,2019年),在冻结的预训练网络层之间插入小瓶颈层。适配器提供了另一种减少特定任务参数的方法,Houlsby等人(2019)在冻结BERT-Large时实现了接近完整模型调整的GLUE性能,并且仅添加了2-4%的额外参数。Pfeiffer et al.(2020)在多语言环境中使用多个适配器来显式地将语言理解与任务规范分开,类似于我们的方法。适配器和提示调优之间的一个核心区别是这两种方法如何更改模型行为。适配器通过允许在任何给定层重写激活来修改作用于由神经网络参数化的输入表示的实际函数。提示调优通过保持函数不变并添加新的输入表示来修改行为,这些新的输入表示会影响后续输入的处理方式。
5 Resilience to Domain Shift
By freezing the core language model parameters, prompt tuning prevents the model from modifying its general understanding of language. Instead, prompt representations indirectly modulate the representation of the input. This reduces the model's ability to overfit to a dataset by memorizing spe cific lexical cues and spurious correlations. This restriction suggests that prompt tuning may improve robustness to domain shifts, where the distribution of inputs differs between training and evaluation.通过冻结核心语言模型参数,快速调优可以防止模型修改其对语言的一般理解。相反,提示表示间接地调节输入的表示。这降低了模型通过记忆特定词汇线索和虚假相关性来过度拟合数据集的能力。这种限制表明,及时调整可以提高对域偏移的鲁棒性,其中输入的分布在训练和评估之间不同。
We investigate zero-shot domain transfer on two tasks: question answering (QA) and paraphrase detection. For question answering, we use the MRQA 2019 shared task on generalization (Fisch et al., 2019). This task collects extractive QA datasets in a unified format and tests how models trained on "in-domain" datasets perform when evaluated on "out-of-domain" datasets. For our experiments, we train on SQuAD (Rajpurkar et al., 2016) and evaluate on each of the out-of-domain datasets.11我们在两个任务上研究了零触发域迁移:问题回答和释义检测。对于问题回答,我们使用MRQA 2019关于概括的共享任务(菲施等人,(2019年版)。此任务以统一的格式收集提取QA数据集,并测试在"域内"数据集上训练的模型在"域外"数据集上评估时的性能。对于我们的实验,我们在SQuAD上训练(Rajpurkar等人,2016年),并对每个域外数据集进行评估。
Table 1 shows that prompt tuning outperforms model tuning on the majority of out-of-domain datasets, with a remarkable 12.5 point F1 gap between the two approaches on TextbookQA. We observe larger gains from prompt tuning in cases of larger domain shifts (e.g. to Biomedical in BioASQ or to Textbooks in TextbookQA). Of the datasets where model tuning is better, we see that DROP shares a domain (Wikipedia) with SQuAD and is thus one of the smallest domain transfers.表1显示,在大多数域外数据集上,即时调优优于模型调优,在TextbookQA上,这两种方法之间存在显著的12.5点F1差距。我们观察到更大的增益,从迅速调整的情况下,较大的域转移(例如,生物医学的BioASQ或教科书的TextbookQA)。在模型调优更好的数据集中,我们看到DROP与SQuAD共享一个域(Wikipedia),因此是最小的域传输之一。
表1:在SQuAD上训练并在MRQA 2019共享任务的域外数据集上评估的模型的F1平均值和标准偏差。即时调优往往比模型调优提供更强的零触发性能,特别是在像TextbookQA这样具有较大域偏移的数据集上。
As a second test of robustness to domain shift, we explore transfer between two paraphrase detection tasks from GLUE (Wang et al., 2019b). The first task is QQP (Iyer et al., 2017), which asks if two questions from the community Q&A site Quora are "duplicates". The second task is MRPC (Dolan and Brockett, 2005), which asks if two sentences drawn from news articles are paraphrases. We test transfer in both directions (QQP⇔MRPC). As before, we train on the "in-domain" task, select checkpoints using in-domain validation, and evaluate zero-shot on the "out-of-domain" task.作为对域移位的鲁棒性的第二测试,我们探索了来自GLUE的两个释义检测任务之间的转移(Wang等人,2019年b)。第一个任务是QQP(Iyer等人,2017年),其中询问来自社区问答网站Quora的两个问题是否"重复"。第二个任务是MRPC(Dolan和Brockett,2005),它询问从新闻文章中提取的两个句子是否是释义。我们测试了两个方向的传输(QQP MRPC)。与之前一样,我们在"域内"任务上进行培训,使用域内验证选择检查点,并在"域外"任务上评估零命中率。
Table 2 shows that training a lightweight prompt on the QQP data and evaluating on MRPC gives much better performance than tuning the entire model (+3.2 accuracy and +3.1 F1). The results are much closer in the other direction, with prompt tuning showing a small improvement in accuracy and a small drop in F1. These results support the view that model tuning may be over-parameterized and more prone to overfit the training task, to the detriment of similar tasks in different domains.表2显示,在QQP数据上训练一个轻量级提示并在MRPC上进行评估,比调整整个模型(+3.2精度和+3.1 F1)提供了更好的性能。结果在另一个方向上要接近得多,及时调整显示精度略有提高,F1略有下降。这些结果支持这样的观点,即模型调整可能是过度参数化的,并且更容易过度拟合训练任务,从而损害不同领域中的类似任务。
6 Prompt Ensembling
Ensembles of neural models trained from different initializations on the same data are widely observed to improve task performance (Hansen and Salamon, 1990) and are useful for estimating model uncertainty (Lakshminarayanan et al., 2017). However, as model size increases, ensembling can become impractical. Beyond the space required to store N models (e.g. 42 GiB for each copy of T5-XXL), there is a substantial inference cost to running N distinct models, whether in parallel or in series.从相同数据的不同初始化训练的神经模型的集合被广泛地观察到改善任务性能(汉森和Salamon,1990),并且对于估计模型不确定性是有用的(Lakshminarayanan等人,(2017年版)。但是,随着模型大小的增加,组合可能变得不切实际。除了存储N个模型所需的空间(例如,对于T5-XXL的每个副本为42 GiB)之外,对于并行或串行地运行N个不同的模型,存在相当大的推理成本。
Prompt tuning provides a more efficient way to ensemble multiple adaptations of a pre-trained language model. By training N prompts on the same task, we create N separate "models" for a task, while still sharing the core language modeling parameters throughout. Beyond drastically reducing storage costs, the prompt ensemble makes inference more efficient. To process one example, rather than computing forward passes ofN different models, we can execute a single forward pass with a batch size of N, replicating the example across the batch and varying the prompt. These savings mirror those seen for multi-tasking in Figure 2.即时调优提供了一种更有效的方式来集成预训练语言模型的多种适应性。通过在同一个任务上训练N个提示,我们为一个任务创建了N个独立的"模型",同时仍然在整个过程中共享核心语言建模参数。除了大幅降低存储成本外,快速集成还使推理更有效。为了处理一个示例,我们可以以批量大小N执行单个向前传递,而不是计算N个不同模型的向前传递,从而在整个批次中复制示例并改变提示。这些节省反映了图2中多任务处理的节省。
To demonstrate the viability of prompt ensembling, we train five prompts for each SuperGLUE task, using a single frozen T5-XXL model with our default hyperparameters. We use simple majority voting to compute predictions from the ensemble. Table 3 shows that across all tasks, the ensemble beats the single-prompt average and beats, or matches, the best individual prompt.为了证明提示集成的可行性,我们为每个SuperGLUE任务训练了五个提示,使用单个冻结的T5-XXL模型和我们的默认超参数。我们使用简单的多数表决来计算预测的合奏。表3显示,在所有任务中,集合都击败了单个提示的平均值,并击败或匹配了最佳的单个提示。
7 Interpretability 可解释性
- An ideally interpretable prompt would consist of natural language that clearly describes the task at hand, explicitly asks the model for some result or action, and makes it easy to understand why the prompt elicited such behavior from the model.一个理想的可解释的提示应该由自然语言组成,它清楚地描述了手头的任务,明确地要求模型给出一些结果或动作,并且很容易理解为什么提示会从模型中引发这样的行为。
- As prompt tuning works in the continuous embedding space rather than the discrete token space, interpreting prompts becomes more difficult. To test the interpretability of our learned soft prompts, we compute the nearest neighbors to each prompt token from the frozen model's vocabulary. We use cosine distance between the vocabulary embedding vector and the prompt token representation as the similarity metric.由于提示调优在连续嵌入空间而不是离散标记空间中工作,因此解释提示变得更加困难。为了测试我们学习的软提示的可解释性,我们从冻结模型的词汇表中计算每个提示标记的最近邻居。我们使用词汇嵌入向量和提示符表示之间的余弦距离作为相似性度量。
- We observe that for a given learned prompt token, the top-5 nearest neighbors form tight semantic clusters. For example, we see lexically similar clusters such as { Technology / technology / Technologies / technological / technologies }, as well as more diverse but still strongly related clusters such as { entirely / completely / totally / altogether / 100% }. The nature of these clusters suggests that the prompts are in fact learning "word-like" representations. We found that random vectors drawn from the embedding space do not show this sort of semantic clustering.我们观察到,对于一个给定的学习提示令牌,前5个最近的邻居形成紧密的语义集群。例如,我们可以看到词汇相似的集群,如{ Technology / technology / Technologies / technological / technologies },以及更多样化但仍然紧密相关的集群,如{ entirely / completely / totally / altogether / 100% }。这些集群的性质表明,提示实际上是学习"类单词"的表示。我们发现,从嵌入空间中提取的随机向量不显示这种语义聚类。
- When initializing the prompts using the "classlabel" strategy, we often find that the class labels persist through training. Specifically, if a prompt token is initialized to a given label, that label is often among the learned token's nearest neighbors after tuning. When initializing with the "Random Uniform" or "Sampled Vocab" methods, the class labels can also be found in the nearest neighbors of the prompts; however they tend to appear as neighbors to multiple prompt tokens. This suggests that the model is learning to store the expected output classes in the prompts as reference, and initializing the prompt to outputs classes makes this easier and more centralized.当使用"classlabel"策略初始化提示时,我们经常发现类标签在训练过程中一直存在。具体来说,如果提示标记被初始化为给定的标签,则该标签通常在调优后位于学习标记的最近邻居中。当使用"Random Uniform"或"Sampled Vocab"方法初始化时,类标签也可以在提示符的最近邻居中找到;但是它们往往会显示为多个提示符标记的邻居。这表明模型正在学习将预期的输出类存储在提示符中作为参考,并且将提示符初始化为输出类使其更容易和更集中。
- When examining longer prompts (e.g. size 100), we often find several prompt tokens with the same nearest neighbors. This suggests there is either excess capacity in the prompt, or that the lack of sequential structure in the prompt representation makes it difficult for the model to localize information to a specific position.当检查较长的提示符(例如大小为100)时,我们经常会发现几个提示符具有相同的最近邻居。这表明提示中存在过剩的容量,或者提示表示中缺乏顺序结构使得模型难以将信息本地化到特定位置。
- While the learned prompts taken as sequences show little interpretability, we do observe a high frequency of words like science, technology and engineering as the nearest neighbors for prompts trained on the BoolQ dataset and approximately 20% of the questions are in the "Nature/Science" category. While more investigation is needed, this suggests that one role of the prompt may be to prime the model to interpret inputs in a specific domain or context (e.g. "scientific")虽然作为序列的学习提示显示出很小的可解释性,但我们确实观察到科学,技术和工程等单词的频率很高,作为在BoolQ数据集上训练的提示的最近邻居,大约20%的问题属于"自然/科学"类别。虽然需要更多的调查,这表明提示的一个作用可能是引导模型在特定领域或背景下解释输入(例如"科学")
8 Conclusion
In this paper, we showed that prompt tuning is a competitive technique for adapting frozen pretrained language models to downstream tasks. On the popular SuperGLUE benchmark, its task performance rivals that of traditional model tuning, with the gap vanishing as model size increases. On zeroshot domain transfer, we found that prompt tuning leads to improved generalization. This plausibly indicates that freezing general-purpose language understanding parameters and restricting downstream learning to a lightweight parameter footprint can help to avoid overfitting to a specific domain.在本文中,我们证明了即时调整是一种有竞争力的技术,可以使冻结的预训练语言模型适应下游任务。在流行的SuperGLUE基准测试中,它的任务性能与传统的模型调整相当,随着模型大小的增加,差距消失了。在zeroshot域转移,我们发现,及时调整导致改进的泛化。这合理地表明,冻结通用语言理解参数并将下游学习限制在轻量级参数足迹上有助于避免对特定领域的过度拟合。
Beyond task quality metrics, we discussed the appeal of moving to frozen pre-trained models in terms of storage and serving costs. This move enables both efficient multi-task serving, as well as efficient high-performing prompt ensembling. Looking forward, we believe that factoring out task-defining parameters as distinct from general language-modeling parameters is an exciting step that opens up many avenues for new research.除了任务质量指标之外,我们还讨论了在存储和服务成本方面转向冻结预训练模型的吸引力。这一举措既可以实现高效的多任务服务,也可以实现高效的高性能即时集成。展望未来,我们相信,将任务定义参数与一般语言建模参数区分开来是一个令人兴奋的步骤,为新的研究开辟了许多途径。