LLMs之o3：《Deliberative Alignment: Reasoning Enables Safer Language Models》翻译与解读

导读：2024年12月，这篇论文提出了一种名为"审慎式对齐 (Deliberative Alignment)"的新方法，旨在提高大型语言模型 (LLM) 的安全性。论文的核心思想是让模型在回答问题之前，能够明确地回忆和推理安全规范。

>> 背景痛点：目前的 LLM 安全训练主要依赖于监督微调 (SFT) 和基于人类反馈的强化学习 (RLHF)。然而，这些方法存在一些局限性：

● 缺乏深思熟虑： LLM 需要即时响应用户请求，没有时间进行深思熟虑，尤其是在复杂的安全性场景下。

● 隐式学习： LLM 需要从大量标记的例子中间接推断安全标准，而不是直接学习管理它们的具体安全规范。这导致数据效率低下，难以应对陌生的场景或对抗性攻击。

>> 具体的解决方案 ：审慎式对齐 (Deliberative Alignment)。审慎式对齐是一种新的训练方法，它让 LLM 在生成答案之前，能够明确地推理安全规范。该方法包含两个核心阶段：

● 监督微调 (SFT)：这一阶段训练模型直接推理安全规范。通过上下文蒸馏技术，利用仅针对有用性训练的模型生成 (prompt, CoT, output) 三元组数据集，其中 CoT (Chain-of-Thought，思维链) 明确引用安全规范。这个数据集不依赖于人工标注的完成结果。

● 强化学习 (RL)：这一阶段使用高计算量的 RL 来训练模型更有效地思考。通过一个"裁判"LLM (GRM)，根据安全规范对模型生成的 CoT 和输出进行评分，提供奖励信号，进一步优化模型的安全性推理。

>> 核心思路步骤：

● 数据生成：收集带有安全类别标签的提示，为每个 (prompt, category) 对生成特定类别的安全规范 spec(category)。使用 spec-agnostic 模型 Gbase 生成包含对安全规范进行推理的 (CoT, output) 数据。

● 过滤：使用具有安全规范信息的"裁判"模型 GRM 对生成的 (CoT, output) 数据进行质量过滤，选择高质量的样本。

● 监督微调 (SFT)：使用过滤后的 (prompt, CoT, output) 数据对 Gbase 进行监督微调，让模型学习在 CoT 中参考安全规范来生成符合规范的答案。

● 强化学习 (RL)：使用"裁判"模型 GRM 提供奖励信号，进一步优化模型在安全相关提示上的响应。

>> 优势：

● 提高安全性：显著提高了模型对恶意提示的抵抗能力，同时降低了对良性请求的过度拒绝率。

● 增强鲁棒性：提高了模型对对抗性攻击和超出分布 (OOD) 场景的泛化能力。

● 可扩展性：通过合成数据生成，减少了对大规模人工标注数据的依赖，提高了可扩展性。

● 可解释性：由于模型明确地推理安全规范，其决策过程更易于理解和解释。

>> 结论和观点：

● 审慎式对齐在提高 LLM 安全性方面取得了显著进展，在多个安全基准测试中都取得了 Pareto 提升。

● 模型在推理过程中对安全规范进行明确的推理，是提高安全性的关键。

● 合成数据生成管道为安全对齐提供了一种可扩展的方法。

● 审慎式对齐提高了模型对超出分布场景的泛化能力。

● 虽然审慎式对齐取得了积极成果，但论文也强调了随着 AI 模型能力的提升，对齐工作也需要持续改进，以应对未来可能出现的更复杂的安全挑战，例如模型目标与人类意图的偏差等。

这篇论文的核心贡献在于提出了一种新颖的 LLM 安全对齐方法------审慎式对齐 。该方法通过让模型在回答之前明确地推理安全规范，有效地解决了现有方法中缺乏深思熟虑和隐式学习的缺陷。审慎式对齐在提高模型安全性、鲁棒性和可扩展性方面都取得了显著成果，并为未来 LLM 安全对齐的研究提供了新的思路和方向。然而，论文也指出了未来需要继续研究的挑战，例如如何应对更高级的对抗性攻击以及如何确保模型长期保持与人类价值观的一致性。

[《Deliberative Alignment: Reasoning Enables Safer Language Models》翻译与解读](#《Deliberative Alignment: Reasoning Enables Safer Language Models》翻译与解读)

Abstract

[1 Introduction](#1 Introduction)

[Figure 1: A sample o1 chain-of-thought. Here, a user attempts to obtain advice on untraceable payment methods to use for an adult website, in order to avoid detection by law enforcement. The user tries to jailbreak the model, by encoding the request and wrapping it with instructions intended to encourage the model to comply. In the model's chain-of-thought, the model decodes the request and recognizes that the user is trying to trick it (highlighted in yellow). It successfully reasons through the relevant OpenAI safety policies (highlighted in green), and ultimately provides an answer that follows hard refusal style guidelines.图 1：一个 o1 链式思维示例。在此，用户试图获取有关用于成人网站的无法追踪的支付方式的建议，以避免被执法部门发现。用户试图破解模型，通过编码请求并用旨在鼓励模型配合的指令将其包裹起来。在模型的链式思维中，模型解码了请求，并识别出用户试图欺骗它（用黄色突出显示）。它成功地推理出了相关的 OpenAI 安全政策（用绿色突出显示），最终给出了遵循强硬拒绝风格指南的回答。](#Figure 1: A sample o1 chain-of-thought. Here, a user attempts to obtain advice on untraceable payment methods to use for an adult website, in order to avoid detection by law enforcement. The user tries to jailbreak the model, by encoding the request and wrapping it with instructions intended to encourage the model to comply. In the model’s chain-of-thought, the model decodes the request and recognizes that the user is trying to trick it (highlighted in yellow). It successfully reasons through the relevant OpenAI safety policies (highlighted in green), and ultimately provides an answer that follows hard refusal style guidelines.图 1：一个 o1 链式思维示例。在此，用户试图获取有关用于成人网站的无法追踪的支付方式的建议，以避免被执法部门发现。用户试图破解模型，通过编码请求并用旨在鼓励模型配合的指令将其包裹起来。在模型的链式思维中，模型解码了请求，并识别出用户试图欺骗它（用黄色突出显示）。它成功地推理出了相关的 OpenAI 安全政策（用绿色突出显示），最终给出了遵循强硬拒绝风格指南的回答。)

[Figure 2: Main safety results. The o1 models advance the Pareto frontier of refusing to answer malicious jailbreak prompts (from StrongREJECT [12]) and not over-refusing benign prompts (from XSTest [13]), compared to GPT-4o and other state-of-the-art LLMs. Error bars represent estimates of standard deviation calculated over 1,000 bootstrap trials.图 2：主要安全结果。与 GPT-4o 和其他最先进的 LLM 相比，o1 模型在拒绝回答恶意破解提示（来自 StrongREJECT [12]）和不过度拒绝良性提示（来自 XSTest [13]）方面推进了帕累托前沿。误差条代表在 1000 次自助抽样试验中计算出的标准偏差估计值。](#Figure 2: Main safety results. The o1 models advance the Pareto frontier of refusing to answer malicious jailbreak prompts (from StrongREJECT [12]) and not over-refusing benign prompts (from XSTest [13]), compared to GPT-4o and other state-of-the-art LLMs. Error bars represent estimates of standard deviation calculated over 1,000 bootstrap trials.图 2：主要安全结果。与 GPT-4o 和其他最先进的 LLM 相比，o1 模型在拒绝回答恶意破解提示（来自 StrongREJECT [12]）和不过度拒绝良性提示（来自 XSTest [13]）方面推进了帕累托前沿。误差条代表在 1000 次自助抽样试验中计算出的标准偏差估计值。)

[6 Discussion](#6 Discussion)

《Deliberative Alignment: Reasoning Enables Safer Language Models》翻译与解读

|------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 地址 | 论文地址：https://assets.ctfassets.net/kftzwdyauwt9/4pNYAZteAQXWtloDdANQ7L/0aedc43a8f2d1e5c71c5e114d287593f/OpenAI_Deliberative-Alignment-Reasoning-Enables-Safer_Language-Models_122024_3.pdf |
| 时间 | 2024年 12月？日 |
| 作者 | OpenAI |

Abstract

1 Introduction

|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Modern Large Language Models (LLMs) are safety trained using Supervised Fine Tuning (SFT) and Rein-forcement Learning from Human Feedback (RLHF) to mitigate harmful, undesirable, or otherwise disallowed outputs [2]--[4]. Despite ongoing advances in these methods, today's models still exhibit safety shortcomings: they can be tricked into revealing harmful content, often refuse legitimate requests, and remain vulnerable to jailbreak attacks [5]--[8]. We argue that many of these failures arise from two limitations in modern safety training. First, LLMs must respond instantly to user requests using a fixed amount of compute, without deliberation even for complex safety scenarios. Second, LLMs must infer underlying safety standards indirectly from large sets of labeled examples, rather than directly learning the safety specifications that govern them. This reliance on implicit, pattern-based learning leads to poor data efficiency and makes it challenging for models to generalize when facing unfamiliar scenarios or adversarial attacks. | 现代大型语言模型（LLMs）通过监督微调（SFT）和基于人类反馈的强化学习（RLHF）进行安全训练，以减少有害、不受欢迎或被禁止的输出[2]-[4]。尽管这些方法不断取得进展，但当今的模型仍存在安全缺陷：它们可能会被诱骗泄露有害内容，经常拒绝合法请求，并且仍然容易受到破解攻击[5]-[8]。我们认为，这些失败中的许多都源于现代安全训练的两个局限性。首先，LLMs 必须在固定计算量内即时响应用户请求，即使面对复杂的安全场景也无法进行深思熟虑。其次，LLMs 必须从大量标注示例中间接推断出潜在的安全标准，而不是直接学习管理它们的安全规范。这种对隐性、基于模式的学习的依赖导致数据效率低下，并使模型在面对不熟悉的场景或对抗性攻击时难以泛化。 |
| We propose deliberative alignment, a training approach that teaches LLMs to explicitly reason through safety specifications before producing an answer. By applying this method to OpenAI's o-series models [1], we enable them to use chain-of-thought (CoT) reasoning to examine user prompts, identify relevant policy guidelines, and generate safer responses (e.g., Figure 1). Our method proceeds in two core stages, integrating process- and outcome-based supervision [9]. In the first stage, we teach the model to directly reason about our safety specifications within its chain-of-thought, by performing supervised fine-tuning on (prompt, CoT, output) examples where the CoTs reference the specifications. We construct this dataset using context distillation [10], [11] and an o-type model trained only for helpfulness (i.e. trained without any safety-relevant data). Concretely, we present the model with the safety specifications as part of the system prompt, generate model completions, and then strip away the system prompts to form the final dataset. This stage provides the model with a strong prior for reasoning through safety considerations. In the second stage, we use high-compute RL to train the model to think more effectively. To do so, we provide reward signal using a judge LLM that is given our safety specifications. Notably, our training procedure requires no human-labeled completions.1 Despite relying only on model-generated data, we achieve highly precise specification adherence. This addresses a major challenge of standard LLM safety training---its heavy dependence on large-scale, human-labeled data: As LLMs' capa-bilities improve, the pool of human trainers qualified to provide such labeling shrinks, making it harder to scale safety with capabilities. Deliberative alignment's synthetic data generation pipeline offers a scalable approach to alignment, reserving human expertise for evaluation. We compare o1 to GPT-4o and other state-of-the-art LLMs across a range of internal and external safety benchmarks, such as jailbreak and content-policy refusal evals. The o1 models achieve a Pareto improvement by reducing both under- and overrefusals (see Figure 2) and they saturate many of our hardest safety benchmarks. Furthermore, we find that deliberative alignment enables strong generalization to out-of-distribution safety scenarios. In detailed ablation studies, we find that process-supervision provides a strong prior, and that outcome-based RL refines the CoT safety reasoning. Overall, our results suggest that chain-of-thought reasoning can serve to leverage test-time compute to improve safety behavior, ultimately training LLMs to be "right for the right reasons". | 我们提出了一种名为"审慎对齐"的训练方法，该方法教导大型语言模型在生成答案之前明确地通过安全规范进行推理。通过将此方法应用于 OpenAI 的 o 系列模型[1]，我们使它们能够使用链式思维（CoT）推理来检查用户提示，识别相关的政策指南，并生成更安全的响应（例如图 1）。我们的方法分为两个核心阶段，结合了过程和结果监督[9]。在第一阶段，我们通过在（提示、CoT、输出）示例上进行监督微调来教导模型在其链式思维中直接对我们的安全规范进行推理，其中 CoT 引用了这些规范。我们使用上下文蒸馏[10]、[11]和仅针对有用性进行训练的 o 类型模型（即未使用任何与安全相关的数据进行训练）来构建此数据集。具体来说，我们将安全规范作为系统提示的一部分呈现给模型，生成模型的完成内容，然后去除系统提示以形成最终数据集。此阶段为模型提供了通过安全考虑进行推理的强大先验知识。在第二阶段，我们使用高计算量的强化学习来训练模型，使其能够更有效地思考。为此，我们使用一个被赋予了我们的安全规范的评判型语言模型来提供奖励信号。值得注意的是，我们的训练过程不需要人工标注的完成结果。尽管仅依赖模型生成的数据，我们仍实现了高度精确的规范遵循。这解决了标准语言模型安全训练的一个重大挑战------其对大规模人工标注数据的高度依赖：随着语言模型能力的提升，能够提供此类标注的人类训练师数量减少，使得安全性的提升难以与能力的提升同步。审慎对齐的合成数据生成流程提供了一种可扩展的对齐方法，将人类专业知识保留用于评估。我们将 o1 与 GPT-4o 以及其他最先进的大型语言模型（LLMs）在一系列内部和外部的安全基准测试中进行了比较，例如越狱和内容政策拒绝评估。o1 模型实现了帕累托改进，减少了拒绝不足和拒绝过度的情况（见图 2），并且在我们许多最难的安全基准测试中达到了饱和状态。此外，我们发现审慎对齐能够使模型在分布外的安全场景中实现强大的泛化能力。在详细的消融研究中，我们发现过程监督提供了强大的先验条件，而基于结果的强化学习则完善了链式思维的安全推理。总体而言，我们的结果表明，链式思维推理可以利用测试时的计算来改善安全行为，最终训练出"出于正确理由而正确"的大型语言模型。 |

Figure 1: A sample o1 chain-of-thought. Here, a user attempts to obtain advice on untraceable payment methods to use for an adult website, in order to avoid detection by law enforcement. The user tries to jailbreak the model, by encoding the request and wrapping it with instructions intended to encourage the model to comply. In the model's chain-of-thought, the model decodes the request and recognizes that the user is trying to trick it (highlighted in yellow). It successfully reasons through the relevant OpenAI safety policies (highlighted in green), and ultimately provides an answer that follows hard refusal style guidelines.图 1：一个 o1 链式思维示例。在此，用户试图获取有关用于成人网站的无法追踪的支付方式的建议，以避免被执法部门发现。用户试图破解模型，通过编码请求并用旨在鼓励模型配合的指令将其包裹起来。在模型的链式思维中，模型解码了请求，并识别出用户试图欺骗它（用黄色突出显示）。它成功地推理出了相关的 OpenAI 安全政策（用绿色突出显示），最终给出了遵循强硬拒绝风格指南的回答。

Figure 2: Main safety results. The o1 models advance the Pareto frontier of refusing to answer malicious jailbreak prompts (from StrongREJECT [12]) and not over-refusing benign prompts (from XSTest [13]), compared to GPT-4o and other state-of-the-art LLMs. Error bars represent estimates of standard deviation calculated over 1,000 bootstrap trials.图 2：主要安全结果。与 GPT-4o 和其他最先进的 LLM 相比，o1 模型在拒绝回答恶意破解提示（来自 StrongREJECT [12]）和不过度拒绝良性提示（来自 XSTest [13]）方面推进了帕累托前沿。误差条代表在 1000 次自助抽样试验中计算出的标准偏差估计值。