TRE: 鼓励在Trust Rigon 进行探索

机构: 百度

代码： https://github.com/WhyChaos/TRE-Encouraging-Exploration-in-the-Trust-Region

Abstract

熵正则化是强化学习（RL）中提升探索能力的标准技术。然而，在大语言模型（LLMs）中，它往往效果甚微，甚至会导致性能下降。我们认为，这种失败源于大语言模型所固有的累积尾部风险（cumulative tail risk），这种风险来自其庞大的词表规模以及较长的生成序列长度。

在这样的环境下，标准的全局熵最大化会将概率质量不加区分地分散到大量处于尾部的无效 token 上，而不是集中于合理候选项，从而破坏连贯的推理过程。

为了解决这一问题，我们提出了 Trust Region Entropy（TRE） 方法。该方法鼓励模型仅在其"信任区域（trust region）"内进行探索。我们在数学推理任务（MATH）、组合搜索任务（Countdown）以及偏好对齐任务（HH）上进行了大量实验，结果表明，TRE 在各项任务中均稳定优于标准 PPO、传统熵正则化方法以及其他探索基线方法。

Contribution

• We introduce Trust Region Entropy (TRE), a method that encourages exploration strictly within a trust region via local entropy maximization.

• We demonstrate through extensive experiments on mathematical reasoning (MATH), combinatorial search (Countdown), and preference alignment (HH) that TRE consistently outperforms vanilla PPO, standard entropy regularization, and other exploration baselines.

RL for LLM Alignment

Following the standard Reinforcement Learn-ing from Human Feedback (RLHF) pipeline (Ouyang et al., 2022), models initially trained via supervised fine-tuning are further optimized using algorithms such as Proximal Policy Optimization (PPO) (Schulman et al., 2017) to maxi-mize non-differentiable reward signals. This paradigm has proven effective across various domains, from improving helpfulness and safety (Bai et al., 2022) to enhancing mathematical reasoning capabilities (Guo et al., 2025; Yu et al., 2025).

Entropy Regularization

Entropy regularization is a cornerstone technique in modern RL, encouraging exploration via the entropy term.

While highly effective in low-dimensional continuous control, naive entropy maximization proves problematic for LLMs due to massive vocabulary sizes (Cui et al., 2025).

To mitigate this, contemporaneous works have proposed selective constraint mechanisms.（选择性约束机制）

For instance, Wang et al. (2025) propose Forking-Tokens, which restricts optimization to steps with high entropy to preserve exploratory potential.

Similarly, Cui et al. (2025) introduces KL-Cov, which identifies steps with high covariance（协方差) be-tween advantage estimates and log-probabilities, selectively imposing a strong KL penalty on these critical steps to sta-bilize training dynamics.

Trust Region

The concept of a Trust Region is foun-dational tostable optimization in reinforcement learning.

先解释Trust Region 是啥:

在策略梯度（Policy Gradient）里，我们本质是在做：

问题是：

如果一步更新太大 👉 policy 分布剧烈改变
重要性采样比率会爆炸
训练不稳定甚至崩溃

尤其你做 LLM RL 时，这个问题更明显 ------

policy 是 50k 维 softmax，更新稍微大一点就会乱。

所以核心问题变成：

❓ 如何保证每次 policy 更新不要偏离太远？

这就是 Trust Region 思想的来源。

TRPO → PPO 的演进其实是 "理论最优 + 复杂约束" → "工程可行 + 近似替代"

TRPO （2015）Trust Region Policy Optimization

TRPO (Schulman et al., 2015) constrains the policy update by enforcing a strict KL-divergence constraint on a surrogate objective, ensuring monotonic improvement while maintaining stability. This surrogate objective is designed to approximate the true objective while keeping the updates within a trust region defined by the KL-divergence.

核心思想

直接在优化问题里加入一个 KL 约束

subject to:

新策略不能离旧策略太远（KL距离受限）

surrogate objective

原始目标J(θ)不好直接优化，所以构造一个 surrogate

也就是 importance sampling + advantage

PPO (2017)Proximal Policy Optimization

In contrast, **PPO (Schul-man et al., 2017)**simplifies this approach by introducing a clipped surrogate objective that penalizes large policy de-viations, making it more tractable and efficient, while still achieving similar stability to TRPO.

它不再写约束优化，而是直接修改目标函数