LLMs之rStar：《Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers》翻译与解读

导读：这篇论文提出了一种名为rStar的自我博弈互推理方法，用于增强小型语言模型 (SLMs) 的推理能力，无需微调或依赖更强大的模型。rStar为SLM在推理任务上的应用提供了新的思路和方向。该方法在无需外部监督和强大模型的情况下，通过巧妙的生成-判别机制和丰富的推理动作，实现了显著的性能提升。

>> 背景痛点 ：大型语言模型 (LLMs) 在复杂推理任务上表现不佳，即使是先进模型在GSM8K等数据集上的准确率也较低。虽然微调可以提高推理能力，但大多数LLM的微调数据依赖于像GPT-4这样更强大的模型，这限制了小型模型的发展。现有基于自我探索的LLM自我改进方法存在两个主要问题：难以有效探索解空间，以及难以评估推理步骤和最终答案的质量，尤其在SLM上问题更为突出。 SLM的指令遵循能力弱和自我奖励不可靠，导致自我改进效果不佳甚至适得其反。

>> 具体的解决方案 ：提出rStar，rStar 将推理过程分解为自我博弈的生成-判别过程：

● 生成阶段：使用蒙特卡洛树搜索 (MCTS) 增强目标SLM，并引入更丰富的类人推理动作，例如分解问题、提出子问题、改写问题等，以生成更高质量的推理轨迹。

● 判别阶段：使用另一个能力相近的SLM作为判别器，对MCTS生成的每个推理轨迹进行验证。判别器通过部分提示来完成剩余的推理步骤，并与原始轨迹进行比较。如果两者一致，则认为该轨迹更可靠。

● 最终选择：目标SLM根据奖励分数和一致性得分选择最终的推理轨迹作为解决方案。

>> 核心思路步骤：

● 问题建模：将推理问题转化为多步推理生成任务。

● MCTS生成推理轨迹：使用MCTS算法探索更丰富的类人推理动作空间，生成多个候选推理轨迹。定义了五种类人推理动作 (A1-A5)。

● 互推理一致性验证：使用第二个SLM作为判别器，对候选轨迹进行验证，评估其一致性。

● 最终轨迹选择：目标SLM根据奖励分数和一致性得分选择最终的推理轨迹。

>> 优势：

● 无需微调或强大模型：直接增强SLM的推理能力，无需依赖更强大的模型进行微调或数据合成。

● 更丰富的推理动作：模拟人类推理行为，提高了推理轨迹的质量。

● 互推理一致性：通过第二个SLM的验证，提高了答案选择的可靠性，避免了过度拟合。

● 高效的推理过程：通过并行化处理，提高了推理效率。

>> 结论和观点：

● rStar 在五个SLM和五个推理任务上取得了最先进的性能，显著优于现有的多轮提示和自我改进方法。

● rStar 表明SLM本身就具备强大的推理能力，只是需要合适的引导。

● 论文通过大量实验验证了rStar的有效性，并对各个组件进行了消融实验，分析了其作用。

● 即使只使用少量rollouts，rStar也能显著提高推理准确率。

● 自我评估在SLM中效果不佳，而互推理一致性验证更为有效。

[《Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers》翻译与解读](#《Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers》翻译与解读)

Abstract

[1 Introduction](#1 Introduction)

[Figure 1: With 32 rounds of inference, rStar makes SLMs highly capable problem-solvers, matching or even surpassing the reasoning performance achieved after domain-specialized SFT.A promising paradigm to improve reasoning without superior models is to leverage the knowledge within LLMs themselves (Wang et al., 2023; Hao et al., 2023; Madaan et al., 2024). For example, RAP (Hao et al., 2023) adopts a self-exploration solution to iteratively improve LLM's reasoning performance through self-rewarded feedback. Unfortunately, study suggests that this paradigm often suffers from two fundamental issues.图1：通过32轮推理，rStar使SLMs成为高度有效的问题解决者，其推理能力与甚至超越了经过专门领域训练的SFT所达到的水平。一种无需使用更高级模型即可提升推理能力的有前途的方法是利用LLMs本身的知识（Wang et al.， 2023; Hao et al.， 2023; Madaan et al.， 2024）。例如，RAP（Hao et al.， 2023）采用自我探索解决方案，通过自我奖励反馈来逐步提高LLM的推理性能。然而，研究表明，这种方法通常会面临两个根本性问题。](#Figure 1: With 32 rounds of inference, rStar makes SLMs highly capable problem-solvers, matching or even surpassing the reasoning performance achieved after domain-specialized SFT.A promising paradigm to improve reasoning without superior models is to leverage the knowledge within LLMs themselves (Wang et al., 2023; Hao et al., 2023; Madaan et al., 2024). For example, RAP (Hao et al., 2023) adopts a self-exploration solution to iteratively improve LLM’s reasoning performance through self-rewarded feedback. Unfortunately, study suggests that this paradigm often suffers from two fundamental issues.图1：通过32轮推理，rStar使SLMs成为高度有效的问题解决者，其推理能力与甚至超越了经过专门领域训练的SFT所达到的水平。一种无需使用更高级模型即可提升推理能力的有前途的方法是利用LLMs本身的知识（Wang et al.， 2023; Hao et al.， 2023; Madaan et al.， 2024）。例如，RAP（Hao et al.， 2023）采用自我探索解决方案，通过自我奖励反馈来逐步提高LLM的推理性能。然而，研究表明，这种方法通常会面临两个根本性问题。)

[Figure 2:Our self-play mutual reasoning is a generation-discrimination process: (1) a self-generator augments the target SLM to generate candidate reasoning trajectories using MCTS; (2) the discriminator uses another SLM to provide unsupervised feedback on each trajectory based on partial hints; (3) based on this feedback, the target SLM decides a final reasoning trajectory as the solution.图2：我们的自我博弈推理是一个生成-鉴别过程：(1) 自生成器通过MCTS使用目标SLM来扩展生成候选推理轨迹；(2) 鉴别器使用另一个SLM根据部分提示对每个轨迹提供无监督反馈；(3) 根据这些反馈，目标SLM决定最终的推理轨迹作为解决方案。](#Figure 2:Our self-play mutual reasoning is a generation-discrimination process: (1) a self-generator augments the target SLM to generate candidate reasoning trajectories using MCTS; (2) the discriminator uses another SLM to provide unsupervised feedback on each trajectory based on partial hints; (3) based on this feedback, the target SLM decides a final reasoning trajectory as the solution.图2：我们的自我博弈推理是一个生成-鉴别过程：(1) 自生成器通过MCTS使用目标SLM来扩展生成候选推理轨迹；(2) 鉴别器使用另一个SLM根据部分提示对每个轨迹提供无监督反馈；(3) 根据这些反馈，目标SLM决定最终的推理轨迹作为解决方案。)

Conclusion

《 Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers 》翻译与解读

|------------|--------------------------------------------------------------------------------------------------------------|
| 地址 | 论文地址：https://arxiv.org/abs/2408.06195 |
| 时间 | 2024年 8月12 日 |
| 作者 | 微软亚洲研究院、哈佛大学 |

Abstract

1 Introduction

|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Despite their success, large language models (LLMs) face significant challenges in complex reasoning (Valmeekam et al., 2022; Weng et al., 2023). For example, state of the art models like Mistral-7B (Jiang et al., 2023) can only achieve 36.5% accuracy on the GSM8K dataset, even with techniques like Chain-of-Throught (CoT) (Wei et al., 2022). Although fine-tuning is shown to be an effective way to improve reasoning capability, most LLMs rely on fine-tuning data distilled or synthesized by superior models like GPT-4 (Wang et al., 2024a; Gou et al., 2023). Meanwhile, the community has been actively working on a complimentary and yet more challenging approach: Reasoning improvements without a superior teacher LLM. First, LLMs often struggle to effectively explore the solution space during reasoning. The self-exploration often traps in a solution space with low-quality reasoning steps even after many attempts. For example, our experiments reveal that after 32 rounds of self-exploration with RAP (Hao et al., 2023), only 24% of the trajectories generated by LLaMA2-7B on GSM8K are correct. Second, even the self-exploration can find high quality reasoning steps, it is difficult for SLMs to tell which reasoning steps are of higher quality or determine which final answers are correct, thus it is hard to effectively guide the self-exploration. Our study shows that a naïve reward-based self-exploration guidance can lead to results no better than random guesses (see Appendix A.1). | 尽管它们已经取得了成功，但大型语言模型（LLMs）在复杂推理方面仍面临重大挑战（Valmeekam et al.， 2022; Weng et al.， 2023）。例如，像Mistral-7B这样的先进模型（Jiang et al.， 2023）在GSM8K数据集上的准确率只有36.5%，即使使用了链式推理（Chain-of-Throught）等技术（Wei et al.， 2022）。尽管微调被证明是提高推理能力的有效方法，但大多数LLMs依赖于由GPT-4等更优越的模型提炼或合成的数据进行微调（Wang et al.， 2024a; Gou et al.， 2023）。与此同时，社区一直在积极探索一种补充性且更具挑战性的方法：无需优越的教师LLM的推理改进。首先，LLMs在推理过程中往往难以有效探索解决方案空间。自我探索往往在低质量的推理步骤的解决方案空间中陷入困境，即使经过多次尝试也是如此。例如，我们的实验表明，在32轮自我探索（Hao et al.， 2023）之后，LLaMA2-7B在GSM8K上的生成轨迹中只有24%是正确的。第二，即使自我探索可以找到高质量的推理步骤，SLMs也很难区分哪些推理步骤质量更高或确定哪些最终答案是正确的，因此很难有效引导自我探索。我们的研究表明，一种简单的基于奖励的自我探索指导可能导致的结果甚至不如随机猜测（参见附录A.1）。 |
| A more troublesome fact is that the above two issues are more pronounced in the smaller version of LLMs, i.e., SLMs, due to their weaker capabilities. For instance, while GPT-4 can improve by self-refining its output (Madaan et al., 2024; Wu et al., 2024; Zhou et al., 2024), the approaches are less effective in SLMs and may even lead to worse performance (Forsman, 2024). This significantly hinders the adoption of neural language models. This paper introduces Self-play muTuAl Reasoning (rStar), a novel approach that boosts SLMs' reasoning capability during inference without fine-tuning or superior models. To address the aforementioned challenges, rStar decouples reasoning into a self-play mutual generation-discrimination process as illustrated in Fig. 2. Specifically, rStar is unique in the following approaches. First, although relying on a conventional Monte Carlo Tree Search (MCTS) for SLMs to self-generate reasoning steps, rStar advocates a richer set of reasoning actions in the self-exploration. The new proposed actions simulate human reasoning behaviors given the current reasoning state, such as decomposing and searching for a specific reasoning step, proposing a new sub-question, or rephrasing the given question. This enables SLMs to generate high-quality candidate reasoning trajectories during self-exploration. Second, to effectively guide the exploration among the generated reasoning trajectories, rStar augments the MCTS process with a new discrimination process called mutual consistency. In particular, rStar employs a second SLM with the similar capability, acting as a discriminator to provide unsupervised feedback on each candidate reasoning trajectory generated by MCTS. To improve the accuracy of the feedback, rStar hints the second SLM with sampled partial reasoning trajectories, asking it to complete the remaining reasoning steps. And rStar deems the mutually agreed reasoning trajectories of higher quality. Mutual consistency mirrors the common human practice in the absence of supervision, where agreement among peers (i.e., two SLMs) on derived answers suggests a higher likelihood of correctness. As a result, mutual consistency offers more effective reasoning across diverse tasks than other approaches like self-consistency (Wang et al., 2023) and avoids the risk of overfitting when training a reward model (Chen et al., 2024a; Wang et al., 2024b). | 更令人担忧的是，上述两个问题在较小版本的LLMs（即SLMs）中更为明显，因为它们的能力较弱。例如，虽然GPT-4可以通过自我完善输出来改进（Madaan等人，2024；Wu等人，2024；Zhou等人，2024），但这些方法在SLMs中的效果较差，甚至可能导致更差的表现（Forsman，2024）。这大大阻碍了神经语言模型的采用。本文介绍了Self-play muTuAl Reasoning（rStar），这是一种新颖的方法，可以在推理过程中提升SLMs的推理能力，无需微调或使用高级模型。为了解决上述挑战，rStar将推理分解为一个自我博弈的生成-鉴别过程，如图2所示。具体来说，rStar在以下方面具有独特之处。首先，尽管依赖传统的蒙特卡洛树搜索（MCTS）算法来指导SLMs进行自我推理，但rStar在自我探索中提倡更丰富的推理动作集。新提出的动作模拟了在当前推理状态下的人类推理行为，例如分解并搜索特定的推理步骤、提出新的子问题或重述给定的问题。这使得SLMs能够在自我探索期间生成高质量的候选推理轨迹。其次，为了有效地引导生成的推理轨迹之间的探索，rStar在MCTS过程中引入了一个新的鉴别过程，称为相互一致性。特别是，rStar使用了一个具有类似能力的第二SLM，作为鉴别器，为由MCTS生成的每个候选推理轨迹提供无监督反馈。为了提高反馈的准确性，rStar向第二SLM提供了部分推理轨迹样本，要求它完成剩余的推理步骤。rStar认为相互一致的推理轨迹质量更高。相互一致性反映了在缺乏监督的情况下，人类的常见做法，即同行（即两个SLM）对衍生答案的一致意见表明正确性的可能性更高。因此，相互一致性比其他方法（如自一致性（Wang 等人，2023））能够更有效地在各种任务之间进行推理，并且避免了在训练奖励模型时出现过拟合的风险（Chen 等人，2024a；Wang 等人，2024b）。 |
| Extensive experiments across five SLMs and five diverse reasoning tasks demonstrate the effectiveness of rStar. With just 32 rounds of MCTS inference, rStar significantly enhances SLMs' reasoning capabilities, matching or even surpassing the accuracy achieved after fine-tuning. For example, rStar boosts GSM8K accuracy from 12.51% to 63.91% for LLaMA2-7B, from 36.46% to 81.88% for Mistral, and from 47.23% to 85.52% for LLaMA3-8B. Furthermore, we conduct comprehensive experiments to verify rStar's superiority over state-of-the-art baselines, including single-round inference techniques like few-shot CoT, multi-round prompting approaches such as self-consistency, and self-improvement techniques such as RAP, ToT, self-evaluation and self-verification. | 在五个SLM和五个不同的推理任务中进行的大量实验证明了rStar的有效性。通过32轮MCTS推理，rStar显著提高了slm的推理能力，达到甚至超过微调后的精度。例如，rStar将LLaMA2-7B的GSM8K精度从12.51%提高到63.91%，Mistral的GSM8K精度从36.46%提高到81.88%，LLaMA3-8B的GSM8K精度从47.23%提高到85.52%。此外，我们进行了全面的实验来验证rStar相对于最先进基线的优势，包括单轮推理技术，如少样本CoT，多轮提示方法，如自一致性，以及自我完善技术，如RAP， ToT，自我评估和自我验证。 |

Figure 1: With 32 rounds of inference, rStar makes SLMs highly capable problem-solvers, matching or even surpassing the reasoning performance achieved after domain-specialized SFT.A promising paradigm to improve reasoning without superior models is to leverage the knowledge within LLMs themselves (Wang et al., 2023; Hao et al., 2023; Madaan et al., 2024). For example, RAP (Hao et al., 2023) adopts a self-exploration solution to iteratively improve LLM's reasoning performance through self-rewarded feedback. Unfortunately, study suggests that this paradigm often suffers from two fundamental issues.图1：通过32轮推理，rStar使SLMs成为高度有效的问题解决者，其推理能力与甚至超越了经过专门领域训练的SFT所达到的水平。一种无需使用更高级模型即可提升推理能力的有前途的方法是利用LLMs本身的知识（Wang et al.， 2023; Hao et al.， 2023; Madaan et al.， 2024）。例如，RAP（Hao et al.， 2023）采用自我探索解决方案，通过自我奖励反馈来逐步提高LLM的推理性能。然而，研究表明，这种方法通常会面临两个根本性问题。

Figure 2:Our self-play mutual reasoning is a generation-discrimination process: (1) a self-generator augments the target SLM to generate candidate reasoning trajectories using MCTS; (2) the discriminator uses another SLM to provide unsupervised feedback on each trajectory based on partial hints; (3) based on this feedback, the target SLM decides a final reasoning trajectory as the solution.图2：我们的自我博弈推理是一个生成-鉴别过程：(1) 自生成器通过MCTS使用目标SLM来扩展生成候选推理轨迹；(2) 鉴别器使用另一个SLM根据部分提示对每个轨迹提供无监督反馈；(3) 根据这些反馈，目标SLM决定最终的推理轨迹作为解决方案。