PAL: Program-aided Language Models

PAL: Program-aided Language Models

ArXiv:https://arxiv.org/pdf/2211.10435

GitHub:https://reasonwithpal.com/

一、动机

  • 大模型与Chain-of-Thought可以很好地将一些复杂的问题分解为若干个子问题并进行逐步推理;
  • 但是对于一些较为复杂的数学运算,活着涉及到交大的数字符号运算时,即便大模型对任务的规划和分解思考是正确的,但是依然会存在计算错误,这些错误通常大模型很难很好地解决;
  • 因此,本文思考将一些复杂的推理任务先用大模型进行分解,其次让大模型生成python代码并基于解释器来实现计算。通过引入代码,可以弥补由于计算错误所带来的问题。

This bridges an important gap in chain-of- thought-like methods, where reasoning chains can be correct but produce an incorrect answer.

二、方法

本文提出Program-Aided Language Model(PAL)。

相比于Chain-of-thought,每一个exemplar中包含一个推理路径,这个推理路径时融合了自然语言和python代码。且最终只提供完整的变成代码,不提供最终答案。大模型在该prompt的引导下对目标测试样本进行推理和代码生成,最终借助python解释器获得最终答案。

下图展示了一个同时含有自然语言和python代码的推理路径:

PAL方法与CoT的对比图如下所示:

Exemplar的构建

对于评测数据集中,如果现有的工作如果已经提供了exemplar,则直接使用,否则则随机采样3~6个标注样本作为exemplar。

推理路径中的代码函数名称也要与原始变量名保持一致,采用下划线分割的形式定义。

For example, a variable that describes the number of apples in the basket should have a name such as num apples in basket. This keeps the generated code linked to the entities in the question.

三、实验

数据集:

  • 数学运算:GSM8K、SVAMP、ASDIV、MAWPS;
  • 符号推理:BBH-Hard
  • 算法推理:BBH-Hard

GSM8K-Hard:

作者通过启发式更改数字的方式构建了一个新的数据集,并基于这个数据发现50%的情况下大模型虽然给出正确的推理思路但是由于交大的数字计算存在错误导致最终预测错误。

符号推理的prompt样例:

数学运算实验结果:

四、实现

Prompt

针对数学运算、符号推理、算法运算三种类型的任务分别设计了带有编程语言和自然语言的prompt。

python 复制代码
MATH_CHAT_BETA_PROMPT = '''
Let's use python to solve math problems. Here are three examples how to do it,
Q: Olivia has $23. She bought five bagels for $3 each. How much money does she have left?

def solution():

"""Olivia has 23. She bought five bagels for 3 each. How much money does she have left?"""

money_initial = 23

bagels = 5

bagel_cost = 3

money_spent = bagels * bagel_cost

money_left = money_initial - money_spent

result = money_left

return result

复制代码
Q: Michael had 58 golf balls. On tuesday, he lost 23 golf balls. On wednesday, he lost 2 more. How many golf balls did he have at the end of wednesday?

def solution():

"""Michael had 58 golf balls. On tuesday, he lost 23 golf balls. On wednesday, he lost 2 more. How many golf balls did he have at the end of wednesday?"""

golf_balls_initial = 58

golf_balls_lost_tuesday = 23

golf_balls_lost_wednesday = 2

golf_balls_left = golf_balls_initial - golf_balls_lost_tuesday - golf_balls_lost_wednesday

result = golf_balls_left

return result

复制代码
Q: There were nine computers in the server room. Five more computers were installed each day, from monday to thursday. How many computers are now in the server room?

def solution():

"""There were nine computers in the server room. Five more computers were installed each day, from monday to thursday. How many computers are now in the server room?"""

computers_initial = 9

computers_per_day = 5

num_days = 4 # 4 days between monday and thursday

computers_added = computers_per_day * num_days

computers_total = computers_initial + computers_added

result = computers_total

return result

复制代码
How about this question?
Q: {question}
'''.strip()
相关推荐
DS随心转APP几秒前
Claude 导出对话多方案横向测评来袭,借助 AI 导出鸭对比各类导出工具优劣,筛选最优处理办法
人工智能·ai·chatgpt·deepseek·ai导出鸭
诺***帝1 分钟前
GPT-Image-2技术架构深度拆解:2026年图像生成模型全面解析
人工智能·gpt
chen_zn954 分钟前
OpenPi、GR00T的视觉语言模型与动作模型连接方式差异分析总结
人工智能·深度学习·具身智能·vla
闵孚龙8 分钟前
Autograd 自动求导:PyTorch 训练模型的发动机
人工智能·pytorch·python
云和数据.ChenGuang9 分钟前
大模型厂商常用的数据库有哪些?
数据库·人工智能·pytorch·深度学习·numpy
旅僧10 分钟前
Bert理论讲解
人工智能·深度学习·bert
邵宇然11 分钟前
编译期阻断 Bug:Rust 类型系统如何将运行时错误消灭在编译阶段
人工智能
ACP广源盛1392462567312 分钟前
GSV6155@ACP#DP 1.4a 重定时器芯片,物理 AI 信号长距传输的稳定保障
大数据·人工智能·分布式·嵌入式硬件·spark
明志数科13 分钟前
数据标注自动化 vs 人工——4D时序标注场景谁靠谱?
人工智能
深圳市晶科鑫实业有限公司13 分钟前
AI服务器为何对低抖动差分晶振如此挑剔?
服务器·人工智能·单片机·物联网·车载系统·云计算·信息与通信