从ReAct到CodeAct再到OpenManus，解析Manus背后的技术

基于大语言模型的智能体（Agent）技术正在飞速发展，相应的应用也层出不穷。笔者在以下两篇文章中，已分别对基于大语言模型的单智能体和多智能体进行详细的介绍：

作为智能体的大脑，大语言模型本身的推理能力正在不断发展。DeepSeek R1等推理模型的推出，使得基于大语言模型的智能体在面对复杂问题时，也能准确地思考、规划解决方案。智能体本身的应用范式也正在不断发展。Manus作为通用AI智能体的代表之一，一经推出，便吸引了业界广泛的关注。Manus能够完成端到端的任务闭环，其针对用户提出的各类问题，能够分步思考、规划解决方案，完成操作浏览器、执行代码等各类操作。例如，对于用户提出的"筛选本月抖音爆款视频"的问题，Manus能够自动规划并完成以下操作：爬取平台数据、分析互动指标、生成TOP50榜单、打包下载链接，整个过程无需任何人工介入，真正实现"一句话出成果"。

Manus本身并未开源，且试用需要申请邀请码，但从其发布的视频以及其联合创始人------季逸超透露的技术细节（如图1所示）来看，其参考了CodeAct。CodeAct是基于ReAct的改进。ReAct使得大语言模型能够逐步思考、规划，每步根据当前上下文推理给出需要调用的工具和参数，并将工具返回的结果补充至上下文进一步推理后续需要调用的工具，直至大语言模型根据上下文判断已获取最终问题答案或任务已完成。而CodeAct更进一步，其在ReAct逐步思考、规划的基础上，将大语言模型每步给出单一的工具和参数，升级为每步给出包含多个工具调用的代码。季逸超指出一方面随着大语言模型本身的不断发展，其编写代码的能力不断增强，使得由大语言模型直接输出代码更为可行，另一方面由大语言模型直接输出代码，相对于每步调用单一的工具，可以通过代码执行相对更为复杂的逻辑，从而减少ReAct的思考步数，减少上下文的长度。

采用类似的思路，MetaGPT团队也迅速行动，复刻Manus并开源，推出了OpenManus。因此，本文分为四部分，第一部分介绍智能体，第二部分介绍ReAct，第三部分介绍CodeAct，第四部分介绍OpenManus。期望通过该文，读者能够对Manus的相关技术有初步的了解。文中如有不足之处，欢迎指正和交流。

智能体

基于大语言模型的智能体（Agent）其核心是大语言模型，大语言模型承担着大脑的角色，用于思考和规划，而围绕着大语言模型，Agent还包含记忆和工具，记忆用于存储短期上下文信息和长期知识信息，工具则承担着感官和四肢的角色，在大语言模型的思考和规划下，Agent一方面可以通过工具获取外部的各种信息用于进一步的思考和规划，另一方可以通过工具执行动作对外部环境施加影响。

《LLM Powered Autonomous Agents》中使用图2描述了基于大语言模型的Agent的整体架构，其中包含以下核心组件：

规划（Planning），大语言模型作为Agent的大脑负责思考和规划，而思考和规划的方式又可以分为两部分：
- 分而治之（Task Decomposition）：对于复杂的任务，大语言模型会将其分解为多个相对简单的子任务，每个子任务包含独立的子目标，从而分而治之、逐步求解；
- 自我反思（Self-Reflection）：大语言模型会对过去的规划和执行进行自我反思，分析其中的错误，并对后续的思考和规划进行改进，完善最后输出的结果。
记忆（Memory），记忆是对大语言模型本身由模型结构和参数所蕴含知识的补充，记忆又可以分为短期记忆和长期记忆：
- 短期记忆，即大语言模型的上下文学习，包括提示、指示、前序步骤的大语言模型推理结果和工具执行结果等；
- 长期记忆，即外部可快速检索的向量索引，这也就是目前比较流行的一种大语言模型应用的解决方案------RAG（Retrieval-Augmented Generation，检索增强生成）。RAG的流程可以简单概括为，将包含知识的文档切分为块，并对块向量化，构建块向量索引，然后将问题也向量化，然后从块向量索引中检索和问题相关的块，最后将块和问题合并作为大语言模型的输入进行推理。RAG可以有效缓解大语言模型无法扩展知识、由知识局限产生的"幻觉"的问题。
工具（Tool），工具作为Agent的感官和四肢，Agent一方面可以通过工具获取外部的各种信息用于进一步的思考和规划，例如通过搜索引擎搜索某个关键词的最新信息，另一方可以通过工具执行动作对外部环境施加影响，例如调用外部系统的接口执行指令并获取执行结果。

ReAct

2023年由Google和普林斯顿大学发表的论文《ReAct: Synergizing Reasoning and Acting in Language Models》提出了一种大语言模型思考和规划方式------ReAct。思维链如图3左侧所示，仅依赖大语言模型进行多步推理，是一种"Reason Only"的思考和规划方式，缺乏执行，不能从外部获取信息，容易造成幻觉，并且错误会在多步推理中传播，另外也不能对外部施加影响，限制了Agent的作用。而ReAct的思考和规划方式如图3右侧所示，整体也是将复杂问题拆分为多个简单问题分步思考和规划，但和思维链不同的是，在每步思考和规划中，ReAct同时包含了推理（Reason）和执行（Action），通过大语言模型的思考和规划，既输出由自然语言描述的思考过程，也输出格式化的工具指令，而论文中设计的工具是可以对维基百科进行关键词检索的API，因此执行大语言模型输出的工具指令可以获取外部信息，将外部信息作为观察结果，和思考过程、工具指令合并，作为大语言模型下一步思考和规划的上下文。论文通过组合推理和执行，能够有效获取外部信息作为大语言模型的知识补充，从而在知识问答（HotpotQA）和事实验证（FEVER）等任务上取得较好的效果，而ReAct的命名即"Reason"（推理）和"Action"（执行）的组合、缩写。

具体如何实现ReAct？图4右侧是论文中给出的一个示例，其通过提示工程让大语言模型对复杂问题分步思考和规划，每步思考和规划按以下格式进行推理和执行：

plain 复制代码

Thought:[思考过程]
Action:[工具指令]
Observation:[观察结果]

每步思考和规划时，首先由大语言模型在"Thought"和"Action"后分别输出思考过程和工具指令，然后Agent解析"Action"后输出的工具指令，调用相应的工具执行指令，获取执行结果后，再拼接到"Observation"的后面。论文中设计的工具是可以对维基百科进行关键词检索的API，其包含三种格式的指令：

Search[entity]：返回和"entity"最相关的5条语句，若不存在，则返回和"entity"最相似的5个实体；
Lookup[string]：返回包含"string"的语句；
Finish[answer]：结合上下文和大语言模型本身蕴含的知识，已获取当前任务的答案并返回。

图4右侧的示例，原问题是"Aside from the Apple Remote, what other device can control the program Apple Remote was originally designed to interact with?"，即"除了苹果遥控器之外，还有哪些设备可以控制最初设计用于和苹果遥控器交互的程序"，大语言模型第一步思考和规划给出的思考过程和工具指令分别是"I need to search App Remote and find the program it was originally designed to interact with."和"Search[Apple Remote]"，即检索和"Apple Remote"相关的语句，查找和其交互的程序，因此Agent调用工具的API获取相关信息，拼接到"Observation"的后面，和第一步的思考过程、工具指令合并，作为大语言模型第二步思考和规划的上下文，大语言模型第二步思考和规划给出的思考过程和工具指令分别是"Apple Remote was originally designed to control the Front Row media center program. I need to search Front Row next and find what other device can control it."和"Search[Front Row]"，即检索和"Front Row"相关的语句，查找可控制"Front Row"的设备，如此经过共4步的思考和规划过程，最终大语言模型输出正确的答案"keyword function keys"，即"keyword function keys"也可以控制最初设计用于和苹果遥控器交互的程序。

如何控制大语言模型对于原始问题分步思考和规划，且每步按照固定格式分别输出思考过程和工具指令？论文采用的方案是在提示中增加多个示例以引导大语言模型对后续的问题按照ReAct方式进行分步思考和规划。图5是论文在附录中记录的两个示例。

目前LangChain等大语言模型的开发框架也支持ReAct方式的分步思考和规划，其中，对于大语言模型，一般是在提示中增加工具功能和指令格式的描述，并增加让模型分步按格式输出思考过程和工具指令的指示，以引导大语言模型按照ReAct方式进行思考和规划，而对于大语言会话模型，一般是基于大语言会话模型的"Function Calling"能力，在和大语言会话模型的交互中，除输入会话消息外，还输入多个可选函数的描述，由大语言会话模型返回需要调用的函数和参数。后面在实践部分，会详细介绍如何在LangChain中实现ReAct方式的分步思考和规划。目前，ReAct已成为一种常用的基于大语言模型和外部工具求解复杂问题的思考和规划方式。

Agent的初始化原先可以采用"initialize_agent"方法，指定Agent类型，并传入工具、模型，代码如下所示（部分）：

python 复制代码

from langchain.agents import initialize_agent

agent= initialize_agent(tools, model, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, handle_parsing_errors=True, verbose = True)

目前则建议采用各类型Agent的"create"方法进行Agent的初始化，例如参考MRKL系统、按ReAct方式进行思考和规划的Agent的初始化的代码如下所示（部分）：

python 复制代码

from langchain import hub
from langchain.agents import create_react_agent

prompt = hub.pull("hwchase17/react")
agent = create_react_agent(model, tools, prompt)

"initialize_agent"方法初始化时，使用各类型Agent中内置的提示模版，而"create"方法初始化时，需要指定提示模版，而LangChain通过LangChainHub（LangChainHub目前已集成至LangSmith中）集中管理由开发者共享的提示、链或代理，因此可以从LangChainHub上拉取所需的提示模版。上例中地址为"hwchase17/react"的提示模版的具体内容如下所示：

plain 复制代码

Answer the following questions as best you can. You have access to the following tools:
{tools}
Use the following format:
Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [{tool_names}]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question
Begin!
Question: {input}
Thought:{agent_scratchpad}

其和"langchain.agents.mrkl.base�.ZeroShotAgent�"中内置的提示模版一致，即参考MRKL系统设计的Agent，在提示中增加工具功能和指令格式的描述，并增加让模型分步按格式输出思考过程和工具指令的指示，以引导大语言模型按照ReAct方式进行思考和规划。

示例1

大语言模型使用通义千问，工具使用"Tavily Search"，实现MRKL系统设计的Agent，按照ReAct方式进行思考和规划，针对问题------"what's the weather of San Francisco?"，先使用"Tavily Search"查询天气，再根据查询结果回答问题。代码如下所示：

java 复制代码

import os
from langchain import hub
from langchain.agents import AgentExecutor, create_react_agent
from langchain_community.llms import Tongyi
from langchain_community.tools.tavily_search import TavilySearchResults

LANGCHAIN_API_KEY = 'xxx'
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = LANGCHAIN_API_KEY

DASHSCOPE_API_KEY = 'xxx'
os.environ["DASHSCOPE_API_KEY"] = DASHSCOPE_API_KEY

TAVILY_API_KEY = 'xxx'
os.environ["TAVILY_API_KEY"] = TAVILY_API_KEY

prompt = hub.pull("hwchase17/react")
model = Tongyi()
search = TavilySearchResults(max_results=2)
tools = [search]
agent = create_react_agent(model, tools, prompt)
agent_executor = AgentExecutor(agent=agent, tools=tools)

agent_executor.invoke(
    {
        "input": "what's the weather of San Francisco?",
    }
)

执行代码后从LangSmith上可以看到，模型第一步思考和规划认为需要使用工具，且按格式给出了工具名称和工具参数，如图6所示。

Agent随后调用工具查询天气，并将查询结果作为观察结果追加到提示中。模型第二步思考和规划，认为从提示中已可得到最终的答案并返回答案，如图7所示。

示例2

在示例1的基础上，引入"ConversationBufferMemory"记忆人类和AI的历史会话消息，在提示中增加了历史会话消息以模拟会话场景，代码如下所示：

python 复制代码

import os
from langchain.agents import AgentType
from langchain.agents import initialize_agent
from langchain.memory import ConversationBufferMemory
from langchain_community.llms import Tongyi
from langchain_community.tools.tavily_search import TavilySearchResults

LANGCHAIN_API_KEY = 'xxx'
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = LANGCHAIN_API_KEY

DASHSCOPE_API_KEY = 'xxx'
os.environ["DASHSCOPE_API_KEY"] = DASHSCOPE_API_KEY

TAVILY_API_KEY = 'xxx'
os.environ["TAVILY_API_KEY"] = TAVILY_API_KEY

model = Tongyi()
search = TavilySearchResults(max_results=2)
tools = [search]
memory = ConversationBufferMemory(memory_key="chat_history")
agent = initialize_agent(tools, model, agent=AgentType.CONVERSATIONAL_REACT_DESCRIPTION, verbose=True, memory=memory)

agent.run(input = "hi im bob! and i live in sf")
agent.run(input = "whats the weather where I live?")

对于第一句人类会话，模型直接进行了回答，如图8所示。

对于第二句人类会话，模型从历史会话消息中识别出当前人的地域信息，后续和示例1相同，分两步思考和规划，第一步使用工具获取天气信息，第二步根据获取的天气信息给出答案，如图9和图10所示。

CodeAct

前面章节介绍了ReAct，其使大语言模型分步思考，每步思考在"Thought:"后输出思考过程，在"Action:"后输出待执行动作，待执行动作的输出格式是文本或JSON格式。2024年由苹果和伊利诺伊大学发表的论文《Executable Code Actions Elicit Better LLM Agents》在ReAct的基础上，进一步提出了CodeAct。CodeAct和ReAct相同的是，其也使大语言模型分步思考，和ReAct不同的是，其使大语言模型输出可执行的Python代码作为动作。如图11所示，ReAct是"LLM Agent using [Text/JSON] as Action"，CodeAct是"LLM Agent using [Code] as Action"。

图11中的任务是"Determine the most cost-effective country to purchase the smartphone model "CodeAct 1". The countries to consider are the USA, Japan, Germany, and India"，即判断在哪个国家购买智能手机"CodeAct 1"最划算。其可用工具包括：

lookup_rates(country:str)->(float, float)：根据国家查询汇率和税率；
convert_and_tax(price:float, exchange_rate:float, tax_rate:float)->float：根据当地价格、汇率、税率计算含税的美元价格；
estimate_fnal_price(converted_price:float, shipping_cost:float)->float：根据含税的美元价格和运费计算最终的美元价格；
lookup_phone_price(model:str, country:str)->float：根据智能手机型号、国家查询当地价格；
estimate_shipping_cost(destination_country:str)->float：根据目的地国家预估运费；

图11左侧是ReAct的示例，针对上述任务和可用工具，大语言模型分步思考，对于每个国家，均按相同的步骤思考规划、执行动作，查询汇率和税率，查询当地价格，根据当地价格、汇率、税率计算含税的美元价格，根据目的地国家预估运费，根据含税的美元价格和运费计算最终的美元价格，最后，比较各国家的最终美元价格，得出结论，在日本购买智能手机"CodeAct 1"最划算，且最终美元价格为904美元。为了得出这一结论，ReAct整体需要5*4+1=21步思考。图11右侧是CodeAct的示例，针对上述任务和可用工具，大语言模型也分步思考，但第一步思考直接生成以下可执行的代码：

python 复制代码

countries = ['USA', 'Japan', 'Germany', 'India']
final_prices = {}

for country in countries:
    exhange_rate, tax_rate = lookup_rates(country)
    local_price = lookup_phone_price("CodeAct 1", country)
    converted_price = convert_and_tax(
        local_price, exchange_rate, tax_rate
    )
    shipping_cost = estimate_shipping_cost(country)
    final_price = estimate_final_price(converted_price, shipping_cost)
    final_prices[country] = final_price

most_cost_effective_country = min(final_prices, key=final_prices.get)
most_cost_effective_price = final_prices[most_cost_effective_country]
print(most_cost_effective_country, most_cost_effective_price)

上述代码直接通过控制流和数据流表达ReAct中的多步逻辑，比如，ReAct需要分步思考，对每个国家计算最终的美元价格，而CodeAct只需要在生成的Python代码中，通过for循环表达分别对每个国家计算最终的美元价格这一语义。另外，CodeAct在生成的Python代码中，可以复用库函数实现一些判断逻辑，例如复用min函数从多个国家的最终美元价格中选取最低价格的国家。

执行上述代码，输出结果后，大语言模型就可以根据代码执行结果输出结论，仍是在日本购买智能手机"CodeAct 1"最划算，且最终美元价格为904美元。为了得出这一结论，CodeAct整体只需要2步思考，相比于ReAct，思考步数少很多，思考效率大大提升。

综上，CodeAct的优势有：

与Python解释器整合，从而使得CodeAct可以输出代码，并执行代码，根据代码执行结果动态调整过去的动作，输出新的动作。
CodeAct输出的代码可以使用已有的三方库，扩展可用的动作空间。
代码数据已广泛应用于大语言模型的预训练，从而大语言模型直接输出Python代码可行且准确率相比之前有了很大提升。
相比于ReAct需要预先定义JSON或文本格式，且每步只能输出并执行一个动作，CodeAct可以在一步中直接输出代码，通过数据流在多个工具之间传递中间结果，通过控制流组合多个工具实现ReAct中需要多步思考的复杂逻辑。

论文通过图12所示的表格说明了CodeAct相比于ReAct的优势：

可用数据，CodeAct依赖大语言模型输出代码的能力，而大量代码数据均可以用于大语言模型的预训练，使其可以输出代码，ReAct依赖大语言模型按照特定格式输出动作，因此需要按照特定格式标注数据。
复杂操作，CodeAct输出的代码可以通过数据和控制流组合多个工具实现复杂逻辑，ReAct需要设计新的工具以实现条件判断等复杂逻辑。
可用工具，CodeAct输出的代码可以直接使用已有的三方库，ReAct需要人工从已有软件中挑选工具。
自动反馈，CodeAct可以将代码执行的结果或异常栈作为一种反馈。

CodeAct

基于大语言模型的智能体进行多步交互（multi-turn interaction）的通用框架如图13所示，其中包含三个角色：智能体（Agent）、用户（User）和环境（Environment）。定义交互（Interaction）为智能体和外部实体（用户或环境）的消息交换。每步交互中，智能体从用户侧接收输入（Input），例如自然语言形式的指示，或从环境侧接收观察结果（Observation），例如代码执行结果，通过思维链（Chain of Thought，COT）的方式思考、规划，向用户侧返回输出（Output），例如自然语言形式的结果，或向环境侧返回下一步的动作（Action），例如代码。CodeAct中，智能体向环境侧返回的动作是一段Python代码，从环境侧接收代码执行结果。

CodeAct的系统提示（System Prompt）示例如下所示，其采用OpenAI的ChatML格式，指示大语言模型作为智能助手，可以输出Python代码，且Python代码包含在两个""之间。

plain 复制代码

<|im_start|>
system
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. The assistant can interact with an interactive Python (Jupyter Notebook) environment and receive the corresponding output when needed. The code should be enclosed using "<execute>" tag, for example: <execute> print("Hello World!") </execute>. The assistant should attempt fewer things at a time instead of putting too much code in one <execute> block. The assistant can install packages through PIP by <execute> !pip install [package needed] </execute> and should always import packages and define variables before starting to use them. The assistant should stop <execute> and provide an answer when they have already obtained the answer from the execution result. Whenever possible, execute the code for the user using <execute> instead of providing it. The assistant's response should be concise, but do express their thoughts.
<|im_end|>

论文通过实验评估CodeAct在简单工具使用场景（只使用一个工具）的性能。论文针对工具使用，对大语言模型分别输出CodeAct、JSON和文本格式的工具使用进行实验，CodeAct、JSON和文本格式的工具使用示例如图14所示。

论文使用API-Bank数据集（包含问题、工具集和提示），指示大语言模型分步思考、规划，每步输出一个工具使用，且输出的工具使用的格式包括CodeAct（Python函数调用的形式）、JSON和文本三种，然后使用API-Bank的准确性评估指标，即比较大语言模型在三种格式要求下分别输出的工具和工具真实值。

实验结果如图15所示。每一行为一个大语言模型在三种格式要求下输出的工具的准确性。前面数行为开源大语言模型，后面数行为闭源大语言模型。先比较CodeAct、JSON和文本格式的工具调用。CodeAct在整体17个大语言模型中，8个大语言模型的准确率最高，优于JSON和文本格式。CodeAct在开源大语言模型中准确率更优，在8个开源大语言模型中，4个大语言模型的准确率最高。JSON在开源大语言模型中准确率较低，而在闭源大语言模型中准确率较高，说明闭源大语言模型应该针对JSON进行过专门微调。在开源大语言模型的微调方面，代码相比于JSON和预定义文本格式有更多可用训练数据，因此，基于CodeAct改进开源大语言模型的工具使用能力相比其他方法可能是一个更优的方案。

论文通过实验评估CodeAct在复杂工具使用场景（使用多个工具）的性能。CodeAct输出的Python代码，可以通过数据和控制流组合多个工具，论文通过实验衡量这个优化在复杂工具使用场景下的收益。

论文指出目前并没有需要组合多个工具使用的复杂任务数据集，以衡量不同格式的工具使用的效果。因此，论文构建了新的数据集------M³ToolEval，该数据集名称由来是"multiple calls to multiple tools in multi-turn interactons"，即评估大语言模型通过在多步交互中多次调用多个工具、处理复杂任务的能力。该数据集的任务包括网页浏览、财务分析、旅行规划、科学研究、信息处理等，每种任务可通过使用多个工具完成。该数据集的系统提示如下所示：

plain 复制代码

You have access to the following tools:
{{Tool Definition}}
{{Formatting Instruction}}
Now, let's get started!
Instruction: {{Example: Find the current price of Legendary Wand.}}
Answer in the format of 'xx.xx' (e.g., 12.34).
You can optionally express your thoughts using natural language before your action. For example, 'Thought: I want to use tool_name to do something. Action: <your action to call tool_name> End Action'.
Note that your output should always contain either 'Action:' or 'Answer:', but not both.
When you are done, output the result using 'Answer: your answer'
Please ONLY output the answer (e.g., single number), without any other text.

其中，"{{...}}"部分会被具体的信息替换。"{{Tool Definition}}"部分会被具体可用的工具集信息替换，示例如下，其中列出了浏览器操作（包括点击网页、返回上一页、向上滚动、向下滚动等）和计算器操作的相关命令：

plain 复制代码

[1] click_url: Clicks on a URL. A clickable URL looks like [Clickable '<url_argument>'] in the webpage.
Arguments: url (str).
Returns the rendered content of the webpage after clicking the URL showing on the current rendered page.
Signature: click_url(url: str) -> str
[2] go_to_previous_page: Goes back to the previous page. It has no arguments.
After going back to the previous page, return the rendered content of the webpage.
Signature: go_to_previous_page() -> str
[3] scroll_down: Scrolls down the view. It has no arguments.
Returns the rendered content of the webpage after scrolling down.
Signature: scroll_down() -> str
[4] scroll_up: Scrolls up the view. It has no arguments.
Returns the rendered content of the webpage after scrolling up.
Signature: scroll_up() -> str
[5] view: Return the current view in string format of the rendered webpage. It has no arguments.
Returns the rendered content of the webpage.
You should call this when you want to see the rendered content of the current webpage.
Signature: view() -> str
[6] calculator: Evaluates the given expression and returns the result. Accepts a calculation expression as input. For example, "2 + (3 * 4)" will return 14.
Signature: calculator(expression: str) -> float

"{{Formatting Instruction}}"部分会被工具使用的格式信息替换，输出代码的指示示例如下所示：

plain 复制代码

You can use the tools by outputing a block of Python code that invoke the tools.
You may use for-loops, if-statements, and other Python constructs when necessary.
Be sure to print the final answer at the end of your code.
You should begin your tool invocation with 'Action:' and end it with 'End Action'.
Example: 'Action:
tool_name(argument_1)
End Action'

论文指出其设计的提示比较简洁，只是指示大语言模型可以使用多个工具，按照一定格式给出思考过程和需要使用的工具，并没有给出few-shot的示例，从而测试大语言模型以zero-shot方式使用工具的能力。

论文评估了不同的大语言模型在M³ToolEval数据集上、以不同的格式输出工具使用的效果。评估时，大语言模型进行多步思考，思考步数最多不超过10，每步思考给出需要调用的工具或给出最终的答案。评估指标包括两个：准确率，大语言模型每步给出的工具和最终给出的答案和真实值的匹配率，准确率越高越好；平均步数，大语言模型处理任务所需的平均步数，步数越少越好。

实验结果如图16所示。每一行为一个大语言模型在三种格式要求下输出的工具的准确率和平均步数，前面数行为开源大语言模型，后面数行为闭源大语言模型。CodeAct在整体17个大语言模型中，准确率相比JSON和文本格式更高。CodeAct在闭源大语言模型------gpt-4-1106-preview上取得了最高的准确率（74.4%），相比在gpt-4-1106-preview上取得次高准确率的文本格式（53.7%），准确率从53.7%提升至74.4%，而平均步数从7.7降低至5.5。另外，CodeAct在开源大语言模型上取得的效果相比闭源大语言模型有一定的差距（论文发表于2024年6月，在DeepSeek R1发布之前，若引入DeepSeek R1进行CodeAct的评估，则开源大语言模型取得的效果应有所改进）。论文所使用的开源大语言模型中，取得最高准确率的是lemur-70b-chat-v1，准确率也仅为13.4%。这说明，论文发布时的开源大语言模型在处理复杂任务时的指令跟随能力相比闭源大语言模型还稍显不足。

基于上述分析，论文基于开源大语言模型------LLaMA-2和Mistral-7B，对其进行微调，设计了以CodeAct方式进行思考的智能体------CodeActAgent，该智能体在多步思考中，能够输出代码，自动使用Python的三方库逐步解决问题，如图17所示。针对用户提出的下载数据集、设计回归模型预测汽车的燃油效率（每加仑英里数，MPG）这一任务，CodeActAgent通过思考生成相应的Python代码，其中使用Pandas下载数据集、丢弃空值、选择特征和Label，然后使用Scikit-Learn划分数据集为训练集和测试集、使用训练集训练回归模型、使用测试集预估Label、计算测试集的均方误差（MSE）和R2系数。CodeActAgent可以将Python代码的执行结果和异常堆栈作为反馈用于后续的思考、规划。图17所示示例中，生成的Python代码执行时，输出"ValueError: could not convert string to float: '?'"的异常，根据这一反馈，CodeActAgent在第二步思考时，判断数据集中存在非法字符"?"，因此，重新生成Python代码，在原先代码的基础上，增加将非法字符"?"替换为空值的操作，然后再丢弃空值。再次执行代码时，顺利训练模型并预估。

CodeActAgent也可以将用户的输入作为反馈，用于后续的思考、规划，图17所示示例中，用户进一步提出数据可视化的要求，CodeActAgent针对这一要求，进一步生成Python代码，使用Matplotlib绘制柱状图。

CodeActInstruct

为了提升开源大语言模型以CodeAct方式进行多步思考的能力，论文构建了数据集------CodeActInstruct，用于对开源大语言模型进行微调。论文所设计的CodeActAgent，使用了开源大语言模型------LLaMA-2和Mistral-7B，并进行微调。微调的数据集同时包含CodeActInstruct和常规会话数据，使得在不影响大语言模型常规会话能力的前提下，提升以CodeAct方式进行多步思考的能力。

论文考虑了4种智能体和环境交互的应用场景，并对相应的5个数据集进行调整，以生成多步思考的数据集。这4个应用场景是：

信息搜索（Information Seeking）。论文使用多步推理数据集HotpotQA的训练集的子集，生成信息搜索场景下多步思考的数据集，该数据集使用维基百科（以Python函数的方式提供调用，函数名称为wikipedia_search）搜索相关信息回答问题。
软件包（工具）使用，论文使用代码生成数据集APPS和数学问题数据集MATH的训练集。对于代码生成数据集APPS，代码生成任务已包含导入软件包和通过定义Python函数创建工具。对于数学问题数据集MATH，论文在上下文中指示大语言模型在问题求解时导入软件包，例如对于符号数学问题的求解，导入sympy。
外部记忆（External Memory）。论文使用表格问答数据集WikiTableQuestion的训练集的子集，将其转化为两种需要获取外部记忆的表格推理任务：基于SQL的，需要大语言模型导入Sqlite3，使用SQL查询关系型数据库，根据查询结果回答问题；基于Pandas的，需要大语言模型导入Pandas，加载数据构建Pandas的表格对象，对其进行数据操作（例如筛选、过滤等），来回答问题。
机器人规划（Robot Planning）。论文使用ALFWorld（具身环境模拟器），生成通过多步调用机器人控制API（以Python函数的方式）、完成搬家任务的数据集。论文在上下文中指示大语言模型通过循环和分支语句实现多步操作（例如在多个位置中搜索物品）。

论文对上述数据集进行下采样，选择了其中相对较为复杂的任务或问题，并通过指示由闭源大语言模型生成针对这些任务或问题的多步思考过程（Interaction Trajectory）。除了代码生成问题外，论文使用OpenAI的gpt-3.5-turbo-0613和Anthropic的claude-2对任务或问题生成多步思考过程。而对于代码生成问题，由于针对代码调试信息，需要较长的上下文，因此论文使用GTP3.5的长上下文版本------gpt-3.5-turbo-0613-16k对任务或问题生成多步思考过程。对于代码生成问题，论文在上下文中还指示大语言模型先通过测试用例测试其输出的代码，再提交最终的答案。针对上述模型均求解不了的任务或问题，论文进一步使用gpt-4-0613生成多步思考过程。

论文进一步筛选上述生成的多步思考过程。筛选规则包括以下三项：

以代码作为动作（Code-as-Actions）。论文排除不符合以代码作为动作的思考过程，比如大语言模型输出了错误的API调用，或输出的动作不能正确按代码解析、执行。
自我改进（Self-Improving）。论文筛选初始时遇到错误但后续大语言模型修正错误、最终给出正确答案的思考过程。通过这些思考过程来提升大语言模型自我改进的能力。
指令跟随（Instruction-Following）。论文排除未跟随指令的思考过程。

论文最终筛选出了411个由gpt-4-0613生成的思考过程，6728个由gpt-3.5和claude生成的思考过程，构成CodeActInstruct。具体各数据集生成的思考过程样本数、Token数如图18中CodeActInstruct部分所示。

论文使用图18中的CodeActinstruct数据集和通用会话数据集对LLaMA-2和Mistral-7B进行微调，微调后的大语言模型被称为CodeActAgent。CodeActAgent和其他大语言模型在各类数据集上的效果对比如图19所示，其中每一行代表一个大语言模型在各类数据集上的效果指标。从中可以看出，CodeActAgent在以代码作为动作（Code-as-Actions）的这类数据集上相比其他开源大语言模型效果较好。CodeActAgent（LLaMa-2）和CodeActAgent（Mistral-7B）在MINT数据集上超过了其他开源大语言模型，在M³ToolEval数据集上，CodeActAgent（Mistral-7B）的效果超过了其他相同大小的开源大语言模型，效果接近图16中更大的开源大语言模型（70B的lemur-70b-chat-v1）。CodeActAgent在以文本作为动作（Text as Actions）的这类数据集上和其他开源大语言模型效果基本持平。未经过文本动作优化的CodeActAgent（LLaMa-2）和经过文本动作微调的AgentLM-7B在MiniWob++数据集上的效果持平。CodeActAgent在通用任务上保持原有效果或有所提升。只是CodeActAgent（Mistral-7B）在MMLU数据集上相比微调之前的基座模型------Mistral-7B，效果略有下降。

OpenManus

OpenManus是由MetaGPT团队开发的开源AI智能体框架，于2025年3月发布，旨在复刻并改进商业产品Manus的核心功能。该项目通过模块化设计和多工具链协同，提供无需邀请码、可本地部署的智能体解决方案，支持代码生成、数据分析、网页交互等复杂任务自动化。

OpenManus的核心特性包括：

本地化运行。支持在终端通过指令直接调用预配置的语言模型（如GPT-4o、Claude 3.5等），执行代码生成、数据分析和网页操作，无需依赖云端服务。
多工具链协同。集成Python代码执行器、浏览器自动化工具、文件处理系统和网络搜索工具，实现跨工具任务流。例如自动生成HTML报告时，可联动网络搜索获取数据，再调用代码执行器处理数据。
透明化任务规划。基于ReAct和CodeAct框架（推理-行动循环）分解复杂任务，通过可视化日志实时展示思考逻辑和进度，支持用户干预优化流程。例如处理"分析网站SEO"任务时，依次执行数据抓取、清洗和报告生成步骤。
灵活架构扩展。采用分层模块设计，包含基础智能体（BaseAgent）、ReAct智能体（ReActAgent）等组件，支持快速集成新工具。开发者可通过配置文件切换语言模型或添加自定义工具（如电商比价插件）。

OpenManus的应用场景包括：

自动化办公：生成周报、处理Excel数据、整理会议纪要等。
开发者辅助：代码调试、项目架构规划、自动化测试等。
商业分析：网络数据爬取、跨平台比价、生成市场调研报告等。

源码解析

核心类图

OpenManus的代码仓库地址是：<github.com/mannaandpoe... 。其核心类包含各个智能体、工具以及大语言模型，图20仅列出OpenManus核心的5个智能体（非全部），这些智能体采用分层结构，从上到下为继承关系，另外，这些智能体还关联了各个工具以及大语言模型，后续将分别详细介绍这些智能体。>

puml 复制代码

@startuml

class BaseAgent {
+ run
+ {abstract} step
}

class ReActAgent {
+ step
+ {abstract} think
+ {abstract} act
}

class ToolCallAgent {
+ think
+ act
}

class BrowserAgent {
+ think
}

class Manus {
+ think
}


BaseAgent <|-- ReActAgent
ReActAgent <|-- ToolCallAgent
ToolCallAgent <|-- BrowserAgent
BrowserAgent <|-- Manus

@enduml

BaseAgent

BaseAgent类（app/agent/base.py）的核心方法是run，代码如下所示，核心逻辑就是循环调用step方法，执行多步，直至达到最大步数或达到FINISHED状态，step方法是一个抽象方法，由BaseAgent类的子类实现：

python 复制代码

    async def run(self, request: Optional[str] = None) -> str:
        """Execute the agent's main loop asynchronously.

        Args:
            request: Optional initial user request to process.

        Returns:
            A string summarizing the execution results.

        Raises:
            RuntimeError: If the agent is not in IDLE state at start.
        """
        if self.state != AgentState.IDLE:
            raise RuntimeError(f"Cannot run agent from state: {self.state}")

        if request:
            self.update_memory("user", request)

        results: List[str] = []
        async with self.state_context(AgentState.RUNNING):
            while (
                self.current_step < self.max_steps and self.state != AgentState.FINISHED
            ):
                self.current_step += 1
                logger.info(f"Executing step {self.current_step}/{self.max_steps}")
                step_result = await self.step()

                # Check for stuck state
                if self.is_stuck():
                    self.handle_stuck_state()

                results.append(f"Step {self.current_step}: {step_result}")

            if self.current_step >= self.max_steps:
                self.current_step = 0
                self.state = AgentState.IDLE
                results.append(f"Terminated: Reached max steps ({self.max_steps})")
        await SANDBOX_CLIENT.cleanup()
        return "\n".join(results) if results else "No steps executed"

ReActAgent

ReActAgent类（app/agent/react.py）是BaseAgent类的子类，其实现step方法，代码如下所示，核心逻辑就是实现ReAct框架，在每步执行时，先调用think方法进行思考，再调用act方法进行行动，think方法和act方法均是抽象方法，由ReActAgent类的子类实现。

python 复制代码

    @abstractmethod
    async def think(self) -> bool:
        """Process current state and decide next action"""

    @abstractmethod
    async def act(self) -> str:
        """Execute decided actions"""

    async def step(self) -> str:
        """Execute a single step: think and act."""
        should_act = await self.think()
        if not should_act:
            return "Thinking complete - no action needed"
        return await self.act()

ToolCallAgent

ToolCallAgent类（app/agent/toolcall.py）是ReActAgent类的子类，其实现think方法和act方法，think方法中，基于Tool Call，将可用的工具集和历史消息一并发送给大语言模型，由大语言模型进行思考，返回需要调用的工具和参数，act方法中，根据大语言模型返回的结果，调用工具，返回工具执行的结果。

think方法代码如下所示，其调用LLM类的ask_tool方法传入历史消息和可用工具集，由大语言模型进行思考，大语言模型返回需要调用的工具和参数，赋值至self.tool_calls。

python 复制代码

    async def think(self) -> bool:
        """Process current state and decide next actions using tools"""
        if self.next_step_prompt:
            user_msg = Message.user_message(self.next_step_prompt)
            self.messages += [user_msg]

        try:
            # Get response with tool options
            response = await self.llm.ask_tool(
                messages=self.messages,
                system_msgs=(
                    [Message.system_message(self.system_prompt)]
                    if self.system_prompt
                    else None
                ),
                tools=self.available_tools.to_params(),
                tool_choice=self.tool_choices,
            )
        except ValueError:
            raise
        except Exception as e:
            # Check if this is a RetryError containing TokenLimitExceeded
            if hasattr(e, "__cause__") and isinstance(e.__cause__, TokenLimitExceeded):
                token_limit_error = e.__cause__
                logger.error(
                    f"🚨 Token limit error (from RetryError): {token_limit_error}"
                )
                self.memory.add_message(
                    Message.assistant_message(
                        f"Maximum token limit reached, cannot continue execution: {str(token_limit_error)}"
                    )
                )
                self.state = AgentState.FINISHED
                return False
            raise

        self.tool_calls = tool_calls = (
            response.tool_calls if response and response.tool_calls else []
        )
        content = response.content if response and response.content else ""

        # Log response info
        logger.info(f"✨ {self.name}'s thoughts: {content}")
        logger.info(
            f"🛠️ {self.name} selected {len(tool_calls) if tool_calls else 0} tools to use"
        )
        if tool_calls:
            logger.info(
                f"🧰 Tools being prepared: {[call.function.name for call in tool_calls]}"
            )
            logger.info(f"🔧 Tool arguments: {tool_calls[0].function.arguments}")

        try:
            if response is None:
                raise RuntimeError("No response received from the LLM")

            # Handle different tool_choices modes
            if self.tool_choices == ToolChoice.NONE:
                if tool_calls:
                    logger.warning(
                        f"🤔 Hmm, {self.name} tried to use tools when they weren't available!"
                    )
                if content:
                    self.memory.add_message(Message.assistant_message(content))
                    return True
                return False

            # Create and add assistant message
            assistant_msg = (
                Message.from_tool_calls(content=content, tool_calls=self.tool_calls)
                if self.tool_calls
                else Message.assistant_message(content)
            )
            self.memory.add_message(assistant_msg)

            if self.tool_choices == ToolChoice.REQUIRED and not self.tool_calls:
                return True  # Will be handled in act()

            # For 'auto' mode, continue with content if no commands but content exists
            if self.tool_choices == ToolChoice.AUTO and not self.tool_calls:
                return bool(content)

            return bool(self.tool_calls)
        except Exception as e:
            logger.error(f"🚨 Oops! The {self.name}'s thinking process hit a snag: {e}")
            self.memory.add_message(
                Message.assistant_message(
                    f"Error encountered while processing: {str(e)}"
                )
            )
            return False

ToolCallAgent默认关联的工具包括CreateChatCompletion和Terminate。OpenManus定义了工具的基类BaseTool，其中定义了工具执行的抽象方法execute，通过继承该类，实现execute方法，从而实现多个工具。CreateChatCompletion（app/tool/create_chat_completion.py）和Terminate（app/tool/terminate.py）均继承自BaseTool。CreateChatCompletion用于在会话结束时，创建结构化的输出，该工具的execute方法的代码如下所示：

python 复制代码

    async def execute(self, required: list | None = None, **kwargs) -> Any:
        """Execute the chat completion with type conversion.

        Args:
            required: List of required field names or None
            **kwargs: Response data

        Returns:
            Converted response based on response_type
        """
        required = required or self.required

        # Handle case when required is a list
        if isinstance(required, list) and len(required) > 0:
            if len(required) == 1:
                required_field = required[0]
                result = kwargs.get(required_field, "")
            else:
                # Return multiple fields as a dictionary
                return {field: kwargs.get(field, "") for field in required}
        else:
            required_field = "response"
            result = kwargs.get(required_field, "")

        # Type conversion logic
        if self.response_type == str:
            return result

        if isinstance(self.response_type, type) and issubclass(
            self.response_type, BaseModel
        ):
            return self.response_type(**kwargs)

        if get_origin(self.response_type) in (list, dict):
            return result  # Assuming result is already in correct format

        try:
            return self.response_type(result)
        except (ValueError, TypeError):
            return result

Terminate只是用于大语言模型在认为之前的输出已成功（或失败）解决用户问题时返回该工具，标识会话可结束，该工具的execute方法中无任何业务逻辑，代码如下所示：

python 复制代码

    async def execute(self, status: str) -> str:
        """Finish the current execution"""
        return f"The interaction has been completed with status: {status}"

ToolCallAgent的act方法的代码如下所示，其遍历self.tool_calls，对于大语言模型返回需要调用的工具和参数，调用execute_tool方法调用工具，并将工具执行结果加入记忆（历史消息）中，用于下一步的思考。

python 复制代码

    async def act(self) -> str:
        """Execute tool calls and handle their results"""
        if not self.tool_calls:
            if self.tool_choices == ToolChoice.REQUIRED:
                raise ValueError(TOOL_CALL_REQUIRED)

            # Return last message content if no tool calls
            return self.messages[-1].content or "No content or commands to execute"

        results = []
        for command in self.tool_calls:
            # Reset base64_image for each tool call
            self._current_base64_image = None

            result = await self.execute_tool(command)

            if self.max_observe:
                result = result[: self.max_observe]

            logger.info(
                f"🎯 Tool '{command.function.name}' completed its mission! Result: {result}"
            )

            # Add tool response to memory
            tool_msg = Message.tool_message(
                content=result,
                tool_call_id=command.id,
                name=command.function.name,
                base64_image=self._current_base64_image,
            )
            self.memory.add_message(tool_msg)
            results.append(result)

        return "\n\n".join(results)

TooCallAgent的系统提示是：

plain 复制代码

You are an agent that can execute tool calls

下一步提示是：

plain 复制代码

If you want to stop interaction, use `terminate` tool/function call.

下一步提示会追加在用户输入和每步思考、执行后。

BrowserAgent

BrowserAgent类（app/agent/browser.py）是ToolCallAgent类的子类，其关联的工具包括BrowserUseTool和Terminate，依赖BrowserUseTool实现智能体对浏览器的操作。

首先介绍一下BrowserUseTool。开源项目browser-use提供了Python库，使得智能体能够依赖该Python库操作浏览器，打开网址，选择元素，执行动作。BrowserUseTool（app/tool/browser_use_tool.py）继承BaseTool，实现对browser-use的封装，提供操作浏览器的能力，其execute方法的代码较长，这里不再列出，主要是依赖browser-use提供的能力实现多种浏览器操作，包括：go_to_url，go_back，refresh，web_search，click_element，input_text，scroll_down、scroll_up，scroll_to_text，send_keys，get_dropdown_options，select_dropdown_option，extract_content，switch_tab，open_tab，close_tab，wait，读者如感兴趣，可下载源码了解其中细节。

BrowserAgent类复用ToolCallAgent类的act方法（未重写），在BrowserAgent中，就是调用BrowserUseTool的execute方法，执行浏览器操作。BrowserAgent类重写了think方法，代码如下所示，其先调用get_browser_state方法，该方法再调用BrowserUseTool的get_current_state方法获取工具上次操作浏览器后的状态，包括URL、标题、标签、截图、元素、坐标等信息，将这些信息更新至下一步提示的模板中，然后调用父类的think方法，基于Tool Call，将可用的工具集和历史消息一并发送给大语言模型，由大语言模型进行思考：

python 复制代码

    async def think(self) -> bool:
        """Process current state and decide next actions using tools, with browser state info added"""
        # Add browser state to the context
        browser_state = await self.get_browser_state()

        # Initialize placeholder values
        url_info = ""
        tabs_info = ""
        content_above_info = ""
        content_below_info = ""
        results_info = ""

        if browser_state and not browser_state.get("error"):
            # URL and title info
            url_info = f"\n   URL: {browser_state.get('url', 'N/A')}\n   Title: {browser_state.get('title', 'N/A')}"

            # Tab information
            if "tabs" in browser_state:
                tabs = browser_state.get("tabs", [])
                if tabs:
                    tabs_info = f"\n   {len(tabs)} tab(s) available"

            # Content above/below viewport
            pixels_above = browser_state.get("pixels_above", 0)
            pixels_below = browser_state.get("pixels_below", 0)

            if pixels_above > 0:
                content_above_info = f" ({pixels_above} pixels)"

            if pixels_below > 0:
                content_below_info = f" ({pixels_below} pixels)"

            # Add screenshot as base64 if available
            if self._current_base64_image:
                # Create a message with image attachment
                image_message = Message.user_message(
                    content="Current browser screenshot:",
                    base64_image=self._current_base64_image,
                )
                self.memory.add_message(image_message)

        # Replace placeholders with actual browser state info
        self.next_step_prompt = NEXT_STEP_PROMPT.format(
            url_placeholder=url_info,
            tabs_placeholder=tabs_info,
            content_above_placeholder=content_above_info,
            content_below_placeholder=content_below_info,
            results_placeholder=results_info,
        )

        # Call parent implementation
        result = await super().think()

        # Reset the next_step_prompt to its original state
        self.next_step_prompt = NEXT_STEP_PROMPT

        return result

BrowserAgent的系统提示是：

plain 复制代码

You are an AI agent designed to automate browser tasks. Your goal is to accomplish the ultimate task following the rules.

# Input Format
Task
Previous steps
Current URL
Open Tabs
Interactive Elements
[index]<type>text</type>
- index: Numeric identifier for interaction
- type: HTML element type (button, input, etc.)
- text: Element description
Example:
[33]<button>Submit Form</button>

- Only elements with numeric indexes in [] are interactive
- elements without [] provide only context

# Response Rules
1. RESPONSE FORMAT: You must ALWAYS respond with valid JSON in this exact format:
{{"current_state": {{"evaluation_previous_goal": "Success|Failed|Unknown - Analyze the current elements and the image to check if the previous goals/actions are successful like intended by the task. Mention if something unexpected happened. Shortly state why/why not",
"memory": "Description of what has been done and what you need to remember. Be very specific. Count here ALWAYS how many times you have done something and how many remain. E.g. 0 out of 10 websites analyzed. Continue with abc and xyz",
"next_goal": "What needs to be done with the next immediate action"}},
"action":[{{"one_action_name": {{// action-specific parameter}}}}, // ... more actions in sequence]}}

2. ACTIONS: You can specify multiple actions in the list to be executed in sequence. But always specify only one action name per item. Use maximum {{max_actions}} actions per sequence.
Common action sequences:
- Form filling: [{{"input_text": {{"index": 1, "text": "username"}}}}, {{"input_text": {{"index": 2, "text": "password"}}}}, {{"click_element": {{"index": 3}}}}]
- Navigation and extraction: [{{"go_to_url": {{"url": "https://example.com"}}}}, {{"extract_content": {{"goal": "extract the names"}}}}]
- Actions are executed in the given order
- If the page changes after an action, the sequence is interrupted and you get the new state.
- Only provide the action sequence until an action which changes the page state significantly.
- Try to be efficient, e.g. fill forms at once, or chain actions where nothing changes on the page
- only use multiple actions if it makes sense.

3. ELEMENT INTERACTION:
- Only use indexes of the interactive elements
- Elements marked with "[]Non-interactive text" are non-interactive

4. NAVIGATION & ERROR HANDLING:
- If no suitable elements exist, use other functions to complete the task
- If stuck, try alternative approaches - like going back to a previous page, new search, new tab etc.
- Handle popups/cookies by accepting or closing them
- Use scroll to find elements you are looking for
- If you want to research something, open a new tab instead of using the current tab
- If captcha pops up, try to solve it - else try a different approach
- If the page is not fully loaded, use wait action

5. TASK COMPLETION:
- Use the done action as the last action as soon as the ultimate task is complete
- Dont use "done" before you are done with everything the user asked you, except you reach the last step of max_steps.
- If you reach your last step, use the done action even if the task is not fully finished. Provide all the information you have gathered so far. If the ultimate task is completly finished set success to true. If not everything the user asked for is completed set success in done to false!
- If you have to do something repeatedly for example the task says for "each", or "for all", or "x times", count always inside "memory" how many times you have done it and how many remain. Don't stop until you have completed like the task asked you. Only call done after the last step.
- Don't hallucinate actions
- Make sure you include everything you found out for the ultimate task in the done text parameter. Do not just say you are done, but include the requested information of the task.

6. VISUAL CONTEXT:
- When an image is provided, use it to understand the page layout
- Bounding boxes with labels on their top right corner correspond to element indexes

7. Form filling:
- If you fill an input field and your action sequence is interrupted, most often something changed e.g. suggestions popped up under the field.

8. Long tasks:
- Keep track of the status and subresults in the memory.

9. Extraction:
- If your task is to find information - call extract_content on the specific pages to get and store the information.
Your responses must be always JSON with the specified format.

BrowserAgent的下一步提示是：

plain 复制代码

What should I do next to achieve my goal?

When you see [Current state starts here], focus on the following:
- Current URL and page title{url_placeholder}
- Available tabs{tabs_placeholder}
- Interactive elements and their indices
- Content above{content_above_placeholder} or below{content_below_placeholder} the viewport (if indicated)
- Any action results or errors{results_placeholder}

For browser interactions:
- To navigate: browser_use with action="go_to_url", url="..."
- To click: browser_use with action="click_element", index=N
- To type: browser_use with action="input_text", index=N, text="..."
- To extract: browser_use with action="extract_content", goal="..."
- To scroll: browser_use with action="scroll_down" or "scroll_up"

Consider both what's visible and what might be beyond the current viewport.
Be methodical - remember your progress and what you've learned so far.

Manus

Manus（app/agent/manus.py）类是BrowserAgent类的子类，其关联的工具包括PythonExecute，BrowserUseTool， StrReplaceEditor，Terminate。前面已介绍BrowserUseTool和Terminate，下面再介绍一下PythonExecute和StrReplaceEditor。PythonExecute提供执行Python代码的能力，其execute方法的代码如下：

python 复制代码

    async def execute(
        self,
        code: str,
        timeout: int = 5,
    ) -> Dict:
        """
        Executes the provided Python code with a timeout.

        Args:
            code (str): The Python code to execute.
            timeout (int): Execution timeout in seconds.

        Returns:
            Dict: Contains 'output' with execution output or error message and 'success' status.
        """

        with multiprocessing.Manager() as manager:
            result = manager.dict({"observation": "", "success": False})
            if isinstance(__builtins__, dict):
                safe_globals = {"__builtins__": __builtins__}
            else:
                safe_globals = {"__builtins__": __builtins__.__dict__.copy()}
            proc = multiprocessing.Process(
                target=self._run_code, args=(code, result, safe_globals)
            )
            proc.start()
            proc.join(timeout)

            # timeout process
            if proc.is_alive():
                proc.terminate()
                proc.join(1)
                return {
                    "observation": f"Execution timeout after {timeout} seconds",
                    "success": False,
                }
            return dict(result)

StrReplaceEditor提供编辑文件的能力，包括查看文件、新建文件、替换文本、写入文本、撤销编辑等，其execute方法的代码如下：

python 复制代码

    async def execute(
        self,
        *,
        command: Command,
        path: str,
        file_text: str | None = None,
        view_range: list[int] | None = None,
        old_str: str | None = None,
        new_str: str | None = None,
        insert_line: int | None = None,
        **kwargs: Any,
    ) -> str:
        """Execute a file operation command."""
        # Get the appropriate file operator
        operator = self._get_operator()

        # Validate path and command combination
        await self.validate_path(command, Path(path), operator)

        # Execute the appropriate command
        if command == "view":
            result = await self.view(path, view_range, operator)
        elif command == "create":
            if file_text is None:
                raise ToolError("Parameter `file_text` is required for command: create")
            await operator.write_file(path, file_text)
            self._file_history[path].append(file_text)
            result = ToolResult(output=f"File created successfully at: {path}")
        elif command == "str_replace":
            if old_str is None:
                raise ToolError(
                    "Parameter `old_str` is required for command: str_replace"
                )
            result = await self.str_replace(path, old_str, new_str, operator)
        elif command == "insert":
            if insert_line is None:
                raise ToolError(
                    "Parameter `insert_line` is required for command: insert"
                )
            if new_str is None:
                raise ToolError("Parameter `new_str` is required for command: insert")
            result = await self.insert(path, insert_line, new_str, operator)
        elif command == "undo_edit":
            result = await self.undo_edit(path, operator)
        else:
            # This should be caught by type checking, but we include it for safety
            raise ToolError(
                f'Unrecognized command {command}. The allowed commands for the {self.name} tool are: {", ".join(get_args(Command))}'
            )

        return str(result)

Manus的系统提示是：

python 复制代码

You are OpenManus, an all-capable AI assistant, aimed at solving any task presented by the user. You have various tools at your disposal that you can call upon to efficiently complete complex requests. Whether it's programming, information retrieval, file processing, or web browsing, you can handle it all.
The initial directory is: {directory}

Manus重写think方法，在调用父类的think方法之前，增加判断，如果最近的若干条消息中未使用浏览器，则下一步的提示是：

plain 复制代码

Based on user needs, proactively select the most appropriate tool or combination of tools. For complex tasks, you can break down the problem and use different tools step by step to solve it. After using each tool, clearly explain the execution results and suggest the next steps.

如果最近的若干条消息中使用了浏览器，则仍采用BrowserAgent中的下一步提示。但实际上，Manus最终调用父类的think方法，而父类的think方法中，会强制采用BrowserAgent中的下一步提示，也就是说Manus重写think方法修改下一步提示并不会生效。

Manus的think的代码如下所示：

python 复制代码

    async def think(self) -> bool:
        """Process current state and decide next actions with appropriate context."""
        # Store original prompt
        original_prompt = self.next_step_prompt

        # Only check recent messages (last 3) for browser activity
        recent_messages = self.memory.messages[-3:] if self.memory.messages else []
        browser_in_use = any(
            "browser_use" in msg.content.lower()
            for msg in recent_messages
            if hasattr(msg, "content") and isinstance(msg.content, str)
        )

        if browser_in_use:
            # Override with browser-specific prompt temporarily to get browser context
            self.next_step_prompt = BROWSER_NEXT_STEP_PROMPT

        # Call parent's think method
        result = await super().think()

        # Restore original prompt
        self.next_step_prompt = original_prompt

        return result

示例解析

本地安装

采用OpenManus官方推荐的安装方式进行本地安装。笔者所使用的电脑是MacBook Pro，Mac OS是12.6.2。安装步骤是：

安装UV。

shell 复制代码

curl -LsSf https://astral.sh/uv/install.sh | sh

克隆OpenManus代码仓库。

shell 复制代码

git clone https://github.com/mannaandpoem/OpenManus.git
cd OpenManus

初始化Python虚拟环境，Python版本为3.12，并启用该环境。

shell 复制代码

uv venv --python 3.12
source .venv/bin/activate

安装OpenManus的依赖。

shell 复制代码

uv pip install -r requirements.txt

安装浏览器自动化工具（可选）。Playwright 是一个跨浏览器、跨平台、跨语言的测试框架。

shell 复制代码

playwright install

配置。复制config/config.example.toml至config/config.toml，修改其中的配置。笔者主要做了两个修改，一个是将大语言模型修改为笔者所使用的"qwen-plus"，另一个是将搜索引擎修改为百度。配置中的修改部分如下所示，其他均为默认值。

plain 复制代码

[llm]
model = "qwen-plus"        # The LLM model to use
base_url = "https://dashscope.aliyuncs.com/compatible-mode/v1/"  # API endpoint URL
api_key = "xxx"            # Your API key
max_tokens = 8192          # Maximum number of tokens in the response
temperature = 0.0   

[search]
# Search engine for agent to use. Default is "Google", can be set to "Baidu" or "DuckDuckGo".
engine = "Baidu"
# Fallback engine order. Default is ["DuckDuckGo", "Baidu"] - will try in this order after primary engine fails.
#fallback_engines = ["DuckDuckGo", "Baidu"]
# Seconds to wait before retrying all engines again when they all fail due to rate limits. Default is 60.
#retry_delay = 60
# Maximum number of times to retry all engines when all fail. Default is 3.
#max_retries = 3

运行。按以下命令启动后，就可以输入问题由OpenManus推理、回答了。

shell 复制代码

python main.py

示例1

示例1是一道数学问题："求解以下方程的根的倒数的和： <math xmlns="http://www.w3.org/1998/Math/MathML"> x 2 − 13 x + 4 = 0 x^2-13x+4=0 </math>x2−13x+4=0，并以数字的形式输出"。

第一步思考时，大语言模型的输入如下所示，其中消息部分包含三个消息：

第一个消息是Manus的系统提示。
第二个消息是用户输入的问题。
第三个消息是Manus（实际是BrowserAgent）的下一步提示。

工具部分是Manus所使用的4个工具的说明：

json 复制代码

{
	"model": "qwen-plus",
	"messages": [{
		"role": "system",
		"content": "You are OpenManus, an all-capable AI assistant, aimed at solving any task presented by the user. You have various tools at your disposal that you can call upon to efficiently complete complex requests. Whether it's programming, information retrieval, file processing, or web browsing, you can handle it all.The initial directory is: /Users/wangtao/workspace/OpenManus/workspace"
	}, {
		"role": "user",
		"content": "求解以下方程的根的倒数的和：$x^2-13x+4=0$，并以数字的形式输出"
	}, {
		"role": "user",
		"content": "\nWhat should I do next to achieve my goal?\n\nWhen you see [Current state starts here], focus on the following:\n- Current URL and page title\n- Available tabs\n- Interactive elements and their indices\n- Content above or below the viewport (if indicated)\n- Any action results or errors\n\nFor browser interactions:\n- To navigate: browser_use with action=\"go_to_url\", url=\"...\"\n- To click: browser_use with action=\"click_element\", index=N\n- To type: browser_use with action=\"input_text\", index=N, text=\"...\"\n- To extract: browser_use with action=\"extract_content\", goal=\"...\"\n- To scroll: browser_use with action=\"scroll_down\" or \"scroll_up\"\n\nConsider both what's visible and what might be beyond the current viewport.\nBe methodical - remember your progress and what you've learned so far.\n"
	}],
	"tools": [{
		"type": "function",
		"function": {
			"name": "python_execute",
			"description": "Executes Python code string. Note: Only print outputs are visible, function return values are not captured. Use print statements to see results.",
			"parameters": {
				"type": "object",
				"properties": {
					"code": {
						"type": "string",
						"description": "The Python code to execute."
					}
				},
				"required": ["code"]
			}
		}
	}, {
		"type": "function",
		"function": {
			"name": "browser_use",
			"description": "\nInteract with a web browser to perform various actions such as navigation, element interaction, content extraction, and tab management. This tool provides a comprehensive set of browser automation capabilities:\n\nNavigation:\n- 'go_to_url': Go to a specific URL in the current tab\n- 'go_back': Go back\n- 'refresh': Refresh the current page\n- 'web_search': Search the query in the current tab, the query should be a search query like humans search in web, concrete and not vague or super long. More the single most important items.\n\nElement Interaction:\n- 'click_element': Click an element by index\n- 'input_text': Input text into a form element\n- 'scroll_down'/'scroll_up': Scroll the page (with optional pixel amount)\n- 'scroll_to_text': If you dont find something which you want to interact with, scroll to it\n- 'send_keys': Send strings of special keys like Escape,Backspace, Insert, PageDown, Delete, Enter, Shortcuts such as `Control+o`, `Control+Shift+T` are supported as well. This gets used in keyboard.press.\n- 'get_dropdown_options': Get all options from a dropdown\n- 'select_dropdown_option': Select dropdown option for interactive element index by the text of the option you want to select\n\nContent Extraction:\n- 'extract_content': Extract page content to retrieve specific information from the page, e.g. all company names, a specifc description, all information about, links with companies in structured format or simply links\n\nTab Management:\n- 'switch_tab': Switch to a specific tab\n- 'open_tab': Open a new tab with a URL\n- 'close_tab': Close the current tab\n\nUtility:\n- 'wait': Wait for a specified number of seconds\n",
			"parameters": {
				"type": "object",
				"properties": {
					"action": {
						"type": "string",
						"enum": ["go_to_url", "click_element", "input_text", "scroll_down", "scroll_up", "scroll_to_text", "send_keys", "get_dropdown_options", "select_dropdown_option", "go_back", "web_search", "wait", "extract_content", "switch_tab", "open_tab", "close_tab"],
						"description": "The browser action to perform"
					},
					"url": {
						"type": "string",
						"description": "URL for 'go_to_url' or 'open_tab' actions"
					},
					"index": {
						"type": "integer",
						"description": "Element index for 'click_element', 'input_text', 'get_dropdown_options', or 'select_dropdown_option' actions"
					},
					"text": {
						"type": "string",
						"description": "Text for 'input_text', 'scroll_to_text', or 'select_dropdown_option' actions"
					},
					"scroll_amount": {
						"type": "integer",
						"description": "Pixels to scroll (positive for down, negative for up) for 'scroll_down' or 'scroll_up' actions"
					},
					"tab_id": {
						"type": "integer",
						"description": "Tab ID for 'switch_tab' action"
					},
					"query": {
						"type": "string",
						"description": "Search query for 'web_search' action"
					},
					"goal": {
						"type": "string",
						"description": "Extraction goal for 'extract_content' action"
					},
					"keys": {
						"type": "string",
						"description": "Keys to send for 'send_keys' action"
					},
					"seconds": {
						"type": "integer",
						"description": "Seconds to wait for 'wait' action"
					}
				},
				"required": ["action"],
				"dependencies": {
					"go_to_url": ["url"],
					"click_element": ["index"],
					"input_text": ["index", "text"],
					"switch_tab": ["tab_id"],
					"open_tab": ["url"],
					"scroll_down": ["scroll_amount"],
					"scroll_up": ["scroll_amount"],
					"scroll_to_text": ["text"],
					"send_keys": ["keys"],
					"get_dropdown_options": ["index"],
					"select_dropdown_option": ["index", "text"],
					"go_back": [],
					"web_search": ["query"],
					"wait": ["seconds"],
					"extract_content": ["goal"]
				}
			}
		}
	}, {
		"type": "function",
		"function": {
			"name": "str_replace_editor",
			"description": "Custom editing tool for viewing, creating and editing files\n* State is persistent across command calls and discussions with the user\n* If `path` is a file, `view` displays the result of applying `cat -n`. If `path` is a directory, `view` lists non-hidden files and directories up to 2 levels deep\n* The `create` command cannot be used if the specified `path` already exists as a file\n* If a `command` generates a long output, it will be truncated and marked with `<response clipped>`\n* The `undo_edit` command will revert the last edit made to the file at `path`\n\nNotes for using the `str_replace` command:\n* The `old_str` parameter should match EXACTLY one or more consecutive lines from the original file. Be mindful of whitespaces!\n* If the `old_str` parameter is not unique in the file, the replacement will not be performed. Make sure to include enough context in `old_str` to make it unique\n* The `new_str` parameter should contain the edited lines that should replace the `old_str`\n",
			"parameters": {
				"type": "object",
				"properties": {
					"command": {
						"description": "The commands to run. Allowed options are: `view`, `create`, `str_replace`, `insert`, `undo_edit`.",
						"enum": ["view", "create", "str_replace", "insert", "undo_edit"],
						"type": "string"
					},
					"path": {
						"description": "Absolute path to file or directory.",
						"type": "string"
					},
					"file_text": {
						"description": "Required parameter of `create` command, with the content of the file to be created.",
						"type": "string"
					},
					"old_str": {
						"description": "Required parameter of `str_replace` command containing the string in `path` to replace.",
						"type": "string"
					},
					"new_str": {
						"description": "Optional parameter of `str_replace` command containing the new string (if not given, no string will be added). Required parameter of `insert` command containing the string to insert.",
						"type": "string"
					},
					"insert_line": {
						"description": "Required parameter of `insert` command. The `new_str` will be inserted AFTER the line `insert_line` of `path`.",
						"type": "integer"
					},
					"view_range": {
						"description": "Optional parameter of `view` command when `path` points to a file. If none is given, the full file is shown. If provided, the file will be shown in the indicated line number range, e.g. [11, 12] will show lines 11 and 12. Indexing at 1 to start. Setting `[start_line, -1]` shows all lines from `start_line` to the end of the file.",
						"items": {
							"type": "integer"
						},
						"type": "array"
					}
				},
				"required": ["command", "path"]
			}
		}
	}, {
		"type": "function",
		"function": {
			"name": "terminate",
			"description": "Terminate the interaction when the request is met OR if the assistant cannot proceed further with the task.\nWhen you have finished all the tasks, call this tool to end the work.",
			"parameters": {
				"type": "object",
				"properties": {
					"status": {
						"type": "string",
						"description": "The finish status of the interaction.",
						"enum": ["success", "failure"]
					}
				},
				"required": ["status"]
			}
		}
	}],
	"tool_choice": "auto",
	"timeout": 300
}

第一步思考，大语言模型给出的返回如下所示，其给出需要执行工具"python_execute"，并给出相应的代码：

plain 复制代码

ChatCompletion(id='chatcmpl-9265109b-3f47-9469-82bc-db76552e0863', choices=[Choice(finish_reason='tool_calls', index=0, logprobs=None, message=ChatCompletionMessage(content='', refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=[ChatCompletionMessageToolCall(id='call_8db7141e4bcb4cf9ac7a7b', function=Function(arguments='{"code":"import sympy\\n\\n# 定义变量和方程\\nx = sympy.symbols(\'x\')\\nequation = x**2 - 13*x + 4\\n\\n# 求解方程的根\\nroots = sympy.solve(equation, x)\\n\\n# 计算根的倒数之和\\nreciprocal_sum = sum(1/root for root in roots)\\n\\nprint(reciprocal_sum.evalf())"}', name='python_execute'), type='function', index=0)]))], created=1744508508, model='qwen-plus', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=114, prompt_tokens=2062, total_tokens=2176, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetails(audio_tokens=None, cached_tokens=0)))

Manus根据大语言模型返回，执行代码，获取执行结果"3.25"，如图所示。

第二步思考时，大语言模型的输入如下所示，相比于第一步思考时的输入，主要在消息部分增加了三个消息：

增加的第一个消息是第一步思考大语言模型给出的返回。
增加的第二个消息是第一步思考工具执行的结果。
增加的第三个消息是Manus（实际是BrowserAgent）的下一步提示。

json 复制代码

{
	"model": "qwen-plus",
	"messages": [{
		"role": "system",
		"content": "You are OpenManus, an all-capable AI assistant, aimed at solving any task presented by the user. You have various tools at your disposal that you can call upon to efficiently complete complex requests. Whether it's programming, information retrieval, file processing, or web browsing, you can handle it all.The initial directory is: /Users/wangtao/workspace/OpenManus/workspace"
	}, {
		"role": "user",
		"content": "求解以下方程的根的倒数的和：$x^2-13x+4=0$，并以数字的形式输出"
	}, {
		"role": "user",
		"content": "\nWhat should I do next to achieve my goal?\n\nWhen you see [Current state starts here], focus on the following:\n- Current URL and page title\n- Available tabs\n- Interactive elements and their indices\n- Content above or below the viewport (if indicated)\n- Any action results or errors\n\nFor browser interactions:\n- To navigate: browser_use with action=\"go_to_url\", url=\"...\"\n- To click: browser_use with action=\"click_element\", index=N\n- To type: browser_use with action=\"input_text\", index=N, text=\"...\"\n- To extract: browser_use with action=\"extract_content\", goal=\"...\"\n- To scroll: browser_use with action=\"scroll_down\" or \"scroll_up\"\n\nConsider both what's visible and what might be beyond the current viewport.\nBe methodical - remember your progress and what you've learned so far.\n"
	}, {
		"role": "assistant",
		"content": "",
		"tool_calls": [{
			"id": "call_8db7141e4bcb4cf9ac7a7b",
			"type": "function",
			"function": {
				"name": "python_execute",
				"arguments": "{\"code\":\"import sympy\\n\\n# 定义变量和方程\\nx = sympy.symbols('x')\\nequation = x**2 - 13*x + 4\\n\\n# 求解方程的根\\nroots = sympy.solve(equation, x)\\n\\n# 计算根的倒数之和\\nreciprocal_sum = sum(1/root for root in roots)\\n\\nprint(reciprocal_sum.evalf())\"}"
			}
		}]
	}, {
		"role": "tool",
		"content": "Observed output of cmd `python_execute` executed:\n{'observation': '3.25000000000000\\n', 'success': True}",
		"name": "python_execute",
		"tool_call_id": "call_8db7141e4bcb4cf9ac7a7b"
	}, {
		"role": "user",
		"content": "\nWhat should I do next to achieve my goal?\n\nWhen you see [Current state starts here], focus on the following:\n- Current URL and page title\n- Available tabs\n- Interactive elements and their indices\n- Content above or below the viewport (if indicated)\n- Any action results or errors\n\nFor browser interactions:\n- To navigate: browser_use with action=\"go_to_url\", url=\"...\"\n- To click: browser_use with action=\"click_element\", index=N\n- To type: browser_use with action=\"input_text\", index=N, text=\"...\"\n- To extract: browser_use with action=\"extract_content\", goal=\"...\"\n- To scroll: browser_use with action=\"scroll_down\" or \"scroll_up\"\n\nConsider both what's visible and what might be beyond the current viewport.\nBe methodical - remember your progress and what you've learned so far.\n"
	}],
	"tools": [{
		"type": "function",
		"function": {
			"name": "python_execute",
			"description": "Executes Python code string. Note: Only print outputs are visible, function return values are not captured. Use print statements to see results.",
			"parameters": {
				"type": "object",
				"properties": {
					"code": {
						"type": "string",
						"description": "The Python code to execute."
					}
				},
				"required": ["code"]
			}
		}
	}, {
		"type": "function",
		"function": {
			"name": "browser_use",
			"description": "\nInteract with a web browser to perform various actions such as navigation, element interaction, content extraction, and tab management. This tool provides a comprehensive set of browser automation capabilities:\n\nNavigation:\n- 'go_to_url': Go to a specific URL in the current tab\n- 'go_back': Go back\n- 'refresh': Refresh the current page\n- 'web_search': Search the query in the current tab, the query should be a search query like humans search in web, concrete and not vague or super long. More the single most important items.\n\nElement Interaction:\n- 'click_element': Click an element by index\n- 'input_text': Input text into a form element\n- 'scroll_down'/'scroll_up': Scroll the page (with optional pixel amount)\n- 'scroll_to_text': If you dont find something which you want to interact with, scroll to it\n- 'send_keys': Send strings of special keys like Escape,Backspace, Insert, PageDown, Delete, Enter, Shortcuts such as `Control+o`, `Control+Shift+T` are supported as well. This gets used in keyboard.press.\n- 'get_dropdown_options': Get all options from a dropdown\n- 'select_dropdown_option': Select dropdown option for interactive element index by the text of the option you want to select\n\nContent Extraction:\n- 'extract_content': Extract page content to retrieve specific information from the page, e.g. all company names, a specifc description, all information about, links with companies in structured format or simply links\n\nTab Management:\n- 'switch_tab': Switch to a specific tab\n- 'open_tab': Open a new tab with a URL\n- 'close_tab': Close the current tab\n\nUtility:\n- 'wait': Wait for a specified number of seconds\n",
			"parameters": {
				"type": "object",
				"properties": {
					"action": {
						"type": "string",
						"enum": ["go_to_url", "click_element", "input_text", "scroll_down", "scroll_up", "scroll_to_text", "send_keys", "get_dropdown_options", "select_dropdown_option", "go_back", "web_search", "wait", "extract_content", "switch_tab", "open_tab", "close_tab"],
						"description": "The browser action to perform"
					},
					"url": {
						"type": "string",
						"description": "URL for 'go_to_url' or 'open_tab' actions"
					},
					"index": {
						"type": "integer",
						"description": "Element index for 'click_element', 'input_text', 'get_dropdown_options', or 'select_dropdown_option' actions"
					},
					"text": {
						"type": "string",
						"description": "Text for 'input_text', 'scroll_to_text', or 'select_dropdown_option' actions"
					},
					"scroll_amount": {
						"type": "integer",
						"description": "Pixels to scroll (positive for down, negative for up) for 'scroll_down' or 'scroll_up' actions"
					},
					"tab_id": {
						"type": "integer",
						"description": "Tab ID for 'switch_tab' action"
					},
					"query": {
						"type": "string",
						"description": "Search query for 'web_search' action"
					},
					"goal": {
						"type": "string",
						"description": "Extraction goal for 'extract_content' action"
					},
					"keys": {
						"type": "string",
						"description": "Keys to send for 'send_keys' action"
					},
					"seconds": {
						"type": "integer",
						"description": "Seconds to wait for 'wait' action"
					}
				},
				"required": ["action"],
				"dependencies": {
					"go_to_url": ["url"],
					"click_element": ["index"],
					"input_text": ["index", "text"],
					"switch_tab": ["tab_id"],
					"open_tab": ["url"],
					"scroll_down": ["scroll_amount"],
					"scroll_up": ["scroll_amount"],
					"scroll_to_text": ["text"],
					"send_keys": ["keys"],
					"get_dropdown_options": ["index"],
					"select_dropdown_option": ["index", "text"],
					"go_back": [],
					"web_search": ["query"],
					"wait": ["seconds"],
					"extract_content": ["goal"]
				}
			}
		}
	}, {
		"type": "function",
		"function": {
			"name": "str_replace_editor",
			"description": "Custom editing tool for viewing, creating and editing files\n* State is persistent across command calls and discussions with the user\n* If `path` is a file, `view` displays the result of applying `cat -n`. If `path` is a directory, `view` lists non-hidden files and directories up to 2 levels deep\n* The `create` command cannot be used if the specified `path` already exists as a file\n* If a `command` generates a long output, it will be truncated and marked with `<response clipped>`\n* The `undo_edit` command will revert the last edit made to the file at `path`\n\nNotes for using the `str_replace` command:\n* The `old_str` parameter should match EXACTLY one or more consecutive lines from the original file. Be mindful of whitespaces!\n* If the `old_str` parameter is not unique in the file, the replacement will not be performed. Make sure to include enough context in `old_str` to make it unique\n* The `new_str` parameter should contain the edited lines that should replace the `old_str`\n",
			"parameters": {
				"type": "object",
				"properties": {
					"command": {
						"description": "The commands to run. Allowed options are: `view`, `create`, `str_replace`, `insert`, `undo_edit`.",
						"enum": ["view", "create", "str_replace", "insert", "undo_edit"],
						"type": "string"
					},
					"path": {
						"description": "Absolute path to file or directory.",
						"type": "string"
					},
					"file_text": {
						"description": "Required parameter of `create` command, with the content of the file to be created.",
						"type": "string"
					},
					"old_str": {
						"description": "Required parameter of `str_replace` command containing the string in `path` to replace.",
						"type": "string"
					},
					"new_str": {
						"description": "Optional parameter of `str_replace` command containing the new string (if not given, no string will be added). Required parameter of `insert` command containing the string to insert.",
						"type": "string"
					},
					"insert_line": {
						"description": "Required parameter of `insert` command. The `new_str` will be inserted AFTER the line `insert_line` of `path`.",
						"type": "integer"
					},
					"view_range": {
						"description": "Optional parameter of `view` command when `path` points to a file. If none is given, the full file is shown. If provided, the file will be shown in the indicated line number range, e.g. [11, 12] will show lines 11 and 12. Indexing at 1 to start. Setting `[start_line, -1]` shows all lines from `start_line` to the end of the file.",
						"items": {
							"type": "integer"
						},
						"type": "array"
					}
				},
				"required": ["command", "path"]
			}
		}
	}, {
		"type": "function",
		"function": {
			"name": "terminate",
			"description": "Terminate the interaction when the request is met OR if the assistant cannot proceed further with the task.\nWhen you have finished all the tasks, call this tool to end the work.",
			"parameters": {
				"type": "object",
				"properties": {
					"status": {
						"type": "string",
						"description": "The finish status of the interaction.",
						"enum": ["success", "failure"]
					}
				},
				"required": ["status"]
			}
		}
	}],
	"tool_choice": "auto",
	"timeout": 300
}

第二步思考，大语言模型给出的返回如下所示，其给出最终的答案："方程 (x^2 - 13x + 4 = 0) 的根的倒数之和为 3.25。以数字形式输出的结果是：3.25。既然我们已经得到了结果，接下来可以结束这个任务。"，并给出需要执行工具terminate，从而结束会话：

plain 复制代码

ChatCompletion(id='chatcmpl-c6a9e801-5862-9449-a9de-8cb4c7ba34da', choices=[Choice(finish_reason='tool_calls', index=0, logprobs=None, message=ChatCompletionMessage(content='方程 \\(x^2 - 13x + 4 = 0\\) 的根的倒数之和为 3.25。以数字形式输出的结果是：3.25。\n\n既然我们已经得到了结果，接下来可以结束这个任务。', refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=[ChatCompletionMessageToolCall(id='call_27ed896159434448b1abb8', function=Function(arguments='{"status":"success"}', name='terminate'), type='function', index=0)]))], created=1744508517, model='qwen-plus', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=71, prompt_tokens=2407, total_tokens=2478, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetails(audio_tokens=None, cached_tokens=0)))

示例2

示例2是笔者3月29日坐在驶往扬州的高铁上，因第二天要参加扬州半程马拉松，向OpenManus提问的问题："请用中文回答明天扬州的天气适合跑马拉松吗？"

第一步思考时，大语言模型返回需执行工具"browser_use"，访问天气网站。

由于访问天气网站失败，第二步思考时，大语言模型返回需执行工具"browser_use"，通过搜索引擎搜索"扬州明天天气"。

OpenManus均会打开浏览器，访问页面，搜索关键词，提取页面内容。

第三步思考时，大语言模型根据上一步提取的页面内容，回答第二天扬州的天气是否适合跑马拉松，如下所示：

plain 复制代码

根 据 搜 索 结 果 ， 我 找 到 了 扬 州 明 天 的 天 气 预 报 。 明 天 扬 州 的 天 气 预 计 为 多 云 转 阴 ， 气 温 范 围 在 6°C到 14°C之 间 ， 风 力 为 东 南 风 2级 ， 空 气 质 量 良 。 
                                                                                                                                                                                                                      
对 于 跑 马 拉 松 来 说 ， 这 样 的 天 气 条 件 是 比 较 理 想 的 。 温 度 适 中 ， 没 有 强 烈 的 阳 光 直 射 ， 风 力 较 小 ， 这 些 都 有 助 于 运 动 员 保 持 良 好 的 竞 技 状 态 。 因 此 ， 明 天 扬 州 的 天 气 适 合 跑 马 拉 松 。

大语言模型同时返回需执行工具"terminate"，结束本次会话。

从ReAct到CodeAct再到OpenManus，解析Manus背后的技术

智能体

ReAct

示例1

示例2

CodeAct

CodeAct

CodeActInstruct

OpenManus

源码解析

核心类图

BaseAgent

ReActAgent

ToolCallAgent

BrowserAgent

Manus

示例解析

本地安装

示例1

示例2

参考文献