在上一篇技术分析中,我们探讨了Browser-use框架如何实现页面元素标注。本文将聚焦于其提示词构造流程,揭示AI如何理解浏览器界面的核心机制。
上一篇(首发)-揭秘AI自动化框架Browser-use(一),如何实现炫酷的页面元素标注效果
想了解更多技术实现细节和源码解析,欢迎关注我的微信公众号**【松哥ai自动化】**。每周我都会在公众号首发一篇深度技术文章,从源码角度剖析各种实用工具的实现原理。
提示词系统概览
Browser-use的提示词系统由多个协同工作的组件构成:
SystemPrompt
- 负责系统级指令的加载与定制AgentMessagePrompt
- 构造页面状态的提示词PlannerPrompt
- 为规划器提供指导的提示词MessageManager
- 管理消息流和上下文
这些组件在Agent
类的执行过程中共同作用,确保大模型获得准确且结构化的输入信息。
系统提示词的构造与定制
系统提示词是大模型的核心指令集,定义了模型的行为边界和能力。Browser-use在Agent
类初始化时通过SystemPrompt
类构造系统提示词:
python
self._message_manager = MessageManager(
task=task,
system_message=SystemPrompt(
action_description=self.available_actions,
max_actions_per_step=self.settings.max_actions_per_step,
override_system_message=override_system_message,
extend_system_message=extend_system_message,
).get_system_message(),
settings=MessageManagerSettings(
max_input_tokens=self.settings.max_input_tokens,
include_attributes=self.settings.include_attributes,
message_context=self.settings.message_context,
sensitive_data=sensitive_data,
available_file_paths=self.settings.available_file_paths,
),
state=self.state.message_manager_state,
)
SystemPrompt
类支持三种模式:
- 默认模式:从预定义模板加载系统提示词
- 扩展模式 :通过
extend_system_message
参数扩展默认提示词 - 覆盖模式 :通过
override_system_message
完全替换默认提示词
系统提示词通常包含以下关键内容:
- Agent的角色和任务说明
- 输入格式规范(URL、标签页、交互元素等)
- 可用操作的描述和使用方法
- 输出格式要求和示例
这种设计使开发者能够根据需求灵活定制系统提示词,同时保持核心功能不变。
页面状态提示词构造
为了让大模型理解当前网页状态,Browser-use使用AgentMessagePrompt
类构造包含页面完整信息的提示词:
python
def add_state_message(
self,
state: BrowserState,
result: Optional[List[ActionResult]] = None,
step_info: Optional[AgentStepInfo] = None,
use_vision=True,
) -> None:
"""Add browser state as human message"""
# 处理操作结果和错误信息
if result:
for r in result:
if r.include_in_memory:
if r.extracted_content:
msg = HumanMessage(content='Action result: ' + str(r.extracted_content))
self._add_message_with_tokens(msg)
if r.error:
# 获取错误信息的最后一行
last_line = r.error.split('\n')[-1]
msg = HumanMessage(content='Action error: ' + last_line)
self._add_message_with_tokens(msg)
result = None # 结果已加入历史,不再重复添加
# 构造当前页面状态消息
state_message = AgentMessagePrompt(
state,
result,
include_attributes=self.settings.include_attributes,
step_info=step_info,
).get_user_message(use_vision)
self._add_message_with_tokens(state_message)
AgentMessagePrompt
的get_user_message
方法负责构造页面状态提示词,它包含以下关键信息:
- URL信息:当前页面的URL
- 标签页信息:所有可用的标签页
- 交互元素:页面上可交互的元素树(带索引的扁平化表示)
- 滚动状态:页面上下方是否还有内容可滚动
- 步骤信息:当前执行到的步骤和总步骤数
- 视觉信息 :当
use_vision=True
时,包含网页截图的Base64编码
这种结构化的状态表示帮助大模型全面了解当前页面的状态和可执行的操作。
规划器提示词构造
对于复杂任务,Browser-use实现了规划器功能,通过PlannerPrompt
类构造专门的规划提示词:
python
class PlannerPrompt(SystemPrompt):
def get_system_message(self) -> SystemMessage:
return SystemMessage(
content="""You are a planning agent that helps break down tasks into smaller steps and reason about the current state.
Your role is to:
1. Analyze the current state and history
2. Evaluate progress towards the ultimate goal
3. Identify potential challenges or roadblocks
4. Suggest the next high-level steps to take
Inside your messages, there will be AI messages from different agents with different formats.
Your output format should be always a JSON object with the following fields:
{
"state_analysis": "Brief analysis of the current state and what has been done so far",
"progress_evaluation": "Evaluation of progress towards the ultimate goal (as percentage and description)",
"challenges": "List any potential challenges or roadblocks",
"next_steps": "List 2-3 concrete next steps to take",
"reasoning": "Explain your reasoning for the suggested next steps"
}
Ignore the other AI messages output structures.
Keep your responses concise and focused on actionable insights."""
)
规划器的执行由Agent
类中的_run_planner
方法实现:
python
async def _run_planner(self) -> Optional[str]:
"""Run the planner to analyze state and suggest next steps"""
# 如果未设置规划器LLM,则跳过规划
if not self.settings.planner_llm:
return None
# 创建规划器消息历史(使用完整消息历史,除了第一条系统消息)
planner_messages = [
PlannerPrompt(self.controller.registry.get_prompt_description()).get_system_message(),
*self._message_manager.get_messages()[1:],
]
# 如果规划器不使用视觉信息,则移除截图
if not self.settings.use_vision_for_planner and self.settings.use_vision:
last_state_message: HumanMessage = planner_messages[-1]
# 从最后的状态消息中移除图像
new_msg = ''
if isinstance(last_state_message.content, list):
for msg in last_state_message.content:
if msg['type'] == 'text':
new_msg += msg['text']
elif msg['type'] == 'image_url':
continue
else:
new_msg = last_state_message.content
planner_messages[-1] = HumanMessage(content=new_msg)
# 根据模型类型转换输入消息格式
planner_messages = convert_input_messages(planner_messages, self.planner_model_name)
# 获取规划器输出
response = await self.settings.planner_llm.ainvoke(planner_messages)
plan = str(response.content)
# 特定模型处理(如deepseek-reasoner)
if self.planner_model_name == 'deepseek-reasoner':
plan = self._remove_think_tags(plan)
# 尝试解析JSON并记录
try:
plan_json = json.loads(plan)
logger.info(f'Planning Analysis:\n{json.dumps(plan_json, indent=4)}')
except json.JSONDecodeError:
logger.info(f'Planning Analysis:\n{plan}')
except Exception as e:
logger.debug(f'Error parsing planning analysis: {e}')
logger.info(f'Plan: {plan}')
return plan
规划器输出被添加到消息历史中,为Agent提供高层次的指导:
python
# 在指定间隔运行规划器
if self.settings.planner_llm and self.state.n_steps % self.settings.planner_interval == 0:
plan = await self._run_planner()
# 将计划添加到最后一条状态消息之前
self._message_manager.add_plan(plan, position=-1)
消息管理与上下文维护
Browser-use使用MessageManager
类管理所有消息的流动和上下文:
python
class MessageManager:
def __init__(
self,
task: str,
system_message: SystemMessage,
settings: MessageManagerSettings = MessageManagerSettings(),
state: MessageManagerState = MessageManagerState(),
):
self.task = task
self.settings = settings
self.state = state
self.system_prompt = system_message
# 仅当状态为空时初始化消息
if len(self.state.history.messages) == 0:
self._init_messages()
消息管理器在初始化时会设置基本消息结构:
python
def _init_messages(self) -> None:
"""Initialize the message history with system message, context, task, and other initial messages"""
# 添加系统提示词
self._add_message_with_tokens(self.system_prompt)
# 添加上下文(如果有)
if self.settings.message_context:
context_message = HumanMessage(content='Context for the task' + self.settings.message_context)
self._add_message_with_tokens(context_message)
# 添加任务描述
task_message = HumanMessage(
content=f'Your ultimate task is: """{self.task}""". If you achieved your ultimate task, stop everything and use the done action in the next step to complete the task. If not, continue as usual.'
)
self._add_message_with_tokens(task_message)
# 添加敏感数据占位符(如果有)
if self.settings.sensitive_data:
info = f'Here are placeholders for sensitve data: {list(self.settings.sensitive_data.keys())}'
info += 'To use them, write <secret>the placeholder name</secret>'
info_message = HumanMessage(content=info)
self._add_message_with_tokens(info_message)
# 添加输出示例
placeholder_message = HumanMessage(content='Example output:')
self._add_message_with_tokens(placeholder_message)
# 构造工具调用示例
tool_calls = [
{
'name': 'AgentOutput',
'args': {
'current_state': {
'evaluation_previous_goal': 'Success - I opend the first page',
'memory': 'Starting with the new task. I have completed 1/10 steps',
'next_goal': 'Click on company a',
},
'action': [{'click_element': {'index': 0}}],
},
'id': str(self.state.tool_id),
'type': 'tool_call',
}
]
# 添加示例工具调用
example_tool_call = AIMessage(
content='',
tool_calls=tool_calls,
)
self._add_message_with_tokens(example_tool_call)
self.add_tool_message(content='Browser started')
# 添加任务历史标记
placeholder_message = HumanMessage(content='[Your task history memory starts here]')
self._add_message_with_tokens(placeholder_message)
# 添加可用文件路径(如果有)
if self.settings.available_file_paths:
filepaths_msg = HumanMessage(content=f'Here are file paths you can use: {self.settings.available_file_paths}')
self._add_message_with_tokens(filepaths_msg)
MessageManager
还实现了令牌计数和截断功能,确保输入不超过模型的上下文窗口限制:
python
def _add_message_with_tokens(self, message: BaseMessage, position: int | None = None) -> None:
"""Add message to history with token counting"""
# 计算消息的令牌数
token_count = self._count_tokens(message)
# 添加消息到历史
if position is not None:
self.state.history.messages.insert(position, message)
self.state.history.message_tokens.insert(position, token_count)
else:
self.state.history.messages.append(message)
self.state.history.message_tokens.append(token_count)
# 更新当前令牌总数
self.state.history.current_tokens += token_count
(关键点)大模型提示词输入与返回输出处理
在Agent
类的get_next_action
方法中,Browser-use通过不同方式处理模型输出:
python
@time_execution_async('--get_next_action (agent)')
async def get_next_action(self, input_messages: list[BaseMessage]) -> AgentOutput:
"""Get next action from LLM based on current state"""
input_messages = self._convert_input_messages(input_messages)
if self.tool_calling_method == 'raw':
output = self.llm.invoke(input_messages)
output.content = self._remove_think_tags(str(output.content))
try:
parsed_json = extract_json_from_model_output(output.content)
parsed = self.AgentOutput(**parsed_json)
except (ValueError, ValidationError) as e:
logger.warning(f'Failed to parse model output: {output} {str(e)}')
raise ValueError('Could not parse response.')
elif self.tool_calling_method is None:
structured_llm = self.llm.with_structured_output(self.AgentOutput, include_raw=True)
response: dict[str, Any] = await structured_llm.ainvoke(input_messages)
parsed: AgentOutput | None = response['parsed']
else:
structured_llm = self.llm.with_structured_output(self.AgentOutput, include_raw=True, method=self.tool_calling_method)
response: dict[str, Any] = await structured_llm.ainvoke(input_messages)
parsed: AgentOutput | None = response['parsed']
if parsed is None:
raise ValueError('Could not parse response.')
# 限制每步操作数量
if len(parsed.action) > self.settings.max_actions_per_step:
parsed.action = parsed.action[: self.settings.max_actions_per_step]
return parsed
Browser-use支持三种输出处理方式:
- 原始模式 (
raw
):直接解析模型输出的文本 - 结构化输出 (
None
):使用LangChain的结构化输出功能 - 工具调用 (
function_calling
/json_mode
):使用OpenAI等模型的工具调用功能
这种灵活性使Browser-use能够适应不同的LLM接口和输出格式。
提示词验证机制
Browser-use实现了输出验证机制,通过_validate_output
方法检查模型输出是否符合预期:
python
async def _validate_output(self) -> bool:
"""Validate the output of the last action is what the user wanted"""
system_msg = (
f'You are a validator of an agent who interacts with a browser. '
f'Validate if the output of last action is what the user wanted and if the task is completed. '
f'If the task is unclear defined, you can let it pass. But if something is missing or the image does not show what was requested dont let it pass. '
f'Try to understand the page and help the model with suggestions like scroll, do x, ... to get the solution right. '
f'Task to validate: {self.task}. Return a JSON object with 2 keys: is_valid and reason. '
f'is_valid is a boolean that indicates if the output is correct. '
f'reason is a string that explains why it is valid or not.'
)
# 获取当前浏览器状态
state = await self.browser_context.get_state()
content = AgentMessagePrompt(
state=state,
result=self.state.last_result,
include_attributes=self.settings.include_attributes,
)
msg = [SystemMessage(content=system_msg), content.get_user_message(self.settings.use_vision)]
# 定义验证结果模型
class ValidationResult(BaseModel):
is_valid: bool
reason: str
# 使用结构化输出获取验证结果
validator = self.llm.with_structured_output(ValidationResult, include_raw=True)
response: dict[str, Any] = await validator.ainvoke(msg)
parsed: ValidationResult = response['parsed']
is_valid = parsed.is_valid
# 处理验证结果
if not is_valid:
logger.info(f'❌ Validator decision: {parsed.reason}')
msg = f'The output is not yet correct. {parsed.reason}.'
self.state.last_result = [ActionResult(extracted_content=msg, include_in_memory=True)]
else:
logger.info(f'✅ Validator decision: {parsed.reason}')
return is_valid
这种验证机制增强了Browser-use的可靠性,能够在任务执行过程中自动检测问题并提供纠正建议。
多模型配置与提示词定制
Browser-use支持多种LLM,不同模型可能需要特定的提示词处理:
python
def _convert_input_messages(self, input_messages: list[BaseMessage]) -> list[BaseMessage]:
"""Convert input messages to the correct format"""
if self.model_name == 'deepseek-reasoner' or self.model_name.startswith('deepseek-r1'):
return convert_input_messages(input_messages, self.model_name)
else:
return input_messages
对于不支持函数调用的模型,Browser-use会做特殊处理:
python
def _convert_messages_for_non_function_calling_models(input_messages: list[BaseMessage]) -> list[BaseMessage]:
"""Convert messages for non-function-calling models"""
output_messages = []
for message in input_messages:
if isinstance(message, HumanMessage):
output_messages.append(message)
elif isinstance(message, SystemMessage):
output_messages.append(message)
elif isinstance(message, ToolMessage):
output_messages.append(HumanMessage(content=message.content))
elif isinstance(message, AIMessage):
# 检查tool_calls是否为有效的JSON对象
if message.tool_calls:
tool_calls = json.dumps(message.tool_calls)
output_messages.append(AIMessage(content=tool_calls))
else:
output_messages.append(message)
else:
raise ValueError(f'Unknown message type: {type(message)}')
return output_messages
实践案例:使用不同的模型
Browser-use支持多种LLM提供商,如OpenAI、Anthropic、千问等:
python
# 使用OpenAI
agent = Agent(
task="搜索四川的10大景点",
llm=ChatOpenAI(model="gpt-4o"),
)
# 使用Anthropic Claude
agent = Agent(
task="搜索四川的10大景点",
llm=ChatAnthropic(model_name="claude-3-5-sonnet"),
)
# 使用ModelScope的千问模型
agent = Agent(
task="搜索四川的10大景点",
llm=ChatOpenAI(
model='Qwen/Qwen2.5-72B-Instruct',
api_key='xxx',
base_url='https://api.modelscope.cn/v1/'
),
browser=browser,
use_vision=False,
)
最佳实践与优化建议
- 令牌管理 :Browser-use通过
max_input_tokens
参数控制输入令牌数量,防止超出模型限制。当接近限制时,会自动裁剪历史消息:
python
if 'Max token limit reached' in error_msg:
# 减少令牌限制
self._message_manager.settings.max_input_tokens = self.settings.max_input_tokens - 500
self._message_manager.cut_messages()
- 扩展系统提示词 :使用
extend_system_message
比完全覆盖系统提示词更安全:
python
extend_system_message = """
重要规则:无论任务是什么,始终先打开一个新标签页并首先访问baidu.com。
"""
agent = Agent(
task="搜索四川的10大景点",
llm=ChatOpenAI(model='gpt-4'),
extend_system_message=extend_system_message
)
- 规划器配置:对于复杂任务,配置规划器和规划间隔可以提高执行效率:
python
agent = Agent(
task="搜索并比较四川十大景点的门票价格和游览时间",
llm=ChatOpenAI(model='gpt-4o'),
planner_llm=ChatOpenAI(model='gpt-4o'),
planner_interval=3 # 每3步执行一次规划
)
- 视觉选择:根据任务需要选择是否使用视觉能力:
python
agent = Agent(
task="提取网页文本内容",
llm=ChatOpenAI(model='gpt-4o'),
use_vision=True, # 启用视觉能力
use_vision_for_planner=False # 规划器不使用视觉能力
)
总结
通过深入理解Browser-use的提示词构造机制,开发者可以优化自动化应用,实现更复杂的任务,同时保持高可靠性和适应性。提示词工程是Browser-use框架的核心,也是其能够应对各种复杂Web场景的关键所在。
想了解更多技术实现细节和源码解析,欢迎关注我的微信公众号**【松哥ai自动化】**。每周我都会在公众号首发一篇深度技术文章,从源码角度剖析各种实用工具的实现原理。
下一篇我们将深入分析Browser-use如何处理复杂的界面交互操作,包括表单填写、多步骤导航和动态内容处理等高级场景,敬请关注!
附录
(一)系统提示词输出示例
text
Message Type: SystemMessage
Content: You are an AI agent designed to automate browser tasks. Your goal is to accomplish the ultimate task following the rules.
# Input Format
Task
Previous steps
Current URL
Open Tabs
Interactive Elements
[index]<type>text</type>
- index: Numeric identifier for interaction
- type: HTML element type (button, input, etc.)
- text: Element description
Example:
[33]<button>Submit Form</button>
- Only elements with numeric indexes in [] are interactive
- elements without [] provide only context
(二)用户消息提示词输出示例
text
Content: Your ultimate task is: """采集四川的10大景点""". If you achieved your ultimate task, stop everything and use the done action in the next step to complete the task. If not, continue as usual.
Message Type: HumanMessage
Content: Example output:
(三)规划器提示词输出示例
text
Tool Calls: [
{
"name": "AgentOutput",
"args": {
"current_state": {
"evaluation_previous_goal": "Success - I opend the first page",
"memory": "Starting with the new task. I have completed 1/10 steps",
"next_goal": "Click on company a"
},
"action": [
{
"click_element": {
"index": 0
}
}
]
},
"id": "1",
"type": "tool_call"
}
]