揭秘AI自动化框架Browser-use(二),如何构造大模型提示词

在上一篇技术分析中,我们探讨了Browser-use框架如何实现页面元素标注。本文将聚焦于其提示词构造流程,揭示AI如何理解浏览器界面的核心机制。

上一篇(首发)-揭秘AI自动化框架Browser-use(一),如何实现炫酷的页面元素标注效果

想了解更多技术实现细节和源码解析,欢迎关注我的微信公众号**【松哥ai自动化】**。每周我都会在公众号首发一篇深度技术文章,从源码角度剖析各种实用工具的实现原理。

提示词系统概览

Browser-use的提示词系统由多个协同工作的组件构成:

  1. SystemPrompt - 负责系统级指令的加载与定制
  2. AgentMessagePrompt - 构造页面状态的提示词
  3. PlannerPrompt - 为规划器提供指导的提示词
  4. MessageManager - 管理消息流和上下文
graph TD A[提示词系统] --> B[SystemPrompt] A --> C[AgentMessagePrompt] A --> D[PlannerPrompt] B --> E[系统行为定义] C --> F[状态描述] D --> G[任务规划]

这些组件在Agent类的执行过程中共同作用,确保大模型获得准确且结构化的输入信息。

系统提示词的构造与定制

系统提示词是大模型的核心指令集,定义了模型的行为边界和能力。Browser-use在Agent类初始化时通过SystemPrompt类构造系统提示词:

python 复制代码
self._message_manager = MessageManager(
    task=task,
    system_message=SystemPrompt(
        action_description=self.available_actions,
        max_actions_per_step=self.settings.max_actions_per_step,
        override_system_message=override_system_message,
        extend_system_message=extend_system_message,
    ).get_system_message(),
    settings=MessageManagerSettings(
        max_input_tokens=self.settings.max_input_tokens,
        include_attributes=self.settings.include_attributes,
        message_context=self.settings.message_context,
        sensitive_data=sensitive_data,
        available_file_paths=self.settings.available_file_paths,
    ),
    state=self.state.message_manager_state,
)

SystemPrompt类支持三种模式:

  1. 默认模式:从预定义模板加载系统提示词
  2. 扩展模式 :通过extend_system_message参数扩展默认提示词
  3. 覆盖模式 :通过override_system_message完全替换默认提示词

系统提示词通常包含以下关键内容:

  • Agent的角色和任务说明
  • 输入格式规范(URL、标签页、交互元素等)
  • 可用操作的描述和使用方法
  • 输出格式要求和示例

这种设计使开发者能够根据需求灵活定制系统提示词,同时保持核心功能不变。

页面状态提示词构造

为了让大模型理解当前网页状态,Browser-use使用AgentMessagePrompt类构造包含页面完整信息的提示词:

python 复制代码
def add_state_message(
    self,
    state: BrowserState,
    result: Optional[List[ActionResult]] = None,
    step_info: Optional[AgentStepInfo] = None,
    use_vision=True,
) -> None:
    """Add browser state as human message"""
    
    # 处理操作结果和错误信息
    if result:
        for r in result:
            if r.include_in_memory:
                if r.extracted_content:
                    msg = HumanMessage(content='Action result: ' + str(r.extracted_content))
                    self._add_message_with_tokens(msg)
                if r.error:
                    # 获取错误信息的最后一行
                    last_line = r.error.split('\n')[-1]
                    msg = HumanMessage(content='Action error: ' + last_line)
                    self._add_message_with_tokens(msg)
                result = None  # 结果已加入历史,不再重复添加
    
    # 构造当前页面状态消息
    state_message = AgentMessagePrompt(
        state,
        result,
        include_attributes=self.settings.include_attributes,
        step_info=step_info,
    ).get_user_message(use_vision)
    self._add_message_with_tokens(state_message)

AgentMessagePromptget_user_message方法负责构造页面状态提示词,它包含以下关键信息:

  1. URL信息:当前页面的URL
  2. 标签页信息:所有可用的标签页
  3. 交互元素:页面上可交互的元素树(带索引的扁平化表示)
  4. 滚动状态:页面上下方是否还有内容可滚动
  5. 步骤信息:当前执行到的步骤和总步骤数
  6. 视觉信息 :当use_vision=True时,包含网页截图的Base64编码

这种结构化的状态表示帮助大模型全面了解当前页面的状态和可执行的操作。

规划器提示词构造

对于复杂任务,Browser-use实现了规划器功能,通过PlannerPrompt类构造专门的规划提示词:

python 复制代码
class PlannerPrompt(SystemPrompt):
    def get_system_message(self) -> SystemMessage:
        return SystemMessage(
            content="""You are a planning agent that helps break down tasks into smaller steps and reason about the current state.
Your role is to:
1. Analyze the current state and history
2. Evaluate progress towards the ultimate goal
3. Identify potential challenges or roadblocks
4. Suggest the next high-level steps to take

Inside your messages, there will be AI messages from different agents with different formats.

Your output format should be always a JSON object with the following fields:
{
    "state_analysis": "Brief analysis of the current state and what has been done so far",
    "progress_evaluation": "Evaluation of progress towards the ultimate goal (as percentage and description)",
    "challenges": "List any potential challenges or roadblocks",
    "next_steps": "List 2-3 concrete next steps to take",
    "reasoning": "Explain your reasoning for the suggested next steps"
}

Ignore the other AI messages output structures.

Keep your responses concise and focused on actionable insights."""
        )

规划器的执行由Agent类中的_run_planner方法实现:

python 复制代码
async def _run_planner(self) -> Optional[str]:
    """Run the planner to analyze state and suggest next steps"""
    # 如果未设置规划器LLM,则跳过规划
    if not self.settings.planner_llm:
        return None

    # 创建规划器消息历史(使用完整消息历史,除了第一条系统消息)
    planner_messages = [
        PlannerPrompt(self.controller.registry.get_prompt_description()).get_system_message(),
        *self._message_manager.get_messages()[1:],
    ]

    # 如果规划器不使用视觉信息,则移除截图
    if not self.settings.use_vision_for_planner and self.settings.use_vision:
        last_state_message: HumanMessage = planner_messages[-1]
        # 从最后的状态消息中移除图像
        new_msg = ''
        if isinstance(last_state_message.content, list):
            for msg in last_state_message.content:
                if msg['type'] == 'text':
                    new_msg += msg['text']
                elif msg['type'] == 'image_url':
                    continue
        else:
            new_msg = last_state_message.content

        planner_messages[-1] = HumanMessage(content=new_msg)

    # 根据模型类型转换输入消息格式
    planner_messages = convert_input_messages(planner_messages, self.planner_model_name)

    # 获取规划器输出
    response = await self.settings.planner_llm.ainvoke(planner_messages)
    plan = str(response.content)
    
    # 特定模型处理(如deepseek-reasoner)
    if self.planner_model_name == 'deepseek-reasoner':
        plan = self._remove_think_tags(plan)
    
    # 尝试解析JSON并记录
    try:
        plan_json = json.loads(plan)
        logger.info(f'Planning Analysis:\n{json.dumps(plan_json, indent=4)}')
    except json.JSONDecodeError:
        logger.info(f'Planning Analysis:\n{plan}')
    except Exception as e:
        logger.debug(f'Error parsing planning analysis: {e}')
        logger.info(f'Plan: {plan}')

    return plan

规划器输出被添加到消息历史中,为Agent提供高层次的指导:

python 复制代码
# 在指定间隔运行规划器
if self.settings.planner_llm and self.state.n_steps % self.settings.planner_interval == 0:
    plan = await self._run_planner()
    # 将计划添加到最后一条状态消息之前
    self._message_manager.add_plan(plan, position=-1)

消息管理与上下文维护

Browser-use使用MessageManager类管理所有消息的流动和上下文:

python 复制代码
class MessageManager:
    def __init__(
        self,
        task: str,
        system_message: SystemMessage,
        settings: MessageManagerSettings = MessageManagerSettings(),
        state: MessageManagerState = MessageManagerState(),
    ):
        self.task = task
        self.settings = settings
        self.state = state
        self.system_prompt = system_message

        # 仅当状态为空时初始化消息
        if len(self.state.history.messages) == 0:
            self._init_messages()

消息管理器在初始化时会设置基本消息结构:

python 复制代码
def _init_messages(self) -> None:
    """Initialize the message history with system message, context, task, and other initial messages"""
    # 添加系统提示词
    self._add_message_with_tokens(self.system_prompt)

    # 添加上下文(如果有)
    if self.settings.message_context:
        context_message = HumanMessage(content='Context for the task' + self.settings.message_context)
        self._add_message_with_tokens(context_message)

    # 添加任务描述
    task_message = HumanMessage(
        content=f'Your ultimate task is: """{self.task}""". If you achieved your ultimate task, stop everything and use the done action in the next step to complete the task. If not, continue as usual.'
    )
    self._add_message_with_tokens(task_message)

    # 添加敏感数据占位符(如果有)
    if self.settings.sensitive_data:
        info = f'Here are placeholders for sensitve data: {list(self.settings.sensitive_data.keys())}'
        info += 'To use them, write <secret>the placeholder name</secret>'
        info_message = HumanMessage(content=info)
        self._add_message_with_tokens(info_message)

    # 添加输出示例
    placeholder_message = HumanMessage(content='Example output:')
    self._add_message_with_tokens(placeholder_message)

    # 构造工具调用示例
    tool_calls = [
        {
            'name': 'AgentOutput',
            'args': {
                'current_state': {
                    'evaluation_previous_goal': 'Success - I opend the first page',
                    'memory': 'Starting with the new task. I have completed 1/10 steps',
                    'next_goal': 'Click on company a',
                },
                'action': [{'click_element': {'index': 0}}],
            },
            'id': str(self.state.tool_id),
            'type': 'tool_call',
        }
    ]

    # 添加示例工具调用
    example_tool_call = AIMessage(
        content='',
        tool_calls=tool_calls,
    )
    self._add_message_with_tokens(example_tool_call)
    self.add_tool_message(content='Browser started')

    # 添加任务历史标记
    placeholder_message = HumanMessage(content='[Your task history memory starts here]')
    self._add_message_with_tokens(placeholder_message)

    # 添加可用文件路径(如果有)
    if self.settings.available_file_paths:
        filepaths_msg = HumanMessage(content=f'Here are file paths you can use: {self.settings.available_file_paths}')
        self._add_message_with_tokens(filepaths_msg)

MessageManager还实现了令牌计数和截断功能,确保输入不超过模型的上下文窗口限制:

python 复制代码
def _add_message_with_tokens(self, message: BaseMessage, position: int | None = None) -> None:
    """Add message to history with token counting"""
    # 计算消息的令牌数
    token_count = self._count_tokens(message)
    
    # 添加消息到历史
    if position is not None:
        self.state.history.messages.insert(position, message)
        self.state.history.message_tokens.insert(position, token_count)
    else:
        self.state.history.messages.append(message)
        self.state.history.message_tokens.append(token_count)
    
    # 更新当前令牌总数
    self.state.history.current_tokens += token_count

(关键点)大模型提示词输入与返回输出处理

Agent类的get_next_action方法中,Browser-use通过不同方式处理模型输出:

python 复制代码
@time_execution_async('--get_next_action (agent)')
async def get_next_action(self, input_messages: list[BaseMessage]) -> AgentOutput:
    """Get next action from LLM based on current state"""
    input_messages = self._convert_input_messages(input_messages)
    
    if self.tool_calling_method == 'raw':
        output = self.llm.invoke(input_messages)
        output.content = self._remove_think_tags(str(output.content))
        try:
            parsed_json = extract_json_from_model_output(output.content)
            parsed = self.AgentOutput(**parsed_json)
        except (ValueError, ValidationError) as e:
            logger.warning(f'Failed to parse model output: {output} {str(e)}')
            raise ValueError('Could not parse response.')

    elif self.tool_calling_method is None:
        structured_llm = self.llm.with_structured_output(self.AgentOutput, include_raw=True)
        response: dict[str, Any] = await structured_llm.ainvoke(input_messages)
        parsed: AgentOutput | None = response['parsed']
    else:
        structured_llm = self.llm.with_structured_output(self.AgentOutput, include_raw=True, method=self.tool_calling_method)
        response: dict[str, Any] = await structured_llm.ainvoke(input_messages)
        parsed: AgentOutput | None = response['parsed']

    if parsed is None:
        raise ValueError('Could not parse response.')

    # 限制每步操作数量
    if len(parsed.action) > self.settings.max_actions_per_step:
        parsed.action = parsed.action[: self.settings.max_actions_per_step]

    return parsed

Browser-use支持三种输出处理方式:

  1. 原始模式raw):直接解析模型输出的文本
  2. 结构化输出None):使用LangChain的结构化输出功能
  3. 工具调用function_calling/json_mode):使用OpenAI等模型的工具调用功能

这种灵活性使Browser-use能够适应不同的LLM接口和输出格式。

提示词验证机制

Browser-use实现了输出验证机制,通过_validate_output方法检查模型输出是否符合预期:

python 复制代码
async def _validate_output(self) -> bool:
    """Validate the output of the last action is what the user wanted"""
    system_msg = (
        f'You are a validator of an agent who interacts with a browser. '
        f'Validate if the output of last action is what the user wanted and if the task is completed. '
        f'If the task is unclear defined, you can let it pass. But if something is missing or the image does not show what was requested dont let it pass. '
        f'Try to understand the page and help the model with suggestions like scroll, do x, ... to get the solution right. '
        f'Task to validate: {self.task}. Return a JSON object with 2 keys: is_valid and reason. '
        f'is_valid is a boolean that indicates if the output is correct. '
        f'reason is a string that explains why it is valid or not.'
    )
    
    # 获取当前浏览器状态
    state = await self.browser_context.get_state()
    content = AgentMessagePrompt(
        state=state,
        result=self.state.last_result,
        include_attributes=self.settings.include_attributes,
    )
    msg = [SystemMessage(content=system_msg), content.get_user_message(self.settings.use_vision)]
    
    # 定义验证结果模型
    class ValidationResult(BaseModel):
        is_valid: bool
        reason: str

    # 使用结构化输出获取验证结果
    validator = self.llm.with_structured_output(ValidationResult, include_raw=True)
    response: dict[str, Any] = await validator.ainvoke(msg)
    parsed: ValidationResult = response['parsed']
    is_valid = parsed.is_valid
    
    # 处理验证结果
    if not is_valid:
        logger.info(f'❌ Validator decision: {parsed.reason}')
        msg = f'The output is not yet correct. {parsed.reason}.'
        self.state.last_result = [ActionResult(extracted_content=msg, include_in_memory=True)]
    else:
        logger.info(f'✅ Validator decision: {parsed.reason}')
    
    return is_valid

这种验证机制增强了Browser-use的可靠性,能够在任务执行过程中自动检测问题并提供纠正建议。

多模型配置与提示词定制

Browser-use支持多种LLM,不同模型可能需要特定的提示词处理:

python 复制代码
def _convert_input_messages(self, input_messages: list[BaseMessage]) -> list[BaseMessage]:
    """Convert input messages to the correct format"""
    if self.model_name == 'deepseek-reasoner' or self.model_name.startswith('deepseek-r1'):
        return convert_input_messages(input_messages, self.model_name)
    else:
        return input_messages

对于不支持函数调用的模型,Browser-use会做特殊处理:

python 复制代码
def _convert_messages_for_non_function_calling_models(input_messages: list[BaseMessage]) -> list[BaseMessage]:
    """Convert messages for non-function-calling models"""
    output_messages = []
    for message in input_messages:
        if isinstance(message, HumanMessage):
            output_messages.append(message)
        elif isinstance(message, SystemMessage):
            output_messages.append(message)
        elif isinstance(message, ToolMessage):
            output_messages.append(HumanMessage(content=message.content))
        elif isinstance(message, AIMessage):
            # 检查tool_calls是否为有效的JSON对象
            if message.tool_calls:
                tool_calls = json.dumps(message.tool_calls)
                output_messages.append(AIMessage(content=tool_calls))
            else:
                output_messages.append(message)
        else:
            raise ValueError(f'Unknown message type: {type(message)}')
    return output_messages

实践案例:使用不同的模型

Browser-use支持多种LLM提供商,如OpenAI、Anthropic、千问等:

python 复制代码
# 使用OpenAI
agent = Agent(
    task="搜索四川的10大景点",
    llm=ChatOpenAI(model="gpt-4o"),
)

# 使用Anthropic Claude
agent = Agent(
    task="搜索四川的10大景点",
    llm=ChatAnthropic(model_name="claude-3-5-sonnet"),
)

# 使用ModelScope的千问模型
agent = Agent(
    task="搜索四川的10大景点",
    llm=ChatOpenAI(
        model='Qwen/Qwen2.5-72B-Instruct',
        api_key='xxx',
        base_url='https://api.modelscope.cn/v1/'
    ),
    browser=browser,
    use_vision=False,
)

最佳实践与优化建议

  1. 令牌管理 :Browser-use通过max_input_tokens参数控制输入令牌数量,防止超出模型限制。当接近限制时,会自动裁剪历史消息:
python 复制代码
if 'Max token limit reached' in error_msg:
    # 减少令牌限制
    self._message_manager.settings.max_input_tokens = self.settings.max_input_tokens - 500
    self._message_manager.cut_messages()
  1. 扩展系统提示词 :使用extend_system_message比完全覆盖系统提示词更安全:
python 复制代码
extend_system_message = """
重要规则:无论任务是什么,始终先打开一个新标签页并首先访问baidu.com。
"""

agent = Agent(
    task="搜索四川的10大景点",
    llm=ChatOpenAI(model='gpt-4'),
    extend_system_message=extend_system_message
)
  1. 规划器配置:对于复杂任务,配置规划器和规划间隔可以提高执行效率:
python 复制代码
agent = Agent(
    task="搜索并比较四川十大景点的门票价格和游览时间",
    llm=ChatOpenAI(model='gpt-4o'),
    planner_llm=ChatOpenAI(model='gpt-4o'),
    planner_interval=3  # 每3步执行一次规划
)
  1. 视觉选择:根据任务需要选择是否使用视觉能力:
python 复制代码
agent = Agent(
    task="提取网页文本内容",
    llm=ChatOpenAI(model='gpt-4o'),
    use_vision=True,  # 启用视觉能力
    use_vision_for_planner=False  # 规划器不使用视觉能力
)

总结

通过深入理解Browser-use的提示词构造机制,开发者可以优化自动化应用,实现更复杂的任务,同时保持高可靠性和适应性。提示词工程是Browser-use框架的核心,也是其能够应对各种复杂Web场景的关键所在。

想了解更多技术实现细节和源码解析,欢迎关注我的微信公众号**【松哥ai自动化】**。每周我都会在公众号首发一篇深度技术文章,从源码角度剖析各种实用工具的实现原理。

下一篇我们将深入分析Browser-use如何处理复杂的界面交互操作,包括表单填写、多步骤导航和动态内容处理等高级场景,敬请关注!

附录

(一)系统提示词输出示例

text 复制代码
Message Type: SystemMessage
Content: You are an AI agent designed to automate browser tasks. Your goal is to accomplish the ultimate task following the rules.

# Input Format
Task
Previous steps
Current URL
Open Tabs
Interactive Elements
[index]<type>text</type>
- index: Numeric identifier for interaction
- type: HTML element type (button, input, etc.)
- text: Element description
Example:
[33]<button>Submit Form</button>

- Only elements with numeric indexes in [] are interactive
- elements without [] provide only context

(二)用户消息提示词输出示例

text 复制代码
Content: Your ultimate task is: """采集四川的10大景点""". If you achieved your ultimate task, stop everything and use the done action in the next step to complete the task. If not, continue as usual.

Message Type: HumanMessage
Content: Example output:

(三)规划器提示词输出示例

text 复制代码
Tool Calls: [
  {
    "name": "AgentOutput",
    "args": {
      "current_state": {
        "evaluation_previous_goal": "Success - I opend the first page",
        "memory": "Starting with the new task. I have completed 1/10 steps",
        "next_goal": "Click on company a"
      },
      "action": [
        {
          "click_element": {
            "index": 0
          }
        }
      ]
    },
    "id": "1",
    "type": "tool_call"
  }
]
相关推荐
小众AI1 分钟前
UI-TARS: 基于视觉语言模型的多模式代理
人工智能·ui·语言模型
北京地铁1号线18 分钟前
卷积神经网络(CNN)前向传播手撕
人工智能·pytorch·深度学习
伊织code23 分钟前
PyTorch API 7 - TorchScript、hub、矩阵、打包、profile
人工智能·pytorch·python·ai·矩阵·api
AI不止绘画1 小时前
分享一个可以用GPT打标的傻瓜式SD图片打标工具——辣椒炒肉图片打标助手
人工智能·ai·aigc·图片打标·图片模型训练·lora训练打标·sd打标
视觉语言导航1 小时前
昆士兰科技大学无人机自主导航探索新框架!UAVNav:GNSS拒止与视觉受限环境中的无人机导航与目标检测
人工智能·无人机·具身智能
新知图书2 小时前
OpenCV实现数字水印的相关函数和示例代码
人工智能·opencv·计算机视觉
创客匠人老蒋2 小时前
刘强东 “猪猪侠” 营销:重构创始人IP的符号革命|创客匠人热点评述
人工智能·创始人ip
买了一束花2 小时前
数据预处理之数据平滑处理详解
开发语言·人工智能·算法·matlab
神州问学3 小时前
数智驱动——AI:企业数字化转型的“超级引擎”
人工智能
说私域3 小时前
桑德拉精神与开源链动2+1模式AI智能名片S2B2C商城小程序的协同价值研究
人工智能·小程序·开源·零售