Agent 系列（23）：Web Agent——让 Agent 真正浏览网页

为什么要有 Web Agent

LLM 的知识有截止日期。你问它"LangGraph 最新版本是多少"，它只能告诉你训练数据里的版本。Web Agent 解决这个问题：让 Agent 真正上网查，拿到实时数据再回答。

但"上网查"比听起来复杂：

网页是 HTML，不是文本------直接丢进上下文会带入大量无用标签
一个页面可能有几万 Token------超出 LLM 的能力范围
Agent 可能无限循环------A 页面跳 B，B 页面跳 C，永不停止
URL 可以被幻觉------LLM 会编造不存在的链接

这四个问题对应四个工程设计：HTML 清洗、Token Budget、Step Limit、URL 错误处理。本文把这些拼成一个可运行的 Web Agent。

架构设计

整体是标准的 LangGraph 双节点图：

arduino 复制代码

用户问题
    │
    ▼
┌─────────────────────────────────────┐
│         agent_node                  │
│  SystemPrompt + messages → LLM      │
│  bound_llm.invoke(msgs)             │
└────────┬────────────────────────────┘
         │
    有 tool_calls?
         │
    ┌────┴─────┐
   是          否（或 steps >= MAX_STEPS）
    │               │
    ▼               ▼
tools_node         END
web_search /
fetch_page
    │
    └──→ agent_node（循环）

State 只有两个字段：

python 复制代码

class WState(TypedDict):
    messages: Annotated[list, add_messages]  # 累积消息
    steps: int                                # 已用步数

steps 是 Web Agent 特有的------普通 Agent 不需要显式计步，但 Web Agent 可能在页面间无限跳转，必须有硬限制。

两个工具

web_search：DuckDuckGo 搜索

python 复制代码

@tool
def web_search(query: str) -> str:
    """
    Search the web with DuckDuckGo.
    Returns up to 5 results, each with title, snippet, and URL.
    Use the URLs from results to call fetch_page --- never invent URLs.
    """
    try:
        resp = requests.get(
            "https://html.duckduckgo.com/html/",
            params={"q": query},
            headers=HEADERS,
            timeout=12,
        )
        soup = BeautifulSoup(resp.text, "html.parser")
        results = []
        for i, block in enumerate(soup.select(".result"), 1):
            if i > 5:
                break
            title   = (block.select_one(".result__title")   or soup.new_tag("x")).get_text(strip=True)
            snippet = (block.select_one(".result__snippet") or soup.new_tag("x")).get_text(strip=True)
            url_raw = (block.select_one(".result__url")     or soup.new_tag("x")).get_text(strip=True)
            url = f"https://{url_raw}" if url_raw and not url_raw.startswith("http") else url_raw
            results.append(f"{i}. {title}\n   {snippet}\n   URL: {url}")
        return "\n\n".join(results) if results else "No results found."
    except Exception as exc:
        return f"Search error: {exc}"

用的是 DuckDuckGo HTML 接口，不需要 API Key。解析 .result CSS 类，提取标题、摘要、URL，返回结构化文本给 LLM。

工具描述里有一句关键指令：Use the URLs from results to call fetch_page --- never invent URLs。这是防止 URL 幻觉的第一道防线------在 Prompt 层面明确告诉模型 URL 的合法来源。

fetch_page：页面抓取 + 清洗

python 复制代码

@tool
def fetch_page(url: str) -> str:
    """
    Fetch a web page and return its cleaned text (truncated to token budget).
    Only call with real URLs obtained from web_search results.
    """
    try:
        resp = requests.get(url, headers=HEADERS, timeout=12)
        resp.raise_for_status()
        full_text = clean_html(resp.text)
        orig_tokens = count_tokens(full_text)
        displayed = truncate_to_budget(full_text)
        shown_tokens = min(orig_tokens, PAGE_TOKEN_BUDGET)
        return (
            f"[URL: {url}]\n"
            f"[Size: {orig_tokens} tokens → showing {shown_tokens} tokens "
            f"(budget={PAGE_TOKEN_BUDGET})]\n\n"
            f"{displayed}"
        )
    except requests.HTTPError as exc:
        return f"HTTP {exc.response.status_code} --- could not fetch {url}"
    except requests.ConnectionError:
        return f"Connection error --- {url} may not exist or be unreachable"
    except Exception as exc:
        return f"Error fetching {url}: {type(exc).__name__}: {exc}"

分三步：

clean_html：BeautifulSoup 去掉 script/style/nav/footer，返回纯文本
truncate_to_budget：超出 Token Budget 的部分截断
错误分类：HTTP 错误、连接错误、其他异常各返回不同的安全字符串

注意 requests.HTTPError 和 requests.ConnectionError 是两种不同失败场景：前者是服务器返回了响应（4xx/5xx），后者是连接本身失败（域名不存在、网络不通）。

三个工程 Guard

Guard 1：URL 错误处理

测试一个完全不存在的域名：

less 复制代码

fetch_page(https://totally-made-up-domain-xyz99999.org/docs/n...)
→ Connection error --- https://totally-made-up-domain-xyz99999.org/docs/nonexistent may not exist or be unreachable

不崩溃，不抛异常，返回一个安全的错误字符串。LLM 收到这个字符串后会选择尝试其他 URL 或其他搜索词。

这是 Guard 的关键设计原则：错误是工具的返回值，不是异常。工具调用失败不应该中断整个 Agent 执行，而是让 LLM 根据错误信息自适应。

Guard 2：Token Budget 截断

测试 PyPI 的 langgraph 页面：

scss 复制代码

fetch_page(pypi.org/project/langgraph/)
→ [Size: 4576 tokens → showing 800 tokens (budget=800)]

原始页面 4576 tokens，截断到 800，节省了 82% 的上下文空间。

截断的实现很简单：

python 复制代码

PAGE_TOKEN_BUDGET = 800   # max tokens of page text sent to LLM per fetch

def count_tokens(text: str) -> int:
    """Rough estimate: ~3 chars per token for English/Chinese mix."""
    return max(1, len(text) // 3)

def truncate_to_budget(text: str, budget: int = PAGE_TOKEN_BUDGET) -> str:
    if count_tokens(text) <= budget:
        return text
    cutoff = budget * 3
    return text[:cutoff] + f"\n\n[... content truncated to ~{budget}-token budget ...]"

count_tokens 用的是粗估（3 chars ≈ 1 token），不是精确的 tokenizer。对于截断场景，精度要求不高，速度更重要。

Guard 3：Step Limit

python 复制代码

MAX_STEPS = 8

def router(state: WState) -> str:
    if state["steps"] >= MAX_STEPS:
        return END
    last = state["messages"][-1]
    if isinstance(last, AIMessage) and last.tool_calls:
        return "tools"
    return END

state["steps"] 在每次 agent_node 执行时加 1：

python 复制代码

def agent_node(state: WState) -> dict:
    msgs = [SystemMessage(content=SYSTEM_PROMPT)] + state["messages"]
    response = bound_llm.invoke(msgs)
    return {"messages": [response], "steps": state["steps"] + 1}

Router 优先检查步数，再检查 tool_calls。即使 LLM 还想继续调用工具，步数到了也会强制结束。这是防止无限循环的硬边界。

步数初始化在调用时：

python 复制代码

state = graph.invoke(
    {"messages": [HumanMessage(content=query)], "steps": 0},
    config={"recursion_limit": MAX_STEPS * 3},
)

recursion_limit 是 LangGraph 的内置防护，steps 是应用层的自定义防护，两者独立工作。

运行结果

ini 复制代码

======================================================================
Web Agent Demo
Model: glm-4-flash  |  Token budget/page: 800  |  Max steps: 8
======================================================================

=== Part 3: Engineering Guards ===

──────────────────────────────────────────────────────────────────────
[Guard 1] URL error handling (bad / hallucinated URL)
  fetch_page(https://totally-made-up-domain-xyz99999.org/docs/n...)
  → Connection error --- https://totally-made-up-domain-xyz99999.org/docs/nonexistent may not exist or be unreachable

──────────────────────────────────────────────────────────────────────
[Guard 2] Token budget enforcement (budget=800 tokens/page)
  fetch_page(pypi.org/project/langgraph/)
  → [Size: 4576 tokens → showing 800 tokens (budget=800)]

──────────────────────────────────────────────────────────────────────
[Guard 3] Step limit (MAX_STEPS=8) --- agent cannot loop forever
  Graph router returns END when state['steps'] >= 8
  Even if tool_calls remain, execution stops.

三个 Guard 全部按预期工作。

研究部分（Parts 1 & 2）遭遇了 DuckDuckGo 限流，搜索返回空结果，模型正确报告了失败而不是编造答案------这本身也是 Guard 有效的体现：Agent 没有在搜索失败的情况下继续循环，而是明确告知用户无法获取数据。

DuckDuckGo 的局限性

DuckDuckGo HTML 接口是无 Key 方案，生产环境不可靠：

频繁请求会被限流或返回空结果
HTML 结构可能随时变化，CSS 选择器失效
没有速率限制控制，容易触发封锁

生产替代方案：

方案	特点
Tavily API	专为 LLM Agent 设计，返回结构化结果
SerpAPI	多搜索引擎，稳定，付费
Brave Search API	免费额度较大，独立索引
Jina Reader	专做页面转文本，效果好

切换只需要替换 web_search 工具的实现，Agent 图结构不变。

完整 Graph 代码

python 复制代码

TOOLS   = [web_search, fetch_page]
TOOL_MAP = {t.name: t for t in TOOLS}
bound_llm = llm.bind_tools(TOOLS)

SYSTEM_PROMPT = f"""You are a web research agent. Answer the user's question by browsing the web.

Workflow:
1. Call web_search to find relevant pages.
2. Call fetch_page on promising URLs to read content.
3. If you find the answer, give a clear, concise final response.
4. If a page doesn't help, try a different search query.

Strict rules:
- Only use URLs from web_search results --- never invent or guess URLs.
- If fetch_page returns an error, try a different URL or search query.
- You have at most {MAX_STEPS} total steps. Be efficient.
- Once you have enough information, stop browsing and answer directly."""


class WState(TypedDict):
    messages: Annotated[list, add_messages]
    steps: int


def agent_node(state: WState) -> dict:
    msgs = [SystemMessage(content=SYSTEM_PROMPT)] + state["messages"]
    response = bound_llm.invoke(msgs)
    return {"messages": [response], "steps": state["steps"] + 1}


def tools_node(state: WState) -> dict:
    last = state["messages"][-1]
    results = []
    for tc in last.tool_calls:
        output = TOOL_MAP[tc["name"]].invoke(tc["args"])
        results.append(ToolMessage(content=str(output), tool_call_id=tc["id"]))
    return {"messages": results}


def router(state: WState) -> str:
    if state["steps"] >= MAX_STEPS:
        return END
    last = state["messages"][-1]
    if isinstance(last, AIMessage) and last.tool_calls:
        return "tools"
    return END


def build_graph():
    g = StateGraph(WState)
    g.add_node("agent", agent_node)
    g.add_node("tools", tools_node)
    g.set_entry_point("agent")
    g.add_conditional_edges("agent", router, {"tools": "tools", END: END})
    g.add_edge("tools", "agent")
    return g.compile()

Graph 编译后赋值给模块级变量 graph，run_research 直接调用 graph.invoke()。

设计 Checklist

工具设计

HTML 清洗：去掉 script/style/nav/footer，只留正文
错误分类：HTTP 错误 / 连接错误 / 其他，各自返回安全字符串
工具描述里明确 URL 来源规则：never invent URLs

Engineering Guard

Token Budget：页面文本截断到合理上限（800-2000 tokens）
Step Limit：router 优先检查步数，再检查 tool_calls
两层防护：应用层 steps + LangGraph recursion_limit

State 设计

messages: Annotated[list, add_messages]------必须用 reducer，否则消息不累积
steps: int------Web Agent 特有字段，普通 Agent 可以省略

生产化

搜索工具替换为有 API Key 的稳定方案（Tavily/SerpAPI）
User-Agent 设置为真实浏览器 UA，避免被拒绝
请求超时：timeout=12（搜索和页面抓取各自设置）

总结

三条结论：

Guard 是独立的：工具失败不等于 Agent 失败；错误作为返回值让 LLM 自适应，而不是中断执行
Token Budget 是必须的：一个普通网页 4576 tokens，截到 800 节省 82% 上下文，在大量页面浏览时影响巨大
Step Limit 是硬边界 ：steps >= MAX_STEPS → END 写在 router 里，不依赖 Prompt 的"自觉"，无论 LLM 多想继续，步数到了就停

Web Agent 的本质是：给 LLM 装上可控的眼睛，而不是无限的网络访问权限。

参考资料

欢迎访问 PrimeSkills ------ 一个精心策划的 AI Agent 与技能市场，所有内容均经过真实企业级工作流验证。没有噱头，只有真正有效的东西。

更多实用知识和有趣产品，欢迎访问我的个人主页