Agent 系列（14）：Agent 可观测性——追踪每一步决策，让黑盒变透明

黑盒里的 Agent

你启动了一个 Agent，发了一条请求，等了 6 秒，得到了回答。

这 6 秒里发生了什么？

LLM 思考了几次？
工具调用了几次、传入了什么参数、返回了什么？
延迟卡在哪一步？
如果回答错了，是哪一步出了问题？

普通 agent.invoke() 什么都不告诉你。Agent 的可观测性（Observability）就是解决这个问题：让你看到 Agent 每一步的决策和代价。

三个可观测性层次

复制代码

开发阶段    →  实时 Trace    （看到发生了什么）
分析阶段    →  延迟时间线    （看到时间花在哪里）
生产阶段    →  结构化审计日志 （记录每次请求，可回溯）

三种模式使用同一套基础：LangChain 的 BaseCallbackHandler。

核心：AgentTracer

LangChain 提供 BaseCallbackHandler 钩子，覆盖 LLM 调用和工具调用的生命周期：

python 复制代码

from langchain_core.callbacks import BaseCallbackHandler
from langchain_core.outputs import LLMResult

class AgentTracer(BaseCallbackHandler):
    def __init__(self, verbose: bool = True, trace_id: str = "") -> None:
        super().__init__()
        self.verbose = verbose
        self.trace_id = trace_id or str(uuid.uuid4())[:8]
        self.steps: list[StepRecord] = []
        self._llm_t0: float = 0.0
        self._tool_t0: float = 0.0
        self._tool_name: str = ""

    # LLM 开始（Chat 模型走 on_chat_model_start，不是 on_llm_start）
    def on_chat_model_start(self, serialized, messages, **kwargs) -> None:
        self._llm_t0 = time.time()
        if self.verbose:
            print(f"  [LLM →] reasoning...")

    def on_llm_end(self, response: LLMResult, **kwargs) -> None:
        t1 = time.time()
        # 提取文本（工具调用轮次 content 为空，只有最终回答轮才有文字）
        output = _extract_text(response)
        self.steps.append(StepRecord(
            step_type="llm", name="LLM",
            output_preview=output,
            start_time=self._llm_t0, end_time=t1,
        ))
        if self.verbose:
            print(f"  [LLM ←] {t1 - self._llm_t0:.0f}ms  |  {output[:70]}")

    def on_tool_start(self, serialized, input_str: str, **kwargs) -> None:
        self._tool_t0 = time.time()
        self._tool_name = serialized.get("name", "tool")
        if self.verbose:
            print(f"  [TOOL→] {self._tool_name}({str(input_str)[:60]})")

    def on_tool_end(self, output: str, **kwargs) -> None:
        t1 = time.time()
        self.steps.append(StepRecord(
            step_type="tool", name=self._tool_name,
            output_preview=str(output)[:80],
            start_time=self._tool_t0, end_time=t1,
        ))
        if self.verbose:
            print(f"  [TOOL←] {str(output)[:80]}  [{(t1 - self._tool_t0)*1000:.0f}ms]")

关键细节：Chat 模型用的是 on_chat_model_start，不是 on_llm_start。 两者都要实现才能覆盖所有模型类型。

使用方式是把 Tracer 通过 config 传给 agent.invoke：

python 复制代码

tracer = AgentTracer(verbose=True)
result = agent.invoke(
    {"messages": [HumanMessage(query)]},
    config={"callbacks": [tracer]},
)

Demo 1：实时 Trace

查询："What is the weather in Beijing and Shanghai? Calculate the temperature difference."

真实输出：

css 复制代码

  [LLM →] reasoning...
  [LLM ←] 1544ms  |                              ← 工具调用轮，content 为空
  [TOOL→] get_weather({'city': 'Beijing'})
  [TOOL←] {"city": "Beijing", "temp": 25, "condition": "sunny"}  [1ms]
  [LLM →] reasoning...
  [LLM ←] 1236ms  |                              ← 工具调用轮，content 为空
  [TOOL→] get_weather({'city': 'Shanghai'})
  [TOOL←] {"city": "Shanghai", "temp": 22, "condition": "cloudy"}  [1ms]
  [LLM →] reasoning...
  [LLM ←] 2895ms  |                              ← 工具调用轮，content 为空
  [TOOL→] calculator({'expression': '25.0-22.0'})
  [TOOL←] 25.0-22.0 = 3.0  [1ms]
  [LLM →] reasoning...
  [LLM ←] 2465ms  |  The temperature in Beijing is 25°C and Shanghai is 22°C...

Trace summary: 7 steps  (4 LLM calls, 3 tool calls)

为什么中间的 LLM 输出 preview 是空的？

ReAct Agent 的工作方式：每次决定调用工具时，LLM 输出的是 tool_calls JSON，而不是文字。这个 JSON 会被 LangGraph 框架消费，不出现在 content 文本字段里。只有最后一次 LLM 调用（决定不再调工具、直接回答）才会产生文字 content。

这正是 Trace 的价值：不用 Trace 你看不到这 4 次 LLM 调用，你只知道"等了 8 秒"。有了 Trace 你知道：4 次 LLM + 3 次工具，其中哪次思考最慢。

Demo 2：延迟时间线

查询："Tell me the WonderBot Pro price and calculate 299 * 12 for annual cost."

静默运行后，分析 tracer.steps 生成 ASCII 时间线：

python 复制代码

total_ms  = sum(s.duration_ms for s in tracer.steps)
bar_scale = 40 / total_ms

for i, step in enumerate(tracer.steps, 1):
    bar   = "█" * max(int(step.duration_ms * bar_scale), 1)
    label = "LLM reasoning" if step.step_type == "llm" else f"tool: {step.name}"
    print(f"  Step {i}  {label:<25} [{step.duration_ms:>6.0f}ms]  {bar}")

真实输出：

vbnet 复制代码

Step-by-step breakdown:

  Step 1  LLM reasoning             [  2409ms]  ██████████████
  Step 2  tool: get_product_info    [     1ms]  █
  Step 3  LLM reasoning             [  2069ms]  ████████████
  Step 4  tool: calculator          [     1ms]  █
  Step 5  LLM reasoning             [  1977ms]  ████████████

  ────────────────────────────────────────────────────────────
  Total : 6457ms
  LLM   : 6455ms  (100.0% of wall time)
  Tools :    2ms  (0.0% of wall time)

LLM 占 100%，工具占 0%。

这是 Agent 性能优化最重要的一条结论：工具调用几乎是免费的（2ms），所有延迟来自 LLM。所以当你想优化 Agent 速度时，要减少 LLM 调用次数（合并请求、减少推理步骤），而不是优化工具实现。

Demo 3：结构化审计日志

查询："What's the weather in Shenzhen and how much does WonderBot Basic cost?"

运行结束后，把 tracer.steps 序列化为 JSON：

python 复制代码

def build_audit_log(tracer: AgentTracer, query: str, answer: str) -> dict:
    steps_log = []
    for s in tracer.steps:
        entry = {"type": s.step_type, "duration_ms": round(s.duration_ms, 1)}
        if s.step_type == "tool":
            entry["tool"]   = s.name
            entry["input"]  = s.input_preview
            entry["output"] = s.output_preview
        else:
            entry["output_preview"] = s.output_preview
        steps_log.append(entry)
    ...

真实输出：

json 复制代码

{
  "trace_id": "6c8e4b64",
  "query": "What's the weather in Shenzhen and how much does WonderBot Basic cost?",
  "answer": "The weather in Shenzhen is rainy with a temperature of 30 degrees Celsius. The cost of WonderBot Basic is not available.",
  "steps": [
    {"type": "llm",  "duration_ms": 593.0,  "output_preview": ""},
    {"type": "tool", "duration_ms": 1.1,    "tool": "get_weather",    "input": "{'city': 'Shenzhen'}",       "output": "..."},
    {"type": "llm",  "duration_ms": 1164.9, "output_preview": ""},
    {"type": "tool", "duration_ms": 1.0,    "tool": "get_product_info","input": "{'product_name': 'Basic'}", "output": "Product 'Basic' not found..."},
    {"type": "llm",  "duration_ms": 1190.6, "output_preview": "The weather in Shenzhen is rainy..."}
  ],
  "summary": {
    "step_count": 5, "tool_call_count": 2,
    "total_ms": 2950.6, "llm_ms": 2948.5, "tool_ms": 2.0
  }
}

注意 get_product_info 的 input 是 {'product_name': 'Basic'}------LLM 传入的是 "Basic"，没有带 "wonderbot" 前缀，导致查找失败。审计日志记录了工具的原始入参，这在排查 "为什么 Agent 回答错了" 时极为关键。如果没有这条记录，你只会看到最终答案里有一句 "not available"，不知道是哪里出了问题。

三种模式对比

css 复制代码

模式          适用场景              是否影响性能    输出
──────────────────────────────────────────────────────────────────
实时 Trace    开发调试              有（打印 I/O）  控制台事件流
延迟时间线    性能分析              无（后处理）    ASCII 图表
审计日志      生产环境、合规审计    无（后处理）    JSON 文件 / 数据库

三种模式共用同一个 AgentTracer，通过 verbose=True/False 控制实时打印，后处理分析不依赖打印行为。

设计 Checklist

Tracer 实现

Chat 模型用 on_chat_model_start，不是 on_llm_start（两者都实现更保险）
on_tool_start 里记录工具名和输入，在 on_tool_end 里用同一个实例变量配对
用 start_time / end_time 而不是在 end 里算 delta（支持嵌套调用）
StepRecord 做成 dataclass，方便后续序列化和分析

审计日志

每次请求生成唯一 trace_id（UUID 前 8 位即可）
记录工具的原始入参，不只是工具名------入参是排查问题的关键
对 LLM 中间步骤的 output_preview 为空要有预期：工具调用轮次 content 为空是正常行为
生产环境把 audit log 写到数据库或日志系统，不要只打印到 stdout

性能分析

区分 llm_ms 和 tool_ms，前者通常占 99%+
优化方向：减少 LLM 调用次数 > 优化工具执行速度
建立 p50/p95 延迟基线，用于检测退化

总结

五个核心结论：

LangChain Callback 是可观测性的基础 ：BaseCallbackHandler 的四个方法覆盖 LLM 和工具的完整生命周期
Chat 模型用 on_chat_model_start ：这是容易踩的坑，on_llm_start 对 ChatOpenAI 不触发
LLM 调用轮次的 content 可以为空：工具调用轮的 LLM 输出是 tool_calls JSON，不是文字
LLM 占 99-100% 延迟：工具调用在毫秒级，优化延迟应减少 LLM 调用次数
审计日志的价值在工具入参：只记录最终答案不够，工具接收到的原始参数才是排查问题的关键

下一篇：Agent 记忆系统进阶 ------ 短期记忆、长期记忆、摘要压缩，怎么让 Agent 记住跨会话的上下文？

参考资料

欢迎访问 PrimeSkills ------ 一个精心策划的 AI Agent 与技能市场，所有内容均经过真实企业级工作流验证。没有噱头，只有真正有效的东西。

更多实用知识和有趣产品，欢迎访问我的个人主页