多Agent协作：从采集到报告的流水线设计

做CLI工具容易，做4个Agent串联还能稳定跑的CLI工具不容易。

SmartInspector的全量分析流程是：用户说"帮我全面分析一下" → 采集设备trace → LLM分析性能瓶颈 → 定位到项目源码 → 生成结构化报告。四个步骤，四个Agent，数据在它们之间流转。

这篇文章拆解这个流水线是怎么搭的，踩了哪些坑。

Orchestrator：不干活的调度中心

很多多Agent系统的Orchestrator自己干一堆事------解析意图、准备数据、甚至直接调LLM。我的做法是：Orchestrator只做一件事------路由。

python 复制代码

@node_error_handler("orchestrator")
def orchestrator_node(state: AgentState) -> dict:
    """Pure LLM classification to decide routing."""
    messages = state.get("messages", [])
    
    # 提取最后一条用户消息
    user_msg = ""
    for m in reversed(messages):
        if isinstance(m, dict):
            if m.get("role") == "user":
                user_msg = m.get("content", "")
                break
        else:
            content = getattr(m, "content", "")
            msg_type = getattr(m, "type", "")
            if content and msg_type == "human":
                user_msg = content
                break

    # LLM分类，max_tokens=5，只返回一个标签
    llm = _get_route_llm()
    response = llm.invoke(orch_input)
    raw = response.content.strip().lower()
    
    # 解析路由标签
    decision = RouteDecision.END
    for v, rd in valid.items():
        if v in raw:
            decision = rd
            break
    
    return {"messages": [], "_route": decision, "skip_wait": skip_wait}

关键设计决策：

Orchestrator不修改任何数据字段 。它只往state里塞一个_route字符串，其他字段原样透传。这保证了下游节点拿到的数据是干净的。
用max_tokens=5 ，LLM只需要返回一个标签（如full_analysis），不浪费Token。
fallback机制：LLM分类失败时路由到fallback节点，不会直接报错挂掉。

路由结果存在AgentState["_route"]里，LangGraph的conditional edges根据这个值决定走哪条边：

python 复制代码

def route_from_orchestrator(state: AgentState) -> str:
    decision = state.get("_route", "end")
    mapping = {
        RouteDecision.FULL_ANALYSIS: "collector",
        RouteDecision.ANDROID: "android_expert",
        RouteDecision.EXPLORER: "explorer",
        RouteDecision.END: "fallback",
    }
    return mapping.get(decision, "fallback")

AgentState：数据流转的骨架

多Agent系统最核心的设计是状态。每个Agent都是一个纯函数：输入state → 输出部分state更新。LangGraph负责merge。

python 复制代码

class AgentState(TypedDict):
    messages: Annotated[list, operator.add]    # 消息列表，只追加不覆盖
    perf_summary: str          # JSON: 采集结果
    perf_analysis: str         # Markdown: 分析结果
    attribution_data: str      # JSON: 可归因的slice列表
    attribution_result: str    # JSON: 归因结果（含源码片段）
    _route: str                # 路由决策
    _trace_path: str           # trace文件路径

这里有三个设计要点：

1. messages用operator.add而非覆盖

LangGraph的Annotated[list, operator.add]意味着每个节点返回的messages会被追加到已有列表，而不是替换。这样collector产生的[trace collected]消息不会丢，reporter产生的最终报告也不会覆盖前面的进度信息。

2. 数据字段用字符串不用对象

perf_summary、attribution_result都是JSON字符串，不是Python对象。原因很简单：LangGraph的state serialization要求所有字段可序列化。用字符串虽然每次要json.loads()，但避免了自定义serializer的麻烦。

3. _pass_through保证数据不丢

每个节点只修改自己负责的字段，其他字段必须透传：

python 复制代码

def _pass_through(state: AgentState, *, extra_keys: tuple = ()) -> dict:
    keys = _PASS_THROUGH_KEYS + extra_keys
    return {k: state.get(k, "") for k in keys}

这解决了一个实际Bug：早期attributor节点忘了透传perf_analysis，导致reporter拿到的分析结果是空的。加了_pass_through后，每个节点都显式声明"我不改的字段原样传下去"，再没出过数据丢失的问题。

四个Agent，四步流水线

Collector：采集

Collector是最重的节点------调adb采集Perfetto trace，解析成结构化JSON，还要和WebSocket通道合并block事件。

python 复制代码

def collector_node(state: AgentState) -> dict:
    # 冷启动模式：先force-stop，采集开始后再launch
    if is_startup and cold_start_target:
        _adb_force_stop(cold_start_target)
    
    # 采集trace
    trace_path = PerfettoCollector.pull_trace_from_device(
        duration_ms=duration_ms,
        target_process=target_process,
    )
    
    # 解析并生成summary
    collector = PerfettoCollector(trace_path)
    summary = collector.summarize()
    
    # 合并WS block事件（SQL数据+App端stack_trace）
    merged = _merge_block_events(sql_events, ws_events)
    summary.block_events = merged
    
    return {
        "perf_summary": summary.to_json(),    # 核心输出
        "_trace_path": trace_path,            # 传给下游
    }

Collector的输出是perf_summary（JSON字符串），这是整个流水线的数据源。后续所有节点都依赖这个字段。

Analyzer：分析

Analyzer拿perf_summary做LLM分析，输出Markdown格式的性能诊断：

python 复制代码

def analyzer_node(state: AgentState) -> dict:
    perf_json = state.get("perf_summary", "")
    analysis = analyze_perf(perf_json)
    
    return {
        "messages": [AIMessage(content=analysis)],
        "perf_analysis": analysis,     # 核心输出
    }

这里有个perf_analyzer_node和analyzer_node的区别------前者是独立使用的（用户说"分析一下这份数据"），后者是pipeline中的一环。功能一样，区别在于perf_analyzer_node会尝试从历史消息中找数据，而pipeline版本的analyzer_node只从state取。

Attributor：归因

Attributor做的是从性能热点追溯到项目源码：

python 复制代码

def attributor_node(state: AgentState) -> dict:
    perf_json = state.get("perf_summary", "")
    
    # 第一步：提取可归因的slice（过滤系统类）
    attributable = extract_attributable_slices(perf_json, min_dur_ms=1.0)
    
    # 第二步：在项目源码中搜索方法定义
    results = run_attribution(attributable)
    
    return {
        "attribution_data": json.dumps(attributable),
        "attribution_result": json.dumps(results),  # 核心输出
    }

归因的核心逻辑：Perfetto记录的是SI$tag#ClassName.methodName#duration格式的slice，Attributor从中提取类名和方法名，用glob+grep在项目目录搜索匹配的源文件，返回文件路径和代码片段。

Reporter：报告

Reporter是流水线最后一环，把前面三个节点的输出合并成一份结构化报告：

python 复制代码

def reporter_node(state: AgentState) -> dict:
    perf_json = state.get("perf_summary", "")
    perf_analysis = state.get("perf_analysis", "")
    attribution_result = state.get("attribution_result", "")
    
    # 拼装输入：归因数据放最前面（最重要，不能被截断）
    user_parts = []
    user_parts.extend(format_attribution_section(attribution_result))
    user_parts.extend(format_perf_sections(perf_json))
    user_parts.append(f"## 性能分析\n{perf_analysis}")
    
    # Token超限时按段落截断
    if estimated_tokens > MAX_REPORT_INPUT_TOKENS:
        sections = user_content.split("\n\n")
        truncated = []
        for sec in sections:
            if total + len(sec) > target_chars and truncated:
                break
            truncated.append(sec)
    
    # 流式生成报告
    full_content = generate_report(report_prompt, user_content)
    
    return {
        "messages": [AIMessage(content=complete_report)],
    }

归因数据放最前面这个设计是踩坑后的结论。早期把perf_summary放最前面，结果Token超限截断时，归因数据（最核心的输入）被截掉了，报告质量断崖式下降。调整顺序后，即使截断也只丢一些辅助指标数据。

错误处理：一个挂了不影响全局

多Agent系统最怕的是某个节点异常后整个pipeline崩掉。我用了一个装饰器统一处理：

python 复制代码

def node_error_handler(node_name: str):
    def decorator(func):
        @functools.wraps(func)
        def wrapper(state: AgentState) -> dict:
            try:
                return func(state)
            except Exception as e:
                print(f"  [{node_name}] ERROR: {e}", flush=True)
                return {
                    "messages": [AIMessage(content=f"[{node_name}] Error: {e}")],
                    **_pass_through(state),
                }
        return wrapper
    return decorator

核心思路：任何节点报错都返回一个安全的state更新------把错误信息塞进messages，其他字段透传。这样：

Collector挂了 → Analyzer拿到空的perf_summary → 跳过分析 → 用户看到错误提示
Analyzer挂了 → Attributor仍然能基于perf_summary做归因（降级但不中断）
Attributor挂了 → Reporter用perf_analysis生成报告（少源码归因，但报告不缺）

降级而不中断，这是流水线系统的核心原则。

Graph构建：把节点串起来

最后看LangGraph的图构建：

python 复制代码

def create_graph():
    builder = StateGraph(AgentState)
    
    # 注册所有节点
    builder.add_node("orchestrator", orchestrator_node)
    builder.add_node("collector", collector_node)
    builder.add_node("analyzer", analyzer_node)
    builder.add_node("attributor", attributor_node)
    builder.add_node("reporter", reporter_node)
    
    # 入口
    builder.add_edge(START, "orchestrator")
    
    # Orchestrator条件路由
    builder.add_conditional_edges("orchestrator", route_from_orchestrator, ...)
    
    # 全量分析流水线
    builder.add_edge("collector", "analyzer")
    builder.add_conditional_edges("analyzer", _route_from_analyzer, ...)
    builder.add_edge("attributor", "reporter")
    builder.add_edge("reporter", END)
    
    return builder.compile(checkpointer=MemorySaver())

注意analyzer后面有条件分支：TRACE模式直接END（只要分析），STARTUP模式走冷启动分析，FULL_ANALYSIS模式走attributor→reporter。同一个analyzer节点服务三种场景，靠_route字段区分后续路径。

总结：几个关键决策

Orchestrator只路由不处理------职责单一，好维护，Token消耗极低（5个Token）
State用字符串不用对象------序列化零风险，代价只是多一点json.loads
_pass_through强制透传------防止节点遗漏数据，这是个Bug驱动的抽象
错误处理统一装饰器------降级不中断，用户总能拿到部分结果
归因数据放报告输入最前面------Token截断时保核心丢边缘

这套流水线跑了快两个月，从最开始的collector→analyzer两步，逐步加到现在的4步。LangGraph的StateGraph模型让扩展很自然------加个节点，加条边，改个路由函数，完事。

SmartInspector项目地址：GitHub 系列上一篇：意图路由：用5个Token决定该找谁