多智能体架构实践指南

一、为什么需要多智能体

单 Agent 的天花板：

局限	原因	后果
上下文窗口有限	复杂任务的中间状态会撑爆 Context	长任务执行到一半开始"忘事"
角色冲突	一个 Agent 既负责写代码又负责审代码	自己审自己，发现不了问题
串行瓶颈	所有步骤只能顺序执行	本可并行的任务被迫排队
专业深度不够	一个 Prompt 无法同时装下所有领域知识	什么都会但什么都不精
错误传播	一步出错，全链受影响	没有隔离，没有独立验证

多智能体的核心价值：让专业的 Agent 做专业的事，并行执行，相互验证，整体能力远超个体之和。

二、多智能体的三种基础拓扑

2.1 流水线型（Pipeline）

最简单的模式：每个 Agent 处理一个阶段，结果传给下一个。

css 复制代码

ounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(line用户输入    ↓[Agent 1: 意图理解] → 结构化任务描述    ↓[Agent 2: 信息收集] → 原始数据    ↓[Agent 3: 分析处理] → 分析结论    ↓[Agent 4: 报告生成] → 最终输出

适合：步骤明确、顺序固定的任务（内容生产流水线、数据处理管道）

缺点：前一步出错会污染后续所有步骤，无法并行。

2.2 主从型（Orchestrator-Worker）

一个编排 Agent 负责拆解任务和分配，多个 Worker Agent 并行执行。

css 复制代码

ounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(line用户目标    ↓[Orchestrator Agent]  ← 负责：任务拆解 / 分配 / 汇总 / 决策    ↙      ↓      ↘[Worker A] [Worker B] [Worker C]   ← 并行执行子任务  搜索       代码分析    文档撰写    ↘      ↓      ↙[Orchestrator Agent]  ← 汇总结果，判断是否需要继续    ↓最终输出

适合：任务可拆解、子任务相互独立（大型研究、复杂代码重构）

关键设计：Orchestrator 只负责调度，不执行具体工作------保持职责分离。

2.3 辩证型（Debate / Critic）

多个 Agent 对同一问题给出不同视角或相互质疑，最终收敛到更高质量的答案。

css 复制代码

ounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(line用户问题    ↓[Agent A: 提案者]  → 给出初始方案    ↓[Agent B: 批评者]  → 找出漏洞和风险    ↓[Agent C: 裁判者]  → 综合双方观点，给出最终判断

变体------自我辩证（单 Agent 模拟多角色）：

javascript 复制代码

ounter(lineounter(lineounter(line第一轮：给出答案第二轮："现在扮演一个挑剔的审查员，找出上面答案的所有问题"第三轮："综合以上，给出修订版本"

适合：高风险决策、代码审查、方案评估（需要避免单点盲区的场景）

2.4 混合型（实际生产常用）

以上三种拓扑的组合，根据任务动态切换：

css 复制代码

ounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(line[Orchestrator]    ↓ 拆解任务  ┌──────────────────────────┐  │  流水线子任务1            │  │  [收集] → [处理] → [验证] │  └──────────────────────────┘  ┌──────────────────────────┐  │  并行子任务2              │  │  [Worker A] [Worker B]   │  └──────────────────────────┘    ↓ 汇总  [Critic Agent]  ← 对结果做质量审核    ↓  最终输出

三、核心设计模式

3.1 任务分解（Task Decomposition）

Orchestrator 如何把大任务拆成小任务是多智能体的第一关键。

方法一：静态分解（预定义子任务）

makefile 复制代码

ounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(line# 已知任务结构，直接硬编码分解逻辑RESEARCH_PIPELINE = [    {"agent": "searcher",   "task": "收集相关信息"},    {"agent": "analyst",    "task": "分析数据"},    {"agent": "writer",     "task": "撰写报告"},    {"agent": "reviewer",   "task": "审核报告"},]

适合：任务结构稳定，流程不变,本质是自定义工作流。

方法二：动态分解（LLM 推理拆解）

python 复制代码

ounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineDECOMPOSE_PROMPT = """你是一个任务规划专家。将以下复杂任务分解为可独立执行的子任务列表。
任务：{task}
要求：- 每个子任务独立可执行- 标注子任务间的依赖关系- 标注哪些子任务可以并行
以 JSON 返回：{  "subtasks": [    {      "id": "t1",      "description": "子任务描述",      "agent_type": "需要什么类型的 Agent",      "depends_on": [],      "can_parallel": true    }  ]}"""
def decompose_task(orchestrator_llm, task: str) -> list[Subtask]:    response = orchestrator_llm.complete(        DECOMPOSE_PROMPT.format(task=task)    )    return parse_subtasks(response)

适合：任务结构未知，需要 Agent 自主规划。

3.2 Agent 间通信（Inter-Agent Communication）

消息传递三要素：

css 复制代码

ounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(line@dataclassclass AgentMessage:    sender: str          # 发送方 Agent ID    receiver: str        # 接收方 Agent ID（或 "broadcast"）    content: str         # 消息内容    message_type: str    # "task" / "result" / "error" / "clarification"    task_id: str         # 追踪用，同一任务的所有消息共享 ID    metadata: dict       # 附加信息（优先级、截止时间等）

通信模式对比：

模式	实现方式	适合场景
直接调用	Agent A 直接调用 Agent B 的函数	简单、低延迟场景
消息队列	Agent 通过队列异步通信	高并发、解耦场景
共享状态	多 Agent 读写同一个状态存储	需要实时感知全局状态
事件总线	Agent 发布事件，其他 Agent 订阅	松耦合、动态扩展

实际推荐：生产环境优先用消息队列（Redis / RabbitMQ），开发调试时用直接调用。

3.3 状态管理（Shared State）

多 Agent 协作必须有一个共享的"工作台"：

python 复制代码

ounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineclass TaskState:    task_id: str    status: str              # pending / running / completed / failed    original_goal: str       # 用户最初的目标    subtasks: list[Subtask]  # 子任务列表和状态    context: dict            # 各 Agent 共享的上下文信息    artifacts: dict          # 中间产物（文档、代码、数据）    errors: list[Error]      # 错误记录    history: list[Event]     # 完整执行历史（用于调试）
# 所有 Agent 都读写同一个 TaskState# 关键：写操作需要加锁，防止并发冲突

状态存储选择：

css 复制代码

ounter(lineounter(lineounter(line轻量场景：    内存字典（单机、无持久化需求）生产场景：    Redis（低延迟、持久化、支持原子操作）复杂场景：    PostgreSQL / MongoDB（需要查询历史、审计）

3.4 错误处理与容错

多 Agent 系统的错误传播比单 Agent 复杂得多：

python 复制代码

ounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineclass ResilientOrchestrator:
    async def execute_subtask(self, subtask: Subtask) -> Result:        for attempt in range(3):            try:                result = await self.workers[subtask.agent_type].run(subtask)                return result
            except AgentTimeoutError:                if attempt == 2:                    # 超时3次：尝试用不同 Agent 类型完成                    return await self.fallback_agent.run(subtask)
            except AgentFailedError as e:                # 失败：记录错误，判断是否可重试                if e.is_retryable:                    await asyncio.sleep(2 ** attempt)  # 指数退避                    continue                else:                    # 不可重试：标记子任务失败，通知 Orchestrator 决策                    return Result(status="failed", error=e, subtask=subtask)
    def handle_subtask_failure(self, failed_subtask: Subtask):        """子任务失败时，Orchestrator 的三种策略"""        if failed_subtask.is_critical:            self.abort_entire_task()          # 关键步骤失败 → 终止整个任务        elif failed_subtask.has_fallback:            self.use_degraded_result()        # 有降级方案 → 用次优结果继续        else:            self.skip_and_continue()          # 非关键 → 跳过，继续执行

四、主流框架对比

4.1 AutoGen（微软）

核心思想：对话即协作。Agent 之间通过自然语言对话完成任务。

makefile 复制代码

ounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineimport autogen
# 定义 Agentassistant = autogen.AssistantAgent(    name="助手",    llm_config={"model": "gpt-4o"},    system_message="你是一个代码专家，负责编写和优化代码。")
critic = autogen.AssistantAgent(    name="审查员",    llm_config={"model": "gpt-4o"},    system_message="你是一个严格的代码审查员，找出所有潜在问题。")
user_proxy = autogen.UserProxyAgent(    name="用户",    human_input_mode="NEVER",     # 自动模式，无需人工介入    code_execution_config={"work_dir": "workspace"})
# 启动多轮对话user_proxy.initiate_chat(    assistant,    message="写一个高性能的 LRU 缓存实现")# assistant 写代码 → user_proxy 执行 → critic 审查 → assistant 修改 → 循环直到满意

优点：对话驱动，灵活自然，代码执行内置。缺点：对话控制难以精确，容易陷入无意义的往复。

4.2 CrewAI

核心思想：角色驱动。给每个 Agent 明确的角色、目标、背景故事。

makefile 复制代码

ounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(linefrom crewai import Agent, Task, Crew, Process
# 定义角色researcher = Agent(    role="市场研究员",    goal="收集目标市场的详细信息",    backstory="你有10年市场研究经验，擅长从海量信息中提炼洞察",    tools=[web_search_tool, data_analysis_tool],    llm="claude-opus-4-6")
writer = Agent(    role="商业分析师",    goal="将研究结果转化为可执行的商业建议",    backstory="你是麦肯锡前咨询顾问，擅长将复杂数据转化为清晰建议",    llm="claude-sonnet-4-6")
# 定义任务research_task = Task(    description="调研中国 AI 代码助手市场规模和主要竞争者",    expected_output="包含市场规模、竞争格局、用户痛点的结构化报告",    agent=researcher)
analysis_task = Task(    description="基于研究结果，给出产品进入市场的TOP3策略",    expected_output="3个具体可执行的市场进入策略，每个包含风险评估",    agent=writer,    context=[research_task]   # 依赖研究任务的输出)
# 启动 Crewcrew = Crew(    agents=[researcher, writer],    tasks=[research_task, analysis_task],    process=Process.sequential   # 或 Process.hierarchical)
result = crew.kickoff()

优点：角色定义清晰，适合业务场景建模，上手快。缺点：灵活性较低，动态任务支持不够好。

4.3 LangGraph

核心思想：图即工作流。用有向图（DAG）精确定义 Agent 执行逻辑和状态流转。

python 复制代码

ounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(linefrom langgraph.graph import StateGraph, ENDfrom typing import TypedDict
# 定义共享状态class ResearchState(TypedDict):    query: str    search_results: list    analysis: str    final_report: str    iteration_count: int
# 定义节点（每个节点是一个 Agent 或处理函数）def search_agent(state: ResearchState) -> ResearchState:    results = web_search(state["query"])    return {**state, "search_results": results}
def analyze_agent(state: ResearchState) -> ResearchState:    analysis = llm.analyze(state["search_results"])    return {**state, "analysis": analysis, "iteration_count": state["iteration_count"] + 1}
def write_agent(state: ResearchState) -> ResearchState:    report = llm.write_report(state["analysis"])    return {**state, "final_report": report}
# 条件路由：决定下一步去哪def should_continue(state: ResearchState) -> str:    if state["iteration_count"] < 3 and len(state["search_results"]) < 5:        return "search"    # 结果不够，继续搜索    return "write"         # 结果足够，进入写作
# 构建图graph = StateGraph(ResearchState)graph.add_node("search", search_agent)graph.add_node("analyze", analyze_agent)graph.add_node("write", write_agent)
graph.set_entry_point("search")graph.add_edge("search", "analyze")graph.add_conditional_edges("analyze", should_continue, {    "search": "search",    "write": "write"})graph.add_edge("write", END)
app = graph.compile()result = app.invoke({"query": "AI Agent 最新进展", "iteration_count": 0})

优点：执行流程精确可控，支持循环/条件分支，调试友好，生产级可靠性最高。缺点：上手成本较高，需要明确设计图结构。

4.4 框架选型建议

css 复制代码

ounter(lineounter(lineounter(lineounter(line快速原型 / 任务结构简单      →  CrewAI（上手最快）对话驱动 / 需要代码执行      →  AutoGen生产环境 / 需要精确控制流程  →  LangGraph（推荐）已用 Claude Code / MCP 体系  →  原生 Sub-agent + Skill（天然集成）

五、实战案例：代码 Review 多智能体系统

目标：自动对 Pull Request 做全面的代码审查

css 复制代码

ounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(linePR 提交    ↓[Orchestrator]    ├── 并行分配给各 Worker：    │   ├── [安全 Agent]      → 检查 SQL 注入、XSS、权限问题    │   ├── [性能 Agent]      → 检查 N+1 查询、内存泄漏、算法复杂度    │   ├── [规范 Agent]      → 检查代码风格、命名规范、注释完整性    │   └── [测试 Agent]      → 检查测试覆盖率、边界情况    ↓[Critic Agent]               → 对各 Agent 的发现做优先级排序，去重    ↓[Summary Agent]              → 生成结构化的 Review 报告    ↓自动发布 PR 评论

关键实现细节：

python 复制代码

ounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineasync def review_pr(pr_diff: str) -> ReviewReport:
    # 1. 并行执行所有专项检查    results = await asyncio.gather(        security_agent.review(pr_diff),        performance_agent.review(pr_diff),        style_agent.review(pr_diff),        test_agent.review(pr_diff),        return_exceptions=True   # 某个 Agent 失败不影响其他    )
    # 2. 过滤掉失败的结果    valid_results = [r for r in results if not isinstance(r, Exception)]
    # 3. Critic Agent 做优先级排序    prioritized = await critic_agent.prioritize(valid_results)
    # 4. 生成最终报告    report = await summary_agent.generate(prioritized)    return report
# 效果对比：# 单 Agent：串行执行，耗时 ~60s，容易遗漏某一类问题# 多 Agent：并行执行，耗时 ~15s，各专项 Agent 专注领域问题

六、生产环境的关键挑战

6.1 防止"Agent 环路"

多 Agent 系统最常见的故障：Agent 们陷入相互等待或无限循环。

python 复制代码

ounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineclass LoopGuard:    def __init__(self, max_iterations: int = 20, max_time_seconds: int = 300):        self.max_iterations = max_iterations        self.max_time = max_time_seconds        self.start_time = time.time()        self.iteration = 0
    def check(self):        self.iteration += 1        if self.iteration > self.max_iterations:            raise MaxIterationsExceeded(f"已执行 {self.iteration} 轮，强制终止")        if time.time() - self.start_time > self.max_time:            raise TimeoutError(f"执行超过 {self.max_time}s，强制终止")

6.2 成本控制

多 Agent 并行调用，Token 消耗是单 Agent 的数倍：

css 复制代码

ounter(lineounter(lineounter(lineounter(lineounter(line成本优化策略：├── Orchestrator 用便宜模型（Haiku），Worker 根据任务复杂度选模型├── 子任务结果缓存（相同输入不重复调用）├── 设置单次任务的 Token 预算上限└── 降级策略：预算不足时减少并行度

6.3 可观测性

单 Agent 出问题好排查，多 Agent 出问题需要追踪整个调用树：

css 复制代码

ounter(lineounter(lineounter(lineounter(lineounter(lineounter(line必须记录的信息：├── 每个 Agent 的输入/输出（完整记录）├── Agent 间消息传递的完整链路├── 每个子任务的耗时和 Token 消耗├── 失败节点和错误原因└── Orchestrator 的决策路径（为什么这样分配任务）

推荐工具：LangSmith（原生支持多 Agent Tracing）、Langfuse（开源替代）

七、设计原则总结

markdown 复制代码

ounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(lineounter(line1. 单一职责   每个 Agent 只做一件事，且做到最好   一旦一个 Agent 开始"兼职"，就是该拆分的信号
2. 最小通信   Agent 间传递的信息越少越好   传递结构化数据（JSON），而非非结构化文本
3. 故障隔离   一个 Agent 失败不应该导致整个系统崩溃   Orchestrator 必须有明确的降级策略
4. 人工介入点   对于高风险操作（删除数据、发送邮件、金融交易）   必须设计"人工确认"节点，不能让多 Agent 完全自主
5. 可观测优先   多 Agent 系统的调试成本远高于单 Agent   从第一天就接入 Tracing，不要等出问题再加
6. 从简开始   不要一上来就设计复杂的多 Agent 拓扑   先用单 Agent 验证任务可行性，再识别哪些子任务值得独立出去

附：技术选型速查

需求	推荐方案
快速验证多 Agent 思路	CrewAI
生产级流程控制	LangGraph
对话驱动协作	AutoGen
Agent 间通信队列	Redis Streams
共享状态存储	Redis / PostgreSQL
多 Agent Tracing	LangSmith / Langfuse
防环路 / 超时控制	自实现 LoopGuard + asyncio.wait_for
成本监控	Helicone / LLMonitor