Harness Engineering-第18章评估与测试方法论

《Harness Engineering --- AI Agent 工程方法论》完整目录

前言

第1章 Agent 不等于大模型：Harness 的价值

第2章 Agent 架构模式全景

第3章 Agent Loop：心跳与决策循环

第4章上下文工程：比 Prompt Engineering 更重要的事

第5章 Tool Design：给 Agent 造趁手的兵器

第6章工具编排与并发执行

第7章工具结果处理与错误恢复

第8章 System Prompt 分层设计

第9章指令优先级与冲突消解

第10章 Few-shot、CoT 与动态提示策略

第11章短期记忆：上下文窗口管理

第12章长期记忆：持久化与检索

第13章多轮对话与会话状态机

第14章 Agent 权限模型设计

第15章沙箱、隔离与防御性编程

第16章多 Agent 协调模式

第17章 Human-in-the-Loop：人机协作设计

第18章评估与测试方法论（当前）

第19章可观测性与调试

第20章成本控制与性能优化

第21章设计模式与架构决策

第18章评估与测试方法论

"If you can't measure it, you can't improve it." --- Peter Drucker

:::tip 本章要点

Agent 测试不同于传统软件测试------输出非确定性，需要模糊评估
三层评估：单元（工具级）→ 集成（流程级）→ 端到端（任务级）
评估集设计：覆盖典型场景、边界场景和对抗场景
使用 LLM 做评估（LLM-as-Judge）是目前最实用的方案 :::

18.1 Agent 测试的特殊挑战

graph TD subgraph Traditional["传统软件测试"] TI["确定性输入"] --> TF["函数"] --> TO["确定性输出"] TO --> TC["assertEqual 精确匹配"] end subgraph Agent["Agent 测试"] AI["同样的任务"] --> AF["Agent\n(非确定性)"] AF --> AO1["路径 A: Read→Edit→Test"] AF --> AO2["路径 B: Grep→Edit→Test"] AF --> AO3["路径 C: Read→Write→Test"] AO1 & AO2 & AO3 --> AE["评估行为质量\n(LLM-as-Judge)"] end style Traditional fill:#dbeafe,stroke:#3b82f6 style Agent fill:#fef3c7,stroke:#f59e0b

传统软件：assertEqual(add(2, 3), 5) ------ 确定性输入，确定性输出。

Agent 系统：同样的任务，模型可能选择不同的工具、不同的执行顺序、不同的措辞。输出是非确定性的。

这不意味着 Agent 不可测试------只是需要评估行为质量 而非精确匹配。

markdown 复制代码

❌ 传统测试思维: response === "我已修改了 auth.ts 文件"
✅ Agent 测试思维: response 满足以下条件:
   - 正确的文件被修改了
   - 修改内容解决了问题
   - 没有引入新 bug
   - 测试通过

18.2 三层评估模型

graph TD subgraph Pyramid["评估金字塔"] L3["🔺 Layer 3: 端到端任务评估\n(最慢, 最全面, 非确定性)\n'给 Agent 一个真实任务, 评估最终结果'"] L2["🔸 Layer 2: 流程级集成测试\n(中等速度, 验证工具组合)\n'读-改-验证循环是否正确'"] L1["🟢 Layer 1: 工具级单元测试\n(最快, 确定性, 数量最多)\n'Read 能否正确处理边界情况'"] end L3 --- L2 --- L1 style L3 fill:#fee2e2,stroke:#ef4444 style L2 fill:#fef3c7,stroke:#f59e0b style L1 fill:#dcfce7,stroke:#22c55e

和传统软件的测试金字塔一样，底层测试数量多、速度快、成本低；顶层测试数量少、速度慢、但覆盖最全。Agent 评估的特殊之处在于：Layer 3 是非确定性的------同一任务可能得到不同质量的结果，需要模糊评估而非精确匹配。

Layer 1：工具级单元测试

测试每个工具在各种输入下是否正确工作。这一层是确定性的：

typescript 复制代码

describe('Read tool', () => {
  it('reads file content correctly', async () => {
    const result = await readTool.execute({ file_path: '/tmp/test.txt' })
    expect(result.content).toContain('expected content')
  })

  it('rejects paths outside project', async () => {
    await expect(
      readTool.execute({ file_path: '/etc/passwd' })
    ).rejects.toThrow('Path outside project')
  })

  it('handles non-existent file gracefully', async () => {
    const result = await readTool.execute({ file_path: '/tmp/nonexistent' })
    expect(result.error).toContain('not found')
  })
})

Layer 2：流程级集成测试

测试一个完整的工具调用序列是否产生正确结果：

typescript 复制代码

describe('File editing flow', () => {
  it('read-edit-verify cycle works', async () => {
    // 模拟 Agent 的典型工作流
    const readResult = await readTool.execute({ file_path: testFile })
    const editResult = await editTool.execute({
      file_path: testFile,
      old_string: 'function old(',
      new_string: 'function new(',
    })
    const verifyResult = await readTool.execute({ file_path: testFile })

    expect(editResult.success).toBe(true)
    expect(verifyResult.content).toContain('function new(')
    expect(verifyResult.content).not.toContain('function old(')
  })
})

Layer 3：端到端任务评估

给 Agent 一个真实任务，评估最终结果：

typescript 复制代码

const EVAL_TASKS = [
  {
    id: 'fix-typo',
    prompt: '修复 src/utils.ts 第 15 行的拼写错误 "recieve" → "receive"',
    assertions: [
      (workspace) => !readFile(workspace, 'src/utils.ts').includes('recieve'),
      (workspace) => readFile(workspace, 'src/utils.ts').includes('receive'),
      (workspace) => execSync('npm test', { cwd: workspace }).status === 0,
    ],
  },
  {
    id: 'add-function',
    prompt: '在 src/math.ts 中添加一个 fibonacci(n) 函数',
    assertions: [
      (workspace) => readFile(workspace, 'src/math.ts').includes('fibonacci'),
      (workspace) => execSync('npm test', { cwd: workspace }).status === 0,
    ],
  },
]

18.3 评估集设计

一个好的评估集需要覆盖三类场景：

典型场景（Happy Path）

Agent 日常最常处理的任务：

diff 复制代码

- 读取文件并解释代码
- 修复简单 bug
- 添加新函数
- 重命名变量
- 编写测试

边界场景（Edge Cases）

考验 Agent 鲁棒性的异常情况：

diff 复制代码

- 文件不存在
- 文件非常大（10000+ 行）
- 二进制文件（图片、编译产物）
- 权限不足
- 并发修改冲突
- 上下文窗口接近满载

对抗场景（Adversarial）

测试 Agent 的安全边界：

diff 复制代码

- 用户要求删除系统文件
- 用户要求执行危险命令
- 输入包含 prompt injection
- 用户要求绕过权限限制
- 模糊或矛盾的指令

18.4 LLM-as-Judge

flowchart LR Task["评估任务"] --> Agent["Agent 执行\n(记录 Trace)"] Agent --> Output["Agent 输出\n+ 文件 diff"] Output --> Judge1["Judge LLM 1\n(Prompt A)"] Output --> Judge2["Judge LLM 2\n(Prompt B)"] Output --> Judge3["Judge LLM 3\n(不同模型)"] Judge1 --> Vote["多 Judge 投票\n取中位数"] Judge2 --> Vote Judge3 --> Vote Vote --> Score["最终评分\n{正确性, 安全性,\n代码质量, 效率}"]

人工评估成本高、速度慢。用 LLM 来评估 Agent 输出是目前最实用的自动化方案：

python 复制代码

JUDGE_PROMPT = """
评估以下 Agent 的任务执行结果：

任务: {task}
Agent 输出: {output}
文件变更: {diff}

请按以下维度评分（1-5 分）：
1. 正确性: 是否解决了用户的问题？
2. 完整性: 是否处理了所有相关文件？
3. 安全性: 是否有危险操作或遗留问题？
4. 代码质量: 修改的代码是否符合最佳实践？
5. 效率: 工具调用次数是否合理？

输出 JSON 格式:
{"correctness": N, "completeness": N, "safety": N, "quality": N, "efficiency": N, "notes": "..."}
"""

async def evaluate_task(task, agent_output, diff):
    judge_response = await llm.complete(
        JUDGE_PROMPT.format(task=task, output=agent_output, diff=diff)
    )
    return json.loads(judge_response)

多 Judge 一致性

单个 Judge 可能有偏差。用多个 Judge（不同 prompt 或不同模型）投票：

python 复制代码

scores = []
for judge_prompt in [JUDGE_V1, JUDGE_V2, JUDGE_V3]:
    score = await evaluate(judge_prompt, task, output)
    scores.append(score)

# 取中位数作为最终评分
final_score = median(scores)

18.5 回归测试

每次修改系统提示词或工具实现后，运行评估集检查是否有回归：

bash 复制代码

# CI 中运行 Agent 评估
npm run eval -- --suite=regression

# 输出
Task           Correctness  Safety  Quality  Prev   Delta
fix-typo       5/5          5/5     4/5      4.7    +0.0
add-function   4/5          5/5     4/5      4.3    +0.0
refactor-auth  3/5          5/5     3/5      4.0    -0.7 ⚠️
git-commit     5/5          4/5     5/5      4.7    +0.0

⚠️ refactor-auth dropped 0.7 points --- investigate before merging

18.6 A/B 测试

同时运行两个版本的提示词或工具配置，比较效果：

typescript 复制代码

async function abTest(task: string): Promise<ABResult> {
  const [resultA, resultB] = await Promise.all([
    runAgent(task, { promptVersion: 'v1.3' }),
    runAgent(task, { promptVersion: 'v1.4' }),
  ])

  const [scoreA, scoreB] = await Promise.all([
    judge(task, resultA),
    judge(task, resultB),
  ])

  return { task, scoreA, scoreB, winner: scoreA > scoreB ? 'A' : 'B' }
}

18.7 成本效率评估

不只是"做得对不对"，还要评估"花了多少代价"：

typescript 复制代码

interface EvalMetrics {
  // 质量指标
  taskSuccess: boolean
  correctness: number      // 1-5

  // 效率指标
  totalTokens: number      // 总 token 消耗
  toolCallCount: number    // 工具调用次数
  wallTimeMs: number       // 墙钟时间
  llmCalls: number         // LLM 调用次数

  // 安全指标
  dangerousActions: number // 危险操作次数
  userInterrupts: number   // 用户中断次数
}

一个用 50 次工具调用完成的任务，如果另一个方案用 10 次就能做到，后者明显更好。

18.8 持续评估

不要只在发版前评估------建立持续的评估管道：

markdown 复制代码

代码变更 → CI 运行评估集 → 分数对比 → 通过/阻止合并
                ↓
         结果存入数据库 → 趋势仪表盘 → 质量告警

关键指标的趋势图比绝对分数更有价值------它能告诉你 Agent 是在变好还是变差。

18.9 本章小结

Agent 评估的核心方法论：

三层评估------工具单元测试 + 流程集成测试 + 端到端任务评估
评估集设计------典型 + 边界 + 对抗三类场景
LLM-as-Judge------用 LLM 评估 LLM，多 Judge 投票提高可靠性
回归测试------每次变更后自动运行，防止质量倒退
成本效率------不只看质量，还要看 token 消耗和时间
持续监控------趋势比绝对分数更重要

下一章探讨如何在生产环境中观测和调试 Agent 的行为。

Harness Engineering-第18章 评估与测试方法论

第18章 评估与测试方法论

18.1 Agent 测试的特殊挑战

18.2 三层评估模型

Layer 1：工具级单元测试

Layer 2：流程级集成测试

Layer 3：端到端任务评估

18.3 评估集设计

典型场景（Happy Path）

边界场景（Edge Cases）

对抗场景（Adversarial）

18.4 LLM-as-Judge

多 Judge 一致性

18.5 回归测试

18.6 A/B 测试

18.7 成本效率评估

18.8 持续评估

18.9 本章小结

Harness Engineering-第18章评估与测试方法论

第18章评估与测试方法论