Agent 系列（22）：Context Engineering 深度——三种上下文管理策略的量化对比

上下文的线性成本问题

Agent 不是无状态的 API 调用------它需要记住对话历史。每一轮对话都会累积到上下文窗口里，直到触发两个问题：

yaml 复制代码

Turn 1:   1K tokens   ← 便宜
Turn 10:  5K tokens   ← 还好
Turn 50:  25K tokens  ← 开始贵了
Turn 100: 50K tokens  ← 每次调用都是重新喂一遍历史

这不是理论问题。一个 30 轮的项目讨论对话，全量历史 ~2,500 tokens；100 轮之后这个数字是 ~8,000 tokens，且线性增长。

常见的三种应对策略：

策略	做法	直觉上的代价
Naive	全量历史传入	贵，但准
Sliding Window	只保留最近 N 条	省，但可能丢信息
Rolling Summary	LLM 压缩旧消息 + 保留近期	均衡？

本文用真实 benchmark 验证"直觉上的代价"是否准确。

Demo 设计

对话构造

30 轮项目讨论，覆盖数据库选型、缓存配置、迁移责任、部署平台、CI/CD、认证方案等 30 个技术决策。关键设计：重要决策刻意放在第 1-4 轮（最早期），后续才是"近期内容"。这样可以强制暴露上下文丢失问题。

三种策略实现

Strategy 1：Naive（基准）

python 复制代码

def run_naive(history: list, query: str, keywords: list[str]) -> StrategyResult:
    msgs = [SystemMessage(content=SYSTEM_PROMPT)] + history + [HumanMessage(content=query)]
    tokens = count_messages_tokens(msgs)
    t0 = time.time()
    text = str(llm.invoke(msgs).content)
    return StrategyResult(text, tokens, time.time() - t0, recall_score(text, keywords))

Strategy 2：Sliding Window（截断）

python 复制代码

def run_sliding_window(
    history: list, query: str, keywords: list[str], window: int = 12
) -> StrategyResult:
    recent = history[-window:]   # 只保留最近 12 条消息
    msgs = [SystemMessage(content=SYSTEM_PROMPT)] + recent + [HumanMessage(content=query)]
    ...

Strategy 3：Rolling Summary（滚动摘要）

python 复制代码

def summarize(messages: list) -> str:
    """把旧消息块压缩成要点列表。"""
    text = "\n".join(
        f"{'User' if isinstance(m, HumanMessage) else 'Assistant'}: {m.content}"
        for m in messages
    )
    prompt = (
        "Compress the following project discussion into concise bullet points.\n"
        "Preserve: every decision made, owner names, technical choices, exact numbers.\n"
        "Remove: conversational filler, redundancy.\n\n"
        f"Conversation:\n{text}\n\n"
        "Bullet-point summary:"
    )
    return str(llm.invoke([HumanMessage(content=prompt)]).content)


def run_rolling_summary(
    history: list, query: str, keywords: list[str],
    recent_window: int = 8, cached_summary: str | None = None,
) -> tuple[StrategyResult, str]:
    old = history[:-recent_window]
    recent = history[-recent_window:]
    summary = cached_summary if cached_summary is not None else summarize(old)

    # 摘要注入系统 Prompt
    sys = SYSTEM_PROMPT + f"\n\n## Earlier Meeting Notes (Summary)\n{summary}"
    msgs = [SystemMessage(content=sys)] + recent + [HumanMessage(content=query)]
    ...

关键设计：cached_summary 参数允许摘要只构建一次，后续 4 个测试查询复用，摘要构建成本 38.2s 只付一次。

测试查询（全部针对最早期决策）

yaml 复制代码

Query 1: What database did we choose? Who owns it, and why?
         Keywords: postgresql / timescaledb / david / acid / time-series

Query 2: What's our caching technology and TTL configuration?
         Keywords: redis / cluster / 1 hour / 5 minute / 16

Query 3: Who is responsible for database migrations, and what approvals are needed?
         Keywords: sarah / backend lead / 2 / senior / flyway

Query 4: What deployment platform and cluster configuration did we decide on?
         Keywords: kubernetes / eks / helm / argocd / 3-node

召回率 = 关键词命中数 / 总关键词数（简单但确定性强，不依赖额外 LLM 评分）。

运行结果

sql 复制代码

History: 30 turns  |  Full context: ~2,485 estimated tokens
Rolling summary build time: 38.2s (one-time, cached for all queries)

逐查询召回

arduino 复制代码

Query                         Naive   Sliding   Rolling
──────────────────────────────────────────────────────
DB decision (turn 1)           100%        0%        0%
Cache config (turn 2)           60%       40%       80%
Migration ownership (turn 3)    80%       20%       60%
Deployment platform (turn 4)    80%       20%       60%

聚合指标

sql 复制代码

Strategy                  Avg Tokens   Avg Latency   Avg Recall
────────────────────────────────────────────────────────────────
Naive (full history)           2,513          9.6s        80%
Sliding Window (last 12)         604         17.4s        20%
Rolling Summary                1,289          8.5s        50%

Token reduction vs Naive:
  Sliding Window: -76%
  Rolling Summary: -49%

Key insights:
  Highest recall:       Naive (full history)
  Most token-efficient: Sliding Window (last 12)
  Best quality/cost:    Rolling Summary

三个反直觉发现

发现 1：截断的代价远超预期

Sliding Window 节省了 76% token，但召回率从 80% 跌至 20%。

这并不奇怪------第 1-4 轮的决策在 30 轮后已经被截断窗口完全切掉------但程度让人警醒。Query 1（数据库决策）得到 0%：5 个关键词一个都没命中。模型不是"记不太清"，是完全不知道这个决策存在。

结论：Sliding Window 适合"上下文无关"的短期任务；对"需要追溯早期决策"的场景是灾难。

发现 2：摘要偶尔比原始更准

Query 2（缓存配置）：Rolling Summary 80% > Naive 60%。

为什么摘要比全量历史更准？原始对话里，缓存讨论分散在多轮中，模型需要在 2,500 tokens 里"找"到相关内容；摘要把所有决策压缩到一个结构化列表，关键信息密度更高，模型提取更容易。

这揭示了 Naive 的隐藏问题：上下文越长，信号越稀疏，噪声越多，模型反而可能"找不到"近在眼前的答案。

发现 3：压缩损失是真实 bug

Query 1 的 Rolling Summary 召回率为 0%，而摘要里明确写着：

dart 复制代码

- Database: PostgreSQL with TimescaleDB extension (David, DB Lead)

关键词 postgresql、timescaledb、david 都在摘要里，但模型的回答里没有出现。复现排查后发现：模型回答了"数据库选型"，但没有明确提到"ACID compliance"和"time-series"------这两个词是原始讨论里的技术理由，摘要只保留了最终决策，没有保留选择原因。

这是压缩的本质代价：摘要保留了"做了什么"，损失了"为什么这样做"。对需要推理依据（而非只需要事实）的查询，这个损失会很大。

何时选择哪种策略

css 复制代码

任务类型                           推荐策略
──────────────────────────────────────────────────────────
短期、无状态、每次独立任务          Naive（历史本来就短）
长对话、只关心最近几轮内容          Sliding Window（省 token）
长对话、需要追溯早期决策            Rolling Summary（均衡）
需要精确还原早期技术决策的原因      Naive（或 Rolling + 保留原因字段）

Rolling Summary 的生产优化要点：

摘要粒度：每 20-30 轮触发一次，不要频繁压缩（每轮都压缩本质上等于 Naive 的成本 + 延迟）
摘要 Prompt 要求保留数字和名称 ：Preserve: every decision, owner names, exact numbers 是关键
双层结构 ：summary（旧）+ recent（近期），不要把摘要和近期消息混在一起重新压缩
惰性构建：摘要只在首次需要时构建，然后缓存，不随每次对话重新触发

实际的 Rolling Summary 输出

这是 demo 从 22 轮对话构建的摘要，展示了压缩比和信息保留效果：

yaml 复制代码

- Database: PostgreSQL with TimescaleDB extension (David, DB Lead)
- Caching: Redis Cluster, TTL: 1h for sessions, 5m for dashboards, 16GB max memory
- Database Migrations: Sarah (Backend Lead), Flyway, 2 senior-engineer approvals
- Deployment: Kubernetes on AWS EKS, Helm charts, 3-node prod, 1-node staging, ArgoCD
- API Versioning: URL path versioning, 2 major versions, 6m deprecation notice
- Authentication: JWT tokens, 24h TTL for users, 1h for admins, 30-day Redis refresh TTL
- Logging: Structured JSON, Fluentd → ELK, 30-day hot, 1-year cold in S3 Glacier
- Rate Limiting: Token bucket, 100 req/min standard, 1000 req/min premium, Redis
- CI/CD: GitHub Actions + ArgoCD, blue-green, 5m health check
- Internal Services: REST external, gRPC + Protocol Buffers internal
  ... (共 22 条，原始 ~1,800 tokens → 压缩后 ~600 tokens，压缩比 3:1)

设计 Checklist

策略选择

明确任务的上下文追溯深度需求
Sliding Window 适合"近期信息足够"的场景
Rolling Summary 适合"需要追溯但不需要完整原文"的场景
Naive 适合"历史本来就短"或"需要完整推理链"的场景

Rolling Summary 实现

Prompt 明确要求保留：决策、负责人、数字、技术选择
摘要与近期消息分开放（不混入 messages 列表）
摘要构建后缓存，不重复调用
触发阈值：消息超过 N 条时才压缩（N 建议 20-30）

召回率验证

覆盖早期轮次的测试查询（不能只测近期）
关键词要包含技术理由，不只是结论（"acid" 不只是 "postgresql"）
在选定策略的上线前用小样本量化验证

总结

三条结论：

截断是双刃剑：Sliding Window 节省 76% token，但对早期决策召回率跌到 0%；不做 benchmark 验证很难发现这个临界点在哪里
摘要的压缩损失是"为什么"而非"是什么"：决策事实被保留，但技术理由会在压缩中消失；需要追溯原因的场景要么用 Naive，要么在摘要 Prompt 里显式要求保留原因
Rolling Summary 的摘要是一次性成本：38s 的构建时间看起来吓人，但这是离线构建后缓存的，运行时每次只加 ~600 tokens 而不是 2,500 tokens

参考资料

LangChain Message History 文档
Anthropic Context Caching 文档
本系列完整 Demo 代码：agent-21-context-engineering-deep

欢迎访问 PrimeSkills ------ 一个精心策划的 AI Agent 与技能市场，所有内容均经过真实企业级工作流验证。没有噱头，只有真正有效的东西。

更多实用知识和有趣产品，欢迎访问我的个人主页