PromptCaching 工程实践：把LLM 调用成本砍掉80%

从缓存命中率 7% 到 84%，这是 ProjectDiscovery 用三个月跑出来的真实数字。他们的 Neo 安全测试平台每次任务要走 26 个步骤、40 次工具调用，系统提示塞了 2500 行 YAML------超过 20K token。在没有认真对待缓存之前，每天的 LLM 账单是一条持续攀升的曲线。

优化之后，累计命中了 98 亿 token，成本降了 59%。

这篇文章不讲「Prompt Caching 是什么」------这个问题谷歌五分钟能解决。我想讲的是：为什么你的缓存命中率停在 20% 左右，以及怎么把它推过 70%。

一、KV Cache 的底层逻辑

在开始之前，先把「为什么前缀必须完全一样才能命中」这件事说清楚，否则后面所有的工程决策都会显得莫名其妙。

Transformer 在做 attention 计算时，会为每个 token 生成一对 key-value 张量。这个计算过程很重------它是 LLM 推理延迟和费用的大头。KV Cache 的核心思路是：如果下一次请求的 前缀部分 和上次完全相同，那就可以直接复用上次算好的 KV 张量，不用重算。

复用 = 省时间 + 省钱。

关键词是「完全相同」。不是语义相似，而是字节级别的精确匹配。一个多余的空格、一个不同的时间戳，全盘 miss。

二、两家主要 Provider 的实现对比

Anthropic Claude：手动打标

Claude 采用显式标记 的方式。你需要在内容块上加 cache_control 字段：

python 复制代码

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "你是一个代码审查专家...",
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        {"role": "user", "content": "帮我审查这段代码"}
    ]
)

usage = response.usage
print(f"写入缓存: {usage.cache_creation_input_tokens}")
print(f"从缓存读取: {usage.cache_read_input_tokens}")

硬约束：最小 1024 token、最多 4 个断点、TTL 5 分钟起（频繁访问可延至 1 小时）。

定价逻辑（Claude Sonnet 4.5）：

类型	价格（每百万 token）
标准输入	$3.00
缓存写入（首次）	$3.75（+25%）
缓存读取（命中）	$0.30（-90%）

Break-even 在 1.4 次命中：只要一个 prompt 前缀会被使用 2 次以上，缓存就是合算的。

OpenAI：自动模式

OpenAI 无需任何代码改动，自动缓存所有超过 1024 token 的 prompt 前缀：

python 复制代码

from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "你是一个代码审查专家...（很长的系统提示）"},
        {"role": "user", "content": "审查这段代码..."}
    ]
)

usage = response.usage
cached = usage.prompt_tokens_details.cached_tokens
total_prompt = usage.prompt_tokens
hit_rate = cached / total_prompt if total_prompt > 0 else 0
print(f"缓存命中率: {hit_rate:.1%}")

缓存按 128 token 粒度命中，TTL 约 5-10 分钟。价格：命中享 50% 折扣，无写入溢价。

三、Prompt 结构即缓存架构

核心原则只有一句话：越静态的内容越靠前，越动态的内容越靠后。

复制代码

┌─────────────────────────────────────┐
│  System Prompt（所有请求共享）        │  ← 最稳定，最靠前，打缓存断点
├─────────────────────────────────────┤
│  工具定义（工具集固定时）              │  ← 次稳定，打缓存断点
├─────────────────────────────────────┤
│  检索文档 / 上下文资料                │  ← 对话内共享
├─────────────────────────────────────┤
│  对话历史（随对话增长）               │  ← 动态增长
├─────────────────────────────────────┤
│  当前用户消息                        │  ← 每次不同，最靠后
└─────────────────────────────────────┘

五个最贵的习惯

1. 在 system prompt 里注入时间戳

python 复制代码

# ❌ 每次前缀都不同，永远 miss
system = f"当前时间：{datetime.now()}。你是一个助手..."

# ✅ 时间放到 user message
system = "你是一个助手..."
user = f"[当前时间：{datetime.now()}]\n用户问题：{question}"

2. 注入用户 ID 或请求 ID

python 复制代码

# ❌ 每个用户前缀不同
system = f"用户ID: {user_id}。你是一个专属助手..."

# ✅ 用户信息后置
system = "你是一个专属助手..."
user = f"[用户: {user_name}]\n{question}"

3. 随机化 few-shot examples 顺序 --- 固定顺序，按质量排序后保持不变。

4. 每次动态生成工具定义 --- 用 sort_keys=True 序列化后缓存字符串，直接复用。

5. 在 system prompt 开头放用户配置 --- 静态部分前置，用户配置后置。

四、Agent 系统的三断点架构

这是 ProjectDiscovery 把命中率从 <20% 推到 84% 的核心技巧。

python 复制代码

import json

def build_agent_messages_v2(system_prompt, tool_definitions,
                             conversation_history, working_memory, user_message):
    """三断点架构：最大化缓存命中率"""
    
    # 断点1：静态系统提示（最稳定，TTL ~1小时）
    system = [
        {
            "type": "text",
            "text": system_prompt,
            "cache_control": {"type": "ephemeral"}
        }
    ]
    
    # 断点3：工具定义（工具集不变时极稳定）
    tool_defs_text = json.dumps(tool_definitions, ensure_ascii=False, sort_keys=True)
    
    messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": f"<tools>\n{tool_defs_text}\n</tools>",
                    "cache_control": {"type": "ephemeral"}  # 断点3
                },
                {"type": "text", "text": "[对话开始]"}
            ]
        },
        {"role": "assistant", "content": "好的，我准备好了。"}
    ]
    
    # 对话历史（断点2 = 最近N轮）
    if conversation_history:
        messages.extend(conversation_history[:-1])
        last_hist = conversation_history[-1].copy()
        if isinstance(last_hist.get("content"), str):
            last_hist["content"] = [{
                "type": "text",
                "text": last_hist["content"],
                "cache_control": {"type": "ephemeral"}  # 断点2
            }]
        messages.append(last_hist)
    
    # 工作内存 + 当前用户消息（动态，不打断点，放最后）
    current_content = ""
    if working_memory:
        current_content += f"<working_memory>\n{working_memory}\n</working_memory>\n\n"
    current_content += user_message
    messages.append({"role": "user", "content": current_content})
    
    return system, messages

Relocation Trick：把工作内存从 system prompt 末尾移到 user message 末尾。

工作内存每步都在变化，如果放在 system prompt 尾部，会让整个 20K token 的系统提示缓存每步失效。移到 user message 之后，system prompt 缓存就能稳定命中了。

仅此一个改动，命中率从 <20% → ~74%。

五、并发陷阱与冷启动

并发 Race Condition

复制代码

T=0ms: 请求A → 缓存 miss，开始写入
T=2ms: 请求B → 缓存还在写，miss
T=5ms: 请求C → 命中！

服务启动预热：

python 复制代码

async def warm_up_cache(client, system_prompt, tool_definitions):
    response = await client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1,
        system=[{
            "type": "text",
            "text": system_prompt,
            "cache_control": {"type": "ephemeral"}
        }],
        messages=[{"role": "user", "content": "ping"}]
    )
    print(f"预热完成，写入 {response.usage.cache_creation_input_tokens} token")

六、监控指标体系

python 复制代码

from dataclasses import dataclass

@dataclass
class CacheMetrics:
    cache_creation_tokens: int = 0
    cache_read_tokens: int = 0
    regular_input_tokens: int = 0
    request_count: int = 0
    
    def record(self, usage):
        self.request_count += 1
        self.cache_creation_tokens += getattr(usage, 'cache_creation_input_tokens', 0)
        self.cache_read_tokens += getattr(usage, 'cache_read_input_tokens', 0)
        self.regular_input_tokens += getattr(usage, 'input_tokens', 0)
    
    @property
    def hit_rate(self) -> float:
        total = self.cache_creation_tokens + self.cache_read_tokens + self.regular_input_tokens
        return self.cache_read_tokens / total if total > 0 else 0
    
    def report(self):
        print(f"命中率: {self.hit_rate:.1%}")
        if self.request_count > 100 and self.hit_rate < 0.5:
            print("⚠️  命中率低于 50%，建议检查 prompt 结构")
        elif self.hit_rate >= 0.7:
            print("✅ 命中率健康（>70%）")

目标基线：生产系统命中率 >70%。

总结

三条核心原则：

静态内容前置：越稳定的越靠前
动态内容后置：越易变的越靠后
持续监控命中率：把它当成产品指标

5 分钟自检表：

System prompt 里有没有时间戳或用户 ID？
工具定义是不是每次动态生成？
System prompt 超过 1024 token 了吗？
有没有监控缓存命中 token？
Agent 系统里工作内存放在 prefix 中间还是末尾？

参考资料

How We Cut LLM Costs by 59% With Prompt Caching --- ProjectDiscovery, 2026

Prompt Caching Infrastructure --- Introl, 2026

Prompt Caching 201 --- OpenAI, 2026