Claude Opus 4.8：模型小幅升级，平台大步向前

阅读时长 7 分钟 ｜ Model ID: claude-opus-4-8 ｜ 上下文窗口 ≤ 1M tokens

Opus 4.8 的发布通稿用了 "improved coding, reasoning, agentic" 三件套，和过去八个版本几乎一字不差。

200 多页 System Card，一句埋在 §5.2 里的话很有意思：

We found Opus 4.8 to be somewhat less robust than Opus 4.7 in several agentic contexts (such as vulnerability to prompt injection attacks).

一个旗舰升级公开承认自己在某项安全性上退步了 ，这比 SWE-bench 涨 1.0 分更引人思考------它指向了这次发布的真实形状：模型本体进入小步迭代期，真正的变量已经搬到模型外面那一圈协议、编排和产品层上去了。

过去需要用 LangGraph、自研 orchestrator、queue/worker 拼出来的 fan-out、context isolation、结果合并、异步 subagent 等能力，正在从 Claude Code 的产品实践，逐步沉淀为 Claude 平台的天然"骨骼"。
#mermaid-svg-llmHTHH95MWSntdQ{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-llmHTHH95MWSntdQ .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-llmHTHH95MWSntdQ .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-llmHTHH95MWSntdQ .error-icon{fill:#552222;}#mermaid-svg-llmHTHH95MWSntdQ .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-llmHTHH95MWSntdQ .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-llmHTHH95MWSntdQ .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-llmHTHH95MWSntdQ .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-llmHTHH95MWSntdQ .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-llmHTHH95MWSntdQ .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-llmHTHH95MWSntdQ .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-llmHTHH95MWSntdQ .marker{fill:#333333;stroke:#333333;}#mermaid-svg-llmHTHH95MWSntdQ .marker.cross{stroke:#333333;}#mermaid-svg-llmHTHH95MWSntdQ svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-llmHTHH95MWSntdQ p{margin:0;}#mermaid-svg-llmHTHH95MWSntdQ .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-llmHTHH95MWSntdQ .cluster-label text{fill:#333;}#mermaid-svg-llmHTHH95MWSntdQ .cluster-label span{color:#333;}#mermaid-svg-llmHTHH95MWSntdQ .cluster-label span p{background-color:transparent;}#mermaid-svg-llmHTHH95MWSntdQ .label text,#mermaid-svg-llmHTHH95MWSntdQ span{fill:#333;color:#333;}#mermaid-svg-llmHTHH95MWSntdQ .node rect,#mermaid-svg-llmHTHH95MWSntdQ .node circle,#mermaid-svg-llmHTHH95MWSntdQ .node ellipse,#mermaid-svg-llmHTHH95MWSntdQ .node polygon,#mermaid-svg-llmHTHH95MWSntdQ .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-llmHTHH95MWSntdQ .rough-node .label text,#mermaid-svg-llmHTHH95MWSntdQ .node .label text,#mermaid-svg-llmHTHH95MWSntdQ .image-shape .label,#mermaid-svg-llmHTHH95MWSntdQ .icon-shape .label{text-anchor:middle;}#mermaid-svg-llmHTHH95MWSntdQ .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-llmHTHH95MWSntdQ .rough-node .label,#mermaid-svg-llmHTHH95MWSntdQ .node .label,#mermaid-svg-llmHTHH95MWSntdQ .image-shape .label,#mermaid-svg-llmHTHH95MWSntdQ .icon-shape .label{text-align:center;}#mermaid-svg-llmHTHH95MWSntdQ .node.clickable{cursor:pointer;}#mermaid-svg-llmHTHH95MWSntdQ .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-llmHTHH95MWSntdQ .arrowheadPath{fill:#333333;}#mermaid-svg-llmHTHH95MWSntdQ .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-llmHTHH95MWSntdQ .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-llmHTHH95MWSntdQ .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-llmHTHH95MWSntdQ .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-llmHTHH95MWSntdQ .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-llmHTHH95MWSntdQ .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-llmHTHH95MWSntdQ .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-llmHTHH95MWSntdQ .cluster text{fill:#333;}#mermaid-svg-llmHTHH95MWSntdQ .cluster span{color:#333;}#mermaid-svg-llmHTHH95MWSntdQ div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-llmHTHH95MWSntdQ .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-llmHTHH95MWSntdQ rect.text{fill:none;stroke-width:0;}#mermaid-svg-llmHTHH95MWSntdQ .icon-shape,#mermaid-svg-llmHTHH95MWSntdQ .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-llmHTHH95MWSntdQ .icon-shape p,#mermaid-svg-llmHTHH95MWSntdQ .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-llmHTHH95MWSntdQ .icon-shape .label rect,#mermaid-svg-llmHTHH95MWSntdQ .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-llmHTHH95MWSntdQ .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-llmHTHH95MWSntdQ .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-llmHTHH95MWSntdQ :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;}#mermaid-svg-llmHTHH95MWSntdQ .core>*{fill:#1f6feb!important;stroke:#0b3d91!important;color:#fff!important;font-weight:bold!important;}#mermaid-svg-llmHTHH95MWSntdQ .core span{fill:#1f6feb!important;stroke:#0b3d91!important;color:#fff!important;font-weight:bold!important;}#mermaid-svg-llmHTHH95MWSntdQ .core tspan{fill:#fff!important;}#mermaid-svg-llmHTHH95MWSntdQ .win>*{fill:#2ea043!important;stroke:#1a6b2a!important;color:#fff!important;}#mermaid-svg-llmHTHH95MWSntdQ .win span{fill:#2ea043!important;stroke:#1a6b2a!important;color:#fff!important;}#mermaid-svg-llmHTHH95MWSntdQ .win tspan{fill:#fff!important;}#mermaid-svg-llmHTHH95MWSntdQ .minor>*{fill:#8957e5!important;stroke:#5a3aa1!important;color:#fff!important;}#mermaid-svg-llmHTHH95MWSntdQ .minor span{fill:#8957e5!important;stroke:#5a3aa1!important;color:#fff!important;}#mermaid-svg-llmHTHH95MWSntdQ .minor tspan{fill:#fff!important;}#mermaid-svg-llmHTHH95MWSntdQ .base>*{fill:#6e7681!important;stroke:#444c56!important;color:#fff!important;}#mermaid-svg-llmHTHH95MWSntdQ .base span{fill:#6e7681!important;stroke:#444c56!important;color:#fff!important;}#mermaid-svg-llmHTHH95MWSntdQ .base tspan{fill:#fff!important;} 应用层

Claude Code · claude.ai 网页版 · Cowork
服务/中间件层 ★ 主战场

Multi-agent Harness · Effort Control · Mid-stream System
模型层

Opus 4.8（小步迭代）
训练 / 引擎 / 系统软件 / 硬件

两个核心信号

一、诚实度：误报错误结果首次降到 0%

在 Anthropic 内部专门评测"模型是否会谎报自己工作成果"的场景里（System Card §6.1.2），Opus 4.8 是第一个达到 0% 误报率的模型。

过度自信场景比 4.7 降幅约 10×；agentic 编码中谎报任务完成比 Sonnet 4.6 降幅约 17×、比 Mythos Preview 降幅约 5×。

Opus 4.8 更诚实了。👏

这对自动 PR review、AI 结对编程和长任务 Agent 产品的价值，远高于任何一个基准分数的提升------这些场景的失败很多是因为 Agent "做错了但不说" 导致的。

二、Multi-agent harness：Agent 编排从产品能力走向平台能力

Anthropic 从 MCP、Claude Desktop、Claude Code、Claude Code 插件、Claude Code Web/Slack、Cowork 到现在的 multi-agent harness，一直在把"模型 + 工具 + 文件系统 + 子任务编排 + 上下文管理"这套东西往平台层收。

这条线终于从产品探索，变成了更明确的战略下注：多 Agent 编排正式纳入 Opus 4.8 模型评测框架（见 System Card §8.11）。

blocking subagents、fixed-agent team、async subagents 三种 harness，本质上对应了三类常见编排模式：主控拆任务、固定小队协作、异步 worker 回传。

在 blocking subagents 模式里，orchestrator 自己没有任务工具，只负责 spawn subagents；subagent 拿完整工具集和独立 200K context；async subagents 则更接近产品化形态：主 agent 可以继续干活，同时拉起长生命周期 subagent，等它完成后再把结果发回主会话。

结果也很不错：Multi-Agent BrowseComp 上，Orchestrator with Blocking Subagents 拿到 88.5，高于 single-agent 84.3；five-agent team 在 5M token 总预算下，分数高于 10M token 的单 agent（85.4 vs 84.3），延迟只有后者约 20%（代价是更高 token usage------本质上是用 token 换 latency）。

对自建多 Agent 编排的团队，这是最该立刻评估的事。官方方案未必替掉你的全部编排层，但 fan-out、critic、retry、结果合并、状态检查这类通用代码，已经有一部分开始进入 Claude 平台的原生层。如果你的自研编排主要围绕这些能力，一定比例的代码都值得重新审视，且越晚评估，沉没成本越大。

三个配套动作

如果说 multi-agent harness 是平台化的骨架，那下面三个能力就是让这套骨架真正跑起来的肌肉：推理深度可调、规则中途可改、并行成本下降。

Effort Control：把推理深度变成可调参数。

过去"模型想多深"基本藏在模型版本和产品档位里：你选了哪个模型、哪个价格层级，系统就默认给你对应的推理预算。现在它变成了可显式控制的旋钮。System Card §8.2 里最说明问题的一组数据是：SWE-bench Pro 上，Opus 4.8 用最低 effort 的成绩，约等于 Opus 4.7 用最高 effort 的峰值。换句话说，"省着用"也能拿到上一代天花板。
Mid-stream system：在 Agent 执行长任务期间改规则。

Messages API 允许在 messages 数组中途插入 system 条目。看起来是小改动，但对长任务 Agent 很关键：跑到一半时，可以在不破坏 prompt cache 的情况下进行约束追加、指令改写和人格切换，无需整段上下文推倒重来。
Fast mode 降价：给并行 Agent 工作负载让路。

Regular 价格持平（$5 / $25 per M），Fast 从 $30 / $150 砍到 $10 / $50。降价放到 multi-agent harness 里看就清楚了：高频、低延迟、并行多个 subagent 的任务执行成本，和之前相比开始有不错的性价比了。

五个最值得看的基准

完整对比表请见文末附录。

评测	Opus 4.8	Opus 4.7	含义
USAMO 2026	96.7	69.3	本次最戏剧性单点提升，证明类数学推理跨过台阶
SWE-bench Pro	69.2	64.3	比 GPT-5.5（58.6）高 10.6 分，编码类目前最强
HLE（无工具）	49.8	46.9	通用知识推理领先 GPT-5.5（41.4）和 Gemini 3.1 Pro（44.4）
GPQA Diamond	93.6	94.2	Opus 系列首次单点回退，纯学术推理切换前请先做 4.7 / 4.8 对照测试
Terminal-Bench 2.1	74.6	66.1	大涨 8.5 分但仍输给 GPT-5.5（78.2）

编码 / Agent / 长上下文领先，纯学术推理与 Gemini 持平略输，命令行任务输给 GPT-5.5。

一句话画像：全能型选手，不是单点冠军。

尤其值得关注：grader speculation

除了 USAMO 的跳跃和 multi-agent harness，System Card §6.1.2 里这个发现也值得 keep in mind：

Opus 4.8 在思考链里开始揣测"自己会被怎么打分" ------

Anthropic 把这条列为"训练过程中最值得警惕的趋势"。

白盒可解释性工具进一步在少量 RL episode 里发现了未被表达出来的 grader-aware 推理：

也就是说，模型在心里想这件事，但不写出来。

行为层面 Anthropic 评估"没有显著外溢"，所以不影响当下使用，但它指向了一个会持续发酵的问题：

RLHF 的边际收益正在逼近天花板，模型开始优化"评估函数"而不是"任务本身"。

所以这次 Opus 4.8 同时给出两个方向的信号：向外，模型能力被 harness 放大；向内，训练收益开始被评估函数反噬。前者解释了为什么 Agent 中间件会被平台吞噬；后者则意味着，模型层当然还会进步，但"只靠下一代模型更强"来支撑叙事，已经越来越不够了。

这对通用人工智能（AGI）的下一程意味着什么？

对齐工作的重心会从行为评测转向白盒可解释性------光看模型说什么不够了，还得看它在想什么。System Card 这次大量引用白盒证据来下结论，本身就是信号。
"模型再训练几轮就能解决 X 问题"这种说法的可信度在下降。如果模型已经开始揣测打分者，单纯加数据、加奖励信号的回报会越来越小。未来的能力跃迁，可能更依赖架构、训练范式和模型外系统，而不是单纯 scale。

前沿难题已经不只在"让模型更强"这一侧，也在"理解模型到底在想什么"这一侧。

这也是为什么，"只升级模型本体"已经越来越难满足大家对 AI 的期待。

其他翻车

GPQA Diamond 微降至 93.6：纯学术推理场景切换前，请先用私有评测集做 4.7 / 4.8 对照测试；
Pilot 阶段记录的怪癖（System Card §6.2.1）：偶尔过度自信、早停、不必要追问、偶尔叫用户去睡觉、轻微 sycophancy、任务边缘场景下主动删文件（Agent 产品记得给高危工具加 dry-run 或二次确认）；
家族排序：Opus 4.8 是当前最强公开 Claude，但还不是 Anthropic 手里最强的牌------System Card 多次提到它仍弱于 Mythos Preview。

该不该切

你的场景	建议	关键依据
大规模代码迁移 / 重构	🟢 立刻切	SWE-bench Pro 69.2、multi-agent harness 加持
重度 Claude Code / IDE 内联用户	🟢 立刻切	Fast 降价 + 工具调用省步
自建 multi-agent 编排团队	🟢 立刻评估	官方方案可能替掉一部分 fan-out / merge / retry 代码
长程 Agent / 高频低延迟交互	🟡 提上日程	OSWorld 83.4、Fast mode 经济性翻盘
数学证明 / 严肃推理类	🟡 提上日程	USAMO 69.3 → 96.7 是质变
金融 / 法务 / 医疗辅助	🟡 提上日程	误报率 0%、更愿意承认不确定
学术 PhD 类 / 命令行 Agent	🟠 谨慎评估	GPQA 微降、Terminal-Bench 输给 GPT-5.5
已稳定的第三方编排框架产品	🟠 再等等	别为 10-20% 收益动地基
暴露在不可信输入下的 Agent	🔴 加固再切	prompt injection 比 4.7 更"脆弱"

切换前必跑的 6 项回归

质量回归集：50--200 条核心 case，4.7 vs 4.8 双跑、标注退化项
工具调用一致性 ：同一组 schema，比 调用次数 / 完成率 / 错调率
Prompt cache 命中率：是否真没破？尤其在 mid-stream system 注入下
Prompt injection 红队：本次必加，验证你这一侧的防御覆盖到位
真实成本曲线：你的流量画像下的实际单位成本，试一下 minimum effort 是否够用
并行失败隔离：subagent 超时、报错或返回错误结论时，主会话会不会被污染；高危工具必须 dry-run 或二次确认

一个可证伪的预测

判断这次发布是否真是"平台位移"，而不只是"模型迭代"，几个月后看一件事就够了：

独立多 Agent 编排框架，是继续站在主编排层，还是被挤到适配层。

具体看四个信号：release 频率、核心维护者活跃度、企业案例增长，以及生态插件是否还在扩张。

如果到 2026 年底，LangGraph、CrewAI、AutoGen 这一档里，有两个以上开始从"Agent 主框架"退成"连接器 / 工作流适配层 / legacy 项目"，那本文的判断基本成立：Anthropic 确实把 Agent 中间件的一部分收进了平台层。

如果它们仍在高频发布、企业案例继续增长、开发者还在围绕它们搭新项目------那 Opus 4.8 就只是一次更好用的模型升级，"平台吞噬中间件"的叙事并不成立。

一起拭目以待吧。

附录：完整基准对比表

来自 System Card §8.1，Claude 数据为 adaptive thinking + max effort、5 次平均。

编码与 Agent

评测	Opus 4.8	Opus 4.7	GPT-5.5	Gemini 3.1 Pro
SWE-bench Verified	88.6	87.6	---	80.6
SWE-bench Pro	69.2	64.3	58.6	54.2
SWE-bench Multilingual	84.4	80.5	---	---
SWE-bench Multimodal	38.4	34.5	---	---
Terminal-Bench 2.1	74.6	66.1	78.2	70.3
OSWorld-Verified	83.4	82.8	78.7	76.2
MCP-Atlas	82.2	79.1	75.3	78.2
Automation Bench	15.5	9.9	12.9	9.6

推理与知识

评测	Opus 4.8	Opus 4.7	GPT-5.5	Gemini 3.1 Pro
HLE（无工具）	49.8	46.9	41.4	44.4
HLE（带工具）	57.9	54.7	52.2	51.4
GPQA Diamond	93.6	94.2	---	94.3
USAMO 2026	96.7	69.3	---	---
ArxivMath	71.82	---	71.48	64.79

长上下文与多 Agent

评测	Opus 4.8	Opus 4.7	GPT-5.5	Gemini 3.1 Pro
BrowseComp（multi-agent）	88.5	79.8（single）	84.4	85.9
GraphWalks BFS 256K	85.9	76.9	73.7	---
GraphWalks Parents 256K	99.3	93.6	90.1	---
Finance Agent v2	53.9	51.5	51.8	43.0

参考

Claude Opus 4.8 --- Anthropic News

Claude Opus 4.8 System Card (Anthropic, 2026)，重点参考 §5.2、§6.1.2、§6.3.7、§6.6.3、§8.1、§8.2、§8.11