当 max_tokens=1 遇上 reasoning 模型：从 Xagent 一次“测试连接“按钮的失败说起

一个测连通的按钮，把 reasoning 模型挡在了门外

如果你用过开源 AI 平台，肯定见过这样的设计------配置一个新模型时，UI 上有个"测试连接"按钮：你填好 base_url、api_key、模型名，点一下，平台后端发个最小成本的请求验证它能不能跑。

通常这个请求长这样：

python 复制代码

await llm.chat([{"role": "user", "content": "Hello"}], max_tokens=1)

为什么 max_tokens=1？因为只是要"验证连通性"------只要 API 返回 200、能拿到一个 token，就算连上。便宜、快、对模型供应商也友好。

这套设计在传统 chat 模型上工作了多年。直到 reasoning 模型登场。

最近 Xagent 通过两个连续合并的 PR 修复了这个问题：

PR #625：fix(xinference): handle reasoning models in chat response and test-connection（d8794f2 已 merge）
PR #626：fix(openai): fall back to reasoning_content when content is empty（3ca54bb 已 merge）

这两个 PR 看起来在修同一件事，但合在一起读，能讲清楚一个非常微妙的兼容性故事------以及一段关于 PR review 的好示范。

故事开始：用户连不上 qwen3.6_27b

具体复现：用户在 Xagent UI 上配置一个 Xinference 服务上的 reasoning 模型 qwen3.6_27b，点"测试连接"，前端报错：

css 复制代码

Invalid Xinference response: {
  'id': 'chat...',
  'choices': [{
    'message': {
      'role': 'assistant',
      'content': '',
      'reasoning_content': 'Here'
    },
    'finish_reason': 'length'
  }],
  'usage': {'prompt_tokens': 11, 'completion_tokens': 1}
}

看起来响应是好的------HTTP 200、有 choices、有 usage。问题在 content='' 和 reasoning_content='Here'。

Reasoning 模型的"脑内独白"机制

Reasoning 模型（qwen3-thinking、deepseek-r1、qwen3.x_*、openai o1/o3 等）在生成最终答案之前，会先在 reasoning_content 字段里做"思考"------这部分是模型的草稿、推理链、自我对话；最终答案才放到 content 里。

正常完成时，响应是这样：

json 复制代码

{"content": "答案是 42", "reasoning_content": "需要计算 6×7..."}

但当 max_tokens=1 时，模型刚开始 reasoning 就被截断------reasoning_content="Here"（thinking 的第一个 token），content=""（还没开始写最终答案），finish_reason="length"（截断）。

对 reasoning 模型来说，max_tokens=1 永远拿不到 content。 这就是为什么"测试连接"按钮把所有 reasoning 模型都挡在了门外。

Adapter 层的雪上加霜

就算把 max_tokens 改大，问题还没完。Xinference 和 OpenAI 两个 chat adapter 在解析响应时都有这样一段：

python 复制代码

content = message.content
if not content or not content.strip():
    raise RuntimeError("LLM returned empty content and no tool calls")

它们完全不看 reasoning_content ------只要 content 空，就抛错。这意味着：哪怕响应里 reasoning_content 有有用的部分内容，adapter 也会无情地把它丢掉，对外报"无效响应"。

修复策略：双层防护

第一层：调大 max_tokens（前置防线）

api/model.py 里的 /test-connection 和 /test 两个端点的 max_tokens 从 1 调到 16：

python 复制代码

# Test chat connection with a small but non-trivial token budget.
# max_tokens=1 is unsafe for reasoning models that aren't caught
# by the name-based heuristic above (e.g. qwen3-thinking variants
# advertised as qwen3.x_*): they would consume the single token
# in reasoning_content and return an empty content...
chat_kwargs: dict[str, Any] = {"max_tokens": 16}

16 个 token 对普通 chat 模型仍然便宜，但给 reasoning 模型留出了"开始写最终答案"的空间。注意这里的判断------为什么是 16 不是 100、不是 4？因为：

对纯 chat 模型，仍然要尽量便宜
对 reasoning 模型，给一个能产生"答案开头"的最小预算
16 是个工程上的 sweet spot：够 reasoning 模型从"思考"过渡到"作答"，但又不至于在普通模型上浪费

这种"参数选择不写魔术数字、要解释为什么"的代码注释，是 PR 工程素养的体现。

第二层：adapter 兜底（后置防线）

但单独调大 max_tokens 不够------因为：

用户可能在业务代码里仍然显式传小 max_tokens
Reasoning 模型本来就有可能在任何时候被截断
即使 16 token，对某些超大 reasoning 模型也可能不够

所以两个 adapter 都加了 fallback：

python 复制代码

# xinference.py
reasoning_content = message.get("reasoning_content") or ""
finish_reason = choice.get("finish_reason")

# 当 content 为空时，且满足三个条件才用 reasoning_content 兜底
if (
    finish_reason == "length"
    and reasoning_content
    and reasoning_content.strip()
):
    return {
        "type": "text",
        "content": reasoning_content,
        "reasoning_content": reasoning_content,
        "reasoning": reasoning_content,
        "raw": response_dict,
    }

这里的三个条件每一个都不是装饰：

条件 1：`finish_reason == "length"`

为什么不能"只要 content 空就 fallback"？

这个 PR 最有意思的细节，发生在 review 阶段。最初的实现是这样：

python 复制代码

# ❌ 早期版本
if reasoning_content:
    return {"content": reasoning_content, ...}

reviewer qinxuye 指出：

The PR description and test case scope this fallback to finish_reason="length" truncation, but the implementation promotes reasoning_content whenever content is empty. ... A provider response with finish_reason="stop", empty final content, and a populated reasoning trace would now be treated as a successful answer instead of surfacing that the model never produced final content.

翻译成大白话："如果模型返回 finish_reason='stop'（说自己结束了），但 content 是空的、只有 reasoning，这是模型出了真 bug------模型说自己写完了但其实啥也没写。这种情况你应该 raise，让用户知道；而不是悄悄把 reasoning（草稿）当成最终答案返回。"

这是一个软件工程上的"假成功"陷阱------代码层面看起来 work 了，但语义上掩盖了下游真实的失败。reviewer 看到这种"过宽的 fallback"立刻识别出来。

条件 2：`reasoning_content`（truthy 检查）

最基本的 None 检查，避免对 None 调 .strip() 报错。

条件 3：`reasoning_content.strip()`（非空白检查）

来自 gemini-code-assist 机器人的 review：

If reasoning_content contains only whitespace (e.g., " "), if reasoning_content: will evaluate to True. This would bypass the empty content guard and return whitespace as the main content...

if reasoning_content: 对纯空白 " " 是 truthy，但用户拿到的 content 仍是无意义的空白 。这与上层 if not content.strip() 的契约不对称。修复后两边都用 .strip() 检查，契约对齐。

这个 PR 教给我们什么？

教训一：API 设计的兼容性永远以"未来"打"现在"

max_tokens=1 在 chat 模型时代是优雅的------验证连通性的最便宜方式。但 reasoning 模型把"模型如何生成 token"的语义改了：1 个 token 不再够"出门"。

API 设计要为新一代模型留缓冲 。Xagent 的修复用 max_tokens=16 而不是 max_tokens=1，本质上是承认了"测试请求要给模型留一点呼吸空间"这个新事实。

教训二：归一化层是处理供应商异构性的护城河

Xinference 和 OpenAI 是两个独立的 adapter，但它们都在响应解析这一层做归一化------把厂商的私有字段（reasoning_content）翻译成框架的统一概念（content）。Xagent 的两个 PR 在这两个 adapter 上各做一份，保证下游业务代码完全不需要关心是哪个 provider 返回的。

这就是好的归一化层的价值------下游永远只看"标准化后的形态"。

教训三：fallback 必须有边界

PR review 过程中暴露的最深刻教训："看起来 work" ≠ "正确"。

python 复制代码

# ❌ "看起来 work"
if reasoning_content:
    return {"content": reasoning_content}

# ✅ 加边界
if (
    finish_reason == "length"      # 只在截断时
    and reasoning_content           # 不为 None
    and reasoning_content.strip()   # 不是纯空白
):
    return {"content": reasoning_content}

每个条件对应一个会引发"假成功"的反例。这就是为什么测试用例里要专门加 test_finish_reason_stop_with_only_reasoning_still_raises 和 test_whitespace_only_reasoning_content_still_raises------fallback 路径必须测反例，不只是测正例。

教训四：PR description 必须 = 实现 scope

qinxuye 的 review 用了一句精准的话：

The PR description and test case scope this fallback to finish_reason="length" truncation , but the implementation promotes reasoning_content whenever content is empty.

description 说 scope = X，implementation 做的是 X+Y。这就是 scope creep。reviewer 一句话点出来后，作者一行 if 加上 finish_reason == "length" 就解决了。

写完 PR 用 reviewer 视角自问："如果我只读 description 和 test，能从 implementation 推出这个 scope 吗？" 如果答案是"实现比 description 宽"，那一定有 bug 在等着。

教训五：跨 adapter 同步修复

一个有意思的细节：PR #625 先于 PR #626 提交，#625 修了 Xinference adapter，但用户用 OpenAI Provider 连接 Xinference 服务（OpenAI-compatible 协议）时仍然报错------因为问题也存在于 OpenAI adapter 里。后来才补开了 #626。

教训：修一个 adapter 时，必须 grep 同目录其他 adapter 看是否有相同问题。chat adapter 之间逻辑高度同构，bug 也往往同构。

当 max_tokens=1 遇上 reasoning 模型：从 Xagent 一次“测试连接“按钮的失败说起

一个测连通的按钮，把 reasoning 模型挡在了门外

故事开始：用户连不上 qwen3.6_27b

Reasoning 模型的"脑内独白"机制

Adapter 层的雪上加霜

修复策略：双层防护

第一层：调大 max_tokens（前置防线）

第二层：adapter 兜底（后置防线）

条件 1：finish_reason == "length"

条件 2：reasoning_content（truthy 检查）

条件 3：reasoning_content.strip()（非空白检查）

这个 PR 教给我们什么？

教训一：API 设计的兼容性永远以"未来"打"现在"

教训二：归一化层是处理供应商异构性的护城河

教训三：fallback 必须有边界

教训四：PR description 必须 = 实现 scope

教训五：跨 adapter 同步修复

推荐一下 Xagent

条件 1：`finish_reason == "length"`

条件 2：`reasoning_content`（truthy 检查）

条件 3：`reasoning_content.strip()`（非空白检查）