LLM 返回的 JSON 又炸了？三种 Structured Output 方案实测，附代码和踩坑

你一定遇到过这种事：让 GPT 返回一个 JSON，它回了一段 markdown 代码块，还贴心地加了句"以下是您要求的 JSON 格式数据："。

你写了个正则去掉代码围栏。然后它偶尔返回 JSONL。然后某个用户名里带了引号，整个 parse 炸了。

我在生产项目里被这个问题折腾了大半年。2026 年了，LLM structured output 的工具链已经很成熟，但很多人还在手写正则 + json.loads 硬解析。这篇文章把三种方案的实际效果摊开讲，重点说 PydanticAI 的用法和我踩过的坑。

问题到底出在哪

LLM 本质是文本生成器，你的代码需要的是数据结构。这两个东西之间的缝隙就是 bug 住的地方。

json.loads 一个 LLM 的原始回复，你默认了六件事：

输出一定是合法 JSON（不一定）
字段名跟你要的一样（不一定）
字段类型对（string 还是 number？不一定）
值在合理范围内（不一定）
没有多余字段（不一定）
格式在不同输入下一致（不一定）

一个都不能保证。GPT-5.5 刚发布，能力又涨了一截，但输出格式不可控这个问题从 GPT-3.5 到现在没根本解决过。

三种方案，效果差很远

第一层：prompt 里写"请返回 JSON"

python 复制代码

response = client.chat.completions.create(
    model="gpt-5-mini",
    messages=[{
        "role": "user",
        "content": '分析这段文本的情感，返回JSON，字段：sentiment、score。文本：这个产品太垃圾了'
    }]
)
result = response.choices[0].message.content
# 可能返回：```json\n{"sentiment": "negative", "score": 0.9}\n```
# 也可能返回：{"sentiment": "negative", "score": "0.9"}  ← score 变 string 了
# 还可能返回：当然！以下是分析结果：{"sentiment": "negative", "score": 0.9}

我统计过，纯 prompt 方式在 GPT-4o 上大概 85-92% 的成功率。听着还行，但生产环境跑 10 万次请求，8000 次解析失败，够你喝一壶的。

第二层：function calling / tool use

python 复制代码

tools = [{
    "type": "function",
    "function": {
        "name": "analyze_sentiment",
        "parameters": {
            "type": "object",
            "properties": {
                "sentiment": {"type": "string", "enum": ["positive", "negative", "neutral"]},
                "score": {"type": "number"}
            },
            "required": ["sentiment", "score"]
        }
    }
}]

response = client.chat.completions.create(
    model="gpt-5-mini",
    messages=[{"role": "user", "content": "分析：这个产品太垃圾了"}],
    tools=tools,
    tool_choice={"type": "function", "function": {"name": "analyze_sentiment"}}
)

成功率到了 95-99%。JSON 格式基本没问题，字段也有了。但 schema 对模型来说还是"提示"不是"约束"------score 字段你要求 0 到 1 之间的 float，它偶尔给你返回 85 或者 -0.3。

第三层：native structured output（constrained decoding）

这是 2025 年以来各家模型厂商陆续支持的方案。原理是在 token 生成阶段用有限状态机（FSM）遮盖不合法的 token，模型物理上只能输出符合 schema 的内容。

python 复制代码

from pydantic import BaseModel, Field

class SentimentResult(BaseModel):
    sentiment: str
    score: float = Field(ge=0.0, le=1.0)

response = client.beta.chat.completions.parse(
    model="gpt-5-mini",
    messages=[
        {"role": "system", "content": "分析文本情感"},
        {"role": "user", "content": "这个产品太垃圾了"}
    ],
    response_format=SentimentResult,
)
result = response.choices[0].message.parsed
print(result.sentiment)  # "negative"
print(result.score)      # 0.92

100% schema 合规。不用 retry，不用正则，不用 try-except 包 json.loads。

但这里有个问题：每家 API 的 structured output 实现不一样。OpenAI 用 response_format，Anthropic 靠 tool_use 模拟，Gemini 有自己的 response_schema 参数。你要是支持多个模型，得写三套代码。

用库来抹平差异

目前 Python 生态里做 structured output 的库主要五个，我都试过：

Instructor（12.3k star）最简单，包一层 Pydantic 校验 + 自动 retry，支持所有主流模型。缺点是没有 agent/tool 支持，纯做数据提取。PydanticAI（14.5k star）是 Pydantic 官方团队做的，除了 structured output，还带 tool calling 和依赖注入，能建完整的 agent，代码风格跟 FastAPI 很像。Outlines（13.3k star）走 constrained decoding 路线，在本地模型上 100% schema 合规，但不支持云端 API。Guidance（19k star）最强但学习曲线最陡，能在生成过程中做条件分支，根据中间结果选不同 schema，也只支持本地模型。TypeScript 那边有 Zod + zodResponseFormat，Python 开发者忽略就行。

我的选择逻辑：用云端 API 做纯数据提取就选 Instructor，要建带工具的 agent 选 PydanticAI，跑本地模型要保证合规选 Outlines。

PydanticAI 实战

选 PydanticAI 的原因很直接：我的项目需要 structured output + tool calling + 多模型支持，三个需求它都覆盖了。

安装和基本用法

bash 复制代码

pip install pydantic-ai[openai,google,anthropic]

最简单的 structured output：

python 复制代码

from pydantic import BaseModel
from pydantic_ai import Agent

class CityInfo(BaseModel):
    name: str
    country: str
    population: int

agent = Agent("openai:gpt-5-mini", output_type=CityInfo)
result = agent.run_sync("北京的基本信息")
print(result.output)
# name='北京' country='中国' population=21893095
print(type(result.output))
# <class 'CityInfo'>

返回的不是字符串，是一个类型安全的 Python 对象。IDE 里能自动补全 result.output.name。

tool calling：让 agent 调用外部函数

LLM 自己不能访问数据库、不能调 API。PydanticAI 用 @agent.tool 装饰器注册函数，LLM 根据用户问题和函数的 docstring 决定要不要调用。

python 复制代码

import httpx
from pydantic import BaseModel
from pydantic_ai import Agent

class WeatherReport(BaseModel):
    city: str
    temperature: float
    condition: str
    humidity: int

agent = Agent(
    "openai:gpt-5-mini",
    output_type=WeatherReport,
    system_prompt="你是天气查询助手，用工具获取实时天气数据后返回结构化结果。"
)

@agent.tool_plain
async def get_weather(city: str) -> str:
    """查询指定城市的实时天气数据。"""
    async with httpx.AsyncClient() as client:
        resp = await client.get(
            f"https://api.weatherapi.com/v1/current.json",
            params={"key": "YOUR_KEY", "q": city, "lang": "zh"}
        )
        data = resp.json()
        current = data["current"]
        return f"温度{current['temp_c']}°C，{current['condition']['text']}，湿度{current['humidity']}%"

result = agent.run_sync("上海现在天气怎么样？")
print(result.output)
# city='上海' temperature=24.5 condition='多云' humidity=68

流程是这样的：用户问"上海天气"→ LLM 看到 get_weather 工具的 docstring → 决定调用它，参数填 city="上海" → 拿到返回结果 → 生成符合 WeatherReport schema 的输出。

关键点：docstring 一定要写清楚。LLM 完全靠 docstring 判断什么时候该调这个函数。我试过不写 docstring，调用率直接掉到 30% 以下。

依赖注入：不用全局变量传数据库连接

这是 PydanticAI 跟 Instructor 拉开差距的地方。FastAPI 用过的人对这个模式很熟。

python 复制代码

from dataclasses import dataclass
from pydantic import BaseModel
from pydantic_ai import Agent, RunContext

@dataclass
class AppDeps:
    db_pool: any  # 数据库连接池
    api_key: str  # 外部 API key

class UserProfile(BaseModel):
    name: str
    email: str
    order_count: int

agent = Agent(
    "openai:gpt-5-mini",
    deps_type=AppDeps,
    output_type=UserProfile,
    system_prompt="根据用户ID查询用户信息。"
)

@agent.tool
async def lookup_user(ctx: RunContext[AppDeps], user_id: int) -> str:
    """根据用户ID查询数据库中的用户信息。"""
    # ctx.deps 里拿到注入的依赖
    row = await ctx.deps.db_pool.fetchrow(
        "SELECT name, email, order_count FROM users WHERE id=$1", user_id
    )
    if row:
        return f"用户：{row['name']}，邮箱：{row['email']}，订单数：{row['order_count']}"
    return "用户不存在"

# 运行时注入依赖
result = agent.run_sync(
    "查一下用户 12345 的信息",
    deps=AppDeps(db_pool=my_pool, api_key="xxx")
)

tool 函数第一个参数是 RunContext，里面的 deps 就是你注入进去的依赖对象。不用全局变量，测试时可以传 mock 对象，干净很多。

多输出类型：让 LLM 自己选返回什么

有时候你不确定 LLM 该返回什么结构。PydanticAI 支持 union type：

python 复制代码

from pydantic import BaseModel
from pydantic_ai import Agent

class SuccessResult(BaseModel):
    data: dict
    message: str

class ErrorResult(BaseModel):
    error_code: int
    error_message: str

agent = Agent(
    "openai:gpt-5-mini",
    output_type=[SuccessResult, ErrorResult],  # 两种都行
    system_prompt="处理用户请求。如果请求合理返回成功结果，不合理返回错误。"
)

result = agent.run_sync("查询订单 #99999999")
if isinstance(result.output, ErrorResult):
    print(f"错误 {result.output.error_code}: {result.output.error_message}")
else:
    print(f"成功: {result.output.message}")

PydanticAI 会把每种类型注册为独立的 output tool，LLM 选调哪个。实测比用一个大 schema 塞 optional 字段准确率高不少。

五个坑，真金白银踩出来的

坑 1：retry 会吃掉 token 预算

PydanticAI 的 validation retry 是自动的------返回的数据不合 schema，它自动再请求一次。默认 retry 次数是 1，你可以设更高：

python 复制代码

agent = Agent("openai:gpt-5-mini", output_type=MyModel, retries=3)

但每次 retry 都是一次完整的 API 调用。我有个 schema 嵌套了四层，字段 30 多个，GPT-4o 的 retry 率大概 12%，一个月多花了 400 多美元。后来把 schema 拆成两步调用，每步字段控制在 10 个以内，retry 率降到 2% 以下。

经验：schema 字段超过 15 个就考虑拆分。嵌套不要超过三层。

坑 2：不同模型对 schema 的支持差别大

我在 OpenAI、Anthropic、Gemini 三家测过同一个 schema：

python 复制代码

class ArticleAnalysis(BaseModel):
    title: str
    topics: list[str] = Field(min_length=1, max_length=5)
    sentiment: Literal["positive", "negative", "neutral"]
    word_count: int = Field(ge=100, le=10000)
    has_code: bool

GPT-5-mini：100% 通过，零 retry
Claude Opus 4.7：Field 的 ge/le 约束偶尔不遵守，retry 率约 5%
Gemini 3 Flash：Literal 类型基本都对，但 list 的 max_length 约束经常不管用

Anthropic 的 structured output 实际上是通过 tool_use 模拟的，不是真正的 constrained decoding，所以数值范围约束比较弱。如果你的业务对数值范围敏感，加一层 Pydantic validator 做兜底：

python 复制代码

from pydantic import BaseModel, Field, field_validator

class ArticleAnalysis(BaseModel):
    word_count: int = Field(ge=100, le=10000)
    
    @field_validator("word_count")
    @classmethod
    def check_word_count(cls, v):
        if v < 100 or v > 10000:
            raise ValueError(f"word_count {v} 超出范围")
        return v

坑 3：docstring 写不好，tool 就白注册了

前面说过了，LLM 靠 docstring 决定要不要调用 tool。几个细节：

用中文 docstring 在中文场景下效果比英文好 10-15%（我用 GPT-4o 测的）
docstring 第一行要写功能摘要，参数说明放后面
参数的类型提示越具体越好。user_id: int 比 id: str 好，LLM 更容易理解该传什么

python 复制代码

# 差的 docstring
@agent.tool_plain
def query(q: str) -> str:
    """查询"""
    ...

# 好的 docstring
@agent.tool_plain
def search_products(keyword: str, max_price: float = 0, category: str = "") -> str:
    """在商品数据库中搜索商品。
    
    Args:
        keyword: 搜索关键词，比如"无线耳机"
        max_price: 最高价格限制，0表示不限
        category: 商品类别，比如"电子产品""家居"
    """
    ...

坑 4：async 和 sync 混用会出问题

PydanticAI 提供 run_sync() 和 run() 两个方法。run() 是 async 的，run_sync() 是同步包装。

如果你的 tool 函数是 async 的（比如用了 httpx.AsyncClient），在某些场景下用 run_sync() 会报 event loop 冲突。特别是在 Jupyter Notebook 或者已有 asyncio loop 的环境里。

python 复制代码

# 在 Jupyter 里会报错
result = agent.run_sync("查天气")

# 用 nest_asyncio 解决
import nest_asyncio
nest_asyncio.apply()
result = agent.run_sync("查天气")  # 这下好了

生产环境建议统一用 async：

python 复制代码

import asyncio

async def main():
    result = await agent.run("查天气")
    print(result.output)

asyncio.run(main())

坑 5：model 字符串写错了没有明确报错

PydanticAI 的 model 参数格式是 "provider:model_name"，比如 "openai:gpt-5-mini"。

我有次手滑写成了 "openai:gpt5-mini"（少了个横杠），报的错不是"模型名无效"，而是一个很深的 HTTP 404 错误。排查了半小时才发现是模型名拼错。

建议把模型名提成常量：

python 复制代码

MODEL_GPT5_MINI = "openai:gpt-5-mini"
MODEL_CLAUDE_OPUS = "anthropic:claude-opus-4-7"
MODEL_GEMINI_FLASH = "google-gla:gemini-3-flash-preview"

agent = Agent(MODEL_GPT5_MINI, output_type=MyModel)

什么时候不该用 structured output

两种情况下别用：

一是你需要模型自由发挥的场景。写文章、聊天对话、头脑风暴，这些场景强制 schema 反而限制了模型的表达。structured output 适合数据提取、分类、标注这类明确知道要什么结构的场景。

二是 schema 特别复杂（字段 50+、嵌套 5 层以上）。越复杂的 schema 模型填充准确率越低，retry 成本越高。不如拆成多个小 schema 分步调用。

我的项目里大概 60% 的 LLM 调用用了 structured output，剩下 40% 是自由文本生成。两者搭配着来效果最好。

这篇文章里的代码都能直接跑。PydanticAI 的官方文档在 ai.pydantic.dev ，API reference 写得比较全。踩了什么别的坑欢迎评论区聊。