大型语言模型结构化输出：用 JSON Schema 约束大模型输出

结构化输出：用 JSON Schema 约束大模型输出

第一部分：引言、JSON Schema 基础与约束大模型输出

一、引言

1.1 大模型输出不可控的痛点

在大型语言模型（LLM）蓬勃发展的今天，开发者们面临着一个核心挑战：如何让 AI 的输出变得可预测、可处理、可集成？

让我们看一个典型的场景：

复制代码

用户：帮我提取这段文章的关键信息
AI：这篇文章主要讨论了...（然后输出了500字的自由文本）

当你试图将这段文本解析为结构化数据时，麻烦才刚刚开始：

AI 可能返回 JSON，也可能返回 Markdown 格式的代码块
字段名称可能不统一（"关键要点" vs "核心观点" vs "主要结论"）
数据类型不稳定（"10" 可能是数字也可能是字符串）
嵌套结构随意（有时数组，有时对象）

这种"不可控"体现在以下几个维度：

维度	表现	影响
格式不稳定	JSON/Markdown/纯文本随机出现	解析逻辑需要大量 if-else
结构不一致	字段数量和层级随时变化	数据模型难以设计
类型混乱	数字与字符串混用、布尔值表达不统一	后端处理逻辑复杂化
验证困难	无法在接收阶段判断数据合法性	运行时错误频发

1.2 结构化输出的意义

结构化输出的价值在于将 AI 的"创作自由"与程序的"处理确定性"统一起来。它意味着：

下游处理可预期：程序可以直接依赖返回数据的结构，无需复杂的解析逻辑
错误检测前置：在数据进入业务逻辑前就能发现格式错误
多系统集成便利：结构化数据天然适合与数据库、API、前端组件对接
测试可覆盖：输出可以与传统软件一样进行单元测试和集成测试

举一个实际例子。假设你正在构建一个客服系统，需要从用户反馈中提取：

json 复制代码

{
  "issue_type": "退货申请",
  "product_id": "SKU-12345",
  "urgency": "high",
  "reason": "商品损坏"
}

没有结构化输出时，你可能需要编写大量的正则表达式和解析逻辑；而有了结构化输出，这个过程变成了：

python 复制代码

# 定义期望的结构
class CustomerFeedback(BaseModel):
    issue_type: str
    product_id: str
    urgency: Literal["low", "medium", "high"]
    reason: str

# AI 输出自动符合这个结构
feedback = model.with_structured_output(CustomerFeedback)
result = feedback.invoke(user_message)
# result 就是类型安全的 CustomerFeedback 对象

1.3 JSON Schema 为何成为首选方案

在众多结构化方案中，JSON Schema 为什么能够脱颖而出？

第一，JSON Schema 是工业标准。它由 IETF 维护，有完善的规范文档和广泛的生态系统支持。与自定义 DSL 不同，JSON Schema 可以在数十种编程语言中使用。

第二，JSON Schema 表达力强。它支持：

基本类型：string, number, integer, boolean, null, array, object
组合逻辑：anyOf, oneOf, allOf, not
条件约束：if-then-else
格式校验：email, uri, date-time, uuid 等
自定义规则：const, enum, pattern, minimum, maxLength 等

第三，主流 LLM 平台原生支持。OpenAI、Anthropic、Google 等厂商都提供了基于 JSON Schema 的结构化输出功能，开发者无需自己实现解析逻辑。

第四，与类型系统无缝对接。Pydantic、TypeScript、Zod 等库都能从 JSON Schema 自动生成类型定义，实现"一次定义，多处使用"。

二、JSON Schema 基础

2.1 JSON Schema 是什么

JSON Schema 是一个声明式语言，用于描述 JSON 数据的结构和验证规则。简单来说，它回答两个问题：

这个 JSON 数据的结构是什么样的？（描述性）
这个 JSON 数据是否合法？（验证性）

一个最简单的 JSON Schema 如下：

json 复制代码

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "type": "object"
}

这个 Schema 表示："我要验证的是一个对象"------非常宽松，几乎接受任何 JSON。

2.2 核心概念

类型（Type）

JSON Schema 定义了七种基本类型：

JSON Schema 定义了七种基本类型，每种都是一个独立的 Schema：

json 复制代码

// 字符串
{ "type": "string" }

// 任意浮点数
{ "type": "number" }

// 整数
{ "type": "integer" }

// 布尔值
{ "type": "boolean" }

// 空值
{ "type": "null" }

// 数组
{ "type": "array" }

// 对象
{ "type": "object" }

注意：每个类型是一个完整的 Schema 声明。同一个 JSON 对象中不能重复使用 "type" 键。类型可以组合使用（后续会讲）。

属性（Properties）

定义对象属性：

json 复制代码

{
  "type": "object",
  "properties": {
    "name": { "type": "string" },
    "age": { "type": "integer" },
    "email": { "type": "string" }
  }
}

Required（必填字段）

标记哪些字段必须存在：

json 复制代码

{
  "type": "object",
  "properties": {
    "name": { "type": "string" },
    "age": { "type": "integer" }
  },
  "required": ["name"]
}

上例中，name 是必填的，age 是可选的。

Enum（枚举）

限制值必须来自预定义列表：

json 复制代码

{
  "type": "object",
  "properties": {
    "status": {
      "type": "string",
      "enum": ["pending", "processing", "completed", "failed"]
    },
    "priority": {
      "type": "integer",
      "enum": [1, 2, 3, 4, 5]
    }
  }
}

Format（格式）

内置格式校验器：

json 复制代码

{
  "type": "object",
  "properties": {
    "email": { "type": "string", "format": "email" },
    "url": { "type": "string", "format": "uri" },
    "date": { "type": "string", "format": "date" },
    "datetime": { "type": "string", "format": "date-time" },
    "uuid": { "type": "string", "format": "uuid" },
    "ipv4": { "type": "string", "format": "ipv4" }
  }
}

注意：JSON Schema 实现对 format 的支持程度不一。多数库会校验格式，但有些只是做基本检查。

2.3 常用关键字详解

anyOf / oneOf / allOf

这三个关键字处理"或"和"与"的逻辑：

anyOf：满足任意一个即可

json 复制代码

{
  "anyOf": [
    { "type": "string", "maxLength": 5 },
    { "type": "integer", "minimum": 0 }
  ]
}

这个值可以是短字符串，也可以是非负整数。

oneOf：必须满足且仅满足一个

json 复制代码

{
  "oneOf": [
    { "type": "object", "properties": { "type": { "const": "cat" } } },
    { "type": "object", "properties": { "type": { "const": "dog" } } }
  ]
}

值必须是猫或狗，不能同时是两者（实际应用中这种情况很少出现）。

allOf：必须同时满足所有

json 复制代码

{
  "allOf": [
    { "type": "object", "properties": { "name": { "type": "string" } } },
    { "type": "object", "properties": { "age": { "type": "integer" } } }
  ]
}

值必须同时是对象、同时有 name 和 age 字段。

additionalProperties

控制是否允许未定义的属性：

json 复制代码

{
  "type": "object",
  "properties": {
    "name": { "type": "string" }
  },
  "additionalProperties": false
}

设置为 false 时，任何不在 properties 中定义的字段都会导致验证失败。这是严格模式的常用设置。

json 复制代码

{
  "type": "object",
  "properties": {
    "name": { "type": "string" }
  },
  "additionalProperties": { "type": "string" }
}

设置为对象时，允许额外属性，但要求额外属性也符合指定类型。

数组约束

json 复制代码

{
  "type": "array",
  "items": { "type": "string" },
  "minItems": 1,
  "maxItems": 10,
  "uniqueItems": true
}

items：数组元素的类型
minItems / maxItems：元素数量限制
uniqueItems：是否要求元素唯一

字符串约束

json 复制代码

{
  "type": "string",
  "minLength": 1,
  "maxLength": 100,
  "pattern": "^[A-Z].*",      // 正则表达式
  "format": "email"
}

数值约束

json 复制代码

{
  "type": "number",
  "minimum": 0,
  "maximum": 100,
  "exclusiveMinimum": 0,     // > 0
  "exclusiveMaximum": 100,   // < 100
  "multipleOf": 5            // 必须是 5 的倍数
}

2.4 版本差异：Draft 7 vs Draft 2019-09 vs Draft 2020-12

JSON Schema 有多个版本，主要差异如下：

特性	Draft 7	Draft 2019-09	Draft 2020-12
`$schema` URI	`http://json-schema.org/draft-07/schema#`	`https://json-schema.org/draft/2019-09/schema`	`https://json-schema.org/draft/2020-12/schema`
if/then/else	✅ 支持	✅ 支持	✅ 支持
unevaluatedProperties	不支持	支持	支持
dependentSchemas	不支持	支持	支持
nullable	❌ 不支持（核心规范）	❌ 不支持（核心规范）	❌ 不支持（核心规范）
format Intersection	不支持	支持	支持

实际使用建议：

新项目推荐使用 Draft 2020-12（最新稳定版）
与 OpenAI 等平台集成时，它们可能使用特定版本，需要注意兼容性问题
nullable 是 OpenAPI 规范 的关键字，不是 JSON Schema 标准
- JSON Schema 标准做法：使用 type 数组 {"type": ["string", "null"]}（所有版本都支持）
- OpenAPI 3.0/3.1 中使用 type: string, nullable: true

三、用 JSON Schema 约束大模型输出

3.1 原理：为什么 LLM 能遵循 JSON Schema

这是一个有趣的问题：大模型是如何理解并遵循 JSON Schema 的？

答案是：LLM 在训练过程中已经学习了大量包含 JSON Schema 描述的文本。当你给出一个 Schema 时，模型实际上是在进行"文本补全"，它会根据 Schema 的结构规则，生成符合该规则的 JSON 文本。

复制代码

┌─────────────────────────────────────────────────────────────┐
│                    LLM 遵循 JSON Schema 的机制               │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│   输入：                                                   │
│   ┌─────────────────────────────────────────────────────┐   │
│   │ 你是一个数据提取助手。根据以下 Schema 提取信息：       │   │
│   │                                                      │   │
│   │ {                                                    │   │
│   │   "type": "object",                                  │   │
│   │   "properties": {                                    │   │
│   │     "name": {"type": "string"},                     │   │
│   │     "age": {"type": "integer"}                      │   │
│   │   },                                                 │   │
│   │   "required": ["name"]                               │   │
│   │ }                                                    │   │
│   └─────────────────────────────────────────────────────┘   │
│                          ↓                                  │
│   模型基于训练数据中的 JSON 模式，理解：                      │
│   • type: object → 输出应该是一个 JSON 对象                │
│   │ properties 定义了字段 name (string) 和 age (integer)  │
│   │ required 标记 name 为必填                             │
│                          ↓                                  │
│   输出：                                                   │
│   ┌─────────────────────────────────────────────────────┐   │
│   │ {                                                    │   │
│   │   "name": "张三",                                     │   │
│   │   "age": 28                                          │   │
│   │ }                                                    │   │
│   └─────────────────────────────────────────────────────┘   │
│                                                             │
└─────────────────────────────────────────────────────────────┘

需要注意的是：LLM 遵循 JSON Schema 是一种概率行为，而非确定性验证 。模型可能偶尔生成不符合 Schema 的输出，因此在使用结构化输出时，仍然需要验证层来确保数据合法性。

3.2 主流平台的 JSON Schema 约束能力

OpenAI

OpenAI 在 2024 年推出了 response_format 参数，支持结构化输出：

python 复制代码

from openai import OpenAI
import json

client = OpenAI()

# 方法一：使用 chat.completions.create + response_format
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "user", "content": "提取用户反馈信息：商品质量很好，但发货太慢"}
    ],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "feedback_extraction",
            "strict": True,
            "schema": {
                "type": "object",
                "properties": {
                    "sentiment": {
                        "type": "string",
                        "enum": ["positive", "negative", "neutral"]
                    },
                    "aspect": {
                        "type": "array",
                        "items": {"type": "string"}
                    },
                    "summary": {"type": "string"}
                },
                "required": ["sentiment", "summary"],
                "additionalProperties": False
            }
        }
    }
)

result = json.loads(response.choices[0].message.content)
# result = {"sentiment": "positive", "aspect": ["质量", "发货"], "summary": "商品质量很好，但发货太慢"}

关键参数：

response_format.type: 设置为 json_schema
json_schema.name: Schema 的名字（用于标识）
json_schema.schema: 实际的 JSON Schema 定义
json_schema.strict: 强制严格模式，要求输出完全符合 Schema

strict 模式要求：

所有 object 类型必须设置 additionalProperties: false

required 必须包含所有字段（可选字段使用 anyOf + null 处理）

不支持递归引用超过最大深度

所有属性的类型必须是具体类型（不能省略 type）

python 复制代码

# 方法二：使用 beta.chat.completions.parse（支持自动 Pydantic 解析）
from pydantic import BaseModel
from typing import Literal

class FeedbackExtraction(BaseModel):
    sentiment: Literal["positive", "negative", "neutral"]
    summary: str

response = client.beta.chat.completions.parse(
    model="gpt-4o",
    messages=[
        {"role": "user", "content": "提取用户反馈信息：商品质量很好，但发货太慢"}
    ],
    response_format=FeedbackExtraction
)

# 直接得到 Pydantic 模型对象
result = response.choices[0].message.parsed
print(result.sentiment)  # "positive"

Anthropic

Anthropic 通过 Tool Use 机制实现结构化输出：

python 复制代码

from anthropic import Anthropic

client = Anthropic()

# 定义结构化输出作为 tool
tools = [
    {
        "name": "extract_feedback",
        "description": "提取用户反馈信息",
        "input_schema": {
            "type": "object",
            "properties": {
                "sentiment": {
                    "type": "string",
                    "enum": ["positive", "negative", "neutral"]
                },
                "aspect": {
                    "type": "array",
                    "items": {"type": "string"}
                },
                "summary": {"type": "string"}
            },
            "required": ["sentiment", "summary"]
        }
    }
]

message = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    tools=tools,
    messages=[
        {"role": "user", "content": "提取用户反馈信息：商品质量很好，但发货太慢"}
    ]
)

# 解析 tool use 结果
for block in message.content:
    if block.type == "tool_use":
        result = block.input
        # result = {"sentiment": "positive", "aspect": ["质量", "发货"], ...}

注意：Anthropic 的 input_schema 本质上就是 JSON Schema（draft-07 风格）。

Google (Gemini)

Google Gemini 通过 responseSchema 参数支持结构化输出：

python 复制代码

import google.generativeai as genai

genai.configure(api_key="YOUR_API_KEY")

model = genai.GenerativeModel(
    model_name="gemini-2.0-flash",
    generation_config={
        "response_mime_type": "application/json",
        "response_schema": {
            "type": "object",
            "properties": {
                "sentiment": {
                    "type": "string",
                    "enum": ["positive", "negative", "neutral"]
                },
                "aspect": {"type": "array", "items": {"type": "string"}},
                "summary": {"type": "string"}
            },
            "required": ["sentiment", "summary"]
        }
    }
)

response = model.generate_content("提取用户反馈信息：商品质量很好，但发货太慢")

result = json.loads(response.text)

3.3 结构化输出 vs 自由文本的对比

维度	结构化输出	自由文本
可预测性	输出结构固定	格式随意
解析成本	直接反序列化	需要 NLP 解析
错误处理	验证失败可捕获	难以检测格式错误
模型能力	受限于 Schema 复杂度	完全释放模型能力
_token 效率	Schema 占用一定 token	无额外开销
灵活性	结构变更需修改 Schema	随时调整
适用场景	数据提取、API 响应、状态机	创意写作、对话、解释

实际建议：

需要精确结构时（数据提取、程序集成）→ 使用结构化输出
需要创意自由时（文案生成、分析解释）→ 使用自由文本
混合方案：先用自由文本生成，再用结构化提取关键信息

3.4 实战：构建复杂 JSON Schema 模板

让我们构建一个电商评论分析的复杂 Schema：

json 复制代码

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "name": "ecommerce_review_analysis",
  "type": "object",
  "properties": {
    "review_id": {
      "type": "string",
      "description": "评论唯一标识"
    },
    "overall_sentiment": {
      "type": "string",
      "enum": ["positive", "neutral", "negative"]
    },
    "sentiment_score": {
      "type": "number",
      "minimum": -1.0,
      "maximum": 1.0,
      "description": "情感倾向分数，-1 为完全负面，1 为完全正面"
    },
    "aspects": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "aspect_name": {
            "type": "string",
            "enum": ["质量", "价格", "物流", "服务", "外观", "功能"]
          },
          "sentiment": {
            "type": "string",
            "enum": ["positive", "neutral", "negative"]
          },
          "mentions": {
            "type": "array",
            "items": {"type": "string"},
            "description": "用户提到的具体表述"
          }
        },
        "required": ["aspect_name", "sentiment"]
      },
      "minItems": 1
    },
    "keywords": {
      "type": "array",
      "items": {"type": "string"},
      "maxItems": 10
    },
    "verification": {
      "type": "object",
      "properties": {
        "is_verified_purchase": {"type": "boolean"},
        "reviewer_type": {
          "type": "string",
          "enum": ["first_time_buyer", "repeat_buyer", "frequent_reviewer"]
        }
      }
    },
    "metadata": {
      "type": "object",
      "additionalProperties": {"type": "string"}
    }
  },
  "required": ["review_id", "overall_sentiment", "sentiment_score", "aspects"],
  "additionalProperties": false
}

3.5 嵌套结构、数组与条件分支的约束技巧

嵌套对象约束

json 复制代码

{
  "type": "object",
  "properties": {
    "user": {
      "type": "object",
      "properties": {
        "profile": {
          "type": "object",
          "properties": {
            "name": {"type": "string"},
            "avatar": {"type": "string", "format": "uri"}
          },
          "required": ["name"]
        },
        "settings": {
          "type": "object",
          "properties": {
            "theme": {"type": "string", "enum": ["light", "dark"]},
            "language": {"type": "string", "pattern": "^[a-z]{2}(-[A-Z]{2})?$"}
          }
        }
      },
      "required": ["profile"]
    }
  },
  "required": ["user"]
}

数组中对象的约束

json 复制代码

{
  "type": "object",
  "properties": {
    "order_items": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "product_id": {"type": "string"},
          "quantity": {"type": "integer", "minimum": 1},
          "price": {"type": "number", "minimum": 0}
        },
        "required": ["product_id", "quantity", "price"]
      },
      "minItems": 1
    }
  },
  "required": ["order_items"]
}

条件分支（if-then-else）

Draft 2019-09+ 版本支持条件逻辑：

json 复制代码

{
  "type": "object",
  "properties": {
    "membership_type": {
      "type": "string",
      "enum": ["free", "basic", "premium"]
    },
    "premium_features": {
      "type": "object",
      "properties": {
        "priority_support": {"type": "boolean"},
        "max_storage_gb": {"type": "integer"}
      }
    }
  },
  "if": {
    "properties": {
      "membership_type": {"const": "premium"}
    }
  },
  "then": {
    "required": ["premium_features"]
  }
}

当 membership_type 为 "premium" 时，premium_features 变成必填。

anyOf 处理多态

json 复制代码

{
  "type": "object",
  "properties": {
    "event_type": {"type": "string"},
    "event_data": {
      "anyOf": [
        {
          "type": "object",
          "properties": {
            "event_type": {"const": "click"},
            "element_id": {"type": "string"},
            "coordinates": {
              "type": "object",
              "properties": {
                "x": {"type": "number"},
                "y": {"type": "number"}
              }
            }
          },
          "required": ["element_id"]
        },
        {
          "type": "object",
          "properties": {
            "event_type": {"const": "submit"},
            "form_id": {"type": "string"},
            "fields": {
              "type": "object",
              "additionalProperties": {"type": "string"}
            }
          },
          "required": ["form_id"]
        }
      ]
    }
  },
  "required": ["event_type", "event_data"]
}

小结

本部分我们介绍了：

大模型输出不可控的痛点和结构化输出的意义
JSON Schema 的核心概念和常用关键字
不同 JSON Schema 版本的差异
主流 LLM 平台（OpenAI、Anthropic、Google）的结构化输出能力
复杂 JSON Schema 的构建技巧

在下一部分中，我们将深入探讨 Pydantic 集成与类型安全，了解如何利用 Python 类型系统简化结构化输出的开发。

第二部分：Pydantic 集成与类型安全

四、Pydantic 集成

4.1 Pydantic v2 核心特性

Pydantic 是一个 Python 数据验证库，在 v2 版本中进行了重大架构升级，性能提升了 50 倍以上。v2 版本的核心特性包括：

1. 基于 Rust 的核心验证器（pydantic-core）

python 复制代码

# Pydantic v2 使用 Rust 编写的核心，验证速度大幅提升
from pydantic import BaseModel

class User(BaseModel):
    name: str
    age: int
    email: str

# 验证速度比 v1 快 50 倍以上
user = User(name="张三", age=28, email="zhangsan@example.com")

2. 新的配置系统

python 复制代码

from pydantic import BaseModel, ConfigDict

class User(BaseModel):
    model_config = ConfigDict(
        str_strip_whitespace=True,    # 自动去除字符串首尾空格
        validate_default=True,        # 验证默认值
        frozen=True,                  # 不可变对象
        use_enum_values=True,         # enum 序列化时使用值而非成员
    )
    
    name: str
    role: str = "user"

3. 改进的验证错误

python 复制代码

from pydantic import BaseModel, ValidationError

class User(BaseModel):
    name: str
    age: int
    email: str

try:
    User(name="", age="not_a_number", email="invalid")
except ValidationError as e:
    print(e.error_count())  # 3 个错误
    for error in e.errors():
        print(f"{error['loc']}: {error['msg']}")

输出：

复制代码

3
('name',): String should have at least 1 character
('age',): Input should be a valid integer
('email',): Input should be a valid email address

4.2 从 Pydantic 模型自动生成 JSON Schema

Pydantic v2 内置了 JSON Schema 生成功能，这是与 LLM 集成的关键：

python 复制代码

from pydantic import BaseModel, Field
from typing import Literal

class FeedbackAnalysis(BaseModel):
    """用户反馈分析结果"""
    
    sentiment: Literal["positive", "negative", "neutral"]
    sentiment_score: float = Field(
        ge=-1.0, le=1.0,
        description="情感倾向分数，-1 为完全负面，1 为完全正面"
    )
    aspects: list[str] = Field(min_length=1, description="涉及的方面")
    keywords: list[str] = Field(max_length=10, description="关键词")
    metadata: dict[str, str] | None = Field(default=None, description="附加信息")

# 生成 JSON Schema
schema = FeedbackAnalysis.model_json_schema()
print(json.dumps(schema, indent=2, ensure_ascii=False))

输出：

json 复制代码

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "title": "用户反馈分析结果",
  "type": "object",
  "properties": {
    "sentiment": {
      "type": "string",
      "enum": ["positive", "negative", "neutral"],
      "title": "Sentiment"
    },
    "sentiment_score": {
      "type": "number",
      "description": "情感倾向分数，-1 为完全负面，1 为完全正面",
      "minimum": -1.0,
      "maximum": 1.0,
      "title": "Sentiment Score"
    },
    "aspects": {
      "type": "array",
      "items": {"type": "string"},
      "description": "涉及的方面",
      "minItems": 1,
      "title": "Aspects"
    },
    "keywords": {
      "type": "array",
      "items": {"type": "string"},
      "description": "关键词",
      "maxItems": 10,
      "title": "Keywords"
    },
    "metadata": {
      "anyOf": [
        {
          "type": "object",
          "additionalProperties": {"type": "string"}
        },
        {"type": "null"}
      ],
      "description": "附加信息",
      "default": null,
      "title": "Metadata"
    }
  },
  "required": ["sentiment", "sentiment_score", "aspects", "keywords"]
}

Pydantic 的 JSON Schema 生成默认使用 Draft 2020-12，可以通过 mode 参数调整：

python 复制代码

from pydantic import BaseModel, ConfigDict
import json

class User(BaseModel):
    model_config = ConfigDict(
        json_schema_extra={
            "examples": [{"name": "张三", "age": 28}]
        }
    )
    
    name: str
    age: int = 0

# mode="validation" --- 生成用于验证的 Schema（默认）
# mode="serialization" --- 生成用于序列化的 Schema
schema = User.model_json_schema()
print(json.dumps(schema, indent=2, ensure_ascii=False))

4.3 OpenAI Function Calling + Pydantic 实战

将 Pydantic 与 OpenAI Function Calling 结合，可以实现完全类型安全的集成：

python 复制代码

from pydantic import BaseModel, Field
from typing import Literal
from openai import OpenAI

# 1. 定义 Pydantic 模型
class ExtractFeedback(BaseModel):
    """用户反馈信息提取"""
    
    sentiment: Literal["positive", "negative", "neutral"]
    sentiment_score: float = Field(
        ge=-1.0, le=1.0,
        description="情感倾向分数"
    )
    aspects: list[str] = Field(
        min_length=1,
        description="用户提到的方面（如：质量、价格、物流）"
    )
    keywords: list[str] = Field(
        max_length=10,
        description="关键词列表"
    )

# 2. 生成 OpenAI 工具定义
tools = [
    {
        "type": "function",
        "function": {
            "name": "extract_feedback",
            "description": "从用户反馈文本中提取结构化信息",
            "parameters": ExtractFeedback.model_json_schema()
        }
    }
]

# 3. 调用 OpenAI
client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": "这款产品太棒了！质量非常好，物流也很快，就是价格稍微有点贵。非常推荐购买！"
        }
    ],
    tools=tools,
    tool_choice={
        "type": "function",
        "function": {"name": "extract_feedback"}
    }
)

# 4. 解析结果并验证
tool_call = response.choices[0].message.tool_calls[0]
feedback_data = ExtractFeedback.model_validate_json(tool_call.function.arguments)

# 现在 feedback_data 是一个完全类型安全的对象
print(feedback_data.sentiment)        # positive
print(feedback_data.sentiment_score)  # 0.75
print(feedback_data.aspects)          # ['质量', '价格', '物流']

更简洁的方式：使用 openai-agents 或 langchain

一些库提供了更高级的封装：

python 复制代码

# 使用 openai-agents（简化示例）
from agents import Agent

agent = Agent(
    model="gpt-4o",
    result_type=ExtractFeedback
)

result = agent.run("这款产品太棒了！质量非常好...")
# result 就是 ExtractFeedback 类型

4.4 数据验证与错误处理

即使有 JSON Schema 约束，LLM 仍可能输出不符合 Schema 的数据。Pydantic 的验证层是最后一道防线：

python 复制代码

from pydantic import BaseModel, Field, ValidationError
from typing import Literal

class FeedbackAnalysis(BaseModel):
    sentiment: Literal["positive", "negative", "neutral"]
    score: float = Field(ge=0, le=100)
    tags: list[str]

# 模拟 LLM 返回的数据（可能有问题）
llm_output = '''
{
  "sentiment": "positive",
  "score": "not_a_number",
  "tags": [1, 2, 3]
}
'''

try:
    result = FeedbackAnalysis.model_validate_json(llm_output)
except ValidationError as e:
    print("验证失败：")
    for error in e.errors():
        loc = ".".join(str(l) for l in error["loc"])
        print(f"  • {loc}: {error['msg']} (输入: {error['input']})")

输出：

复制代码

验证失败：
  • score: Input should be a valid number (输入: not_a_number)
  • tags.0: Input should be valid string (输入: 1)
  • tags.1: Input should be valid string (输入: 2)
  • tags.2: Input should be valid string (输入: 3)

优雅的错误处理策略：

python 复制代码

import json
from pydantic import BaseModel, ValidationError
from openai import OpenAI

class FeedbackAnalysis(BaseModel):
    sentiment: Literal["positive", "negative", "neutral"]
    score: float

def extract_with_retry(client: OpenAI, prompt: str, max_retries: int = 3):
    """带重试的提取函数"""
    
    for attempt in range(max_retries):
        try:
            # 调用 LLM...
            response = client.chat.completions.create(...)
            data = json.loads(response.choices[0].message.content)
            
            # Pydantic 验证
            return FeedbackAnalysis.model_validate(data)
            
        except (json.JSONDecodeError, ValidationError) as e:
            if attempt == max_retries - 1:
                raise RuntimeError(f"提取失败，已重试 {max_retries} 次: {e}")
            # 可以在这里添加修复逻辑或调整 prompt
            print(f"第 {attempt + 1} 次尝试失败，尝试修复...")

4.5 自定义验证器与 Field 约束

Pydantic 提供了丰富的验证器，可以处理复杂业务逻辑：

Field 级别的约束

python 复制代码

from pydantic import BaseModel, Field, field_validator
from typing import Annotated

class UserRegistration(BaseModel):
    username: str = Field(
        min_length=3,
        max_length=20,
        pattern=r"^[a-zA-Z][a-zA-Z0-9_]*$",
        description="用户名，以字母开头"
    )
    email: str = Field(format="email")
    password: str = Field(min_length=8)
    age: int = Field(ge=18, le=120)
    phone: str = Field(pattern=r"^\+?1?\d{9,15}$")
    
    @field_validator("password")
    @classmethod
    def password_strength(cls, v: str) -> str:
        """验证密码强度"""
        if not any(c.isupper() for c in v):
            raise ValueError("密码必须包含大写字母")
        if not any(c.islower() for c in v):
            raise ValueError("密码必须包含小写字母")
        if not any(c.isdigit() for c in v):
            raise ValueError("密码必须包含数字")
        return v

# 使用 Annotated 简化复杂验证
Password = Annotated[str, Field(min_length=8)]

class LoginForm(BaseModel):
    username: str
    password: Password

跨字段验证

python 复制代码

from pydantic import BaseModel, model_validator

class Order(BaseModel):
    items: list[dict]
    discount: float = 0
    total: float
    
    @model_validator(mode="after")
    def check_totals(self) -> "Order":
        """验证折扣和总价的一致性"""
        calculated_total = sum(item["price"] * item["quantity"] for item in self.items)
        expected_total = calculated_total - self.discount
        
        if abs(expected_total - self.total) > 0.01:
            raise ValueError(
                f"总价 {self.total} 与计算值 {expected_total:.2f} 不符"
            )
        return self

自定义类型

python 复制代码

from pydantic import GetCoreSchemaHandler
from pydantic_core import core_schema
from typing import Any

class PhoneNumber(str):
    """自定义手机号类型（Pydantic v2 方式）"""
    
    @classmethod
    def validate(cls, v: Any) -> "PhoneNumber":
        if isinstance(v, str):
            # 清理格式
            cleaned = v.replace(" ", "").replace("-", "")
            if cleaned.startswith("+86"):
                cleaned = cleaned[3:]
            if len(cleaned) == 11 and cleaned.isdigit():
                return cls(cleaned)
        raise ValueError("无效的手机号格式")
    
    @classmethod
    def __get_pydantic_core_schema__(
        cls, source: type[Any], handler: GetCoreSchemaHandler
    ) -> core_schema.CoreSchema:
        return core_schema.no_info_after_validator_function(
            cls.validate,
            core_schema.str_schema(),
            serialization=core_schema.to_string_ser_schema()
        )

class Contact(BaseModel):
    name: str
    phone: PhoneNumber

# 现在 phone 会自动验证格式
contact = Contact(name="张三", phone="13812345678")
print(contact.phone)  # 13812345678 (自动转换为 PhoneNumber)

五、类型安全

5.1 Python 类型系统与 JSON Schema 的映射

Pydantic 的核心价值在于将 Python 类型注解转换为 JSON Schema，并同时提供运行时验证：

Python 类型	JSON Schema 类型	示例
`str`	`string`	`"type": "string"`
`int`	`integer`	`"type": "integer"`
`float`	`number`	`"type": "number"`
`bool`	`boolean`	`"type": "boolean"`
`None`	`null`	`"type": "null"`
`list[T]`	`array`	`"type": "array", "items": {...}`
`dict[K, V]`	`object`	`"type": "object", "additionalProperties": {...}`
`Literal["a", "b"]`	`enum`	`"enum": ["a", "b"]`
`Union[A, B]`	`anyOf`	`"anyOf": [{...}, {...}]`
`Optional[T]`	`anyOf`	`"anyOf": [{"type": "T"}, {"type": "null"}]`
`Enum`	`enum`	`"enum": ["A", "B", "C"]`

复杂类型映射示例

python 复制代码

from typing import Union, Optional, Literal
from pydantic import BaseModel

class ComplexTypes(BaseModel):
    # Union → anyOf
    string_or_number: Union[str, int]
    
    # Optional → anyOf + null
    optional_string: Optional[str]
    
    # Literal → enum
    status: Literal["pending", "active", "completed"]
    
    # list[T] → array with items
    tags: list[str]
    
    # dict → object with additionalProperties
    metadata: dict[str, str]
    
    # 嵌套模型
    nested: "NestedModel"
    
    # 递归类型（需用 typing 延迟引用）
    tree: Optional["TreeNode"] = None

class NestedModel(BaseModel):
    value: int

class TreeNode(BaseModel):
    value: str
    children: list["TreeNode"] = []

ComplexTypes.model_rebuild()  # 解析延迟引用

生成的 Schema：

json 复制代码

{
  "type": "object",
  "properties": {
    "string_or_number": {
      "anyOf": [{"type": "string"}, {"type": "integer"}]
    },
    "optional_string": {
      "anyOf": [{"type": "string"}, {"type": "null"}]
    },
    "status": {
      "type": "string",
      "enum": ["pending", "active", "completed"]
    },
    "tags": {
      "type": "array",
      "items": {"type": "string"}
    },
    "metadata": {
      "type": "object",
      "additionalProperties": {"type": "string"}
    },
    "nested": {"$ref": "#/$defs/NestedModel"},
    "tree": {
      "anyOf": [{"$ref": "#/$defs/TreeNode"}, {"type": "null"}]
    }
  },
  "$defs": {
    "NestedModel": {
      "type": "object",
      "properties": {"value": {"type": "integer"}},
      "required": ["value"]
    },
    "TreeNode": {
      "type": "object",
      "properties": {
        "value": {"type": "string"},
        "children": {
          "type": "array",
          "items": {"$ref": "#/$defs/TreeNode"}
        }
      },
      "required": ["value"]
    }
  }
}

5.2 泛型与联合类型的处理

泛型模型

python 复制代码

from pydantic import BaseModel
from typing import Generic, TypeVar

T = TypeVar("T")

class ApiResponse(BaseModel, Generic[T]):
    """通用 API 响应包装器"""
    code: int
    message: str
    data: T

class User(BaseModel):
    name: str
    email: str

# 使用泛型
ResponseUser = ApiResponse[User]

# 生成 Schema 时会包含具体的 data 类型
schema = ResponseUser.model_json_schema()
print(json.dumps(schema, indent=2))

输出：

json 复制代码

{
  "type": "object",
  "properties": {
    "code": {"type": "integer"},
    "message": {"type": "string"},
    "data": {
      "type": "object",
      "properties": {
        "name": {"type": "string"},
        "email": {"type": "string"}
      },
      "required": ["name", "email"]
    }
  },
  "required": ["code", "message", "data"]
}

联合类型的验证

python 复制代码

from pydantic import BaseModel
from typing import Union

class Cat(BaseModel):
    pet_type: str = "cat"
    meow_volume: int

class Dog(BaseModel):
    pet_type: str = "dog"
    bark_volume: int

class Zoo(BaseModel):
    # Union 会生成 anyOf，支持多种类型
    pet: Union[Cat, Dog]

# Cat
zoo1 = Zoo(pet={"pet_type": "cat", "meow_volume": 5})

# Dog
zoo2 = Zoo(pet={"pet_type": "dog", "bark_volume": 10})

# 验证错误 - 未知类型
try:
    Zoo(pet={"pet_type": "bird", "wing_span": 20})
except Exception as e:
    print(e)

discriminated union（区分联合类型）

Pydantic v2 支持 discriminated union，提供更精确的类型推断：

python 复制代码

from pydantic import BaseModel, Field
from typing import Union, Annotated, Literal
import json

class Cat(BaseModel):
    pet_type: Literal["cat"] = "cat"
    meow_volume: int

class Dog(BaseModel):
    pet_type: Literal["dog"] = "dog"
    bark_volume: int

# Pydantic v2 使用 Literal 判别字段自动实现 discriminated union
Zoo = Annotated[
    Union[Cat, Dog],
    Field(discriminator="pet_type")
]

class ZooContainer(BaseModel):
    pet: Zoo

# 解析时会根据 pet_type 自动选择正确的类型
data = json.loads('{"pet": {"pet_type": "cat", "meow_volume": 5}}')
zoo = ZooContainer.model_validate(data)
print(type(zoo.pet).__name__)  # Cat
print(zoo.pet.meow_volume)     # 5

5.3 静态类型检查工具集成

将 Pydantic 模型与 mypy、pyright 集成，可以在开发时发现类型错误：

mypy 配置

python 复制代码

# demo.py
from pydantic import BaseModel
from typing import Literal

class Feedback(BaseModel):
    sentiment: Literal["positive", "negative", "neutral"]
    score: float

def process_feedback(feedback: Feedback) -> str:
    return f"{feedback.sentiment}: {feedback.score}"

# 错误的调用 - mypy 会报错
result = process_feedback("not a feedback")  # Error!

toml 复制代码

# pyproject.toml
[tool.mypy]
python_version = "3.11"
strict = true
plugins = ["pydantic.mypy"]

运行 mypy：

bash 复制代码

$ mypy demo.py
demo.py:10: error: Argument 1 to "process_feedback" has incompatible type "str"; expected "Feedback"  [arg-type]

pyright 配置

pyright 对 Pydantic 有更好的开箱支持：

toml 复制代码

# pyrightconfig.json
{
  "include": ["src"],
  "typeCheckingMode": "strict",
  "pydantic": {
    "useModelInJsonSchema": true
  }
}

类型安全的完整流程

复制代码

┌─────────────────────────────────────────────────────────────────────┐
│                    端到端类型安全流程                                  │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  1. 定义模型 ─────────────────────────────────────────────────▶    │
│     ┌──────────────────────────────────────────┐                    │
│     │ class Feedback(BaseModel):               │                    │
│     │     sentiment: Literal["pos", "neg"]    │                    │
│     │     score: float                         │                    │
│     └──────────────────────────────────────────┘                    │
│                          ↓                                          │
│  2. 类型检查 ─────────────────────────────────────────────────▶    │
│     $ pyright / mypy → 开发时发现类型错误                             │
│                          ↓                                          │
│  3. 生成 Schema ──────────────────────────────────────────────▶    │
│     schema = Feedback.model_json_schema()                          │
│                          ↓                                          │
│  4. LLM 调用 ──────────────────────────────────────────────────▶    │
│     OpenAI(response_format=schema)                                 │
│                          ↓                                          │
│  5. 运行时验证 ─────────────────────────────────────────────────▶   │
│     result = Feedback.model_validate_json(llm_output)              │
│                          ↓                                          │
│  6. 类型安全的业务逻辑 ──────────────────────────────────────────▶  │
│     # result 是 Feedback 类型，IDE 提供完整补全                      │
│     if result.sentiment == "positive": ...                         │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

5.4 端到端类型安全示例

python 复制代码

"""
完整的端到端类型安全示例
"""
from pydantic import BaseModel, Field
from typing import Literal, TypeAlias
from openai import OpenAI
import json

# ============ 1. 定义领域模型 ============

class SentimentAnalysis(BaseModel):
    """情感分析结果"""
    overall: Literal["positive", "negative", "neutral"]
    score: float = Field(description="-1 到 1 的情感分数", ge=-1.0, le=1.0)

class AspectSentiment(BaseModel):
    """方面的情感分析"""
    aspect: str
    sentiment: Literal["positive", "negative", "neutral"]
    evidence: list[str] = Field(description="支撑该情感的原文片段")

class FeedbackAnalysis(BaseModel):
    """完整的反馈分析结果"""
    summary: str = Field(max_length=200)
    sentiment: SentimentAnalysis
    aspects: list[AspectSentiment] = Field(min_length=1)
    keywords: list[str] = Field(max_length=10)

# ============ 2. 生成 OpenAI Schema ============

def get_analysis_tool():
    return {
        "type": "function",
        "function": {
            "name": "analyze_feedback",
            "description": "分析用户反馈文本，提取结构化信息",
            "parameters": FeedbackAnalysis.model_json_schema()
        }
    }

# ============ 3. LLM 调用函数 ============

def analyze_feedback(client: OpenAI, text: str) -> FeedbackAnalysis:
    """分析用户反馈，返回类型安全的对象"""
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system", 
                "content": "你是一个专业的用户反馈分析助手。请分析以下反馈并提取结构化信息。"
            },
            {"role": "user", "content": text}
        ],
        tools=[get_analysis_tool()],
        tool_choice={"type": "function", "function": {"name": "analyze_feedback"}}
    )
    
    # 解析并验证
    tool_call = response.choices[0].message.tool_calls[0]
    return FeedbackAnalysis.model_validate_json(tool_call.function.arguments)

# ============ 4. 使用类型安全的结果 ============

def generate_report(analysis: FeedbackAnalysis) -> str:
    """生成分析报告 - 享受完整的类型提示"""
    
    # IDE 会提示 sentiment.overall 的字面量类型
    sentiment_emoji = {
        "positive": "😊",
        "negative": "😞",
        "neutral": "😐"
    }
    
    report = [
        f"## 反馈摘要",
        f"{sentiment_emoji[analysis.sentiment.overall]} {analysis.summary}",
        f"",
        f"**情感得分**: {analysis.sentiment.score:.2f}",
        f"",
        f"## 方面分析"
    ]
    
    for aspect in analysis.aspects:
        # IDE 知道 aspect.sentiment 只能是三个值之一
        report.append(f"- **{aspect.aspect}**: {aspect.sentiment}")
        for evidence in aspect.evidence:
            report.append(f"  - \"{evidence}\"")
    
    report.append(f"")
    report.append(f"**关键词**: {', '.join(analysis.keywords)}")
    
    return "\n".join(report)

# ============ 5. 主流程 ============

if __name__ == "__main__":
    client = OpenAI()
    
    feedback_text = """
    这款产品太让人失望了！质量很差，用了三天就坏了。
    客服态度也不好，打了三次电话都没解决。
    不过价格确实很便宜，如果只是临时用用可以考虑。
    """
    
    # 完整的类型安全流程
    result = analyze_feedback(client, feedback_text)
    report = generate_report(result)
    
    print(report)
    # result 是 FeedbackAnalysis 类型
    # result.sentiment.overall 是 Literal["positive", "negative", "neutral"]
    # IDE 提供完整的自动补全和类型检查

5.5 TypeScript 中的 JSON Schema 类型安全方案

前端开发者同样需要类型安全。TypeScript 生态中有多个从 JSON Schema 生成类型的工具：

方案一：quicktype（推荐）

bash 复制代码

# 安装
npm install -g quicktype

# 从 JSON Schema 生成 TypeScript 类型
quicktype --schema schema.json -t FeedbackAnalysis

输入 schema.json：

json 复制代码

{
  "type": "object",
  "properties": {
    "sentiment": {
      "type": "string",
      "enum": ["positive", "negative", "neutral"]
    },
    "score": {"type": "number"}
  },
  "required": ["sentiment", "score"]
}

输出：

typescript 复制代码

export type Sentiment = "positive" | "negative" | "neutral";

export interface FeedbackAnalysis {
  sentiment: Sentiment;
  score: number;
}

方案二：json-schema-to-typescript

bash 复制代码

npm install json-schema-to-typescript

typescript 复制代码

import { compile } from 'json-schema-to-typescript';

const schema = {
  type: 'object',
  properties: {
    sentiment: { type: 'string', enum: ['positive', 'negative', 'neutral'] },
    score: { type: 'number' }
  },
  required: ['sentiment', 'score']
};

const ts = await compile(schema, 'Feedback');
console.log(ts);

方案三：Zod（运行时验证 + 类型推断）

typescript 复制代码

import { z } from 'zod';

// 定义 Schema
const SentimentSchema = z.enum(['positive', 'negative', 'neutral']);

const FeedbackSchema = z.object({
  sentiment: SentimentSchema,
  score: z.number().min(-1).max(1),
  aspects: z.array(z.object({
    aspect: z.string(),
    sentiment: SentimentSchema
  })).min(1)
});

// 类型推断
type Feedback = z.infer<typeof FeedbackSchema>;

// 验证
const parseFeedback = (data: unknown) => {
  return FeedbackSchema.parse(data);
};

// 与 LLM 响应集成
const llmResponse = await llm.analyze(text);
const feedback = parseFeedback(JSON.parse(llmResponse));
// feedback 现在是类型安全的 Feedback

前端实际使用示例

typescript 复制代码

// api/feedback.ts
import { z } from 'zod';

export const FeedbackSchema = z.object({
  summary: z.string().max(200),
  sentiment: z.object({
    overall: z.enum(['positive', 'negative', 'neutral']),
    score: z.number().min(-1).max(1)
  }),
  aspects: z.array(z.object({
    aspect: z.string(),
    sentiment: z.enum(['positive', 'negative', 'neutral']),
    evidence: z.array(z.string())
  })).min(1),
  keywords: z.array(z.string()).max(10)
});

export type Feedback = z.infer<typeof FeedbackSchema>;

// API 调用
async function analyzeFeedback(text: string): Promise<Feedback> {
  const response = await fetch('/api/analyze', {
    method: 'POST',
    body: JSON.stringify({ text })
  });
  
  const data = await response.json();
  
  // 验证响应数据（防御性编程）
  return FeedbackSchema.parse(data);
}

// 组件使用
function FeedbackCard({ feedback }: { feedback: Feedback }) {
  // TypeScript 知道 feedback.sentiment.overall 只能是三个值
  // IDE 提供完整的类型提示
  return (
    <div className="sentiment-{feedback.sentiment.overall}">
      <h3>{feedback.summary}</h3>
      <span>Score: {feedback.sentiment.score}</span>
      {feedback.aspects.map(aspect => (
        <div key={aspect.aspect}>
          {aspect.aspect}: {aspect.sentiment}
        </div>
      ))}
    </div>
  );
}

小结

本部分我们深入探讨了：

Pydantic v2 的核心特性和性能提升
从 Pydantic 模型自动生成 JSON Schema 的方法
OpenAI Function Calling 与 Pydantic 的集成实战
数据验证与错误处理策略
自定义验证器与 Field 约束
Python 类型系统与 JSON Schema 的映射关系
泛型与联合类型的处理
静态类型检查工具（mypy、pyright）的集成
端到端类型安全的完整流程
TypeScript 中的类型安全方案

在下一部分中，我们将讨论常见陷阱与调试 以及最佳实践。

第三部分：常见陷阱与调试、最佳实践

六、常见陷阱与调试

陷阱一：Schema 过于复杂

问题描述：试图用一个 Schema 描述所有可能的边缘情况，导致 Schema 嵌套层级深、约束条件多。

症状：

LLM 输出频繁不符合 Schema
验证错误率居高不下
Schema 难以维护和理解

示例：

json 复制代码

{
  "type": "object",
  "properties": {
    "data": {
      "type": "object",
      "properties": {
        "items": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "nested": {
                "type": "object",
                "properties": {
                  "deep": {
                    "type": "object",
                    "properties": {
                      "value": {"type": "string"}
                    }
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

解决方案：

扁平化结构：尽量减少嵌套层级

python 复制代码

# 不推荐：深层嵌套
from pydantic import BaseModel, ConfigDict

class DeepNested(BaseModel):
    model_config = ConfigDict(extra="forbid")
    
    data: dict  # 避免深层嵌套

# 推荐：扁平化设计
class Item(BaseModel):
    value: str

class ItemsData(BaseModel):
    items: list[Item]

class Root(BaseModel):
    data: ItemsData

分步提取：复杂任务拆分为多个简单步骤

python 复制代码

# 不推荐：一个 Schema 完成所有提取
class AllInOne(BaseModel):
    entities: list[dict]
    relations: list[dict]
    sentiment: dict
    summary: str
    topics: list[str]

# 推荐：分步提取
class EntityExtract(BaseModel):
    entities: list[dict]

class RelationExtract(BaseModel):
    relations: list[dict]

# 分步调用
entities = agent.run(text, result_type=EntityExtract)
relations = agent.run(text, result_type=RelationExtract)

使用 composition：将复杂 Schema 分解为可组合的模块

python 复制代码

from pydantic import BaseModel

# 可复用的组件
class TimestampMixin(BaseModel):
    created_at: str
    updated_at: str

class PaginationMixin(BaseModel):
    page: int = 1
    page_size: int = 20
    total: int

# 组合使用
class Response(BaseModel):
    # 组合使用可复用的字段模式（Python 不支持 mixin 继承，需要手动复制字段）
    model_config = ConfigDict(extra="allow")
    
    data: list[dict]
    
    # 复用 TimestampMixin 和 PaginationMixin 的字段模式
    created_at: str = ""
    updated_at: str = ""
    page: int = 1
    page_size: int = 20
    total: int = 0

陷阱二：类型不匹配

问题描述：Python 类型与 JSON Schema 类型映射不正确，导致验证失败。

常见错误：

python 复制代码

from pydantic import BaseModel

# 错误：使用 list 而不是 list[str]
class WrongType(BaseModel):
    tags: list  # 缺少元素类型

# 正确：明确指定元素类型
class CorrectType(BaseModel):
    tags: list[str]

# 错误：dict 没有指定键值类型
class WrongDict(BaseModel):
    metadata: dict  # 不够具体

# 正确：指定键值类型
class CorrectDict(BaseModel):
    metadata: dict[str, str]  # 键和值都是字符串

# 错误：使用 int 而不是 float
class WrongNumber(BaseModel):
    score: int  # 分数可能是小数

# 正确：使用 float
class CorrectNumber(BaseModel):
    score: float  # 可以是 0.5, 0.75 等

类型映射速查表：

复制代码

┌─────────────────────────────────────────────────────────────────────┐
│                    Python → JSON Schema 类型映射                     │
├──────────────────────┬──────────────────────────────────────────────┤
│ Python 类型          │ JSON Schema 类型                              │
├──────────────────────┼──────────────────────────────────────────────┤
│ str                  │ "type": "string"                             │
│ int                  │ "type": "integer"                            │
│ float                │ "type": "number"                             │
│ bool                 │ "type": "boolean"                            │
│ list[str]            │ "type": "array", "items": {"type": "string"}│
│ dict[str, str]       │ "type": "object", "additionalProperties":    │
│                      │     {"type": "string"}                       │
│ Literal["a", "b"]    │ "type": "string", "enum": ["a", "b"]        │
│ Optional[str]        │ "anyOf": [{"type": "string"}, {"type":       │
│                      │     "null"}]                                 │
│ Union[str, int]      │ "anyOf": [{"type": "string"}, {"type":       │
│                      │     "integer"}]                              │
└──────────────────────┴──────────────────────────────────────────────┘

陷阱三：additionalProperties 的坑

问题描述 ：additionalProperties 设置不当导致验证失败或过于宽松。

场景分析：

python 复制代码

from pydantic import BaseModel, ConfigDict, ValidationError

# 场景1：设置为 false，但 LLM 添加了额外字段
class StrictModel(BaseModel):
    model_config = ConfigDict(extra="forbid")
    
    name: str
    age: int

# LLM 输出包含了 "nickname" 字段
llm_output = '{"name": "张三", "age": 28, "nickname": "小张"}'

try:
    StrictModel.model_validate_json(llm_output)
except ValidationError as e:
    print(e)

输出：

复制代码

Input should be a valid JSON object
Field required: nickname

解决方案：

在 Schema 设计时明确 extra 行为

python 复制代码

# Pydantic v2 配置
from pydantic import ConfigDict

class ForbidExtra(BaseModel):
    """严格模式：不允许额外字段"""
    model_config = ConfigDict(extra="forbid")
    name: str

class AllowExtra(BaseModel):
    """宽松模式：允许额外字段"""
    model_config = ConfigDict(extra="allow")
    name: str

class IgnoreExtra(BaseModel):
    """忽略模式：丢弃额外字段"""
    model_config = ConfigDict(extra="ignore")
    name: str

使用 json_schema 参数控制

python 复制代码

class Feedback(BaseModel):
    sentiment: str
    
    model_config = ConfigDict(
        json_schema_extra={
            "additionalProperties": False  # 生成的 Schema 包含此设置
        }
    )

修复策略：预处理 LLM 输出

python 复制代码

import json

def sanitize_json(output: str, allowed_fields: list[str]) -> str:
    """移除不允许的字段"""
    data = json.loads(output)
    sanitized = {k: v for k, v in data.items() if k in allowed_fields}
    return json.dumps(sanitized, ensure_ascii=False)

# 使用
clean_output = sanitize_json(
    llm_output, 
    allowed_fields=["sentiment", "score", "aspects"]
)
result = Feedback.model_validate_json(clean_output)

陷阱四：嵌套层级过深

问题描述：Schema 嵌套超过 4-5 层时，LLM 理解和遵循的成功率显著下降。

症状：

深层嵌套的字段经常缺失
验证错误集中在深层结构
模型"忘记"返回某些深层字段

解决方案：

限制嵌套深度

推荐的最大嵌套层级：
- 关键业务字段：≤ 3 层
- 辅助信息：≤ 4 层
- 避免超过 5 层
使用数组扁平化

python 复制代码

# 不推荐：深层嵌套
class DeepNested(BaseModel):
    user: dict = {
        "profile": {
            "settings": {
                "privacy": {...}
            }
        }
    }

# 推荐：使用数组扁平化
class FlatModel(BaseModel):
    user_id: str
    profile_settings: list[dict]  # 展平为数组

使用命名引用（$ref）

json 复制代码

{
  "type": "object",
  "properties": {
    "user": { "$ref": "#/$defs/UserProfile" }
  },
  "$defs": {
    "UserProfile": {
      "type": "object",
      "properties": {
        "name": {"type": "string"},
        "settings": { "$ref": "#/$defs/Settings" }
      }
    },
    "Settings": {...}
  }
}

陷阱五：enum 值与训练数据不匹配

问题描述：Schema 中定义的 enum 值与模型训练数据中的表达不一致，导致模型难以正确生成。

示例：

python 复制代码

# Schema 定义的 enum
class StatusEnum(str, Enum):
    PENDING = "pending"
    PROCESSING = "processing"
    COMPLETED = "completed"

class Task(BaseModel):
    status: StatusEnum

# 模型可能输出：
# - "Pending" (首字母大写)
# - "pending " (多余空格)
# - "进行中" (中文)
# - "processing..." (多余字符)

解决方案：

使用 case-insensitive 验证

python 复制代码

from pydantic import field_validator
from enum import Enum

class StatusEnum(str, Enum):
    PENDING = "pending"
    PROCESSING = "processing"
    COMPLETED = "completed"

class Task(BaseModel):
    status: StatusEnum
    
    @field_validator("status", mode="before")
    @classmethod
    def normalize_status(cls, v):
        if isinstance(v, str):
            # 标准化输入
            return v.strip().lower()
        return v

在 prompt 中提供示例

python 复制代码

prompt = """
提取任务状态，使用以下枚举值之一：
- pending: 任务等待处理
- processing: 任务正在处理中
- completed: 任务已完成

示例输出：{"status": "pending", ...}
"""

使用 more restrictive 约束

python 复制代码

class Task(BaseModel):
    status: str  # 先用 string 接收
    
    @field_validator("status")
    @classmethod
    def validate_status(cls, v):
        valid = {"pending", "processing", "completed", "完成", "进行中"}
        if v.lower().strip() not in valid:
            raise ValueError(f"无效的状态: {v}")
        # 映射到标准值
        mapping = {
            "pending": "pending",
            "完成": "completed",
            "进行中": "processing"
        }
        return mapping.get(v.lower().strip(), v)

调试工具链

jsonschema 库

python 复制代码

import json
from jsonschema import validate, ValidationError

# 定义 Schema
schema = {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "age": {"type": "integer"}
    },
    "required": ["name"]
}

# 待验证的数据
data = {"name": "张三", "age": "二十"}

# 验证
try:
    validate(instance=data, schema=schema)
except ValidationError as e:
    print(f"验证失败：{e.message}")
    print(f"路径：{'.'.join(str(p) for p in e.path)}")
    print(f"方案：{e.validator}")
    print(f"输入：{e.instance}")

输出：

复制代码

验证失败：'二十' is not of type 'integer'
路径：age
方案：type
输入：二十

Pydantic ValidationError 分析

python 复制代码

from pydantic import BaseModel, ValidationError
import json

class User(BaseModel):
    name: str
    age: int
    email: str

# 模拟有问题的输入
bad_data = {
    "name": "",  # 空字符串
    "age": "invalid",  # 不是数字
    "email": "not-an-email"
}

try:
    User.model_validate(bad_data)
except ValidationError as e:
    print(f"错误数量: {e.error_count()}")
    print("\n详细错误：")
    for error in e.errors():
        loc = ".".join(str(l) for l in error["loc"])
        msg = error["msg"]
        input_val = error["input"]
        ctx = error.get("ctx", {})
        
        print(f"  📍 {loc}")
        print(f"     错误: {msg}")
        print(f"     输入: {input_val}")
        if ctx:
            print(f"     上下文: {ctx}")
        print()

调试日志记录

python 复制代码

import logging
import json
from pydantic import BaseModel
from typing import Literal

# 配置日志
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger(__name__)

class AnalysisResult(BaseModel):
    sentiment: Literal["positive", "negative", "neutral"]
    confidence: float

def debug_analysis(llm_output: str):
    """带调试信息的分析函数"""
    
    logger.debug(f"LLM 原始输出：{llm_output[:200]}...")
    
    # 尝试解析 JSON
    try:
        data = json.loads(llm_output)
        logger.debug(f"解析后的数据：{json.dumps(data, ensure_ascii=False)}")
    except json.JSONDecodeError as e:
        logger.error(f"JSON 解析失败：{e}")
        return None
    
    # 尝试验证
    try:
        result = AnalysisResult.model_validate(data)
        logger.info(f"验证成功：{result}")
        return result
    except ValidationError as e:
        logger.warning(f"验证失败：{e.error_count()} 个错误")
        for error in e.errors():
            logger.warning(f"  - {error['loc']}: {error['msg']}")
        
        # 尝试修复策略
        logger.info("尝试修复策略...")
        # ... 修复逻辑

日志记录与问题定位策略

复制代码

┌─────────────────────────────────────────────────────────────────────┐
│                    结构化输出调试流程                                 │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Step 1: 记录输入                                                   │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │ logger.debug(f"原始 prompt: {prompt[:100]}...")              │  │
│  │ logger.debug(f"LLM 原始输出: {raw_output[:200]}...")         │  │
│  └──────────────────────────────────────────────────────────────┘  │
│                              ↓                                      │
│  Step 2: 记录解析                                                   │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │ try:                                                          │  │
│  │     parsed = json.loads(raw_output)                          │  │
│  │     logger.debug(f"解析成功: {parsed}")                       │  │
│  │ except JSONDecodeError as e:                                 │  │
│  │     logger.error(f"JSON 解析失败: {e}")                       │  │
│  └──────────────────────────────────────────────────────────────┘  │
│                              ↓                                      │
│  Step 3: 记录验证                                                   │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │ try:                                                          │  │
│  │     result = Model.model_validate(parsed)                    │  │
│  │     logger.info(f"验证成功: {result}")                        │  │
│  │ except ValidationError as e:                                 │  │
│  │     for error in e.errors():                                 │  │
│  │         logger.warning(f"字段 {error['loc']}: {error['msg']}")│  │
│  └──────────────────────────────────────────────────────────────┘  │
│                              ↓                                      │
│  Step 4: 分析问题模式                                               │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │ # 统计常见错误类型                                            │  │
│  │ error_patterns = analyze_errors(validation_errors)           │  │
│  │ logger.info(f"错误模式: {error_patterns}")                    │  │
│  └──────────────────────────────────────────────────────────────┘  │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

性能优化

1. 减少 Schema 复杂度

python 复制代码

# 优化前：复杂 Schema
complex_schema = {
    "type": "object",
    "properties": {
        "deep": {
            "type": "object",
            "properties": {
                "nested": {
                    "type": "object",
                    "properties": {...}  # 5+ 层
                }
            }
        }
    }
}

# 优化后：简化 Schema
simple_schema = {
    "type": "object",
    "properties": {
        "flat_key": {"type": "string"}
    }
}

2. 使用 selection 减少 token

python 复制代码

# 只提取必要字段
class MinimalResult(BaseModel):
    """最小化的结果模型"""
    # 只包含必要字段，减少 token 消耗
    status: str
    id: str
    
    model_config = ConfigDict(frozen=True)  # 不可变，提升性能

3. 缓存 Schema 序列化

python 复制代码

from functools import lru_cache
import json

class FeedbackModel(BaseModel):
    sentiment: str
    score: float

# 缓存 Schema 字符串（避免重复生成）
@lru_cache(maxsize=1)
def get_schema_str() -> str:
    return FeedbackModel.model_json_schema()

# 使用缓存的 Schema
schema_str = get_schema_str()

4. 并行验证多个结果

python 复制代码

from concurrent.futures import ThreadPoolExecutor
from pydantic import BaseModel

class Result(BaseModel):
    id: str
    value: float

def validate_result(data: dict) -> Result:
    return Result.model_validate(data)

def batch_validate(results: list[dict]) -> list[Result]:
    """并行验证，提高吞吐量"""
    with ThreadPoolExecutor() as executor:
        return list(executor.map(validate_result, results))

七、最佳实践总结

Schema 设计原则

1. 最小化原则

只包含必要的字段，不要试图覆盖所有边缘情况：

python 复制代码

# 推荐：最小化设计
class Feedback(BaseModel):
    sentiment: str
    summary: str

# 避免：过度设计
class OverEngineeredFeedback(BaseModel):
    sentiment: str
    summary: str
    sentiment_score: float
    confidence: float
    possible_intents: list[str]
    topics: list[str]
    entities: list[dict]
    # ... 30+ 字段

2. 明确性原则

使用精确的类型和约束：

python 复制代码

# 推荐：明确类型
class PreciseFeedback(BaseModel):
    sentiment: Literal["positive", "negative", "neutral"]
    score: float = Field(ge=-1.0, le=1.0)
    keywords: list[str] = Field(max_length=5)

# 避免：模糊类型
class VagueFeedback(BaseModel):
    sentiment: str  # 不知道可取值
    score: float  # 范围未知
    keywords: list  # 元素类型未知

3. 一致性原则

保持字段命名和结构的一致性：

python 复制代码

# 统一的命名风格（蛇形）
class SnakeCaseModel(BaseModel):
    user_name: str
    created_at: str
    item_count: int

# 避免：混用风格
class MixedCaseModel(BaseModel):
    userName: str  # 驼峰
    created_at: str  # 蛇形
    itemCount: int  # 驼峰

4. 可测试性原则

确保 Schema 可以独立测试：

python 复制代码

class TestableSchema(BaseModel):
    value: int = Field(ge=0, le=100)
    
    @field_validator("value")
    @classmethod
    def value_must_be_sensible(cls, v):
        # 添加业务规则
        assert v >= 0, "值不能为负"
        return v

# 独立的单元测试
def test_schema_validation():
    # 正常情况
    assert TestableSchema(value=50)
    
    # 边界情况
    assert TestableSchema(value=0)
    assert TestableSchema(value=100)
    
    # 异常情况
    with pytest.raises(ValidationError):
        TestableSchema(value=-1)
    with pytest.raises(ValidationError):
        TestableSchema(value=101)

开发工作流建议

阶段一：定义模型

复制代码

1. 分析需求 → 确定输出结构
   |
   v
2. 设计 Pydantic 模型
   |
   v
3. 生成 JSON Schema → 验证正确性
   |
   v
4. 编写单元测试 → 验证字段约束

python 复制代码

# 开发流程示例：先测试 Schema
def test_schema_constraints():
    """验证 Schema 约束是否符合预期"""
    
    # 测试必填字段
    with pytest.raises(ValidationError):
        FeedbackModel()
    
    # 测试类型约束
    with pytest.raises(ValidationError):
        FeedbackModel(sentiment="invalid")
    
    # 测试范围约束
    with pytest.raises(ValidationError):
        FeedbackModel(sentiment="positive", score=999)

阶段二：集成 LLM

复制代码

1. 生成工具定义（从 Pydantic 模型）
   |
   v
2. 基础集成测试 → 验证调用流程
   |
   v
3. 端到端测试 → 验证完整流程
   |
   v
4. 错误处理测试 → 验证容错能力

python 复制代码

def test_llm_integration():
    """集成测试：验证 LLM 输出符合 Schema"""
    
    # 调用 LLM
    result = analyze_with_llm("产品很好用")
    
    # 验证类型
    assert isinstance(result, FeedbackModel)
    
    # 验证值
    assert result.sentiment == "positive"

阶段三：迭代优化

复制代码

1. 收集生产环境错误
   |
   v
2. 分析错误模式
   |
   v
3. 优化 Schema 或 Prompt
   |
   v
4. 回归测试 → 验证修复有效

生产环境部署注意事项

1. 验证层不可或缺

即使 LLM 声称支持结构化输出，永远要有验证层：

python 复制代码

def safe_analyze(text: str) -> FeedbackResult:
    """生产环境使用的安全分析函数"""
    
    # 1. LLM 调用
    llm_result = call_llm(text)
    
    # 2. 解析 JSON（可能失败）
    try:
        data = json.loads(llm_result)
    except json.JSONDecodeError as e:
        logger.error(f"JSON 解析失败: {e}")
        raise AnalysisError("LLM 输出无法解析")
    
    # 3. Schema 验证（可能失败）
    try:
        return FeedbackResult.model_validate(data)
    except ValidationError as e:
        logger.error(f"Schema 验证失败: {e}")
        raise AnalysisError("LLM 输出不符合预期格式")

2. 设置超时和重试

python 复制代码

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=1, max=10),
    reraise=True
)
def call_with_retry(client: OpenAI, prompt: str) -> str:
    """带重试的 LLM 调用"""
    return client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        timeout=30  # 30 秒超时
    )

3. 监控和告警

python 复制代码

import time
from dataclasses import dataclass
from typing import Optional

@dataclass
class AnalysisMetrics:
    total_calls: int = 0
    success_calls: int = 0
    parse_errors: int = 0
    validation_errors: int = 0
    avg_latency_ms: float = 0

metrics = AnalysisMetrics()

def monitored_analyze(text: str) -> FeedbackResult:
    """带监控的分析函数"""
    start = time.time()
    metrics.total_calls += 1
    
    try:
        result = analyze_with_llm(text)
        metrics.success_calls += 1
        return result
    except json.JSONDecodeError:
        metrics.parse_errors += 1
        raise
    except ValidationError:
        metrics.validation_errors += 1
        raise
    finally:
        latency = (time.time() - start) * 1000
        metrics.avg_latency_ms = (
            metrics.avg_latency_ms * (metrics.total_calls - 1) + latency
        ) / metrics.total_calls
        
        # 超过阈值时告警
        if metrics.validation_errors / metrics.total_calls > 0.1:
            alert_team(f"验证错误率超过 10%: {metrics}")

4. 降级策略

python 复制代码

def analyze_with_fallback(text: str) -> dict:
    """带降级策略的分析函数"""
    
    try:
        # 优先尝试结构化输出
        result = structured_analyze(text)
        return result.model_dump()
        
    except (ValidationError, JSONDecodeError) as e:
        logger.warning(f"结构化输出失败，使用降级方案: {e}")
        
        try:
            # 降级：使用自由文本 + 后处理解析
            raw = free_text_analyze(text)
            return parse_fallback(raw)  # 使用正则/NLP 提取
            
        except Exception as e2:
            logger.error(f"降级方案也失败: {e2}")
            # 最终降级：返回默认值
            return {"sentiment": "unknown", "error": str(e2)}

5. Schema 版本管理

python 复制代码

from pydantic import BaseModel
from enum import IntEnum

class SchemaVersion(IntEnum):
    V1 = 1
    V2 = 2
    CURRENT = V2

class FeedbackV1(BaseModel):
    """v1 版本"""
    sentiment: str
    score: float

class FeedbackV2(BaseModel):
    """v2 版本 - 增加 keywords"""
    sentiment: str
    score: float
    keywords: list[str]

# 版本兼容性处理
def parse_feedback(data: dict, version: SchemaVersion = SchemaVersion.CURRENT):
    if version == SchemaVersion.V1:
        return FeedbackV1.model_validate(data)
    return FeedbackV2.model_validate(data)

小结

本部分我们深入探讨了：

五大常见陷阱及其解决方案（Schema 过于复杂、类型不匹配、additionalProperties 坑、嵌套层级过深、enum 值不匹配）
调试工具链（jsonschema 库、Pydantic ValidationError、日志记录）
性能优化技巧
Schema 设计原则（最小化、明确性、一致性、可测试性）
开发工作流建议
生产环境部署注意事项（验证层、重试、监控、降级、版本管理）

在最后一部分中，我们将提供附录，包含完整示例代码和参考资源。

八、附录

8.1 完整示例代码汇总

示例一：完整的端到端结构化输出流程

python 复制代码

"""
完整的端到端结构化输出示例
文件：examples/complete_workflow.py
"""
from pydantic import BaseModel, Field, ValidationError, ConfigDict
from typing import Literal, Optional
from openai import OpenAI
import json
import logging

# 配置日志
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


# ============ 1. 定义领域模型 ============

class SentimentAnalysis(BaseModel):
    """情感分析结果"""
    overall: Literal["positive", "negative", "neutral"]
    score: float = Field(description="-1 到 1 的情感分数", ge=-1.0, le=1.0)


class AspectSentiment(BaseModel):
    """方面的情感分析"""
    aspect: str
    sentiment: Literal["positive", "negative", "neutral"]
    evidence: list[str] = Field(
        description="支撑该情感的原文片段",
        max_length=3
    )


class FeedbackAnalysis(BaseModel):
    """完整的反馈分析结果"""
    summary: str = Field(max_length=200, description="不超过200字的摘要")
    sentiment: SentimentAnalysis
    aspects: list[AspectSentiment] = Field(min_length=1, description="至少一个方面")
    keywords: list[str] = Field(max_length=10, description="最多10个关键词")
    metadata: Optional[dict[str, str]] = Field(
        default=None, 
        description="附加元信息"
    )

    model_config = ConfigDict(json_schema_extra={"required": ["summary", "sentiment", "aspects", "keywords"]})


# ============ 2. 生成 OpenAI 工具定义 ============

def create_tool_definition() -> dict:
    """从 Pydantic 模型创建 OpenAI 工具定义"""
    return {
        "type": "function",
        "function": {
            "name": "analyze_feedback",
            "description": "分析用户反馈文本，提取结构化信息",
            "parameters": FeedbackAnalysis.model_json_schema()
        }
    }


# ============ 3. LLM 调用函数 ============

class LLMAnalyzer:
    """LLM 分析器封装"""
    
    def __init__(self, client: OpenAI, model: str = "gpt-4o"):
        self.client = client
        self.model = model
        self.tools = [create_tool_definition()]
    
    def analyze(self, text: str) -> FeedbackAnalysis:
        """分析用户反馈，返回类型安全的对象"""
        
        logger.info(f"开始分析文本，长度: {len(text)}")
        
        # 构建消息
        messages = [
            {
                "role": "system",
                "content": "你是一个专业的用户反馈分析助手。请从文本中提取结构化信息。"
            },
            {"role": "user", "content": text}
        ]
        
        # 调用 LLM
        response = self.client.chat.completions.create(
            model=self.model,
            messages=messages,
            tools=self.tools,
            tool_choice={
                "type": "function", 
                "function": {"name": "analyze_feedback"}
            }
        )
        
        # 提取工具调用结果
        tool_call = response.choices[0].message.tool_calls[0]
        raw_result = tool_call.function.arguments
        
        logger.debug(f"LLM 原始输出: {raw_result[:200]}...")
        
        # 解析并验证
        try:
            data = json.loads(raw_result)
            result = FeedbackAnalysis.model_validate(data)
            logger.info(f"分析成功: sentiment={result.sentiment.overall}")
            return result
            
        except (json.JSONDecodeError, ValidationError) as e:
            logger.error(f"验证失败: {e}")
            raise


# ============ 4. 降级处理 ============

def analyze_with_fallback(text: str, analyzer: LLMAnalyzer) -> dict:
    """带降级处理的分析函数"""
    
    try:
        # 优先使用结构化输出
        result = analyzer.analyze(text)
        return result.model_dump()
        
    except Exception as e:
        logger.warning(f"结构化输出失败: {e}，使用降级方案")
        
        # 降级：使用简单的自由文本分析
        fallback_response = analyzer.client.chat.completions.create(
            model=analyzer.model,
            messages=[
                {"role": "system", "content": "简要分析情感，返回 JSON: {\"sentiment\": \"positive/neutral/negative\"}"},
                {"role": "user", "content": text}
            ]
        )
        
        try:
            return json.loads(fallback_response.choices[0].message.content)
        except:
            return {"sentiment": "unknown", "error": str(e)}


# ============ 5. 主流程 ============

def main():
    """主流程示例"""
    
    # 初始化客户端
    client = OpenAI()
    analyzer = LLMAnalyzer(client)
    
    # 待分析的反馈文本
    feedback_text = """
    这款产品太让人失望了！质量很差，用了三天就坏了。
    客服态度也不好，打了三次电话都没解决。
    不过价格确实很便宜，如果只是临时用用可以考虑。
    """
    
    # 分析
    try:
        result = analyzer.analyze(feedback_text)
        
        # 使用类型安全的结果
        print(f"总体情感: {result.sentiment.overall}")
        print(f"情感分数: {result.sentiment.score}")
        print(f"摘要: {result.summary}")
        print(f"方面分析:")
        for aspect in result.aspects:
            print(f"  - {aspect.aspect}: {aspect.sentiment}")
        print(f"关键词: {result.keywords}")
        
    except ValidationError as e:
        print(f"分析失败: {e.error_count()} 个验证错误")
        for error in e.errors():
            print(f"  - {error['loc']}: {error['msg']}")


if __name__ == "__main__":
    main()

示例二：TypeScript / Zod 完整示例

typescript 复制代码

/**
 * TypeScript 完整示例
 * 文件：examples/feedback-typescript.ts
 */

import { z } from 'zod';

// ============ 1. 定义 Schema ============

const SentimentSchema = z.object({
  overall: z.enum(['positive', 'negative', 'neutral']),
  score: z.number().min(-1).max(1)
});

const AspectSentimentSchema = z.object({
  aspect: z.string(),
  sentiment: z.enum(['positive', 'negative', 'neutral']),
  evidence: z.array(z.string()).max(3)
});

const FeedbackSchema = z.object({
  summary: z.string().max(200),
  sentiment: SentimentSchema,
  aspects: z.array(AspectSentimentSchema).min(1),
  keywords: z.array(z.string()).max(10),
  metadata: z.record(z.string()).optional()
});

// 类型推断
type Feedback = z.infer<typeof FeedbackSchema>;


// ============ 2. API 客户端 ============

interface AnalyzeRequest {
  text: string;
}

interface AnalyzeResponse {
  success: boolean;
  data?: Feedback;
  error?: string;
}

async function analyzeFeedback(text: string): Promise<AnalyzeResponse> {
  try {
    const response = await fetch('/api/analyze', {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({ text })
    });
    
    const raw = await response.json();
    
    // 验证响应数据
    const result = FeedbackSchema.parse(raw);
    
    return { success: true, data: result };
    
  } catch (error) {
    if (error instanceof z.ZodError) {
      return { 
        success: false, 
        error: `验证失败: ${error.errors.map(e => e.message).join(', ')}` 
      };
    }
    return { success: false, error: `未知错误: ${error}` };
  }
}


// ============ 3. 组件示例 ============

import React from 'react';

interface FeedbackCardProps {
  feedback: Feedback;
}

export function FeedbackCard({ feedback }: FeedbackCardProps) {
  const sentimentEmoji = {
    positive: '😊',
    negative: '😞',
    neutral: '😐'
  };
  
  return (
    <div className="feedback-card">
      <div className="header">
        <span className="emoji">
          {sentimentEmoji[feedback.sentiment.overall]}
        </span>
        <span className="score">
          {feedback.sentiment.score.toFixed(2)}
        </span>
      </div>
      
      <p className="summary">{feedback.summary}</p>
      
      <div className="aspects">
        {feedback.aspects.map(aspect => (
          <div key={aspect.aspect} className={`aspect ${aspect.sentiment}`}>
            <span>{aspect.aspect}</span>
            <span>{sentimentEmoji[aspect.sentiment]}</span>
          </div>
        ))}
      </div>
      
      <div className="keywords">
        {feedback.keywords.map(kw => (
          <span key={kw} className="keyword">{kw}</span>
        ))}
      </div>
    </div>
  );
}


// ============ 4. 使用示例 ============

function App() {
  const [feedback, setFeedback] = React.useState<Feedback | null>(null);
  const [loading, setLoading] = React.useState(false);
  const [error, setError] = React.useState<string | null>(null);
  
  const handleAnalyze = async (text: string) => {
    setLoading(true);
    setError(null);
    
    const result = await analyzeFeedback(text);
    
    if (result.success && result.data) {
      setFeedback(result.data);
    } else {
      setError(result.error || '分析失败');
    }
    
    setLoading(false);
  };
  
  return (
    <div>
      {loading && <div>加载中...</div>}
      {error && <div className="error">{error}</div>}
      {feedback && <FeedbackCard feedback={feedback} />}
    </div>
  );
}

示例三：Pydantic 自定义验证器完整示例

python 复制代码

"""
自定义验证器完整示例
文件：examples/validators.py
"""
from pydantic import (
    BaseModel, 
    Field, 
    field_validator, 
    model_validator,
    ValidationError,
    GetCoreSchemaHandler,
    ConfigDict
)
from pydantic_core import core_schema
from typing import Optional, Any
import re


class Email(str):
    """自定义邮箱类型（Pydantic v2 方式）"""
    
    EMAIL_PATTERN = re.compile(r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$')
    
    @classmethod
    def validate(cls, v: Any) -> "Email":
        if isinstance(v, str) and cls.EMAIL_PATTERN.match(v):
            return cls(v.lower())
        raise ValueError("无效的邮箱格式")
    
    @classmethod
    def __get_pydantic_core_schema__(
        cls, source: type[Any], handler: GetCoreSchemaHandler
    ) -> core_schema.CoreSchema:
        return core_schema.no_info_after_validator_function(
            cls.validate,
            core_schema.str_schema(),
            serialization=core_schema.to_string_ser_schema()
        )


class PhoneNumber(str):
    """自定义手机号类型（Pydantic v2 方式）"""
    
    PHONE_PATTERN = re.compile(r'^1[3-9]\d{9}$')
    
    @classmethod
    def validate(cls, v: Any) -> "PhoneNumber":
        cleaned = str(v).replace(" ", "").replace("-", "")
        if cls.PHONE_PATTERN.match(cleaned):
            return cls(cleaned)
        raise ValueError("无效的手机号格式")
    
    @classmethod
    def __get_pydantic_core_schema__(
        cls, source: type[Any], handler: GetCoreSchemaHandler
    ) -> core_schema.CoreSchema:
        return core_schema.no_info_after_validator_function(
            cls.validate,
            core_schema.str_schema(),
            serialization=core_schema.to_string_ser_schema()
        )


class UserRegistration(BaseModel):
    """用户注册模型 - 展示复杂验证"""
    
    username: str = Field(
        min_length=3, 
        max_length=20,
        pattern=r'^[a-zA-Z][a-zA-Z0-9_]*$',
        description="用户名，以字母开头"
    )
    email: Email  # 使用自定义类型
    phone: Optional[PhoneNumber] = None  # 使用自定义类型
    password: str = Field(min_length=8)
    age: int = Field(ge=18, le=120)
    
    @field_validator("password")
    @classmethod
    def password_strength(cls, v: str) -> str:
        """验证密码强度"""
        errors = []
        
        if not any(c.isupper() for c in v):
            errors.append("大写字母")
        if not any(c.islower() for c in v):
            errors.append("小写字母")
        if not any(c.isdigit() for c in v):
            errors.append("数字")
        if not any(c in "!@#$%^&*()_+-=[]{}|;:,.<>?" for c in v):
            errors.append("特殊字符")
            
        if errors:
            raise ValueError(f"密码必须包含: {', '.join(errors)}")
            
        return v
    
    @field_validator("username")
    @classmethod
    def username_no_reserved(cls, v: str) -> str:
        """检查用户名是否使用保留字"""
        reserved = {"admin", "root", "system", "user", "test"}
        if v.lower() in reserved:
            raise ValueError(f"用户名 '{v}' 不可用")
        return v
    
    @model_validator(mode="after")
    def check_phone_or_email(self) -> "UserRegistration":
        """至少提供一种联系方式"""
        if not self.email and not self.phone:
            raise ValueError("必须提供邮箱或手机号")
        return self


# 测试
if __name__ == "__main__":
    # 正常情况
    user = UserRegistration(
        username="ZhangSan",
        email="zhangsan@example.com",
        phone="13812345678",
        password="SecurePass123!",
        age=25
    )
    print(f"创建用户成功: {user.username}")
    
    # 验证错误
    try:
        UserRegistration(
            username="admin",  # 保留字
            email="invalid",   # 无效邮箱
            password="weak",   # 密码太弱
            age=15             # 未成年
        )
    except ValidationError as e:
        print(f"验证失败 ({e.error_count()} 个错误):")
        for error in e.errors():
            loc = ".".join(str(l) for l in error["loc"])
            print(f"  - {loc}: {error['msg']}")

8.2 参考资源链接

官方文档

资源	链接
JSON Schema 官方文档	https://json-schema.org/
JSON Schema Draft 2020-12	https://json-schema.org/draft/2020-12/json-schema-core.html
Pydantic 官方文档	https://docs.pydantic.dev/
Pydantic v2 变更日志	https://docs.pydantic.dev/changelog/
OpenAI Structured Outputs	https://platform.openai.com/docs/guides/structured-outputs
Anthropic Tool Use	https://docs.anthropic.com/en/docs/build-with-claude/tool-use
Google Gemini API	https://ai.google.dev/docs

开源库

库	说明	链接
pydantic	Python 数据验证	https://github.com/pydantic/pydantic
zod	TypeScript Schema 验证	https://github.com/colinhacks/zod
jsonschema	JSON Schema 验证	https://github.com/python-jsonschema/jsonschema
quicktype	类型生成工具	https://github.com/glideapps/quicktype
json-schema-to-typescript	Schema 转 TS 类型	https://github.com/bcherny/json-schema-to-typescript