文章目录
-
- [一、每个 LLM 提供商都是一座孤岛](#一、每个 LLM 提供商都是一座孤岛)
- [二、LLM API 的通用抽象:所有模型都在做同一件事](#二、LLM API 的通用抽象:所有模型都在做同一件事)
-
- [2.1 统一接口设计](#2.1 统一接口设计)
- [2.2 OpenAI 客户端实现](#2.2 OpenAI 客户端实现)
- [2.3 DeepSeek 兼容层](#2.3 DeepSeek 兼容层)
- [2.4 智谱 GLM 适配器](#2.4 智谱 GLM 适配器)
- [2.5 多厂商 API 差异对照](#2.5 多厂商 API 差异对照)
- [三、工厂模式 + 策略模式:统一创建与路由](#三、工厂模式 + 策略模式:统一创建与路由)
-
- [3.1 工厂模式创建客户端](#3.1 工厂模式创建客户端)
- [3.2 统一调用层的完整架构](#3.2 统一调用层的完整架构)
- [四、工程化增强:重试、Fallback 与成本追踪](#四、工程化增强:重试、Fallback 与成本追踪)
-
- [4.1 自动重试(tenacity)](#4.1 自动重试(tenacity))
- [4.2 Fallback 路由流程](#4.2 Fallback 路由流程)
- [4.3 成本追踪器](#4.3 成本追踪器)
- 五、并发调用:同时请求多模型对比
-
- [5.1 异步并发调用](#5.1 异步并发调用)
- [5.2 流式输出的并发处理](#5.2 流式输出的并发处理)
- [六、实战:模型竞技场(Chatbot Arena)](#六、实战:模型竞技场(Chatbot Arena))
- [七、流式输出与 JSON 模式](#七、流式输出与 JSON 模式)
-
- [7.1 流式输出的完整处理](#7.1 流式输出的完整处理)
- [7.2 Function Calling 统一封装](#7.2 Function Calling 统一封装)
- 八、小结
一、每个 LLM 提供商都是一座孤岛
团队 A 用 OpenAI 的 client.chat.completions.create(),团队 B 用 DeepSeek 的兼容接口但模型名不同,团队 C 用智谱 GLM 的 zhipuai 包。三个团队各自维护一套调用代码,参数名不统一、错误处理不一致、成本统计分散在三个 Excel 表里。当某个模型服务宕机时,切换备用模型的成本是改代码 + 重新测试 + 更新文档------整个过程可能要几个小时。
LLM API 工程化的核心问题不是"怎么调用某个模型的 API",而是"怎么在多厂商、多模型、多场景的环境下,构建一个可维护、可观测、可容灾的统一调用层"。本文从抽象接口、多厂商适配、工程化增强和生产级实战四个维度,设计一套完整的 LLM 统一调用方案。
二、LLM API 的通用抽象:所有模型都在做同一件事
2.1 统一接口设计
无论 OpenAI、DeepSeek、智谱还是通义千问,所有 LLM 的 Chat API 都在做同一件事:接收一组消息,返回一个补全结果。这个极简的抽象是所有统一层设计的基础。
#mermaid-svg-RAmCYRZvAI0tYkL9{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-RAmCYRZvAI0tYkL9 .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-RAmCYRZvAI0tYkL9 .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-RAmCYRZvAI0tYkL9 .error-icon{fill:#552222;}#mermaid-svg-RAmCYRZvAI0tYkL9 .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-RAmCYRZvAI0tYkL9 .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-RAmCYRZvAI0tYkL9 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-RAmCYRZvAI0tYkL9 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-RAmCYRZvAI0tYkL9 .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-RAmCYRZvAI0tYkL9 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-RAmCYRZvAI0tYkL9 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-RAmCYRZvAI0tYkL9 .marker{fill:#333333;stroke:#333333;}#mermaid-svg-RAmCYRZvAI0tYkL9 .marker.cross{stroke:#333333;}#mermaid-svg-RAmCYRZvAI0tYkL9 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-RAmCYRZvAI0tYkL9 p{margin:0;}#mermaid-svg-RAmCYRZvAI0tYkL9 .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-RAmCYRZvAI0tYkL9 .cluster-label text{fill:#333;}#mermaid-svg-RAmCYRZvAI0tYkL9 .cluster-label span{color:#333;}#mermaid-svg-RAmCYRZvAI0tYkL9 .cluster-label span p{background-color:transparent;}#mermaid-svg-RAmCYRZvAI0tYkL9 .label text,#mermaid-svg-RAmCYRZvAI0tYkL9 span{fill:#333;color:#333;}#mermaid-svg-RAmCYRZvAI0tYkL9 .node rect,#mermaid-svg-RAmCYRZvAI0tYkL9 .node circle,#mermaid-svg-RAmCYRZvAI0tYkL9 .node ellipse,#mermaid-svg-RAmCYRZvAI0tYkL9 .node polygon,#mermaid-svg-RAmCYRZvAI0tYkL9 .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-RAmCYRZvAI0tYkL9 .rough-node .label text,#mermaid-svg-RAmCYRZvAI0tYkL9 .node .label text,#mermaid-svg-RAmCYRZvAI0tYkL9 .image-shape .label,#mermaid-svg-RAmCYRZvAI0tYkL9 .icon-shape .label{text-anchor:middle;}#mermaid-svg-RAmCYRZvAI0tYkL9 .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-RAmCYRZvAI0tYkL9 .rough-node .label,#mermaid-svg-RAmCYRZvAI0tYkL9 .node .label,#mermaid-svg-RAmCYRZvAI0tYkL9 .image-shape .label,#mermaid-svg-RAmCYRZvAI0tYkL9 .icon-shape .label{text-align:center;}#mermaid-svg-RAmCYRZvAI0tYkL9 .node.clickable{cursor:pointer;}#mermaid-svg-RAmCYRZvAI0tYkL9 .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-RAmCYRZvAI0tYkL9 .arrowheadPath{fill:#333333;}#mermaid-svg-RAmCYRZvAI0tYkL9 .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-RAmCYRZvAI0tYkL9 .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-RAmCYRZvAI0tYkL9 .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-RAmCYRZvAI0tYkL9 .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-RAmCYRZvAI0tYkL9 .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-RAmCYRZvAI0tYkL9 .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-RAmCYRZvAI0tYkL9 .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-RAmCYRZvAI0tYkL9 .cluster text{fill:#333;}#mermaid-svg-RAmCYRZvAI0tYkL9 .cluster span{color:#333;}#mermaid-svg-RAmCYRZvAI0tYkL9 div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-RAmCYRZvAI0tYkL9 .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-RAmCYRZvAI0tYkL9 rect.text{fill:none;stroke-width:0;}#mermaid-svg-RAmCYRZvAI0tYkL9 .icon-shape,#mermaid-svg-RAmCYRZvAI0tYkL9 .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-RAmCYRZvAI0tYkL9 .icon-shape p,#mermaid-svg-RAmCYRZvAI0tYkL9 .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-RAmCYRZvAI0tYkL9 .icon-shape .label rect,#mermaid-svg-RAmCYRZvAI0tYkL9 .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-RAmCYRZvAI0tYkL9 .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-RAmCYRZvAI0tYkL9 .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-RAmCYRZvAI0tYkL9 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} Messages
System + User + Assistant
LLM Client
Completion
Content + Usage + Meta
统一接口
无论底层是哪个厂商
python
from abc import ABC, abstractmethod
from dataclasses import dataclass
from typing import List, Dict, Optional, Iterator, Any
@dataclass
class Message:
role: str # system / user / assistant / tool
content: str
@dataclass
class CompletionResult:
content: str
model: str
usage: Dict[str, int] # {"prompt_tokens": 100, "completion_tokens": 50}
finish_reason: Optional[str] = None
raw_response: Optional[Any] = None
class BaseLLMClient(ABC):
"""LLM 客户端抽象基类"""
@abstractmethod
def chat(self, messages: List[Message], **kwargs) -> CompletionResult:
"""同步聊天"""
pass
@abstractmethod
async def achat(self, messages: List[Message], **kwargs) -> CompletionResult:
"""异步聊天"""
pass
@abstractmethod
def stream_chat(self, messages: List[Message], **kwargs) -> Iterator[str]:
"""流式输出"""
pass
@property
@abstractmethod
def model_name(self) -> str:
pass
@property
@abstractmethod
def pricing(self) -> Dict[str, float]:
"""返回定价:{'input': 0.0015, 'output': 0.002} $/1K tokens"""
pass
2.2 OpenAI 客户端实现
python
from openai import OpenAI, AsyncOpenAI
import json
class OpenAIClient(BaseLLMClient):
def __init__(self, api_key: str, model: str = "gpt-4o",
base_url: Optional[str] = None):
self.client = OpenAI(api_key=api_key, base_url=base_url)
self.async_client = AsyncOpenAI(api_key=api_key, base_url=base_url)
self._model = model
@property
def model_name(self) -> str:
return self._model
@property
def pricing(self) -> Dict[str, float]:
prices = {
"gpt-4o": {"input": 0.0025, "output": 0.01},
"gpt-4o-mini": {"input": 0.00015, "output": 0.0006},
"gpt-3.5-turbo": {"input": 0.0005, "output": 0.0015}
}
return prices.get(self._model, {"input": 0.0, "output": 0.0})
def chat(self, messages: List[Message], **kwargs) -> CompletionResult:
response = self.client.chat.completions.create(
model=self._model,
messages=[{"role": m.role, "content": m.content} for m in messages],
**kwargs
)
return CompletionResult(
content=response.choices[0].message.content,
model=self._model,
usage={
"prompt_tokens": response.usage.prompt_tokens,
"completion_tokens": response.usage.completion_tokens
},
finish_reason=response.choices[0].finish_reason,
raw_response=response
)
async def achat(self, messages: List[Message], **kwargs) -> CompletionResult:
response = await self.async_client.chat.completions.create(
model=self._model,
messages=[{"role": m.role, "content": m.content} for m in messages],
**kwargs
)
return CompletionResult(
content=response.choices[0].message.content,
model=self._model,
usage={
"prompt_tokens": response.usage.prompt_tokens,
"completion_tokens": response.usage.completion_tokens
},
finish_reason=response.choices[0].finish_reason
)
def stream_chat(self, messages: List[Message], **kwargs) -> Iterator[str]:
stream = self.client.chat.completions.create(
model=self._model,
messages=[{"role": m.role, "content": m.content} for m in messages],
stream=True,
**kwargs
)
for chunk in stream:
if chunk.choices[0].delta.content:
yield chunk.choices[0].delta.content
2.3 DeepSeek 兼容层
DeepSeek 的 API 完全兼容 OpenAI SDK,只需更换 base_url 和 api_key。这背后的技术原因是 DeepSeek 的服务端基于 vLLM 或 SGLang 部署,两者都实现了 OpenAI 兼容的 REST API。
python
class DeepSeekClient(OpenAIClient):
"""DeepSeek 兼容层------继承 OpenAI 客户端,仅需换 URL"""
def __init__(self, api_key: str, model: str = "deepseek-chat"):
super().__init__(
api_key=api_key,
model=model,
base_url="https://api.deepseek.com/v1"
)
@property
def pricing(self) -> Dict[str, float]:
prices = {
"deepseek-chat": {"input": 0.00014, "output": 0.00028},
"deepseek-reasoner": {"input": 0.00055, "output": 0.00219}
}
return prices.get(self._model, {"input": 0.0, "output": 0.0})
2.4 智谱 GLM 适配器
智谱 GLM 使用自己的 SDK,需要额外封装以适配统一接口。
python
from zhipuai import ZhipuAI
class ZhipuClient(BaseLLMClient):
def __init__(self, api_key: str, model: str = "glm-4"):
self.client = ZhipuAI(api_key=api_key)
self._model = model
@property
def model_name(self) -> str:
return self._model
@property
def pricing(self) -> Dict[str, float]:
return {"input": 0.001, "output": 0.001} # 以实际定价为准
def chat(self, messages: List[Message], **kwargs) -> CompletionResult:
response = self.client.chat.completions.create(
model=self._model,
messages=[{"role": m.role, "content": m.content} for m in messages],
**kwargs
)
return CompletionResult(
content=response.choices[0].message.content,
model=self._model,
usage={
"prompt_tokens": response.usage.prompt_tokens,
"completion_tokens": response.usage.completion_tokens
},
finish_reason=response.choices[0].finish_reason,
raw_response=response
)
async def achat(self, messages: List[Message], **kwargs) -> CompletionResult:
# 智谱 SDK 暂无官方 async 支持,用线程池包装
import asyncio
loop = asyncio.get_event_loop()
return await loop.run_in_executor(None, self.chat, messages)
def stream_chat(self, messages: List[Message], **kwargs) -> Iterator[str]:
response = self.client.chat.completions.create(
model=self._model,
messages=[{"role": m.role, "content": m.content} for m in messages],
stream=True,
**kwargs
)
for chunk in response:
if chunk.choices[0].delta.content:
yield chunk.choices[0].delta.content
2.5 多厂商 API 差异对照
| 维度 | OpenAI | DeepSeek | 智谱 GLM | 阿里通义千问 |
|---|---|---|---|---|
| SDK | openai |
openai(兼容) |
zhipuai |
dashscope |
| 认证 | api_key |
api_key |
api_key |
api_key |
| 基础 URL | api.openai.com |
api.deepseek.com |
SDK 内置 | dashscope.aliyuncs.com |
| 流式 | stream=True |
stream=True |
stream=True |
stream=True |
| JSON 模式 | response_format={"type": "json_object"} |
支持 | 部分支持 | 支持 |
| Function Calling | 原生支持 | 支持 | 支持 | 支持 |
| 温度参数 | temperature |
temperature |
temperature |
temperature |
| 系统消息 | system role |
system role |
system role |
system role |
三、工厂模式 + 策略模式:统一创建与路由
3.1 工厂模式创建客户端
python
from enum import Enum
class LLMProvider(Enum):
OPENAI = "openai"
DEEPSEEK = "deepseek"
ZHIPU = "zhipu"
QWEN = "qwen"
class LLMFactory:
"""工厂模式:根据配置创建对应的 LLM 客户端"""
_registry = {
LLMProvider.OPENAI: OpenAIClient,
LLMProvider.DEEPSEEK: DeepSeekClient,
LLMProvider.ZHIPU: ZhipuClient,
# LLMProvider.QWEN: QwenClient,
}
@classmethod
def create(cls, provider: LLMProvider, api_key: str,
model: Optional[str] = None, **kwargs) -> BaseLLMClient:
client_class = cls._registry.get(provider)
if not client_class:
raise ValueError(f"未知的 LLM 提供商: {provider}")
return client_class(api_key=api_key, model=model, **kwargs)
@classmethod
def register(cls, provider: LLMProvider, client_class: type):
"""注册新的客户端类型"""
cls._registry[provider] = client_class
# 使用示例
client = LLMFactory.create(
provider=LLMProvider.DEEPSEEK,
api_key="sk-xxx",
model="deepseek-chat"
)
result = client.chat([
Message(role="system", content="你是一个有用的助手"),
Message(role="user", content="解释 Python 的 GIL")
])
print(result.content)
3.2 统一调用层的完整架构
#mermaid-svg-0tNXpF2qmbs6fbqU{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-0tNXpF2qmbs6fbqU .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-0tNXpF2qmbs6fbqU .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-0tNXpF2qmbs6fbqU .error-icon{fill:#552222;}#mermaid-svg-0tNXpF2qmbs6fbqU .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-0tNXpF2qmbs6fbqU .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-0tNXpF2qmbs6fbqU .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-0tNXpF2qmbs6fbqU .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-0tNXpF2qmbs6fbqU .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-0tNXpF2qmbs6fbqU .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-0tNXpF2qmbs6fbqU .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-0tNXpF2qmbs6fbqU .marker{fill:#333333;stroke:#333333;}#mermaid-svg-0tNXpF2qmbs6fbqU .marker.cross{stroke:#333333;}#mermaid-svg-0tNXpF2qmbs6fbqU svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-0tNXpF2qmbs6fbqU p{margin:0;}#mermaid-svg-0tNXpF2qmbs6fbqU .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-0tNXpF2qmbs6fbqU .cluster-label text{fill:#333;}#mermaid-svg-0tNXpF2qmbs6fbqU .cluster-label span{color:#333;}#mermaid-svg-0tNXpF2qmbs6fbqU .cluster-label span p{background-color:transparent;}#mermaid-svg-0tNXpF2qmbs6fbqU .label text,#mermaid-svg-0tNXpF2qmbs6fbqU span{fill:#333;color:#333;}#mermaid-svg-0tNXpF2qmbs6fbqU .node rect,#mermaid-svg-0tNXpF2qmbs6fbqU .node circle,#mermaid-svg-0tNXpF2qmbs6fbqU .node ellipse,#mermaid-svg-0tNXpF2qmbs6fbqU .node polygon,#mermaid-svg-0tNXpF2qmbs6fbqU .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-0tNXpF2qmbs6fbqU .rough-node .label text,#mermaid-svg-0tNXpF2qmbs6fbqU .node .label text,#mermaid-svg-0tNXpF2qmbs6fbqU .image-shape .label,#mermaid-svg-0tNXpF2qmbs6fbqU .icon-shape .label{text-anchor:middle;}#mermaid-svg-0tNXpF2qmbs6fbqU .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-0tNXpF2qmbs6fbqU .rough-node .label,#mermaid-svg-0tNXpF2qmbs6fbqU .node .label,#mermaid-svg-0tNXpF2qmbs6fbqU .image-shape .label,#mermaid-svg-0tNXpF2qmbs6fbqU .icon-shape .label{text-align:center;}#mermaid-svg-0tNXpF2qmbs6fbqU .node.clickable{cursor:pointer;}#mermaid-svg-0tNXpF2qmbs6fbqU .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-0tNXpF2qmbs6fbqU .arrowheadPath{fill:#333333;}#mermaid-svg-0tNXpF2qmbs6fbqU .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-0tNXpF2qmbs6fbqU .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-0tNXpF2qmbs6fbqU .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-0tNXpF2qmbs6fbqU .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-0tNXpF2qmbs6fbqU .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-0tNXpF2qmbs6fbqU .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-0tNXpF2qmbs6fbqU .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-0tNXpF2qmbs6fbqU .cluster text{fill:#333;}#mermaid-svg-0tNXpF2qmbs6fbqU .cluster span{color:#333;}#mermaid-svg-0tNXpF2qmbs6fbqU div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-0tNXpF2qmbs6fbqU .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-0tNXpF2qmbs6fbqU rect.text{fill:none;stroke-width:0;}#mermaid-svg-0tNXpF2qmbs6fbqU .icon-shape,#mermaid-svg-0tNXpF2qmbs6fbqU .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-0tNXpF2qmbs6fbqU .icon-shape p,#mermaid-svg-0tNXpF2qmbs6fbqU .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-0tNXpF2qmbs6fbqU .icon-shape .label rect,#mermaid-svg-0tNXpF2qmbs6fbqU .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-0tNXpF2qmbs6fbqU .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-0tNXpF2qmbs6fbqU .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-0tNXpF2qmbs6fbqU :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 直接路由
Fallback
负载均衡
业务代码
LLMManager
统一入口
路由策略
OpenAIClient
DeepSeekClient
ZhipuClient
Retry + Timeout
Cost Tracker
返回结果
配置中心
YAML/JSON
四、工程化增强:重试、Fallback 与成本追踪
4.1 自动重试(tenacity)
LLM API 调用面临三类失败:网络超时、Rate Limit(429)、服务端错误(5xx)。tenacity 库提供了优雅的重试机制。
python
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
from openai import RateLimitError, APIError
class ResilientLLMClient:
"""带重试和 fallback 的 LLM 客户端包装器"""
def __init__(self, primary: BaseLLMClient,
fallback: Optional[BaseLLMClient] = None,
max_retries: int = 3):
self.primary = primary
self.fallback = fallback
self.max_retries = max_retries
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=2, max=30),
retry=retry_if_exception_type((RateLimitError, APIError, TimeoutError)),
reraise=True
)
def chat(self, messages: List[Message], **kwargs) -> CompletionResult:
try:
return self.primary.chat(messages, **kwargs)
except Exception as e:
if self.fallback:
print(f"主模型 {self.primary.model_name} 失败: {e}")
print(f"切换到备用模型 {self.fallback.model_name}")
return self.fallback.chat(messages, **kwargs)
raise
wait_exponential(multiplier=1, min=2, max=30)实现了指数退避:第 1 次重试等待 2 秒,第 2 次等待 4 秒,第 3 次等待 8 秒,上限 30 秒。这是处理 Rate Limit 的标准策略。
4.2 Fallback 路由流程
#mermaid-svg-fnh5G5cVwXpq1lqR{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-fnh5G5cVwXpq1lqR .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-fnh5G5cVwXpq1lqR .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-fnh5G5cVwXpq1lqR .error-icon{fill:#552222;}#mermaid-svg-fnh5G5cVwXpq1lqR .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-fnh5G5cVwXpq1lqR .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-fnh5G5cVwXpq1lqR .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-fnh5G5cVwXpq1lqR .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-fnh5G5cVwXpq1lqR .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-fnh5G5cVwXpq1lqR .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-fnh5G5cVwXpq1lqR .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-fnh5G5cVwXpq1lqR .marker{fill:#333333;stroke:#333333;}#mermaid-svg-fnh5G5cVwXpq1lqR .marker.cross{stroke:#333333;}#mermaid-svg-fnh5G5cVwXpq1lqR svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-fnh5G5cVwXpq1lqR p{margin:0;}#mermaid-svg-fnh5G5cVwXpq1lqR .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-fnh5G5cVwXpq1lqR .cluster-label text{fill:#333;}#mermaid-svg-fnh5G5cVwXpq1lqR .cluster-label span{color:#333;}#mermaid-svg-fnh5G5cVwXpq1lqR .cluster-label span p{background-color:transparent;}#mermaid-svg-fnh5G5cVwXpq1lqR .label text,#mermaid-svg-fnh5G5cVwXpq1lqR span{fill:#333;color:#333;}#mermaid-svg-fnh5G5cVwXpq1lqR .node rect,#mermaid-svg-fnh5G5cVwXpq1lqR .node circle,#mermaid-svg-fnh5G5cVwXpq1lqR .node ellipse,#mermaid-svg-fnh5G5cVwXpq1lqR .node polygon,#mermaid-svg-fnh5G5cVwXpq1lqR .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-fnh5G5cVwXpq1lqR .rough-node .label text,#mermaid-svg-fnh5G5cVwXpq1lqR .node .label text,#mermaid-svg-fnh5G5cVwXpq1lqR .image-shape .label,#mermaid-svg-fnh5G5cVwXpq1lqR .icon-shape .label{text-anchor:middle;}#mermaid-svg-fnh5G5cVwXpq1lqR .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-fnh5G5cVwXpq1lqR .rough-node .label,#mermaid-svg-fnh5G5cVwXpq1lqR .node .label,#mermaid-svg-fnh5G5cVwXpq1lqR .image-shape .label,#mermaid-svg-fnh5G5cVwXpq1lqR .icon-shape .label{text-align:center;}#mermaid-svg-fnh5G5cVwXpq1lqR .node.clickable{cursor:pointer;}#mermaid-svg-fnh5G5cVwXpq1lqR .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-fnh5G5cVwXpq1lqR .arrowheadPath{fill:#333333;}#mermaid-svg-fnh5G5cVwXpq1lqR .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-fnh5G5cVwXpq1lqR .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-fnh5G5cVwXpq1lqR .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-fnh5G5cVwXpq1lqR .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-fnh5G5cVwXpq1lqR .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-fnh5G5cVwXpq1lqR .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-fnh5G5cVwXpq1lqR .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-fnh5G5cVwXpq1lqR .cluster text{fill:#333;}#mermaid-svg-fnh5G5cVwXpq1lqR .cluster span{color:#333;}#mermaid-svg-fnh5G5cVwXpq1lqR div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-fnh5G5cVwXpq1lqR .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-fnh5G5cVwXpq1lqR rect.text{fill:none;stroke-width:0;}#mermaid-svg-fnh5G5cVwXpq1lqR .icon-shape,#mermaid-svg-fnh5G5cVwXpq1lqR .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-fnh5G5cVwXpq1lqR .icon-shape p,#mermaid-svg-fnh5G5cVwXpq1lqR .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-fnh5G5cVwXpq1lqR .icon-shape .label rect,#mermaid-svg-fnh5G5cVwXpq1lqR .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-fnh5G5cVwXpq1lqR .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-fnh5G5cVwXpq1lqR .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-fnh5G5cVwXpq1lqR :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 超时/错误
超时/错误
全部失败
成功
成功
成功
请求
GPT-4o
等待 2s
DeepSeek
等待 4s
GLM-4
抛出异常
返回结果
4.3 成本追踪器
python
import time
from dataclasses import dataclass, field
from typing import List
from datetime import datetime
@dataclass
class CostRecord:
timestamp: str
model: str
provider: str
prompt_tokens: int
completion_tokens: int
cost_usd: float
latency_ms: float
class CostTracker:
"""LLM 调用成本追踪器"""
def __init__(self):
self.records: List[CostRecord] = []
def track(self, client: BaseLLMClient, result: CompletionResult,
latency_ms: float):
pricing = client.pricing
prompt_cost = result.usage["prompt_tokens"] * pricing["input"] / 1000
completion_cost = result.usage["completion_tokens"] * pricing["output"] / 1000
total_cost = prompt_cost + completion_cost
record = CostRecord(
timestamp=datetime.now().isoformat(),
model=result.model,
provider=client.__class__.__name__,
prompt_tokens=result.usage["prompt_tokens"],
completion_tokens=result.usage["completion_tokens"],
cost_usd=total_cost,
latency_ms=latency_ms
)
self.records.append(record)
return record
def summary(self) -> Dict:
if not self.records:
return {}
total_cost = sum(r.cost_usd for r in self.records)
total_tokens = sum(r.prompt_tokens + r.completion_tokens for r in self.records)
avg_latency = sum(r.latency_ms for r in self.records) / len(self.records)
by_model = {}
for r in self.records:
if r.model not in by_model:
by_model[r.model] = {"calls": 0, "cost": 0.0, "tokens": 0}
by_model[r.model]["calls"] += 1
by_model[r.model]["cost"] += r.cost_usd
by_model[r.model]["tokens"] += r.prompt_tokens + r.completion_tokens
return {
"total_calls": len(self.records),
"total_cost_usd": round(total_cost, 4),
"total_tokens": total_tokens,
"avg_latency_ms": round(avg_latency, 2),
"by_model": by_model
}
# 使用示例
tracker = CostTracker()
start = time.time()
result = client.chat([Message(role="user", content="Hello")])
latency = (time.time() - start) * 1000
tracker.track(client, result, latency)
print(tracker.summary())
五、并发调用:同时请求多模型对比
5.1 异步并发调用
python
import asyncio
async def multi_model_chat(question: str, clients: List[BaseLLMClient]) -> Dict[str, str]:
"""同时向多个模型提问,返回各模型回答"""
messages = [Message(role="user", content=question)]
tasks = [client.achat(messages) for client in clients]
results = await asyncio.gather(*tasks, return_exceptions=True)
responses = {}
for client, result in zip(clients, results):
if isinstance(result, Exception):
responses[client.model_name] = f"错误: {result}"
else:
responses[client.model_name] = result.content
return responses
# 使用示例
async def main():
clients = [
LLMFactory.create(LLMProvider.OPENAI, "sk-openai", "gpt-4o-mini"),
LLMFactory.create(LLMProvider.DEEPSEEK, "sk-deepseek", "deepseek-chat"),
LLMFactory.create(LLMProvider.ZHIPU, "sk-zhipu", "glm-4"),
]
responses = await multi_model_chat(
"解释 Python 的异步编程模型",
clients
)
for model, answer in responses.items():
print(f"\n=== {model} ===")
print(answer[:500] + "...")
asyncio.run(main())
5.2 流式输出的并发处理
python
async def stream_multi_model(question: str, clients: List[BaseLLMClient]):
"""多模型流式输出,实时对比"""
messages = [Message(role="user", content=question)]
# 为每个模型创建独立的队列
queues = {client.model_name: asyncio.Queue() for client in clients}
async def stream_to_queue(client: BaseLLMClient):
for chunk in client.stream_chat(messages):
await queues[client.model_name].put(chunk)
await queues[client.model_name].put(None) # 结束标记
# 启动所有流式任务
tasks = [asyncio.create_task(stream_to_queue(c)) for c in clients]
# 实时消费并打印
active = set(client.model_name for client in clients)
while active:
for model_name in list(active):
try:
chunk = queues[model_name].get_nowait()
if chunk is None:
active.remove(model_name)
else:
print(f"[{model_name}] {chunk}", end="", flush=True)
except asyncio.QueueEmpty:
pass
await asyncio.sleep(0.01)
六、实战:模型竞技场(Chatbot Arena)
模型竞技场是一个生产级工具:对同一问题同时调用多个模型,并排展示答案,由用户评分,统计各模型胜率。
python
import gradio as gr
import asyncio
from collections import defaultdict
class ChatbotArena:
"""模型竞技场------多模型答案对比与评分"""
def __init__(self):
self.clients = {}
self.scores = defaultdict(lambda: {"wins": 0, "losses": 0, "ties": 0})
self.history = []
def register_client(self, name: str, client: BaseLLMClient):
self.clients[name] = client
async def battle(self, question: str) -> Dict[str, str]:
"""发起一轮对比"""
messages = [Message(role="user", content=question)]
tasks = {name: client.achat(messages)
for name, client in self.clients.items()}
results = await asyncio.gather(*tasks.values(), return_exceptions=True)
answers = {}
for name, result in zip(tasks.keys(), results):
if isinstance(result, Exception):
answers[name] = f"[错误] {result}"
else:
answers[name] = result.content
self.history.append({"question": question, "answers": answers})
return answers
def vote(self, battle_idx: int, winner: str):
"""为某一轮投票"""
battle = self.history[battle_idx]
models = list(battle["answers"].keys())
for model in models:
if model == winner:
self.scores[model]["wins"] += 1
else:
self.scores[model]["losses"] += 1
def leaderboard(self) -> List[Dict]:
"""生成排行榜"""
leaderboard = []
for model, score in self.scores.items():
total = score["wins"] + score["losses"] + score["ties"]
if total > 0:
win_rate = score["wins"] / total
leaderboard.append({
"model": model,
"wins": score["wins"],
"losses": score["losses"],
"win_rate": f"{win_rate:.1%}"
})
return sorted(leaderboard, key=lambda x: x["wins"], reverse=True)
# Gradio 界面
arena = ChatbotArena()
# 注册模型
arena.register_client("GPT-4o-mini",
LLMFactory.create(LLMProvider.OPENAI, "sk-xxx", "gpt-4o-mini"))
arena.register_client("DeepSeek-V3",
LLMFactory.create(LLMProvider.DEEPSEEK, "sk-xxx", "deepseek-chat"))
arena.register_client("GLM-4",
LLMFactory.create(LLMProvider.ZHIPU, "sk-xxx", "glm-4"))
async def battle_fn(question):
answers = await arena.battle(question)
output = "\n\n".join([
f"## {name}\n{content[:2000]}"
for name, content in answers.items()
])
return output, arena.leaderboard()
# Gradio 界面代码(简化版)
with gr.Blocks() as demo:
gr.Markdown("# LLM 模型竞技场")
question = gr.Textbox(label="输入问题", placeholder="输入一个问题...")
battle_btn = gr.Button("开始对比")
output = gr.Markdown(label="对比结果")
leaderboard = gr.JSON(label="排行榜")
battle_btn.click(battle_fn, inputs=question, outputs=[output, leaderboard])
demo.launch()
七、流式输出与 JSON 模式
7.1 流式输出的完整处理
流式输出在聊天场景中直接影响用户体验------用户不需要等待整个答案生成完毕,而是看到文字逐字出现。
python
def stream_with_stats(client: BaseLLMClient, messages: List[Message]):
"""流式输出并统计 token"""
full_content = []
chunk_count = 0
for chunk in client.stream_chat(messages):
print(chunk, end="", flush=True)
full_content.append(chunk)
chunk_count += 1
print(f"\n\n总计 {chunk_count} 个 chunk,约 {len(''.join(full_content))} 字符")
# JSON 模式------强制输出结构化数据
result = client.chat([
Message(role="system", content="以 JSON 格式返回结果"),
Message(role="user", content="分析这段代码的问题")
], response_format={"type": "json_object"})
import json
analysis = json.loads(result.content)
print(analysis["issues"]) # 结构化访问
7.2 Function Calling 统一封装
Function Calling 让 LLM 可以决定调用外部工具。统一层需要将不同厂商的 Function Calling 格式差异封装掉。
python
from typing import Callable, Dict, Any
class ToolRegistry:
"""工具注册中心"""
def __init__(self):
self.tools: Dict[str, Callable] = {}
self.schemas: List[Dict] = []
def register(self, name: str, description: str,
parameters: Dict, func: Callable):
self.tools[name] = func
self.schemas.append({
"type": "function",
"function": {
"name": name,
"description": description,
"parameters": parameters
}
})
def call(self, name: str, arguments: Dict) -> Any:
if name not in self.tools:
raise ValueError(f"未知工具: {name}")
return self.tools[name](**arguments)
# 注册工具
tools = ToolRegistry()
tools.register(
name="get_weather",
description="获取指定城市的天气",
parameters={
"type": "object",
"properties": {
"city": {"type": "string", "description": "城市名称"},
"date": {"type": "string", "description": "日期,格式 YYYY-MM-DD"}
},
"required": ["city"]
},
func=lambda city, date=None: {"temperature": 25, "condition": "晴"}
)
# 带工具调用的对话
result = client.chat([
Message(role="user", content="北京今天天气怎么样?")
], tools=tools.schemas)
# 如果模型决定调用工具
if result.raw_response.choices[0].message.tool_calls:
tool_call = result.raw_response.choices[0].message.tool_calls[0]
tool_result = tools.call(
tool_call.function.name,
json.loads(tool_call.function.arguments)
)
print(f"工具返回: {tool_result}")
八、小结
LLM API 工程化的核心不是调用某个具体 API,而是构建一个可扩展、可观测、可容灾的统一调用层。
BaseLLMClient 抽象接口定义了所有 LLM 的最小公约数:chat()、achat()、stream_chat()。OpenAI 和 DeepSeek 共享同一个 SDK,智谱和通义千问通过适配器模式接入。工厂模式 LLMFactory.create() 让业务代码与具体模型解耦------更换模型只需改配置,不改代码。
生产环境的三大增强:自动重试(tenacity 指数退避)、Fallback 路由(主模型失败自动切备用)、成本追踪(每次调用记录 token 数和费用)。这三项功能将 LLM 调用从"能跑"提升到"可运维"。
并发调用 asyncio.gather() 让多模型对比变得简单,模型竞技场则是一个真正有用的生产工具------不是 Hello World,而是可以部署给团队日常使用的基座。
此前专栏关于 FastAPI 工程化、异步编程、Docker 容器化部署以及微服务全链路的文章,为本文提供了从 API 设计到服务部署的完整上游支撑。如果本文对 LLM 工程化实践有所启发,欢迎点赞、收藏与关注。