AI应用接入大模型：直连、代理、网关三种方案的工程权衡

调用大模型API时，"直连""走代理""自建网关"三条路各有取舍。本文不讲哪个最好，只讲每种方案在什么场景下合理、什么场景下踩坑，以及工程上需要额外处理什么。

一、三条路径的本质区别

先厘清概念。假设你要调用GPT-4o：

复制代码

方案A：应用 → api.openai.com              （直连）
方案B：应用 → 自建代理 → api.openai.com     （代理转发）
方案C：应用 → 自建网关 → 多个上游           （网关路由）

三者的核心差异不在于"中间多了几跳"，而在于你把复杂度放在了哪里：

维度	直连	代理	网关
复杂度位置	应用代码内	代理服务	网关服务
多模型支持	需自己在代码里切换	代理可做协议转换	网关做路由+协议转换
故障切换	应用层重试	代理可做上游切换	网关做多通道负载均衡
计费统计	需自己实现	代理可统一统计	网关内置计费模块
运维成本	最低	中等	最高

没有银弹。下面逐个拆解。

二、方案A：直连------最简单但天花板最低

适用场景

服务器能稳定访问目标API（比如部署在海外）
只用一个厂商的模型
调用量不大，不需要复杂的计费和监控

代码示例

python

复制代码

from openai import OpenAI

client = OpenAI(
    api_key="your-key",
    # base_url 不填，默认指向 api.openai.com
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "你好"}]
)

直连需要自己处理的工程问题

1. 网络不稳定

直连海外API，国内服务器延迟波动大。需要在应用层做超时和重试：

python

复制代码

import asyncio
from openai import AsyncOpenAI, APITimeoutError, APIConnectionError

client = AsyncOpenAI(api_key="your-key", timeout=30.0)

async def robust_chat(model, messages, max_retries=3):
    for attempt in range(max_retries):
        try:
            return await client.chat.completions.create(
                model=model, messages=messages
            )
        except (APITimeoutError, APIConnectionError) as e:
            if attempt == max_retries - 1:
                raise
            wait = 2 ** attempt  # 1s, 2s, 4s
            await asyncio.sleep(wait)

2. 多模型切换

如果同时用GPT和Claude，需要维护两套客户端：

python

复制代码

from openai import AsyncOpenAI
import anthropic

openai_client = AsyncOpenAI(api_key="openai-key")
anthropic_client = anthropic.AsyncAnthropic(api_key="anthropic-key")

async def chat(model, messages):
    if model.startswith("gpt"):
        return await openai_client.chat.completions.create(
            model=model, messages=messages
        )
    elif model.startswith("claude"):
        # Anthropic的system是独立字段，需要转换
        system = ""
        user_messages = []
        for msg in messages:
            if msg["role"] == "system":
                system += msg["content"]
            else:
                user_messages.append(msg)
        return await anthropic_client.messages.create(
            model=model,
            system=system,
            messages=user_messages,
            max_tokens=2000
        )

注意Anthropic和OpenAI的请求格式差异（system字段独立、max_tokens必填），这个适配逻辑在直连方案下需要自己写。

3. 计费统计

直连不提供统一的用量看板，需要自己记录：

python

复制代码

from collections import defaultdict

usage_log = defaultdict(lambda: {"input": 0, "output": 0})

async def tracked_chat(model, messages):
    response = await client.chat.completions.create(
        model=model, messages=messages
    )
    usage_log[model]["input"] += response.usage.prompt_tokens
    usage_log[model]["output"] += response.usage.completion_tokens
    return response

def get_usage_report():
    return {m: dict(u) for m, u in usage_log.items()}

直连的局限

网络问题只能靠重试兜底，不能换通道
多模型适配代码侵入业务逻辑
没有统一的计费和监控入口

调用量增长后，这些局限会推动你往代理或网关方案迁移。

三、方案B：自建代理------中间层做协议适配

适用场景

需要访问多个厂商的API
想把网络适配、协议转换从业务代码中剥离
不需要复杂的多通道负载均衡

架构

复制代码

应用 → 自建代理服务 → OpenAI / Anthropic / Google
         │
         ├─ 协议转换（统一为OpenAI格式）
         ├─ 超时重试
         └─ 用量日志

代理服务示例

用FastAPI写一个轻量代理：

python

复制代码

from fastapi import FastAPI, Request
from fastapi.responses import JSONResponse
from openai import AsyncOpenAI
import anthropic
import logging

app = FastAPI()
logger = logging.getLogger(__name__)

# 各厂商客户端
clients = {
    "openai": AsyncOpenAI(api_key="openai-key"),
    "anthropic": anthropic.AsyncAnthropic(api_key="anthropic-key"),
}

@app.post("/v1/chat/completions")
async def proxy_chat(request: Request):
    """统一入口，按model字段路由到不同厂商"""
    body = await request.json()
    model = body["model"]
    messages = body["messages"]
    
    start = time.time()
    
    if model.startswith("gpt"):
        # OpenAI格式直接转发
        response = await clients["openai"].chat.completions.create(
            model=model, messages=messages
        )
        result = response.model_dump()
        
    elif model.startswith("claude"):
        # 转换为Anthropic格式
        system = ""
        user_msgs = []
        for msg in messages:
            if msg["role"] == "system":
                system += msg["content"]
            else:
                user_msgs.append(msg)
        
        response = await clients["anthropic"].messages.create(
            model=model,
            system=system,
            messages=user_msgs,
            max_tokens=body.get("max_tokens", 2000)
        )
        # 转换回OpenAI格式
        result = convert_to_openai_format(response)
    
    # 记录用量
    latency = time.time() - start
    logger.info(f"model={model} latency={latency:.2f}s tokens={result.get('usage')}")
    
    return JSONResponse(result)

代理方案的优势

业务代码变成统一的OpenAI格式，不再关心后端是哪个厂商：

python

复制代码

# 应用层代码------不关心后端是GPT还是Claude
import httpx

async def chat(model, messages):
    async with httpx.AsyncClient() as client:
        resp = await client.post(
            "http://your-proxy:8000/v1/chat/completions",
            json={"model": model, "messages": messages}
        )
        return resp.json()

代理方案的踩坑点

坑1：流式响应转发

代理转发SSE流时，要注意不能缓冲。FastAPI的StreamingResponse默认不缓冲，但如果前面套了Nginx，需要关闭proxy_buffering：

nginx

复制代码

location /v1/chat/completions {
    proxy_pass http://proxy:8000;
    proxy_buffering off;        # 关键
    proxy_cache off;
    proxy_read_timeout 300s;
    chunked_transfer_encoding on;
}

坑2：错误格式不统一

OpenAI返回的错误格式和Anthropic不同。代理层需要统一错误响应：

python

复制代码

@app.exception_handler(Exception)
async def error_handler(request, exc):
    if isinstance(exc, anthropic.APIStatusError):
        return JSONResponse(
            status_code=exc.status_code,
            content={"error": {"message": exc.message, "type": "api_error"}}
        )
    return JSONResponse(
        status_code=500,
        content={"error": {"message": str(exc), "type": "internal_error"}}
    )

四、方案C：自建网关------完整的API管理平台

适用场景

多团队、多应用共享AI能力
需要多通道负载均衡和自动故障切换
需要按应用/用户分别计费
调用量大，对可用性要求高

网关的完整模块

复制代码

                    ┌─ 认证（API Key验证 + 权限）
                    ├─ 限流（令牌桶 / 滑动窗口）
请求 → 网关 ────────├─ 路由（model → 上游映射）
                    ├─ 熔断（错误率超阈值自动切断）
                    ├─ 负载均衡（多通道加权轮询）
                    ├─ 计费（按应用/用户统计Token）
                    ├─ 审核（输入/输出内容安全）
                    ├─ 缓存（相同请求复用结果）
                    └─ 监控（延迟/错误率/用量看板）

核心模块代码

路由 + 负载均衡：

python

复制代码

import random
from collections import defaultdict

class GatewayRouter:
    """网关路由：模型→多上游通道映射"""
    
    def __init__(self):
        # model → [通道列表]
        self.routes = defaultdict(list)
    
    def add_route(self, model, channel_name, client, weight=1):
        self.routes[model].append({
            "name": channel_name,
            "client": client,
            "weight": weight,
            "health": 1.0  # 健康分0-1
        })
    
    async def route(self, model, messages, **kwargs):
        channels = self.routes.get(model, [])
        if not channels:
            raise ValueError(f"模型 {model} 无可用通道")
        
        # 按健康分过滤
        healthy = [c for c in channels if c["health"] > 0.3]
        if not healthy:
            healthy = channels  # 全不健康也得试
        
        # 加权随机选择
        total_weight = sum(c["weight"] * c["health"] for c in healthy)
        r = random.uniform(0, total_weight)
        cumulative = 0
        for channel in healthy:
            cumulative += channel["weight"] * channel["health"]
            if r <= cumulative:
                return channel
        
        return healthy[0]

熔断器：

python

复制代码

import time
from collections import deque

class CircuitBreaker:
    """滑动窗口错误率熔断"""
    
    def __init__(self, threshold=0.3, window=60, min_calls=20):
        self.threshold = threshold
        self.window = window
        self.min_calls = min_calls
        self.records = deque()  # [(timestamp, success)]
        self.state = "closed"   # closed / open / half_open
        self.opened_at = 0
    
    def record(self, success):
        now = time.time()
        self.records.append((now, success))
        # 清理过期记录
        while self.records and self.records[0][0] < now - self.window:
            self.records.popleft()
        self._evaluate()
    
    def _evaluate(self):
        if len(self.records) < self.min_calls:
            return
        
        errors = sum(1 for _, s in self.records if not s)
        error_rate = errors / len(self.records)
        
        if error_rate > self.threshold:
            self.state = "open"
            self.opened_at = time.time()
        elif self.state == "half_open" and error_rate < self.threshold / 2:
            self.state = "closed"
    
    def allow(self):
        if self.state == "closed":
            return True
        if self.state == "open":
            if time.time() - self.opened_at > 10:  # 10秒后试探
                self.state = "half_open"
                return True
            return False
        return True  # half_open放行

计费模块：

python

复制代码

class BillingTracker:
    """按应用+模型统计Token用量"""
    
    def __init__(self):
        self.usage = defaultdict(lambda: defaultdict(lambda: {
            "input_tokens": 0, "output_tokens": 0, "requests": 0
        }))
    
    def record(self, app_id, model, input_tokens, output_tokens):
        stats = self.usage[app_id][model]
        stats["input_tokens"] += input_tokens
        stats["output_tokens"] += output_tokens
        stats["requests"] += 1
    
    def get_report(self, app_id=None):
        if app_id:
            return dict(self.usage.get(app_id, {}))
        return {app: dict(models) for app, models in self.usage.items()}

开源方案

自建网关不用从零写，有成熟的开源方案：

项目	语言	特点
one-api	Go	多渠道管理、负载均衡、计费
new-api	Go	one-api增强版，UI更好
LiteLLM	Python	轻量级，100+模型支持
FastGPT	TypeScript	带知识库和Agent能力

基于开源方案做二次开发，比自己从零搭快10倍。

网关方案的代价

运维成本：网关本身需要监控、备份、升级
延迟增加：多一跳网络，增加10-50ms
单点风险：网关挂了所有应用都受影响，需要做网关的高可用

五、三种方案的对比总结

维度	直连	代理	网关
开发成本	低	中	高（可用开源方案降低）
运维成本	低	中	高
多模型支持	需自己适配	代理层适配	网关层适配
故障切换	应用层重试	代理可切换	多通道自动切换
计费统计	自己实现	代理日志	内置计费模块
延迟	最低	+5-20ms	+10-50ms
适合阶段	原型/小规模	中等规模	大规模/多团队

六、迁移路径

大部分团队的演进路线是：直连 → 代理 → 网关。

复制代码

阶段1（0-1万次/天）：直连，快速验证业务
    ↓ 网络不稳定、多模型需求出现
阶段2（1-10万次/天）：加代理，统一协议和重试
    ↓ 多团队共享、计费需求、高可用需求出现
阶段3（10万+次/天）：上网关，完整的管理平台

不要一开始就上网关------过早的架构复杂度比技术债更危险。先直连跑通业务，等痛点出现了再迁移。迁移时可以灰度：新请求走网关，老请求继续走直连，逐步切换。

七、选型检查清单

服务器能稳定访问目标API吗？
用几个厂商的模型？
日调用量多少？
需要按应用/用户计费吗？
有专职运维吗？
对可用性要求多高？（99% / 99.9% / 99.99%）
有多团队共享需求吗？

前两个问题决定了直连够不够用，后面的问题决定了是否需要代理或网关。

八、总结

三种方案不是优劣关系，是适用阶段不同：

直连适合起步阶段，简单直接，复杂度在应用代码内
代理适合中等规模，把协议适配和网络处理从业务中剥离
网关适合大规模和多团队场景，提供完整的管理能力

工程选型的核心原则：用当前阶段最简单的方案，保留迁移的可能性。不要为了"未来可能需要"而过早引入复杂架构。