AI API 调用优化实战

前言

💡 痛点：AI API 调用总是失败？速率限制频繁触发？成本居高不下？响应太慢用户体验差？

🎯 解决方案 ：掌握 AI API 调用优化 --- 从重试机制、到限流处理、再到缓存策略与成本优化。

AI API 调用优化全景：
#mermaid-svg-gItVYQa8SSBx2hXC{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-gItVYQa8SSBx2hXC .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-gItVYQa8SSBx2hXC .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-gItVYQa8SSBx2hXC .error-icon{fill:#552222;}#mermaid-svg-gItVYQa8SSBx2hXC .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-gItVYQa8SSBx2hXC .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-gItVYQa8SSBx2hXC .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-gItVYQa8SSBx2hXC .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-gItVYQa8SSBx2hXC .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-gItVYQa8SSBx2hXC .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-gItVYQa8SSBx2hXC .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-gItVYQa8SSBx2hXC .marker{fill:#333333;stroke:#333333;}#mermaid-svg-gItVYQa8SSBx2hXC .marker.cross{stroke:#333333;}#mermaid-svg-gItVYQa8SSBx2hXC svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-gItVYQa8SSBx2hXC p{margin:0;}#mermaid-svg-gItVYQa8SSBx2hXC .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-gItVYQa8SSBx2hXC .cluster-label text{fill:#333;}#mermaid-svg-gItVYQa8SSBx2hXC .cluster-label span{color:#333;}#mermaid-svg-gItVYQa8SSBx2hXC .cluster-label span p{background-color:transparent;}#mermaid-svg-gItVYQa8SSBx2hXC .label text,#mermaid-svg-gItVYQa8SSBx2hXC span{fill:#333;color:#333;}#mermaid-svg-gItVYQa8SSBx2hXC .node rect,#mermaid-svg-gItVYQa8SSBx2hXC .node circle,#mermaid-svg-gItVYQa8SSBx2hXC .node ellipse,#mermaid-svg-gItVYQa8SSBx2hXC .node polygon,#mermaid-svg-gItVYQa8SSBx2hXC .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-gItVYQa8SSBx2hXC .rough-node .label text,#mermaid-svg-gItVYQa8SSBx2hXC .node .label text,#mermaid-svg-gItVYQa8SSBx2hXC .image-shape .label,#mermaid-svg-gItVYQa8SSBx2hXC .icon-shape .label{text-anchor:middle;}#mermaid-svg-gItVYQa8SSBx2hXC .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-gItVYQa8SSBx2hXC .rough-node .label,#mermaid-svg-gItVYQa8SSBx2hXC .node .label,#mermaid-svg-gItVYQa8SSBx2hXC .image-shape .label,#mermaid-svg-gItVYQa8SSBx2hXC .icon-shape .label{text-align:center;}#mermaid-svg-gItVYQa8SSBx2hXC .node.clickable{cursor:pointer;}#mermaid-svg-gItVYQa8SSBx2hXC .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-gItVYQa8SSBx2hXC .arrowheadPath{fill:#333333;}#mermaid-svg-gItVYQa8SSBx2hXC .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-gItVYQa8SSBx2hXC .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-gItVYQa8SSBx2hXC .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-gItVYQa8SSBx2hXC .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-gItVYQa8SSBx2hXC .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-gItVYQa8SSBx2hXC .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-gItVYQa8SSBx2hXC .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-gItVYQa8SSBx2hXC .cluster text{fill:#333;}#mermaid-svg-gItVYQa8SSBx2hXC .cluster span{color:#333;}#mermaid-svg-gItVYQa8SSBx2hXC div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-gItVYQa8SSBx2hXC .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-gItVYQa8SSBx2hXC rect.text{fill:none;stroke-width:0;}#mermaid-svg-gItVYQa8SSBx2hXC .icon-shape,#mermaid-svg-gItVYQa8SSBx2hXC .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-gItVYQa8SSBx2hXC .icon-shape p,#mermaid-svg-gItVYQa8SSBx2hXC .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-gItVYQa8SSBx2hXC .icon-shape .label rect,#mermaid-svg-gItVYQa8SSBx2hXC .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-gItVYQa8SSBx2hXC .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-gItVYQa8SSBx2hXC .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-gItVYQa8SSBx2hXC :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} AI API 调用优化
重试机制
限流处理
缓存策略
成本优化
监控告警
指数退避
断路器
抖动
令牌桶
漏桶
滑动窗口
结果缓存
语义缓存
分布式缓存
模型选择
Batch API
Prompt 优化
Token 统计
成本追踪
性能监控

常见问题与解决方案：

问题	原因	解决方案
调用失败	网络错误/服务器错误	重试机制
速率限制	触发 RPM/TPM 限制	限流处理
成本高	模型贵/Token 消耗大	成本优化
响应慢	网络延迟/模型推理慢	缓存/流式响应
无监控	缺乏可观测性	监控告警

一、重试机制

1.1 指数退避重试

python 复制代码

# ===== 指数退避重试 =====

import time
import random
from openai import OpenAI
from openai._exceptions import (
    RateLimitError,
    APIConnectionError,
    InternalServerError,
    APIStatusError
)

class RetryConfig:
    """重试配置"""
    def __init__(
        self,
        max_retries: int = 3,
        base_delay: float = 1.0,
        max_delay: float = 60.0,
        backoff_factor: float = 2.0,
        jitter: bool = True
    ):
        self.max_retries = max_retries
        self.base_delay = base_delay
        self.max_delay = max_delay
        self.backoff_factor = backoff_factor
        self.jitter = jitter

def exponential_backoff_retry(
    func,
    retry_config: RetryConfig = None,
    *args,
    **kwargs
):
    """指数退避重试装饰器"""
    if retry_config is None:
        retry_config = RetryConfig()
    
    for attempt in range(retry_config.max_retries):
        try:
            return func(*args, **kwargs)
        
        except (RateLimitError, APIConnectionError, InternalServerError) as e:
            if attempt == retry_config.max_retries - 1:
                raise  # 最后一次重试失败，抛出异常
            
            # 计算延迟
            delay = retry_config.base_delay * (retry_config.backoff_factor ** attempt)
            delay = min(delay, retry_config.max_delay)
            
            # 添加抖动
            if retry_config.jitter:
                delay = delay * (1 + random.random())  # 0-100% 抖动
            
            print(f"错误: {e}")
            print(f"{delay:.2f} 秒后重试（第 {attempt + 1} 次）...")
            time.sleep(delay)
        
        except Exception as e:
            # 其他错误不重试
            raise
    
    raise Exception("达到最大重试次数")

# 使用装饰器
@exponential_backoff_retry
def call_openai_api(messages: list):
    """调用 OpenAI API"""
    client = OpenAI()
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        max_tokens=1024
    )
    return response.choices[0].message.content

# 测试
if __name__ == '__main__':
    result = call_openai_api(
        [{"role": "user", "content": "你好！"}]
    )
    print(result)

1.2 断路器模式

python 复制代码

# ===== 断路器模式 =====

import time
from enum import Enum
from dataclasses import dataclass, field

class CircuitState(Enum):
    """断路器状态"""
    CLOSED = "closed"      # 关闭（正常）
    OPEN = "open"          # 打开（熔断）
    HALF_OPEN = "half_open"  # 半开（尝试恢复）

@dataclass
class CircuitBreakerConfig:
    """断路器配置"""
    failure_threshold: int = 5          # 失败阈值
    success_threshold: int = 2          # 成功阈值（半开状态）
    timeout: float = 60.0              # 超时时间（秒）
    reset_timeout: float = 30.0        # 重置超时（秒）

class CircuitBreaker:
    """断路器"""
    
    def __init__(self, config: CircuitBreakerConfig = None):
        self.config = config or CircuitBreakerConfig()
        self.state = CircuitState.CLOSED
        self.failure_count = 0
        self.success_count = 0
        self.last_failure_time = None
    
    def call(self, func, *args, **kwargs):
        """执行函数（带断路器保护）"""
        if self.state == CircuitState.OPEN:
            # 检查是否可以尝试恢复
            if time.time() - self.last_failure_time > self.config.reset_timeout:
                self.state = CircuitState.HALF_OPEN
                self.success_count = 0
                print("断路器进入半开状态")
            else:
                raise Exception("断路器打开，拒绝请求")
        
        try:
            result = func(*args, **kwargs)
            self._on_success()
            return result
        
        except Exception as e:
            self._on_failure()
            raise
    
    def _on_success(self):
        """成功回调"""
        if self.state == CircuitState.HALF_OPEN:
            self.success_count += 1
            if self.success_count >= self.config.success_threshold:
                self.state = CircuitState.CLOSED
                self.failure_count = 0
                print("断路器关闭（恢复）")
        else:
            self.failure_count = 0
    
    def _on_failure(self):
        """失败回调"""
        self.failure_count += 1
        self.last_failure_time = time.time()
        
        if self.failure_count >= self.config.failure_threshold:
            self.state = CircuitState.OPEN
            print(f"断路器打开（失败次数: {self.failure_count}）")

# 使用断路器
breaker = CircuitBreaker()

def call_api_with_circuit_breaker(messages: list):
    """带断路器保护的 API 调用"""
    def _call():
        client = OpenAI()
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=messages
        )
        return response.choices[0].message.content
    
    return breaker.call(_call)

# 测试
if __name__ == '__main__':
    for i in range(10):
        try:
            result = call_api_with_circuit_breaker(
                [{"role": "user", "content": f"测试 {i}"}]
            )
            print(f"请求 {i}: 成功")
        except Exception as e:
            print(f"请求 {i}: 失败 - {e}")

1.3 重试最佳实践

python 复制代码

# ===== 重试最佳实践 =====

from openai import OpenAI
from openai._exceptions import RateLimitError, APIConnectionError, InternalServerError
import time
import random

class RetryStrategy:
    """重试策略"""
    
    @staticmethod
    def exponential_backoff(
        max_retries: int = 3,
        base_delay: float = 1.0,
        max_delay: float = 60.0,
        backoff_factor: float = 2.0,
        jitter: bool = True
    ):
        """指数退避"""
        def decorator(func):
            def wrapper(*args, **kwargs):
                for attempt in range(max_retries):
                    try:
                        return func(*args, **kwargs)
                    except (RateLimitError, APIConnectionError, InternalServerError) as e:
                        if attempt == max_retries - 1:
                            raise
                        
                        delay = min(base_delay * (backoff_factor ** attempt), max_delay)
                        if jitter:
                            delay = delay * (1 + random.random())
                        
                        print(f"重试 {attempt + 1}/{max_retries}，{delay:.2f}秒后重试...")
                        time.sleep(delay)
                
                raise Exception("达到最大重试次数")
            return wrapper
        return decorator
    
    @staticmethod
    def fixed_delay(
        max_retries: int = 3,
        delay: float = 1.0
    ):
        """固定延迟"""
        def decorator(func):
            def wrapper(*args, **kwargs):
                for attempt in range(max_retries):
                    try:
                        return func(*args, **kwargs)
                    except (RateLimitError, APIConnectionError, InternalServerError) as e:
                        if attempt == max_retries - 1:
                            raise
                        
                        print(f"重试 {attempt + 1}/{max_retries}，{delay}秒后重试...")
                        time.sleep(delay)
                
                raise Exception("达到最大重试次数")
            return wrapper
        return decorator

# 使用
class OpenAIService:
    """OpenAI 服务（带重试）"""
    
    @RetryStrategy.exponential_backoff(max_retries=3, base_delay=1.0, jitter=True)
    def chat(self, messages: list):
        """对话（带指数退避重试）"""
        client = OpenAI()
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=messages
        )
        return response.choices[0].message.content
    
    @RetryStrategy.fixed_delay(max_retries=3, delay=2.0)
    def chat_fixed(self, messages: list):
        """对话（带固定延迟重试）"""
        client = OpenAI()
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=messages
        )
        return response.choices[0].message.content

二、限流处理

2.1 令牌桶算法

python 复制代码

# ===== 令牌桶算法 =====

import time
from threading import Lock

class TokenBucket:
    """令牌桶"""
    
    def __init__(self, capacity: int, refill_rate: float):
        """
        初始化令牌桶
        
        Args:
            capacity: 桶容量（最大令牌数）
            refill_rate: 填充速率（令牌/秒）
        """
        self.capacity = capacity
        self.refill_rate = refill_rate
        self.tokens = float(capacity)
        self.last_refill_time = time.time()
        self.lock = Lock()
    
    def _refill(self):
        """填充令牌"""
        now = time.time()
        elapsed = now - self.last_refill_time
        
        # 计算应填充的令牌数
        refill_tokens = elapsed * self.refill_rate
        self.tokens = min(self.capacity, self.tokens + refill_tokens)
        self.last_refill_time = now
    
    def consume(self, tokens: int = 1) -> bool:
        """
        消费令牌
        
        Args:
            tokens: 需要的令牌数
            
        Returns:
            bool: 是否成功获取令牌
        """
        with self.lock:
            self._refill()
            
            if self.tokens >= tokens:
                self.tokens -= tokens
                return True
            return False
    
    def wait_for_tokens(self, tokens: int = 1, timeout: float = None) -> bool:
        """
        等待令牌（阻塞）
        
        Args:
            tokens: 需要的令牌数
            timeout: 超时时间（秒）
            
        Returns:
            bool: 是否成功获取令牌
        """
        start_time = time.time()
        
        while True:
            if self.consume(tokens):
                return True
            
            if timeout and (time.time() - start_time) > timeout:
                return False
            
            # 计算需要等待的时间
            with self.lock:
                deficit = tokens - self.tokens
                wait_time = deficit / self.refill_rate
            
            time.sleep(min(wait_time, 0.1))  # 最多等待 0.1 秒

# 使用令牌桶限流
bucket = TokenBucket(capacity=10, refill_rate=2.0)  # 容量 10，速率 2 tokens/s

def rate_limited_api_call(messages: list):
    """限流的 API 调用"""
    # 等待令牌）
    if not bucket.wait_for_tokens(tokens=1, timeout=5.0):
        raise Exception("获取令牌超时")
    
    # 调用 API
    client = OpenAI()
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages
    )
    return response.choices[0].message.content

# 测试
if __name__ == '__main__':
    for i in range(20):
        try:
            result = rate_limited_api_call(
                [{"role": "user", "content": f"测试 {i}"}]
            )
            print(f"请求 {i}: 成功")
        except Exception as e:
            print(f"请求 {i}: 失败 - {e}")

2.2 漏桶算法

python 复制代码

# ===== 漏桶算法 =====

import time
from queue import Queue
from threading import Thread, Lock

class LeakyBucket:
    """漏桶"""
    
    def __init__(self, capacity: int, leak_rate: float):
        """
        初始化漏桶
        
        Args:
            capacity: 桶容量
            leak_rate: 漏出速率（请求/秒）
        """
        self.capacity = capacity
        self.leak_rate = leak_rate
        self.queue = Queue(maxsize=capacity)
        self.lock = Lock()
        self.running = True
        
        # 启动漏出线程
        self.leak_thread = Thread(target=self._leak, daemon=True)
        self.leak_thread.start()
    
    def _leak(self):
        """漏出请求"""
        while self.running:
            with self.lock:
                if not self.queue.empty():
                    request = self.queue.get()
                    # 处理请求
                    self._process_request(request)
            
            # 按照漏出速率等待
            time.sleep(1.0 / self.leak_rate)
    
    def _process_request(self, request):
        """处理请求"""
        # 实际应该调用 API
        print(f"处理请求: {request}")
    
    def add_request(self, request) -> bool:
        """
        添加请求
        
        Returns:
            bool: 是否成功添加
        """
        try:
            self.queue.put(request, block=False)
            return True
        except:
            return False
    
    def stop(self):
        """停止漏桶"""
        self.running = False

# 使用漏桶限流
bucket = LeakyBucket(capacity=10, leak_rate=2.0)  # 容量 10，速率 2 requests/s

def rate_limited_api_call_leaky(messages: list):
    """限流的 API 调用（漏桶）"""
    request_id = f"req_{time.time()}"
    
    if not bucket.add_request(request_id):
        raise Exception("漏桶已满，请求被丢弃")
    
    # 请求已加入队列，等待处理
    print(f"请求 {request_id} 已加入队列")
    
    # 实际应该等待结果
    # 这里简化为直接返回
    return f"请求 {request_id} 已排队"

# 测试
if __name__ == '__main__':
    for i in range(20):
        try:
            result = rate_limited_api_call_leaky(
                [{"role": "user", "content": f"测试 {i}"}]
            )
            print(f"请求 {i}: {result}")
        except Exception as e:
            print(f"请求 {i}: 失败 - {e}")
    
    time.sleep(5)  # 等待处理完成
    bucket.stop()

2.3 滑动窗口限流

python 复制代码

# ===== 滑动窗口限流 =====

import time
from collections import deque

class SlidingWindowRateLimiter:
    """滑动窗口限流器"""
    
    def __init__(self, max_requests: int, window_size: float):
        """
        初始化滑动窗口限流器
        
        Args:
            max_requests: 窗口内最大请求数
            window_size: 窗口大小（秒）
        """
        self.max_requests = max_requests
        self.window_size = window_size
        self.requests = deque()
        self.lock = Lock()
    
    def allow_request(self) -> bool:
        """
        检查是否允许请求
        
        Returns:
            bool: 是否允许
        """
        with self.lock:
            now = time.time()
            
            # 移除过期的请求
            while self.requests and self.requests[0] < now - self.window_size:
                self.requests.popleft()
            
            # 检查是否超过限制
            if len(self.requests) >= self.max_requests:
                return False
            
            # 添加新请求
            self.requests.append(now)
            return True
    
    def get_wait_time(self) -> float:
        """
        获取需要等待的时间
        
        Returns:
            float: 等待时间（秒），0 表示无需等待
        """
        with self.lock:
            if not self.requests:
                return 0.0
            
            # 最早请求的时间
            earliest = self.requests[0]
            now = time.time()
            
            # 计算等待时间
            wait_time = (earliest + self.window_size) - now
            return max(0.0, wait_time)

# 使用滑动窗口限流
limiter = SlidingWindowRateLimiter(max_requests=10, window_size=60.0)  # 60秒内最多10个请求

def rate_limited_api_call_sliding(messages: list):
    """限流的 API 调用（滑动窗口）"""
    if not limiter.allow_request():
        wait_time = limiter.get_wait_time()
        raise Exception(f"速率限制，请 {wait_time:.2f} 秒后重试")
    
    # 调用 API
    client = OpenAI()
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages
    )
    return response.choices[0].message.content

# 测试
if __name__ == '__main__':
    for i in range(20):
        try:
            result = rate_limited_api_call_sliding(
                [{"role": "user", "content": f"测试 {i}"}]
            )
            print(f"请求 {i}: 成功")
        except Exception as e:
            print(f"请求 {i}: 失败 - {e}")

三、缓存策略

3.1 结果缓存

python 复制代码

# ===== 结果缓存 =====

import hashlib
import json
import time
from typing import Dict, Any, Optional

class ResultCache:
    """结果缓存"""
    
    def __init__(self, ttl: int = 3600):
        """
        初始化缓存
        
        Args:
            ttl: 缓存生存时间（秒）
        """
        self.cache: Dict[str, Dict[str, Any]] = {}
        self.ttl = ttl
        self.lock = Lock()
    
    def _generate_key(self, messages: list, model: str, **kwargs) -> str:
        """生成缓存 key"""
        content = {
            "messages": messages,
            "model": model,
            **kwargs
        }
        content_str = json.dumps(content, sort_keys=True)
        return hashlib.md5(content_str.encode()).hexdigest()
    
    def get(self, messages: list, model: str, **kwargs) -> Optional[str]:
        """
        获取缓存
        
        Returns:
            缓存的结果，如果没有则返回 None
        """
        key = self._generate_key(messages, model, **kwargs)
        
        with self.lock:
            if key not in self.cache:
                return None
            
            entry = self.cache[key]
            
            # 检查是否过期
            if time.time() - entry["timestamp"] > self.ttl:
                del self.cache[key]
                return None
            
            return entry["result"]
    
    def set(self, messages: list, model: str, result: str, **kwargs):
        """设置缓存"""
        key = self._generate_key(messages, model, **kwargs)
        
        with self.lock:
            self.cache[key] = {
                "result": result,
                "timestamp": time.time()
            }
    
    def clear(self):
        """清空缓存"""
        with self.lock:
            self.cache.clear()
    
    def remove_expired(self):
        """移除过期缓存"""
        now = time.time()
        with self.lock:
            expired_keys = [
                k for k, v in self.cache.items()
                if now - v["timestamp"] > self.ttl
            ]
            for k in expired_keys:
                del self.cache[k]

# 使用结果缓存
cache = ResultCache(ttl=3600)  # 缓存 1 小时

def cached_api_call(messages: list, model: str = "gpt-4o-mini"):
    """带缓存的 API 调用"""
    # 检查缓存
    cached = cache.get(messages, model)
    if cached:
        print("使用缓存")
        return cached
    
    # 调用 API
    print("调用 API")
    client = OpenAI()
    response = client.chat.completions.create(
        model=model,
        messages=messages
    )
    result = response.choices[0].message.content
    
    # 写入缓存
    cache.set(messages, model, result)
    
    return result

# 测试
if __name__ == '__main__':
    messages = [{"role": "user", "content": "你好！"}]
    
    # 第一次调用（会调用 API）
    result1 = cached_api_call(messages)
    print(f"结果1: {result1[:50]}...")
    
    # 第二次调用（会使用缓存）
    result2 = cached_api_call(messages)
    print(f"结果2: {result2[:50]}...")

3.2 语义缓存

python 复制代码

# ===== 语义缓存 =====

import hashlib
import json
from typing import List, Dict, Any, Optional
import numpy as np

class SemanticCache:
    """语义缓存（基于向量相似度）"""
    
    def __init__(self, similarity_threshold: float = 0.95, ttl: int = 3600):
        """
        初始化语义缓存
        
        Args:
            similarity_threshold: 相似度阈值（0-1）
            ttl: 缓存生存时间（秒）
        """
        self.cache: List[Dict[str, Any]] = []
        self.similarity_threshold = similarity_threshold
        self.ttl = ttl
        self.lock = Lock()
    
    def _get_embedding(self, text: str) -> List[float]:
        """
        获取文本向量（简化版）
        
        实际应该使用 OpenAI/Azure 的 Embeddings API
        """
        # 简化：使用字符频率作为向量
        vector = [0.0] * 128
        for char in text:
            idx = ord(char) % 128
            vector[idx] += 1.0
        
        # 归一化
        norm = np.linalg.norm(vector)
        if norm > 0:
            vector = [v / norm for v in vector]
        
        return vector
    
    def _cosine_similarity(self, vec1: List[float], vec2: List[float]) -> float:
        """计算余弦相似度"""
        dot_product = sum(a * b for a, b in zip(vec1, vec2))
        return dot_product
    
    def get(self, messages: list) -> Optional[str]:
        """
        获取缓存（基于语义相似度）
        
        Returns:
            缓存的结果，如果没有则返回 None
        """
        # 提取用户消息
        user_messages = [m["content"] for m in messages if m["role"] == "user"]
        query = " ".join(user_messages)
        
        # 获取查询向量
        query_embedding = self._get_embedding(query)
        
        with self.lock:
            # 移除过期缓存
            now = time.time()
            self.cache = [
                c for c in self.cache
                if now - c["timestamp"] < self.ttl
            ]
            
            # 查找相似缓存
            for entry in self.cache:
                similarity = self._cosine_similarity(query_embedding, entry["embedding"])
                
                if similarity >= self.similarity_threshold:
                    print(f"语义缓存命中（相似度: {similarity:.4f}）")
                    return entry["result"]
        
        return None
    
    def set(self, messages: list, result: str):
        """设置缓存"""
        # 提取用户消息
        user_messages = [m["content"] for m in messages if m["role"] == "user"]
        query = " ".join(user_messages)
        
        # 获取向量
        embedding = self._get_embedding(query)
        
        with self.lock:
            self.cache.append({
                "query": query,
                "embedding": embedding,
                "result": result,
                "timestamp": time.time()
            })
    
    def clear(self):
        """清空缓存"""
        with self.lock:
            self.cache.clear()

# 使用语义缓存
semantic_cache = SemanticCache(similarity_threshold=0.95, ttl=3600)

def semantic_cached_api_call(messages: list):
    """带语义缓存的 API 调用"""
    # 检查缓存
    cached = semantic_cache.get(messages)
    if cached:
        return cached
    
    # 调用 API
    client = OpenAI()
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages
    )
    result = response.choices[0].message.content
    
    # 写入缓存
    semantic_cache.set(messages, result)
    
    return result

# 测试
if __name__ == '__main__':
    # 相似的问题（应该命中缓存）
    messages1 = [{"role": "user", "content": "什么是机器学习？"}]
    messages2 = [{"role": "user", "content": "请解释一下机器学习"}]
    
    result1 = semantic_cached_api_call(messages1)
    print(f"结果1: {result1[:50]}...")
    
    result2 = semantic_cached_api_call(messages2)
    print(f"结果2: {result2[:50]}...")

3.3 分布式缓存（Redis）

python 复制代码

# ===== 分布式缓存（Redis）=====

import json
import hashlib
import time
import redis

class RedisCache:
    """Redis 缓存"""
    
    def __init__(self, redis_url: str = "redis://localhost:6379", ttl: int = 3600):
        """
        初始化 Redis 缓存
        
        Args:
            redis_url: Redis 连接 URL
            ttl: 缓存生存时间（秒）
        """
        self.redis = redis.from_url(redis_url)
        self.ttl = ttl
    
    def _generate_key(self, messages: list, model: str, **kwargs) -> str:
        """生成缓存 key"""
        content = {
            "messages": messages,
            "model": model,
            **kwargs
        }
        content_str = json.dumps(content, sort_keys=True)
        return f"ai_cache:{hashlib.md5(content_str.encode()).hexdigest()}"
    
    def get(self, messages: list, model: str, **kwargs) -> Optional[str]:
        """获取缓存"""
        key = self._generate_key(messages, model, **kwargs)
        
        cached = self.redis.get(key)
        if cached:
            return json.loads(cached)
        return None
    
    def set(self, messages: list, model: str, result: str, **kwargs):
        """设置缓存"""
        key = self._generate_key(messages, model, **kwargs)
        
        self.redis.setex(
            key,
            self.ttl,
            json.dumps(result)
        )
    
    def clear(self):
        """清空缓存"""
        keys = self.redis.keys("ai_cache:*")
        if keys:
            self.redis.delete(*keys)

# 使用 Redis 缓存
redis_cache = RedisCache(redis_url="redis://localhost:6379", ttl=3600)

def redis_cached_api_call(messages: list, model: str = "gpt-4o-mini"):
    """带 Redis 缓存的 API 调用"""
    # 检查缓存
    cached = redis_cache.get(messages, model)
    if cached:
        print("使用 Redis 缓存")
        return cached
    
    # 调用 API
    print("调用 API")
    client = OpenAI()
    response = client.chat.completions.create(
        model=model,
        messages=messages
    )
    result = response.choices[0].message.content
    
    # 写入缓存
    redis_cache.set(messages, model, result)
    
    return result

四、成本优化

4.1 模型选择策略

python 复制代码

# ===== 模型选择策略 =====

from enum import Enum
from dataclasses import dataclass
from typing import Optional

class TaskComplexity(Enum):
    """任务复杂度"""
    LOW = "low"          # 简单任务
    MEDIUM = "medium"    # 中等任务
    HIGH = "high"        # 复杂任务

@dataclass
class ModelConfig:
    """模型配置"""
    name: str
    input_price: float   # $/1M tokens
    output_price: float  # $/1M tokens
    speed: float         # tokens/second (approx)
    capability: float    # 0-1, 能力评分

# 模型配置
MODELS = {
    "gpt-4o": ModelConfig(
        name="gpt-4o",
        input_price=5.00,
        output_price=15.00,
        speed=60.0,
        capability=0.95
    ),
    "gpt-4o-mini": ModelConfig(
        name="gpt-4o-mini",
        input_price=0.15,
        output_price=0.60,
        speed=100.0,
        capability=0.80
    ),
    "gpt-3.5-turbo": ModelConfig(
        name="gpt-3.5-turbo",
        input_price=0.50,
        output_price=1.50,
        speed=150.0,
        capability=0.70
    )
}

def select_model(
    task_complexity: TaskComplexity,
    budget_sensitive: bool = False,
    speed_sensitive: bool = False
) -> str:
    """
    选择模型
    
    Args:
        task_complexity: 任务复杂度
        budget_sensitive: 是否成本敏感
        speed_sensitive: 是否速度敏感
        
    Returns:
        str: 模型名称
    """
    if budget_sensitive:
        return "gpt-4o-mini"
    
    if speed_sensitive:
        return "gpt-3.5-turbo"
    
    if task_complexity == TaskComplexity.HIGH:
        return "gpt-4o"
    elif task_complexity == TaskComplexity.MEDIUM:
        return "gpt-4o-mini"
    else:
        return "gpt-3.5-turbo"

def estimate_cost(
    model: str,
    input_tokens: int,
    output_tokens: int
) -> float:
    """
    估算成本
    
    Args:
        model: 模型名称
        input_tokens: 输入 token 数
        output_tokens: 输出 token 数
        
    Returns:
        float: 成本（美元）
    """
    if model not in MODELS:
        raise ValueError(f"未知模型: {model}")
    
    config = MODELS[model]
    
    input_cost = (input_tokens / 1_000_000) * config.input_price
    output_cost = (output_tokens / 1_000_000) * config.output_price
    
    return input_cost + output_cost

# 测试
if __name__ == '__main__':
    # 选择模型
    model = select_model(
        task_complexity=TaskComplexity.MEDIUM,
        budget_sensitive=True
    )
    print(f"推荐模型: {model}")
    
    # 估算成本
    cost = estimate_cost(
        model="gpt-4o-mini",
        input_tokens=1000,
        output_tokens=500
    )
    print(f"估算成本: ${cost:.4f}")

4.2 Prompt 优化

python 复制代码

# ===== Prompt 优化 =====

from openai import OpenAI
import tiktoken

client = OpenAI()

def count_tokens(text: str, model: str = "gpt-4o-mini") -> int:
    """计算 token 数"""
    # 使用 tiktoken 估算
    encoding = tiktoken.get_encoding("cl100k_base")
    return len(encoding.encode(text))

def optimize_prompt(prompt: str, max_tokens: int = 1000) -> str:
    """
    Prompt 优化（压缩）
    
    Args:
        prompt: 原始 prompt
        max_tokens: 最大 token 数
        
    Returns:
        str: 优化后的 prompt
    """
    # 1. 移除多余空白
    optimized = " ".join(prompt.split())
    
    # 2. 检查 token 数
    tokens = count_tokens(optimized)
    
    if tokens <= max_tokens:
        return optimized
    
    # 3. 截断（保留前 max_tokens 个 token）
    encoding = tiktoken.get_encoding("cl100k_base")
    tokens = encoding.encode(optimized)
    truncated = encoding.decode(tokens[:max_tokens])
    
    return truncated

def use_few_shot_examples():
    """使用少量示例（减少解释）"""
    prompt = """分类以下文本的情感（正面/负面/中性）：

示例 1: "这个产品太棒了！" -> 正面
示例 2: "糟糕的体验，不推荐。" -> 负面
示例 3: "这个产品还行吧。" -> 中性

现在分类: "性价比很高，推荐购买。"

只输出分类结果，不要解释。"""
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=10  # 限制输出
    )
    
    return response.choices[0].message.content

def use_system_prompt():
    """使用系统提示（减少重复）"""
    messages = [
        {"role": "system", "content": "你是一个情感分析助手。只输出：正面/负面/中性"},
        {"role": "user", "content": "这个产品太棒了！"},
        {"role": "assistant", "content": "正面"},
        {"role": "user", "content": "糟糕的体验，不推荐。"},
        {"role": "assistant", "content": "负面"},
        {"role": "user", "content": "性价比很高，推荐购买。"}
    ]
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        max_tokens=10
    )
    
    return response.choices[0].message.content

4.3 Batch API 使用

python 复制代码

# ===== Batch API 使用 =====

from openai import OpenAI
import time

client = OpenAI()

def create_batch():
    """创建批量任务"""
    requests = [
        {
            "custom_id": "request-1",
            "method": "POST",
            "url": "/v1/chat/completions",
            "body": {
                "model": "gpt-4o-mini",
                "messages": [{"role": "user", "content": "1+1=?"}],
                "max_tokens": 1024
            }
        },
        {
            "custom_id": "request-2",
            "method": "POST",
            "url": "/v1/chat/completions",
            "body": {
                "model": "gpt-4o-mini",
                "messages": [{"role": "user", "content": "2+2=?"}],
                "max_tokens": 1024
            }
        }
    ]
    
    batch = client.batches.create(
        input_file_id=upload_requests(requests),
        endpoint="/v1/chat/completions",
        completion_window="24h"
    )
    
    print(f"Batch ID: {batch.id}")
    print(f"状态: {batch.status}")
    
    return batch.id

def upload_requests(requests: list) -> str:
    """上传请求文件"""
    # 将请求写入临时文件
    import tempfile
    import json
    
    with tempfile.NamedTemporaryFile(mode='w', delete=False, suffix='.jsonl') as f:
        for req in requests:
            f.write(json.dumps(req) + '\n')
        temp_path = f.name
    
    # 上传文件
    with open(temp_path, 'rb') as f:
        file = client.files.create(
            file=f,
            purpose="batch"
        )
    
    return file.id

def check_batch_status(batch_id: str):
    """查询批量任务状态"""
    batch = client.batches.retrieve(batch_id)
    
    print(f"状态: {batch.status}")
    print(f"总请求数: {batch.request_counts.total}")
    print(f"完成: {batch.request_counts.completed}")
    print(f"失败: {batch.request_counts.failed}")
    
    return batch.status

def wait_for_batch(batch_id: str, check_interval: int = 60):
    """等待批量任务完成"""
    while True:
        status = check_batch_status(batch_id)
        
        if status == "completed":
            print("批量任务已完成！")
            break
        elif status in ["failed", "cancelled", "expired"]:
            print(f"批量任务 {status}")
            break
        
        print(f"等待 {check_interval} 秒后检查...")
        time.sleep(check_interval)

def get_batch_results(batch_id: str):
    """获取批量任务结果"""
    batch = client.batches.retrieve(batch_id)
    
    # 下载结果文件
    result_file_id = batch.output_file_id
    if not result_file_id:
        print("没有结果文件")
        return []
    
    result_content = client.files.content(result_file_id)
    
    # 解析结果
    results = []
    for line in result_content.text.split('\n'):
        if line.strip():
            results.append(json.loads(line))
    
    return results

# 完整流程
def batch_processing_example():
    """批量处理示例"""
    # 1. 创建批量任务
    batch_id = create_batch()
    
    # 2. 等待完成
    wait_for_batch(batch_id)
    
    # 3. 获取结果
    results = get_batch_results(batch_id)
    
    # 4. 处理结果
    for r in results:
        custom_id = r["custom_id"]
        content = r["response"]["body"]["choices"][0]["message"]["content"]
        print(f"{custom_id}: {content}")

# 成本对比
def compare_cost_normal_vs_batch(num_requests: int):
    """对比普通 API 和 Batch API 的成本"""
    # 假设每个请求平均 1000 input tokens + 500 output tokens
    input_tokens = 1000 * num_requests
    output_tokens = 500 * num_requests
    
    # 普通 API 成本
    normal_cost = estimate_cost("gpt-4o-mini", input_tokens, output_tokens)
    
    # Batch API 成本（便宜 50%）
    batch_cost = normal_cost * 0.5
    
    print(f"普通 API 成本: ${normal_cost:.2f}")
    print(f"Batch API 成本: ${batch_cost:.2f}")
    print(f"节省: ${normal_cost - batch_cost:.2f}")
    
    return batch_cost

五、监控告警

5.1 Token 统计

python 复制代码

# ===== Token 统计 =====

import time
from dataclasses import dataclass, field
from typing import Dict, List
from collections import defaultdict

@dataclass
class TokenUsage:
    """Token 使用量"""
    input_tokens: int
    output_tokens: int
    timestamp: float
    model: str
    user_id: str = "unknown"

class TokenTracker:
    """Token 跟踪器"""
    
    def __init__(self):
        self.usage_history: List[TokenUsage] = []
        self.user_usage: Dict[str, int] = defaultdict(int)
        self.model_usage: Dict[str, int] = defaultdict(int)
        self.lock = Lock()
    
    def track(self, usage: TokenUsage):
        """记录使用量"""
        with self.lock:
            self.usage_history.append(usage)
            self.user_usage[usage.user_id] += usage.input_tokens + usage.output_tokens
            self.model_usage[usage.model] += usage.input_tokens + usage.output_tokens
    
    def get_user_usage(self, user_id: str, time_window: float = 3600) -> int:
        """
        获取用户使用量
        
        Args:
            user_id: 用户 ID
            time_window: 时间窗口（秒）
            
        Returns:
            int: token 使用量
        """
        now = time.time()
        total = 0
        
        with self.lock:
            for usage in self.usage_history:
                if usage.user_id == user_id and now - usage.timestamp <= time_window:
                    total += usage.input_tokens + usage.output_tokens
        
        return total
    
    def get_model_usage(self, model: str, time_window: float = 3600) -> int:
        """获取模型使用量"""
        now = time.time()
        total = 0
        
        with self.lock:
            for usage in self.usage_history:
                if usage.model == model and now - usage.timestamp <= time_window:
                    total += usage.input_tokens + usage.output_tokens
        
        return total
    
    def get_top_users(self, limit: int = 10) -> List[tuple]:
        """获取使用量最高的用户"""
        sorted_users = sorted(
            self.user_usage.items(),
            key=lambda x: x[1],
            reverse=True
        )
        return sorted_users[:limit]
    
    def get_report(self, time_window: float = 3600) -> Dict:
        """生成报告"""
        now = time.time()
        
        total_tokens = 0
        total_cost = 0.0
        
        with self.lock:
            for usage in self.usage_history:
                if now - usage.timestamp <= time_window:
                    total_tokens += usage.input_tokens + usage.output_tokens
                    total_cost += estimate_cost(
                        usage.model,
                        usage.input_tokens,
                        usage.output_tokens
                    )
        
        return {
            "total_tokens": total_tokens,
            "total_cost_usd": total_cost,
            "unique_users": len(self.user_usage),
            "models_used": list(self.model_usage.keys()),
            "top_users": self.get_top_users(10)
        }

# 使用 Token 跟踪器
tracker = TokenTracker()

def tracked_api_call(messages: list, model: str = "gpt-4o-mini", user_id: str = "unknown"):
    """带 Token 跟踪的 API 调用"""
    client = OpenAI()
    response = client.chat.completions.create(
        model=model,
        messages=messages
    )
    
    # 跟踪使用量
    usage = TokenUsage(
        input_tokens=response.usage.prompt_tokens,
        output_tokens=response.usage.completion_tokens,
        timestamp=time.time(),
        model=model,
        user_id=user_id
    )
    tracker.track(usage)
    
    return response.choices[0].message.content

# 测试
if __name__ == '__main__':
    result = tracked_api_call(
        messages=[{"role": "user", "content": "你好！"}],
        model="gpt-4o-mini",
        user_id="user_001"
    )
    
    # 获取报告
    report = tracker.get_report()
    print(f"报告: {report}")

5.2 成本追踪

python 复制代码

# ===== 成本追踪 =====

from dataclasses import dataclass
from typing import Dict, List
import time

@dataclass
class CostRecord:
    """成本记录"""
    timestamp: float
    model: str
    input_tokens: int
    output_tokens: int
    cost_usd: float
    user_id: str
    request_id: str

class CostTracker:
    """成本跟踪器"""
    
    def __init__(self):
        self.records: List[CostRecord] = []
        self.lock = Lock()
    
    def track_cost(self, record: CostRecord):
        """记录成本"""
        with self.lock:
            self.records.append(record)
    
    def get_total_cost(self, time_window: float = 86400) -> float:
        """
        获取总成本
        
        Args:
            time_window: 时间窗口（秒），默认 1 天
            
        Returns:
            float: 总成本（美元）
        """
        now = time.time()
        total = 0.0
        
        with self.lock:
            for record in self.records:
                if now - record.timestamp <= time_window:
                    total += record.cost_usd
        
        return total
    
    def get_user_cost(self, user_id: str, time_window: float = 86400) -> float:
        """获取用户成本"""
        now = time.time()
        total = 0.0
        
        with self.lock:
            for record in self.records:
                if record.user_id == user_id and now - record.timestamp <= time_window:
                    total += record.cost_usd
        
        return total
    
    def get_model_cost(self, model: str, time_window: float = 86400) -> float:
        """获取模型成本"""
        now = time.time()
        total = 0.0
        
        with self.lock:
            for record in self.records:
                if record.model == model and now - record.timestamp <= time_window:
                    total += record.cost_usd
        
        return total
    
    def get_daily_cost_breakdown(self, days: int = 7) -> Dict[str, float]:
        """获取每日成本分解"""
        now = time.time()
        daily_costs = {}
        
        with self.lock:
            for record in self.records:
                if now - record.timestamp <= days * 86400:
                    day = time.strftime("%Y-%m-%d", time.localtime(record.timestamp))
                    daily_costs[day] = daily_costs.get(day, 0.0) + record.cost_usd
        
        return daily_costs
    
    def check_budget(self, budget: float, time_window: float = 86400) -> bool:
        """
        检查预算
        
        Args:
            budget: 预算（美元）
            time_window: 时间窗口（秒）
            
        Returns:
            bool: 是否超出预算
        """
        total_cost = self.get_total_cost(time_window)
        return total_cost > budget

# 使用成本跟踪器
cost_tracker = CostTracker()

def tracked_api_call_with_cost(messages: list, model: str = "gpt-4o-mini", user_id: str = "unknown"):
    """带成本跟踪的 API 调用"""
    client = OpenAI()
    response = client.chat.completions.create(
        model=model,
        messages=messages
    )
    
    # 计算成本
    input_tokens = response.usage.prompt_tokens
    output_tokens = response.usage.completion_tokens
    cost = estimate_cost(model, input_tokens, output_tokens)
    
    # 记录成本
    record = CostRecord(
        timestamp=time.time(),
        model=model,
        input_tokens=input_tokens,
        output_tokens=output_tokens,
        cost_usd=cost,
        user_id=user_id,
        request_id=f"req_{int(time.time())}"
    )
    cost_tracker.track_cost(record)
    
    return response.choices[0].message.content

# 测试
if __name__ == '__main__':
    result = tracked_api_call_with_cost(
        messages=[{"role": "user", "content": "你好！"}],
        model="gpt-4o-mini",
        user_id="user_001"
    )
    
    # 获取成本报告
    total_cost = cost_tracker.get_total_cost()
    print(f"总成本: ${total_cost:.4f}")
    
    daily_breakdown = cost_tracker.get_daily_cost_breakdown()
    print(f"每日成本: {daily_breakdown}")

5.3 性能监控

python 复制代码

# ===== 性能监控 =====

import time
from dataclasses import dataclass
from typing import List, Dict
from collections import defaultdict

@dataclass
class PerformanceMetric:
    """性能指标"""
    timestamp: float
    latency_ms: float
    model: str
    success: bool
    error_type: str = None

class PerformanceMonitor:
    """性能监控器"""
    
    def __init__(self):
        self.metrics: List[PerformanceMetric] = []
        self.model_latency: Dict[str, List[float]] = defaultdict(list)
        self.lock = Lock()
    
    def record(self, metric: PerformanceMetric):
        """记录性能指标"""
        with self.lock:
            self.metrics.append(metric)
            self.model_latency[metric.model].append(metric.latency_ms)
    
    def get_average_latency(self, model: str = None, time_window: float = 3600) -> float:
        """
        获取平均延迟
        
        Args:
            model: 模型名称（None 表示所有模型）
            time_window: 时间窗口（秒）
            
        Returns:
            float: 平均延迟（毫秒）
        """
        now = time.time()
        latencies = []
        
        with self.lock:
            for metric in self.metrics:
                if now - metric.timestamp <= time_window:
                    if model is None or metric.model == model:
                        latencies.append(metric.latency_ms)
        
        return sum(latencies) / len(latencies) if latencies else 0.0
    
    def get_success_rate(self, model: str = None, time_window: float = 3600) -> float:
        """获取成功率"""
        now = time.time()
        total = 0
        success = 0
        
        with self.lock:
            for metric in self.metrics:
                if now - metric.timestamp <= time_window:
                    if model is None or metric.model == model:
                        total += 1
                        if metric.success:
                            success += 1
        
        return success / total if total > 0 else 0.0
    
    def get_error_distribution(self, time_window: float = 3600) -> Dict[str, int]:
        """获取错误分布"""
        now = time.time()
        errors = defaultdict(int)
        
        with self.lock:
            for metric in self.metrics:
                if not metric.success and now - metric.timestamp <= time_window:
                    errors[metric.error_type] += 1
        
        return dict(errors)
    
    def get_percentile_latency(self, percentile: float, model: str = None) -> float:
        """
        获取延迟分位数
        
        Args:
            percentile: 分位数（0-1）
            model: 模型名称
            
        Returns:
            float: 延迟（毫秒）
        """
        latencies = []
        
        with self.lock:
            for metric in self.metrics:
                if model is None or metric.model == model:
                    latencies.append(metric.latency_ms)
        
        if not latencies:
            return 0.0
        
        latencies.sort()
        idx = int(len(latencies) * percentile)
        return latencies[idx]

# 使用性能监控器
monitor = PerformanceMonitor()

def monitored_api_call(messages: list, model: str = "gpt-4o-mini"):
    """带性能监控的 API 调用"""
    start_time = time.time()
    success = True
    error_type = None
    
    try:
        client = OpenAI()
        response = client.chat.completions.create(
            model=model,
            messages=messages
        )
        result = response.choices[0].message.content
    
    except Exception as e:
        success = False
        error_type = type(e).__name__
        result = None
    
    finally:
        end_time = time.time()
        latency_ms = (end_time - start_time) * 1000
        
        metric = PerformanceMetric(
            timestamp=start_time,
            latency_ms=latency_ms,
            model=model,
            success=success,
            error_type=error_type
        )
        monitor.record(metric)
    
    return result

# 测试
if __name__ == '__main__':
    result = monitored_api_call(
        messages=[{"role": "user", "content": "你好！"}],
        model="gpt-4o-mini"
    )
    
    # 获取性能报告
    avg_latency = monitor.get_average_latency()
    success_rate = monitor.get_success_rate()
    
    print(f"平均延迟: {avg_latency:.2f} ms")
    print(f"成功率: {success_rate:.2%}")

六、生产案例

6.1 案例：优化的 AI API 调用系统

python 复制代码

# ===== 案例：优化的 AI API 调用系统 =====

from openai import OpenAI
from openai._exceptions import RateLimitError, APIConnectionError, InternalServerError
import time
import json
from dataclasses import dataclass
from typing import Optional

@dataclass
class APIConfig:
    """API 配置"""
    model: str = "gpt-4o-mini"
    max_retries: int = 3
    timeout: float = 30.0
    cache_ttl: int = 3600

class OptimizedAIService:
    """优化的 AI 服务"""
    
    def __init__(self, config: APIConfig = None):
        self.config = config or APIConfig()
        self.client = OpenAI(timeout=self.config.timeout)
        self.cache = ResultCache(ttl=self.config.cache_ttl)
        self.tracker = TokenTracker()
        self.monitor = PerformanceMonitor()
        self.limiter = SlidingWindowRateLimiter(max_requests=50, window_size=60.0)
    
    @exponential_backoff_retry(max_retries=3)
    def chat(self, messages: list, user_id: str = "unknown") -> str:
        """
        优化的对话 API
        
        特性：
        1. 缓存
        2. 重试
        3. 限流
        4. Token 跟踪
        5. 性能监控
        """
        # 1. 检查缓存
        cached = self.cache.get(messages, self.config.model)
        if cached:
            print("使用缓存")
            return cached
        
        # 2. 限流检查
        if not self.limiter.allow_request():
            raise Exception("速率限制，请稍后重试")
        
        # 3. 记录开始时间
        start_time = time.time()
        success = True
        error_type = None
        
        try:
            # 4. 调用 API
            response = self.client.chat.completions.create(
                model=self.config.model,
                messages=messages,
                max_tokens=1024
            )
            
            result = response.choices[0].message.content
            
            # 5. 写入缓存
            self.cache.set(messages, self.config.model, result)
            
            # 6. 跟踪 Token
            usage = TokenUsage(
                input_tokens=response.usage.prompt_tokens,
                output_tokens=response.usage.completion_tokens,
                timestamp=time.time(),
                model=self.config.model,
                user_id=user_id
            )
            self.tracker.track(usage)
            
            return result
        
        except Exception as e:            success = False
            error_type = type(e).__name__
            raise
        
        finally:
            # 7. 记录性能
            end_time = time.time()
            latency_ms = (end_time - start_time) * 1000
            
            metric = PerformanceMetric(
                timestamp=start_time,
                latency_ms=latency_ms,
                model=self.config.model,
                success=success,
                error_type=error_type
            )
            self.monitor.record(metric)
    
    def get_stats(self) -> dict:
        """获取统计信息"""
        return {
            "token_usage": self.tracker.get_report(),
            "performance": {
                "avg_latency": self.monitor.get_average_latency(),
                "success_rate": self.monitor.get_success_rate()
            }
        }

# 使用优化的 AI 服务
service = OptimizedAIService()

if __name__ == '__main__':
    # 第一次调用
    result1 = service.chat(
        messages=[{"role": "user", "content": "你好！"}],
        user_id="user_001"
    )
    print(f"结果1: {result1[:50]}...")
    
    # 第二次调用（会使用缓存）
    result2 = service.chat(
        messages=[{"role": "user", "content": "你好！"}],
        user_id="user_001"
    )
    print(f"结果2: {result2[:50]}...")
    
    # 获取统计
    stats = service.get_stats()
    print(f"统计: {stats}")

七、总结

7.1 核心要点

#mermaid-svg-hLLuT3fFLRLuOHw2{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-hLLuT3fFLRLuOHw2 .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-hLLuT3fFLRLuOHw2 .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-hLLuT3fFLRLuOHw2 .error-icon{fill:#552222;}#mermaid-svg-hLLuT3fFLRLuOHw2 .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-hLLuT3fFLRLuOHw2 .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-hLLuT3fFLRLuOHw2 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-hLLuT3fFLRLuOHw2 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-hLLuT3fFLRLuOHw2 .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-hLLuT3fFLRLuOHw2 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-hLLuT3fFLRLuOHw2 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-hLLuT3fFLRLuOHw2 .marker{fill:#333333;stroke:#333333;}#mermaid-svg-hLLuT3fFLRLuOHw2 .marker.cross{stroke:#333333;}#mermaid-svg-hLLuT3fFLRLuOHw2 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-hLLuT3fFLRLuOHw2 p{margin:0;}#mermaid-svg-hLLuT3fFLRLuOHw2 .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-hLLuT3fFLRLuOHw2 .cluster-label text{fill:#333;}#mermaid-svg-hLLuT3fFLRLuOHw2 .cluster-label span{color:#333;}#mermaid-svg-hLLuT3fFLRLuOHw2 .cluster-label span p{background-color:transparent;}#mermaid-svg-hLLuT3fFLRLuOHw2 .label text,#mermaid-svg-hLLuT3fFLRLuOHw2 span{fill:#333;color:#333;}#mermaid-svg-hLLuT3fFLRLuOHw2 .node rect,#mermaid-svg-hLLuT3fFLRLuOHw2 .node circle,#mermaid-svg-hLLuT3fFLRLuOHw2 .node ellipse,#mermaid-svg-hLLuT3fFLRLuOHw2 .node polygon,#mermaid-svg-hLLuT3fFLRLuOHw2 .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-hLLuT3fFLRLuOHw2 .rough-node .label text,#mermaid-svg-hLLuT3fFLRLuOHw2 .node .label text,#mermaid-svg-hLLuT3fFLRLuOHw2 .image-shape .label,#mermaid-svg-hLLuT3fFLRLuOHw2 .icon-shape .label{text-anchor:middle;}#mermaid-svg-hLLuT3fFLRLuOHw2 .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-hLLuT3fFLRLuOHw2 .rough-node .label,#mermaid-svg-hLLuT3fFLRLuOHw2 .node .label,#mermaid-svg-hLLuT3fFLRLuOHw2 .image-shape .label,#mermaid-svg-hLLuT3fFLRLuOHw2 .icon-shape .label{text-align:center;}#mermaid-svg-hLLuT3fFLRLuOHw2 .node.clickable{cursor:pointer;}#mermaid-svg-hLLuT3fFLRLuOHw2 .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-hLLuT3fFLRLuOHw2 .arrowheadPath{fill:#333333;}#mermaid-svg-hLLuT3fFLRLuOHw2 .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-hLLuT3fFLRLuOHw2 .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-hLLuT3fFLRLuOHw2 .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-hLLuT3fFLRLuOHw2 .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-hLLuT3fFLRLuOHw2 .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-hLLuT3fFLRLuOHw2 .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-hLLuT3fFLRLuOHw2 .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-hLLuT3fFLRLuOHw2 .cluster text{fill:#333;}#mermaid-svg-hLLuT3fFLRLuOHw2 .cluster span{color:#333;}#mermaid-svg-hLLuT3fFLRLuOHw2 div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-hLLuT3fFLRLuOHw2 .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-hLLuT3fFLRLuOHw2 rect.text{fill:none;stroke-width:0;}#mermaid-svg-hLLuT3fFLRLuOHw2 .icon-shape,#mermaid-svg-hLLuT3fFLRLuOHw2 .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-hLLuT3fFLRLuOHw2 .icon-shape p,#mermaid-svg-hLLuT3fFLRLuOHw2 .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-hLLuT3fFLRLuOHw2 .icon-shape .label rect,#mermaid-svg-hLLuT3fFLRLuOHw2 .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-hLLuT3fFLRLuOHw2 .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-hLLuT3fFLRLuOHw2 .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-hLLuT3fFLRLuOHw2 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} AI API 调用优化
重试机制
限流处理
缓存策略
成本优化
监控告警
指数退避
断路器
令牌桶
漏桶
滑动窗口
结果缓存
语义缓存
分布式缓存
模型选择
Prompt 优化
Batch API
Token 统计
成本追踪
性能监控

7.2 最佳实践

实践	说明
重试策略	指数退避 + 抖动，避免重试风暴
限流算法	令牌桶（允许突发）/ 漏桶（平滑速率）
缓存策略	结果缓存 + 语义缓存，减少重复调用
成本优化	模型选择 + Prompt 优化 + Batch API
监控告警	Token 统计 + 成本追踪 + 性能监控

7.3 优化效果

优化项	效果
重试机制	减少失败率 50%+
限流处理	避免触发速率限制
结果缓存	减少 API 调用 30%+
语义缓存	减少 API 调用 50%+
模型选择	降低成本 50%+
Batch API	降低成本 50%
Prompt 优化	减少 Token 消耗 20%+

本文基于 OpenAI API 最佳实践编写。如有问题欢迎评论区讨论！