AI API 调用优化实战

前言

💡 痛点:AI API 调用总是失败?速率限制频繁触发?成本居高不下?响应太慢用户体验差?

🎯 解决方案 :掌握 AI API 调用优化 --- 从重试机制、到限流处理、再到缓存策略与成本优化。

AI API 调用优化全景:
#mermaid-svg-gItVYQa8SSBx2hXC{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-gItVYQa8SSBx2hXC .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-gItVYQa8SSBx2hXC .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-gItVYQa8SSBx2hXC .error-icon{fill:#552222;}#mermaid-svg-gItVYQa8SSBx2hXC .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-gItVYQa8SSBx2hXC .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-gItVYQa8SSBx2hXC .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-gItVYQa8SSBx2hXC .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-gItVYQa8SSBx2hXC .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-gItVYQa8SSBx2hXC .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-gItVYQa8SSBx2hXC .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-gItVYQa8SSBx2hXC .marker{fill:#333333;stroke:#333333;}#mermaid-svg-gItVYQa8SSBx2hXC .marker.cross{stroke:#333333;}#mermaid-svg-gItVYQa8SSBx2hXC svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-gItVYQa8SSBx2hXC p{margin:0;}#mermaid-svg-gItVYQa8SSBx2hXC .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-gItVYQa8SSBx2hXC .cluster-label text{fill:#333;}#mermaid-svg-gItVYQa8SSBx2hXC .cluster-label span{color:#333;}#mermaid-svg-gItVYQa8SSBx2hXC .cluster-label span p{background-color:transparent;}#mermaid-svg-gItVYQa8SSBx2hXC .label text,#mermaid-svg-gItVYQa8SSBx2hXC span{fill:#333;color:#333;}#mermaid-svg-gItVYQa8SSBx2hXC .node rect,#mermaid-svg-gItVYQa8SSBx2hXC .node circle,#mermaid-svg-gItVYQa8SSBx2hXC .node ellipse,#mermaid-svg-gItVYQa8SSBx2hXC .node polygon,#mermaid-svg-gItVYQa8SSBx2hXC .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-gItVYQa8SSBx2hXC .rough-node .label text,#mermaid-svg-gItVYQa8SSBx2hXC .node .label text,#mermaid-svg-gItVYQa8SSBx2hXC .image-shape .label,#mermaid-svg-gItVYQa8SSBx2hXC .icon-shape .label{text-anchor:middle;}#mermaid-svg-gItVYQa8SSBx2hXC .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-gItVYQa8SSBx2hXC .rough-node .label,#mermaid-svg-gItVYQa8SSBx2hXC .node .label,#mermaid-svg-gItVYQa8SSBx2hXC .image-shape .label,#mermaid-svg-gItVYQa8SSBx2hXC .icon-shape .label{text-align:center;}#mermaid-svg-gItVYQa8SSBx2hXC .node.clickable{cursor:pointer;}#mermaid-svg-gItVYQa8SSBx2hXC .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-gItVYQa8SSBx2hXC .arrowheadPath{fill:#333333;}#mermaid-svg-gItVYQa8SSBx2hXC .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-gItVYQa8SSBx2hXC .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-gItVYQa8SSBx2hXC .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-gItVYQa8SSBx2hXC .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-gItVYQa8SSBx2hXC .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-gItVYQa8SSBx2hXC .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-gItVYQa8SSBx2hXC .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-gItVYQa8SSBx2hXC .cluster text{fill:#333;}#mermaid-svg-gItVYQa8SSBx2hXC .cluster span{color:#333;}#mermaid-svg-gItVYQa8SSBx2hXC div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-gItVYQa8SSBx2hXC .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-gItVYQa8SSBx2hXC rect.text{fill:none;stroke-width:0;}#mermaid-svg-gItVYQa8SSBx2hXC .icon-shape,#mermaid-svg-gItVYQa8SSBx2hXC .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-gItVYQa8SSBx2hXC .icon-shape p,#mermaid-svg-gItVYQa8SSBx2hXC .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-gItVYQa8SSBx2hXC .icon-shape .label rect,#mermaid-svg-gItVYQa8SSBx2hXC .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-gItVYQa8SSBx2hXC .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-gItVYQa8SSBx2hXC .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-gItVYQa8SSBx2hXC :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} AI API 调用优化
重试机制
限流处理
缓存策略
成本优化
监控告警
指数退避
断路器
抖动
令牌桶
漏桶
滑动窗口
结果缓存
语义缓存
分布式缓存
模型选择
Batch API
Prompt 优化
Token 统计
成本追踪
性能监控

常见问题与解决方案:

问题 原因 解决方案
调用失败 网络错误/服务器错误 重试机制
速率限制 触发 RPM/TPM 限制 限流处理
成本高 模型贵/Token 消耗大 成本优化
响应慢 网络延迟/模型推理慢 缓存/流式响应
无监控 缺乏可观测性 监控告警

一、重试机制

1.1 指数退避重试

python 复制代码
# ===== 指数退避重试 =====

import time
import random
from openai import OpenAI
from openai._exceptions import (
    RateLimitError,
    APIConnectionError,
    InternalServerError,
    APIStatusError
)

class RetryConfig:
    """重试配置"""
    def __init__(
        self,
        max_retries: int = 3,
        base_delay: float = 1.0,
        max_delay: float = 60.0,
        backoff_factor: float = 2.0,
        jitter: bool = True
    ):
        self.max_retries = max_retries
        self.base_delay = base_delay
        self.max_delay = max_delay
        self.backoff_factor = backoff_factor
        self.jitter = jitter

def exponential_backoff_retry(
    func,
    retry_config: RetryConfig = None,
    *args,
    **kwargs
):
    """指数退避重试装饰器"""
    if retry_config is None:
        retry_config = RetryConfig()
    
    for attempt in range(retry_config.max_retries):
        try:
            return func(*args, **kwargs)
        
        except (RateLimitError, APIConnectionError, InternalServerError) as e:
            if attempt == retry_config.max_retries - 1:
                raise  # 最后一次重试失败,抛出异常
            
            # 计算延迟
            delay = retry_config.base_delay * (retry_config.backoff_factor ** attempt)
            delay = min(delay, retry_config.max_delay)
            
            # 添加抖动
            if retry_config.jitter:
                delay = delay * (1 + random.random())  # 0-100% 抖动
            
            print(f"错误: {e}")
            print(f"{delay:.2f} 秒后重试(第 {attempt + 1} 次)...")
            time.sleep(delay)
        
        except Exception as e:
            # 其他错误不重试
            raise
    
    raise Exception("达到最大重试次数")

# 使用装饰器
@exponential_backoff_retry
def call_openai_api(messages: list):
    """调用 OpenAI API"""
    client = OpenAI()
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        max_tokens=1024
    )
    return response.choices[0].message.content

# 测试
if __name__ == '__main__':
    result = call_openai_api(
        [{"role": "user", "content": "你好!"}]
    )
    print(result)

1.2 断路器模式

python 复制代码
# ===== 断路器模式 =====

import time
from enum import Enum
from dataclasses import dataclass, field

class CircuitState(Enum):
    """断路器状态"""
    CLOSED = "closed"      # 关闭(正常)
    OPEN = "open"          # 打开(熔断)
    HALF_OPEN = "half_open"  # 半开(尝试恢复)

@dataclass
class CircuitBreakerConfig:
    """断路器配置"""
    failure_threshold: int = 5          # 失败阈值
    success_threshold: int = 2          # 成功阈值(半开状态)
    timeout: float = 60.0              # 超时时间(秒)
    reset_timeout: float = 30.0        # 重置超时(秒)

class CircuitBreaker:
    """断路器"""
    
    def __init__(self, config: CircuitBreakerConfig = None):
        self.config = config or CircuitBreakerConfig()
        self.state = CircuitState.CLOSED
        self.failure_count = 0
        self.success_count = 0
        self.last_failure_time = None
    
    def call(self, func, *args, **kwargs):
        """执行函数(带断路器保护)"""
        if self.state == CircuitState.OPEN:
            # 检查是否可以尝试恢复
            if time.time() - self.last_failure_time > self.config.reset_timeout:
                self.state = CircuitState.HALF_OPEN
                self.success_count = 0
                print("断路器进入半开状态")
            else:
                raise Exception("断路器打开,拒绝请求")
        
        try:
            result = func(*args, **kwargs)
            self._on_success()
            return result
        
        except Exception as e:
            self._on_failure()
            raise
    
    def _on_success(self):
        """成功回调"""
        if self.state == CircuitState.HALF_OPEN:
            self.success_count += 1
            if self.success_count >= self.config.success_threshold:
                self.state = CircuitState.CLOSED
                self.failure_count = 0
                print("断路器关闭(恢复)")
        else:
            self.failure_count = 0
    
    def _on_failure(self):
        """失败回调"""
        self.failure_count += 1
        self.last_failure_time = time.time()
        
        if self.failure_count >= self.config.failure_threshold:
            self.state = CircuitState.OPEN
            print(f"断路器打开(失败次数: {self.failure_count})")

# 使用断路器
breaker = CircuitBreaker()

def call_api_with_circuit_breaker(messages: list):
    """带断路器保护的 API 调用"""
    def _call():
        client = OpenAI()
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=messages
        )
        return response.choices[0].message.content
    
    return breaker.call(_call)

# 测试
if __name__ == '__main__':
    for i in range(10):
        try:
            result = call_api_with_circuit_breaker(
                [{"role": "user", "content": f"测试 {i}"}]
            )
            print(f"请求 {i}: 成功")
        except Exception as e:
            print(f"请求 {i}: 失败 - {e}")

1.3 重试最佳实践

python 复制代码
# ===== 重试最佳实践 =====

from openai import OpenAI
from openai._exceptions import RateLimitError, APIConnectionError, InternalServerError
import time
import random

class RetryStrategy:
    """重试策略"""
    
    @staticmethod
    def exponential_backoff(
        max_retries: int = 3,
        base_delay: float = 1.0,
        max_delay: float = 60.0,
        backoff_factor: float = 2.0,
        jitter: bool = True
    ):
        """指数退避"""
        def decorator(func):
            def wrapper(*args, **kwargs):
                for attempt in range(max_retries):
                    try:
                        return func(*args, **kwargs)
                    except (RateLimitError, APIConnectionError, InternalServerError) as e:
                        if attempt == max_retries - 1:
                            raise
                        
                        delay = min(base_delay * (backoff_factor ** attempt), max_delay)
                        if jitter:
                            delay = delay * (1 + random.random())
                        
                        print(f"重试 {attempt + 1}/{max_retries},{delay:.2f}秒后重试...")
                        time.sleep(delay)
                
                raise Exception("达到最大重试次数")
            return wrapper
        return decorator
    
    @staticmethod
    def fixed_delay(
        max_retries: int = 3,
        delay: float = 1.0
    ):
        """固定延迟"""
        def decorator(func):
            def wrapper(*args, **kwargs):
                for attempt in range(max_retries):
                    try:
                        return func(*args, **kwargs)
                    except (RateLimitError, APIConnectionError, InternalServerError) as e:
                        if attempt == max_retries - 1:
                            raise
                        
                        print(f"重试 {attempt + 1}/{max_retries},{delay}秒后重试...")
                        time.sleep(delay)
                
                raise Exception("达到最大重试次数")
            return wrapper
        return decorator

# 使用
class OpenAIService:
    """OpenAI 服务(带重试)"""
    
    @RetryStrategy.exponential_backoff(max_retries=3, base_delay=1.0, jitter=True)
    def chat(self, messages: list):
        """对话(带指数退避重试)"""
        client = OpenAI()
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=messages
        )
        return response.choices[0].message.content
    
    @RetryStrategy.fixed_delay(max_retries=3, delay=2.0)
    def chat_fixed(self, messages: list):
        """对话(带固定延迟重试)"""
        client = OpenAI()
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=messages
        )
        return response.choices[0].message.content

二、限流处理

2.1 令牌桶算法

python 复制代码
# ===== 令牌桶算法 =====

import time
from threading import Lock

class TokenBucket:
    """令牌桶"""
    
    def __init__(self, capacity: int, refill_rate: float):
        """
        初始化令牌桶
        
        Args:
            capacity: 桶容量(最大令牌数)
            refill_rate: 填充速率(令牌/秒)
        """
        self.capacity = capacity
        self.refill_rate = refill_rate
        self.tokens = float(capacity)
        self.last_refill_time = time.time()
        self.lock = Lock()
    
    def _refill(self):
        """填充令牌"""
        now = time.time()
        elapsed = now - self.last_refill_time
        
        # 计算应填充的令牌数
        refill_tokens = elapsed * self.refill_rate
        self.tokens = min(self.capacity, self.tokens + refill_tokens)
        self.last_refill_time = now
    
    def consume(self, tokens: int = 1) -> bool:
        """
        消费令牌
        
        Args:
            tokens: 需要的令牌数
            
        Returns:
            bool: 是否成功获取令牌
        """
        with self.lock:
            self._refill()
            
            if self.tokens >= tokens:
                self.tokens -= tokens
                return True
            return False
    
    def wait_for_tokens(self, tokens: int = 1, timeout: float = None) -> bool:
        """
        等待令牌(阻塞)
        
        Args:
            tokens: 需要的令牌数
            timeout: 超时时间(秒)
            
        Returns:
            bool: 是否成功获取令牌
        """
        start_time = time.time()
        
        while True:
            if self.consume(tokens):
                return True
            
            if timeout and (time.time() - start_time) > timeout:
                return False
            
            # 计算需要等待的时间
            with self.lock:
                deficit = tokens - self.tokens
                wait_time = deficit / self.refill_rate
            
            time.sleep(min(wait_time, 0.1))  # 最多等待 0.1 秒

# 使用令牌桶限流
bucket = TokenBucket(capacity=10, refill_rate=2.0)  # 容量 10,速率 2 tokens/s

def rate_limited_api_call(messages: list):
    """限流的 API 调用"""
    # 等待令牌)
    if not bucket.wait_for_tokens(tokens=1, timeout=5.0):
        raise Exception("获取令牌超时")
    
    # 调用 API
    client = OpenAI()
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages
    )
    return response.choices[0].message.content

# 测试
if __name__ == '__main__':
    for i in range(20):
        try:
            result = rate_limited_api_call(
                [{"role": "user", "content": f"测试 {i}"}]
            )
            print(f"请求 {i}: 成功")
        except Exception as e:
            print(f"请求 {i}: 失败 - {e}")

2.2 漏桶算法

python 复制代码
# ===== 漏桶算法 =====

import time
from queue import Queue
from threading import Thread, Lock

class LeakyBucket:
    """漏桶"""
    
    def __init__(self, capacity: int, leak_rate: float):
        """
        初始化漏桶
        
        Args:
            capacity: 桶容量
            leak_rate: 漏出速率(请求/秒)
        """
        self.capacity = capacity
        self.leak_rate = leak_rate
        self.queue = Queue(maxsize=capacity)
        self.lock = Lock()
        self.running = True
        
        # 启动漏出线程
        self.leak_thread = Thread(target=self._leak, daemon=True)
        self.leak_thread.start()
    
    def _leak(self):
        """漏出请求"""
        while self.running:
            with self.lock:
                if not self.queue.empty():
                    request = self.queue.get()
                    # 处理请求
                    self._process_request(request)
            
            # 按照漏出速率等待
            time.sleep(1.0 / self.leak_rate)
    
    def _process_request(self, request):
        """处理请求"""
        # 实际应该调用 API
        print(f"处理请求: {request}")
    
    def add_request(self, request) -> bool:
        """
        添加请求
        
        Returns:
            bool: 是否成功添加
        """
        try:
            self.queue.put(request, block=False)
            return True
        except:
            return False
    
    def stop(self):
        """停止漏桶"""
        self.running = False

# 使用漏桶限流
bucket = LeakyBucket(capacity=10, leak_rate=2.0)  # 容量 10,速率 2 requests/s

def rate_limited_api_call_leaky(messages: list):
    """限流的 API 调用(漏桶)"""
    request_id = f"req_{time.time()}"
    
    if not bucket.add_request(request_id):
        raise Exception("漏桶已满,请求被丢弃")
    
    # 请求已加入队列,等待处理
    print(f"请求 {request_id} 已加入队列")
    
    # 实际应该等待结果
    # 这里简化为直接返回
    return f"请求 {request_id} 已排队"

# 测试
if __name__ == '__main__':
    for i in range(20):
        try:
            result = rate_limited_api_call_leaky(
                [{"role": "user", "content": f"测试 {i}"}]
            )
            print(f"请求 {i}: {result}")
        except Exception as e:
            print(f"请求 {i}: 失败 - {e}")
    
    time.sleep(5)  # 等待处理完成
    bucket.stop()

2.3 滑动窗口限流

python 复制代码
# ===== 滑动窗口限流 =====

import time
from collections import deque

class SlidingWindowRateLimiter:
    """滑动窗口限流器"""
    
    def __init__(self, max_requests: int, window_size: float):
        """
        初始化滑动窗口限流器
        
        Args:
            max_requests: 窗口内最大请求数
            window_size: 窗口大小(秒)
        """
        self.max_requests = max_requests
        self.window_size = window_size
        self.requests = deque()
        self.lock = Lock()
    
    def allow_request(self) -> bool:
        """
        检查是否允许请求
        
        Returns:
            bool: 是否允许
        """
        with self.lock:
            now = time.time()
            
            # 移除过期的请求
            while self.requests and self.requests[0] < now - self.window_size:
                self.requests.popleft()
            
            # 检查是否超过限制
            if len(self.requests) >= self.max_requests:
                return False
            
            # 添加新请求
            self.requests.append(now)
            return True
    
    def get_wait_time(self) -> float:
        """
        获取需要等待的时间
        
        Returns:
            float: 等待时间(秒),0 表示无需等待
        """
        with self.lock:
            if not self.requests:
                return 0.0
            
            # 最早请求的时间
            earliest = self.requests[0]
            now = time.time()
            
            # 计算等待时间
            wait_time = (earliest + self.window_size) - now
            return max(0.0, wait_time)

# 使用滑动窗口限流
limiter = SlidingWindowRateLimiter(max_requests=10, window_size=60.0)  # 60秒内最多10个请求

def rate_limited_api_call_sliding(messages: list):
    """限流的 API 调用(滑动窗口)"""
    if not limiter.allow_request():
        wait_time = limiter.get_wait_time()
        raise Exception(f"速率限制,请 {wait_time:.2f} 秒后重试")
    
    # 调用 API
    client = OpenAI()
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages
    )
    return response.choices[0].message.content

# 测试
if __name__ == '__main__':
    for i in range(20):
        try:
            result = rate_limited_api_call_sliding(
                [{"role": "user", "content": f"测试 {i}"}]
            )
            print(f"请求 {i}: 成功")
        except Exception as e:
            print(f"请求 {i}: 失败 - {e}")

三、缓存策略

3.1 结果缓存

python 复制代码
# ===== 结果缓存 =====

import hashlib
import json
import time
from typing import Dict, Any, Optional

class ResultCache:
    """结果缓存"""
    
    def __init__(self, ttl: int = 3600):
        """
        初始化缓存
        
        Args:
            ttl: 缓存生存时间(秒)
        """
        self.cache: Dict[str, Dict[str, Any]] = {}
        self.ttl = ttl
        self.lock = Lock()
    
    def _generate_key(self, messages: list, model: str, **kwargs) -> str:
        """生成缓存 key"""
        content = {
            "messages": messages,
            "model": model,
            **kwargs
        }
        content_str = json.dumps(content, sort_keys=True)
        return hashlib.md5(content_str.encode()).hexdigest()
    
    def get(self, messages: list, model: str, **kwargs) -> Optional[str]:
        """
        获取缓存
        
        Returns:
            缓存的结果,如果没有则返回 None
        """
        key = self._generate_key(messages, model, **kwargs)
        
        with self.lock:
            if key not in self.cache:
                return None
            
            entry = self.cache[key]
            
            # 检查是否过期
            if time.time() - entry["timestamp"] > self.ttl:
                del self.cache[key]
                return None
            
            return entry["result"]
    
    def set(self, messages: list, model: str, result: str, **kwargs):
        """设置缓存"""
        key = self._generate_key(messages, model, **kwargs)
        
        with self.lock:
            self.cache[key] = {
                "result": result,
                "timestamp": time.time()
            }
    
    def clear(self):
        """清空缓存"""
        with self.lock:
            self.cache.clear()
    
    def remove_expired(self):
        """移除过期缓存"""
        now = time.time()
        with self.lock:
            expired_keys = [
                k for k, v in self.cache.items()
                if now - v["timestamp"] > self.ttl
            ]
            for k in expired_keys:
                del self.cache[k]

# 使用结果缓存
cache = ResultCache(ttl=3600)  # 缓存 1 小时

def cached_api_call(messages: list, model: str = "gpt-4o-mini"):
    """带缓存的 API 调用"""
    # 检查缓存
    cached = cache.get(messages, model)
    if cached:
        print("使用缓存")
        return cached
    
    # 调用 API
    print("调用 API")
    client = OpenAI()
    response = client.chat.completions.create(
        model=model,
        messages=messages
    )
    result = response.choices[0].message.content
    
    # 写入缓存
    cache.set(messages, model, result)
    
    return result

# 测试
if __name__ == '__main__':
    messages = [{"role": "user", "content": "你好!"}]
    
    # 第一次调用(会调用 API)
    result1 = cached_api_call(messages)
    print(f"结果1: {result1[:50]}...")
    
    # 第二次调用(会使用缓存)
    result2 = cached_api_call(messages)
    print(f"结果2: {result2[:50]}...")

3.2 语义缓存

python 复制代码
# ===== 语义缓存 =====

import hashlib
import json
from typing import List, Dict, Any, Optional
import numpy as np

class SemanticCache:
    """语义缓存(基于向量相似度)"""
    
    def __init__(self, similarity_threshold: float = 0.95, ttl: int = 3600):
        """
        初始化语义缓存
        
        Args:
            similarity_threshold: 相似度阈值(0-1)
            ttl: 缓存生存时间(秒)
        """
        self.cache: List[Dict[str, Any]] = []
        self.similarity_threshold = similarity_threshold
        self.ttl = ttl
        self.lock = Lock()
    
    def _get_embedding(self, text: str) -> List[float]:
        """
        获取文本向量(简化版)
        
        实际应该使用 OpenAI/Azure 的 Embeddings API
        """
        # 简化:使用字符频率作为向量
        vector = [0.0] * 128
        for char in text:
            idx = ord(char) % 128
            vector[idx] += 1.0
        
        # 归一化
        norm = np.linalg.norm(vector)
        if norm > 0:
            vector = [v / norm for v in vector]
        
        return vector
    
    def _cosine_similarity(self, vec1: List[float], vec2: List[float]) -> float:
        """计算余弦相似度"""
        dot_product = sum(a * b for a, b in zip(vec1, vec2))
        return dot_product
    
    def get(self, messages: list) -> Optional[str]:
        """
        获取缓存(基于语义相似度)
        
        Returns:
            缓存的结果,如果没有则返回 None
        """
        # 提取用户消息
        user_messages = [m["content"] for m in messages if m["role"] == "user"]
        query = " ".join(user_messages)
        
        # 获取查询向量
        query_embedding = self._get_embedding(query)
        
        with self.lock:
            # 移除过期缓存
            now = time.time()
            self.cache = [
                c for c in self.cache
                if now - c["timestamp"] < self.ttl
            ]
            
            # 查找相似缓存
            for entry in self.cache:
                similarity = self._cosine_similarity(query_embedding, entry["embedding"])
                
                if similarity >= self.similarity_threshold:
                    print(f"语义缓存命中(相似度: {similarity:.4f})")
                    return entry["result"]
        
        return None
    
    def set(self, messages: list, result: str):
        """设置缓存"""
        # 提取用户消息
        user_messages = [m["content"] for m in messages if m["role"] == "user"]
        query = " ".join(user_messages)
        
        # 获取向量
        embedding = self._get_embedding(query)
        
        with self.lock:
            self.cache.append({
                "query": query,
                "embedding": embedding,
                "result": result,
                "timestamp": time.time()
            })
    
    def clear(self):
        """清空缓存"""
        with self.lock:
            self.cache.clear()

# 使用语义缓存
semantic_cache = SemanticCache(similarity_threshold=0.95, ttl=3600)

def semantic_cached_api_call(messages: list):
    """带语义缓存的 API 调用"""
    # 检查缓存
    cached = semantic_cache.get(messages)
    if cached:
        return cached
    
    # 调用 API
    client = OpenAI()
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages
    )
    result = response.choices[0].message.content
    
    # 写入缓存
    semantic_cache.set(messages, result)
    
    return result

# 测试
if __name__ == '__main__':
    # 相似的问题(应该命中缓存)
    messages1 = [{"role": "user", "content": "什么是机器学习?"}]
    messages2 = [{"role": "user", "content": "请解释一下机器学习"}]
    
    result1 = semantic_cached_api_call(messages1)
    print(f"结果1: {result1[:50]}...")
    
    result2 = semantic_cached_api_call(messages2)
    print(f"结果2: {result2[:50]}...")

3.3 分布式缓存(Redis)

python 复制代码
# ===== 分布式缓存(Redis)=====

import json
import hashlib
import time
import redis

class RedisCache:
    """Redis 缓存"""
    
    def __init__(self, redis_url: str = "redis://localhost:6379", ttl: int = 3600):
        """
        初始化 Redis 缓存
        
        Args:
            redis_url: Redis 连接 URL
            ttl: 缓存生存时间(秒)
        """
        self.redis = redis.from_url(redis_url)
        self.ttl = ttl
    
    def _generate_key(self, messages: list, model: str, **kwargs) -> str:
        """生成缓存 key"""
        content = {
            "messages": messages,
            "model": model,
            **kwargs
        }
        content_str = json.dumps(content, sort_keys=True)
        return f"ai_cache:{hashlib.md5(content_str.encode()).hexdigest()}"
    
    def get(self, messages: list, model: str, **kwargs) -> Optional[str]:
        """获取缓存"""
        key = self._generate_key(messages, model, **kwargs)
        
        cached = self.redis.get(key)
        if cached:
            return json.loads(cached)
        return None
    
    def set(self, messages: list, model: str, result: str, **kwargs):
        """设置缓存"""
        key = self._generate_key(messages, model, **kwargs)
        
        self.redis.setex(
            key,
            self.ttl,
            json.dumps(result)
        )
    
    def clear(self):
        """清空缓存"""
        keys = self.redis.keys("ai_cache:*")
        if keys:
            self.redis.delete(*keys)

# 使用 Redis 缓存
redis_cache = RedisCache(redis_url="redis://localhost:6379", ttl=3600)

def redis_cached_api_call(messages: list, model: str = "gpt-4o-mini"):
    """带 Redis 缓存的 API 调用"""
    # 检查缓存
    cached = redis_cache.get(messages, model)
    if cached:
        print("使用 Redis 缓存")
        return cached
    
    # 调用 API
    print("调用 API")
    client = OpenAI()
    response = client.chat.completions.create(
        model=model,
        messages=messages
    )
    result = response.choices[0].message.content
    
    # 写入缓存
    redis_cache.set(messages, model, result)
    
    return result

四、成本优化

4.1 模型选择策略

python 复制代码
# ===== 模型选择策略 =====

from enum import Enum
from dataclasses import dataclass
from typing import Optional

class TaskComplexity(Enum):
    """任务复杂度"""
    LOW = "low"          # 简单任务
    MEDIUM = "medium"    # 中等任务
    HIGH = "high"        # 复杂任务

@dataclass
class ModelConfig:
    """模型配置"""
    name: str
    input_price: float   # $/1M tokens
    output_price: float  # $/1M tokens
    speed: float         # tokens/second (approx)
    capability: float    # 0-1, 能力评分

# 模型配置
MODELS = {
    "gpt-4o": ModelConfig(
        name="gpt-4o",
        input_price=5.00,
        output_price=15.00,
        speed=60.0,
        capability=0.95
    ),
    "gpt-4o-mini": ModelConfig(
        name="gpt-4o-mini",
        input_price=0.15,
        output_price=0.60,
        speed=100.0,
        capability=0.80
    ),
    "gpt-3.5-turbo": ModelConfig(
        name="gpt-3.5-turbo",
        input_price=0.50,
        output_price=1.50,
        speed=150.0,
        capability=0.70
    )
}

def select_model(
    task_complexity: TaskComplexity,
    budget_sensitive: bool = False,
    speed_sensitive: bool = False
) -> str:
    """
    选择模型
    
    Args:
        task_complexity: 任务复杂度
        budget_sensitive: 是否成本敏感
        speed_sensitive: 是否速度敏感
        
    Returns:
        str: 模型名称
    """
    if budget_sensitive:
        return "gpt-4o-mini"
    
    if speed_sensitive:
        return "gpt-3.5-turbo"
    
    if task_complexity == TaskComplexity.HIGH:
        return "gpt-4o"
    elif task_complexity == TaskComplexity.MEDIUM:
        return "gpt-4o-mini"
    else:
        return "gpt-3.5-turbo"

def estimate_cost(
    model: str,
    input_tokens: int,
    output_tokens: int
) -> float:
    """
    估算成本
    
    Args:
        model: 模型名称
        input_tokens: 输入 token 数
        output_tokens: 输出 token 数
        
    Returns:
        float: 成本(美元)
    """
    if model not in MODELS:
        raise ValueError(f"未知模型: {model}")
    
    config = MODELS[model]
    
    input_cost = (input_tokens / 1_000_000) * config.input_price
    output_cost = (output_tokens / 1_000_000) * config.output_price
    
    return input_cost + output_cost

# 测试
if __name__ == '__main__':
    # 选择模型
    model = select_model(
        task_complexity=TaskComplexity.MEDIUM,
        budget_sensitive=True
    )
    print(f"推荐模型: {model}")
    
    # 估算成本
    cost = estimate_cost(
        model="gpt-4o-mini",
        input_tokens=1000,
        output_tokens=500
    )
    print(f"估算成本: ${cost:.4f}")

4.2 Prompt 优化

python 复制代码
# ===== Prompt 优化 =====

from openai import OpenAI
import tiktoken

client = OpenAI()

def count_tokens(text: str, model: str = "gpt-4o-mini") -> int:
    """计算 token 数"""
    # 使用 tiktoken 估算
    encoding = tiktoken.get_encoding("cl100k_base")
    return len(encoding.encode(text))

def optimize_prompt(prompt: str, max_tokens: int = 1000) -> str:
    """
    Prompt 优化(压缩)
    
    Args:
        prompt: 原始 prompt
        max_tokens: 最大 token 数
        
    Returns:
        str: 优化后的 prompt
    """
    # 1. 移除多余空白
    optimized = " ".join(prompt.split())
    
    # 2. 检查 token 数
    tokens = count_tokens(optimized)
    
    if tokens <= max_tokens:
        return optimized
    
    # 3. 截断(保留前 max_tokens 个 token)
    encoding = tiktoken.get_encoding("cl100k_base")
    tokens = encoding.encode(optimized)
    truncated = encoding.decode(tokens[:max_tokens])
    
    return truncated

def use_few_shot_examples():
    """使用少量示例(减少解释)"""
    prompt = """分类以下文本的情感(正面/负面/中性):

示例 1: "这个产品太棒了!" -> 正面
示例 2: "糟糕的体验,不推荐。" -> 负面
示例 3: "这个产品还行吧。" -> 中性

现在分类: "性价比很高,推荐购买。"

只输出分类结果,不要解释。"""
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=10  # 限制输出
    )
    
    return response.choices[0].message.content

def use_system_prompt():
    """使用系统提示(减少重复)"""
    messages = [
        {"role": "system", "content": "你是一个情感分析助手。只输出:正面/负面/中性"},
        {"role": "user", "content": "这个产品太棒了!"},
        {"role": "assistant", "content": "正面"},
        {"role": "user", "content": "糟糕的体验,不推荐。"},
        {"role": "assistant", "content": "负面"},
        {"role": "user", "content": "性价比很高,推荐购买。"}
    ]
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        max_tokens=10
    )
    
    return response.choices[0].message.content

4.3 Batch API 使用

python 复制代码
# ===== Batch API 使用 =====

from openai import OpenAI
import time

client = OpenAI()

def create_batch():
    """创建批量任务"""
    requests = [
        {
            "custom_id": "request-1",
            "method": "POST",
            "url": "/v1/chat/completions",
            "body": {
                "model": "gpt-4o-mini",
                "messages": [{"role": "user", "content": "1+1=?"}],
                "max_tokens": 1024
            }
        },
        {
            "custom_id": "request-2",
            "method": "POST",
            "url": "/v1/chat/completions",
            "body": {
                "model": "gpt-4o-mini",
                "messages": [{"role": "user", "content": "2+2=?"}],
                "max_tokens": 1024
            }
        }
    ]
    
    batch = client.batches.create(
        input_file_id=upload_requests(requests),
        endpoint="/v1/chat/completions",
        completion_window="24h"
    )
    
    print(f"Batch ID: {batch.id}")
    print(f"状态: {batch.status}")
    
    return batch.id

def upload_requests(requests: list) -> str:
    """上传请求文件"""
    # 将请求写入临时文件
    import tempfile
    import json
    
    with tempfile.NamedTemporaryFile(mode='w', delete=False, suffix='.jsonl') as f:
        for req in requests:
            f.write(json.dumps(req) + '\n')
        temp_path = f.name
    
    # 上传文件
    with open(temp_path, 'rb') as f:
        file = client.files.create(
            file=f,
            purpose="batch"
        )
    
    return file.id

def check_batch_status(batch_id: str):
    """查询批量任务状态"""
    batch = client.batches.retrieve(batch_id)
    
    print(f"状态: {batch.status}")
    print(f"总请求数: {batch.request_counts.total}")
    print(f"完成: {batch.request_counts.completed}")
    print(f"失败: {batch.request_counts.failed}")
    
    return batch.status

def wait_for_batch(batch_id: str, check_interval: int = 60):
    """等待批量任务完成"""
    while True:
        status = check_batch_status(batch_id)
        
        if status == "completed":
            print("批量任务已完成!")
            break
        elif status in ["failed", "cancelled", "expired"]:
            print(f"批量任务 {status}")
            break
        
        print(f"等待 {check_interval} 秒后检查...")
        time.sleep(check_interval)

def get_batch_results(batch_id: str):
    """获取批量任务结果"""
    batch = client.batches.retrieve(batch_id)
    
    # 下载结果文件
    result_file_id = batch.output_file_id
    if not result_file_id:
        print("没有结果文件")
        return []
    
    result_content = client.files.content(result_file_id)
    
    # 解析结果
    results = []
    for line in result_content.text.split('\n'):
        if line.strip():
            results.append(json.loads(line))
    
    return results

# 完整流程
def batch_processing_example():
    """批量处理示例"""
    # 1. 创建批量任务
    batch_id = create_batch()
    
    # 2. 等待完成
    wait_for_batch(batch_id)
    
    # 3. 获取结果
    results = get_batch_results(batch_id)
    
    # 4. 处理结果
    for r in results:
        custom_id = r["custom_id"]
        content = r["response"]["body"]["choices"][0]["message"]["content"]
        print(f"{custom_id}: {content}")

# 成本对比
def compare_cost_normal_vs_batch(num_requests: int):
    """对比普通 API 和 Batch API 的成本"""
    # 假设每个请求平均 1000 input tokens + 500 output tokens
    input_tokens = 1000 * num_requests
    output_tokens = 500 * num_requests
    
    # 普通 API 成本
    normal_cost = estimate_cost("gpt-4o-mini", input_tokens, output_tokens)
    
    # Batch API 成本(便宜 50%)
    batch_cost = normal_cost * 0.5
    
    print(f"普通 API 成本: ${normal_cost:.2f}")
    print(f"Batch API 成本: ${batch_cost:.2f}")
    print(f"节省: ${normal_cost - batch_cost:.2f}")
    
    return batch_cost

五、监控告警

5.1 Token 统计

python 复制代码
# ===== Token 统计 =====

import time
from dataclasses import dataclass, field
from typing import Dict, List
from collections import defaultdict

@dataclass
class TokenUsage:
    """Token 使用量"""
    input_tokens: int
    output_tokens: int
    timestamp: float
    model: str
    user_id: str = "unknown"

class TokenTracker:
    """Token 跟踪器"""
    
    def __init__(self):
        self.usage_history: List[TokenUsage] = []
        self.user_usage: Dict[str, int] = defaultdict(int)
        self.model_usage: Dict[str, int] = defaultdict(int)
        self.lock = Lock()
    
    def track(self, usage: TokenUsage):
        """记录使用量"""
        with self.lock:
            self.usage_history.append(usage)
            self.user_usage[usage.user_id] += usage.input_tokens + usage.output_tokens
            self.model_usage[usage.model] += usage.input_tokens + usage.output_tokens
    
    def get_user_usage(self, user_id: str, time_window: float = 3600) -> int:
        """
        获取用户使用量
        
        Args:
            user_id: 用户 ID
            time_window: 时间窗口(秒)
            
        Returns:
            int: token 使用量
        """
        now = time.time()
        total = 0
        
        with self.lock:
            for usage in self.usage_history:
                if usage.user_id == user_id and now - usage.timestamp <= time_window:
                    total += usage.input_tokens + usage.output_tokens
        
        return total
    
    def get_model_usage(self, model: str, time_window: float = 3600) -> int:
        """获取模型使用量"""
        now = time.time()
        total = 0
        
        with self.lock:
            for usage in self.usage_history:
                if usage.model == model and now - usage.timestamp <= time_window:
                    total += usage.input_tokens + usage.output_tokens
        
        return total
    
    def get_top_users(self, limit: int = 10) -> List[tuple]:
        """获取使用量最高的用户"""
        sorted_users = sorted(
            self.user_usage.items(),
            key=lambda x: x[1],
            reverse=True
        )
        return sorted_users[:limit]
    
    def get_report(self, time_window: float = 3600) -> Dict:
        """生成报告"""
        now = time.time()
        
        total_tokens = 0
        total_cost = 0.0
        
        with self.lock:
            for usage in self.usage_history:
                if now - usage.timestamp <= time_window:
                    total_tokens += usage.input_tokens + usage.output_tokens
                    total_cost += estimate_cost(
                        usage.model,
                        usage.input_tokens,
                        usage.output_tokens
                    )
        
        return {
            "total_tokens": total_tokens,
            "total_cost_usd": total_cost,
            "unique_users": len(self.user_usage),
            "models_used": list(self.model_usage.keys()),
            "top_users": self.get_top_users(10)
        }

# 使用 Token 跟踪器
tracker = TokenTracker()

def tracked_api_call(messages: list, model: str = "gpt-4o-mini", user_id: str = "unknown"):
    """带 Token 跟踪的 API 调用"""
    client = OpenAI()
    response = client.chat.completions.create(
        model=model,
        messages=messages
    )
    
    # 跟踪使用量
    usage = TokenUsage(
        input_tokens=response.usage.prompt_tokens,
        output_tokens=response.usage.completion_tokens,
        timestamp=time.time(),
        model=model,
        user_id=user_id
    )
    tracker.track(usage)
    
    return response.choices[0].message.content

# 测试
if __name__ == '__main__':
    result = tracked_api_call(
        messages=[{"role": "user", "content": "你好!"}],
        model="gpt-4o-mini",
        user_id="user_001"
    )
    
    # 获取报告
    report = tracker.get_report()
    print(f"报告: {report}")

5.2 成本追踪

python 复制代码
# ===== 成本追踪 =====

from dataclasses import dataclass
from typing import Dict, List
import time

@dataclass
class CostRecord:
    """成本记录"""
    timestamp: float
    model: str
    input_tokens: int
    output_tokens: int
    cost_usd: float
    user_id: str
    request_id: str

class CostTracker:
    """成本跟踪器"""
    
    def __init__(self):
        self.records: List[CostRecord] = []
        self.lock = Lock()
    
    def track_cost(self, record: CostRecord):
        """记录成本"""
        with self.lock:
            self.records.append(record)
    
    def get_total_cost(self, time_window: float = 86400) -> float:
        """
        获取总成本
        
        Args:
            time_window: 时间窗口(秒),默认 1 天
            
        Returns:
            float: 总成本(美元)
        """
        now = time.time()
        total = 0.0
        
        with self.lock:
            for record in self.records:
                if now - record.timestamp <= time_window:
                    total += record.cost_usd
        
        return total
    
    def get_user_cost(self, user_id: str, time_window: float = 86400) -> float:
        """获取用户成本"""
        now = time.time()
        total = 0.0
        
        with self.lock:
            for record in self.records:
                if record.user_id == user_id and now - record.timestamp <= time_window:
                    total += record.cost_usd
        
        return total
    
    def get_model_cost(self, model: str, time_window: float = 86400) -> float:
        """获取模型成本"""
        now = time.time()
        total = 0.0
        
        with self.lock:
            for record in self.records:
                if record.model == model and now - record.timestamp <= time_window:
                    total += record.cost_usd
        
        return total
    
    def get_daily_cost_breakdown(self, days: int = 7) -> Dict[str, float]:
        """获取每日成本分解"""
        now = time.time()
        daily_costs = {}
        
        with self.lock:
            for record in self.records:
                if now - record.timestamp <= days * 86400:
                    day = time.strftime("%Y-%m-%d", time.localtime(record.timestamp))
                    daily_costs[day] = daily_costs.get(day, 0.0) + record.cost_usd
        
        return daily_costs
    
    def check_budget(self, budget: float, time_window: float = 86400) -> bool:
        """
        检查预算
        
        Args:
            budget: 预算(美元)
            time_window: 时间窗口(秒)
            
        Returns:
            bool: 是否超出预算
        """
        total_cost = self.get_total_cost(time_window)
        return total_cost > budget

# 使用成本跟踪器
cost_tracker = CostTracker()

def tracked_api_call_with_cost(messages: list, model: str = "gpt-4o-mini", user_id: str = "unknown"):
    """带成本跟踪的 API 调用"""
    client = OpenAI()
    response = client.chat.completions.create(
        model=model,
        messages=messages
    )
    
    # 计算成本
    input_tokens = response.usage.prompt_tokens
    output_tokens = response.usage.completion_tokens
    cost = estimate_cost(model, input_tokens, output_tokens)
    
    # 记录成本
    record = CostRecord(
        timestamp=time.time(),
        model=model,
        input_tokens=input_tokens,
        output_tokens=output_tokens,
        cost_usd=cost,
        user_id=user_id,
        request_id=f"req_{int(time.time())}"
    )
    cost_tracker.track_cost(record)
    
    return response.choices[0].message.content

# 测试
if __name__ == '__main__':
    result = tracked_api_call_with_cost(
        messages=[{"role": "user", "content": "你好!"}],
        model="gpt-4o-mini",
        user_id="user_001"
    )
    
    # 获取成本报告
    total_cost = cost_tracker.get_total_cost()
    print(f"总成本: ${total_cost:.4f}")
    
    daily_breakdown = cost_tracker.get_daily_cost_breakdown()
    print(f"每日成本: {daily_breakdown}")

5.3 性能监控

python 复制代码
# ===== 性能监控 =====

import time
from dataclasses import dataclass
from typing import List, Dict
from collections import defaultdict

@dataclass
class PerformanceMetric:
    """性能指标"""
    timestamp: float
    latency_ms: float
    model: str
    success: bool
    error_type: str = None

class PerformanceMonitor:
    """性能监控器"""
    
    def __init__(self):
        self.metrics: List[PerformanceMetric] = []
        self.model_latency: Dict[str, List[float]] = defaultdict(list)
        self.lock = Lock()
    
    def record(self, metric: PerformanceMetric):
        """记录性能指标"""
        with self.lock:
            self.metrics.append(metric)
            self.model_latency[metric.model].append(metric.latency_ms)
    
    def get_average_latency(self, model: str = None, time_window: float = 3600) -> float:
        """
        获取平均延迟
        
        Args:
            model: 模型名称(None 表示所有模型)
            time_window: 时间窗口(秒)
            
        Returns:
            float: 平均延迟(毫秒)
        """
        now = time.time()
        latencies = []
        
        with self.lock:
            for metric in self.metrics:
                if now - metric.timestamp <= time_window:
                    if model is None or metric.model == model:
                        latencies.append(metric.latency_ms)
        
        return sum(latencies) / len(latencies) if latencies else 0.0
    
    def get_success_rate(self, model: str = None, time_window: float = 3600) -> float:
        """获取成功率"""
        now = time.time()
        total = 0
        success = 0
        
        with self.lock:
            for metric in self.metrics:
                if now - metric.timestamp <= time_window:
                    if model is None or metric.model == model:
                        total += 1
                        if metric.success:
                            success += 1
        
        return success / total if total > 0 else 0.0
    
    def get_error_distribution(self, time_window: float = 3600) -> Dict[str, int]:
        """获取错误分布"""
        now = time.time()
        errors = defaultdict(int)
        
        with self.lock:
            for metric in self.metrics:
                if not metric.success and now - metric.timestamp <= time_window:
                    errors[metric.error_type] += 1
        
        return dict(errors)
    
    def get_percentile_latency(self, percentile: float, model: str = None) -> float:
        """
        获取延迟分位数
        
        Args:
            percentile: 分位数(0-1)
            model: 模型名称
            
        Returns:
            float: 延迟(毫秒)
        """
        latencies = []
        
        with self.lock:
            for metric in self.metrics:
                if model is None or metric.model == model:
                    latencies.append(metric.latency_ms)
        
        if not latencies:
            return 0.0
        
        latencies.sort()
        idx = int(len(latencies) * percentile)
        return latencies[idx]

# 使用性能监控器
monitor = PerformanceMonitor()

def monitored_api_call(messages: list, model: str = "gpt-4o-mini"):
    """带性能监控的 API 调用"""
    start_time = time.time()
    success = True
    error_type = None
    
    try:
        client = OpenAI()
        response = client.chat.completions.create(
            model=model,
            messages=messages
        )
        result = response.choices[0].message.content
    
    except Exception as e:
        success = False
        error_type = type(e).__name__
        result = None
    
    finally:
        end_time = time.time()
        latency_ms = (end_time - start_time) * 1000
        
        metric = PerformanceMetric(
            timestamp=start_time,
            latency_ms=latency_ms,
            model=model,
            success=success,
            error_type=error_type
        )
        monitor.record(metric)
    
    return result

# 测试
if __name__ == '__main__':
    result = monitored_api_call(
        messages=[{"role": "user", "content": "你好!"}],
        model="gpt-4o-mini"
    )
    
    # 获取性能报告
    avg_latency = monitor.get_average_latency()
    success_rate = monitor.get_success_rate()
    
    print(f"平均延迟: {avg_latency:.2f} ms")
    print(f"成功率: {success_rate:.2%}")

六、生产案例

6.1 案例:优化的 AI API 调用系统

python 复制代码
# ===== 案例:优化的 AI API 调用系统 =====

from openai import OpenAI
from openai._exceptions import RateLimitError, APIConnectionError, InternalServerError
import time
import json
from dataclasses import dataclass
from typing import Optional

@dataclass
class APIConfig:
    """API 配置"""
    model: str = "gpt-4o-mini"
    max_retries: int = 3
    timeout: float = 30.0
    cache_ttl: int = 3600

class OptimizedAIService:
    """优化的 AI 服务"""
    
    def __init__(self, config: APIConfig = None):
        self.config = config or APIConfig()
        self.client = OpenAI(timeout=self.config.timeout)
        self.cache = ResultCache(ttl=self.config.cache_ttl)
        self.tracker = TokenTracker()
        self.monitor = PerformanceMonitor()
        self.limiter = SlidingWindowRateLimiter(max_requests=50, window_size=60.0)
    
    @exponential_backoff_retry(max_retries=3)
    def chat(self, messages: list, user_id: str = "unknown") -> str:
        """
        优化的对话 API
        
        特性:
        1. 缓存
        2. 重试
        3. 限流
        4. Token 跟踪
        5. 性能监控
        """
        # 1. 检查缓存
        cached = self.cache.get(messages, self.config.model)
        if cached:
            print("使用缓存")
            return cached
        
        # 2. 限流检查
        if not self.limiter.allow_request():
            raise Exception("速率限制,请稍后重试")
        
        # 3. 记录开始时间
        start_time = time.time()
        success = True
        error_type = None
        
        try:
            # 4. 调用 API
            response = self.client.chat.completions.create(
                model=self.config.model,
                messages=messages,
                max_tokens=1024
            )
            
            result = response.choices[0].message.content
            
            # 5. 写入缓存
            self.cache.set(messages, self.config.model, result)
            
            # 6. 跟踪 Token
            usage = TokenUsage(
                input_tokens=response.usage.prompt_tokens,
                output_tokens=response.usage.completion_tokens,
                timestamp=time.time(),
                model=self.config.model,
                user_id=user_id
            )
            self.tracker.track(usage)
            
            return result
        
        except Exception as e:            success = False
            error_type = type(e).__name__
            raise
        
        finally:
            # 7. 记录性能
            end_time = time.time()
            latency_ms = (end_time - start_time) * 1000
            
            metric = PerformanceMetric(
                timestamp=start_time,
                latency_ms=latency_ms,
                model=self.config.model,
                success=success,
                error_type=error_type
            )
            self.monitor.record(metric)
    
    def get_stats(self) -> dict:
        """获取统计信息"""
        return {
            "token_usage": self.tracker.get_report(),
            "performance": {
                "avg_latency": self.monitor.get_average_latency(),
                "success_rate": self.monitor.get_success_rate()
            }
        }

# 使用优化的 AI 服务
service = OptimizedAIService()

if __name__ == '__main__':
    # 第一次调用
    result1 = service.chat(
        messages=[{"role": "user", "content": "你好!"}],
        user_id="user_001"
    )
    print(f"结果1: {result1[:50]}...")
    
    # 第二次调用(会使用缓存)
    result2 = service.chat(
        messages=[{"role": "user", "content": "你好!"}],
        user_id="user_001"
    )
    print(f"结果2: {result2[:50]}...")
    
    # 获取统计
    stats = service.get_stats()
    print(f"统计: {stats}")

七、总结

7.1 核心要点

#mermaid-svg-hLLuT3fFLRLuOHw2{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-hLLuT3fFLRLuOHw2 .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-hLLuT3fFLRLuOHw2 .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-hLLuT3fFLRLuOHw2 .error-icon{fill:#552222;}#mermaid-svg-hLLuT3fFLRLuOHw2 .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-hLLuT3fFLRLuOHw2 .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-hLLuT3fFLRLuOHw2 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-hLLuT3fFLRLuOHw2 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-hLLuT3fFLRLuOHw2 .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-hLLuT3fFLRLuOHw2 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-hLLuT3fFLRLuOHw2 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-hLLuT3fFLRLuOHw2 .marker{fill:#333333;stroke:#333333;}#mermaid-svg-hLLuT3fFLRLuOHw2 .marker.cross{stroke:#333333;}#mermaid-svg-hLLuT3fFLRLuOHw2 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-hLLuT3fFLRLuOHw2 p{margin:0;}#mermaid-svg-hLLuT3fFLRLuOHw2 .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-hLLuT3fFLRLuOHw2 .cluster-label text{fill:#333;}#mermaid-svg-hLLuT3fFLRLuOHw2 .cluster-label span{color:#333;}#mermaid-svg-hLLuT3fFLRLuOHw2 .cluster-label span p{background-color:transparent;}#mermaid-svg-hLLuT3fFLRLuOHw2 .label text,#mermaid-svg-hLLuT3fFLRLuOHw2 span{fill:#333;color:#333;}#mermaid-svg-hLLuT3fFLRLuOHw2 .node rect,#mermaid-svg-hLLuT3fFLRLuOHw2 .node circle,#mermaid-svg-hLLuT3fFLRLuOHw2 .node ellipse,#mermaid-svg-hLLuT3fFLRLuOHw2 .node polygon,#mermaid-svg-hLLuT3fFLRLuOHw2 .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-hLLuT3fFLRLuOHw2 .rough-node .label text,#mermaid-svg-hLLuT3fFLRLuOHw2 .node .label text,#mermaid-svg-hLLuT3fFLRLuOHw2 .image-shape .label,#mermaid-svg-hLLuT3fFLRLuOHw2 .icon-shape .label{text-anchor:middle;}#mermaid-svg-hLLuT3fFLRLuOHw2 .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-hLLuT3fFLRLuOHw2 .rough-node .label,#mermaid-svg-hLLuT3fFLRLuOHw2 .node .label,#mermaid-svg-hLLuT3fFLRLuOHw2 .image-shape .label,#mermaid-svg-hLLuT3fFLRLuOHw2 .icon-shape .label{text-align:center;}#mermaid-svg-hLLuT3fFLRLuOHw2 .node.clickable{cursor:pointer;}#mermaid-svg-hLLuT3fFLRLuOHw2 .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-hLLuT3fFLRLuOHw2 .arrowheadPath{fill:#333333;}#mermaid-svg-hLLuT3fFLRLuOHw2 .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-hLLuT3fFLRLuOHw2 .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-hLLuT3fFLRLuOHw2 .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-hLLuT3fFLRLuOHw2 .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-hLLuT3fFLRLuOHw2 .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-hLLuT3fFLRLuOHw2 .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-hLLuT3fFLRLuOHw2 .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-hLLuT3fFLRLuOHw2 .cluster text{fill:#333;}#mermaid-svg-hLLuT3fFLRLuOHw2 .cluster span{color:#333;}#mermaid-svg-hLLuT3fFLRLuOHw2 div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-hLLuT3fFLRLuOHw2 .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-hLLuT3fFLRLuOHw2 rect.text{fill:none;stroke-width:0;}#mermaid-svg-hLLuT3fFLRLuOHw2 .icon-shape,#mermaid-svg-hLLuT3fFLRLuOHw2 .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-hLLuT3fFLRLuOHw2 .icon-shape p,#mermaid-svg-hLLuT3fFLRLuOHw2 .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-hLLuT3fFLRLuOHw2 .icon-shape .label rect,#mermaid-svg-hLLuT3fFLRLuOHw2 .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-hLLuT3fFLRLuOHw2 .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-hLLuT3fFLRLuOHw2 .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-hLLuT3fFLRLuOHw2 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} AI API 调用优化
重试机制
限流处理
缓存策略
成本优化
监控告警
指数退避
断路器
令牌桶
漏桶
滑动窗口
结果缓存
语义缓存
分布式缓存
模型选择
Prompt 优化
Batch API
Token 统计
成本追踪
性能监控

7.2 最佳实践

实践 说明
重试策略 指数退避 + 抖动,避免重试风暴
限流算法 令牌桶(允许突发)/ 漏桶(平滑速率)
缓存策略 结果缓存 + 语义缓存,减少重复调用
成本优化 模型选择 + Prompt 优化 + Batch API
监控告警 Token 统计 + 成本追踪 + 性能监控

7.3 优化效果

优化项 效果
重试机制 减少失败率 50%+
限流处理 避免触发速率限制
结果缓存 减少 API 调用 30%+
语义缓存 减少 API 调用 50%+
模型选择 降低成本 50%+
Batch API 降低成本 50%
Prompt 优化 减少 Token 消耗 20%+

本文基于 OpenAI API 最佳实践编写。如有问题欢迎评论区讨论!

相关推荐
Hector_zh2 分钟前
逐浪 · 第十一篇: Vibe Coding 下的效率定义与规范建设
人工智能·vibecoding
147API9 分钟前
Claude进入受监管系统前,接入层应该先怎么设计
人工智能
Szime10 分钟前
深智微:面向汽车电子与工业控制的电子元器件原装现货服务商
人工智能·汽车
gis分享者11 分钟前
Claude Code 接入蓝耘 GLM-5.1:终端 AI 编程助手配置实战
人工智能·ai·实战·claude·cc·接入glm
东方隐侠安全团队-千里13 分钟前
币安Skills Hub:散户的“机构级超能力“来了
安全·ai·区块链·skills
企学宝14 分钟前
央国企数字化培训升级路径:学分制+AI评卷的全新实践
人工智能·企业培训·公司内训
三更两点14 分钟前
AI拉呱-2026年06月12日AI技术洞察简报
人工智能
终端域名16 分钟前
AI与区块链融合:加密货币的下一前沿——技术架构、企业价值与未来趋势
人工智能·架构·区块链
lauo17 分钟前
ibbot青春版:当腾讯AI“换船”,一部手机如何成为你的Token“私矿”?
大数据·人工智能·chatgpt·智能手机·ai-native
yzqy_22 分钟前
AMD AI 开发者计划学习笔记:从 ROCm 到 Ryzen AI,理解 AMD 的 AI 开发生态
人工智能·笔记·学习·datawhale·amdev