前言
💡 痛点:AI API 调用总是失败?速率限制频繁触发?成本居高不下?响应太慢用户体验差?
🎯 解决方案 :掌握 AI API 调用优化 --- 从重试机制、到限流处理、再到缓存策略与成本优化。
AI API 调用优化全景:
#mermaid-svg-gItVYQa8SSBx2hXC{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-gItVYQa8SSBx2hXC .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-gItVYQa8SSBx2hXC .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-gItVYQa8SSBx2hXC .error-icon{fill:#552222;}#mermaid-svg-gItVYQa8SSBx2hXC .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-gItVYQa8SSBx2hXC .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-gItVYQa8SSBx2hXC .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-gItVYQa8SSBx2hXC .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-gItVYQa8SSBx2hXC .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-gItVYQa8SSBx2hXC .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-gItVYQa8SSBx2hXC .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-gItVYQa8SSBx2hXC .marker{fill:#333333;stroke:#333333;}#mermaid-svg-gItVYQa8SSBx2hXC .marker.cross{stroke:#333333;}#mermaid-svg-gItVYQa8SSBx2hXC svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-gItVYQa8SSBx2hXC p{margin:0;}#mermaid-svg-gItVYQa8SSBx2hXC .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-gItVYQa8SSBx2hXC .cluster-label text{fill:#333;}#mermaid-svg-gItVYQa8SSBx2hXC .cluster-label span{color:#333;}#mermaid-svg-gItVYQa8SSBx2hXC .cluster-label span p{background-color:transparent;}#mermaid-svg-gItVYQa8SSBx2hXC .label text,#mermaid-svg-gItVYQa8SSBx2hXC span{fill:#333;color:#333;}#mermaid-svg-gItVYQa8SSBx2hXC .node rect,#mermaid-svg-gItVYQa8SSBx2hXC .node circle,#mermaid-svg-gItVYQa8SSBx2hXC .node ellipse,#mermaid-svg-gItVYQa8SSBx2hXC .node polygon,#mermaid-svg-gItVYQa8SSBx2hXC .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-gItVYQa8SSBx2hXC .rough-node .label text,#mermaid-svg-gItVYQa8SSBx2hXC .node .label text,#mermaid-svg-gItVYQa8SSBx2hXC .image-shape .label,#mermaid-svg-gItVYQa8SSBx2hXC .icon-shape .label{text-anchor:middle;}#mermaid-svg-gItVYQa8SSBx2hXC .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-gItVYQa8SSBx2hXC .rough-node .label,#mermaid-svg-gItVYQa8SSBx2hXC .node .label,#mermaid-svg-gItVYQa8SSBx2hXC .image-shape .label,#mermaid-svg-gItVYQa8SSBx2hXC .icon-shape .label{text-align:center;}#mermaid-svg-gItVYQa8SSBx2hXC .node.clickable{cursor:pointer;}#mermaid-svg-gItVYQa8SSBx2hXC .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-gItVYQa8SSBx2hXC .arrowheadPath{fill:#333333;}#mermaid-svg-gItVYQa8SSBx2hXC .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-gItVYQa8SSBx2hXC .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-gItVYQa8SSBx2hXC .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-gItVYQa8SSBx2hXC .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-gItVYQa8SSBx2hXC .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-gItVYQa8SSBx2hXC .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-gItVYQa8SSBx2hXC .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-gItVYQa8SSBx2hXC .cluster text{fill:#333;}#mermaid-svg-gItVYQa8SSBx2hXC .cluster span{color:#333;}#mermaid-svg-gItVYQa8SSBx2hXC div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-gItVYQa8SSBx2hXC .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-gItVYQa8SSBx2hXC rect.text{fill:none;stroke-width:0;}#mermaid-svg-gItVYQa8SSBx2hXC .icon-shape,#mermaid-svg-gItVYQa8SSBx2hXC .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-gItVYQa8SSBx2hXC .icon-shape p,#mermaid-svg-gItVYQa8SSBx2hXC .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-gItVYQa8SSBx2hXC .icon-shape .label rect,#mermaid-svg-gItVYQa8SSBx2hXC .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-gItVYQa8SSBx2hXC .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-gItVYQa8SSBx2hXC .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-gItVYQa8SSBx2hXC :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} AI API 调用优化
重试机制
限流处理
缓存策略
成本优化
监控告警
指数退避
断路器
抖动
令牌桶
漏桶
滑动窗口
结果缓存
语义缓存
分布式缓存
模型选择
Batch API
Prompt 优化
Token 统计
成本追踪
性能监控
常见问题与解决方案:
| 问题 | 原因 | 解决方案 |
|---|---|---|
| 调用失败 | 网络错误/服务器错误 | 重试机制 |
| 速率限制 | 触发 RPM/TPM 限制 | 限流处理 |
| 成本高 | 模型贵/Token 消耗大 | 成本优化 |
| 响应慢 | 网络延迟/模型推理慢 | 缓存/流式响应 |
| 无监控 | 缺乏可观测性 | 监控告警 |
一、重试机制
1.1 指数退避重试
python
# ===== 指数退避重试 =====
import time
import random
from openai import OpenAI
from openai._exceptions import (
RateLimitError,
APIConnectionError,
InternalServerError,
APIStatusError
)
class RetryConfig:
"""重试配置"""
def __init__(
self,
max_retries: int = 3,
base_delay: float = 1.0,
max_delay: float = 60.0,
backoff_factor: float = 2.0,
jitter: bool = True
):
self.max_retries = max_retries
self.base_delay = base_delay
self.max_delay = max_delay
self.backoff_factor = backoff_factor
self.jitter = jitter
def exponential_backoff_retry(
func,
retry_config: RetryConfig = None,
*args,
**kwargs
):
"""指数退避重试装饰器"""
if retry_config is None:
retry_config = RetryConfig()
for attempt in range(retry_config.max_retries):
try:
return func(*args, **kwargs)
except (RateLimitError, APIConnectionError, InternalServerError) as e:
if attempt == retry_config.max_retries - 1:
raise # 最后一次重试失败,抛出异常
# 计算延迟
delay = retry_config.base_delay * (retry_config.backoff_factor ** attempt)
delay = min(delay, retry_config.max_delay)
# 添加抖动
if retry_config.jitter:
delay = delay * (1 + random.random()) # 0-100% 抖动
print(f"错误: {e}")
print(f"{delay:.2f} 秒后重试(第 {attempt + 1} 次)...")
time.sleep(delay)
except Exception as e:
# 其他错误不重试
raise
raise Exception("达到最大重试次数")
# 使用装饰器
@exponential_backoff_retry
def call_openai_api(messages: list):
"""调用 OpenAI API"""
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
max_tokens=1024
)
return response.choices[0].message.content
# 测试
if __name__ == '__main__':
result = call_openai_api(
[{"role": "user", "content": "你好!"}]
)
print(result)
1.2 断路器模式
python
# ===== 断路器模式 =====
import time
from enum import Enum
from dataclasses import dataclass, field
class CircuitState(Enum):
"""断路器状态"""
CLOSED = "closed" # 关闭(正常)
OPEN = "open" # 打开(熔断)
HALF_OPEN = "half_open" # 半开(尝试恢复)
@dataclass
class CircuitBreakerConfig:
"""断路器配置"""
failure_threshold: int = 5 # 失败阈值
success_threshold: int = 2 # 成功阈值(半开状态)
timeout: float = 60.0 # 超时时间(秒)
reset_timeout: float = 30.0 # 重置超时(秒)
class CircuitBreaker:
"""断路器"""
def __init__(self, config: CircuitBreakerConfig = None):
self.config = config or CircuitBreakerConfig()
self.state = CircuitState.CLOSED
self.failure_count = 0
self.success_count = 0
self.last_failure_time = None
def call(self, func, *args, **kwargs):
"""执行函数(带断路器保护)"""
if self.state == CircuitState.OPEN:
# 检查是否可以尝试恢复
if time.time() - self.last_failure_time > self.config.reset_timeout:
self.state = CircuitState.HALF_OPEN
self.success_count = 0
print("断路器进入半开状态")
else:
raise Exception("断路器打开,拒绝请求")
try:
result = func(*args, **kwargs)
self._on_success()
return result
except Exception as e:
self._on_failure()
raise
def _on_success(self):
"""成功回调"""
if self.state == CircuitState.HALF_OPEN:
self.success_count += 1
if self.success_count >= self.config.success_threshold:
self.state = CircuitState.CLOSED
self.failure_count = 0
print("断路器关闭(恢复)")
else:
self.failure_count = 0
def _on_failure(self):
"""失败回调"""
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.config.failure_threshold:
self.state = CircuitState.OPEN
print(f"断路器打开(失败次数: {self.failure_count})")
# 使用断路器
breaker = CircuitBreaker()
def call_api_with_circuit_breaker(messages: list):
"""带断路器保护的 API 调用"""
def _call():
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages
)
return response.choices[0].message.content
return breaker.call(_call)
# 测试
if __name__ == '__main__':
for i in range(10):
try:
result = call_api_with_circuit_breaker(
[{"role": "user", "content": f"测试 {i}"}]
)
print(f"请求 {i}: 成功")
except Exception as e:
print(f"请求 {i}: 失败 - {e}")
1.3 重试最佳实践
python
# ===== 重试最佳实践 =====
from openai import OpenAI
from openai._exceptions import RateLimitError, APIConnectionError, InternalServerError
import time
import random
class RetryStrategy:
"""重试策略"""
@staticmethod
def exponential_backoff(
max_retries: int = 3,
base_delay: float = 1.0,
max_delay: float = 60.0,
backoff_factor: float = 2.0,
jitter: bool = True
):
"""指数退避"""
def decorator(func):
def wrapper(*args, **kwargs):
for attempt in range(max_retries):
try:
return func(*args, **kwargs)
except (RateLimitError, APIConnectionError, InternalServerError) as e:
if attempt == max_retries - 1:
raise
delay = min(base_delay * (backoff_factor ** attempt), max_delay)
if jitter:
delay = delay * (1 + random.random())
print(f"重试 {attempt + 1}/{max_retries},{delay:.2f}秒后重试...")
time.sleep(delay)
raise Exception("达到最大重试次数")
return wrapper
return decorator
@staticmethod
def fixed_delay(
max_retries: int = 3,
delay: float = 1.0
):
"""固定延迟"""
def decorator(func):
def wrapper(*args, **kwargs):
for attempt in range(max_retries):
try:
return func(*args, **kwargs)
except (RateLimitError, APIConnectionError, InternalServerError) as e:
if attempt == max_retries - 1:
raise
print(f"重试 {attempt + 1}/{max_retries},{delay}秒后重试...")
time.sleep(delay)
raise Exception("达到最大重试次数")
return wrapper
return decorator
# 使用
class OpenAIService:
"""OpenAI 服务(带重试)"""
@RetryStrategy.exponential_backoff(max_retries=3, base_delay=1.0, jitter=True)
def chat(self, messages: list):
"""对话(带指数退避重试)"""
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages
)
return response.choices[0].message.content
@RetryStrategy.fixed_delay(max_retries=3, delay=2.0)
def chat_fixed(self, messages: list):
"""对话(带固定延迟重试)"""
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages
)
return response.choices[0].message.content
二、限流处理
2.1 令牌桶算法
python
# ===== 令牌桶算法 =====
import time
from threading import Lock
class TokenBucket:
"""令牌桶"""
def __init__(self, capacity: int, refill_rate: float):
"""
初始化令牌桶
Args:
capacity: 桶容量(最大令牌数)
refill_rate: 填充速率(令牌/秒)
"""
self.capacity = capacity
self.refill_rate = refill_rate
self.tokens = float(capacity)
self.last_refill_time = time.time()
self.lock = Lock()
def _refill(self):
"""填充令牌"""
now = time.time()
elapsed = now - self.last_refill_time
# 计算应填充的令牌数
refill_tokens = elapsed * self.refill_rate
self.tokens = min(self.capacity, self.tokens + refill_tokens)
self.last_refill_time = now
def consume(self, tokens: int = 1) -> bool:
"""
消费令牌
Args:
tokens: 需要的令牌数
Returns:
bool: 是否成功获取令牌
"""
with self.lock:
self._refill()
if self.tokens >= tokens:
self.tokens -= tokens
return True
return False
def wait_for_tokens(self, tokens: int = 1, timeout: float = None) -> bool:
"""
等待令牌(阻塞)
Args:
tokens: 需要的令牌数
timeout: 超时时间(秒)
Returns:
bool: 是否成功获取令牌
"""
start_time = time.time()
while True:
if self.consume(tokens):
return True
if timeout and (time.time() - start_time) > timeout:
return False
# 计算需要等待的时间
with self.lock:
deficit = tokens - self.tokens
wait_time = deficit / self.refill_rate
time.sleep(min(wait_time, 0.1)) # 最多等待 0.1 秒
# 使用令牌桶限流
bucket = TokenBucket(capacity=10, refill_rate=2.0) # 容量 10,速率 2 tokens/s
def rate_limited_api_call(messages: list):
"""限流的 API 调用"""
# 等待令牌)
if not bucket.wait_for_tokens(tokens=1, timeout=5.0):
raise Exception("获取令牌超时")
# 调用 API
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages
)
return response.choices[0].message.content
# 测试
if __name__ == '__main__':
for i in range(20):
try:
result = rate_limited_api_call(
[{"role": "user", "content": f"测试 {i}"}]
)
print(f"请求 {i}: 成功")
except Exception as e:
print(f"请求 {i}: 失败 - {e}")
2.2 漏桶算法
python
# ===== 漏桶算法 =====
import time
from queue import Queue
from threading import Thread, Lock
class LeakyBucket:
"""漏桶"""
def __init__(self, capacity: int, leak_rate: float):
"""
初始化漏桶
Args:
capacity: 桶容量
leak_rate: 漏出速率(请求/秒)
"""
self.capacity = capacity
self.leak_rate = leak_rate
self.queue = Queue(maxsize=capacity)
self.lock = Lock()
self.running = True
# 启动漏出线程
self.leak_thread = Thread(target=self._leak, daemon=True)
self.leak_thread.start()
def _leak(self):
"""漏出请求"""
while self.running:
with self.lock:
if not self.queue.empty():
request = self.queue.get()
# 处理请求
self._process_request(request)
# 按照漏出速率等待
time.sleep(1.0 / self.leak_rate)
def _process_request(self, request):
"""处理请求"""
# 实际应该调用 API
print(f"处理请求: {request}")
def add_request(self, request) -> bool:
"""
添加请求
Returns:
bool: 是否成功添加
"""
try:
self.queue.put(request, block=False)
return True
except:
return False
def stop(self):
"""停止漏桶"""
self.running = False
# 使用漏桶限流
bucket = LeakyBucket(capacity=10, leak_rate=2.0) # 容量 10,速率 2 requests/s
def rate_limited_api_call_leaky(messages: list):
"""限流的 API 调用(漏桶)"""
request_id = f"req_{time.time()}"
if not bucket.add_request(request_id):
raise Exception("漏桶已满,请求被丢弃")
# 请求已加入队列,等待处理
print(f"请求 {request_id} 已加入队列")
# 实际应该等待结果
# 这里简化为直接返回
return f"请求 {request_id} 已排队"
# 测试
if __name__ == '__main__':
for i in range(20):
try:
result = rate_limited_api_call_leaky(
[{"role": "user", "content": f"测试 {i}"}]
)
print(f"请求 {i}: {result}")
except Exception as e:
print(f"请求 {i}: 失败 - {e}")
time.sleep(5) # 等待处理完成
bucket.stop()
2.3 滑动窗口限流
python
# ===== 滑动窗口限流 =====
import time
from collections import deque
class SlidingWindowRateLimiter:
"""滑动窗口限流器"""
def __init__(self, max_requests: int, window_size: float):
"""
初始化滑动窗口限流器
Args:
max_requests: 窗口内最大请求数
window_size: 窗口大小(秒)
"""
self.max_requests = max_requests
self.window_size = window_size
self.requests = deque()
self.lock = Lock()
def allow_request(self) -> bool:
"""
检查是否允许请求
Returns:
bool: 是否允许
"""
with self.lock:
now = time.time()
# 移除过期的请求
while self.requests and self.requests[0] < now - self.window_size:
self.requests.popleft()
# 检查是否超过限制
if len(self.requests) >= self.max_requests:
return False
# 添加新请求
self.requests.append(now)
return True
def get_wait_time(self) -> float:
"""
获取需要等待的时间
Returns:
float: 等待时间(秒),0 表示无需等待
"""
with self.lock:
if not self.requests:
return 0.0
# 最早请求的时间
earliest = self.requests[0]
now = time.time()
# 计算等待时间
wait_time = (earliest + self.window_size) - now
return max(0.0, wait_time)
# 使用滑动窗口限流
limiter = SlidingWindowRateLimiter(max_requests=10, window_size=60.0) # 60秒内最多10个请求
def rate_limited_api_call_sliding(messages: list):
"""限流的 API 调用(滑动窗口)"""
if not limiter.allow_request():
wait_time = limiter.get_wait_time()
raise Exception(f"速率限制,请 {wait_time:.2f} 秒后重试")
# 调用 API
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages
)
return response.choices[0].message.content
# 测试
if __name__ == '__main__':
for i in range(20):
try:
result = rate_limited_api_call_sliding(
[{"role": "user", "content": f"测试 {i}"}]
)
print(f"请求 {i}: 成功")
except Exception as e:
print(f"请求 {i}: 失败 - {e}")
三、缓存策略
3.1 结果缓存
python
# ===== 结果缓存 =====
import hashlib
import json
import time
from typing import Dict, Any, Optional
class ResultCache:
"""结果缓存"""
def __init__(self, ttl: int = 3600):
"""
初始化缓存
Args:
ttl: 缓存生存时间(秒)
"""
self.cache: Dict[str, Dict[str, Any]] = {}
self.ttl = ttl
self.lock = Lock()
def _generate_key(self, messages: list, model: str, **kwargs) -> str:
"""生成缓存 key"""
content = {
"messages": messages,
"model": model,
**kwargs
}
content_str = json.dumps(content, sort_keys=True)
return hashlib.md5(content_str.encode()).hexdigest()
def get(self, messages: list, model: str, **kwargs) -> Optional[str]:
"""
获取缓存
Returns:
缓存的结果,如果没有则返回 None
"""
key = self._generate_key(messages, model, **kwargs)
with self.lock:
if key not in self.cache:
return None
entry = self.cache[key]
# 检查是否过期
if time.time() - entry["timestamp"] > self.ttl:
del self.cache[key]
return None
return entry["result"]
def set(self, messages: list, model: str, result: str, **kwargs):
"""设置缓存"""
key = self._generate_key(messages, model, **kwargs)
with self.lock:
self.cache[key] = {
"result": result,
"timestamp": time.time()
}
def clear(self):
"""清空缓存"""
with self.lock:
self.cache.clear()
def remove_expired(self):
"""移除过期缓存"""
now = time.time()
with self.lock:
expired_keys = [
k for k, v in self.cache.items()
if now - v["timestamp"] > self.ttl
]
for k in expired_keys:
del self.cache[k]
# 使用结果缓存
cache = ResultCache(ttl=3600) # 缓存 1 小时
def cached_api_call(messages: list, model: str = "gpt-4o-mini"):
"""带缓存的 API 调用"""
# 检查缓存
cached = cache.get(messages, model)
if cached:
print("使用缓存")
return cached
# 调用 API
print("调用 API")
client = OpenAI()
response = client.chat.completions.create(
model=model,
messages=messages
)
result = response.choices[0].message.content
# 写入缓存
cache.set(messages, model, result)
return result
# 测试
if __name__ == '__main__':
messages = [{"role": "user", "content": "你好!"}]
# 第一次调用(会调用 API)
result1 = cached_api_call(messages)
print(f"结果1: {result1[:50]}...")
# 第二次调用(会使用缓存)
result2 = cached_api_call(messages)
print(f"结果2: {result2[:50]}...")
3.2 语义缓存
python
# ===== 语义缓存 =====
import hashlib
import json
from typing import List, Dict, Any, Optional
import numpy as np
class SemanticCache:
"""语义缓存(基于向量相似度)"""
def __init__(self, similarity_threshold: float = 0.95, ttl: int = 3600):
"""
初始化语义缓存
Args:
similarity_threshold: 相似度阈值(0-1)
ttl: 缓存生存时间(秒)
"""
self.cache: List[Dict[str, Any]] = []
self.similarity_threshold = similarity_threshold
self.ttl = ttl
self.lock = Lock()
def _get_embedding(self, text: str) -> List[float]:
"""
获取文本向量(简化版)
实际应该使用 OpenAI/Azure 的 Embeddings API
"""
# 简化:使用字符频率作为向量
vector = [0.0] * 128
for char in text:
idx = ord(char) % 128
vector[idx] += 1.0
# 归一化
norm = np.linalg.norm(vector)
if norm > 0:
vector = [v / norm for v in vector]
return vector
def _cosine_similarity(self, vec1: List[float], vec2: List[float]) -> float:
"""计算余弦相似度"""
dot_product = sum(a * b for a, b in zip(vec1, vec2))
return dot_product
def get(self, messages: list) -> Optional[str]:
"""
获取缓存(基于语义相似度)
Returns:
缓存的结果,如果没有则返回 None
"""
# 提取用户消息
user_messages = [m["content"] for m in messages if m["role"] == "user"]
query = " ".join(user_messages)
# 获取查询向量
query_embedding = self._get_embedding(query)
with self.lock:
# 移除过期缓存
now = time.time()
self.cache = [
c for c in self.cache
if now - c["timestamp"] < self.ttl
]
# 查找相似缓存
for entry in self.cache:
similarity = self._cosine_similarity(query_embedding, entry["embedding"])
if similarity >= self.similarity_threshold:
print(f"语义缓存命中(相似度: {similarity:.4f})")
return entry["result"]
return None
def set(self, messages: list, result: str):
"""设置缓存"""
# 提取用户消息
user_messages = [m["content"] for m in messages if m["role"] == "user"]
query = " ".join(user_messages)
# 获取向量
embedding = self._get_embedding(query)
with self.lock:
self.cache.append({
"query": query,
"embedding": embedding,
"result": result,
"timestamp": time.time()
})
def clear(self):
"""清空缓存"""
with self.lock:
self.cache.clear()
# 使用语义缓存
semantic_cache = SemanticCache(similarity_threshold=0.95, ttl=3600)
def semantic_cached_api_call(messages: list):
"""带语义缓存的 API 调用"""
# 检查缓存
cached = semantic_cache.get(messages)
if cached:
return cached
# 调用 API
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages
)
result = response.choices[0].message.content
# 写入缓存
semantic_cache.set(messages, result)
return result
# 测试
if __name__ == '__main__':
# 相似的问题(应该命中缓存)
messages1 = [{"role": "user", "content": "什么是机器学习?"}]
messages2 = [{"role": "user", "content": "请解释一下机器学习"}]
result1 = semantic_cached_api_call(messages1)
print(f"结果1: {result1[:50]}...")
result2 = semantic_cached_api_call(messages2)
print(f"结果2: {result2[:50]}...")
3.3 分布式缓存(Redis)
python
# ===== 分布式缓存(Redis)=====
import json
import hashlib
import time
import redis
class RedisCache:
"""Redis 缓存"""
def __init__(self, redis_url: str = "redis://localhost:6379", ttl: int = 3600):
"""
初始化 Redis 缓存
Args:
redis_url: Redis 连接 URL
ttl: 缓存生存时间(秒)
"""
self.redis = redis.from_url(redis_url)
self.ttl = ttl
def _generate_key(self, messages: list, model: str, **kwargs) -> str:
"""生成缓存 key"""
content = {
"messages": messages,
"model": model,
**kwargs
}
content_str = json.dumps(content, sort_keys=True)
return f"ai_cache:{hashlib.md5(content_str.encode()).hexdigest()}"
def get(self, messages: list, model: str, **kwargs) -> Optional[str]:
"""获取缓存"""
key = self._generate_key(messages, model, **kwargs)
cached = self.redis.get(key)
if cached:
return json.loads(cached)
return None
def set(self, messages: list, model: str, result: str, **kwargs):
"""设置缓存"""
key = self._generate_key(messages, model, **kwargs)
self.redis.setex(
key,
self.ttl,
json.dumps(result)
)
def clear(self):
"""清空缓存"""
keys = self.redis.keys("ai_cache:*")
if keys:
self.redis.delete(*keys)
# 使用 Redis 缓存
redis_cache = RedisCache(redis_url="redis://localhost:6379", ttl=3600)
def redis_cached_api_call(messages: list, model: str = "gpt-4o-mini"):
"""带 Redis 缓存的 API 调用"""
# 检查缓存
cached = redis_cache.get(messages, model)
if cached:
print("使用 Redis 缓存")
return cached
# 调用 API
print("调用 API")
client = OpenAI()
response = client.chat.completions.create(
model=model,
messages=messages
)
result = response.choices[0].message.content
# 写入缓存
redis_cache.set(messages, model, result)
return result
四、成本优化
4.1 模型选择策略
python
# ===== 模型选择策略 =====
from enum import Enum
from dataclasses import dataclass
from typing import Optional
class TaskComplexity(Enum):
"""任务复杂度"""
LOW = "low" # 简单任务
MEDIUM = "medium" # 中等任务
HIGH = "high" # 复杂任务
@dataclass
class ModelConfig:
"""模型配置"""
name: str
input_price: float # $/1M tokens
output_price: float # $/1M tokens
speed: float # tokens/second (approx)
capability: float # 0-1, 能力评分
# 模型配置
MODELS = {
"gpt-4o": ModelConfig(
name="gpt-4o",
input_price=5.00,
output_price=15.00,
speed=60.0,
capability=0.95
),
"gpt-4o-mini": ModelConfig(
name="gpt-4o-mini",
input_price=0.15,
output_price=0.60,
speed=100.0,
capability=0.80
),
"gpt-3.5-turbo": ModelConfig(
name="gpt-3.5-turbo",
input_price=0.50,
output_price=1.50,
speed=150.0,
capability=0.70
)
}
def select_model(
task_complexity: TaskComplexity,
budget_sensitive: bool = False,
speed_sensitive: bool = False
) -> str:
"""
选择模型
Args:
task_complexity: 任务复杂度
budget_sensitive: 是否成本敏感
speed_sensitive: 是否速度敏感
Returns:
str: 模型名称
"""
if budget_sensitive:
return "gpt-4o-mini"
if speed_sensitive:
return "gpt-3.5-turbo"
if task_complexity == TaskComplexity.HIGH:
return "gpt-4o"
elif task_complexity == TaskComplexity.MEDIUM:
return "gpt-4o-mini"
else:
return "gpt-3.5-turbo"
def estimate_cost(
model: str,
input_tokens: int,
output_tokens: int
) -> float:
"""
估算成本
Args:
model: 模型名称
input_tokens: 输入 token 数
output_tokens: 输出 token 数
Returns:
float: 成本(美元)
"""
if model not in MODELS:
raise ValueError(f"未知模型: {model}")
config = MODELS[model]
input_cost = (input_tokens / 1_000_000) * config.input_price
output_cost = (output_tokens / 1_000_000) * config.output_price
return input_cost + output_cost
# 测试
if __name__ == '__main__':
# 选择模型
model = select_model(
task_complexity=TaskComplexity.MEDIUM,
budget_sensitive=True
)
print(f"推荐模型: {model}")
# 估算成本
cost = estimate_cost(
model="gpt-4o-mini",
input_tokens=1000,
output_tokens=500
)
print(f"估算成本: ${cost:.4f}")
4.2 Prompt 优化
python
# ===== Prompt 优化 =====
from openai import OpenAI
import tiktoken
client = OpenAI()
def count_tokens(text: str, model: str = "gpt-4o-mini") -> int:
"""计算 token 数"""
# 使用 tiktoken 估算
encoding = tiktoken.get_encoding("cl100k_base")
return len(encoding.encode(text))
def optimize_prompt(prompt: str, max_tokens: int = 1000) -> str:
"""
Prompt 优化(压缩)
Args:
prompt: 原始 prompt
max_tokens: 最大 token 数
Returns:
str: 优化后的 prompt
"""
# 1. 移除多余空白
optimized = " ".join(prompt.split())
# 2. 检查 token 数
tokens = count_tokens(optimized)
if tokens <= max_tokens:
return optimized
# 3. 截断(保留前 max_tokens 个 token)
encoding = tiktoken.get_encoding("cl100k_base")
tokens = encoding.encode(optimized)
truncated = encoding.decode(tokens[:max_tokens])
return truncated
def use_few_shot_examples():
"""使用少量示例(减少解释)"""
prompt = """分类以下文本的情感(正面/负面/中性):
示例 1: "这个产品太棒了!" -> 正面
示例 2: "糟糕的体验,不推荐。" -> 负面
示例 3: "这个产品还行吧。" -> 中性
现在分类: "性价比很高,推荐购买。"
只输出分类结果,不要解释。"""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
max_tokens=10 # 限制输出
)
return response.choices[0].message.content
def use_system_prompt():
"""使用系统提示(减少重复)"""
messages = [
{"role": "system", "content": "你是一个情感分析助手。只输出:正面/负面/中性"},
{"role": "user", "content": "这个产品太棒了!"},
{"role": "assistant", "content": "正面"},
{"role": "user", "content": "糟糕的体验,不推荐。"},
{"role": "assistant", "content": "负面"},
{"role": "user", "content": "性价比很高,推荐购买。"}
]
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
max_tokens=10
)
return response.choices[0].message.content
4.3 Batch API 使用
python
# ===== Batch API 使用 =====
from openai import OpenAI
import time
client = OpenAI()
def create_batch():
"""创建批量任务"""
requests = [
{
"custom_id": "request-1",
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": "gpt-4o-mini",
"messages": [{"role": "user", "content": "1+1=?"}],
"max_tokens": 1024
}
},
{
"custom_id": "request-2",
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": "gpt-4o-mini",
"messages": [{"role": "user", "content": "2+2=?"}],
"max_tokens": 1024
}
}
]
batch = client.batches.create(
input_file_id=upload_requests(requests),
endpoint="/v1/chat/completions",
completion_window="24h"
)
print(f"Batch ID: {batch.id}")
print(f"状态: {batch.status}")
return batch.id
def upload_requests(requests: list) -> str:
"""上传请求文件"""
# 将请求写入临时文件
import tempfile
import json
with tempfile.NamedTemporaryFile(mode='w', delete=False, suffix='.jsonl') as f:
for req in requests:
f.write(json.dumps(req) + '\n')
temp_path = f.name
# 上传文件
with open(temp_path, 'rb') as f:
file = client.files.create(
file=f,
purpose="batch"
)
return file.id
def check_batch_status(batch_id: str):
"""查询批量任务状态"""
batch = client.batches.retrieve(batch_id)
print(f"状态: {batch.status}")
print(f"总请求数: {batch.request_counts.total}")
print(f"完成: {batch.request_counts.completed}")
print(f"失败: {batch.request_counts.failed}")
return batch.status
def wait_for_batch(batch_id: str, check_interval: int = 60):
"""等待批量任务完成"""
while True:
status = check_batch_status(batch_id)
if status == "completed":
print("批量任务已完成!")
break
elif status in ["failed", "cancelled", "expired"]:
print(f"批量任务 {status}")
break
print(f"等待 {check_interval} 秒后检查...")
time.sleep(check_interval)
def get_batch_results(batch_id: str):
"""获取批量任务结果"""
batch = client.batches.retrieve(batch_id)
# 下载结果文件
result_file_id = batch.output_file_id
if not result_file_id:
print("没有结果文件")
return []
result_content = client.files.content(result_file_id)
# 解析结果
results = []
for line in result_content.text.split('\n'):
if line.strip():
results.append(json.loads(line))
return results
# 完整流程
def batch_processing_example():
"""批量处理示例"""
# 1. 创建批量任务
batch_id = create_batch()
# 2. 等待完成
wait_for_batch(batch_id)
# 3. 获取结果
results = get_batch_results(batch_id)
# 4. 处理结果
for r in results:
custom_id = r["custom_id"]
content = r["response"]["body"]["choices"][0]["message"]["content"]
print(f"{custom_id}: {content}")
# 成本对比
def compare_cost_normal_vs_batch(num_requests: int):
"""对比普通 API 和 Batch API 的成本"""
# 假设每个请求平均 1000 input tokens + 500 output tokens
input_tokens = 1000 * num_requests
output_tokens = 500 * num_requests
# 普通 API 成本
normal_cost = estimate_cost("gpt-4o-mini", input_tokens, output_tokens)
# Batch API 成本(便宜 50%)
batch_cost = normal_cost * 0.5
print(f"普通 API 成本: ${normal_cost:.2f}")
print(f"Batch API 成本: ${batch_cost:.2f}")
print(f"节省: ${normal_cost - batch_cost:.2f}")
return batch_cost
五、监控告警
5.1 Token 统计
python
# ===== Token 统计 =====
import time
from dataclasses import dataclass, field
from typing import Dict, List
from collections import defaultdict
@dataclass
class TokenUsage:
"""Token 使用量"""
input_tokens: int
output_tokens: int
timestamp: float
model: str
user_id: str = "unknown"
class TokenTracker:
"""Token 跟踪器"""
def __init__(self):
self.usage_history: List[TokenUsage] = []
self.user_usage: Dict[str, int] = defaultdict(int)
self.model_usage: Dict[str, int] = defaultdict(int)
self.lock = Lock()
def track(self, usage: TokenUsage):
"""记录使用量"""
with self.lock:
self.usage_history.append(usage)
self.user_usage[usage.user_id] += usage.input_tokens + usage.output_tokens
self.model_usage[usage.model] += usage.input_tokens + usage.output_tokens
def get_user_usage(self, user_id: str, time_window: float = 3600) -> int:
"""
获取用户使用量
Args:
user_id: 用户 ID
time_window: 时间窗口(秒)
Returns:
int: token 使用量
"""
now = time.time()
total = 0
with self.lock:
for usage in self.usage_history:
if usage.user_id == user_id and now - usage.timestamp <= time_window:
total += usage.input_tokens + usage.output_tokens
return total
def get_model_usage(self, model: str, time_window: float = 3600) -> int:
"""获取模型使用量"""
now = time.time()
total = 0
with self.lock:
for usage in self.usage_history:
if usage.model == model and now - usage.timestamp <= time_window:
total += usage.input_tokens + usage.output_tokens
return total
def get_top_users(self, limit: int = 10) -> List[tuple]:
"""获取使用量最高的用户"""
sorted_users = sorted(
self.user_usage.items(),
key=lambda x: x[1],
reverse=True
)
return sorted_users[:limit]
def get_report(self, time_window: float = 3600) -> Dict:
"""生成报告"""
now = time.time()
total_tokens = 0
total_cost = 0.0
with self.lock:
for usage in self.usage_history:
if now - usage.timestamp <= time_window:
total_tokens += usage.input_tokens + usage.output_tokens
total_cost += estimate_cost(
usage.model,
usage.input_tokens,
usage.output_tokens
)
return {
"total_tokens": total_tokens,
"total_cost_usd": total_cost,
"unique_users": len(self.user_usage),
"models_used": list(self.model_usage.keys()),
"top_users": self.get_top_users(10)
}
# 使用 Token 跟踪器
tracker = TokenTracker()
def tracked_api_call(messages: list, model: str = "gpt-4o-mini", user_id: str = "unknown"):
"""带 Token 跟踪的 API 调用"""
client = OpenAI()
response = client.chat.completions.create(
model=model,
messages=messages
)
# 跟踪使用量
usage = TokenUsage(
input_tokens=response.usage.prompt_tokens,
output_tokens=response.usage.completion_tokens,
timestamp=time.time(),
model=model,
user_id=user_id
)
tracker.track(usage)
return response.choices[0].message.content
# 测试
if __name__ == '__main__':
result = tracked_api_call(
messages=[{"role": "user", "content": "你好!"}],
model="gpt-4o-mini",
user_id="user_001"
)
# 获取报告
report = tracker.get_report()
print(f"报告: {report}")
5.2 成本追踪
python
# ===== 成本追踪 =====
from dataclasses import dataclass
from typing import Dict, List
import time
@dataclass
class CostRecord:
"""成本记录"""
timestamp: float
model: str
input_tokens: int
output_tokens: int
cost_usd: float
user_id: str
request_id: str
class CostTracker:
"""成本跟踪器"""
def __init__(self):
self.records: List[CostRecord] = []
self.lock = Lock()
def track_cost(self, record: CostRecord):
"""记录成本"""
with self.lock:
self.records.append(record)
def get_total_cost(self, time_window: float = 86400) -> float:
"""
获取总成本
Args:
time_window: 时间窗口(秒),默认 1 天
Returns:
float: 总成本(美元)
"""
now = time.time()
total = 0.0
with self.lock:
for record in self.records:
if now - record.timestamp <= time_window:
total += record.cost_usd
return total
def get_user_cost(self, user_id: str, time_window: float = 86400) -> float:
"""获取用户成本"""
now = time.time()
total = 0.0
with self.lock:
for record in self.records:
if record.user_id == user_id and now - record.timestamp <= time_window:
total += record.cost_usd
return total
def get_model_cost(self, model: str, time_window: float = 86400) -> float:
"""获取模型成本"""
now = time.time()
total = 0.0
with self.lock:
for record in self.records:
if record.model == model and now - record.timestamp <= time_window:
total += record.cost_usd
return total
def get_daily_cost_breakdown(self, days: int = 7) -> Dict[str, float]:
"""获取每日成本分解"""
now = time.time()
daily_costs = {}
with self.lock:
for record in self.records:
if now - record.timestamp <= days * 86400:
day = time.strftime("%Y-%m-%d", time.localtime(record.timestamp))
daily_costs[day] = daily_costs.get(day, 0.0) + record.cost_usd
return daily_costs
def check_budget(self, budget: float, time_window: float = 86400) -> bool:
"""
检查预算
Args:
budget: 预算(美元)
time_window: 时间窗口(秒)
Returns:
bool: 是否超出预算
"""
total_cost = self.get_total_cost(time_window)
return total_cost > budget
# 使用成本跟踪器
cost_tracker = CostTracker()
def tracked_api_call_with_cost(messages: list, model: str = "gpt-4o-mini", user_id: str = "unknown"):
"""带成本跟踪的 API 调用"""
client = OpenAI()
response = client.chat.completions.create(
model=model,
messages=messages
)
# 计算成本
input_tokens = response.usage.prompt_tokens
output_tokens = response.usage.completion_tokens
cost = estimate_cost(model, input_tokens, output_tokens)
# 记录成本
record = CostRecord(
timestamp=time.time(),
model=model,
input_tokens=input_tokens,
output_tokens=output_tokens,
cost_usd=cost,
user_id=user_id,
request_id=f"req_{int(time.time())}"
)
cost_tracker.track_cost(record)
return response.choices[0].message.content
# 测试
if __name__ == '__main__':
result = tracked_api_call_with_cost(
messages=[{"role": "user", "content": "你好!"}],
model="gpt-4o-mini",
user_id="user_001"
)
# 获取成本报告
total_cost = cost_tracker.get_total_cost()
print(f"总成本: ${total_cost:.4f}")
daily_breakdown = cost_tracker.get_daily_cost_breakdown()
print(f"每日成本: {daily_breakdown}")
5.3 性能监控
python
# ===== 性能监控 =====
import time
from dataclasses import dataclass
from typing import List, Dict
from collections import defaultdict
@dataclass
class PerformanceMetric:
"""性能指标"""
timestamp: float
latency_ms: float
model: str
success: bool
error_type: str = None
class PerformanceMonitor:
"""性能监控器"""
def __init__(self):
self.metrics: List[PerformanceMetric] = []
self.model_latency: Dict[str, List[float]] = defaultdict(list)
self.lock = Lock()
def record(self, metric: PerformanceMetric):
"""记录性能指标"""
with self.lock:
self.metrics.append(metric)
self.model_latency[metric.model].append(metric.latency_ms)
def get_average_latency(self, model: str = None, time_window: float = 3600) -> float:
"""
获取平均延迟
Args:
model: 模型名称(None 表示所有模型)
time_window: 时间窗口(秒)
Returns:
float: 平均延迟(毫秒)
"""
now = time.time()
latencies = []
with self.lock:
for metric in self.metrics:
if now - metric.timestamp <= time_window:
if model is None or metric.model == model:
latencies.append(metric.latency_ms)
return sum(latencies) / len(latencies) if latencies else 0.0
def get_success_rate(self, model: str = None, time_window: float = 3600) -> float:
"""获取成功率"""
now = time.time()
total = 0
success = 0
with self.lock:
for metric in self.metrics:
if now - metric.timestamp <= time_window:
if model is None or metric.model == model:
total += 1
if metric.success:
success += 1
return success / total if total > 0 else 0.0
def get_error_distribution(self, time_window: float = 3600) -> Dict[str, int]:
"""获取错误分布"""
now = time.time()
errors = defaultdict(int)
with self.lock:
for metric in self.metrics:
if not metric.success and now - metric.timestamp <= time_window:
errors[metric.error_type] += 1
return dict(errors)
def get_percentile_latency(self, percentile: float, model: str = None) -> float:
"""
获取延迟分位数
Args:
percentile: 分位数(0-1)
model: 模型名称
Returns:
float: 延迟(毫秒)
"""
latencies = []
with self.lock:
for metric in self.metrics:
if model is None or metric.model == model:
latencies.append(metric.latency_ms)
if not latencies:
return 0.0
latencies.sort()
idx = int(len(latencies) * percentile)
return latencies[idx]
# 使用性能监控器
monitor = PerformanceMonitor()
def monitored_api_call(messages: list, model: str = "gpt-4o-mini"):
"""带性能监控的 API 调用"""
start_time = time.time()
success = True
error_type = None
try:
client = OpenAI()
response = client.chat.completions.create(
model=model,
messages=messages
)
result = response.choices[0].message.content
except Exception as e:
success = False
error_type = type(e).__name__
result = None
finally:
end_time = time.time()
latency_ms = (end_time - start_time) * 1000
metric = PerformanceMetric(
timestamp=start_time,
latency_ms=latency_ms,
model=model,
success=success,
error_type=error_type
)
monitor.record(metric)
return result
# 测试
if __name__ == '__main__':
result = monitored_api_call(
messages=[{"role": "user", "content": "你好!"}],
model="gpt-4o-mini"
)
# 获取性能报告
avg_latency = monitor.get_average_latency()
success_rate = monitor.get_success_rate()
print(f"平均延迟: {avg_latency:.2f} ms")
print(f"成功率: {success_rate:.2%}")
六、生产案例
6.1 案例:优化的 AI API 调用系统
python
# ===== 案例:优化的 AI API 调用系统 =====
from openai import OpenAI
from openai._exceptions import RateLimitError, APIConnectionError, InternalServerError
import time
import json
from dataclasses import dataclass
from typing import Optional
@dataclass
class APIConfig:
"""API 配置"""
model: str = "gpt-4o-mini"
max_retries: int = 3
timeout: float = 30.0
cache_ttl: int = 3600
class OptimizedAIService:
"""优化的 AI 服务"""
def __init__(self, config: APIConfig = None):
self.config = config or APIConfig()
self.client = OpenAI(timeout=self.config.timeout)
self.cache = ResultCache(ttl=self.config.cache_ttl)
self.tracker = TokenTracker()
self.monitor = PerformanceMonitor()
self.limiter = SlidingWindowRateLimiter(max_requests=50, window_size=60.0)
@exponential_backoff_retry(max_retries=3)
def chat(self, messages: list, user_id: str = "unknown") -> str:
"""
优化的对话 API
特性:
1. 缓存
2. 重试
3. 限流
4. Token 跟踪
5. 性能监控
"""
# 1. 检查缓存
cached = self.cache.get(messages, self.config.model)
if cached:
print("使用缓存")
return cached
# 2. 限流检查
if not self.limiter.allow_request():
raise Exception("速率限制,请稍后重试")
# 3. 记录开始时间
start_time = time.time()
success = True
error_type = None
try:
# 4. 调用 API
response = self.client.chat.completions.create(
model=self.config.model,
messages=messages,
max_tokens=1024
)
result = response.choices[0].message.content
# 5. 写入缓存
self.cache.set(messages, self.config.model, result)
# 6. 跟踪 Token
usage = TokenUsage(
input_tokens=response.usage.prompt_tokens,
output_tokens=response.usage.completion_tokens,
timestamp=time.time(),
model=self.config.model,
user_id=user_id
)
self.tracker.track(usage)
return result
except Exception as e: success = False
error_type = type(e).__name__
raise
finally:
# 7. 记录性能
end_time = time.time()
latency_ms = (end_time - start_time) * 1000
metric = PerformanceMetric(
timestamp=start_time,
latency_ms=latency_ms,
model=self.config.model,
success=success,
error_type=error_type
)
self.monitor.record(metric)
def get_stats(self) -> dict:
"""获取统计信息"""
return {
"token_usage": self.tracker.get_report(),
"performance": {
"avg_latency": self.monitor.get_average_latency(),
"success_rate": self.monitor.get_success_rate()
}
}
# 使用优化的 AI 服务
service = OptimizedAIService()
if __name__ == '__main__':
# 第一次调用
result1 = service.chat(
messages=[{"role": "user", "content": "你好!"}],
user_id="user_001"
)
print(f"结果1: {result1[:50]}...")
# 第二次调用(会使用缓存)
result2 = service.chat(
messages=[{"role": "user", "content": "你好!"}],
user_id="user_001"
)
print(f"结果2: {result2[:50]}...")
# 获取统计
stats = service.get_stats()
print(f"统计: {stats}")
七、总结
7.1 核心要点
#mermaid-svg-hLLuT3fFLRLuOHw2{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-hLLuT3fFLRLuOHw2 .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-hLLuT3fFLRLuOHw2 .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-hLLuT3fFLRLuOHw2 .error-icon{fill:#552222;}#mermaid-svg-hLLuT3fFLRLuOHw2 .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-hLLuT3fFLRLuOHw2 .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-hLLuT3fFLRLuOHw2 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-hLLuT3fFLRLuOHw2 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-hLLuT3fFLRLuOHw2 .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-hLLuT3fFLRLuOHw2 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-hLLuT3fFLRLuOHw2 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-hLLuT3fFLRLuOHw2 .marker{fill:#333333;stroke:#333333;}#mermaid-svg-hLLuT3fFLRLuOHw2 .marker.cross{stroke:#333333;}#mermaid-svg-hLLuT3fFLRLuOHw2 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-hLLuT3fFLRLuOHw2 p{margin:0;}#mermaid-svg-hLLuT3fFLRLuOHw2 .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-hLLuT3fFLRLuOHw2 .cluster-label text{fill:#333;}#mermaid-svg-hLLuT3fFLRLuOHw2 .cluster-label span{color:#333;}#mermaid-svg-hLLuT3fFLRLuOHw2 .cluster-label span p{background-color:transparent;}#mermaid-svg-hLLuT3fFLRLuOHw2 .label text,#mermaid-svg-hLLuT3fFLRLuOHw2 span{fill:#333;color:#333;}#mermaid-svg-hLLuT3fFLRLuOHw2 .node rect,#mermaid-svg-hLLuT3fFLRLuOHw2 .node circle,#mermaid-svg-hLLuT3fFLRLuOHw2 .node ellipse,#mermaid-svg-hLLuT3fFLRLuOHw2 .node polygon,#mermaid-svg-hLLuT3fFLRLuOHw2 .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-hLLuT3fFLRLuOHw2 .rough-node .label text,#mermaid-svg-hLLuT3fFLRLuOHw2 .node .label text,#mermaid-svg-hLLuT3fFLRLuOHw2 .image-shape .label,#mermaid-svg-hLLuT3fFLRLuOHw2 .icon-shape .label{text-anchor:middle;}#mermaid-svg-hLLuT3fFLRLuOHw2 .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-hLLuT3fFLRLuOHw2 .rough-node .label,#mermaid-svg-hLLuT3fFLRLuOHw2 .node .label,#mermaid-svg-hLLuT3fFLRLuOHw2 .image-shape .label,#mermaid-svg-hLLuT3fFLRLuOHw2 .icon-shape .label{text-align:center;}#mermaid-svg-hLLuT3fFLRLuOHw2 .node.clickable{cursor:pointer;}#mermaid-svg-hLLuT3fFLRLuOHw2 .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-hLLuT3fFLRLuOHw2 .arrowheadPath{fill:#333333;}#mermaid-svg-hLLuT3fFLRLuOHw2 .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-hLLuT3fFLRLuOHw2 .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-hLLuT3fFLRLuOHw2 .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-hLLuT3fFLRLuOHw2 .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-hLLuT3fFLRLuOHw2 .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-hLLuT3fFLRLuOHw2 .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-hLLuT3fFLRLuOHw2 .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-hLLuT3fFLRLuOHw2 .cluster text{fill:#333;}#mermaid-svg-hLLuT3fFLRLuOHw2 .cluster span{color:#333;}#mermaid-svg-hLLuT3fFLRLuOHw2 div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-hLLuT3fFLRLuOHw2 .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-hLLuT3fFLRLuOHw2 rect.text{fill:none;stroke-width:0;}#mermaid-svg-hLLuT3fFLRLuOHw2 .icon-shape,#mermaid-svg-hLLuT3fFLRLuOHw2 .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-hLLuT3fFLRLuOHw2 .icon-shape p,#mermaid-svg-hLLuT3fFLRLuOHw2 .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-hLLuT3fFLRLuOHw2 .icon-shape .label rect,#mermaid-svg-hLLuT3fFLRLuOHw2 .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-hLLuT3fFLRLuOHw2 .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-hLLuT3fFLRLuOHw2 .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-hLLuT3fFLRLuOHw2 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} AI API 调用优化
重试机制
限流处理
缓存策略
成本优化
监控告警
指数退避
断路器
令牌桶
漏桶
滑动窗口
结果缓存
语义缓存
分布式缓存
模型选择
Prompt 优化
Batch API
Token 统计
成本追踪
性能监控
7.2 最佳实践
| 实践 | 说明 |
|---|---|
| 重试策略 | 指数退避 + 抖动,避免重试风暴 |
| 限流算法 | 令牌桶(允许突发)/ 漏桶(平滑速率) |
| 缓存策略 | 结果缓存 + 语义缓存,减少重复调用 |
| 成本优化 | 模型选择 + Prompt 优化 + Batch API |
| 监控告警 | Token 统计 + 成本追踪 + 性能监控 |
7.3 优化效果
| 优化项 | 效果 |
|---|---|
| 重试机制 | 减少失败率 50%+ |
| 限流处理 | 避免触发速率限制 |
| 结果缓存 | 减少 API 调用 30%+ |
| 语义缓存 | 减少 API 调用 50%+ |
| 模型选择 | 降低成本 50%+ |
| Batch API | 降低成本 50% |
| Prompt 优化 | 减少 Token 消耗 20%+ |
本文基于 OpenAI API 最佳实践编写。如有问题欢迎评论区讨论!