目录
- API网关高级功能:限流、鉴权与监控深度实现
-
- 引言
- 一、分布式限流系统设计
-
- [1.1 限流算法对比](#1.1 限流算法对比)
- [1.2 分布式令牌桶实现](#1.2 分布式令牌桶实现)
- 二、统一鉴权系统设计
-
- [2.1 多层次鉴权架构](#2.1 多层次鉴权架构)
- [2.2 完整的鉴权实现](#2.2 完整的鉴权实现)
- 三、实时监控系统设计
-
- [3.1 监控架构](#3.1 监控架构)
- [3.2 监控系统实现](#3.2 监控系统实现)
- 四、系统集成与最佳实践
-
- [4.1 完整集成示例](#4.1 完整集成示例)
- [4.2 性能优化建议](#4.2 性能优化建议)
- [4.3 安全最佳实践](#4.3 安全最佳实践)
- 五、总结与展望
-
- [5.1 关键实现要点](#5.1 关键实现要点)
- [5.2 生产环境考虑](#5.2 生产环境考虑)
- [5.3 未来发展方向](#5.3 未来发展方向)
『宝藏代码胶囊开张啦!』------ 我的 CodeCapsule 来咯!✨写代码不再头疼!我的新站点 CodeCapsule 主打一个 "白菜价"+"量身定制 "!无论是卡脖子的毕设/课设/文献复现 ,需要灵光一现的算法改进 ,还是想给项目加个"外挂",这里都有便宜又好用的代码方案等你发现!低成本,高适配,助你轻松通关!速来围观 👉 CodeCapsule官网
API网关高级功能:限流、鉴权与监控深度实现
引言
在现代微服务架构中,API网关作为系统的入口,承担着流量控制、安全防护和系统监控三大核心职责。本文深入探讨API网关的高级功能实现,包括分布式限流 、统一鉴权 和实时监控,并提供完整的Python实现方案。
一、分布式限流系统设计
1.1 限流算法对比
在分布式环境中,常用的限流算法有以下几种:
令牌桶 : R ( t ) = min ( C , R ( t − 1 ) + r ⋅ Δ t ) 漏桶 : R out ( t ) = min ( R max , R in ( t ) ) 滑动窗口 : W ( t ) = ∑ i = t − n t r i \begin{aligned} \text{令牌桶} & : R(t) = \min(C, R(t-1) + r \cdot \Delta t) \\ \text{漏桶} & : R_{\text{out}}(t) = \min(R_{\text{max}}, R_{\text{in}}(t)) \\ \text{滑动窗口} & : W(t) = \sum_{i=t-n}^{t} r_i \end{aligned} 令牌桶漏桶滑动窗口:R(t)=min(C,R(t−1)+r⋅Δt):Rout(t)=min(Rmax,Rin(t)):W(t)=i=t−n∑tri
其中:
- C C C: 桶容量
- r r r: 令牌产生速率
- R max R_{\text{max}} Rmax: 最大流出速率
- W ( t ) W(t) W(t): 时间窗口内的请求总数
1.2 分布式令牌桶实现
分布式存储层 通过 拒绝 成功 失败 Redis Cluster 限流计数器 时间窗口存储 客户端请求 本地限流检查 Redis分布式计数 返回429 转发请求 返回429 同步到集群 其他网关节点
python
"""
分布式限流系统实现
支持:令牌桶算法、滑动窗口、多维度限流
"""
import time
import asyncio
import hashlib
import json
from typing import Dict, List, Optional, Tuple, Any
from dataclasses import dataclass, field
from enum import Enum
from collections import defaultdict
import redis.asyncio as redis
from datetime import datetime, timedelta
import logging
logger = logging.getLogger(__name__)
class RateLimitAlgorithm(Enum):
"""限流算法枚举"""
TOKEN_BUCKET = "token_bucket"
SLIDING_WINDOW = "sliding_window"
FIXED_WINDOW = "fixed_window"
LEAKY_BUCKET = "leaky_bucket"
@dataclass
class RateLimitConfig:
"""限流配置"""
algorithm: RateLimitAlgorithm = RateLimitAlgorithm.TOKEN_BUCKET
capacity: int = 100 # 桶容量/窗口最大请求数
refill_rate: float = 10.0 # 令牌补充速率/秒
window_size: int = 60 # 窗口大小(秒)
# 多维度限流配置
dimensions: List[str] = field(default_factory=lambda: ["ip", "user", "endpoint"])
group_by: Optional[str] = None # 按用户组或租户分组
# 动态调整参数
adaptive: bool = False # 是否启用自适应限流
min_capacity: int = 10 # 最小容量
max_capacity: int = 1000 # 最大容量
class DistributedRateLimiter:
"""分布式限流器"""
def __init__(self, redis_client: redis.Redis, namespace: str = "ratelimit"):
"""
初始化分布式限流器
参数:
redis_client: Redis客户端
namespace: Redis键名前缀
"""
self.redis = redis_client
self.namespace = namespace
# 本地缓存减少Redis访问
self.local_cache = {}
self.cache_ttl = 1.0 # 本地缓存TTL(秒)
self.last_cache_clean = time.time()
def _get_cache_key(self, key: str) -> str:
"""生成缓存键"""
return f"{self.namespace}:{key}"
async def _clean_local_cache(self):
"""清理过期的本地缓存"""
now = time.time()
if now - self.last_cache_clean > 60: # 每60秒清理一次
expired_keys = []
for key, (value, timestamp) in self.local_cache.items():
if now - timestamp > self.cache_ttl:
expired_keys.append(key)
for key in expired_keys:
del self.local_cache[key]
self.last_cache_clean = now
async def token_bucket_acquire(self, key: str, config: RateLimitConfig) -> Tuple[bool, Dict[str, Any]]:
"""
令牌桶算法实现
返回:
Tuple[bool, Dict]: (是否允许, 限流信息)
"""
cache_key = self._get_cache_key(f"token:{key}")
# 检查本地缓存
if cache_key in self.local_cache:
data, timestamp = self.local_cache[cache_key]
if time.time() - timestamp < self.cache_ttl:
return data['allowed'], data
# Redis Lua脚本实现原子操作
lua_script = """
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local requested = tonumber(ARGV[4])
local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = capacity
local last_refill = now
if bucket[1] then
tokens = tonumber(bucket[1])
last_refill = tonumber(bucket[2])
-- 补充令牌
local time_passed = now - last_refill
local new_tokens = time_passed * refill_rate
if new_tokens > 0 then
tokens = math.min(capacity, tokens + new_tokens)
last_refill = now
end
end
local allowed = 0
if tokens >= requested then
tokens = tokens - requested
allowed = 1
redis.call('HMSET', key,
'tokens', tokens,
'last_refill', last_refill)
redis.call('EXPIRE', key, math.ceil(capacity / refill_rate) * 2)
end
return {allowed, tokens, last_refill}
"""
try:
result = await self.redis.eval(
lua_script,
1,
cache_key,
config.capacity,
config.refill_rate,
time.time(),
1
)
allowed = bool(result[0])
remaining = result[1]
reset_time = result[2]
limit_info = {
'allowed': allowed,
'remaining': remaining,
'reset_time': reset_time,
'limit': config.capacity,
'algorithm': config.algorithm.value,
'key': key
}
# 更新本地缓存
self.local_cache[cache_key] = (limit_info, time.time())
await self._clean_local_cache()
return allowed, limit_info
except Exception as e:
logger.error(f"Token bucket error: {str(e)}")
# Redis故障时降级为本地限流
return True, {'error': 'degraded', 'allowed': True}
async def sliding_window_acquire(self, key: str, config: RateLimitConfig) -> Tuple[bool, Dict[str, Any]]:
"""
滑动窗口算法实现
"""
cache_key = self._get_cache_key(f"window:{key}")
current_time = time.time()
window_start = current_time - config.window_size
# 清理旧的时间戳
cleanup_script = """
local key = KEYS[1]
local window_start = tonumber(ARGV[1])
local window_size = tonumber(ARGV[2])
redis.call('ZREMRANGEBYSCORE', key, 0, window_start)
local current_count = redis.call('ZCARD', key)
local limit = tonumber(ARGV[3])
if current_count < limit then
redis.call('ZADD', key, tonumber(ARGV[4]), ARGV[5])
redis.call('EXPIRE', key, window_size * 2)
return {1, current_count + 1}
else
return {0, current_count}
"""
try:
# 生成唯一请求ID
request_id = hashlib.md5(f"{key}:{current_time}".encode()).hexdigest()
result = await self.redis.eval(
cleanup_script,
1,
cache_key,
window_start,
config.window_size,
config.capacity,
current_time,
request_id
)
allowed = bool(result[0])
current_count = result[1]
limit_info = {
'allowed': allowed,
'current_count': current_count,
'limit': config.capacity,
'window_size': config.window_size,
'algorithm': config.algorithm.value,
'key': key
}
return allowed, limit_info
except Exception as e:
logger.error(f"Sliding window error: {str(e)}")
return True, {'error': 'degraded', 'allowed': True}
def _generate_limit_key(self, config: RateLimitConfig, **dimensions) -> str:
"""生成限流键"""
key_parts = []
for dim in config.dimensions:
if dim in dimensions:
key_parts.append(f"{dim}:{dimensions[dim]}")
if config.group_by and config.group_by in dimensions:
key_parts.insert(0, f"group:{dimensions[config.group_by]}")
return ":".join(key_parts)
async def acquire(self, config: RateLimitConfig, **dimensions) -> Tuple[bool, Dict[str, Any]]:
"""
获取限流许可
参数:
config: 限流配置
dimensions: 限流维度参数
返回:
Tuple[bool, Dict]: (是否允许, 限流详情)
"""
if not config.dimensions:
return True, {'allowed': True, 'reason': 'no_limit'}
# 生成限流键
limit_key = self._generate_limit_key(config, **dimensions)
# 根据算法选择限流器
if config.algorithm == RateLimitAlgorithm.TOKEN_BUCKET:
return await self.token_bucket_acquire(limit_key, config)
elif config.algorithm == RateLimitAlgorithm.SLIDING_WINDOW:
return await self.sliding_window_acquire(limit_key, config)
elif config.algorithm == RateLimitAlgorithm.FIXED_WINDOW:
# 固定窗口实现类似滑动窗口
return await self.sliding_window_acquire(limit_key, config)
else:
# 默认使用令牌桶
return await self.token_bucket_acquire(limit_key, config)
async def get_rate_limit_status(self, key: str) -> Dict[str, Any]:
"""获取限流状态"""
token_key = self._get_cache_key(f"token:{key}")
window_key = self._get_cache_key(f"window:{key}")
try:
# 获取令牌桶状态
token_data = await self.redis.hgetall(token_key)
# 获取窗口计数
window_count = await self.redis.zcard(window_key)
return {
'token_bucket': dict(token_data) if token_data else {},
'window_count': window_count,
'key': key
}
except Exception as e:
logger.error(f"Get rate limit status error: {str(e)}")
return {'error': str(e)}
class AdaptiveRateLimiter:
"""自适应限流器"""
def __init__(self, base_config: RateLimitConfig):
"""
初始化自适应限流器
参数:
base_config: 基础限流配置
"""
self.base_config = base_config
self.metrics_window = 300 # 指标收集窗口(秒)
self.metrics = {
'request_count': [],
'error_rate': [],
'response_time': []
}
def calculate_optimal_capacity(self) -> int:
"""
计算最佳容量
基于系统负载和错误率动态调整
"""
if not self.metrics['request_count']:
return self.base_config.capacity
# 计算最近的平均指标
recent_window = 60 # 最近60秒
recent_metrics = {
'requests': self._get_recent_avg('request_count', recent_window),
'error_rate': self._get_recent_avg('error_rate', recent_window),
'response_time': self._get_recent_avg('response_time', recent_window)
}
# 基于响应时间和错误率调整容量
base_capacity = self.base_config.capacity
# 响应时间超过阈值时减少容量
if recent_metrics['response_time'] > 1000: # 1秒
adjustment = 0.5
elif recent_metrics['response_time'] > 500: # 500ms
adjustment = 0.7
elif recent_metrics['response_time'] < 100: # 100ms
adjustment = 1.3 # 系统响应快,增加容量
else:
adjustment = 1.0
# 错误率高时减少容量
if recent_metrics['error_rate'] > 0.1: # 10%错误率
adjustment *= 0.8
new_capacity = int(base_capacity * adjustment)
# 限制在最小和最大容量之间
new_capacity = max(self.base_config.min_capacity,
min(new_capacity, self.base_config.max_capacity))
logger.info(f"Adaptive rate limit: {base_capacity} -> {new_capacity}, "
f"response_time={recent_metrics['response_time']:.1f}ms, "
f"error_rate={recent_metrics['error_rate']:.2%}")
return new_capacity
def _get_recent_avg(self, metric_name: str, window_seconds: int) -> float:
"""获取最近时间窗口的平均值"""
now = time.time()
cutoff = now - window_seconds
metrics = [(t, v) for t, v in self.metrics[metric_name] if t >= cutoff]
if not metrics:
return 0.0
total = sum(v for _, v in metrics)
return total / len(metrics)
def record_metric(self, metric_name: str, value: float):
"""记录指标"""
self.metrics[metric_name].append((time.time(), value))
# 清理旧数据
cutoff = time.time() - self.metrics_window
self.metrics[metric_name] = [
(t, v) for t, v in self.metrics[metric_name]
if t >= cutoff
]
def get_config(self) -> RateLimitConfig:
"""获取当前配置"""
if self.base_config.adaptive:
current_capacity = self.calculate_optimal_capacity()
updated_config = RateLimitConfig(
algorithm=self.base_config.algorithm,
capacity=current_capacity,
refill_rate=self.base_config.refill_rate,
window_size=self.base_config.window_size,
dimensions=self.base_config.dimensions.copy(),
group_by=self.base_config.group_by,
adaptive=True,
min_capacity=self.base_config.min_capacity,
max_capacity=self.base_config.max_capacity
)
return updated_config
return self.base_config
二、统一鉴权系统设计
2.1 多层次鉴权架构
鉴权组件 无效 有效 无权限 有权限 无权限 有权限 JWT解析器 RBAC引擎 数据权限过滤器 审计日志 API请求 JWT验证 返回401 解析用户上下文 RBAC权限检查 返回403 数据权限检查 返回403 转发请求
2.2 完整的鉴权实现
python
"""
统一鉴权系统
支持:JWT、RBAC、数据权限、审计日志
"""
import jwt
import time
import uuid
from typing import Dict, List, Optional, Set, Any, Callable
from dataclasses import dataclass, field
from enum import Enum
from datetime import datetime, timedelta
from functools import wraps
import hashlib
import json
from cryptography.hazmat.primitives import serialization
from cryptography.hazmat.primitives.asymmetric import rsa
from cryptography.hazmat.backends import default_backend
class PermissionAction(Enum):
"""权限动作枚举"""
CREATE = "create"
READ = "read"
UPDATE = "update"
DELETE = "delete"
EXECUTE = "execute"
@dataclass
class Permission:
"""权限定义"""
resource: str # 资源标识
action: PermissionAction # 操作类型
conditions: Optional[Dict[str, Any]] = None # 条件限制
@dataclass
class Role:
"""角色定义"""
name: str
permissions: List[Permission]
inherited_roles: List[str] = field(default_factory=list)
@dataclass
class UserContext:
"""用户上下文"""
user_id: str
username: str
roles: List[str]
permissions: Set[str] = field(default_factory=set)
attributes: Dict[str, Any] = field(default_factory=dict)
tenant_id: Optional[str] = None # 租户ID(多租户支持)
session_id: Optional[str] = None
class JWTAuthenticator:
"""JWT认证器"""
def __init__(self,
secret_key: Optional[str] = None,
public_key: Optional[str] = None,
private_key: Optional[str] = None,
algorithm: str = "RS256",
token_ttl: int = 3600):
"""
初始化JWT认证器
参数:
secret_key: 对称加密密钥(HS256等算法使用)
public_key: 公钥(RS256等算法使用)
private_key: 私钥(RS256等算法使用)
algorithm: 加密算法
token_ttl: Token有效期(秒)
"""
self.algorithm = algorithm
self.token_ttl = token_ttl
if algorithm.startswith("RS"):
# RSA算法
if private_key:
self.private_key = serialization.load_pem_private_key(
private_key.encode(),
password=None,
backend=default_backend()
)
else:
# 生成密钥对
private_key = rsa.generate_private_key(
public_exponent=65537,
key_size=2048,
backend=default_backend()
)
self.private_key = private_key
if public_key:
self.public_key = serialization.load_pem_public_key(
public_key.encode(),
backend=default_backend()
)
else:
self.public_key = self.private_key.public_key()
else:
# 对称加密算法
if not secret_key:
secret_key = self._generate_secret_key()
self.secret_key = secret_key
def _generate_secret_key(self) -> str:
"""生成随机密钥"""
return hashlib.sha256(str(uuid.uuid4()).encode()).hexdigest()
def create_token(self, user_context: UserContext,
additional_claims: Optional[Dict[str, Any]] = None) -> str:
"""
创建JWT Token
参数:
user_context: 用户上下文
additional_claims: 额外声明
返回:
str: JWT Token
"""
now = datetime.utcnow()
# 基础声明
claims = {
'sub': user_context.user_id,
'username': user_context.username,
'roles': user_context.roles,
'permissions': list(user_context.permissions),
'iat': now,
'exp': now + timedelta(seconds=self.token_ttl),
'jti': str(uuid.uuid4()), # JWT ID
'tenant_id': user_context.tenant_id,
'session_id': user_context.session_id
}
# 添加额外声明
if additional_claims:
claims.update(additional_claims)
# 根据算法选择密钥
if self.algorithm.startswith("RS"):
token = jwt.encode(
claims,
self.private_key,
algorithm=self.algorithm
)
else:
token = jwt.encode(
claims,
self.secret_key,
algorithm=self.algorithm
)
return token
def verify_token(self, token: str) -> Optional[Dict[str, Any]]:
"""
验证JWT Token
参数:
token: JWT Token
返回:
Optional[Dict]: 解码后的声明
"""
try:
if self.algorithm.startswith("RS"):
payload = jwt.decode(
token,
self.public_key,
algorithms=[self.algorithm]
)
else:
payload = jwt.decode(
token,
self.secret_key,
algorithms=[self.algorithm]
)
# 检查Token是否在黑名单中
if self._is_token_revoked(payload.get('jti')):
return None
return payload
except jwt.ExpiredSignatureError:
logger.warning("JWT token expired")
return None
except jwt.InvalidTokenError as e:
logger.warning(f"Invalid JWT token: {str(e)}")
return None
def _is_token_revoked(self, jti: str) -> bool:
"""检查Token是否被撤销"""
# 这里可以连接Redis或数据库检查黑名单
# 简化实现,返回False
return False
def refresh_token(self, token: str) -> Optional[str]:
"""
刷新Token
参数:
token: 原Token
返回:
Optional[str]: 新Token
"""
payload = self.verify_token(token)
if not payload:
return None
# 创建新的用户上下文
user_context = UserContext(
user_id=payload['sub'],
username=payload['username'],
roles=payload['roles'],
permissions=set(payload.get('permissions', [])),
tenant_id=payload.get('tenant_id'),
session_id=payload.get('session_id')
)
# 生成新Token
return self.create_token(user_context)
class RBACAuthorizer:
"""RBAC权限控制器"""
def __init__(self):
"""初始化RBAC控制器"""
self.roles: Dict[str, Role] = {}
self.role_hierarchy: Dict[str, Set[str]] = defaultdict(set)
def add_role(self, role: Role):
"""添加角色"""
self.roles[role.name] = role
# 构建角色继承关系
for inherited_role in role.inherited_roles:
self.role_hierarchy[role.name].add(inherited_role)
def get_role_permissions(self, role_name: str,
include_inherited: bool = True) -> Set[str]:
"""
获取角色权限
参数:
role_name: 角色名称
include_inherited: 是否包含继承的权限
返回:
Set[str]: 权限集合
"""
permissions = set()
if role_name not in self.roles:
return permissions
# 获取直接权限
role = self.roles[role_name]
for perm in role.permissions:
perm_key = f"{perm.resource}:{perm.action.value}"
permissions.add(perm_key)
# 获取继承的权限
if include_inherited:
inherited_roles = self._get_all_inherited_roles(role_name)
for inherited_role in inherited_roles:
if inherited_role in self.roles:
inherited_perms = self.get_role_permissions(
inherited_role,
include_inherited=False
)
permissions.update(inherited_perms)
return permissions
def _get_all_inherited_roles(self, role_name: str) -> Set[str]:
"""获取所有继承的角色(递归)"""
all_roles = set()
stack = [role_name]
while stack:
current_role = stack.pop()
if current_role in self.role_hierarchy:
for inherited_role in self.role_hierarchy[current_role]:
if inherited_role not in all_roles:
all_roles.add(inherited_role)
stack.append(inherited_role)
return all_roles
def has_permission(self, user_context: UserContext,
resource: str, action: PermissionAction,
conditions: Optional[Dict[str, Any]] = None) -> bool:
"""
检查用户是否有权限
参数:
user_context: 用户上下文
resource: 资源标识
action: 操作类型
conditions: 条件限制
返回:
bool: 是否有权限
"""
# 检查用户直接权限
perm_key = f"{resource}:{action.value}"
if perm_key in user_context.permissions:
return self._check_conditions(conditions, user_context)
# 检查角色权限
for role_name in user_context.roles:
role_permissions = self.get_role_permissions(role_name)
if perm_key in role_permissions:
return self._check_conditions(conditions, user_context)
return False
def _check_conditions(self, conditions: Optional[Dict[str, Any]],
user_context: UserContext) -> bool:
"""检查条件限制"""
if not conditions:
return True
# 简化实现,实际项目需要更复杂的条件检查
for key, expected_value in conditions.items():
if key.startswith("user."):
# 检查用户属性
attr_key = key[5:] # 去掉"user."前缀
if attr_key not in user_context.attributes:
return False
if user_context.attributes[attr_key] != expected_value:
return False
return True
class DataPermissionFilter:
"""数据权限过滤器"""
def __init__(self):
"""初始化数据权限过滤器"""
self.data_permission_rules: Dict[str, List[Callable]] = defaultdict(list)
def add_rule(self, resource_type: str, rule_func: Callable):
"""
添加数据权限规则
参数:
resource_type: 资源类型
rule_func: 规则函数,接收(user_context, resource_data)返回bool
"""
self.data_permission_rules[resource_type].append(rule_func)
def filter_resources(self, user_context: UserContext,
resources: List[Dict[str, Any]],
resource_type: str) -> List[Dict[str, Any]]:
"""
过滤资源列表
参数:
user_context: 用户上下文
resources: 资源列表
resource_type: 资源类型
返回:
List[Dict]: 过滤后的资源列表
"""
if resource_type not in self.data_permission_rules:
return resources
filtered_resources = []
for resource in resources:
allowed = True
for rule_func in self.data_permission_rules[resource_type]:
if not rule_func(user_context, resource):
allowed = False
break
if allowed:
filtered_resources.append(resource)
return filtered_resources
def can_access_resource(self, user_context: UserContext,
resource: Dict[str, Any],
resource_type: str) -> bool:
"""
检查是否可以访问单个资源
参数:
user_context: 用户上下文
resource: 资源数据
resource_type: 资源类型
返回:
bool: 是否可以访问
"""
if resource_type not in self.data_permission_rules:
return True
for rule_func in self.data_permission_rules[resource_type]:
if not rule_func(user_context, resource):
return False
return True
class AuditLogger:
"""审计日志记录器"""
def __init__(self, storage_backend: Optional[Any] = None):
"""
初始化审计日志记录器
参数:
storage_backend: 存储后端(Redis、数据库等)
"""
self.storage_backend = storage_backend
self.buffer = []
self.buffer_size = 100
self.last_flush = time.time()
def log_access(self, user_context: UserContext,
resource: str, action: str,
status: str, details: Optional[Dict[str, Any]] = None):
"""
记录访问日志
参数:
user_context: 用户上下文
resource: 访问的资源
action: 执行的操作
status: 访问状态(allow/deny)
details: 详细信息
"""
log_entry = {
'timestamp': datetime.utcnow().isoformat(),
'user_id': user_context.user_id,
'username': user_context.username,
'tenant_id': user_context.tenant_id,
'session_id': user_context.session_id,
'resource': resource,
'action': action,
'status': status,
'ip_address': user_context.attributes.get('ip_address'),
'user_agent': user_context.attributes.get('user_agent'),
'details': details or {}
}
self.buffer.append(log_entry)
# 缓冲达到阈值或超过时间间隔时刷新
if (len(self.buffer) >= self.buffer_size or
time.time() - self.last_flush > 30):
self.flush()
def flush(self):
"""刷新缓冲到存储后端"""
if not self.buffer or not self.storage_backend:
return
try:
# 这里应该将日志写入存储后端
# 简化实现,仅打印日志
for entry in self.buffer:
logger.info(f"Audit log: {json.dumps(entry)}")
self.buffer.clear()
self.last_flush = time.time()
except Exception as e:
logger.error(f"Failed to flush audit logs: {str(e)}")
class UnifiedAuthManager:
"""统一认证授权管理器"""
def __init__(self,
jwt_authenticator: JWTAuthenticator,
rbac_authorizer: Optional[RBACAuthorizer] = None,
data_filter: Optional[DataPermissionFilter] = None,
audit_logger: Optional[AuditLogger] = None):
"""
初始化统一认证授权管理器
参数:
jwt_authenticator: JWT认证器
rbac_authorizer: RBAC权限控制器
data_filter: 数据权限过滤器
audit_logger: 审计日志记录器
"""
self.jwt_authenticator = jwt_authenticator
self.rbac_authorizer = rbac_authorizer or RBACAuthorizer()
self.data_filter = data_filter or DataPermissionFilter()
self.audit_logger = audit_logger
# 缓存用户权限
self.permission_cache = {}
self.cache_ttl = 300 # 缓存5分钟
def authenticate_request(self, headers: Dict[str, str]) -> Optional[UserContext]:
"""
认证请求
参数:
headers: 请求头
返回:
Optional[UserContext]: 用户上下文
"""
# 提取Token
auth_header = headers.get('Authorization')
if not auth_header or not auth_header.startswith('Bearer '):
return None
token = auth_header[7:]
# 验证Token
payload = self.jwt_authenticator.verify_token(token)
if not payload:
return None
# 创建用户上下文
user_context = UserContext(
user_id=payload['sub'],
username=payload['username'],
roles=payload['roles'],
tenant_id=payload.get('tenant_id'),
session_id=payload.get('session_id')
)
# 获取并缓存权限
cache_key = f"perms:{user_context.user_id}"
if cache_key in self.permission_cache:
cached_data = self.permission_cache[cache_key]
if time.time() - cached_data['timestamp'] < self.cache_ttl:
user_context.permissions = cached_data['permissions']
return user_context
# 计算用户所有权限
all_permissions = set()
for role_name in user_context.roles:
role_permissions = self.rbac_authorizer.get_role_permissions(role_name)
all_permissions.update(role_permissions)
user_context.permissions = all_permissions
# 更新缓存
self.permission_cache[cache_key] = {
'permissions': all_permissions,
'timestamp': time.time()
}
return user_context
def authorize_request(self, user_context: UserContext,
resource: str, action: PermissionAction,
resource_data: Optional[Dict[str, Any]] = None) -> bool:
"""
授权请求
参数:
user_context: 用户上下文
resource: 资源标识
action: 操作类型
resource_data: 资源数据(用于数据权限检查)
返回:
bool: 是否授权
"""
# RBAC权限检查
has_rbac_permission = self.rbac_authorizer.has_permission(
user_context, resource, action
)
if not has_rbac_permission:
# 记录审计日志
if self.audit_logger:
self.audit_logger.log_access(
user_context, resource, action.value,
status="deny", details={'reason': 'rbac_denied'}
)
return False
# 数据权限检查(如果有资源数据)
if resource_data and self.data_filter:
resource_type = resource.split(':')[0] if ':' in resource else resource
has_data_permission = self.data_filter.can_access_resource(
user_context, resource_data, resource_type
)
if not has_data_permission:
# 记录审计日志
if self.audit_logger:
self.audit_logger.log_access(
user_context, resource, action.value,
status="deny", details={'reason': 'data_denied'}
)
return False
# 记录成功的访问
if self.audit_logger:
self.audit_logger.log_access(
user_context, resource, action.value,
status="allow"
)
return True
def create_auth_decorator(self, resource: str, action: PermissionAction):
"""
创建认证授权装饰器
参数:
resource: 资源标识
action: 操作类型
返回:
Callable: 装饰器函数
"""
def decorator(func):
@wraps(func)
async def wrapper(*args, **kwargs):
# 从上下文中提取请求信息
# 这里需要根据具体的Web框架调整
request = kwargs.get('request')
if not request:
raise ValueError("Request object not found in arguments")
# 认证
user_context = self.authenticate_request(request.headers)
if not user_context:
return {'error': 'Unauthorized'}, 401
# 授权
resource_data = None
# 这里可以根据需要从请求中提取资源数据
if not self.authorize_request(user_context, resource, action, resource_data):
return {'error': 'Forbidden'}, 403
# 将用户上下文添加到请求中
request.user_context = user_context
# 执行原函数
return await func(*args, **kwargs)
return wrapper
return decorator
三、实时监控系统设计
3.1 监控架构
存储层 时间序列数据库 日志存储 分布式追踪 API网关节点 指标收集器 API网关节点 指标收集器 API网关节点 指标收集器 消息队列 流处理器 实时聚合 异常检测 告警系统 监控仪表盘 通知渠道
3.2 监控系统实现
python
"""
实时监控系统
支持:指标收集、实时聚合、异常检测、告警通知
"""
import time
import asyncio
import json
from typing import Dict, List, Optional, Any, Callable
from dataclasses import dataclass, field
from datetime import datetime, timedelta
from collections import defaultdict, deque
from enum import Enum
import statistics
import numpy as np
from threading import Lock
import logging
logger = logging.getLogger(__name__)
class MetricType(Enum):
"""指标类型"""
COUNTER = "counter" # 计数器
GAUGE = "gauge" # 计量器
HISTOGRAM = "histogram" # 直方图
SUMMARY = "summary" # 摘要
@dataclass
class Metric:
"""指标定义"""
name: str
type: MetricType
value: Any
tags: Dict[str, str] = field(default_factory=dict)
timestamp: float = field(default_factory=time.time)
@dataclass
class AlertRule:
"""告警规则"""
name: str
metric_name: str
condition: str # 条件表达式,如 "> 100"
duration: int = 60 # 持续时间(秒)
severity: str = "warning" # 严重程度
message_template: str = "{metric_name} exceeded threshold: {value} > {threshold}"
tags: Dict[str, str] = field(default_factory=dict)
class MetricsCollector:
"""指标收集器"""
def __init__(self, flush_interval: int = 10):
"""
初始化指标收集器
参数:
flush_interval: 刷新间隔(秒)
"""
self.flush_interval = flush_interval
self.metrics_buffer: List[Metric] = []
self.lock = Lock()
self.last_flush = time.time()
# 内置计数器
self.counters = defaultdict(int)
self.gauges = {}
self.histograms = defaultdict(list)
# 注册内置指标
self._register_default_metrics()
def _register_default_metrics(self):
"""注册默认指标"""
# 请求相关指标
self.register_counter("http_requests_total",
description="Total HTTP requests")
self.register_counter("http_requests_by_status",
description="HTTP requests by status code",
tag_keys=["status"])
self.register_histogram("http_request_duration_seconds",
buckets=[0.1, 0.5, 1, 2, 5],
description="HTTP request duration")
# 系统指标
self.register_gauge("system_cpu_usage",
description="System CPU usage")
self.register_gauge("system_memory_usage",
description="System memory usage")
# 网关特定指标
self.register_counter("rate_limit_requests",
description="Rate limited requests",
tag_keys=["reason"])
self.register_counter("auth_failed_requests",
description="Authentication failed requests")
def register_counter(self, name: str, description: str = "",
tag_keys: List[str] = None):
"""注册计数器"""
# 实际实现中会存储元数据
pass
def register_gauge(self, name: str, description: str = ""):
"""注册计量器"""
pass
def register_histogram(self, name: str, buckets: List[float],
description: str = ""):
"""注册直方图"""
pass
def record_metric(self, metric: Metric):
"""记录指标"""
with self.lock:
self.metrics_buffer.append(metric)
# 根据指标类型更新内置状态
if metric.type == MetricType.COUNTER:
key = f"{metric.name}:{json.dumps(metric.tags)}"
self.counters[key] += int(metric.value)
elif metric.type == MetricType.GAUGE:
key = f"{metric.name}:{json.dumps(metric.tags)}"
self.gauges[key] = metric.value
elif metric.type == MetricType.HISTOGRAM:
self.histograms[metric.name].append(metric.value)
# 检查是否需要刷新
if time.time() - self.last_flush >= self.flush_interval:
self.flush()
def increment_counter(self, name: str, value: int = 1,
tags: Optional[Dict[str, str]] = None):
"""增加计数器"""
self.record_metric(
Metric(name=name, type=MetricType.COUNTER,
value=value, tags=tags or {})
)
def set_gauge(self, name: str, value: float,
tags: Optional[Dict[str, str]] = None):
"""设置计量器值"""
self.record_metric(
Metric(name=name, type=MetricType.GAUGE,
value=value, tags=tags or {})
)
def record_histogram(self, name: str, value: float,
tags: Optional[Dict[str, str]] = None):
"""记录直方图值"""
self.record_metric(
Metric(name=name, type=MetricType.HISTOGRAM,
value=value, tags=tags or {})
)
def flush(self) -> List[Metric]:
"""刷新指标缓冲区"""
with self.lock:
if not self.metrics_buffer:
return []
flushed_metrics = self.metrics_buffer.copy()
self.metrics_buffer.clear()
self.last_flush = time.time()
logger.info(f"Flushed {len(flushed_metrics)} metrics")
return flushed_metrics
def get_current_metrics(self) -> Dict[str, Any]:
"""获取当前指标状态"""
with self.lock:
return {
'counters': dict(self.counters),
'gauges': dict(self.gauges),
'histograms': {k: len(v) for k, v in self.histograms.items()},
'buffer_size': len(self.metrics_buffer)
}
class TimeSeriesAggregator:
"""时间序列聚合器"""
def __init__(self, window_sizes: List[int] = [60, 300, 900, 3600]):
"""
初始化时间序列聚合器
参数:
window_sizes: 聚合窗口大小列表(秒)
"""
self.window_sizes = sorted(window_sizes)
self.data = defaultdict(lambda: defaultdict(deque))
self.lock = Lock()
def add_metric(self, metric: Metric):
"""添加指标"""
with self.lock:
metric_key = self._get_metric_key(metric)
for window_size in self.window_sizes:
window_key = f"{metric_key}:{window_size}"
window_data = self.data[window_key]
# 添加新数据点
window_data['timestamps'].append(metric.timestamp)
window_data['values'].append(metric.value)
# 清理旧数据
cutoff = metric.timestamp - window_size
while (window_data['timestamps'] and
window_data['timestamps'][0] < cutoff):
window_data['timestamps'].popleft()
window_data['values'].popleft()
def _get_metric_key(self, metric: Metric) -> str:
"""生成指标键"""
tags_str = ":".join(f"{k}={v}" for k, v in sorted(metric.tags.items()))
return f"{metric.name}:{metric.type.value}:{tags_str}"
def get_aggregated_stats(self, metric_name: str,
metric_type: MetricType,
window_size: int,
tags: Optional[Dict[str, str]] = None) -> Dict[str, Any]:
"""获取聚合统计"""
tags = tags or {}
metric_key = self._get_metric_key(
Metric(name=metric_name, type=metric_type, value=0, tags=tags)
)
window_key = f"{metric_key}:{window_size}"
with self.lock:
window_data = self.data.get(window_key)
if not window_data or not window_data['values']:
return {'count': 0}
values = list(window_data['values'])
if metric_type == MetricType.COUNTER:
stats = {
'count': len(values),
'sum': sum(values),
'rate': sum(values) / window_size
}
elif metric_type == MetricType.GAUGE:
stats = {
'count': len(values),
'avg': statistics.mean(values),
'min': min(values),
'max': max(values),
'latest': values[-1]
}
elif metric_type in [MetricType.HISTOGRAM, MetricType.SUMMARY]:
if not values:
return {'count': 0}
stats = {
'count': len(values),
'avg': statistics.mean(values),
'min': min(values),
'max': max(values),
'p50': np.percentile(values, 50),
'p90': np.percentile(values, 90),
'p95': np.percentile(values, 95),
'p99': np.percentile(values, 99)
}
else:
stats = {'count': len(values)}
stats['window_size'] = window_size
stats['data_points'] = len(values)
return stats
def get_all_metrics_summary(self) -> Dict[str, Any]:
"""获取所有指标摘要"""
with self.lock:
summary = {}
for window_key, window_data in self.data.items():
if not window_data['values']:
continue
parts = window_key.split(':')
if len(parts) < 3:
continue
metric_name = parts[0]
metric_type = parts[1]
window_size = int(parts[-1])
if metric_name not in summary:
summary[metric_name] = {}
if window_size not in summary[metric_name]:
summary[metric_name][window_size] = {
'count': len(window_data['values']),
'type': metric_type
}
return summary
class AnomalyDetector:
"""异常检测器"""
def __init__(self, sensitivity: float = 2.0):
"""
初始化异常检测器
参数:
sensitivity: 敏感度,值越大检测越宽松
"""
self.sensitivity = sensitivity
self.metric_history = defaultdict(lambda: deque(maxlen=1000))
self.baselines = {}
self.lock = Lock()
def detect_anomaly(self, metric: Metric) -> Optional[Dict[str, Any]]:
"""
检测指标异常
参数:
metric: 指标数据
返回:
Optional[Dict]: 异常信息,无异常返回None
"""
metric_key = self._get_metric_key(metric)
with self.lock:
# 更新历史数据
history = self.metric_history[metric_key]
history.append((metric.timestamp, metric.value))
# 计算基线(如果历史数据足够)
if len(history) >= 30: # 至少30个数据点
baseline = self._calculate_baseline(history)
self.baselines[metric_key] = baseline
# 检测异常
is_anomaly, anomaly_score = self._check_anomaly(
metric.value, baseline
)
if is_anomaly:
return {
'metric': metric.name,
'value': metric.value,
'baseline': baseline,
'score': anomaly_score,
'timestamp': metric.timestamp,
'tags': metric.tags
}
return None
def _get_metric_key(self, metric: Metric) -> str:
"""生成指标键"""
tags_str = ":".join(f"{k}={v}" for k, v in sorted(metric.tags.items()))
return f"{metric.name}:{metric.type.value}:{tags_str}"
def _calculate_baseline(self, history: deque) -> Dict[str, float]:
"""计算基线"""
timestamps, values = zip(*history)
values = list(values)
# 使用移动平均和标准差
if len(values) < 10:
window = len(values)
else:
window = 10
# 计算移动平均和标准差
moving_avg = []
moving_std = []
for i in range(len(values)):
start = max(0, i - window + 1)
window_values = values[start:i+1]
if window_values:
moving_avg.append(statistics.mean(window_values))
if len(window_values) > 1:
moving_std.append(statistics.stdev(window_values))
else:
moving_std.append(0)
baseline = {
'mean': moving_avg[-1] if moving_avg else 0,
'std': moving_std[-1] if moving_std else 0,
'min': min(values),
'max': max(values),
'median': statistics.median(values),
'history_size': len(values)
}
return baseline
def _check_anomaly(self, value: float, baseline: Dict[str, float]) -> Tuple[bool, float]:
"""检查是否为异常值"""
if baseline['std'] == 0:
# 标准差为0,检查是否超出范围
margin = baseline['mean'] * 0.1 # 10%的容差
lower_bound = baseline['mean'] - margin
upper_bound = baseline['mean'] + margin
if value < lower_bound or value > upper_bound:
anomaly_score = abs(value - baseline['mean']) / margin
return True, anomaly_score
return False, 0
# 使用Z-score检测异常
z_score = abs(value - baseline['mean']) / baseline['std']
# 动态阈值
threshold = self.sensitivity * (1 + np.log1p(baseline['history_size'] / 100))
if z_score > threshold:
return True, z_score / threshold
return False, 0
def get_detection_status(self) -> Dict[str, Any]:
"""获取检测状态"""
with self.lock:
return {
'monitored_metrics': len(self.metric_history),
'established_baselines': len(self.baselines),
'sensitivity': self.sensitivity
}
class AlertManager:
"""告警管理器"""
def __init__(self):
"""初始化告警管理器"""
self.rules: Dict[str, AlertRule] = {}
self.active_alerts: Dict[str, Dict[str, Any]] = {}
self.alert_history = deque(maxlen=1000)
self.notification_channels = []
self.lock = Lock()
def add_rule(self, rule: AlertRule):
"""添加告警规则"""
with self.lock:
self.rules[rule.name] = rule
def check_metric(self, metric: Metric):
"""检查指标是否触发告警"""
with self.lock:
for rule_name, rule in self.rules.items():
if rule.metric_name != metric.name:
continue
# 检查标签匹配
if not self._check_tags_match(metric.tags, rule.tags):
continue
# 检查条件
if self._evaluate_condition(metric.value, rule.condition):
self._trigger_alert(rule, metric)
else:
self._resolve_alert(rule_name, metric)
def _check_tags_match(self, metric_tags: Dict[str, str],
rule_tags: Dict[str, str]) -> bool:
"""检查标签是否匹配"""
for key, value in rule_tags.items():
if key not in metric_tags or metric_tags[key] != value:
return False
return True
def _evaluate_condition(self, value: float, condition: str) -> bool:
"""评估条件"""
try:
# 解析条件表达式
if condition.startswith(">"):
threshold = float(condition[1:].strip())
return value > threshold
elif condition.startswith("<"):
threshold = float(condition[1:].strip())
return value < threshold
elif condition.startswith(">="):
threshold = float(condition[2:].strip())
return value >= threshold
elif condition.startswith("<="):
threshold = float(condition[2:].strip())
return value <= threshold
elif condition.startswith("=="):
threshold = float(condition[2:].strip())
return abs(value - threshold) < 0.001
elif condition.startswith("!="):
threshold = float(condition[2:].strip())
return abs(value - threshold) >= 0.001
else:
logger.warning(f"Unsupported condition: {condition}")
return False
except ValueError:
logger.error(f"Invalid condition format: {condition}")
return False
def _trigger_alert(self, rule: AlertRule, metric: Metric):
"""触发告警"""
alert_key = f"{rule.name}:{json.dumps(metric.tags)}"
with self.lock:
if alert_key in self.active_alerts:
# 更新现有告警
alert = self.active_alerts[alert_key]
alert['last_triggered'] = metric.timestamp
alert['trigger_count'] += 1
alert['latest_value'] = metric.value
else:
# 创建新告警
alert = {
'rule_name': rule.name,
'metric_name': metric.name,
'tags': metric.tags,
'condition': rule.condition,
'severity': rule.severity,
'first_triggered': metric.timestamp,
'last_triggered': metric.timestamp,
'trigger_count': 1,
'latest_value': metric.value,
'message': rule.message_template.format(
metric_name=metric.name,
value=metric.value,
threshold=rule.condition
)
}
self.active_alerts[alert_key] = alert
# 记录到历史
self.alert_history.append({
**alert,
'timestamp': metric.timestamp,
'type': 'triggered'
})
# 发送通知
self._send_notification(alert)
def _resolve_alert(self, rule_name: str, metric: Metric):
"""解决告警"""
alert_key = f"{rule_name}:{json.dumps(metric.tags)}"
with self.lock:
if alert_key in self.active_alerts:
alert = self.active_alerts.pop(alert_key)
# 记录解决事件
self.alert_history.append({
'rule_name': rule_name,
'metric_name': metric.name,
'tags': metric.tags,
'timestamp': metric.timestamp,
'type': 'resolved',
'duration': metric.timestamp - alert['first_triggered']
})
def _send_notification(self, alert: Dict[str, Any]):
"""发送通知"""
# 这里可以实现邮件、Slack、Webhook等通知方式
logger.warning(f"ALERT: {alert['message']}")
for channel in self.notification_channels:
try:
channel.send(alert)
except Exception as e:
logger.error(f"Failed to send notification: {str(e)}")
def add_notification_channel(self, channel):
"""添加通知渠道"""
self.notification_channels.append(channel)
def get_active_alerts(self) -> List[Dict[str, Any]]:
"""获取活跃告警"""
with self.lock:
return list(self.active_alerts.values())
def get_alert_history(self, limit: int = 100) -> List[Dict[str, Any]]:
"""获取告警历史"""
with self.lock:
return list(self.alert_history)[-limit:]
class APIMonitoringDashboard:
"""API监控仪表盘"""
def __init__(self,
metrics_collector: MetricsCollector,
aggregator: TimeSeriesAggregator,
anomaly_detector: AnomalyDetector,
alert_manager: AlertManager):
"""
初始化监控仪表盘
参数:
metrics_collector: 指标收集器
aggregator: 时间序列聚合器
anomaly_detector: 异常检测器
alert_manager: 告警管理器
"""
self.metrics_collector = metrics_collector
self.aggregator = aggregator
self.anomaly_detector = anomaly_detector
self.alert_manager = alert_manager
# 仪表盘状态
self.start_time = time.time()
self.request_count = 0
self.error_count = 0
def get_system_health(self) -> Dict[str, Any]:
"""获取系统健康状态"""
uptime = time.time() - self.start_time
# 获取关键指标
request_stats = self.aggregator.get_aggregated_stats(
"http_requests_total", MetricType.COUNTER, 300
)
error_stats = self.aggregator.get_aggregated_stats(
"http_requests_by_status", MetricType.COUNTER, 300,
{"status": "5xx"}
)
latency_stats = self.aggregator.get_aggregated_stats(
"http_request_duration_seconds", MetricType.HISTOGRAM, 300
)
# 计算健康度
if request_stats.get('rate', 0) > 0:
error_rate = error_stats.get('rate', 0) / request_stats.get('rate', 1)
else:
error_rate = 0
health_score = self._calculate_health_score(
request_stats.get('rate', 0),
error_rate,
latency_stats.get('p95', 0)
)
return {
'uptime': uptime,
'health_score': health_score,
'request_rate': request_stats.get('rate', 0),
'error_rate': error_rate,
'latency_p95': latency_stats.get('p95', 0),
'active_alerts': len(self.alert_manager.get_active_alerts()),
'metric_count': len(self.aggregator.get_all_metrics_summary())
}
def _calculate_health_score(self, request_rate: float,
error_rate: float, latency: float) -> float:
"""计算健康度分数"""
# 简单加权算法
max_request_rate = 1000 # 假设最大1000请求/秒
request_score = min(request_rate / max_request_rate, 1.0)
error_score = 1.0 - min(error_rate * 10, 1.0) # 错误率超过10%得0分
max_latency = 2.0 # 假设最大2秒
latency_score = 1.0 - min(latency / max_latency, 1.0)
# 加权平均
weights = {'request': 0.2, 'error': 0.5, 'latency': 0.3}
total_score = (
request_score * weights['request'] +
error_score * weights['error'] +
latency_score * weights['latency']
)
return round(total_score * 100, 2) # 转换为百分比
def get_api_performance(self, timeframe: int = 300) -> Dict[str, Any]:
"""获取API性能数据"""
# 请求统计
total_requests = self.aggregator.get_aggregated_stats(
"http_requests_total", MetricType.COUNTER, timeframe
)
# 按状态码统计
status_codes = ["2xx", "3xx", "4xx", "5xx"]
requests_by_status = {}
for status in status_codes:
stats = self.aggregator.get_aggregated_stats(
"http_requests_by_status", MetricType.COUNTER, timeframe,
{"status": status}
)
requests_by_status[status] = stats.get('sum', 0)
# 延迟分布
latency_stats = self.aggregator.get_aggregated_stats(
"http_request_duration_seconds", MetricType.HISTOGRAM, timeframe
)
# 限流统计
rate_limit_stats = self.aggregator.get_aggregated_stats(
"rate_limit_requests", MetricType.COUNTER, timeframe
)
return {
'timeframe': timeframe,
'total_requests': total_requests.get('sum', 0),
'request_rate': total_requests.get('rate', 0),
'requests_by_status': requests_by_status,
'success_rate': requests_by_status.get('2xx', 0) / max(total_requests.get('sum', 1), 1),
'latency': {
'avg': latency_stats.get('avg', 0),
'p50': latency_stats.get('p50', 0),
'p95': latency_stats.get('p95', 0),
'p99': latency_stats.get('p99', 0),
'max': latency_stats.get('max', 0)
},
'rate_limited': rate_limit_stats.get('sum', 0),
'rate_limit_rate': rate_limit_stats.get('rate', 0)
}
def get_security_metrics(self) -> Dict[str, Any]:
"""获取安全指标"""
# 认证失败统计
auth_failures = self.aggregator.get_aggregated_stats(
"auth_failed_requests", MetricType.COUNTER, 300
)
# 获取最近的安全事件
security_alerts = []
for alert in self.alert_manager.get_active_alerts():
if any(keyword in alert['metric_name'] for keyword in
['auth', 'security', 'attack', 'brute']):
security_alerts.append(alert)
return {
'auth_failures': auth_failures.get('sum', 0),
'auth_failure_rate': auth_failures.get('rate', 0),
'active_security_alerts': len(security_alerts),
'security_alerts': security_alerts[:10] # 最近10个
}
def get_detailed_metrics(self) -> Dict[str, Any]:
"""获取详细指标"""
summary = self.aggregator.get_all_metrics_summary()
detailed_metrics = {}
for metric_name, windows in summary.items():
detailed_metrics[metric_name] = {}
for window_size, info in windows.items():
stats = self.aggregator.get_aggregated_stats(
metric_name, MetricType(info['type']), window_size
)
detailed_metrics[metric_name][window_size] = stats
return {
'metrics_summary': summary,
'detailed_metrics': detailed_metrics,
'collector_status': self.metrics_collector.get_current_metrics(),
'anomaly_detection': self.anomaly_detector.get_detection_status(),
'active_alerts': self.alert_manager.get_active_alerts()
}
def generate_report(self) -> Dict[str, Any]:
"""生成监控报告"""
return {
'timestamp': datetime.utcnow().isoformat(),
'system_health': self.get_system_health(),
'api_performance': self.get_api_performance(),
'security_metrics': self.get_security_metrics(),
'alerts': {
'active': self.alert_manager.get_active_alerts(),
'recent': self.alert_manager.get_alert_history(20)
}
}
# 整合示例
class APIMonitoringSystem:
"""API监控系统整合"""
def __init__(self, redis_client=None):
"""
初始化API监控系统
参数:
redis_client: Redis客户端(用于分布式存储)
"""
# 初始化各个组件
self.metrics_collector = MetricsCollector()
self.aggregator = TimeSeriesAggregator()
self.anomaly_detector = AnomalyDetector()
self.alert_manager = AlertManager()
self.dashboard = APIMonitoringDashboard(
self.metrics_collector,
self.aggregator,
self.anomaly_detector,
self.alert_manager
)
# 启动后台任务
self._start_background_tasks()
# 配置默认告警规则
self._setup_default_alerts()
def _start_background_tasks(self):
"""启动后台任务"""
async def flush_metrics():
while True:
await asyncio.sleep(10)
metrics = self.metrics_collector.flush()
for metric in metrics:
self.aggregator.add_metric(metric)
# 异常检测
anomaly = self.anomaly_detector.detect_anomaly(metric)
if anomaly:
logger.warning(f"Anomaly detected: {anomaly}")
# 告警检查
self.alert_manager.check_metric(metric)
self.background_task = asyncio.create_task(flush_metrics())
def _setup_default_alerts(self):
"""设置默认告警规则"""
# 高错误率告警
self.alert_manager.add_rule(AlertRule(
name="high_error_rate",
metric_name="http_requests_by_status",
condition="> 0.1", # 错误率超过10%
duration=60,
severity="critical",
message_template="Error rate exceeded threshold: {value} > 10%",
tags={"status": "5xx"}
))
# 高延迟告警
self.alert_manager.add_rule(AlertRule(
name="high_latency",
metric_name="http_request_duration_seconds",
condition="> 2.0", # 延迟超过2秒
duration=30,
severity="warning",
message_template="High latency detected: {value} > 2s"
))
# 认证失败告警
self.alert_manager.add_rule(AlertRule(
name="auth_failure_spike",
metric_name="auth_failed_requests",
condition="> 10", # 10次/秒
duration=10,
severity="warning",
message_template="Authentication failure spike: {value} > 10/s"
))
def record_request(self, method: str, path: str,
status_code: int, duration: float):
"""记录HTTP请求"""
# 记录总请求数
self.metrics_collector.increment_counter("http_requests_total")
# 按状态码记录
status_group = f"{status_code // 100}xx"
self.metrics_collector.increment_counter(
"http_requests_by_status",
tags={"status": status_group}
)
# 记录延迟
self.metrics_collector.record_histogram(
"http_request_duration_seconds",
duration
)
# 更新仪表盘计数
self.dashboard.request_count += 1
if status_code >= 500:
self.dashboard.error_count += 1
def record_auth_failure(self, reason: str):
"""记录认证失败"""
self.metrics_collector.increment_counter(
"auth_failed_requests",
tags={"reason": reason}
)
def record_rate_limit(self, reason: str):
"""记录限流"""
self.metrics_collector.increment_counter(
"rate_limit_requests",
tags={"reason": reason}
)
def get_monitoring_data(self) -> Dict[str, Any]:
"""获取监控数据"""
return self.dashboard.generate_report()
async def stop(self):
"""停止监控系统"""
if self.background_task:
self.background_task.cancel()
try:
await self.background_task
except asyncio.CancelledError:
pass
四、系统集成与最佳实践
4.1 完整集成示例
python
"""
API网关完整集成示例
"""
import asyncio
from typing import Dict, Any
import uvicorn
from fastapi import FastAPI, Request, HTTPException, Depends
from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
import redis.asyncio as redis
app = FastAPI(title="API Gateway with Advanced Features")
security = HTTPBearer()
# 初始化组件
redis_client = redis.from_url("redis://localhost:6379", decode_responses=True)
# 初始化限流系统
rate_limiter = DistributedRateLimiter(redis_client)
# 初始化鉴权系统
jwt_auth = JWTAuthenticator(
secret_key="your-secret-key-here",
algorithm="HS256"
)
auth_manager = UnifiedAuthManager(jwt_auth)
# 初始化监控系统
monitoring_system = APIMonitoringSystem(redis_client)
# 依赖注入
async def get_rate_limiter():
return rate_limiter
async def get_auth_manager():
return auth_manager
async def get_monitoring_system():
return monitoring_system
# 中间件
@app.middleware("http")
async def monitoring_middleware(request: Request, call_next):
start_time = time.time()
try:
response = await call_next(request)
# 记录请求指标
duration = time.time() - start_time
monitoring_system.record_request(
request.method,
str(request.url.path),
response.status_code,
duration
)
return response
except Exception as e:
duration = time.time() - start_time
status_code = 500 if not hasattr(e, 'status_code') else e.status_code
monitoring_system.record_request(
request.method,
str(request.url.path),
status_code,
duration
)
raise
# 路由
@app.post("/api/login")
async def login(username: str, password: str):
# 验证用户凭据(简化实现)
if username == "admin" and password == "password":
user_context = UserContext(
user_id="1",
username="admin",
roles=["admin"],
permissions={"users:read", "users:write"}
)
token = jwt_auth.create_token(user_context)
return {"access_token": token, "token_type": "bearer"}
monitoring_system.record_auth_failure("invalid_credentials")
raise HTTPException(status_code=401, detail="Invalid credentials")
@app.get("/api/protected")
async def protected_route(
credentials: HTTPAuthorizationCredentials = Depends(security),
auth_manager: UnifiedAuthManager = Depends(get_auth_manager)
):
# 认证
headers = {"Authorization": f"Bearer {credentials.credentials}"}
user_context = auth_manager.authenticate_request(headers)
if not user_context:
raise HTTPException(status_code=401, detail="Unauthorized")
# 授权
if not auth_manager.authorize_request(
user_context,
"protected:resource",
PermissionAction.READ
):
raise HTTPException(status_code=403, detail="Forbidden")
return {"message": "Access granted", "user": user_context.username}
@app.get("/api/metrics")
async def get_metrics(
monitoring_system: APIMonitoringSystem = Depends(get_monitoring_system)
):
"""获取监控指标"""
return monitoring_system.get_monitoring_data()
@app.get("/api/health")
async def health_check():
"""健康检查"""
return {
"status": "healthy",
"timestamp": datetime.utcnow().isoformat(),
"version": "1.0.0"
}
# 启动应用
if __name__ == "__main__":
uvicorn.run(
app,
host="0.0.0.0",
port=8000,
log_level="info"
)
4.2 性能优化建议
-
限流系统优化:
- 使用本地缓存减少Redis访问
- 批量操作减少网络往返
- 实现漏桶算法应对突发流量
-
鉴权系统优化:
- 权限结果缓存
- JWT Token黑名单使用布隆过滤器
- 异步权限检查
-
监控系统优化:
- 指标采样减少存储
- 使用列式存储优化查询
- 实现预测性告警
4.3 安全最佳实践
安全评分 = ∑ i = 1 n w i ⋅ S i \text{安全评分} = \sum_{i=1}^{n} w_i \cdot S_i 安全评分=i=1∑nwi⋅Si
其中 S i S_i Si 为各项安全措施得分, w i w_i wi 为权重:
- 多因素认证
- 定期密钥轮换
- 权限最小化原则
- 完整的审计日志
- 实时入侵检测
五、总结与展望
本文详细介绍了API网关的三大核心功能实现:
5.1 关键实现要点
-
分布式限流:
- 支持多种算法(令牌桶、滑动窗口)
- 多维度限流策略
- 自适应限流能力
-
统一鉴权:
- JWT认证与刷新
- RBAC权限控制
- 数据级权限过滤
- 完整审计日志
-
实时监控:
- 指标收集与聚合
- 异常自动检测
- 智能告警系统
- 可视化仪表盘
5.2 生产环境考虑
-
高可用性:
客户端 负载均衡器 网关集群节点1 网关集群节点2 网关集群节点3 共享存储 后端服务
-
可观测性增强:
- 分布式追踪集成
- 业务指标监控
- 用户行为分析
-
扩展性设计:
- 插件化架构
- 自定义过滤器
- 热配置更新
5.3 未来发展方向
-
AI增强:
- 基于机器学习的异常检测
- 智能限流策略
- 预测性扩容
-
边缘计算:
- 边缘网关部署
- 本地化限流和认证
- 边缘缓存优化
-
安全增强:
- 零信任架构
- 行为分析检测威胁
- 自动化安全响应
本文提供的实现方案是一个完整的起点,实际生产环境中需要根据具体业务需求进行调整和优化。每个组件都设计为可独立扩展和替换,确保系统的灵活性和可维护性。