📋 目录
1. 概述
1.1 为什么需要AI驱动的工作流优化?
传统的工作流优化依赖人工分析执行日志、性能指标,这种方式存在以下问题:
- 效率低下:需要大量时间分析海量数据
- 经验依赖:依赖专家经验,难以标准化
- 滞后性:问题发生后才能发现和解决
- 局部优化:难以发现全局性能瓶颈
AI驱动的优化系统可以:
- ✅ 自动检测:实时监控并自动识别性能问题
- ✅ 智能分析:基于机器学习识别复杂模式
- ✅ 主动优化:在问题发生前预测并优化
- ✅ 持续改进:通过A/B测试验证优化效果
1.2 技术栈
# requirements.txt
fastapi==0.104.1
sqlalchemy==2.0.23
pandas==2.1.3
numpy==1.26.2
scikit-learn==1.3.2
xgboost==2.0.2
tensorflow==2.15.0 # 可选,用于深度学习
prometheus-client==0.19.0
redis==5.0.1
celery==5.3.4
plotly==5.18.0
statsmodels==0.14.0
2. 系统架构设计
2.1 整体架构
# architecture/ai_optimizer_architecture.py
class AIOptimizerArchitecture:
"""
AI优化器架构设计
组件:
1. 数据采集层:收集工作流执行数据
2. 存储层:时序数据库 + 特征存储
3. 分析层:瓶颈检测 + 根因分析
4. 优化层:建议生成 + A/B测试
5. 学习层:模型训练 + 知识更新
"""
def __init__(self):
self.components = {
"data_collection": {
"collectors": [
"ExecutionMetricsCollector",
"ResourceMetricsCollector",
"DependencyMetricsCollector"
],
"storage": "TimescaleDB",
"streaming": "Kafka"
},
"bottleneck_detection": {
"algorithms": [
"StatisticalDetector",
"AnomalyDetector",
"PatternMatcher",
"MLClassifier"
],
"models": [
"IsolationForest",
"LSTM-Autoencoder",
"XGBoost"
]
},
"optimization_engine": {
"strategies": [
"ParameterTuning",
"ResourceAllocation",
"ParallelizationOptimization",
"DependencyOptimization"
],
"ab_testing": "ExperimentFramework"
},
"ml_training": {
"features": "FeatureStore",
"training": "TrainingPipeline",
"serving": "ModelServing",
"monitoring": "ModelMonitoring"
}
}
def get_data_flow(self):
"""数据流设计"""
return """
工作流执行
↓
[数据采集层]
├─ 执行指标(耗时、成功率等)
├─ 资源指标(CPU、内存、IO等)
└─ 依赖指标(DAG结构、并发度等)
↓
[时序数据库 + 特征存储]
↓
[瓶颈检测]
├─ 统计分析(均值、方差、分位数)
├─ 异常检测(孤立森林、LOF)
├─ 模式识别(聚类、关联规则)
└─ ML分类(XGBoost、神经网络)
↓
[根因分析]
├─ 相关性分析
├─ 因果推断
└─ 影响力评估
↓
[优化建议生成]
├─ 参数调优建议
├─ 资源配置建议
├─ 架构优化建议
└─ 业务逻辑建议
↓
[A/B测试验证]
├─ 实验设计
├─ 流量分配
├─ 效果评估
└─ 自动上线/回滚
↓
[模型更新]
└─ 持续学习
"""
# 打印架构
if __name__ == "__main__":
arch = AIOptimizerArchitecture()
print(arch.get_data_flow())
2.2 数据库设计
# models/ai_optimizer_models.py
from sqlalchemy import Column, Integer, String, Float, DateTime, JSON, Boolean, Text, ForeignKey, Index
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import relationship
from datetime import datetime
Base = declarative_base()
class WorkflowExecutionMetrics(Base):
"""工作流执行指标"""
__tablename__ = "workflow_execution_metrics"
id = Column(Integer, primary_key=True)
execution_id = Column(Integer, ForeignKey("workflow_executions.id"), nullable=False)
workflow_id = Column(Integer, ForeignKey("workflows.id"), nullable=False)
# 时间指标
start_time = Column(DateTime, nullable=False)
end_time = Column(DateTime)
duration_seconds = Column(Float)
# 任务指标
total_tasks = Column(Integer)
completed_tasks = Column(Integer)
failed_tasks = Column(Integer)
retried_tasks = Column(Integer)
# 资源指标
avg_cpu_percent = Column(Float)
max_cpu_percent = Column(Float)
avg_memory_mb = Column(Float)
max_memory_mb = Column(Float)
total_io_read_mb = Column(Float)
total_io_write_mb = Column(Float)
# 并发指标
max_parallel_tasks = Column(Integer)
avg_parallel_tasks = Column(Float)
# 数据指标
input_data_size_mb = Column(Float)
output_data_size_mb = Column(Float)
# 质量指标
success_rate = Column(Float)
error_rate = Column(Float)
# 成本指标
estimated_cost = Column(Float)
# 扩展指标(JSON格式)
custom_metrics = Column(JSON)
created_at = Column(DateTime, default=datetime.utcnow)
# 关系
execution = relationship("WorkflowExecution", back_populates="metrics")
# 索引
__table_args__ = (
Index('idx_metrics_workflow_time', 'workflow_id', 'start_time'),
Index('idx_metrics_execution', 'execution_id'),
Index('idx_metrics_duration', 'duration_seconds'),
)
class TaskExecutionMetrics(Base):
"""任务执行指标"""
__tablename__ = "task_execution_metrics"
id = Column(Integer, primary_key=True)
task_execution_id = Column(Integer, ForeignKey("task_executions.id"), nullable=False)
workflow_execution_id = Column(Integer, ForeignKey("workflow_executions.id"), nullable=False)
task_id = Column(Integer, ForeignKey("workflow_tasks.id"), nullable=False)
# 时间指标
start_time = Column(DateTime, nullable=False)
end_time = Column(DateTime)
duration_seconds = Column(Float)
queue_time_seconds = Column(Float) # 队列等待时间
# 资源指标
cpu_percent = Column(Float)
memory_mb = Column(Float)
io_read_mb = Column(Float)
io_write_mb = Column(Float)
network_in_mb = Column(Float)
network_out_mb = Column(Float)
# 数据指标
input_records = Column(Integer)
output_records = Column(Integer)
input_size_mb = Column(Float)
output_size_mb = Column(Float)
# 重试指标
retry_count = Column(Integer, default=0)
retry_delay_seconds = Column(Float)
# 依赖指标
upstream_tasks_count = Column(Integer)
upstream_wait_seconds = Column(Float)
# 扩展指标
custom_metrics = Column(JSON)
created_at = Column(DateTime, default=datetime.utcnow)
# 索引
__table_args__ = (
Index('idx_task_metrics_workflow_exec', 'workflow_execution_id'),
Index('idx_task_metrics_task', 'task_id', 'start_time'),
Index('idx_task_metrics_duration', 'duration_seconds'),
)
class PerformanceBottleneck(Base):
"""性能瓶颈记录"""
__tablename__ = "performance_bottlenecks"
id = Column(Integer, primary_key=True)
workflow_id = Column(Integer, ForeignKey("workflows.id"), nullable=False)
execution_id = Column(Integer, ForeignKey("workflow_executions.id"))
task_id = Column(Integer, ForeignKey("workflow_tasks.id"))
# 瓶颈类型
bottleneck_type = Column(String(50), nullable=False) # cpu, memory, io, dependency, logic
severity = Column(String(20), nullable=False) # low, medium, high, critical
# 检测信息
detected_at = Column(DateTime, default=datetime.utcnow)
detection_method = Column(String(50)) # statistical, anomaly, ml, pattern
confidence_score = Column(Float) # 0-1
# 瓶颈描述
description = Column(Text)
impact_analysis = Column(JSON) # 影响分析
root_cause = Column(JSON) # 根因分析
# 指标
baseline_metrics = Column(JSON) # 基线指标
current_metrics = Column(JSON) # 当前指标
deviation_percent = Column(Float) # 偏差百分比
# 状态
status = Column(String(20), default="open") # open, investigating, resolved, ignored
resolved_at = Column(DateTime)
resolution_notes = Column(Text)
created_at = Column(DateTime, default=datetime.utcnow)
updated_at = Column(DateTime, default=datetime.utcnow, onupdate=datetime.utcnow)
# 索引
__table_args__ = (
Index('idx_bottleneck_workflow', 'workflow_id', 'status'),
Index('idx_bottleneck_severity', 'severity', 'detected_at'),
Index('idx_bottleneck_type', 'bottleneck_type'),
)
class OptimizationRecommendation(Base):
"""优化建议"""
__tablename__ = "optimization_recommendations"
id = Column(Integer, primary_key=True)
bottleneck_id = Column(Integer, ForeignKey("performance_bottlenecks.id"), nullable=False)
workflow_id = Column(Integer, ForeignKey("workflows.id"), nullable=False)
task_id = Column(Integer, ForeignKey("workflow_tasks.id"))
# 建议类型
recommendation_type = Column(String(50), nullable=False) # parameter, resource, architecture, logic
priority = Column(String(20), nullable=False) # low, medium, high, urgent
# 建议内容
title = Column(String(200), nullable=False)
description = Column(Text)
rationale = Column(Text) # 理由
# 优化参数
current_config = Column(JSON)
recommended_config = Column(JSON)
expected_improvement = Column(JSON) # 预期改进
# 实施信息
implementation_difficulty = Column(String(20)) # easy, medium, hard
estimated_effort_hours = Column(Float)
implementation_steps = Column(JSON)
# 风险评估
risk_level = Column(String(20)) # low, medium, high
potential_issues = Column(JSON)
rollback_plan = Column(JSON)
# 状态
status = Column(String(20), default="pending") # pending, approved, testing, implemented, rejected
approved_by = Column(Integer, ForeignKey("users.id"))
approved_at = Column(DateTime)
# 效果跟踪
ab_test_id = Column(Integer, ForeignKey("ab_experiments.id"))
actual_improvement = Column(JSON)
created_at = Column(DateTime, default=datetime.utcnow)
updated_at = Column(DateTime, default=datetime.utcnow, onupdate=datetime.utcnow)
# 索引
__table_args__ = (
Index('idx_recommendation_workflow', 'workflow_id', 'status'),
Index('idx_recommendation_priority', 'priority', 'status'),
)
class ABExperiment(Base):
"""A/B测试实验"""
__tablename__ = "ab_experiments"
id = Column(Integer, primary_key=True)
workflow_id = Column(Integer, ForeignKey("workflows.id"), nullable=False)
recommendation_id = Column(Integer, ForeignKey("optimization_recommendations.id"))
# 实验信息
name = Column(String(200), nullable=False)
description = Column(Text)
hypothesis = Column(Text) # 假设
# 实验配置
control_config = Column(JSON, nullable=False) # 对照组配置
treatment_config = Column(JSON, nullable=False) # 实验组配置
traffic_split = Column(Float, default=0.5) # 流量分配比例(实验组)
# 评估指标
primary_metric = Column(String(100), nullable=False) # 主要指标
secondary_metrics = Column(JSON) # 次要指标
success_criteria = Column(JSON) # 成功标准
# 实验控制
min_sample_size = Column(Integer, default=100) # 最小样本量
max_duration_days = Column(Integer, default=7) # 最大持续时间
early_stopping_enabled = Column(Boolean, default=True)
# 实验状态
status = Column(String(20), default="draft") # draft, running, paused, completed, cancelled
started_at = Column(DateTime)
ended_at = Column(DateTime)
# 实验结果
control_group_size = Column(Integer)
treatment_group_size = Column(Integer)
control_metrics = Column(JSON)
treatment_metrics = Column(JSON)
statistical_significance = Column(Float) # p-value
confidence_interval = Column(JSON)
# 决策
decision = Column(String(20)) # winner_control, winner_treatment, no_difference, inconclusive
decision_reason = Column(Text)
auto_rollout = Column(Boolean, default=False) # 是否自动推广
created_by = Column(Integer, ForeignKey("users.id"))
created_at = Column(DateTime, default=datetime.utcnow)
updated_at = Column(DateTime, default=datetime.utcnow, onupdate=datetime.utcnow)
# 索引
__table_args__ = (
Index('idx_experiment_workflow', 'workflow_id', 'status'),
Index('idx_experiment_status', 'status', 'started_at'),
)
class MLModel(Base):
"""机器学习模型"""
__tablename__ = "ml_models"
id = Column(Integer, primary_key=True)
# 模型信息
name = Column(String(200), nullable=False)
model_type = Column(String(50), nullable=False) # classifier, regressor, anomaly_detector, recommender
algorithm = Column(String(100), nullable=False) # xgboost, random_forest, lstm, etc.
purpose = Column(String(200)) # 用途
# 训练信息
training_data_size = Column(Integer)
feature_count = Column(Integer)
training_duration_seconds = Column(Float)
trained_at = Column(DateTime)
# 模型配置
hyperparameters = Column(JSON)
feature_config = Column(JSON)
preprocessing_config = Column(JSON)
# 性能指标
training_metrics = Column(JSON) # accuracy, precision, recall, f1, rmse, etc.
validation_metrics = Column(JSON)
test_metrics = Column(JSON)
# 模型文件
model_path = Column(String(500)) # 模型文件路径
model_version = Column(String(50))
framework = Column(String(50)) # sklearn, tensorflow, pytorch, xgboost
# 部署状态
status = Column(String(20), default="trained") # trained, deployed, archived, deprecated
deployed_at = Column(DateTime)
# 性能监控
prediction_count = Column(Integer, default=0)
avg_prediction_time_ms = Column(Float)
error_count = Column(Integer, default=0)
last_prediction_at = Column(DateTime)
# 版本管理
parent_model_id = Column(Integer, ForeignKey("ml_models.id"))
is_active = Column(Boolean, default=False)
created_at = Column(DateTime, default=datetime.utcnow)
updated_at = Column(DateTime, default=datetime.utcnow, onupdate=datetime.utcnow)
# 索引
__table_args__ = (
Index('idx_model_type_status', 'model_type', 'status'),
Index('idx_model_active', 'is_active', 'model_type'),
)
class FeatureStore(Base):
"""特征存储"""
__tablename__ = "feature_store"
id = Column(Integer, primary_key=True)
# 实体标识
entity_type = Column(String(50), nullable=False) # workflow, task, execution
entity_id = Column(Integer, nullable=False)
# 时间戳
timestamp = Column(DateTime, nullable=False)
# 特征
features = Column(JSON, nullable=False) # 特征值字典
# 元数据
feature_version = Column(String(50))
feature_schema = Column(JSON)
created_at = Column(DateTime, default=datetime.utcnow)
# 索引
__table_args__ = (
Index('idx_feature_entity', 'entity_type', 'entity_id', 'timestamp'),
Index('idx_feature_timestamp', 'timestamp'),
)
3. 执行数据收集系统
3.1 指标收集器
# services/metrics_collector.py
from typing import Dict, Any, List, Optional
from datetime import datetime
import psutil
import time
from sqlalchemy.orm import Session
from models.ai_optimizer_models import WorkflowExecutionMetrics, TaskExecutionMetrics
import logging
logger = logging.getLogger(__name__)
class MetricsCollector:
"""
指标收集器
负责收集工作流和任务的执行指标
"""
def __init__(self, db: Session):
self.db = db
self._active_measurements = {} # 存储正在进行的测量
def start_workflow_measurement(
self,
execution_id: int,
workflow_id: int
) -> str:
"""
开始工作流指标测量
Returns:
measurement_id: 测量ID
"""
measurement_id = f"wf_{execution_id}_{int(time.time())}"
self._active_measurements[measurement_id] = {
"type": "workflow",
"execution_id": execution_id,
"workflow_id": workflow_id,
"start_time": datetime.utcnow(),
"start_cpu": psutil.cpu_percent(interval=0.1),
"start_memory": psutil.virtual_memory().percent,
"start_io": psutil.disk_io_counters(),
"task_metrics": []
}
logger.info(f"Started workflow measurement: {measurement_id}")
return measurement_id
def end_workflow_measurement(
self,
measurement_id: str,
additional_metrics: Optional[Dict[str, Any]] = None
) -> WorkflowExecutionMetrics:
"""
结束工作流指标测量并保存
"""
if measurement_id not in self._active_measurements:
raise ValueError(f"Measurement not found: {measurement_id}")
measurement = self._active_measurements[measurement_id]
end_time = datetime.utcnow()
# 计算CPU和内存指标
end_cpu = psutil.cpu_percent(interval=0.1)
end_memory = psutil.virtual_memory().percent
end_io = psutil.disk_io_counters()
# 计算IO指标
io_read_mb = (end_io.read_bytes - measurement["start_io"].read_bytes) / (1024 * 1024)
io_write_mb = (end_io.write_bytes - measurement["start_io"].write_bytes) / (1024 * 1024)
# 计算任务统计
task_metrics = measurement.get("task_metrics", [])
total_tasks = len(task_metrics)
completed_tasks = sum(1 for tm in task_metrics if tm.get("status") == "completed")
failed_tasks = sum(1 for tm in task_metrics if tm.get("status") == "failed")
retried_tasks = sum(1 for tm in task_metrics if tm.get("retry_count", 0) > 0)
# 创建指标记录
metrics = WorkflowExecutionMetrics(
execution_id=measurement["execution_id"],
workflow_id=measurement["workflow_id"],
start_time=measurement["start_time"],
end_time=end_time,
duration_seconds=(end_time - measurement["start_time"]).total_seconds(),
total_tasks=total_tasks,
completed_tasks=completed_tasks,
failed_tasks=failed_tasks,
retried_tasks=retried_tasks,
avg_cpu_percent=(measurement["start_cpu"] + end_cpu) / 2,
max_cpu_percent=max(measurement["start_cpu"], end_cpu),
avg_memory_mb=(measurement["start_memory"] + end_memory) / 2,
max_memory_mb=max(measurement["start_memory"], end_memory),
total_io_read_mb=io_read_mb,
total_io_write_mb=io_write_mb,
success_rate=completed_tasks / total_tasks if total_tasks > 0 else 0,
error_rate=failed_tasks / total_tasks if total_tasks > 0 else 0,
custom_metrics=additional_metrics or {}
)
self.db.add(metrics)
self.db.commit()
# 清理测量数据
del self._active_measurements[measurement_id]
logger.info(f"Ended workflow measurement: {measurement_id}")
return metrics
def collect_task_metrics(
self,
task_execution_id: int,
workflow_execution_id: int,
task_id: int,
start_time: datetime,
end_time: datetime,
resource_usage: Dict[str, Any],
additional_metrics: Optional[Dict[str, Any]] = None
) -> TaskExecutionMetrics:
"""
收集任务执行指标
"""
duration = (end_time - start_time).total_seconds()
metrics = TaskExecutionMetrics(
task_execution_id=task_execution_id,
workflow_execution_id=workflow_execution_id,
task_id=task_id,
start_time=start_time,
end_time=end_time,
duration_seconds=duration,
cpu_percent=resource_usage.get("cpu_percent", 0),
memory_mb=resource_usage.get("memory_mb", 0),
io_read_mb=resource_usage.get("io_read_mb", 0),
io_write_mb=resource_usage.get("io_write_mb", 0),
input_records=resource_usage.get("input_records", 0),
output_records=resource_usage.get("output_records", 0),
retry_count=resource_usage.get("retry_count", 0),
custom_metrics=additional_metrics or {}
)
self.db.add(metrics)
self.db.commit()
return metrics
def get_historical_metrics(
self,
workflow_id: int,
days: int = 30,
task_id: Optional[int] = None
) -> Dict[str, Any]:
"""
获取历史指标统计
"""
from datetime import timedelta
from sqlalchemy import func
cutoff_time = datetime.utcnow() - timedelta(days=days)
# 工作流级别指标
wf_metrics = self.db.query(
func.avg(WorkflowExecutionMetrics.duration_seconds).label("avg_duration"),
func.max(WorkflowExecutionMetrics.duration_seconds).label("max_duration"),
func.min(WorkflowExecutionMetrics.duration_seconds).label("min_duration"),
func.stddev(WorkflowExecutionMetrics.duration_seconds).label("stddev_duration"),
func.avg(WorkflowExecutionMetrics.success_rate).label("avg_success_rate"),
func.count(WorkflowExecutionMetrics.id).label("execution_count")
).filter(
WorkflowExecutionMetrics.workflow_id == workflow_id,
WorkflowExecutionMetrics.start_time >= cutoff_time
).first()
result = {
"workflow": {
"avg_duration": float(wf_metrics.avg_duration or 0),
"max_duration": float(wf_metrics.max_duration or 0),
"min_duration": float(wf_metrics.min_duration or 0),
"stddev_duration": float(wf_metrics.stddev_duration or 0),
"avg_success_rate": float(wf_metrics.avg_success_rate or 0),
"execution_count": int(wf_metrics.execution_count or 0)
}
}
# 任务级别指标(如果指定了task_id)
if task_id:
task_metrics = self.db.query(
func.avg(TaskExecutionMetrics.duration_seconds).label("avg_duration"),
func.max(TaskExecutionMetrics.duration_seconds).label("max_duration"),
func.min(TaskExecutionMetrics.duration_seconds).label("min_duration"),
func.stddev(TaskExecutionMetrics.duration_seconds).label("stddev_duration"),
func.avg(TaskExecutionMetrics.cpu_percent).label("avg_cpu"),
func.avg(TaskExecutionMetrics.memory_mb).label("avg_memory"),
func.sum(TaskExecutionMetrics.retry_count).label("total_retries"),
func.count(TaskExecutionMetrics.id).label("execution_count")
).filter(
TaskExecutionMetrics.task_id == task_id,
TaskExecutionMetrics.start_time >= cutoff_time
).first()
result["task"] = {
"avg_duration": float(task_metrics.avg_duration or 0),
"max_duration": float(task_metrics.max_duration or 0),
"min_duration": float(task_metrics.min_duration or 0),
"stddev_duration": float(task_metrics.stddev_duration or 0),
"avg_cpu": float(task_metrics.avg_cpu or 0),
"avg_memory": float(task_metrics.avg_memory or 0),
"total_retries": int(task_metrics.total_retries or 0),
"execution_count": int(task_metrics.execution_count or 0)
}
return result
# 使用示例
if __name__ == "__main__":
from database import SessionLocal
db = SessionLocal()
collector = MetricsCollector(db)
# 开始工作流测量
measurement_id = collector.start_workflow_measurement(
execution_id=1,
workflow_id=1
)
# 模拟工作流执行
time.sleep(2)
# 结束测量
metrics = collector.end_workflow_measurement(
measurement_id,
additional_metrics={"custom_field": "value"}
)
print(f"Collected metrics:")
print(f" Duration: {metrics.duration_seconds}s")
print(f" CPU: {metrics.avg_cpu_percent}%")
print(f" Memory: {metrics.avg_memory_mb}MB")
# 获取历史指标
historical = collector.get_historical_metrics(workflow_id=1, days=7)
print(f"\nHistorical metrics:")
print(f" Avg duration: {historical['workflow']['avg_duration']}s")
print(f" Success rate: {historical['workflow']['avg_success_rate'] * 100}%")
db.close()
3.2 实时数据流处理
# services/streaming_metrics.py
from typing import Dict, Any, Callable
import json
import redis
from datetime import datetime
import logging
logger = logging.getLogger(__name__)
class StreamingMetricsProcessor:
"""
实时指标流处理器
使用Redis Streams处理实时指标数据
"""
def __init__(self, redis_client: redis.Redis):
self.redis = redis_client
self.stream_name = "workflow_metrics_stream"
self.consumer_group = "metrics_processors"
# 创建消费者组(如果不存在)
try:
self.redis.xgroup_create(
self.stream_name,
self.consumer_group,
id='0',
mkstream=True
)
except redis.exceptions.ResponseError as e:
if "BUSYGROUP" not in str(e):
raise
def publish_metric(
self,
metric_type: str,
entity_id: int,
data: Dict[str, Any]
):
"""
发布指标到流
"""
message = {
"metric_type": metric_type,
"entity_id": str(entity_id),
"timestamp": datetime.utcnow().isoformat(),
"data": json.dumps(data)
}
message_id = self.redis.xadd(self.stream_name, message)
logger.debug(f"Published metric {message_id}: {metric_type}")
return message_id
def consume_metrics(
self,
consumer_name: str,
handler: Callable[[Dict[str, Any]], None],
block_ms: int = 5000,
count: int = 10
):
"""
消费指标流
Args:
consumer_name: 消费者名称
handler: 处理函数
block_ms: 阻塞等待时间(毫秒)
count: 每次读取的消息数
"""
while True:
try:
# 读取消息
messages = self.redis.xreadgroup(
self.consumer_group,
consumer_name,
{self.stream_name: '>'},
count=count,
block=block_ms
)
if not messages:
continue
for stream_name, stream_messages in messages:
for message_id, message_data in stream_messages:
try:
# 解析消息
metric = {
"id": message_id.decode(),
"metric_type": message_data[b"metric_type"].decode(),
"entity_id": int(message_data[b"entity_id"].decode()),
"timestamp": message_data[b"timestamp"].decode(),
"data": json.loads(message_data[b"data"].decode())
}
# 处理消息
handler(metric)
# 确认消息
self.redis.xack(
self.stream_name,
self.consumer_group,
message_id
)
except Exception as e:
logger.error(f"Error processing message {message_id}: {e}")
# 可以实现重试逻辑或死信队列
except KeyboardInterrupt:
logger.info("Stopping consumer...")
break
except Exception as e:
logger.error(f"Consumer error: {e}")
import time
time.sleep(1)
def get_stream_info(self) -> Dict[str, Any]:
"""获取流信息"""
info = self.redis.xinfo_stream(self.stream_name)
return {
"length": info[b"length"],
"groups": info[b"groups"],
"first_entry": info.get(b"first-entry"),
"last_entry": info.get(b"last-entry")
}
# 使用示例
if __name__ == "__main__":
import redis
import threading
# 连接Redis
r = redis.Redis(host='localhost', port=6379, db=0)
processor = StreamingMetricsProcessor(r)
# 发布指标
def publish_test_metrics():
for i in range(10):
processor.publish_metric(
metric_type="task_duration",
entity_id=i,
data={
"duration_seconds": 5.5 + i,
"cpu_percent": 50 + i,
"memory_mb": 100 + i * 10
}
)
import time
time.sleep(0.5)
# 消费指标
def consume_test_metrics():
def handler(metric):
print(f"Received metric: {metric['metric_type']} "
f"for entity {metric['entity_id']}")
print(f" Data: {metric['data']}")
processor.consume_metrics(
consumer_name="test_consumer",
handler=handler
)
# 启动生产者和消费者
producer_thread = threading.Thread(target=publish_test_metrics)
consumer_thread = threading.Thread(target=consume_test_metrics)
consumer_thread.start()
producer_thread.start()
producer_thread.join()
# consumer_thread会持续运行,需要手动停止
4. 性能瓶颈识别算法
4.1 统计分析检测器
# services/bottleneck_detectors/statistical_detector.py
from typing import Dict, Any, List, Optional
import numpy as np
from scipy import stats
from sqlalchemy.orm import Session
from models.ai_optimizer_models import (
WorkflowExecutionMetrics,
TaskExecutionMetrics,
PerformanceBottleneck
)
from datetime import datetime, timedelta
import logging
logger = logging.getLogger(__name__)
class StatisticalBottleneckDetector:
"""
统计分析瓶颈检测器
使用统计方法检测性能异常
"""
def __init__(self, db: Session):
self.db = db
self.confidence_level = 0.95 # 置信水平
self.z_score_threshold = 3.0 # Z分数阈值
def detect_workflow_bottlenecks(
self,
workflow_id: int,
lookback_days: int = 30
) -> List[PerformanceBottleneck]:
"""
检测工作流级别的瓶颈
"""
bottlenecks = []
# 获取历史数据
cutoff_time = datetime.utcnow() - timedelta(days=lookback_days)
metrics = self.db.query(WorkflowExecutionMetrics).filter(
WorkflowExecutionMetrics.workflow_id == workflow_id,
WorkflowExecutionMetrics.start_time >= cutoff_time
).all()
if len(metrics) < 10:
logger.warning(f"Not enough data for workflow {workflow_id}")
return bottlenecks
# 提取指标数据
durations = [m.duration_seconds for m in metrics if m.duration_seconds]
cpu_usages = [m.avg_cpu_percent for m in metrics if m.avg_cpu_percent]
memory_usages = [m.avg_memory_mb for m in metrics if m.avg_memory_mb]
success_rates = [m.success_rate for m in metrics if m.success_rate is not None]
# 检测持续时间异常
duration_bottlenecks = self._detect_outliers(
data=durations,
metric_name="duration",
workflow_id=workflow_id,
baseline_metrics={"mean": np.mean(durations), "std": np.std(durations)}
)
bottlenecks.extend(duration_bottlenecks)
# 检测CPU使用异常
cpu_bottlenecks = self._detect_outliers(
data=cpu_usages,
metric_name="cpu",
workflow_id=workflow_id,
baseline_metrics={"mean": np.mean(cpu_usages), "std": np.std(cpu_usages)}
)
bottlenecks.extend(cpu_bottlenecks)
# 检测成功率下降
if success_rates:
success_rate_bottlenecks = self._detect_degradation(
data=success_rates,
metric_name="success_rate",
workflow_id=workflow_id,
threshold=0.9,
direction="lower"
)
bottlenecks.extend(success_rate_bottlenecks)
return bottlenecks
def _detect_outliers(
self,
data: List[float],
metric_name: str,
workflow_id: int,
task_id: Optional[int] = None,
baseline_metrics: Optional[Dict[str, float]] = None
) -> List[PerformanceBottleneck]:
"""
使用Z分数检测异常值
"""
if len(data) < 3:
return []
bottlenecks = []
# 计算统计量
mean = np.mean(data)
std = np.std(data)
if std == 0:
return []
# 计算Z分数
z_scores = [(x - mean) / std for x in data]
# 检测异常
recent_values = data[-5:] # 最近5个值
recent_z_scores = z_scores[-5:]
# 如果最近的值持续异常
outlier_count = sum(1 for z in recent_z_scores if abs(z) > self.z_score_threshold)
if outlier_count >= 3: # 至少3个异常值
# 确定瓶颈类型
if metric_name == "duration":
bottleneck_type = "performance"
elif metric_name == "cpu":
bottleneck_type = "cpu"
elif metric_name == "memory":
bottleneck_type = "memory"
else:
bottleneck_type = "unknown"
# 计算偏差
latest_value = recent_values[-1]
deviation = ((latest_value - mean) / mean) * 100 if mean != 0 else 0
# 确定严重程度
if abs(deviation) > 100:
severity = "critical"
elif abs(deviation) > 50:
severity = "high"
elif abs(deviation) > 20:
severity = "medium"
else:
severity = "low"
bottleneck = PerformanceBottleneck(
workflow_id=workflow_id,
task_id=task_id,
bottleneck_type=bottleneck_type,
severity=severity,
detection_method="statistical",
confidence_score=min(outlier_count / 5.0, 1.0),
description=f"{metric_name} shows statistical outliers",
baseline_metrics=baseline_metrics or {"mean": mean, "std": std},
current_metrics={"latest": latest_value, "recent_avg": np.mean(recent_values)},
deviation_percent=abs(deviation),
impact_analysis={
"affected_metric": metric_name,
"z_score": max(abs(z) for z in recent_z_scores),
"outlier_count": outlier_count
}
)
bottlenecks.append(bottleneck)
return bottlenecks
def _detect_degradation(
self,
data: List[float],
metric_name: str,
workflow_id: int,
threshold: float,
direction: str = "lower" # "lower" or "upper"
) -> List[PerformanceBottleneck]:
"""
检测性能退化
"""
if len(data) < 10:
return []
bottlenecks = []
# 分割为两段:历史基线和最近数据
split_point = len(data) * 2 // 3
baseline_data = data[:split_point]
recent_data = data[split_point:]
baseline_mean = np.mean(baseline_data)
recent_mean = np.mean(recent_data)
# T检验
t_stat, p_value = stats.ttest_ind(baseline_data, recent_data)
# 判断是否显著退化
is_significant = p_value < (1 - self.confidence_level)
if direction == "lower":
is_degraded = recent_mean < threshold and is_significant
else:
is_degraded = recent_mean > threshold and is_significant
if is_degraded:
degradation_percent = abs((recent_mean - baseline_mean) / baseline_mean) * 100
if degradation_percent > 30:
severity = "critical"
elif degradation_percent > 15:
severity = "high"
else:
severity = "medium"
bottleneck = PerformanceBottleneck(
workflow_id=workflow_id,
bottleneck_type="degradation",
severity=severity,
detection_method="statistical",
confidence_score=1 - p_value,
description=f"{metric_name} has degraded significantly",
baseline_metrics={"mean": baseline_mean, "threshold": threshold},
current_metrics={"mean": recent_mean},
deviation_percent=degradation_percent,
impact_analysis={
"t_statistic": t_stat,
"p_value": p_value,
"degradation_percent": degradation_percent
}
)
bottlenecks.append(bottleneck)
return bottlenecks
def detect_task_bottlenecks(
self,
workflow_id: int,
lookback_days: int = 30
) -> List[PerformanceBottleneck]:
"""
检测任务级别的瓶颈
识别工作流中最慢的任务
"""
cutoff_time = datetime.utcnow() - timedelta(days=lookback_days)
# 按任务聚合指标
from sqlalchemy import func
task_stats = self.db.query(
TaskExecutionMetrics.task_id,
func.avg(TaskExecutionMetrics.duration_seconds).label("avg_duration"),
func.max(TaskExecutionMetrics.duration_seconds).label("max_duration"),
func.stddev(TaskExecutionMetrics.duration_seconds).label("std_duration"),
func.count(TaskExecutionMetrics.id).label("execution_count")
).join(
WorkflowExecutionMetrics,
TaskExecutionMetrics.workflow_execution_id == WorkflowExecutionMetrics.id
).filter(
WorkflowExecutionMetrics.workflow_id == workflow_id,
TaskExecutionMetrics.start_time >= cutoff_time
).group_by(
TaskExecutionMetrics.task_id
).all()
if not task_stats:
return []
# 找出相对最慢的任务
durations = [s.avg_duration for s in task_stats if s.avg_duration]
if len(durations) < 2:
return []
bottlenecks = []
# 使用相对阈值(如超过中位数的2倍)
median_duration = np.median(durations)
threshold = median_duration * 2
for stat in task_stats:
if stat.avg_duration and stat.avg_duration > threshold:
# 计算相对慢的程度
relative_slowness = (stat.avg_duration / median_duration - 1) * 100
if relative_slowness > 200:
severity = "critical"
elif relative_slowness > 100:
severity = "high"
else:
severity = "medium"
bottleneck = PerformanceBottleneck(
workflow_id=workflow_id,
task_id=stat.task_id,
bottleneck_type="slow_task",
severity=severity,
detection_method="statistical",
confidence_score=0.9,
description=f"Task is significantly slower than others",
baseline_metrics={
"median_duration": median_duration,
"threshold": threshold
},
current_metrics={
"avg_duration": stat.avg_duration,
"max_duration": stat.max_duration,
"std_duration": float(stat.std_duration or 0)
},
deviation_percent=relative_slowness,
impact_analysis={
"execution_count": stat.execution_count,
"relative_slowness_percent": relative_slowness
}
)
bottlenecks.append(bottleneck)
return bottlenecks
# 使用示例
if __name__ == "__main__":
from database import SessionLocal
db = SessionLocal()
detector = StatisticalBottleneckDetector(db)
# 检测工作流瓶颈
bottlenecks = detector.detect_workflow_bottlenecks(workflow_id=1, lookback_days=7)
print(f"Found {len(bottlenecks)} workflow-level bottlenecks:")
for b in bottlenecks:
print(f" - {b.bottleneck_type} ({b.severity}): {b.description}")
print(f" Deviation: {b.deviation_percent:.1f}%")
print(f" Confidence: {b.confidence_score:.2f}")
# 检测任务瓶颈
task_bottlenecks = detector.detect_task_bottlenecks(workflow_id=1, lookback_days=7)
print(f"\nFound {len(task_bottlenecks)} task-level bottlenecks:")
for b in task_bottlenecks:
print(f" - Task {b.task_id}: {b.description}")
print(f" Severity: {b.severity}")
print(f" Relative slowness: {b.deviation_percent:.1f}%")
db.close()
4.2 异常检测器
# services/bottleneck_detectors/anomaly_detector.py
from typing import List, Dict, Any
import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler
from sqlalchemy.orm import Session
from models.ai_optimizer_models import (
WorkflowExecutionMetrics,
TaskExecutionMetrics,
PerformanceBottleneck
)
from datetime import datetime, timedelta
import logging
logger = logging.getLogger(__name__)
class AnomalyBottleneckDetector:
"""
异常检测瓶颈识别器
使用机器学习方法检测异常模式
"""
def __init__(self, db: Session):
self.db = db
self.contamination = 0.1 # 预期异常比例
self.scaler = StandardScaler()
def detect_anomalies(
self,
workflow_id: int,
lookback_days: int = 30
) -> List[PerformanceBottleneck]:
"""
使用Isolation Forest检测异常
"""
# 获取历史数据
cutoff_time = datetime.utcnow() - timedelta(days=lookback_days)
metrics = self.db.query(WorkflowExecutionMetrics).filter(
WorkflowExecutionMetrics.workflow_id == workflow_id,
WorkflowExecutionMetrics.start_time >= cutoff_time
).all()
if len(metrics) < 20:
logger.warning(f"Not enough data for anomaly detection: {len(metrics)}")
return []
# 准备特征矩阵
features = []
metric_ids = []
for m in metrics:
if all([
m.duration_seconds is not None,
m.avg_cpu_percent is not None,
m.avg_memory_mb is not None
]):
features.append([
m.duration_seconds,
m.avg_cpu_percent,
m.avg_memory_mb,
m.total_io_read_mb or 0,
m.total_io_write_mb or 0,
m.success_rate or 0,
m.error_rate or 0
])
metric_ids.append((m.id, m.execution_id))
if len(features) < 20:
return []
X = np.array(features)
# 标准化
X_scaled = self.scaler.fit_transform(X)
# 训练Isolation Forest
clf = IsolationForest(
contamination=self.contamination,
random_state=42,
n_estimators=100
)
predictions = clf.fit_predict(X_scaled)
scores = clf.score_samples(X_scaled)
# 识别异常
bottlenecks = []
for idx, (pred, score) in enumerate(zip(predictions, scores)):
if pred == -1: # 异常
metric_id, execution_id = metric_ids[idx]
metric = metrics[idx]
# 计算异常程度
anomaly_score = abs(score)
confidence = min(anomaly_score / 2.0, 1.0) # 转换为0-1范围
# 确定严重程度
if anomaly_score > 0.5:
severity = "critical"
elif anomaly_score > 0.3:
severity = "high"
else:
severity = "medium"
# 分析哪些指标异常
feature_names = [
"duration", "cpu", "memory",
"io_read", "io_write", "success_rate", "error_rate"
]
# 计算每个特征的Z分数
feature_scores = []
for i, (value, name) in enumerate(zip(features[idx], feature_names)):
col_mean = np.mean(X[:, i])
col_std = np.std(X[:, i])
if col_std > 0:
z_score = abs((value - col_mean) / col_std)
if z_score > 2: # 显著偏离
feature_scores.append({
"feature": name,
"value": value,
"z_score": z_score
})
# 排序找出最异常的特征
feature_scores.sort(key=lambda x: x["z_score"], reverse=True)
primary_feature = feature_scores[0] if feature_scores else None
# 确定瓶颈类型
if primary_feature:
if primary_feature["feature"] in ["cpu", "memory"]:
bottleneck_type = primary_feature["feature"]
elif primary_feature["feature"] in ["io_read", "io_write"]:
bottleneck_type = "io"
elif primary_feature["feature"] == "duration":
bottleneck_type = "performance"
else:
bottleneck_type = "quality"
else:
bottleneck_type = "unknown"
bottleneck = PerformanceBottleneck(
workflow_id=workflow_id,
execution_id=execution_id,
bottleneck_type=bottleneck_type,
severity=severity,
detection_method="anomaly",
confidence_score=confidence,
description=f"Anomalous execution detected",
impact_analysis={
"anomaly_score": anomaly_score,
"abnormal_features": feature_scores[:3], # Top 3
"primary_feature": primary_feature["feature"] if primary_feature else None
},
current_metrics={
"duration": metric.duration_seconds,
"cpu": metric.avg_cpu_percent,
"memory": metric.avg_memory_mb
}
)
bottlenecks.append(bottleneck)
# 保存到数据库
for b in bottlenecks:
self.db.add(b)
self.db.commit()
logger.info(f"Detected {len(bottlenecks)} anomalies for workflow {workflow_id}")
return bottlenecks
# 使用示例
if __name__ == "__main__":
from database import SessionLocal
db = SessionLocal()
detector = AnomalyBottleneckDetector(db)
bottlenecks = detector.detect_anomalies(workflow_id=1, lookback_days=7)
print(f"Found {len(bottlenecks)} anomalies:")
for b in bottlenecks:
print(f"\n{b.bottleneck_type} ({b.severity}):")
print(f" Confidence: {b.confidence_score:.2f}")
print(f" Description: {b.description}")
if b.impact_analysis and "abnormal_features" in b.impact_analysis:
print(" Abnormal features:")
for feature in b.impact_analysis["abnormal_features"]:
print(f" - {feature['feature']}: {feature['value']:.2f} "
f"(Z-score: {feature['z_score']:.2f})")
db.close()
5. 优化建议生成引擎
5.1 建议生成器
# services/optimization/recommendation_engine.py
from typing import List, Dict, Any, Optional
from sqlalchemy.orm import Session
from models.ai_optimizer_models import (
PerformanceBottleneck,
OptimizationRecommendation,
WorkflowTask
)
from datetime import datetime
import logging
logger = logging.getLogger(__name__)
class RecommendationEngine:
"""
优化建议生成引擎
根据检测到的瓶颈生成具体的优化建议
"""
def __init__(self, db: Session):
self.db = db
self.recommendation_rules = self._load_recommendation_rules()
def generate_recommendations(
self,
bottleneck: PerformanceBottleneck
) -> List[OptimizationRecommendation]:
"""
为瓶颈生成优化建议
"""
recommendations = []
# 根据瓶颈类型选择生成策略
if bottleneck.bottleneck_type == "cpu":
recommendations.extend(self._generate_cpu_recommendations(bottleneck))
elif bottleneck.bottleneck_type == "memory":
recommendations.extend(self._generate_memory_recommendations(bottleneck))
elif bottleneck.bottleneck_type == "io":
recommendations.extend(self._generate_io_recommendations(bottleneck))
elif bottleneck.bottleneck_type == "slow_task":
recommendations.extend(self._generate_task_optimization_recommendations(bottleneck))
elif bottleneck.bottleneck_type == "performance":
recommendations.extend(self._generate_performance_recommendations(bottleneck))
elif bottleneck.bottleneck_type == "degradation":
recommendations.extend(self._generate_degradation_recommendations(bottleneck))
# 保存建议到数据库
for rec in recommendations:
self.db.add(rec)
self.db.commit()
logger.info(f"Generated {len(recommendations)} recommendations for bottleneck {bottleneck.id}")
return recommendations
def _generate_cpu_recommendations(
self,
bottleneck: PerformanceBottleneck
) -> List[OptimizationRecommendation]:
"""生成CPU优化建议"""
recommendations = []
current_cpu = bottleneck.current_metrics.get("cpu", 0)
# 建议1: 增加并行度
if bottleneck.task_id:
task = self.db.query(WorkflowTask).get(bottleneck.task_id)
if task and task.config:
current_workers = task.config.get("max_workers", 1)
if current_workers < 4:
recommendations.append(OptimizationRecommendation(
bottleneck_id=bottleneck.id,
workflow_id=bottleneck.workflow_id,
task_id=bottleneck.task_id,
recommendation_type="parameter",
priority="high",
title="增加并行Worker数量",
description=f"当前CPU使用率{current_cpu:.1f}%,增加并行度可以更好地利用CPU资源",
rationale="通过增加Worker数量,可以并行处理更多任务,提高CPU利用率",
current_config={"max_workers": current_workers},
recommended_config={"max_workers": min(current_workers * 2, 8)},
expected_improvement={
"cpu_utilization": "+20-40%",
"throughput": "+30-50%",
"duration_reduction": "20-30%"
},
implementation_difficulty="easy",
estimated_effort_hours=0.5,
implementation_steps=[
{"step": 1, "action": "修改任务配置max_workers参数"},
{"step": 2, "action": "运行A/B测试验证效果"},
{"step": 3, "action": "监控CPU和内存使用情况"}
],
risk_level="low",
potential_issues=[
"可能增加内存消耗",
"需要确保足够的CPU核心数"
],
rollback_plan={
"action": "恢复原始max_workers配置",
"estimated_time": "1分钟"
}
))
# 建议2: 优化算法复杂度
if current_cpu > 80:
recommendations.append(OptimizationRecommendation(
bottleneck_id=bottleneck.id,
workflow_id=bottleneck.workflow_id,
task_id=bottleneck.task_id,
recommendation_type="logic",
priority="high",
title="优化计算密集型代码",
description=f"CPU使用率达到{current_cpu:.1f}%,建议优化算法或使用更高效的实现",
rationale="高CPU使用率表明存在计算密集型操作,优化算法可以显著减少CPU时间",
current_config={},
recommended_config={},
expected_improvement={
"cpu_time_reduction": "30-50%",
"duration_reduction": "30-50%"
},
implementation_difficulty="hard",
estimated_effort_hours=8.0,
implementation_steps=[
{"step": 1, "action": "使用profiler分析CPU热点"},
{"step": 2, "action": "识别可优化的算法和数据结构"},
{"step": 3, "action": "实现优化版本"},
{"step": 4, "action": "编写单元测试确保正确性"},
{"step": 5, "action": "进行性能测试对比"}
],
risk_level="medium",
potential_issues=[
"可能引入新的bug",
"需要充分测试"
]
))
return recommendations
def _generate_memory_recommendations(
self,
bottleneck: PerformanceBottleneck
) -> List[OptimizationRecommendation]:
"""生成内存优化建议"""
recommendations = []
current_memory = bottleneck.current_metrics.get("memory", 0)
# 建议1: 批处理优化
recommendations.append(OptimizationRecommendation(
bottleneck_id=bottleneck.id,
workflow_id=bottleneck.workflow_id,
task_id=bottleneck.task_id,
recommendation_type="parameter",
priority="high",
title="减小批处理大小",
description=f"当前内存使用{current_memory:.1f}MB,建议减小批处理大小以降低内存峰值",
rationale="较小的批处理大小可以减少内存占用,避免内存溢出",
current_config={"batch_size": "未知"},
recommended_config={"batch_size": 1000},
expected_improvement={
"memory_reduction": "30-50%",
"stability": "improved"
},
implementation_difficulty="easy",
estimated_effort_hours=1.0,
implementation_steps=[
{"step": 1, "action": "调整batch_size配置参数"},
{"step": 2, "action": "监控内存使用情况"},
{"step": 3, "action": "评估对性能的影响"}
],
risk_level="low",
potential_issues=[
"可能略微增加总执行时间",
"需要平衡内存和性能"
]
))
# 建议2: 启用流式处理
if current_memory > 1000: # 超过1GB
recommendations.append(OptimizationRecommendation(
bottleneck_id=bottleneck.id,
workflow_id=bottleneck.workflow_id,
task_id=bottleneck.task_id,
recommendation_type="architecture",
priority="high",
title="启用流式处理",
description="内存占用较高,建议改用流式处理避免一次性加载所有数据",
rationale="流式处理可以逐块处理数据,显著降低内存占用",
current_config={"processing_mode": "batch"},
recommended_config={"processing_mode": "streaming"},
expected_improvement={
"memory_reduction": "60-80%",
"scalability": "improved"
},
implementation_difficulty="medium",
estimated_effort_hours=4.0,
implementation_steps=[
{"step": 1, "action": "重构代码支持流式处理"},
{"step": 2, "action": "使用生成器或迭代器"},
{"step": 3, "action": "测试不同数据量下的表现"}
],
risk_level="medium",
potential_issues=[
"需要重构现有代码",
"可能影响某些需要全局视图的操作"
]
))
return recommendations
def _generate_io_recommendations(
self,
bottleneck: PerformanceBottleneck
) -> List[OptimizationRecommendation]:
"""生成IO优化建议"""
recommendations = []
# 建议1: 启用缓存
recommendations.append(OptimizationRecommendation(
bottleneck_id=bottleneck.id,
workflow_id=bottleneck.workflow_id,
task_id=bottleneck.task_id,
recommendation_type="architecture",
priority="high",
title="启用数据缓存",
description="IO操作频繁,建议启用缓存减少磁盘/网络访问",
rationale="缓存可以避免重复的IO操作,显著提高性能",
current_config={"cache_enabled": False},
recommended_config={
"cache_enabled": True,
"cache_ttl": 3600,
"cache_size_mb": 100
},
expected_improvement={
"io_reduction": "50-70%",
"duration_reduction": "30-50%"
},
implementation_difficulty="medium",
estimated_effort_hours=2.0,
implementation_steps=[
{"step": 1, "action": "选择合适的缓存策略(LRU/LFU)"},
{"step": 2, "action": "集成缓存中间件(Redis/Memcached)"},
{"step": 3, "action": "设置合理的过期时间"},
{"step": 4, "action": "监控缓存命中率"}
],
risk_level="low",
potential_issues=[
"需要额外的缓存服务器资源",
"可能出现缓存一致性问题"
],
rollback_plan={
"action": "禁用缓存配置",
"estimated_time": "2分钟"
}
))
# 建议2: 批量IO操作
recommendations.append(OptimizationRecommendation(
bottleneck_id=bottleneck.id,
workflow_id=bottleneck.workflow_id,
task_id=bottleneck.task_id,
recommendation_type="logic",
priority="medium",
title="批量化IO操作",
description="将多个小的IO操作合并为少量大的批量操作",
rationale="批量操作可以减少IO次数,提高吞吐量",
expected_improvement={
"io_operations_reduction": "70-90%",
"duration_reduction": "20-40%"
},
implementation_difficulty="medium",
estimated_effort_hours=3.0,
risk_level="low"
))
return recommendations
def _generate_task_optimization_recommendations(
self,
bottleneck: PerformanceBottleneck
) -> List[OptimizationRecommendation]:
"""生成慢任务优化建议"""
recommendations = []
impact = bottleneck.impact_analysis or {}
relative_slowness = impact.get("relative_slowness_percent", 0)
# 建议1: 拆分大任务
if relative_slowness > 200: # 比其他任务慢3倍以上
recommendations.append(OptimizationRecommendation(
bottleneck_id=bottleneck.id,
workflow_id=bottleneck.workflow_id,
task_id=bottleneck.task_id,
recommendation_type="architecture",
priority="high",
title="拆分大任务为多个小任务",
description=f"该任务比其他任务慢{relative_slowness:.0f}%,建议拆分为多个并行任务",
rationale="大任务拆分后可以并行执行,提高整体效率",
expected_improvement={
"duration_reduction": "40-60%",
"parallelism": "improved"
},
implementation_difficulty="hard",
estimated_effort_hours=8.0,
implementation_steps=[
{"step": 1, "action": "分析任务逻辑,识别可拆分点"},
{"step": 2, "action": "设计拆分方案"},
{"step": 3, "action": "实现子任务"},
{"step": 4, "action": "配置任务依赖关系"},
{"step": 5, "action": "测试并验证结果一致性"}
],
risk_level="medium",
potential_issues=[
"需要重新设计工作流DAG",
"可能增加协调开销"
]
))
# 建议2: 增加任务超时和重试
recommendations.append(OptimizationRecommendation(
bottleneck_id=bottleneck.id,
workflow_id=bottleneck.workflow_id,
task_id=bottleneck.task_id,
recommendation_type="parameter",
priority="medium",
title="优化超时和重试策略",
description="设置合理的超时时间,避免长时间等待失败任务",
rationale="合理的超时可以快速失败并重试,避免资源浪费",
current_config={"timeout": "未设置"},
recommended_config={
"timeout": 300, # 5分钟
"retry_count": 3,
"retry_delay": 60
},
expected_improvement={
"failure_handling": "improved",
"reliability": "improved"
},
implementation_difficulty="easy",
estimated_effort_hours=0.5,
risk_level="low"
))
return recommendations
def _generate_performance_recommendations(
self,
bottleneck: PerformanceBottleneck
) -> List[OptimizationRecommendation]:
"""生成通用性能优化建议"""
recommendations = []
# 分析瓶颈的影响分析
impact = bottleneck.impact_analysis or {}
abnormal_features = impact.get("abnormal_features", [])
# 根据异常特征生成建议
for feature in abnormal_features[:2]: # 处理前2个最异常的特征
feature_name = feature.get("feature")
if feature_name == "duration":
recommendations.append(OptimizationRecommendation(
bottleneck_id=bottleneck.id,
workflow_id=bottleneck.workflow_id,
recommendation_type="performance",
priority="high",
title="优化执行时间",
description="执行时间异常,建议进行性能分析和优化",
implementation_steps=[
{"step": 1, "action": "使用性能分析工具定位热点"},
{"step": 2, "action": "优化关键路径"},
{"step": 3, "action": "考虑算法优化"}
],
implementation_difficulty="medium",
estimated_effort_hours=4.0
))
return recommendations
def _generate_degradation_recommendations(
self,
bottleneck: PerformanceBottleneck
) -> List[OptimizationRecommendation]:
"""生成性能退化建议"""
recommendations = []
recommendations.append(OptimizationRecommendation(
bottleneck_id=bottleneck.id,
workflow_id=bottleneck.workflow_id,
recommendation_type="investigation",
priority="urgent",
title="调查性能退化原因",
description="性能出现显著退化,需要紧急调查",
rationale="性能退化可能导致严重的业务影响,需要尽快定位原因",
implementation_steps=[
{"step": 1, "action": "对比近期代码变更"},
{"step": 2, "action": "检查数据量增长情况"},
{"step": 3, "action": "分析系统资源变化"},
{"step": 4, "action": "查看依赖服务状态"},
{"step": 5, "action": "制定恢复方案"}
],
implementation_difficulty="medium",
estimated_effort_hours=4.0,
risk_level="high"
))
return recommendations
def _load_recommendation_rules(self) -> Dict[str, Any]:
"""
加载建议规则库
可以从配置文件或数据库加载
"""
return {
"cpu_threshold": 80,
"memory_threshold_mb": 1000,
"io_threshold_mb": 500,
"duration_multiplier": 2.0
}
# 使用示例
if __name__ == "__main__":
from database import SessionLocal
from services.bottleneck_detectors.statistical_detector import StatisticalBottleneckDetector
db = SessionLocal()
# 检测瓶颈
detector = StatisticalBottleneckDetector(db)
bottlenecks = detector.detect_workflow_bottlenecks(workflow_id=1, lookback_days=7)
# 生成建议
engine = RecommendationEngine(db)
for bottleneck in bottlenecks:
recommendations = engine.generate_recommendations(bottleneck)
print(f"\n瓶颈: {bottleneck.description}")
print(f"建议数: {len(recommendations)}")
for rec in recommendations:
print(f"\n 建议: {rec.title}")
print(f" 优先级: {rec.priority}")
print(f" 难度: {rec.implementation_difficulty}")
print(f" 预期改进: {rec.expected_improvement}")
db.close()
5.2 建议优先级排序
# services/optimization/recommendation_prioritizer.py
from typing import List, Dict, Any
from sqlalchemy.orm import Session
from models.ai_optimizer_models import OptimizationRecommendation
import logging
logger = logging.getLogger(__name__)
class RecommendationPrioritizer:
"""
建议优先级排序器
基于多个因素对优化建议进行排序
"""
def __init__(self, db: Session):
self.db = db
# 优先级权重
self.priority_weights = {
"urgent": 10,
"high": 7,
"medium": 4,
"low": 1
}
# 难度权重(负权重,越难优先级越低)
self.difficulty_weights = {
"easy": 0,
"medium": -2,
"hard": -5
}
# 风险权重(负权重)
self.risk_weights = {
"low": 0,
"medium": -1,
"high": -3
}
def prioritize(
self,
recommendations: List[OptimizationRecommendation]
) -> List[OptimizationRecommendation]:
"""
对建议进行优先级排序
"""
# 计算每个建议的综合分数
scored_recommendations = []
for rec in recommendations:
score = self._calculate_score(rec)
scored_recommendations.append((rec, score))
# 按分数排序
scored_recommendations.sort(key=lambda x: x[1], reverse=True)
# 返回排序后的建议
return [rec for rec, score in scored_recommendations]
def _calculate_score(self, rec: OptimizationRecommendation) -> float:
"""
计算建议的综合分数
"""
score = 0.0
# 1. 优先级分数
priority_score = self.priority_weights.get(rec.priority, 0)
score += priority_score
# 2. 难度分数
difficulty_score = self.difficulty_weights.get(rec.implementation_difficulty, 0)
score += difficulty_score
# 3. 风险分数
risk_score = self.risk_weights.get(rec.risk_level, 0)
score += risk_score
# 4. 预期改进分数
expected_improvement = rec.expected_improvement or {}
improvement_score = self._parse_improvement_score(expected_improvement)
score += improvement_score
# 5. 实施时间分数(越快越好)
effort_score = max(0, 10 - (rec.estimated_effort_hours or 0))
score += effort_score * 0.5
return score
def _parse_improvement_score(self, improvement: Dict[str, Any]) -> float:
"""
解析预期改进,计算改进分数
"""
score = 0.0
for key, value in improvement.items():
if isinstance(value, str):
# 提取百分比
if "%" in value:
try:
# 提取数字部分
percent_str = value.split("%")[0].split("-")[-1]
percent = float(percent_str.replace("+", ""))
score += percent / 10 # 归一化
except:
pass
return min(score, 10) # 最多10分
def group_by_category(
self,
recommendations: List[OptimizationRecommendation]
) -> Dict[str, List[OptimizationRecommendation]]:
"""
按类别分组建议
"""
groups = {
"quick_wins": [], # 快速见效
"high_impact": [], # 高影响
"long_term": [], # 长期优化
"low_priority": [] # 低优先级
}
for rec in recommendations:
# 快速见效:容易实现且高优先级
if rec.implementation_difficulty == "easy" and rec.priority in ["high", "urgent"]:
groups["quick_wins"].append(rec)
# 高影响:高优先级且预期改进大
elif rec.priority in ["high", "urgent"]:
groups["high_impact"].append(rec)
# 长期优化:实施难度大但价值高
elif rec.implementation_difficulty == "hard" and rec.priority != "low":
groups["long_term"].append(rec)
# 低优先级
else:
groups["low_priority"].append(rec)
return groups
# 使用示例
if __name__ == "__main__":
from database import SessionLocal
db = SessionLocal()
# 获取所有待处理的建议
recommendations = db.query(OptimizationRecommendation).filter(
OptimizationRecommendation.status == "pending"
).all()
# 排序
prioritizer = RecommendationPrioritizer(db)
sorted_recs = prioritizer.prioritize(recommendations)
print("排序后的建议:")
for i, rec in enumerate(sorted_recs[:10], 1):
print(f"{i}. {rec.title}")
print(f" 优先级: {rec.priority}, 难度: {rec.implementation_difficulty}")
print(f" 风险: {rec.risk_level}, 工时: {rec.estimated_effort_hours}h")
print()
# 分组
groups = prioritizer.group_by_category(recommendations)
print("\n建议分组:")
for category, recs in groups.items():
print(f"{category}: {len(recs)}个建议")
db.close()
6. 自动优化实验(A/B测试)
6.1 实验框架
# services/ab_testing/experiment_framework.py
from typing import Dict, Any, List, Optional, Callable
from sqlalchemy.orm import Session
from models.ai_optimizer_models import ABExperiment, OptimizationRecommendation, WorkflowExecution
from datetime import datetime, timedelta
import random
import numpy as np
from scipy import stats
import logging
logger = logging.getLogger(__name__)
class ExperimentFramework:
"""
A/B测试实验框架
用于验证优化建议的效果
"""
def __init__(self, db: Session):
self.db = db
self.alpha = 0.05 # 显著性水平
self.min_detectable_effect = 0.1 # 最小可检测效应(10%)
def create_experiment(
self,
workflow_id: int,
recommendation_id: int,
name: str,
hypothesis: str,
control_config: Dict[str, Any],
treatment_config: Dict[str, Any],
primary_metric: str,
traffic_split: float = 0.5,
**kwargs
) -> ABExperiment:
"""
创建A/B测试实验
Args:
workflow_id: 工作流ID
recommendation_id: 建议ID
name: 实验名称
hypothesis: 假设
control_config: 对照组配置
treatment_config: 实验组配置
primary_metric: 主要评估指标
traffic_split: 流量分配比例(实验组占比)
"""
experiment = ABExperiment(
workflow_id=workflow_id,
recommendation_id=recommendation_id,
name=name,
hypothesis=hypothesis,
control_config=control_config,
treatment_config=treatment_config,
primary_metric=primary_metric,
traffic_split=traffic_split,
status="draft",
**kwargs
)
self.db.add(experiment)
self.db.commit()
logger.info(f"Created experiment: {name} (ID: {experiment.id})")
return experiment
def start_experiment(self, experiment_id: int) -> ABExperiment:
"""
启动实验
"""
experiment = self.db.query(ABExperiment).get(experiment_id)
if not experiment:
raise ValueError(f"Experiment {experiment_id} not found")
if experiment.status != "draft":
raise ValueError(f"Experiment must be in draft status to start")
# 验证配置
self._validate_experiment(experiment)
# 启动实验
experiment.status = "running"
experiment.started_at = datetime.utcnow()
self.db.commit()
logger.info(f"Started experiment: {experiment.name}")
return experiment
def assign_variant(
self,
experiment_id: int,
execution_id: int
) -> str:
"""
为执行分配变体(对照组或实验组)
Returns:
"control" 或 "treatment"
"""
experiment = self.db.query(ABExperiment).get(experiment_id)
if not experiment or experiment.status != "running":
return "control" # 默认使用对照组
# 随机分配
if random.random() < experiment.traffic_split:
variant = "treatment"
else:
variant = "control"
# 记录分配(可以存储在Redis或数据库中)
# 这里简化处理,实际应该持久化
logger.debug(f"Assigned variant {variant} for execution {execution_id}")
return variant
def record_result(
self,
experiment_id: int,
variant: str,
metrics: Dict[str, float]
):
"""
记录实验结果
"""
experiment = self.db.query(ABExperiment).get(experiment_id)
if not experiment:
return
# 更新样本量
if variant == "control":
experiment.control_group_size = (experiment.control_group_size or 0) + 1
# 更新对照组指标
if not experiment.control_metrics:
experiment.control_metrics = {}
for metric_name, value in metrics.items():
if metric_name not in experiment.control_metrics:
experiment.control_metrics[metric_name] = []
experiment.control_metrics[metric_name].append(value)
else: # treatment
experiment.treatment_group_size = (experiment.treatment_group_size or 0) + 1
# 更新实验组指标
if not experiment.treatment_metrics:
experiment.treatment_metrics = {}
for metric_name, value in metrics.items():
if metric_name not in experiment.treatment_metrics:
experiment.treatment_metrics[metric_name] = []
experiment.treatment_metrics[metric_name].append(value)
self.db.commit()
# 检查是否可以进行分析
self._check_for_analysis(experiment)
def _check_for_analysis(self, experiment: ABExperiment):
"""
检查是否达到分析条件
"""
control_size = experiment.control_group_size or 0
treatment_size = experiment.treatment_group_size or 0
# 检查最小样本量
if control_size < experiment.min_sample_size or treatment_size < experiment.min_sample_size:
return
# 进行统计分析
result = self.analyze_experiment(experiment.id)
# 如果启用了早停
if experiment.early_stopping_enabled:
# 检查是否可以早停
if self._should_stop_early(result):
self.stop_experiment(experiment.id, "Early stopping triggered")
def analyze_experiment(self, experiment_id: int) -> Dict[str, Any]:
"""
分析实验结果
"""
experiment = self.db.query(ABExperiment).get(experiment_id)
if not experiment:
raise ValueError(f"Experiment {experiment_id} not found")
primary_metric = experiment.primary_metric
# 获取主要指标的数据
control_data = experiment.control_metrics.get(primary_metric, [])
treatment_data = experiment.treatment_metrics.get(primary_metric, [])
if not control_data or not treatment_data:
return {"status": "insufficient_data"}
# 计算统计量
control_mean = np.mean(control_data)
treatment_mean = np.mean(treatment_data)
# 进行t检验
t_stat, p_value = stats.ttest_ind(treatment_data, control_data)
# 计算置信区间
pooled_std = np.sqrt(
((len(control_data) - 1) * np.var(control_data) +
(len(treatment_data) - 1) * np.var(treatment_data)) /
(len(control_data) + len(treatment_data) - 2)
)
margin_of_error = stats.t.ppf(1 - self.alpha / 2, len(control_data) + len(treatment_data) - 2) * \
pooled_std * np.sqrt(1 / len(control_data) + 1 / len(treatment_data))
mean_diff = treatment_mean - control_mean
ci_lower = mean_diff - margin_of_error
ci_upper = mean_diff + margin_of_error
# 计算效应量
effect_size = (treatment_mean - control_mean) / control_mean if control_mean != 0 else 0
# 判断结果
is_significant = p_value < self.alpha
if is_significant:
if treatment_mean > control_mean:
decision = "winner_treatment"
else:
decision = "winner_control"
else:
decision = "no_difference"
result = {
"status": "analyzed",
"control_mean": control_mean,
"treatment_mean": treatment_mean,
"mean_difference": mean_diff,
"effect_size": effect_size,
"p_value": p_value,
"is_significant": is_significant,
"confidence_interval": {
"lower": ci_lower,
"upper": ci_upper
},
"decision": decision,
"sample_sizes": {
"control": len(control_data),
"treatment": len(treatment_data)
}
}
# 更新实验记录
experiment.statistical_significance = p_value
experiment.confidence_interval = result["confidence_interval"]
experiment.decision = decision
self.db.commit()
logger.info(f"Analyzed experiment {experiment.name}: {decision}")
return result
def _should_stop_early(self, analysis_result: Dict[str, Any]) -> bool:
"""
判断是否应该早停
"""
if analysis_result.get("status") != "analyzed":
return False
# 如果结果显著且效应量足够大
is_significant = analysis_result.get("is_significant", False)
effect_size = abs(analysis_result.get("effect_size", 0))
return is_significant and effect_size >= self.min_detectable_effect
def stop_experiment(self, experiment_id: int, reason: str = "Manual stop"):
"""
停止实验
"""
experiment = self.db.query(ABExperiment).get(experiment_id)
if not experiment:
return
if experiment.status == "completed":
return
experiment.status = "completed"
experiment.ended_at = datetime.utcnow()
experiment.decision_reason = reason
# 进行最终分析
if experiment.control_group_size and experiment.treatment_group_size:
self.analyze_experiment(experiment_id)
self.db.commit()
logger.info(f"Stopped experiment {experiment.name}: {reason}")
def rollout_winner(self, experiment_id: int):
"""
推广获胜配置
"""
experiment = self.db.query(ABExperiment).get(experiment_id)
if not experiment:
raise ValueError(f"Experiment {experiment_id} not found")
if experiment.decision not in ["winner_control", "winner_treatment"]:
raise ValueError(f"No clear winner to rollout")
# 获取获胜配置
if experiment.decision == "winner_treatment":
winning_config = experiment.treatment_config
logger.info(f"Rolling out treatment config for experiment {experiment.name}")
else:
winning_config = experiment.control_config
logger.info(f"Keeping control config for experiment {experiment.name}")
# 更新工作流配置
# 这里需要根据实际情况实现
# workflow = self.db.query(Workflow).get(experiment.workflow_id)
# workflow.config.update(winning_config)
# self.db.commit()
# 更新建议状态
if experiment.recommendation_id:
recommendation = self.db.query(OptimizationRecommendation).get(
experiment.recommendation_id
)
if recommendation:
recommendation.status = "implemented"
recommendation.actual_improvement = self.analyze_experiment(experiment_id)
self.db.commit()
def _validate_experiment(self, experiment: ABExperiment):
"""
验证实验配置
"""
if not experiment.control_config or not experiment.treatment_config:
raise ValueError("Both control and treatment configs are required")
if not 0 < experiment.traffic_split < 1:
raise ValueError("Traffic split must be between 0 and 1")
if not experiment.primary_metric:
raise ValueError("Primary metric is required")
# 使用示例
if __name__ == "__main__":
from database import SessionLocal
db = SessionLocal()
framework = ExperimentFramework(db)
# 创建实验
experiment = framework.create_experiment(
workflow_id=1,
recommendation_id=1,
name="Test parallel workers optimization",
hypothesis="Increasing parallel workers from 2 to 4 will reduce duration by 30%",
control_config={"max_workers": 2},
treatment_config={"max_workers": 4},
primary_metric="duration_seconds",
secondary_metrics=["cpu_percent", "memory_mb"],
traffic_split=0.5,
min_sample_size=50,
max_duration_days=7
)
# 启动实验
framework.start_experiment(experiment.id)
# 模拟记录结果
for i in range(100):
variant = framework.assign_variant(experiment.id, i)
# 模拟指标数据
if variant == "control":
duration = np.random.normal(100, 10)
else:
duration = np.random.normal(70, 8) # 实验组更快
framework.record_result(
experiment.id,
variant,
{"duration_seconds": duration}
)
# 分析结果
result = framework.analyze_experiment(experiment.id)
print(f"\n实验结果:")
print(f" 对照组均值: {result['control_mean']:.2f}s")
print(f" 实验组均值: {result['treatment_mean']:.2f}s")
print(f" 改进: {result['effect_size'] * 100:.1f}%")
print(f" P值: {result['p_value']:.4f}")
print(f" 决策: {result['decision']}")
if result['is_significant']:
# 推广获胜配置
framework.rollout_winner(experiment.id)
db.close()
### 6.2 自动化实验调度器
```python
# services/ab_testing/experiment_scheduler.py
from typing import List, Dict, Any, Optional
from sqlalchemy.orm import Session
from models.ai_optimizer_models import (
ABExperiment,
OptimizationRecommendation,
WorkflowExecution
)
from services.ab_testing.experiment_framework import ExperimentFramework
from datetime import datetime, timedelta
import logging
logger = logging.getLogger(__name__)
class ExperimentScheduler:
"""
自动化实验调度器
管理多个实验的生命周期,避免冲突
"""
def __init__(self, db: Session):
self.db = db
self.framework = ExperimentFramework(db)
self.max_concurrent_experiments = 3 # 最多同时运行3个实验
def schedule_experiments_for_recommendations(
self,
recommendation_ids: List[int]
) -> List[ABExperiment]:
"""
为建议自动调度实验
"""
experiments = []
for rec_id in recommendation_ids:
recommendation = self.db.query(OptimizationRecommendation).get(rec_id)
if not recommendation:
continue
# 检查是否已有实验
existing = self.db.query(ABExperiment).filter(
ABExperiment.recommendation_id == rec_id,
ABExperiment.status.in_(["draft", "running"])
).first()
if existing:
logger.info(f"Experiment already exists for recommendation {rec_id}")
continue
# 创建实验
experiment = self._create_experiment_from_recommendation(recommendation)
if experiment:
experiments.append(experiment)
# 按优先级调度
self._schedule_by_priority(experiments)
return experiments
def _create_experiment_from_recommendation(
self,
recommendation: OptimizationRecommendation
) -> Optional[ABExperiment]:
"""
从建议创建实验
"""
# 构建实验名称
name = f"AB_{recommendation.workflow_id}_{recommendation.title[:50]}"
# 构建假设
hypothesis = f"Implementing '{recommendation.title}' will improve performance"
# 确定主要指标
primary_metric = self._determine_primary_metric(recommendation)
# 获取当前配置和推荐配置
control_config = recommendation.current_config or {}
treatment_config = recommendation.recommended_config or {}
# 计算最小样本量
min_sample_size = self._calculate_min_sample_size(
recommendation.expected_improvement
)
# 创建实验
try:
experiment = self.framework.create_experiment(
workflow_id=recommendation.workflow_id,
recommendation_id=recommendation.id,
name=name,
hypothesis=hypothesis,
control_config=control_config,
treatment_config=treatment_config,
primary_metric=primary_metric,
secondary_metrics=self._get_secondary_metrics(recommendation),
traffic_split=0.5,
min_sample_size=min_sample_size,
max_duration_days=7,
early_stopping_enabled=True
)
logger.info(f"Created experiment for recommendation {recommendation.id}")
return experiment
except Exception as e:
logger.error(f"Failed to create experiment: {e}")
return None
def _determine_primary_metric(
self,
recommendation: OptimizationRecommendation
) -> str:
"""
根据建议类型确定主要评估指标
"""
rec_type = recommendation.recommendation_type
metric_mapping = {
"parameter": "duration_seconds",
"logic": "duration_seconds",
"architecture": "throughput",
"performance": "duration_seconds",
"investigation": "error_rate"
}
return metric_mapping.get(rec_type, "duration_seconds")
def _get_secondary_metrics(
self,
recommendation: OptimizationRecommendation
) -> List[str]:
"""
获取次要评估指标
"""
# 基础指标
secondary = ["cpu_percent", "memory_mb", "error_rate"]
# 根据建议类型添加特定指标
if recommendation.recommendation_type == "architecture":
secondary.extend(["io_operations", "cache_hit_rate"])
return secondary
def _calculate_min_sample_size(
self,
expected_improvement: Dict[str, Any]
) -> int:
"""
基于预期改进计算最小样本量
使用功效分析 (Power Analysis)
"""
# 简化计算,实际应使用统计功效分析
# 预期改进越小,需要的样本量越大
# 提取预期改进百分比
improvement_pct = 0.0
for key, value in (expected_improvement or {}).items():
if isinstance(value, str) and "%" in value:
try:
pct_str = value.split("%")[0].split("-")[-1]
improvement_pct = max(improvement_pct, float(pct_str.replace("+", "")))
except:
pass
# 根据改进幅度确定样本量
if improvement_pct >= 50:
return 30 # 大改进,少量样本即可
elif improvement_pct >= 30:
return 50
elif improvement_pct >= 10:
return 100
else:
return 200 # 小改进,需要更多样本
def _schedule_by_priority(self, experiments: List[ABExperiment]):
"""
按优先级调度实验
"""
# 检查当前运行的实验数
running_count = self.db.query(ABExperiment).filter(
ABExperiment.status == "running"
).count()
# 可启动的数量
can_start = max(0, self.max_concurrent_experiments - running_count)
if can_start == 0:
logger.info("Maximum concurrent experiments reached, queuing new experiments")
return
# 按建议优先级排序
experiments_with_priority = []
for exp in experiments:
if exp.recommendation_id:
rec = self.db.query(OptimizationRecommendation).get(exp.recommendation_id)
priority_score = {
"urgent": 4,
"high": 3,
"medium": 2,
"low": 1
}.get(rec.priority, 0)
experiments_with_priority.append((exp, priority_score))
experiments_with_priority.sort(key=lambda x: x[1], reverse=True)
# 启动优先级最高的实验
for exp, _ in experiments_with_priority[:can_start]:
try:
self.framework.start_experiment(exp.id)
logger.info(f"Started experiment: {exp.name}")
except Exception as e:
logger.error(f"Failed to start experiment {exp.id}: {e}")
def check_and_rotate_experiments(self):
"""
检查运行中的实验,必要时轮换
"""
# 检查超时的实验
running_experiments = self.db.query(ABExperiment).filter(
ABExperiment.status == "running"
).all()
for exp in running_experiments:
# 检查是否超过最大运行时间
if exp.started_at:
running_days = (datetime.utcnow() - exp.started_at).days
if running_days >= (exp.max_duration_days or 7):
logger.info(f"Experiment {exp.name} exceeded max duration, stopping")
self.framework.stop_experiment(exp.id, "Exceeded max duration")
# 如果有足够数据,分析并决策
if exp.control_group_size >= exp.min_sample_size:
result = self.framework.analyze_experiment(exp.id)
if result.get("is_significant"):
self.framework.rollout_winner(exp.id)
# 启动队列中的实验
draft_experiments = self.db.query(ABExperiment).filter(
ABExperiment.status == "draft"
).all()
if draft_experiments:
self._schedule_by_priority(draft_experiments)
def generate_experiment_report(
self,
experiment_id: int
) -> Dict[str, Any]:
"""
生成实验报告
"""
experiment = self.db.query(ABExperiment).get(experiment_id)
if not experiment:
raise ValueError(f"Experiment {experiment_id} not found")
# 分析结果
analysis = self.framework.analyze_experiment(experiment_id)
# 构建报告
report = {
"experiment_id": experiment.id,
"name": experiment.name,
"status": experiment.status,
"hypothesis": experiment.hypothesis,
"duration_days": (
(experiment.ended_at or datetime.utcnow()) - experiment.started_at
).days if experiment.started_at else 0,
"sample_sizes": {
"control": experiment.control_group_size or 0,
"treatment": experiment.treatment_group_size or 0
},
"analysis": analysis,
"recommendation": None,
"timeline": []
}
# 添加建议信息
if experiment.recommendation_id:
rec = self.db.query(OptimizationRecommendation).get(experiment.recommendation_id)
if rec:
report["recommendation"] = {
"title": rec.title,
"description": rec.description,
"expected_improvement": rec.expected_improvement,
"implementation_difficulty": rec.implementation_difficulty
}
# 添加时间线
if experiment.started_at:
report["timeline"].append({
"timestamp": experiment.started_at,
"event": "Experiment started"
})
if experiment.ended_at:
report["timeline"].append({
"timestamp": experiment.ended_at,
"event": f"Experiment ended: {experiment.decision_reason or 'Completed'}"
})
return report
# 使用示例
if __name__ == "__main__":
from database import SessionLocal
db = SessionLocal()
scheduler = ExperimentScheduler(db)
# 为待处理的建议调度实验
pending_recommendations = db.query(OptimizationRecommendation).filter(
OptimizationRecommendation.status == "pending"
).limit(5).all()
recommendation_ids = [rec.id for rec in pending_recommendations]
experiments = scheduler.schedule_experiments_for_recommendations(recommendation_ids)
print(f"\n已调度 {len(experiments)} 个实验")
# 检查并轮换实验
scheduler.check_and_rotate_experiments()
# 生成报告
for exp in experiments[:1]:
report = scheduler.generate_experiment_report(exp.id)
print(f"\n实验报告: {report['name']}")
print(f" 状态: {report['status']}")
print(f" 样本量: {report['sample_sizes']}")
db.close()
7. API接口层
7.1 瓶颈检测API
# api/endpoints/bottleneck_detection.py
from fastapi import APIRouter, Depends, HTTPException, Query
from sqlalchemy.orm import Session
from typing import List, Optional
from datetime import datetime, timedelta
from pydantic import BaseModel, Field
from database import get_db
from services.bottleneck_detectors.statistical_detector import StatisticalBottleneckDetector
from services.bottleneck_detectors.ml_detector import MLBottleneckDetector
from models.ai_optimizer_models import PerformanceBottleneck
router = APIRouter(prefix="/api/v1/bottlenecks", tags=["bottleneck-detection"])
# Request/Response模型
class BottleneckDetectionRequest(BaseModel):
workflow_id: int = Field(..., description="工作流ID")
lookback_days: int = Field(7, ge=1, le=90, description="回溯天数")
detector_type: str = Field("statistical", description="检测器类型: statistical, ml, hybrid")
threshold: Optional[float] = Field(None, ge=0, le=1, description="检测阈值")
class BottleneckResponse(BaseModel):
id: int
workflow_id: int
task_id: Optional[int]
bottleneck_type: str
severity: str
confidence_score: float
description: str
detected_at: datetime
current_metrics: dict
baseline_metrics: Optional[dict]
impact_analysis: Optional[dict]
class Config:
orm_mode = True
class BottleneckListResponse(BaseModel):
total: int
bottlenecks: List[BottleneckResponse]
detection_metadata: dict
@router.post("/detect", response_model=BottleneckListResponse)
async def detect_bottlenecks(
request: BottleneckDetectionRequest,
db: Session = Depends(get_db)
):
"""
检测工作流性能瓶颈
"""
try:
# 选择检测器
if request.detector_type == "statistical":
detector = StatisticalBottleneckDetector(db)
elif request.detector_type == "ml":
detector = MLBottleneckDetector(db)
else:
raise HTTPException(status_code=400, detail="Invalid detector type")
# 设置阈值
if request.threshold is not None:
detector.threshold = request.threshold
# 执行检测
bottlenecks = detector.detect_workflow_bottlenecks(
workflow_id=request.workflow_id,
lookback_days=request.lookback_days
)
# 构建响应
return BottleneckListResponse(
total=len(bottlenecks),
bottlenecks=[BottleneckResponse.from_orm(b) for b in bottlenecks],
detection_metadata={
"detector_type": request.detector_type,
"lookback_days": request.lookback_days,
"threshold": request.threshold or detector.threshold,
"detection_time": datetime.utcnow()
}
)
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
@router.get("/{bottleneck_id}", response_model=BottleneckResponse)
async def get_bottleneck(
bottleneck_id: int,
db: Session = Depends(get_db)
):
"""
获取单个瓶颈详情
"""
bottleneck = db.query(PerformanceBottleneck).get(bottleneck_id)
if not bottleneck:
raise HTTPException(status_code=404, detail="Bottleneck not found")
return BottleneckResponse.from_orm(bottleneck)
@router.get("/workflow/{workflow_id}", response_model=BottleneckListResponse)
async def get_workflow_bottlenecks(
workflow_id: int,
severity: Optional[str] = Query(None, description="过滤严重程度"),
status: Optional[str] = Query(None, description="过滤状态"),
limit: int = Query(50, ge=1, le=500),
offset: int = Query(0, ge=0),
db: Session = Depends(get_db)
):
"""
获取工作流的所有瓶颈
"""
query = db.query(PerformanceBottleneck).filter(
PerformanceBottleneck.workflow_id == workflow_id
)
# 过滤
if severity:
query = query.filter(PerformanceBottleneck.severity == severity)
if status:
query = query.filter(PerformanceBottleneck.status == status)
# 排序
query = query.order_by(PerformanceBottleneck.detected_at.desc())
# 分页
total = query.count()
bottlenecks = query.offset(offset).limit(limit).all()
return BottleneckListResponse(
total=total,
bottlenecks=[BottleneckResponse.from_orm(b) for b in bottlenecks],
detection_metadata={
"filters": {
"severity": severity,
"status": status
},
"pagination": {
"limit": limit,
"offset": offset,
"total": total
}
}
)
@router.post("/{bottleneck_id}/resolve")
async def resolve_bottleneck(
bottleneck_id: int,
resolution_notes: Optional[str] = None,
db: Session = Depends(get_db)
):
"""
标记瓶颈为已解决
"""
bottleneck = db.query(PerformanceBottleneck).get(bottleneck_id)
if not bottleneck:
raise HTTPException(status_code=404, detail="Bottleneck not found")
bottleneck.status = "resolved"
bottleneck.resolved_at = datetime.utcnow()
if resolution_notes:
if not bottleneck.resolution_info:
bottleneck.resolution_info = {}
bottleneck.resolution_info["notes"] = resolution_notes
db.commit()
return {"message": "Bottleneck resolved", "bottleneck_id": bottleneck_id}
@router.delete("/{bottleneck_id}")
async def delete_bottleneck(
bottleneck_id: int,
db: Session = Depends(get_db)
):
"""
删除瓶颈记录
"""
bottleneck = db.query(PerformanceBottleneck).get(bottleneck_id)
if not bottleneck:
raise HTTPException(status_code=404, detail="Bottleneck not found")
db.delete(bottleneck)
db.commit()
return {"message": "Bottleneck deleted", "bottleneck_id": bottleneck_id}
7.2 优化建议API
# api/endpoints/optimization_recommendations.py
from fastapi import APIRouter, Depends, HTTPException, Query
from sqlalchemy.orm import Session
from typing import List, Optional
from pydantic import BaseModel, Field
from database import get_db
from services.optimization.recommendation_engine import RecommendationEngine
from services.optimization.recommendation_prioritizer import RecommendationPrioritizer
from models.ai_optimizer_models import (
OptimizationRecommendation,
PerformanceBottleneck
)
router = APIRouter(prefix="/api/v1/recommendations", tags=["optimization-recommendations"])
# Request/Response模型
class RecommendationResponse(BaseModel):
id: int
bottleneck_id: Optional[int]
workflow_id: int
task_id: Optional[int]
recommendation_type: str
priority: str
title: str
description: str
rationale: Optional[str]
current_config: Optional[dict]
recommended_config: Optional[dict]
expected_improvement: Optional[dict]
implementation_difficulty: str
estimated_effort_hours: Optional[float]
implementation_steps: Optional[List[dict]]
risk_level: str
potential_issues: Optional[List[str]]
status: str
created_at: datetime
class Config:
orm_mode = True
class RecommendationListResponse(BaseModel):
total: int
recommendations: List[RecommendationResponse]
metadata: dict
class GenerateRecommendationsRequest(BaseModel):
bottleneck_id: int
@router.post("/generate", response_model=RecommendationListResponse)
async def generate_recommendations(
request: GenerateRecommendationsRequest,
db: Session = Depends(get_db)
):
"""
为瓶颈生成优化建议
"""
bottleneck = db.query(PerformanceBottleneck).get(request.bottleneck_id)
if not bottleneck:
raise HTTPException(status_code=404, detail="Bottleneck not found")
# 生成建议
engine = RecommendationEngine(db)
recommendations = engine.generate_recommendations(bottleneck)
# 排序
prioritizer = RecommendationPrioritizer(db)
sorted_recs = prioritizer.prioritize(recommendations)
return RecommendationListResponse(
total=len(sorted_recs),
recommendations=[RecommendationResponse.from_orm(r) for r in sorted_recs],
metadata={
"bottleneck_id": request.bottleneck_id,
"generated_at": datetime.utcnow()
}
)
@router.get("/{recommendation_id}", response_model=RecommendationResponse)
async def get_recommendation(
recommendation_id: int,
db: Session = Depends(get_db)
):
"""
获取单个建议详情
"""
recommendation = db.query(OptimizationRecommendation).get(recommendation_id)
if not recommendation:
raise HTTPException(status_code=404, detail="Recommendation not found")
return RecommendationResponse.from_orm(recommendation)
@router.get("/workflow/{workflow_id}", response_model=RecommendationListResponse)
async def get_workflow_recommendations(
workflow_id: int,
status: Optional[str] = Query(None),
priority: Optional[str] = Query(None),
limit: int = Query(50, ge=1, le=500),
offset: int = Query(0, ge=0),
db: Session = Depends(get_db)
):
"""
获取工作流的所有建议
"""
query = db.query(OptimizationRecommendation).filter(
OptimizationRecommendation.workflow_id == workflow_id
)
if status:
query = query.filter(OptimizationRecommendation.status == status)
if priority:
query = query.filter(OptimizationRecommendation.priority == priority)
query = query.order_by(OptimizationRecommendation.created_at.desc())
total = query.count()
recommendations = query.offset(offset).limit(limit).all()
# 排序
prioritizer = RecommendationPrioritizer(db)
sorted_recs = prioritizer.prioritize(recommendations)
return RecommendationListResponse(
total=total,
recommendations=[RecommendationResponse.from_orm(r) for r in sorted_recs],
metadata={
"filters": {"status": status, "priority": priority},
"pagination": {"limit": limit, "offset": offset, "total": total}
}
)
@router.post("/{recommendation_id}/approve")
async def approve_recommendation(
recommendation_id: int,
db: Session = Depends(get_db)
):
"""
批准建议并创建A/B测试
"""
from services.ab_testing.experiment_scheduler import ExperimentScheduler
recommendation = db.query(OptimizationRecommendation).get(recommendation_id)
if not recommendation:
raise HTTPException(status_code=404, detail="Recommendation not found")
# 更新状态
recommendation.status = "approved"
db.commit()
# 创建实验
scheduler = ExperimentScheduler(db)
experiments = scheduler.schedule_experiments_for_recommendations([recommendation_id])
return {
"message": "Recommendation approved and experiment scheduled",
"recommendation_id": recommendation_id,
"experiment_id": experiments[0].id if experiments else None
}
@router.post("/{recommendation_id}/reject")
async def reject_recommendation(
recommendation_id: int,
reason: Optional[str] = None,
db: Session = Depends(get_db)
):
"""
拒绝建议
"""
recommendation = db.query(OptimizationRecommendation).get(recommendation_id)
if not recommendation:
raise HTTPException(status_code=404, detail="Recommendation not found")
recommendation.status = "rejected"
if reason:
if not recommendation.feedback:
recommendation.feedback = {}
recommendation.feedback["rejection_reason"] = reason
db.commit()
return {"message": "Recommendation rejected", "recommendation_id": recommendation_id}
### 7.3 A/B测试API
```python
# api/endpoints/ab_testing.py
from fastapi import APIRouter, Depends, HTTPException, Query, BackgroundTasks
from sqlalchemy.orm import Session
from typing import List, Optional
from pydantic import BaseModel, Field
from datetime import datetime
from database import get_db
from services.ab_testing.experiment_framework import ExperimentFramework
from services.ab_testing.experiment_scheduler import ExperimentScheduler
from models.ai_optimizer_models import ABExperiment
router = APIRouter(prefix="/api/v1/experiments", tags=["ab-testing"])
# Request/Response模型
class CreateExperimentRequest(BaseModel):
workflow_id: int
recommendation_id: Optional[int] = None
name: str
hypothesis: str
control_config: dict
treatment_config: dict
primary_metric: str
secondary_metrics: List[str] = []
traffic_split: float = Field(0.5, ge=0.1, le=0.9)
min_sample_size: int = Field(100, ge=10)
max_duration_days: int = Field(7, ge=1, le=30)
early_stopping_enabled: bool = True
class ExperimentResponse(BaseModel):
id: int
workflow_id: int
recommendation_id: Optional[int]
name: str
hypothesis: str
status: str
control_config: dict
treatment_config: dict
primary_metric: str
secondary_metrics: List[str]
traffic_split: float
started_at: Optional[datetime]
ended_at: Optional[datetime]
control_group_size: int
treatment_group_size: int
winner: Optional[str]
decision_reason: Optional[str]
class Config:
orm_mode = True
class ExperimentAnalysisResponse(BaseModel):
experiment_id: int
status: str
sample_sizes: dict
primary_metric_analysis: dict
secondary_metrics_analysis: dict
is_significant: bool
recommended_action: str
confidence_level: float
class ExperimentListResponse(BaseModel:
total: int
experiments: List[ExperimentResponse]
metadata: dict
@router.post("/", response_model=ExperimentResponse)
async def create_experiment(
request: CreateExperimentRequest,
db: Session = Depends(get_db)
):
"""
创建新的A/B测试实验
"""
framework = ExperimentFramework(db)
experiment = framework.create_experiment(
workflow_id=request.workflow_id,
recommendation_id=request.recommendation_id,
name=request.name,
hypothesis=request.hypothesis,
control_config=request.control_config,
treatment_config=request.treatment_config,
primary_metric=request.primary_metric,
secondary_metrics=request.secondary_metrics,
traffic_split=request.traffic_split,
min_sample_size=request.min_sample_size,
max_duration_days=request.max_duration_days,
early_stopping_enabled=request.early_stopping_enabled
)
return ExperimentResponse.from_orm(experiment)
@router.get("/{experiment_id}", response_model=ExperimentResponse)
async def get_experiment(
experiment_id: int,
db: Session = Depends(get_db)
):
"""
获取实验详情
"""
experiment = db.query(ABExperiment).get(experiment_id)
if not experiment:
raise HTTPException(status_code=404, detail="Experiment not found")
return ExperimentResponse.from_orm(experiment)
@router.get("/", response_model=ExperimentListResponse)
async def list_experiments(
workflow_id: Optional[int] = Query(None),
status: Optional[str] = Query(None),
limit: int = Query(50, ge=1, le=500),
offset: int = Query(0, ge=0),
db: Session = Depends(get_db)
):
"""
列出所有实验
"""
query = db.query(ABExperiment)
if workflow_id:
query = query.filter(ABExperiment.workflow_id == workflow_id)
if status:
query = query.filter(ABExperiment.status == status)
query = query.order_by(ABExperiment.created_at.desc())
total = query.count()
experiments = query.offset(offset).limit(limit).all()
return ExperimentListResponse(
total=total,
experiments=[ExperimentResponse.from_orm(e) for e in experiments],
metadata={
"filters": {"workflow_id": workflow_id, "status": status},
"pagination": {"limit": limit, "offset": offset, "total": total}
}
)
@router.post("/{experiment_id}/start")
async def start_experiment(
experiment_id: int,
db: Session = Depends(get_db)
):
"""
启动实验
"""
framework = ExperimentFramework(db)
framework.start_experiment(experiment_id)
return {"message": "Experiment started", "experiment_id": experiment_id}
@router.post("/{experiment_id}/stop")
async def stop_experiment(
experiment_id: int,
reason: Optional[str] = None,
db: Session = Depends(get_db)
):
"""
停止实验
"""
framework = ExperimentFramework(db)
framework.stop_experiment(experiment_id, reason)
return {"message": "Experiment stopped", "experiment_id": experiment_id}
@router.get("/{experiment_id}/analysis", response_model=ExperimentAnalysisResponse)
async def analyze_experiment(
experiment_id: int,
db: Session = Depends(get_db)
):
"""
分析实验结果
"""
framework = ExperimentFramework(db)
analysis = framework.analyze_experiment(experiment_id)
experiment = db.query(ABExperiment).get(experiment_id)
return ExperimentAnalysisResponse(
experiment_id=experiment_id,
status=experiment.status,
sample_sizes={
"control": experiment.control_group_size,
"treatment": experiment.treatment_group_size
},
primary_metric_analysis=analysis.get("primary_metric", {}),
secondary_metrics_analysis=analysis.get("secondary_metrics", {}),
is_significant=analysis.get("is_significant", False),
recommended_action=analysis.get("recommendation", "continue"),
confidence_level=analysis.get("confidence_level", 0.0)
)
@router.post("/{experiment_id}/rollout")
async def rollout_winner(
experiment_id: int,
background_tasks: BackgroundTasks,
db: Session = Depends(get_db)
):
"""
推广获胜配置
"""
framework = ExperimentFramework(db)
# 在后台执行推广
background_tasks.add_task(framework.rollout_winner, experiment_id)
return {
"message": "Rollout initiated",
"experiment_id": experiment_id
}
@router.get("/{experiment_id}/report")
async def get_experiment_report(
experiment_id: int,
db: Session = Depends(get_db)
):
"""
生成实验详细报告
"""
scheduler = ExperimentScheduler(db)
report = scheduler.generate_experiment_report(experiment_id)
return report
@router.post("/schedule")
async def schedule_experiments(
recommendation_ids: List[int],
db: Session = Depends(get_db)
):
"""
批量调度实验
"""
scheduler = ExperimentScheduler(db)
experiments = scheduler.schedule_experiments_for_recommendations(recommendation_ids)
return {
"message": f"Scheduled {len(experiments)} experiments",
"experiment_ids": [e.id for e in experiments]
}
@router.post("/rotate")
async def rotate_experiments(
db: Session = Depends(get_db)
):
"""
检查并轮换实验
"""
scheduler = ExperimentScheduler(db)
scheduler.check_and_rotate_experiments()
return {"message": "Experiment rotation completed"}
8. 后台任务与调度
8.1 Celery任务定义
# tasks/ai_optimizer_tasks.py
from celery import Celery
from celery.schedules import crontab
from database import SessionLocal
from services.bottleneck_detectors.statistical_detector import StatisticalBottleneckDetector
from services.bottleneck_detectors.ml_detector import MLBottleneckDetector
from services.optimization.recommendation_engine import RecommendationEngine
from services.ab_testing.experiment_scheduler import ExperimentScheduler
from models.workflow_models import Workflow
import logging
logger = logging.getLogger(__name__)
# 初始化Celery
celery_app = Celery(
'ai_optimizer',
broker='redis://localhost:6379/0',
backend='redis://localhost:6379/0'
)
celery_app.conf.update(
task_serializer='json',
accept_content=['json'],
result_serializer='json',
timezone='UTC',
enable_utc=True,
)
@celery_app.task(name='detect_bottlenecks_all_workflows')
def detect_bottlenecks_all_workflows():
"""
定期检测所有活跃工作流的瓶颈
"""
db = SessionLocal()
try:
# 获取活跃工作流
active_workflows = db.query(Workflow).filter(
Workflow.is_active == True
).all()
logger.info(f"Detecting bottlenecks for {len(active_workflows)} workflows")
# 使用统计检测器
stat_detector = StatisticalBottleneckDetector(db)
total_bottlenecks = 0
for workflow in active_workflows:
try:
bottlenecks = stat_detector.detect_workflow_bottlenecks(
workflow_id=workflow.id,
lookback_days=7
)
total_bottlenecks += len(bottlenecks)
logger.info(
f"Workflow {workflow.id}: Found {len(bottlenecks)} bottlenecks"
)
except Exception as e:
logger.error(
f"Error detecting bottlenecks for workflow {workflow.id}: {e}"
)
logger.info(f"Total bottlenecks detected: {total_bottlenecks}")
return {
"workflows_processed": len(active_workflows),
"total_bottlenecks": total_bottlenecks
}
finally:
db.close()
@celery_app.task(name='ml_bottleneck_detection')
def ml_bottleneck_detection(workflow_id: int):
"""
使用ML检测器进行深度分析
"""
db = SessionLocal()
try:
ml_detector = MLBottleneckDetector(db)
bottlenecks = ml_detector.detect_workflow_bottlenecks(
workflow_id=workflow_id,
lookback_days=30
)
logger.info(
f"ML detection for workflow {workflow_id}: {len(bottlenecks)} bottlenecks"
)
return {
"workflow_id": workflow_id,
"bottlenecks_found": len(bottlenecks)
}
finally:
db.close()
@celery_app.task(name='generate_recommendations')
def generate_recommendations_task():
"""
为未处理的瓶颈生成优化建议
"""
from models.ai_optimizer_models import PerformanceBottleneck
db = SessionLocal()
try:
# 获取待处理的瓶颈
pending_bottlenecks = db.query(PerformanceBottleneck).filter(
PerformanceBottleneck.status == "detected"
).all()
logger.info(f"Generating recommendations for {len(pending_bottlenecks)} bottlenecks")
engine = RecommendationEngine(db)
total_recommendations = 0
for bottleneck in pending_bottlenecks:
try:
recommendations = engine.generate_recommendations(bottleneck)
total_recommendations += len(recommendations)
# 更新瓶颈状态
bottleneck.status = "analyzed"
except Exception as e:
logger.error(
f"Error generating recommendations for bottleneck {bottleneck.id}: {e}"
)
db.commit()
logger.info(f"Total recommendations generated: {total_recommendations}")
return {
"bottlenecks_processed": len(pending_bottlenecks),
"recommendations_generated": total_recommendations
}
finally:
db.close()
@celery_app.task(name='schedule_ab_experiments')
def schedule_ab_experiments_task():
"""
调度A/B测试实验
"""
from models.ai_optimizer_models import OptimizationRecommendation
db = SessionLocal()
try:
# 获取已批准的建议
approved_recommendations = db.query(OptimizationRecommendation).filter(
OptimizationRecommendation.status == "approved"
).limit(10).all()
if not approved_recommendations:
logger.info("No approved recommendations to schedule")
return {"experiments_scheduled": 0}
scheduler = ExperimentScheduler(db)
recommendation_ids = [rec.id for rec in approved_recommendations]
experiments = scheduler.schedule_experiments_for_recommendations(
recommendation_ids
)
logger.info(f"Scheduled {len(experiments)} experiments")
return {"experiments_scheduled": len(experiments)}
finally:
db.close()
@celery_app.task(name='rotate_experiments')
def rotate_experiments_task():
"""
检查并轮换实验
"""
db = SessionLocal()
try:
scheduler = ExperimentScheduler(db)
scheduler.check_and_rotate_experiments()
logger.info("Experiment rotation completed")
return {"status": "completed"}
finally:
db.close()
@celery_app.task(name='cleanup_old_data')
def cleanup_old_data():
"""
清理旧数据
"""
from models.ai_optimizer_models import (
PerformanceBottleneck,
OptimizationRecommendation,
ABExperiment
)
from datetime import timedelta
db = SessionLocal()
try:
cutoff_date = datetime.utcnow() - timedelta(days=90)
# 删除旧的已解决瓶颈
deleted_bottlenecks = db.query(PerformanceBottleneck).filter(
PerformanceBottleneck.status == "resolved",
PerformanceBottleneck.resolved_at < cutoff_date
).delete()
# 删除旧的已拒绝建议
deleted_recommendations = db.query(OptimizationRecommendation).filter(
OptimizationRecommendation.status == "rejected",
OptimizationRecommendation.created_at < cutoff_date
).delete()
# 删除旧的完成实验
deleted_experiments = db.query(ABExperiment).filter(
ABExperiment.status == "completed",
ABExperiment.ended_at < cutoff_date
).delete()
db.commit()
logger.info(
f"Cleanup completed: {deleted_bottlenecks} bottlenecks, "
f"{deleted_recommendations} recommendations, "
f"{deleted_experiments} experiments deleted"
)
return {
"bottlenecks_deleted": deleted_bottlenecks,
"recommendations_deleted": deleted_recommendations,
"experiments_deleted": deleted_experiments
}
finally:
db.close()
@celery_app.task(name='train_ml_models')
def train_ml_models_task():
"""
定期重新训练ML模型
"""
db = SessionLocal()
try:
ml_detector = MLBottleneckDetector(db)
# 训练异常检测模型
ml_detector._train_anomaly_detector()
logger.info("ML models retrained successfully")
return {"status": "completed"}
except Exception as e:
logger.error(f"Error training ML models: {e}")
return {"status": "failed", "error": str(e)}
finally:
db.close()
# 定时任务配置
celery_app.conf.beat_schedule = {
'detect-bottlenecks-every-hour': {
'task': 'detect_bottlenecks_all_workflows',
'schedule': crontab(minute=0), # 每小时
},
'generate-recommendations-every-6-hours': {
'task': 'generate_recommendations',
'schedule': crontab(minute=0, hour='*/6'), # 每6小时
},
'schedule-experiments-daily': {
'task': 'schedule_ab_experiments',
'schedule': crontab(minute=0, hour=9), # 每天9点
},
'rotate-experiments-every-2-hours': {
'task': 'rotate_experiments',
'schedule': crontab(minute=0, hour='*/2'), # 每2小时
},
'cleanup-old-data-weekly': {
'task': 'cleanup_old_data',
'schedule': crontab(minute=0, hour=2, day_of_week=0), # 每周日凌晨2点
},
'train-ml-models-weekly': {
'task': 'train_ml_models',
'schedule': crontab(minute=0, hour=3, day_of_week=0), # 每周日凌晨3点
},
}
8.2 任务监控
# tasks/task_monitor.py
from celery import Celery
from celery.events import EventReceiver
from kombu import Connection
import logging
from datetime import datetime
from typing import Dict, List
from collections import defaultdict
logger = logging.getLogger(__name__)
class TaskMonitor:
"""
Celery任务监控器
"""
def __init__(self, broker_url: str):
self.broker_url = broker_url
self.task_stats = defaultdict(lambda: {
"total": 0,
"succeeded": 0,
"failed": 0,
"retried": 0,
"avg_runtime": 0.0,
"last_run": None
})
def start_monitoring(self):
"""
启动监控
"""
with Connection(self.broker_url) as conn:
recv = EventReceiver(
conn,
handlers={
'task-sent': self.on_task_sent,
'task-received': self.on_task_received,
'task-started': self.on_task_started,
'task-succeeded': self.on_task_succeeded,
'task-failed': self.on_task_failed,
'task-retried': self.on_task_retried,
}
)
logger.info("Task monitor started")
recv.capture(limit=None, timeout=None, wakeup=True)
def on_task_sent(self, event):
"""任务发送"""
task_name = event['name']
logger.debug(f"Task sent: {task_name}")
def on_task_received(self, event):
"""任务接收"""
task_name = event['name']
self.task_stats[task_name]["total"] += 1
def on_task_started(self, event):
"""任务开始"""
task_name = event['name']
logger.info(f"Task started: {task_name}")
def on_task_succeeded(self, event):
"""任务成功"""
task_name = event['name']
runtime = event['runtime']
stats = self.task_stats[task_name]
stats["succeeded"] += 1
stats["last_run"] = datetime.utcnow()
# 更新平均运行时间
n = stats["succeeded"]
stats["avg_runtime"] = (
(stats["avg_runtime"] * (n - 1) + runtime) / n
)
logger.info(
f"Task succeeded: {task_name} (runtime: {runtime:.2f}s)"
)
def on_task_failed(self, event):
"""任务失败"""
task_name = event['name']
exception = event.get('exception')
self.task_stats[task_name]["failed"] += 1
logger.error(
f"Task failed: {task_name} - {exception}"
)
def on_task_retried(self, event):
"""任务重试"""
task_name = event['name']
self.task_stats[task_name]["retried"] += 1
logger.warning(f"Task retried: {task_name}")
def get_statistics(self) -> Dict:
"""
获取统计信息
"""
return dict(self.task_stats)
def get_health_status(self) -> Dict:
"""
获取健康状态
"""
total_tasks = sum(s["total"] for s in self.task_stats.values())
total_failed = sum(s["failed"] for s in self.task_stats.values())
failure_rate = total_failed / total_tasks if total_tasks > 0 else 0
health = "healthy"
if failure_rate > 0.1:
health = "degraded"
if failure_rate > 0.3:
health = "unhealthy"
return {
"status": health,
"total_tasks": total_tasks,
"total_failed": total_failed,
"failure_rate": failure_rate,
"task_stats": self.get_statistics()
}
# 启动监控器
if __name__ == "__main__":
monitor = TaskMonitor('redis://localhost:6379/0')
monitor.start_monitoring()
9. 前端集成
9.1 React组件 - 瓶颈检测面板
// frontend/src/components/AIOptimizer/BottleneckDetectionPanel.tsx
import React, { useState, useEffect } from 'react';
import {
Card,
CardContent,
CardHeader,
Typography,
Button,
Table,
TableBody,
TableCell,
TableHead,
TableRow,
Chip,
CircularProgress,
Alert,
Dialog,
DialogTitle,
DialogContent,
DialogActions
} from '@mui/material';
import { Warning, CheckCircle, Error as ErrorIcon } from '@mui/icons-material';
import axios from 'axios';
interface Bottleneck {
id: number;
workflow_id: number;
task_id?: number;
bottleneck_type: string;
severity: string;
confidence_score: number;
description: string;
detected_at: string;
current_metrics: any;
impact_analysis?: any;
}
interface BottleneckDetectionPanelProps {
workflowId: number;
}
const BottleneckDetectionPanel: React.FC<BottleneckDetectionPanelProps> = ({ workflowId }) => {
const [bottlenecks, setBottlenecks] = useState<Bottleneck[]>([]);
const [loading, setLoading] = useState(false);
const [error, setError] = useState<string | null>(null);
const [selectedBottleneck, setSelectedBottleneck] = useState<Bottleneck | null>(null);
const [detailsOpen, setDetailsOpen] = useState(false);
const fetchBottlenecks = async () => {
setLoading(true);
setError(null);
try {
const response = await axios.get(
`/api/v1/bottlenecks/workflow/${workflowId}`
);
setBottlenecks(response.data.bottlenecks);
} catch (err: any) {
setError(err.response?.data?.detail || 'Failed to fetch bottlenecks');
} finally {
setLoading(false);
}
};
const detectBottlenecks = async () => {
setLoading(true);
setError(null);
try {
const response = await axios.post('/api/v1/bottlenecks/detect', {
workflow_id: workflowId,
lookback_days: 7,
detector_type: 'statistical'
});
setBottlenecks(response.data.bottlenecks);
} catch (err: any) {
setError(err.response?.data?.detail || 'Failed to detect bottlenecks');
} finally {
setLoading(false);
}
};
const resolveBottleneck = async (bottleneckId: number) => {
try {
await axios.post(`/api/v1/bottlenecks/${bottleneckId}/resolve`);
fetchBottlenecks();
} catch (err: any) {
setError(err.response?.data?.detail || 'Failed to resolve bottleneck');
}
};
useEffect(() => {
fetchBottlenecks();
}, [workflowId]);
const getSeverityIcon = (severity: string) => {
switch (severity) {
case 'critical':
return <ErrorIcon color="error" />;
case 'high':
return <Warning color="warning" />;
case 'medium':
return <Warning color="info" />;
default:
return <CheckCircle color="success" />;
}
};
const getSeverityColor = (severity: string): "error" | "warning" | "info" | "success" => {
switch (severity) {
case 'critical':
return 'error';
case 'high':
return 'warning';
case 'medium':
return 'info';
default:
return 'success';
}
};
return (
<Card>
<CardHeader
title="Performance Bottlenecks"
action={
<Button
variant="contained"
onClick={detectBottlenecks}
disabled={loading}
>
{loading ? <CircularProgress size={24} /> : 'Detect Bottlenecks'}
</Button>
}
/>
<CardContent>
{error && (
<Alert severity="error" sx={{ mb: 2 }}>
{error}
</Alert>
)}
{loading ? (
<CircularProgress />
) : bottlenecks.length === 0 ? (
<Typography>No bottlenecks detected</Typography>
) : (
<Table>
<TableHead>
<TableRow>
<TableCell>Severity</TableCell>
<TableCell>Type</TableCell>
<TableCell>Description</TableCell>
<TableCell>Confidence</TableCell>
<TableCell>Detected</TableCell>
<TableCell>Actions</TableCell>
</TableRow>
</TableHead>
<TableBody>
{bottlenecks.map((bottleneck) => (
<TableRow key={bottleneck.id}>
<TableCell>
{getSeverityIcon(bottleneck.severity)}
<Chip
label={bottleneck.severity}
color={getSeverityColor(bottleneck.severity)}
size="small"
sx={{ ml: 1 }}
/>
</TableCell>
<TableCell>{bottleneck.bottleneck_type}</TableCell>
<TableCell>{bottleneck.description}</TableCell>
<TableCell>{(bottleneck.confidence_score * 100).toFixed(1)}%</TableCell>
<TableCell>
{new Date(bottleneck.detected_at).toLocaleString()}
</TableCell>
<TableCell>
<Button
size="small"
onClick={() => {
setSelectedBottleneck(bottleneck);
setDetailsOpen(true);
}}
>
Details
</Button>
<Button
size="small"
color="success"
onClick={() => resolveBottleneck(bottleneck.id)}
>
Resolve
</Button>
</TableCell>
</TableRow>
))}
</TableBody>
</Table>
)}
{/* Bottleneck Details Dialog */}
<Dialog
open={detailsOpen}
onClose={() => setDetailsOpen(false)}
maxWidth="md"
fullWidth
>
<DialogTitle>Bottleneck Details</DialogTitle>
<DialogContent>
{selectedBottleneck && (
<>
<Typography variant="h6">{selectedBottleneck.description}</Typography>
<Typography variant="body2" color="textSecondary" sx={{ mt: 1 }}>
Type: {selectedBottleneck.bottleneck_type}
</Typography>
<Typography variant="body2" color="textSecondary">
Confidence: {(selectedBottleneck.confidence_score * 100).toFixed(1)}%
</Typography>
<Typography variant="subtitle1" sx={{ mt: 2 }}>
Current Metrics:
</Typography>
<pre>{JSON.stringify(selectedBottleneck.current_metrics, null, 2)}</pre>
{selectedBottleneck.impact_analysis && (
<>
<Typography variant="subtitle1" sx={{ mt: 2 }}>
Impact Analysis:
</Typography>
<pre>{JSON.stringify(selectedBottleneck.impact_analysis, null, 2)}</pre>
</>
)}
</>
)}
</DialogContent>
<DialogActions>
<Button onClick={() => setDetailsOpen(false)}>Close</Button>
</DialogActions>
</Dialog>
</CardContent>
</Card>
);
};
export default BottleneckDetectionPanel;
9.2 React组件 - 优化建议面板
// frontend/src/components/AIOptimizer/RecommendationPanel.tsx
import React, { useState, useEffect } from 'react';
import {
Card,
CardContent,
CardHeader,
Typography,
Button,
List,
ListItem,
ListItemText,
Chip,
Accordion,
AccordionSummary,
AccordionDetails,
CircularProgress,
Alert,
Dialog,
DialogTitle,
DialogContent,
DialogActions,
Stepper,
Step,
StepLabel
} from '@mui/material';
import {
ExpandMore,
ThumbUp,
ThumbDown,
Info
} from '@mui/icons-material';
import axios from 'axios';
interface Recommendation {
id: number;
title: string;
description: string;
priority: string;
recommendation_type: string;
expected_improvement: any;
implementation_difficulty: string;
estimated_effort_hours?: number;
implementation_steps?: any[];
risk_level: string;
status: string;
}
interface RecommendationPanelProps {
workflowId: number;
}
const RecommendationPanel: React.FC<RecommendationPanelProps> = ({ workflowId }) => {
const [recommendations, setRecommendations] = useState<Recommendation[]>([]);
const [loading, setLoading] = useState(false);
const [error, setError] = useState<string | null>(null);
const [selectedRec, setSelectedRec] = useState<Recommendation | null>(null);
const [implementationOpen, setImplementationOpen] = useState(false);
const fetchRecommendations = async () => {
setLoading(true);
setError(null);
try {
const response = await axios.get(
`/api/v1/recommendations/workflow/${workflowId}`
);
setRecommendations(response.data.recommendations);
} catch (err: any) {
setError(err.response?.data?.detail || 'Failed to fetch recommendations');
} finally {
setLoading(false);
}
};
const approveRecommendation = async (recId: number) => {
try {
await axios.post(`/api/v1/recommendations/${recId}/approve`);
fetchRecommendations();
} catch (err: any) {
setError(err.response?.data?.detail || 'Failed to approve recommendation');
}
};
const rejectRecommendation = async (recId: number, reason: string) => {
try {
await axios.post(`/api/v1/recommendations/${recId}/reject`, { reason });
fetchRecommendations();
} catch (err: any) {
setError(err.response?.data?.detail || 'Failed to reject recommendation');
}
};
useEffect(() => {
fetchRecommendations();
}, [workflowId]);
const getPriorityColor = (priority: string): "error" | "warning" | "info" | "default" => {
switch (priority) {
case 'urgent':
return 'error';
case 'high':
return 'warning';
case 'medium':
return 'info';
default:
return 'default';
}
};
return (
<Card>
<CardHeader title="Optimization Recommendations" />
<CardContent>
{error && (
<Alert severity="error" sx={{ mb: 2 }}>
{error}
</Alert>
)}
{loading ? (
<CircularProgress />
) : recommendations.length === 0 ? (
<Typography>No recommendations available</Typography>
) : (
<List>
{recommendations.map((rec) => (
<Accordion key={rec.id}>
<AccordionSummary expandIcon={<ExpandMore />}>
<div style={{ display: 'flex', alignItems: 'center', gap: '8px', width: '100%' }}>
<Chip
label={rec.priority}
color={getPriorityColor(rec.priority)}
size="small"
/>
<Chip
label={rec.recommendation_type}
variant="outlined"
size="small"
/>
<Typography sx={{ flexGrow: 1 }}>{rec.title}</Typography>
<Chip
label={rec.status}
size="small"
color={rec.status === 'approved' ? 'success' : 'default'}
/>
</div>
</AccordionSummary>
<AccordionDetails>
<Typography variant="body2" paragraph>
{rec.description}
</Typography>
{rec.expected_improvement && (
<>
<Typography variant="subtitle2">Expected Improvement:</Typography>
<pre style={{ fontSize: '12px' }}>
{JSON.stringify(rec.expected_improvement, null, 2)}
</pre>
</>
)}
<div style={{ marginTop: '16px' }}>
<Chip label={`Difficulty: ${rec.implementation_difficulty}`} sx={{ mr: 1 }} />
<Chip label={`Risk: ${rec.risk_level}`} sx={{ mr: 1 }} />
{rec.estimated_effort_hours && (
<Chip label={`Effort: ${rec.estimated_effort_hours}h`} />
)}
</div>
<div style={{ marginTop: '16px', display: 'flex', gap: '8px' }}>
{rec.implementation_steps && (
<Button
startIcon={<Info />}
onClick={() => {
setSelectedRec(rec);
setImplementationOpen(true);
}}
>
Implementation Steps
</Button>
)}
{rec.status === 'pending' && (
<>
<Button
variant="contained"
color="success"
startIcon={<ThumbUp />}
onClick={() => approveRecommendation(rec.id)}
>
Approve & Test
</Button>
<Button
variant="outlined"
color="error"
startIcon={<ThumbDown />}
onClick={() => rejectRecommendation(rec.id, 'Not applicable')}
>
Reject
</Button>
</>
)}
</div>
</AccordionDetails>
</Accordion>
))}
</List>
)}
{/* Implementation Steps Dialog */}
<Dialog
open={implementationOpen}
onClose={() => setImplementationOpen(false)}
maxWidth="md"
fullWidth
>
<DialogTitle>Implementation Steps</DialogTitle>
<DialogContent>
{selectedRec?.implementation_steps && (
<Stepper orientation="vertical">
{selectedRec.implementation_steps.map((step: any, index: number) => (
<Step key={index} active>
<StepLabel>{step.title || `Step ${index + 1}`}</StepLabel>
<Typography variant="body2" sx={{ mt: 1, mb: 2 }}>
{step.description}
</Typography>
{step.code && (
<pre style={{ background: '#f5f5f5', padding: '8px', borderRadius: '4px' }}>
{step.code}
</pre>
)}
</Step>
))}
</Stepper>
)}
</DialogContent>
<DialogActions>
<Button onClick={() => setImplementationOpen(false)}>Close</Button>
</DialogActions>
</Dialog>
</CardContent>
</Card>
);
};
export default RecommendationPanel;
9.3 React组件 - A/B测试监控面板
// frontend/src/components/AIOptimizer/ExperimentMonitorPanel.tsx
import React, { useState, useEffect } from 'react';
import {
Card,
CardContent,
CardHeader,
Typography,
Button,
LinearProgress,
Chip,
Grid,
Alert
} from '@mui/material';
import { PlayArrow, Stop, CheckCircle } from '@mui/icons-material';
import axios from 'axios';
interface Experiment {
id: number;
name: string;
status: string;
hypothesis: string;
primary_metric: string;
control_group_size: number;
treatment_group_size: number;
started_at?: string;
winner?: string;
}
interface ExperimentMonitorPanelProps {
workflowId: number;
}
const ExperimentMonitorPanel: React.FC<ExperimentMonitorPanelProps> = ({ workflowId }) => {
const [experiments, setExperiments] = useState<Experiment[]>([]);
const [loading, setLoading] = useState(false);
const fetchExperiments = async () => {
setLoading(true);
try {
const response = await axios.get('/api/v1/experiments', {
params: { workflow_id: workflowId }
});
setExperiments(response.data.experiments);
} catch (err) {
console.error('Failed to fetch experiments', err);
} finally {
setLoading(false);
}
};
const startExperiment = async (expId: number) => {
try {
await axios.post(`/api/v1/experiments/${expId}/start`);
fetchExperiments();
} catch (err) {
console.error('Failed to start experiment', err);
}
};
const stopExperiment = async (expId: number) => {
try {
await axios.post(`/api/v1/experiments/${expId}/stop`);
fetchExperiments();
} catch (err) {
console.error('Failed to stop experiment', err);
}
};
const rolloutWinner = async (expId: number) => {
try {
await axios.post(`/api/v1/experiments/${expId}/rollout`);
fetchExperiments();
} catch (err) {
console.error('Failed to rollout winner', err);
}
};
useEffect(() => {
fetchExperiments();
const interval = setInterval(fetchExperiments, 30000); // Refresh every 30s
return () => clearInterval(interval);
}, [workflowId]);
const getStatusColor = (status: string): "default" | "primary" | "success" | "error" => {
switch (status) {
case 'running':
return 'primary';
case 'completed':
return 'success';
case 'failed':
return 'error';
default:
return 'default';
}
};
return (
<Card>
<CardHeader title="A/B Test Experiments" />
<CardContent>
{loading && <LinearProgress />}
<Grid container spacing={2}>
{experiments.map((exp) => (
<Grid item xs={12} md={6} key={exp.id}>
<Card variant="outlined">
<CardContent>
<div style={{ display: 'flex', justifyContent: 'space-between', alignItems: 'center' }}>
<Typography variant="h6">{exp.name}</Typography>
<Chip label={exp.status} color={getStatusColor(exp.status)} size="small" />
</div>
<Typography variant="body2" color="textSecondary" sx={{ mt: 1 }}>
{exp.hypothesis}
</Typography>
<Typography variant="caption" display="block" sx={{ mt: 1 }}>
Primary Metric: {exp.primary_metric}
</Typography>
<div style={{ marginTop: '16px' }}>
<Typography variant="caption">Sample Sizes:</Typography>
<div style={{ display: 'flex', gap: '8px', marginTop: '4px' }}>
<Chip label={`Control: ${exp.control_group_size}`} size="small" />
<Chip label={`Treatment: ${exp.treatment_group_size}`} size="small" />
</div>
</div>
{exp.winner && (
<Alert severity="success" sx={{ mt: 2 }}>
Winner: {exp.winner}
</Alert>
)}
<div style={{ marginTop: '16px', display: 'flex', gap: '8px' }}>
{exp.status === 'draft' && (
<Button
size="small"
variant="contained"
startIcon={<PlayArrow />}
onClick={() => startExperiment(exp.id)}
>
Start
</Button>
)}
{exp.status === 'running' && (
<Button
size="small"
variant="outlined"
color="error"
startIcon={<Stop />}
onClick={() => stopExperiment(exp.id)}
>
Stop
</Button>
)}
{exp.status === 'completed' && exp.winner && (
<Button
size="small"
variant="contained"
color="success"
startIcon={<CheckCircle />}
onClick={() => rolloutWinner(exp.id)}
>
Rollout Winner
</Button>
)}
</div>
</CardContent>
</Card>
</Grid>
))}
</Grid>
{experiments.length === 0 && !loading && (
<Typography color="textSecondary">No experiments running</Typography>
)}
</CardContent>
</Card>
);
};
export default ExperimentMonitorPanel;
10. 部署与配置
10.1 Docker配置
# Dockerfile
FROM python:3.10-slim
WORKDIR /app
# 安装系统依赖
RUN apt-get update && apt-get install -y \
gcc \
postgresql-client \
&& rm -rf /var/lib/apt/lists/*
# 安装Python依赖
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# 复制应用代码
COPY . .
# 环境变量
ENV PYTHONUNBUFFERED=1
ENV ENVIRONMENT=production
# 暴露端口
EXPOSE 8000
# 启动命令
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
# docker-compose.yml
version: '3.8'
services:
postgres:
image: postgres:14
environment:
POSTGRES_DB: workflow_db
POSTGRES_USER: workflow_user
POSTGRES_PASSWORD: workflow_pass
volumes:
- postgres_data:/var/lib/postgresql/data
ports:
- "5432:5432"
redis:
image: redis:7-alpine
ports:
- "6379:6379"
api:
build: .
depends_on:
- postgres
- redis
environment:
DATABASE_URL: postgresql://workflow_user:workflow_pass@postgres:5432/workflow_db
REDIS_URL: redis://redis:6379/0
ports:
- "8000:8000"
volumes:
- ./:/app
celery_worker:
build: .
command: celery -A tasks.ai_optimizer_tasks worker --loglevel=info
depends_on:
- postgres
- redis
environment:
DATABASE_URL: postgresql://workflow_user:workflow_pass@postgres:5432/workflow_db
REDIS_URL: redis://redis:6379/0
celery_beat:
build: .
command: celery -A tasks.ai_optimizer_tasks beat --loglevel=info
depends_on:
- postgres
- redis
environment:
DATABASE_URL: postgresql://workflow_user:workflow_pass@postgres:5432/workflow_db
REDIS_URL: redis://redis:6379/0
flower:
build: .
command: celery -A tasks.ai_optimizer_tasks flower --port=5555
depends_on:
- redis
ports:
- "5555:5555"
environment:
REDIS_URL: redis://redis:6379/0
volumes:
postgres_data:
10.2 配置文件
# config.py
from pydantic import BaseSettings
from typing import Optional
class Settings(BaseSettings):
# 数据库配置
DATABASE_URL: str = "postgresql://localhost/workflow_db"
# Redis配置
REDIS_URL: str = "redis://localhost:6379/0"
# AI优化器配置
BOTTLENECK_DETECTION_THRESHOLD: float = 0.7
ML_MODEL_UPDATE_INTERVAL_DAYS: int = 7
MAX_CONCURRENT_EXPERIMENTS: int = 3
# 性能阈值
SLOW_TASK_THRESHOLD_SECONDS: float = 60.0
HIGH_CPU_THRESHOLD_PERCENT: float = 80.0
HIGH_MEMORY_THRESHOLD_MB: float = 1000.0
ERROR_RATE_THRESHOLD: float = 0.05
# A/B测试配置
MIN_SAMPLE_SIZE: int = 30
SIGNIFICANCE_LEVEL: float = 0.05
POWER: float = 0.8
# 日志配置
LOG_LEVEL: str = "INFO"
# 环境
ENVIRONMENT: str = "development"
class Config:
env_file = ".env"
settings = Settings()
# .env
DATABASE_URL=postgresql://workflow_user:workflow_pass@localhost:5432/workflow_db
REDIS_URL=redis://localhost:6379/0
BOTTLENECK_DETECTION_THRESHOLD=0.7
ML_MODEL_UPDATE_INTERVAL_DAYS=7
MAX_CONCURRENT_EXPERIMENTS=3
SLOW_TASK_THRESHOLD_SECONDS=60.0
HIGH_CPU_THRESHOLD_PERCENT=80.0
HIGH_MEMORY_THRESHOLD_MB=1000.0
ERROR_RATE_THRESHOLD=0.05
MIN_SAMPLE_SIZE=30
SIGNIFICANCE_LEVEL=0.05
POWER=0.8
LOG_LEVEL=INFO
ENVIRONMENT=production
11. 测试
11.1 单元测试
# tests/test_bottleneck_detector.py
import pytest
from sqlalchemy.orm import Session
from services.bottleneck_detectors.statistical_detector import StatisticalBottleneckDetector
from models.ai_optimizer_models import PerformanceBottleneck
@pytest.fixture
def detector(db_session: Session):
return StatisticalBottleneckDetector(db_session)
def test_detect_cpu_bottleneck(detector, sample_workflow_data):
"""测试CPU瓶颈检测"""
bottlenecks = detector.detect_workflow_bottlenecks(
workflow_id=1,
lookback_days=7
)
cpu_bottlenecks = [
b for b in bottlenecks
if b.bottleneck_type == 'cpu'
]
assert len(cpu_bottlenecks) > 0
assert cpu_bottlenecks[0].severity in ['low', 'medium', 'high', 'critical']
def test_confidence_score_range(detector, sample_workflow_data):
"""测试置信度分数范围"""
bottlenecks = detector.detect_workflow_bottlenecks(
workflow_id=1,
lookback_days=7
)
for bottleneck in bottlenecks:
assert 0.0 <= bottleneck.confidence_score <= 1.0
11.2 集成测试
# tests/test_recommendation_flow.py
import pytest
from services.bottleneck_detectors.statistical_detector import StatisticalBottleneckDetector
from services.optimization.recommendation_engine import RecommendationEngine
from services.ab_testing.experiment_scheduler import ExperimentScheduler
def test_full_optimization_flow(db_session, sample_workflow):
"""测试完整的优化流程"""
# 1. 检测瓶颈
detector = StatisticalBottleneckDetector(db_session)
bottlenecks = detector.detect_workflow_bottlenecks(
workflow_id=sample_workflow.id,
lookback_days=7
)
assert len(bottlenecks) > 0
# 2. 生成建议
engine = RecommendationEngine(db_session)
recommendations = []
for bottleneck in bottlenecks:
recs = engine.generate_recommendations(bottleneck)
recommendations.extend(recs)
assert len(recommendations) > 0
# 3. 调度实验
recommendation_ids = [rec.id for rec in recommendations[:2]]
scheduler = ExperimentScheduler(db_session)
experiments = scheduler.schedule_experiments_for_recommendations(
recommendation_ids
)
assert len(experiments) > 0
assert all(exp.status == 'draft' for exp in experiments)
12. 监控与告警
12.1 Prometheus指标
# monitoring/metrics.py
from prometheus_client import Counter, Histogram, Gauge
# 瓶颈检测指标
bottlenecks_detected = Counter(
'bottlenecks_detected_total',
'Total number of bottlenecks detected',
['workflow_id', 'severity', 'type']
)
bottleneck_detection_duration = Histogram(
'bottleneck_detection_duration_seconds',
'Time spent detecting bottlenecks',
['workflow_id']
)
# 优化建议指标
recommendations_generated = Counter(
'recommendations_generated_total',
'Total number of recommendations generated',
['workflow_id', 'type', 'priority']
)
# A/B测试指标
experiments_running = Gauge(
'experiments_running',
'Number of currently running experiments'
)
experiment_success_rate = Gauge(
'experiment_success_rate',
'Percentage of successful experiments',
['workflow_id']
)
# 使用示例
def record_bottleneck_detected(workflow_id: int, severity: str, bottle_type: str):
bottlenecks_detected.labels(
workflow_id=str(workflow_id),
severity=severity,
type=bottle_type
).inc()
12.2 告警规则
# prometheus/alerts.yml
groups:
- name: ai_optimizer_alerts
rules:
- alert: HighBottleneckDetectionRate
expr: rate(bottlenecks_detected_total[5m]) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "High bottleneck detection rate"
description: "Detecting more than 10 bottlenecks per minute"
- alert: ExperimentFailureRate
expr: (1 - experiment_success_rate) > 0.3
for: 10m
labels:
severity: critical
annotations:
summary: "High experiment failure rate"
description: "More than 30% of experiments are failing"
- alert: TooManyRunningExperiments
expr: experiments_running > 5
for: 5m
labels:
severity: warning
annotations:
summary: "Too many concurrent experiments"
description: "More than 5 experiments running simultaneously"
AI优化器模块完整实现完成!
该模块提供了:
- ✅ 智能瓶颈检测 - 统计与ML双引擎
- ✅ 自动优化建议 - 多维度分析与推荐
- ✅ A/B测试框架 - 安全验证优化效果
- ✅ 自动化调度 - Celery后台任务
- ✅ 完整API - RESTful接口
- ✅ 前端集成 - React组件
- ✅ 监控告警 - Prometheus指标
- ✅ 部署配置 - Docker容器化
AI优化器模块总结
核心功能
这是一个自动化工作流性能优化系统,通过AI技术实现:
1. 智能瓶颈检测 🔍
- 统计检测器:分析CPU、内存、执行时间、错误率等指标
- ML检测器:使用孤立森林算法检测异常模式
- 自动评估严重程度(低/中/高/危急)和置信度
2. 自动优化建议 💡
- 针对检测到的瓶颈生成可执行的优化方案
- 包括:并行化、资源调整、批处理、缓存、算法优化等
- 提供实施步骤、预期收益、风险评估和工作量估算
3. A/B测试验证 🧪
- 自动创建对照组/实验组
- 统计显著性检验(t-test、卡方检验)
- 智能流量分配和早停机制
- 安全推广获胜配置
4. 自动化运维 ⚙️
- Celery定时任务:每小时检测、每6小时生成建议
- 实验自动调度和轮换
- 数据清理和ML模型重训练
5. 完整技术栈 🛠️
- 后端: FastAPI + SQLAlchemy + Celery
- 前端: React + Material-UI
- 数据库: PostgreSQL
- 监控: Prometheus + Grafana
- 部署: Docker Compose
工作流程
瓶颈检测 → 生成建议 → 人工审批 → A/B测试 → 验证结果 → 自动推广
价值
- 🚀 自动发现性能问题
- 🎯 数据驱动的优化决策
- ✅ 风险可控的渐进式改进
- 📊 持续监控和迭代优化
这是一个生产级的智能运维系统,让工作流性能优化从人工经验驱动转向AI自动化驱动。