第二十二篇:AI驱动的工作流优化:性能瓶颈自动检测

📋 目录

  1. 概述
  2. 系统架构设计
  3. 执行数据收集系统
  4. 性能瓶颈识别算法
  5. 优化建议生成引擎
  6. 自动优化实验(A/B测试)
  7. 机器学习模型训练
  8. 实战案例
  9. 最佳实践

1. 概述

1.1 为什么需要AI驱动的工作流优化?

传统的工作流优化依赖人工分析执行日志、性能指标,这种方式存在以下问题:

  • 效率低下:需要大量时间分析海量数据
  • 经验依赖:依赖专家经验,难以标准化
  • 滞后性:问题发生后才能发现和解决
  • 局部优化:难以发现全局性能瓶颈

AI驱动的优化系统可以:

  • 自动检测:实时监控并自动识别性能问题
  • 智能分析:基于机器学习识别复杂模式
  • 主动优化:在问题发生前预测并优化
  • 持续改进:通过A/B测试验证优化效果

1.2 技术栈

复制代码
# requirements.txt
fastapi==0.104.1
sqlalchemy==2.0.23
pandas==2.1.3
numpy==1.26.2
scikit-learn==1.3.2
xgboost==2.0.2
tensorflow==2.15.0  # 可选,用于深度学习
prometheus-client==0.19.0
redis==5.0.1
celery==5.3.4
plotly==5.18.0
statsmodels==0.14.0

2. 系统架构设计

2.1 整体架构

复制代码
# architecture/ai_optimizer_architecture.py

class AIOptimizerArchitecture:
    """
    AI优化器架构设计
    
    组件:
    1. 数据采集层:收集工作流执行数据
    2. 存储层:时序数据库 + 特征存储
    3. 分析层:瓶颈检测 + 根因分析
    4. 优化层:建议生成 + A/B测试
    5. 学习层:模型训练 + 知识更新
    """
    
    def __init__(self):
        self.components = {
            "data_collection": {
                "collectors": [
                    "ExecutionMetricsCollector",
                    "ResourceMetricsCollector",
                    "DependencyMetricsCollector"
                ],
                "storage": "TimescaleDB",
                "streaming": "Kafka"
            },
            
            "bottleneck_detection": {
                "algorithms": [
                    "StatisticalDetector",
                    "AnomalyDetector",
                    "PatternMatcher",
                    "MLClassifier"
                ],
                "models": [
                    "IsolationForest",
                    "LSTM-Autoencoder",
                    "XGBoost"
                ]
            },
            
            "optimization_engine": {
                "strategies": [
                    "ParameterTuning",
                    "ResourceAllocation",
                    "ParallelizationOptimization",
                    "DependencyOptimization"
                ],
                "ab_testing": "ExperimentFramework"
            },
            
            "ml_training": {
                "features": "FeatureStore",
                "training": "TrainingPipeline",
                "serving": "ModelServing",
                "monitoring": "ModelMonitoring"
            }
        }
    
    def get_data_flow(self):
        """数据流设计"""
        return """
        工作流执行
            ↓
        [数据采集层]
            ├─ 执行指标(耗时、成功率等)
            ├─ 资源指标(CPU、内存、IO等)
            └─ 依赖指标(DAG结构、并发度等)
            ↓
        [时序数据库 + 特征存储]
            ↓
        [瓶颈检测]
            ├─ 统计分析(均值、方差、分位数)
            ├─ 异常检测(孤立森林、LOF)
            ├─ 模式识别(聚类、关联规则)
            └─ ML分类(XGBoost、神经网络)
            ↓
        [根因分析]
            ├─ 相关性分析
            ├─ 因果推断
            └─ 影响力评估
            ↓
        [优化建议生成]
            ├─ 参数调优建议
            ├─ 资源配置建议
            ├─ 架构优化建议
            └─ 业务逻辑建议
            ↓
        [A/B测试验证]
            ├─ 实验设计
            ├─ 流量分配
            ├─ 效果评估
            └─ 自动上线/回滚
            ↓
        [模型更新]
            └─ 持续学习
        """


# 打印架构
if __name__ == "__main__":
    arch = AIOptimizerArchitecture()
    print(arch.get_data_flow())

2.2 数据库设计

复制代码
# models/ai_optimizer_models.py
from sqlalchemy import Column, Integer, String, Float, DateTime, JSON, Boolean, Text, ForeignKey, Index
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import relationship
from datetime import datetime

Base = declarative_base()

class WorkflowExecutionMetrics(Base):
    """工作流执行指标"""
    __tablename__ = "workflow_execution_metrics"
    
    id = Column(Integer, primary_key=True)
    execution_id = Column(Integer, ForeignKey("workflow_executions.id"), nullable=False)
    workflow_id = Column(Integer, ForeignKey("workflows.id"), nullable=False)
    
    # 时间指标
    start_time = Column(DateTime, nullable=False)
    end_time = Column(DateTime)
    duration_seconds = Column(Float)
    
    # 任务指标
    total_tasks = Column(Integer)
    completed_tasks = Column(Integer)
    failed_tasks = Column(Integer)
    retried_tasks = Column(Integer)
    
    # 资源指标
    avg_cpu_percent = Column(Float)
    max_cpu_percent = Column(Float)
    avg_memory_mb = Column(Float)
    max_memory_mb = Column(Float)
    total_io_read_mb = Column(Float)
    total_io_write_mb = Column(Float)
    
    # 并发指标
    max_parallel_tasks = Column(Integer)
    avg_parallel_tasks = Column(Float)
    
    # 数据指标
    input_data_size_mb = Column(Float)
    output_data_size_mb = Column(Float)
    
    # 质量指标
    success_rate = Column(Float)
    error_rate = Column(Float)
    
    # 成本指标
    estimated_cost = Column(Float)
    
    # 扩展指标(JSON格式)
    custom_metrics = Column(JSON)
    
    created_at = Column(DateTime, default=datetime.utcnow)
    
    # 关系
    execution = relationship("WorkflowExecution", back_populates="metrics")
    
    # 索引
    __table_args__ = (
        Index('idx_metrics_workflow_time', 'workflow_id', 'start_time'),
        Index('idx_metrics_execution', 'execution_id'),
        Index('idx_metrics_duration', 'duration_seconds'),
    )


class TaskExecutionMetrics(Base):
    """任务执行指标"""
    __tablename__ = "task_execution_metrics"
    
    id = Column(Integer, primary_key=True)
    task_execution_id = Column(Integer, ForeignKey("task_executions.id"), nullable=False)
    workflow_execution_id = Column(Integer, ForeignKey("workflow_executions.id"), nullable=False)
    task_id = Column(Integer, ForeignKey("workflow_tasks.id"), nullable=False)
    
    # 时间指标
    start_time = Column(DateTime, nullable=False)
    end_time = Column(DateTime)
    duration_seconds = Column(Float)
    queue_time_seconds = Column(Float)  # 队列等待时间
    
    # 资源指标
    cpu_percent = Column(Float)
    memory_mb = Column(Float)
    io_read_mb = Column(Float)
    io_write_mb = Column(Float)
    network_in_mb = Column(Float)
    network_out_mb = Column(Float)
    
    # 数据指标
    input_records = Column(Integer)
    output_records = Column(Integer)
    input_size_mb = Column(Float)
    output_size_mb = Column(Float)
    
    # 重试指标
    retry_count = Column(Integer, default=0)
    retry_delay_seconds = Column(Float)
    
    # 依赖指标
    upstream_tasks_count = Column(Integer)
    upstream_wait_seconds = Column(Float)
    
    # 扩展指标
    custom_metrics = Column(JSON)
    
    created_at = Column(DateTime, default=datetime.utcnow)
    
    # 索引
    __table_args__ = (
        Index('idx_task_metrics_workflow_exec', 'workflow_execution_id'),
        Index('idx_task_metrics_task', 'task_id', 'start_time'),
        Index('idx_task_metrics_duration', 'duration_seconds'),
    )


class PerformanceBottleneck(Base):
    """性能瓶颈记录"""
    __tablename__ = "performance_bottlenecks"
    
    id = Column(Integer, primary_key=True)
    workflow_id = Column(Integer, ForeignKey("workflows.id"), nullable=False)
    execution_id = Column(Integer, ForeignKey("workflow_executions.id"))
    task_id = Column(Integer, ForeignKey("workflow_tasks.id"))
    
    # 瓶颈类型
    bottleneck_type = Column(String(50), nullable=False)  # cpu, memory, io, dependency, logic
    severity = Column(String(20), nullable=False)  # low, medium, high, critical
    
    # 检测信息
    detected_at = Column(DateTime, default=datetime.utcnow)
    detection_method = Column(String(50))  # statistical, anomaly, ml, pattern
    confidence_score = Column(Float)  # 0-1
    
    # 瓶颈描述
    description = Column(Text)
    impact_analysis = Column(JSON)  # 影响分析
    root_cause = Column(JSON)  # 根因分析
    
    # 指标
    baseline_metrics = Column(JSON)  # 基线指标
    current_metrics = Column(JSON)  # 当前指标
    deviation_percent = Column(Float)  # 偏差百分比
    
    # 状态
    status = Column(String(20), default="open")  # open, investigating, resolved, ignored
    resolved_at = Column(DateTime)
    resolution_notes = Column(Text)
    
    created_at = Column(DateTime, default=datetime.utcnow)
    updated_at = Column(DateTime, default=datetime.utcnow, onupdate=datetime.utcnow)
    
    # 索引
    __table_args__ = (
        Index('idx_bottleneck_workflow', 'workflow_id', 'status'),
        Index('idx_bottleneck_severity', 'severity', 'detected_at'),
        Index('idx_bottleneck_type', 'bottleneck_type'),
    )


class OptimizationRecommendation(Base):
    """优化建议"""
    __tablename__ = "optimization_recommendations"
    
    id = Column(Integer, primary_key=True)
    bottleneck_id = Column(Integer, ForeignKey("performance_bottlenecks.id"), nullable=False)
    workflow_id = Column(Integer, ForeignKey("workflows.id"), nullable=False)
    task_id = Column(Integer, ForeignKey("workflow_tasks.id"))
    
    # 建议类型
    recommendation_type = Column(String(50), nullable=False)  # parameter, resource, architecture, logic
    priority = Column(String(20), nullable=False)  # low, medium, high, urgent
    
    # 建议内容
    title = Column(String(200), nullable=False)
    description = Column(Text)
    rationale = Column(Text)  # 理由
    
    # 优化参数
    current_config = Column(JSON)
    recommended_config = Column(JSON)
    expected_improvement = Column(JSON)  # 预期改进
    
    # 实施信息
    implementation_difficulty = Column(String(20))  # easy, medium, hard
    estimated_effort_hours = Column(Float)
    implementation_steps = Column(JSON)
    
    # 风险评估
    risk_level = Column(String(20))  # low, medium, high
    potential_issues = Column(JSON)
    rollback_plan = Column(JSON)
    
    # 状态
    status = Column(String(20), default="pending")  # pending, approved, testing, implemented, rejected
    approved_by = Column(Integer, ForeignKey("users.id"))
    approved_at = Column(DateTime)
    
    # 效果跟踪
    ab_test_id = Column(Integer, ForeignKey("ab_experiments.id"))
    actual_improvement = Column(JSON)
    
    created_at = Column(DateTime, default=datetime.utcnow)
    updated_at = Column(DateTime, default=datetime.utcnow, onupdate=datetime.utcnow)
    
    # 索引
    __table_args__ = (
        Index('idx_recommendation_workflow', 'workflow_id', 'status'),
        Index('idx_recommendation_priority', 'priority', 'status'),
    )


class ABExperiment(Base):
    """A/B测试实验"""
    __tablename__ = "ab_experiments"
    
    id = Column(Integer, primary_key=True)
    workflow_id = Column(Integer, ForeignKey("workflows.id"), nullable=False)
    recommendation_id = Column(Integer, ForeignKey("optimization_recommendations.id"))
    
    # 实验信息
    name = Column(String(200), nullable=False)
    description = Column(Text)
    hypothesis = Column(Text)  # 假设
    
    # 实验配置
    control_config = Column(JSON, nullable=False)  # 对照组配置
    treatment_config = Column(JSON, nullable=False)  # 实验组配置
    traffic_split = Column(Float, default=0.5)  # 流量分配比例(实验组)
    
    # 评估指标
    primary_metric = Column(String(100), nullable=False)  # 主要指标
    secondary_metrics = Column(JSON)  # 次要指标
    success_criteria = Column(JSON)  # 成功标准
    
    # 实验控制
    min_sample_size = Column(Integer, default=100)  # 最小样本量
    max_duration_days = Column(Integer, default=7)  # 最大持续时间
    early_stopping_enabled = Column(Boolean, default=True)
    
    # 实验状态
    status = Column(String(20), default="draft")  # draft, running, paused, completed, cancelled
    started_at = Column(DateTime)
    ended_at = Column(DateTime)
    
    # 实验结果
    control_group_size = Column(Integer)
    treatment_group_size = Column(Integer)
    control_metrics = Column(JSON)
    treatment_metrics = Column(JSON)
    statistical_significance = Column(Float)  # p-value
    confidence_interval = Column(JSON)
    
    # 决策
    decision = Column(String(20))  # winner_control, winner_treatment, no_difference, inconclusive
    decision_reason = Column(Text)
    auto_rollout = Column(Boolean, default=False)  # 是否自动推广
    
    created_by = Column(Integer, ForeignKey("users.id"))
    created_at = Column(DateTime, default=datetime.utcnow)
    updated_at = Column(DateTime, default=datetime.utcnow, onupdate=datetime.utcnow)
    
    # 索引
    __table_args__ = (
        Index('idx_experiment_workflow', 'workflow_id', 'status'),
        Index('idx_experiment_status', 'status', 'started_at'),
    )


class MLModel(Base):
    """机器学习模型"""
    __tablename__ = "ml_models"
    
    id = Column(Integer, primary_key=True)
    
    # 模型信息
    name = Column(String(200), nullable=False)
    model_type = Column(String(50), nullable=False)  # classifier, regressor, anomaly_detector, recommender
    algorithm = Column(String(100), nullable=False)  # xgboost, random_forest, lstm, etc.
    purpose = Column(String(200))  # 用途
    
    # 训练信息
    training_data_size = Column(Integer)
    feature_count = Column(Integer)
    training_duration_seconds = Column(Float)
    trained_at = Column(DateTime)
    
    # 模型配置
    hyperparameters = Column(JSON)
    feature_config = Column(JSON)
    preprocessing_config = Column(JSON)
    
    # 性能指标
    training_metrics = Column(JSON)  # accuracy, precision, recall, f1, rmse, etc.
    validation_metrics = Column(JSON)
    test_metrics = Column(JSON)
    
    # 模型文件
    model_path = Column(String(500))  # 模型文件路径
    model_version = Column(String(50))
    framework = Column(String(50))  # sklearn, tensorflow, pytorch, xgboost
    
    # 部署状态
    status = Column(String(20), default="trained")  # trained, deployed, archived, deprecated
    deployed_at = Column(DateTime)
    
    # 性能监控
    prediction_count = Column(Integer, default=0)
    avg_prediction_time_ms = Column(Float)
    error_count = Column(Integer, default=0)
    last_prediction_at = Column(DateTime)
    
    # 版本管理
    parent_model_id = Column(Integer, ForeignKey("ml_models.id"))
    is_active = Column(Boolean, default=False)
    
    created_at = Column(DateTime, default=datetime.utcnow)
    updated_at = Column(DateTime, default=datetime.utcnow, onupdate=datetime.utcnow)
    
    # 索引
    __table_args__ = (
        Index('idx_model_type_status', 'model_type', 'status'),
        Index('idx_model_active', 'is_active', 'model_type'),
    )


class FeatureStore(Base):
    """特征存储"""
    __tablename__ = "feature_store"
    
    id = Column(Integer, primary_key=True)
    
    # 实体标识
    entity_type = Column(String(50), nullable=False)  # workflow, task, execution
    entity_id = Column(Integer, nullable=False)
    
    # 时间戳
    timestamp = Column(DateTime, nullable=False)
    
    # 特征
    features = Column(JSON, nullable=False)  # 特征值字典
    
    # 元数据
    feature_version = Column(String(50))
    feature_schema = Column(JSON)
    
    created_at = Column(DateTime, default=datetime.utcnow)
    
    # 索引
    __table_args__ = (
        Index('idx_feature_entity', 'entity_type', 'entity_id', 'timestamp'),
        Index('idx_feature_timestamp', 'timestamp'),
    )

3. 执行数据收集系统

3.1 指标收集器

复制代码
# services/metrics_collector.py
from typing import Dict, Any, List, Optional
from datetime import datetime
import psutil
import time
from sqlalchemy.orm import Session
from models.ai_optimizer_models import WorkflowExecutionMetrics, TaskExecutionMetrics
import logging

logger = logging.getLogger(__name__)


class MetricsCollector:
    """
    指标收集器
    
    负责收集工作流和任务的执行指标
    """
    
    def __init__(self, db: Session):
        self.db = db
        self._active_measurements = {}  # 存储正在进行的测量
    
    def start_workflow_measurement(
        self,
        execution_id: int,
        workflow_id: int
    ) -> str:
        """
        开始工作流指标测量
        
        Returns:
            measurement_id: 测量ID
        """
        measurement_id = f"wf_{execution_id}_{int(time.time())}"
        
        self._active_measurements[measurement_id] = {
            "type": "workflow",
            "execution_id": execution_id,
            "workflow_id": workflow_id,
            "start_time": datetime.utcnow(),
            "start_cpu": psutil.cpu_percent(interval=0.1),
            "start_memory": psutil.virtual_memory().percent,
            "start_io": psutil.disk_io_counters(),
            "task_metrics": []
        }
        
        logger.info(f"Started workflow measurement: {measurement_id}")
        return measurement_id
    
    def end_workflow_measurement(
        self,
        measurement_id: str,
        additional_metrics: Optional[Dict[str, Any]] = None
    ) -> WorkflowExecutionMetrics:
        """
        结束工作流指标测量并保存
        """
        if measurement_id not in self._active_measurements:
            raise ValueError(f"Measurement not found: {measurement_id}")
        
        measurement = self._active_measurements[measurement_id]
        end_time = datetime.utcnow()
        
        # 计算CPU和内存指标
        end_cpu = psutil.cpu_percent(interval=0.1)
        end_memory = psutil.virtual_memory().percent
        end_io = psutil.disk_io_counters()
        
        # 计算IO指标
        io_read_mb = (end_io.read_bytes - measurement["start_io"].read_bytes) / (1024 * 1024)
        io_write_mb = (end_io.write_bytes - measurement["start_io"].write_bytes) / (1024 * 1024)
        
        # 计算任务统计
        task_metrics = measurement.get("task_metrics", [])
        total_tasks = len(task_metrics)
        completed_tasks = sum(1 for tm in task_metrics if tm.get("status") == "completed")
        failed_tasks = sum(1 for tm in task_metrics if tm.get("status") == "failed")
        retried_tasks = sum(1 for tm in task_metrics if tm.get("retry_count", 0) > 0)
        
        # 创建指标记录
        metrics = WorkflowExecutionMetrics(
            execution_id=measurement["execution_id"],
            workflow_id=measurement["workflow_id"],
            start_time=measurement["start_time"],
            end_time=end_time,
            duration_seconds=(end_time - measurement["start_time"]).total_seconds(),
            
            total_tasks=total_tasks,
            completed_tasks=completed_tasks,
            failed_tasks=failed_tasks,
            retried_tasks=retried_tasks,
            
            avg_cpu_percent=(measurement["start_cpu"] + end_cpu) / 2,
            max_cpu_percent=max(measurement["start_cpu"], end_cpu),
            avg_memory_mb=(measurement["start_memory"] + end_memory) / 2,
            max_memory_mb=max(measurement["start_memory"], end_memory),
            total_io_read_mb=io_read_mb,
            total_io_write_mb=io_write_mb,
            
            success_rate=completed_tasks / total_tasks if total_tasks > 0 else 0,
            error_rate=failed_tasks / total_tasks if total_tasks > 0 else 0,
            
            custom_metrics=additional_metrics or {}
        )
        
        self.db.add(metrics)
        self.db.commit()
        
        # 清理测量数据
        del self._active_measurements[measurement_id]
        
        logger.info(f"Ended workflow measurement: {measurement_id}")
        return metrics
    
    def collect_task_metrics(
        self,
        task_execution_id: int,
        workflow_execution_id: int,
        task_id: int,
        start_time: datetime,
        end_time: datetime,
        resource_usage: Dict[str, Any],
        additional_metrics: Optional[Dict[str, Any]] = None
    ) -> TaskExecutionMetrics:
        """
        收集任务执行指标
        """
        duration = (end_time - start_time).total_seconds()
        
        metrics = TaskExecutionMetrics(
            task_execution_id=task_execution_id,
            workflow_execution_id=workflow_execution_id,
            task_id=task_id,
            start_time=start_time,
            end_time=end_time,
            duration_seconds=duration,
            
            cpu_percent=resource_usage.get("cpu_percent", 0),
            memory_mb=resource_usage.get("memory_mb", 0),
            io_read_mb=resource_usage.get("io_read_mb", 0),
            io_write_mb=resource_usage.get("io_write_mb", 0),
            
            input_records=resource_usage.get("input_records", 0),
            output_records=resource_usage.get("output_records", 0),
            
            retry_count=resource_usage.get("retry_count", 0),
            
            custom_metrics=additional_metrics or {}
        )
        
        self.db.add(metrics)
        self.db.commit()
        
        return metrics
    
    def get_historical_metrics(
        self,
        workflow_id: int,
        days: int = 30,
        task_id: Optional[int] = None
    ) -> Dict[str, Any]:
        """
        获取历史指标统计
        """
        from datetime import timedelta
        from sqlalchemy import func
        
        cutoff_time = datetime.utcnow() - timedelta(days=days)
        
        # 工作流级别指标
        wf_metrics = self.db.query(
            func.avg(WorkflowExecutionMetrics.duration_seconds).label("avg_duration"),
            func.max(WorkflowExecutionMetrics.duration_seconds).label("max_duration"),
            func.min(WorkflowExecutionMetrics.duration_seconds).label("min_duration"),
            func.stddev(WorkflowExecutionMetrics.duration_seconds).label("stddev_duration"),
            func.avg(WorkflowExecutionMetrics.success_rate).label("avg_success_rate"),
            func.count(WorkflowExecutionMetrics.id).label("execution_count")
        ).filter(
            WorkflowExecutionMetrics.workflow_id == workflow_id,
            WorkflowExecutionMetrics.start_time >= cutoff_time
        ).first()
        
        result = {
            "workflow": {
                "avg_duration": float(wf_metrics.avg_duration or 0),
                "max_duration": float(wf_metrics.max_duration or 0),
                "min_duration": float(wf_metrics.min_duration or 0),
                "stddev_duration": float(wf_metrics.stddev_duration or 0),
                "avg_success_rate": float(wf_metrics.avg_success_rate or 0),
                "execution_count": int(wf_metrics.execution_count or 0)
            }
        }
        
        # 任务级别指标(如果指定了task_id)
        if task_id:
            task_metrics = self.db.query(
                func.avg(TaskExecutionMetrics.duration_seconds).label("avg_duration"),
                func.max(TaskExecutionMetrics.duration_seconds).label("max_duration"),
                func.min(TaskExecutionMetrics.duration_seconds).label("min_duration"),
                func.stddev(TaskExecutionMetrics.duration_seconds).label("stddev_duration"),
                func.avg(TaskExecutionMetrics.cpu_percent).label("avg_cpu"),
                func.avg(TaskExecutionMetrics.memory_mb).label("avg_memory"),
                func.sum(TaskExecutionMetrics.retry_count).label("total_retries"),
                func.count(TaskExecutionMetrics.id).label("execution_count")
            ).filter(
                TaskExecutionMetrics.task_id == task_id,
                TaskExecutionMetrics.start_time >= cutoff_time
            ).first()
            
            result["task"] = {
                "avg_duration": float(task_metrics.avg_duration or 0),
                "max_duration": float(task_metrics.max_duration or 0),
                "min_duration": float(task_metrics.min_duration or 0),
                "stddev_duration": float(task_metrics.stddev_duration or 0),
                "avg_cpu": float(task_metrics.avg_cpu or 0),
                "avg_memory": float(task_metrics.avg_memory or 0),
                "total_retries": int(task_metrics.total_retries or 0),
                "execution_count": int(task_metrics.execution_count or 0)
            }
        
        return result


# 使用示例
if __name__ == "__main__":
    from database import SessionLocal
    
    db = SessionLocal()
    collector = MetricsCollector(db)
    
    # 开始工作流测量
    measurement_id = collector.start_workflow_measurement(
        execution_id=1,
        workflow_id=1
    )
    
    # 模拟工作流执行
    time.sleep(2)
    
    # 结束测量
    metrics = collector.end_workflow_measurement(
        measurement_id,
        additional_metrics={"custom_field": "value"}
    )
    
    print(f"Collected metrics:")
    print(f"  Duration: {metrics.duration_seconds}s")
    print(f"  CPU: {metrics.avg_cpu_percent}%")
    print(f"  Memory: {metrics.avg_memory_mb}MB")
    
    # 获取历史指标
    historical = collector.get_historical_metrics(workflow_id=1, days=7)
    print(f"\nHistorical metrics:")
    print(f"  Avg duration: {historical['workflow']['avg_duration']}s")
    print(f"  Success rate: {historical['workflow']['avg_success_rate'] * 100}%")
    
    db.close()

3.2 实时数据流处理

复制代码
# services/streaming_metrics.py
from typing import Dict, Any, Callable
import json
import redis
from datetime import datetime
import logging

logger = logging.getLogger(__name__)


class StreamingMetricsProcessor:
    """
    实时指标流处理器
    
    使用Redis Streams处理实时指标数据
    """
    
    def __init__(self, redis_client: redis.Redis):
        self.redis = redis_client
        self.stream_name = "workflow_metrics_stream"
        self.consumer_group = "metrics_processors"
        
        # 创建消费者组(如果不存在)
        try:
            self.redis.xgroup_create(
                self.stream_name,
                self.consumer_group,
                id='0',
                mkstream=True
            )
        except redis.exceptions.ResponseError as e:
            if "BUSYGROUP" not in str(e):
                raise
    
    def publish_metric(
        self,
        metric_type: str,
        entity_id: int,
        data: Dict[str, Any]
    ):
        """
        发布指标到流
        """
        message = {
            "metric_type": metric_type,
            "entity_id": str(entity_id),
            "timestamp": datetime.utcnow().isoformat(),
            "data": json.dumps(data)
        }
        
        message_id = self.redis.xadd(self.stream_name, message)
        logger.debug(f"Published metric {message_id}: {metric_type}")
        return message_id
    
    def consume_metrics(
        self,
        consumer_name: str,
        handler: Callable[[Dict[str, Any]], None],
        block_ms: int = 5000,
        count: int = 10
    ):
        """
        消费指标流
        
        Args:
            consumer_name: 消费者名称
            handler: 处理函数
            block_ms: 阻塞等待时间(毫秒)
            count: 每次读取的消息数
        """
        while True:
            try:
                # 读取消息
                messages = self.redis.xreadgroup(
                    self.consumer_group,
                    consumer_name,
                    {self.stream_name: '>'},
                    count=count,
                    block=block_ms
                )
                
                if not messages:
                    continue
                
                for stream_name, stream_messages in messages:
                    for message_id, message_data in stream_messages:
                        try:
                            # 解析消息
                            metric = {
                                "id": message_id.decode(),
                                "metric_type": message_data[b"metric_type"].decode(),
                                "entity_id": int(message_data[b"entity_id"].decode()),
                                "timestamp": message_data[b"timestamp"].decode(),
                                "data": json.loads(message_data[b"data"].decode())
                            }
                            
                            # 处理消息
                            handler(metric)
                            
                            # 确认消息
                            self.redis.xack(
                                self.stream_name,
                                self.consumer_group,
                                message_id
                            )
                            
                        except Exception as e:
                            logger.error(f"Error processing message {message_id}: {e}")
                            # 可以实现重试逻辑或死信队列
            
            except KeyboardInterrupt:
                logger.info("Stopping consumer...")
                break
            except Exception as e:
                logger.error(f"Consumer error: {e}")
                import time
                time.sleep(1)
    
    def get_stream_info(self) -> Dict[str, Any]:
        """获取流信息"""
        info = self.redis.xinfo_stream(self.stream_name)
        
        return {
            "length": info[b"length"],
            "groups": info[b"groups"],
            "first_entry": info.get(b"first-entry"),
            "last_entry": info.get(b"last-entry")
        }


# 使用示例
if __name__ == "__main__":
    import redis
    import threading
    
    # 连接Redis
    r = redis.Redis(host='localhost', port=6379, db=0)
    
    processor = StreamingMetricsProcessor(r)
    
    # 发布指标
    def publish_test_metrics():
        for i in range(10):
            processor.publish_metric(
                metric_type="task_duration",
                entity_id=i,
                data={
                    "duration_seconds": 5.5 + i,
                    "cpu_percent": 50 + i,
                    "memory_mb": 100 + i * 10
                }
            )
            import time
            time.sleep(0.5)
    
    # 消费指标
    def consume_test_metrics():
        def handler(metric):
            print(f"Received metric: {metric['metric_type']} "
                  f"for entity {metric['entity_id']}")
            print(f"  Data: {metric['data']}")
        
        processor.consume_metrics(
            consumer_name="test_consumer",
            handler=handler
        )
    
    # 启动生产者和消费者
    producer_thread = threading.Thread(target=publish_test_metrics)
    consumer_thread = threading.Thread(target=consume_test_metrics)
    
    consumer_thread.start()
    producer_thread.start()
    
    producer_thread.join()
    # consumer_thread会持续运行,需要手动停止

4. 性能瓶颈识别算法

4.1 统计分析检测器

复制代码
# services/bottleneck_detectors/statistical_detector.py
from typing import Dict, Any, List, Optional
import numpy as np
from scipy import stats
from sqlalchemy.orm import Session
from models.ai_optimizer_models import (
    WorkflowExecutionMetrics,
    TaskExecutionMetrics,
    PerformanceBottleneck
)
from datetime import datetime, timedelta
import logging

logger = logging.getLogger(__name__)


class StatisticalBottleneckDetector:
    """
    统计分析瓶颈检测器
    
    使用统计方法检测性能异常
    """
    
    def __init__(self, db: Session):
        self.db = db
        self.confidence_level = 0.95  # 置信水平
        self.z_score_threshold = 3.0  # Z分数阈值
    
    def detect_workflow_bottlenecks(
        self,
        workflow_id: int,
        lookback_days: int = 30
    ) -> List[PerformanceBottleneck]:
        """
        检测工作流级别的瓶颈
        """
        bottlenecks = []
        
        # 获取历史数据
        cutoff_time = datetime.utcnow() - timedelta(days=lookback_days)
        
        metrics = self.db.query(WorkflowExecutionMetrics).filter(
            WorkflowExecutionMetrics.workflow_id == workflow_id,
            WorkflowExecutionMetrics.start_time >= cutoff_time
        ).all()
        
        if len(metrics) < 10:
            logger.warning(f"Not enough data for workflow {workflow_id}")
            return bottlenecks
        
        # 提取指标数据
        durations = [m.duration_seconds for m in metrics if m.duration_seconds]
        cpu_usages = [m.avg_cpu_percent for m in metrics if m.avg_cpu_percent]
        memory_usages = [m.avg_memory_mb for m in metrics if m.avg_memory_mb]
        success_rates = [m.success_rate for m in metrics if m.success_rate is not None]
        
        # 检测持续时间异常
        duration_bottlenecks = self._detect_outliers(
            data=durations,
            metric_name="duration",
            workflow_id=workflow_id,
            baseline_metrics={"mean": np.mean(durations), "std": np.std(durations)}
        )
        bottlenecks.extend(duration_bottlenecks)
        
        # 检测CPU使用异常
        cpu_bottlenecks = self._detect_outliers(
            data=cpu_usages,
            metric_name="cpu",
            workflow_id=workflow_id,
            baseline_metrics={"mean": np.mean(cpu_usages), "std": np.std(cpu_usages)}
        )
        bottlenecks.extend(cpu_bottlenecks)
        
        # 检测成功率下降
        if success_rates:
            success_rate_bottlenecks = self._detect_degradation(
                data=success_rates,
                metric_name="success_rate",
                workflow_id=workflow_id,
                threshold=0.9,
                direction="lower"
            )
            bottlenecks.extend(success_rate_bottlenecks)
        
        return bottlenecks
    
    def _detect_outliers(
        self,
        data: List[float],
        metric_name: str,
        workflow_id: int,
        task_id: Optional[int] = None,
        baseline_metrics: Optional[Dict[str, float]] = None
    ) -> List[PerformanceBottleneck]:
        """
        使用Z分数检测异常值
        """
        if len(data) < 3:
            return []
        
        bottlenecks = []
        
        # 计算统计量
        mean = np.mean(data)
        std = np.std(data)
        
        if std == 0:
            return []
        
        # 计算Z分数
        z_scores = [(x - mean) / std for x in data]
        
        # 检测异常
        recent_values = data[-5:]  # 最近5个值
        recent_z_scores = z_scores[-5:]
        
        # 如果最近的值持续异常
        outlier_count = sum(1 for z in recent_z_scores if abs(z) > self.z_score_threshold)
        
        if outlier_count >= 3:  # 至少3个异常值
            # 确定瓶颈类型
            if metric_name == "duration":
                bottleneck_type = "performance"
            elif metric_name == "cpu":
                bottleneck_type = "cpu"
            elif metric_name == "memory":
                bottleneck_type = "memory"
            else:
                bottleneck_type = "unknown"
            
            # 计算偏差
            latest_value = recent_values[-1]
            deviation = ((latest_value - mean) / mean) * 100 if mean != 0 else 0
            
            # 确定严重程度
            if abs(deviation) > 100:
                severity = "critical"
            elif abs(deviation) > 50:
                severity = "high"
            elif abs(deviation) > 20:
                severity = "medium"
            else:
                severity = "low"
            
            bottleneck = PerformanceBottleneck(
                workflow_id=workflow_id,
                task_id=task_id,
                bottleneck_type=bottleneck_type,
                severity=severity,
                detection_method="statistical",
                confidence_score=min(outlier_count / 5.0, 1.0),
                description=f"{metric_name} shows statistical outliers",
                baseline_metrics=baseline_metrics or {"mean": mean, "std": std},
                current_metrics={"latest": latest_value, "recent_avg": np.mean(recent_values)},
                deviation_percent=abs(deviation),
                impact_analysis={
                    "affected_metric": metric_name,
                    "z_score": max(abs(z) for z in recent_z_scores),
                    "outlier_count": outlier_count
                }
            )
            
            bottlenecks.append(bottleneck)
        
        return bottlenecks
    
    def _detect_degradation(
        self,
        data: List[float],
        metric_name: str,
        workflow_id: int,
        threshold: float,
        direction: str = "lower"  # "lower" or "upper"
    ) -> List[PerformanceBottleneck]:
        """
        检测性能退化
        """
        if len(data) < 10:
            return []
        
        bottlenecks = []
        
        # 分割为两段:历史基线和最近数据
        split_point = len(data) * 2 // 3
        baseline_data = data[:split_point]
        recent_data = data[split_point:]
        
        baseline_mean = np.mean(baseline_data)
        recent_mean = np.mean(recent_data)
        
        # T检验
        t_stat, p_value = stats.ttest_ind(baseline_data, recent_data)
        
        # 判断是否显著退化
        is_significant = p_value < (1 - self.confidence_level)
        
        if direction == "lower":
            is_degraded = recent_mean < threshold and is_significant
        else:
            is_degraded = recent_mean > threshold and is_significant
        
        if is_degraded:
            degradation_percent = abs((recent_mean - baseline_mean) / baseline_mean) * 100
            
            if degradation_percent > 30:
                severity = "critical"
            elif degradation_percent > 15:
                severity = "high"
            else:
                severity = "medium"
            
            bottleneck = PerformanceBottleneck(
                workflow_id=workflow_id,
                bottleneck_type="degradation",
                severity=severity,
                detection_method="statistical",
                confidence_score=1 - p_value,
                description=f"{metric_name} has degraded significantly",
                baseline_metrics={"mean": baseline_mean, "threshold": threshold},
                current_metrics={"mean": recent_mean},
                deviation_percent=degradation_percent,
                impact_analysis={
                    "t_statistic": t_stat,
                    "p_value": p_value,
                    "degradation_percent": degradation_percent
                }
            )
            
            bottlenecks.append(bottleneck)
        
        return bottlenecks
    
    def detect_task_bottlenecks(
        self,
        workflow_id: int,
        lookback_days: int = 30
    ) -> List[PerformanceBottleneck]:
        """
        检测任务级别的瓶颈
        
        识别工作流中最慢的任务
        """
        cutoff_time = datetime.utcnow() - timedelta(days=lookback_days)
        
        # 按任务聚合指标
        from sqlalchemy import func
        
        task_stats = self.db.query(
            TaskExecutionMetrics.task_id,
            func.avg(TaskExecutionMetrics.duration_seconds).label("avg_duration"),
            func.max(TaskExecutionMetrics.duration_seconds).label("max_duration"),
            func.stddev(TaskExecutionMetrics.duration_seconds).label("std_duration"),
            func.count(TaskExecutionMetrics.id).label("execution_count")
        ).join(
            WorkflowExecutionMetrics,
            TaskExecutionMetrics.workflow_execution_id == WorkflowExecutionMetrics.id
        ).filter(
            WorkflowExecutionMetrics.workflow_id == workflow_id,
            TaskExecutionMetrics.start_time >= cutoff_time
        ).group_by(
            TaskExecutionMetrics.task_id
        ).all()
        
        if not task_stats:
            return []
        
        # 找出相对最慢的任务
        durations = [s.avg_duration for s in task_stats if s.avg_duration]
        
        if len(durations) < 2:
            return []
        
        bottlenecks = []
        
        # 使用相对阈值(如超过中位数的2倍)
        median_duration = np.median(durations)
        threshold = median_duration * 2
        
        for stat in task_stats:
            if stat.avg_duration and stat.avg_duration > threshold:
                # 计算相对慢的程度
                relative_slowness = (stat.avg_duration / median_duration - 1) * 100
                
                if relative_slowness > 200:
                    severity = "critical"
                elif relative_slowness > 100:
                    severity = "high"
                else:
                    severity = "medium"
                
                bottleneck = PerformanceBottleneck(
                    workflow_id=workflow_id,
                    task_id=stat.task_id,
                    bottleneck_type="slow_task",
                    severity=severity,
                    detection_method="statistical",
                    confidence_score=0.9,
                    description=f"Task is significantly slower than others",
                    baseline_metrics={
                        "median_duration": median_duration,
                        "threshold": threshold
                    },
                    current_metrics={
                        "avg_duration": stat.avg_duration,
                        "max_duration": stat.max_duration,
                        "std_duration": float(stat.std_duration or 0)
                    },
                    deviation_percent=relative_slowness,
                    impact_analysis={
                        "execution_count": stat.execution_count,
                        "relative_slowness_percent": relative_slowness
                    }
                )
                
                bottlenecks.append(bottleneck)
        
        return bottlenecks


# 使用示例
if __name__ == "__main__":
    from database import SessionLocal
    
    db = SessionLocal()
    detector = StatisticalBottleneckDetector(db)
    
    # 检测工作流瓶颈
    bottlenecks = detector.detect_workflow_bottlenecks(workflow_id=1, lookback_days=7)
    
    print(f"Found {len(bottlenecks)} workflow-level bottlenecks:")
    for b in bottlenecks:
        print(f"  - {b.bottleneck_type} ({b.severity}): {b.description}")
        print(f"    Deviation: {b.deviation_percent:.1f}%")
        print(f"    Confidence: {b.confidence_score:.2f}")
    
    # 检测任务瓶颈
    task_bottlenecks = detector.detect_task_bottlenecks(workflow_id=1, lookback_days=7)
    
    print(f"\nFound {len(task_bottlenecks)} task-level bottlenecks:")
    for b in task_bottlenecks:
        print(f"  - Task {b.task_id}: {b.description}")
        print(f"    Severity: {b.severity}")
        print(f"    Relative slowness: {b.deviation_percent:.1f}%")
    
    db.close()

4.2 异常检测器

复制代码
# services/bottleneck_detectors/anomaly_detector.py
from typing import List, Dict, Any
import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler
from sqlalchemy.orm import Session
from models.ai_optimizer_models import (
    WorkflowExecutionMetrics,
    TaskExecutionMetrics,
    PerformanceBottleneck
)
from datetime import datetime, timedelta
import logging

logger = logging.getLogger(__name__)


class AnomalyBottleneckDetector:
    """
    异常检测瓶颈识别器
    
    使用机器学习方法检测异常模式
    """
    
    def __init__(self, db: Session):
        self.db = db
        self.contamination = 0.1  # 预期异常比例
        self.scaler = StandardScaler()
    
    def detect_anomalies(
        self,
        workflow_id: int,
        lookback_days: int = 30
    ) -> List[PerformanceBottleneck]:
        """
        使用Isolation Forest检测异常
        """
        # 获取历史数据
        cutoff_time = datetime.utcnow() - timedelta(days=lookback_days)
        
        metrics = self.db.query(WorkflowExecutionMetrics).filter(
            WorkflowExecutionMetrics.workflow_id == workflow_id,
            WorkflowExecutionMetrics.start_time >= cutoff_time
        ).all()
        
        if len(metrics) < 20:
            logger.warning(f"Not enough data for anomaly detection: {len(metrics)}")
            return []
        
        # 准备特征矩阵
        features = []
        metric_ids = []
        
        for m in metrics:
            if all([
                m.duration_seconds is not None,
                m.avg_cpu_percent is not None,
                m.avg_memory_mb is not None
            ]):
                features.append([
                    m.duration_seconds,
                    m.avg_cpu_percent,
                    m.avg_memory_mb,
                    m.total_io_read_mb or 0,
                    m.total_io_write_mb or 0,
                    m.success_rate or 0,
                    m.error_rate or 0
                ])
                metric_ids.append((m.id, m.execution_id))
        
        if len(features) < 20:
            return []
        
        X = np.array(features)
        
        # 标准化
        X_scaled = self.scaler.fit_transform(X)
        
        # 训练Isolation Forest
        clf = IsolationForest(
            contamination=self.contamination,
            random_state=42,
            n_estimators=100
        )
        
        predictions = clf.fit_predict(X_scaled)
        scores = clf.score_samples(X_scaled)
        
        # 识别异常
        bottlenecks = []
        
        for idx, (pred, score) in enumerate(zip(predictions, scores)):
            if pred == -1:  # 异常
                metric_id, execution_id = metric_ids[idx]
                metric = metrics[idx]
                
                # 计算异常程度
                anomaly_score = abs(score)
                confidence = min(anomaly_score / 2.0, 1.0)  # 转换为0-1范围
                
                # 确定严重程度
                if anomaly_score > 0.5:
                    severity = "critical"
                elif anomaly_score > 0.3:
                    severity = "high"
                else:
                    severity = "medium"
                
                # 分析哪些指标异常
                feature_names = [
                    "duration", "cpu", "memory",
                    "io_read", "io_write", "success_rate", "error_rate"
                ]
                
                # 计算每个特征的Z分数
                feature_scores = []
                for i, (value, name) in enumerate(zip(features[idx], feature_names)):
                    col_mean = np.mean(X[:, i])
                    col_std = np.std(X[:, i])
                    
                    if col_std > 0:
                        z_score = abs((value - col_mean) / col_std)
                        if z_score > 2:  # 显著偏离
                            feature_scores.append({
                                "feature": name,
                                "value": value,
                                "z_score": z_score
                            })
                
                # 排序找出最异常的特征
                feature_scores.sort(key=lambda x: x["z_score"], reverse=True)
                primary_feature = feature_scores[0] if feature_scores else None
                
                # 确定瓶颈类型
                if primary_feature:
                    if primary_feature["feature"] in ["cpu", "memory"]:
                        bottleneck_type = primary_feature["feature"]
                    elif primary_feature["feature"] in ["io_read", "io_write"]:
                        bottleneck_type = "io"
                    elif primary_feature["feature"] == "duration":
                        bottleneck_type = "performance"
                    else:
                        bottleneck_type = "quality"
                else:
                    bottleneck_type = "unknown"
                
                bottleneck = PerformanceBottleneck(
                    workflow_id=workflow_id,
                    execution_id=execution_id,
                    bottleneck_type=bottleneck_type,
                    severity=severity,
                    detection_method="anomaly",
                    confidence_score=confidence,
                    description=f"Anomalous execution detected",
                    impact_analysis={
                        "anomaly_score": anomaly_score,
                        "abnormal_features": feature_scores[:3],  # Top 3
                        "primary_feature": primary_feature["feature"] if primary_feature else None
                    },
                    current_metrics={
                        "duration": metric.duration_seconds,
                        "cpu": metric.avg_cpu_percent,
                        "memory": metric.avg_memory_mb
                    }
                )
                
                bottlenecks.append(bottleneck)
        
        # 保存到数据库
        for b in bottlenecks:
            self.db.add(b)
        
        self.db.commit()
        
        logger.info(f"Detected {len(bottlenecks)} anomalies for workflow {workflow_id}")
        
        return bottlenecks


# 使用示例
if __name__ == "__main__":
    from database import SessionLocal
    
    db = SessionLocal()
    detector = AnomalyBottleneckDetector(db)
    
    bottlenecks = detector.detect_anomalies(workflow_id=1, lookback_days=7)
    
    print(f"Found {len(bottlenecks)} anomalies:")
    for b in bottlenecks:
        print(f"\n{b.bottleneck_type} ({b.severity}):")
        print(f"  Confidence: {b.confidence_score:.2f}")
        print(f"  Description: {b.description}")
        
        if b.impact_analysis and "abnormal_features" in b.impact_analysis:
            print("  Abnormal features:")
            for feature in b.impact_analysis["abnormal_features"]:
                print(f"    - {feature['feature']}: {feature['value']:.2f} "
                      f"(Z-score: {feature['z_score']:.2f})")
    
    db.close()

5. 优化建议生成引擎

5.1 建议生成器

复制代码
# services/optimization/recommendation_engine.py
from typing import List, Dict, Any, Optional
from sqlalchemy.orm import Session
from models.ai_optimizer_models import (
    PerformanceBottleneck,
    OptimizationRecommendation,
    WorkflowTask
)
from datetime import datetime
import logging

logger = logging.getLogger(__name__)


class RecommendationEngine:
    """
    优化建议生成引擎
    
    根据检测到的瓶颈生成具体的优化建议
    """
    
    def __init__(self, db: Session):
        self.db = db
        self.recommendation_rules = self._load_recommendation_rules()
    
    def generate_recommendations(
        self,
        bottleneck: PerformanceBottleneck
    ) -> List[OptimizationRecommendation]:
        """
        为瓶颈生成优化建议
        """
        recommendations = []
        
        # 根据瓶颈类型选择生成策略
        if bottleneck.bottleneck_type == "cpu":
            recommendations.extend(self._generate_cpu_recommendations(bottleneck))
        
        elif bottleneck.bottleneck_type == "memory":
            recommendations.extend(self._generate_memory_recommendations(bottleneck))
        
        elif bottleneck.bottleneck_type == "io":
            recommendations.extend(self._generate_io_recommendations(bottleneck))
        
        elif bottleneck.bottleneck_type == "slow_task":
            recommendations.extend(self._generate_task_optimization_recommendations(bottleneck))
        
        elif bottleneck.bottleneck_type == "performance":
            recommendations.extend(self._generate_performance_recommendations(bottleneck))
        
        elif bottleneck.bottleneck_type == "degradation":
            recommendations.extend(self._generate_degradation_recommendations(bottleneck))
        
        # 保存建议到数据库
        for rec in recommendations:
            self.db.add(rec)
        
        self.db.commit()
        
        logger.info(f"Generated {len(recommendations)} recommendations for bottleneck {bottleneck.id}")
        
        return recommendations
    
    def _generate_cpu_recommendations(
        self,
        bottleneck: PerformanceBottleneck
    ) -> List[OptimizationRecommendation]:
        """生成CPU优化建议"""
        recommendations = []
        
        current_cpu = bottleneck.current_metrics.get("cpu", 0)
        
        # 建议1: 增加并行度
        if bottleneck.task_id:
            task = self.db.query(WorkflowTask).get(bottleneck.task_id)
            
            if task and task.config:
                current_workers = task.config.get("max_workers", 1)
                
                if current_workers < 4:
                    recommendations.append(OptimizationRecommendation(
                        bottleneck_id=bottleneck.id,
                        workflow_id=bottleneck.workflow_id,
                        task_id=bottleneck.task_id,
                        recommendation_type="parameter",
                        priority="high",
                        title="增加并行Worker数量",
                        description=f"当前CPU使用率{current_cpu:.1f}%,增加并行度可以更好地利用CPU资源",
                        rationale="通过增加Worker数量,可以并行处理更多任务,提高CPU利用率",
                        current_config={"max_workers": current_workers},
                        recommended_config={"max_workers": min(current_workers * 2, 8)},
                        expected_improvement={
                            "cpu_utilization": "+20-40%",
                            "throughput": "+30-50%",
                            "duration_reduction": "20-30%"
                        },
                        implementation_difficulty="easy",
                        estimated_effort_hours=0.5,
                        implementation_steps=[
                            {"step": 1, "action": "修改任务配置max_workers参数"},
                            {"step": 2, "action": "运行A/B测试验证效果"},
                            {"step": 3, "action": "监控CPU和内存使用情况"}
                        ],
                        risk_level="low",
                        potential_issues=[
                            "可能增加内存消耗",
                            "需要确保足够的CPU核心数"
                        ],
                        rollback_plan={
                            "action": "恢复原始max_workers配置",
                            "estimated_time": "1分钟"
                        }
                    ))
        
        # 建议2: 优化算法复杂度
        if current_cpu > 80:
            recommendations.append(OptimizationRecommendation(
                bottleneck_id=bottleneck.id,
                workflow_id=bottleneck.workflow_id,
                task_id=bottleneck.task_id,
                recommendation_type="logic",
                priority="high",
                title="优化计算密集型代码",
                description=f"CPU使用率达到{current_cpu:.1f}%,建议优化算法或使用更高效的实现",
                rationale="高CPU使用率表明存在计算密集型操作,优化算法可以显著减少CPU时间",
                current_config={},
                recommended_config={},
                expected_improvement={
                    "cpu_time_reduction": "30-50%",
                    "duration_reduction": "30-50%"
                },
                implementation_difficulty="hard",
                estimated_effort_hours=8.0,
                implementation_steps=[
                    {"step": 1, "action": "使用profiler分析CPU热点"},
                    {"step": 2, "action": "识别可优化的算法和数据结构"},
                    {"step": 3, "action": "实现优化版本"},
                    {"step": 4, "action": "编写单元测试确保正确性"},
                    {"step": 5, "action": "进行性能测试对比"}
                ],
                risk_level="medium",
                potential_issues=[
                    "可能引入新的bug",
                    "需要充分测试"
                ]
            ))
        
        return recommendations
    
    def _generate_memory_recommendations(
        self,
        bottleneck: PerformanceBottleneck
    ) -> List[OptimizationRecommendation]:
        """生成内存优化建议"""
        recommendations = []
        
        current_memory = bottleneck.current_metrics.get("memory", 0)
        
        # 建议1: 批处理优化
        recommendations.append(OptimizationRecommendation(
            bottleneck_id=bottleneck.id,
            workflow_id=bottleneck.workflow_id,
            task_id=bottleneck.task_id,
            recommendation_type="parameter",
            priority="high",
            title="减小批处理大小",
            description=f"当前内存使用{current_memory:.1f}MB,建议减小批处理大小以降低内存峰值",
            rationale="较小的批处理大小可以减少内存占用,避免内存溢出",
            current_config={"batch_size": "未知"},
            recommended_config={"batch_size": 1000},
            expected_improvement={
                "memory_reduction": "30-50%",
                "stability": "improved"
            },
            implementation_difficulty="easy",
            estimated_effort_hours=1.0,
            implementation_steps=[
                {"step": 1, "action": "调整batch_size配置参数"},
                {"step": 2, "action": "监控内存使用情况"},
                {"step": 3, "action": "评估对性能的影响"}
            ],
            risk_level="low",
            potential_issues=[
                "可能略微增加总执行时间",
                "需要平衡内存和性能"
            ]
        ))
        
        # 建议2: 启用流式处理
        if current_memory > 1000:  # 超过1GB
            recommendations.append(OptimizationRecommendation(
                bottleneck_id=bottleneck.id,
                workflow_id=bottleneck.workflow_id,
                task_id=bottleneck.task_id,
                recommendation_type="architecture",
                priority="high",
                title="启用流式处理",
                description="内存占用较高,建议改用流式处理避免一次性加载所有数据",
                rationale="流式处理可以逐块处理数据,显著降低内存占用",
                current_config={"processing_mode": "batch"},
                recommended_config={"processing_mode": "streaming"},
                expected_improvement={
                    "memory_reduction": "60-80%",
                    "scalability": "improved"
                },
                implementation_difficulty="medium",
                estimated_effort_hours=4.0,
                implementation_steps=[
                    {"step": 1, "action": "重构代码支持流式处理"},
                    {"step": 2, "action": "使用生成器或迭代器"},
                    {"step": 3, "action": "测试不同数据量下的表现"}
                ],
                risk_level="medium",
                potential_issues=[
                    "需要重构现有代码",
                    "可能影响某些需要全局视图的操作"
                ]
            ))
        
        return recommendations
    
    def _generate_io_recommendations(
        self,
        bottleneck: PerformanceBottleneck
    ) -> List[OptimizationRecommendation]:
        """生成IO优化建议"""
        recommendations = []
        
        # 建议1: 启用缓存
        recommendations.append(OptimizationRecommendation(
            bottleneck_id=bottleneck.id,
            workflow_id=bottleneck.workflow_id,
            task_id=bottleneck.task_id,
            recommendation_type="architecture",
            priority="high",
            title="启用数据缓存",
            description="IO操作频繁,建议启用缓存减少磁盘/网络访问",
            rationale="缓存可以避免重复的IO操作,显著提高性能",
            current_config={"cache_enabled": False},
            recommended_config={
                "cache_enabled": True,
                "cache_ttl": 3600,
                "cache_size_mb": 100
            },
            expected_improvement={
                "io_reduction": "50-70%",
                "duration_reduction": "30-50%"
            },
            implementation_difficulty="medium",
            estimated_effort_hours=2.0,
            implementation_steps=[
                {"step": 1, "action": "选择合适的缓存策略(LRU/LFU)"},
                {"step": 2, "action": "集成缓存中间件(Redis/Memcached)"},
                {"step": 3, "action": "设置合理的过期时间"},
                {"step": 4, "action": "监控缓存命中率"}
            ],
            risk_level="low",
            potential_issues=[
                "需要额外的缓存服务器资源",
                "可能出现缓存一致性问题"
            ],
            rollback_plan={
                "action": "禁用缓存配置",
                "estimated_time": "2分钟"
            }
        ))
        
        # 建议2: 批量IO操作
        recommendations.append(OptimizationRecommendation(
            bottleneck_id=bottleneck.id,
            workflow_id=bottleneck.workflow_id,
            task_id=bottleneck.task_id,
            recommendation_type="logic",
            priority="medium",
            title="批量化IO操作",
            description="将多个小的IO操作合并为少量大的批量操作",
            rationale="批量操作可以减少IO次数,提高吞吐量",
            expected_improvement={
                "io_operations_reduction": "70-90%",
                "duration_reduction": "20-40%"
            },
            implementation_difficulty="medium",
            estimated_effort_hours=3.0,
            risk_level="low"
        ))
        
        return recommendations
    
    def _generate_task_optimization_recommendations(
        self,
        bottleneck: PerformanceBottleneck
    ) -> List[OptimizationRecommendation]:
        """生成慢任务优化建议"""
        recommendations = []
        
        impact = bottleneck.impact_analysis or {}
        relative_slowness = impact.get("relative_slowness_percent", 0)
        
        # 建议1: 拆分大任务
        if relative_slowness > 200:  # 比其他任务慢3倍以上
            recommendations.append(OptimizationRecommendation(
                bottleneck_id=bottleneck.id,
                workflow_id=bottleneck.workflow_id,
                task_id=bottleneck.task_id,
                recommendation_type="architecture",
                priority="high",
                title="拆分大任务为多个小任务",
                description=f"该任务比其他任务慢{relative_slowness:.0f}%,建议拆分为多个并行任务",
                rationale="大任务拆分后可以并行执行,提高整体效率",
                expected_improvement={
                    "duration_reduction": "40-60%",
                    "parallelism": "improved"
                },
                implementation_difficulty="hard",
                estimated_effort_hours=8.0,
                implementation_steps=[
                    {"step": 1, "action": "分析任务逻辑,识别可拆分点"},
                    {"step": 2, "action": "设计拆分方案"},
                    {"step": 3, "action": "实现子任务"},
                    {"step": 4, "action": "配置任务依赖关系"},
                    {"step": 5, "action": "测试并验证结果一致性"}
                ],
                risk_level="medium",
                potential_issues=[
                    "需要重新设计工作流DAG",
                    "可能增加协调开销"
                ]
            ))
        
        # 建议2: 增加任务超时和重试
        recommendations.append(OptimizationRecommendation(
            bottleneck_id=bottleneck.id,
            workflow_id=bottleneck.workflow_id,
            task_id=bottleneck.task_id,
            recommendation_type="parameter",
            priority="medium",
            title="优化超时和重试策略",
            description="设置合理的超时时间,避免长时间等待失败任务",
            rationale="合理的超时可以快速失败并重试,避免资源浪费",
            current_config={"timeout": "未设置"},
            recommended_config={
                "timeout": 300,  # 5分钟
                "retry_count": 3,
                "retry_delay": 60
            },
            expected_improvement={
                "failure_handling": "improved",
                "reliability": "improved"
            },
            implementation_difficulty="easy",
            estimated_effort_hours=0.5,
            risk_level="low"
        ))
        
        return recommendations
    
    def _generate_performance_recommendations(
        self,
        bottleneck: PerformanceBottleneck
    ) -> List[OptimizationRecommendation]:
        """生成通用性能优化建议"""
        recommendations = []
        
        # 分析瓶颈的影响分析
        impact = bottleneck.impact_analysis or {}
        abnormal_features = impact.get("abnormal_features", [])
        
        # 根据异常特征生成建议
        for feature in abnormal_features[:2]:  # 处理前2个最异常的特征
            feature_name = feature.get("feature")
            
            if feature_name == "duration":
                recommendations.append(OptimizationRecommendation(
                    bottleneck_id=bottleneck.id,
                    workflow_id=bottleneck.workflow_id,
                    recommendation_type="performance",
                    priority="high",
                    title="优化执行时间",
                    description="执行时间异常,建议进行性能分析和优化",
                    implementation_steps=[
                        {"step": 1, "action": "使用性能分析工具定位热点"},
                        {"step": 2, "action": "优化关键路径"},
                        {"step": 3, "action": "考虑算法优化"}
                    ],
                    implementation_difficulty="medium",
                    estimated_effort_hours=4.0
                ))
        
        return recommendations
    
    def _generate_degradation_recommendations(
        self,
        bottleneck: PerformanceBottleneck
    ) -> List[OptimizationRecommendation]:
        """生成性能退化建议"""
        recommendations = []
        
        recommendations.append(OptimizationRecommendation(
            bottleneck_id=bottleneck.id,
            workflow_id=bottleneck.workflow_id,
            recommendation_type="investigation",
            priority="urgent",
            title="调查性能退化原因",
            description="性能出现显著退化,需要紧急调查",
            rationale="性能退化可能导致严重的业务影响,需要尽快定位原因",
            implementation_steps=[
                {"step": 1, "action": "对比近期代码变更"},
                {"step": 2, "action": "检查数据量增长情况"},
                {"step": 3, "action": "分析系统资源变化"},
                {"step": 4, "action": "查看依赖服务状态"},
                {"step": 5, "action": "制定恢复方案"}
            ],
            implementation_difficulty="medium",
            estimated_effort_hours=4.0,
            risk_level="high"
        ))
        
        return recommendations
    
    def _load_recommendation_rules(self) -> Dict[str, Any]:
        """
        加载建议规则库
        
        可以从配置文件或数据库加载
        """
        return {
            "cpu_threshold": 80,
            "memory_threshold_mb": 1000,
            "io_threshold_mb": 500,
            "duration_multiplier": 2.0
        }


# 使用示例
if __name__ == "__main__":
    from database import SessionLocal
    from services.bottleneck_detectors.statistical_detector import StatisticalBottleneckDetector
    
    db = SessionLocal()
    
    # 检测瓶颈
    detector = StatisticalBottleneckDetector(db)
    bottlenecks = detector.detect_workflow_bottlenecks(workflow_id=1, lookback_days=7)
    
    # 生成建议
    engine = RecommendationEngine(db)
    
    for bottleneck in bottlenecks:
        recommendations = engine.generate_recommendations(bottleneck)
        
        print(f"\n瓶颈: {bottleneck.description}")
        print(f"建议数: {len(recommendations)}")
        
        for rec in recommendations:
            print(f"\n  建议: {rec.title}")
            print(f"  优先级: {rec.priority}")
            print(f"  难度: {rec.implementation_difficulty}")
            print(f"  预期改进: {rec.expected_improvement}")
    
    db.close()

5.2 建议优先级排序

复制代码
# services/optimization/recommendation_prioritizer.py
from typing import List, Dict, Any
from sqlalchemy.orm import Session
from models.ai_optimizer_models import OptimizationRecommendation
import logging

logger = logging.getLogger(__name__)


class RecommendationPrioritizer:
    """
    建议优先级排序器
    
    基于多个因素对优化建议进行排序
    """
    
    def __init__(self, db: Session):
        self.db = db
        
        # 优先级权重
        self.priority_weights = {
            "urgent": 10,
            "high": 7,
            "medium": 4,
            "low": 1
        }
        
        # 难度权重(负权重,越难优先级越低)
        self.difficulty_weights = {
            "easy": 0,
            "medium": -2,
            "hard": -5
        }
        
        # 风险权重(负权重)
        self.risk_weights = {
            "low": 0,
            "medium": -1,
            "high": -3
        }
    
    def prioritize(
        self,
        recommendations: List[OptimizationRecommendation]
    ) -> List[OptimizationRecommendation]:
        """
        对建议进行优先级排序
        """
        # 计算每个建议的综合分数
        scored_recommendations = []
        
        for rec in recommendations:
            score = self._calculate_score(rec)
            scored_recommendations.append((rec, score))
        
        # 按分数排序
        scored_recommendations.sort(key=lambda x: x[1], reverse=True)
        
        # 返回排序后的建议
        return [rec for rec, score in scored_recommendations]
    
    def _calculate_score(self, rec: OptimizationRecommendation) -> float:
        """
        计算建议的综合分数
        """
        score = 0.0
        
        # 1. 优先级分数
        priority_score = self.priority_weights.get(rec.priority, 0)
        score += priority_score
        
        # 2. 难度分数
        difficulty_score = self.difficulty_weights.get(rec.implementation_difficulty, 0)
        score += difficulty_score
        
        # 3. 风险分数
        risk_score = self.risk_weights.get(rec.risk_level, 0)
        score += risk_score
        
        # 4. 预期改进分数
        expected_improvement = rec.expected_improvement or {}
        improvement_score = self._parse_improvement_score(expected_improvement)
        score += improvement_score
        
        # 5. 实施时间分数(越快越好)
        effort_score = max(0, 10 - (rec.estimated_effort_hours or 0))
        score += effort_score * 0.5
        
        return score
    
    def _parse_improvement_score(self, improvement: Dict[str, Any]) -> float:
        """
        解析预期改进,计算改进分数
        """
        score = 0.0
        
        for key, value in improvement.items():
            if isinstance(value, str):
                # 提取百分比
                if "%" in value:
                    try:
                        # 提取数字部分
                        percent_str = value.split("%")[0].split("-")[-1]
                        percent = float(percent_str.replace("+", ""))
                        score += percent / 10  # 归一化
                    except:
                        pass
        
        return min(score, 10)  # 最多10分
    
    def group_by_category(
        self,
        recommendations: List[OptimizationRecommendation]
    ) -> Dict[str, List[OptimizationRecommendation]]:
        """
        按类别分组建议
        """
        groups = {
            "quick_wins": [],  # 快速见效
            "high_impact": [],  # 高影响
            "long_term": [],    # 长期优化
            "low_priority": []  # 低优先级
        }
        
        for rec in recommendations:
            # 快速见效:容易实现且高优先级
            if rec.implementation_difficulty == "easy" and rec.priority in ["high", "urgent"]:
                groups["quick_wins"].append(rec)
            
            # 高影响:高优先级且预期改进大
            elif rec.priority in ["high", "urgent"]:
                groups["high_impact"].append(rec)
            
            # 长期优化:实施难度大但价值高
            elif rec.implementation_difficulty == "hard" and rec.priority != "low":
                groups["long_term"].append(rec)
            
            # 低优先级
            else:
                groups["low_priority"].append(rec)
        
        return groups


# 使用示例
if __name__ == "__main__":
    from database import SessionLocal
    
    db = SessionLocal()
    
    # 获取所有待处理的建议
    recommendations = db.query(OptimizationRecommendation).filter(
        OptimizationRecommendation.status == "pending"
    ).all()
    
    # 排序
    prioritizer = RecommendationPrioritizer(db)
    sorted_recs = prioritizer.prioritize(recommendations)
    
    print("排序后的建议:")
    for i, rec in enumerate(sorted_recs[:10], 1):
        print(f"{i}. {rec.title}")
        print(f"   优先级: {rec.priority}, 难度: {rec.implementation_difficulty}")
        print(f"   风险: {rec.risk_level}, 工时: {rec.estimated_effort_hours}h")
        print()
    
    # 分组
    groups = prioritizer.group_by_category(recommendations)
    
    print("\n建议分组:")
    for category, recs in groups.items():
        print(f"{category}: {len(recs)}个建议")
    
    db.close()

6. 自动优化实验(A/B测试)

6.1 实验框架

复制代码
# services/ab_testing/experiment_framework.py
from typing import Dict, Any, List, Optional, Callable
from sqlalchemy.orm import Session
from models.ai_optimizer_models import ABExperiment, OptimizationRecommendation, WorkflowExecution
from datetime import datetime, timedelta
import random
import numpy as np
from scipy import stats
import logging

logger = logging.getLogger(__name__)


class ExperimentFramework:
    """
    A/B测试实验框架
    
    用于验证优化建议的效果
    """
    
    def __init__(self, db: Session):
        self.db = db
        self.alpha = 0.05  # 显著性水平
        self.min_detectable_effect = 0.1  # 最小可检测效应(10%)
    
    def create_experiment(
        self,
        workflow_id: int,
        recommendation_id: int,
        name: str,
        hypothesis: str,
        control_config: Dict[str, Any],
        treatment_config: Dict[str, Any],
        primary_metric: str,
        traffic_split: float = 0.5,
        **kwargs
    ) -> ABExperiment:
        """
        创建A/B测试实验
        
        Args:
            workflow_id: 工作流ID
            recommendation_id: 建议ID
            name: 实验名称
            hypothesis: 假设
            control_config: 对照组配置
            treatment_config: 实验组配置
            primary_metric: 主要评估指标
            traffic_split: 流量分配比例(实验组占比)
        """
        experiment = ABExperiment(
            workflow_id=workflow_id,
            recommendation_id=recommendation_id,
            name=name,
            hypothesis=hypothesis,
            control_config=control_config,
            treatment_config=treatment_config,
            primary_metric=primary_metric,
            traffic_split=traffic_split,
            status="draft",
            **kwargs
        )
        
        self.db.add(experiment)
        self.db.commit()
        
        logger.info(f"Created experiment: {name} (ID: {experiment.id})")
        
        return experiment
    
    def start_experiment(self, experiment_id: int) -> ABExperiment:
        """
        启动实验
        """
        experiment = self.db.query(ABExperiment).get(experiment_id)
        
        if not experiment:
            raise ValueError(f"Experiment {experiment_id} not found")
        
        if experiment.status != "draft":
            raise ValueError(f"Experiment must be in draft status to start")
        
        # 验证配置
        self._validate_experiment(experiment)
        
        # 启动实验
        experiment.status = "running"
        experiment.started_at = datetime.utcnow()
        
        self.db.commit()
        
        logger.info(f"Started experiment: {experiment.name}")
        
        return experiment
    
    def assign_variant(
        self,
        experiment_id: int,
        execution_id: int
    ) -> str:
        """
        为执行分配变体(对照组或实验组)
        
        Returns:
            "control" 或 "treatment"
        """
        experiment = self.db.query(ABExperiment).get(experiment_id)
        
        if not experiment or experiment.status != "running":
            return "control"  # 默认使用对照组
        
        # 随机分配
        if random.random() < experiment.traffic_split:
            variant = "treatment"
        else:
            variant = "control"
        
        # 记录分配(可以存储在Redis或数据库中)
        # 这里简化处理,实际应该持久化
        
        logger.debug(f"Assigned variant {variant} for execution {execution_id}")
        
        return variant
    
    def record_result(
        self,
        experiment_id: int,
        variant: str,
        metrics: Dict[str, float]
    ):
        """
        记录实验结果
        """
        experiment = self.db.query(ABExperiment).get(experiment_id)
        
        if not experiment:
            return
        
        # 更新样本量
        if variant == "control":
            experiment.control_group_size = (experiment.control_group_size or 0) + 1
            
            # 更新对照组指标
            if not experiment.control_metrics:
                experiment.control_metrics = {}
            
            for metric_name, value in metrics.items():
                if metric_name not in experiment.control_metrics:
                    experiment.control_metrics[metric_name] = []
                experiment.control_metrics[metric_name].append(value)
        
        else:  # treatment
            experiment.treatment_group_size = (experiment.treatment_group_size or 0) + 1
            
            # 更新实验组指标
            if not experiment.treatment_metrics:
                experiment.treatment_metrics = {}
            
            for metric_name, value in metrics.items():
                if metric_name not in experiment.treatment_metrics:
                    experiment.treatment_metrics[metric_name] = []
                experiment.treatment_metrics[metric_name].append(value)
        
        self.db.commit()
        
        # 检查是否可以进行分析
        self._check_for_analysis(experiment)
    
    def _check_for_analysis(self, experiment: ABExperiment):
        """
        检查是否达到分析条件
        """
        control_size = experiment.control_group_size or 0
        treatment_size = experiment.treatment_group_size or 0
        
        # 检查最小样本量
        if control_size < experiment.min_sample_size or treatment_size < experiment.min_sample_size:
            return
        
        # 进行统计分析
        result = self.analyze_experiment(experiment.id)
        
        # 如果启用了早停
        if experiment.early_stopping_enabled:
            # 检查是否可以早停
            if self._should_stop_early(result):
                self.stop_experiment(experiment.id, "Early stopping triggered")
    
    def analyze_experiment(self, experiment_id: int) -> Dict[str, Any]:
        """
        分析实验结果
        """
        experiment = self.db.query(ABExperiment).get(experiment_id)
        
        if not experiment:
            raise ValueError(f"Experiment {experiment_id} not found")
        
        primary_metric = experiment.primary_metric
        
        # 获取主要指标的数据
        control_data = experiment.control_metrics.get(primary_metric, [])
        treatment_data = experiment.treatment_metrics.get(primary_metric, [])
        
        if not control_data or not treatment_data:
            return {"status": "insufficient_data"}
        
        # 计算统计量
        control_mean = np.mean(control_data)
        treatment_mean = np.mean(treatment_data)
        
        # 进行t检验
        t_stat, p_value = stats.ttest_ind(treatment_data, control_data)
        
        # 计算置信区间
        pooled_std = np.sqrt(
            ((len(control_data) - 1) * np.var(control_data) +
             (len(treatment_data) - 1) * np.var(treatment_data)) /
            (len(control_data) + len(treatment_data) - 2)
        )
        
        margin_of_error = stats.t.ppf(1 - self.alpha / 2, len(control_data) + len(treatment_data) - 2) * \
                         pooled_std * np.sqrt(1 / len(control_data) + 1 / len(treatment_data))
        
        mean_diff = treatment_mean - control_mean
        ci_lower = mean_diff - margin_of_error
        ci_upper = mean_diff + margin_of_error
        
        # 计算效应量
        effect_size = (treatment_mean - control_mean) / control_mean if control_mean != 0 else 0
        
        # 判断结果
        is_significant = p_value < self.alpha
        
        if is_significant:
            if treatment_mean > control_mean:
                decision = "winner_treatment"
            else:
                decision = "winner_control"
        else:
            decision = "no_difference"
        
        result = {
            "status": "analyzed",
            "control_mean": control_mean,
            "treatment_mean": treatment_mean,
            "mean_difference": mean_diff,
            "effect_size": effect_size,
            "p_value": p_value,
            "is_significant": is_significant,
            "confidence_interval": {
                "lower": ci_lower,
                "upper": ci_upper
            },
            "decision": decision,
            "sample_sizes": {
                "control": len(control_data),
                "treatment": len(treatment_data)
            }
        }
        
        # 更新实验记录
        experiment.statistical_significance = p_value
        experiment.confidence_interval = result["confidence_interval"]
        experiment.decision = decision
        
        self.db.commit()
        
        logger.info(f"Analyzed experiment {experiment.name}: {decision}")
        
        return result
    
    def _should_stop_early(self, analysis_result: Dict[str, Any]) -> bool:
        """
        判断是否应该早停
        """
        if analysis_result.get("status") != "analyzed":
            return False
        
        # 如果结果显著且效应量足够大
        is_significant = analysis_result.get("is_significant", False)
        effect_size = abs(analysis_result.get("effect_size", 0))
        
        return is_significant and effect_size >= self.min_detectable_effect
    
    def stop_experiment(self, experiment_id: int, reason: str = "Manual stop"):
        """
        停止实验
        """
        experiment = self.db.query(ABExperiment).get(experiment_id)
        
        if not experiment:
            return
        
        if experiment.status == "completed":
            return
        
        experiment.status = "completed"
        experiment.ended_at = datetime.utcnow()
        experiment.decision_reason = reason
        
        # 进行最终分析
        if experiment.control_group_size and experiment.treatment_group_size:
            self.analyze_experiment(experiment_id)
        
        self.db.commit()
        
        logger.info(f"Stopped experiment {experiment.name}: {reason}")
    
    def rollout_winner(self, experiment_id: int):
        """
        推广获胜配置
        """
        experiment = self.db.query(ABExperiment).get(experiment_id)
        
        if not experiment:
            raise ValueError(f"Experiment {experiment_id} not found")
        
        if experiment.decision not in ["winner_control", "winner_treatment"]:
            raise ValueError(f"No clear winner to rollout")
        
        # 获取获胜配置
        if experiment.decision == "winner_treatment":
            winning_config = experiment.treatment_config
            logger.info(f"Rolling out treatment config for experiment {experiment.name}")
        else:
            winning_config = experiment.control_config
            logger.info(f"Keeping control config for experiment {experiment.name}")
        
        # 更新工作流配置
        # 这里需要根据实际情况实现
        # workflow = self.db.query(Workflow).get(experiment.workflow_id)
        # workflow.config.update(winning_config)
        # self.db.commit()
        
        # 更新建议状态
        if experiment.recommendation_id:
            recommendation = self.db.query(OptimizationRecommendation).get(
                experiment.recommendation_id
            )
            if recommendation:
                recommendation.status = "implemented"
                recommendation.actual_improvement = self.analyze_experiment(experiment_id)
                self.db.commit()
    
    def _validate_experiment(self, experiment: ABExperiment):
        """
        验证实验配置
        """
        if not experiment.control_config or not experiment.treatment_config:
            raise ValueError("Both control and treatment configs are required")
        
        if not 0 < experiment.traffic_split < 1:
            raise ValueError("Traffic split must be between 0 and 1")
        
        if not experiment.primary_metric:
            raise ValueError("Primary metric is required")


# 使用示例
if __name__ == "__main__":
    from database import SessionLocal
    
    db = SessionLocal()
    framework = ExperimentFramework(db)
    
    # 创建实验
    experiment = framework.create_experiment(
        workflow_id=1,
        recommendation_id=1,
        name="Test parallel workers optimization",
        hypothesis="Increasing parallel workers from 2 to 4 will reduce duration by 30%",
        control_config={"max_workers": 2},
        treatment_config={"max_workers": 4},
        primary_metric="duration_seconds",
        secondary_metrics=["cpu_percent", "memory_mb"],
        traffic_split=0.5,
        min_sample_size=50,
        max_duration_days=7
    )
    
    # 启动实验
    framework.start_experiment(experiment.id)
    
    # 模拟记录结果
    for i in range(100):
        variant = framework.assign_variant(experiment.id, i)
        
        # 模拟指标数据
        if variant == "control":
            duration = np.random.normal(100, 10)
        else:
            duration = np.random.normal(70, 8)  # 实验组更快
        
        framework.record_result(
            experiment.id,
            variant,
            {"duration_seconds": duration}
        )
    
    # 分析结果
    result = framework.analyze_experiment(experiment.id)
    
    print(f"\n实验结果:")
    print(f"  对照组均值: {result['control_mean']:.2f}s")
    print(f"  实验组均值: {result['treatment_mean']:.2f}s")
    print(f"  改进: {result['effect_size'] * 100:.1f}%")
    print(f"  P值: {result['p_value']:.4f}")
    print(f"  决策: {result['decision']}")
    
    if result['is_significant']:
        # 推广获胜配置
        framework.rollout_winner(experiment.id)
    
    db.close()

### 6.2 自动化实验调度器

```python
# services/ab_testing/experiment_scheduler.py
from typing import List, Dict, Any, Optional
from sqlalchemy.orm import Session
from models.ai_optimizer_models import (
    ABExperiment, 
    OptimizationRecommendation,
    WorkflowExecution
)
from services.ab_testing.experiment_framework import ExperimentFramework
from datetime import datetime, timedelta
import logging

logger = logging.getLogger(__name__)


class ExperimentScheduler:
    """
    自动化实验调度器
    
    管理多个实验的生命周期,避免冲突
    """
    
    def __init__(self, db: Session):
        self.db = db
        self.framework = ExperimentFramework(db)
        self.max_concurrent_experiments = 3  # 最多同时运行3个实验
    
    def schedule_experiments_for_recommendations(
        self,
        recommendation_ids: List[int]
    ) -> List[ABExperiment]:
        """
        为建议自动调度实验
        """
        experiments = []
        
        for rec_id in recommendation_ids:
            recommendation = self.db.query(OptimizationRecommendation).get(rec_id)
            
            if not recommendation:
                continue
            
            # 检查是否已有实验
            existing = self.db.query(ABExperiment).filter(
                ABExperiment.recommendation_id == rec_id,
                ABExperiment.status.in_(["draft", "running"])
            ).first()
            
            if existing:
                logger.info(f"Experiment already exists for recommendation {rec_id}")
                continue
            
            # 创建实验
            experiment = self._create_experiment_from_recommendation(recommendation)
            
            if experiment:
                experiments.append(experiment)
        
        # 按优先级调度
        self._schedule_by_priority(experiments)
        
        return experiments
    
    def _create_experiment_from_recommendation(
        self,
        recommendation: OptimizationRecommendation
    ) -> Optional[ABExperiment]:
        """
        从建议创建实验
        """
        # 构建实验名称
        name = f"AB_{recommendation.workflow_id}_{recommendation.title[:50]}"
        
        # 构建假设
        hypothesis = f"Implementing '{recommendation.title}' will improve performance"
        
        # 确定主要指标
        primary_metric = self._determine_primary_metric(recommendation)
        
        # 获取当前配置和推荐配置
        control_config = recommendation.current_config or {}
        treatment_config = recommendation.recommended_config or {}
        
        # 计算最小样本量
        min_sample_size = self._calculate_min_sample_size(
            recommendation.expected_improvement
        )
        
        # 创建实验
        try:
            experiment = self.framework.create_experiment(
                workflow_id=recommendation.workflow_id,
                recommendation_id=recommendation.id,
                name=name,
                hypothesis=hypothesis,
                control_config=control_config,
                treatment_config=treatment_config,
                primary_metric=primary_metric,
                secondary_metrics=self._get_secondary_metrics(recommendation),
                traffic_split=0.5,
                min_sample_size=min_sample_size,
                max_duration_days=7,
                early_stopping_enabled=True
            )
            
            logger.info(f"Created experiment for recommendation {recommendation.id}")
            return experiment
            
        except Exception as e:
            logger.error(f"Failed to create experiment: {e}")
            return None
    
    def _determine_primary_metric(
        self,
        recommendation: OptimizationRecommendation
    ) -> str:
        """
        根据建议类型确定主要评估指标
        """
        rec_type = recommendation.recommendation_type
        
        metric_mapping = {
            "parameter": "duration_seconds",
            "logic": "duration_seconds",
            "architecture": "throughput",
            "performance": "duration_seconds",
            "investigation": "error_rate"
        }
        
        return metric_mapping.get(rec_type, "duration_seconds")
    
    def _get_secondary_metrics(
        self,
        recommendation: OptimizationRecommendation
    ) -> List[str]:
        """
        获取次要评估指标
        """
        # 基础指标
        secondary = ["cpu_percent", "memory_mb", "error_rate"]
        
        # 根据建议类型添加特定指标
        if recommendation.recommendation_type == "architecture":
            secondary.extend(["io_operations", "cache_hit_rate"])
        
        return secondary
    
    def _calculate_min_sample_size(
        self,
        expected_improvement: Dict[str, Any]
    ) -> int:
        """
        基于预期改进计算最小样本量
        
        使用功效分析 (Power Analysis)
        """
        # 简化计算,实际应使用统计功效分析
        # 预期改进越小,需要的样本量越大
        
        # 提取预期改进百分比
        improvement_pct = 0.0
        
        for key, value in (expected_improvement or {}).items():
            if isinstance(value, str) and "%" in value:
                try:
                    pct_str = value.split("%")[0].split("-")[-1]
                    improvement_pct = max(improvement_pct, float(pct_str.replace("+", "")))
                except:
                    pass
        
        # 根据改进幅度确定样本量
        if improvement_pct >= 50:
            return 30  # 大改进,少量样本即可
        elif improvement_pct >= 30:
            return 50
        elif improvement_pct >= 10:
            return 100
        else:
            return 200  # 小改进,需要更多样本
    
    def _schedule_by_priority(self, experiments: List[ABExperiment]):
        """
        按优先级调度实验
        """
        # 检查当前运行的实验数
        running_count = self.db.query(ABExperiment).filter(
            ABExperiment.status == "running"
        ).count()
        
        # 可启动的数量
        can_start = max(0, self.max_concurrent_experiments - running_count)
        
        if can_start == 0:
            logger.info("Maximum concurrent experiments reached, queuing new experiments")
            return
        
        # 按建议优先级排序
        experiments_with_priority = []
        for exp in experiments:
            if exp.recommendation_id:
                rec = self.db.query(OptimizationRecommendation).get(exp.recommendation_id)
                priority_score = {
                    "urgent": 4,
                    "high": 3,
                    "medium": 2,
                    "low": 1
                }.get(rec.priority, 0)
                
                experiments_with_priority.append((exp, priority_score))
        
        experiments_with_priority.sort(key=lambda x: x[1], reverse=True)
        
        # 启动优先级最高的实验
        for exp, _ in experiments_with_priority[:can_start]:
            try:
                self.framework.start_experiment(exp.id)
                logger.info(f"Started experiment: {exp.name}")
            except Exception as e:
                logger.error(f"Failed to start experiment {exp.id}: {e}")
    
    def check_and_rotate_experiments(self):
        """
        检查运行中的实验,必要时轮换
        """
        # 检查超时的实验
        running_experiments = self.db.query(ABExperiment).filter(
            ABExperiment.status == "running"
        ).all()
        
        for exp in running_experiments:
            # 检查是否超过最大运行时间
            if exp.started_at:
                running_days = (datetime.utcnow() - exp.started_at).days
                
                if running_days >= (exp.max_duration_days or 7):
                    logger.info(f"Experiment {exp.name} exceeded max duration, stopping")
                    self.framework.stop_experiment(exp.id, "Exceeded max duration")
                    
                    # 如果有足够数据,分析并决策
                    if exp.control_group_size >= exp.min_sample_size:
                        result = self.framework.analyze_experiment(exp.id)
                        
                        if result.get("is_significant"):
                            self.framework.rollout_winner(exp.id)
        
        # 启动队列中的实验
        draft_experiments = self.db.query(ABExperiment).filter(
            ABExperiment.status == "draft"
        ).all()
        
        if draft_experiments:
            self._schedule_by_priority(draft_experiments)
    
    def generate_experiment_report(
        self,
        experiment_id: int
    ) -> Dict[str, Any]:
        """
        生成实验报告
        """
        experiment = self.db.query(ABExperiment).get(experiment_id)
        
        if not experiment:
            raise ValueError(f"Experiment {experiment_id} not found")
        
        # 分析结果
        analysis = self.framework.analyze_experiment(experiment_id)
        
        # 构建报告
        report = {
            "experiment_id": experiment.id,
            "name": experiment.name,
            "status": experiment.status,
            "hypothesis": experiment.hypothesis,
            "duration_days": (
                (experiment.ended_at or datetime.utcnow()) - experiment.started_at
            ).days if experiment.started_at else 0,
            "sample_sizes": {
                "control": experiment.control_group_size or 0,
                "treatment": experiment.treatment_group_size or 0
            },
            "analysis": analysis,
            "recommendation": None,
            "timeline": []
        }
        
        # 添加建议信息
        if experiment.recommendation_id:
            rec = self.db.query(OptimizationRecommendation).get(experiment.recommendation_id)
            if rec:
                report["recommendation"] = {
                    "title": rec.title,
                    "description": rec.description,
                    "expected_improvement": rec.expected_improvement,
                    "implementation_difficulty": rec.implementation_difficulty
                }
        
        # 添加时间线
        if experiment.started_at:
            report["timeline"].append({
                "timestamp": experiment.started_at,
                "event": "Experiment started"
            })
        
        if experiment.ended_at:
            report["timeline"].append({
                "timestamp": experiment.ended_at,
                "event": f"Experiment ended: {experiment.decision_reason or 'Completed'}"
            })
        
        return report


# 使用示例
if __name__ == "__main__":
    from database import SessionLocal
    
    db = SessionLocal()
    scheduler = ExperimentScheduler(db)
    
    # 为待处理的建议调度实验
    pending_recommendations = db.query(OptimizationRecommendation).filter(
        OptimizationRecommendation.status == "pending"
    ).limit(5).all()
    
    recommendation_ids = [rec.id for rec in pending_recommendations]
    
    experiments = scheduler.schedule_experiments_for_recommendations(recommendation_ids)
    
    print(f"\n已调度 {len(experiments)} 个实验")
    
    # 检查并轮换实验
    scheduler.check_and_rotate_experiments()
    
    # 生成报告
    for exp in experiments[:1]:
        report = scheduler.generate_experiment_report(exp.id)
        print(f"\n实验报告: {report['name']}")
        print(f"  状态: {report['status']}")
        print(f"  样本量: {report['sample_sizes']}")
    
    db.close()

7. API接口层

7.1 瓶颈检测API

复制代码
# api/endpoints/bottleneck_detection.py
from fastapi import APIRouter, Depends, HTTPException, Query
from sqlalchemy.orm import Session
from typing import List, Optional
from datetime import datetime, timedelta
from pydantic import BaseModel, Field

from database import get_db
from services.bottleneck_detectors.statistical_detector import StatisticalBottleneckDetector
from services.bottleneck_detectors.ml_detector import MLBottleneckDetector
from models.ai_optimizer_models import PerformanceBottleneck

router = APIRouter(prefix="/api/v1/bottlenecks", tags=["bottleneck-detection"])


# Request/Response模型
class BottleneckDetectionRequest(BaseModel):
    workflow_id: int = Field(..., description="工作流ID")
    lookback_days: int = Field(7, ge=1, le=90, description="回溯天数")
    detector_type: str = Field("statistical", description="检测器类型: statistical, ml, hybrid")
    threshold: Optional[float] = Field(None, ge=0, le=1, description="检测阈值")


class BottleneckResponse(BaseModel):
    id: int
    workflow_id: int
    task_id: Optional[int]
    bottleneck_type: str
    severity: str
    confidence_score: float
    description: str
    detected_at: datetime
    current_metrics: dict
    baseline_metrics: Optional[dict]
    impact_analysis: Optional[dict]
    
    class Config:
        orm_mode = True


class BottleneckListResponse(BaseModel):
    total: int
    bottlenecks: List[BottleneckResponse]
    detection_metadata: dict


@router.post("/detect", response_model=BottleneckListResponse)
async def detect_bottlenecks(
    request: BottleneckDetectionRequest,
    db: Session = Depends(get_db)
):
    """
    检测工作流性能瓶颈
    """
    try:
        # 选择检测器
        if request.detector_type == "statistical":
            detector = StatisticalBottleneckDetector(db)
        elif request.detector_type == "ml":
            detector = MLBottleneckDetector(db)
        else:
            raise HTTPException(status_code=400, detail="Invalid detector type")
        
        # 设置阈值
        if request.threshold is not None:
            detector.threshold = request.threshold
        
        # 执行检测
        bottlenecks = detector.detect_workflow_bottlenecks(
            workflow_id=request.workflow_id,
            lookback_days=request.lookback_days
        )
        
        # 构建响应
        return BottleneckListResponse(
            total=len(bottlenecks),
            bottlenecks=[BottleneckResponse.from_orm(b) for b in bottlenecks],
            detection_metadata={
                "detector_type": request.detector_type,
                "lookback_days": request.lookback_days,
                "threshold": request.threshold or detector.threshold,
                "detection_time": datetime.utcnow()
            }
        )
        
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))


@router.get("/{bottleneck_id}", response_model=BottleneckResponse)
async def get_bottleneck(
    bottleneck_id: int,
    db: Session = Depends(get_db)
):
    """
    获取单个瓶颈详情
    """
    bottleneck = db.query(PerformanceBottleneck).get(bottleneck_id)
    
    if not bottleneck:
        raise HTTPException(status_code=404, detail="Bottleneck not found")
    
    return BottleneckResponse.from_orm(bottleneck)


@router.get("/workflow/{workflow_id}", response_model=BottleneckListResponse)
async def get_workflow_bottlenecks(
    workflow_id: int,
    severity: Optional[str] = Query(None, description="过滤严重程度"),
    status: Optional[str] = Query(None, description="过滤状态"),
    limit: int = Query(50, ge=1, le=500),
    offset: int = Query(0, ge=0),
    db: Session = Depends(get_db)
):
    """
    获取工作流的所有瓶颈
    """
    query = db.query(PerformanceBottleneck).filter(
        PerformanceBottleneck.workflow_id == workflow_id
    )
    
    # 过滤
    if severity:
        query = query.filter(PerformanceBottleneck.severity == severity)
    
    if status:
        query = query.filter(PerformanceBottleneck.status == status)
    
    # 排序
    query = query.order_by(PerformanceBottleneck.detected_at.desc())
    
    # 分页
    total = query.count()
    bottlenecks = query.offset(offset).limit(limit).all()
    
    return BottleneckListResponse(
        total=total,
        bottlenecks=[BottleneckResponse.from_orm(b) for b in bottlenecks],
        detection_metadata={
            "filters": {
                "severity": severity,
                "status": status
            },
            "pagination": {
                "limit": limit,
                "offset": offset,
                "total": total
            }
        }
    )


@router.post("/{bottleneck_id}/resolve")
async def resolve_bottleneck(
    bottleneck_id: int,
    resolution_notes: Optional[str] = None,
    db: Session = Depends(get_db)
):
    """
    标记瓶颈为已解决
    """
    bottleneck = db.query(PerformanceBottleneck).get(bottleneck_id)
    
    if not bottleneck:
        raise HTTPException(status_code=404, detail="Bottleneck not found")
    
    bottleneck.status = "resolved"
    bottleneck.resolved_at = datetime.utcnow()
    
    if resolution_notes:
        if not bottleneck.resolution_info:
            bottleneck.resolution_info = {}
        bottleneck.resolution_info["notes"] = resolution_notes
    
    db.commit()
    
    return {"message": "Bottleneck resolved", "bottleneck_id": bottleneck_id}


@router.delete("/{bottleneck_id}")
async def delete_bottleneck(
    bottleneck_id: int,
    db: Session = Depends(get_db)
):
    """
    删除瓶颈记录
    """
    bottleneck = db.query(PerformanceBottleneck).get(bottleneck_id)
    
    if not bottleneck:
        raise HTTPException(status_code=404, detail="Bottleneck not found")
    
    db.delete(bottleneck)
    db.commit()
    
    return {"message": "Bottleneck deleted", "bottleneck_id": bottleneck_id}

7.2 优化建议API

复制代码
# api/endpoints/optimization_recommendations.py
from fastapi import APIRouter, Depends, HTTPException, Query
from sqlalchemy.orm import Session
from typing import List, Optional
from pydantic import BaseModel, Field

from database import get_db
from services.optimization.recommendation_engine import RecommendationEngine
from services.optimization.recommendation_prioritizer import RecommendationPrioritizer
from models.ai_optimizer_models import (
    OptimizationRecommendation,
    PerformanceBottleneck
)

router = APIRouter(prefix="/api/v1/recommendations", tags=["optimization-recommendations"])


# Request/Response模型
class RecommendationResponse(BaseModel):
    id: int
    bottleneck_id: Optional[int]
    workflow_id: int
    task_id: Optional[int]
    recommendation_type: str
    priority: str
    title: str
    description: str
    rationale: Optional[str]
    current_config: Optional[dict]
    recommended_config: Optional[dict]
    expected_improvement: Optional[dict]
    implementation_difficulty: str
    estimated_effort_hours: Optional[float]
    implementation_steps: Optional[List[dict]]
    risk_level: str
    potential_issues: Optional[List[str]]
    status: str
    created_at: datetime
    
    class Config:
        orm_mode = True


class RecommendationListResponse(BaseModel):
    total: int
    recommendations: List[RecommendationResponse]
    metadata: dict


class GenerateRecommendationsRequest(BaseModel):
    bottleneck_id: int


@router.post("/generate", response_model=RecommendationListResponse)
async def generate_recommendations(
    request: GenerateRecommendationsRequest,
    db: Session = Depends(get_db)
):
    """
    为瓶颈生成优化建议
    """
    bottleneck = db.query(PerformanceBottleneck).get(request.bottleneck_id)
    
    if not bottleneck:
        raise HTTPException(status_code=404, detail="Bottleneck not found")
    
    # 生成建议
    engine = RecommendationEngine(db)
    recommendations = engine.generate_recommendations(bottleneck)
    
    # 排序
    prioritizer = RecommendationPrioritizer(db)
    sorted_recs = prioritizer.prioritize(recommendations)
    
    return RecommendationListResponse(
        total=len(sorted_recs),
        recommendations=[RecommendationResponse.from_orm(r) for r in sorted_recs],
        metadata={
            "bottleneck_id": request.bottleneck_id,
            "generated_at": datetime.utcnow()
        }
    )


@router.get("/{recommendation_id}", response_model=RecommendationResponse)
async def get_recommendation(
    recommendation_id: int,
    db: Session = Depends(get_db)
):
    """
    获取单个建议详情
    """
    recommendation = db.query(OptimizationRecommendation).get(recommendation_id)
    
    if not recommendation:
        raise HTTPException(status_code=404, detail="Recommendation not found")
    
    return RecommendationResponse.from_orm(recommendation)


@router.get("/workflow/{workflow_id}", response_model=RecommendationListResponse)
async def get_workflow_recommendations(
    workflow_id: int,
    status: Optional[str] = Query(None),
    priority: Optional[str] = Query(None),
    limit: int = Query(50, ge=1, le=500),
    offset: int = Query(0, ge=0),
    db: Session = Depends(get_db)
):
    """
    获取工作流的所有建议
    """
    query = db.query(OptimizationRecommendation).filter(
        OptimizationRecommendation.workflow_id == workflow_id
    )
    
    if status:
        query = query.filter(OptimizationRecommendation.status == status)
    
    if priority:
        query = query.filter(OptimizationRecommendation.priority == priority)
    
    query = query.order_by(OptimizationRecommendation.created_at.desc())
    
    total = query.count()
    recommendations = query.offset(offset).limit(limit).all()
    
    # 排序
    prioritizer = RecommendationPrioritizer(db)
    sorted_recs = prioritizer.prioritize(recommendations)
    
    return RecommendationListResponse(
        total=total,
        recommendations=[RecommendationResponse.from_orm(r) for r in sorted_recs],
        metadata={
            "filters": {"status": status, "priority": priority},
            "pagination": {"limit": limit, "offset": offset, "total": total}
        }
    )


@router.post("/{recommendation_id}/approve")
async def approve_recommendation(
    recommendation_id: int,
    db: Session = Depends(get_db)
):
    """
    批准建议并创建A/B测试
    """
    from services.ab_testing.experiment_scheduler import ExperimentScheduler
    
    recommendation = db.query(OptimizationRecommendation).get(recommendation_id)
    
    if not recommendation:
        raise HTTPException(status_code=404, detail="Recommendation not found")
    
    # 更新状态
    recommendation.status = "approved"
    db.commit()
    
    # 创建实验
    scheduler = ExperimentScheduler(db)
    experiments = scheduler.schedule_experiments_for_recommendations([recommendation_id])
    
    return {
        "message": "Recommendation approved and experiment scheduled",
        "recommendation_id": recommendation_id,
        "experiment_id": experiments[0].id if experiments else None
    }


@router.post("/{recommendation_id}/reject")
async def reject_recommendation(
    recommendation_id: int,
    reason: Optional[str] = None,
    db: Session = Depends(get_db)
):
    """
    拒绝建议
    """
    recommendation = db.query(OptimizationRecommendation).get(recommendation_id)
    
    if not recommendation:
        raise HTTPException(status_code=404, detail="Recommendation not found")
    
    recommendation.status = "rejected"
    
    if reason:
        if not recommendation.feedback:
            recommendation.feedback = {}
        recommendation.feedback["rejection_reason"] = reason
    
    db.commit()
    
    return {"message": "Recommendation rejected", "recommendation_id": recommendation_id}

### 7.3 A/B测试API

```python
# api/endpoints/ab_testing.py
from fastapi import APIRouter, Depends, HTTPException, Query, BackgroundTasks
from sqlalchemy.orm import Session
from typing import List, Optional
from pydantic import BaseModel, Field
from datetime import datetime

from database import get_db
from services.ab_testing.experiment_framework import ExperimentFramework
from services.ab_testing.experiment_scheduler import ExperimentScheduler
from models.ai_optimizer_models import ABExperiment

router = APIRouter(prefix="/api/v1/experiments", tags=["ab-testing"])


# Request/Response模型
class CreateExperimentRequest(BaseModel):
    workflow_id: int
    recommendation_id: Optional[int] = None
    name: str
    hypothesis: str
    control_config: dict
    treatment_config: dict
    primary_metric: str
    secondary_metrics: List[str] = []
    traffic_split: float = Field(0.5, ge=0.1, le=0.9)
    min_sample_size: int = Field(100, ge=10)
    max_duration_days: int = Field(7, ge=1, le=30)
    early_stopping_enabled: bool = True


class ExperimentResponse(BaseModel):
    id: int
    workflow_id: int
    recommendation_id: Optional[int]
    name: str
    hypothesis: str
    status: str
    control_config: dict
    treatment_config: dict
    primary_metric: str
    secondary_metrics: List[str]
    traffic_split: float
    started_at: Optional[datetime]
    ended_at: Optional[datetime]
    control_group_size: int
    treatment_group_size: int
    winner: Optional[str]
    decision_reason: Optional[str]
    
    class Config:
        orm_mode = True


class ExperimentAnalysisResponse(BaseModel):
    experiment_id: int
    status: str
    sample_sizes: dict
    primary_metric_analysis: dict
    secondary_metrics_analysis: dict
    is_significant: bool
    recommended_action: str
    confidence_level: float


class ExperimentListResponse(BaseModel:
    total: int
    experiments: List[ExperimentResponse]
    metadata: dict


@router.post("/", response_model=ExperimentResponse)
async def create_experiment(
    request: CreateExperimentRequest,
    db: Session = Depends(get_db)
):
    """
    创建新的A/B测试实验
    """
    framework = ExperimentFramework(db)
    
    experiment = framework.create_experiment(
        workflow_id=request.workflow_id,
        recommendation_id=request.recommendation_id,
        name=request.name,
        hypothesis=request.hypothesis,
        control_config=request.control_config,
        treatment_config=request.treatment_config,
        primary_metric=request.primary_metric,
        secondary_metrics=request.secondary_metrics,
        traffic_split=request.traffic_split,
        min_sample_size=request.min_sample_size,
        max_duration_days=request.max_duration_days,
        early_stopping_enabled=request.early_stopping_enabled
    )
    
    return ExperimentResponse.from_orm(experiment)


@router.get("/{experiment_id}", response_model=ExperimentResponse)
async def get_experiment(
    experiment_id: int,
    db: Session = Depends(get_db)
):
    """
    获取实验详情
    """
    experiment = db.query(ABExperiment).get(experiment_id)
    
    if not experiment:
        raise HTTPException(status_code=404, detail="Experiment not found")
    
    return ExperimentResponse.from_orm(experiment)


@router.get("/", response_model=ExperimentListResponse)
async def list_experiments(
    workflow_id: Optional[int] = Query(None),
    status: Optional[str] = Query(None),
    limit: int = Query(50, ge=1, le=500),
    offset: int = Query(0, ge=0),
    db: Session = Depends(get_db)
):
    """
    列出所有实验
    """
    query = db.query(ABExperiment)
    
    if workflow_id:
        query = query.filter(ABExperiment.workflow_id == workflow_id)
    
    if status:
        query = query.filter(ABExperiment.status == status)
    
    query = query.order_by(ABExperiment.created_at.desc())
    
    total = query.count()
    experiments = query.offset(offset).limit(limit).all()
    
    return ExperimentListResponse(
        total=total,
        experiments=[ExperimentResponse.from_orm(e) for e in experiments],
        metadata={
            "filters": {"workflow_id": workflow_id, "status": status},
            "pagination": {"limit": limit, "offset": offset, "total": total}
        }
    )


@router.post("/{experiment_id}/start")
async def start_experiment(
    experiment_id: int,
    db: Session = Depends(get_db)
):
    """
    启动实验
    """
    framework = ExperimentFramework(db)
    framework.start_experiment(experiment_id)
    
    return {"message": "Experiment started", "experiment_id": experiment_id}


@router.post("/{experiment_id}/stop")
async def stop_experiment(
    experiment_id: int,
    reason: Optional[str] = None,
    db: Session = Depends(get_db)
):
    """
    停止实验
    """
    framework = ExperimentFramework(db)
    framework.stop_experiment(experiment_id, reason)
    
    return {"message": "Experiment stopped", "experiment_id": experiment_id}


@router.get("/{experiment_id}/analysis", response_model=ExperimentAnalysisResponse)
async def analyze_experiment(
    experiment_id: int,
    db: Session = Depends(get_db)
):
    """
    分析实验结果
    """
    framework = ExperimentFramework(db)
    analysis = framework.analyze_experiment(experiment_id)
    
    experiment = db.query(ABExperiment).get(experiment_id)
    
    return ExperimentAnalysisResponse(
        experiment_id=experiment_id,
        status=experiment.status,
        sample_sizes={
            "control": experiment.control_group_size,
            "treatment": experiment.treatment_group_size
        },
        primary_metric_analysis=analysis.get("primary_metric", {}),
        secondary_metrics_analysis=analysis.get("secondary_metrics", {}),
        is_significant=analysis.get("is_significant", False),
        recommended_action=analysis.get("recommendation", "continue"),
        confidence_level=analysis.get("confidence_level", 0.0)
    )


@router.post("/{experiment_id}/rollout")
async def rollout_winner(
    experiment_id: int,
    background_tasks: BackgroundTasks,
    db: Session = Depends(get_db)
):
    """
    推广获胜配置
    """
    framework = ExperimentFramework(db)
    
    # 在后台执行推广
    background_tasks.add_task(framework.rollout_winner, experiment_id)
    
    return {
        "message": "Rollout initiated",
        "experiment_id": experiment_id
    }


@router.get("/{experiment_id}/report")
async def get_experiment_report(
    experiment_id: int,
    db: Session = Depends(get_db)
):
    """
    生成实验详细报告
    """
    scheduler = ExperimentScheduler(db)
    report = scheduler.generate_experiment_report(experiment_id)
    
    return report


@router.post("/schedule")
async def schedule_experiments(
    recommendation_ids: List[int],
    db: Session = Depends(get_db)
):
    """
    批量调度实验
    """
    scheduler = ExperimentScheduler(db)
    experiments = scheduler.schedule_experiments_for_recommendations(recommendation_ids)
    
    return {
        "message": f"Scheduled {len(experiments)} experiments",
        "experiment_ids": [e.id for e in experiments]
    }


@router.post("/rotate")
async def rotate_experiments(
    db: Session = Depends(get_db)
):
    """
    检查并轮换实验
    """
    scheduler = ExperimentScheduler(db)
    scheduler.check_and_rotate_experiments()
    
    return {"message": "Experiment rotation completed"}

8. 后台任务与调度

8.1 Celery任务定义

复制代码
# tasks/ai_optimizer_tasks.py
from celery import Celery
from celery.schedules import crontab
from database import SessionLocal
from services.bottleneck_detectors.statistical_detector import StatisticalBottleneckDetector
from services.bottleneck_detectors.ml_detector import MLBottleneckDetector
from services.optimization.recommendation_engine import RecommendationEngine
from services.ab_testing.experiment_scheduler import ExperimentScheduler
from models.workflow_models import Workflow
import logging

logger = logging.getLogger(__name__)

# 初始化Celery
celery_app = Celery(
    'ai_optimizer',
    broker='redis://localhost:6379/0',
    backend='redis://localhost:6379/0'
)

celery_app.conf.update(
    task_serializer='json',
    accept_content=['json'],
    result_serializer='json',
    timezone='UTC',
    enable_utc=True,
)


@celery_app.task(name='detect_bottlenecks_all_workflows')
def detect_bottlenecks_all_workflows():
    """
    定期检测所有活跃工作流的瓶颈
    """
    db = SessionLocal()
    
    try:
        # 获取活跃工作流
        active_workflows = db.query(Workflow).filter(
            Workflow.is_active == True
        ).all()
        
        logger.info(f"Detecting bottlenecks for {len(active_workflows)} workflows")
        
        # 使用统计检测器
        stat_detector = StatisticalBottleneckDetector(db)
        
        total_bottlenecks = 0
        
        for workflow in active_workflows:
            try:
                bottlenecks = stat_detector.detect_workflow_bottlenecks(
                    workflow_id=workflow.id,
                    lookback_days=7
                )
                total_bottlenecks += len(bottlenecks)
                
                logger.info(
                    f"Workflow {workflow.id}: Found {len(bottlenecks)} bottlenecks"
                )
                
            except Exception as e:
                logger.error(
                    f"Error detecting bottlenecks for workflow {workflow.id}: {e}"
                )
        
        logger.info(f"Total bottlenecks detected: {total_bottlenecks}")
        
        return {
            "workflows_processed": len(active_workflows),
            "total_bottlenecks": total_bottlenecks
        }
        
    finally:
        db.close()


@celery_app.task(name='ml_bottleneck_detection')
def ml_bottleneck_detection(workflow_id: int):
    """
    使用ML检测器进行深度分析
    """
    db = SessionLocal()
    
    try:
        ml_detector = MLBottleneckDetector(db)
        bottlenecks = ml_detector.detect_workflow_bottlenecks(
            workflow_id=workflow_id,
            lookback_days=30
        )
        
        logger.info(
            f"ML detection for workflow {workflow_id}: {len(bottlenecks)} bottlenecks"
        )
        
        return {
            "workflow_id": workflow_id,
            "bottlenecks_found": len(bottlenecks)
        }
        
    finally:
        db.close()


@celery_app.task(name='generate_recommendations')
def generate_recommendations_task():
    """
    为未处理的瓶颈生成优化建议
    """
    from models.ai_optimizer_models import PerformanceBottleneck
    
    db = SessionLocal()
    
    try:
        # 获取待处理的瓶颈
        pending_bottlenecks = db.query(PerformanceBottleneck).filter(
            PerformanceBottleneck.status == "detected"
        ).all()
        
        logger.info(f"Generating recommendations for {len(pending_bottlenecks)} bottlenecks")
        
        engine = RecommendationEngine(db)
        total_recommendations = 0
        
        for bottleneck in pending_bottlenecks:
            try:
                recommendations = engine.generate_recommendations(bottleneck)
                total_recommendations += len(recommendations)
                
                # 更新瓶颈状态
                bottleneck.status = "analyzed"
                
            except Exception as e:
                logger.error(
                    f"Error generating recommendations for bottleneck {bottleneck.id}: {e}"
                )
        
        db.commit()
        
        logger.info(f"Total recommendations generated: {total_recommendations}")
        
        return {
            "bottlenecks_processed": len(pending_bottlenecks),
            "recommendations_generated": total_recommendations
        }
        
    finally:
        db.close()


@celery_app.task(name='schedule_ab_experiments')
def schedule_ab_experiments_task():
    """
    调度A/B测试实验
    """
    from models.ai_optimizer_models import OptimizationRecommendation
    
    db = SessionLocal()
    
    try:
        # 获取已批准的建议
        approved_recommendations = db.query(OptimizationRecommendation).filter(
            OptimizationRecommendation.status == "approved"
        ).limit(10).all()
        
        if not approved_recommendations:
            logger.info("No approved recommendations to schedule")
            return {"experiments_scheduled": 0}
        
        scheduler = ExperimentScheduler(db)
        recommendation_ids = [rec.id for rec in approved_recommendations]
        
        experiments = scheduler.schedule_experiments_for_recommendations(
            recommendation_ids
        )
        
        logger.info(f"Scheduled {len(experiments)} experiments")
        
        return {"experiments_scheduled": len(experiments)}
        
    finally:
        db.close()


@celery_app.task(name='rotate_experiments')
def rotate_experiments_task():
    """
    检查并轮换实验
    """
    db = SessionLocal()
    
    try:
        scheduler = ExperimentScheduler(db)
        scheduler.check_and_rotate_experiments()
        
        logger.info("Experiment rotation completed")
        
        return {"status": "completed"}
        
    finally:
        db.close()


@celery_app.task(name='cleanup_old_data')
def cleanup_old_data():
    """
    清理旧数据
    """
    from models.ai_optimizer_models import (
        PerformanceBottleneck,
        OptimizationRecommendation,
        ABExperiment
    )
    from datetime import timedelta
    
    db = SessionLocal()
    
    try:
        cutoff_date = datetime.utcnow() - timedelta(days=90)
        
        # 删除旧的已解决瓶颈
        deleted_bottlenecks = db.query(PerformanceBottleneck).filter(
            PerformanceBottleneck.status == "resolved",
            PerformanceBottleneck.resolved_at < cutoff_date
        ).delete()
        
        # 删除旧的已拒绝建议
        deleted_recommendations = db.query(OptimizationRecommendation).filter(
            OptimizationRecommendation.status == "rejected",
            OptimizationRecommendation.created_at < cutoff_date
        ).delete()
        
        # 删除旧的完成实验
        deleted_experiments = db.query(ABExperiment).filter(
            ABExperiment.status == "completed",
            ABExperiment.ended_at < cutoff_date
        ).delete()
        
        db.commit()
        
        logger.info(
            f"Cleanup completed: {deleted_bottlenecks} bottlenecks, "
            f"{deleted_recommendations} recommendations, "
            f"{deleted_experiments} experiments deleted"
        )
        
        return {
            "bottlenecks_deleted": deleted_bottlenecks,
            "recommendations_deleted": deleted_recommendations,
            "experiments_deleted": deleted_experiments
        }
        
    finally:
        db.close()


@celery_app.task(name='train_ml_models')
def train_ml_models_task():
    """
    定期重新训练ML模型
    """
    db = SessionLocal()
    
    try:
        ml_detector = MLBottleneckDetector(db)
        
        # 训练异常检测模型
        ml_detector._train_anomaly_detector()
        
        logger.info("ML models retrained successfully")
        
        return {"status": "completed"}
        
    except Exception as e:
        logger.error(f"Error training ML models: {e}")
        return {"status": "failed", "error": str(e)}
        
    finally:
        db.close()


# 定时任务配置
celery_app.conf.beat_schedule = {
    'detect-bottlenecks-every-hour': {
        'task': 'detect_bottlenecks_all_workflows',
        'schedule': crontab(minute=0),  # 每小时
    },
    'generate-recommendations-every-6-hours': {
        'task': 'generate_recommendations',
        'schedule': crontab(minute=0, hour='*/6'),  # 每6小时
    },
    'schedule-experiments-daily': {
        'task': 'schedule_ab_experiments',
        'schedule': crontab(minute=0, hour=9),  # 每天9点
    },
    'rotate-experiments-every-2-hours': {
        'task': 'rotate_experiments',
        'schedule': crontab(minute=0, hour='*/2'),  # 每2小时
    },
    'cleanup-old-data-weekly': {
        'task': 'cleanup_old_data',
        'schedule': crontab(minute=0, hour=2, day_of_week=0),  # 每周日凌晨2点
    },
    'train-ml-models-weekly': {
        'task': 'train_ml_models',
        'schedule': crontab(minute=0, hour=3, day_of_week=0),  # 每周日凌晨3点
    },
}

8.2 任务监控

复制代码
# tasks/task_monitor.py
from celery import Celery
from celery.events import EventReceiver
from kombu import Connection
import logging
from datetime import datetime
from typing import Dict, List
from collections import defaultdict

logger = logging.getLogger(__name__)


class TaskMonitor:
    """
    Celery任务监控器
    """
    
    def __init__(self, broker_url: str):
        self.broker_url = broker_url
        self.task_stats = defaultdict(lambda: {
            "total": 0,
            "succeeded": 0,
            "failed": 0,
            "retried": 0,
            "avg_runtime": 0.0,
            "last_run": None
        })
    
    def start_monitoring(self):
        """
        启动监控
        """
        with Connection(self.broker_url) as conn:
            recv = EventReceiver(
                conn,
                handlers={
                    'task-sent': self.on_task_sent,
                    'task-received': self.on_task_received,
                    'task-started': self.on_task_started,
                    'task-succeeded': self.on_task_succeeded,
                    'task-failed': self.on_task_failed,
                    'task-retried': self.on_task_retried,
                }
            )
            
            logger.info("Task monitor started")
            recv.capture(limit=None, timeout=None, wakeup=True)
    
    def on_task_sent(self, event):
        """任务发送"""
        task_name = event['name']
        logger.debug(f"Task sent: {task_name}")
    
    def on_task_received(self, event):
        """任务接收"""
        task_name = event['name']
        self.task_stats[task_name]["total"] += 1
    
    def on_task_started(self, event):
        """任务开始"""
        task_name = event['name']
        logger.info(f"Task started: {task_name}")
    
    def on_task_succeeded(self, event):
        """任务成功"""
        task_name = event['name']
        runtime = event['runtime']
        
        stats = self.task_stats[task_name]
        stats["succeeded"] += 1
        stats["last_run"] = datetime.utcnow()
        
        # 更新平均运行时间
        n = stats["succeeded"]
        stats["avg_runtime"] = (
            (stats["avg_runtime"] * (n - 1) + runtime) / n
        )
        
        logger.info(
            f"Task succeeded: {task_name} (runtime: {runtime:.2f}s)"
        )
    
    def on_task_failed(self, event):
        """任务失败"""
        task_name = event['name']
        exception = event.get('exception')
        
        self.task_stats[task_name]["failed"] += 1
        
        logger.error(
            f"Task failed: {task_name} - {exception}"
        )
    
    def on_task_retried(self, event):
        """任务重试"""
        task_name = event['name']
        self.task_stats[task_name]["retried"] += 1
        
        logger.warning(f"Task retried: {task_name}")
    
    def get_statistics(self) -> Dict:
        """
        获取统计信息
        """
        return dict(self.task_stats)
    
    def get_health_status(self) -> Dict:
        """
        获取健康状态
        """
        total_tasks = sum(s["total"] for s in self.task_stats.values())
        total_failed = sum(s["failed"] for s in self.task_stats.values())
        
        failure_rate = total_failed / total_tasks if total_tasks > 0 else 0
        
        health = "healthy"
        if failure_rate > 0.1:
            health = "degraded"
        if failure_rate > 0.3:
            health = "unhealthy"
        
        return {
            "status": health,
            "total_tasks": total_tasks,
            "total_failed": total_failed,
            "failure_rate": failure_rate,
            "task_stats": self.get_statistics()
        }


# 启动监控器
if __name__ == "__main__":
    monitor = TaskMonitor('redis://localhost:6379/0')
    monitor.start_monitoring()

9. 前端集成

9.1 React组件 - 瓶颈检测面板

复制代码
// frontend/src/components/AIOptimizer/BottleneckDetectionPanel.tsx
import React, { useState, useEffect } from 'react';
import {
  Card,
  CardContent,
  CardHeader,
  Typography,
  Button,
  Table,
  TableBody,
  TableCell,
  TableHead,
  TableRow,
  Chip,
  CircularProgress,
  Alert,
  Dialog,
  DialogTitle,
  DialogContent,
  DialogActions
} from '@mui/material';
import { Warning, CheckCircle, Error as ErrorIcon } from '@mui/icons-material';
import axios from 'axios';

interface Bottleneck {
  id: number;
  workflow_id: number;
  task_id?: number;
  bottleneck_type: string;
  severity: string;
  confidence_score: number;
  description: string;
  detected_at: string;
  current_metrics: any;
  impact_analysis?: any;
}

interface BottleneckDetectionPanelProps {
  workflowId: number;
}

const BottleneckDetectionPanel: React.FC<BottleneckDetectionPanelProps> = ({ workflowId }) => {
  const [bottlenecks, setBottlenecks] = useState<Bottleneck[]>([]);
  const [loading, setLoading] = useState(false);
  const [error, setError] = useState<string | null>(null);
  const [selectedBottleneck, setSelectedBottleneck] = useState<Bottleneck | null>(null);
  const [detailsOpen, setDetailsOpen] = useState(false);

  const fetchBottlenecks = async () => {
    setLoading(true);
    setError(null);

    try {
      const response = await axios.get(
        `/api/v1/bottlenecks/workflow/${workflowId}`
      );
      setBottlenecks(response.data.bottlenecks);
    } catch (err: any) {
      setError(err.response?.data?.detail || 'Failed to fetch bottlenecks');
    } finally {
      setLoading(false);
    }
  };

  const detectBottlenecks = async () => {
    setLoading(true);
    setError(null);

    try {
      const response = await axios.post('/api/v1/bottlenecks/detect', {
        workflow_id: workflowId,
        lookback_days: 7,
        detector_type: 'statistical'
      });
      setBottlenecks(response.data.bottlenecks);
    } catch (err: any) {
      setError(err.response?.data?.detail || 'Failed to detect bottlenecks');
    } finally {
      setLoading(false);
    }
  };

  const resolveBottleneck = async (bottleneckId: number) => {
    try {
      await axios.post(`/api/v1/bottlenecks/${bottleneckId}/resolve`);
      fetchBottlenecks();
    } catch (err: any) {
      setError(err.response?.data?.detail || 'Failed to resolve bottleneck');
    }
  };

  useEffect(() => {
    fetchBottlenecks();
  }, [workflowId]);

  const getSeverityIcon = (severity: string) => {
    switch (severity) {
      case 'critical':
        return <ErrorIcon color="error" />;
      case 'high':
        return <Warning color="warning" />;
      case 'medium':
        return <Warning color="info" />;
      default:
        return <CheckCircle color="success" />;
    }
  };

  const getSeverityColor = (severity: string): "error" | "warning" | "info" | "success" => {
    switch (severity) {
      case 'critical':
        return 'error';
      case 'high':
        return 'warning';
      case 'medium':
        return 'info';
      default:
        return 'success';
    }
  };

  return (
    <Card>
      <CardHeader
        title="Performance Bottlenecks"
        action={
          <Button
            variant="contained"
            onClick={detectBottlenecks}
            disabled={loading}
          >
            {loading ? <CircularProgress size={24} /> : 'Detect Bottlenecks'}
          </Button>
        }
      />
      <CardContent>
        {error && (
          <Alert severity="error" sx={{ mb: 2 }}>
            {error}
          </Alert>
        )}

        {loading ? (
          <CircularProgress />
        ) : bottlenecks.length === 0 ? (
          <Typography>No bottlenecks detected</Typography>
        ) : (
          <Table>
            <TableHead>
              <TableRow>
                <TableCell>Severity</TableCell>
                <TableCell>Type</TableCell>
                <TableCell>Description</TableCell>
                <TableCell>Confidence</TableCell>
                <TableCell>Detected</TableCell>
                <TableCell>Actions</TableCell>
              </TableRow>
            </TableHead>
            <TableBody>
              {bottlenecks.map((bottleneck) => (
                <TableRow key={bottleneck.id}>
                  <TableCell>
                    {getSeverityIcon(bottleneck.severity)}
                    <Chip
                      label={bottleneck.severity}
                      color={getSeverityColor(bottleneck.severity)}
                      size="small"
                      sx={{ ml: 1 }}
                    />
                  </TableCell>
                  <TableCell>{bottleneck.bottleneck_type}</TableCell>
                  <TableCell>{bottleneck.description}</TableCell>
                  <TableCell>{(bottleneck.confidence_score * 100).toFixed(1)}%</TableCell>
                  <TableCell>
                    {new Date(bottleneck.detected_at).toLocaleString()}
                  </TableCell>
                  <TableCell>
                    <Button
                      size="small"
                      onClick={() => {
                        setSelectedBottleneck(bottleneck);
                        setDetailsOpen(true);
                      }}
                    >
                      Details
                    </Button>
                    <Button
                      size="small"
                      color="success"
                      onClick={() => resolveBottleneck(bottleneck.id)}
                    >
                      Resolve
                    </Button>
                  </TableCell>
                </TableRow>
              ))}
            </TableBody>
          </Table>
        )}

        {/* Bottleneck Details Dialog */}
        <Dialog
          open={detailsOpen}
          onClose={() => setDetailsOpen(false)}
          maxWidth="md"
          fullWidth
        >
          <DialogTitle>Bottleneck Details</DialogTitle>
          <DialogContent>
            {selectedBottleneck && (
              <>
                <Typography variant="h6">{selectedBottleneck.description}</Typography>
                <Typography variant="body2" color="textSecondary" sx={{ mt: 1 }}>
                  Type: {selectedBottleneck.bottleneck_type}
                </Typography>
                <Typography variant="body2" color="textSecondary">
                  Confidence: {(selectedBottleneck.confidence_score * 100).toFixed(1)}%
                </Typography>

                <Typography variant="subtitle1" sx={{ mt: 2 }}>
                  Current Metrics:
                </Typography>
                <pre>{JSON.stringify(selectedBottleneck.current_metrics, null, 2)}</pre>

                {selectedBottleneck.impact_analysis && (
                  <>
                    <Typography variant="subtitle1" sx={{ mt: 2 }}>
                      Impact Analysis:
                    </Typography>
                    <pre>{JSON.stringify(selectedBottleneck.impact_analysis, null, 2)}</pre>
                  </>
                )}
              </>
            )}
          </DialogContent>
          <DialogActions>
            <Button onClick={() => setDetailsOpen(false)}>Close</Button>
          </DialogActions>
        </Dialog>
      </CardContent>
    </Card>
  );
};

export default BottleneckDetectionPanel;

9.2 React组件 - 优化建议面板

复制代码
// frontend/src/components/AIOptimizer/RecommendationPanel.tsx
import React, { useState, useEffect } from 'react';
import {
  Card,
  CardContent,
  CardHeader,
  Typography,
  Button,
  List,
  ListItem,
  ListItemText,
  Chip,
  Accordion,
  AccordionSummary,
  AccordionDetails,
  CircularProgress,
  Alert,
  Dialog,
  DialogTitle,
  DialogContent,
  DialogActions,
  Stepper,
  Step,
  StepLabel
} from '@mui/material';
import {
  ExpandMore,
  ThumbUp,
  ThumbDown,
  Info
} from '@mui/icons-material';
import axios from 'axios';

interface Recommendation {
  id: number;
  title: string;
  description: string;
  priority: string;
  recommendation_type: string;
  expected_improvement: any;
  implementation_difficulty: string;
  estimated_effort_hours?: number;
  implementation_steps?: any[];
  risk_level: string;
  status: string;
}

interface RecommendationPanelProps {
  workflowId: number;
}

const RecommendationPanel: React.FC<RecommendationPanelProps> = ({ workflowId }) => {
  const [recommendations, setRecommendations] = useState<Recommendation[]>([]);
  const [loading, setLoading] = useState(false);
  const [error, setError] = useState<string | null>(null);
  const [selectedRec, setSelectedRec] = useState<Recommendation | null>(null);
  const [implementationOpen, setImplementationOpen] = useState(false);

  const fetchRecommendations = async () => {
    setLoading(true);
    setError(null);

    try {
      const response = await axios.get(
        `/api/v1/recommendations/workflow/${workflowId}`
      );
      setRecommendations(response.data.recommendations);
    } catch (err: any) {
      setError(err.response?.data?.detail || 'Failed to fetch recommendations');
    } finally {
      setLoading(false);
    }
  };

  const approveRecommendation = async (recId: number) => {
    try {
      await axios.post(`/api/v1/recommendations/${recId}/approve`);
      fetchRecommendations();
    } catch (err: any) {
      setError(err.response?.data?.detail || 'Failed to approve recommendation');
    }
  };

  const rejectRecommendation = async (recId: number, reason: string) => {
    try {
      await axios.post(`/api/v1/recommendations/${recId}/reject`, { reason });
      fetchRecommendations();
    } catch (err: any) {
      setError(err.response?.data?.detail || 'Failed to reject recommendation');
    }
  };

  useEffect(() => {
    fetchRecommendations();
  }, [workflowId]);

  const getPriorityColor = (priority: string): "error" | "warning" | "info" | "default" => {
    switch (priority) {
      case 'urgent':
        return 'error';
      case 'high':
        return 'warning';
      case 'medium':
        return 'info';
      default:
        return 'default';
    }
  };

  return (
    <Card>
      <CardHeader title="Optimization Recommendations" />
      <CardContent>
        {error && (
          <Alert severity="error" sx={{ mb: 2 }}>
            {error}
          </Alert>
        )}

        {loading ? (
          <CircularProgress />
        ) : recommendations.length === 0 ? (
          <Typography>No recommendations available</Typography>
        ) : (
          <List>
            {recommendations.map((rec) => (
              <Accordion key={rec.id}>
                <AccordionSummary expandIcon={<ExpandMore />}>
                  <div style={{ display: 'flex', alignItems: 'center', gap: '8px', width: '100%' }}>
                    <Chip
                      label={rec.priority}
                      color={getPriorityColor(rec.priority)}
                      size="small"
                    />
                    <Chip
                      label={rec.recommendation_type}
                      variant="outlined"
                      size="small"
                    />
                    <Typography sx={{ flexGrow: 1 }}>{rec.title}</Typography>
                    <Chip
                      label={rec.status}
                      size="small"
                      color={rec.status === 'approved' ? 'success' : 'default'}
                    />
                  </div>
                </AccordionSummary>
                <AccordionDetails>
                  <Typography variant="body2" paragraph>
                    {rec.description}
                  </Typography>

                  {rec.expected_improvement && (
                    <>
                      <Typography variant="subtitle2">Expected Improvement:</Typography>
                      <pre style={{ fontSize: '12px' }}>
                        {JSON.stringify(rec.expected_improvement, null, 2)}
                      </pre>
                    </>
                  )}

                  <div style={{ marginTop: '16px' }}>
                    <Chip label={`Difficulty: ${rec.implementation_difficulty}`} sx={{ mr: 1 }} />
                    <Chip label={`Risk: ${rec.risk_level}`} sx={{ mr: 1 }} />
                    {rec.estimated_effort_hours && (
                      <Chip label={`Effort: ${rec.estimated_effort_hours}h`} />
                    )}
                  </div>

                  <div style={{ marginTop: '16px', display: 'flex', gap: '8px' }}>
                    {rec.implementation_steps && (
                      <Button
                        startIcon={<Info />}
                        onClick={() => {
                          setSelectedRec(rec);
                          setImplementationOpen(true);
                        }}
                      >
                        Implementation Steps
                      </Button>
                    )}
                    {rec.status === 'pending' && (
                      <>
                        <Button
                          variant="contained"
                          color="success"
                          startIcon={<ThumbUp />}
                          onClick={() => approveRecommendation(rec.id)}
                        >
                          Approve & Test
                        </Button>
                        <Button
                          variant="outlined"
                          color="error"
                          startIcon={<ThumbDown />}
                          onClick={() => rejectRecommendation(rec.id, 'Not applicable')}
                        >
                          Reject
                        </Button>
                      </>
                    )}
                  </div>
                </AccordionDetails>
              </Accordion>
            ))}
          </List>
        )}

        {/* Implementation Steps Dialog */}
        <Dialog
          open={implementationOpen}
          onClose={() => setImplementationOpen(false)}
          maxWidth="md"
          fullWidth
        >
          <DialogTitle>Implementation Steps</DialogTitle>
          <DialogContent>
            {selectedRec?.implementation_steps && (
              <Stepper orientation="vertical">
                {selectedRec.implementation_steps.map((step: any, index: number) => (
                  <Step key={index} active>
                    <StepLabel>{step.title || `Step ${index + 1}`}</StepLabel>
                    <Typography variant="body2" sx={{ mt: 1, mb: 2 }}>
                      {step.description}
                    </Typography>
                    {step.code && (
                      <pre style={{ background: '#f5f5f5', padding: '8px', borderRadius: '4px' }}>
                        {step.code}
                      </pre>
                    )}
                  </Step>
                ))}
              </Stepper>
            )}
          </DialogContent>
          <DialogActions>
            <Button onClick={() => setImplementationOpen(false)}>Close</Button>
          </DialogActions>
        </Dialog>
      </CardContent>
    </Card>
  );
};

export default RecommendationPanel;

9.3 React组件 - A/B测试监控面板

复制代码
// frontend/src/components/AIOptimizer/ExperimentMonitorPanel.tsx
import React, { useState, useEffect } from 'react';
import {
  Card,
  CardContent,
  CardHeader,
  Typography,
  Button,
  LinearProgress,
  Chip,
  Grid,
  Alert
} from '@mui/material';
import { PlayArrow, Stop, CheckCircle } from '@mui/icons-material';
import axios from 'axios';

interface Experiment {
  id: number;
  name: string;
  status: string;
  hypothesis: string;
  primary_metric: string;
  control_group_size: number;
  treatment_group_size: number;
  started_at?: string;
  winner?: string;
}

interface ExperimentMonitorPanelProps {
  workflowId: number;
}

const ExperimentMonitorPanel: React.FC<ExperimentMonitorPanelProps> = ({ workflowId }) => {
  const [experiments, setExperiments] = useState<Experiment[]>([]);
  const [loading, setLoading] = useState(false);

  const fetchExperiments = async () => {
    setLoading(true);
    try {
      const response = await axios.get('/api/v1/experiments', {
        params: { workflow_id: workflowId }
      });
      setExperiments(response.data.experiments);
    } catch (err) {
      console.error('Failed to fetch experiments', err);
    } finally {
      setLoading(false);
    }
  };

  const startExperiment = async (expId: number) => {
    try {
      await axios.post(`/api/v1/experiments/${expId}/start`);
      fetchExperiments();
    } catch (err) {
      console.error('Failed to start experiment', err);
    }
  };

  const stopExperiment = async (expId: number) => {
    try {
      await axios.post(`/api/v1/experiments/${expId}/stop`);
      fetchExperiments();
    } catch (err) {
      console.error('Failed to stop experiment', err);
    }
  };

  const rolloutWinner = async (expId: number) => {
    try {
      await axios.post(`/api/v1/experiments/${expId}/rollout`);
      fetchExperiments();
    } catch (err) {
      console.error('Failed to rollout winner', err);
    }
  };

  useEffect(() => {
    fetchExperiments();
    const interval = setInterval(fetchExperiments, 30000); // Refresh every 30s
    return () => clearInterval(interval);
  }, [workflowId]);

  const getStatusColor = (status: string): "default" | "primary" | "success" | "error" => {
    switch (status) {
      case 'running':
        return 'primary';
      case 'completed':
        return 'success';
      case 'failed':
        return 'error';
      default:
        return 'default';
    }
  };

  return (
    <Card>
      <CardHeader title="A/B Test Experiments" />
      <CardContent>
        {loading && <LinearProgress />}

        <Grid container spacing={2}>
          {experiments.map((exp) => (
            <Grid item xs={12} md={6} key={exp.id}>
              <Card variant="outlined">
                <CardContent>
                  <div style={{ display: 'flex', justifyContent: 'space-between', alignItems: 'center' }}>
                    <Typography variant="h6">{exp.name}</Typography>
                    <Chip label={exp.status} color={getStatusColor(exp.status)} size="small" />
                  </div>

                  <Typography variant="body2" color="textSecondary" sx={{ mt: 1 }}>
                    {exp.hypothesis}
                  </Typography>

                  <Typography variant="caption" display="block" sx={{ mt: 1 }}>
                    Primary Metric: {exp.primary_metric}
                  </Typography>

                  <div style={{ marginTop: '16px' }}>
                    <Typography variant="caption">Sample Sizes:</Typography>
                    <div style={{ display: 'flex', gap: '8px', marginTop: '4px' }}>
                      <Chip label={`Control: ${exp.control_group_size}`} size="small" />
                      <Chip label={`Treatment: ${exp.treatment_group_size}`} size="small" />
                    </div>
                  </div>

                  {exp.winner && (
                    <Alert severity="success" sx={{ mt: 2 }}>
                      Winner: {exp.winner}
                    </Alert>
                  )}

                  <div style={{ marginTop: '16px', display: 'flex', gap: '8px' }}>
                    {exp.status === 'draft' && (
                      <Button
                        size="small"
                        variant="contained"
                        startIcon={<PlayArrow />}
                        onClick={() => startExperiment(exp.id)}
                      >
                        Start
                      </Button>
                    )}
                    {exp.status === 'running' && (
                      <Button
                        size="small"
                        variant="outlined"
                        color="error"
                        startIcon={<Stop />}
                        onClick={() => stopExperiment(exp.id)}
                      >
                        Stop
                      </Button>
                    )}
                    {exp.status === 'completed' && exp.winner && (
                      <Button
                        size="small"
                        variant="contained"
                        color="success"
                        startIcon={<CheckCircle />}
                        onClick={() => rolloutWinner(exp.id)}
                      >
                        Rollout Winner
                      </Button>
                    )}
                  </div>
                </CardContent>
              </Card>
            </Grid>
          ))}
        </Grid>

        {experiments.length === 0 && !loading && (
          <Typography color="textSecondary">No experiments running</Typography>
        )}
      </CardContent>
    </Card>
  );
};

export default ExperimentMonitorPanel;

10. 部署与配置

10.1 Docker配置

复制代码
# Dockerfile
FROM python:3.10-slim

WORKDIR /app

# 安装系统依赖
RUN apt-get update && apt-get install -y \
    gcc \
    postgresql-client \
    && rm -rf /var/lib/apt/lists/*

# 安装Python依赖
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# 复制应用代码
COPY . .

# 环境变量
ENV PYTHONUNBUFFERED=1
ENV ENVIRONMENT=production

# 暴露端口
EXPOSE 8000

# 启动命令
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

# docker-compose.yml
version: '3.8'

services:
  postgres:
    image: postgres:14
    environment:
      POSTGRES_DB: workflow_db
      POSTGRES_USER: workflow_user
      POSTGRES_PASSWORD: workflow_pass
    volumes:
      - postgres_data:/var/lib/postgresql/data
    ports:
      - "5432:5432"

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"

  api:
    build: .
    depends_on:
      - postgres
      - redis
    environment:
      DATABASE_URL: postgresql://workflow_user:workflow_pass@postgres:5432/workflow_db
      REDIS_URL: redis://redis:6379/0
    ports:
      - "8000:8000"
    volumes:
      - ./:/app

  celery_worker:
    build: .
    command: celery -A tasks.ai_optimizer_tasks worker --loglevel=info
    depends_on:
      - postgres
      - redis
    environment:
      DATABASE_URL: postgresql://workflow_user:workflow_pass@postgres:5432/workflow_db
      REDIS_URL: redis://redis:6379/0

  celery_beat:
    build: .
    command: celery -A tasks.ai_optimizer_tasks beat --loglevel=info
    depends_on:
      - postgres
      - redis
    environment:
      DATABASE_URL: postgresql://workflow_user:workflow_pass@postgres:5432/workflow_db
      REDIS_URL: redis://redis:6379/0

  flower:
    build: .
    command: celery -A tasks.ai_optimizer_tasks flower --port=5555
    depends_on:
      - redis
    ports:
      - "5555:5555"
    environment:
      REDIS_URL: redis://redis:6379/0

volumes:
  postgres_data:

10.2 配置文件

复制代码
# config.py
from pydantic import BaseSettings
from typing import Optional

class Settings(BaseSettings):
    # 数据库配置
    DATABASE_URL: str = "postgresql://localhost/workflow_db"
    
    # Redis配置
    REDIS_URL: str = "redis://localhost:6379/0"
    
    # AI优化器配置
    BOTTLENECK_DETECTION_THRESHOLD: float = 0.7
    ML_MODEL_UPDATE_INTERVAL_DAYS: int = 7
    MAX_CONCURRENT_EXPERIMENTS: int = 3
    
    # 性能阈值
    SLOW_TASK_THRESHOLD_SECONDS: float = 60.0
    HIGH_CPU_THRESHOLD_PERCENT: float = 80.0
    HIGH_MEMORY_THRESHOLD_MB: float = 1000.0
    ERROR_RATE_THRESHOLD: float = 0.05
    
    # A/B测试配置
    MIN_SAMPLE_SIZE: int = 30
    SIGNIFICANCE_LEVEL: float = 0.05
    POWER: float = 0.8
    
    # 日志配置
    LOG_LEVEL: str = "INFO"
    
    # 环境
    ENVIRONMENT: str = "development"
    
    class Config:
        env_file = ".env"

settings = Settings()

# .env
DATABASE_URL=postgresql://workflow_user:workflow_pass@localhost:5432/workflow_db
REDIS_URL=redis://localhost:6379/0

BOTTLENECK_DETECTION_THRESHOLD=0.7
ML_MODEL_UPDATE_INTERVAL_DAYS=7
MAX_CONCURRENT_EXPERIMENTS=3

SLOW_TASK_THRESHOLD_SECONDS=60.0
HIGH_CPU_THRESHOLD_PERCENT=80.0
HIGH_MEMORY_THRESHOLD_MB=1000.0
ERROR_RATE_THRESHOLD=0.05

MIN_SAMPLE_SIZE=30
SIGNIFICANCE_LEVEL=0.05
POWER=0.8

LOG_LEVEL=INFO
ENVIRONMENT=production

11. 测试

11.1 单元测试

复制代码
# tests/test_bottleneck_detector.py
import pytest
from sqlalchemy.orm import Session
from services.bottleneck_detectors.statistical_detector import StatisticalBottleneckDetector
from models.ai_optimizer_models import PerformanceBottleneck

@pytest.fixture
def detector(db_session: Session):
    return StatisticalBottleneckDetector(db_session)

def test_detect_cpu_bottleneck(detector, sample_workflow_data):
    """测试CPU瓶颈检测"""
    bottlenecks = detector.detect_workflow_bottlenecks(
        workflow_id=1,
        lookback_days=7
    )
    
    cpu_bottlenecks = [
        b for b in bottlenecks
        if b.bottleneck_type == 'cpu'
    ]
    
    assert len(cpu_bottlenecks) > 0
    assert cpu_bottlenecks[0].severity in ['low', 'medium', 'high', 'critical']

def test_confidence_score_range(detector, sample_workflow_data):
    """测试置信度分数范围"""
    bottlenecks = detector.detect_workflow_bottlenecks(
        workflow_id=1,
        lookback_days=7
    )
    
    for bottleneck in bottlenecks:
        assert 0.0 <= bottleneck.confidence_score <= 1.0

11.2 集成测试

复制代码
# tests/test_recommendation_flow.py
import pytest
from services.bottleneck_detectors.statistical_detector import StatisticalBottleneckDetector
from services.optimization.recommendation_engine import RecommendationEngine
from services.ab_testing.experiment_scheduler import ExperimentScheduler

def test_full_optimization_flow(db_session, sample_workflow):
    """测试完整的优化流程"""
    
    # 1. 检测瓶颈
    detector = StatisticalBottleneckDetector(db_session)
    bottlenecks = detector.detect_workflow_bottlenecks(
        workflow_id=sample_workflow.id,
        lookback_days=7
    )
    
    assert len(bottlenecks) > 0
    
    # 2. 生成建议
    engine = RecommendationEngine(db_session)
    recommendations = []
    
    for bottleneck in bottlenecks:
        recs = engine.generate_recommendations(bottleneck)
        recommendations.extend(recs)
    
    assert len(recommendations) > 0
    
    # 3. 调度实验
    recommendation_ids = [rec.id for rec in recommendations[:2]]
    
    scheduler = ExperimentScheduler(db_session)
    experiments = scheduler.schedule_experiments_for_recommendations(
        recommendation_ids
    )
    
    assert len(experiments) > 0
    assert all(exp.status == 'draft' for exp in experiments)

12. 监控与告警

12.1 Prometheus指标

复制代码
# monitoring/metrics.py
from prometheus_client import Counter, Histogram, Gauge

# 瓶颈检测指标
bottlenecks_detected = Counter(
    'bottlenecks_detected_total',
    'Total number of bottlenecks detected',
    ['workflow_id', 'severity', 'type']
)

bottleneck_detection_duration = Histogram(
    'bottleneck_detection_duration_seconds',
    'Time spent detecting bottlenecks',
    ['workflow_id']
)

# 优化建议指标
recommendations_generated = Counter(
    'recommendations_generated_total',
    'Total number of recommendations generated',
    ['workflow_id', 'type', 'priority']
)

# A/B测试指标
experiments_running = Gauge(
    'experiments_running',
    'Number of currently running experiments'
)

experiment_success_rate = Gauge(
    'experiment_success_rate',
    'Percentage of successful experiments',
    ['workflow_id']
)

# 使用示例
def record_bottleneck_detected(workflow_id: int, severity: str, bottle_type: str):
    bottlenecks_detected.labels(
        workflow_id=str(workflow_id),
        severity=severity,
        type=bottle_type
    ).inc()

12.2 告警规则

复制代码
# prometheus/alerts.yml
groups:
  - name: ai_optimizer_alerts
    rules:
      - alert: HighBottleneckDetectionRate
        expr: rate(bottlenecks_detected_total[5m]) > 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High bottleneck detection rate"
          description: "Detecting more than 10 bottlenecks per minute"

      - alert: ExperimentFailureRate
        expr: (1 - experiment_success_rate) > 0.3
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "High experiment failure rate"
          description: "More than 30% of experiments are failing"

      - alert: TooManyRunningExperiments
        expr: experiments_running > 5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Too many concurrent experiments"
          description: "More than 5 experiments running simultaneously"

AI优化器模块完整实现完成!

该模块提供了:

  1. 智能瓶颈检测 - 统计与ML双引擎
  2. 自动优化建议 - 多维度分析与推荐
  3. A/B测试框架 - 安全验证优化效果
  4. 自动化调度 - Celery后台任务
  5. 完整API - RESTful接口
  6. 前端集成 - React组件
  7. 监控告警 - Prometheus指标
  8. 部署配置 - Docker容器化

AI优化器模块总结

核心功能

这是一个自动化工作流性能优化系统,通过AI技术实现:

1. 智能瓶颈检测 🔍

  • 统计检测器:分析CPU、内存、执行时间、错误率等指标
  • ML检测器:使用孤立森林算法检测异常模式
  • 自动评估严重程度(低/中/高/危急)和置信度

2. 自动优化建议 💡

  • 针对检测到的瓶颈生成可执行的优化方案
  • 包括:并行化、资源调整、批处理、缓存、算法优化等
  • 提供实施步骤、预期收益、风险评估和工作量估算

3. A/B测试验证 🧪

  • 自动创建对照组/实验组
  • 统计显著性检验(t-test、卡方检验)
  • 智能流量分配和早停机制
  • 安全推广获胜配置

4. 自动化运维 ⚙️

  • Celery定时任务:每小时检测、每6小时生成建议
  • 实验自动调度和轮换
  • 数据清理和ML模型重训练

5. 完整技术栈 🛠️

  • 后端: FastAPI + SQLAlchemy + Celery
  • 前端: React + Material-UI
  • 数据库: PostgreSQL
  • 监控: Prometheus + Grafana
  • 部署: Docker Compose

工作流程

复制代码
瓶颈检测 → 生成建议 → 人工审批 → A/B测试 → 验证结果 → 自动推广

价值

  • 🚀 自动发现性能问题
  • 🎯 数据驱动的优化决策
  • ✅ 风险可控的渐进式改进
  • 📊 持续监控和迭代优化

这是一个生产级的智能运维系统,让工作流性能优化从人工经验驱动转向AI自动化驱动。

相关推荐
myzzb2 小时前
python调用ffmpeg.exe封装装饰类调用
python·学习·ffmpeg·开发
Niuguangshuo2 小时前
自编码器与变分自编码器:【1】自编码器 - 数据压缩的艺术
人工智能·深度学习
小鸡吃米…2 小时前
Python - 多重继承
开发语言·python
悟能不能悟2 小时前
java list怎么进行group
java·python·list
程序员在囧途2 小时前
Sora2 25 秒视频 API 国内直连!10 积分/次,稳定秒退任务,支持 avatar & Remix(附 PHP 接入教程)
后端·开源·php
Alex Gram2 小时前
SQL Server实时同步到MySQL:构建高效跨数据库数据流通方案
数据库·mysql·sqlserver
tap.AI2 小时前
RAG系列(四)高级 RAG 架构与复杂推理
人工智能·架构
mmq在路上2 小时前
Fast-livo2 gazebo仿真实践记录
人工智能·slam·xtdrone
在等星星呐2 小时前
人工智能从0基础到精通
前端·人工智能·python