【机器学习与智慧医疗】T2DM-EWS: 2型糖尿病早期预警系统(多参数集成分类模型)完整实现

T2DM-EWS: 2型糖尿病早期预警系统(多参数集成分类模型)

完整项目源码与架构设计文档


一、总体架构设计

1.1 系统层级图

复制代码
┌─────────────────────────────────────────────────────────────────┐
│                        应用层 (Application)                       │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────────┐  │
│  │ CLI 入口    │  │ REST API    │  │ 可视化 Dashboard        │  │
│  │ main.py     │  │ (预留)      │  │ matplotlib/seaborn      │  │
│  └──────┬──────┘  └──────┬──────┘  └────────────┬────────────┘  │
└─────────┼────────────────┼─────────────────────┼─────────────────┘
          │                │                     │
┌─────────┼────────────────┼─────────────────────┼─────────────────┐
│         ▼                ▼                     ▼                 │
│  ┌─────────────────────────────────────────────────────────────┐  │
│  │                    业务逻辑层 (Pipeline)                     │  │
│  │  T2DMPipeline.run_full_pipeline()                          │  │
│  │    Step 1: Data Generation                                 │  │
│  │    Step 2: Stratified Split                                 │  │
│  │    Step 3: Preprocessing (Impute → Clip → Scale)           │  │
│  │    Step 4: Feature Engineering (Prior + Poly + Select)    │  │
│  │    Step 5: Stacking Ensemble Training (5-Fold OOF)         │  │
│  │    Step 6: Evaluation (AUROC/AUPRC/Calibration)           │  │
│  │    Step 7: Explainability & Visualization                   │  │
│  └─────────────────────────────────────────────────────────────┘  │
└───────────────────────────────────────────────────────────────────┘
          │
┌─────────┼─────────────────────────────────────────────────────────┐
│         ▼                                                         │
│  ┌─────────────────────────────────────────────────────────────┐  │
│  │                    模型层 (Model Layer)                      │  │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────────┐ │  │
│  │  │ Base Learners│  │ Meta-Learner │  │ ClinicalExplainer│ │  │
│  │  │ (5 experts)  │──│ (LR fusion)  │──│ (SHAP-like)      │ │  │
│  │  └──────────────┘  └──────────────┘  └──────────────────┘ │  │
│  └─────────────────────────────────────────────────────────────┘  │
└───────────────────────────────────────────────────────────────────┘
          │
┌─────────┼─────────────────────────────────────────────────────────┐
│         ▼                                                         │
│  ┌─────────────────────────────────────────────────────────────┐  │
│  │                    数据层 (Data Layer)                       │  │
│  │  ┌─────────────┐  ┌─────────────┐  ┌────────────────────┐  │  │
│  │  │ DataLoader  │  │Preprocessor │  │ FeatureEngineer    │  │  │
│  │  │ (Synthetic) │  │ (Impute/    │  │ (Prior/Poly/       │  │  │
│  │  │             │  │  Clip/Scale)│  │  SelectKBest)      │  │  │
│  │  └─────────────┘  └─────────────┘  └────────────────────┘  │  │
│  └─────────────────────────────────────────────────────────────┘  │
└───────────────────────────────────────────────────────────────────┘

1.2 模块职责矩阵

模块 职责 输入 输出 依赖
config.py 全局超参数与常量 SystemConfig, DataConfig, ModelConfig, ThresholdConfig
data_loader.py 数据生成/加载 DataConfig DataFrame (26维特征+标签) sklearn.datasets
preprocessor.py 清洗与标准化 原始DataFrame 标准化矩阵 X_scaled sklearn.impute, sklearn.preprocessing
feature_engineer.py 特征构造与选择 X_scaled, y X_engineered (35维) sklearn.feature_selection
base_learners.py 基学习器集群 X_engineered, y 概率字典 {name: proba} sklearn.ensemble, sklearn.linear_model
meta_learner.py 元学习器融合 Meta-feature矩阵 Z 最终风险概率 sklearn.linear_model
trainer.py OOF训练与序列化 完整训练数据 TrainedEnsemble对象 sklearn.model_selection
evaluator.py 指标计算与绘图 y_true, y_proba 报告字典+图表 sklearn.metrics, matplotlib
explainer.py 全局/局部解释 训练好的模型+数据 重要性图/Waterfall图 sklearn.inspection
visualizer.py 系统级可视化 各类中间结果 Dashboard/分布图/热力图 matplotlib, seaborn
pipeline.py 端到端编排 配置对象 完整结果字典 上述全部
main.py CLI入口 命令行参数 终端输出/文件 pipeline.py

1.3 协同设计

数据流协同

  • DataLoaderPreprocessorFeatureEngineer 形成单向数据流,通过 numpy.ndarray 传递
  • FeatureEngineer 输出同时喂给 BaseLearnersMetaLearner(通过 Trainer 协调OOF生成)
  • Evaluator 接收所有基学习器与Stacking的预测结果,横向对比
  • Explainer 挂载在最终模型上,提供事后解释

控制流协同

  • Pipeline 作为中央控制器,按固定7步顺序调度各模块
  • Trainer 内部使用 StratifiedKFold 保证类别平衡,通过 clone() 防止交叉污染
  • BaseLearnersMetaLearner 解耦,支持独立替换基学习器算法

1.4 接口对接

内部接口

python 复制代码
# 数据流接口
X_raw: pd.DataFrame        → preprocessor.fit_transform() → X_scaled: np.ndarray
X_scaled, y                → feature_engineer.fit_transform() → X_fe: np.ndarray
X_fe, y                    → trainer.fit() → ensemble: TrainedEnsemble
ensemble, X_fe             → ensemble.predict_proba() → y_proba: np.ndarray

外部接口(预留)

python 复制代码
# RESTful API 伪代码(Flask/FastAPI 封装层)
@app.post("/predict")
def predict_endpoint(payload: ClinicalInput):
    x_dict = payload.dict()
    result = pipeline.predict_single(x_dict)
    return {"risk": result["t2dm_5year_risk"], "level": result["risk_level"]}

二、测试标准

测试项 方法 通过标准
语法检查 py_compile.compile() 全部 .py 文件无 SyntaxError
数据生成 检查 df.shape, df.isna().sum() 8000×27, NaN率≈2%
预处理 检查 X_scaled 统计量 均值≈0, 标准差≈1, 无NaN
特征工程 检查 X_fe.shape[1] ≤35 (SelectKBest约束)
训练收敛 检查各基学习器 predict_proba 输出 概率范围 [0,1]
OOF完整性 检查 Z 矩阵无NaN Z.shape == (n_train, n_base)
元学习器 检查权重和 sum(weights) ≈ 1.0
端到端推理 单样本预测耗时 < 200ms (CPU单核)
序列化 pickle.dump/load 往返 加载后预测结果一致
可视化 检查输出目录 8张PNG + 1份JSON报告

三、验收标准

  1. 功能验收python main.py --mode train --prefix demo 成功执行并生成 outputs/demo/figures/ 下全部图表
  2. 性能验收:测试集 AUROC ≥ 0.80, AUPRC ≥ 0.40, Brier Score ≤ 0.15
  3. 解释性验收 :全局重要性Top3特征包含 FPG, BMI, Age 或其交互项
  4. 鲁棒性验收:单样本预测接口对26维完整/部分缺失输入均返回有效JSON
  5. 部署验收 :模型包 ensemble_model.pkl 可在新环境中 pickle.load 并直接推理

四、源码实现

4.1 config.py

python 复制代码
"""
T2DM-EWS 全局配置模块
"""
import os
from dataclasses import dataclass, field
from typing import List

@dataclass
class SystemConfig:
    random_state: int = 42
    n_jobs: int = -1
    test_size: float = 0.2
    cv_folds: int = 5

    project_root: str = field(default_factory=lambda: os.path.dirname(os.path.abspath(__file__)))
    data_dir: str = field(init=False)
    model_dir: str = field(init=False)
    output_dir: str = field(init=False)
    figure_dir: str = field(init=False)

    def __post_init__(self):
        self.data_dir = os.path.join(self.project_root, "data")
        self.model_dir = os.path.join(self.project_root, "models")
        self.output_dir = os.path.join(self.project_root, "outputs")
        self.figure_dir = os.path.join(self.output_dir, "figures")
        for d in [self.data_dir, self.model_dir, self.output_dir, self.figure_dir]:
            os.makedirs(d, exist_ok=True)

@dataclass
class DataConfig:
    n_samples: int = 8000
    n_features: int = 26
    n_informative: int = 18
    n_redundant: int = 6
    n_classes: int = 2
    random_state: int = 42
    positive_ratio: float = 0.15
    flip_y: float = 0.03
    missing_rate: float = 0.02

    feature_names: List[str] = field(default_factory=lambda: [
        "Age", "FPG", "BMI", "HbA1c", "LDL_C", "HDL_C", "TG",
        "GGT", "ALT", "AST", "SBP", "DBP", "Waist",
        "Hip", "TC", "Cr", "UA", "HOMA_IR", "Fasting_Insulin",
        "CRP", "WBC", "RBC", "Hb", "Neutrophil", "Lymphocyte", "Platelet"
    ])
    target_name: str = "T2DM_5yr_Risk"

@dataclass
class ModelConfig:
    base_random_state: int = 42
    lr_c: float = 1.0
    lr_max_iter: int = 1000
    lr_class_weight: str = "balanced"
    rf_n_estimators: int = 300
    rf_max_depth: int = 12
    rf_min_samples_leaf: int = 5
    rf_class_weight: str = "balanced_subsample"
    gb_n_estimators: int = 200
    gb_max_depth: int = 5
    gb_learning_rate: float = 0.08
    gb_subsample: float = 0.8
    ada_n_estimators: int = 200
    ada_learning_rate: float = 0.1
    et_n_estimators: int = 300
    et_max_depth: int = 12
    et_min_samples_leaf: int = 5
    et_class_weight: str = "balanced"
    meta_c: float = 0.5
    meta_max_iter: int = 1000
    meta_solver: str = "lbfgs"
    select_k_best: int = 35

@dataclass
class ThresholdConfig:
    high_risk_threshold: float = 0.70
    moderate_risk_threshold: float = 0.40
    sensitivity_target: float = 0.85

SYSTEM_CONFIG = SystemConfig()
DATA_CONFIG = DataConfig()
MODEL_CONFIG = ModelConfig()
THRESHOLD_CONFIG = ThresholdConfig()

4.2 data_loader.py

python 复制代码
"""
数据加载与模拟生成模块
使用方式:
  from data_loader import ClinicalDataGenerator
  gen = ClinicalDataGenerator(DATA_CONFIG)
  df = gen.generate(save_path="data/raw_clinical.csv")
"""
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from typing import Optional
import os

class ClinicalDataGenerator:
    def __init__(self, config):
        self.cfg = config
        self.rng = np.random.RandomState(config.random_state)

    def generate(self, save_path: Optional[str] = None) -> pd.DataFrame:
        X, y = make_classification(
            n_samples=self.cfg.n_samples,
            n_features=self.cfg.n_features,
            n_informative=self.cfg.n_informative,
            n_redundant=self.cfg.n_redundant,
            n_classes=self.cfg.n_classes,
            weights=[1 - self.cfg.positive_ratio, self.cfg.positive_ratio],
            flip_y=self.cfg.flip_y,
            random_state=self.cfg.random_state,
            hypercube=False,
            shift=0.0,
            scale=1.0
        )
        X = self._apply_clinical_scaling(X)
        df = pd.DataFrame(X, columns=self.cfg.feature_names)
        df[self.cfg.target_name] = y
        df = self._inject_missing(df)
        df = self._inject_outliers(df)
        if save_path:
            os.makedirs(os.path.dirname(save_path), exist_ok=True)
            df.to_csv(save_path, index=False)
        return df

    def _apply_clinical_scaling(self, X: np.ndarray) -> np.ndarray:
        clinical_params = {
            "Age": (52.0, 12.0), "FPG": (5.6, 1.2), "BMI": (24.5, 3.8),
            "HbA1c": (5.7, 0.9), "LDL_C": (2.9, 0.8), "HDL_C": (1.3, 0.35),
            "TG": (1.6, 0.9), "GGT": (35.0, 20.0), "ALT": (28.0, 18.0),
            "AST": (26.0, 12.0), "SBP": (128.0, 16.0), "DBP": (80.0, 10.0),
            "Waist": (85.0, 10.0), "Hip": (95.0, 8.0), "TC": (4.9, 0.9),
            "Cr": (75.0, 15.0), "UA": (320.0, 80.0), "HOMA_IR": (2.8, 1.8),
            "Fasting_Insulin": (12.0, 6.0), "CRP": (2.5, 3.0),
            "WBC": (6.2, 1.5), "RBC": (4.5, 0.5), "Hb": (140.0, 15.0),
            "Neutrophil": (0.58, 0.08), "Lymphocyte": (0.30, 0.07),
            "Platelet": (220.0, 50.0)
        }
        X_scaled = np.zeros_like(X)
        for i, name in enumerate(self.cfg.feature_names):
            mu, sigma = clinical_params.get(name, (0.0, 1.0))
            X_scaled[:, i] = mu + sigma * X[:, i]
        return X_scaled

    def _inject_missing(self, df: pd.DataFrame) -> pd.DataFrame:
        df_out = df.copy()
        for col in self.cfg.feature_names:
            mask = self.rng.rand(len(df_out)) < self.cfg.missing_rate
            df_out.loc[mask, col] = np.nan
        return df_out

    def _inject_outliers(self, df: pd.DataFrame, n_outliers: int = 50) -> pd.DataFrame:
        df_out = df.copy()
        idx = self.rng.choice(df_out.index, size=n_outliers, replace=False)
        cols = self.rng.choice(self.cfg.feature_names, size=n_outliers)
        for i, col in zip(idx, cols):
            if self.rng.rand() < 0.5:
                df_out.loc[i, col] = df_out[col].mean() + 4.5 * df_out[col].std()
            else:
                df_out.loc[i, col] = df_out[col].mean() - 4.5 * df_out[col].std()
        return df_out

4.3 preprocessor.py

python 复制代码
"""
数据预处理模块
使用方式:
  from preprocessor import ClinicalPreprocessor
  prep = ClinicalPreprocessor()
  X_train_clean = prep.fit_transform(X_train)
  X_test_clean = prep.transform(X_test)
"""
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, RobustScaler
from typing import Union, List

class ClinicalPreprocessor:
    def __init__(self, impute_strategy="median", outlier_lower=0.01,
                 outlier_upper=0.99, scaler_type="standard"):
        self.impute_strategy = impute_strategy
        self.outlier_lower = outlier_lower
        self.outlier_upper = outlier_upper
        self.scaler_type = scaler_type
        self.imputer = SimpleImputer(strategy=impute_strategy)
        self.scaler = StandardScaler() if scaler_type == "standard" else RobustScaler()
        self.feature_names = []
        self.clip_bounds_ = {}
        self.is_fitted = False

    def fit(self, X: Union[pd.DataFrame, np.ndarray], y=None):
        if isinstance(X, pd.DataFrame):
            self.feature_names = list(X.columns)
            X_arr = X.values
        else:
            X_arr = X
        self.imputer.fit(X_arr)
        self.clip_bounds_ = {
            i: (np.percentile(X_arr[:, i], self.outlier_lower * 100),
                np.percentile(X_arr[:, i], self.outlier_upper * 100))
            for i in range(X_arr.shape[1])
        }
        X_clipped = self._clip_array(X_arr)
        X_imputed = self.imputer.transform(X_clipped)
        self.scaler.fit(X_imputed)
        self.is_fitted = True
        return self

    def transform(self, X: Union[pd.DataFrame, np.ndarray]) -> np.ndarray:
        if not self.is_fitted:
            raise RuntimeError("Preprocessor must be fitted before transform.")
        if isinstance(X, pd.DataFrame):
            X_arr = X.values
        else:
            X_arr = X.copy()
        X_imp = self.imputer.transform(X_arr)
        X_clip = self._clip_array(X_imp)
        X_scaled = self.scaler.transform(X_clip)
        return X_scaled

    def fit_transform(self, X, y=None):
        self.fit(X, y)
        return self.transform(X)

    def _clip_array(self, X: np.ndarray) -> np.ndarray:
        X_out = X.copy()
        for i, (low, high) in self.clip_bounds_.items():
            X_out[:, i] = np.clip(X_out[:, i], low, high)
        return X_out

    def get_feature_names(self):
        return self.feature_names

4.4 feature_engineer.py

python 复制代码
"""
特征工程模块
使用方式:
  from feature_engineer import ClinicalFeatureEngineer
  fe = ClinicalFeatureEngineer(select_k=35)
  X_new = fe.fit_transform(X_train, y_train)
"""
import numpy as np
import pandas as pd
from sklearn.preprocessing import PolynomialFeatures
from sklearn.feature_selection import SelectKBest, f_classif
from typing import List, Optional

class ClinicalFeatureEngineer:
    def __init__(self, poly_degree=2, poly_interaction_only=True,
                 select_k=35, include_bias=False):
        self.poly_degree = poly_degree
        self.poly_interaction_only = poly_interaction_only
        self.select_k = select_k
        self.include_bias = include_bias
        self.poly = PolynomialFeatures(
            degree=poly_degree,
            interaction_only=poly_interaction_only,
            include_bias=include_bias
        )
        self.selector = SelectKBest(score_func=f_classif, k=select_k)
        self.raw_feature_names = []
        self.poly_feature_names = []
        self.selected_feature_names = []
        self.is_fitted = False

    def fit(self, X, y, raw_names=None):
        self.raw_feature_names = raw_names or ["f"+str(i) for i in range(X.shape[1])]
        X_prior = self._build_prior_features(X)
        X_poly = self.poly.fit_transform(X)
        self.poly_feature_names = self.poly.get_feature_names_out(self.raw_feature_names).tolist()
        X_combined = np.hstack([X_prior, X_poly])
        combined_names = self._get_prior_names() + self.poly_feature_names
        self.selector.fit(X_combined, y)
        mask = self.selector.get_support()
        self.selected_feature_names = [name for name, m in zip(combined_names, mask) if m]
        self.is_fitted = True
        return self

    def transform(self, X):
        if not self.is_fitted:
            raise RuntimeError("FeatureEngineer must be fitted before transform.")
        X_prior = self._build_prior_features(X)
        X_poly = self.poly.transform(X)
        X_combined = np.hstack([X_prior, X_poly])
        X_selected = self.selector.transform(X_combined)
        return X_selected

    def fit_transform(self, X, y, raw_names=None):
        self.fit(X, y, raw_names)
        return self.transform(X)

    def _build_prior_features(self, X):
        n = X.shape[0]
        features = []
        features.append((X[:, 1] * X[:, 2]).reshape(-1, 1))  # FPG x BMI
        features.append((X[:, 0] * X[:, 1]).reshape(-1, 1))  # Age x FPG
        hdl_safe = np.where(X[:, 5] < 0.5, 0.5, X[:, 5])
        features.append((X[:, 4] / hdl_safe).reshape(-1, 1))  # LDL/HDL
        features.append((X[:, 6] / hdl_safe).reshape(-1, 1))  # TG/HDL
        features.append((X[:, 17] * X[:, 2]).reshape(-1, 1))  # HOMA_IR x BMI
        features.append((X[:, 12] / (X[:, 0] + 1.0)).reshape(-1, 1))  # Waist/Age
        features.append(((X[:, 10] + 2 * X[:, 11]) / 3.0).reshape(-1, 1))  # MAP
        features.append((X[:, 10] - X[:, 11]).reshape(-1, 1))  # Pulse Pressure
        features.append((X[:, 19] * X[:, 2]).reshape(-1, 1))  # CRP x BMI
        lymph_safe = np.where(X[:, 24] < 0.01, 0.01, X[:, 24])
        features.append((X[:, 23] / lymph_safe).reshape(-1, 1))  # NLR
        return np.hstack(features)

    def _get_prior_names(self):
        return [
            "PRIOR_FPG_x_BMI", "PRIOR_Age_x_FPG", "PRIOR_LDL_div_HDL",
            "PRIOR_TG_div_HDL", "PRIOR_HOMAIR_x_BMI", "PRIOR_Waist_div_Age",
            "PRIOR_MAP", "PRIOR_PulsePressure", "PRIOR_CRP_x_BMI", "PRIOR_NLR"
        ]

    def get_selected_names(self):
        return self.selected_feature_names

4.5 base_learners.py

python 复制代码
"""
基学习器集群模块
使用方式:
  from base_learners import BaseLearnerCluster
  cluster = BaseLearnerCluster(MODEL_CONFIG)
  cluster.fit(X_train, y_train)
  probs = cluster.predict_proba_base(X_test)
"""
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier, ExtraTreesClassifier
from typing import Dict, List

class BaseLearnerCluster:
    def __init__(self, config):
        self.cfg = config
        self.learners = {}
        self._build_learners()

    def _build_learners(self):
        cfg = self.cfg
        self.learners["LogisticRegression"] = LogisticRegression(
            C=cfg.lr_c, max_iter=cfg.lr_max_iter, class_weight=cfg.lr_class_weight,
            random_state=cfg.base_random_state, n_jobs=-1, solver="lbfgs"
        )
        self.learners["RandomForest"] = RandomForestClassifier(
            n_estimators=cfg.rf_n_estimators, max_depth=cfg.rf_max_depth,
            min_samples_leaf=cfg.rf_min_samples_leaf, class_weight=cfg.rf_class_weight,
            random_state=cfg.base_random_state, n_jobs=-1
        )
        self.learners["GradientBoosting"] = GradientBoostingClassifier(
            n_estimators=cfg.gb_n_estimators, max_depth=cfg.gb_max_depth,
            learning_rate=cfg.gb_learning_rate, subsample=cfg.gb_subsample,
            random_state=cfg.base_random_state
        )
        self.learners["AdaBoost"] = AdaBoostClassifier(
            n_estimators=cfg.ada_n_estimators, learning_rate=cfg.ada_learning_rate,
            random_state=cfg.base_random_state
        )
        self.learners["ExtraTrees"] = ExtraTreesClassifier(
            n_estimators=cfg.et_n_estimators, max_depth=cfg.et_max_depth,
            min_samples_leaf=cfg.et_min_samples_leaf, class_weight=cfg.et_class_weight,
            random_state=cfg.base_random_state, n_jobs=-1
        )

    def fit(self, X, y):
        for name, model in self.learners.items():
            model.fit(X, y)
        return self

    def predict_proba_base(self, X):
        probs = {}
        for name, model in self.learners.items():
            proba = model.predict_proba(X)[:, 1]
            probs[name] = proba
        return probs

    def predict_base(self, X):
        preds = {}
        for name, model in self.learners.items():
            preds[name] = model.predict(X)
        return preds

    def get_oob_importances(self):
        importances = {}
        for name, model in self.learners.items():
            if hasattr(model, "feature_importances_"):
                importances[name] = model.feature_importances_
        return importances

    def get_learner(self, name):
        return self.learners.get(name)

    def names(self):
        return list(self.learners.keys())

4.6 meta_learner.py

python 复制代码
"""
元学习器(Meta-Learner)模块
使用方式:
  from meta_learner import StackingMetaLearner
  meta = StackingMetaLearner(MODEL_CONFIG)
  meta.fit(Z_train, y_train)
  y_pred = meta.predict_proba(Z_test)
"""
import numpy as np
from sklearn.linear_model import LogisticRegression
from typing import Dict, List

class StackingMetaLearner:
    def __init__(self, config):
        self.cfg = config
        self.model = LogisticRegression(
            C=config.meta_c, max_iter=config.meta_max_iter,
            solver=config.meta_solver, class_weight="balanced",
            random_state=getattr(config, "base_random_state", 42)
        )
        self.is_fitted = False
        self.meta_feature_names = []

    def fit(self, Z, y, feature_names=None):
        self.model.fit(Z, y)
        self.meta_feature_names = feature_names or ["base_"+str(i) for i in range(Z.shape[1])]
        self.is_fitted = True
        return self

    def predict_proba(self, Z):
        if not self.is_fitted:
            raise RuntimeError("MetaLearner must be fitted first.")
        return self.model.predict_proba(Z)[:, 1]

    def predict(self, Z):
        return self.model.predict(Z)

    def get_meta_weights(self):
        if not self.is_fitted:
            return {}
        coef = self.model.coef_[0]
        exp_coef = np.exp(coef - np.max(coef))
        weights = exp_coef / np.sum(exp_coef)
        return {name: float(w) for name, w in zip(self.meta_feature_names, weights)}

    def get_intercept(self):
        return float(self.model.intercept_[0])

4.7 trainer.py

python 复制代码
"""
训练控制器模块
使用方式:
  from trainer import StackingTrainer
  trainer = StackingTrainer(MODEL_CONFIG, base_cluster, meta_learner)
  ensemble = trainer.fit(X_train, y_train)
  y_pred = ensemble.predict_proba(X_test)
"""
import numpy as np
import pickle
import os
from sklearn.model_selection import StratifiedKFold
from sklearn.base import clone
from typing import Dict

class TrainedEnsemble:
    def __init__(self, base_learners, meta_learner, base_names):
        self.base_learners = base_learners
        self.meta_learner = meta_learner
        self.base_names = base_names

    def predict_proba(self, X):
        Z = np.zeros((X.shape[0], len(self.base_names)))
        for i, name in enumerate(self.base_names):
            Z[:, i] = self.base_learners[name].predict_proba(X)[:, 1]
        return self.meta_learner.predict_proba(Z)

    def predict(self, X):
        proba = self.predict_proba(X)
        return (proba >= 0.5).astype(int)

    def get_base_probas(self, X):
        return {name: self.base_learners[name].predict_proba(X)[:, 1]
                for name in self.base_names}

class StackingTrainer:
    def __init__(self, config, base_cluster, meta_learner, n_folds=5):
        self.cfg = config
        self.base_cluster = base_cluster
        self.meta_learner = meta_learner
        self.n_folds = n_folds
        self.cv = StratifiedKFold(n_splits=n_folds, shuffle=True,
                                  random_state=getattr(config, "base_random_state", 42))

    def fit(self, X, y):
        n_samples = X.shape[0]
        n_base = len(self.base_cluster.names())
        Z = np.zeros((n_samples, n_base))

        for fold_idx, (train_idx, val_idx) in enumerate(self.cv.split(X, y)):
            X_train_fold, X_val_fold = X[train_idx], X[val_idx]
            y_train_fold = y[train_idx]

            for j, name in enumerate(self.base_cluster.names()):
                model = self.base_cluster.get_learner(name)
                model_clone = clone(model)
                model_clone.fit(X_train_fold, y_train_fold)
                Z[val_idx, j] = model_clone.predict_proba(X_val_fold)[:, 1]

        self.meta_learner.fit(Z, y, feature_names=self.base_cluster.names())
        self.base_cluster.fit(X, y)

        ensemble = TrainedEnsemble(
            base_learners={name: self.base_cluster.get_learner(name)
                          for name in self.base_cluster.names()},
            meta_learner=self.meta_learner,
            base_names=self.base_cluster.names()
        )
        return ensemble

    def save_ensemble(self, ensemble, path):
        os.makedirs(os.path.dirname(path), exist_ok=True)
        with open(path, "wb") as f:
            pickle.dump(ensemble, f)

    def load_ensemble(self, path):
        with open(path, "rb") as f:
            return pickle.load(f)

4.8 evaluator.py

python 复制代码
"""
模型评估模块
使用方式:
  from evaluator import ModelEvaluator
  ev = ModelEvaluator(THRESHOLD_CONFIG)
  report = ev.evaluate(y_true, y_pred_proba, model_name="Stacking")
"""
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import (roc_auc_score, average_precision_score, f1_score,
                             cohen_kappa_score, accuracy_score, confusion_matrix,
                             roc_curve, precision_recall_curve, brier_score_loss)
from typing import Dict, Optional

class ModelEvaluator:
    def __init__(self, threshold_config):
        self.thresh_cfg = threshold_config

    def evaluate(self, y_true, y_proba, model_name="Model"):
        y_pred_default = (y_proba >= 0.5).astype(int)
        thresholds = np.arange(0.05, 1.0, 0.01)
        f1s = [f1_score(y_true, (y_proba >= t).astype(int)) for t in thresholds]
        best_thresh = thresholds[np.argmax(f1s)]

        fpr, tpr, thresh_roc = roc_curve(y_true, y_proba)
        specificity = 1 - fpr
        idx_sp90 = np.where(specificity >= 0.90)[0]
        sens_at_sp90 = tpr[idx_sp90[-1]] if len(idx_sp90) > 0 else 0.0

        report = {
            "Model": model_name,
            "AUROC": round(roc_auc_score(y_true, y_proba), 4),
            "AUPRC": round(average_precision_score(y_true, y_proba), 4),
            "Accuracy_0.5": round(accuracy_score(y_true, y_pred_default), 4),
            "F1_0.5": round(f1_score(y_true, y_pred_default), 4),
            "F1_Optimal": round(np.max(f1s), 4),
            "Optimal_Threshold": round(best_thresh, 3),
            "Cohen_Kappa": round(cohen_kappa_score(y_true, y_pred_default), 4),
            "Sensitivity_at_90_Specificity": round(sens_at_sp90, 4),
            "Brier_Score": round(brier_score_loss(y_true, y_proba), 4),
            "High_Risk_Ratio": round(np.mean(y_proba >= self.thresh_cfg.high_risk_threshold), 4),
            "Moderate_Risk_Ratio": round(np.mean(y_proba >= self.thresh_cfg.moderate_risk_threshold), 4)
        }
        return report

    def plot_roc_comparison(self, y_true, probas_dict, save_path=None):
        plt.figure(figsize=(8, 7))
        colors = plt.cm.tab10(np.linspace(0, 1, len(probas_dict)))
        for (name, proba), color in zip(probas_dict.items(), colors):
            fpr, tpr, _ = roc_curve(y_true, proba)
            auc = roc_auc_score(y_true, proba)
            plt.plot(fpr, tpr, color=color, lw=2, label=f"{name} (AUC = {auc:.3f})")
        plt.plot([0, 1], [0, 1], "k--", lw=1, alpha=0.5)
        plt.xlim([0.0, 1.0]); plt.ylim([0.0, 1.05])
        plt.xlabel("False Positive Rate (1 - Specificity)", fontsize=12)
        plt.ylabel("True Positive Rate (Sensitivity)", fontsize=12)
        plt.title("ROC Curve Comparison: Base Learners vs Stacking Ensemble", fontsize=14)
        plt.legend(loc="lower right", fontsize=10)
        plt.grid(alpha=0.3)
        plt.tight_layout()
        if save_path:
            plt.savefig(save_path, dpi=300, bbox_inches="tight")
        plt.show()

    def plot_pr_comparison(self, y_true, probas_dict, save_path=None):
        plt.figure(figsize=(8, 7))
        colors = plt.cm.tab10(np.linspace(0, 1, len(probas_dict)))
        for (name, proba), color in zip(probas_dict.items(), colors):
            precision, recall, _ = precision_recall_curve(y_true, proba)
            auprc = average_precision_score(y_true, proba)
            plt.plot(recall, precision, color=color, lw=2, label=f"{name} (AUPRC = {auprc:.3f})")
        baseline = np.mean(y_true)
        plt.axhline(baseline, color="gray", linestyle="--", alpha=0.7, label=f"Baseline (Prevalence = {baseline:.3f})")
        plt.xlabel("Recall (Sensitivity)", fontsize=12)
        plt.ylabel("Precision (PPV)", fontsize=12)
        plt.title("Precision-Recall Curve Comparison", fontsize=14)
        plt.legend(loc="lower left", fontsize=10)
        plt.grid(alpha=0.3)
        plt.tight_layout()
        if save_path:
            plt.savefig(save_path, dpi=300, bbox_inches="tight")
        plt.show()

    def plot_confusion_matrix(self, y_true, y_proba, threshold=0.5, save_path=None):
        y_pred = (y_proba >= threshold).astype(int)
        cm = confusion_matrix(y_true, y_pred, normalize="true")
        plt.figure(figsize=(6, 5))
        sns.heatmap(cm, annot=True, fmt=".2f", cmap="Blues",
                    xticklabels=["Low Risk", "High Risk"],
                    yticklabels=["Low Risk", "High Risk"],
                    cbar_kws={"label": "Proportion"})
        plt.xlabel("Predicted Label", fontsize=12)
        plt.ylabel("True Label", fontsize=12)
        plt.title(f"Normalized Confusion Matrix (threshold={threshold})", fontsize=13)
        plt.tight_layout()
        if save_path:
            plt.savefig(save_path, dpi=300, bbox_inches="tight")
        plt.show()

    def plot_calibration(self, y_true, probas_dict, n_bins=10, save_path=None):
        plt.figure(figsize=(8, 6))
        for name, proba in probas_dict.items():
            bin_boundaries = np.linspace(0, 1, n_bins + 1)
            bin_lowers = bin_boundaries[:-1]
            bin_uppers = bin_boundaries[1:]
            bin_centers = (bin_lowers + bin_uppers) / 2
            bin_accuracies = np.zeros(n_bins)
            for i in range(n_bins):
                in_bin = (proba > bin_lowers[i]) & (proba <= bin_uppers[i])
                prop_in_bin = np.mean(in_bin)
                if prop_in_bin > 0:
                    bin_accuracies[i] = np.mean(y_true[in_bin])
                else:
                    bin_accuracies[i] = 0.0
            plt.plot(bin_centers, bin_accuracies, "o-", label=name, markersize=6)
        plt.plot([0, 1], [0, 1], "k--", label="Perfectly calibrated")
        plt.xlim([0.0, 1.0]); plt.ylim([0.0, 1.0])
        plt.xlabel("Mean Predicted Probability", fontsize=12)
        plt.ylabel("Fraction of Positives", fontsize=12)
        plt.title("Calibration Plot (Reliability Diagram)", fontsize=14)
        plt.legend(fontsize=10)
        plt.grid(alpha=0.3)
        plt.tight_layout()
        if save_path:
            plt.savefig(save_path, dpi=300, bbox_inches="tight")
        plt.show()

    def plot_risk_distribution(self, y_true, y_proba, save_path=None):
        plt.figure(figsize=(8, 5))
        plt.hist(y_proba[y_true == 0], bins=40, alpha=0.6, label="True Negative", color="steelblue", edgecolor="white")
        plt.hist(y_proba[y_true == 1], bins=40, alpha=0.6, label="True Positive", color="crimson", edgecolor="white")
        plt.axvline(self.thresh_cfg.moderate_risk_threshold, color="orange", linestyle="--", label=f"Moderate Risk ({self.thresh_cfg.moderate_risk_threshold})")
        plt.axvline(self.thresh_cfg.high_risk_threshold, color="red", linestyle="--", label=f"High Risk ({self.thresh_cfg.high_risk_threshold})")
        plt.xlabel("Predicted T2DM Risk Probability", fontsize=12)
        plt.ylabel("Count", fontsize=12)
        plt.title("Risk Score Distribution by True Outcome", fontsize=14)
        plt.legend(fontsize=10)
        plt.grid(alpha=0.3, axis="y")
        plt.tight_layout()
        if save_path:
            plt.savefig(save_path, dpi=300, bbox_inches="tight")
        plt.show()

4.9 explainer.py

python 复制代码
"""
模型解释模块(SHAP-like 简化实现)
使用方式:
  from explainer import ClinicalExplainer
  explainer = ClinicalExplainer(ensemble, feature_names, preprocessor)
  explainer.global_summary(X_test, y_test, save_path="figs/importance.png")
"""
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.inspection import permutation_importance
from typing import List, Dict, Optional

class ClinicalExplainer:
    def __init__(self, ensemble, raw_feature_names, preprocessor=None, feature_engineer=None):
        self.ensemble = ensemble
        self.raw_names = raw_feature_names
        self.preprocessor = preprocessor
        self.feature_engineer = feature_engineer
        self.base_importances = {}
        self.meta_weights = {}

    def fit_global(self, X, y):
        base_probas = self.ensemble.get_base_probas(X)
        meta_weights = self.ensemble.meta_learner.get_meta_weights()
        weighted_imp = np.zeros(len(self.raw_names))
        total_weight = 0.0

        for name, model in self.ensemble.base_learners.items():
            r = permutation_importance(model, X, y, n_repeats=10,
                                       random_state=42, scoring="roc_auc", n_jobs=-1)
            imp = r.importances_mean
            weight = meta_weights.get(name, 1.0 / len(meta_weights))
            weighted_imp += weight * imp
            total_weight += weight
            self.base_importances[name] = imp

        self.global_importance = weighted_imp / (total_weight + 1e-9)
        self.meta_weights = meta_weights
        return self

    def plot_global_summary(self, save_path=None):
        if not hasattr(self, "global_importance"):
            raise RuntimeError("Must call fit_global before plotting.")
        imp_df = pd.DataFrame({
            "Feature": self.raw_names,
            "Importance": self.global_importance
        }).sort_values("Importance", ascending=True)
        color_map = self._feature_domain_colors()
        colors = [color_map.get(f, "gray") for f in imp_df["Feature"]]
        plt.figure(figsize=(8, 10))
        plt.barh(imp_df["Feature"], imp_df["Importance"], color=colors, edgecolor="white")
        plt.xlabel("Weighted Permutation Importance", fontsize=12)
        plt.title("Global Feature Importance (Meta-Weighted)", fontsize=14)
        plt.grid(alpha=0.3, axis="x")
        plt.tight_layout()
        if save_path:
            plt.savefig(save_path, dpi=300, bbox_inches="tight")
        plt.show()

    def local_waterfall(self, x, save_path=None):
        base_probas = self.ensemble.get_base_probas(x.reshape(1, -1))
        Z = np.array([[base_probas[name][0] for name in self.ensemble.base_names]])
        final_proba = self.ensemble.meta_learner.predict_proba(Z)[0]
        z_scores = x
        shap_like = self.global_importance * z_scores
        base_value = 0.35
        order = np.argsort(np.abs(shap_like))[::-1][:12]
        top_features = [self.raw_names[i] for i in order]
        top_shaps = shap_like[order]
        cumulative = [base_value]
        for val in top_shaps:
            cumulative.append(cumulative[-1] + val)
        cumulative = np.array(cumulative)
        fig, ax = plt.subplots(figsize=(10, 7))
        for i in range(len(top_shaps)):
            val = top_shaps[i]
            color = "#d62728" if val > 0 else "#1f77b4"
            ax.barh(i, val, left=cumulative[i], color=color, edgecolor="white", height=0.6)
            ax.text(cumulative[i] + val/2, i, f"{val:+.3f}",
                    ha="center", va="center", color="white", fontsize=9, weight="bold")
        ax.set_yticks(range(len(top_shaps)))
        ax.set_yticklabels(top_features, fontsize=11)
        ax.invert_yaxis()
        ax.axvline(base_value, color="black", linestyle="--", alpha=0.5)
        ax.set_xlabel("Contribution to Risk Probability", fontsize=12)
        ax.set_title(f"Local Explanation (Final Risk = {final_proba:.3f})", fontsize=14)
        ax.grid(alpha=0.3, axis="x")
        plt.tight_layout()
        if save_path:
            plt.savefig(save_path, dpi=300, bbox_inches="tight")
        plt.show()

    def _feature_domain_colors(self):
        return {
            "Age": "#8c564b", "FPG": "#e377c2", "BMI": "#e377c2", "HbA1c": "#e377c2",
            "LDL_C": "#ff7f0e", "HDL_C": "#ff7f0e", "TG": "#ff7f0e", "TC": "#ff7f0e",
            "GGT": "#2ca02c", "ALT": "#2ca02c", "AST": "#2ca02c",
            "SBP": "#d62728", "DBP": "#d62728", "Waist": "#9467bd", "Hip": "#9467bd",
            "Cr": "#7f7f7f", "UA": "#7f7f7f",
            "HOMA_IR": "#bcbd22", "Fasting_Insulin": "#bcbd22",
            "CRP": "#17becf", "WBC": "#17becf", "RBC": "#17becf", "Hb": "#17becf",
            "Neutrophil": "#17becf", "Lymphocyte": "#17becf", "Platelet": "#17becf"
        }

    def get_meta_weight_table(self):
        return pd.DataFrame({
            "Base_Learner": list(self.meta_weights.keys()),
            "Meta_Weight": list(self.meta_weights.values())
        }).sort_values("Meta_Weight", ascending=False)

4.10 visualizer.py

python 复制代码
"""
系统级可视化模块
使用方式:
  from visualizer import SystemVisualizer
  viz = SystemVisualizer()
  viz.plot_feature_distributions(df, target_col="T2DM_5yr_Risk", save_dir="figs/")
"""
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix, roc_curve, auc
from typing import List, Optional, Dict

class SystemVisualizer:
    def __init__(self, style="seaborn-v0_8-whitegrid"):
        try:
            plt.style.use(style)
        except:
            plt.style.use("seaborn-whitegrid")
        self.colors = ["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd"]

    def plot_feature_distributions(self, df, target_col, features=None, n_cols=4, save_path=None):
        feats = features or df.columns.drop(target_col).tolist()[:12]
        n_rows = int(np.ceil(len(feats) / n_cols))
        fig, axes = plt.subplots(n_rows, n_cols, figsize=(n_cols*4, n_rows*3))
        axes = axes.flatten()
        df_neg = df[df[target_col] == 0]
        df_pos = df[df[target_col] == 1]
        for idx, feat in enumerate(feats):
            ax = axes[idx]
            ax.hist(df_neg[feat].dropna(), bins=30, alpha=0.5, label="Low Risk",
                    color="steelblue", density=True, edgecolor="white")
            ax.hist(df_pos[feat].dropna(), bins=30, alpha=0.5, label="High Risk",
                    color="crimson", density=True, edgecolor="white")
            ax.set_title(feat, fontsize=11)
            ax.set_xlabel("")
            ax.set_ylabel("Density")
            if idx == 0:
                ax.legend(fontsize=8)
            ax.grid(alpha=0.3, axis="y")
        for idx in range(len(feats), len(axes)):
            axes[idx].axis("off")
        fig.suptitle("Feature Distributions by T2DM Risk Status", fontsize=16, y=1.02)
        plt.tight_layout()
        if save_path:
            plt.savefig(save_path, dpi=300, bbox_inches="tight")
        plt.show()

    def plot_correlation_heatmap(self, df, features=None, save_path=None):
        feats = features or df.columns.drop("T2DM_5yr_Risk", errors="ignore").tolist()
        corr = df[feats].corr(method="pearson")
        plt.figure(figsize=(12, 10))
        mask = np.triu(np.ones_like(corr, dtype=bool), k=1)
        sns.heatmap(corr, mask=mask, annot=True, fmt=".2f", cmap="RdBu_r",
                    center=0, square=True, linewidths=0.5,
                    cbar_kws={"shrink": 0.8, "label": "Pearson r"},
                    annot_kws={"size": 8})
        plt.title("Clinical Feature Correlation Matrix", fontsize=14)
        plt.tight_layout()
        if save_path:
            plt.savefig(save_path, dpi=300, bbox_inches="tight")
        plt.show()

    def plot_base_learner_diversity(self, y_proba_dict, save_path=None):
        names = list(y_proba_dict.keys())
        n = len(names)
        fig, axes = plt.subplots(n, n, figsize=(n*3, n*3))
        for i, name_i in enumerate(names):
            for j, name_j in enumerate(names):
                ax = axes[i, j]
                if i == j:
                    ax.hist(y_proba_dict[name_i], bins=30, color=self.colors[i % len(self.colors)],
                            edgecolor="white", alpha=0.7)
                    ax.set_title(name_i, fontsize=10)
                else:
                    ax.scatter(y_proba_dict[name_j], y_proba_dict[name_i], alpha=0.3, s=8, color="black")
                    r = np.corrcoef(y_proba_dict[name_j], y_proba_dict[name_i])[0, 1]
                    ax.text(0.05, 0.95, f"r={r:.2f}", transform=ax.transAxes,
                            fontsize=9, verticalalignment="top",
                            bbox=dict(boxstyle="round", facecolor="wheat", alpha=0.5))
                ax.set_xlim([0, 1]); ax.set_ylim([0, 1])
                ax.grid(alpha=0.3)
        fig.suptitle("Base Learner Diversity Matrix (Predicted Probabilities)", fontsize=16)
        plt.tight_layout()
        if save_path:
            plt.savefig(save_path, dpi=300, bbox_inches="tight")
        plt.show()

    def plot_meta_weights(self, weights, save_path=None):
        labels = list(weights.keys())
        sizes = list(weights.values())
        fig, ax = plt.subplots(figsize=(7, 7))
        wedges, texts, autotexts = ax.pie(
            sizes, labels=labels, autopct="%1.1f%%", startangle=90,
            colors=self.colors, textprops={"fontsize": 10},
            wedgeprops={"edgecolor": "white", "linewidth": 2}
        )
        for autotext in autotexts:
            autotext.set_color("white")
            autotext.set_weight("bold")
        ax.set_title("Meta-Learner Weight Allocation
(How Much the Director Trusts Each Expert)", fontsize=13)
        plt.tight_layout()
        if save_path:
            plt.savefig(save_path, dpi=300, bbox_inches="tight")
        plt.show()

    def plot_dashboard(self, df, y_true, y_proba, y_proba_dict, weights, save_path=None):
        fig = plt.figure(figsize=(18, 12))
        gs = fig.add_gridspec(2, 3, hspace=0.3, wspace=0.3)

        ax1 = fig.add_subplot(gs[0, 0])
        ax1.hist(y_proba[y_true==0], bins=30, alpha=0.6, label="Neg", color="steelblue", edgecolor="white")
        ax1.hist(y_proba[y_true==1], bins=30, alpha=0.6, label="Pos", color="crimson", edgecolor="white")
        ax1.axvline(0.5, color="black", linestyle="--")
        ax1.set_title("A. Risk Score Distribution")
        ax1.legend()
        ax1.grid(alpha=0.3, axis="y")

        ax2 = fig.add_subplot(gs[0, 1])
        fpr, tpr, _ = roc_curve(y_true, y_proba)
        roc_auc = auc(fpr, tpr)
        ax2.plot(fpr, tpr, lw=2, label=f"Stacking AUC={roc_auc:.3f}")
        ax2.plot([0,1], [0,1], "k--", alpha=0.5)
        ax2.set_title("B. ROC Curve (Stacking)")
        ax2.set_xlabel("FPR"); ax2.set_ylabel("TPR")
        ax2.legend(); ax2.grid(alpha=0.3)

        ax3 = fig.add_subplot(gs[0, 2])
        cm = confusion_matrix(y_true, (y_proba>=0.5).astype(int), normalize="true")
        sns.heatmap(cm, annot=True, fmt=".2f", cmap="Blues", ax=ax3,
                    xticklabels=["Neg", "Pos"], yticklabels=["Neg", "Pos"], cbar=False)
        ax3.set_title("C. Confusion Matrix")

        ax4 = fig.add_subplot(gs[1, 0])
        names_w = list(weights.keys()); vals_w = list(weights.values())
        ax4.barh(names_w, vals_w, color=self.colors[:len(names_w)], edgecolor="white")
        ax4.set_title("D. Meta-Learner Weights")
        ax4.grid(alpha=0.3, axis="x")

        ax5 = fig.add_subplot(gs[1, 1])
        aucs = []
        for name, proba in y_proba_dict.items():
            aucs.append(auc(*roc_curve(y_true, proba)[:2]))
        ax5.bar(list(y_proba_dict.keys()), aucs, color=self.colors[:len(aucs)], edgecolor="white")
        ax5.axhline(0.5, color="gray", linestyle="--")
        ax5.set_ylim([0.4, 1.0])
        ax5.set_title("E. Base Learner AUROC")
        ax5.tick_params(axis="x", rotation=15)
        ax5.grid(alpha=0.3, axis="y")

        ax6 = fig.add_subplot(gs[1, 2])
        bin_edges = np.linspace(0, 1, 11)
        bin_centers = (bin_edges[:-1] + bin_edges[1:]) / 2
        bin_accs = []
        for i in range(len(bin_edges)-1):
            mask = (y_proba > bin_edges[i]) & (y_proba <= bin_edges[i+1])
            if mask.sum() > 0:
                bin_accs.append(y_true[mask].mean())
            else:
                bin_accs.append(0)
        ax6.plot(bin_centers, bin_accs, "o-", color="darkgreen", markersize=8, label="Stacking")
        ax6.plot([0,1], [0,1], "k--", alpha=0.5, label="Ideal")
        ax6.set_title("F. Calibration Curve")
        ax6.set_xlabel("Predicted"); ax6.set_ylabel("Observed")
        ax6.legend(); ax6.grid(alpha=0.3)

        fig.suptitle("T2DM Early Warning System -- Executive Dashboard", fontsize=18, y=0.98)
        if save_path:
            plt.savefig(save_path, dpi=300, bbox_inches="tight")
        plt.show()

4.11 pipeline.py

python 复制代码
"""
主控流水线模块
使用方式:
  from pipeline import T2DMPipeline
  pipe = T2DMPipeline()
  results = pipe.run_full_pipeline(save_prefix="run_001")
"""
import os
import json
import numpy as np
import pandas as pd
from typing import Dict, Any

from config import SYSTEM_CONFIG, DATA_CONFIG, MODEL_CONFIG, THRESHOLD_CONFIG
from data_loader import ClinicalDataGenerator
from preprocessor import ClinicalPreprocessor
from feature_engineer import ClinicalFeatureEngineer
from base_learners import BaseLearnerCluster
from meta_learner import StackingMetaLearner
from trainer import StackingTrainer, TrainedEnsemble
from evaluator import ModelEvaluator
from explainer import ClinicalExplainer
from visualizer import SystemVisualizer
from sklearn.model_selection import train_test_split

class T2DMPipeline:
    def __init__(self):
        self.sys_cfg = SYSTEM_CONFIG
        self.data_cfg = DATA_CONFIG
        self.model_cfg = MODEL_CONFIG
        self.thresh_cfg = THRESHOLD_CONFIG

        self.generator = ClinicalDataGenerator(self.data_cfg)
        self.preprocessor = ClinicalPreprocessor()
        self.feature_engineer = ClinicalFeatureEngineer(select_k=self.model_cfg.select_k_best)
        self.base_cluster = BaseLearnerCluster(self.model_cfg)
        self.meta_learner = StackingMetaLearner(self.model_cfg)
        self.trainer = StackingTrainer(self.model_cfg, self.base_cluster, self.meta_learner, n_folds=5)
        self.evaluator = ModelEvaluator(self.thresh_cfg)
        self.visualizer = SystemVisualizer()

        self.df_raw = None
        self.X_train = None
        self.X_test = None
        self.y_train = None
        self.y_test = None
        self.ensemble = None
        self.y_proba_test = None
        self.y_proba_base_test = {}
        self.explainer = None

    def run_full_pipeline(self, save_prefix="default_run"):
        print("="*60)
        print("T2DM Early Warning System -- Full Pipeline Execution")
        print("="*60)

        print("[Step 1/7] Generating synthetic clinical dataset...")
        self.df_raw = self.generator.generate()
        pos_rate = self.df_raw[self.data_cfg.target_name].mean()
        print(f"   -> Generated {len(self.df_raw)} samples, positive rate = {pos_rate:.3f}")

        print("[Step 2/7] Stratified train/test split...")
        X = self.df_raw.drop(columns=[self.data_cfg.target_name])
        y = self.df_raw[self.data_cfg.target_name].values
        X_train_raw, X_test_raw, y_train, y_test = train_test_split(
            X, y, test_size=self.sys_cfg.test_size,
            random_state=self.sys_cfg.random_state, stratify=y
        )
        self.y_train = y_train
        self.y_test = y_test
        print(f"   -> Train: {len(y_train)} (pos={y_train.sum()}), Test: {len(y_test)} (pos={y_test.sum()})")

        print("[Step 3/7] Preprocessing (impute -> clip -> scale)...")
        X_train_scaled = self.preprocessor.fit_transform(X_train_raw)
        X_test_scaled = self.preprocessor.transform(X_test_raw)
        print(f"   -> Feature dimension after preprocessing: {X_train_scaled.shape[1]}")

        print("[Step 4/7] Feature engineering (prior interactions + polynomial + selection)...")
        X_train_fe = self.feature_engineer.fit_transform(
            X_train_scaled, y_train, raw_names=self.data_cfg.feature_names
        )
        X_test_fe = self.feature_engineer.transform(X_test_scaled)
        self.X_train = X_train_fe
        self.X_test = X_test_fe
        n_candidates = len(self.feature_engineer._get_prior_names()) + len(self.data_cfg.feature_names)
        print(f"   -> Feature dimension after engineering: {X_train_fe.shape[1]} (selected from {n_candidates} candidates)")

        print("[Step 5/7] Training Stacking Ensemble (5-Fold OOF)...")
        self.ensemble = self.trainer.fit(X_train_fe, y_train)
        print("   -> Base learners re-trained on full data.")
        print("   -> Meta-learner trained on OOF meta-features.")

        print("[Step 6/7] Evaluation on hold-out test set...")
        self.y_proba_test = self.ensemble.predict_proba(X_test_fe)
        self.y_proba_base_test = self.ensemble.get_base_probas(X_test_fe)

        reports = {}
        for name, proba in self.y_proba_base_test.items():
            reports[name] = self.evaluator.evaluate(y_test, proba, model_name=name)
        reports["StackingEnsemble"] = self.evaluator.evaluate(y_test, self.y_proba_test, model_name="StackingEnsemble")

        print("   [Evaluation Summary]")
        summary_df = pd.DataFrame(reports).T[["AUROC", "AUPRC", "F1_0.5", "Cohen_Kappa"]]
        print(summary_df.to_string())

        print("[Step 7/7] Explainability & Visualization...")
        self.explainer = ClinicalExplainer(
            self.ensemble, self.data_cfg.feature_names,
            preprocessor=self.preprocessor
        )
        self.explainer.fit_global(X_test_scaled, y_test)

        out_dir = self.sys_cfg.output_dir
        run_dir = os.path.join(out_dir, save_prefix)
        fig_dir = os.path.join(run_dir, "figures")
        os.makedirs(fig_dir, exist_ok=True)

        model_path = os.path.join(run_dir, "ensemble_model.pkl")
        self.trainer.save_ensemble(self.ensemble, model_path)

        report_path = os.path.join(run_dir, "evaluation_report.json")
        with open(report_path, "w") as f:
            json.dump(reports, f, indent=2)

        print("   -> Plotting ROC comparison...")
        self.evaluator.plot_roc_comparison(
            y_test, {**self.y_proba_base_test, "StackingEnsemble": self.y_proba_test},
            save_path=os.path.join(fig_dir, "roc_comparison.png")
        )
        print("   -> Plotting PR comparison...")
        self.evaluator.plot_pr_comparison(
            y_test, {**self.y_proba_base_test, "StackingEnsemble": self.y_proba_test},
            save_path=os.path.join(fig_dir, "pr_comparison.png")
        )
        print("   -> Plotting confusion matrix...")
        self.evaluator.plot_confusion_matrix(
            y_test, self.y_proba_test, threshold=0.5,
            save_path=os.path.join(fig_dir, "confusion_matrix.png")
        )
        print("   -> Plotting risk distribution...")
        self.evaluator.plot_risk_distribution(
            y_test, self.y_proba_test,
            save_path=os.path.join(fig_dir, "risk_distribution.png")
        )
        print("   -> Plotting calibration curve...")
        self.evaluator.plot_calibration(
            y_test, {**self.y_proba_base_test, "StackingEnsemble": self.y_proba_test},
            save_path=os.path.join(fig_dir, "calibration.png")
        )
        print("   -> Plotting global feature importance...")
        self.explainer.plot_global_summary(
            save_path=os.path.join(fig_dir, "global_importance.png")
        )
        print("   -> Plotting meta-learner weights...")
        meta_weights = self.ensemble.meta_learner.get_meta_weights()
        self.visualizer.plot_meta_weights(
            meta_weights,
            save_path=os.path.join(fig_dir, "meta_weights.png")
        )
        print("   -> Plotting base learner diversity matrix...")
        self.visualizer.plot_base_learner_diversity(
            self.y_proba_base_test,
            save_path=os.path.join(fig_dir, "learner_diversity.png")
        )
        print("   -> Plotting executive dashboard...")
        self.visualizer.plot_dashboard(
            self.df_raw, y_test, self.y_proba_test, self.y_proba_base_test, meta_weights,
            save_path=os.path.join(fig_dir, "dashboard.png")
        )
        print("   -> Plotting local waterfall (sample 0)...")
        self.explainer.local_waterfall(
            X_test_scaled[0],
            save_path=os.path.join(fig_dir, "waterfall_sample_0.png")
        )

        print(f"
[Pipeline Complete] All artifacts saved to: {run_dir}")

        return {
            "run_id": save_prefix,
            "model_path": model_path,
            "report_path": report_path,
            "figure_dir": fig_dir,
            "evaluation": reports,
            "meta_weights": meta_weights,
            "test_positive_rate": float(y_test.mean()),
            "predicted_high_risk_ratio": float(np.mean(self.y_proba_test >= self.thresh_cfg.high_risk_threshold))
        }

    def predict_single(self, x_dict):
        if self.ensemble is None:
            raise RuntimeError("Pipeline must be trained before prediction.")
        x_df = pd.DataFrame([x_dict])
        x_scaled = self.preprocessor.transform(x_df)
        x_fe = self.feature_engineer.transform(x_scaled)
        proba = float(self.ensemble.predict_proba(x_fe)[0])

        if proba >= self.thresh_cfg.high_risk_threshold:
            risk_level = "HIGH"
            advice = "建议立即转诊内分泌科,启动强化生活方式干预或药物预防。"
        elif proba >= self.thresh_cfg.moderate_risk_threshold:
            risk_level = "MODERATE"
            advice = "建议3-6个月复查糖耐量与HbA1c,启动饮食运动干预。"
        else:
            risk_level = "LOW"
            advice = "维持常规年度体检,保持健康生活方式。"

        z_scores = x_scaled[0]
        shap_like = self.explainer.global_importance * z_scores if hasattr(self.explainer, "global_importance") else np.zeros(len(self.data_cfg.feature_names))
        top_idx = np.argsort(np.abs(shap_like))[::-1][:3]
        drivers = [
            {"feature": self.data_cfg.feature_names[i],
             "direction": "increases" if shap_like[i] > 0 else "decreases",
             "contribution": float(shap_like[i])}
            for i in top_idx
        ]

        return {
            "t2dm_5year_risk": round(proba, 4),
            "risk_level": risk_level,
            "clinical_advice": advice,
            "top_drivers": drivers,
            "threshold_high": self.thresh_cfg.high_risk_threshold,
            "threshold_moderate": self.thresh_cfg.moderate_risk_threshold
        }

4.12 main.py

python 复制代码
#!/usr/bin/env python3
"""
T2DM-EWS: 2型糖尿病早期预警系统 -- 主入口脚本

使用方式:
  1. 完整训练与评估(默认):
     python main.py --mode train --prefix run_001

  2. 单例预测(需先完成训练):
     python main.py --mode predict --model models/ensemble_model.pkl \
       --age 58 --fpg 6.8 --bmi 27.3 --ldl_c 3.2 --hdl_c 1.1 --tg 2.1

  3. 查看系统架构说明:
     python main.py --mode info
"""
import argparse
import json
import os
import sys
import numpy as np
import pandas as pd

from config import SYSTEM_CONFIG, DATA_CONFIG, MODEL_CONFIG, THRESHOLD_CONFIG
from pipeline import T2DMPipeline

def print_architecture_info():
    info = """
========================================================================
           T2DM-EWS: 2型糖尿病早期预警系统 (多参数集成分类模型)
========================================================================
  总体架构: 五层流水线 + 双循环反馈

  Layer 1: 数据层 (Data Layer)
    -- ClinicalDataGenerator  -- 模拟/加载真实体检数据
    -- 缺失值注入 (2%随机缺失)
    -- 异常值注入 (模拟检验误差)

  Layer 2: 预处理层 (Preprocessing Layer)
    -- SimpleImputer (median策略)
    -- Winsorize截断 (1%-99%分位数)
    -- StandardScaler / RobustScaler

  Layer 3: 特征工程层 (Feature Engineering Layer)
    -- 先验临床交互特征 (FPG*BMI, LDL/HDL, TG/HDL, MAP, NLR等)
    -- PolynomialFeatures (degree=2, interaction_only)
    -- SelectKBest (f_classif, k=35)

  Layer 4: 模型层 (Model Layer) -- 基学习器集群 + 元学习器
    Base Learners (5位专科医生):
      - LogisticRegression   (线性边界, 高可解释性)
      - RandomForest         (随机切片, 非线性规则)
      - GradientBoosting     (梯度残差修正)
      - AdaBoost             (序列纠错, 关注难分病例)
      - ExtraTrees           (极端随机性, 降低方差)
    Meta-Learner (主任医师):
      - LogisticRegression   (学习最优加权融合)
    训练策略: 5-Fold Stratified OOF (防止数据泄露)

  Layer 5: 输出层 (Output Layer)
    -- 风险概率 P(T2DM|x) in [0,1]
    -- 风险分层: LOW (<0.40) / MODERATE (0.40-0.70) / HIGH (>0.70)
    -- ClinicalExplainer     (全局/局部特征重要性, Waterfall图)
    -- SystemVisualizer      (Dashboard, ROC, PR, Calibration, Diversity)

  接口对接:
    - 输入: 原始临床指标字典 / CSV / DataFrame
    - 输出: JSON {risk, level, advice, drivers, thresholds}
    - 部署: pickle序列化模型包, 支持RESTful API封装

  测试标准:
    1. AUROC > 0.75 (基线) / > 0.80 (目标)
    2. AUPRC > 0.40 (类别不平衡下的稳健指标)
    3. Sensitivity@90%Specificity > 0.70
    4. Calibration (Brier Score < 0.15)
    5. 基学习器预测相关性 < 0.90 (保证集成多样性)

  验收标准:
    v 端到端推理延迟 < 200ms (单CPU)
    v 模型包可序列化/反序列化
    v 所有可视化图表自动生成并保存
    v 单样本预测接口返回结构化JSON
    v 特征重要性解释与医学先验一致
========================================================================
"""
    print(info)

def main():
    parser = argparse.ArgumentParser(
        description="T2DM Early Warning System -- Multi-Parameter Ensemble Classifier",
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog="Example: python main.py --mode train --prefix demo_run"
    )
    parser.add_argument("--mode", choices=["train", "predict", "visualize", "info"],
                        default="train", help="运行模式")
    parser.add_argument("--prefix", type=str, default="run_default",
                        help="输出目录前缀 (用于train/visualize)")
    parser.add_argument("--model", type=str, default=None,
                        help="已保存的模型路径 (用于predict)")

    clinical_args = [
        ("age", float, 50.0), ("fpg", float, 5.6), ("bmi", float, 24.0),
        ("hba1c", float, 5.7), ("ldl_c", float, 2.9), ("hdl_c", float, 1.3),
        ("tg", float, 1.6), ("ggt", float, 35.0), ("alt", float, 28.0),
        ("ast", float, 26.0), ("sbp", float, 128.0), ("dbp", float, 80.0),
        ("waist", float, 85.0), ("hip", float, 95.0), ("tc", float, 4.9),
        ("cr", float, 75.0), ("ua", float, 320.0), ("homa_ir", float, 2.8),
        ("fasting_insulin", float, 12.0), ("crp", float, 2.5),
        ("wbc", float, 6.2), ("rbc", float, 4.5), ("hb", float, 140.0),
        ("neutrophil", float, 0.58), ("lymphocyte", float, 0.30),
        ("platelet", float, 220.0)
    ]
    for name, typ, default in clinical_args:
        parser.add_argument(f"--{name}", type=typ, default=default)

    args = parser.parse_args()

    if args.mode == "info":
        print_architecture_info()
        return

    if args.mode == "train":
        print_architecture_info()
        print("
>>> 启动训练模式...")
        pipe = T2DMPipeline()
        results = pipe.run_full_pipeline(save_prefix=args.prefix)
        print("
>>> 训练完成。结果摘要:")
        print(json.dumps(results["evaluation"]["StackingEnsemble"], indent=2, ensure_ascii=False))
        print(f"
>>> 模型已保存至: {results['model_path']}")
        print(f">>> 图表已保存至: {results['figure_dir']}")

    elif args.mode == "predict":
        if not args.model or not os.path.exists(args.model):
            print("错误: 预测模式需要有效的 --model 路径")
            sys.exit(1)
        print(">>> 加载预训练模型...")
        pipe = T2DMPipeline()
        pipe.run_full_pipeline(save_prefix="temp_predict")

        x_dict = {
            "Age": args.age, "FPG": args.fpg, "BMI": args.bmi,
            "HbA1c": args.hba1c, "LDL_C": args.ldl_c, "HDL_C": args.hdl_c,
            "TG": args.tg, "GGT": args.ggt, "ALT": args.alt, "AST": args.ast,
            "SBP": args.sbp, "DBP": args.dbp, "Waist": args.waist, "Hip": args.hip,
            "TC": args.tc, "Cr": args.cr, "UA": args.ua, "HOMA_IR": args.homa_ir,
            "Fasting_Insulin": args.fasting_insulin, "CRP": args.crp,
            "WBC": args.wbc, "RBC": args.rbc, "Hb": args.hb,
            "Neutrophil": args.neutrophil, "Lymphocyte": args.lymphocyte,
            "Platelet": args.platelet
        }
        result = pipe.predict_single(x_dict)
        print("
>>> 预测结果:")
        print(json.dumps(result, indent=2, ensure_ascii=False))

    elif args.mode == "visualize":
        print(">>> 可视化模式(基于最新训练结果)")
        pipe = T2DMPipeline()
        pipe.run_full_pipeline(save_prefix=args.prefix)
        print(f">>> 图表已更新至 outputs/{args.prefix}/figures/")

if __name__ == "__main__":
    main()

4.13 requirements.txt

复制代码
numpy>=1.21.0
pandas>=1.3.0
scikit-learn>=1.0.0
matplotlib>=3.4.0
seaborn>=0.11.0

五、快速开始

bash 复制代码
# 1. 安装依赖
pip install -r requirements.txt

# 2. 查看系统架构
python main.py --mode info

# 3. 执行完整训练与评估
python main.py --mode train --prefix run_001

# 4. 查看输出
ls outputs/run_001/figures/
# dashboard.png  roc_comparison.png  pr_comparison.png  confusion_matrix.png
# risk_distribution.png  calibration.png  global_importance.png  meta_weights.png
# learner_diversity.png  waterfall_sample_0.png

# 5. 单例预测(示例)
python main.py --mode predict --age 58 --fpg 6.8 --bmi 27.3 --ldl_c 3.2 --hdl_c 1.1 --tg 2.1

六、算法伪代码

6.1 Stacking Ensemble 训练阶段

复制代码
Algorithm: TrainStackingEnsemble
Input:  Feature matrix X in R^{n x p}, labels y in {0,1}^n
        Base learner pool B = {b_1, b_2, ..., b_m}
        Number of folds K = 5
Output: Trained ensemble E = (B*, M)

// Step 1: Generate meta-features (Out-of-Fold predictions)
Initialize Z in R^{n x m}

for k = 1 to K do
    D_train(k) <- indices of training fold k
    D_val(k)   <- indices of validation fold k

    for j = 1 to m do
        fit b_j on X[D_train(k)], y[D_train(k)]
        p_j(k) <- predict_proba(b_j, X[D_val(k)])
        Z[D_val(k), j] <- p_j(k)
    end
end

// Step 2: Train meta-learner
M <- LogisticRegression(solver = lbfgs, max_iter = 1000)
fit M on (Z, y)

// Step 3: Retrain base learners on full data (for deployment)
for j = 1 to m do
    fit b_j on (X, y)
    b_j* <- trained b_j
end

return E = (B*, M)

6.2 单实例预测阶段

复制代码
Algorithm: PredictRisk
Input:  Trained ensemble E = (B*, M), new instance x_new in R^p
Output: Risk probability y_hat in [0,1]

// Parallel invocation of all base learners
for j = 1 to m do
    z_j <- predict_proba(b_j*, x_new)
end

// Assemble meta-feature vector
z <- [z_1, z_2, ..., z_m]^T

// Meta-learner final decision
y_hat <- predict_proba(M, z)

return y_hat

文档版本: v1.0 | 生成日期: 2026-05-19

相关推荐
南屹川1 小时前
【缓存技术】Redis实战:从缓存策略到分布式锁
人工智能
Li emily7 小时前
解决了加密货币api多币种订阅时的数据乱序问题
人工智能·python·api·fastapi
山川绿水8 小时前
bugku——PWN——overflow2
人工智能·web安全·网络安全
程序员cxuan8 小时前
微信读书官方发了 skills,把我给秀麻了。
人工智能·后端·程序员
fake_ss1988 小时前
AI时代学习全栈项目开发的新范式
java·人工智能·学习·架构·个人开发·学习方法
nassi_8 小时前
对AI工程问题的一些思考
大数据·人工智能·hadoop
AI技术控8 小时前
《Transformers are Inherently Succinct》论文解读:从“能表达什么”到“多紧凑地表达”
人工智能·python·深度学习·机器学习·自然语言处理
蔡俊锋8 小时前
AI记忆压缩术:从305GB到7.4GB的魔法
人工智能·ai·ai 记忆
Upsy-Daisy9 小时前
AI Agent 项目学习笔记(二):Spring AI 与 ChatClient 主链路解析
人工智能·笔记·学习