T2DM-EWS: 2型糖尿病早期预警系统(多参数集成分类模型)
完整项目源码与架构设计文档
一、总体架构设计
1.1 系统层级图
┌─────────────────────────────────────────────────────────────────┐
│ 应用层 (Application) │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │
│ │ CLI 入口 │ │ REST API │ │ 可视化 Dashboard │ │
│ │ main.py │ │ (预留) │ │ matplotlib/seaborn │ │
│ └──────┬──────┘ └──────┬──────┘ └────────────┬────────────┘ │
└─────────┼────────────────┼─────────────────────┼─────────────────┘
│ │ │
┌─────────┼────────────────┼─────────────────────┼─────────────────┐
│ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ 业务逻辑层 (Pipeline) │ │
│ │ T2DMPipeline.run_full_pipeline() │ │
│ │ Step 1: Data Generation │ │
│ │ Step 2: Stratified Split │ │
│ │ Step 3: Preprocessing (Impute → Clip → Scale) │ │
│ │ Step 4: Feature Engineering (Prior + Poly + Select) │ │
│ │ Step 5: Stacking Ensemble Training (5-Fold OOF) │ │
│ │ Step 6: Evaluation (AUROC/AUPRC/Calibration) │ │
│ │ Step 7: Explainability & Visualization │ │
│ └─────────────────────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────────────┘
│
┌─────────┼─────────────────────────────────────────────────────────┐
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ 模型层 (Model Layer) │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │ │
│ │ │ Base Learners│ │ Meta-Learner │ │ ClinicalExplainer│ │ │
│ │ │ (5 experts) │──│ (LR fusion) │──│ (SHAP-like) │ │ │
│ │ └──────────────┘ └──────────────┘ └──────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────────────┘
│
┌─────────┼─────────────────────────────────────────────────────────┐
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ 数据层 (Data Layer) │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌────────────────────┐ │ │
│ │ │ DataLoader │ │Preprocessor │ │ FeatureEngineer │ │ │
│ │ │ (Synthetic) │ │ (Impute/ │ │ (Prior/Poly/ │ │ │
│ │ │ │ │ Clip/Scale)│ │ SelectKBest) │ │ │
│ │ └─────────────┘ └─────────────┘ └────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────────────┘
1.2 模块职责矩阵
| 模块 | 职责 | 输入 | 输出 | 依赖 |
|---|---|---|---|---|
config.py |
全局超参数与常量 | 无 | SystemConfig, DataConfig, ModelConfig, ThresholdConfig |
无 |
data_loader.py |
数据生成/加载 | DataConfig |
DataFrame (26维特征+标签) |
sklearn.datasets |
preprocessor.py |
清洗与标准化 | 原始DataFrame | 标准化矩阵 X_scaled |
sklearn.impute, sklearn.preprocessing |
feature_engineer.py |
特征构造与选择 | X_scaled, y |
X_engineered (35维) |
sklearn.feature_selection |
base_learners.py |
基学习器集群 | X_engineered, y |
概率字典 {name: proba} |
sklearn.ensemble, sklearn.linear_model |
meta_learner.py |
元学习器融合 | Meta-feature矩阵 Z |
最终风险概率 | sklearn.linear_model |
trainer.py |
OOF训练与序列化 | 完整训练数据 | TrainedEnsemble对象 |
sklearn.model_selection |
evaluator.py |
指标计算与绘图 | y_true, y_proba |
报告字典+图表 | sklearn.metrics, matplotlib |
explainer.py |
全局/局部解释 | 训练好的模型+数据 | 重要性图/Waterfall图 | sklearn.inspection |
visualizer.py |
系统级可视化 | 各类中间结果 | Dashboard/分布图/热力图 | matplotlib, seaborn |
pipeline.py |
端到端编排 | 配置对象 | 完整结果字典 | 上述全部 |
main.py |
CLI入口 | 命令行参数 | 终端输出/文件 | pipeline.py |
1.3 协同设计
数据流协同:
DataLoader→Preprocessor→FeatureEngineer形成单向数据流,通过numpy.ndarray传递FeatureEngineer输出同时喂给BaseLearners和MetaLearner(通过Trainer协调OOF生成)Evaluator接收所有基学习器与Stacking的预测结果,横向对比Explainer挂载在最终模型上,提供事后解释
控制流协同:
Pipeline作为中央控制器,按固定7步顺序调度各模块Trainer内部使用StratifiedKFold保证类别平衡,通过clone()防止交叉污染BaseLearners与MetaLearner解耦,支持独立替换基学习器算法
1.4 接口对接
内部接口:
python
# 数据流接口
X_raw: pd.DataFrame → preprocessor.fit_transform() → X_scaled: np.ndarray
X_scaled, y → feature_engineer.fit_transform() → X_fe: np.ndarray
X_fe, y → trainer.fit() → ensemble: TrainedEnsemble
ensemble, X_fe → ensemble.predict_proba() → y_proba: np.ndarray
外部接口(预留):
python
# RESTful API 伪代码(Flask/FastAPI 封装层)
@app.post("/predict")
def predict_endpoint(payload: ClinicalInput):
x_dict = payload.dict()
result = pipeline.predict_single(x_dict)
return {"risk": result["t2dm_5year_risk"], "level": result["risk_level"]}
二、测试标准
| 测试项 | 方法 | 通过标准 |
|---|---|---|
| 语法检查 | py_compile.compile() |
全部 .py 文件无 SyntaxError |
| 数据生成 | 检查 df.shape, df.isna().sum() |
8000×27, NaN率≈2% |
| 预处理 | 检查 X_scaled 统计量 |
均值≈0, 标准差≈1, 无NaN |
| 特征工程 | 检查 X_fe.shape[1] |
≤35 (SelectKBest约束) |
| 训练收敛 | 检查各基学习器 predict_proba 输出 |
概率范围 [0,1] |
| OOF完整性 | 检查 Z 矩阵无NaN |
Z.shape == (n_train, n_base) |
| 元学习器 | 检查权重和 | sum(weights) ≈ 1.0 |
| 端到端推理 | 单样本预测耗时 | < 200ms (CPU单核) |
| 序列化 | pickle.dump/load 往返 |
加载后预测结果一致 |
| 可视化 | 检查输出目录 | 8张PNG + 1份JSON报告 |
三、验收标准
- 功能验收 :
python main.py --mode train --prefix demo成功执行并生成outputs/demo/figures/下全部图表 - 性能验收:测试集 AUROC ≥ 0.80, AUPRC ≥ 0.40, Brier Score ≤ 0.15
- 解释性验收 :全局重要性Top3特征包含
FPG,BMI,Age或其交互项 - 鲁棒性验收:单样本预测接口对26维完整/部分缺失输入均返回有效JSON
- 部署验收 :模型包
ensemble_model.pkl可在新环境中pickle.load并直接推理
四、源码实现
4.1 config.py
python
"""
T2DM-EWS 全局配置模块
"""
import os
from dataclasses import dataclass, field
from typing import List
@dataclass
class SystemConfig:
random_state: int = 42
n_jobs: int = -1
test_size: float = 0.2
cv_folds: int = 5
project_root: str = field(default_factory=lambda: os.path.dirname(os.path.abspath(__file__)))
data_dir: str = field(init=False)
model_dir: str = field(init=False)
output_dir: str = field(init=False)
figure_dir: str = field(init=False)
def __post_init__(self):
self.data_dir = os.path.join(self.project_root, "data")
self.model_dir = os.path.join(self.project_root, "models")
self.output_dir = os.path.join(self.project_root, "outputs")
self.figure_dir = os.path.join(self.output_dir, "figures")
for d in [self.data_dir, self.model_dir, self.output_dir, self.figure_dir]:
os.makedirs(d, exist_ok=True)
@dataclass
class DataConfig:
n_samples: int = 8000
n_features: int = 26
n_informative: int = 18
n_redundant: int = 6
n_classes: int = 2
random_state: int = 42
positive_ratio: float = 0.15
flip_y: float = 0.03
missing_rate: float = 0.02
feature_names: List[str] = field(default_factory=lambda: [
"Age", "FPG", "BMI", "HbA1c", "LDL_C", "HDL_C", "TG",
"GGT", "ALT", "AST", "SBP", "DBP", "Waist",
"Hip", "TC", "Cr", "UA", "HOMA_IR", "Fasting_Insulin",
"CRP", "WBC", "RBC", "Hb", "Neutrophil", "Lymphocyte", "Platelet"
])
target_name: str = "T2DM_5yr_Risk"
@dataclass
class ModelConfig:
base_random_state: int = 42
lr_c: float = 1.0
lr_max_iter: int = 1000
lr_class_weight: str = "balanced"
rf_n_estimators: int = 300
rf_max_depth: int = 12
rf_min_samples_leaf: int = 5
rf_class_weight: str = "balanced_subsample"
gb_n_estimators: int = 200
gb_max_depth: int = 5
gb_learning_rate: float = 0.08
gb_subsample: float = 0.8
ada_n_estimators: int = 200
ada_learning_rate: float = 0.1
et_n_estimators: int = 300
et_max_depth: int = 12
et_min_samples_leaf: int = 5
et_class_weight: str = "balanced"
meta_c: float = 0.5
meta_max_iter: int = 1000
meta_solver: str = "lbfgs"
select_k_best: int = 35
@dataclass
class ThresholdConfig:
high_risk_threshold: float = 0.70
moderate_risk_threshold: float = 0.40
sensitivity_target: float = 0.85
SYSTEM_CONFIG = SystemConfig()
DATA_CONFIG = DataConfig()
MODEL_CONFIG = ModelConfig()
THRESHOLD_CONFIG = ThresholdConfig()
4.2 data_loader.py
python
"""
数据加载与模拟生成模块
使用方式:
from data_loader import ClinicalDataGenerator
gen = ClinicalDataGenerator(DATA_CONFIG)
df = gen.generate(save_path="data/raw_clinical.csv")
"""
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from typing import Optional
import os
class ClinicalDataGenerator:
def __init__(self, config):
self.cfg = config
self.rng = np.random.RandomState(config.random_state)
def generate(self, save_path: Optional[str] = None) -> pd.DataFrame:
X, y = make_classification(
n_samples=self.cfg.n_samples,
n_features=self.cfg.n_features,
n_informative=self.cfg.n_informative,
n_redundant=self.cfg.n_redundant,
n_classes=self.cfg.n_classes,
weights=[1 - self.cfg.positive_ratio, self.cfg.positive_ratio],
flip_y=self.cfg.flip_y,
random_state=self.cfg.random_state,
hypercube=False,
shift=0.0,
scale=1.0
)
X = self._apply_clinical_scaling(X)
df = pd.DataFrame(X, columns=self.cfg.feature_names)
df[self.cfg.target_name] = y
df = self._inject_missing(df)
df = self._inject_outliers(df)
if save_path:
os.makedirs(os.path.dirname(save_path), exist_ok=True)
df.to_csv(save_path, index=False)
return df
def _apply_clinical_scaling(self, X: np.ndarray) -> np.ndarray:
clinical_params = {
"Age": (52.0, 12.0), "FPG": (5.6, 1.2), "BMI": (24.5, 3.8),
"HbA1c": (5.7, 0.9), "LDL_C": (2.9, 0.8), "HDL_C": (1.3, 0.35),
"TG": (1.6, 0.9), "GGT": (35.0, 20.0), "ALT": (28.0, 18.0),
"AST": (26.0, 12.0), "SBP": (128.0, 16.0), "DBP": (80.0, 10.0),
"Waist": (85.0, 10.0), "Hip": (95.0, 8.0), "TC": (4.9, 0.9),
"Cr": (75.0, 15.0), "UA": (320.0, 80.0), "HOMA_IR": (2.8, 1.8),
"Fasting_Insulin": (12.0, 6.0), "CRP": (2.5, 3.0),
"WBC": (6.2, 1.5), "RBC": (4.5, 0.5), "Hb": (140.0, 15.0),
"Neutrophil": (0.58, 0.08), "Lymphocyte": (0.30, 0.07),
"Platelet": (220.0, 50.0)
}
X_scaled = np.zeros_like(X)
for i, name in enumerate(self.cfg.feature_names):
mu, sigma = clinical_params.get(name, (0.0, 1.0))
X_scaled[:, i] = mu + sigma * X[:, i]
return X_scaled
def _inject_missing(self, df: pd.DataFrame) -> pd.DataFrame:
df_out = df.copy()
for col in self.cfg.feature_names:
mask = self.rng.rand(len(df_out)) < self.cfg.missing_rate
df_out.loc[mask, col] = np.nan
return df_out
def _inject_outliers(self, df: pd.DataFrame, n_outliers: int = 50) -> pd.DataFrame:
df_out = df.copy()
idx = self.rng.choice(df_out.index, size=n_outliers, replace=False)
cols = self.rng.choice(self.cfg.feature_names, size=n_outliers)
for i, col in zip(idx, cols):
if self.rng.rand() < 0.5:
df_out.loc[i, col] = df_out[col].mean() + 4.5 * df_out[col].std()
else:
df_out.loc[i, col] = df_out[col].mean() - 4.5 * df_out[col].std()
return df_out
4.3 preprocessor.py
python
"""
数据预处理模块
使用方式:
from preprocessor import ClinicalPreprocessor
prep = ClinicalPreprocessor()
X_train_clean = prep.fit_transform(X_train)
X_test_clean = prep.transform(X_test)
"""
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, RobustScaler
from typing import Union, List
class ClinicalPreprocessor:
def __init__(self, impute_strategy="median", outlier_lower=0.01,
outlier_upper=0.99, scaler_type="standard"):
self.impute_strategy = impute_strategy
self.outlier_lower = outlier_lower
self.outlier_upper = outlier_upper
self.scaler_type = scaler_type
self.imputer = SimpleImputer(strategy=impute_strategy)
self.scaler = StandardScaler() if scaler_type == "standard" else RobustScaler()
self.feature_names = []
self.clip_bounds_ = {}
self.is_fitted = False
def fit(self, X: Union[pd.DataFrame, np.ndarray], y=None):
if isinstance(X, pd.DataFrame):
self.feature_names = list(X.columns)
X_arr = X.values
else:
X_arr = X
self.imputer.fit(X_arr)
self.clip_bounds_ = {
i: (np.percentile(X_arr[:, i], self.outlier_lower * 100),
np.percentile(X_arr[:, i], self.outlier_upper * 100))
for i in range(X_arr.shape[1])
}
X_clipped = self._clip_array(X_arr)
X_imputed = self.imputer.transform(X_clipped)
self.scaler.fit(X_imputed)
self.is_fitted = True
return self
def transform(self, X: Union[pd.DataFrame, np.ndarray]) -> np.ndarray:
if not self.is_fitted:
raise RuntimeError("Preprocessor must be fitted before transform.")
if isinstance(X, pd.DataFrame):
X_arr = X.values
else:
X_arr = X.copy()
X_imp = self.imputer.transform(X_arr)
X_clip = self._clip_array(X_imp)
X_scaled = self.scaler.transform(X_clip)
return X_scaled
def fit_transform(self, X, y=None):
self.fit(X, y)
return self.transform(X)
def _clip_array(self, X: np.ndarray) -> np.ndarray:
X_out = X.copy()
for i, (low, high) in self.clip_bounds_.items():
X_out[:, i] = np.clip(X_out[:, i], low, high)
return X_out
def get_feature_names(self):
return self.feature_names
4.4 feature_engineer.py
python
"""
特征工程模块
使用方式:
from feature_engineer import ClinicalFeatureEngineer
fe = ClinicalFeatureEngineer(select_k=35)
X_new = fe.fit_transform(X_train, y_train)
"""
import numpy as np
import pandas as pd
from sklearn.preprocessing import PolynomialFeatures
from sklearn.feature_selection import SelectKBest, f_classif
from typing import List, Optional
class ClinicalFeatureEngineer:
def __init__(self, poly_degree=2, poly_interaction_only=True,
select_k=35, include_bias=False):
self.poly_degree = poly_degree
self.poly_interaction_only = poly_interaction_only
self.select_k = select_k
self.include_bias = include_bias
self.poly = PolynomialFeatures(
degree=poly_degree,
interaction_only=poly_interaction_only,
include_bias=include_bias
)
self.selector = SelectKBest(score_func=f_classif, k=select_k)
self.raw_feature_names = []
self.poly_feature_names = []
self.selected_feature_names = []
self.is_fitted = False
def fit(self, X, y, raw_names=None):
self.raw_feature_names = raw_names or ["f"+str(i) for i in range(X.shape[1])]
X_prior = self._build_prior_features(X)
X_poly = self.poly.fit_transform(X)
self.poly_feature_names = self.poly.get_feature_names_out(self.raw_feature_names).tolist()
X_combined = np.hstack([X_prior, X_poly])
combined_names = self._get_prior_names() + self.poly_feature_names
self.selector.fit(X_combined, y)
mask = self.selector.get_support()
self.selected_feature_names = [name for name, m in zip(combined_names, mask) if m]
self.is_fitted = True
return self
def transform(self, X):
if not self.is_fitted:
raise RuntimeError("FeatureEngineer must be fitted before transform.")
X_prior = self._build_prior_features(X)
X_poly = self.poly.transform(X)
X_combined = np.hstack([X_prior, X_poly])
X_selected = self.selector.transform(X_combined)
return X_selected
def fit_transform(self, X, y, raw_names=None):
self.fit(X, y, raw_names)
return self.transform(X)
def _build_prior_features(self, X):
n = X.shape[0]
features = []
features.append((X[:, 1] * X[:, 2]).reshape(-1, 1)) # FPG x BMI
features.append((X[:, 0] * X[:, 1]).reshape(-1, 1)) # Age x FPG
hdl_safe = np.where(X[:, 5] < 0.5, 0.5, X[:, 5])
features.append((X[:, 4] / hdl_safe).reshape(-1, 1)) # LDL/HDL
features.append((X[:, 6] / hdl_safe).reshape(-1, 1)) # TG/HDL
features.append((X[:, 17] * X[:, 2]).reshape(-1, 1)) # HOMA_IR x BMI
features.append((X[:, 12] / (X[:, 0] + 1.0)).reshape(-1, 1)) # Waist/Age
features.append(((X[:, 10] + 2 * X[:, 11]) / 3.0).reshape(-1, 1)) # MAP
features.append((X[:, 10] - X[:, 11]).reshape(-1, 1)) # Pulse Pressure
features.append((X[:, 19] * X[:, 2]).reshape(-1, 1)) # CRP x BMI
lymph_safe = np.where(X[:, 24] < 0.01, 0.01, X[:, 24])
features.append((X[:, 23] / lymph_safe).reshape(-1, 1)) # NLR
return np.hstack(features)
def _get_prior_names(self):
return [
"PRIOR_FPG_x_BMI", "PRIOR_Age_x_FPG", "PRIOR_LDL_div_HDL",
"PRIOR_TG_div_HDL", "PRIOR_HOMAIR_x_BMI", "PRIOR_Waist_div_Age",
"PRIOR_MAP", "PRIOR_PulsePressure", "PRIOR_CRP_x_BMI", "PRIOR_NLR"
]
def get_selected_names(self):
return self.selected_feature_names
4.5 base_learners.py
python
"""
基学习器集群模块
使用方式:
from base_learners import BaseLearnerCluster
cluster = BaseLearnerCluster(MODEL_CONFIG)
cluster.fit(X_train, y_train)
probs = cluster.predict_proba_base(X_test)
"""
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier, ExtraTreesClassifier
from typing import Dict, List
class BaseLearnerCluster:
def __init__(self, config):
self.cfg = config
self.learners = {}
self._build_learners()
def _build_learners(self):
cfg = self.cfg
self.learners["LogisticRegression"] = LogisticRegression(
C=cfg.lr_c, max_iter=cfg.lr_max_iter, class_weight=cfg.lr_class_weight,
random_state=cfg.base_random_state, n_jobs=-1, solver="lbfgs"
)
self.learners["RandomForest"] = RandomForestClassifier(
n_estimators=cfg.rf_n_estimators, max_depth=cfg.rf_max_depth,
min_samples_leaf=cfg.rf_min_samples_leaf, class_weight=cfg.rf_class_weight,
random_state=cfg.base_random_state, n_jobs=-1
)
self.learners["GradientBoosting"] = GradientBoostingClassifier(
n_estimators=cfg.gb_n_estimators, max_depth=cfg.gb_max_depth,
learning_rate=cfg.gb_learning_rate, subsample=cfg.gb_subsample,
random_state=cfg.base_random_state
)
self.learners["AdaBoost"] = AdaBoostClassifier(
n_estimators=cfg.ada_n_estimators, learning_rate=cfg.ada_learning_rate,
random_state=cfg.base_random_state
)
self.learners["ExtraTrees"] = ExtraTreesClassifier(
n_estimators=cfg.et_n_estimators, max_depth=cfg.et_max_depth,
min_samples_leaf=cfg.et_min_samples_leaf, class_weight=cfg.et_class_weight,
random_state=cfg.base_random_state, n_jobs=-1
)
def fit(self, X, y):
for name, model in self.learners.items():
model.fit(X, y)
return self
def predict_proba_base(self, X):
probs = {}
for name, model in self.learners.items():
proba = model.predict_proba(X)[:, 1]
probs[name] = proba
return probs
def predict_base(self, X):
preds = {}
for name, model in self.learners.items():
preds[name] = model.predict(X)
return preds
def get_oob_importances(self):
importances = {}
for name, model in self.learners.items():
if hasattr(model, "feature_importances_"):
importances[name] = model.feature_importances_
return importances
def get_learner(self, name):
return self.learners.get(name)
def names(self):
return list(self.learners.keys())
4.6 meta_learner.py
python
"""
元学习器(Meta-Learner)模块
使用方式:
from meta_learner import StackingMetaLearner
meta = StackingMetaLearner(MODEL_CONFIG)
meta.fit(Z_train, y_train)
y_pred = meta.predict_proba(Z_test)
"""
import numpy as np
from sklearn.linear_model import LogisticRegression
from typing import Dict, List
class StackingMetaLearner:
def __init__(self, config):
self.cfg = config
self.model = LogisticRegression(
C=config.meta_c, max_iter=config.meta_max_iter,
solver=config.meta_solver, class_weight="balanced",
random_state=getattr(config, "base_random_state", 42)
)
self.is_fitted = False
self.meta_feature_names = []
def fit(self, Z, y, feature_names=None):
self.model.fit(Z, y)
self.meta_feature_names = feature_names or ["base_"+str(i) for i in range(Z.shape[1])]
self.is_fitted = True
return self
def predict_proba(self, Z):
if not self.is_fitted:
raise RuntimeError("MetaLearner must be fitted first.")
return self.model.predict_proba(Z)[:, 1]
def predict(self, Z):
return self.model.predict(Z)
def get_meta_weights(self):
if not self.is_fitted:
return {}
coef = self.model.coef_[0]
exp_coef = np.exp(coef - np.max(coef))
weights = exp_coef / np.sum(exp_coef)
return {name: float(w) for name, w in zip(self.meta_feature_names, weights)}
def get_intercept(self):
return float(self.model.intercept_[0])
4.7 trainer.py
python
"""
训练控制器模块
使用方式:
from trainer import StackingTrainer
trainer = StackingTrainer(MODEL_CONFIG, base_cluster, meta_learner)
ensemble = trainer.fit(X_train, y_train)
y_pred = ensemble.predict_proba(X_test)
"""
import numpy as np
import pickle
import os
from sklearn.model_selection import StratifiedKFold
from sklearn.base import clone
from typing import Dict
class TrainedEnsemble:
def __init__(self, base_learners, meta_learner, base_names):
self.base_learners = base_learners
self.meta_learner = meta_learner
self.base_names = base_names
def predict_proba(self, X):
Z = np.zeros((X.shape[0], len(self.base_names)))
for i, name in enumerate(self.base_names):
Z[:, i] = self.base_learners[name].predict_proba(X)[:, 1]
return self.meta_learner.predict_proba(Z)
def predict(self, X):
proba = self.predict_proba(X)
return (proba >= 0.5).astype(int)
def get_base_probas(self, X):
return {name: self.base_learners[name].predict_proba(X)[:, 1]
for name in self.base_names}
class StackingTrainer:
def __init__(self, config, base_cluster, meta_learner, n_folds=5):
self.cfg = config
self.base_cluster = base_cluster
self.meta_learner = meta_learner
self.n_folds = n_folds
self.cv = StratifiedKFold(n_splits=n_folds, shuffle=True,
random_state=getattr(config, "base_random_state", 42))
def fit(self, X, y):
n_samples = X.shape[0]
n_base = len(self.base_cluster.names())
Z = np.zeros((n_samples, n_base))
for fold_idx, (train_idx, val_idx) in enumerate(self.cv.split(X, y)):
X_train_fold, X_val_fold = X[train_idx], X[val_idx]
y_train_fold = y[train_idx]
for j, name in enumerate(self.base_cluster.names()):
model = self.base_cluster.get_learner(name)
model_clone = clone(model)
model_clone.fit(X_train_fold, y_train_fold)
Z[val_idx, j] = model_clone.predict_proba(X_val_fold)[:, 1]
self.meta_learner.fit(Z, y, feature_names=self.base_cluster.names())
self.base_cluster.fit(X, y)
ensemble = TrainedEnsemble(
base_learners={name: self.base_cluster.get_learner(name)
for name in self.base_cluster.names()},
meta_learner=self.meta_learner,
base_names=self.base_cluster.names()
)
return ensemble
def save_ensemble(self, ensemble, path):
os.makedirs(os.path.dirname(path), exist_ok=True)
with open(path, "wb") as f:
pickle.dump(ensemble, f)
def load_ensemble(self, path):
with open(path, "rb") as f:
return pickle.load(f)
4.8 evaluator.py
python
"""
模型评估模块
使用方式:
from evaluator import ModelEvaluator
ev = ModelEvaluator(THRESHOLD_CONFIG)
report = ev.evaluate(y_true, y_pred_proba, model_name="Stacking")
"""
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import (roc_auc_score, average_precision_score, f1_score,
cohen_kappa_score, accuracy_score, confusion_matrix,
roc_curve, precision_recall_curve, brier_score_loss)
from typing import Dict, Optional
class ModelEvaluator:
def __init__(self, threshold_config):
self.thresh_cfg = threshold_config
def evaluate(self, y_true, y_proba, model_name="Model"):
y_pred_default = (y_proba >= 0.5).astype(int)
thresholds = np.arange(0.05, 1.0, 0.01)
f1s = [f1_score(y_true, (y_proba >= t).astype(int)) for t in thresholds]
best_thresh = thresholds[np.argmax(f1s)]
fpr, tpr, thresh_roc = roc_curve(y_true, y_proba)
specificity = 1 - fpr
idx_sp90 = np.where(specificity >= 0.90)[0]
sens_at_sp90 = tpr[idx_sp90[-1]] if len(idx_sp90) > 0 else 0.0
report = {
"Model": model_name,
"AUROC": round(roc_auc_score(y_true, y_proba), 4),
"AUPRC": round(average_precision_score(y_true, y_proba), 4),
"Accuracy_0.5": round(accuracy_score(y_true, y_pred_default), 4),
"F1_0.5": round(f1_score(y_true, y_pred_default), 4),
"F1_Optimal": round(np.max(f1s), 4),
"Optimal_Threshold": round(best_thresh, 3),
"Cohen_Kappa": round(cohen_kappa_score(y_true, y_pred_default), 4),
"Sensitivity_at_90_Specificity": round(sens_at_sp90, 4),
"Brier_Score": round(brier_score_loss(y_true, y_proba), 4),
"High_Risk_Ratio": round(np.mean(y_proba >= self.thresh_cfg.high_risk_threshold), 4),
"Moderate_Risk_Ratio": round(np.mean(y_proba >= self.thresh_cfg.moderate_risk_threshold), 4)
}
return report
def plot_roc_comparison(self, y_true, probas_dict, save_path=None):
plt.figure(figsize=(8, 7))
colors = plt.cm.tab10(np.linspace(0, 1, len(probas_dict)))
for (name, proba), color in zip(probas_dict.items(), colors):
fpr, tpr, _ = roc_curve(y_true, proba)
auc = roc_auc_score(y_true, proba)
plt.plot(fpr, tpr, color=color, lw=2, label=f"{name} (AUC = {auc:.3f})")
plt.plot([0, 1], [0, 1], "k--", lw=1, alpha=0.5)
plt.xlim([0.0, 1.0]); plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate (1 - Specificity)", fontsize=12)
plt.ylabel("True Positive Rate (Sensitivity)", fontsize=12)
plt.title("ROC Curve Comparison: Base Learners vs Stacking Ensemble", fontsize=14)
plt.legend(loc="lower right", fontsize=10)
plt.grid(alpha=0.3)
plt.tight_layout()
if save_path:
plt.savefig(save_path, dpi=300, bbox_inches="tight")
plt.show()
def plot_pr_comparison(self, y_true, probas_dict, save_path=None):
plt.figure(figsize=(8, 7))
colors = plt.cm.tab10(np.linspace(0, 1, len(probas_dict)))
for (name, proba), color in zip(probas_dict.items(), colors):
precision, recall, _ = precision_recall_curve(y_true, proba)
auprc = average_precision_score(y_true, proba)
plt.plot(recall, precision, color=color, lw=2, label=f"{name} (AUPRC = {auprc:.3f})")
baseline = np.mean(y_true)
plt.axhline(baseline, color="gray", linestyle="--", alpha=0.7, label=f"Baseline (Prevalence = {baseline:.3f})")
plt.xlabel("Recall (Sensitivity)", fontsize=12)
plt.ylabel("Precision (PPV)", fontsize=12)
plt.title("Precision-Recall Curve Comparison", fontsize=14)
plt.legend(loc="lower left", fontsize=10)
plt.grid(alpha=0.3)
plt.tight_layout()
if save_path:
plt.savefig(save_path, dpi=300, bbox_inches="tight")
plt.show()
def plot_confusion_matrix(self, y_true, y_proba, threshold=0.5, save_path=None):
y_pred = (y_proba >= threshold).astype(int)
cm = confusion_matrix(y_true, y_pred, normalize="true")
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt=".2f", cmap="Blues",
xticklabels=["Low Risk", "High Risk"],
yticklabels=["Low Risk", "High Risk"],
cbar_kws={"label": "Proportion"})
plt.xlabel("Predicted Label", fontsize=12)
plt.ylabel("True Label", fontsize=12)
plt.title(f"Normalized Confusion Matrix (threshold={threshold})", fontsize=13)
plt.tight_layout()
if save_path:
plt.savefig(save_path, dpi=300, bbox_inches="tight")
plt.show()
def plot_calibration(self, y_true, probas_dict, n_bins=10, save_path=None):
plt.figure(figsize=(8, 6))
for name, proba in probas_dict.items():
bin_boundaries = np.linspace(0, 1, n_bins + 1)
bin_lowers = bin_boundaries[:-1]
bin_uppers = bin_boundaries[1:]
bin_centers = (bin_lowers + bin_uppers) / 2
bin_accuracies = np.zeros(n_bins)
for i in range(n_bins):
in_bin = (proba > bin_lowers[i]) & (proba <= bin_uppers[i])
prop_in_bin = np.mean(in_bin)
if prop_in_bin > 0:
bin_accuracies[i] = np.mean(y_true[in_bin])
else:
bin_accuracies[i] = 0.0
plt.plot(bin_centers, bin_accuracies, "o-", label=name, markersize=6)
plt.plot([0, 1], [0, 1], "k--", label="Perfectly calibrated")
plt.xlim([0.0, 1.0]); plt.ylim([0.0, 1.0])
plt.xlabel("Mean Predicted Probability", fontsize=12)
plt.ylabel("Fraction of Positives", fontsize=12)
plt.title("Calibration Plot (Reliability Diagram)", fontsize=14)
plt.legend(fontsize=10)
plt.grid(alpha=0.3)
plt.tight_layout()
if save_path:
plt.savefig(save_path, dpi=300, bbox_inches="tight")
plt.show()
def plot_risk_distribution(self, y_true, y_proba, save_path=None):
plt.figure(figsize=(8, 5))
plt.hist(y_proba[y_true == 0], bins=40, alpha=0.6, label="True Negative", color="steelblue", edgecolor="white")
plt.hist(y_proba[y_true == 1], bins=40, alpha=0.6, label="True Positive", color="crimson", edgecolor="white")
plt.axvline(self.thresh_cfg.moderate_risk_threshold, color="orange", linestyle="--", label=f"Moderate Risk ({self.thresh_cfg.moderate_risk_threshold})")
plt.axvline(self.thresh_cfg.high_risk_threshold, color="red", linestyle="--", label=f"High Risk ({self.thresh_cfg.high_risk_threshold})")
plt.xlabel("Predicted T2DM Risk Probability", fontsize=12)
plt.ylabel("Count", fontsize=12)
plt.title("Risk Score Distribution by True Outcome", fontsize=14)
plt.legend(fontsize=10)
plt.grid(alpha=0.3, axis="y")
plt.tight_layout()
if save_path:
plt.savefig(save_path, dpi=300, bbox_inches="tight")
plt.show()
4.9 explainer.py
python
"""
模型解释模块(SHAP-like 简化实现)
使用方式:
from explainer import ClinicalExplainer
explainer = ClinicalExplainer(ensemble, feature_names, preprocessor)
explainer.global_summary(X_test, y_test, save_path="figs/importance.png")
"""
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.inspection import permutation_importance
from typing import List, Dict, Optional
class ClinicalExplainer:
def __init__(self, ensemble, raw_feature_names, preprocessor=None, feature_engineer=None):
self.ensemble = ensemble
self.raw_names = raw_feature_names
self.preprocessor = preprocessor
self.feature_engineer = feature_engineer
self.base_importances = {}
self.meta_weights = {}
def fit_global(self, X, y):
base_probas = self.ensemble.get_base_probas(X)
meta_weights = self.ensemble.meta_learner.get_meta_weights()
weighted_imp = np.zeros(len(self.raw_names))
total_weight = 0.0
for name, model in self.ensemble.base_learners.items():
r = permutation_importance(model, X, y, n_repeats=10,
random_state=42, scoring="roc_auc", n_jobs=-1)
imp = r.importances_mean
weight = meta_weights.get(name, 1.0 / len(meta_weights))
weighted_imp += weight * imp
total_weight += weight
self.base_importances[name] = imp
self.global_importance = weighted_imp / (total_weight + 1e-9)
self.meta_weights = meta_weights
return self
def plot_global_summary(self, save_path=None):
if not hasattr(self, "global_importance"):
raise RuntimeError("Must call fit_global before plotting.")
imp_df = pd.DataFrame({
"Feature": self.raw_names,
"Importance": self.global_importance
}).sort_values("Importance", ascending=True)
color_map = self._feature_domain_colors()
colors = [color_map.get(f, "gray") for f in imp_df["Feature"]]
plt.figure(figsize=(8, 10))
plt.barh(imp_df["Feature"], imp_df["Importance"], color=colors, edgecolor="white")
plt.xlabel("Weighted Permutation Importance", fontsize=12)
plt.title("Global Feature Importance (Meta-Weighted)", fontsize=14)
plt.grid(alpha=0.3, axis="x")
plt.tight_layout()
if save_path:
plt.savefig(save_path, dpi=300, bbox_inches="tight")
plt.show()
def local_waterfall(self, x, save_path=None):
base_probas = self.ensemble.get_base_probas(x.reshape(1, -1))
Z = np.array([[base_probas[name][0] for name in self.ensemble.base_names]])
final_proba = self.ensemble.meta_learner.predict_proba(Z)[0]
z_scores = x
shap_like = self.global_importance * z_scores
base_value = 0.35
order = np.argsort(np.abs(shap_like))[::-1][:12]
top_features = [self.raw_names[i] for i in order]
top_shaps = shap_like[order]
cumulative = [base_value]
for val in top_shaps:
cumulative.append(cumulative[-1] + val)
cumulative = np.array(cumulative)
fig, ax = plt.subplots(figsize=(10, 7))
for i in range(len(top_shaps)):
val = top_shaps[i]
color = "#d62728" if val > 0 else "#1f77b4"
ax.barh(i, val, left=cumulative[i], color=color, edgecolor="white", height=0.6)
ax.text(cumulative[i] + val/2, i, f"{val:+.3f}",
ha="center", va="center", color="white", fontsize=9, weight="bold")
ax.set_yticks(range(len(top_shaps)))
ax.set_yticklabels(top_features, fontsize=11)
ax.invert_yaxis()
ax.axvline(base_value, color="black", linestyle="--", alpha=0.5)
ax.set_xlabel("Contribution to Risk Probability", fontsize=12)
ax.set_title(f"Local Explanation (Final Risk = {final_proba:.3f})", fontsize=14)
ax.grid(alpha=0.3, axis="x")
plt.tight_layout()
if save_path:
plt.savefig(save_path, dpi=300, bbox_inches="tight")
plt.show()
def _feature_domain_colors(self):
return {
"Age": "#8c564b", "FPG": "#e377c2", "BMI": "#e377c2", "HbA1c": "#e377c2",
"LDL_C": "#ff7f0e", "HDL_C": "#ff7f0e", "TG": "#ff7f0e", "TC": "#ff7f0e",
"GGT": "#2ca02c", "ALT": "#2ca02c", "AST": "#2ca02c",
"SBP": "#d62728", "DBP": "#d62728", "Waist": "#9467bd", "Hip": "#9467bd",
"Cr": "#7f7f7f", "UA": "#7f7f7f",
"HOMA_IR": "#bcbd22", "Fasting_Insulin": "#bcbd22",
"CRP": "#17becf", "WBC": "#17becf", "RBC": "#17becf", "Hb": "#17becf",
"Neutrophil": "#17becf", "Lymphocyte": "#17becf", "Platelet": "#17becf"
}
def get_meta_weight_table(self):
return pd.DataFrame({
"Base_Learner": list(self.meta_weights.keys()),
"Meta_Weight": list(self.meta_weights.values())
}).sort_values("Meta_Weight", ascending=False)
4.10 visualizer.py
python
"""
系统级可视化模块
使用方式:
from visualizer import SystemVisualizer
viz = SystemVisualizer()
viz.plot_feature_distributions(df, target_col="T2DM_5yr_Risk", save_dir="figs/")
"""
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix, roc_curve, auc
from typing import List, Optional, Dict
class SystemVisualizer:
def __init__(self, style="seaborn-v0_8-whitegrid"):
try:
plt.style.use(style)
except:
plt.style.use("seaborn-whitegrid")
self.colors = ["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd"]
def plot_feature_distributions(self, df, target_col, features=None, n_cols=4, save_path=None):
feats = features or df.columns.drop(target_col).tolist()[:12]
n_rows = int(np.ceil(len(feats) / n_cols))
fig, axes = plt.subplots(n_rows, n_cols, figsize=(n_cols*4, n_rows*3))
axes = axes.flatten()
df_neg = df[df[target_col] == 0]
df_pos = df[df[target_col] == 1]
for idx, feat in enumerate(feats):
ax = axes[idx]
ax.hist(df_neg[feat].dropna(), bins=30, alpha=0.5, label="Low Risk",
color="steelblue", density=True, edgecolor="white")
ax.hist(df_pos[feat].dropna(), bins=30, alpha=0.5, label="High Risk",
color="crimson", density=True, edgecolor="white")
ax.set_title(feat, fontsize=11)
ax.set_xlabel("")
ax.set_ylabel("Density")
if idx == 0:
ax.legend(fontsize=8)
ax.grid(alpha=0.3, axis="y")
for idx in range(len(feats), len(axes)):
axes[idx].axis("off")
fig.suptitle("Feature Distributions by T2DM Risk Status", fontsize=16, y=1.02)
plt.tight_layout()
if save_path:
plt.savefig(save_path, dpi=300, bbox_inches="tight")
plt.show()
def plot_correlation_heatmap(self, df, features=None, save_path=None):
feats = features or df.columns.drop("T2DM_5yr_Risk", errors="ignore").tolist()
corr = df[feats].corr(method="pearson")
plt.figure(figsize=(12, 10))
mask = np.triu(np.ones_like(corr, dtype=bool), k=1)
sns.heatmap(corr, mask=mask, annot=True, fmt=".2f", cmap="RdBu_r",
center=0, square=True, linewidths=0.5,
cbar_kws={"shrink": 0.8, "label": "Pearson r"},
annot_kws={"size": 8})
plt.title("Clinical Feature Correlation Matrix", fontsize=14)
plt.tight_layout()
if save_path:
plt.savefig(save_path, dpi=300, bbox_inches="tight")
plt.show()
def plot_base_learner_diversity(self, y_proba_dict, save_path=None):
names = list(y_proba_dict.keys())
n = len(names)
fig, axes = plt.subplots(n, n, figsize=(n*3, n*3))
for i, name_i in enumerate(names):
for j, name_j in enumerate(names):
ax = axes[i, j]
if i == j:
ax.hist(y_proba_dict[name_i], bins=30, color=self.colors[i % len(self.colors)],
edgecolor="white", alpha=0.7)
ax.set_title(name_i, fontsize=10)
else:
ax.scatter(y_proba_dict[name_j], y_proba_dict[name_i], alpha=0.3, s=8, color="black")
r = np.corrcoef(y_proba_dict[name_j], y_proba_dict[name_i])[0, 1]
ax.text(0.05, 0.95, f"r={r:.2f}", transform=ax.transAxes,
fontsize=9, verticalalignment="top",
bbox=dict(boxstyle="round", facecolor="wheat", alpha=0.5))
ax.set_xlim([0, 1]); ax.set_ylim([0, 1])
ax.grid(alpha=0.3)
fig.suptitle("Base Learner Diversity Matrix (Predicted Probabilities)", fontsize=16)
plt.tight_layout()
if save_path:
plt.savefig(save_path, dpi=300, bbox_inches="tight")
plt.show()
def plot_meta_weights(self, weights, save_path=None):
labels = list(weights.keys())
sizes = list(weights.values())
fig, ax = plt.subplots(figsize=(7, 7))
wedges, texts, autotexts = ax.pie(
sizes, labels=labels, autopct="%1.1f%%", startangle=90,
colors=self.colors, textprops={"fontsize": 10},
wedgeprops={"edgecolor": "white", "linewidth": 2}
)
for autotext in autotexts:
autotext.set_color("white")
autotext.set_weight("bold")
ax.set_title("Meta-Learner Weight Allocation
(How Much the Director Trusts Each Expert)", fontsize=13)
plt.tight_layout()
if save_path:
plt.savefig(save_path, dpi=300, bbox_inches="tight")
plt.show()
def plot_dashboard(self, df, y_true, y_proba, y_proba_dict, weights, save_path=None):
fig = plt.figure(figsize=(18, 12))
gs = fig.add_gridspec(2, 3, hspace=0.3, wspace=0.3)
ax1 = fig.add_subplot(gs[0, 0])
ax1.hist(y_proba[y_true==0], bins=30, alpha=0.6, label="Neg", color="steelblue", edgecolor="white")
ax1.hist(y_proba[y_true==1], bins=30, alpha=0.6, label="Pos", color="crimson", edgecolor="white")
ax1.axvline(0.5, color="black", linestyle="--")
ax1.set_title("A. Risk Score Distribution")
ax1.legend()
ax1.grid(alpha=0.3, axis="y")
ax2 = fig.add_subplot(gs[0, 1])
fpr, tpr, _ = roc_curve(y_true, y_proba)
roc_auc = auc(fpr, tpr)
ax2.plot(fpr, tpr, lw=2, label=f"Stacking AUC={roc_auc:.3f}")
ax2.plot([0,1], [0,1], "k--", alpha=0.5)
ax2.set_title("B. ROC Curve (Stacking)")
ax2.set_xlabel("FPR"); ax2.set_ylabel("TPR")
ax2.legend(); ax2.grid(alpha=0.3)
ax3 = fig.add_subplot(gs[0, 2])
cm = confusion_matrix(y_true, (y_proba>=0.5).astype(int), normalize="true")
sns.heatmap(cm, annot=True, fmt=".2f", cmap="Blues", ax=ax3,
xticklabels=["Neg", "Pos"], yticklabels=["Neg", "Pos"], cbar=False)
ax3.set_title("C. Confusion Matrix")
ax4 = fig.add_subplot(gs[1, 0])
names_w = list(weights.keys()); vals_w = list(weights.values())
ax4.barh(names_w, vals_w, color=self.colors[:len(names_w)], edgecolor="white")
ax4.set_title("D. Meta-Learner Weights")
ax4.grid(alpha=0.3, axis="x")
ax5 = fig.add_subplot(gs[1, 1])
aucs = []
for name, proba in y_proba_dict.items():
aucs.append(auc(*roc_curve(y_true, proba)[:2]))
ax5.bar(list(y_proba_dict.keys()), aucs, color=self.colors[:len(aucs)], edgecolor="white")
ax5.axhline(0.5, color="gray", linestyle="--")
ax5.set_ylim([0.4, 1.0])
ax5.set_title("E. Base Learner AUROC")
ax5.tick_params(axis="x", rotation=15)
ax5.grid(alpha=0.3, axis="y")
ax6 = fig.add_subplot(gs[1, 2])
bin_edges = np.linspace(0, 1, 11)
bin_centers = (bin_edges[:-1] + bin_edges[1:]) / 2
bin_accs = []
for i in range(len(bin_edges)-1):
mask = (y_proba > bin_edges[i]) & (y_proba <= bin_edges[i+1])
if mask.sum() > 0:
bin_accs.append(y_true[mask].mean())
else:
bin_accs.append(0)
ax6.plot(bin_centers, bin_accs, "o-", color="darkgreen", markersize=8, label="Stacking")
ax6.plot([0,1], [0,1], "k--", alpha=0.5, label="Ideal")
ax6.set_title("F. Calibration Curve")
ax6.set_xlabel("Predicted"); ax6.set_ylabel("Observed")
ax6.legend(); ax6.grid(alpha=0.3)
fig.suptitle("T2DM Early Warning System -- Executive Dashboard", fontsize=18, y=0.98)
if save_path:
plt.savefig(save_path, dpi=300, bbox_inches="tight")
plt.show()
4.11 pipeline.py
python
"""
主控流水线模块
使用方式:
from pipeline import T2DMPipeline
pipe = T2DMPipeline()
results = pipe.run_full_pipeline(save_prefix="run_001")
"""
import os
import json
import numpy as np
import pandas as pd
from typing import Dict, Any
from config import SYSTEM_CONFIG, DATA_CONFIG, MODEL_CONFIG, THRESHOLD_CONFIG
from data_loader import ClinicalDataGenerator
from preprocessor import ClinicalPreprocessor
from feature_engineer import ClinicalFeatureEngineer
from base_learners import BaseLearnerCluster
from meta_learner import StackingMetaLearner
from trainer import StackingTrainer, TrainedEnsemble
from evaluator import ModelEvaluator
from explainer import ClinicalExplainer
from visualizer import SystemVisualizer
from sklearn.model_selection import train_test_split
class T2DMPipeline:
def __init__(self):
self.sys_cfg = SYSTEM_CONFIG
self.data_cfg = DATA_CONFIG
self.model_cfg = MODEL_CONFIG
self.thresh_cfg = THRESHOLD_CONFIG
self.generator = ClinicalDataGenerator(self.data_cfg)
self.preprocessor = ClinicalPreprocessor()
self.feature_engineer = ClinicalFeatureEngineer(select_k=self.model_cfg.select_k_best)
self.base_cluster = BaseLearnerCluster(self.model_cfg)
self.meta_learner = StackingMetaLearner(self.model_cfg)
self.trainer = StackingTrainer(self.model_cfg, self.base_cluster, self.meta_learner, n_folds=5)
self.evaluator = ModelEvaluator(self.thresh_cfg)
self.visualizer = SystemVisualizer()
self.df_raw = None
self.X_train = None
self.X_test = None
self.y_train = None
self.y_test = None
self.ensemble = None
self.y_proba_test = None
self.y_proba_base_test = {}
self.explainer = None
def run_full_pipeline(self, save_prefix="default_run"):
print("="*60)
print("T2DM Early Warning System -- Full Pipeline Execution")
print("="*60)
print("[Step 1/7] Generating synthetic clinical dataset...")
self.df_raw = self.generator.generate()
pos_rate = self.df_raw[self.data_cfg.target_name].mean()
print(f" -> Generated {len(self.df_raw)} samples, positive rate = {pos_rate:.3f}")
print("[Step 2/7] Stratified train/test split...")
X = self.df_raw.drop(columns=[self.data_cfg.target_name])
y = self.df_raw[self.data_cfg.target_name].values
X_train_raw, X_test_raw, y_train, y_test = train_test_split(
X, y, test_size=self.sys_cfg.test_size,
random_state=self.sys_cfg.random_state, stratify=y
)
self.y_train = y_train
self.y_test = y_test
print(f" -> Train: {len(y_train)} (pos={y_train.sum()}), Test: {len(y_test)} (pos={y_test.sum()})")
print("[Step 3/7] Preprocessing (impute -> clip -> scale)...")
X_train_scaled = self.preprocessor.fit_transform(X_train_raw)
X_test_scaled = self.preprocessor.transform(X_test_raw)
print(f" -> Feature dimension after preprocessing: {X_train_scaled.shape[1]}")
print("[Step 4/7] Feature engineering (prior interactions + polynomial + selection)...")
X_train_fe = self.feature_engineer.fit_transform(
X_train_scaled, y_train, raw_names=self.data_cfg.feature_names
)
X_test_fe = self.feature_engineer.transform(X_test_scaled)
self.X_train = X_train_fe
self.X_test = X_test_fe
n_candidates = len(self.feature_engineer._get_prior_names()) + len(self.data_cfg.feature_names)
print(f" -> Feature dimension after engineering: {X_train_fe.shape[1]} (selected from {n_candidates} candidates)")
print("[Step 5/7] Training Stacking Ensemble (5-Fold OOF)...")
self.ensemble = self.trainer.fit(X_train_fe, y_train)
print(" -> Base learners re-trained on full data.")
print(" -> Meta-learner trained on OOF meta-features.")
print("[Step 6/7] Evaluation on hold-out test set...")
self.y_proba_test = self.ensemble.predict_proba(X_test_fe)
self.y_proba_base_test = self.ensemble.get_base_probas(X_test_fe)
reports = {}
for name, proba in self.y_proba_base_test.items():
reports[name] = self.evaluator.evaluate(y_test, proba, model_name=name)
reports["StackingEnsemble"] = self.evaluator.evaluate(y_test, self.y_proba_test, model_name="StackingEnsemble")
print(" [Evaluation Summary]")
summary_df = pd.DataFrame(reports).T[["AUROC", "AUPRC", "F1_0.5", "Cohen_Kappa"]]
print(summary_df.to_string())
print("[Step 7/7] Explainability & Visualization...")
self.explainer = ClinicalExplainer(
self.ensemble, self.data_cfg.feature_names,
preprocessor=self.preprocessor
)
self.explainer.fit_global(X_test_scaled, y_test)
out_dir = self.sys_cfg.output_dir
run_dir = os.path.join(out_dir, save_prefix)
fig_dir = os.path.join(run_dir, "figures")
os.makedirs(fig_dir, exist_ok=True)
model_path = os.path.join(run_dir, "ensemble_model.pkl")
self.trainer.save_ensemble(self.ensemble, model_path)
report_path = os.path.join(run_dir, "evaluation_report.json")
with open(report_path, "w") as f:
json.dump(reports, f, indent=2)
print(" -> Plotting ROC comparison...")
self.evaluator.plot_roc_comparison(
y_test, {**self.y_proba_base_test, "StackingEnsemble": self.y_proba_test},
save_path=os.path.join(fig_dir, "roc_comparison.png")
)
print(" -> Plotting PR comparison...")
self.evaluator.plot_pr_comparison(
y_test, {**self.y_proba_base_test, "StackingEnsemble": self.y_proba_test},
save_path=os.path.join(fig_dir, "pr_comparison.png")
)
print(" -> Plotting confusion matrix...")
self.evaluator.plot_confusion_matrix(
y_test, self.y_proba_test, threshold=0.5,
save_path=os.path.join(fig_dir, "confusion_matrix.png")
)
print(" -> Plotting risk distribution...")
self.evaluator.plot_risk_distribution(
y_test, self.y_proba_test,
save_path=os.path.join(fig_dir, "risk_distribution.png")
)
print(" -> Plotting calibration curve...")
self.evaluator.plot_calibration(
y_test, {**self.y_proba_base_test, "StackingEnsemble": self.y_proba_test},
save_path=os.path.join(fig_dir, "calibration.png")
)
print(" -> Plotting global feature importance...")
self.explainer.plot_global_summary(
save_path=os.path.join(fig_dir, "global_importance.png")
)
print(" -> Plotting meta-learner weights...")
meta_weights = self.ensemble.meta_learner.get_meta_weights()
self.visualizer.plot_meta_weights(
meta_weights,
save_path=os.path.join(fig_dir, "meta_weights.png")
)
print(" -> Plotting base learner diversity matrix...")
self.visualizer.plot_base_learner_diversity(
self.y_proba_base_test,
save_path=os.path.join(fig_dir, "learner_diversity.png")
)
print(" -> Plotting executive dashboard...")
self.visualizer.plot_dashboard(
self.df_raw, y_test, self.y_proba_test, self.y_proba_base_test, meta_weights,
save_path=os.path.join(fig_dir, "dashboard.png")
)
print(" -> Plotting local waterfall (sample 0)...")
self.explainer.local_waterfall(
X_test_scaled[0],
save_path=os.path.join(fig_dir, "waterfall_sample_0.png")
)
print(f"
[Pipeline Complete] All artifacts saved to: {run_dir}")
return {
"run_id": save_prefix,
"model_path": model_path,
"report_path": report_path,
"figure_dir": fig_dir,
"evaluation": reports,
"meta_weights": meta_weights,
"test_positive_rate": float(y_test.mean()),
"predicted_high_risk_ratio": float(np.mean(self.y_proba_test >= self.thresh_cfg.high_risk_threshold))
}
def predict_single(self, x_dict):
if self.ensemble is None:
raise RuntimeError("Pipeline must be trained before prediction.")
x_df = pd.DataFrame([x_dict])
x_scaled = self.preprocessor.transform(x_df)
x_fe = self.feature_engineer.transform(x_scaled)
proba = float(self.ensemble.predict_proba(x_fe)[0])
if proba >= self.thresh_cfg.high_risk_threshold:
risk_level = "HIGH"
advice = "建议立即转诊内分泌科,启动强化生活方式干预或药物预防。"
elif proba >= self.thresh_cfg.moderate_risk_threshold:
risk_level = "MODERATE"
advice = "建议3-6个月复查糖耐量与HbA1c,启动饮食运动干预。"
else:
risk_level = "LOW"
advice = "维持常规年度体检,保持健康生活方式。"
z_scores = x_scaled[0]
shap_like = self.explainer.global_importance * z_scores if hasattr(self.explainer, "global_importance") else np.zeros(len(self.data_cfg.feature_names))
top_idx = np.argsort(np.abs(shap_like))[::-1][:3]
drivers = [
{"feature": self.data_cfg.feature_names[i],
"direction": "increases" if shap_like[i] > 0 else "decreases",
"contribution": float(shap_like[i])}
for i in top_idx
]
return {
"t2dm_5year_risk": round(proba, 4),
"risk_level": risk_level,
"clinical_advice": advice,
"top_drivers": drivers,
"threshold_high": self.thresh_cfg.high_risk_threshold,
"threshold_moderate": self.thresh_cfg.moderate_risk_threshold
}
4.12 main.py
python
#!/usr/bin/env python3
"""
T2DM-EWS: 2型糖尿病早期预警系统 -- 主入口脚本
使用方式:
1. 完整训练与评估(默认):
python main.py --mode train --prefix run_001
2. 单例预测(需先完成训练):
python main.py --mode predict --model models/ensemble_model.pkl \
--age 58 --fpg 6.8 --bmi 27.3 --ldl_c 3.2 --hdl_c 1.1 --tg 2.1
3. 查看系统架构说明:
python main.py --mode info
"""
import argparse
import json
import os
import sys
import numpy as np
import pandas as pd
from config import SYSTEM_CONFIG, DATA_CONFIG, MODEL_CONFIG, THRESHOLD_CONFIG
from pipeline import T2DMPipeline
def print_architecture_info():
info = """
========================================================================
T2DM-EWS: 2型糖尿病早期预警系统 (多参数集成分类模型)
========================================================================
总体架构: 五层流水线 + 双循环反馈
Layer 1: 数据层 (Data Layer)
-- ClinicalDataGenerator -- 模拟/加载真实体检数据
-- 缺失值注入 (2%随机缺失)
-- 异常值注入 (模拟检验误差)
Layer 2: 预处理层 (Preprocessing Layer)
-- SimpleImputer (median策略)
-- Winsorize截断 (1%-99%分位数)
-- StandardScaler / RobustScaler
Layer 3: 特征工程层 (Feature Engineering Layer)
-- 先验临床交互特征 (FPG*BMI, LDL/HDL, TG/HDL, MAP, NLR等)
-- PolynomialFeatures (degree=2, interaction_only)
-- SelectKBest (f_classif, k=35)
Layer 4: 模型层 (Model Layer) -- 基学习器集群 + 元学习器
Base Learners (5位专科医生):
- LogisticRegression (线性边界, 高可解释性)
- RandomForest (随机切片, 非线性规则)
- GradientBoosting (梯度残差修正)
- AdaBoost (序列纠错, 关注难分病例)
- ExtraTrees (极端随机性, 降低方差)
Meta-Learner (主任医师):
- LogisticRegression (学习最优加权融合)
训练策略: 5-Fold Stratified OOF (防止数据泄露)
Layer 5: 输出层 (Output Layer)
-- 风险概率 P(T2DM|x) in [0,1]
-- 风险分层: LOW (<0.40) / MODERATE (0.40-0.70) / HIGH (>0.70)
-- ClinicalExplainer (全局/局部特征重要性, Waterfall图)
-- SystemVisualizer (Dashboard, ROC, PR, Calibration, Diversity)
接口对接:
- 输入: 原始临床指标字典 / CSV / DataFrame
- 输出: JSON {risk, level, advice, drivers, thresholds}
- 部署: pickle序列化模型包, 支持RESTful API封装
测试标准:
1. AUROC > 0.75 (基线) / > 0.80 (目标)
2. AUPRC > 0.40 (类别不平衡下的稳健指标)
3. Sensitivity@90%Specificity > 0.70
4. Calibration (Brier Score < 0.15)
5. 基学习器预测相关性 < 0.90 (保证集成多样性)
验收标准:
v 端到端推理延迟 < 200ms (单CPU)
v 模型包可序列化/反序列化
v 所有可视化图表自动生成并保存
v 单样本预测接口返回结构化JSON
v 特征重要性解释与医学先验一致
========================================================================
"""
print(info)
def main():
parser = argparse.ArgumentParser(
description="T2DM Early Warning System -- Multi-Parameter Ensemble Classifier",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="Example: python main.py --mode train --prefix demo_run"
)
parser.add_argument("--mode", choices=["train", "predict", "visualize", "info"],
default="train", help="运行模式")
parser.add_argument("--prefix", type=str, default="run_default",
help="输出目录前缀 (用于train/visualize)")
parser.add_argument("--model", type=str, default=None,
help="已保存的模型路径 (用于predict)")
clinical_args = [
("age", float, 50.0), ("fpg", float, 5.6), ("bmi", float, 24.0),
("hba1c", float, 5.7), ("ldl_c", float, 2.9), ("hdl_c", float, 1.3),
("tg", float, 1.6), ("ggt", float, 35.0), ("alt", float, 28.0),
("ast", float, 26.0), ("sbp", float, 128.0), ("dbp", float, 80.0),
("waist", float, 85.0), ("hip", float, 95.0), ("tc", float, 4.9),
("cr", float, 75.0), ("ua", float, 320.0), ("homa_ir", float, 2.8),
("fasting_insulin", float, 12.0), ("crp", float, 2.5),
("wbc", float, 6.2), ("rbc", float, 4.5), ("hb", float, 140.0),
("neutrophil", float, 0.58), ("lymphocyte", float, 0.30),
("platelet", float, 220.0)
]
for name, typ, default in clinical_args:
parser.add_argument(f"--{name}", type=typ, default=default)
args = parser.parse_args()
if args.mode == "info":
print_architecture_info()
return
if args.mode == "train":
print_architecture_info()
print("
>>> 启动训练模式...")
pipe = T2DMPipeline()
results = pipe.run_full_pipeline(save_prefix=args.prefix)
print("
>>> 训练完成。结果摘要:")
print(json.dumps(results["evaluation"]["StackingEnsemble"], indent=2, ensure_ascii=False))
print(f"
>>> 模型已保存至: {results['model_path']}")
print(f">>> 图表已保存至: {results['figure_dir']}")
elif args.mode == "predict":
if not args.model or not os.path.exists(args.model):
print("错误: 预测模式需要有效的 --model 路径")
sys.exit(1)
print(">>> 加载预训练模型...")
pipe = T2DMPipeline()
pipe.run_full_pipeline(save_prefix="temp_predict")
x_dict = {
"Age": args.age, "FPG": args.fpg, "BMI": args.bmi,
"HbA1c": args.hba1c, "LDL_C": args.ldl_c, "HDL_C": args.hdl_c,
"TG": args.tg, "GGT": args.ggt, "ALT": args.alt, "AST": args.ast,
"SBP": args.sbp, "DBP": args.dbp, "Waist": args.waist, "Hip": args.hip,
"TC": args.tc, "Cr": args.cr, "UA": args.ua, "HOMA_IR": args.homa_ir,
"Fasting_Insulin": args.fasting_insulin, "CRP": args.crp,
"WBC": args.wbc, "RBC": args.rbc, "Hb": args.hb,
"Neutrophil": args.neutrophil, "Lymphocyte": args.lymphocyte,
"Platelet": args.platelet
}
result = pipe.predict_single(x_dict)
print("
>>> 预测结果:")
print(json.dumps(result, indent=2, ensure_ascii=False))
elif args.mode == "visualize":
print(">>> 可视化模式(基于最新训练结果)")
pipe = T2DMPipeline()
pipe.run_full_pipeline(save_prefix=args.prefix)
print(f">>> 图表已更新至 outputs/{args.prefix}/figures/")
if __name__ == "__main__":
main()
4.13 requirements.txt
numpy>=1.21.0
pandas>=1.3.0
scikit-learn>=1.0.0
matplotlib>=3.4.0
seaborn>=0.11.0
五、快速开始
bash
# 1. 安装依赖
pip install -r requirements.txt
# 2. 查看系统架构
python main.py --mode info
# 3. 执行完整训练与评估
python main.py --mode train --prefix run_001
# 4. 查看输出
ls outputs/run_001/figures/
# dashboard.png roc_comparison.png pr_comparison.png confusion_matrix.png
# risk_distribution.png calibration.png global_importance.png meta_weights.png
# learner_diversity.png waterfall_sample_0.png
# 5. 单例预测(示例)
python main.py --mode predict --age 58 --fpg 6.8 --bmi 27.3 --ldl_c 3.2 --hdl_c 1.1 --tg 2.1
六、算法伪代码
6.1 Stacking Ensemble 训练阶段
Algorithm: TrainStackingEnsemble
Input: Feature matrix X in R^{n x p}, labels y in {0,1}^n
Base learner pool B = {b_1, b_2, ..., b_m}
Number of folds K = 5
Output: Trained ensemble E = (B*, M)
// Step 1: Generate meta-features (Out-of-Fold predictions)
Initialize Z in R^{n x m}
for k = 1 to K do
D_train(k) <- indices of training fold k
D_val(k) <- indices of validation fold k
for j = 1 to m do
fit b_j on X[D_train(k)], y[D_train(k)]
p_j(k) <- predict_proba(b_j, X[D_val(k)])
Z[D_val(k), j] <- p_j(k)
end
end
// Step 2: Train meta-learner
M <- LogisticRegression(solver = lbfgs, max_iter = 1000)
fit M on (Z, y)
// Step 3: Retrain base learners on full data (for deployment)
for j = 1 to m do
fit b_j on (X, y)
b_j* <- trained b_j
end
return E = (B*, M)
6.2 单实例预测阶段
Algorithm: PredictRisk
Input: Trained ensemble E = (B*, M), new instance x_new in R^p
Output: Risk probability y_hat in [0,1]
// Parallel invocation of all base learners
for j = 1 to m do
z_j <- predict_proba(b_j*, x_new)
end
// Assemble meta-feature vector
z <- [z_1, z_2, ..., z_m]^T
// Meta-learner final decision
y_hat <- predict_proba(M, z)
return y_hat
文档版本: v1.0 | 生成日期: 2026-05-19