一、软硬件环境说明
硬件配置
- 处理器:Intel Core i5及以上
- 内存:8GB DDR4 及以上
- 存储:256GB SSD
软件环境
- 操作系统:Windows 10
- Python版本:3.8(需安装以下库):
pandas==1.3.5 # 数据处理
scikit-learn==1.0.2 # 机器学习工具
matplotlib==3.5.1 # 可视化
numpy==1.21.6 # 数值计算
二、数据均衡处理技术
原始数据分布
|--------|-------|----|
| 类别 | 样本数量 | 占比 |
| 未患病(0) | 15696 | |
| 患病(1) | 5312 | |
下采样过程
- 样本分离:
通过df.target字段分割多数类(未患病0)与少数类(患病1)
python
df['target'] = df['Diabetes_Type'].apply(lambda x: 1 if x in ['Type 1', 'Type 2'] else 0)
# 分离正负样本
df_majority = df[df.target == 0]
df_minority = df[df.target == 1]
-
平衡处理 :
python= resample(df_majority, n_samples=len(df_minority), # 取5312个样本 random_state=42) # 确保可复现性
-
重构数据集:
合并下采样后的多数类与原始少数类
新数据集样本量:10624(5312:5312,1:1平衡)
python
# 合并新数据集
df_downsampled = pd.concat([df_majority_down, df_minority])
df = pd.get_dummies(df, columns=['Gender','Region','Dietary_Habits'])
features = ['Age','BMI','Fast_Food_Intake','HbA1c','Cholesterol_Level','Stress_Level','Sleep_Hours']
# 划分数据集
X_train = df_downsampled[features]
y_train = df_downsampled['target']
X_test = df[features]
y_test = df['target']
三、模型构建与评估
随机森林配置
python
RandomForestClassifier(
n_estimators=200, # 决策树数量(增强模型稳定性)
max_depth=8, # 控制过拟合
class_weight='balanced' # 补偿下采样可能丢失的信息
)
ROC曲线解读
- 生成逻辑:
用原始数据集作为测试集(保持真实分布)
计算患病概率阈值从0到1时的TPR/FPR变化
- 关键指标:
AUC值:0.740
最佳阈值:0.48
对应指标:
-
- 灵敏度:88.6%
- 特异度:85.3%
四、可视化呈现
- ROC曲线图 :
- 数据分布对比图:

测试数据
|----|-----|--------|-----------|---------------|-------------------------|----------------------|--------------------|------|-------------------------|----------------|------------------|---------|---------------------|---------------------|-------|-------------------|-------------|---------------|-------------|--------------|-------------|---|
| ID | Age | Gender | Region | Family_Income | Family_History_Diabetes | Parent_Diabetes_Type | Genetic_Risk_Score | BMI | Physical_Activity_Level | Dietary_Habits | Fast_Food_Intake | Smoking | Alcohol_Consumption | Fasting_Blood_Sugar | HbA1c | Cholesterol_Level | Prediabetes | Diabetes_Type | Sleep_Hours | Stress_Level | Screen_Time | |
| 1 | 21 | Male | North | 2209393 | No | None | 6 | 31.4 | Sedentary | Moderate | 1 | Yes | No | 95.6 | 9.5 | 163.3 | Yes | None | 7.7 | 7 | 6.8 | |
| 2 | 18 | Female | Central | 387650 | No | None | 5 | 24.4 | Active | Unhealthy | 5 | No | No | 164.9 | 5 | 169.1 | Yes | None | 7.9 | 8 | 6 | |
| 3 | 25 | Male | North | 383333 | No | None | 6 | 20 | Moderate | Moderate | 2 | No | No | 110.5 | 8.3 | 296.3 | Yes | Type 1 | 7.6 | 8 | 4.6 | |
| 4 | 22 | Male | Northeast | 2443733 | No | None | 4 | 39.8 | Moderate | Unhealthy | 4 | No | Yes | 160.7 | 4.6 | 252.8 | No | None | 9.5 | 2 | 10.9 | |
| 5 | 19 | Male | Central | 1449463 | No | None | 4 | 19.2 | Moderate | Moderate | 0 | No | Yes | 73.7 | 5.3 | 252.3 | No | None | 6.4 | 2 | 1.3 | |
| 6 | 21 | Female | Central | 2142229 | Yes | None | 3 | 36.4 | Active | Unhealthy | 4 | No | No | 78.9 | 4.7 | 249 | Yes | Type 2 | 4 | 3 | 5.2 | |
| 7 | 24 | Female | North | 2357529 | No | Type 2 | 7 | 39 | Moderate | Unhealthy | 9 | No | No | 76.1 | 6.1 | 198.4 | Yes | Type 2 | 4 | 9 | 4.9 | |
| 8 | 17 | Male | East | 2395436 | No | None | 1 | 24.5 | Moderate | Moderate | 5 | No | Yes | 95.6 | 7.9 | 133.6 | Yes | None | 5 | 6 | 2.1 | |
| 9 | 21 | Female | East | 1288764 | No | Type 2 | 3 | 18.1 | Sedentary | Healthy | 0 | No | No | 177.2 | 5.1 | 235.9 | No | None | 9 | 2 | 6.8 | |
| 10 | 25 | Female | North | 127666 | Yes | None | 3 | 17.7 | Sedentary | Unhealthy | 4 | No | No | 102.3 | 5.1 | 289.1 | Yes | None | 4.8 | 8 | 10.4 | |
| 11 | 25 | Male | Central | 1117881 | No | Type 1 | 5 | 25 | Sedentary | Unhealthy | 4 | No | Yes | 107.7 | 4.4 | 169.4 | No | None | 6.3 | 10 | 8.6 | |
项目包含的文件,有需要的Q作者