文章目录
-
- [引言:99.9% 准确率,却毫无价值](#引言:99.9% 准确率,却毫无价值)
- 一、不平衡的严重程度分级与对应策略
- [二、准确率陷阱:为什么 99.9% 毫无意义](#二、准确率陷阱:为什么 99.9% 毫无意义)
- 三、采样策略全景:从随机到合成
-
- [3.1 随机过采样与欠采样](#3.1 随机过采样与欠采样)
- [3.2 SMOTE:合成少数类过采样](#3.2 SMOTE:合成少数类过采样)
- [3.3 ADASYN:自适应合成采样](#3.3 ADASYN:自适应合成采样)
- [3.4 组合策略:欠采样 + 过采样](#3.4 组合策略:欠采样 + 过采样)
- 四、代价敏感学习:不改数据,改优先级
-
- [4.1 class_weight 参数](#4.1 class_weight 参数)
- [4.2 Focal Loss:聚焦难分类样本](#4.2 Focal Loss:聚焦难分类样本)
- 五、阈值调整:从数学最优到业务最优
-
- [5.1 精度-召回曲线与最优阈值](#5.1 精度-召回曲线与最优阈值)
- [5.2 业务成本矩阵驱动的阈值选择](#5.2 业务成本矩阵驱动的阈值选择)
- 六、评估指标选择:场景决定指标
- 七、不平衡数据的正确交叉验证
-
- [错误做法:先采样再 CV](#错误做法:先采样再 CV)
- [正确做法:Pipeline 内部 CV](#正确做法:Pipeline 内部 CV)
- 八、实战:信用卡欺诈检测完整方案对比
- 总结
引言:99.9% 准确率,却毫无价值
欺诈检测团队收到一份模型报告:准确率 99.9%。业务方高兴了三秒,直到有人问了一句------"召回率多少?"
答案是 0%。
这个模型的策略是"无论什么情况,全部预测为正常"。在欺诈率仅有 0.1% 的数据集上,这个策略在准确率上完美无缺,在业务上一无所用。
不平衡数据是机器学习最常见的隐性陷阱。医疗诊断(患病率 3%)、工业缺陷检测(缺陷率 0.5%)、信用卡欺诈(欺诈率 0.17%)------这些场景中,少数类恰恰是业务最关心的类别。模型对准确率的优化,反而会主动"学会忽略"少数类。
本文从不平衡严重程度分级出发,系统讲解采样策略、代价敏感学习、评估指标选择和业务阈值调优,给出一套完整的处理框架------不是"SMOTE 一招搞定",而是"根据场景选对工具"。
一、不平衡的严重程度分级与对应策略
不同严重程度的不平衡,需要不同力度的干预。把所有不平衡场景都用同一套方法处理,是工程上的偷懒。
| 不平衡程度 | 比例范围 | 典型场景 | 核心策略 |
|---|---|---|---|
| 轻度不平衡 | 1:5 ~ 1:10 | 流失预测 | class_weight,阈值调整 |
| 中度不平衡 | 1:10 ~ 1:100 | 疾病诊断 | SMOTE + class_weight |
| 重度不平衡 | 1:100 ~ 1:1000 | 欺诈检测 | 欠采样 + SMOTE + Focal Loss |
| 极端不平衡 | > 1:1000 | 极稀有事件 | 异常检测框架,非分类框架 |
第一个判断:这是分类问题还是异常检测问题? 当正样本比例低于 0.1%(1:1000),标准分类框架已接近极限,应考虑切换为单类分类(One-Class SVM、Isolation Forest)或异常检测框架。
#mermaid-svg-g9HCAQEjOOlJzFSb{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-g9HCAQEjOOlJzFSb .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-g9HCAQEjOOlJzFSb .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-g9HCAQEjOOlJzFSb .error-icon{fill:#552222;}#mermaid-svg-g9HCAQEjOOlJzFSb .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-g9HCAQEjOOlJzFSb .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-g9HCAQEjOOlJzFSb .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-g9HCAQEjOOlJzFSb .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-g9HCAQEjOOlJzFSb .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-g9HCAQEjOOlJzFSb .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-g9HCAQEjOOlJzFSb .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-g9HCAQEjOOlJzFSb .marker{fill:#333333;stroke:#333333;}#mermaid-svg-g9HCAQEjOOlJzFSb .marker.cross{stroke:#333333;}#mermaid-svg-g9HCAQEjOOlJzFSb svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-g9HCAQEjOOlJzFSb p{margin:0;}#mermaid-svg-g9HCAQEjOOlJzFSb .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-g9HCAQEjOOlJzFSb .cluster-label text{fill:#333;}#mermaid-svg-g9HCAQEjOOlJzFSb .cluster-label span{color:#333;}#mermaid-svg-g9HCAQEjOOlJzFSb .cluster-label span p{background-color:transparent;}#mermaid-svg-g9HCAQEjOOlJzFSb .label text,#mermaid-svg-g9HCAQEjOOlJzFSb span{fill:#333;color:#333;}#mermaid-svg-g9HCAQEjOOlJzFSb .node rect,#mermaid-svg-g9HCAQEjOOlJzFSb .node circle,#mermaid-svg-g9HCAQEjOOlJzFSb .node ellipse,#mermaid-svg-g9HCAQEjOOlJzFSb .node polygon,#mermaid-svg-g9HCAQEjOOlJzFSb .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-g9HCAQEjOOlJzFSb .rough-node .label text,#mermaid-svg-g9HCAQEjOOlJzFSb .node .label text,#mermaid-svg-g9HCAQEjOOlJzFSb .image-shape .label,#mermaid-svg-g9HCAQEjOOlJzFSb .icon-shape .label{text-anchor:middle;}#mermaid-svg-g9HCAQEjOOlJzFSb .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-g9HCAQEjOOlJzFSb .rough-node .label,#mermaid-svg-g9HCAQEjOOlJzFSb .node .label,#mermaid-svg-g9HCAQEjOOlJzFSb .image-shape .label,#mermaid-svg-g9HCAQEjOOlJzFSb .icon-shape .label{text-align:center;}#mermaid-svg-g9HCAQEjOOlJzFSb .node.clickable{cursor:pointer;}#mermaid-svg-g9HCAQEjOOlJzFSb .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-g9HCAQEjOOlJzFSb .arrowheadPath{fill:#333333;}#mermaid-svg-g9HCAQEjOOlJzFSb .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-g9HCAQEjOOlJzFSb .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-g9HCAQEjOOlJzFSb .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-g9HCAQEjOOlJzFSb .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-g9HCAQEjOOlJzFSb .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-g9HCAQEjOOlJzFSb .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-g9HCAQEjOOlJzFSb .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-g9HCAQEjOOlJzFSb .cluster text{fill:#333;}#mermaid-svg-g9HCAQEjOOlJzFSb .cluster span{color:#333;}#mermaid-svg-g9HCAQEjOOlJzFSb div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-g9HCAQEjOOlJzFSb .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-g9HCAQEjOOlJzFSb rect.text{fill:none;stroke-width:0;}#mermaid-svg-g9HCAQEjOOlJzFSb .icon-shape,#mermaid-svg-g9HCAQEjOOlJzFSb .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-g9HCAQEjOOlJzFSb .icon-shape p,#mermaid-svg-g9HCAQEjOOlJzFSb .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-g9HCAQEjOOlJzFSb .icon-shape .label rect,#mermaid-svg-g9HCAQEjOOlJzFSb .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-g9HCAQEjOOlJzFSb .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-g9HCAQEjOOlJzFSb .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-g9HCAQEjOOlJzFSb :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} > 10%(1:10以内)
1%~10%(1:10~1:100)
0.1%~1%(1:100~1:1000)
< 0.1%(1:1000以上)
发现类别不平衡
正样本比例
轻度:class_weight + 阈值调整
中度:SMOTE + class_weight
重度:欠采样+SMOTE+Focal Loss
极端:切换异常检测框架
评估指标:F1 / ROC-AUC
评估指标:PR-AUC / F1
评估指标:PR-AUC + 业务成本矩阵
评估指标:精度@K / AUPR
二、准确率陷阱:为什么 99.9% 毫无意义
这是不平衡数据处理的第一课,也是最容易被忽视的一课。
准确率的数学本质: Accuracy = (TP + TN) / (TP + TN + FP + FN)
在正负样本比例为 1:999 的场景中:
python
import numpy as np
from sklearn.metrics import accuracy_score, f1_score, recall_score
from sklearn.metrics import precision_recall_curve, average_precision_score
# 模拟1:999不平衡数据集(10000样本,10个正样本)
n_total = 10000
n_positive = 10
n_negative = n_total - n_positive
y_true = np.array([1] * n_positive + [0] * n_negative)
# 策略一:全部预测为负(傻模型)
y_pred_all_negative = np.zeros(n_total, dtype=int)
# 策略二:随机猜测(按真实比例)
np.random.seed(42)
y_pred_random = np.random.choice([0, 1], size=n_total,
p=[n_negative/n_total, n_positive/n_total])
print("=== 傻模型(全预测为负)===")
print(f"准确率: {accuracy_score(y_true, y_pred_all_negative):.4f}")
print(f"召回率: {recall_score(y_true, y_pred_all_negative):.4f}")
print(f"F1: {f1_score(y_true, y_pred_all_negative):.4f}")
# 输出:准确率 0.9990,召回率 0.0000,F1 0.0000
准确率 99.9% 对应召回率 0%------准确率在不平衡场景下是"反指标",越高反而越说明模型在逃避少数类。
不平衡场景应当使用的指标体系:
| 指标 | 计算方式 | 适用场景 | 局限 |
|---|---|---|---|
| Recall(召回率) | TP / (TP + FN) | 漏检代价极高(如癌症筛查) | 可通过全部预测正类达到 100% |
| Precision(精确率) | TP / (TP + FP) | 误报代价高(如垃圾邮件) | 可通过只预测最确信样本达到 100% |
| F1 Score | 2×P×R / (P+R) | 精确率与召回率需要平衡 | 不反映 TN,不适合极端不平衡 |
| PR-AUC | 精度-召回曲线下面积 | 极端不平衡场景 | 比 ROC-AUC 更敏感 |
| ROC-AUC | ROC 曲线下面积 | 轻中度不平衡 | 在极端不平衡时会虚高 |
为什么 ROC-AUC 在极端不平衡时虚高: ROC 曲线使用 FPR(假阳性率 = FP / (FP + TN))作为横轴。当负样本数量极大时(TN 极大),即使有大量 FP,FPR 也很低------导致 ROC-AUC 虚高。PR-AUC 直接看 Precision,对 FP 更敏感,是极端不平衡场景的首选。
三、采样策略全景:从随机到合成
采样是处理不平衡最直觉的方法------调整训练集的类别比例。但不同采样策略有不同的副作用。
3.1 随机过采样与欠采样
python
from collections import Counter
from sklearn.datasets import make_classification
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
# 构造不平衡数据集
X, y = make_classification(n_samples=10000, n_features=20,
weights=[0.99, 0.01], random_state=42)
print(f"原始分布: {Counter(y)}") # {0: 9900, 1: 100}
# 随机过采样:复制少数类样本
ros = RandomOverSampler(sampling_strategy=0.1, random_state=42)
X_ros, y_ros = ros.fit_resample(X, y)
print(f"过采样后: {Counter(y_ros)}") # 少数类增至约990
# 随机欠采样:删除多数类样本
rus = RandomUnderSampler(sampling_strategy=0.1, random_state=42)
X_rus, y_rus = rus.fit_resample(X, y)
print(f"欠采样后: {Counter(y_rus)}") # 多数类减至约1000
随机过采样的核心问题: 复制样本不增加信息,模型容易在这些重复样本上过拟合------边界区域的少数类样本被"强记忆",泛化能力下降。
随机欠采样的核心问题: 删除大量多数类样本会丢失真实的数据分布信息,对于小数据集破坏性尤其大。
3.2 SMOTE:合成少数类过采样
SMOTE(Synthetic Minority Over-sampling Technique)是目前最广泛使用的过采样方法,核心思想是"在少数类样本之间插值生成新样本",而非简单复制。
#mermaid-svg-d8ftxoDfavYLNk7D{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-d8ftxoDfavYLNk7D .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-d8ftxoDfavYLNk7D .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-d8ftxoDfavYLNk7D .error-icon{fill:#552222;}#mermaid-svg-d8ftxoDfavYLNk7D .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-d8ftxoDfavYLNk7D .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-d8ftxoDfavYLNk7D .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-d8ftxoDfavYLNk7D .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-d8ftxoDfavYLNk7D .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-d8ftxoDfavYLNk7D .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-d8ftxoDfavYLNk7D .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-d8ftxoDfavYLNk7D .marker{fill:#333333;stroke:#333333;}#mermaid-svg-d8ftxoDfavYLNk7D .marker.cross{stroke:#333333;}#mermaid-svg-d8ftxoDfavYLNk7D svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-d8ftxoDfavYLNk7D p{margin:0;}#mermaid-svg-d8ftxoDfavYLNk7D .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-d8ftxoDfavYLNk7D .cluster-label text{fill:#333;}#mermaid-svg-d8ftxoDfavYLNk7D .cluster-label span{color:#333;}#mermaid-svg-d8ftxoDfavYLNk7D .cluster-label span p{background-color:transparent;}#mermaid-svg-d8ftxoDfavYLNk7D .label text,#mermaid-svg-d8ftxoDfavYLNk7D span{fill:#333;color:#333;}#mermaid-svg-d8ftxoDfavYLNk7D .node rect,#mermaid-svg-d8ftxoDfavYLNk7D .node circle,#mermaid-svg-d8ftxoDfavYLNk7D .node ellipse,#mermaid-svg-d8ftxoDfavYLNk7D .node polygon,#mermaid-svg-d8ftxoDfavYLNk7D .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-d8ftxoDfavYLNk7D .rough-node .label text,#mermaid-svg-d8ftxoDfavYLNk7D .node .label text,#mermaid-svg-d8ftxoDfavYLNk7D .image-shape .label,#mermaid-svg-d8ftxoDfavYLNk7D .icon-shape .label{text-anchor:middle;}#mermaid-svg-d8ftxoDfavYLNk7D .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-d8ftxoDfavYLNk7D .rough-node .label,#mermaid-svg-d8ftxoDfavYLNk7D .node .label,#mermaid-svg-d8ftxoDfavYLNk7D .image-shape .label,#mermaid-svg-d8ftxoDfavYLNk7D .icon-shape .label{text-align:center;}#mermaid-svg-d8ftxoDfavYLNk7D .node.clickable{cursor:pointer;}#mermaid-svg-d8ftxoDfavYLNk7D .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-d8ftxoDfavYLNk7D .arrowheadPath{fill:#333333;}#mermaid-svg-d8ftxoDfavYLNk7D .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-d8ftxoDfavYLNk7D .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-d8ftxoDfavYLNk7D .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-d8ftxoDfavYLNk7D .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-d8ftxoDfavYLNk7D .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-d8ftxoDfavYLNk7D .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-d8ftxoDfavYLNk7D .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-d8ftxoDfavYLNk7D .cluster text{fill:#333;}#mermaid-svg-d8ftxoDfavYLNk7D .cluster span{color:#333;}#mermaid-svg-d8ftxoDfavYLNk7D div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-d8ftxoDfavYLNk7D .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-d8ftxoDfavYLNk7D rect.text{fill:none;stroke-width:0;}#mermaid-svg-d8ftxoDfavYLNk7D .icon-shape,#mermaid-svg-d8ftxoDfavYLNk7D .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-d8ftxoDfavYLNk7D .icon-shape p,#mermaid-svg-d8ftxoDfavYLNk7D .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-d8ftxoDfavYLNk7D .icon-shape .label rect,#mermaid-svg-d8ftxoDfavYLNk7D .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-d8ftxoDfavYLNk7D .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-d8ftxoDfavYLNk7D .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-d8ftxoDfavYLNk7D :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 选取一个少数类样本 xi
在特征空间中找 K 个最近邻
随机选择一个邻居 xj
在 xi 和 xj 之间随机插值
生成新合成样本
新样本 = xi + λ×(xj - xi), λ∈U(0,1)
python
from imblearn.over_sampling import SMOTE, BorderlineSMOTE, SVMSMOTE, ADASYN
# 标准 SMOTE
smote = SMOTE(sampling_strategy=0.5, k_neighbors=5, random_state=42)
X_smote, y_smote = smote.fit_resample(X, y)
print(f"SMOTE 后: {Counter(y_smote)}")
# Borderline-SMOTE:只在边界区域的少数类样本附近合成
# 边界样本 = 邻居中多数类占比 50%~100% 的少数类样本
bsmote = BorderlineSMOTE(sampling_strategy=0.5, random_state=42, kind='borderline-1')
X_bsmote, y_bsmote = bsmote.fit_resample(X, y)
# SVM-SMOTE:在 SVM 决策边界附近的支持向量区域合成
svmsmote = SVMSMOTE(sampling_strategy=0.5, random_state=42)
X_svmsmote, y_svmsmote = svmsmote.fit_resample(X, y)
3.3 ADASYN:自适应合成采样
ADASYN(Adaptive Synthetic Sampling)在 SMOTE 基础上引入了"难度权重"------在多数类邻居更多(更难分类)的少数类样本附近生成更多合成样本,对边界区域更关注。
python
# ADASYN:难分类区域生成更多样本
adasyn = ADASYN(sampling_strategy=0.5, n_neighbors=5, random_state=42)
X_adasyn, y_adasyn = adasyn.fit_resample(X, y)
SMOTE 家族选型指南:
| 方法 | 核心机制 | 适用场景 | 风险 |
|---|---|---|---|
| 标准 SMOTE | 随机邻居插值 | 通用场景,数据分布较规则 | 可能在噪声区域生成样本 |
| Borderline-SMOTE | 边界区域插值 | 边界区域少数类信息稀少 | 对边界定义敏感 |
| SVM-SMOTE | SVM 支持向量区域 | 数据线性可分性较好 | 计算开销较大 |
| ADASYN | 自适应密度权重 | 少数类分布不均匀 | 可能过度关注噪声样本 |
3.4 组合策略:欠采样 + 过采样
在极端不平衡场景,单一策略往往效果有限。组合策略是实践中的主流选择:
python
from imblearn.combine import SMOTETomek, SMOTEENN
from imblearn.pipeline import Pipeline
# SMOTE + Tomek Links(删除边界模糊的多数类样本)
smotetomek = SMOTETomek(sampling_strategy=0.5, random_state=42)
X_st, y_st = smotetomek.fit_resample(X, y)
# SMOTE + ENN(Edited Nearest Neighbours,删除被邻居错误分类的样本)
smoteenn = SMOTEENN(sampling_strategy=0.5, random_state=42)
X_se, y_se = smoteenn.fit_resample(X, y)
print(f"SMOTE+Tomek: {Counter(y_st)}")
print(f"SMOTE+ENN: {Counter(y_se)}")
四、代价敏感学习:不改数据,改优先级
采样策略通过修改训练数据分布来间接影响模型,而代价敏感学习直接在损失函数层面给少数类赋予更高的权重------"漏掉一个欺诈的代价,远高于误判一个正常交易"。
4.1 class_weight 参数
sklearn 中几乎所有分类器都支持 class_weight 参数,'balanced' 模式自动根据样本比例计算权重: w c = n _ s a m p l e s n _ c l a s s e s × n _ s a m p l e s c w_c = \frac{n\_samples}{n\_classes \times n\_samples_c} wc=n_classes×n_samplescn_samples
python
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import StratifiedKFold, cross_val_score
import numpy as np
# 自动均衡权重
clf_balanced = RandomForestClassifier(
n_estimators=100,
class_weight='balanced', # 等价于 {0: 1, 1: 99}(1:99不平衡)
random_state=42
)
# 自定义业务成本权重
# 假设:漏判一个欺诈的代价是误判一个正常的 10 倍
fraud_cost_ratio = 10
clf_custom = RandomForestClassifier(
n_estimators=100,
class_weight={0: 1, 1: fraud_cost_ratio},
random_state=42
)
# 逻辑回归同样支持
lr_balanced = LogisticRegression(
class_weight='balanced',
max_iter=1000,
random_state=42
)
4.2 Focal Loss:聚焦难分类样本
Focal Loss 由 Facebook 提出,最初用于目标检测中的类别不平衡,核心思想是"对容易分类的样本降低权重,对难分类的样本增加关注"。
标准交叉熵: C E ( p , y ) = − y log ( p ) − ( 1 − y ) log ( 1 − p ) CE(p, y) = -y\log(p) - (1-y)\log(1-p) CE(p,y)=−ylog(p)−(1−y)log(1−p)
Focal Loss: F L ( p t ) = − α t ( 1 − p t ) γ log ( p t ) FL(p_t) = -\alpha_t(1-p_t)^\gamma \log(p_t) FL(pt)=−αt(1−pt)γlog(pt)
- ( 1 − p t ) γ (1-p_t)^\gamma (1−pt)γ:调制因子,对已经高置信度正确预测的样本( p t → 1 p_t \to 1 pt→1)降低权重
- γ \gamma γ(focusing parameter):通常取 2,越大对难样本的关注越强
- α t \alpha_t αt:类别权重,平衡正负样本
python
import torch
import torch.nn as nn
import torch.nn.functional as F
class FocalLoss(nn.Module):
def __init__(self, alpha=0.25, gamma=2.0, reduction='mean'):
super().__init__()
self.alpha = alpha
self.gamma = gamma
self.reduction = reduction
def forward(self, inputs, targets):
BCE_loss = F.binary_cross_entropy_with_logits(
inputs, targets.float(), reduction='none'
)
pt = torch.exp(-BCE_loss) # 预测正确的概率
focal_loss = self.alpha * (1 - pt) ** self.gamma * BCE_loss
if self.reduction == 'mean':
return focal_loss.mean()
elif self.reduction == 'sum':
return focal_loss.sum()
return focal_loss
# 在神经网络训练中使用
# criterion = FocalLoss(alpha=0.25, gamma=2.0)
# loss = criterion(logits, labels)
对于不使用深度学习的场景(如 sklearn 树模型),Focal Loss 不直接可用,但可以通过 sample_weight 参数近似实现:
python
from sklearn.linear_model import LogisticRegression
import numpy as np
def compute_focal_sample_weights(y_pred_proba, y_true, gamma=2.0):
"""基于预测概率计算样本权重,近似 Focal Loss 效果"""
# 对应正确类的预测概率
pt = np.where(y_true == 1, y_pred_proba, 1 - y_pred_proba)
# 难样本权重更高
weights = (1 - pt) ** gamma
return weights
# 使用方式:先训练基础模型获得 pt,再用 focal weights 重训练
五、阈值调整:从数学最优到业务最优
默认分类阈值 0.5 来自"认为正负类的误分代价相等"的隐含假设。在不平衡场景中,这个假设几乎从不成立。
5.1 精度-召回曲线与最优阈值
python
from sklearn.metrics import precision_recall_curve, f1_score, fbeta_score
import matplotlib.pyplot as plt
import numpy as np
def find_optimal_threshold(y_true, y_pred_proba, beta=1.0):
"""
找到最优分类阈值
beta < 1:更重视精确率(误报代价高,如垃圾邮件)
beta = 1:F1,平衡精确率和召回率
beta > 1:更重视召回率(漏报代价高,如癌症筛查)
"""
precisions, recalls, thresholds = precision_recall_curve(y_true, y_pred_proba)
# F-beta score for each threshold
# 注意:precision_recall_curve 返回的长度比 thresholds 多1
fbeta_scores = []
for p, r in zip(precisions[:-1], recalls[:-1]):
if p + r > 0:
fb = (1 + beta**2) * p * r / (beta**2 * p + r)
else:
fb = 0
fbeta_scores.append(fb)
best_idx = np.argmax(fbeta_scores)
best_threshold = thresholds[best_idx]
best_f_beta = fbeta_scores[best_idx]
return best_threshold, best_f_beta, precisions, recalls, thresholds
# 示例使用
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
stratify=y, random_state=42)
clf = RandomForestClassifier(n_estimators=100, class_weight='balanced', random_state=42)
clf.fit(X_train, y_train)
y_proba = clf.predict_proba(X_test)[:, 1]
# F1 最优阈值(欺诈检测:更重视召回,beta=2)
thresh_f1, f1_val, prec, rec, thresh = find_optimal_threshold(y_test, y_proba, beta=1.0)
thresh_f2, f2_val, _, _, _ = find_optimal_threshold(y_test, y_proba, beta=2.0)
print(f"F1 最优阈值: {thresh_f1:.3f}, F1={f1_val:.3f}")
print(f"F2 最优阈值(更重视召回): {thresh_f2:.3f}, F2={f2_val:.3f}")
5.2 业务成本矩阵驱动的阈值选择
真实业务中,阈值的选择不是"数学最优",而是"业务成本最优"。
python
def business_cost_optimal_threshold(y_true, y_pred_proba,
cost_fn=10, cost_fp=1):
"""
基于业务成本矩阵找最优阈值
cost_fn: 漏报成本(False Negative = 漏掉一个欺诈)
cost_fp: 误报成本(False Positive = 误判一个正常交易)
欺诈检测场景:
- 漏报欺诈(FN):损失真实金额,如 1000 元
- 误报正常(FP):人工复核成本,如 100 元
→ cost_fn=10, cost_fp=1
"""
precisions, recalls, thresholds = precision_recall_curve(y_true, y_pred_proba)
# 从混淆矩阵计算总成本
total_costs = []
for thresh in thresholds:
y_pred = (y_pred_proba >= thresh).astype(int)
tn = np.sum((y_pred == 0) & (y_true == 0))
fp = np.sum((y_pred == 1) & (y_true == 0))
fn = np.sum((y_pred == 0) & (y_true == 1))
tp = np.sum((y_pred == 1) & (y_true == 1))
total_cost = fn * cost_fn + fp * cost_fp
total_costs.append(total_cost)
best_idx = np.argmin(total_costs)
return thresholds[best_idx], total_costs[best_idx]
# 欺诈场景:漏报代价是误报的 10 倍
optimal_thresh, min_cost = business_cost_optimal_threshold(
y_test, y_proba, cost_fn=10, cost_fp=1
)
print(f"业务成本最优阈值: {optimal_thresh:.3f}, 总成本: {min_cost}")
阈值调整不需要重新训练模型,只需要调整最终预测的判断标准------这是不平衡处理中成本最低、效果最直接的方法之一。
六、评估指标选择:场景决定指标
不同场景下,使用错误的评估指标会导致完全错误的模型选择判断。
#mermaid-svg-ALgz487mgXTwLXB8{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-ALgz487mgXTwLXB8 .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-ALgz487mgXTwLXB8 .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-ALgz487mgXTwLXB8 .error-icon{fill:#552222;}#mermaid-svg-ALgz487mgXTwLXB8 .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-ALgz487mgXTwLXB8 .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-ALgz487mgXTwLXB8 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-ALgz487mgXTwLXB8 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-ALgz487mgXTwLXB8 .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-ALgz487mgXTwLXB8 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-ALgz487mgXTwLXB8 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-ALgz487mgXTwLXB8 .marker{fill:#333333;stroke:#333333;}#mermaid-svg-ALgz487mgXTwLXB8 .marker.cross{stroke:#333333;}#mermaid-svg-ALgz487mgXTwLXB8 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-ALgz487mgXTwLXB8 p{margin:0;}#mermaid-svg-ALgz487mgXTwLXB8 .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-ALgz487mgXTwLXB8 .cluster-label text{fill:#333;}#mermaid-svg-ALgz487mgXTwLXB8 .cluster-label span{color:#333;}#mermaid-svg-ALgz487mgXTwLXB8 .cluster-label span p{background-color:transparent;}#mermaid-svg-ALgz487mgXTwLXB8 .label text,#mermaid-svg-ALgz487mgXTwLXB8 span{fill:#333;color:#333;}#mermaid-svg-ALgz487mgXTwLXB8 .node rect,#mermaid-svg-ALgz487mgXTwLXB8 .node circle,#mermaid-svg-ALgz487mgXTwLXB8 .node ellipse,#mermaid-svg-ALgz487mgXTwLXB8 .node polygon,#mermaid-svg-ALgz487mgXTwLXB8 .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-ALgz487mgXTwLXB8 .rough-node .label text,#mermaid-svg-ALgz487mgXTwLXB8 .node .label text,#mermaid-svg-ALgz487mgXTwLXB8 .image-shape .label,#mermaid-svg-ALgz487mgXTwLXB8 .icon-shape .label{text-anchor:middle;}#mermaid-svg-ALgz487mgXTwLXB8 .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-ALgz487mgXTwLXB8 .rough-node .label,#mermaid-svg-ALgz487mgXTwLXB8 .node .label,#mermaid-svg-ALgz487mgXTwLXB8 .image-shape .label,#mermaid-svg-ALgz487mgXTwLXB8 .icon-shape .label{text-align:center;}#mermaid-svg-ALgz487mgXTwLXB8 .node.clickable{cursor:pointer;}#mermaid-svg-ALgz487mgXTwLXB8 .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-ALgz487mgXTwLXB8 .arrowheadPath{fill:#333333;}#mermaid-svg-ALgz487mgXTwLXB8 .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-ALgz487mgXTwLXB8 .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-ALgz487mgXTwLXB8 .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-ALgz487mgXTwLXB8 .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-ALgz487mgXTwLXB8 .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-ALgz487mgXTwLXB8 .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-ALgz487mgXTwLXB8 .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-ALgz487mgXTwLXB8 .cluster text{fill:#333;}#mermaid-svg-ALgz487mgXTwLXB8 .cluster span{color:#333;}#mermaid-svg-ALgz487mgXTwLXB8 div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-ALgz487mgXTwLXB8 .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-ALgz487mgXTwLXB8 rect.text{fill:none;stroke-width:0;}#mermaid-svg-ALgz487mgXTwLXB8 .icon-shape,#mermaid-svg-ALgz487mgXTwLXB8 .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-ALgz487mgXTwLXB8 .icon-shape p,#mermaid-svg-ALgz487mgXTwLXB8 .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-ALgz487mgXTwLXB8 .icon-shape .label rect,#mermaid-svg-ALgz487mgXTwLXB8 .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-ALgz487mgXTwLXB8 .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-ALgz487mgXTwLXB8 .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-ALgz487mgXTwLXB8 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 轻度 1:5~1:10
中度 1:10~1:100
重度 1:100~1:1000
漏报代价高
误报代价高
业务成本明确
极端 >1:1000
选择评估指标
类别不平衡程度
F1 Score + ROC-AUC
F1 Score + PR-AUC
PR-AUC 为主
业务场景
Recall + PR-AUC + F2
Precision + PR-AUC + F0.5
自定义成本矩阵
切换异常检测:AUPR / Precision@K
python
from sklearn.metrics import (
classification_report, confusion_matrix,
roc_auc_score, average_precision_score,
f1_score, fbeta_score, recall_score, precision_score
)
def comprehensive_eval(y_true, y_pred, y_proba, imbalance_ratio, beta=1.0):
"""全面评估不平衡分类模型"""
print("=" * 50)
print(f"不平衡比例约为 1:{imbalance_ratio:.0f}")
print("=" * 50)
print(f"Precision: {precision_score(y_true, y_pred):.4f}")
print(f"Recall: {recall_score(y_true, y_pred):.4f}")
print(f"F1: {f1_score(y_true, y_pred):.4f}")
print(f"F{beta}: {fbeta_score(y_true, y_pred, beta=beta):.4f}")
if y_proba is not None:
roc = roc_auc_score(y_true, y_proba)
pr_auc = average_precision_score(y_true, y_proba)
print(f"ROC-AUC: {roc:.4f}")
print(f"PR-AUC: {pr_auc:.4f}")
if imbalance_ratio > 50:
print(f"\n⚠️ 严重不平衡场景:PR-AUC ({pr_auc:.4f}) 是核心指标")
print(f" ROC-AUC ({roc:.4f}) 可能虚高,参考意义有限")
print("\n混淆矩阵:")
cm = confusion_matrix(y_true, y_pred)
print(f" TN={cm[0,0]}, FP={cm[0,1]}")
print(f" FN={cm[1,0]}, TP={cm[1,1]}")
# 计算不平衡比例
pos_ratio = y_test.mean()
imbalance = (1 - pos_ratio) / pos_ratio
y_pred_default = (y_proba >= 0.5).astype(int)
y_pred_optimal = (y_proba >= thresh_f1).astype(int)
print("\n--- 使用默认阈值 0.5 ---")
comprehensive_eval(y_test, y_pred_default, y_proba, imbalance, beta=2)
print("\n--- 使用最优阈值 ---")
comprehensive_eval(y_test, y_pred_optimal, y_proba, imbalance, beta=2)
七、不平衡数据的正确交叉验证
这是不平衡处理中最容易踩的坑:在交叉验证的训练集外做采样。
错误做法:先采样再 CV
python
# ❌ 错误:先对全量数据 SMOTE,再做 CV
# 测试集中含有从训练集合成的样本,数据泄露!
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)
cv = StratifiedKFold(n_splits=5)
scores = cross_val_score(clf, X_resampled, y_resampled, cv=cv, scoring='f1')
# 结果虚高:测试集中有合成样本,模型已"见过"它们
正确做法:Pipeline 内部 CV
python
from imblearn.pipeline import Pipeline # 注意:用 imblearn 的 Pipeline
from sklearn.model_selection import StratifiedKFold, cross_validate
# ✅ 正确:采样步骤在 Pipeline 内,每折 CV 的采样只在该折训练集上进行
pipeline = Pipeline([
('smote', SMOTE(sampling_strategy=0.5, random_state=42)),
('clf', RandomForestClassifier(n_estimators=100,
class_weight='balanced',
random_state=42))
])
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
results = cross_validate(
pipeline, X, y, cv=cv,
scoring=['f1', 'average_precision', 'recall'],
return_train_score=False
)
print(f"F1: {results['test_f1'].mean():.4f} ± {results['test_f1'].std():.4f}")
print(f"PR-AUC: {results['test_average_precision'].mean():.4f} ± {results['test_average_precision'].std():.4f}")
print(f"Recall: {results['test_recall'].mean():.4f} ± {results['test_recall'].std():.4f}")
为什么必须用 imblearn 的 Pipeline: sklearn 的标准 Pipeline 不支持 fit_resample 接口,只有 imblearn 的 Pipeline 能在 CV 的 fit 阶段正确调用 SMOTE 的 fit_resample,确保每折的采样只在训练折上进行,测试折保持原始分布。
八、实战:信用卡欺诈检测完整方案对比
以信用卡欺诈检测(欺诈率 0.17%,极端不平衡)为例,系统对比各处理策略的效果。
python
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from sklearn.model_selection import StratifiedKFold, cross_validate
import numpy as np
# ---- 数据准备 ----
# 使用 sklearn 生成模拟欺诈数据(实际可用 Kaggle creditcard 数据集)
from sklearn.datasets import make_classification
X, y = make_classification(
n_samples=284807, n_features=30, n_informative=15,
weights=[0.9983, 0.0017], # 接近真实欺诈比例
random_state=42
)
print(f"数据集大小: {X.shape}")
print(f"欺诈比例: {y.mean():.4f} ({y.sum()} 笔欺诈 / {len(y)} 笔交易)")
# ---- 策略对比 ----
strategies = {
"基准(无处理)": RandomForestClassifier(
n_estimators=100, random_state=42),
"class_weight='balanced'": RandomForestClassifier(
n_estimators=100, class_weight='balanced', random_state=42),
"SMOTE + RF": Pipeline([
('smote', SMOTE(sampling_strategy=0.1, random_state=42)),
('clf', RandomForestClassifier(n_estimators=100, random_state=42))
]),
"SMOTE + class_weight(组合)": Pipeline([
('smote', SMOTE(sampling_strategy=0.1, random_state=42)),
('clf', RandomForestClassifier(
n_estimators=100, class_weight='balanced', random_state=42))
]),
}
cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
print("\n{:<30} {:>8} {:>8} {:>8}".format("策略", "PR-AUC", "Recall", "F1"))
print("-" * 60)
for name, model in strategies.items():
results = cross_validate(
model, X, y, cv=cv,
scoring=['average_precision', 'recall', 'f1'],
)
print("{:<30} {:>8.4f} {:>8.4f} {:>8.4f}".format(
name,
results['test_average_precision'].mean(),
results['test_recall'].mean(),
results['test_f1'].mean()
))
实战中的策略选择经验:
- 先尝试 class_weight='balanced':几乎零额外成本,效果通常优于无处理
- SMOTE 对树模型改善有限:树模型自身的 class_weight 机制更高效;SMOTE 主要对线性模型有明显提升
- 阈值调整往往是最有效的杠杆:相同模型,阈值从 0.5 调至业务最优,Recall 可提升 20-40%
- PR-AUC 是极端不平衡场景的核心基准:相同 ROC-AUC 的两个模型,PR-AUC 可能相差 2-3 倍
#mermaid-svg-WUyHgJSN1fWDaGfe{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-WUyHgJSN1fWDaGfe .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-WUyHgJSN1fWDaGfe .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-WUyHgJSN1fWDaGfe .error-icon{fill:#552222;}#mermaid-svg-WUyHgJSN1fWDaGfe .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-WUyHgJSN1fWDaGfe .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-WUyHgJSN1fWDaGfe .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-WUyHgJSN1fWDaGfe .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-WUyHgJSN1fWDaGfe .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-WUyHgJSN1fWDaGfe .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-WUyHgJSN1fWDaGfe .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-WUyHgJSN1fWDaGfe .marker{fill:#333333;stroke:#333333;}#mermaid-svg-WUyHgJSN1fWDaGfe .marker.cross{stroke:#333333;}#mermaid-svg-WUyHgJSN1fWDaGfe svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-WUyHgJSN1fWDaGfe p{margin:0;}#mermaid-svg-WUyHgJSN1fWDaGfe .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-WUyHgJSN1fWDaGfe .cluster-label text{fill:#333;}#mermaid-svg-WUyHgJSN1fWDaGfe .cluster-label span{color:#333;}#mermaid-svg-WUyHgJSN1fWDaGfe .cluster-label span p{background-color:transparent;}#mermaid-svg-WUyHgJSN1fWDaGfe .label text,#mermaid-svg-WUyHgJSN1fWDaGfe span{fill:#333;color:#333;}#mermaid-svg-WUyHgJSN1fWDaGfe .node rect,#mermaid-svg-WUyHgJSN1fWDaGfe .node circle,#mermaid-svg-WUyHgJSN1fWDaGfe .node ellipse,#mermaid-svg-WUyHgJSN1fWDaGfe .node polygon,#mermaid-svg-WUyHgJSN1fWDaGfe .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-WUyHgJSN1fWDaGfe .rough-node .label text,#mermaid-svg-WUyHgJSN1fWDaGfe .node .label text,#mermaid-svg-WUyHgJSN1fWDaGfe .image-shape .label,#mermaid-svg-WUyHgJSN1fWDaGfe .icon-shape .label{text-anchor:middle;}#mermaid-svg-WUyHgJSN1fWDaGfe .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-WUyHgJSN1fWDaGfe .rough-node .label,#mermaid-svg-WUyHgJSN1fWDaGfe .node .label,#mermaid-svg-WUyHgJSN1fWDaGfe .image-shape .label,#mermaid-svg-WUyHgJSN1fWDaGfe .icon-shape .label{text-align:center;}#mermaid-svg-WUyHgJSN1fWDaGfe .node.clickable{cursor:pointer;}#mermaid-svg-WUyHgJSN1fWDaGfe .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-WUyHgJSN1fWDaGfe .arrowheadPath{fill:#333333;}#mermaid-svg-WUyHgJSN1fWDaGfe .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-WUyHgJSN1fWDaGfe .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-WUyHgJSN1fWDaGfe .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-WUyHgJSN1fWDaGfe .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-WUyHgJSN1fWDaGfe .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-WUyHgJSN1fWDaGfe .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-WUyHgJSN1fWDaGfe .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-WUyHgJSN1fWDaGfe .cluster text{fill:#333;}#mermaid-svg-WUyHgJSN1fWDaGfe .cluster span{color:#333;}#mermaid-svg-WUyHgJSN1fWDaGfe div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-WUyHgJSN1fWDaGfe .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-WUyHgJSN1fWDaGfe rect.text{fill:none;stroke-width:0;}#mermaid-svg-WUyHgJSN1fWDaGfe .icon-shape,#mermaid-svg-WUyHgJSN1fWDaGfe .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-WUyHgJSN1fWDaGfe .icon-shape p,#mermaid-svg-WUyHgJSN1fWDaGfe .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-WUyHgJSN1fWDaGfe .icon-shape .label rect,#mermaid-svg-WUyHgJSN1fWDaGfe .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-WUyHgJSN1fWDaGfe .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-WUyHgJSN1fWDaGfe .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-WUyHgJSN1fWDaGfe :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 是
否
是
否
极端不平衡分类任务
步骤一:class_weight='balanced'
零成本基线
PR-AUC < 0.3?
步骤二:SMOTE Pipeline
中度干预
步骤三:阈值调优
精度-召回曲线找最优
业务成本明确?
成本矩阵驱动阈值
业务最优
F1 最优 或 F-beta
数学最优
评估:PR-AUC + Recall + 业务成本
总结
不平衡数据处理不是"选一个采样方法"的单点决策,而是贯穿数据处理、模型训练、评估指标、阈值调整的全链路工程。
核心原则回顾:
- 准确率是不平衡场景的反指标,用 PR-AUC 和 Recall 代替
- 不平衡严重程度决定处理力度:轻度用 class_weight,重度才需要 SMOTE
- 采样必须在 Pipeline 内进行,CV 外的采样会造成数据泄露
- 阈值调整是效率最高的优化手段,不需要重新训练模型
- 业务成本矩阵驱动阈值选择,不是数学最优,是业务最优
模块二从不平衡数据开始,是有意为之的------现实数据几乎没有完美均衡的,掌握这个基础是处理真实业务场景的前提。
如果本文对理解不平衡数据有所帮助,欢迎点赞、收藏和关注,后续内容将持续深入机器学习的进阶场景。
系列已发布文章: