不平衡数据处理实战:采样策略/代价敏感学习/评估指标/业务场景

文章目录

引言:99.9% 准确率,却毫无价值

欺诈检测团队收到一份模型报告:准确率 99.9%。业务方高兴了三秒,直到有人问了一句------"召回率多少?"

答案是 0%。

这个模型的策略是"无论什么情况,全部预测为正常"。在欺诈率仅有 0.1% 的数据集上,这个策略在准确率上完美无缺,在业务上一无所用。

不平衡数据是机器学习最常见的隐性陷阱。医疗诊断(患病率 3%)、工业缺陷检测(缺陷率 0.5%)、信用卡欺诈(欺诈率 0.17%)------这些场景中,少数类恰恰是业务最关心的类别。模型对准确率的优化,反而会主动"学会忽略"少数类。

本文从不平衡严重程度分级出发,系统讲解采样策略、代价敏感学习、评估指标选择和业务阈值调优,给出一套完整的处理框架------不是"SMOTE 一招搞定",而是"根据场景选对工具"。


一、不平衡的严重程度分级与对应策略

不同严重程度的不平衡,需要不同力度的干预。把所有不平衡场景都用同一套方法处理,是工程上的偷懒。

不平衡程度 比例范围 典型场景 核心策略
轻度不平衡 1:5 ~ 1:10 流失预测 class_weight,阈值调整
中度不平衡 1:10 ~ 1:100 疾病诊断 SMOTE + class_weight
重度不平衡 1:100 ~ 1:1000 欺诈检测 欠采样 + SMOTE + Focal Loss
极端不平衡 > 1:1000 极稀有事件 异常检测框架,非分类框架

第一个判断:这是分类问题还是异常检测问题? 当正样本比例低于 0.1%(1:1000),标准分类框架已接近极限,应考虑切换为单类分类(One-Class SVM、Isolation Forest)或异常检测框架。
#mermaid-svg-g9HCAQEjOOlJzFSb{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-g9HCAQEjOOlJzFSb .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-g9HCAQEjOOlJzFSb .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-g9HCAQEjOOlJzFSb .error-icon{fill:#552222;}#mermaid-svg-g9HCAQEjOOlJzFSb .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-g9HCAQEjOOlJzFSb .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-g9HCAQEjOOlJzFSb .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-g9HCAQEjOOlJzFSb .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-g9HCAQEjOOlJzFSb .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-g9HCAQEjOOlJzFSb .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-g9HCAQEjOOlJzFSb .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-g9HCAQEjOOlJzFSb .marker{fill:#333333;stroke:#333333;}#mermaid-svg-g9HCAQEjOOlJzFSb .marker.cross{stroke:#333333;}#mermaid-svg-g9HCAQEjOOlJzFSb svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-g9HCAQEjOOlJzFSb p{margin:0;}#mermaid-svg-g9HCAQEjOOlJzFSb .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-g9HCAQEjOOlJzFSb .cluster-label text{fill:#333;}#mermaid-svg-g9HCAQEjOOlJzFSb .cluster-label span{color:#333;}#mermaid-svg-g9HCAQEjOOlJzFSb .cluster-label span p{background-color:transparent;}#mermaid-svg-g9HCAQEjOOlJzFSb .label text,#mermaid-svg-g9HCAQEjOOlJzFSb span{fill:#333;color:#333;}#mermaid-svg-g9HCAQEjOOlJzFSb .node rect,#mermaid-svg-g9HCAQEjOOlJzFSb .node circle,#mermaid-svg-g9HCAQEjOOlJzFSb .node ellipse,#mermaid-svg-g9HCAQEjOOlJzFSb .node polygon,#mermaid-svg-g9HCAQEjOOlJzFSb .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-g9HCAQEjOOlJzFSb .rough-node .label text,#mermaid-svg-g9HCAQEjOOlJzFSb .node .label text,#mermaid-svg-g9HCAQEjOOlJzFSb .image-shape .label,#mermaid-svg-g9HCAQEjOOlJzFSb .icon-shape .label{text-anchor:middle;}#mermaid-svg-g9HCAQEjOOlJzFSb .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-g9HCAQEjOOlJzFSb .rough-node .label,#mermaid-svg-g9HCAQEjOOlJzFSb .node .label,#mermaid-svg-g9HCAQEjOOlJzFSb .image-shape .label,#mermaid-svg-g9HCAQEjOOlJzFSb .icon-shape .label{text-align:center;}#mermaid-svg-g9HCAQEjOOlJzFSb .node.clickable{cursor:pointer;}#mermaid-svg-g9HCAQEjOOlJzFSb .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-g9HCAQEjOOlJzFSb .arrowheadPath{fill:#333333;}#mermaid-svg-g9HCAQEjOOlJzFSb .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-g9HCAQEjOOlJzFSb .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-g9HCAQEjOOlJzFSb .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-g9HCAQEjOOlJzFSb .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-g9HCAQEjOOlJzFSb .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-g9HCAQEjOOlJzFSb .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-g9HCAQEjOOlJzFSb .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-g9HCAQEjOOlJzFSb .cluster text{fill:#333;}#mermaid-svg-g9HCAQEjOOlJzFSb .cluster span{color:#333;}#mermaid-svg-g9HCAQEjOOlJzFSb div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-g9HCAQEjOOlJzFSb .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-g9HCAQEjOOlJzFSb rect.text{fill:none;stroke-width:0;}#mermaid-svg-g9HCAQEjOOlJzFSb .icon-shape,#mermaid-svg-g9HCAQEjOOlJzFSb .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-g9HCAQEjOOlJzFSb .icon-shape p,#mermaid-svg-g9HCAQEjOOlJzFSb .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-g9HCAQEjOOlJzFSb .icon-shape .label rect,#mermaid-svg-g9HCAQEjOOlJzFSb .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-g9HCAQEjOOlJzFSb .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-g9HCAQEjOOlJzFSb .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-g9HCAQEjOOlJzFSb :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} > 10%(1:10以内)
1%~10%(1:10~1:100)
0.1%~1%(1:100~1:1000)
< 0.1%(1:1000以上)
发现类别不平衡
正样本比例
轻度:class_weight + 阈值调整
中度:SMOTE + class_weight
重度:欠采样+SMOTE+Focal Loss
极端:切换异常检测框架
评估指标:F1 / ROC-AUC
评估指标:PR-AUC / F1
评估指标:PR-AUC + 业务成本矩阵
评估指标:精度@K / AUPR


二、准确率陷阱:为什么 99.9% 毫无意义

这是不平衡数据处理的第一课,也是最容易被忽视的一课。

准确率的数学本质: Accuracy = (TP + TN) / (TP + TN + FP + FN)

在正负样本比例为 1:999 的场景中:

python 复制代码
import numpy as np
from sklearn.metrics import accuracy_score, f1_score, recall_score
from sklearn.metrics import precision_recall_curve, average_precision_score

# 模拟1:999不平衡数据集(10000样本,10个正样本)
n_total = 10000
n_positive = 10
n_negative = n_total - n_positive

y_true = np.array([1] * n_positive + [0] * n_negative)

# 策略一:全部预测为负(傻模型)
y_pred_all_negative = np.zeros(n_total, dtype=int)

# 策略二:随机猜测(按真实比例)
np.random.seed(42)
y_pred_random = np.random.choice([0, 1], size=n_total, 
                                  p=[n_negative/n_total, n_positive/n_total])

print("=== 傻模型(全预测为负)===")
print(f"准确率: {accuracy_score(y_true, y_pred_all_negative):.4f}")
print(f"召回率: {recall_score(y_true, y_pred_all_negative):.4f}")
print(f"F1: {f1_score(y_true, y_pred_all_negative):.4f}")

# 输出:准确率 0.9990,召回率 0.0000,F1 0.0000

准确率 99.9% 对应召回率 0%------准确率在不平衡场景下是"反指标",越高反而越说明模型在逃避少数类。

不平衡场景应当使用的指标体系:

指标 计算方式 适用场景 局限
Recall(召回率) TP / (TP + FN) 漏检代价极高(如癌症筛查) 可通过全部预测正类达到 100%
Precision(精确率) TP / (TP + FP) 误报代价高(如垃圾邮件) 可通过只预测最确信样本达到 100%
F1 Score 2×P×R / (P+R) 精确率与召回率需要平衡 不反映 TN,不适合极端不平衡
PR-AUC 精度-召回曲线下面积 极端不平衡场景 比 ROC-AUC 更敏感
ROC-AUC ROC 曲线下面积 轻中度不平衡 在极端不平衡时会虚高

为什么 ROC-AUC 在极端不平衡时虚高: ROC 曲线使用 FPR(假阳性率 = FP / (FP + TN))作为横轴。当负样本数量极大时(TN 极大),即使有大量 FP,FPR 也很低------导致 ROC-AUC 虚高。PR-AUC 直接看 Precision,对 FP 更敏感,是极端不平衡场景的首选。


三、采样策略全景:从随机到合成

采样是处理不平衡最直觉的方法------调整训练集的类别比例。但不同采样策略有不同的副作用。

3.1 随机过采样与欠采样

python 复制代码
from collections import Counter
from sklearn.datasets import make_classification
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler

# 构造不平衡数据集
X, y = make_classification(n_samples=10000, n_features=20, 
                            weights=[0.99, 0.01], random_state=42)
print(f"原始分布: {Counter(y)}")  # {0: 9900, 1: 100}

# 随机过采样:复制少数类样本
ros = RandomOverSampler(sampling_strategy=0.1, random_state=42)
X_ros, y_ros = ros.fit_resample(X, y)
print(f"过采样后: {Counter(y_ros)}")  # 少数类增至约990

# 随机欠采样:删除多数类样本
rus = RandomUnderSampler(sampling_strategy=0.1, random_state=42)
X_rus, y_rus = rus.fit_resample(X, y)
print(f"欠采样后: {Counter(y_rus)}")  # 多数类减至约1000

随机过采样的核心问题: 复制样本不增加信息,模型容易在这些重复样本上过拟合------边界区域的少数类样本被"强记忆",泛化能力下降。

随机欠采样的核心问题: 删除大量多数类样本会丢失真实的数据分布信息,对于小数据集破坏性尤其大。

3.2 SMOTE:合成少数类过采样

SMOTE(Synthetic Minority Over-sampling Technique)是目前最广泛使用的过采样方法,核心思想是"在少数类样本之间插值生成新样本",而非简单复制。
#mermaid-svg-d8ftxoDfavYLNk7D{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-d8ftxoDfavYLNk7D .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-d8ftxoDfavYLNk7D .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-d8ftxoDfavYLNk7D .error-icon{fill:#552222;}#mermaid-svg-d8ftxoDfavYLNk7D .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-d8ftxoDfavYLNk7D .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-d8ftxoDfavYLNk7D .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-d8ftxoDfavYLNk7D .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-d8ftxoDfavYLNk7D .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-d8ftxoDfavYLNk7D .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-d8ftxoDfavYLNk7D .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-d8ftxoDfavYLNk7D .marker{fill:#333333;stroke:#333333;}#mermaid-svg-d8ftxoDfavYLNk7D .marker.cross{stroke:#333333;}#mermaid-svg-d8ftxoDfavYLNk7D svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-d8ftxoDfavYLNk7D p{margin:0;}#mermaid-svg-d8ftxoDfavYLNk7D .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-d8ftxoDfavYLNk7D .cluster-label text{fill:#333;}#mermaid-svg-d8ftxoDfavYLNk7D .cluster-label span{color:#333;}#mermaid-svg-d8ftxoDfavYLNk7D .cluster-label span p{background-color:transparent;}#mermaid-svg-d8ftxoDfavYLNk7D .label text,#mermaid-svg-d8ftxoDfavYLNk7D span{fill:#333;color:#333;}#mermaid-svg-d8ftxoDfavYLNk7D .node rect,#mermaid-svg-d8ftxoDfavYLNk7D .node circle,#mermaid-svg-d8ftxoDfavYLNk7D .node ellipse,#mermaid-svg-d8ftxoDfavYLNk7D .node polygon,#mermaid-svg-d8ftxoDfavYLNk7D .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-d8ftxoDfavYLNk7D .rough-node .label text,#mermaid-svg-d8ftxoDfavYLNk7D .node .label text,#mermaid-svg-d8ftxoDfavYLNk7D .image-shape .label,#mermaid-svg-d8ftxoDfavYLNk7D .icon-shape .label{text-anchor:middle;}#mermaid-svg-d8ftxoDfavYLNk7D .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-d8ftxoDfavYLNk7D .rough-node .label,#mermaid-svg-d8ftxoDfavYLNk7D .node .label,#mermaid-svg-d8ftxoDfavYLNk7D .image-shape .label,#mermaid-svg-d8ftxoDfavYLNk7D .icon-shape .label{text-align:center;}#mermaid-svg-d8ftxoDfavYLNk7D .node.clickable{cursor:pointer;}#mermaid-svg-d8ftxoDfavYLNk7D .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-d8ftxoDfavYLNk7D .arrowheadPath{fill:#333333;}#mermaid-svg-d8ftxoDfavYLNk7D .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-d8ftxoDfavYLNk7D .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-d8ftxoDfavYLNk7D .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-d8ftxoDfavYLNk7D .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-d8ftxoDfavYLNk7D .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-d8ftxoDfavYLNk7D .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-d8ftxoDfavYLNk7D .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-d8ftxoDfavYLNk7D .cluster text{fill:#333;}#mermaid-svg-d8ftxoDfavYLNk7D .cluster span{color:#333;}#mermaid-svg-d8ftxoDfavYLNk7D div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-d8ftxoDfavYLNk7D .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-d8ftxoDfavYLNk7D rect.text{fill:none;stroke-width:0;}#mermaid-svg-d8ftxoDfavYLNk7D .icon-shape,#mermaid-svg-d8ftxoDfavYLNk7D .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-d8ftxoDfavYLNk7D .icon-shape p,#mermaid-svg-d8ftxoDfavYLNk7D .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-d8ftxoDfavYLNk7D .icon-shape .label rect,#mermaid-svg-d8ftxoDfavYLNk7D .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-d8ftxoDfavYLNk7D .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-d8ftxoDfavYLNk7D .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-d8ftxoDfavYLNk7D :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 选取一个少数类样本 xi
在特征空间中找 K 个最近邻
随机选择一个邻居 xj
在 xi 和 xj 之间随机插值
生成新合成样本
新样本 = xi + λ×(xj - xi), λ∈U(0,1)

python 复制代码
from imblearn.over_sampling import SMOTE, BorderlineSMOTE, SVMSMOTE, ADASYN

# 标准 SMOTE
smote = SMOTE(sampling_strategy=0.5, k_neighbors=5, random_state=42)
X_smote, y_smote = smote.fit_resample(X, y)
print(f"SMOTE 后: {Counter(y_smote)}")

# Borderline-SMOTE:只在边界区域的少数类样本附近合成
# 边界样本 = 邻居中多数类占比 50%~100% 的少数类样本
bsmote = BorderlineSMOTE(sampling_strategy=0.5, random_state=42, kind='borderline-1')
X_bsmote, y_bsmote = bsmote.fit_resample(X, y)

# SVM-SMOTE:在 SVM 决策边界附近的支持向量区域合成
svmsmote = SVMSMOTE(sampling_strategy=0.5, random_state=42)
X_svmsmote, y_svmsmote = svmsmote.fit_resample(X, y)

3.3 ADASYN:自适应合成采样

ADASYN(Adaptive Synthetic Sampling)在 SMOTE 基础上引入了"难度权重"------在多数类邻居更多(更难分类)的少数类样本附近生成更多合成样本,对边界区域更关注。

python 复制代码
# ADASYN:难分类区域生成更多样本
adasyn = ADASYN(sampling_strategy=0.5, n_neighbors=5, random_state=42)
X_adasyn, y_adasyn = adasyn.fit_resample(X, y)

SMOTE 家族选型指南:

方法 核心机制 适用场景 风险
标准 SMOTE 随机邻居插值 通用场景,数据分布较规则 可能在噪声区域生成样本
Borderline-SMOTE 边界区域插值 边界区域少数类信息稀少 对边界定义敏感
SVM-SMOTE SVM 支持向量区域 数据线性可分性较好 计算开销较大
ADASYN 自适应密度权重 少数类分布不均匀 可能过度关注噪声样本

3.4 组合策略:欠采样 + 过采样

在极端不平衡场景,单一策略往往效果有限。组合策略是实践中的主流选择:

python 复制代码
from imblearn.combine import SMOTETomek, SMOTEENN
from imblearn.pipeline import Pipeline

# SMOTE + Tomek Links(删除边界模糊的多数类样本)
smotetomek = SMOTETomek(sampling_strategy=0.5, random_state=42)
X_st, y_st = smotetomek.fit_resample(X, y)

# SMOTE + ENN(Edited Nearest Neighbours,删除被邻居错误分类的样本)
smoteenn = SMOTEENN(sampling_strategy=0.5, random_state=42)
X_se, y_se = smoteenn.fit_resample(X, y)

print(f"SMOTE+Tomek: {Counter(y_st)}")
print(f"SMOTE+ENN: {Counter(y_se)}")

四、代价敏感学习:不改数据,改优先级

采样策略通过修改训练数据分布来间接影响模型,而代价敏感学习直接在损失函数层面给少数类赋予更高的权重------"漏掉一个欺诈的代价,远高于误判一个正常交易"。

4.1 class_weight 参数

sklearn 中几乎所有分类器都支持 class_weight 参数,'balanced' 模式自动根据样本比例计算权重: w c = n _ s a m p l e s n _ c l a s s e s × n _ s a m p l e s c w_c = \frac{n\_samples}{n\_classes \times n\_samples_c} wc=n_classes×n_samplescn_samples

python 复制代码
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import StratifiedKFold, cross_val_score
import numpy as np

# 自动均衡权重
clf_balanced = RandomForestClassifier(
    n_estimators=100, 
    class_weight='balanced',  # 等价于 {0: 1, 1: 99}(1:99不平衡)
    random_state=42
)

# 自定义业务成本权重
# 假设:漏判一个欺诈的代价是误判一个正常的 10 倍
fraud_cost_ratio = 10
clf_custom = RandomForestClassifier(
    n_estimators=100,
    class_weight={0: 1, 1: fraud_cost_ratio},
    random_state=42
)

# 逻辑回归同样支持
lr_balanced = LogisticRegression(
    class_weight='balanced',
    max_iter=1000,
    random_state=42
)

4.2 Focal Loss:聚焦难分类样本

Focal Loss 由 Facebook 提出,最初用于目标检测中的类别不平衡,核心思想是"对容易分类的样本降低权重,对难分类的样本增加关注"。

标准交叉熵: C E ( p , y ) = − y log ⁡ ( p ) − ( 1 − y ) log ⁡ ( 1 − p ) CE(p, y) = -y\log(p) - (1-y)\log(1-p) CE(p,y)=−ylog(p)−(1−y)log(1−p)

Focal Loss: F L ( p t ) = − α t ( 1 − p t ) γ log ⁡ ( p t ) FL(p_t) = -\alpha_t(1-p_t)^\gamma \log(p_t) FL(pt)=−αt(1−pt)γlog(pt)

  • ( 1 − p t ) γ (1-p_t)^\gamma (1−pt)γ:调制因子,对已经高置信度正确预测的样本( p t → 1 p_t \to 1 pt→1)降低权重
  • γ \gamma γ(focusing parameter):通常取 2,越大对难样本的关注越强
  • α t \alpha_t αt:类别权重,平衡正负样本
python 复制代码
import torch
import torch.nn as nn
import torch.nn.functional as F

class FocalLoss(nn.Module):
    def __init__(self, alpha=0.25, gamma=2.0, reduction='mean'):
        super().__init__()
        self.alpha = alpha
        self.gamma = gamma
        self.reduction = reduction
    
    def forward(self, inputs, targets):
        BCE_loss = F.binary_cross_entropy_with_logits(
            inputs, targets.float(), reduction='none'
        )
        pt = torch.exp(-BCE_loss)  # 预测正确的概率
        focal_loss = self.alpha * (1 - pt) ** self.gamma * BCE_loss
        
        if self.reduction == 'mean':
            return focal_loss.mean()
        elif self.reduction == 'sum':
            return focal_loss.sum()
        return focal_loss

# 在神经网络训练中使用
# criterion = FocalLoss(alpha=0.25, gamma=2.0)
# loss = criterion(logits, labels)

对于不使用深度学习的场景(如 sklearn 树模型),Focal Loss 不直接可用,但可以通过 sample_weight 参数近似实现:

python 复制代码
from sklearn.linear_model import LogisticRegression
import numpy as np

def compute_focal_sample_weights(y_pred_proba, y_true, gamma=2.0):
    """基于预测概率计算样本权重,近似 Focal Loss 效果"""
    # 对应正确类的预测概率
    pt = np.where(y_true == 1, y_pred_proba, 1 - y_pred_proba)
    # 难样本权重更高
    weights = (1 - pt) ** gamma
    return weights

# 使用方式:先训练基础模型获得 pt,再用 focal weights 重训练

五、阈值调整:从数学最优到业务最优

默认分类阈值 0.5 来自"认为正负类的误分代价相等"的隐含假设。在不平衡场景中,这个假设几乎从不成立。

5.1 精度-召回曲线与最优阈值

python 复制代码
from sklearn.metrics import precision_recall_curve, f1_score, fbeta_score
import matplotlib.pyplot as plt
import numpy as np

def find_optimal_threshold(y_true, y_pred_proba, beta=1.0):
    """
    找到最优分类阈值
    beta < 1:更重视精确率(误报代价高,如垃圾邮件)
    beta = 1:F1,平衡精确率和召回率
    beta > 1:更重视召回率(漏报代价高,如癌症筛查)
    """
    precisions, recalls, thresholds = precision_recall_curve(y_true, y_pred_proba)
    
    # F-beta score for each threshold
    # 注意:precision_recall_curve 返回的长度比 thresholds 多1
    fbeta_scores = []
    for p, r in zip(precisions[:-1], recalls[:-1]):
        if p + r > 0:
            fb = (1 + beta**2) * p * r / (beta**2 * p + r)
        else:
            fb = 0
        fbeta_scores.append(fb)
    
    best_idx = np.argmax(fbeta_scores)
    best_threshold = thresholds[best_idx]
    best_f_beta = fbeta_scores[best_idx]
    
    return best_threshold, best_f_beta, precisions, recalls, thresholds

# 示例使用
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                      stratify=y, random_state=42)
clf = RandomForestClassifier(n_estimators=100, class_weight='balanced', random_state=42)
clf.fit(X_train, y_train)
y_proba = clf.predict_proba(X_test)[:, 1]

# F1 最优阈值(欺诈检测:更重视召回,beta=2)
thresh_f1, f1_val, prec, rec, thresh = find_optimal_threshold(y_test, y_proba, beta=1.0)
thresh_f2, f2_val, _, _, _ = find_optimal_threshold(y_test, y_proba, beta=2.0)

print(f"F1 最优阈值: {thresh_f1:.3f}, F1={f1_val:.3f}")
print(f"F2 最优阈值(更重视召回): {thresh_f2:.3f}, F2={f2_val:.3f}")

5.2 业务成本矩阵驱动的阈值选择

真实业务中,阈值的选择不是"数学最优",而是"业务成本最优"。

python 复制代码
def business_cost_optimal_threshold(y_true, y_pred_proba, 
                                     cost_fn=10, cost_fp=1):
    """
    基于业务成本矩阵找最优阈值
    cost_fn: 漏报成本(False Negative = 漏掉一个欺诈)
    cost_fp: 误报成本(False Positive = 误判一个正常交易)
    
    欺诈检测场景:
    - 漏报欺诈(FN):损失真实金额,如 1000 元
    - 误报正常(FP):人工复核成本,如 100 元
    → cost_fn=10, cost_fp=1
    """
    precisions, recalls, thresholds = precision_recall_curve(y_true, y_pred_proba)
    
    # 从混淆矩阵计算总成本
    total_costs = []
    for thresh in thresholds:
        y_pred = (y_pred_proba >= thresh).astype(int)
        tn = np.sum((y_pred == 0) & (y_true == 0))
        fp = np.sum((y_pred == 1) & (y_true == 0))
        fn = np.sum((y_pred == 0) & (y_true == 1))
        tp = np.sum((y_pred == 1) & (y_true == 1))
        
        total_cost = fn * cost_fn + fp * cost_fp
        total_costs.append(total_cost)
    
    best_idx = np.argmin(total_costs)
    return thresholds[best_idx], total_costs[best_idx]

# 欺诈场景:漏报代价是误报的 10 倍
optimal_thresh, min_cost = business_cost_optimal_threshold(
    y_test, y_proba, cost_fn=10, cost_fp=1
)
print(f"业务成本最优阈值: {optimal_thresh:.3f}, 总成本: {min_cost}")

阈值调整不需要重新训练模型,只需要调整最终预测的判断标准------这是不平衡处理中成本最低、效果最直接的方法之一。


六、评估指标选择:场景决定指标

不同场景下,使用错误的评估指标会导致完全错误的模型选择判断。
#mermaid-svg-ALgz487mgXTwLXB8{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-ALgz487mgXTwLXB8 .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-ALgz487mgXTwLXB8 .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-ALgz487mgXTwLXB8 .error-icon{fill:#552222;}#mermaid-svg-ALgz487mgXTwLXB8 .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-ALgz487mgXTwLXB8 .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-ALgz487mgXTwLXB8 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-ALgz487mgXTwLXB8 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-ALgz487mgXTwLXB8 .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-ALgz487mgXTwLXB8 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-ALgz487mgXTwLXB8 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-ALgz487mgXTwLXB8 .marker{fill:#333333;stroke:#333333;}#mermaid-svg-ALgz487mgXTwLXB8 .marker.cross{stroke:#333333;}#mermaid-svg-ALgz487mgXTwLXB8 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-ALgz487mgXTwLXB8 p{margin:0;}#mermaid-svg-ALgz487mgXTwLXB8 .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-ALgz487mgXTwLXB8 .cluster-label text{fill:#333;}#mermaid-svg-ALgz487mgXTwLXB8 .cluster-label span{color:#333;}#mermaid-svg-ALgz487mgXTwLXB8 .cluster-label span p{background-color:transparent;}#mermaid-svg-ALgz487mgXTwLXB8 .label text,#mermaid-svg-ALgz487mgXTwLXB8 span{fill:#333;color:#333;}#mermaid-svg-ALgz487mgXTwLXB8 .node rect,#mermaid-svg-ALgz487mgXTwLXB8 .node circle,#mermaid-svg-ALgz487mgXTwLXB8 .node ellipse,#mermaid-svg-ALgz487mgXTwLXB8 .node polygon,#mermaid-svg-ALgz487mgXTwLXB8 .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-ALgz487mgXTwLXB8 .rough-node .label text,#mermaid-svg-ALgz487mgXTwLXB8 .node .label text,#mermaid-svg-ALgz487mgXTwLXB8 .image-shape .label,#mermaid-svg-ALgz487mgXTwLXB8 .icon-shape .label{text-anchor:middle;}#mermaid-svg-ALgz487mgXTwLXB8 .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-ALgz487mgXTwLXB8 .rough-node .label,#mermaid-svg-ALgz487mgXTwLXB8 .node .label,#mermaid-svg-ALgz487mgXTwLXB8 .image-shape .label,#mermaid-svg-ALgz487mgXTwLXB8 .icon-shape .label{text-align:center;}#mermaid-svg-ALgz487mgXTwLXB8 .node.clickable{cursor:pointer;}#mermaid-svg-ALgz487mgXTwLXB8 .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-ALgz487mgXTwLXB8 .arrowheadPath{fill:#333333;}#mermaid-svg-ALgz487mgXTwLXB8 .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-ALgz487mgXTwLXB8 .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-ALgz487mgXTwLXB8 .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-ALgz487mgXTwLXB8 .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-ALgz487mgXTwLXB8 .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-ALgz487mgXTwLXB8 .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-ALgz487mgXTwLXB8 .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-ALgz487mgXTwLXB8 .cluster text{fill:#333;}#mermaid-svg-ALgz487mgXTwLXB8 .cluster span{color:#333;}#mermaid-svg-ALgz487mgXTwLXB8 div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-ALgz487mgXTwLXB8 .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-ALgz487mgXTwLXB8 rect.text{fill:none;stroke-width:0;}#mermaid-svg-ALgz487mgXTwLXB8 .icon-shape,#mermaid-svg-ALgz487mgXTwLXB8 .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-ALgz487mgXTwLXB8 .icon-shape p,#mermaid-svg-ALgz487mgXTwLXB8 .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-ALgz487mgXTwLXB8 .icon-shape .label rect,#mermaid-svg-ALgz487mgXTwLXB8 .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-ALgz487mgXTwLXB8 .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-ALgz487mgXTwLXB8 .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-ALgz487mgXTwLXB8 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 轻度 1:5~1:10
中度 1:10~1:100
重度 1:100~1:1000
漏报代价高
误报代价高
业务成本明确
极端 >1:1000
选择评估指标
类别不平衡程度
F1 Score + ROC-AUC
F1 Score + PR-AUC
PR-AUC 为主
业务场景
Recall + PR-AUC + F2
Precision + PR-AUC + F0.5
自定义成本矩阵
切换异常检测:AUPR / Precision@K

python 复制代码
from sklearn.metrics import (
    classification_report, confusion_matrix, 
    roc_auc_score, average_precision_score,
    f1_score, fbeta_score, recall_score, precision_score
)

def comprehensive_eval(y_true, y_pred, y_proba, imbalance_ratio, beta=1.0):
    """全面评估不平衡分类模型"""
    print("=" * 50)
    print(f"不平衡比例约为 1:{imbalance_ratio:.0f}")
    print("=" * 50)
    
    print(f"Precision:    {precision_score(y_true, y_pred):.4f}")
    print(f"Recall:       {recall_score(y_true, y_pred):.4f}")
    print(f"F1:           {f1_score(y_true, y_pred):.4f}")
    print(f"F{beta}:          {fbeta_score(y_true, y_pred, beta=beta):.4f}")
    
    if y_proba is not None:
        roc = roc_auc_score(y_true, y_proba)
        pr_auc = average_precision_score(y_true, y_proba)
        print(f"ROC-AUC:      {roc:.4f}")
        print(f"PR-AUC:       {pr_auc:.4f}")
        
        if imbalance_ratio > 50:
            print(f"\n⚠️  严重不平衡场景:PR-AUC ({pr_auc:.4f}) 是核心指标")
            print(f"    ROC-AUC ({roc:.4f}) 可能虚高,参考意义有限")
    
    print("\n混淆矩阵:")
    cm = confusion_matrix(y_true, y_pred)
    print(f"  TN={cm[0,0]}, FP={cm[0,1]}")
    print(f"  FN={cm[1,0]}, TP={cm[1,1]}")

# 计算不平衡比例
pos_ratio = y_test.mean()
imbalance = (1 - pos_ratio) / pos_ratio

y_pred_default = (y_proba >= 0.5).astype(int)
y_pred_optimal = (y_proba >= thresh_f1).astype(int)

print("\n--- 使用默认阈值 0.5 ---")
comprehensive_eval(y_test, y_pred_default, y_proba, imbalance, beta=2)

print("\n--- 使用最优阈值 ---")
comprehensive_eval(y_test, y_pred_optimal, y_proba, imbalance, beta=2)

七、不平衡数据的正确交叉验证

这是不平衡处理中最容易踩的坑:在交叉验证的训练集外做采样

错误做法:先采样再 CV

python 复制代码
# ❌ 错误:先对全量数据 SMOTE,再做 CV
# 测试集中含有从训练集合成的样本,数据泄露!
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

cv = StratifiedKFold(n_splits=5)
scores = cross_val_score(clf, X_resampled, y_resampled, cv=cv, scoring='f1')
# 结果虚高:测试集中有合成样本,模型已"见过"它们

正确做法:Pipeline 内部 CV

python 复制代码
from imblearn.pipeline import Pipeline  # 注意:用 imblearn 的 Pipeline
from sklearn.model_selection import StratifiedKFold, cross_validate

# ✅ 正确:采样步骤在 Pipeline 内,每折 CV 的采样只在该折训练集上进行
pipeline = Pipeline([
    ('smote', SMOTE(sampling_strategy=0.5, random_state=42)),
    ('clf', RandomForestClassifier(n_estimators=100, 
                                    class_weight='balanced',
                                    random_state=42))
])

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
results = cross_validate(
    pipeline, X, y, cv=cv, 
    scoring=['f1', 'average_precision', 'recall'],
    return_train_score=False
)

print(f"F1:     {results['test_f1'].mean():.4f} ± {results['test_f1'].std():.4f}")
print(f"PR-AUC: {results['test_average_precision'].mean():.4f} ± {results['test_average_precision'].std():.4f}")
print(f"Recall: {results['test_recall'].mean():.4f} ± {results['test_recall'].std():.4f}")

为什么必须用 imblearn 的 Pipeline: sklearn 的标准 Pipeline 不支持 fit_resample 接口,只有 imblearn 的 Pipeline 能在 CV 的 fit 阶段正确调用 SMOTE 的 fit_resample,确保每折的采样只在训练折上进行,测试折保持原始分布。


八、实战:信用卡欺诈检测完整方案对比

以信用卡欺诈检测(欺诈率 0.17%,极端不平衡)为例,系统对比各处理策略的效果。

python 复制代码
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from sklearn.model_selection import StratifiedKFold, cross_validate
import numpy as np

# ---- 数据准备 ----
# 使用 sklearn 生成模拟欺诈数据(实际可用 Kaggle creditcard 数据集)
from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=284807, n_features=30, n_informative=15,
    weights=[0.9983, 0.0017],  # 接近真实欺诈比例
    random_state=42
)

print(f"数据集大小: {X.shape}")
print(f"欺诈比例: {y.mean():.4f} ({y.sum()} 笔欺诈 / {len(y)} 笔交易)")

# ---- 策略对比 ----
strategies = {
    "基准(无处理)": RandomForestClassifier(
        n_estimators=100, random_state=42),
    
    "class_weight='balanced'": RandomForestClassifier(
        n_estimators=100, class_weight='balanced', random_state=42),
    
    "SMOTE + RF": Pipeline([
        ('smote', SMOTE(sampling_strategy=0.1, random_state=42)),
        ('clf', RandomForestClassifier(n_estimators=100, random_state=42))
    ]),
    
    "SMOTE + class_weight(组合)": Pipeline([
        ('smote', SMOTE(sampling_strategy=0.1, random_state=42)),
        ('clf', RandomForestClassifier(
            n_estimators=100, class_weight='balanced', random_state=42))
    ]),
}

cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

print("\n{:<30} {:>8} {:>8} {:>8}".format("策略", "PR-AUC", "Recall", "F1"))
print("-" * 60)

for name, model in strategies.items():
    results = cross_validate(
        model, X, y, cv=cv,
        scoring=['average_precision', 'recall', 'f1'],
    )
    print("{:<30} {:>8.4f} {:>8.4f} {:>8.4f}".format(
        name,
        results['test_average_precision'].mean(),
        results['test_recall'].mean(),
        results['test_f1'].mean()
    ))

实战中的策略选择经验:

  1. 先尝试 class_weight='balanced':几乎零额外成本,效果通常优于无处理
  2. SMOTE 对树模型改善有限:树模型自身的 class_weight 机制更高效;SMOTE 主要对线性模型有明显提升
  3. 阈值调整往往是最有效的杠杆:相同模型,阈值从 0.5 调至业务最优,Recall 可提升 20-40%
  4. PR-AUC 是极端不平衡场景的核心基准:相同 ROC-AUC 的两个模型,PR-AUC 可能相差 2-3 倍

#mermaid-svg-WUyHgJSN1fWDaGfe{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-WUyHgJSN1fWDaGfe .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-WUyHgJSN1fWDaGfe .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-WUyHgJSN1fWDaGfe .error-icon{fill:#552222;}#mermaid-svg-WUyHgJSN1fWDaGfe .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-WUyHgJSN1fWDaGfe .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-WUyHgJSN1fWDaGfe .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-WUyHgJSN1fWDaGfe .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-WUyHgJSN1fWDaGfe .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-WUyHgJSN1fWDaGfe .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-WUyHgJSN1fWDaGfe .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-WUyHgJSN1fWDaGfe .marker{fill:#333333;stroke:#333333;}#mermaid-svg-WUyHgJSN1fWDaGfe .marker.cross{stroke:#333333;}#mermaid-svg-WUyHgJSN1fWDaGfe svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-WUyHgJSN1fWDaGfe p{margin:0;}#mermaid-svg-WUyHgJSN1fWDaGfe .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-WUyHgJSN1fWDaGfe .cluster-label text{fill:#333;}#mermaid-svg-WUyHgJSN1fWDaGfe .cluster-label span{color:#333;}#mermaid-svg-WUyHgJSN1fWDaGfe .cluster-label span p{background-color:transparent;}#mermaid-svg-WUyHgJSN1fWDaGfe .label text,#mermaid-svg-WUyHgJSN1fWDaGfe span{fill:#333;color:#333;}#mermaid-svg-WUyHgJSN1fWDaGfe .node rect,#mermaid-svg-WUyHgJSN1fWDaGfe .node circle,#mermaid-svg-WUyHgJSN1fWDaGfe .node ellipse,#mermaid-svg-WUyHgJSN1fWDaGfe .node polygon,#mermaid-svg-WUyHgJSN1fWDaGfe .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-WUyHgJSN1fWDaGfe .rough-node .label text,#mermaid-svg-WUyHgJSN1fWDaGfe .node .label text,#mermaid-svg-WUyHgJSN1fWDaGfe .image-shape .label,#mermaid-svg-WUyHgJSN1fWDaGfe .icon-shape .label{text-anchor:middle;}#mermaid-svg-WUyHgJSN1fWDaGfe .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-WUyHgJSN1fWDaGfe .rough-node .label,#mermaid-svg-WUyHgJSN1fWDaGfe .node .label,#mermaid-svg-WUyHgJSN1fWDaGfe .image-shape .label,#mermaid-svg-WUyHgJSN1fWDaGfe .icon-shape .label{text-align:center;}#mermaid-svg-WUyHgJSN1fWDaGfe .node.clickable{cursor:pointer;}#mermaid-svg-WUyHgJSN1fWDaGfe .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-WUyHgJSN1fWDaGfe .arrowheadPath{fill:#333333;}#mermaid-svg-WUyHgJSN1fWDaGfe .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-WUyHgJSN1fWDaGfe .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-WUyHgJSN1fWDaGfe .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-WUyHgJSN1fWDaGfe .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-WUyHgJSN1fWDaGfe .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-WUyHgJSN1fWDaGfe .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-WUyHgJSN1fWDaGfe .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-WUyHgJSN1fWDaGfe .cluster text{fill:#333;}#mermaid-svg-WUyHgJSN1fWDaGfe .cluster span{color:#333;}#mermaid-svg-WUyHgJSN1fWDaGfe div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-WUyHgJSN1fWDaGfe .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-WUyHgJSN1fWDaGfe rect.text{fill:none;stroke-width:0;}#mermaid-svg-WUyHgJSN1fWDaGfe .icon-shape,#mermaid-svg-WUyHgJSN1fWDaGfe .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-WUyHgJSN1fWDaGfe .icon-shape p,#mermaid-svg-WUyHgJSN1fWDaGfe .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-WUyHgJSN1fWDaGfe .icon-shape .label rect,#mermaid-svg-WUyHgJSN1fWDaGfe .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-WUyHgJSN1fWDaGfe .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-WUyHgJSN1fWDaGfe .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-WUyHgJSN1fWDaGfe :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 是



极端不平衡分类任务
步骤一:class_weight='balanced'

零成本基线
PR-AUC < 0.3?
步骤二:SMOTE Pipeline

中度干预
步骤三:阈值调优

精度-召回曲线找最优
业务成本明确?
成本矩阵驱动阈值

业务最优
F1 最优 或 F-beta

数学最优
评估:PR-AUC + Recall + 业务成本


总结

不平衡数据处理不是"选一个采样方法"的单点决策,而是贯穿数据处理、模型训练、评估指标、阈值调整的全链路工程。

核心原则回顾:

  • 准确率是不平衡场景的反指标,用 PR-AUC 和 Recall 代替
  • 不平衡严重程度决定处理力度:轻度用 class_weight,重度才需要 SMOTE
  • 采样必须在 Pipeline 内进行,CV 外的采样会造成数据泄露
  • 阈值调整是效率最高的优化手段,不需要重新训练模型
  • 业务成本矩阵驱动阈值选择,不是数学最优,是业务最优

模块二从不平衡数据开始,是有意为之的------现实数据几乎没有完美均衡的,掌握这个基础是处理真实业务场景的前提。


如果本文对理解不平衡数据有所帮助,欢迎点赞、收藏和关注,后续内容将持续深入机器学习的进阶场景。

系列已发布文章: