什么是机器学习中的类别不平衡

一、什么是数据平衡与不平衡？

简单理解

想象你要训练一个模型来识别"垃圾邮件"：

平衡数据：1000封邮件中，500封是垃圾邮件，500封是正常邮件（比例 1:1）
不平衡数据：1000封邮件中，50封是垃圾邮件，950封是正常邮件（比例 1:19）

现实世界中，绝大多数数据都是不平衡的：

欺诈检测：正常交易 vs 欺诈交易（可能是 10000:1）
疾病诊断：健康人 vs 患者（可能是 100:1）
客户流失：留存客户 vs 流失客户（可能是 10:1）

二、为什么数据不平衡是个大问题？

2.1 直观例子

假设你有一个极度不平衡的数据集：

类别A（多数类）：990个样本
类别B（少数类）：10个样本

如果你什么都不做，模型可能会"偷懒"------直接把所有样本预测为类别A，准确率也能达到 99%！但这显然是个无用的模型。

2.2 核心问题

问题	说明
模型偏向多数类	模型为了最小化整体损失，会倾向于预测样本多的类别
少数类学习不足	模型很少见到少数类样本，无法学到其特征
评估指标误导	高准确率可能掩盖模型在少数类上的糟糕表现
业务损失巨大	在欺诈检测中，漏掉1个欺诈可能比误判100个正常交易代价更高

三、理论分析：为什么模型会"偷懒"？

3.1 从损失函数角度

大多数模型使用交叉熵损失（Cross-Entropy Loss）：

L=−1N∑i=1N $yilog⁡(y\^i)+(1−yi)log⁡(1−y\^i)$ \mathcal{L} = -\frac{1}{N} \sum_{i=1}^{N} \left $y_i \\log(\\hat{y}_i) + (1-y_i)\\log(1-\\hat{y}_i) \\right$ L=−N1i=1∑N $yilog(y\^i)+(1−yi)log(1−y\^i)$

问题所在：

当类别A占99%时，模型把A预测错的损失会被"稀释"
把A预测对的收益会被放大
模型自然倾向于把边界推向少数类，牺牲少数类的召回率

3.2 从梯度下降角度

多数类样本贡献的梯度更新占主导
少数类样本的梯度被"淹没"
模型参数更新方向主要由多数类决定

四、如何判断你的数据是否不平衡？

4.1 不平衡程度分级

比例	程度	处理方式
1:1 ~ 1:2	基本平衡	无需特殊处理
1:2 ~ 1:10	轻度不平衡	简单处理即可
1:10 ~ 1:100	中度不平衡	需要专门技术
1:100+	极度不平衡	必须采用高级技术

4.2 可视化检查（Python实战）

python 复制代码

import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
from sklearn.datasets import make_classification

# 生成一个不平衡数据集
X, y = make_classification(
    n_samples=1000,
    n_features=2,
    n_redundant=0,
    n_clusters_per_class=1,
    weights=[0.9, 0.1],  # 类别0占90%，类别1占10%
    flip_y=0,
    random_state=42
)

# 统计类别分布
counter = Counter(y)
print(f"类别分布: {counter}")

# 可视化
plt.figure(figsize=(12, 5))

# 子图1：类别分布柱状图
plt.subplot(1, 2, 1)
plt.bar(counter.keys(), counter.values(), color=['skyblue', 'salmon'])
plt.xlabel('类别')
plt.ylabel('样本数')
plt.title('类别分布（不平衡）')
for i, v in enumerate(counter.values()):
    plt.text(i, v + 10, str(v), ha='center', fontsize=12)

# 子图2：散点图
plt.subplot(1, 2, 2)
plt.scatter(X[y==0, 0], X[y==0, 1], label='类别0（多数类）', alpha=0.6, s=50)
plt.scatter(X[y==1, 0], X[y==1, 1], label='类别1（少数类）', alpha=0.6, s=50)
plt.xlabel('特征1')
plt.ylabel('特征2')
plt.title('数据分布可视化')
plt.legend()

plt.tight_layout()
plt.show()

输出结果：

复制代码

类别分布: Counter({0: 900, 1: 100})  # 9:1的不平衡比例

五、解决方案：从简单到高级

5.1 数据层面：重采样技术

5.1.1 随机过采样（Random Over-Sampling）

原理：复制少数类样本，使其数量与多数类相同

python 复制代码

from imblearn.over_sampling import RandomOverSampler
from collections import Counter

# 原始数据
print(f"原始分布: {Counter(y)}")

# 随机过采样
ros = RandomOverSampler(random_state=42)
X_resampled, y_resampled = ros.fit_resample(X, y)

print(f"过采样后: {Counter(y_resampled)}")
# 输出: 原始分布: Counter({0: 900, 1: 100})
#      过采样后: Counter({0: 900, 1: 900})

优点：简单直接
缺点：容易过拟合（重复的样本）

5.1.2 SMOTE（合成少数类过采样）

原理：在少数类样本之间插值生成新样本，而不是简单复制

python 复制代码

from imblearn.over_sampling import SMOTE
import matplotlib.pyplot as plt

# SMOTE过采样
smote = SMOTE(random_state=42)
X_smote, y_smote = smote.fit_resample(X, y)

# 可视化对比
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# 原始数据
axes[0].scatter(X[y==0, 0], X[y==0, 1], label='类别0', alpha=0.6)
axes[0].scatter(X[y==1, 0], X[y==1, 1], label='类别1', alpha=0.6)
axes[0].set_title(f'原始数据\n{dict(Counter(y))}')
axes[0].legend()

# 随机过采样
ros = RandomOverSampler(random_state=42)
X_ros, y_ros = ros.fit_resample(X, y)
axes[1].scatter(X_ros[y_ros==0, 0], X_ros[y_ros==0, 1], label='类别0', alpha=0.6)
axes[1].scatter(X_ros[y_ros==1, 0], X_ros[y_ros==1, 1], label='类别1', alpha=0.6)
axes[1].set_title(f'随机过采样\n{dict(Counter(y_ros))}')
axes[1].legend()

# SMOTE
axes[2].scatter(X_smote[y_smote==0, 0], X_smote[y_smote==0, 1], label='类别0', alpha=0.6)
axes[2].scatter(X_smote[y_smote==1, 0], X_smote[y_smote==1, 1], label='类别1', alpha=0.6)
axes[2].set_title(f'SMOTE\n{dict(Counter(y_smote))}')
axes[2].legend()

plt.tight_layout()
plt.show()

SMOTE的工作原理：

对于每个少数类样本，找到其k个最近邻（默认k=5）
随机选择一个近邻
在这两个样本之间的连线上随机取一点，作为新样本

优点：生成新样本，减少过拟合
缺点：可能生成噪声样本（当少数类有异常值时）

5.1.3 随机欠采样（Random Under-Sampling）

原理：随机删除多数类样本

python 复制代码

from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler(random_state=42)
X_rus, y_rus = rus.fit_resample(X, y)
print(f"欠采样后: {Counter(y_rus)}")
# 输出: 欠采样后: Counter({0: 100, 1: 100})

优点：减少训练时间
缺点：可能丢失重要信息

5.1.4 综合方法：SMOTE + Tomek Links

python 复制代码

from imblearn.combine import SMOTETomek

# 先SMOTE过采样，再Tomek Links清理噪声
smote_tomek = SMOTETomek(random_state=42)
X_combined, y_combined = smote_tomek.fit_resample(X, y)

5.2 算法层面：调整学习过程

5.2.1 类别权重（Class Weight）

原理：给少数类更高的损失权重，让模型更关注少数类

python 复制代码

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# 划分数据
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 方法1：手动设置权重
# 权重与样本数成反比：weight = total_samples / (n_classes * n_samples)
weights = {0: 1, 1: 9}  # 类别1的权重是类别0的9倍

clf_weighted = RandomForestClassifier(class_weight=weights, random_state=42)
clf_weighted.fit(X_train, y_train)

# 方法2：自动计算（推荐）
clf_auto = RandomForestClassifier(class_weight='balanced', random_state=42)
clf_auto.fit(X_train, y_train)

# 对比：不使用权重
clf_normal = RandomForestClassifier(random_state=42)
clf_normal.fit(X_train, y_train)

# 评估
print("=== 无权重 ===")
print(classification_report(y_test, clf_normal.predict(X_test)))

print("=== 自动权重 ===")
print(classification_report(y_test, clf_auto.predict(X_test)))

5.2.2 代价敏感学习（Cost-Sensitive Learning）

在更极端的场景中，可以自定义代价矩阵：

python 复制代码

from sklearn.svm import SVC

# 定义代价矩阵：C[i][j]表示将i误判为j的代价
# 例如：漏检欺诈的代价是误判正常交易的100倍
cost_matrix = [[1, 100], [1, 1]]  # 简化示例

# SVM中使用class_weight实现
svm_cost = SVC(class_weight={0: 1, 1: 100}, kernel='rbf')

5.3 评估层面：正确的评价指标

绝对不能用准确率（Accuracy）！

5.3.1 混淆矩阵与衍生指标

复制代码

                预测
              0      1
实    0    TN    FP
际    1    FN    TP

- 真正例(TP)：实际是1，预测是1
- 假正例(FP)：实际是0，预测是1
- 真负例(TN)：实际是0，预测是0
- 假负例(FN)：实际是1，预测是0

5.3.2 关键指标

指标	公式	含义	适用场景
精确率	TP/(TP+FP)	预测为正的样本中，实际为正的比例	减少误报（如垃圾邮件）
召回率	TP/(TP+FN)	实际为正的样本中，被正确预测的比例	减少漏报（如疾病诊断）
F1分数	2×(P×R)/(P+R)	精确率和召回率的调和平均	综合评估
AUC-ROC	-	ROC曲线下面积	评估排序能力
AUC-PR	-	PR曲线下面积	不平衡数据首选

python 复制代码

from sklearn.metrics import (classification_report, confusion_matrix, 
                            roc_auc_score, average_precision_score,
                            precision_recall_curve, roc_curve)

# 训练模型并预测
clf = RandomForestClassifier(class_weight='balanced', random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
y_prob = clf.predict_proba(X_test)[:, 1]

# 详细报告
print("分类报告：")
print(classification_report(y_test, y_pred, target_names=['类别0', '类别1']))

# 混淆矩阵
print("\n混淆矩阵：")
print(confusion_matrix(y_test, y_pred))

# AUC分数
print(f"\nROC-AUC: {roc_auc_score(y_test, y_prob):.3f}")
print(f"PR-AUC: {average_precision_score(y_test, y_prob):.3f}")  # 不平衡数据更重要

# 可视化ROC和PR曲线
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# ROC曲线
fpr, tpr, _ = roc_curve(y_test, y_prob)
axes[0].plot(fpr, tpr, label=f'ROC (AUC = {roc_auc_score(y_test, y_prob):.3f})')
axes[0].plot([0, 1], [0, 1], 'k--', label='随机猜测')
axes[0].set_xlabel('假正例率 (FPR)')
axes[0].set_ylabel('真正例率 (TPR)')
axes[0].set_title('ROC曲线')
axes[0].legend()

# PR曲线
precision, recall, _ = precision_recall_curve(y_test, y_prob)
axes[1].plot(recall, precision, label=f'PR (AUC = {average_precision_score(y_test, y_prob):.3f})')
axes[1].axhline(y=y_test.mean(), color='r', linestyle='--', label='随机基线')
axes[1].set_xlabel('召回率 (Recall)')
axes[1].set_ylabel('精确率 (Precision)')
axes[1].set_title('PR曲线（不平衡数据更重要）')
axes[1].legend()

plt.tight_layout()
plt.show()

六、完整实战案例：信用卡欺诈检测

这是一个经典的极度不平衡数据集（欺诈交易通常<0.1%）。

python 复制代码

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, average_precision_score
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline as ImbPipeline
import warnings
warnings.filterwarnings('ignore')

# 模拟数据（实际中加载真实信用卡数据）
np.random.seed(42)
n_samples = 10000
n_features = 10

# 生成特征
X = np.random.randn(n_samples, n_features)

# 生成极度不平衡的标签（欺诈率0.5%）
y = np.zeros(n_samples)
fraud_indices = np.random.choice(n_samples, size=int(n_samples*0.005), replace=False)
y[fraud_indices] = 1

print(f"数据集大小: {n_samples}")
print(f"特征维度: {n_features}")
print(f"欺诈交易数: {int(y.sum())} ({y.mean()*100:.2f}%)")
print(f"正常交易数: {int((1-y).sum())} ({(1-y.mean())*100:.2f}%)")

# 划分训练测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, 
                                                    random_state=42, stratify=y)

print(f"\n训练集分布: {np.bincount(y_train.astype(int))}")
print(f"测试集分布: {np.bincount(y_test.astype(int))}")

# 定义评估函数
def evaluate_model(model, X_test, y_test, name):
    y_pred = model.predict(X_test)
    y_prob = model.predict_proba(X_test)[:, 1]
    
    print(f"\n{'='*50}")
    print(f"模型: {name}")
    print(f"{'='*50}")
    print(classification_report(y_test, y_pred, target_names=['正常', '欺诈']))
    print(f"PR-AUC: {average_precision_score(y_test, y_prob):.4f}")

# 1. 基线模型（什么都不做）
print("\n" + "="*60)
print("策略对比实验")
print("="*60)

lr_baseline = LogisticRegression(max_iter=1000)
lr_baseline.fit(X_train, y_train)
evaluate_model(lr_baseline, X_test, y_test, "基线（无处理）")

# 2. 类别权重
lr_weighted = LogisticRegression(class_weight='balanced', max_iter=1000)
lr_weighted.fit(X_train, y_train)
evaluate_model(lr_weighted, X_test, y_test, "类别权重")

# 3. SMOTE过采样
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)
print(f"\nSMOTE后训练集分布: {np.bincount(y_train_smote.astype(int))}")

lr_smote = LogisticRegression(max_iter=1000)
lr_smote.fit(X_train_smote, y_train_smote)
evaluate_model(lr_smote, X_test, y_test, "SMOTE")

# 4. 组合策略：SMOTE + 类别权重
lr_combined = LogisticRegression(class_weight='balanced', max_iter=1000)
lr_combined.fit(X_train_smote, y_train_smote)
evaluate_model(lr_combined, X_test, y_test, "SMOTE + 类别权重")

# 5. 使用Pipeline（推荐做法）
pipeline = ImbPipeline([
    ('smote', SMOTE(random_state=42)),
    ('classifier', LogisticRegression(class_weight='balanced', max_iter=1000))
])
pipeline.fit(X_train, y_train)
evaluate_model(pipeline, X_test, y_test, "Pipeline（SMOTE+权重）")

# 6. 随机森林对比
rf_smote = RandomForestClassifier(n_estimators=100, class_weight='balanced', random_state=42)
rf_smote.fit(X_train_smote, y_train_smote)
evaluate_model(rf_smote, X_test, y_test, "随机森林（SMOTE+权重）")

七、选择策略的决策树

复制代码

开始
  │
  ▼
数据不平衡程度？
  ├── 轻度（<1:10）
  │     └── 使用 class_weight='balanced'
  │
  ├── 中度（1:10 ~ 1:100）
  │     ├── 样本充足？ ──是──► SMOTE + 类别权重
  │     └── 样本不足？ ──否──► 类别权重 + 集成方法
  │
  └── 极度（>1:100）
        ├── 异常检测方法（One-Class SVM, Isolation Forest）
        ├── 代价敏感学习（自定义高代价）
        └── 或：将问题转化为排序问题（而非分类）

八、常见陷阱与最佳实践

❌ 常见错误

在交叉验证前进行过采样 → 数据泄露
只看准确率 → 被高准确率误导
在测试集上用过采样 → 评估失真
对所有问题都用SMOTE → 有时简单方法更好

✅ 最佳实践

始终使用StratifiedKFold保持每折的类别比例
在Pipeline中进行重采样，防止数据泄露
优先考虑PR-AUC而非ROC-AUC（极度不平衡时）
业务导向：根据业务成本调整阈值，而非默认0.5

python 复制代码

from sklearn.model_selection import StratifiedKFold
from imblearn.pipeline import Pipeline

# 正确的交叉验证
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

pipeline = Pipeline([
    ('smote', SMOTE(random_state=42)),
    ('clf', RandomForestClassifier(class_weight='balanced', random_state=42))
])

# 在交叉验证中评估
from sklearn.model_selection import cross_val_score
scores = cross_val_score(pipeline, X, y, cv=cv, scoring='average_precision')
print(f"PR-AUC: {scores.mean():.3f} (+/- {scores.std()*2:.3f})")

九、总结

概念	要点
数据不平衡	类别样本数差异大，是机器学习中的常态
核心危害	模型偏向多数类，忽视少数类（往往更重要）
解决思路	数据层面（重采样）+ 算法层面（权重/代价）+ 评估层面（正确指标）
首选方案	SMOTE + class_weight='balanced' + PR-AUC评估
关键原则	业务成本导向，而非单纯追求准确率

理解数据不平衡是成为合格机器学习工程师的关键一步。记住：在不平衡数据上，一个99%准确率的模型可能完全无用，而一个80%准确率但召回率高的模型可能价值巨大。