机器学习数据预处理流程

常用的数据预处理

复制代码
删除常量列 → remove_constants
删除低方差列 → ignore_low_variance
删除无关特征 → remove_irrelevant
数值缺失填充 → numeric_imputation
分类缺失填充 → categorical_imputation
高级缺失填充 → imputation_type
OneHot 编码 → max_encoding_ohe
标签编码 → encoding_method
归一化 / 标准化 → normalize 和下面一起使用
归一化方法 → normalize_method
数据正态变换 → transformation  和下面一起使用
变换方法 → transformation_method
删除异常值 → remove_outliers  和下面一起使用
异常检测方法 → outliers_method
删除共线性特征 → remove_multicollinearity
共线性阈值 → multicollinearity_threshold
特征选择 → feature_selection 和下面一起使用
特征选择方法 → feature_selection_method
PCA 降维 → pca
处理数据不平衡 → fix_imbalance

顺序原则 先清洗 → 再编码 → 再降维 → 最后建模

对于分类模型,示例

复制代码
def full_preprocessing_pipeline(df, target, test_size=0.3, random_state=42):
    from sklearn.model_selection import train_test_split
    import pandas as pd
    import numpy as np

    # ==================== 1. 数据拆分 ====================
    X = df.drop(target, axis=1)
    y = df[target]
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, random_state=random_state, stratify=y
    )

    # ==================== 2. 删除异常值(最早做) ====================
    # IQR依赖原始数据分布,所以放在预处理最前面
    X_train, y_train, X_test, y_test = remove_outliers(
        X_train, y_train, X_test, y_test, method='iqr', threshold=1.5
    )

    # ==================== 3. 删除常量列 ====================
    X_train, X_test = remove_constants(X_train, X_test)

    # ==================== 4. 删除低方差列 ====================
    X_train, X_test = ignore_low_variance(X_train, X_test, threshold=0.01)

    # ==================== 5. 缺失值处理(二选一) ====================
    num_cols = X_train.select_dtypes(include=[np.number]).columns.tolist()
    cat_cols = X_train.select_dtypes(include=['object', 'category']).columns.tolist()

    # 方案A:简单填充(推荐,速度快)
    X_train, X_test = numeric_imputation(X_train, X_test, num_cols, strategy='median')
    X_train, X_test = categorical_imputation(X_train, X_test, cat_cols, strategy='most_frequent')

    # 方案B:高级填充(如果缺失值复杂,取消注释下面这行,并注释掉上面的)
    # X_train, X_test = advanced_imputation(X_train, X_test, num_cols, method='iterative')

    # ==================== 6. 分类变量编码(二选一) ====================
    # 方案A:OneHot编码(推荐,适合类别数不多的场景)
    X_train, X_test = onehot_encoding(X_train, X_test, cat_cols, max_categories=20)

    # 方案B:Label编码(适合有序分类变量,取消注释下面这行,并注释掉上面的)
    # X_train, X_test = label_encoding(X_train, X_test, cat_cols)

    # 对齐列(确保训练集和测试集列一致)
    X_train, X_test = X_train.align(X_test, join='left', axis=1)

    # ==================== 7. 删除无关特征 ====================
    # 基于随机森林删除重要性为0的特征
    X_train, X_test = remove_irrelevant_rf(X_train, y_train, X_test)

    # ==================== 8. 重新识别数值列 ====================
    num_cols_updated = X_train.select_dtypes(include=[np.number]).columns.tolist()

    # ==================== 9. 删除共线性特征 ====================
    X_train, X_test = remove_multicollinearity(X_train, X_test, threshold=5)

    # ==================== 10. 再次重新识别数值列 ====================
    num_cols_final = X_train.select_dtypes(include=[np.number]).columns.tolist()

    # ==================== 11. 数据转换(正态化) ====================
    # 先做数据转换,再做归一化(推荐顺序)
    # Yeo-Johnson支持负值,Box-Cox仅支持正值
    X_train, X_test = transformation(X_train, X_test, num_cols_final, method='yeo-johnson')

    # ==================== 12. 归一化/标准化 ====================
    X_train, X_test = normalize(X_train, X_test, num_cols_final, method='zscore')

    # ==================== 13. 特征选择(与PCA二选一) ====================
    # 如果特征很多,可以先用特征选择;如果追求信息完整性,用PCA
    # 这里默认使用特征选择(注释掉则跳过)
    X_train, X_test = feature_selection(X_train, y_train, X_test, method='rf', threshold='median')

    # ==================== 14. PCA降维(与特征选择二选一) ====================
    # 如果用了特征选择,PCA通常不需要;如果特征仍很多,可取消注释
    # X_train, X_test = apply_pca(X_train, X_test, n_components=None, explained_variance=0.95)

    # ==================== 15. 处理数据不平衡(放到最后,模型训练前) ====================
    X_train, y_train = fix_imbalance_classification(X_train, y_train, method='smote')

    print(f"最终训练集形状: {X_train.shape}")
    print(f"最终测试集形状: {X_test.shape}")
    print(f"训练集类别分布:\n{pd.Series(y_train).value_counts()}")

    return X_train, X_test, y_train, y_test