常用的数据预处理
删除常量列 → remove_constants
删除低方差列 → ignore_low_variance
删除无关特征 → remove_irrelevant
数值缺失填充 → numeric_imputation
分类缺失填充 → categorical_imputation
高级缺失填充 → imputation_type
OneHot 编码 → max_encoding_ohe
标签编码 → encoding_method
归一化 / 标准化 → normalize 和下面一起使用
归一化方法 → normalize_method
数据正态变换 → transformation 和下面一起使用
变换方法 → transformation_method
删除异常值 → remove_outliers 和下面一起使用
异常检测方法 → outliers_method
删除共线性特征 → remove_multicollinearity
共线性阈值 → multicollinearity_threshold
特征选择 → feature_selection 和下面一起使用
特征选择方法 → feature_selection_method
PCA 降维 → pca
处理数据不平衡 → fix_imbalance
顺序原则 先清洗 → 再编码 → 再降维 → 最后建模
对于分类模型,示例
def full_preprocessing_pipeline(df, target, test_size=0.3, random_state=42):
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
# ==================== 1. 数据拆分 ====================
X = df.drop(target, axis=1)
y = df[target]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=test_size, random_state=random_state, stratify=y
)
# ==================== 2. 删除异常值(最早做) ====================
# IQR依赖原始数据分布,所以放在预处理最前面
X_train, y_train, X_test, y_test = remove_outliers(
X_train, y_train, X_test, y_test, method='iqr', threshold=1.5
)
# ==================== 3. 删除常量列 ====================
X_train, X_test = remove_constants(X_train, X_test)
# ==================== 4. 删除低方差列 ====================
X_train, X_test = ignore_low_variance(X_train, X_test, threshold=0.01)
# ==================== 5. 缺失值处理(二选一) ====================
num_cols = X_train.select_dtypes(include=[np.number]).columns.tolist()
cat_cols = X_train.select_dtypes(include=['object', 'category']).columns.tolist()
# 方案A:简单填充(推荐,速度快)
X_train, X_test = numeric_imputation(X_train, X_test, num_cols, strategy='median')
X_train, X_test = categorical_imputation(X_train, X_test, cat_cols, strategy='most_frequent')
# 方案B:高级填充(如果缺失值复杂,取消注释下面这行,并注释掉上面的)
# X_train, X_test = advanced_imputation(X_train, X_test, num_cols, method='iterative')
# ==================== 6. 分类变量编码(二选一) ====================
# 方案A:OneHot编码(推荐,适合类别数不多的场景)
X_train, X_test = onehot_encoding(X_train, X_test, cat_cols, max_categories=20)
# 方案B:Label编码(适合有序分类变量,取消注释下面这行,并注释掉上面的)
# X_train, X_test = label_encoding(X_train, X_test, cat_cols)
# 对齐列(确保训练集和测试集列一致)
X_train, X_test = X_train.align(X_test, join='left', axis=1)
# ==================== 7. 删除无关特征 ====================
# 基于随机森林删除重要性为0的特征
X_train, X_test = remove_irrelevant_rf(X_train, y_train, X_test)
# ==================== 8. 重新识别数值列 ====================
num_cols_updated = X_train.select_dtypes(include=[np.number]).columns.tolist()
# ==================== 9. 删除共线性特征 ====================
X_train, X_test = remove_multicollinearity(X_train, X_test, threshold=5)
# ==================== 10. 再次重新识别数值列 ====================
num_cols_final = X_train.select_dtypes(include=[np.number]).columns.tolist()
# ==================== 11. 数据转换(正态化) ====================
# 先做数据转换,再做归一化(推荐顺序)
# Yeo-Johnson支持负值,Box-Cox仅支持正值
X_train, X_test = transformation(X_train, X_test, num_cols_final, method='yeo-johnson')
# ==================== 12. 归一化/标准化 ====================
X_train, X_test = normalize(X_train, X_test, num_cols_final, method='zscore')
# ==================== 13. 特征选择(与PCA二选一) ====================
# 如果特征很多,可以先用特征选择;如果追求信息完整性,用PCA
# 这里默认使用特征选择(注释掉则跳过)
X_train, X_test = feature_selection(X_train, y_train, X_test, method='rf', threshold='median')
# ==================== 14. PCA降维(与特征选择二选一) ====================
# 如果用了特征选择,PCA通常不需要;如果特征仍很多,可取消注释
# X_train, X_test = apply_pca(X_train, X_test, n_components=None, explained_variance=0.95)
# ==================== 15. 处理数据不平衡(放到最后,模型训练前) ====================
X_train, y_train = fix_imbalance_classification(X_train, y_train, method='smote')
print(f"最终训练集形状: {X_train.shape}")
print(f"最终测试集形状: {X_test.shape}")
print(f"训练集类别分布:\n{pd.Series(y_train).value_counts()}")
return X_train, X_test, y_train, y_test