特征工程技巧与最佳实践

文章目录

特征工程技巧与最佳实践

特征工程是机器学习项目的核心环节,它指的是利用领域知识从原始数据中创建新特征,使得机器学习算法能够更好地学习数据中的模式。在实际项目中,特征工程往往比选择更复杂的模型更能带来性能提升。有句话说得好:"数据和特征决定了模型的上限,而算法只是逼近这个上限。"

一、特征工程概述

1.1 什么是特征工程

特征工程是将原始数据转换为更能表示潜在问题的特征的过程,其目标是提高机器学习模型的性能。好的特征工程可以:

  • 提高模型的预测准确率
  • 减少训练时间
  • 降低过拟合风险
  • 提高模型的可解释性
  • 帮助模型更快地收敛
1.2 特征工程的重要性
python 复制代码
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# 创建示例数据:预测房价
np.random.seed(42)
n_samples = 1000

# 原始数据
data = {
    'total_area': np.random.uniform(50, 200, n_samples),
    'num_bedrooms': np.random.randint(1, 6, n_samples),
    'num_bathrooms': np.random.randint(1, 4, n_samples),
    'age': np.random.randint(0, 50, n_samples),
    'distance_to_city_center': np.random.uniform(1, 30, n_samples),
    'has_garden': np.random.choice([0, 1], n_samples),
    'has_parking': np.random.choice([0, 1], n_samples),
    'floor': np.random.randint(1, 30, n_samples)
}

df = pd.DataFrame(data)

# 基于原始特征生成标签(房价)
df['price'] = (
    df['total_area'] * 5000 +
    df['num_bedrooms'] * 20000 +
    df['num_bathrooms'] * 15000 +
    df['age'] * -500 +
    df['distance_to_city_center'] * -1000 +
    df['has_garden'] * 50000 +
    df['has_parking'] * 30000 +
    np.random.normal(0, 50000, n_samples)  # 添加噪声
)

# 将房价分类为高、中、低
df['price_category'] = pd.qcut(df['price'], q=3, labels=['low', 'medium', 'high'])

# 分离特征和标签
X = df.drop(['price', 'price_category'], axis=1)
y = df['price_category']

# 1. 使用原始特征训练模型
model_original = RandomForestClassifier(n_estimators=100, random_state=42)
scores_original = cross_val_score(model_original, X, y, cv=5, scoring='accuracy')
print(f"使用原始特征的准确率: {scores_original.mean():.4f} (+/- {scores_original.std() * 2:.4f})")

# 2. 特征工程:创建新特征
X_engineered = X.copy()

# 单位面积房价(目标编码,实际应用中需要小心数据泄露)
X_engineered['area_per_bedroom'] = X_engineered['total_area'] / X_engineered['num_bedrooms']
X_engineered['area_per_bathroom'] = X_engineered['total_area'] / X_engineered['num_bathrooms']

# 创建交互特征
X_engineered['total_rooms'] = X_engineered['num_bedrooms'] + X_engineered['num_bathrooms']
X_engineered['area_per_room'] = X_engineered['total_area'] / X_engineered['total_rooms']

# 房屋年龄分桶
X_engineered['age_category'] = pd.cut(X_engineered['age'], 
                                      bins=[0, 5, 15, 30, float('inf')],
                                      labels=['new', 'young', 'middle', 'old'])

# 距离城市中心的分桶
X_engineered['distance_category'] = pd.cut(X_engineered['distance_to_city_center'],
                                          bins=[0, 5, 10, 20, float('inf')],
                                          labels=['very_close', 'close', 'medium', 'far'])

# 是否是高层建筑
X_engineered['is_high_rise'] = (X_engineered['floor'] >= 10).astype(int)

# 花园和停车位的组合
X_engineered['has_both_garden_parking'] = (X_engineered['has_garden'] & X_engineered['has_parking']).astype(int)

# 对分类特征进行独热编码
X_engineered = pd.get_dummies(X_engineered, columns=['age_category', 'distance_category'], drop_first=True)

# 3. 使用工程化后的特征训练模型
model_engineered = RandomForestClassifier(n_estimators=100, random_state=42)
scores_engineered = cross_val_score(model_engineered, X_engineered, y, cv=5, scoring='accuracy')
print(f"使用工程化特征的准确率: {scores_engineered.mean():.4f} (+/- {scores_engineered.std() * 2:.4f})")

print(f"\n特征工程带来的提升: {(scores_engineered.mean() - scores_original.mean()) * 100:.2f}%")

二、数值型特征处理

2.1 特征缩放

特征缩放是将不同范围的特征缩放到相似的范围内,这对于许多机器学习算法非常重要。

python 复制代码
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
import matplotlib.pyplot as plt

# 创建不同尺度的数据
data = {
    'height_cm': np.random.normal(170, 10, 1000),  # 身高:170cm左右
    'weight_kg': np.random.normal(65, 8, 1000),    # 体重:65kg左右
    'salary': np.random.lognormal(10, 0.5, 1000),  # 薪资:对数正态分布
    'age': np.random.normal(30, 8, 1000)           # 年龄:30岁左右
}

df = pd.DataFrame(data)

print("原始数据统计:")
print(df.describe())

# 1. 标准化(Z-score标准化)
scaler_standard = StandardScaler()
df_standard = pd.DataFrame(scaler_standard.fit_transform(df), columns=df.columns)

print("\n标准化后数据统计:")
print(df_standard.describe())

# 2. 最小-最大缩放(归一化)
scaler_minmax = MinMaxScaler()
df_minmax = pd.DataFrame(scaler_minmax.fit_transform(df), columns=df.columns)

print("\n归一化后数据统计:")
print(df_minmax.describe())

# 3. 鲁棒缩放(对异常值不敏感)
scaler_robust = RobustScaler()
df_robust = pd.DataFrame(scaler_robust.fit_transform(df), columns=df.columns)

print("\n鲁棒缩放后数据统计:")
print(df_robust.describe())

# 可视化对比
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# 原始数据
axes[0, 0].boxplot([df[col] for col in df.columns])
axes[0, 0].set_xticklabels(df.columns, rotation=45)
axes[0, 0].set_title('原始数据')

# 标准化
axes[0, 1].boxplot([df_standard[col] for col in df_standard.columns])
axes[0, 1].set_xticklabels(df_standard.columns, rotation=45)
axes[0, 1].set_title('标准化')

# 归一化
axes[1, 0].boxplot([df_minmax[col] for col in df_minmax.columns])
axes[1, 0].set_xticklabels(df_minmax.columns, rotation=45)
axes[1, 0].set_title('归一化')

# 鲁棒缩放
axes[1, 1].boxplot([df_robust[col] for col in df_robust.columns])
axes[1, 1].set_xticklabels(df_robust.columns, rotation=45)
axes[1, 1].set_title('鲁棒缩放')

plt.tight_layout()
plt.show()
2.2 非线性变换

某些机器学习算法假设特征服从正态分布,非线性变换可以帮助满足这个假设。

python 复制代码
from scipy import stats

# 创建偏态数据
np.random.seed(42)
skewed_data = np.random.exponential(scale=1.0, size=1000)
df_skewed = pd.DataFrame({'value': skewed_data})

# 1. 对数变换(适用于右偏数据)
df_skewed['log_transform'] = np.log1p(df_skewed['value'])  # log1p避免log(0)

# 2. 平方根变换(适用于右偏数据)
df_skewed['sqrt_transform'] = np.sqrt(df_skewed['value'])

# 3. Box-Cox变换(自动找到最优变换)
df_skewed['boxcox_transform'], _ = stats.boxcox(df_skewed['value'])

# 4. Yeo-Johnson变换(Box-Cox的扩展,可以处理负值)
from sklearn.preprocessing import PowerTransformer
yeo_johnson = PowerTransformer(method='yeo-johnson')
df_skewed['yeojohnson_transform'] = yeo_johnson.fit_transform(df_skewed[['value']])

# 可视化变换效果
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 原始数据
axes[0, 0].hist(df_skewed['value'], bins=50, edgecolor='black')
axes[0, 0].set_title(f'原始数据 (偏度: {df_skewed["value"].skew():.2f})')

# 对数变换
axes[0, 1].hist(df_skewed['log_transform'], bins=50, edgecolor='black')
axes[0, 1].set_title(f'对数变换 (偏度: {df_skewed["log_transform"].skew():.2f})')

# Box-Cox变换
axes[1, 0].hist(df_skewed['boxcox_transform'], bins=50, edgecolor='black')
axes[1, 0].set_title(f'Box-Cox变换 (偏度: {df_skewed["boxcox_transform"].skew():.2f})')

# Yeo-Johnson变换
axes[1, 1].hist(df_skewed['yeojohnson_transform'], bins=50, edgecolor='black')
axes[1, 1].set_title(f'Yeo-Johnson变换 (偏度: {df_skewed["yeojohnson_transform"].skew():.2f})')

plt.tight_layout()
plt.show()

# 选择合适的变换
def select_transformation(data):
    """
    根据数据的偏度选择合适的变换方法
    """
    skewness = stats.skew(data)
    
    if abs(skewness) < 0.5:
        print(f"偏度={skewness:.4f},数据接近正态分布,不需要变换")
        return data
    elif skewness > 0.5:
        print(f"偏度={skewness:.4f},数据右偏,尝试对数变换或Box-Cox变换")
        if np.all(data > 0):
            transformed, _ = stats.boxcox(data)
            return transformed
        else:
            return np.log1p(data - data.min() + 1)
    else:
        print(f"偏度={skewness:.4f},数据左偏,尝试平方变换或指数变换")
        return data ** 2

selected_transform = select_transformation(skewed_data)
2.3 分箱(Binning/Discretization)

分箱是将连续变量转换为离散变量,可以降低噪声影响,处理异常值,有时还能捕捉非线性关系。

python 复制代码
from sklearn.preprocessing import KBinsDiscretizer, Binarizer

# 创建年龄数据
np.random.seed(42)
ages = np.random.normal(35, 12, 1000)
df_ages = pd.DataFrame({'age': ages})

# 1. 等宽分箱(Equal-width binning)
df_ages['age_bin_equal_width'] = pd.cut(df_ages['age'], bins=5, labels=False)

# 2. 等频分箱(Equal-frequency binning)
df_ages['age_bin_equal_freq'] = pd.qcut(df_ages['age'], q=5, labels=False)

# 3. 自定义边界分箱
bins = [0, 18, 30, 45, 60, float('inf')]
labels = ['青少年', '青年', '中年', '中老年', '老年']
df_ages['age_category'] = pd.cut(df_ages['age'], bins=bins, labels=labels)

# 4. 使用KBinsDiscretizer
kbd = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='uniform')
df_ages['age_bin_kbd'] = kbd.fit_transform(df_ages[['age']])

# 5. 二值化
binarizer = Binarizer(threshold=30)
df_ages['is_adult'] = binarizer.fit_transform(df_ages[['age']])

# 查看分箱结果
print("分箱结果统计:")
print(df_ages[['age', 'age_category']].head(10))

print("\n各年龄段分布:")
print(df_ages['age_category'].value_counts())

# 可视化分箱效果
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# 原始分布
axes[0].hist(df_ages['age'], bins=30, edgecolor='black', alpha=0.7)
axes[0].axvline(18, color='red', linestyle='--', label='18岁')
axes[0].axvline(30, color='orange', linestyle='--', label='30岁')
axes[0].axvline(45, color='yellow', linestyle='--', label='45岁')
axes[0].axvline(60, color='green', linestyle='--', label='60岁')
axes[0].set_xlabel('年龄')
axes[0].set_ylabel('频数')
axes[0].set_title('年龄分布')
axes[0].legend()

# 分箱后分布
df_ages['age_category'].value_counts().plot(kind='bar', ax=axes[1], edgecolor='black')
axes[1].set_xlabel('年龄段')
axes[1].set_ylabel('人数')
axes[1].set_title('年龄段分布')
axes[1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

三、分类型特征处理

3.1 标签编码(Label Encoding)

标签编码将每个类别映射为一个整数,适用于有序类别。

python 复制代码
from sklearn.preprocessing import LabelEncoder

# 创建示例数据
data = {
    'education': ['小学', '初中', '高中', '本科', '硕士', '博士'] * 100,
    'city': ['北京', '上海', '深圳', '广州', '杭州'] * 120,
    'grade': ['A', 'B', 'C', 'D', 'F'] * 100
}
df = pd.DataFrame(data)

# 1. 标签编码(适用于有序类别)
le_education = LabelEncoder()
df['education_encoded'] = le_education.fit_transform(df['education'])

print("教育程度编码映射:")
for i, label in enumerate(le_education.classes_):
    print(f"{label}: {i}")

# 2. 自定义有序编码
education_order = ['小学', '初中', '高中', '本科', '硕士', '博士']
education_map = {edu: i for i, edu in enumerate(education_order)}
df['education_ordered'] = df['education'].map(education_map)

print("\n有序教育程度编码:")
print(df[['education', 'education_ordered']].head())

# 3. 成绩编码
grade_map = {'A': 5, 'B': 4, 'C': 3, 'D': 2, 'F': 1}
df['grade_score'] = df['grade'].map(grade_map)

print("\n成绩编码:")
print(df[['grade', 'grade_score']].head())
3.2 独热编码(One-Hot Encoding)

独热编码为每个类别创建一个二元特征,适用于无序类别。

python 复制代码
from sklearn.preprocessing import OneHotEncoder

# 1. 使用pandas的get_dummies
df_onehot = pd.get_dummies(df, columns=['city'], prefix='city')

print("独热编码后的数据(前10列):")
print(df_onehot.head())

# 2. 使用sklearn的OneHotEncoder
encoder = OneHotEncoder(sparse=False, drop='first')  # drop='first'避免多重共线性
city_encoded = encoder.fit_transform(df[['city']])

print("\nOneHotEncoder编码结果:")
print(city_encoded[:5])

# 3. 处理高基数类别(Hashing Trick)
from sklearn.feature_extraction import FeatureHasher

# 假设有大量城市类别
cities = ['北京', '上海', '深圳', '广州', '杭州', '南京', '武汉', '成都', '重庆', '天津'] * 100
df_cities = pd.DataFrame({'city': cities})

# 使用哈希技巧
fh = FeatureHasher(n_features=8, input_type='string')
hashed_features = fh.transform(df_cities['city'])

print("\nHashing Trick结果(前5行):")
print(hashed_features[:5].toarray())
3.3 目标编码(Target Encoding)

目标编码用目标变量的统计量(如均值)来替换类别,对高基数类别特别有效。

python 复制代码
# 创建示例数据
np.random.seed(42)
n_samples = 10000

data = {
    'category': np.random.choice(['A', 'B', 'C', 'D', 'E'], n_samples),
    'city': np.random.choice(['北京', '上海', '深圳', '广州', '杭州', '南京'], n_samples),
    'target': np.random.randint(0, 2, n_samples)
}

df_target = pd.DataFrame(data)

# 1. 简单的目标编码
category_means = df_target.groupby('category')['target'].mean()
df_target['category_target_encoded'] = df_target['category'].map(category_means)

print("类别目标均值:")
print(category_means)

print("\n目标编码结果:")
print(df_target[['category', 'category_target_encoded']].head())

# 2. 带平滑的目标编码(防止过拟合)
def smooth_target_encoding(df, cat_col, target_col, alpha=10):
    """
    平滑目标编码
    alpha: 平滑参数,值越大越接近全局均值
    """
    # 全局均值
    global_mean = df[target_col].mean()
    
    # 类别统计
    category_stats = df.groupby(cat_col).agg({
        target_col: ['mean', 'count']
    })
    category_stats.columns = ['mean', 'count']
    
    # 计算平滑编码
    smoothed = (category_stats['mean'] * category_stats['count'] + 
                global_mean * alpha) / (category_stats['count'] + alpha)
    
    return smoothed

smoothed_encoding = smooth_target_encoding(df_target, 'category', 'target')
df_target['category_smoothed_encoded'] = df_target['category'].map(smoothed_encoding)

print("\n平滑目标编码:")
print(smoothed_encoding)

# 3. K折目标编码(避免数据泄露)
from sklearn.model_selection import KFold

def kfold_target_encoding(df, cat_col, target_col, n_folds=5, alpha=10):
    """
    K折目标编码,避免数据泄露
    """
    df_encoded = df.copy()
    df_encoded[f'{cat_col}_encoded'] = np.nan
    
    kf = KFold(n_splits=n_folds, shuffle=True, random_state=42)
    
    for train_idx, val_idx in kf.split(df):
        # 训练集
        train_df = df.iloc[train_idx]
        # 验证集
        val_df = df.iloc[val_idx]
        
        # 计算训练集的平滑编码
        global_mean = train_df[target_col].mean()
        category_stats = train_df.groupby(cat_col).agg({
            target_col: ['mean', 'count']
        })
        category_stats.columns = ['mean', 'count']
        smoothed = (category_stats['mean'] * category_stats['count'] + 
                    global_mean * alpha) / (category_stats['count'] + alpha)
        
        # 应用到验证集
        df_encoded.loc[val_idx, f'{cat_col}_encoded'] = val_df[cat_col].map(smoothed)
    
    return df_encoded

df_kfold_encoded = kfold_target_encoding(df_target, 'city', 'target')
print("\nK折目标编码结果:")
print(df_kfold_encoded[['city', 'city_encoded']].head())

四、时间序列特征

时间序列数据需要特殊的特征工程方法来捕捉时间相关的模式。

python 复制代码
# 创建时间序列数据
dates = pd.date_range(start='2020-01-01', end='2023-12-31', freq='D')
np.random.seed(42)

values = []
for i in range(len(dates)):
    # 添加季节性和趋势
    value = (
        100 +  # 基础值
        i * 0.01 +  # 趋势
        10 * np.sin(2 * np.pi * i / 365) +  # 年度季节性
        5 * np.sin(2 * np.pi * i / 7) +  # 周度季节性
        np.random.normal(0, 2)  # 随机噪声
    )
    values.append(value)

df_time = pd.DataFrame({'date': dates, 'value': values})

# 1. 基础时间特征
df_time['year'] = df_time['date'].dt.year
df_time['month'] = df_time['date'].dt.month
df_time['day'] = df_time['date'].dt.day
df_time['dayofweek'] = df_time['date'].dt.dayofweek
df_time['dayofyear'] = df_time['date'].dt.dayofyear
df_time['weekofyear'] = df_time['date'].dt.isocalendar().week
df_time['quarter'] = df_time['date'].dt.quarter

# 2. 是否是周末/节假日
df_time['is_weekend'] = (df_time['dayofweek'] >= 5).astype(int)

# 3. 周期性特征(使用三角函数处理)
df_time['month_sin'] = np.sin(2 * np.pi * df_time['month'] / 12)
df_time['month_cos'] = np.cos(2 * np.pi * df_time['month'] / 12)
df_time['dayofweek_sin'] = np.sin(2 * np.pi * df_time['dayofweek'] / 7)
df_time['dayofweek_cos'] = np.cos(2 * np.pi * df_time['dayofweek'] / 7)

# 4. 滞后特征
for lag in [1, 7, 30]:
    df_time[f'lag_{lag}'] = df_time['value'].shift(lag)

# 5. 滚动窗口特征
df_time['rolling_mean_7'] = df_time['value'].rolling(window=7).mean()
df_time['rolling_std_7'] = df_time['value'].rolling(window=7).std()
df_time['rolling_max_7'] = df_time['value'].rolling(window=7).max()
df_time['rolling_min_7'] = df_time['value'].rolling(window=7).min()

df_time['rolling_mean_30'] = df_time['value'].rolling(window=30).mean()
df_time['rolling_std_30'] = df_time['value'].rolling(window=30).std()

# 6. 扩展窗口特征
df_time['expanding_mean'] = df_time['value'].expanding().mean()
df_time['expanding_max'] = df_time['value'].expanding().max()

# 7. 差分特征
df_time['diff_1'] = df_time['value'].diff(1)
df_time['diff_7'] = df_time['value'].diff(7)
df_time['pct_change_1'] = df_time['value'].pct_change(1)
df_time['pct_change_7'] = df_time['value'].pct_change(7)

# 8. 时间间隔特征(如果数据时间间隔不均匀)
df_time['time_diff'] = df_time['date'].diff().dt.days

# 查看特征
print("时间序列特征示例:")
print(df_time[['date', 'value', 'month', 'dayofweek', 'is_weekend', 
               'lag_1', 'rolling_mean_7', 'diff_1']].head(10))

# 可视化时间特征与目标变量的关系
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 月度模式
monthly_avg = df_time.groupby('month')['value'].mean()
axes[0, 0].plot(monthly_avg.index, monthly_avg.values, marker='o')
axes[0, 0].set_xlabel('月份')
axes[0, 0].set_ylabel('平均值')
axes[0, 0].set_title('月度模式')
axes[0, 0].grid(True)

# 周度模式
weekly_avg = df_time.groupby('dayofweek')['value'].mean()
axes[0, 1].plot(weekly_avg.index, weekly_avg.values, marker='o')
axes[0, 1].set_xlabel('星期')
axes[0, 1].set_ylabel('平均值')
axes[0, 1].set_title('周度模式')
axes[0, 1].set_xticks(range(7))
axes[0, 1].set_xticklabels(['周一', '周二', '周三', '周四', '周五', '周六', '周日'])
axes[0, 1].grid(True)

# 周末与非周末对比
weekend_avg = df_time.groupby('is_weekend')['value'].mean()
axes[1, 0].bar(['工作日', '周末'], weekend_avg.values)
axes[1, 0].set_ylabel('平均值')
axes[1, 0].set_title('周末 vs 工作日')

# 滞后关系
axes[1, 1].scatter(df_time['lag_1'], df_time['value'], alpha=0.3)
axes[1, 1].set_xlabel('前一天值')
axes[1, 1].set_ylabel('当天值')
axes[1, 1].set_title('滞后关系')
axes[1, 1].grid(True)

plt.tight_layout()
plt.show()

五、文本特征

文本数据需要特殊的方法转换为数值特征。

python 复制代码
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import re

# 创建文本数据
texts = [
    "机器学习是人工智能的一个分支",
    "深度学习是机器学习的子集",
    "神经网络是深度学习的基础",
    "自然语言处理是AI的重要应用领域",
    "计算机视觉是另一个重要的AI应用",
    "Python是机器学习最流行的编程语言",
    "数据科学包含机器学习和数据挖掘",
    "人工智能正在改变我们的生活方式"
]

df_text = pd.DataFrame({'text': texts})

# 1. 文本清洗
def clean_text(text):
    """清洗文本数据"""
    # 转换为小写
    text = text.lower()
    # 移除标点符号
    text = re.sub(r'[^\w\s]', '', text)
    # 移除多余空格
    text = re.sub(r'\s+', ' ', text)
    return text.strip()

df_text['cleaned_text'] = df_text['text'].apply(clean_text)

# 2. 词袋模型
count_vectorizer = CountVectorizer(max_features=20)
bow_features = count_vectorizer.fit_transform(df_text['cleaned_text'])
df_bow = pd.DataFrame(bow_features.toarray(), 
                       columns=count_vectorizer.get_feature_names_out())

print("词袋模型特征:")
print(df_bow.head())

# 3. TF-IDF特征
tfidf_vectorizer = TfidfVectorizer(max_features=20)
tfidf_features = tfidf_vectorizer.fit_transform(df_text['cleaned_text'])
df_tfidf = pd.DataFrame(tfidf_features.toarray(), 
                        columns=tfidf_vectorizer.get_feature_names_out())

print("\nTF-IDF特征:")
print(df_tfidf.head())

# 4. N-gram特征(捕捉词序信息)
ngram_vectorizer = CountVectorizer(ngram_range=(1, 2), max_features=20)
ngram_features = ngram_vectorizer.fit_transform(df_text['cleaned_text'])
df_ngram = pd.DataFrame(ngram_features.toarray(), 
                        columns=ngram_vectorizer.get_feature_names_out())

print("\nN-gram特征:")
print(df_ngram.head())

# 5. 文本统计特征
def extract_text_features(text):
    """提取文本统计特征"""
    return {
        'length': len(text),
        'word_count': len(text.split()),
        'avg_word_length': np.mean([len(word) for word in text.split()]) if text.split() else 0,
        'unique_word_ratio': len(set(text.split())) / len(text.split()) if text.split() else 0
    }

text_stats = df_text['cleaned_text'].apply(extract_text_features)
df_stats = pd.DataFrame(text_stats.tolist())

print("\n文本统计特征:")
print(df_stats.head())

# 6. 词频特征(自定义)
def get_word_frequency(texts, top_n=10):
    """获取词频特征"""
    all_words = ' '.join(texts).split()
    word_freq = pd.Series(all_words).value_counts()
    top_words = word_freq.head(top_n).index.tolist()
    
    word_features = []
    for text in texts:
        words = text.split()
        features = [words.count(word) for word in top_words]
        word_features.append(features)
    
    return pd.DataFrame(word_features, columns=[f'freq_{word}' for word in top_words])

df_word_freq = get_word_frequency(df_text['cleaned_text'].tolist())
print("\n词频特征:")
print(df_word_freq.head())

六、特征选择

特征选择是从原始特征中选择最重要的特征子集,可以减少过拟合、提高模型性能、加速训练。

python 复制代码
from sklearn.datasets import make_classification
from sklearn.feature_selection import (
    SelectKBest, f_classif, chi2, mutual_info_classif,
    RFE, SelectFromModel
)
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

# 创建示例数据
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, 
                          n_redundant=5, n_clusters_per_class=1, random_state=42)
feature_names = [f'feature_{i}' for i in range(X.shape[1])]
df_feature = pd.DataFrame(X, columns=feature_names)
df_feature['target'] = y

# 1. 过滤法(Filter Methods)
# 方差阈值(移除低方差特征)
from sklearn.feature_selection import VarianceThreshold

variance_threshold = VarianceThreshold(threshold=0.01)
high_variance_features = variance_threshold.fit_transform(df_feature.drop('target', axis=1))
print(f"原始特征数: {X.shape[1]}, 保留特征数: {high_variance_features.shape[1]}")

# 单变量特征选择(使用ANOVA F值)
selector_f = SelectKBest(f_classif, k=10)
X_new_f = selector_f.fit_transform(df_feature.drop('target', axis=1), df_feature['target'])

selected_features_f = [feature_names[i] for i in selector_f.get_support(indices=True)]
print("\nANOVA F值选择的特征:")
print(selected_features_f)

# 卡方检验(要求特征非负)
selector_chi2 = SelectKBest(chi2, k=10)
X_new_chi2 = selector_chi2.fit_transform(
    df_feature.drop('target', axis=1) - df_feature.drop('target', axis=1).min() + 1,
    df_feature['target']
)

selected_features_chi2 = [feature_names[i] for i in selector_chi2.get_support(indices=True)]
print("\n卡方检验选择的特征:")
print(selected_features_chi2)

# 互信息
selector_mi = SelectKBest(mutual_info_classif, k=10)
X_new_mi = selector_mi.fit_transform(df_feature.drop('target', axis=1), df_feature['target'])

selected_features_mi = [feature_names[i] for i in selector_mi.get_support(indices=True)]
print("\n互信息选择的特征:")
print(selected_features_mi)

# 2. 包裹法(Wrapper Methods)
# 递归特征消除(RFE)
estimator = LogisticRegression(max_iter=1000, random_state=42)
rfe_selector = RFE(estimator, n_features_to_select=10, step=1)
X_new_rfe = rfe_selector.fit_transform(df_feature.drop('target', axis=1), df_feature['target'])

selected_features_rfe = [feature_names[i] for i in rfe_selector.get_support(indices=True)]
print("\nRFE选择的特征:")
print(selected_features_rfe)

# 3. 嵌入法(Embedded Methods)
# 基于随机森林的特征重要性
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(df_feature.drop('target', axis=1), df_feature['target'])

feature_importance = pd.DataFrame({
    'feature': feature_names,
    'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)

print("\n随机森林特征重要性(Top 10):")
print(feature_importance.head(10))

# 使用SelectFromModel选择特征
selector_model = SelectFromModel(rf, threshold='median', prefit=True)
X_new_model = selector_model.transform(df_feature.drop('target', axis=1))

selected_features_model = [feature_names[i] for i in selector_model.get_support(indices=True)]
print(f"\n基于模型选择的特征数: {len(selected_features_model)}")
print(selected_features_model)

# 可视化特征重要性
plt.figure(figsize=(12, 6))
plt.barh(feature_importance['feature'], feature_importance['importance'])
plt.xlabel('特征重要性')
plt.ylabel('特征')
plt.title('随机森林特征重要性')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

# 4. 相关性分析(移除高相关特征)
correlation_matrix = df_feature.drop('target', axis=1).corr()

# 找出高相关特征对
high_corr_pairs = []
for i in range(len(correlation_matrix.columns)):
    for j in range(i+1, len(correlation_matrix.columns)):
        if abs(correlation_matrix.iloc[i, j]) > 0.8:
            high_corr_pairs.append((
                correlation_matrix.columns[i],
                correlation_matrix.columns[j],
                correlation_matrix.iloc[i, j]
            ))

print("\n高相关特征对(|相关系数| > 0.8):")
for pair in high_corr_pairs:
    print(f"{pair[0]} - {pair[1]}: {pair[2]:.4f}")

七、特征交互

特征交互可以创建新的特征,捕捉变量之间的关系。

python 复制代码
# 创建示例数据
np.random.seed(42)
n_samples = 1000

data = {
    'x1': np.random.normal(0, 1, n_samples),
    'x2': np.random.normal(0, 1, n_samples),
    'x3': np.random.normal(0, 1, n_samples)
}
df_interaction = pd.DataFrame(data)

# 目标变量:包含交互项
df_interaction['y'] = (
    2 * df_interaction['x1'] + 
    3 * df_interaction['x2'] + 
    4 * df_interaction['x3'] +
    5 * df_interaction['x1'] * df_interaction['x2'] +  # 交互项
    np.random.normal(0, 0.5, n_samples)
)

# 1. 乘法交互
df_interaction['x1_x2'] = df_interaction['x1'] * df_interaction['x2']
df_interaction['x1_x3'] = df_interaction['x1'] * df_interaction['x3']
df_interaction['x2_x3'] = df_interaction['x2'] * df_interaction['x3']

# 2. 除法交互
df_interaction['x1_div_x2'] = df_interaction['x1'] / (df_interaction['x2'] + 1e-8)
df_interaction['x2_div_x3'] = df_interaction['x2'] / (df_interaction['x3'] + 1e-8)

# 3. 加法交互
df_interaction['x1_plus_x2'] = df_interaction['x1'] + df_interaction['x2']
df_interaction['x1_plus_x3'] = df_interaction['x1'] + df_interaction['x3']

# 4. 多项式特征(二次)
df_interaction['x1_squared'] = df_interaction['x1'] ** 2
df_interaction['x2_squared'] = df_interaction['x2'] ** 2
df_interaction['x3_squared'] = df_interaction['x3'] ** 2

# 5. 使用PolynomialFeatures自动创建多项式特征
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)
poly_features = poly.fit_transform(df_interaction[['x1', 'x2', 'x3']])
poly_feature_names = poly.get_feature_names_out(['x1', 'x2', 'x3'])

df_poly = pd.DataFrame(poly_features, columns=poly_feature_names)

print("多项式特征:")
print(df_poly.head())

# 6. 评估交互特征的重要性
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# 模型1:仅原始特征
model1 = LinearRegression()
model1.fit(df_interaction[['x1', 'x2', 'x3']], df_interaction['y'])
y_pred1 = model1.predict(df_interaction[['x1', 'x2', 'x3']])
r2_1 = r2_score(df_interaction['y'], y_pred1)

# 模型2:包含交互特征
model2 = LinearRegression()
interaction_features = ['x1', 'x2', 'x3', 'x1_x2', 'x1_x3', 'x2_x3', 
                        'x1_squared', 'x2_squared', 'x3_squared']
model2.fit(df_interaction[interaction_features], df_interaction['y'])
y_pred2 = model2.predict(df_interaction[interaction_features])
r2_2 = r2_score(df_interaction['y'], y_pred2)

print(f"\n仅原始特征的R²: {r2_1:.4f}")
print(f"包含交互特征的R²: {r2_2:.4f}")
print(f"提升: {(r2_2 - r2_1) * 100:.2f}%")

# 可视化交互效应
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# x1和x2的交互效应
scatter = axes[0].scatter(df_interaction['x1'], df_interaction['x2'], 
                         c=df_interaction['y'], cmap='viridis', alpha=0.6)
axes[0].set_xlabel('x1')
axes[0].set_ylabel('x2')
axes[0].set_title('x1和x2的交互效应')
plt.colorbar(scatter, ax=axes[0])

# 预测值对比
axes[1].scatter(df_interaction['y'], y_pred1, alpha=0.5, label='原始特征')
axes[1].scatter(df_interaction['y'], y_pred2, alpha=0.5, label='交互特征')
axes[1].plot([df_interaction['y'].min(), df_interaction['y'].max()],
             [df_interaction['y'].min(), df_interaction['y'].max()],
             'r--', label='完美预测')
axes[1].set_xlabel('真实值')
axes[1].set_ylabel('预测值')
axes[1].set_title('预测性能对比')
axes[1].legend()

plt.tight_layout()
plt.show()

八、特征工程流程最佳实践

python 复制代码
class FeatureEngineeringPipeline:
    """
    特征工程流水线
    封装常用的特征工程操作
    """
    
    def __init__(self, df):
        self.df = df.copy()
        self.transformations = []
    
    def add_feature(self, name, feature_series, description=''):
        """添加新特征"""
        self.df[name] = feature_series
        self.transformations.append({
            'type': 'add_feature',
            'name': name,
            'description': description
        })
        return self
    
    def drop_features(self, features):
        """删除特征"""
        self.df = self.df.drop(columns=features)
        self.transformations.append({
            'type': 'drop_features',
            'features': features
        })
        return self
    
    def transform_feature(self, name, func, description=''):
        """变换特征"""
        self.df[name] = func(self.df[name])
        self.transformations.append({
            'type': 'transform_feature',
            'name': name,
            'description': description
        })
        return self
    
    def encode_categorical(self, columns, method='onehot'):
        """编码分类特征"""
        if method == 'onehot':
            self.df = pd.get_dummies(self.df, columns=columns, drop_first=True)
        elif method == 'label':
            for col in columns:
                le = LabelEncoder()
                self.df[col] = le.fit_transform(self.df[col])
        
        self.transformations.append({
            'type': 'encode_categorical',
            'columns': columns,
            'method': method
        })
        return self
    
    def scale_features(self, columns, method='standard'):
        """缩放数值特征"""
        if method == 'standard':
            scaler = StandardScaler()
        elif method == 'minmax':
            scaler = MinMaxScaler()
        elif method == 'robust':
            scaler = RobustScaler()
        
        self.df[columns] = scaler.fit_transform(self.df[columns])
        
        self.transformations.append({
            'type': 'scale_features',
            'columns': columns,
            'method': method
        })
        return self
    
    def handle_missing(self, strategy='mean'):
        """处理缺失值"""
        for col in self.df.columns:
            if self.df[col].isnull().sum() > 0:
                if strategy == 'mean' and self.df[col].dtype in [np.number]:
                    self.df[col].fillna(self.df[col].mean(), inplace=True)
                elif strategy == 'median' and self.df[col].dtype in [np.number]:
                    self.df[col].fillna(self.df[col].median(), inplace=True)
                elif strategy == 'mode':
                    self.df[col].fillna(self.df[col].mode()[0], inplace=True)
                elif strategy == 'drop':
                    self.df = self.df.dropna(subset=[col])
        
        self.transformations.append({
            'type': 'handle_missing',
            'strategy': strategy
        })
        return self
    
    def get_transformed_data(self):
        """获取转换后的数据"""
        return self.df.copy()
    
    def get_transformation_report(self):
        """获取转换报告"""
        return self.transformations


# 使用示例
pipeline = FeatureEngineeringPipeline(df)

# 处理缺失值
pipeline.handle_missing(strategy='mean')

# 特征变换
pipeline.transform_feature('age', lambda x: np.log1p(x), '对数变换')

# 添加新特征
pipeline.add_feature('age_squared', df['age'] ** 2, '年龄的平方')

# 编码分类特征
pipeline.encode_categorical(['gender', 'department'], method='onehot')

# 缩放数值特征
numerical_cols = ['age', 'salary', 'experience_years']
pipeline.scale_features(numerical_cols, method='standard')

# 获取处理后的数据
transformed_df = pipeline.get_transformed_data()

# 获取转换报告
report = pipeline.get_transformation_report()
print("特征工程流水线执行记录:")
for step in report:
    print(f"- {step}")

九、总结

特征工程是机器学习项目中最重要的环节之一。通过合理的特征工程,我们可以:

  1. 提高模型性能:好的特征能让模型更容易学习数据中的模式。

  2. 降低模型复杂度:用更简单的模型达到更好的效果。

  3. 增强可解释性:有意义的特征更容易解释和调试。

  4. 加速训练:减少无关和冗余特征可以提高训练效率。

核心要点

  • 理解业务领域是做好特征工程的基础
  • 特征工程是一个迭代的过程,需要不断尝试和验证
  • 始终使用验证集来评估特征工程的效果
  • 避免数据泄露,特别是在时间序列和目标编码中
  • 记录所有特征工程步骤,确保可重现性
  • 从简单到复杂,先尝试基本的特征变换,再考虑复杂的交互特征

记住,特征工程没有万能的公式,最好的方法往往来源于对数据的深入理解和对业务的洞察。持续探索和实验是特征工程成功的关键!

相关推荐
CoderJia程序员甲3 小时前
GitHub 热榜项目 - 日榜(2026-01-22)
ai·开源·大模型·github·ai教程
Tom·Ge5 小时前
Claude Code 和 Cursor 有何异同
ai
哥布林学者8 小时前
吴恩达深度学习课程五:自然语言处理 第二周:词嵌入(六)情绪分类和词嵌入除偏
深度学习·ai
CoderJia程序员甲10 小时前
GitHub 热榜项目 - 日榜(2026-01-24)
git·ai·开源·llm·github
玉梅小洋11 小时前
Unity Muse 完整使用文档:Sprite+Texture专项
unity·ai·游戏引擎
带刺的坐椅11 小时前
Claude Code Agent Skills vs. Solon AI Skills:从工具增强到框架规范的深度对齐
java·ai·agent·claude·solon·mcp·skills
组合缺一11 小时前
MCP 进化:让静态 Tool 进化为具备“上下文感知”的远程 Skills
java·ai·llm·agent·mcp·skills
爱跑步的程序员~12 小时前
大模型prompt工程指南
ai·prompt
DS随心转APP12 小时前
豆包排版乱码怎么办?
人工智能·ai·chatgpt·deepseek·ds随心转