文章目录
-
- 特征工程技巧与最佳实践
-
- 一、特征工程概述
-
- [1.1 什么是特征工程](#1.1 什么是特征工程)
- [1.2 特征工程的重要性](#1.2 特征工程的重要性)
- 二、数值型特征处理
-
- [2.1 特征缩放](#2.1 特征缩放)
- [2.2 非线性变换](#2.2 非线性变换)
- [2.3 分箱(Binning/Discretization)](#2.3 分箱(Binning/Discretization))
- 三、分类型特征处理
-
- [3.1 标签编码(Label Encoding)](#3.1 标签编码(Label Encoding))
- [3.2 独热编码(One-Hot Encoding)](#3.2 独热编码(One-Hot Encoding))
- [3.3 目标编码(Target Encoding)](#3.3 目标编码(Target Encoding))
- 四、时间序列特征
- 五、文本特征
- 六、特征选择
- 七、特征交互
- 八、特征工程流程最佳实践
- 九、总结
特征工程技巧与最佳实践
特征工程是机器学习项目的核心环节,它指的是利用领域知识从原始数据中创建新特征,使得机器学习算法能够更好地学习数据中的模式。在实际项目中,特征工程往往比选择更复杂的模型更能带来性能提升。有句话说得好:"数据和特征决定了模型的上限,而算法只是逼近这个上限。"
一、特征工程概述
1.1 什么是特征工程
特征工程是将原始数据转换为更能表示潜在问题的特征的过程,其目标是提高机器学习模型的性能。好的特征工程可以:
- 提高模型的预测准确率
- 减少训练时间
- 降低过拟合风险
- 提高模型的可解释性
- 帮助模型更快地收敛
1.2 特征工程的重要性
python
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
# 创建示例数据:预测房价
np.random.seed(42)
n_samples = 1000
# 原始数据
data = {
'total_area': np.random.uniform(50, 200, n_samples),
'num_bedrooms': np.random.randint(1, 6, n_samples),
'num_bathrooms': np.random.randint(1, 4, n_samples),
'age': np.random.randint(0, 50, n_samples),
'distance_to_city_center': np.random.uniform(1, 30, n_samples),
'has_garden': np.random.choice([0, 1], n_samples),
'has_parking': np.random.choice([0, 1], n_samples),
'floor': np.random.randint(1, 30, n_samples)
}
df = pd.DataFrame(data)
# 基于原始特征生成标签(房价)
df['price'] = (
df['total_area'] * 5000 +
df['num_bedrooms'] * 20000 +
df['num_bathrooms'] * 15000 +
df['age'] * -500 +
df['distance_to_city_center'] * -1000 +
df['has_garden'] * 50000 +
df['has_parking'] * 30000 +
np.random.normal(0, 50000, n_samples) # 添加噪声
)
# 将房价分类为高、中、低
df['price_category'] = pd.qcut(df['price'], q=3, labels=['low', 'medium', 'high'])
# 分离特征和标签
X = df.drop(['price', 'price_category'], axis=1)
y = df['price_category']
# 1. 使用原始特征训练模型
model_original = RandomForestClassifier(n_estimators=100, random_state=42)
scores_original = cross_val_score(model_original, X, y, cv=5, scoring='accuracy')
print(f"使用原始特征的准确率: {scores_original.mean():.4f} (+/- {scores_original.std() * 2:.4f})")
# 2. 特征工程:创建新特征
X_engineered = X.copy()
# 单位面积房价(目标编码,实际应用中需要小心数据泄露)
X_engineered['area_per_bedroom'] = X_engineered['total_area'] / X_engineered['num_bedrooms']
X_engineered['area_per_bathroom'] = X_engineered['total_area'] / X_engineered['num_bathrooms']
# 创建交互特征
X_engineered['total_rooms'] = X_engineered['num_bedrooms'] + X_engineered['num_bathrooms']
X_engineered['area_per_room'] = X_engineered['total_area'] / X_engineered['total_rooms']
# 房屋年龄分桶
X_engineered['age_category'] = pd.cut(X_engineered['age'],
bins=[0, 5, 15, 30, float('inf')],
labels=['new', 'young', 'middle', 'old'])
# 距离城市中心的分桶
X_engineered['distance_category'] = pd.cut(X_engineered['distance_to_city_center'],
bins=[0, 5, 10, 20, float('inf')],
labels=['very_close', 'close', 'medium', 'far'])
# 是否是高层建筑
X_engineered['is_high_rise'] = (X_engineered['floor'] >= 10).astype(int)
# 花园和停车位的组合
X_engineered['has_both_garden_parking'] = (X_engineered['has_garden'] & X_engineered['has_parking']).astype(int)
# 对分类特征进行独热编码
X_engineered = pd.get_dummies(X_engineered, columns=['age_category', 'distance_category'], drop_first=True)
# 3. 使用工程化后的特征训练模型
model_engineered = RandomForestClassifier(n_estimators=100, random_state=42)
scores_engineered = cross_val_score(model_engineered, X_engineered, y, cv=5, scoring='accuracy')
print(f"使用工程化特征的准确率: {scores_engineered.mean():.4f} (+/- {scores_engineered.std() * 2:.4f})")
print(f"\n特征工程带来的提升: {(scores_engineered.mean() - scores_original.mean()) * 100:.2f}%")
二、数值型特征处理
2.1 特征缩放
特征缩放是将不同范围的特征缩放到相似的范围内,这对于许多机器学习算法非常重要。
python
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
import matplotlib.pyplot as plt
# 创建不同尺度的数据
data = {
'height_cm': np.random.normal(170, 10, 1000), # 身高:170cm左右
'weight_kg': np.random.normal(65, 8, 1000), # 体重:65kg左右
'salary': np.random.lognormal(10, 0.5, 1000), # 薪资:对数正态分布
'age': np.random.normal(30, 8, 1000) # 年龄:30岁左右
}
df = pd.DataFrame(data)
print("原始数据统计:")
print(df.describe())
# 1. 标准化(Z-score标准化)
scaler_standard = StandardScaler()
df_standard = pd.DataFrame(scaler_standard.fit_transform(df), columns=df.columns)
print("\n标准化后数据统计:")
print(df_standard.describe())
# 2. 最小-最大缩放(归一化)
scaler_minmax = MinMaxScaler()
df_minmax = pd.DataFrame(scaler_minmax.fit_transform(df), columns=df.columns)
print("\n归一化后数据统计:")
print(df_minmax.describe())
# 3. 鲁棒缩放(对异常值不敏感)
scaler_robust = RobustScaler()
df_robust = pd.DataFrame(scaler_robust.fit_transform(df), columns=df.columns)
print("\n鲁棒缩放后数据统计:")
print(df_robust.describe())
# 可视化对比
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
# 原始数据
axes[0, 0].boxplot([df[col] for col in df.columns])
axes[0, 0].set_xticklabels(df.columns, rotation=45)
axes[0, 0].set_title('原始数据')
# 标准化
axes[0, 1].boxplot([df_standard[col] for col in df_standard.columns])
axes[0, 1].set_xticklabels(df_standard.columns, rotation=45)
axes[0, 1].set_title('标准化')
# 归一化
axes[1, 0].boxplot([df_minmax[col] for col in df_minmax.columns])
axes[1, 0].set_xticklabels(df_minmax.columns, rotation=45)
axes[1, 0].set_title('归一化')
# 鲁棒缩放
axes[1, 1].boxplot([df_robust[col] for col in df_robust.columns])
axes[1, 1].set_xticklabels(df_robust.columns, rotation=45)
axes[1, 1].set_title('鲁棒缩放')
plt.tight_layout()
plt.show()
2.2 非线性变换
某些机器学习算法假设特征服从正态分布,非线性变换可以帮助满足这个假设。
python
from scipy import stats
# 创建偏态数据
np.random.seed(42)
skewed_data = np.random.exponential(scale=1.0, size=1000)
df_skewed = pd.DataFrame({'value': skewed_data})
# 1. 对数变换(适用于右偏数据)
df_skewed['log_transform'] = np.log1p(df_skewed['value']) # log1p避免log(0)
# 2. 平方根变换(适用于右偏数据)
df_skewed['sqrt_transform'] = np.sqrt(df_skewed['value'])
# 3. Box-Cox变换(自动找到最优变换)
df_skewed['boxcox_transform'], _ = stats.boxcox(df_skewed['value'])
# 4. Yeo-Johnson变换(Box-Cox的扩展,可以处理负值)
from sklearn.preprocessing import PowerTransformer
yeo_johnson = PowerTransformer(method='yeo-johnson')
df_skewed['yeojohnson_transform'] = yeo_johnson.fit_transform(df_skewed[['value']])
# 可视化变换效果
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# 原始数据
axes[0, 0].hist(df_skewed['value'], bins=50, edgecolor='black')
axes[0, 0].set_title(f'原始数据 (偏度: {df_skewed["value"].skew():.2f})')
# 对数变换
axes[0, 1].hist(df_skewed['log_transform'], bins=50, edgecolor='black')
axes[0, 1].set_title(f'对数变换 (偏度: {df_skewed["log_transform"].skew():.2f})')
# Box-Cox变换
axes[1, 0].hist(df_skewed['boxcox_transform'], bins=50, edgecolor='black')
axes[1, 0].set_title(f'Box-Cox变换 (偏度: {df_skewed["boxcox_transform"].skew():.2f})')
# Yeo-Johnson变换
axes[1, 1].hist(df_skewed['yeojohnson_transform'], bins=50, edgecolor='black')
axes[1, 1].set_title(f'Yeo-Johnson变换 (偏度: {df_skewed["yeojohnson_transform"].skew():.2f})')
plt.tight_layout()
plt.show()
# 选择合适的变换
def select_transformation(data):
"""
根据数据的偏度选择合适的变换方法
"""
skewness = stats.skew(data)
if abs(skewness) < 0.5:
print(f"偏度={skewness:.4f},数据接近正态分布,不需要变换")
return data
elif skewness > 0.5:
print(f"偏度={skewness:.4f},数据右偏,尝试对数变换或Box-Cox变换")
if np.all(data > 0):
transformed, _ = stats.boxcox(data)
return transformed
else:
return np.log1p(data - data.min() + 1)
else:
print(f"偏度={skewness:.4f},数据左偏,尝试平方变换或指数变换")
return data ** 2
selected_transform = select_transformation(skewed_data)
2.3 分箱(Binning/Discretization)
分箱是将连续变量转换为离散变量,可以降低噪声影响,处理异常值,有时还能捕捉非线性关系。
python
from sklearn.preprocessing import KBinsDiscretizer, Binarizer
# 创建年龄数据
np.random.seed(42)
ages = np.random.normal(35, 12, 1000)
df_ages = pd.DataFrame({'age': ages})
# 1. 等宽分箱(Equal-width binning)
df_ages['age_bin_equal_width'] = pd.cut(df_ages['age'], bins=5, labels=False)
# 2. 等频分箱(Equal-frequency binning)
df_ages['age_bin_equal_freq'] = pd.qcut(df_ages['age'], q=5, labels=False)
# 3. 自定义边界分箱
bins = [0, 18, 30, 45, 60, float('inf')]
labels = ['青少年', '青年', '中年', '中老年', '老年']
df_ages['age_category'] = pd.cut(df_ages['age'], bins=bins, labels=labels)
# 4. 使用KBinsDiscretizer
kbd = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='uniform')
df_ages['age_bin_kbd'] = kbd.fit_transform(df_ages[['age']])
# 5. 二值化
binarizer = Binarizer(threshold=30)
df_ages['is_adult'] = binarizer.fit_transform(df_ages[['age']])
# 查看分箱结果
print("分箱结果统计:")
print(df_ages[['age', 'age_category']].head(10))
print("\n各年龄段分布:")
print(df_ages['age_category'].value_counts())
# 可视化分箱效果
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# 原始分布
axes[0].hist(df_ages['age'], bins=30, edgecolor='black', alpha=0.7)
axes[0].axvline(18, color='red', linestyle='--', label='18岁')
axes[0].axvline(30, color='orange', linestyle='--', label='30岁')
axes[0].axvline(45, color='yellow', linestyle='--', label='45岁')
axes[0].axvline(60, color='green', linestyle='--', label='60岁')
axes[0].set_xlabel('年龄')
axes[0].set_ylabel('频数')
axes[0].set_title('年龄分布')
axes[0].legend()
# 分箱后分布
df_ages['age_category'].value_counts().plot(kind='bar', ax=axes[1], edgecolor='black')
axes[1].set_xlabel('年龄段')
axes[1].set_ylabel('人数')
axes[1].set_title('年龄段分布')
axes[1].tick_params(axis='x', rotation=45)
plt.tight_layout()
plt.show()
三、分类型特征处理
3.1 标签编码(Label Encoding)
标签编码将每个类别映射为一个整数,适用于有序类别。
python
from sklearn.preprocessing import LabelEncoder
# 创建示例数据
data = {
'education': ['小学', '初中', '高中', '本科', '硕士', '博士'] * 100,
'city': ['北京', '上海', '深圳', '广州', '杭州'] * 120,
'grade': ['A', 'B', 'C', 'D', 'F'] * 100
}
df = pd.DataFrame(data)
# 1. 标签编码(适用于有序类别)
le_education = LabelEncoder()
df['education_encoded'] = le_education.fit_transform(df['education'])
print("教育程度编码映射:")
for i, label in enumerate(le_education.classes_):
print(f"{label}: {i}")
# 2. 自定义有序编码
education_order = ['小学', '初中', '高中', '本科', '硕士', '博士']
education_map = {edu: i for i, edu in enumerate(education_order)}
df['education_ordered'] = df['education'].map(education_map)
print("\n有序教育程度编码:")
print(df[['education', 'education_ordered']].head())
# 3. 成绩编码
grade_map = {'A': 5, 'B': 4, 'C': 3, 'D': 2, 'F': 1}
df['grade_score'] = df['grade'].map(grade_map)
print("\n成绩编码:")
print(df[['grade', 'grade_score']].head())
3.2 独热编码(One-Hot Encoding)
独热编码为每个类别创建一个二元特征,适用于无序类别。
python
from sklearn.preprocessing import OneHotEncoder
# 1. 使用pandas的get_dummies
df_onehot = pd.get_dummies(df, columns=['city'], prefix='city')
print("独热编码后的数据(前10列):")
print(df_onehot.head())
# 2. 使用sklearn的OneHotEncoder
encoder = OneHotEncoder(sparse=False, drop='first') # drop='first'避免多重共线性
city_encoded = encoder.fit_transform(df[['city']])
print("\nOneHotEncoder编码结果:")
print(city_encoded[:5])
# 3. 处理高基数类别(Hashing Trick)
from sklearn.feature_extraction import FeatureHasher
# 假设有大量城市类别
cities = ['北京', '上海', '深圳', '广州', '杭州', '南京', '武汉', '成都', '重庆', '天津'] * 100
df_cities = pd.DataFrame({'city': cities})
# 使用哈希技巧
fh = FeatureHasher(n_features=8, input_type='string')
hashed_features = fh.transform(df_cities['city'])
print("\nHashing Trick结果(前5行):")
print(hashed_features[:5].toarray())
3.3 目标编码(Target Encoding)
目标编码用目标变量的统计量(如均值)来替换类别,对高基数类别特别有效。
python
# 创建示例数据
np.random.seed(42)
n_samples = 10000
data = {
'category': np.random.choice(['A', 'B', 'C', 'D', 'E'], n_samples),
'city': np.random.choice(['北京', '上海', '深圳', '广州', '杭州', '南京'], n_samples),
'target': np.random.randint(0, 2, n_samples)
}
df_target = pd.DataFrame(data)
# 1. 简单的目标编码
category_means = df_target.groupby('category')['target'].mean()
df_target['category_target_encoded'] = df_target['category'].map(category_means)
print("类别目标均值:")
print(category_means)
print("\n目标编码结果:")
print(df_target[['category', 'category_target_encoded']].head())
# 2. 带平滑的目标编码(防止过拟合)
def smooth_target_encoding(df, cat_col, target_col, alpha=10):
"""
平滑目标编码
alpha: 平滑参数,值越大越接近全局均值
"""
# 全局均值
global_mean = df[target_col].mean()
# 类别统计
category_stats = df.groupby(cat_col).agg({
target_col: ['mean', 'count']
})
category_stats.columns = ['mean', 'count']
# 计算平滑编码
smoothed = (category_stats['mean'] * category_stats['count'] +
global_mean * alpha) / (category_stats['count'] + alpha)
return smoothed
smoothed_encoding = smooth_target_encoding(df_target, 'category', 'target')
df_target['category_smoothed_encoded'] = df_target['category'].map(smoothed_encoding)
print("\n平滑目标编码:")
print(smoothed_encoding)
# 3. K折目标编码(避免数据泄露)
from sklearn.model_selection import KFold
def kfold_target_encoding(df, cat_col, target_col, n_folds=5, alpha=10):
"""
K折目标编码,避免数据泄露
"""
df_encoded = df.copy()
df_encoded[f'{cat_col}_encoded'] = np.nan
kf = KFold(n_splits=n_folds, shuffle=True, random_state=42)
for train_idx, val_idx in kf.split(df):
# 训练集
train_df = df.iloc[train_idx]
# 验证集
val_df = df.iloc[val_idx]
# 计算训练集的平滑编码
global_mean = train_df[target_col].mean()
category_stats = train_df.groupby(cat_col).agg({
target_col: ['mean', 'count']
})
category_stats.columns = ['mean', 'count']
smoothed = (category_stats['mean'] * category_stats['count'] +
global_mean * alpha) / (category_stats['count'] + alpha)
# 应用到验证集
df_encoded.loc[val_idx, f'{cat_col}_encoded'] = val_df[cat_col].map(smoothed)
return df_encoded
df_kfold_encoded = kfold_target_encoding(df_target, 'city', 'target')
print("\nK折目标编码结果:")
print(df_kfold_encoded[['city', 'city_encoded']].head())
四、时间序列特征
时间序列数据需要特殊的特征工程方法来捕捉时间相关的模式。
python
# 创建时间序列数据
dates = pd.date_range(start='2020-01-01', end='2023-12-31', freq='D')
np.random.seed(42)
values = []
for i in range(len(dates)):
# 添加季节性和趋势
value = (
100 + # 基础值
i * 0.01 + # 趋势
10 * np.sin(2 * np.pi * i / 365) + # 年度季节性
5 * np.sin(2 * np.pi * i / 7) + # 周度季节性
np.random.normal(0, 2) # 随机噪声
)
values.append(value)
df_time = pd.DataFrame({'date': dates, 'value': values})
# 1. 基础时间特征
df_time['year'] = df_time['date'].dt.year
df_time['month'] = df_time['date'].dt.month
df_time['day'] = df_time['date'].dt.day
df_time['dayofweek'] = df_time['date'].dt.dayofweek
df_time['dayofyear'] = df_time['date'].dt.dayofyear
df_time['weekofyear'] = df_time['date'].dt.isocalendar().week
df_time['quarter'] = df_time['date'].dt.quarter
# 2. 是否是周末/节假日
df_time['is_weekend'] = (df_time['dayofweek'] >= 5).astype(int)
# 3. 周期性特征(使用三角函数处理)
df_time['month_sin'] = np.sin(2 * np.pi * df_time['month'] / 12)
df_time['month_cos'] = np.cos(2 * np.pi * df_time['month'] / 12)
df_time['dayofweek_sin'] = np.sin(2 * np.pi * df_time['dayofweek'] / 7)
df_time['dayofweek_cos'] = np.cos(2 * np.pi * df_time['dayofweek'] / 7)
# 4. 滞后特征
for lag in [1, 7, 30]:
df_time[f'lag_{lag}'] = df_time['value'].shift(lag)
# 5. 滚动窗口特征
df_time['rolling_mean_7'] = df_time['value'].rolling(window=7).mean()
df_time['rolling_std_7'] = df_time['value'].rolling(window=7).std()
df_time['rolling_max_7'] = df_time['value'].rolling(window=7).max()
df_time['rolling_min_7'] = df_time['value'].rolling(window=7).min()
df_time['rolling_mean_30'] = df_time['value'].rolling(window=30).mean()
df_time['rolling_std_30'] = df_time['value'].rolling(window=30).std()
# 6. 扩展窗口特征
df_time['expanding_mean'] = df_time['value'].expanding().mean()
df_time['expanding_max'] = df_time['value'].expanding().max()
# 7. 差分特征
df_time['diff_1'] = df_time['value'].diff(1)
df_time['diff_7'] = df_time['value'].diff(7)
df_time['pct_change_1'] = df_time['value'].pct_change(1)
df_time['pct_change_7'] = df_time['value'].pct_change(7)
# 8. 时间间隔特征(如果数据时间间隔不均匀)
df_time['time_diff'] = df_time['date'].diff().dt.days
# 查看特征
print("时间序列特征示例:")
print(df_time[['date', 'value', 'month', 'dayofweek', 'is_weekend',
'lag_1', 'rolling_mean_7', 'diff_1']].head(10))
# 可视化时间特征与目标变量的关系
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# 月度模式
monthly_avg = df_time.groupby('month')['value'].mean()
axes[0, 0].plot(monthly_avg.index, monthly_avg.values, marker='o')
axes[0, 0].set_xlabel('月份')
axes[0, 0].set_ylabel('平均值')
axes[0, 0].set_title('月度模式')
axes[0, 0].grid(True)
# 周度模式
weekly_avg = df_time.groupby('dayofweek')['value'].mean()
axes[0, 1].plot(weekly_avg.index, weekly_avg.values, marker='o')
axes[0, 1].set_xlabel('星期')
axes[0, 1].set_ylabel('平均值')
axes[0, 1].set_title('周度模式')
axes[0, 1].set_xticks(range(7))
axes[0, 1].set_xticklabels(['周一', '周二', '周三', '周四', '周五', '周六', '周日'])
axes[0, 1].grid(True)
# 周末与非周末对比
weekend_avg = df_time.groupby('is_weekend')['value'].mean()
axes[1, 0].bar(['工作日', '周末'], weekend_avg.values)
axes[1, 0].set_ylabel('平均值')
axes[1, 0].set_title('周末 vs 工作日')
# 滞后关系
axes[1, 1].scatter(df_time['lag_1'], df_time['value'], alpha=0.3)
axes[1, 1].set_xlabel('前一天值')
axes[1, 1].set_ylabel('当天值')
axes[1, 1].set_title('滞后关系')
axes[1, 1].grid(True)
plt.tight_layout()
plt.show()
五、文本特征
文本数据需要特殊的方法转换为数值特征。
python
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import re
# 创建文本数据
texts = [
"机器学习是人工智能的一个分支",
"深度学习是机器学习的子集",
"神经网络是深度学习的基础",
"自然语言处理是AI的重要应用领域",
"计算机视觉是另一个重要的AI应用",
"Python是机器学习最流行的编程语言",
"数据科学包含机器学习和数据挖掘",
"人工智能正在改变我们的生活方式"
]
df_text = pd.DataFrame({'text': texts})
# 1. 文本清洗
def clean_text(text):
"""清洗文本数据"""
# 转换为小写
text = text.lower()
# 移除标点符号
text = re.sub(r'[^\w\s]', '', text)
# 移除多余空格
text = re.sub(r'\s+', ' ', text)
return text.strip()
df_text['cleaned_text'] = df_text['text'].apply(clean_text)
# 2. 词袋模型
count_vectorizer = CountVectorizer(max_features=20)
bow_features = count_vectorizer.fit_transform(df_text['cleaned_text'])
df_bow = pd.DataFrame(bow_features.toarray(),
columns=count_vectorizer.get_feature_names_out())
print("词袋模型特征:")
print(df_bow.head())
# 3. TF-IDF特征
tfidf_vectorizer = TfidfVectorizer(max_features=20)
tfidf_features = tfidf_vectorizer.fit_transform(df_text['cleaned_text'])
df_tfidf = pd.DataFrame(tfidf_features.toarray(),
columns=tfidf_vectorizer.get_feature_names_out())
print("\nTF-IDF特征:")
print(df_tfidf.head())
# 4. N-gram特征(捕捉词序信息)
ngram_vectorizer = CountVectorizer(ngram_range=(1, 2), max_features=20)
ngram_features = ngram_vectorizer.fit_transform(df_text['cleaned_text'])
df_ngram = pd.DataFrame(ngram_features.toarray(),
columns=ngram_vectorizer.get_feature_names_out())
print("\nN-gram特征:")
print(df_ngram.head())
# 5. 文本统计特征
def extract_text_features(text):
"""提取文本统计特征"""
return {
'length': len(text),
'word_count': len(text.split()),
'avg_word_length': np.mean([len(word) for word in text.split()]) if text.split() else 0,
'unique_word_ratio': len(set(text.split())) / len(text.split()) if text.split() else 0
}
text_stats = df_text['cleaned_text'].apply(extract_text_features)
df_stats = pd.DataFrame(text_stats.tolist())
print("\n文本统计特征:")
print(df_stats.head())
# 6. 词频特征(自定义)
def get_word_frequency(texts, top_n=10):
"""获取词频特征"""
all_words = ' '.join(texts).split()
word_freq = pd.Series(all_words).value_counts()
top_words = word_freq.head(top_n).index.tolist()
word_features = []
for text in texts:
words = text.split()
features = [words.count(word) for word in top_words]
word_features.append(features)
return pd.DataFrame(word_features, columns=[f'freq_{word}' for word in top_words])
df_word_freq = get_word_frequency(df_text['cleaned_text'].tolist())
print("\n词频特征:")
print(df_word_freq.head())
六、特征选择
特征选择是从原始特征中选择最重要的特征子集,可以减少过拟合、提高模型性能、加速训练。
python
from sklearn.datasets import make_classification
from sklearn.feature_selection import (
SelectKBest, f_classif, chi2, mutual_info_classif,
RFE, SelectFromModel
)
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
# 创建示例数据
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,
n_redundant=5, n_clusters_per_class=1, random_state=42)
feature_names = [f'feature_{i}' for i in range(X.shape[1])]
df_feature = pd.DataFrame(X, columns=feature_names)
df_feature['target'] = y
# 1. 过滤法(Filter Methods)
# 方差阈值(移除低方差特征)
from sklearn.feature_selection import VarianceThreshold
variance_threshold = VarianceThreshold(threshold=0.01)
high_variance_features = variance_threshold.fit_transform(df_feature.drop('target', axis=1))
print(f"原始特征数: {X.shape[1]}, 保留特征数: {high_variance_features.shape[1]}")
# 单变量特征选择(使用ANOVA F值)
selector_f = SelectKBest(f_classif, k=10)
X_new_f = selector_f.fit_transform(df_feature.drop('target', axis=1), df_feature['target'])
selected_features_f = [feature_names[i] for i in selector_f.get_support(indices=True)]
print("\nANOVA F值选择的特征:")
print(selected_features_f)
# 卡方检验(要求特征非负)
selector_chi2 = SelectKBest(chi2, k=10)
X_new_chi2 = selector_chi2.fit_transform(
df_feature.drop('target', axis=1) - df_feature.drop('target', axis=1).min() + 1,
df_feature['target']
)
selected_features_chi2 = [feature_names[i] for i in selector_chi2.get_support(indices=True)]
print("\n卡方检验选择的特征:")
print(selected_features_chi2)
# 互信息
selector_mi = SelectKBest(mutual_info_classif, k=10)
X_new_mi = selector_mi.fit_transform(df_feature.drop('target', axis=1), df_feature['target'])
selected_features_mi = [feature_names[i] for i in selector_mi.get_support(indices=True)]
print("\n互信息选择的特征:")
print(selected_features_mi)
# 2. 包裹法(Wrapper Methods)
# 递归特征消除(RFE)
estimator = LogisticRegression(max_iter=1000, random_state=42)
rfe_selector = RFE(estimator, n_features_to_select=10, step=1)
X_new_rfe = rfe_selector.fit_transform(df_feature.drop('target', axis=1), df_feature['target'])
selected_features_rfe = [feature_names[i] for i in rfe_selector.get_support(indices=True)]
print("\nRFE选择的特征:")
print(selected_features_rfe)
# 3. 嵌入法(Embedded Methods)
# 基于随机森林的特征重要性
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(df_feature.drop('target', axis=1), df_feature['target'])
feature_importance = pd.DataFrame({
'feature': feature_names,
'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)
print("\n随机森林特征重要性(Top 10):")
print(feature_importance.head(10))
# 使用SelectFromModel选择特征
selector_model = SelectFromModel(rf, threshold='median', prefit=True)
X_new_model = selector_model.transform(df_feature.drop('target', axis=1))
selected_features_model = [feature_names[i] for i in selector_model.get_support(indices=True)]
print(f"\n基于模型选择的特征数: {len(selected_features_model)}")
print(selected_features_model)
# 可视化特征重要性
plt.figure(figsize=(12, 6))
plt.barh(feature_importance['feature'], feature_importance['importance'])
plt.xlabel('特征重要性')
plt.ylabel('特征')
plt.title('随机森林特征重要性')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()
# 4. 相关性分析(移除高相关特征)
correlation_matrix = df_feature.drop('target', axis=1).corr()
# 找出高相关特征对
high_corr_pairs = []
for i in range(len(correlation_matrix.columns)):
for j in range(i+1, len(correlation_matrix.columns)):
if abs(correlation_matrix.iloc[i, j]) > 0.8:
high_corr_pairs.append((
correlation_matrix.columns[i],
correlation_matrix.columns[j],
correlation_matrix.iloc[i, j]
))
print("\n高相关特征对(|相关系数| > 0.8):")
for pair in high_corr_pairs:
print(f"{pair[0]} - {pair[1]}: {pair[2]:.4f}")
七、特征交互
特征交互可以创建新的特征,捕捉变量之间的关系。
python
# 创建示例数据
np.random.seed(42)
n_samples = 1000
data = {
'x1': np.random.normal(0, 1, n_samples),
'x2': np.random.normal(0, 1, n_samples),
'x3': np.random.normal(0, 1, n_samples)
}
df_interaction = pd.DataFrame(data)
# 目标变量:包含交互项
df_interaction['y'] = (
2 * df_interaction['x1'] +
3 * df_interaction['x2'] +
4 * df_interaction['x3'] +
5 * df_interaction['x1'] * df_interaction['x2'] + # 交互项
np.random.normal(0, 0.5, n_samples)
)
# 1. 乘法交互
df_interaction['x1_x2'] = df_interaction['x1'] * df_interaction['x2']
df_interaction['x1_x3'] = df_interaction['x1'] * df_interaction['x3']
df_interaction['x2_x3'] = df_interaction['x2'] * df_interaction['x3']
# 2. 除法交互
df_interaction['x1_div_x2'] = df_interaction['x1'] / (df_interaction['x2'] + 1e-8)
df_interaction['x2_div_x3'] = df_interaction['x2'] / (df_interaction['x3'] + 1e-8)
# 3. 加法交互
df_interaction['x1_plus_x2'] = df_interaction['x1'] + df_interaction['x2']
df_interaction['x1_plus_x3'] = df_interaction['x1'] + df_interaction['x3']
# 4. 多项式特征(二次)
df_interaction['x1_squared'] = df_interaction['x1'] ** 2
df_interaction['x2_squared'] = df_interaction['x2'] ** 2
df_interaction['x3_squared'] = df_interaction['x3'] ** 2
# 5. 使用PolynomialFeatures自动创建多项式特征
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)
poly_features = poly.fit_transform(df_interaction[['x1', 'x2', 'x3']])
poly_feature_names = poly.get_feature_names_out(['x1', 'x2', 'x3'])
df_poly = pd.DataFrame(poly_features, columns=poly_feature_names)
print("多项式特征:")
print(df_poly.head())
# 6. 评估交互特征的重要性
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
# 模型1:仅原始特征
model1 = LinearRegression()
model1.fit(df_interaction[['x1', 'x2', 'x3']], df_interaction['y'])
y_pred1 = model1.predict(df_interaction[['x1', 'x2', 'x3']])
r2_1 = r2_score(df_interaction['y'], y_pred1)
# 模型2:包含交互特征
model2 = LinearRegression()
interaction_features = ['x1', 'x2', 'x3', 'x1_x2', 'x1_x3', 'x2_x3',
'x1_squared', 'x2_squared', 'x3_squared']
model2.fit(df_interaction[interaction_features], df_interaction['y'])
y_pred2 = model2.predict(df_interaction[interaction_features])
r2_2 = r2_score(df_interaction['y'], y_pred2)
print(f"\n仅原始特征的R²: {r2_1:.4f}")
print(f"包含交互特征的R²: {r2_2:.4f}")
print(f"提升: {(r2_2 - r2_1) * 100:.2f}%")
# 可视化交互效应
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# x1和x2的交互效应
scatter = axes[0].scatter(df_interaction['x1'], df_interaction['x2'],
c=df_interaction['y'], cmap='viridis', alpha=0.6)
axes[0].set_xlabel('x1')
axes[0].set_ylabel('x2')
axes[0].set_title('x1和x2的交互效应')
plt.colorbar(scatter, ax=axes[0])
# 预测值对比
axes[1].scatter(df_interaction['y'], y_pred1, alpha=0.5, label='原始特征')
axes[1].scatter(df_interaction['y'], y_pred2, alpha=0.5, label='交互特征')
axes[1].plot([df_interaction['y'].min(), df_interaction['y'].max()],
[df_interaction['y'].min(), df_interaction['y'].max()],
'r--', label='完美预测')
axes[1].set_xlabel('真实值')
axes[1].set_ylabel('预测值')
axes[1].set_title('预测性能对比')
axes[1].legend()
plt.tight_layout()
plt.show()
八、特征工程流程最佳实践
python
class FeatureEngineeringPipeline:
"""
特征工程流水线
封装常用的特征工程操作
"""
def __init__(self, df):
self.df = df.copy()
self.transformations = []
def add_feature(self, name, feature_series, description=''):
"""添加新特征"""
self.df[name] = feature_series
self.transformations.append({
'type': 'add_feature',
'name': name,
'description': description
})
return self
def drop_features(self, features):
"""删除特征"""
self.df = self.df.drop(columns=features)
self.transformations.append({
'type': 'drop_features',
'features': features
})
return self
def transform_feature(self, name, func, description=''):
"""变换特征"""
self.df[name] = func(self.df[name])
self.transformations.append({
'type': 'transform_feature',
'name': name,
'description': description
})
return self
def encode_categorical(self, columns, method='onehot'):
"""编码分类特征"""
if method == 'onehot':
self.df = pd.get_dummies(self.df, columns=columns, drop_first=True)
elif method == 'label':
for col in columns:
le = LabelEncoder()
self.df[col] = le.fit_transform(self.df[col])
self.transformations.append({
'type': 'encode_categorical',
'columns': columns,
'method': method
})
return self
def scale_features(self, columns, method='standard'):
"""缩放数值特征"""
if method == 'standard':
scaler = StandardScaler()
elif method == 'minmax':
scaler = MinMaxScaler()
elif method == 'robust':
scaler = RobustScaler()
self.df[columns] = scaler.fit_transform(self.df[columns])
self.transformations.append({
'type': 'scale_features',
'columns': columns,
'method': method
})
return self
def handle_missing(self, strategy='mean'):
"""处理缺失值"""
for col in self.df.columns:
if self.df[col].isnull().sum() > 0:
if strategy == 'mean' and self.df[col].dtype in [np.number]:
self.df[col].fillna(self.df[col].mean(), inplace=True)
elif strategy == 'median' and self.df[col].dtype in [np.number]:
self.df[col].fillna(self.df[col].median(), inplace=True)
elif strategy == 'mode':
self.df[col].fillna(self.df[col].mode()[0], inplace=True)
elif strategy == 'drop':
self.df = self.df.dropna(subset=[col])
self.transformations.append({
'type': 'handle_missing',
'strategy': strategy
})
return self
def get_transformed_data(self):
"""获取转换后的数据"""
return self.df.copy()
def get_transformation_report(self):
"""获取转换报告"""
return self.transformations
# 使用示例
pipeline = FeatureEngineeringPipeline(df)
# 处理缺失值
pipeline.handle_missing(strategy='mean')
# 特征变换
pipeline.transform_feature('age', lambda x: np.log1p(x), '对数变换')
# 添加新特征
pipeline.add_feature('age_squared', df['age'] ** 2, '年龄的平方')
# 编码分类特征
pipeline.encode_categorical(['gender', 'department'], method='onehot')
# 缩放数值特征
numerical_cols = ['age', 'salary', 'experience_years']
pipeline.scale_features(numerical_cols, method='standard')
# 获取处理后的数据
transformed_df = pipeline.get_transformed_data()
# 获取转换报告
report = pipeline.get_transformation_report()
print("特征工程流水线执行记录:")
for step in report:
print(f"- {step}")
九、总结
特征工程是机器学习项目中最重要的环节之一。通过合理的特征工程,我们可以:
-
提高模型性能:好的特征能让模型更容易学习数据中的模式。
-
降低模型复杂度:用更简单的模型达到更好的效果。
-
增强可解释性:有意义的特征更容易解释和调试。
-
加速训练:减少无关和冗余特征可以提高训练效率。
核心要点:
- 理解业务领域是做好特征工程的基础
- 特征工程是一个迭代的过程,需要不断尝试和验证
- 始终使用验证集来评估特征工程的效果
- 避免数据泄露,特别是在时间序列和目标编码中
- 记录所有特征工程步骤,确保可重现性
- 从简单到复杂,先尝试基本的特征变换,再考虑复杂的交互特征
记住,特征工程没有万能的公式,最好的方法往往来源于对数据的深入理解和对业务的洞察。持续探索和实验是特征工程成功的关键!