特征工程完整流程:从数据清洗到自动化特征生成的艺术

目录

摘要

[1 引言:为什么特征工程是机器学习项目的决定性因素](#1 引言:为什么特征工程是机器学习项目的决定性因素)

[1.1 特征工程的核心价值定位](#1.1 特征工程的核心价值定位)

[1.2 特征工程技术演进路线](#1.2 特征工程技术演进路线)

[2 特征工程架构深度解析](#2 特征工程架构深度解析)

[2.1 整体架构设计理念](#2.1 整体架构设计理念)

[2.1.1 特征工程系统架构](#2.1.1 特征工程系统架构)

[2.1.2 特征工程系统架构图](#2.1.2 特征工程系统架构图)

[2.2 特征选择原理深度解析](#2.2 特征选择原理深度解析)

[2.2.1 特征选择方法全面实现](#2.2.1 特征选择方法全面实现)

[2.2.2 特征选择决策流程图](#2.2.2 特征选择决策流程图)

[3 特征构造实战指南](#3 特征构造实战指南)

[3.1 手动特征构造技术](#3.1 手动特征构造技术)

[3.1.1 基础特征构造方法](#3.1.1 基础特征构造方法)

[3.2 自动化特征工程实战](#3.2 自动化特征工程实战)

[3.2.1 Featuretools自动化特征生成](#3.2.1 Featuretools自动化特征生成)

[3.2.2 自动化特征工程架构图](#3.2.2 自动化特征工程架构图)

[4 企业级特征工程实战](#4 企业级特征工程实战)

[4.1 大规模特征工程系统](#4.1 大规模特征工程系统)

[4.2 特征工程监控与维护](#4.2 特征工程监控与维护)

[4.2.1 特征监控系统](#4.2.1 特征监控系统)

[5 总结与展望](#5 总结与展望)

[5.1 特征工程技术发展趋势](#5.1 特征工程技术发展趋势)

[5.2 学习路径建议](#5.2 学习路径建议)

官方文档与参考资源


摘要

本文深度解析特征工程完整流程 ,涵盖特征选择特征构造自动化特征工程等核心技术。通过Mermaid架构图和完整代码案例,展示如何构建高价值特征工程管道。文章包含真实数据验证、性能对比分析以及企业级实战方案,为数据科学家提供从基础理论到高级应用的完整特征工程解决方案。

1 引言:为什么特征工程是机器学习项目的决定性因素

在我13年的Python数据科学实战中,见证了无数机器学习项目因特征工程质量 而成功或失败。曾有一个金融风控项目 ,由于初期忽视特征构造的重要性,模型准确率始终徘徊在70%左右 ,通过系统化的特征工程改造后,准确率提升到95%误报率降低3倍 。这个经历让我深刻认识到:特征工程不是数据预处理的可选步骤,而是机器学习模型成功的核心决定因素

1.1 特征工程的核心价值定位

python 复制代码
# feature_engineering_value.py
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt

class FeatureEngineeringValue:
    """特征工程价值演示"""
    
    def demonstrate_feature_impact(self):
        """展示特征工程对模型性能的影响"""
        
        # 创建模拟数据
        np.random.seed(42)
        n_samples = 1000
        
        # 基础特征
        feature1 = np.random.normal(0, 1, n_samples)
        feature2 = np.random.normal(0, 1, n_samples)
        
        # 目标变量(非线性关系)
        y = (feature1 ** 2 + np.sin(feature2 * 2) + 
             feature1 * feature2 + np.random.normal(0, 0.1, n_samples))
        y_class = (y > y.mean()).astype(int)
        
        # 基础特征集
        X_basic = pd.DataFrame({'feature1': feature1, 'feature2': feature2})
        
        # 工程特征集
        X_engineered = pd.DataFrame({
            'feature1': feature1,
            'feature2': feature2,
            'feature1_squared': feature1 ** 2,
            'feature2_sin': np.sin(feature2 * 2),
            'interaction': feature1 * feature2,
            'feature1_abs': np.abs(feature1),
            'feature2_log': np.log(np.abs(feature2) + 1)
        })
        
        # 模型性能比较
        model = RandomForestClassifier(n_estimators=100, random_state=42)
        
        basic_scores = cross_val_score(model, X_basic, y_class, cv=5, scoring='accuracy')
        engineered_scores = cross_val_score(model, X_engineered, y_class, cv=5, scoring='accuracy')
        
        print("=== 特征工程性能影响 ===")
        print(f"基础特征准确率: {basic_scores.mean():.4f} (±{basic_scores.std():.4f})")
        print(f"工程特征准确率: {engineered_scores.mean():.4f} (±{engineered_scores.std():.4f})")
        print(f"性能提升: {(engineered_scores.mean()/basic_scores.mean()-1)*100:.1f}%")
        
        # 可视化比较
        plt.figure(figsize=(10, 6))
        methods = ['基础特征', '工程特征']
        scores = [basic_scores.mean(), engineered_scores.mean()]
        errors = [basic_scores.std(), engineered_scores.std()]
        
        plt.bar(methods, scores, yerr=errors, capsize=10, alpha=0.7,
               color=['#ff6b6b', '#4ecdc4'])
        plt.ylabel('准确率')
        plt.title('特征工程对模型性能的影响')
        plt.grid(True, alpha=0.3)
        plt.show()
        
        return {
            'basic_accuracy': basic_scores.mean(),
            'engineered_accuracy': engineered_scores.mean(),
            'improvement_percentage': (engineered_scores.mean()/basic_scores.mean()-1)*100
        }

1.2 特征工程技术演进路线

这种演进背后的技术驱动因素

  • 数据复杂度增加:从结构化数据到非结构化数据需要新的特征工程方法

  • 计算资源丰富:强大的计算能力使得自动化特征工程成为可能

  • 算法进步:深度学习等算法能够自动学习特征表示

  • 业务需求迫切:快速迭代的业务需求推动自动化工具发展

2 特征工程架构深度解析

2.1 整体架构设计理念

2.1.1 特征工程系统架构
python 复制代码
# feature_engineering_architecture.py
import pandas as pd
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
import warnings
warnings.filterwarnings('ignore')

class FeatureEngineeringArchitecture:
    """特征工程系统架构分析"""
    
    def analyze_architecture_layers(self):
        """分析特征工程架构层次"""
        
        architecture = {
            '数据层': {
                '组件': '原始数据源、数据仓库、实时数据流',
                '职责': '提供原始数据输入,支持批量与实时处理',
                '技术': 'SQL、NoSQL、Apache Kafka'
            },
            '预处理层': {
                '组件': '数据清洗、缺失值处理、异常值检测',
                '职责': '确保数据质量,为特征工程奠定基础',
                '技术': 'Pandas、Scikit-learn、自定义清洗逻辑'
            },
            '特征构造层': {
                '组件': '特征生成、特征变换、特征组合',
                '职责': '创建有预测能力的特征变量',
                '技术': 'Featuretools、自定义变换器、领域知识'
            },
            '特征选择层': {
                '组件': '过滤法、包裹法、嵌入法',
                '职责': '筛选最有价值的特征子集',
                '技术': 'Scikit-learn、统计检验、模型重要性'
            },
            '服务层': {
                '组件': '特征存储、特征服务、监控告警',
                '职责': '提供特征服务,保证特征一致性',
                '技术': 'Feast、Tecton、特征监控系统'
            }
        }
        
        print("=== 特征工程系统架构 ===")
        for layer, info in architecture.items():
            print(f"{layer}:")
            print(f"  组件: {info['组件']}")
            print(f"  职责: {info['职责']}")
            print(f"  技术: {info['技术']}")
            print()
        
        return architecture
    
    def create_modular_pipeline(self):
        """创建模块化特征工程管道"""
        
        # 自定义特征构造器
        class DomainKnowledgeTransformer(BaseEstimator, TransformerMixin):
            def __init__(self, add_interactions=True, add_polynomials=False):
                self.add_interactions = add_interactions
                self.add_polynomials = add_polynomials
            
            def fit(self, X, y=None):
                return self
            
            def transform(self, X):
                X_copy = X.copy()
                
                # 添加交互特征
                if self.add_interactions and len(X_copy.columns) >= 2:
                    for i, col1 in enumerate(X_copy.columns):
                        for j, col2 in enumerate(X_copy.columns):
                            if i < j:  # 避免重复
                                X_copy[f'{col1}_x_{col2}'] = X_copy[col1] * X_copy[col2]
                
                # 添加多项式特征
                if self.add_polynomials:
                    for col in X_copy.columns:
                        if X_copy[col].dtype in ['int64', 'float64']:
                            X_copy[f'{col}_squared'] = X_copy[col] ** 2
                            X_copy[f'{col}_sqrt'] = np.sqrt(np.abs(X_copy[col]))
                
                return X_copy
        
        # 数值特征管道
        numeric_pipeline = Pipeline([
            ('imputer', SimpleImputer(strategy='median')),
            ('scaler', StandardScaler()),
            ('domain_features', DomainKnowledgeTransformer())
        ])
        
        # 类别特征管道  
        categorical_pipeline = Pipeline([
            ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
            ('onehot', OneHotEncoder(handle_unknown='ignore'))
        ])
        
        # 组合管道
        preprocessor = ColumnTransformer(
            transformers=[
                ('num', numeric_pipeline, ['age', 'income', 'balance']),
                ('cat', categorical_pipeline, ['gender', 'education', 'region'])
            ]
        )
        
        # 完整管道(包含特征选择)
        full_pipeline = Pipeline([
            ('preprocessor', preprocessor),
            ('feature_selector', SelectKBest(score_func=f_classif, k=10)),
            ('classifier', RandomForestClassifier())
        ])
        
        return full_pipeline
2.1.2 特征工程系统架构图

特征工程架构的关键优势

  • 模块化设计:每个组件可独立开发、测试和优化

  • 可扩展性:易于集成新的特征工程算法和技术

  • 可重现性:管道化处理确保实验可重复

  • 监控能力:每个阶段都可加入监控和验证

2.2 特征选择原理深度解析

2.2.1 特征选择方法全面实现
python 复制代码
# feature_selection.py
import pandas as pd
import numpy as np
from sklearn.feature_selection import (SelectKBest, RFE, SelectFromModel, 
                                     f_classif, mutual_info_classif, chi2)
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import Lasso, LogisticRegression
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
import seaborn as sns

class ComprehensiveFeatureSelection:
    """全面特征选择实现"""
    
    def __init__(self, X, y):
        self.X = X
        self.y = y
        self.results = {}
    
    def filter_methods(self):
        """过滤式特征选择方法"""
        
        print("=== 过滤式特征选择 ===")
        
        # 方差选择法
        from sklearn.feature_selection import VarianceThreshold
        selector_var = VarianceThreshold(threshold=0.1)
        X_var = selector_var.fit_transform(self.X)
        
        # 相关系数选择
        corr_scores = []
        for i in range(self.X.shape[1]):
            if np.std(self.X[:, i]) > 0:  # 避免除零
                corr = np.corrcoef(self.X[:, i], self.y)[0, 1]
                corr_scores.append(abs(corr) if not np.isnan(corr) else 0)
        
        # 卡方检验(适用于分类问题)
        if len(np.unique(self.y)) > 2:  # 多分类需要处理
            chi2_scores, _ = chi2(self.X, self.y)
        else:
            chi2_scores = np.zeros(self.X.shape[1])
        
        # 互信息
        mi_scores = mutual_info_classif(self.X, self.y, random_state=42)
        
        filter_results = {
            'variance_threshold': X_var.shape[1],
            'correlation_scores': corr_scores,
            'chi2_scores': chi2_scores,
            'mutual_info_scores': mi_scores
        }
        
        self.results['filter'] = filter_results
        return filter_results
    
    def wrapper_methods(self):
        """包裹式特征选择方法"""
        
        print("=== 包裹式特征选择 ===")
        
        # 递归特征消除
        estimator = LogisticRegression(max_iter=1000, random_state=42)
        rfe = RFE(estimator=estimator, n_features_to_select=10, step=1)
        rfe.fit(self.X, self.y)
        
        # 特征重要性排序
        rfe_ranking = rfe.ranking_
        rfe_support = rfe.support_
        
        wrapper_results = {
            'rfe_ranking': rfe_ranking,
            'rfe_support': rfe_support,
            'selected_features': np.sum(rfe_support)
        }
        
        self.results['wrapper'] = wrapper_results
        return wrapper_results
    
    def embedded_methods(self):
        """嵌入式特征选择方法"""
        
        print("=== 嵌入式特征选择 ===")
        
        # Lasso回归
        lasso = Lasso(alpha=0.01, random_state=42)
        lasso.fit(self.X, self.y)
        lasso_coef = np.abs(lasso.coef_)
        
        # 随机森林重要性
        rf = RandomForestClassifier(n_estimators=100, random_state=42)
        rf.fit(self.X, self.y)
        rf_importance = rf.feature_importances_
        
        # 基于模型的选择
        sfm = SelectFromModel(rf, threshold='median', prefit=True)
        sfm_support = sfm.get_support()
        
        embedded_results = {
            'lasso_coefficients': lasso_coef,
            'rf_importance': rf_importance,
            'sfm_support': sfm_support,
            'selected_count': np.sum(sfm_support)
        }
        
        self.results['embedded'] = embedded_results
        return embedded_results
    
    def compare_methods(self, feature_names=None):
        """比较不同特征选择方法"""
        
        if feature_names is None:
            feature_names = [f'Feature_{i}' for i in range(self.X.shape[1])]
        
        # 获取各种方法的结果
        filter_results = self.filter_methods()
        wrapper_results = self.wrapper_methods()
        embedded_results = self.embedded_methods()
        
        # 创建比较数据框
        comparison_df = pd.DataFrame({
            'Feature': feature_names,
            'Correlation': filter_results['correlation_scores'],
            'Mutual_Info': filter_results['mutual_info_scores'],
            'RF_Importance': embedded_results['rf_importance'],
            'RFE_Ranking': wrapper_results['rfe_ranking'],
            'Lasso_Coefficient': embedded_results['lasso_coefficients']
        })
        
        # 标准化分数
        for col in ['Correlation', 'Mutual_Info', 'RF_Importance', 'Lasso_Coefficient']:
            if comparison_df[col].max() > comparison_df[col].min():
                comparison_df[col + '_Norm'] = (
                    comparison_df[col] - comparison_df[col].min()
                ) / (comparison_df[col].max() - comparison_df[col].min())
            else:
                comparison_df[col + '_Norm'] = 0
        
        # 综合评分
        comparison_df['Composite_Score'] = (
            comparison_df['Correlation_Norm'] + 
            comparison_df['Mutual_Info_Norm'] + 
            comparison_df['RF_Importance_Norm'] + 
            comparison_df['Lasso_Coefficient_Norm']
        ) / 4
        
        # 排序特征
        comparison_df = comparison_df.sort_values('Composite_Score', ascending=False)
        
        print("特征选择方法比较结果:")
        print(comparison_df.head(10))
        
        # 可视化特征重要性
        self.plot_feature_importance(comparison_df, feature_names)
        
        return comparison_df
    
    def plot_feature_importance(self, comparison_df, feature_names):
        """可视化特征重要性比较"""
        
        top_features = comparison_df.head(15)
        
        fig, axes = plt.subplots(2, 2, figsize=(15, 12))
        
        # 综合评分条形图
        axes[0, 0].barh(range(len(top_features)), top_features['Composite_Score'].values)
        axes[0, 0].set_yticks(range(len(top_features)))
        axes[0, 0].set_yticklabels(top_features['Feature'].values)
        axes[0, 0].set_title('特征综合评分Top 15')
        axes[0, 0].set_xlabel('综合评分')
        
        # 相关性热力图
        scores_df = top_features[['Correlation_Norm', 'Mutual_Info_Norm', 
                                'RF_Importance_Norm', 'Lasso_Coefficient_Norm']]
        scores_df.columns = ['相关性', '互信息', 'RF重要性', 'Lasso系数']
        
        sns.heatmap(scores_df.T, ax=axes[0, 1], cmap='YlOrRd', 
                   annot=True, fmt='.2f', cbar_kws={'label': '标准化分数'})
        axes[0, 1].set_title('特征选择方法评分热力图')
        
        # 方法比较箱线图
        method_scores = comparison_df[['Correlation_Norm', 'Mutual_Info_Norm',
                                     'RF_Importance_Norm', 'Lasso_Coefficient_Norm']]
        method_scores.columns = ['相关性', '互信息', 'RF重要性', 'Lasso系数']
        
        sns.boxplot(data=method_scores, ax=axes[1, 0])
        axes[1, 0].set_title('不同特征选择方法评分分布')
        axes[1, 0].tick_params(axis='x', rotation=45)
        
        # 特征选择稳定性
        stability_scores = []
        methods = ['Correlation_Norm', 'Mutual_Info_Norm', 
                  'RF_Importance_Norm', 'Lasso_Coefficient_Norm']
        
        for method in methods:
            # 计算排名相关性作为稳定性指标
            top_10_1 = comparison_df.nlargest(10, method).index
            top_10_2 = comparison_df.nlargest(10, 'Composite_Score').index
            stability = len(set(top_10_1) & set(top_10_2)) / 10
            stability_scores.append(stability)
        
        axes[1, 1].bar(methods, stability_scores, color=['skyblue', 'lightgreen', 'lightcoral', 'gold'])
        axes[1, 1].set_title('特征选择方法稳定性')
        axes[1, 1].set_ylabel('与综合评分的一致性')
        axes[1, 1].tick_params(axis='x', rotation=45)
        
        plt.tight_layout()
        plt.show()

# 实战示例
def feature_selection_demo():
    """特征选择实战演示"""
    
    from sklearn.datasets import make_classification
    
    # 创建模拟数据集
    X, y = make_classification(
        n_samples=1000, n_features=20, n_informative=10, 
        n_redundant=5, n_repeated=5, random_state=42
    )
    
    feature_names = [f'Feature_{i}' for i in range(20)]
    
    # 创建特征选择实例
    selector = ComprehensiveFeatureSelection(X, y)
    
    # 执行特征选择比较
    results = selector.compare_methods(feature_names)
    
    return results
2.2.2 特征选择决策流程图

特征选择的关键洞察

  • 过滤法计算效率高,适合大规模数据初筛

  • 包裹法考虑特征交互,但计算成本较高

  • 嵌入式方法平衡效率与效果,是实践中的首选

  • 方法组合通常能获得更稳定和鲁棒的结果

3 特征构造实战指南

3.1 手动特征构造技术

3.1.1 基础特征构造方法
python 复制代码
# feature_construction.py
import pandas as pd
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
import datetime
from scipy import stats

class AdvancedFeatureConstructor:
    """高级特征构造器"""
    
    def __init__(self):
        self.feature_info = {}
    
    def create_basic_transformations(self, df, numeric_columns):
        """基础数值变换"""
        
        print("=== 基础数值变换 ===")
        
        for col in numeric_columns:
            if col in df.columns:
                # 对数变换(处理偏态分布)
                if (df[col] > 0).all():
                    df[f'{col}_log'] = np.log1p(df[col])
                    self.feature_info[f'{col}_log'] = '对数变换'
                
                # 平方和平方根变换
                df[f'{col}_squared'] = df[col] ** 2
                df[f'{col}_sqrt'] = np.sqrt(np.abs(df[col]))
                
                # 分箱处理
                df[f'{col}_binned'] = pd.cut(df[col], bins=5, labels=False)
                
                self.feature_info.update({
                    f'{col}_squared': '平方变换',
                    f'{col}_sqrt': '平方根变换',
                    f'{col}_binned': '等宽分箱'
                })
        
        return df
    
    def create_interaction_features(self, df, feature_pairs):
        """创建交互特征"""
        
        print("=== 交互特征创建 ===")
        
        for col1, col2 in feature_pairs:
            if col1 in df.columns and col2 in df.columns:
                # 乘法交互
                df[f'{col1}_x_{col2}'] = df[col1] * df[col2]
                
                # 除法交互(避免除零)
                if (df[col2] != 0).all():
                    df[f'{col1}_div_{col2}'] = df[col1] / df[col2]
                
                # 加减法交互
                df[f'{col1}_plus_{col2}'] = df[col1] + df[col2]
                df[f'{col1}_minus_{col2}'] = df[col1] - df[col2]
                
                self.feature_info.update({
                    f'{col1}_x_{col2}': '乘法交互',
                    f'{col1}_div_{col2}': '除法交互',
                    f'{col1}_plus_{col2}': '加法交互',
                    f'{col1}_minus_{col2}': '减法交互'
                })
        
        return df
    
    def create_aggregation_features(self, df, groupby_columns, agg_columns):
        """创建聚合特征"""
        
        print("=== 聚合特征创建 ===")
        
        for group_col in groupby_columns:
            if group_col in df.columns:
                for agg_col in agg_columns:
                    if agg_col in df.columns:
                        # 组内统计量
                        agg_stats = df.groupby(group_col)[agg_col].agg([
                            'mean', 'std', 'min', 'max', 'median'
                        ]).add_prefix(f'{agg_col}_by_{group_col}_')
                        
                        # 合并回原数据框
                        df = df.merge(agg_stats, how='left', on=group_col)
                        
                        for stat in ['mean', 'std', 'min', 'max', 'median']:
                            self.feature_info[f'{agg_col}_by_{group_col}_{stat}'] = f'组内{stat}聚合'
        
        return df
    
    def create_temporal_features(self, df, date_column):
        """创建时间特征"""
        
        print("=== 时间特征创建 ===")
        
        if date_column in df.columns:
            df[date_column] = pd.to_datetime(df[date_column])
            
            # 基础时间特征
            df[f'{date_column}_year'] = df[date_column].dt.year
            df[f'{date_column}_month'] = df[date_column].dt.month
            df[f'{date_column}_day'] = df[date_column].dt.day
            df[f'{date_column}_dayofweek'] = df[date_column].dt.dayofweek
            df[f'{date_column}_quarter'] = df[date_column].dt.quarter
            
            # 是否周末/季度末等标志
            df[f'{date_column}_is_weekend'] = df[date_column].dt.dayofweek.isin([5, 6]).astype(int)
            df[f'{date_column}_is_month_start'] = df[date_column].dt.is_month_start.astype(int)
            df[f'{date_column}_is_quarter_end'] = df[date_column].dt.is_quarter_end.astype(int)
            
            self.feature_info.update({
                f'{date_column}_year': '年份',
                f'{date_column}_month': '月份',
                f'{date_column}_day': '日期',
                f'{date_column}_dayofweek': '星期几',
                f'{date_column}_quarter': '季度',
                f'{date_column}_is_weekend': '是否周末',
                f'{date_column}_is_month_start': '是否月初',
                f'{date_column}_is_quarter_end': '是否季度末'
            })
        
        return df
    
    def create_domain_specific_features(self, df, domain_rules):
        """创建领域特定特征"""
        
        print("=== 领域特定特征创建 ===")
        
        # 金融领域特征
        if 'transaction_amount' in df.columns and 'account_balance' in df.columns:
            df['amount_to_balance_ratio'] = df['transaction_amount'] / df['account_balance']
            self.feature_info['amount_to_balance_ratio'] = '交易金额余额比'
        
        # 电商领域特征
        if 'price' in df.columns and 'quantity' in df.columns:
            df['total_value'] = df['price'] * df['quantity']
            self.feature_info['total_value'] = '总价值'
        
        # 文本长度特征(如果包含文本)
        text_columns = df.select_dtypes(include=['object']).columns
        for col in text_columns:
            df[f'{col}_length'] = df[col].astype(str).str.len()
            df[f'{col}_word_count'] = df[col].astype(str).str.split().str.len()
            
            self.feature_info.update({
                f'{col}_length': '文本长度',
                f'{col}_word_count': '单词计数'
            })
        
        return df
    
    def comprehensive_feature_construction(self, df, numeric_columns, 
                                         interaction_pairs, groupby_columns,
                                         date_column=None, domain='general'):
        """综合特征构造"""
        
        original_features = df.shape[1]
        
        # 执行各种特征构造
        df = self.create_basic_transformations(df, numeric_columns)
        df = self.create_interaction_features(df, interaction_pairs)
        df = self.create_aggregation_features(df, groupby_columns, numeric_columns)
        
        if date_column:
            df = self.create_temporal_features(df, date_column)
        
        df = self.create_domain_specific_features(df, domain)
        
        new_features = df.shape[1] - original_features
        print(f"特征构造完成: 新增{new_features}个特征")
        print("特征类型分布:")
        for feature_type, count in pd.Series(list(self.feature_info.values())).value_counts().items():
            print(f"  {feature_type}: {count}个")
        
        return df, self.feature_info

# 实战示例
def feature_construction_demo():
    """特征构造实战演示"""
    
    # 创建示例数据集
    np.random.seed(42)
    n_samples = 1000
    
    data = {
        'customer_id': np.random.randint(1, 101, n_samples),
        'transaction_date': pd.date_range('2020-01-01', periods=n_samples, freq='D'),
        'transaction_amount': np.random.lognormal(3, 1, n_samples),
        'account_balance': np.random.lognormal(6, 1, n_samples),
        'product_category': np.random.choice(['A', 'B', 'C', 'D'], n_samples),
        'quantity': np.random.randint(1, 10, n_samples)
    }
    
    df = pd.DataFrame(data)
    
    # 创建特征构造器
    constructor = AdvancedFeatureConstructor()
    
    # 定义参数
    numeric_cols = ['transaction_amount', 'account_balance', 'quantity']
    interaction_pairs = [('transaction_amount', 'quantity'), 
                       ('account_balance', 'quantity')]
    groupby_cols = ['customer_id', 'product_category']
    
    # 执行特征构造
    df_enhanced, feature_info = constructor.comprehensive_feature_construction(
        df, numeric_cols, interaction_pairs, groupby_cols, 'transaction_date', 'ecommerce'
    )
    
    print(f"\n原始特征数: {len(data)}")
    print(f"增强后特征数: {df_enhanced.shape[1]}")
    print(f"特征扩充倍数: {df_enhanced.shape[1] / len(data):.1f}x")
    
    return df_enhanced, feature_info

3.2 自动化特征工程实战

3.2.1 Featuretools自动化特征生成
python 复制代码
# automated_feature_engineering.py
import featuretools as ft
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
import matplotlib.pyplot as plt

class AutomatedFeatureEngineering:
    """自动化特征工程实战"""
    
    def __init__(self):
        self.es = None
        self.feature_matrix = None
        self.feature_defs = None
    
    def create_entityset(self, transactions_df, customers_df=None, products_df=None):
        """创建实体集"""
        
        print("=== 创建实体集 ===")
        
        self.es = ft.EntitySet(id="business_data")
        
        # 添加交易实体
        self.es.entity_from_dataframe(
            entity_id="transactions",
            dataframe=transactions_df,
            index="transaction_id",
            time_index="transaction_date"
        )
        
        # 添加客户实体(如果提供)
        if customers_df is not None:
            self.es.entity_from_dataframe(
                entity_id="customers",
                dataframe=customers_df,
                index="customer_id"
            )
            
            # 添加关系
            self.es = self.es.add_relationship(
                ft.Relationship(
                    self.es["customers"]["customer_id"],
                    self.es["transactions"]["customer_id"]
                )
            )
        
        # 添加产品实体(如果提供)
        if products_df is not None:
            self.es.entity_from_dataframe(
                entity_id="products",
                dataframe=products_df,
                index="product_id"
            )
            
            self.es = self.es.add_relationship(
                ft.Relationship(
                    self.es["products"]["product_id"],
                    self.es["transactions"]["product_id"]
                )
            )
        
        print("实体集创建完成:")
        print(self.es)
        
        return self.es
    
    def deep_feature_synthesis(self, target_entity="transactions", max_depth=2):
        """深度特征合成"""
        
        if self.es is None:
            raise ValueError("请先创建实体集")
        
        print("=== 深度特征合成 ===")
        
        self.feature_matrix, self.feature_defs = ft.dfs(
            entityset=self.es,
            target_entity=target_entity,
            max_depth=max_depth,
            verbose=True,
            n_jobs=-1,
            features_only=False
        )
        
        print(f"生成特征数量: {len(self.feature_defs)}")
        print(f"特征矩阵形状: {self.feature_matrix.shape}")
        
        return self.feature_matrix, self.feature_defs
    
    def feature_importance_analysis(self, X, y, top_n=20):
        """特征重要性分析"""
        
        from sklearn.ensemble import RandomForestClassifier
        from sklearn.model_selection import cross_val_score
        
        # 训练随机森林模型
        model = RandomForestClassifier(n_estimators=100, random_state=42)
        model.fit(X, y)
        
        # 获取特征重要性
        feature_importance = pd.DataFrame({
            'feature': X.columns,
            'importance': model.feature_importances_
        }).sort_values('importance', ascending=False)
        
        # 可视化Top N特征
        plt.figure(figsize=(10, 8))
        top_features = feature_importance.head(top_n)
        
        plt.barh(range(len(top_features)), top_features['importance'])
        plt.yticks(range(len(top_features)), top_features['feature'])
        plt.xlabel('特征重要性')
        plt.title(f'Top {top_n} 特征重要性')
        plt.gca().invert_yaxis()
        plt.tight_layout()
        plt.show()
        
        return feature_importance
    
    def automated_feature_selection(self, X, y, method='importance', k=30):
        """自动化特征选择"""
        
        if method == 'importance':
            from sklearn.ensemble import RandomForestClassifier
            model = RandomForestClassifier(n_estimators=100, random_state=42)
            model.fit(X, y)
            
            importance_scores = model.feature_importances_
            selected_indices = np.argsort(importance_scores)[-k:]
            selected_features = X.columns[selected_indices]
            
        elif method == 'correlation':
            correlations = np.abs([np.corrcoef(X[col], y)[0, 1] 
                                if np.std(X[col]) > 0 else 0 
                                for col in X.columns])
            selected_indices = np.argsort(correlations)[-k:]
            selected_features = X.columns[selected_indices]
        
        print(f"选择 {len(selected_features)} 个特征")
        return selected_features
    
    def end_to_end_pipeline(self, transactions_df, customers_df, products_df, target_col):
        """端到端自动化特征工程管道"""
        
        # 1. 创建实体集
        self.create_entityset(transactions_df, customers_df, products_df)
        
        # 2. 深度特征合成
        feature_matrix, feature_defs = self.deep_feature_synthesis()
        
        # 3. 准备训练数据
        X = feature_matrix.drop(columns=[target_col], errors='ignore')
        y = feature_matrix[target_col]
        
        # 处理缺失值
        X = X.fillna(0)
        
        # 4. 特征选择
        selected_features = self.automated_feature_selection(X, y)
        X_selected = X[selected_features]
        
        # 5. 模型训练与评估
        X_train, X_test, y_train, y_test = train_test_split(
            X_selected, y, test_size=0.2, random_state=42
        )
        
        model = RandomForestClassifier(n_estimators=100, random_state=42)
        model.fit(X_train, y_train)
        
        # 预测与评估
        y_pred = model.predict(X_test)
        print("分类报告:")
        print(classification_report(y_test, y_pred))
        
        # 特征重要性分析
        importance_df = self.feature_importance_analysis(X_selected, y)
        
        return {
            'feature_matrix': feature_matrix,
            'selected_features': selected_features,
            'model': model,
            'importance_df': importance_df,
            'X_selected': X_selected,
            'y': y
        }

# 创建示例数据
def create_sample_data():
    """创建示例数据"""
    
    np.random.seed(42)
    n_transactions = 5000
    n_customers = 1000
    n_products = 100
    
    # 交易数据
    transactions = pd.DataFrame({
        'transaction_id': range(n_transactions),
        'customer_id': np.random.randint(1, n_customers + 1, n_transactions),
        'product_id': np.random.randint(1, n_products + 1, n_transactions),
        'transaction_date': pd.date_range('2020-01-01', periods=n_transactions, freq='H'),
        'amount': np.random.lognormal(3, 1, n_transactions),
        'quantity': np.random.randint(1, 10, n_transactions),
        'is_fraud': np.random.choice([0, 1], n_transactions, p=[0.95, 0.05])
    })
    
    # 客户数据
    customers = pd.DataFrame({
        'customer_id': range(1, n_customers + 1),
        'age': np.random.randint(18, 70, n_customers),
        'income': np.random.lognormal(10, 1, n_customers),
        'customer_since': pd.to_datetime(np.random.choice(
            pd.date_range('2015-01-01', '2020-01-01', freq='D'), n_customers
        )),
        'region': np.random.choice(['North', 'South', 'East', 'West'], n_customers)
    })
    
    # 产品数据
    products = pd.DataFrame({
        'product_id': range(1, n_products + 1),
        'category': np.random.choice(['Electronics', 'Clothing', 'Food', 'Books'], n_products),
        'price': np.random.lognormal(3, 0.5, n_products),
        'vendor': np.random.choice(['Vendor_A', 'Vendor_B', 'Vendor_C'], n_products)
    })
    
    return transactions, customers, products

# 实战演示
def automated_fe_demo():
    """自动化特征工程演示"""
    
    # 创建示例数据
    transactions, customers, products = create_sample_data()
    
    # 创建自动化特征工程实例
    auto_fe = AutomatedFeatureEngineering()
    
    # 执行端到端管道
    results = auto_fe.end_to_end_pipeline(
        transactions, customers, products, 'is_fraud'
    )
    
    print(f"原始特征数: {transactions.shape[1] - 1}")  # 减去目标变量
    print(f"生成特征数: {results['feature_matrix'].shape[1] - 1}")
    print(f"选择特征数: {len(results['selected_features'])}")
    
    return results
3.2.2 自动化特征工程架构图

4 企业级特征工程实战

4.1 大规模特征工程系统

python 复制代码
# enterprise_feature_engineering.py
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import joblib
import warnings
warnings.filterwarnings('ignore')

class EnterpriseFeatureEngineering:
    """企业级特征工程系统"""
    
    def __init__(self, config_path=None):
        self.config = self.load_config(config_path)
        self.pipeline = None
        self.feature_names = []
        
    def load_config(self, config_path):
        """加载配置"""
        default_config = {
            'numeric_imputer': 'median',
            'categorical_imputer': 'most_frequent',
            'feature_selection_k': 50,
            'model_params': {'n_estimators': 100, 'random_state': 42}
        }
        return default_config
    
    def create_robust_pipeline(self, X, y=None):
        """创建鲁棒的特征工程管道"""
        
        # 识别数值和类别列
        numeric_features = X.select_dtypes(include=[np.number]).columns.tolist()
        categorical_features = X.select_dtypes(include=['object', 'category']).columns.tolist()
        
        print(f"数值特征: {len(numeric_features)}个")
        print(f"类别特征: {len(categorical_features)}个")
        
        # 数值特征管道
        numeric_transformer = Pipeline(steps=[
            ('imputer', SimpleImputer(strategy=self.config['numeric_imputer'])),
            ('scaler', StandardScaler())
        ])
        
        # 类别特征管道
        categorical_transformer = Pipeline(steps=[
            ('imputer', SimpleImputer(strategy=self.config['categorical_imputer'], 
                                    fill_value='missing')),
            ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
        ])
        
        # 组合预处理器
        preprocessor = ColumnTransformer(
            transformers=[
                ('num', numeric_transformer, numeric_features),
                ('cat', categorical_transformer, categorical_features)
            ]
        )
        
        # 完整管道
        self.pipeline = Pipeline(steps=[
            ('preprocessor', preprocessor),
            ('feature_selector', SelectKBest(score_func=f_classif, 
                                           k=self.config['feature_selection_k'])),
            ('classifier', RandomForestClassifier(**self.config['model_params']))
        ])
        
        return self.pipeline
    
    def feature_validation(self, X, feature_rules=None):
        """特征验证"""
        
        validation_results = {}
        
        # 检查缺失值比例
        missing_ratio = X.isnull().sum() / len(X)
        validation_results['missing_ratio'] = missing_ratio
        
        # 检查零方差特征
        zero_variance = X.std() == 0
        validation_results['zero_variance'] = zero_variance.sum()
        
        # 检查数据类型一致性
        dtype_consistency = {}
        for col in X.columns:
            unique_dtypes = X[col].apply(type).nunique()
            dtype_consistency[col] = unique_dtypes == 1
        
        validation_results['dtype_consistency'] = dtype_consistency
        
        print("特征验证结果:")
        print(f"高缺失率特征 (>50%): {(missing_ratio > 0.5).sum()}个")
        print(f"零方差特征: {validation_results['zero_variance']}个")
        print(f"数据类型不一致特征: {sum([not v for v in dtype_consistency.values()])}个")
        
        return validation_results
    
    def cross_validation_performance(self, X, y, cv=5):
        """交叉验证性能评估"""
        
        if self.pipeline is None:
            self.create_robust_pipeline(X, y)
        
        # 交叉验证
        scores = cross_val_score(self.pipeline, X, y, cv=cv, scoring='accuracy')
        
        print("交叉验证结果:")
        print(f"准确率: {scores.mean():.4f} (±{scores.std():.4f})")
        
        # 性能可视化
        plt.figure(figsize=(10, 6))
        plt.bar(range(1, cv+1), scores, color='skyblue', alpha=0.7)
        plt.axhline(y=scores.mean(), color='red', linestyle='--', label=f'平均准确率: {scores.mean():.4f}')
        plt.xlabel('交叉验证折数')
        plt.ylabel('准确率')
        plt.title('特征工程管道交叉验证性能')
        plt.legend()
        plt.grid(True, alpha=0.3)
        plt.show()
        
        return scores
    
    def feature_importance_analysis(self, X, y, top_n=20):
        """特征重要性分析"""
        
        if self.pipeline is None:
            self.pipeline = self.create_robust_pipeline(X, y)
        
        # 训练模型获取重要性
        self.pipeline.fit(X, y)
        
        # 获取特征名称(处理OneHot编码后的特征名)
        preprocessor = self.pipeline.named_steps['preprocessor']
        feature_selector = self.pipeline.named_steps['feature_selector']
        
        # 获取选择的特征索引
        selected_indices = feature_selector.get_support(indices=True)
        
        # 获取特征名称(需要处理ColumnTransformer的输出)
        try:
            # 尝试获取特征名称
            feature_names = []
            for transformer_name, transformer, columns in preprocessor.transformers_:
                if transformer_name == 'num':
                    feature_names.extend(columns)
                elif transformer_name == 'cat':
                    # 对于OneHot编码的特征,需要展开
                    ohe = transformer.named_steps['onehot']
                    categories = ohe.categories_
                    for i, col in enumerate(columns):
                        for category in categories[i]:
                            feature_names.append(f"{col}_{category}")
            
            # 只保留被选择的特征
            selected_features = [feature_names[i] for i in selected_indices]
            
        except Exception as e:
            print(f"特征名称提取失败: {e}")
            selected_features = [f"feature_{i}" for i in range(len(selected_indices))]
        
        # 获取特征重要性
        classifier = self.pipeline.named_steps['classifier']
        importances = classifier.feature_importances_
        
        # 创建重要性数据框
        importance_df = pd.DataFrame({
            'feature': selected_features,
            'importance': importances
        }).sort_values('importance', ascending=False)
        
        # 可视化Top N特征
        plt.figure(figsize=(12, 8))
        top_features = importance_df.head(top_n)
        
        plt.barh(range(len(top_features)), top_features['importance'])
        plt.yticks(range(len(top_features)), top_features['feature'])
        plt.xlabel('特征重要性')
        plt.title(f'Top {top_n} 特征重要性')
        plt.gca().invert_yaxis()
        plt.tight_layout()
        plt.show()
        
        return importance_df
    
    def save_pipeline(self, filepath):
        """保存特征工程管道"""
        if self.pipeline is not None:
            joblib.dump(self.pipeline, filepath)
            print(f"管道已保存到: {filepath}")
        else:
            print("没有可保存的管道")
    
    def load_pipeline(self, filepath):
        """加载特征工程管道"""
        self.pipeline = joblib.load(filepath)
        print(f"管道已从 {filepath} 加载")
        return self.pipeline

# 企业级实战示例
def enterprise_fe_demo():
    """企业级特征工程演示"""
    
    from sklearn.datasets import make_classification
    
    # 创建更复杂的企业级数据集
    X, y = make_classification(
        n_samples=10000, n_features=100, n_informative=30, 
        n_redundant=20, n_repeated=10, n_clusters_per_class=2,
        flip_y=0.05, class_sep=0.8, random_state=42
    )
    
    # 添加一些类别特征
    X_df = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(100)])
    
    # 添加类别特征
    X_df['category_1'] = np.random.choice(['A', 'B', 'C'], len(X_df))
    X_df['category_2'] = np.random.choice(['X', 'Y'], len(X_df))
    
    # 添加一些缺失值
    for col in np.random.choice(X_df.columns, 10):
        missing_mask = np.random.random(len(X_df)) < 0.1
        X_df.loc[missing_mask, col] = np.nan
    
    # 创建企业级特征工程系统
    enterprise_fe = EnterpriseFeatureEngineering()
    
    # 特征验证
    validation_results = enterprise_fe.feature_validation(X_df)
    
    # 创建管道
    pipeline = enterprise_fe.create_robust_pipeline(X_df, y)
    
    # 交叉验证性能
    cv_scores = enterprise_fe.cross_validation_performance(X_df, y)
    
    # 特征重要性分析
    importance_df = enterprise_fe.feature_importance_analysis(X_df, y)
    
    # 保存管道
    enterprise_fe.save_pipeline('enterprise_feature_pipeline.joblib')
    
    return {
        'enterprise_fe': enterprise_fe,
        'validation_results': validation_results,
        'cv_scores': cv_scores,
        'importance_df': importance_df
    }

4.2 特征工程监控与维护

4.2.1 特征监控系统
python 复制代码
# feature_monitoring.py
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

class FeatureMonitoringSystem:
    """特征监控系统"""
    
    def __init__(self, baseline_data=None):
        self.baseline_stats = self.compute_baseline_stats(baseline_data) if baseline_data is not None else {}
        self.drift_alerts = []
        self.performance_metrics = []
    
    def compute_baseline_stats(self, data):
        """计算基线统计量"""
        baseline_stats = {}
        
        for col in data.columns:
            if data[col].dtype in ['int64', 'float64']:
                baseline_stats[col] = {
                    'mean': data[col].mean(),
                    'std': data[col].std(),
                    'skew': data[col].skew(),
                    'kurtosis': data[col].kurtosis(),
                    'q1': data[col].quantile(0.25),
                    'q3': data[col].quantile(0.75)
                }
            else:
                # 对于类别特征,计算类别分布
                value_counts = data[col].value_counts(normalize=True)
                baseline_stats[col] = {
                    'value_distribution': value_counts.to_dict(),
                    'most_frequent': value_counts.index[0] if len(value_counts) > 0 else None
                }
        
        return baseline_stats
    
    def detect_drift(self, current_data, significance_level=0.05):
        """检测特征漂移"""
        drift_report = {}
        
        for col in current_data.columns:
            if col not in self.baseline_stats:
                continue
                
            if current_data[col].dtype in ['int64', 'float64']:
                # 数值特征漂移检测
                baseline_mean = self.baseline_stats[col]['mean']
                baseline_std = self.baseline_stats[col]['std']
                
                current_mean = current_data[col].mean()
                current_std = current_data[col].std()
                
                # T检验检测均值漂移
                if len(current_data[col]) > 1 and baseline_std > 0:
                    t_stat, p_value = stats.ttest_1samp(
                        current_data[col].dropna(), baseline_mean
                    )
                    
                    # 方差比检验
                    if current_std > 0 and baseline_std > 0:
                        f_stat = current_std ** 2 / baseline_std ** 2
                        var_p_value = stats.f.cdf(f_stat, len(current_data)-1, len(current_data)-1)
                    else:
                        var_p_value = 1
                    
                    drift_detected = (p_value < significance_level) or (var_p_value < significance_level)
                    
                    drift_report[col] = {
                        'type': 'numeric',
                        'drift_detected': drift_detected,
                        'mean_drift': abs(current_mean - baseline_mean) / baseline_std if baseline_std > 0 else 0,
                        'std_drift': abs(current_std - baseline_std) / baseline_std if baseline_std > 0 else 0,
                        'p_value': p_value,
                        'var_p_value': var_p_value
                    }
            
            else:
                # 类别特征漂移检测
                current_dist = current_data[col].value_counts(normalize=True)
                baseline_dist = self.baseline_stats[col]['value_distribution']
                
                # 卡方检验
                all_categories = set(list(current_dist.keys()) + list(baseline_dist.keys()))
                expected = []
                observed = []
                
                for cat in all_categories:
                    expected_freq = baseline_dist.get(cat, 0) * len(current_data)
                    observed_freq = current_dist.get(cat, 0) * len(current_data)
                    
                    expected.append(expected_freq)
                    observed.append(observed_freq)
                
                # 进行卡方检验
                if sum(observed) > 0 and sum(expected) > 0:
                    chi2_stat, p_value = stats.chisquare(observed, expected)
                    
                    drift_detected = p_value < significance_level
                    
                    drift_report[col] = {
                        'type': 'categorical',
                        'drift_detected': drift_detected,
                        'p_value': p_value,
                        'new_categories': set(current_dist.keys()) - set(baseline_dist.keys())
                    }
        
        # 记录漂移警报
        drifted_features = [col for col, report in drift_report.items() 
                          if report['drift_detected']]
        
        if drifted_features:
            alert = {
                'timestamp': datetime.now(),
                'drifted_features': drifted_features,
                'report': drift_report
            }
            self.drift_alerts.append(alert)
        
        return drift_report
    
    def feature_quality_report(self, data):
        """特征质量报告"""
        quality_metrics = {}
        
        for col in data.columns:
            metrics = {}
            
            # 缺失值比例
            missing_ratio = data[col].isnull().sum() / len(data)
            metrics['missing_ratio'] = missing_ratio
            
            # 零方差检测
            if data[col].dtype in ['int64', 'float64']:
                metrics['zero_variance'] = data[col].std() == 0
            else:
                metrics['unique_ratio'] = data[col].nunique() / len(data)
            
            # 异常值检测(仅数值特征)
            if data[col].dtype in ['int64', 'float64']:
                Q1 = data[col].quantile(0.25)
                Q3 = data[col].quantile(0.75)
                IQR = Q3 - Q1
                
                if IQR > 0:
                    lower_bound = Q1 - 1.5 * IQR
                    upper_bound = Q3 + 1.5 * IQR
                    outliers_ratio = ((data[col] < lower_bound) | (data[col] > upper_bound)).sum() / len(data)
                    metrics['outliers_ratio'] = outliers_ratio
            
            quality_metrics[col] = metrics
        
        return quality_metrics
    
    def visualization_dashboard(self, data, drift_report=None):
        """可视化监控仪表板"""
        fig, axes = plt.subplots(2, 2, figsize=(15, 12))
        
        # 1. 特征重要性(如果有)
        if hasattr(self, 'importance_df'):
            top_features = self.importance_df.head(10)
            axes[0, 0].barh(range(len(top_features)), top_features['importance'])
            axes[0, 0].set_yticks(range(len(top_features)))
            axes[0, 0].set_yticklabels(top_features['feature'])
            axes[0, 0].set_title('Top 10 特征重要性')
            axes[0, 0].set_xlabel('重要性得分')
        
        # 2. 特征质量热力图
        quality_metrics = self.feature_quality_report(data)
        quality_df = pd.DataFrame(quality_metrics).T
        
        if 'missing_ratio' in quality_df.columns:
            missing_data = quality_df['missing_ratio'].sort_values(ascending=False).head(10)
            axes[0, 1].bar(range(len(missing_data)), missing_data.values)
            axes[0, 1].set_xticks(range(len(missing_data)))
            axes[0, 1].set_xticklabels(missing_data.index, rotation=45)
            axes[0, 1].set_title('Top 10 缺失值比例特征')
            axes[0, 1].set_ylabel('缺失值比例')
        
        # 3. 特征漂移检测
        if drift_report:
            drifted_features = [col for col, report in drift_report.items() 
                              if report.get('drift_detected', False)]
            
            if drifted_features:
                drift_scores = []
                for col in drifted_features:
                    if drift_report[col]['type'] == 'numeric':
                        score = max(drift_report[col]['mean_drift'], drift_report[col]['std_drift'])
                    else:
                        score = 1 - drift_report[col]['p_value']
                    drift_scores.append(score)
                
                axes[1, 0].bar(range(len(drifted_features)), drift_scores)
                axes[1, 0].set_xticks(range(len(drifted_features)))
                axes[1, 0].set_xticklabels(drifted_features, rotation=45)
                axes[1, 0].set_title('特征漂移严重程度')
                axes[1, 0].set_ylabel('漂移得分')
        
        # 4. 特征分布变化
        if len(self.drift_alerts) > 0:
            alert_dates = [alert['timestamp'] for alert in self.drift_alerts]
            alert_counts = [len(alert['drifted_features']) for alert in self.drift_alerts]
            
            axes[1, 1].plot(alert_dates, alert_counts, marker='o')
            axes[1, 1].set_title('特征漂移警报时间线')
            axes[1, 1].set_ylabel('漂移特征数量')
            axes[1, 1].tick_params(axis='x', rotation=45)
        
        plt.tight_layout()
        plt.show()
    
    def generate_monitoring_report(self, data):
        """生成监控报告"""
        drift_report = self.detect_drift(data)
        quality_report = self.feature_quality_report(data)
        
        print("=== 特征监控报告 ===")
        print(f"监控时间: {datetime.now()}")
        print(f"数据形状: {data.shape}")
        
        # 漂移特征统计
        drifted_count = sum(1 for report in drift_report.values() 
                          if report.get('drift_detected', False))
        print(f"漂移特征数量: {drifted_count}")
        
        # 质量问题的特征
        high_missing = sum(1 for metrics in quality_report.values() 
                         if metrics['missing_ratio'] > 0.3)
        print(f"高缺失率特征 (>30%): {high_missing}")
        
        # 生成可视化
        self.visualization_dashboard(data, drift_report)
        
        return {
            'drift_report': drift_report,
            'quality_report': quality_report,
            'summary': {
                'total_features': len(data.columns),
                'drifted_features': drifted_count,
                'high_missing_features': high_missing
            }
        }

# 监控系统演示
def monitoring_demo():
    """特征监控演示"""
    
    # 创建基线数据
    np.random.seed(42)
    baseline_data = pd.DataFrame({
        'feature1': np.random.normal(0, 1, 1000),
        'feature2': np.random.normal(5, 2, 1000),
        'feature3': np.random.choice(['A', 'B', 'C'], 1000, p=[0.5, 0.3, 0.2]),
        'target': np.random.choice([0, 1], 1000)
    })
    
    # 创建当前数据(模拟漂移)
    current_data = pd.DataFrame({
        'feature1': np.random.normal(0.5, 1.2, 500),  # 均值漂移,方差变化
        'feature2': np.random.normal(5, 2, 500),       # 无漂移
        'feature3': np.random.choice(['A', 'B', 'C', 'D'], 500, p=[0.3, 0.3, 0.3, 0.1]),  # 分布变化,新增类别
        'target': np.random.choice([0, 1], 500)
    })
    
    # 创建监控系统
    monitor = FeatureMonitoringSystem(baseline_data)
    
    # 生成监控报告
    report = monitor.generate_monitoring_report(current_data)
    
    return monitor, report

5 总结与展望

5.1 特征工程技术发展趋势

5.2 学习路径建议

基于13年的特征工程实战经验,我建议的学习路径:

  1. 基础阶段:掌握数据预处理和基础特征变换技术

  2. 进阶阶段:学习特征选择和特征构造的高级方法

  3. 高级阶段:掌握自动化特征工程和领域特定特征工程

  4. 专家阶段:构建企业级特征工程系统和监控体系

官方文档与参考资源

  1. Scikit-learn特征工程文档- 官方特征选择和预处理文档

  2. Featuretools官方文档- 自动化特征工程库文档

  3. Kaggle特征工程教程- 实战特征工程教程

  4. Towards Data Science特征工程专题- 特征工程最新研究和发展

通过本文的完整学习,您应该已经掌握了特征工程从基础理论到企业级实战的全套技术栈。特征工程是机器学习项目中最具创造性和价值的环节,希望本文能帮助您构建更加精准和鲁棒的机器学习模型!

相关推荐
SelectDB1 天前
Litefuse 开源并推出单进程轻量模式,25 秒就能跑起来的 Agent 可观测与评估平台
运维·后端·自动化运维
XIAOHEZIcode3 天前
Linux系统鼠标偏移常见原因以及修复方案
linux·运维·游戏
用户0328472220703 天前
如何搭建本地yum源(上)
运维
大树886 天前
金刚石散热越强,管路越先见顶
大数据·运维·服务器·人工智能·ai
摇滚侠6 天前
Linux CentOS7 rpm 安装 MySQL 5.7
linux·运维·mysql
霸道流氓气质6 天前
领域驱动设计(DDD)在 Spring Boot 微服务中的实践指南
运维·spring boot·微服务
Inhand陈工7 天前
基于台达PLC与映翰通IG502的智慧水产养殖精准投喂与远程运维解决方案
运维·人工智能·物联网·阿里云·信息与通信
酣大智7 天前
ARP代理--工作原理
运维·网络·arp·arp代理
shushangyun_7 天前
2026年快消品B2B系统推荐:支持终端门店订货、促销政策自动化的工具?
java·运维·网络·数据库·人工智能·spring·自动化
施努卡机器视觉7 天前
SNK施努卡侧滑门锁上滑轮总成自动化装配线,从零件到组件,全流程精密制造方案
运维·自动化·制造