┌─────────────────────────────────────────────────────────────┐
│ 数据分析技术栈 │
├─────────────────────────────────────────────────────────────┤
│ 应用层:BI/可视化/报告 │
│ ├─ Tableau / Power BI / Looker │
│ ├─ Streamlit / Dash (交互式分析应用) │
│ └─ Jupyter Notebook (探索性分析) │
├─────────────────────────────────────────────────────────────┤
│ 分析层:Python/R 分析框架 │
│ ├─ 数据处理: pandas, numpy │
│ ├─ 统计分析: scipy, statsmodels │
│ ├─ 机器学习: scikit-learn, XGBoost │
│ ├─ 深度学习: TensorFlow, PyTorch │
│ └─ 时间序列: prophet, darts │
├─────────────────────────────────────────────────────────────┤
│ 数据层:SQL/存储/ETL │
│ ├─ 查询: PostgreSQL, MySQL, BigQuery, Snowflake │
│ ├─ 数据仓库: Redshift, Databricks │
│ ├─ ETL: Apache Airflow, dbt │
│ └─ 实时: Kafka, Flink │
├─────────────────────────────────────────────────────────────┤
│ 基础设施层:计算/容器化/编排 │
│ ├─ 计算: Docker, Kubernetes │
│ ├─ 云服务: AWS, GCP, Azure │
│ └─ 版本控制: Git, GitHub Actions │
└─────────────────────────────────────────────────────────────┘
目录
[1. Pandas 中的 df (DataFrame) 用法](#1. Pandas 中的 df (DataFrame) 用法)
[📊 创建表格](#📊 创建表格)
[2. 核心重点:NumPy 与 Pandas (df) 的互相转换](#2. 核心重点:NumPy 与 Pandas (df) 的互相转换)
[🔁 转换对照表](#🔁 转换对照表)
[💻 代码演示](#💻 代码演示)
[📌 总结建议](#📌 总结建议)
[2.1 第一性原理 → 代码实现](#2.1 第一性原理 → 代码实现)
[2.2 Whys → 根因分析技术栈](#2.2 Whys → 根因分析技术栈)
[2.3 系统思维 → 图数据库与网络分析](#2.3 系统思维 → 图数据库与网络分析)
[2.4 MECE → 数据分区与分类算法](#2.4 MECE → 数据分区与分类算法)
[3.1 数据清洗框架](#3.1 数据清洗框架)
[3.2 特征工程技术](#3.2 特征工程技术)
[4.1 端到端ML Pipeline](#4.1 端到端ML Pipeline)
[5.1 数据版本控制](#5.1 数据版本控制)
一、数据分析基础
NumPy 和 Pandas 是 Python 数据科学领域的"黄金搭档",但它们分工不同:
- NumPy :擅长纯数值计算(矩阵、数组运算),速度快,但没有列名和行索引的概念。
- Pandas (DataFrame) :基于 NumPy 构建,相当于带标签的 Excel 表格。它给数据加上了"列名"和"行索引",更适合处理结构化数据(如 CSV、Excel 文件)。
下面我为你详细拆解 NumPy 数组 和 Pandas DataFrame (df) 的具体用法及转换关系。
1. Pandas 中的 df (DataFrame) 用法
df 是我们在写代码时给 DataFrame 对象起的变量名(约定俗成叫 df),它就像一个功能强大的表格。
📊 创建表格
你可以用字典或列表来创建一个带列名的表格:
python
import pandas as pd
import numpy as np
# 方法一:用字典创建(最常用,键是列名)
data = {'Name': ['Tom', 'Nick', 'Alice'], 'Age': }
df = pd.DataFrame(data)
# 方法二:直接用 NumPy 数组创建
arr = np.array([, ])
df_from_np = pd.DataFrame(arr, columns=['A', 'B']) # 可以指定列名
```<websource>source_group_web_5</websource>
#### 🔍 数据选取与切片
这是 `df` 最强大的地方,你可以像查字典一样查数据:
* **选列**:`df['Name']`(选中 Name 这一列)。
* **按条件筛选**:`df[df['Age'] > 20]`(找出年龄大于 20 的行)。
* **按位置选**:`df.iloc`(选第一行),`df.loc`(按标签选)<websource>source_group_web_6</websource>。
#### 🧹 数据清洗
处理缺失值、重复值非常方便:
* **删除空值**:`df.dropna()`
* **填充空值**:`df.fillna(0)`
* **去重**:`df.drop_duplicates()`<websource>source_group_web_7</websource>
---
### 2. NumPy 数组 (Array) 用法
NumPy 的核心是 `ndarray`,它没有"列名"的概念,纯粹是为了**快**和**数学运算**<websource>source_group_web_8</websource>。
#### 🔢 创建数组
```python
import numpy as np
# 创建一维数组
arr_1d = np.array()<websource>source_group_web_9</websource>
# 创建二维数组(矩阵)
arr_2d = np.array([, ])
```<websource>source_group_web_10</websource>
#### 🧮 数学运算
NumPy 支持"广播"机制,可以直接对整个数组进行数学计算,不用写循环:
```python
# 所有元素加 10
result = arr_1d + 10
# 计算平均值
mean_val = np.mean(arr_2d)
2. 核心重点:NumPy 与 Pandas (df) 的互相转换
在实际工作中,你经常需要在两者之间切换:用 Pandas 读取和清洗数据,转为 NumPy 进行计算,算完再转回 Pandas 展示。
🔁 转换对照表
| 转换方向 | 方法/代码 | 说明 |
|---|---|---|
| NumPy → Pandas | pd.DataFrame(array, columns=['A', 'B']) |
将数组变成表格,记得指定列名,否则默认是数字 0, 1... |
| Pandas → NumPy | df.values 或 df.to_numpy() |
提取表格中的纯数值部分,变成数组,丢失列名和索引。 |
| Series → NumPy | df['Name'].values |
提取某一列变成一维数组。 |
💻 代码演示
python
python
# 1. 假设我们有一个 NumPy 数组
np_data = np.array([, ])
# 2. 转为 Pandas DataFrame (赋予列名)
df = pd.DataFrame(np_data, columns=['X', 'Y'])
print(df)
# 输出:
# X Y
# 0 10 20
# 1 30 40
# 3. 对 df 进行一些操作(比如 X 列加 5)
df['X'] = df['X'] + 5
# 4. 转回 NumPy 进行下一步复杂计算
final_array = df.values
📌 总结建议
- 如果你要处理 Excel、CSV 文件 ,或者需要列名、行索引 ,请务必使用 Pandas (
df)。 - 如果你要做复杂的矩阵运算、图像处理 ,或者追求极致速度 ,请使用 NumPy。
- 最佳实践 :通常流程是
读取数据(Pandas)->清洗数据(Pandas)->数值计算(NumPy)->结果展示(Pandas)。
二、分析框架的技术实现
2.1 第一性原理 → 代码实现
框架核心:拆解问题到最基础变量,从零重建逻辑
技术实现模式:
python
# 示例:用户留存分析的第一性原理拆解
# 第一层:定义核心指标
retention_rate = retained_users / total_users
# 第二层:拆解变量
retained_users = users_with_positive_action
total_users = users_at_start
# 第三层:定义"积极行为"
def is_positive_action(user_behavior):
"""
基础定义:用户在观察期内完成以下任一行为
"""
actions = {
'login': user_behavior.login_count > 0,
'purchase': user_behavior.purchase_count > 0,
'engagement': user_behavior.time_spent > 300 # 秒
}
return any(actions.values())
# 第四层:从事件日志构建行为特征
from pyspark.sql import functions as F
def build_user_behavior_features(events_df):
"""
从原始事件日志聚合用户行为特征
"""
return events_df.groupBy('user_id').agg(
F.count('*').alias('total_events'),
F.sum(F.when(F.col('event_type') == 'login', 1).otherwise(0))
.alias('login_count'),
F.sum(F.when(F.col('event_type') == 'purchase', 1).otherwise(0))
.alias('purchase_count'),
F.sum('duration').alias('time_spent')
)
# 第五层:端到端分析管道
def first_principles_retention_analysis(spark, event_logs_path, observation_start, observation_end):
"""
完整分析管道:从日志到留存率
"""
# 1. 读取原始数据
events_df = spark.read.parquet(event_logs_path)
# 2. 数据清洗
events_df = events_df.filter(
(F.col('timestamp') >= observation_start) &
(F.col('timestamp') <= observation_end)
)
# 3. 特征工程(第一性原理拆解的核心)
behavior_df = build_user_behavior_features(events_df)
# 4. 应用基础定义
behavior_df = behavior_df.withColumn(
'is_retained',
F.udf(is_positive_action)(F.struct([
F.col('login_count'),
F.col('purchase_count'),
F.col('time_spent')
]))
)
# 5. 计算指标
retention_rate = behavior_df.agg(
F.avg(F.when(F.col('is_retained'), 1).otherwise(0))
).collect()[0][0]
return {
'retention_rate': retention_rate,
'retained_users': behavior_df.filter(F.col('is_retained')).count(),
'total_users': behavior_df.count()
}
关键洞察:
- 第一性原理在代码中的体现:函数式拆解,每个函数只做一件事
- 避免"魔法数字"和"经验公式",从业务定义出发编写逻辑
- 可测试性:每个基础单元都可以独立验证
2.2 Whys → 根因分析技术栈
框架核心:递归追问,找到根本原因
技术实现:因果图 + 数据溯源
python
# 示例:电商订单下降的5 Whys技术分析
class RootCauseAnalyzer:
"""
基于5 Whys的自动化根因分析系统
"""
def __init__(self, db_connection):
self.db = db_connection
def why_1_order_decline(self, period_start, period_end):
"""
Why 1: 订单量是否真的下降?
"""
sql = """
WITH daily_orders AS (
SELECT
date_trunc('day', order_date) as day,
COUNT(*) as order_count
FROM orders
WHERE order_date BETWEEN %s AND %s
GROUP BY 1
)
SELECT
AVG(order_count) as avg_daily_orders,
STDDEV(order_count) as std_daily_orders,
(AVG(order_count) - LAG(AVG(order_count)) OVER (ORDER BY day)) / LAG(AVG(order_count)) OVER (ORDER by day) as yoy_change
FROM daily_orders
"""
return self.db.execute(sql, (period_start, period_end)).fetchone()
def why_2_traffic_or_conversion(self, period_start, period_end):
"""
Why 2: 是流量下降还是转化率下降?
"""
# 流量分析
traffic_sql = """
SELECT COUNT(DISTINCT session_id) as sessions,
AVG(page_views) as avg_pages_per_session
FROM web_analytics
WHERE event_date BETWEEN %s AND %s
"""
# 转化率分析
conversion_sql = """
WITH funnel AS (
SELECT
COUNT(DISTINCT CASE WHEN event_type = 'product_view' THEN session_id END) as views,
COUNT(DISTINCT CASE WHEN event_type = 'add_to_cart' THEN session_id END) as adds,
COUNT(DISTINCT CASE WHEN event_type = 'purchase' THEN session_id END) as purchases
FROM web_analytics
WHERE event_date BETWEEN %s AND %s
)
SELECT
adds * 1.0 / NULLIF(views, 0) as view_to_add_rate,
purchases * 1.0 / NULLIF(adds, 0) as add_to_purchase_rate,
purchases * 1.0 / NULLIF(views, 0) as view_to_purchase_rate
FROM funnel
"""
return {
'traffic': self.db.execute(traffic_sql, (period_start, period_end)).fetchone(),
'conversion': self.db.execute(conversion_sql, (period_start, period_end)).fetchone()
}
def why_3_traffic_sources(self, period_start, period_end):
"""
Why 3: 如果是流量下降,哪个渠道出了问题?
"""
sql = """
SELECT
traffic_source,
COUNT(DISTINCT session_id) as sessions,
AVG(session_duration) as avg_duration,
SUM(revenue) as total_revenue
FROM web_analytics
WHERE event_date BETWEEN %s AND %s
GROUP BY 1
ORDER BY 2 DESC
"""
return self.db.execute(sql, (period_start, period_end)).fetchall()
def why_4_external_factors(self, period_start, period_end):
"""
Why 4: 是否有外部因素影响?(季节性、竞品、事件)
"""
# 时间序列分解
from statsmodels.tsa.seasonal import seasonal_decompose
import pandas as pd
orders_df = pd.read_sql("""
SELECT DATE(order_date) as date, COUNT(*) as orders
FROM orders
WHERE order_date >= DATE_SUB(%s, INTERVAL 90 DAY)
GROUP BY 1
ORDER BY 1
""", self.db, params=(period_start,))
result = seasonal_decompose(orders_df.set_index('date')['orders'],
model='multiplicative', period=30)
return {
'trend': result.trend,
'seasonal': result.seasonal,
'residual': result.resid
}
def why_5_root_cause(self, analysis_results):
"""
Why 5: 根本原因总结
"""
causes = []
# 检查趋势
if analysis_results['external']['trend'].iloc[-1] < analysis_results['external']['trend'].iloc[-30]:
causes.append("长期趋势:市场需求下降")
# 检查季节性
if abs(analysis_results['external']['seasonal'].iloc[-1]) > 0.2:
causes.append("季节性因素:当前处于淡季")
# 检查渠道
traffic_sources = analysis_results['traffic_sources']
if len(traffic_sources) > 0:
worst_source = min(traffic_sources, key=lambda x: x[1])
if worst_source[1] < sum(x[1] for x in traffic_sources) / len(traffic_sources) * 0.5:
causes.append(f"渠道问题:{worst_source[0]}流量大幅下降")
return causes
# 使用示例
analyzer = RootCauseAnalyzer(db_connection)
results = analyzer.why_1_order_decline('2026-01-01', '2026-03-01')
if results['yoy_change'] < -0.1: # 下降超过10%
results['conversion'] = analyzer.why_2_traffic_or_conversion('2026-01-01', '2026-03-01')
results['traffic_sources'] = analyzer.why_3_traffic_sources('2026-01-01', '2026-03-01')
results['external'] = analyzer.why_4_external_factors('2026-01-01', '2026-03-01')
root_causes = analyzer.why_5_root_cause(results)
print("根因分析结果:")
for cause in root_causes:
print(f" - {cause}")
2.3 系统思维 → 图数据库与网络分析
框架核心:理解系统中的反馈循环、动态行为
技术实现:图数据库(Neo4j)+ 网络分析(NetworkX)
2.4 MECE → 数据分区与分类算法
框架核心:互斥且穷尽的分类
技术实现:数据建模、分区键设计、聚类算法
python
# 示例:MECE原则在数据分区和用户分群中的应用
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
class MECEDataClassifier:
"""
实现MECE分类的数据工程方法
"""
@staticmethod
def validate_mece(partitions, total_records):
"""
验证分区是否满足MECE原则
"""
# 1. 检查互斥性:任何记录是否属于多个分区?
all_ids = []
for name, df in partitions.items():
all_ids.extend(df['id'].tolist())
duplicates = len(all_ids) - len(set(all_ids))
# 2. 检查穷尽性:所有记录是否都被覆盖?
total_in_partitions = sum(len(df) for df in partitions.values())
validation = {
'total_records': total_records,
'total_in_partitions': total_in_partitions,
'exhaustive': total_in_partitions == total_records,
'mutually_exclusive': duplicates == 0,
'duplicates': duplicates
}
return validation
@staticmethod
def mece_time_partitioning(df, date_column='date'):
"""
基于时间的MECE分区(无重叠)
"""
df[date_column] = pd.to_datetime(df[date_column])
partitions = {}
# 方案1:互斥的月份分区
df['month'] = df[date_column].dt.to_period('M')
for month in df['month'].unique():
partitions[str(month)] = df[df['month'] == month].copy()
return partitions, MECEDataClassifier.validate_mece(partitions, len(df))
@staticmethod
def mece_geographic_partitioning(df, region_column='region'):
"""
基于地理的MECE分区
"""
# 定义完整的地理层级(确保穷尽)
region_mapping = {
'华东': ['上海', '江苏', '浙江'],
'华北': ['北京', '天津', '河北'],
'华南': ['广东', '广西', '海南'],
'西部': ['四川', '重庆', '云南', '贵州'],
'其他': [] # 捕获所有其他地区
}
# 为每个记录分配大区
def assign_region(city):
for region, cities in region_mapping.items():
if city in cities:
return region
return '其他'
df['macro_region'] = df[region_column].apply(assign_region)
partitions = {
region: df[df['macro_region'] == region].copy()
for region in region_mapping.keys()
}
return partitions, MECEDataClassifier.validate_mece(partitions, len(df))
@staticmethod
def mece_user_segmentation(df, features=['recency', 'frequency', 'monetary'], n_clusters=4):
"""
使用K-means实现MECE用户分群
"""
# 准备数据
X = df[features].values
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# 聚类(K-means天然保证互斥)
kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
clusters = kmeans.fit_predict(X_scaled)
# 创建分区(每个样本只属于一个簇 = 互斥)
partitions = {}
for cluster_id in range(n_clusters):
partitions[f'Cluster_{cluster_id}'] = df[clusters == cluster_id].copy()
validation = MECEDataClassifier.validate_mece(partitions, len(df))
# 添加簇特征分析
cluster_profiles = {}
for cluster_id in range(n_clusters):
cluster_data = df[clusters == cluster_id]
profile = {
'size': len(cluster_data),
'mean_features': cluster_data[features].mean().to_dict(),
'std_features': cluster_data[features].std().to_dict()
}
cluster_profiles[f'Cluster_{cluster_id}'] = profile
return partitions, validation, cluster_profiles
@staticmethod
def mece_business_logic_partitioning(df):
"""
基于业务逻辑的MECE分区(例如:用户生命周期)
"""
def lifecycle_stage(row):
"""
定义互斥且穷尽的生命周期阶段
顺序检查确保互斥性
"""
if row['days_since_first_order'] == 0:
return '新用户(首次当日)'
elif row['days_since_first_order'] <= 7:
return '新用户(首周)'
elif row['days_since_last_order'] > 90:
return '流失用户'
elif row['lifetime_orders'] >= 10:
return '高价值用户'
elif row['lifetime_orders'] >= 3:
return '忠诚用户'
else:
return '活跃用户'
df['lifecycle_stage'] = df.apply(lifecycle_stage, axis=1)
partitions = {
stage: df[df['lifecycle_stage'] == stage].copy()
for stage in df['lifecycle_stage'].unique()
}
validation = MECEDataClassifier.validate_mece(partitions, len(df))
return partitions, validation
# 实战应用:构建MECE的数据仓库分区策略
class DataWarehousePartitioner:
"""
数据仓库中的MECE分区设计
"""
def design_table_partitioning(self):
"""
设计表的分区键(确保查询时分区互斥)
"""
# 错误示例:非互斥
# PARTITION BY RANGE (date) SUBPARTITION BY HASH (user_id)
# 如果同时按日期和用户哈希分区,某些查询可能扫描多个分区
# 正确示例:互斥的层级分区
partition_design = """
CREATE TABLE orders_partitioned (
order_id BIGINT,
user_id BIGINT,
order_date DATE,
amount DECIMAL(10,2),
status VARCHAR(20),
region VARCHAR(50),
PRIMARY KEY (order_id, order_date)
)
PARTITION BY RANGE (order_date)
SUBPARTITION BY LIST (region); -- 二级分区也互斥
-- 创建子分区(确保穷尽:每个日期下的所有地区)
ALTER TABLE orders_partitioned
ADD PARTITION p2026_01 VALUES LESS THAN ('2026-02-01') (
SUBPARTITION p2026_01_east VALUES IN ('上海', '江苏', '浙江'),
SUBPARTITION p2026_01_north VALUES IN ('北京', '天津', '河北'),
SUBPARTITION p2026_01_south VALUES IN ('广东', '广西', '海南'),
SUBPARTITION p2026_01_other VALUES IN ('其他')
);
-- 添加其他月份分区...
"""
return partition_design
def mece_data_quality_checks(self, df):
"""
MECE的数据质量检查(确保每个维度都被覆盖)
"""
checks = []
# 检查1:状态值的穷尽性
expected_statuses = ['pending', 'paid', 'shipped', 'delivered', 'cancelled']
actual_statuses = df['status'].unique()
missing = set(expected_statuses) - set(actual_statuses)
unexpected = set(actual_statuses) - set(expected_statuses)
checks.append({
'dimension': 'order_status',
'expected': expected_statuses,
'actual': list(actual_statuses),
'missing': list(missing),
'unexpected': list(unexpected),
'is_mece': len(missing) == 0 and len(unexpected) == 0
})
# 检查2:时间连续性(没有日期缺失)
date_range = pd.date_range(df['order_date'].min(), df['order_date'].max())
missing_dates = set(date_range) - set(df['order_date'].unique())
checks.append({
'dimension': 'order_date',
'expected_range': (df['order_date'].min(), df['order_date'].max()),
'missing_dates': len(missing_dates),
'is_continuous': len(missing_dates) == 0
})
return checks
# 使用示例
import pandas as pd
# 生成示例数据
np.random.seed(42)
n_users = 10000
data = {
'id': range(n_users),
'recency': np.random.randint(1, 365, n_users),
'frequency': np.random.randint(1, 50, n_users),
'monetary': np.random.exponential(1000, n_users),
'days_since_first_order': np.random.randint(0, 730, n_users),
'days_since_last_order': np.random.randint(0, 365, n_users),
'lifetime_orders': np.random.randint(0, 100, n_users),
'date': pd.date_range('2026-01-01', periods=n_users),
'region': np.random.choice(['上海', '北京', '广东', '四川'], n_users),
'status': np.random.choice(['paid', 'shipped', 'delivered'], n_users)
}
df = pd.DataFrame(data)
# MECE用户分群
classifier = MECEDataClassifier()
partitions, validation, profiles = classifier.mece_user_segmentation(df)
print("MECE验证结果:")
print(f" 互斥性: {validation['mutually_exclusive']}")
print(f" 穷尽性: {validation['exhaustive']}")
print(f" 总记录数: {validation['total_records']}")
print("\n各分群特征:")
for cluster_name, profile in profiles.items():
print(f" {cluster_name}: {profile['size']}用户")
print(f" 平均RFM: R={profile['mean_features']['recency']:.1f}, "
f"F={profile['mean_features']['frequency']:.1f}, "
f"M={profile['mean_features']['monetary']:.1f}")
三、数据处理工程技术
3.1 数据清洗框架
python
import pandas as pd
import numpy as np
from typing import Dict, List, Callable
class DataCleaningPipeline:
"""
可组合的数据清洗管道
"""
def __init__(self):
self.steps = []
self.logs = []
def add_step(self, name: str, func: Callable, **kwargs):
"""
添加清洗步骤
"""
self.steps.append({
'name': name,
'func': func,
'kwargs': kwargs
})
return self
def execute(self, df: pd.DataFrame) -> pd.DataFrame:
"""
执行整个管道
"""
result_df = df.copy()
for step in self.steps:
original_rows = len(result_df)
result_df = step['func'](result_df, **step['kwargs'])
new_rows = len(result_df)
rows_removed = original_rows - new_rows
self.logs.append({
'step': step['name'],
'rows_before': original_rows,
'rows_after': new_rows,
'rows_removed': rows_removed
})
return result_df
def get_report(self) -> pd.DataFrame:
"""
获取清洗报告
"""
return pd.DataFrame(self.logs)
# 清洗函数库
def remove_duplicates(df, subset=None):
"""去重"""
return df.drop_duplicates(subset=subset, keep='first')
def handle_missing_values(df, strategy='drop', threshold=0.5):
"""
处理缺失值
strategy: 'drop', 'fill_mean', 'fill_median', 'fill_forward', 'fill_zero'
threshold: 缺失比例超过此值的列将被删除
"""
df = df.copy()
# 删除缺失率过高的列
missing_ratio = df.isnull().mean()
cols_to_drop = missing_ratio[missing_ratio > threshold].index.tolist()
if cols_to_drop:
df = df.drop(columns=cols_to_drop)
if strategy == 'drop':
df = df.dropna()
elif strategy == 'fill_mean':
df = df.fillna(df.mean())
elif strategy == 'fill_median':
df = df.fillna(df.median())
elif strategy == 'fill_forward':
df = df.fillna(method='ffill')
elif strategy == 'fill_zero':
df = df.fillna(0)
return df
def remove_outliers(df, columns, method='iqr', n_std=3):
"""
移除异常值
method: 'iqr' (四分位数), 'zscore' (标准差)
"""
df = df.copy()
for col in columns:
if pd.api.types.is_numeric_dtype(df[col]):
if method == 'iqr':
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
df = df[(df[col] >= lower_bound) & (df[col] <= upper_bound)]
elif method == 'zscore':
mean = df[col].mean()
std = df[col].std()
df = df[abs(df[col] - mean) <= n_std * std]
return df
def standardize_formats(df, column_mappings):
"""
标准化格式
column_mappings: {
'column_name': {
'type': 'datetime'|'numeric'|'categorical',
'format': str # 可选
}
}
"""
df = df.copy()
for col, config in column_mappings.items():
if col not in df.columns:
continue
data_type = config.get('type')
if data_type == 'datetime':
fmt = config.get('format')
if fmt:
df[col] = pd.to_datetime(df[col], format=fmt)
else:
df[col] = pd.to_datetime(df[col])
elif data_type == 'numeric':
# 移除非数字字符
if df[col].dtype == 'object':
df[col] = df[col].str.replace('[^0-9.-]', '', regex=True)
df[col] = pd.to_numeric(df[col], errors='coerce')
elif data_type == 'categorical':
df[col] = df[col].astype('category')
return df
# 使用示例
pipeline = DataCleaningPipeline()
pipeline.add_step('remove_duplicates', remove_duplicates, subset='user_id')
pipeline.add_step('handle_missing', handle_missing_values, strategy='fill_median')
pipeline.add_step('remove_outliers', remove_outliers, columns=['amount', 'age'], method='iqr')
pipeline.add_step('standardize', standardize_formats,
column_mappings={
'order_date': {'type': 'datetime'},
'amount': {'type': 'numeric'},
'user_segment': {'type': 'categorical'}
})
# 执行清洗
cleaned_df = pipeline.execute(raw_df)
report = pipeline.get_report()
3.2 特征工程技术
python
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif
import pandas as pd
import numpy as np
class FeatureEngineering:
"""
特征工程工具集
"""
@staticmethod
def create_time_features(df, date_column):
"""
从日期列创建时间特征
"""
df = df.copy()
df[date_column] = pd.to_datetime(df[date_column])
df[f'{date_column}_year'] = df[date_column].dt.year
df[f'{date_column}_month'] = df[date_column].dt.month
df[f'{date_column}_day'] = df[date_column].dt.day
df[f'{date_column}_dayofweek'] = df[date_column].dt.dayofweek
df[f'{date_column}_is_weekend'] = df[date_column].dt.dayofweek.isin([5, 6]).astype(int)
df[f'{date_column}_is_month_start'] = df[date_column].dt.is_month_start.astype(int)
df[f'{date_column}_is_month_end'] = df[date_column].dt.is_month_end.astype(int)
# 季节
df[f'{date_column}_season'] = df[date_column].dt.month % 12 // 3 + 1
return df
@staticmethod
def create_lag_features(df, group_column, value_column, lags=[1, 7, 30]):
"""
创建滞后特征
"""
df = df.copy()
for lag in lags:
df[f'{value_column}_lag_{lag}'] = df.groupby(group_column)[value_column].shift(lag)
return df
@staticmethod
def create_rolling_features(df, group_column, value_column, windows=[7, 14, 30]):
"""
创建滚动窗口特征
"""
df = df.copy()
for window in windows:
df[f'{value_column}_rolling_mean_{window}'] = (
df.groupby(group_column)[value_column]
.transform(lambda x: x.rolling(window).mean())
)
df[f'{value_column}_rolling_std_{window}'] = (
df.groupby(group_column)[value_column]
.transform(lambda x: x.rolling(window).std())
)
df[f'{value_column}_rolling_max_{window}'] = (
df.groupby(group_column)[value_column]
.transform(lambda x: x.rolling(window).max())
)
return df
@staticmethod
def create_ratio_features(df, numerators, denominators, epsilon=1e-6):
"""
创建比率特征
"""
df = df.copy()
for num, den in zip(numerators, denominators):
feature_name = f'{num}_per_{den}'
df[feature_name] = df[num] / (df[den] + epsilon)
return df
@staticmethod
def encode_categorical(df, columns, method='onehot', drop_first=True):
"""
分类特征编码
"""
df = df.copy()
if method == 'onehot':
for col in columns:
if col in df.columns:
dummies = pd.get_dummies(df[col], prefix=col, drop_first=drop_first)
df = pd.concat([df, dummies], axis=1)
df = df.drop(columns=[col])
elif method == 'label':
for col in columns:
if col in df.columns:
le = LabelEncoder()
df[col] = le.fit_transform(df[col].astype(str))
return df
@staticmethod
def scale_features(df, columns, method='standard'):
"""
特征缩放
"""
df = df.copy()
if method == 'standard':
scaler = StandardScaler()
elif method == 'minmax':
scaler = MinMaxScaler()
else:
raise ValueError(f"Unknown scaling method: {method}")
df[columns] = scaler.fit_transform(df[columns])
return df
@staticmethod
def select_features(X, y, method='k_best', k=10):
"""
特征选择
"""
if method == 'k_best':
selector = SelectKBest(score_func=f_classif, k=k)
X_selected = selector.fit_transform(X, y)
selected_features = X.columns[selector.get_support()]
return pd.DataFrame(X_selected, columns=selected_features), selector
elif method == 'mutual_info':
selector = SelectKBest(score_func=mutual_info_classif, k=k)
X_selected = selector.fit_transform(X, y)
selected_features = X.columns[selector.get_support()]
return pd.DataFrame(X_selected, columns=selected_features), selector
# 实战:用户行为特征工程
def create_user_behavior_features(events_df):
"""
从用户事件日志创建特征
"""
# 基础聚合
user_features = events_df.groupby('user_id').agg({
'event_type': 'count', # 总事件数
'session_id': pd.Series.nunique, # 会话数
'page_views': 'sum', # 总页面浏览
'time_on_page': 'mean', # 平均停留时间
'is_mobile': 'mean' # 移动端占比
}).reset_index()
user_features.columns = ['user_id', 'total_events', 'n_sessions',
'total_page_views', 'avg_time_on_page', 'mobile_ratio']
# 比率特征
user_features['events_per_session'] = (
user_features['total_events'] / user_features['n_sessions']
)
user_features['page_views_per_event'] = (
user_features['total_page_views'] / user_features['total_events']
)
# 事件类型分布
event_type_dist = pd.crosstab(events_df['user_id'], events_df['event_type'],
normalize='index')
event_type_dist.columns = [f'event_type_{col}_ratio' for col in event_type_dist.columns]
user_features = user_features.merge(
event_type_dist,
left_on='user_id',
right_index=True,
how='left'
)
return user_features
四、机器学习工程化
4.1 端到端ML Pipeline
python
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score
import joblib
import pandas as pd
import numpy as np
class MLExperiment:
"""
机器学习实验框架
"""
def __init__(self, X, y, random_state=42):
self.X = X
self.y = y
self.random_state = random_state
self.pipeline = None
self.model = None
self.results = {}
def build_pipeline(self, numeric_features, categorical_features):
"""
构建预处理管道
"""
# 数值特征处理
numeric_transformer = Pipeline(steps=[
('scaler', StandardScaler())
])
# 分类特征处理
categorical_transformer = Pipeline(steps=[
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
# 组合
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
]
)
# 完整管道
self.pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', RandomForestClassifier(random_state=self.random_state))
])
return self
def train(self, test_size=0.2):
"""
训练模型
"""
X_train, X_test, y_train, y_test = train_test_split(
self.X, self.y, test_size=test_size, random_state=self.random_state
)
self.pipeline.fit(X_train, y_train)
# 评估
train_score = self.pipeline.score(X_train, y_train)
test_score = self.pipeline.score(X_test, y_test)
self.results = {
'train_score': train_score,
'test_score': test_score,
'overfitting': train_score - test_score
}
return self
def cross_validate(self, cv=5):
"""
交叉验证
"""
cv_scores = cross_val_score(self.pipeline, self.X, self.y, cv=cv)
self.results['cv_mean'] = cv_scores.mean()
self.results['cv_std'] = cv_scores.std()
return self
def save_model(self, filepath):
"""
保存模型
"""
joblib.dump(self.pipeline, filepath)
def load_model(self, filepath):
"""
加载模型
"""
self.pipeline = joblib.load(filepath)
return self
def get_feature_importance(self):
"""
获取特征重要性
"""
if not hasattr(self.pipeline, 'named_steps'):
return None
# 获取特征名
preprocessor = self.pipeline.named_steps['preprocessor']
classifier = self.pipeline.named_steps['classifier']
# 数值特征名
numeric_features = preprocessor.transformers_[0][2]
# 分类特征名(经过OneHot编码)
cat_transformer = preprocessor.transformers_[1][1]
cat_features = cat_transformer.named_steps['onehot'].get_feature_names_out()
all_features = np.concatenate([numeric_features, cat_features])
importance = pd.DataFrame({
'feature': all_features,
'importance': classifier.feature_importances_
}).sort_values('importance', ascending=False)
return importance
五、数据工程基础设施
5.1 数据版本控制
python
import hashlib
import json
import pandas as pd
from datetime import datetime
from pathlib import Path
class DataVersioning:
"""
数据版本控制系统
"""
def __init__(self, base_path='./data_versions'):
self.base_path = Path(base_path)
self.base_path.mkdir(exist_ok=True)
def get_data_hash(self, df):
"""
计算DataFrame的哈希值
"""
df_json = df.to_json().encode()
return hashlib.sha256(df_json).hexdigest()
def save_version(self, df, version_name, metadata=None):
"""
保存数据版本
"""
version_hash = self.get_data_hash(df)
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
# 创建版本目录
version_dir = self.base_path / f"{timestamp}_{version_name}"
version_dir.mkdir(exist_ok=True)
# 保存数据
filepath = version_dir / 'data.parquet'
df.to_parquet(filepath)
# 保存元数据
version_info = {
'version_name': version_name,
'timestamp': timestamp,
'hash': version_hash,
'shape': df.shape,
'columns': df.columns.tolist(),
'metadata': metadata or {}
}
with open(version_dir / 'version_info.json', 'w') as f:
json.dump(version_info, f, indent=2)
return version_dir
def load_version(self, version_dir):
"""
加载数据版本
"""
version_path = Path(version_dir)
filepath = version_path / 'data.parquet'
df = pd.read_parquet(filepath)
# 加载元数据
with open(version_path / 'version_info.json', 'r') as f:
version_info = json.load(f)
return df, version_info
def list_versions(self):
"""
列出所有版本
"""
versions = []
for version_dir in sorted(self.base_path.iterdir()):
if version_dir.is_dir():
info_file = version_dir / 'version_info.json'
if info_file.exists():
with open(info_file, 'r') as f:
info = json.load(f)
versions.append(info)
return versions