架构演进:从数仓到湖仓一体
数据架构演进脉络
现代数据架构经历了四代演进,每一代都在解决前一代的核心痛点:
第一代:传统数仓(Teradata / Oracle)
→ 问题:昂贵、扩展性差、只支持结构化数据
第二代:大数据平台(Hadoop + Hive)
→ 问题:Lambda 架构复杂、实时批处理两套体系、数据一致性难保障
第三代:数据湖(S3 + Parquet + Spark)
→ 问题:缺乏事务、Schema 漂移、数据质量不可控、"数据沼泽"
第四代:湖仓一体(Lakehouse = 数据湖 + 数据仓库 + AI 原生)
→ 解决:ACID 事务、Schema 约束、流批统一、AI/BI 一体化访问
┌──────────────────────────────────────────────────────────────┐
│ 架构演进对比 │
├──────────┬──────────┬──────────┬──────────┬──────────────────┤
│ │ 传统数仓 │ 大数据平台│ 数据湖 │ 湖仓一体 │
├──────────┼──────────┼──────────┼──────────┼──────────────────┤
│ 存储基础 │ 专有存储 │ HDFS │ 对象存储 │ 对象存储+表格式 │
│ 数据模型 │ 强Schema │ Hive Met │ 无约束 │ Schema + 事务 │
│ 事务支持 │ ACID │ 弱 │ 无 │ ACID │
│ 实时能力 │ 弱 │ Lambda │ 弱 │ 流批统一 │
│ AI 支持 │ 弱 │ 中等 │ 好 │ 原生支持 │
│ 成本 │ 极高 │ 高 │ 低 │ 低 │
│ 数据质量 │ 高 │ 中 │ 低(沼泽) │ 高 │
└──────────┴──────────┴──────────┴──────────┴──────────────────┘
为什么数据湖会变成"数据沼泽"?
数据湖的核心承诺是"先存储、后分析",但缺乏治理机制导致:
- Schema-on-Read 的陷阱:写入时无约束,读取时才发现字段缺失、类型错乱
- 无事务保障:并发写入导致脏读、幻读,ETL 中途失败产生半写数据
- 元数据缺失:数据进了湖却找不到、看不懂,血缘无法追溯
- 时间旅行不可行:无法回溯数据变更历史,合规审计困难
- 流批割裂:实时数据与批量数据走不同路径,结果不一致
湖仓一体的出现,正是为了在数据湖的开放性和低成本基础上,补齐数据仓库的事务性和治理能力。
数据湖三剑客:Delta Lake / Apache Iceberg / Apache Hudi
数据湖三剑客是构建湖仓一体的三大开源表格式(Table Format),它们在对象存储之上提供 ACID 事务、Schema 演化、时间旅行等能力,是现代数据架构的核心基础设施。
三剑客定位与起源
| 维度 |
Delta Lake |
Apache Iceberg |
Apache Hudi |
| 发起方 |
Databricks |
Netflix → Apache |
Uber → Apache |
| 核心场景 |
流批统一 + Spark 深度集成 |
跨引擎互操作 + 大表管理 |
增量处理 + CDC/UPSERT |
| 设计哲学 |
简单易用、Spark 原生 |
开放标准、引擎无关 |
增量优先、近实时更新 |
| 社区生态 |
Databricks 主导,Linux Foundation |
最活跃,多厂商支持 |
Apache 社区,增量场景强势 |
| 元数据管理 |
事务日志(Delta Log) |
Manifest 文件层级 |
Timeline + 元数据表 |
| 并发模型 |
乐观并发 + MVCC |
乐观并发 + MVCC |
MVCC + Timeline |
| 查询引擎 |
Spark 为主,逐步扩展 |
Spark/Flink/Trino/Doris |
Spark/Flink/Trino/Presto |
核心架构深度对比
Delta Lake:事务日志架构
/delta/user-events/
├── _delta_log/
│ ├── 00000000000000000000.json ← 版本0:初始提交
│ ├── 00000000000000000001.json ← 版本1:追加数据
│ ├── 00000000000000000002.json ← 版本2:更新操作
│ ├── 00000000000000000002.checkpoint.parquet ← 检查点
│ └── ...
├── part-00000-xxx.snappy.parquet ← 数据文件
├── part-00001-xxx.snappy.parquet
└── ...
核心机制:
- 每次写操作生成一个 JSON 格式的事务日志文件(按序号递增)
- 日志记录 AddFile / RemoveFile / Metadata / Protocol 等动作
- 读取时从最新 Checkpoint + 增量日志重建表状态,实现 Snapshot Isolation
- 支持 OPTIMIZE(小文件合并)、Z-ORDER(数据跳过优化)
# Delta Lake 高阶用法
from delta import DeltaTable
from pyspark.sql import SparkSession
spark = SparkSession.builder \\
.appName("DeltaAdvanced") \\
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \\
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \\
.getOrCreate()
# === CDF(Change Data Feed):捕获数据变更 ===
spark.sql("""
CREATE TABLE user_events (
user_id STRING, action STRING, item_id STRING,
price DOUBLE, event_time TIMESTAMP
) USING DELTA
TBLPROPERTIES (delta.enableChangeDataFeed = true)
""")
# 消费变更流
changes = spark.read.format("delta") \\
.option("readChangeDataFeed", "true") \\
.option("startingVersion", 10) \\
.table("user_events")
# _change_type: insert / update_preimage / update_postimage / delete
# === Z-ORDER 优化查询跳过 ===
DeltaTable.forPath(spark, "/delta/user-events") \\
.optimize() \\
.executeZOrderBy("user_id", "event_time")
# === 时间旅行 + 数据审计 ===
df_v5 = spark.read.format("delta") \\
.option("versionAsOf", 5).load("/delta/user-events")
df_ts = spark.read.format("delta") \\
.option("timestampAsOf", "2024-01-01 00:00:00").load("/delta/user-events")
# === Schema 演化 ===
spark.sql("""
ALTER TABLE user_events ADD COLUMN (
device_type STRING,
session_id STRING
)
""")
# === Liquid Clustering(Delta 3.0+)===
# 替代 Z-ORDER,支持增量聚类
spark.sql("""
CREATE TABLE events_clustered (
event_id STRING, user_id STRING, event_time TIMESTAMP, data MAP<STRING, STRING>
) USING DELTA
CLUSTER BY (user_id, event_time)
""")
Apache Iceberg:分层元数据架构
/warehouse/user-events/
├── metadata/
│ ├── v1.metadata.json ← 快照1的元数据
│ ├── v2.metadata.json ← 快照2的元数据
│ ├── snap-xxx.avro ← 快照的 Manifest List
│ ├── xxx-m0.avro ← Manifest 文件
│ └── ...
├── data/
│ ├── event_time=user_id=xxx/
│ │ ├── part-00000.parquet
│ │ └── ...
│ └── ...
核心机制:
- 三层元数据:Metadata → Manifest List → Manifest File → Data File
- Snapshot Isolation:每个快照指向一个 Manifest List,读操作不受写影响
- Partition Evolution:无需重写数据即可变更分区策略
- Hidden Partitioning:分区对用户透明,避免常见分区陷阱
- 多引擎共识:Spark、Flink、Trino、Doris 等共享同一套元数据
# Iceberg 高阶用法
from pyiceberg.catalog import load_catalog
catalog = load_catalog("production", **{
"type": "rest",
"uri": "http://iceberg-rest:8181",
"warehouse": "s3://warehouse"
})
# === 分区演化:无需重写历史数据 ===
catalog.load_table("analytics.user_events").update_spec() \\
.add_field("event_time", "hour") \\
.commit()
# 旧数据保持旧分区,新数据使用新分区,查询自动合并
# === 快照管理 ===
table = catalog.load_table("analytics.user_events")
for snapshot in table.snapshots():
print(f"Snapshot {snapshot.snapshot_id}: {snapshot.timestamp_ms}, "
f"files={snapshot.number_of_data_files}, rows={snapshot.total_records}")
# 回滚到指定快照
table.manage_snapshots().rollback_to(snapshot_id=1234567890).commit()
# === 增量读取(Incremental Scan)===
incremental_scan = table.scan_incremental(
from_snapshot_id=12345,
to_snapshot_id=12350
)
# === 分支(Branches)=== Iceberg 1.4+
table.manage_snapshots().create_branch("audit-branch", snapshot_id=12345).commit()
# === 元数据压缩 ===
table.expire_snapshots().older_than(7 * 24 * 3600 * 1000).commit()
table.remove_orphan_files().older_than(3 * 24 * 3600 * 1000).commit()
-- Iceberg SQL 操作(Spark/Trino/Flink 均可执行)
-- 查看快照历史
SELECT * FROM analytics.user_events.snapshots;
-- 查看数据文件
SELECT * FROM analytics.user_events.files;
-- 查看分区
SELECT * FROM analytics.user_events.partitions;
-- 时间旅行
SELECT * FROM analytics.user_events FOR SYSTEM_TIME AS OF TIMESTAMP '2024-06-01 00:00:00';
SELECT * FROM analytics.user_events VERSION AS OF 1234567890;
-- 回滚
CALL system.rollback_to_snapshot('analytics.user_events', 1234567890);
Apache Hudi:Timeline 架构
/warehouse/user-events/
├── .hoodie/
│ ├── 20240101120000.commit ← 完成提交
│ ├── 20240101130000.deltacommit ← 增量提交(Merge on Read)
│ ├── 20240101130000.commit.requested
│ ├── 20240101130000.inflight
│ ├── .aux/ ← 辅助元数据
│ └── timeline_metadata/
├── user_id=xxx/
│ ├── basefile_1.parquet ← Base File(列存)
│ └── log_file_1.log ← Log File(行存增量)
└── ...
核心机制:
- Timeline:所有操作按时间线排列,每个 Instant 包含 requested → inflight → completed 三态
- 两种表类型 :
- Copy on Write(CoW):写入时复制,适合批处理场景
- Merge on Read(MoR):增量写入 Log File,读取时合并,适合近实时更新
- 主键索引:Bloom Filter / HBase / Simple 索引,支持高效 UPSERT
- 异步 Compaction:MoR 表的 Log File 后台合并为 Base File
# Hudi 高阶用法
from pyspark.sql import SparkSession
spark = SparkSession.builder \\
.appName("HudiAdvanced") \\
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \\
.getOrCreate()
# === MoR 表:近实时 UPSERT ===
hudi_options = {
"hoodie.table.name": "user_events",
"hoodie.table.type": "MERGE_ON_READ",
"hoodie.datasource.write.recordkey.field": "user_id",
"hoodie.datasource.write.partitionpath.field": "event_date",
"hoodie.datasource.write.precombine.field": "event_time",
"hoodie.datasource.write.operation": "upsert",
"hoodie.index.type": "BLOOM",
"hoodie.bloom.index.filter.type": "DYNAMIC_V0",
"hoodie.upsert.shuffle.parallelism": 200,
"hoodie.compact.inline": "false",
"hoodie.compact.schedule.inline": "true",
"hoodie.compact.inline.max.delta.commits": 5,
}
df.write.format("hudi").options(**hudi_options).mode("append").save("/warehouse/user-events")
# === 增量查询 ===
incremental_df = spark.read.format("hudi") \\
.option("hoodie.datasource.query.type", "incremental") \\
.option("hoodie.datasource.read.begin.instanttime", "20240101120000") \\
.load("/warehouse/user-events")
# === Timeline 操作 ===
spark.sql("""
CALL show_commits_metadata(
table => 'analytics.user_events',
limit => 10
)
""")
# 手动触发 Compaction
spark.sql("""
CALL run_compaction(
table => 'analytics.user_events',
op => 'schedule'
)
""")
三剑客选型决策
你的核心需求是什么?
│
┌───────────────┼───────────────┐
│ │ │
跨引擎互操作 Spark 深度绑定 增量更新/CDC
多租户共享 流批统一优先 近实时 UPSERT
│ │ │
▼ ▼ ▼
Iceberg Delta Lake Hudi
│ │ │
┌────┴────┐ ┌────┴────┐ ┌────┴────┐
│Trino │ │Databricks│ │Flink CDC│
│Doris │ │Synapse │ │Debezium │
│StarRocks│ │Spark │ │Kafka │
│Spark │ │ │ │Spark │
│Flink │ │ │ │ │
└─────────┘ └─────────┘ └─────────┘
| 决策因素 |
选 Delta Lake |
选 Iceberg |
选 Hudi |
| 主力引擎 |
Spark |
多引擎混合 |
Spark + Flink |
| 更新模式 |
批量 + 微批 |
批量 + 批量覆盖 |
高频 UPSERT + CDC |
| 实时性要求 |
分钟级 |
分钟级 |
秒~分钟级 |
| 表规模 |
中~大 |
极大(PB级验证) |
中~大 |
| 社区倾向 |
Databricks 生态 |
开放中立 |
增量场景 |
| 治理需求 |
中等 |
强(审计/合规) |
中等 |
| 团队技能 |
Spark 熟练 |
多引擎团队 |
增量处理经验 |
数据采集
| 工具 |
类型 |
适用场景 |
特点 |
| Apache Flume |
日志采集 |
海量日志数据收集 |
可靠性高、可扩展 |
| Apache Sqoop |
数据迁移 |
关系型数据库与 HDFS 间数据传输 |
批量导入导出 |
| Apache Kafka |
消息队列 |
实时数据流管道 |
高吞吐、低延迟 |
| Apache NiFi |
数据流管理 |
可视化数据路由与转换 |
拖拽式配置 |
| Logstash |
日志处理 |
ELK 栈数据采集 |
丰富的插件生态 |
| Canal |
数据同步 |
MySQL binlog 实时同步 |
增量数据捕获 |
Kafka 核心示例
from kafka import KafkaProducer, KafkaConsumer
import json
# 生产者:发送数据
producer = KafkaProducer(
bootstrap_servers=['localhost:9092'],
value_serializer=lambda v: json.dumps(v).encode('utf-8')
)
# 发送用户行为事件
event = {
"user_id": "u_10086",
"action": "click",
"item_id": "item_2048",
"timestamp": "2024-01-15T10:30:00Z",
"platform": "mobile"
}
producer.send('user-events', value=event)
producer.flush()
# 消费者:实时处理数据
consumer = KafkaConsumer(
'user-events',
bootstrap_servers=['localhost:9092'],
group_id='ai-processor',
auto_offset_reset='latest',
value_deserializer=lambda m: json.loads(m.decode('utf-8'))
)
for message in consumer:
event = message.value
print(f"收到事件: {event['action']} - 用户: {event['user_id']}")
# 实时送入 AI 模型进行推理
# prediction = model.predict(extract_features(event))
数据存储
存储架构演进
传统数据仓库 → 数据湖 → 湖仓一体(Lakehouse)
| 存储方案 |
代表技术 |
适用场景 |
特点 |
| 分布式文件系统 |
HDFS |
大文件批处理 |
高容错、高吞吐 |
| 对象存储 |
AWS S3 / MinIO |
云原生数据存储 |
弹性扩展、低成本 |
| NoSQL 数据库 |
HBase / Cassandra |
海量随机读写 |
高并发、低延迟 |
| 数据仓库 |
Hive / ClickHouse |
OLAP 分析 |
SQL 支持、列式存储 |
| 数据湖 |
Delta Lake / Iceberg |
流批统一存储 |
ACID 事务、Schema 演化 |
| 湖仓一体 |
Databricks / StarRocks |
AI + BI 一体化 |
统一存储、多引擎访问 |
Delta Lake 示例(数据湖方案)
from delta import DeltaTable
from pyspark.sql import SparkSession
spark = SparkSession.builder .appName("BigData-AI-Pipeline") .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") .getOrCreate()
# 写入数据到 Delta Lake
df = spark.read.json("s3a://data-lake/raw/user-events/")
df.write.format("delta").mode("append").save("/delta/user-events")
# ACID 事务更新
delta_table = DeltaTable.forPath(spark, "/delta/user-events")
delta_table.update(
condition="status = 'pending'",
set={"status": "'processed'", "updated_at": "current_timestamp()"}
)
# 时间旅行:查询历史版本
df_version_5 = spark.read.format("delta") .option("versionAsOf", 5) .load("/delta/user-events")
# 流批一体读取
streaming_df = spark.readStream.format("delta") .load("/delta/user-events") .groupBy("action") .count()
数据计算
批处理 vs 流处理
| 维度 |
批处理(Spark) |
流处理(Flink) |
| 延迟 |
分钟~小时 |
毫秒~秒 |
| 吞吐量 |
极高 |
高 |
| 数据完整性 |
精确 |
近似(窗口内精确) |
| 典型场景 |
ETL、报表、模型训练 |
实时监控、实时推荐 |
| 容错 |
重新执行 |
Checkpoint + 保存点 |
Spark 批处理示例
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count, avg, window
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, TimestampType
spark = SparkSession.builder .appName("UserBehaviorAnalytics") .getOrCreate()
# 读取数据
schema = StructType([
StructField("user_id", StringType(), True),
StructField("action", StringType(), True),
StructField("item_id", StringType(), True),
StructField("price", DoubleType(), True),
StructField("timestamp", TimestampType(), True)
])
df = spark.read.schema(schema).parquet("s3a://data-lake/processed/events/")
# 用户行为分析
user_stats = df.groupBy("user_id").agg(
count("*").alias("total_actions"),
countDistinct("item_id").alias("unique_items"),
avg("price").alias("avg_price")
).filter(col("total_actions") > 10)
# 时间窗口分析
window_stats = df.groupBy(
window(col("timestamp"), "1 hour"),
col("action")
).count().orderBy("window", "count")
# 写入结果供 AI 模型使用
user_stats.write.mode("overwrite").parquet("s3a://data-lake/ml-features/user-stats/")
Flink 实时流处理示例
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream import CheckpointingMode
from pyflink.table import StreamTableEnvironment
env = StreamExecutionEnvironment.get_execution_environment()
env.enable_checkpointing(60000, CheckpointingMode.EXACTLY_ONCE)
t_env = StreamTableEnvironment.create(env)
# 定义 Kafka 源表
t_env.execute_sql("""
CREATE TABLE user_events (
user_id STRING,
action STRING,
item_id STRING,
price DOUBLE,
event_time TIMESTAMP(3),
WATERMARK FOR event_time AS event_time - INTERVAL '5' SECOND
) WITH (
'connector' = 'kafka',
'topic' = 'user-events',
'properties.bootstrap.servers' = 'localhost:9092',
'properties.group.id' = 'flink-processor',
'format' = 'json',
'scan.startup.mode' = 'latest-offset'
)
""")
# 实时窗口聚合
t_env.execute_sql("""
CREATE TABLE realtime_stats (
window_start TIMESTAMP(3),
window_end TIMESTAMP(3),
action STRING,
event_count BIGINT,
total_revenue DOUBLE
) WITH (
'connector' = 'jdbc',
'url' = 'jdbc:mysql://localhost:3306/analytics',
'table-name' = 'realtime_stats',
'username' = 'root',
'password' = 'password'
)
""").wait()
t_env.execute_sql("""
INSERT INTO realtime_stats
SELECT
TUMBLE_START(event_time, INTERVAL '1' MINUTE) AS window_start,
TUMBLE_END(event_time, INTERVAL '1' MINUTE) AS window_end,
action,
COUNT(*) AS event_count,
SUM(price) AS total_revenue
FROM user_events
GROUP BY
TUMBLE(event_time, INTERVAL '1' MINUTE),
action
""").wait()
AI 核心技术
机器学习基础
Scikit-learn:经典机器学习
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.pipeline import Pipeline
import pandas as pd
import numpy as np
# 从大数据存储加载特征数据
df = pd.read_parquet("s3://data-lake/ml-features/user-churn/")
# 特征工程
features = ['login_freq', 'avg_session_duration', 'purchase_count',
'avg_order_value', 'days_since_last_login', 'complaint_count']
X = df[features]
y = df['churn_label']
# 训练/测试集划分
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# 构建模型 Pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', GradientBoostingClassifier(
n_estimators=200,
learning_rate=0.1,
max_depth=5,
random_state=42
))
])
# 交叉验证评估
cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='roc_auc')
print(f"交叉验证 AUC: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")
# 训练与评估
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
y_prob = pipeline.predict_proba(X_test)[:, 1]
print(classification_report(y_test, y_pred))
print(f"测试集 AUC: {roc_auc_score(y_test, y_prob):.4f}")
# 特征重要性分析
importances = pipeline.named_steps['classifier'].feature_importances_
for feat, imp in sorted(zip(features, importances), key=lambda x: -x[1]):
print(f" {feat}: {imp:.4f}")
Spark MLlib:分布式机器学习
from pyspark.ml import Pipeline
from pyspark.ml.classification import GBTClassifier
from pyspark.ml.feature import VectorAssembler, StandardScaler, StringIndexer
from pyspark.ml.evaluation import BinaryClassificationEvaluator
# 读取大规模特征数据
df = spark.read.parquet("s3a://data-lake/ml-features/user-churn/")
# 特征组装
assembler = VectorAssembler(
inputCols=['login_freq', 'avg_session_duration', 'purchase_count',
'avg_order_value', 'days_since_last_login', 'complaint_count'],
outputCol='features_vec'
)
# 特征缩放
scaler = StandardScaler(
inputCol='features_vec',
outputCol='scaled_features'
)
# 标签编码
indexer = StringIndexer(inputCol='churn_label', outputCol='label')
# GBT 分类器
gbt = GBTClassifier(
featuresCol='scaled_features',
labelCol='label',
maxIter=100,
maxDepth=5,
stepSize=0.1
)
# 构建 Pipeline
pipeline = Pipeline(stages=[assembler, scaler, indexer, gbt])
# 训练
train_df, test_df = df.randomSplit([0.8, 0.2], seed=42)
model = pipeline.fit(train_df)
# 评估
predictions = model.transform(test_df)
evaluator = BinaryClassificationEvaluator(metricName='areaUnderROC')
auc = evaluator.evaluate(predictions)
print(f"测试集 AUC: {auc:.4f}")
# 批量预测并写回数据湖
all_predictions = model.transform(df)
all_predictions.select("user_id", "prediction", "probability") .write.mode("overwrite") .parquet("s3a://data-lake/predictions/churn/")
深度学习
PyTorch:用户行为序列建模
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
# 自定义数据集:用户行为序列
class UserBehaviorDataset(Dataset):
def __init__(self, sequences, labels):
self.sequences = torch.LongTensor(sequences)
self.labels = torch.FloatTensor(labels)
def __len__(self):
return len(self.labels)
def __getitem__(self, idx):
return self.sequences[idx], self.labels[idx]
# LSTM 行为预测模型
class BehaviorPredictor(nn.Module):
def __init__(self, vocab_size, embed_dim, hidden_dim, num_layers=2):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
self.lstm = nn.LSTM(
embed_dim, hidden_dim,
num_layers=num_layers,
batch_first=True,
dropout=0.3,
bidirectional=True
)
self.attention = nn.Linear(hidden_dim * 2, 1)
self.classifier = nn.Sequential(
nn.Linear(hidden_dim * 2, 128),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(128, 1),
nn.Sigmoid()
)
def forward(self, x):
embeds = self.embedding(x) # (B, L, E)
lstm_out, _ = self.lstm(embeds) # (B, L, H*2)
attn_weights = torch.softmax(self.attention(lstm_out), dim=1) # (B, L, 1)
context = torch.sum(attn_weights * lstm_out, dim=1) # (B, H*2)
output = self.classifier(context) # (B, 1)
return output.squeeze(-1)
# 训练循环
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = BehaviorPredictor(vocab_size=50000, embed_dim=128, hidden_dim=256).to(device)
criterion = nn.BCELoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=10)
train_dataset = UserBehaviorDataset(train_sequences, train_labels)
train_loader = DataLoader(train_dataset, batch_size=256, shuffle=True, num_workers=4)
for epoch in range(10):
model.train()
total_loss = 0
for batch_x, batch_y in train_loader:
batch_x, batch_y = batch_x.to(device), batch_y.to(device)
optimizer.zero_grad()
preds = model(batch_x)
loss = criterion(preds, batch_y)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
total_loss += loss.item()
scheduler.step()
avg_loss = total_loss / len(train_loader)
print(f"Epoch {epoch+1}/10 - Loss: {avg_loss:.4f}")
import torch
import torch.nn as nn
import math
class PositionalEncoding(nn.Module):
def __init__(self, d_model, max_len=5000):
super().__init__()
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len).unsqueeze(1).float()
div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
self.register_buffer('pe', pe.unsqueeze(0))
def forward(self, x):
return x + self.pe[:, :x.size(1)]
class TimeSeriesTransformer(nn.Module):
def __init__(self, input_dim, d_model=128, nhead=8, num_layers=4, pred_len=24):
super().__init__()
self.input_proj = nn.Linear(input_dim, d_model)
self.pos_enc = PositionalEncoding(d_model)
encoder_layer = nn.TransformerEncoderLayer(
d_model=d_model, nhead=nhead, dim_feedforward=512,
dropout=0.1, batch_first=True
)
self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
self.output_proj = nn.Linear(d_model, pred_len)
def forward(self, x):
# x: (B, L, input_dim)
x = self.input_proj(x)
x = self.pos_enc(x)
x = self.transformer(x)
x = x[:, -1, :] # 取最后一个时间步
out = self.output_proj(x) # (B, pred_len)
return out
# 使用示例:预测未来24小时流量
model = TimeSeriesTransformer(input_dim=10, d_model=128, nhead=8, num_layers=4, pred_len=24)
dummy_input = torch.randn(32, 168, 10) # 32样本,168小时历史,10个特征
output = model(dummy_input)
print(f"预测输出形状: {output.shape}") # (32, 24)
大语言模型(LLM)集成
RAG:检索增强生成
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
from langchain.document_loaders import PyPDFLoader, TextLoader
# 1. 文档加载与切分
documents = []
for pdf_path in ["data/report_q1.pdf", "data/report_q2.pdf", "data/report_q3.pdf"]:
loader = PyPDFLoader(pdf_path)
documents.extend(loader.load())
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
separators=["\n\n", "\n", "。", "!", "?", ";", ",", " ", ""]
)
chunks = text_splitter.split_documents(documents)
# 2. 向量化与索引构建
embeddings = HuggingFaceEmbeddings(
model_name="BAAI/bge-large-zh",
model_kwargs={"device": "cuda"}
)
vectorstore = FAISS.from_documents(chunks, embeddings)
vectorstore.save_local("faiss_index")
# 3. 构建 RAG 链
llm = OpenAI(model="gpt-4", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vectorstore.as_retriever(search_kwargs={"k": 5}),
return_source_documents=True
)
# 4. 查询
result = qa_chain({"query": "Q3 季度用户增长趋势如何?主要驱动因素是什么?"})
print(f"回答: {result['result']}")
print(f"来源文档: {[doc.metadata for doc in result['source_documents']]}")
LLM Fine-tuning(微调)
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer
from datasets import load_dataset
from peft import LoraConfig, get_peft_model, TaskType
# 加载基座模型
model_name = "Qwen/Qwen2.5-7B"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto",
trust_remote_code=True
)
# LoRA 配置(参数高效微调)
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16,
lora_alpha=32,
lora_dropout=0.05,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# 输出: trainable params: 20M || all params: 7B || trainable%: 0.28%
# 数据预处理
dataset = load_dataset("json", data_files="data/finetune_data.jsonl")
def tokenize_function(examples):
prompts = [f"### 指令:\n{inst}\n\n### 回答:\n{out}"
for inst, out in zip(examples["instruction"], examples["output"])]
return tokenizer(prompts, truncation=True, max_length=2048, padding="max_length")
tokenized_dataset = dataset.map(tokenize_function, batched=True)
# 训练配置
training_args = TrainingArguments(
output_dir="./qwen-finetuned",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=8,
learning_rate=2e-4,
weight_decay=0.01,
warmup_ratio=0.05,
lr_scheduler_type="cosine",
bf16=True,
logging_steps=10,
save_strategy="epoch",
report_to="wandb"
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset["train"],
tokenizer=tokenizer
)
trainer.train()
model.save_pretrained("./qwen-finetuned")
tokenizer.save_pretrained("./qwen-finetuned")
大数据 + AI 融合架构
Lambda 架构 vs Kappa 架构
| 维度 |
Lambda 架构 |
Kappa 架构 |
| 核心思想 |
批处理层 + 速度层 + 服务层 |
统一流处理 |
| 延迟 |
批处理有延迟 |
全链路低延迟 |
| 复杂度 |
维护两套代码 |
单一流处理 |
| 一致性 |
最终一致 |
强一致 |
| 适用场景 |
复杂报表 + 实时需求 |
实时性要求高 |
| 代表技术 |
Hadoop + Storm + HBase |
Flink + Kafka |
现代 Lakehouse 架构
┌──────────────────────────────────────────────────────────────┐
│ Lakehouse 架构 │
├──────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ BI 报表 │ │ 数据科学 │ │ AI/ML │ │ 实时应用 │ │
│ │ (SQL) │ │ (Python) │ │ (训练/推理)│ │ (API) │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
│ │ │ │ │ │
│ ┌────┴─────────────┴─────────────┴─────────────┴─────┐ │
│ │ 统一元数据 & 治理层 │ │
│ │ (Unity Catalog / Hive Metastore) │ │
│ └────────────────────┬───────────────────────────────┘ │
│ │ │
│ ┌────────────────────┴───────────────────────────────┐ │
│ │ 开放表格式 (Delta / Iceberg / Hudi) │ │
│ │ ACID 事务 | Schema 演化 | 时间旅行 | 增量处理 │ │
│ └────────────────────┬───────────────────────────────┘ │
│ │ │
│ ┌────────────────────┴───────────────────────────────┐ │
│ │ 云存储 (S3 / OSS / ADLS / HDFS) │ │
│ └────────────────────────────────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────┘
端到端 AI 数据流水线
"""
完整的 大数据 + AI 流水线示例
数据采集 → 特征工程 → 模型训练 → 模型部署 → 在线推理
"""
# ========== Step 1: 数据采集与入湖 ==========
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
spark = SparkSession.builder.appName("AI-Pipeline").getOrCreate()
# 从 Kafka 读取实时数据
streaming_df = spark.readStream .format("kafka") .option("kafka.bootstrap.servers", "localhost:9092") .option("subscribe", "user-events") .load()
# 解析并写入 Delta Lake
parsed_df = streaming_df.select(
from_json(col("value").cast("string"), schema).alias("data")
).select("data.*")
query = parsed_df.writeStream .format("delta") .option("checkpointLocation", "/checkpoint/events") .outputMode("append") .start("/delta/user-events")
# ========== Step 2: 特征工程 ==========
from pyspark.sql import Window
events_df = spark.read.format("delta").load("/delta/user-events")
# 用户聚合特征
user_features = events_df.groupBy("user_id").agg(
count("*").alias("total_events"),
countDistinct("item_id").alias("unique_items"),
avg("price").alias("avg_price"),
max("timestamp").alias("last_active"),
count(when(col("action") == "purchase", True)).alias("purchase_count"),
count(when(col("action") == "click", True)).alias("click_count"),
(count(when(col("action") == "purchase", True)) /
count(when(col("action") == "click", True))).alias("conversion_rate")
)
# 时间窗口特征
window_features = events_df.groupBy(
"user_id",
window("timestamp", "7 days")
).agg(
count("*").alias("weekly_events"),
sum("price").alias("weekly_spending")
)
# 保存特征到特征存储
user_features.write.mode("overwrite") .format("delta") .save("/delta/ml-features/user-churn/")
# ========== Step 3: 模型训练 ==========
import mlflow
import mlflow.sklearn
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, f1_score
import pandas as pd
# 从特征存储读取
features_df = spark.read.format("delta").load("/delta/ml-features/user-churn/")
features_pd = features_df.toPandas()
X = features_pd.drop(columns=['user_id', 'churn_label'])
y = features_pd['churn_label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# MLflow 实验追踪
mlflow.set_experiment("user-churn-prediction")
with mlflow.start_run():
model = GradientBoostingClassifier(
n_estimators=200, learning_rate=0.1, max_depth=5
)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, y_prob)
f1 = f1_score(y_test, y_pred)
# 记录参数和指标
mlflow.log_params({"n_estimators": 200, "learning_rate": 0.1, "max_depth": 5})
mlflow.log_metrics({"auc": auc, "f1": f1})
mlflow.sklearn.log_model(model, "model")
print(f"AUC: {auc:.4f}, F1: {f1:.4f}")
# ========== Step 4: 模型服务部署 ==========
# model_serve.py (FastAPI 服务)
from fastapi import FastAPI
import mlflow.sklearn
import numpy as np
app = FastAPI(title="Churn Prediction API")
model = mlflow.sklearn.load_model("models:/churn-model/Production")
@app.post("/predict")
async def predict(features: dict):
feature_values = np.array([list(features.values())]).reshape(1, -1)
probability = model.predict_proba(feature_values)[0][1]
return {
"churn_probability": float(probability),
"prediction": "churn" if probability > 0.5 else "retain",
"risk_level": "high" if probability > 0.8 else "medium" if probability > 0.5 else "low"
}
@app.post("/batch_predict")
async def batch_predict(users: list[dict]):
features = np.array([list(u.values()) for u in users])
probabilities = model.predict_proba(features)[:, 1]
return [{"user_index": i, "churn_probability": float(p)} for i, p in enumerate(probabilities)]
实时智能应用
实时推荐系统
"""
基于 Flink + AI 的实时推荐系统
"""
# ========== 实时特征计算 (Flink SQL) ==========
# 在 Flink SQL Client 中执行
"""
-- 实时用户行为特征
CREATE VIEW user_realtime_features AS
SELECT
user_id,
COUNT(*) AS event_count_1h,
COUNT(DISTINCT item_id) AS unique_items_1h,
SUM(CASE WHEN action = 'purchase' THEN price ELSE 0 END) AS spending_1h,
TUMBLE_START(event_time, INTERVAL '1' HOUR) AS window_start
FROM user_events
GROUP BY
user_id,
TUMBLE(event_time, INTERVAL '1' HOUR);
"""
# ========== 推荐模型推理服务 ==========
import torch
from fastapi import FastAPI
from pydantic import BaseModel
import redis
import json
app = FastAPI(title="Real-time Recommendation API")
redis_client = redis.Redis(host='localhost', port=6379, db=0)
# 加载推荐模型
rec_model = torch.load("models/rec_model.pt")
rec_model.eval()
class UserFeatures(BaseModel):
user_id: str
event_count_1h: int
unique_items_1h: int
spending_1h: float
recent_categories: list[str]
@app.post("/recommend")
async def recommend(features: UserFeatures, top_k: int = 10):
# 1. 检查缓存
cache_key = f"rec:{features.user_id}"
cached = redis_client.get(cache_key)
if cached:
return json.loads(cached)
# 2. 模型推理
with torch.no_grad():
user_tensor = encode_features(features)
scores = rec_model(user_tensor)
top_indices = scores.argsort(descending=True)[:top_k]
# 3. 获取推荐商品详情
item_ids = [get_item_id(idx) for idx in top_indices]
recommendations = fetch_item_details(item_ids)
# 4. 缓存结果(5分钟过期)
redis_client.setex(cache_key, 300, json.dumps(recommendations))
return recommendations
智能监控与异常检测
"""
基于 AI 的实时异常检测系统
"""
import numpy as np
from sklearn.ensemble import IsolationForest
from datetime import datetime
class AnomalyDetector:
def __init__(self):
self.model = IsolationForest(
n_estimators=100,
contamination=0.01,
random_state=42
)
self.is_fitted = False
self.feature_buffer = []
def update_model(self, features: np.ndarray):
"""增量更新模型"""
self.feature_buffer.append(features)
if len(self.feature_buffer) >= 1000:
training_data = np.vstack(self.feature_buffer)
self.model.fit(training_data)
self.is_fitted = True
self.feature_buffer = []
def detect(self, features: np.ndarray) -> dict:
"""实时检测异常"""
if not self.is_fitted:
return {"is_anomaly": False, "status": "model_not_ready"}
score = self.model.decision_function(features.reshape(1, -1))[0]
is_anomaly = self.model.predict(features.reshape(1, -1))[0] == -1
result = {
"is_anomaly": bool(is_anomaly),
"anomaly_score": float(score),
"severity": "high" if score < -0.3 else "medium" if score < -0.1 else "low",
"timestamp": datetime.now().isoformat()
}
if is_anomaly:
self._trigger_alert(result)
return result
def _trigger_alert(self, result: dict):
"""触发告警"""
print(f"[ALERT] 异常检测告警: {result}")
# 发送到告警系统(钉钉/飞书/Slack/PagerDuty)
# alert_client.send(channel="data-anomaly", message=result)
# Flink 实时流集成
detector = AnomalyDetector()
def process_event(event):
"""Flink 处理函数"""
features = extract_features(event)
detector.update_model(features)
result = detector.detect(features)
if result["is_anomaly"]:
# 写入异常事件表
return {
**event,
"anomaly_score": result["anomaly_score"],
"severity": result["severity"],
"detected_at": result["timestamp"]
}
return None
智能数据质量监控
"""
AI 驱动的数据质量监控
"""
from dataclasses import dataclass
from typing import Optional
import pandas as pd
import numpy as np
@dataclass
class QualityRule:
name: str
column: str
rule_type: str # 'range', 'null_check', 'unique', 'distribution', 'ai_anomaly'
params: dict
severity: str = "warning" # 'warning', 'critical'
class DataQualityMonitor:
def __init__(self):
self.rules: list[QualityRule] = []
self.baseline_stats: dict = {}
def add_rule(self, rule: QualityRule):
self.rules.append(rule)
def compute_baseline(self, df: pd.DataFrame):
"""基于历史数据建立基线"""
for col in df.select_dtypes(include=[np.number]).columns:
self.baseline_stats[col] = {
"mean": df[col].mean(),
"std": df[col].std(),
"min": df[col].min(),
"max": df[col].max(),
"null_rate": df[col].isnull().mean(),
"distribution": df[col].value_counts(normalize=True).head(20).to_dict()
}
def check_quality(self, df: pd.DataFrame) -> list[dict]:
"""执行数据质量检查"""
results = []
for rule in self.rules:
if rule.rule_type == "null_check":
null_rate = df[rule.column].isnull().mean()
threshold = rule.params.get("max_null_rate", 0.05)
passed = null_rate <= threshold
results.append({
"rule": rule.name, "passed": passed,
"metric": null_rate, "threshold": threshold,
"severity": rule.severity
})
elif rule.rule_type == "range":
out_of_range = ((df[rule.column] < rule.params["min"]) |
(df[rule.column] > rule.params["max"])).mean()
threshold = rule.params.get("max_outlier_rate", 0.01)
passed = out_of_range <= threshold
results.append({
"rule": rule.name, "passed": passed,
"metric": out_of_range, "threshold": threshold,
"severity": rule.severity
})
elif rule.rule_type == "ai_anomaly":
# 使用 AI 检测数据分布漂移
current_mean = df[rule.column].mean()
baseline = self.baseline_stats.get(rule.column, {})
expected_mean = baseline.get("mean", current_mean)
expected_std = baseline.get("std", 1)
z_score = abs(current_mean - expected_mean) / max(expected_std, 1e-8)
threshold = rule.params.get("z_threshold", 3.0)
passed = z_score <= threshold
results.append({
"rule": rule.name, "passed": passed,
"metric": z_score, "threshold": threshold,
"detail": f"数据漂移检测: 均值从 {expected_mean:.2f} 变为 {current_mean:.2f}",
"severity": rule.severity
})
return results
# 使用示例
monitor = DataQualityMonitor()
# 添加质量规则
monitor.add_rule(QualityRule("用户ID非空", "user_id", "null_check", {"max_null_rate": 0.01}, "critical"))
monitor.add_rule(QualityRule("价格范围", "price", "range", {"min": 0, "max": 100000, "max_outlier_rate": 0.01}, "warning"))
monitor.add_rule(QualityRule("金额分布漂移", "amount", "ai_anomaly", {"z_threshold": 3.0}, "critical"))
# 建立基线并检查
baseline_df = pd.read_parquet("s3://data-lake/processed/transactions/history/")
monitor.compute_baseline(baseline_df)
current_df = pd.read_parquet("s3://data-lake/processed/transactions/2024-01-15/")
results = monitor.check_quality(current_df)
for r in results:
status = "✅ PASS" if r["passed"] else f"❌ FAIL [{r['severity']}]"
print(f"{status} - {r['rule']}: {r.get('detail', f'{r["metric"]:.4f} vs {r["threshold"]}')}")
MLOps 与模型管理
MLOps 全流程
┌─────────────────────────────────────────────────────────────┐
│ MLOps 全流程 │
├─────────────────────────────────────────────────────────────┤
│ │
│ 数据版本 → 特征工程 → 实验追踪 → 模型注册 → 模型部署 → 监控 │
│ (DVC) (Feast) (MLflow) (MLflow) (Seldon/ (Prometheus│
│ (Tecton) (W&B) (Vertex) BentoML) + Grafana) │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ CI/CD 自动化流水线 │ │
│ │ 代码提交 → 数据验证 → 模型训练 → 评估 → 部署 → 监控 │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
MLflow 实验管理
import mlflow
from mlflow.models.signature import infer_signature
mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("bigdata-ai-churn-prediction")
def train_and_log_model(params, X_train, X_test, y_train, y_test):
with mlflow.start_run(run_name=f"gbt_{params['n_estimators']}_{params['max_depth']}"):
# 训练模型
model = GradientBoostingClassifier(**params)
model.fit(X_train, y_train)
# 评估
y_prob = model.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, y_prob)
f1 = f1_score(y_test, model.predict(X_test))
# 记录
mlflow.log_params(params)
mlflow.log_metrics({"auc": auc, "f1": f1})
mlflow.log_artifact("feature_importance.png")
# 注册模型
signature = infer_signature(X_train, model.predict(X_train))
mlflow.sklearn.log_model(
model, "model",
signature=signature,
registered_model_name="churn-predictor"
)
return auc
# 超参数搜索
param_grid = [
{"n_estimators": 100, "learning_rate": 0.1, "max_depth": 3},
{"n_estimators": 200, "learning_rate": 0.1, "max_depth": 5},
{"n_estimators": 300, "learning_rate": 0.05, "max_depth": 7},
]
best_auc = 0
for params in param_grid:
auc = train_and_log_model(params, X_train, X_test, y_train, y_test)
best_auc = max(best_auc, auc)
print(f"最佳 AUC: {best_auc:.4f}")
模型部署策略
| 策略 |
说明 |
适用场景 |
| 蓝绿部署 |
新旧版本并行,一键切换 |
关键业务零停机 |
| 金丝雀发布 |
小流量验证新模型 |
风险敏感场景 |
| A/B 测试 |
流量分流对比效果 |
模型效果验证 |
| 影子部署 |
新模型接收流量但不返回结果 |
安全验证 |
# BentoML 模型服务
import bentoml
from bentoml.io import JSON, NumpyNdarray
# 保存模型
bentoml.sklearn.save_model("churn_predictor", model)
# 定义服务 (service.py)
runner = bentoml.sklearn.get("churn_predictor:latest").to_runner()
svc = bentoml.Service("churn_prediction_service", runners=[runner])
@svc.api(input=JSON(), output=JSON())
async def predict(input_data: dict) -> dict:
features = np.array([list(input_data.values())]).reshape(1, -1)
probability = await runner.predict_proba.async_run(features)
return {
"churn_probability": float(probability[0][1]),
"prediction": "churn" if probability[0][1] > 0.5 else "retain"
}
# 部署命令
# bentoml serve service:svc --production
# bentoml containerize churn_predictor:latest -t churn-predictor:latest
# docker run -p 3000:3000 churn-predictor:latest
部署与运维
Kubernetes 部署大数据 + AI 平台
# spark-cluster.yaml
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
name: ai-feature-pipeline
namespace: bigdata
spec:
type: Python
pythonVersion: "3"
mode: cluster
image: "registry.example.com/spark-ai:3.5.0"
imagePullPolicy: Always
mainApplicationFile: "s3a://scripts/feature_pipeline.py"
arguments:
- "--input"
- "s3a://data-lake/raw/events/"
- "--output"
- "s3a://data-lake/ml-features/"
sparkVersion: "3.5.0"
restartPolicy:
type: OnFailure
onFailureRetries: 3
driver:
cores: 2
memory: "4g"
serviceAccount: spark-operator-spark
executor:
cores: 4
memory: "8g"
instances: 5
deps:
pyFiles:
- "s3a://scripts/utils.py"
jars:
- "s3a://jars/delta-core_2.12-3.0.0.jar"
hadoopConf:
fs.s3a.endpoint: "minio:9000"
fs.s3a.access.key: "admin"
fs.s3a.secret.key: "password"
# model-serving.yaml - Triton Inference Server
apiVersion: apps/v1
kind: Deployment
metadata:
name: triton-inference-server
namespace: ai-serving
spec:
replicas: 3
selector:
matchLabels:
app: triton
template:
metadata:
labels:
app: triton
spec:
containers:
- name: triton
image: nvcr.io/nvidia/tritonserver:24.01-py3
ports:
- containerPort: 8000 # HTTP
- containerPort: 8001 # gRPC
- containerPort: 8002 # Metrics
resources:
limits:
nvidia.com/gpu: 1
volumeMounts:
- name: model-repository
mountPath: /models
env:
- name: MODEL_REPOSITORY
value: "/models"
volumes:
- name: model-repository
persistentVolumeClaim:
claimName: model-repo-pvc
---
apiVersion: v1
kind: Service
metadata:
name: triton-service
namespace: ai-serving
spec:
selector:
app: triton
ports:
- name: http
port: 8000
- name: grpc
port: 8001
- name: metrics
port: 8002
type: ClusterIP
监控体系
# prometheus-rules.yaml - AI 模型监控告警规则
groups:
- name: model_monitoring
rules:
- alert: ModelPredictionLatencyHigh
expr: histogram_quantile(0.95, rate(triton_inference_request_duration_seconds_bucket[5m])) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "模型推理延迟过高"
description: "P95 延迟超过 500ms,当前值: {{ $value }}s"
- alert: ModelAccuracyDegraded
expr: model_accuracy_score < 0.85
for: 1h
labels:
severity: critical
annotations:
summary: "模型精度下降"
description: "模型准确率低于阈值 0.85,当前值: {{ $value }}"
- alert: DataDriftDetected
expr: data_drift_score > 0.1
for: 30m
labels:
severity: warning
annotations:
summary: "检测到数据漂移"
description: "特征分布漂移分数超过阈值,可能需要重新训练模型"
- alert: KafkaConsumerLagHigh
expr: kafka_consumer_group_lag > 100000
for: 10m
labels:
severity: critical
annotations:
summary: "Kafka 消费延迟过高"
description: "消费者组延迟超过 100K,当前值: {{ $value }}"
技术选型总览
| 层级 |
技术选型 |
说明 |
| 数据采集 |
Kafka + Flume + Debezium |
实时 + 批量 + CDC |
| 数据存储 |
Delta Lake + MinIO + ClickHouse |
湖仓一体 + 对象存储 + OLAP |
| 批处理 |
Spark on Kubernetes |
大规模离线计算 |
| 流处理 |
Flink on Kubernetes |
低延迟实时计算 |
| 特征存储 |
Feast / Tecton |
离线在线特征一致 |
| 模型训练 |
PyTorch + Spark MLlib |
深度学习 + 分布式ML |
| 实验管理 |
MLflow + Weights & Biases |
实验追踪与模型注册 |
| 模型服务 |
Triton + BentoML + FastAPI |
GPU推理 + CPU推理 + API |
| LLM 平台 |
vLLM + LangChain + Milvus |
大模型推理 + RAG + 向量库 |
| 调度编排 |
Airflow + Dagster |
任务调度与数据编排 |
| 容器编排 |
Kubernetes + ArgoCD |
容器化部署与 GitOps |
| 监控告警 |
Prometheus + Grafana + EFK |
指标 + 日志 + 告警 |
总结与学习路径
大数据 + AI 工程师能力模型
┌─────────────────┐
│ 业务理解能力 │
│ (领域知识/需求分析) │
└────────┬────────┘
│
┌──────────────┼──────────────┐
│ │ │
┌────────▼───────┐ ┌───▼────────┐ ┌───▼──────────┐
│ 大数据工程能力 │ │ AI/ML 能力 │ │ 工程化能力 │
│(采集/存储/计算) │ │(建模/训练/ │ │(服务/部署/ │
│ │ │ 推理/评估) │ │ 监控/运维) │
└────────┬───────┘ └───┬────────┘ └───┬──────────┘
│ │ │
└──────────────┼──────────────┘
│
┌────────▼────────┐
│ 融合架构能力 │
│(端到端流水线设计) │
└─────────────────┘
推荐学习路径
| 阶段 |
时间 |
学习内容 |
实践项目 |
| 入门 |
1-2 月 |
Python、SQL、Linux 基础、Hadoop 生态概述 |
搭建单机 Hadoop 集群 |
| 进阶 |
2-3 月 |
Spark/Flink、数据仓库、机器学习基础 |
用户行为分析平台 |
| 深入 |
3-4 月 |
深度学习、特征工程、模型部署 |
实时推荐系统 |
| 融合 |
2-3 月 |
MLOps、LLM 集成、湖仓一体 |
端到端智能数据平台 |
| 精通 |
持续 |
系统设计、性能优化、前沿技术 |
企业级生产系统 |
核心技术书单
- 📖 大数据:《大数据之路》《Spark: The Definitive Guide》《Stream Processing with Apache Flink》
- 📖 机器学习:《Hands-On ML with Scikit-Learn & TensorFlow》《统计学习方法》
- 📖 深度学习:《Deep Learning》《动手学深度学习》(d2l.ai)
- 📖 LLM:《Build a Large Language Model》《Natural Language Processing with Transformers》
- 📖 MLOps:《Designing Machine Learning Systems》《Machine Learning Engineering》
- 📖 系统设计:《Data-Intensive Applications》《Fundamentals of Data Engineering》
行业应用场景
| 行业 |
大数据应用 |
AI 应用 |
融合价值 |
| 电商 |
用户行为分析、商品画像 |
推荐系统、搜索排序 |
实时个性化推荐 |
| 金融 |
交易数据仓库、风控数据湖 |
反欺诈、信用评分 |
实时风控决策 |
| 医疗 |
电子病历、影像数据管理 |
辅助诊断、药物发现 |
智能诊疗系统 |
| 制造 |
IoT 传感器数据采集 |
预测性维护、质量检测 |
智能工厂 |
| 交通 |
轨迹数据、流量监控 |
路径规划、需求预测 |
智慧出行 |
| 内容 |
内容画像、用户分群 |
内容生成、智能审核 |
AIGC 内容平台 |
大数据与 AI 的融合正在重塑每一个行业。掌握从数据采集到智能决策的全链路能力,是未来技术人最核心的竞争力。
本指南持续更新,欢迎贡献和反馈!
本指南涵盖了大数据与 AI 融合开发的核心技术栈,从数据采集、存储、计算到 AI 模型训练、部署和运维的完整链路。随着技术持续演进,建议关注以下趋势:湖仓一体深化 、实时 AI 推理 、LLM + 大数据融合 、Data + AI Mesh 架构 、绿色计算与成本优化。