第15期:机器学习与大数据融合 - 工业智能的算法引擎
导言:机器学习与大数据平台的融合是工业智能化的核心驱动力。本期深入探讨Spark MLlib、FlinkML、TensorFlow on Spark等主流机器学习框架,详细讲解特征工程、模型训练、在线推理的全流程实战,并结合工业场景的异常检测、预测性维护、质量预测等典型应用。
15.1 工业机器学习技术栈
15.1.1 技术选型矩阵
┌─────────────────────────────────────────────────────────────────────┐
│ 工业ML技术选型矩阵 │
├─────────────┬──────────────┬──────────────┬─────────────────────────┤
│ 场景 │ 推荐框架 │ 数据规模 │ 特点 │
├─────────────┼──────────────┼──────────────┼─────────────────────────┤
│ 特征工程 │ Spark MLlib │ TB~PB │ DataFrame API丰富 │
│ 离线训练 │ Spark MLlib │ GB~TB │ 分布式训练 │
│ 流式训练 │ FlinkML │ 实时流 │ 在线学习 │
│ 深度学习 │ TensorFlow │ GPU集群 │ 分布式训练框架 │
│ │ on Spark │ │ │
│ 模型服务 │ Seldon/MLeap │ - │ 低延迟推理 │
│ AutoML │ Flyte/Athena │ 自动搜索 │ 超参调优 │
└─────────────┴──────────────┴──────────────┴─────────────────────────┘
#mermaid-svg-h9J2SPOiw1fH17F6{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-h9J2SPOiw1fH17F6 .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-h9J2SPOiw1fH17F6 .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-h9J2SPOiw1fH17F6 .error-icon{fill:#552222;}#mermaid-svg-h9J2SPOiw1fH17F6 .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-h9J2SPOiw1fH17F6 .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-h9J2SPOiw1fH17F6 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-h9J2SPOiw1fH17F6 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-h9J2SPOiw1fH17F6 .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-h9J2SPOiw1fH17F6 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-h9J2SPOiw1fH17F6 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-h9J2SPOiw1fH17F6 .marker{fill:#333333;stroke:#333333;}#mermaid-svg-h9J2SPOiw1fH17F6 .marker.cross{stroke:#333333;}#mermaid-svg-h9J2SPOiw1fH17F6 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-h9J2SPOiw1fH17F6 p{margin:0;}#mermaid-svg-h9J2SPOiw1fH17F6 .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-h9J2SPOiw1fH17F6 .cluster-label text{fill:#333;}#mermaid-svg-h9J2SPOiw1fH17F6 .cluster-label span{color:#333;}#mermaid-svg-h9J2SPOiw1fH17F6 .cluster-label span p{background-color:transparent;}#mermaid-svg-h9J2SPOiw1fH17F6 .label text,#mermaid-svg-h9J2SPOiw1fH17F6 span{fill:#333;color:#333;}#mermaid-svg-h9J2SPOiw1fH17F6 .node rect,#mermaid-svg-h9J2SPOiw1fH17F6 .node circle,#mermaid-svg-h9J2SPOiw1fH17F6 .node ellipse,#mermaid-svg-h9J2SPOiw1fH17F6 .node polygon,#mermaid-svg-h9J2SPOiw1fH17F6 .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-h9J2SPOiw1fH17F6 .rough-node .label text,#mermaid-svg-h9J2SPOiw1fH17F6 .node .label text,#mermaid-svg-h9J2SPOiw1fH17F6 .image-shape .label,#mermaid-svg-h9J2SPOiw1fH17F6 .icon-shape .label{text-anchor:middle;}#mermaid-svg-h9J2SPOiw1fH17F6 .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-h9J2SPOiw1fH17F6 .rough-node .label,#mermaid-svg-h9J2SPOiw1fH17F6 .node .label,#mermaid-svg-h9J2SPOiw1fH17F6 .image-shape .label,#mermaid-svg-h9J2SPOiw1fH17F6 .icon-shape .label{text-align:center;}#mermaid-svg-h9J2SPOiw1fH17F6 .node.clickable{cursor:pointer;}#mermaid-svg-h9J2SPOiw1fH17F6 .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-h9J2SPOiw1fH17F6 .arrowheadPath{fill:#333333;}#mermaid-svg-h9J2SPOiw1fH17F6 .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-h9J2SPOiw1fH17F6 .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-h9J2SPOiw1fH17F6 .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-h9J2SPOiw1fH17F6 .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-h9J2SPOiw1fH17F6 .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-h9J2SPOiw1fH17F6 .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-h9J2SPOiw1fH17F6 .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-h9J2SPOiw1fH17F6 .cluster text{fill:#333;}#mermaid-svg-h9J2SPOiw1fH17F6 .cluster span{color:#333;}#mermaid-svg-h9J2SPOiw1fH17F6 div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-h9J2SPOiw1fH17F6 .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-h9J2SPOiw1fH17F6 rect.text{fill:none;stroke-width:0;}#mermaid-svg-h9J2SPOiw1fH17F6 .icon-shape,#mermaid-svg-h9J2SPOiw1fH17F6 .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-h9J2SPOiw1fH17F6 .icon-shape p,#mermaid-svg-h9J2SPOiw1fH17F6 .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-h9J2SPOiw1fH17F6 .icon-shape .label rect,#mermaid-svg-h9J2SPOiw1fH17F6 .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-h9J2SPOiw1fH17F6 .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-h9J2SPOiw1fH17F6 .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-h9J2SPOiw1fH17F6 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 应用层
预测服务
推荐系统
异常检测
服务层
MLflow
Seldon Core
Triton
训练层
Spark MLlib
TensorFlow
XGBoost
特征工程层
Spark SQL
Flink
数据层
HDFS
Kafka
HBase
15.1.2 工业ML流程架构
#mermaid-svg-AmqiQzf91Z7qGf17{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-AmqiQzf91Z7qGf17 .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-AmqiQzf91Z7qGf17 .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-AmqiQzf91Z7qGf17 .error-icon{fill:#552222;}#mermaid-svg-AmqiQzf91Z7qGf17 .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-AmqiQzf91Z7qGf17 .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-AmqiQzf91Z7qGf17 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-AmqiQzf91Z7qGf17 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-AmqiQzf91Z7qGf17 .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-AmqiQzf91Z7qGf17 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-AmqiQzf91Z7qGf17 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-AmqiQzf91Z7qGf17 .marker{fill:#333333;stroke:#333333;}#mermaid-svg-AmqiQzf91Z7qGf17 .marker.cross{stroke:#333333;}#mermaid-svg-AmqiQzf91Z7qGf17 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-AmqiQzf91Z7qGf17 p{margin:0;}#mermaid-svg-AmqiQzf91Z7qGf17 .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-AmqiQzf91Z7qGf17 .cluster-label text{fill:#333;}#mermaid-svg-AmqiQzf91Z7qGf17 .cluster-label span{color:#333;}#mermaid-svg-AmqiQzf91Z7qGf17 .cluster-label span p{background-color:transparent;}#mermaid-svg-AmqiQzf91Z7qGf17 .label text,#mermaid-svg-AmqiQzf91Z7qGf17 span{fill:#333;color:#333;}#mermaid-svg-AmqiQzf91Z7qGf17 .node rect,#mermaid-svg-AmqiQzf91Z7qGf17 .node circle,#mermaid-svg-AmqiQzf91Z7qGf17 .node ellipse,#mermaid-svg-AmqiQzf91Z7qGf17 .node polygon,#mermaid-svg-AmqiQzf91Z7qGf17 .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-AmqiQzf91Z7qGf17 .rough-node .label text,#mermaid-svg-AmqiQzf91Z7qGf17 .node .label text,#mermaid-svg-AmqiQzf91Z7qGf17 .image-shape .label,#mermaid-svg-AmqiQzf91Z7qGf17 .icon-shape .label{text-anchor:middle;}#mermaid-svg-AmqiQzf91Z7qGf17 .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-AmqiQzf91Z7qGf17 .rough-node .label,#mermaid-svg-AmqiQzf91Z7qGf17 .node .label,#mermaid-svg-AmqiQzf91Z7qGf17 .image-shape .label,#mermaid-svg-AmqiQzf91Z7qGf17 .icon-shape .label{text-align:center;}#mermaid-svg-AmqiQzf91Z7qGf17 .node.clickable{cursor:pointer;}#mermaid-svg-AmqiQzf91Z7qGf17 .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-AmqiQzf91Z7qGf17 .arrowheadPath{fill:#333333;}#mermaid-svg-AmqiQzf91Z7qGf17 .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-AmqiQzf91Z7qGf17 .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-AmqiQzf91Z7qGf17 .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-AmqiQzf91Z7qGf17 .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-AmqiQzf91Z7qGf17 .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-AmqiQzf91Z7qGf17 .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-AmqiQzf91Z7qGf17 .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-AmqiQzf91Z7qGf17 .cluster text{fill:#333;}#mermaid-svg-AmqiQzf91Z7qGf17 .cluster span{color:#333;}#mermaid-svg-AmqiQzf91Z7qGf17 div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-AmqiQzf91Z7qGf17 .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-AmqiQzf91Z7qGf17 rect.text{fill:none;stroke-width:0;}#mermaid-svg-AmqiQzf91Z7qGf17 .icon-shape,#mermaid-svg-AmqiQzf91Z7qGf17 .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-AmqiQzf91Z7qGf17 .icon-shape p,#mermaid-svg-AmqiQzf91Z7qGf17 .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-AmqiQzf91Z7qGf17 .icon-shape .label rect,#mermaid-svg-AmqiQzf91Z7qGf17 .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-AmqiQzf91Z7qGf17 .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-AmqiQzf91Z7qGf17 .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-AmqiQzf91Z7qGf17 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 特征反馈
触发重训练
监控反馈
漂移检测
自动重训练
模型部署
批量推理
实时推理
A/B测试
模型训练
数据划分
分布式训练
模型评估
模型注册
特征工程
数据清洗
特征转换
特征关联
聚合统计
数据摄入
原始数据
流式数据
15.2 Spark MLlib实战
15.2.1 特征工程核心代码
python
# industrial_feature_engineering.py
from pyspark.ml import Pipeline
from pyspark.ml.feature import (
VectorAssembler, StandardScaler, StringIndexer,
OneHotEncoder, MinMaxScaler, Imputer, Bucketizer
)
from pyspark.ml.functions import vector_to_array
from pyspark.sql.functions import *
from pyspark.sql.types import *
def build_sensor_feature_pipeline(spark):
"""
构建工业传感器特征工程流水线
"""
# 1. 数值特征处理
numeric_features = [
"temperature", "pressure", "vibration",
"current", "voltage", "speed"
]
# 缺失值填充
imputer = Imputer(
strategy="median", # 工业场景推荐中位数
inputCols=numeric_features,
outputCols=[f"{c}_imputed" for c in numeric_features]
)
# 2. 统计特征生成
window_spec = Window.partitionBy("machine_id") \
.orderBy("timestamp") \
.rowsBetween(-100, 0) # 最近100条记录
sensor_stats = (
df
.withColumn("temp_mean_100", avg("temperature").over(window_spec))
.withColumn("temp_std_100", stddev("temperature").over(window_spec))
.withColumn("temp_max_100", max("temperature").over(window_spec))
.withColumn("temp_min_100", min("temperature").over(window_spec))
.withColumn("temp_trend",
(col("temperature") - col("temp_mean_100")) / col("temp_std_100"))
)
# 3. 时序特征
sensor_features = (
sensor_stats
.withColumn("hour", hour("timestamp"))
.withColumn("day_of_week", dayofweek("timestamp"))
.withColumn("is_night", when((col("hour") < 6) | (col("hour") > 22), 1).otherwise(0))
.withColumn("is_weekend", when(dayofweek("timestamp") > 5, 1).otherwise(0))
.withColumn("month", month("timestamp"))
.withColumn("quarter", quarter("timestamp"))
)
# 4. 类别特征编码
category_features = ["machine_type", "production_line", "shift"]
indexers = [
StringIndexer(inputCol=c, outputCol=f"{c}_index")
for c in category_features
]
encoders = [
OneHotEncoder(inputCol=f"{c}_index", outputCol=f"{c}_vec")
for c in category_features
]
# 5. 特征向量化
all_numeric = [f"{c}_imputed" for c in numeric_features] + \
["temp_mean_100", "temp_std_100", "temp_trend"]
assembler = VectorAssembler(
inputCols=all_numeric + [f"{c}_vec" for c in category_features],
outputCol="features"
)
# 6. 特征标准化
scaler = StandardScaler(
inputCol="features",
outputCol="scaled_features",
withMean=True,
withStd=True
)
# 构建Pipeline
pipeline = Pipeline(stages=[
imputer,
*indexers,
*encoders,
assembler,
scaler
])
return pipeline
def compute_time_series_features(df, group_col, value_col, windows):
"""
计算多窗口时序统计特征
"""
result_dfs = []
for w in windows:
window_spec = Window.partitionBy(group_col) \
.orderBy("timestamp") \
.rowsBetween(-w, 0)
window_df = df.withColumn(f"rolling_{w}_mean",
avg(value_col).over(window_spec))
window_df = window_df.withColumn(f"rolling_{w}_std",
stddev(value_col).over(window_spec))
window_df = window_df.withColumn(f"rolling_{w}_max",
max(value_col).over(window_spec))
window_df = window_df.withColumn(f"rolling_{w}_min",
min(value_col).over(window_spec))
result_dfs.append(window_df)
return result_dfs[-1] # 返回最后一个窗口结果
15.2.2 模型训练与调优
python
# industrial_model_training.py
from pyspark.ml.tuning import (
ParamGridBuilder, CrossValidator, TrainValidationSplit
)
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.classification import (
GBTClassifier, RandomForestClassifier, LogisticRegression
)
from pyspark.ml.regression import GBTRegressor
class IndustrialModelTrainer:
"""工业场景模型训练器"""
def __init__(self, spark):
self.spark = spark
def train_anomaly_detection_model(self, train_df, test_df):
"""
训练异常检测模型
使用IsolationForest + GBT混合方案
"""
# 特征列
feature_col = "scaled_features"
label_col = "is_anomaly"
# 1. GBT分类器
gbt = GBTClassifier(
featuresCol=feature_col,
labelCol=label_col,
maxIter=100,
maxDepth=6,
stepSize=0.1,
subsamplingRate=0.8,
minInstancesPerNode=10,
featureSubsetStrategy="sqrt"
)
# 2. 参数网格搜索
param_grid = ParamGridBuilder() \
.addGrid(gbt.maxDepth, [4, 6, 8]) \
.addGrid(gbt.maxIter, [50, 100, 150]) \
.addGrid(gbt.stepSize, [0.05, 0.1, 0.2]) \
.build()
# 3. 评估器
evaluator = BinaryClassificationEvaluator(
labelCol=label_col,
rawPredictionCol="rawPrediction",
metricName="areaUnderPR" # 工业场景用PR曲线更合适
)
# 4. 交叉验证
cv = CrossValidator(
estimator=gbt,
estimatorParamMaps=param_grid,
evaluator=evaluator,
numFolds=5,
parallelism=4,
seed=42
)
# 5. 训练
cv_model = cv.fit(train_df)
# 6. 最佳模型
best_model = cv_model.bestModel
print(f"Best maxDepth: {best_model.getOrDefault('maxDepth')}")
print(f"Best maxIter: {best_model.getOrDefault('maxIter')}")
# 7. 测试集评估
predictions = best_model.transform(test_df)
# 计算多个指标
metrics = {
"AUC": evaluator.evaluate(predictions),
"Precision": self._precision(predictions, label_col),
"Recall": self._recall(predictions, label_col),
"F1": self._f1_score(predictions, label_col)
}
return best_model, predictions, metrics
def train_predictive_maintenance_model(self, train_df, test_df):
"""
训练预测性维护模型
预测设备剩余使用寿命(RUL)
"""
feature_col = "scaled_features"
label_col = "rul" # Remaining Useful Life
# GBT回归器
gbt_reg = GBTRegressor(
featuresCol=feature_col,
labelCol=label_col,
maxIter=100,
maxDepth=5,
lossType="squaredError"
)
# 参数配置
param_grid = ParamGridBuilder() \
.addGrid(gbt_reg.maxDepth, [4, 6]) \
.addGrid(gbt_reg.maxIter, [80, 100, 120]) \
.addGrid(gbt_reg.stepSize, [0.1]) \
.build()
# RMSE评估器
evaluator = RegressionEvaluator(
labelCol=label_col,
predictionCol="prediction",
metricName="rmse"
)
cv = CrossValidator(
estimator=gbt_reg,
estimatorParamMaps=param_grid,
evaluator=evaluator,
numFolds=5
)
cv_model = cv.fit(train_df)
predictions = cv_model.bestModel.transform(test_df)
return cv_model.bestModel, predictions
def _precision(self, predictions_df, label_col):
tp = predictions_df.filter(
(col("prediction") == 1) & (col(label_col) == 1)
).count()
fp = predictions_df.filter(
(col("prediction") == 1) & (col(label_col) == 0)
).count()
return tp / (tp + fp) if (tp + fp) > 0 else 0
def _recall(self, predictions_df, label_col):
tp = predictions_df.filter(
(col("prediction") == 1) & (col(label_col) == 1)
).count()
fn = predictions_df.filter(
(col("prediction") == 0) & (col(label_col) == 1)
).count()
return tp / (tp + fn) if (tp + fn) > 0 else 0
def _f1_score(self, predictions_df, label_col):
p = self._precision(predictions_df, label_col)
r = self._recall(predictions_df, label_col)
return 2 * p * r / (p + r) if (p + r) > 0 else 0
15.2.3 模型管理与部署
python
# model_serving.py - 模型服务化
import mlflow
from mlflow.tracking import MlflowClient
class ModelServingManager:
"""模型服务管理器"""
def __init__(self, tracking_uri):
mlflow.set_tracking_uri(tracking_uri)
self.client = MlflowClient()
def register_model(self, model, model_name, metrics):
"""
注册模型到MLflow Registry
"""
with mlflow.start_run():
# 记录参数和指标
for key, value in metrics.items():
mlflow.log_metric(key, value)
# 记录特征重要性
if hasattr(model, 'featureImportances'):
feature_importance = dict(zip(
feature_names,
model.featureImportances.toArray()
))
mlflow.log_dict(feature_importance, "feature_importance.json")
# 注册模型
model_uri = mlflow.sklearn.log_model(
sk_model=model,
artifact_path=model_name,
registered_model_name=model_name
)
return model_uri
def transition_model_stage(self, model_name, version, stage):
"""
模型阶段转换:None -> Staging -> Production
"""
self.client.transition_model_version_stage(
name=model_name,
version=version,
stage=stage
)
def load_production_model(self, model_name):
"""
加载生产环境模型
"""
model_uri = f"models:/{model_name}/Production"
return mlflow.pyfunc.load_model(model_uri)
def batch_predict(self, model_name, batch_df):
"""
批量推理
"""
model = self.load_production_model(model_name)
return model.predict(batch_df)
def real_time_predict(self, model_name, features_dict):
"""
实时推理 (通过REST API)
"""
import requests
# 获取模型元数据
model_version = self.client.get_latest_versions(
model_name, stages=["Production"]
)[0]
# 构造请求
payload = {
"inputs": [features_dict]
}
# 调用推理服务
response = requests.post(
f"http://inference-server:8080/v2/models/{model_name}/infer",
json=payload
)
return response.json()
15.3 Flink流式机器学习
15.3.1 FlinkML实时特征计算
java
// FlinkML流式特征计算
package com.industrial.flink.ml;
import org.apache.flink.api.common.eventtime.*;
import org.apache.flink.streaming.api.scala.*;
import org.apache.flink.ml.common.*;
import org.apache.flink.ml.feature.*;
import org.apache.flink.streaming.api.datastream.*;
import org.apache.flink.table.api.*;
public class FlinkStreamingFeatureEngine {
public DataStream<SensorRecord> processSensorStream(StreamExecutionEnvironment env) {
// 1. Kafka数据源
KafkaSource<SensorRecord> source = KafkaSource.<SensorRecord>builder()
.setBootstrapServers("kafka:9092")
.setTopics("industrial-sensors")
.setGroupId("feature-engine")
.setValueOnlyDeserializer(new SensorRecordDeserializer())
.build();
// 2. 事件时间与水印
WatermarkStrategy<SensorRecord> watermarkStrategy =
WatermarkStrategy
.<SensorRecord>forBoundedOutOfOrderness(Duration.ofSeconds(10))
.withTimestampAssigner((event, timestamp) -> event.getTimestamp())
.withIdleness(Duration.ofMinutes(1));
DataStream<SensorRecord> sensorStream = env
.fromSource(source, watermarkStrategy, "Kafka Source")
.keyBy(SensorRecord::getMachineId);
// 3. 滑动窗口特征计算
DataStream<FeatureVector> features = sensorStream
.window(SlidingEventTimeWindows.of(
Time.minutes(5), // 窗口大小
Time.minutes(1) // 滑动步长
))
.process(new FeatureProcessFunction())
.name("FeatureEngineering");
// 4. 实时异常检测
DataStream<AnomalyResult> anomalies = features
.keyBy(f -> f.getMachineId())
.process(new AnomalyDetectionProcessFunction())
.name("AnomalyDetection");
// 5. 输出到Kafka
anomalies.addSink(KafkaSink.<AnomalyResult>builder()
.setBootstrapServers("kafka:9092")
.setRecordSerializer(new AnomalyResultSerializer("anomaly-output"))
.build());
return features;
}
/**
* 特征计算函数
*/
public static class FeatureProcessFunction
extends KeyedProcessFunction<String, SensorRecord, FeatureVector> {
private ValueState<List<Double>> temperatureHistory;
@Override
public void open(Configuration parameters) {
temperatureHistory = getRuntimeContext().getListState(
new ListStateDescriptor<>("temp_history",
Types.DOUBLE)
);
}
@Override
public void processElement(
SensorRecord record,
Context ctx,
Collector<FeatureVector> out) throws Exception {
// 更新历史数据
List<Double> history = new ArrayList<>(
temperatureHistory.value() != null ?
temperatureHistory.value() : new ArrayList<>()
);
history.add(record.getTemperature());
// 保留最近100条
if (history.size() > 100) {
history = history.subList(history.size() - 100, history.size());
}
temperatureHistory.update(history);
// 计算特征
double mean = history.stream().mapToDouble(Double::doubleValue).average().orElse(0);
double std = Math.sqrt(
history.stream()
.mapToDouble(v -> Math.pow(v - mean, 2))
.average().orElse(0)
);
double max = history.stream().mapToDouble(Double::doubleValue).max().orElse(0);
double min = history.stream().mapToDouble(Double::doubleValue).min().orElse(0);
double trend = history.size() >= 10 ?
(history.get(history.size()-1) - history.get(history.size()-10)) / std : 0;
FeatureVector feature = FeatureVector.newBuilder()
.setMachineId(record.getMachineId())
.setTimestamp(record.getTimestamp())
.setTempMean(mean)
.setTempStd(std)
.setTempMax(max)
.setTempMin(min)
.setTempTrend(trend)
.setTemperature(record.getTemperature())
.setPressure(record.getPressure())
.setVibration(record.getVibration())
.build();
out.collect(feature);
}
}
}
15.4 模型漂移监控与自动重训练
15.4.1 漂移检测架构
python
# drift_detection.py
import numpy as np
from scipy import stats
from dataclasses import dataclass
from typing import List, Dict, Optional
@dataclass
class DriftReport:
"""漂移检测报告"""
feature_name: str
drift_type: str # 'feature_drift' | 'prediction_drift' | 'label_drift'
drift_score: float
p_value: float
is_drifted: bool
recommendation: str
class DriftDetector:
"""模型漂移检测器"""
def __init__(self, reference_window_size: int = 10000):
self.reference_window_size = reference_window_size
self.reference_data: Dict[str, np.ndarray] = {}
self.current_window: Dict[str, np.ndarray] = {}
def update_reference(self, feature_name: str, data: np.ndarray):
"""更新参考数据分布"""
if feature_name not in self.reference_data:
self.reference_data[feature_name] = data
else:
self.reference_data[feature_name] = np.concatenate([
self.reference_data[feature_name],
data
])[-self.reference_window_size:]
def update_current(self, feature_name: str, data: np.ndarray):
"""更新当前数据窗口"""
if feature_name not in self.current_window:
self.current_window[feature_name] = data
else:
self.current_window[feature_name] = np.concatenate([
self.current_window[feature_name],
data
])[-self.reference_window_size:]
def detect_drift(self, feature_name: str) -> DriftReport:
"""执行漂移检测"""
ref = self.reference_data.get(feature_name, np.array([]))
curr = self.current_window.get(feature_name, np.array([]))
if len(ref) < 100 or len(curr) < 100:
return DriftReport(
feature_name=feature_name,
drift_type="insufficient_data",
drift_score=0.0,
p_value=1.0,
is_drifted=False,
recommendation="收集更多数据"
)
# Kolmogorov-Smirnov检验
ks_stat, p_value = stats.ks_2samp(ref, curr)
# Population Stability Index (PSI)
psi = self._calculate_psi(ref, curr)
# 综合漂移分数
drift_score = (ks_stat * 0.6 + min(psi, 1) * 0.4)
# 判断是否漂移
is_drifted = p_value < 0.05 or psi > 0.2
return DriftReport(
feature_name=feature_name,
drift_type="feature_drift",
drift_score=drift_score,
p_value=p_value,
is_drifted=is_drifted,
recommendation=self._get_recommendation(drift_score)
)
def _calculate_psi(self, expected: np.ndarray, actual: np.ndarray) -> float:
"""计算PSI指标"""
# 分箱
bins = np.percentile(expected, np.linspace(0, 100, 11))
bins[0] = -np.inf
bins[-1] = np.inf
# 计算各箱占比
expected_pct = np.histogram(expected, bins)[0] / len(expected)
actual_pct = np.histogram(actual, bins)[0] / len(actual)
# 避免除零
expected_pct = np.where(expected_pct == 0, 0.0001, expected_pct)
actual_pct = np.where(actual_pct == 0, 0.0001, actual_pct)
# 计算PSI
psi = np.sum((actual_pct - expected_pct) *
np.log(actual_pct / expected_pct))
return psi
def _get_recommendation(self, drift_score: float) -> str:
if drift_score < 0.1:
return "正常,无需干预"
elif drift_score < 0.2:
return "轻微漂移,加强监控"
else:
return "严重漂移,触发重训练"
15.5 知识体系总结
#mermaid-svg-CU54Y4w8wiiCJ3Ud{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-CU54Y4w8wiiCJ3Ud .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-CU54Y4w8wiiCJ3Ud .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-CU54Y4w8wiiCJ3Ud .error-icon{fill:#552222;}#mermaid-svg-CU54Y4w8wiiCJ3Ud .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-CU54Y4w8wiiCJ3Ud .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-CU54Y4w8wiiCJ3Ud .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-CU54Y4w8wiiCJ3Ud .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-CU54Y4w8wiiCJ3Ud .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-CU54Y4w8wiiCJ3Ud .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-CU54Y4w8wiiCJ3Ud .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-CU54Y4w8wiiCJ3Ud .marker{fill:#333333;stroke:#333333;}#mermaid-svg-CU54Y4w8wiiCJ3Ud .marker.cross{stroke:#333333;}#mermaid-svg-CU54Y4w8wiiCJ3Ud svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-CU54Y4w8wiiCJ3Ud p{margin:0;}#mermaid-svg-CU54Y4w8wiiCJ3Ud .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-CU54Y4w8wiiCJ3Ud .cluster-label text{fill:#333;}#mermaid-svg-CU54Y4w8wiiCJ3Ud .cluster-label span{color:#333;}#mermaid-svg-CU54Y4w8wiiCJ3Ud .cluster-label span p{background-color:transparent;}#mermaid-svg-CU54Y4w8wiiCJ3Ud .label text,#mermaid-svg-CU54Y4w8wiiCJ3Ud span{fill:#333;color:#333;}#mermaid-svg-CU54Y4w8wiiCJ3Ud .node rect,#mermaid-svg-CU54Y4w8wiiCJ3Ud .node circle,#mermaid-svg-CU54Y4w8wiiCJ3Ud .node ellipse,#mermaid-svg-CU54Y4w8wiiCJ3Ud .node polygon,#mermaid-svg-CU54Y4w8wiiCJ3Ud .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-CU54Y4w8wiiCJ3Ud .rough-node .label text,#mermaid-svg-CU54Y4w8wiiCJ3Ud .node .label text,#mermaid-svg-CU54Y4w8wiiCJ3Ud .image-shape .label,#mermaid-svg-CU54Y4w8wiiCJ3Ud .icon-shape .label{text-anchor:middle;}#mermaid-svg-CU54Y4w8wiiCJ3Ud .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-CU54Y4w8wiiCJ3Ud .rough-node .label,#mermaid-svg-CU54Y4w8wiiCJ3Ud .node .label,#mermaid-svg-CU54Y4w8wiiCJ3Ud .image-shape .label,#mermaid-svg-CU54Y4w8wiiCJ3Ud .icon-shape .label{text-align:center;}#mermaid-svg-CU54Y4w8wiiCJ3Ud .node.clickable{cursor:pointer;}#mermaid-svg-CU54Y4w8wiiCJ3Ud .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-CU54Y4w8wiiCJ3Ud .arrowheadPath{fill:#333333;}#mermaid-svg-CU54Y4w8wiiCJ3Ud .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-CU54Y4w8wiiCJ3Ud .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-CU54Y4w8wiiCJ3Ud .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-CU54Y4w8wiiCJ3Ud .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-CU54Y4w8wiiCJ3Ud .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-CU54Y4w8wiiCJ3Ud .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-CU54Y4w8wiiCJ3Ud .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-CU54Y4w8wiiCJ3Ud .cluster text{fill:#333;}#mermaid-svg-CU54Y4w8wiiCJ3Ud .cluster span{color:#333;}#mermaid-svg-CU54Y4w8wiiCJ3Ud div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-CU54Y4w8wiiCJ3Ud .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-CU54Y4w8wiiCJ3Ud rect.text{fill:none;stroke-width:0;}#mermaid-svg-CU54Y4w8wiiCJ3Ud .icon-shape,#mermaid-svg-CU54Y4w8wiiCJ3Ud .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-CU54Y4w8wiiCJ3Ud .icon-shape p,#mermaid-svg-CU54Y4w8wiiCJ3Ud .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-CU54Y4w8wiiCJ3Ud .icon-shape .label rect,#mermaid-svg-CU54Y4w8wiiCJ3Ud .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-CU54Y4w8wiiCJ3Ud .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-CU54Y4w8wiiCJ3Ud .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-CU54Y4w8wiiCJ3Ud :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} ML与大数据融合
特征工程
模型训练
模型服务
监控运维
特征提取
特征选择
特征存储
分布式训练
超参调优
模型选择
批量推理
实时推理
A/B测试
漂移检测
自动重训练
性能监控
| 模块 | 组件 | 工业应用场景 |
|---|---|---|
| 特征工程 | Spark/Flink | 传感器特征提取、时序统计 |
| 模型训练 | Spark MLlib, TensorFlow | 异常检测、故障预测 |
| 模型服务 | MLflow, Seldon | 实时推理、批量评分 |
| 漂移监控 | 自研/Evidently | 数据漂移检测、模型监控 |
下期预告
第16期我们将深入探讨《实时流处理架构》,讲解Kafka+Flink+Kafka Connect构建端到端的实时数据管道。敬请期待!
作者:高炉炼铁智能化技术研究者,专注钢铁冶金与人工智能 交叉领域。
👍 如果觉得有帮助,请点赞、收藏、转发!
版权归作者所有,未经许可请勿抄袭,套用,商用(或其它具有利益性行为) 。
🔔 关注专栏,不错过后续精彩内容!