第20期:故障诊断与根因分析 - 从表象到本质的智能推理
导言:故障诊断与根因分析是工业大数据平台的核心价值体现。本期深入讲解基于规则、基于统计、基于机器学习、基于知识图谱等多种故障诊断方法,详细剖析工业故障传播模型与因果推断技术,并结合实战代码展示完整的故障诊断与根因分析系统。
20.1 故障诊断技术体系
20.1.1 故障诊断方法论
┌────────────────────────────────────────────────────────────────────────┐
│ 故障诊断方法分类体系 │
├────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ 基于规则的诊断 │ │
│ │ - 专家知识编码 │ │
│ │ - 阈值判断、状态机 │ │
│ │ - 优点: 可解释性强; 缺点: 覆盖度有限 │ │
│ └──────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ 基于统计的诊断 │ │
│ │ - 假设检验、方差分析 │ │
│ │ - 控制图(SPC)、假设检验 │ │
│ │ - 优点: 有理论基础; 缺点: 假设可能不成立 │ │
│ └──────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ 基于机器学习的诊断 │ │
│ │ - 分类模型: SVM、随机森林、神经网络 │ │
│ │ - 优点: 自动学习特征; 缺点: 需要大量标注数据 │ │
│ └──────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ 基于知识图谱的诊断 │ │
│ │ - 实体关系建模、图推理 │ │
│ │ - 优点: 知识复用; 缺点: 知识获取困难 │ │
│ └──────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ 基于因果推断的诊断 │ │
│ │ - 结构因果模型、do-calculus │ │
│ │ - 优点: 发现真正原因; 缺点: 计算复杂度高 │ │
│ └──────────────────────────────────────────────────────────────────┘ │
│ │
└────────────────────────────────────────────────────────────────────────┘
20.1.2 故障诊断系统架构
#mermaid-svg-UfOnmegOlDALNGWU{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-UfOnmegOlDALNGWU .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-UfOnmegOlDALNGWU .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-UfOnmegOlDALNGWU .error-icon{fill:#552222;}#mermaid-svg-UfOnmegOlDALNGWU .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-UfOnmegOlDALNGWU .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-UfOnmegOlDALNGWU .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-UfOnmegOlDALNGWU .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-UfOnmegOlDALNGWU .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-UfOnmegOlDALNGWU .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-UfOnmegOlDALNGWU .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-UfOnmegOlDALNGWU .marker{fill:#333333;stroke:#333333;}#mermaid-svg-UfOnmegOlDALNGWU .marker.cross{stroke:#333333;}#mermaid-svg-UfOnmegOlDALNGWU svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-UfOnmegOlDALNGWU p{margin:0;}#mermaid-svg-UfOnmegOlDALNGWU .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-UfOnmegOlDALNGWU .cluster-label text{fill:#333;}#mermaid-svg-UfOnmegOlDALNGWU .cluster-label span{color:#333;}#mermaid-svg-UfOnmegOlDALNGWU .cluster-label span p{background-color:transparent;}#mermaid-svg-UfOnmegOlDALNGWU .label text,#mermaid-svg-UfOnmegOlDALNGWU span{fill:#333;color:#333;}#mermaid-svg-UfOnmegOlDALNGWU .node rect,#mermaid-svg-UfOnmegOlDALNGWU .node circle,#mermaid-svg-UfOnmegOlDALNGWU .node ellipse,#mermaid-svg-UfOnmegOlDALNGWU .node polygon,#mermaid-svg-UfOnmegOlDALNGWU .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-UfOnmegOlDALNGWU .rough-node .label text,#mermaid-svg-UfOnmegOlDALNGWU .node .label text,#mermaid-svg-UfOnmegOlDALNGWU .image-shape .label,#mermaid-svg-UfOnmegOlDALNGWU .icon-shape .label{text-anchor:middle;}#mermaid-svg-UfOnmegOlDALNGWU .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-UfOnmegOlDALNGWU .rough-node .label,#mermaid-svg-UfOnmegOlDALNGWU .node .label,#mermaid-svg-UfOnmegOlDALNGWU .image-shape .label,#mermaid-svg-UfOnmegOlDALNGWU .icon-shape .label{text-align:center;}#mermaid-svg-UfOnmegOlDALNGWU .node.clickable{cursor:pointer;}#mermaid-svg-UfOnmegOlDALNGWU .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-UfOnmegOlDALNGWU .arrowheadPath{fill:#333333;}#mermaid-svg-UfOnmegOlDALNGWU .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-UfOnmegOlDALNGWU .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-UfOnmegOlDALNGWU .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-UfOnmegOlDALNGWU .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-UfOnmegOlDALNGWU .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-UfOnmegOlDALNGWU .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-UfOnmegOlDALNGWU .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-UfOnmegOlDALNGWU .cluster text{fill:#333;}#mermaid-svg-UfOnmegOlDALNGWU .cluster span{color:#333;}#mermaid-svg-UfOnmegOlDALNGWU div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-UfOnmegOlDALNGWU .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-UfOnmegOlDALNGWU rect.text{fill:none;stroke-width:0;}#mermaid-svg-UfOnmegOlDALNGWU .icon-shape,#mermaid-svg-UfOnmegOlDALNGWU .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-UfOnmegOlDALNGWU .icon-shape p,#mermaid-svg-UfOnmegOlDALNGWU .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-UfOnmegOlDALNGWU .icon-shape .label rect,#mermaid-svg-UfOnmegOlDALNGWU .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-UfOnmegOlDALNGWU .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-UfOnmegOlDALNGWU .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-UfOnmegOlDALNGWU :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 输出
告警通知
诊断报告
处置建议
诊断层
故障检测
故障分类
根因分析
分析引擎
规则引擎
统计分析
机器学习
知识图谱
数据采集
传感器数据
系统日志
监控指标
20.2 基于规则的故障诊断
20.2.1 工业设备规则库设计
python
# fault_rules.py
from typing import Dict, List, Optional
from dataclasses import dataclass
from enum import Enum
class Severity(Enum):
INFO = "info"
WARNING = "warning"
ERROR = "error"
CRITICAL = "critical"
@dataclass
class FaultRule:
"""故障诊断规则"""
rule_id: str
name: str
description: str
severity: Severity
conditions: List[Dict] # 触发条件
causes: List[str] # 可能原因
actions: List[Dict] # 建议动作
enabled: bool = True
class IndustrialFaultRuleEngine:
"""工业设备故障规则引擎"""
def __init__(self):
self.rules = self._load_rules()
self.state = {}
def _load_rules(self) -> List[FaultRule]:
return [
# ========== 电机故障规则 ==========
FaultRule(
rule_id="MOTOR_001",
name="电机轴承温度过高",
description="电机轴承温度超过阈值",
severity=Severity.WARNING,
conditions=[
{"type": "threshold", "field": "bearing_temp", "op": ">", "value": 80},
{"type": "duration", "field": "bearing_temp", "op": ">", "value": 75, "duration_seconds": 300}
],
causes=["润滑不足", "轴承磨损", "负载过重", "安装不当"],
actions=[
{"type": "alert", "level": "warning"},
{"type": "reduce_load", "percentage": 20},
{"type": "schedule_maintenance", "priority": "high"}
]
),
FaultRule(
rule_id="MOTOR_002",
name="电机振动异常",
description="电机振动超过正常范围",
severity=Severity.ERROR,
conditions=[
{"type": "threshold", "field": "vibration_rms", "op": ">", "value": 4.5},
{"type": "trend", "field": "vibration_rms", "op": "increasing", "rate": 0.1}
],
causes=["不平衡", "不对中", "轴承故障", "共振"],
actions=[
{"type": "alert", "level": "error"},
{"type": "analyze_frequency"},
{"type": "schedule_maintenance", "priority": "critical"}
]
),
FaultRule(
rule_id="MOTOR_003",
name="电机电流不平衡",
description="三相电流不平衡率超过10%",
severity=Severity.WARNING,
conditions=[
{"type": "formula", "expression": "current_unbalance_rate", "op": ">", "value": 10}
],
causes=["电源问题", "绕组故障", "接触不良"],
actions=[
{"type": "alert", "level": "warning"},
{"type": "check_power_supply"},
{"type": "schedule_inspection"}
]
),
# ========== 泵故障规则 ==========
FaultRule(
rule_id="PUMP_001",
name="泵流量下降",
description="泵流量低于正常值的80%",
severity=Severity.WARNING,
conditions=[
{"type": "threshold", "field": "flow_rate", "op": "<", "value": 0.8, "baseline": "flow_baseline"},
{"type": "duration", "field": "flow_rate", "op": "<", "value": 0.85, "duration_seconds": 600}
],
causes=["叶轮磨损", "入口堵塞", "密封泄漏", "电机功率不足"],
actions=[
{"type": "alert", "level": "warning"},
{"type": "check_inlet"},
{"type": "measure_seal"},
{"type": "schedule_inspection"}
]
),
FaultRule(
rule_id="PUMP_002",
name="泵气蚀检测",
description="泵入口压力过低导致气蚀",
severity=Severity.CRITICAL,
conditions=[
{"type": "threshold", "field": "inlet_pressure", "op": "<", "value": -0.05},
{"type": "pattern", "field": "vibration_spectrum", "pattern": "cavitation"}
],
causes=["入口压力过低", "液体温度过高", "流量过大"],
actions=[
{"type": "alert", "level": "critical"},
{"type": "reduce_flow"},
{"type": "check_npsh"},
{"type": "emergency_shutdown", "if": "cavitation_severe"}
]
),
# ========== 温度异常规则 ==========
FaultRule(
rule_id="TEMP_001",
name="温度快速上升",
description="温度在短时间内快速上升",
severity=Severity.ERROR,
conditions=[
{"type": "rate_of_change", "field": "temperature", "op": ">", "value": 5, "window_seconds": 60},
{"type": "trend", "field": "temperature", "op": "increasing", "consecutive": 5}
],
causes=["冷却系统故障", "过载运行", "润滑不足"],
actions=[
{"type": "alert", "level": "error"},
{"type": "check_cooling"},
{"type": "reduce_load"},
{"type": "prepare_shutdown"}
]
),
]
def evaluate_rule(self, rule: FaultRule, data: Dict) -> Optional[Dict]:
"""评估规则是否触发"""
for condition in rule.conditions:
if not self._evaluate_condition(condition, data):
return None
return {
"rule_id": rule.rule_id,
"name": rule.name,
"severity": rule.severity,
"timestamp": data.get("timestamp"),
"triggered_data": data,
"causes": rule.causes,
"actions": rule.actions
}
def diagnose(self, sensor_data: Dict) -> List[Dict]:
"""执行故障诊断"""
results = []
for rule in self.rules:
if not rule.enabled:
continue
result = self.evaluate_rule(rule, sensor_data)
if result:
results.append(result)
# 按严重程度排序
severity_order = {
Severity.CRITICAL: 0,
Severity.ERROR: 1,
Severity.WARNING: 2,
Severity.INFO: 3
}
results.sort(key=lambda x: severity_order[x["severity"]])
return results
def _evaluate_condition(self, condition: Dict, data: Dict) -> bool:
"""评估单个条件"""
cond_type = condition["type"]
if cond_type == "threshold":
field = condition["field"]
value = data.get(field)
if value is None:
return False
op = condition["op"]
threshold = condition["value"]
if "baseline" in condition:
baseline = data.get(condition["baseline"], 1)
value = value / baseline
return self._compare(value, op, threshold)
elif cond_type == "duration":
# 检查持续时间条件
field = condition["field"]
key = f"{field}_{condition['op']}_{condition['value']}"
if key not in self.state:
self.state[key] = {"start": data.get("timestamp"), "count": 0}
if self._evaluate_condition(
{"type": "threshold", **condition}, data
):
self.state[key]["count"] += 1
else:
self.state[key] = {"start": None, "count": 0}
duration = condition.get("duration_seconds", 0)
return self.state[key]["count"] * 1 >= duration
elif cond_type == "rate_of_change":
field = condition["field"]
current = data.get(field)
if current is None:
return False
history_key = f"{field}_history"
if history_key not in self.state:
self.state[history_key] = []
self.state[history_key].append(current)
if len(self.state[history_key]) > 2:
self.state[history_key].pop(0)
if len(self.state[history_key]) == 2:
rate = abs(self.state[history_key][1] - self.state[history_key][0])
return self._compare(rate, condition["op"], condition["value"])
return False
def _compare(self, value, op, threshold) -> bool:
if op == ">":
return value > threshold
elif op == "<":
return value < threshold
elif op == ">=":
return value >= threshold
elif op == "<=":
return value <= threshold
elif op == "==":
return value == threshold
return False
20.3 基于机器学习的故障诊断
20.3.1 故障分类模型
python
# fault_classification.py
from pyspark.ml.classification import RandomForestClassifier, GBTClassifier
from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
class FaultClassificationSystem:
"""故障分类系统"""
def __init__(self, spark):
self.spark = spark
self.models = {}
def prepare_training_data(self):
"""
准备训练数据
"""
# 读取历史故障数据
fault_df = self.spark.table("maintenance.fault_records")
# 读取对应的传感器数据
sensor_df = self.spark.table("sensors.time_series_data")
# 关联数据
train_df = fault_df.join(
sensor_df,
(fault_df.equipment_id == sensor_df.equipment_id) &
(fault_df.fault_time.between(
sensor_df.timestamp - expr("interval 1 hour"),
sensor_df.timestamp + expr("interval 1 hour")
))
).groupBy("equipment_id", "fault_type").agg(
avg("temperature").alias("avg_temp"),
stddev("temperature").alias("std_temp"),
avg("vibration").alias("avg_vibration"),
stddev("vibration").alias("std_vibration"),
avg("pressure").alias("avg_pressure"),
stddev("pressure").alias("std_pressure"),
max("current").alias("max_current"),
min("current").alias("min_current")
).withColumn(
"features",
array("avg_temp", "std_temp", "avg_vibration", "std_vibration",
"avg_pressure", "std_pressure", "max_current", "min_current")
)
return train_df
def train_fault_classifier(self, train_df):
"""
训练故障分类模型
"""
# 特征向量化
assembler = VectorAssembler(
inputCols=["avg_temp", "std_temp", "avg_vibration", "std_vibration",
"avg_pressure", "std_pressure", "max_current", "min_current"],
outputCol="features"
)
# 标准化
scaler = StandardScaler(
inputCol="features",
outputCol="scaled_features",
withMean=True,
withStd=True
)
# 分类器
rf = RandomForestClassifier(
featuresCol="scaled_features",
labelCol="fault_type",
numTrees=100,
maxDepth=10,
impurity="gini"
)
# Pipeline
pipeline = Pipeline(stages=[assembler, scaler, rf])
model = pipeline.fit(train_df)
# 评估
predictions = model.transform(train_df)
evaluator = MulticlassClassificationEvaluator(
labelCol="fault_type",
predictionCol="prediction",
metricName="accuracy"
)
accuracy = evaluator.evaluate(predictions)
# 获取特征重要性
feature_importance = model.stages[-1].featureImportances.toArray()
return model, accuracy, feature_importance
def predict_fault_type(self, sensor_data: Dict, model):
"""
预测故障类型
"""
# 转换为DataFrame
pdf = pd.DataFrame([sensor_data])
df = self.spark.createDataFrame(pdf)
# 预测
predictions = model.transform(df)
result = predictions.select(
"prediction", "probability", "rawPrediction"
).collect()[0]
# 获取top-3可能的故障类型
probs = result.probability.toArray()
classes = model.stages[-1].java_model.classes()
top3 = sorted(zip(classes, probs), key=lambda x: x[1], reverse=True)[:3]
return {
"predicted_fault": result.prediction,
"top3_predictions": [{"fault": f, "probability": p} for f, p in top3],
"confidence": max(probs)
}
20.3.2 异常检测模型
python
# anomaly_detection.py
from pyspark.ml.anomaly import IsolationForest
from pyspark.ml.feature import VectorAssembler
class AnomalyDetectionSystem:
"""异常检测系统"""
def __init__(self, spark):
self.spark = spark
self.model = None
def train_isolation_forest(self, normal_df):
"""
训练Isolation Forest异常检测模型
"""
# 特征工程
feature_cols = ["temperature", "vibration", "pressure", "current", "voltage"]
assembler = VectorAssembler(
inputCols=feature_cols,
outputCol="features"
)
train_data = assembler.transform(normal_df).select("features")
# Isolation Forest
self.model = IsolationForest(
featuresCol="features",
predictionCol="prediction",
anomalyScoreCol="anomalyScore",
contamination=0.01, # 假设1%异常
numTrees=100
)
return self.model.fit(train_data)
def detect_anomalies(self, test_df):
"""
检测异常
"""
predictions = self.model.transform(test_df)
# 标记异常
anomalies = predictions.filter(col("prediction") == 1)
return anomalies
20.4 根因分析方法
20.4.1 基于故障传播链的根因分析
#mermaid-svg-bpuVH7EWAEzHk1UD{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-bpuVH7EWAEzHk1UD .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-bpuVH7EWAEzHk1UD .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-bpuVH7EWAEzHk1UD .error-icon{fill:#552222;}#mermaid-svg-bpuVH7EWAEzHk1UD .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-bpuVH7EWAEzHk1UD .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-bpuVH7EWAEzHk1UD .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-bpuVH7EWAEzHk1UD .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-bpuVH7EWAEzHk1UD .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-bpuVH7EWAEzHk1UD .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-bpuVH7EWAEzHk1UD .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-bpuVH7EWAEzHk1UD .marker{fill:#333333;stroke:#333333;}#mermaid-svg-bpuVH7EWAEzHk1UD .marker.cross{stroke:#333333;}#mermaid-svg-bpuVH7EWAEzHk1UD svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-bpuVH7EWAEzHk1UD p{margin:0;}#mermaid-svg-bpuVH7EWAEzHk1UD .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-bpuVH7EWAEzHk1UD .cluster-label text{fill:#333;}#mermaid-svg-bpuVH7EWAEzHk1UD .cluster-label span{color:#333;}#mermaid-svg-bpuVH7EWAEzHk1UD .cluster-label span p{background-color:transparent;}#mermaid-svg-bpuVH7EWAEzHk1UD .label text,#mermaid-svg-bpuVH7EWAEzHk1UD span{fill:#333;color:#333;}#mermaid-svg-bpuVH7EWAEzHk1UD .node rect,#mermaid-svg-bpuVH7EWAEzHk1UD .node circle,#mermaid-svg-bpuVH7EWAEzHk1UD .node ellipse,#mermaid-svg-bpuVH7EWAEzHk1UD .node polygon,#mermaid-svg-bpuVH7EWAEzHk1UD .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-bpuVH7EWAEzHk1UD .rough-node .label text,#mermaid-svg-bpuVH7EWAEzHk1UD .node .label text,#mermaid-svg-bpuVH7EWAEzHk1UD .image-shape .label,#mermaid-svg-bpuVH7EWAEzHk1UD .icon-shape .label{text-anchor:middle;}#mermaid-svg-bpuVH7EWAEzHk1UD .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-bpuVH7EWAEzHk1UD .rough-node .label,#mermaid-svg-bpuVH7EWAEzHk1UD .node .label,#mermaid-svg-bpuVH7EWAEzHk1UD .image-shape .label,#mermaid-svg-bpuVH7EWAEzHk1UD .icon-shape .label{text-align:center;}#mermaid-svg-bpuVH7EWAEzHk1UD .node.clickable{cursor:pointer;}#mermaid-svg-bpuVH7EWAEzHk1UD .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-bpuVH7EWAEzHk1UD .arrowheadPath{fill:#333333;}#mermaid-svg-bpuVH7EWAEzHk1UD .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-bpuVH7EWAEzHk1UD .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-bpuVH7EWAEzHk1UD .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-bpuVH7EWAEzHk1UD .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-bpuVH7EWAEzHk1UD .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-bpuVH7EWAEzHk1UD .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-bpuVH7EWAEzHk1UD .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-bpuVH7EWAEzHk1UD .cluster text{fill:#333;}#mermaid-svg-bpuVH7EWAEzHk1UD .cluster span{color:#333;}#mermaid-svg-bpuVH7EWAEzHk1UD div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-bpuVH7EWAEzHk1UD .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-bpuVH7EWAEzHk1UD rect.text{fill:none;stroke-width:0;}#mermaid-svg-bpuVH7EWAEzHk1UD .icon-shape,#mermaid-svg-bpuVH7EWAEzHk1UD .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-bpuVH7EWAEzHk1UD .icon-shape p,#mermaid-svg-bpuVH7EWAEzHk1UD .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-bpuVH7EWAEzHk1UD .icon-shape .label rect,#mermaid-svg-bpuVH7EWAEzHk1UD .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-bpuVH7EWAEzHk1UD .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-bpuVH7EWAEzHk1UD .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-bpuVH7EWAEzHk1UD :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 单点故障
多点故障
设备故障报警
查询故障传播链
故障图谱
定位根因
直接修复
传播路径分析
确定干预点
执行修复
python
# root_cause_analysis.py
import networkx as nx
from collections import defaultdict
class RootCauseAnalyzer:
"""基于故障传播链的根因分析"""
def __init__(self):
self.fault_graph = nx.DiGraph()
self.correlation_matrix = {}
def build_fault_graph(self, historical_faults):
"""
从历史故障数据构建故障传播图
"""
# 节点: 故障类型
# 边: 故障传播关系 (从因到果)
for fault_pair in historical_faults:
cause = fault_pair["cause"]
effect = fault_pair["effect"]
weight = fault_pair.get("frequency", 1)
if self.fault_graph.has_edge(cause, effect):
self.fault_graph[cause][effect]["weight"] += weight
else:
self.fault_graph.add_edge(cause, effect, weight=weight)
def find_root_cause(self, observed_faults: List[str]) -> Dict:
"""
根因分析
使用贝叶斯网络 + 图推理
"""
# 1. 构建当前故障子图
subgraph = self.fault_graph.subgraph(observed_faults)
# 2. 计算每个节点的可能原因概率
cause_scores = {}
for fault in observed_faults:
# 计算该故障的所有可能原因
predecessors = set(nx.ancestors(self.fault_graph, fault))
if predecessors:
cause_scores[fault] = predecessors
# 3. 找到最可能的根因
# 根因: 没有前驱节点的故障,或者前驱不在观测列表中的故障
root_causes = []
for fault in observed_faults:
predecessors = set(self.fault_graph.predecessors(fault))
external_causes = predecessors - set(observed_faults)
if external_causes:
# 找出这些外部原因的历史传播频率
for cause in external_causes:
edge_weight = self.fault_graph[cause][fault]["weight"]
root_causes.append({
"cause": cause,
"effect": fault,
"propagation_probability": edge_weight
})
else:
# 可能是真正的根因
root_causes.append({
"cause": fault,
"effect": None,
"propagation_probability": 1.0
})
# 4. 排序并返回结果
root_causes.sort(key=lambda x: x["propagation_probability"], reverse=True)
return {
"observed_faults": observed_faults,
"root_causes": root_causes[:5],
"propagation_chain": self._find_propagation_chain(observed_faults),
"intervention_points": self._find_intervention_points(observed_faults)
}
def _find_propagation_chain(self, faults: List[str]) -> List[List[str]]:
"""找出故障传播链"""
chains = []
for fault in faults:
# 找出从该故障出发的最长路径
try:
path = nx.dag_longest_path(
self.fault_graph.subgraph(faults),
weight="weight"
)
if len(path) > 1:
chains.append(path)
except:
pass
return chains
def _find_intervention_points(self, faults: List[str]) -> List[str]:
"""
找到最佳干预点
干预点: 能够切断最多故障传播的关键节点
"""
intervention_scores = {}
for fault in faults:
# 计算该节点能影响多少其他故障
descendants = nx.descendants(self.fault_graph, fault)
affected = descendants & set(faults)
if affected:
intervention_scores[fault] = len(affected)
# 返回得分最高的干预点
sorted_points = sorted(
intervention_scores.items(),
key=lambda x: x[1],
reverse=True
)
return [point for point, score in sorted_points[:3]]
20.4.2 基于时序因果推断的根因分析
python
# causal_inference.py
import numpy as np
from scipy import stats
class CausalInferenceAnalyzer:
"""基于Granger因果检验的根因分析"""
def __init__(self):
self.causal_relations = {}
def granger_causality_test(self, time_series_dict: Dict[str, pd.Series],
max_lag: int = 5) -> Dict:
"""
Granger因果检验
"""
results = {}
variables = list(time_series_dict.keys())
for i, var1 in enumerate(variables):
for var2 in variables[i+1:]:
# 检验 var1 是否 Granger-cause var2
p_value_1_to_2 = self._granger_test(
time_series_dict[var2],
time_series_dict[var1],
max_lag
)
# 检验 var2 是否 Granger-cause var1
p_value_2_to_1 = self._granger_test(
time_series_dict[var1],
time_series_dict[var2],
max_lag
)
results[(var1, var2)] = {
f"{var1}->{var2}": p_value_1_to_2,
f"{var2}->{var1}": p_value_2_to_1,
f"{var1} causes {var2}": p_value_1_to_2 < 0.05,
f"{var2} causes {var1}": p_value_2_to_1 < 0.05
}
return results
def _granger_test(self, target: pd.Series, predictor: pd.Series,
max_lag: int) -> float:
"""
执行Granger因果检验
"""
from statsmodels.tsa.stattools import grangercausalitytests
data = pd.concat([target, predictor], axis=1).dropna()
try:
result = grangercausalitytests(
data, maxlag=max_lag, verbose=False
)
# 返回最小p值
min_p_value = min(
result[lag][0]["ssr_ftest"][1]
for lag in range(1, max_lag + 1)
)
return min_p_value
except:
return 1.0
def find_causal_roots(self, event_series: Dict[str, pd.Series],
target_event: str) -> List[Dict]:
"""
找出导致目标事件的原因
"""
# Granger因果检验
causal_relations = self.granger_causality_test(event_series)
# 筛选对目标事件有因果关系的事件
causes = []
for key, result in causal_relations.items():
var1, var2 = key
if var2 == target_event and result[f"{var1} causes {var2}"]:
causes.append({
"cause": var1,
"effect": var2,
"p_value": result[f"{var1}->{var2}"],
"confidence": 1 - result[f"{var1}->{var2}"]
})
# 按置信度排序
causes.sort(key=lambda x: x["confidence"], reverse=True)
return causes[:10]
20.5 故障诊断实战案例
20.5.1 生产线故障诊断系统
python
# production_fault_diagnosis.py
class ProductionFaultDiagnosisSystem:
"""生产线故障诊断系统"""
def __init__(self, spark):
self.spark = spark
self.rule_engine = IndustrialFaultRuleEngine()
self.ml_classifier = FaultClassificationSystem(spark)
self.rca_analyzer = RootCauseAnalyzer()
def diagnose_equipment(self, equipment_id: str, time_window_minutes: int = 30):
"""
诊断设备故障
"""
# 1. 采集实时数据
sensor_data = self._collect_sensor_data(equipment_id, time_window_minutes)
# 2. 基于规则的诊断
rule_results = self.rule_engine.diagnose(sensor_data)
# 3. 基于ML的诊断
ml_result = self.ml_classifier.predict_fault_type(sensor_data, self.ml_classifier.model)
# 4. 综合判断
if rule_results:
primary_fault = rule_results[0]
causes = primary_fault["causes"]
else:
primary_fault = {
"rule_id": "ML_DETECTED",
"name": ml_result["predicted_fault"],
"severity": Severity.WARNING
}
causes = [p["fault"] for p in ml_result["top3_predictions"]]
# 5. 根因分析
if len(rule_results) > 1:
observed_faults = [r["name"] for r in rule_results]
rca_result = self.rca_analyzer.find_root_cause(observed_faults)
else:
rca_result = None
# 6. 生成诊断报告
return {
"equipment_id": equipment_id,
"timestamp": sensor_data.get("timestamp"),
"primary_fault": primary_fault,
"all_faults": rule_results,
"ml_predictions": ml_result,
"root_cause_analysis": rca_result,
"recommended_actions": self._generate_actions(primary_fault, causes),
"maintenance_window": self._estimate_maintenance_window(primary_fault)
}
def _generate_actions(self, fault: Dict, causes: List[str]) -> List[Dict]:
"""生成建议动作"""
actions = []
for cause in causes[:3]:
actions.append({
"action": f"检查{cause}",
"priority": "high" if fault["severity"] in [Severity.CRITICAL, Severity.ERROR] else "medium",
"estimated_time": self._estimate_action_time(cause)
})
return actions
def _estimate_action_time(self, cause: str) -> str:
"""估算处理时间"""
time_mapping = {
"润滑不足": "30分钟",
"轴承磨损": "2小时",
"密封泄漏": "1小时",
"冷却系统故障": "1小时"
}
return time_mapping.get(cause, "30分钟")
20.6 知识体系总结
#mermaid-svg-i2UP69izI5dzBeTd{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-i2UP69izI5dzBeTd .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-i2UP69izI5dzBeTd .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-i2UP69izI5dzBeTd .error-icon{fill:#552222;}#mermaid-svg-i2UP69izI5dzBeTd .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-i2UP69izI5dzBeTd .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-i2UP69izI5dzBeTd .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-i2UP69izI5dzBeTd .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-i2UP69izI5dzBeTd .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-i2UP69izI5dzBeTd .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-i2UP69izI5dzBeTd .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-i2UP69izI5dzBeTd .marker{fill:#333333;stroke:#333333;}#mermaid-svg-i2UP69izI5dzBeTd .marker.cross{stroke:#333333;}#mermaid-svg-i2UP69izI5dzBeTd svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-i2UP69izI5dzBeTd p{margin:0;}#mermaid-svg-i2UP69izI5dzBeTd .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-i2UP69izI5dzBeTd .cluster-label text{fill:#333;}#mermaid-svg-i2UP69izI5dzBeTd .cluster-label span{color:#333;}#mermaid-svg-i2UP69izI5dzBeTd .cluster-label span p{background-color:transparent;}#mermaid-svg-i2UP69izI5dzBeTd .label text,#mermaid-svg-i2UP69izI5dzBeTd span{fill:#333;color:#333;}#mermaid-svg-i2UP69izI5dzBeTd .node rect,#mermaid-svg-i2UP69izI5dzBeTd .node circle,#mermaid-svg-i2UP69izI5dzBeTd .node ellipse,#mermaid-svg-i2UP69izI5dzBeTd .node polygon,#mermaid-svg-i2UP69izI5dzBeTd .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-i2UP69izI5dzBeTd .rough-node .label text,#mermaid-svg-i2UP69izI5dzBeTd .node .label text,#mermaid-svg-i2UP69izI5dzBeTd .image-shape .label,#mermaid-svg-i2UP69izI5dzBeTd .icon-shape .label{text-anchor:middle;}#mermaid-svg-i2UP69izI5dzBeTd .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-i2UP69izI5dzBeTd .rough-node .label,#mermaid-svg-i2UP69izI5dzBeTd .node .label,#mermaid-svg-i2UP69izI5dzBeTd .image-shape .label,#mermaid-svg-i2UP69izI5dzBeTd .icon-shape .label{text-align:center;}#mermaid-svg-i2UP69izI5dzBeTd .node.clickable{cursor:pointer;}#mermaid-svg-i2UP69izI5dzBeTd .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-i2UP69izI5dzBeTd .arrowheadPath{fill:#333333;}#mermaid-svg-i2UP69izI5dzBeTd .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-i2UP69izI5dzBeTd .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-i2UP69izI5dzBeTd .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-i2UP69izI5dzBeTd .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-i2UP69izI5dzBeTd .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-i2UP69izI5dzBeTd .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-i2UP69izI5dzBeTd .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-i2UP69izI5dzBeTd .cluster text{fill:#333;}#mermaid-svg-i2UP69izI5dzBeTd .cluster span{color:#333;}#mermaid-svg-i2UP69izI5dzBeTd div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-i2UP69izI5dzBeTd .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-i2UP69izI5dzBeTd rect.text{fill:none;stroke-width:0;}#mermaid-svg-i2UP69izI5dzBeTd .icon-shape,#mermaid-svg-i2UP69izI5dzBeTd .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-i2UP69izI5dzBeTd .icon-shape p,#mermaid-svg-i2UP69izI5dzBeTd .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-i2UP69izI5dzBeTd .icon-shape .label rect,#mermaid-svg-i2UP69izI5dzBeTd .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-i2UP69izI5dzBeTd .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-i2UP69izI5dzBeTd .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-i2UP69izI5dzBeTd :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 故障诊断
规则诊断
机器学习
根因分析
阈值规则
状态机
模式匹配
分类模型
异常检测
预测模型
故障传播
因果推断
贝叶斯网络
| 诊断方法 | 适用场景 | 优点 | 缺点 |
|---|---|---|---|
| 规则诊断 | 已知故障模式 | 可解释、实时性好 | 覆盖度有限 |
| 机器学习 | 复杂故障模式 | 自动学习 | 需要标注数据 |
| 知识图谱 | 故障传播分析 | 知识复用 | 知识获取困难 |
| 因果推断 | 根因定位 | 发现真正原因 | 计算复杂度高 |
下期预告
第21期我们将深入探讨《Hadoop企业级最佳实践》,汇总Hadoop在企业级应用中的架构设计、安全治理、运维管理等最佳实践。敬请期待!
作者:高炉炼铁智能化技术研究者,专注钢铁冶金与人工智能 交叉领域。
👍 如果觉得有帮助,请点赞、收藏、转发!
版权归作者所有,未经许可请勿抄袭,套用,商用(或其它具有利益性行为) 。
🔔 关注专栏,不错过后续精彩内容!