采集链路诊断与可观测性——当数据不“对“的时候，你在第几层排查？

上一篇文章我们深入了高速采集的环形缓冲区、压缩算法和特征提取，解决了"采得快"的问题。但有一个更让人头疼的前提从未被挑战过：你怎么知道采到的数据是对的？

先讲一个我经历过的案例。

凌晨 2:17，值班手机响了。操作员说："反应釜温度曲线从 30 分钟前开始每隔 5 秒跳变一次，从 85°C 跳到 120°C 再跳回来，持续 2 秒后恢复。"他去看了 PLC 本体的显示屏------温度显示 86.3°C，稳定。于是判断网关或采集程序有问题。半小时后，他把网关日志发给我看，日志里记录的数值确实在跳。

但第二天我到现场后发现，网关日志是精确记录了 PLC 返回的数值，问题出在 PLC 的 AO 模块上------一个模拟量输入通道的共模电压超限了。操作员看的是 PLC 的 HMI 显示（经过内部滤波+平均处理），而采集网关读的是原始寄存器值，没有经过滤波。

这件事让我意识到：在一个 5 层以上的采集链路中，每一层都有可能撒谎，而且每一层都有充分理由让你相信不是自己的问题。

1. 采集链路的分层故障模型

在深入诊断工具之前，需要先建立一套共同语言------把采集链路分层，每层定义其故障模式。
#mermaid-svg-hTyBZUs3PJkvCaib{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-hTyBZUs3PJkvCaib .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-hTyBZUs3PJkvCaib .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-hTyBZUs3PJkvCaib .error-icon{fill:#552222;}#mermaid-svg-hTyBZUs3PJkvCaib .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-hTyBZUs3PJkvCaib .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-hTyBZUs3PJkvCaib .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-hTyBZUs3PJkvCaib .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-hTyBZUs3PJkvCaib .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-hTyBZUs3PJkvCaib .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-hTyBZUs3PJkvCaib .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-hTyBZUs3PJkvCaib .marker{fill:#333333;stroke:#333333;}#mermaid-svg-hTyBZUs3PJkvCaib .marker.cross{stroke:#333333;}#mermaid-svg-hTyBZUs3PJkvCaib svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-hTyBZUs3PJkvCaib p{margin:0;}#mermaid-svg-hTyBZUs3PJkvCaib .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-hTyBZUs3PJkvCaib .cluster-label text{fill:#333;}#mermaid-svg-hTyBZUs3PJkvCaib .cluster-label span{color:#333;}#mermaid-svg-hTyBZUs3PJkvCaib .cluster-label span p{background-color:transparent;}#mermaid-svg-hTyBZUs3PJkvCaib .label text,#mermaid-svg-hTyBZUs3PJkvCaib span{fill:#333;color:#333;}#mermaid-svg-hTyBZUs3PJkvCaib .node rect,#mermaid-svg-hTyBZUs3PJkvCaib .node circle,#mermaid-svg-hTyBZUs3PJkvCaib .node ellipse,#mermaid-svg-hTyBZUs3PJkvCaib .node polygon,#mermaid-svg-hTyBZUs3PJkvCaib .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-hTyBZUs3PJkvCaib .rough-node .label text,#mermaid-svg-hTyBZUs3PJkvCaib .node .label text,#mermaid-svg-hTyBZUs3PJkvCaib .image-shape .label,#mermaid-svg-hTyBZUs3PJkvCaib .icon-shape .label{text-anchor:middle;}#mermaid-svg-hTyBZUs3PJkvCaib .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-hTyBZUs3PJkvCaib .rough-node .label,#mermaid-svg-hTyBZUs3PJkvCaib .node .label,#mermaid-svg-hTyBZUs3PJkvCaib .image-shape .label,#mermaid-svg-hTyBZUs3PJkvCaib .icon-shape .label{text-align:center;}#mermaid-svg-hTyBZUs3PJkvCaib .node.clickable{cursor:pointer;}#mermaid-svg-hTyBZUs3PJkvCaib .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-hTyBZUs3PJkvCaib .arrowheadPath{fill:#333333;}#mermaid-svg-hTyBZUs3PJkvCaib .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-hTyBZUs3PJkvCaib .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-hTyBZUs3PJkvCaib .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-hTyBZUs3PJkvCaib .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-hTyBZUs3PJkvCaib .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-hTyBZUs3PJkvCaib .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-hTyBZUs3PJkvCaib .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-hTyBZUs3PJkvCaib .cluster text{fill:#333;}#mermaid-svg-hTyBZUs3PJkvCaib .cluster span{color:#333;}#mermaid-svg-hTyBZUs3PJkvCaib div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-hTyBZUs3PJkvCaib .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-hTyBZUs3PJkvCaib rect.text{fill:none;stroke-width:0;}#mermaid-svg-hTyBZUs3PJkvCaib .icon-shape,#mermaid-svg-hTyBZUs3PJkvCaib .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-hTyBZUs3PJkvCaib .icon-shape p,#mermaid-svg-hTyBZUs3PJkvCaib .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-hTyBZUs3PJkvCaib .icon-shape .label rect,#mermaid-svg-hTyBZUs3PJkvCaib .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-hTyBZUs3PJkvCaib .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-hTyBZUs3PJkvCaib .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-hTyBZUs3PJkvCaib :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} L1: 物理层
L2: 协议层
L3: 网络层
L4: 采集网关
L5: 应用层
Web 界面 / 数据库 / SCADA
采集程序 / 网关
以太网 / 串口 / 4G / Wi-Fi
Modbus / OPC UA / MQTT
PLC / 传感器 / IO 模块 / 接线

每一层能看到的"数据"是它下一层传上来的，所以排查时要从下往上逐层验证：

层	典型故障	你看到的现象	最容易误判为
L1 物理层	传感器接线松动、IO 模块通道损坏、共模干扰	数值跳变、冻结、周期性毛刺	网关 bug
L2 协议层	Transaction ID 错乱、超时重传导致重复数据	数据重复、时间戳跳跃	应用层重复处理
L3 网络层	交换机丢包、TCP 重传、带宽打满	数据缺失、延迟飙升	采集程序卡死
L4 采集层	缓冲区溢出、GC 暂停、线程饥饿	数据断层、周期漂移	硬件故障
L5 应用层	时序数据库写入限流、图表查询缓存	曲线缺失、显示延迟	采集断了

排查的第一原则：永远从 L1 开始，逐层向上。 不要在 L5 分析了一整天最后发现是 L1 的一根线松了。

2. 可观测性的三个支柱------Metrics、Logs、Traces

2.1 每个环节必须暴露的 Metrics

一个健康的采集节点应该暴露以下指标。我用 Prometheus 格式定义：

python 复制代码

"""
采集网关 Metrics 定义（Prometheus 格式）

部署方式：
1. pip install prometheus_client
2. 采集进程启动 HTTP 服务暴露 /metrics 端点
3. Prometheus Server 定期拉取，Grafana 展示
"""
from prometheus_client import start_http_server, Gauge, Histogram, Counter
import time
import threading


class CollectorMetrics:
    """
    采集网关可观测性指标

    所有指标分为三类：
    - Gauge: 瞬时值（当前缓存积压、连接状态）
    - Counter: 累计值（总采集点数、错误次数）
    - Histogram: 分布统计（采集延迟、发布延迟）
    """

    def __init__(self, port: int = 8000):
        # ===== 连接状态 =====
        self.mqtt_connected = Gauge(
            "plc_mqtt_connected", "MQTT 连接状态", ["gateway_id"])
        self.modbus_connected = Gauge(
            "plc_modbus_connected", "Modbus 连接状态", ["gateway_id", "plc_id"])

        # ===== 采集统计 =====
        self.points_collected = Counter(
            "plc_points_collected_total", "累计采集点数",
            ["gateway_id", "plc_id", "tag_name"])
        self.points_published = Counter(
            "plc_points_published_total", "累计发布点数",
            ["gateway_id", "topic"])
        self.read_errors = Counter(
            "plc_read_errors_total", "读取错误次数",
            ["gateway_id", "plc_id", "error_type"])

        # ===== 延迟分布 =====
        # 采集延迟：从发送请求到收到响应的时间
        self.read_latency = Histogram(
            "plc_read_latency_seconds",
            "Modbus/OPC UA 读取延迟（秒）",
            ["gateway_id", "plc_id"],
            buckets=(.001, .005, .01, .025, .05, .1, .25, .5, 1.0))
        # 发布延迟：MQTT publish() 耗时
        self.publish_latency = Histogram(
            "plc_publish_latency_seconds",
            "MQTT 发布延迟（秒）",
            ["gateway_id"],
            buckets=(.001, .005, .01, .025, .05, .1, .25, .5, 1.0))

        # ===== 缓存状态 =====
        self.cache_usage = Gauge(
            "plc_cache_usage_bytes", "缓存使用量（字节）", ["gateway_id"])
        self.cache_queue_depth = Gauge(
            "plc_cache_queue_depth", "待发布队列深度", ["gateway_id"])

        # ===== 心跳 =====
        self.last_read_timestamp = Gauge(
            "plc_last_read_timestamp_seconds",
            "最后成功读取的时间戳（Unix epoch）",
            ["gateway_id", "plc_id"])

        # 启动 HTTP 服务
        start_http_server(port)
        print(f"Metrics HTTP 服务已启动 :{port}/metrics")

    def record_read(self, gateway_id: str, plc_id: str,
                    tag: str, latency: float, success: bool):
        """记录一次采集操作"""
        self.points_collected.labels(
            gateway_id=gateway_id, plc_id=plc_id,
            tag_name=tag).inc()
        self.read_latency.labels(
            gateway_id=gateway_id, plc_id=plc_id).observe(latency)
        if not success:
            self.read_errors.labels(
                gateway_id=gateway_id, plc_id=plc_id,
                error_type="timeout").inc()
        self.last_read_timestamp.labels(
            gateway_id=gateway_id, plc_id=plc_id).set(time.time())


# ===== 使用示例 =====
if __name__ == "__main__":
    metrics = CollectorMetrics(port=8000)

    # 模拟采集循环
    for i in range(100):
        import random
        latency = random.uniform(0.001, 0.050)
        metrics.record_read("gw_01", "s7_1200", "temperature",
                            latency, success=True)
        time.sleep(1)

    print("Metrics 服务已运行于 :8000/metrics")
    print("用 curl http://localhost:8000/metrics 查看")

启动后访问 http://localhost:8000/metrics，你会看到类似这样的输出：

复制代码

# HELP plc_read_latency_seconds Modbus/OPC UA 读取延迟（秒）
# TYPE plc_read_latency_seconds histogram
plc_read_latency_seconds_bucket{gateway_id="gw_01",plc_id="s7_1200",le="0.001"} 5
plc_read_latency_seconds_bucket{...le="0.005"} 23
plc_read_latency_seconds_bucket{...le="0.01"} 67
plc_read_latency_seconds_bucket{...le="+Inf"} 100
plc_read_latency_seconds_count 100
# HELP plc_read_errors_total 读取错误次数
# TYPE plc_read_errors_total counter
plc_read_errors_total{gateway_id="gw_01",plc_id="s7_1200",error_type="timeout"} 2

2.2 结构化日志------不要把时间花在 grep 上

大多数人写采集日志是这样的：

python 复制代码

# 反面教材
print(f"读取温度: {temp}")           # 没有时间戳
logging.info(f"数据已发布")           # 没有上下文
logging.error("连接失败")             # 没有错误码和上下文

不行的原因：当你有 50 台网关，每台每秒采集 1000 个点，按这种格式写日志，根本没法搜。

正确的做法------结构化日志：

python 复制代码

"""
structured_logging.py --- 采集链路结构化日志

每条日志包含：
- timestamp: ISO8601 时间（含时区）
- level: 日志级别
- gateway_id / plc_id: 设备标识
- session_id: 一次采集会话的追踪 ID
- component: 日志来源模块
- event: 事件类型
- 业务字段
"""
import structlog
import logging
import time
import uuid
from typing import Optional


# 配置 structlog
structlog.configure(
    processors=[
        structlog.stdlib.add_log_level,
        structlog.stdlib.PositionalArgumentsFormatter(),
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.JSONRenderer(),  # JSON 格式输出
    ],
    context_class=dict,
    logger_factory=structlog.PrintLoggerFactory(),
    cache_logger_on_first_use=True,
)

logger = structlog.get_logger()


class CollectorLogger:
    """
    采集网关结构化日志

    每条日志输出为 JSON 行，可直接被 Filebeat/Logstash 等采集。
    """

    def __init__(self, gateway_id: str):
        self.gateway_id = gateway_id
        self._session_id: Optional[str] = None

    def new_session(self):
        """新建采集会话（每个采集周期调用一次）"""
        self._session_id = uuid.uuid4().hex[:12]
        return self._session_id

    def read_ok(self, plc_id: str, tag: str, value: float, latency_ms: float):
        """采集成功的日志"""
        logger.info("read_ok",
                    gateway_id=self.gateway_id,
                    plc_id=plc_id,
                    tag=tag,
                    value=value,
                    latency_ms=round(latency_ms, 1),
                    session=self._session_id)

    def read_error(self, plc_id: str, tag: str, error: str, code: int):
        """采集失败的日志"""
        logger.error("read_error",
                     gateway_id=self.gateway_id,
                     plc_id=plc_id,
                     tag=tag,
                     error=error,
                     error_code=code,
                     session=self._session_id)

    def connection_lost(self, plc_id: str, reason: str):
        """连接断开日志"""
        logger.warning("connection_lost",
                       gateway_id=self.gateway_id,
                       plc_id=plc_id,
                       reason=reason)

    def cache_high_watermark(self, depth: int, max_depth: int):
        """缓存水位告警"""
        logger.warning("cache_watermark",
                       gateway_id=self.gateway_id,
                       depth=depth,
                       max_depth=max_depth,
                       usage_pct=round(depth / max_depth * 100, 1))

输出示例（每行一条 JSON）：

json 复制代码

{"timestamp": "2026-06-11T02:17:30.123Z", "level": "info", "event": "read_ok", "gateway_id": "gw_plant1", "plc_id": "s7_1200", "tag": "temperature", "value": 86.3, "latency_ms": 4.2, "session": "a3f8c91e2b0d"}
{"timestamp": "2026-06-11T02:17:30.987Z", "level": "error", "event": "read_error", "gateway_id": "gw_plant1", "plc_id": "s7_1200", "tag": "pressure", "error": "timeout", "error_code": 11, "session": "a3f8c91e2b0d"}

这样的日志可以直接被 Elasticsearch 或 Loki 索引 ，在 Grafana 上按 gateway_id、plc_id、event 筛选，不需要写一行 grep。

2.3 链路追踪------一个请求从 PLC 到云端的完整路径

常规的 Metrics 告诉你"出错了"，Logs 告诉你"哪里出错了"，但 Trace 告诉你"这次请求到底经历了什么"。

采集链路的一个完整 Trace 包含：
时序数据库 MQTT Broker PLC 采集网关时序数据库 MQTT Broker PLC 采集网关 #mermaid-svg-bCY1pPPysIFTi2ix{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-bCY1pPPysIFTi2ix .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-bCY1pPPysIFTi2ix .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-bCY1pPPysIFTi2ix .error-icon{fill:#552222;}#mermaid-svg-bCY1pPPysIFTi2ix .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-bCY1pPPysIFTi2ix .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-bCY1pPPysIFTi2ix .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-bCY1pPPysIFTi2ix .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-bCY1pPPysIFTi2ix .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-bCY1pPPysIFTi2ix .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-bCY1pPPysIFTi2ix .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-bCY1pPPysIFTi2ix .marker{fill:#333333;stroke:#333333;}#mermaid-svg-bCY1pPPysIFTi2ix .marker.cross{stroke:#333333;}#mermaid-svg-bCY1pPPysIFTi2ix svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-bCY1pPPysIFTi2ix p{margin:0;}#mermaid-svg-bCY1pPPysIFTi2ix .actor{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-bCY1pPPysIFTi2ix text.actor>tspan{fill:black;stroke:none;}#mermaid-svg-bCY1pPPysIFTi2ix .actor-line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-bCY1pPPysIFTi2ix .innerArc{stroke-width:1.5;stroke-dasharray:none;}#mermaid-svg-bCY1pPPysIFTi2ix .messageLine0{stroke-width:1.5;stroke-dasharray:none;stroke:#333;}#mermaid-svg-bCY1pPPysIFTi2ix .messageLine1{stroke-width:1.5;stroke-dasharray:2,2;stroke:#333;}#mermaid-svg-bCY1pPPysIFTi2ix #arrowhead path{fill:#333;stroke:#333;}#mermaid-svg-bCY1pPPysIFTi2ix .sequenceNumber{fill:white;}#mermaid-svg-bCY1pPPysIFTi2ix #sequencenumber{fill:#333;}#mermaid-svg-bCY1pPPysIFTi2ix #crosshead path{fill:#333;stroke:#333;}#mermaid-svg-bCY1pPPysIFTi2ix .messageText{fill:#333;stroke:none;}#mermaid-svg-bCY1pPPysIFTi2ix .labelBox{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-bCY1pPPysIFTi2ix .labelText,#mermaid-svg-bCY1pPPysIFTi2ix .labelText>tspan{fill:black;stroke:none;}#mermaid-svg-bCY1pPPysIFTi2ix .loopText,#mermaid-svg-bCY1pPPysIFTi2ix .loopText>tspan{fill:black;stroke:none;}#mermaid-svg-bCY1pPPysIFTi2ix .loopLine{stroke-width:2px;stroke-dasharray:2,2;stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-bCY1pPPysIFTi2ix .note{stroke:#aaaa33;fill:#fff5ad;}#mermaid-svg-bCY1pPPysIFTi2ix .noteText,#mermaid-svg-bCY1pPPysIFTi2ix .noteText>tspan{fill:black;stroke:none;}#mermaid-svg-bCY1pPPysIFTi2ix .activation0{fill:#f4f4f4;stroke:#666;}#mermaid-svg-bCY1pPPysIFTi2ix .activation1{fill:#f4f4f4;stroke:#666;}#mermaid-svg-bCY1pPPysIFTi2ix .activation2{fill:#f4f4f4;stroke:#666;}#mermaid-svg-bCY1pPPysIFTi2ix .actorPopupMenu{position:absolute;}#mermaid-svg-bCY1pPPysIFTi2ix .actorPopupMenuPanel{position:absolute;fill:#ECECFF;box-shadow:0px 8px 16px 0px rgba(0,0,0,0.2);filter:drop-shadow(3px 5px 2px rgb(0 0 0 / 0.4));}#mermaid-svg-bCY1pPPysIFTi2ix .actor-man line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-bCY1pPPysIFTi2ix .actor-man circle,#mermaid-svg-bCY1pPPysIFTi2ix line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;stroke-width:2px;}#mermaid-svg-bCY1pPPysIFTi2ix :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} Trace ID: abc123Span ID: root Span: modbus_read开始 02:17:30.000 Span: modbus_read结束 02:17:30.004 (4ms) Span: parse_response开始 02:17:30.004 Span: parse_response结束 02:17:30.005 (1ms) Span: mqtt_publish开始 02:17:30.005 Span: mqtt_publish结束 02:17:30.008 (3ms) Modbus Read (FC=03)响应publish(temperature=86.3)PubAck

用 OpenTelemetry Python SDK 实现：

python 复制代码

"""
otel_tracing.py --- 采集链路的 OpenTelemetry 追踪

实现原理：
1. 每次 Modbus 读取创建一个 Span
2. Span 包含：PLC 地址、功能码、延迟
3. 所有 Span 关联到同一个 Trace（一次采集周期）
"""
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.trace import SpanKind

# 初始化 Tracer
provider = TracerProvider()
processor = BatchSpanProcessor(OTLPSpanExporter(
    endpoint="http://otel-collector:4317"))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

tracer = trace.get_tracer(__name__)


class TracedCollector:
    """
    带链路追踪的采集器

    每个采集周期创建一个 Root Span，其下包含：
    1. modbus_read: 读取 PLC 寄存器
    2. process: 数据格式转换/压缩
    3. mqtt_publish: 发布到 Broker
    """

    def read_and_publish(self, plc_ip: str, tag: str):
        """一次完整的采集-发布链（带追踪）"""
        with tracer.start_as_current_span(
            "collect_cycle",
            kind=SpanKind.CLIENT,
            attributes={"plc.ip": plc_ip, "tag": tag}
        ) as root_span:

            # 1. Modbus 读取
            with tracer.start_as_current_span(
                "modbus_read",
                attributes={"function_code": 3, "address": "0x0000"}
            ) as read_span:
                value, latency = self._do_modbus_read(plc_ip)
                read_span.set_attribute("value", value)
                read_span.set_attribute("latency_ms", latency)

            # 2. 数据处理
            with tracer.start_as_current_span("process"):
                processed = self._process(value)

            # 3. MQTT 发布
            with tracer.start_as_current_span(
                "mqtt_publish",
                attributes={"topic": f"plant/data/{tag}", "qos": 1}
            ) as pub_span:
                self._do_publish(tag, processed)

    def _do_modbus_read(self, plc_ip: str):
        return 86.3, 4.2

    def _process(self, value: float):
        return round(value, 1)

    def _do_publish(self, tag: str, value: float):
        pass

当采集链路出现延迟尖峰时，Trace 能直接告诉你：延迟是花在了 Modbus 读取上、数据处理上、还是 MQTT 发布上------不需要猜。

3. 九种异常模式与根因决策树

3.1 异常速查表

下面归纳了采集链路中最常见的 9 种数据异常现象，附带排查入口和概率。

#	现象	最可能原因（概率排序）	排查入口
1	数值冻结（长时间不变）	L1: PLC 扫描模式设为 HALT > L4: 网关进程卡死 > L3: TCP 连接断开但未检测	L1: 检查 PLC RUN/STOP 灯；L4: 检查网关进程活性和最后采集时间戳
2	数值跳变（瞬间大幅变化然后恢复）	L1: 模拟量通道干扰 > L2: 字节序解析错误 > L1: 传感器断线（4-20mA 开路）	L1: 万用表测通道电压/电流；L2: 检查字节序配置
3	整段数据缺失	L3: 网络断开 > L4: 缓存溢出丢旧数据 > L5: 数据库写入限流	L3: ping + traceroute；L4: 检查缓存积压指标
4	时间戳回退	L4: 网关时钟漂移 > L4: NTP 同步后时间修正	L4: 检查 NTP 同步状态和偏移量
5	周期性毛刺	L1: 变频器/电机干扰 > L1: 采集卡共模电压 > L2: 轮询周期共振	L1: 示波器测信号质量；L2: 检查采集周期是否与干扰源频率成整数比
6	数据重复（完全相同的数据出现两次）	L2: Modbus 超时重传+原始响应到达 > L4: 应用层未做去重	L2: 检查 Transaction ID 管理；L4: 检查 SEQ 去重逻辑
7	数值错位（A 寄存器的值出现在 B 的位置）	L2: 地址偏移配置错误 > L2: 字节对齐 padding 处理错误	L2: 读取原始报文对比
8	协议异常（异常功能码、CRC 错误）	L3: 串口干扰（RTU） > L3: 网关协议栈 bug	L3: 抓包分析原始报文
9	采集周期漂移（采集间隔逐渐变大）	L4: CPU 负载上升 > L4: GC/日志阻塞 > L4: 内存泄漏	L4: 检查 CPU/内存指标；L4: 检查采集线程实际 Sleep 时间

3.2 根因决策树

#mermaid-svg-RGVdCmROlwMe2a9M{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-RGVdCmROlwMe2a9M .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-RGVdCmROlwMe2a9M .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-RGVdCmROlwMe2a9M .error-icon{fill:#552222;}#mermaid-svg-RGVdCmROlwMe2a9M .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-RGVdCmROlwMe2a9M .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-RGVdCmROlwMe2a9M .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-RGVdCmROlwMe2a9M .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-RGVdCmROlwMe2a9M .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-RGVdCmROlwMe2a9M .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-RGVdCmROlwMe2a9M .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-RGVdCmROlwMe2a9M .marker{fill:#333333;stroke:#333333;}#mermaid-svg-RGVdCmROlwMe2a9M .marker.cross{stroke:#333333;}#mermaid-svg-RGVdCmROlwMe2a9M svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-RGVdCmROlwMe2a9M p{margin:0;}#mermaid-svg-RGVdCmROlwMe2a9M .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-RGVdCmROlwMe2a9M .cluster-label text{fill:#333;}#mermaid-svg-RGVdCmROlwMe2a9M .cluster-label span{color:#333;}#mermaid-svg-RGVdCmROlwMe2a9M .cluster-label span p{background-color:transparent;}#mermaid-svg-RGVdCmROlwMe2a9M .label text,#mermaid-svg-RGVdCmROlwMe2a9M span{fill:#333;color:#333;}#mermaid-svg-RGVdCmROlwMe2a9M .node rect,#mermaid-svg-RGVdCmROlwMe2a9M .node circle,#mermaid-svg-RGVdCmROlwMe2a9M .node ellipse,#mermaid-svg-RGVdCmROlwMe2a9M .node polygon,#mermaid-svg-RGVdCmROlwMe2a9M .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-RGVdCmROlwMe2a9M .rough-node .label text,#mermaid-svg-RGVdCmROlwMe2a9M .node .label text,#mermaid-svg-RGVdCmROlwMe2a9M .image-shape .label,#mermaid-svg-RGVdCmROlwMe2a9M .icon-shape .label{text-anchor:middle;}#mermaid-svg-RGVdCmROlwMe2a9M .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-RGVdCmROlwMe2a9M .rough-node .label,#mermaid-svg-RGVdCmROlwMe2a9M .node .label,#mermaid-svg-RGVdCmROlwMe2a9M .image-shape .label,#mermaid-svg-RGVdCmROlwMe2a9M .icon-shape .label{text-align:center;}#mermaid-svg-RGVdCmROlwMe2a9M .node.clickable{cursor:pointer;}#mermaid-svg-RGVdCmROlwMe2a9M .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-RGVdCmROlwMe2a9M .arrowheadPath{fill:#333333;}#mermaid-svg-RGVdCmROlwMe2a9M .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-RGVdCmROlwMe2a9M .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-RGVdCmROlwMe2a9M .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-RGVdCmROlwMe2a9M .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-RGVdCmROlwMe2a9M .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-RGVdCmROlwMe2a9M .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-RGVdCmROlwMe2a9M .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-RGVdCmROlwMe2a9M .cluster text{fill:#333;}#mermaid-svg-RGVdCmROlwMe2a9M .cluster span{color:#333;}#mermaid-svg-RGVdCmROlwMe2a9M div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-RGVdCmROlwMe2a9M .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-RGVdCmROlwMe2a9M rect.text{fill:none;stroke-width:0;}#mermaid-svg-RGVdCmROlwMe2a9M .icon-shape,#mermaid-svg-RGVdCmROlwMe2a9M .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-RGVdCmROlwMe2a9M .icon-shape p,#mermaid-svg-RGVdCmROlwMe2a9M .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-RGVdCmROlwMe2a9M .icon-shape .label rect,#mermaid-svg-RGVdCmROlwMe2a9M .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-RGVdCmROlwMe2a9M .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-RGVdCmROlwMe2a9M .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-RGVdCmROlwMe2a9M :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 数值冻结
NO
YES
NO
NO
YES
数值跳变
NO
YES
NO
YES
数据缺失
NO
YES
YES
NO
周期性毛刺
YES
NO
YES
数据重复
YES
NO
采集周期漂移
YES
NO
YES
数据异常
现象是什么?
检查 L1: PLC 指示灯 RUN/STOP
PLC 正常?
PLC 扫描停止 → 检查 PLC 程序/CPU 故障
检查 L4: 网关最后采集时间戳
最近有更新?
检查 L3: 网络连通性
网络通?
修复网络连接
检查 L4: 网关进程状态/GC 暂停
检查 L1: 万用表测量传感器信号
信号稳定?
传感器/接线故障 → 检查屏蔽接地
检查 L2: 字节序/数据类型配置
配置正确?
修正字节序/对齐设置
检查 L2: 共模电压/接地环路
检查 L3: ping PLC 和网关
网络通?
排查网络中断原因
检查 L4: 缓存积压指标
队列深度接近上限?
缓存溢出丢数据 → 扩容或降采集频率
检查 L5: 数据库写入速率限制
检查 L1: 是否有变频器/电机近场干扰
有干扰源?
加屏蔽/远离干扰源/用隔离器
检查 L2: 采集周期是否与设备扫描周期共振
周期成整数比?
调整采集周期避开共振点
检查 L2: Modbus 超时重传次数
重传>0?
优化超时设置 + 增加应用层 SEQ 去重
检查 L4: 消费者线程是否重复处理
检查 L4: CPU 占用率
CPU > 80%?
检查是否有内存泄漏/GC 频繁
检查 L4: 采集线程实际睡眠时间
Sleep 误差 > 10%?
改用高精度定时器或 RT 内核

3.3 决策树使用说明

这棵树不是给你背的，而是发生问题时逐层排查的路线图。我把决策树逻辑编码为一个可交互的诊断函数：

python 复制代码

"""
diagnosis_decision_tree.py --- 采集异常诊断决策树

用法：传入现象描述，返回排查步骤清单。
"""
from typing import List, Dict


class DiagnosisEngine:
    """
    采集异常诊断引擎

    基于人工经验的决策树，逐层引导排查。
    """

    TREES = {
        "数值冻结": [
            ("L1", "检查 PLC 本体 RUN/STOP 指示灯", [
                ("PLC 在 STOP 状态", "检查 CPU 故障原因，上载诊断缓冲区"),
                ("PLC 在 RUN 状态", "检查网关最后成功采集时间戳"),
            ]),
            ("L4", "检查网关进程存活状态", [
                ("进程已死", "检查 OOM Killer 日志 / 重启采集进程"),
                ("进程运行中", "检查采集线程是否阻塞（GC / 死锁）"),
            ]),
            ("L3", "检查网络连通性", [
                ("ping 不通", "排查物理链路 / 交换机端口 / IP 配置"),
                ("ping 通但端口不通", "检查 PLC 端口 502 是否开放"),
            ]),
        ],
        "数值跳变": [
            ("L1", "用万用表测量模拟量通道信号", [
                ("信号不稳定", "检查屏蔽接地 / 信号隔离器"),
                ("信号稳定", "检查字节序配置（Big Endian vs Little Endian）"),
            ]),
            ("L2", "检查寄存器数据类型的字节序", [
                ("32 位值按 16 位切分", "检查数据模型定义是否匹配"),
                ("字节序颠倒", "检查 PLC 与网关的字节序设置"),
            ]),
            ("L1", "排除共模电压干扰", [
                ("共模电压 > 10V", "加装信号隔离器 / 检查接地"),
            ]),
        ],
        "数据缺失": [
            ("L3", "ping 测试网络连通性", [
                ("不通", "联系网络团队修复链路"),
                ("通但延迟高", "检查 TCP 重传率 / 链路带宽"),
            ]),
            ("L4", "检查缓存队列积压指标", [
                ("队列满", "缓存溢出导致丢数据 → 扩容缓冲区"),
                ("队列空", "采集线程本身没有产生数据"),
            ]),
            ("L5", "检查时序数据库写入速率", [
                ("写入限流", "调整数据库写入并发 / 降低采集频率"),
            ]),
        ],
    }

    def __init__(self, gateway_id: str):
        self.gateway_id = gateway_id

    def diagnose(self, symptom: str) -> List[Dict]:
        """
        根据现象返回诊断步骤

        Args:
            symptom: 异常现象（数值冻结/数值跳变/数据缺失/...）

        Returns: [
            {"layer": "L1", "check": "检查...", "steps": [...]},
            ...
        ]
        """
        tree = self.TREES.get(symptom)
        if not tree:
            return [{"error": f"未知现象: {symptom}，"
                     f"支持: {list(self.TREES.keys())}"}]

        result = []
        for layer, check, branches in tree:
            step = {
                "layer": layer,
                "check": check,
                "branches": [
                    {"condition": cond, "action": action}
                    for cond, action in branches
                ]
            }
            result.append(step)

        return result

    def print_diagnosis(self, symptom: str):
        """打印可读的诊断步骤"""
        steps = self.diagnose(symptom)
        print(f"\n{'='*60}")
        print(f"诊断: {symptom}")
        print(f"网关: {self.gateway_id}")
        print(f"{'='*60}")

        for step in steps:
            if "error" in step:
                print(f"\n[!] {step['error']}")
                continue

            print(f"\n[{step['layer']}] {step['check']}")
            for branch in step["branches"]:
                print(f"   ├─ 如果 {branch['condition']}")
                print(f"   └─ → {branch['action']}")

        print(f"\n提示：逐层排查，从 L1 开始，不要跳层。")


# ===== 使用示例 =====
if __name__ == "__main__":
    engine = DiagnosisEngine("gw_plant1")
    engine.print_diagnosis("数值冻结")

输出：

复制代码

============================================================
诊断: 数值冻结
网关: gw_plant1
============================================================

[L1] 检查 PLC 本体 RUN/STOP 指示灯
   ├─ 如果 PLC 在 STOP 状态
   └─ → 检查 CPU 故障原因，上载诊断缓冲区
   ├─ 如果 PLC 在 RUN 状态
   └─ → 检查网关最后成功采集时间戳

[L4] 检查网关进程存活状态
   ├─ 如果 进程已死
   └─ → 检查 OOM Killer 日志 / 重启采集进程
   ├─ 如果 进程运行中
   └─ → 检查采集线程是否阻塞（GC / 死锁）

[L3] 检查网络连通性
   ├─ 如果 ping 不通
   └─ → 排查物理链路 / 交换机端口 / IP 配置
   ├─ 如果 ping 通但端口不通
   └─ → 检查 PLC 端口 502 是否开放

4. 现场排查工具箱

工具 1 到 4 覆盖了从 L1 到 L4 的常见排查场景。每个工具都是可直接运行的 Python 脚本，唯一依赖是 Python 标准库。

4.1 工具 1：`modbus_ping.py`------Modbus 连通性检查

类似于 ping，但在 Modbus 应用层做探测，而不是 ICMP。因为有些 PLC 虽然 ping 得通但 Modbus 端口已经挂了。

python 复制代码

"""
modbus_ping.py --- Modbus TCP 连通性检查工具

用法：python modbus_ping.py 192.168.1.10 [--port 502] [--count 10]

输出：每个探测包的 RTT（类似 ping 的格式）
      统计：最小/平均/最大/丢包率
"""
import socket
import struct
import time
import argparse
import sys


class ModbusPing:
    """
    Modbus 应用层 Ping

    发送一个 Modbus TCP Read Holding Registers 请求（FC=03），
    测量从请求到收到正确响应的时间。

    比 ICMP ping 更准确的采集链路探测方式：
    - ICMP 通 ≠ Modbus 通（PLC 可能 502 端口没开）
    - Modbus 通 ≠ 数据正确（可能单元 ID 不对）
    """

    def __init__(self, host: str, port: int = 502, unit_id: int = 1,
                 timeout: float = 2.0):
        self.host = host
        self.port = port
        self.unit_id = unit_id
        self.timeout = timeout

    def ping_once(self) -> dict:
        """
        发送一次 Modbus 请求

        Returns: {
            "success": bool,
            "rtt_ms": float,        # 往返时间（毫秒）
            "error": str or None,
            "response_fc": int or None,   # 响应功能码
        }
        """
        # 构造 Modbus TCP 请求：FC=03, 地址 0x0000, 读取 1 个寄存器
        tx_id = 0x0001
        protocol = 0x0000
        length = 6  # 后续字节数
        body = struct.pack(">BHH", self.unit_id, 0x0000, 0x0001)
        length += 1  # 加上 unit_id
        request = struct.pack(">HHH", tx_id, protocol, length) + body

        sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        sock.settimeout(self.timeout)

        try:
            t_start = time.perf_counter()

            sock.connect((self.host, self.port))
            sock.sendall(request)

            # 读取响应（MBAP 7 + PDU 3 = 至少 10 字节）
            response = b""
            while len(response) < 10:
                chunk = sock.recv(16 - len(response))
                if not chunk:
                    raise ConnectionError("连接提前关闭")
                response += chunk

            rtt = (time.perf_counter() - t_start) * 1000

            # 解析响应
            resp_tx_id = struct.unpack(">H", response[0:2])[0]
            resp_fc = response[7]
            data_len = struct.unpack(">H", response[4:6])[0]

            # 检查事务 ID
            if resp_tx_id != tx_id:
                return {"success": False, "rtt_ms": rtt,
                        "error": f"Transaction ID 不匹配: "
                                 f"预期 {tx_id}，收到 {resp_tx_id}",
                        "response_fc": resp_fc}

            # 检查异常功能码
            if resp_fc & 0x80:
                exception_code = response[8]
                return {"success": False, "rtt_ms": rtt,
                        "error": f"Modbus 异常: FC=0x{resp_fc:02x}, "
                                 f"异常码={exception_code}",
                        "response_fc": resp_fc}

            return {"success": True, "rtt_ms": round(rtt, 2),
                    "error": None, "response_fc": resp_fc}

        except socket.timeout:
            return {"success": False, "rtt_ms": None,
                    "error": "超时", "response_fc": None}
        except Exception as e:
            return {"success": False, "rtt_ms": None,
                    "error": str(e), "response_fc": None}
        finally:
            sock.close()


def main():
    parser = argparse.ArgumentParser(description="Modbus TCP Ping")
    parser.add_argument("host", help="PLC IP 地址")
    parser.add_argument("--port", type=int, default=502)
    parser.add_argument("--count", type=int, default=10,
                        help="发送次数（默认 10）")
    parser.add_argument("--unit", type=int, default=1,
                        help="Unit ID（默认 1）")
    parser.add_argument("--interval", type=float, default=1.0,
                        help="间隔秒数（默认 1）")
    args = parser.parse_args()

    print(f"Modbus Ping → {args.host}:{args.port} (Unit ID={args.unit})")
    print(f"发送 {args.count} 个探测包，间隔 {args.interval}s")
    print("-" * 60)

    pinger = ModbusPing(args.host, args.port, args.unit)
    rtts = []
    lost = 0

    for i in range(args.count):
        result = pinger.ping_once()

        if result["success"]:
            rtts.append(result["rtt_ms"])
            status = f"成功 rtt={result['rtt_ms']:.2f}ms"
        else:
            lost += 1
            status = f"失败 ({result['error']})"

        print(f"包 {i+1:>3}/{args.count}: {status}")
        time.sleep(args.interval)

    # 统计
    sent = args.count
    loss_rate = lost / sent * 100
    print("-" * 60)
    print(f"--- {args.host}:{args.port} 统计 ---")
    print(f"发送={sent}, 成功={sent-lost}, 丢失={lost} ({loss_rate:.1f}% 丢包)")

    if rtts:
        print(f"RTT(ms): 最小={min(rtts):.2f}, "
              f"平均={sum(rtts)/len(rtts):.2f}, "
              f"最大={max(rtts):.2f}")
    else:
        print("RTT: 无有效数据")


if __name__ == "__main__":
    main()

复制代码

# 使用示例
$ python modbus_ping.py 192.168.1.10 --count 5

Modbus Ping → 192.168.1.10:502 (Unit ID=1)
发送 5 个探测包，间隔 1s
------------------------------------------------------------
包   1/5: 成功 rtt=3.45ms
包   2/5: 成功 rtt=2.98ms
包   3/5: 失败 (超时)
包   4/5: 成功 rtt=4.12ms
包   5/5: 成功 rtt=3.01ms
------------------------------------------------------------
--- 192.168.1.10:502 统计 ---
发送=5, 成功=4, 丢失=1 (20.0% 丢包)
RTT(ms): 最小=2.98, 平均=3.39, 最大=4.12

4.2 工具 2：`opcua_scan.py`------OPC UA 端点扫描

python 复制代码

"""
opcua_scan.py --- OPC UA 服务器端点扫描

检查 OPC UA 服务器的：
1. 支持的 SecurityMode（None / Sign / SignAndEncrypt）
2. 各模式下的连接延迟
3. 根节点下可访问的变量数量

用法：python opcua_scan.py opc.tcp://192.168.1.10:4840
"""
import asyncio
import time
from asyncua import Client
from asyncua.ua import SecurityPolicyType


async def scan_endpoint(url: str, timeout: int = 5):
    """
    扫描 OPC UA 服务器端点

    Args:
        url: OPC UA 服务器地址 (opc.tcp://host:port)
        timeout: 连接超时（秒）

    Returns: {
        "url": str,
        "available_policies": [str],
        "connect_latency_ms": {policy: float},
        "root_variables": int,
        "server_info": dict,
    }
    """
    result = {"url": url, "available_policies": [],
              "connect_latency_ms": {}, "root_variables": 0}

    policies = [
        ("None", SecurityPolicyType.NoSecurity),
        ("Sign", SecurityPolicyType.Basic256Sha256_Sign),
        ("SignAndEncrypt", SecurityPolicyType.Basic256Sha256_SignAndEncrypt),
    ]

    for name, policy in policies:
        try:
            client = Client(url=url, timeout=timeout)
            client.set_security_string(policy)

            t0 = time.perf_counter()
            await client.connect()
            latency = (time.perf_counter() - t0) * 1000

            result["available_policies"].append(name)
            result["connect_latency_ms"][name] = round(latency, 1)

            # 只在 None 模式下统计变量数
            if name == "None":
                root = client.get_objects_node()
                children = await root.get_children()
                var_count = 0
                for child in children:
                    vars_in = await child.get_variables()
                    var_count += len(vars_in)
                result["root_variables"] = var_count

                # 服务器信息
                try:
                    server_state = await client.get_server_state()
                    result["server_info"] = {
                        "state": server_state.name,
                    }
                except Exception:
                    pass

            await client.disconnect()

        except Exception as e:
            result["connect_latency_ms"][name] = None

    return result


async def main():
    import sys
    url = sys.argv[1] if len(sys.argv) > 1 else "opc.tcp://localhost:4840"
    print(f"扫描 OPC UA 端点: {url}")
    print("=" * 60)

    result = await scan_endpoint(url)

    print(f"\n支持的安全策略:")
    for policy in result["available_policies"]:
        latency = result["connect_latency_ms"].get(policy)
        latency_str = f"{latency}ms" if latency else "连接失败"
        print(f"  [{policy}] 连接延迟: {latency_str}")

    if result["root_variables"]:
        print(f"\n根节点下可访问变量: {result['root_variables']}")
    if result.get("server_info"):
        info = result["server_info"]
        print(f"服务器状态: {info.get('state', '未知')}")


if __name__ == "__main__":
    asyncio.run(main())

4.3 工具 3：`mqtt_sniff.py`------MQTT 主题嗅探

python 复制代码

"""
mqtt_sniff.py --- MQTT 主题嗅探工具

订阅指定主题（支持通配符），打印每条消息的：
- 时间戳、主题、QoS、Payload 大小
- 消息到达间隔（用于检测发布是否中断）

用法：python mqtt_sniff.py localhost plant/temperature

适合排查：数据有没有在发？发布间隔是否稳定？Payload 格式是否正确？
"""
import time
import json
import argparse
import sys

import paho.mqtt.client as mqtt


class MQTTSniffer:
    """
    MQTT 主题嗅探器

    订阅一个或多个主题，实时打印消息元数据和到达间隔。
    当超过 silence_threshold 秒没有消息时告警。
    """

    def __init__(self, host: str, port: int, topics: list,
                 silence_threshold: float = 30.0):
        self.host = host
        self.port = port
        self.topics = topics
        self.silence_threshold = silence_threshold

        self._last_msg_time = time.time()
        self._msg_count = 0
        self._intervals = []

        self.client = mqtt.Client(client_id="mqtt_sniff")
        self.client.on_connect = self._on_connect
        self.client.on_message = self._on_message

    def _on_connect(self, client, userdata, flags, rc):
        print(f"已连接 {self.host}:{self.port} (rc={rc})")
        for topic in self.topics:
            client.subscribe(topic, qos=0)
            print(f"  订阅: {topic}")

    def _on_message(self, client, userdata, msg):
        now = time.time()
        interval = now - self._last_msg_time
        self._last_msg_time = now
        self._msg_count += 1
        self._intervals.append(interval)

        # 解析 Payload
        try:
            payload = json.loads(msg.payload)
            payload_preview = json.dumps(payload, ensure_ascii=False)[:120]
        except (json.JSONDecodeError, UnicodeDecodeError):
            payload_preview = f"<binary: {len(msg.payload)} bytes>"

        ts = time.strftime("%H:%M:%S", time.localtime())
        print(f"[{ts}] #{self._msg_count:>6} | "
              f"主题: {msg.topic:<40} | "
              f"间隔: {interval:<6.2f}s | "
              f"Payload: {payload_preview}")

    def run(self):
        print(f"MQTT Sniffer → {self.host}:{self.port}")
        print(f"监听主题: {', '.join(self.topics)}")
        print(f"静默告警阈值: {self.silence_threshold}s")
        print("=" * 80)

        self.client.connect(self.host, self.port, keepalive=60)
        self.client.loop_forever()


def main():
    parser = argparse.ArgumentParser(description="MQTT 主题嗅探")
    parser.add_argument("host", help="MQTT Broker 地址")
    parser.add_argument("topics", nargs="+", help="订阅的主题（支持 + 和 # 通配符）")
    parser.add_argument("--port", type=int, default=1883)
    parser.add_argument("--silence", type=float, default=30.0,
                        help="静默告警阈值（秒）")
    args = parser.parse_args()

    sniffer = MQTTSniffer(args.host, args.port, args.topics, args.silence)
    try:
        sniffer.run()
    except KeyboardInterrupt:
        print("\n已停止")
        sys.exit(0)


if __name__ == "__main__":
    main()

4.4 工具 4：`jitter_measure.py`------采集周期抖动测量

这个工具解决的是第 9 篇末尾的问题：你的采集循环实际执行间隔和预期差多少？

python 复制代码

"""
jitter_measure.py --- 采集周期抖动测量

测量采集循环的实际执行间隔，输出统计和直方图。

用法：python jitter_measure.py [--interval_ms 100] [--count 1000]

输出：
- 实际间隔的 P50/P95/P99/P100
- 与目标间隔的偏差分布
- 检测"周期漂移"（间隔持续增长的趋势）
"""
import time
import statistics
import argparse
import sys


class JitterMeter:
    """
    采集周期抖动测量仪

    模拟采集循环的运行间隔，精确测量每个周期的实际耗时，
    判断采集循环的稳定性。
    """

    def __init__(self, target_interval_ms: float = 100.0):
        """
        Args:
            target_interval_ms: 目标采集间隔（毫秒）
        """
        self.target = target_interval_ms / 1000  # 转秒
        self.actual_intervals = []
        self.drift = []  # 累积漂移

    def measure(self, count: int, simulate_load: bool = False):
        """
        执行 count 次采集循环，测量每次的实际间隔

        Args:
            count: 测量次数
            simulate_load: 是否模拟随机延迟抖动
        """
        last = time.perf_counter()
        expected_next = last + self.target

        for i in range(count):
            # 模拟采集操作的时间消耗
            if simulate_load:
                load = __import__('random').uniform(0.001, 0.005)
                time.sleep(load)

            # 等待到目标时间点
            now = time.perf_counter()
            sleep_needed = expected_next - now
            if sleep_needed > 0:
                time.sleep(sleep_needed)
            elif sleep_needed < -self.target:
                # 落后超过一个周期 → 跳过（追赶模式）
                pass

            actual_interval = time.perf_counter() - last
            self.actual_intervals.append(actual_interval * 1000)  # 转ms
            self.drift.append((actual_interval - self.target) * 1000)

            last = time.perf_counter()
            expected_next += self.target

    def report(self):
        """输出测量报告"""
        if not self.actual_intervals:
            return

        intervals = sorted(self.actual_intervals)
        n = len(intervals)
        target_ms = self.target * 1000

        print(f"\n采集周期抖动报告")
        print(f"=" * 60)
        print(f"目标间隔: {target_ms:.1f}ms")
        print(f"采样次数: {n}")
        print(f"-" * 60)
        print(f"  最短: {intervals[0]:8.3f}ms")
        print(f"  最长: {intervals[-1]:8.3f}ms")
        print(f"  平均: {sum(intervals)/n:8.3f}ms")
        print(f"  P50:  {intervals[int(n*0.50)]:8.3f}ms")
        print(f"  P95:  {intervals[int(n*0.95)]:8.3f}ms")
        print(f"  P99:  {intervals[int(n*0.99)]:8.3f}ms")
        print(f"  标准差: {statistics.stdev(intervals):.3f}ms")
        print(f"-" * 60)

        # 偏差分析
        max_drift = max(abs(d) for d in self.drift)
        avg_drift = sum(self.drift) / len(self.drift)
        print(f"偏差分析:")
        print(f"  平均偏差: {avg_drift:.3f}ms "
              f"({'超前' if avg_drift < 0 else '滞后'})")
        print(f"  最大偏差: {max_drift:.3f}ms")

        # 漂移趋势：最后 10% 的间隔是否持续增大
        tail = self.actual_intervals[int(-n*0.1):]
        tail_increasing = all(
            tail[i] <= tail[i+1] for i in range(len(tail)-5)
        )
        if tail_increasing:
            print(f"  ⚠ 检测到周期漂移趋势（最近 10% 的间隔持续增大）")
            print(f"    → 可能原因：CPU 负载上升、内存泄漏、线程饥饿")

        # 超预算比例
        over_budget = sum(1 for i in intervals if i > target_ms * 1.1)
        over_pct = over_budget / n * 100
        print(f"  超预算 (>110% 目标): {over_pct:.1f}%")

        # 直方图
        print(f"\n间隔分布直方图:")

        min_v = min(intervals)
        max_v = max(intervals)
        if max_v > min_v:
            bucket_size = (max_v - min_v) / 10
            hist = [0] * 10
            for v in intervals:
                idx = min(int((v - min_v) / bucket_size), 9)
                hist[idx] += 1

            for i, count in enumerate(hist):
                bar = "#" * int(count / n * 100) if n > 0 else ""
                pct = count / n * 100
                lo = min_v + i * bucket_size
                hi = min_v + (i + 1) * bucket_size
                print(f"  {lo:8.3f}-{hi:8.3f}ms: {bar} {pct:.1f}%")


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="采集周期抖动测量")
    parser.add_argument("--interval", type=float, default=100.0,
                        help="目标采集间隔（毫秒）")
    parser.add_argument("--count", type=int, default=1000,
                        help="测量次数")
    parser.add_argument("--load", action="store_true",
                        help="模拟随机采集负载")
    args = parser.parse_args()

    meter = JitterMeter(target_interval_ms=args.interval)
    print(f"测量采集周期抖动...")
    print(f"目标间隔: {args.interval}ms, 测量 {args.count} 次")

    t0 = time.perf_counter()
    meter.measure(args.count, simulate_load=args.load)
    elapsed = (time.perf_counter() - t0) * 1000
    print(f"测量耗时: {elapsed:.0f}ms")

    meter.report()

运行结果（在一般负载的 Windows 上）：

复制代码

$ python jitter_measure.py --interval 100 --count 500

采集周期抖动报告
============================================================
目标间隔: 100.0ms
采样次数: 500
------------------------------------------------------------
  最短: 99.821ms
  最长: 116.434ms
  平均: 100.013ms
  P50:  99.928ms
  P95:  100.237ms
  P99:  103.445ms
  标准差: 0.761ms
------------------------------------------------------------
偏差分析:
  平均偏差: 0.013ms (滞后)
  最大偏差: 16.434ms
  超预算 (>110% 目标): 0.4%

解读：平均 100.013ms 几乎完美，但 P99=103.445ms 说明有 1% 的周期超过了 100ms 预算，最大偏差 16.4ms------这对应着一瞬间的 GC 或调度延迟。

5. 完整的健康检查端点

将以上所有组件聚合为一个统一的健康检查接口：

python 复制代码

"""
health_check.py --- 采集网关统一健康检查端点

暴露两个端点：
- GET /health: 简洁的健康检查（供 K8s / 负载均衡器使用）
- GET /health/detail: 详细的健康报告（供人工排查使用）
"""
import json
import time
import socket
import threading
from http.server import HTTPServer, BaseHTTPRequestHandler
from typing import Optional


class HealthStatus:
    """
    采集网关健康状态

    检查项：
    1. PLC 连接状态
    2. MQTT 连接状态
    3. 缓存积压水位
    4. 最后成功采集时间
    5. 采集周期偏差
    6. 系统资源（CPU、内存、磁盘）
    """

    def __init__(self, gateway_id: str,
                 plc_host: str, plc_port: int = 502,
                 mqtt_host: str = "localhost", mqtt_port: int = 1883):
        self.gateway_id = gateway_id
        self.plc_host = plc_host
        self.plc_port = plc_port
        self.mqtt_host = mqtt_host
        self.mqtt_port = mqtt_port

        # 状态缓存（由后台线程更新）
        self._cache = {
            "status": "starting",
            "timestamp": time.time(),
            "checks": {}
        }
        self._lock = threading.Lock()

    def update_cache(self, checks: dict):
        """更新健康检查缓存"""
        with self._lock:
            self._cache["checks"] = checks
            self._cache["timestamp"] = time.time()
            all_ok = all(
                c.get("ok", False) for c in checks.values()
            )
            self._cache["status"] = "healthy" if all_ok else "degraded"

    def get_status(self, detail: bool = False) -> dict:
        """获取健康状态"""
        with self._lock:
            result = {
                "gateway_id": self.gateway_id,
                "status": self._cache["status"],
                "updated_at": time.strftime(
                    "%Y-%m-%dT%H:%M:%SZ", time.gmtime(self._cache["timestamp"])),
            }
            if detail:
                result["checks"] = self._cache["checks"]
            return result


class HealthHTTPHandler(BaseHTTPRequestHandler):
    """HTTP 健康检查端点"""

    def do_GET(self):
        if self.path == "/health":
            status = self.server.health_status.get_status(detail=False)
            code = 200 if status["status"] == "healthy" else 503
            self.send_response(code)
            self.send_header("Content-Type", "application/json")
            self.end_headers()
            self.wfile.write(json.dumps(status).encode())

        elif self.path == "/health/detail":
            status = self.server.health_status.get_status(detail=True)
            code = 200 if status["status"] == "healthy" else 503
            self.send_response(code)
            self.send_header("Content-Type", "application/json")
            self.end_headers()
            self.wfile.write(json.dumps(status, indent=2).encode())

        else:
            self.send_response(404)
            self.end_headers()

    def log_message(self, format, *args):
        pass  # 健康检查端点的请求不记录到 access log


def start_health_server(health_status: HealthStatus, port: int = 8080):
    """启动健康检查 HTTP 服务"""
    server = HTTPServer(("0.0.0.0", port), HealthHTTPHandler)
    server.health_status = health_status
    t = threading.Thread(target=server.serve_forever, daemon=True)
    t.start()
    print(f"健康检查端点已启动 :{port}/health")
    return server


# ===== 使用示例 =====
if __name__ == "__main__":
    hs = HealthStatus("gw_plant1", plc_host="192.168.1.10")
    start_health_server(hs, port=8080)

    # 模拟健康检查更新
    for i in range(10):
        checks = {
            "plc_connection": {
                "ok": i % 5 != 3,
                "latency_ms": 4.2 + i * 0.1,
            },
            "mqtt_connection": {
                "ok": True,
                "latency_ms": 1.8,
            },
            "cache_depth": {
                "ok": i < 8,
                "depth": i * 100,
                "max": 10000,
                "usage_pct": i * 1.0,
            },
            "last_read": {
                "ok": True,
                "age_seconds": i,
            },
        }
        hs.update_cache(checks)
        time.sleep(1)

    print("\n健康检查端点运行中。用 curl 测试:")
    print("  curl http://localhost:8080/health")
    print("  curl http://localhost:8080/health/detail")

复制代码

$ curl http://localhost:8080/health/detail | python -m json.tool

{
    "gateway_id": "gw_plant1",
    "status": "degraded",
    "updated_at": "2026-06-11T02:30:00Z",
    "checks": {
        "plc_connection": {"ok": false, "latency_ms": 4.6},
        "mqtt_connection": {"ok": true, "latency_ms": 1.8},
        "cache_depth": {"ok": true, "depth": 700, "max": 10000, "usage_pct": 7.0},
        "last_read": {"ok": true, "age_seconds": 8}
    }
}

6. 案例复盘------反应釜温度跳变的完整排查过程

最后回到文章开头的案例，走一遍完整的排查流程。

背景

现象：反应釜温度值每 5 秒跳变一次（85°C → 120°C → 85°C），持续 2 秒后恢复
链路：PT100 热电阻 → 模拟量输入模块（Siemens SM1231 AI）→ S7-1200 PLC → Modbus TCP → 采集网关 → MQTT → 云平台

排查过程

步骤	操作	发现	结论
1	检查 PLC 本体 HMI	温度显示 86.3°C，稳定	✗ L4-L5 不是根因
2	在网关侧直接连接 PLC 读原始寄存器	网关日志中有跳变的数值 85/120	✓ 网关读到的就是跳变数据
3	用 `modbus_ping.py` 测试 PLC Modbus 响应	RTT 稳定 3-5ms，无丢包	✗ L3 网络层正常
4	用万用表测量 SM1231 AI 通道输出端	信号稳定，无跳变	✗ L1 传感器->AI 模块正常
5	用万用表测量 AI 通道的 COM 端对地电压	发现共模电压 18V AC	✓ L1 接地异常
6	检查布线	热电阻屏蔽层在 PLC 侧悬空，且与变频器动力线平行走线 20 米	✓ 根因

根因

PT100 热电阻使用的是 3 线制接法，其中一条线用于补偿。但当屏蔽层没有单端接地，且信号线在变频器动力线旁边平行敷设 20 米时，动力线中的高频共模干扰耦合到了补偿线上，导致 AI 模块的测量值在干扰峰值时跳变。PLC 本体的 HMI 显示经过内部数字滤波（平均 5 次采样），所以看起来稳定；而 Modbus 读取的是 AI 模块的原始过程值，没有经过滤波。

修复

#mermaid-svg-rUXvjGMxKWb739Rk{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-rUXvjGMxKWb739Rk .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-rUXvjGMxKWb739Rk .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-rUXvjGMxKWb739Rk .error-icon{fill:#552222;}#mermaid-svg-rUXvjGMxKWb739Rk .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-rUXvjGMxKWb739Rk .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-rUXvjGMxKWb739Rk .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-rUXvjGMxKWb739Rk .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-rUXvjGMxKWb739Rk .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-rUXvjGMxKWb739Rk .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-rUXvjGMxKWb739Rk .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-rUXvjGMxKWb739Rk .marker{fill:#333333;stroke:#333333;}#mermaid-svg-rUXvjGMxKWb739Rk .marker.cross{stroke:#333333;}#mermaid-svg-rUXvjGMxKWb739Rk svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-rUXvjGMxKWb739Rk p{margin:0;}#mermaid-svg-rUXvjGMxKWb739Rk .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-rUXvjGMxKWb739Rk .cluster-label text{fill:#333;}#mermaid-svg-rUXvjGMxKWb739Rk .cluster-label span{color:#333;}#mermaid-svg-rUXvjGMxKWb739Rk .cluster-label span p{background-color:transparent;}#mermaid-svg-rUXvjGMxKWb739Rk .label text,#mermaid-svg-rUXvjGMxKWb739Rk span{fill:#333;color:#333;}#mermaid-svg-rUXvjGMxKWb739Rk .node rect,#mermaid-svg-rUXvjGMxKWb739Rk .node circle,#mermaid-svg-rUXvjGMxKWb739Rk .node ellipse,#mermaid-svg-rUXvjGMxKWb739Rk .node polygon,#mermaid-svg-rUXvjGMxKWb739Rk .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-rUXvjGMxKWb739Rk .rough-node .label text,#mermaid-svg-rUXvjGMxKWb739Rk .node .label text,#mermaid-svg-rUXvjGMxKWb739Rk .image-shape .label,#mermaid-svg-rUXvjGMxKWb739Rk .icon-shape .label{text-anchor:middle;}#mermaid-svg-rUXvjGMxKWb739Rk .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-rUXvjGMxKWb739Rk .rough-node .label,#mermaid-svg-rUXvjGMxKWb739Rk .node .label,#mermaid-svg-rUXvjGMxKWb739Rk .image-shape .label,#mermaid-svg-rUXvjGMxKWb739Rk .icon-shape .label{text-align:center;}#mermaid-svg-rUXvjGMxKWb739Rk .node.clickable{cursor:pointer;}#mermaid-svg-rUXvjGMxKWb739Rk .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-rUXvjGMxKWb739Rk .arrowheadPath{fill:#333333;}#mermaid-svg-rUXvjGMxKWb739Rk .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-rUXvjGMxKWb739Rk .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-rUXvjGMxKWb739Rk .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-rUXvjGMxKWb739Rk .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-rUXvjGMxKWb739Rk .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-rUXvjGMxKWb739Rk .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-rUXvjGMxKWb739Rk .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-rUXvjGMxKWb739Rk .cluster text{fill:#333;}#mermaid-svg-rUXvjGMxKWb739Rk .cluster span{color:#333;}#mermaid-svg-rUXvjGMxKWb739Rk div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-rUXvjGMxKWb739Rk .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-rUXvjGMxKWb739Rk rect.text{fill:none;stroke-width:0;}#mermaid-svg-rUXvjGMxKWb739Rk .icon-shape,#mermaid-svg-rUXvjGMxKWb739Rk .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-rUXvjGMxKWb739Rk .icon-shape p,#mermaid-svg-rUXvjGMxKWb739Rk .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-rUXvjGMxKWb739Rk .icon-shape .label rect,#mermaid-svg-rUXvjGMxKWb739Rk .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-rUXvjGMxKWb739Rk .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-rUXvjGMxKWb739Rk .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-rUXvjGMxKWb739Rk :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 修复后
3 线制 + 屏蔽层单端接地
滤波后的稳定值
Modbus
PT100 热电阻
SM1231 AI
S7-1200
采集网关
修复前
3 线制 + 屏蔽层悬空
电磁干扰
PT100 热电阻
SM1231 AI
变频器

修复操作：

将 PT100 屏蔽层在 PLC 柜侧的接地排上单端接地
将信号线移出动力线槽（从平行 20 米改为 90° 交叉跨越）
（可选）在采集网关侧增加中值滤波：连续读 3 次取中值

教训：如果你在 L5 看到数据异常，不要急着怀疑采集代码。先去看 L1 的接地和接线，这比查任何日志都快。

7. 总结

采集链路诊断的核心不是"用什么工具"，而是在哪一层发现问题。

排查阶段	工具/方法	能排除的层	耗时
第一步	看 PLC HMI 或万用表测传感器	L1	5 分钟
第二步	`modbus_ping.py` / `opcua_scan.py`	L2	2 分钟
第三步	`jitter_measure.py` 测采集周期	L4	2 分钟
第四步	`mqtt_sniff.py` 订阅看消息流	L3-L4	3 分钟
第五步	`/metrics` 看延迟直方图	L3-L4	1 分钟
第六步	`/health/detail` 看组件级健康	L1-L5	10 秒钟
第七步	结构化日志 + Trace 分析	L1-L5	5 分钟

最后送你一段口诀（贴在工位上）：

数值冻结查 PLC，数值跳变测接地；

数据缺失看网络，重复数据查重传；

周期漂移看 CPU，时间戳去看 NTP；

别信日志信万用表，L1 才是第一关。

👉 下一篇预告： $PLC 数采系列 11$ 大规模采集架构------从单台网关到千点集群 前 10 篇都假设一台网关采集几台 PLC、几十到几百个点。但当一个工厂有 5000 个测点、100 台 PLC、多个地理位置时，单台网关的模式就彻底失效了。下一篇深入：采集节点的水平扩展、分片策略（按区域/按协议/按频率）、两级聚合架构、Edge 集群的 Leader Election 和 Failover、全局时间戳对齐、以及配置中心的集中管理（etcd / Consul）。这是从"能用"到"能规模化"的最后一跃。

采集链路诊断与可观测性——当数据不“对“的时候，你在第几层排查？

1. 采集链路的分层故障模型

2. 可观测性的三个支柱------Metrics、Logs、Traces

2.1 每个环节必须暴露的 Metrics

2.2 结构化日志------不要把时间花在 grep 上

2.3 链路追踪------一个请求从 PLC 到云端的完整路径

3. 九种异常模式与根因决策树

3.1 异常速查表

3.2 根因决策树

3.3 决策树使用说明

4. 现场排查工具箱

4.1 工具 1：modbus_ping.py------Modbus 连通性检查

4.2 工具 2：opcua_scan.py------OPC UA 端点扫描

4.3 工具 3：mqtt_sniff.py------MQTT 主题嗅探

4.4 工具 4：jitter_measure.py------采集周期抖动测量

5. 完整的健康检查端点

6. 案例复盘------反应釜温度跳变的完整排查过程

背景

排查过程

根因

修复

7. 总结

4.1 工具 1：`modbus_ping.py`------Modbus 连通性检查

4.2 工具 2：`opcua_scan.py`------OPC UA 端点扫描

4.3 工具 3：`mqtt_sniff.py`------MQTT 主题嗅探

4.4 工具 4：`jitter_measure.py`------采集周期抖动测量