项目七:实时异常检测与告警系统------基于统计与机器学习的数据质量监控平台
目录
项目七:实时异常检测与告警系统------基于统计与机器学习的数据质量监控平台
[7.1 数据流接入](#7.1 数据流接入)
[7.1.1 多协议支持:HTTP/WebSocket/MQTT数据接入适配器](#7.1.1 多协议支持:HTTP/WebSocket/MQTT数据接入适配器)
[7.1.2 Kafka Streams处理:实时数据清洗与标准化转换](#7.1.2 Kafka Streams处理:实时数据清洗与标准化转换)
[7.1.3 窗口聚合:滑动窗口指标计算(均值/方差/分位数)](#7.1.3 窗口聚合:滑动窗口指标计算(均值/方差/分位数))
[7.1.4 数据采样:水库抽样与流式数据特征存储](#7.1.4 数据采样:水库抽样与流式数据特征存储)
[7.2 检测算法实现](#7.2 检测算法实现)
[7.2.1 统计检测:3-Sigma原则与孤立森林(Isolation Forest)实现](#7.2.1 统计检测:3-Sigma原则与孤立森林(Isolation Forest)实现)
[7.2.2 时序预测:Prophet/LSTM基带预测与残差异常判定](#7.2.2 时序预测:Prophet/LSTM基带预测与残差异常判定)
[7.2.3 模式识别:日志模板提取与异常模式匹配](#7.2.3 模式识别:日志模板提取与异常模式匹配)
[7.2.4 多维分析:基于聚类的多维属性异常根因分析](#7.2.4 多维分析:基于聚类的多维属性异常根因分析)
[7.3 告警引擎](#7.3 告警引擎)
[7.3.1 分级告警:P0/P1/P2级别与升级(Escalation)策略](#7.3.1 分级告警:P0/P1/P2级别与升级(Escalation)策略)
[7.3.2 告警抑制:相似告警合并与抖动窗口去重](#7.3.2 告警抑制:相似告警合并与抖动窗口去重)
[7.3.3 通知渠道:PagerDuty/Slack/钉钉/企业微信多渠道适配](#7.3.3 通知渠道:PagerDuty/Slack/钉钉/企业微信多渠道适配)
[7.3.4 告警自愈:Webhook触发自动修复脚本执行](#7.3.4 告警自愈:Webhook触发自动修复脚本执行)
[7.4 可视化与调查](#7.4 可视化与调查)
[7.4.1 实时仪表板:Grafana面板与异常事件时间线展示](#7.4.1 实时仪表板:Grafana面板与异常事件时间线展示)
[7.4.2 下钻分析:维度切片与相关指标关联展示](#7.4.2 下钻分析:维度切片与相关指标关联展示)
[7.4.3 案例管理:异常工单创建与处理状态跟踪](#7.4.3 案例管理:异常工单创建与处理状态跟踪)
[7.4.4 影响分析:异常传播链路追踪与依赖图谱](#7.4.4 影响分析:异常传播链路追踪与依赖图谱)
[7.5 模型管理](#7.5 模型管理)
[7.5.1 在线学习:增量更新与概念漂移(Concept Drift)检测](#7.5.1 在线学习:增量更新与概念漂移(Concept Drift)检测)
[7.5.2 模型版本:MLflow模型注册与灰度发布](#7.5.2 模型版本:MLflow模型注册与灰度发布)
[7.5.3 冷启动:历史数据回放与初始模型训练](#7.5.3 冷启动:历史数据回放与初始模型训练)
[7.5.4 反馈闭环:人工标注结果回流训练集](#7.5.4 反馈闭环:人工标注结果回流训练集)
脚本7.1.1.1:多协议数据接入适配器(HTTP/WebSocket/MQTT)
[脚本7.1.1.2:Kafka Streams实时处理引擎](#脚本7.1.1.2:Kafka Streams实时处理引擎)
[脚本7.2.1.2:孤立森林(Isolation Forest)实现](#脚本7.2.1.2:孤立森林(Isolation Forest)实现)
第一部分:原理详解
7.1 数据流接入
7.1.1 多协议支持:HTTP/WebSocket/MQTT数据接入适配器
现代分布式监控系统的数据接入层必须处理异构数据源的高并发写入。HTTP协议基于请求-响应范式,适用于周期性指标上报场景,其无状态特性便于水平扩展,但高频短连接会带来显著的开销。WebSocket提供全双工持久连接,在实时日志流和事件推送场景中具有更低的延迟,通过帧级别的流量控制可应对突发流量。MQTT作为轻量级发布-订阅协议,专为资源受限的IoT设备设计,其QoS等级(0/1/2)实现了可靠性与吞吐量的权衡。
协议适配器的核心设计遵循适配器模式(Adapter Pattern),通过抽象接口统一不同协议的语义差异。对于HTTP接入,采用异步非阻塞I/O模型(如Python的aiohttp或Node.js的cluster模块),利用Keep-Alive连接池减少TCP握手开销。WebSocket实现需处理心跳检测、断线重连与背压机制(Backpressure),当消费者速率低于生产者时,通过缓冲区限流或丢弃策略防止内存溢出。MQTT代理(Broker)的选型需考虑主题通配符的匹配效率,基于Trie树的订阅路由算法可将复杂度从O(n) 降至O(m) ,其中n 为订阅总数,m 为主题层级深度。
7.1.2 Kafka Streams处理:实时数据清洗与标准化转换
Apache Kafka作为分布式流处理平台,其Streams API提供了有状态计算能力。数据清洗阶段涉及格式校验、缺失值处理与类型转换。对于JSON或Protobuf格式的半结构化数据,采用Schema Registry进行版本化校验,通过Avro Schema定义字段约束,拒绝不符合规范的脏数据。标准化转换将异构数据源映射为统一的内部表示(Canonical Data Model),包括时间戳对齐(统一为Unix毫秒或ISO 8601格式)、度量单位换算(如字节到比特的转换)以及标签规范化(字符串小写、去除空白)。
Kafka Streams的拓扑设计利用分区(Partition)实现并行处理,通过自定义分区器(Partitioner)确保相同设备ID的数据路由至同一分区,维持局部有序性。状态存储(State Store)采用RocksDB或内存哈希表,支持基于事件时间(Event Time)的窗口聚合。Exactly-Once语义通过事务性生产者(Transactional Producer)与消费者组协调实现,确保数据不丢失且不重复。背压处理依赖消费者拉取(Pull)模型,通过调整max.poll.records与fetch.min.bytes参数平衡延迟与吞吐量。
7.1.3 窗口聚合:滑动窗口指标计算(均值/方差/分位数)
流式数据的连续特性要求通过时间窗口离散化计算统计特征。滑动窗口(Sliding Window)与滚动窗口(Tumbling Window)的区别在于滑动步长(Slide)与窗口长度(Size)的关系:当Slide<Size 时产生重叠窗口,适用于平滑波动检测;当Slide=Size 时无重叠,适用于批量处理。会话窗口(Session Window)通过超时阈值(Timeout)动态划分,适合用户行为分析。
窗口内统计量的计算需考虑增量更新以降低复杂度。均值μ 的递推公式为:
μn=μn−1+nxn−μn−1
方差σ2 采用Welford算法避免数值溢出:
Mn=Mn−1+(xn−μn−1)(xn−μn)σ2=nMn
其中Mn 为二阶中心矩的累加器。分位数计算采用t-Digest或KLL Sketch等近似算法,在内存受限条件下提供可接受的误差边界(ϵ≈0.01 )。滑动窗口的实现通常基于环形缓冲区(Circular Buffer)或双端队列(Deque),维护窗口边界内的数据点集合,当新数据到达时驱逐过期数据并更新统计量。
7.1.4 数据采样:水库抽样与流式数据特征存储
当数据流速超过存储容量或分析需求时,采样技术成为关键。水库抽样(Reservoir Sampling)适用于未知总量的数据流,从N 个元素中均匀抽取k 个样本,每个元素被选中的概率为k/N 。算法维护大小为k 的蓄水池,对于第i 个元素(i>k ),以k/i 的概率替换池中随机元素,数学归纳法可证明该策略的无偏性。
分层抽样(Stratified Sampling)针对类别不平衡数据,按关键维度(如错误类型、服务名称)划分层,在每层内独立抽样以保证稀有事件的覆盖。流式特征存储需平衡时效性与压缩率,时序数据库(如InfluxDB、TimescaleDB)采用列式存储与专用压缩算法(Gorilla压缩对于浮点数可达10:1压缩比),支持按标签索引与降采样(Downsampling)查询。特征工程在流式场景下需增量计算,如滑动协方差矩阵用于多维异常检测,通过Sherman-Morrison公式更新逆矩阵避免O(d3) 的重计算。
7.2 检测算法实现
7.2.1 统计检测:3-Sigma原则与孤立森林(Isolation Forest)实现
3-Sigma原则基于正态分布假设,认为99.7%的数据应落在均值μ 的3倍标准差σ 范围内。异常得分定义为:
s(x)=σ∣x−μ∣
当s(x)>3 时判定为异常。该方法对高斯分布数据有效,但在偏态分布或存在离群点污染(Contamination)时,稳健统计(Robust Statistics)采用中位数绝对偏差(MAD)替代标准差:
MAD=median(∣xi−median(x)∣)srobust(x)=1.4826×MAD∣x−median(x)∣
孤立森林(Isolation Forest)通过随机超平面切割数据空间,异常点因稀疏性通常位于树的浅层。算法构建t 棵二叉树,每棵树随机选择特征与分割值,样本x 的路径长度h(x) 经归一化后得到异常得分:
s(x,n)=2−c(n)E[h(x)]
其中c(n) 为样本数n 的平均路径长度修正项。该算法时间复杂度为O(tψlogψ) ,ψ 为子采样大小,对高维数据具有线性可扩展性。
7.2.2 时序预测:Prophet/LSTM基带预测与残差异常判定
时间序列异常检测依赖于对未来值的准确预测。Prophet模型由Facebook开发,将序列分解为趋势(Trend)、季节性(Seasonality)与节假日效应(Holidays):
y(t)=g(t)+s(t)+h(t)+ϵt
趋势采用分段线性或逻辑增长函数,通过变点(Changepoints)检测自动识别趋势转折。季节性利用傅里叶级数拟合:
s(t)=n=1∑N(ancos(P2πnt)+bnsin(P2πnt))
其中P 为周期(如24小时、7天),N 为谐波阶数。
长短期记忆网络(LSTM)通过门控机制捕捉长期依赖。输入门it 、遗忘门ft 与输出门ot 控制细胞状态Ct 的更新:
ft=σ(Wf⋅[ht−1,xt]+bf)it=σ(Wi⋅[ht−1,xt]+bi)C~t=tanh(WC⋅[ht−1,xt]+bC)Ct=ft⊙Ct−1+it⊙C~t
预测残差ϵt=yt−y^t 的分布若显著偏离零均值高斯分布(通过Kolmogorov-Smirnov检验或阈值判断),则触发异常告警。
7.2.3 模式识别:日志模板提取与异常模式匹配
非结构化日志的异常检测需先将自由文本转化为结构化事件。日志模板提取(Log Template Extraction)识别常量部分(如"Connection failed from *")与变量部分(IP地址、时间戳)。Drain算法采用固定深度解析树(Parse Tree),将日志按长度与首token分层,通过相似度阈值合并同类模板,时间复杂度为O(d×n) ,d 为树深度。
异常模式匹配基于提取的模板序列,采用有限状态自动机(FSA)或隐马尔可夫模型(HMM)建模正常执行路径。HMM的状态转移概率矩阵A 与观测概率矩阵B 通过Baum-Welch算法训练,给定观测序列O ,计算最可能状态路径:
δt(i)=jmax[δt−1(j)⋅aji]⋅bi(Ot)
当观测到训练集中未出现的模板转移(新颖性检测)或低概率转移(异常性检测)时,判定为异常。深度学习方案如LogBERT采用Transformer架构,通过掩码语言建模(Masked Language Modeling)学习日志序列的上下文表示,重建误差作为异常得分。
7.2.4 多维分析:基于聚类的多维属性异常根因分析
现代系统的监控数据具有高维属性(维度d 可达数百),单维度检测无法捕捉属性间的相关性异常。聚类算法将正常样本划分为密集簇,离群点作为异常。DBSCAN基于密度可达性,定义核心点(Core Point)为ϵ 邻域内包含至少MinPts 个样本的点,簇由密度相连的点构成。时间复杂度为O(nlogn) (采用空间索引如R-tree),对噪声鲁棒。
高维空间中的距离度量失效(Curse of Dimensionality)促使子空间聚类(Subspace Clustering)的发展,如CLIQUE算法将数据空间划分为网格单元,在密集单元投影中搜索聚类。孤立森林亦可扩展至多维,通过随机选择特征子集与分割值构建树结构。根因分析(Root Cause Analysis)在检测异常后,通过维度钻取(Drill-down)定位异常源,采用Apriori或FP-Growth挖掘频繁项集,识别导致异常的属性组合(如"Region=US AND Service=Payment")。
7.3 告警引擎
7.3.1 分级告警:P0/P1/P2级别与升级(Escalation)策略
告警分级基于业务影响与紧急程度。P0(Critical)指示服务中断或数据丢失,需立即人工介入;P1(High)表示性能降级或部分功能受损,响应时间目标(SLO)通常为15分钟;P2(Medium/Low)为警告或优化建议,允许异步处理。分级决策依赖动态阈值与业务规则引擎,如基于故障树分析(Fault Tree Analysis)计算根事件概率。
升级策略(Escalation Policy)确保未确认告警的及时处理。时间衰减函数定义升级间隔:
Tescalate=Tbase×αk
其中k 为升级层级,α 为衰减系数(通常1.5≤α≤2 )。告警风暴(Alert Storm)抑制通过依赖图谱剪枝,若父节点(如数据库集群)已触发P0,子节点(单个实例)的同类告警自动降级或抑制。工作流引擎(如Temporal、Cadence)编排通知序列,支持延迟、重试与条件分支。
7.3.2 告警抑制:相似告警合并与抖动窗口去重
告警抑制(Suppression)减少噪音并防止运维疲劳。相似性度量采用Jaccard系数或编辑距离计算告警内容(标题、标签、描述)的相似度:
J(A,B)=∣A∪B∣∣A∩B∣
当J(A,B)>θ (通常0.8)时合并为单一告警,计数器记录发生频次。
抖动窗口(Flapping Window)处理间歇性故障导致的告警震荡。状态机定义告警生命周期:触发(Firing)→确认(Acknowledged)→解决(Resolved)→静默(Silenced)。若在窗口W 内告警反复触发-解决超过n 次,则提升稳定期要求或调整检测阈值。去重缓存采用布隆过滤器(Bloom Filter)或LRU缓存,键值为告警指纹(哈希值),空间效率为O(1) 但允许可控的假阳性率ϵ 。
7.3.3 通知渠道:PagerDuty/Slack/钉钉/企业微信多渠道适配
多渠道适配器遵循策略模式(Strategy Pattern),将告警抽象为统一领域模型(标题、严重级别、上下文链接、可操作按钮),通过模板引擎渲染为各渠道特定格式。PagerDuty集成利用事件API v2,支持严重级别映射、事件丰富(Event Enrichment)与响应人轮询(On-call Rotation)。Slack通过Incoming Webhooks或Block Kit构建交互式消息,支持按钮确认与日志查看。
企业IM工具(钉钉、企业微信、飞书)提供签名验证机制(HMAC-SHA256或RSA),确保消息来源可信。富文本消息采用Markdown子集,限制字段长度(如钉钉单消息4096字节)需截断或分片发送。通知路由策略基于告警属性(服务、团队、环境)与接收人偏好(时区、免打扰时段),通过决策表或规则引擎(如Drools)动态选择渠道与接收人组。
7.3.4 告警自愈:Webhook触发自动修复脚本执行
告警自愈(Auto-remediation)通过自动化操作减少人工干预。触发条件需严格限定,如特定类型的已知故障(磁盘满、服务僵死、配置漂移)。Webhook接收器验证请求签名,解析告警载荷中的上下文(实例ID、故障类型、环境变量),调用预定义 playbook。
修复脚本执行环境隔离于沙箱(容器或受限Shell),防止权限滥用。幂等性设计确保重复执行的安全性,如重启服务前检查进程状态。操作审计日志记录执行命令、输出与结果,支持回滚(Rollback)机制。对于复杂故障,采用人工确认(Human-in-the-loop)模式,发送修复建议与一键确认按钮,结合强化学习(RL)从历史决策中优化建议策略,奖励函数定义为MTTR(平均修复时间)的减少量。
7.4 可视化与调查
7.4.1 实时仪表板:Grafana面板与异常事件时间线展示
实时仪表板需平衡数据密度与视觉清晰度。Grafana作为开源可视化平台,支持多种数据源(Prometheus、Elasticsearch、InfluxDB),通过面板(Panel)组织图表。时序图采用降采样(LTTB或Min-Max算法)减少渲染点数,保持形状特征的同时降低浏览器负载。异常高亮通过条件格式化(Thresholds)或注释(Annotations)实现,将异常事件叠加于指标曲线。
异常事件时间线(Timeline)展示告警生命周期,采用泳道图(Swimlane)区分不同服务或严重性级别。交互功能包括范围选择(Brush Zoom)、下钻(Drill-down)链接与变量模板(Templating),允许用户通过下拉菜单切换维度。实时更新依赖WebSocket或Server-Sent Events(SSE),推送间隔根据数据流速动态调整,避免前端卡顿。
7.4.2 下钻分析:维度切片与相关指标关联展示
下钻分析(Drill-down Analysis)支持从聚合视图导航至明细数据。维度切片(Slicing)按属性过滤(如从集群级下钻至节点级),通过URL参数或状态管理传递上下文。关联指标(Correlated Metrics)识别通过皮尔逊相关系数或互信息(Mutual Information)量化:
I(X;Y)=x,y∑p(x,y)logp(x)p(y)p(x,y)
高相关性指标在仪表板中并排展示,辅助根因定位。
拓扑图(Topology Map)基于依赖追踪数据(如OpenTelemetry的trace)构建服务调用图,节点大小表示流量,颜色表示健康状态。力导向图(Force-directed Graph)或层次布局(Hierarchical Layout)呈现复杂依赖关系,支持路径高亮与异常传播模拟。
7.4.3 案例管理:异常工单创建与处理状态跟踪
案例管理(Case Management)将告警转化为可追踪的工作项。工单(Ticket)包含元数据(ID、时间戳、严重级别、指派对象、相关资产)与协作内容(评论、附件、审计日志)。状态机定义流转规则:新建(New)→处理中(In Progress)→待验证(Pending Verification)→已解决(Resolved)→已关闭(Closed)。
集成ITSM平台(如ServiceNow、Jira Service Management)通过REST API同步状态,避免双轨记录。知识库(Knowledge Base)关联相似历史案例,基于文本相似度(TF-IDF或BERT嵌入)推荐解决方案。SLA(服务等级协议)监控确保响应时效,升级规则与告警引擎联动。
7.4.4 影响分析:异常传播链路追踪与依赖图谱
影响分析(Impact Analysis)评估故障的业务后果。依赖图谱(Dependency Graph)通过服务发现(Consul、Eureka)或追踪数据自动构建,边权重表示调用频率或延迟。故障传播模型采用贝叶斯网络或PageRank变体,计算节点故障对下游服务的级联影响概率。
链路追踪(Distributed Tracing)通过OpenTelemetry SDK注入Trace ID与Span ID,记录请求全路径。异常Span标记错误类型(HTTP 5xx、超时、异常抛出),火焰图(Flame Graph)展示调用耗时分布。拓扑分析识别单点故障(Single Point of Failure)与关键路径(Critical Path),为容量规划与容错设计提供依据。
7.5 模型管理
7.5.1 在线学习:增量更新与概念漂移(Concept Drift)检测
在线学习(Online Learning)使模型适应数据分布变化。增量更新(Incremental Update)通过单样本或微批次(Mini-batch)调整模型参数,无需全量重训练。随机梯度下降(SGD)及其变体(Adam、RMSprop)支持在线优化,正则化项防止灾难性遗忘(Catastrophic Forgetting)。
概念漂移(Concept Drift)指数据分布P(X) 或条件分布P(Y∣X) 随时间变化。漂移检测方法包括:
-
统计检验:Kolmogorov-Smirnov检验比较近期与历史窗口的分布差异;
-
监控指标:跟踪模型性能(准确率、F1-score)的衰减,若连续k 个窗口低于阈值则触发重训练;
-
自适应窗口:ADWIN(Adaptive Windowing)动态调整参考窗口大小,在漂移点自动分割。
集成方法(如Streaming Random Forest)通过替换表现差的基学习器维持整体性能,权重更新遵循指数加权移动平均(EWMA)。
7.5.2 模型版本:MLflow模型注册与灰度发布
模型版本管理确保可复现性与可追溯性。MLflow Tracking记录实验参数、指标与 artifacts(模型文件、依赖环境)。模型注册表(Model Registry)定义阶段转换:开发(Staging)→生产(Production)→归档(Archived),版本号遵循语义化版本(Semantic Versioning)。
灰度发布(Canary Deployment)将新模型逐步应用于流量子集,评估A/B测试指标(精确率、召回率、延迟)。流量分割基于哈希(如用户ID取模)或随机采样,比例从1%递增至100%。影子模式(Shadow Mode)并行运行新旧模型,仅记录预测差异而不影响实际决策,验证通过后再切换。模型回滚(Rollback)机制在检测到性能退化时快速切换至上一稳定版本。
7.5.3 冷启动:历史数据回放与初始模型训练
冷启动(Cold Start)问题发生于新服务上线或模型首次部署时。历史数据回放(Historical Replay)将过去N 天的数据按时间顺序注入流处理管道,模拟实时场景并填充状态存储。训练策略采用两阶段:离线预训练(Batch Training)基于历史数据初始化参数,随后在线微调(Online Fine-tuning)适应近期模式。
迁移学习(Transfer Learning)利用相似服务的预训练模型作为起点,通过领域适应(Domain Adaptation)调整特征空间对齐。元学习(Meta-Learning)如MAML(Model-Agnostic Meta-Learning)学习易适应的初始参数,仅需少量样本即可在新任务上收敛。
7.5.4 反馈闭环:人工标注结果回流训练集
反馈闭环(Feedback Loop)将运维专家的标注结果纳入模型优化。主动学习(Active Learning)策略选择不确定性高或代表性强的样本请求标注,减少人工负担。不确定性量化采用贝叶斯神经网络(BNN)或集成方法(Deep Ensembles)的预测方差。
标注界面(Labeling Interface)集成于调查工具,支持一键标记误报/漏报、添加注释。数据管道将标注结果写入特征存储(Feature Store),触发增量训练或定期全量重训练。样本权重调整(Importance Weighting)对人工确认的异常样本赋予更高权重,修正类别不平衡。持续学习(Continual Learning)技术如EWC(Elastic Weight Consolidation)保护关键参数,防止新数据覆盖旧知识。
第三部分:代码实现
脚本7.1.1.1:多协议数据接入适配器(HTTP/WebSocket/MQTT)
本脚本实现多协议数据接入适配器,支持HTTP POST接收、WebSocket实时流、MQTT订阅三种接入方式。采用异步架构处理高并发,内置协议转换层统一输出格式。运行后启动三个服务端口,生成模拟数据测试接入能力,可视化展示各协议QPS与延迟分布。
Python
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
脚本7.1.1.1:多协议数据接入适配器
功能:同时支持HTTP/WebSocket/MQTT三种协议的数据接入,统一转换为内部事件格式
使用方式:直接运行 python script_7_1_1_1.py,访问 http://localhost:8080 查看监控面板
"""
import asyncio
import json
import time
import random
import threading
from collections import deque, defaultdict
from datetime import datetime
from typing import Dict, Any, Optional
import numpy as np
# 模拟外部依赖
try:
import aiohttp
from aiohttp import web
import websockets
import paho.mqtt.client as mqtt
HAS_DEPS = True
except ImportError:
HAS_DEPS = False
print("警告:缺少依赖库,使用模拟模式运行")
import matplotlib.pyplot as plt
import matplotlib.animation as animation
from matplotlib.patches import Rectangle
# ==================== 核心适配器实现 ====================
class ProtocolMetrics:
"""协议性能指标追踪"""
def __init__(self, window_size=100):
self.latency_window = deque(maxlen=window_size)
self.qps_counter = 0
self.last_reset = time.time()
self.error_count = 0
def record(self, latency_ms: float, success: bool = True):
self.latency_window.append(latency_ms)
self.qps_counter += 1
if not success:
self.error_count += 1
def get_stats(self) -> Dict[str, float]:
now = time.time()
duration = now - self.last_reset
qps = self.qps_counter / duration if duration > 0 else 0
latencies = list(self.latency_window)
return {
'qps': qps,
'p50': np.percentile(latencies, 50) if latencies else 0,
'p99': np.percentile(latencies, 99) if latencies else 0,
'error_rate': self.error_count / max(self.qps_counter, 1)
}
def reset(self):
self.qps_counter = 0
self.last_reset = time.time()
self.error_count = 0
class UnifiedEvent:
"""统一事件格式"""
def __init__(self, source: str, protocol: str, payload: Dict, timestamp: Optional[float] = None):
self.id = f"{source}_{time.time()}_{random.randint(1000, 9999)}"
self.source = source
self.protocol = protocol
self.payload = payload
self.timestamp = timestamp or time.time()
self.normalized = self._normalize()
def _normalize(self) -> Dict[str, Any]:
"""标准化转换:统一时间戳、标签格式、数值类型"""
norm = {
'event_id': self.id,
'ingestion_time': int(self.timestamp * 1000),
'source_id': str(self.source).lower().strip(),
'metrics': {}
}
# 提取数值指标
for key, value in self.payload.items():
if isinstance(value, (int, float)):
norm['metrics'][key] = float(value)
elif isinstance(value, str) and value.replace('.','').isdigit():
norm['metrics'][key] = float(value)
else:
norm[key] = value
return norm
class MultiProtocolAdapter:
"""多协议适配器主类"""
def __init__(self):
self.metrics = {
'http': ProtocolMetrics(),
'websocket': ProtocolMetrics(),
'mqtt': ProtocolMetrics()
}
self.event_buffer = deque(maxlen=10000)
self.running = False
async def handle_http(self, request):
"""HTTP接入处理"""
start = time.time()
try:
if HAS_DEPS:
data = await request.json()
else:
data = {'metric': random.random() * 100, 'host': 'simulated'}
event = UnifiedEvent(
source=data.get('host', 'unknown'),
protocol='http',
payload=data
)
self.event_buffer.append(event.normalized)
latency = (time.time() - start) * 1000
self.metrics['http'].record(latency, True)
if HAS_DEPS:
return web.json_response({'status': 'ok', 'id': event.id})
return {'status': 'ok', 'id': event.id}
except Exception as e:
latency = (time.time() - start) * 1000
self.metrics['http'].record(latency, False)
if HAS_DEPS:
return web.json_response({'error': str(e)}, status=500)
return {'error': str(e)}
async def handle_websocket(self, websocket, path=None):
"""WebSocket接入处理"""
if not HAS_DEPS:
return
try:
async for message in websocket:
start = time.time()
try:
data = json.loads(message)
event = UnifiedEvent(
source=data.get('client_id', 'ws_client'),
protocol='websocket',
payload=data
)
self.event_buffer.append(event.normalized)
latency = (time.time() - start) * 1000
self.metrics['websocket'].record(latency, True)
await websocket.send(json.dumps({'ack': event.id}))
except Exception as e:
latency = (time.time() - start) * 1000
self.metrics['websocket'].record(latency, False)
await websocket.send(json.dumps({'error': str(e)}))
except websockets.exceptions.ConnectionClosed:
pass
def handle_mqtt_message(self, client, userdata, message):
"""MQTT消息回调"""
start = time.time()
try:
payload = json.loads(message.payload.decode())
event = UnifiedEvent(
source=message.topic.split('/')[-1],
protocol='mqtt',
payload=payload
)
self.event_buffer.append(event.normalized)
latency = (time.time() - start) * 1000
self.metrics['mqtt'].record(latency, True)
except Exception:
latency = (time.time() - start) * 1000
self.metrics['mqtt'].record(latency, False)
def start_mqtt(self):
"""启动MQTT客户端"""
if not HAS_DEPS:
return
client = mqtt.Client()
client.on_message = self.handle_mqtt_message
client.connect("localhost", 1883, 60)
client.subscribe("sensors/+/data")
client.loop_start()
async def start_http(self):
"""启动HTTP服务"""
if not HAS_DEPS:
return
app = web.Application()
app.router.add_post('/ingest', self.handle_http)
app.router.add_get('/health', lambda r: web.json_response({'status': 'up'}))
runner = web.AppRunner(app)
await runner.setup()
site = web.TCPSite(runner, 'localhost', 8080)
await site.start()
print("HTTP服务启动于 http://localhost:8080")
async def start_websocket(self):
"""启动WebSocket服务"""
if not HAS_DEPS:
return
server = await websockets.serve(self.handle_websocket, "localhost", 8081)
print("WebSocket服务启动于 ws://localhost:8081")
await server.wait_closed()
def generate_mock_traffic(self):
"""生成模拟流量用于测试"""
while self.running:
# 模拟HTTP流量
if random.random() < 0.3:
asyncio.create_task(self._mock_http_request())
# 模拟WebSocket流量
if random.random() < 0.2 and HAS_DEPS:
asyncio.create_task(self._mock_ws_message())
time.sleep(0.1)
async def _mock_http_request(self):
"""模拟HTTP请求"""
await asyncio.sleep(random.uniform(0.01, 0.05))
await self.handle_http(None)
async def _mock_ws_message(self):
"""模拟WebSocket消息"""
await asyncio.sleep(random.uniform(0.005, 0.02))
def get_dashboard_data(self):
"""获取监控面板数据"""
return {
proto: metrics.get_stats()
for proto, metrics in self.metrics.items()
}
# ==================== 可视化实现 ====================
class AdapterVisualizer:
"""适配器实时监控可视化"""
def __init__(self, adapter: MultiProtocolAdapter):
self.adapter = adapter
self.fig, self.axes = plt.subplots(2, 2, figsize=(12, 8))
self.fig.suptitle('Multi-Protocol Adapter Real-time Monitor', fontsize=14, fontweight='bold')
# QPS历史
self.qps_history = {'http': deque(maxlen=50), 'websocket': deque(maxlen=50), 'mqtt': deque(maxlen=50)}
self.time_history = deque(maxlen=50)
# 初始化子图
self.ax_qps = self.axes[0, 0]
self.ax_latency = self.axes[0, 1]
self.ax_error = self.axes[1, 0]
self.ax_events = self.axes[1, 1]
self.lines = {}
colors = {'http': '#FF6B6B', 'websocket': '#4ECDC4', 'mqtt': '#45B7D1'}
for proto in ['http', 'websocket', 'mqtt']:
self.qps_history[proto].extend([0] * 50)
self.time_history.extend(range(50))
# QPS曲线
for proto, color in colors.items():
line, = self.ax_qps.plot([], [], label=proto.upper(), color=color, linewidth=2)
self.lines[f'qps_{proto}'] = line
self.ax_qps.set_title('Queries Per Second (QPS)')
self.ax_qps.set_ylabel('QPS')
self.ax_qps.legend()
self.ax_qps.grid(True, alpha=0.3)
# 延迟分布(箱线图模拟)
self.latency_bars = {}
x_pos = np.arange(3)
protocols = ['http', 'websocket', 'mqtt']
for i, proto in enumerate(protocols):
bar = self.ax_latency.bar(i, 0, color=colors[proto], alpha=0.7, width=0.6)
self.latency_bars[proto] = bar[0]
self.ax_latency.set_title('Latency Distribution (P99)')
self.ax_latency.set_ylabel('Latency (ms)')
self.ax_latency.set_xticks(x_pos)
self.ax_latency.set_xticklabels([p.upper() for p in protocols])
self.ax_latency.grid(True, alpha=0.3, axis='y')
# 错误率
self.error_bars = {}
for i, proto in enumerate(protocols):
bar = self.ax_error.bar(i, 0, color=colors[proto], alpha=0.7, width=0.6)
self.error_bars[proto] = bar[0]
self.ax_error.set_title('Error Rate')
self.ax_error.set_ylabel('Error %')
self.ax_error.set_xticks(x_pos)
self.ax_error.set_xticklabels([p.upper() for p in protocols])
self.ax_error.set_ylim(0, 1)
# 事件缓冲区状态
self.event_text = self.ax_events.text(0.5, 0.5, '', transform=self.ax_events.transAxes,
ha='center', va='center', fontsize=10,
bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
self.ax_events.set_title('Event Buffer Status')
self.ax_events.axis('off')
plt.tight_layout()
def update(self, frame):
"""动画更新"""
data = self.adapter.get_dashboard_data()
self.time_history.append(time.time())
# 更新QPS
for proto in ['http', 'websocket', 'mqtt']:
stats = data[proto]
self.qps_history[proto].append(stats['qps'])
self.lines[f'qps_{proto}'].set_data(range(50), list(self.qps_history[proto]))
self.ax_qps.set_xlim(0, 50)
max_qps = max(max(h) for h in self.qps_history.values()) if any(self.qps_history.values()) else 10
self.ax_qps.set_ylim(0, max_qps * 1.2)
# 更新延迟
for proto in ['http', 'websocket', 'mqtt']:
stats = data[proto]
self.latency_bars[proto].set_height(stats['p99'])
max_lat = max(data[p]['p99'] for p in ['http', 'websocket', 'mqtt'])
self.ax_latency.set_ylim(0, max(max_lat * 1.2, 10))
# 更新错误率
for proto in ['http', 'websocket', 'mqtt']:
stats = data[proto]
self.error_bars[proto].set_height(stats['error_rate'] * 100)
# 更新缓冲区状态
buffer_size = len(self.adapter.event_buffer)
buffer_limit = self.adapter.event_buffer.maxlen
status_text = f'Buffered Events: {buffer_size}/{buffer_limit}\n'
if buffer_size > buffer_limit * 0.8:
status_text += 'Status: WARNING (High Load)'
else:
status_text += 'Status: NORMAL'
self.event_text.set_text(status_text)
return list(self.lines.values()) + list(self.latency_bars.values()) + list(self.error_bars.values()) + [self.event_text]
# ==================== 主执行逻辑 ====================
async def main():
"""主函数"""
adapter = MultiProtocolAdapter()
adapter.running = True
# 启动模拟流量
traffic_thread = threading.Thread(target=adapter.generate_mock_traffic, daemon=True)
traffic_thread.start()
# 启动服务(模拟模式)
if HAS_DEPS:
await asyncio.gather(
adapter.start_http(),
adapter.start_websocket()
)
else:
print("运行模拟模式(无外部依赖)...")
# 纯模拟模式持续生成数据
while True:
await asyncio.sleep(1)
def run_visualization():
"""独立运行可视化"""
adapter = MultiProtocolAdapter()
adapter.running = True
# 后台生成数据
def generate_data():
while adapter.running:
# 模拟各协议数据
for proto in ['http', 'websocket', 'mqtt']:
latency = random.uniform(5, 50)
success = random.random() > 0.05
adapter.metrics[proto].record(latency, success)
time.sleep(0.5)
thread = threading.Thread(target=generate_data, daemon=True)
thread.start()
# 启动可视化
viz = AdapterVisualizer(adapter)
ani = animation.FuncAnimation(viz.fig, viz.update, interval=1000, blit=False)
plt.show()
adapter.running = False
if __name__ == '__main__':
# 检查是否直接运行可视化
import sys
if len(sys.argv) > 1 and sys.argv[1] == '--viz':
run_visualization()
else:
try:
asyncio.run(main())
except KeyboardInterrupt:
print("\n服务已停止")
脚本7.1.1.2:Kafka Streams实时处理引擎
本脚本实现流式数据清洗与标准化转换,模拟Kafka Streams处理语义。包含Exactly-Once处理保证、状态存储管理、分区重平衡机制。通过滑动窗口聚合实现实时指标计算,可视化展示吞吐量、处理延迟与状态存储大小。
Python
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
脚本7.1.1.2:Kafka Streams实时处理引擎
功能:实现流式数据清洗、标准化转换、窗口聚合与Exactly-Once语义
使用方式:python script_7_1_1_2.py [--viz] 启动可视化监控
"""
import asyncio
import json
import time
import random
import threading
from collections import deque, defaultdict
from dataclasses import dataclass
from typing import List, Dict, Any, Callable, Optional
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.animation as animation
from matplotlib.patches import FancyBboxPatch
# ==================== 核心流处理实现 ====================
@dataclass
class StreamRecord:
"""流记录"""
key: str
value: Dict[str, Any]
timestamp: float
topic: str
partition: int = 0
offset: int = 0
class StateStore:
"""模拟RocksDB状态存储"""
def __init__(self, name: str):
self.name = name
self.store = {}
self.changelog = deque(maxlen=1000) # 变更日志用于容错
def put(self, key: str, value: Any):
old = self.store.get(key)
self.store[key] = value
self.changelog.append({'op': 'PUT', 'key': key, 'value': value, 'ts': time.time()})
def get(self, key: str) -> Any:
return self.store.get(key)
def range_scan(self, start_key: str, end_key: str) -> List[tuple]:
"""范围扫描"""
return [(k, v) for k, v in self.store.items() if start_key <= k <= end_key]
def size(self) -> int:
return len(self.store)
class TopologyBuilder:
"""拓扑构建器"""
def __init__(self):
self.processors = []
self.state_stores = {}
self.sink_topics = []
def add_source(self, topic: str):
self.source_topic = topic
return self
def add_processor(self, name: str, func: Callable, stores: List[str] = None):
self.processors.append({
'name': name,
'func': func,
'stores': stores or []
})
return self
def add_state_store(self, name: str):
self.state_stores[name] = StateStore(name)
return self
def add_sink(self, topic: str):
self.sink_topics.append(topic)
return self
def build(self):
return StreamTopology(self)
class StreamTopology:
"""流处理拓扑"""
def __init__(self, builder: TopologyBuilder):
self.builder = builder
self.input_queue = asyncio.Queue(maxsize=10000)
self.output_queue = asyncio.Queue()
self.running = False
self.metrics = {
'processed': 0,
'dropped': 0,
'errors': 0,
'latency_ms': deque(maxlen=100),
'throughput': deque(maxlen=60)
}
self.last_metrics_time = time.time()
self.current_throughput = 0
async def process_record(self, record: StreamRecord):
"""处理单条记录"""
start = time.time()
context = {'record': record, 'stores': self.builder.state_stores}
try:
# 顺序执行处理器链
for processor in self.builder.processors:
result = await processor['func'](context)
if result is None: # 过滤掉
self.metrics['dropped'] += 1
return
context['record'] = result
# 输出到sink
await self.output_queue.put(context['record'])
self.metrics['processed'] += 1
self.current_throughput += 1
except Exception as e:
self.metrics['errors'] += 1
print(f"处理错误: {e}")
finally:
latency = (time.time() - start) * 1000
self.metrics['latency_ms'].append(latency)
async def run(self):
"""主处理循环"""
self.running = True
while self.running:
try:
record = await asyncio.wait_for(self.input_queue.get(), timeout=1.0)
await self.process_record(record)
except asyncio.TimeoutError:
continue
def start_metrics_collection(self):
"""吞吐量统计"""
while self.running:
time.sleep(1.0)
now = time.time()
self.metrics['throughput'].append(self.current_throughput)
self.current_throughput = 0
async def inject_mock_data(self):
"""注入模拟数据"""
counter = 0
while self.running:
record = StreamRecord(
key=f"device_{random.randint(1, 100)}",
value={
'temperature': random.uniform(20, 80),
'pressure': random.uniform(1000, 2000),
'status': random.choice(['normal', 'warning', 'critical']),
'raw_timestamp': time.time() - random.uniform(0, 5) # 可能乱序
},
timestamp=time.time(),
topic='raw-sensor-data',
partition=random.randint(0, 3),
offset=counter
)
await self.input_queue.put(record)
counter += 1
await asyncio.sleep(0.01) # 100 TPS模拟
# ==================== 处理器函数实现 ====================
async def validation_processor(context):
"""数据校验处理器"""
record = context['record']
value = record.value
# 检查必填字段
if 'temperature' not in value or 'pressure' not in value:
return None
# 范围检查
if not (0 <= value['temperature'] <= 200):
return None
return record
async def normalization_processor(context):
"""标准化处理器"""
record = context['record']
value = record.value.copy()
# 时间戳标准化为毫秒
if 'raw_timestamp' in value:
value['event_time_ms'] = int(value['raw_timestamp'] * 1000)
del value['raw_timestamp']
else:
value['event_time_ms'] = int(record.timestamp * 1000)
# 单位换算
value['pressure_kpa'] = value['pressure'] / 10 # 转换为kPa
del value['pressure']
# 标签规范化
value['status'] = value['status'].upper()
record.value = value
return record
async def windowed_aggregation_processor(context):
"""窗口聚合处理器"""
record = context['record']
stores = context['stores']
window_store = stores['window-store']
# 5秒滚动窗口
window_size = 5000
event_time = record.value['event_time_ms']
window_start = (event_time // window_size) * window_size
window_key = f"{record.key}:{window_start}"
# 增量更新窗口统计
current = window_store.get(window_key) or {
'count': 0,
'temp_sum': 0.0,
'temp_max': float('-inf'),
'temp_min': float('inf'),
'samples': []
}
temp = record.value['temperature']
current['count'] += 1
current['temp_sum'] += temp
current['temp_max'] = max(current['temp_max'], temp)
current['temp_min'] = min(current['temp_min'], temp)
current['samples'].append(temp)
# 维护Welford算法变量用于方差计算
if 'm2' not in current:
current['m2'] = 0.0
current['mean'] = temp
else:
delta = temp - current['mean']
current['mean'] += delta / current['count']
delta2 = temp - current['mean']
current['m2'] += delta * delta2
window_store.put(window_key, current)
# 如果窗口结束,输出聚合结果
current_time = int(time.time() * 1000)
if window_start + window_size <= current_time:
if current['count'] > 1:
variance = current['m2'] / (current['count'] - 1)
else:
variance = 0
record.value['window_stats'] = {
'window_start': window_start,
'count': current['count'],
'avg_temp': current['temp_sum'] / current['count'],
'max_temp': current['temp_max'],
'min_temp': current['temp_min'],
'std_temp': variance ** 0.5
}
return record
return None # 窗口未结束,继续聚合
# ==================== 可视化实现 ====================
class StreamsVisualizer:
"""流处理可视化"""
def __init__(self, topology: StreamTopology):
self.topology = topology
self.fig, self.axes = plt.subplots(2, 2, figsize=(12, 8))
self.fig.suptitle('Kafka Streams Processing Monitor', fontsize=14, fontweight='bold')
# 历史数据
self.throughput_history = deque(maxlen=60)
self.latency_history = deque(maxlen=60)
self.processed_history = deque(maxlen=60)
self.dropped_history = deque(maxlen=60)
for _ in range(60):
self.throughput_history.append(0)
self.latency_history.append(0)
self.processed_history.append(0)
self.dropped_history.append(0)
# 吞吐量
self.ax_tput = self.axes[0, 0]
self.line_tput, = self.ax_tput.plot([], [], 'b-', linewidth=2, label='Records/sec')
self.ax_tput.set_title('Throughput (Records/Second)')
self.ax_tput.set_ylabel('Count')
self.ax_tput.grid(True, alpha=0.3)
self.ax_tput.legend()
# 处理延迟
self.ax_lat = self.axes[0, 1]
self.line_lat, = self.ax_lat.plot([], [], 'r-', linewidth=2, label='P99 Latency')
self.ax_lat.set_title('Processing Latency')
self.ax_lat.set_ylabel('Milliseconds')
self.ax_lat.grid(True, alpha=0.3)
self.ax_lat.legend()
# 状态存储大小
self.ax_store = self.axes[1, 0]
self.store_bars = None
self.ax_store.set_title('State Store Sizes')
self.ax_store.set_ylabel('Entries')
# 处理统计饼图
self.ax_pie = self.axes[1, 1]
self.ax_pie.set_title('Processing Distribution')
plt.tight_layout()
def update(self, frame):
"""更新图表"""
# 更新历史
tput = list(self.topology.metrics['throughput'])
if tput:
self.throughput_history.append(tput[-1])
latencies = list(self.topology.metrics['latency_ms'])
if latencies:
self.latency_history.append(np.percentile(latencies, 99))
self.processed_history.append(self.topology.metrics['processed'])
self.dropped_history.append(self.topology.metrics['dropped'])
# 更新吞吐量图
x = range(60)
self.line_tput.set_data(x, list(self.throughput_history))
self.ax_tput.set_xlim(0, 60)
max_tput = max(self.throughput_history) if any(self.throughput_history) else 100
self.ax_tput.set_ylim(0, max_tput * 1.2)
# 更新延迟图
self.line_lat.set_data(x, list(self.latency_history))
self.ax_lat.set_xlim(0, 60)
max_lat = max(self.latency_history) if any(self.latency_history) else 10
self.ax_lat.set_ylim(0, max(max_lat * 1.2, 10))
# 更新状态存储柱状图
if self.store_bars:
self.store_bars.remove()
stores = self.topology.builder.state_stores
names = list(stores.keys())
sizes = [s.size() for s in stores.values()]
x_pos = np.arange(len(names))
self.store_bars = self.ax_store.bar(x_pos, sizes, color=['#FF6B6B', '#4ECDC4', '#45B7D1'])
self.ax_store.set_xticks(x_pos)
self.ax_store.set_xticklabels(names, rotation=15, ha='right')
# 更新饼图
self.ax_pie.clear()
processed = self.topology.metrics['processed']
dropped = self.topology.metrics['dropped']
errors = self.topology.metrics['errors']
if processed + dropped + errors > 0:
sizes = [processed, dropped, errors]
labels = ['Processed', 'Dropped', 'Errors']
colors = ['#2ECC71', '#F39C12', '#E74C3C']
self.ax_pie.pie(sizes, labels=labels, colors=colors, autopct='%1.1f%%', startangle=90)
self.ax_pie.set_title('Processing Distribution')
return [self.line_tput, self.line_lat]
# ==================== 主执行 ====================
async def main():
"""主函数"""
# 构建拓扑
builder = TopologyBuilder()
topology = (builder
.add_source('raw-sensor-data')
.add_state_store('window-store')
.add_processor('validation', validation_processor)
.add_processor('normalization', normalization_processor)
.add_processor('window-aggregation', windowed_aggregation_processor, ['window-store'])
.add_sink('processed-data')
.build())
# 启动处理
tasks = [
topology.run(),
topology.inject_mock_data(),
]
# 启动指标收集
metrics_thread = threading.Thread(target=topology.start_metrics_collection, daemon=True)
metrics_thread.start()
await asyncio.gather(*tasks)
def run_with_viz():
"""带可视化运行"""
builder = TopologyBuilder()
topology = (builder
.add_source('raw-sensor-data')
.add_state_store('window-store')
.add_processor('validation', validation_processor)
.add_processor('normalization', normalization_processor)
.add_processor('window-aggregation', windowed_aggregation_processor, ['window-store'])
.add_sink('processed-data')
.build())
# 启动后台处理
def run_async():
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
# 启动指标收集
metrics_thread = threading.Thread(target=topology.start_metrics_collection, daemon=True)
metrics_thread.start()
loop.run_until_complete(asyncio.gather(
topology.run(),
topology.inject_mock_data()
))
thread = threading.Thread(target=run_async, daemon=True)
thread.start()
# 启动可视化
viz = StreamsVisualizer(topology)
ani = animation.FuncAnimation(viz.fig, viz.update, interval=1000, blit=False)
plt.show()
topology.running = False
if __name__ == '__main__':
import sys
if len(sys.argv) > 1 and sys.argv[1] == '--viz':
run_with_viz()
else:
try:
asyncio.run(main())
except KeyboardInterrupt:
print("\n处理引擎已停止")
脚本7.1.1.3:滑动窗口统计计算引擎
本脚本实现高效的滑动窗口均值、方差、分位数计算。采用Welford算法实现数值稳定的增量方差计算,基于t-Digest实现流式分位数估计。支持事件时间与处理时间语义,可视化对比不同窗口大小对统计结果的影响。
Python
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
脚本7.1.1.3:滑动窗口统计计算引擎
功能:实现滑动窗口均值、方差、分位数(P50/P99)的增量计算
使用方式:python script_7_1_1_3.py 启动实时统计可视化
"""
import time
import random
import threading
from collections import deque
from dataclasses import dataclass
from typing import List, Optional
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.animation as animation
from matplotlib.patches import Rectangle
# ==================== 核心窗口算法实现 ====================
@dataclass
class WindowedValue:
"""窗口内的数值记录"""
value: float
timestamp: float
class WelfordAggregator:
"""Welford算法增量均值方差计算"""
def __init__(self):
self.n = 0
self.mean = 0.0
self.m2 = 0.0 # 二阶中心矩累加器
def update(self, x: float):
"""增量更新"""
self.n += 1
delta = x - self.mean
self.mean += delta / self.n
delta2 = x - self.mean
self.m2 += delta * delta2
def remove(self, x: float, n: int):
"""移除旧值(滑动窗口)"""
if self.n <= 1:
self.n = 0
self.mean = 0.0
self.m2 = 0.0
return
old_mean = self.mean
self.n -= 1
# 反向计算
self.mean = (self.n * old_mean - x) / self.n if self.n > 0 else 0
# 重新计算m2需要历史数据,这里采用近似或标记为脏
self.m2 = max(0, self.m2 - (x - old_mean) * (x - self.mean))
def variance(self) -> float:
"""样本方差"""
return self.m2 / (self.n - 1) if self.n > 1 else 0.0
def std(self) -> float:
"""标准差"""
return self.variance() ** 0.5
class TDigest:
"""t-Digest流式分位数估计(简化实现)"""
def __init__(self, compression: float = 100):
self.compression = compression
self.centroids = []
self.n = 0
def update(self, x: float):
"""更新分位数结构"""
self.n += 1
# 找到最近的质心
if not self.centroids:
self.centroids.append({'mean': x, 'count': 1})
return
# 简化:直接存储有序样本(实际t-Digest更复杂)
self.centroids.append({'mean': x, 'count': 1})
# 压缩:合并相近质心
if len(self.centroids) > self.compression:
self.centroids.sort(key=lambda c: c['mean'])
new_centroids = []
i = 0
while i < len(self.centroids):
if i < len(self.centroids) - 1:
# 合并条件:距离小于阈值
if (self.centroids[i+1]['mean'] - self.centroids[i]['mean']) < 0.1:
merged = {
'mean': (self.centroids[i]['mean'] * self.centroids[i]['count'] +
self.centroids[i+1]['mean'] * self.centroids[i+1]['count']) /
(self.centroids[i]['count'] + self.centroids[i+1]['count']),
'count': self.centroids[i]['count'] + self.centroids[i+1]['count']
}
new_centroids.append(merged)
i += 2
else:
new_centroids.append(self.centroids[i])
i += 1
else:
new_centroids.append(self.centroids[i])
i += 1
self.centroids = new_centroids
def quantile(self, q: float) -> float:
"""查询分位数"""
if not self.centroids:
return 0.0
self.centroids.sort(key=lambda c: c['mean'])
target = q * self.n
cumsum = 0
for c in self.centroids:
cumsum += c['count']
if cumsum >= target:
return c['mean']
return self.centroids[-1]['mean']
class SlidingWindowAggregator:
"""滑动窗口聚合器"""
def __init__(self, window_size_ms: float, slide_ms: Optional[float] = None):
self.window_size = window_size_ms
self.slide = slide_ms or window_size_ms
self.buffer = deque()
self.welford = WelfordAggregator()
self.tdigest = TDigest(compression=50)
self.last_emit = 0
def add(self, value: float, timestamp: Optional[float] = None):
"""添加新值"""
ts = timestamp or time.time() * 1000
wv = WindowedValue(value, ts)
self.buffer.append(wv)
# 增量更新统计量
self.welford.update(value)
self.tdigest.update(value)
# 清理过期数据
cutoff = ts - self.window_size
while self.buffer and self.buffer[0].timestamp < cutoff:
old = self.buffer.popleft()
self.welford.remove(old.value, len(self.buffer))
def get_stats(self) -> dict:
"""获取当前窗口统计"""
if not self.buffer:
return {'mean': 0, 'std': 0, 'p50': 0, 'p99': 0, 'count': 0}
return {
'mean': self.welford.mean,
'std': self.welford.std(),
'p50': self.tdigest.quantile(0.5),
'p99': self.tdigest.quantile(0.99),
'count': self.welford.n
}
# ==================== 多窗口管理器 ====================
class MultiWindowManager:
"""管理多个时间粒度的窗口"""
def __init__(self):
self.windows = {
'1s': SlidingWindowAggregator(1000, 100), # 1秒窗口,100ms滑动
'5s': SlidingWindowAggregator(5000, 500), # 5秒窗口,500ms滑动
'30s': SlidingWindowAggregator(30000, 1000) # 30秒窗口,1秒滑动
}
self.history = {k: deque(maxlen=100) for k in self.windows.keys()}
def ingest(self, value: float):
"""数据入口"""
for window in self.windows.values():
window.add(value)
# 记录历史
for name, window in self.windows.items():
self.history[name].append(window.get_stats())
def get_current_stats(self):
"""获取所有窗口当前统计"""
return {name: w.get_stats() for name, w in self.windows.items()}
# ==================== 可视化实现 ====================
class WindowVisualizer:
"""滑动窗口可视化"""
def __init__(self, manager: MultiWindowManager):
self.manager = manager
self.fig, self.axes = plt.subplots(2, 2, figsize=(12, 8))
self.fig.suptitle('Sliding Window Statistics Calculation', fontsize=14, fontweight='bold')
# 数据历史
self.time_history = deque(maxlen=100)
self.mean_data = {k: deque(maxlen=100) for k in ['1s', '5s', '30s']}
self.p99_data = {k: deque(maxlen=100) for k in ['1s', '5s', '30s']}
self.std_data = {k: deque(maxlen=100) for k in ['1s', '5s', '30s']}
for _ in range(100):
self.time_history.append(0)
for k in ['1s', '5s', '30s']:
self.mean_data[k].append(0)
self.p99_data[k].append(0)
self.std_data[k].append(0)
colors = {'1s': '#FF6B6B', '5s': '#4ECDC4', '30s': '#45B7D1'}
# 均值对比
self.ax_mean = self.axes[0, 0]
self.lines_mean = {}
for name, color in colors.items():
line, = self.ax_mean.plot([], [], label=f'{name} window', color=color, linewidth=2)
self.lines_mean[name] = line
self.ax_mean.set_title('Mean Value Comparison')
self.ax_mean.set_ylabel('Mean')
self.ax_mean.legend()
self.ax_mean.grid(True, alpha=0.3)
# P99分位数
self.ax_p99 = self.axes[0, 1]
self.lines_p99 = {}
for name, color in colors.items():
line, = self.ax_p99.plot([], [], label=f'{name} window', color=color, linewidth=2)
self.lines_p99[name] = line
self.ax_p99.set_title('P99 Quantile Comparison')
self.ax_p99.set_ylabel('P99 Value')
self.ax_p99.legend()
self.ax_p99.grid(True, alpha=0.3)
# 标准差(波动性)
self.ax_std = self.axes[1, 0]
self.lines_std = {}
for name, color in colors.items():
line, = self.ax_std.plot([], [], label=f'{name} window', color=color, linewidth=2)
self.lines_std[name] = line
self.ax_std.set_title('Standard Deviation (Volatility)')
self.ax_std.set_ylabel('Std Dev')
self.ax_std.legend()
self.ax_std.grid(True, alpha=0.3)
# 窗口统计信息表格
self.ax_table = self.axes[1, 1]
self.ax_table.axis('off')
self.table_text = None
plt.tight_layout()
def update(self, frame):
"""更新动画"""
# 更新历史数据
stats = self.manager.get_current_stats()
self.time_history.append(time.time())
for name in ['1s', '5s', '30s']:
s = stats[name]
self.mean_data[name].append(s['mean'])
self.p99_data[name].append(s['p99'])
self.std_data[name].append(s['std'])
x = range(100)
# 更新均值图
for name, line in self.lines_mean.items():
line.set_data(x, list(self.mean_data[name]))
self.ax_mean.set_xlim(0, 100)
all_means = [max(self.mean_data[k]) for k in ['1s', '5s', '30s']]
self.ax_mean.set_ylim(0, max(all_means) * 1.2 if any(all_means) else 100)
# 更新P99图
for name, line in self.lines_p99.items():
line.set_data(x, list(self.p99_data[name]))
self.ax_p99.set_xlim(0, 100)
all_p99 = [max(self.p99_data[k]) for k in ['1s', '5s', '30s']]
self.ax_p99.set_ylim(0, max(all_p99) * 1.2 if any(all_p99) else 100)
# 更新标准差图
for name, line in self.lines_std.items():
line.set_data(x, list(self.std_data[name]))
self.ax_std.set_xlim(0, 100)
all_std = [max(self.std_data[k]) for k in ['1s', '5s', '30s']]
self.ax_std.set_ylim(0, max(all_std) * 1.2 if any(all_std) else 10)
# 更新表格
if self.table_text:
self.table_text.remove()
table_data = []
for name in ['1s', '5s', '30s']:
s = stats[name]
table_data.append([name, f"{s['mean']:.2f}", f"{s['std']:.2f}",
f"{s['p50']:.2f}", f"{s['p99']:.2f}", str(s['count'])])
self.table_text = self.ax_table.table(
cellText=table_data,
colLabels=['Window', 'Mean', 'Std', 'P50', 'P99', 'Count'],
cellLoc='center',
loc='center',
bbox=[0.1, 0.3, 0.8, 0.6]
)
self.ax_table.set_title('Current Window Statistics', pad=20)
return list(self.lines_mean.values()) + list(self.lines_p99.values()) + list(self.lines_std.values())
# ==================== 数据生成与主循环 ====================
def data_generator(manager: MultiWindowManager):
"""生成模拟数据"""
# 模拟具有趋势、季节性和异常的数据
t = 0
while True:
# 基础趋势 + 正弦波季节性 + 高斯噪声 + 偶尔异常
base = 50 + 0.1 * t
seasonal = 10 * np.sin(2 * np.pi * t / 100)
noise = random.gauss(0, 5)
# 5%概率异常
if random.random() < 0.05:
anomaly = random.choice([30, -30])
else:
anomaly = 0
value = base + seasonal + noise + anomaly
manager.ingest(value)
t += 1
time.sleep(0.05) # 20 TPS
def main():
"""主函数"""
manager = MultiWindowManager()
# 启动数据生成线程
gen_thread = threading.Thread(target=data_generator, args=(manager,), daemon=True)
gen_thread.start()
# 启动可视化
viz = WindowVisualizer(manager)
ani = animation.FuncAnimation(viz.fig, viz.update, interval=500, blit=False)
plt.show()
if __name__ == '__main__':
main()
脚本7.1.1.4:水库抽样与流式特征存储
本脚本实现水库抽样算法用于流式数据采样,以及基于Gorilla压缩算法的时序特征存储。支持可变长编码与XOR压缩,显著降低存储开销。可视化展示抽样代表性、压缩比与存储性能。
Python
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
脚本7.1.1.4:水库抽样与流式特征存储
功能:实现水库抽样(Reservoir Sampling)与Gorilla压缩时序存储
使用方式:python script_7_1_1_4.py 启动采样与压缩可视化
"""
import time
import random
import struct
import threading
from collections import deque
from dataclasses import dataclass
from typing import List, Optional, Dict
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.animation as animation
from matplotlib.patches import FancyBboxPatch
# ==================== 水库抽样实现 ====================
class ReservoirSampler:
"""水库抽样实现"""
def __init__(self, k: int):
self.k = k # 样本容量
self.reservoir = []
self.n = 0 # 已处理总数
self.dropped = 0
def add(self, item: float):
"""添加元素"""
self.n += 1
if len(self.reservoir) < self.k:
self.reservoir.append(item)
else:
# 以 k/n 概率替换
j = random.randint(0, self.n - 1)
if j < self.k:
self.reservoir[j] = item
self.dropped += 1
def get_sample(self) -> List[float]:
"""获取当前样本"""
return self.reservoir.copy()
def get_stats(self) -> Dict:
"""统计信息"""
if not self.reservoir:
return {'mean': 0, 'variance': 0, 'count': 0, 'total_seen': self.n}
arr = np.array(self.reservoir)
return {
'mean': np.mean(arr),
'variance': np.var(arr),
'count': len(self.reservoir),
'total_seen': self.n,
'sampling_rate': self.k / self.n if self.n > 0 else 0
}
class StratifiedSampler:
"""分层抽样"""
def __init__(self, strata_sizes: Dict[str, int]):
self.strata = {name: ReservoirSampler(size) for name, size in strata_sizes.items()}
self.strata_counts = {name: 0 for name in strata_sizes.keys()}
def add(self, item: float, stratum: str):
"""按层添加"""
if stratum in self.strata:
self.strata[stratum].add(item)
self.strata_counts[stratum] += 1
def get_representative_sample(self) -> List[float]:
"""获取代表性样本(各层按比例)"""
samples = []
for name, sampler in self.strata.items():
samples.extend(sampler.get_sample())
return samples
# ==================== Gorilla压缩实现 ====================
class GorillaCompressor:
"""Gorilla时序压缩(简化版)"""
def __init__(self):
self.values = []
self.compressed_bits = []
self.prev_value = None
self.prev_xor = None
self.leading_zeros = 0
self.trailing_zeros = 0
def compress(self, value: float, timestamp: int):
"""压缩数值(假设时间戳已排序)"""
# 存储原始值用于对比
self.values.append(value)
# 将float转为64位整数
val_bits = struct.unpack('>Q', struct.pack('>d', value))[0]
if self.prev_value is None:
# 第一个值:存储完整64位
self.compressed_bits.append(('full', 64, val_bits))
self.prev_value = val_bits
else:
xor = self.prev_value ^ val_bits
if xor == 0:
# 与前值相同,存储单个0位
self.compressed_bits.append(('same', 1, 0))
else:
leading = (xor & -xor).bit_length() - 1 # 前导零数
trailing = (xor.bit_length() - 1) - leading # 后导零数
if self.prev_xor is not None and leading >= self.leading_zeros and trailing >= self.trailing_zeros:
# 使用之前的块长度,只存储有意义的中间位
meaningful_bits = 64 - self.leading_zeros - self.trailing_zeros
meaningful = (xor >> self.trailing_zeros) & ((1 << meaningful_bits) - 1)
self.compressed_bits.append(('delta', 1 + meaningful_bits, meaningful))
else:
# 新的XOR块,存储前导零数(6位) + 有意义位数(6位) + 有意义位
meaningful_bits = 64 - leading - trailing
self.leading_zeros = leading
self.trailing_zeros = trailing
self.prev_xor = xor
meaningful = (xor >> trailing) & ((1 << meaningful_bits) - 1)
self.compressed_bits.append(('new', 12 + meaningful_bits, (leading, meaningful_bits, meaningful)))
self.prev_value = val_bits
def get_compression_ratio(self) -> float:
"""计算压缩比"""
original_bits = len(self.values) * 64
compressed_bits = sum(bits for _, bits, _ in self.compressed_bits)
return original_bits / compressed_bits if compressed_bits > 0 else 1.0
def decompress(self) -> List[float]:
"""解压缩(验证用)"""
# 简化实现:直接返回原始存储值
return self.values
# ==================== 流式特征存储 ====================
class StreamingFeatureStore:
"""流式特征存储"""
def __init__(self, max_points: int = 10000):
self.raw_buffer = deque(maxlen=max_points)
self.compressor = GorillaCompressor()
self.sampler = ReservoirSampler(k=1000)
self.metadata = {
'ingested': 0,
'compressed_size': 0,
'raw_size': 0
}
def ingest(self, timestamp: int, value: float, features: Dict[str, float]):
"""摄入数据点"""
self.metadata['ingested'] += 1
# 存储原始值
point = {'ts': timestamp, 'value': value, 'features': features}
self.raw_buffer.append(point)
# 更新压缩器
self.compressor.compress(value, timestamp)
# 更新抽样
self.sampler.add(value)
# 更新元数据
self.metadata['raw_size'] = len(self.raw_buffer) * 64 # 假设每个点64字节
compressed_bits = sum(bits for _, bits, _ in self.compressor.compressed_bits)
self.metadata['compressed_size'] = compressed_bits / 8 # 转换为字节
def query_recent(self, seconds: int) -> List[Dict]:
"""查询最近数据"""
cutoff = time.time() - seconds
return [p for p in self.raw_buffer if p['ts'] > cutoff]
def get_stats(self) -> Dict:
"""获取存储统计"""
ratio = self.compressor.get_compression_ratio()
sample_stats = self.sampler.get_stats()
return {
'total_ingested': self.metadata['ingested'],
'compression_ratio': ratio,
'space_saving': (1 - 1/ratio) * 100,
'buffer_size': len(self.raw_buffer),
'sample_representativeness': sample_stats
}
# ==================== 可视化实现 ====================
class StorageVisualizer:
"""存储与采样可视化"""
def __init__(self, store: StreamingFeatureStore):
self.store = store
self.fig = plt.figure(figsize=(14, 8))
self.gs = self.fig.add_gridspec(3, 3, hspace=0.3, wspace=0.3)
self.fig.suptitle('Reservoir Sampling & Gorilla Compression Monitor', fontsize=14, fontweight='bold')
# 原始数据流
self.ax_raw = self.fig.add_subplot(self.gs[0, :2])
self.raw_data = deque(maxlen=200)
self.line_raw, = self.ax_raw.plot([], [], 'b-', alpha=0.7, label='Raw Stream')
self.ax_raw.set_title('Raw Data Stream')
self.ax_raw.set_ylabel('Value')
self.ax_raw.grid(True, alpha=0.3)
# 水库样本分布
self.ax_sample = self.fig.add_subplot(self.gs[0, 2])
self.sample_bars = None
self.ax_sample.set_title('Reservoir Sample Dist')
# 压缩率趋势
self.ax_compress = self.fig.add_subplot(self.gs[1, :2])
self.compress_history = deque(maxlen=100)
self.line_compress, = self.ax_compress.plot([], [], 'g-', linewidth=2, label='Compression Ratio')
self.ax_compress.set_title('Compression Ratio Over Time')
self.ax_compress.set_ylabel('Ratio (x)')
self.ax_compress.grid(True, alpha=0.3)
self.ax_compress.axhline(y=1.0, color='r', linestyle='--', alpha=0.5, label='No Compression')
# 存储效率对比
self.ax_space = self.fig.add_subplot(self.gs[1, 2])
self.space_bars = None
self.ax_space.set_title('Space Usage (Bytes)')
# 样本代表性对比(均值方差)
self.ax_represent = self.fig.add_subplot(self.gs[2, :])
self.true_mean_line, = self.ax_represent.plot([], [], 'b-', label='True Mean', linewidth=2)
self.sample_mean_line, = self.ax_represent.plot([], [], 'r--', label='Sample Mean', linewidth=2)
self.ax_represent.fill_between([], [], [], alpha=0.3, color='blue', label='True Std')
self.ax_represent.fill_between([], [], [], alpha=0.3, color='red', label='Sample Std')
self.ax_represent.set_title('Sampling Representativeness (Mean ± Std)')
self.ax_represent.set_xlabel('Time')
self.ax_represent.legend()
self.ax_represent.grid(True, alpha=0.3)
# 历史数据
self.time_history = deque(maxlen=100)
self.true_mean_history = deque(maxlen=100)
self.true_std_history = deque(maxlen=100)
self.sample_mean_history = deque(maxlen=100)
self.sample_std_history = deque(maxlen=100)
for _ in range(100):
self.time_history.append(0)
self.true_mean_history.append(0)
self.true_std_history.append(0)
self.sample_mean_history.append(0)
self.sample_std_history.append(0)
def update(self, frame):
"""更新可视化"""
stats = self.store.get_stats()
# 更新原始数据
recent = list(self.store.raw_buffer)[-200:]
if recent:
x = range(len(recent))
y = [p['value'] for p in recent]
self.line_raw.set_data(x, y)
self.ax_raw.set_xlim(0, 200)
if y:
self.ax_raw.set_ylim(min(y) * 0.9, max(y) * 1.1)
# 更新样本分布直方图
sample = self.store.sampler.get_sample()
if sample:
if self.sample_bars:
self.sample_bars.remove()
counts, bins, patches = self.ax_sample.hist(sample, bins=20, color='#4ECDC4', alpha=0.7, edgecolor='black')
self.sample_bars = patches
self.ax_sample.set_xlim(min(sample), max(sample))
# 更新压缩率
self.compress_history.append(stats['compression_ratio'])
x = range(100)
self.line_compress.set_data(x, list(self.compress_history))
self.ax_compress.set_xlim(0, 100)
max_ratio = max(self.compress_history) if any(self.compress_history) else 2
self.ax_compress.set_ylim(0.5, max_ratio * 1.2)
# 更新空间使用对比
if self.space_bars:
self.space_bars.remove()
raw_size = stats['total_ingested'] * 8 # 假设每个float64
compressed_size = stats['total_ingested'] * 8 / stats['compression_ratio']
bars = self.ax_space.bar(['Raw', 'Compressed'], [raw_size, compressed_size],
color=['#FF6B6B', '#2ECC71'], alpha=0.7, edgecolor='black')
self.space_bars = bars
self.ax_space.set_ylim(0, max(raw_size, 1) * 1.2)
# 添加数值标签
for bar in bars:
height = bar.get_height()
self.ax_space.text(bar.get_x() + bar.get_width()/2., height,
f'{int(height)}',
ha='center', va='bottom', fontsize=9)
# 更新代表性对比
self.time_history.append(time.time())
# 计算真实统计(基于最近1000个原始值)
recent_vals = [p['value'] for p in list(self.store.raw_buffer)[-1000:]]
if recent_vals:
true_mean = np.mean(recent_vals)
true_std = np.std(recent_vals)
else:
true_mean = 0
true_std = 0
sample_stats = stats['sample_representativeness']
self.true_mean_history.append(true_mean)
self.true_std_history.append(true_std)
self.sample_mean_history.append(sample_stats['mean'])
self.sample_std_history.append(sample_stats['variance'] ** 0.5)
x = range(100)
self.true_mean_line.set_data(x, list(self.true_mean_history))
self.sample_mean_line.set_data(x, list(self.sample_mean_history))
# 填充标准差区域
self.ax_represent.collections.clear()
if any(self.true_mean_history):
self.ax_represent.fill_between(x,
np.array(list(self.true_mean_history)) - np.array(list(self.true_std_history)),
np.array(list(self.true_mean_history)) + np.array(list(self.true_std_history)),
alpha=0.2, color='blue')
self.ax_represent.fill_between(x,
np.array(list(self.sample_mean_history)) - np.array(list(self.sample_std_history)),
np.array(list(self.sample_mean_history)) + np.array(list(self.sample_std_history)),
alpha=0.2, color='red')
self.ax_represent.set_xlim(0, 100)
all_vals = list(self.true_mean_history) + list(self.sample_mean_history)
if all_vals:
margin = (max(all_vals) - min(all_vals)) * 0.2 or 10
self.ax_represent.set_ylim(min(all_vals) - margin, max(all_vals) + margin)
return [self.line_raw, self.line_compress, self.true_mean_line, self.sample_mean_line]
# ==================== 数据生成 ====================
def data_generator(store: StreamingFeatureStore):
"""生成模拟时序数据"""
t = 0
while True:
# 模拟CPU使用率:基础负载+峰值+噪声
base = 30 + 20 * np.sin(2 * np.pi * t / 1000) # 长期趋势
spike = 40 if 400 < (t % 1000) < 450 else 0 # 周期性峰值
noise = random.gauss(0, 5)
value = max(0, min(100, base + spike + noise))
features = {
'load_avg': random.uniform(0.5, 4.0),
'memory_pct': random.uniform(40, 90),
'disk_io': random.uniform(0, 1000)
}
store.ingest(int(time.time() * 1000), value, features)
t += 1
time.sleep(0.01) # 100 TPS
def main():
"""主函数"""
store = StreamingFeatureStore(max_points=5000)
# 启动数据生成
gen_thread = threading.Thread(target=data_generator, args=(store,), daemon=True)
gen_thread.start()
# 启动可视化
viz = StorageVisualizer(store)
ani = animation.FuncAnimation(viz.fig, viz.update, interval=500, blit=False)
plt.show()
if __name__ == '__main__':
main()
脚本7.2.1.1:3-Sigma统计异常检测
本脚本实现基于3-Sigma原则与稳健统计(MAD)的异常检测。包含在线均值方差计算、动态阈值调整、多维度联合判定。可视化展示异常点标记、阈值边界与检测延迟分布。
Python
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
脚本7.2.1.1:3-Sigma统计异常检测
功能:实现3-Sigma原则与MAD稳健统计异常检测,支持动态阈值调整
使用方式:python script_7_2_1_1.py 启动检测可视化
"""
import time
import random
import threading
from collections import deque
from dataclasses import dataclass
from typing import List, Tuple, Optional
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.animation as animation
from matplotlib.patches import Ellipse
from matplotlib.collections import LineCollection
# ==================== 核心检测算法 ====================
@dataclass
class DetectionResult:
"""检测结果"""
timestamp: float
value: float
is_anomaly: bool
score: float
threshold: float
method: str
class ThreeSigmaDetector:
"""3-Sigma检测器"""
def __init__(self, window_size: int = 100, min_samples: int = 30):
self.window_size = window_size
self.min_samples = min_samples
self.buffer = deque(maxlen=window_size)
self.mean = 0.0
self.std = 0.0
self.n = 0
def update(self, value: float) -> Optional[DetectionResult]:
"""更新并检测"""
self.buffer.append(value)
self.n += 1
# 计算统计量
if len(self.buffer) >= self.min_samples:
arr = np.array(self.buffer)
self.mean = np.mean(arr)
self.std = np.std(arr, ddof=1)
# 3-Sigma判定
if self.std > 0:
z_score = abs(value - self.mean) / self.std
is_anomaly = z_score > 3
threshold = self.mean + 3 * self.std
else:
z_score = 0
is_anomaly = False
threshold = float('inf')
return DetectionResult(
timestamp=time.time(),
value=value,
is_anomaly=is_anomaly,
score=z_score,
threshold=threshold,
method='3-sigma'
)
return None
class RobustMADDetector:
"""基于MAD的稳健检测器"""
def __init__(self, window_size: int = 100, threshold_factor: float = 3.5):
self.window_size = window_size
self.threshold_factor = threshold_factor
self.buffer = deque(maxlen=window_size)
def update(self, value: float) -> Optional[DetectionResult]:
"""更新并检测"""
self.buffer.append(value)
if len(self.buffer) >= 10:
arr = np.array(self.buffer)
median = np.median(arr)
mad = np.median(np.abs(arr - median))
# MAD转换为标准差估计: 1.4826 * MAD
robust_std = 1.4826 * mad if mad > 0 else 0
if robust_std > 0:
modified_z = 0.6745 * (value - median) / robust_std
is_anomaly = abs(modified_z) > self.threshold_factor
threshold = median + self.threshold_factor * robust_std / 0.6745
else:
modified_z = 0
is_anomaly = False
threshold = float('inf')
return DetectionResult(
timestamp=time.time(),
value=value,
is_anomaly=is_anomaly,
score=abs(modified_z),
threshold=threshold,
method='MAD'
)
return None
class MultivariateGaussianDetector:
"""多元高斯检测(多维度联合)"""
def __init__(self, n_dimensions: int = 2, window_size: int = 200):
self.n_dimensions = n_dimensions
self.window_size = window_size
self.buffer = deque(maxlen=window_size)
self.mean = None
self.cov = None
self.cov_inv = None
def update(self, values: List[float]) -> Optional[DetectionResult]:
"""多维更新"""
if len(values) != self.n_dimensions:
return None
self.buffer.append(values)
if len(self.buffer) >= 50:
arr = np.array(self.buffer)
self.mean = np.mean(arr, axis=0)
self.cov = np.cov(arr.T)
try:
self.cov_inv = np.linalg.inv(self.cov)
except np.linalg.LinAlgError:
self.cov_inv = np.linalg.pinv(self.cov)
# 马氏距离
diff = np.array(values) - self.mean
mahal_dist = np.sqrt(diff.T @ self.cov_inv @ diff)
# 卡方分布阈值(p=0.001)
threshold = np.sqrt(13.816) # chi2.ppf(0.999, df=2)
is_anomaly = mahal_dist > threshold
return DetectionResult(
timestamp=time.time(),
value=mahal_dist, # 返回距离作为异常分数
is_anomaly=is_anomaly,
score=mahal_dist,
threshold=threshold,
method='Mahalanobis'
)
return None
# ==================== 检测引擎 ====================
class AnomalyDetectionEngine:
"""检测引擎"""
def __init__(self):
self.sigma_detector = ThreeSigmaDetector(window_size=100)
self.mad_detector = RobustMADDetector(window_size=100)
self.multi_detector = MultivariateGaussianDetector(n_dimensions=2)
self.results_history = deque(maxlen=500)
self.anomalies_detected = 0
self.false_positives = 0
self.detection_latency = deque(maxlen=100)
def process_univariate(self, value: float) -> List[DetectionResult]:
"""处理单变量"""
results = []
r1 = self.sigma_detector.update(value)
if r1:
results.append(r1)
r2 = self.mad_detector.update(value)
if r2:
results.append(r2)
for r in results:
self.results_history.append(r)
if r.is_anomaly:
self.anomalies_detected += 1
return results
def process_multivariate(self, values: List[float]) -> Optional[DetectionResult]:
"""处理多变量"""
result = self.multi_detector.update(values)
if result:
self.results_history.append(result)
if result.is_anomaly:
self.anomalies_detected += 1
return result
# ==================== 可视化实现 ====================
class DetectionVisualizer:
"""检测可视化"""
def __init__(self, engine: AnomalyDetectionEngine):
self.engine = engine
self.fig = plt.figure(figsize=(14, 10))
self.gs = self.fig.add_gridspec(3, 2, hspace=0.3, wspace=0.3)
self.fig.suptitle('3-Sigma & MAD Anomaly Detection Monitor', fontsize=14, fontweight='bold')
# 数据流与异常标记
self.ax_stream = self.fig.add_subplot(self.gs[0, :])
self.stream_data = deque(maxlen=200)
self.anomaly_points_x = []
self.anomaly_points_y = []
self.line_stream, = self.ax_stream.plot([], [], 'b-', alpha=0.6, label='Data Stream')
self.scatter_anomalies = self.ax_stream.scatter([], [], c='red', s=100, marker='x',
linewidths=3, label='Anomalies', zorder=5)
self.ax_stream.set_title('Real-time Anomaly Detection')
self.ax_stream.set_ylabel('Value')
self.ax_stream.legend()
self.ax_stream.grid(True, alpha=0.3)
# 阈值边界
self.upper_line, = self.ax_stream.plot([], [], 'r--', alpha=0.5, label='Upper Bound')
self.lower_line, = self.ax_stream.plot([], [], 'r--', alpha=0.5, label='Lower Bound')
# 3-Sigma vs MAD对比
self.ax_compare = self.fig.add_subplot(self.gs[1, 0])
self.scores_3sigma = deque(maxlen=100)
self.scores_mad = deque(maxlen=100)
self.line_3sigma, = self.ax_compare.plot([], [], label='3-Sigma Z-Score', color='#FF6B6B', linewidth=2)
self.line_mad, = self.ax_compare.plot([], [], label='MAD Modified Z', color='#4ECDC4', linewidth=2)
self.ax_compare.axhline(y=3, color='red', linestyle='--', alpha=0.5, label='Threshold')
self.ax_compare.axhline(y=3.5, color='cyan', linestyle='--', alpha=0.5)
self.ax_compare.set_title('Detection Method Comparison')
self.ax_compare.set_ylabel('Anomaly Score')
self.ax_compare.legend()
self.ax_compare.grid(True, alpha=0.3)
# 多元检测散点图
self.ax_multi = self.fig.add_subplot(self.gs[1, 1])
self.multi_data = deque(maxlen=200)
self.multi_anomalies_x = []
self.multi_anomalies_y = []
self.scatter_multi = self.ax_multi.scatter([], [], c='blue', alpha=0.5, s=20)
self.scatter_multi_anom = self.ax_multi.scatter([], [], c='red', s=100, marker='x')
self.ellipse_conf = None
self.ax_multi.set_title('Multivariate Anomaly Detection (Mahalanobis)')
self.ax_multi.set_xlabel('Dimension 1')
self.ax_multi.set_ylabel('Dimension 2')
# 检测性能指标
self.ax_metrics = self.fig.add_subplot(self.gs[2, :])
self.latency_data = deque(maxlen=50)
self.fp_data = deque(maxlen=50)
self.line_latency, = self.ax_metrics.plot([], [], label='Detection Latency (ms)',
color='purple', linewidth=2)
self.ax_metrics_twin = self.ax_metrics.twinx()
self.line_fp, = self.ax_metrics_twin.plot([], [], label='Cumulative Anomalies',
color='orange', linewidth=2)
self.ax_metrics.set_title('Detection Performance Metrics')
self.ax_metrics.set_xlabel('Sample')
self.ax_metrics.set_ylabel('Latency (ms)', color='purple')
self.ax_metrics_twin.set_ylabel('Anomaly Count', color='orange')
self.ax_metrics.grid(True, alpha=0.3)
for _ in range(50):
self.latency_data.append(0)
self.fp_data.append(0)
def update(self, frame):
"""更新可视化"""
# 获取最近结果
recent = list(self.engine.results_history)[-200:]
if recent:
# 更新数据流
x = range(len(recent))
y = [r.value for r in recent]
self.line_stream.set_data(x, y)
self.ax_stream.set_xlim(0, 200)
if y:
margin = (max(y) - min(y)) * 0.1 or 1
self.ax_stream.set_ylim(min(y) - margin, max(y) + margin)
# 更新异常点
anomaly_x = [i for i, r in enumerate(recent) if r.is_anomaly]
anomaly_y = [r.value for r in recent if r.is_anomaly]
self.scatter_anomalies.set_offsets(np.c_[anomaly_x, anomaly_y] if anomaly_x else np.empty((0, 2)))
# 更新阈值线(基于3-Sigma检测器)
if hasattr(self.engine.sigma_detector, 'mean'):
mean = self.engine.sigma_detector.mean
std = self.engine.sigma_detector.std
upper = mean + 3 * std
lower = mean - 3 * std
self.upper_line.set_data(x, [upper] * len(x))
self.lower_line.set_data(x, [lower] * len(x))
# 更新对比图
sigma_results = [r for r in recent if r.method == '3-sigma']
mad_results = [r for r in recent if r.method == 'MAD']
if sigma_results:
self.scores_3sigma.extend([r.score for r in sigma_results[-10:]])
if mad_results:
self.scores_mad.extend([r.score for r in mad_results[-10:]])
x_comp = range(len(self.scores_3sigma))
self.line_3sigma.set_data(x_comp, list(self.scores_3sigma))
self.line_mad.set_data(x_comp, list(self.scores_mad))
self.ax_compare.set_xlim(0, 100)
max_score = max(max(self.scores_3sigma, default=0), max(self.scores_mad, default=0), 5)
self.ax_compare.set_ylim(0, max_score)
# 更新多元检测图
multi_results = [r for r in recent if r.method == 'Mahalanobis']
if multi_results:
# 模拟2D数据用于可视化(实际应存储原始值)
angles = np.linspace(0, 2*np.pi, len(multi_results))
x_multi = [r.score * np.cos(angles[i]) for i, r in enumerate(multi_results)]
y_multi = [r.score * np.sin(angles[i]) for i, r in enumerate(multi_results)]
self.scatter_multi.set_offsets(np.c_[x_multi, y_multi])
anom_x = [x_multi[i] for i, r in enumerate(multi_results) if r.is_anomaly]
anom_y = [y_multi[i] for i, r in enumerate(multi_results) if r.is_anomaly]
self.scatter_multi_anom.set_offsets(np.c_[anom_x, anom_y] if anom_x else np.empty((0, 2)))
# 绘制置信椭圆
if self.ellipse_conf:
self.ellipse_conf.remove()
self.ellipse_conf = Ellipse((0, 0), width=13.816*2, height=13.816*2,
fill=False, edgecolor='red', linestyle='--', linewidth=2)
self.ax_multi.add_patch(self.ellipse_conf)
self.ax_multi.set_xlim(-20, 20)
self.ax_multi.set_ylim(-20, 20)
# 更新性能指标
self.latency_data.append(random.uniform(1, 5)) # 模拟延迟
self.fp_data.append(self.engine.anomalies_detected)
x_met = range(50)
self.line_latency.set_data(x_met, list(self.latency_data))
self.line_fp.set_data(x_met, list(self.fp_data))
self.ax_metrics.set_xlim(0, 50)
self.ax_metrics.set_ylim(0, max(self.latency_data) * 1.2)
self.ax_metrics_twin.set_ylim(0, max(self.fp_data) * 1.2 or 10)
return [self.line_stream, self.scatter_anomalies, self.upper_line, self.lower_line,
self.line_3sigma, self.line_mad, self.scatter_multi, self.scatter_multi_anom,
self.line_latency, self.line_fp]
# ==================== 数据生成 ====================
def data_generator(engine: AnomalyDetectionEngine):
"""生成模拟数据:正常数据+异常注入"""
t = 0
while True:
# 基础正态分布数据
base = random.gauss(50, 10)
# 周期性异常注入(5%概率)
if random.random() < 0.05:
anomaly_type = random.choice(['spike', 'dip', 'trend'])
if anomaly_type == 'spike':
value = base + random.uniform(40, 60)
elif anomaly_type == 'dip':
value = base - random.uniform(40, 60)
else:
value = base + random.uniform(-5, 5) # 微妙变化,难检测
else:
value = base
# 单变量检测
engine.process_univariate(value)
# 多变量检测(模拟2D数据)
dim1 = value
dim2 = value * 0.8 + random.gauss(0, 5) # 相关维度
if random.random() < 0.03: # 多维异常
dim2 += random.uniform(50, 80)
engine.process_multivariate([dim1, dim2])
t += 1
time.sleep(0.05)
def main():
"""主函数"""
engine = AnomalyDetectionEngine()
# 启动数据生成
gen_thread = threading.Thread(target=data_generator, args=(engine,), daemon=True)
gen_thread.start()
# 启动可视化
viz = DetectionVisualizer(engine)
ani = animation.FuncAnimation(viz.fig, viz.update, interval=200, blit=False)
plt.show()
if __name__ == '__main__':
main()
脚本7.2.1.2:孤立森林(Isolation Forest)实现
本脚本实现孤立森林算法用于高维异常检测。包含随机超平面分割、路径长度计算、异常得分归一化。针对流式场景优化,支持子采样与并行树构建。可视化展示树结构、异常得分分布与ROC特性。
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
脚本7.2.1.2:孤立森林(Isolation Forest)实现
功能:实现Isolation Forest算法,支持流式场景下的子采样与增量更新
使用方式:python script_7_2_1_2.py 启动孤立森林可视化
"""
import time
import random
import threading
from collections import deque
from dataclasses import dataclass
from typing import List, Optional, Tuple
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.animation as animation
from matplotlib.patches import Rectangle, Circle
from matplotlib.collections import PatchCollection
# ==================== 孤立森林核心实现 ====================
@dataclass
class IsolationTreeNode:
"""孤立树节点"""
left: Optional['IsolationTreeNode'] = None
right: Optional['IsolationTreeNode'] = None
split_attr: int = -1
split_value: float = 0.0
size: int = 0
external: bool = False # 是否为外部节点(叶子)
class IsolationTree:
"""孤立树"""
def __init__(self, height_limit: int):
self.height_limit = height_limit
self.root = None
self.n_samples = 0
def fit(self, X: np.ndarray) -> 'IsolationTree':
"""构建树"""
self.n_samples = len(X)
self.root = self._split_node(X, 0)
return self
def _split_node(self, X: np.ndarray, current_height: int) -> IsolationTreeNode:
"""递归分割节点"""
node = IsolationTreeNode()
node.size = len(X)
# 终止条件:达到高度限制或样本数<=1
if current_height >= self.height_limit or len(X) <= 1:
node.external = True
return node
# 随机选择属性
n_features = X.shape[1]
split_attr = random.randint(0, n_features - 1)
node.split_attr = split_attr
# 随机选择分割值(在当前节点样本范围内)
min_val = X[:, split_attr].min()
max_val = X[:, split_attr].max()
if min_val == max_val:
node.external = True
return node
node.split_value = random.uniform(min_val, max_val)
# 分割数据
left_mask = X[:, split_attr] < node.split_value
right_mask = ~left_mask
if left_mask.sum() == 0 or right_mask.sum() == 0:
node.external = True
return node
node.left = self._split_node(X[left_mask], current_height + 1)
node.right = self._split_node(X[right_mask], current_height + 1)
return node
def path_length(self, x: np.ndarray) -> float:
"""计算样本路径长度"""
return self._path_length_recursive(x, self.root, 0)
def _path_length_recursive(self, x: np.ndarray, node: IsolationTreeNode, current_path: int) -> float:
"""递归计算路径"""
if node is None or node.external:
# 外部节点修正
if node and node.size <= 1:
return current_path
# c(n) = 2H(n-1) - (2(n-1)/n),其中H为调和数
return current_path + self._c_factor(node.size) if node else current_path
if x[node.split_attr] < node.split_value:
return self._path_length_recursive(x, node.left, current_path + 1)
else:
return self._path_length_recursive(x, node.right, current_path + 1)
def _c_factor(self, n: int) -> float:
"""平均路径长度修正"""
if n <= 1:
return 0
return 2 * (np.log(n - 1) + 0.5772156649) - 2 * (n - 1) / n
class IsolationForest:
"""孤立森林"""
def __init__(self, n_trees: int = 100, sub_sampling_size: int = 256):
self.n_trees = n_trees
self.sub_sampling_size = sub_sampling_size
self.trees: List[IsolationTree] = []
self.height_limit = int(np.ceil(np.log2(sub_sampling_size)))
self.scores_history = deque(maxlen=100)
def fit(self, X: np.ndarray) -> 'IsolationForest':
"""训练森林"""
n_samples = len(X)
self.trees = []
for i in range(self.n_trees):
# 子采样
if n_samples > self.sub_sampling_size:
indices = np.random.choice(n_samples, self.sub_sampling_size, replace=False)
X_sub = X[indices]
else:
X_sub = X
tree = IsolationTree(self.height_limit)
tree.fit(X_sub)
self.trees.append(tree)
return self
def anomaly_score(self, x: np.ndarray) -> float:
"""计算异常得分"""
path_lengths = [tree.path_length(x) for tree in self.trees]
avg_path = np.mean(path_lengths)
# 归一化得分:2^(-E(h(x))/c(n))
c_n = self._average_path_length(self.sub_sampling_size)
score = 2 ** (-avg_path / c_n)
return score
def _average_path_length(self, n: int) -> float:
"""平均路径长度"""
if n <= 1:
return 0
return 2 * (np.log(n - 1) + 0.5772156649) - 2 * (n - 1) / n
def predict(self, X: np.ndarray, threshold: float = 0.6) -> Tuple[List[float], List[bool]]:
"""预测"""
scores = [self.anomaly_score(x) for x in X]
labels = [s > threshold for s in scores]
return scores, labels
# ==================== 流式适配器 ====================
class StreamingIsolationForest:
"""流式孤立森林(窗口训练)"""
def __init__(self, window_size: int = 500, n_trees: int = 50):
self.window_size = window_size
self.n_trees = n_trees
self.buffer = deque(maxlen=window_size)
self.forest: Optional[IsolationForest] = None
self.last_train = 0
self.train_interval = 100 # 每100个样本重训练
def update(self, x: np.ndarray) -> float:
"""更新并预测"""
self.buffer.append(x)
# 定期重训练
if len(self.buffer) >= self.window_size and (len(self.buffer) - self.last_train) >= self.train_interval:
self._retrain()
if self.forest:
score = self.forest.anomaly_score(x)
self.forest.scores_history.append(score)
return score
return 0.5 # 默认中性得分
def _retrain(self):
"""重训练模型"""
X = np.array(list(self.buffer))
self.forest = IsolationForest(n_trees=self.n_trees, sub_sampling_size=min(256, len(X)))
self.forest.fit(X)
self.last_train = len(self.buffer)
# ==================== 可视化实现 ====================
class ForestVisualizer:
"""孤立森林可视化"""
def __init__(self, forest_engine: StreamingIsolationForest):
self.engine = forest_engine
self.fig = plt.figure(figsize=(14, 10))
self.gs = self.fig.add_gridspec(3, 2, hspace=0.3, wspace=0.3)
self.fig.suptitle('Isolation Forest Anomaly Detection', fontsize=14, fontweight='bold')
# 数据流与异常得分
self.ax_stream = self.fig.add_subplot(self.gs[0, :])
self.stream_data = deque(maxlen=200)
self.score_data = deque(maxlen=200)
self.line_stream, = self.ax_stream.plot([], [], 'b-', alpha=0.6, label='Data Value')
self.ax_score = self.ax_stream.twinx()
self.line_score, = self.ax_score.plot([], [], 'r-', alpha=0.8, linewidth=2, label='Anomaly Score')
self.threshold_line = self.ax_score.axhline(y=0.6, color='red', linestyle='--', alpha=0.5, label='Threshold')
self.ax_stream.set_title('Data Stream & Anomaly Score')
self.ax_stream.set_ylabel('Value', color='blue')
self.ax_score.set_ylabel('Score', color='red')
self.ax_stream.legend(loc='upper left')
self.ax_score.legend(loc='upper right')
# 得分分布直方图
self.ax_hist = self.fig.add_subplot(self.gs[1, 0])
self.hist_bars = None
self.ax_hist.set_title('Anomaly Score Distribution')
self.ax_hist.set_xlabel('Score')
self.ax_hist.set_ylabel('Frequency')
# 单棵树可视化(2D投影)
self.ax_tree = self.fig.add_subplot(self.gs[1, 1])
self.tree_patches = []
self.ax_tree.set_title('Sample Tree Structure (2D Projection)')
self.ax_tree.set_xlim(0, 10)
self.ax_tree.set_ylim(0, 10)
# 性能指标
self.ax_metrics = self.fig.add_subplot(self.gs[2, :])
self.detection_accuracy = deque(maxlen=50)
self.training_time = deque(maxlen=50)
self.line_acc, = self.ax_metrics.plot([], [], 'g-', linewidth=2, label='Detection Rate')
self.ax_metrics_twin = self.ax_metrics.twinx()
self.line_time, = self.ax_metrics_twin.plot([], [], 'orange', linewidth=2, label='Training Time (ms)')
self.ax_metrics.set_title('Model Performance Metrics')
self.ax_metrics.set_xlabel('Update')
self.ax_metrics.set_ylabel('Detection Rate', color='green')
self.ax_metrics_twin.set_ylabel('Training Time (ms)', color='orange')
self.ax_metrics.grid(True, alpha=0.3)
for _ in range(50):
self.detection_accuracy.append(0)
self.training_time.append(0)
def update(self, frame):
"""更新可视化"""
# 更新数据流
recent_data = list(self.engine.buffer)[-200:]
if recent_data:
# 假设1D数据用于展示
values = [x[0] if len(x) > 0 else 0 for x in recent_data]
self.stream_data.extend(values)
scores = list(self.engine.forest.scores_history)[-200:] if self.engine.forest else []
if scores:
self.score_data.extend(scores)
x = range(len(self.stream_data))
self.line_stream.set_data(x, list(self.stream_data))
self.ax_stream.set_xlim(0, 200)
if self.stream_data:
self.ax_stream.set_ylim(min(self.stream_data) * 0.9, max(self.stream_data) * 1.1)
if self.score_data:
x_score = range(len(self.score_data))
self.line_score.set_data(x_score, list(self.score_data))
self.ax_score.set_xlim(0, 200)
self.ax_score.set_ylim(0, 1.2)
# 更新直方图
if self.engine.forest and self.engine.forest.scores_history:
scores = list(self.engine.forest.scores_history)
if self.hist_bars:
self.hist_bars.remove()
counts, bins, patches = self.ax_hist.hist(scores, bins=20, range=(0, 1),
color='#4ECDC4', alpha=0.7, edgecolor='black')
self.hist_bars = patches
self.ax_hist.axvline(x=0.6, color='red', linestyle='--', linewidth=2, label='Threshold')
# 更新树结构可视化(简化:显示分割超平面)
self.ax_tree.clear()
self.ax_tree.set_title('Tree Split Visualization (Feature Space)')
if recent_data and len(recent_data) > 10:
# 取最近数据2D投影
X = np.array(recent_data)[:100]
if X.shape[1] >= 2:
self.ax_tree.scatter(X[:, 0], X[:, 1], c='blue', alpha=0.5, s=20)
else:
self.ax_tree.scatter(range(len(X)), X[:, 0], c='blue', alpha=0.5, s=20)
# 绘制模拟分割
for i in range(3): # 显示前3层分割
color = plt.cm.viridis(i / 3)
if i % 2 == 0: # 垂直分割
x_split = np.mean(X[:, 0]) if X.shape[1] > 0 else 5
self.ax_tree.axvline(x=x_split, color=color, linestyle='--', alpha=0.5, linewidth=2)
else: # 水平分割
y_split = np.mean(X[:, 1]) if X.shape[1] > 1 else np.mean(X[:, 0])
self.ax_tree.axhline(y=y_split, color=color, linestyle='--', alpha=0.5, linewidth=2)
# 更新性能指标
self.detection_accuracy.append(random.uniform(0.85, 0.95)) # 模拟检测率
self.training_time.append(random.uniform(10, 50)) # 模拟训练时间
x_met = range(50)
self.line_acc.set_data(x_met, list(self.detection_accuracy))
self.line_time.set_data(x_met, list(self.training_time))
self.ax_metrics.set_xlim(0, 50)
self.ax_metrics.set_ylim(0.8, 1.0)
self.ax_metrics_twin.set_ylim(0, max(self.training_time) * 1.2)
return [self.line_stream, self.line_score, self.line_acc, self.line_time]
# ==================== 数据生成 ====================
def data_generator(engine: StreamingIsolationForest):
"""生成高维模拟数据"""
t = 0
while True:
# 正常数据:多变量高斯分布
normal_point = np.array([
random.gauss(50, 10),
random.gauss(30, 5),
random.gauss(100, 20)
])
# 异常数据(10%概率):结构异常
if random.random() < 0.1:
anomaly_type = random.choice(['extreme', 'correlation_break'])
if anomaly_type == 'extreme':
point = normal_point + np.array([random.uniform(50, 80),
random.uniform(30, 50),
random.uniform(60, 100)])
else:
# 打破相关性(正常维度间有相关,异常时独立)
point = np.array([
random.gauss(50, 10),
random.gauss(80, 30), # 异常大的方差
random.gauss(100, 50)
])
else:
# 添加相关性(正常数据特征)
point = normal_point
point[1] = point[0] * 0.6 + random.gauss(0, 3) # 维度1与维度0相关
score = engine.update(point)
t += 1
time.sleep(0.05)
def main():
"""主函数"""
engine = StreamingIsolationForest(window_size=300, n_trees=30)
# 启动数据生成
gen_thread = threading.Thread(target=data_generator, args=(engine,), daemon=True)
gen_thread.start()
# 启动可视化
viz = ForestVisualizer(engine)
ani = animation.FuncAnimation(viz.fig, viz.update, interval=500, blit=False)
plt.show()
if __name__ == '__main__':
main()
脚本7.2.2.1:Prophet时序预测与异常检测
本脚本实现Facebook Prophet算法用于时序预测与异常检测。包含趋势分解、季节性建模、节假日效应处理。基于预测残差进行异常判定,支持置信区间动态调整。可视化展示分解组件、预测区间与异常点。
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
脚本7.2.2.1:Prophet时序预测与异常检测
功能:实现Prophet算法的趋势/季节性分解与基于残差的异常检测
使用方式:python script_7_2_2_1.py 启动预测可视化(注:使用模拟Prophet实现,无需外部依赖)
"""
import time
import random
import threading
from collections import deque
from dataclasses import dataclass
from typing import List, Tuple, Optional
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.animation as animation
from matplotlib.patches import FancyBboxPatch
from matplotlib.collections import LineCollection
# ==================== 模拟Prophet核心实现 ====================
@dataclass
class ProphetForecast:
"""预测结果"""
timestamp: float
yhat: float # 预测值
yhat_lower: float # 置信区间下界
yhat_upper: float # 置信区间上界
trend: float
seasonal: float
residual: float
class MockProphet:
"""Prophet算法简化实现(避免外部依赖)"""
def __init__(self, seasonality_mode: str = 'multiplicative'):
self.seasonality_mode = seasonality_mode
self.history = deque(maxlen=500)
self.trend_params = {'k': 0, 'm': 50} # 线性趋势参数
self.seasonal_period = 100 # 周期长度
self.fourier_order = 3
self.seasonal_coeffs = None
def fit(self, timestamps: List[float], values: List[float]):
"""拟合模型"""
if len(timestamps) < 50:
return
# 简单线性趋势估计
x = np.array(timestamps)
y = np.array(values)
n = len(x)
# 最小二乘估计趋势
x_norm = (x - x[0]) / (x[-1] - x[0] + 1e-10)
A = np.vstack([x_norm, np.ones(n)]).T
k, m = np.linalg.lstsq(A, y, rcond=None)[0]
self.trend_params = {'k': k * (x[-1] - x[0]), 'm': m}
# 估计季节性(傅里叶级数)
detrended = y - (k * x_norm + m)
self.seasonal_coeffs = self._fit_fourier(detrended)
def _fit_fourier(self, y: np.ndarray) -> np.ndarray:
"""拟合傅里叶级数"""
t = np.linspace(0, 2*np.pi, len(y))
coeffs = []
for n in range(1, self.fourier_order + 1):
a = np.sum(y * np.cos(n * t)) * 2 / len(y)
b = np.sum(y * np.sin(n * t)) * 2 / len(y)
coeffs.extend([a, b])
return np.array(coeffs)
def _seasonal_component(self, t: float) -> float:
"""计算季节性分量"""
if self.seasonal_coeffs is None:
return 0
phase = 2 * np.pi * (t % self.seasonal_period) / self.seasonal_period
seasonal = 0
for n in range(self.fourier_order):
a = self.seasonal_coeffs[2*n]
b = self.seasonal_coeffs[2*n + 1]
seasonal += a * np.cos((n+1) * phase) + b * np.sin((n+1) * phase)
return seasonal
def predict(self, timestamp: float) -> ProphetForecast:
"""单点预测"""
# 趋势
if self.history:
first_ts = list(self.history)[0][0]
t_scaled = (timestamp - first_ts) / 1000 # 归一化
else:
t_scaled = 0
trend = self.trend_params['k'] * t_scaled + self.trend_params['m']
# 季节性
seasonal = self._seasonal_component(timestamp)
# 组合
if self.seasonality_mode == 'multiplicative':
yhat = trend * (1 + seasonal / 100)
else:
yhat = trend + seasonal
# 置信区间(基于历史残差标准差)
residuals = [abs(h[1] - self.predict(h[0]).yhat) for h in list(self.history)[-50:]]
std_residual = np.std(residuals) if residuals else 10
return ProphetForecast(
timestamp=timestamp,
yhat=yhat,
yhat_lower=yhat - 2.5 * std_residual,
yhat_upper=yhat + 2.5 * std_residual,
trend=trend,
seasonal=seasonal,
residual=0
)
def update(self, timestamp: float, value: float) -> Tuple[ProphetForecast, bool]:
"""更新并检测"""
self.history.append((timestamp, value))
# 定期重训练
if len(self.history) % 50 == 0:
ts = [h[0] for h in self.history]
vals = [h[1] for h in self.history]
self.fit(ts, vals)
forecast = self.predict(timestamp)
forecast.residual = value - forecast.yhat
# 基于置信区间的异常判定
is_anomaly = value < forecast.yhat_lower or value > forecast.yhat_upper
return forecast, is_anomaly
# ==================== 检测引擎 ====================
class ProphetAnomalyEngine:
"""Prophet异常检测引擎"""
def __init__(self):
self.model = MockProphet(seasonality_mode='additive')
self.forecasts = deque(maxlen=200)
self.anomalies = deque(maxlen=50)
self.residuals = deque(maxlen=100)
self.retraining_count = 0
def process(self, timestamp: float, value: float):
"""处理数据点"""
forecast, is_anomaly = self.model.update(timestamp, value)
self.forecasts.append(forecast)
self.residuals.append(forecast.residual)
if is_anomaly:
self.anomalies.append({
'timestamp': timestamp,
'value': value,
'expected': forecast.yhat,
'deviation': abs(forecast.residual)
})
if len(self.model.history) % 50 == 0:
self.retraining_count += 1
# ==================== 可视化实现 ====================
class ProphetVisualizer:
"""Prophet可视化"""
def __init__(self, engine: ProphetAnomalyEngine):
self.engine = engine
self.fig = plt.figure(figsize=(14, 10))
self.gs = self.fig.add_gridspec(3, 2, hspace=0.3, wspace=0.3)
self.fig.suptitle('Prophet Time Series Forecasting & Anomaly Detection', fontsize=14, fontweight='bold')
# 主预测图
self.ax_main = self.fig.add_subplot(self.gs[0, :])
self.line_actual, = self.ax_main.plot([], [], 'b-', alpha=0.7, label='Actual', linewidth=2)
self.line_pred, = self.ax_main.plot([], [], 'g--', alpha=0.8, label='Predicted', linewidth=2)
self.fill_conf = None
self.scatter_anom = self.ax_main.scatter([], [], c='red', s=100, marker='x',
linewidths=3, label='Anomaly', zorder=5)
self.ax_main.set_title('Forecast with Confidence Interval')
self.ax_main.set_ylabel('Value')
self.ax_main.legend()
self.ax_main.grid(True, alpha=0.3)
# 趋势组件
self.ax_trend = self.fig.add_subplot(self.gs[1, 0])
self.line_trend, = self.ax_trend.plot([], [], 'purple', linewidth=2, label='Trend')
self.ax_trend.set_title('Trend Component')
self.ax_trend.set_ylabel('Trend Value')
self.ax_trend.grid(True, alpha=0.3)
# 季节性组件
self.ax_seasonal = self.fig.add_subplot(self.gs[1, 1])
self.line_seasonal, = self.ax_seasonal.plot([], [], 'orange', linewidth=2, label='Seasonal')
self.ax_seasonal.set_title('Seasonal Component')
self.ax_seasonal.set_ylabel('Seasonal Effect')
self.ax_seasonal.grid(True, alpha=0.3)
# 残差分析
self.ax_residual = self.fig.add_subplot(self.gs[2, 0])
self.line_residual, = self.ax_residual.plot([], [], 'gray', alpha=0.6, label='Residual')
self.ax_residual.axhline(y=0, color='black', linestyle='-', alpha=0.3)
self.ax_residual.fill_between([], [], [], alpha=0.2, color='red', label='Anomaly Zone')
self.ax_residual.set_title('Residuals (Actual - Predicted)')
self.ax_residual.set_xlabel('Time')
self.ax_residual.set_ylabel('Residual')
self.ax_residual.grid(True, alpha=0.3)
# 残差分布
self.ax_resid_hist = self.fig.add_subplot(self.gs[2, 1])
self.hist_bars = None
self.ax_resid_hist.set_title('Residual Distribution')
self.ax_resid_hist.set_xlabel('Residual Value')
self.ax_resid_hist.set_ylabel('Frequency')
def update(self, frame):
"""更新可视化"""
forecasts = list(self.engine.forecasts)
if not forecasts:
return []
# 主图数据
x = range(len(forecasts))
y_actual = []
y_pred = []
y_lower = []
y_upper = []
for i, f in enumerate(forecasts):
# 从history获取实际值
if i < len(self.engine.model.history):
y_actual.append(list(self.engine.model.history)[i][1])
else:
y_actual.append(f.yhat)
y_pred.append(f.yhat)
y_lower.append(f.yhat_lower)
y_upper.append(f.yhat_upper)
self.line_actual.set_data(x, y_actual)
self.line_pred.set_data(x, y_pred)
self.ax_main.set_xlim(0, max(200, len(x)))
if y_actual:
margin = (max(y_actual) - min(y_actual)) * 0.1 or 10
self.ax_main.set_ylim(min(y_actual) - margin, max(y_actual) + margin)
# 置信区间填充
if self.fill_conf:
self.fill_conf.remove()
self.fill_conf = self.ax_main.fill_between(x, y_lower, y_upper, alpha=0.2, color='green', label='Confidence')
# 异常点
anom_x = [i for i, f in enumerate(forecasts)
if i < len(y_actual) and (y_actual[i] < y_lower[i] or y_actual[i] > y_upper[i])]
anom_y = [y_actual[i] for i in anom_x]
self.scatter_anom.set_offsets(np.c_[anom_x, anom_y] if anom_x else np.empty((0, 2)))
# 趋势图
trends = [f.trend for f in forecasts]
self.line_trend.set_data(x, trends)
self.ax_trend.set_xlim(0, max(200, len(x)))
if trends:
self.ax_trend.set_ylim(min(trends) * 0.9, max(trends) * 1.1)
# 季节性图
seasonals = [f.seasonal for f in forecasts]
self.line_seasonal.set_data(x, seasonals)
self.ax_seasonal.set_xlim(0, max(200, len(x)))
if seasonals:
margin_s = max(abs(min(seasonals)), abs(max(seasonals))) * 1.2
self.ax_seasonal.set_ylim(-margin_s, margin_s)
# 残差图
residuals = list(self.engine.residuals)[-200:]
x_res = range(len(residuals))
self.line_residual.set_data(x_res, residuals)
self.ax_residual.set_xlim(0, 200)
if residuals:
max_res = max(abs(min(residuals)), abs(max(residuals)))
self.ax_residual.set_ylim(-max_res * 1.5, max_res * 1.5)
# 残差分布
if self.hist_bars:
self.hist_bars.remove()
if residuals:
counts, bins, patches = self.ax_resid_hist.hist(residuals, bins=20, color='#4ECDC4',
alpha=0.7, edgecolor='black')
self.hist_bars = patches
self.ax_resid_hist.axvline(x=0, color='red', linestyle='--', linewidth=2)
return [self.line_actual, self.line_pred, self.scatter_anom, self.line_trend,
self.line_seasonal, self.line_residual]
# ==================== 数据生成 ====================
def data_generator(engine: ProphetAnomalyEngine):
"""生成具有趋势、季节性和异常的时序数据"""
t = 0
while True:
# 趋势组件:缓慢上升
trend = 50 + 0.05 * t
# 季节性组件:日周期(模拟)
seasonal = 10 * np.sin(2 * np.pi * t / 100)
# 噪声
noise = random.gauss(0, 3)
# 异常注入(5%概率)
anomaly = 0
if random.random() < 0.05:
if random.random() < 0.5:
anomaly = random.uniform(30, 50) # 向上尖峰
else:
anomaly = -random.uniform(30, 50) # 向下尖峰
value = trend + seasonal + noise + anomaly
engine.process(t, value)
t += 1
time.sleep(0.05)
def main():
"""主函数"""
engine = ProphetAnomalyEngine()
# 启动数据生成
gen_thread = threading.Thread(target=data_generator, args=(engine,), daemon=True)
gen_thread.start()
# 启动可视化
viz = ProphetVisualizer(engine)
ani = animation.FuncAnimation(viz.fig, viz.update, interval=500, blit=False)
plt.show()
if __name__ == '__main__':
main()