NVSentinel主要功能及对应组件

NVSentinel 五大核心大功能及对应组件

一、硬件/集群全维度监控采集

故障源头探测。

功能说明

部署在每台GPU节点,对接DCGM、系统日志、网卡、云厂商API、K8s资源,采集所有GPU/网卡/调度异常,统一标准化为HealthEvent,是整个系统的数据输入端。

包含组件

  1. gpu-health-monitor(DaemonSet):对接DCGM采集GPU温度、ECC、XID、NVLink故障
  2. syslog-health-monitor(DaemonSet):监听系统journal日志,抓取GPU驱动崩溃、PCI报错
  3. nic-health-monitor(DaemonSet):监控网卡、RDMA、高速互联链路硬件异常
  4. metadata-collector(DaemonSet):采集GPU、网卡、NVSwitch硬件拓扑元数据
  5. csp-health-monitor(Deployment):轮询公有云API,获取VM计划性维护事件
  6. kubernetes-object-monitor(Deployment):监听K8s Node/Pod/CRD资源变更,生成集群资源事件
  7. preflight(Init容器模板):GPU Pod启动前置硬件自检,异常上报事件
  8. platform-connectors(DaemonSet,采集网关):接收所有采集组件gRPC事件、校验标准化、写入数据库、同步更新K8s Node Condition

二、存储与事件总线

全系统数据协同底座。

功能说明

统一持久化所有HealthEvent,提供Change Streams变更流,让所有控制面组件异步解耦协同,存储故障全生命周期状态。

包含组件

  1. mongodb-store(StatefulSet,生产默认):主事件存储,提供Change Streams订阅能力
  2. k8s-datastore(StatefulSet,测试轻量替代):基于K8s CRD存储事件,无Change Streams
  3. postgresql(StatefulSet,辅助存储):存储审计日志、分析统计数据
  4. incluster-file-server(Deployment):全局配置中心,下发监控阈值、CEL隔离规则、故障映射模板(支撑监控+自愈全部模块)

三、全自动故障自愈闭环

总结:核心核心功能,故障处置流水线。

功能说明

读取数据库故障事件,按「节点隔离 → 优雅驱逐业务Pod → 生成硬件修复工单 → 执行服务器重启/更换」完整自动化流程,支持自定义CEL规则、Slurm混合集群适配。

包含组件

  1. fault-quarantine(Deployment):CEL规则引擎,判断是否封锁节点、添加故障污点,更新事件隔离状态
  2. node-drainer (Deployment):按命名空间策略优雅驱逐故障节点所有业务Pod;配套slinky-drainer适配K8s+Sl混合集群
  3. slurm-drain-monitor(Deployment):K8s与Slurm调度双向同步,故障节点自动排空Slurm作业
  4. fault-remediation (Deployment):驱逐完成后,根据故障动作生成RebootNode/TerminateNode运维CR工单
  5. janitor(Deployment):CR控制器,监听修复工单,下发硬件操作指令
  6. janitor-provider(Deployment):底层硬件驱动抽象层,对接IPMI、公有云、机房硬件API执行重启/换卡

四、事件分析、大盘与告警输出

总结:观测、复盘、批量风险预警。

功能说明

只读消费全量故障事件,做多维度聚合统计、连锁故障识别、时序指标输出,同时支持把单条/聚合告警推送外部运维平台。

包含组件

  1. health-events-analyzer(Deployment):故障聚合分析、批量连锁故障识别、Prometheus指标生成、集群风险复合告警、历史审计统计
  2. event-exporter(Deployment):单条原始故障事件透传,推送Webhook/第三方告警系统
  3. labeler(Deployment):自动给K8s节点打上GPU驱动、DCGM、硬件型号标签,辅助分析过滤

五、外部集成扩展能力

总结:对接第三方调度、运维、监控平台。

功能说明

打通异构调度集群、外部运维系统、监控平台,实现跨平台故障同步、双向联动。

包含组件

  1. slurm-drain-monitor:K8s ↔ Slurm 调度双向同步集成
  2. csp-health-monitor:公有云维护通知集成
  3. event-exporter:外部告警平台(钉钉/企业微信/监控系统)集成
  4. janitor + janitor-provider:机房DCIM、IPMI、公有云API硬件运维集成
  5. preflight准入Webhook:容器调度前置硬件校验集成
  6. slinky-drainer:Slinky Slurm Operator混合集群驱逐集成

六、整体组件架构

#mermaid-svg-nHK4Nj4fGG8cdSQC{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-nHK4Nj4fGG8cdSQC .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-nHK4Nj4fGG8cdSQC .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-nHK4Nj4fGG8cdSQC .error-icon{fill:#552222;}#mermaid-svg-nHK4Nj4fGG8cdSQC .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-nHK4Nj4fGG8cdSQC .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-nHK4Nj4fGG8cdSQC .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-nHK4Nj4fGG8cdSQC .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-nHK4Nj4fGG8cdSQC .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-nHK4Nj4fGG8cdSQC .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-nHK4Nj4fGG8cdSQC .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-nHK4Nj4fGG8cdSQC .marker{fill:#333333;stroke:#333333;}#mermaid-svg-nHK4Nj4fGG8cdSQC .marker.cross{stroke:#333333;}#mermaid-svg-nHK4Nj4fGG8cdSQC svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-nHK4Nj4fGG8cdSQC p{margin:0;}#mermaid-svg-nHK4Nj4fGG8cdSQC .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-nHK4Nj4fGG8cdSQC .cluster-label text{fill:#333;}#mermaid-svg-nHK4Nj4fGG8cdSQC .cluster-label span{color:#333;}#mermaid-svg-nHK4Nj4fGG8cdSQC .cluster-label span p{background-color:transparent;}#mermaid-svg-nHK4Nj4fGG8cdSQC .label text,#mermaid-svg-nHK4Nj4fGG8cdSQC span{fill:#333;color:#333;}#mermaid-svg-nHK4Nj4fGG8cdSQC .node rect,#mermaid-svg-nHK4Nj4fGG8cdSQC .node circle,#mermaid-svg-nHK4Nj4fGG8cdSQC .node ellipse,#mermaid-svg-nHK4Nj4fGG8cdSQC .node polygon,#mermaid-svg-nHK4Nj4fGG8cdSQC .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-nHK4Nj4fGG8cdSQC .rough-node .label text,#mermaid-svg-nHK4Nj4fGG8cdSQC .node .label text,#mermaid-svg-nHK4Nj4fGG8cdSQC .image-shape .label,#mermaid-svg-nHK4Nj4fGG8cdSQC .icon-shape .label{text-anchor:middle;}#mermaid-svg-nHK4Nj4fGG8cdSQC .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-nHK4Nj4fGG8cdSQC .rough-node .label,#mermaid-svg-nHK4Nj4fGG8cdSQC .node .label,#mermaid-svg-nHK4Nj4fGG8cdSQC .image-shape .label,#mermaid-svg-nHK4Nj4fGG8cdSQC .icon-shape .label{text-align:center;}#mermaid-svg-nHK4Nj4fGG8cdSQC .node.clickable{cursor:pointer;}#mermaid-svg-nHK4Nj4fGG8cdSQC .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-nHK4Nj4fGG8cdSQC .arrowheadPath{fill:#333333;}#mermaid-svg-nHK4Nj4fGG8cdSQC .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-nHK4Nj4fGG8cdSQC .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-nHK4Nj4fGG8cdSQC .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-nHK4Nj4fGG8cdSQC .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-nHK4Nj4fGG8cdSQC .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-nHK4Nj4fGG8cdSQC .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-nHK4Nj4fGG8cdSQC .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-nHK4Nj4fGG8cdSQC .cluster text{fill:#333;}#mermaid-svg-nHK4Nj4fGG8cdSQC .cluster span{color:#333;}#mermaid-svg-nHK4Nj4fGG8cdSQC div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-nHK4Nj4fGG8cdSQC .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-nHK4Nj4fGG8cdSQC rect.text{fill:none;stroke-width:0;}#mermaid-svg-nHK4Nj4fGG8cdSQC .icon-shape,#mermaid-svg-nHK4Nj4fGG8cdSQC .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-nHK4Nj4fGG8cdSQC .icon-shape p,#mermaid-svg-nHK4Nj4fGG8cdSQC .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-nHK4Nj4fGG8cdSQC .icon-shape .label rect,#mermaid-svg-nHK4Nj4fGG8cdSQC .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-nHK4Nj4fGG8cdSQC .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-nHK4Nj4fGG8cdSQC .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-nHK4Nj4fGG8cdSQC :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} Health Detection Layer
Event Processing Core
Analysis Layer
Response Automation Layer
Support Services
Kubernetes API
Nodes

(Cordon, Taint, Labels)
Pods

(Eviction)
Custom Resources

(RebootNode, TerminateNode)
labeler

(Deployment)

modules/labeler/
metadata-collector

(Deployment)

modules/metadata-collector/
log-collector

(Job)

modules/log-collector/
fault-quarantine

(Deployment)

modules/fault-quarantine/
node-drainer

(Deployment)

modules/node-drainer/
fault-remediation

(Deployment)

modules/fault-remediation/
janitor

(Deployment)

modules/janitor/
health-events-analyzer

(Deployment)

modules/health-events-analyzer/
platform-connectors

(DaemonSet)

modules/platform-connectors/
mongodb-store / postgresql

(StatefulSet)

distros/kubernetes/nvsentinel/charts/mongodb-store/
gpu-health-monitor

(DaemonSet)

health-monitors/gpu-health-monitor/
syslog-health-monitor

(DaemonSet)

health-monitors/syslog-health-monitor/
csp-health-monitor

(Deployment)

health-monitors/csp-health-monitor/
kubernetes-object-monitor

(Deployment)

health-monitors/kubernetes-object-monitor/

总结

  1. 采集层:各类-health-monitor + platform-connectors
  2. 存储配置层:mongodb-store / postgresql / incluster-file-server
  3. 自愈处置层:fault-quarantine → node-drainer → fault-remediation → janitor套件
  4. 观测分析层:health-events-analyzer、event-exporter、labeler
  5. 异构集成层:slurm-drain-monitor、csp-health-monitor、janitor-provider
组件 类型 语言 功能说明 默认状态
gpu-health-monitor DaemonSet Python 通过 DCGM 监控 GPU 硬件状态(XID 错误、温度、ECC 等) Enabled distros/kubernetes/nvsentinel/values.yaml144
syslog-health-monitor DaemonSet Go 通过 journalctl 解析系统日志,检测硬件故障 Enabled distros/kubernetes/nvsentinel/values.yaml160
csp-health-monitor Deployment Go 轮询云厂商 API,获取维护事件 Disabled distros/kubernetes/nvsentinel/values.yaml158
kubernetes-object-monitor Deployment Go 基于 CEL 策略监控 Kubernetes 资源 Disabled distros/kubernetes/nvsentinel/values.yaml172
platform-connectors DaemonSet Go 接收健康事件的 gRPC 服务,并将事件持久化到数据存储 Enabled distros/kubernetes/nvsentinel/values-tilt.yaml131
mongodb-store StatefulSet --- 通过变更流持久化健康事件 Disabled (internal) distros/kubernetes/nvsentinel/values.yaml170
health-events-analyzer Deployment Go 基于聚合管道进行事件模式检测 Disabled distros/kubernetes/nvsentinel/values.yaml146
fault-quarantine Deployment Go 基于 CEL 规则引擎执行节点隔离 Disabled distros/kubernetes/nvsentinel/values.yaml148
node-drainer Deployment Go 按命名空间策略驱逐 Pod Disabled distros/kubernetes/nvsentinel/values.yaml150
fault-remediation Deployment Go 通过 Go 模板创建维护类 CR Disabled distros/kubernetes/nvsentinel/values.yaml152
janitor Deployment Go 通过云厂商 API 执行节点重启或终止 Disabled distros/kubernetes/nvsentinel/values.yaml154
labeler Deployment Go 自动为节点标注 DCGM 和驱动版本信息 Enabled distros/kubernetes/nvsentinel/values.yaml162
metadata-collector Deployment Go 采集 GPU 拓扑信息(PCI、UUID 等) Enabled distros/kubernetes/nvsentinel/values.yaml164
log-collector Job Python 故障发生时采集诊断日志 Disabled distros/kubernetes/nvsentinel/values-full.yaml226