NVSentinel 五大核心大功能及对应组件
一、硬件/集群全维度监控采集
故障源头探测。
功能说明
部署在每台GPU节点,对接DCGM、系统日志、网卡、云厂商API、K8s资源,采集所有GPU/网卡/调度异常,统一标准化为HealthEvent,是整个系统的数据输入端。
包含组件
- gpu-health-monitor(DaemonSet):对接DCGM采集GPU温度、ECC、XID、NVLink故障
- syslog-health-monitor(DaemonSet):监听系统journal日志,抓取GPU驱动崩溃、PCI报错
- nic-health-monitor(DaemonSet):监控网卡、RDMA、高速互联链路硬件异常
- metadata-collector(DaemonSet):采集GPU、网卡、NVSwitch硬件拓扑元数据
- csp-health-monitor(Deployment):轮询公有云API,获取VM计划性维护事件
- kubernetes-object-monitor(Deployment):监听K8s Node/Pod/CRD资源变更,生成集群资源事件
- preflight(Init容器模板):GPU Pod启动前置硬件自检,异常上报事件
- platform-connectors(DaemonSet,采集网关):接收所有采集组件gRPC事件、校验标准化、写入数据库、同步更新K8s Node Condition
二、存储与事件总线
全系统数据协同底座。
功能说明
统一持久化所有HealthEvent,提供Change Streams变更流,让所有控制面组件异步解耦协同,存储故障全生命周期状态。
包含组件
- mongodb-store(StatefulSet,生产默认):主事件存储,提供Change Streams订阅能力
- k8s-datastore(StatefulSet,测试轻量替代):基于K8s CRD存储事件,无Change Streams
- postgresql(StatefulSet,辅助存储):存储审计日志、分析统计数据
- incluster-file-server(Deployment):全局配置中心,下发监控阈值、CEL隔离规则、故障映射模板(支撑监控+自愈全部模块)
三、全自动故障自愈闭环
总结:核心核心功能,故障处置流水线。
功能说明
读取数据库故障事件,按「节点隔离 → 优雅驱逐业务Pod → 生成硬件修复工单 → 执行服务器重启/更换」完整自动化流程,支持自定义CEL规则、Slurm混合集群适配。
包含组件
- fault-quarantine(Deployment):CEL规则引擎,判断是否封锁节点、添加故障污点,更新事件隔离状态
- node-drainer (Deployment):按命名空间策略优雅驱逐故障节点所有业务Pod;配套
slinky-drainer适配K8s+Sl混合集群 - slurm-drain-monitor(Deployment):K8s与Slurm调度双向同步,故障节点自动排空Slurm作业
- fault-remediation (Deployment):驱逐完成后,根据故障动作生成
RebootNode/TerminateNode运维CR工单 - janitor(Deployment):CR控制器,监听修复工单,下发硬件操作指令
- janitor-provider(Deployment):底层硬件驱动抽象层,对接IPMI、公有云、机房硬件API执行重启/换卡
四、事件分析、大盘与告警输出
总结:观测、复盘、批量风险预警。
功能说明
只读消费全量故障事件,做多维度聚合统计、连锁故障识别、时序指标输出,同时支持把单条/聚合告警推送外部运维平台。
包含组件
- health-events-analyzer(Deployment):故障聚合分析、批量连锁故障识别、Prometheus指标生成、集群风险复合告警、历史审计统计
- event-exporter(Deployment):单条原始故障事件透传,推送Webhook/第三方告警系统
- labeler(Deployment):自动给K8s节点打上GPU驱动、DCGM、硬件型号标签,辅助分析过滤
五、外部集成扩展能力
总结:对接第三方调度、运维、监控平台。
功能说明
打通异构调度集群、外部运维系统、监控平台,实现跨平台故障同步、双向联动。
包含组件
- slurm-drain-monitor:K8s ↔ Slurm 调度双向同步集成
- csp-health-monitor:公有云维护通知集成
- event-exporter:外部告警平台(钉钉/企业微信/监控系统)集成
- janitor + janitor-provider:机房DCIM、IPMI、公有云API硬件运维集成
- preflight准入Webhook:容器调度前置硬件校验集成
- slinky-drainer:Slinky Slurm Operator混合集群驱逐集成
六、整体组件架构
#mermaid-svg-nHK4Nj4fGG8cdSQC{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-nHK4Nj4fGG8cdSQC .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-nHK4Nj4fGG8cdSQC .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-nHK4Nj4fGG8cdSQC .error-icon{fill:#552222;}#mermaid-svg-nHK4Nj4fGG8cdSQC .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-nHK4Nj4fGG8cdSQC .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-nHK4Nj4fGG8cdSQC .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-nHK4Nj4fGG8cdSQC .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-nHK4Nj4fGG8cdSQC .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-nHK4Nj4fGG8cdSQC .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-nHK4Nj4fGG8cdSQC .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-nHK4Nj4fGG8cdSQC .marker{fill:#333333;stroke:#333333;}#mermaid-svg-nHK4Nj4fGG8cdSQC .marker.cross{stroke:#333333;}#mermaid-svg-nHK4Nj4fGG8cdSQC svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-nHK4Nj4fGG8cdSQC p{margin:0;}#mermaid-svg-nHK4Nj4fGG8cdSQC .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-nHK4Nj4fGG8cdSQC .cluster-label text{fill:#333;}#mermaid-svg-nHK4Nj4fGG8cdSQC .cluster-label span{color:#333;}#mermaid-svg-nHK4Nj4fGG8cdSQC .cluster-label span p{background-color:transparent;}#mermaid-svg-nHK4Nj4fGG8cdSQC .label text,#mermaid-svg-nHK4Nj4fGG8cdSQC span{fill:#333;color:#333;}#mermaid-svg-nHK4Nj4fGG8cdSQC .node rect,#mermaid-svg-nHK4Nj4fGG8cdSQC .node circle,#mermaid-svg-nHK4Nj4fGG8cdSQC .node ellipse,#mermaid-svg-nHK4Nj4fGG8cdSQC .node polygon,#mermaid-svg-nHK4Nj4fGG8cdSQC .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-nHK4Nj4fGG8cdSQC .rough-node .label text,#mermaid-svg-nHK4Nj4fGG8cdSQC .node .label text,#mermaid-svg-nHK4Nj4fGG8cdSQC .image-shape .label,#mermaid-svg-nHK4Nj4fGG8cdSQC .icon-shape .label{text-anchor:middle;}#mermaid-svg-nHK4Nj4fGG8cdSQC .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-nHK4Nj4fGG8cdSQC .rough-node .label,#mermaid-svg-nHK4Nj4fGG8cdSQC .node .label,#mermaid-svg-nHK4Nj4fGG8cdSQC .image-shape .label,#mermaid-svg-nHK4Nj4fGG8cdSQC .icon-shape .label{text-align:center;}#mermaid-svg-nHK4Nj4fGG8cdSQC .node.clickable{cursor:pointer;}#mermaid-svg-nHK4Nj4fGG8cdSQC .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-nHK4Nj4fGG8cdSQC .arrowheadPath{fill:#333333;}#mermaid-svg-nHK4Nj4fGG8cdSQC .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-nHK4Nj4fGG8cdSQC .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-nHK4Nj4fGG8cdSQC .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-nHK4Nj4fGG8cdSQC .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-nHK4Nj4fGG8cdSQC .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-nHK4Nj4fGG8cdSQC .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-nHK4Nj4fGG8cdSQC .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-nHK4Nj4fGG8cdSQC .cluster text{fill:#333;}#mermaid-svg-nHK4Nj4fGG8cdSQC .cluster span{color:#333;}#mermaid-svg-nHK4Nj4fGG8cdSQC div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-nHK4Nj4fGG8cdSQC .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-nHK4Nj4fGG8cdSQC rect.text{fill:none;stroke-width:0;}#mermaid-svg-nHK4Nj4fGG8cdSQC .icon-shape,#mermaid-svg-nHK4Nj4fGG8cdSQC .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-nHK4Nj4fGG8cdSQC .icon-shape p,#mermaid-svg-nHK4Nj4fGG8cdSQC .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-nHK4Nj4fGG8cdSQC .icon-shape .label rect,#mermaid-svg-nHK4Nj4fGG8cdSQC .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-nHK4Nj4fGG8cdSQC .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-nHK4Nj4fGG8cdSQC .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-nHK4Nj4fGG8cdSQC :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} Health Detection Layer
Event Processing Core
Analysis Layer
Response Automation Layer
Support Services
Kubernetes API
Nodes
(Cordon, Taint, Labels)
Pods
(Eviction)
Custom Resources
(RebootNode, TerminateNode)
labeler
(Deployment)
modules/labeler/
metadata-collector
(Deployment)
modules/metadata-collector/
log-collector
(Job)
modules/log-collector/
fault-quarantine
(Deployment)
modules/fault-quarantine/
node-drainer
(Deployment)
modules/node-drainer/
fault-remediation
(Deployment)
modules/fault-remediation/
janitor
(Deployment)
modules/janitor/
health-events-analyzer
(Deployment)
modules/health-events-analyzer/
platform-connectors
(DaemonSet)
modules/platform-connectors/
mongodb-store / postgresql
(StatefulSet)
distros/kubernetes/nvsentinel/charts/mongodb-store/
gpu-health-monitor
(DaemonSet)
health-monitors/gpu-health-monitor/
syslog-health-monitor
(DaemonSet)
health-monitors/syslog-health-monitor/
csp-health-monitor
(Deployment)
health-monitors/csp-health-monitor/
kubernetes-object-monitor
(Deployment)
health-monitors/kubernetes-object-monitor/
总结
- 采集层:各类-health-monitor + platform-connectors
- 存储配置层:mongodb-store / postgresql / incluster-file-server
- 自愈处置层:fault-quarantine → node-drainer → fault-remediation → janitor套件
- 观测分析层:health-events-analyzer、event-exporter、labeler
- 异构集成层:slurm-drain-monitor、csp-health-monitor、janitor-provider
| 组件 | 类型 | 语言 | 功能说明 | 默认状态 |
|---|---|---|---|---|
gpu-health-monitor |
DaemonSet | Python | 通过 DCGM 监控 GPU 硬件状态(XID 错误、温度、ECC 等) | Enabled distros/kubernetes/nvsentinel/values.yaml144 |
syslog-health-monitor |
DaemonSet | Go | 通过 journalctl 解析系统日志,检测硬件故障 | Enabled distros/kubernetes/nvsentinel/values.yaml160 |
csp-health-monitor |
Deployment | Go | 轮询云厂商 API,获取维护事件 | Disabled distros/kubernetes/nvsentinel/values.yaml158 |
kubernetes-object-monitor |
Deployment | Go | 基于 CEL 策略监控 Kubernetes 资源 | Disabled distros/kubernetes/nvsentinel/values.yaml172 |
platform-connectors |
DaemonSet | Go | 接收健康事件的 gRPC 服务,并将事件持久化到数据存储 | Enabled distros/kubernetes/nvsentinel/values-tilt.yaml131 |
mongodb-store |
StatefulSet | --- | 通过变更流持久化健康事件 | Disabled (internal) distros/kubernetes/nvsentinel/values.yaml170 |
health-events-analyzer |
Deployment | Go | 基于聚合管道进行事件模式检测 | Disabled distros/kubernetes/nvsentinel/values.yaml146 |
fault-quarantine |
Deployment | Go | 基于 CEL 规则引擎执行节点隔离 | Disabled distros/kubernetes/nvsentinel/values.yaml148 |
node-drainer |
Deployment | Go | 按命名空间策略驱逐 Pod | Disabled distros/kubernetes/nvsentinel/values.yaml150 |
fault-remediation |
Deployment | Go | 通过 Go 模板创建维护类 CR | Disabled distros/kubernetes/nvsentinel/values.yaml152 |
janitor |
Deployment | Go | 通过云厂商 API 执行节点重启或终止 | Disabled distros/kubernetes/nvsentinel/values.yaml154 |
labeler |
Deployment | Go | 自动为节点标注 DCGM 和驱动版本信息 | Enabled distros/kubernetes/nvsentinel/values.yaml162 |
metadata-collector |
Deployment | Go | 采集 GPU 拓扑信息(PCI、UUID 等) | Enabled distros/kubernetes/nvsentinel/values.yaml164 |
log-collector |
Job | Python | 故障发生时采集诊断日志 | Disabled distros/kubernetes/nvsentinel/values-full.yaml226 |