
从命令输出来看,这些是 Kubernetes 集群中与 NVIDIA GPU 管理相关的组件,属于 NVIDIA GPU Operator 的一部分。这些组件共同协作,实现了 Kubernetes 环境中 GPU 资源的自动化管理、调度和监控。下面逐一分析各组件的作用:
- gpu-feature-discovery
-
- 作用:GPU 特性发现组件,用于检测节点上 GPU 的详细信息(如型号、显存、支持的特性等),并将这些信息以标签形式暴露给 Kubernetes,帮助调度器更精准地分配 GPU 资源。
- *gpu-operator-node-feature-discovery
-
- 包括 master(主节点)和 worker(工作节点)组件
-
- 作用:节点特性发现组件,负责检测节点的硬件和软件特性(不仅限于 GPU),并将这些信息上报给 Kubernetes,为 Pod 调度提供依据。
- gpu-operator
-
- 作用:GPU Operator 的核心控制器,负责管理和协调其他所有 GPU 相关组件的部署、更新和生命周期,确保整个 GPU 管理系统正常运行。
- nvidia-container-toolkit-daemonset
-
- 作用:NVIDIA 容器工具集,提供了在容器中使用 GPU 的支持,包括运行时环境和驱动映射,确保容器能够正确访问主机的 GPU 资源。
- nvidia-cuda-validator
-
- 作用:CUDA 验证组件,用于检测节点上 CUDA 环境的正确性,运行完成后会退出(状态为 Completed),确保 GPU 能够正常运行 CUDA 应用。
- nvidia-dcgm-exporter
-
- 作用:NVIDIA 数据中心 GPU 管理器(DCGM)的导出器,用于收集 GPU 的监控指标(如利用率、温度、显存使用等),并提供给 Prometheus 等监控系统。
- nvidia-device-plugin-daemonset
-
- 作用:NVIDIA 设备插件,是 Kubernetes 设备插件框架的实现,负责向 Kubernetes API 服务器注册 GPU 资源,使 Kubernetes 能够识别和管理 GPU。
- nvidia-operator-validator
-
- 作用:GPU Operator 验证组件,持续监控 GPU 相关组件的健康状态,确保整个 GPU 管理系统的配置正确且运行正常。
这些组件共同构成了 Kubernetes 环境中完整的 GPU 管理解决方案,使得 GPU 资源能够像 CPU、内存等资源一样被 Kubernetes 自动调度和管理,简化了 GPU 应用在容器环境中的部署和运行。
参考: node 信息一览
bash
bb@su2070:~$ k get node su2070 -o yaml
apiVersion: v1
kind: Node
metadata:
annotations:
hami.io/node-handshake: Deleted_2025-08-30 12:56:11
hami.io/node-handshake-dcu: Deleted_2025-08-30 12:56:11
kubeadm.alpha.kubernetes.io/cri-socket: unix:///run/containerd/containerd.sock
nfd.node.kubernetes.io/feature-labels: cpu-cpuid.ADX,cpu-cpuid.AESNI,cpu-cpuid.AVX,cpu-cpuid.AVX2,cpu-cpuid.CMPXCHG8,cpu-cpuid.FLUSH_L1D,cpu-cpuid.FMA3,cpu-cpuid.FXSR,cpu-cpuid.FXSROPT,cpu-cpuid.IA32_ARCH_CAP,cpu-cpuid.IBPB,cpu-cpuid.LAHF,cpu-cpuid.MD_CLEAR,cpu-cpuid.MOVBE,cpu-cpuid.MPX,cpu-cpuid.OSXSAVE,cpu-cpuid.SPEC_CTRL_SSBD,cpu-cpuid.SRBDS_CTRL,cpu-cpuid.STIBP,cpu-cpuid.SYSCALL,cpu-cpuid.SYSEE,cpu-cpuid.VMX,cpu-cpuid.X87,cpu-cpuid.XGETBV1,cpu-cpuid.XSAVE,cpu-cpuid.XSAVEC,cpu-cpuid.XSAVEOPT,cpu-cpuid.XSAVES,cpu-cstate.enabled,cpu-hardware_multithreading,cpu-model.family,cpu-model.id,cpu-model.vendor_id,cpu-pstate.scaling_governor,cpu-pstate.status,cpu-pstate.turbo,kernel-config.NO_HZ,kernel-config.NO_HZ_FULL,kernel-version.full,kernel-version.major,kernel-version.minor,kernel-version.revision,nvidia.com/cuda.driver-version.full,nvidia.com/cuda.driver-version.major,nvidia.com/cuda.driver-version.minor,nvidia.com/cuda.driver-version.revision,nvidia.com/cuda.driver.major,nvidia.com/cuda.driver.minor,nvidia.com/cuda.driver.rev,nvidia.com/cuda.runtime-version.full,nvidia.com/cuda.runtime-version.major,nvidia.com/cuda.runtime-version.minor,nvidia.com/cuda.runtime.major,nvidia.com/cuda.runtime.minor,nvidia.com/gfd.timestamp,nvidia.com/gpu.compute.major,nvidia.com/gpu.compute.minor,nvidia.com/gpu.count,nvidia.com/gpu.family,nvidia.com/gpu.machine,nvidia.com/gpu.memory,nvidia.com/gpu.mode,nvidia.com/gpu.product,nvidia.com/gpu.replicas,nvidia.com/gpu.sharing-strategy,nvidia.com/mig.capable,nvidia.com/mig.strategy,nvidia.com/mps.capable,nvidia.com/vgpu.present,pci-10de.present,pci-10ec.present,pci-8086.present,storage-nonrotationaldisk,system-os_release.ID,system-os_release.VERSION_ID,system-os_release.VERSION_ID.major,system-os_release.VERSION_ID.minor
node.alpha.kubernetes.io/ttl: "0"
nvidia.com/gpu-driver-upgrade-enabled: "true"
ovn.kubernetes.io/allocated: "true"
ovn.kubernetes.io/chassis: 0f78ab1d-e291-45a1-83f6-094b855af693
ovn.kubernetes.io/cidr: 100.64.0.0/16
ovn.kubernetes.io/gateway: 100.64.0.1
ovn.kubernetes.io/ip_address: 100.64.0.2
ovn.kubernetes.io/logical_switch: join
ovn.kubernetes.io/mac_address: 16:03:a6:54:88:e6
ovn.kubernetes.io/port_name: node-su2070
volumes.kubernetes.io/controller-managed-attach-detach: "true"
creationTimestamp: "2025-08-30T11:34:09Z"
labels:
beta.kubernetes.io/arch: amd64
beta.kubernetes.io/os: linux
feature.node.kubernetes.io/cpu-cpuid.ADX: "true"
feature.node.kubernetes.io/cpu-cpuid.AESNI: "true"
feature.node.kubernetes.io/cpu-cpuid.AVX: "true"
feature.node.kubernetes.io/cpu-cpuid.AVX2: "true"
feature.node.kubernetes.io/cpu-cpuid.CMPXCHG8: "true"
feature.node.kubernetes.io/cpu-cpuid.FLUSH_L1D: "true"
feature.node.kubernetes.io/cpu-cpuid.FMA3: "true"
feature.node.kubernetes.io/cpu-cpuid.FXSR: "true"
feature.node.kubernetes.io/cpu-cpuid.FXSROPT: "true"
feature.node.kubernetes.io/cpu-cpuid.IA32_ARCH_CAP: "true"
feature.node.kubernetes.io/cpu-cpuid.IBPB: "true"
feature.node.kubernetes.io/cpu-cpuid.LAHF: "true"
feature.node.kubernetes.io/cpu-cpuid.MD_CLEAR: "true"
feature.node.kubernetes.io/cpu-cpuid.MOVBE: "true"
feature.node.kubernetes.io/cpu-cpuid.MPX: "true"
feature.node.kubernetes.io/cpu-cpuid.OSXSAVE: "true"
feature.node.kubernetes.io/cpu-cpuid.SPEC_CTRL_SSBD: "true"
feature.node.kubernetes.io/cpu-cpuid.SRBDS_CTRL: "true"
feature.node.kubernetes.io/cpu-cpuid.STIBP: "true"
feature.node.kubernetes.io/cpu-cpuid.SYSCALL: "true"
feature.node.kubernetes.io/cpu-cpuid.SYSEE: "true"
feature.node.kubernetes.io/cpu-cpuid.VMX: "true"
feature.node.kubernetes.io/cpu-cpuid.X87: "true"
feature.node.kubernetes.io/cpu-cpuid.XGETBV1: "true"
feature.node.kubernetes.io/cpu-cpuid.XSAVE: "true"
feature.node.kubernetes.io/cpu-cpuid.XSAVEC: "true"
feature.node.kubernetes.io/cpu-cpuid.XSAVEOPT: "true"
feature.node.kubernetes.io/cpu-cpuid.XSAVES: "true"
feature.node.kubernetes.io/cpu-cstate.enabled: "true"
feature.node.kubernetes.io/cpu-hardware_multithreading: "true"
feature.node.kubernetes.io/cpu-model.family: "6"
feature.node.kubernetes.io/cpu-model.id: "165"
feature.node.kubernetes.io/cpu-model.vendor_id: Intel
feature.node.kubernetes.io/cpu-pstate.scaling_governor: powersave
feature.node.kubernetes.io/cpu-pstate.status: active
feature.node.kubernetes.io/cpu-pstate.turbo: "true"
feature.node.kubernetes.io/kernel-config.NO_HZ: "true"
feature.node.kubernetes.io/kernel-config.NO_HZ_FULL: "true"
feature.node.kubernetes.io/kernel-version.full: 6.14.0-29-generic
feature.node.kubernetes.io/kernel-version.major: "6"
feature.node.kubernetes.io/kernel-version.minor: "14"
feature.node.kubernetes.io/kernel-version.revision: "0"
feature.node.kubernetes.io/pci-10de.present: "true"
feature.node.kubernetes.io/pci-10ec.present: "true"
feature.node.kubernetes.io/pci-8086.present: "true"
feature.node.kubernetes.io/storage-nonrotationaldisk: "true"
feature.node.kubernetes.io/system-os_release.ID: ubuntu
feature.node.kubernetes.io/system-os_release.VERSION_ID: "24.04"
feature.node.kubernetes.io/system-os_release.VERSION_ID.major: "24"
feature.node.kubernetes.io/system-os_release.VERSION_ID.minor: "04"
gpu: "on"
kube-ovn/role: master
kubernetes.io/arch: amd64
kubernetes.io/hostname: su2070
kubernetes.io/os: linux
node-role.kubernetes.io/control-plane: ""
node-role.kubernetes.io/worker: ""
node.kubernetes.io/exclude-from-external-load-balancers: ""
nvidia.com/cuda.driver-version.full: 575.64.03
nvidia.com/cuda.driver-version.major: "575"
nvidia.com/cuda.driver-version.minor: "64"
nvidia.com/cuda.driver-version.revision: "03"
nvidia.com/cuda.driver.major: "575"
nvidia.com/cuda.driver.minor: "64"
nvidia.com/cuda.driver.rev: "03"
nvidia.com/cuda.runtime-version.full: "12.9"
nvidia.com/cuda.runtime-version.major: "12"
nvidia.com/cuda.runtime-version.minor: "9"
nvidia.com/cuda.runtime.major: "12"
nvidia.com/cuda.runtime.minor: "9"
nvidia.com/gfd.timestamp: "1756568541"
nvidia.com/gpu-driver-upgrade-state: upgrade-done
nvidia.com/gpu.compute.major: "7"
nvidia.com/gpu.compute.minor: "5"
nvidia.com/gpu.count: "1"
nvidia.com/gpu.deploy.container-toolkit: "true"
nvidia.com/gpu.deploy.dcgm: "true"
nvidia.com/gpu.deploy.dcgm-exporter: "true"
nvidia.com/gpu.deploy.device-plugin: "true"
nvidia.com/gpu.deploy.driver: pre-installed
nvidia.com/gpu.deploy.gpu-feature-discovery: "true"
nvidia.com/gpu.deploy.node-status-exporter: "true"
nvidia.com/gpu.deploy.operator-validator: "true"
nvidia.com/gpu.family: turing
nvidia.com/gpu.machine: OMEN-25L-Desktop-GT11-0xxx
nvidia.com/gpu.memory: "8192"
nvidia.com/gpu.mode: graphics
nvidia.com/gpu.present: "true"
nvidia.com/gpu.product: NVIDIA-GeForce-RTX-2070-SUPER
nvidia.com/gpu.replicas: "1"
nvidia.com/gpu.sharing-strategy: none
nvidia.com/mig.capable: "false"
nvidia.com/mig.strategy: single
nvidia.com/mps.capable: "false"
nvidia.com/vgpu.present: "false"
name: su2070
resourceVersion: "139183"
uid: 182f27a0-da15-432a-84d1-d39703c4fb85
spec:
podCIDR: 10.16.0.0/24
podCIDRs:
- 10.16.0.0/24
status:
addresses:
- address: 192.168.0.101
type: InternalIP
- address: su2070
type: Hostname
allocatable:
cpu: 15600m
ephemeral-storage: 490048472Ki
hugepages-1Gi: "0"
hugepages-2Mi: "0"
memory: "31308977128"
nvidia.com/gpu: "1"
pods: "110"
capacity:
cpu: "16"
ephemeral-storage: 490048472Ki
hugepages-1Gi: "0"
hugepages-2Mi: "0"
memory: 32723340Ki
nvidia.com/gpu: "1"
pods: "110"
conditions:
- lastHeartbeatTime: "2025-08-31T01:02:44Z"
lastTransitionTime: "2025-08-31T01:02:44Z"
message: ping check to gateway ip 100.64.0.1 succeeded
reason: JoinSubnetGatewayReachable
status: "False"
type: NetworkUnavailable
- lastHeartbeatTime: "2025-08-31T00:58:33Z"
lastTransitionTime: "2025-08-30T11:34:08Z"
message: kubelet has sufficient memory available
reason: KubeletHasSufficientMemory
status: "False"
type: MemoryPressure
- lastHeartbeatTime: "2025-08-31T00:58:33Z"
lastTransitionTime: "2025-08-30T11:34:08Z"
message: kubelet has no disk pressure
reason: KubeletHasNoDiskPressure
status: "False"
type: DiskPressure
- lastHeartbeatTime: "2025-08-31T00:58:33Z"
lastTransitionTime: "2025-08-30T11:34:08Z"
message: kubelet has sufficient PID available
reason: KubeletHasSufficientPID
status: "False"
type: PIDPressure
- lastHeartbeatTime: "2025-08-31T00:58:33Z"
lastTransitionTime: "2025-08-30T12:37:29Z"
message: kubelet is posting ready status. AppArmor enabled
reason: KubeletReady
status: "True"
type: Ready
daemonEndpoints:
kubeletEndpoint:
Port: 10250
images:
- names:
- nvcr.io/nvidia/cloud-native/k8s-driver-manager@sha256:c525320fd1e771b911b68f8e760b83e8fccf1beea43bf9b009c4f0c591e193ea
- nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.8.0
sizeBytes: 238542288
- names:
- docker.io/kubeovn/kube-ovn:v1.15.0
sizeBytes: 227388842
- names:
- nvcr.io/nvidia/gpu-operator@sha256:f628afdfca94c50730960b21dda2567a53b61a161bac232bf8dd0b72fec9213f
- nvcr.io/nvidia/gpu-operator:v25.3.1
sizeBytes: 225897475
- names:
- nvcr.io/nvidia/k8s-device-plugin@sha256:037160a36de0f060fc21cc0cb2f795d980282ff1471b55530433ca4350b24c4f
- nvcr.io/nvidia/k8s-device-plugin:v0.17.2
sizeBytes: 203083899
- names:
- nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:0b6f1944b05254ce50a08d44ca0d23a40f254fb448255a9234c43dec44e6929c
- nvcr.io/nvidia/cloud-native/gpu-operator-validator:v25.3.1
sizeBytes: 188844071
- names:
- nvcr.io/nvidia/k8s/container-toolkit@sha256:d90dd628828082d61ea2334dc5dbfe7104a160ddea5ff4e0d44e12dee24c10f6
- nvcr.io/nvidia/k8s/container-toolkit:v1.17.8-ubuntu20.04
sizeBytes: 153613868
- names:
- nvcr.io/nvidia/k8s/dcgm-exporter@sha256:76281f16b132e0f6ddb16a1b8adfe523ea3d5ce290f623f4734d50906c8770a7
- nvcr.io/nvidia/k8s/dcgm-exporter:4.2.3-4.1.3-ubuntu22.04
sizeBytes: 142361073
- names:
- docker.io/projecthami/hami:v2.6.1
sizeBytes: 134779787
- names:
- registry.k8s.io/nfd/node-feature-discovery:v0.17.3
sizeBytes: 80718055
- names:
- registry.cn-beijing.aliyuncs.com/kubesphereio/kube-apiserver@sha256:bc5c88d316db8b27ff203273d67ec8a962ef346278d9bbae400473efc1674725
- registry.cn-beijing.aliyuncs.com/kubesphereio/kube-apiserver:v1.29.5
sizeBytes: 35233975
- names:
- registry.cn-beijing.aliyuncs.com/kubesphereio/kube-controller-manager@sha256:a9a64e67b66ea6fb43f976f65d8a0cadd68b0ed5ed2311d2fc4bf887403ecf8a
- registry.cn-beijing.aliyuncs.com/kubesphereio/kube-controller-manager:v1.29.5
sizeBytes: 33591210
- names:
- registry.cn-beijing.aliyuncs.com/kubesphereio/k8s-dns-node-cache@sha256:d2504aceb7db88ce24779a750522b22eeda061868376fcf3dea19e3d1cff911e
- registry.cn-beijing.aliyuncs.com/kubesphereio/k8s-dns-node-cache:1.22.20
sizeBytes: 30467856
- names:
- registry.cn-beijing.aliyuncs.com/kubesphereio/kube-proxy@sha256:4c9681a68b0f068f66e6c4120be71a4416621cad1427802deaaa79d01fdffb85
- registry.cn-beijing.aliyuncs.com/kubesphereio/kube-proxy:v1.29.5
sizeBytes: 28408608
- names:
- docker.io/liangjw/kube-webhook-certgen:v1.1.1
sizeBytes: 18899182
- names:
- registry.cn-beijing.aliyuncs.com/kubesphereio/kube-scheduler@sha256:5e729dc015466f486fdeed22200d86108ffac26ea6e5abf3258c0502a637d3a7
- registry.cn-hangzhou.aliyuncs.com/google_containers/kube-scheduler@sha256:3ae97633e22c6f9319423018500ae6348548d8c0b1575ba1bd4090cb50e8c5b7
- registry.cn-beijing.aliyuncs.com/kubesphereio/kube-scheduler:v1.29.5
- registry.cn-hangzhou.aliyuncs.com/google_containers/kube-scheduler:v1.29.5
sizeBytes: 18673620
- names:
- registry.cn-beijing.aliyuncs.com/kubesphereio/coredns@sha256:8e352a029d304ca7431c6507b56800636c321cb52289686a581ab70aaa8a2e2a
- registry.cn-beijing.aliyuncs.com/kubesphereio/coredns:1.9.3
sizeBytes: 14837849
- names:
- registry.cn-beijing.aliyuncs.com/kubesphereio/pause@sha256:7031c1b283388d2c2e09b57badb803c05ebed362dc88d84b480cc47f72a21097
- registry.cn-beijing.aliyuncs.com/kubesphereio/pause:3.9
sizeBytes: 321520
nodeInfo:
architecture: amd64
bootID: 6d4581fd-56e9-44fe-b7e1-db26694163d3
containerRuntimeVersion: containerd://1.7.13
kernelVersion: 6.14.0-29-generic
kubeProxyVersion: v1.29.5
kubeletVersion: v1.29.5
machineID: abc2ec7c46ef482cb42e45dce12f12b8
operatingSystem: linux
osImage: Ubuntu 24.04.3 LTS
systemUUID: 6557d507-a023-11ea-a942-3c18a010fee8
bb@su2070:~$