分析 gpu-operator 组件

从命令输出来看,这些是 Kubernetes 集群中与 NVIDIA GPU 管理相关的组件,属于 NVIDIA GPU Operator 的一部分。这些组件共同协作,实现了 Kubernetes 环境中 GPU 资源的自动化管理、调度和监控。下面逐一分析各组件的作用:

  1. gpu-feature-discovery
    • 作用:GPU 特性发现组件,用于检测节点上 GPU 的详细信息(如型号、显存、支持的特性等),并将这些信息以标签形式暴露给 Kubernetes,帮助调度器更精准地分配 GPU 资源。
  1. *gpu-operator-node-feature-discovery
    • 包括 master(主节点)和 worker(工作节点)组件
    • 作用:节点特性发现组件,负责检测节点的硬件和软件特性(不仅限于 GPU),并将这些信息上报给 Kubernetes,为 Pod 调度提供依据。
  1. gpu-operator
    • 作用:GPU Operator 的核心控制器,负责管理和协调其他所有 GPU 相关组件的部署、更新和生命周期,确保整个 GPU 管理系统正常运行。
  1. nvidia-container-toolkit-daemonset
    • 作用:NVIDIA 容器工具集,提供了在容器中使用 GPU 的支持,包括运行时环境和驱动映射,确保容器能够正确访问主机的 GPU 资源。
  1. nvidia-cuda-validator
    • 作用:CUDA 验证组件,用于检测节点上 CUDA 环境的正确性,运行完成后会退出(状态为 Completed),确保 GPU 能够正常运行 CUDA 应用。
  1. nvidia-dcgm-exporter
    • 作用:NVIDIA 数据中心 GPU 管理器(DCGM)的导出器,用于收集 GPU 的监控指标(如利用率、温度、显存使用等),并提供给 Prometheus 等监控系统。
  1. nvidia-device-plugin-daemonset
    • 作用:NVIDIA 设备插件,是 Kubernetes 设备插件框架的实现,负责向 Kubernetes API 服务器注册 GPU 资源,使 Kubernetes 能够识别和管理 GPU。
  1. nvidia-operator-validator
    • 作用:GPU Operator 验证组件,持续监控 GPU 相关组件的健康状态,确保整个 GPU 管理系统的配置正确且运行正常。

这些组件共同构成了 Kubernetes 环境中完整的 GPU 管理解决方案,使得 GPU 资源能够像 CPU、内存等资源一样被 Kubernetes 自动调度和管理,简化了 GPU 应用在容器环境中的部署和运行。

参考: node 信息一览

bash 复制代码
bb@su2070:~$ k get node su2070 -o yaml
apiVersion: v1
kind: Node
metadata:
  annotations:
    hami.io/node-handshake: Deleted_2025-08-30 12:56:11
    hami.io/node-handshake-dcu: Deleted_2025-08-30 12:56:11
    kubeadm.alpha.kubernetes.io/cri-socket: unix:///run/containerd/containerd.sock
    nfd.node.kubernetes.io/feature-labels: cpu-cpuid.ADX,cpu-cpuid.AESNI,cpu-cpuid.AVX,cpu-cpuid.AVX2,cpu-cpuid.CMPXCHG8,cpu-cpuid.FLUSH_L1D,cpu-cpuid.FMA3,cpu-cpuid.FXSR,cpu-cpuid.FXSROPT,cpu-cpuid.IA32_ARCH_CAP,cpu-cpuid.IBPB,cpu-cpuid.LAHF,cpu-cpuid.MD_CLEAR,cpu-cpuid.MOVBE,cpu-cpuid.MPX,cpu-cpuid.OSXSAVE,cpu-cpuid.SPEC_CTRL_SSBD,cpu-cpuid.SRBDS_CTRL,cpu-cpuid.STIBP,cpu-cpuid.SYSCALL,cpu-cpuid.SYSEE,cpu-cpuid.VMX,cpu-cpuid.X87,cpu-cpuid.XGETBV1,cpu-cpuid.XSAVE,cpu-cpuid.XSAVEC,cpu-cpuid.XSAVEOPT,cpu-cpuid.XSAVES,cpu-cstate.enabled,cpu-hardware_multithreading,cpu-model.family,cpu-model.id,cpu-model.vendor_id,cpu-pstate.scaling_governor,cpu-pstate.status,cpu-pstate.turbo,kernel-config.NO_HZ,kernel-config.NO_HZ_FULL,kernel-version.full,kernel-version.major,kernel-version.minor,kernel-version.revision,nvidia.com/cuda.driver-version.full,nvidia.com/cuda.driver-version.major,nvidia.com/cuda.driver-version.minor,nvidia.com/cuda.driver-version.revision,nvidia.com/cuda.driver.major,nvidia.com/cuda.driver.minor,nvidia.com/cuda.driver.rev,nvidia.com/cuda.runtime-version.full,nvidia.com/cuda.runtime-version.major,nvidia.com/cuda.runtime-version.minor,nvidia.com/cuda.runtime.major,nvidia.com/cuda.runtime.minor,nvidia.com/gfd.timestamp,nvidia.com/gpu.compute.major,nvidia.com/gpu.compute.minor,nvidia.com/gpu.count,nvidia.com/gpu.family,nvidia.com/gpu.machine,nvidia.com/gpu.memory,nvidia.com/gpu.mode,nvidia.com/gpu.product,nvidia.com/gpu.replicas,nvidia.com/gpu.sharing-strategy,nvidia.com/mig.capable,nvidia.com/mig.strategy,nvidia.com/mps.capable,nvidia.com/vgpu.present,pci-10de.present,pci-10ec.present,pci-8086.present,storage-nonrotationaldisk,system-os_release.ID,system-os_release.VERSION_ID,system-os_release.VERSION_ID.major,system-os_release.VERSION_ID.minor
    node.alpha.kubernetes.io/ttl: "0"
    nvidia.com/gpu-driver-upgrade-enabled: "true"
    ovn.kubernetes.io/allocated: "true"
    ovn.kubernetes.io/chassis: 0f78ab1d-e291-45a1-83f6-094b855af693
    ovn.kubernetes.io/cidr: 100.64.0.0/16
    ovn.kubernetes.io/gateway: 100.64.0.1
    ovn.kubernetes.io/ip_address: 100.64.0.2
    ovn.kubernetes.io/logical_switch: join
    ovn.kubernetes.io/mac_address: 16:03:a6:54:88:e6
    ovn.kubernetes.io/port_name: node-su2070
    volumes.kubernetes.io/controller-managed-attach-detach: "true"
  creationTimestamp: "2025-08-30T11:34:09Z"
  labels:
    beta.kubernetes.io/arch: amd64
    beta.kubernetes.io/os: linux
    feature.node.kubernetes.io/cpu-cpuid.ADX: "true"
    feature.node.kubernetes.io/cpu-cpuid.AESNI: "true"
    feature.node.kubernetes.io/cpu-cpuid.AVX: "true"
    feature.node.kubernetes.io/cpu-cpuid.AVX2: "true"
    feature.node.kubernetes.io/cpu-cpuid.CMPXCHG8: "true"
    feature.node.kubernetes.io/cpu-cpuid.FLUSH_L1D: "true"
    feature.node.kubernetes.io/cpu-cpuid.FMA3: "true"
    feature.node.kubernetes.io/cpu-cpuid.FXSR: "true"
    feature.node.kubernetes.io/cpu-cpuid.FXSROPT: "true"
    feature.node.kubernetes.io/cpu-cpuid.IA32_ARCH_CAP: "true"
    feature.node.kubernetes.io/cpu-cpuid.IBPB: "true"
    feature.node.kubernetes.io/cpu-cpuid.LAHF: "true"
    feature.node.kubernetes.io/cpu-cpuid.MD_CLEAR: "true"
    feature.node.kubernetes.io/cpu-cpuid.MOVBE: "true"
    feature.node.kubernetes.io/cpu-cpuid.MPX: "true"
    feature.node.kubernetes.io/cpu-cpuid.OSXSAVE: "true"
    feature.node.kubernetes.io/cpu-cpuid.SPEC_CTRL_SSBD: "true"
    feature.node.kubernetes.io/cpu-cpuid.SRBDS_CTRL: "true"
    feature.node.kubernetes.io/cpu-cpuid.STIBP: "true"
    feature.node.kubernetes.io/cpu-cpuid.SYSCALL: "true"
    feature.node.kubernetes.io/cpu-cpuid.SYSEE: "true"
    feature.node.kubernetes.io/cpu-cpuid.VMX: "true"
    feature.node.kubernetes.io/cpu-cpuid.X87: "true"
    feature.node.kubernetes.io/cpu-cpuid.XGETBV1: "true"
    feature.node.kubernetes.io/cpu-cpuid.XSAVE: "true"
    feature.node.kubernetes.io/cpu-cpuid.XSAVEC: "true"
    feature.node.kubernetes.io/cpu-cpuid.XSAVEOPT: "true"
    feature.node.kubernetes.io/cpu-cpuid.XSAVES: "true"
    feature.node.kubernetes.io/cpu-cstate.enabled: "true"
    feature.node.kubernetes.io/cpu-hardware_multithreading: "true"
    feature.node.kubernetes.io/cpu-model.family: "6"
    feature.node.kubernetes.io/cpu-model.id: "165"
    feature.node.kubernetes.io/cpu-model.vendor_id: Intel
    feature.node.kubernetes.io/cpu-pstate.scaling_governor: powersave
    feature.node.kubernetes.io/cpu-pstate.status: active
    feature.node.kubernetes.io/cpu-pstate.turbo: "true"
    feature.node.kubernetes.io/kernel-config.NO_HZ: "true"
    feature.node.kubernetes.io/kernel-config.NO_HZ_FULL: "true"
    feature.node.kubernetes.io/kernel-version.full: 6.14.0-29-generic
    feature.node.kubernetes.io/kernel-version.major: "6"
    feature.node.kubernetes.io/kernel-version.minor: "14"
    feature.node.kubernetes.io/kernel-version.revision: "0"
    feature.node.kubernetes.io/pci-10de.present: "true"
    feature.node.kubernetes.io/pci-10ec.present: "true"
    feature.node.kubernetes.io/pci-8086.present: "true"
    feature.node.kubernetes.io/storage-nonrotationaldisk: "true"
    feature.node.kubernetes.io/system-os_release.ID: ubuntu
    feature.node.kubernetes.io/system-os_release.VERSION_ID: "24.04"
    feature.node.kubernetes.io/system-os_release.VERSION_ID.major: "24"
    feature.node.kubernetes.io/system-os_release.VERSION_ID.minor: "04"
    gpu: "on"
    kube-ovn/role: master
    kubernetes.io/arch: amd64
    kubernetes.io/hostname: su2070
    kubernetes.io/os: linux
    node-role.kubernetes.io/control-plane: ""
    node-role.kubernetes.io/worker: ""
    node.kubernetes.io/exclude-from-external-load-balancers: ""
    nvidia.com/cuda.driver-version.full: 575.64.03
    nvidia.com/cuda.driver-version.major: "575"
    nvidia.com/cuda.driver-version.minor: "64"
    nvidia.com/cuda.driver-version.revision: "03"
    nvidia.com/cuda.driver.major: "575"
    nvidia.com/cuda.driver.minor: "64"
    nvidia.com/cuda.driver.rev: "03"
    nvidia.com/cuda.runtime-version.full: "12.9"
    nvidia.com/cuda.runtime-version.major: "12"
    nvidia.com/cuda.runtime-version.minor: "9"
    nvidia.com/cuda.runtime.major: "12"
    nvidia.com/cuda.runtime.minor: "9"
    nvidia.com/gfd.timestamp: "1756568541"
    nvidia.com/gpu-driver-upgrade-state: upgrade-done
    nvidia.com/gpu.compute.major: "7"
    nvidia.com/gpu.compute.minor: "5"
    nvidia.com/gpu.count: "1"
    nvidia.com/gpu.deploy.container-toolkit: "true"
    nvidia.com/gpu.deploy.dcgm: "true"
    nvidia.com/gpu.deploy.dcgm-exporter: "true"
    nvidia.com/gpu.deploy.device-plugin: "true"
    nvidia.com/gpu.deploy.driver: pre-installed
    nvidia.com/gpu.deploy.gpu-feature-discovery: "true"
    nvidia.com/gpu.deploy.node-status-exporter: "true"
    nvidia.com/gpu.deploy.operator-validator: "true"
    nvidia.com/gpu.family: turing
    nvidia.com/gpu.machine: OMEN-25L-Desktop-GT11-0xxx
    nvidia.com/gpu.memory: "8192"
    nvidia.com/gpu.mode: graphics
    nvidia.com/gpu.present: "true"
    nvidia.com/gpu.product: NVIDIA-GeForce-RTX-2070-SUPER
    nvidia.com/gpu.replicas: "1"
    nvidia.com/gpu.sharing-strategy: none
    nvidia.com/mig.capable: "false"
    nvidia.com/mig.strategy: single
    nvidia.com/mps.capable: "false"
    nvidia.com/vgpu.present: "false"
  name: su2070
  resourceVersion: "139183"
  uid: 182f27a0-da15-432a-84d1-d39703c4fb85
spec:
  podCIDR: 10.16.0.0/24
  podCIDRs:
  - 10.16.0.0/24
status:
  addresses:
  - address: 192.168.0.101
    type: InternalIP
  - address: su2070
    type: Hostname
  allocatable:
    cpu: 15600m
    ephemeral-storage: 490048472Ki
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: "31308977128"
    nvidia.com/gpu: "1"
    pods: "110"
  capacity:
    cpu: "16"
    ephemeral-storage: 490048472Ki
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: 32723340Ki
    nvidia.com/gpu: "1"
    pods: "110"
  conditions:
  - lastHeartbeatTime: "2025-08-31T01:02:44Z"
    lastTransitionTime: "2025-08-31T01:02:44Z"
    message: ping check to gateway ip 100.64.0.1 succeeded
    reason: JoinSubnetGatewayReachable
    status: "False"
    type: NetworkUnavailable
  - lastHeartbeatTime: "2025-08-31T00:58:33Z"
    lastTransitionTime: "2025-08-30T11:34:08Z"
    message: kubelet has sufficient memory available
    reason: KubeletHasSufficientMemory
    status: "False"
    type: MemoryPressure
  - lastHeartbeatTime: "2025-08-31T00:58:33Z"
    lastTransitionTime: "2025-08-30T11:34:08Z"
    message: kubelet has no disk pressure
    reason: KubeletHasNoDiskPressure
    status: "False"
    type: DiskPressure
  - lastHeartbeatTime: "2025-08-31T00:58:33Z"
    lastTransitionTime: "2025-08-30T11:34:08Z"
    message: kubelet has sufficient PID available
    reason: KubeletHasSufficientPID
    status: "False"
    type: PIDPressure
  - lastHeartbeatTime: "2025-08-31T00:58:33Z"
    lastTransitionTime: "2025-08-30T12:37:29Z"
    message: kubelet is posting ready status. AppArmor enabled
    reason: KubeletReady
    status: "True"
    type: Ready
  daemonEndpoints:
    kubeletEndpoint:
      Port: 10250
  images:
  - names:
    - nvcr.io/nvidia/cloud-native/k8s-driver-manager@sha256:c525320fd1e771b911b68f8e760b83e8fccf1beea43bf9b009c4f0c591e193ea
    - nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.8.0
    sizeBytes: 238542288
  - names:
    - docker.io/kubeovn/kube-ovn:v1.15.0
    sizeBytes: 227388842
  - names:
    - nvcr.io/nvidia/gpu-operator@sha256:f628afdfca94c50730960b21dda2567a53b61a161bac232bf8dd0b72fec9213f
    - nvcr.io/nvidia/gpu-operator:v25.3.1
    sizeBytes: 225897475
  - names:
    - nvcr.io/nvidia/k8s-device-plugin@sha256:037160a36de0f060fc21cc0cb2f795d980282ff1471b55530433ca4350b24c4f
    - nvcr.io/nvidia/k8s-device-plugin:v0.17.2
    sizeBytes: 203083899
  - names:
    - nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:0b6f1944b05254ce50a08d44ca0d23a40f254fb448255a9234c43dec44e6929c
    - nvcr.io/nvidia/cloud-native/gpu-operator-validator:v25.3.1
    sizeBytes: 188844071
  - names:
    - nvcr.io/nvidia/k8s/container-toolkit@sha256:d90dd628828082d61ea2334dc5dbfe7104a160ddea5ff4e0d44e12dee24c10f6
    - nvcr.io/nvidia/k8s/container-toolkit:v1.17.8-ubuntu20.04
    sizeBytes: 153613868
  - names:
    - nvcr.io/nvidia/k8s/dcgm-exporter@sha256:76281f16b132e0f6ddb16a1b8adfe523ea3d5ce290f623f4734d50906c8770a7
    - nvcr.io/nvidia/k8s/dcgm-exporter:4.2.3-4.1.3-ubuntu22.04
    sizeBytes: 142361073
  - names:
    - docker.io/projecthami/hami:v2.6.1
    sizeBytes: 134779787
  - names:
    - registry.k8s.io/nfd/node-feature-discovery:v0.17.3
    sizeBytes: 80718055
  - names:
    - registry.cn-beijing.aliyuncs.com/kubesphereio/kube-apiserver@sha256:bc5c88d316db8b27ff203273d67ec8a962ef346278d9bbae400473efc1674725
    - registry.cn-beijing.aliyuncs.com/kubesphereio/kube-apiserver:v1.29.5
    sizeBytes: 35233975
  - names:
    - registry.cn-beijing.aliyuncs.com/kubesphereio/kube-controller-manager@sha256:a9a64e67b66ea6fb43f976f65d8a0cadd68b0ed5ed2311d2fc4bf887403ecf8a
    - registry.cn-beijing.aliyuncs.com/kubesphereio/kube-controller-manager:v1.29.5
    sizeBytes: 33591210
  - names:
    - registry.cn-beijing.aliyuncs.com/kubesphereio/k8s-dns-node-cache@sha256:d2504aceb7db88ce24779a750522b22eeda061868376fcf3dea19e3d1cff911e
    - registry.cn-beijing.aliyuncs.com/kubesphereio/k8s-dns-node-cache:1.22.20
    sizeBytes: 30467856
  - names:
    - registry.cn-beijing.aliyuncs.com/kubesphereio/kube-proxy@sha256:4c9681a68b0f068f66e6c4120be71a4416621cad1427802deaaa79d01fdffb85
    - registry.cn-beijing.aliyuncs.com/kubesphereio/kube-proxy:v1.29.5
    sizeBytes: 28408608
  - names:
    - docker.io/liangjw/kube-webhook-certgen:v1.1.1
    sizeBytes: 18899182
  - names:
    - registry.cn-beijing.aliyuncs.com/kubesphereio/kube-scheduler@sha256:5e729dc015466f486fdeed22200d86108ffac26ea6e5abf3258c0502a637d3a7
    - registry.cn-hangzhou.aliyuncs.com/google_containers/kube-scheduler@sha256:3ae97633e22c6f9319423018500ae6348548d8c0b1575ba1bd4090cb50e8c5b7
    - registry.cn-beijing.aliyuncs.com/kubesphereio/kube-scheduler:v1.29.5
    - registry.cn-hangzhou.aliyuncs.com/google_containers/kube-scheduler:v1.29.5
    sizeBytes: 18673620
  - names:
    - registry.cn-beijing.aliyuncs.com/kubesphereio/coredns@sha256:8e352a029d304ca7431c6507b56800636c321cb52289686a581ab70aaa8a2e2a
    - registry.cn-beijing.aliyuncs.com/kubesphereio/coredns:1.9.3
    sizeBytes: 14837849
  - names:
    - registry.cn-beijing.aliyuncs.com/kubesphereio/pause@sha256:7031c1b283388d2c2e09b57badb803c05ebed362dc88d84b480cc47f72a21097
    - registry.cn-beijing.aliyuncs.com/kubesphereio/pause:3.9
    sizeBytes: 321520
  nodeInfo:
    architecture: amd64
    bootID: 6d4581fd-56e9-44fe-b7e1-db26694163d3
    containerRuntimeVersion: containerd://1.7.13
    kernelVersion: 6.14.0-29-generic
    kubeProxyVersion: v1.29.5
    kubeletVersion: v1.29.5
    machineID: abc2ec7c46ef482cb42e45dce12f12b8
    operatingSystem: linux
    osImage: Ubuntu 24.04.3 LTS
    systemUUID: 6557d507-a023-11ea-a942-3c18a010fee8
bb@su2070:~$ 
相关推荐
向上的车轮8 小时前
Odoo与Django 的区别是什么?
后端·python·django·odoo
完美世界的一天9 小时前
Golang 面试题「中级」
开发语言·后端·面试·golang
小明说Java9 小时前
解密双十一电商优惠券批量下发设计与实现
后端
bobz9659 小时前
virtio-networking 5: 介绍 vDPA kernel framework
后端
橙子家10 小时前
接口 IResultFilter、IAsyncResultFilter 的简介和用法示例(.net)
后端
bobz96510 小时前
Virtio-networking: 2019 总结 2020展望
后端
AntBlack10 小时前
每周学点 AI : 在 Modal 上面搭建一下大模型应用
后端
G探险者10 小时前
常见线程池的创建方式及应用场景
后端
bobz96511 小时前
virtio-networking 4: 介绍 vDPA 1
后端
柏油12 小时前
MySQL InnoDB 架构
数据库·后端·mysql