HAMi + prometheus-k8s + grafana实现vgpu虚拟化监控

最近长沙跑了半个多月,跟甲方客户对了下项目指标,许久没更新

回来后继续研究如何实现 grafana实现HAMi vgpu虚拟化监控,毕竟合同里写了需要体现gpu资源限制和算力共享以及体现算力卡资源共享监控

先说下为啥要用HAMi吧, 一个重要原因是公司有人引见了这个工具的作者, 很多问题我都可以直接向作者提问

HAMi,是一个国产的GPU与国产加速卡(支持的GPU与国产加速卡型号与具体特性请查看此项目官网:https://github.com/Project-HAMi/HAMi/)虚拟化开源项目,实现以kubernetes为基础的容器场景下GPU或加速卡虚拟化。HAMi原名"k8s-vGPU-scheduler",

最初由我司开源,现已在国内与国际上愈加流行,是管理Kubernetes中异构设备的中间件。它可以管理不同类型的异构设备(如GPU、NPU等),在Pod之间共享异构设备,根据设备的拓扑信息和调度策略做出更好的调度决策。为了阐述的简明性,本文只提供一种可行的办法,最终实现使用prometheus抓取监控指标并作为数据源、使用grafana来展示监控信息的目的。

本文假定已经部署好Kubernetes集群、HAMi。以下涉及到的相关组件都是在kubernetes集群内安装的,相关组件或软件版本信息如下:

组件或软件名称 版本 备注
kubernetes集群 v1.23.1 AMD64构架服务器环境下
HAMi 根据向开源作者提问,当前HAMi版本发行机制还不够成熟,暂以安装HAMi的scheduler.kubeScheduler.imageTag 参数值为其版本,此值要跟kubernetes版本看齐 项目地址:https://github.com/Project-HAMi/HAMi/
kube-prometheus stack prom/prometheus:v2.27.1 关于监控的安装参见实现prometheus+grafana的监控部署_prometheus grafana监控部署-CSDN博客
dcgm-exporter nvcr.io/nvidia/k8s/dcgm-exporter:3.3.9-3.6.1-ubuntu22.04

HAMi 的默认安装方式是通过helm,添加Helm仓库:

helm repo add hami-charts https://project-hami.github.io/HAMi/

检查Kubernetes版本并安装HAMi(服务器版本为1.23.1):

helm install hami hami-charts/hami --set scheduler.kubeScheduler.imageTag=v1.23.1 -n kube-system

验证hami安装成功

kubectl get pods -n kube-system

确认hami-device-plugin和hami-scheduler都处于Running状态表示安装成功。

把helm安装转为hami-install.yaml

helm template hami hami-charts/hami --set scheduler.kubeScheduler.imageTag=v1.23.1 -n kube-system > hami-install.yaml

该格式部署

---
# Source: hami/templates/device-plugin/monitorserviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: hami-device-plugin
  namespace: "kube-system"
  labels:
    app.kubernetes.io/component: "hami-device-plugin"
    helm.sh/chart: hami-2.4.0
    app.kubernetes.io/name: hami
    app.kubernetes.io/instance: hami
    app.kubernetes.io/version: "2.4.0"
    app.kubernetes.io/managed-by: Helm
---
# Source: hami/templates/scheduler/serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: hami-scheduler
  namespace: "kube-system"
  labels:
    app.kubernetes.io/component: "hami-scheduler"
    helm.sh/chart: hami-2.4.0
    app.kubernetes.io/name: hami
    app.kubernetes.io/instance: hami
    app.kubernetes.io/version: "2.4.0"
    app.kubernetes.io/managed-by: Helm
---
# Source: hami/templates/device-plugin/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: hami-device-plugin
  labels:
    app.kubernetes.io/component: hami-device-plugin
    helm.sh/chart: hami-2.4.0
    app.kubernetes.io/name: hami
    app.kubernetes.io/instance: hami
    app.kubernetes.io/version: "2.4.0"
    app.kubernetes.io/managed-by: Helm
data:
  config.json: |
    {
        "nodeconfig": [
            {
                "name": "m5-cloudinfra-online02",
                "devicememoryscaling": 1.8,
                "devicesplitcount": 10,
                "migstrategy":"none",
                "filterdevices": {
                  "uuid": [],
                  "index": []
                }
            }
        ]
    }
---
# Source: hami/templates/scheduler/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: hami-scheduler
  labels:
    app.kubernetes.io/component: hami-scheduler
    helm.sh/chart: hami-2.4.0
    app.kubernetes.io/name: hami
    app.kubernetes.io/instance: hami
    app.kubernetes.io/version: "2.4.0"
    app.kubernetes.io/managed-by: Helm
data:
  config.json: |
    {
        "kind": "Policy",
        "apiVersion": "v1",
        "extenders": [
            {
                "urlPrefix": "https://127.0.0.1:443",
                "filterVerb": "filter",
                "bindVerb": "bind",
                "enableHttps": true,
                "weight": 1,
                "nodeCacheCapable": true,
                "httpTimeout": 30000000000,
                "tlsConfig": {
                    "insecure": true
                },
                "managedResources": [
                    {
                        "name": "nvidia.com/gpu",
                        "ignoredByScheduler": true
                    },
                    {
                        "name": "nvidia.com/gpumem",
                        "ignoredByScheduler": true
                    },
                    {
                        "name": "nvidia.com/gpucores",
                        "ignoredByScheduler": true
                    },
                    {
                        "name": "nvidia.com/gpumem-percentage",
                        "ignoredByScheduler": true
                    },
                    {
                        "name": "nvidia.com/priority",
                        "ignoredByScheduler": true
                    },
                    {
                        "name": "cambricon.com/vmlu",
                        "ignoredByScheduler": true
                    },
                    {
                        "name": "hygon.com/dcunum",
                        "ignoredByScheduler": true
                    },
                    {
                        "name": "hygon.com/dcumem",
                        "ignoredByScheduler": true 
                    },
                    {
                        "name": "hygon.com/dcucores",
                        "ignoredByScheduler": true
                    },
                    {
                        "name": "iluvatar.ai/vgpu",
                        "ignoredByScheduler": true
                    }
                ],
                "ignoreable": false
            }
        ]
    }
---
# Source: hami/templates/scheduler/configmapnew.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: hami-scheduler-newversion
  labels:
    app.kubernetes.io/component: hami-scheduler
    helm.sh/chart: hami-2.4.0
    app.kubernetes.io/name: hami
    app.kubernetes.io/instance: hami
    app.kubernetes.io/version: "2.4.0"
    app.kubernetes.io/managed-by: Helm
data:
  config.yaml: |
    apiVersion: kubescheduler.config.k8s.io/v1
    kind: KubeSchedulerConfiguration
    leaderElection:
      leaderElect: false
    profiles:
    - schedulerName: hami-scheduler
    extenders:
    - urlPrefix: "https://127.0.0.1:443"
      filterVerb: filter
      bindVerb: bind
      nodeCacheCapable: true
      weight: 1
      httpTimeout: 30s
      enableHTTPS: true
      tlsConfig:
        insecure: true
      managedResources:
      - name: nvidia.com/gpu
        ignoredByScheduler: true
      - name: nvidia.com/gpumem
        ignoredByScheduler: true
      - name: nvidia.com/gpucores
        ignoredByScheduler: true
      - name: nvidia.com/gpumem-percentage
        ignoredByScheduler: true
      - name: nvidia.com/priority
        ignoredByScheduler: true
      - name: cambricon.com/vmlu
        ignoredByScheduler: true
      - name: hygon.com/dcunum
        ignoredByScheduler: true
      - name: hygon.com/dcumem
        ignoredByScheduler: true
      - name: hygon.com/dcucores
        ignoredByScheduler: true
      - name: iluvatar.ai/vgpu
        ignoredByScheduler: true
---
# Source: hami/templates/scheduler/device-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: hami-scheduler-device
  labels:
    app.kubernetes.io/component: hami-scheduler
    helm.sh/chart: hami-2.4.0
    app.kubernetes.io/name: hami
    app.kubernetes.io/instance: hami
    app.kubernetes.io/version: "2.4.0"
    app.kubernetes.io/managed-by: Helm
data:
  device-config.yaml: |-
    nvidia:
      resourceCountName: nvidia.com/gpu
      resourceMemoryName: nvidia.com/gpumem
      resourceMemoryPercentageName: nvidia.com/gpumem-percentage
      resourceCoreName: nvidia.com/gpucores
      resourcePriorityName: nvidia.com/priority
      overwriteEnv: false
      defaultMemory: 0
      defaultCores: 0
      defaultGPUNum: 1
      deviceSplitCount: 10
      deviceMemoryScaling: 1
      deviceCoreScaling: 1
    cambricon:
      resourceCountName: cambricon.com/vmlu
      resourceMemoryName: cambricon.com/mlu.smlu.vmemory
      resourceCoreName: cambricon.com/mlu.smlu.vcore
    hygon:
      resourceCountName: hygon.com/dcunum
      resourceMemoryName: hygon.com/dcumem
      resourceCoreName: hygon.com/dcucores
    metax:
      resourceCountName: "metax-tech.com/gpu"
    mthreads:
      resourceCountName: "mthreads.com/vgpu"
      resourceMemoryName: "mthreads.com/sgpu-memory"
      resourceCoreName: "mthreads.com/sgpu-core"
    iluvatar: 
      resourceCountName: iluvatar.ai/vgpu
      resourceMemoryName: iluvatar.ai/vcuda-memory
      resourceCoreName: iluvatar.ai/vcuda-core
    vnpus:
    - chipName: 910B
      commonWord: Ascend910A
      resourceName: huawei.com/Ascend910A
      resourceMemoryName: huawei.com/Ascend910A-memory
      memoryAllocatable: 32768
      memoryCapacity: 32768
      aiCore: 30
      templates:
        - name: vir02
          memory: 2184
          aiCore: 2
        - name: vir04
          memory: 4369
          aiCore: 4
        - name: vir08
          memory: 8738
          aiCore: 8
        - name: vir16
          memory: 17476
          aiCore: 16
    - chipName: 910B3
      commonWord: Ascend910B
      resourceName: huawei.com/Ascend910B
      resourceMemoryName: huawei.com/Ascend910B-memory
      memoryAllocatable: 65536
      memoryCapacity: 65536
      aiCore: 20
      aiCPU: 7
      templates:
        - name: vir05_1c_16g
          memory: 16384
          aiCore: 5
          aiCPU: 1
        - name: vir10_3c_32g
          memory: 32768
          aiCore: 10
          aiCPU: 3
    - chipName: 310P3
      commonWord: Ascend310P
      resourceName: huawei.com/Ascend310P
      resourceMemoryName: huawei.com/Ascend310P-memory
      memoryAllocatable: 21527
      memoryCapacity: 24576
      aiCore: 8
      aiCPU: 7
      templates:
        - name: vir01
          memory: 3072
          aiCore: 1
          aiCPU: 1
        - name: vir02
          memory: 6144
          aiCore: 2
          aiCPU: 2
        - name: vir04
          memory: 12288
          aiCore: 4
          aiCPU: 4
---
# Source: hami/templates/device-plugin/monitorrole.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name:  hami-device-plugin-monitor
rules:
  - apiGroups:
      - ""
    resources:
      - pods
    verbs:
      - get
      - create
      - watch
      - list
      - update
      - patch
  - apiGroups:
      - ""
    resources:
      - nodes
    verbs:
      - get
      - update
      - list
      - patch
---
# Source: hami/templates/device-plugin/monitorrolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: hami-device-plugin
  labels:
    app.kubernetes.io/component: "hami-device-plugin"
    helm.sh/chart: hami-2.4.0
    app.kubernetes.io/name: hami
    app.kubernetes.io/instance: hami
    app.kubernetes.io/version: "2.4.0"
    app.kubernetes.io/managed-by: Helm
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  #name: cluster-admin
  name: hami-device-plugin-monitor
subjects:
  - kind: ServiceAccount
    name: hami-device-plugin
    namespace: "kube-system"
---
# Source: hami/templates/scheduler/rolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: hami-scheduler
  labels:
    app.kubernetes.io/component: "hami-scheduler"
    helm.sh/chart: hami-2.4.0
    app.kubernetes.io/name: hami
    app.kubernetes.io/instance: hami
    app.kubernetes.io/version: "2.4.0"
    app.kubernetes.io/managed-by: Helm
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cluster-admin
subjects:
  - kind: ServiceAccount
    name: hami-scheduler
    namespace: "kube-system"
---
# Source: hami/templates/device-plugin/monitorservice.yaml
apiVersion: v1
kind: Service
metadata:
  name: hami-device-plugin-monitor
  labels:
    app.kubernetes.io/component: hami-device-plugin
    helm.sh/chart: hami-2.4.0
    app.kubernetes.io/name: hami
    app.kubernetes.io/instance: hami
    app.kubernetes.io/version: "2.4.0"
    app.kubernetes.io/managed-by: Helm
spec:
  externalTrafficPolicy: Local
  selector:
    app.kubernetes.io/component: hami-device-plugin
  type: NodePort
  ports:
    - name: monitorport
      port: 31992
      targetPort: 9394
      nodePort: 31992
---
# Source: hami/templates/scheduler/service.yaml
apiVersion: v1
kind: Service
metadata:
  name: hami-scheduler
  labels:
    app.kubernetes.io/component: hami-scheduler
    helm.sh/chart: hami-2.4.0
    app.kubernetes.io/name: hami
    app.kubernetes.io/instance: hami
    app.kubernetes.io/version: "2.4.0"
    app.kubernetes.io/managed-by: Helm
spec:
  type: NodePort
  ports:
    - name: http
      port: 443
      targetPort: 443
      nodePort: 31998
      protocol: TCP
    - name: monitor
      port: 31993
      targetPort: 9395
      nodePort: 31993
      protocol: TCP
  selector:
    app.kubernetes.io/component: hami-scheduler
    app.kubernetes.io/name: hami
    app.kubernetes.io/instance: hami
---
# Source: hami/templates/device-plugin/daemonsetnvidia.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: hami-device-plugin
  labels:
    app.kubernetes.io/component: hami-device-plugin
    helm.sh/chart: hami-2.4.0
    app.kubernetes.io/name: hami
    app.kubernetes.io/instance: hami
    app.kubernetes.io/version: "2.4.0"
    app.kubernetes.io/managed-by: Helm
spec:
  selector:
    matchLabels:
      app.kubernetes.io/component: hami-device-plugin
      app.kubernetes.io/name: hami
      app.kubernetes.io/instance: hami
  template:
    metadata:
      labels:
        app.kubernetes.io/component: hami-device-plugin
        hami.io/webhook: ignore
        app.kubernetes.io/name: hami
        app.kubernetes.io/instance: hami
    spec:
      imagePullSecrets: 
        []
      serviceAccountName: hami-device-plugin
      priorityClassName: system-node-critical
      hostPID: true
      hostNetwork: true
      containers:
        - name: device-plugin
          image: projecthami/hami:latest
          imagePullPolicy: "IfNotPresent"
          lifecycle:
            postStart:
              exec:
                command: ["/bin/sh","-c", "cp -f /k8s-vgpu/lib/nvidia/* /usr/local/vgpu/"]
          command:
            - nvidia-device-plugin
            - --config-file=/device-config.yaml
            - --mig-strategy=none
            - --disable-core-limit=false
            - -v=false
          env:
            - name: NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
            - name: NVIDIA_MIG_MONITOR_DEVICES
              value: all
            - name: HOOK_PATH
              value: /usr/local
          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop: ["ALL"]
              add: ["SYS_ADMIN"]
          volumeMounts:
            - name: device-plugin
              mountPath: /var/lib/kubelet/device-plugins
            - name: lib
              mountPath: /usr/local/vgpu
            - name: usrbin
              mountPath: /usrbin
            - name: deviceconfig
              mountPath: /config
            - name: hosttmp
              mountPath: /tmp
            - name: device-config
              mountPath: /device-config.yaml
              subPath: device-config.yaml
        - name: vgpu-monitor
          image: projecthami/hami:latest
          imagePullPolicy: "IfNotPresent"
          command: ["vGPUmonitor"]
          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop: ["ALL"]
              add: ["SYS_ADMIN"]
          env:
            - name: NVIDIA_VISIBLE_DEVICES
              value: "all"
            - name: NVIDIA_MIG_MONITOR_DEVICES
              value: all
            - name: HOOK_PATH
              value: /usr/local/vgpu              
          volumeMounts:
            - name: ctrs
              mountPath: /usr/local/vgpu/containers
            - name: dockers
              mountPath: /run/docker
            - name: containerds
              mountPath: /run/containerd
            - name: sysinfo
              mountPath: /sysinfo
            - name: hostvar
              mountPath: /hostvar
      volumes:
        - name: ctrs
          hostPath:
            path: /usr/local/vgpu/containers
        - name: hosttmp
          hostPath:
            path: /tmp
        - name: dockers
          hostPath:
            path: /run/docker
        - name: containerds
          hostPath:
            path: /run/containerd
        - name: device-plugin
          hostPath:
            path: /var/lib/kubelet/device-plugins
        - name: lib
          hostPath:
            path: /usr/local/vgpu
        - name: usrbin
          hostPath:
            path: /usr/bin
        - name: sysinfo
          hostPath:
            path: /sys
        - name: hostvar
          hostPath:
            path: /var
        - name: deviceconfig
          configMap:
            name: hami-device-plugin
        - name: device-config
          configMap:
            name: hami-scheduler-device
      nodeSelector: 
        gpu: "on"
---
# Source: hami/templates/scheduler/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: hami-scheduler
  labels:
    app.kubernetes.io/component: hami-scheduler
    helm.sh/chart: hami-2.4.0
    app.kubernetes.io/name: hami
    app.kubernetes.io/instance: hami
    app.kubernetes.io/version: "2.4.0"
    app.kubernetes.io/managed-by: Helm
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/component: hami-scheduler
      app.kubernetes.io/name: hami
      app.kubernetes.io/instance: hami
  template:
    metadata:
      labels:
        app.kubernetes.io/component: hami-scheduler
        app.kubernetes.io/name: hami
        app.kubernetes.io/instance: hami
        hami.io/webhook: ignore
    spec:
      imagePullSecrets: 
        []
      serviceAccountName: hami-scheduler
      priorityClassName: system-node-critical
      containers:
        - name: kube-scheduler
          image: registry.cn-hangzhou.aliyuncs.com/google_containers/kube-scheduler:v1.31.0
          imagePullPolicy: "IfNotPresent"
          command:
            - kube-scheduler
            - --config=/config/config.yaml
            - -v=4
            - --leader-elect=true
            - --leader-elect-resource-name=hami-scheduler
            - --leader-elect-resource-namespace=kube-system
          volumeMounts:
            - name: scheduler-config
              mountPath: /config
        - name: vgpu-scheduler-extender
          image: projecthami/hami:latest
          imagePullPolicy: "IfNotPresent"
          env:
          command:
            - scheduler
            - --http_bind=0.0.0.0:443
            - --cert_file=/tls/tls.crt
            - --key_file=/tls/tls.key
            - --scheduler-name=hami-scheduler
            - --metrics-bind-address=:9395
            - --node-scheduler-policy=binpack
            - --gpu-scheduler-policy=spread
            - --device-config-file=/device-config.yaml
            - --debug
            - -v=4
          ports:
            - name: http
              containerPort: 443
              protocol: TCP
          volumeMounts:
            - name: tls-config
              mountPath: /tls
            - name: device-config
              mountPath: /device-config.yaml
              subPath: device-config.yaml
      volumes:
        - name: tls-config
          secret:
            secretName: hami-scheduler-tls
        - name: scheduler-config
          configMap:
            name: hami-scheduler-newversion
        - name: device-config
          configMap:
            name: hami-scheduler-device
---
# Source: hami/templates/scheduler/webhook.yaml
apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
  name: hami-webhook
webhooks:
  - admissionReviewVersions:
    - v1beta1
    clientConfig:
      service:
        name: hami-scheduler
        namespace: kube-system
        path: /webhook
        port: 443
    failurePolicy: Ignore
    matchPolicy: Equivalent
    name: vgpu.hami.io
    namespaceSelector:
      matchExpressions:
      - key: hami.io/webhook
        operator: NotIn
        values:
        - ignore
    objectSelector:
      matchExpressions:
      - key: hami.io/webhook
        operator: NotIn
        values:
        - ignore
    reinvocationPolicy: Never
    rules:
      - apiGroups:
          - ""
        apiVersions:
          - v1
        operations:
          - CREATE
        resources:
          - pods
        scope: '*'
    sideEffects: None
    timeoutSeconds: 10
---
# Source: hami/templates/scheduler/job-patch/serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: hami-admission
  annotations:
    "helm.sh/hook": pre-install,pre-upgrade,post-install,post-upgrade
    "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
  labels:
    helm.sh/chart: hami-2.4.0
    app.kubernetes.io/name: hami
    app.kubernetes.io/instance: hami
    app.kubernetes.io/version: "2.4.0"
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/component: admission-webhook
---
# Source: hami/templates/scheduler/job-patch/clusterrole.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: hami-admission
  annotations:
    "helm.sh/hook": pre-install,pre-upgrade,post-install,post-upgrade
    "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
  labels:
    helm.sh/chart: hami-2.4.0
    app.kubernetes.io/name: hami
    app.kubernetes.io/instance: hami
    app.kubernetes.io/version: "2.4.0"
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/component: admission-webhook
rules:
  - apiGroups:
      - admissionregistration.k8s.io
    resources:
      #- validatingwebhookconfigurations
      - mutatingwebhookconfigurations
    verbs:
      - get
      - update
---
# Source: hami/templates/scheduler/job-patch/clusterrolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name:  hami-admission
  annotations:
    "helm.sh/hook": pre-install,pre-upgrade,post-install,post-upgrade
    "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
  labels:
    helm.sh/chart: hami-2.4.0
    app.kubernetes.io/name: hami
    app.kubernetes.io/instance: hami
    app.kubernetes.io/version: "2.4.0"
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/component: admission-webhook
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: hami-admission
subjects:
  - kind: ServiceAccount
    name: hami-admission
    namespace: "kube-system"
---
# Source: hami/templates/scheduler/job-patch/role.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name:  hami-admission
  annotations:
    "helm.sh/hook": pre-install,pre-upgrade,post-install,post-upgrade
    "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
  labels:
    helm.sh/chart: hami-2.4.0
    app.kubernetes.io/name: hami
    app.kubernetes.io/instance: hami
    app.kubernetes.io/version: "2.4.0"
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/component: admission-webhook
rules:
  - apiGroups:
      - ""
    resources:
      - secrets
    verbs:
      - get
      - create
---
# Source: hami/templates/scheduler/job-patch/rolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: hami-admission
  annotations:
    "helm.sh/hook": pre-install,pre-upgrade,post-install,post-upgrade
    "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
  labels:
    helm.sh/chart: hami-2.4.0
    app.kubernetes.io/name: hami
    app.kubernetes.io/instance: hami
    app.kubernetes.io/version: "2.4.0"
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/component: admission-webhook
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: hami-admission
subjects:
  - kind: ServiceAccount
    name: hami-admission
    namespace: "kube-system"
---
# Source: hami/templates/scheduler/job-patch/job-createSecret.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: hami-admission-create
  annotations:
    "helm.sh/hook": pre-install,pre-upgrade
    "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
  labels:
    helm.sh/chart: hami-2.4.0
    app.kubernetes.io/name: hami
    app.kubernetes.io/instance: hami
    app.kubernetes.io/version: "2.4.0"
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/component: admission-webhook
spec:
  template:
    metadata:
      name: hami-admission-create
      labels:
        helm.sh/chart: hami-2.4.0
        app.kubernetes.io/name: hami
        app.kubernetes.io/instance: hami
        app.kubernetes.io/version: "2.4.0"
        app.kubernetes.io/managed-by: Helm
        app.kubernetes.io/component: admission-webhook
        hami.io/webhook: ignore
    spec:
      imagePullSecrets: 
        []
      containers:
        - name: create
          image: liangjw/kube-webhook-certgen:v1.1.1
          imagePullPolicy: IfNotPresent
          args:
            - create
            - --cert-name=tls.crt
            - --key-name=tls.key
            - --host=hami-scheduler.kube-system.svc,127.0.0.1
            - --namespace=kube-system
            - --secret-name=hami-scheduler-tls
      restartPolicy: OnFailure
      serviceAccountName: hami-admission
      securityContext:
        runAsNonRoot: true
        runAsUser: 2000
---
# Source: hami/templates/scheduler/job-patch/job-patchWebhook.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: hami-admission-patch
  annotations:
    "helm.sh/hook": post-install,post-upgrade
    "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
  labels:
    helm.sh/chart: hami-2.4.0
    app.kubernetes.io/name: hami
    app.kubernetes.io/instance: hami
    app.kubernetes.io/version: "2.4.0"
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/component: admission-webhook
spec:
  template:
    metadata:
      name: hami-admission-patch
      labels:
        helm.sh/chart: hami-2.4.0
        app.kubernetes.io/name: hami
        app.kubernetes.io/instance: hami
        app.kubernetes.io/version: "2.4.0"
        app.kubernetes.io/managed-by: Helm
        app.kubernetes.io/component: admission-webhook
        hami.io/webhook: ignore
    spec:
      imagePullSecrets: 
        []
      containers:
        - name: patch
          image: liangjw/kube-webhook-certgen:v1.1.1
          imagePullPolicy: IfNotPresent
          args:
            - patch
            - --webhook-name=hami-webhook
            - --namespace=kube-system
            - --patch-validating=false
            - --secret-name=hami-scheduler-tls
      restartPolicy: OnFailure
      serviceAccountName: hami-admission
      securityContext:
        runAsNonRoot: true
        runAsUser: 2000

部署dcgm-exporter

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: "dcgm-exporter"
  labels:
    app.kubernetes.io/name: "dcgm-exporter"
    app.kubernetes.io/version: "3.6.1"
spec:
  updateStrategy:
    type: RollingUpdate
  selector:
    matchLabels:
      app.kubernetes.io/name: "dcgm-exporter"
      app.kubernetes.io/version: "3.6.1"
  template:
    metadata:
      labels:
        app.kubernetes.io/name: "dcgm-exporter"
        app.kubernetes.io/version: "3.6.1"
      name: "dcgm-exporter"
    spec:
      containers:
      - image: "nvcr.io/nvidia/k8s/dcgm-exporter:3.3.9-3.6.1-ubuntu22.04"
        env:
        - name: "DCGM_EXPORTER_LISTEN"
          value: ":9400"
        - name: "DCGM_EXPORTER_KUBERNETES"
          value: "true"
        name: "dcgm-exporter"
        ports:
        - name: "metrics"
          containerPort: 9400
        securityContext:
          runAsNonRoot: false
          runAsUser: 0
          capabilities:
            add: ["SYS_ADMIN"]
        volumeMounts:
        - name: "pod-gpu-resources"
          readOnly: true
          mountPath: "/var/lib/kubelet/pod-resources"
      volumes:
      - name: "pod-gpu-resources"
        hostPath:
          path: "/var/lib/kubelet/pod-resources"

---

kind: Service
apiVersion: v1
metadata:
  name: "dcgm-exporter"
  labels:
    app.kubernetes.io/name: "dcgm-exporter"
    app.kubernetes.io/version: "3.6.1"
spec:
  selector:
    app.kubernetes.io/name: "dcgm-exporter"
    app.kubernetes.io/version: "3.6.1"
  ports:
  - name: "metrics"
    port: 9400

dcgm-exporter安装成功

参考这个hami-vgpu dashboard 下载panel 的json文件

hami-vgpu-dashboard | Grafana Labs 导入后grafana中将创建一个名为"hami-vgpu-dashboard"的dashboard,但此页面中有一些Panel如vGPUCorePercentage还没有数据

ServiceMonitor 是 Prometheus Operator 中的一个自定义资源,主要用于监控 Kubernetes 中的服务。它的作用包括:

1. 自动化发现

ServiceMonitor 允许 Prometheus 自动发现和监控 Kubernetes 中的服务。通过定义 ServiceMonitor,您可以告诉 Prometheus 监控特定服务的端点。

2. 配置抓取参数

您可以在 ServiceMonitor 中设置抓取的相关参数,例如:

  • 抓取间隔:定义 Prometheus 多频繁抓取数据(如每 30 秒)。
  • 超时:定义抓取请求的超时时间。
  • 标签选择器:指定要监控的服务的标签,确保 Prometheus 仅抓取相关服务的数据。

dcgm-exporter需要配置两个service monitor

hami-device-plugin-svc-monitor.yaml

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: hami-device-plugin-svc-monitor
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app.kubernetes.io/component: hami-device-plugin
  namespaceSelector:
    matchNames:
      - kube-system
  endpoints:
    - path: /metrics
      port: monitorport
      interval: "15s"
      honorLabels: false
      relabelings:
        - sourceLabels: [__meta_kubernetes_endpoints_name]
          regex: hami-.*
          replacement: $1
          action: keep
        - sourceLabels: [__meta_kubernetes_pod_node_name]
          regex: (.*)
          targetLabel: node_name
          replacement: ${1}
          action: replace
        - sourceLabels: [__meta_kubernetes_pod_host_ip]
          regex: (.*)
          targetLabel: ip
          replacement: $1
          action: replace

hami-scheduler-svc-monitor.yaml

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: hami-scheduler-svc-monitor
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app.kubernetes.io/component: hami-scheduler
  namespaceSelector:
    matchNames:
      - kube-system
  endpoints:
    - path: /metrics
      port: monitor
      interval: "15s"
      honorLabels: false
      relabelings:
        - sourceLabels: [__meta_kubernetes_endpoints_name]
          regex: hami-.*
          replacement: $1
          action: keep
        - sourceLabels: [__meta_kubernetes_pod_node_name]
          regex: (.*)
          targetLabel: node_name
          replacement: ${1}
          action: replace
        - sourceLabels: [__meta_kubernetes_pod_host_ip]
          regex: (.*)
          targetLabel: ip
          replacement: $1
          action: replace

确认创建的ServiceMonitor

启动gpu pod一个测试下

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod-1
spec:
  restartPolicy: Never
  containers:
    - name: cuda-container
      image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.2.1
      command: ["sleep", "infinity"]
      resources:
        limits:
          nvidia.com/gpu: 1
          nvidia.com/gpumem: 1000
          nvidia.com/gpucores: 10

如果看到pod一直pending 状态

检查下节点如果出现下面gpu为0的情况

需要

   docker:
		1:下载NVIDIA-DOCKER2安装包并安装
		2:修改/etc/docker/daemon.json文件内容加上
			{
			"default-runtime": "nvidia",
				"runtimes": {
					"nvidia": {
						"path": "/usr/bin/nvidia-container-runtime",
						"runtimeArgs": []
					}
				},
			}
	k8s:
		1:下载k8s-device-plugin 镜像
		2:编写nvidia-device-plugin.yml创建驱动pod

使用这个yml进行创建

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      priorityClassName: "system-node-critical"
      containers:
      - image: nvidia/k8s-device-plugin:1.11
        name: nvidia-device-plugin-ctr
        env:
          - name: FAIL_ON_INIT_ERROR
            value: "false"
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
        - name: device-plugin
          mountPath: /var/lib/kubelet/device-plugins
      volumes:
      - name: device-plugin
        hostPath:
          path: /var/lib/kubelet/device-plugins

gpu pod启动后进入查看下, gpu内存和限制的大小相同设置成功

访问下{scheduler node ip}:31993/metrics

日志最后有两行

复制代码
vGPUPodsDeviceAllocated{containeridx="0",deviceusedcore="40",deviceuuid="GPU-7666e9de-679b-a768-51c6-260b81cd00ec",nodename="192.168.110.126",podname="gpu-pod-1",podnamespace="default",zone="vGPU"} 1.048576e+10
vGPUPodsDeviceAllocated{containeridx="0",deviceusedcore="40",deviceuuid="GPU-7666e9de-679b-a768-51c6-260b81cd00ec",nodename="192.168.110.126",podname="gpu-pod-2",podnamespace="default",zone="vGPU"} 1.048576e+10

可以看到相同deviceuuid的gpu被不同pod共享使用

exec进入hami-device-plugin daemonset里面执行nvidia-smi -L 可以看到机器上所有显卡的信息

root@node126:/# nvidia-smi -L

GPU 0: NVIDIA GeForce RTX 4090 (UUID: GPU-7666e9de-679b-a768-51c6-260b81cd00ec)

GPU 1: NVIDIA GeForce RTX 4090 (UUID: GPU-9f32af29-1a72-6e47-af2c-72b1130a176b)

root@node126:/#

之前创建的两个serviceMonitor会去请求

app.kubernetes.io/component: hami-scheduler 和app.kubernetes.io/component: hami-device-plugin 的/metrics 接口获取数据

当gpu-pod跑起来以后查看hami-vgpu-metrics-dashboard

相关推荐
张声录120 小时前
【Prometheus】【Blackbox Exporter】深入解析 ProbeTCP 函数:如何实现 Go 中的 TCP/SSL 协议探测
tcp/ip·golang·prometheus
福大大架构师每日一题2 天前
42.2 告警触发trigger模块单点问题和高可用解决方案
java·linux·服务器·prometheus
等一场春雨2 天前
linux 使用 MySQL Performance Schema 和 Prometheus + Grafana 来监控 MySQL 性能
linux·mysql·prometheus
福大大架构师每日一题4 天前
41.5 nginx拦截prometheus查询请求使用lua脚本做promql的检查替换
nginx·lua·prometheus
张声录17 天前
【prometheus】【blackbox_exporter】grafna导入blackbox_exporter看板配置
prometheus
福大大架构师每日一题8 天前
41.3 将重查询记录增量更新到consul和redis中
windows·redis·prometheus·consul
编程、小哥哥9 天前
Prometheus + Grafana 监控,验证 Hystrix 超时熔断
hystrix·grafana·prometheus
tingting01199 天前
k8s-1.28.2 部署prometheus
容器·kubernetes·prometheus
福大大架构师每日一题10 天前
41.1 预聚合提速实战项目之需求分析和架构设计
prometheus