HAMi + prometheus-k8s + grafana实现vgpu虚拟化监控

最近长沙跑了半个多月,跟甲方客户对了下项目指标,许久没更新

回来后继续研究如何实现 grafana实现HAMi vgpu虚拟化监控,毕竟合同里写了需要体现gpu资源限制和算力共享以及体现算力卡资源共享监控

先说下为啥要用HAMi吧, 一个重要原因是公司有人引见了这个工具的作者, 很多问题我都可以直接向作者提问

HAMi,是一个国产的GPU与国产加速卡(支持的GPU与国产加速卡型号与具体特性请查看此项目官网:https://github.com/Project-HAMi/HAMi/)虚拟化开源项目,实现以kubernetes为基础的容器场景下GPU或加速卡虚拟化。HAMi原名"k8s-vGPU-scheduler",

最初由我司开源,现已在国内与国际上愈加流行,是管理Kubernetes中异构设备的中间件。它可以管理不同类型的异构设备(如GPU、NPU等),在Pod之间共享异构设备,根据设备的拓扑信息和调度策略做出更好的调度决策。为了阐述的简明性,本文只提供一种可行的办法,最终实现使用prometheus抓取监控指标并作为数据源、使用grafana来展示监控信息的目的。

本文假定已经部署好Kubernetes集群、HAMi。以下涉及到的相关组件都是在kubernetes集群内安装的,相关组件或软件版本信息如下:

组件或软件名称 版本 备注
kubernetes集群 v1.23.1 AMD64构架服务器环境下
HAMi 根据向开源作者提问,当前HAMi版本发行机制还不够成熟,暂以安装HAMi的scheduler.kubeScheduler.imageTag 参数值为其版本,此值要跟kubernetes版本看齐 项目地址:https://github.com/Project-HAMi/HAMi/
kube-prometheus stack prom/prometheus:v2.27.1 关于监控的安装参见实现prometheus+grafana的监控部署_prometheus grafana监控部署-CSDN博客
dcgm-exporter nvcr.io/nvidia/k8s/dcgm-exporter:3.3.9-3.6.1-ubuntu22.04

HAMi 的默认安装方式是通过helm,添加Helm仓库:

复制代码
helm repo add hami-charts https://project-hami.github.io/HAMi/

检查Kubernetes版本并安装HAMi(服务器版本为1.23.1):

复制代码
helm install hami hami-charts/hami --set scheduler.kubeScheduler.imageTag=v1.23.1 -n kube-system

验证hami安装成功

复制代码
kubectl get pods -n kube-system

确认hami-device-plugin和hami-scheduler都处于Running状态表示安装成功。

把helm安装转为hami-install.yaml

复制代码
helm template hami hami-charts/hami --set scheduler.kubeScheduler.imageTag=v1.23.1 -n kube-system > hami-install.yaml

该格式部署

复制代码
---
# Source: hami/templates/device-plugin/monitorserviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: hami-device-plugin
  namespace: "kube-system"
  labels:
    app.kubernetes.io/component: "hami-device-plugin"
    helm.sh/chart: hami-2.4.0
    app.kubernetes.io/name: hami
    app.kubernetes.io/instance: hami
    app.kubernetes.io/version: "2.4.0"
    app.kubernetes.io/managed-by: Helm
---
# Source: hami/templates/scheduler/serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: hami-scheduler
  namespace: "kube-system"
  labels:
    app.kubernetes.io/component: "hami-scheduler"
    helm.sh/chart: hami-2.4.0
    app.kubernetes.io/name: hami
    app.kubernetes.io/instance: hami
    app.kubernetes.io/version: "2.4.0"
    app.kubernetes.io/managed-by: Helm
---
# Source: hami/templates/device-plugin/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: hami-device-plugin
  labels:
    app.kubernetes.io/component: hami-device-plugin
    helm.sh/chart: hami-2.4.0
    app.kubernetes.io/name: hami
    app.kubernetes.io/instance: hami
    app.kubernetes.io/version: "2.4.0"
    app.kubernetes.io/managed-by: Helm
data:
  config.json: |
    {
        "nodeconfig": [
            {
                "name": "m5-cloudinfra-online02",
                "devicememoryscaling": 1.8,
                "devicesplitcount": 10,
                "migstrategy":"none",
                "filterdevices": {
                  "uuid": [],
                  "index": []
                }
            }
        ]
    }
---
# Source: hami/templates/scheduler/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: hami-scheduler
  labels:
    app.kubernetes.io/component: hami-scheduler
    helm.sh/chart: hami-2.4.0
    app.kubernetes.io/name: hami
    app.kubernetes.io/instance: hami
    app.kubernetes.io/version: "2.4.0"
    app.kubernetes.io/managed-by: Helm
data:
  config.json: |
    {
        "kind": "Policy",
        "apiVersion": "v1",
        "extenders": [
            {
                "urlPrefix": "https://127.0.0.1:443",
                "filterVerb": "filter",
                "bindVerb": "bind",
                "enableHttps": true,
                "weight": 1,
                "nodeCacheCapable": true,
                "httpTimeout": 30000000000,
                "tlsConfig": {
                    "insecure": true
                },
                "managedResources": [
                    {
                        "name": "nvidia.com/gpu",
                        "ignoredByScheduler": true
                    },
                    {
                        "name": "nvidia.com/gpumem",
                        "ignoredByScheduler": true
                    },
                    {
                        "name": "nvidia.com/gpucores",
                        "ignoredByScheduler": true
                    },
                    {
                        "name": "nvidia.com/gpumem-percentage",
                        "ignoredByScheduler": true
                    },
                    {
                        "name": "nvidia.com/priority",
                        "ignoredByScheduler": true
                    },
                    {
                        "name": "cambricon.com/vmlu",
                        "ignoredByScheduler": true
                    },
                    {
                        "name": "hygon.com/dcunum",
                        "ignoredByScheduler": true
                    },
                    {
                        "name": "hygon.com/dcumem",
                        "ignoredByScheduler": true 
                    },
                    {
                        "name": "hygon.com/dcucores",
                        "ignoredByScheduler": true
                    },
                    {
                        "name": "iluvatar.ai/vgpu",
                        "ignoredByScheduler": true
                    }
                ],
                "ignoreable": false
            }
        ]
    }
---
# Source: hami/templates/scheduler/configmapnew.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: hami-scheduler-newversion
  labels:
    app.kubernetes.io/component: hami-scheduler
    helm.sh/chart: hami-2.4.0
    app.kubernetes.io/name: hami
    app.kubernetes.io/instance: hami
    app.kubernetes.io/version: "2.4.0"
    app.kubernetes.io/managed-by: Helm
data:
  config.yaml: |
    apiVersion: kubescheduler.config.k8s.io/v1
    kind: KubeSchedulerConfiguration
    leaderElection:
      leaderElect: false
    profiles:
    - schedulerName: hami-scheduler
    extenders:
    - urlPrefix: "https://127.0.0.1:443"
      filterVerb: filter
      bindVerb: bind
      nodeCacheCapable: true
      weight: 1
      httpTimeout: 30s
      enableHTTPS: true
      tlsConfig:
        insecure: true
      managedResources:
      - name: nvidia.com/gpu
        ignoredByScheduler: true
      - name: nvidia.com/gpumem
        ignoredByScheduler: true
      - name: nvidia.com/gpucores
        ignoredByScheduler: true
      - name: nvidia.com/gpumem-percentage
        ignoredByScheduler: true
      - name: nvidia.com/priority
        ignoredByScheduler: true
      - name: cambricon.com/vmlu
        ignoredByScheduler: true
      - name: hygon.com/dcunum
        ignoredByScheduler: true
      - name: hygon.com/dcumem
        ignoredByScheduler: true
      - name: hygon.com/dcucores
        ignoredByScheduler: true
      - name: iluvatar.ai/vgpu
        ignoredByScheduler: true
---
# Source: hami/templates/scheduler/device-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: hami-scheduler-device
  labels:
    app.kubernetes.io/component: hami-scheduler
    helm.sh/chart: hami-2.4.0
    app.kubernetes.io/name: hami
    app.kubernetes.io/instance: hami
    app.kubernetes.io/version: "2.4.0"
    app.kubernetes.io/managed-by: Helm
data:
  device-config.yaml: |-
    nvidia:
      resourceCountName: nvidia.com/gpu
      resourceMemoryName: nvidia.com/gpumem
      resourceMemoryPercentageName: nvidia.com/gpumem-percentage
      resourceCoreName: nvidia.com/gpucores
      resourcePriorityName: nvidia.com/priority
      overwriteEnv: false
      defaultMemory: 0
      defaultCores: 0
      defaultGPUNum: 1
      deviceSplitCount: 10
      deviceMemoryScaling: 1
      deviceCoreScaling: 1
    cambricon:
      resourceCountName: cambricon.com/vmlu
      resourceMemoryName: cambricon.com/mlu.smlu.vmemory
      resourceCoreName: cambricon.com/mlu.smlu.vcore
    hygon:
      resourceCountName: hygon.com/dcunum
      resourceMemoryName: hygon.com/dcumem
      resourceCoreName: hygon.com/dcucores
    metax:
      resourceCountName: "metax-tech.com/gpu"
    mthreads:
      resourceCountName: "mthreads.com/vgpu"
      resourceMemoryName: "mthreads.com/sgpu-memory"
      resourceCoreName: "mthreads.com/sgpu-core"
    iluvatar: 
      resourceCountName: iluvatar.ai/vgpu
      resourceMemoryName: iluvatar.ai/vcuda-memory
      resourceCoreName: iluvatar.ai/vcuda-core
    vnpus:
    - chipName: 910B
      commonWord: Ascend910A
      resourceName: huawei.com/Ascend910A
      resourceMemoryName: huawei.com/Ascend910A-memory
      memoryAllocatable: 32768
      memoryCapacity: 32768
      aiCore: 30
      templates:
        - name: vir02
          memory: 2184
          aiCore: 2
        - name: vir04
          memory: 4369
          aiCore: 4
        - name: vir08
          memory: 8738
          aiCore: 8
        - name: vir16
          memory: 17476
          aiCore: 16
    - chipName: 910B3
      commonWord: Ascend910B
      resourceName: huawei.com/Ascend910B
      resourceMemoryName: huawei.com/Ascend910B-memory
      memoryAllocatable: 65536
      memoryCapacity: 65536
      aiCore: 20
      aiCPU: 7
      templates:
        - name: vir05_1c_16g
          memory: 16384
          aiCore: 5
          aiCPU: 1
        - name: vir10_3c_32g
          memory: 32768
          aiCore: 10
          aiCPU: 3
    - chipName: 310P3
      commonWord: Ascend310P
      resourceName: huawei.com/Ascend310P
      resourceMemoryName: huawei.com/Ascend310P-memory
      memoryAllocatable: 21527
      memoryCapacity: 24576
      aiCore: 8
      aiCPU: 7
      templates:
        - name: vir01
          memory: 3072
          aiCore: 1
          aiCPU: 1
        - name: vir02
          memory: 6144
          aiCore: 2
          aiCPU: 2
        - name: vir04
          memory: 12288
          aiCore: 4
          aiCPU: 4
---
# Source: hami/templates/device-plugin/monitorrole.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name:  hami-device-plugin-monitor
rules:
  - apiGroups:
      - ""
    resources:
      - pods
    verbs:
      - get
      - create
      - watch
      - list
      - update
      - patch
  - apiGroups:
      - ""
    resources:
      - nodes
    verbs:
      - get
      - update
      - list
      - patch
---
# Source: hami/templates/device-plugin/monitorrolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: hami-device-plugin
  labels:
    app.kubernetes.io/component: "hami-device-plugin"
    helm.sh/chart: hami-2.4.0
    app.kubernetes.io/name: hami
    app.kubernetes.io/instance: hami
    app.kubernetes.io/version: "2.4.0"
    app.kubernetes.io/managed-by: Helm
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  #name: cluster-admin
  name: hami-device-plugin-monitor
subjects:
  - kind: ServiceAccount
    name: hami-device-plugin
    namespace: "kube-system"
---
# Source: hami/templates/scheduler/rolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: hami-scheduler
  labels:
    app.kubernetes.io/component: "hami-scheduler"
    helm.sh/chart: hami-2.4.0
    app.kubernetes.io/name: hami
    app.kubernetes.io/instance: hami
    app.kubernetes.io/version: "2.4.0"
    app.kubernetes.io/managed-by: Helm
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cluster-admin
subjects:
  - kind: ServiceAccount
    name: hami-scheduler
    namespace: "kube-system"
---
# Source: hami/templates/device-plugin/monitorservice.yaml
apiVersion: v1
kind: Service
metadata:
  name: hami-device-plugin-monitor
  labels:
    app.kubernetes.io/component: hami-device-plugin
    helm.sh/chart: hami-2.4.0
    app.kubernetes.io/name: hami
    app.kubernetes.io/instance: hami
    app.kubernetes.io/version: "2.4.0"
    app.kubernetes.io/managed-by: Helm
spec:
  externalTrafficPolicy: Local
  selector:
    app.kubernetes.io/component: hami-device-plugin
  type: NodePort
  ports:
    - name: monitorport
      port: 31992
      targetPort: 9394
      nodePort: 31992
---
# Source: hami/templates/scheduler/service.yaml
apiVersion: v1
kind: Service
metadata:
  name: hami-scheduler
  labels:
    app.kubernetes.io/component: hami-scheduler
    helm.sh/chart: hami-2.4.0
    app.kubernetes.io/name: hami
    app.kubernetes.io/instance: hami
    app.kubernetes.io/version: "2.4.0"
    app.kubernetes.io/managed-by: Helm
spec:
  type: NodePort
  ports:
    - name: http
      port: 443
      targetPort: 443
      nodePort: 31998
      protocol: TCP
    - name: monitor
      port: 31993
      targetPort: 9395
      nodePort: 31993
      protocol: TCP
  selector:
    app.kubernetes.io/component: hami-scheduler
    app.kubernetes.io/name: hami
    app.kubernetes.io/instance: hami
---
# Source: hami/templates/device-plugin/daemonsetnvidia.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: hami-device-plugin
  labels:
    app.kubernetes.io/component: hami-device-plugin
    helm.sh/chart: hami-2.4.0
    app.kubernetes.io/name: hami
    app.kubernetes.io/instance: hami
    app.kubernetes.io/version: "2.4.0"
    app.kubernetes.io/managed-by: Helm
spec:
  selector:
    matchLabels:
      app.kubernetes.io/component: hami-device-plugin
      app.kubernetes.io/name: hami
      app.kubernetes.io/instance: hami
  template:
    metadata:
      labels:
        app.kubernetes.io/component: hami-device-plugin
        hami.io/webhook: ignore
        app.kubernetes.io/name: hami
        app.kubernetes.io/instance: hami
    spec:
      imagePullSecrets: 
        []
      serviceAccountName: hami-device-plugin
      priorityClassName: system-node-critical
      hostPID: true
      hostNetwork: true
      containers:
        - name: device-plugin
          image: projecthami/hami:latest
          imagePullPolicy: "IfNotPresent"
          lifecycle:
            postStart:
              exec:
                command: ["/bin/sh","-c", "cp -f /k8s-vgpu/lib/nvidia/* /usr/local/vgpu/"]
          command:
            - nvidia-device-plugin
            - --config-file=/device-config.yaml
            - --mig-strategy=none
            - --disable-core-limit=false
            - -v=false
          env:
            - name: NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
            - name: NVIDIA_MIG_MONITOR_DEVICES
              value: all
            - name: HOOK_PATH
              value: /usr/local
          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop: ["ALL"]
              add: ["SYS_ADMIN"]
          volumeMounts:
            - name: device-plugin
              mountPath: /var/lib/kubelet/device-plugins
            - name: lib
              mountPath: /usr/local/vgpu
            - name: usrbin
              mountPath: /usrbin
            - name: deviceconfig
              mountPath: /config
            - name: hosttmp
              mountPath: /tmp
            - name: device-config
              mountPath: /device-config.yaml
              subPath: device-config.yaml
        - name: vgpu-monitor
          image: projecthami/hami:latest
          imagePullPolicy: "IfNotPresent"
          command: ["vGPUmonitor"]
          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop: ["ALL"]
              add: ["SYS_ADMIN"]
          env:
            - name: NVIDIA_VISIBLE_DEVICES
              value: "all"
            - name: NVIDIA_MIG_MONITOR_DEVICES
              value: all
            - name: HOOK_PATH
              value: /usr/local/vgpu              
          volumeMounts:
            - name: ctrs
              mountPath: /usr/local/vgpu/containers
            - name: dockers
              mountPath: /run/docker
            - name: containerds
              mountPath: /run/containerd
            - name: sysinfo
              mountPath: /sysinfo
            - name: hostvar
              mountPath: /hostvar
      volumes:
        - name: ctrs
          hostPath:
            path: /usr/local/vgpu/containers
        - name: hosttmp
          hostPath:
            path: /tmp
        - name: dockers
          hostPath:
            path: /run/docker
        - name: containerds
          hostPath:
            path: /run/containerd
        - name: device-plugin
          hostPath:
            path: /var/lib/kubelet/device-plugins
        - name: lib
          hostPath:
            path: /usr/local/vgpu
        - name: usrbin
          hostPath:
            path: /usr/bin
        - name: sysinfo
          hostPath:
            path: /sys
        - name: hostvar
          hostPath:
            path: /var
        - name: deviceconfig
          configMap:
            name: hami-device-plugin
        - name: device-config
          configMap:
            name: hami-scheduler-device
      nodeSelector: 
        gpu: "on"
---
# Source: hami/templates/scheduler/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: hami-scheduler
  labels:
    app.kubernetes.io/component: hami-scheduler
    helm.sh/chart: hami-2.4.0
    app.kubernetes.io/name: hami
    app.kubernetes.io/instance: hami
    app.kubernetes.io/version: "2.4.0"
    app.kubernetes.io/managed-by: Helm
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/component: hami-scheduler
      app.kubernetes.io/name: hami
      app.kubernetes.io/instance: hami
  template:
    metadata:
      labels:
        app.kubernetes.io/component: hami-scheduler
        app.kubernetes.io/name: hami
        app.kubernetes.io/instance: hami
        hami.io/webhook: ignore
    spec:
      imagePullSecrets: 
        []
      serviceAccountName: hami-scheduler
      priorityClassName: system-node-critical
      containers:
        - name: kube-scheduler
          image: registry.cn-hangzhou.aliyuncs.com/google_containers/kube-scheduler:v1.31.0
          imagePullPolicy: "IfNotPresent"
          command:
            - kube-scheduler
            - --config=/config/config.yaml
            - -v=4
            - --leader-elect=true
            - --leader-elect-resource-name=hami-scheduler
            - --leader-elect-resource-namespace=kube-system
          volumeMounts:
            - name: scheduler-config
              mountPath: /config
        - name: vgpu-scheduler-extender
          image: projecthami/hami:latest
          imagePullPolicy: "IfNotPresent"
          env:
          command:
            - scheduler
            - --http_bind=0.0.0.0:443
            - --cert_file=/tls/tls.crt
            - --key_file=/tls/tls.key
            - --scheduler-name=hami-scheduler
            - --metrics-bind-address=:9395
            - --node-scheduler-policy=binpack
            - --gpu-scheduler-policy=spread
            - --device-config-file=/device-config.yaml
            - --debug
            - -v=4
          ports:
            - name: http
              containerPort: 443
              protocol: TCP
          volumeMounts:
            - name: tls-config
              mountPath: /tls
            - name: device-config
              mountPath: /device-config.yaml
              subPath: device-config.yaml
      volumes:
        - name: tls-config
          secret:
            secretName: hami-scheduler-tls
        - name: scheduler-config
          configMap:
            name: hami-scheduler-newversion
        - name: device-config
          configMap:
            name: hami-scheduler-device
---
# Source: hami/templates/scheduler/webhook.yaml
apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
  name: hami-webhook
webhooks:
  - admissionReviewVersions:
    - v1beta1
    clientConfig:
      service:
        name: hami-scheduler
        namespace: kube-system
        path: /webhook
        port: 443
    failurePolicy: Ignore
    matchPolicy: Equivalent
    name: vgpu.hami.io
    namespaceSelector:
      matchExpressions:
      - key: hami.io/webhook
        operator: NotIn
        values:
        - ignore
    objectSelector:
      matchExpressions:
      - key: hami.io/webhook
        operator: NotIn
        values:
        - ignore
    reinvocationPolicy: Never
    rules:
      - apiGroups:
          - ""
        apiVersions:
          - v1
        operations:
          - CREATE
        resources:
          - pods
        scope: '*'
    sideEffects: None
    timeoutSeconds: 10
---
# Source: hami/templates/scheduler/job-patch/serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: hami-admission
  annotations:
    "helm.sh/hook": pre-install,pre-upgrade,post-install,post-upgrade
    "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
  labels:
    helm.sh/chart: hami-2.4.0
    app.kubernetes.io/name: hami
    app.kubernetes.io/instance: hami
    app.kubernetes.io/version: "2.4.0"
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/component: admission-webhook
---
# Source: hami/templates/scheduler/job-patch/clusterrole.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: hami-admission
  annotations:
    "helm.sh/hook": pre-install,pre-upgrade,post-install,post-upgrade
    "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
  labels:
    helm.sh/chart: hami-2.4.0
    app.kubernetes.io/name: hami
    app.kubernetes.io/instance: hami
    app.kubernetes.io/version: "2.4.0"
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/component: admission-webhook
rules:
  - apiGroups:
      - admissionregistration.k8s.io
    resources:
      #- validatingwebhookconfigurations
      - mutatingwebhookconfigurations
    verbs:
      - get
      - update
---
# Source: hami/templates/scheduler/job-patch/clusterrolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name:  hami-admission
  annotations:
    "helm.sh/hook": pre-install,pre-upgrade,post-install,post-upgrade
    "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
  labels:
    helm.sh/chart: hami-2.4.0
    app.kubernetes.io/name: hami
    app.kubernetes.io/instance: hami
    app.kubernetes.io/version: "2.4.0"
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/component: admission-webhook
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: hami-admission
subjects:
  - kind: ServiceAccount
    name: hami-admission
    namespace: "kube-system"
---
# Source: hami/templates/scheduler/job-patch/role.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name:  hami-admission
  annotations:
    "helm.sh/hook": pre-install,pre-upgrade,post-install,post-upgrade
    "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
  labels:
    helm.sh/chart: hami-2.4.0
    app.kubernetes.io/name: hami
    app.kubernetes.io/instance: hami
    app.kubernetes.io/version: "2.4.0"
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/component: admission-webhook
rules:
  - apiGroups:
      - ""
    resources:
      - secrets
    verbs:
      - get
      - create
---
# Source: hami/templates/scheduler/job-patch/rolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: hami-admission
  annotations:
    "helm.sh/hook": pre-install,pre-upgrade,post-install,post-upgrade
    "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
  labels:
    helm.sh/chart: hami-2.4.0
    app.kubernetes.io/name: hami
    app.kubernetes.io/instance: hami
    app.kubernetes.io/version: "2.4.0"
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/component: admission-webhook
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: hami-admission
subjects:
  - kind: ServiceAccount
    name: hami-admission
    namespace: "kube-system"
---
# Source: hami/templates/scheduler/job-patch/job-createSecret.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: hami-admission-create
  annotations:
    "helm.sh/hook": pre-install,pre-upgrade
    "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
  labels:
    helm.sh/chart: hami-2.4.0
    app.kubernetes.io/name: hami
    app.kubernetes.io/instance: hami
    app.kubernetes.io/version: "2.4.0"
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/component: admission-webhook
spec:
  template:
    metadata:
      name: hami-admission-create
      labels:
        helm.sh/chart: hami-2.4.0
        app.kubernetes.io/name: hami
        app.kubernetes.io/instance: hami
        app.kubernetes.io/version: "2.4.0"
        app.kubernetes.io/managed-by: Helm
        app.kubernetes.io/component: admission-webhook
        hami.io/webhook: ignore
    spec:
      imagePullSecrets: 
        []
      containers:
        - name: create
          image: liangjw/kube-webhook-certgen:v1.1.1
          imagePullPolicy: IfNotPresent
          args:
            - create
            - --cert-name=tls.crt
            - --key-name=tls.key
            - --host=hami-scheduler.kube-system.svc,127.0.0.1
            - --namespace=kube-system
            - --secret-name=hami-scheduler-tls
      restartPolicy: OnFailure
      serviceAccountName: hami-admission
      securityContext:
        runAsNonRoot: true
        runAsUser: 2000
---
# Source: hami/templates/scheduler/job-patch/job-patchWebhook.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: hami-admission-patch
  annotations:
    "helm.sh/hook": post-install,post-upgrade
    "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
  labels:
    helm.sh/chart: hami-2.4.0
    app.kubernetes.io/name: hami
    app.kubernetes.io/instance: hami
    app.kubernetes.io/version: "2.4.0"
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/component: admission-webhook
spec:
  template:
    metadata:
      name: hami-admission-patch
      labels:
        helm.sh/chart: hami-2.4.0
        app.kubernetes.io/name: hami
        app.kubernetes.io/instance: hami
        app.kubernetes.io/version: "2.4.0"
        app.kubernetes.io/managed-by: Helm
        app.kubernetes.io/component: admission-webhook
        hami.io/webhook: ignore
    spec:
      imagePullSecrets: 
        []
      containers:
        - name: patch
          image: liangjw/kube-webhook-certgen:v1.1.1
          imagePullPolicy: IfNotPresent
          args:
            - patch
            - --webhook-name=hami-webhook
            - --namespace=kube-system
            - --patch-validating=false
            - --secret-name=hami-scheduler-tls
      restartPolicy: OnFailure
      serviceAccountName: hami-admission
      securityContext:
        runAsNonRoot: true
        runAsUser: 2000

部署dcgm-exporter

复制代码
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: "dcgm-exporter"
  labels:
    app.kubernetes.io/name: "dcgm-exporter"
    app.kubernetes.io/version: "3.6.1"
spec:
  updateStrategy:
    type: RollingUpdate
  selector:
    matchLabels:
      app.kubernetes.io/name: "dcgm-exporter"
      app.kubernetes.io/version: "3.6.1"
  template:
    metadata:
      labels:
        app.kubernetes.io/name: "dcgm-exporter"
        app.kubernetes.io/version: "3.6.1"
      name: "dcgm-exporter"
    spec:
      containers:
      - image: "nvcr.io/nvidia/k8s/dcgm-exporter:3.3.9-3.6.1-ubuntu22.04"
        env:
        - name: "DCGM_EXPORTER_LISTEN"
          value: ":9400"
        - name: "DCGM_EXPORTER_KUBERNETES"
          value: "true"
        name: "dcgm-exporter"
        ports:
        - name: "metrics"
          containerPort: 9400
        securityContext:
          runAsNonRoot: false
          runAsUser: 0
          capabilities:
            add: ["SYS_ADMIN"]
        volumeMounts:
        - name: "pod-gpu-resources"
          readOnly: true
          mountPath: "/var/lib/kubelet/pod-resources"
      volumes:
      - name: "pod-gpu-resources"
        hostPath:
          path: "/var/lib/kubelet/pod-resources"

---

kind: Service
apiVersion: v1
metadata:
  name: "dcgm-exporter"
  labels:
    app.kubernetes.io/name: "dcgm-exporter"
    app.kubernetes.io/version: "3.6.1"
spec:
  selector:
    app.kubernetes.io/name: "dcgm-exporter"
    app.kubernetes.io/version: "3.6.1"
  ports:
  - name: "metrics"
    port: 9400

dcgm-exporter安装成功

参考这个hami-vgpu dashboard 下载panel 的json文件

hami-vgpu-dashboard | Grafana Labs 导入后grafana中将创建一个名为"hami-vgpu-dashboard"的dashboard,但此页面中有一些Panel如vGPUCorePercentage还没有数据

ServiceMonitor 是 Prometheus Operator 中的一个自定义资源,主要用于监控 Kubernetes 中的服务。它的作用包括:

1. 自动化发现

ServiceMonitor 允许 Prometheus 自动发现和监控 Kubernetes 中的服务。通过定义 ServiceMonitor,您可以告诉 Prometheus 监控特定服务的端点。

2. 配置抓取参数

您可以在 ServiceMonitor 中设置抓取的相关参数,例如:

  • 抓取间隔:定义 Prometheus 多频繁抓取数据(如每 30 秒)。
  • 超时:定义抓取请求的超时时间。
  • 标签选择器:指定要监控的服务的标签,确保 Prometheus 仅抓取相关服务的数据。

dcgm-exporter需要配置两个service monitor

hami-device-plugin-svc-monitor.yaml

复制代码
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: hami-device-plugin-svc-monitor
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app.kubernetes.io/component: hami-device-plugin
  namespaceSelector:
    matchNames:
      - kube-system
  endpoints:
    - path: /metrics
      port: monitorport
      interval: "15s"
      honorLabels: false
      relabelings:
        - sourceLabels: [__meta_kubernetes_endpoints_name]
          regex: hami-.*
          replacement: $1
          action: keep
        - sourceLabels: [__meta_kubernetes_pod_node_name]
          regex: (.*)
          targetLabel: node_name
          replacement: ${1}
          action: replace
        - sourceLabels: [__meta_kubernetes_pod_host_ip]
          regex: (.*)
          targetLabel: ip
          replacement: $1
          action: replace

hami-scheduler-svc-monitor.yaml

复制代码
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: hami-scheduler-svc-monitor
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app.kubernetes.io/component: hami-scheduler
  namespaceSelector:
    matchNames:
      - kube-system
  endpoints:
    - path: /metrics
      port: monitor
      interval: "15s"
      honorLabels: false
      relabelings:
        - sourceLabels: [__meta_kubernetes_endpoints_name]
          regex: hami-.*
          replacement: $1
          action: keep
        - sourceLabels: [__meta_kubernetes_pod_node_name]
          regex: (.*)
          targetLabel: node_name
          replacement: ${1}
          action: replace
        - sourceLabels: [__meta_kubernetes_pod_host_ip]
          regex: (.*)
          targetLabel: ip
          replacement: $1
          action: replace

确认创建的ServiceMonitor

启动gpu pod一个测试下

复制代码
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod-1
spec:
  restartPolicy: Never
  containers:
    - name: cuda-container
      image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.2.1
      command: ["sleep", "infinity"]
      resources:
        limits:
          nvidia.com/gpu: 1
          nvidia.com/gpumem: 1000
          nvidia.com/gpucores: 10

如果看到pod一直pending 状态

检查下节点如果出现下面gpu为0的情况

需要

复制代码
   docker:
		1:下载NVIDIA-DOCKER2安装包并安装
		2:修改/etc/docker/daemon.json文件内容加上
			{
			"default-runtime": "nvidia",
				"runtimes": {
					"nvidia": {
						"path": "/usr/bin/nvidia-container-runtime",
						"runtimeArgs": []
					}
				},
			}
	k8s:
		1:下载k8s-device-plugin 镜像
		2:编写nvidia-device-plugin.yml创建驱动pod

使用这个yml进行创建

复制代码
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      priorityClassName: "system-node-critical"
      containers:
      - image: nvidia/k8s-device-plugin:1.11
        name: nvidia-device-plugin-ctr
        env:
          - name: FAIL_ON_INIT_ERROR
            value: "false"
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
        - name: device-plugin
          mountPath: /var/lib/kubelet/device-plugins
      volumes:
      - name: device-plugin
        hostPath:
          path: /var/lib/kubelet/device-plugins

gpu pod启动后进入查看下, gpu内存和限制的大小相同设置成功

访问下{scheduler node ip}:31993/metrics

日志最后有两行

复制代码
vGPUPodsDeviceAllocated{containeridx="0",deviceusedcore="40",deviceuuid="GPU-7666e9de-679b-a768-51c6-260b81cd00ec",nodename="192.168.110.126",podname="gpu-pod-1",podnamespace="default",zone="vGPU"} 1.048576e+10
vGPUPodsDeviceAllocated{containeridx="0",deviceusedcore="40",deviceuuid="GPU-7666e9de-679b-a768-51c6-260b81cd00ec",nodename="192.168.110.126",podname="gpu-pod-2",podnamespace="default",zone="vGPU"} 1.048576e+10

可以看到相同deviceuuid的gpu被不同pod共享使用

exec进入hami-device-plugin daemonset里面执行nvidia-smi -L 可以看到机器上所有显卡的信息

root@node126:/# nvidia-smi -L

GPU 0: NVIDIA GeForce RTX 4090 (UUID: GPU-7666e9de-679b-a768-51c6-260b81cd00ec)

GPU 1: NVIDIA GeForce RTX 4090 (UUID: GPU-9f32af29-1a72-6e47-af2c-72b1130a176b)

root@node126:/#

之前创建的两个serviceMonitor会去请求

app.kubernetes.io/component: hami-scheduler 和app.kubernetes.io/component: hami-device-plugin 的/metrics 接口获取数据

当gpu-pod跑起来以后查看hami-vgpu-metrics-dashboard

相关推荐
云游1 天前
大模型性能指标的监控系统(prometheus3.5.0)和可视化工具(grafana12.1.0)基础篇
grafana·prometheus·可视化·监控
qq_232045572 天前
非容器方式安装Prometheus和Grafana,以及nginx配置访问Grafana
nginx·grafana·prometheus
夜莺云原生监控2 天前
Prometheus 监控 Kubernetes Cluster 最新极简教程
容器·kubernetes·prometheus
SRETalk3 天前
Prometheus 监控 Kubernetes Cluster 最新极简教程
kubernetes·prometheus
川石课堂软件测试3 天前
JMeter并发测试与多进程测试
功能测试·jmeter·docker·容器·kubernetes·单元测试·prometheus
SRETalk4 天前
夜莺监控的几种架构模式详解
prometheus·victoriametrics·nightingale·夜莺监控
Ditglu.5 天前
使用Prometheus + Grafana + node_exporter实现Linux服务器性能监控
服务器·grafana·prometheus
SRETalk5 天前
监控系统如何选型:Zabbix vs Prometheus
zabbix·prometheus
睡觉z5 天前
云原生环境Prometheus企业级监控
云原生·prometheus
归梧谣5 天前
云原生环境 Prometheus 企业级监控实战
云原生·prometheus