GPU显存HPA：K8s智能扩缩实战

针对GPU智能体基于显存使用率配置HPA的需求，其核心在于让Kubernetes能够采集并理解GPU显存使用率这一自定义指标，并以此驱动Horizontal Pod Autoscaler的决策。以下是详细的配置方法与实践步骤。

一、整体架构与组件

实现基于GPU显存使用率的HPA，需要构建一个完整的监控与指标转换链路：

数据采集 ：由部署在每个GPU节点上的nvidia-dcgm-exporter或nvidia-gpu-exporter（DaemonSet）负责采集GPU各项指标，包括显存使用率。
指标存储与聚合 ：Prometheus Server定期从Exporter拉取指标并存储。
指标转换与适配 ：prometheus-adapter将Prometheus中的原始指标转换为Kubernetes API能理解的custom.metrics.k8s.io API格式。
自动扩缩容决策 ：HPA控制器通过查询custom.metrics.k8s.io API获取显存使用率指标，并根据预设规则计算所需的Pod副本数。

二、详细配置步骤

步骤1：部署GPU指标导出器（Exporter）

以nvidia-dcgm-exporter为例，这是NVIDIA官方推荐的GPU监控组件。

yaml 复制代码

# nvidia-dcgm-exporter-daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-dcgm-exporter
  namespace: monitoring # 建议放在monitoring命名空间
spec:
  selector:
    matchLabels:
      name: nvidia-dcgm-exporter
  template:
    metadata:
      labels:
        name: nvidia-dcgm-exporter
    spec:
      tolerations: # 容忍污点，确保能调度到GPU节点
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      containers:
      - name: nvidia-dcgm-exporter
        image: nvidia/dcgm-exporter:latest
        resources:
          limits:
            nvidia.com/gpu: 1 # 申请1个GPU用于监控
          requests:
            nvidia.com/gpu: 1
        securityContext:
          runAsNonRoot: false
          runAsUser: 0
        ports:
        - name: metrics
          containerPort: 9400
          protocol: TCP
        env:
        - name: DCGM_EXPORTER_INTERVAL # 采集间隔，单位毫秒
          value: "30000"

部署后，可通过http://<node-ip>:9400/metrics验证指标是否正常暴露。关键指标包括：

DCGM_FI_DEV_MEMORY_USED：GPU显存使用量（字节）
DCGM_FI_DEV_MEMORY_TOTAL：GPU显存总量（字节）
DCGM_FI_DEV_GPU_UTIL：GPU计算利用率（百分比）

步骤2：配置Prometheus采集规则

确保Prometheus Server的配置中包含了抓取nvidia-dcgm-exporter的Job。通常在prometheus.yml或通过Prometheus Operator的ServiceMonitor配置。

示例：通过ServiceMonitor配置（使用Prometheus Operator时）

yaml 复制代码

# service-monitor-gpu.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: nvidia-dcgm-exporter
  namespace: monitoring
  labels:
    release: prometheus-operator # 与Prometheus实例的selector匹配
spec:
  selector:
    matchLabels:
      name: nvidia-dcgm-exporter # 匹配DaemonSet的Pod标签
  endpoints:
  - port: metrics
    interval: 30s
  namespaceSelector:
    matchNames:
    - monitoring

步骤3：部署与配置Prometheus Adapter

prometheus-adapter负责将Prometheus指标转换为Kubernetes自定义指标API。其核心配置文件config.yaml定义了指标发现和转换规则。

1. 创建适配器配置文件：

重点关注如何将DCGM_FI_DEV_MEMORY_USED和DCGM_FI_DEV_MEMORY_TOTAL转换为按Pod聚合的显存使用率。

yaml 复制代码

# prometheus-adapter-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-adapter-config
  namespace: custom-metrics
data:
  config.yaml: |
    rules:
      custom:
      - seriesQuery: 'DCGM_FI_DEV_MEMORY_USED{gpu=~".*",instance=~".*",namespace!="",pod!=""}'
        resources:
          overrides:
            namespace: {resource: "namespace"}
            pod: {resource: "pod"}
        name:
          matches: "DCGM_FI_DEV_MEMORY_USED"
          as: "gpu_memory_used_bytes" # 定义指标名称
        metricsQuery: 'sum by (namespace, pod) (DCGM_FI_DEV_MEMORY_USED{gpu=~".*",instance=~".*",namespace!="",pod!=""})'
      - seriesQuery: 'DCGM_FI_DEV_MEMORY_TOTAL{gpu=~".*",instance=~".*",namespace!="",pod!=""}'
        resources:
          overrides:
            namespace: {resource: "namespace"}
            pod: {resource: "pod"}
        name:
          matches: "DCGM_FI_DEV_MEMORY_TOTAL"
          as: "gpu_memory_total_bytes"
        metricsQuery: 'sum by (namespace, pod) (DCGM_FI_DEV_MEMORY_TOTAL{gpu=~".*",instance=~".*",namespace!="",pod!=""})'
      # 关键：定义一个使用率指标，作为HPA的目标
      - seriesQuery: 'DCGM_FI_DEV_MEMORY_USED{gpu=~".*",instance=~".*",namespace!="",pod!=""}'
        resources:
          template: <<.Resource>>
        name:
          matches: "DCGM_FI_DEV_MEMORY_USED"
          as: "gpu_memory_utilization" # 最终HPA使用的指标名
        metricsQuery: |
          sum by (namespace, pod) (
            DCGM_FI_DEV_MEMORY_USED{gpu=~".*",instance=~".*",namespace!="",pod!=""}
          ) 
          / 
          sum by (namespace, pod) (
            DCGM_FI_DEV_MEMORY_TOTAL{gpu=~".*",instance=~".*",namespace!="",pod!=""}
          ) * 100

2. 部署Prometheus Adapter：

可以使用Helm Chart简化部署，或直接使用Deployment。

yaml 复制代码

# prometheus-adapter-deployment.yaml (片段)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus-adapter
  namespace: custom-metrics
spec:
  template:
    spec:
      containers:
      - name: prometheus-adapter
        image: registry.k8s.io/prometheus-adapter/prometheus-adapter:v0.11.2
        args:
        - --secure-port=6443
        - --tls-cert-file=/var/run/serving-cert/tls.crt
        - --tls-private-key-file=/var/run/serving-cert/tls.key
        - --logtostderr=true
        - --prometheus-url=http://prometheus-server.monitoring.svc.cluster.local:9090 # Prometheus服务地址
        - --metrics-relist-interval=1m
        - --v=6
        - --config=/etc/adapter/config.yaml
        volumeMounts:
        - name: config
          mountPath: /etc/adapter
        - name: serving-cert
          mountPath: /var/run/serving-cert
      volumes:
      - name: config
        configMap:
          name: prometheus-adapter-config
      - name: serving-cert
        secret:
          secretName: prometheus-adapter-tls
---
apiVersion: v1
kind: Service
metadata:
  name: prometheus-adapter
  namespace: custom-metrics
spec:
  ports:
  - name: https
    port: 443
    targetPort: 6443
  selector:
    app: prometheus-adapter

部署后，验证自定义指标API是否正常工作：

bash 复制代码

kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta2/namespaces/default/pods/*/gpu_memory_utilization" | jq .

步骤4：创建基于GPU显存使用率的HPA

假设GPU智能体部署在default命名空间，名为gpu-agent-deployment。

yaml 复制代码

# hpa-gpu-memory.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: gpu-agent-hpa
  namespace: default
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: gpu-agent-deployment
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Pods
    pods:
      metric:
        name: gpu_memory_utilization # 与adapter配置中定义的指标名一致
      target:
        type: AverageValue
        averageValue: "75" # 目标：所有Pod的平均显存使用率维持在75%
  behavior: # 扩缩容行为配置，防止抖动
    scaleDown:
      stabilizationWindowSeconds: 300 # 缩容冷却窗口300秒
      policies:
      - type: Percent
        value: 10
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 60 # 扩容冷却窗口60秒
      policies:
      - type: Percent
        value: 100
        periodSeconds: 60

三、关键配置解析与最佳实践

指标选择与计算 ：直接使用显存使用量 / 显存总量计算出的使用率作为HPA指标，比单纯使用显存使用量更合理，因为它消除了不同GPU型号（如A100-40GB vs V100-16GB）的差异，使目标阈值（如averageValue: "75"）具有通用性。
目标值设定 ：averageValue: "75"表示HPA会努力将所有Pod实例的平均GPU显存使用率维持在75%。这个值需要根据实际业务负载和模型特性通过压测确定。设置过高（如90%）可能导致频繁OOM和扩容不及时；设置过低（如50%）则会造成资源浪费。
冷却时间与扩缩容策略 ：behavior字段至关重要。GPU应用启动和模型加载耗时较长，因此：
- scaleUp.stabilizationWindowSeconds 可以设置较短（如60秒），以便快速响应负载增长。
- scaleDown.stabilizationWindowSeconds 应设置较长（如300秒），避免因短期负载下降导致Pod被过早回收，引发服务抖动。

多指标协同 ：生产环境中，建议将GPU显存使用率与QPS（每秒查询数） 、请求延迟 或GPU计算利用率结合，配置多指标的HPA，以实现更全面的弹性伸缩。例如，即使显存使用率未达标，但QPS持续过高，也应触发扩容。

yaml 复制代码

metrics:
- type: Pods
  pods:
    metric:
      name: gpu_memory_utilization
    target:
      type: AverageValue
      averageValue: "75"
- type: Object
  object:
    metric:
      name: requests_per_second # 假设这是从应用暴露的QPS指标
    describedObject:
      apiVersion: v1
      kind: Service
      name: gpu-agent-service
    target:
      type: Value
      value: "100"

Pod资源请求必须准确 ：HPA负责调整副本数，但调度成功与否取决于Pod的resources.requests。务必为GPU智能体Pod准确设置limits.nvidia.com/gpu: 1和requests.memory，确保Kubernetes调度器能找到有足够资源的节点。
监控与告警 ：配置Prometheus告警规则，监控HPA状态（如kube_horizontalpodautoscaler_status_current_replicas）、无法扩缩容的事件以及GPU显存使用率的异常情况，以便及时干预。

通过以上步骤，即可在Kubernetes中建立一个基于GPU显存使用率、响应灵敏且稳定的自动扩缩容机制，有效提升GPU集群的资源利用率和服务可靠性。

GPU显存HPA：K8s智能扩缩实战

一、整体架构与组件

二、详细配置步骤

步骤1：部署GPU指标导出器（Exporter）

步骤2：配置Prometheus采集规则

步骤3：部署与配置Prometheus Adapter

步骤4：创建基于GPU显存使用率的HPA

三、关键配置解析与最佳实践

参考来源