05 - 大模型推理生产架构设计:混合部署与Kubernetes实战

05 - 大模型推理生产架构设计:混合部署与Kubernetes实战

本文是《大模型推理框架深度解析》系列的第五篇,详解生产级架构设计、K8s部署与智能路由策略。

写在前面

将大模型推理服务部署到生产环境,面临诸多挑战:

  • GPU故障如何优雅处理?
  • 高并发时如何保证稳定性?
  • 模型更新如何零停机?

本文将介绍一种经过验证的生产架构:"vLLM主链路 + llama.cpp降级链路"的混合部署方案。


一、为什么需要混合架构

1.1 单一架构的风险

纯vLLM架构的问题

  • GPU驱动崩溃导致服务中断
  • CUDA OOM引发批量失败
  • 模型加载失败无退路

纯llama.cpp架构的问题

  • CPU推理速度慢(29x vs GPU)
  • 无法满足高并发需求
  • 无分布式扩展能力

1.2 混合架构的价值

复制代码
┌─────────────────────────────────────────────────────────────────┐
│                     API Gateway                                  │
│                     (Kong/AWS ALB)                               │
└───────────────────────────┬─────────────────────────────────────┘
                            │
              ┌─────────────┴─────────────┐
              │                           │
    ┌─────────▼──────────┐    ┌──────────▼──────────┐
    │   vLLM主链路        │    │  llama.cpp降级链路   │
    │  (高吞吐/GPU集群)   │    │  (CPU兜底/边缘节点)  │
    ├────────────────────┤    ├─────────────────────┤
    │ • 99.9% SLA        │    │ • 99% SLA           │
    │ • 35+ tok/s        │    │ • 1-5 tok/s         │
    │ • 高并发支持        │    │ • 无限扩展           │
    └────────────────────┘    └─────────────────────┘

降级场景

故障类型 vLLM行为 llama.cpp兜底价值
GPU驱动崩溃 服务中断 CPU实例接管请求
CUDA OOM 批量失败 大上下文请求转CPU
网络分区 分布式TP失败 单节点CPU继续服务
模型损坏 加载失败 备用量化模型启动

二、智能路由网关设计

2.1 健康评分算法

python 复制代码
class HealthScorer:
    """
    后端健康评分算法
    评分范围:0-1,越高越健康
    """
    def calculate_score(self, backend: Backend) -> float:
        metrics = backend.get_metrics()
        
        # 基础指标权重
        gpu_util = metrics.gpu_utilization * 0.25
        mem_pressure = (1 - metrics.gpu_memory_free / metrics.gpu_memory_total) * 0.25
        queue_depth = min(metrics.pending_requests / 100, 1.0) * 0.20
        p99_latency = self._normalize_latency(metrics.p99_latency) * 0.20
        error_rate = metrics.error_rate * 0.10
        
        # 综合评分
        score = 1.0 - (gpu_util + mem_pressure + queue_depth + p99_latency + error_rate)
        
        # 状态标记
        if score < 0.6:
            backend.mark_degraded()
        if score < 0.3:
            backend.mark_unhealthy()
            
        return max(score, 0.0)

2.2 渐进切流策略

yaml 复制代码
# 切流规则配置
traffic_split:
  # 正常状态:全部流量走vLLM
  healthy:
    vllm: 100%
    llama_cpp: 0%
  
  # vLLM负载过高:简单请求转CPU
  high_load:
    vllm: 80%
    llama_cpp: 20%
  
  # vLLM部分故障:流量对半
  degraded:
    vllm: 50%
    llama_cpp: 50%
  
  # vLLM完全故障:全部降级
  emergency:
    vllm: 0%
    llama_cpp: 100%

2.3 Kong网关配置示例

yaml 复制代码
# kong.yml
services:
  - name: llm-router
    url: http://internal-router:8080
    routes:
      - name: completions
        paths: ["/v1/completions"]
        methods: ["POST"]
    plugins:
      - name: rate-limiting
        config:
          minute: 1000
      - name: prometheus

  # vLLM上游
  - name: vllm-upstream
    host: vllm-service
    port: 8000
    healthchecks:
      active:
        healthy:
          interval: 5s
          successes: 2
        unhealthy:
          interval: 5s
          timeouts: 3
          http_failures: 2
        http_path: /health

  # llama.cpp上游
  - name: llama-cpp-upstream
    host: llama-cpp-service
    port: 8080

三、Kubernetes部署方案

3.1 vLLM主链路部署

yaml 复制代码
# vllm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama-70b
  namespace: inference
spec:
  replicas: 2
  selector:
    matchLabels:
      app: vllm-llama-70b
  template:
    metadata:
      labels:
        app: vllm-llama-70b
    spec:
      # GPU节点亲和性
      nodeSelector:
        node-type: gpu-a100
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      containers:
        - name: vllm
          image: vllm/vllm-openai:v0.14.0
          args:
            - --model
            - /models/llama-3.1-70b-awq
            - --quantization
            - awq
            - --tensor-parallel-size
            - "4"
            - --gpu-memory-utilization
            - "0.85"
            - --max-model-len
            - "32768"
            - --enable-prefix-caching
            - --swap-space
            - "4"
          resources:
            limits:
              nvidia.com/gpu: "4"
              memory: "256Gi"
              cpu: "32"
            requests:
              nvidia.com/gpu: "4"  # 必须等于limits,防止超售
              memory: "256Gi"
              cpu: "16"
          ports:
            - containerPort: 8000
              name: http
          volumeMounts:
            - name: model-volume
              mountPath: /models
          # 健康检查
          livenessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 300  # 模型加载时间
            periodSeconds: 10
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 60
            periodSeconds: 5
          # 优雅关闭
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 30"]
      volumes:
        - name: model-volume
          persistentVolumeClaim:
            claimName: llama-70b-model-pvc
      # Pod反亲和性:避免同一节点部署多个副本
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchExpressions:
                    - key: app
                      operator: In
                      values:
                        - vllm-llama-70b
                topologyKey: kubernetes.io/hostname
---
# Service
apiVersion: v1
kind: Service
metadata:
  name: vllm-service
  namespace: inference
spec:
  selector:
    app: vllm-llama-70b
  ports:
    - port: 8000
      targetPort: 8000
  type: ClusterIP

3.2 HPA自动扩缩容

yaml 复制代码
# vllm-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-hpa
  namespace: inference
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-llama-70b
  minReplicas: 2
  maxReplicas: 10
  metrics:
    # 基于GPU利用率扩容
    - type: Pods
      pods:
        metric:
          name: vllm_gpu_utilization
        target:
          type: AverageValue
          averageValue: "80"
    # 基于队列长度扩容
    - type: Pods
      pods:
        metric:
          name: vllm_queue_length
        target:
          type: AverageValue
          averageValue: "10"
  behavior:
    # 快速扩容
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Percent
          value: 100
          periodSeconds: 60
    # 慢速缩容
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 10
          periodSeconds: 60

3.3 llama.cpp降级链路部署

yaml 复制代码
# llama-cpp-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llama-cpp-fallback
  namespace: inference
spec:
  replicas: 3
  selector:
    matchLabels:
      app: llama-cpp-fallback
  template:
    metadata:
      labels:
        app: llama-cpp-fallback
    spec:
      # CPU高内存节点
      nodeSelector:
        node-type: cpu-highmem
      containers:
        - name: llama-server
          image: ghcr.io/ggerganov/llama.cpp:server
          args:
            - --model
            - /models/llama-3.1-70b-q4_k_m.gguf
            - --ctx-size
            - "65536"
            - --threads
            - "32"
            - --batch-size
            - "512"
            - --port
            - "8080"
          resources:
            limits:
              memory: "192Gi"
              cpu: "32"
            requests:
              memory: "128Gi"
              cpu: "16"
          ports:
            - containerPort: 8080
              name: http
          volumeMounts:
            - name: model-volume
              mountPath: /models
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 30
            periodSeconds: 10
      volumes:
        - name: model-volume
          persistentVolumeClaim:
            claimName: gguf-models-pvc

3.4 PodDisruptionBudget

yaml 复制代码
# pdb.yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: vllm-pdb
  namespace: inference
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: vllm-llama-70b

四、监控与告警体系

4.1 Prometheus监控配置

yaml 复制代码
# servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: vllm-metrics
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: vllm-llama-70b
  endpoints:
    - port: http
      path: /metrics
      interval: 15s

4.2 关键监控指标

yaml 复制代码
# 核心性能指标
performance_metrics:
  - vllm:time_to_first_token_seconds      # TTFT
  - vllm:time_per_output_token_seconds    # ITL/TPOT
  - vllm:throughput_tokens_per_second     # 吞吐量
  - vllm:request_queue_length             # 队列长度

# 资源利用率
resource_metrics:
  - nvidia_gpu_utilization                # GPU利用率
  - nvidia_gpu_memory_used_bytes          # GPU显存使用
  - nvidia_gpu_temperature                # GPU温度
  - container_cpu_usage_seconds_total     # CPU使用
  - container_memory_working_set_bytes    # 内存使用

# 缓存效率
cache_metrics:
  - vllm:gpu_prefix_cache_hit_rate        # GPU前缀缓存命中率
  - vllm:cpu_prefix_cache_hit_rate        # CPU前缀缓存命中率
  - vllm:kv_cache_usage_percent           # KV Cache使用率

# 业务指标
business_metrics:
  - vllm:request_success_total            # 成功请求数
  - vllm:request_failure_total            # 失败请求数
  - vllm:request_duration_seconds         # 请求耗时
  - vllm:prompt_tokens_total              # 输入token数
  - vllm:generation_tokens_total          # 输出token数

4.3 告警规则

yaml 复制代码
# alerts.yaml
groups:
  - name: vllm-alerts
    rules:
      # TTFT过高
      - alert: HighTTFT
        expr: histogram_quantile(0.99, vllm:time_to_first_token_seconds) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "vLLM TTFT is high"
          description: "P99 TTFT is {{ $value }}s"

      # GPU OOM风险
      - alert: GPUOOMRisk
        expr: vllm:kv_cache_usage_percent > 0.95
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "GPU OOM risk detected"
          description: "KV Cache usage is {{ $value }}%"

      # 队列堆积
      - alert: RequestQueueBuildup
        expr: vllm:request_queue_length > 50
        for: 3m
        labels:
          severity: warning
        annotations:
          summary: "Request queue is building up"
          description: "Queue length is {{ $value }}"

      # 模型加载失败
      - alert: ModelLoadingFailure
        expr: rate(vllm:request_failure_total{reason="model_load"}[5m]) > 0
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "Model loading failed"

      # GPU温度过高
      - alert: GPUTemperatureHigh
        expr: nvidia_gpu_temperature > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "GPU temperature is high"
          description: "Temperature is {{ $value }}°C"

五、自动化故障恢复

5.1 故障检测与恢复流程

复制代码
┌──────────────┐
│ 监控检测异常  │
└──────┬───────┘
       │
       ▼
┌──────────────┐
│  异常类型判断  │
└──────┬───────┘
       │
   ┌───┴───┬────────┬────────┬────────┐
   ▼       ▼        ▼        ▼        ▼
┌────┐  ┌────┐  ┌────┐  ┌────┐  ┌────┐
│GPU │  │GPU │  │模型 │  │网络 │  │其他 │
│OOM │  │崩溃│  │损坏│  │分区│  │异常 │
└──┬─┘  └──┬─┘  └──┬─┘  └──┬─┘  └──┬─┘
   │       │       │       │       │
   ▼       ▼       ▼       ▼       ▼
┌────┐  ┌────┐  ┌────┐  ┌────┐  ┌────┐
│HPA │  │Pod │  │切换 │  │降级 │  │人工 │
 │扩容│  │重建│  │备用 │  │链路│  │介入 │
└────┘  └────┘  └────┘  └────┘  └────┘

5.2 自动恢复脚本

python 复制代码
#!/usr/bin/env python3
"""
自动故障恢复脚本
"""
import subprocess
import time
from prometheus_api_client import PrometheusConnect

prom = PrometheusConnect(url="http://prometheus:9090")

def check_gpu_oom():
    """检查GPU OOM风险"""
    result = prom.custom_query(
        query='vllm:kv_cache_usage_percent > 0.95'
    )
    return len(result) > 0

def trigger_hpa_scale_up():
    """触发HPA扩容"""
    subprocess.run([
        'kubectl', 'scale', 'deployment', 'vllm-llama-70b',
        '--replicas=4', '-n', 'inference'
    ])

def switch_to_fallback():
    """切换到降级链路"""
    # 更新Kong路由配置
    subprocess.run([
        'kubectl', 'patch', 'ingress', 'llm-ingress',
        '-p', '{"metadata":{"annotations":{"konghq.com/strip-path":"false"}}}'
    ])

def main():
    while True:
        if check_gpu_oom():
            print("GPU OOM risk detected, triggering recovery...")
            trigger_hpa_scale_up()
            time.sleep(60)
            
            # 检查扩容是否有效
            if check_gpu_oom():
                print("Scale up not effective, switching to fallback...")
                switch_to_fallback()
        
        time.sleep(30)

if __name__ == "__main__":
    main()

六、模型更新策略

6.1 滚动更新

yaml 复制代码
# 滚动更新策略
spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1        # 最多多启动1个Pod
      maxUnavailable: 0  # 不允许不可用

6.2 蓝绿部署

bash 复制代码
# 部署新版本(绿色)
kubectl apply -f vllm-deployment-v2.yaml

# 验证新版本健康
kubectl rollout status deployment/vllm-llama-70b-v2

# 切换流量
curl -X POST http://kong:8001/services/llm-router \
  -d "host=vllm-service-v2"

# 保留旧版本一段时间,确认稳定后删除
kubectl delete deployment/vllm-llama-70b-v1

参考资源


文章标签

Kubernetes 生产部署 GPU调度 HPA 混合架构 智能路由 故障恢复 监控告警

相关推荐
Mr.小海2 小时前
Docker 容器间依赖管理
运维·docker·容器
海兰4 小时前
Elastic Stack 技术栈与无服务器架构核心指南
云原生·架构·serverless
别多香了5 小时前
Kubernetes Pod 管理
容器·kubernetes
认真的薛薛7 小时前
3.k8s-暴露pod和service
云原生·容器·kubernetes
Alice_whj8 小时前
AI云原生笔记
人工智能·笔记·云原生
❀͜͡傀儡师8 小时前
使用 Docker 部署 Puter 云桌面系统
运维·docker·容器
人间打气筒(Ada)9 小时前
Kubernetes核心技术-service详解
云原生·容器·kubernetes·云计算·devops·service·service代理
匀泪9 小时前
云原生(nginx环境设定)
java·nginx·云原生
切糕师学AI9 小时前
Kubernetes Deployment 详解
容器·kubernetes