05 - 大模型推理生产架构设计：混合部署与Kubernetes实战

本文是《大模型推理框架深度解析》系列的第五篇，详解生产级架构设计、K8s部署与智能路由策略。

写在前面

将大模型推理服务部署到生产环境，面临诸多挑战：

GPU故障如何优雅处理？
高并发时如何保证稳定性？
模型更新如何零停机？

本文将介绍一种经过验证的生产架构："vLLM主链路 + llama.cpp降级链路"的混合部署方案。

一、为什么需要混合架构

1.1 单一架构的风险

纯vLLM架构的问题：

GPU驱动崩溃导致服务中断
CUDA OOM引发批量失败
模型加载失败无退路

纯llama.cpp架构的问题：

CPU推理速度慢（29x vs GPU）
无法满足高并发需求
无分布式扩展能力

1.2 混合架构的价值

复制代码

┌─────────────────────────────────────────────────────────────────┐
│                     API Gateway                                  │
│                     (Kong/AWS ALB)                               │
└───────────────────────────┬─────────────────────────────────────┘
                            │
              ┌─────────────┴─────────────┐
              │                           │
    ┌─────────▼──────────┐    ┌──────────▼──────────┐
    │   vLLM主链路        │    │  llama.cpp降级链路   │
    │  (高吞吐/GPU集群)   │    │  (CPU兜底/边缘节点)  │
    ├────────────────────┤    ├─────────────────────┤
    │ • 99.9% SLA        │    │ • 99% SLA           │
    │ • 35+ tok/s        │    │ • 1-5 tok/s         │
    │ • 高并发支持        │    │ • 无限扩展           │
    └────────────────────┘    └─────────────────────┘

降级场景：

故障类型	vLLM行为	llama.cpp兜底价值
GPU驱动崩溃	服务中断	CPU实例接管请求
CUDA OOM	批量失败	大上下文请求转CPU
网络分区	分布式TP失败	单节点CPU继续服务
模型损坏	加载失败	备用量化模型启动

二、智能路由网关设计

2.1 健康评分算法

python 复制代码

class HealthScorer:
    """
    后端健康评分算法
    评分范围：0-1，越高越健康
    """
    def calculate_score(self, backend: Backend) -> float:
        metrics = backend.get_metrics()
        
        # 基础指标权重
        gpu_util = metrics.gpu_utilization * 0.25
        mem_pressure = (1 - metrics.gpu_memory_free / metrics.gpu_memory_total) * 0.25
        queue_depth = min(metrics.pending_requests / 100, 1.0) * 0.20
        p99_latency = self._normalize_latency(metrics.p99_latency) * 0.20
        error_rate = metrics.error_rate * 0.10
        
        # 综合评分
        score = 1.0 - (gpu_util + mem_pressure + queue_depth + p99_latency + error_rate)
        
        # 状态标记
        if score < 0.6:
            backend.mark_degraded()
        if score < 0.3:
            backend.mark_unhealthy()
            
        return max(score, 0.0)

2.2 渐进切流策略

yaml 复制代码

# 切流规则配置
traffic_split:
  # 正常状态：全部流量走vLLM
  healthy:
    vllm: 100%
    llama_cpp: 0%
  
  # vLLM负载过高：简单请求转CPU
  high_load:
    vllm: 80%
    llama_cpp: 20%
  
  # vLLM部分故障：流量对半
  degraded:
    vllm: 50%
    llama_cpp: 50%
  
  # vLLM完全故障：全部降级
  emergency:
    vllm: 0%
    llama_cpp: 100%

2.3 Kong网关配置示例

yaml 复制代码

# kong.yml
services:
  - name: llm-router
    url: http://internal-router:8080
    routes:
      - name: completions
        paths: ["/v1/completions"]
        methods: ["POST"]
    plugins:
      - name: rate-limiting
        config:
          minute: 1000
      - name: prometheus

  # vLLM上游
  - name: vllm-upstream
    host: vllm-service
    port: 8000
    healthchecks:
      active:
        healthy:
          interval: 5s
          successes: 2
        unhealthy:
          interval: 5s
          timeouts: 3
          http_failures: 2
        http_path: /health

  # llama.cpp上游
  - name: llama-cpp-upstream
    host: llama-cpp-service
    port: 8080

三、Kubernetes部署方案

3.1 vLLM主链路部署

yaml 复制代码

# vllm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama-70b
  namespace: inference
spec:
  replicas: 2
  selector:
    matchLabels:
      app: vllm-llama-70b
  template:
    metadata:
      labels:
        app: vllm-llama-70b
    spec:
      # GPU节点亲和性
      nodeSelector:
        node-type: gpu-a100
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      containers:
        - name: vllm
          image: vllm/vllm-openai:v0.14.0
          args:
            - --model
            - /models/llama-3.1-70b-awq
            - --quantization
            - awq
            - --tensor-parallel-size
            - "4"
            - --gpu-memory-utilization
            - "0.85"
            - --max-model-len
            - "32768"
            - --enable-prefix-caching
            - --swap-space
            - "4"
          resources:
            limits:
              nvidia.com/gpu: "4"
              memory: "256Gi"
              cpu: "32"
            requests:
              nvidia.com/gpu: "4"  # 必须等于limits，防止超售
              memory: "256Gi"
              cpu: "16"
          ports:
            - containerPort: 8000
              name: http
          volumeMounts:
            - name: model-volume
              mountPath: /models
          # 健康检查
          livenessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 300  # 模型加载时间
            periodSeconds: 10
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 60
            periodSeconds: 5
          # 优雅关闭
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 30"]
      volumes:
        - name: model-volume
          persistentVolumeClaim:
            claimName: llama-70b-model-pvc
      # Pod反亲和性：避免同一节点部署多个副本
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchExpressions:
                    - key: app
                      operator: In
                      values:
                        - vllm-llama-70b
                topologyKey: kubernetes.io/hostname
---
# Service
apiVersion: v1
kind: Service
metadata:
  name: vllm-service
  namespace: inference
spec:
  selector:
    app: vllm-llama-70b
  ports:
    - port: 8000
      targetPort: 8000
  type: ClusterIP

3.2 HPA自动扩缩容

yaml 复制代码

# vllm-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-hpa
  namespace: inference
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-llama-70b
  minReplicas: 2
  maxReplicas: 10
  metrics:
    # 基于GPU利用率扩容
    - type: Pods
      pods:
        metric:
          name: vllm_gpu_utilization
        target:
          type: AverageValue
          averageValue: "80"
    # 基于队列长度扩容
    - type: Pods
      pods:
        metric:
          name: vllm_queue_length
        target:
          type: AverageValue
          averageValue: "10"
  behavior:
    # 快速扩容
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Percent
          value: 100
          periodSeconds: 60
    # 慢速缩容
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 10
          periodSeconds: 60

3.3 llama.cpp降级链路部署

yaml 复制代码

# llama-cpp-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llama-cpp-fallback
  namespace: inference
spec:
  replicas: 3
  selector:
    matchLabels:
      app: llama-cpp-fallback
  template:
    metadata:
      labels:
        app: llama-cpp-fallback
    spec:
      # CPU高内存节点
      nodeSelector:
        node-type: cpu-highmem
      containers:
        - name: llama-server
          image: ghcr.io/ggerganov/llama.cpp:server
          args:
            - --model
            - /models/llama-3.1-70b-q4_k_m.gguf
            - --ctx-size
            - "65536"
            - --threads
            - "32"
            - --batch-size
            - "512"
            - --port
            - "8080"
          resources:
            limits:
              memory: "192Gi"
              cpu: "32"
            requests:
              memory: "128Gi"
              cpu: "16"
          ports:
            - containerPort: 8080
              name: http
          volumeMounts:
            - name: model-volume
              mountPath: /models
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 30
            periodSeconds: 10
      volumes:
        - name: model-volume
          persistentVolumeClaim:
            claimName: gguf-models-pvc

3.4 PodDisruptionBudget

yaml 复制代码

# pdb.yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: vllm-pdb
  namespace: inference
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: vllm-llama-70b

四、监控与告警体系

4.1 Prometheus监控配置

yaml 复制代码

# servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: vllm-metrics
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: vllm-llama-70b
  endpoints:
    - port: http
      path: /metrics
      interval: 15s

4.2 关键监控指标

yaml 复制代码

# 核心性能指标
performance_metrics:
  - vllm:time_to_first_token_seconds      # TTFT
  - vllm:time_per_output_token_seconds    # ITL/TPOT
  - vllm:throughput_tokens_per_second     # 吞吐量
  - vllm:request_queue_length             # 队列长度

# 资源利用率
resource_metrics:
  - nvidia_gpu_utilization                # GPU利用率
  - nvidia_gpu_memory_used_bytes          # GPU显存使用
  - nvidia_gpu_temperature                # GPU温度
  - container_cpu_usage_seconds_total     # CPU使用
  - container_memory_working_set_bytes    # 内存使用

# 缓存效率
cache_metrics:
  - vllm:gpu_prefix_cache_hit_rate        # GPU前缀缓存命中率
  - vllm:cpu_prefix_cache_hit_rate        # CPU前缀缓存命中率
  - vllm:kv_cache_usage_percent           # KV Cache使用率

# 业务指标
business_metrics:
  - vllm:request_success_total            # 成功请求数
  - vllm:request_failure_total            # 失败请求数
  - vllm:request_duration_seconds         # 请求耗时
  - vllm:prompt_tokens_total              # 输入token数
  - vllm:generation_tokens_total          # 输出token数

4.3 告警规则

yaml 复制代码

# alerts.yaml
groups:
  - name: vllm-alerts
    rules:
      # TTFT过高
      - alert: HighTTFT
        expr: histogram_quantile(0.99, vllm:time_to_first_token_seconds) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "vLLM TTFT is high"
          description: "P99 TTFT is {{ $value }}s"

      # GPU OOM风险
      - alert: GPUOOMRisk
        expr: vllm:kv_cache_usage_percent > 0.95
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "GPU OOM risk detected"
          description: "KV Cache usage is {{ $value }}%"

      # 队列堆积
      - alert: RequestQueueBuildup
        expr: vllm:request_queue_length > 50
        for: 3m
        labels:
          severity: warning
        annotations:
          summary: "Request queue is building up"
          description: "Queue length is {{ $value }}"

      # 模型加载失败
      - alert: ModelLoadingFailure
        expr: rate(vllm:request_failure_total{reason="model_load"}[5m]) > 0
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "Model loading failed"

      # GPU温度过高
      - alert: GPUTemperatureHigh
        expr: nvidia_gpu_temperature > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "GPU temperature is high"
          description: "Temperature is {{ $value }}°C"

五、自动化故障恢复

5.1 故障检测与恢复流程

复制代码

┌──────────────┐
│ 监控检测异常  │
└──────┬───────┘
       │
       ▼
┌──────────────┐
│  异常类型判断  │
└──────┬───────┘
       │
   ┌───┴───┬────────┬────────┬────────┐
   ▼       ▼        ▼        ▼        ▼
┌────┐  ┌────┐  ┌────┐  ┌────┐  ┌────┐
│GPU │  │GPU │  │模型 │  │网络 │  │其他 │
│OOM │  │崩溃│  │损坏│  │分区│  │异常 │
└──┬─┘  └──┬─┘  └──┬─┘  └──┬─┘  └──┬─┘
   │       │       │       │       │
   ▼       ▼       ▼       ▼       ▼
┌────┐  ┌────┐  ┌────┐  ┌────┐  ┌────┐
│HPA │  │Pod │  │切换 │  │降级 │  │人工 │
 │扩容│  │重建│  │备用 │  │链路│  │介入 │
└────┘  └────┘  └────┘  └────┘  └────┘

5.2 自动恢复脚本

python 复制代码

#!/usr/bin/env python3
"""
自动故障恢复脚本
"""
import subprocess
import time
from prometheus_api_client import PrometheusConnect

prom = PrometheusConnect(url="http://prometheus:9090")

def check_gpu_oom():
    """检查GPU OOM风险"""
    result = prom.custom_query(
        query='vllm:kv_cache_usage_percent > 0.95'
    )
    return len(result) > 0

def trigger_hpa_scale_up():
    """触发HPA扩容"""
    subprocess.run([
        'kubectl', 'scale', 'deployment', 'vllm-llama-70b',
        '--replicas=4', '-n', 'inference'
    ])

def switch_to_fallback():
    """切换到降级链路"""
    # 更新Kong路由配置
    subprocess.run([
        'kubectl', 'patch', 'ingress', 'llm-ingress',
        '-p', '{"metadata":{"annotations":{"konghq.com/strip-path":"false"}}}'
    ])

def main():
    while True:
        if check_gpu_oom():
            print("GPU OOM risk detected, triggering recovery...")
            trigger_hpa_scale_up()
            time.sleep(60)
            
            # 检查扩容是否有效
            if check_gpu_oom():
                print("Scale up not effective, switching to fallback...")
                switch_to_fallback()
        
        time.sleep(30)

if __name__ == "__main__":
    main()

六、模型更新策略

6.1 滚动更新

yaml 复制代码

# 滚动更新策略
spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1        # 最多多启动1个Pod
      maxUnavailable: 0  # 不允许不可用

6.2 蓝绿部署

bash 复制代码

# 部署新版本（绿色）
kubectl apply -f vllm-deployment-v2.yaml

# 验证新版本健康
kubectl rollout status deployment/vllm-llama-70b-v2

# 切换流量
curl -X POST http://kong:8001/services/llm-router \
  -d "host=vllm-service-v2"

# 保留旧版本一段时间，确认稳定后删除
kubectl delete deployment/vllm-llama-70b-v1

参考资源

文章标签

Kubernetes 生产部署 GPU调度 HPA 混合架构 智能路由 故障恢复 监控告警