05 - 大模型推理生产架构设计:混合部署与Kubernetes实战
本文是《大模型推理框架深度解析》系列的第五篇,详解生产级架构设计、K8s部署与智能路由策略。
写在前面
将大模型推理服务部署到生产环境,面临诸多挑战:
- GPU故障如何优雅处理?
- 高并发时如何保证稳定性?
- 模型更新如何零停机?
本文将介绍一种经过验证的生产架构:"vLLM主链路 + llama.cpp降级链路"的混合部署方案。
一、为什么需要混合架构
1.1 单一架构的风险
纯vLLM架构的问题:
- GPU驱动崩溃导致服务中断
- CUDA OOM引发批量失败
- 模型加载失败无退路
纯llama.cpp架构的问题:
- CPU推理速度慢(29x vs GPU)
- 无法满足高并发需求
- 无分布式扩展能力
1.2 混合架构的价值
┌─────────────────────────────────────────────────────────────────┐
│ API Gateway │
│ (Kong/AWS ALB) │
└───────────────────────────┬─────────────────────────────────────┘
│
┌─────────────┴─────────────┐
│ │
┌─────────▼──────────┐ ┌──────────▼──────────┐
│ vLLM主链路 │ │ llama.cpp降级链路 │
│ (高吞吐/GPU集群) │ │ (CPU兜底/边缘节点) │
├────────────────────┤ ├─────────────────────┤
│ • 99.9% SLA │ │ • 99% SLA │
│ • 35+ tok/s │ │ • 1-5 tok/s │
│ • 高并发支持 │ │ • 无限扩展 │
└────────────────────┘ └─────────────────────┘
降级场景:
| 故障类型 | vLLM行为 | llama.cpp兜底价值 |
|---|---|---|
| GPU驱动崩溃 | 服务中断 | CPU实例接管请求 |
| CUDA OOM | 批量失败 | 大上下文请求转CPU |
| 网络分区 | 分布式TP失败 | 单节点CPU继续服务 |
| 模型损坏 | 加载失败 | 备用量化模型启动 |
二、智能路由网关设计
2.1 健康评分算法
python
class HealthScorer:
"""
后端健康评分算法
评分范围:0-1,越高越健康
"""
def calculate_score(self, backend: Backend) -> float:
metrics = backend.get_metrics()
# 基础指标权重
gpu_util = metrics.gpu_utilization * 0.25
mem_pressure = (1 - metrics.gpu_memory_free / metrics.gpu_memory_total) * 0.25
queue_depth = min(metrics.pending_requests / 100, 1.0) * 0.20
p99_latency = self._normalize_latency(metrics.p99_latency) * 0.20
error_rate = metrics.error_rate * 0.10
# 综合评分
score = 1.0 - (gpu_util + mem_pressure + queue_depth + p99_latency + error_rate)
# 状态标记
if score < 0.6:
backend.mark_degraded()
if score < 0.3:
backend.mark_unhealthy()
return max(score, 0.0)
2.2 渐进切流策略
yaml
# 切流规则配置
traffic_split:
# 正常状态:全部流量走vLLM
healthy:
vllm: 100%
llama_cpp: 0%
# vLLM负载过高:简单请求转CPU
high_load:
vllm: 80%
llama_cpp: 20%
# vLLM部分故障:流量对半
degraded:
vllm: 50%
llama_cpp: 50%
# vLLM完全故障:全部降级
emergency:
vllm: 0%
llama_cpp: 100%
2.3 Kong网关配置示例
yaml
# kong.yml
services:
- name: llm-router
url: http://internal-router:8080
routes:
- name: completions
paths: ["/v1/completions"]
methods: ["POST"]
plugins:
- name: rate-limiting
config:
minute: 1000
- name: prometheus
# vLLM上游
- name: vllm-upstream
host: vllm-service
port: 8000
healthchecks:
active:
healthy:
interval: 5s
successes: 2
unhealthy:
interval: 5s
timeouts: 3
http_failures: 2
http_path: /health
# llama.cpp上游
- name: llama-cpp-upstream
host: llama-cpp-service
port: 8080
三、Kubernetes部署方案
3.1 vLLM主链路部署
yaml
# vllm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-llama-70b
namespace: inference
spec:
replicas: 2
selector:
matchLabels:
app: vllm-llama-70b
template:
metadata:
labels:
app: vllm-llama-70b
spec:
# GPU节点亲和性
nodeSelector:
node-type: gpu-a100
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: vllm
image: vllm/vllm-openai:v0.14.0
args:
- --model
- /models/llama-3.1-70b-awq
- --quantization
- awq
- --tensor-parallel-size
- "4"
- --gpu-memory-utilization
- "0.85"
- --max-model-len
- "32768"
- --enable-prefix-caching
- --swap-space
- "4"
resources:
limits:
nvidia.com/gpu: "4"
memory: "256Gi"
cpu: "32"
requests:
nvidia.com/gpu: "4" # 必须等于limits,防止超售
memory: "256Gi"
cpu: "16"
ports:
- containerPort: 8000
name: http
volumeMounts:
- name: model-volume
mountPath: /models
# 健康检查
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 300 # 模型加载时间
periodSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 5
# 优雅关闭
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 30"]
volumes:
- name: model-volume
persistentVolumeClaim:
claimName: llama-70b-model-pvc
# Pod反亲和性:避免同一节点部署多个副本
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- vllm-llama-70b
topologyKey: kubernetes.io/hostname
---
# Service
apiVersion: v1
kind: Service
metadata:
name: vllm-service
namespace: inference
spec:
selector:
app: vllm-llama-70b
ports:
- port: 8000
targetPort: 8000
type: ClusterIP
3.2 HPA自动扩缩容
yaml
# vllm-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: vllm-hpa
namespace: inference
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm-llama-70b
minReplicas: 2
maxReplicas: 10
metrics:
# 基于GPU利用率扩容
- type: Pods
pods:
metric:
name: vllm_gpu_utilization
target:
type: AverageValue
averageValue: "80"
# 基于队列长度扩容
- type: Pods
pods:
metric:
name: vllm_queue_length
target:
type: AverageValue
averageValue: "10"
behavior:
# 快速扩容
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 100
periodSeconds: 60
# 慢速缩容
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
3.3 llama.cpp降级链路部署
yaml
# llama-cpp-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: llama-cpp-fallback
namespace: inference
spec:
replicas: 3
selector:
matchLabels:
app: llama-cpp-fallback
template:
metadata:
labels:
app: llama-cpp-fallback
spec:
# CPU高内存节点
nodeSelector:
node-type: cpu-highmem
containers:
- name: llama-server
image: ghcr.io/ggerganov/llama.cpp:server
args:
- --model
- /models/llama-3.1-70b-q4_k_m.gguf
- --ctx-size
- "65536"
- --threads
- "32"
- --batch-size
- "512"
- --port
- "8080"
resources:
limits:
memory: "192Gi"
cpu: "32"
requests:
memory: "128Gi"
cpu: "16"
ports:
- containerPort: 8080
name: http
volumeMounts:
- name: model-volume
mountPath: /models
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
volumes:
- name: model-volume
persistentVolumeClaim:
claimName: gguf-models-pvc
3.4 PodDisruptionBudget
yaml
# pdb.yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: vllm-pdb
namespace: inference
spec:
minAvailable: 1
selector:
matchLabels:
app: vllm-llama-70b
四、监控与告警体系
4.1 Prometheus监控配置
yaml
# servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: vllm-metrics
namespace: monitoring
spec:
selector:
matchLabels:
app: vllm-llama-70b
endpoints:
- port: http
path: /metrics
interval: 15s
4.2 关键监控指标
yaml
# 核心性能指标
performance_metrics:
- vllm:time_to_first_token_seconds # TTFT
- vllm:time_per_output_token_seconds # ITL/TPOT
- vllm:throughput_tokens_per_second # 吞吐量
- vllm:request_queue_length # 队列长度
# 资源利用率
resource_metrics:
- nvidia_gpu_utilization # GPU利用率
- nvidia_gpu_memory_used_bytes # GPU显存使用
- nvidia_gpu_temperature # GPU温度
- container_cpu_usage_seconds_total # CPU使用
- container_memory_working_set_bytes # 内存使用
# 缓存效率
cache_metrics:
- vllm:gpu_prefix_cache_hit_rate # GPU前缀缓存命中率
- vllm:cpu_prefix_cache_hit_rate # CPU前缀缓存命中率
- vllm:kv_cache_usage_percent # KV Cache使用率
# 业务指标
business_metrics:
- vllm:request_success_total # 成功请求数
- vllm:request_failure_total # 失败请求数
- vllm:request_duration_seconds # 请求耗时
- vllm:prompt_tokens_total # 输入token数
- vllm:generation_tokens_total # 输出token数
4.3 告警规则
yaml
# alerts.yaml
groups:
- name: vllm-alerts
rules:
# TTFT过高
- alert: HighTTFT
expr: histogram_quantile(0.99, vllm:time_to_first_token_seconds) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "vLLM TTFT is high"
description: "P99 TTFT is {{ $value }}s"
# GPU OOM风险
- alert: GPUOOMRisk
expr: vllm:kv_cache_usage_percent > 0.95
for: 1m
labels:
severity: critical
annotations:
summary: "GPU OOM risk detected"
description: "KV Cache usage is {{ $value }}%"
# 队列堆积
- alert: RequestQueueBuildup
expr: vllm:request_queue_length > 50
for: 3m
labels:
severity: warning
annotations:
summary: "Request queue is building up"
description: "Queue length is {{ $value }}"
# 模型加载失败
- alert: ModelLoadingFailure
expr: rate(vllm:request_failure_total{reason="model_load"}[5m]) > 0
for: 0m
labels:
severity: critical
annotations:
summary: "Model loading failed"
# GPU温度过高
- alert: GPUTemperatureHigh
expr: nvidia_gpu_temperature > 85
for: 5m
labels:
severity: warning
annotations:
summary: "GPU temperature is high"
description: "Temperature is {{ $value }}°C"
五、自动化故障恢复
5.1 故障检测与恢复流程
┌──────────────┐
│ 监控检测异常 │
└──────┬───────┘
│
▼
┌──────────────┐
│ 异常类型判断 │
└──────┬───────┘
│
┌───┴───┬────────┬────────┬────────┐
▼ ▼ ▼ ▼ ▼
┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐
│GPU │ │GPU │ │模型 │ │网络 │ │其他 │
│OOM │ │崩溃│ │损坏│ │分区│ │异常 │
└──┬─┘ └──┬─┘ └──┬─┘ └──┬─┘ └──┬─┘
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐
│HPA │ │Pod │ │切换 │ │降级 │ │人工 │
│扩容│ │重建│ │备用 │ │链路│ │介入 │
└────┘ └────┘ └────┘ └────┘ └────┘
5.2 自动恢复脚本
python
#!/usr/bin/env python3
"""
自动故障恢复脚本
"""
import subprocess
import time
from prometheus_api_client import PrometheusConnect
prom = PrometheusConnect(url="http://prometheus:9090")
def check_gpu_oom():
"""检查GPU OOM风险"""
result = prom.custom_query(
query='vllm:kv_cache_usage_percent > 0.95'
)
return len(result) > 0
def trigger_hpa_scale_up():
"""触发HPA扩容"""
subprocess.run([
'kubectl', 'scale', 'deployment', 'vllm-llama-70b',
'--replicas=4', '-n', 'inference'
])
def switch_to_fallback():
"""切换到降级链路"""
# 更新Kong路由配置
subprocess.run([
'kubectl', 'patch', 'ingress', 'llm-ingress',
'-p', '{"metadata":{"annotations":{"konghq.com/strip-path":"false"}}}'
])
def main():
while True:
if check_gpu_oom():
print("GPU OOM risk detected, triggering recovery...")
trigger_hpa_scale_up()
time.sleep(60)
# 检查扩容是否有效
if check_gpu_oom():
print("Scale up not effective, switching to fallback...")
switch_to_fallback()
time.sleep(30)
if __name__ == "__main__":
main()
六、模型更新策略
6.1 滚动更新
yaml
# 滚动更新策略
spec:
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # 最多多启动1个Pod
maxUnavailable: 0 # 不允许不可用
6.2 蓝绿部署
bash
# 部署新版本(绿色)
kubectl apply -f vllm-deployment-v2.yaml
# 验证新版本健康
kubectl rollout status deployment/vllm-llama-70b-v2
# 切换流量
curl -X POST http://kong:8001/services/llm-router \
-d "host=vllm-service-v2"
# 保留旧版本一段时间,确认稳定后删除
kubectl delete deployment/vllm-llama-70b-v1
参考资源
文章标签
Kubernetes 生产部署 GPU调度 HPA 混合架构 智能路由 故障恢复 监控告警