笔者针对envoy和istio基于流量层面的故障转移机制做了一个整理介绍,方便后续查看。
1 Envoy HTTP故障转移核心机制
Envoy作为数据平面代理,通过多层级机制实现HTTP流量的智能故障转移。
1.1 重试策略:故障转移的第一道防线
重试是Envoy处理瞬时故障的首要机制。以下配置示例展示了完整的HTTP重试策略:
route_config:
name: local_route
virtual_hosts:
- name: backend_service
domains: ["*"]
routes:
- match: { prefix: "/" }
route:
cluster: backend_cluster
retry_policy:
retry_on: "connect-failure,refused-stream,
5xx,gateway-error,retriable-4xx"
num_retries: 3
per_try_timeout: 2s
retry_back_off:
base_interval: 0.1s
max_interval: 10s
retry_host_predicate:
- name: envoy.retry_host_predicates.
previous_hosts
retry_priority:
name: envoy.retry_priorities.previous_priorities
关键参数解析:
-
retry_on:定义重试触发条件
-
"connect-failure":连接失败(TCP层级)
-
"5xx":服务器端错误(HTTP 500-599)
-
"retriable-4xx":可重试的客户端错误(如HTTP 409冲突)
-
num_retries:最大重试次数,避免无限重试导致雪崩效应
-
per_try_timeout:单次尝试超时,确保快速失败
1.2 异常点检测:主动健康管理
异常点检测(Outlier Detection)是Envoy的主动健康管理机制,通过连续错误监控自动隔离故障端点。
clusters:
- name: backend_cluster
connect_timeout: 0.25s
type: STRICT_DNS
lb_policy: ROUND_ROBIN
outlier_detection:
consecutive_5xx: 5
interval: 10s
base_ejection_time: 30s
max_ejection_percent: 50
enforcing_consecutive_5xx: 100
工作机制:
-
错误统计:统计每个端点的连续5xx错误
-
阈值判断:当错误数超过
"consecutive_5xx"阈值时,触发隔离
-
临时驱逐:将异常端点从负载均衡池中移除,持续"base_ejection_time"
-
自动恢复:驱逐时间过后,自动重新引入端点进行测试
1.3 负载均衡与优先级路由
Envoy支持多种负载均衡算法,结合优先级实现智能路由:
clusters:
- name: primary_cluster
lb_policy: LEAST_REQUEST
load_assignment:
cluster_name: primary_cluster
endpoints:
- priority: 0
lb_endpoints:
- endpoint:
address: { socket_address: { address: 10.0.1.1, port_value: 80 } }
- endpoint:
address: { socket_address: { address: 10.0.1.2, port_value: 80 } }
- name: secondary_cluster
lb_policy: ROUND_ROBIN
load_assignment:
cluster_name: secondary_cluster
endpoints:
- priority: 1
lb_endpoints:
- endpoint:
address: { socket_address: { address: 10.0.2.1, port_value: 80 } }
2 Istio HTTP流量治理与故障转移
Istio在Envoy基础上提供了声明式的流量治理能力,通过API资源实现精细化控制。
2.1 VirtualService:HTTP路由规则
VirtualService定义了HTTP请求的路由规则,是故障转移的核心配置。
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: http-service
spec:
hosts:
http:
-
match:
-
headers:
user-agent:
regex: ".*Mobile.*"
route:
- destination:
host: mobile-service
port:
number: 80
weight: 100
-
route:
-
destination:
host: primary-service
port:
number: 80
weight: 80
- destination:
host: backup-service
port:
number: 80
weight: 20
retries:
attempts: 3
perTryTimeout: 2s
retryOn: connect-failure,5xx,gateway-error
高级路由特性:
-
条件匹配:基于URI、Header、Method等条件进行路由
-
权重分配:精确控制流量分发给不同后端
-
重试策略:服务级别的重试配置
2.2 DestinationRule:故障转移策略定义
DestinationRule定义了服务访问策略,包括负载均衡、连接池和异常点检测。
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: http-service-dr
spec:
host: http-service
trafficPolicy:
loadBalancer:
simple: LEAST_CONN
connectionPool:
http:
http1MaxPendingRequests: 1024
maxRequestsPerConnection: 1024
maxRetries: 3
idleTimeout: 3600s
outlierDetection:
consecutive5xxErrors: 5
interval: 10s
baseEjectionTime: 30s
maxEjectionPercent: 50
subsets:
- name: v1
labels:
version: v1
trafficPolicy:
outlierDetection:
consecutiveGatewayErrors: 3
consecutive5xxErrors: 5
2.3 地域感知故障转移
Istio支持基于地域的智能故障转移,优先将流量保持在相同或邻近地域。
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: locality-aware-dr
spec:
host: global-service
trafficPolicy:
loadBalancer:
localityLbSetting:
enabled: true
distribute:
- from: "us-west1/*"
to:
"us-west1/*": 70
"us-west2/*": 20
"us-central1/*": 10
failover:
- from: us-west1
to: us-west2
- from: us-west2
to: us-central1
outlierDetection:
consecutive5xxErrors: 5
interval: 5s
baseEjectionTime: 60s
3 实战:HTTP故障转移完整示例
3.1 场景描述与架构设计
假设一个电商应用,包含用户服务(user-service)和订单服务(order-service),部署在多个地域(us-west1, us-west2, us-central1)。需要实现以下故障转移需求:
-
正常情况:90%流量到本地域,10%到备份域
-
本地域故障:自动切换到备份域
-
服务级故障:基于HTTP状态码进行隔离
3.2 完整配置实现
地域故障转移策略
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: user-service-locality-dr
spec:
host: user-service
trafficPolicy:
loadBalancer:
localityLbSetting:
enabled: true
failover:
- from: us-west1
to: us-west2
- from: us-west2
to: us-central1
connectionPool:
http:
http2MaxRequests: 1000
maxRequestsPerConnection: 10
maxRetries: 3
outlierDetection:
consecutive5xxErrors: 5
consecutiveGatewayErrors: 5
interval: 10s
baseEjectionTime: 30s
HTTP路由规则
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: user-service-vs
spec:
hosts:
- user-service
http:
-
match:
-
headers:
x-region:
exact: "us-west1"
route:
- destination:
host: user-service
subset: us-west1
weight: 90
- destination:
host: user-service
subset: us-west2
weight: 10
retries:
attempts: 3
perTryTimeout: 2s
retryOn: connect-failure,5xx
-
route:
-
destination:
host: user-service
subset: us-west1
timeout: 10s
服务子集定义
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: user-service-subsets
spec:
host: user-service
subsets:
- name: us-west1
labels:
region: us-west1
version: v1
- name: us-west2
labels:
region: us-west2
version: v1
- name: us-central1
labels:
region: us-central1
version: v1
3.3 故障转移流程验证
通过注入故障测试故障转移效果:
1. 正常请求测试
curl -H "x-region: us-west1" http://user-service/api/v1/users/123
2. 注入us-west1地域故障(模拟)
kubectl --context=us-west1 scale deployment user-service --replicas=0
3. 验证故障转移
curl -H "x-region: us-west1" http://user-service/api/v1/users/123
预期:请求自动路由到us-west2地域
4 监控
4.1 关键监控指标
有效的故障转移需要完善的监控体系:
请求成功率
"istio_requests_total" < 99%
描述: 总体服务可用性
错误分布
"istio_response_duration_seconds" > 500ms
描述: 响应延迟监控
故障转移次数
"envoy_cluster_upstream_rq_503" > 10/min
描述: 异常转移频次
端点健康状态
"envoy_cluster_membership_healthy" < 50%
描述: 健康端点比例
4.2 Envoy访问日志分析
Envoy的访问日志提供了详细的故障转移信息:
%START_TIME%\] "%REQ(:METHOD)% %REQ(X-ENVOY-ORIGINAL-PATH?:PATH)% %PROTOCOL%" status=%RESPONSE_CODE% flags=%RESPONSE_FLAGS% upstream_host=%UPSTREAM_HOST% upstream_cluster=%UPSTREAM_CLUSTER% 关键响应标志: - UF:上游连接失败(触发故障转移) - URX:上游远程连接关闭 - DC:下游连接终止 总结: 通过以上配置和策略,Envoy和Istio能够为HTTP流量提供企业级的故障转移保障,确保业务在复杂分布式环境中的高可用性。