Envoy与Istio HTTP流量故障转移机制介绍

笔者针对envoy和istio基于流量层面的故障转移机制做了一个整理介绍,方便后续查看。

1 Envoy HTTP故障转移核心机制

Envoy作为数据平面代理,通过多层级机制实现HTTP流量的智能故障转移。

1.1 重试策略:故障转移的第一道防线

重试是Envoy处理瞬时故障的首要机制。以下配置示例展示了完整的HTTP重试策略:

route_config:

name: local_route

virtual_hosts:

  • name: backend_service

domains: ["*"]

routes:

  • match: { prefix: "/" }

route:

cluster: backend_cluster

retry_policy:

retry_on: "connect-failure,refused-stream,

5xx,gateway-error,retriable-4xx"

num_retries: 3

per_try_timeout: 2s

retry_back_off:

base_interval: 0.1s

max_interval: 10s

retry_host_predicate:

  • name: envoy.retry_host_predicates.

previous_hosts

retry_priority:

name: envoy.retry_priorities.previous_priorities

关键参数解析:

  • retry_on:定义重试触发条件

  • "connect-failure":连接失败(TCP层级)

  • "5xx":服务器端错误(HTTP 500-599)

  • "retriable-4xx":可重试的客户端错误(如HTTP 409冲突)

  • num_retries:最大重试次数,避免无限重试导致雪崩效应

  • per_try_timeout:单次尝试超时,确保快速失败

1.2 异常点检测:主动健康管理

异常点检测(Outlier Detection)是Envoy的主动健康管理机制,通过连续错误监控自动隔离故障端点。

clusters:

  • name: backend_cluster

connect_timeout: 0.25s

type: STRICT_DNS

lb_policy: ROUND_ROBIN

outlier_detection:

consecutive_5xx: 5

interval: 10s

base_ejection_time: 30s

max_ejection_percent: 50

enforcing_consecutive_5xx: 100

工作机制:

  1. 错误统计:统计每个端点的连续5xx错误

  2. 阈值判断:当错误数超过

"consecutive_5xx"阈值时,触发隔离

  1. 临时驱逐:将异常端点从负载均衡池中移除,持续"base_ejection_time"

  2. 自动恢复:驱逐时间过后,自动重新引入端点进行测试

1.3 负载均衡与优先级路由

Envoy支持多种负载均衡算法,结合优先级实现智能路由:

clusters:

  • name: primary_cluster

lb_policy: LEAST_REQUEST

load_assignment:

cluster_name: primary_cluster

endpoints:

  • priority: 0

lb_endpoints:

  • endpoint:

address: { socket_address: { address: 10.0.1.1, port_value: 80 } }

  • endpoint:

address: { socket_address: { address: 10.0.1.2, port_value: 80 } }

  • name: secondary_cluster

lb_policy: ROUND_ROBIN

load_assignment:

cluster_name: secondary_cluster

endpoints:

  • priority: 1

lb_endpoints:

  • endpoint:

address: { socket_address: { address: 10.0.2.1, port_value: 80 } }

2 Istio HTTP流量治理与故障转移

Istio在Envoy基础上提供了声明式的流量治理能力,通过API资源实现精细化控制。

2.1 VirtualService:HTTP路由规则

VirtualService定义了HTTP请求的路由规则,是故障转移的核心配置。

apiVersion: networking.istio.io/v1beta1

kind: VirtualService

metadata:

name: http-service

spec:

hosts:

http:

  • match:

  • headers:

user-agent:

regex: ".*Mobile.*"

route:

  • destination:

host: mobile-service

port:

number: 80

weight: 100

  • route:

  • destination:

host: primary-service

port:

number: 80

weight: 80

  • destination:

host: backup-service

port:

number: 80

weight: 20

retries:

attempts: 3

perTryTimeout: 2s

retryOn: connect-failure,5xx,gateway-error

高级路由特性:

  • 条件匹配:基于URI、Header、Method等条件进行路由

  • 权重分配:精确控制流量分发给不同后端

  • 重试策略:服务级别的重试配置

2.2 DestinationRule:故障转移策略定义

DestinationRule定义了服务访问策略,包括负载均衡、连接池和异常点检测。

apiVersion: networking.istio.io/v1beta1

kind: DestinationRule

metadata:

name: http-service-dr

spec:

host: http-service

trafficPolicy:

loadBalancer:

simple: LEAST_CONN

connectionPool:

http:

http1MaxPendingRequests: 1024

maxRequestsPerConnection: 1024

maxRetries: 3

idleTimeout: 3600s

outlierDetection:

consecutive5xxErrors: 5

interval: 10s

baseEjectionTime: 30s

maxEjectionPercent: 50

subsets:

  • name: v1

labels:

version: v1

trafficPolicy:

outlierDetection:

consecutiveGatewayErrors: 3

consecutive5xxErrors: 5

2.3 地域感知故障转移

Istio支持基于地域的智能故障转移,优先将流量保持在相同或邻近地域。

apiVersion: networking.istio.io/v1beta1

kind: DestinationRule

metadata:

name: locality-aware-dr

spec:

host: global-service

trafficPolicy:

loadBalancer:

localityLbSetting:

enabled: true

distribute:

  • from: "us-west1/*"

to:

"us-west1/*": 70

"us-west2/*": 20

"us-central1/*": 10

failover:

  • from: us-west1

to: us-west2

  • from: us-west2

to: us-central1

outlierDetection:

consecutive5xxErrors: 5

interval: 5s

baseEjectionTime: 60s

3 实战:HTTP故障转移完整示例

3.1 场景描述与架构设计

假设一个电商应用,包含用户服务(user-service)和订单服务(order-service),部署在多个地域(us-west1, us-west2, us-central1)。需要实现以下故障转移需求:

  • 正常情况:90%流量到本地域,10%到备份域

  • 本地域故障:自动切换到备份域

  • 服务级故障:基于HTTP状态码进行隔离

3.2 完整配置实现

地域故障转移策略

apiVersion: networking.istio.io/v1beta1

kind: DestinationRule

metadata:

name: user-service-locality-dr

spec:

host: user-service

trafficPolicy:

loadBalancer:

localityLbSetting:

enabled: true

failover:

  • from: us-west1

to: us-west2

  • from: us-west2

to: us-central1

connectionPool:

http:

http2MaxRequests: 1000

maxRequestsPerConnection: 10

maxRetries: 3

outlierDetection:

consecutive5xxErrors: 5

consecutiveGatewayErrors: 5

interval: 10s

baseEjectionTime: 30s


HTTP路由规则

apiVersion: networking.istio.io/v1beta1

kind: VirtualService

metadata:

name: user-service-vs

spec:

hosts:

  • user-service

http:

  • match:

  • headers:

x-region:

exact: "us-west1"

route:

  • destination:

host: user-service

subset: us-west1

weight: 90

  • destination:

host: user-service

subset: us-west2

weight: 10

retries:

attempts: 3

perTryTimeout: 2s

retryOn: connect-failure,5xx

  • route:

  • destination:

host: user-service

subset: us-west1

timeout: 10s


服务子集定义

apiVersion: networking.istio.io/v1beta1

kind: DestinationRule

metadata:

name: user-service-subsets

spec:

host: user-service

subsets:

  • name: us-west1

labels:

region: us-west1

version: v1

  • name: us-west2

labels:

region: us-west2

version: v1

  • name: us-central1

labels:

region: us-central1

version: v1

3.3 故障转移流程验证

通过注入故障测试故障转移效果:

1. 正常请求测试

curl -H "x-region: us-west1" http://user-service/api/v1/users/123

2. 注入us-west1地域故障(模拟)

kubectl --context=us-west1 scale deployment user-service --replicas=0

3. 验证故障转移

curl -H "x-region: us-west1" http://user-service/api/v1/users/123

预期:请求自动路由到us-west2地域

4 监控

4.1 关键监控指标

有效的故障转移需要完善的监控体系:

请求成功率

"istio_requests_total" < 99%

描述: 总体服务可用性

错误分布

"istio_response_duration_seconds" > 500ms

描述: 响应延迟监控

故障转移次数

"envoy_cluster_upstream_rq_503" > 10/min

描述: 异常转移频次

端点健康状态

"envoy_cluster_membership_healthy" < 50%

描述: 健康端点比例

4.2 Envoy访问日志分析

Envoy的访问日志提供了详细的故障转移信息:

%START_TIME%\] "%REQ(:METHOD)% %REQ(X-ENVOY-ORIGINAL-PATH?:PATH)% %PROTOCOL%" status=%RESPONSE_CODE% flags=%RESPONSE_FLAGS% upstream_host=%UPSTREAM_HOST% upstream_cluster=%UPSTREAM_CLUSTER% 关键响应标志: - UF:上游连接失败(触发故障转移) - URX:上游远程连接关闭 - DC:下游连接终止 总结: 通过以上配置和策略,Envoy和Istio能够为HTTP流量提供企业级的故障转移保障,确保业务在复杂分布式环境中的高可用性。

相关推荐
Coisinilove3 小时前
数通第一次培训(10.13)
网络·数通·现代网络通信
Trouvaille ~3 小时前
【Linux】TCP可靠性与性能优化详解:从确认应答到拥塞控制
linux·运维·服务器·网络·tcp/ip·性能优化·操作系统
IT利刃出鞘12 小时前
SSL证书--手动生成自签名IP证书
网络·网络协议·ssl
wanhengidc12 小时前
私有云具体是指什么
服务器·网络·游戏·智能手机·云计算
开开心心就好15 小时前
一键加密隐藏视频,专属格式播放工具
java·linux·开发语言·网络·人工智能·macos
末日汐15 小时前
TCP编程简单回显服务
服务器·网络·tcp/ip
Trouvaille ~17 小时前
【Linux】TCP协议基础与连接管理详解:从三次握手到四次挥手
linux·运维·服务器·网络·c++·网络协议·tcp/ip
njmanong18 小时前
Google点名处置IPIDEA及子品牌:代理IP行业进入强治理期
网络·网络协议·tcp/ip
君陌社区·网络安全防护中心18 小时前
通过OVSDB管理交换机
网络