一、为什么需要健康检查
在分布式系统中,服务实例可能因为各种原因变得不可用,而调用方却毫不知情,继续向故障实例发送请求,导致大量失败。
常见的服务不可用场景:
- 进程假死:Java进程存在但无法响应请求(Full GC、死锁)
- 资源耗尽:CPU 100%、内存OOM、连接池耗尽
- 网络故障:网络分区、防火墙规则变更
- 依赖故障:数据库、缓存等依赖服务不可用
- 代码异常:未捕获的异常导致服务降级
健康检查的价值:
- 及时发现不健康实例
- 自动从负载均衡中剔除故障节点
- 故障恢复后自动重新加入
- 为故障转移提供决策依据
二、健康检查的分类
1. 主动健康检查 vs 被动健康检查
| 类型 | 说明 | 优点 | 缺点 |
|---|---|---|---|
| 主动检查 | 定期主动探测服务状态 | 及时发现故障 | 增加额外请求 |
| 被动检查 | 根据实际请求结果判断 | 无额外开销 | 发现故障较慢 |
最佳实践:两者结合使用
2. 检查层次
L1: 进程检查(进程是否存在)
L2: 端口检查(端口是否可连接)
L3: HTTP检查(接口是否正常响应)
L4: 业务检查(核心业务逻辑是否正常)
三、Spring Boot健康检查
1. Actuator健康端点
xml
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
yaml
management:
endpoints:
web:
exposure:
include: health,info,metrics
endpoint:
health:
show-details: always
show-components: always
health:
db:
enabled: true
redis:
enabled: true
diskspace:
enabled: true
访问健康端点:
bash
curl http://localhost:8080/actuator/health
响应示例:
json
{
"status": "UP",
"components": {
"db": {
"status": "UP",
"details": {
"database": "MySQL",
"validationQuery": "isValid()"
}
},
"redis": {
"status": "UP",
"details": {
"version": "7.0.5"
}
},
"diskSpace": {
"status": "UP",
"details": {
"total": 107374182400,
"free": 53687091200,
"threshold": 10485760
}
}
}
}
2. 自定义健康检查
java
@Component
public class DatabaseHealthIndicator implements HealthIndicator {
@Autowired
private JdbcTemplate jdbcTemplate;
@Override
public Health health() {
try {
// 检查数据库连接
Integer result = jdbcTemplate.queryForObject("SELECT 1", Integer.class);
if (result != null && result == 1) {
return Health.up()
.withDetail("database", "MySQL")
.withDetail("status", "connected")
.build();
}
return Health.down()
.withDetail("database", "MySQL")
.withDetail("error", "Query returned unexpected result")
.build();
} catch (Exception e) {
return Health.down()
.withDetail("database", "MySQL")
.withDetail("error", e.getMessage())
.build();
}
}
}
@Component
public class RedisHealthIndicator implements HealthIndicator {
@Autowired
private RedisTemplate<String, String> redisTemplate;
@Override
public Health health() {
try {
String pong = redisTemplate.getConnectionFactory()
.getConnection()
.ping();
if ("PONG".equals(pong)) {
return Health.up()
.withDetail("redis", "connected")
.build();
}
return Health.down()
.withDetail("redis", "unexpected response: " + pong)
.build();
} catch (Exception e) {
return Health.down()
.withDetail("redis", "connection failed")
.withDetail("error", e.getMessage())
.build();
}
}
}
@Component
public class BusinessHealthIndicator implements HealthIndicator {
@Autowired
private OrderService orderService;
@Override
public Health health() {
try {
// 检查核心业务是否正常
boolean canCreateOrder = orderService.checkCreateOrderCapability();
if (canCreateOrder) {
return Health.up()
.withDetail("order-service", "normal")
.build();
}
return Health.down()
.withDetail("order-service", "degraded")
.build();
} catch (Exception e) {
return Health.down()
.withDetail("order-service", "error")
.withDetail("message", e.getMessage())
.build();
}
}
}
3. 健康检查分组
yaml
management:
endpoint:
health:
group:
# 存活探针:进程是否存活
liveness:
include: livenessState
# 就绪探针:是否可以接收流量
readiness:
include: readinessState,db,redis
bash
# 存活探针
curl http://localhost:8080/actuator/health/liveness
# 就绪探针
curl http://localhost:8080/actuator/health/readiness
四、Nginx健康检查
1. 被动健康检查
nginx
upstream backend {
server 192.168.1.10:8080 max_fails=3 fail_timeout=30s;
server 192.168.1.11:8080 max_fails=3 fail_timeout=30s;
server 192.168.1.12:8080 max_fails=3 fail_timeout=30s backup;
}
server {
listen 80;
location / {
proxy_pass http://backend;
proxy_connect_timeout 5s;
proxy_read_timeout 30s;
proxy_send_timeout 30s;
# 失败时重试
proxy_next_upstream error timeout http_500 http_502 http_503;
proxy_next_upstream_tries 3;
}
}
参数说明:
max_fails=3:30秒内失败3次则标记为不可用fail_timeout=30s:不可用状态持续30秒backup:备用服务器,主服务器全部不可用时启用
2. 主动健康检查(nginx_upstream_check_module)
nginx
upstream backend {
server 192.168.1.10:8080;
server 192.168.1.11:8080;
check interval=3000 rise=2 fall=3 timeout=1000 type=http;
check_http_send "GET /health HTTP/1.0\r\n\r\n";
check_http_expect_alive http_2xx http_3xx;
}
server {
listen 80;
location /status {
check_status;
access_log off;
}
}
参数说明:
interval=3000:每3秒检查一次rise=2:连续2次成功则标记为健康fall=3:连续3次失败则标记为不健康timeout=1000:检查超时时间1秒
3. Nginx Plus主动健康检查
nginx
upstream backend {
zone backend 64k;
server 192.168.1.10:8080;
server 192.168.1.11:8080;
}
server {
listen 80;
location / {
proxy_pass http://backend;
health_check interval=5s fails=3 passes=2 uri=/health;
}
}
五、Kubernetes健康检查
1. 存活探针(Liveness Probe)
yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: order-service
spec:
template:
spec:
containers:
- name: order-service
image: order-service:latest
livenessProbe:
httpGet:
path: /actuator/health/liveness
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
successThreshold: 1
存活探针失败时: K8s会重启容器
2. 就绪探针(Readiness Probe)
yaml
readinessProbe:
httpGet:
path: /actuator/health/readiness
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
successThreshold: 1
就绪探针失败时: K8s会将Pod从Service的Endpoints中移除,不再接收流量
3. 启动探针(Startup Probe)
yaml
startupProbe:
httpGet:
path: /actuator/health
port: 8080
failureThreshold: 30
periodSeconds: 10
启动探针的作用: 给应用足够的启动时间,避免存活探针在启动期间误判
4. 三种探针对比
| 探针类型 | 失败后果 | 适用场景 |
|---|---|---|
| Liveness | 重启容器 | 检测进程是否存活 |
| Readiness | 从Service移除 | 检测是否可接收流量 |
| Startup | 重启容器 | 应用启动期间保护 |
六、故障转移机制
1. 客户端故障转移(Ribbon)
java
@Configuration
public class RibbonConfig {
@Bean
public IRule ribbonRule() {
// 重试规则:失败后自动切换到其他实例
return new RetryRule(new RoundRobinRule(), 3);
}
@Bean
public IPing ribbonPing() {
// 使用HTTP健康检查
return new PingUrl(false, "/health");
}
@Bean
public ILoadBalancer ribbonLoadBalancer() {
return new ZoneAwareLoadBalancer<>();
}
}
2. Feign重试配置
java
@Configuration
public class FeignConfig {
@Bean
public Retryer feignRetryer() {
// 初始间隔100ms,最大间隔1s,最多重试3次
return new Retryer.Default(100, 1000, 3);
}
@Bean
public Request.Options feignOptions() {
// 连接超时5s,读取超时30s
return new Request.Options(5, TimeUnit.SECONDS, 30, TimeUnit.SECONDS, true);
}
}
3. Sentinel熔断降级
java
@Service
public class OrderService {
@SentinelResource(
value = "getUserInfo",
fallback = "getUserInfoFallback",
blockHandler = "getUserInfoBlock"
)
public UserInfo getUserInfo(Long userId) {
return userClient.getUser(userId);
}
// 降级处理
public UserInfo getUserInfoFallback(Long userId, Throwable ex) {
log.error("获取用户信息失败,userId={}", userId, ex);
return UserInfo.defaultUser(userId);
}
// 限流处理
public UserInfo getUserInfoBlock(Long userId, BlockException ex) {
log.warn("获取用户信息被限流,userId={}", userId);
return UserInfo.defaultUser(userId);
}
}
4. 故障转移流程
请求到达
↓
负载均衡选择实例
↓
发送请求
↓
请求失败?
├── 是 → 记录失败次数
│ ├── 超过阈值?
│ │ ├── 是 → 标记实例不健康 → 切换到其他实例
│ │ └── 否 → 重试
└── 否 → 返回结果
七、监控与告警
1. 健康检查监控
java
@Component
public class HealthCheckMonitor {
@Autowired
private HealthEndpoint healthEndpoint;
@Autowired
private AlertService alertService;
@Scheduled(fixedRate = 30000)
public void monitorHealth() {
HealthComponent health = healthEndpoint.health();
if (health.getStatus() != Status.UP) {
// 发送告警
alertService.alert(
"服务健康检查失败",
"当前状态: " + health.getStatus(),
AlertLevel.CRITICAL
);
}
}
}
2. 关键指标
yaml
# Prometheus监控指标
- name: health_check_status
help: "服务健康状态(1=健康,0=不健康)"
type: gauge
- name: health_check_duration_seconds
help: "健康检查耗时"
type: histogram
- name: failover_count_total
help: "故障转移次数"
type: counter
八、最佳实践
1. 健康检查设计原则
- 快速响应:健康检查接口应在100ms内返回
- 轻量级:不要在健康检查中执行复杂逻辑
- 分层检查:区分存活检查和就绪检查
- 避免级联:健康检查不应触发其他服务的健康检查
2. 故障转移策略
快速失败 → 重试 → 熔断 → 降级 → 告警
- 快速失败:设置合理的超时时间
- 重试:幂等操作可以重试,非幂等操作谨慎重试
- 熔断:防止故障扩散
- 降级:保证核心功能可用
3. 常见误区
❌ 误区1:健康检查接口太重
java
// 错误:在健康检查中执行复杂查询
@GetMapping("/health")
public Health health() {
// 这会导致健康检查超时
List<Order> orders = orderRepository.findAll();
return Health.up().build();
}
✅ 正确做法:
java
@GetMapping("/health")
public Health health() {
// 只检查连接是否正常
jdbcTemplate.queryForObject("SELECT 1", Integer.class);
return Health.up().build();
}
❌ 误区2:存活探针和就绪探针混用
- 存活探针失败 → 重启容器(代价大)
- 就绪探针失败 → 停止接收流量(代价小)
- 应该根据场景选择合适的探针
九、总结
健康检查与故障转移是高可用架构的核心机制:
- 多层次检查:进程→端口→HTTP→业务
- 主被动结合:主动探测 + 被动感知
- 快速响应:及时发现故障,快速切换
- 自动恢复:故障恢复后自动重新加入
实施建议:
- 所有服务都要实现健康检查接口
- K8s部署必须配置三种探针
- 设置合理的超时和重试策略
- 建立完善的监控告警体系
思考题:你们的服务有没有实现分层健康检查?存活探针和就绪探针有没有区分?
个人观点,仅供参考