【架构实战】健康检查与故障转移机制

一、为什么需要健康检查

在分布式系统中，服务实例可能因为各种原因变得不可用，而调用方却毫不知情，继续向故障实例发送请求，导致大量失败。

常见的服务不可用场景：

进程假死：Java进程存在但无法响应请求（Full GC、死锁）
资源耗尽：CPU 100%、内存OOM、连接池耗尽
网络故障：网络分区、防火墙规则变更
依赖故障：数据库、缓存等依赖服务不可用
代码异常：未捕获的异常导致服务降级

健康检查的价值：

及时发现不健康实例
自动从负载均衡中剔除故障节点
故障恢复后自动重新加入
为故障转移提供决策依据

二、健康检查的分类

1. 主动健康检查 vs 被动健康检查

类型	说明	优点	缺点
主动检查	定期主动探测服务状态	及时发现故障	增加额外请求
被动检查	根据实际请求结果判断	无额外开销	发现故障较慢

最佳实践：两者结合使用

2. 检查层次

复制代码

L1: 进程检查（进程是否存在）
L2: 端口检查（端口是否可连接）
L3: HTTP检查（接口是否正常响应）
L4: 业务检查（核心业务逻辑是否正常）

三、Spring Boot健康检查

1. Actuator健康端点

xml 复制代码

<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>

yaml 复制代码

management:
  endpoints:
    web:
      exposure:
        include: health,info,metrics
  endpoint:
    health:
      show-details: always
      show-components: always
  health:
    db:
      enabled: true
    redis:
      enabled: true
    diskspace:
      enabled: true

访问健康端点：

bash 复制代码

curl http://localhost:8080/actuator/health

响应示例：

json 复制代码

{
  "status": "UP",
  "components": {
    "db": {
      "status": "UP",
      "details": {
        "database": "MySQL",
        "validationQuery": "isValid()"
      }
    },
    "redis": {
      "status": "UP",
      "details": {
        "version": "7.0.5"
      }
    },
    "diskSpace": {
      "status": "UP",
      "details": {
        "total": 107374182400,
        "free": 53687091200,
        "threshold": 10485760
      }
    }
  }
}

2. 自定义健康检查

java 复制代码

@Component
public class DatabaseHealthIndicator implements HealthIndicator {
    
    @Autowired
    private JdbcTemplate jdbcTemplate;
    
    @Override
    public Health health() {
        try {
            // 检查数据库连接
            Integer result = jdbcTemplate.queryForObject("SELECT 1", Integer.class);
            
            if (result != null && result == 1) {
                return Health.up()
                    .withDetail("database", "MySQL")
                    .withDetail("status", "connected")
                    .build();
            }
            
            return Health.down()
                .withDetail("database", "MySQL")
                .withDetail("error", "Query returned unexpected result")
                .build();
                
        } catch (Exception e) {
            return Health.down()
                .withDetail("database", "MySQL")
                .withDetail("error", e.getMessage())
                .build();
        }
    }
}

@Component
public class RedisHealthIndicator implements HealthIndicator {
    
    @Autowired
    private RedisTemplate<String, String> redisTemplate;
    
    @Override
    public Health health() {
        try {
            String pong = redisTemplate.getConnectionFactory()
                .getConnection()
                .ping();
            
            if ("PONG".equals(pong)) {
                return Health.up()
                    .withDetail("redis", "connected")
                    .build();
            }
            
            return Health.down()
                .withDetail("redis", "unexpected response: " + pong)
                .build();
                
        } catch (Exception e) {
            return Health.down()
                .withDetail("redis", "connection failed")
                .withDetail("error", e.getMessage())
                .build();
        }
    }
}

@Component
public class BusinessHealthIndicator implements HealthIndicator {
    
    @Autowired
    private OrderService orderService;
    
    @Override
    public Health health() {
        try {
            // 检查核心业务是否正常
            boolean canCreateOrder = orderService.checkCreateOrderCapability();
            
            if (canCreateOrder) {
                return Health.up()
                    .withDetail("order-service", "normal")
                    .build();
            }
            
            return Health.down()
                .withDetail("order-service", "degraded")
                .build();
                
        } catch (Exception e) {
            return Health.down()
                .withDetail("order-service", "error")
                .withDetail("message", e.getMessage())
                .build();
        }
    }
}

3. 健康检查分组

yaml 复制代码

management:
  endpoint:
    health:
      group:
        # 存活探针：进程是否存活
        liveness:
          include: livenessState
        # 就绪探针：是否可以接收流量
        readiness:
          include: readinessState,db,redis

bash 复制代码

# 存活探针
curl http://localhost:8080/actuator/health/liveness

# 就绪探针
curl http://localhost:8080/actuator/health/readiness

四、Nginx健康检查

1. 被动健康检查

nginx 复制代码

upstream backend {
    server 192.168.1.10:8080 max_fails=3 fail_timeout=30s;
    server 192.168.1.11:8080 max_fails=3 fail_timeout=30s;
    server 192.168.1.12:8080 max_fails=3 fail_timeout=30s backup;
}

server {
    listen 80;
    
    location / {
        proxy_pass http://backend;
        proxy_connect_timeout 5s;
        proxy_read_timeout 30s;
        proxy_send_timeout 30s;
        
        # 失败时重试
        proxy_next_upstream error timeout http_500 http_502 http_503;
        proxy_next_upstream_tries 3;
    }
}

参数说明：

max_fails=3：30秒内失败3次则标记为不可用
fail_timeout=30s：不可用状态持续30秒
backup：备用服务器，主服务器全部不可用时启用

2. 主动健康检查（nginx_upstream_check_module）

nginx 复制代码

upstream backend {
    server 192.168.1.10:8080;
    server 192.168.1.11:8080;
    
    check interval=3000 rise=2 fall=3 timeout=1000 type=http;
    check_http_send "GET /health HTTP/1.0\r\n\r\n";
    check_http_expect_alive http_2xx http_3xx;
}

server {
    listen 80;
    
    location /status {
        check_status;
        access_log off;
    }
}

参数说明：

interval=3000：每3秒检查一次
rise=2：连续2次成功则标记为健康
fall=3：连续3次失败则标记为不健康
timeout=1000：检查超时时间1秒

3. Nginx Plus主动健康检查

nginx 复制代码

upstream backend {
    zone backend 64k;
    server 192.168.1.10:8080;
    server 192.168.1.11:8080;
}

server {
    listen 80;
    
    location / {
        proxy_pass http://backend;
        health_check interval=5s fails=3 passes=2 uri=/health;
    }
}

五、Kubernetes健康检查

1. 存活探针（Liveness Probe）

yaml 复制代码

apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-service
spec:
  template:
    spec:
      containers:
        - name: order-service
          image: order-service:latest
          livenessProbe:
            httpGet:
              path: /actuator/health/liveness
              port: 8080
            initialDelaySeconds: 30
            periodSeconds: 10
            timeoutSeconds: 5
            failureThreshold: 3
            successThreshold: 1

存活探针失败时： K8s会重启容器

2. 就绪探针（Readiness Probe）

yaml 复制代码

readinessProbe:
  httpGet:
    path: /actuator/health/readiness
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 5
  timeoutSeconds: 3
  failureThreshold: 3
  successThreshold: 1

就绪探针失败时： K8s会将Pod从Service的Endpoints中移除，不再接收流量

3. 启动探针（Startup Probe）

yaml 复制代码

startupProbe:
  httpGet:
    path: /actuator/health
    port: 8080
  failureThreshold: 30
  periodSeconds: 10

启动探针的作用： 给应用足够的启动时间，避免存活探针在启动期间误判

4. 三种探针对比

探针类型	失败后果	适用场景
Liveness	重启容器	检测进程是否存活
Readiness	从Service移除	检测是否可接收流量
Startup	重启容器	应用启动期间保护

六、故障转移机制

1. 客户端故障转移（Ribbon）

java 复制代码

@Configuration
public class RibbonConfig {
    
    @Bean
    public IRule ribbonRule() {
        // 重试规则：失败后自动切换到其他实例
        return new RetryRule(new RoundRobinRule(), 3);
    }
    
    @Bean
    public IPing ribbonPing() {
        // 使用HTTP健康检查
        return new PingUrl(false, "/health");
    }
    
    @Bean
    public ILoadBalancer ribbonLoadBalancer() {
        return new ZoneAwareLoadBalancer<>();
    }
}

2. Feign重试配置

java 复制代码

@Configuration
public class FeignConfig {
    
    @Bean
    public Retryer feignRetryer() {
        // 初始间隔100ms，最大间隔1s，最多重试3次
        return new Retryer.Default(100, 1000, 3);
    }
    
    @Bean
    public Request.Options feignOptions() {
        // 连接超时5s，读取超时30s
        return new Request.Options(5, TimeUnit.SECONDS, 30, TimeUnit.SECONDS, true);
    }
}

3. Sentinel熔断降级

java 复制代码

@Service
public class OrderService {
    
    @SentinelResource(
        value = "getUserInfo",
        fallback = "getUserInfoFallback",
        blockHandler = "getUserInfoBlock"
    )
    public UserInfo getUserInfo(Long userId) {
        return userClient.getUser(userId);
    }
    
    // 降级处理
    public UserInfo getUserInfoFallback(Long userId, Throwable ex) {
        log.error("获取用户信息失败，userId={}", userId, ex);
        return UserInfo.defaultUser(userId);
    }
    
    // 限流处理
    public UserInfo getUserInfoBlock(Long userId, BlockException ex) {
        log.warn("获取用户信息被限流，userId={}", userId);
        return UserInfo.defaultUser(userId);
    }
}

4. 故障转移流程

复制代码

请求到达
    ↓
负载均衡选择实例
    ↓
发送请求
    ↓
请求失败？
    ├── 是 → 记录失败次数
    │         ├── 超过阈值？
    │         │   ├── 是 → 标记实例不健康 → 切换到其他实例
    │         │   └── 否 → 重试
    └── 否 → 返回结果

七、监控与告警

1. 健康检查监控

java 复制代码

@Component
public class HealthCheckMonitor {
    
    @Autowired
    private HealthEndpoint healthEndpoint;
    
    @Autowired
    private AlertService alertService;
    
    @Scheduled(fixedRate = 30000)
    public void monitorHealth() {
        HealthComponent health = healthEndpoint.health();
        
        if (health.getStatus() != Status.UP) {
            // 发送告警
            alertService.alert(
                "服务健康检查失败",
                "当前状态: " + health.getStatus(),
                AlertLevel.CRITICAL
            );
        }
    }
}

2. 关键指标

yaml 复制代码

# Prometheus监控指标
- name: health_check_status
  help: "服务健康状态（1=健康，0=不健康）"
  type: gauge

- name: health_check_duration_seconds
  help: "健康检查耗时"
  type: histogram

- name: failover_count_total
  help: "故障转移次数"
  type: counter

八、最佳实践

1. 健康检查设计原则

快速响应：健康检查接口应在100ms内返回
轻量级：不要在健康检查中执行复杂逻辑
分层检查：区分存活检查和就绪检查
避免级联：健康检查不应触发其他服务的健康检查

2. 故障转移策略

复制代码

快速失败 → 重试 → 熔断 → 降级 → 告警

快速失败：设置合理的超时时间
重试：幂等操作可以重试，非幂等操作谨慎重试
熔断：防止故障扩散
降级：保证核心功能可用

3. 常见误区

❌ 误区1：健康检查接口太重

java 复制代码

// 错误：在健康检查中执行复杂查询
@GetMapping("/health")
public Health health() {
    // 这会导致健康检查超时
    List<Order> orders = orderRepository.findAll();
    return Health.up().build();
}

✅ 正确做法：

java 复制代码

@GetMapping("/health")
public Health health() {
    // 只检查连接是否正常
    jdbcTemplate.queryForObject("SELECT 1", Integer.class);
    return Health.up().build();
}

❌ 误区2：存活探针和就绪探针混用

存活探针失败 → 重启容器（代价大）
就绪探针失败 → 停止接收流量（代价小）
应该根据场景选择合适的探针

九、总结

健康检查与故障转移是高可用架构的核心机制：

多层次检查：进程→端口→HTTP→业务
主被动结合：主动探测 + 被动感知
快速响应：及时发现故障，快速切换
自动恢复：故障恢复后自动重新加入

实施建议：

所有服务都要实现健康检查接口
K8s部署必须配置三种探针
设置合理的超时和重试策略
建立完善的监控告警体系

思考题：你们的服务有没有实现分层健康检查？存活探针和就绪探针有没有区分？

个人观点，仅供参考