淘客系统的容灾演练与恢复:Java Chaos Monkey模拟节点故障下的服务降级与快速切换实践
大家好,我是 微赚淘客系统3.0 的研发者省赚客!
高可用是淘客系统的核心要求。为验证在数据库宕机、缓存失效或服务实例崩溃等异常场景下系统的自愈能力,我们引入 Chaos Monkey 思想,结合 Spring Boot Actuator 与自研故障注入模块,在预发环境主动制造故障,并通过 Sentinel 与多活架构实现服务降级与快速切换。
Chaos Monkey 故障注入模块设计
我们基于 Spring Boot 开发轻量级 Chaos Agent,支持运行时注入延迟、异常或服务中断:
java
package juwatech.cn.chaos;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.stereotype.Component;
import java.util.Random;
@Component
public class ChaosInjector {
@Value("${chaos.enabled:false}")
private boolean chaosEnabled;
@Value("${chaos.failure.rate:0.1}")
private double failureRate;
private final Random random = new Random();
public void maybeThrowException(String serviceName) {
if (!chaosEnabled) return;
if (random.nextDouble() < failureRate) {
juwatech.cn.util.AsyncLogger.logAsync("Chaos injected: killing " + serviceName);
throw new RuntimeException("Simulated failure in " + serviceName);
}
}
public void maybeDelay(long maxDelayMs) {
if (!chaosEnabled) return;
if (random.nextDouble() < failureRate) {
long delay = random.nextInt((int) maxDelayMs);
try {
Thread.sleep(delay);
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
}
}
}
}
在关键服务中嵌入故障点:
java
package juwatech.cn.service;
import juwatech.cn.chaos.ChaosInjector;
import org.springframework.stereotype.Service;
@Service
public class CouponRemoteService {
private final ChaosInjector chaosInjector;
public CouponRemoteService(ChaosInjector chaosInjector) {
this.chaosInjector = chaosInjector;
}
public String fetchCouponFromRemote(String itemId) {
// 模拟调用第三方接口前注入故障
chaosInjector.maybeThrowException("coupon-remote-api");
chaosInjector.maybeDelay(2000);
// 实际调用逻辑(略)
return "CPN_" + itemId;
}
}
Sentinel 服务降级规则配置
当远程服务不可用时,自动触发 fallback:
java
package juwatech.cn.fallback;
import com.alibaba.csp.sentinel.slots.block.BlockException;
public class CouponFallback {
public static String remoteFallback(String itemId, BlockException ex) {
juwatech.cn.util.AsyncLogger.logAsync("Fallback triggered for item: " + itemId);
return "DEFAULT_COUPON_5YUAN"; // 返回兜底优惠券
}
public static String dbFallback(String userId, BlockException ex) {
return "[]"; // 返回空列表
}
}
通过 Sentinel 注解绑定资源与降级策略:
java
package juwatech.cn.service;
import com.alibaba.csp.sentinel.annotation.SentinelResource;
import juwatech.cn.fallback.CouponFallback;
import org.springframework.stereotype.Service;
@Service
public class CouponQueryService {
@SentinelResource(
value = "fetchCouponFromRemote",
blockHandler = "remoteFallback",
blockHandlerClass = CouponFallback.class
)
public String queryCoupon(String itemId) {
return new CouponRemoteService(new juwatech.cn.chaos.ChaosInjector()).fetchCouponFromRemote(itemId);
}
}
动态加载降级规则(启动时初始化):
java
package juwatech.cn.config;
import com.alibaba.csp.sentinel.slots.block.degrade.DegradeRule;
import com.alibaba.csp.sentinel.slots.block.degrade.DegradeRuleManager;
import org.springframework.context.annotation.Configuration;
import javax.annotation.PostConstruct;
import java.util.Collections;
@Configuration
public class SentinelDegradeConfig {
@PostConstruct
public void initDegradeRules() {
DegradeRule rule = new DegradeRule("fetchCouponFromRemote")
.setGrade(com.alibaba.csp.sentinel.slots.block.degrade.DegradeRule.RT) // 基于响应时间
.setCount(200) // 超过200ms视为慢调用
.setTimeWindow(10) // 熔断10秒
.setMinRequestAmount(5) // 最小请求数
.setSlowRatioThreshold(0.5); // 慢调用比例阈值
DegradeRuleManager.loadRules(Collections.singletonList(rule));
}
}
多活数据库快速切换机制
主库故障时,自动切换至只读副本。我们封装数据源路由逻辑:
java
package juwatech.cn.datasource;
import org.springframework.jdbc.datasource.lookup.AbstractRoutingDataSource;
public class RoutingDataSource extends AbstractRoutingDataSource {
@Override
protected Object determineCurrentLookupKey() {
return DataSourceContext.getDataSourceType();
}
}
上下文管理:
java
package juwatech.cn.datasource;
public class DataSourceContext {
private static final ThreadLocal<String> contextHolder = new ThreadLocal<>();
public static void setMaster() {
contextHolder.set("master");
}
public static void setSlave() {
contextHolder.set("slave");
}
public static String getDataSourceType() {
return contextHolder.get() != null ? contextHolder.get() : "master";
}
}
在数据库操作前检测主库健康状态:
java
package juwatech.cn.service;
import juwatech.cn.datasource.DataSourceContext;
import org.springframework.stereotype.Service;
@Service
public class OrderService {
private final juwatech.cn.db.HealthChecker healthChecker;
public OrderService(juwatech.cn.db.HealthChecker healthChecker) {
this.healthChecker = healthChecker;
}
public void createOrder(String orderId) {
if (!healthChecker.isMasterDbHealthy()) {
DataSourceContext.setSlave(); // 临时切到从库(仅限查询)
throw new IllegalStateException("Master DB down, write operation blocked");
}
DataSourceContext.setMaster();
// 执行写入
}
}
健康检查实现:
java
package juwatech.cn.db;
import org.springframework.jdbc.core.JdbcTemplate;
import org.springframework.stereotype.Component;
@Component
public class HealthChecker {
private final JdbcTemplate masterJdbcTemplate;
public HealthChecker(JdbcTemplate masterJdbcTemplate) {
this.masterJdbcTemplate = masterJdbcTemplate;
}
public boolean isMasterDbHealthy() {
try {
masterJdbcTemplate.queryForObject("SELECT 1", Integer.class);
return true;
} catch (Exception e) {
juwatech.cn.util.AsyncLogger.logAsync("Master DB health check failed: " + e.getMessage());
return false;
}
}
}
演练流程与指标验证
- 启动 Chaos Monkey(
chaos.enabled=true) - 触发高频查券请求
- 监控 Prometheus 指标:
sentinel_block_requests_total(降级次数)http_server_requests_seconds_count{status="500"}(错误率)
- 验证 Kibana 中 fallback 日志是否正常记录
- 模拟主库宕机,观察数据源切换日志与业务影响范围
通过上述机制,系统在节点故障下仍能提供基础服务能力,保障核心链路可用性。
本文著作权归 微赚淘客系统3.0 研发团队,转载请注明出处!