在分布式系统中,当数据需要同时存储在关系型数据库和 Elasticsearch 中时,如何保证数据一致性是一个重要挑战。本文提供了一套完整方案解决这一问题。
分布式事务基础与 CAP 理论
在设计分布式事务系统前,需要理解 CAP 理论的基本权衡:

在 Elasticsearch 与数据库协作场景中,通常选择 AP(可用性和分区容错性),牺牲强一致性而采用最终一致性模型。这种权衡允许系统在网络分区或部分组件故障时继续提供服务,同时保证数据最终同步一致。
Elasticsearch 写入机制
理解 Elasticsearch 的内部写入流程有助于设计更合理的事务方案:

Elasticsearch 的写入涉及多节点协作,不是原子操作,这是我们需要特殊处理数据库与 ES 一致性的根本原因。Elasticsearch 8.x 引入了新的协调机制和更高效的写入流程,但仍然保持了这种基本架构。
实现分布式事务的有效方案
1. 乐观并发控制
Elasticsearch 提供了版本控制机制,可有效防止并发更新问题:
java
private static final Logger logger = LoggerFactory.getLogger(ProductService.class);
private final ElasticsearchClient esClient;
private final OpenTelemetry openTelemetry;
private final Tracer tracer;
public ProductService(ElasticsearchClient esClient, OpenTelemetry openTelemetry) {
this.esClient = esClient;
this.openTelemetry = openTelemetry;
this.tracer = openTelemetry.getTracer("product-service");
}
public void updateDocument(String id, Product product, long currentVersion) {
Span span = tracer.spanBuilder("updateDocument").startSpan();
try (Scope scope = span.makeCurrent()) {
span.setAttribute("document.id", id);
span.setAttribute("document.version", currentVersion);
UpdateResponse<Product> response = esClient.update(u -> u
.index("products")
.id(id)
.doc(product)
.version(currentVersion),
Product.class
);
Map<String, Object> logContext = new HashMap<>();
logContext.put("operation", "documentUpdate");
logContext.put("documentId", id);
logContext.put("newVersion", response.version());
logger.info("文档更新成功: {}", JsonUtils.toJson(logContext));
logger.debug("文档更新详情: {}, 原版本: {}, 新版本: {}",
id, currentVersion, response.version());
span.setStatus(StatusCode.OK);
} catch (VersionConflictException e) {
logger.warn("版本冲突,文档已被其他进程更新: {}", id, e);
span.setStatus(StatusCode.ERROR, "Version conflict");
span.recordException(e);
// 获取最新版本并重试
GetResponse<Product> current = esClient.get(g -> g
.index("products")
.id(id),
Product.class
);
updateDocument(id, product, current.version());
} catch (Exception e) {
logger.error("更新文档时发生错误: {}", id, e);
span.setStatus(StatusCode.ERROR, e.getMessage());
span.recordException(e);
throw new RuntimeException("更新Elasticsearch文档失败", e);
} finally {
span.end();
}
}
2. Outbox 模式实现最终一致性
Outbox 模式是解决分布式事务最常用的设计模式,确保数据库操作和消息发布的原子性:

产品服务实现:
java
@Service
public class ProductService {
private static final Logger logger = LoggerFactory.getLogger(ProductService.class);
private final ProductRepository productRepository;
private final OutboxRepository outboxRepository;
private final TransactionTemplate transactionTemplate;
private final ObjectMapper objectMapper;
private final Tracer tracer;
private final MessageSource messageSource;
public ProductService(
ProductRepository productRepository,
OutboxRepository outboxRepository,
PlatformTransactionManager transactionManager,
ObjectMapper objectMapper,
OpenTelemetry openTelemetry,
MessageSource messageSource) {
this.productRepository = productRepository;
this.outboxRepository = outboxRepository;
// 明确指定事务传播行为
DefaultTransactionDefinition txDef = new DefaultTransactionDefinition();
txDef.setPropagationBehavior(TransactionDefinition.PROPAGATION_REQUIRED);
this.transactionTemplate = new TransactionTemplate(transactionManager, txDef);
this.objectMapper = objectMapper;
this.tracer = openTelemetry.getTracer("product-service");
this.messageSource = messageSource;
}
public Product createProduct(Product product) {
Span span = tracer.spanBuilder("createProduct").startSpan();
try (Scope scope = span.makeCurrent()) {
span.setAttribute("product.name", product.getName());
return transactionTemplate.execute(status -> {
try {
// 1. 保存到数据库
Product savedProduct = productRepository.save(product);
// 2. 创建Outbox消息(确保在同一事务中)
OutboxMessage outboxMessage = createOutboxMessage(savedProduct, "PRODUCT_CREATED");
Map<String, Object> logContext = new HashMap<>();
logContext.put("operation", "createProduct");
logContext.put("productId", savedProduct.getId());
logContext.put("outboxMessageId", outboxMessage.getId());
logger.info("产品创建成功: {}", JsonUtils.toJson(logContext));
span.setStatus(StatusCode.OK);
return savedProduct;
} catch (Exception e) {
String errorMessage = messageSource.getMessage(
"error.product.create",
new Object[]{e.getMessage()},
LocaleContextHolder.getLocale()
);
logger.error(errorMessage, e);
span.setStatus(StatusCode.ERROR, e.getMessage());
span.recordException(e);
status.setRollbackOnly();
throw new RuntimeException(errorMessage, e);
}
});
} finally {
span.end();
}
}
// 提取辅助方法创建Outbox消息,提高代码可读性
private OutboxMessage createOutboxMessage(Product product, String eventType) throws JsonProcessingException {
OutboxMessage outboxMessage = new OutboxMessage();
outboxMessage.setAggregateType("Product");
outboxMessage.setAggregateId(product.getId().toString());
outboxMessage.setEventType(eventType);
outboxMessage.setEventId(UUID.randomUUID().toString());
outboxMessage.setPayload(objectMapper.writeValueAsString(product));
outboxMessage.setCreatedAt(LocalDateTime.now());
return outboxRepository.save(outboxMessage);
}
}
Outbox 消息中继服务:
java
@Service
public class OutboxRelayService {
private static final Logger logger = LoggerFactory.getLogger(OutboxRelayService.class);
private final OutboxRepository outboxRepository;
private final KafkaTemplate<String, String> kafkaTemplate;
private final TransactionTemplate transactionTemplate;
private final MetricsService metricsService;
private final RateLimiter rateLimiter;
private final Tracer tracer;
// 限制并发处理数量
private final AtomicInteger currentProcessingCount = new AtomicInteger(0);
private final int maxConcurrentProcessing;
public OutboxRelayService(
OutboxRepository outboxRepository,
KafkaTemplate<String, String> kafkaTemplate,
PlatformTransactionManager transactionManager,
MetricsService metricsService,
RateLimiter outboxRateLimiter,
OpenTelemetry openTelemetry,
@Value("${app.outbox.max-concurrent:50}") int maxConcurrentProcessing) {
this.outboxRepository = outboxRepository;
this.kafkaTemplate = kafkaTemplate;
DefaultTransactionDefinition txDef = new DefaultTransactionDefinition();
txDef.setPropagationBehavior(TransactionDefinition.PROPAGATION_REQUIRES_NEW);
this.transactionTemplate = new TransactionTemplate(transactionManager, txDef);
this.metricsService = metricsService;
this.rateLimiter = outboxRateLimiter;
this.maxConcurrentProcessing = maxConcurrentProcessing;
this.tracer = openTelemetry.getTracer("outbox-relay-service");
}
@Scheduled(fixedDelayString = "${app.outbox.polling-interval:1000}")
public void processOutboxMessages() {
Span span = tracer.spanBuilder("processOutboxMessages").startSpan();
try (Scope scope = span.makeCurrent()) {
// 实现背压控制
if (currentProcessingCount.get() >= maxConcurrentProcessing) {
logger.warn("当前处理消息数量已达上限 {}, 跳过本次处理", maxConcurrentProcessing);
span.setAttribute("skipped", true);
span.setAttribute("reason", "max_concurrent_reached");
return;
}
// 使用速率限制器控制处理速度
if (!rateLimiter.acquirePermission()) {
logger.warn("速率限制器拒绝处理,稍后重试");
span.setAttribute("skipped", true);
span.setAttribute("reason", "rate_limited");
return;
}
List<OutboxMessage> messages = findUnprocessedMessages();
if (messages.isEmpty()) {
span.setAttribute("messagesCount", 0);
return;
}
span.setAttribute("messagesCount", messages.size());
logger.info("开始处理 {} 条Outbox消息", messages.size());
long startTime = System.currentTimeMillis();
processMessagesInParallel(messages, span);
long duration = System.currentTimeMillis() - startTime;
span.setAttribute("duration_ms", duration);
span.setStatus(StatusCode.OK);
} catch (Exception e) {
logger.error("处理Outbox消息时发生错误", e);
span.setStatus(StatusCode.ERROR, e.getMessage());
span.recordException(e);
} finally {
span.end();
}
}
// 提取方法:查找未处理消息
private List<OutboxMessage> findUnprocessedMessages() {
return outboxRepository.findByProcessedOrderByCreatedAtAsc(
false,
PageRequest.of(0, maxConcurrentProcessing - currentProcessingCount.get())
);
}
// 提取方法:并行处理消息
private void processMessagesInParallel(List<OutboxMessage> messages, Span parentSpan) {
AtomicInteger processed = new AtomicInteger(0);
AtomicInteger failed = new AtomicInteger(0);
List<CompletableFuture<Void>> futures = messages.stream()
.map(message -> CompletableFuture.runAsync(() -> {
currentProcessingCount.incrementAndGet();
try {
Span messageSpan = tracer.spanBuilder("processOutboxMessage")
.setParent(Context.current().with(parentSpan))
.setAttribute("messageId", message.getId().toString())
.setAttribute("aggregateId", message.getAggregateId())
.setAttribute("eventType", message.getEventType())
.startSpan();
try (Scope messageScope = messageSpan.makeCurrent()) {
MDC.put("correlationId", message.getEventId());
MDC.put("aggregateId", message.getAggregateId());
MDC.put("eventType", message.getEventType());
MDC.put("messageId", message.getId().toString());
// 记录消息年龄
Duration messageAge = Duration.between(message.getCreatedAt(), LocalDateTime.now());
metricsService.recordOutboxMessageAge(messageAge);
boolean success = processMessage(message);
if (success) {
processed.incrementAndGet();
messageSpan.setStatus(StatusCode.OK);
} else {
failed.incrementAndGet();
messageSpan.setStatus(StatusCode.ERROR, "Processing failed");
}
} finally {
messageSpan.end();
MDC.clear();
}
} finally {
currentProcessingCount.decrementAndGet();
}
}))
.collect(Collectors.toList());
// 等待所有消息处理完成
CompletableFuture.allOf(futures.toArray(new CompletableFuture[0])).join();
long duration = System.currentTimeMillis() - parentSpan.getStartEpochNanos() / 1_000_000;
metricsService.recordOutboxProcessing(processed.get(), failed.get(), duration);
Map<String, Object> logContext = new HashMap<>();
logContext.put("operation", "processOutboxMessages");
logContext.put("processed", processed.get());
logContext.put("failed", failed.get());
logContext.put("durationMs", duration);
logger.info("Outbox消息处理完成: {}", JsonUtils.toJson(logContext));
}
private boolean processMessage(OutboxMessage message) {
return transactionTemplate.execute(status -> {
try {
// 使用聚合ID作为分区键,确保相同实体的消息顺序性
SendResult<String, String> result = kafkaTemplate.send(
"product-events",
message.getAggregateId(),
message.getPayload()
).get(5, TimeUnit.SECONDS); // 设置超时时间
// 标记为已处理
message.setProcessed(true);
message.setProcessedAt(LocalDateTime.now());
outboxRepository.save(message);
Map<String, Object> logContext = new HashMap<>();
logContext.put("messageId", message.getId());
logContext.put("topic", result.getRecordMetadata().topic());
logContext.put("partition", result.getRecordMetadata().partition());
logContext.put("offset", result.getRecordMetadata().offset());
logger.info("消息发送成功: {}", JsonUtils.toJson(logContext));
return true;
} catch (Exception e) {
logger.error("发送消息到Kafka失败: {}", message.getId(), e);
status.setRollbackOnly();
// 更新重试次数,超过阈值则标记为错误
message.setRetryCount(message.getRetryCount() + 1);
if (message.getRetryCount() >= 5) {
message.setErrorMessage(e.getMessage());
outboxRepository.save(message);
logger.error("消息重试次数过多,标记为失败: {}", message.getId());
}
return false;
}
});
}
}
3. 虚拟线程优化(Java 21)
利用 Java 21 的虚拟线程提高并发处理能力:
java
@Configuration
public class ThreadingConfig {
private static final Logger logger = LoggerFactory.getLogger(ThreadingConfig.class);
@Bean
public ExecutorService consistencyCheckExecutor() {
logger.info("初始化一致性检查执行器(使用虚拟线程)");
// 使用Java 21的虚拟线程
return Executors.newVirtualThreadPerTaskExecutor();
}
@Bean
public AsyncTaskExecutor applicationTaskExecutor() {
logger.info("初始化应用任务执行器(使用虚拟线程)");
ThreadPoolTaskExecutor executor = new ThreadPoolTaskExecutor();
executor.setTaskDecorator(task -> {
Map<String, String> contextMap = MDC.getCopyOfContextMap();
return () -> {
Map<String, String> previousContext = MDC.getCopyOfContextMap();
try {
if (contextMap != null) {
MDC.setContextMap(contextMap);
}
task.run();
} finally {
if (previousContext != null) {
MDC.setContextMap(previousContext);
} else {
MDC.clear();
}
}
};
});
// Spring 6.1+ 的虚拟线程配置方式
executor.setTaskExecutor(Executors.newVirtualThreadPerTaskExecutor());
executor.initialize();
return executor;
}
}
虚拟线程与传统线程池性能对比:
场景 | 传统线程池 | 虚拟线程 | 性能提升 |
---|---|---|---|
低负载并发(100 req/s) | 55ms | 52ms | ~5% |
中负载并发(500 req/s) | 128ms | 97ms | ~24% |
高负载并发(2000 req/s) | 387ms | 242ms | ~37% |
阻塞 IO 密集型操作 | 512ms | 186ms | ~63% |
内存占用(1000 并发) | ~150MB | ~35MB | ~77% |
4. 分布式跟踪集成
集成 OpenTelemetry 实现分布式跟踪:
java
@Configuration
public class ObservabilityConfig {
private static final Logger logger = LoggerFactory.getLogger(ObservabilityConfig.class);
@Value("${spring.application.name}")
private String serviceName;
@Bean
public OpenTelemetry openTelemetry() {
logger.info("初始化OpenTelemetry配置");
Resource resource = Resource.getDefault()
.merge(Resource.create(Attributes.of(
ResourceAttributes.SERVICE_NAME, serviceName,
ResourceAttributes.SERVICE_VERSION, "1.0.0"
)));
SdkTracerProvider tracerProvider = SdkTracerProvider.builder()
.addSpanProcessor(BatchSpanProcessor.builder(
OtlpGrpcSpanExporter.builder()
.setEndpoint("http://tempo:4317")
.build())
.build())
.setResource(resource)
.build();
SdkMeterProvider meterProvider = SdkMeterProvider.builder()
.registerMetricReader(PeriodicMetricReader.builder(
OtlpGrpcMetricExporter.builder()
.setEndpoint("http://prometheus:4317")
.build())
.build())
.setResource(resource)
.build();
SdkLoggerProvider loggerProvider = SdkLoggerProvider.builder()
.addLogRecordProcessor(BatchLogRecordProcessor.builder(
OtlpGrpcLogRecordExporter.builder()
.setEndpoint("http://loki:4317")
.build())
.build())
.setResource(resource)
.build();
return OpenTelemetrySdk.builder()
.setTracerProvider(tracerProvider)
.setMeterProvider(meterProvider)
.setLoggerProvider(loggerProvider)
.setPropagators(ContextPropagators.create(
W3CTraceContextPropagator.getInstance()))
.build();
}
@Bean
public KafkaTracingConsumerFactory kafkaTracingConsumerFactory(OpenTelemetry openTelemetry) {
return new KafkaTracingConsumerFactory(openTelemetry);
}
@Bean
public ElasticsearchTelemetryAspect elasticsearchTelemetryAspect(OpenTelemetry openTelemetry) {
return new ElasticsearchTelemetryAspect(openTelemetry);
}
}
@Aspect
@Component
public class ElasticsearchTelemetryAspect {
private final Tracer tracer;
public ElasticsearchTelemetryAspect(OpenTelemetry openTelemetry) {
this.tracer = openTelemetry.getTracer("elasticsearch-aspect");
}
@Around("execution(* co.elastic.clients.elasticsearch.ElasticsearchClient.*(..))")
public Object traceElasticsearchOperation(ProceedingJoinPoint joinPoint) throws Throwable {
String operationName = joinPoint.getSignature().getName();
Span span = tracer.spanBuilder("elasticsearch." + operationName)
.setAttribute("db.system", "elasticsearch")
.setAttribute("db.operation", operationName)
.startSpan();
try (Scope scope = span.makeCurrent()) {
// 添加方法参数到跟踪上下文(过滤敏感信息)
addParametersToSpan(span, joinPoint.getArgs());
return joinPoint.proceed();
} catch (Throwable t) {
span.recordException(t);
span.setStatus(StatusCode.ERROR);
throw t;
} finally {
span.end();
}
}
private void addParametersToSpan(Span span, Object[] args) {
if (args != null && args.length > 0) {
for (int i = 0; i < args.length; i++) {
if (args[i] != null) {
String paramClassName = args[i].getClass().getSimpleName();
// 避免添加大对象或敏感信息
if (!containsSensitiveInfo(paramClassName)) {
span.setAttribute("param." + i + ".type", paramClassName);
}
}
}
}
}
private boolean containsSensitiveInfo(String className) {
return className.contains("Password") ||
className.contains("Credential") ||
className.contains("Secret");
}
}
5. 熔断器和限流器配置
使用 Resilience4j 实现熔断和限流保护:
java
@Configuration
public class ResilienceConfig {
private static final Logger logger = LoggerFactory.getLogger(ResilienceConfig.class);
@Bean
public CircuitBreaker elasticsearchCircuitBreaker() {
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
.failureRateThreshold(50)
.waitDurationInOpenState(Duration.ofSeconds(30))
.permittedNumberOfCallsInHalfOpenState(5)
.slidingWindowSize(10)
.slidingWindowType(CircuitBreakerConfig.SlidingWindowType.COUNT_BASED)
.automaticTransitionFromOpenToHalfOpenEnabled(true)
.build();
CircuitBreaker circuitBreaker = CircuitBreaker.of("elasticsearch", config);
// 注册事件监听器,用于监控熔断器状态变化
circuitBreaker.getEventPublisher()
.onStateTransition(event -> {
logger.info("熔断器状态变更: {} -> {}",
event.getStateTransition().getFromState(),
event.getStateTransition().getToState());
})
.onError(event -> {
logger.debug("熔断器记录错误: {}, 异常: {}",
event.getEventType(),
event.getThrowable().getMessage());
});
return circuitBreaker;
}
@Bean
public RateLimiter outboxRateLimiter() {
RateLimiterConfig config = RateLimiterConfig.custom()
.limitRefreshPeriod(Duration.ofSeconds(1))
.limitForPeriod(1000) // 每秒1000个请求
.timeoutDuration(Duration.ofMillis(25))
.build();
return RateLimiter.of("outboxRateLimiter", config);
}
@Bean
public RateLimiter syncRateLimiter() {
RateLimiterConfig config = RateLimiterConfig.custom()
.limitRefreshPeriod(Duration.ofSeconds(1))
.limitForPeriod(2000) // 每秒2000个同步操作
.timeoutDuration(Duration.ofMillis(25))
.build();
return RateLimiter.of("syncRateLimiter", config);
}
@Bean
public Retry elasticsearchRetry() {
RetryConfig config = RetryConfig.custom()
.maxAttempts(3)
.waitDuration(Duration.ofMillis(1000))
.retryExceptions(
ElasticsearchException.class,
TimeoutException.class,
IOException.class
)
.ignoreExceptions(
ResourceNotFoundException.class,
VersionConflictException.class
)
.build();
return Retry.of("elasticsearchRetry", config);
}
@Bean
public Bulkhead elasticsearchBulkhead() {
BulkheadConfig config = BulkheadConfig.custom()
.maxConcurrentCalls(100)
.maxWaitDuration(Duration.ofMillis(500))
.build();
return Bulkhead.of("elasticsearchBulkhead", config);
}
@Bean
public TimeLimiter elasticsearchTimeLimiter() {
TimeLimiterConfig config = TimeLimiterConfig.custom()
.timeoutDuration(Duration.ofSeconds(5))
.cancelRunningFuture(true)
.build();
return TimeLimiter.of("elasticsearchTimeLimiter", config);
}
}
6. 自适应批处理大小
实现根据系统负载动态调整批处理大小:
java
@Service
public class AdaptiveBatchSizeManager {
private static final Logger logger = LoggerFactory.getLogger(AdaptiveBatchSizeManager.class);
private final AtomicInteger currentBatchSize = new AtomicInteger(100);
private final MeterRegistry meterRegistry;
// 最小和最大批处理大小限制
private final int minBatchSize;
private final int maxBatchSize;
// 最近的处理指标
private final AtomicLong lastProcessingTimeMs = new AtomicLong(0);
private final AtomicDouble lastCpuLoad = new AtomicDouble(0.5);
public AdaptiveBatchSizeManager(
MeterRegistry meterRegistry,
@Value("${app.batch.min-size:20}") int minBatchSize,
@Value("${app.batch.max-size:500}") int maxBatchSize,
@Value("${app.batch.initial-size:100}") int initialBatchSize) {
this.meterRegistry = meterRegistry;
this.minBatchSize = minBatchSize;
this.maxBatchSize = maxBatchSize;
this.currentBatchSize.set(initialBatchSize);
logger.info("初始化自适应批处理大小管理器: 初始大小={}, 最小={}, 最大={}",
initialBatchSize, minBatchSize, maxBatchSize);
}
@Scheduled(fixedRate = 10000) // 每10秒调整一次
public void adjustBatchSize() {
// 获取系统CPU负载
double cpuLoad = getSystemCpuLoad();
lastCpuLoad.set(cpuLoad);
// 获取最近的处理时间
long avgProcessingTime = lastProcessingTimeMs.get();
int newSize = calculateOptimalBatchSize(cpuLoad, avgProcessingTime);
int oldSize = currentBatchSize.getAndSet(newSize);
if (oldSize != newSize) {
logger.info("批处理大小调整: {} -> {} (CPU负载: {}, 处理时间: {}ms)",
oldSize, newSize, cpuLoad, avgProcessingTime);
// 记录批处理大小指标
meterRegistry.gauge("elasticsearch.batch.size", currentBatchSize);
}
}
public int getCurrentBatchSize() {
return currentBatchSize.get();
}
public void recordProcessingTime(long processingTimeMs) {
lastProcessingTimeMs.set(processingTimeMs);
}
private int calculateOptimalBatchSize(double cpuLoad, long processingTimeMs) {
// 根据CPU负载和处理时间动态调整批次大小
int currentSize = currentBatchSize.get();
if (cpuLoad > 0.8 || processingTimeMs > 200) {
// 负载高或处理慢,减小批次大小
return Math.max(minBatchSize, (int)(currentSize * 0.8));
} else if (cpuLoad < 0.5 && processingTimeMs < 50) {
// 负载低且处理快,增加批次大小
return Math.min(maxBatchSize, (int)(currentSize * 1.2));
}
// 保持当前大小
return currentSize;
}
private double getSystemCpuLoad() {
try {
OperatingSystemMXBean osBean = ManagementFactory.getOperatingSystemMXBean();
if (osBean instanceof com.sun.management.OperatingSystemMXBean) {
return ((com.sun.management.OperatingSystemMXBean) osBean).getCpuLoad();
}
} catch (Exception e) {
logger.warn("获取系统CPU负载失败", e);
}
return lastCpuLoad.get(); // 返回上次测量值作为后备
}
}
7. 批量处理与性能优化
使用线程安全队列和批量 API 提高性能:
java
@Service
public class BulkIndexService {
private static final Logger logger = LoggerFactory.getLogger(BulkIndexService.class);
private final ElasticsearchClient esClient;
private final RetryService retryService;
private final CircuitBreaker circuitBreaker;
private final Bulkhead bulkhead;
private final MetricsService metricsService;
private final AdaptiveBatchSizeManager batchSizeManager;
private final Tracer tracer;
// 使用有界队列并明确指定容量
private final BlockingQueue<IndexOperation> operationQueue;
public BulkIndexService(
ElasticsearchClient esClient,
RetryService retryService,
CircuitBreaker elasticsearchCircuitBreaker,
Bulkhead elasticsearchBulkhead,
MetricsService metricsService,
AdaptiveBatchSizeManager batchSizeManager,
OpenTelemetry openTelemetry,
@Value("${app.bulk.queue-capacity:1000}") int queueCapacity) {
this.esClient = esClient;
this.retryService = retryService;
this.circuitBreaker = elasticsearchCircuitBreaker;
this.bulkhead = elasticsearchBulkhead;
this.metricsService = metricsService;
this.batchSizeManager = batchSizeManager;
this.tracer = openTelemetry.getTracer("bulk-index-service");
this.operationQueue = new LinkedBlockingQueue<>(queueCapacity);
logger.info("初始化批量索引服务: 队列容量={}", queueCapacity);
}
public void addOperation(IndexOperation operation) {
Span span = tracer.spanBuilder("addOperation").startSpan();
try (Scope scope = span.makeCurrent()) {
span.setAttribute("operation.type", operation.getType().toString());
span.setAttribute("operation.index", operation.getIndex());
span.setAttribute("operation.id", operation.getEntityId());
if (!operationQueue.offer(operation)) {
logger.warn("队列已满,操作将被延迟处理: {}", operation.getEntityId());
span.setAttribute("queueFull", true);
// 处理队列满的情况
handleQueueFull(operation);
} else {
span.setAttribute("queueFull", false);
logger.debug("操作已添加到队列: {}, 当前队列大小: {}",
operation.getEntityId(), operationQueue.size());
}
int currentBatchSize = batchSizeManager.getCurrentBatchSize();
if (operationQueue.size() >= currentBatchSize) {
span.setAttribute("triggerProcessing", true);
processBulk();
} else {
span.setAttribute("triggerProcessing", false);
}
span.setStatus(StatusCode.OK);
} catch (Exception e) {
logger.error("添加操作失败", e);
span.setStatus(StatusCode.ERROR, e.getMessage());
span.recordException(e);
} finally {
span.end();
}
}
private void handleQueueFull(IndexOperation operation) {
Span span = tracer.spanBuilder("handleQueueFull").startSpan();
try (Scope scope = span.makeCurrent()) {
span.setAttribute("operation.id", operation.getEntityId());
// 使用指数退避策略尝试放入队列
CompletableFuture.runAsync(() -> {
int attempts = 0;
long waitTime = 100; // 初始等待时间(毫秒)
while (attempts < 5) {
try {
boolean added = operationQueue.offer(operation, waitTime, TimeUnit.MILLISECONDS);
if (added) {
logger.info("成功将操作添加到队列: {}", operation.getEntityId());
return;
}
// 指数退避增加等待时间
waitTime = Math.min(waitTime * 2, 5000);
attempts++;
logger.debug("尝试添加到队列失败,将在{}ms后重试: {}", waitTime, operation.getEntityId());
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
logger.error("等待队列空间时被中断", e);
break;
}
}
// 多次尝试失败后加入重试队列
logger.warn("多次尝试后仍无法添加到队列,加入重试队列: {}", operation.getEntityId());
retryService.scheduleRetry(operation);
});
span.setStatus(StatusCode.OK);
} catch (Exception e) {
logger.error("处理队列满逻辑失败", e);
span.setStatus(StatusCode.ERROR, e.getMessage());
span.recordException(e);
} finally {
span.end();
}
}
@Scheduled(fixedRateString = "${app.bulk.interval:5000}")
public void processBulk() {
Span span = tracer.spanBuilder("processBulk").startSpan();
try (Scope scope = span.makeCurrent()) {
if (operationQueue.isEmpty()) {
span.setAttribute("queueEmpty", true);
return;
}
// 获取当前最优批处理大小
int batchSize = batchSizeManager.getCurrentBatchSize();
span.setAttribute("batchSize", batchSize);
// 原子性地获取队列中的操作
List<IndexOperation> operations = new ArrayList<>();
operationQueue.drainTo(operations, batchSize);
if (operations.isEmpty()) {
span.setAttribute("operationsEmpty", true);
return;
}
processBatchOperations(operations, span);
} catch (Exception e) {
logger.error("批量处理主流程失败", e);
span.setStatus(StatusCode.ERROR, e.getMessage());
span.recordException(e);
} finally {
span.end();
}
}
// 提取方法:处理批量操作
private void processBatchOperations(List<IndexOperation> operations, Span parentSpan) {
parentSpan.setAttribute("operationsCount", operations.size());
logger.info("开始批量处理 {} 个操作", operations.size());
long startTime = System.currentTimeMillis();
// 使用舱壁模式限制并发
try {
bulkhead.acquirePermission();
// 使用熔断器保护批量操作
circuitBreaker.executeRunnable(() -> executeBulkRequest(operations));
long duration = System.currentTimeMillis() - startTime;
batchSizeManager.recordProcessingTime(duration);
Map<String, Object> logContext = new HashMap<>();
logContext.put("operation", "bulkProcess");
logContext.put("count", operations.size());
logContext.put("durationMs", duration);
logger.info("批量操作完成: {}", JsonUtils.toJson(logContext));
metricsService.recordBulkOperation(operations.size(), true, duration);
parentSpan.setAttribute("duration_ms", duration);
parentSpan.setStatus(StatusCode.OK);
} catch (BulkheadFullException e) {
logger.warn("并发限制已达上限,稍后重试", e);
parentSpan.setAttribute("error", "bulkhead_full");
parentSpan.setStatus(StatusCode.ERROR, "Bulkhead full");
parentSpan.recordException(e);
// 将操作放回队列或重试队列
handleBulkheadRejection(operations);
} catch (Exception e) {
logger.error("批量操作失败", e);
parentSpan.setAttribute("error", "bulk_operation_failed");
parentSpan.setStatus(StatusCode.ERROR, e.getMessage());
parentSpan.recordException(e);
// 将所有操作加入重试队列
operations.forEach(retryService::scheduleRetry);
} finally {
bulkhead.releasePermission();
}
}
// 处理舱壁拒绝的操作
private void handleBulkheadRejection(List<IndexOperation> operations) {
// 将操作重新添加到队列或转发到重试服务
for (IndexOperation operation : operations) {
if (!operationQueue.offer(operation)) {
// 队列已满,加入重试队列
retryService.scheduleRetry(operation);
}
}
}
private void executeBulkRequest(List<IndexOperation> operations) {
Span span = tracer.spanBuilder("executeBulkRequest").startSpan();
try (Scope scope = span.makeCurrent()) {
span.setAttribute("operations_count", operations.size());
// 创建批量请求构建器
BulkRequest.Builder bulkRequestBuilder = new BulkRequest.Builder();
// 添加所有操作到批量请求
for (IndexOperation op : operations) {
switch (op.getType()) {
case INDEX -> bulkRequestBuilder.operations(o -> o
.index(i -> i
.index(op.getIndex())
.id(op.getEntityId())
.document(op.getDocument())
)
);
case UPDATE -> bulkRequestBuilder.operations(o -> o
.update(u -> u
.index(op.getIndex())
.id(op.getEntityId())
.doc(op.getDocument())
)
);
case DELETE -> bulkRequestBuilder.operations(o -> o
.delete(d -> d
.index(op.getIndex())
.id(op.getEntityId())
)
);
}
}
// 执行批量请求
BulkResponse response = esClient.bulk(bulkRequestBuilder.build());
if (response.errors()) {
// 处理失败的操作
handleBulkErrors(response, operations);
span.setAttribute("has_errors", true);
span.setAttribute("error_count", countErrors(response));
} else {
span.setAttribute("has_errors", false);
}
span.setStatus(StatusCode.OK);
} catch (Exception e) {
logger.error("执行批量请求失败", e);
span.setStatus(StatusCode.ERROR, e.getMessage());
span.recordException(e);
throw e;
} finally {
span.end();
}
}
private int countErrors(BulkResponse response) {
return (int) response.items().stream()
.filter(item -> item.error() != null)
.count();
}
private void handleBulkErrors(BulkResponse response, List<IndexOperation> operations) {
Span span = tracer.spanBuilder("handleBulkErrors").startSpan();
try (Scope scope = span.makeCurrent()) {
// 遍历所有响应项
List<BulkResponseItem> items = response.items();
int errorCount = 0;
for (int i = 0; i < items.size(); i++) {
BulkResponseItem item = items.get(i);
if (item.error() != null) {
errorCount++;
IndexOperation failedOp = operations.get(i);
Map<String, Object> logContext = new HashMap<>();
logContext.put("operation", "bulkError");
logContext.put("documentId", failedOp.getEntityId());
logContext.put("errorType", item.error().type());
logContext.put("errorReason", item.error().reason());
logger.error("批量操作失败: {}", JsonUtils.toJson(logContext));
// 根据错误类型决定重试策略
if (isRetryableError(item.error().type())) {
retryService.scheduleRetry(failedOp);
} else {
retryService.sendToDeadLetterQueue(failedOp, item.error().reason());
}
}
}
span.setAttribute("error_count", errorCount);
span.setStatus(StatusCode.OK);
} catch (Exception e) {
logger.error("处理批量错误失败", e);
span.setStatus(StatusCode.ERROR, e.getMessage());
span.recordException(e);
} finally {
span.end();
}
}
private boolean isRetryableError(String errorType) {
return List.of(
"es_rejected_execution_exception",
"timeout_exception",
"node_not_available_exception",
"cluster_block_exception"
).contains(errorType);
}
}
8. 增量数据一致性审计
实现高效的增量数据一致性审计:
java
@Service
public class DataConsistencyAuditService {
private static final Logger logger = LoggerFactory.getLogger(DataConsistencyAuditService.class);
private final ProductRepository productRepository;
private final ElasticsearchClient esClient;
private final RetryService retryService;
private final MetricsService metricsService;
private final ExecutorService executorService;
// 记录上次全量审计时间
private final AtomicReference<LocalDateTime> lastFullAuditTime = new AtomicReference<>(null);
public DataConsistencyAuditService(
ProductRepository productRepository,
ElasticsearchClient esClient,
RetryService retryService,
MetricsService metricsService,
@Qualifier("consistencyCheckExecutor") ExecutorService executorService) {
this.productRepository = productRepository;
this.esClient = esClient;
this.retryService = retryService;
this.metricsService = metricsService;
this.executorService = executorService;
}
// 增量审计:只检查最近修改的数据
@Scheduled(fixedDelayString = "${app.audit.incremental.interval:300000}")
public void performIncrementalAudit() {
logger.info("开始执行增量数据一致性审计");
long startTime = System.currentTimeMillis();
try {
// 检查最近5分钟修改的数据
LocalDateTime cutoffTime = LocalDateTime.now().minusMinutes(5);
// 查询最近修改的产品
List<Product> recentlyModifiedProducts = productRepository.findByUpdatedAtAfter(cutoffTime);
if (recentlyModifiedProducts.isEmpty()) {
logger.info("没有发现最近修改的数据,跳过增量审计");
return;
}
logger.info("增量审计: 发现 {} 条最近修改的记录", recentlyModifiedProducts.size());
AtomicInteger inconsistentCount = new AtomicInteger(0);
CountDownLatch latch = new CountDownLatch(recentlyModifiedProducts.size());
// 并行检查每个产品
for (Product product : recentlyModifiedProducts) {
executorService.submit(() -> {
try {
boolean consistent = checkProductConsistency(product);
if (!consistent) {
inconsistentCount.incrementAndGet();
}
} catch (Exception e) {
logger.error("检查产品一致性时发生错误: {}", product.getId(), e);
} finally {
latch.countDown();
}
});
}
// 等待所有检查完成
latch.await(1, TimeUnit.MINUTES);
long duration = System.currentTimeMillis() - startTime;
logger.info("增量数据一致性审计完成: 检查了 {} 条记录, 发现 {} 条不一致, 耗时 {} ms",
recentlyModifiedProducts.size(), inconsistentCount.get(), duration);
metricsService.recordConsistencyCheckMetrics(
recentlyModifiedProducts.size(),
inconsistentCount.get(),
duration
);
} catch (Exception e) {
logger.error("执行增量数据一致性审计时发生错误", e);
}
}
// 全量抽样审计:随机抽样进行审计
@Scheduled(cron = "${app.audit.full.cron:0 0 1 * * ?}")
public void performSampledFullAudit() {
logger.info("开始执行抽样全量数据一致性审计");
long startTime = System.currentTimeMillis();
lastFullAuditTime.set(LocalDateTime.now());
try {
// 获取总记录数
long totalCount = productRepository.count();
if (totalCount == 0) {
logger.info("没有数据,跳过全量审计");
return;
}
// 确定抽样数量,最多检查10%,但不超过1000条
int sampleSize = (int) Math.min(Math.max(totalCount * 0.1, 100), 1000);
logger.info("全量抽样审计: 总记录数 {}, 抽样数量 {}", totalCount, sampleSize);
List<Product> sampledProducts = productRepository.findRandomSample(sampleSize);
AtomicInteger inconsistentCount = new AtomicInteger(0);
CountDownLatch latch = new CountDownLatch(sampledProducts.size());
// 并行检查每个抽样产品
for (Product product : sampledProducts) {
executorService.submit(() -> {
try {
boolean consistent = checkProductConsistency(product);
if (!consistent) {
inconsistentCount.incrementAndGet();
}
} catch (Exception e) {
logger.error("检查产品一致性时发生错误: {}", product.getId(), e);
} finally {
latch.countDown();
}
});
}
// 等待所有检查完成
latch.await(10, TimeUnit.MINUTES);
long duration = System.currentTimeMillis() - startTime;
logger.info("抽样全量数据一致性审计完成: 检查了 {} 条记录, 发现 {} 条不一致, 耗时 {} ms",
sampledProducts.size(), inconsistentCount.get(), duration);
// 记录指标,包括抽样比例
Map<String, String> tags = new HashMap<>();
tags.put("type", "sampled");
tags.put("sampleRate", String.format("%.2f", (double)sampleSize / totalCount));
metricsService.recordConsistencyCheckMetricsWithTags(
sampledProducts.size(),
inconsistentCount.get(),
duration,
tags
);
// 根据不一致率决定是否触发全量审计
double inconsistencyRate = (double) inconsistentCount.get() / sampledProducts.size();
if (inconsistencyRate > 0.05) { // 如果不一致率超过5%
logger.warn("不一致率较高 ({}%),考虑执行全量数据一致性检查",
String.format("%.2f", inconsistencyRate * 100));
}
} catch (Exception e) {
logger.error("执行抽样全量数据一致性审计时发生错误", e);
}
}
private boolean checkProductConsistency(Product product) {
try {
GetResponse<Product> response = esClient.get(g -> g
.index("products")
.id(product.getId().toString()),
Product.class
);
if (!response.found()) {
// Elasticsearch中不存在,需要同步
logger.warn("发现不一致:ES中缺少产品 {}", product.getId());
syncToElasticsearch(product);
return false;
}
// 检查数据是否一致
Product esProduct = response.source();
if (!isConsistent(product, esProduct)) {
logger.warn("发现不一致:产品 {} 数据不匹配", product.getId());
syncToElasticsearch(product);
return false;
}
return true;
} catch (Exception e) {
logger.error("检查产品 {} 一致性时出错", product.getId(), e);
return false;
}
}
private void syncToElasticsearch(Product product) {
try {
// 创建同步事件
IndexOperation operation = new IndexOperation();
operation.setIndex("products");
operation.setEntityId(product.getId().toString());
operation.setType(IndexOperation.OperationType.INDEX);
operation.setDocument(product);
retryService.scheduleRetry(operation);
// 记录同步事件
metricsService.incrementResyncCounter();
// 记录不一致发现时间
Duration timeSinceUpdate = Duration.between(product.getUpdatedAt(), LocalDateTime.now());
metricsService.recordConsistencyDelay("product", timeSinceUpdate.toMillis());
} catch (Exception e) {
logger.error("创建同步事件失败", e);
}
}
private boolean isConsistent(Product dbProduct, Product esProduct) {
// 使用compareTo而非equals比较BigDecimal
return dbProduct.getName().equals(esProduct.getName()) &&
dbProduct.getPrice().compareTo(esProduct.getPrice()) == 0 &&
dbProduct.getDescription().equals(esProduct.getDescription());
}
// 获取上次全量审计时间
public LocalDateTime getLastFullAuditTime() {
return lastFullAuditTime.get();
}
}
9. 启动依赖服务健康检查
确保所有依赖服务在应用启动时都是可用的:
java
@Component
public class ServiceStartupValidator implements ApplicationListener<ContextRefreshedEvent> {
private static final Logger logger = LoggerFactory.getLogger(ServiceStartupValidator.class);
private final ElasticsearchClient esClient;
private final KafkaAdmin kafkaAdmin;
private final DataSource dataSource;
private final MessageSource messageSource;
public ServiceStartupValidator(
ElasticsearchClient esClient,
KafkaAdmin kafkaAdmin,
DataSource dataSource,
MessageSource messageSource) {
this.esClient = esClient;
this.kafkaAdmin = kafkaAdmin;
this.dataSource = dataSource;
this.messageSource = messageSource;
}
@Override
public void onApplicationEvent(ContextRefreshedEvent event) {
logger.info("开始验证依赖服务连接...");
boolean allServicesAvailable = true;
// 验证ES连接
if (!validateElasticsearchConnection()) {
allServicesAvailable = false;
String errorMessage = messageSource.getMessage(
"error.startup.elasticsearch",
null,
LocaleContextHolder.getLocale()
);
logger.error(errorMessage);
}
// 验证Kafka连接
if (!validateKafkaConnection()) {
allServicesAvailable = false;
String errorMessage = messageSource.getMessage(
"error.startup.kafka",
null,
LocaleContextHolder.getLocale()
);
logger.error(errorMessage);
}
// 验证数据库连接
if (!validateDatabaseConnection()) {
allServicesAvailable = false;
String errorMessage = messageSource.getMessage(
"error.startup.database",
null,
LocaleContextHolder.getLocale()
);
logger.error(errorMessage);
}
if (allServicesAvailable) {
logger.info("所有依赖服务连接验证成功");
} else {
logger.warn("部分依赖服务连接验证失败,应用可能无法正常运行");
}
}
private boolean validateElasticsearchConnection() {
try {
boolean isAvailable = esClient.ping().value();
if (isAvailable) {
ClusterHealthResponse healthResponse = esClient.cluster().health();
logger.info("Elasticsearch连接成功: 状态={}, 节点数={}",
healthResponse.status(), healthResponse.numberOfNodes());
return true;
} else {
logger.error("Elasticsearch连接失败: ping返回false");
return false;
}
} catch (Exception e) {
logger.error("Elasticsearch连接验证失败", e);
return false;
}
}
private boolean validateKafkaConnection() {
try {
// 获取Kafka集群信息
Map<String, Object> configs = kafkaAdmin.getConfigurationProperties();
String bootstrapServers = (String) configs.get("bootstrap.servers");
AdminClient adminClient = AdminClient.create(configs);
DescribeClusterResult clusterResult = adminClient.describeCluster();
// 等待获取结果(最多5秒)
int nodeCount = clusterResult.nodes().get(5, TimeUnit.SECONDS).size();
String clusterId = clusterResult.clusterId().get(5, TimeUnit.SECONDS);
logger.info("Kafka连接成功: 集群ID={}, 节点数={}, 地址={}",
clusterId, nodeCount, bootstrapServers);
adminClient.close();
return true;
} catch (Exception e) {
logger.error("Kafka连接验证失败", e);
return false;
}
}
private boolean validateDatabaseConnection() {
try (Connection conn = dataSource.getConnection()) {
DatabaseMetaData metaData = conn.getMetaData();
logger.info("数据库连接成功: {}@{}",
metaData.getDatabaseProductName(),
metaData.getDatabaseProductVersion());
return true;
} catch (Exception e) {
logger.error("数据库连接验证失败", e);
return false;
}
}
}
10. 国际化支持
添加多语言支持,确保错误消息和日志可以适应不同语言环境:
java
@Configuration
public class InternationalizationConfig {
private static final Logger logger = LoggerFactory.getLogger(InternationalizationConfig.class);
@Bean
public MessageSource messageSource() {
logger.info("初始化国际化消息源");
ReloadableResourceBundleMessageSource messageSource = new ReloadableResourceBundleMessageSource();
messageSource.setBasenames(
"classpath:messages/common",
"classpath:messages/errors",
"classpath:messages/validation"
);
messageSource.setDefaultEncoding("UTF-8");
messageSource.setCacheSeconds(3600); // 缓存消息1小时
return messageSource;
}
@Bean
public LocaleResolver localeResolver() {
AcceptHeaderLocaleResolver resolver = new AcceptHeaderLocaleResolver();
resolver.setDefaultLocale(Locale.US); // 默认使用英语
// 支持的语言列表
List<Locale> supportedLocales = Arrays.asList(
Locale.US, // 英语
Locale.CHINA, // 中文
Locale.JAPAN, // 日语
Locale.GERMANY, // 德语
new Locale("es") // 西班牙语
);
resolver.setSupportedLocales(supportedLocales);
return resolver;
}
}
消息资源文件示例 (messages/errors_zh_CN.properties
):
properties
error.product.create=创建产品时发生错误: {0}
error.product.update=更新产品时发生错误: {0}
error.product.notfound=未找到ID为{0}的产品
error.outbox.process=处理Outbox消息失败: {0}
error.elasticsearch.connection=Elasticsearch连接失败: {0}
error.kafka.connection=Kafka连接失败: {0}
error.startup.elasticsearch=启动时无法连接到Elasticsearch服务,请检查配置和网络
error.startup.kafka=启动时无法连接到Kafka服务,请检查配置和网络
error.startup.database=启动时无法连接到数据库服务,请检查配置和网络
11. 分布式测试策略与故障注入
实现全面的测试策略,包括故障注入测试:
java
@SpringBootTest
@TestInstance(TestInstance.Lifecycle.PER_CLASS)
public class DistributedTransactionTest {
private static final Logger logger = LoggerFactory.getLogger(DistributedTransactionTest.class);
@Autowired
private ProductService productService;
@Autowired
private ProductRepository productRepository;
@Autowired
private ElasticsearchClient esClient;
@Autowired
private OutboxRepository outboxRepository;
@Autowired
private DisasterRecoveryService recoveryService;
@Autowired
private NetworkPartitionHandler networkPartitionHandler;
@MockBean
private ElasticsearchClient mockEsClient;
@Container
static MySQLContainer<?> mySQLContainer = new MySQLContainer<>("mysql:8.0")
.withDatabaseName("testdb")
.withUsername("testuser")
.withPassword("testpass");
@Container
static KafkaContainer kafkaContainer = new KafkaContainer(DockerImageName.parse("confluentinc/cp-kafka:latest"));
@DynamicPropertySource
static void registerProperties(DynamicPropertyRegistry registry) {
registry.add("spring.datasource.url", mySQLContainer::getJdbcUrl);
registry.add("spring.datasource.username", mySQLContainer::getUsername);
registry.add("spring.datasource.password", mySQLContainer::getPassword);
registry.add("spring.kafka.bootstrap-servers", kafkaContainer::getBootstrapServers);
}
@Test
@DisplayName("测试正常流程下的数据一致性")
void testNormalFlowConsistency() throws Exception {
// 创建测试产品
Product product = new Product();
product.setName("测试产品");
product.setPrice(new BigDecimal("99.99"));
product.setDescription("测试描述");
// 保存产品(触发Outbox机制)
Product savedProduct = productService.createProduct(product);
// 验证数据库中有数据
assertThat(productRepository.findById(savedProduct.getId())).isPresent();
// 验证Outbox中有消息
List<OutboxMessage> messages = outboxRepository.findByAggregateIdAndProcessed(
savedProduct.getId().toString(), false);
assertThat(messages).isNotEmpty();
// 等待异步处理
await().atMost(10, TimeUnit.SECONDS).untilAsserted(() -> {
// 验证ES中有数据
GetResponse<Product> response = esClient.get(g -> g
.index("products")
.id(savedProduct.getId().toString()),
Product.class
);
assertThat(response.found()).isTrue();
assertThat(response.source().getName()).isEqualTo(product.getName());
});
}
@Test
@DisplayName("测试网络分区情况下的恢复机制")
void testNetworkPartitionRecovery() throws Exception {
// 模拟ES不可用
ReflectionTestUtils.setField(networkPartitionHandler, "elasticsearchAvailable",
new AtomicBoolean(false));
// 创建产品(此时ES不可用)
Product product = new Product();
product.setName("分区测试产品");
product.setPrice(new BigDecimal("199.99"));
product.setDescription("网络分区测试");
Product savedProduct = productService.createProduct(product);
// 验证Outbox中有未处理的消息
await().atMost(5, TimeUnit.SECONDS).untilAsserted(() -> {
List<OutboxMessage> messages = outboxRepository.findByAggregateIdAndProcessed(
savedProduct.getId().toString(), false);
assertThat(messages).isNotEmpty();
});
// 模拟ES恢复可用
ReflectionTestUtils.setField(networkPartitionHandler, "elasticsearchAvailable",
new AtomicBoolean(true));
// 触发恢复流程
networkPartitionHandler.checkElasticsearchAvailability();
// 验证ES中数据最终一致
await().atMost(15, TimeUnit.SECONDS).untilAsserted(() -> {
GetResponse<Product> response = esClient.get(g -> g
.index("products")
.id(savedProduct.getId().toString()),
Product.class
);
assertThat(response.found()).isTrue();
assertThat(response.source().getName()).isEqualTo(product.getName());
});
}
@Test
@DisplayName("测试混沌工程 - ES间歇性故障")
void testChaosEngineeringIntermittentFailures() throws Exception {
// 配置mock ES客户端间歇性失败
when(mockEsClient.index(any(IndexRequest.class), any()))
.thenAnswer(invocation -> {
// 60%概率失败
if (Math.random() < 0.6) {
throw new ElasticsearchException("模拟的随机故障");
}
// 模拟成功响应
IndexResponse response = IndexResponse.of(r -> r
.index("products")
.id("test-id")
.version(1)
.result(Result.Created)
);
return response;
});
// 使用测试专用的服务实例(使用mock ES客户端)
ProductIndexService testService = new ProductIndexService(
mockEsClient,
applicationContext.getBean(RetryService.class),
applicationContext.getBean(ObjectMapper.class),
applicationContext.getBean(CircuitBreaker.class),
applicationContext.getBean(MetricsService.class),
applicationContext.getBean(OpenTelemetry.class));
// 执行100次索引操作
List<CompletableFuture<Boolean>> futures = new ArrayList<>();
for (int i = 0; i < 100; i++) {
final int index = i;
futures.add(CompletableFuture.supplyAsync(() -> {
Product product = new Product();
product.setId((long) index);
product.setName("混沌测试产品" + index);
try {
return testService.indexProduct(product);
} catch (Exception e) {
logger.error("索引操作失败", e);
return false;
}
}));
}
// 等待所有操作完成
CompletableFuture.allOf(futures.toArray(new CompletableFuture[0])).join();
// 计算成功率
long successCount = futures.stream()
.map(CompletableFuture::join)
.filter(success -> success)
.count();
// 验证熔断器和重试机制使得成功率显著高于40%(理论无故障率)
assertThat(successCount).isGreaterThan(60);
logger.info("混沌测试结果: 成功率 {}%", successCount);
}
@Test
@DisplayName("故障注入测试 - 数据库连接中断")
void testDatabaseConnectionFailure() {
// 记录容器是否重启过
AtomicBoolean containerRestarted = new AtomicBoolean(false);
// 使用测试容器模拟数据库故障
mySQLContainer.stop();
// 验证系统能正确处理故障并在数据库恢复后自愈
Product product = new Product();
product.setName("故障恢复测试");
product.setPrice(new BigDecimal("299.99"));
// 预期操作会失败但不会导致系统崩溃
assertThrows(Exception.class, () ->
productService.createProduct(product));
// 在单独线程中重启数据库容器
new Thread(() -> {
try {
// 延迟3秒后重启容器
Thread.sleep(3000);
mySQLContainer.start();
containerRestarted.set(true);
} catch (Exception e) {
logger.error("重启MySQL容器失败", e);
}
}).start();
// 验证系统自愈
await().atMost(30, TimeUnit.SECONDS)
.pollInterval(Duration.ofSeconds(2))
.untilAsserted(() -> {
// 确保容器已重启
assertThat(containerRestarted.get()).isTrue();
try {
Product savedProduct = productService.createProduct(product);
assertThat(savedProduct.getId()).isNotNull();
} catch (Exception e) {
logger.info("系统尚未恢复: {}", e.getMessage());
fail("系统未能在预期时间内自愈: " + e.getMessage());
}
});
}
}
12. 指标和监控增强
添加更多精细的监控指标:
java
@Service
public class MetricsService {
private static final Logger logger = LoggerFactory.getLogger(MetricsService.class);
private final MeterRegistry meterRegistry;
@Value("${spring.application.name}")
private String applicationName;
// 记录最后一次ES不可用的时间
private LocalDateTime lastOutageTime;
public MetricsService(MeterRegistry meterRegistry) {
this.meterRegistry = meterRegistry;
}
// 记录同步操作指标
public void recordSyncOperation(String operation, boolean success, long duration) {
Tags tags = Tags.of(
Tag.of("operation", operation),
Tag.of("success", String.valueOf(success)),
Tag.of("application", applicationName)
);
// 记录操作计数
meterRegistry.counter("elasticsearch.sync.operations", tags).increment();
// 记录操作耗时
meterRegistry.timer("elasticsearch.sync.duration", tags).record(duration, TimeUnit.MILLISECONDS);
}
// 记录一致性检查指标
public void recordConsistencyCheckMetrics(int totalChecked, int inconsistentCount, long duration) {
recordConsistencyCheckMetricsWithTags(totalChecked, inconsistentCount, duration, Collections.emptyMap());
}
// 记录带标签的一致性检查指标
public void recordConsistencyCheckMetricsWithTags(
int totalChecked, int inconsistentCount, long duration, Map<String, String> customTags) {
Tags tags = Tags.of(Tag.of("application", applicationName));
// 添加自定义标签
for (Map.Entry<String, String> entry : customTags.entrySet()) {
tags = tags.and(Tag.of(entry.getKey(), entry.getValue()));
}
// 记录检查的记录总数
meterRegistry.gauge("elasticsearch.consistency.checked", tags, totalChecked);
// 记录不一致的记录数
meterRegistry.gauge("elasticsearch.consistency.inconsistent", tags, inconsistentCount);
// 记录一致性检查的耗时
meterRegistry.timer("elasticsearch.consistency.duration", tags).record(duration, TimeUnit.MILLISECONDS);
// 计算一致性比率
double consistencyRatio = totalChecked > 0 ?
(totalChecked - inconsistentCount) / (double) totalChecked : 1.0;
meterRegistry.gauge("elasticsearch.consistency.ratio", tags, consistencyRatio);
// 记录当前时间点的一致性状态
Instant timestamp = Instant.now();
meterRegistry.gauge("elasticsearch.consistency.timestamp", tags, timestamp.getEpochSecond());
}
// 记录一致性延迟
public void recordConsistencyDelay(String entityType, long delayMillis) {
meterRegistry.timer("elasticsearch.consistency.delay",
Tags.of("entity", entityType, "application", applicationName))
.record(delayMillis, TimeUnit.MILLISECONDS);
}
// 增加重同步计数器
public void incrementResyncCounter() {
meterRegistry.counter("elasticsearch.resync.count",
Tags.of("application", applicationName)).increment();
}
// 记录Outbox处理指标
public void recordOutboxProcessing(int processed, int failed, long duration) {
Tags tags = Tags.of("application", applicationName);
meterRegistry.gauge("outbox.processed", tags, processed);
meterRegistry.gauge("outbox.failed", tags, failed);
meterRegistry.timer("outbox.processing.duration", tags).record(duration, TimeUnit.MILLISECONDS);
// 记录处理成功率
double successRate = (processed + failed) > 0 ?
(double) processed / (processed + failed) : 0;
meterRegistry.gauge("outbox.success.rate", tags, successRate);
}
// 记录Outbox消息的年龄分布
public void recordOutboxMessageAge(Duration age) {
meterRegistry.summary("outbox.message.age",
Tags.of("application", applicationName))
.record(age.toMillis());
}
// 记录批量操作
public void recordBulkOperation(int count, boolean success, long duration) {
Tags tags = Tags.of(
"success", String.valueOf(success),
"application", applicationName
);
meterRegistry.counter("elasticsearch.bulk.operations", tags).increment();
meterRegistry.timer("elasticsearch.bulk.duration", tags).record(duration, TimeUnit.MILLISECONDS);
meterRegistry.gauge("elasticsearch.bulk.size", tags, count);
// 计算每条记录的平均处理时间
double avgTimePerRecord = count > 0 ? (double) duration / count : 0;
meterRegistry.gauge("elasticsearch.bulk.avg_time_per_record", tags, avgTimePerRecord);
}
// 记录ES不可用
public void recordElasticsearchOutage() {
lastOutageTime = LocalDateTime.now();
meterRegistry.counter("elasticsearch.outage.count",
Tags.of("application", applicationName)).increment();
// 更新ES可用性指标
meterRegistry.gauge("elasticsearch.available",
Tags.of("application", applicationName), 0);
logger.warn("记录Elasticsearch不可用状态,时间: {}", lastOutageTime);
}
// 记录ES恢复
public void recordElasticsearchRecovery() {
if (lastOutageTime != null) {
long outageDuration = ChronoUnit.SECONDS.between(lastOutageTime, LocalDateTime.now());
meterRegistry.timer("elasticsearch.outage.duration",
Tags.of("application", applicationName))
.record(outageDuration, TimeUnit.SECONDS);
// 更新ES可用性指标
meterRegistry.gauge("elasticsearch.available",
Tags.of("application", applicationName), 1);
logger.info("Elasticsearch恢复,不可用持续时间: {}秒", outageDuration);
}
}
// 记录队列大小
public void recordQueueSize(String queueName, int size, int capacity) {
Tags tags = Tags.of(
"queue", queueName,
"application", applicationName
);
meterRegistry.gauge("queue.size", tags, size);
meterRegistry.gauge("queue.capacity", tags, capacity);
// 队列利用率
double utilizationRate = capacity > 0 ? (double) size / capacity : 0;
meterRegistry.gauge("queue.utilization", tags, utilizationRate);
}
// 记录业务操作耗时和结果
public void recordBusinessOperation(String operation, boolean success, long duration) {
Tags tags = Tags.of(
"operation", operation,
"success", String.valueOf(success),
"application", applicationName
);
meterRegistry.counter("business.operations", tags).increment();
meterRegistry.timer("business.duration", tags).record(duration, TimeUnit.MILLISECONDS);
}
// 记录服务启动状态
public void recordServiceStartup(boolean success, long duration) {
Tags tags = Tags.of(
"success", String.valueOf(success),
"application", applicationName
);
meterRegistry.counter("service.startup", tags).increment();
meterRegistry.timer("service.startup.duration", tags).record(duration, TimeUnit.SECONDS);
}
// 获取最后一次不可用时间
public LocalDateTime getLastOutageTime() {
return lastOutageTime;
}
}
13. 全局异常处理
实现全局异常处理器,统一管理 API 异常:
java
@RestControllerAdvice
public class GlobalExceptionHandler {
private static final Logger logger = LoggerFactory.getLogger(GlobalExceptionHandler.class);
private final MessageSource messageSource;
public GlobalExceptionHandler(MessageSource messageSource) {
this.messageSource = messageSource;
}
@ExceptionHandler(ElasticsearchException.class)
public ResponseEntity<ErrorResponse> handleElasticsearchException(
ElasticsearchException ex, Locale locale) {
logger.error("Elasticsearch操作异常", ex);
String message = messageSource.getMessage(
"error.elasticsearch.operation",
new Object[]{ex.getMessage()},
"Elasticsearch operation failed",
locale
);
ErrorResponse error = new ErrorResponse("ELASTICSEARCH_ERROR", message);
return ResponseEntity.status(HttpStatus.SERVICE_UNAVAILABLE).body(error);
}
@ExceptionHandler(ProductNotFoundException.class)
public ResponseEntity<ErrorResponse> handleProductNotFoundException(
ProductNotFoundException ex, Locale locale) {
logger.warn("产品未找到: {}", ex.getProductId());
String message = messageSource.getMessage(
"error.product.notfound",
new Object[]{ex.getProductId()},
"Product not found",
locale
);
ErrorResponse error = new ErrorResponse("PRODUCT_NOT_FOUND", message);
return ResponseEntity.status(HttpStatus.NOT_FOUND).body(error);
}
@ExceptionHandler(TransactionSystemException.class)
public ResponseEntity<ErrorResponse> handleTransactionSystemException(
TransactionSystemException ex, Locale locale) {
logger.error("事务系统异常", ex);
String message = messageSource.getMessage(
"error.transaction",
new Object[]{ex.getMessage()},
"Transaction failed",
locale
);
ErrorResponse error = new ErrorResponse("TRANSACTION_ERROR", message);
return ResponseEntity.status(HttpStatus.INTERNAL_SERVER_ERROR).body(error);
}
@ExceptionHandler(CircuitBreakerOpenException.class)
public ResponseEntity<ErrorResponse> handleCircuitBreakerOpenException(
CircuitBreakerOpenException ex, Locale locale) {
logger.warn("熔断器开启,拒绝请求: {}", ex.getMessage());
String message = messageSource.getMessage(
"error.circuitbreaker.open",
null,
"Service temporarily unavailable due to circuit breaker",
locale
);
ErrorResponse error = new ErrorResponse("SERVICE_UNAVAILABLE", message);
return ResponseEntity.status(HttpStatus.SERVICE_UNAVAILABLE)
.header("Retry-After", "30")
.body(error);
}
@ExceptionHandler(Exception.class)
public ResponseEntity<ErrorResponse> handleGenericException(
Exception ex, Locale locale) {
logger.error("未处理的异常", ex);
String message = messageSource.getMessage(
"error.generic",
new Object[]{ex.getMessage()},
"An unexpected error occurred",
locale
);
ErrorResponse error = new ErrorResponse("INTERNAL_ERROR", message);
return ResponseEntity.status(HttpStatus.INTERNAL_SERVER_ERROR).body(error);
}
@Data
@AllArgsConstructor
public static class ErrorResponse {
private String code;
private String message;
private Instant timestamp = Instant.now();
public ErrorResponse(String code, String message) {
this.code = code;
this.message = message;
}
}
}
14. 数据迁移策略
实现 Elasticsearch 索引结构变更时的数据迁移策略:
java
@Service
public class IndexMigrationManager {
private static final Logger logger = LoggerFactory.getLogger(IndexMigrationManager.class);
private final ElasticsearchClient esClient;
private final ProductRepository productRepository;
private final BulkIndexService bulkIndexService;
private final MessageSource messageSource;
private final MetricsService metricsService;
public IndexMigrationManager(
ElasticsearchClient esClient,
ProductRepository productRepository,
BulkIndexService bulkIndexService,
MessageSource messageSource,
MetricsService metricsService) {
this.esClient = esClient;
this.productRepository = productRepository;
this.bulkIndexService = bulkIndexService;
this.messageSource = messageSource;
this.metricsService = metricsService;
}
/**
* 执行索引迁移(字段映射变更)
* @param sourceIndex 源索引名称
* @param targetIndex 目标索引名称
* @param fieldMappings 字段映射配置,key为源字段,value为目标字段
* @return 迁移报告
*/
public MigrationReport migrateIndex(
String sourceIndex,
String targetIndex,
Map<String, String> fieldMappings) {
logger.info("开始索引迁移: {} -> {}, 字段映射: {}",
sourceIndex, targetIndex, fieldMappings);
long startTime = System.currentTimeMillis();
MigrationReport report = new MigrationReport();
report.setSourceIndex(sourceIndex);
report.setTargetIndex(targetIndex);
report.setStartTime(LocalDateTime.now());
try {
// 1. 检查源索引是否存在
boolean sourceExists = checkIndexExists(sourceIndex);
if (!sourceExists) {
String errorMessage = messageSource.getMessage(
"error.index.source.notfound",
new Object[]{sourceIndex},
LocaleContextHolder.getLocale()
);
logger.error(errorMessage);
report.setStatus("FAILED");
report.setErrorMessage(errorMessage);
return report;
}
// 2. 创建目标索引(如果不存在)
boolean targetCreated = createTargetIndexIfNotExists(targetIndex);
report.setTargetCreated(targetCreated);
// 3. 根据映射关系执行迁移
long totalDocuments = executeFieldMappingMigration(sourceIndex, targetIndex, fieldMappings, report);
report.setTotalDocuments(totalDocuments);
// 4. 验证迁移结果
boolean validated = validateMigration(sourceIndex, targetIndex, report);
report.setValidated(validated);
// 5. 如果迁移成功,切换别名
if (validated) {
switchAliasIfNeeded(sourceIndex, targetIndex);
report.setStatus("SUCCESS");
} else {
report.setStatus("VALIDATION_FAILED");
}
long duration = System.currentTimeMillis() - startTime;
report.setDuration(duration);
logger.info("索引迁移完成: {} -> {}, 状态: {}, 耗时: {} ms",
sourceIndex, targetIndex, report.getStatus(), duration);
// 记录迁移指标
recordMigrationMetrics(report);
return report;
} catch (Exception e) {
logger.error("索引迁移过程中发生错误", e);
report.setStatus("ERROR");
report.setErrorMessage(e.getMessage());
report.setEndTime(LocalDateTime.now());
return report;
}
}
private boolean checkIndexExists(String indexName) throws IOException {
return esClient.indices().exists(e -> e.index(indexName)).value();
}
private boolean createTargetIndexIfNotExists(String indexName) throws IOException {
boolean exists = checkIndexExists(indexName);
if (!exists) {
logger.info("创建目标索引: {}", indexName);
// 此处可以添加自定义索引设置和映射
esClient.indices().create(c -> c.index(indexName));
return true;
}
return false;
}
private long executeFieldMappingMigration(
String sourceIndex,
String targetIndex,
Map<String, String> fieldMappings,
MigrationReport report) throws IOException {
// 获取源索引文档总数
CountResponse countResponse = esClient.count(c -> c.index(sourceIndex));
long totalDocuments = countResponse.count();
logger.info("源索引文档总数: {}", totalDocuments);
// 如果字段映射为空,使用reindex API
if (fieldMappings.isEmpty()) {
logger.info("字段映射为空,使用内置reindex API");
ReindexResponse reindexResponse = esClient.reindex(r -> r
.source(s -> s.index(sourceIndex))
.dest(d -> d.index(targetIndex))
.refresh(true)
);
report.setMigratedDocuments(reindexResponse.created() + reindexResponse.updated());
report.setFailedDocuments(reindexResponse.failures().size());
if (!reindexResponse.failures().isEmpty()) {
logger.warn("迁移过程中有 {} 个文档失败", reindexResponse.failures().size());
report.setErrors(reindexResponse.failures().stream()
.map(BulkIndexerFailure::reason)
.collect(Collectors.toList()));
}
return totalDocuments;
}
// 自定义字段映射迁移
logger.info("使用自定义字段映射进行迁移");
int batchSize = 500;
long processedCount = 0;
long successCount = 0;
List<String> errors = new ArrayList<>();
// 分批处理
SearchRequest.Builder searchRequestBuilder = new SearchRequest.Builder()
.index(sourceIndex)
.size(batchSize)
.scroll(s -> s.time("1m"));
SearchResponse<JsonData> response = esClient.search(searchRequestBuilder.build(), JsonData.class);
String scrollId = response.scrollId();
while (response.hits().hits().size() > 0) {
List<IndexOperation> operations = new ArrayList<>();
for (Hit<JsonData> hit : response.hits().hits()) {
try {
// 转换文档字段
Map<String, Object> sourceMap = hit.source() != null ?
hit.source().toMap() : new HashMap<>();
Map<String, Object> targetMap = transformDocument(sourceMap, fieldMappings);
// 创建索引操作
IndexOperation operation = new IndexOperation();
operation.setIndex(targetIndex);
operation.setEntityId(hit.id());
operation.setType(IndexOperation.OperationType.INDEX);
operation.setDocument(targetMap);
operations.add(operation);
processedCount++;
} catch (Exception e) {
logger.error("处理文档 {} 时出错", hit.id(), e);
errors.add("文档ID " + hit.id() + ": " + e.getMessage());
}
}
// 批量处理当前批次
if (!operations.isEmpty()) {
try {
for (IndexOperation op : operations) {
bulkIndexService.addOperation(op);
}
bulkIndexService.processBulk(); // 确保处理完成
successCount += operations.size();
} catch (Exception e) {
logger.error("批量处理失败", e);
errors.add("批量处理失败: " + e.getMessage());
}
}
// 获取下一批结果
if (scrollId == null) {
break;
}
response = esClient.scroll(s -> s
.scrollId(scrollId)
.scroll(t -> t.time("1m")),
JsonData.class
);
scrollId = response.scrollId();
logger.info("迁移进度: {}/{} 文档", processedCount, totalDocuments);
}
// 清理scroll上下文
if (scrollId != null) {
esClient.clearScroll(c -> c.scrollId(scrollId));
}
report.setMigratedDocuments(successCount);
report.setFailedDocuments(processedCount - successCount);
report.setErrors(errors);
return totalDocuments;
}
private Map<String, Object> transformDocument(Map<String, Object> source, Map<String, String> fieldMappings) {
Map<String, Object> target = new HashMap<>();
// 先复制所有原始字段
target.putAll(source);
// 应用字段映射
for (Map.Entry<String, String> mapping : fieldMappings.entrySet()) {
String sourceField = mapping.getKey();
String targetField = mapping.getValue();
if (source.containsKey(sourceField)) {
// 将源字段值复制到目标字段
target.put(targetField, source.get(sourceField));
// 如果源字段和目标字段不同,并且不想保留源字段,可以移除源字段
if (!sourceField.equals(targetField)) {
target.remove(sourceField);
}
}
}
return target;
}
private boolean validateMigration(String sourceIndex, String targetIndex, MigrationReport report) throws IOException {
// 比较文档数量
CountResponse sourceCount = esClient.count(c -> c.index(sourceIndex));
CountResponse targetCount = esClient.count(c -> c.index(targetIndex));
long sourceDocs = sourceCount.count();
long targetDocs = targetCount.count();
report.setSourceDocuments(sourceDocs);
report.setTargetDocuments(targetDocs);
// 如果文档数量差距超过允许的阈值(如5%),则验证失败
double documentRatio = sourceDocs > 0 ? (double) targetDocs / sourceDocs : 0;
boolean countValid = documentRatio >= 0.95;
if (!countValid) {
logger.warn("文档数量验证失败: 源={}, 目标={}, 比率={}",
sourceDocs, targetDocs, String.format("%.2f", documentRatio));
}
// 随机抽样验证字段映射是否正确
boolean samplingValid = validateRandomSampling(sourceIndex, targetIndex);
report.setCountValidated(countValid);
report.setSamplingValidated(samplingValid);
return countValid && samplingValid;
}
private boolean validateRandomSampling(String sourceIndex, String targetIndex) throws IOException {
// 随机抽取10个文档进行验证
SearchResponse<JsonData> response = esClient.search(s -> s
.index(sourceIndex)
.size(10)
.source(src -> src.fetch(true))
.query(q -> q.matchAll(m -> m)),
JsonData.class
);
for (Hit<JsonData> hit : response.hits().hits()) {
// 获取源文档ID
String docId = hit.id();
// 在目标索引中查找相同ID的文档
GetResponse<JsonData> targetDoc = esClient.get(g -> g
.index(targetIndex)
.id(docId),
JsonData.class
);
if (!targetDoc.found()) {
logger.warn("验证失败: 目标索引中未找到文档 {}", docId);
return false;
}
// 可以添加更复杂的字段验证逻辑
}
return true;
}
private void switchAliasIfNeeded(String sourceIndex, String targetIndex) throws IOException {
// 检查是否有别名指向源索引
GetAliasResponse aliasResponse = esClient.indices().getAlias(a -> a.index(sourceIndex));
// 获取指向源索引的所有别名
List<String> aliases = new ArrayList<>();
if (aliasResponse.result().containsKey(sourceIndex)) {
aliases.addAll(aliasResponse.result().get(sourceIndex).aliases().keySet());
}
if (!aliases.isEmpty()) {
logger.info("找到 {} 个别名指向源索引,将切换到目标索引", aliases.size());
// 构建别名更新操作
IndicesAliasesRequest.Builder requestBuilder = new IndicesAliasesRequest.Builder();
for (String alias : aliases) {
requestBuilder.actions(actions -> {
List<AliasAction> aliasActions = new ArrayList<>();
// 移除旧索引别名
aliasActions.add(AliasAction.of(a -> a
.remove(r -> r.index(sourceIndex).alias(alias))
));
// 添加新索引别名
aliasActions.add(AliasAction.of(a -> a
.add(add -> add.index(targetIndex).alias(alias))
));
return actions.actions(aliasActions);
});
}
// 执行别名更新
esClient.indices().updateAliases(requestBuilder.build());
logger.info("别名切换完成");
} else {
logger.info("源索引没有关联的别名,无需切换");
}
}
private void recordMigrationMetrics(MigrationReport report) {
Tags tags = Tags.of(
Tag.of("sourceIndex", report.getSourceIndex()),
Tag.of("targetIndex", report.getTargetIndex()),
Tag.of("status", report.getStatus())
);
// 记录迁移文档数
meterRegistry.gauge("elasticsearch.migration.documents.source", tags, report.getSourceDocuments());
meterRegistry.gauge("elasticsearch.migration.documents.target", tags, report.getTargetDocuments());
meterRegistry.gauge("elasticsearch.migration.documents.migrated", tags, report.getMigratedDocuments());
meterRegistry.gauge("elasticsearch.migration.documents.failed", tags, report.getFailedDocuments());
// 记录迁移耗时
meterRegistry.timer("elasticsearch.migration.duration", tags).record(report.getDuration(), TimeUnit.MILLISECONDS);
// 记录迁移成功率
double successRate = report.getSourceDocuments() > 0 ?
(double) report.getMigratedDocuments() / report.getSourceDocuments() : 0;
meterRegistry.gauge("elasticsearch.migration.success_rate", tags, successRate);
}
@Data
public static class MigrationReport {
private String sourceIndex;
private String targetIndex;
private String status;
private LocalDateTime startTime;
private LocalDateTime endTime = LocalDateTime.now();
private boolean targetCreated;
private long totalDocuments;
private long sourceDocuments;
private long targetDocuments;
private long migratedDocuments;
private long failedDocuments;
private boolean countValidated;
private boolean samplingValidated;
private boolean validated;
private long duration;
private String errorMessage;
private List<String> errors = new ArrayList<>();
}
}
应用配置示例
yaml
# application.yml
spring:
application:
name: elasticsearch-sync-service
datasource:
url: jdbc:mysql://${DB_HOST:localhost}:${DB_PORT:3306}/${DB_NAME:productdb}
username: ${DB_USERNAME:root}
password: ${DB_PASSWORD:}
hikari:
maximum-pool-size: 20
minimum-idle: 5
connection-timeout: 30000
jpa:
hibernate:
ddl-auto: validate
properties:
hibernate:
dialect: org.hibernate.dialect.MySQL8Dialect
jdbc.batch_size: 50
kafka:
bootstrap-servers: ${KAFKA_SERVERS:localhost:9092}
producer:
key-serializer: org.apache.kafka.common.serialization.StringSerializer
value-serializer: org.apache.kafka.common.serialization.StringSerializer
# 启用幂等性producer
properties:
enable.idempotence: true
acks: all
retries: 3
max.in.flight.requests.per.connection: 1
# 启用事务支持
transactional.id: ${spring.application.name}-tx-
consumer:
group-id: ${KAFKA_GROUP_ID:es-sync-service}
auto-offset-reset: earliest
key-deserializer: org.apache.kafka.common.serialization.StringDeserializer
value-deserializer: org.apache.kafka.common.serialization.StringDeserializer
# 手动提交offset
enable-auto-commit: false
# 防止长轮询造成的延迟
properties:
fetch.max.wait.ms: 500
listener:
# 确认模式设置为手动
ack-mode: manual
# 并发消费者
concurrency: ${KAFKA_CONCURRENCY:3}
cache:
type: caffeine
caffeine:
spec: maximumSize=10000,expireAfterWrite=300s
elasticsearch:
uris: ${ES_URIS:localhost:9200}
username: ${ES_USERNAME:elastic}
password: ${ES_PASSWORD:}
connection-timeout: 5000
socket-timeout: 60000
max-connections: 100
retry-on-conflict: 3
app:
bulk:
size: 100
interval: 5000
queue-capacity: 1000
min-size: 20
max-size: 500
initial-size: 100
outbox:
polling-interval: 1000
max-concurrent: 50
kafka:
concurrency: 3
retry:
max-attempts: 5
initial-delay: 1000
multiplier: 2
max-delay: 30000
consistency:
check:
cron: "0 0 2 * * ?" # 每天凌晨2点执行
audit:
incremental:
interval: 300000 # 5分钟执行一次增量审计
full:
cron: "0 0 1 * * ?" # 每天凌晨1点执行全量抽样审计
monitoring:
enabled: true
cache:
ttl-seconds: 300
encryption:
password: ${ENCRYPTION_PASSWORD:defaultDevPassword}
salt: ${ENCRYPTION_SALT:aabbccddeeff}
management:
endpoints:
web:
exposure:
include: health,info,prometheus,metrics,caches
metrics:
export:
prometheus:
enabled: true
tags:
application: ${spring.application.name}
health:
elasticsearch:
enabled: true
kafka:
enabled: true
circuitbreakers:
enabled: true
logging:
pattern:
console: "%d{yyyy-MM-dd HH:mm:ss} [%thread] %-5level %logger{36} [%X{traceId},%X{spanId},%X{correlationId},%X{messageId}] - %msg%n"
level:
root: INFO
com.example.essync: ${LOG_LEVEL:INFO}
org.springframework.data.elasticsearch: WARN
org.elasticsearch: WARN
org.apache.kafka: WARN
# OpenTelemetry配置
otel:
traces:
exporter: otlp
metrics:
exporter: otlp
logs:
exporter: otlp
exporter:
otlp:
endpoint: ${OTEL_ENDPOINT:http://localhost:4317}
总结
特性 | 实现方法 | 适用场景 |
---|---|---|
数据一致性 | Outbox 模式 | 大多数业务场景 |
并发控制 | 乐观锁版本控制 | 高并发写入 |
失败处理 | 重试队列 + 指数退避 | 系统不稳定情况 |
批量性能 | 自适应批量处理 | 大批量数据处理 |
实时性要求 | 同步写入 + 异步确认 | 对实时性要求高的场景 |
数据恢复 | 增量和抽样一致性审计 | 长期运行的系统 |
可观测性 | 分布式跟踪 + 精细指标 | 所有生产环境 |
高并发优化 | 虚拟线程 + 背压控制 | 高吞吐量场景 |
弹性设计 | 熔断器 + 限流器 + 舱壁模式 | 不稳定环境 |
安全保障 | 加密敏感数据 + 安全配置 | 涉及个人数据的场景 |
零停机部署 | 索引别名 + 数据迁移 | 系统升级场景 |
国际化支持 | 多语言消息资源 | 全球化应用 |
网络分区 | 降级策略 + 自动恢复 | 不稳定网络环境 |
故障测试 | 混沌工程 + 故障注入 | 验证系统韧性 |