通过 Elasticsearch 实现分布式事务的可靠方案

在分布式系统中，当数据需要同时存储在关系型数据库和 Elasticsearch 中时，如何保证数据一致性是一个重要挑战。本文提供了一套完整方案解决这一问题。

分布式事务基础与 CAP 理论

在设计分布式事务系统前，需要理解 CAP 理论的基本权衡：

在 Elasticsearch 与数据库协作场景中，通常选择 AP（可用性和分区容错性），牺牲强一致性而采用最终一致性模型。这种权衡允许系统在网络分区或部分组件故障时继续提供服务，同时保证数据最终同步一致。

Elasticsearch 写入机制

理解 Elasticsearch 的内部写入流程有助于设计更合理的事务方案：

Elasticsearch 的写入涉及多节点协作，不是原子操作，这是我们需要特殊处理数据库与 ES 一致性的根本原因。Elasticsearch 8.x 引入了新的协调机制和更高效的写入流程，但仍然保持了这种基本架构。

实现分布式事务的有效方案

1. 乐观并发控制

Elasticsearch 提供了版本控制机制，可有效防止并发更新问题：

java 复制代码

private static final Logger logger = LoggerFactory.getLogger(ProductService.class);

private final ElasticsearchClient esClient;
private final OpenTelemetry openTelemetry;
private final Tracer tracer;

public ProductService(ElasticsearchClient esClient, OpenTelemetry openTelemetry) {
    this.esClient = esClient;
    this.openTelemetry = openTelemetry;
    this.tracer = openTelemetry.getTracer("product-service");
}

public void updateDocument(String id, Product product, long currentVersion) {
    Span span = tracer.spanBuilder("updateDocument").startSpan();
    try (Scope scope = span.makeCurrent()) {
        span.setAttribute("document.id", id);
        span.setAttribute("document.version", currentVersion);

        UpdateResponse<Product> response = esClient.update(u -> u
            .index("products")
            .id(id)
            .doc(product)
            .version(currentVersion),
            Product.class
        );

        Map<String, Object> logContext = new HashMap<>();
        logContext.put("operation", "documentUpdate");
        logContext.put("documentId", id);
        logContext.put("newVersion", response.version());

        logger.info("文档更新成功: {}", JsonUtils.toJson(logContext));
        logger.debug("文档更新详情: {}, 原版本: {}, 新版本: {}",
                  id, currentVersion, response.version());
        span.setStatus(StatusCode.OK);
    } catch (VersionConflictException e) {
        logger.warn("版本冲突，文档已被其他进程更新: {}", id, e);
        span.setStatus(StatusCode.ERROR, "Version conflict");
        span.recordException(e);

        // 获取最新版本并重试
        GetResponse<Product> current = esClient.get(g -> g
            .index("products")
            .id(id),
            Product.class
        );
        updateDocument(id, product, current.version());
    } catch (Exception e) {
        logger.error("更新文档时发生错误: {}", id, e);
        span.setStatus(StatusCode.ERROR, e.getMessage());
        span.recordException(e);
        throw new RuntimeException("更新Elasticsearch文档失败", e);
    } finally {
        span.end();
    }
}

2. Outbox 模式实现最终一致性

Outbox 模式是解决分布式事务最常用的设计模式，确保数据库操作和消息发布的原子性：

产品服务实现：

java 复制代码

@Service
public class ProductService {

    private static final Logger logger = LoggerFactory.getLogger(ProductService.class);

    private final ProductRepository productRepository;
    private final OutboxRepository outboxRepository;
    private final TransactionTemplate transactionTemplate;
    private final ObjectMapper objectMapper;
    private final Tracer tracer;
    private final MessageSource messageSource;

    public ProductService(
            ProductRepository productRepository,
            OutboxRepository outboxRepository,
            PlatformTransactionManager transactionManager,
            ObjectMapper objectMapper,
            OpenTelemetry openTelemetry,
            MessageSource messageSource) {
        this.productRepository = productRepository;
        this.outboxRepository = outboxRepository;
        // 明确指定事务传播行为
        DefaultTransactionDefinition txDef = new DefaultTransactionDefinition();
        txDef.setPropagationBehavior(TransactionDefinition.PROPAGATION_REQUIRED);
        this.transactionTemplate = new TransactionTemplate(transactionManager, txDef);
        this.objectMapper = objectMapper;
        this.tracer = openTelemetry.getTracer("product-service");
        this.messageSource = messageSource;
    }

    public Product createProduct(Product product) {
        Span span = tracer.spanBuilder("createProduct").startSpan();
        try (Scope scope = span.makeCurrent()) {
            span.setAttribute("product.name", product.getName());

            return transactionTemplate.execute(status -> {
                try {
                    // 1. 保存到数据库
                    Product savedProduct = productRepository.save(product);

                    // 2. 创建Outbox消息（确保在同一事务中）
                    OutboxMessage outboxMessage = createOutboxMessage(savedProduct, "PRODUCT_CREATED");

                    Map<String, Object> logContext = new HashMap<>();
                    logContext.put("operation", "createProduct");
                    logContext.put("productId", savedProduct.getId());
                    logContext.put("outboxMessageId", outboxMessage.getId());

                    logger.info("产品创建成功: {}", JsonUtils.toJson(logContext));
                    span.setStatus(StatusCode.OK);

                    return savedProduct;

                } catch (Exception e) {
                    String errorMessage = messageSource.getMessage(
                        "error.product.create",
                        new Object[]{e.getMessage()},
                        LocaleContextHolder.getLocale()
                    );
                    logger.error(errorMessage, e);
                    span.setStatus(StatusCode.ERROR, e.getMessage());
                    span.recordException(e);
                    status.setRollbackOnly();
                    throw new RuntimeException(errorMessage, e);
                }
            });
        } finally {
            span.end();
        }
    }

    // 提取辅助方法创建Outbox消息，提高代码可读性
    private OutboxMessage createOutboxMessage(Product product, String eventType) throws JsonProcessingException {
        OutboxMessage outboxMessage = new OutboxMessage();
        outboxMessage.setAggregateType("Product");
        outboxMessage.setAggregateId(product.getId().toString());
        outboxMessage.setEventType(eventType);
        outboxMessage.setEventId(UUID.randomUUID().toString());
        outboxMessage.setPayload(objectMapper.writeValueAsString(product));
        outboxMessage.setCreatedAt(LocalDateTime.now());
        return outboxRepository.save(outboxMessage);
    }
}

Outbox 消息中继服务：

java 复制代码

@Service
public class OutboxRelayService {

    private static final Logger logger = LoggerFactory.getLogger(OutboxRelayService.class);

    private final OutboxRepository outboxRepository;
    private final KafkaTemplate<String, String> kafkaTemplate;
    private final TransactionTemplate transactionTemplate;
    private final MetricsService metricsService;
    private final RateLimiter rateLimiter;
    private final Tracer tracer;

    // 限制并发处理数量
    private final AtomicInteger currentProcessingCount = new AtomicInteger(0);
    private final int maxConcurrentProcessing;

    public OutboxRelayService(
            OutboxRepository outboxRepository,
            KafkaTemplate<String, String> kafkaTemplate,
            PlatformTransactionManager transactionManager,
            MetricsService metricsService,
            RateLimiter outboxRateLimiter,
            OpenTelemetry openTelemetry,
            @Value("${app.outbox.max-concurrent:50}") int maxConcurrentProcessing) {
        this.outboxRepository = outboxRepository;
        this.kafkaTemplate = kafkaTemplate;

        DefaultTransactionDefinition txDef = new DefaultTransactionDefinition();
        txDef.setPropagationBehavior(TransactionDefinition.PROPAGATION_REQUIRES_NEW);
        this.transactionTemplate = new TransactionTemplate(transactionManager, txDef);

        this.metricsService = metricsService;
        this.rateLimiter = outboxRateLimiter;
        this.maxConcurrentProcessing = maxConcurrentProcessing;
        this.tracer = openTelemetry.getTracer("outbox-relay-service");
    }

    @Scheduled(fixedDelayString = "${app.outbox.polling-interval:1000}")
    public void processOutboxMessages() {
        Span span = tracer.spanBuilder("processOutboxMessages").startSpan();
        try (Scope scope = span.makeCurrent()) {
            // 实现背压控制
            if (currentProcessingCount.get() >= maxConcurrentProcessing) {
                logger.warn("当前处理消息数量已达上限 {}, 跳过本次处理", maxConcurrentProcessing);
                span.setAttribute("skipped", true);
                span.setAttribute("reason", "max_concurrent_reached");
                return;
            }

            // 使用速率限制器控制处理速度
            if (!rateLimiter.acquirePermission()) {
                logger.warn("速率限制器拒绝处理，稍后重试");
                span.setAttribute("skipped", true);
                span.setAttribute("reason", "rate_limited");
                return;
            }

            List<OutboxMessage> messages = findUnprocessedMessages();

            if (messages.isEmpty()) {
                span.setAttribute("messagesCount", 0);
                return;
            }

            span.setAttribute("messagesCount", messages.size());
            logger.info("开始处理 {} 条Outbox消息", messages.size());
            long startTime = System.currentTimeMillis();

            processMessagesInParallel(messages, span);

            long duration = System.currentTimeMillis() - startTime;
            span.setAttribute("duration_ms", duration);
            span.setStatus(StatusCode.OK);
        } catch (Exception e) {
            logger.error("处理Outbox消息时发生错误", e);
            span.setStatus(StatusCode.ERROR, e.getMessage());
            span.recordException(e);
        } finally {
            span.end();
        }
    }

    // 提取方法：查找未处理消息
    private List<OutboxMessage> findUnprocessedMessages() {
        return outboxRepository.findByProcessedOrderByCreatedAtAsc(
            false,
            PageRequest.of(0, maxConcurrentProcessing - currentProcessingCount.get())
        );
    }

    // 提取方法：并行处理消息
    private void processMessagesInParallel(List<OutboxMessage> messages, Span parentSpan) {
        AtomicInteger processed = new AtomicInteger(0);
        AtomicInteger failed = new AtomicInteger(0);

        List<CompletableFuture<Void>> futures = messages.stream()
            .map(message -> CompletableFuture.runAsync(() -> {
                currentProcessingCount.incrementAndGet();
                try {
                    Span messageSpan = tracer.spanBuilder("processOutboxMessage")
                        .setParent(Context.current().with(parentSpan))
                        .setAttribute("messageId", message.getId().toString())
                        .setAttribute("aggregateId", message.getAggregateId())
                        .setAttribute("eventType", message.getEventType())
                        .startSpan();

                    try (Scope messageScope = messageSpan.makeCurrent()) {
                        MDC.put("correlationId", message.getEventId());
                        MDC.put("aggregateId", message.getAggregateId());
                        MDC.put("eventType", message.getEventType());
                        MDC.put("messageId", message.getId().toString());

                        // 记录消息年龄
                        Duration messageAge = Duration.between(message.getCreatedAt(), LocalDateTime.now());
                        metricsService.recordOutboxMessageAge(messageAge);

                        boolean success = processMessage(message);
                        if (success) {
                            processed.incrementAndGet();
                            messageSpan.setStatus(StatusCode.OK);
                        } else {
                            failed.incrementAndGet();
                            messageSpan.setStatus(StatusCode.ERROR, "Processing failed");
                        }
                    } finally {
                        messageSpan.end();
                        MDC.clear();
                    }
                } finally {
                    currentProcessingCount.decrementAndGet();
                }
            }))
            .collect(Collectors.toList());

        // 等待所有消息处理完成
        CompletableFuture.allOf(futures.toArray(new CompletableFuture[0])).join();

        long duration = System.currentTimeMillis() - parentSpan.getStartEpochNanos() / 1_000_000;
        metricsService.recordOutboxProcessing(processed.get(), failed.get(), duration);

        Map<String, Object> logContext = new HashMap<>();
        logContext.put("operation", "processOutboxMessages");
        logContext.put("processed", processed.get());
        logContext.put("failed", failed.get());
        logContext.put("durationMs", duration);

        logger.info("Outbox消息处理完成: {}", JsonUtils.toJson(logContext));
    }

    private boolean processMessage(OutboxMessage message) {
        return transactionTemplate.execute(status -> {
            try {
                // 使用聚合ID作为分区键，确保相同实体的消息顺序性
                SendResult<String, String> result = kafkaTemplate.send(
                    "product-events",
                    message.getAggregateId(),
                    message.getPayload()
                ).get(5, TimeUnit.SECONDS); // 设置超时时间

                // 标记为已处理
                message.setProcessed(true);
                message.setProcessedAt(LocalDateTime.now());
                outboxRepository.save(message);

                Map<String, Object> logContext = new HashMap<>();
                logContext.put("messageId", message.getId());
                logContext.put("topic", result.getRecordMetadata().topic());
                logContext.put("partition", result.getRecordMetadata().partition());
                logContext.put("offset", result.getRecordMetadata().offset());

                logger.info("消息发送成功: {}", JsonUtils.toJson(logContext));
                return true;
            } catch (Exception e) {
                logger.error("发送消息到Kafka失败: {}", message.getId(), e);
                status.setRollbackOnly();

                // 更新重试次数，超过阈值则标记为错误
                message.setRetryCount(message.getRetryCount() + 1);
                if (message.getRetryCount() >= 5) {
                    message.setErrorMessage(e.getMessage());
                    outboxRepository.save(message);
                    logger.error("消息重试次数过多，标记为失败: {}", message.getId());
                }

                return false;
            }
        });
    }
}

3. 虚拟线程优化（Java 21）

利用 Java 21 的虚拟线程提高并发处理能力：

java 复制代码

@Configuration
public class ThreadingConfig {

    private static final Logger logger = LoggerFactory.getLogger(ThreadingConfig.class);

    @Bean
    public ExecutorService consistencyCheckExecutor() {
        logger.info("初始化一致性检查执行器（使用虚拟线程）");
        // 使用Java 21的虚拟线程
        return Executors.newVirtualThreadPerTaskExecutor();
    }

    @Bean
    public AsyncTaskExecutor applicationTaskExecutor() {
        logger.info("初始化应用任务执行器（使用虚拟线程）");

        ThreadPoolTaskExecutor executor = new ThreadPoolTaskExecutor();
        executor.setTaskDecorator(task -> {
            Map<String, String> contextMap = MDC.getCopyOfContextMap();
            return () -> {
                Map<String, String> previousContext = MDC.getCopyOfContextMap();
                try {
                    if (contextMap != null) {
                        MDC.setContextMap(contextMap);
                    }
                    task.run();
                } finally {
                    if (previousContext != null) {
                        MDC.setContextMap(previousContext);
                    } else {
                        MDC.clear();
                    }
                }
            };
        });

        // Spring 6.1+ 的虚拟线程配置方式
        executor.setTaskExecutor(Executors.newVirtualThreadPerTaskExecutor());
        executor.initialize();

        return executor;
    }
}

虚拟线程与传统线程池性能对比：

场景	传统线程池	虚拟线程	性能提升
低负载并发(100 req/s)	55ms	52ms	~5%
中负载并发(500 req/s)	128ms	97ms	~24%
高负载并发(2000 req/s)	387ms	242ms	~37%
阻塞 IO 密集型操作	512ms	186ms	~63%
内存占用(1000 并发)	~150MB	~35MB	~77%

4. 分布式跟踪集成

集成 OpenTelemetry 实现分布式跟踪：

java 复制代码

@Configuration
public class ObservabilityConfig {

    private static final Logger logger = LoggerFactory.getLogger(ObservabilityConfig.class);

    @Value("${spring.application.name}")
    private String serviceName;

    @Bean
    public OpenTelemetry openTelemetry() {
        logger.info("初始化OpenTelemetry配置");

        Resource resource = Resource.getDefault()
            .merge(Resource.create(Attributes.of(
                ResourceAttributes.SERVICE_NAME, serviceName,
                ResourceAttributes.SERVICE_VERSION, "1.0.0"
            )));

        SdkTracerProvider tracerProvider = SdkTracerProvider.builder()
            .addSpanProcessor(BatchSpanProcessor.builder(
                OtlpGrpcSpanExporter.builder()
                    .setEndpoint("http://tempo:4317")
                    .build())
                .build())
            .setResource(resource)
            .build();

        SdkMeterProvider meterProvider = SdkMeterProvider.builder()
            .registerMetricReader(PeriodicMetricReader.builder(
                OtlpGrpcMetricExporter.builder()
                    .setEndpoint("http://prometheus:4317")
                    .build())
                .build())
            .setResource(resource)
            .build();

        SdkLoggerProvider loggerProvider = SdkLoggerProvider.builder()
            .addLogRecordProcessor(BatchLogRecordProcessor.builder(
                OtlpGrpcLogRecordExporter.builder()
                    .setEndpoint("http://loki:4317")
                    .build())
                .build())
            .setResource(resource)
            .build();

        return OpenTelemetrySdk.builder()
            .setTracerProvider(tracerProvider)
            .setMeterProvider(meterProvider)
            .setLoggerProvider(loggerProvider)
            .setPropagators(ContextPropagators.create(
                W3CTraceContextPropagator.getInstance()))
            .build();
    }

    @Bean
    public KafkaTracingConsumerFactory kafkaTracingConsumerFactory(OpenTelemetry openTelemetry) {
        return new KafkaTracingConsumerFactory(openTelemetry);
    }

    @Bean
    public ElasticsearchTelemetryAspect elasticsearchTelemetryAspect(OpenTelemetry openTelemetry) {
        return new ElasticsearchTelemetryAspect(openTelemetry);
    }
}

@Aspect
@Component
public class ElasticsearchTelemetryAspect {

    private final Tracer tracer;

    public ElasticsearchTelemetryAspect(OpenTelemetry openTelemetry) {
        this.tracer = openTelemetry.getTracer("elasticsearch-aspect");
    }

    @Around("execution(* co.elastic.clients.elasticsearch.ElasticsearchClient.*(..))")
    public Object traceElasticsearchOperation(ProceedingJoinPoint joinPoint) throws Throwable {
        String operationName = joinPoint.getSignature().getName();
        Span span = tracer.spanBuilder("elasticsearch." + operationName)
            .setAttribute("db.system", "elasticsearch")
            .setAttribute("db.operation", operationName)
            .startSpan();

        try (Scope scope = span.makeCurrent()) {
            // 添加方法参数到跟踪上下文（过滤敏感信息）
            addParametersToSpan(span, joinPoint.getArgs());
            return joinPoint.proceed();
        } catch (Throwable t) {
            span.recordException(t);
            span.setStatus(StatusCode.ERROR);
            throw t;
        } finally {
            span.end();
        }
    }

    private void addParametersToSpan(Span span, Object[] args) {
        if (args != null && args.length > 0) {
            for (int i = 0; i < args.length; i++) {
                if (args[i] != null) {
                    String paramClassName = args[i].getClass().getSimpleName();
                    // 避免添加大对象或敏感信息
                    if (!containsSensitiveInfo(paramClassName)) {
                        span.setAttribute("param." + i + ".type", paramClassName);
                    }
                }
            }
        }
    }

    private boolean containsSensitiveInfo(String className) {
        return className.contains("Password") ||
               className.contains("Credential") ||
               className.contains("Secret");
    }
}

5. 熔断器和限流器配置

使用 Resilience4j 实现熔断和限流保护：

java 复制代码

@Configuration
public class ResilienceConfig {

    private static final Logger logger = LoggerFactory.getLogger(ResilienceConfig.class);

    @Bean
    public CircuitBreaker elasticsearchCircuitBreaker() {
        CircuitBreakerConfig config = CircuitBreakerConfig.custom()
            .failureRateThreshold(50)
            .waitDurationInOpenState(Duration.ofSeconds(30))
            .permittedNumberOfCallsInHalfOpenState(5)
            .slidingWindowSize(10)
            .slidingWindowType(CircuitBreakerConfig.SlidingWindowType.COUNT_BASED)
            .automaticTransitionFromOpenToHalfOpenEnabled(true)
            .build();

        CircuitBreaker circuitBreaker = CircuitBreaker.of("elasticsearch", config);

        // 注册事件监听器，用于监控熔断器状态变化
        circuitBreaker.getEventPublisher()
            .onStateTransition(event -> {
                logger.info("熔断器状态变更: {} -> {}",
                    event.getStateTransition().getFromState(),
                    event.getStateTransition().getToState());
            })
            .onError(event -> {
                logger.debug("熔断器记录错误: {}, 异常: {}",
                    event.getEventType(),
                    event.getThrowable().getMessage());
            });

        return circuitBreaker;
    }

    @Bean
    public RateLimiter outboxRateLimiter() {
        RateLimiterConfig config = RateLimiterConfig.custom()
            .limitRefreshPeriod(Duration.ofSeconds(1))
            .limitForPeriod(1000)  // 每秒1000个请求
            .timeoutDuration(Duration.ofMillis(25))
            .build();
        return RateLimiter.of("outboxRateLimiter", config);
    }

    @Bean
    public RateLimiter syncRateLimiter() {
        RateLimiterConfig config = RateLimiterConfig.custom()
            .limitRefreshPeriod(Duration.ofSeconds(1))
            .limitForPeriod(2000)  // 每秒2000个同步操作
            .timeoutDuration(Duration.ofMillis(25))
            .build();
        return RateLimiter.of("syncRateLimiter", config);
    }

    @Bean
    public Retry elasticsearchRetry() {
        RetryConfig config = RetryConfig.custom()
            .maxAttempts(3)
            .waitDuration(Duration.ofMillis(1000))
            .retryExceptions(
                ElasticsearchException.class,
                TimeoutException.class,
                IOException.class
            )
            .ignoreExceptions(
                ResourceNotFoundException.class,
                VersionConflictException.class
            )
            .build();

        return Retry.of("elasticsearchRetry", config);
    }

    @Bean
    public Bulkhead elasticsearchBulkhead() {
        BulkheadConfig config = BulkheadConfig.custom()
            .maxConcurrentCalls(100)
            .maxWaitDuration(Duration.ofMillis(500))
            .build();

        return Bulkhead.of("elasticsearchBulkhead", config);
    }

    @Bean
    public TimeLimiter elasticsearchTimeLimiter() {
        TimeLimiterConfig config = TimeLimiterConfig.custom()
            .timeoutDuration(Duration.ofSeconds(5))
            .cancelRunningFuture(true)
            .build();

        return TimeLimiter.of("elasticsearchTimeLimiter", config);
    }
}

6. 自适应批处理大小

实现根据系统负载动态调整批处理大小：

java 复制代码

@Service
public class AdaptiveBatchSizeManager {

    private static final Logger logger = LoggerFactory.getLogger(AdaptiveBatchSizeManager.class);

    private final AtomicInteger currentBatchSize = new AtomicInteger(100);
    private final MeterRegistry meterRegistry;

    // 最小和最大批处理大小限制
    private final int minBatchSize;
    private final int maxBatchSize;

    // 最近的处理指标
    private final AtomicLong lastProcessingTimeMs = new AtomicLong(0);
    private final AtomicDouble lastCpuLoad = new AtomicDouble(0.5);

    public AdaptiveBatchSizeManager(
            MeterRegistry meterRegistry,
            @Value("${app.batch.min-size:20}") int minBatchSize,
            @Value("${app.batch.max-size:500}") int maxBatchSize,
            @Value("${app.batch.initial-size:100}") int initialBatchSize) {
        this.meterRegistry = meterRegistry;
        this.minBatchSize = minBatchSize;
        this.maxBatchSize = maxBatchSize;
        this.currentBatchSize.set(initialBatchSize);

        logger.info("初始化自适应批处理大小管理器: 初始大小={}, 最小={}, 最大={}",
            initialBatchSize, minBatchSize, maxBatchSize);
    }

    @Scheduled(fixedRate = 10000) // 每10秒调整一次
    public void adjustBatchSize() {
        // 获取系统CPU负载
        double cpuLoad = getSystemCpuLoad();
        lastCpuLoad.set(cpuLoad);

        // 获取最近的处理时间
        long avgProcessingTime = lastProcessingTimeMs.get();

        int newSize = calculateOptimalBatchSize(cpuLoad, avgProcessingTime);
        int oldSize = currentBatchSize.getAndSet(newSize);

        if (oldSize != newSize) {
            logger.info("批处理大小调整: {} -> {} (CPU负载: {}, 处理时间: {}ms)",
                oldSize, newSize, cpuLoad, avgProcessingTime);

            // 记录批处理大小指标
            meterRegistry.gauge("elasticsearch.batch.size", currentBatchSize);
        }
    }

    public int getCurrentBatchSize() {
        return currentBatchSize.get();
    }

    public void recordProcessingTime(long processingTimeMs) {
        lastProcessingTimeMs.set(processingTimeMs);
    }

    private int calculateOptimalBatchSize(double cpuLoad, long processingTimeMs) {
        // 根据CPU负载和处理时间动态调整批次大小
        int currentSize = currentBatchSize.get();

        if (cpuLoad > 0.8 || processingTimeMs > 200) {
            // 负载高或处理慢，减小批次大小
            return Math.max(minBatchSize, (int)(currentSize * 0.8));
        } else if (cpuLoad < 0.5 && processingTimeMs < 50) {
            // 负载低且处理快，增加批次大小
            return Math.min(maxBatchSize, (int)(currentSize * 1.2));
        }

        // 保持当前大小
        return currentSize;
    }

    private double getSystemCpuLoad() {
        try {
            OperatingSystemMXBean osBean = ManagementFactory.getOperatingSystemMXBean();
            if (osBean instanceof com.sun.management.OperatingSystemMXBean) {
                return ((com.sun.management.OperatingSystemMXBean) osBean).getCpuLoad();
            }
        } catch (Exception e) {
            logger.warn("获取系统CPU负载失败", e);
        }
        return lastCpuLoad.get(); // 返回上次测量值作为后备
    }
}

7. 批量处理与性能优化

使用线程安全队列和批量 API 提高性能：

java 复制代码

@Service
public class BulkIndexService {

    private static final Logger logger = LoggerFactory.getLogger(BulkIndexService.class);

    private final ElasticsearchClient esClient;
    private final RetryService retryService;
    private final CircuitBreaker circuitBreaker;
    private final Bulkhead bulkhead;
    private final MetricsService metricsService;
    private final AdaptiveBatchSizeManager batchSizeManager;
    private final Tracer tracer;

    // 使用有界队列并明确指定容量
    private final BlockingQueue<IndexOperation> operationQueue;

    public BulkIndexService(
            ElasticsearchClient esClient,
            RetryService retryService,
            CircuitBreaker elasticsearchCircuitBreaker,
            Bulkhead elasticsearchBulkhead,
            MetricsService metricsService,
            AdaptiveBatchSizeManager batchSizeManager,
            OpenTelemetry openTelemetry,
            @Value("${app.bulk.queue-capacity:1000}") int queueCapacity) {
        this.esClient = esClient;
        this.retryService = retryService;
        this.circuitBreaker = elasticsearchCircuitBreaker;
        this.bulkhead = elasticsearchBulkhead;
        this.metricsService = metricsService;
        this.batchSizeManager = batchSizeManager;
        this.tracer = openTelemetry.getTracer("bulk-index-service");
        this.operationQueue = new LinkedBlockingQueue<>(queueCapacity);

        logger.info("初始化批量索引服务: 队列容量={}", queueCapacity);
    }

    public void addOperation(IndexOperation operation) {
        Span span = tracer.spanBuilder("addOperation").startSpan();
        try (Scope scope = span.makeCurrent()) {
            span.setAttribute("operation.type", operation.getType().toString());
            span.setAttribute("operation.index", operation.getIndex());
            span.setAttribute("operation.id", operation.getEntityId());

            if (!operationQueue.offer(operation)) {
                logger.warn("队列已满，操作将被延迟处理: {}", operation.getEntityId());
                span.setAttribute("queueFull", true);
                // 处理队列满的情况
                handleQueueFull(operation);
            } else {
                span.setAttribute("queueFull", false);
                logger.debug("操作已添加到队列: {}, 当前队列大小: {}",
                    operation.getEntityId(), operationQueue.size());
            }

            int currentBatchSize = batchSizeManager.getCurrentBatchSize();
            if (operationQueue.size() >= currentBatchSize) {
                span.setAttribute("triggerProcessing", true);
                processBulk();
            } else {
                span.setAttribute("triggerProcessing", false);
            }
            span.setStatus(StatusCode.OK);
        } catch (Exception e) {
            logger.error("添加操作失败", e);
            span.setStatus(StatusCode.ERROR, e.getMessage());
            span.recordException(e);
        } finally {
            span.end();
        }
    }

    private void handleQueueFull(IndexOperation operation) {
        Span span = tracer.spanBuilder("handleQueueFull").startSpan();
        try (Scope scope = span.makeCurrent()) {
            span.setAttribute("operation.id", operation.getEntityId());

            // 使用指数退避策略尝试放入队列
            CompletableFuture.runAsync(() -> {
                int attempts = 0;
                long waitTime = 100; // 初始等待时间（毫秒）

                while (attempts < 5) {
                    try {
                        boolean added = operationQueue.offer(operation, waitTime, TimeUnit.MILLISECONDS);
                        if (added) {
                            logger.info("成功将操作添加到队列: {}", operation.getEntityId());
                            return;
                        }
                        // 指数退避增加等待时间
                        waitTime = Math.min(waitTime * 2, 5000);
                        attempts++;
                        logger.debug("尝试添加到队列失败，将在{}ms后重试: {}", waitTime, operation.getEntityId());
                    } catch (InterruptedException e) {
                        Thread.currentThread().interrupt();
                        logger.error("等待队列空间时被中断", e);
                        break;
                    }
                }
                // 多次尝试失败后加入重试队列
                logger.warn("多次尝试后仍无法添加到队列，加入重试队列: {}", operation.getEntityId());
                retryService.scheduleRetry(operation);
            });
            span.setStatus(StatusCode.OK);
        } catch (Exception e) {
            logger.error("处理队列满逻辑失败", e);
            span.setStatus(StatusCode.ERROR, e.getMessage());
            span.recordException(e);
        } finally {
            span.end();
        }
    }

    @Scheduled(fixedRateString = "${app.bulk.interval:5000}")
    public void processBulk() {
        Span span = tracer.spanBuilder("processBulk").startSpan();
        try (Scope scope = span.makeCurrent()) {
            if (operationQueue.isEmpty()) {
                span.setAttribute("queueEmpty", true);
                return;
            }

            // 获取当前最优批处理大小
            int batchSize = batchSizeManager.getCurrentBatchSize();
            span.setAttribute("batchSize", batchSize);

            // 原子性地获取队列中的操作
            List<IndexOperation> operations = new ArrayList<>();
            operationQueue.drainTo(operations, batchSize);

            if (operations.isEmpty()) {
                span.setAttribute("operationsEmpty", true);
                return;
            }

            processBatchOperations(operations, span);
        } catch (Exception e) {
            logger.error("批量处理主流程失败", e);
            span.setStatus(StatusCode.ERROR, e.getMessage());
            span.recordException(e);
        } finally {
            span.end();
        }
    }

    // 提取方法：处理批量操作
    private void processBatchOperations(List<IndexOperation> operations, Span parentSpan) {
        parentSpan.setAttribute("operationsCount", operations.size());
        logger.info("开始批量处理 {} 个操作", operations.size());
        long startTime = System.currentTimeMillis();

        // 使用舱壁模式限制并发
        try {
            bulkhead.acquirePermission();
            // 使用熔断器保护批量操作
            circuitBreaker.executeRunnable(() -> executeBulkRequest(operations));

            long duration = System.currentTimeMillis() - startTime;
            batchSizeManager.recordProcessingTime(duration);

            Map<String, Object> logContext = new HashMap<>();
            logContext.put("operation", "bulkProcess");
            logContext.put("count", operations.size());
            logContext.put("durationMs", duration);

            logger.info("批量操作完成: {}", JsonUtils.toJson(logContext));

            metricsService.recordBulkOperation(operations.size(), true, duration);
            parentSpan.setAttribute("duration_ms", duration);
            parentSpan.setStatus(StatusCode.OK);
        } catch (BulkheadFullException e) {
            logger.warn("并发限制已达上限，稍后重试", e);
            parentSpan.setAttribute("error", "bulkhead_full");
            parentSpan.setStatus(StatusCode.ERROR, "Bulkhead full");
            parentSpan.recordException(e);
            // 将操作放回队列或重试队列
            handleBulkheadRejection(operations);
        } catch (Exception e) {
            logger.error("批量操作失败", e);
            parentSpan.setAttribute("error", "bulk_operation_failed");
            parentSpan.setStatus(StatusCode.ERROR, e.getMessage());
            parentSpan.recordException(e);

            // 将所有操作加入重试队列
            operations.forEach(retryService::scheduleRetry);
        } finally {
            bulkhead.releasePermission();
        }
    }

    // 处理舱壁拒绝的操作
    private void handleBulkheadRejection(List<IndexOperation> operations) {
        // 将操作重新添加到队列或转发到重试服务
        for (IndexOperation operation : operations) {
            if (!operationQueue.offer(operation)) {
                // 队列已满，加入重试队列
                retryService.scheduleRetry(operation);
            }
        }
    }

    private void executeBulkRequest(List<IndexOperation> operations) {
        Span span = tracer.spanBuilder("executeBulkRequest").startSpan();
        try (Scope scope = span.makeCurrent()) {
            span.setAttribute("operations_count", operations.size());

            // 创建批量请求构建器
            BulkRequest.Builder bulkRequestBuilder = new BulkRequest.Builder();

            // 添加所有操作到批量请求
            for (IndexOperation op : operations) {
                switch (op.getType()) {
                    case INDEX -> bulkRequestBuilder.operations(o -> o
                        .index(i -> i
                            .index(op.getIndex())
                            .id(op.getEntityId())
                            .document(op.getDocument())
                        )
                    );
                    case UPDATE -> bulkRequestBuilder.operations(o -> o
                        .update(u -> u
                            .index(op.getIndex())
                            .id(op.getEntityId())
                            .doc(op.getDocument())
                        )
                    );
                    case DELETE -> bulkRequestBuilder.operations(o -> o
                        .delete(d -> d
                            .index(op.getIndex())
                            .id(op.getEntityId())
                        )
                    );
                }
            }

            // 执行批量请求
            BulkResponse response = esClient.bulk(bulkRequestBuilder.build());

            if (response.errors()) {
                // 处理失败的操作
                handleBulkErrors(response, operations);
                span.setAttribute("has_errors", true);
                span.setAttribute("error_count", countErrors(response));
            } else {
                span.setAttribute("has_errors", false);
            }

            span.setStatus(StatusCode.OK);
        } catch (Exception e) {
            logger.error("执行批量请求失败", e);
            span.setStatus(StatusCode.ERROR, e.getMessage());
            span.recordException(e);
            throw e;
        } finally {
            span.end();
        }
    }

    private int countErrors(BulkResponse response) {
        return (int) response.items().stream()
            .filter(item -> item.error() != null)
            .count();
    }

    private void handleBulkErrors(BulkResponse response, List<IndexOperation> operations) {
        Span span = tracer.spanBuilder("handleBulkErrors").startSpan();
        try (Scope scope = span.makeCurrent()) {
            // 遍历所有响应项
            List<BulkResponseItem> items = response.items();
            int errorCount = 0;

            for (int i = 0; i < items.size(); i++) {
                BulkResponseItem item = items.get(i);

                if (item.error() != null) {
                    errorCount++;
                    IndexOperation failedOp = operations.get(i);

                    Map<String, Object> logContext = new HashMap<>();
                    logContext.put("operation", "bulkError");
                    logContext.put("documentId", failedOp.getEntityId());
                    logContext.put("errorType", item.error().type());
                    logContext.put("errorReason", item.error().reason());

                    logger.error("批量操作失败: {}", JsonUtils.toJson(logContext));

                    // 根据错误类型决定重试策略
                    if (isRetryableError(item.error().type())) {
                        retryService.scheduleRetry(failedOp);
                    } else {
                        retryService.sendToDeadLetterQueue(failedOp, item.error().reason());
                    }
                }
            }

            span.setAttribute("error_count", errorCount);
            span.setStatus(StatusCode.OK);
        } catch (Exception e) {
            logger.error("处理批量错误失败", e);
            span.setStatus(StatusCode.ERROR, e.getMessage());
            span.recordException(e);
        } finally {
            span.end();
        }
    }

    private boolean isRetryableError(String errorType) {
        return List.of(
            "es_rejected_execution_exception",
            "timeout_exception",
            "node_not_available_exception",
            "cluster_block_exception"
        ).contains(errorType);
    }
}

8. 增量数据一致性审计

实现高效的增量数据一致性审计：

java 复制代码

@Service
public class DataConsistencyAuditService {

    private static final Logger logger = LoggerFactory.getLogger(DataConsistencyAuditService.class);

    private final ProductRepository productRepository;
    private final ElasticsearchClient esClient;
    private final RetryService retryService;
    private final MetricsService metricsService;
    private final ExecutorService executorService;

    // 记录上次全量审计时间
    private final AtomicReference<LocalDateTime> lastFullAuditTime = new AtomicReference<>(null);

    public DataConsistencyAuditService(
            ProductRepository productRepository,
            ElasticsearchClient esClient,
            RetryService retryService,
            MetricsService metricsService,
            @Qualifier("consistencyCheckExecutor") ExecutorService executorService) {
        this.productRepository = productRepository;
        this.esClient = esClient;
        this.retryService = retryService;
        this.metricsService = metricsService;
        this.executorService = executorService;
    }

    // 增量审计：只检查最近修改的数据
    @Scheduled(fixedDelayString = "${app.audit.incremental.interval:300000}")
    public void performIncrementalAudit() {
        logger.info("开始执行增量数据一致性审计");
        long startTime = System.currentTimeMillis();

        try {
            // 检查最近5分钟修改的数据
            LocalDateTime cutoffTime = LocalDateTime.now().minusMinutes(5);

            // 查询最近修改的产品
            List<Product> recentlyModifiedProducts = productRepository.findByUpdatedAtAfter(cutoffTime);

            if (recentlyModifiedProducts.isEmpty()) {
                logger.info("没有发现最近修改的数据，跳过增量审计");
                return;
            }

            logger.info("增量审计: 发现 {} 条最近修改的记录", recentlyModifiedProducts.size());

            AtomicInteger inconsistentCount = new AtomicInteger(0);
            CountDownLatch latch = new CountDownLatch(recentlyModifiedProducts.size());

            // 并行检查每个产品
            for (Product product : recentlyModifiedProducts) {
                executorService.submit(() -> {
                    try {
                        boolean consistent = checkProductConsistency(product);
                        if (!consistent) {
                            inconsistentCount.incrementAndGet();
                        }
                    } catch (Exception e) {
                        logger.error("检查产品一致性时发生错误: {}", product.getId(), e);
                    } finally {
                        latch.countDown();
                    }
                });
            }

            // 等待所有检查完成
            latch.await(1, TimeUnit.MINUTES);

            long duration = System.currentTimeMillis() - startTime;

            logger.info("增量数据一致性审计完成: 检查了 {} 条记录, 发现 {} 条不一致, 耗时 {} ms",
                     recentlyModifiedProducts.size(), inconsistentCount.get(), duration);

            metricsService.recordConsistencyCheckMetrics(
                recentlyModifiedProducts.size(),
                inconsistentCount.get(),
                duration
            );
        } catch (Exception e) {
            logger.error("执行增量数据一致性审计时发生错误", e);
        }
    }

    // 全量抽样审计：随机抽样进行审计
    @Scheduled(cron = "${app.audit.full.cron:0 0 1 * * ?}")
    public void performSampledFullAudit() {
        logger.info("开始执行抽样全量数据一致性审计");
        long startTime = System.currentTimeMillis();
        lastFullAuditTime.set(LocalDateTime.now());

        try {
            // 获取总记录数
            long totalCount = productRepository.count();

            if (totalCount == 0) {
                logger.info("没有数据，跳过全量审计");
                return;
            }

            // 确定抽样数量，最多检查10%，但不超过1000条
            int sampleSize = (int) Math.min(Math.max(totalCount * 0.1, 100), 1000);

            logger.info("全量抽样审计: 总记录数 {}, 抽样数量 {}", totalCount, sampleSize);

            List<Product> sampledProducts = productRepository.findRandomSample(sampleSize);

            AtomicInteger inconsistentCount = new AtomicInteger(0);
            CountDownLatch latch = new CountDownLatch(sampledProducts.size());

            // 并行检查每个抽样产品
            for (Product product : sampledProducts) {
                executorService.submit(() -> {
                    try {
                        boolean consistent = checkProductConsistency(product);
                        if (!consistent) {
                            inconsistentCount.incrementAndGet();
                        }
                    } catch (Exception e) {
                        logger.error("检查产品一致性时发生错误: {}", product.getId(), e);
                    } finally {
                        latch.countDown();
                    }
                });
            }

            // 等待所有检查完成
            latch.await(10, TimeUnit.MINUTES);

            long duration = System.currentTimeMillis() - startTime;

            logger.info("抽样全量数据一致性审计完成: 检查了 {} 条记录, 发现 {} 条不一致, 耗时 {} ms",
                     sampledProducts.size(), inconsistentCount.get(), duration);

            // 记录指标，包括抽样比例
            Map<String, String> tags = new HashMap<>();
            tags.put("type", "sampled");
            tags.put("sampleRate", String.format("%.2f", (double)sampleSize / totalCount));

            metricsService.recordConsistencyCheckMetricsWithTags(
                sampledProducts.size(),
                inconsistentCount.get(),
                duration,
                tags
            );

            // 根据不一致率决定是否触发全量审计
            double inconsistencyRate = (double) inconsistentCount.get() / sampledProducts.size();
            if (inconsistencyRate > 0.05) { // 如果不一致率超过5%
                logger.warn("不一致率较高 ({}%)，考虑执行全量数据一致性检查",
                    String.format("%.2f", inconsistencyRate * 100));
            }
        } catch (Exception e) {
            logger.error("执行抽样全量数据一致性审计时发生错误", e);
        }
    }

    private boolean checkProductConsistency(Product product) {
        try {
            GetResponse<Product> response = esClient.get(g -> g
                .index("products")
                .id(product.getId().toString()),
                Product.class
            );

            if (!response.found()) {
                // Elasticsearch中不存在，需要同步
                logger.warn("发现不一致：ES中缺少产品 {}", product.getId());
                syncToElasticsearch(product);
                return false;
            }

            // 检查数据是否一致
            Product esProduct = response.source();
            if (!isConsistent(product, esProduct)) {
                logger.warn("发现不一致：产品 {} 数据不匹配", product.getId());
                syncToElasticsearch(product);
                return false;
            }

            return true;
        } catch (Exception e) {
            logger.error("检查产品 {} 一致性时出错", product.getId(), e);
            return false;
        }
    }

    private void syncToElasticsearch(Product product) {
        try {
            // 创建同步事件
            IndexOperation operation = new IndexOperation();
            operation.setIndex("products");
            operation.setEntityId(product.getId().toString());
            operation.setType(IndexOperation.OperationType.INDEX);
            operation.setDocument(product);

            retryService.scheduleRetry(operation);

            // 记录同步事件
            metricsService.incrementResyncCounter();

            // 记录不一致发现时间
            Duration timeSinceUpdate = Duration.between(product.getUpdatedAt(), LocalDateTime.now());
            metricsService.recordConsistencyDelay("product", timeSinceUpdate.toMillis());
        } catch (Exception e) {
            logger.error("创建同步事件失败", e);
        }
    }

    private boolean isConsistent(Product dbProduct, Product esProduct) {
        // 使用compareTo而非equals比较BigDecimal
        return dbProduct.getName().equals(esProduct.getName()) &&
               dbProduct.getPrice().compareTo(esProduct.getPrice()) == 0 &&
               dbProduct.getDescription().equals(esProduct.getDescription());
    }

    // 获取上次全量审计时间
    public LocalDateTime getLastFullAuditTime() {
        return lastFullAuditTime.get();
    }
}

9. 启动依赖服务健康检查

确保所有依赖服务在应用启动时都是可用的：

java 复制代码

@Component
public class ServiceStartupValidator implements ApplicationListener<ContextRefreshedEvent> {

    private static final Logger logger = LoggerFactory.getLogger(ServiceStartupValidator.class);

    private final ElasticsearchClient esClient;
    private final KafkaAdmin kafkaAdmin;
    private final DataSource dataSource;
    private final MessageSource messageSource;

    public ServiceStartupValidator(
            ElasticsearchClient esClient,
            KafkaAdmin kafkaAdmin,
            DataSource dataSource,
            MessageSource messageSource) {
        this.esClient = esClient;
        this.kafkaAdmin = kafkaAdmin;
        this.dataSource = dataSource;
        this.messageSource = messageSource;
    }

    @Override
    public void onApplicationEvent(ContextRefreshedEvent event) {
        logger.info("开始验证依赖服务连接...");
        boolean allServicesAvailable = true;

        // 验证ES连接
        if (!validateElasticsearchConnection()) {
            allServicesAvailable = false;
            String errorMessage = messageSource.getMessage(
                "error.startup.elasticsearch",
                null,
                LocaleContextHolder.getLocale()
            );
            logger.error(errorMessage);
        }

        // 验证Kafka连接
        if (!validateKafkaConnection()) {
            allServicesAvailable = false;
            String errorMessage = messageSource.getMessage(
                "error.startup.kafka",
                null,
                LocaleContextHolder.getLocale()
            );
            logger.error(errorMessage);
        }

        // 验证数据库连接
        if (!validateDatabaseConnection()) {
            allServicesAvailable = false;
            String errorMessage = messageSource.getMessage(
                "error.startup.database",
                null,
                LocaleContextHolder.getLocale()
            );
            logger.error(errorMessage);
        }

        if (allServicesAvailable) {
            logger.info("所有依赖服务连接验证成功");
        } else {
            logger.warn("部分依赖服务连接验证失败，应用可能无法正常运行");
        }
    }

    private boolean validateElasticsearchConnection() {
        try {
            boolean isAvailable = esClient.ping().value();
            if (isAvailable) {
                ClusterHealthResponse healthResponse = esClient.cluster().health();
                logger.info("Elasticsearch连接成功: 状态={}, 节点数={}",
                    healthResponse.status(), healthResponse.numberOfNodes());
                return true;
            } else {
                logger.error("Elasticsearch连接失败: ping返回false");
                return false;
            }
        } catch (Exception e) {
            logger.error("Elasticsearch连接验证失败", e);
            return false;
        }
    }

    private boolean validateKafkaConnection() {
        try {
            // 获取Kafka集群信息
            Map<String, Object> configs = kafkaAdmin.getConfigurationProperties();
            String bootstrapServers = (String) configs.get("bootstrap.servers");

            AdminClient adminClient = AdminClient.create(configs);
            DescribeClusterResult clusterResult = adminClient.describeCluster();

            // 等待获取结果（最多5秒）
            int nodeCount = clusterResult.nodes().get(5, TimeUnit.SECONDS).size();
            String clusterId = clusterResult.clusterId().get(5, TimeUnit.SECONDS);

            logger.info("Kafka连接成功: 集群ID={}, 节点数={}, 地址={}",
                clusterId, nodeCount, bootstrapServers);

            adminClient.close();
            return true;
        } catch (Exception e) {
            logger.error("Kafka连接验证失败", e);
            return false;
        }
    }

    private boolean validateDatabaseConnection() {
        try (Connection conn = dataSource.getConnection()) {
            DatabaseMetaData metaData = conn.getMetaData();
            logger.info("数据库连接成功: {}@{}",
                metaData.getDatabaseProductName(),
                metaData.getDatabaseProductVersion());
            return true;
        } catch (Exception e) {
            logger.error("数据库连接验证失败", e);
            return false;
        }
    }
}

10. 国际化支持

添加多语言支持，确保错误消息和日志可以适应不同语言环境：

java 复制代码

@Configuration
public class InternationalizationConfig {

    private static final Logger logger = LoggerFactory.getLogger(InternationalizationConfig.class);

    @Bean
    public MessageSource messageSource() {
        logger.info("初始化国际化消息源");

        ReloadableResourceBundleMessageSource messageSource = new ReloadableResourceBundleMessageSource();
        messageSource.setBasenames(
            "classpath:messages/common",
            "classpath:messages/errors",
            "classpath:messages/validation"
        );
        messageSource.setDefaultEncoding("UTF-8");
        messageSource.setCacheSeconds(3600); // 缓存消息1小时

        return messageSource;
    }

    @Bean
    public LocaleResolver localeResolver() {
        AcceptHeaderLocaleResolver resolver = new AcceptHeaderLocaleResolver();
        resolver.setDefaultLocale(Locale.US); // 默认使用英语

        // 支持的语言列表
        List<Locale> supportedLocales = Arrays.asList(
            Locale.US,         // 英语
            Locale.CHINA,      // 中文
            Locale.JAPAN,      // 日语
            Locale.GERMANY,    // 德语
            new Locale("es")   // 西班牙语
        );
        resolver.setSupportedLocales(supportedLocales);

        return resolver;
    }
}

消息资源文件示例 (messages/errors_zh_CN.properties):

properties 复制代码

error.product.create=创建产品时发生错误: {0}
error.product.update=更新产品时发生错误: {0}
error.product.notfound=未找到ID为{0}的产品
error.outbox.process=处理Outbox消息失败: {0}
error.elasticsearch.connection=Elasticsearch连接失败: {0}
error.kafka.connection=Kafka连接失败: {0}
error.startup.elasticsearch=启动时无法连接到Elasticsearch服务，请检查配置和网络
error.startup.kafka=启动时无法连接到Kafka服务，请检查配置和网络
error.startup.database=启动时无法连接到数据库服务，请检查配置和网络

11. 分布式测试策略与故障注入

实现全面的测试策略，包括故障注入测试：

java 复制代码

@SpringBootTest
@TestInstance(TestInstance.Lifecycle.PER_CLASS)
public class DistributedTransactionTest {

    private static final Logger logger = LoggerFactory.getLogger(DistributedTransactionTest.class);

    @Autowired
    private ProductService productService;

    @Autowired
    private ProductRepository productRepository;

    @Autowired
    private ElasticsearchClient esClient;

    @Autowired
    private OutboxRepository outboxRepository;

    @Autowired
    private DisasterRecoveryService recoveryService;

    @Autowired
    private NetworkPartitionHandler networkPartitionHandler;

    @MockBean
    private ElasticsearchClient mockEsClient;

    @Container
    static MySQLContainer<?> mySQLContainer = new MySQLContainer<>("mysql:8.0")
        .withDatabaseName("testdb")
        .withUsername("testuser")
        .withPassword("testpass");

    @Container
    static KafkaContainer kafkaContainer = new KafkaContainer(DockerImageName.parse("confluentinc/cp-kafka:latest"));

    @DynamicPropertySource
    static void registerProperties(DynamicPropertyRegistry registry) {
        registry.add("spring.datasource.url", mySQLContainer::getJdbcUrl);
        registry.add("spring.datasource.username", mySQLContainer::getUsername);
        registry.add("spring.datasource.password", mySQLContainer::getPassword);
        registry.add("spring.kafka.bootstrap-servers", kafkaContainer::getBootstrapServers);
    }

    @Test
    @DisplayName("测试正常流程下的数据一致性")
    void testNormalFlowConsistency() throws Exception {
        // 创建测试产品
        Product product = new Product();
        product.setName("测试产品");
        product.setPrice(new BigDecimal("99.99"));
        product.setDescription("测试描述");

        // 保存产品（触发Outbox机制）
        Product savedProduct = productService.createProduct(product);

        // 验证数据库中有数据
        assertThat(productRepository.findById(savedProduct.getId())).isPresent();

        // 验证Outbox中有消息
        List<OutboxMessage> messages = outboxRepository.findByAggregateIdAndProcessed(
                savedProduct.getId().toString(), false);
        assertThat(messages).isNotEmpty();

        // 等待异步处理
        await().atMost(10, TimeUnit.SECONDS).untilAsserted(() -> {
            // 验证ES中有数据
            GetResponse<Product> response = esClient.get(g -> g
                .index("products")
                .id(savedProduct.getId().toString()),
                Product.class
            );
            assertThat(response.found()).isTrue();
            assertThat(response.source().getName()).isEqualTo(product.getName());
        });
    }

    @Test
    @DisplayName("测试网络分区情况下的恢复机制")
    void testNetworkPartitionRecovery() throws Exception {
        // 模拟ES不可用
        ReflectionTestUtils.setField(networkPartitionHandler, "elasticsearchAvailable",
                new AtomicBoolean(false));

        // 创建产品（此时ES不可用）
        Product product = new Product();
        product.setName("分区测试产品");
        product.setPrice(new BigDecimal("199.99"));
        product.setDescription("网络分区测试");

        Product savedProduct = productService.createProduct(product);

        // 验证Outbox中有未处理的消息
        await().atMost(5, TimeUnit.SECONDS).untilAsserted(() -> {
            List<OutboxMessage> messages = outboxRepository.findByAggregateIdAndProcessed(
                    savedProduct.getId().toString(), false);
            assertThat(messages).isNotEmpty();
        });

        // 模拟ES恢复可用
        ReflectionTestUtils.setField(networkPartitionHandler, "elasticsearchAvailable",
                new AtomicBoolean(true));

        // 触发恢复流程
        networkPartitionHandler.checkElasticsearchAvailability();

        // 验证ES中数据最终一致
        await().atMost(15, TimeUnit.SECONDS).untilAsserted(() -> {
            GetResponse<Product> response = esClient.get(g -> g
                .index("products")
                .id(savedProduct.getId().toString()),
                Product.class
            );
            assertThat(response.found()).isTrue();
            assertThat(response.source().getName()).isEqualTo(product.getName());
        });
    }

    @Test
    @DisplayName("测试混沌工程 - ES间歇性故障")
    void testChaosEngineeringIntermittentFailures() throws Exception {
        // 配置mock ES客户端间歇性失败
        when(mockEsClient.index(any(IndexRequest.class), any()))
            .thenAnswer(invocation -> {
                // 60%概率失败
                if (Math.random() < 0.6) {
                    throw new ElasticsearchException("模拟的随机故障");
                }

                // 模拟成功响应
                IndexResponse response = IndexResponse.of(r -> r
                    .index("products")
                    .id("test-id")
                    .version(1)
                    .result(Result.Created)
                );
                return response;
            });

        // 使用测试专用的服务实例（使用mock ES客户端）
        ProductIndexService testService = new ProductIndexService(
                mockEsClient,
                applicationContext.getBean(RetryService.class),
                applicationContext.getBean(ObjectMapper.class),
                applicationContext.getBean(CircuitBreaker.class),
                applicationContext.getBean(MetricsService.class),
                applicationContext.getBean(OpenTelemetry.class));

        // 执行100次索引操作
        List<CompletableFuture<Boolean>> futures = new ArrayList<>();
        for (int i = 0; i < 100; i++) {
            final int index = i;
            futures.add(CompletableFuture.supplyAsync(() -> {
                Product product = new Product();
                product.setId((long) index);
                product.setName("混沌测试产品" + index);

                try {
                    return testService.indexProduct(product);
                } catch (Exception e) {
                    logger.error("索引操作失败", e);
                    return false;
                }
            }));
        }

        // 等待所有操作完成
        CompletableFuture.allOf(futures.toArray(new CompletableFuture[0])).join();

        // 计算成功率
        long successCount = futures.stream()
            .map(CompletableFuture::join)
            .filter(success -> success)
            .count();

        // 验证熔断器和重试机制使得成功率显著高于40%（理论无故障率）
        assertThat(successCount).isGreaterThan(60);
        logger.info("混沌测试结果: 成功率 {}%", successCount);
    }

    @Test
    @DisplayName("故障注入测试 - 数据库连接中断")
    void testDatabaseConnectionFailure() {
        // 记录容器是否重启过
        AtomicBoolean containerRestarted = new AtomicBoolean(false);

        // 使用测试容器模拟数据库故障
        mySQLContainer.stop();

        // 验证系统能正确处理故障并在数据库恢复后自愈
        Product product = new Product();
        product.setName("故障恢复测试");
        product.setPrice(new BigDecimal("299.99"));

        // 预期操作会失败但不会导致系统崩溃
        assertThrows(Exception.class, () ->
            productService.createProduct(product));

        // 在单独线程中重启数据库容器
        new Thread(() -> {
            try {
                // 延迟3秒后重启容器
                Thread.sleep(3000);
                mySQLContainer.start();
                containerRestarted.set(true);
            } catch (Exception e) {
                logger.error("重启MySQL容器失败", e);
            }
        }).start();

        // 验证系统自愈
        await().atMost(30, TimeUnit.SECONDS)
            .pollInterval(Duration.ofSeconds(2))
            .untilAsserted(() -> {
                // 确保容器已重启
                assertThat(containerRestarted.get()).isTrue();

                try {
                    Product savedProduct = productService.createProduct(product);
                    assertThat(savedProduct.getId()).isNotNull();
                } catch (Exception e) {
                    logger.info("系统尚未恢复: {}", e.getMessage());
                    fail("系统未能在预期时间内自愈: " + e.getMessage());
                }
            });
    }
}

12. 指标和监控增强

添加更多精细的监控指标：

java 复制代码

@Service
public class MetricsService {

    private static final Logger logger = LoggerFactory.getLogger(MetricsService.class);

    private final MeterRegistry meterRegistry;

    @Value("${spring.application.name}")
    private String applicationName;

    // 记录最后一次ES不可用的时间
    private LocalDateTime lastOutageTime;

    public MetricsService(MeterRegistry meterRegistry) {
        this.meterRegistry = meterRegistry;
    }

    // 记录同步操作指标
    public void recordSyncOperation(String operation, boolean success, long duration) {
        Tags tags = Tags.of(
            Tag.of("operation", operation),
            Tag.of("success", String.valueOf(success)),
            Tag.of("application", applicationName)
        );

        // 记录操作计数
        meterRegistry.counter("elasticsearch.sync.operations", tags).increment();

        // 记录操作耗时
        meterRegistry.timer("elasticsearch.sync.duration", tags).record(duration, TimeUnit.MILLISECONDS);
    }

    // 记录一致性检查指标
    public void recordConsistencyCheckMetrics(int totalChecked, int inconsistentCount, long duration) {
        recordConsistencyCheckMetricsWithTags(totalChecked, inconsistentCount, duration, Collections.emptyMap());
    }

    // 记录带标签的一致性检查指标
    public void recordConsistencyCheckMetricsWithTags(
            int totalChecked, int inconsistentCount, long duration, Map<String, String> customTags) {
        Tags tags = Tags.of(Tag.of("application", applicationName));

        // 添加自定义标签
        for (Map.Entry<String, String> entry : customTags.entrySet()) {
            tags = tags.and(Tag.of(entry.getKey(), entry.getValue()));
        }

        // 记录检查的记录总数
        meterRegistry.gauge("elasticsearch.consistency.checked", tags, totalChecked);

        // 记录不一致的记录数
        meterRegistry.gauge("elasticsearch.consistency.inconsistent", tags, inconsistentCount);

        // 记录一致性检查的耗时
        meterRegistry.timer("elasticsearch.consistency.duration", tags).record(duration, TimeUnit.MILLISECONDS);

        // 计算一致性比率
        double consistencyRatio = totalChecked > 0 ?
            (totalChecked - inconsistentCount) / (double) totalChecked : 1.0;
        meterRegistry.gauge("elasticsearch.consistency.ratio", tags, consistencyRatio);

        // 记录当前时间点的一致性状态
        Instant timestamp = Instant.now();
        meterRegistry.gauge("elasticsearch.consistency.timestamp", tags, timestamp.getEpochSecond());
    }

    // 记录一致性延迟
    public void recordConsistencyDelay(String entityType, long delayMillis) {
        meterRegistry.timer("elasticsearch.consistency.delay",
            Tags.of("entity", entityType, "application", applicationName))
            .record(delayMillis, TimeUnit.MILLISECONDS);
    }

    // 增加重同步计数器
    public void incrementResyncCounter() {
        meterRegistry.counter("elasticsearch.resync.count",
            Tags.of("application", applicationName)).increment();
    }

    // 记录Outbox处理指标
    public void recordOutboxProcessing(int processed, int failed, long duration) {
        Tags tags = Tags.of("application", applicationName);
        meterRegistry.gauge("outbox.processed", tags, processed);
        meterRegistry.gauge("outbox.failed", tags, failed);
        meterRegistry.timer("outbox.processing.duration", tags).record(duration, TimeUnit.MILLISECONDS);

        // 记录处理成功率
        double successRate = (processed + failed) > 0 ?
            (double) processed / (processed + failed) : 0;
        meterRegistry.gauge("outbox.success.rate", tags, successRate);
    }

    // 记录Outbox消息的年龄分布
    public void recordOutboxMessageAge(Duration age) {
        meterRegistry.summary("outbox.message.age",
            Tags.of("application", applicationName))
            .record(age.toMillis());
    }

    // 记录批量操作
    public void recordBulkOperation(int count, boolean success, long duration) {
        Tags tags = Tags.of(
            "success", String.valueOf(success),
            "application", applicationName
        );

        meterRegistry.counter("elasticsearch.bulk.operations", tags).increment();
        meterRegistry.timer("elasticsearch.bulk.duration", tags).record(duration, TimeUnit.MILLISECONDS);
        meterRegistry.gauge("elasticsearch.bulk.size", tags, count);

        // 计算每条记录的平均处理时间
        double avgTimePerRecord = count > 0 ? (double) duration / count : 0;
        meterRegistry.gauge("elasticsearch.bulk.avg_time_per_record", tags, avgTimePerRecord);
    }

    // 记录ES不可用
    public void recordElasticsearchOutage() {
        lastOutageTime = LocalDateTime.now();
        meterRegistry.counter("elasticsearch.outage.count",
            Tags.of("application", applicationName)).increment();

        // 更新ES可用性指标
        meterRegistry.gauge("elasticsearch.available",
            Tags.of("application", applicationName), 0);

        logger.warn("记录Elasticsearch不可用状态，时间: {}", lastOutageTime);
    }

    // 记录ES恢复
    public void recordElasticsearchRecovery() {
        if (lastOutageTime != null) {
            long outageDuration = ChronoUnit.SECONDS.between(lastOutageTime, LocalDateTime.now());
            meterRegistry.timer("elasticsearch.outage.duration",
                Tags.of("application", applicationName))
                .record(outageDuration, TimeUnit.SECONDS);

            // 更新ES可用性指标
            meterRegistry.gauge("elasticsearch.available",
                Tags.of("application", applicationName), 1);

            logger.info("Elasticsearch恢复，不可用持续时间: {}秒", outageDuration);
        }
    }

    // 记录队列大小
    public void recordQueueSize(String queueName, int size, int capacity) {
        Tags tags = Tags.of(
            "queue", queueName,
            "application", applicationName
        );

        meterRegistry.gauge("queue.size", tags, size);
        meterRegistry.gauge("queue.capacity", tags, capacity);

        // 队列利用率
        double utilizationRate = capacity > 0 ? (double) size / capacity : 0;
        meterRegistry.gauge("queue.utilization", tags, utilizationRate);
    }

    // 记录业务操作耗时和结果
    public void recordBusinessOperation(String operation, boolean success, long duration) {
        Tags tags = Tags.of(
            "operation", operation,
            "success", String.valueOf(success),
            "application", applicationName
        );

        meterRegistry.counter("business.operations", tags).increment();
        meterRegistry.timer("business.duration", tags).record(duration, TimeUnit.MILLISECONDS);
    }

    // 记录服务启动状态
    public void recordServiceStartup(boolean success, long duration) {
        Tags tags = Tags.of(
            "success", String.valueOf(success),
            "application", applicationName
        );

        meterRegistry.counter("service.startup", tags).increment();
        meterRegistry.timer("service.startup.duration", tags).record(duration, TimeUnit.SECONDS);
    }

    // 获取最后一次不可用时间
    public LocalDateTime getLastOutageTime() {
        return lastOutageTime;
    }
}

13. 全局异常处理

实现全局异常处理器，统一管理 API 异常：

java 复制代码

@RestControllerAdvice
public class GlobalExceptionHandler {

    private static final Logger logger = LoggerFactory.getLogger(GlobalExceptionHandler.class);

    private final MessageSource messageSource;

    public GlobalExceptionHandler(MessageSource messageSource) {
        this.messageSource = messageSource;
    }

    @ExceptionHandler(ElasticsearchException.class)
    public ResponseEntity<ErrorResponse> handleElasticsearchException(
            ElasticsearchException ex, Locale locale) {
        logger.error("Elasticsearch操作异常", ex);

        String message = messageSource.getMessage(
            "error.elasticsearch.operation",
            new Object[]{ex.getMessage()},
            "Elasticsearch operation failed",
            locale
        );

        ErrorResponse error = new ErrorResponse("ELASTICSEARCH_ERROR", message);
        return ResponseEntity.status(HttpStatus.SERVICE_UNAVAILABLE).body(error);
    }

    @ExceptionHandler(ProductNotFoundException.class)
    public ResponseEntity<ErrorResponse> handleProductNotFoundException(
            ProductNotFoundException ex, Locale locale) {
        logger.warn("产品未找到: {}", ex.getProductId());

        String message = messageSource.getMessage(
            "error.product.notfound",
            new Object[]{ex.getProductId()},
            "Product not found",
            locale
        );

        ErrorResponse error = new ErrorResponse("PRODUCT_NOT_FOUND", message);
        return ResponseEntity.status(HttpStatus.NOT_FOUND).body(error);
    }

    @ExceptionHandler(TransactionSystemException.class)
    public ResponseEntity<ErrorResponse> handleTransactionSystemException(
            TransactionSystemException ex, Locale locale) {
        logger.error("事务系统异常", ex);

        String message = messageSource.getMessage(
            "error.transaction",
            new Object[]{ex.getMessage()},
            "Transaction failed",
            locale
        );

        ErrorResponse error = new ErrorResponse("TRANSACTION_ERROR", message);
        return ResponseEntity.status(HttpStatus.INTERNAL_SERVER_ERROR).body(error);
    }

    @ExceptionHandler(CircuitBreakerOpenException.class)
    public ResponseEntity<ErrorResponse> handleCircuitBreakerOpenException(
            CircuitBreakerOpenException ex, Locale locale) {
        logger.warn("熔断器开启，拒绝请求: {}", ex.getMessage());

        String message = messageSource.getMessage(
            "error.circuitbreaker.open",
            null,
            "Service temporarily unavailable due to circuit breaker",
            locale
        );

        ErrorResponse error = new ErrorResponse("SERVICE_UNAVAILABLE", message);
        return ResponseEntity.status(HttpStatus.SERVICE_UNAVAILABLE)
            .header("Retry-After", "30")
            .body(error);
    }

    @ExceptionHandler(Exception.class)
    public ResponseEntity<ErrorResponse> handleGenericException(
            Exception ex, Locale locale) {
        logger.error("未处理的异常", ex);

        String message = messageSource.getMessage(
            "error.generic",
            new Object[]{ex.getMessage()},
            "An unexpected error occurred",
            locale
        );

        ErrorResponse error = new ErrorResponse("INTERNAL_ERROR", message);
        return ResponseEntity.status(HttpStatus.INTERNAL_SERVER_ERROR).body(error);
    }

    @Data
    @AllArgsConstructor
    public static class ErrorResponse {
        private String code;
        private String message;
        private Instant timestamp = Instant.now();

        public ErrorResponse(String code, String message) {
            this.code = code;
            this.message = message;
        }
    }
}

14. 数据迁移策略

实现 Elasticsearch 索引结构变更时的数据迁移策略：

java 复制代码

@Service
public class IndexMigrationManager {

    private static final Logger logger = LoggerFactory.getLogger(IndexMigrationManager.class);

    private final ElasticsearchClient esClient;
    private final ProductRepository productRepository;
    private final BulkIndexService bulkIndexService;
    private final MessageSource messageSource;
    private final MetricsService metricsService;

    public IndexMigrationManager(
            ElasticsearchClient esClient,
            ProductRepository productRepository,
            BulkIndexService bulkIndexService,
            MessageSource messageSource,
            MetricsService metricsService) {
        this.esClient = esClient;
        this.productRepository = productRepository;
        this.bulkIndexService = bulkIndexService;
        this.messageSource = messageSource;
        this.metricsService = metricsService;
    }

    /**
     * 执行索引迁移（字段映射变更）
     * @param sourceIndex 源索引名称
     * @param targetIndex 目标索引名称
     * @param fieldMappings 字段映射配置，key为源字段，value为目标字段
     * @return 迁移报告
     */
    public MigrationReport migrateIndex(
            String sourceIndex,
            String targetIndex,
            Map<String, String> fieldMappings) {

        logger.info("开始索引迁移: {} -> {}, 字段映射: {}",
            sourceIndex, targetIndex, fieldMappings);
        long startTime = System.currentTimeMillis();

        MigrationReport report = new MigrationReport();
        report.setSourceIndex(sourceIndex);
        report.setTargetIndex(targetIndex);
        report.setStartTime(LocalDateTime.now());

        try {
            // 1. 检查源索引是否存在
            boolean sourceExists = checkIndexExists(sourceIndex);
            if (!sourceExists) {
                String errorMessage = messageSource.getMessage(
                    "error.index.source.notfound",
                    new Object[]{sourceIndex},
                    LocaleContextHolder.getLocale()
                );
                logger.error(errorMessage);
                report.setStatus("FAILED");
                report.setErrorMessage(errorMessage);
                return report;
            }

            // 2. 创建目标索引（如果不存在）
            boolean targetCreated = createTargetIndexIfNotExists(targetIndex);
            report.setTargetCreated(targetCreated);

            // 3. 根据映射关系执行迁移
            long totalDocuments = executeFieldMappingMigration(sourceIndex, targetIndex, fieldMappings, report);
            report.setTotalDocuments(totalDocuments);

            // 4. 验证迁移结果
            boolean validated = validateMigration(sourceIndex, targetIndex, report);
            report.setValidated(validated);

            // 5. 如果迁移成功，切换别名
            if (validated) {
                switchAliasIfNeeded(sourceIndex, targetIndex);
                report.setStatus("SUCCESS");
            } else {
                report.setStatus("VALIDATION_FAILED");
            }

            long duration = System.currentTimeMillis() - startTime;
            report.setDuration(duration);

            logger.info("索引迁移完成: {} -> {}, 状态: {}, 耗时: {} ms",
                sourceIndex, targetIndex, report.getStatus(), duration);

            // 记录迁移指标
            recordMigrationMetrics(report);

            return report;
        } catch (Exception e) {
            logger.error("索引迁移过程中发生错误", e);
            report.setStatus("ERROR");
            report.setErrorMessage(e.getMessage());
            report.setEndTime(LocalDateTime.now());
            return report;
        }
    }

    private boolean checkIndexExists(String indexName) throws IOException {
        return esClient.indices().exists(e -> e.index(indexName)).value();
    }

    private boolean createTargetIndexIfNotExists(String indexName) throws IOException {
        boolean exists = checkIndexExists(indexName);
        if (!exists) {
            logger.info("创建目标索引: {}", indexName);
            // 此处可以添加自定义索引设置和映射
            esClient.indices().create(c -> c.index(indexName));
            return true;
        }
        return false;
    }

    private long executeFieldMappingMigration(
            String sourceIndex,
            String targetIndex,
            Map<String, String> fieldMappings,
            MigrationReport report) throws IOException {

        // 获取源索引文档总数
        CountResponse countResponse = esClient.count(c -> c.index(sourceIndex));
        long totalDocuments = countResponse.count();
        logger.info("源索引文档总数: {}", totalDocuments);

        // 如果字段映射为空，使用reindex API
        if (fieldMappings.isEmpty()) {
            logger.info("字段映射为空，使用内置reindex API");
            ReindexResponse reindexResponse = esClient.reindex(r -> r
                .source(s -> s.index(sourceIndex))
                .dest(d -> d.index(targetIndex))
                .refresh(true)
            );

            report.setMigratedDocuments(reindexResponse.created() + reindexResponse.updated());
            report.setFailedDocuments(reindexResponse.failures().size());

            if (!reindexResponse.failures().isEmpty()) {
                logger.warn("迁移过程中有 {} 个文档失败", reindexResponse.failures().size());
                report.setErrors(reindexResponse.failures().stream()
                    .map(BulkIndexerFailure::reason)
                    .collect(Collectors.toList()));
            }

            return totalDocuments;
        }

        // 自定义字段映射迁移
        logger.info("使用自定义字段映射进行迁移");

        int batchSize = 500;
        long processedCount = 0;
        long successCount = 0;
        List<String> errors = new ArrayList<>();

        // 分批处理
        SearchRequest.Builder searchRequestBuilder = new SearchRequest.Builder()
            .index(sourceIndex)
            .size(batchSize)
            .scroll(s -> s.time("1m"));

        SearchResponse<JsonData> response = esClient.search(searchRequestBuilder.build(), JsonData.class);
        String scrollId = response.scrollId();

        while (response.hits().hits().size() > 0) {
            List<IndexOperation> operations = new ArrayList<>();

            for (Hit<JsonData> hit : response.hits().hits()) {
                try {
                    // 转换文档字段
                    Map<String, Object> sourceMap = hit.source() != null ?
                        hit.source().toMap() : new HashMap<>();
                    Map<String, Object> targetMap = transformDocument(sourceMap, fieldMappings);

                    // 创建索引操作
                    IndexOperation operation = new IndexOperation();
                    operation.setIndex(targetIndex);
                    operation.setEntityId(hit.id());
                    operation.setType(IndexOperation.OperationType.INDEX);
                    operation.setDocument(targetMap);

                    operations.add(operation);
                    processedCount++;
                } catch (Exception e) {
                    logger.error("处理文档 {} 时出错", hit.id(), e);
                    errors.add("文档ID " + hit.id() + ": " + e.getMessage());
                }
            }

            // 批量处理当前批次
            if (!operations.isEmpty()) {
                try {
                    for (IndexOperation op : operations) {
                        bulkIndexService.addOperation(op);
                    }
                    bulkIndexService.processBulk(); // 确保处理完成
                    successCount += operations.size();
                } catch (Exception e) {
                    logger.error("批量处理失败", e);
                    errors.add("批量处理失败: " + e.getMessage());
                }
            }

            // 获取下一批结果
            if (scrollId == null) {
                break;
            }

            response = esClient.scroll(s -> s
                .scrollId(scrollId)
                .scroll(t -> t.time("1m")),
                JsonData.class
            );
            scrollId = response.scrollId();

            logger.info("迁移进度: {}/{} 文档", processedCount, totalDocuments);
        }

        // 清理scroll上下文
        if (scrollId != null) {
            esClient.clearScroll(c -> c.scrollId(scrollId));
        }

        report.setMigratedDocuments(successCount);
        report.setFailedDocuments(processedCount - successCount);
        report.setErrors(errors);

        return totalDocuments;
    }

    private Map<String, Object> transformDocument(Map<String, Object> source, Map<String, String> fieldMappings) {
        Map<String, Object> target = new HashMap<>();

        // 先复制所有原始字段
        target.putAll(source);

        // 应用字段映射
        for (Map.Entry<String, String> mapping : fieldMappings.entrySet()) {
            String sourceField = mapping.getKey();
            String targetField = mapping.getValue();

            if (source.containsKey(sourceField)) {
                // 将源字段值复制到目标字段
                target.put(targetField, source.get(sourceField));

                // 如果源字段和目标字段不同，并且不想保留源字段，可以移除源字段
                if (!sourceField.equals(targetField)) {
                    target.remove(sourceField);
                }
            }
        }

        return target;
    }

    private boolean validateMigration(String sourceIndex, String targetIndex, MigrationReport report) throws IOException {
        // 比较文档数量
        CountResponse sourceCount = esClient.count(c -> c.index(sourceIndex));
        CountResponse targetCount = esClient.count(c -> c.index(targetIndex));

        long sourceDocs = sourceCount.count();
        long targetDocs = targetCount.count();

        report.setSourceDocuments(sourceDocs);
        report.setTargetDocuments(targetDocs);

        // 如果文档数量差距超过允许的阈值（如5%），则验证失败
        double documentRatio = sourceDocs > 0 ? (double) targetDocs / sourceDocs : 0;
        boolean countValid = documentRatio >= 0.95;

        if (!countValid) {
            logger.warn("文档数量验证失败: 源={}, 目标={}, 比率={}",
                sourceDocs, targetDocs, String.format("%.2f", documentRatio));
        }

        // 随机抽样验证字段映射是否正确
        boolean samplingValid = validateRandomSampling(sourceIndex, targetIndex);

        report.setCountValidated(countValid);
        report.setSamplingValidated(samplingValid);

        return countValid && samplingValid;
    }

    private boolean validateRandomSampling(String sourceIndex, String targetIndex) throws IOException {
        // 随机抽取10个文档进行验证
        SearchResponse<JsonData> response = esClient.search(s -> s
            .index(sourceIndex)
            .size(10)
            .source(src -> src.fetch(true))
            .query(q -> q.matchAll(m -> m)),
            JsonData.class
        );

        for (Hit<JsonData> hit : response.hits().hits()) {
            // 获取源文档ID
            String docId = hit.id();

            // 在目标索引中查找相同ID的文档
            GetResponse<JsonData> targetDoc = esClient.get(g -> g
                .index(targetIndex)
                .id(docId),
                JsonData.class
            );

            if (!targetDoc.found()) {
                logger.warn("验证失败: 目标索引中未找到文档 {}", docId);
                return false;
            }

            // 可以添加更复杂的字段验证逻辑
        }

        return true;
    }

    private void switchAliasIfNeeded(String sourceIndex, String targetIndex) throws IOException {
        // 检查是否有别名指向源索引
        GetAliasResponse aliasResponse = esClient.indices().getAlias(a -> a.index(sourceIndex));

        // 获取指向源索引的所有别名
        List<String> aliases = new ArrayList<>();
        if (aliasResponse.result().containsKey(sourceIndex)) {
            aliases.addAll(aliasResponse.result().get(sourceIndex).aliases().keySet());
        }

        if (!aliases.isEmpty()) {
            logger.info("找到 {} 个别名指向源索引，将切换到目标索引", aliases.size());

            // 构建别名更新操作
            IndicesAliasesRequest.Builder requestBuilder = new IndicesAliasesRequest.Builder();

            for (String alias : aliases) {
                requestBuilder.actions(actions -> {
                    List<AliasAction> aliasActions = new ArrayList<>();

                    // 移除旧索引别名
                    aliasActions.add(AliasAction.of(a -> a
                        .remove(r -> r.index(sourceIndex).alias(alias))
                    ));

                    // 添加新索引别名
                    aliasActions.add(AliasAction.of(a -> a
                        .add(add -> add.index(targetIndex).alias(alias))
                    ));

                    return actions.actions(aliasActions);
                });
            }

            // 执行别名更新
            esClient.indices().updateAliases(requestBuilder.build());
            logger.info("别名切换完成");
        } else {
            logger.info("源索引没有关联的别名，无需切换");
        }
    }

    private void recordMigrationMetrics(MigrationReport report) {
        Tags tags = Tags.of(
            Tag.of("sourceIndex", report.getSourceIndex()),
            Tag.of("targetIndex", report.getTargetIndex()),
            Tag.of("status", report.getStatus())
        );

        // 记录迁移文档数
        meterRegistry.gauge("elasticsearch.migration.documents.source", tags, report.getSourceDocuments());
        meterRegistry.gauge("elasticsearch.migration.documents.target", tags, report.getTargetDocuments());
        meterRegistry.gauge("elasticsearch.migration.documents.migrated", tags, report.getMigratedDocuments());
        meterRegistry.gauge("elasticsearch.migration.documents.failed", tags, report.getFailedDocuments());

        // 记录迁移耗时
        meterRegistry.timer("elasticsearch.migration.duration", tags).record(report.getDuration(), TimeUnit.MILLISECONDS);

        // 记录迁移成功率
        double successRate = report.getSourceDocuments() > 0 ?
            (double) report.getMigratedDocuments() / report.getSourceDocuments() : 0;
        meterRegistry.gauge("elasticsearch.migration.success_rate", tags, successRate);
    }

    @Data
    public static class MigrationReport {
        private String sourceIndex;
        private String targetIndex;
        private String status;
        private LocalDateTime startTime;
        private LocalDateTime endTime = LocalDateTime.now();
        private boolean targetCreated;
        private long totalDocuments;
        private long sourceDocuments;
        private long targetDocuments;
        private long migratedDocuments;
        private long failedDocuments;
        private boolean countValidated;
        private boolean samplingValidated;
        private boolean validated;
        private long duration;
        private String errorMessage;
        private List<String> errors = new ArrayList<>();
    }
}

应用配置示例

yaml 复制代码

# application.yml
spring:
  application:
    name: elasticsearch-sync-service
  datasource:
    url: jdbc:mysql://${DB_HOST:localhost}:${DB_PORT:3306}/${DB_NAME:productdb}
    username: ${DB_USERNAME:root}
    password: ${DB_PASSWORD:}
    hikari:
      maximum-pool-size: 20
      minimum-idle: 5
      connection-timeout: 30000
  jpa:
    hibernate:
      ddl-auto: validate
    properties:
      hibernate:
        dialect: org.hibernate.dialect.MySQL8Dialect
        jdbc.batch_size: 50
  kafka:
    bootstrap-servers: ${KAFKA_SERVERS:localhost:9092}
    producer:
      key-serializer: org.apache.kafka.common.serialization.StringSerializer
      value-serializer: org.apache.kafka.common.serialization.StringSerializer
      # 启用幂等性producer
      properties:
        enable.idempotence: true
        acks: all
        retries: 3
        max.in.flight.requests.per.connection: 1
        # 启用事务支持
        transactional.id: ${spring.application.name}-tx-
    consumer:
      group-id: ${KAFKA_GROUP_ID:es-sync-service}
      auto-offset-reset: earliest
      key-deserializer: org.apache.kafka.common.serialization.StringDeserializer
      value-deserializer: org.apache.kafka.common.serialization.StringDeserializer
      # 手动提交offset
      enable-auto-commit: false
      # 防止长轮询造成的延迟
      properties:
        fetch.max.wait.ms: 500
    listener:
      # 确认模式设置为手动
      ack-mode: manual
      # 并发消费者
      concurrency: ${KAFKA_CONCURRENCY:3}

  cache:
    type: caffeine
    caffeine:
      spec: maximumSize=10000,expireAfterWrite=300s

elasticsearch:
  uris: ${ES_URIS:localhost:9200}
  username: ${ES_USERNAME:elastic}
  password: ${ES_PASSWORD:}
  connection-timeout: 5000
  socket-timeout: 60000
  max-connections: 100
  retry-on-conflict: 3

app:
  bulk:
    size: 100
    interval: 5000
    queue-capacity: 1000
    min-size: 20
    max-size: 500
    initial-size: 100
  outbox:
    polling-interval: 1000
    max-concurrent: 50
  kafka:
    concurrency: 3
  retry:
    max-attempts: 5
    initial-delay: 1000
    multiplier: 2
    max-delay: 30000
  consistency:
    check:
      cron: "0 0 2 * * ?" # 每天凌晨2点执行
  audit:
    incremental:
      interval: 300000 # 5分钟执行一次增量审计
    full:
      cron: "0 0 1 * * ?" # 每天凌晨1点执行全量抽样审计
  monitoring:
    enabled: true
  cache:
    ttl-seconds: 300
  encryption:
    password: ${ENCRYPTION_PASSWORD:defaultDevPassword}
    salt: ${ENCRYPTION_SALT:aabbccddeeff}

management:
  endpoints:
    web:
      exposure:
        include: health,info,prometheus,metrics,caches
  metrics:
    export:
      prometheus:
        enabled: true
    tags:
      application: ${spring.application.name}
  health:
    elasticsearch:
      enabled: true
    kafka:
      enabled: true
    circuitbreakers:
      enabled: true

logging:
  pattern:
    console: "%d{yyyy-MM-dd HH:mm:ss} [%thread] %-5level %logger{36} [%X{traceId},%X{spanId},%X{correlationId},%X{messageId}] - %msg%n"
  level:
    root: INFO
    com.example.essync: ${LOG_LEVEL:INFO}
    org.springframework.data.elasticsearch: WARN
    org.elasticsearch: WARN
    org.apache.kafka: WARN

# OpenTelemetry配置
otel:
  traces:
    exporter: otlp
  metrics:
    exporter: otlp
  logs:
    exporter: otlp
  exporter:
    otlp:
      endpoint: ${OTEL_ENDPOINT:http://localhost:4317}

总结

特性	实现方法	适用场景
数据一致性	Outbox 模式	大多数业务场景
并发控制	乐观锁版本控制	高并发写入
失败处理	重试队列 + 指数退避	系统不稳定情况
批量性能	自适应批量处理	大批量数据处理
实时性要求	同步写入 + 异步确认	对实时性要求高的场景
数据恢复	增量和抽样一致性审计	长期运行的系统
可观测性	分布式跟踪 + 精细指标	所有生产环境
高并发优化	虚拟线程 + 背压控制	高吞吐量场景
弹性设计	熔断器 + 限流器 + 舱壁模式	不稳定环境
安全保障	加密敏感数据 + 安全配置	涉及个人数据的场景
零停机部署	索引别名 + 数据迁移	系统升级场景
国际化支持	多语言消息资源	全球化应用
网络分区	降级策略 + 自动恢复	不稳定网络环境
故障测试	混沌工程 + 故障注入	验证系统韧性