Java 重试机制没写对，线上很容易出问题！这份生产级方案请收好

前言

重试不是前端的事情吗？后端为什么也需要重试？看到这个问题，很多刚入行的开发者都会有同样的疑惑：为什么后端也需要重试机制。

想象一下你去餐厅吃饭的经历： 前端重试 ：就像你向服务员点菜，但服务员没听清楚，你重复说一遍。 后端重试：就像厨师做菜时发现某种食材暂时缺货，厨房内部协调寻找替代方案。

两者都是为了完成"让顾客吃到饭"这个最终目标，但处理的问题层面完全不同。

一、什么是后端重试？

前端重试 ：浏览器/APP自动重新发请求给后端。 后端重试：当一个后端服务调用另一个后端服务（或数据库、缓存、第三方API等）失败时，自动重新尝试调用的机制。

为什么需要后端重试？

假设你正在开发一个电商系统：

graph LR A[用户下单] --> B[订单服务] B --> C[库存服务] B --> D[支付服务] D --> E[银行接口]

如果用户在支付时，刚好遇到银行接口网络抖动，你会：

不重试的情况 ：直接告诉用户"支付失败，请重试"，体验非常差。 后端重试：系统自动重试几次，成功后再通知用户，用户体验提升。

前后端重试的直观对比

场景	前端重试	后端重试
用户登录	输入密码后点击登录，网络超时时前端重新请求	认证服务调用用户数据库失败时，服务间自动重试
查询订单	页面加载失败时，用户手动刷新或自动重试	订单服务调用数据库查询失败时自动重试
支付操作	支付页面提交后无响应，前端重新提交	支付服务调用银行接口失败时自动重试

二、后端重试的应用场景

场景1：微服务架构中的服务调用

现代应用很少是单个巨无霸服务，而是会由多个微服务组成：

复制代码

用户请求 → API网关 → 订单服务 → 支付服务 → 库存服务

以下代码示例 ：SpringBoot中的服务使用@Retryable注解调用重试

java 复制代码

@Service
public class OrderService {
    
    // 使用@Retryable注解实现自动重试
    @Retryable(
        value = {RemoteServiceException.class}, // 什么异常需要重试
        maxAttempts = 3,                       // 最多重试3次
        backoff = @Backoff(delay = 1000)       // 每次间隔1秒
    )
    public PaymentResult processPayment(Order order) {
        // 调用支付服务，失败时自动重试
        return paymentService.charge(order);
    }
    
    // 重试全部失败后的处理
    @Recover
    public PaymentResult paymentFallback(RemoteServiceException e, Order order) {
        // 记录日志、发送告警、执行补偿操作
        log.error("支付服务调用失败，订单ID: {}", order.getId());
        return PaymentResult.failed("系统繁忙，请稍后重试");
    }
}

场景2：数据库操作重试

数据库连接偶尔会出现临时性问题：

java 复制代码

@Repository
public class UserRepository {
    
    // 数据库临时故障时重试
    @Retryable(value = {TransientDataAccessException.class})
    public User saveUser(User user) {
        // 网络闪断、连接池超时等临时故障会自动重试
        return jdbcTemplate.update(
            "INSERT INTO users (name, email) VALUES (?, ?)", 
            user.getName(), user.getEmail()
        );
    }
}

场景3：第三方API调用

调用外部服务时经常遇到不稳定的情况：

java 复制代码

@Service 
public class WeatherService {
    
    @Retryable(
        value = {HttpServerErrorException.class}, // 服务器5xx错误
        maxAttempts = 2,
        backoff = @Backoff(delay = 2000)
    )
    public WeatherData getWeather(String city) {
        // 调用天气API，遇到服务器错误时重试
        return weatherApiClient.getWeather(city);
    }
}

三、为什么后端重试更复杂？

1. 幂等性（Idempotency）问题

什么是幂等性？ 多次执行同一操作的结果与执行一次的结果相同。

前端重试 ：查询操作（GET）天然幂等，重复查询不会改变数据 后端重试：创建订单、支付等操作必须保证幂等，否则重试会导致重复创建

解决方案：幂等键（Idempotency Key）

java 复制代码

@Service
public class PaymentService {
    
    public PaymentResult processPayment(PaymentRequest request) {
        // 生成唯一的幂等键
        String idempotentKey = generateIdempotentKey(request.getOrderId());
        
        return retryPayment(request, idempotentKey);
    }
    
    @Retryable(maxAttempts = 3)
    private PaymentResult retryPayment(PaymentRequest request, String idempotentKey) {
        // 使用幂等键防止重复支付
        return paymentGateway.charge(
            request.getAmount(), 
            request.getCardToken(), 
            idempotentKey
        );
    }
    
    private String generateIdempotentKey(String orderId) {
        return orderId + "_" + System.currentTimeMillis();
    }
}

2. 需要判断什么情况下重试

不是所有失败都应该重试：

java 复制代码

@Retryable(value = {
    // 应该重试的异常（临时性故障）
    SocketTimeoutException.class,      // 网络超时
    ConnectException.class,            // 连接异常
    HttpServerErrorException.class,    // 服务器5xx错误
    TooManyRequestsException.class     // 429限流（稍后重试）
}, not = {
    // 不应该重试的异常（永久性故障）
    IllegalArgumentException.class,    // 参数错误（重试没用）
    AuthenticationException.class,     // 认证失败（需要重新登录）
    InsufficientBalanceException.class // 余额不足（需要用户充值）
})
public BusinessResult callExternalService() {
    return externalService.process(request);
}

3. 需要更复杂的重试策略

前端重试通常很简单，后端则需要考虑多种策略：

java 复制代码

@Retryable(
    maxAttempts = 4,
    backoff = @Backoff(
        delay = 1000,        // 初始延迟1秒
        multiplier = 2,      // 每次延迟翻倍
        maxDelay = 10000,    // 最大延迟10秒
        random = true        // 添加随机性，避免"惊群效应"
    )
)
public ServiceResponse callWithBackoff() {
    return remoteService.call();
}

四、项目中的重试架构设计

案例：电商下单流程的完整重试设计

java 复制代码

@Service
@Slf4j
public class OrderCreationService {
    
    @Resource 
    private InventoryService inventoryService;
    
    @Resource 
    private PaymentService paymentService;
    
    @Resource 
    private NotificationService notificationService;
    
    @Transactional
    public OrderResult createOrder(CreateOrderRequest request) {
        try {
            // 1. 扣减库存（可重试）
            deductInventoryWithRetry(request);
            
            // 2. 创建支付（可重试+幂等）
            processPaymentWithRetry(request);
            
            // 3. 保存订单
            Order order = saveOrder(request);
            
            // 4. 发送通知（可重试，但不阻塞主流程）
            sendNotificationAsync(order);
            
            return OrderResult.success(order);
            
        } catch (Exception e) {
            // 整体失败时的补偿操作
            compensateOrderCreation(request);
            return OrderResult.failed("创建订单失败");
        }
    }
    
    @Retryable(maxAttempts = 3, backoff = @Backoff(1000))
    private void deductInventoryWithRetry(CreateOrderRequest request) {
        inventoryService.deduct(request.getItems());
    }
    
    @Retryable(maxAttempts = 3, backoff = @Backoff(2000))
    private void processPaymentWithRetry(CreateOrderRequest request) {
        String idempotentKey = "order_" + request.getOrderId();
        paymentService.charge(request.getAmount(), idempotentKey);
    }
    
    @Async
    @Retryable(maxAttempts = 2)
    private void sendNotificationAsync(Order order) {
        notificationService.sendOrderConfirmation(order);
    }
}

重试策略配置的最佳实践

yaml 复制代码

# application.yml - 不同场景的重试配置
resilience4j:
  retry:
    configs:
      default:
        max-attempts: 3
        wait-duration: 1s
        enable-exponential-backoff: true
        exponential-backoff-multiplier: 2
        
      fast-retry:    # 快速重试：网络抖动等瞬时故障
        max-attempts: 2
        wait-duration: 100ms
        
      slow-retry:    # 慢速重试：外部服务恢复需要时间
        max-attempts: 5
        wait-duration: 5s
        exponential-backoff-multiplier: 2
        
      no-retry:      # 不重试：业务逻辑错误
        max-attempts: 1

五、重试机制的问题和应对策略

问题1：重试风暴（Retry Storm）

多个服务同时重试，导致雪崩效应：

解决方案：加入随机抖动（Jitter）

java 复制代码

@Backoff(
    delay = 1000,
    maxDelay = 10000,
    multiplier = 2,
    random = true  // 添加随机性，避免所有重试同时发生
)

问题2：长时间阻塞用户请求

解决方案：异步重试

java 复制代码

@Async
@Retryable(maxAttempts = 3)
public CompletableFuture<Void> asyncRetryOperation() {
    // 异步执行，不阻塞主线程
    return CompletableFuture.completedFuture(heavyOperation());
}

问题3：重试无限循环

解决方案：合理的重试次数和超时控制

java 复制代码

@Retryable(
    maxAttempts = 3,  // 限制最大重试次数
    maxDelay = 30000  // 总重试时间不超过30秒
)
@Timeout(duration = 10)  // 每次调用超时10秒
public String callWithLimits() {
    return externalService.call();
}

六、监控和告警

没有监控的重试就像盲人摸象：

简单的重试监控：

java 复制代码

@Component
public class RetryMonitor {
    
    private final MeterRegistry meterRegistry;
    
    // 记录重试指标
    public void recordRetry(String service, String method, boolean success, int retryCount) {
        // 计数
        Counter.builder("api.retry.count")
            .tag("service", service)
            .tag("method", method) 
            .tag("success", String.valueOf(success))
            .register(meterRegistry)
            .increment();
        
        // 记录重试次数分布
        DistributionSummary.builder("api.retry.attempts")
            .tag("service", service)
            .register(meterRegistry)
            .record(retryCount);
    }
}

// 在重试逻辑中使用
@Retryable(value = Exception.class, maxAttempts = 3)
public void someMethod() {
    try {
        // 业务逻辑
    } catch (Exception e) {
        retryMonitor.recordRetry("UserService", "getUser", false, retryCount);
        throw e;
    }
}

配置化重试参数：

yaml 复制代码

# application.yml - 不同服务不同重试策略
retry:
  configs:
    user-service:
      max-attempts: 3
      initial-delay: 1000ms
      multiplier: 2
    payment-service:  
      max-attempts: 5    # 支付重要，多试几次
      initial-delay: 2000ms
      multiplier: 2
    sms-service:
      max-attempts: 2    # 短信少试几次，避免轰炸
      initial-delay: 3000ms
      multiplier: 1.5

总结

前后端重试的职责划分

层级	负责的重试类型	示例
前端重试	用户交互层面的瞬时故障	网络抖动、网关超时
API网关重试	路由层面的故障	后端服务实例短暂不可用
后端服务重试	业务逻辑层面的临时故障	数据库连接超时、第三方API限流

完整的重试体系架构

复制代码

用户请求 → 前端重试（网络层） → API网关重试（路由层） → 后端服务重试（业务层）

实践建议

对于刚接触后端重试的开发者，建议：

从小处开始：先在最重要的服务调用上添加重试
谨慎设置重试次数：通常2-3次足够，过多重试会加重系统负担
一定要实现超时控制：避免重试导致请求长时间挂起
记得处理重试失败的情况：要有降级方案

在你的当前项目中，哪些场景适合引入后端重试机制？如何设计合适的重试策略？

本文首发于公众号：程序员刘大华，专注分享前后端开发的实战笔记。关注我，少走弯路，一起进步！

📌往期精彩

《SpringBoot 中的 7 种耗时统计方式，你用过几种？》

《Java8 都出这么多年了，Optional 还是没人用？到底卡在哪了？》

《加班到凌晨，我用 Vue3 + ElementUI 写了个可编辑的表格组件》

《Vue3+CSS 实现的 3D 卡片动画，让你的网页瞬间高大上》