AI 服务 Connection Reset by Peer 问题修复

问题描述

现象：定时任务调用 DashScope AI API 时，请求刚发出即失败，抛出 Connection reset by peer 异常，导致周报生成失败。

错误日志：

bash 复制代码

[reactor-http-epoll-1] WARN  r.n.http.client.HttpClientConnect - [bb53674b-2, L:/172.19.0.6:51476 - R:dashscope.aliyuncs.com/39.96.213.166:443] The connection observed an error, the request cannot be retried as the headers/body were sent
io.netty.channel.unix.Errors$NativeIoException: recvAddress(..) failed: Connection reset by peer

[scheduling-1] ERROR c.j.service.component.AIComponent - AI 调用失败: recvAddress(..) failed: Connection reset by peer
org.springframework.web.reactive.function.client.WebClientRequestException: recvAddress(..) failed: Connection reset by peer

根因分析：

Reactor Netty 默认使用连接池复用 TCP 连接。定时任务每周只执行一次，连接池中的空闲连接在等待期间已被服务端（DashScope）单方面关闭（TCP Keep-Alive 超时），但客户端未感知。当定时任务触发时，客户端尝试复用该"僵尸连接"发送请求，服务端返回 RST 包，导致 Connection reset by peer。

解决方案

1. 连接池优化（根治）

配置 ConnectionProvider，限制空闲连接存活时间，避免复用过期连接：

bash 复制代码

ConnectionProvider connectionProvider = ConnectionProvider.builder("ai-pool")
        .maxConnections(16)
        .maxIdleTime(Duration.ofSeconds(30))      // 空闲超过 30 秒自动关闭
        .maxLifeTime(Duration.ofMinutes(5))       // 连接最长存活 5 分钟
        .evictInBackground(Duration.ofSeconds(30)) // 每 30 秒后台清理过期连接
        .build();

HttpClient httpClient = HttpClient.create(connectionProvider)
        .option(ChannelOption.CONNECT_TIMEOUT_MILLIS, connectTimeout)
        .responseTimeout(Duration.ofMillis(readTimeout));

2. 重试机制（容错）

对网络瞬时故障（WebClientRequestException、IOException）自动重试，使用指数退避策略：

bash 复制代码

.retryWhen(Retry.backoff(3, Duration.ofSeconds(2))
        .filter(this::isRetryableError)
        .doBeforeRetry(signal -> log.warn("AI 调用遇到网络错误，第 {} 次重试: {}",
                signal.totalRetries() + 1, signal.failure().getMessage())))

重试判断逻辑：

bash 复制代码

private boolean isRetryableError(Throwable throwable) {
    if (throwable instanceof WebClientRequestException) return true;
    if (throwable instanceof IOException) return true;
    // 检查 cause 链
    Throwable cause = throwable.getCause();
    while (cause != null) {
        if (cause instanceof IOException) return true;
        cause = cause.getCause();
    }
    return false;
}

3. 超时配置调整

AI 请求响应需 2-3 分钟（联网搜索 + 大量数据生成），将 read-timeout 从 120s 提升至 240s：

bash 复制代码

# application-online.yml
dashscope:
  read-timeout: 240000  # 4 分钟，给 2-3 分钟的响应留足余量