问题描述
现象 :定时任务调用 DashScope AI API 时,请求刚发出即失败,抛出 Connection reset by peer 异常,导致周报生成失败。
错误日志:
bash
[reactor-http-epoll-1] WARN r.n.http.client.HttpClientConnect - [bb53674b-2, L:/172.19.0.6:51476 - R:dashscope.aliyuncs.com/39.96.213.166:443] The connection observed an error, the request cannot be retried as the headers/body were sent
io.netty.channel.unix.Errors$NativeIoException: recvAddress(..) failed: Connection reset by peer
[scheduling-1] ERROR c.j.service.component.AIComponent - AI 调用失败: recvAddress(..) failed: Connection reset by peer
org.springframework.web.reactive.function.client.WebClientRequestException: recvAddress(..) failed: Connection reset by peer
根因分析:
Reactor Netty 默认使用连接池复用 TCP 连接。定时任务每周只执行一次,连接池中的空闲连接在等待期间已被服务端(DashScope)单方面关闭(TCP Keep-Alive 超时),但客户端未感知。当定时任务触发时,客户端尝试复用该"僵尸连接"发送请求,服务端返回 RST 包,导致 Connection reset by peer。
解决方案
1. 连接池优化(根治)
配置 ConnectionProvider,限制空闲连接存活时间,避免复用过期连接:
bash
ConnectionProvider connectionProvider = ConnectionProvider.builder("ai-pool")
.maxConnections(16)
.maxIdleTime(Duration.ofSeconds(30)) // 空闲超过 30 秒自动关闭
.maxLifeTime(Duration.ofMinutes(5)) // 连接最长存活 5 分钟
.evictInBackground(Duration.ofSeconds(30)) // 每 30 秒后台清理过期连接
.build();
HttpClient httpClient = HttpClient.create(connectionProvider)
.option(ChannelOption.CONNECT_TIMEOUT_MILLIS, connectTimeout)
.responseTimeout(Duration.ofMillis(readTimeout));
2. 重试机制(容错)
对网络瞬时故障(WebClientRequestException、IOException)自动重试,使用指数退避策略:
bash
.retryWhen(Retry.backoff(3, Duration.ofSeconds(2))
.filter(this::isRetryableError)
.doBeforeRetry(signal -> log.warn("AI 调用遇到网络错误,第 {} 次重试: {}",
signal.totalRetries() + 1, signal.failure().getMessage())))
重试判断逻辑:
bash
private boolean isRetryableError(Throwable throwable) {
if (throwable instanceof WebClientRequestException) return true;
if (throwable instanceof IOException) return true;
// 检查 cause 链
Throwable cause = throwable.getCause();
while (cause != null) {
if (cause instanceof IOException) return true;
cause = cause.getCause();
}
return false;
}
3. 超时配置调整
AI 请求响应需 2-3 分钟(联网搜索 + 大量数据生成),将 read-timeout 从 120s 提升至 240s:
bash
# application-online.yml
dashscope:
read-timeout: 240000 # 4 分钟,给 2-3 分钟的响应留足余量