Nginx 502 错误场景分析

场景一：连接异常

微服务部分接口出现 502 错误。出现报错的接口共同点是响应慢， nginx 日志显示都是 20s 后报错。

系统架构为：

xxl-job(其他客户端) → nginx(keepalive, 60s超时) → k8s(haproxy → ingress) → 微服务网关 → 微服务

抓包分析：

nginx 向 haproxy 发送请求，20s 后，haproxy 主动发起了关闭连接的报文。查 haproxy 的配置：

timeout client 20s：haproxy 和客户端通信时，连接不活跃的时间，既不发送数据，也不 ack 接收的数据；

timeout server 20s：haproxy 和后端通信时，连接不活跃的时间，即既不发送数据，也不 ack 接收的数据；

基本确定为 haproxy 配置问题。去掉 haproxy，直连后面的 ingress，恢复正常。注：此处只是解决了 nginx 报错 502 的问题，实际问题是因为有超过 20s 才返回的慢查询接口，需要单独处理。

场景二：上游服务异常

nginx 转发请求到其他服务时，出现大量 502 错误，并且报错 connect() failed (111: Connection refused) while connecting to upstream 以及 no live upstreams。

原因：其他服务升级重启，导致无法连接并且上游服务都不可用，nginx 快速返回 502。

场景三：无法创建新的连接

nginx 调用服务 A 时，出现大量 502 错误，查看 nginx 错误日志为：

crit\] 30930#0: \*10894479 connect() to x.x.x.x:80 failed (99: Cannot assign requested address) while connecting to upstream。 ![3-1.jpg](https://file.jishuzhan.net/article/1746704041250918402/4a9292aeb254795cb72ab1092cf6753b.webp) 架构图如下： ![3-2.jpg](https://file.jishuzhan.net/article/1746704041250918402/452c5667854b8eaf91b68894d57578ef.webp) 查连接数，发现大量的 socket time-wait 连接，继续查看配置，发现 nginxA 和 nginxB 之间没有采用连接池，每次请求过来都会新建一个新的连接，导致大量连接 TIME_WAIT。 nginxA 配置如下： ```lua upstream nginx-b-up { server 10.11.13.13:80; } location /nginx-b/ { proxy_set_header Host 10.11.1.1; proxy_http_version 1.1; proxy_set_header Connection ""; proxy_pass http://nginx-b-up/; } ``` 猜测：大量短连接导致了 99: Cannot assign requested address 错误？ 在测试环境压测，发现不配置连接池时，nginxA 和 nginxB 之间的连接最多在 28000 左右（包含所有状态的连接，主要为 TIME_WAIT 以及 ESTABLISHED），这意味着，继续有请求进来时，nginx 已经无法再创建新的连接了，和错误日志匹配。 建立 socket 连接时，需要源 ip(固定为本机 ip)，源端口(随机分配)，目的 ip(固定为对端 nginx 地址)，目的端口(固定为对端 80 端口)，源端口是有限的，限制了最多可创建的 socket 连接数。 ```bash [root@wh ~]# cat /proc/sys/net/ipv4/ip_local_port_range 32768 60999 ``` 查看系统默认的临时端口范围为 32768\~60999，即最多 28232 个，和实际现象匹配。 确定原因后，解决方案可以是：1、扩大临时端口的范围，比如 32768\~90000；2、采用连接池，避免频繁的创建/销毁连接，更改为如下配置即可： ```lua upstream nginx-b-up { server 10.11.13.13:80; keepalive 128; keepalive_requests 10000; keepalive_timeout 60s; } ``` ## 场景四：存在慢接口导致nginx认为上游服务不可用 短时间内出现大量的 502 错误（服务 A、服务 B 的接口都有），并且响应极快，nginx 日志显示 1ms 内即返回了 502。和场景一的区别在于，上游服务（ingress 外部 nginx、ingress、网关）都正常。 系统架构图如下，其中 外网/内网 nginx 和 ingress 外部 nginx 之间开启了长连接。 ![4.png](https://file.jishuzhan.net/article/1746704041250918402/ef821de2fb58d3ab6a4bf1f974475229.webp) 外网/内网 nginx 配置如下： ```ini upstream k8s-ingress { server 172.16.100.1:20080; server 172.16.100.2:20080; keepalive 256; keepalive_requests 10000; keepalive_timeout 60s; } ``` ingress 外网 nginx 配置如下： ```ini http { keepalive_timeout 65s; proxy_read_timeout 60s; } ``` nginx 日志显示，服务 A 有很多请求返回了 504 错误。经查，服务 A 某接口出现异常，60s 未返回(proxy_read_timeout 默认配置 60s，如果连续 60s 未读取到响应数据流，则认为超时)。之后，nginx 出现大量错误，如下： ```bash 2023/08/16 11:01:21 [error] 17395#17395: *14942410 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 39.144.154.36, server: x.x.x.x, request: "POST /api/zb-a/f-user/common/agreement/sign HTTP/1.1", upstream: "http://172.16.100.1:20080/api/zb-a/f-user/common/agreement/sign?", referrer: "http://127.0.0.1:53862/" 2023/08/16 11:01:21 [error] 17394#17394: *14941951 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 49.93.239.101, server: x.x.x.x, request: "POST /api/zb-a/f-user/user/home HTTP/1.1", upstream: "http://172.16.100.2:20080/api/zb-a/f-user/user/home?", referrer: "http://127.0.0.1:14218/" 2023/08/16 11:01:22 [error] 17394#17394: *14942912 no live upstreams while connecting to upstream, client: 114.223.30.43, server: x.x.x.x, request: "GET /api/zb-a/f/debt/list?type=1 HTTP/1.1", upstream: "http://k8s-ingress/api/zb-a/f/debt/list?type=1", host: "x.x.x.x", referrer: "http://127.0.0.1:19998/" ``` 如果 nginx 日志设置 warn 级别，还会出现如下报错： ```bash 2023/08/16 11:01:21 [warn] 23517#23517: *455733549 upstream server temporarily disabled while reading response header from upstream, client: 10.20.1.1, server: , request: "POST /api/zb-a/f-user/user/home? HTTP/1.1", upstream: "http://172.100.1:30080/api/zb-a/f-user/user/home", host: "x.x.x.x", referrer: "http://127.0.0.1:64522/" ``` 看日志，upstream timed out 的意思是 nginx 在等待上游的响应数据超时；no live upstreams 的意思是 nginx 准备向上游发起请求的时候发现没有存活的后端；upstream server temporarily 表示上游服务暂时不可用，即标记该 upstream 节点暂时失效。 💡 **nginx 判断节点失效状态** Nginx 默认判断失败节点状态以 connect refuse 和 time out 状态为准，不以 HTTP 错误状态进行判断失败，因为 HTTP 只要能返回状态说明该节点还可以正常连接，所以 nginx 判断其还是存活状态。 可以通过 proxy_next_upstream 指令增加对 502、503、504、500 等错误也用来判断节点状态，增加后，在 next_upstream 过程中，会对 fails 进行累加，如果备用机处理还是错误则直接返回错误信息。当 fails 大于等于 max_fails 时，则该节点失效。 max_fails：最大尝试失败次数，默认 1，即一次失败，认为节点失效，后续请求会转发到其他节点； max_timeout：失效时间，默认 10s。超过失效时间或者所有节点都失效后，该节点会重新置为有效，重新探测。 根据默认配置，只要有一次请求超时，upstream 节点就是失效状态，当短时间内有大量请求超时时，172.16.100.1/2 这 2 个节点很容易全部失效，之后 nginx 报错：no live upstreams。 TODO：当判断 no live upstreams 时（对应的节点状态为 NGX_BUSY），返回 502 错误，同时重置上游节点状态为正常。理论上，服务 B 接口不应该出现 502 错误。 处理方案：调整网关程序，当发现响应超过 55s 时，主动返回 504 错误给 nginx，避免出现 nginx 超时。