nginx超时相关参数验证.md

一、环境简介

IP地址	角色
192.168.1.16	nginx
192.168.1.18	后端

1.nginx配置

nginx 复制代码

upstream backend {
        server 192.168.1.18;
}

server {
        listen 80;

        location / {
                proxy_pass http://backend;
        }

}

2 1.18配置

1.18的配置也安装一个nginx，首页内容如下：

html 复制代码

<html>
<head>
	<meta charset="utf-8">
</head>
<body>
	<center><h1>backend server</h1></center>
</body>
</html>

二、502问题

1.后端宕机502

1.1.停止1.18的nginx

shell 复制代码

/usr/local/nginx/sbin/nginx -s stop

1.2.客户端在次访问

页面直接出现502。次502是1.16的nginx返回给用户的

shell 复制代码

502 Bad Gateway
----------------------
nginx/1.18.0

错误日志如下：

nginx 复制代码

2024/04/23 02:32:55 [error] 9052#0: *55 connect() failed (111: Connection refused) while connecting to upstream, client: 192.168.1.1, server: , request: "GET / HTTP/1.1", upstream: "http://192.168.1.18:80/", host: "192.168.1.16"

1.3 配置502返回页面

shell 复制代码

mkdir -p /data/wwwroot/error

# 错误页面内容如下

vim error.html
<head>
	<meta charset="utf-8">
</head>
<body>
	<h1>服务器开小差</h1>
</body>

1.4 配置1.16的nginx

nginx 复制代码

upstream backend {
	server 192.168.1.18;
}

server {
	listen 80;
	
	location / {
		proxy_pass http://backend;
	}
	
    # 加入以下配置， error_page和location 必须同时配置
	error_page   500 502 503 504  /error.html;	
	location = /error.html {
		root /data/wwwroot/error;
	}
}

2.防火墙引起502

2.1 启动1.18防火墙

shell 复制代码

systemctl start firewalld

2.2 前端报错

shell 复制代码

502 Bad Gateway
nginx/1.18.0

2.3 错误日志

shell 复制代码

2024/04/23 02:29:37 [error] 9052#0: *42 connect() failed (113: No route to host) while connecting to upstream, client: 192.168.1.1, server: , request: "GET / HTTP/1.1", upstream: "http://192.168.1.18:80/", host: "192.168.1.16"

3.连接后端超时502

3.1 目标地址不存在

后端的server 不存在，也会引起502

shell 复制代码

upstream backend {
        server 192.168.1.118;
}

前端报错

shell 复制代码

502 Bad Gateway
nginx/1.18.0

错误日志。错误日志和防火墙引起502的错误日志类型一样。

shell 复制代码

2024/04/23 21:30:01 [error] 8385#0: *5 connect() failed (113: No route to host) while connecting to upstream, client: 192.168.1.1, server: , request: "GET / HTTP/1.1", upstream: "http://192.168.1.118:80/", host: "192.168.1.16"

4.后端主动断开502

后端服务器返回数据时间过长，也会引起nginx502. 此时关闭nginx，使用一个python程序来模拟后端。这样比较好复现程序超时现象。

4.1 编写后端程序

c 复制代码

[root@node4 test]# vim main.py

from flask import Flask
import json,time


app = Flask(__name__)

@app.route('/')
def main():
    data = {
	"message": "access success!!",
	"code": "200"
    }
    
    # 这里模拟 后端服务器处理任务超时
    time.sleep(2000)
    return json.dumps(data),data["code"]

if __name__ == "__main__":
	app.run(debug=True,host="0.0.0.0")

4.2 使用gunicorn启动

这里之所以是用gunicorn去启动python，没有使用nginx直接反向代理到flask是因为，直接反向代理的时候，只要修改了flask代码，前端使用浏览器材访问一下，就会卡一下（pending一会）。

不知道为什么，但是使用gunicorn就不会出现pending的问题。

python 复制代码

gunicorn -w 2  -b 0.0.0.0:16868 main:app

4.3 修改nginx配置

复制代码

upstream backend {
		# 这里修改1.18的python程序启动端口
        server 192.168.1.18:16868;
}

4.4 超时报错

复制代码

502 Bad Gateway
nginx/1.18.0

4.5 注意：

（1）gunicore程序默认是30秒没有数据传输就会主动断开。

在1.18上的抓包结果也可以看出是1.18主动断开的。

shell 复制代码

03:49:35.354454 IP 192.168.1.18.16868 > 192.168.1.16.44912: Flags [F.], seq 150174228, ack 3477334224, win 227, options [nop,nop,TS val 1310959 ecr 1360495], length 0
03:49:35.355370 IP 192.168.1.16.44912 > 192.168.1.18.16868: Flags [F.], seq 3477334224, ack 150174229, win 229, options [nop,nop,TS val 1391453 ecr 1310959], length 0
03:49:35.355421 IP 192.168.1.18.16868 > 192.168.1.16.44912: Flags [.], ack 3477334225, win 227, options [nop,nop,TS val 1310960 ecr 1391453], length 0

（2）在实际生产中，后端程序也会设置超时间，如果处理数据超时，就会抛异常，就会主动断开。nginx就会返回502。

当然前端会根据状态码进行友好提示。

4.6 -t 验证

给gunicore程序加上-t参数，设置后端的超时时间，来验证一下上述说法是否正确。

这里加了5秒的超时时间，也就是5秒内内有返回有效数据，guncrion就会超时，主动断开。

shell 复制代码

 gunicorn -w 2  -t 5 -b 0.0.0.0:16868 main:app

抓包也能看出，1.18主动断开了.

命令如下：

shell 复制代码

tcpdump -i any -nn -S host 192.168.1.16 and port 16868

结果如下：

复制代码

04:04:30.132786 IP 192.168.1.18.16868 > 192.168.1.16.44920: Flags [F.], seq 471861193, ack 3371017224, win 235, options [nop,nop,TS val 2205737 ecr 2280783], length 0
04:04:30.133665 IP 192.168.1.16.44920 > 192.168.1.18.16868: Flags [.], ack 471861194, win 229, options [nop,nop,TS val 2286228 ecr 2205737], length 0
04:04:30.133914 IP 192.168.1.16.44920 > 192.168.1.18.16868: Flags [F.], seq 3371017224, ack 471861194, win 229, options [nop,nop,TS val 2286228 ecr 2205737], length 0
04:04:30.133938 IP 192.168.1.18.16868 > 192.168.1.16.44920: Flags [.], ack 3371017225, win 235, options [nop,nop,TS val 2205739 ecr 2286228], length 0

浏览器也是加载5秒中就不在加载了，然后显示502 bad gateway了

三、504问题

1.nginx代理超时504

上边的后端处理超时，没有及时返回给前端数据，nginx就会返回浏览器502.

那么在后端处理数据的场景非常之多，不可能所有的超时场景程序员都能捕获到，然后手动抛回异常。

所有就有了意外得后端超时场景。这里手动模拟一下

1.1.程序超时设置

程序还是保留上边得超时时间

复制代码

time.sleep(2000)

1.2.gunicorn设置超时

复制代码

gunicorn -w 2  -t 2000 -b 0.0.0.0:16868 main:app

以上两台配置来模拟后端超时时间很长，并且不会在2000秒内主动断开连接。

1.3.浏览器访问

当浏览访问超过60秒返回504了。

复制代码

504 Gateway Time-out
nginx/1.18.0

这里得60秒超时时间就是nginx 反向代理后端得超时时间了。也就是说后端在60秒内没有返回给nginx数据，nginx就会主动断开连接。通过抓包也可以看出来。

1.4.设置代理超时时间

proxy_read_timeout 此参数得默认值就是60S，现在改成5秒。

复制代码

server {
        listen 80;

        location / {
                proxy_pass http://backend;
                proxy_read_timeout 5;
        }
}

再次刷新浏览器，发现浏览器加载5秒后不在加载，然后通过抓包可以看到，nginx主动断开了连接。

1.5 总结

proxy_read_timeout 此参数得含义就是nginx等待后端返回数据得时间，超过这个时间就会返回504

2.连接超时504

连接超时504，一般出现在nginx 和后端server建立得时候产生得504.接下来模拟一下。

2.1.设置防火墙策略

在1.18上设置网络策略，来模拟网络问题

shell 复制代码

iptables -A OUTPUT -d 192.168.1.16 -j DROP

nignx还原默认配置

shell 复制代码

server {
        listen 80;

        location / {
                proxy_pass http://backend;
        }
}

2.2 验证

在浏览器发起访问，发现在建立三次握手得时候就没有成功，浏览器经过60秒后，不在加载。也返回

shell 复制代码

504 Gateway Time-out
nginx/1.18.0

2.2.设置连接超时时间

proxy_connect_timeout 参数就是设置连接后端server得时候得超时时间。

修改nginx配置,设置连接超时为5秒

shell 复制代码

upstream backend {
	server 192.168.1.18:16868;
}

server {
	listen 80;
	
	location / {
		proxy_pass http://backend;
		proxy_connect_timeout 5;
	}
	
}

在浏览器访问，发现浏览加载5秒，不会在在继续记载等待。

四、后端健康检测

说到后端健康检测，就说到了以下两个配置，

一个是max_fails,

另一个是fail_timeout

网上对max_fails得解释都一致,表示连接后端节点得次数。

fail_timeout具体表示什么意思，我也不知道，网上得解释众说纷纭。我是根据下边得配置来得到结论

1.后端超时

1.1 后端程序

复制代码

[root@node4 test]# cat main.py 
from flask import Flask
import json,time


app = Flask(__name__)

@app.route('/')
def main():
    # 让程序休息200秒，模拟程序超时200秒
    time.sleep(200)
    return "<h1>port 16868<h1>"

if __name__ == "__main__":
	app.run(debug=True,host="0.0.0.0")

1.2 启动程序

复制代码

gunicorn -w 1 -t 2000 -b 0.0.0.0:16868 main:app

1.3 配置nginx

nginx 复制代码

upstream backend {
        server 192.168.1.18:16868 max_fails=2 fail_timeout=5;
}

server {
        listen 80;

        location / {
                proxy_pass http://backend;
        }

}

1.4 浏览器访问

使用浏览器访问，浏览器没有在5秒得时候，主动断开连接，这说明fail_timeout参数不是用户单次请求后端得超时间。浏览器依然是在默认得60s断开的。

1.5 增加超时配置

c 复制代码

upstream backend {
	# 这里fail_timeout 调整到100
	server 192.168.1.18:16868 max_fails=2 fail_timeout=100;
}

server {
	listen 80;
	
	location / {
		proxy_pass http://backend;
		# 这里超时时间调整为1秒
		proxy_read_timeout 1;
	}
	
}

这里是单节点，发送了两次请求，1.18依然可以收到请求。nginx也依然会将请求转发到1.18

1.6 宕到1.18的16868服务

（1）当1.18的16868服务直接宕掉之后直接返回502.但是通过抓包，1.18依然可以收到nginx建立连接的请求。

（2）所以说明在单节点的情况下，这两个组合参数和不配置的区别的不大。

目前到这里依然没有明确fail_timeout参数的含义。

2.增加节点

c 复制代码

upstream backend {
	server 192.168.1.18:16868 max_fails=3 fail_timeout=100;
	server 192.168.1.18 max_fails=3 fail_timeout=10;
}

server {
	listen 80;
	
	location / {
		proxy_pass http://backend;
		proxy_read_timeout 1;
	}
	
}

2.1 使用浏览器访问

2.1.1 访问现象

出现轮询显现，但是轮到到1.18的16868端口的时候，因为后端的程序会超时，所有在一直加载，但是在nginx中又设置了proxy_read_timeout 1,所有过了1秒就会断开。

虽然请求被负载到了16868端口上，但是页面一直停留在80端口的页面上。

2.1.2 结果

当1.18的16868端口累计断开了3次（max_fails次数）之后，通过抓包得知

（1）nginx在fail_timeout时间内不会再将请求转发给1.18的16868端口。

（2）在超过fail_timeout时间后，在刷新浏览器的时候，nginx只会发送一次（注意：是1次）请求去连接后端16868端口是否存活。

2.2 使用nginx配置

c 复制代码

upstream backend {
	# 这里由100改为10秒
	server 192.168.1.18:16868 max_fails=3 fail_timeout=10;
	server 192.168.1.18 max_fails=3 fail_timeout=10;
}

server {
	listen 80;
	
	location / {
		proxy_pass http://backend;
		# 删除超时
	}
}

2.3 取消程序超时

删除 time.sleep(1)代码

2.4 验证

再次使用浏览器访问，出现正常的轮询状态。

2.5 停止1.18的16868服务

通过抓包依然得到如上结论：

2.5.1 访问现象

当1.18的16868服务停止后，页面一直停留在80端口的页面，没有跳转到80的报错页面

但是通过tcpdump抓包得知，请求依然被转发到了16868端口。

shell 复制代码

listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
# 一次连接
21:59:31.395083 IP 192.168.1.16.50124 > 192.168.1.18.16868: Flags [S], seq 1025926583, win 29200, 
21:59:31.395123 IP 192.168.1.18.16868 > 192.168.1.16.50124: Flags [R.], seq 0, ack 1025926584, win 0, 

# 两次次连接
21:59:34.016791 IP 192.168.1.16.50130 > 192.168.1.18.16868: Flags [S], seq 3104815423, win 29200, 
21:59:34.016824 IP 192.168.1.18.16868 > 192.168.1.16.50130: Flags [R.], seq 0, ack 3104815424, win 0, 

# 三次连接
21:59:36.915576 IP 192.168.1.16.50136 > 192.168.1.18.16868: Flags [S], seq 1161735139, win 29200, 
21:59:36.915605 IP 192.168.1.18.16868 > 192.168.1.16.50136: Flags [R.], seq 0, ack 1161735140, win 0,

2.5.2 结果

当16868端口失败连接次数累计达到3次（max_fails次数）之后。

（1）nginx在fail_timeout时间内不会再将请求转发给1.18的16868端口。

（2）在超过fail_timeout时间后，在刷新浏览器的时候，nginx只会发送一次（注意：是1次）请求去连接后端16868端口是否存活。

2.6.恢复16868端口

当16868端口恢复后，nginx不会立即去探测它的存活，依然会等到fail_timeout过后再去探测它是否存活。

3.默认配置

如果upstream中的server中采用了默认配置

shell 复制代码

upstream backend {
	server 192.168.1.18:16868;
	server 192.168.1.18;
}

那么

shell 复制代码

max_fails的默认值为1
fail_timeout的默认值为10秒