1.基础环境
1.1 AWS EKS 版本信息
云上AWS EKS 环境版本: 1.29.6 版本
操作系统: Amazon Linux
内核版本: 5.10.220-209.869.amzn2.x86_64
Nginx镜像版本: nginx stable-alpine3.19
1.2 自建机房 kubernetes 版本
机房kubernetes 版本: 1.16.10
操作系统: Centos 7 linux
内核版本: 4.18.16-1.e17.elerpo.86_64
Nginx镜像版本: nginx stable-alpine3.19
2.问题现象描述
2.1 云上AWS 问题现象描述
1.因业务上云 Aws eks 在迁移过程发现一个问题 云上aws环境发现问题现象发现nginx镜像种反向代理
proxy_pass 域名为内部域名时候出现报错,反向代理其他域名例如: www.baidu.com
或者www.goole.com 正常.
测试结论:
1.配置proxy_pass外部共有域名都正常;
2.容器内部使用dig nslookup 内部域名解析正常(无异常)
3.集群node-local-dns和coredns 解析测试都正常
2.2 nginx -t 检查 错误日志:
nginx -t 语法检查具体报错
#配置文件
location /test/{
proxy_pass http://dev-example.K8S.cloud/;
}
#语法检查报错
nginx -t
报错日志: 2024/10/02 08:25:03 67#67: host not found in upstream "dev-example.K8S.cloud" /etc/nginx/conf.d/default.conf:16
[emerg] 1#1: host not found in upstream "dev-example.K8S.cloud" in /etc/nginx/conf.d/default.conf:16
nginx: configuration file /etc/nginx/nginx.conf test failed
3 定位问题
3.1 排除法
1.检查容器镜像系统是否可正常解析配置内部域名. (pod dig 验证正常解析)
#域名解析
dig dev-example.K8S.cloud
server: 169.254.20.10 #node-local-dns dns节点cache
Adress: 169.254.20.10:53
Non-authoritative answer:
Name: dev-example.K8S.cloud
Adress: 10.10.1.1
2.检查nginx 是否代理其他外部域名也出现问题 (配置转发其他外部域名-正常)
#配置文件
location /test/{
proxy_pass http://www.baidu.com/;
}
#语法检查正常
nginx: configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is success
3.配置私有域名以后依旧出现出现域名解析找不到问题 (nginx在解析内部域名时候出现不识别问题)
#配置文件
location /test/{
proxy_pass http://dev-example.K8S.cloud/;
}
#语法检查报错
nginx -t
报错日志: 2024/10/02 08:25:03 67#67: host not found in upstream "dev-example.K8S.cloud" /etc/nginx/conf.d/default.conf:16
[emerg] 1#1: host not found in upstream "dev-example.K8S.cloud" in /etc/nginx/conf.d/default.conf:16
nginx: configuration file /etc/nginx/nginx.conf test failed
3.2 细节定位法 (抓包DNS解析请求)
1.打开一个POD终端查看
进入pod 安装 tcpdump
apk add tcpdump
#抓取dns 解析
tcpdump -i any -n -s 0 port 53
2.打开一个另外POD终端运行nginx -t
nginx -t
报错日志: 2024/10/02 08:25:03 67#67: host not found in upstream "dev-example.K8S.cloud" /etc/nginx/conf.d/default.conf:16
[emerg] 1#1: host not found in upstream "dev-example.K8S.cloud" in /etc/nginx/conf.d/default.conf:16
nginx: configuration file /etc/nginx/nginx.conf test failed
3.tcpdump 内部有问题域名 抓包过程如下 (dev-example.K8S.cloud)
tcpdump -i any -n -s 0 port 53
02:16:00.696096 eth0 Out IP 100.66.42.156.33611 > 169.254.20.10.53: 28640+ A? dev-example.K8S.cloud. (35)
02:16:00.696124 eth0 Out IP 100.66.42.156.33611 > 169.254.20.10.53: 28785+ AAAA? dev-example.K8S.cloud. (35)
02:16:00.696242 eth0 In IP 169.254.20.10.53 > 100.66.42.156.33611: 28640* 1/0/0 A 10.147.51.85 (68)
02:16:00.830250 eth0 In IP 169.254.20.10.53 > 100.66.42.156.33611: 28785 ServFail 0/0/0 (35)
02:16:00.830287 eth0 Out IP 100.66.42.156.33611 > 169.254.20.10.53: 28785+ AAAA? dev-example.K8S.cloud. (35)
02:16:00.830359 eth0 In IP 169.254.20.10.53 > 100.66.42.156.33611: 28785 ServFail* 0/0/0 (35)
02:16:00.830386 eth0 Out IP 100.66.42.156.33611 > 169.254.20.10.53: 28785+ AAAA? dev-example.K8S.cloud. (35)
02:16:00.830432 eth0 In IP 169.254.20.10.53 > 100.66.42.156.33611: 28785 ServFail* 0/0/0 (35)
02:16:00.830455 eth0 Out IP 100.66.42.156.33611 > 169.254.20.10.53: 28785+ AAAA? dev-example.K8S.cloud. (35)
02:16:00.830499 eth0 In IP 169.254.20.10.53 > 100.66.42.156.33611: 28785 ServFail* 0/0/0 (35)
02:16:00.830521 eth0 Out IP 100.66.42.156.33611 > 169.254.20.10.53: 28785+ AAAA? dev-example.K8S.cloud. (35)
02:16:00.830556 eth0 In IP 169.254.20.10.53 > 100.66.42.156.33611: 28785 ServFail* 0/0/0 (35)
02:16:03.197116 eth0 Out IP 100.66.42.156.33611 > 169.254.20.10.53: 28785+ AAAA? dev-example.K8S.cloud. (35)
02:16:03.197228 eth0 In IP 169.254.20.10.53 > 100.66.42.156.33611: 28785 ServFail* 0/0/0 (35)
02:16:03.197263 eth0 Out IP 100.66.42.156.33611 > 169.254.20.10.53: 28785+ AAAA? dev-example.K8S.cloud. (35)
02:16:03.197345 eth0 In IP 169.254.20.10.53 > 100.66.42.156.33611: 28785 ServFail* 0/0/0 (35)
02:16:03.197373 eth0 Out IP 100.66.42.156.33611 > 169.254.20.10.53: 28785+ AAAA? dev-example.K8S.cloud. (35)
02:16:03.197412 eth0 In IP 169.254.20.10.53 > 100.66.42.156.33611: 28785 ServFail* 0/0/0 (35)
02:16:03.197453 eth0 Out IP 100.66.42.156.33611 > 169.254.20.10.53: 28785+ AAAA? dev-example.K8S.cloud. (35)
02:16:03.197502 eth0 In IP 169.254.20.10.53 > 100.66.42.156.33611: 28785 ServFail* 0/0/0 (35)
02:16:03.197525 eth0 Out IP 100.66.42.156.33611 > 169.254.20.10.53: 28785+ AAAA? dev-example.K8S.cloud. (35)
02:16:03.197562 eth0 In IP 169.254.20.10.53 > 100.66.42.156.33611: 28785 vv 0/0/0 (35)
4.抓包正常外部域名 (www.baiud.com)
location /test/{
proxy_pass http://www.baidu.com/;
}
#语法检查正常
nginx: configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is success
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
09:14:53.702737 eth0 Out IP 100.66.42.156.45529 > 169.254.20.10.53: 50229+ A? www.baidu.com. (31)
09:14:53.702769 eth0 Out IP 100.66.42.156.45529 > 169.254.20.10.53: 50648+ AAAA? www.baidu.com. (31)
09:14:53.704511 eth0 In IP 169.254.20.10.53 > 100.66.42.156.45529: 50229 4/0/0 CNAME www.a.shifen.com., CNAME www.wshifen.com., A 45.113.192.102, A 45.113.192.101 (181)
09:14:53.771089 eth0 In IP 169.254.20.10.53 > 100.66.42.156.45529: 50648 2/1/0 CNAME www.a.shifen.com., CNAME www.wshifen.com. (207)
5.结论
以上问题
1.发现tcpdump 抓包过程种发现内部域名时候出现 请求 AAAA? ipv6 请 A? 时候并未找给出正确的dns解析返回值
2.抓取外部域名直接 找到baidu域名相关的cname 域名
stable-alpine3.19 镜像在处理内部域名是否触发查询多次ipv6出现问题导致A记录ipv4也出现问题
1.问题修复 (通过K8S 部署deployment文件修改容器内核参数关闭ipv6支持)
3.3 K8S 如何修复镜像ipv6问题
initContainers:
- command:
- sh
- -c
- sysctl -w net.ipv6.conf.all.disable_ipv6=1 && sysctl -w net.ipv6.conf.default.disable_ipv6=1 #关闭ipv6 内核参数
image: busybox
imagePullPolicy: Always
name: init-update-containers
3.4 尝试其他方案测试(更换nginx官网标准版本镜像)
nginx:latest (自测可修复ipv6问题 可正常解析内部域名)
3.5 使用k8s 外部 (ExternalName) nginx内部配置svc 地址
apiVersion: v1
kind: Service
metadata:
labels:
name: dev-example-external
spec:
externalName: dev-example.K8S.cloud
type: ExternalName
4.扩展通过调用过程
strace -o nginx_trace.log -tt -T -e trace=all nginx -t 2>&1