1 现象
在宿主机上 DNS 解析正常:
XML
$ nslookup kubernetes.default.svc.cluster.local
Server: 223.5.5.5
Address: 223.5.5.5#53
** server can't find kubernetes.default.svc.cluster.local: NXDOMAIN
$ nslookup google.com
Server: 223.5.5.5
Address: 223.5.5.5#53
Non-authoritative answer:
Name: google.com
Address: 142.250.73.110
但在 k8s pod 内 DNS 解析失败:
bash
$ nslookup kubernetes.default.svc.cluster.local
;; connection timed out; no servers could be reached
$ nslookup google.com
;; connection timed out; no servers could be reached
CoreDNS pod 日志大量 timeout:
XML
$ kubectl -n kube-system logs coredns-7cb5659999-nshr9 --tail=50
.:53
[INFO] plugin/reload: Running configuration MD5 = db32ca3650231d74073ff4cf814959a7
CoreDNS-1.8.6
linux/arm64, go1.17.1, 13a9191
[ERROR] plugin/errors: 2 2445800166269057896.3880739144127939746. HINFO: read udp 10.244.0.72:58074->223.5.5.5:53: i/o timeout
[ERROR] plugin/errors: 2 2445800166269057896.3880739144127939746. HINFO: read udp 10.244.0.72:53367->172.16.240.2:53: i/o timeout
[ERROR] plugin/errors: 2 2445800166269057896.3880739144127939746. HINFO: read udp 10.244.0.72:56148->223.5.5.5:53: i/o timeout
[ERROR] plugin/errors: 2 2445800166269057896.3880739144127939746. HINFO: read udp 10.244.0.72:34444->223.5.5.5:53: i/o timeout
[ERROR] plugin/errors: 2 2445800166269057896.3880739144127939746. HINFO: read udp 10.244.0.72:56775->172.16.240.2:53: i/o timeout
[ERROR] plugin/errors: 2 2445800166269057896.3880739144127939746. HINFO: read udp 10.244.0.72:59791->223.5.5.5:53: i/o timeout
[ERROR] plugin/errors: 2 2445800166269057896.3880739144127939746. HINFO: read udp 10.244.0.72:48823->172.16.240.2:53: i/o timeout
[ERROR] plugin/errors: 2 2445800166269057896.3880739144127939746. HINFO: read udp 10.244.0.72:41791->223.6.6.6:53: i/o timeout
[ERROR] plugin/errors: 2 2445800166269057896.3880739144127939746. HINFO: read udp 10.244.0.72:33037->223.6.6.6:53: i/o timeout
2 原因
直接原因
机器重启后,iptables 规则丢失,其中 FORWARD 链的默认策略从 ACCEPT 被重置为 DROP,导致 Pod 网络流量被阻断。
根本原因
iptables 规则保存在内核内存,重启后会自动清空,而 kube-proxy 重启后只会恢复 Service 相关规则(NAT表),FORWARD 策略(filter 表)不会被 kube-proxy 恢复,因此被恢复为内核默认值 DROP,所有 Pod 网络流量就被丢弃。
3 解决
持久化网络规则:
bash
sudo cat > /usr/local/bin/k8s-network-init.sh <<'EOF'
#!/bin/bash
set -e
echo "Initializing Kubernetes network at $(date)" >> /var/log/k8s-network.log
# 1. 启用 IP 转发
sysctl -w net.ipv4.ip_forward=1
sysctl -w net.ipv4.conf.all.forwarding=1
# 2. 设置全局策略
iptables -P FORWARD ACCEPT
iptables -P INPUT ACCEPT
iptables -P OUTPUT ACCEPT
# 3. 清空可能冲突的规则(可选)
# iptables -F FORWARD
# 4. 确保 Pod 网络可以通信
iptables -I FORWARD -s 10.244.0.0/16 -j ACCEPT 2>/dev/null || true
iptables -I FORWARD -d 10.244.0.0/16 -j ACCEPT 2>/dev/null || true
# 5. 确保 NAT 规则(允许 Pod 访问外网)
iptables -t nat -I POSTROUTING -s 10.244.0.0/16 ! -d 10.244.0.0/16 -j MASQUERADE 2>/dev/null || true
# 6. 保存规则
if command -v netfilter-persistent &> /dev/null; then
netfilter-persistent save >> /var/log/k8s-network.log 2>&1
fi
echo "Network initialization completed at $(date)" >> /var/log/k8s-network.log
EOF
sudo chmod +x /usr/local/bin/k8s-network-init.sh
创建 systemd 服务:
bash
sudo cat > /etc/systemd/system/k8s-network-init.service <<'EOF'
[Unit]
Description=Kubernetes Network Initialization
Before=kubelet.service
After=network.target
[Service]
Type=oneshot
ExecStart=/usr/local/bin/k8s-network-init.sh
RemainAfterExit=yes
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable k8s-network-init.service
sudo systemctl start k8s-network-init.service