iptables 规则重启机器后丢失导致k8s网络不可用

1 现象

在宿主机上 DNS 解析正常:

XML 复制代码
$ nslookup kubernetes.default.svc.cluster.local
Server:         223.5.5.5
Address:        223.5.5.5#53
** server can't find kubernetes.default.svc.cluster.local: NXDOMAIN

$ nslookup google.com
Server:         223.5.5.5
Address:        223.5.5.5#53
Non-authoritative answer:
Name:   google.com
Address: 142.250.73.110

但在 k8s pod 内 DNS 解析失败:

bash 复制代码
$ nslookup kubernetes.default.svc.cluster.local
;; connection timed out; no servers could be reached

$ nslookup google.com
;; connection timed out; no servers could be reached

CoreDNS pod 日志大量 timeout:

XML 复制代码
$ kubectl -n kube-system logs coredns-7cb5659999-nshr9 --tail=50
.:53
[INFO] plugin/reload: Running configuration MD5 = db32ca3650231d74073ff4cf814959a7
CoreDNS-1.8.6
linux/arm64, go1.17.1, 13a9191
[ERROR] plugin/errors: 2 2445800166269057896.3880739144127939746. HINFO: read udp 10.244.0.72:58074->223.5.5.5:53: i/o timeout
[ERROR] plugin/errors: 2 2445800166269057896.3880739144127939746. HINFO: read udp 10.244.0.72:53367->172.16.240.2:53: i/o timeout
[ERROR] plugin/errors: 2 2445800166269057896.3880739144127939746. HINFO: read udp 10.244.0.72:56148->223.5.5.5:53: i/o timeout
[ERROR] plugin/errors: 2 2445800166269057896.3880739144127939746. HINFO: read udp 10.244.0.72:34444->223.5.5.5:53: i/o timeout
[ERROR] plugin/errors: 2 2445800166269057896.3880739144127939746. HINFO: read udp 10.244.0.72:56775->172.16.240.2:53: i/o timeout
[ERROR] plugin/errors: 2 2445800166269057896.3880739144127939746. HINFO: read udp 10.244.0.72:59791->223.5.5.5:53: i/o timeout
[ERROR] plugin/errors: 2 2445800166269057896.3880739144127939746. HINFO: read udp 10.244.0.72:48823->172.16.240.2:53: i/o timeout
[ERROR] plugin/errors: 2 2445800166269057896.3880739144127939746. HINFO: read udp 10.244.0.72:41791->223.6.6.6:53: i/o timeout
[ERROR] plugin/errors: 2 2445800166269057896.3880739144127939746. HINFO: read udp 10.244.0.72:33037->223.6.6.6:53: i/o timeout

2 原因

直接原因

机器重启后,iptables 规则丢失,其中 FORWARD 链的默认策略从 ACCEPT 被重置为 DROP,导致 Pod 网络流量被阻断。

根本原因

iptables 规则保存在内核内存,重启后会自动清空,而 kube-proxy 重启后只会恢复 Service 相关规则(NAT表),FORWARD 策略(filter 表)不会被 kube-proxy 恢复,因此被恢复为内核默认值 DROP,所有 Pod 网络流量就被丢弃。

3 解决

持久化网络规则:

bash 复制代码
sudo cat > /usr/local/bin/k8s-network-init.sh <<'EOF'
#!/bin/bash
set -e

echo "Initializing Kubernetes network at $(date)" >> /var/log/k8s-network.log

# 1. 启用 IP 转发
sysctl -w net.ipv4.ip_forward=1
sysctl -w net.ipv4.conf.all.forwarding=1

# 2. 设置全局策略
iptables -P FORWARD ACCEPT
iptables -P INPUT ACCEPT
iptables -P OUTPUT ACCEPT

# 3. 清空可能冲突的规则(可选)
# iptables -F FORWARD

# 4. 确保 Pod 网络可以通信
iptables -I FORWARD -s 10.244.0.0/16 -j ACCEPT 2>/dev/null || true
iptables -I FORWARD -d 10.244.0.0/16 -j ACCEPT 2>/dev/null || true

# 5. 确保 NAT 规则(允许 Pod 访问外网)
iptables -t nat -I POSTROUTING -s 10.244.0.0/16 ! -d 10.244.0.0/16 -j MASQUERADE 2>/dev/null || true

# 6. 保存规则
if command -v netfilter-persistent &> /dev/null; then
    netfilter-persistent save >> /var/log/k8s-network.log 2>&1
fi

echo "Network initialization completed at $(date)" >> /var/log/k8s-network.log
EOF

sudo chmod +x /usr/local/bin/k8s-network-init.sh

创建 systemd 服务:

bash 复制代码
sudo cat > /etc/systemd/system/k8s-network-init.service <<'EOF'
[Unit]
Description=Kubernetes Network Initialization
Before=kubelet.service
After=network.target

[Service]
Type=oneshot
ExecStart=/usr/local/bin/k8s-network-init.sh
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable k8s-network-init.service
sudo systemctl start k8s-network-init.service
相关推荐
运维儿2 小时前
4.VLAN 技术:二层网络的优化之道
网络·网络协议·tcp/ip·linux 网络·云计算网络
运维行者_2 小时前
MSP网络管理破局者:IPAM+SPM插件终结IP冲突与安全威胁
运维·服务器·开发语言·网络·安全·web安全·php
人间打气筒(Ada)2 小时前
「码动四季·开源同行」go语言:如何处理 Go 错误异常与并发陷阱?
开发语言·后端·golang·defer·panic·errors·并发陷阱
女王大人万岁2 小时前
Golang实战gin-swagger:自动生成API文档
服务器·开发语言·后端·golang·gin
竹之却2 小时前
OpenClaw 2026.4.5版本更新详解
网络·人工智能·agent·openclaw
派大星酷2 小时前
Http---详细格式介绍
网络·网络协议·http
斯普信云原生组2 小时前
Docker 开源软件应急处理方案及操作手册——日常维护与监控命令集
docker·容器·eureka
威联通网络存储2 小时前
云原生容器底座:Kubernetes 持久化存储与 CSI 架构解析
python·云原生·架构·kubernetes
网硕互联的小客服2 小时前
CentOS 7 系统开通后如何修改数据盘挂载目录?
运维·服务器·网络·安全·自动化