1. 概述
这是一个不慎的操作导致的Kubenernets控制平面故障,正常情况下不应该存在偶数控制节点。出现这个情况的原因是,需要我需要更新Node1
节点服务器版本,服务器有启动失败的的情况,为了稳妥起见另外启动一个控制节点(即Node2
)。更新前已经备份好Etcd数据,崩溃数据恢复已经做好了准备(我以为的...)。
1.1、集群信息
Name | IP | Role |
---|---|---|
Node1 | 172.17.1.120 | 控制节点1 |
Node2 | 172.17.1.121 | 控制节点2 |
k8s-2 | 172.17.1.131 | 工作节点1 |
k8s-3 | 172.17.1.132 | 工作节点2 |
1.2、故障现象
我按照正常卸载控制平面节点的操作步骤,一切顺利,当时并没有什么异常。
bash
root@node1:~# kubectl drain node node2 --ignore-daemonsets --delete-emptydir-data
root@node1:~# kubectl delete node node2
root@node2:~# kubeadm reset --force
故障出现原因:双控制平面Kubernetes(这个状态本身就异常,但不在本次讨论范围内)删除Node2
控制节点后,另外一个控制平面无法正常工作。具体表现为ETCD启动失败,导致Kubernetes api-server
启动失败。
bash
Jul 22 23:17:05 node1 kubelet[3301]: E0722 23:17:05.135471 3301 controller.go:145] "Failed to ensure lease exists, will retry" err="Get \"https://172.17.1.120:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/node1?timeout=10s\": dial tcp 172.17.1.120:6443: connect: connection refused" interval="7s"
2. 故障分析
在控制节点查询查询容器运行状态,发现所有容器都处于正常状态,但是etcd
容器日志显示无法连接另外一个控制节点。
2.1、查询控制节点服务状态
shell
root@node1:~# crictl --runtime-endpoint=unix:///var/run/containerd/containerd.sock ps | grep -e apiserver -e etcd -e controller-manager -e scheduler
6dd841c1bdcc3 9ea0bd82ed4f6 About an hour ago Running kube-scheduler 53 9477ef18cb630
cda7709fabb7f b0cdcf76ac8e9 About an hour ago Running kube-controller-manager 54 7a3368070af64
78f4ae23ef1e0 a9e7e6b294baf About an hour ago Running etcd 54 583d4b926dc80
526d7fbe05632 f44c6888a2d24 12 hours ago Running kube-apiserver 0 e21825618af02
经过查询服务都已经运行,但etcd
日志显示无法连接另外一个控制节点 ,api-server
不能连接到etcd。
shell
root@node1:~# crictl --runtime-endpoint=unix:///var/run/containerd/containerd.sock logs -f 78f4ae23ef1e0
{"level":"warn","ts":"2025-07-23T05:31:03.896215Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"39011656e166436e","rtt":"0s","error":"dial tcp 172.17.1.121:2380: connect: connection refused"}
{"level":"info","ts":"2025-07-23T05:31:04.416899Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"e6c9d72c757dea1 is starting a new election at term 62"}
{"level":"info","ts":"2025-07-23T05:31:04.416978Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"e6c9d72c757dea1 became pre-candidate at term 62"}
{"level":"info","ts":"2025-07-23T05:31:04.417053Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"e6c9d72c757dea1 received MsgPreVoteResp from e6c9d72c757dea1 at term 62"}
{"level":"info","ts":"2025-07-23T05:31:04.417147Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"e6c9d72c757dea1 [logterm: 62, index: 234052257] sent MsgPreVote request to 39011656e166436e at term 62"}
从上面日志不难看出,etcd 一直处于选举状态,这是因为存活节点(1/2)达不到Raft的多数派原则,导致无法选举出Leader,2379
端口没有正常监听。导致我们操作不了etcd,无论重启kubelet还是重启etcd容器都无法解决问题。
3. 故障处理
既然etcd无法选出leader,而且我们也只需要一个etcd那么最好的方式就是强制启动etcd,让故障自恢复。
3.1、查询etcd容器启动参数
bash
root@node1:~# crictl --runtime-endpoint=unix:///var/run/containerd/containerd.sock ps | grep etcd | awk '{print $1}' | xargs crictl --runtime-endpoint=unix:///var/run/containerd/containerd.sock inspect
# 找到 image和 args 部分
"image": {
"image": "registry.k8s.io/etcd:3.5.16-0"
}
---省略部分输出---
"args": [
"etcd",
"--advertise-client-urls=https://172.17.1.120:2379",
"--cert-file=/etc/kubernetes/pki/etcd/server.crt",
"--client-cert-auth=true",
"--data-dir=/var/lib/etcd",
"--experimental-initial-corrupt-check=true",
"--experimental-watch-progress-notify-interval=5s",
"--initial-advertise-peer-urls=https://172.17.1.120:2380",
"--initial-cluster=node1=https://172.17.1.120:2380",
"--key-file=/etc/kubernetes/pki/etcd/server.key",
"--listen-client-urls=https://127.0.0.1:2379,https://172.17.1.120:2379",
"--listen-metrics-urls=http://127.0.0.1:2381",
"--listen-peer-urls=https://172.17.1.120:2380",
"--name=node1",
"--peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt",
"--peer-client-cert-auth=true",
"--peer-key-file=/etc/kubernetes/pki/etcd/peer.key",
"--peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt",
"--snapshot-count=10000",
"--trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt"
],
3.2、强制启动etcd
bash
docker run --rm --network=host -p 2379:2379 -p 2380:2380 -v /etc/kubernetes/pki/:/etc/kubernetes/pki/ -v /var/lib/etcd:/var/lib/etcd registry.k8s.io/etcd:3.5.16-0 \
etcd \
--advertise-client-urls=https://172.17.1.120:2379 \
--cert-file=/etc/kubernetes/pki/etcd/server.crt \
--client-cert-auth=true \
--data-dir=/var/lib/etcd \
--experimental-initial-corrupt-check=true \
--experimental-watch-progress-notify-interval=5s \
--initial-advertise-peer-urls=https://172.17.1.120:2380 \
--initial-cluster=node1=https://172.17.1.120:2380 \
--key-file=/etc/kubernetes/pki/etcd/server.key \
--listen-client-urls=https://127.0.0.1:2379,https://172.17.1.120:2379 \
--listen-metrics-urls=http://127.0.0.1:2381 \
--listen-peer-urls=https://172.17.1.120:2380 \
--name=node1 \
--peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt \
--peer-client-cert-auth=true \
--peer-key-file=/etc/kubernetes/pki/etcd/peer.key \
--peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt \
--snapshot-count=10000 \
--trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt \
--force-new-cluster
这里有几个关键信息需要特别注意:
- 强制新集群:
--force-new-cluster
,因为只保留一个节点,也不需要考虑数据一致性,刷新不回删除原有数据,只更新meta。 - etcd密钥与证书
-v /etc/kubernetes/pki/:/etc/kubernetes/pki/
,etcd数据目录-v /var/lib/etcd:/var/lib/etcd
这个kebelet启动的continaerd容器一致即可。
查看etcd日志,etcd已经正常启动,并node2被移除。
bash
{"level":"info","ts":"2025-07-23T05:35:03.300893Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"e6c9d72c757dea1 switched to configuration voters=(1039378730311999137)"}
{"level":"info","ts":"2025-07-23T05:35:03.301000Z","caller":"membership/cluster.go:472","msg":"removed member","cluster-id":"4a0015d70b3f3c63","local-member-id":"e6c9d72c757dea1","removed-remote-peer-id":"39011656e166436e","removed-remote-peer-urls":["https://172.17.1.121:2380"]}
建议采用docker启动修复etcd数据,体验比crictl要更好。
3.3、重启kubelet
通过systemctl start kubelet
命令启动服务,经过5分钟左右的自恢复,观察日志正常后,控制节点恢复正常。 控制节点正常后,依次启动work节点集群恢复正常。
bash
root@node1:/data# kubectl get nodes
NAME STATUS ROLES AGE VERSION
k8s-2 Ready <none> 139d v1.29.15
k8s-3 Ready <none> 140d v1.29.15
node1 Ready control-plane 728d v1.29.15
使用 kubectl get nodes
命令,可以看到控制已经恢复正常。
4. 复盘
4.1、控制平面节点数量
go
双数节点一致是Raft协议的弱点,实际上所有分布式系统都不推荐双数节点,一般来说`高可用`需要采用3、5、7等奇数节点部署。
3.2、上述操作能避免吗?
go
如果在执行 `kubeadm reset`之前先删除etcd member也能避免上述错误。
也就是在
1.2
执行阶段完整的操作应该是
bash
root@node1:~# kubectl drain node node2 --ignore-daemonsets --delete-emptydir-data
root@node1:~# kubectl delete node node2
# 找到node2的memberID
root@node1:~# ETCDCTL_API=3 etcdctl --cert=/etc/kubernetes/pki/etcd/peer.crt \
--key=/etc/kubernetes/pki/etcd/peer.key \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--endpoints=https://127.0.0.1:2379 \
member list
root@node1:~# ETCDCTL_API=3 etcdctl --cert=/etc/kubernetes/pki/etcd/peer.crt \
--key=/etc/kubernetes/pki/etcd/peer.key \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--endpoints=https://127.0.0.1:2379 \
member remove <member-id >
#在node2上重置kubelet
root@node2:~# kubeadm reset --force
清理iptables ipvs 等后续操作略...