一次Kubernetes集群故障处理案例:etcd无法选出Leader导致Kubernetes API-Server启动失败

原文发布于 notes.icool.io/k8s-problem...

1. 概述

这是一个不慎的操作导致的Kubenernets控制平面故障,正常情况下不应该存在偶数控制节点。出现这个情况的原因是,需要我需要更新Node1节点服务器版本,服务器有启动失败的的情况,为了稳妥起见另外启动一个控制节点(即Node2)。更新前已经备份好Etcd数据,崩溃数据恢复已经做好了准备(我以为的...)。

1.1、集群信息

Name IP Role
Node1 172.17.1.120 控制节点1
Node2 172.17.1.121 控制节点2
k8s-2 172.17.1.131 工作节点1
k8s-3 172.17.1.132 工作节点2

1.2、故障现象

我按照正常卸载控制平面节点的操作步骤,一切顺利,当时并没有什么异常。

bash 复制代码
root@node1:~# kubectl drain node node2 --ignore-daemonsets --delete-emptydir-data
root@node1:~# kubectl delete node node2

root@node2:~# kubeadm reset --force

故障出现原因:双控制平面Kubernetes(这个状态本身就异常,但不在本次讨论范围内)删除Node2控制节点后,另外一个控制平面无法正常工作。具体表现为ETCD启动失败,导致Kubernetes api-server 启动失败。

bash 复制代码
Jul 22 23:17:05 node1 kubelet[3301]: E0722 23:17:05.135471    3301 controller.go:145] "Failed to ensure lease exists, will retry" err="Get \"https://172.17.1.120:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/node1?timeout=10s\": dial tcp 172.17.1.120:6443: connect: connection refused" interval="7s"

2. 故障分析

在控制节点查询查询容器运行状态,发现所有容器都处于正常状态,但是etcd 容器日志显示无法连接另外一个控制节点。

2.1、查询控制节点服务状态

shell 复制代码
root@node1:~# crictl  --runtime-endpoint=unix:///var/run/containerd/containerd.sock ps  | grep -e  apiserver -e etcd -e controller-manager -e scheduler
6dd841c1bdcc3       9ea0bd82ed4f6       About an hour ago   Running             kube-scheduler            53                  9477ef18cb630
cda7709fabb7f       b0cdcf76ac8e9       About an hour ago   Running             kube-controller-manager   54                  7a3368070af64
78f4ae23ef1e0       a9e7e6b294baf       About an hour ago   Running             etcd                      54                  583d4b926dc80
526d7fbe05632       f44c6888a2d24       12 hours ago        Running             kube-apiserver            0                   e21825618af02

经过查询服务都已经运行,但etcd日志显示无法连接另外一个控制节点 ,api-server 不能连接到etcd。

shell 复制代码
root@node1:~# crictl  --runtime-endpoint=unix:///var/run/containerd/containerd.sock logs -f 78f4ae23ef1e0

{"level":"warn","ts":"2025-07-23T05:31:03.896215Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"39011656e166436e","rtt":"0s","error":"dial tcp 172.17.1.121:2380: connect: connection refused"}
{"level":"info","ts":"2025-07-23T05:31:04.416899Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"e6c9d72c757dea1 is starting a new election at term 62"}
{"level":"info","ts":"2025-07-23T05:31:04.416978Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"e6c9d72c757dea1 became pre-candidate at term 62"}
{"level":"info","ts":"2025-07-23T05:31:04.417053Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"e6c9d72c757dea1 received MsgPreVoteResp from e6c9d72c757dea1 at term 62"}
{"level":"info","ts":"2025-07-23T05:31:04.417147Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"e6c9d72c757dea1 [logterm: 62, index: 234052257] sent MsgPreVote request to 39011656e166436e at term 62"}

从上面日志不难看出,etcd 一直处于选举状态,这是因为存活节点(1/2)达不到Raft的多数派原则,导致无法选举出Leader,2379端口没有正常监听。导致我们操作不了etcd,无论重启kubelet还是重启etcd容器都无法解决问题。

3. 故障处理

既然etcd无法选出leader,而且我们也只需要一个etcd那么最好的方式就是强制启动etcd,让故障自恢复。

3.1、查询etcd容器启动参数

bash 复制代码
root@node1:~# crictl --runtime-endpoint=unix:///var/run/containerd/containerd.sock ps | grep etcd | awk '{print $1}' | xargs crictl --runtime-endpoint=unix:///var/run/containerd/containerd.sock inspect


# 找到 image和 args 部分
"image": {
      "image": "registry.k8s.io/etcd:3.5.16-0"
    }
---省略部分输出---
"args": [
          "etcd",
          "--advertise-client-urls=https://172.17.1.120:2379",
          "--cert-file=/etc/kubernetes/pki/etcd/server.crt",
          "--client-cert-auth=true",
          "--data-dir=/var/lib/etcd",
          "--experimental-initial-corrupt-check=true",
          "--experimental-watch-progress-notify-interval=5s",
          "--initial-advertise-peer-urls=https://172.17.1.120:2380",
          "--initial-cluster=node1=https://172.17.1.120:2380",
          "--key-file=/etc/kubernetes/pki/etcd/server.key",
          "--listen-client-urls=https://127.0.0.1:2379,https://172.17.1.120:2379",
          "--listen-metrics-urls=http://127.0.0.1:2381",
          "--listen-peer-urls=https://172.17.1.120:2380",
          "--name=node1",
          "--peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt",
          "--peer-client-cert-auth=true",
          "--peer-key-file=/etc/kubernetes/pki/etcd/peer.key",
          "--peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt",
          "--snapshot-count=10000",
          "--trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt"
        ],

3.2、强制启动etcd

bash 复制代码
docker  run --rm --network=host -p 2379:2379 -p 2380:2380 -v /etc/kubernetes/pki/:/etc/kubernetes/pki/ -v /var/lib/etcd:/var/lib/etcd  registry.k8s.io/etcd:3.5.16-0 \
          etcd \
          --advertise-client-urls=https://172.17.1.120:2379 \
          --cert-file=/etc/kubernetes/pki/etcd/server.crt  \
          --client-cert-auth=true   \
          --data-dir=/var/lib/etcd   \
          --experimental-initial-corrupt-check=true   \
          --experimental-watch-progress-notify-interval=5s   \
          --initial-advertise-peer-urls=https://172.17.1.120:2380   \
          --initial-cluster=node1=https://172.17.1.120:2380  \
          --key-file=/etc/kubernetes/pki/etcd/server.key   \
          --listen-client-urls=https://127.0.0.1:2379,https://172.17.1.120:2379   \
          --listen-metrics-urls=http://127.0.0.1:2381  \
          --listen-peer-urls=https://172.17.1.120:2380   \
          --name=node1  \
          --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt   \
          --peer-client-cert-auth=true   \
          --peer-key-file=/etc/kubernetes/pki/etcd/peer.key   \
          --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt   \
          --snapshot-count=10000   \
          --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt \
          --force-new-cluster 

这里有几个关键信息需要特别注意:

  • 强制新集群:--force-new-cluster,因为只保留一个节点,也不需要考虑数据一致性,刷新不回删除原有数据,只更新meta。
  • etcd密钥与证书 -v /etc/kubernetes/pki/:/etc/kubernetes/pki/ ,etcd数据目录-v /var/lib/etcd:/var/lib/etcd 这个kebelet启动的continaerd容器一致即可。

查看etcd日志,etcd已经正常启动,并node2被移除。

bash 复制代码
{"level":"info","ts":"2025-07-23T05:35:03.300893Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"e6c9d72c757dea1 switched to configuration voters=(1039378730311999137)"}
{"level":"info","ts":"2025-07-23T05:35:03.301000Z","caller":"membership/cluster.go:472","msg":"removed member","cluster-id":"4a0015d70b3f3c63","local-member-id":"e6c9d72c757dea1","removed-remote-peer-id":"39011656e166436e","removed-remote-peer-urls":["https://172.17.1.121:2380"]}

建议采用docker启动修复etcd数据,体验比crictl要更好。

3.3、重启kubelet

通过systemctl start kubelet命令启动服务,经过5分钟左右的自恢复,观察日志正常后,控制节点恢复正常。 控制节点正常后,依次启动work节点集群恢复正常。

bash 复制代码
root@node1:/data# kubectl get nodes
NAME    STATUS   ROLES           AGE    VERSION
k8s-2   Ready    <none>          139d   v1.29.15
k8s-3   Ready    <none>          140d   v1.29.15
node1   Ready    control-plane   728d   v1.29.15

使用 kubectl get nodes 命令,可以看到控制已经恢复正常。

4. 复盘

4.1、控制平面节点数量

go 复制代码
双数节点一致是Raft协议的弱点,实际上所有分布式系统都不推荐双数节点,一般来说`高可用`需要采用3、5、7等奇数节点部署。

3.2、上述操作能避免吗?

go 复制代码
如果在执行 `kubeadm reset`之前先删除etcd member也能避免上述错误。

也就是在1.2执行阶段完整的操作应该是

bash 复制代码
root@node1:~# kubectl drain node node2 --ignore-daemonsets --delete-emptydir-data
root@node1:~# kubectl delete node node2
# 找到node2的memberID
root@node1:~# ETCDCTL_API=3 etcdctl --cert=/etc/kubernetes/pki/etcd/peer.crt \
    --key=/etc/kubernetes/pki/etcd/peer.key \
    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
    --endpoints=https://127.0.0.1:2379 \
    member list


root@node1:~# ETCDCTL_API=3 etcdctl --cert=/etc/kubernetes/pki/etcd/peer.crt \
    --key=/etc/kubernetes/pki/etcd/peer.key \
    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
    --endpoints=https://127.0.0.1:2379 \
    member remove   <member-id >
#在node2上重置kubelet
root@node2:~# kubeadm reset --force
清理iptables ipvs 等后续操作略...
相关推荐
David爱编程3 小时前
Cilium 与 Calico 网络安全能力横向评测
云原生·容器·kubernetes
David爱编程3 小时前
Kubernetes中使用Calico实现零信任网络访问控制
云原生·容器·kubernetes
睡觉z5 小时前
k8s日志收集
容器·kubernetes·jenkins
潮落拾贝9 小时前
k8s+isulad 国产化技术栈云原生技术栈搭建2-crictl
云原生·容器·kubernetes·国产化
东风微鸣10 小时前
ArgoCD:我的GitOps探索之旅与未来展望
docker·云原生·kubernetes·可观察性
坐望云起14 小时前
Hyper-V + Centos stream 9 搭建K8s集群(一)
linux·kubernetes·centos
潘多编程21 小时前
云原生三剑客:Kubernetes + Docker + Spring Cloud 实战指南与深度整合
docker·云原生·kubernetes
❀͜͡傀儡师1 天前
Kubernetes (K8s) 部署资源的完整配置OceanBase
容器·kubernetes·oceanbase
无敌糖果1 天前
K8S的Pod之initC容器restartPolicy新特性
云原生·容器·kubernetes·pod·restartpolicy·容器重启