一次Kubernetes集群故障处理案例:etcd无法选出Leader导致Kubernetes API-Server启动失败

原文发布于 notes.icool.io/k8s-problem...

1. 概述

这是一个不慎的操作导致的Kubenernets控制平面故障,正常情况下不应该存在偶数控制节点。出现这个情况的原因是,需要我需要更新Node1节点服务器版本,服务器有启动失败的的情况,为了稳妥起见另外启动一个控制节点(即Node2)。更新前已经备份好Etcd数据,崩溃数据恢复已经做好了准备(我以为的...)。

1.1、集群信息

Name IP Role
Node1 172.17.1.120 控制节点1
Node2 172.17.1.121 控制节点2
k8s-2 172.17.1.131 工作节点1
k8s-3 172.17.1.132 工作节点2

1.2、故障现象

我按照正常卸载控制平面节点的操作步骤,一切顺利,当时并没有什么异常。

bash 复制代码
root@node1:~# kubectl drain node node2 --ignore-daemonsets --delete-emptydir-data
root@node1:~# kubectl delete node node2

root@node2:~# kubeadm reset --force

故障出现原因:双控制平面Kubernetes(这个状态本身就异常,但不在本次讨论范围内)删除Node2控制节点后,另外一个控制平面无法正常工作。具体表现为ETCD启动失败,导致Kubernetes api-server 启动失败。

bash 复制代码
Jul 22 23:17:05 node1 kubelet[3301]: E0722 23:17:05.135471    3301 controller.go:145] "Failed to ensure lease exists, will retry" err="Get \"https://172.17.1.120:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/node1?timeout=10s\": dial tcp 172.17.1.120:6443: connect: connection refused" interval="7s"

2. 故障分析

在控制节点查询查询容器运行状态,发现所有容器都处于正常状态,但是etcd 容器日志显示无法连接另外一个控制节点。

2.1、查询控制节点服务状态

shell 复制代码
root@node1:~# crictl  --runtime-endpoint=unix:///var/run/containerd/containerd.sock ps  | grep -e  apiserver -e etcd -e controller-manager -e scheduler
6dd841c1bdcc3       9ea0bd82ed4f6       About an hour ago   Running             kube-scheduler            53                  9477ef18cb630
cda7709fabb7f       b0cdcf76ac8e9       About an hour ago   Running             kube-controller-manager   54                  7a3368070af64
78f4ae23ef1e0       a9e7e6b294baf       About an hour ago   Running             etcd                      54                  583d4b926dc80
526d7fbe05632       f44c6888a2d24       12 hours ago        Running             kube-apiserver            0                   e21825618af02

经过查询服务都已经运行,但etcd日志显示无法连接另外一个控制节点 ,api-server 不能连接到etcd。

shell 复制代码
root@node1:~# crictl  --runtime-endpoint=unix:///var/run/containerd/containerd.sock logs -f 78f4ae23ef1e0

{"level":"warn","ts":"2025-07-23T05:31:03.896215Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"39011656e166436e","rtt":"0s","error":"dial tcp 172.17.1.121:2380: connect: connection refused"}
{"level":"info","ts":"2025-07-23T05:31:04.416899Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"e6c9d72c757dea1 is starting a new election at term 62"}
{"level":"info","ts":"2025-07-23T05:31:04.416978Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"e6c9d72c757dea1 became pre-candidate at term 62"}
{"level":"info","ts":"2025-07-23T05:31:04.417053Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"e6c9d72c757dea1 received MsgPreVoteResp from e6c9d72c757dea1 at term 62"}
{"level":"info","ts":"2025-07-23T05:31:04.417147Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"e6c9d72c757dea1 [logterm: 62, index: 234052257] sent MsgPreVote request to 39011656e166436e at term 62"}

从上面日志不难看出,etcd 一直处于选举状态,这是因为存活节点(1/2)达不到Raft的多数派原则,导致无法选举出Leader,2379端口没有正常监听。导致我们操作不了etcd,无论重启kubelet还是重启etcd容器都无法解决问题。

3. 故障处理

既然etcd无法选出leader,而且我们也只需要一个etcd那么最好的方式就是强制启动etcd,让故障自恢复。

3.1、查询etcd容器启动参数

bash 复制代码
root@node1:~# crictl --runtime-endpoint=unix:///var/run/containerd/containerd.sock ps | grep etcd | awk '{print $1}' | xargs crictl --runtime-endpoint=unix:///var/run/containerd/containerd.sock inspect


# 找到 image和 args 部分
"image": {
      "image": "registry.k8s.io/etcd:3.5.16-0"
    }
---省略部分输出---
"args": [
          "etcd",
          "--advertise-client-urls=https://172.17.1.120:2379",
          "--cert-file=/etc/kubernetes/pki/etcd/server.crt",
          "--client-cert-auth=true",
          "--data-dir=/var/lib/etcd",
          "--experimental-initial-corrupt-check=true",
          "--experimental-watch-progress-notify-interval=5s",
          "--initial-advertise-peer-urls=https://172.17.1.120:2380",
          "--initial-cluster=node1=https://172.17.1.120:2380",
          "--key-file=/etc/kubernetes/pki/etcd/server.key",
          "--listen-client-urls=https://127.0.0.1:2379,https://172.17.1.120:2379",
          "--listen-metrics-urls=http://127.0.0.1:2381",
          "--listen-peer-urls=https://172.17.1.120:2380",
          "--name=node1",
          "--peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt",
          "--peer-client-cert-auth=true",
          "--peer-key-file=/etc/kubernetes/pki/etcd/peer.key",
          "--peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt",
          "--snapshot-count=10000",
          "--trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt"
        ],

3.2、强制启动etcd

bash 复制代码
docker  run --rm --network=host -p 2379:2379 -p 2380:2380 -v /etc/kubernetes/pki/:/etc/kubernetes/pki/ -v /var/lib/etcd:/var/lib/etcd  registry.k8s.io/etcd:3.5.16-0 \
          etcd \
          --advertise-client-urls=https://172.17.1.120:2379 \
          --cert-file=/etc/kubernetes/pki/etcd/server.crt  \
          --client-cert-auth=true   \
          --data-dir=/var/lib/etcd   \
          --experimental-initial-corrupt-check=true   \
          --experimental-watch-progress-notify-interval=5s   \
          --initial-advertise-peer-urls=https://172.17.1.120:2380   \
          --initial-cluster=node1=https://172.17.1.120:2380  \
          --key-file=/etc/kubernetes/pki/etcd/server.key   \
          --listen-client-urls=https://127.0.0.1:2379,https://172.17.1.120:2379   \
          --listen-metrics-urls=http://127.0.0.1:2381  \
          --listen-peer-urls=https://172.17.1.120:2380   \
          --name=node1  \
          --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt   \
          --peer-client-cert-auth=true   \
          --peer-key-file=/etc/kubernetes/pki/etcd/peer.key   \
          --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt   \
          --snapshot-count=10000   \
          --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt \
          --force-new-cluster 

这里有几个关键信息需要特别注意:

  • 强制新集群:--force-new-cluster,因为只保留一个节点,也不需要考虑数据一致性,刷新不回删除原有数据,只更新meta。
  • etcd密钥与证书 -v /etc/kubernetes/pki/:/etc/kubernetes/pki/ ,etcd数据目录-v /var/lib/etcd:/var/lib/etcd 这个kebelet启动的continaerd容器一致即可。

查看etcd日志,etcd已经正常启动,并node2被移除。

bash 复制代码
{"level":"info","ts":"2025-07-23T05:35:03.300893Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"e6c9d72c757dea1 switched to configuration voters=(1039378730311999137)"}
{"level":"info","ts":"2025-07-23T05:35:03.301000Z","caller":"membership/cluster.go:472","msg":"removed member","cluster-id":"4a0015d70b3f3c63","local-member-id":"e6c9d72c757dea1","removed-remote-peer-id":"39011656e166436e","removed-remote-peer-urls":["https://172.17.1.121:2380"]}

建议采用docker启动修复etcd数据,体验比crictl要更好。

3.3、重启kubelet

通过systemctl start kubelet命令启动服务,经过5分钟左右的自恢复,观察日志正常后,控制节点恢复正常。 控制节点正常后,依次启动work节点集群恢复正常。

bash 复制代码
root@node1:/data# kubectl get nodes
NAME    STATUS   ROLES           AGE    VERSION
k8s-2   Ready    <none>          139d   v1.29.15
k8s-3   Ready    <none>          140d   v1.29.15
node1   Ready    control-plane   728d   v1.29.15

使用 kubectl get nodes 命令,可以看到控制已经恢复正常。

4. 复盘

4.1、控制平面节点数量

go 复制代码
双数节点一致是Raft协议的弱点,实际上所有分布式系统都不推荐双数节点,一般来说`高可用`需要采用3、5、7等奇数节点部署。

3.2、上述操作能避免吗?

go 复制代码
如果在执行 `kubeadm reset`之前先删除etcd member也能避免上述错误。

也就是在1.2执行阶段完整的操作应该是

bash 复制代码
root@node1:~# kubectl drain node node2 --ignore-daemonsets --delete-emptydir-data
root@node1:~# kubectl delete node node2
# 找到node2的memberID
root@node1:~# ETCDCTL_API=3 etcdctl --cert=/etc/kubernetes/pki/etcd/peer.crt \
    --key=/etc/kubernetes/pki/etcd/peer.key \
    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
    --endpoints=https://127.0.0.1:2379 \
    member list


root@node1:~# ETCDCTL_API=3 etcdctl --cert=/etc/kubernetes/pki/etcd/peer.crt \
    --key=/etc/kubernetes/pki/etcd/peer.key \
    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
    --endpoints=https://127.0.0.1:2379 \
    member remove   <member-id >
#在node2上重置kubelet
root@node2:~# kubeadm reset --force
清理iptables ipvs 等后续操作略...
相关推荐
2301_8107463134 分钟前
CKA冲刺40天笔记 - day24 Kubernetes Clusterrole 和 Clusterrole Binding
笔记·容器·kubernetes
ICT董老师2 小时前
通过kubernetes部署nginx + php网站环境
运维·nginx·云原生·容器·kubernetes·php
原神启动12 小时前
K8S(八)—— Kubernetes Pod 资源限制 + 探针(Probe)解析
云原生·容器·kubernetes
zxnbmk3 小时前
【7】Kubernetes存储(本章知识密度较高,仅浅浅了解后续详解)
linux·云原生·容器·kubernetes
叫致寒吧3 小时前
pod详解
云原生·kubernetes
水上冰石3 小时前
查看k8s下Jenkins的插件在宿主机的路径
容器·kubernetes·jenkins
孤岛悬城3 小时前
58 k8s之pod
云原生·容器·kubernetes
可爱又迷人的反派角色“yang”3 小时前
k8s(五)
linux·运维·docker·云原生·容器·kubernetes
oMcLin3 小时前
如何在Ubuntu 22.10上通过配置K3s轻量级Kubernetes集群,提升边缘计算环境的资源管理能力?
ubuntu·kubernetes·边缘计算
水上冰石4 小时前
如何查看k8s按照的jenkins插件的路径
容器·kubernetes·jenkins