K8s 集群高可用master节点ETCD挂掉如何恢复?

写在前面


  • 很常见的集群运维场景,整理分享
  • 博文内容为 K8s 集群高可用 master 节点故障如何恢复的过程
  • 理解不足小伙伴帮忙指正

不必太纠结于当下,也不必太忧虑未来,当你经历过一些事情的时候,眼前的风景已经和从前不一样了。------村上春树


遇到了什么问题

今天做实验发现 ,集群其中一个 master 节点上的 etcdapiserver 都挂掉了

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~]
└─$kubectl get nodes
NAME                          STATUS   ROLES           AGE    VERSION
vms100.liruilongs.github.io   Ready    control-plane   415d   v1.25.1
vms101.liruilongs.github.io   Ready    control-plane   415d   v1.25.1
vms102.liruilongs.github.io   Ready    control-plane   415d   v1.25.1
vms103.liruilongs.github.io   Ready    <none>          415d   v1.25.1
vms105.liruilongs.github.io   Ready    <none>          415d   v1.25.1
vms106.liruilongs.github.io   Ready    <none>          415d   v1.25.1
┌──[root@vms100.liruilongs.github.io]-[~]
└─$

vms100.liruilongs.github.io 这个节点 上的 apiserveretcd

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~]
└─$kubectl get pod -A -o wide | grep apiserver
kube-system          kube-apiserver-vms100.liruilongs.github.io            0/1     CrashLoopBackOff   1448 (3m23s ago)   415d   192.168.26.100   vms100.liruilongs.github.io   <none>           <none>
kube-system          kube-apiserver-vms101.liruilongs.github.io            1/1     Running            272 (3h18m ago)    415d   192.168.26.101   vms101.liruilongs.github.io   <none>           <none>
kube-system          kube-apiserver-vms102.liruilongs.github.io            1/1     Running            246 (3h18m ago)    415d   192.168.26.102   vms102.liruilongs.github.io   <none>           <none>
┌──[root@vms100.liruilongs.github.io]-[~]
└─$kubectl get pod -A -o wide | grep etcd
kube-system          etcd-vms100.liruilongs.github.io                      0/1     CrashLoopBackOff   1244 (3m6s ago)    415d   192.168.26.100   vms100.liruilongs.github.io   <none>           <none>
kube-system          etcd-vms101.liruilongs.github.io                      1/1     Running            167 (3h18m ago)    415d   192.168.26.101   vms101.liruilongs.github.io   <none>           <none>
kube-system          etcd-vms102.liruilongs.github.io                      1/1     Running            173 (3h18m ago)    415d   192.168.26.102   vms102.liruilongs.github.io   <none>           <none>

查看 keepalived 对应的静态Pod运行正常

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~]
└─$kubectl get pod -A -o wide | grep keep
kube-system          keepalived-vms100.liruilongs.github.io                1/1     Running            63 (3h50m ago)    415d   192.168.26.100   vms100.liruilongs.github.io   <none>           <none>
kube-system          keepalived-vms101.liruilongs.github.io                1/1     Running            54 (3h51m ago)    415d   192.168.26.101   vms101.liruilongs.github.io   <none>           <none>
kube-system          keepalived-vms102.liruilongs.github.io                1/1     Running            60 (3h51m ago)    415d   192.168.26.102   vms102.liruilongs.github.io   <none>           <none>
┌──[root@vms100.liruilongs.github.io]-[~]
└─$

所以可能是 etcd 数据不同步,或者什么原因 导致etcd 挂掉了。因为 每个 master 节点的 apiserver 只和 本节点的 etcd 进行 通信(每个 etcd 的写请求会转发到 etcd 的领导节点),etcd 挂掉,apiserver 无法提供能力,所以也会挂掉。

通过 etcdctl 可以发现 vms100.liruilongs.github.io 上的 etcd 彻底死掉了

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~]
└─$ETCDCTL_API=3 etcdctl  --endpoints https://127.0.0.1:2379 \
 --cert="/etc/kubernetes/pki/etcd/server.crt" \
 --key="/etc/kubernetes/pki/etcd/server.key"  \
 --cacert="/etc/kubernetes/pki/etcd/ca.crt" \
 member list -w table
Error: dial tcp 127.0.0.1:2379: connect: connection refused

如何排查

这里我们换一个 etcd 节点 执行 命令

查看 etcd 集群成员

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~]
└─$ssh vms101.liruilongs.github.io
Last login: Sat Mar  2 09:52:01 2024 from 192.168.26.100
┌──[root@vms101.liruilongs.github.io]-[~]
└─$ETCDCTL_API=3 etcdctl  --endpoints https://127.0.0.1:2379 \
 --cert="/etc/kubernetes/pki/etcd/server.crt"  \
 --key="/etc/kubernetes/pki/etcd/server.key" \
  --cacert="/etc/kubernetes/pki/etcd/ca.crt" \
  member list -w table
+------------------+---------+-----------------------------+-----------------------------+-----------------------------+
|        ID        | STATUS  |            NAME             |         PEER ADDRS          |        CLIENT ADDRS         |
+------------------+---------+-----------------------------+-----------------------------+-----------------------------+
|  ee392e5273e89e2 | started | vms100.liruilongs.github.io | https://192.168.26.100:2380 | https://192.168.26.100:2379 |
| 70059e836d19883d | started | vms101.liruilongs.github.io | https://192.168.26.101:2380 | https://192.168.26.101:2379 |
| b8cb9f66c2e63b91 | started | vms102.liruilongs.github.io | https://192.168.26.102:2380 | https://192.168.26.102:2379 |
+------------------+---------+-----------------------------+-----------------------------+-----------------------------+

查看节点状态

bash 复制代码
┌──[root@vms101.liruilongs.github.io]-[~]
└─$ETCDCTL_API=3 etcdctl  --endpoints https://127.0.0.1:2379 \
  --cert="/etc/kubernetes/pki/etcd/server.crt" \
  --key="/etc/kubernetes/pki/etcd/server.key"  \
  --cacert="/etc/kubernetes/pki/etcd/ca.crt" \
   endpoint status --cluster  -w table
Failed to get the status of endpoint https://192.168.26.100:2379 (context deadline exceeded)
+-----------------------------+------------------+---------+---------+-----------+-----------+------------+
|          ENDPOINT           |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+-----------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://192.168.26.101:2379 | 70059e836d19883d |   3.5.4 |   88 MB |     false |       603 |   22208417 |
| https://192.168.26.102:2379 | b8cb9f66c2e63b91 |   3.5.4 |   88 MB |      true |       603 |   22208417 |
+-----------------------------+------------------+---------+---------+-----------+-----------+------------+

确定 ETCD 节点故障

bash 复制代码
┌──[root@vms101.liruilongs.github.io]-[~]
└─$ETCDCTL_API=3 etcdctl  --endpoints https://127.0.0.1:2379  \
 --cert="/etc/kubernetes/pki/etcd/server.crt"  \
 --key="/etc/kubernetes/pki/etcd/server.key"  \
 --cacert="/etc/kubernetes/pki/etcd/ca.crt" \
 endpoint  health --cluster  -w table
https://192.168.26.101:2379 is healthy: successfully committed proposal: took = 3.753357ms
https://192.168.26.102:2379 is healthy: successfully committed proposal: took = 2.989943ms
https://192.168.26.100:2379 is unhealthy: failed to connect: dial tcp 192.168.26.100:2379: connect: connection refused
Error: unhealthy cluster

查看 etcd 的容器日志

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~]
└─$docker ps -a | grep etcd
0f2f98ebf8c3   a8a176a5d5d6                                        "etcd --advertise-cl..."   4 minutes ago   Exited (2) 4 minutes ago                  k8s_etcd_etcd-vms100.liruilongs.github.io_kube-system_e8c17bb99f9bd8119cdd769556041e18_1252
a4b39d16a753   registry.aliyuncs.com/google_containers/pause:3.8   "/pause"                 4 hours ago     Up 4 hours                                k8s_POD_etcd-vms100.liruilongs.github.io_kube-system_e8c17bb99f9bd8119cdd769556041e18_54
┌──[root@vms100.liruilongs.github.io]-[~]
└─$docker logs 0f2f98ebf8c3
{"level":"info","ts":"2024-03-16T14:46:54.644Z","caller":"etcdmain/etcd.go:73","msg":"Running: ","args":["etcd","--advertise-client-urls=https://192.168.26.100:2379","--cert-file=/etc/kubernetes/pki/etcd/server.crt","--client-cert-auth=true","--data-dir=/var/lib/etcd","--experimental-initial-corrupt-check=true","--experimental-watch-progress-notify-interval=5s","--initial-advertise-peer-urls=https://192.168.26.100:2380","--initial-cluster=vms100.liruilongs.github.io=https://192.168.26.100:2380","--key-file=/etc/kubernetes/pki/etcd/server.key","--listen-client-urls=https://127.0.0.1:2379,https://192.168.26.100:2379","--listen-metrics-urls=http://127.0.0.1:2381","--listen-peer-urls=https://192.168.26.100:2380","--name=vms100.liruilongs.github.io","--peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt","--peer-client-cert-auth=true","--peer-key-file=/etc/kubernetes/pki/etcd/peer.key","--peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt","--snapshot-count=10000","--trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt"]}
{"level":"info","ts":"2024-03-16T14:46:54.645Z","caller":"etcdmain/etcd.go:116","msg":"server has been already initialized","data-dir":"/var/lib/etcd","dir-type":"member"}
{"level":"info","ts":"2024-03-16T14:46:54.645Z","caller":"embed/etcd.go:131","msg":"configuring peer listeners","listen-peer-urls":["https://192.168.26.100:2380"]}
{"level":"info","ts":"2024-03-16T14:46:54.645Z","caller":"embed/etcd.go:479","msg":"starting with peer TLS","tls-info":"cert = /etc/kubernetes/pki/etcd/peer.crt, key = /etc/kubernetes/pki/etcd/peer.key, client-cert=, client-key=, trusted-ca = /etc/kubernetes/pki/etcd/ca.crt, client-cert-auth = true, crl-file = ","cipher-suites":[]}
{"level":"info","ts":"2024-03-16T14:46:54.645Z","caller":"embed/etcd.go:139","msg":"configuring client listeners","listen-client-urls":["https://127.0.0.1:2379","https://192.168.26.100:2379"]}
{"level":"info","ts":"2024-03-16T14:46:54.645Z","caller":"embed/etcd.go:308","msg":"starting an etcd server","etcd-version":"3.5.4","git-sha":"08407ff76","go-version":"go1.16.15","go-os":"linux","go-arch":"amd64","max-cpu-set":4,"max-cpu-available":4,"member-initialized":true,"name":"vms100.liruilongs.github.io","data-dir":"/var/lib/etcd","wal-dir":"","wal-dir-dedicated":"","member-dir":"/var/lib/etcd/member","force-new-cluster":false,"heartbeat-interval":"100ms","election-timeout":"1s","initial-election-tick-advance":true,"snapshot-count":10000,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["https://192.168.26.100:2380"],"listen-peer-urls":["https://192.168.26.100:2380"],"advertise-client-urls":["https://192.168.26.100:2379"],"listen-client-urls":["https://127.0.0.1:2379","https://192.168.26.100:2379"],"listen-metrics-urls":["http://127.0.0.1:2381"],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"","initial-cluster-state":"new","initial-cluster-token":"","quota-size-bytes":2147483648,"pre-vote":true,"initial-corrupt-check":true,"corrupt-check-time-interval":"0s","auto-compaction-mode":"periodic","auto-compaction-retention":"0s","auto-compaction-interval":"0s","discovery-url":"","discovery-proxy":"","downgrade-check-interval":"5s"}
panic: freepages: failed to get all reachable pages (page 7744: multiple references)

goroutine 109 [running]:
go.etcd.io/bbolt.(*DB).freepages.func2(0xc00009c480)
        /go/pkg/mod/go.etcd.io/bbolt@v1.3.6/db.go:1056 +0xe9
created by go.etcd.io/bbolt.(*DB).freepages
        /go/pkg/mod/go.etcd.io/bbolt@v1.3.6/db.go:1054 +0x1cd
┌──[root@vms100.liruilongs.github.io]-[~]
└─$

如何解决

这里最快的办法是重新同步一下这个节点的数据,即把这个故障节点移出 集群,清理完故障节点旧数据在重新添加,操作步骤

  • 清理数据目录,移动静态Pod 的yaml 文件:停止故障节点服务,然后删除etcd数据目录。
  • 移除故障节点:使用member remove命令剔除错误节点,可以在健康的节点执行命令。
  • 添加节点:使用member add命令添加故障节点。
  • 重新启动:移动故障节点yaml文件,进行启动

: 静态Pod 通过加载指定目录的 yaml 文件来调度,kubelet 会定时扫描,删除移动 yaml 文件,静态 Pod 会自动停止,同理。添加 yaml 文件会自动创建静态 Pod

移动静态Pod 的yaml 文件

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~]
└─$mv  /etc/kubernetes/manifests/{etcd.yaml,kube-apiserver.yaml}  /tmp/

删除etcd数据目录

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~]
└─$rm -rf /var/lib/etcd/*

确认节点 的 etcdapiservier 都已经停止

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~]
└─$kubectl get pod -A -o wide | grep apiserver
kube-system          kube-apiserver-vms101.liruilongs.github.io            1/1     Running   272 (4h15m ago)   415d   192.168.26.101   vms101.liruilongs.github.io   <none>           <none>
kube-system          kube-apiserver-vms102.liruilongs.github.io            1/1     Running   246 (4h15m ago)   415d   192.168.26.102   vms102.liruilongs.github.io   <none>           <none>
┌──[root@vms100.liruilongs.github.io]-[~]
└─$kubectl get pod -A -o wide | grep etcd
kube-system          etcd-vms101.liruilongs.github.io                      1/1     Running   167 (4h15m ago)   415d   192.168.26.101   vms101.liruilongs.github.io   <none>           <none>
kube-system          etcd-vms102.liruilongs.github.io                      1/1     Running   173 (4h15m ago)   415d   192.168.26.102   vms102.liruilongs.github.io   <none>           <none>
┌──[root@vms100.liruilongs.github.io]-[~]
└─$

获取故障节点 ID,下面的操作我们在健康的 etcd 节点执行,或者可以修改 --endpoints

bash 复制代码
┌──[root@vms101.liruilongs.github.io]-[~]
└─$ETCDCTL_API=3 etcdctl  --endpoints https://192.168.26.101:2379  --cert="/etc/kubernetes/pki/etcd/server.crt"  --key="/etc/kubernetes/pki/etcd/server.key"  --cacert="/etc/kubernetes/pki/etcd/ca.crt" member list -w table
+------------------+---------+-----------------------------+-----------------------------+-----------------------------+
|        ID        | STATUS  |            NAME             |         PEER ADDRS          |        CLIENT ADDRS         |
+------------------+---------+-----------------------------+-----------------------------+-----------------------------+
|  ee392e5273e89e2 | started | vms100.liruilongs.github.io | https://192.168.26.100:2380 | https://192.168.26.100:2379 |
| 70059e836d19883d | started | vms101.liruilongs.github.io | https://192.168.26.101:2380 | https://192.168.26.101:2379 |
| b8cb9f66c2e63b91 | started | vms102.liruilongs.github.io | https://192.168.26.102:2380 | https://192.168.26.102:2379 |
+------------------+---------+-----------------------------+-----------------------------+-----------------------------+

移除故障节点

bash 复制代码
┌──[root@vms101.liruilongs.github.io]-[~]
└─$ETCDCTL_API=3 etcdctl  --endpoints https://127.0.0.1:2379  --cert="/etc/kubernetes/pki/etcd/server.crt"  --key="/etc/kubernetes/pki/etcd/server.key"  --cacert="/etc/kubernetes/pki/etcd/ca.crt"  member remove ee392e5273e89e2
Member  ee392e5273e89e2 removed from cluster 4816f346663d82a7

重新添加

bash 复制代码
┌──[root@vms101.liruilongs.github.io]-[~]
└─$ETCDCTL_API=3 etcdctl  --endpoints https://127.0.0.1:2379  --cert="/etc/kubernetes/pki/etcd/server.crt"  --key="/etc/kubernetes/pki/etcd/server.key"  --cacert="/etc/kubernetes/pki/etcd/ca.crt"  member add vms100.liruilongs.github.io --peer-urls=https://192.168.26.100:2380
Member 456f71fdc1ad9917 added to cluster 4816f346663d82a7

ETCD_NAME="vms100.liruilongs.github.io"
ETCD_INITIAL_CLUSTER="vms100.liruilongs.github.io=https://192.168.26.100:2380,vms101.liruilongs.github.io=https://192.168.26.101:2380,vms102.liruilongs.github.io=https://192.168.26.102:2380"
ETCD_INITIAL_ADVERTISE_PEER_URLS="https://192.168.26.100:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"

回到 100 节点机器,移动 Yaml 文件,恢复节点

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~]
└─$mv /tmp/{etcd.yaml,kube-apiserver.yaml} /etc/kubernetes/manifests/

确认 Pod 状态

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~]
└─$kubectl get pod -A -o wide | grep etcd
kube-system          etcd-vms100.liruilongs.github.io                      1/1     Running   0                 16s    192.168.26.100   vms100.liruilongs.github.io   <none>           <none>
kube-system          etcd-vms101.liruilongs.github.io                      1/1     Running   167 (4h32m ago)   415d   192.168.26.101   vms101.liruilongs.github.io   <none>           <none>
kube-system          etcd-vms102.liruilongs.github.io                      1/1     Running   173 (4h32m ago)   415d   192.168.26.102   vms102.liruilongs.github.io   <none>           <none>
┌──[root@vms100.liruilongs.github.io]-[~]
└─$kubectl get pod -A -o wide | grep apiserver
kube-system          kube-apiserver-vms100.liruilongs.github.io            1/1     Running   0                 24s    192.168.26.100   vms100.liruilongs.github.io   <none>           <none>
kube-system          kube-apiserver-vms101.liruilongs.github.io            1/1     Running   272 (4h32m ago)   415d   192.168.26.101   vms101.liruilongs.github.io   <none>           <none>
kube-system          kube-apiserver-vms102.liruilongs.github.io            1/1     Running   246 (4h32m ago)   415d   192.168.26.102   vms102.liruilongs.github.io   <none>           <none>
┌──[root@vms100.liruilongs.github.io]-[~]
└─$

查看 etcd 集群状态

bash 复制代码
┌──[root@vms101.liruilongs.github.io]-[~]
└─$ETCDCTL_API=3 etcdctl  --endpoints https://127.0.0.1:2379  --cert="/etc/kubernetes/pki/etcd/server.crt"  --key="/etc/kubernetes/pki/etcd/server.key"  --cacert="/etc/kubernetes/pki/etcd/ca.crt" member list -w table
+------------------+-----------+-----------------------------+-----------------------------+-----------------------------+
|        ID        |  STATUS   |            NAME             |         PEER ADDRS          |        CLIENT ADDRS         |
+------------------+-----------+-----------------------------+-----------------------------+-----------------------------+
| 54952f3b494c0286 | unstarted |                             | https://192.168.26.100:2380 |                             |
| 70059e836d19883d |   started | vms101.liruilongs.github.io | https://192.168.26.101:2380 | https://192.168.26.101:2379 |
| b8cb9f66c2e63b91 |   started | vms102.liruilongs.github.io | https://192.168.26.102:2380 | https://192.168.26.102:2379 |
+------------------+-----------+-----------------------------+-----------------------------+-----------------------------+

这里我们发现 新添加的节点状态不正常,一直是 unstarted

我们在 故障节点执行 etcd 命令。发现故障节点并没有添加到集群,而是作为一个单节点运行。

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~]
└─$ETCDCTL_API=3 etcdctl  --endpoints https://127.0.0.1:2379  --cert="/etc/kubernetes/pki/etcd/server.crt"  --key="/etc/kubernetes/pki/etcd/server.key"  --cacert="/etc/kubernetes/pki/etcd/ca.crt" member list -w table
+-----------------+---------+-----------------------------+-----------------------------+-----------------------------+
|       ID        | STATUS  |            NAME             |         PEER ADDRS          |        CLIENT ADDRS         |
+-----------------+---------+-----------------------------+-----------------------------+-----------------------------+
| ee392e5273e89e2 | started | vms100.liruilongs.github.io | https://192.168.26.100:2380 | https://192.168.26.100:2379 |
+-----------------+---------+-----------------------------+-----------------------------+-----------------------------+
┌──[root@vms100.liruilongs.github.io]-[~]
└─$ETCDCTL_API=3 etcdctl  --endpoints https://127.0.0.1:2379  --cert="/etc/kubernetes/pki/etcd/server.crt"  --key="/etc/kubernetes/pki/etcd/server.key"  --cacert="/etc/kubernetes/pki/etcd/ca.crt" endpoint status --cluster  -w table
+-----------------------------+-----------------+---------+---------+-----------+-----------+------------+
|          ENDPOINT           |       ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+-----------------------------+-----------------+---------+---------+-----------+-----------+------------+
| https://192.168.26.100:2379 | ee392e5273e89e2 |   3.5.4 |  815 kB |      true |         2 |       2261 |
+-----------------------------+-----------------+---------+---------+-----------+-----------+------------+
┌──[root@vms100.liruilongs.github.io]-[~]
└─$

也没有同步 当前集群的数据

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~]
└─$kubectl get pod -A -o wide  --server=https://vms100.liruilongs.github.io:6443
No resources found

遇到这种情况,大部分原因是 某个节点的 etcd配置文件的问题,我的这个问题是 故障节点的 etcd 配置文件,没有集群信息相关配置,所以这里把集群相关配置写入配置

原本的配置文件

yaml 复制代码
┌──[root@vms100.liruilongs.github.io]-[~]
└─$cat /etc/kubernetes/manifests/etcd.yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubeadm.kubernetes.io/etcd.advertise-client-urls: https://192.168.26.100:2379
  creationTimestamp: null
  labels:
    component: etcd
    tier: control-plane
  name: etcd
  namespace: kube-system
spec:
  containers:
  - command:
    - etcd
    - --advertise-client-urls=https://192.168.26.100:2379
    - --cert-file=/etc/kubernetes/pki/etcd/server.crt
    - --client-cert-auth=true
    - --data-dir=/var/lib/etcd
    - --experimental-initial-corrupt-check=true
    - --experimental-watch-progress-notify-interval=5s
    - --initial-advertise-peer-urls=https://192.168.26.100:2380
    - --initial-cluster=vms100.liruilongs.github.io=https://192.168.26.100:2380
    - --key-file=/etc/kubernetes/pki/etcd/server.key
    - --listen-client-urls=https://127.0.0.1:2379,https://192.168.26.100:2379
    - --listen-metrics-urls=http://127.0.0.1:2381
    - --listen-peer-urls=https://192.168.26.100:2380
    - --name=vms100.liruilongs.github.io
    - --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
    - --peer-client-cert-auth=true
    - --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
    - --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
    - --snapshot-count=10000
    - --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
    image: registry.aliyuncs.com/google_containers/etcd:3.5.4-0
。。。。。。。。。。。。。。。。

集群信息不全的,添加后的配置文件

yaml 复制代码
┌──[root@vms100.liruilongs.github.io]-[~]
└─$cat /etc/kubernetes/manifests/etcd.yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubeadm.kubernetes.io/etcd.advertise-client-urls: https://192.168.26.100:2379
  creationTimestamp: null
  labels:
    component: etcd
    tier: control-plane
  name: etcd
  namespace: kube-system
spec:
  containers:
  - command:
    - etcd
    - --advertise-client-urls=https://192.168.26.100:2379
    - --cert-file=/etc/kubernetes/pki/etcd/server.crt
    - --client-cert-auth=true
    - --data-dir=/var/lib/etcd
    - --experimental-initial-corrupt-check=true
    - --experimental-watch-progress-notify-interval=5s
    - --initial-advertise-peer-urls=https://192.168.26.100:2380
    - --initial-cluster=vms100.liruilongs.github.io=https://192.168.26.100:2380,vms101.liruilongs.github.io=https://192.168.26.101:2380
    - --initial-cluster-state=existing
    - --key-file=/etc/kubernetes/pki/etcd/server.key
    - --listen-client-urls=https://127.0.0.1:2379,https://192.168.26.100:2379
    - --listen-metrics-urls=http://127.0.0.1:2381
    - --listen-peer-urls=https://192.168.26.100:2380
    - --name=vms100.liruilongs.github.io
    - --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
    - --peer-client-cert-auth=true
    - --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
    - --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
    - --snapshot-count=10000
    - --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt

然后我们以上面相同的方式从新恢复一次,发现节点直接没有起来

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~]
└─$kubectl get pod -A -o wide | grep apiserver
kube-system          kube-apiserver-vms100.liruilongs.github.io            0/1     CrashLoopBackOff   1 (18s ago)       39s    192.168.26.100   vms100.liruilongs.github.io   <none>           <none>
kube-system          kube-apiserver-vms101.liruilongs.github.io            1/1     Running            272 (5h29m ago)   415d   192.168.26.101   vms101.liruilongs.github.io   <none>           <none>
kube-system          kube-apiserver-vms102.liruilongs.github.io            1/1     Running            246 (5h29m ago)   415d   192.168.26.102   vms102.liruilongs.github.io   <none>           <none>
┌──[root@vms100.liruilongs.github.io]-[~]
└─$kubectl get pod -A -o wide | grep etcd
kube-system          etcd-vms100.liruilongs.github.io                      0/1     CrashLoopBackOff   3 (21s ago)       53s    192.168.26.100   vms100.liruilongs.github.io   <none>           <none>
kube-system          etcd-vms101.liruilongs.github.io                      1/1     Running            167 (5h29m ago)   415d   192.168.26.101   vms101.liruilongs.github.io   <none>           <none>
kube-system          etcd-vms102.liruilongs.github.io                      1/1     Running            173 (5h29m ago)   415d   192.168.26.102   vms102.liruilongs.github.io   <none>           <none>

查看日志

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~]
└─$kubectl logs etcd-vms100.liruilongs.github.io -n kube-system
.............................
{"level":"fatal","ts":"2024-03-16T16:25:19.981Z","caller":"etcdmain/etcd.go:204","msg":"discovery failed","error":"error validating peerURLs {ClusterID:4816f346663d82a7 Members:[&{ID:b8cb9f66c2e63b91 RaftAttributes:{PeerURLs:[https://192.168.26.102:2380] IsLearner:false} Attributes:{Name:vms102.liruilongs.github.io ClientURLs:[https://192.168.26.102:2379]}} &{ID:3fbbbed942c51f7b RaftAttributes:{PeerURLs:[https://192.168.26.100:2380] IsLearner:false} Attributes:{Name: ClientURLs:[]}} &{ID:70059e836d19883d RaftAttributes:{PeerURLs:[https://192.168.26.101:2380] IsLearner:false} Attributes:{Name:vms101.liruilongs.github.io ClientURLs:[https://192.168.26.101:2379]}}] RemovedMemberIDs:[]}: member count is unequal","stacktrace":"go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:204\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/main.go:40\nmain.main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/main.go:32\nruntime.main\n\t/go/gos/go1.16.15/src/runtime/proc.go:225"}

根据日志信息,可以看到有用的信息 RemovedMemberIDs:[]}: member count is unequal ,成员数量不相等,在分析日志

json 复制代码
{
    "level": "info",
    "ts": "2024-03-16T16:25:19.961Z",
    "caller": "etcdmain/etcd.go:73",
    "msg": "Running: ",
    "args": [
        "etcd",
        "--advertise-client-urls=https://192.168.26.100:2379",
        "--cert-file=/etc/kubernetes/pki/etcd/server.crt",
        "--client-cert-auth=true",
        "--data-dir=/var/lib/etcd",
        "--experimental-initial-corrupt-check=true",
        "--experimental-watch-progress-notify-interval=5s",
        "--initial-advertise-peer-urls=https://192.168.26.100:2380",
        "--initial-cluster=vms100.liruilongs.github.io=https://192.168.26.100:2380,vms101.liruilongs.github.io=https://192.168.26.101:2380",
        "--initial-cluster-state=existing",
        "--key-file=/etc/kubernetes/pki/etcd/server.key",
        "--listen-client-urls=https://127.0.0.1:2379,https://192.168.26.100:2379",
        "--listen-metrics-urls=http://127.0.0.1:2381",
        "--listen-peer-urls=https://192.168.26.100:2380",
        "--name=vms100.liruilongs.github.io",
        "--peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt",
        "--peer-client-cert-auth=true",
        "--peer-key-file=/etc/kubernetes/pki/etcd/peer.key",
        "--peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt",
        "--snapshot-count=10000",
        "--trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt"
    ]
}
..............................................................................
{
    "level": "warn",
    "ts": "2024-03-16T16:25:19.981Z",
    "caller": "etcdmain/etcd.go:146",
    "msg": "failed to start etcd",
    "error": "error validating peerURLs {ClusterID:4816f346663d82a7 Members:[&{ID:b8cb9f66c2e63b91 RaftAttributes:{PeerURLs:[https://192.168.26.102:2380] IsLearner:false} Attributes:{Name:vms102.liruilongs.github.io ClientURLs:[https://192.168.26.102:2379]}} &{ID:3fbbbed942c51f7b RaftAttributes:{PeerURLs:[https://192.168.26.100:2380] IsLearner:false} Attributes:{Name: ClientURLs:[]}} &{ID:70059e836d19883d RaftAttributes:{PeerURLs:[https://192.168.26.101:2380] IsLearner:false} Attributes:{Name:vms101.liruilongs.github.io ClientURLs:[https://192.168.26.101:2379]}}] RemovedMemberIDs:[]}: member count is unequal"
}
{
    "level": "fatal",
    "ts": "2024-03-16T16:25:19.981Z",
    "caller": "etcdmain/etcd.go:204",
    "msg": "discovery failed",
    "error": "error validating peerURLs {ClusterID:4816f346663d82a7 Members:[&{ID:b8cb9f66c2e63b91 RaftAttributes:{PeerURLs:[https://192.168.26.102:2380] IsLearner:false} Attributes:{Name:vms102.liruilongs.github.io ClientURLs:[https://192.168.26.102:2379]}} &{ID:3fbbbed942c51f7b RaftAttributes:{PeerURLs:[https://192.168.26.100:2380] IsLearner:false} Attributes:{Name: ClientURLs:[]}} &{ID:70059e836d19883d RaftAttributes:{PeerURLs:[https://192.168.26.101:2380] IsLearner:false} Attributes:{Name:vms101.liruilongs.github.io ClientURLs:[https://192.168.26.101:2379]}}] RemovedMemberIDs:[]}: member count is unequal",
    "stacktrace": "go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:204\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/main.go:40\nmain.main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/main.go:32\nruntime.main\n\t/go/gos/go1.16.15/src/runtime/proc.go:225"
}

可以看到它提示 可能错误与 vms102.liruilongs.github.io 节点相关

然后我们看一下 vms102.liruilongs.github.io 的配置文件

yaml 复制代码
┌──[root@vms102.liruilongs.github.io]-[~]
└─$cat /etc/kubernetes/manifests/etcd.yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubeadm.kubernetes.io/etcd.advertise-client-urls: https://192.168.26.102:2379
  creationTimestamp: null
  labels:
    component: etcd
    tier: control-plane
  name: etcd
  namespace: kube-system
spec:
  containers:
  - command:
    - etcd
    - --advertise-client-urls=https://192.168.26.102:2379
    - --cert-file=/etc/kubernetes/pki/etcd/server.crt
    - --client-cert-auth=true
    - --data-dir=/var/lib/etcd
    - --experimental-initial-corrupt-check=true
    - --experimental-watch-progress-notify-interval=5s
    - --initial-advertise-peer-urls=https://192.168.26.102:2380
    - --initial-cluster=vms100.liruilongs.github.io=https://192.168.26.100:2380,vms102.liruilongs.github.io=https://192.168.26.102:2380,vms101.liruilongs.github.io=https://192.168.26.101:2380
    - --initial-cluster-state=existing
    - --key-file=/etc/kubernetes/pki/etcd/server.key
    - --listen-client-urls=https://127.0.0.1:2379,https://192.168.26.102:2379
    - --listen-metrics-urls=http://127.0.0.1:2381
    - --listen-peer-urls=https://192.168.26.102:2380
    - --name=vms102.liruilongs.github.io
    - --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
    - --peer-client-cert-auth=true
    - --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
    - --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
    - --snapshot-count=10000
    - --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt

通过配置文件比对,可以发现,之前配置的故障节点的配置任然有问题,少了一个vms102.liruilongs.github.io=https://192.168.26.102:2380节点信息。

bash 复制代码
"--initial-cluster=vms100.liruilongs.github.io=https://192.168.26.100:2380,vms101.liruilongs.github.io=https://192.168.26.101:2380",
"--initial-cluster=vms100.liruilongs.github.io=https://192.168.26.100:2380,vms102.liruilongs.github.io=https://192.168.26.102:2380,vms101.liruilongs.github.io=https://192.168.26.101:2380"

修改完配置,按照上面相同的流程重新恢复节点, 节点恢复

通过 etcdctl 命令检查

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~]
└─$ETCDCTL_API=3 etcdctl  --endpoints https://127.0.0.1:2379  --cert="/etc/kubernetes/pki/etcd/server.crt"  --key="/etc/kubernetes/pki/etcd/server.key"  --cacert="/etc/kubernetes/pki/etcd/ca.crt" member list -w table
+------------------+---------+-----------------------------+-----------------------------+-----------------------------+
|        ID        | STATUS  |            NAME             |         PEER ADDRS          |        CLIENT ADDRS         |
+------------------+---------+-----------------------------+-----------------------------+-----------------------------+
| 70059e836d19883d | started | vms101.liruilongs.github.io | https://192.168.26.101:2380 | https://192.168.26.101:2379 |
| ac5f6045dbe477b3 | started | vms100.liruilongs.github.io | https://192.168.26.100:2380 | https://192.168.26.100:2379 |
| b8cb9f66c2e63b91 | started | vms102.liruilongs.github.io | https://192.168.26.102:2380 | https://192.168.26.102:2379 |
+------------------+---------+-----------------------------+-----------------------------+-----------------------------+
bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~]
└─$ETCDCTL_API=3 etcdctl  --endpoints https://127.0.0.1:2379  --cert="/etc/kubernetes/pki/etcd/server.crt"  --key="/etc/kubernetes/pki/etcd/server.key"  --cacert="/etc/kubernetes/pki/etcd/ca.crt" endpoint status --cluster  -w table
+-----------------------------+------------------+---------+---------+-----------+-----------+------------+
|          ENDPOINT           |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+-----------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://192.168.26.101:2379 | 70059e836d19883d |   3.5.4 |   88 MB |     false |       603 |   22227327 |
| https://192.168.26.100:2379 | ac5f6045dbe477b3 |   3.5.4 |   88 MB |     false |       603 |   22227327 |
| https://192.168.26.102:2379 | b8cb9f66c2e63b91 |   3.5.4 |   88 MB |      true |       603 |   22227327 |
+-----------------------------+------------------+---------+---------+-----------+-----------+------------+
┌──[root@vms100.liruilongs.github.io]-[~]
└─$

故障节点恢复,在实际的操作中,添加完节点,我们需要确认故障节点的配置文件是否是正确的配置文件


© 2018-2024 liruilonger@gmail.com, All rights reserved. 保持署名-非商用-相同方式共享(CC BY-NC-SA 4.0)

相关推荐
Algorithm157620 分钟前
云原生相关的 Go 语言工程师技术路线(含博客网址导航)
开发语言·云原生·golang
蜜獾云1 小时前
docker 安装雷池WAF防火墙 守护Web服务器
linux·运维·服务器·网络·网络安全·docker·容器
年薪丰厚2 小时前
如何在K8S集群中查看和操作Pod内的文件?
docker·云原生·容器·kubernetes·k8s·container
zhangj11252 小时前
K8S Ingress 服务配置步骤说明
云原生·容器·kubernetes
岁月变迁呀2 小时前
kubeadm搭建k8s集群
云原生·容器·kubernetes
墨水\\2 小时前
二进制部署k8s
云原生·容器·kubernetes
Source、3 小时前
k8s-metrics-server
云原生·容器·kubernetes
上海运维Q先生3 小时前
面试题整理15----K8s常见的网络插件有哪些
运维·网络·kubernetes
颜淡慕潇3 小时前
【K8S问题系列 |19 】如何解决 Pod 无法挂载 PVC问题
后端·云原生·容器·kubernetes
ProtonBase3 小时前
如何从 0 到 1 ,打造全新一代分布式数据架构
java·网络·数据库·数据仓库·分布式·云原生·架构