K8s 集群高可用master节点ETCD全部挂掉如何恢复?

写在前面


  • 博文内容涉及集群 ETCD 全部挂掉,通过备份文件恢复的操作 Demo
  • 理解不足小伙伴帮忙指正 😃,生活加油

不必太纠结于当下,也不必太忧虑未来,当你经历过一些事情的时候,眼前的风景已经和从前不一样了。------村上春树


前提是需要etcd备份文件,如果没有 etcd 备份,或者其他的备份手段,可能 GG 了

这里默认需要使用 etcdctl 的地方已经安装了该工具

备份文件分享

分享一个备份脚本

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$cat /usr/lib/systemd/system/etcd_back.sh
#!/bin/bash

#@File    :   erct_break.sh
#@Time    :   2023/01/27 23:00:27
#@Author  :   Li Ruilong
#@Version :   1.0
#@Desc    :   ETCD 备份
#@Contact :   1224965096@qq.com

if [ ! -d /root/back/ ];then
   mkdir -p /root/back/
fi
STR_DATE=$(date +%Y%m%d%H%M)

ETCDCTL_API=3 etcdctl \
--endpoints="https://127.0.0.1:2379"  \
--cert="/etc/kubernetes/pki/etcd/server.crt"  \
--key="/etc/kubernetes/pki/etcd/server.key"  \
--cacert="/etc/kubernetes/pki/etcd/ca.crt"   \
snapshot save /root/back/snap-${STR_DATE}.db

ETCDCTL_API=3 etcdctl --write-out=table snapshot status /root/back/snap-${STR_DATE}.db

sudo chmod  o-w,u-w,g-w  /root/back/snap-${STR_DATE}.db

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$

运行方式

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$/usr/bin/bash /usr/lib/systemd/system/etcd_back.sh
Snapshot saved at /root/back/snap-202406051145.db
+----------+----------+------------+------------+
|   HASH   | REVISION | TOTAL KEYS | TOTAL SIZE |
+----------+----------+------------+------------+
| 7b00ddcf | 22243784 |       5999 |      88 MB |
+----------+----------+------------+------------+

生成对应的备份数据

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ll /root/back/snap-202*
.....
-r--r--r-- 1 root root 87515168 6月   5 11:45 /root/back/snap-202406051144.db
-r--r--r-- 1 root root 87515168 6月   5 11:45 /root/back/snap-202406051145.db

可以使用 systemd 配置成 service unit

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$systemctl cat etcd-backup.service
# /usr/lib/systemd/system/etcd-backup.service
# /usr/lib/systemd/system/etcd-backup.service
[Unit]
Description= "ETCD 备份"
After=network-online.target

[Service]
Type=oneshot
Environment=ETCDCTL_API=3
ExecStart=/usr/bin/bash /usr/lib/systemd/system/etcd_back.sh


[Install]
WantedBy=multi-user.target

主要是方便看日志,方便管理

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$journalctl -u etcd-backup.service
-- No entries --
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$systemctl start  etcd-backup.service
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$journalctl -u etcd-backup.service
-- Logs begin at 三 2024-06-05 03:49:25 CST, end at 三 2024-06-05 11:49:08 CST. --
6月 05 11:49:04 vms100.liruilongs.github.io systemd[1]: Starting "ETCD 备份"...
6月 05 11:49:07 vms100.liruilongs.github.io bash[3957]: Snapshot saved at /root/back/snap-202406051149.db
6月 05 11:49:07 vms100.liruilongs.github.io bash[3957]: +----------+----------+------------+------------+
6月 05 11:49:07 vms100.liruilongs.github.io bash[3957]: |   HASH   | REVISION | TOTAL KEYS | TOTAL SIZE |
6月 05 11:49:07 vms100.liruilongs.github.io bash[3957]: +----------+----------+------------+------------+
6月 05 11:49:07 vms100.liruilongs.github.io bash[3957]: | 1ce12bf7 | 22244346 |       3753 |      88 MB |
6月 05 11:49:07 vms100.liruilongs.github.io bash[3957]: +----------+----------+------------+------------+
6月 05 11:49:07 vms100.liruilongs.github.io sudo[4344]:     root : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/bin/chmod o-w,u-w,g-w /root/back/snap-202406051149.db
6月 05 11:49:07 vms100.liruilongs.github.io systemd[1]: Started "ETCD 备份".

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ll /root/back/snap-202406051*
........................
-r--r--r-- 1 root root 87515168 6月   5 11:49 /root/back/snap-202406051149.db
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$

然后使用 timer unit 配置为定时启动

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$systemctl cat etcd-backup.timer
# /usr/lib/systemd/system/etcd-backup.timer
# /usr/lib/systemd/system/etcd-backup.timer
[Unit]
Description="每天备份一次 ETCD"

[Timer]
OnBootSec=3s
OnCalendar=*-*-* 00:00:00
Unit=etcd-backup.service

[Install]
WantedBy=multi-user.target

同样可以看日志

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$journalctl -u  etcd-backup.timer
-- No entries --

故障处理恢复

故障表象,集群整个崩了,所有 master 上的 etcd 和 apiserver 都死掉了

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~]
└─$kubectl get  pods
The connection to the server 192.168.26.99:30033 was refused - did you specify the right host or port?

移动 etcd 和 apiserver 的对应 静态 podyaml 文件。关于 静态 Pod 运行原理这里不多讲,感兴趣小伙伴可以官网看下

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible k8s_master -m  shell -a "mv  /etc/kubernetes/manifests/{etcd.yaml,kube-apiserver.yaml}  /tmp/" -i host.yaml
192.168.26.102 | CHANGED | rc=0 >>

192.168.26.100 | CHANGED | rc=0 >>

192.168.26.101 | CHANGED | rc=0 >>

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$

清除当前集群的 etcd 的数据文件和对应的目录

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible k8s_master  -m shell -a "rm -rf /var/lib/etcd/*" -i host.yaml
[WARNING]: Consider using the file module with state=absent rather than running 'rm'.  If you need to use command because file is insufficient you can add 'warn: false' to this command task or set
'command_warnings=False' in ansible.cfg to get rid of this message.
192.168.26.102 | CHANGED | rc=0 >>

192.168.26.100 | CHANGED | rc=0 >>

192.168.26.101 | CHANGED | rc=0 >>

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible k8s_master  -m shell -a "ls /var/lib/etcd/" -i host.yaml
192.168.26.102 | CHANGED | rc=0 >>

192.168.26.101 | CHANGED | rc=0 >>

192.168.26.100 | CHANGED | rc=0 >>

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$

拷贝 备份文件到当前集群的每个 etcd 节点

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible k8s_master  -m  copy -a "src=/root/back/snap-202403270000.db dest=/root/" -i host.yaml
192.168.26.100 | CHANGED => {
    "ansible_facts": {
        "discovered_interpreter_python": "/usr/bin/python"
    },
    "changed": true,
    "checksum": "d8927d8fa47b1e162cb2326ddd968d8227a0555d",
    "dest": "/root/snap-202403270000.db",
    "gid": 0,
    "group": "root",
    "md5sum": "6489d7243f636086816ac13aa69ceb44",
    "mode": "0644",
    "owner": "root",
    "size": 87515168,
    "src": "/root/.ansible/tmp/ansible-tmp-1717557132.87-95740-233443993764822/source",
    "state": "file",
    "uid": 0
}
192.168.26.101 | CHANGED => {
    "ansible_facts": {
        "discovered_interpreter_python": "/usr/bin/python"
    },
    "changed": true,
    "checksum": "d8927d8fa47b1e162cb2326ddd968d8227a0555d",
    "dest": "/root/snap-202403270000.db",
    "gid": 0,
    "group": "root",
    "md5sum": "6489d7243f636086816ac13aa69ceb44",
    "mode": "0644",
    "owner": "root",
    "size": 87515168,
    "src": "/root/.ansible/tmp/ansible-tmp-1717557132.92-95742-263013169057776/source",
    "state": "file",
    "uid": 0
}
192.168.26.102 | CHANGED => {
    "ansible_facts": {
        "discovered_interpreter_python": "/usr/bin/python"
    },
    "changed": true,
    "checksum": "d8927d8fa47b1e162cb2326ddd968d8227a0555d",
    "dest": "/root/snap-202403270000.db",
    "gid": 0,
    "group": "root",
    "md5sum": "6489d7243f636086816ac13aa69ceb44",
    "mode": "0644",
    "owner": "root",
    "size": 87515168,
    "src": "/root/.ansible/tmp/ansible-tmp-1717557132.92-95744-205050494494041/source",
    "state": "file",
    "uid": 0
}
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$

确定拷贝文件的备份文件

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible k8s_master  -m shell -a "ls /root/snap-202403270000.db" -i host.yaml
192.168.26.102 | CHANGED | rc=0 >>
/root/snap-202403270000.db
192.168.26.101 | CHANGED | rc=0 >>
/root/snap-202403270000.db
192.168.26.100 | CHANGED | rc=0 >>
/root/snap-202403270000.db
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$

在其中一个节点执行备份恢复命令

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$vim etcd_break.sh
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$sh etcd_break.sh
Error: data-dir "/var/lib/etcd" exists

提示目录存在,所以需要把目录也同样删除掉

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible k8s_master  -m shell -a "rm -rf /var/lib/etcd" -i host.yaml
[WARNING]: Consider using the file module with state=absent rather than running 'rm'.  If you need to use command because file is insufficient you can add 'warn: false' to this command task or set
'command_warnings=False' in ansible.cfg to get rid of this message.
192.168.26.102 | CHANGED | rc=0 >>

192.168.26.100 | CHANGED | rc=0 >>

192.168.26.101 | CHANGED | rc=0 >>

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$

备份恢复命令

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$cat etcd_break.sh
ETCDCTL_API=3 etcdctl snapshot restore /root/snap-202403270000.db \
 --name vms100.liruilongs.github.io  \
 --cert="/etc/kubernetes/pki/etcd/server.crt" \
 --key="/etc/kubernetes/pki/etcd/server.key"  \
 --cacert="/etc/kubernetes/pki/etcd/ca.crt"   \
 --endpoints="https://127.0.0.1:2379" \
 --initial-advertise-peer-urls="https://192.168.26.100:2380"  \
 --initial-cluster="vms100.liruilongs.github.io=https://192.168.26.100:2380,vms101.liruilongs.github.io=https://192.168.26.101:2380,vms102.liruilongs.github.io=https://192.168.26.102:2380" \
 --data-dir=/var/lib/etcd

再次执行,备份恢复成功

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$sh etcd_break.sh
2024-06-05 11:19:12.114058 I | mvcc: restore compact to 22239463
2024-06-05 11:19:12.137939 I | etcdserver/membership: added member ee392e5273e89e2 [https://192.168.26.100:2380] to cluster 4816f346663d82a7
2024-06-05 11:19:12.138023 I | etcdserver/membership: added member 70059e836d19883d [https://192.168.26.101:2380] to cluster 4816f346663d82a7
2024-06-05 11:19:12.138055 I | etcdserver/membership: added member b8cb9f66c2e63b91 [https://192.168.26.102:2380] to cluster 4816f346663d82a7

其他的etcd节点备份恢复,需要修改脚本两个地方:

192.168.26.101 节点执行

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible 192.168.26.101 -m shell -a "ETCDCTL_API=3 etcdctl snapshot restore /root/snap-202403270000.db --name vms101.liruilongs.github.io  --cert="/etc/kubernetes/pki/etcd/server.crt" --key="/etc/kubernetes/pki/etcd/server.key"  --cacert="/etc/kubernetes/pki/etcd/ca.crt"   --endpoints="https://127.0.0.1:2379" --initial-advertise-peer-urls="https://192.168.26.101:2380"  --initial-cluster="vms100.liruilongs.github.io=https://192.168.26.100:2380,vms101.liruilongs.github.io=https://192.168.26.101:2380,vms102.liruilongs.github.io=https://192.168.26.102:2380" --data-dir=/var/lib/etcd" -i host.yaml

192.168.26.101 | CHANGED | rc=0 >>
2024-06-05 11:25:25.557851 I | mvcc: restore compact to 22239463
2024-06-05 11:25:25.614487 I | etcdserver/membership: added member ee392e5273e89e2 [https://192.168.26.100:2380] to cluster 4816f346663d82a7
2024-06-05 11:25:25.614549 I | etcdserver/membership: added member 70059e836d19883d [https://192.168.26.101:2380] to cluster 4816f346663d82a7
2024-06-05 11:25:25.614574 I | etcdserver/membership: added member b8cb9f66c2e63b91 [https://192.168.26.102:2380] to cluster 4816f346663d82a7
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$

192.168.26.102 节点执行

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible 192.168.26.102 -m shell -a "ETCDCTL_API=3 etcdctl snapshot restore /root/snap-202403270000.db --name vms102.l
iruilongs.github.io  --cert="/etc/kubernetes/pki/etcd/server.crt" --key="/etc/kubernetes/pki/etcd/server.key"  --cacert=
"/etc/kubernetes/pki/etcd/ca.crt"   --endpoints="https://127.0.0.1:2379" --initial-advertise-peer-urls="https://192.168.26.102:2380"  --initial-cluster="vms100.liruilongs.github.io=https://192.168.26.100:2380,vms101.liruilongs.github.io=https://192.168.26.101:2380,vms102.liruilongs.github.io=https://192.168.26.102:2380" --data-dir=/var/lib/etcd" -i host.yaml

192.168.26.102 | CHANGED | rc=0 >>
2024-06-05 11:30:06.918159 I | mvcc: restore compact to 22239463
2024-06-05 11:30:06.935413 I | etcdserver/membership: added member ee392e5273e89e2 [https://192.168.26.100:2380] to cluster 4816f346663d82a7
2024-06-05 11:30:06.935460 I | etcdserver/membership: added member 70059e836d19883d [https://192.168.26.101:2380] to cluster 4816f346663d82a7
2024-06-05 11:30:06.935471 I | etcdserver/membership: added member b8cb9f66c2e63b91 [https://192.168.26.102:2380] to cluster 4816f346663d82a7
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$

移动静态 Pod对应的 yaml 文件,恢复 etcd 和apiserver 对应的Pod

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible k8s_master  -m shell -a "mv /tmp/{etcd.yaml,kube-apiserver.yaml} /etc/kubernetes/manifests/" -i host.yaml
192.168.26.102 | CHANGED | rc=0 >>

192.168.26.101 | CHANGED | rc=0 >>

192.168.26.100 | CHANGED | rc=0 >>

确认静态pod 恢复

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible k8s_master -m shell -a "ls /etc/kubernetes/manifests/" -i host.yaml
192.168.26.100 | CHANGED | rc=0 >>
etcd.yaml
haproxy.yaml
keepalived.yaml
kube-apiserver.yaml
kube-controller-manager.yaml
kube-scheduler.yaml
192.168.26.102 | CHANGED | rc=0 >>
etcd.yaml
haproxy.yaml
keepalived.yaml
kube-apiserver.yaml
kube-controller-manager.yaml
kube-scheduler.yaml
192.168.26.101 | CHANGED | rc=0 >>
etcd.yaml
haproxy.yaml
keepalived.yaml
kube-apiserver.yaml
kube-controller-manager.yaml
kube-scheduler.yaml
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$

查看 etcd 集群节点状态

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ETCDCTL_API=3 etcdctl  --endpoints https://127.0.0.1:2379  --cert="/etc/kubernetes/pki/etcd/server.crt"  --key="/etc/kubernetes/pki/etcd/server.key"  --cacert="/etc/kubernetes/pki/etcd/ca.crt" member list -w table
+------------------+---------+-----------------------------+-----------------------------+-----------------------------+
|        ID        | STATUS  |            NAME             |         PEER ADDRS          |        CLIENT ADDRS         |
+------------------+---------+-----------------------------+-----------------------------+-----------------------------+
|  ee392e5273e89e2 | started | vms100.liruilongs.github.io | https://192.168.26.100:2380 | https://192.168.26.100:2379 |
| 70059e836d19883d | started | vms101.liruilongs.github.io | https://192.168.26.101:2380 | https://192.168.26.101:2379 |
| b8cb9f66c2e63b91 | started | vms102.liruilongs.github.io | https://192.168.26.102:2380 | https://192.168.26.102:2379 |
+------------------+---------+-----------------------------+-----------------------------+-----------------------------+

确认集群是否恢复

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$kubectl get nodes
NAME                          STATUS   ROLES           AGE    VERSION
vms100.liruilongs.github.io   Ready    control-plane   495d   v1.25.1
vms101.liruilongs.github.io   Ready    control-plane   495d   v1.25.1
vms102.liruilongs.github.io   Ready    control-plane   495d   v1.25.1
vms103.liruilongs.github.io   Ready    <none>          495d   v1.25.1
vms105.liruilongs.github.io   Ready    <none>          495d   v1.25.1
vms106.liruilongs.github.io   Ready    <none>          495d   v1.25.1
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$

博文部分内容参考

© 文中涉及参考链接内容版权归原作者所有,如有侵权请告知 😃


https://etcd.io/docs/v3.5/


© 2018-2024 liruilonger@gmail.com, 保持署名-非商用-相同方式共享(CC BY-NC-SA 4.0)

相关推荐
eddy-原31 分钟前
ELKStack 与 Kubernetes 核心基础知识点综合作业
云原生·容器·kubernetes
V胡桃夹子34 分钟前
Docker快速部署apollo
运维·docker·容器
ygqygq22 小时前
Kubernetes Gateway API 与 Envoy Gateway 部署使用指南
kubernetes·gateway·envoy·ingress
Code知行合壹2 小时前
Kubernetes微服务DevOps
微服务·kubernetes·devops
会飞的土拨鼠呀3 小时前
Docker 部署开源蜜罐Cowrie
docker·容器·开源
怪我冷i4 小时前
win11使用minikube搭建K8S集群基于podman desktop( Fedora Linux 43)
linux·kubernetes·ai编程·ai写作·podman
我是谁??4 小时前
Rocky9.2离线安装docker和NVIDIA Container Toolkit训练环境搭建
运维·docker·容器
victory04315 小时前
K8S 从Harbor当中拉取镜像 连接方法
云原生·容器·kubernetes
陈陈CHENCHEN5 小时前
【Kubernetes】K8s 1.35 配置 Docker 作为容器运行时
docker·kubernetes
勇气要爆发5 小时前
Kubernetes (K8S):云时代的“超级舵手”
云原生·容器·kubernetes