K8s 集群高可用master节点ETCD全部挂掉如何恢复?

写在前面


  • 博文内容涉及集群 ETCD 全部挂掉,通过备份文件恢复的操作 Demo
  • 理解不足小伙伴帮忙指正 😃,生活加油

不必太纠结于当下,也不必太忧虑未来,当你经历过一些事情的时候,眼前的风景已经和从前不一样了。------村上春树


前提是需要etcd备份文件,如果没有 etcd 备份,或者其他的备份手段,可能 GG 了

这里默认需要使用 etcdctl 的地方已经安装了该工具

备份文件分享

分享一个备份脚本

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$cat /usr/lib/systemd/system/etcd_back.sh
#!/bin/bash

#@File    :   erct_break.sh
#@Time    :   2023/01/27 23:00:27
#@Author  :   Li Ruilong
#@Version :   1.0
#@Desc    :   ETCD 备份
#@Contact :   1224965096@qq.com

if [ ! -d /root/back/ ];then
   mkdir -p /root/back/
fi
STR_DATE=$(date +%Y%m%d%H%M)

ETCDCTL_API=3 etcdctl \
--endpoints="https://127.0.0.1:2379"  \
--cert="/etc/kubernetes/pki/etcd/server.crt"  \
--key="/etc/kubernetes/pki/etcd/server.key"  \
--cacert="/etc/kubernetes/pki/etcd/ca.crt"   \
snapshot save /root/back/snap-${STR_DATE}.db

ETCDCTL_API=3 etcdctl --write-out=table snapshot status /root/back/snap-${STR_DATE}.db

sudo chmod  o-w,u-w,g-w  /root/back/snap-${STR_DATE}.db

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$

运行方式

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$/usr/bin/bash /usr/lib/systemd/system/etcd_back.sh
Snapshot saved at /root/back/snap-202406051145.db
+----------+----------+------------+------------+
|   HASH   | REVISION | TOTAL KEYS | TOTAL SIZE |
+----------+----------+------------+------------+
| 7b00ddcf | 22243784 |       5999 |      88 MB |
+----------+----------+------------+------------+

生成对应的备份数据

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ll /root/back/snap-202*
.....
-r--r--r-- 1 root root 87515168 6月   5 11:45 /root/back/snap-202406051144.db
-r--r--r-- 1 root root 87515168 6月   5 11:45 /root/back/snap-202406051145.db

可以使用 systemd 配置成 service unit

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$systemctl cat etcd-backup.service
# /usr/lib/systemd/system/etcd-backup.service
# /usr/lib/systemd/system/etcd-backup.service
[Unit]
Description= "ETCD 备份"
After=network-online.target

[Service]
Type=oneshot
Environment=ETCDCTL_API=3
ExecStart=/usr/bin/bash /usr/lib/systemd/system/etcd_back.sh


[Install]
WantedBy=multi-user.target

主要是方便看日志,方便管理

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$journalctl -u etcd-backup.service
-- No entries --
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$systemctl start  etcd-backup.service
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$journalctl -u etcd-backup.service
-- Logs begin at 三 2024-06-05 03:49:25 CST, end at 三 2024-06-05 11:49:08 CST. --
6月 05 11:49:04 vms100.liruilongs.github.io systemd[1]: Starting "ETCD 备份"...
6月 05 11:49:07 vms100.liruilongs.github.io bash[3957]: Snapshot saved at /root/back/snap-202406051149.db
6月 05 11:49:07 vms100.liruilongs.github.io bash[3957]: +----------+----------+------------+------------+
6月 05 11:49:07 vms100.liruilongs.github.io bash[3957]: |   HASH   | REVISION | TOTAL KEYS | TOTAL SIZE |
6月 05 11:49:07 vms100.liruilongs.github.io bash[3957]: +----------+----------+------------+------------+
6月 05 11:49:07 vms100.liruilongs.github.io bash[3957]: | 1ce12bf7 | 22244346 |       3753 |      88 MB |
6月 05 11:49:07 vms100.liruilongs.github.io bash[3957]: +----------+----------+------------+------------+
6月 05 11:49:07 vms100.liruilongs.github.io sudo[4344]:     root : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/bin/chmod o-w,u-w,g-w /root/back/snap-202406051149.db
6月 05 11:49:07 vms100.liruilongs.github.io systemd[1]: Started "ETCD 备份".

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ll /root/back/snap-202406051*
........................
-r--r--r-- 1 root root 87515168 6月   5 11:49 /root/back/snap-202406051149.db
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$

然后使用 timer unit 配置为定时启动

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$systemctl cat etcd-backup.timer
# /usr/lib/systemd/system/etcd-backup.timer
# /usr/lib/systemd/system/etcd-backup.timer
[Unit]
Description="每天备份一次 ETCD"

[Timer]
OnBootSec=3s
OnCalendar=*-*-* 00:00:00
Unit=etcd-backup.service

[Install]
WantedBy=multi-user.target

同样可以看日志

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$journalctl -u  etcd-backup.timer
-- No entries --

故障处理恢复

故障表象,集群整个崩了,所有 master 上的 etcd 和 apiserver 都死掉了

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~]
└─$kubectl get  pods
The connection to the server 192.168.26.99:30033 was refused - did you specify the right host or port?

移动 etcd 和 apiserver 的对应 静态 podyaml 文件。关于 静态 Pod 运行原理这里不多讲,感兴趣小伙伴可以官网看下

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible k8s_master -m  shell -a "mv  /etc/kubernetes/manifests/{etcd.yaml,kube-apiserver.yaml}  /tmp/" -i host.yaml
192.168.26.102 | CHANGED | rc=0 >>

192.168.26.100 | CHANGED | rc=0 >>

192.168.26.101 | CHANGED | rc=0 >>

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$

清除当前集群的 etcd 的数据文件和对应的目录

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible k8s_master  -m shell -a "rm -rf /var/lib/etcd/*" -i host.yaml
[WARNING]: Consider using the file module with state=absent rather than running 'rm'.  If you need to use command because file is insufficient you can add 'warn: false' to this command task or set
'command_warnings=False' in ansible.cfg to get rid of this message.
192.168.26.102 | CHANGED | rc=0 >>

192.168.26.100 | CHANGED | rc=0 >>

192.168.26.101 | CHANGED | rc=0 >>

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible k8s_master  -m shell -a "ls /var/lib/etcd/" -i host.yaml
192.168.26.102 | CHANGED | rc=0 >>

192.168.26.101 | CHANGED | rc=0 >>

192.168.26.100 | CHANGED | rc=0 >>

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$

拷贝 备份文件到当前集群的每个 etcd 节点

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible k8s_master  -m  copy -a "src=/root/back/snap-202403270000.db dest=/root/" -i host.yaml
192.168.26.100 | CHANGED => {
    "ansible_facts": {
        "discovered_interpreter_python": "/usr/bin/python"
    },
    "changed": true,
    "checksum": "d8927d8fa47b1e162cb2326ddd968d8227a0555d",
    "dest": "/root/snap-202403270000.db",
    "gid": 0,
    "group": "root",
    "md5sum": "6489d7243f636086816ac13aa69ceb44",
    "mode": "0644",
    "owner": "root",
    "size": 87515168,
    "src": "/root/.ansible/tmp/ansible-tmp-1717557132.87-95740-233443993764822/source",
    "state": "file",
    "uid": 0
}
192.168.26.101 | CHANGED => {
    "ansible_facts": {
        "discovered_interpreter_python": "/usr/bin/python"
    },
    "changed": true,
    "checksum": "d8927d8fa47b1e162cb2326ddd968d8227a0555d",
    "dest": "/root/snap-202403270000.db",
    "gid": 0,
    "group": "root",
    "md5sum": "6489d7243f636086816ac13aa69ceb44",
    "mode": "0644",
    "owner": "root",
    "size": 87515168,
    "src": "/root/.ansible/tmp/ansible-tmp-1717557132.92-95742-263013169057776/source",
    "state": "file",
    "uid": 0
}
192.168.26.102 | CHANGED => {
    "ansible_facts": {
        "discovered_interpreter_python": "/usr/bin/python"
    },
    "changed": true,
    "checksum": "d8927d8fa47b1e162cb2326ddd968d8227a0555d",
    "dest": "/root/snap-202403270000.db",
    "gid": 0,
    "group": "root",
    "md5sum": "6489d7243f636086816ac13aa69ceb44",
    "mode": "0644",
    "owner": "root",
    "size": 87515168,
    "src": "/root/.ansible/tmp/ansible-tmp-1717557132.92-95744-205050494494041/source",
    "state": "file",
    "uid": 0
}
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$

确定拷贝文件的备份文件

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible k8s_master  -m shell -a "ls /root/snap-202403270000.db" -i host.yaml
192.168.26.102 | CHANGED | rc=0 >>
/root/snap-202403270000.db
192.168.26.101 | CHANGED | rc=0 >>
/root/snap-202403270000.db
192.168.26.100 | CHANGED | rc=0 >>
/root/snap-202403270000.db
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$

在其中一个节点执行备份恢复命令

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$vim etcd_break.sh
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$sh etcd_break.sh
Error: data-dir "/var/lib/etcd" exists

提示目录存在,所以需要把目录也同样删除掉

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible k8s_master  -m shell -a "rm -rf /var/lib/etcd" -i host.yaml
[WARNING]: Consider using the file module with state=absent rather than running 'rm'.  If you need to use command because file is insufficient you can add 'warn: false' to this command task or set
'command_warnings=False' in ansible.cfg to get rid of this message.
192.168.26.102 | CHANGED | rc=0 >>

192.168.26.100 | CHANGED | rc=0 >>

192.168.26.101 | CHANGED | rc=0 >>

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$

备份恢复命令

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$cat etcd_break.sh
ETCDCTL_API=3 etcdctl snapshot restore /root/snap-202403270000.db \
 --name vms100.liruilongs.github.io  \
 --cert="/etc/kubernetes/pki/etcd/server.crt" \
 --key="/etc/kubernetes/pki/etcd/server.key"  \
 --cacert="/etc/kubernetes/pki/etcd/ca.crt"   \
 --endpoints="https://127.0.0.1:2379" \
 --initial-advertise-peer-urls="https://192.168.26.100:2380"  \
 --initial-cluster="vms100.liruilongs.github.io=https://192.168.26.100:2380,vms101.liruilongs.github.io=https://192.168.26.101:2380,vms102.liruilongs.github.io=https://192.168.26.102:2380" \
 --data-dir=/var/lib/etcd

再次执行,备份恢复成功

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$sh etcd_break.sh
2024-06-05 11:19:12.114058 I | mvcc: restore compact to 22239463
2024-06-05 11:19:12.137939 I | etcdserver/membership: added member ee392e5273e89e2 [https://192.168.26.100:2380] to cluster 4816f346663d82a7
2024-06-05 11:19:12.138023 I | etcdserver/membership: added member 70059e836d19883d [https://192.168.26.101:2380] to cluster 4816f346663d82a7
2024-06-05 11:19:12.138055 I | etcdserver/membership: added member b8cb9f66c2e63b91 [https://192.168.26.102:2380] to cluster 4816f346663d82a7

其他的etcd节点备份恢复,需要修改脚本两个地方:

192.168.26.101 节点执行

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible 192.168.26.101 -m shell -a "ETCDCTL_API=3 etcdctl snapshot restore /root/snap-202403270000.db --name vms101.liruilongs.github.io  --cert="/etc/kubernetes/pki/etcd/server.crt" --key="/etc/kubernetes/pki/etcd/server.key"  --cacert="/etc/kubernetes/pki/etcd/ca.crt"   --endpoints="https://127.0.0.1:2379" --initial-advertise-peer-urls="https://192.168.26.101:2380"  --initial-cluster="vms100.liruilongs.github.io=https://192.168.26.100:2380,vms101.liruilongs.github.io=https://192.168.26.101:2380,vms102.liruilongs.github.io=https://192.168.26.102:2380" --data-dir=/var/lib/etcd" -i host.yaml

192.168.26.101 | CHANGED | rc=0 >>
2024-06-05 11:25:25.557851 I | mvcc: restore compact to 22239463
2024-06-05 11:25:25.614487 I | etcdserver/membership: added member ee392e5273e89e2 [https://192.168.26.100:2380] to cluster 4816f346663d82a7
2024-06-05 11:25:25.614549 I | etcdserver/membership: added member 70059e836d19883d [https://192.168.26.101:2380] to cluster 4816f346663d82a7
2024-06-05 11:25:25.614574 I | etcdserver/membership: added member b8cb9f66c2e63b91 [https://192.168.26.102:2380] to cluster 4816f346663d82a7
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$

192.168.26.102 节点执行

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible 192.168.26.102 -m shell -a "ETCDCTL_API=3 etcdctl snapshot restore /root/snap-202403270000.db --name vms102.l
iruilongs.github.io  --cert="/etc/kubernetes/pki/etcd/server.crt" --key="/etc/kubernetes/pki/etcd/server.key"  --cacert=
"/etc/kubernetes/pki/etcd/ca.crt"   --endpoints="https://127.0.0.1:2379" --initial-advertise-peer-urls="https://192.168.26.102:2380"  --initial-cluster="vms100.liruilongs.github.io=https://192.168.26.100:2380,vms101.liruilongs.github.io=https://192.168.26.101:2380,vms102.liruilongs.github.io=https://192.168.26.102:2380" --data-dir=/var/lib/etcd" -i host.yaml

192.168.26.102 | CHANGED | rc=0 >>
2024-06-05 11:30:06.918159 I | mvcc: restore compact to 22239463
2024-06-05 11:30:06.935413 I | etcdserver/membership: added member ee392e5273e89e2 [https://192.168.26.100:2380] to cluster 4816f346663d82a7
2024-06-05 11:30:06.935460 I | etcdserver/membership: added member 70059e836d19883d [https://192.168.26.101:2380] to cluster 4816f346663d82a7
2024-06-05 11:30:06.935471 I | etcdserver/membership: added member b8cb9f66c2e63b91 [https://192.168.26.102:2380] to cluster 4816f346663d82a7
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$

移动静态 Pod对应的 yaml 文件,恢复 etcd 和apiserver 对应的Pod

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible k8s_master  -m shell -a "mv /tmp/{etcd.yaml,kube-apiserver.yaml} /etc/kubernetes/manifests/" -i host.yaml
192.168.26.102 | CHANGED | rc=0 >>

192.168.26.101 | CHANGED | rc=0 >>

192.168.26.100 | CHANGED | rc=0 >>

确认静态pod 恢复

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible k8s_master -m shell -a "ls /etc/kubernetes/manifests/" -i host.yaml
192.168.26.100 | CHANGED | rc=0 >>
etcd.yaml
haproxy.yaml
keepalived.yaml
kube-apiserver.yaml
kube-controller-manager.yaml
kube-scheduler.yaml
192.168.26.102 | CHANGED | rc=0 >>
etcd.yaml
haproxy.yaml
keepalived.yaml
kube-apiserver.yaml
kube-controller-manager.yaml
kube-scheduler.yaml
192.168.26.101 | CHANGED | rc=0 >>
etcd.yaml
haproxy.yaml
keepalived.yaml
kube-apiserver.yaml
kube-controller-manager.yaml
kube-scheduler.yaml
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$

查看 etcd 集群节点状态

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ETCDCTL_API=3 etcdctl  --endpoints https://127.0.0.1:2379  --cert="/etc/kubernetes/pki/etcd/server.crt"  --key="/etc/kubernetes/pki/etcd/server.key"  --cacert="/etc/kubernetes/pki/etcd/ca.crt" member list -w table
+------------------+---------+-----------------------------+-----------------------------+-----------------------------+
|        ID        | STATUS  |            NAME             |         PEER ADDRS          |        CLIENT ADDRS         |
+------------------+---------+-----------------------------+-----------------------------+-----------------------------+
|  ee392e5273e89e2 | started | vms100.liruilongs.github.io | https://192.168.26.100:2380 | https://192.168.26.100:2379 |
| 70059e836d19883d | started | vms101.liruilongs.github.io | https://192.168.26.101:2380 | https://192.168.26.101:2379 |
| b8cb9f66c2e63b91 | started | vms102.liruilongs.github.io | https://192.168.26.102:2380 | https://192.168.26.102:2379 |
+------------------+---------+-----------------------------+-----------------------------+-----------------------------+

确认集群是否恢复

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$kubectl get nodes
NAME                          STATUS   ROLES           AGE    VERSION
vms100.liruilongs.github.io   Ready    control-plane   495d   v1.25.1
vms101.liruilongs.github.io   Ready    control-plane   495d   v1.25.1
vms102.liruilongs.github.io   Ready    control-plane   495d   v1.25.1
vms103.liruilongs.github.io   Ready    <none>          495d   v1.25.1
vms105.liruilongs.github.io   Ready    <none>          495d   v1.25.1
vms106.liruilongs.github.io   Ready    <none>          495d   v1.25.1
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$

博文部分内容参考

© 文中涉及参考链接内容版权归原作者所有,如有侵权请告知 😃


https://etcd.io/docs/v3.5/


© 2018-2024 liruilonger@gmail.com, 保持署名-非商用-相同方式共享(CC BY-NC-SA 4.0)

相关推荐
lichenyang4532 天前
Docker 学习笔记(四):Dockerfile,把项目打成自己的镜像
docker·容器
lichenyang4532 天前
Docker 学习笔记(三):Docker 网络、bridge、子网和容器互通
docker·容器
lichenyang4532 天前
Docker 学习笔记(二):docker run 的参数到底在控制什么?
docker·容器
运维开发故事4 天前
基于 Arthas 的多集群在线诊断系统设计与实现
kubernetes
Patrick_Wilson6 天前
从「改个端口」到 502:Next.js on k8s 的容器端口、Service 映射与 env 覆盖
docker·kubernetes·next.js
探索云原生6 天前
K8s 1.36 这个 GA 特性,把 initContainer 拉模型的 hack 干掉了
ai·云原生·kubernetes
云恒要逆袭7 天前
运行你的第一个Docker容器
后端·docker·容器
Java之美8 天前
一次k8s升级引发的DevicePlugin注册失败
云原生·kubernetes
程序员老赵8 天前
10 分钟部署 OpenCode:Docker 一键安装,浏览器打开就能用 AI 写代码(附完整命令与排错)
docker·容器·ai编程