K8s 集群高可用master节点ETCD全部挂掉如何恢复?

写在前面


  • 博文内容涉及集群 ETCD 全部挂掉,通过备份文件恢复的操作 Demo
  • 理解不足小伙伴帮忙指正 😃,生活加油

不必太纠结于当下,也不必太忧虑未来,当你经历过一些事情的时候,眼前的风景已经和从前不一样了。------村上春树


前提是需要etcd备份文件,如果没有 etcd 备份,或者其他的备份手段,可能 GG 了

这里默认需要使用 etcdctl 的地方已经安装了该工具

备份文件分享

分享一个备份脚本

bash 复制代码
┌──[[email protected]]-[~/ansible]
└─$cat /usr/lib/systemd/system/etcd_back.sh
#!/bin/bash

#@File    :   erct_break.sh
#@Time    :   2023/01/27 23:00:27
#@Author  :   Li Ruilong
#@Version :   1.0
#@Desc    :   ETCD 备份
#@Contact :   [email protected]

if [ ! -d /root/back/ ];then
   mkdir -p /root/back/
fi
STR_DATE=$(date +%Y%m%d%H%M)

ETCDCTL_API=3 etcdctl \
--endpoints="https://127.0.0.1:2379"  \
--cert="/etc/kubernetes/pki/etcd/server.crt"  \
--key="/etc/kubernetes/pki/etcd/server.key"  \
--cacert="/etc/kubernetes/pki/etcd/ca.crt"   \
snapshot save /root/back/snap-${STR_DATE}.db

ETCDCTL_API=3 etcdctl --write-out=table snapshot status /root/back/snap-${STR_DATE}.db

sudo chmod  o-w,u-w,g-w  /root/back/snap-${STR_DATE}.db

┌──[[email protected]]-[~/ansible]
└─$

运行方式

bash 复制代码
┌──[[email protected]]-[~/ansible]
└─$/usr/bin/bash /usr/lib/systemd/system/etcd_back.sh
Snapshot saved at /root/back/snap-202406051145.db
+----------+----------+------------+------------+
|   HASH   | REVISION | TOTAL KEYS | TOTAL SIZE |
+----------+----------+------------+------------+
| 7b00ddcf | 22243784 |       5999 |      88 MB |
+----------+----------+------------+------------+

生成对应的备份数据

bash 复制代码
┌──[[email protected]]-[~/ansible]
└─$ll /root/back/snap-202*
.....
-r--r--r-- 1 root root 87515168 6月   5 11:45 /root/back/snap-202406051144.db
-r--r--r-- 1 root root 87515168 6月   5 11:45 /root/back/snap-202406051145.db

可以使用 systemd 配置成 service unit

bash 复制代码
┌──[[email protected]]-[~/ansible]
└─$systemctl cat etcd-backup.service
# /usr/lib/systemd/system/etcd-backup.service
# /usr/lib/systemd/system/etcd-backup.service
[Unit]
Description= "ETCD 备份"
After=network-online.target

[Service]
Type=oneshot
Environment=ETCDCTL_API=3
ExecStart=/usr/bin/bash /usr/lib/systemd/system/etcd_back.sh


[Install]
WantedBy=multi-user.target

主要是方便看日志,方便管理

bash 复制代码
┌──[[email protected]]-[~/ansible]
└─$journalctl -u etcd-backup.service
-- No entries --
┌──[[email protected]]-[~/ansible]
└─$systemctl start  etcd-backup.service
┌──[[email protected]]-[~/ansible]
└─$journalctl -u etcd-backup.service
-- Logs begin at 三 2024-06-05 03:49:25 CST, end at 三 2024-06-05 11:49:08 CST. --
6月 05 11:49:04 vms100.liruilongs.github.io systemd[1]: Starting "ETCD 备份"...
6月 05 11:49:07 vms100.liruilongs.github.io bash[3957]: Snapshot saved at /root/back/snap-202406051149.db
6月 05 11:49:07 vms100.liruilongs.github.io bash[3957]: +----------+----------+------------+------------+
6月 05 11:49:07 vms100.liruilongs.github.io bash[3957]: |   HASH   | REVISION | TOTAL KEYS | TOTAL SIZE |
6月 05 11:49:07 vms100.liruilongs.github.io bash[3957]: +----------+----------+------------+------------+
6月 05 11:49:07 vms100.liruilongs.github.io bash[3957]: | 1ce12bf7 | 22244346 |       3753 |      88 MB |
6月 05 11:49:07 vms100.liruilongs.github.io bash[3957]: +----------+----------+------------+------------+
6月 05 11:49:07 vms100.liruilongs.github.io sudo[4344]:     root : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/bin/chmod o-w,u-w,g-w /root/back/snap-202406051149.db
6月 05 11:49:07 vms100.liruilongs.github.io systemd[1]: Started "ETCD 备份".

┌──[[email protected]]-[~/ansible]
└─$ll /root/back/snap-202406051*
........................
-r--r--r-- 1 root root 87515168 6月   5 11:49 /root/back/snap-202406051149.db
┌──[[email protected]]-[~/ansible]
└─$

然后使用 timer unit 配置为定时启动

bash 复制代码
┌──[[email protected]]-[~/ansible]
└─$systemctl cat etcd-backup.timer
# /usr/lib/systemd/system/etcd-backup.timer
# /usr/lib/systemd/system/etcd-backup.timer
[Unit]
Description="每天备份一次 ETCD"

[Timer]
OnBootSec=3s
OnCalendar=*-*-* 00:00:00
Unit=etcd-backup.service

[Install]
WantedBy=multi-user.target

同样可以看日志

bash 复制代码
┌──[[email protected]]-[~/ansible]
└─$journalctl -u  etcd-backup.timer
-- No entries --

故障处理恢复

故障表象,集群整个崩了,所有 master 上的 etcd 和 apiserver 都死掉了

bash 复制代码
┌──[[email protected]]-[~]
└─$kubectl get  pods
The connection to the server 192.168.26.99:30033 was refused - did you specify the right host or port?

移动 etcd 和 apiserver 的对应 静态 podyaml 文件。关于 静态 Pod 运行原理这里不多讲,感兴趣小伙伴可以官网看下

bash 复制代码
┌──[[email protected]]-[~/ansible]
└─$ansible k8s_master -m  shell -a "mv  /etc/kubernetes/manifests/{etcd.yaml,kube-apiserver.yaml}  /tmp/" -i host.yaml
192.168.26.102 | CHANGED | rc=0 >>

192.168.26.100 | CHANGED | rc=0 >>

192.168.26.101 | CHANGED | rc=0 >>

┌──[[email protected]]-[~/ansible]
└─$

清除当前集群的 etcd 的数据文件和对应的目录

bash 复制代码
┌──[[email protected]]-[~/ansible]
└─$ansible k8s_master  -m shell -a "rm -rf /var/lib/etcd/*" -i host.yaml
[WARNING]: Consider using the file module with state=absent rather than running 'rm'.  If you need to use command because file is insufficient you can add 'warn: false' to this command task or set
'command_warnings=False' in ansible.cfg to get rid of this message.
192.168.26.102 | CHANGED | rc=0 >>

192.168.26.100 | CHANGED | rc=0 >>

192.168.26.101 | CHANGED | rc=0 >>

┌──[[email protected]]-[~/ansible]
└─$ansible k8s_master  -m shell -a "ls /var/lib/etcd/" -i host.yaml
192.168.26.102 | CHANGED | rc=0 >>

192.168.26.101 | CHANGED | rc=0 >>

192.168.26.100 | CHANGED | rc=0 >>

┌──[[email protected]]-[~/ansible]
└─$

拷贝 备份文件到当前集群的每个 etcd 节点

bash 复制代码
┌──[[email protected]]-[~/ansible]
└─$ansible k8s_master  -m  copy -a "src=/root/back/snap-202403270000.db dest=/root/" -i host.yaml
192.168.26.100 | CHANGED => {
    "ansible_facts": {
        "discovered_interpreter_python": "/usr/bin/python"
    },
    "changed": true,
    "checksum": "d8927d8fa47b1e162cb2326ddd968d8227a0555d",
    "dest": "/root/snap-202403270000.db",
    "gid": 0,
    "group": "root",
    "md5sum": "6489d7243f636086816ac13aa69ceb44",
    "mode": "0644",
    "owner": "root",
    "size": 87515168,
    "src": "/root/.ansible/tmp/ansible-tmp-1717557132.87-95740-233443993764822/source",
    "state": "file",
    "uid": 0
}
192.168.26.101 | CHANGED => {
    "ansible_facts": {
        "discovered_interpreter_python": "/usr/bin/python"
    },
    "changed": true,
    "checksum": "d8927d8fa47b1e162cb2326ddd968d8227a0555d",
    "dest": "/root/snap-202403270000.db",
    "gid": 0,
    "group": "root",
    "md5sum": "6489d7243f636086816ac13aa69ceb44",
    "mode": "0644",
    "owner": "root",
    "size": 87515168,
    "src": "/root/.ansible/tmp/ansible-tmp-1717557132.92-95742-263013169057776/source",
    "state": "file",
    "uid": 0
}
192.168.26.102 | CHANGED => {
    "ansible_facts": {
        "discovered_interpreter_python": "/usr/bin/python"
    },
    "changed": true,
    "checksum": "d8927d8fa47b1e162cb2326ddd968d8227a0555d",
    "dest": "/root/snap-202403270000.db",
    "gid": 0,
    "group": "root",
    "md5sum": "6489d7243f636086816ac13aa69ceb44",
    "mode": "0644",
    "owner": "root",
    "size": 87515168,
    "src": "/root/.ansible/tmp/ansible-tmp-1717557132.92-95744-205050494494041/source",
    "state": "file",
    "uid": 0
}
┌──[[email protected]]-[~/ansible]
└─$

确定拷贝文件的备份文件

bash 复制代码
┌──[[email protected]]-[~/ansible]
└─$ansible k8s_master  -m shell -a "ls /root/snap-202403270000.db" -i host.yaml
192.168.26.102 | CHANGED | rc=0 >>
/root/snap-202403270000.db
192.168.26.101 | CHANGED | rc=0 >>
/root/snap-202403270000.db
192.168.26.100 | CHANGED | rc=0 >>
/root/snap-202403270000.db
┌──[[email protected]]-[~/ansible]
└─$

在其中一个节点执行备份恢复命令

bash 复制代码
┌──[[email protected]]-[~/ansible]
└─$vim etcd_break.sh
┌──[[email protected]]-[~/ansible]
└─$sh etcd_break.sh
Error: data-dir "/var/lib/etcd" exists

提示目录存在,所以需要把目录也同样删除掉

bash 复制代码
┌──[[email protected]]-[~/ansible]
└─$ansible k8s_master  -m shell -a "rm -rf /var/lib/etcd" -i host.yaml
[WARNING]: Consider using the file module with state=absent rather than running 'rm'.  If you need to use command because file is insufficient you can add 'warn: false' to this command task or set
'command_warnings=False' in ansible.cfg to get rid of this message.
192.168.26.102 | CHANGED | rc=0 >>

192.168.26.100 | CHANGED | rc=0 >>

192.168.26.101 | CHANGED | rc=0 >>

┌──[[email protected]]-[~/ansible]
└─$

备份恢复命令

bash 复制代码
┌──[[email protected]]-[~/ansible]
└─$cat etcd_break.sh
ETCDCTL_API=3 etcdctl snapshot restore /root/snap-202403270000.db \
 --name vms100.liruilongs.github.io  \
 --cert="/etc/kubernetes/pki/etcd/server.crt" \
 --key="/etc/kubernetes/pki/etcd/server.key"  \
 --cacert="/etc/kubernetes/pki/etcd/ca.crt"   \
 --endpoints="https://127.0.0.1:2379" \
 --initial-advertise-peer-urls="https://192.168.26.100:2380"  \
 --initial-cluster="vms100.liruilongs.github.io=https://192.168.26.100:2380,vms101.liruilongs.github.io=https://192.168.26.101:2380,vms102.liruilongs.github.io=https://192.168.26.102:2380" \
 --data-dir=/var/lib/etcd

再次执行,备份恢复成功

bash 复制代码
┌──[[email protected]]-[~/ansible]
└─$sh etcd_break.sh
2024-06-05 11:19:12.114058 I | mvcc: restore compact to 22239463
2024-06-05 11:19:12.137939 I | etcdserver/membership: added member ee392e5273e89e2 [https://192.168.26.100:2380] to cluster 4816f346663d82a7
2024-06-05 11:19:12.138023 I | etcdserver/membership: added member 70059e836d19883d [https://192.168.26.101:2380] to cluster 4816f346663d82a7
2024-06-05 11:19:12.138055 I | etcdserver/membership: added member b8cb9f66c2e63b91 [https://192.168.26.102:2380] to cluster 4816f346663d82a7

其他的etcd节点备份恢复,需要修改脚本两个地方:

192.168.26.101 节点执行

bash 复制代码
┌──[[email protected]]-[~/ansible]
└─$ansible 192.168.26.101 -m shell -a "ETCDCTL_API=3 etcdctl snapshot restore /root/snap-202403270000.db --name vms101.liruilongs.github.io  --cert="/etc/kubernetes/pki/etcd/server.crt" --key="/etc/kubernetes/pki/etcd/server.key"  --cacert="/etc/kubernetes/pki/etcd/ca.crt"   --endpoints="https://127.0.0.1:2379" --initial-advertise-peer-urls="https://192.168.26.101:2380"  --initial-cluster="vms100.liruilongs.github.io=https://192.168.26.100:2380,vms101.liruilongs.github.io=https://192.168.26.101:2380,vms102.liruilongs.github.io=https://192.168.26.102:2380" --data-dir=/var/lib/etcd" -i host.yaml

192.168.26.101 | CHANGED | rc=0 >>
2024-06-05 11:25:25.557851 I | mvcc: restore compact to 22239463
2024-06-05 11:25:25.614487 I | etcdserver/membership: added member ee392e5273e89e2 [https://192.168.26.100:2380] to cluster 4816f346663d82a7
2024-06-05 11:25:25.614549 I | etcdserver/membership: added member 70059e836d19883d [https://192.168.26.101:2380] to cluster 4816f346663d82a7
2024-06-05 11:25:25.614574 I | etcdserver/membership: added member b8cb9f66c2e63b91 [https://192.168.26.102:2380] to cluster 4816f346663d82a7
┌──[[email protected]]-[~/ansible]
└─$

192.168.26.102 节点执行

bash 复制代码
┌──[[email protected]]-[~/ansible]
└─$ansible 192.168.26.102 -m shell -a "ETCDCTL_API=3 etcdctl snapshot restore /root/snap-202403270000.db --name vms102.l
iruilongs.github.io  --cert="/etc/kubernetes/pki/etcd/server.crt" --key="/etc/kubernetes/pki/etcd/server.key"  --cacert=
"/etc/kubernetes/pki/etcd/ca.crt"   --endpoints="https://127.0.0.1:2379" --initial-advertise-peer-urls="https://192.168.26.102:2380"  --initial-cluster="vms100.liruilongs.github.io=https://192.168.26.100:2380,vms101.liruilongs.github.io=https://192.168.26.101:2380,vms102.liruilongs.github.io=https://192.168.26.102:2380" --data-dir=/var/lib/etcd" -i host.yaml

192.168.26.102 | CHANGED | rc=0 >>
2024-06-05 11:30:06.918159 I | mvcc: restore compact to 22239463
2024-06-05 11:30:06.935413 I | etcdserver/membership: added member ee392e5273e89e2 [https://192.168.26.100:2380] to cluster 4816f346663d82a7
2024-06-05 11:30:06.935460 I | etcdserver/membership: added member 70059e836d19883d [https://192.168.26.101:2380] to cluster 4816f346663d82a7
2024-06-05 11:30:06.935471 I | etcdserver/membership: added member b8cb9f66c2e63b91 [https://192.168.26.102:2380] to cluster 4816f346663d82a7
┌──[[email protected]]-[~/ansible]
└─$

移动静态 Pod对应的 yaml 文件,恢复 etcd 和apiserver 对应的Pod

bash 复制代码
┌──[[email protected]]-[~/ansible]
└─$ansible k8s_master  -m shell -a "mv /tmp/{etcd.yaml,kube-apiserver.yaml} /etc/kubernetes/manifests/" -i host.yaml
192.168.26.102 | CHANGED | rc=0 >>

192.168.26.101 | CHANGED | rc=0 >>

192.168.26.100 | CHANGED | rc=0 >>

确认静态pod 恢复

bash 复制代码
┌──[[email protected]]-[~/ansible]
└─$ansible k8s_master -m shell -a "ls /etc/kubernetes/manifests/" -i host.yaml
192.168.26.100 | CHANGED | rc=0 >>
etcd.yaml
haproxy.yaml
keepalived.yaml
kube-apiserver.yaml
kube-controller-manager.yaml
kube-scheduler.yaml
192.168.26.102 | CHANGED | rc=0 >>
etcd.yaml
haproxy.yaml
keepalived.yaml
kube-apiserver.yaml
kube-controller-manager.yaml
kube-scheduler.yaml
192.168.26.101 | CHANGED | rc=0 >>
etcd.yaml
haproxy.yaml
keepalived.yaml
kube-apiserver.yaml
kube-controller-manager.yaml
kube-scheduler.yaml
┌──[[email protected]]-[~/ansible]
└─$

查看 etcd 集群节点状态

bash 复制代码
┌──[[email protected]]-[~/ansible]
└─$ETCDCTL_API=3 etcdctl  --endpoints https://127.0.0.1:2379  --cert="/etc/kubernetes/pki/etcd/server.crt"  --key="/etc/kubernetes/pki/etcd/server.key"  --cacert="/etc/kubernetes/pki/etcd/ca.crt" member list -w table
+------------------+---------+-----------------------------+-----------------------------+-----------------------------+
|        ID        | STATUS  |            NAME             |         PEER ADDRS          |        CLIENT ADDRS         |
+------------------+---------+-----------------------------+-----------------------------+-----------------------------+
|  ee392e5273e89e2 | started | vms100.liruilongs.github.io | https://192.168.26.100:2380 | https://192.168.26.100:2379 |
| 70059e836d19883d | started | vms101.liruilongs.github.io | https://192.168.26.101:2380 | https://192.168.26.101:2379 |
| b8cb9f66c2e63b91 | started | vms102.liruilongs.github.io | https://192.168.26.102:2380 | https://192.168.26.102:2379 |
+------------------+---------+-----------------------------+-----------------------------+-----------------------------+

确认集群是否恢复

bash 复制代码
┌──[[email protected]]-[~/ansible]
└─$kubectl get nodes
NAME                          STATUS   ROLES           AGE    VERSION
vms100.liruilongs.github.io   Ready    control-plane   495d   v1.25.1
vms101.liruilongs.github.io   Ready    control-plane   495d   v1.25.1
vms102.liruilongs.github.io   Ready    control-plane   495d   v1.25.1
vms103.liruilongs.github.io   Ready    <none>          495d   v1.25.1
vms105.liruilongs.github.io   Ready    <none>          495d   v1.25.1
vms106.liruilongs.github.io   Ready    <none>          495d   v1.25.1
┌──[[email protected]]-[~/ansible]
└─$

博文部分内容参考

© 文中涉及参考链接内容版权归原作者所有,如有侵权请告知 😃


https://etcd.io/docs/v3.5/


© 2018-2024 [email protected], 保持署名-非商用-相同方式共享(CC BY-NC-SA 4.0)

相关推荐
棠十一12 分钟前
docker 启动elasticsearch 与 kibana
elasticsearch·docker·容器
❀͜͡傀儡师13 分钟前
使用docker 安装Redis 带配置文件(x86和arm)版本
redis·docker·容器
cuoluoche18 分钟前
【docker n8n】windows环境如何挂载
运维·docker·容器
木头左36 分钟前
Docker 容器化基础:镜像、容器与仓库的本质解析
docker·容器·eureka
不穿铠甲的穿山甲39 分钟前
docker 部署redis集群 配置
redis·docker·容器
藥瓿亭1 小时前
K8S认证|CKS题库+答案| 8. 沙箱运行容器 gVisor
linux·运维·docker·云原生·容器·kubernetes·cks
rocksun1 小时前
需要尽早知道的容器安全知识
安全·容器·kubernetes
网硕互联的小客服2 小时前
如何排查 Docker 容器资源占用过高的问题?
运维·服务器·网络·安全·docker·容器
ReadThroughLife3 小时前
【已解决】MACOS M4 芯片使用 Docker Desktop 工具安装 MICROSOFT SQL SERVER
microsoft·macos·docker·容器
KrityCat3 小时前
阿里云Alibaba Cloud安装Docker与Docker compose【图文教程】
阿里云·docker·容器