K8s 集群高可用master节点ETCD全部挂掉如何恢复?

写在前面


  • 博文内容涉及集群 ETCD 全部挂掉,通过备份文件恢复的操作 Demo
  • 理解不足小伙伴帮忙指正 😃,生活加油

不必太纠结于当下,也不必太忧虑未来,当你经历过一些事情的时候,眼前的风景已经和从前不一样了。------村上春树


前提是需要etcd备份文件,如果没有 etcd 备份,或者其他的备份手段,可能 GG 了

这里默认需要使用 etcdctl 的地方已经安装了该工具

备份文件分享

分享一个备份脚本

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$cat /usr/lib/systemd/system/etcd_back.sh
#!/bin/bash

#@File    :   erct_break.sh
#@Time    :   2023/01/27 23:00:27
#@Author  :   Li Ruilong
#@Version :   1.0
#@Desc    :   ETCD 备份
#@Contact :   1224965096@qq.com

if [ ! -d /root/back/ ];then
   mkdir -p /root/back/
fi
STR_DATE=$(date +%Y%m%d%H%M)

ETCDCTL_API=3 etcdctl \
--endpoints="https://127.0.0.1:2379"  \
--cert="/etc/kubernetes/pki/etcd/server.crt"  \
--key="/etc/kubernetes/pki/etcd/server.key"  \
--cacert="/etc/kubernetes/pki/etcd/ca.crt"   \
snapshot save /root/back/snap-${STR_DATE}.db

ETCDCTL_API=3 etcdctl --write-out=table snapshot status /root/back/snap-${STR_DATE}.db

sudo chmod  o-w,u-w,g-w  /root/back/snap-${STR_DATE}.db

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$

运行方式

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$/usr/bin/bash /usr/lib/systemd/system/etcd_back.sh
Snapshot saved at /root/back/snap-202406051145.db
+----------+----------+------------+------------+
|   HASH   | REVISION | TOTAL KEYS | TOTAL SIZE |
+----------+----------+------------+------------+
| 7b00ddcf | 22243784 |       5999 |      88 MB |
+----------+----------+------------+------------+

生成对应的备份数据

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ll /root/back/snap-202*
.....
-r--r--r-- 1 root root 87515168 6月   5 11:45 /root/back/snap-202406051144.db
-r--r--r-- 1 root root 87515168 6月   5 11:45 /root/back/snap-202406051145.db

可以使用 systemd 配置成 service unit

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$systemctl cat etcd-backup.service
# /usr/lib/systemd/system/etcd-backup.service
# /usr/lib/systemd/system/etcd-backup.service
[Unit]
Description= "ETCD 备份"
After=network-online.target

[Service]
Type=oneshot
Environment=ETCDCTL_API=3
ExecStart=/usr/bin/bash /usr/lib/systemd/system/etcd_back.sh


[Install]
WantedBy=multi-user.target

主要是方便看日志,方便管理

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$journalctl -u etcd-backup.service
-- No entries --
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$systemctl start  etcd-backup.service
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$journalctl -u etcd-backup.service
-- Logs begin at 三 2024-06-05 03:49:25 CST, end at 三 2024-06-05 11:49:08 CST. --
6月 05 11:49:04 vms100.liruilongs.github.io systemd[1]: Starting "ETCD 备份"...
6月 05 11:49:07 vms100.liruilongs.github.io bash[3957]: Snapshot saved at /root/back/snap-202406051149.db
6月 05 11:49:07 vms100.liruilongs.github.io bash[3957]: +----------+----------+------------+------------+
6月 05 11:49:07 vms100.liruilongs.github.io bash[3957]: |   HASH   | REVISION | TOTAL KEYS | TOTAL SIZE |
6月 05 11:49:07 vms100.liruilongs.github.io bash[3957]: +----------+----------+------------+------------+
6月 05 11:49:07 vms100.liruilongs.github.io bash[3957]: | 1ce12bf7 | 22244346 |       3753 |      88 MB |
6月 05 11:49:07 vms100.liruilongs.github.io bash[3957]: +----------+----------+------------+------------+
6月 05 11:49:07 vms100.liruilongs.github.io sudo[4344]:     root : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/bin/chmod o-w,u-w,g-w /root/back/snap-202406051149.db
6月 05 11:49:07 vms100.liruilongs.github.io systemd[1]: Started "ETCD 备份".

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ll /root/back/snap-202406051*
........................
-r--r--r-- 1 root root 87515168 6月   5 11:49 /root/back/snap-202406051149.db
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$

然后使用 timer unit 配置为定时启动

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$systemctl cat etcd-backup.timer
# /usr/lib/systemd/system/etcd-backup.timer
# /usr/lib/systemd/system/etcd-backup.timer
[Unit]
Description="每天备份一次 ETCD"

[Timer]
OnBootSec=3s
OnCalendar=*-*-* 00:00:00
Unit=etcd-backup.service

[Install]
WantedBy=multi-user.target

同样可以看日志

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$journalctl -u  etcd-backup.timer
-- No entries --

故障处理恢复

故障表象,集群整个崩了,所有 master 上的 etcd 和 apiserver 都死掉了

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~]
└─$kubectl get  pods
The connection to the server 192.168.26.99:30033 was refused - did you specify the right host or port?

移动 etcd 和 apiserver 的对应 静态 podyaml 文件。关于 静态 Pod 运行原理这里不多讲,感兴趣小伙伴可以官网看下

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible k8s_master -m  shell -a "mv  /etc/kubernetes/manifests/{etcd.yaml,kube-apiserver.yaml}  /tmp/" -i host.yaml
192.168.26.102 | CHANGED | rc=0 >>

192.168.26.100 | CHANGED | rc=0 >>

192.168.26.101 | CHANGED | rc=0 >>

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$

清除当前集群的 etcd 的数据文件和对应的目录

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible k8s_master  -m shell -a "rm -rf /var/lib/etcd/*" -i host.yaml
[WARNING]: Consider using the file module with state=absent rather than running 'rm'.  If you need to use command because file is insufficient you can add 'warn: false' to this command task or set
'command_warnings=False' in ansible.cfg to get rid of this message.
192.168.26.102 | CHANGED | rc=0 >>

192.168.26.100 | CHANGED | rc=0 >>

192.168.26.101 | CHANGED | rc=0 >>

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible k8s_master  -m shell -a "ls /var/lib/etcd/" -i host.yaml
192.168.26.102 | CHANGED | rc=0 >>

192.168.26.101 | CHANGED | rc=0 >>

192.168.26.100 | CHANGED | rc=0 >>

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$

拷贝 备份文件到当前集群的每个 etcd 节点

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible k8s_master  -m  copy -a "src=/root/back/snap-202403270000.db dest=/root/" -i host.yaml
192.168.26.100 | CHANGED => {
    "ansible_facts": {
        "discovered_interpreter_python": "/usr/bin/python"
    },
    "changed": true,
    "checksum": "d8927d8fa47b1e162cb2326ddd968d8227a0555d",
    "dest": "/root/snap-202403270000.db",
    "gid": 0,
    "group": "root",
    "md5sum": "6489d7243f636086816ac13aa69ceb44",
    "mode": "0644",
    "owner": "root",
    "size": 87515168,
    "src": "/root/.ansible/tmp/ansible-tmp-1717557132.87-95740-233443993764822/source",
    "state": "file",
    "uid": 0
}
192.168.26.101 | CHANGED => {
    "ansible_facts": {
        "discovered_interpreter_python": "/usr/bin/python"
    },
    "changed": true,
    "checksum": "d8927d8fa47b1e162cb2326ddd968d8227a0555d",
    "dest": "/root/snap-202403270000.db",
    "gid": 0,
    "group": "root",
    "md5sum": "6489d7243f636086816ac13aa69ceb44",
    "mode": "0644",
    "owner": "root",
    "size": 87515168,
    "src": "/root/.ansible/tmp/ansible-tmp-1717557132.92-95742-263013169057776/source",
    "state": "file",
    "uid": 0
}
192.168.26.102 | CHANGED => {
    "ansible_facts": {
        "discovered_interpreter_python": "/usr/bin/python"
    },
    "changed": true,
    "checksum": "d8927d8fa47b1e162cb2326ddd968d8227a0555d",
    "dest": "/root/snap-202403270000.db",
    "gid": 0,
    "group": "root",
    "md5sum": "6489d7243f636086816ac13aa69ceb44",
    "mode": "0644",
    "owner": "root",
    "size": 87515168,
    "src": "/root/.ansible/tmp/ansible-tmp-1717557132.92-95744-205050494494041/source",
    "state": "file",
    "uid": 0
}
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$

确定拷贝文件的备份文件

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible k8s_master  -m shell -a "ls /root/snap-202403270000.db" -i host.yaml
192.168.26.102 | CHANGED | rc=0 >>
/root/snap-202403270000.db
192.168.26.101 | CHANGED | rc=0 >>
/root/snap-202403270000.db
192.168.26.100 | CHANGED | rc=0 >>
/root/snap-202403270000.db
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$

在其中一个节点执行备份恢复命令

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$vim etcd_break.sh
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$sh etcd_break.sh
Error: data-dir "/var/lib/etcd" exists

提示目录存在,所以需要把目录也同样删除掉

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible k8s_master  -m shell -a "rm -rf /var/lib/etcd" -i host.yaml
[WARNING]: Consider using the file module with state=absent rather than running 'rm'.  If you need to use command because file is insufficient you can add 'warn: false' to this command task or set
'command_warnings=False' in ansible.cfg to get rid of this message.
192.168.26.102 | CHANGED | rc=0 >>

192.168.26.100 | CHANGED | rc=0 >>

192.168.26.101 | CHANGED | rc=0 >>

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$

备份恢复命令

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$cat etcd_break.sh
ETCDCTL_API=3 etcdctl snapshot restore /root/snap-202403270000.db \
 --name vms100.liruilongs.github.io  \
 --cert="/etc/kubernetes/pki/etcd/server.crt" \
 --key="/etc/kubernetes/pki/etcd/server.key"  \
 --cacert="/etc/kubernetes/pki/etcd/ca.crt"   \
 --endpoints="https://127.0.0.1:2379" \
 --initial-advertise-peer-urls="https://192.168.26.100:2380"  \
 --initial-cluster="vms100.liruilongs.github.io=https://192.168.26.100:2380,vms101.liruilongs.github.io=https://192.168.26.101:2380,vms102.liruilongs.github.io=https://192.168.26.102:2380" \
 --data-dir=/var/lib/etcd

再次执行,备份恢复成功

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$sh etcd_break.sh
2024-06-05 11:19:12.114058 I | mvcc: restore compact to 22239463
2024-06-05 11:19:12.137939 I | etcdserver/membership: added member ee392e5273e89e2 [https://192.168.26.100:2380] to cluster 4816f346663d82a7
2024-06-05 11:19:12.138023 I | etcdserver/membership: added member 70059e836d19883d [https://192.168.26.101:2380] to cluster 4816f346663d82a7
2024-06-05 11:19:12.138055 I | etcdserver/membership: added member b8cb9f66c2e63b91 [https://192.168.26.102:2380] to cluster 4816f346663d82a7

其他的etcd节点备份恢复,需要修改脚本两个地方:

192.168.26.101 节点执行

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible 192.168.26.101 -m shell -a "ETCDCTL_API=3 etcdctl snapshot restore /root/snap-202403270000.db --name vms101.liruilongs.github.io  --cert="/etc/kubernetes/pki/etcd/server.crt" --key="/etc/kubernetes/pki/etcd/server.key"  --cacert="/etc/kubernetes/pki/etcd/ca.crt"   --endpoints="https://127.0.0.1:2379" --initial-advertise-peer-urls="https://192.168.26.101:2380"  --initial-cluster="vms100.liruilongs.github.io=https://192.168.26.100:2380,vms101.liruilongs.github.io=https://192.168.26.101:2380,vms102.liruilongs.github.io=https://192.168.26.102:2380" --data-dir=/var/lib/etcd" -i host.yaml

192.168.26.101 | CHANGED | rc=0 >>
2024-06-05 11:25:25.557851 I | mvcc: restore compact to 22239463
2024-06-05 11:25:25.614487 I | etcdserver/membership: added member ee392e5273e89e2 [https://192.168.26.100:2380] to cluster 4816f346663d82a7
2024-06-05 11:25:25.614549 I | etcdserver/membership: added member 70059e836d19883d [https://192.168.26.101:2380] to cluster 4816f346663d82a7
2024-06-05 11:25:25.614574 I | etcdserver/membership: added member b8cb9f66c2e63b91 [https://192.168.26.102:2380] to cluster 4816f346663d82a7
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$

192.168.26.102 节点执行

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible 192.168.26.102 -m shell -a "ETCDCTL_API=3 etcdctl snapshot restore /root/snap-202403270000.db --name vms102.l
iruilongs.github.io  --cert="/etc/kubernetes/pki/etcd/server.crt" --key="/etc/kubernetes/pki/etcd/server.key"  --cacert=
"/etc/kubernetes/pki/etcd/ca.crt"   --endpoints="https://127.0.0.1:2379" --initial-advertise-peer-urls="https://192.168.26.102:2380"  --initial-cluster="vms100.liruilongs.github.io=https://192.168.26.100:2380,vms101.liruilongs.github.io=https://192.168.26.101:2380,vms102.liruilongs.github.io=https://192.168.26.102:2380" --data-dir=/var/lib/etcd" -i host.yaml

192.168.26.102 | CHANGED | rc=0 >>
2024-06-05 11:30:06.918159 I | mvcc: restore compact to 22239463
2024-06-05 11:30:06.935413 I | etcdserver/membership: added member ee392e5273e89e2 [https://192.168.26.100:2380] to cluster 4816f346663d82a7
2024-06-05 11:30:06.935460 I | etcdserver/membership: added member 70059e836d19883d [https://192.168.26.101:2380] to cluster 4816f346663d82a7
2024-06-05 11:30:06.935471 I | etcdserver/membership: added member b8cb9f66c2e63b91 [https://192.168.26.102:2380] to cluster 4816f346663d82a7
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$

移动静态 Pod对应的 yaml 文件,恢复 etcd 和apiserver 对应的Pod

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible k8s_master  -m shell -a "mv /tmp/{etcd.yaml,kube-apiserver.yaml} /etc/kubernetes/manifests/" -i host.yaml
192.168.26.102 | CHANGED | rc=0 >>

192.168.26.101 | CHANGED | rc=0 >>

192.168.26.100 | CHANGED | rc=0 >>

确认静态pod 恢复

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible k8s_master -m shell -a "ls /etc/kubernetes/manifests/" -i host.yaml
192.168.26.100 | CHANGED | rc=0 >>
etcd.yaml
haproxy.yaml
keepalived.yaml
kube-apiserver.yaml
kube-controller-manager.yaml
kube-scheduler.yaml
192.168.26.102 | CHANGED | rc=0 >>
etcd.yaml
haproxy.yaml
keepalived.yaml
kube-apiserver.yaml
kube-controller-manager.yaml
kube-scheduler.yaml
192.168.26.101 | CHANGED | rc=0 >>
etcd.yaml
haproxy.yaml
keepalived.yaml
kube-apiserver.yaml
kube-controller-manager.yaml
kube-scheduler.yaml
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$

查看 etcd 集群节点状态

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ETCDCTL_API=3 etcdctl  --endpoints https://127.0.0.1:2379  --cert="/etc/kubernetes/pki/etcd/server.crt"  --key="/etc/kubernetes/pki/etcd/server.key"  --cacert="/etc/kubernetes/pki/etcd/ca.crt" member list -w table
+------------------+---------+-----------------------------+-----------------------------+-----------------------------+
|        ID        | STATUS  |            NAME             |         PEER ADDRS          |        CLIENT ADDRS         |
+------------------+---------+-----------------------------+-----------------------------+-----------------------------+
|  ee392e5273e89e2 | started | vms100.liruilongs.github.io | https://192.168.26.100:2380 | https://192.168.26.100:2379 |
| 70059e836d19883d | started | vms101.liruilongs.github.io | https://192.168.26.101:2380 | https://192.168.26.101:2379 |
| b8cb9f66c2e63b91 | started | vms102.liruilongs.github.io | https://192.168.26.102:2380 | https://192.168.26.102:2379 |
+------------------+---------+-----------------------------+-----------------------------+-----------------------------+

确认集群是否恢复

bash 复制代码
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$kubectl get nodes
NAME                          STATUS   ROLES           AGE    VERSION
vms100.liruilongs.github.io   Ready    control-plane   495d   v1.25.1
vms101.liruilongs.github.io   Ready    control-plane   495d   v1.25.1
vms102.liruilongs.github.io   Ready    control-plane   495d   v1.25.1
vms103.liruilongs.github.io   Ready    <none>          495d   v1.25.1
vms105.liruilongs.github.io   Ready    <none>          495d   v1.25.1
vms106.liruilongs.github.io   Ready    <none>          495d   v1.25.1
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$

博文部分内容参考

© 文中涉及参考链接内容版权归原作者所有,如有侵权请告知 😃


https://etcd.io/docs/v3.5/


© 2018-2024 liruilonger@gmail.com, 保持署名-非商用-相同方式共享(CC BY-NC-SA 4.0)

相关推荐
march of Time24 分钟前
docker容器技术、k8s的原理和常见命令、用k8s部署应用步骤
docker·容器·kubernetes
板栗妖怪1 小时前
docker安装以及简单使用
docker·容器
乐之者v1 小时前
k8s 常用的命令
云原生·容器·kubernetes
蝎子莱莱爱打怪3 小时前
docker 重要且常用命令大全
java·spring cloud·docker·容器·eureka
小杰6663 小时前
安装docker版rabbitmq 3.12
docker·容器·rabbitmq
明明跟你说过6 小时前
【云原生】服务网格(Istio)如何简化微服务通信
运维·微服务·云原生·容器·kubernetes·k8s·istio
bestcxx7 小时前
(九)Docker 的网络通信
docker·容器
孤城2867 小时前
10 docker 安装 mysql详解
mysql·docker·微服务·容器
lendq8 小时前
k8s-第一节-minikube
云原生·容器·kubernetes
洛阳泰山9 小时前
Docker 镜像国内加速下载教程
docker·容器