1、etcd扩/缩容
2、etcd数据备份/恢复
数据备份
数据备份要从leader节点进行备份,否则可能存在数据同步延迟导致数据不一致;
etcdctl --endpoints="https://10.119.48.166:2379" snapshot save /data/snapshot$(date +%Y%m%d).db
数据恢复
etcdctl snapshot restore /tmp/snapshot20230718.db --data-dir=/data/kube/etcd --name=etcd1 --initial-cluster-token=etcd-cluster-0 --initial-cluster=etcd1=https://10.119.48.166:2380,etcd2=https://10.119.48.168:2380,etcd3=https://10.119.48.169:2380 --initial-advertise-peer-urls=https://10.119.48.166:2380 #etcd1上执行 #恢复数据到新目录
#每个节点恢复数据,然后修改配置指定新的数据目录,最后重启所有节点
注意数据目录权限,权限不够,提示不是很友好。
3、etcd磁盘打满
现象:业务报错"Etcdserver: mvcc: database space exceeded"
原因:etcd默认存储空间限制为2G,最大支持8G。达到配额会触发告警,然后 Etcd 系统将进入操作受限的维护模式。
官方解释:The default storage size limit is 2GB, configurable with --quota-backend-bytes flag. 8GB is a suggested maximum size for normal environments and etcd warns at startup if the configured value exceeds it.
查看当前空间使用
#检查空间使用
$ ETCDCTL_API=3 etcdctl --write-out=table endpoint status
+----------------+------------------+-----------+---------+-----------+-----------+------------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+----------------+------------------+-----------+---------+-----------+-----------+------------+
|127.0.0.1:2379| bf9071f4639c75cc |2.3.0+git|18 MB |true|2|3332|
+----------------+------------------+-----------+---------+-----------+-----------+------------+
临时解决方案:压缩并进行碎片清理,然后清清除告警可以临时恢复使用
# get current revision
$ rev=$(ETCDCTL_API=3 etcdctl --endpoints=:2379 endpoint status --write-out="json" | egrep -o '"revision":[0-9]*' | egrep -o '[0-9].*')
# compact away all old revisions
$ ETCDCTL_API=3 etcdctl compact $rev
compacted revision 1516
# defragment away excessive space
$ ETCDCTL_API=3 etcdctl defrag
Finished defragmenting etcd member[127.0.0.1:2379]
# disarm alarm
$ ETCDCTL_API=3 etcdctl alarm disarm
memberID:13803658152347727308 alarm:NOSPACE
# test puts are allowed again
$ ETCDCTL_API=3 etcdctl put newkey 123
OK
- 彻底解决方案
彻底解决方式为修改修改etcd配额解决,具体参数如下;
--auto-compaction-retention=1 在一个小时内为mvcc键值存储的自动压实保留。0表示禁用自动压缩
--max-request-bytes=10485760 消息最大字节数,ETCD默认该值为1.5M,etcd版本小于3.2.10不支持此参数
--quota-backend-bytes=4294967296 ETCDdb数据大小,默认是2g,扩容到4g,如果4g不够,可继续增加,最大可以调整到8g
重启etcd服务生效
4、etcd leader频繁切换
类似异常日志
异常日志1
waiting for ReadIndex response took too long
异常日志2
{"level":"warn","msg":"slow fdatasync","took":"2.14025047s","expected-duration":"1s"}
异常日志3
raft.node: 255a2e4092d561fb changed leader from 255a2e4092d561fb to 1de1eaa8fb268f49 at term 3112
可以通过命令多次查看,leader节点是不是经常变化
export ETCDCTL_API=3
etcdctl endpoints='https://10.119.52.70:2379,https://10.119.52.71:2379,https://10.119.52.72:2379 'endpoint status --cacert /etc/kubernetes/cluster1/ssl/ca.pem -w table
常见原因:网络延迟和磁盘io导致
解决方法
临时解决可以通过修改参数扩大心跳检测时长避免leader频繁切换
--election-timeout=5000 (默认1000ms)
--heartbeat-interval=500 (默认500ms)
修改/etc/systemd/system/etcd.service(同集群心跳和选举时间要保持一致),添加以上参数
# 重载配置
sudo systemctl daemon-reload
# 重启etcd
sudo systemctl restart etcd.service
彻底解决需要将etcd独立部署,或者检查网络和磁盘,将etcd部署到性能更好的环境上