文章目录
环境
系统平台:N/A
版本:4.5.8
症状
检查etcd集群结果如下(下面结果是自己测试环境结果,命令执行需要设置环境变量(V3接口),可根据解决方案中的步骤设置):
# etcdctl endpoint health --user root:Highgo@123
http://192.168.56.12:2379 is unhealthy: failed to commit proposal: Active Alarm(s): NOSPACE
http://192.168.56.10:2379 is unhealthy: failed to commit proposal: Active Alarm(s): NOSPACE
http://192.168.56.11:2379 is unhealthy: failed to commit proposal: Active Alarm(s): NOSPACE
Error: unhealthy cluster
# etcdctl endpoint status -w table
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+
| http://192.168.56.10:2379 | 4dc3ceeac3266e5 | 3.5.9 | 209 MB | false | false | 95 | 37282 | 37282 | memberID:17186684763751740414 |
| | | | | | | | | | alarm:NOSPACE |
| http://192.168.56.11:2379 | d609d7b6f0361765 | 3.5.9 | 209 MB | false | false | 95 | 37282 | 37282 | memberID:17186684763751740414 |
| | | | | | | | | | alarm:NOSPACE |
| http://192.168.56.12:2379 | ee835ebbd20a73fe | 3.5.9 | 209 MB | true | false | 95 | 37282 | 37282 | memberID:17186684763751740414 |
| | | | | | | | | | alarm:NOSPACE |
--检查告警信息
# etcdctl alarm list
memberID:17186684763751740414 alarm:NOSPACE
问题原因
etcd配置文件etcd.yml中quota-backend-bytes 存储空间配额可以理解为 ETCD 数据库大小,默认限制 2G(推荐最大 8G)。当数据写入耗尽存储空间时,ETCD 会引发整个集群范围的警告,该警告将会导致集群切换为维护模式,维护模式 仅接受键值读取和删除,不支持写入。
解决方案
下面命令结果均为自己测试环境结果,环境变量已设置,客户环境如果没有设置环境变量,需要自己根据实际环境设置,环境变量设置如下:
export ETCDCTL_ENDPOINTS=http://192.168.56.10:2379,http://192.168.56.11:2379,http://192.168.56.12:2379
export ETCDCTL_API=3
export PATH=$PATH:/usr/local/hghac/etcd:/usr/local/hghac/hac/hghactl
其中etcd使用v3版本的命令,如果没有设置授权访问,不用添加--user参数,如果设置了,可以到hghac.yml文件中查询 。
检查etcd.yml文件,确认quota-backend-bytes(存储配额)、auto-compaction-retention、auto-compaction-mode是否设置,如果没有设置,将3个参数设置,存储配额(quota-backend-bytes)调整为8G,如果存储配额(quota-backend-bytes)已经是8G,使用方法一或方法二解决,如果没有到达8G,使用方法三调整参数。方法一二可能影响HGHAC使用,建议关闭HGHAC后操作,需要停止数据库和业务。 上述描述中的hghac.yml和etcd.yml,可以通过下面命令查询:
# systemctl status etcd
● etcd.service - Etcd
Loaded: loaded (/usr/lib/systemd/system/etcd.service; enabled; vendor preset: disabled)
Active: active (running) since Thu 2025-02-06 14:47:39 CST; 1h 5min ago
Main PID: 5042 (etcd)
Tasks: 9
Memory: 172.2M
CGroup: /system.slice/etcd.service
└─5042 /usr/local/hghac/etcd/etcd --config-file=/usr/local/hghac/etcd/etcd.yml
。。。。。。。
# systemctl status hghac
● hghac.service - hghac
Loaded: loaded (/etc/systemd/system/hghac.service; enabled; vendor preset: disabled)
Active: active (running) since Thu 2025-02-06 10:48:17 CST; 5h 5min ago
Main PID: 1233 (hghac)
Tasks: 19
Memory: 32.8M
CGroup: /system.slice/hghac.service
├─1233 /usr/local/hghac/hac/hghac/hghac /usr/local/hghac/hac/hghac.yml
。。。。。。。。
其中查看状态就可以知道位置,如果没有显示,可以到服务配置文件hghac.service和etcd.service中查看。下面是三种方法,操作完成后需要插入键值确认etcd是否正常
方法一 压缩老数据,并清理
查看etcd大小:
# etcdctl endpoint status -w table --user root:Highgo@123
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+
| http://192.168.56.10:2379 | 4dc3ceeac3266e5 | 3.5.9 | 211 MB | false | false | 96 | 38756 | 38756 | memberID:350221866816923365 |
| | | | | | | | | | alarm:NOSPACE |
| http://192.168.56.11:2379 | d609d7b6f0361765 | 3.5.9 | 204 MB | true | false | 96 | 38757 | 38757 | memberID:350221866816923365 |
| | | | | | | | | | alarm:NOSPACE |
| http://192.168.56.12:2379 | ee835ebbd20a73fe | 3.5.9 | 207 MB | false | false | 96 | 38758 | 38758 | memberID:350221866816923365 |
| | | | | | | | | | alarm:NOSPACE |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+
获取当前版本:
# etcdctl endpoint status --write-out="json" --user root:Highgo@123 | egrep -o '"revision":[0-9]*' | egrep -o '[0-9].*'
4011
4011
4011
压缩掉所有旧版本
# etcdctl compact 4011 --user root:Highgo@123
整理多余的空间
# etcdctl --command-timeout=30s defrag --user root:Highgo@123
command-timeout默认5s,如果超时,报错 context deadline exceeded,可以将时间加长
取消告警信息
# etcdctl alarm disarm
执行完成后,再次检查etcd大小,确认清理是否完成。
方法二 Etcd集群重做
见support 017213101
方法三 修改etcd.yml参数
修改etcd.yml,已有参数修改,没有的添加参数 参数如下:
vi /usr/local/hghac/etcd/etcd.yml
quota-backend-bytes: 8589934592
auto-compaction-retention: '24h'
auto-compaction-mode: 'periodic'
添加完成后,按照节点依次重启etcd
systemctl restart etcd
检查etcd是否正常
# etcdctl put newkey 123 --user root:Highgo@123
OK
如果无法重启该节点,可以按照添加删除etcd节点的方法重新添加,方法如下
查询节点信息:
# etcdctl member list --user root:Highgo@123
4dc3ceeac3266e5, started, etcd_01, http://192.168.56.10:2380, http://192.168.56.10:2379, false
d609d7b6f0361765, started, etcd_02, http://192.168.56.11:2380, http://192.168.56.11:2379, false
ee835ebbd20a73fe, started, etcd_03, http://192.168.56.12:2380, http://192.168.56.12:2379, false
删除无法重启的节点:
etcdctl member remove 4dc3ceeac3266e5 --user root:Highgo@123
添加该节点:
etcdctl member add etcd_01 --peer-urls=http://192.168.56.10:2380 --user root:Highgo@123
修改配置文件etcd.yml中的参数initial-cluster-state
initial-cluster-state: existing
根据etcd.yml文件中参数data-dir: /usr/local/hghac/etcd/etcd01 删除该文件
rm -rf /usr/local/hghac/etcd/etcd01
然后启动etcd
systemctl start etcd
备注
模拟etcd空间满的方法 任意一个节点执行。
sql
while [ 1 ]; do dd if=/dev/urandom bs=1024 count=1024 | ETCDCTL_API=3 etcdctl put key --user root:Highgo@123 || break; done