etcd空间满(V3接口)

文章目录

环境

系统平台:N/A

版本:4.5.8

症状

检查etcd集群结果如下(下面结果是自己测试环境结果,命令执行需要设置环境变量(V3接口),可根据解决方案中的步骤设置):

复制代码
# etcdctl endpoint health --user root:Highgo@123
http://192.168.56.12:2379 is unhealthy: failed to commit proposal: Active Alarm(s): NOSPACE 
http://192.168.56.10:2379 is unhealthy: failed to commit proposal: Active Alarm(s): NOSPACE 
http://192.168.56.11:2379 is unhealthy: failed to commit proposal: Active Alarm(s): NOSPACE 
Error: unhealthy cluster
# etcdctl endpoint status -w table
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+
|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX |             ERRORS             |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+
| http://192.168.56.10:2379 |  4dc3ceeac3266e5 |   3.5.9 |  209 MB |     false |      false |        95 |      37282 |              37282 |  memberID:17186684763751740414 |
|                           |                  |         |         |           |            |           |            |                    |                 alarm:NOSPACE  |
| http://192.168.56.11:2379 | d609d7b6f0361765 |   3.5.9 |  209 MB |     false |      false |        95 |      37282 |              37282 |  memberID:17186684763751740414 |
|                           |                  |         |         |           |            |           |            |                    |                 alarm:NOSPACE  |
| http://192.168.56.12:2379 | ee835ebbd20a73fe |   3.5.9 |  209 MB |      true |      false |        95 |      37282 |              37282 |  memberID:17186684763751740414 |
|                           |                  |         |         |           |            |           |            |                    |                 alarm:NOSPACE  |
--检查告警信息
# etcdctl alarm list
memberID:17186684763751740414 alarm:NOSPACE 

问题原因

etcd配置文件etcd.yml中quota-backend-bytes 存储空间配额可以理解为 ETCD 数据库大小,默认限制 2G(推荐最大 8G)。当数据写入耗尽存储空间时,ETCD 会引发整个集群范围的警告,该警告将会导致集群切换为维护模式,维护模式 仅接受键值读取和删除,不支持写入。

解决方案

下面命令结果均为自己测试环境结果,环境变量已设置,客户环境如果没有设置环境变量,需要自己根据实际环境设置,环境变量设置如下:

复制代码
export ETCDCTL_ENDPOINTS=http://192.168.56.10:2379,http://192.168.56.11:2379,http://192.168.56.12:2379
export ETCDCTL_API=3
export PATH=$PATH:/usr/local/hghac/etcd:/usr/local/hghac/hac/hghactl

其中etcd使用v3版本的命令,如果没有设置授权访问,不用添加--user参数,如果设置了,可以到hghac.yml文件中查询 。

检查etcd.yml文件,确认quota-backend-bytes(存储配额)、auto-compaction-retention、auto-compaction-mode是否设置,如果没有设置,将3个参数设置,存储配额(quota-backend-bytes)调整为8G,如果存储配额(quota-backend-bytes)已经是8G,使用方法一或方法二解决,如果没有到达8G,使用方法三调整参数。方法一二可能影响HGHAC使用,建议关闭HGHAC后操作,需要停止数据库和业务。 上述描述中的hghac.yml和etcd.yml,可以通过下面命令查询:

复制代码
# systemctl status etcd
● etcd.service - Etcd
   Loaded: loaded (/usr/lib/systemd/system/etcd.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2025-02-06 14:47:39 CST; 1h 5min ago
 Main PID: 5042 (etcd)
    Tasks: 9
   Memory: 172.2M
   CGroup: /system.slice/etcd.service
           └─5042 /usr/local/hghac/etcd/etcd --config-file=/usr/local/hghac/etcd/etcd.yml
。。。。。。。
# systemctl status hghac
● hghac.service - hghac
   Loaded: loaded (/etc/systemd/system/hghac.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2025-02-06 10:48:17 CST; 5h 5min ago
 Main PID: 1233 (hghac)
    Tasks: 19
   Memory: 32.8M
   CGroup: /system.slice/hghac.service
           ├─1233 /usr/local/hghac/hac/hghac/hghac /usr/local/hghac/hac/hghac.yml
。。。。。。。。

其中查看状态就可以知道位置,如果没有显示,可以到服务配置文件hghac.service和etcd.service中查看。下面是三种方法,操作完成后需要插入键值确认etcd是否正常

方法一 压缩老数据,并清理

查看etcd大小:

复制代码
# etcdctl endpoint status -w table  --user root:Highgo@123
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+
|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX |             ERRORS             |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+
| http://192.168.56.10:2379 |  4dc3ceeac3266e5 |   3.5.9 |  211 MB |     false |      false |        96 |      38756 |              38756 |    memberID:350221866816923365 |
|                           |                  |         |         |           |            |           |            |                    |                 alarm:NOSPACE  |
| http://192.168.56.11:2379 | d609d7b6f0361765 |   3.5.9 |  204 MB |      true |      false |        96 |      38757 |              38757 |    memberID:350221866816923365 |
|                           |                  |         |         |           |            |           |            |                    |                 alarm:NOSPACE  |
| http://192.168.56.12:2379 | ee835ebbd20a73fe |   3.5.9 |  207 MB |     false |      false |        96 |      38758 |              38758 |    memberID:350221866816923365 |
|                           |                  |         |         |           |            |           |            |                    |                 alarm:NOSPACE  |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+

获取当前版本:

复制代码
# etcdctl  endpoint status --write-out="json" --user root:Highgo@123 | egrep -o '"revision":[0-9]*' | egrep -o '[0-9].*'
4011
4011
4011

压缩掉所有旧版本

复制代码
# etcdctl  compact 4011 --user root:Highgo@123

整理多余的空间

复制代码
# etcdctl --command-timeout=30s defrag  --user root:Highgo@123

command-timeout默认5s,如果超时,报错 context deadline exceeded,可以将时间加长

取消告警信息

复制代码
# etcdctl  alarm disarm

执行完成后,再次检查etcd大小,确认清理是否完成。

方法二 Etcd集群重做

见support 017213101

方法三 修改etcd.yml参数

修改etcd.yml,已有参数修改,没有的添加参数 参数如下:

复制代码
vi /usr/local/hghac/etcd/etcd.yml
quota-backend-bytes: 8589934592
auto-compaction-retention: '24h' 
auto-compaction-mode: 'periodic'

添加完成后,按照节点依次重启etcd

复制代码
systemctl restart etcd
检查etcd是否正常
# etcdctl put newkey 123  --user root:Highgo@123
OK

如果无法重启该节点,可以按照添加删除etcd节点的方法重新添加,方法如下

查询节点信息:

复制代码
# etcdctl member list  --user root:Highgo@123
4dc3ceeac3266e5, started, etcd_01, http://192.168.56.10:2380, http://192.168.56.10:2379, false
d609d7b6f0361765, started, etcd_02, http://192.168.56.11:2380, http://192.168.56.11:2379, false
ee835ebbd20a73fe, started, etcd_03, http://192.168.56.12:2380, http://192.168.56.12:2379, false
删除无法重启的节点:
etcdctl member remove 4dc3ceeac3266e5  --user root:Highgo@123
添加该节点:
etcdctl member add etcd_01 --peer-urls=http://192.168.56.10:2380  --user root:Highgo@123
修改配置文件etcd.yml中的参数initial-cluster-state
initial-cluster-state: existing
根据etcd.yml文件中参数data-dir:  /usr/local/hghac/etcd/etcd01 删除该文件
rm -rf  /usr/local/hghac/etcd/etcd01 
然后启动etcd
systemctl start etcd

备注

模拟etcd空间满的方法 任意一个节点执行。

sql 复制代码
while [ 1 ]; do dd if=/dev/urandom bs=1024 count=1024  | ETCDCTL_API=3 etcdctl put key --user  root:Highgo@123 || break; done