数据文件误删除,OceanBase中如何重建受影响的节点

当不慎误删数据文件且当前没有现成的可替换节点时,在OceanBase中,不必急于采取极端措施,可以考虑运用server_permanent_offline_time 参数,来重建受影响的节点。

原理:

server_permanent_offline_time 是 OceanBase数据库中负责管理节点被认定为永久离线所需时间的一个关键参数。当集群中的某个节点发生故障或宕机后,系统会依据此参数设定的时间阈值来触发相应的处理流程。

如果节点宕机时间小于该参数设置的值,系统会暂时不做处理,以避免频繁的数据迁移;如果宕机时间超过该参数设置的值,该节点被标记为永久下线,RootService 会将该 OBServer 上包含的数据副本从 Paxos 成员组中删除,并在同 zone 内其他可用 OBServer 上补充数据,以保证数据副本 Paxos 成员组完整。该参数默认值是 3600 秒,一般设置较大,以避免不必要的副本复制。此外,当永久下线的节点重新被拉起后,其上的全部数据都需要从其他副本重新拉取。

在本场景下,即是通过调低该参数,让故障节点快速永久下线再重新上线,达到数据重建的目的。

请注意,此过程会占用集群一定的资源,可能会影响性能,因此建议在业务低峰期进行。

官方建议

关于 server_permanent_offline_time 的适用场景和建议值,官方提供如下:

  1. OceanBase 数据库版本升级场景:建议将该配置项的值设置为72h。

  2. OBServer 硬件更换场景:建议将该配置项的值设置为4h。

  3. OBServer 清空上线场景:建议将该配置项的值设置为10m,使集群快速上线。

准备过程

预备一套环境

使用OBD工具快速部署一套3节点OB以及一个OBProxy,再创建好一个租户sysbench_tenant,primary_zone为RANDOM。

注:本文基于OB 3.1.2版本,其他版本需注意另作验证。

版本 ip
oceanbase 3.1.2 10.186.64.74
oceanbase 3.1.2 10.186.64.75
oceanbase 3.1.2 10.186.64.79
OBProxy 3.2.3 10.186.60.3
准备些数据

使用 sysbench 创建一个表 sbtest1 并插入1W数据。

复制代码
sysbench ./oltp_insert.lua --mysql-host=10.186.60.3 --mysql-port=2883 --mysql-db=sysbenchdb --mysql-user="sysbench@sysbench_tenant"  --mysql-password=sysbench --tables=1 --table_size=10000 --threads=1 --time=600 --report-interval=10 --db-driver=mysql --db-ps-mode=disable --skip-trx=on --mysql-ignore-errors=6002,6004,4012,2013,4016,1062,5157,4038 prepare

这里改写了 sysbench 的建表语句,分了3个区,查询 sbtest1 表分区副本分布如下

复制代码
MySQL [oceanbase]> select tenant.tenant_name, zone, svr_ip,svr_port, case when role=1 then 'leader' when role=2 then 'follower' else NULL end as role, count(1) as partition_cnt from __all_virtual_meta_table meta  inner join __all_tenant tenant  on meta.tenant_id=tenant.tenant_id inner join __all_virtual_table tab  on meta.tenant_id=tab.tenant_id and meta.table_id=tab.table_id where tenant.tenant_id=1001 and tab.table_name='sbtest1' group by  tenant.tenant_name,zone, svr_ip,svr_port, 5 order by  tenant.tenant_name, zone, svr_ip, role desc;+-----------------+-------+--------------+----------+----------+---------------+| tenant_name     | zone  | svr_ip       | svr_port | role     | partition_cnt |+-----------------+-------+--------------+----------+----------+---------------+| sysbench_tenant | zone1 | 10.186.64.74 |     2882 | leader   |             1 || sysbench_tenant | zone1 | 10.186.64.74 |     2882 | follower |             2 || sysbench_tenant | zone2 | 10.186.64.75 |     2882 | leader   |             1 || sysbench_tenant | zone2 | 10.186.64.75 |     2882 | follower |             2 || sysbench_tenant | zone3 | 10.186.64.79 |     2882 | leader   |             1 || sysbench_tenant | zone3 | 10.186.64.79 |     2882 | follower |             2 |+-----------------+-------+--------------+----------+----------+---------------+

开始实验

使用 sysbench 持续写入数据,维持一定的流量,便于在节点重建后对比各节点数据是否一致。

复制代码
sysbench ./oltp_insert.lua --mysql-host=10.186.60.3 --mysql-port=2883 --mysql-db=sysbenchdb --mysql-user="sysbench@sysbench_tenant"  --mysql-password=sysbench --tables=1 --table_size=10000 --threads=1 --time=300 --report-interval=10 --db-driver=mysql --db-ps-mode=disable --skip-trx=on --mysql-ignore-errors=6002,6004,4012,2013,4016,1062,5157,4038 run
删除某节点的数据文件

选择 zone3 下的 10.186.64.79 节点,将数据文件删除。

复制代码
[root@localhost data]# rm -rf 1/sstable/block_file[root@localhost data]# cd 1/sstable/[root@localhost sstable]# lltotal 0
永久下线故障节点

1. 调小参数 server_permanent_offline_time ,缩短节点永久下线时间

server_permanent_offline_time 默认值为 3600s

复制代码
MySQL [oceanbase]> alter system set server_permanent_offline_time='60s';Query OK, 0 rows affected (0.030 sec) MySQL [oceanbase]> SHOW PARAMETERS LIKE "%server_permanent_offline_time%";+-------+----------+--------------+----------+-------------------------------+-----------+-------+-----------------------------------------------------------------------------------------------------------------------------------+--------------+---------+---------+-------------------+| zone  | svr_type | svr_ip       | svr_port | name                          | data_type | value | info                                                                                                                              | section      | scope   | source  | edit_level        |+-------+----------+--------------+----------+-------------------------------+-----------+-------+-----------------------------------------------------------------------------------------------------------------------------------+--------------+---------+---------+-------------------+| zone3 | observer | 10.186.64.79 |     2882 | server_permanent_offline_time | NULL      | 60s   | the time interval between any two heartbeats beyond which a server is considered to be \'permanently\' offline. Range: [20s,+∞)   | ROOT_SERVICE | CLUSTER | DEFAULT | DYNAMIC_EFFECTIVE || zone1 | observer | 10.186.64.74 |     2882 | server_permanent_offline_time | NULL      | 60s   | the time interval between any two heartbeats beyond which a server is considered to be \'permanently\' offline. Range: [20s,+∞)   | ROOT_SERVICE | CLUSTER | DEFAULT | DYNAMIC_EFFECTIVE || zone2 | observer | 10.186.64.75 |     2882 | server_permanent_offline_time | NULL      | 60s   | the time interval between any two heartbeats beyond which a server is considered to be \'permanently\' offline. Range: [20s,+∞)   | ROOT_SERVICE | CLUSTER | DEFAULT | DYNAMIC_EFFECTIVE |+-------+----------+--------------+----------+-------------------------------+-----------+-------+-----------------------------------------------------------------------------------------------------------------------------------+--------------+---------+---------+-------------------+
2. 停止故障节点对外提供服务

在 kill ob 进程前,建议使用隔离(ISOLATE SERVER)或者停止(STOP SERVER)节点的命令,停掉发往该节点的请求,转移副本 leader 角色。在节点重建恢复后,再开启流量。

复制代码
# 停掉79节点服务MySQL [oceanbase]> ALTER SYSTEM STOP SERVER '10.186.64.79:2882' ZONE='zone3';  # 或者隔离ALTER SYSTEM ISOLATE SERVER '10.186.64.79:2882' ZONE='zone3';
3. kill observer进程

执行 kill -9 $observer_pid ,等待 server_permanent_offline_time 的时间,该ob进入"永久下线"状态。判断ob是否已经永久下线,可以查询表 __all_rootservice_event_history,存在名为 "permanent_offline "的event记录,确认时间和ip都一致后,即可认为ob已经永久下线。

复制代码
MySQL [oceanbase]> select * from __all_rootservice_event_history where event='permanent_offline' ;                       +----------------------------+--------+-------------------+--------+---------------------+-------+--------+-------+--------+-------+--------+-------+--------+-------+--------+------------+--------------+-------------+| gmt_create                 | module | event             | name1  | value1              | name2 | value2 | name3 | value3 | name4 | value4 | name5 | value5 | name6 | value6 | extra_info | rs_svr_ip    | rs_svr_port |+----------------------------+--------+-------------------+--------+---------------------+-------+--------+-------+--------+-------+--------+-------+--------+-------+--------+------------+--------------+-------------+| 2023-03-29 17:34:09.596035 | server | permanent_offline | server | "10.186.64.79:2882" |       |        |       |        |       |        |       |        |       |        |            | 10.186.64.74 |        2882 |+----------------------------+--------+-------------------+--------+---------------------+-------+--------+-------+--------+-------+--------+-------+--------+-------+--------+------------+--------------+-------------+
查询分区副本分布如下,已不存在79节点的分区副本信息,进一步确认了79节点已永久下线。

zone2 下的75节点有一个从副本升级为leader角色,此时集群仍然可以继续对外服务。

复制代码
MySQL [oceanbase]> select tenant.tenant_name, zone, svr_ip,svr_port, case when role=1 then 'leader' when role=2 then 'follower' else NULL end as role, count(1) as partition_cnt from __all_virtual_meta_table meta  inner join __all_tenant tenant  on meta.tenant_id=tenant.tenant_id inner join __all_virtual_table tab  on meta.tenant_id=tab.tenant_id and meta.table_id=tab.table_id where tenant.tenant_id=1001 and tab.table_name='sbtest1' group by  tenant.tenant_name,zone, svr_ip,svr_port, 5 order by  tenant.tenant_name, zone, svr_ip, role desc;+-----------------+-------+--------------+----------+----------+---------------+| tenant_name     | zone  | svr_ip       | svr_port | role     | partition_cnt |+-----------------+-------+--------------+----------+----------+---------------+| sysbench_tenant | zone1 | 10.186.64.74 |     2882 | leader   |             1 || sysbench_tenant | zone1 | 10.186.64.74 |     2882 | follower |             2 || sysbench_tenant | zone2 | 10.186.64.75 |     2882 | leader   |             2 || sysbench_tenant | zone2 | 10.186.64.75 |     2882 | follower |             1 |+-----------------+-------+--------------+----------+----------+---------------+4 rows in set (0.005 sec)
拉起故障节点,触发数据自动重建

1. 启动79节点的ob进程,进程启动后会自动触发重建。

注:防止ob启动失败或存在其他问题,建议启动前将数据文件和事务日志均清空。

复制代码
[root@localhost data]# rm -rf log1/clog/*[root@localhost data]# rm -rf log1/ilog/*[root@localhost data]# rm -rf log1/slog/*[root@localhost data]# rm -rf 1/sstable/block_file[root@localhost data]# cd 1/sstable/[root@localhost sstable]# lltotal 0[root@localhost sstable]# su adminbash-4.2$ cd /home/admin/ && ./bin/observer./bin/observer

进程启动后,确认ob心跳恢复状态为active,然后查看分区正在不断补足中

复制代码
MySQL [oceanbase]> select svr_ip,zone,with_rootserver,status,stop_time,start_service_time,build_version from __all_server;+--------------+-------+-----------------+--------+------------------+--------------------+----------------------------------------------------------------------------------------+| svr_ip       | zone  | with_rootserver | status | stop_time        | start_service_time | build_version                                                                          |+--------------+-------+-----------------+--------+-----------+---------------------------+----------------------------------------------------------------------------------------+| 10.186.64.74 | zone1 |               1 | active |                0 |   1679984798650860 | 3.1.2_10000392021123010-d4ace121deae5b81d8f0b40afbc4c02705b7fc1d(Dec 30 2021 02:47:29) || 10.186.64.75 | zone2 |               0 | active |                0 |   1679984801289281 | 3.1.2_10000392021123010-d4ace121deae5b81d8f0b40afbc4c02705b7fc1d(Dec 30 2021 02:47:29) || 10.186.64.79 | zone3 |               0 | active | 1680082329964975 |   1680082511964975 | 3.1.2_10000392021123010-d4ace121deae5b81d8f0b40afbc4c02705b7fc1d(Dec 30 2021 02:47:29) |+--------------+-------+-----------------+--------+------------------+--------------------+----------------------------------------------------------------------------------------+3 rows in set (0.002 sec) MySQL [oceanbase]> select count(*),zone from gv$partition group by zone;+----------+-------+| count(*) | zone  |+----------+-------+|     1322 | zone1 ||     1322 | zone2 ||      152 | zone3 |+----------+-------+3 rows in set (0.228 sec)  MySQL [oceanbase]> select count(*),zone from gv$partition group by zone;+----------+-------+| count(*) | zone  |+----------+-------+|     1322 | zone1 ||     1322 | zone2 ||      664 | zone3 |+----------+-------+3 rows in set (0.113 sec)MySQL [oceanbase]> select count(*),zone from gv$partition group by zone;                                                +----------+-------+| count(*) | zone  |+----------+-------+|     1322 | zone1 ||     1322 | zone2 ||     1179 | zone3 |+----------+-------+3 rows in set (0.112 sec)  MySQL [oceanbase]> select count(*),zone from gv$partition group by zone;+----------+-------+| count(*) | zone  |+----------+-------+|     1322 | zone1 ||     1322 | zone2 ||     1322 | zone3 |+----------+-------+3 rows in set (0.116 sec)
当3个zone内的分区个数一致后,同时查看zone3已存在副本信息,认为重建完毕。

由于79节点处于隔离状态,所以还没有leader副本。

复制代码
MySQL [oceanbase]> select tenant.tenant_name, zone, svr_ip,svr_port, case when role=1 then 'leader' when role=2 then 'follower' else NULL end as role, count(1) as partition_cnt from __all_virtual_meta_table meta  inner join __all_tenant tenant  on meta.tenant_id=tenant.tenant_id inner join __all_virtual_table tab  on meta.tenant_id=tab.tenant_id and meta.table_id=tab.table_id where tenant.tenant_id=1001 and tab.table_name='sbtest1' group by  tenant.tenant_name,zone, svr_ip,svr_port, 5 order by  tenant.tenant_name, zone, svr_ip, role desc;+-----------------+-------+--------------+----------+----------+---------------+| tenant_name     | zone  | svr_ip       | svr_port | role     | partition_cnt |+-----------------+-------+--------------+----------+----------+---------------+| sysbench_tenant | zone1 | 10.186.64.74 |     2882 | leader   |             1 || sysbench_tenant | zone1 | 10.186.64.74 |     2882 | follower |             2 || sysbench_tenant | zone2 | 10.186.64.75 |     2882 | leader   |             2 || sysbench_tenant | zone2 | 10.186.64.75 |     2882 | follower |             1 || sysbench_tenant | zone3 | 10.186.64.79 |     2882 | follower |             3 |+-----------------+-------+--------------+----------+----------+---------------+6 rows in set (0.005 sec)
2. 开启故障节点服务

执行命令解除79节点的隔离状态。

复制代码
ALTER SYSTEM START SERVER '10.186.64.79:2882' ZONE='zone3';

查询分区副本分布如下,leader角色已迁回79节点。

复制代码
MySQL [oceanbase]> select tenant.tenant_name, zone, svr_ip,svr_port, case when role=1 then 'leader' when role=2 then 'follower' else NULL end as role, count(1) as partition_cnt from __all_virtual_meta_table meta  inner join __all_tenant tenant  on meta.tenant_id=tenant.tenant_id inner join __all_virtual_table tab  on meta.tenant_id=tab.tenant_id and meta.table_id=tab.table_id where tenant.tenant_id=1001 and tab.table_name='sbtest1' group by  tenant.tenant_name,zone, svr_ip,svr_port, 5 order by  tenant.tenant_name, zone, svr_ip, role desc;+-----------------+-------+--------------+----------+----------+---------------+| tenant_name     | zone  | svr_ip       | svr_port | role     | partition_cnt |+-----------------+-------+--------------+----------+----------+---------------+| sysbench_tenant | zone1 | 10.186.64.74 |     2882 | leader   |             1 || sysbench_tenant | zone1 | 10.186.64.74 |     2882 | follower |             2 || sysbench_tenant | zone2 | 10.186.64.75 |     2882 | leader   |             1 || sysbench_tenant | zone2 | 10.186.64.75 |     2882 | follower |             2 || sysbench_tenant | zone3 | 10.186.64.79 |     2882 | leader   |             1 || sysbench_tenant | zone3 | 10.186.64.79 |     2882 | follower |             2 |+-----------------+-------+--------------+----------+----------+---------------+

3. 把 server_permanent_offline_time 参数恢复为默认值3600s

复制代码
MySQL [oceanbase]> alter system set server_permanent_offline_time='3600s';Query OK, 0 rows affected (0.028 sec) MySQL [oceanbase]> SHOW PARAMETERS LIKE "%server_permanent_offline_time%";+-------+----------+--------------+----------+-------------------------------+-----------+-------+-----------------------------------------------------------------------------------------------------------------------------------+--------------+---------+---------+-------------------+| zone  | svr_type | svr_ip       | svr_port | name                          | data_type | value | info                                                                                                                              | section      | scope   | source  | edit_level        |+-------+----------+--------------+----------+-------------------------------+-----------+-------+-----------------------------------------------------------------------------------------------------------------------------------+--------------+---------+---------+-------------------+| zone2 | observer | 10.186.64.75 |     2882 | server_permanent_offline_time | NULL      | 3600s | the time interval between any two heartbeats beyond which a server is considered to be \'permanently\' offline. Range: [20s,+∞)   | ROOT_SERVICE | CLUSTER | DEFAULT | DYNAMIC_EFFECTIVE || zone1 | observer | 10.186.64.74 |     2882 | server_permanent_offline_time | NULL      | 3600s | the time interval between any two heartbeats beyond which a server is considered to be \'permanently\' offline. Range: [20s,+∞)   | ROOT_SERVICE | CLUSTER | DEFAULT | DYNAMIC_EFFECTIVE || zone3 | observer | 10.186.64.79 |     2882 | server_permanent_offline_time | NULL      | 3600s | the time interval between any two heartbeats beyond which a server is considered to be \'permanently\' offline. Range: [20s,+∞)   | ROOT_SERVICE | CLUSTER | DEFAULT | DYNAMIC_EFFECTIVE |+-------+----------+--------------+----------+-------------------------------+-----------+-------+-----------------------------------------------------------------------------------------------------------------------------------+--------------+---------+---------+-------------------+3 rows in set (0.007 sec)
校验各ob节点数据量

sysbench 已运行结束,直连各 observer ,校验数据量是一致的。

复制代码
[root@localhost ~]#  obclient -h10.186.64.74 -P2881 -usysbench@sysbench_tenant -Dsysbenchdb -A -psysbenchWelcome to the OceanBase.  Commands end with ; or \g.Your MySQL connection id is 3221545401Server version: 5.7.25 OceanBase 3.1.2 (r10000392021123010-d4ace121deae5b81d8f0b40afbc4c02705b7fc1d) (Built Dec 30 2021 02:47:29) Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. MySQL [sysbenchdb]> select count(*) from sbtest1;+----------+| count(*) |+----------+|    53195 |+----------+1 row in set (0.036 sec) MySQL [sysbenchdb]> exitBye[root@localhost ~]#  obclient -h10.186.64.75 -P2881 -usysbench@sysbench_tenant -Dsysbenchdb -A -psysbenchWelcome to the OceanBase.  Commands end with ; or \g.Your MySQL connection id is 3221823448Server version: 5.7.25 OceanBase 3.1.2 (r10000392021123010-d4ace121deae5b81d8f0b40afbc4c02705b7fc1d) (Built Dec 30 2021 02:47:29) Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. MySQL [sysbenchdb]> select count(*) from sbtest1;+----------+| count(*) |+----------+|    53195 |+----------+1 row in set (0.040 sec) MySQL [sysbenchdb]> exitBye[root@localhost ~]#  obclient -h10.186.64.79 -P2881 -usysbench@sysbench_tenant -Dsysbenchdb -A -psysbenchWelcome to the OceanBase.  Commands end with ; or \g.Your MySQL connection id is 3222011907Server version: 5.7.25 OceanBase 3.1.2 (r10000392021123010-d4ace121deae5b81d8f0b40afbc4c02705b7fc1d) (Built Dec 30 2021 02:47:29) Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. MySQL [sysbenchdb]> select count(*) from sbtest1;+----------+| count(*) |+----------+|    53195 |+----------+1 row in set (0.037 sec) MySQL [sysbenchdb]>

总结

数据文件损坏或者丢失时,可通过调整参数 server_permanent_offline_time 来重建受影响的节点。

  1. 设小 server_permanent_offline_time 阈值。

  2. 停止故障节点对外服务。

  3. 终止该节点进程。

  4. 超过阈值后,节点将被标记为永久下线,系统会自动清空副本以及向同zone内其他节点迁移数据。

  5. 启动 OB 进程,自动触发重建节点数据。

  6. 开启故障节点服务。

  7. 把 server_permanent_offline_time 参数改回原来的值。

相关推荐
韩曙亮2 天前
【系统架构设计师】数据库系统 ② ( 分布式数据库 | 分布式数据库 特点 | 分布式数据库 分层模式 | 两阶段提交协议 - 2PC 协议 )
数据库·分布式·系统架构·分布式数据库·软考·dbms·两阶段提交协议
ActionTech4 天前
ChatDBA VS DeepSeek:快速诊断 OceanBase 集群新租户数据同步异常
oceanbase·deepseek·chatdba·爱可生
码农老起4 天前
从Oracle到OceanBase数据库迁移:全方位技术解析
数据库·oracle·oceanbase
码农老起8 天前
OceanBase数据库基于脚本的分布式存储层性能深度优化
数据库·分布式·oceanbase
码农老起8 天前
万亿级数据量的OceanBase应用从JVM到协议栈立体化改造实现性能调优
jvm·oceanbase
OceanBase数据库官方博客10 天前
OceanBase 读写分离最佳实践
oceanbase·分布式数据库·读写分离·最佳实践
OceanBase数据库官方博客12 天前
网易云信架构升级实践,故障恢复时间缩至8秒
oceanbase·分布式数据库·架构选型·布道师计划
OceanBase数据库官方博客14 天前
自然语言秒转SQL—— 免费体验 OB Cloud Text2SQL 数据查询
数据库·sql·ai·oceanbase·分布式数据库·向量·text2sql
青云交19 天前
Java 大视界 -- 基于 Java 的大数据分布式数据库架构设计与实践(125)
java·大数据·分布式·分布式数据库·架构设计·数据处理·高可用性