aiforcast集群单节点CPU使用率100%问题

一、现象

2.18日早上5点左右开始节点172.16.24.146的CPU开始飚满到100%

be.out 日志

yaml 复制代码

start time: Sat Jul  6 01:03:04 CST 2024, server uptime:  01:03:04 up 94 days,  8:22,  2 users,  load average: 0.00, 0.03, 0.05
Ignored unknown config: default_rowset_type
start time: 2025年 02月 18日 星期二 09:52:40 CST, server uptime:  09:52:40 up 321 days, 17:12,  2 users,  load average: 70.17, 105.80, 113.74
Ignored unknown config: default_rowset_type

INFO日志

sql 复制代码

W0218 10:05:07.955432 25950 load_channel.cpp:97] Fail to open index 14666 of load 0bf274331c354715-a06716d7da777200: Service unavailable: Failed to load data into tablet 14667, because of too many versions, current/limit: 7498/1000. You can reduce the loading job concurrency, or increase loading data batch size. If you are loading data with Routine Load, you can increase FE configs routine_load_task_consume_second and max_routine_load_batch_size,
/build/starrocks/be/src/storage/delta_writer.cpp:26 writer->_init()
/build/starrocks/be/src/runtime/local_tablets_channel.cpp:71 _open_all_writers(params)
E0218 10:05:07.968695 25901 delta_writer.cpp:136] Failed to load data into tablet 14667, because of too many versions, current/limit: 7498/1000. You can reduce the loading job concurrency, or increase loading data batch size. If you are loading data with Routine Load, you can increase FE configs routine_load_task_consume_second and max_routine_load_batch_size,
W0218 10:05:07.968721 25901 load_channel.cpp:97] Fail to open index 14666 of load e1a73d08f14e4f11-b20d9273d3e052af: Service unavailable: Failed to load data into tablet 14667, because of too many versions, current/limit: 7498/1000. You can reduce the loading job concurrency, or increase loading data batch size. If you are loading data with Routine Load, you can increase FE configs routine_load_task_consume_second and max_routine_load_batch_size,
/build/starrocks/be/src/storage/delta_writer.cpp:26 writer->_init()
/build/starrocks/be/src/runtime/local_tablets_channel.cpp:71 _open_all_writers(params)
W0218 10:05:07.969110 23536 tablet_sink.cpp:1137] NodeChannel[3338134], tablet open failed, load_id=e1a73d08-f14e-4f11-b20d-9273d3e052af, txn_id: 235374208, parallel=1, compress_type=2, node=172.16.24.146:8060, errmsg=Failed to load data into tablet 14667, because of too many versions, current/limit: 7498/1000. You can reduce the loading job concurrency, or increase loading data batch size. If you are loading data with Routine Load, you can increase FE configs routine_load_task_consume_second and max_routine_load_batch_size,
W0218 10:05:08.597052 23536 tablet_sink.cpp:1498] close channel failed. channel_name=NodeChannel[3338134], load_info=load_id=e1a73d08-f14e-4f11-b20d-9273d3e052af, txn_id: 235374208, parallel=1, compress_type=2, error_msg=Failed to load data into tablet 14667, because of too many versions, current/limit: 7498/1000. You can reduce the loading job concurrency, or increase loading data batch size. If you are loading data with Routine Load, you can increase FE configs routine_load_task_consume_second and max_routine_load_batch_size,

sql 复制代码

W0218 10:05:07.969110 23536 tablet_sink.cpp:1137] NodeChannel[3338134], tablet open failed, load_id=e1a73d08-f14e-4f11-b20d-9273d3e052af, txn_id: 235374208, parallel=1, compress_type=2, node=172.16.24.146:8060, errmsg=Failed to load data into tablet 14667, because of too many versions, current/limit: 7498/1000. You can reduce the loading job concurrency, or increase loading data batch size. If you are loading data with Routine Load, you can increase FE configs routine_load_task_consume_second and max_routine_load_batch_size

ai_forecast库aifcst_us表

markdown 复制代码

mysql> show tablet 14667\G
*************************** 1. row ***************************
       DbName: ai_forecast
    TableName: aifcst_us
PartitionName: aifcst_us
    IndexName: aifcst_us
         DbId: 11001
      TableId: 14665
  PartitionId: 14664
      IndexId: 14666
       IsSync: true
    DetailCmd: SHOW PROC '/dbs/11001/14665/partitions/14664/14666/14667';
1 row in set (0.00 sec)
mysql>

监控信息

FE的监控 infra-grafana.hwwt2.com/d/1fFiWJ4m1... BE的监控 infra-grafana.hwwt2.com/d/1fFiWJ4m1...

二、分析

root权限连接sr

perl 复制代码

# 
mysql -uroot -h172.16.24.90 -P9030  -pYumcStarRocks@XfZ!

Show tablet

css 复制代码

# 
mysql -h 127.0.0.1 -P 9030 -uroot -p -e "show tablet from ai_forecast.aifcst_us" > ai_forecast.aifcst_us_172.16.24.146.txt

厂商分析时需要的文件

ruby 复制代码

第一步：show backends;  --获取be节点的backend_id

第二步：获取 pstack:   连接leader fe 执行sql (替换其中的$backend_id )
mysql -h172.16.24.90 -P 9030 -uroot -p'YumcStarRocks@XfZ!' -e "admin execute on 3338134 'System.print(ExecEnv.get_stack_trace_for_all_threads())';" > 146.pstack

第三步：获取pprof
pprof --svg --seconds=60 http://172.16.24.146:8040/pprof/profile > 146.svg

第四步：比较be的参数配置
curl http://172.16.24.146:8040/varz > be_146.varz
curl http://172.16.24.91:8040/varz > be_91.varz

第五步：分析表ai_forecast.aifcst_us
mysql -h172.16.24.90 -P9030 -uroot -p'YumcStarRocks@XfZ!' -e "show tablet from ai_forecast.aifcst_us" > aifcst_us_tablet.txt

三、处理

172.16.24.146节点数据盘只有一块，另外3个be节点的数据磁盘至少5块数据盘，146节点上tablet的版本数过多需要compaction，compaction的线程数默认值是跟节点磁盘个数挂钩的，主要是146的compaction线程少，跟不上。

也就是说146上的compaction线程数量比另外3个be节点要少很多，所以146上频繁。

把146的be走 decommission 下线
等完全下线完成后，show backends中 tabletnum 变为0，然后再清空146节点的storage 目录下的数据，再把节点加回到集群中