aiforcast集群单节点CPU使用率100%问题

一、现象

2.18日早上5点左右开始节点172.16.24.146的CPU开始飚满到100%

be.out 日志

yaml 复制代码
start time: Sat Jul  6 01:03:04 CST 2024, server uptime:  01:03:04 up 94 days,  8:22,  2 users,  load average: 0.00, 0.03, 0.05
Ignored unknown config: default_rowset_type
start time: 2025年 02月 18日 星期二 09:52:40 CST, server uptime:  09:52:40 up 321 days, 17:12,  2 users,  load average: 70.17, 105.80, 113.74
Ignored unknown config: default_rowset_type

INFO日志

sql 复制代码
W0218 10:05:07.955432 25950 load_channel.cpp:97] Fail to open index 14666 of load 0bf274331c354715-a06716d7da777200: Service unavailable: Failed to load data into tablet 14667, because of too many versions, current/limit: 7498/1000. You can reduce the loading job concurrency, or increase loading data batch size. If you are loading data with Routine Load, you can increase FE configs routine_load_task_consume_second and max_routine_load_batch_size,
/build/starrocks/be/src/storage/delta_writer.cpp:26 writer->_init()
/build/starrocks/be/src/runtime/local_tablets_channel.cpp:71 _open_all_writers(params)
E0218 10:05:07.968695 25901 delta_writer.cpp:136] Failed to load data into tablet 14667, because of too many versions, current/limit: 7498/1000. You can reduce the loading job concurrency, or increase loading data batch size. If you are loading data with Routine Load, you can increase FE configs routine_load_task_consume_second and max_routine_load_batch_size,
W0218 10:05:07.968721 25901 load_channel.cpp:97] Fail to open index 14666 of load e1a73d08f14e4f11-b20d9273d3e052af: Service unavailable: Failed to load data into tablet 14667, because of too many versions, current/limit: 7498/1000. You can reduce the loading job concurrency, or increase loading data batch size. If you are loading data with Routine Load, you can increase FE configs routine_load_task_consume_second and max_routine_load_batch_size,
/build/starrocks/be/src/storage/delta_writer.cpp:26 writer->_init()
/build/starrocks/be/src/runtime/local_tablets_channel.cpp:71 _open_all_writers(params)
W0218 10:05:07.969110 23536 tablet_sink.cpp:1137] NodeChannel[3338134], tablet open failed, load_id=e1a73d08-f14e-4f11-b20d-9273d3e052af, txn_id: 235374208, parallel=1, compress_type=2, node=172.16.24.146:8060, errmsg=Failed to load data into tablet 14667, because of too many versions, current/limit: 7498/1000. You can reduce the loading job concurrency, or increase loading data batch size. If you are loading data with Routine Load, you can increase FE configs routine_load_task_consume_second and max_routine_load_batch_size,
W0218 10:05:08.597052 23536 tablet_sink.cpp:1498] close channel failed. channel_name=NodeChannel[3338134], load_info=load_id=e1a73d08-f14e-4f11-b20d-9273d3e052af, txn_id: 235374208, parallel=1, compress_type=2, error_msg=Failed to load data into tablet 14667, because of too many versions, current/limit: 7498/1000. You can reduce the loading job concurrency, or increase loading data batch size. If you are loading data with Routine Load, you can increase FE configs routine_load_task_consume_second and max_routine_load_batch_size,
sql 复制代码
W0218 10:05:07.969110 23536 tablet_sink.cpp:1137] NodeChannel[3338134], tablet open failed, load_id=e1a73d08-f14e-4f11-b20d-9273d3e052af, txn_id: 235374208, parallel=1, compress_type=2, node=172.16.24.146:8060, errmsg=Failed to load data into tablet 14667, because of too many versions, current/limit: 7498/1000. You can reduce the loading job concurrency, or increase loading data batch size. If you are loading data with Routine Load, you can increase FE configs routine_load_task_consume_second and max_routine_load_batch_size

ai_forecast库aifcst_us表

markdown 复制代码
mysql> show tablet 14667\G
*************************** 1. row ***************************
       DbName: ai_forecast
    TableName: aifcst_us
PartitionName: aifcst_us
    IndexName: aifcst_us
         DbId: 11001
      TableId: 14665
  PartitionId: 14664
      IndexId: 14666
       IsSync: true
    DetailCmd: SHOW PROC '/dbs/11001/14665/partitions/14664/14666/14667';
1 row in set (0.00 sec)
mysql> 

监控信息

FE的监控 infra-grafana.hwwt2.com/d/1fFiWJ4m1... BE的监控 infra-grafana.hwwt2.com/d/1fFiWJ4m1...

二、分析

  1. root权限连接sr
perl 复制代码
# 
mysql -uroot -h172.16.24.90 -P9030  -pYumcStarRocks@XfZ!
  1. Show tablet
css 复制代码
# 
mysql -h 127.0.0.1 -P 9030 -uroot -p -e "show tablet from ai_forecast.aifcst_us" > ai_forecast.aifcst_us_172.16.24.146.txt
  1. 厂商分析时需要的文件
ruby 复制代码
第一步:show backends;  --获取be节点的backend_id

第二步:获取 pstack:   连接leader fe 执行sql (替换其中的$backend_id )
mysql -h172.16.24.90 -P 9030 -uroot -p'YumcStarRocks@XfZ!' -e "admin execute on 3338134 'System.print(ExecEnv.get_stack_trace_for_all_threads())';" > 146.pstack

第三步:获取pprof
pprof --svg --seconds=60 http://172.16.24.146:8040/pprof/profile > 146.svg

第四步:比较be的参数配置
curl http://172.16.24.146:8040/varz > be_146.varz
curl http://172.16.24.91:8040/varz > be_91.varz

第五步:分析表ai_forecast.aifcst_us
mysql -h172.16.24.90 -P9030 -uroot -p'YumcStarRocks@XfZ!' -e "show tablet from ai_forecast.aifcst_us" > aifcst_us_tablet.txt

三、处理

  1. 172.16.24.146节点数据盘只有一块,另外3个be节点的数据磁盘至少5块数据盘,146节点上tablet的版本数过多需要compaction,compaction的线程数默认值是 跟 节点磁盘个数挂钩的,主要是146的compaction线程少,跟不上。

也就是说146上的compaction线程数量比另外3个be节点要少很多,所以146上频繁。

  1. 把146的be走 decommission 下线
  2. 等完全下线完成后,show backends中 tabletnum 变为0,然后再清空146节点的storage 目录下的数据,再把节点加回到集群中
相关推荐
Momo__1 小时前
VueUse createReusableTemplate —— 单文件组件内的模板复用神器
前端·vue.js
程序员小富1 小时前
我开源了一个开发者专属的智能 JSON 工具,得到了媳妇高度认可
前端·vue.js·后端
小小小小宇1 小时前
程序员如何给 LLM 装工具以及看懂推理过程
前端
写代码的皮筏艇1 小时前
React中的forwardRef
前端·react.js·面试
槑有老呆1 小时前
花三个月工资请了个 AI 程序员,结果它连青岛啤酒股价都查不了
前端
风骏时光牛马1 小时前
Verilog开发常见问题汇总解析
前端
子兮曰1 小时前
AI Coding Method Map:一张图看懂 AI 编程的完整链路
前端·人工智能·后端
weedsfly1 小时前
语法糖褪去之后——Babel 转译产物中的 JavaScript 本貌
前端·javascript
JustHappy1 小时前
「软件设计思想杂谈🤔」“切图仔”也能懂编译原理?框架源码也许没那么难。聊聊 Vue 的编译(上)
前端·javascript·vue.js