aiforcast集群单节点CPU使用率100%问题

一、现象

2.18日早上5点左右开始节点172.16.24.146的CPU开始飚满到100%

be.out 日志

yaml 复制代码
start time: Sat Jul  6 01:03:04 CST 2024, server uptime:  01:03:04 up 94 days,  8:22,  2 users,  load average: 0.00, 0.03, 0.05
Ignored unknown config: default_rowset_type
start time: 2025年 02月 18日 星期二 09:52:40 CST, server uptime:  09:52:40 up 321 days, 17:12,  2 users,  load average: 70.17, 105.80, 113.74
Ignored unknown config: default_rowset_type

INFO日志

sql 复制代码
W0218 10:05:07.955432 25950 load_channel.cpp:97] Fail to open index 14666 of load 0bf274331c354715-a06716d7da777200: Service unavailable: Failed to load data into tablet 14667, because of too many versions, current/limit: 7498/1000. You can reduce the loading job concurrency, or increase loading data batch size. If you are loading data with Routine Load, you can increase FE configs routine_load_task_consume_second and max_routine_load_batch_size,
/build/starrocks/be/src/storage/delta_writer.cpp:26 writer->_init()
/build/starrocks/be/src/runtime/local_tablets_channel.cpp:71 _open_all_writers(params)
E0218 10:05:07.968695 25901 delta_writer.cpp:136] Failed to load data into tablet 14667, because of too many versions, current/limit: 7498/1000. You can reduce the loading job concurrency, or increase loading data batch size. If you are loading data with Routine Load, you can increase FE configs routine_load_task_consume_second and max_routine_load_batch_size,
W0218 10:05:07.968721 25901 load_channel.cpp:97] Fail to open index 14666 of load e1a73d08f14e4f11-b20d9273d3e052af: Service unavailable: Failed to load data into tablet 14667, because of too many versions, current/limit: 7498/1000. You can reduce the loading job concurrency, or increase loading data batch size. If you are loading data with Routine Load, you can increase FE configs routine_load_task_consume_second and max_routine_load_batch_size,
/build/starrocks/be/src/storage/delta_writer.cpp:26 writer->_init()
/build/starrocks/be/src/runtime/local_tablets_channel.cpp:71 _open_all_writers(params)
W0218 10:05:07.969110 23536 tablet_sink.cpp:1137] NodeChannel[3338134], tablet open failed, load_id=e1a73d08-f14e-4f11-b20d-9273d3e052af, txn_id: 235374208, parallel=1, compress_type=2, node=172.16.24.146:8060, errmsg=Failed to load data into tablet 14667, because of too many versions, current/limit: 7498/1000. You can reduce the loading job concurrency, or increase loading data batch size. If you are loading data with Routine Load, you can increase FE configs routine_load_task_consume_second and max_routine_load_batch_size,
W0218 10:05:08.597052 23536 tablet_sink.cpp:1498] close channel failed. channel_name=NodeChannel[3338134], load_info=load_id=e1a73d08-f14e-4f11-b20d-9273d3e052af, txn_id: 235374208, parallel=1, compress_type=2, error_msg=Failed to load data into tablet 14667, because of too many versions, current/limit: 7498/1000. You can reduce the loading job concurrency, or increase loading data batch size. If you are loading data with Routine Load, you can increase FE configs routine_load_task_consume_second and max_routine_load_batch_size,
sql 复制代码
W0218 10:05:07.969110 23536 tablet_sink.cpp:1137] NodeChannel[3338134], tablet open failed, load_id=e1a73d08-f14e-4f11-b20d-9273d3e052af, txn_id: 235374208, parallel=1, compress_type=2, node=172.16.24.146:8060, errmsg=Failed to load data into tablet 14667, because of too many versions, current/limit: 7498/1000. You can reduce the loading job concurrency, or increase loading data batch size. If you are loading data with Routine Load, you can increase FE configs routine_load_task_consume_second and max_routine_load_batch_size

ai_forecast库aifcst_us表

markdown 复制代码
mysql> show tablet 14667\G
*************************** 1. row ***************************
       DbName: ai_forecast
    TableName: aifcst_us
PartitionName: aifcst_us
    IndexName: aifcst_us
         DbId: 11001
      TableId: 14665
  PartitionId: 14664
      IndexId: 14666
       IsSync: true
    DetailCmd: SHOW PROC '/dbs/11001/14665/partitions/14664/14666/14667';
1 row in set (0.00 sec)
mysql> 

监控信息

FE的监控 infra-grafana.hwwt2.com/d/1fFiWJ4m1... BE的监控 infra-grafana.hwwt2.com/d/1fFiWJ4m1...

二、分析

  1. root权限连接sr
perl 复制代码
# 
mysql -uroot -h172.16.24.90 -P9030  -pYumcStarRocks@XfZ!
  1. Show tablet
css 复制代码
# 
mysql -h 127.0.0.1 -P 9030 -uroot -p -e "show tablet from ai_forecast.aifcst_us" > ai_forecast.aifcst_us_172.16.24.146.txt
  1. 厂商分析时需要的文件
ruby 复制代码
第一步:show backends;  --获取be节点的backend_id

第二步:获取 pstack:   连接leader fe 执行sql (替换其中的$backend_id )
mysql -h172.16.24.90 -P 9030 -uroot -p'YumcStarRocks@XfZ!' -e "admin execute on 3338134 'System.print(ExecEnv.get_stack_trace_for_all_threads())';" > 146.pstack

第三步:获取pprof
pprof --svg --seconds=60 http://172.16.24.146:8040/pprof/profile > 146.svg

第四步:比较be的参数配置
curl http://172.16.24.146:8040/varz > be_146.varz
curl http://172.16.24.91:8040/varz > be_91.varz

第五步:分析表ai_forecast.aifcst_us
mysql -h172.16.24.90 -P9030 -uroot -p'YumcStarRocks@XfZ!' -e "show tablet from ai_forecast.aifcst_us" > aifcst_us_tablet.txt

三、处理

  1. 172.16.24.146节点数据盘只有一块,另外3个be节点的数据磁盘至少5块数据盘,146节点上tablet的版本数过多需要compaction,compaction的线程数默认值是 跟 节点磁盘个数挂钩的,主要是146的compaction线程少,跟不上。

也就是说146上的compaction线程数量比另外3个be节点要少很多,所以146上频繁。

  1. 把146的be走 decommission 下线
  2. 等完全下线完成后,show backends中 tabletnum 变为0,然后再清空146节点的storage 目录下的数据,再把节点加回到集群中
相关推荐
Jolyne_6 分钟前
树节点key不唯一的勾选、展开状态的处理思路
前端·算法·react.js
饺子不放糖7 分钟前
workspace:你真的会用吗?
前端
饺子不放糖10 分钟前
dependencies vs devDependencies:别再傻傻分不清,你的 package.json 可能早就"胖"了!
前端
Kevin@wust16 分钟前
axios的封装
前端·vue
teeeeeeemo18 分钟前
Ajax、Axios、Fetch核心区别
开发语言·前端·javascript·笔记·ajax
柏成24 分钟前
基于 pnpm + monorepo 的 Qiankun微前端解决方案(内置模块联邦)
前端·javascript·面试
唐诗38 分钟前
VMware Mac m系列安装 Windws 11,保姆级教程
前端·后端·github
ZXT42 分钟前
Chrome Devtool
前端
wycode44 分钟前
web缓存问题的解决方案
前端
一枚前端小能手1 小时前
🆘 Git翻车现场救援指南:5个救命技巧让你起死回生
前端·git