aiforcast集群单节点CPU使用率100%问题

一、现象

2.18日早上5点左右开始节点172.16.24.146的CPU开始飚满到100%

be.out 日志

yaml 复制代码
start time: Sat Jul  6 01:03:04 CST 2024, server uptime:  01:03:04 up 94 days,  8:22,  2 users,  load average: 0.00, 0.03, 0.05
Ignored unknown config: default_rowset_type
start time: 2025年 02月 18日 星期二 09:52:40 CST, server uptime:  09:52:40 up 321 days, 17:12,  2 users,  load average: 70.17, 105.80, 113.74
Ignored unknown config: default_rowset_type

INFO日志

sql 复制代码
W0218 10:05:07.955432 25950 load_channel.cpp:97] Fail to open index 14666 of load 0bf274331c354715-a06716d7da777200: Service unavailable: Failed to load data into tablet 14667, because of too many versions, current/limit: 7498/1000. You can reduce the loading job concurrency, or increase loading data batch size. If you are loading data with Routine Load, you can increase FE configs routine_load_task_consume_second and max_routine_load_batch_size,
/build/starrocks/be/src/storage/delta_writer.cpp:26 writer->_init()
/build/starrocks/be/src/runtime/local_tablets_channel.cpp:71 _open_all_writers(params)
E0218 10:05:07.968695 25901 delta_writer.cpp:136] Failed to load data into tablet 14667, because of too many versions, current/limit: 7498/1000. You can reduce the loading job concurrency, or increase loading data batch size. If you are loading data with Routine Load, you can increase FE configs routine_load_task_consume_second and max_routine_load_batch_size,
W0218 10:05:07.968721 25901 load_channel.cpp:97] Fail to open index 14666 of load e1a73d08f14e4f11-b20d9273d3e052af: Service unavailable: Failed to load data into tablet 14667, because of too many versions, current/limit: 7498/1000. You can reduce the loading job concurrency, or increase loading data batch size. If you are loading data with Routine Load, you can increase FE configs routine_load_task_consume_second and max_routine_load_batch_size,
/build/starrocks/be/src/storage/delta_writer.cpp:26 writer->_init()
/build/starrocks/be/src/runtime/local_tablets_channel.cpp:71 _open_all_writers(params)
W0218 10:05:07.969110 23536 tablet_sink.cpp:1137] NodeChannel[3338134], tablet open failed, load_id=e1a73d08-f14e-4f11-b20d-9273d3e052af, txn_id: 235374208, parallel=1, compress_type=2, node=172.16.24.146:8060, errmsg=Failed to load data into tablet 14667, because of too many versions, current/limit: 7498/1000. You can reduce the loading job concurrency, or increase loading data batch size. If you are loading data with Routine Load, you can increase FE configs routine_load_task_consume_second and max_routine_load_batch_size,
W0218 10:05:08.597052 23536 tablet_sink.cpp:1498] close channel failed. channel_name=NodeChannel[3338134], load_info=load_id=e1a73d08-f14e-4f11-b20d-9273d3e052af, txn_id: 235374208, parallel=1, compress_type=2, error_msg=Failed to load data into tablet 14667, because of too many versions, current/limit: 7498/1000. You can reduce the loading job concurrency, or increase loading data batch size. If you are loading data with Routine Load, you can increase FE configs routine_load_task_consume_second and max_routine_load_batch_size,
sql 复制代码
W0218 10:05:07.969110 23536 tablet_sink.cpp:1137] NodeChannel[3338134], tablet open failed, load_id=e1a73d08-f14e-4f11-b20d-9273d3e052af, txn_id: 235374208, parallel=1, compress_type=2, node=172.16.24.146:8060, errmsg=Failed to load data into tablet 14667, because of too many versions, current/limit: 7498/1000. You can reduce the loading job concurrency, or increase loading data batch size. If you are loading data with Routine Load, you can increase FE configs routine_load_task_consume_second and max_routine_load_batch_size

ai_forecast库aifcst_us表

markdown 复制代码
mysql> show tablet 14667\G
*************************** 1. row ***************************
       DbName: ai_forecast
    TableName: aifcst_us
PartitionName: aifcst_us
    IndexName: aifcst_us
         DbId: 11001
      TableId: 14665
  PartitionId: 14664
      IndexId: 14666
       IsSync: true
    DetailCmd: SHOW PROC '/dbs/11001/14665/partitions/14664/14666/14667';
1 row in set (0.00 sec)
mysql> 

监控信息

FE的监控 infra-grafana.hwwt2.com/d/1fFiWJ4m1... BE的监控 infra-grafana.hwwt2.com/d/1fFiWJ4m1...

二、分析

  1. root权限连接sr
perl 复制代码
# 
mysql -uroot -h172.16.24.90 -P9030  -pYumcStarRocks@XfZ!
  1. Show tablet
css 复制代码
# 
mysql -h 127.0.0.1 -P 9030 -uroot -p -e "show tablet from ai_forecast.aifcst_us" > ai_forecast.aifcst_us_172.16.24.146.txt
  1. 厂商分析时需要的文件
ruby 复制代码
第一步:show backends;  --获取be节点的backend_id

第二步:获取 pstack:   连接leader fe 执行sql (替换其中的$backend_id )
mysql -h172.16.24.90 -P 9030 -uroot -p'YumcStarRocks@XfZ!' -e "admin execute on 3338134 'System.print(ExecEnv.get_stack_trace_for_all_threads())';" > 146.pstack

第三步:获取pprof
pprof --svg --seconds=60 http://172.16.24.146:8040/pprof/profile > 146.svg

第四步:比较be的参数配置
curl http://172.16.24.146:8040/varz > be_146.varz
curl http://172.16.24.91:8040/varz > be_91.varz

第五步:分析表ai_forecast.aifcst_us
mysql -h172.16.24.90 -P9030 -uroot -p'YumcStarRocks@XfZ!' -e "show tablet from ai_forecast.aifcst_us" > aifcst_us_tablet.txt

三、处理

  1. 172.16.24.146节点数据盘只有一块,另外3个be节点的数据磁盘至少5块数据盘,146节点上tablet的版本数过多需要compaction,compaction的线程数默认值是 跟 节点磁盘个数挂钩的,主要是146的compaction线程少,跟不上。

也就是说146上的compaction线程数量比另外3个be节点要少很多,所以146上频繁。

  1. 把146的be走 decommission 下线
  2. 等完全下线完成后,show backends中 tabletnum 变为0,然后再清空146节点的storage 目录下的数据,再把节点加回到集群中
相关推荐
jacGJ6 小时前
记录学习--文件读写
java·前端·学习
毕设源码-赖学姐6 小时前
【开题答辩全过程】以 基于WEB的实验室开放式管理系统的设计与实现为例,包含答辩的问题和答案
前端
幻云20106 小时前
Python深度学习:从筑基到登仙
前端·javascript·vue.js·人工智能·python
我即将远走丶或许也能高飞8 小时前
vuex 和 pinia 的学习使用
开发语言·前端·javascript
钟离墨笺9 小时前
Go语言--2go基础-->基本数据类型
开发语言·前端·后端·golang
爱吃泡芙的小白白9 小时前
Vue 3 核心原理与实战:从响应式到企业级应用
前端·javascript·vue.js
卓怡学长9 小时前
m115乐购游戏商城系统
java·前端·数据库·spring boot·spring·游戏
老陈聊架构10 小时前
『AI辅助Skill』掌握三大AI设计Skill:前端独立完成产品设计全流程
前端·人工智能·claude·skill
Ulyanov10 小时前
从桌面到云端:构建Web三维战场指挥系统
开发语言·前端·python·tkinter·pyvista·gui开发
cypking10 小时前
二、前端Java后端对比指南
java·开发语言·前端