20260604SR超时问题排查

故障说明:

2026-06-04 09:40 ~ 2026-06-02 10:20 期间,运营后台系统的用户数据查询明显变慢,甚至超时。由大数据StarRocks集群压力过大引发。

导致原因分析(磁盘、网络、实现方式、数据量、使用量)

基于 SR的审计日志

  1. 查询比平时增加,导致内存稳定在 85%左右,持高不下,从而导致查询 pending,最终导致超时
    1. 同一个时间段的查询数据量相比于一周前多
      1. ods_dc_sbh_plat_order 多 336
      2. ads_c_order_detail_all_gv_view 多 108
      3. ads_order_detail_all_gv_view 多79
    2. 同一个时间段每分钟的查询数据量相比于一周前多,具体见 ads大表查询情况
    3. 期间,单查询扫描的数据最大 18G,扫描条数最多 1.2亿(优化中)

当时机器情况

SR情况

内存稳定在 85% 左右

ads大表和订单查询情况

|----------------------------------------|-------|--------|----------|
| table_name | 当天时间段 | 周环比时间段 | 差值 (A-B) |
| dmp_ods.ods_dc_sbh_plat_order | 594 | 258 | +336 |
| dmp_ads.ads_c_order_detail_all_gv_view | 115 | 7 | +108 |
| dmp_ads.ads_order_profit_sum_gv | 430 | 533 | -103 |
| dmp_ads.ads_order_detail_all_gv_view | 297 | 218 | +79 |

这期间有大查询,c端运营,扫描的数据量,3G-18G不等

ads大表查询情况

|----------------------------------------|-------|-------------|--------------|-------|
| table_name | time | A_count(今日) | B_count(一周前) | A-B差值 |
| dmp_ads.ads_c_order_detail_all_gv_view | 09:40 | 1 | 0 | 1 |
| dmp_ads.ads_c_order_detail_all_gv_view | 09:41 | 2 | 0 | 2 |
| dmp_ads.ads_c_order_detail_all_gv_view | 09:43 | 2 | 0 | 2 |
| dmp_ads.ads_c_order_detail_all_gv_view | 09:44 | 2 | 0 | 2 |
| dmp_ads.ads_c_order_detail_all_gv_view | 09:50 | 2 | 0 | 2 |
| dmp_ads.ads_c_order_detail_all_gv_view | 09:53 | 1 | 0 | 1 |
| dmp_ads.ads_c_order_detail_all_gv_view | 09:54 | 1 | 0 | 1 |
| dmp_ads.ads_c_order_detail_all_gv_view | 09:55 | 2 | 1 | 1 |
| dmp_ads.ads_c_order_detail_all_gv_view | 09:56 | 3 | 0 | 3 |
| dmp_ads.ads_c_order_detail_all_gv_view | 09:57 | 5 | 0 | 5 |
| dmp_ads.ads_c_order_detail_all_gv_view | 09:59 | 6 | 0 | 6 |
| dmp_ads.ads_c_order_detail_all_gv_view | 10:01 | 3 | 0 | 3 |
| dmp_ads.ads_c_order_detail_all_gv_view | 10:02 | 4 | 0 | 4 |
| dmp_ads.ads_c_order_detail_all_gv_view | 10:03 | 13 | 0 | 13 |
| dmp_ads.ads_c_order_detail_all_gv_view | 10:04 | 5 | 0 | 5 |
| dmp_ads.ads_c_order_detail_all_gv_view | 10:05 | 5 | 0 | 5 |
| dmp_ads.ads_c_order_detail_all_gv_view | 10:06 | 7 | 0 | 7 |
| dmp_ads.ads_c_order_detail_all_gv_view | 10:09 | 6 | 2 | 4 |
| dmp_ads.ads_c_order_detail_all_gv_view | 10:11 | 14 | 0 | 14 |
| dmp_ads.ads_c_order_detail_all_gv_view | 10:13 | 8 | 2 | 6 |
| dmp_ads.ads_c_order_detail_all_gv_view | 10:14 | 9 | 1 | 8 |
| dmp_ads.ads_c_order_detail_all_gv_view | 10:15 | 6 | 1 | 5 |
| dmp_ads.ads_c_order_detail_all_gv_view | 10:16 | 6 | 0 | 6 |
| dmp_ads.ads_c_order_detail_all_gv_view | 10:19 | 2 | 0 | 2 |
| dmp_ads.ads_order_detail_all_gv_view | 09:40 | 6 | 2 | 4 |
| dmp_ads.ads_order_detail_all_gv_view | 09:41 | 14 | 0 | 14 |
| dmp_ads.ads_order_detail_all_gv_view | 09:42 | 7 | 0 | 7 |
| dmp_ads.ads_order_detail_all_gv_view | 09:43 | 2 | 0 | 2 |
| dmp_ads.ads_order_detail_all_gv_view | 09:44 | 2 | 0 | 2 |
| dmp_ads.ads_order_detail_all_gv_view | 09:46 | 0 | 4 | -4 |
| dmp_ads.ads_order_detail_all_gv_view | 09:47 | 1 | 4 | -3 |
| dmp_ads.ads_order_detail_all_gv_view | 09:48 | 2 | 9 | -7 |
| dmp_ads.ads_order_detail_all_gv_view | 09:49 | 0 | 2 | -2 |
| dmp_ads.ads_order_detail_all_gv_view | 09:51 | 5 | 2 | 3 |
| dmp_ads.ads_order_detail_all_gv_view | 09:53 | 0 | 3 | -3 |
| dmp_ads.ads_order_detail_all_gv_view | 09:54 | 2 | 6 | -4 |
| dmp_ads.ads_order_detail_all_gv_view | 09:55 | 23 | 0 | 23 |
| dmp_ads.ads_order_detail_all_gv_view | 09:56 | 3 | 2 | 1 |
| dmp_ads.ads_order_detail_all_gv_view | 09:57 | 0 | 3 | -3 |
| dmp_ads.ads_order_detail_all_gv_view | 09:58 | 0 | 8 | -8 |
| dmp_ads.ads_order_detail_all_gv_view | 09:59 | 24 | 4 | 20 |
| dmp_ads.ads_order_detail_all_gv_view | 10:00 | 5 | 12 | -7 |
| dmp_ads.ads_order_detail_all_gv_view | 10:01 | 12 | 1 | 11 |
| dmp_ads.ads_order_detail_all_gv_view | 10:02 | 19 | 1 | 18 |
| dmp_ads.ads_order_detail_all_gv_view | 10:03 | 11 | 3 | 8 |
| dmp_ads.ads_order_detail_all_gv_view | 10:04 | 17 | 6 | 11 |
| dmp_ads.ads_order_detail_all_gv_view | 10:05 | 23 | 4 | 19 |
| dmp_ads.ads_order_detail_all_gv_view | 10:06 | 11 | 6 | 5 |
| dmp_ads.ads_order_detail_all_gv_view | 10:07 | 4 | 7 | -3 |
| dmp_ads.ads_order_detail_all_gv_view | 10:08 | 3 | 13 | -10 |
| dmp_ads.ads_order_detail_all_gv_view | 10:09 | 10 | 14 | -4 |
| dmp_ads.ads_order_detail_all_gv_view | 10:10 | 2 | 5 | -3 |
| dmp_ads.ads_order_detail_all_gv_view | 10:11 | 14 | 5 | 9 |
| dmp_ads.ads_order_detail_all_gv_view | 10:12 | 9 | 9 | 0 |
| dmp_ads.ads_order_detail_all_gv_view | 10:13 | 17 | 7 | 10 |
| dmp_ads.ads_order_detail_all_gv_view | 10:14 | 3 | 15 | -12 |
| dmp_ads.ads_order_detail_all_gv_view | 10:15 | 11 | 24 | -13 |
| dmp_ads.ads_order_detail_all_gv_view | 10:16 | 3 | 6 | -3 |
| dmp_ads.ads_order_detail_all_gv_view | 10:17 | 2 | 5 | -3 |
| dmp_ads.ads_order_detail_all_gv_view | 10:18 | 4 | 18 | -14 |
| dmp_ads.ads_order_detail_all_gv_view | 10:19 | 26 | 8 | 18 |
| dmp_ads.ads_order_profit_sum_gv | 09:40 | 2 | 7 | -5 |
| dmp_ads.ads_order_profit_sum_gv | 09:41 | 27 | 18 | 9 |
| dmp_ads.ads_order_profit_sum_gv | 09:42 | 12 | 12 | 0 |
| dmp_ads.ads_order_profit_sum_gv | 09:43 | 12 | 7 | 5 |
| dmp_ads.ads_order_profit_sum_gv | 09:44 | 7 | 16 | -9 |
| dmp_ads.ads_order_profit_sum_gv | 09:45 | 3 | 11 | -8 |
| dmp_ads.ads_order_profit_sum_gv | 09:46 | 7 | 2 | 5 |
| dmp_ads.ads_order_profit_sum_gv | 09:47 | 24 | 14 | 10 |
| dmp_ads.ads_order_profit_sum_gv | 09:48 | 30 | 16 | 14 |
| dmp_ads.ads_order_profit_sum_gv | 09:49 | 23 | 10 | 13 |
| dmp_ads.ads_order_profit_sum_gv | 09:50 | 24 | 15 | 9 |
| dmp_ads.ads_order_profit_sum_gv | 09:51 | 4 | 18 | -14 |
| dmp_ads.ads_order_profit_sum_gv | 09:52 | 1 | 30 | -29 |
| dmp_ads.ads_order_profit_sum_gv | 09:53 | 16 | 15 | 1 |
| dmp_ads.ads_order_profit_sum_gv | 09:54 | 13 | 28 | -15 |
| dmp_ads.ads_order_profit_sum_gv | 09:55 | 12 | 15 | -3 |
| dmp_ads.ads_order_profit_sum_gv | 09:56 | 20 | 40 | -20 |
| dmp_ads.ads_order_profit_sum_gv | 09:57 | 9 | 20 | -11 |
| dmp_ads.ads_order_profit_sum_gv | 09:58 | 3 | 10 | -7 |
| dmp_ads.ads_order_profit_sum_gv | 09:59 | 15 | 13 | 2 |
| dmp_ads.ads_order_profit_sum_gv | 10:00 | 4 | 16 | -12 |
| dmp_ads.ads_order_profit_sum_gv | 10:01 | 4 | 5 | -1 |
| dmp_ads.ads_order_profit_sum_gv | 10:02 | 6 | 16 | -10 |
| dmp_ads.ads_order_profit_sum_gv | 10:03 | 7 | 10 | -3 |
| dmp_ads.ads_order_profit_sum_gv | 10:04 | 2 | 8 | -6 |
| dmp_ads.ads_order_profit_sum_gv | 10:05 | 2 | 10 | -8 |
| dmp_ads.ads_order_profit_sum_gv | 10:06 | 16 | 9 | 7 |
| dmp_ads.ads_order_profit_sum_gv | 10:07 | 2 | 11 | -9 |
| dmp_ads.ads_order_profit_sum_gv | 10:08 | 0 | 1 | -1 |
| dmp_ads.ads_order_profit_sum_gv | 10:09 | 3 | 14 | -11 |
| dmp_ads.ads_order_profit_sum_gv | 10:10 | 6 | 5 | 1 |
| dmp_ads.ads_order_profit_sum_gv | 10:11 | 9 | 14 | -5 |
| dmp_ads.ads_order_profit_sum_gv | 10:12 | 12 | 15 | -3 |
| dmp_ads.ads_order_profit_sum_gv | 10:13 | 22 | 13 | 9 |
| dmp_ads.ads_order_profit_sum_gv | 10:14 | 12 | 22 | -10 |
| dmp_ads.ads_order_profit_sum_gv | 10:15 | 20 | 18 | 2 |
| dmp_ads.ads_order_profit_sum_gv | 10:16 | 11 | 6 | 5 |
| dmp_ads.ads_order_profit_sum_gv | 10:17 | 7 | 6 | 1 |
| dmp_ads.ads_order_profit_sum_gv | 10:18 | 12 | 10 | 2 |
| dmp_ads.ads_order_profit_sum_gv | 10:19 | 9 | 7 | 2 |
| |

队列 Pending情况

最多 Pending 了 94 个查询,其中大资源组41,短查询资源组 53

各个资源组的资源分配情况

支持的QPS

新试图 ads_order_detail_all_gv_view_new_refund_all: 7-8

老view:dmp_ads.ads_order_detail_all_gv_view 2-3

恢复方式

连接上 sr

sby:mysql -h dmp02 -P9030 -uapp_admin_user -p'01wwYcMu' --default-character-set=utf8mb4

SHOW PROC '/current_queries'; 把 大查询 kill 掉

KILL QUERY <ConnectionId>;

优化事项

view内存中关联,比如 c 端运营,优化中
空间换时间
  1. ads_order_detail_all_gv_view_new_refund_all 新 view 上线,已上线-20260604
    1. 建立实时的 dwd,保存订单数据+订单扩展表数据+毛利数据
  2. ads_order_detail_all_gv_view_asym 下线,预计下周
时间换资源 - 进行中
  1. ETL统一收口查询大数据平台,进行中
  2. 完成收口后,针对各个业务就可以限定优先级
    1. 核心业务满资源满队列查
    2. 次核心限制
    3. 内部大查询,比如订单明细,限制资源使用,调大超时时间(比如 30 s -1min),并限制查询频率,超过查询频率或者后端查询中,前端页面转圈圈等待

应急预案

  1. 通知 C端运营先不要查询
  2. 通知运营后台先不要查询订单明细
  3. 找到 大查询,并kill 掉