ClickHouse CPU 排查详细指南

ClickHouse CPU 排查详细指南

📋 概述

本指南详细说明如何排查和解决 ClickHouse CPU 使用率过高的问题。CPU 高使用率通常由查询并发过多、慢查询、后台任务、数据导入等因素引起。


🔍 第一步:快速诊断

1.1 检查 CPU 使用率

bash 复制代码
# 方法1:使用 top 查看 ClickHouse 进程 CPU
top -p $(pgrep clickhouse-server)

# 方法2:使用 htop(如果已安装)
htop -p $(pgrep clickhouse-server)

# 方法3:使用 ps
ps aux | grep clickhouse-server | grep -v grep

# 方法4:查看系统整体 CPU
mpstat -P ALL 1 5

正常值 :CPU 使用率 < 70%
警告值 :CPU 使用率 70-90%
危险值:CPU 使用率 > 90%

1.2 检查系统负载

bash 复制代码
# 查看系统负载
uptime

# 查看详细负载信息
cat /proc/loadavg

# 持续监控负载
watch -n 1 'uptime'

正常值 :负载 < CPU 核心数
警告值 :负载 = CPU 核心数 - 2倍
危险值:负载 > 2倍 CPU 核心数


🔍 第二步:检查当前运行的查询

2.1 查看所有正在执行的查询

bash 复制代码
clickhouse-client --query "
SELECT 
    query_id,
    user,
    address,
    query,
    elapsed,
    read_rows,
    formatReadableSize(read_bytes) as read_bytes,
    formatReadableSize(memory_usage) as memory,
    formatReadableSize(read_bytes/elapsed) as read_speed
FROM system.processes 
WHERE query != ''
ORDER BY elapsed DESC
FORMAT Vertical
"

关键字段说明

  • query_id:查询ID,可用于kill查询
  • elapsed:查询已执行时间(秒)
  • read_rows:已读取行数
  • read_bytes:已读取字节数
  • memory_usage:内存使用量

2.2 查看长时间运行的查询

bash 复制代码
# 查看执行时间超过10秒的查询
clickhouse-client --query "
SELECT 
    query_id,
    user,
    query,
    elapsed,
    read_rows,
    formatReadableSize(read_bytes) as read_bytes
FROM system.processes 
WHERE query != '' AND elapsed > 10
ORDER BY elapsed DESC
"

2.3 查看查询的详细信息

bash 复制代码
# 查看特定查询的详细信息
clickhouse-client --query "
SELECT * 
FROM system.processes 
WHERE query_id = 'your-query-id'
FORMAT Vertical
"

2.4 统计并发查询数

bash 复制代码
# 统计当前并发查询数
clickhouse-client --query "
SELECT 
    count() as concurrent_queries,
    sum(elapsed) as total_elapsed,
    sum(read_rows) as total_read_rows,
    formatReadableSize(sum(read_bytes)) as total_read_bytes
FROM system.processes 
WHERE query != ''
"

正常值 :并发查询数 < CPU 核心数
警告值 :并发查询数 = CPU 核心数 - 2倍
危险值:并发查询数 > 2倍 CPU 核心数


🔍 第三步:检查查询历史(慢查询)

3.1 查看最近的慢查询

bash 复制代码
# 查看最近100条执行时间超过1秒的查询
clickhouse-client --query "
SELECT 
    query_id,
    user,
    query,
    query_start_time,
    query_duration_ms,
    read_rows,
    formatReadableSize(read_bytes) as read_bytes,
    formatReadableSize(memory_usage) as memory,
    result_rows,
    formatReadableSize(result_bytes) as result_size
FROM system.query_log 
WHERE type = 2 
  AND query_duration_ms > 1000
  AND event_time > now() - INTERVAL 1 HOUR
ORDER BY query_duration_ms DESC 
LIMIT 100
FORMAT Vertical
"

关键字段说明

  • type = 2:表示查询完成(1=查询开始,2=查询结束)
  • query_duration_ms:查询执行时间(毫秒)
  • read_rows:读取的行数
  • result_rows:返回的结果行数

3.2 统计慢查询趋势

bash 复制代码
# 按小时统计慢查询
clickhouse-client --query "
SELECT 
    toStartOfHour(event_time) as hour,
    count() as slow_queries,
    avg(query_duration_ms) as avg_duration_ms,
    max(query_duration_ms) as max_duration_ms,
    min(query_duration_ms) as min_duration_ms,
    sum(read_rows) as total_read_rows,
    formatReadableSize(sum(read_bytes)) as total_read_bytes
FROM system.query_log 
WHERE type = 2 
  AND event_time > now() - INTERVAL 24 HOUR
  AND query_duration_ms > 1000
GROUP BY hour
ORDER BY hour DESC
"

3.3 找出最耗CPU的查询模式

bash 复制代码
# 按查询模式(去除具体值)统计
clickhouse-client --query "
SELECT 
    normalizeQuery(query) as query_pattern,
    count() as query_count,
    avg(query_duration_ms) as avg_duration_ms,
    max(query_duration_ms) as max_duration_ms,
    sum(query_duration_ms) as total_duration_ms,
    sum(read_rows) as total_read_rows
FROM system.query_log 
WHERE type = 2 
  AND event_time > now() - INTERVAL 24 HOUR
  AND query_duration_ms > 1000
GROUP BY query_pattern
ORDER BY total_duration_ms DESC
LIMIT 20
"

🔍 第四步:检查后台任务

4.1 查看合并任务(Merges)

bash 复制代码
# 查看所有正在进行的合并任务
clickhouse-client --query "
SELECT 
    database,
    table,
    elapsed,
    progress,
    merge_type,
    merge_algorithm,
    num_parts_to_merge,
    total_rows_to_merge,
    formatReadableSize(total_bytes_to_merge) as total_size,
    formatReadableSize(bytes_read_uncompressed) as bytes_read,
    formatReadableSize(bytes_written_uncompressed) as bytes_written
FROM system.merges 
ORDER BY elapsed DESC
FORMAT Vertical
"

关键字段说明

  • elapsed:合并已执行时间(秒)
  • progress:合并进度(0-1)
  • merge_type:合并类型(Regular, TTLDelete等)
  • total_bytes_to_merge:需要合并的总字节数

正常值 :活跃合并任务数 < 5
警告值 :活跃合并任务数 5-15
危险值:活跃合并任务数 > 15

4.2 查看 Mutation 任务

bash 复制代码
# 查看所有正在进行的Mutation任务
clickhouse-client --query "
SELECT 
    database,
    table,
    mutation_id,
    command,
    create_time,
    is_done,
    latest_failed_part,
    latest_fail_time,
    latest_fail_reason
FROM system.mutations 
WHERE is_done = 0
ORDER BY create_time DESC
FORMAT Vertical
"

关键字段说明

  • mutation_id:Mutation ID
  • command:执行的命令(如 ALTER TABLE ... DELETE)
  • is_done:是否完成(0=进行中,1=完成)
  • latest_failed_part:最新失败的分区

正常值 :活跃Mutation任务数 = 0
警告值 :活跃Mutation任务数 1-3
危险值:活跃Mutation任务数 > 3

4.3 统计后台任务

bash 复制代码
# 统计活跃的后台任务
clickhouse-client --query "
SELECT 
    'Merges' as task_type,
    count() as active_count,
    sum(elapsed) as total_elapsed,
    formatReadableSize(sum(total_bytes_to_merge)) as total_size
FROM system.merges
UNION ALL
SELECT 
    'Mutations' as task_type,
    count() as active_count,
    0 as total_elapsed,
    '' as total_size
FROM system.mutations
WHERE is_done = 0
"

🔍 第五步:检查表和数据分区

5.1 查看表的分区信息

bash 复制代码
# 查看所有表的分区大小
clickhouse-client --query "
SELECT 
    database,
    table,
    partition,
    count() as parts_count,
    sum(rows) as total_rows,
    formatReadableSize(sum(bytes_on_disk)) as total_size,
    min(modification_time) as oldest_part,
    max(modification_time) as newest_part
FROM system.parts 
WHERE active = 1
GROUP BY database, table, partition
ORDER BY sum(bytes_on_disk) DESC
LIMIT 50
"

5.2 查看表的压缩情况

bash 复制代码
# 查看表的压缩算法和压缩率
clickhouse-client --query "
SELECT 
    database,
    table,
    format,
    count() as parts_count,
    sum(rows) as total_rows,
    formatReadableSize(sum(bytes_on_disk)) as compressed_size,
    formatReadableSize(sum(data_uncompressed_size)) as uncompressed_size,
    round(sum(data_uncompressed_size) / sum(bytes_on_disk), 2) as compression_ratio
FROM system.parts 
WHERE active = 1
GROUP BY database, table, format
ORDER BY sum(bytes_on_disk) DESC
LIMIT 30
"

压缩算法说明

  • LZ4:速度快,CPU消耗低,压缩率中等(推荐)
  • ZSTD:压缩率高,但CPU消耗高
  • ZSTD(1-3):低级别ZSTD,平衡压缩率和CPU

5.3 找出大分区

bash 复制代码
# 找出最大的分区
clickhouse-client --query "
SELECT 
    database,
    table,
    partition,
    count() as parts,
    sum(rows) as rows,
    formatReadableSize(sum(bytes_on_disk)) as size,
    max(modification_time) as last_modified
FROM system.parts 
WHERE active = 1
GROUP BY database, table, partition
HAVING sum(bytes_on_disk) > 10737418240  -- 大于10GB
ORDER BY sum(bytes_on_disk) DESC
LIMIT 20
"

🔍 第六步:检查系统配置

6.1 查看并发相关配置

bash 复制代码
# 查看所有并发相关配置
clickhouse-client --query "
SELECT 
    name,
    value,
    description
FROM system.settings 
WHERE name LIKE '%concurrent%' 
   OR name LIKE '%thread%'
   OR name LIKE '%pool%'
ORDER BY name
"

关键配置

  • max_concurrent_queries:最大并发查询数(推荐:CPU核心数)
  • max_thread_pool_size:线程池大小(推荐:CPU核心数*2)
  • max_insert_threads:插入线程数(推荐:4-8)
  • background_pool_size:后台任务线程数(推荐:16)

6.2 查看查询限制配置

bash 复制代码
# 查看查询限制配置
clickhouse-client --query "
SELECT 
    name,
    value,
    description
FROM system.settings 
WHERE name LIKE '%max_%' 
  AND (name LIKE '%query%' 
       OR name LIKE '%memory%' 
       OR name LIKE '%time%'
       OR name LIKE '%rows%'
       OR name LIKE '%bytes%')
ORDER BY name
"

关键配置

  • max_execution_time:查询最大执行时间(秒)
  • max_memory_usage:单查询最大内存(字节)
  • max_rows_to_read:最大读取行数
  • max_bytes_to_read:最大读取字节数

6.3 查看合并相关配置

bash 复制代码
# 查看合并相关配置
clickhouse-client --query "
SELECT 
    name,
    value,
    description
FROM system.settings 
WHERE name LIKE '%merge%'
ORDER BY name
"

关键配置

  • max_bytes_to_merge_at_max_space_in_pool:合并最大字节数(推荐:150GB)
  • max_replicated_merges_in_queue:副本合并队列大小(推荐:16)

🛠️ 解决方案

方案1:优化查询并发

问题:并发查询过多
bash 复制代码
# 1. 查看当前并发数
clickhouse-client --query "SELECT count() FROM system.processes WHERE query != ''"

# 2. 临时降低并发限制(在用户级别)
clickhouse-client --query "SET max_concurrent_queries=8"

# 3. 永久修改(在config.xml中)
# <profiles>
#     <default>
#         <max_concurrent_queries>8</max_concurrent_queries>
#     </default>
# </profiles>

# 4. 或者kill长时间运行的查询
clickhouse-client --query "KILL QUERY WHERE query_id='xxx'"

方案2:优化慢查询

问题:慢查询导致CPU高
bash 复制代码
# 1. 找出慢查询
clickhouse-client --query "
SELECT query_id, query, query_duration_ms, read_rows
FROM system.query_log 
WHERE type=2 AND query_duration_ms > 5000
ORDER BY query_duration_ms DESC 
LIMIT 10
"

# 2. 分析查询执行计划
clickhouse-client --query "EXPLAIN SELECT ..."  # 替换为实际查询

# 3. 优化建议:
# - 添加合适的索引(ORDER BY字段)
# - 优化WHERE条件,减少扫描数据量
# - 使用PREWHERE替代WHERE(如果可能)
# - 减少SELECT的字段数量
# - 使用LIMIT限制返回结果
# - 避免全表扫描

# 4. 设置查询超时
clickhouse-client --query "SET max_execution_time=60"  # 60秒超时

方案3:优化后台任务

问题:合并任务过多
bash 复制代码
# 1. 查看合并任务
clickhouse-client --query "SELECT * FROM system.merges ORDER BY elapsed DESC"

# 2. 调整合并配置(在config.xml中)
# <merge_tree>
#     <max_bytes_to_merge_at_max_space_in_pool>161061273600</max_bytes_to_merge_at_max_space_in_pool>
#     <background_pool_size>16</background_pool_size>
# </merge_tree>

# 3. 等待合并完成或手动触发合并(谨慎使用)
clickhouse-client --query "OPTIMIZE TABLE database.table FINAL"
问题:Mutation任务卡住
bash 复制代码
# 1. 查看Mutation任务
clickhouse-client --query "SELECT * FROM system.mutations WHERE is_done=0"

# 2. 如果Mutation卡住,可以取消(谨慎操作,可能导致数据不一致)
# clickhouse-client --query "KILL MUTATION WHERE database='xxx' AND table='xxx' AND mutation_id='xxx'"

# 3. 避免在高峰期执行大量Mutation

方案4:优化压缩算法

问题:压缩算法CPU密集
bash 复制代码
# 1. 查看当前使用的压缩算法
clickhouse-client --query "
SELECT format, count() as parts, formatReadableSize(sum(bytes_on_disk)) as size
FROM system.parts 
WHERE active=1 
GROUP BY format
"

# 2. 如果使用ZSTD,考虑改为LZ4(新表)
CREATE TABLE new_table (
    ...
) ENGINE = MergeTree()
ORDER BY ...
SETTINGS index_granularity = 8192;

# 3. 或者降低ZSTD压缩级别(新表)
CREATE TABLE new_table (
    column_name String CODEC(ZSTD(1))  -- 降低压缩级别
) ENGINE = MergeTree()
ORDER BY ...;

# 4. 对于现有表,需要重建(数据量大时谨慎)
ALTER TABLE table_name MODIFY COLUMN column_name String CODEC(LZ4);

方案5:优化数据导入

问题:数据导入导致CPU高
bash 复制代码
# 1. 查看导入任务
clickhouse-client --query "SELECT query_id, query, elapsed FROM system.processes WHERE query LIKE '%INSERT%'"

# 2. 降低插入并发
clickhouse-client --query "SET max_insert_threads=4"

# 3. 使用批量插入
# 错误:INSERT INTO table VALUES (...); INSERT INTO table VALUES (...);
# 正确:INSERT INTO table VALUES (...), (...), (...);

# 4. 使用异步插入(如果支持)
# INSERT INTO table ASYNC VALUES (...)

方案6:优化表结构

问题:表结构不合理导致查询慢
bash 复制代码
# 1. 检查ORDER BY字段是否合理
clickhouse-client --query "SHOW CREATE TABLE database.table_name"

# 2. 优化建议:
# - ORDER BY字段应该是查询中常用的过滤字段
# - 添加合适的索引(ORDER BY字段)
# - 使用物化视图预聚合数据
# - 使用合适的表引擎(MergeTree vs ReplacingMergeTree等)

# 3. 检查分区键
# 分区键应该选择经常用于过滤的字段(如日期)

📊 监控脚本

创建CPU监控脚本

bash 复制代码
#!/bin/bash
# clickhouse_cpu_monitor.sh

echo "=== ClickHouse CPU 监控 ==="
echo ""

# 1. 系统CPU使用率
echo "1. 系统CPU使用率:"
top -bn1 | grep "Cpu(s)" | awk '{print "用户空间: " $2 "%", "内核空间: " $4 "%", "空闲: " $8 "%"}'
echo ""

# 2. ClickHouse进程CPU
echo "2. ClickHouse进程CPU:"
CH_PID=$(pgrep clickhouse-server)
if [ -n "$CH_PID" ]; then
    top -bn1 -p $CH_PID | tail -1 | awk '{print "CPU使用率: " $9 "%", "内存使用率: " $10 "%"}'
else
    echo "ClickHouse进程未运行"
fi
echo ""

# 3. 并发查询数
echo "3. 并发查询数:"
clickhouse-client --query "SELECT count() as concurrent_queries FROM system.processes WHERE query != ''"
echo ""

# 4. 长时间运行的查询
echo "4. 长时间运行的查询 (>10秒):"
clickhouse-client --query "
SELECT 
    query_id,
    elapsed,
    read_rows,
    formatReadableSize(read_bytes) as read_bytes
FROM system.processes 
WHERE query != '' AND elapsed > 10
ORDER BY elapsed DESC
LIMIT 5
"
echo ""

# 5. 活跃合并任务
echo "5. 活跃合并任务:"
clickhouse-client --query "SELECT count() as active_merges FROM system.merges"
echo ""

# 6. 活跃Mutation任务
echo "6. 活跃Mutation任务:"
clickhouse-client --query "SELECT count() as active_mutations FROM system.mutations WHERE is_done=0"
echo ""

# 7. 最近1小时的慢查询统计
echo "7. 最近1小时慢查询统计 (>1秒):"
clickhouse-client --query "
SELECT 
    count() as slow_queries,
    avg(query_duration_ms) as avg_duration_ms,
    max(query_duration_ms) as max_duration_ms
FROM system.query_log 
WHERE type=2 
  AND event_time > now() - INTERVAL 1 HOUR
  AND query_duration_ms > 1000
"

保存为 clickhouse_cpu_monitor.sh,然后:

bash 复制代码
chmod +x clickhouse_cpu_monitor.sh
./clickhouse_cpu_monitor.sh

✅ 排查检查清单

  • 检查系统CPU使用率(top, htop)
  • 检查当前运行的查询(system.processes)
  • 检查慢查询历史(system.query_log)
  • 检查后台任务(system.merges, system.mutations)
  • 检查表和数据分区状态(system.parts)
  • 检查系统配置(system.settings)
  • 分析查询执行计划(EXPLAIN)
  • 优化慢查询SQL
  • 调整并发配置
  • 优化压缩算法
  • 监控CPU使用情况

💡 最佳实践

  1. 定期监控:设置定时任务监控CPU使用率
  2. 慢查询分析:定期分析慢查询日志,优化查询
  3. 合理配置:根据服务器配置合理设置并发数
  4. 选择压缩:根据场景选择压缩算法(LZ4 vs ZSTD)
  5. 避免高峰期:避免在业务高峰期执行大量后台任务
  6. 优化表结构:合理设计表结构,选择合适的ORDER BY字段
  7. 使用物化视图:使用物化视图预聚合数据,减少查询CPU消耗

记住:CPU高使用率通常是查询问题,先查查询,再查配置! 🚀

相关推荐
七夜zippoe12 小时前
CANN Runtime任务描述序列化与持久化源码深度解码
大数据·运维·服务器·cann
Fcy64813 小时前
Linux下 进程(一)(冯诺依曼体系、操作系统、进程基本概念与基本操作)
linux·运维·服务器·进程
袁袁袁袁满13 小时前
Linux怎么查看最新下载的文件
linux·运维·服务器
代码游侠13 小时前
学习笔记——设备树基础
linux·运维·开发语言·单片机·算法
Harvey90314 小时前
通过 Helm 部署 Nginx 应用的完整标准化步骤
linux·运维·nginx·k8s
珠海西格电力科技15 小时前
微电网能量平衡理论的实现条件在不同场景下有哪些差异?
运维·服务器·网络·人工智能·云计算·智慧城市
释怀不想释怀15 小时前
Linux环境变量
linux·运维·服务器
zzzsde15 小时前
【Linux】进程(4):进程优先级&&调度队列
linux·运维·服务器
聆风吟º17 小时前
CANN开源项目实战指南:使用oam-tools构建自动化故障诊断与运维可观测性体系
运维·开源·自动化·cann
NPE~17 小时前
自动化工具Drissonpage 保姆级教程(含xpath语法)
运维·后端·爬虫·自动化·网络爬虫·xpath·浏览器自动化