ClickHouse CPU 排查详细指南
📋 概述
本指南详细说明如何排查和解决 ClickHouse CPU 使用率过高的问题。CPU 高使用率通常由查询并发过多、慢查询、后台任务、数据导入等因素引起。
🔍 第一步:快速诊断
1.1 检查 CPU 使用率
bash
# 方法1:使用 top 查看 ClickHouse 进程 CPU
top -p $(pgrep clickhouse-server)
# 方法2:使用 htop(如果已安装)
htop -p $(pgrep clickhouse-server)
# 方法3:使用 ps
ps aux | grep clickhouse-server | grep -v grep
# 方法4:查看系统整体 CPU
mpstat -P ALL 1 5
正常值 :CPU 使用率 < 70%
警告值 :CPU 使用率 70-90%
危险值:CPU 使用率 > 90%
1.2 检查系统负载
bash
# 查看系统负载
uptime
# 查看详细负载信息
cat /proc/loadavg
# 持续监控负载
watch -n 1 'uptime'
正常值 :负载 < CPU 核心数
警告值 :负载 = CPU 核心数 - 2倍
危险值:负载 > 2倍 CPU 核心数
🔍 第二步:检查当前运行的查询
2.1 查看所有正在执行的查询
bash
clickhouse-client --query "
SELECT
query_id,
user,
address,
query,
elapsed,
read_rows,
formatReadableSize(read_bytes) as read_bytes,
formatReadableSize(memory_usage) as memory,
formatReadableSize(read_bytes/elapsed) as read_speed
FROM system.processes
WHERE query != ''
ORDER BY elapsed DESC
FORMAT Vertical
"
关键字段说明:
query_id:查询ID,可用于kill查询elapsed:查询已执行时间(秒)read_rows:已读取行数read_bytes:已读取字节数memory_usage:内存使用量
2.2 查看长时间运行的查询
bash
# 查看执行时间超过10秒的查询
clickhouse-client --query "
SELECT
query_id,
user,
query,
elapsed,
read_rows,
formatReadableSize(read_bytes) as read_bytes
FROM system.processes
WHERE query != '' AND elapsed > 10
ORDER BY elapsed DESC
"
2.3 查看查询的详细信息
bash
# 查看特定查询的详细信息
clickhouse-client --query "
SELECT *
FROM system.processes
WHERE query_id = 'your-query-id'
FORMAT Vertical
"
2.4 统计并发查询数
bash
# 统计当前并发查询数
clickhouse-client --query "
SELECT
count() as concurrent_queries,
sum(elapsed) as total_elapsed,
sum(read_rows) as total_read_rows,
formatReadableSize(sum(read_bytes)) as total_read_bytes
FROM system.processes
WHERE query != ''
"
正常值 :并发查询数 < CPU 核心数
警告值 :并发查询数 = CPU 核心数 - 2倍
危险值:并发查询数 > 2倍 CPU 核心数
🔍 第三步:检查查询历史(慢查询)
3.1 查看最近的慢查询
bash
# 查看最近100条执行时间超过1秒的查询
clickhouse-client --query "
SELECT
query_id,
user,
query,
query_start_time,
query_duration_ms,
read_rows,
formatReadableSize(read_bytes) as read_bytes,
formatReadableSize(memory_usage) as memory,
result_rows,
formatReadableSize(result_bytes) as result_size
FROM system.query_log
WHERE type = 2
AND query_duration_ms > 1000
AND event_time > now() - INTERVAL 1 HOUR
ORDER BY query_duration_ms DESC
LIMIT 100
FORMAT Vertical
"
关键字段说明:
type = 2:表示查询完成(1=查询开始,2=查询结束)query_duration_ms:查询执行时间(毫秒)read_rows:读取的行数result_rows:返回的结果行数
3.2 统计慢查询趋势
bash
# 按小时统计慢查询
clickhouse-client --query "
SELECT
toStartOfHour(event_time) as hour,
count() as slow_queries,
avg(query_duration_ms) as avg_duration_ms,
max(query_duration_ms) as max_duration_ms,
min(query_duration_ms) as min_duration_ms,
sum(read_rows) as total_read_rows,
formatReadableSize(sum(read_bytes)) as total_read_bytes
FROM system.query_log
WHERE type = 2
AND event_time > now() - INTERVAL 24 HOUR
AND query_duration_ms > 1000
GROUP BY hour
ORDER BY hour DESC
"
3.3 找出最耗CPU的查询模式
bash
# 按查询模式(去除具体值)统计
clickhouse-client --query "
SELECT
normalizeQuery(query) as query_pattern,
count() as query_count,
avg(query_duration_ms) as avg_duration_ms,
max(query_duration_ms) as max_duration_ms,
sum(query_duration_ms) as total_duration_ms,
sum(read_rows) as total_read_rows
FROM system.query_log
WHERE type = 2
AND event_time > now() - INTERVAL 24 HOUR
AND query_duration_ms > 1000
GROUP BY query_pattern
ORDER BY total_duration_ms DESC
LIMIT 20
"
🔍 第四步:检查后台任务
4.1 查看合并任务(Merges)
bash
# 查看所有正在进行的合并任务
clickhouse-client --query "
SELECT
database,
table,
elapsed,
progress,
merge_type,
merge_algorithm,
num_parts_to_merge,
total_rows_to_merge,
formatReadableSize(total_bytes_to_merge) as total_size,
formatReadableSize(bytes_read_uncompressed) as bytes_read,
formatReadableSize(bytes_written_uncompressed) as bytes_written
FROM system.merges
ORDER BY elapsed DESC
FORMAT Vertical
"
关键字段说明:
elapsed:合并已执行时间(秒)progress:合并进度(0-1)merge_type:合并类型(Regular, TTLDelete等)total_bytes_to_merge:需要合并的总字节数
正常值 :活跃合并任务数 < 5
警告值 :活跃合并任务数 5-15
危险值:活跃合并任务数 > 15
4.2 查看 Mutation 任务
bash
# 查看所有正在进行的Mutation任务
clickhouse-client --query "
SELECT
database,
table,
mutation_id,
command,
create_time,
is_done,
latest_failed_part,
latest_fail_time,
latest_fail_reason
FROM system.mutations
WHERE is_done = 0
ORDER BY create_time DESC
FORMAT Vertical
"
关键字段说明:
mutation_id:Mutation IDcommand:执行的命令(如 ALTER TABLE ... DELETE)is_done:是否完成(0=进行中,1=完成)latest_failed_part:最新失败的分区
正常值 :活跃Mutation任务数 = 0
警告值 :活跃Mutation任务数 1-3
危险值:活跃Mutation任务数 > 3
4.3 统计后台任务
bash
# 统计活跃的后台任务
clickhouse-client --query "
SELECT
'Merges' as task_type,
count() as active_count,
sum(elapsed) as total_elapsed,
formatReadableSize(sum(total_bytes_to_merge)) as total_size
FROM system.merges
UNION ALL
SELECT
'Mutations' as task_type,
count() as active_count,
0 as total_elapsed,
'' as total_size
FROM system.mutations
WHERE is_done = 0
"
🔍 第五步:检查表和数据分区
5.1 查看表的分区信息
bash
# 查看所有表的分区大小
clickhouse-client --query "
SELECT
database,
table,
partition,
count() as parts_count,
sum(rows) as total_rows,
formatReadableSize(sum(bytes_on_disk)) as total_size,
min(modification_time) as oldest_part,
max(modification_time) as newest_part
FROM system.parts
WHERE active = 1
GROUP BY database, table, partition
ORDER BY sum(bytes_on_disk) DESC
LIMIT 50
"
5.2 查看表的压缩情况
bash
# 查看表的压缩算法和压缩率
clickhouse-client --query "
SELECT
database,
table,
format,
count() as parts_count,
sum(rows) as total_rows,
formatReadableSize(sum(bytes_on_disk)) as compressed_size,
formatReadableSize(sum(data_uncompressed_size)) as uncompressed_size,
round(sum(data_uncompressed_size) / sum(bytes_on_disk), 2) as compression_ratio
FROM system.parts
WHERE active = 1
GROUP BY database, table, format
ORDER BY sum(bytes_on_disk) DESC
LIMIT 30
"
压缩算法说明:
- LZ4:速度快,CPU消耗低,压缩率中等(推荐)
- ZSTD:压缩率高,但CPU消耗高
- ZSTD(1-3):低级别ZSTD,平衡压缩率和CPU
5.3 找出大分区
bash
# 找出最大的分区
clickhouse-client --query "
SELECT
database,
table,
partition,
count() as parts,
sum(rows) as rows,
formatReadableSize(sum(bytes_on_disk)) as size,
max(modification_time) as last_modified
FROM system.parts
WHERE active = 1
GROUP BY database, table, partition
HAVING sum(bytes_on_disk) > 10737418240 -- 大于10GB
ORDER BY sum(bytes_on_disk) DESC
LIMIT 20
"
🔍 第六步:检查系统配置
6.1 查看并发相关配置
bash
# 查看所有并发相关配置
clickhouse-client --query "
SELECT
name,
value,
description
FROM system.settings
WHERE name LIKE '%concurrent%'
OR name LIKE '%thread%'
OR name LIKE '%pool%'
ORDER BY name
"
关键配置:
max_concurrent_queries:最大并发查询数(推荐:CPU核心数)max_thread_pool_size:线程池大小(推荐:CPU核心数*2)max_insert_threads:插入线程数(推荐:4-8)background_pool_size:后台任务线程数(推荐:16)
6.2 查看查询限制配置
bash
# 查看查询限制配置
clickhouse-client --query "
SELECT
name,
value,
description
FROM system.settings
WHERE name LIKE '%max_%'
AND (name LIKE '%query%'
OR name LIKE '%memory%'
OR name LIKE '%time%'
OR name LIKE '%rows%'
OR name LIKE '%bytes%')
ORDER BY name
"
关键配置:
max_execution_time:查询最大执行时间(秒)max_memory_usage:单查询最大内存(字节)max_rows_to_read:最大读取行数max_bytes_to_read:最大读取字节数
6.3 查看合并相关配置
bash
# 查看合并相关配置
clickhouse-client --query "
SELECT
name,
value,
description
FROM system.settings
WHERE name LIKE '%merge%'
ORDER BY name
"
关键配置:
max_bytes_to_merge_at_max_space_in_pool:合并最大字节数(推荐:150GB)max_replicated_merges_in_queue:副本合并队列大小(推荐:16)
🛠️ 解决方案
方案1:优化查询并发
问题:并发查询过多
bash
# 1. 查看当前并发数
clickhouse-client --query "SELECT count() FROM system.processes WHERE query != ''"
# 2. 临时降低并发限制(在用户级别)
clickhouse-client --query "SET max_concurrent_queries=8"
# 3. 永久修改(在config.xml中)
# <profiles>
# <default>
# <max_concurrent_queries>8</max_concurrent_queries>
# </default>
# </profiles>
# 4. 或者kill长时间运行的查询
clickhouse-client --query "KILL QUERY WHERE query_id='xxx'"
方案2:优化慢查询
问题:慢查询导致CPU高
bash
# 1. 找出慢查询
clickhouse-client --query "
SELECT query_id, query, query_duration_ms, read_rows
FROM system.query_log
WHERE type=2 AND query_duration_ms > 5000
ORDER BY query_duration_ms DESC
LIMIT 10
"
# 2. 分析查询执行计划
clickhouse-client --query "EXPLAIN SELECT ..." # 替换为实际查询
# 3. 优化建议:
# - 添加合适的索引(ORDER BY字段)
# - 优化WHERE条件,减少扫描数据量
# - 使用PREWHERE替代WHERE(如果可能)
# - 减少SELECT的字段数量
# - 使用LIMIT限制返回结果
# - 避免全表扫描
# 4. 设置查询超时
clickhouse-client --query "SET max_execution_time=60" # 60秒超时
方案3:优化后台任务
问题:合并任务过多
bash
# 1. 查看合并任务
clickhouse-client --query "SELECT * FROM system.merges ORDER BY elapsed DESC"
# 2. 调整合并配置(在config.xml中)
# <merge_tree>
# <max_bytes_to_merge_at_max_space_in_pool>161061273600</max_bytes_to_merge_at_max_space_in_pool>
# <background_pool_size>16</background_pool_size>
# </merge_tree>
# 3. 等待合并完成或手动触发合并(谨慎使用)
clickhouse-client --query "OPTIMIZE TABLE database.table FINAL"
问题:Mutation任务卡住
bash
# 1. 查看Mutation任务
clickhouse-client --query "SELECT * FROM system.mutations WHERE is_done=0"
# 2. 如果Mutation卡住,可以取消(谨慎操作,可能导致数据不一致)
# clickhouse-client --query "KILL MUTATION WHERE database='xxx' AND table='xxx' AND mutation_id='xxx'"
# 3. 避免在高峰期执行大量Mutation
方案4:优化压缩算法
问题:压缩算法CPU密集
bash
# 1. 查看当前使用的压缩算法
clickhouse-client --query "
SELECT format, count() as parts, formatReadableSize(sum(bytes_on_disk)) as size
FROM system.parts
WHERE active=1
GROUP BY format
"
# 2. 如果使用ZSTD,考虑改为LZ4(新表)
CREATE TABLE new_table (
...
) ENGINE = MergeTree()
ORDER BY ...
SETTINGS index_granularity = 8192;
# 3. 或者降低ZSTD压缩级别(新表)
CREATE TABLE new_table (
column_name String CODEC(ZSTD(1)) -- 降低压缩级别
) ENGINE = MergeTree()
ORDER BY ...;
# 4. 对于现有表,需要重建(数据量大时谨慎)
ALTER TABLE table_name MODIFY COLUMN column_name String CODEC(LZ4);
方案5:优化数据导入
问题:数据导入导致CPU高
bash
# 1. 查看导入任务
clickhouse-client --query "SELECT query_id, query, elapsed FROM system.processes WHERE query LIKE '%INSERT%'"
# 2. 降低插入并发
clickhouse-client --query "SET max_insert_threads=4"
# 3. 使用批量插入
# 错误:INSERT INTO table VALUES (...); INSERT INTO table VALUES (...);
# 正确:INSERT INTO table VALUES (...), (...), (...);
# 4. 使用异步插入(如果支持)
# INSERT INTO table ASYNC VALUES (...)
方案6:优化表结构
问题:表结构不合理导致查询慢
bash
# 1. 检查ORDER BY字段是否合理
clickhouse-client --query "SHOW CREATE TABLE database.table_name"
# 2. 优化建议:
# - ORDER BY字段应该是查询中常用的过滤字段
# - 添加合适的索引(ORDER BY字段)
# - 使用物化视图预聚合数据
# - 使用合适的表引擎(MergeTree vs ReplacingMergeTree等)
# 3. 检查分区键
# 分区键应该选择经常用于过滤的字段(如日期)
📊 监控脚本
创建CPU监控脚本
bash
#!/bin/bash
# clickhouse_cpu_monitor.sh
echo "=== ClickHouse CPU 监控 ==="
echo ""
# 1. 系统CPU使用率
echo "1. 系统CPU使用率:"
top -bn1 | grep "Cpu(s)" | awk '{print "用户空间: " $2 "%", "内核空间: " $4 "%", "空闲: " $8 "%"}'
echo ""
# 2. ClickHouse进程CPU
echo "2. ClickHouse进程CPU:"
CH_PID=$(pgrep clickhouse-server)
if [ -n "$CH_PID" ]; then
top -bn1 -p $CH_PID | tail -1 | awk '{print "CPU使用率: " $9 "%", "内存使用率: " $10 "%"}'
else
echo "ClickHouse进程未运行"
fi
echo ""
# 3. 并发查询数
echo "3. 并发查询数:"
clickhouse-client --query "SELECT count() as concurrent_queries FROM system.processes WHERE query != ''"
echo ""
# 4. 长时间运行的查询
echo "4. 长时间运行的查询 (>10秒):"
clickhouse-client --query "
SELECT
query_id,
elapsed,
read_rows,
formatReadableSize(read_bytes) as read_bytes
FROM system.processes
WHERE query != '' AND elapsed > 10
ORDER BY elapsed DESC
LIMIT 5
"
echo ""
# 5. 活跃合并任务
echo "5. 活跃合并任务:"
clickhouse-client --query "SELECT count() as active_merges FROM system.merges"
echo ""
# 6. 活跃Mutation任务
echo "6. 活跃Mutation任务:"
clickhouse-client --query "SELECT count() as active_mutations FROM system.mutations WHERE is_done=0"
echo ""
# 7. 最近1小时的慢查询统计
echo "7. 最近1小时慢查询统计 (>1秒):"
clickhouse-client --query "
SELECT
count() as slow_queries,
avg(query_duration_ms) as avg_duration_ms,
max(query_duration_ms) as max_duration_ms
FROM system.query_log
WHERE type=2
AND event_time > now() - INTERVAL 1 HOUR
AND query_duration_ms > 1000
"
保存为 clickhouse_cpu_monitor.sh,然后:
bash
chmod +x clickhouse_cpu_monitor.sh
./clickhouse_cpu_monitor.sh
✅ 排查检查清单
- 检查系统CPU使用率(top, htop)
- 检查当前运行的查询(system.processes)
- 检查慢查询历史(system.query_log)
- 检查后台任务(system.merges, system.mutations)
- 检查表和数据分区状态(system.parts)
- 检查系统配置(system.settings)
- 分析查询执行计划(EXPLAIN)
- 优化慢查询SQL
- 调整并发配置
- 优化压缩算法
- 监控CPU使用情况
💡 最佳实践
- 定期监控:设置定时任务监控CPU使用率
- 慢查询分析:定期分析慢查询日志,优化查询
- 合理配置:根据服务器配置合理设置并发数
- 选择压缩:根据场景选择压缩算法(LZ4 vs ZSTD)
- 避免高峰期:避免在业务高峰期执行大量后台任务
- 优化表结构:合理设计表结构,选择合适的ORDER BY字段
- 使用物化视图:使用物化视图预聚合数据,减少查询CPU消耗
记住:CPU高使用率通常是查询问题,先查查询,再查配置! 🚀