ClickHouse CPU 排查快速参考指南
🚨 故障快速诊断表
| 故障现象 |
可能原因 |
检查命令 |
正常值 |
异常值 |
| CPU使用率高 |
查询并发过多 |
top -p $(pgrep clickhouse) |
CPU<70% |
CPU>90% |
| CPU使用率高 |
查询复杂度过高 |
clickhouse-client --query "SELECT query_id, query, elapsed, read_rows FROM system.processes" |
查询时间<1s |
查询时间>10s |
| CPU使用率高 |
线程数过多 |
clickhouse-client --query "SELECT * FROM system.processes" |
并发查询<10 |
并发查询>50 |
| CPU使用率高 |
压缩算法CPU密集 |
clickhouse-client --query "SELECT format, sum(compressed_size) FROM system.parts GROUP BY format" |
LZ4 |
ZSTD(高CPU) |
| CPU使用率高 |
后台合并任务 |
clickhouse-client --query "SELECT * FROM system.merges" |
合并数<5 |
合并数>20 |
| CPU使用率高 |
索引构建 |
clickhouse-client --query "SELECT * FROM system.mutations" |
无进行中 |
有进行中 |
| CPU使用率高 |
数据导入 |
iostat -x 1 |
%util<70% |
%util>90% |
📋 常用排查命令速查
1. 查看当前运行的查询
# 查看所有正在执行的查询
clickhouse-client --query "SELECT query_id, user, address, query, elapsed, read_rows, read_bytes, memory_usage, formatReadableSize(memory_usage) as memory FROM system.processes ORDER BY elapsed DESC"
# 查看查询详情(包含CPU时间)
clickhouse-client --query "SELECT query_id, query, elapsed, read_rows, formatReadableSize(read_bytes) as read_bytes, formatReadableSize(memory_usage) as memory, formatReadableSize(read_bytes/elapsed) as read_speed FROM system.processes WHERE query != '' ORDER BY elapsed DESC FORMAT Vertical"
# 查看长时间运行的查询
clickhouse-client --query "SELECT query_id, user, query, elapsed, read_rows FROM system.processes WHERE elapsed > 10 ORDER BY elapsed DESC"
2. 查看查询历史(慢查询)
# 查看最近100条慢查询(超过1秒)
clickhouse-client --query "SELECT query_id, user, query, query_start_time, query_duration_ms, read_rows, read_bytes, memory_usage, formatReadableSize(memory_usage) as memory FROM system.query_log WHERE type=2 AND query_duration_ms > 1000 ORDER BY query_duration_ms DESC LIMIT 100 FORMAT Vertical"
# 查看最近1小时的慢查询统计
clickhouse-client --query "SELECT toStartOfHour(event_time) as hour, count() as slow_queries, avg(query_duration_ms) as avg_duration_ms, max(query_duration_ms) as max_duration_ms FROM system.query_log WHERE type=2 AND event_time > now() - INTERVAL 1 HOUR AND query_duration_ms > 1000 GROUP BY hour ORDER BY hour DESC"
3. 查看系统资源使用
# 查看ClickHouse进程CPU使用率
top -p $(pgrep clickhouse-server) -n 1 | grep clickhouse
# 查看ClickHouse线程数
clickhouse-client --query "SELECT count() FROM system.processes"
# 查看系统负载
uptime
# 查看CPU核心使用情况
mpstat -P ALL 1 5
4. 查看后台任务
# 查看正在进行的合并任务
clickhouse-client --query "SELECT database, table, elapsed, progress, merge_type, merge_algorithm, num_parts_to_merge, total_rows_to_merge, total_bytes_to_merge, formatReadableSize(total_bytes_to_merge) as total_size FROM system.merges ORDER BY elapsed DESC"
# 查看正在进行的Mutation任务
clickhouse-client --query "SELECT database, table, mutation_id, command, create_time, is_done, latest_failed_part, latest_fail_time, latest_fail_reason FROM system.mutations WHERE is_done=0 ORDER BY create_time DESC"
# 查看后台任务统计
clickhouse-client --query "SELECT count() as active_merges FROM system.merges"
clickhouse-client --query "SELECT count() as active_mutations FROM system.mutations WHERE is_done=0"
5. 查看表和数据分区状态
# 查看所有表的分区信息
clickhouse-client --query "SELECT database, table, partition, name, rows, bytes_on_disk, formatReadableSize(bytes_on_disk) as size, modification_time FROM system.parts WHERE active=1 ORDER BY bytes_on_disk DESC LIMIT 50"
# 查看表的数据压缩情况
clickhouse-client --query "SELECT database, table, format, count() as parts_count, sum(rows) as total_rows, sum(bytes_on_disk) as total_bytes, formatReadableSize(sum(bytes_on_disk)) as total_size, avg(compression_ratio) as avg_compression FROM system.parts WHERE active=1 GROUP BY database, table, format ORDER BY total_bytes DESC"
# 查看分区大小分布(找出大分区)
clickhouse-client --query "SELECT database, table, partition, count() as parts, sum(rows) as rows, formatReadableSize(sum(bytes_on_disk)) as size FROM system.parts WHERE active=1 GROUP BY database, table, partition ORDER BY sum(bytes_on_disk) DESC LIMIT 20"
6. 查看系统配置
# 查看最大并发查询数
clickhouse-client --query "SELECT name, value FROM system.settings WHERE name LIKE '%max_concurrent%'"
# 查看线程池配置
clickhouse-client --query "SELECT name, value FROM system.settings WHERE name LIKE '%thread%' OR name LIKE '%pool%'"
# 查看查询限制配置
clickhouse-client --query "SELECT name, value FROM system.settings WHERE name LIKE '%max_%' AND (name LIKE '%query%' OR name LIKE '%memory%' OR name LIKE '%time%')"
⚙️ 关键参数速查表
查询并发控制
| 参数 |
推荐值 |
说明 |
为什么 |
max_concurrent_queries |
CPU核心数 |
最大并发查询数 |
避免过多查询竞争CPU |
max_thread_pool_size |
CPU核心数*2 |
线程池大小 |
平衡并发和资源使用 |
max_insert_threads |
4-8 |
插入线程数 |
控制插入操作的并发 |
background_pool_size |
16 |
后台任务线程数 |
控制合并、Mutation等后台任务 |
查询限制
| 参数 |
推荐值 |
说明 |
为什么 |
max_execution_time |
300 |
查询最大执行时间(秒) |
防止长时间查询占用CPU |
max_memory_usage |
10000000000 |
单查询最大内存(字节) |
防止内存溢出导致CPU浪费 |
max_rows_to_read |
0(无限制) |
最大读取行数 |
根据业务需求设置 |
max_bytes_to_read |
0(无限制) |
最大读取字节数 |
根据业务需求设置 |
合并和压缩
| 参数 |
推荐值 |
说明 |
为什么 |
max_bytes_to_merge_at_max_space_in_pool |
161061273600 |
合并最大字节数(150GB) |
控制合并任务大小 |
max_replicated_merges_in_queue |
16 |
副本合并队列大小 |
控制副本合并并发 |
compression_codec |
LZ4 |
压缩算法 |
LZ4速度快,CPU消耗低 |
🔍 CPU高使用率排查流程图
开始排查
↓
1. 检查当前运行的查询(system.processes)
↓
有长时间运行的查询?
↓ 是 → 分析查询,优化或kill
↓ 否
2. 检查查询历史(system.query_log)
↓
有慢查询?
↓ 是 → 优化慢查询SQL
↓ 否
3. 检查后台任务(system.merges, system.mutations)
↓
有大量合并/Mutation任务?
↓ 是 → 调整合并策略或等待完成
↓ 否
4. 检查并发查询数(system.processes count)
↓
并发查询过多?
↓ 是 → 调整max_concurrent_queries
↓ 否
5. 检查系统资源(top, iostat)
↓
CPU/磁盘/内存问题?
↓ 是 → 优化系统参数或扩容
↓ 否
6. 检查表结构和索引
↓
索引不当或表结构问题?
↓ 是 → 优化表结构和索引
↓ 否
问题解决!
📊 性能指标正常范围
| 指标 |
正常范围 |
警告范围 |
危险范围 |
| CPU使用率 |
<70% |
70-90% |
>90% |
| 并发查询数 |
<CPU核心数 |
CPU核心数-2倍 |
>2倍CPU核心数 |
| 查询平均执行时间 |
<1s |
1-5s |
>5s |
| 慢查询数量(>1s) |
<10/小时 |
10-50/小时 |
>50/小时 |
| 活跃合并任务数 |
<5 |
5-15 |
>15 |
| 活跃Mutation任务数 |
0 |
1-3 |
>3 |
| 线程数 |
<CPU核心数*2 |
CPU核心数*2-4倍 |
>4倍CPU核心数 |
🛠️ 常见问题快速修复
问题1:查询并发过多导致CPU高
# 1. 查看当前并发查询
clickhouse-client --query "SELECT count() as concurrent_queries FROM system.processes WHERE query != ''"
# 2. 查看正在运行的查询
clickhouse-client --query "SELECT query_id, user, query, elapsed FROM system.processes WHERE query != '' ORDER BY elapsed DESC"
# 3. 临时降低并发限制(在config.xml中修改)
# <max_concurrent_queries>8</max_concurrent_queries>
# 4. 或者kill长时间运行的查询
clickhouse-client --query "KILL QUERY WHERE query_id='xxx'"
问题2:慢查询导致CPU高
# 1. 找出慢查询
clickhouse-client --query "SELECT query_id, query, query_duration_ms, read_rows FROM system.query_log WHERE type=2 AND query_duration_ms > 5000 ORDER BY query_duration_ms DESC LIMIT 10"
# 2. 分析慢查询(查看执行计划)
clickhouse-client --query "EXPLAIN SELECT ..." # 替换为实际慢查询
# 3. 优化查询(添加索引、优化WHERE条件、减少读取数据量)
# 4. 设置查询超时
clickhouse-client --query "SET max_execution_time=60" # 60秒超时
问题3:后台合并任务导致CPU高
# 1. 查看合并任务
clickhouse-client --query "SELECT database, table, elapsed, progress, total_rows_to_merge FROM system.merges ORDER BY elapsed DESC"
# 2. 调整合并策略(在config.xml中)
# <max_bytes_to_merge_at_max_space_in_pool>161061273600</max_bytes_to_merge_at_max_space_in_pool>
# <background_pool_size>16</background_pool_size>
# 3. 等待合并完成或手动触发合并
clickhouse-client --query "OPTIMIZE TABLE database.table FINAL"
问题4:Mutation任务导致CPU高
# 1. 查看Mutation任务
clickhouse-client --query "SELECT database, table, mutation_id, command, create_time, is_done FROM system.mutations WHERE is_done=0"
# 2. 如果Mutation卡住,可以取消(谨慎操作)
# clickhouse-client --query "KILL MUTATION WHERE database='xxx' AND table='xxx' AND mutation_id='xxx'"
# 3. 避免在高峰期执行大量Mutation
问题5:压缩算法CPU密集
# 1. 查看当前使用的压缩算法
clickhouse-client --query "SELECT format, count() as parts, sum(bytes_on_disk) as total_bytes FROM system.parts WHERE active=1 GROUP BY format"
# 2. 如果使用ZSTD等高CPU压缩,考虑改为LZ4
# ALTER TABLE table_name MODIFY COLUMN column_name String CODEC(LZ4)
# 3. 或者调整压缩级别
# ALTER TABLE table_name MODIFY COLUMN column_name String CODEC(ZSTD(1)) # 降低压缩级别
问题6:数据导入导致CPU高
# 1. 查看导入任务
clickhouse-client --query "SELECT query_id, query, elapsed FROM system.processes WHERE query LIKE '%INSERT%'"
# 2. 降低插入并发
# SET max_insert_threads=4
# 3. 批量插入而不是单条插入
# INSERT INTO table VALUES (...), (...), (...) # 批量插入
💡 优化原则总结
- 先诊断,后优化:先确定CPU高的原因,再针对性优化
- 控制并发:合理设置并发查询数,避免CPU过载
- 优化查询:优化慢查询,减少CPU消耗
- 调整后台任务:合理配置合并和Mutation任务
- 选择合适压缩:根据场景选择压缩算法(LZ4 vs ZSTD)
- 监控持续:持续监控CPU使用情况,及时发现问题
📚 相关文档
- 详细排查指南:
ClickHouse_CPU排查详细指南.md
- 故障检测指南:
ClickHouse故障表象、影响与检测指南.md
- 修复指南:
ClickHouse调优修复指南.md
记住:CPU高使用率通常是查询问题,先查查询,再查配置! 🚀