ClickHouse CPU 排查详细指南

📋 概述

本指南详细说明如何排查和解决 ClickHouse CPU 使用率过高的问题。CPU 高使用率通常由查询并发过多、慢查询、后台任务、数据导入等因素引起。

🔍 第一步：快速诊断

1.1 检查 CPU 使用率

bash 复制代码

# 方法1：使用 top 查看 ClickHouse 进程 CPU
top -p $(pgrep clickhouse-server)

# 方法2：使用 htop（如果已安装）
htop -p $(pgrep clickhouse-server)

# 方法3：使用 ps
ps aux | grep clickhouse-server | grep -v grep

# 方法4：查看系统整体 CPU
mpstat -P ALL 1 5

正常值 ：CPU 使用率 < 70%
警告值 ：CPU 使用率 70-90%
危险值：CPU 使用率 > 90%

1.2 检查系统负载

bash 复制代码

# 查看系统负载
uptime

# 查看详细负载信息
cat /proc/loadavg

# 持续监控负载
watch -n 1 'uptime'

正常值 ：负载 < CPU 核心数
警告值 ：负载 = CPU 核心数 - 2倍
危险值：负载 > 2倍 CPU 核心数

🔍 第二步：检查当前运行的查询

2.1 查看所有正在执行的查询

bash 复制代码

clickhouse-client --query "
SELECT 
    query_id,
    user,
    address,
    query,
    elapsed,
    read_rows,
    formatReadableSize(read_bytes) as read_bytes,
    formatReadableSize(memory_usage) as memory,
    formatReadableSize(read_bytes/elapsed) as read_speed
FROM system.processes 
WHERE query != ''
ORDER BY elapsed DESC
FORMAT Vertical
"

关键字段说明：

query_id：查询ID，可用于kill查询
elapsed：查询已执行时间（秒）
read_rows：已读取行数
read_bytes：已读取字节数
memory_usage：内存使用量

2.2 查看长时间运行的查询

bash 复制代码

# 查看执行时间超过10秒的查询
clickhouse-client --query "
SELECT 
    query_id,
    user,
    query,
    elapsed,
    read_rows,
    formatReadableSize(read_bytes) as read_bytes
FROM system.processes 
WHERE query != '' AND elapsed > 10
ORDER BY elapsed DESC
"

2.3 查看查询的详细信息

bash 复制代码

# 查看特定查询的详细信息
clickhouse-client --query "
SELECT * 
FROM system.processes 
WHERE query_id = 'your-query-id'
FORMAT Vertical
"

2.4 统计并发查询数

bash 复制代码

# 统计当前并发查询数
clickhouse-client --query "
SELECT 
    count() as concurrent_queries,
    sum(elapsed) as total_elapsed,
    sum(read_rows) as total_read_rows,
    formatReadableSize(sum(read_bytes)) as total_read_bytes
FROM system.processes 
WHERE query != ''
"

正常值 ：并发查询数 < CPU 核心数
警告值 ：并发查询数 = CPU 核心数 - 2倍
危险值：并发查询数 > 2倍 CPU 核心数

🔍 第三步：检查查询历史（慢查询）

3.1 查看最近的慢查询

bash 复制代码

# 查看最近100条执行时间超过1秒的查询
clickhouse-client --query "
SELECT 
    query_id,
    user,
    query,
    query_start_time,
    query_duration_ms,
    read_rows,
    formatReadableSize(read_bytes) as read_bytes,
    formatReadableSize(memory_usage) as memory,
    result_rows,
    formatReadableSize(result_bytes) as result_size
FROM system.query_log 
WHERE type = 2 
  AND query_duration_ms > 1000
  AND event_time > now() - INTERVAL 1 HOUR
ORDER BY query_duration_ms DESC 
LIMIT 100
FORMAT Vertical
"

关键字段说明：

type = 2：表示查询完成（1=查询开始，2=查询结束）
query_duration_ms：查询执行时间（毫秒）
read_rows：读取的行数
result_rows：返回的结果行数

3.2 统计慢查询趋势

bash 复制代码

# 按小时统计慢查询
clickhouse-client --query "
SELECT 
    toStartOfHour(event_time) as hour,
    count() as slow_queries,
    avg(query_duration_ms) as avg_duration_ms,
    max(query_duration_ms) as max_duration_ms,
    min(query_duration_ms) as min_duration_ms,
    sum(read_rows) as total_read_rows,
    formatReadableSize(sum(read_bytes)) as total_read_bytes
FROM system.query_log 
WHERE type = 2 
  AND event_time > now() - INTERVAL 24 HOUR
  AND query_duration_ms > 1000
GROUP BY hour
ORDER BY hour DESC
"

3.3 找出最耗CPU的查询模式

bash 复制代码

# 按查询模式（去除具体值）统计
clickhouse-client --query "
SELECT 
    normalizeQuery(query) as query_pattern,
    count() as query_count,
    avg(query_duration_ms) as avg_duration_ms,
    max(query_duration_ms) as max_duration_ms,
    sum(query_duration_ms) as total_duration_ms,
    sum(read_rows) as total_read_rows
FROM system.query_log 
WHERE type = 2 
  AND event_time > now() - INTERVAL 24 HOUR
  AND query_duration_ms > 1000
GROUP BY query_pattern
ORDER BY total_duration_ms DESC
LIMIT 20
"

🔍 第四步：检查后台任务

4.1 查看合并任务（Merges）

bash 复制代码

# 查看所有正在进行的合并任务
clickhouse-client --query "
SELECT 
    database,
    table,
    elapsed,
    progress,
    merge_type,
    merge_algorithm,
    num_parts_to_merge,
    total_rows_to_merge,
    formatReadableSize(total_bytes_to_merge) as total_size,
    formatReadableSize(bytes_read_uncompressed) as bytes_read,
    formatReadableSize(bytes_written_uncompressed) as bytes_written
FROM system.merges 
ORDER BY elapsed DESC
FORMAT Vertical
"

关键字段说明：

elapsed：合并已执行时间（秒）
progress：合并进度（0-1）
merge_type：合并类型（Regular, TTLDelete等）
total_bytes_to_merge：需要合并的总字节数

正常值 ：活跃合并任务数 < 5
警告值 ：活跃合并任务数 5-15
危险值：活跃合并任务数 > 15

4.2 查看 Mutation 任务

bash 复制代码

# 查看所有正在进行的Mutation任务
clickhouse-client --query "
SELECT 
    database,
    table,
    mutation_id,
    command,
    create_time,
    is_done,
    latest_failed_part,
    latest_fail_time,
    latest_fail_reason
FROM system.mutations 
WHERE is_done = 0
ORDER BY create_time DESC
FORMAT Vertical
"

关键字段说明：

mutation_id：Mutation ID
command：执行的命令（如 ALTER TABLE ... DELETE）
is_done：是否完成（0=进行中，1=完成）
latest_failed_part：最新失败的分区

正常值 ：活跃Mutation任务数 = 0
警告值 ：活跃Mutation任务数 1-3
危险值：活跃Mutation任务数 > 3

4.3 统计后台任务

bash 复制代码

# 统计活跃的后台任务
clickhouse-client --query "
SELECT 
    'Merges' as task_type,
    count() as active_count,
    sum(elapsed) as total_elapsed,
    formatReadableSize(sum(total_bytes_to_merge)) as total_size
FROM system.merges
UNION ALL
SELECT 
    'Mutations' as task_type,
    count() as active_count,
    0 as total_elapsed,
    '' as total_size
FROM system.mutations
WHERE is_done = 0
"

🔍 第五步：检查表和数据分区

5.1 查看表的分区信息

bash 复制代码

# 查看所有表的分区大小
clickhouse-client --query "
SELECT 
    database,
    table,
    partition,
    count() as parts_count,
    sum(rows) as total_rows,
    formatReadableSize(sum(bytes_on_disk)) as total_size,
    min(modification_time) as oldest_part,
    max(modification_time) as newest_part
FROM system.parts 
WHERE active = 1
GROUP BY database, table, partition
ORDER BY sum(bytes_on_disk) DESC
LIMIT 50
"

5.2 查看表的压缩情况

bash 复制代码

# 查看表的压缩算法和压缩率
clickhouse-client --query "
SELECT 
    database,
    table,
    format,
    count() as parts_count,
    sum(rows) as total_rows,
    formatReadableSize(sum(bytes_on_disk)) as compressed_size,
    formatReadableSize(sum(data_uncompressed_size)) as uncompressed_size,
    round(sum(data_uncompressed_size) / sum(bytes_on_disk), 2) as compression_ratio
FROM system.parts 
WHERE active = 1
GROUP BY database, table, format
ORDER BY sum(bytes_on_disk) DESC
LIMIT 30
"

压缩算法说明：

LZ4：速度快，CPU消耗低，压缩率中等（推荐）
ZSTD：压缩率高，但CPU消耗高
ZSTD(1-3)：低级别ZSTD，平衡压缩率和CPU

5.3 找出大分区

bash 复制代码

# 找出最大的分区
clickhouse-client --query "
SELECT 
    database,
    table,
    partition,
    count() as parts,
    sum(rows) as rows,
    formatReadableSize(sum(bytes_on_disk)) as size,
    max(modification_time) as last_modified
FROM system.parts 
WHERE active = 1
GROUP BY database, table, partition
HAVING sum(bytes_on_disk) > 10737418240  -- 大于10GB
ORDER BY sum(bytes_on_disk) DESC
LIMIT 20
"

🔍 第六步：检查系统配置

6.1 查看并发相关配置

bash 复制代码

# 查看所有并发相关配置
clickhouse-client --query "
SELECT 
    name,
    value,
    description
FROM system.settings 
WHERE name LIKE '%concurrent%' 
   OR name LIKE '%thread%'
   OR name LIKE '%pool%'
ORDER BY name
"

关键配置：

max_concurrent_queries：最大并发查询数（推荐：CPU核心数）
max_thread_pool_size：线程池大小（推荐：CPU核心数*2）
max_insert_threads：插入线程数（推荐：4-8）
background_pool_size：后台任务线程数（推荐：16）

6.2 查看查询限制配置

bash 复制代码

# 查看查询限制配置
clickhouse-client --query "
SELECT 
    name,
    value,
    description
FROM system.settings 
WHERE name LIKE '%max_%' 
  AND (name LIKE '%query%' 
       OR name LIKE '%memory%' 
       OR name LIKE '%time%'
       OR name LIKE '%rows%'
       OR name LIKE '%bytes%')
ORDER BY name
"

关键配置：

max_execution_time：查询最大执行时间（秒）
max_memory_usage：单查询最大内存（字节）
max_rows_to_read：最大读取行数
max_bytes_to_read：最大读取字节数

6.3 查看合并相关配置

bash 复制代码

# 查看合并相关配置
clickhouse-client --query "
SELECT 
    name,
    value,
    description
FROM system.settings 
WHERE name LIKE '%merge%'
ORDER BY name
"

关键配置：

max_bytes_to_merge_at_max_space_in_pool：合并最大字节数（推荐：150GB）
max_replicated_merges_in_queue：副本合并队列大小（推荐：16）

🛠️ 解决方案

方案1：优化查询并发

问题：并发查询过多

bash 复制代码

# 1. 查看当前并发数
clickhouse-client --query "SELECT count() FROM system.processes WHERE query != ''"

# 2. 临时降低并发限制（在用户级别）
clickhouse-client --query "SET max_concurrent_queries=8"

# 3. 永久修改（在config.xml中）
# <profiles>
#     <default>
#         <max_concurrent_queries>8</max_concurrent_queries>
#     </default>
# </profiles>

# 4. 或者kill长时间运行的查询
clickhouse-client --query "KILL QUERY WHERE query_id='xxx'"

方案2：优化慢查询

问题：慢查询导致CPU高

bash 复制代码

# 1. 找出慢查询
clickhouse-client --query "
SELECT query_id, query, query_duration_ms, read_rows
FROM system.query_log 
WHERE type=2 AND query_duration_ms > 5000
ORDER BY query_duration_ms DESC 
LIMIT 10
"

# 2. 分析查询执行计划
clickhouse-client --query "EXPLAIN SELECT ..."  # 替换为实际查询

# 3. 优化建议：
# - 添加合适的索引（ORDER BY字段）
# - 优化WHERE条件，减少扫描数据量
# - 使用PREWHERE替代WHERE（如果可能）
# - 减少SELECT的字段数量
# - 使用LIMIT限制返回结果
# - 避免全表扫描

# 4. 设置查询超时
clickhouse-client --query "SET max_execution_time=60"  # 60秒超时

方案3：优化后台任务

问题：合并任务过多

bash 复制代码

# 1. 查看合并任务
clickhouse-client --query "SELECT * FROM system.merges ORDER BY elapsed DESC"

# 2. 调整合并配置（在config.xml中）
# <merge_tree>
#     <max_bytes_to_merge_at_max_space_in_pool>161061273600</max_bytes_to_merge_at_max_space_in_pool>
#     <background_pool_size>16</background_pool_size>
# </merge_tree>

# 3. 等待合并完成或手动触发合并（谨慎使用）
clickhouse-client --query "OPTIMIZE TABLE database.table FINAL"

问题：Mutation任务卡住

bash 复制代码

# 1. 查看Mutation任务
clickhouse-client --query "SELECT * FROM system.mutations WHERE is_done=0"

# 2. 如果Mutation卡住，可以取消（谨慎操作，可能导致数据不一致）
# clickhouse-client --query "KILL MUTATION WHERE database='xxx' AND table='xxx' AND mutation_id='xxx'"

# 3. 避免在高峰期执行大量Mutation

方案4：优化压缩算法

问题：压缩算法CPU密集

bash 复制代码

# 1. 查看当前使用的压缩算法
clickhouse-client --query "
SELECT format, count() as parts, formatReadableSize(sum(bytes_on_disk)) as size
FROM system.parts 
WHERE active=1 
GROUP BY format
"

# 2. 如果使用ZSTD，考虑改为LZ4（新表）
CREATE TABLE new_table (
    ...
) ENGINE = MergeTree()
ORDER BY ...
SETTINGS index_granularity = 8192;

# 3. 或者降低ZSTD压缩级别（新表）
CREATE TABLE new_table (
    column_name String CODEC(ZSTD(1))  -- 降低压缩级别
) ENGINE = MergeTree()
ORDER BY ...;

# 4. 对于现有表，需要重建（数据量大时谨慎）
ALTER TABLE table_name MODIFY COLUMN column_name String CODEC(LZ4);

方案5：优化数据导入

问题：数据导入导致CPU高

bash 复制代码

# 1. 查看导入任务
clickhouse-client --query "SELECT query_id, query, elapsed FROM system.processes WHERE query LIKE '%INSERT%'"

# 2. 降低插入并发
clickhouse-client --query "SET max_insert_threads=4"

# 3. 使用批量插入
# 错误：INSERT INTO table VALUES (...); INSERT INTO table VALUES (...);
# 正确：INSERT INTO table VALUES (...), (...), (...);

# 4. 使用异步插入（如果支持）
# INSERT INTO table ASYNC VALUES (...)

方案6：优化表结构

问题：表结构不合理导致查询慢

bash 复制代码

# 1. 检查ORDER BY字段是否合理
clickhouse-client --query "SHOW CREATE TABLE database.table_name"

# 2. 优化建议：
# - ORDER BY字段应该是查询中常用的过滤字段
# - 添加合适的索引（ORDER BY字段）
# - 使用物化视图预聚合数据
# - 使用合适的表引擎（MergeTree vs ReplacingMergeTree等）

# 3. 检查分区键
# 分区键应该选择经常用于过滤的字段（如日期）

📊 监控脚本

创建CPU监控脚本

bash 复制代码

#!/bin/bash
# clickhouse_cpu_monitor.sh

echo "=== ClickHouse CPU 监控 ==="
echo ""

# 1. 系统CPU使用率
echo "1. 系统CPU使用率:"
top -bn1 | grep "Cpu(s)" | awk '{print "用户空间: " $2 "%", "内核空间: " $4 "%", "空闲: " $8 "%"}'
echo ""

# 2. ClickHouse进程CPU
echo "2. ClickHouse进程CPU:"
CH_PID=$(pgrep clickhouse-server)
if [ -n "$CH_PID" ]; then
    top -bn1 -p $CH_PID | tail -1 | awk '{print "CPU使用率: " $9 "%", "内存使用率: " $10 "%"}'
else
    echo "ClickHouse进程未运行"
fi
echo ""

# 3. 并发查询数
echo "3. 并发查询数:"
clickhouse-client --query "SELECT count() as concurrent_queries FROM system.processes WHERE query != ''"
echo ""

# 4. 长时间运行的查询
echo "4. 长时间运行的查询 (>10秒):"
clickhouse-client --query "
SELECT 
    query_id,
    elapsed,
    read_rows,
    formatReadableSize(read_bytes) as read_bytes
FROM system.processes 
WHERE query != '' AND elapsed > 10
ORDER BY elapsed DESC
LIMIT 5
"
echo ""

# 5. 活跃合并任务
echo "5. 活跃合并任务:"
clickhouse-client --query "SELECT count() as active_merges FROM system.merges"
echo ""

# 6. 活跃Mutation任务
echo "6. 活跃Mutation任务:"
clickhouse-client --query "SELECT count() as active_mutations FROM system.mutations WHERE is_done=0"
echo ""

# 7. 最近1小时的慢查询统计
echo "7. 最近1小时慢查询统计 (>1秒):"
clickhouse-client --query "
SELECT 
    count() as slow_queries,
    avg(query_duration_ms) as avg_duration_ms,
    max(query_duration_ms) as max_duration_ms
FROM system.query_log 
WHERE type=2 
  AND event_time > now() - INTERVAL 1 HOUR
  AND query_duration_ms > 1000
"

保存为 clickhouse_cpu_monitor.sh，然后：

bash 复制代码

chmod +x clickhouse_cpu_monitor.sh
./clickhouse_cpu_monitor.sh

✅ 排查检查清单

💡 最佳实践

定期监控：设置定时任务监控CPU使用率
慢查询分析：定期分析慢查询日志，优化查询
合理配置：根据服务器配置合理设置并发数
选择压缩：根据场景选择压缩算法（LZ4 vs ZSTD）
避免高峰期：避免在业务高峰期执行大量后台任务
优化表结构：合理设计表结构，选择合适的ORDER BY字段
使用物化视图：使用物化视图预聚合数据，减少查询CPU消耗

记住：CPU高使用率通常是查询问题，先查查询，再查配置！ 🚀