ClickHouse CPU 排查详细指南

ClickHouse CPU 排查详细指南

📋 概述

本指南详细说明如何排查和解决 ClickHouse CPU 使用率过高的问题。CPU 高使用率通常由查询并发过多、慢查询、后台任务、数据导入等因素引起。


🔍 第一步:快速诊断

1.1 检查 CPU 使用率

bash 复制代码
# 方法1:使用 top 查看 ClickHouse 进程 CPU
top -p $(pgrep clickhouse-server)

# 方法2:使用 htop(如果已安装)
htop -p $(pgrep clickhouse-server)

# 方法3:使用 ps
ps aux | grep clickhouse-server | grep -v grep

# 方法4:查看系统整体 CPU
mpstat -P ALL 1 5

正常值 :CPU 使用率 < 70%
警告值 :CPU 使用率 70-90%
危险值:CPU 使用率 > 90%

1.2 检查系统负载

bash 复制代码
# 查看系统负载
uptime

# 查看详细负载信息
cat /proc/loadavg

# 持续监控负载
watch -n 1 'uptime'

正常值 :负载 < CPU 核心数
警告值 :负载 = CPU 核心数 - 2倍
危险值:负载 > 2倍 CPU 核心数


🔍 第二步:检查当前运行的查询

2.1 查看所有正在执行的查询

bash 复制代码
clickhouse-client --query "
SELECT 
    query_id,
    user,
    address,
    query,
    elapsed,
    read_rows,
    formatReadableSize(read_bytes) as read_bytes,
    formatReadableSize(memory_usage) as memory,
    formatReadableSize(read_bytes/elapsed) as read_speed
FROM system.processes 
WHERE query != ''
ORDER BY elapsed DESC
FORMAT Vertical
"

关键字段说明

  • query_id:查询ID,可用于kill查询
  • elapsed:查询已执行时间(秒)
  • read_rows:已读取行数
  • read_bytes:已读取字节数
  • memory_usage:内存使用量

2.2 查看长时间运行的查询

bash 复制代码
# 查看执行时间超过10秒的查询
clickhouse-client --query "
SELECT 
    query_id,
    user,
    query,
    elapsed,
    read_rows,
    formatReadableSize(read_bytes) as read_bytes
FROM system.processes 
WHERE query != '' AND elapsed > 10
ORDER BY elapsed DESC
"

2.3 查看查询的详细信息

bash 复制代码
# 查看特定查询的详细信息
clickhouse-client --query "
SELECT * 
FROM system.processes 
WHERE query_id = 'your-query-id'
FORMAT Vertical
"

2.4 统计并发查询数

bash 复制代码
# 统计当前并发查询数
clickhouse-client --query "
SELECT 
    count() as concurrent_queries,
    sum(elapsed) as total_elapsed,
    sum(read_rows) as total_read_rows,
    formatReadableSize(sum(read_bytes)) as total_read_bytes
FROM system.processes 
WHERE query != ''
"

正常值 :并发查询数 < CPU 核心数
警告值 :并发查询数 = CPU 核心数 - 2倍
危险值:并发查询数 > 2倍 CPU 核心数


🔍 第三步:检查查询历史(慢查询)

3.1 查看最近的慢查询

bash 复制代码
# 查看最近100条执行时间超过1秒的查询
clickhouse-client --query "
SELECT 
    query_id,
    user,
    query,
    query_start_time,
    query_duration_ms,
    read_rows,
    formatReadableSize(read_bytes) as read_bytes,
    formatReadableSize(memory_usage) as memory,
    result_rows,
    formatReadableSize(result_bytes) as result_size
FROM system.query_log 
WHERE type = 2 
  AND query_duration_ms > 1000
  AND event_time > now() - INTERVAL 1 HOUR
ORDER BY query_duration_ms DESC 
LIMIT 100
FORMAT Vertical
"

关键字段说明

  • type = 2:表示查询完成(1=查询开始,2=查询结束)
  • query_duration_ms:查询执行时间(毫秒)
  • read_rows:读取的行数
  • result_rows:返回的结果行数

3.2 统计慢查询趋势

bash 复制代码
# 按小时统计慢查询
clickhouse-client --query "
SELECT 
    toStartOfHour(event_time) as hour,
    count() as slow_queries,
    avg(query_duration_ms) as avg_duration_ms,
    max(query_duration_ms) as max_duration_ms,
    min(query_duration_ms) as min_duration_ms,
    sum(read_rows) as total_read_rows,
    formatReadableSize(sum(read_bytes)) as total_read_bytes
FROM system.query_log 
WHERE type = 2 
  AND event_time > now() - INTERVAL 24 HOUR
  AND query_duration_ms > 1000
GROUP BY hour
ORDER BY hour DESC
"

3.3 找出最耗CPU的查询模式

bash 复制代码
# 按查询模式(去除具体值)统计
clickhouse-client --query "
SELECT 
    normalizeQuery(query) as query_pattern,
    count() as query_count,
    avg(query_duration_ms) as avg_duration_ms,
    max(query_duration_ms) as max_duration_ms,
    sum(query_duration_ms) as total_duration_ms,
    sum(read_rows) as total_read_rows
FROM system.query_log 
WHERE type = 2 
  AND event_time > now() - INTERVAL 24 HOUR
  AND query_duration_ms > 1000
GROUP BY query_pattern
ORDER BY total_duration_ms DESC
LIMIT 20
"

🔍 第四步:检查后台任务

4.1 查看合并任务(Merges)

bash 复制代码
# 查看所有正在进行的合并任务
clickhouse-client --query "
SELECT 
    database,
    table,
    elapsed,
    progress,
    merge_type,
    merge_algorithm,
    num_parts_to_merge,
    total_rows_to_merge,
    formatReadableSize(total_bytes_to_merge) as total_size,
    formatReadableSize(bytes_read_uncompressed) as bytes_read,
    formatReadableSize(bytes_written_uncompressed) as bytes_written
FROM system.merges 
ORDER BY elapsed DESC
FORMAT Vertical
"

关键字段说明

  • elapsed:合并已执行时间(秒)
  • progress:合并进度(0-1)
  • merge_type:合并类型(Regular, TTLDelete等)
  • total_bytes_to_merge:需要合并的总字节数

正常值 :活跃合并任务数 < 5
警告值 :活跃合并任务数 5-15
危险值:活跃合并任务数 > 15

4.2 查看 Mutation 任务

bash 复制代码
# 查看所有正在进行的Mutation任务
clickhouse-client --query "
SELECT 
    database,
    table,
    mutation_id,
    command,
    create_time,
    is_done,
    latest_failed_part,
    latest_fail_time,
    latest_fail_reason
FROM system.mutations 
WHERE is_done = 0
ORDER BY create_time DESC
FORMAT Vertical
"

关键字段说明

  • mutation_id:Mutation ID
  • command:执行的命令(如 ALTER TABLE ... DELETE)
  • is_done:是否完成(0=进行中,1=完成)
  • latest_failed_part:最新失败的分区

正常值 :活跃Mutation任务数 = 0
警告值 :活跃Mutation任务数 1-3
危险值:活跃Mutation任务数 > 3

4.3 统计后台任务

bash 复制代码
# 统计活跃的后台任务
clickhouse-client --query "
SELECT 
    'Merges' as task_type,
    count() as active_count,
    sum(elapsed) as total_elapsed,
    formatReadableSize(sum(total_bytes_to_merge)) as total_size
FROM system.merges
UNION ALL
SELECT 
    'Mutations' as task_type,
    count() as active_count,
    0 as total_elapsed,
    '' as total_size
FROM system.mutations
WHERE is_done = 0
"

🔍 第五步:检查表和数据分区

5.1 查看表的分区信息

bash 复制代码
# 查看所有表的分区大小
clickhouse-client --query "
SELECT 
    database,
    table,
    partition,
    count() as parts_count,
    sum(rows) as total_rows,
    formatReadableSize(sum(bytes_on_disk)) as total_size,
    min(modification_time) as oldest_part,
    max(modification_time) as newest_part
FROM system.parts 
WHERE active = 1
GROUP BY database, table, partition
ORDER BY sum(bytes_on_disk) DESC
LIMIT 50
"

5.2 查看表的压缩情况

bash 复制代码
# 查看表的压缩算法和压缩率
clickhouse-client --query "
SELECT 
    database,
    table,
    format,
    count() as parts_count,
    sum(rows) as total_rows,
    formatReadableSize(sum(bytes_on_disk)) as compressed_size,
    formatReadableSize(sum(data_uncompressed_size)) as uncompressed_size,
    round(sum(data_uncompressed_size) / sum(bytes_on_disk), 2) as compression_ratio
FROM system.parts 
WHERE active = 1
GROUP BY database, table, format
ORDER BY sum(bytes_on_disk) DESC
LIMIT 30
"

压缩算法说明

  • LZ4:速度快,CPU消耗低,压缩率中等(推荐)
  • ZSTD:压缩率高,但CPU消耗高
  • ZSTD(1-3):低级别ZSTD,平衡压缩率和CPU

5.3 找出大分区

bash 复制代码
# 找出最大的分区
clickhouse-client --query "
SELECT 
    database,
    table,
    partition,
    count() as parts,
    sum(rows) as rows,
    formatReadableSize(sum(bytes_on_disk)) as size,
    max(modification_time) as last_modified
FROM system.parts 
WHERE active = 1
GROUP BY database, table, partition
HAVING sum(bytes_on_disk) > 10737418240  -- 大于10GB
ORDER BY sum(bytes_on_disk) DESC
LIMIT 20
"

🔍 第六步:检查系统配置

6.1 查看并发相关配置

bash 复制代码
# 查看所有并发相关配置
clickhouse-client --query "
SELECT 
    name,
    value,
    description
FROM system.settings 
WHERE name LIKE '%concurrent%' 
   OR name LIKE '%thread%'
   OR name LIKE '%pool%'
ORDER BY name
"

关键配置

  • max_concurrent_queries:最大并发查询数(推荐:CPU核心数)
  • max_thread_pool_size:线程池大小(推荐:CPU核心数*2)
  • max_insert_threads:插入线程数(推荐:4-8)
  • background_pool_size:后台任务线程数(推荐:16)

6.2 查看查询限制配置

bash 复制代码
# 查看查询限制配置
clickhouse-client --query "
SELECT 
    name,
    value,
    description
FROM system.settings 
WHERE name LIKE '%max_%' 
  AND (name LIKE '%query%' 
       OR name LIKE '%memory%' 
       OR name LIKE '%time%'
       OR name LIKE '%rows%'
       OR name LIKE '%bytes%')
ORDER BY name
"

关键配置

  • max_execution_time:查询最大执行时间(秒)
  • max_memory_usage:单查询最大内存(字节)
  • max_rows_to_read:最大读取行数
  • max_bytes_to_read:最大读取字节数

6.3 查看合并相关配置

bash 复制代码
# 查看合并相关配置
clickhouse-client --query "
SELECT 
    name,
    value,
    description
FROM system.settings 
WHERE name LIKE '%merge%'
ORDER BY name
"

关键配置

  • max_bytes_to_merge_at_max_space_in_pool:合并最大字节数(推荐:150GB)
  • max_replicated_merges_in_queue:副本合并队列大小(推荐:16)

🛠️ 解决方案

方案1:优化查询并发

问题:并发查询过多
bash 复制代码
# 1. 查看当前并发数
clickhouse-client --query "SELECT count() FROM system.processes WHERE query != ''"

# 2. 临时降低并发限制(在用户级别)
clickhouse-client --query "SET max_concurrent_queries=8"

# 3. 永久修改(在config.xml中)
# <profiles>
#     <default>
#         <max_concurrent_queries>8</max_concurrent_queries>
#     </default>
# </profiles>

# 4. 或者kill长时间运行的查询
clickhouse-client --query "KILL QUERY WHERE query_id='xxx'"

方案2:优化慢查询

问题:慢查询导致CPU高
bash 复制代码
# 1. 找出慢查询
clickhouse-client --query "
SELECT query_id, query, query_duration_ms, read_rows
FROM system.query_log 
WHERE type=2 AND query_duration_ms > 5000
ORDER BY query_duration_ms DESC 
LIMIT 10
"

# 2. 分析查询执行计划
clickhouse-client --query "EXPLAIN SELECT ..."  # 替换为实际查询

# 3. 优化建议:
# - 添加合适的索引(ORDER BY字段)
# - 优化WHERE条件,减少扫描数据量
# - 使用PREWHERE替代WHERE(如果可能)
# - 减少SELECT的字段数量
# - 使用LIMIT限制返回结果
# - 避免全表扫描

# 4. 设置查询超时
clickhouse-client --query "SET max_execution_time=60"  # 60秒超时

方案3:优化后台任务

问题:合并任务过多
bash 复制代码
# 1. 查看合并任务
clickhouse-client --query "SELECT * FROM system.merges ORDER BY elapsed DESC"

# 2. 调整合并配置(在config.xml中)
# <merge_tree>
#     <max_bytes_to_merge_at_max_space_in_pool>161061273600</max_bytes_to_merge_at_max_space_in_pool>
#     <background_pool_size>16</background_pool_size>
# </merge_tree>

# 3. 等待合并完成或手动触发合并(谨慎使用)
clickhouse-client --query "OPTIMIZE TABLE database.table FINAL"
问题:Mutation任务卡住
bash 复制代码
# 1. 查看Mutation任务
clickhouse-client --query "SELECT * FROM system.mutations WHERE is_done=0"

# 2. 如果Mutation卡住,可以取消(谨慎操作,可能导致数据不一致)
# clickhouse-client --query "KILL MUTATION WHERE database='xxx' AND table='xxx' AND mutation_id='xxx'"

# 3. 避免在高峰期执行大量Mutation

方案4:优化压缩算法

问题:压缩算法CPU密集
bash 复制代码
# 1. 查看当前使用的压缩算法
clickhouse-client --query "
SELECT format, count() as parts, formatReadableSize(sum(bytes_on_disk)) as size
FROM system.parts 
WHERE active=1 
GROUP BY format
"

# 2. 如果使用ZSTD,考虑改为LZ4(新表)
CREATE TABLE new_table (
    ...
) ENGINE = MergeTree()
ORDER BY ...
SETTINGS index_granularity = 8192;

# 3. 或者降低ZSTD压缩级别(新表)
CREATE TABLE new_table (
    column_name String CODEC(ZSTD(1))  -- 降低压缩级别
) ENGINE = MergeTree()
ORDER BY ...;

# 4. 对于现有表,需要重建(数据量大时谨慎)
ALTER TABLE table_name MODIFY COLUMN column_name String CODEC(LZ4);

方案5:优化数据导入

问题:数据导入导致CPU高
bash 复制代码
# 1. 查看导入任务
clickhouse-client --query "SELECT query_id, query, elapsed FROM system.processes WHERE query LIKE '%INSERT%'"

# 2. 降低插入并发
clickhouse-client --query "SET max_insert_threads=4"

# 3. 使用批量插入
# 错误:INSERT INTO table VALUES (...); INSERT INTO table VALUES (...);
# 正确:INSERT INTO table VALUES (...), (...), (...);

# 4. 使用异步插入(如果支持)
# INSERT INTO table ASYNC VALUES (...)

方案6:优化表结构

问题:表结构不合理导致查询慢
bash 复制代码
# 1. 检查ORDER BY字段是否合理
clickhouse-client --query "SHOW CREATE TABLE database.table_name"

# 2. 优化建议:
# - ORDER BY字段应该是查询中常用的过滤字段
# - 添加合适的索引(ORDER BY字段)
# - 使用物化视图预聚合数据
# - 使用合适的表引擎(MergeTree vs ReplacingMergeTree等)

# 3. 检查分区键
# 分区键应该选择经常用于过滤的字段(如日期)

📊 监控脚本

创建CPU监控脚本

bash 复制代码
#!/bin/bash
# clickhouse_cpu_monitor.sh

echo "=== ClickHouse CPU 监控 ==="
echo ""

# 1. 系统CPU使用率
echo "1. 系统CPU使用率:"
top -bn1 | grep "Cpu(s)" | awk '{print "用户空间: " $2 "%", "内核空间: " $4 "%", "空闲: " $8 "%"}'
echo ""

# 2. ClickHouse进程CPU
echo "2. ClickHouse进程CPU:"
CH_PID=$(pgrep clickhouse-server)
if [ -n "$CH_PID" ]; then
    top -bn1 -p $CH_PID | tail -1 | awk '{print "CPU使用率: " $9 "%", "内存使用率: " $10 "%"}'
else
    echo "ClickHouse进程未运行"
fi
echo ""

# 3. 并发查询数
echo "3. 并发查询数:"
clickhouse-client --query "SELECT count() as concurrent_queries FROM system.processes WHERE query != ''"
echo ""

# 4. 长时间运行的查询
echo "4. 长时间运行的查询 (>10秒):"
clickhouse-client --query "
SELECT 
    query_id,
    elapsed,
    read_rows,
    formatReadableSize(read_bytes) as read_bytes
FROM system.processes 
WHERE query != '' AND elapsed > 10
ORDER BY elapsed DESC
LIMIT 5
"
echo ""

# 5. 活跃合并任务
echo "5. 活跃合并任务:"
clickhouse-client --query "SELECT count() as active_merges FROM system.merges"
echo ""

# 6. 活跃Mutation任务
echo "6. 活跃Mutation任务:"
clickhouse-client --query "SELECT count() as active_mutations FROM system.mutations WHERE is_done=0"
echo ""

# 7. 最近1小时的慢查询统计
echo "7. 最近1小时慢查询统计 (>1秒):"
clickhouse-client --query "
SELECT 
    count() as slow_queries,
    avg(query_duration_ms) as avg_duration_ms,
    max(query_duration_ms) as max_duration_ms
FROM system.query_log 
WHERE type=2 
  AND event_time > now() - INTERVAL 1 HOUR
  AND query_duration_ms > 1000
"

保存为 clickhouse_cpu_monitor.sh,然后:

bash 复制代码
chmod +x clickhouse_cpu_monitor.sh
./clickhouse_cpu_monitor.sh

✅ 排查检查清单

  • 检查系统CPU使用率(top, htop)
  • 检查当前运行的查询(system.processes)
  • 检查慢查询历史(system.query_log)
  • 检查后台任务(system.merges, system.mutations)
  • 检查表和数据分区状态(system.parts)
  • 检查系统配置(system.settings)
  • 分析查询执行计划(EXPLAIN)
  • 优化慢查询SQL
  • 调整并发配置
  • 优化压缩算法
  • 监控CPU使用情况

💡 最佳实践

  1. 定期监控:设置定时任务监控CPU使用率
  2. 慢查询分析:定期分析慢查询日志,优化查询
  3. 合理配置:根据服务器配置合理设置并发数
  4. 选择压缩:根据场景选择压缩算法(LZ4 vs ZSTD)
  5. 避免高峰期:避免在业务高峰期执行大量后台任务
  6. 优化表结构:合理设计表结构,选择合适的ORDER BY字段
  7. 使用物化视图:使用物化视图预聚合数据,减少查询CPU消耗

记住:CPU高使用率通常是查询问题,先查查询,再查配置! 🚀

相关推荐
曦樂~2 小时前
【Docker】Dockerfile自定义镜像
运维·docker·容器
拾心212 小时前
【云运维】Kubernetes安装(基于 Docker + Calico)
运维·docker·kubernetes
q***51892 小时前
离线安装 Nginx
运维·数据库·nginx
我也要当昏君2 小时前
4.1.8 【2022 统考真题】
运维·服务器·网络
Caster_Z2 小时前
Windows环境安装Docker
运维·docker·容器
磊〜2 小时前
Linux 服务器安装 dstat 监控插件
linux·运维·服务器
二进制coder2 小时前
服务器BMC开发视角:解析CPU管理的两大核心接口PECI与APML
运维·服务器·网络
行初心3 小时前
uos基础 sudoers 查看sudo的配置文件
运维
行初心3 小时前
uos基础 shells 查看支持的shell
运维