ClickHouse 实战指南:高性能列式数据库实践
目录
- [1. ClickHouse 简介](#1. ClickHouse 简介)
- [2. ClickHouse 核心架构](#2. ClickHouse 核心架构)
- [3. 环境搭建与配置](#3. 环境搭建与配置)
- [4. 表引擎详解](#4. 表引擎详解)
- [5. Java 客户端集成](#5. Java 客户端集成)
- [6. 数据写入实战](#6. 数据写入实战)
- [7. 查询优化技巧](#7. 查询优化技巧)
- [8. 物化视图与聚合](#8. 物化视图与聚合)
- [9. 分布式集群实践](#9. 分布式集群实践)
- [10. 实时数据分析场景](#10. 实时数据分析场景)
- [11. 性能优化最佳实践](#11. 性能优化最佳实践)
- [12. 监控与运维](#12. 监控与运维)
- [13. 总结](#13. 总结)
1. ClickHouse 简介
ClickHouse 是一个用于在线分析处理(OLAP)的开源列式数据库管理系统,由俄罗斯最大的搜索引擎 Yandex 开发。它能够实时生成分析报告,处理数十亿行数据时仍保持亚秒级查询响应。
1.1 核心特性
- 真正的列式存储: 数据按列存储,极大提升查询性能
- 数据压缩: 压缩比可达 10:1 甚至更高
- 向量化执行: SIMD 指令集加速计算
- 分布式查询: 支持分片和副本
- 实时数据插入: 每秒可处理百万级数据
- SQL 支持: 支持标准 SQL 及扩展语法
- 高可用性: 支持数据副本和故障转移
1.2 ClickHouse vs 传统数据库
性能对比 (10亿行数据聚合查询)
ClickHouse █ 0.5s
PostgreSQL ████████████████████████ 120s
MySQL ████████████████████████████ 150s
Elasticsearch ████████ 40s
使用场景对比
┌─────────────────┬──────────────┬──────────────┬──────────────┐
│ Feature │ ClickHouse │ MySQL │ PostgreSQL │
├─────────────────┼──────────────┼──────────────┼──────────────┤
│ OLAP 查询 │ ★★★★★ │ ★★☆☆☆ │ ★★★☆☆ │
├─────────────────┼──────────────┼──────────────┼──────────────┤
│ OLTP 事务 │ ★☆☆☆☆ │ ★★★★★ │ ★★★★★ │
├─────────────────┼──────────────┼──────────────┼──────────────┤
│ 数据压缩 │ ★★★★★ │ ★★☆☆☆ │ ★★★☆☆ │
├─────────────────┼──────────────┼──────────────┼──────────────┤
│ 写入性能 │ ★★★★★ │ ★★★☆☆ │ ★★★☆☆ │
├─────────────────┼──────────────┼──────────────┼──────────────┤
│ 分布式支持 │ ★★★★★ │ ★★☆☆☆ │ ★★☆☆☆ │
└─────────────────┴──────────────┴──────────────┴──────────────┘
1.3 典型应用场景
- 用户行为分析: 网站点击流、APP 使用数据分析
- 实时报表: 业务指标实时统计与展示
- 日志分析: 海量日志存储与查询
- 时序数据: IoT 设备监控、系统指标监控
- 数据仓库: 替代传统 OLAP 引擎
2. ClickHouse 核心架构
2.1 整体架构
ClickHouse 架构图
Client Layer (客户端层)
┌─────────────────────────────────────────────────────────┐
│ JDBC Client │ HTTP Client │ CLI Client │ Grafana │
└────────────────────────┬────────────────────────────────┘
│
▼
Query Processing Layer (查询处理层)
┌─────────────────────────────────────────────────────────┐
│ Query Parser │
│ ↓ │
│ Query Optimizer │
│ ↓ │
│ Execution Engine │
│ (Vectorized Processing) │
└────────────────────────┬────────────────────────────────┘
│
▼
Storage Layer (存储层)
┌─────────────────────────────────────────────────────────┐
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Table Engine │ │ Table Engine │ │ Table Engine │ │
│ │ MergeTree │ │ Log Family │ │ Distributed │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │ │
│ └──────────────────┴──────────────────┘ │
│ ↓ │
│ Column-Oriented Storage │
│ ↓ │
│ Compressed Data Files │
│ (.bin, .mrk, .idx) │
└──────────────────────────────────────────────────────────┘
2.2 列式存储原理
行式存储 vs 列式存储
行式存储 (MySQL, PostgreSQL):
Row 1: [ID=1, Name=Alice, Age=25, City=Beijing]
Row 2: [ID=2, Name=Bob, Age=30, City=Shanghai]
Row 3: [ID=3, Name=Carol, Age=28, City=Beijing]
查询: SELECT AVG(Age) FROM users WHERE City='Beijing'
需要读取: 所有列的所有数据
列式存储 (ClickHouse):
ID Column: [1, 2, 3]
Name Column: [Alice, Bob, Carol]
Age Column: [25, 30, 28]
City Column: [Beijing, Shanghai, Beijing]
查询: SELECT AVG(Age) FROM users WHERE City='Beijing'
需要读取: Age列 + City列 (仅相关列)
优势:
1. 只读取需要的列,减少 I/O
2. 同一列数据类型相同,压缩率高
3. 向量化执行,CPU 缓存友好
2.3 数据分片与副本
分布式架构 (3个节点,2个分片,2个副本)
Shard 1 Shard 2
┌──────────────────┐ ┌──────────────────┐
│ Replica 1 │ │ Replica 1 │
│ Node 1 │ │ Node 2 │
│ Data: A-M │ │ Data: N-Z │
└──────────────────┘ └──────────────────┘
│ │
│ │
▼ ▼
┌──────────────────┐ ┌──────────────────┐
│ Replica 2 │ │ Replica 2 │
│ Node 3 │ │ Node 1 │
│ Data: A-M │ │ Data: N-Z │
└──────────────────┘ └──────────────────┘
ZooKeeper Cluster
┌──────────────────────────────────┐
│ Coordination & Metadata │
│ - Replica sync │
│ - Leader election │
│ - Schema management │
└──────────────────────────────────┘
3. 环境搭建与配置
3.1 Docker 快速启动
bash
# 1. 拉取镜像
docker pull clickhouse/clickhouse-server
# 2. 启动 ClickHouse
docker run -d \
--name clickhouse-server \
-p 8123:8123 \
-p 9000:9000 \
--ulimit nofile=262144:262144 \
clickhouse/clickhouse-server
# 3. 进入客户端
docker exec -it clickhouse-server clickhouse-client
# 4. 测试连接
SELECT version();
3.2 Maven 依赖配置
xml
<!-- pom.xml -->
<properties>
<clickhouse.version>0.5.0</clickhouse.version>
<java.version>11</java.version>
</properties>
<dependencies>
<!-- ClickHouse JDBC 驱动 -->
<dependency>
<groupId>com.clickhouse</groupId>
<artifactId>clickhouse-jdbc</artifactId>
<version>${clickhouse.version}</version>
<classifier>all</classifier>
</dependency>
<!-- HTTP 客户端 (可选) -->
<dependency>
<groupId>com.clickhouse</groupId>
<artifactId>clickhouse-http-client</artifactId>
<version>${clickhouse.version}</version>
</dependency>
<!-- 连接池 -->
<dependency>
<groupId>com.zaxxer</groupId>
<artifactId>HikariCP</artifactId>
<version>5.0.1</version>
</dependency>
<!-- 日志 -->
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>2.0.9</version>
</dependency>
</dependencies>
3.3 Java 连接配置
java
import com.clickhouse.jdbc.ClickHouseDataSource;
import com.zaxxer.hikari.HikariConfig;
import com.zaxxer.hikari.HikariDataSource;
import java.sql.Connection;
import java.sql.SQLException;
import java.util.Properties;
/**
* ClickHouse 连接配置
*/
public class ClickHouseConfig {
private static final String JDBC_URL =
"jdbc:clickhouse://localhost:8123/default";
private static final String USERNAME = "default";
private static final String PASSWORD = "";
/**
* 方式1: 直接连接
*/
public static Connection getConnection() throws SQLException {
Properties properties = new Properties();
properties.setProperty("user", USERNAME);
properties.setProperty("password", PASSWORD);
// 设置连接参数
properties.setProperty("socket_timeout", "300000");
properties.setProperty("max_execution_time", "300");
properties.setProperty("max_insert_block_size", "1048576");
ClickHouseDataSource dataSource =
new ClickHouseDataSource(JDBC_URL, properties);
return dataSource.getConnection();
}
/**
* 方式2: 使用连接池 (推荐)
*/
public static HikariDataSource getDataSource() {
HikariConfig config = new HikariConfig();
config.setJdbcUrl(JDBC_URL);
config.setUsername(USERNAME);
config.setPassword(PASSWORD);
// 连接池配置
config.setMaximumPoolSize(10);
config.setMinimumIdle(2);
config.setConnectionTimeout(30000);
config.setIdleTimeout(600000);
config.setMaxLifetime(1800000);
// ClickHouse 特定配置
config.addDataSourceProperty("socket_timeout", "300000");
config.addDataSourceProperty("compress", "true");
config.addDataSourceProperty("max_insert_block_size", "1048576");
return new HikariDataSource(config);
}
/**
* 测试连接
*/
public static void testConnection() {
try (Connection conn = getConnection()) {
var stmt = conn.createStatement();
var rs = stmt.executeQuery("SELECT version()");
if (rs.next()) {
System.out.println("ClickHouse Version: " + rs.getString(1));
}
System.out.println("连接成功!");
} catch (SQLException e) {
System.err.println("连接失败: " + e.getMessage());
e.printStackTrace();
}
}
public static void main(String[] args) {
testConnection();
}
}
4. 表引擎详解
4.1 MergeTree 家族
ClickHouse 最强大的表引擎系列,支持数据分区、索引、副本等特性。
MergeTree 家族
MergeTree (基础)
├─ ReplacingMergeTree (去重)
├─ SummingMergeTree (求和)
├─ AggregatingMergeTree (聚合)
├─ CollapsingMergeTree (折叠)
└─ VersionedCollapsingMergeTree (版本折叠)
ReplicatedMergeTree (副本)
├─ ReplicatedReplacingMergeTree
├─ ReplicatedSummingMergeTree
└─ ReplicatedAggregatingMergeTree
java
import java.sql.Connection;
import java.sql.Statement;
/**
* 表引擎创建示例
*/
public class TableEngineExample {
/**
* 1. MergeTree - 最常用的表引擎
*/
public static void createMergeTreeTable(Connection conn)
throws Exception {
String sql =
"CREATE TABLE IF NOT EXISTS user_events (\n" +
" event_date Date,\n" +
" event_time DateTime,\n" +
" user_id UInt64,\n" +
" event_type String,\n" +
" page_url String,\n" +
" device_type String,\n" +
" session_id String,\n" +
" duration UInt32\n" +
") ENGINE = MergeTree()\n" +
"PARTITION BY toYYYYMM(event_date)\n" +
"ORDER BY (user_id, event_time)\n" +
"SETTINGS index_granularity = 8192";
try (Statement stmt = conn.createStatement()) {
stmt.execute(sql);
System.out.println("MergeTree 表创建成功!");
}
}
/**
* 2. ReplacingMergeTree - 自动去重
* 场景: 用户信息表,同一用户只保留最新记录
*/
public static void createReplacingMergeTreeTable(Connection conn)
throws Exception {
String sql =
"CREATE TABLE IF NOT EXISTS user_profiles (\n" +
" user_id UInt64,\n" +
" username String,\n" +
" email String,\n" +
" age UInt8,\n" +
" city String,\n" +
" update_time DateTime,\n" +
" version UInt64\n" +
") ENGINE = ReplacingMergeTree(version)\n" +
"ORDER BY user_id\n" +
"SETTINGS index_granularity = 8192";
try (Statement stmt = conn.createStatement()) {
stmt.execute(sql);
System.out.println("ReplacingMergeTree 表创建成功!");
}
}
/**
* 3. SummingMergeTree - 自动求和
* 场景: 指标累加,如页面 PV、UV 统计
*/
public static void createSummingMergeTreeTable(Connection conn)
throws Exception {
String sql =
"CREATE TABLE IF NOT EXISTS page_statistics (\n" +
" stat_date Date,\n" +
" page_url String,\n" +
" pv UInt64,\n" +
" uv UInt64,\n" +
" bounce_count UInt64\n" +
") ENGINE = SummingMergeTree()\n" +
"PARTITION BY toYYYYMM(stat_date)\n" +
"ORDER BY (stat_date, page_url)\n" +
"SETTINGS index_granularity = 8192";
try (Statement stmt = conn.createStatement()) {
stmt.execute(sql);
System.out.println("SummingMergeTree 表创建成功!");
}
}
/**
* 4. AggregatingMergeTree - 聚合函数
* 场景: 预聚合统计数据
*/
public static void createAggregatingMergeTreeTable(Connection conn)
throws Exception {
String sql =
"CREATE TABLE IF NOT EXISTS user_metrics_agg (\n" +
" stat_date Date,\n" +
" user_id UInt64,\n" +
" total_orders SimpleAggregateFunction(sum, UInt64),\n" +
" total_amount SimpleAggregateFunction(sum, Decimal(18,2)),\n" +
" avg_amount AggregateFunction(avg, Decimal(18,2)),\n" +
" unique_products AggregateFunction(uniq, UInt64)\n" +
") ENGINE = AggregatingMergeTree()\n" +
"PARTITION BY toYYYYMM(stat_date)\n" +
"ORDER BY (stat_date, user_id)\n" +
"SETTINGS index_granularity = 8192";
try (Statement stmt = conn.createStatement()) {
stmt.execute(sql);
System.out.println("AggregatingMergeTree 表创建成功!");
}
}
/**
* 5. 分布式表
* 场景: 集群环境下的分片表
*/
public static void createDistributedTable(Connection conn)
throws Exception {
// 先创建本地表
String localTableSql =
"CREATE TABLE IF NOT EXISTS orders_local (\n" +
" order_id UInt64,\n" +
" user_id UInt64,\n" +
" product_id UInt64,\n" +
" amount Decimal(18,2),\n" +
" order_time DateTime\n" +
") ENGINE = MergeTree()\n" +
"PARTITION BY toYYYYMM(order_time)\n" +
"ORDER BY (user_id, order_time)";
// 创建分布式表
String distributedTableSql =
"CREATE TABLE IF NOT EXISTS orders_all AS orders_local\n" +
"ENGINE = Distributed(cluster_name, default, orders_local, rand())";
try (Statement stmt = conn.createStatement()) {
stmt.execute(localTableSql);
// stmt.execute(distributedTableSql); // 需要集群环境
System.out.println("分布式表创建成功!");
}
}
}
4.2 表引擎选择指南
表引擎选择流程
需要副本? ───Yes───▶ 使用 Replicated* 系列
│
No
│
▼
需要去重? ───Yes───▶ ReplacingMergeTree
│
No
│
▼
需要求和? ───Yes───▶ SummingMergeTree
│
No
│
▼
需要预聚合? ─Yes───▶ AggregatingMergeTree
│
No
│
▼
普通场景 ───────────▶ MergeTree
5. Java 客户端集成
5.1 JDBC 基础操作
java
import java.sql.*;
import java.time.LocalDateTime;
import java.util.ArrayList;
import java.util.List;
/**
* ClickHouse JDBC 基础操作
*/
public class ClickHouseJDBCExample {
/**
* 插入数据
*/
public static void insertData(Connection conn) throws SQLException {
String sql =
"INSERT INTO user_events " +
"(event_date, event_time, user_id, event_type, " +
"page_url, device_type, session_id, duration) " +
"VALUES (?, ?, ?, ?, ?, ?, ?, ?)";
try (PreparedStatement pstmt = conn.prepareStatement(sql)) {
// 插入单条数据
pstmt.setDate(1, Date.valueOf("2024-01-15"));
pstmt.setTimestamp(2, Timestamp.valueOf(LocalDateTime.now()));
pstmt.setLong(3, 1001L);
pstmt.setString(4, "page_view");
pstmt.setString(5, "/home");
pstmt.setString(6, "mobile");
pstmt.setString(7, "session_12345");
pstmt.setInt(8, 120);
int rows = pstmt.executeUpdate();
System.out.println("插入 " + rows + " 条数据");
}
}
/**
* 批量插入 (推荐方式)
*/
public static void batchInsert(Connection conn,
List<UserEvent> events)
throws SQLException {
String sql =
"INSERT INTO user_events " +
"(event_date, event_time, user_id, event_type, " +
"page_url, device_type, session_id, duration) " +
"VALUES (?, ?, ?, ?, ?, ?, ?, ?)";
try (PreparedStatement pstmt = conn.prepareStatement(sql)) {
for (UserEvent event : events) {
pstmt.setDate(1, Date.valueOf(event.eventDate));
pstmt.setTimestamp(2, Timestamp.valueOf(event.eventTime));
pstmt.setLong(3, event.userId);
pstmt.setString(4, event.eventType);
pstmt.setString(5, event.pageUrl);
pstmt.setString(6, event.deviceType);
pstmt.setString(7, event.sessionId);
pstmt.setInt(8, event.duration);
pstmt.addBatch();
}
int[] results = pstmt.executeBatch();
System.out.println("批量插入 " + results.length + " 条数据");
}
}
/**
* 查询数据
*/
public static List<UserEvent> queryData(Connection conn,
long userId)
throws SQLException {
String sql =
"SELECT event_date, event_time, user_id, event_type, " +
" page_url, device_type, session_id, duration " +
"FROM user_events " +
"WHERE user_id = ? " +
"ORDER BY event_time DESC " +
"LIMIT 100";
List<UserEvent> events = new ArrayList<>();
try (PreparedStatement pstmt = conn.prepareStatement(sql)) {
pstmt.setLong(1, userId);
try (ResultSet rs = pstmt.executeQuery()) {
while (rs.next()) {
UserEvent event = new UserEvent();
event.eventDate = rs.getDate("event_date").toLocalDate();
event.eventTime = rs.getTimestamp("event_time").toLocalDateTime();
event.userId = rs.getLong("user_id");
event.eventType = rs.getString("event_type");
event.pageUrl = rs.getString("page_url");
event.deviceType = rs.getString("device_type");
event.sessionId = rs.getString("session_id");
event.duration = rs.getInt("duration");
events.add(event);
}
}
}
return events;
}
/**
* 聚合查询
*/
public static void aggregateQuery(Connection conn) throws SQLException {
String sql =
"SELECT " +
" event_date, " +
" event_type, " +
" device_type, " +
" count() as event_count, " +
" uniq(user_id) as unique_users, " +
" avg(duration) as avg_duration, " +
" max(duration) as max_duration " +
"FROM user_events " +
"WHERE event_date >= today() - 7 " +
"GROUP BY event_date, event_type, device_type " +
"ORDER BY event_date DESC, event_count DESC";
try (Statement stmt = conn.createStatement();
ResultSet rs = stmt.executeQuery(sql)) {
System.out.println("\n用户行为统计:");
System.out.println("─────────────────────────────────────────────────");
System.out.printf("%-12s %-15s %-12s %8s %8s %12s %12s\n",
"日期", "事件类型", "设备类型", "事件数", "UV", "平均时长", "最大时长");
System.out.println("─────────────────────────────────────────────────");
while (rs.next()) {
System.out.printf("%-12s %-15s %-12s %8d %8d %12.2f %12d\n",
rs.getDate("event_date"),
rs.getString("event_type"),
rs.getString("device_type"),
rs.getLong("event_count"),
rs.getLong("unique_users"),
rs.getDouble("avg_duration"),
rs.getLong("max_duration")
);
}
}
}
/**
* 用户事件实体类
*/
public static class UserEvent {
public java.time.LocalDate eventDate;
public java.time.LocalDateTime eventTime;
public long userId;
public String eventType;
public String pageUrl;
public String deviceType;
public String sessionId;
public int duration;
@Override
public String toString() {
return String.format(
"UserEvent{userId=%d, type='%s', page='%s', time=%s}",
userId, eventType, pageUrl, eventTime
);
}
}
}
5.2 高性能批量写入
java
import com.clickhouse.client.*;
import com.clickhouse.data.ClickHouseFormat;
import java.io.ByteArrayInputStream;
import java.nio.charset.StandardCharsets;
import java.util.List;
import java.util.concurrent.CompletableFuture;
/**
* 高性能批量写入
*/
public class HighPerformanceWriter {
/**
* 方式1: 使用 INSERT VALUES (适合小批量)
*/
public static void batchInsertWithValues(Connection conn,
List<UserEvent> events)
throws SQLException {
StringBuilder sql = new StringBuilder(
"INSERT INTO user_events VALUES "
);
for (int i = 0; i < events.size(); i++) {
UserEvent event = events.get(i);
if (i > 0) sql.append(",");
sql.append(String.format(
"('%s','%s',%d,'%s','%s','%s','%s',%d)",
event.eventDate,
event.eventTime,
event.userId,
event.eventType,
event.pageUrl,
event.deviceType,
event.sessionId,
event.duration
));
}
try (Statement stmt = conn.createStatement()) {
stmt.execute(sql.toString());
}
}
/**
* 方式2: 使用 CSV 格式 (推荐,高性能)
*/
public static void batchInsertWithCSV(ClickHouseClient client,
List<UserEvent> events)
throws Exception {
// 构建 CSV 数据
StringBuilder csv = new StringBuilder();
for (UserEvent event : events) {
csv.append(event.eventDate).append(",")
.append(event.eventTime).append(",")
.append(event.userId).append(",")
.append(event.eventType).append(",")
.append(event.pageUrl).append(",")
.append(event.deviceType).append(",")
.append(event.sessionId).append(",")
.append(event.duration).append("\n");
}
// 使用 ClickHouse Client API
ClickHouseRequest<?> request = client.read(
ClickHouseNode.of("http://localhost:8123")
);
CompletableFuture<ClickHouseResponse> future = request
.write()
.query("INSERT INTO user_events FORMAT CSV")
.data(new ByteArrayInputStream(
csv.toString().getBytes(StandardCharsets.UTF_8)
))
.executeAsync();
ClickHouseResponse response = future.get();
System.out.println("CSV 批量插入成功!");
}
/**
* 方式3: 异步批量写入器
*/
public static class AsyncBatchWriter {
private final Connection connection;
private final List<UserEvent> buffer;
private final int batchSize;
private final long flushIntervalMs;
private volatile boolean running;
public AsyncBatchWriter(Connection connection,
int batchSize,
long flushIntervalMs) {
this.connection = connection;
this.batchSize = batchSize;
this.flushIntervalMs = flushIntervalMs;
this.buffer = new ArrayList<>(batchSize);
this.running = true;
// 启动定时刷新线程
startFlushThread();
}
public synchronized void write(UserEvent event) {
buffer.add(event);
if (buffer.size() >= batchSize) {
flush();
}
}
private synchronized void flush() {
if (buffer.isEmpty()) return;
try {
batchInsertWithValues(connection, new ArrayList<>(buffer));
buffer.clear();
System.out.println("已刷新批次数据");
} catch (SQLException e) {
System.err.println("批量写入失败: " + e.getMessage());
}
}
private void startFlushThread() {
new Thread(() -> {
while (running) {
try {
Thread.sleep(flushIntervalMs);
flush();
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
break;
}
}
}).start();
}
public void close() {
running = false;
flush();
}
}
}
6. 数据写入实战
6.1 实时日志采集
基于 Kafka + ClickHouse 的实时日志处理系统。
java
import org.apache.kafka.clients.consumer.*;
import org.apache.kafka.common.serialization.StringDeserializer;
import java.sql.Connection;
import java.time.Duration;
import java.util.*;
/**
* Kafka 实时日志消费并写入 ClickHouse
* 场景: 收集 Web 访问日志实时分析
*/
public class RealTimeLogCollector {
private final KafkaConsumer<String, String> consumer;
private final Connection clickHouseConn;
private final int batchSize = 1000;
private final List<AccessLog> buffer = new ArrayList<>();
public RealTimeLogCollector(Connection clickHouseConn) {
this.clickHouseConn = clickHouseConn;
this.consumer = createKafkaConsumer();
}
private KafkaConsumer<String, String> createKafkaConsumer() {
Properties props = new Properties();
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG,
"localhost:9092");
props.put(ConsumerConfig.GROUP_ID_CONFIG,
"clickhouse-log-consumer");
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG,
StringDeserializer.class.getName());
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG,
StringDeserializer.class.getName());
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest");
props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "false");
props.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, "1000");
return new KafkaConsumer<>(props);
}
public void start() {
consumer.subscribe(Collections.singletonList("web-access-logs"));
System.out.println("开始消费日志数据...");
try {
while (true) {
ConsumerRecords<String, String> records =
consumer.poll(Duration.ofSeconds(1));
for (ConsumerRecord<String, String> record : records) {
processLog(record.value());
}
// 批量写入
if (buffer.size() >= batchSize) {
flushToClickHouse();
consumer.commitSync();
}
}
} catch (Exception e) {
e.printStackTrace();
} finally {
flushToClickHouse();
consumer.close();
}
}
private void processLog(String logJson) {
try {
// 解析 JSON 日志
AccessLog log = parseAccessLog(logJson);
buffer.add(log);
} catch (Exception e) {
System.err.println("解析日志失败: " + e.getMessage());
}
}
private AccessLog parseAccessLog(String json) {
// 使用 FastJSON 或其他 JSON 库解析
com.alibaba.fastjson2.JSONObject obj =
com.alibaba.fastjson2.JSON.parseObject(json);
AccessLog log = new AccessLog();
log.timestamp = obj.getLong("timestamp");
log.userId = obj.getLong("user_id");
log.ip = obj.getString("ip");
log.method = obj.getString("method");
log.url = obj.getString("url");
log.statusCode = obj.getInteger("status_code");
log.responseTime = obj.getInteger("response_time");
log.userAgent = obj.getString("user_agent");
log.referer = obj.getString("referer");
return log;
}
private void flushToClickHouse() {
if (buffer.isEmpty()) return;
String sql =
"INSERT INTO access_logs " +
"(log_time, user_id, ip, method, url, status_code, " +
"response_time, user_agent, referer) " +
"VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)";
try (var pstmt = clickHouseConn.prepareStatement(sql)) {
for (AccessLog log : buffer) {
pstmt.setTimestamp(1,
new java.sql.Timestamp(log.timestamp));
pstmt.setLong(2, log.userId);
pstmt.setString(3, log.ip);
pstmt.setString(4, log.method);
pstmt.setString(5, log.url);
pstmt.setInt(6, log.statusCode);
pstmt.setInt(7, log.responseTime);
pstmt.setString(8, log.userAgent);
pstmt.setString(9, log.referer);
pstmt.addBatch();
}
pstmt.executeBatch();
System.out.println("成功写入 " + buffer.size() + " 条日志");
buffer.clear();
} catch (Exception e) {
System.err.println("写入 ClickHouse 失败: " + e.getMessage());
}
}
/**
* 访问日志实体
*/
public static class AccessLog {
public long timestamp;
public long userId;
public String ip;
public String method;
public String url;
public int statusCode;
public int responseTime;
public String userAgent;
public String referer;
}
/**
* 创建访问日志表
*/
public static void createAccessLogTable(Connection conn)
throws Exception {
String sql =
"CREATE TABLE IF NOT EXISTS access_logs (\n" +
" log_time DateTime,\n" +
" user_id UInt64,\n" +
" ip String,\n" +
" method String,\n" +
" url String,\n" +
" status_code UInt16,\n" +
" response_time UInt32,\n" +
" user_agent String,\n" +
" referer String,\n" +
" INDEX idx_user_id user_id TYPE minmax GRANULARITY 3,\n" +
" INDEX idx_url url TYPE tokenbf_v1(32768, 3, 0) GRANULARITY 1\n" +
") ENGINE = MergeTree()\n" +
"PARTITION BY toYYYYMMDD(log_time)\n" +
"ORDER BY (log_time, user_id)\n" +
"TTL log_time + INTERVAL 30 DAY\n" +
"SETTINGS index_granularity = 8192";
try (var stmt = conn.createStatement()) {
stmt.execute(sql);
System.out.println("访问日志表创建成功!");
}
}
}
6.2 Spring Boot 集成
java
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.jdbc.core.JdbcTemplate;
import org.springframework.stereotype.Service;
import javax.sql.DataSource;
import java.sql.Timestamp;
import java.util.List;
import java.util.Map;
/**
* Spring Boot 集成 ClickHouse
*/
@SpringBootApplication
public class ClickHouseSpringBootApp {
public static void main(String[] args) {
SpringApplication.run(ClickHouseSpringBootApp.class, args);
}
}
/**
* ClickHouse 配置类
*/
@Configuration
class ClickHouseConfiguration {
@Bean
public DataSource clickHouseDataSource() {
return ClickHouseConfig.getDataSource();
}
@Bean
public JdbcTemplate clickHouseJdbcTemplate(DataSource dataSource) {
return new JdbcTemplate(dataSource);
}
}
/**
* ClickHouse 数据访问服务
*/
@Service
class ClickHouseService {
private final JdbcTemplate jdbcTemplate;
public ClickHouseService(JdbcTemplate jdbcTemplate) {
this.jdbcTemplate = jdbcTemplate;
}
/**
* 查询用户行为统计
*/
public List<Map<String, Object>> getUserBehaviorStats(
String startDate, String endDate) {
String sql =
"SELECT " +
" toDate(event_time) as date, " +
" event_type, " +
" count() as total, " +
" uniq(user_id) as uv, " +
" avg(duration) as avg_duration " +
"FROM user_events " +
"WHERE event_date BETWEEN ? AND ? " +
"GROUP BY date, event_type " +
"ORDER BY date DESC, total DESC";
return jdbcTemplate.queryForList(sql, startDate, endDate);
}
/**
* 保存用户事件
*/
public void saveUserEvent(UserEventDTO event) {
String sql =
"INSERT INTO user_events " +
"(event_date, event_time, user_id, event_type, " +
"page_url, device_type, session_id, duration) " +
"VALUES (?, ?, ?, ?, ?, ?, ?, ?)";
jdbcTemplate.update(sql,
event.getEventDate(),
new Timestamp(event.getEventTime()),
event.getUserId(),
event.getEventType(),
event.getPageUrl(),
event.getDeviceType(),
event.getSessionId(),
event.getDuration()
);
}
/**
* 批量保存
*/
public void batchSaveUserEvents(List<UserEventDTO> events) {
String sql =
"INSERT INTO user_events " +
"(event_date, event_time, user_id, event_type, " +
"page_url, device_type, session_id, duration) " +
"VALUES (?, ?, ?, ?, ?, ?, ?, ?)";
jdbcTemplate.batchUpdate(sql, events, events.size(),
(ps, event) -> {
ps.setDate(1, event.getEventDate());
ps.setTimestamp(2, new Timestamp(event.getEventTime()));
ps.setLong(3, event.getUserId());
ps.setString(4, event.getEventType());
ps.setString(5, event.getPageUrl());
ps.setString(6, event.getDeviceType());
ps.setString(7, event.getSessionId());
ps.setInt(8, event.getDuration());
}
);
}
/**
* 用户事件 DTO
*/
public static class UserEventDTO {
private java.sql.Date eventDate;
private long eventTime;
private long userId;
private String eventType;
private String pageUrl;
private String deviceType;
private String sessionId;
private int duration;
// Getters and Setters
public java.sql.Date getEventDate() { return eventDate; }
public void setEventDate(java.sql.Date eventDate) {
this.eventDate = eventDate;
}
public long getEventTime() { return eventTime; }
public void setEventTime(long eventTime) {
this.eventTime = eventTime;
}
public long getUserId() { return userId; }
public void setUserId(long userId) { this.userId = userId; }
public String getEventType() { return eventType; }
public void setEventType(String eventType) {
this.eventType = eventType;
}
public String getPageUrl() { return pageUrl; }
public void setPageUrl(String pageUrl) { this.pageUrl = pageUrl; }
public String getDeviceType() { return deviceType; }
public void setDeviceType(String deviceType) {
this.deviceType = deviceType;
}
public String getSessionId() { return sessionId; }
public void setSessionId(String sessionId) {
this.sessionId = sessionId;
}
public int getDuration() { return duration; }
public void setDuration(int duration) { this.duration = duration; }
}
}
7. 查询优化技巧
7.1 索引优化
sql
-- 1. 主键索引 (ORDER BY 决定)
CREATE TABLE user_events (
event_date Date,
event_time DateTime,
user_id UInt64,
event_type String,
page_url String
) ENGINE = MergeTree()
ORDER BY (user_id, event_time); -- 主键索引
-- 查询会很快 (使用了主键)
SELECT * FROM user_events
WHERE user_id = 1001;
-- 查询较慢 (未使用主键)
SELECT * FROM user_events
WHERE event_type = 'click';
-- 2. 跳数索引 (Skip Index)
CREATE TABLE access_logs (
log_time DateTime,
url String,
status_code UInt16,
response_time UInt32,
INDEX idx_status status_code TYPE minmax GRANULARITY 3,
INDEX idx_url url TYPE tokenbf_v1(32768, 3, 0) GRANULARITY 1,
INDEX idx_response response_time TYPE set(100) GRANULARITY 4
) ENGINE = MergeTree()
ORDER BY log_time;
java
/**
* 查询优化示例
*/
public class QueryOptimization {
/**
* 优化1: 使用主键过滤
*/
public static void optimizedQuery1(Connection conn)
throws Exception {
// 好的查询 - 使用主键
String goodSql =
"SELECT * FROM user_events " +
"WHERE user_id = ? " +
"AND event_time >= ? " +
"ORDER BY event_time DESC " +
"LIMIT 100";
// 不好的查询 - 未使用主键
String badSql =
"SELECT * FROM user_events " +
"WHERE page_url = '/home' " +
"ORDER BY event_time DESC " +
"LIMIT 100";
// 执行优化后的查询
try (var pstmt = conn.prepareStatement(goodSql)) {
pstmt.setLong(1, 1001L);
pstmt.setTimestamp(2,
Timestamp.valueOf("2024-01-01 00:00:00"));
var rs = pstmt.executeQuery();
// 处理结果...
}
}
/**
* 优化2: 使用 PREWHERE 代替 WHERE
* PREWHERE 先过滤数据,再读取其他列
*/
public static void optimizedQuery2(Connection conn)
throws Exception {
String sql =
"SELECT " +
" user_id, " +
" event_type, " +
" page_url, " +
" duration " +
"FROM user_events " +
"PREWHERE event_date = today() " + // 先过滤
"WHERE user_id > 1000 " +
"LIMIT 1000";
try (var stmt = conn.createStatement();
var rs = stmt.executeQuery(sql)) {
// 处理结果...
}
}
/**
* 优化3: 合理使用分区裁剪
*/
public static void optimizedQuery3(Connection conn)
throws Exception {
String sql =
"SELECT " +
" event_type, " +
" count() as cnt " +
"FROM user_events " +
"WHERE event_date >= today() - 7 " + // 分区裁剪
"GROUP BY event_type";
try (var stmt = conn.createStatement();
var rs = stmt.executeQuery(sql)) {
// 处理结果...
}
}
/**
* 优化4: 使用 FINAL 去重 (ReplacingMergeTree)
*/
public static void optimizedQuery4(Connection conn)
throws Exception {
// 不使用 FINAL - 可能有重复数据
String sql1 =
"SELECT * FROM user_profiles " +
"WHERE user_id = 1001";
// 使用 FINAL - 保证唯一,但性能较慢
String sql2 =
"SELECT * FROM user_profiles FINAL " +
"WHERE user_id = 1001";
// 推荐: 使用 GROUP BY 去重
String sql3 =
"SELECT " +
" user_id, " +
" argMax(username, version) as username, " +
" argMax(email, version) as email, " +
" argMax(age, version) as age, " +
" argMax(city, version) as city " +
"FROM user_profiles " +
"WHERE user_id = 1001 " +
"GROUP BY user_id";
try (var stmt = conn.createStatement();
var rs = stmt.executeQuery(sql3)) {
// 处理结果...
}
}
/**
* 优化5: 使用近似函数
*/
public static void optimizedQuery5(Connection conn)
throws Exception {
// 精确 UV 计算 (慢)
String exactSql =
"SELECT count(DISTINCT user_id) FROM user_events";
// 近似 UV 计算 (快,误差<2%)
String approxSql =
"SELECT uniq(user_id) FROM user_events";
// 更快的近似计算
String fasterSql =
"SELECT uniqHLL12(user_id) FROM user_events";
try (var stmt = conn.createStatement();
var rs = stmt.executeQuery(approxSql)) {
if (rs.next()) {
System.out.println("UV: " + rs.getLong(1));
}
}
}
}
7.2 查询性能对比
查询性能对比 (10亿行数据)
场景1: 全表扫描 vs 主键查询
┌────────────────────────────┬──────────┬──────────────┐
│ Query Type │ Time │ Rows Scanned│
├────────────────────────────┼──────────┼──────────────┤
│ 全表扫描 │ 25.3s │ 1,000,000,000│
│ 主键查询 │ 0.05s │ 10,000 │
│ 主键+分区裁剪 │ 0.02s │ 1,000 │
└────────────────────────────┴──────────┴──────────────┘
场景2: COUNT(DISTINCT) vs uniq()
┌────────────────────────────┬──────────┬──────────────┐
│ Function │ Time │ Accuracy │
├────────────────────────────┼──────────┼──────────────┤
│ count(DISTINCT user_id) │ 18.5s │ 100% │
│ uniq(user_id) │ 2.1s │ 98% │
│ uniqHLL12(user_id) │ 0.8s │ 95% │
└────────────────────────────┴──────────┴──────────────┘
8. 物化视图与聚合
8.1 物化视图基础
物化视图可以预计算聚合结果,大幅提升查询性能。
sql
-- 创建物化视图
CREATE MATERIALIZED VIEW user_daily_stats
ENGINE = SummingMergeTree()
PARTITION BY toYYYYMM(stat_date)
ORDER BY (stat_date, user_id)
AS
SELECT
toDate(event_time) as stat_date,
user_id,
event_type,
count() as event_count,
sum(duration) as total_duration
FROM user_events
GROUP BY stat_date, user_id, event_type;
-- 查询物化视图 (非常快)
SELECT
stat_date,
user_id,
sum(event_count) as total_events,
sum(total_duration) as total_duration
FROM user_daily_stats
WHERE stat_date >= today() - 30
GROUP BY stat_date, user_id;
java
/**
* 物化视图管理
*/
public class MaterializedViewManager {
/**
* 创建用户行为统计物化视图
*/
public static void createUserStatsMV(Connection conn)
throws Exception {
String sql =
"CREATE MATERIALIZED VIEW IF NOT EXISTS user_behavior_stats_mv\n" +
"ENGINE = AggregatingMergeTree()\n" +
"PARTITION BY toYYYYMM(stat_date)\n" +
"ORDER BY (stat_date, user_id, event_type)\n" +
"AS\n" +
"SELECT\n" +
" toDate(event_time) as stat_date,\n" +
" user_id,\n" +
" event_type,\n" +
" device_type,\n" +
" countState() as event_count,\n" +
" sumState(duration) as total_duration,\n" +
" avgState(duration) as avg_duration,\n" +
" uniqState(session_id) as unique_sessions\n" +
"FROM user_events\n" +
"GROUP BY stat_date, user_id, event_type, device_type";
try (var stmt = conn.createStatement()) {
stmt.execute(sql);
System.out.println("物化视图创建成功!");
}
}
/**
* 查询物化视图
*/
public static void queryMaterializedView(Connection conn)
throws Exception {
String sql =
"SELECT\n" +
" stat_date,\n" +
" user_id,\n" +
" event_type,\n" +
" device_type,\n" +
" countMerge(event_count) as total_events,\n" +
" sumMerge(total_duration) as total_duration,\n" +
" avgMerge(avg_duration) as avg_duration,\n" +
" uniqMerge(unique_sessions) as unique_sessions\n" +
"FROM user_behavior_stats_mv\n" +
"WHERE stat_date >= today() - 7\n" +
"GROUP BY stat_date, user_id, event_type, device_type\n" +
"ORDER BY stat_date DESC, total_events DESC\n" +
"LIMIT 100";
try (var stmt = conn.createStatement();
var rs = stmt.executeQuery(sql)) {
System.out.println("\n用户行为统计 (来自物化视图):");
while (rs.next()) {
System.out.printf(
"日期: %s, 用户: %d, 类型: %s, 事件数: %d, " +
"平均时长: %.2f, 会话数: %d\n",
rs.getDate("stat_date"),
rs.getLong("user_id"),
rs.getString("event_type"),
rs.getLong("total_events"),
rs.getDouble("avg_duration"),
rs.getLong("unique_sessions")
);
}
}
}
/**
* 刷新物化视图 (ClickHouse 自动增量更新)
*/
public static void refreshMaterializedView(Connection conn)
throws Exception {
// ClickHouse 物化视图自动实时更新
// 无需手动刷新
System.out.println("ClickHouse 物化视图自动实时更新");
// 如果需要重建,可以删除后重建
// DROP TABLE user_behavior_stats_mv;
// 然后重新创建
}
}
8.2 实时报表场景
java
/**
* 实时报表生成器
* 场景: 电商实时交易大屏
*/
public class RealTimeDashboard {
private final Connection conn;
public RealTimeDashboard(Connection conn) {
this.conn = conn;
}
/**
* 今日实时交易概况
*/
public DashboardMetrics getTodayMetrics() throws Exception {
String sql =
"SELECT\n" +
" count() as total_orders,\n" +
" sum(amount) as total_amount,\n" +
" avg(amount) as avg_amount,\n" +
" uniq(user_id) as unique_users,\n" +
" uniq(product_id) as unique_products,\n" +
" countIf(status = 'SUCCESS') as success_count,\n" +
" countIf(status = 'FAILED') as failed_count\n" +
"FROM orders\n" +
"WHERE order_date = today()";
try (var stmt = conn.createStatement();
var rs = stmt.executeQuery(sql)) {
if (rs.next()) {
DashboardMetrics metrics = new DashboardMetrics();
metrics.totalOrders = rs.getLong("total_orders");
metrics.totalAmount = rs.getDouble("total_amount");
metrics.avgAmount = rs.getDouble("avg_amount");
metrics.uniqueUsers = rs.getLong("unique_users");
metrics.uniqueProducts = rs.getLong("unique_products");
metrics.successCount = rs.getLong("success_count");
metrics.failedCount = rs.getLong("failed_count");
return metrics;
}
}
return null;
}
/**
* 每小时趋势
*/
public List<HourlyTrend> getHourlyTrend() throws Exception {
String sql =
"SELECT\n" +
" toHour(order_time) as hour,\n" +
" count() as order_count,\n" +
" sum(amount) as hour_amount,\n" +
" uniq(user_id) as hour_users\n" +
"FROM orders\n" +
"WHERE order_date = today()\n" +
"GROUP BY hour\n" +
"ORDER BY hour";
List<HourlyTrend> trends = new ArrayList<>();
try (var stmt = conn.createStatement();
var rs = stmt.executeQuery(sql)) {
while (rs.next()) {
HourlyTrend trend = new HourlyTrend();
trend.hour = rs.getInt("hour");
trend.orderCount = rs.getLong("order_count");
trend.hourAmount = rs.getDouble("hour_amount");
trend.hourUsers = rs.getLong("hour_users");
trends.add(trend);
}
}
return trends;
}
/**
* Top 商品排行
*/
public List<ProductRank> getTopProducts(int limit) throws Exception {
String sql =
"SELECT\n" +
" product_id,\n" +
" count() as sale_count,\n" +
" sum(amount) as sale_amount,\n" +
" uniq(user_id) as buyer_count\n" +
"FROM orders\n" +
"WHERE order_date >= today() - 7\n" +
"GROUP BY product_id\n" +
"ORDER BY sale_amount DESC\n" +
"LIMIT ?";
List<ProductRank> ranks = new ArrayList<>();
try (var pstmt = conn.prepareStatement(sql)) {
pstmt.setInt(1, limit);
try (var rs = pstmt.executeQuery()) {
while (rs.next()) {
ProductRank rank = new ProductRank();
rank.productId = rs.getLong("product_id");
rank.saleCount = rs.getLong("sale_count");
rank.saleAmount = rs.getDouble("sale_amount");
rank.buyerCount = rs.getLong("buyer_count");
ranks.add(rank);
}
}
}
return ranks;
}
/**
* 实时大屏指标
*/
public static class DashboardMetrics {
public long totalOrders;
public double totalAmount;
public double avgAmount;
public long uniqueUsers;
public long uniqueProducts;
public long successCount;
public long failedCount;
public void display() {
System.out.println("\n========== 今日实时数据 ==========");
System.out.printf("总订单数: %,d\n", totalOrders);
System.out.printf("总金额: ¥%,.2f\n", totalAmount);
System.out.printf("平均客单价: ¥%,.2f\n", avgAmount);
System.out.printf("下单用户数: %,d\n", uniqueUsers);
System.out.printf("售出商品数: %,d\n", uniqueProducts);
System.out.printf("成功订单: %,d (%.2f%%)\n",
successCount,
100.0 * successCount / totalOrders);
System.out.printf("失败订单: %,d (%.2f%%)\n",
failedCount,
100.0 * failedCount / totalOrders);
System.out.println("================================\n");
}
}
/**
* 小时趋势
*/
public static class HourlyTrend {
public int hour;
public long orderCount;
public double hourAmount;
public long hourUsers;
}
/**
* 商品排行
*/
public static class ProductRank {
public long productId;
public long saleCount;
public double saleAmount;
public long buyerCount;
}
}
9. 分布式集群实践
9.1 集群架构
ClickHouse 分布式集群
Client Applications
│
▼
┌────────────────────────┐
│ Load Balancer │
└────────────────────────┘
│
┌────────────────┴────────────────┐
│ │
▼ ▼
┌───────────────┐ ┌───────────────┐
│ Shard 1 │ │ Shard 2 │
│ │ │ │
│ ┌───────────┐ │ │ ┌───────────┐ │
│ │ Replica 1 │ │ │ │ Replica 1 │ │
│ │ (Master) │ │ │ │ (Master) │ │
│ │ Node 1 │ │ │ │ Node 3 │ │
│ └───────────┘ │ │ └───────────┘ │
│ │ │ │ │ │
│ │ │ │ │ │
│ ┌───────────┐ │ │ ┌───────────┐ │
│ │ Replica 2 │ │ │ │ Replica 2 │ │
│ │ (Slave) │ │ │ │ (Slave) │ │
│ │ Node 2 │ │ │ │ Node 4 │ │
│ └───────────┘ │ │ └───────────┘ │
└───────────────┘ └───────────────┘
│ │
└──────────────┬───────────────────┘
│
▼
┌────────────────┐
│ ZooKeeper │
│ Cluster │
│ (3-5 nodes) │
└────────────────┘
9.2 分布式表操作
java
/**
* 分布式集群操作
*/
public class DistributedClusterOps {
/**
* 创建分布式表
*/
public static void createDistributedTable(Connection conn)
throws Exception {
// 1. 在每个节点创建本地表
String localTableSql =
"CREATE TABLE IF NOT EXISTS orders_local ON CLUSTER my_cluster (\n" +
" order_id UInt64,\n" +
" user_id UInt64,\n" +
" product_id UInt64,\n" +
" amount Decimal(18,2),\n" +
" order_time DateTime,\n" +
" status String\n" +
") ENGINE = ReplicatedMergeTree(\n" +
" '/clickhouse/tables/{shard}/orders_local',\n" +
" '{replica}'\n" +
")\n" +
"PARTITION BY toYYYYMM(order_time)\n" +
"ORDER BY (user_id, order_time)";
// 2. 创建分布式表
String distributedTableSql =
"CREATE TABLE IF NOT EXISTS orders_all ON CLUSTER my_cluster\n" +
"AS orders_local\n" +
"ENGINE = Distributed(\n" +
" my_cluster, -- 集群名称\n" +
" default, -- 数据库名\n" +
" orders_local, -- 本地表名\n" +
" rand() -- 分片键\n" +
")";
try (var stmt = conn.createStatement()) {
stmt.execute(localTableSql);
stmt.execute(distributedTableSql);
System.out.println("分布式表创建成功!");
}
}
/**
* 写入分布式表
*/
public static void insertToDistributedTable(Connection conn)
throws Exception {
// 写入分布式表,自动分片
String sql =
"INSERT INTO orders_all " +
"(order_id, user_id, product_id, amount, order_time, status) " +
"VALUES (?, ?, ?, ?, ?, ?)";
try (var pstmt = conn.prepareStatement(sql)) {
for (int i = 0; i < 1000; i++) {
pstmt.setLong(1, i);
pstmt.setLong(2, i % 100);
pstmt.setLong(3, i % 50);
pstmt.setBigDecimal(4,
new java.math.BigDecimal(String.valueOf(100 + i)));
pstmt.setTimestamp(5,
new Timestamp(System.currentTimeMillis()));
pstmt.setString(6, "SUCCESS");
pstmt.addBatch();
}
pstmt.executeBatch();
System.out.println("批量写入分布式表成功!");
}
}
/**
* 查询分布式表
*/
public static void queryDistributedTable(Connection conn)
throws Exception {
// 查询分布式表,自动聚合所有分片数据
String sql =
"SELECT\n" +
" toDate(order_time) as date,\n" +
" count() as order_count,\n" +
" sum(amount) as total_amount,\n" +
" uniq(user_id) as unique_users\n" +
"FROM orders_all\n" +
"WHERE order_time >= now() - INTERVAL 7 DAY\n" +
"GROUP BY date\n" +
"ORDER BY date DESC";
try (var stmt = conn.createStatement();
var rs = stmt.executeQuery(sql)) {
System.out.println("\n分布式查询结果:");
while (rs.next()) {
System.out.printf(
"日期: %s, 订单数: %d, 总金额: %.2f, UV: %d\n",
rs.getDate("date"),
rs.getLong("order_count"),
rs.getDouble("total_amount"),
rs.getLong("unique_users")
);
}
}
}
/**
* 集群健康检查
*/
public static void checkClusterHealth(Connection conn)
throws Exception {
String sql =
"SELECT\n" +
" cluster,\n" +
" shard_num,\n" +
" replica_num,\n" +
" host_name,\n" +
" port,\n" +
" is_local\n" +
"FROM system.clusters\n" +
"WHERE cluster = 'my_cluster'\n" +
"ORDER BY shard_num, replica_num";
try (var stmt = conn.createStatement();
var rs = stmt.executeQuery(sql)) {
System.out.println("\n集群状态:");
while (rs.next()) {
System.out.printf(
"分片: %d, 副本: %d, 主机: %s:%d, 本地: %s\n",
rs.getInt("shard_num"),
rs.getInt("replica_num"),
rs.getString("host_name"),
rs.getInt("port"),
rs.getBoolean("is_local") ? "是" : "否"
);
}
}
}
}
10. 实时数据分析场景
10.1 用户留存分析
java
/**
* 用户留存分析
*/
public class RetentionAnalysis {
/**
* 计算 N 日留存率
*/
public static void calculateRetention(Connection conn, int days)
throws Exception {
String sql =
"SELECT\n" +
" first_date,\n" +
" retention_day,\n" +
" retained_users,\n" +
" total_users,\n" +
" round(retained_users * 100.0 / total_users, 2) as retention_rate\n" +
"FROM (\n" +
" SELECT\n" +
" first_date,\n" +
" retention_day,\n" +
" uniq(user_id) as retained_users,\n" +
" any(total_users) as total_users\n" +
" FROM (\n" +
" SELECT\n" +
" user_id,\n" +
" first_date,\n" +
" dateDiff('day', first_date, event_date) as retention_day,\n" +
" total_users\n" +
" FROM (\n" +
" SELECT\n" +
" user_id,\n" +
" event_date,\n" +
" min(event_date) OVER (PARTITION BY user_id) as first_date,\n" +
" uniq(user_id) OVER (PARTITION BY min(event_date) OVER (PARTITION BY user_id)) as total_users\n" +
" FROM user_events\n" +
" WHERE event_date >= today() - ?\n" +
" )\n" +
" )\n" +
" WHERE retention_day <= ?\n" +
" GROUP BY first_date, retention_day, total_users\n" +
")\n" +
"ORDER BY first_date DESC, retention_day";
try (var pstmt = conn.prepareStatement(sql)) {
pstmt.setInt(1, days + 7);
pstmt.setInt(2, days);
try (var rs = pstmt.executeQuery()) {
System.out.println("\n用户留存分析:");
System.out.println("─────────────────────────────────────────");
System.out.printf("%-12s %-8s %-12s %-12s %-10s\n",
"首次日期", "留存天", "留存用户", "总用户", "留存率%");
System.out.println("─────────────────────────────────────────");
while (rs.next()) {
System.out.printf("%-12s %-8d %-12d %-12d %-10.2f\n",
rs.getDate("first_date"),
rs.getInt("retention_day"),
rs.getLong("retained_users"),
rs.getLong("total_users"),
rs.getDouble("retention_rate")
);
}
}
}
}
}
10.2 漏斗分析
java
/**
* 转化漏斗分析
*/
public class FunnelAnalysis {
/**
* 分析用户转化路径
* 路径: 浏览商品 -> 加入购物车 -> 下单 -> 支付
*/
public static void analyzeFunnel(Connection conn) throws Exception {
String sql =
"SELECT\n" +
" '浏览商品' as step,\n" +
" 1 as step_num,\n" +
" view_users as users,\n" +
" 100.0 as conversion_rate\n" +
"FROM (\n" +
" SELECT uniq(user_id) as view_users\n" +
" FROM user_events\n" +
" WHERE event_type = 'view'\n" +
" AND event_date = today()\n" +
")\n" +
"UNION ALL\n" +
"SELECT\n" +
" '加入购物车' as step,\n" +
" 2 as step_num,\n" +
" cart_users as users,\n" +
" round(cart_users * 100.0 / view_users, 2) as conversion_rate\n" +
"FROM (\n" +
" SELECT\n" +
" uniq(user_id) as cart_users,\n" +
" (SELECT uniq(user_id) FROM user_events WHERE event_type = 'view' AND event_date = today()) as view_users\n" +
" FROM user_events\n" +
" WHERE event_type = 'add_to_cart'\n" +
" AND event_date = today()\n" +
")\n" +
"UNION ALL\n" +
"SELECT\n" +
" '下单' as step,\n" +
" 3 as step_num,\n" +
" order_users as users,\n" +
" round(order_users * 100.0 / view_users, 2) as conversion_rate\n" +
"FROM (\n" +
" SELECT\n" +
" uniq(user_id) as order_users,\n" +
" (SELECT uniq(user_id) FROM user_events WHERE event_type = 'view' AND event_date = today()) as view_users\n" +
" FROM user_events\n" +
" WHERE event_type = 'order'\n" +
" AND event_date = today()\n" +
")\n" +
"UNION ALL\n" +
"SELECT\n" +
" '支付' as step,\n" +
" 4 as step_num,\n" +
" pay_users as users,\n" +
" round(pay_users * 100.0 / view_users, 2) as conversion_rate\n" +
"FROM (\n" +
" SELECT\n" +
" uniq(user_id) as pay_users,\n" +
" (SELECT uniq(user_id) FROM user_events WHERE event_type = 'view' AND event_date = today()) as view_users\n" +
" FROM user_events\n" +
" WHERE event_type = 'payment'\n" +
" AND event_date = today()\n" +
")\n" +
"ORDER BY step_num";
try (var stmt = conn.createStatement();
var rs = stmt.executeQuery(sql)) {
System.out.println("\n转化漏斗分析:");
System.out.println("─────────────────────────────────────");
System.out.printf("%-15s %-12s %-12s\n",
"步骤", "用户数", "转化率%");
System.out.println("─────────────────────────────────────");
while (rs.next()) {
String step = rs.getString("step");
long users = rs.getLong("users");
double rate = rs.getDouble("conversion_rate");
// ASCII 进度条
int barLength = (int)(rate / 5);
String bar = "█".repeat(barLength);
System.out.printf("%-15s %-12d %6.2f%% %s\n",
step, users, rate, bar);
}
}
}
}
11. 性能优化最佳实践
11.1 数据类型选择
数据类型优化建议
1. 整数类型 - 选择合适的范围
┌──────────────┬────────────┬─────────────────┐
│ Type │ Size │ Range │
├──────────────┼────────────┼─────────────────┤
│ UInt8 │ 1 byte │ 0 ~ 255 │
│ UInt16 │ 2 bytes │ 0 ~ 65535 │
│ UInt32 │ 4 bytes │ 0 ~ 4B │
│ UInt64 │ 8 bytes │ 0 ~ 18Q │
└──────────────┴────────────┴─────────────────┘
2. 字符串类型
- String: 可变长度,适合短字符串
- FixedString(N): 固定长度,适合MD5、UUID
- LowCardinality(String): 低基数字符串,节省空间
3. 时间类型
- Date: 日期(2字节)
- DateTime: 日期时间(4字节)
- DateTime64: 高精度时间(8字节)
java
/**
* 性能优化实践
*/
public class PerformanceOptimization {
/**
* 优化1: 使用 LowCardinality
*/
public static void useLowCardinality(Connection conn)
throws Exception {
// 不好的设计
String badSql =
"CREATE TABLE events_bad (\n" +
" event_time DateTime,\n" +
" event_type String, -- 可能只有10种类型\n" +
" device_type String, -- 可能只有5种设备\n" +
" country String -- 可能只有200个国家\n" +
") ENGINE = MergeTree()\n" +
"ORDER BY event_time";
// 好的设计 - 使用 LowCardinality
String goodSql =
"CREATE TABLE events_good (\n" +
" event_time DateTime,\n" +
" event_type LowCardinality(String),\n" +
" device_type LowCardinality(String),\n" +
" country LowCardinality(String)\n" +
") ENGINE = MergeTree()\n" +
"ORDER BY event_time";
// LowCardinality 可节省50-90%存储空间
}
/**
* 优化2: 合理设置 ORDER BY
*/
public static void optimizeOrderBy(Connection conn)
throws Exception {
// 根据查询模式设置
// 如果经常按 user_id 查询
String sql1 =
"CREATE TABLE user_events (\n" +
" user_id UInt64,\n" +
" event_time DateTime,\n" +
" event_type String\n" +
") ENGINE = MergeTree()\n" +
"ORDER BY (user_id, event_time)"; // user_id 在前
// 如果经常按时间范围查询
String sql2 =
"CREATE TABLE time_series_data (\n" +
" timestamp DateTime,\n" +
" metric_name String,\n" +
" value Float64\n" +
") ENGINE = MergeTree()\n" +
"ORDER BY (timestamp, metric_name)"; // timestamp 在前
}
/**
* 优化3: 使用分区提升查询性能
*/
public static void usePartitioning(Connection conn)
throws Exception {
String sql =
"CREATE TABLE access_logs (\n" +
" log_time DateTime,\n" +
" url String,\n" +
" status_code UInt16\n" +
") ENGINE = MergeTree()\n" +
"PARTITION BY toYYYYMMDD(log_time)\n" + // 按天分区
"ORDER BY log_time\n" +
"TTL log_time + INTERVAL 30 DAY"; // 30天后自动删除
// 好处:
// 1. 查询时只扫描相关分区
// 2. 可以按分区删除数据
// 3. TTL 可以自动清理旧数据
}
/**
* 优化4: 批量操作
*/
public static void batchOperations() {
// 不好: 单条插入
// for (Data d : dataList) {
// INSERT INTO table VALUES (d);
// }
// 好: 批量插入
// INSERT INTO table VALUES (d1), (d2), (d3), ...
// 更好: 使用 CSV 格式
// INSERT INTO table FORMAT CSV
System.out.println("批量操作性能提升 10-100 倍");
}
}
11.2 查询优化检查清单
查询优化检查清单
☑ 1. 索引使用
□ WHERE 条件是否使用了 ORDER BY 列?
□ 是否使用了跳数索引?
□ 是否使用了 PREWHERE?
☑ 2. 分区裁剪
□ 查询是否限制了分区键范围?
□ 避免全表扫描
☑ 3. 数据类型
□ 使用合适大小的整数类型
□ 低基数字符串使用 LowCardinality
□ 避免使用 Nullable (性能损失20-30%)
☑ 4. 聚合优化
□ 使用近似函数 (uniq 代替 count distinct)
□ 使用物化视图预聚合
□ GROUP BY 列顺序与 ORDER BY 一致
☑ 5. JOIN 优化
□ 小表在右侧
□ 使用字典表代替 JOIN
□ 避免大表 JOIN 大表
☑ 6. 并行度
□ max_threads 设置合理
□ 分区数量适中
12. 监控与运维
12.1 系统表监控
java
/**
* ClickHouse 监控
*/
public class ClickHouseMonitoring {
/**
* 监控查询性能
*/
public static void monitorQueries(Connection conn) throws Exception {
String sql =
"SELECT\n" +
" user,\n" +
" query_id,\n" +
" query_duration_ms,\n" +
" read_rows,\n" +
" read_bytes,\n" +
" memory_usage,\n" +
" query\n" +
"FROM system.query_log\n" +
"WHERE type = 'QueryFinish'\n" +
"AND event_time >= now() - INTERVAL 1 HOUR\n" +
"ORDER BY query_duration_ms DESC\n" +
"LIMIT 10";
try (var stmt = conn.createStatement();
var rs = stmt.executeQuery(sql)) {
System.out.println("\n慢查询Top 10:");
while (rs.next()) {
System.out.printf(
"耗时: %dms, 读取行数: %d, 内存: %dMB\nSQL: %s\n\n",
rs.getLong("query_duration_ms"),
rs.getLong("read_rows"),
rs.getLong("memory_usage") / 1024 / 1024,
rs.getString("query").substring(0,
Math.min(100, rs.getString("query").length()))
);
}
}
}
/**
* 监控表大小
*/
public static void monitorTableSize(Connection conn) throws Exception {
String sql =
"SELECT\n" +
" database,\n" +
" table,\n" +
" formatReadableSize(sum(bytes)) as size,\n" +
" sum(rows) as rows,\n" +
" sum(bytes) as bytes_size\n" +
"FROM system.parts\n" +
"WHERE active\n" +
"GROUP BY database, table\n" +
"ORDER BY bytes_size DESC\n" +
"LIMIT 20";
try (var stmt = conn.createStatement();
var rs = stmt.executeQuery(sql)) {
System.out.println("\n表大小统计:");
System.out.println("─────────────────────────────────────────");
while (rs.next()) {
System.out.printf("%-20s.%-20s %15s %15d行\n",
rs.getString("database"),
rs.getString("table"),
rs.getString("size"),
rs.getLong("rows")
);
}
}
}
/**
* 监控副本同步状态
*/
public static void monitorReplication(Connection conn)
throws Exception {
String sql =
"SELECT\n" +
" database,\n" +
" table,\n" +
" is_leader,\n" +
" is_readonly,\n" +
" absolute_delay,\n" +
" queue_size,\n" +
" inserts_in_queue\n" +
"FROM system.replicas";
try (var stmt = conn.createStatement();
var rs = stmt.executeQuery(sql)) {
System.out.println("\n副本同步状态:");
while (rs.next()) {
System.out.printf(
"表: %s.%s, Leader: %s, 延迟: %ds, 队列: %d\n",
rs.getString("database"),
rs.getString("table"),
rs.getBoolean("is_leader") ? "是" : "否",
rs.getLong("absolute_delay"),
rs.getLong("queue_size")
);
}
}
}
}
12.2 告警指标
关键告警指标
1. 查询性能
- 慢查询数量 > 阈值
- 平均查询时间 > 阈值
- 查询错误率 > 1%
2. 存储
- 磁盘使用率 > 80%
- 单表大小 > 阈值
- 分区数量 > 1000
3. 副本
- 副本延迟 > 60s
- 副本队列 > 1000
- 副本故障
4. 系统资源
- CPU 使用率 > 80%
- 内存使用率 > 85%
- 网络带宽 > 80%
5. Merge 操作
- Merge 队列 > 100
- Merge 失败次数 > 0
13. 总结
13.1 ClickHouse 核心优势
ClickHouse 核心优势总结
1. 性能 ★★★★★
- 查询速度快 100-1000倍
- 支持实时写入和查询
- 列式存储 + 向量化执行
2. 压缩 ★★★★★
- 压缩比 10:1 甚至更高
- 节省存储成本
- 减少 I/O 操作
3. 扩展性 ★★★★★
- 水平扩展
- 线性性能增长
- 支持 PB 级数据
4. 易用性 ★★★★☆
- 标准 SQL 支持
- 丰富的函数库
- 多语言客户端
5. 可靠性 ★★★★☆
- 数据副本
- 自动故障转移
- 数据一致性保证
13.2 适用场景
最适合:
- 用户行为分析
- 实时报表和大屏
- 日志分析和监控
- 时序数据分析
- 数据仓库 OLAP
不适合:
- OLTP 事务处理
- 频繁更新/删除
- 需要强一致性的场景
- 行级别锁定
13.3 最佳实践总结
-
表设计
- 合理选择表引擎
- 优化 ORDER BY 列
- 使用分区管理数据
- 设置 TTL 自动清理
-
数据写入
- 批量写入
- 异步写入
- 使用 CSV 格式
- 控制写入频率
-
查询优化
- 使用主键过滤
- 使用 PREWHERE
- 避免 SELECT *
- 使用物化视图
-
运维管理
- 定期监控性能
- 及时清理过期数据
- 备份关键数据
- 升级到稳定版本
13.4 学习资源
- 官方文档: https://clickhouse.com/docs
- GitHub: https://github.com/ClickHouse/ClickHouse
- 中文社区: https://clickhouse.com/docs/zh
- 性能测试: https://benchmark.clickhouse.com
13.5 未来发展
ClickHouse 正在持续演进:
- 更强大的 SQL 支持: 窗口函数、递归查询
- 更好的实时性: 毫秒级延迟
- 云原生: Kubernetes 集成
- 机器学习: 内置 ML 功能
- 多模型: 支持图数据库、文档数据库
附录: 常用命令
bash
# 启动 ClickHouse
clickhouse-server --config-file=/etc/clickhouse-server/config.xml
# 客户端连接
clickhouse-client --host localhost --port 9000
# 导入数据
clickhouse-client --query="INSERT INTO table FORMAT CSV" < data.csv
# 导出数据
clickhouse-client --query="SELECT * FROM table FORMAT CSV" > data.csv
# 查看表结构
DESCRIBE TABLE table_name;
SHOW CREATE TABLE table_name;
# 优化表
OPTIMIZE TABLE table_name FINAL;
# 查看分区
SELECT partition, name, rows FROM system.parts WHERE table = 'table_name';
# 删除分区
ALTER TABLE table_name DROP PARTITION 'partition_id';