ClickHouse_实战指南

ClickHouse 实战指南：高性能列式数据库实践

[1. ClickHouse 简介](#1. ClickHouse 简介)
[2. ClickHouse 核心架构](#2. ClickHouse 核心架构)
[3. 环境搭建与配置](#3. 环境搭建与配置)
[4. 表引擎详解](#4. 表引擎详解)
[5. Java 客户端集成](#5. Java 客户端集成)
[6. 数据写入实战](#6. 数据写入实战)
[7. 查询优化技巧](#7. 查询优化技巧)
[8. 物化视图与聚合](#8. 物化视图与聚合)
[9. 分布式集群实践](#9. 分布式集群实践)
[10. 实时数据分析场景](#10. 实时数据分析场景)
[11. 性能优化最佳实践](#11. 性能优化最佳实践)
[12. 监控与运维](#12. 监控与运维)
[13. 总结](#13. 总结)

1. ClickHouse 简介

ClickHouse 是一个用于在线分析处理(OLAP)的开源列式数据库管理系统,由俄罗斯最大的搜索引擎 Yandex 开发。它能够实时生成分析报告,处理数十亿行数据时仍保持亚秒级查询响应。

1.1 核心特性

真正的列式存储: 数据按列存储,极大提升查询性能
数据压缩: 压缩比可达 10:1 甚至更高
向量化执行: SIMD 指令集加速计算
分布式查询: 支持分片和副本
实时数据插入: 每秒可处理百万级数据
SQL 支持: 支持标准 SQL 及扩展语法
高可用性: 支持数据副本和故障转移

1.2 ClickHouse vs 传统数据库

复制代码

性能对比 (10亿行数据聚合查询)

ClickHouse    █ 0.5s
PostgreSQL    ████████████████████████ 120s
MySQL         ████████████████████████████ 150s
Elasticsearch ████████ 40s

使用场景对比
┌─────────────────┬──────────────┬──────────────┬──────────────┐
│   Feature       │ ClickHouse   │   MySQL      │  PostgreSQL  │
├─────────────────┼──────────────┼──────────────┼──────────────┤
│ OLAP 查询       │   ★★★★★      │   ★★☆☆☆      │   ★★★☆☆      │
├─────────────────┼──────────────┼──────────────┼──────────────┤
│ OLTP 事务       │   ★☆☆☆☆      │   ★★★★★      │   ★★★★★      │
├─────────────────┼──────────────┼──────────────┼──────────────┤
│ 数据压缩        │   ★★★★★      │   ★★☆☆☆      │   ★★★☆☆      │
├─────────────────┼──────────────┼──────────────┼──────────────┤
│ 写入性能        │   ★★★★★      │   ★★★☆☆      │   ★★★☆☆      │
├─────────────────┼──────────────┼──────────────┼──────────────┤
│ 分布式支持      │   ★★★★★      │   ★★☆☆☆      │   ★★☆☆☆      │
└─────────────────┴──────────────┴──────────────┴──────────────┘

1.3 典型应用场景

用户行为分析: 网站点击流、APP 使用数据分析
实时报表: 业务指标实时统计与展示
日志分析: 海量日志存储与查询
时序数据: IoT 设备监控、系统指标监控
数据仓库: 替代传统 OLAP 引擎

2. ClickHouse 核心架构

2.1 整体架构

复制代码

ClickHouse 架构图

Client Layer (客户端层)
┌─────────────────────────────────────────────────────────┐
│  JDBC Client  │  HTTP Client  │  CLI Client  │  Grafana │
└────────────────────────┬────────────────────────────────┘
                         │
                         ▼
Query Processing Layer (查询处理层)
┌─────────────────────────────────────────────────────────┐
│                    Query Parser                          │
│                         ↓                                │
│                  Query Optimizer                         │
│                         ↓                                │
│                  Execution Engine                        │
│              (Vectorized Processing)                     │
└────────────────────────┬────────────────────────────────┘
                         │
                         ▼
Storage Layer (存储层)
┌─────────────────────────────────────────────────────────┐
│                                                          │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  │
│  │ Table Engine │  │ Table Engine │  │ Table Engine │  │
│  │  MergeTree   │  │  Log Family  │  │ Distributed  │  │
│  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘  │
│         │                  │                  │          │
│         └──────────────────┴──────────────────┘          │
│                         ↓                                │
│              Column-Oriented Storage                     │
│                         ↓                                │
│              Compressed Data Files                       │
│              (.bin, .mrk, .idx)                          │
└──────────────────────────────────────────────────────────┘

2.2 列式存储原理

复制代码

行式存储 vs 列式存储

行式存储 (MySQL, PostgreSQL):
Row 1: [ID=1, Name=Alice, Age=25, City=Beijing]
Row 2: [ID=2, Name=Bob,   Age=30, City=Shanghai]
Row 3: [ID=3, Name=Carol, Age=28, City=Beijing]

查询: SELECT AVG(Age) FROM users WHERE City='Beijing'
需要读取: 所有列的所有数据

列式存储 (ClickHouse):
ID    Column: [1, 2, 3]
Name  Column: [Alice, Bob, Carol]
Age   Column: [25, 30, 28]
City  Column: [Beijing, Shanghai, Beijing]

查询: SELECT AVG(Age) FROM users WHERE City='Beijing'
需要读取: Age列 + City列 (仅相关列)

优势:
1. 只读取需要的列,减少 I/O
2. 同一列数据类型相同,压缩率高
3. 向量化执行,CPU 缓存友好

2.3 数据分片与副本

复制代码

分布式架构 (3个节点,2个分片,2个副本)

Shard 1                          Shard 2
┌──────────────────┐            ┌──────────────────┐
│   Replica 1      │            │   Replica 1      │
│   Node 1         │            │   Node 2         │
│   Data: A-M      │            │   Data: N-Z      │
└──────────────────┘            └──────────────────┘
        │                                │
        │                                │
        ▼                                ▼
┌──────────────────┐            ┌──────────────────┐
│   Replica 2      │            │   Replica 2      │
│   Node 3         │            │   Node 1         │
│   Data: A-M      │            │   Data: N-Z      │
└──────────────────┘            └──────────────────┘

ZooKeeper Cluster
┌──────────────────────────────────┐
│  Coordination & Metadata         │
│  - Replica sync                  │
│  - Leader election               │
│  - Schema management             │
└──────────────────────────────────┘

3. 环境搭建与配置

3.1 Docker 快速启动

bash 复制代码

# 1. 拉取镜像
docker pull clickhouse/clickhouse-server

# 2. 启动 ClickHouse
docker run -d \
  --name clickhouse-server \
  -p 8123:8123 \
  -p 9000:9000 \
  --ulimit nofile=262144:262144 \
  clickhouse/clickhouse-server

# 3. 进入客户端
docker exec -it clickhouse-server clickhouse-client

# 4. 测试连接
SELECT version();

3.2 Maven 依赖配置

xml 复制代码

<!-- pom.xml -->
<properties>
    <clickhouse.version>0.5.0</clickhouse.version>
    <java.version>11</java.version>
</properties>

<dependencies>
    <!-- ClickHouse JDBC 驱动 -->
    <dependency>
        <groupId>com.clickhouse</groupId>
        <artifactId>clickhouse-jdbc</artifactId>
        <version>${clickhouse.version}</version>
        <classifier>all</classifier>
    </dependency>

    <!-- HTTP 客户端 (可选) -->
    <dependency>
        <groupId>com.clickhouse</groupId>
        <artifactId>clickhouse-http-client</artifactId>
        <version>${clickhouse.version}</version>
    </dependency>

    <!-- 连接池 -->
    <dependency>
        <groupId>com.zaxxer</groupId>
        <artifactId>HikariCP</artifactId>
        <version>5.0.1</version>
    </dependency>

    <!-- 日志 -->
    <dependency>
        <groupId>org.slf4j</groupId>
        <artifactId>slf4j-api</artifactId>
        <version>2.0.9</version>
    </dependency>
</dependencies>

3.3 Java 连接配置

java 复制代码

import com.clickhouse.jdbc.ClickHouseDataSource;
import com.zaxxer.hikari.HikariConfig;
import com.zaxxer.hikari.HikariDataSource;

import java.sql.Connection;
import java.sql.SQLException;
import java.util.Properties;

/**
 * ClickHouse 连接配置
 */
public class ClickHouseConfig {

    private static final String JDBC_URL =
        "jdbc:clickhouse://localhost:8123/default";
    private static final String USERNAME = "default";
    private static final String PASSWORD = "";

    /**
     * 方式1: 直接连接
     */
    public static Connection getConnection() throws SQLException {
        Properties properties = new Properties();
        properties.setProperty("user", USERNAME);
        properties.setProperty("password", PASSWORD);

        // 设置连接参数
        properties.setProperty("socket_timeout", "300000");
        properties.setProperty("max_execution_time", "300");
        properties.setProperty("max_insert_block_size", "1048576");

        ClickHouseDataSource dataSource =
            new ClickHouseDataSource(JDBC_URL, properties);

        return dataSource.getConnection();
    }

    /**
     * 方式2: 使用连接池 (推荐)
     */
    public static HikariDataSource getDataSource() {
        HikariConfig config = new HikariConfig();
        config.setJdbcUrl(JDBC_URL);
        config.setUsername(USERNAME);
        config.setPassword(PASSWORD);

        // 连接池配置
        config.setMaximumPoolSize(10);
        config.setMinimumIdle(2);
        config.setConnectionTimeout(30000);
        config.setIdleTimeout(600000);
        config.setMaxLifetime(1800000);

        // ClickHouse 特定配置
        config.addDataSourceProperty("socket_timeout", "300000");
        config.addDataSourceProperty("compress", "true");
        config.addDataSourceProperty("max_insert_block_size", "1048576");

        return new HikariDataSource(config);
    }

    /**
     * 测试连接
     */
    public static void testConnection() {
        try (Connection conn = getConnection()) {
            var stmt = conn.createStatement();
            var rs = stmt.executeQuery("SELECT version()");

            if (rs.next()) {
                System.out.println("ClickHouse Version: " + rs.getString(1));
            }

            System.out.println("连接成功!");
        } catch (SQLException e) {
            System.err.println("连接失败: " + e.getMessage());
            e.printStackTrace();
        }
    }

    public static void main(String[] args) {
        testConnection();
    }
}

4. 表引擎详解

4.1 MergeTree 家族

ClickHouse 最强大的表引擎系列,支持数据分区、索引、副本等特性。

复制代码

MergeTree 家族

MergeTree (基础)
    ├─ ReplacingMergeTree (去重)
    ├─ SummingMergeTree (求和)
    ├─ AggregatingMergeTree (聚合)
    ├─ CollapsingMergeTree (折叠)
    └─ VersionedCollapsingMergeTree (版本折叠)

ReplicatedMergeTree (副本)
    ├─ ReplicatedReplacingMergeTree
    ├─ ReplicatedSummingMergeTree
    └─ ReplicatedAggregatingMergeTree

java 复制代码

import java.sql.Connection;
import java.sql.Statement;

/**
 * 表引擎创建示例
 */
public class TableEngineExample {

    /**
     * 1. MergeTree - 最常用的表引擎
     */
    public static void createMergeTreeTable(Connection conn)
            throws Exception {
        String sql =
            "CREATE TABLE IF NOT EXISTS user_events (\n" +
            "    event_date Date,\n" +
            "    event_time DateTime,\n" +
            "    user_id UInt64,\n" +
            "    event_type String,\n" +
            "    page_url String,\n" +
            "    device_type String,\n" +
            "    session_id String,\n" +
            "    duration UInt32\n" +
            ") ENGINE = MergeTree()\n" +
            "PARTITION BY toYYYYMM(event_date)\n" +
            "ORDER BY (user_id, event_time)\n" +
            "SETTINGS index_granularity = 8192";

        try (Statement stmt = conn.createStatement()) {
            stmt.execute(sql);
            System.out.println("MergeTree 表创建成功!");
        }
    }

    /**
     * 2. ReplacingMergeTree - 自动去重
     * 场景: 用户信息表,同一用户只保留最新记录
     */
    public static void createReplacingMergeTreeTable(Connection conn)
            throws Exception {
        String sql =
            "CREATE TABLE IF NOT EXISTS user_profiles (\n" +
            "    user_id UInt64,\n" +
            "    username String,\n" +
            "    email String,\n" +
            "    age UInt8,\n" +
            "    city String,\n" +
            "    update_time DateTime,\n" +
            "    version UInt64\n" +
            ") ENGINE = ReplacingMergeTree(version)\n" +
            "ORDER BY user_id\n" +
            "SETTINGS index_granularity = 8192";

        try (Statement stmt = conn.createStatement()) {
            stmt.execute(sql);
            System.out.println("ReplacingMergeTree 表创建成功!");
        }
    }

    /**
     * 3. SummingMergeTree - 自动求和
     * 场景: 指标累加,如页面 PV、UV 统计
     */
    public static void createSummingMergeTreeTable(Connection conn)
            throws Exception {
        String sql =
            "CREATE TABLE IF NOT EXISTS page_statistics (\n" +
            "    stat_date Date,\n" +
            "    page_url String,\n" +
            "    pv UInt64,\n" +
            "    uv UInt64,\n" +
            "    bounce_count UInt64\n" +
            ") ENGINE = SummingMergeTree()\n" +
            "PARTITION BY toYYYYMM(stat_date)\n" +
            "ORDER BY (stat_date, page_url)\n" +
            "SETTINGS index_granularity = 8192";

        try (Statement stmt = conn.createStatement()) {
            stmt.execute(sql);
            System.out.println("SummingMergeTree 表创建成功!");
        }
    }

    /**
     * 4. AggregatingMergeTree - 聚合函数
     * 场景: 预聚合统计数据
     */
    public static void createAggregatingMergeTreeTable(Connection conn)
            throws Exception {
        String sql =
            "CREATE TABLE IF NOT EXISTS user_metrics_agg (\n" +
            "    stat_date Date,\n" +
            "    user_id UInt64,\n" +
            "    total_orders SimpleAggregateFunction(sum, UInt64),\n" +
            "    total_amount SimpleAggregateFunction(sum, Decimal(18,2)),\n" +
            "    avg_amount AggregateFunction(avg, Decimal(18,2)),\n" +
            "    unique_products AggregateFunction(uniq, UInt64)\n" +
            ") ENGINE = AggregatingMergeTree()\n" +
            "PARTITION BY toYYYYMM(stat_date)\n" +
            "ORDER BY (stat_date, user_id)\n" +
            "SETTINGS index_granularity = 8192";

        try (Statement stmt = conn.createStatement()) {
            stmt.execute(sql);
            System.out.println("AggregatingMergeTree 表创建成功!");
        }
    }

    /**
     * 5. 分布式表
     * 场景: 集群环境下的分片表
     */
    public static void createDistributedTable(Connection conn)
            throws Exception {
        // 先创建本地表
        String localTableSql =
            "CREATE TABLE IF NOT EXISTS orders_local (\n" +
            "    order_id UInt64,\n" +
            "    user_id UInt64,\n" +
            "    product_id UInt64,\n" +
            "    amount Decimal(18,2),\n" +
            "    order_time DateTime\n" +
            ") ENGINE = MergeTree()\n" +
            "PARTITION BY toYYYYMM(order_time)\n" +
            "ORDER BY (user_id, order_time)";

        // 创建分布式表
        String distributedTableSql =
            "CREATE TABLE IF NOT EXISTS orders_all AS orders_local\n" +
            "ENGINE = Distributed(cluster_name, default, orders_local, rand())";

        try (Statement stmt = conn.createStatement()) {
            stmt.execute(localTableSql);
            // stmt.execute(distributedTableSql); // 需要集群环境
            System.out.println("分布式表创建成功!");
        }
    }
}

4.2 表引擎选择指南

复制代码

表引擎选择流程

需要副本? ───Yes───▶ 使用 Replicated* 系列
    │
    No
    │
    ▼
需要去重? ───Yes───▶ ReplacingMergeTree
    │
    No
    │
    ▼
需要求和? ───Yes───▶ SummingMergeTree
    │
    No
    │
    ▼
需要预聚合? ─Yes───▶ AggregatingMergeTree
    │
    No
    │
    ▼
普通场景 ───────────▶ MergeTree

5. Java 客户端集成

5.1 JDBC 基础操作

java 复制代码

import java.sql.*;
import java.time.LocalDateTime;
import java.util.ArrayList;
import java.util.List;

/**
 * ClickHouse JDBC 基础操作
 */
public class ClickHouseJDBCExample {

    /**
     * 插入数据
     */
    public static void insertData(Connection conn) throws SQLException {
        String sql =
            "INSERT INTO user_events " +
            "(event_date, event_time, user_id, event_type, " +
            "page_url, device_type, session_id, duration) " +
            "VALUES (?, ?, ?, ?, ?, ?, ?, ?)";

        try (PreparedStatement pstmt = conn.prepareStatement(sql)) {
            // 插入单条数据
            pstmt.setDate(1, Date.valueOf("2024-01-15"));
            pstmt.setTimestamp(2, Timestamp.valueOf(LocalDateTime.now()));
            pstmt.setLong(3, 1001L);
            pstmt.setString(4, "page_view");
            pstmt.setString(5, "/home");
            pstmt.setString(6, "mobile");
            pstmt.setString(7, "session_12345");
            pstmt.setInt(8, 120);

            int rows = pstmt.executeUpdate();
            System.out.println("插入 " + rows + " 条数据");
        }
    }

    /**
     * 批量插入 (推荐方式)
     */
    public static void batchInsert(Connection conn,
                                   List<UserEvent> events)
            throws SQLException {
        String sql =
            "INSERT INTO user_events " +
            "(event_date, event_time, user_id, event_type, " +
            "page_url, device_type, session_id, duration) " +
            "VALUES (?, ?, ?, ?, ?, ?, ?, ?)";

        try (PreparedStatement pstmt = conn.prepareStatement(sql)) {
            for (UserEvent event : events) {
                pstmt.setDate(1, Date.valueOf(event.eventDate));
                pstmt.setTimestamp(2, Timestamp.valueOf(event.eventTime));
                pstmt.setLong(3, event.userId);
                pstmt.setString(4, event.eventType);
                pstmt.setString(5, event.pageUrl);
                pstmt.setString(6, event.deviceType);
                pstmt.setString(7, event.sessionId);
                pstmt.setInt(8, event.duration);

                pstmt.addBatch();
            }

            int[] results = pstmt.executeBatch();
            System.out.println("批量插入 " + results.length + " 条数据");
        }
    }

    /**
     * 查询数据
     */
    public static List<UserEvent> queryData(Connection conn,
                                           long userId)
            throws SQLException {
        String sql =
            "SELECT event_date, event_time, user_id, event_type, " +
            "       page_url, device_type, session_id, duration " +
            "FROM user_events " +
            "WHERE user_id = ? " +
            "ORDER BY event_time DESC " +
            "LIMIT 100";

        List<UserEvent> events = new ArrayList<>();

        try (PreparedStatement pstmt = conn.prepareStatement(sql)) {
            pstmt.setLong(1, userId);

            try (ResultSet rs = pstmt.executeQuery()) {
                while (rs.next()) {
                    UserEvent event = new UserEvent();
                    event.eventDate = rs.getDate("event_date").toLocalDate();
                    event.eventTime = rs.getTimestamp("event_time").toLocalDateTime();
                    event.userId = rs.getLong("user_id");
                    event.eventType = rs.getString("event_type");
                    event.pageUrl = rs.getString("page_url");
                    event.deviceType = rs.getString("device_type");
                    event.sessionId = rs.getString("session_id");
                    event.duration = rs.getInt("duration");

                    events.add(event);
                }
            }
        }

        return events;
    }

    /**
     * 聚合查询
     */
    public static void aggregateQuery(Connection conn) throws SQLException {
        String sql =
            "SELECT " +
            "    event_date, " +
            "    event_type, " +
            "    device_type, " +
            "    count() as event_count, " +
            "    uniq(user_id) as unique_users, " +
            "    avg(duration) as avg_duration, " +
            "    max(duration) as max_duration " +
            "FROM user_events " +
            "WHERE event_date >= today() - 7 " +
            "GROUP BY event_date, event_type, device_type " +
            "ORDER BY event_date DESC, event_count DESC";

        try (Statement stmt = conn.createStatement();
             ResultSet rs = stmt.executeQuery(sql)) {

            System.out.println("\n用户行为统计:");
            System.out.println("─────────────────────────────────────────────────");
            System.out.printf("%-12s %-15s %-12s %8s %8s %12s %12s\n",
                "日期", "事件类型", "设备类型", "事件数", "UV", "平均时长", "最大时长");
            System.out.println("─────────────────────────────────────────────────");

            while (rs.next()) {
                System.out.printf("%-12s %-15s %-12s %8d %8d %12.2f %12d\n",
                    rs.getDate("event_date"),
                    rs.getString("event_type"),
                    rs.getString("device_type"),
                    rs.getLong("event_count"),
                    rs.getLong("unique_users"),
                    rs.getDouble("avg_duration"),
                    rs.getLong("max_duration")
                );
            }
        }
    }

    /**
     * 用户事件实体类
     */
    public static class UserEvent {
        public java.time.LocalDate eventDate;
        public java.time.LocalDateTime eventTime;
        public long userId;
        public String eventType;
        public String pageUrl;
        public String deviceType;
        public String sessionId;
        public int duration;

        @Override
        public String toString() {
            return String.format(
                "UserEvent{userId=%d, type='%s', page='%s', time=%s}",
                userId, eventType, pageUrl, eventTime
            );
        }
    }
}

5.2 高性能批量写入

java 复制代码

import com.clickhouse.client.*;
import com.clickhouse.data.ClickHouseFormat;

import java.io.ByteArrayInputStream;
import java.nio.charset.StandardCharsets;
import java.util.List;
import java.util.concurrent.CompletableFuture;

/**
 * 高性能批量写入
 */
public class HighPerformanceWriter {

    /**
     * 方式1: 使用 INSERT VALUES (适合小批量)
     */
    public static void batchInsertWithValues(Connection conn,
                                            List<UserEvent> events)
            throws SQLException {
        StringBuilder sql = new StringBuilder(
            "INSERT INTO user_events VALUES "
        );

        for (int i = 0; i < events.size(); i++) {
            UserEvent event = events.get(i);
            if (i > 0) sql.append(",");

            sql.append(String.format(
                "('%s','%s',%d,'%s','%s','%s','%s',%d)",
                event.eventDate,
                event.eventTime,
                event.userId,
                event.eventType,
                event.pageUrl,
                event.deviceType,
                event.sessionId,
                event.duration
            ));
        }

        try (Statement stmt = conn.createStatement()) {
            stmt.execute(sql.toString());
        }
    }

    /**
     * 方式2: 使用 CSV 格式 (推荐,高性能)
     */
    public static void batchInsertWithCSV(ClickHouseClient client,
                                         List<UserEvent> events)
            throws Exception {
        // 构建 CSV 数据
        StringBuilder csv = new StringBuilder();
        for (UserEvent event : events) {
            csv.append(event.eventDate).append(",")
               .append(event.eventTime).append(",")
               .append(event.userId).append(",")
               .append(event.eventType).append(",")
               .append(event.pageUrl).append(",")
               .append(event.deviceType).append(",")
               .append(event.sessionId).append(",")
               .append(event.duration).append("\n");
        }

        // 使用 ClickHouse Client API
        ClickHouseRequest<?> request = client.read(
            ClickHouseNode.of("http://localhost:8123")
        );

        CompletableFuture<ClickHouseResponse> future = request
            .write()
            .query("INSERT INTO user_events FORMAT CSV")
            .data(new ByteArrayInputStream(
                csv.toString().getBytes(StandardCharsets.UTF_8)
            ))
            .executeAsync();

        ClickHouseResponse response = future.get();
        System.out.println("CSV 批量插入成功!");
    }

    /**
     * 方式3: 异步批量写入器
     */
    public static class AsyncBatchWriter {
        private final Connection connection;
        private final List<UserEvent> buffer;
        private final int batchSize;
        private final long flushIntervalMs;
        private volatile boolean running;

        public AsyncBatchWriter(Connection connection,
                               int batchSize,
                               long flushIntervalMs) {
            this.connection = connection;
            this.batchSize = batchSize;
            this.flushIntervalMs = flushIntervalMs;
            this.buffer = new ArrayList<>(batchSize);
            this.running = true;

            // 启动定时刷新线程
            startFlushThread();
        }

        public synchronized void write(UserEvent event) {
            buffer.add(event);

            if (buffer.size() >= batchSize) {
                flush();
            }
        }

        private synchronized void flush() {
            if (buffer.isEmpty()) return;

            try {
                batchInsertWithValues(connection, new ArrayList<>(buffer));
                buffer.clear();
                System.out.println("已刷新批次数据");
            } catch (SQLException e) {
                System.err.println("批量写入失败: " + e.getMessage());
            }
        }

        private void startFlushThread() {
            new Thread(() -> {
                while (running) {
                    try {
                        Thread.sleep(flushIntervalMs);
                        flush();
                    } catch (InterruptedException e) {
                        Thread.currentThread().interrupt();
                        break;
                    }
                }
            }).start();
        }

        public void close() {
            running = false;
            flush();
        }
    }
}

6. 数据写入实战

6.1 实时日志采集

基于 Kafka + ClickHouse 的实时日志处理系统。

java 复制代码

import org.apache.kafka.clients.consumer.*;
import org.apache.kafka.common.serialization.StringDeserializer;

import java.sql.Connection;
import java.time.Duration;
import java.util.*;

/**
 * Kafka 实时日志消费并写入 ClickHouse
 * 场景: 收集 Web 访问日志实时分析
 */
public class RealTimeLogCollector {

    private final KafkaConsumer<String, String> consumer;
    private final Connection clickHouseConn;
    private final int batchSize = 1000;
    private final List<AccessLog> buffer = new ArrayList<>();

    public RealTimeLogCollector(Connection clickHouseConn) {
        this.clickHouseConn = clickHouseConn;
        this.consumer = createKafkaConsumer();
    }

    private KafkaConsumer<String, String> createKafkaConsumer() {
        Properties props = new Properties();
        props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG,
                  "localhost:9092");
        props.put(ConsumerConfig.GROUP_ID_CONFIG,
                  "clickhouse-log-consumer");
        props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG,
                  StringDeserializer.class.getName());
        props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG,
                  StringDeserializer.class.getName());
        props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest");
        props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "false");
        props.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, "1000");

        return new KafkaConsumer<>(props);
    }

    public void start() {
        consumer.subscribe(Collections.singletonList("web-access-logs"));

        System.out.println("开始消费日志数据...");

        try {
            while (true) {
                ConsumerRecords<String, String> records =
                    consumer.poll(Duration.ofSeconds(1));

                for (ConsumerRecord<String, String> record : records) {
                    processLog(record.value());
                }

                // 批量写入
                if (buffer.size() >= batchSize) {
                    flushToClickHouse();
                    consumer.commitSync();
                }
            }
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            flushToClickHouse();
            consumer.close();
        }
    }

    private void processLog(String logJson) {
        try {
            // 解析 JSON 日志
            AccessLog log = parseAccessLog(logJson);
            buffer.add(log);
        } catch (Exception e) {
            System.err.println("解析日志失败: " + e.getMessage());
        }
    }

    private AccessLog parseAccessLog(String json) {
        // 使用 FastJSON 或其他 JSON 库解析
        com.alibaba.fastjson2.JSONObject obj =
            com.alibaba.fastjson2.JSON.parseObject(json);

        AccessLog log = new AccessLog();
        log.timestamp = obj.getLong("timestamp");
        log.userId = obj.getLong("user_id");
        log.ip = obj.getString("ip");
        log.method = obj.getString("method");
        log.url = obj.getString("url");
        log.statusCode = obj.getInteger("status_code");
        log.responseTime = obj.getInteger("response_time");
        log.userAgent = obj.getString("user_agent");
        log.referer = obj.getString("referer");

        return log;
    }

    private void flushToClickHouse() {
        if (buffer.isEmpty()) return;

        String sql =
            "INSERT INTO access_logs " +
            "(log_time, user_id, ip, method, url, status_code, " +
            "response_time, user_agent, referer) " +
            "VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)";

        try (var pstmt = clickHouseConn.prepareStatement(sql)) {
            for (AccessLog log : buffer) {
                pstmt.setTimestamp(1,
                    new java.sql.Timestamp(log.timestamp));
                pstmt.setLong(2, log.userId);
                pstmt.setString(3, log.ip);
                pstmt.setString(4, log.method);
                pstmt.setString(5, log.url);
                pstmt.setInt(6, log.statusCode);
                pstmt.setInt(7, log.responseTime);
                pstmt.setString(8, log.userAgent);
                pstmt.setString(9, log.referer);

                pstmt.addBatch();
            }

            pstmt.executeBatch();
            System.out.println("成功写入 " + buffer.size() + " 条日志");
            buffer.clear();

        } catch (Exception e) {
            System.err.println("写入 ClickHouse 失败: " + e.getMessage());
        }
    }

    /**
     * 访问日志实体
     */
    public static class AccessLog {
        public long timestamp;
        public long userId;
        public String ip;
        public String method;
        public String url;
        public int statusCode;
        public int responseTime;
        public String userAgent;
        public String referer;
    }

    /**
     * 创建访问日志表
     */
    public static void createAccessLogTable(Connection conn)
            throws Exception {
        String sql =
            "CREATE TABLE IF NOT EXISTS access_logs (\n" +
            "    log_time DateTime,\n" +
            "    user_id UInt64,\n" +
            "    ip String,\n" +
            "    method String,\n" +
            "    url String,\n" +
            "    status_code UInt16,\n" +
            "    response_time UInt32,\n" +
            "    user_agent String,\n" +
            "    referer String,\n" +
            "    INDEX idx_user_id user_id TYPE minmax GRANULARITY 3,\n" +
            "    INDEX idx_url url TYPE tokenbf_v1(32768, 3, 0) GRANULARITY 1\n" +
            ") ENGINE = MergeTree()\n" +
            "PARTITION BY toYYYYMMDD(log_time)\n" +
            "ORDER BY (log_time, user_id)\n" +
            "TTL log_time + INTERVAL 30 DAY\n" +
            "SETTINGS index_granularity = 8192";

        try (var stmt = conn.createStatement()) {
            stmt.execute(sql);
            System.out.println("访问日志表创建成功!");
        }
    }
}

6.2 Spring Boot 集成

java 复制代码

import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.jdbc.core.JdbcTemplate;
import org.springframework.stereotype.Service;

import javax.sql.DataSource;
import java.sql.Timestamp;
import java.util.List;
import java.util.Map;

/**
 * Spring Boot 集成 ClickHouse
 */
@SpringBootApplication
public class ClickHouseSpringBootApp {

    public static void main(String[] args) {
        SpringApplication.run(ClickHouseSpringBootApp.class, args);
    }
}

/**
 * ClickHouse 配置类
 */
@Configuration
class ClickHouseConfiguration {

    @Bean
    public DataSource clickHouseDataSource() {
        return ClickHouseConfig.getDataSource();
    }

    @Bean
    public JdbcTemplate clickHouseJdbcTemplate(DataSource dataSource) {
        return new JdbcTemplate(dataSource);
    }
}

/**
 * ClickHouse 数据访问服务
 */
@Service
class ClickHouseService {

    private final JdbcTemplate jdbcTemplate;

    public ClickHouseService(JdbcTemplate jdbcTemplate) {
        this.jdbcTemplate = jdbcTemplate;
    }

    /**
     * 查询用户行为统计
     */
    public List<Map<String, Object>> getUserBehaviorStats(
            String startDate, String endDate) {
        String sql =
            "SELECT " +
            "    toDate(event_time) as date, " +
            "    event_type, " +
            "    count() as total, " +
            "    uniq(user_id) as uv, " +
            "    avg(duration) as avg_duration " +
            "FROM user_events " +
            "WHERE event_date BETWEEN ? AND ? " +
            "GROUP BY date, event_type " +
            "ORDER BY date DESC, total DESC";

        return jdbcTemplate.queryForList(sql, startDate, endDate);
    }

    /**
     * 保存用户事件
     */
    public void saveUserEvent(UserEventDTO event) {
        String sql =
            "INSERT INTO user_events " +
            "(event_date, event_time, user_id, event_type, " +
            "page_url, device_type, session_id, duration) " +
            "VALUES (?, ?, ?, ?, ?, ?, ?, ?)";

        jdbcTemplate.update(sql,
            event.getEventDate(),
            new Timestamp(event.getEventTime()),
            event.getUserId(),
            event.getEventType(),
            event.getPageUrl(),
            event.getDeviceType(),
            event.getSessionId(),
            event.getDuration()
        );
    }

    /**
     * 批量保存
     */
    public void batchSaveUserEvents(List<UserEventDTO> events) {
        String sql =
            "INSERT INTO user_events " +
            "(event_date, event_time, user_id, event_type, " +
            "page_url, device_type, session_id, duration) " +
            "VALUES (?, ?, ?, ?, ?, ?, ?, ?)";

        jdbcTemplate.batchUpdate(sql, events, events.size(),
            (ps, event) -> {
                ps.setDate(1, event.getEventDate());
                ps.setTimestamp(2, new Timestamp(event.getEventTime()));
                ps.setLong(3, event.getUserId());
                ps.setString(4, event.getEventType());
                ps.setString(5, event.getPageUrl());
                ps.setString(6, event.getDeviceType());
                ps.setString(7, event.getSessionId());
                ps.setInt(8, event.getDuration());
            }
        );
    }

    /**
     * 用户事件 DTO
     */
    public static class UserEventDTO {
        private java.sql.Date eventDate;
        private long eventTime;
        private long userId;
        private String eventType;
        private String pageUrl;
        private String deviceType;
        private String sessionId;
        private int duration;

        // Getters and Setters
        public java.sql.Date getEventDate() { return eventDate; }
        public void setEventDate(java.sql.Date eventDate) {
            this.eventDate = eventDate;
        }
        public long getEventTime() { return eventTime; }
        public void setEventTime(long eventTime) {
            this.eventTime = eventTime;
        }
        public long getUserId() { return userId; }
        public void setUserId(long userId) { this.userId = userId; }
        public String getEventType() { return eventType; }
        public void setEventType(String eventType) {
            this.eventType = eventType;
        }
        public String getPageUrl() { return pageUrl; }
        public void setPageUrl(String pageUrl) { this.pageUrl = pageUrl; }
        public String getDeviceType() { return deviceType; }
        public void setDeviceType(String deviceType) {
            this.deviceType = deviceType;
        }
        public String getSessionId() { return sessionId; }
        public void setSessionId(String sessionId) {
            this.sessionId = sessionId;
        }
        public int getDuration() { return duration; }
        public void setDuration(int duration) { this.duration = duration; }
    }
}

7. 查询优化技巧

7.1 索引优化

sql 复制代码

-- 1. 主键索引 (ORDER BY 决定)
CREATE TABLE user_events (
    event_date Date,
    event_time DateTime,
    user_id UInt64,
    event_type String,
    page_url String
) ENGINE = MergeTree()
ORDER BY (user_id, event_time); -- 主键索引

-- 查询会很快 (使用了主键)
SELECT * FROM user_events
WHERE user_id = 1001;

-- 查询较慢 (未使用主键)
SELECT * FROM user_events
WHERE event_type = 'click';

-- 2. 跳数索引 (Skip Index)
CREATE TABLE access_logs (
    log_time DateTime,
    url String,
    status_code UInt16,
    response_time UInt32,
    INDEX idx_status status_code TYPE minmax GRANULARITY 3,
    INDEX idx_url url TYPE tokenbf_v1(32768, 3, 0) GRANULARITY 1,
    INDEX idx_response response_time TYPE set(100) GRANULARITY 4
) ENGINE = MergeTree()
ORDER BY log_time;

java 复制代码

/**
 * 查询优化示例
 */
public class QueryOptimization {

    /**
     * 优化1: 使用主键过滤
     */
    public static void optimizedQuery1(Connection conn)
            throws Exception {
        // 好的查询 - 使用主键
        String goodSql =
            "SELECT * FROM user_events " +
            "WHERE user_id = ? " +
            "AND event_time >= ? " +
            "ORDER BY event_time DESC " +
            "LIMIT 100";

        // 不好的查询 - 未使用主键
        String badSql =
            "SELECT * FROM user_events " +
            "WHERE page_url = '/home' " +
            "ORDER BY event_time DESC " +
            "LIMIT 100";

        // 执行优化后的查询
        try (var pstmt = conn.prepareStatement(goodSql)) {
            pstmt.setLong(1, 1001L);
            pstmt.setTimestamp(2,
                Timestamp.valueOf("2024-01-01 00:00:00"));

            var rs = pstmt.executeQuery();
            // 处理结果...
        }
    }

    /**
     * 优化2: 使用 PREWHERE 代替 WHERE
     * PREWHERE 先过滤数据,再读取其他列
     */
    public static void optimizedQuery2(Connection conn)
            throws Exception {
        String sql =
            "SELECT " +
            "    user_id, " +
            "    event_type, " +
            "    page_url, " +
            "    duration " +
            "FROM user_events " +
            "PREWHERE event_date = today() " +  // 先过滤
            "WHERE user_id > 1000 " +
            "LIMIT 1000";

        try (var stmt = conn.createStatement();
             var rs = stmt.executeQuery(sql)) {
            // 处理结果...
        }
    }

    /**
     * 优化3: 合理使用分区裁剪
     */
    public static void optimizedQuery3(Connection conn)
            throws Exception {
        String sql =
            "SELECT " +
            "    event_type, " +
            "    count() as cnt " +
            "FROM user_events " +
            "WHERE event_date >= today() - 7 " +  // 分区裁剪
            "GROUP BY event_type";

        try (var stmt = conn.createStatement();
             var rs = stmt.executeQuery(sql)) {
            // 处理结果...
        }
    }

    /**
     * 优化4: 使用 FINAL 去重 (ReplacingMergeTree)
     */
    public static void optimizedQuery4(Connection conn)
            throws Exception {
        // 不使用 FINAL - 可能有重复数据
        String sql1 =
            "SELECT * FROM user_profiles " +
            "WHERE user_id = 1001";

        // 使用 FINAL - 保证唯一,但性能较慢
        String sql2 =
            "SELECT * FROM user_profiles FINAL " +
            "WHERE user_id = 1001";

        // 推荐: 使用 GROUP BY 去重
        String sql3 =
            "SELECT " +
            "    user_id, " +
            "    argMax(username, version) as username, " +
            "    argMax(email, version) as email, " +
            "    argMax(age, version) as age, " +
            "    argMax(city, version) as city " +
            "FROM user_profiles " +
            "WHERE user_id = 1001 " +
            "GROUP BY user_id";

        try (var stmt = conn.createStatement();
             var rs = stmt.executeQuery(sql3)) {
            // 处理结果...
        }
    }

    /**
     * 优化5: 使用近似函数
     */
    public static void optimizedQuery5(Connection conn)
            throws Exception {
        // 精确 UV 计算 (慢)
        String exactSql =
            "SELECT count(DISTINCT user_id) FROM user_events";

        // 近似 UV 计算 (快,误差<2%)
        String approxSql =
            "SELECT uniq(user_id) FROM user_events";

        // 更快的近似计算
        String fasterSql =
            "SELECT uniqHLL12(user_id) FROM user_events";

        try (var stmt = conn.createStatement();
             var rs = stmt.executeQuery(approxSql)) {
            if (rs.next()) {
                System.out.println("UV: " + rs.getLong(1));
            }
        }
    }
}

7.2 查询性能对比

复制代码

查询性能对比 (10亿行数据)

场景1: 全表扫描 vs 主键查询
┌────────────────────────────┬──────────┬──────────────┐
│ Query Type                 │   Time   │  Rows Scanned│
├────────────────────────────┼──────────┼──────────────┤
│ 全表扫描                   │  25.3s   │ 1,000,000,000│
│ 主键查询                   │  0.05s   │     10,000   │
│ 主键+分区裁剪              │  0.02s   │      1,000   │
└────────────────────────────┴──────────┴──────────────┘

场景2: COUNT(DISTINCT) vs uniq()
┌────────────────────────────┬──────────┬──────────────┐
│ Function                   │   Time   │   Accuracy   │
├────────────────────────────┼──────────┼──────────────┤
│ count(DISTINCT user_id)    │  18.5s   │    100%      │
│ uniq(user_id)              │   2.1s   │    98%       │
│ uniqHLL12(user_id)         │   0.8s   │    95%       │
└────────────────────────────┴──────────┴──────────────┘

8. 物化视图与聚合

8.1 物化视图基础

物化视图可以预计算聚合结果,大幅提升查询性能。

sql 复制代码

-- 创建物化视图
CREATE MATERIALIZED VIEW user_daily_stats
ENGINE = SummingMergeTree()
PARTITION BY toYYYYMM(stat_date)
ORDER BY (stat_date, user_id)
AS
SELECT
    toDate(event_time) as stat_date,
    user_id,
    event_type,
    count() as event_count,
    sum(duration) as total_duration
FROM user_events
GROUP BY stat_date, user_id, event_type;

-- 查询物化视图 (非常快)
SELECT
    stat_date,
    user_id,
    sum(event_count) as total_events,
    sum(total_duration) as total_duration
FROM user_daily_stats
WHERE stat_date >= today() - 30
GROUP BY stat_date, user_id;

java 复制代码

/**
 * 物化视图管理
 */
public class MaterializedViewManager {

    /**
     * 创建用户行为统计物化视图
     */
    public static void createUserStatsMV(Connection conn)
            throws Exception {
        String sql =
            "CREATE MATERIALIZED VIEW IF NOT EXISTS user_behavior_stats_mv\n" +
            "ENGINE = AggregatingMergeTree()\n" +
            "PARTITION BY toYYYYMM(stat_date)\n" +
            "ORDER BY (stat_date, user_id, event_type)\n" +
            "AS\n" +
            "SELECT\n" +
            "    toDate(event_time) as stat_date,\n" +
            "    user_id,\n" +
            "    event_type,\n" +
            "    device_type,\n" +
            "    countState() as event_count,\n" +
            "    sumState(duration) as total_duration,\n" +
            "    avgState(duration) as avg_duration,\n" +
            "    uniqState(session_id) as unique_sessions\n" +
            "FROM user_events\n" +
            "GROUP BY stat_date, user_id, event_type, device_type";

        try (var stmt = conn.createStatement()) {
            stmt.execute(sql);
            System.out.println("物化视图创建成功!");
        }
    }

    /**
     * 查询物化视图
     */
    public static void queryMaterializedView(Connection conn)
            throws Exception {
        String sql =
            "SELECT\n" +
            "    stat_date,\n" +
            "    user_id,\n" +
            "    event_type,\n" +
            "    device_type,\n" +
            "    countMerge(event_count) as total_events,\n" +
            "    sumMerge(total_duration) as total_duration,\n" +
            "    avgMerge(avg_duration) as avg_duration,\n" +
            "    uniqMerge(unique_sessions) as unique_sessions\n" +
            "FROM user_behavior_stats_mv\n" +
            "WHERE stat_date >= today() - 7\n" +
            "GROUP BY stat_date, user_id, event_type, device_type\n" +
            "ORDER BY stat_date DESC, total_events DESC\n" +
            "LIMIT 100";

        try (var stmt = conn.createStatement();
             var rs = stmt.executeQuery(sql)) {

            System.out.println("\n用户行为统计 (来自物化视图):");
            while (rs.next()) {
                System.out.printf(
                    "日期: %s, 用户: %d, 类型: %s, 事件数: %d, " +
                    "平均时长: %.2f, 会话数: %d\n",
                    rs.getDate("stat_date"),
                    rs.getLong("user_id"),
                    rs.getString("event_type"),
                    rs.getLong("total_events"),
                    rs.getDouble("avg_duration"),
                    rs.getLong("unique_sessions")
                );
            }
        }
    }

    /**
     * 刷新物化视图 (ClickHouse 自动增量更新)
     */
    public static void refreshMaterializedView(Connection conn)
            throws Exception {
        // ClickHouse 物化视图自动实时更新
        // 无需手动刷新
        System.out.println("ClickHouse 物化视图自动实时更新");

        // 如果需要重建,可以删除后重建
        // DROP TABLE user_behavior_stats_mv;
        // 然后重新创建
    }
}

8.2 实时报表场景

java 复制代码

/**
 * 实时报表生成器
 * 场景: 电商实时交易大屏
 */
public class RealTimeDashboard {

    private final Connection conn;

    public RealTimeDashboard(Connection conn) {
        this.conn = conn;
    }

    /**
     * 今日实时交易概况
     */
    public DashboardMetrics getTodayMetrics() throws Exception {
        String sql =
            "SELECT\n" +
            "    count() as total_orders,\n" +
            "    sum(amount) as total_amount,\n" +
            "    avg(amount) as avg_amount,\n" +
            "    uniq(user_id) as unique_users,\n" +
            "    uniq(product_id) as unique_products,\n" +
            "    countIf(status = 'SUCCESS') as success_count,\n" +
            "    countIf(status = 'FAILED') as failed_count\n" +
            "FROM orders\n" +
            "WHERE order_date = today()";

        try (var stmt = conn.createStatement();
             var rs = stmt.executeQuery(sql)) {

            if (rs.next()) {
                DashboardMetrics metrics = new DashboardMetrics();
                metrics.totalOrders = rs.getLong("total_orders");
                metrics.totalAmount = rs.getDouble("total_amount");
                metrics.avgAmount = rs.getDouble("avg_amount");
                metrics.uniqueUsers = rs.getLong("unique_users");
                metrics.uniqueProducts = rs.getLong("unique_products");
                metrics.successCount = rs.getLong("success_count");
                metrics.failedCount = rs.getLong("failed_count");

                return metrics;
            }
        }

        return null;
    }

    /**
     * 每小时趋势
     */
    public List<HourlyTrend> getHourlyTrend() throws Exception {
        String sql =
            "SELECT\n" +
            "    toHour(order_time) as hour,\n" +
            "    count() as order_count,\n" +
            "    sum(amount) as hour_amount,\n" +
            "    uniq(user_id) as hour_users\n" +
            "FROM orders\n" +
            "WHERE order_date = today()\n" +
            "GROUP BY hour\n" +
            "ORDER BY hour";

        List<HourlyTrend> trends = new ArrayList<>();

        try (var stmt = conn.createStatement();
             var rs = stmt.executeQuery(sql)) {

            while (rs.next()) {
                HourlyTrend trend = new HourlyTrend();
                trend.hour = rs.getInt("hour");
                trend.orderCount = rs.getLong("order_count");
                trend.hourAmount = rs.getDouble("hour_amount");
                trend.hourUsers = rs.getLong("hour_users");

                trends.add(trend);
            }
        }

        return trends;
    }

    /**
     * Top 商品排行
     */
    public List<ProductRank> getTopProducts(int limit) throws Exception {
        String sql =
            "SELECT\n" +
            "    product_id,\n" +
            "    count() as sale_count,\n" +
            "    sum(amount) as sale_amount,\n" +
            "    uniq(user_id) as buyer_count\n" +
            "FROM orders\n" +
            "WHERE order_date >= today() - 7\n" +
            "GROUP BY product_id\n" +
            "ORDER BY sale_amount DESC\n" +
            "LIMIT ?";

        List<ProductRank> ranks = new ArrayList<>();

        try (var pstmt = conn.prepareStatement(sql)) {
            pstmt.setInt(1, limit);

            try (var rs = pstmt.executeQuery()) {
                while (rs.next()) {
                    ProductRank rank = new ProductRank();
                    rank.productId = rs.getLong("product_id");
                    rank.saleCount = rs.getLong("sale_count");
                    rank.saleAmount = rs.getDouble("sale_amount");
                    rank.buyerCount = rs.getLong("buyer_count");

                    ranks.add(rank);
                }
            }
        }

        return ranks;
    }

    /**
     * 实时大屏指标
     */
    public static class DashboardMetrics {
        public long totalOrders;
        public double totalAmount;
        public double avgAmount;
        public long uniqueUsers;
        public long uniqueProducts;
        public long successCount;
        public long failedCount;

        public void display() {
            System.out.println("\n========== 今日实时数据 ==========");
            System.out.printf("总订单数: %,d\n", totalOrders);
            System.out.printf("总金额: ¥%,.2f\n", totalAmount);
            System.out.printf("平均客单价: ¥%,.2f\n", avgAmount);
            System.out.printf("下单用户数: %,d\n", uniqueUsers);
            System.out.printf("售出商品数: %,d\n", uniqueProducts);
            System.out.printf("成功订单: %,d (%.2f%%)\n",
                successCount,
                100.0 * successCount / totalOrders);
            System.out.printf("失败订单: %,d (%.2f%%)\n",
                failedCount,
                100.0 * failedCount / totalOrders);
            System.out.println("================================\n");
        }
    }

    /**
     * 小时趋势
     */
    public static class HourlyTrend {
        public int hour;
        public long orderCount;
        public double hourAmount;
        public long hourUsers;
    }

    /**
     * 商品排行
     */
    public static class ProductRank {
        public long productId;
        public long saleCount;
        public double saleAmount;
        public long buyerCount;
    }
}

9. 分布式集群实践

9.1 集群架构

复制代码

ClickHouse 分布式集群

                  Client Applications
                         │
                         ▼
            ┌────────────────────────┐
            │   Load Balancer        │
            └────────────────────────┘
                         │
        ┌────────────────┴────────────────┐
        │                                  │
        ▼                                  ▼
┌───────────────┐                  ┌───────────────┐
│  Shard 1      │                  │  Shard 2      │
│               │                  │               │
│ ┌───────────┐ │                  │ ┌───────────┐ │
│ │ Replica 1 │ │                  │ │ Replica 1 │ │
│ │ (Master)  │ │                  │ │ (Master)  │ │
│ │ Node 1    │ │                  │ │ Node 3    │ │
│ └───────────┘ │                  │ └───────────┘ │
│       │       │                  │       │       │
│       │       │                  │       │       │
│ ┌───────────┐ │                  │ ┌───────────┐ │
│ │ Replica 2 │ │                  │ │ Replica 2 │ │
│ │ (Slave)   │ │                  │ │ (Slave)   │ │
│ │ Node 2    │ │                  │ │ Node 4    │ │
│ └───────────┘ │                  │ └───────────┘ │
└───────────────┘                  └───────────────┘
        │                                  │
        └──────────────┬───────────────────┘
                       │
                       ▼
              ┌────────────────┐
              │   ZooKeeper    │
              │   Cluster      │
              │  (3-5 nodes)   │
              └────────────────┘

9.2 分布式表操作

java 复制代码

/**
 * 分布式集群操作
 */
public class DistributedClusterOps {

    /**
     * 创建分布式表
     */
    public static void createDistributedTable(Connection conn)
            throws Exception {
        // 1. 在每个节点创建本地表
        String localTableSql =
            "CREATE TABLE IF NOT EXISTS orders_local ON CLUSTER my_cluster (\n" +
            "    order_id UInt64,\n" +
            "    user_id UInt64,\n" +
            "    product_id UInt64,\n" +
            "    amount Decimal(18,2),\n" +
            "    order_time DateTime,\n" +
            "    status String\n" +
            ") ENGINE = ReplicatedMergeTree(\n" +
            "    '/clickhouse/tables/{shard}/orders_local',\n" +
            "    '{replica}'\n" +
            ")\n" +
            "PARTITION BY toYYYYMM(order_time)\n" +
            "ORDER BY (user_id, order_time)";

        // 2. 创建分布式表
        String distributedTableSql =
            "CREATE TABLE IF NOT EXISTS orders_all ON CLUSTER my_cluster\n" +
            "AS orders_local\n" +
            "ENGINE = Distributed(\n" +
            "    my_cluster,           -- 集群名称\n" +
            "    default,              -- 数据库名\n" +
            "    orders_local,         -- 本地表名\n" +
            "    rand()                -- 分片键\n" +
            ")";

        try (var stmt = conn.createStatement()) {
            stmt.execute(localTableSql);
            stmt.execute(distributedTableSql);
            System.out.println("分布式表创建成功!");
        }
    }

    /**
     * 写入分布式表
     */
    public static void insertToDistributedTable(Connection conn)
            throws Exception {
        // 写入分布式表,自动分片
        String sql =
            "INSERT INTO orders_all " +
            "(order_id, user_id, product_id, amount, order_time, status) " +
            "VALUES (?, ?, ?, ?, ?, ?)";

        try (var pstmt = conn.prepareStatement(sql)) {
            for (int i = 0; i < 1000; i++) {
                pstmt.setLong(1, i);
                pstmt.setLong(2, i % 100);
                pstmt.setLong(3, i % 50);
                pstmt.setBigDecimal(4,
                    new java.math.BigDecimal(String.valueOf(100 + i)));
                pstmt.setTimestamp(5,
                    new Timestamp(System.currentTimeMillis()));
                pstmt.setString(6, "SUCCESS");

                pstmt.addBatch();
            }

            pstmt.executeBatch();
            System.out.println("批量写入分布式表成功!");
        }
    }

    /**
     * 查询分布式表
     */
    public static void queryDistributedTable(Connection conn)
            throws Exception {
        // 查询分布式表,自动聚合所有分片数据
        String sql =
            "SELECT\n" +
            "    toDate(order_time) as date,\n" +
            "    count() as order_count,\n" +
            "    sum(amount) as total_amount,\n" +
            "    uniq(user_id) as unique_users\n" +
            "FROM orders_all\n" +
            "WHERE order_time >= now() - INTERVAL 7 DAY\n" +
            "GROUP BY date\n" +
            "ORDER BY date DESC";

        try (var stmt = conn.createStatement();
             var rs = stmt.executeQuery(sql)) {

            System.out.println("\n分布式查询结果:");
            while (rs.next()) {
                System.out.printf(
                    "日期: %s, 订单数: %d, 总金额: %.2f, UV: %d\n",
                    rs.getDate("date"),
                    rs.getLong("order_count"),
                    rs.getDouble("total_amount"),
                    rs.getLong("unique_users")
                );
            }
        }
    }

    /**
     * 集群健康检查
     */
    public static void checkClusterHealth(Connection conn)
            throws Exception {
        String sql =
            "SELECT\n" +
            "    cluster,\n" +
            "    shard_num,\n" +
            "    replica_num,\n" +
            "    host_name,\n" +
            "    port,\n" +
            "    is_local\n" +
            "FROM system.clusters\n" +
            "WHERE cluster = 'my_cluster'\n" +
            "ORDER BY shard_num, replica_num";

        try (var stmt = conn.createStatement();
             var rs = stmt.executeQuery(sql)) {

            System.out.println("\n集群状态:");
            while (rs.next()) {
                System.out.printf(
                    "分片: %d, 副本: %d, 主机: %s:%d, 本地: %s\n",
                    rs.getInt("shard_num"),
                    rs.getInt("replica_num"),
                    rs.getString("host_name"),
                    rs.getInt("port"),
                    rs.getBoolean("is_local") ? "是" : "否"
                );
            }
        }
    }
}

10. 实时数据分析场景

10.1 用户留存分析

java 复制代码

/**
 * 用户留存分析
 */
public class RetentionAnalysis {

    /**
     * 计算 N 日留存率
     */
    public static void calculateRetention(Connection conn, int days)
            throws Exception {
        String sql =
            "SELECT\n" +
            "    first_date,\n" +
            "    retention_day,\n" +
            "    retained_users,\n" +
            "    total_users,\n" +
            "    round(retained_users * 100.0 / total_users, 2) as retention_rate\n" +
            "FROM (\n" +
            "    SELECT\n" +
            "        first_date,\n" +
            "        retention_day,\n" +
            "        uniq(user_id) as retained_users,\n" +
            "        any(total_users) as total_users\n" +
            "    FROM (\n" +
            "        SELECT\n" +
            "            user_id,\n" +
            "            first_date,\n" +
            "            dateDiff('day', first_date, event_date) as retention_day,\n" +
            "            total_users\n" +
            "        FROM (\n" +
            "            SELECT\n" +
            "                user_id,\n" +
            "                event_date,\n" +
            "                min(event_date) OVER (PARTITION BY user_id) as first_date,\n" +
            "                uniq(user_id) OVER (PARTITION BY min(event_date) OVER (PARTITION BY user_id)) as total_users\n" +
            "            FROM user_events\n" +
            "            WHERE event_date >= today() - ?\n" +
            "        )\n" +
            "    )\n" +
            "    WHERE retention_day <= ?\n" +
            "    GROUP BY first_date, retention_day, total_users\n" +
            ")\n" +
            "ORDER BY first_date DESC, retention_day";

        try (var pstmt = conn.prepareStatement(sql)) {
            pstmt.setInt(1, days + 7);
            pstmt.setInt(2, days);

            try (var rs = pstmt.executeQuery()) {
                System.out.println("\n用户留存分析:");
                System.out.println("─────────────────────────────────────────");
                System.out.printf("%-12s %-8s %-12s %-12s %-10s\n",
                    "首次日期", "留存天", "留存用户", "总用户", "留存率%");
                System.out.println("─────────────────────────────────────────");

                while (rs.next()) {
                    System.out.printf("%-12s %-8d %-12d %-12d %-10.2f\n",
                        rs.getDate("first_date"),
                        rs.getInt("retention_day"),
                        rs.getLong("retained_users"),
                        rs.getLong("total_users"),
                        rs.getDouble("retention_rate")
                    );
                }
            }
        }
    }
}

10.2 漏斗分析

java 复制代码

/**
 * 转化漏斗分析
 */
public class FunnelAnalysis {

    /**
     * 分析用户转化路径
     * 路径: 浏览商品 -> 加入购物车 -> 下单 -> 支付
     */
    public static void analyzeFunnel(Connection conn) throws Exception {
        String sql =
            "SELECT\n" +
            "    '浏览商品' as step,\n" +
            "    1 as step_num,\n" +
            "    view_users as users,\n" +
            "    100.0 as conversion_rate\n" +
            "FROM (\n" +
            "    SELECT uniq(user_id) as view_users\n" +
            "    FROM user_events\n" +
            "    WHERE event_type = 'view'\n" +
            "    AND event_date = today()\n" +
            ")\n" +
            "UNION ALL\n" +
            "SELECT\n" +
            "    '加入购物车' as step,\n" +
            "    2 as step_num,\n" +
            "    cart_users as users,\n" +
            "    round(cart_users * 100.0 / view_users, 2) as conversion_rate\n" +
            "FROM (\n" +
            "    SELECT\n" +
            "        uniq(user_id) as cart_users,\n" +
            "        (SELECT uniq(user_id) FROM user_events WHERE event_type = 'view' AND event_date = today()) as view_users\n" +
            "    FROM user_events\n" +
            "    WHERE event_type = 'add_to_cart'\n" +
            "    AND event_date = today()\n" +
            ")\n" +
            "UNION ALL\n" +
            "SELECT\n" +
            "    '下单' as step,\n" +
            "    3 as step_num,\n" +
            "    order_users as users,\n" +
            "    round(order_users * 100.0 / view_users, 2) as conversion_rate\n" +
            "FROM (\n" +
            "    SELECT\n" +
            "        uniq(user_id) as order_users,\n" +
            "        (SELECT uniq(user_id) FROM user_events WHERE event_type = 'view' AND event_date = today()) as view_users\n" +
            "    FROM user_events\n" +
            "    WHERE event_type = 'order'\n" +
            "    AND event_date = today()\n" +
            ")\n" +
            "UNION ALL\n" +
            "SELECT\n" +
            "    '支付' as step,\n" +
            "    4 as step_num,\n" +
            "    pay_users as users,\n" +
            "    round(pay_users * 100.0 / view_users, 2) as conversion_rate\n" +
            "FROM (\n" +
            "    SELECT\n" +
            "        uniq(user_id) as pay_users,\n" +
            "        (SELECT uniq(user_id) FROM user_events WHERE event_type = 'view' AND event_date = today()) as view_users\n" +
            "    FROM user_events\n" +
            "    WHERE event_type = 'payment'\n" +
            "    AND event_date = today()\n" +
            ")\n" +
            "ORDER BY step_num";

        try (var stmt = conn.createStatement();
             var rs = stmt.executeQuery(sql)) {

            System.out.println("\n转化漏斗分析:");
            System.out.println("─────────────────────────────────────");
            System.out.printf("%-15s %-12s %-12s\n",
                "步骤", "用户数", "转化率%");
            System.out.println("─────────────────────────────────────");

            while (rs.next()) {
                String step = rs.getString("step");
                long users = rs.getLong("users");
                double rate = rs.getDouble("conversion_rate");

                // ASCII 进度条
                int barLength = (int)(rate / 5);
                String bar = "█".repeat(barLength);

                System.out.printf("%-15s %-12d %6.2f%% %s\n",
                    step, users, rate, bar);
            }
        }
    }
}

11. 性能优化最佳实践

11.1 数据类型选择

复制代码

数据类型优化建议

1. 整数类型 - 选择合适的范围
   ┌──────────────┬────────────┬─────────────────┐
   │ Type         │ Size       │ Range           │
   ├──────────────┼────────────┼─────────────────┤
   │ UInt8        │ 1 byte     │ 0 ~ 255         │
   │ UInt16       │ 2 bytes    │ 0 ~ 65535       │
   │ UInt32       │ 4 bytes    │ 0 ~ 4B          │
   │ UInt64       │ 8 bytes    │ 0 ~ 18Q         │
   └──────────────┴────────────┴─────────────────┘

2. 字符串类型
   - String: 可变长度,适合短字符串
   - FixedString(N): 固定长度,适合MD5、UUID
   - LowCardinality(String): 低基数字符串,节省空间

3. 时间类型
   - Date: 日期(2字节)
   - DateTime: 日期时间(4字节)
   - DateTime64: 高精度时间(8字节)

java 复制代码

/**
 * 性能优化实践
 */
public class PerformanceOptimization {

    /**
     * 优化1: 使用 LowCardinality
     */
    public static void useLowCardinality(Connection conn)
            throws Exception {
        // 不好的设计
        String badSql =
            "CREATE TABLE events_bad (\n" +
            "    event_time DateTime,\n" +
            "    event_type String,        -- 可能只有10种类型\n" +
            "    device_type String,       -- 可能只有5种设备\n" +
            "    country String            -- 可能只有200个国家\n" +
            ") ENGINE = MergeTree()\n" +
            "ORDER BY event_time";

        // 好的设计 - 使用 LowCardinality
        String goodSql =
            "CREATE TABLE events_good (\n" +
            "    event_time DateTime,\n" +
            "    event_type LowCardinality(String),\n" +
            "    device_type LowCardinality(String),\n" +
            "    country LowCardinality(String)\n" +
            ") ENGINE = MergeTree()\n" +
            "ORDER BY event_time";

        // LowCardinality 可节省50-90%存储空间
    }

    /**
     * 优化2: 合理设置 ORDER BY
     */
    public static void optimizeOrderBy(Connection conn)
            throws Exception {
        // 根据查询模式设置
        // 如果经常按 user_id 查询
        String sql1 =
            "CREATE TABLE user_events (\n" +
            "    user_id UInt64,\n" +
            "    event_time DateTime,\n" +
            "    event_type String\n" +
            ") ENGINE = MergeTree()\n" +
            "ORDER BY (user_id, event_time)";  // user_id 在前

        // 如果经常按时间范围查询
        String sql2 =
            "CREATE TABLE time_series_data (\n" +
            "    timestamp DateTime,\n" +
            "    metric_name String,\n" +
            "    value Float64\n" +
            ") ENGINE = MergeTree()\n" +
            "ORDER BY (timestamp, metric_name)";  // timestamp 在前
    }

    /**
     * 优化3: 使用分区提升查询性能
     */
    public static void usePartitioning(Connection conn)
            throws Exception {
        String sql =
            "CREATE TABLE access_logs (\n" +
            "    log_time DateTime,\n" +
            "    url String,\n" +
            "    status_code UInt16\n" +
            ") ENGINE = MergeTree()\n" +
            "PARTITION BY toYYYYMMDD(log_time)\n" +  // 按天分区
            "ORDER BY log_time\n" +
            "TTL log_time + INTERVAL 30 DAY";  // 30天后自动删除

        // 好处:
        // 1. 查询时只扫描相关分区
        // 2. 可以按分区删除数据
        // 3. TTL 可以自动清理旧数据
    }

    /**
     * 优化4: 批量操作
     */
    public static void batchOperations() {
        // 不好: 单条插入
        // for (Data d : dataList) {
        //     INSERT INTO table VALUES (d);
        // }

        // 好: 批量插入
        // INSERT INTO table VALUES (d1), (d2), (d3), ...

        // 更好: 使用 CSV 格式
        // INSERT INTO table FORMAT CSV

        System.out.println("批量操作性能提升 10-100 倍");
    }
}

11.2 查询优化检查清单

复制代码

查询优化检查清单

☑ 1. 索引使用
   □ WHERE 条件是否使用了 ORDER BY 列?
   □ 是否使用了跳数索引?
   □ 是否使用了 PREWHERE?

☑ 2. 分区裁剪
   □ 查询是否限制了分区键范围?
   □ 避免全表扫描

☑ 3. 数据类型
   □ 使用合适大小的整数类型
   □ 低基数字符串使用 LowCardinality
   □ 避免使用 Nullable (性能损失20-30%)

☑ 4. 聚合优化
   □ 使用近似函数 (uniq 代替 count distinct)
   □ 使用物化视图预聚合
   □ GROUP BY 列顺序与 ORDER BY 一致

☑ 5. JOIN 优化
   □ 小表在右侧
   □ 使用字典表代替 JOIN
   □ 避免大表 JOIN 大表

☑ 6. 并行度
   □ max_threads 设置合理
   □ 分区数量适中

12. 监控与运维

12.1 系统表监控

java 复制代码

/**
 * ClickHouse 监控
 */
public class ClickHouseMonitoring {

    /**
     * 监控查询性能
     */
    public static void monitorQueries(Connection conn) throws Exception {
        String sql =
            "SELECT\n" +
            "    user,\n" +
            "    query_id,\n" +
            "    query_duration_ms,\n" +
            "    read_rows,\n" +
            "    read_bytes,\n" +
            "    memory_usage,\n" +
            "    query\n" +
            "FROM system.query_log\n" +
            "WHERE type = 'QueryFinish'\n" +
            "AND event_time >= now() - INTERVAL 1 HOUR\n" +
            "ORDER BY query_duration_ms DESC\n" +
            "LIMIT 10";

        try (var stmt = conn.createStatement();
             var rs = stmt.executeQuery(sql)) {

            System.out.println("\n慢查询Top 10:");
            while (rs.next()) {
                System.out.printf(
                    "耗时: %dms, 读取行数: %d, 内存: %dMB\nSQL: %s\n\n",
                    rs.getLong("query_duration_ms"),
                    rs.getLong("read_rows"),
                    rs.getLong("memory_usage") / 1024 / 1024,
                    rs.getString("query").substring(0,
                        Math.min(100, rs.getString("query").length()))
                );
            }
        }
    }

    /**
     * 监控表大小
     */
    public static void monitorTableSize(Connection conn) throws Exception {
        String sql =
            "SELECT\n" +
            "    database,\n" +
            "    table,\n" +
            "    formatReadableSize(sum(bytes)) as size,\n" +
            "    sum(rows) as rows,\n" +
            "    sum(bytes) as bytes_size\n" +
            "FROM system.parts\n" +
            "WHERE active\n" +
            "GROUP BY database, table\n" +
            "ORDER BY bytes_size DESC\n" +
            "LIMIT 20";

        try (var stmt = conn.createStatement();
             var rs = stmt.executeQuery(sql)) {

            System.out.println("\n表大小统计:");
            System.out.println("─────────────────────────────────────────");
            while (rs.next()) {
                System.out.printf("%-20s.%-20s %15s %15d行\n",
                    rs.getString("database"),
                    rs.getString("table"),
                    rs.getString("size"),
                    rs.getLong("rows")
                );
            }
        }
    }

    /**
     * 监控副本同步状态
     */
    public static void monitorReplication(Connection conn)
            throws Exception {
        String sql =
            "SELECT\n" +
            "    database,\n" +
            "    table,\n" +
            "    is_leader,\n" +
            "    is_readonly,\n" +
            "    absolute_delay,\n" +
            "    queue_size,\n" +
            "    inserts_in_queue\n" +
            "FROM system.replicas";

        try (var stmt = conn.createStatement();
             var rs = stmt.executeQuery(sql)) {

            System.out.println("\n副本同步状态:");
            while (rs.next()) {
                System.out.printf(
                    "表: %s.%s, Leader: %s, 延迟: %ds, 队列: %d\n",
                    rs.getString("database"),
                    rs.getString("table"),
                    rs.getBoolean("is_leader") ? "是" : "否",
                    rs.getLong("absolute_delay"),
                    rs.getLong("queue_size")
                );
            }
        }
    }
}

12.2 告警指标

复制代码

关键告警指标

1. 查询性能
   - 慢查询数量 > 阈值
   - 平均查询时间 > 阈值
   - 查询错误率 > 1%

2. 存储
   - 磁盘使用率 > 80%
   - 单表大小 > 阈值
   - 分区数量 > 1000

3. 副本
   - 副本延迟 > 60s
   - 副本队列 > 1000
   - 副本故障

4. 系统资源
   - CPU 使用率 > 80%
   - 内存使用率 > 85%
   - 网络带宽 > 80%

5. Merge 操作
   - Merge 队列 > 100
   - Merge 失败次数 > 0

13. 总结

13.1 ClickHouse 核心优势

复制代码

ClickHouse 核心优势总结

1. 性能 ★★★★★
   - 查询速度快 100-1000倍
   - 支持实时写入和查询
   - 列式存储 + 向量化执行

2. 压缩 ★★★★★
   - 压缩比 10:1 甚至更高
   - 节省存储成本
   - 减少 I/O 操作

3. 扩展性 ★★★★★
   - 水平扩展
   - 线性性能增长
   - 支持 PB 级数据

4. 易用性 ★★★★☆
   - 标准 SQL 支持
   - 丰富的函数库
   - 多语言客户端

5. 可靠性 ★★★★☆
   - 数据副本
   - 自动故障转移
   - 数据一致性保证

13.2 适用场景

最适合:

用户行为分析
实时报表和大屏
日志分析和监控
时序数据分析
数据仓库 OLAP

不适合:

OLTP 事务处理
频繁更新/删除
需要强一致性的场景
行级别锁定

13.3 最佳实践总结

表设计
- 合理选择表引擎
- 优化 ORDER BY 列
- 使用分区管理数据
- 设置 TTL 自动清理
数据写入
- 批量写入
- 异步写入
- 使用 CSV 格式
- 控制写入频率
查询优化
- 使用主键过滤
- 使用 PREWHERE
- 避免 SELECT *
- 使用物化视图
运维管理
- 定期监控性能
- 及时清理过期数据
- 备份关键数据
- 升级到稳定版本

13.4 学习资源

官方文档: https://clickhouse.com/docs
GitHub: https://github.com/ClickHouse/ClickHouse
中文社区: https://clickhouse.com/docs/zh
性能测试: https://benchmark.clickhouse.com

13.5 未来发展

ClickHouse 正在持续演进:

更强大的 SQL 支持: 窗口函数、递归查询
更好的实时性: 毫秒级延迟
云原生: Kubernetes 集成
机器学习: 内置 ML 功能
多模型: 支持图数据库、文档数据库

附录: 常用命令

bash 复制代码

# 启动 ClickHouse
clickhouse-server --config-file=/etc/clickhouse-server/config.xml

# 客户端连接
clickhouse-client --host localhost --port 9000

# 导入数据
clickhouse-client --query="INSERT INTO table FORMAT CSV" < data.csv

# 导出数据
clickhouse-client --query="SELECT * FROM table FORMAT CSV" > data.csv

# 查看表结构
DESCRIBE TABLE table_name;
SHOW CREATE TABLE table_name;

# 优化表
OPTIMIZE TABLE table_name FINAL;

# 查看分区
SELECT partition, name, rows FROM system.parts WHERE table = 'table_name';

# 删除分区
ALTER TABLE table_name DROP PARTITION 'partition_id';

ClickHouse_实战指南

ClickHouse 实战指南：高性能列式数据库实践

目录

1. ClickHouse 简介

1.1 核心特性

1.2 ClickHouse vs 传统数据库

1.3 典型应用场景

2. ClickHouse 核心架构

2.1 整体架构

2.2 列式存储原理

2.3 数据分片与副本

3. 环境搭建与配置

3.1 Docker 快速启动

3.2 Maven 依赖配置

3.3 Java 连接配置

4. 表引擎详解

4.1 MergeTree 家族

4.2 表引擎选择指南

5. Java 客户端集成

5.1 JDBC 基础操作

5.2 高性能批量写入

6. 数据写入实战

6.1 实时日志采集

6.2 Spring Boot 集成

7. 查询优化技巧

7.1 索引优化

7.2 查询性能对比

8. 物化视图与聚合

8.1 物化视图基础

8.2 实时报表场景

9. 分布式集群实践

9.1 集群架构

9.2 分布式表操作

10. 实时数据分析场景

10.1 用户留存分析

10.2 漏斗分析

11. 性能优化最佳实践

11.1 数据类型选择

11.2 查询优化检查清单

12. 监控与运维

12.1 系统表监控

12.2 告警指标

13. 总结

13.1 ClickHouse 核心优势

13.2 适用场景

13.3 最佳实践总结

13.4 学习资源

13.5 未来发展

附录: 常用命令