ClickHouse_实战指南

ClickHouse 实战指南:高性能列式数据库实践

目录

  • [1. ClickHouse 简介](#1. ClickHouse 简介)
  • [2. ClickHouse 核心架构](#2. ClickHouse 核心架构)
  • [3. 环境搭建与配置](#3. 环境搭建与配置)
  • [4. 表引擎详解](#4. 表引擎详解)
  • [5. Java 客户端集成](#5. Java 客户端集成)
  • [6. 数据写入实战](#6. 数据写入实战)
  • [7. 查询优化技巧](#7. 查询优化技巧)
  • [8. 物化视图与聚合](#8. 物化视图与聚合)
  • [9. 分布式集群实践](#9. 分布式集群实践)
  • [10. 实时数据分析场景](#10. 实时数据分析场景)
  • [11. 性能优化最佳实践](#11. 性能优化最佳实践)
  • [12. 监控与运维](#12. 监控与运维)
  • [13. 总结](#13. 总结)

1. ClickHouse 简介

ClickHouse 是一个用于在线分析处理(OLAP)的开源列式数据库管理系统,由俄罗斯最大的搜索引擎 Yandex 开发。它能够实时生成分析报告,处理数十亿行数据时仍保持亚秒级查询响应。

1.1 核心特性

  • 真正的列式存储: 数据按列存储,极大提升查询性能
  • 数据压缩: 压缩比可达 10:1 甚至更高
  • 向量化执行: SIMD 指令集加速计算
  • 分布式查询: 支持分片和副本
  • 实时数据插入: 每秒可处理百万级数据
  • SQL 支持: 支持标准 SQL 及扩展语法
  • 高可用性: 支持数据副本和故障转移

1.2 ClickHouse vs 传统数据库

复制代码
性能对比 (10亿行数据聚合查询)

ClickHouse    █ 0.5s
PostgreSQL    ████████████████████████ 120s
MySQL         ████████████████████████████ 150s
Elasticsearch ████████ 40s

使用场景对比
┌─────────────────┬──────────────┬──────────────┬──────────────┐
│   Feature       │ ClickHouse   │   MySQL      │  PostgreSQL  │
├─────────────────┼──────────────┼──────────────┼──────────────┤
│ OLAP 查询       │   ★★★★★      │   ★★☆☆☆      │   ★★★☆☆      │
├─────────────────┼──────────────┼──────────────┼──────────────┤
│ OLTP 事务       │   ★☆☆☆☆      │   ★★★★★      │   ★★★★★      │
├─────────────────┼──────────────┼──────────────┼──────────────┤
│ 数据压缩        │   ★★★★★      │   ★★☆☆☆      │   ★★★☆☆      │
├─────────────────┼──────────────┼──────────────┼──────────────┤
│ 写入性能        │   ★★★★★      │   ★★★☆☆      │   ★★★☆☆      │
├─────────────────┼──────────────┼──────────────┼──────────────┤
│ 分布式支持      │   ★★★★★      │   ★★☆☆☆      │   ★★☆☆☆      │
└─────────────────┴──────────────┴──────────────┴──────────────┘

1.3 典型应用场景

  • 用户行为分析: 网站点击流、APP 使用数据分析
  • 实时报表: 业务指标实时统计与展示
  • 日志分析: 海量日志存储与查询
  • 时序数据: IoT 设备监控、系统指标监控
  • 数据仓库: 替代传统 OLAP 引擎

2. ClickHouse 核心架构

2.1 整体架构

复制代码
ClickHouse 架构图

Client Layer (客户端层)
┌─────────────────────────────────────────────────────────┐
│  JDBC Client  │  HTTP Client  │  CLI Client  │  Grafana │
└────────────────────────┬────────────────────────────────┘
                         │
                         ▼
Query Processing Layer (查询处理层)
┌─────────────────────────────────────────────────────────┐
│                    Query Parser                          │
│                         ↓                                │
│                  Query Optimizer                         │
│                         ↓                                │
│                  Execution Engine                        │
│              (Vectorized Processing)                     │
└────────────────────────┬────────────────────────────────┘
                         │
                         ▼
Storage Layer (存储层)
┌─────────────────────────────────────────────────────────┐
│                                                          │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  │
│  │ Table Engine │  │ Table Engine │  │ Table Engine │  │
│  │  MergeTree   │  │  Log Family  │  │ Distributed  │  │
│  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘  │
│         │                  │                  │          │
│         └──────────────────┴──────────────────┘          │
│                         ↓                                │
│              Column-Oriented Storage                     │
│                         ↓                                │
│              Compressed Data Files                       │
│              (.bin, .mrk, .idx)                          │
└──────────────────────────────────────────────────────────┘

2.2 列式存储原理

复制代码
行式存储 vs 列式存储

行式存储 (MySQL, PostgreSQL):
Row 1: [ID=1, Name=Alice, Age=25, City=Beijing]
Row 2: [ID=2, Name=Bob,   Age=30, City=Shanghai]
Row 3: [ID=3, Name=Carol, Age=28, City=Beijing]

查询: SELECT AVG(Age) FROM users WHERE City='Beijing'
需要读取: 所有列的所有数据

列式存储 (ClickHouse):
ID    Column: [1, 2, 3]
Name  Column: [Alice, Bob, Carol]
Age   Column: [25, 30, 28]
City  Column: [Beijing, Shanghai, Beijing]

查询: SELECT AVG(Age) FROM users WHERE City='Beijing'
需要读取: Age列 + City列 (仅相关列)

优势:
1. 只读取需要的列,减少 I/O
2. 同一列数据类型相同,压缩率高
3. 向量化执行,CPU 缓存友好

2.3 数据分片与副本

复制代码
分布式架构 (3个节点,2个分片,2个副本)

Shard 1                          Shard 2
┌──────────────────┐            ┌──────────────────┐
│   Replica 1      │            │   Replica 1      │
│   Node 1         │            │   Node 2         │
│   Data: A-M      │            │   Data: N-Z      │
└──────────────────┘            └──────────────────┘
        │                                │
        │                                │
        ▼                                ▼
┌──────────────────┐            ┌──────────────────┐
│   Replica 2      │            │   Replica 2      │
│   Node 3         │            │   Node 1         │
│   Data: A-M      │            │   Data: N-Z      │
└──────────────────┘            └──────────────────┘

ZooKeeper Cluster
┌──────────────────────────────────┐
│  Coordination & Metadata         │
│  - Replica sync                  │
│  - Leader election               │
│  - Schema management             │
└──────────────────────────────────┘

3. 环境搭建与配置

3.1 Docker 快速启动

bash 复制代码
# 1. 拉取镜像
docker pull clickhouse/clickhouse-server

# 2. 启动 ClickHouse
docker run -d \
  --name clickhouse-server \
  -p 8123:8123 \
  -p 9000:9000 \
  --ulimit nofile=262144:262144 \
  clickhouse/clickhouse-server

# 3. 进入客户端
docker exec -it clickhouse-server clickhouse-client

# 4. 测试连接
SELECT version();

3.2 Maven 依赖配置

xml 复制代码
<!-- pom.xml -->
<properties>
    <clickhouse.version>0.5.0</clickhouse.version>
    <java.version>11</java.version>
</properties>

<dependencies>
    <!-- ClickHouse JDBC 驱动 -->
    <dependency>
        <groupId>com.clickhouse</groupId>
        <artifactId>clickhouse-jdbc</artifactId>
        <version>${clickhouse.version}</version>
        <classifier>all</classifier>
    </dependency>

    <!-- HTTP 客户端 (可选) -->
    <dependency>
        <groupId>com.clickhouse</groupId>
        <artifactId>clickhouse-http-client</artifactId>
        <version>${clickhouse.version}</version>
    </dependency>

    <!-- 连接池 -->
    <dependency>
        <groupId>com.zaxxer</groupId>
        <artifactId>HikariCP</artifactId>
        <version>5.0.1</version>
    </dependency>

    <!-- 日志 -->
    <dependency>
        <groupId>org.slf4j</groupId>
        <artifactId>slf4j-api</artifactId>
        <version>2.0.9</version>
    </dependency>
</dependencies>

3.3 Java 连接配置

java 复制代码
import com.clickhouse.jdbc.ClickHouseDataSource;
import com.zaxxer.hikari.HikariConfig;
import com.zaxxer.hikari.HikariDataSource;

import java.sql.Connection;
import java.sql.SQLException;
import java.util.Properties;

/**
 * ClickHouse 连接配置
 */
public class ClickHouseConfig {

    private static final String JDBC_URL =
        "jdbc:clickhouse://localhost:8123/default";
    private static final String USERNAME = "default";
    private static final String PASSWORD = "";

    /**
     * 方式1: 直接连接
     */
    public static Connection getConnection() throws SQLException {
        Properties properties = new Properties();
        properties.setProperty("user", USERNAME);
        properties.setProperty("password", PASSWORD);

        // 设置连接参数
        properties.setProperty("socket_timeout", "300000");
        properties.setProperty("max_execution_time", "300");
        properties.setProperty("max_insert_block_size", "1048576");

        ClickHouseDataSource dataSource =
            new ClickHouseDataSource(JDBC_URL, properties);

        return dataSource.getConnection();
    }

    /**
     * 方式2: 使用连接池 (推荐)
     */
    public static HikariDataSource getDataSource() {
        HikariConfig config = new HikariConfig();
        config.setJdbcUrl(JDBC_URL);
        config.setUsername(USERNAME);
        config.setPassword(PASSWORD);

        // 连接池配置
        config.setMaximumPoolSize(10);
        config.setMinimumIdle(2);
        config.setConnectionTimeout(30000);
        config.setIdleTimeout(600000);
        config.setMaxLifetime(1800000);

        // ClickHouse 特定配置
        config.addDataSourceProperty("socket_timeout", "300000");
        config.addDataSourceProperty("compress", "true");
        config.addDataSourceProperty("max_insert_block_size", "1048576");

        return new HikariDataSource(config);
    }

    /**
     * 测试连接
     */
    public static void testConnection() {
        try (Connection conn = getConnection()) {
            var stmt = conn.createStatement();
            var rs = stmt.executeQuery("SELECT version()");

            if (rs.next()) {
                System.out.println("ClickHouse Version: " + rs.getString(1));
            }

            System.out.println("连接成功!");
        } catch (SQLException e) {
            System.err.println("连接失败: " + e.getMessage());
            e.printStackTrace();
        }
    }

    public static void main(String[] args) {
        testConnection();
    }
}

4. 表引擎详解

4.1 MergeTree 家族

ClickHouse 最强大的表引擎系列,支持数据分区、索引、副本等特性。

复制代码
MergeTree 家族

MergeTree (基础)
    ├─ ReplacingMergeTree (去重)
    ├─ SummingMergeTree (求和)
    ├─ AggregatingMergeTree (聚合)
    ├─ CollapsingMergeTree (折叠)
    └─ VersionedCollapsingMergeTree (版本折叠)

ReplicatedMergeTree (副本)
    ├─ ReplicatedReplacingMergeTree
    ├─ ReplicatedSummingMergeTree
    └─ ReplicatedAggregatingMergeTree
java 复制代码
import java.sql.Connection;
import java.sql.Statement;

/**
 * 表引擎创建示例
 */
public class TableEngineExample {

    /**
     * 1. MergeTree - 最常用的表引擎
     */
    public static void createMergeTreeTable(Connection conn)
            throws Exception {
        String sql =
            "CREATE TABLE IF NOT EXISTS user_events (\n" +
            "    event_date Date,\n" +
            "    event_time DateTime,\n" +
            "    user_id UInt64,\n" +
            "    event_type String,\n" +
            "    page_url String,\n" +
            "    device_type String,\n" +
            "    session_id String,\n" +
            "    duration UInt32\n" +
            ") ENGINE = MergeTree()\n" +
            "PARTITION BY toYYYYMM(event_date)\n" +
            "ORDER BY (user_id, event_time)\n" +
            "SETTINGS index_granularity = 8192";

        try (Statement stmt = conn.createStatement()) {
            stmt.execute(sql);
            System.out.println("MergeTree 表创建成功!");
        }
    }

    /**
     * 2. ReplacingMergeTree - 自动去重
     * 场景: 用户信息表,同一用户只保留最新记录
     */
    public static void createReplacingMergeTreeTable(Connection conn)
            throws Exception {
        String sql =
            "CREATE TABLE IF NOT EXISTS user_profiles (\n" +
            "    user_id UInt64,\n" +
            "    username String,\n" +
            "    email String,\n" +
            "    age UInt8,\n" +
            "    city String,\n" +
            "    update_time DateTime,\n" +
            "    version UInt64\n" +
            ") ENGINE = ReplacingMergeTree(version)\n" +
            "ORDER BY user_id\n" +
            "SETTINGS index_granularity = 8192";

        try (Statement stmt = conn.createStatement()) {
            stmt.execute(sql);
            System.out.println("ReplacingMergeTree 表创建成功!");
        }
    }

    /**
     * 3. SummingMergeTree - 自动求和
     * 场景: 指标累加,如页面 PV、UV 统计
     */
    public static void createSummingMergeTreeTable(Connection conn)
            throws Exception {
        String sql =
            "CREATE TABLE IF NOT EXISTS page_statistics (\n" +
            "    stat_date Date,\n" +
            "    page_url String,\n" +
            "    pv UInt64,\n" +
            "    uv UInt64,\n" +
            "    bounce_count UInt64\n" +
            ") ENGINE = SummingMergeTree()\n" +
            "PARTITION BY toYYYYMM(stat_date)\n" +
            "ORDER BY (stat_date, page_url)\n" +
            "SETTINGS index_granularity = 8192";

        try (Statement stmt = conn.createStatement()) {
            stmt.execute(sql);
            System.out.println("SummingMergeTree 表创建成功!");
        }
    }

    /**
     * 4. AggregatingMergeTree - 聚合函数
     * 场景: 预聚合统计数据
     */
    public static void createAggregatingMergeTreeTable(Connection conn)
            throws Exception {
        String sql =
            "CREATE TABLE IF NOT EXISTS user_metrics_agg (\n" +
            "    stat_date Date,\n" +
            "    user_id UInt64,\n" +
            "    total_orders SimpleAggregateFunction(sum, UInt64),\n" +
            "    total_amount SimpleAggregateFunction(sum, Decimal(18,2)),\n" +
            "    avg_amount AggregateFunction(avg, Decimal(18,2)),\n" +
            "    unique_products AggregateFunction(uniq, UInt64)\n" +
            ") ENGINE = AggregatingMergeTree()\n" +
            "PARTITION BY toYYYYMM(stat_date)\n" +
            "ORDER BY (stat_date, user_id)\n" +
            "SETTINGS index_granularity = 8192";

        try (Statement stmt = conn.createStatement()) {
            stmt.execute(sql);
            System.out.println("AggregatingMergeTree 表创建成功!");
        }
    }

    /**
     * 5. 分布式表
     * 场景: 集群环境下的分片表
     */
    public static void createDistributedTable(Connection conn)
            throws Exception {
        // 先创建本地表
        String localTableSql =
            "CREATE TABLE IF NOT EXISTS orders_local (\n" +
            "    order_id UInt64,\n" +
            "    user_id UInt64,\n" +
            "    product_id UInt64,\n" +
            "    amount Decimal(18,2),\n" +
            "    order_time DateTime\n" +
            ") ENGINE = MergeTree()\n" +
            "PARTITION BY toYYYYMM(order_time)\n" +
            "ORDER BY (user_id, order_time)";

        // 创建分布式表
        String distributedTableSql =
            "CREATE TABLE IF NOT EXISTS orders_all AS orders_local\n" +
            "ENGINE = Distributed(cluster_name, default, orders_local, rand())";

        try (Statement stmt = conn.createStatement()) {
            stmt.execute(localTableSql);
            // stmt.execute(distributedTableSql); // 需要集群环境
            System.out.println("分布式表创建成功!");
        }
    }
}

4.2 表引擎选择指南

复制代码
表引擎选择流程

需要副本? ───Yes───▶ 使用 Replicated* 系列
    │
    No
    │
    ▼
需要去重? ───Yes───▶ ReplacingMergeTree
    │
    No
    │
    ▼
需要求和? ───Yes───▶ SummingMergeTree
    │
    No
    │
    ▼
需要预聚合? ─Yes───▶ AggregatingMergeTree
    │
    No
    │
    ▼
普通场景 ───────────▶ MergeTree

5. Java 客户端集成

5.1 JDBC 基础操作

java 复制代码
import java.sql.*;
import java.time.LocalDateTime;
import java.util.ArrayList;
import java.util.List;

/**
 * ClickHouse JDBC 基础操作
 */
public class ClickHouseJDBCExample {

    /**
     * 插入数据
     */
    public static void insertData(Connection conn) throws SQLException {
        String sql =
            "INSERT INTO user_events " +
            "(event_date, event_time, user_id, event_type, " +
            "page_url, device_type, session_id, duration) " +
            "VALUES (?, ?, ?, ?, ?, ?, ?, ?)";

        try (PreparedStatement pstmt = conn.prepareStatement(sql)) {
            // 插入单条数据
            pstmt.setDate(1, Date.valueOf("2024-01-15"));
            pstmt.setTimestamp(2, Timestamp.valueOf(LocalDateTime.now()));
            pstmt.setLong(3, 1001L);
            pstmt.setString(4, "page_view");
            pstmt.setString(5, "/home");
            pstmt.setString(6, "mobile");
            pstmt.setString(7, "session_12345");
            pstmt.setInt(8, 120);

            int rows = pstmt.executeUpdate();
            System.out.println("插入 " + rows + " 条数据");
        }
    }

    /**
     * 批量插入 (推荐方式)
     */
    public static void batchInsert(Connection conn,
                                   List<UserEvent> events)
            throws SQLException {
        String sql =
            "INSERT INTO user_events " +
            "(event_date, event_time, user_id, event_type, " +
            "page_url, device_type, session_id, duration) " +
            "VALUES (?, ?, ?, ?, ?, ?, ?, ?)";

        try (PreparedStatement pstmt = conn.prepareStatement(sql)) {
            for (UserEvent event : events) {
                pstmt.setDate(1, Date.valueOf(event.eventDate));
                pstmt.setTimestamp(2, Timestamp.valueOf(event.eventTime));
                pstmt.setLong(3, event.userId);
                pstmt.setString(4, event.eventType);
                pstmt.setString(5, event.pageUrl);
                pstmt.setString(6, event.deviceType);
                pstmt.setString(7, event.sessionId);
                pstmt.setInt(8, event.duration);

                pstmt.addBatch();
            }

            int[] results = pstmt.executeBatch();
            System.out.println("批量插入 " + results.length + " 条数据");
        }
    }

    /**
     * 查询数据
     */
    public static List<UserEvent> queryData(Connection conn,
                                           long userId)
            throws SQLException {
        String sql =
            "SELECT event_date, event_time, user_id, event_type, " +
            "       page_url, device_type, session_id, duration " +
            "FROM user_events " +
            "WHERE user_id = ? " +
            "ORDER BY event_time DESC " +
            "LIMIT 100";

        List<UserEvent> events = new ArrayList<>();

        try (PreparedStatement pstmt = conn.prepareStatement(sql)) {
            pstmt.setLong(1, userId);

            try (ResultSet rs = pstmt.executeQuery()) {
                while (rs.next()) {
                    UserEvent event = new UserEvent();
                    event.eventDate = rs.getDate("event_date").toLocalDate();
                    event.eventTime = rs.getTimestamp("event_time").toLocalDateTime();
                    event.userId = rs.getLong("user_id");
                    event.eventType = rs.getString("event_type");
                    event.pageUrl = rs.getString("page_url");
                    event.deviceType = rs.getString("device_type");
                    event.sessionId = rs.getString("session_id");
                    event.duration = rs.getInt("duration");

                    events.add(event);
                }
            }
        }

        return events;
    }

    /**
     * 聚合查询
     */
    public static void aggregateQuery(Connection conn) throws SQLException {
        String sql =
            "SELECT " +
            "    event_date, " +
            "    event_type, " +
            "    device_type, " +
            "    count() as event_count, " +
            "    uniq(user_id) as unique_users, " +
            "    avg(duration) as avg_duration, " +
            "    max(duration) as max_duration " +
            "FROM user_events " +
            "WHERE event_date >= today() - 7 " +
            "GROUP BY event_date, event_type, device_type " +
            "ORDER BY event_date DESC, event_count DESC";

        try (Statement stmt = conn.createStatement();
             ResultSet rs = stmt.executeQuery(sql)) {

            System.out.println("\n用户行为统计:");
            System.out.println("─────────────────────────────────────────────────");
            System.out.printf("%-12s %-15s %-12s %8s %8s %12s %12s\n",
                "日期", "事件类型", "设备类型", "事件数", "UV", "平均时长", "最大时长");
            System.out.println("─────────────────────────────────────────────────");

            while (rs.next()) {
                System.out.printf("%-12s %-15s %-12s %8d %8d %12.2f %12d\n",
                    rs.getDate("event_date"),
                    rs.getString("event_type"),
                    rs.getString("device_type"),
                    rs.getLong("event_count"),
                    rs.getLong("unique_users"),
                    rs.getDouble("avg_duration"),
                    rs.getLong("max_duration")
                );
            }
        }
    }

    /**
     * 用户事件实体类
     */
    public static class UserEvent {
        public java.time.LocalDate eventDate;
        public java.time.LocalDateTime eventTime;
        public long userId;
        public String eventType;
        public String pageUrl;
        public String deviceType;
        public String sessionId;
        public int duration;

        @Override
        public String toString() {
            return String.format(
                "UserEvent{userId=%d, type='%s', page='%s', time=%s}",
                userId, eventType, pageUrl, eventTime
            );
        }
    }
}

5.2 高性能批量写入

java 复制代码
import com.clickhouse.client.*;
import com.clickhouse.data.ClickHouseFormat;

import java.io.ByteArrayInputStream;
import java.nio.charset.StandardCharsets;
import java.util.List;
import java.util.concurrent.CompletableFuture;

/**
 * 高性能批量写入
 */
public class HighPerformanceWriter {

    /**
     * 方式1: 使用 INSERT VALUES (适合小批量)
     */
    public static void batchInsertWithValues(Connection conn,
                                            List<UserEvent> events)
            throws SQLException {
        StringBuilder sql = new StringBuilder(
            "INSERT INTO user_events VALUES "
        );

        for (int i = 0; i < events.size(); i++) {
            UserEvent event = events.get(i);
            if (i > 0) sql.append(",");

            sql.append(String.format(
                "('%s','%s',%d,'%s','%s','%s','%s',%d)",
                event.eventDate,
                event.eventTime,
                event.userId,
                event.eventType,
                event.pageUrl,
                event.deviceType,
                event.sessionId,
                event.duration
            ));
        }

        try (Statement stmt = conn.createStatement()) {
            stmt.execute(sql.toString());
        }
    }

    /**
     * 方式2: 使用 CSV 格式 (推荐,高性能)
     */
    public static void batchInsertWithCSV(ClickHouseClient client,
                                         List<UserEvent> events)
            throws Exception {
        // 构建 CSV 数据
        StringBuilder csv = new StringBuilder();
        for (UserEvent event : events) {
            csv.append(event.eventDate).append(",")
               .append(event.eventTime).append(",")
               .append(event.userId).append(",")
               .append(event.eventType).append(",")
               .append(event.pageUrl).append(",")
               .append(event.deviceType).append(",")
               .append(event.sessionId).append(",")
               .append(event.duration).append("\n");
        }

        // 使用 ClickHouse Client API
        ClickHouseRequest<?> request = client.read(
            ClickHouseNode.of("http://localhost:8123")
        );

        CompletableFuture<ClickHouseResponse> future = request
            .write()
            .query("INSERT INTO user_events FORMAT CSV")
            .data(new ByteArrayInputStream(
                csv.toString().getBytes(StandardCharsets.UTF_8)
            ))
            .executeAsync();

        ClickHouseResponse response = future.get();
        System.out.println("CSV 批量插入成功!");
    }

    /**
     * 方式3: 异步批量写入器
     */
    public static class AsyncBatchWriter {
        private final Connection connection;
        private final List<UserEvent> buffer;
        private final int batchSize;
        private final long flushIntervalMs;
        private volatile boolean running;

        public AsyncBatchWriter(Connection connection,
                               int batchSize,
                               long flushIntervalMs) {
            this.connection = connection;
            this.batchSize = batchSize;
            this.flushIntervalMs = flushIntervalMs;
            this.buffer = new ArrayList<>(batchSize);
            this.running = true;

            // 启动定时刷新线程
            startFlushThread();
        }

        public synchronized void write(UserEvent event) {
            buffer.add(event);

            if (buffer.size() >= batchSize) {
                flush();
            }
        }

        private synchronized void flush() {
            if (buffer.isEmpty()) return;

            try {
                batchInsertWithValues(connection, new ArrayList<>(buffer));
                buffer.clear();
                System.out.println("已刷新批次数据");
            } catch (SQLException e) {
                System.err.println("批量写入失败: " + e.getMessage());
            }
        }

        private void startFlushThread() {
            new Thread(() -> {
                while (running) {
                    try {
                        Thread.sleep(flushIntervalMs);
                        flush();
                    } catch (InterruptedException e) {
                        Thread.currentThread().interrupt();
                        break;
                    }
                }
            }).start();
        }

        public void close() {
            running = false;
            flush();
        }
    }
}

6. 数据写入实战

6.1 实时日志采集

基于 Kafka + ClickHouse 的实时日志处理系统。

java 复制代码
import org.apache.kafka.clients.consumer.*;
import org.apache.kafka.common.serialization.StringDeserializer;

import java.sql.Connection;
import java.time.Duration;
import java.util.*;

/**
 * Kafka 实时日志消费并写入 ClickHouse
 * 场景: 收集 Web 访问日志实时分析
 */
public class RealTimeLogCollector {

    private final KafkaConsumer<String, String> consumer;
    private final Connection clickHouseConn;
    private final int batchSize = 1000;
    private final List<AccessLog> buffer = new ArrayList<>();

    public RealTimeLogCollector(Connection clickHouseConn) {
        this.clickHouseConn = clickHouseConn;
        this.consumer = createKafkaConsumer();
    }

    private KafkaConsumer<String, String> createKafkaConsumer() {
        Properties props = new Properties();
        props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG,
                  "localhost:9092");
        props.put(ConsumerConfig.GROUP_ID_CONFIG,
                  "clickhouse-log-consumer");
        props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG,
                  StringDeserializer.class.getName());
        props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG,
                  StringDeserializer.class.getName());
        props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest");
        props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "false");
        props.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, "1000");

        return new KafkaConsumer<>(props);
    }

    public void start() {
        consumer.subscribe(Collections.singletonList("web-access-logs"));

        System.out.println("开始消费日志数据...");

        try {
            while (true) {
                ConsumerRecords<String, String> records =
                    consumer.poll(Duration.ofSeconds(1));

                for (ConsumerRecord<String, String> record : records) {
                    processLog(record.value());
                }

                // 批量写入
                if (buffer.size() >= batchSize) {
                    flushToClickHouse();
                    consumer.commitSync();
                }
            }
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            flushToClickHouse();
            consumer.close();
        }
    }

    private void processLog(String logJson) {
        try {
            // 解析 JSON 日志
            AccessLog log = parseAccessLog(logJson);
            buffer.add(log);
        } catch (Exception e) {
            System.err.println("解析日志失败: " + e.getMessage());
        }
    }

    private AccessLog parseAccessLog(String json) {
        // 使用 FastJSON 或其他 JSON 库解析
        com.alibaba.fastjson2.JSONObject obj =
            com.alibaba.fastjson2.JSON.parseObject(json);

        AccessLog log = new AccessLog();
        log.timestamp = obj.getLong("timestamp");
        log.userId = obj.getLong("user_id");
        log.ip = obj.getString("ip");
        log.method = obj.getString("method");
        log.url = obj.getString("url");
        log.statusCode = obj.getInteger("status_code");
        log.responseTime = obj.getInteger("response_time");
        log.userAgent = obj.getString("user_agent");
        log.referer = obj.getString("referer");

        return log;
    }

    private void flushToClickHouse() {
        if (buffer.isEmpty()) return;

        String sql =
            "INSERT INTO access_logs " +
            "(log_time, user_id, ip, method, url, status_code, " +
            "response_time, user_agent, referer) " +
            "VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)";

        try (var pstmt = clickHouseConn.prepareStatement(sql)) {
            for (AccessLog log : buffer) {
                pstmt.setTimestamp(1,
                    new java.sql.Timestamp(log.timestamp));
                pstmt.setLong(2, log.userId);
                pstmt.setString(3, log.ip);
                pstmt.setString(4, log.method);
                pstmt.setString(5, log.url);
                pstmt.setInt(6, log.statusCode);
                pstmt.setInt(7, log.responseTime);
                pstmt.setString(8, log.userAgent);
                pstmt.setString(9, log.referer);

                pstmt.addBatch();
            }

            pstmt.executeBatch();
            System.out.println("成功写入 " + buffer.size() + " 条日志");
            buffer.clear();

        } catch (Exception e) {
            System.err.println("写入 ClickHouse 失败: " + e.getMessage());
        }
    }

    /**
     * 访问日志实体
     */
    public static class AccessLog {
        public long timestamp;
        public long userId;
        public String ip;
        public String method;
        public String url;
        public int statusCode;
        public int responseTime;
        public String userAgent;
        public String referer;
    }

    /**
     * 创建访问日志表
     */
    public static void createAccessLogTable(Connection conn)
            throws Exception {
        String sql =
            "CREATE TABLE IF NOT EXISTS access_logs (\n" +
            "    log_time DateTime,\n" +
            "    user_id UInt64,\n" +
            "    ip String,\n" +
            "    method String,\n" +
            "    url String,\n" +
            "    status_code UInt16,\n" +
            "    response_time UInt32,\n" +
            "    user_agent String,\n" +
            "    referer String,\n" +
            "    INDEX idx_user_id user_id TYPE minmax GRANULARITY 3,\n" +
            "    INDEX idx_url url TYPE tokenbf_v1(32768, 3, 0) GRANULARITY 1\n" +
            ") ENGINE = MergeTree()\n" +
            "PARTITION BY toYYYYMMDD(log_time)\n" +
            "ORDER BY (log_time, user_id)\n" +
            "TTL log_time + INTERVAL 30 DAY\n" +
            "SETTINGS index_granularity = 8192";

        try (var stmt = conn.createStatement()) {
            stmt.execute(sql);
            System.out.println("访问日志表创建成功!");
        }
    }
}

6.2 Spring Boot 集成

java 复制代码
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.jdbc.core.JdbcTemplate;
import org.springframework.stereotype.Service;

import javax.sql.DataSource;
import java.sql.Timestamp;
import java.util.List;
import java.util.Map;

/**
 * Spring Boot 集成 ClickHouse
 */
@SpringBootApplication
public class ClickHouseSpringBootApp {

    public static void main(String[] args) {
        SpringApplication.run(ClickHouseSpringBootApp.class, args);
    }
}

/**
 * ClickHouse 配置类
 */
@Configuration
class ClickHouseConfiguration {

    @Bean
    public DataSource clickHouseDataSource() {
        return ClickHouseConfig.getDataSource();
    }

    @Bean
    public JdbcTemplate clickHouseJdbcTemplate(DataSource dataSource) {
        return new JdbcTemplate(dataSource);
    }
}

/**
 * ClickHouse 数据访问服务
 */
@Service
class ClickHouseService {

    private final JdbcTemplate jdbcTemplate;

    public ClickHouseService(JdbcTemplate jdbcTemplate) {
        this.jdbcTemplate = jdbcTemplate;
    }

    /**
     * 查询用户行为统计
     */
    public List<Map<String, Object>> getUserBehaviorStats(
            String startDate, String endDate) {
        String sql =
            "SELECT " +
            "    toDate(event_time) as date, " +
            "    event_type, " +
            "    count() as total, " +
            "    uniq(user_id) as uv, " +
            "    avg(duration) as avg_duration " +
            "FROM user_events " +
            "WHERE event_date BETWEEN ? AND ? " +
            "GROUP BY date, event_type " +
            "ORDER BY date DESC, total DESC";

        return jdbcTemplate.queryForList(sql, startDate, endDate);
    }

    /**
     * 保存用户事件
     */
    public void saveUserEvent(UserEventDTO event) {
        String sql =
            "INSERT INTO user_events " +
            "(event_date, event_time, user_id, event_type, " +
            "page_url, device_type, session_id, duration) " +
            "VALUES (?, ?, ?, ?, ?, ?, ?, ?)";

        jdbcTemplate.update(sql,
            event.getEventDate(),
            new Timestamp(event.getEventTime()),
            event.getUserId(),
            event.getEventType(),
            event.getPageUrl(),
            event.getDeviceType(),
            event.getSessionId(),
            event.getDuration()
        );
    }

    /**
     * 批量保存
     */
    public void batchSaveUserEvents(List<UserEventDTO> events) {
        String sql =
            "INSERT INTO user_events " +
            "(event_date, event_time, user_id, event_type, " +
            "page_url, device_type, session_id, duration) " +
            "VALUES (?, ?, ?, ?, ?, ?, ?, ?)";

        jdbcTemplate.batchUpdate(sql, events, events.size(),
            (ps, event) -> {
                ps.setDate(1, event.getEventDate());
                ps.setTimestamp(2, new Timestamp(event.getEventTime()));
                ps.setLong(3, event.getUserId());
                ps.setString(4, event.getEventType());
                ps.setString(5, event.getPageUrl());
                ps.setString(6, event.getDeviceType());
                ps.setString(7, event.getSessionId());
                ps.setInt(8, event.getDuration());
            }
        );
    }

    /**
     * 用户事件 DTO
     */
    public static class UserEventDTO {
        private java.sql.Date eventDate;
        private long eventTime;
        private long userId;
        private String eventType;
        private String pageUrl;
        private String deviceType;
        private String sessionId;
        private int duration;

        // Getters and Setters
        public java.sql.Date getEventDate() { return eventDate; }
        public void setEventDate(java.sql.Date eventDate) {
            this.eventDate = eventDate;
        }
        public long getEventTime() { return eventTime; }
        public void setEventTime(long eventTime) {
            this.eventTime = eventTime;
        }
        public long getUserId() { return userId; }
        public void setUserId(long userId) { this.userId = userId; }
        public String getEventType() { return eventType; }
        public void setEventType(String eventType) {
            this.eventType = eventType;
        }
        public String getPageUrl() { return pageUrl; }
        public void setPageUrl(String pageUrl) { this.pageUrl = pageUrl; }
        public String getDeviceType() { return deviceType; }
        public void setDeviceType(String deviceType) {
            this.deviceType = deviceType;
        }
        public String getSessionId() { return sessionId; }
        public void setSessionId(String sessionId) {
            this.sessionId = sessionId;
        }
        public int getDuration() { return duration; }
        public void setDuration(int duration) { this.duration = duration; }
    }
}

7. 查询优化技巧

7.1 索引优化

sql 复制代码
-- 1. 主键索引 (ORDER BY 决定)
CREATE TABLE user_events (
    event_date Date,
    event_time DateTime,
    user_id UInt64,
    event_type String,
    page_url String
) ENGINE = MergeTree()
ORDER BY (user_id, event_time); -- 主键索引

-- 查询会很快 (使用了主键)
SELECT * FROM user_events
WHERE user_id = 1001;

-- 查询较慢 (未使用主键)
SELECT * FROM user_events
WHERE event_type = 'click';

-- 2. 跳数索引 (Skip Index)
CREATE TABLE access_logs (
    log_time DateTime,
    url String,
    status_code UInt16,
    response_time UInt32,
    INDEX idx_status status_code TYPE minmax GRANULARITY 3,
    INDEX idx_url url TYPE tokenbf_v1(32768, 3, 0) GRANULARITY 1,
    INDEX idx_response response_time TYPE set(100) GRANULARITY 4
) ENGINE = MergeTree()
ORDER BY log_time;
java 复制代码
/**
 * 查询优化示例
 */
public class QueryOptimization {

    /**
     * 优化1: 使用主键过滤
     */
    public static void optimizedQuery1(Connection conn)
            throws Exception {
        // 好的查询 - 使用主键
        String goodSql =
            "SELECT * FROM user_events " +
            "WHERE user_id = ? " +
            "AND event_time >= ? " +
            "ORDER BY event_time DESC " +
            "LIMIT 100";

        // 不好的查询 - 未使用主键
        String badSql =
            "SELECT * FROM user_events " +
            "WHERE page_url = '/home' " +
            "ORDER BY event_time DESC " +
            "LIMIT 100";

        // 执行优化后的查询
        try (var pstmt = conn.prepareStatement(goodSql)) {
            pstmt.setLong(1, 1001L);
            pstmt.setTimestamp(2,
                Timestamp.valueOf("2024-01-01 00:00:00"));

            var rs = pstmt.executeQuery();
            // 处理结果...
        }
    }

    /**
     * 优化2: 使用 PREWHERE 代替 WHERE
     * PREWHERE 先过滤数据,再读取其他列
     */
    public static void optimizedQuery2(Connection conn)
            throws Exception {
        String sql =
            "SELECT " +
            "    user_id, " +
            "    event_type, " +
            "    page_url, " +
            "    duration " +
            "FROM user_events " +
            "PREWHERE event_date = today() " +  // 先过滤
            "WHERE user_id > 1000 " +
            "LIMIT 1000";

        try (var stmt = conn.createStatement();
             var rs = stmt.executeQuery(sql)) {
            // 处理结果...
        }
    }

    /**
     * 优化3: 合理使用分区裁剪
     */
    public static void optimizedQuery3(Connection conn)
            throws Exception {
        String sql =
            "SELECT " +
            "    event_type, " +
            "    count() as cnt " +
            "FROM user_events " +
            "WHERE event_date >= today() - 7 " +  // 分区裁剪
            "GROUP BY event_type";

        try (var stmt = conn.createStatement();
             var rs = stmt.executeQuery(sql)) {
            // 处理结果...
        }
    }

    /**
     * 优化4: 使用 FINAL 去重 (ReplacingMergeTree)
     */
    public static void optimizedQuery4(Connection conn)
            throws Exception {
        // 不使用 FINAL - 可能有重复数据
        String sql1 =
            "SELECT * FROM user_profiles " +
            "WHERE user_id = 1001";

        // 使用 FINAL - 保证唯一,但性能较慢
        String sql2 =
            "SELECT * FROM user_profiles FINAL " +
            "WHERE user_id = 1001";

        // 推荐: 使用 GROUP BY 去重
        String sql3 =
            "SELECT " +
            "    user_id, " +
            "    argMax(username, version) as username, " +
            "    argMax(email, version) as email, " +
            "    argMax(age, version) as age, " +
            "    argMax(city, version) as city " +
            "FROM user_profiles " +
            "WHERE user_id = 1001 " +
            "GROUP BY user_id";

        try (var stmt = conn.createStatement();
             var rs = stmt.executeQuery(sql3)) {
            // 处理结果...
        }
    }

    /**
     * 优化5: 使用近似函数
     */
    public static void optimizedQuery5(Connection conn)
            throws Exception {
        // 精确 UV 计算 (慢)
        String exactSql =
            "SELECT count(DISTINCT user_id) FROM user_events";

        // 近似 UV 计算 (快,误差<2%)
        String approxSql =
            "SELECT uniq(user_id) FROM user_events";

        // 更快的近似计算
        String fasterSql =
            "SELECT uniqHLL12(user_id) FROM user_events";

        try (var stmt = conn.createStatement();
             var rs = stmt.executeQuery(approxSql)) {
            if (rs.next()) {
                System.out.println("UV: " + rs.getLong(1));
            }
        }
    }
}

7.2 查询性能对比

复制代码
查询性能对比 (10亿行数据)

场景1: 全表扫描 vs 主键查询
┌────────────────────────────┬──────────┬──────────────┐
│ Query Type                 │   Time   │  Rows Scanned│
├────────────────────────────┼──────────┼──────────────┤
│ 全表扫描                   │  25.3s   │ 1,000,000,000│
│ 主键查询                   │  0.05s   │     10,000   │
│ 主键+分区裁剪              │  0.02s   │      1,000   │
└────────────────────────────┴──────────┴──────────────┘

场景2: COUNT(DISTINCT) vs uniq()
┌────────────────────────────┬──────────┬──────────────┐
│ Function                   │   Time   │   Accuracy   │
├────────────────────────────┼──────────┼──────────────┤
│ count(DISTINCT user_id)    │  18.5s   │    100%      │
│ uniq(user_id)              │   2.1s   │    98%       │
│ uniqHLL12(user_id)         │   0.8s   │    95%       │
└────────────────────────────┴──────────┴──────────────┘

8. 物化视图与聚合

8.1 物化视图基础

物化视图可以预计算聚合结果,大幅提升查询性能。

sql 复制代码
-- 创建物化视图
CREATE MATERIALIZED VIEW user_daily_stats
ENGINE = SummingMergeTree()
PARTITION BY toYYYYMM(stat_date)
ORDER BY (stat_date, user_id)
AS
SELECT
    toDate(event_time) as stat_date,
    user_id,
    event_type,
    count() as event_count,
    sum(duration) as total_duration
FROM user_events
GROUP BY stat_date, user_id, event_type;

-- 查询物化视图 (非常快)
SELECT
    stat_date,
    user_id,
    sum(event_count) as total_events,
    sum(total_duration) as total_duration
FROM user_daily_stats
WHERE stat_date >= today() - 30
GROUP BY stat_date, user_id;
java 复制代码
/**
 * 物化视图管理
 */
public class MaterializedViewManager {

    /**
     * 创建用户行为统计物化视图
     */
    public static void createUserStatsMV(Connection conn)
            throws Exception {
        String sql =
            "CREATE MATERIALIZED VIEW IF NOT EXISTS user_behavior_stats_mv\n" +
            "ENGINE = AggregatingMergeTree()\n" +
            "PARTITION BY toYYYYMM(stat_date)\n" +
            "ORDER BY (stat_date, user_id, event_type)\n" +
            "AS\n" +
            "SELECT\n" +
            "    toDate(event_time) as stat_date,\n" +
            "    user_id,\n" +
            "    event_type,\n" +
            "    device_type,\n" +
            "    countState() as event_count,\n" +
            "    sumState(duration) as total_duration,\n" +
            "    avgState(duration) as avg_duration,\n" +
            "    uniqState(session_id) as unique_sessions\n" +
            "FROM user_events\n" +
            "GROUP BY stat_date, user_id, event_type, device_type";

        try (var stmt = conn.createStatement()) {
            stmt.execute(sql);
            System.out.println("物化视图创建成功!");
        }
    }

    /**
     * 查询物化视图
     */
    public static void queryMaterializedView(Connection conn)
            throws Exception {
        String sql =
            "SELECT\n" +
            "    stat_date,\n" +
            "    user_id,\n" +
            "    event_type,\n" +
            "    device_type,\n" +
            "    countMerge(event_count) as total_events,\n" +
            "    sumMerge(total_duration) as total_duration,\n" +
            "    avgMerge(avg_duration) as avg_duration,\n" +
            "    uniqMerge(unique_sessions) as unique_sessions\n" +
            "FROM user_behavior_stats_mv\n" +
            "WHERE stat_date >= today() - 7\n" +
            "GROUP BY stat_date, user_id, event_type, device_type\n" +
            "ORDER BY stat_date DESC, total_events DESC\n" +
            "LIMIT 100";

        try (var stmt = conn.createStatement();
             var rs = stmt.executeQuery(sql)) {

            System.out.println("\n用户行为统计 (来自物化视图):");
            while (rs.next()) {
                System.out.printf(
                    "日期: %s, 用户: %d, 类型: %s, 事件数: %d, " +
                    "平均时长: %.2f, 会话数: %d\n",
                    rs.getDate("stat_date"),
                    rs.getLong("user_id"),
                    rs.getString("event_type"),
                    rs.getLong("total_events"),
                    rs.getDouble("avg_duration"),
                    rs.getLong("unique_sessions")
                );
            }
        }
    }

    /**
     * 刷新物化视图 (ClickHouse 自动增量更新)
     */
    public static void refreshMaterializedView(Connection conn)
            throws Exception {
        // ClickHouse 物化视图自动实时更新
        // 无需手动刷新
        System.out.println("ClickHouse 物化视图自动实时更新");

        // 如果需要重建,可以删除后重建
        // DROP TABLE user_behavior_stats_mv;
        // 然后重新创建
    }
}

8.2 实时报表场景

java 复制代码
/**
 * 实时报表生成器
 * 场景: 电商实时交易大屏
 */
public class RealTimeDashboard {

    private final Connection conn;

    public RealTimeDashboard(Connection conn) {
        this.conn = conn;
    }

    /**
     * 今日实时交易概况
     */
    public DashboardMetrics getTodayMetrics() throws Exception {
        String sql =
            "SELECT\n" +
            "    count() as total_orders,\n" +
            "    sum(amount) as total_amount,\n" +
            "    avg(amount) as avg_amount,\n" +
            "    uniq(user_id) as unique_users,\n" +
            "    uniq(product_id) as unique_products,\n" +
            "    countIf(status = 'SUCCESS') as success_count,\n" +
            "    countIf(status = 'FAILED') as failed_count\n" +
            "FROM orders\n" +
            "WHERE order_date = today()";

        try (var stmt = conn.createStatement();
             var rs = stmt.executeQuery(sql)) {

            if (rs.next()) {
                DashboardMetrics metrics = new DashboardMetrics();
                metrics.totalOrders = rs.getLong("total_orders");
                metrics.totalAmount = rs.getDouble("total_amount");
                metrics.avgAmount = rs.getDouble("avg_amount");
                metrics.uniqueUsers = rs.getLong("unique_users");
                metrics.uniqueProducts = rs.getLong("unique_products");
                metrics.successCount = rs.getLong("success_count");
                metrics.failedCount = rs.getLong("failed_count");

                return metrics;
            }
        }

        return null;
    }

    /**
     * 每小时趋势
     */
    public List<HourlyTrend> getHourlyTrend() throws Exception {
        String sql =
            "SELECT\n" +
            "    toHour(order_time) as hour,\n" +
            "    count() as order_count,\n" +
            "    sum(amount) as hour_amount,\n" +
            "    uniq(user_id) as hour_users\n" +
            "FROM orders\n" +
            "WHERE order_date = today()\n" +
            "GROUP BY hour\n" +
            "ORDER BY hour";

        List<HourlyTrend> trends = new ArrayList<>();

        try (var stmt = conn.createStatement();
             var rs = stmt.executeQuery(sql)) {

            while (rs.next()) {
                HourlyTrend trend = new HourlyTrend();
                trend.hour = rs.getInt("hour");
                trend.orderCount = rs.getLong("order_count");
                trend.hourAmount = rs.getDouble("hour_amount");
                trend.hourUsers = rs.getLong("hour_users");

                trends.add(trend);
            }
        }

        return trends;
    }

    /**
     * Top 商品排行
     */
    public List<ProductRank> getTopProducts(int limit) throws Exception {
        String sql =
            "SELECT\n" +
            "    product_id,\n" +
            "    count() as sale_count,\n" +
            "    sum(amount) as sale_amount,\n" +
            "    uniq(user_id) as buyer_count\n" +
            "FROM orders\n" +
            "WHERE order_date >= today() - 7\n" +
            "GROUP BY product_id\n" +
            "ORDER BY sale_amount DESC\n" +
            "LIMIT ?";

        List<ProductRank> ranks = new ArrayList<>();

        try (var pstmt = conn.prepareStatement(sql)) {
            pstmt.setInt(1, limit);

            try (var rs = pstmt.executeQuery()) {
                while (rs.next()) {
                    ProductRank rank = new ProductRank();
                    rank.productId = rs.getLong("product_id");
                    rank.saleCount = rs.getLong("sale_count");
                    rank.saleAmount = rs.getDouble("sale_amount");
                    rank.buyerCount = rs.getLong("buyer_count");

                    ranks.add(rank);
                }
            }
        }

        return ranks;
    }

    /**
     * 实时大屏指标
     */
    public static class DashboardMetrics {
        public long totalOrders;
        public double totalAmount;
        public double avgAmount;
        public long uniqueUsers;
        public long uniqueProducts;
        public long successCount;
        public long failedCount;

        public void display() {
            System.out.println("\n========== 今日实时数据 ==========");
            System.out.printf("总订单数: %,d\n", totalOrders);
            System.out.printf("总金额: ¥%,.2f\n", totalAmount);
            System.out.printf("平均客单价: ¥%,.2f\n", avgAmount);
            System.out.printf("下单用户数: %,d\n", uniqueUsers);
            System.out.printf("售出商品数: %,d\n", uniqueProducts);
            System.out.printf("成功订单: %,d (%.2f%%)\n",
                successCount,
                100.0 * successCount / totalOrders);
            System.out.printf("失败订单: %,d (%.2f%%)\n",
                failedCount,
                100.0 * failedCount / totalOrders);
            System.out.println("================================\n");
        }
    }

    /**
     * 小时趋势
     */
    public static class HourlyTrend {
        public int hour;
        public long orderCount;
        public double hourAmount;
        public long hourUsers;
    }

    /**
     * 商品排行
     */
    public static class ProductRank {
        public long productId;
        public long saleCount;
        public double saleAmount;
        public long buyerCount;
    }
}

9. 分布式集群实践

9.1 集群架构

复制代码
ClickHouse 分布式集群

                  Client Applications
                         │
                         ▼
            ┌────────────────────────┐
            │   Load Balancer        │
            └────────────────────────┘
                         │
        ┌────────────────┴────────────────┐
        │                                  │
        ▼                                  ▼
┌───────────────┐                  ┌───────────────┐
│  Shard 1      │                  │  Shard 2      │
│               │                  │               │
│ ┌───────────┐ │                  │ ┌───────────┐ │
│ │ Replica 1 │ │                  │ │ Replica 1 │ │
│ │ (Master)  │ │                  │ │ (Master)  │ │
│ │ Node 1    │ │                  │ │ Node 3    │ │
│ └───────────┘ │                  │ └───────────┘ │
│       │       │                  │       │       │
│       │       │                  │       │       │
│ ┌───────────┐ │                  │ ┌───────────┐ │
│ │ Replica 2 │ │                  │ │ Replica 2 │ │
│ │ (Slave)   │ │                  │ │ (Slave)   │ │
│ │ Node 2    │ │                  │ │ Node 4    │ │
│ └───────────┘ │                  │ └───────────┘ │
└───────────────┘                  └───────────────┘
        │                                  │
        └──────────────┬───────────────────┘
                       │
                       ▼
              ┌────────────────┐
              │   ZooKeeper    │
              │   Cluster      │
              │  (3-5 nodes)   │
              └────────────────┘

9.2 分布式表操作

java 复制代码
/**
 * 分布式集群操作
 */
public class DistributedClusterOps {

    /**
     * 创建分布式表
     */
    public static void createDistributedTable(Connection conn)
            throws Exception {
        // 1. 在每个节点创建本地表
        String localTableSql =
            "CREATE TABLE IF NOT EXISTS orders_local ON CLUSTER my_cluster (\n" +
            "    order_id UInt64,\n" +
            "    user_id UInt64,\n" +
            "    product_id UInt64,\n" +
            "    amount Decimal(18,2),\n" +
            "    order_time DateTime,\n" +
            "    status String\n" +
            ") ENGINE = ReplicatedMergeTree(\n" +
            "    '/clickhouse/tables/{shard}/orders_local',\n" +
            "    '{replica}'\n" +
            ")\n" +
            "PARTITION BY toYYYYMM(order_time)\n" +
            "ORDER BY (user_id, order_time)";

        // 2. 创建分布式表
        String distributedTableSql =
            "CREATE TABLE IF NOT EXISTS orders_all ON CLUSTER my_cluster\n" +
            "AS orders_local\n" +
            "ENGINE = Distributed(\n" +
            "    my_cluster,           -- 集群名称\n" +
            "    default,              -- 数据库名\n" +
            "    orders_local,         -- 本地表名\n" +
            "    rand()                -- 分片键\n" +
            ")";

        try (var stmt = conn.createStatement()) {
            stmt.execute(localTableSql);
            stmt.execute(distributedTableSql);
            System.out.println("分布式表创建成功!");
        }
    }

    /**
     * 写入分布式表
     */
    public static void insertToDistributedTable(Connection conn)
            throws Exception {
        // 写入分布式表,自动分片
        String sql =
            "INSERT INTO orders_all " +
            "(order_id, user_id, product_id, amount, order_time, status) " +
            "VALUES (?, ?, ?, ?, ?, ?)";

        try (var pstmt = conn.prepareStatement(sql)) {
            for (int i = 0; i < 1000; i++) {
                pstmt.setLong(1, i);
                pstmt.setLong(2, i % 100);
                pstmt.setLong(3, i % 50);
                pstmt.setBigDecimal(4,
                    new java.math.BigDecimal(String.valueOf(100 + i)));
                pstmt.setTimestamp(5,
                    new Timestamp(System.currentTimeMillis()));
                pstmt.setString(6, "SUCCESS");

                pstmt.addBatch();
            }

            pstmt.executeBatch();
            System.out.println("批量写入分布式表成功!");
        }
    }

    /**
     * 查询分布式表
     */
    public static void queryDistributedTable(Connection conn)
            throws Exception {
        // 查询分布式表,自动聚合所有分片数据
        String sql =
            "SELECT\n" +
            "    toDate(order_time) as date,\n" +
            "    count() as order_count,\n" +
            "    sum(amount) as total_amount,\n" +
            "    uniq(user_id) as unique_users\n" +
            "FROM orders_all\n" +
            "WHERE order_time >= now() - INTERVAL 7 DAY\n" +
            "GROUP BY date\n" +
            "ORDER BY date DESC";

        try (var stmt = conn.createStatement();
             var rs = stmt.executeQuery(sql)) {

            System.out.println("\n分布式查询结果:");
            while (rs.next()) {
                System.out.printf(
                    "日期: %s, 订单数: %d, 总金额: %.2f, UV: %d\n",
                    rs.getDate("date"),
                    rs.getLong("order_count"),
                    rs.getDouble("total_amount"),
                    rs.getLong("unique_users")
                );
            }
        }
    }

    /**
     * 集群健康检查
     */
    public static void checkClusterHealth(Connection conn)
            throws Exception {
        String sql =
            "SELECT\n" +
            "    cluster,\n" +
            "    shard_num,\n" +
            "    replica_num,\n" +
            "    host_name,\n" +
            "    port,\n" +
            "    is_local\n" +
            "FROM system.clusters\n" +
            "WHERE cluster = 'my_cluster'\n" +
            "ORDER BY shard_num, replica_num";

        try (var stmt = conn.createStatement();
             var rs = stmt.executeQuery(sql)) {

            System.out.println("\n集群状态:");
            while (rs.next()) {
                System.out.printf(
                    "分片: %d, 副本: %d, 主机: %s:%d, 本地: %s\n",
                    rs.getInt("shard_num"),
                    rs.getInt("replica_num"),
                    rs.getString("host_name"),
                    rs.getInt("port"),
                    rs.getBoolean("is_local") ? "是" : "否"
                );
            }
        }
    }
}

10. 实时数据分析场景

10.1 用户留存分析

java 复制代码
/**
 * 用户留存分析
 */
public class RetentionAnalysis {

    /**
     * 计算 N 日留存率
     */
    public static void calculateRetention(Connection conn, int days)
            throws Exception {
        String sql =
            "SELECT\n" +
            "    first_date,\n" +
            "    retention_day,\n" +
            "    retained_users,\n" +
            "    total_users,\n" +
            "    round(retained_users * 100.0 / total_users, 2) as retention_rate\n" +
            "FROM (\n" +
            "    SELECT\n" +
            "        first_date,\n" +
            "        retention_day,\n" +
            "        uniq(user_id) as retained_users,\n" +
            "        any(total_users) as total_users\n" +
            "    FROM (\n" +
            "        SELECT\n" +
            "            user_id,\n" +
            "            first_date,\n" +
            "            dateDiff('day', first_date, event_date) as retention_day,\n" +
            "            total_users\n" +
            "        FROM (\n" +
            "            SELECT\n" +
            "                user_id,\n" +
            "                event_date,\n" +
            "                min(event_date) OVER (PARTITION BY user_id) as first_date,\n" +
            "                uniq(user_id) OVER (PARTITION BY min(event_date) OVER (PARTITION BY user_id)) as total_users\n" +
            "            FROM user_events\n" +
            "            WHERE event_date >= today() - ?\n" +
            "        )\n" +
            "    )\n" +
            "    WHERE retention_day <= ?\n" +
            "    GROUP BY first_date, retention_day, total_users\n" +
            ")\n" +
            "ORDER BY first_date DESC, retention_day";

        try (var pstmt = conn.prepareStatement(sql)) {
            pstmt.setInt(1, days + 7);
            pstmt.setInt(2, days);

            try (var rs = pstmt.executeQuery()) {
                System.out.println("\n用户留存分析:");
                System.out.println("─────────────────────────────────────────");
                System.out.printf("%-12s %-8s %-12s %-12s %-10s\n",
                    "首次日期", "留存天", "留存用户", "总用户", "留存率%");
                System.out.println("─────────────────────────────────────────");

                while (rs.next()) {
                    System.out.printf("%-12s %-8d %-12d %-12d %-10.2f\n",
                        rs.getDate("first_date"),
                        rs.getInt("retention_day"),
                        rs.getLong("retained_users"),
                        rs.getLong("total_users"),
                        rs.getDouble("retention_rate")
                    );
                }
            }
        }
    }
}

10.2 漏斗分析

java 复制代码
/**
 * 转化漏斗分析
 */
public class FunnelAnalysis {

    /**
     * 分析用户转化路径
     * 路径: 浏览商品 -> 加入购物车 -> 下单 -> 支付
     */
    public static void analyzeFunnel(Connection conn) throws Exception {
        String sql =
            "SELECT\n" +
            "    '浏览商品' as step,\n" +
            "    1 as step_num,\n" +
            "    view_users as users,\n" +
            "    100.0 as conversion_rate\n" +
            "FROM (\n" +
            "    SELECT uniq(user_id) as view_users\n" +
            "    FROM user_events\n" +
            "    WHERE event_type = 'view'\n" +
            "    AND event_date = today()\n" +
            ")\n" +
            "UNION ALL\n" +
            "SELECT\n" +
            "    '加入购物车' as step,\n" +
            "    2 as step_num,\n" +
            "    cart_users as users,\n" +
            "    round(cart_users * 100.0 / view_users, 2) as conversion_rate\n" +
            "FROM (\n" +
            "    SELECT\n" +
            "        uniq(user_id) as cart_users,\n" +
            "        (SELECT uniq(user_id) FROM user_events WHERE event_type = 'view' AND event_date = today()) as view_users\n" +
            "    FROM user_events\n" +
            "    WHERE event_type = 'add_to_cart'\n" +
            "    AND event_date = today()\n" +
            ")\n" +
            "UNION ALL\n" +
            "SELECT\n" +
            "    '下单' as step,\n" +
            "    3 as step_num,\n" +
            "    order_users as users,\n" +
            "    round(order_users * 100.0 / view_users, 2) as conversion_rate\n" +
            "FROM (\n" +
            "    SELECT\n" +
            "        uniq(user_id) as order_users,\n" +
            "        (SELECT uniq(user_id) FROM user_events WHERE event_type = 'view' AND event_date = today()) as view_users\n" +
            "    FROM user_events\n" +
            "    WHERE event_type = 'order'\n" +
            "    AND event_date = today()\n" +
            ")\n" +
            "UNION ALL\n" +
            "SELECT\n" +
            "    '支付' as step,\n" +
            "    4 as step_num,\n" +
            "    pay_users as users,\n" +
            "    round(pay_users * 100.0 / view_users, 2) as conversion_rate\n" +
            "FROM (\n" +
            "    SELECT\n" +
            "        uniq(user_id) as pay_users,\n" +
            "        (SELECT uniq(user_id) FROM user_events WHERE event_type = 'view' AND event_date = today()) as view_users\n" +
            "    FROM user_events\n" +
            "    WHERE event_type = 'payment'\n" +
            "    AND event_date = today()\n" +
            ")\n" +
            "ORDER BY step_num";

        try (var stmt = conn.createStatement();
             var rs = stmt.executeQuery(sql)) {

            System.out.println("\n转化漏斗分析:");
            System.out.println("─────────────────────────────────────");
            System.out.printf("%-15s %-12s %-12s\n",
                "步骤", "用户数", "转化率%");
            System.out.println("─────────────────────────────────────");

            while (rs.next()) {
                String step = rs.getString("step");
                long users = rs.getLong("users");
                double rate = rs.getDouble("conversion_rate");

                // ASCII 进度条
                int barLength = (int)(rate / 5);
                String bar = "█".repeat(barLength);

                System.out.printf("%-15s %-12d %6.2f%% %s\n",
                    step, users, rate, bar);
            }
        }
    }
}

11. 性能优化最佳实践

11.1 数据类型选择

复制代码
数据类型优化建议

1. 整数类型 - 选择合适的范围
   ┌──────────────┬────────────┬─────────────────┐
   │ Type         │ Size       │ Range           │
   ├──────────────┼────────────┼─────────────────┤
   │ UInt8        │ 1 byte     │ 0 ~ 255         │
   │ UInt16       │ 2 bytes    │ 0 ~ 65535       │
   │ UInt32       │ 4 bytes    │ 0 ~ 4B          │
   │ UInt64       │ 8 bytes    │ 0 ~ 18Q         │
   └──────────────┴────────────┴─────────────────┘

2. 字符串类型
   - String: 可变长度,适合短字符串
   - FixedString(N): 固定长度,适合MD5、UUID
   - LowCardinality(String): 低基数字符串,节省空间

3. 时间类型
   - Date: 日期(2字节)
   - DateTime: 日期时间(4字节)
   - DateTime64: 高精度时间(8字节)
java 复制代码
/**
 * 性能优化实践
 */
public class PerformanceOptimization {

    /**
     * 优化1: 使用 LowCardinality
     */
    public static void useLowCardinality(Connection conn)
            throws Exception {
        // 不好的设计
        String badSql =
            "CREATE TABLE events_bad (\n" +
            "    event_time DateTime,\n" +
            "    event_type String,        -- 可能只有10种类型\n" +
            "    device_type String,       -- 可能只有5种设备\n" +
            "    country String            -- 可能只有200个国家\n" +
            ") ENGINE = MergeTree()\n" +
            "ORDER BY event_time";

        // 好的设计 - 使用 LowCardinality
        String goodSql =
            "CREATE TABLE events_good (\n" +
            "    event_time DateTime,\n" +
            "    event_type LowCardinality(String),\n" +
            "    device_type LowCardinality(String),\n" +
            "    country LowCardinality(String)\n" +
            ") ENGINE = MergeTree()\n" +
            "ORDER BY event_time";

        // LowCardinality 可节省50-90%存储空间
    }

    /**
     * 优化2: 合理设置 ORDER BY
     */
    public static void optimizeOrderBy(Connection conn)
            throws Exception {
        // 根据查询模式设置
        // 如果经常按 user_id 查询
        String sql1 =
            "CREATE TABLE user_events (\n" +
            "    user_id UInt64,\n" +
            "    event_time DateTime,\n" +
            "    event_type String\n" +
            ") ENGINE = MergeTree()\n" +
            "ORDER BY (user_id, event_time)";  // user_id 在前

        // 如果经常按时间范围查询
        String sql2 =
            "CREATE TABLE time_series_data (\n" +
            "    timestamp DateTime,\n" +
            "    metric_name String,\n" +
            "    value Float64\n" +
            ") ENGINE = MergeTree()\n" +
            "ORDER BY (timestamp, metric_name)";  // timestamp 在前
    }

    /**
     * 优化3: 使用分区提升查询性能
     */
    public static void usePartitioning(Connection conn)
            throws Exception {
        String sql =
            "CREATE TABLE access_logs (\n" +
            "    log_time DateTime,\n" +
            "    url String,\n" +
            "    status_code UInt16\n" +
            ") ENGINE = MergeTree()\n" +
            "PARTITION BY toYYYYMMDD(log_time)\n" +  // 按天分区
            "ORDER BY log_time\n" +
            "TTL log_time + INTERVAL 30 DAY";  // 30天后自动删除

        // 好处:
        // 1. 查询时只扫描相关分区
        // 2. 可以按分区删除数据
        // 3. TTL 可以自动清理旧数据
    }

    /**
     * 优化4: 批量操作
     */
    public static void batchOperations() {
        // 不好: 单条插入
        // for (Data d : dataList) {
        //     INSERT INTO table VALUES (d);
        // }

        // 好: 批量插入
        // INSERT INTO table VALUES (d1), (d2), (d3), ...

        // 更好: 使用 CSV 格式
        // INSERT INTO table FORMAT CSV

        System.out.println("批量操作性能提升 10-100 倍");
    }
}

11.2 查询优化检查清单

复制代码
查询优化检查清单

☑ 1. 索引使用
   □ WHERE 条件是否使用了 ORDER BY 列?
   □ 是否使用了跳数索引?
   □ 是否使用了 PREWHERE?

☑ 2. 分区裁剪
   □ 查询是否限制了分区键范围?
   □ 避免全表扫描

☑ 3. 数据类型
   □ 使用合适大小的整数类型
   □ 低基数字符串使用 LowCardinality
   □ 避免使用 Nullable (性能损失20-30%)

☑ 4. 聚合优化
   □ 使用近似函数 (uniq 代替 count distinct)
   □ 使用物化视图预聚合
   □ GROUP BY 列顺序与 ORDER BY 一致

☑ 5. JOIN 优化
   □ 小表在右侧
   □ 使用字典表代替 JOIN
   □ 避免大表 JOIN 大表

☑ 6. 并行度
   □ max_threads 设置合理
   □ 分区数量适中

12. 监控与运维

12.1 系统表监控

java 复制代码
/**
 * ClickHouse 监控
 */
public class ClickHouseMonitoring {

    /**
     * 监控查询性能
     */
    public static void monitorQueries(Connection conn) throws Exception {
        String sql =
            "SELECT\n" +
            "    user,\n" +
            "    query_id,\n" +
            "    query_duration_ms,\n" +
            "    read_rows,\n" +
            "    read_bytes,\n" +
            "    memory_usage,\n" +
            "    query\n" +
            "FROM system.query_log\n" +
            "WHERE type = 'QueryFinish'\n" +
            "AND event_time >= now() - INTERVAL 1 HOUR\n" +
            "ORDER BY query_duration_ms DESC\n" +
            "LIMIT 10";

        try (var stmt = conn.createStatement();
             var rs = stmt.executeQuery(sql)) {

            System.out.println("\n慢查询Top 10:");
            while (rs.next()) {
                System.out.printf(
                    "耗时: %dms, 读取行数: %d, 内存: %dMB\nSQL: %s\n\n",
                    rs.getLong("query_duration_ms"),
                    rs.getLong("read_rows"),
                    rs.getLong("memory_usage") / 1024 / 1024,
                    rs.getString("query").substring(0,
                        Math.min(100, rs.getString("query").length()))
                );
            }
        }
    }

    /**
     * 监控表大小
     */
    public static void monitorTableSize(Connection conn) throws Exception {
        String sql =
            "SELECT\n" +
            "    database,\n" +
            "    table,\n" +
            "    formatReadableSize(sum(bytes)) as size,\n" +
            "    sum(rows) as rows,\n" +
            "    sum(bytes) as bytes_size\n" +
            "FROM system.parts\n" +
            "WHERE active\n" +
            "GROUP BY database, table\n" +
            "ORDER BY bytes_size DESC\n" +
            "LIMIT 20";

        try (var stmt = conn.createStatement();
             var rs = stmt.executeQuery(sql)) {

            System.out.println("\n表大小统计:");
            System.out.println("─────────────────────────────────────────");
            while (rs.next()) {
                System.out.printf("%-20s.%-20s %15s %15d行\n",
                    rs.getString("database"),
                    rs.getString("table"),
                    rs.getString("size"),
                    rs.getLong("rows")
                );
            }
        }
    }

    /**
     * 监控副本同步状态
     */
    public static void monitorReplication(Connection conn)
            throws Exception {
        String sql =
            "SELECT\n" +
            "    database,\n" +
            "    table,\n" +
            "    is_leader,\n" +
            "    is_readonly,\n" +
            "    absolute_delay,\n" +
            "    queue_size,\n" +
            "    inserts_in_queue\n" +
            "FROM system.replicas";

        try (var stmt = conn.createStatement();
             var rs = stmt.executeQuery(sql)) {

            System.out.println("\n副本同步状态:");
            while (rs.next()) {
                System.out.printf(
                    "表: %s.%s, Leader: %s, 延迟: %ds, 队列: %d\n",
                    rs.getString("database"),
                    rs.getString("table"),
                    rs.getBoolean("is_leader") ? "是" : "否",
                    rs.getLong("absolute_delay"),
                    rs.getLong("queue_size")
                );
            }
        }
    }
}

12.2 告警指标

复制代码
关键告警指标

1. 查询性能
   - 慢查询数量 > 阈值
   - 平均查询时间 > 阈值
   - 查询错误率 > 1%

2. 存储
   - 磁盘使用率 > 80%
   - 单表大小 > 阈值
   - 分区数量 > 1000

3. 副本
   - 副本延迟 > 60s
   - 副本队列 > 1000
   - 副本故障

4. 系统资源
   - CPU 使用率 > 80%
   - 内存使用率 > 85%
   - 网络带宽 > 80%

5. Merge 操作
   - Merge 队列 > 100
   - Merge 失败次数 > 0

13. 总结

13.1 ClickHouse 核心优势

复制代码
ClickHouse 核心优势总结

1. 性能 ★★★★★
   - 查询速度快 100-1000倍
   - 支持实时写入和查询
   - 列式存储 + 向量化执行

2. 压缩 ★★★★★
   - 压缩比 10:1 甚至更高
   - 节省存储成本
   - 减少 I/O 操作

3. 扩展性 ★★★★★
   - 水平扩展
   - 线性性能增长
   - 支持 PB 级数据

4. 易用性 ★★★★☆
   - 标准 SQL 支持
   - 丰富的函数库
   - 多语言客户端

5. 可靠性 ★★★★☆
   - 数据副本
   - 自动故障转移
   - 数据一致性保证

13.2 适用场景

最适合:

  • 用户行为分析
  • 实时报表和大屏
  • 日志分析和监控
  • 时序数据分析
  • 数据仓库 OLAP

不适合:

  • OLTP 事务处理
  • 频繁更新/删除
  • 需要强一致性的场景
  • 行级别锁定

13.3 最佳实践总结

  1. 表设计

    • 合理选择表引擎
    • 优化 ORDER BY 列
    • 使用分区管理数据
    • 设置 TTL 自动清理
  2. 数据写入

    • 批量写入
    • 异步写入
    • 使用 CSV 格式
    • 控制写入频率
  3. 查询优化

    • 使用主键过滤
    • 使用 PREWHERE
    • 避免 SELECT *
    • 使用物化视图
  4. 运维管理

    • 定期监控性能
    • 及时清理过期数据
    • 备份关键数据
    • 升级到稳定版本

13.4 学习资源

13.5 未来发展

ClickHouse 正在持续演进:

  1. 更强大的 SQL 支持: 窗口函数、递归查询
  2. 更好的实时性: 毫秒级延迟
  3. 云原生: Kubernetes 集成
  4. 机器学习: 内置 ML 功能
  5. 多模型: 支持图数据库、文档数据库

附录: 常用命令

bash 复制代码
# 启动 ClickHouse
clickhouse-server --config-file=/etc/clickhouse-server/config.xml

# 客户端连接
clickhouse-client --host localhost --port 9000

# 导入数据
clickhouse-client --query="INSERT INTO table FORMAT CSV" < data.csv

# 导出数据
clickhouse-client --query="SELECT * FROM table FORMAT CSV" > data.csv

# 查看表结构
DESCRIBE TABLE table_name;
SHOW CREATE TABLE table_name;

# 优化表
OPTIMIZE TABLE table_name FINAL;

# 查看分区
SELECT partition, name, rows FROM system.parts WHERE table = 'table_name';

# 删除分区
ALTER TABLE table_name DROP PARTITION 'partition_id';
相关推荐
云闲不收1 天前
clickhouse hbase Hive 区别
hive·clickhouse·hbase
f***01934 天前
clickhouse-介绍、安装、数据类型、sql
数据库·sql·clickhouse
j***63086 天前
clickhouse-介绍、安装、数据类型、sql
数据库·sql·clickhouse
IT油腻大叔9 天前
MySQL VS ClickHouse 索引结构对比分析
mysql·clickhouse
-KamMinG9 天前
解决 ClickHouse 备份性能问题:从原生 BACKUP 迁移到 clickhouse-backup 的实战经验
clickhouse
2301_8075832313 天前
ubuntu22.04集群部署clickhouse详细步骤
linux·clickhouse·zookeeper
Azure++15 天前
Centos安装clickhouse
linux·clickhouse·centos
阳爱铭18 天前
ClickHouse 中至关重要的两类复制表引擎——ReplicatedMergeTree和 ReplicatedReplacingMergeTree
大数据·hive·hadoop·sql·clickhouse·spark·hbase