FastAPI 系列 ·(十一)：ClickHouse 集成——大数据查询实战

FastAPI 系列 · 第 11 篇：ClickHouse 集成------大数据查询实战

🎯 适合人群 ：熟悉 MySQL，希望在 FastAPI 项目中接入 OLAP 数据库解决大数据量统计查询瓶颈的工程师

⏱️ 阅读时间 ：约 35 分钟

💬 一句话定位：当 MySQL 的亿级订单聚合查询慢到无法接受，ClickHouse 是最务实的 OLAP 解决方案。本文从原理到实战，完整演示 FastAPI + asynch 驱动 + ClickHouse 构建高性能统计分析 API。

一、为什么需要 ClickHouse

1.1 MySQL 的 OLAP 瓶颈

shop-api 运行一段时间后，订单表积累了 1 亿条记录。一个看似简单的统计查询：

sql 复制代码

-- 按天统计各品类销售额，查最近 90 天
SELECT
    DATE(created_at) AS order_date,
    p.category,
    SUM(oi.quantity * oi.price) AS revenue
FROM orders o
JOIN order_items oi ON o.id = oi.order_id
JOIN products p ON oi.product_id = p.id
WHERE o.created_at >= DATE_SUB(NOW(), INTERVAL 90 DAY)
  AND o.status = 'paid'
GROUP BY order_date, p.category
ORDER BY order_date DESC;

在 MySQL 上执行：47 秒。用户等待超时，DBA 怒气冲冲。

根因：MySQL 是**行存储（Row Store）**数据库，为 OLTP（在线事务处理）设计：

每行数据连续存储，读一个字段需要加载整行
聚合查询要扫描海量行数据，I/O 极高
跨表 JOIN 在大数据量下代价惊人

1.2 OLAP vs OLTP 核心差异

维度	OLTP（MySQL）	OLAP（ClickHouse）
存储方式	行存储	列存储
主要操作	增删改查（单行/小批量）	聚合分析（全表/范围扫描）
索引设计	B+ Tree，适合等值/范围查找	稀疏索引 + MergeTree，适合排序键范围扫描
写入方式	支持单行实时写入	批量写入效率高，不建议逐行写
数据压缩	有限压缩	列级高压缩（LZ4/ZSTD，5-10x）
并发能力	高并发小查询（1000+ QPS）	低并发大查询（10-100 QPS）
JOIN 能力	强（外键、事务）	弱（推荐宽表/物化视图）
数据一致性	ACID 事务	最终一致（无事务概念）
典型延迟	毫秒级（单行操作）	秒级（亿级扫描）
适用场景	订单创建、库存扣减	销售报表、用户行为分析

同样的 90 天销售额查询，在 ClickHouse 上：0.3 秒。

二、ClickHouse 核心概念

2.1 列式存储原理

复制代码

MySQL 行存储（读取 revenue 字段需扫描所有列）：
+----+--------+--------+-------+------------+
| id | order# | user   | price | created_at |
+----+--------+--------+-------+------------+
| 1  | ORD001 | alice  | 99.9  | 2024-01-01 |
| 2  | ORD002 | bob    | 199.0 | 2024-01-01 |
| 3  | ORD003 | alice  | 49.9  | 2024-01-02 |
+----+--------+--------+-------+------------+
每行连续存储 → 读 price 列需加载所有字段

ClickHouse 列存储（只读取需要的列）：
id:    [1, 2, 3, ...]
price: [99.9, 199.0, 49.9, ...]    ← 只读这一列，压缩率高
date:  [2024-01-01, 2024-01-01, 2024-01-02, ...]

2.2 MergeTree 引擎家族

ClickHouse 最核心的存储引擎是 MergeTree，按主键（排序键）存储数据，后台异步合并数据片段（part）。

引擎	特点	适用场景
`MergeTree`	基础引擎，按排序键有序存储	通用 OLAP
`ReplacingMergeTree`	异步去重（按排序键保留最新版本）	CDC 同步、幂等写入
`AggregatingMergeTree`	存储聚合中间状态	物化视图预聚合
`SummingMergeTree`	自动合并相同 Key 的数值	计数器、累加指标
`CollapsingMergeTree`	通过 sign 字段实现逻辑删除	可变数据流

三、驱动选型与连接配置

3.1 驱动对比

驱动	类型	协议	推荐场景
`clickhouse-driver`	同步	TCP native	脚本、同步服务
`asynch`	异步	TCP native	FastAPI 异步路由 ✅
`clickhouse-connect`	同步/异步	HTTP	云服务、简单查询
`aiochclient`	异步	HTTP	轻量查询

推荐 asynch：原生 TCP 协议，性能最优，完整支持 async/await。

bash 复制代码

pip install asynch

3.2 连接池封装

python 复制代码

# app/clickhouse/client.py
import asyncio
from typing import Any, Optional
from asynch.connection import Connection
from asynch.pool import Pool
from app.config import settings

# -------------------------------------------------------
# ClickHouse 连接池（对标 Spring Data Redis 的 LettuceConnectionFactory）
# -------------------------------------------------------

_pool: Optional[Pool] = None


async def get_clickhouse_pool() -> Pool:
    """获取全局连接池（懒加载）"""
    global _pool
    if _pool is None:
        _pool = Pool(
            host=settings.CLICKHOUSE_HOST,
            port=settings.CLICKHOUSE_PORT,        # 默认 9000（TCP native）
            database=settings.CLICKHOUSE_DATABASE,
            user=settings.CLICKHOUSE_USER,
            password=settings.CLICKHOUSE_PASSWORD,
            minsize=2,
            maxsize=10,
        )
    return _pool


async def close_clickhouse_pool() -> None:
    """关闭连接池（在 lifespan 中调用）"""
    global _pool
    if _pool:
        await _pool.close()
        _pool = None


# -------------------------------------------------------
# 与 lifespan 集成
# -------------------------------------------------------
# app/main.py（在现有 lifespan 中追加）
#
# from app.clickhouse.client import get_clickhouse_pool, close_clickhouse_pool
#
# @asynccontextmanager
# async def lifespan(app: FastAPI):
#     await get_clickhouse_pool()          # 启动时预热连接池
#     yield
#     await close_clickhouse_pool()        # 关闭时释放连接


# -------------------------------------------------------
# 查询辅助函数
# -------------------------------------------------------
async def ch_execute(
    query: str,
    params: Optional[tuple] = None,
) -> list[dict]:
    """
    执行查询，返回字典列表。
    对标 Spring JdbcTemplate.queryForList()
    """
    pool = await get_clickhouse_pool()
    async with pool.acquire() as conn:
        async with conn.cursor() as cursor:
            await cursor.execute(query, params)
            columns = [desc[0] for desc in cursor.description]
            rows = await cursor.fetchall()
            return [dict(zip(columns, row)) for row in rows]


async def ch_insert(table: str, data: list[tuple], columns: list[str]) -> None:
    """
    批量插入数据。
    注意：ClickHouse 批量写入效率远高于逐行写入。
    """
    pool = await get_clickhouse_pool()
    async with pool.acquire() as conn:
        async with conn.cursor() as cursor:
            await cursor.execute(
                f"INSERT INTO {table} ({', '.join(columns)}) VALUES",
                data,
            )

3.3 配置项

python 复制代码

# app/config.py（追加 ClickHouse 配置）
from pydantic_settings import BaseSettings

class Settings(BaseSettings):
    # ... 其他配置 ...

    CLICKHOUSE_HOST: str = "localhost"
    CLICKHOUSE_PORT: int = 9000
    CLICKHOUSE_DATABASE: str = "shop_analytics"
    CLICKHOUSE_USER: str = "default"
    CLICKHOUSE_PASSWORD: str = ""

settings = Settings()

四、建表与数据模型设计

4.1 订单事实表

sql 复制代码

-- 在 ClickHouse 中建立订单明细表
CREATE TABLE shop_analytics.order_events
(
    order_id     UInt64,
    user_id      UInt64,
    product_id   UInt64,
    category     LowCardinality(String),  -- 低基数字段用 LowCardinality 优化
    quantity     UInt32,
    price        Decimal(10, 2),
    revenue      Decimal(10, 2),          -- quantity * price 提前计算
    order_status LowCardinality(String),
    created_at   DateTime,
    created_date Date MATERIALIZED toDate(created_at)  -- 物化列，避免重复计算
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(created_at)      -- 按月分区，方便按时间范围查询
ORDER BY (created_date, category, product_id)  -- 排序键，决定数据物理存储顺序
SETTINGS index_granularity = 8192;


-- 用户行为事件表（用于漏斗分析）
CREATE TABLE shop_analytics.user_events
(
    event_id   UUID DEFAULT generateUUIDv4(),
    user_id    UInt64,
    event_type LowCardinality(String),   -- view_product / add_cart / checkout / pay
    product_id Nullable(UInt64),
    session_id String,
    created_at DateTime,
    properties String                    -- JSON 格式的附加属性
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(created_at)
ORDER BY (user_id, created_at)
SETTINGS index_granularity = 8192;

4.2 ER 关系图（MySQL → ClickHouse 数据流）

复制代码

MySQL (OLTP)                    ClickHouse (OLAP)
+------------------+            +----------------------+
| orders           |            | order_events         |
| - id             |  ETL/CDC   | - order_id           |
| - user_id        |  ------→   | - user_id            |
| - status         |            | - product_id         |
| - created_at     |            | - category           |
+------------------+            | - revenue            |
| order_items      |            | - created_at         |
| - order_id       |            +----------------------+
| - product_id     |            | user_events          |
| - quantity       |            | - user_id            |
| - price          |            | - event_type         |
+------------------+            | - session_id         |
| products         |            | - created_at         |
| - id             |            +----------------------+
| - category       |
+------------------+

五、典型查询实战

5.1 按天聚合销售额

python 复制代码

# app/clickhouse/queries/sales.py
from app.clickhouse.client import ch_execute
from datetime import date


async def get_daily_revenue(
    start_date: date,
    end_date: date,
    category: str | None = None,
) -> list[dict]:
    """
    按天统计销售额，支持品类过滤。
    同样的查询在 MySQL（亿级）约 47s，ClickHouse 约 0.3s。
    """
    conditions = ["order_status = 'paid'", "created_date BETWEEN %(start)s AND %(end)s"]
    params: dict = {"start": start_date, "end": end_date}

    if category:
        conditions.append("category = %(category)s")
        params["category"] = category

    where_clause = " AND ".join(conditions)

    sql = f"""
    SELECT
        created_date                        AS order_date,
        category,
        count()                             AS order_count,
        sum(quantity)                       AS total_quantity,
        sum(revenue)                        AS total_revenue,
        avg(revenue)                        AS avg_order_value
    FROM shop_analytics.order_events
    WHERE {where_clause}
    GROUP BY order_date, category
    ORDER BY order_date DESC, total_revenue DESC
    """
    return await ch_execute(sql, params)


async def get_top_products(
    start_date: date,
    end_date: date,
    limit: int = 20,
) -> list[dict]:
    """TOP N 销售商品排行"""
    sql = """
    SELECT
        product_id,
        sum(quantity)  AS total_quantity,
        sum(revenue)   AS total_revenue,
        count()        AS order_count
    FROM shop_analytics.order_events
    WHERE order_status = 'paid'
      AND created_date BETWEEN %(start)s AND %(end)s
    GROUP BY product_id
    ORDER BY total_revenue DESC
    LIMIT %(limit)s
    """
    return await ch_execute(sql, {"start": start_date, "end": end_date, "limit": limit})

5.2 漏斗分析（windowFunnel）

漏斗分析 是电商数据分析的核心需求：用户从浏览商品 → 加购 → 结账 → 支付，每一步的转化率。ClickHouse 内置 windowFunnel 函数，专为此设计。

python 复制代码

async def get_conversion_funnel(
    start_date: date,
    end_date: date,
    window_seconds: int = 3600,  # 1 小时内完成漏斗算转化
) -> dict:
    """
    购买漏斗分析：
    浏览商品 → 加入购物车 → 开始结账 → 完成支付

    windowFunnel(window)(timestamp, event1, event2, ...) 返回用户完成的最深步骤
    """
    sql = f"""
    SELECT
        level,
        count() AS users
    FROM (
        SELECT
            user_id,
            windowFunnel({window_seconds})(
                created_at,
                event_type = 'view_product',   -- step 1
                event_type = 'add_cart',        -- step 2
                event_type = 'checkout',        -- step 3
                event_type = 'pay'              -- step 4
            ) AS level
        FROM shop_analytics.user_events
        WHERE created_at >= %(start)s
          AND created_at < %(end)s
        GROUP BY user_id
    )
    GROUP BY level
    ORDER BY level ASC
    """
    rows = await ch_execute(sql, {"start": start_date, "end": end_date})

    # 整理成漏斗格式
    level_map = {0: "未开始", 1: "浏览商品", 2: "加入购物车", 3: "开始结账", 4: "完成支付"}
    result = {level_map.get(r["level"], str(r["level"])): r["users"] for r in rows}
    return result


async def get_quantile_order_value(
    start_date: date,
    end_date: date,
) -> dict:
    """
    订单金额分位数分析（quantile 函数）
    了解用户消费分布，P50/P90/P99 指标
    """
    sql = """
    SELECT
        quantile(0.5)(revenue)  AS p50,
        quantile(0.9)(revenue)  AS p90,
        quantile(0.99)(revenue) AS p99,
        avg(revenue)            AS avg_revenue,
        max(revenue)            AS max_revenue
    FROM shop_analytics.order_events
    WHERE order_status = 'paid'
      AND created_date BETWEEN %(start)s AND %(end)s
    """
    rows = await ch_execute(sql, {"start": start_date, "end": end_date})
    return rows[0] if rows else {}

5.3 用户留存分析（arrayJoin）

python 复制代码

async def get_user_retention(cohort_date: date) -> list[dict]:
    """
    按注册日期队列（Cohort）统计 N 日留存率。
    arrayJoin 将数组展开为多行，是 ClickHouse 特有的强大功能。
    """
    sql = """
    SELECT
        cohort_day,
        retention_day,
        count(DISTINCT user_id)                         AS retained_users,
        round(count(DISTINCT user_id) / cohort_total * 100, 2) AS retention_rate
    FROM (
        SELECT
            user_id,
            toDate(min(created_at))                     AS cohort_day,
            arrayJoin(
                arrayMap(
                    x -> dateDiff('day', toDate(min(created_at)), x),
                    groupArray(toDate(created_at))
                )
            )                                           AS retention_day,
            count(DISTINCT toDate(min(created_at)))
                OVER (PARTITION BY toDate(min(created_at))) AS cohort_total
        FROM shop_analytics.user_events
        WHERE event_type = 'pay'
          AND created_at >= %(cohort_date)s
        GROUP BY user_id
    )
    WHERE retention_day BETWEEN 0 AND 30
    GROUP BY cohort_day, retention_day, cohort_total
    ORDER BY cohort_day, retention_day
    """
    return await ch_execute(sql, {"cohort_date": cohort_date})

六、批量写入实战

ClickHouse 的写入策略与 MySQL 完全不同：不要逐行 INSERT，要批量攒数据再写入。

6.1 性能对比

写入方式	100 条耗时	10000 条耗时	1000000 条耗时
逐行 INSERT	~5s	~500s	不可接受
批量 INSERT（1000条/批）	~0.05s	~0.5s	~50s
批量 INSERT（10000条/批）	N/A	~0.1s	~10s

python 复制代码

# app/clickhouse/writer.py
import asyncio
from datetime import datetime
from typing import Any
from app.clickhouse.client import ch_insert


# -------------------------------------------------------
# 批量写入缓冲区（对标 Kafka Producer 的 batch.size）
# -------------------------------------------------------
class ClickHouseBatchWriter:
    """
    攒够 batch_size 条或等待 flush_interval 秒后自动写入。
    适合高频事件写入场景（用户行为埋点等）。
    """

    def __init__(
        self,
        table: str,
        columns: list[str],
        batch_size: int = 1000,
        flush_interval: float = 5.0,
    ):
        self.table = table
        self.columns = columns
        self.batch_size = batch_size
        self.flush_interval = flush_interval
        self._buffer: list[tuple] = []
        self._lock = asyncio.Lock()
        self._flush_task: asyncio.Task | None = None

    async def write(self, row: tuple) -> None:
        async with self._lock:
            self._buffer.append(row)
            if len(self._buffer) >= self.batch_size:
                await self._flush()

    async def _flush(self) -> None:
        """将缓冲区数据写入 ClickHouse"""
        if not self._buffer:
            return
        batch = self._buffer[:]
        self._buffer.clear()
        await ch_insert(self.table, batch, self.columns)

    async def flush(self) -> None:
        """手动触发写入（在 lifespan shutdown 中调用）"""
        async with self._lock:
            await self._flush()


# -------------------------------------------------------
# 直接批量写入（适合 ETL 场景）
# -------------------------------------------------------
async def insert_order_events(events: list[dict]) -> None:
    """
    将订单事件批量写入 ClickHouse。
    调用方负责攒够批次（推荐 1000+ 条）再调用。
    """
    if not events:
        return

    columns = [
        "order_id", "user_id", "product_id", "category",
        "quantity", "price", "revenue", "order_status", "created_at",
    ]

    rows = [
        (
            e["order_id"],
            e["user_id"],
            e["product_id"],
            e["category"],
            e["quantity"],
            float(e["price"]),
            float(e["quantity"]) * float(e["price"]),  # revenue 提前计算
            e["order_status"],
            e["created_at"],
        )
        for e in events
    ]

    await ch_insert("shop_analytics.order_events", rows, columns)


# -------------------------------------------------------
# 批量写入示例（Celery 任务中调用）
# -------------------------------------------------------
# @celery_app.task
# async def sync_orders_to_clickhouse(order_ids: list[int]):
#     """从 MySQL 拉取订单并同步到 ClickHouse"""
#     orders = await fetch_orders_from_mysql(order_ids)
#     await insert_order_events(orders)

七、物化视图（预聚合）

物化视图（Materialized View）是 ClickHouse 的杀手锏功能：写入数据时自动触发聚合计算，查询时直接读预聚合结果。

sql 复制代码

-- 1. 创建聚合存储表（使用 AggregatingMergeTree）
CREATE TABLE shop_analytics.daily_sales_mv
(
    order_date   Date,
    category     LowCardinality(String),
    order_count  AggregateFunction(count),        -- 使用聚合状态类型
    total_revenue AggregateFunction(sum, Decimal(10, 2)),
    avg_revenue  AggregateFunction(avg, Decimal(10, 2))
)
ENGINE = AggregatingMergeTree()
ORDER BY (order_date, category);


-- 2. 创建物化视图（触发器：新数据写入 order_events 时自动更新）
CREATE MATERIALIZED VIEW shop_analytics.daily_sales_mv_trigger
TO shop_analytics.daily_sales_mv
AS
SELECT
    created_date                          AS order_date,
    category,
    countState()                          AS order_count,
    sumState(revenue)                     AS total_revenue,
    avgState(revenue)                     AS avg_revenue
FROM shop_analytics.order_events
WHERE order_status = 'paid'
GROUP BY order_date, category;


-- 3. 查询物化视图（用 Merge 函数读取聚合状态）
SELECT
    order_date,
    category,
    countMerge(order_count)       AS order_count,
    sumMerge(total_revenue)       AS total_revenue,
    avgMerge(avg_revenue)         AS avg_revenue
FROM shop_analytics.daily_sales_mv
WHERE order_date BETWEEN '2024-01-01' AND '2024-03-31'
GROUP BY order_date, category
ORDER BY order_date DESC;

-- 通过物化视图查询，耗时从 0.3s 降至 0.01s（100亿级数据也如此）

八、MySQL → ClickHouse 数据同步方案

8.1 三种同步架构

复制代码

方案一：应用双写（最简单，实时性好）
+----------+    INSERT     +---------+
| FastAPI  | ------------> | MySQL   |
|          |               +---------+
|          |    INSERT     +-------------+
|          | ------------> | ClickHouse  |
+----------+               +-------------+
优点：实时、简单
缺点：双写一致性问题，应用侵入


方案二：Kafka 解耦（推荐，高可靠）
+----------+  Binlog  +-------+  Consume  +-------------+
| MySQL    | -------> | Kafka | --------> | ClickHouse  |
+----------+  CDC     +-------+           | Kafka Engine|
                                          +-------------+
优点：解耦、可靠、支持多消费者
缺点：引入 Kafka，架构复杂


方案三：ClickHouse MySQL 引擎（只读同步）
+---------------------------------------------------+
| ClickHouse                                        |
| CREATE TABLE mysql_orders ENGINE = MySQL(...)    |
| 直接查询 MySQL 数据（不落盘，实时读取）             |
+---------------------------------------------------+
    ↓ 与本地表联合查询
+----------+
| MySQL    |
+----------+
优点：零延迟、无需同步
缺点：查询走网络，不适合大数据量聚合

8.2 使用 ClickHouse MySQL 引擎（快速集成）

sql 复制代码

-- 在 ClickHouse 中创建 MySQL 映射表（只读）
CREATE TABLE shop_analytics.mysql_orders
ENGINE = MySQL(
    'mysql-host:3306',
    'shop_db',
    'orders',
    'readonly_user',
    'password'
);

-- 可以在 ClickHouse 内直接 JOIN MySQL 数据（适合小数据量辅助维度）
SELECT
    me.created_date,
    me.category,
    sum(me.revenue) AS revenue
FROM shop_analytics.order_events me
GROUP BY me.created_date, me.category;

九、贯穿项目：统计分析 API

python 复制代码

# app/routers/analytics.py
from datetime import date, timedelta
from typing import Literal
from fastapi import APIRouter, Query, Depends
from app.auth.dependencies import get_current_user
from app.models.user import User
from app.clickhouse.queries.sales import (
    get_daily_revenue,
    get_top_products,
    get_conversion_funnel,
    get_quantile_order_value,
)
from app.schemas.analytics import (
    DailyRevenueResponse,
    TopProductsResponse,
    FunnelResponse,
    QuantileResponse,
)

router = APIRouter(prefix="/api/v1/analytics", tags=["analytics"])


@router.get("/sales", response_model=list[DailyRevenueResponse])
async def get_sales_stats(
    start: date = Query(default_factory=lambda: date.today() - timedelta(days=30)),
    end: date = Query(default_factory=date.today),
    group_by: Literal["day", "week", "month"] = Query(default="day"),
    category: str | None = Query(default=None, description="品类过滤，不传则查全部"),
    current_user: User = Depends(get_current_user),
):
    """
    销售额统计 API。
    GET /api/v1/analytics/sales?start=2024-01-01&end=2024-03-31&group_by=day&category=electronics

    对标 Spring Boot：@GetMapping + @RequestParam + Service 调用
    """
    if (end - start).days > 365:
        from fastapi import HTTPException
        raise HTTPException(status_code=400, detail="查询范围不能超过 365 天")

    return await get_daily_revenue(start, end, category)


@router.get("/top-products", response_model=list[TopProductsResponse])
async def get_top_products_api(
    start: date = Query(default_factory=lambda: date.today() - timedelta(days=30)),
    end: date = Query(default_factory=date.today),
    limit: int = Query(default=20, ge=1, le=100),
    current_user: User = Depends(get_current_user),
):
    """TOP N 销售商品"""
    return await get_top_products(start, end, limit)


@router.get("/funnel", response_model=FunnelResponse)
async def get_funnel_api(
    start: date = Query(default_factory=lambda: date.today() - timedelta(days=7)),
    end: date = Query(default_factory=date.today),
    window_hours: int = Query(default=1, ge=1, le=72, description="漏斗时间窗口（小时）"),
    current_user: User = Depends(get_current_user),
):
    """购买转化漏斗分析"""
    funnel_data = await get_conversion_funnel(start, end, window_hours * 3600)
    return {"data": funnel_data, "window_hours": window_hours}


@router.get("/order-value-distribution", response_model=QuantileResponse)
async def get_order_value_distribution(
    start: date = Query(default_factory=lambda: date.today() - timedelta(days=30)),
    end: date = Query(default_factory=date.today),
    current_user: User = Depends(get_current_user),
):
    """订单金额分位数分布"""
    return await get_quantile_order_value(start, end)

python 复制代码

# app/schemas/analytics.py
from pydantic import BaseModel
from datetime import date
from decimal import Decimal


class DailyRevenueResponse(BaseModel):
    order_date: date
    category: str
    order_count: int
    total_quantity: int
    total_revenue: Decimal
    avg_order_value: Decimal


class TopProductsResponse(BaseModel):
    product_id: int
    total_quantity: int
    total_revenue: Decimal
    order_count: int


class FunnelResponse(BaseModel):
    data: dict[str, int]   # {"浏览商品": 10000, "加入购物车": 3000, ...}
    window_hours: int


class QuantileResponse(BaseModel):
    p50: Decimal
    p90: Decimal
    p99: Decimal
    avg_revenue: Decimal
    max_revenue: Decimal

十、常见坑与最佳实践

坑 1：DateTime64 时区陷阱

python 复制代码

# ❌ 不指定时区，ClickHouse 默认存储 UTC，查询时显示 UTC 时间
# 存入 "2024-01-01 10:00:00 CST（北京时间）"
# 实际存储为 "2024-01-01 02:00:00 UTC"
# 按天 GROUP BY 时会出现分组跨天错误

# ✅ 建表时指定时区
CREATE TABLE events (
    created_at DateTime('Asia/Shanghai')   -- 明确时区
) ENGINE = MergeTree() ORDER BY created_at;

# ✅ 查询时转换时区
SELECT
    toDate(created_at, 'Asia/Shanghai') AS local_date,
    count()
FROM events
GROUP BY local_date

python 复制代码

# Python 侧也要注意传入时区感知的 datetime
from datetime import datetime, timezone
import pytz

# ❌ naive datetime（无时区信息）
naive_dt = datetime(2024, 1, 1, 10, 0, 0)

# ✅ 带时区的 datetime
cst = pytz.timezone('Asia/Shanghai')
aware_dt = cst.localize(datetime(2024, 1, 1, 10, 0, 0))

坑 2：NULL 值处理与 MySQL 的差异

sql 复制代码

-- ❌ ClickHouse 中 NULL 参与计算的结果未必如预期
SELECT NULL + 1;        -- 返回 NULL（与 MySQL 一致）
SELECT sum(NULL);       -- 返回 0（不是 NULL！与 MySQL 不同）
SELECT avg(NULL);       -- 返回 nan

-- ✅ 使用 Nullable 类型时需要额外处理
SELECT ifNull(nullable_column, 0) AS safe_value FROM table;
SELECT isNull(nullable_column)    AS is_missing FROM table;

坑 3：分布式表 vs 本地表的写入陷阱

sql 复制代码

-- ❌ 在分布式环境中直接写分布式表（Distributed Engine）可能重复写入
INSERT INTO distributed_table VALUES (...);
-- 如果分布式表的 shard key 路由不均，或者写入节点就是存储节点，会出现重复

-- ✅ 直接写本地表（local table），让 ClickHouse 分布式层负责读时聚合
INSERT INTO local_table ON CLUSTER cluster_name VALUES (...);

-- ✅ 用 shard key 确保均匀分布
CREATE TABLE distributed_table AS local_table
ENGINE = Distributed('cluster_name', 'database', 'local_table', rand());
--                                                                ^^^^
--                                              shard key，rand() 均匀分布

坑 4：ReplacingMergeTree 去重不是实时的

sql 复制代码

-- ❌ 误以为 ReplacingMergeTree 立即去重
CREATE TABLE orders_replica (
    order_id UInt64,
    status   String,
    updated_at DateTime
) ENGINE = ReplacingMergeTree(updated_at)  -- 按 updated_at 保留最新
ORDER BY order_id;

-- 插入两条相同 order_id 的记录后立即查询
INSERT INTO orders_replica VALUES (1, 'pending', now());
INSERT INTO orders_replica VALUES (1, 'paid', now() + 1);

SELECT * FROM orders_replica WHERE order_id = 1;
-- ❌ 可能返回 2 行！因为合并是后台异步的

-- ✅ 查询时使用 FINAL 关键字强制去重（性能有损耗）
SELECT * FROM orders_replica FINAL WHERE order_id = 1;
-- 或者用 argMax 手动取最新记录
SELECT order_id, argMax(status, updated_at) AS latest_status
FROM orders_replica
GROUP BY order_id;

坑 5：逐行写入性能灾难

python 复制代码

# ❌ 逐行写入，性能极差（10000 条约需 500 秒）
for order in orders:
    await ch_execute(
        "INSERT INTO order_events VALUES (%(id)s, %(user_id)s, ...)",
        order
    )

# ✅ 批量写入（攒够再写，推荐批次 1000-10000 条）
BATCH_SIZE = 1000
for i in range(0, len(orders), BATCH_SIZE):
    batch = orders[i:i + BATCH_SIZE]
    await insert_order_events(batch)

十一、总结

知识点	ClickHouse 方案	MySQL 对比
存储引擎	MergeTree（列存）	InnoDB（行存）
Python 驱动	`asynch`（异步 TCP）	`aiomysql`
分析查询	聚合函数 + 窗口函数（windowFunnel）	标准 SQL，大数据量慢
预聚合	物化视图（AggregatingMergeTree）	需要手动维护汇总表
写入策略	批量写入（1000+ 条/批）	支持实时单行写入
去重	ReplacingMergeTree + FINAL	UNIQUE KEY
时区	建表时指定 `DateTime('Asia/Shanghai')`	连接时设置 `time_zone`
数据同步	双写 / Kafka CDC / MySQL Engine	---
漏斗分析	`windowFunnel` 内置函数	复杂 CTE 或应用层计算

🎯 金句：ClickHouse 不是 MySQL 的替代品，而是它的搭档------MySQL 处理事务，ClickHouse 处理分析。架构的智慧在于让每个工具做它最擅长的事。

参考资料

下期预告

📝 第 12 篇：Docker 生产部署------从代码到上线

写完了 11 篇功能实现，终于到了最关键的一步：如何把 shop-api 打包进 Docker，配好 Nginx 反向代理、Celery Worker、健康检查，一键 docker compose up 拉起整套服务？下一篇将完整演示多阶段 Dockerfile 构建、docker-compose.yml 全服务编排、pydantic-settings 多环境配置管理，以及生产环境的 Gunicorn + UvicornWorker 调优。系列大结局，不见不散。