FastAPI 系列 ·(十一):ClickHouse 集成——大数据查询实战

FastAPI 系列 · 第 11 篇:ClickHouse 集成------大数据查询实战

🎯 适合人群 :熟悉 MySQL,希望在 FastAPI 项目中接入 OLAP 数据库解决大数据量统计查询瓶颈的工程师

⏱️ 阅读时间 :约 35 分钟

💬 一句话定位:当 MySQL 的亿级订单聚合查询慢到无法接受,ClickHouse 是最务实的 OLAP 解决方案。本文从原理到实战,完整演示 FastAPI + asynch 驱动 + ClickHouse 构建高性能统计分析 API。


一、为什么需要 ClickHouse

1.1 MySQL 的 OLAP 瓶颈

shop-api 运行一段时间后,订单表积累了 1 亿条记录。一个看似简单的统计查询:

sql 复制代码
-- 按天统计各品类销售额,查最近 90 天
SELECT
    DATE(created_at) AS order_date,
    p.category,
    SUM(oi.quantity * oi.price) AS revenue
FROM orders o
JOIN order_items oi ON o.id = oi.order_id
JOIN products p ON oi.product_id = p.id
WHERE o.created_at >= DATE_SUB(NOW(), INTERVAL 90 DAY)
  AND o.status = 'paid'
GROUP BY order_date, p.category
ORDER BY order_date DESC;

在 MySQL 上执行:47 秒。用户等待超时,DBA 怒气冲冲。

根因:MySQL 是**行存储(Row Store)**数据库,为 OLTP(在线事务处理)设计:

  • 每行数据连续存储,读一个字段需要加载整行
  • 聚合查询要扫描海量行数据,I/O 极高
  • 跨表 JOIN 在大数据量下代价惊人

1.2 OLAP vs OLTP 核心差异

维度 OLTP(MySQL) OLAP(ClickHouse)
存储方式 行存储 列存储
主要操作 增删改查(单行/小批量) 聚合分析(全表/范围扫描)
索引设计 B+ Tree,适合等值/范围查找 稀疏索引 + MergeTree,适合排序键范围扫描
写入方式 支持单行实时写入 批量写入效率高,不建议逐行写
数据压缩 有限压缩 列级高压缩(LZ4/ZSTD,5-10x)
并发能力 高并发小查询(1000+ QPS) 低并发大查询(10-100 QPS)
JOIN 能力 强(外键、事务) 弱(推荐宽表/物化视图)
数据一致性 ACID 事务 最终一致(无事务概念)
典型延迟 毫秒级(单行操作) 秒级(亿级扫描)
适用场景 订单创建、库存扣减 销售报表、用户行为分析

同样的 90 天销售额查询,在 ClickHouse 上:0.3 秒


二、ClickHouse 核心概念

2.1 列式存储原理

复制代码
MySQL 行存储(读取 revenue 字段需扫描所有列):
+----+--------+--------+-------+------------+
| id | order# | user   | price | created_at |
+----+--------+--------+-------+------------+
| 1  | ORD001 | alice  | 99.9  | 2024-01-01 |
| 2  | ORD002 | bob    | 199.0 | 2024-01-01 |
| 3  | ORD003 | alice  | 49.9  | 2024-01-02 |
+----+--------+--------+-------+------------+
每行连续存储 → 读 price 列需加载所有字段

ClickHouse 列存储(只读取需要的列):
id:    [1, 2, 3, ...]
price: [99.9, 199.0, 49.9, ...]    ← 只读这一列,压缩率高
date:  [2024-01-01, 2024-01-01, 2024-01-02, ...]

2.2 MergeTree 引擎家族

ClickHouse 最核心的存储引擎是 MergeTree,按主键(排序键)存储数据,后台异步合并数据片段(part)。

引擎 特点 适用场景
MergeTree 基础引擎,按排序键有序存储 通用 OLAP
ReplacingMergeTree 异步去重(按排序键保留最新版本) CDC 同步、幂等写入
AggregatingMergeTree 存储聚合中间状态 物化视图预聚合
SummingMergeTree 自动合并相同 Key 的数值 计数器、累加指标
CollapsingMergeTree 通过 sign 字段实现逻辑删除 可变数据流

三、驱动选型与连接配置

3.1 驱动对比

驱动 类型 协议 推荐场景
clickhouse-driver 同步 TCP native 脚本、同步服务
asynch 异步 TCP native FastAPI 异步路由 ✅
clickhouse-connect 同步/异步 HTTP 云服务、简单查询
aiochclient 异步 HTTP 轻量查询

推荐 asynch:原生 TCP 协议,性能最优,完整支持 async/await。

bash 复制代码
pip install asynch

3.2 连接池封装

python 复制代码
# app/clickhouse/client.py
import asyncio
from typing import Any, Optional
from asynch.connection import Connection
from asynch.pool import Pool
from app.config import settings

# -------------------------------------------------------
# ClickHouse 连接池(对标 Spring Data Redis 的 LettuceConnectionFactory)
# -------------------------------------------------------

_pool: Optional[Pool] = None


async def get_clickhouse_pool() -> Pool:
    """获取全局连接池(懒加载)"""
    global _pool
    if _pool is None:
        _pool = Pool(
            host=settings.CLICKHOUSE_HOST,
            port=settings.CLICKHOUSE_PORT,        # 默认 9000(TCP native)
            database=settings.CLICKHOUSE_DATABASE,
            user=settings.CLICKHOUSE_USER,
            password=settings.CLICKHOUSE_PASSWORD,
            minsize=2,
            maxsize=10,
        )
    return _pool


async def close_clickhouse_pool() -> None:
    """关闭连接池(在 lifespan 中调用)"""
    global _pool
    if _pool:
        await _pool.close()
        _pool = None


# -------------------------------------------------------
# 与 lifespan 集成
# -------------------------------------------------------
# app/main.py(在现有 lifespan 中追加)
#
# from app.clickhouse.client import get_clickhouse_pool, close_clickhouse_pool
#
# @asynccontextmanager
# async def lifespan(app: FastAPI):
#     await get_clickhouse_pool()          # 启动时预热连接池
#     yield
#     await close_clickhouse_pool()        # 关闭时释放连接


# -------------------------------------------------------
# 查询辅助函数
# -------------------------------------------------------
async def ch_execute(
    query: str,
    params: Optional[tuple] = None,
) -> list[dict]:
    """
    执行查询,返回字典列表。
    对标 Spring JdbcTemplate.queryForList()
    """
    pool = await get_clickhouse_pool()
    async with pool.acquire() as conn:
        async with conn.cursor() as cursor:
            await cursor.execute(query, params)
            columns = [desc[0] for desc in cursor.description]
            rows = await cursor.fetchall()
            return [dict(zip(columns, row)) for row in rows]


async def ch_insert(table: str, data: list[tuple], columns: list[str]) -> None:
    """
    批量插入数据。
    注意:ClickHouse 批量写入效率远高于逐行写入。
    """
    pool = await get_clickhouse_pool()
    async with pool.acquire() as conn:
        async with conn.cursor() as cursor:
            await cursor.execute(
                f"INSERT INTO {table} ({', '.join(columns)}) VALUES",
                data,
            )

3.3 配置项

python 复制代码
# app/config.py(追加 ClickHouse 配置)
from pydantic_settings import BaseSettings

class Settings(BaseSettings):
    # ... 其他配置 ...

    CLICKHOUSE_HOST: str = "localhost"
    CLICKHOUSE_PORT: int = 9000
    CLICKHOUSE_DATABASE: str = "shop_analytics"
    CLICKHOUSE_USER: str = "default"
    CLICKHOUSE_PASSWORD: str = ""

settings = Settings()

四、建表与数据模型设计

4.1 订单事实表

sql 复制代码
-- 在 ClickHouse 中建立订单明细表
CREATE TABLE shop_analytics.order_events
(
    order_id     UInt64,
    user_id      UInt64,
    product_id   UInt64,
    category     LowCardinality(String),  -- 低基数字段用 LowCardinality 优化
    quantity     UInt32,
    price        Decimal(10, 2),
    revenue      Decimal(10, 2),          -- quantity * price 提前计算
    order_status LowCardinality(String),
    created_at   DateTime,
    created_date Date MATERIALIZED toDate(created_at)  -- 物化列,避免重复计算
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(created_at)      -- 按月分区,方便按时间范围查询
ORDER BY (created_date, category, product_id)  -- 排序键,决定数据物理存储顺序
SETTINGS index_granularity = 8192;


-- 用户行为事件表(用于漏斗分析)
CREATE TABLE shop_analytics.user_events
(
    event_id   UUID DEFAULT generateUUIDv4(),
    user_id    UInt64,
    event_type LowCardinality(String),   -- view_product / add_cart / checkout / pay
    product_id Nullable(UInt64),
    session_id String,
    created_at DateTime,
    properties String                    -- JSON 格式的附加属性
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(created_at)
ORDER BY (user_id, created_at)
SETTINGS index_granularity = 8192;

4.2 ER 关系图(MySQL → ClickHouse 数据流)

复制代码
MySQL (OLTP)                    ClickHouse (OLAP)
+------------------+            +----------------------+
| orders           |            | order_events         |
| - id             |  ETL/CDC   | - order_id           |
| - user_id        |  ------→   | - user_id            |
| - status         |            | - product_id         |
| - created_at     |            | - category           |
+------------------+            | - revenue            |
| order_items      |            | - created_at         |
| - order_id       |            +----------------------+
| - product_id     |            | user_events          |
| - quantity       |            | - user_id            |
| - price          |            | - event_type         |
+------------------+            | - session_id         |
| products         |            | - created_at         |
| - id             |            +----------------------+
| - category       |
+------------------+

五、典型查询实战

5.1 按天聚合销售额

python 复制代码
# app/clickhouse/queries/sales.py
from app.clickhouse.client import ch_execute
from datetime import date


async def get_daily_revenue(
    start_date: date,
    end_date: date,
    category: str | None = None,
) -> list[dict]:
    """
    按天统计销售额,支持品类过滤。
    同样的查询在 MySQL(亿级)约 47s,ClickHouse 约 0.3s。
    """
    conditions = ["order_status = 'paid'", "created_date BETWEEN %(start)s AND %(end)s"]
    params: dict = {"start": start_date, "end": end_date}

    if category:
        conditions.append("category = %(category)s")
        params["category"] = category

    where_clause = " AND ".join(conditions)

    sql = f"""
    SELECT
        created_date                        AS order_date,
        category,
        count()                             AS order_count,
        sum(quantity)                       AS total_quantity,
        sum(revenue)                        AS total_revenue,
        avg(revenue)                        AS avg_order_value
    FROM shop_analytics.order_events
    WHERE {where_clause}
    GROUP BY order_date, category
    ORDER BY order_date DESC, total_revenue DESC
    """
    return await ch_execute(sql, params)


async def get_top_products(
    start_date: date,
    end_date: date,
    limit: int = 20,
) -> list[dict]:
    """TOP N 销售商品排行"""
    sql = """
    SELECT
        product_id,
        sum(quantity)  AS total_quantity,
        sum(revenue)   AS total_revenue,
        count()        AS order_count
    FROM shop_analytics.order_events
    WHERE order_status = 'paid'
      AND created_date BETWEEN %(start)s AND %(end)s
    GROUP BY product_id
    ORDER BY total_revenue DESC
    LIMIT %(limit)s
    """
    return await ch_execute(sql, {"start": start_date, "end": end_date, "limit": limit})

5.2 漏斗分析(windowFunnel)

漏斗分析 是电商数据分析的核心需求:用户从浏览商品 → 加购 → 结账 → 支付,每一步的转化率。ClickHouse 内置 windowFunnel 函数,专为此设计。

python 复制代码
async def get_conversion_funnel(
    start_date: date,
    end_date: date,
    window_seconds: int = 3600,  # 1 小时内完成漏斗算转化
) -> dict:
    """
    购买漏斗分析:
    浏览商品 → 加入购物车 → 开始结账 → 完成支付

    windowFunnel(window)(timestamp, event1, event2, ...) 返回用户完成的最深步骤
    """
    sql = f"""
    SELECT
        level,
        count() AS users
    FROM (
        SELECT
            user_id,
            windowFunnel({window_seconds})(
                created_at,
                event_type = 'view_product',   -- step 1
                event_type = 'add_cart',        -- step 2
                event_type = 'checkout',        -- step 3
                event_type = 'pay'              -- step 4
            ) AS level
        FROM shop_analytics.user_events
        WHERE created_at >= %(start)s
          AND created_at < %(end)s
        GROUP BY user_id
    )
    GROUP BY level
    ORDER BY level ASC
    """
    rows = await ch_execute(sql, {"start": start_date, "end": end_date})

    # 整理成漏斗格式
    level_map = {0: "未开始", 1: "浏览商品", 2: "加入购物车", 3: "开始结账", 4: "完成支付"}
    result = {level_map.get(r["level"], str(r["level"])): r["users"] for r in rows}
    return result


async def get_quantile_order_value(
    start_date: date,
    end_date: date,
) -> dict:
    """
    订单金额分位数分析(quantile 函数)
    了解用户消费分布,P50/P90/P99 指标
    """
    sql = """
    SELECT
        quantile(0.5)(revenue)  AS p50,
        quantile(0.9)(revenue)  AS p90,
        quantile(0.99)(revenue) AS p99,
        avg(revenue)            AS avg_revenue,
        max(revenue)            AS max_revenue
    FROM shop_analytics.order_events
    WHERE order_status = 'paid'
      AND created_date BETWEEN %(start)s AND %(end)s
    """
    rows = await ch_execute(sql, {"start": start_date, "end": end_date})
    return rows[0] if rows else {}

5.3 用户留存分析(arrayJoin)

python 复制代码
async def get_user_retention(cohort_date: date) -> list[dict]:
    """
    按注册日期队列(Cohort)统计 N 日留存率。
    arrayJoin 将数组展开为多行,是 ClickHouse 特有的强大功能。
    """
    sql = """
    SELECT
        cohort_day,
        retention_day,
        count(DISTINCT user_id)                         AS retained_users,
        round(count(DISTINCT user_id) / cohort_total * 100, 2) AS retention_rate
    FROM (
        SELECT
            user_id,
            toDate(min(created_at))                     AS cohort_day,
            arrayJoin(
                arrayMap(
                    x -> dateDiff('day', toDate(min(created_at)), x),
                    groupArray(toDate(created_at))
                )
            )                                           AS retention_day,
            count(DISTINCT toDate(min(created_at)))
                OVER (PARTITION BY toDate(min(created_at))) AS cohort_total
        FROM shop_analytics.user_events
        WHERE event_type = 'pay'
          AND created_at >= %(cohort_date)s
        GROUP BY user_id
    )
    WHERE retention_day BETWEEN 0 AND 30
    GROUP BY cohort_day, retention_day, cohort_total
    ORDER BY cohort_day, retention_day
    """
    return await ch_execute(sql, {"cohort_date": cohort_date})

六、批量写入实战

ClickHouse 的写入策略与 MySQL 完全不同:不要逐行 INSERT,要批量攒数据再写入

6.1 性能对比

写入方式 100 条耗时 10000 条耗时 1000000 条耗时
逐行 INSERT ~5s ~500s 不可接受
批量 INSERT(1000条/批) ~0.05s ~0.5s ~50s
批量 INSERT(10000条/批) N/A ~0.1s ~10s
python 复制代码
# app/clickhouse/writer.py
import asyncio
from datetime import datetime
from typing import Any
from app.clickhouse.client import ch_insert


# -------------------------------------------------------
# 批量写入缓冲区(对标 Kafka Producer 的 batch.size)
# -------------------------------------------------------
class ClickHouseBatchWriter:
    """
    攒够 batch_size 条或等待 flush_interval 秒后自动写入。
    适合高频事件写入场景(用户行为埋点等)。
    """

    def __init__(
        self,
        table: str,
        columns: list[str],
        batch_size: int = 1000,
        flush_interval: float = 5.0,
    ):
        self.table = table
        self.columns = columns
        self.batch_size = batch_size
        self.flush_interval = flush_interval
        self._buffer: list[tuple] = []
        self._lock = asyncio.Lock()
        self._flush_task: asyncio.Task | None = None

    async def write(self, row: tuple) -> None:
        async with self._lock:
            self._buffer.append(row)
            if len(self._buffer) >= self.batch_size:
                await self._flush()

    async def _flush(self) -> None:
        """将缓冲区数据写入 ClickHouse"""
        if not self._buffer:
            return
        batch = self._buffer[:]
        self._buffer.clear()
        await ch_insert(self.table, batch, self.columns)

    async def flush(self) -> None:
        """手动触发写入(在 lifespan shutdown 中调用)"""
        async with self._lock:
            await self._flush()


# -------------------------------------------------------
# 直接批量写入(适合 ETL 场景)
# -------------------------------------------------------
async def insert_order_events(events: list[dict]) -> None:
    """
    将订单事件批量写入 ClickHouse。
    调用方负责攒够批次(推荐 1000+ 条)再调用。
    """
    if not events:
        return

    columns = [
        "order_id", "user_id", "product_id", "category",
        "quantity", "price", "revenue", "order_status", "created_at",
    ]

    rows = [
        (
            e["order_id"],
            e["user_id"],
            e["product_id"],
            e["category"],
            e["quantity"],
            float(e["price"]),
            float(e["quantity"]) * float(e["price"]),  # revenue 提前计算
            e["order_status"],
            e["created_at"],
        )
        for e in events
    ]

    await ch_insert("shop_analytics.order_events", rows, columns)


# -------------------------------------------------------
# 批量写入示例(Celery 任务中调用)
# -------------------------------------------------------
# @celery_app.task
# async def sync_orders_to_clickhouse(order_ids: list[int]):
#     """从 MySQL 拉取订单并同步到 ClickHouse"""
#     orders = await fetch_orders_from_mysql(order_ids)
#     await insert_order_events(orders)

七、物化视图(预聚合)

物化视图(Materialized View)是 ClickHouse 的杀手锏功能:写入数据时自动触发聚合计算,查询时直接读预聚合结果。

sql 复制代码
-- 1. 创建聚合存储表(使用 AggregatingMergeTree)
CREATE TABLE shop_analytics.daily_sales_mv
(
    order_date   Date,
    category     LowCardinality(String),
    order_count  AggregateFunction(count),        -- 使用聚合状态类型
    total_revenue AggregateFunction(sum, Decimal(10, 2)),
    avg_revenue  AggregateFunction(avg, Decimal(10, 2))
)
ENGINE = AggregatingMergeTree()
ORDER BY (order_date, category);


-- 2. 创建物化视图(触发器:新数据写入 order_events 时自动更新)
CREATE MATERIALIZED VIEW shop_analytics.daily_sales_mv_trigger
TO shop_analytics.daily_sales_mv
AS
SELECT
    created_date                          AS order_date,
    category,
    countState()                          AS order_count,
    sumState(revenue)                     AS total_revenue,
    avgState(revenue)                     AS avg_revenue
FROM shop_analytics.order_events
WHERE order_status = 'paid'
GROUP BY order_date, category;


-- 3. 查询物化视图(用 Merge 函数读取聚合状态)
SELECT
    order_date,
    category,
    countMerge(order_count)       AS order_count,
    sumMerge(total_revenue)       AS total_revenue,
    avgMerge(avg_revenue)         AS avg_revenue
FROM shop_analytics.daily_sales_mv
WHERE order_date BETWEEN '2024-01-01' AND '2024-03-31'
GROUP BY order_date, category
ORDER BY order_date DESC;

-- 通过物化视图查询,耗时从 0.3s 降至 0.01s(100亿级数据也如此)

八、MySQL → ClickHouse 数据同步方案

8.1 三种同步架构

复制代码
方案一:应用双写(最简单,实时性好)
+----------+    INSERT     +---------+
| FastAPI  | ------------> | MySQL   |
|          |               +---------+
|          |    INSERT     +-------------+
|          | ------------> | ClickHouse  |
+----------+               +-------------+
优点:实时、简单
缺点:双写一致性问题,应用侵入


方案二:Kafka 解耦(推荐,高可靠)
+----------+  Binlog  +-------+  Consume  +-------------+
| MySQL    | -------> | Kafka | --------> | ClickHouse  |
+----------+  CDC     +-------+           | Kafka Engine|
                                          +-------------+
优点:解耦、可靠、支持多消费者
缺点:引入 Kafka,架构复杂


方案三:ClickHouse MySQL 引擎(只读同步)
+---------------------------------------------------+
| ClickHouse                                        |
| CREATE TABLE mysql_orders ENGINE = MySQL(...)    |
| 直接查询 MySQL 数据(不落盘,实时读取)             |
+---------------------------------------------------+
    ↓ 与本地表联合查询
+----------+
| MySQL    |
+----------+
优点:零延迟、无需同步
缺点:查询走网络,不适合大数据量聚合

8.2 使用 ClickHouse MySQL 引擎(快速集成)

sql 复制代码
-- 在 ClickHouse 中创建 MySQL 映射表(只读)
CREATE TABLE shop_analytics.mysql_orders
ENGINE = MySQL(
    'mysql-host:3306',
    'shop_db',
    'orders',
    'readonly_user',
    'password'
);

-- 可以在 ClickHouse 内直接 JOIN MySQL 数据(适合小数据量辅助维度)
SELECT
    me.created_date,
    me.category,
    sum(me.revenue) AS revenue
FROM shop_analytics.order_events me
GROUP BY me.created_date, me.category;

九、贯穿项目:统计分析 API

python 复制代码
# app/routers/analytics.py
from datetime import date, timedelta
from typing import Literal
from fastapi import APIRouter, Query, Depends
from app.auth.dependencies import get_current_user
from app.models.user import User
from app.clickhouse.queries.sales import (
    get_daily_revenue,
    get_top_products,
    get_conversion_funnel,
    get_quantile_order_value,
)
from app.schemas.analytics import (
    DailyRevenueResponse,
    TopProductsResponse,
    FunnelResponse,
    QuantileResponse,
)

router = APIRouter(prefix="/api/v1/analytics", tags=["analytics"])


@router.get("/sales", response_model=list[DailyRevenueResponse])
async def get_sales_stats(
    start: date = Query(default_factory=lambda: date.today() - timedelta(days=30)),
    end: date = Query(default_factory=date.today),
    group_by: Literal["day", "week", "month"] = Query(default="day"),
    category: str | None = Query(default=None, description="品类过滤,不传则查全部"),
    current_user: User = Depends(get_current_user),
):
    """
    销售额统计 API。
    GET /api/v1/analytics/sales?start=2024-01-01&end=2024-03-31&group_by=day&category=electronics

    对标 Spring Boot:@GetMapping + @RequestParam + Service 调用
    """
    if (end - start).days > 365:
        from fastapi import HTTPException
        raise HTTPException(status_code=400, detail="查询范围不能超过 365 天")

    return await get_daily_revenue(start, end, category)


@router.get("/top-products", response_model=list[TopProductsResponse])
async def get_top_products_api(
    start: date = Query(default_factory=lambda: date.today() - timedelta(days=30)),
    end: date = Query(default_factory=date.today),
    limit: int = Query(default=20, ge=1, le=100),
    current_user: User = Depends(get_current_user),
):
    """TOP N 销售商品"""
    return await get_top_products(start, end, limit)


@router.get("/funnel", response_model=FunnelResponse)
async def get_funnel_api(
    start: date = Query(default_factory=lambda: date.today() - timedelta(days=7)),
    end: date = Query(default_factory=date.today),
    window_hours: int = Query(default=1, ge=1, le=72, description="漏斗时间窗口(小时)"),
    current_user: User = Depends(get_current_user),
):
    """购买转化漏斗分析"""
    funnel_data = await get_conversion_funnel(start, end, window_hours * 3600)
    return {"data": funnel_data, "window_hours": window_hours}


@router.get("/order-value-distribution", response_model=QuantileResponse)
async def get_order_value_distribution(
    start: date = Query(default_factory=lambda: date.today() - timedelta(days=30)),
    end: date = Query(default_factory=date.today),
    current_user: User = Depends(get_current_user),
):
    """订单金额分位数分布"""
    return await get_quantile_order_value(start, end)
python 复制代码
# app/schemas/analytics.py
from pydantic import BaseModel
from datetime import date
from decimal import Decimal


class DailyRevenueResponse(BaseModel):
    order_date: date
    category: str
    order_count: int
    total_quantity: int
    total_revenue: Decimal
    avg_order_value: Decimal


class TopProductsResponse(BaseModel):
    product_id: int
    total_quantity: int
    total_revenue: Decimal
    order_count: int


class FunnelResponse(BaseModel):
    data: dict[str, int]   # {"浏览商品": 10000, "加入购物车": 3000, ...}
    window_hours: int


class QuantileResponse(BaseModel):
    p50: Decimal
    p90: Decimal
    p99: Decimal
    avg_revenue: Decimal
    max_revenue: Decimal

十、常见坑与最佳实践

坑 1:DateTime64 时区陷阱

python 复制代码
# ❌ 不指定时区,ClickHouse 默认存储 UTC,查询时显示 UTC 时间
# 存入 "2024-01-01 10:00:00 CST(北京时间)"
# 实际存储为 "2024-01-01 02:00:00 UTC"
# 按天 GROUP BY 时会出现分组跨天错误

# ✅ 建表时指定时区
CREATE TABLE events (
    created_at DateTime('Asia/Shanghai')   -- 明确时区
) ENGINE = MergeTree() ORDER BY created_at;

# ✅ 查询时转换时区
SELECT
    toDate(created_at, 'Asia/Shanghai') AS local_date,
    count()
FROM events
GROUP BY local_date
python 复制代码
# Python 侧也要注意传入时区感知的 datetime
from datetime import datetime, timezone
import pytz

# ❌ naive datetime(无时区信息)
naive_dt = datetime(2024, 1, 1, 10, 0, 0)

# ✅ 带时区的 datetime
cst = pytz.timezone('Asia/Shanghai')
aware_dt = cst.localize(datetime(2024, 1, 1, 10, 0, 0))

坑 2:NULL 值处理与 MySQL 的差异

sql 复制代码
-- ❌ ClickHouse 中 NULL 参与计算的结果未必如预期
SELECT NULL + 1;        -- 返回 NULL(与 MySQL 一致)
SELECT sum(NULL);       -- 返回 0(不是 NULL!与 MySQL 不同)
SELECT avg(NULL);       -- 返回 nan

-- ✅ 使用 Nullable 类型时需要额外处理
SELECT ifNull(nullable_column, 0) AS safe_value FROM table;
SELECT isNull(nullable_column)    AS is_missing FROM table;

坑 3:分布式表 vs 本地表的写入陷阱

sql 复制代码
-- ❌ 在分布式环境中直接写分布式表(Distributed Engine)可能重复写入
INSERT INTO distributed_table VALUES (...);
-- 如果分布式表的 shard key 路由不均,或者写入节点就是存储节点,会出现重复

-- ✅ 直接写本地表(local table),让 ClickHouse 分布式层负责读时聚合
INSERT INTO local_table ON CLUSTER cluster_name VALUES (...);

-- ✅ 用 shard key 确保均匀分布
CREATE TABLE distributed_table AS local_table
ENGINE = Distributed('cluster_name', 'database', 'local_table', rand());
--                                                                ^^^^
--                                              shard key,rand() 均匀分布

坑 4:ReplacingMergeTree 去重不是实时的

sql 复制代码
-- ❌ 误以为 ReplacingMergeTree 立即去重
CREATE TABLE orders_replica (
    order_id UInt64,
    status   String,
    updated_at DateTime
) ENGINE = ReplacingMergeTree(updated_at)  -- 按 updated_at 保留最新
ORDER BY order_id;

-- 插入两条相同 order_id 的记录后立即查询
INSERT INTO orders_replica VALUES (1, 'pending', now());
INSERT INTO orders_replica VALUES (1, 'paid', now() + 1);

SELECT * FROM orders_replica WHERE order_id = 1;
-- ❌ 可能返回 2 行!因为合并是后台异步的

-- ✅ 查询时使用 FINAL 关键字强制去重(性能有损耗)
SELECT * FROM orders_replica FINAL WHERE order_id = 1;
-- 或者用 argMax 手动取最新记录
SELECT order_id, argMax(status, updated_at) AS latest_status
FROM orders_replica
GROUP BY order_id;

坑 5:逐行写入性能灾难

python 复制代码
# ❌ 逐行写入,性能极差(10000 条约需 500 秒)
for order in orders:
    await ch_execute(
        "INSERT INTO order_events VALUES (%(id)s, %(user_id)s, ...)",
        order
    )

# ✅ 批量写入(攒够再写,推荐批次 1000-10000 条)
BATCH_SIZE = 1000
for i in range(0, len(orders), BATCH_SIZE):
    batch = orders[i:i + BATCH_SIZE]
    await insert_order_events(batch)

十一、总结

知识点 ClickHouse 方案 MySQL 对比
存储引擎 MergeTree(列存) InnoDB(行存)
Python 驱动 asynch(异步 TCP) aiomysql
分析查询 聚合函数 + 窗口函数(windowFunnel) 标准 SQL,大数据量慢
预聚合 物化视图(AggregatingMergeTree) 需要手动维护汇总表
写入策略 批量写入(1000+ 条/批) 支持实时单行写入
去重 ReplacingMergeTree + FINAL UNIQUE KEY
时区 建表时指定 DateTime('Asia/Shanghai') 连接时设置 time_zone
数据同步 双写 / Kafka CDC / MySQL Engine ---
漏斗分析 windowFunnel 内置函数 复杂 CTE 或应用层计算

🎯 金句:ClickHouse 不是 MySQL 的替代品,而是它的搭档------MySQL 处理事务,ClickHouse 处理分析。架构的智慧在于让每个工具做它最擅长的事。


参考资料


下期预告

📝 第 12 篇:Docker 生产部署------从代码到上线

写完了 11 篇功能实现,终于到了最关键的一步:如何把 shop-api 打包进 Docker,配好 Nginx 反向代理、Celery Worker、健康检查,一键 docker compose up 拉起整套服务?下一篇将完整演示多阶段 Dockerfile 构建、docker-compose.yml 全服务编排、pydantic-settings 多环境配置管理,以及生产环境的 Gunicorn + UvicornWorker 调优。系列大结局,不见不散。

相关推荐
yumgpkpm2 小时前
Hadoop(CDH6、CDP7)在Qwen3.7大模型训练中的作用,(含部署、运行操作步骤)
大数据·hive·hadoop·分布式·zookeeper·spark·kafka
财经资讯数据_灵砚智能2 小时前
基于全球经济类多源新闻的NLP情感分析与数据可视化(夜间-次晨)2026年5月26日
大数据·人工智能·python·信息可视化·自然语言处理·ai编程·灵砚智能
做个文艺程序员2 小时前
第09篇:ES 数据同步方案——Canal + Logstash + Flink 全方案对比与实战
大数据·elasticsearch·mysql同步es·es数据同步·flink实时同步·es增量同步
稳如磐石.2 小时前
北京研华上架式工控机
大数据·人工智能·python
mnasd10 小时前
python常用模块
大数据
步里软件10 小时前
2611.某音 MCN 运营效率提升指南:从手动重复到自动化全流程
大数据·自动化·抖音关注·抖音评论
Agent手记14 小时前
制造业生产流程自动化,Agent需要具备哪些能力?深度拆解2026工业级智能体落地范式与核心架构
大数据·人工智能·ai·架构·自动化
硅基流动14 小时前
光谷爱计算 × 硅基流动:AI 算力联合运营,共建高效“Token 工厂”
大数据·人工智能
xinshu52715 小时前
企业工商和司法风险:从定义到AI识别的完整指南
大数据·人工智能·技术分享