FastAPI 系列 · 第 11 篇:ClickHouse 集成------大数据查询实战
🎯 适合人群 :熟悉 MySQL,希望在 FastAPI 项目中接入 OLAP 数据库解决大数据量统计查询瓶颈的工程师
⏱️ 阅读时间 :约 35 分钟
💬 一句话定位:当 MySQL 的亿级订单聚合查询慢到无法接受,ClickHouse 是最务实的 OLAP 解决方案。本文从原理到实战,完整演示 FastAPI + asynch 驱动 + ClickHouse 构建高性能统计分析 API。
一、为什么需要 ClickHouse
1.1 MySQL 的 OLAP 瓶颈
shop-api 运行一段时间后,订单表积累了 1 亿条记录。一个看似简单的统计查询:
sql
-- 按天统计各品类销售额,查最近 90 天
SELECT
DATE(created_at) AS order_date,
p.category,
SUM(oi.quantity * oi.price) AS revenue
FROM orders o
JOIN order_items oi ON o.id = oi.order_id
JOIN products p ON oi.product_id = p.id
WHERE o.created_at >= DATE_SUB(NOW(), INTERVAL 90 DAY)
AND o.status = 'paid'
GROUP BY order_date, p.category
ORDER BY order_date DESC;
在 MySQL 上执行:47 秒。用户等待超时,DBA 怒气冲冲。
根因:MySQL 是**行存储(Row Store)**数据库,为 OLTP(在线事务处理)设计:
- 每行数据连续存储,读一个字段需要加载整行
- 聚合查询要扫描海量行数据,I/O 极高
- 跨表 JOIN 在大数据量下代价惊人
1.2 OLAP vs OLTP 核心差异
| 维度 | OLTP(MySQL) | OLAP(ClickHouse) |
|---|---|---|
| 存储方式 | 行存储 | 列存储 |
| 主要操作 | 增删改查(单行/小批量) | 聚合分析(全表/范围扫描) |
| 索引设计 | B+ Tree,适合等值/范围查找 | 稀疏索引 + MergeTree,适合排序键范围扫描 |
| 写入方式 | 支持单行实时写入 | 批量写入效率高,不建议逐行写 |
| 数据压缩 | 有限压缩 | 列级高压缩(LZ4/ZSTD,5-10x) |
| 并发能力 | 高并发小查询(1000+ QPS) | 低并发大查询(10-100 QPS) |
| JOIN 能力 | 强(外键、事务) | 弱(推荐宽表/物化视图) |
| 数据一致性 | ACID 事务 | 最终一致(无事务概念) |
| 典型延迟 | 毫秒级(单行操作) | 秒级(亿级扫描) |
| 适用场景 | 订单创建、库存扣减 | 销售报表、用户行为分析 |
同样的 90 天销售额查询,在 ClickHouse 上:0.3 秒。
二、ClickHouse 核心概念
2.1 列式存储原理
MySQL 行存储(读取 revenue 字段需扫描所有列):
+----+--------+--------+-------+------------+
| id | order# | user | price | created_at |
+----+--------+--------+-------+------------+
| 1 | ORD001 | alice | 99.9 | 2024-01-01 |
| 2 | ORD002 | bob | 199.0 | 2024-01-01 |
| 3 | ORD003 | alice | 49.9 | 2024-01-02 |
+----+--------+--------+-------+------------+
每行连续存储 → 读 price 列需加载所有字段
ClickHouse 列存储(只读取需要的列):
id: [1, 2, 3, ...]
price: [99.9, 199.0, 49.9, ...] ← 只读这一列,压缩率高
date: [2024-01-01, 2024-01-01, 2024-01-02, ...]
2.2 MergeTree 引擎家族
ClickHouse 最核心的存储引擎是 MergeTree,按主键(排序键)存储数据,后台异步合并数据片段(part)。
| 引擎 | 特点 | 适用场景 |
|---|---|---|
MergeTree |
基础引擎,按排序键有序存储 | 通用 OLAP |
ReplacingMergeTree |
异步去重(按排序键保留最新版本) | CDC 同步、幂等写入 |
AggregatingMergeTree |
存储聚合中间状态 | 物化视图预聚合 |
SummingMergeTree |
自动合并相同 Key 的数值 | 计数器、累加指标 |
CollapsingMergeTree |
通过 sign 字段实现逻辑删除 | 可变数据流 |
三、驱动选型与连接配置
3.1 驱动对比
| 驱动 | 类型 | 协议 | 推荐场景 |
|---|---|---|---|
clickhouse-driver |
同步 | TCP native | 脚本、同步服务 |
asynch |
异步 | TCP native | FastAPI 异步路由 ✅ |
clickhouse-connect |
同步/异步 | HTTP | 云服务、简单查询 |
aiochclient |
异步 | HTTP | 轻量查询 |
推荐 asynch:原生 TCP 协议,性能最优,完整支持 async/await。
bash
pip install asynch
3.2 连接池封装
python
# app/clickhouse/client.py
import asyncio
from typing import Any, Optional
from asynch.connection import Connection
from asynch.pool import Pool
from app.config import settings
# -------------------------------------------------------
# ClickHouse 连接池(对标 Spring Data Redis 的 LettuceConnectionFactory)
# -------------------------------------------------------
_pool: Optional[Pool] = None
async def get_clickhouse_pool() -> Pool:
"""获取全局连接池(懒加载)"""
global _pool
if _pool is None:
_pool = Pool(
host=settings.CLICKHOUSE_HOST,
port=settings.CLICKHOUSE_PORT, # 默认 9000(TCP native)
database=settings.CLICKHOUSE_DATABASE,
user=settings.CLICKHOUSE_USER,
password=settings.CLICKHOUSE_PASSWORD,
minsize=2,
maxsize=10,
)
return _pool
async def close_clickhouse_pool() -> None:
"""关闭连接池(在 lifespan 中调用)"""
global _pool
if _pool:
await _pool.close()
_pool = None
# -------------------------------------------------------
# 与 lifespan 集成
# -------------------------------------------------------
# app/main.py(在现有 lifespan 中追加)
#
# from app.clickhouse.client import get_clickhouse_pool, close_clickhouse_pool
#
# @asynccontextmanager
# async def lifespan(app: FastAPI):
# await get_clickhouse_pool() # 启动时预热连接池
# yield
# await close_clickhouse_pool() # 关闭时释放连接
# -------------------------------------------------------
# 查询辅助函数
# -------------------------------------------------------
async def ch_execute(
query: str,
params: Optional[tuple] = None,
) -> list[dict]:
"""
执行查询,返回字典列表。
对标 Spring JdbcTemplate.queryForList()
"""
pool = await get_clickhouse_pool()
async with pool.acquire() as conn:
async with conn.cursor() as cursor:
await cursor.execute(query, params)
columns = [desc[0] for desc in cursor.description]
rows = await cursor.fetchall()
return [dict(zip(columns, row)) for row in rows]
async def ch_insert(table: str, data: list[tuple], columns: list[str]) -> None:
"""
批量插入数据。
注意:ClickHouse 批量写入效率远高于逐行写入。
"""
pool = await get_clickhouse_pool()
async with pool.acquire() as conn:
async with conn.cursor() as cursor:
await cursor.execute(
f"INSERT INTO {table} ({', '.join(columns)}) VALUES",
data,
)
3.3 配置项
python
# app/config.py(追加 ClickHouse 配置)
from pydantic_settings import BaseSettings
class Settings(BaseSettings):
# ... 其他配置 ...
CLICKHOUSE_HOST: str = "localhost"
CLICKHOUSE_PORT: int = 9000
CLICKHOUSE_DATABASE: str = "shop_analytics"
CLICKHOUSE_USER: str = "default"
CLICKHOUSE_PASSWORD: str = ""
settings = Settings()
四、建表与数据模型设计
4.1 订单事实表
sql
-- 在 ClickHouse 中建立订单明细表
CREATE TABLE shop_analytics.order_events
(
order_id UInt64,
user_id UInt64,
product_id UInt64,
category LowCardinality(String), -- 低基数字段用 LowCardinality 优化
quantity UInt32,
price Decimal(10, 2),
revenue Decimal(10, 2), -- quantity * price 提前计算
order_status LowCardinality(String),
created_at DateTime,
created_date Date MATERIALIZED toDate(created_at) -- 物化列,避免重复计算
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(created_at) -- 按月分区,方便按时间范围查询
ORDER BY (created_date, category, product_id) -- 排序键,决定数据物理存储顺序
SETTINGS index_granularity = 8192;
-- 用户行为事件表(用于漏斗分析)
CREATE TABLE shop_analytics.user_events
(
event_id UUID DEFAULT generateUUIDv4(),
user_id UInt64,
event_type LowCardinality(String), -- view_product / add_cart / checkout / pay
product_id Nullable(UInt64),
session_id String,
created_at DateTime,
properties String -- JSON 格式的附加属性
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(created_at)
ORDER BY (user_id, created_at)
SETTINGS index_granularity = 8192;
4.2 ER 关系图(MySQL → ClickHouse 数据流)
MySQL (OLTP) ClickHouse (OLAP)
+------------------+ +----------------------+
| orders | | order_events |
| - id | ETL/CDC | - order_id |
| - user_id | ------→ | - user_id |
| - status | | - product_id |
| - created_at | | - category |
+------------------+ | - revenue |
| order_items | | - created_at |
| - order_id | +----------------------+
| - product_id | | user_events |
| - quantity | | - user_id |
| - price | | - event_type |
+------------------+ | - session_id |
| products | | - created_at |
| - id | +----------------------+
| - category |
+------------------+
五、典型查询实战
5.1 按天聚合销售额
python
# app/clickhouse/queries/sales.py
from app.clickhouse.client import ch_execute
from datetime import date
async def get_daily_revenue(
start_date: date,
end_date: date,
category: str | None = None,
) -> list[dict]:
"""
按天统计销售额,支持品类过滤。
同样的查询在 MySQL(亿级)约 47s,ClickHouse 约 0.3s。
"""
conditions = ["order_status = 'paid'", "created_date BETWEEN %(start)s AND %(end)s"]
params: dict = {"start": start_date, "end": end_date}
if category:
conditions.append("category = %(category)s")
params["category"] = category
where_clause = " AND ".join(conditions)
sql = f"""
SELECT
created_date AS order_date,
category,
count() AS order_count,
sum(quantity) AS total_quantity,
sum(revenue) AS total_revenue,
avg(revenue) AS avg_order_value
FROM shop_analytics.order_events
WHERE {where_clause}
GROUP BY order_date, category
ORDER BY order_date DESC, total_revenue DESC
"""
return await ch_execute(sql, params)
async def get_top_products(
start_date: date,
end_date: date,
limit: int = 20,
) -> list[dict]:
"""TOP N 销售商品排行"""
sql = """
SELECT
product_id,
sum(quantity) AS total_quantity,
sum(revenue) AS total_revenue,
count() AS order_count
FROM shop_analytics.order_events
WHERE order_status = 'paid'
AND created_date BETWEEN %(start)s AND %(end)s
GROUP BY product_id
ORDER BY total_revenue DESC
LIMIT %(limit)s
"""
return await ch_execute(sql, {"start": start_date, "end": end_date, "limit": limit})
5.2 漏斗分析(windowFunnel)
漏斗分析 是电商数据分析的核心需求:用户从浏览商品 → 加购 → 结账 → 支付,每一步的转化率。ClickHouse 内置 windowFunnel 函数,专为此设计。
python
async def get_conversion_funnel(
start_date: date,
end_date: date,
window_seconds: int = 3600, # 1 小时内完成漏斗算转化
) -> dict:
"""
购买漏斗分析:
浏览商品 → 加入购物车 → 开始结账 → 完成支付
windowFunnel(window)(timestamp, event1, event2, ...) 返回用户完成的最深步骤
"""
sql = f"""
SELECT
level,
count() AS users
FROM (
SELECT
user_id,
windowFunnel({window_seconds})(
created_at,
event_type = 'view_product', -- step 1
event_type = 'add_cart', -- step 2
event_type = 'checkout', -- step 3
event_type = 'pay' -- step 4
) AS level
FROM shop_analytics.user_events
WHERE created_at >= %(start)s
AND created_at < %(end)s
GROUP BY user_id
)
GROUP BY level
ORDER BY level ASC
"""
rows = await ch_execute(sql, {"start": start_date, "end": end_date})
# 整理成漏斗格式
level_map = {0: "未开始", 1: "浏览商品", 2: "加入购物车", 3: "开始结账", 4: "完成支付"}
result = {level_map.get(r["level"], str(r["level"])): r["users"] for r in rows}
return result
async def get_quantile_order_value(
start_date: date,
end_date: date,
) -> dict:
"""
订单金额分位数分析(quantile 函数)
了解用户消费分布,P50/P90/P99 指标
"""
sql = """
SELECT
quantile(0.5)(revenue) AS p50,
quantile(0.9)(revenue) AS p90,
quantile(0.99)(revenue) AS p99,
avg(revenue) AS avg_revenue,
max(revenue) AS max_revenue
FROM shop_analytics.order_events
WHERE order_status = 'paid'
AND created_date BETWEEN %(start)s AND %(end)s
"""
rows = await ch_execute(sql, {"start": start_date, "end": end_date})
return rows[0] if rows else {}
5.3 用户留存分析(arrayJoin)
python
async def get_user_retention(cohort_date: date) -> list[dict]:
"""
按注册日期队列(Cohort)统计 N 日留存率。
arrayJoin 将数组展开为多行,是 ClickHouse 特有的强大功能。
"""
sql = """
SELECT
cohort_day,
retention_day,
count(DISTINCT user_id) AS retained_users,
round(count(DISTINCT user_id) / cohort_total * 100, 2) AS retention_rate
FROM (
SELECT
user_id,
toDate(min(created_at)) AS cohort_day,
arrayJoin(
arrayMap(
x -> dateDiff('day', toDate(min(created_at)), x),
groupArray(toDate(created_at))
)
) AS retention_day,
count(DISTINCT toDate(min(created_at)))
OVER (PARTITION BY toDate(min(created_at))) AS cohort_total
FROM shop_analytics.user_events
WHERE event_type = 'pay'
AND created_at >= %(cohort_date)s
GROUP BY user_id
)
WHERE retention_day BETWEEN 0 AND 30
GROUP BY cohort_day, retention_day, cohort_total
ORDER BY cohort_day, retention_day
"""
return await ch_execute(sql, {"cohort_date": cohort_date})
六、批量写入实战
ClickHouse 的写入策略与 MySQL 完全不同:不要逐行 INSERT,要批量攒数据再写入。
6.1 性能对比
| 写入方式 | 100 条耗时 | 10000 条耗时 | 1000000 条耗时 |
|---|---|---|---|
| 逐行 INSERT | ~5s | ~500s | 不可接受 |
| 批量 INSERT(1000条/批) | ~0.05s | ~0.5s | ~50s |
| 批量 INSERT(10000条/批) | N/A | ~0.1s | ~10s |
python
# app/clickhouse/writer.py
import asyncio
from datetime import datetime
from typing import Any
from app.clickhouse.client import ch_insert
# -------------------------------------------------------
# 批量写入缓冲区(对标 Kafka Producer 的 batch.size)
# -------------------------------------------------------
class ClickHouseBatchWriter:
"""
攒够 batch_size 条或等待 flush_interval 秒后自动写入。
适合高频事件写入场景(用户行为埋点等)。
"""
def __init__(
self,
table: str,
columns: list[str],
batch_size: int = 1000,
flush_interval: float = 5.0,
):
self.table = table
self.columns = columns
self.batch_size = batch_size
self.flush_interval = flush_interval
self._buffer: list[tuple] = []
self._lock = asyncio.Lock()
self._flush_task: asyncio.Task | None = None
async def write(self, row: tuple) -> None:
async with self._lock:
self._buffer.append(row)
if len(self._buffer) >= self.batch_size:
await self._flush()
async def _flush(self) -> None:
"""将缓冲区数据写入 ClickHouse"""
if not self._buffer:
return
batch = self._buffer[:]
self._buffer.clear()
await ch_insert(self.table, batch, self.columns)
async def flush(self) -> None:
"""手动触发写入(在 lifespan shutdown 中调用)"""
async with self._lock:
await self._flush()
# -------------------------------------------------------
# 直接批量写入(适合 ETL 场景)
# -------------------------------------------------------
async def insert_order_events(events: list[dict]) -> None:
"""
将订单事件批量写入 ClickHouse。
调用方负责攒够批次(推荐 1000+ 条)再调用。
"""
if not events:
return
columns = [
"order_id", "user_id", "product_id", "category",
"quantity", "price", "revenue", "order_status", "created_at",
]
rows = [
(
e["order_id"],
e["user_id"],
e["product_id"],
e["category"],
e["quantity"],
float(e["price"]),
float(e["quantity"]) * float(e["price"]), # revenue 提前计算
e["order_status"],
e["created_at"],
)
for e in events
]
await ch_insert("shop_analytics.order_events", rows, columns)
# -------------------------------------------------------
# 批量写入示例(Celery 任务中调用)
# -------------------------------------------------------
# @celery_app.task
# async def sync_orders_to_clickhouse(order_ids: list[int]):
# """从 MySQL 拉取订单并同步到 ClickHouse"""
# orders = await fetch_orders_from_mysql(order_ids)
# await insert_order_events(orders)
七、物化视图(预聚合)
物化视图(Materialized View)是 ClickHouse 的杀手锏功能:写入数据时自动触发聚合计算,查询时直接读预聚合结果。
sql
-- 1. 创建聚合存储表(使用 AggregatingMergeTree)
CREATE TABLE shop_analytics.daily_sales_mv
(
order_date Date,
category LowCardinality(String),
order_count AggregateFunction(count), -- 使用聚合状态类型
total_revenue AggregateFunction(sum, Decimal(10, 2)),
avg_revenue AggregateFunction(avg, Decimal(10, 2))
)
ENGINE = AggregatingMergeTree()
ORDER BY (order_date, category);
-- 2. 创建物化视图(触发器:新数据写入 order_events 时自动更新)
CREATE MATERIALIZED VIEW shop_analytics.daily_sales_mv_trigger
TO shop_analytics.daily_sales_mv
AS
SELECT
created_date AS order_date,
category,
countState() AS order_count,
sumState(revenue) AS total_revenue,
avgState(revenue) AS avg_revenue
FROM shop_analytics.order_events
WHERE order_status = 'paid'
GROUP BY order_date, category;
-- 3. 查询物化视图(用 Merge 函数读取聚合状态)
SELECT
order_date,
category,
countMerge(order_count) AS order_count,
sumMerge(total_revenue) AS total_revenue,
avgMerge(avg_revenue) AS avg_revenue
FROM shop_analytics.daily_sales_mv
WHERE order_date BETWEEN '2024-01-01' AND '2024-03-31'
GROUP BY order_date, category
ORDER BY order_date DESC;
-- 通过物化视图查询,耗时从 0.3s 降至 0.01s(100亿级数据也如此)
八、MySQL → ClickHouse 数据同步方案
8.1 三种同步架构
方案一:应用双写(最简单,实时性好)
+----------+ INSERT +---------+
| FastAPI | ------------> | MySQL |
| | +---------+
| | INSERT +-------------+
| | ------------> | ClickHouse |
+----------+ +-------------+
优点:实时、简单
缺点:双写一致性问题,应用侵入
方案二:Kafka 解耦(推荐,高可靠)
+----------+ Binlog +-------+ Consume +-------------+
| MySQL | -------> | Kafka | --------> | ClickHouse |
+----------+ CDC +-------+ | Kafka Engine|
+-------------+
优点:解耦、可靠、支持多消费者
缺点:引入 Kafka,架构复杂
方案三:ClickHouse MySQL 引擎(只读同步)
+---------------------------------------------------+
| ClickHouse |
| CREATE TABLE mysql_orders ENGINE = MySQL(...) |
| 直接查询 MySQL 数据(不落盘,实时读取) |
+---------------------------------------------------+
↓ 与本地表联合查询
+----------+
| MySQL |
+----------+
优点:零延迟、无需同步
缺点:查询走网络,不适合大数据量聚合
8.2 使用 ClickHouse MySQL 引擎(快速集成)
sql
-- 在 ClickHouse 中创建 MySQL 映射表(只读)
CREATE TABLE shop_analytics.mysql_orders
ENGINE = MySQL(
'mysql-host:3306',
'shop_db',
'orders',
'readonly_user',
'password'
);
-- 可以在 ClickHouse 内直接 JOIN MySQL 数据(适合小数据量辅助维度)
SELECT
me.created_date,
me.category,
sum(me.revenue) AS revenue
FROM shop_analytics.order_events me
GROUP BY me.created_date, me.category;
九、贯穿项目:统计分析 API
python
# app/routers/analytics.py
from datetime import date, timedelta
from typing import Literal
from fastapi import APIRouter, Query, Depends
from app.auth.dependencies import get_current_user
from app.models.user import User
from app.clickhouse.queries.sales import (
get_daily_revenue,
get_top_products,
get_conversion_funnel,
get_quantile_order_value,
)
from app.schemas.analytics import (
DailyRevenueResponse,
TopProductsResponse,
FunnelResponse,
QuantileResponse,
)
router = APIRouter(prefix="/api/v1/analytics", tags=["analytics"])
@router.get("/sales", response_model=list[DailyRevenueResponse])
async def get_sales_stats(
start: date = Query(default_factory=lambda: date.today() - timedelta(days=30)),
end: date = Query(default_factory=date.today),
group_by: Literal["day", "week", "month"] = Query(default="day"),
category: str | None = Query(default=None, description="品类过滤,不传则查全部"),
current_user: User = Depends(get_current_user),
):
"""
销售额统计 API。
GET /api/v1/analytics/sales?start=2024-01-01&end=2024-03-31&group_by=day&category=electronics
对标 Spring Boot:@GetMapping + @RequestParam + Service 调用
"""
if (end - start).days > 365:
from fastapi import HTTPException
raise HTTPException(status_code=400, detail="查询范围不能超过 365 天")
return await get_daily_revenue(start, end, category)
@router.get("/top-products", response_model=list[TopProductsResponse])
async def get_top_products_api(
start: date = Query(default_factory=lambda: date.today() - timedelta(days=30)),
end: date = Query(default_factory=date.today),
limit: int = Query(default=20, ge=1, le=100),
current_user: User = Depends(get_current_user),
):
"""TOP N 销售商品"""
return await get_top_products(start, end, limit)
@router.get("/funnel", response_model=FunnelResponse)
async def get_funnel_api(
start: date = Query(default_factory=lambda: date.today() - timedelta(days=7)),
end: date = Query(default_factory=date.today),
window_hours: int = Query(default=1, ge=1, le=72, description="漏斗时间窗口(小时)"),
current_user: User = Depends(get_current_user),
):
"""购买转化漏斗分析"""
funnel_data = await get_conversion_funnel(start, end, window_hours * 3600)
return {"data": funnel_data, "window_hours": window_hours}
@router.get("/order-value-distribution", response_model=QuantileResponse)
async def get_order_value_distribution(
start: date = Query(default_factory=lambda: date.today() - timedelta(days=30)),
end: date = Query(default_factory=date.today),
current_user: User = Depends(get_current_user),
):
"""订单金额分位数分布"""
return await get_quantile_order_value(start, end)
python
# app/schemas/analytics.py
from pydantic import BaseModel
from datetime import date
from decimal import Decimal
class DailyRevenueResponse(BaseModel):
order_date: date
category: str
order_count: int
total_quantity: int
total_revenue: Decimal
avg_order_value: Decimal
class TopProductsResponse(BaseModel):
product_id: int
total_quantity: int
total_revenue: Decimal
order_count: int
class FunnelResponse(BaseModel):
data: dict[str, int] # {"浏览商品": 10000, "加入购物车": 3000, ...}
window_hours: int
class QuantileResponse(BaseModel):
p50: Decimal
p90: Decimal
p99: Decimal
avg_revenue: Decimal
max_revenue: Decimal
十、常见坑与最佳实践
坑 1:DateTime64 时区陷阱
python
# ❌ 不指定时区,ClickHouse 默认存储 UTC,查询时显示 UTC 时间
# 存入 "2024-01-01 10:00:00 CST(北京时间)"
# 实际存储为 "2024-01-01 02:00:00 UTC"
# 按天 GROUP BY 时会出现分组跨天错误
# ✅ 建表时指定时区
CREATE TABLE events (
created_at DateTime('Asia/Shanghai') -- 明确时区
) ENGINE = MergeTree() ORDER BY created_at;
# ✅ 查询时转换时区
SELECT
toDate(created_at, 'Asia/Shanghai') AS local_date,
count()
FROM events
GROUP BY local_date
python
# Python 侧也要注意传入时区感知的 datetime
from datetime import datetime, timezone
import pytz
# ❌ naive datetime(无时区信息)
naive_dt = datetime(2024, 1, 1, 10, 0, 0)
# ✅ 带时区的 datetime
cst = pytz.timezone('Asia/Shanghai')
aware_dt = cst.localize(datetime(2024, 1, 1, 10, 0, 0))
坑 2:NULL 值处理与 MySQL 的差异
sql
-- ❌ ClickHouse 中 NULL 参与计算的结果未必如预期
SELECT NULL + 1; -- 返回 NULL(与 MySQL 一致)
SELECT sum(NULL); -- 返回 0(不是 NULL!与 MySQL 不同)
SELECT avg(NULL); -- 返回 nan
-- ✅ 使用 Nullable 类型时需要额外处理
SELECT ifNull(nullable_column, 0) AS safe_value FROM table;
SELECT isNull(nullable_column) AS is_missing FROM table;
坑 3:分布式表 vs 本地表的写入陷阱
sql
-- ❌ 在分布式环境中直接写分布式表(Distributed Engine)可能重复写入
INSERT INTO distributed_table VALUES (...);
-- 如果分布式表的 shard key 路由不均,或者写入节点就是存储节点,会出现重复
-- ✅ 直接写本地表(local table),让 ClickHouse 分布式层负责读时聚合
INSERT INTO local_table ON CLUSTER cluster_name VALUES (...);
-- ✅ 用 shard key 确保均匀分布
CREATE TABLE distributed_table AS local_table
ENGINE = Distributed('cluster_name', 'database', 'local_table', rand());
-- ^^^^
-- shard key,rand() 均匀分布
坑 4:ReplacingMergeTree 去重不是实时的
sql
-- ❌ 误以为 ReplacingMergeTree 立即去重
CREATE TABLE orders_replica (
order_id UInt64,
status String,
updated_at DateTime
) ENGINE = ReplacingMergeTree(updated_at) -- 按 updated_at 保留最新
ORDER BY order_id;
-- 插入两条相同 order_id 的记录后立即查询
INSERT INTO orders_replica VALUES (1, 'pending', now());
INSERT INTO orders_replica VALUES (1, 'paid', now() + 1);
SELECT * FROM orders_replica WHERE order_id = 1;
-- ❌ 可能返回 2 行!因为合并是后台异步的
-- ✅ 查询时使用 FINAL 关键字强制去重(性能有损耗)
SELECT * FROM orders_replica FINAL WHERE order_id = 1;
-- 或者用 argMax 手动取最新记录
SELECT order_id, argMax(status, updated_at) AS latest_status
FROM orders_replica
GROUP BY order_id;
坑 5:逐行写入性能灾难
python
# ❌ 逐行写入,性能极差(10000 条约需 500 秒)
for order in orders:
await ch_execute(
"INSERT INTO order_events VALUES (%(id)s, %(user_id)s, ...)",
order
)
# ✅ 批量写入(攒够再写,推荐批次 1000-10000 条)
BATCH_SIZE = 1000
for i in range(0, len(orders), BATCH_SIZE):
batch = orders[i:i + BATCH_SIZE]
await insert_order_events(batch)
十一、总结
| 知识点 | ClickHouse 方案 | MySQL 对比 |
|---|---|---|
| 存储引擎 | MergeTree(列存) | InnoDB(行存) |
| Python 驱动 | asynch(异步 TCP) |
aiomysql |
| 分析查询 | 聚合函数 + 窗口函数(windowFunnel) | 标准 SQL,大数据量慢 |
| 预聚合 | 物化视图(AggregatingMergeTree) | 需要手动维护汇总表 |
| 写入策略 | 批量写入(1000+ 条/批) | 支持实时单行写入 |
| 去重 | ReplacingMergeTree + FINAL | UNIQUE KEY |
| 时区 | 建表时指定 DateTime('Asia/Shanghai') |
连接时设置 time_zone |
| 数据同步 | 双写 / Kafka CDC / MySQL Engine | --- |
| 漏斗分析 | windowFunnel 内置函数 |
复杂 CTE 或应用层计算 |
🎯 金句:ClickHouse 不是 MySQL 的替代品,而是它的搭档------MySQL 处理事务,ClickHouse 处理分析。架构的智慧在于让每个工具做它最擅长的事。
参考资料
下期预告
📝 第 12 篇:Docker 生产部署------从代码到上线
写完了 11 篇功能实现,终于到了最关键的一步:如何把
shop-api打包进 Docker,配好 Nginx 反向代理、Celery Worker、健康检查,一键docker compose up拉起整套服务?下一篇将完整演示多阶段 Dockerfile 构建、docker-compose.yml 全服务编排、pydantic-settings 多环境配置管理,以及生产环境的 Gunicorn + UvicornWorker 调优。系列大结局,不见不散。