文章目录
-
- [一、为什么需要 CTE?](#一、为什么需要 CTE?)
-
- [1.1 传统复杂查询的痛点](#1.1 传统复杂查询的痛点)
- [1.2 CTE 的核心优势](#1.2 CTE 的核心优势)
- [1.3 CTE vs 临时表 vs 视图](#1.3 CTE vs 临时表 vs 视图)
- [1.4 CTE 的核心价值](#1.4 CTE 的核心价值)
- [二、CTE 基础语法](#二、CTE 基础语法)
-
- [2.1 多 CTE 链式组合](#2.1 多 CTE 链式组合)
- 三、实战场景
-
- [3.1 场景1:计算用户活跃度指标(替代嵌套子查询,提升可读性)](#3.1 场景1:计算用户活跃度指标(替代嵌套子查询,提升可读性))
- [3.2 场景2:多维度聚合分析,避免重复计算(性能优化)](#3.2 场景2:多维度聚合分析,避免重复计算(性能优化))
- [3.3 场景3:组织架构查询(树形结构),递归 CTE(处理层次数据)](#3.3 场景3:组织架构查询(树形结构),递归 CTE(处理层次数据))
- [3.4 场景4:物料清单(BOM)展开](#3.4 场景4:物料清单(BOM)展开)
- [3.5 场景5:会话化用户行为日志,数据清洗与预处理](#3.5 场景5:会话化用户行为日志,数据清洗与预处理)
- [四、PostgreSQL 12+ 高级特性:控制 CTE 物化](#四、PostgreSQL 12+ 高级特性:控制 CTE 物化)
-
- [4.1 默认行为(PostgreSQL 12 前)](#4.1 默认行为(PostgreSQL 12 前))
- [4.2 新增控制选项(PostgreSQL 12+)](#4.2 新增控制选项(PostgreSQL 12+))
- [4.3 性能对比示例](#4.3 性能对比示例)
- 五、常见陷阱与最佳实践
-
- [5.1 陷阱 1:CTE 中的 DML(数据修改)](#5.1 陷阱 1:CTE 中的 DML(数据修改))
- [5.2 陷阱 2:递归 CTE 无限循环](#5.2 陷阱 2:递归 CTE 无限循环)
- [5.3 最佳实践总结](#5.3 最佳实践总结)
- 六、综合案例:电商漏斗分析
-
- [6.1 表结构](#6.1 表结构)
- [6.2 CTE 实现](#6.2 CTE 实现)
适用版本 :PostgreSQL 9.4+(推荐 12+,支持并行 CTE)
目标读者 :数据分析师、后端开发、DBA
核心价值:化繁为简,用声明式 SQL 替代过程式代码,提升可读性与性能
一、为什么需要 CTE?
1.1 传统复杂查询的痛点
sql
-- 嵌套子查询地狱(难以阅读、调试、复用)
SELECT
u.name,
(SELECT COUNT(*) FROM orders o WHERE o.user_id = u.id AND o.status = 'paid') AS paid_orders,
(SELECT SUM(o.amount) FROM orders o WHERE o.user_id = u.id AND o.status = 'paid') AS total_spent
FROM users u
WHERE u.id IN (
SELECT DISTINCT user_id
FROM orders
WHERE created_at > '2023-01-01'
AND user_id IN (
SELECT id FROM users WHERE country = 'US'
)
);
问题:
- 重复扫描
orders表 3 次 - 逻辑分散,难以维护
- 无法复用中间结果
1.2 CTE 的核心优势
可读性 :将复杂逻辑拆分为命名步骤
可维护性 :修改单个 CTE 即可影响全局
性能优化 :PostgreSQL 12+ 支持 MATERIALIZED / NOT MATERIALIZED 控制物化
递归能力 :处理树形/图结构数据(如组织架构、BOM)
逻辑复用:同一 CTE 可被多次引用
1.3 CTE vs 临时表 vs 视图
| 特性 | CTE | 临时表 | 视图 |
|---|---|---|---|
| 生命周期 | 单次查询 | 会话级 | 永久 |
| 存储 | 内存/临时文件 | 临时文件 | 无(逻辑定义) |
| 索引 | 不支持 | 支持 | 不支持(但基表索引可用) |
| 递归 | 支持 | 不支持 | 不支持 |
| 适用场景 | 复杂查询分解 | 大数据量中间结果 | 通用逻辑封装 |
选择建议:
- 单次查询复杂逻辑 → CTE
- 多次查询复用中间结果 → 临时表
- 全局通用逻辑 → 视图
1.4 CTE 的核心价值
| 维度 | 价值 |
|---|---|
| 可读性 | 将"意大利面条式 SQL"变为"乐高积木式逻辑" |
| 可维护性 | 修改单点,影响全局 |
| 性能 | 避免重复计算,PostgreSQL 12+ 支持智能物化 |
| 能力扩展 | 递归查询解锁树形/图数据处理能力 |
| 工程化 | 使 SQL 成为真正的"声明式编程语言" |
🚀 行动建议 :
下次遇到复杂查询时,先问自己:
"能否用 3~5 个 CTE 步骤清晰表达逻辑?"如果答案是 Yes ------ 你已经掌握了现代 SQL 的精髓!
二、CTE 基础语法
sql
WITH cte_name AS (
-- 子查询
SELECT ...
)
SELECT ... FROM cte_name;
2.1 多 CTE 链式组合
sql
WITH
step1 AS (SELECT ...),
step2 AS (SELECT ... FROM step1),
step3 AS (SELECT ... FROM step2)
SELECT * FROM step3;
注意:CTE 按顺序定义,后续 CTE 可引用前面的 CTE
三、实战场景
3.1 场景1:计算用户活跃度指标(替代嵌套子查询,提升可读性)
需求:
- 找出 2023 年注册的美国用户
- 统计其支付订单数、总消费额
- 筛选高价值用户(消费 > $1000)
1、传统写法(嵌套地狱)
sql
SELECT
u.name,
paid_orders,
total_spent
FROM users u
JOIN (
SELECT
user_id,
COUNT(*) AS paid_orders,
SUM(amount) AS total_spent
FROM orders
WHERE status = 'paid'
GROUP BY user_id
HAVING SUM(amount) > 1000
) o ON u.id = o.user_id
WHERE u.country = 'US'
AND u.created_at >= '2023-01-01';
2、CTE 写法(清晰分步)
sql
WITH
new_us_users AS (
SELECT id, name
FROM users
WHERE country = 'US'
AND created_at >= '2023-01-01'
),
user_spending AS (
SELECT
user_id,
COUNT(*) AS paid_orders,
SUM(amount) AS total_spent
FROM orders
WHERE status = 'paid'
GROUP BY user_id
HAVING SUM(amount) > 1000
)
SELECT
u.name,
s.paid_orders,
s.total_spent
FROM new_us_users u
JOIN user_spending s ON u.id = s.user_id;
优势:
- 逻辑分层:先筛选用户,再计算消费
- 中间结果命名清晰(
new_us_users,user_spending) - 易于单独测试每个 CTE
3.2 场景2:多维度聚合分析,避免重复计算(性能优化)
需求:
- 计算每个产品的月度销售额
- 同时输出:当月排名、累计销售额、环比增长率
1、错误写法(重复扫描)
sql
SELECT
product_id,
month,
sales,
RANK() OVER (PARTITION BY month ORDER BY sales DESC) AS monthly_rank,
SUM(sales) OVER (ORDER BY month ROWS UNBOUNDED PRECEDING) AS cum_sales,
(sales - LAG(sales) OVER (PARTITION BY product_id ORDER BY month)) /
NULLIF(LAG(sales) OVER (PARTITION BY product_id ORDER BY month), 0) AS mom_growth
FROM (
SELECT
product_id,
DATE_TRUNC('month', order_date) AS month,
SUM(amount) AS sales
FROM orders
GROUP BY product_id, DATE_TRUNC('month', order_date)
) t;
⚠️ 虽然只扫描一次,但窗口函数逻辑混杂,难以扩展
2、CTE 写法(分步计算)
sql
WITH
monthly_sales AS (
SELECT
product_id,
DATE_TRUNC('month', order_date)::DATE AS month,
SUM(amount) AS sales
FROM orders
GROUP BY product_id, DATE_TRUNC('month', order_date)
),
ranked_sales AS (
SELECT *,
RANK() OVER (PARTITION BY month ORDER BY sales DESC) AS monthly_rank
FROM monthly_sales
),
cumulative_sales AS (
SELECT *,
SUM(sales) OVER (ORDER BY month ROWS UNBOUNDED PRECEDING) AS cum_sales
FROM ranked_sales
)
SELECT
product_id,
month,
sales,
monthly_rank,
cum_sales,
(sales - LAG(sales) OVER (PARTITION BY product_id ORDER BY month)) /
NULLIF(LAG(sales) OVER (PARTITION BY product_id ORDER BY month), 0) AS mom_growth
FROM cumulative_sales
ORDER BY product_id, month;
优势:
- 每个 CTE 聚焦单一职责
- 后续步骤可直接使用前序结果(如
cumulative_sales基于ranked_sales) - 添加新指标只需新增 CTE,不影响原有逻辑
3.3 场景3:组织架构查询(树形结构),递归 CTE(处理层次数据)
表结构:
sql
CREATE TABLE employees (
id INT PRIMARY KEY,
name VARCHAR(100),
manager_id INT REFERENCES employees(id)
);
1、需求 1:查询某员工的所有下属(递归向下)
sql
WITH RECURSIVE subordinates AS (
-- Anchor: 起始节点(CEO)
SELECT id, name, manager_id, 1 AS level
FROM employees
WHERE id = 1 -- CEO ID
UNION ALL
-- Recursive: 逐级展开下属
SELECT e.id, e.name, e.manager_id, s.level + 1
FROM employees e
JOIN subordinates s ON e.manager_id = s.id
)
SELECT * FROM subordinates;
2、需求 2:查询某员工的完整汇报路径(递归向上)
sql
WITH RECURSIVE reporting_line AS (
-- Anchor: 目标员工
SELECT id, name, manager_id, 1 AS depth
FROM employees
WHERE id = 10 -- 目标员工ID
UNION ALL
-- Recursive: 向上找经理
SELECT e.id, e.name, e.manager_id, rl.depth + 1
FROM employees e
JOIN reporting_line rl ON e.id = rl.manager_id
WHERE rl.manager_id IS NOT NULL -- 防止无限循环
)
SELECT * FROM reporting_line;
关键点:
RECURSIVE关键字启用递归UNION ALL连接锚点与递归部分- 必须有终止条件(如
manager_id IS NOT NULL)
3.4 场景4:物料清单(BOM)展开
表结构:
sql
CREATE TABLE bom (
parent_part VARCHAR(50),
child_part VARCHAR(50),
quantity INT
);
查询产品 'CAR' 的所有子部件及总用量:
sql
WITH RECURSIVE exploded_bom AS (
-- Anchor: 最终产品
SELECT
parent_part AS top_part,
child_part,
quantity,
1 AS level
FROM bom
WHERE parent_part = 'CAR'
UNION ALL
-- Recursive: 展开子部件
SELECT
eb.top_part,
b.child_part,
eb.quantity * b.quantity, -- 累计用量
eb.level + 1
FROM bom b
JOIN exploded_bom eb ON b.parent_part = eb.child_part
)
SELECT
top_part,
child_part,
SUM(quantity) AS total_quantity
FROM exploded_bom
GROUP BY top_part, child_part
ORDER BY total_quantity DESC;
3.5 场景5:会话化用户行为日志,数据清洗与预处理
原始数据 :user_events (user_id, event_time, event_type)
需求:将连续事件(间隔 < 30 分钟)划分为同一会话
1、CTE 分步实现
sql
WITH
ordered_events AS (
-- 按用户和时间排序
SELECT
user_id,
event_time,
LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time) AS prev_time
FROM user_events
),
session_flags AS (
-- 标记新会话起点(当前事件与上一事件间隔 > 30 分钟)
SELECT
user_id,
event_time,
CASE
WHEN prev_time IS NULL
OR EXTRACT(EPOCH FROM (event_time - prev_time)) > 1800 -- 30分钟=1800秒
THEN 1
ELSE 0
END AS is_new_session
FROM ordered_events
),
session_ids AS (
-- 累计求和生成会话ID
SELECT
user_id,
event_time,
SUM(is_new_session) OVER (
PARTITION BY user_id
ORDER BY event_time
ROWS UNBOUNDED PRECEDING
) AS session_id
FROM session_flags
)
SELECT
user_id,
session_id,
MIN(event_time) AS session_start,
MAX(event_time) AS session_end,
COUNT(*) AS events_count
FROM session_ids
GROUP BY user_id, session_id
ORDER BY user_id, session_start;
优势:
- 将复杂逻辑拆解为:排序 → 标记 → 累计 → 聚合
- 每步可独立验证(如检查
is_new_session是否正确)
四、PostgreSQL 12+ 高级特性:控制 CTE 物化
4.1 默认行为(PostgreSQL 12 前)
- CTE 总是物化(Materialized):先执行 CTE 存入临时表,再供外层查询使用
- 优点:避免重复计算
- 缺点:可能浪费 I/O(如果外层只取少量数据)
4.2 新增控制选项(PostgreSQL 12+)
sql
-- 强制物化(兼容旧版行为)
WITH cte_name AS MATERIALIZED ( ... )
-- 禁止物化(内联展开,类似子查询)
WITH cte_name AS NOT MATERIALIZED ( ... )
4.3 性能对比示例
sql
-- 场景:CTE 返回 100 万行,但外层只取 10 行
-- MATERIALIZED:先写 100 万行到临时表(慢)
WITH large_cte AS MATERIALIZED (
SELECT * FROM huge_table
)
SELECT * FROM large_cte LIMIT 10;
-- NOT MATERIALIZED:优化器可能只取 10 行(快)
WITH large_cte AS NOT MATERIALIZED (
SELECT * FROM huge_table
)
SELECT * FROM large_cte LIMIT 10;
💡 建议:
- 默认不加关键字(让优化器自动选择)
- 当 CTE 被多次引用 → 用
MATERIALIZED- 当 CTE 数据量大但外层只取少量 → 用
NOT MATERIALIZED
五、常见陷阱与最佳实践
5.1 陷阱 1:CTE 中的 DML(数据修改)
PostgreSQL 允许在 CTE 中执行 INSERT/UPDATE/DELETE(称为 "Writeable CTE"):
sql
-- 删除过期订单,并返回删除数量
WITH deleted_orders AS (
DELETE FROM orders
WHERE created_at < '2020-01-01'
RETURNING id
)
SELECT COUNT(*) FROM deleted_orders;
风险:
- 逻辑隐蔽,易被忽略
- 可能导致意外数据修改
建议:
- 仅在必要时使用(如批量清理)
- 添加明确注释
- 避免在业务查询中混用 DML
5.2 陷阱 2:递归 CTE 无限循环
sql
-- 错误:缺少终止条件
WITH RECURSIVE infinite_loop AS (
SELECT 1 AS n
UNION ALL
SELECT n + 1 FROM infinite_loop
)
SELECT * FROM infinite_loop; -- 永不停止!
防护措施:
-
设置最大递归深度:
sqlSET max_recursive_iterations = 1000; -
在递归条件中加入终止逻辑(如
level < 10)
5.3 最佳实践总结
- 命名规范 :CTE 名称应清晰表达其作用(如
active_users,monthly_revenue) - 单一职责:每个 CTE 只做一件事
- 避免过度嵌套:超过 5 层 CTE 时考虑拆分为多个查询
- 性能监控 :用
EXPLAIN ANALYZE检查 CTE 执行计划 - 递归谨慎:确保有明确终止条件
六、综合案例:电商漏斗分析
需求:计算从浏览 → 加购 → 下单 → 支付的转化率
6.1 表结构
page_views (user_id, product_id, view_time)cart_adds (user_id, product_id, add_time)orders (user_id, order_id, create_time)payments (order_id, pay_time)
6.2 CTE 实现
sql
WITH
step1_view AS (
SELECT DISTINCT user_id
FROM page_views
WHERE view_time >= '2023-10-01'
),
step2_cart AS (
SELECT DISTINCT ca.user_id
FROM cart_adds ca
JOIN step1_view sv ON ca.user_id = sv.user_id
WHERE ca.add_time >= '2023-10-01'
),
step3_order AS (
SELECT DISTINCT o.user_id
FROM orders o
JOIN step2_cart sc ON o.user_id = sc.user_id
WHERE o.create_time >= '2023-10-01'
),
step4_pay AS (
SELECT DISTINCT o.user_id
FROM orders o
JOIN payments p ON o.order_id = p.order_id
JOIN step3_order so ON o.user_id = so.user_id
WHERE p.pay_time >= '2023-10-01'
)
SELECT
(SELECT COUNT(*) FROM step1_view) AS view_users,
(SELECT COUNT(*) FROM step2_cart) AS cart_users,
(SELECT COUNT(*) FROM step3_order) AS order_users,
(SELECT COUNT(*) FROM step4_pay) AS pay_users,
ROUND(100.0 * (SELECT COUNT(*) FROM step2_cart) / (SELECT COUNT(*) FROM step1_view), 2) AS view_to_cart_rate,
ROUND(100.0 * (SELECT COUNT(*) FROM step4_pay) / (SELECT COUNT(*) FROM step1_view), 2) AS overall_conversion
;
输出:
view_users | cart_users | order_users | pay_users | view_to_cart_rate | overall_conversion
-----------|------------|-------------|-----------|-------------------|-------------------
10000 | 2500 | 1200 | 900 | 25.00 | 9.00