一、CUME_DIST() 是什么?(一句话解释)
"累积分布:有多少比例的数据小于等于当前值"
就像老师说:"班上 60% 的同学分数低于或等于你",这就是 CUME_DIST。
公式:小于等于当前值的行数 / 总行数
假设 10 个学生成绩: 50, 60, 70, 80, 80, 90, 90, 90, 95, 100
50分 → 1/10 = 0.10 (10% 的人 ≤ 50)
60分 → 2/10 = 0.20 (20% 的人 ≤ 60)
80分 → 5/10 = 0.50 (50% 的人 ≤ 80,包含两个并列的 80)
90分 → 8/10 = 0.80 (80% 的人 ≤ 90,包含三个并列的 90)
100分 → 10/10 = 1.00 (100% 的人 ≤ 100)
CUME = Cumulative(累积),DIST = Distribution(分布)
二、和 PERCENT_RANK 的区别
这两个函数很像,但计算逻辑不同:
| 函数 | 公式 | 最小值 | 最大值 | 含义 | 示例(10人) |
|---|---|---|---|---|---|
PERCENT_RANK() |
(rank-1)/(n-1) | 0 | 1 | 相对位置 | 第 1 名=0%,最后一名=100% |
CUME_DIST() |
count(≤current)/n | 1/n | 1 | 累积比例 | 最低分=10%,最高分=100% |
关键区别:
- PERCENT_RANK 基于排名位置
- CUME_DIST 基于累积数量
直观对比:
sql
SELECT
name,
score,
ROUND(PERCENT_RANK() OVER (ORDER BY score ASC)::DECIMAL * 100, 2) AS pr,
ROUND(CUME_DIST() OVER (ORDER BY score ASC)::DECIMAL * 100, 2) AS cd
FROM students;
-- 结果(假设有并列):
-- name | score | pr | cd
-- ------+-------+-------+------
-- 张三 | 50 | 0.00 | 10.00 ← 只有 1 个人 ≤ 50
-- 李四 | 60 | 11.11 | 20.00 ← 有 2 个人 ≤ 60
-- 王五 | 80 | 33.33 | 40.00
-- 赵六 | 80 | 33.33 | 40.00 ← 并列!PR 相同,CD 也相同
-- 钱七 | 90 | 55.56 | 70.00
-- 孙八 | 90 | 55.56 | 70.00 ← 又一对并列
-- 周九 | 90 | 55.56 | 70.00
-- 吴十 | 100 | 100.00| 100.00
-- 💡 观察:
-- 1. 最小值不同:PR 从 0 开始,CD 从 10% 开始
-- 2. 并列处理:两者都会给并列值相同的结果
-- 3. 最大值相同:都是 100%
三、8 个实用场景
场景 1:计算累积分布(最经典)
需求: 分析销售金额的分布情况
sql
SELECT
sales_amount,
ROUND(CUME_DIST() OVER (ORDER BY sales_amount ASC)::DECIMAL * 100, 2) AS cumulative_percent
FROM sales_performance
ORDER BY sales_amount;
解读:
cumulative_percent = 50:50% 的员工业绩 ≤ 这个金额(中位数)cumulative_percent = 90:90% 的员工业绩 ≤ 这个金额(前 10% 门槛)
用途: 绘制累积分布曲线,了解数据集中度
场景 2:找中位数
需求: 找出销售业绩的中位数
sql
SELECT
sales_amount,
CUME_DIST() OVER (ORDER BY sales_amount ASC) AS cd
FROM sales_performance
ORDER BY cd
LIMIT 1 OFFSET (SELECT COUNT(*) FROM sales_performance) / 2;
更简洁的写法:
sql
SELECT
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY sales_amount) AS median
FROM sales_performance;
用 CUME_DIST 理解中位数:
sql
-- 中位数就是 cd ≈ 0.5 的那个值
SELECT * FROM (
SELECT
sales_amount,
CUME_DIST() OVER (ORDER BY sales_amount ASC) AS cd
FROM sales_performance
) t
WHERE cd >= 0.5
ORDER BY cd
LIMIT 1;
场景 3:分段统计(四分位数)
需求: 把员工按业绩分成 4 组(Q1, Q2, Q3, Q4)
sql
SELECT
CASE
WHEN CUME_DIST() OVER (ORDER BY sales_amount ASC) <= 0.25 THEN 'Q1 (后25%)'
WHEN CUME_DIST() OVER (ORDER BY sales_amount ASC) <= 0.50 THEN 'Q2 (25-50%)'
WHEN CUME_DIST() OVER (ORDER BY sales_amount ASC) <= 0.75 THEN 'Q3 (50-75%)'
ELSE 'Q4 (前25%)'
END AS quartile,
COUNT(*) AS emp_count,
AVG(sales_amount) AS avg_sales,
MIN(sales_amount) AS min_sales,
MAX(sales_amount) AS max_sales
FROM sales_performance
GROUP BY quartile
ORDER BY quartile;
结果示例:
quartile | emp_count | avg_sales | min_sales | max_sales
-------------+-----------+-----------+-----------+----------
Q1 (后25%) | 25 | 30000 | 10000 | 45000
Q2 (25-50%) | 25 | 60000 | 46000 | 75000
Q3 (50-75%) | 25 | 90000 | 76000 | 110000
Q4 (前25%) | 25 | 150000 | 111000 | 300000
用途: 快速了解业绩分布是否均匀
场景 4:动态阈值划分
需求: 根据实际分布划分等级(而非固定金额)
sql
SELECT
emp_name,
sales_amount,
CASE
WHEN CUME_DIST() OVER (ORDER BY sales_amount DESC) <= 0.1 THEN 'S 级 (顶尖10%)'
WHEN CUME_DIST() OVER (ORDER BY sales_amount DESC) <= 0.3 THEN 'A 级 (前30%)'
WHEN CUME_DIST() OVER (ORDER BY sales_amount DESC) <= 0.6 THEN 'B 级 (前60%)'
WHEN CUME_DIST() OVER (ORDER BY sales_amount DESC) <= 0.9 THEN 'C 级 (前90%)'
ELSE 'D 级 (后10%)'
END AS grade
FROM sales_performance;
与 PERCENT_RANK 的区别:
PERCENT_RANK:基于排名位置(第几名)CUME_DIST:基于累积比例(多少人≤你)- 在没有并列时,两者结果接近;有并列时,CUME_DIST 更平滑
场景 5:分组累积分布
需求: 计算每个部门内的累积分布
sql
SELECT
dept_name,
emp_name,
sales_amount,
ROUND(CUME_DIST() OVER (
PARTITION BY dept_name
ORDER BY sales_amount ASC
)::DECIMAL * 100, 2) AS dept_cume_dist
FROM employees
ORDER BY dept_name, dept_cume_dist;
结果示例:
dept_name | emp_name | sales_amount | dept_cume_dist
----------+----------+--------------+---------------
销售部 | 张三 | 50000 | 20.00 ← 销售部后 20%
销售部 | 李四 | 80000 | 40.00
销售部 | 王五 | 120000 | 100.00 ← 销售部最高
技术部 | 赵六 | 40000 | 33.33
技术部 | 钱七 | 90000 | 100.00 ← 技术部最高
优势: 跨部门对比更公平
场景 6:异常值检测
需求: 找出极端高或极端低的订单
sql
SELECT
order_no,
amount,
ROUND(CUME_DIST() OVER (ORDER BY amount ASC)::DECIMAL * 100, 2) AS cume_dist
FROM orders
WHERE CUME_DIST() OVER (ORDER BY amount ASC) < 0.01 -- 最低 1%
OR CUME_DIST() OVER (ORDER BY amount ASC) > 0.99; -- 最高 1%
用途: 风控系统识别可疑交易
场景 7:生成直方图数据
需求: 为报表生成累积分布数据,用于绘制直方图
sql
SELECT
ROUND(sales_amount / 10000) * 10000 AS amount_bucket, -- 按 1 万分组
ROUND(CUME_DIST() OVER (ORDER BY sales_amount ASC)::DECIMAL * 100, 2) AS cumulative_pct,
COUNT(*) OVER (PARTITION BY ROUND(sales_amount / 10000) * 10000) AS bucket_count
FROM sales_performance
GROUP BY amount_bucket, sales_amount
ORDER BY amount_bucket;
用途: 在 Excel 或 BI 工具中绘制累积分布直方图
场景 8:比较两个数据集的分布
需求: 对比今年和去年的销售业绩分布
sql
SELECT
year,
sales_amount,
ROUND(CUME_DIST() OVER (
PARTITION BY year
ORDER BY sales_amount ASC
)::DECIMAL * 100, 2) AS cume_dist
FROM yearly_sales
WHERE year IN (2025, 2026)
ORDER BY year, cume_dist;
用途: 在同一张图上绘制两条累积分布曲线,直观对比两年差异
四、核心语法
sql
CUME_DIST() OVER (
PARTITION BY column1, column2 -- 可选:分组依据
ORDER BY column3 ASC/DESC -- 必填:排序规则
)
关键点:
- 不需要参数 :
CUME_DIST()括号里是空的 - 必须配合 OVER():声明这是窗口函数
- ORDER BY 必填:决定累积方向
- 返回值是小数:1/n - 1.0,通常乘以 100 转成百分比
五、计算公式详解
CUME_DIST = count(rows where value <= current_value) / total_rows
其中:
- 分子:小于等于当前值的行数(包含当前行)
- 分母:窗口内的总行数
示例推导:
sql
-- 假设有 5 行数据
SELECT
name,
score,
COUNT(*) FILTER (WHERE s.score <= score) OVER () AS count_le,
COUNT(*) OVER () AS total,
COUNT(*) FILTER (WHERE s.score <= score) OVER ()::DECIMAL / COUNT(*) OVER () AS manual_cd,
CUME_DIST() OVER (ORDER BY score ASC) AS auto_cd
FROM students s;
-- 结果:
-- name | score | count_le | total | manual_cd | auto_cd
-- ------+-------+----------+-------+-----------+---------
-- 张三 | 50 | 1 | 5 | 1/5=0.20 | 0.20
-- 李四 | 60 | 2 | 5 | 2/5=0.40 | 0.40
-- 王五 | 70 | 3 | 5 | 3/5=0.60 | 0.60
-- 赵六 | 80 | 4 | 5 | 4/5=0.80 | 0.80
-- 钱七 | 90 | 5 | 5 | 5/5=1.00 | 1.00
特殊情况:
- 如果有并列:所有并列值获得相同的 CUME_DIST(因为它们的"小于等于"计数相同)
- 如果只有 1 行:
1/1 = 1.0(不会是 NULL)
六、性能优化
1. 避免重复计算
sql
-- ❌ 慢:多次调用 CUME_DIST
SELECT
emp_name,
CUME_DIST() OVER (ORDER BY sales ASC) AS cd,
CASE
WHEN CUME_DIST() OVER (ORDER BY sales ASC) <= 0.1 THEN '优秀'
ELSE '普通'
END AS level
FROM employees;
-- ✅ 快:用子查询或 CTE
WITH ranked AS (
SELECT
emp_name,
sales,
CUME_DIST() OVER (ORDER BY sales ASC) AS cd
FROM employees
)
SELECT
emp_name,
cd,
CASE WHEN cd <= 0.1 THEN '优秀' ELSE '普通' END AS level
FROM ranked;
2. 合理使用索引
sql
-- 为 ORDER BY 字段创建索引
CREATE INDEX idx_employees_sales ON employees (sales_amount);
-- 为 PARTITION BY + ORDER BY 创建复合索引
CREATE INDEX idx_emp_dept_sales ON employees (dept_id, sales_amount);
七、常见错误
错误 1:忘记乘以 100
sql
-- ❌ 不直观:0.85 是什么意思?
SELECT CUME_DIST() OVER (ORDER BY score) AS cd FROM students;
-- ✅ 清晰:85.00% 一目了然
SELECT ROUND(CUME_DIST() OVER (ORDER BY score)::DECIMAL * 100, 2) AS cumulative_percent
FROM students;
错误 2:混淆 ASC 和 DESC
sql
-- 场景:分数越高越好
SELECT
name,
score,
ROUND(CUME_DIST() OVER (ORDER BY score DESC)::DECIMAL * 100, 2) AS cd_desc,
ROUND(CUME_DIST() OVER (ORDER BY score ASC)::DECIMAL * 100, 2) AS cd_asc
FROM students;
-- 结果:
-- name | score | cd_desc | cd_asc
-- ------+-------+---------+--------
-- 张三 | 100 | 100.00 | 10.00 ← 最高分
-- 李四 | 60 | 10.00 | 100.00 ← 最低分
-- 💡 记忆技巧:
-- ORDER BY DESC(降序):高分 cd 高(更多人 ≤ 高分)
-- ORDER BY ASC(升序):低分 cd 低(更少人 ≤ 低分)
错误 3:在 WHERE 中直接使用
sql
-- 错误
SELECT * FROM employees
WHERE CUME_DIST() OVER (ORDER BY salary ASC) <= 0.1;
-- ✅ 正确:用子查询
SELECT * FROM (
SELECT *, CUME_DIST() OVER (ORDER BY salary ASC) AS cd
FROM employees
) t
WHERE cd <= 0.1;
错误 4:误解并列处理
sql
-- 假设有 3 个人并列 80 分
SELECT
name,
score,
CUME_DIST() OVER (ORDER BY score ASC) AS cd
FROM students;
-- 结果:
-- name | score | cd
-- ------+-------+------
-- 张三 | 50 | 0.20
-- 李四 | 60 | 0.40
-- 王五 | 80 | 0.80 ← 3 个人都是 0.80
-- 赵六 | 80 | 0.80
-- 钱七 | 80 | 0.80
-- 孙八 | 90 | 1.00
-- 💡 原因:
-- 对于 80 分,有 5 个人 ≤ 80(包括 3 个并列的 80)
-- 所以 cd = 5/10 = 0.50(假设总共 10 人)
-- 所有 80 分的人共享这个值
八、CUME_DIST vs PERCENT_RANK 选择指南
| 场景 | 推荐函数 | 原因 |
|---|---|---|
| 计算中位数 | ✅ CUME_DIST() |
直接找 cd=0.5 的值 |
| 累积分布曲线 | ✅ CUME_DIST() |
更符合统计学定义 |
| 百分位排名 | ✅ PERCENT_RANK() |
基于排名位置,更直观 |
| 异常值检测 | 两者皆可 | 效果相近 |
| 等级划分 | 两者皆可 | 根据偏好选择 |
| 有较多并列值 | ✅ CUME_DIST() |
结果更平滑 |
九、记忆口诀
CUME_DIST 累积布,多少比例小于等于汝
中位数找零点五,四分位数好分组
并列之时同待遇,平滑分布不突兀
ASC DESC 看方向,高低顺序别糊涂
十、总结
核心要点
- CUME_DIST() = 累积分布(多少比例 ≤ 当前值)
- 适用场景 = 中位数、四分位数、累积分布曲线、异常检测
- 返回值 = 小数 1/n - 1.0,通常乘以 100 转成百分比
- 计算公式 = 小于等于当前值的行数 / 总行数
- 使用时机 = 需要"累积比例"或"分布分析"时
快速参考
sql
-- 基本模板
SELECT
字段列表,
ROUND(CUME_DIST() OVER (
PARTITION BY 分组字段 -- 可选
ORDER BY 排序字段 ASC -- 必填
)::DECIMAL * 100, 2) AS cumulative_percent
FROM 表名;
-- 找中位数
SELECT * FROM (
SELECT *, CUME_DIST() OVER (ORDER BY value ASC) AS cd
FROM table_name
) t WHERE cd >= 0.5
ORDER BY cd LIMIT 1;
-- 四分位数分组
SELECT
CASE
WHEN CUME_DIST() OVER (ORDER BY value ASC) <= 0.25 THEN 'Q1'
WHEN CUME_DIST() OVER (ORDER BY value ASC) <= 0.50 THEN 'Q2'
WHEN CUME_DIST() OVER (ORDER BY value ASC) <= 0.75 THEN 'Q3'
ELSE 'Q4'
END AS quartile,
COUNT(*), AVG(value)
FROM table_name
GROUP BY quartile;
实战速查
sql
-- 1. 累积分布
SELECT amount,
ROUND(CUME_DIST() OVER (ORDER BY amount ASC)::DECIMAL * 100, 2) || '%' AS cume
FROM orders;
-- 2. 中位数
SELECT * FROM (
SELECT amount, CUME_DIST() OVER (ORDER BY amount ASC) AS cd
FROM orders
) t WHERE cd >= 0.5 ORDER BY cd LIMIT 1;
-- 3. 四分位分析
SELECT
CASE
WHEN CUME_DIST() OVER (ORDER BY sales ASC) <= 0.25 THEN 'Q1'
WHEN CUME_DIST() OVER (ORDER BY sales ASC) <= 0.50 THEN 'Q2'
WHEN CUME_DIST() OVER (ORDER BY sales ASC) <= 0.75 THEN 'Q3'
ELSE 'Q4'
END AS q,
COUNT(*), AVG(sales)
FROM employees GROUP BY q;
-- 4. 异常值检测
SELECT * FROM (
SELECT *, CUME_DIST() OVER (ORDER BY amount ASC) AS cd
FROM transactions
) t WHERE cd < 0.01 OR cd > 0.99;