PostgreSQL CUME_DIST() 窗口函数完全解析

一、CUME_DIST() 是什么?(一句话解释)

"累积分布:有多少比例的数据小于等于当前值"

就像老师说:"班上 60% 的同学分数低于或等于你",这就是 CUME_DIST。

复制代码
公式:小于等于当前值的行数 / 总行数

假设 10 个学生成绩: 50, 60, 70, 80, 80, 90, 90, 90, 95, 100

50分 → 1/10 = 0.10   (10% 的人 ≤ 50)
60分 → 2/10 = 0.20   (20% 的人 ≤ 60)
80分 → 5/10 = 0.50   (50% 的人 ≤ 80,包含两个并列的 80)
90分 → 8/10 = 0.80   (80% 的人 ≤ 90,包含三个并列的 90)
100分 → 10/10 = 1.00 (100% 的人 ≤ 100)

CUME = Cumulative(累积),DIST = Distribution(分布)


二、和 PERCENT_RANK 的区别

这两个函数很像,但计算逻辑不同:

函数 公式 最小值 最大值 含义 示例(10人)
PERCENT_RANK() (rank-1)/(n-1) 0 1 相对位置 第 1 名=0%,最后一名=100%
CUME_DIST() count(≤current)/n 1/n 1 累积比例 最低分=10%,最高分=100%

关键区别:

  • PERCENT_RANK 基于排名位置
  • CUME_DIST 基于累积数量

直观对比:

sql 复制代码
SELECT 
    name,
    score,
    ROUND(PERCENT_RANK() OVER (ORDER BY score ASC)::DECIMAL * 100, 2) AS pr,
    ROUND(CUME_DIST() OVER (ORDER BY score ASC)::DECIMAL * 100, 2) AS cd
FROM students;

-- 结果(假设有并列):
-- name  | score | pr    | cd
-- ------+-------+-------+------
-- 张三  | 50    | 0.00  | 10.00  ← 只有 1 个人 ≤ 50
-- 李四  | 60    | 11.11 | 20.00  ← 有 2 个人 ≤ 60
-- 王五  | 80    | 33.33 | 40.00
-- 赵六  | 80    | 33.33 | 40.00  ← 并列!PR 相同,CD 也相同
-- 钱七  | 90    | 55.56 | 70.00
-- 孙八  | 90    | 55.56 | 70.00  ← 又一对并列
-- 周九  | 90    | 55.56 | 70.00
-- 吴十  | 100   | 100.00| 100.00

-- 💡 观察:
-- 1. 最小值不同:PR 从 0 开始,CD 从 10% 开始
-- 2. 并列处理:两者都会给并列值相同的结果
-- 3. 最大值相同:都是 100%

三、8 个实用场景

场景 1:计算累积分布(最经典)

需求: 分析销售金额的分布情况

sql 复制代码
SELECT 
    sales_amount,
    ROUND(CUME_DIST() OVER (ORDER BY sales_amount ASC)::DECIMAL * 100, 2) AS cumulative_percent
FROM sales_performance
ORDER BY sales_amount;

解读:

  • cumulative_percent = 50:50% 的员工业绩 ≤ 这个金额(中位数)
  • cumulative_percent = 90:90% 的员工业绩 ≤ 这个金额(前 10% 门槛)

用途: 绘制累积分布曲线,了解数据集中度


场景 2:找中位数

需求: 找出销售业绩的中位数

sql 复制代码
SELECT 
    sales_amount,
    CUME_DIST() OVER (ORDER BY sales_amount ASC) AS cd
FROM sales_performance
ORDER BY cd
LIMIT 1 OFFSET (SELECT COUNT(*) FROM sales_performance) / 2;

更简洁的写法:

sql 复制代码
SELECT 
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY sales_amount) AS median
FROM sales_performance;

用 CUME_DIST 理解中位数:

sql 复制代码
-- 中位数就是 cd ≈ 0.5 的那个值
SELECT * FROM (
    SELECT 
        sales_amount,
        CUME_DIST() OVER (ORDER BY sales_amount ASC) AS cd
    FROM sales_performance
) t
WHERE cd >= 0.5
ORDER BY cd
LIMIT 1;

场景 3:分段统计(四分位数)

需求: 把员工按业绩分成 4 组(Q1, Q2, Q3, Q4)

sql 复制代码
SELECT 
    CASE 
        WHEN CUME_DIST() OVER (ORDER BY sales_amount ASC) <= 0.25 THEN 'Q1 (后25%)'
        WHEN CUME_DIST() OVER (ORDER BY sales_amount ASC) <= 0.50 THEN 'Q2 (25-50%)'
        WHEN CUME_DIST() OVER (ORDER BY sales_amount ASC) <= 0.75 THEN 'Q3 (50-75%)'
        ELSE 'Q4 (前25%)'
    END AS quartile,
    COUNT(*) AS emp_count,
    AVG(sales_amount) AS avg_sales,
    MIN(sales_amount) AS min_sales,
    MAX(sales_amount) AS max_sales
FROM sales_performance
GROUP BY quartile
ORDER BY quartile;

结果示例:

复制代码
quartile     | emp_count | avg_sales | min_sales | max_sales
-------------+-----------+-----------+-----------+----------
Q1 (后25%)   | 25        | 30000     | 10000     | 45000
Q2 (25-50%)  | 25        | 60000     | 46000     | 75000
Q3 (50-75%)  | 25        | 90000     | 76000     | 110000
Q4 (前25%)   | 25        | 150000    | 111000    | 300000

用途: 快速了解业绩分布是否均匀


场景 4:动态阈值划分

需求: 根据实际分布划分等级(而非固定金额)

sql 复制代码
SELECT 
    emp_name,
    sales_amount,
    CASE 
        WHEN CUME_DIST() OVER (ORDER BY sales_amount DESC) <= 0.1 THEN 'S 级 (顶尖10%)'
        WHEN CUME_DIST() OVER (ORDER BY sales_amount DESC) <= 0.3 THEN 'A 级 (前30%)'
        WHEN CUME_DIST() OVER (ORDER BY sales_amount DESC) <= 0.6 THEN 'B 级 (前60%)'
        WHEN CUME_DIST() OVER (ORDER BY sales_amount DESC) <= 0.9 THEN 'C 级 (前90%)'
        ELSE 'D 级 (后10%)'
    END AS grade
FROM sales_performance;

与 PERCENT_RANK 的区别:

  • PERCENT_RANK:基于排名位置(第几名)
  • CUME_DIST:基于累积比例(多少人≤你)
  • 在没有并列时,两者结果接近;有并列时,CUME_DIST 更平滑

场景 5:分组累积分布

需求: 计算每个部门内的累积分布

sql 复制代码
SELECT 
    dept_name,
    emp_name,
    sales_amount,
    ROUND(CUME_DIST() OVER (
        PARTITION BY dept_name 
        ORDER BY sales_amount ASC
    )::DECIMAL * 100, 2) AS dept_cume_dist
FROM employees
ORDER BY dept_name, dept_cume_dist;

结果示例:

复制代码
dept_name | emp_name | sales_amount | dept_cume_dist
----------+----------+--------------+---------------
销售部    | 张三     | 50000        | 20.00         ← 销售部后 20%
销售部    | 李四     | 80000        | 40.00
销售部    | 王五     | 120000       | 100.00        ← 销售部最高
技术部    | 赵六     | 40000        | 33.33
技术部    | 钱七     | 90000        | 100.00        ← 技术部最高

优势: 跨部门对比更公平


场景 6:异常值检测

需求: 找出极端高或极端低的订单

sql 复制代码
SELECT 
    order_no,
    amount,
    ROUND(CUME_DIST() OVER (ORDER BY amount ASC)::DECIMAL * 100, 2) AS cume_dist
FROM orders
WHERE CUME_DIST() OVER (ORDER BY amount ASC) < 0.01  -- 最低 1%
   OR CUME_DIST() OVER (ORDER BY amount ASC) > 0.99; -- 最高 1%

用途: 风控系统识别可疑交易


场景 7:生成直方图数据

需求: 为报表生成累积分布数据,用于绘制直方图

sql 复制代码
SELECT 
    ROUND(sales_amount / 10000) * 10000 AS amount_bucket,  -- 按 1 万分组
    ROUND(CUME_DIST() OVER (ORDER BY sales_amount ASC)::DECIMAL * 100, 2) AS cumulative_pct,
    COUNT(*) OVER (PARTITION BY ROUND(sales_amount / 10000) * 10000) AS bucket_count
FROM sales_performance
GROUP BY amount_bucket, sales_amount
ORDER BY amount_bucket;

用途: 在 Excel 或 BI 工具中绘制累积分布直方图


场景 8:比较两个数据集的分布

需求: 对比今年和去年的销售业绩分布

sql 复制代码
SELECT 
    year,
    sales_amount,
    ROUND(CUME_DIST() OVER (
        PARTITION BY year 
        ORDER BY sales_amount ASC
    )::DECIMAL * 100, 2) AS cume_dist
FROM yearly_sales
WHERE year IN (2025, 2026)
ORDER BY year, cume_dist;

用途: 在同一张图上绘制两条累积分布曲线,直观对比两年差异


四、核心语法

sql 复制代码
CUME_DIST() OVER (
    PARTITION BY column1, column2  -- 可选:分组依据
    ORDER BY column3 ASC/DESC      -- 必填:排序规则
)

关键点:

  1. 不需要参数CUME_DIST() 括号里是空的
  2. 必须配合 OVER():声明这是窗口函数
  3. ORDER BY 必填:决定累积方向
  4. 返回值是小数:1/n - 1.0,通常乘以 100 转成百分比

五、计算公式详解

复制代码
CUME_DIST = count(rows where value <= current_value) / total_rows

其中:
- 分子:小于等于当前值的行数(包含当前行)
- 分母:窗口内的总行数

示例推导:

sql 复制代码
-- 假设有 5 行数据
SELECT 
    name,
    score,
    COUNT(*) FILTER (WHERE s.score <= score) OVER () AS count_le,
    COUNT(*) OVER () AS total,
    COUNT(*) FILTER (WHERE s.score <= score) OVER ()::DECIMAL / COUNT(*) OVER () AS manual_cd,
    CUME_DIST() OVER (ORDER BY score ASC) AS auto_cd
FROM students s;

-- 结果:
-- name  | score | count_le | total | manual_cd | auto_cd
-- ------+-------+----------+-------+-----------+---------
-- 张三  | 50    | 1        | 5     | 1/5=0.20  | 0.20
-- 李四  | 60    | 2        | 5     | 2/5=0.40  | 0.40
-- 王五  | 70    | 3        | 5     | 3/5=0.60  | 0.60
-- 赵六  | 80    | 4        | 5     | 4/5=0.80  | 0.80
-- 钱七  | 90    | 5        | 5     | 5/5=1.00  | 1.00

特殊情况:

  • 如果有并列:所有并列值获得相同的 CUME_DIST(因为它们的"小于等于"计数相同)
  • 如果只有 1 行:1/1 = 1.0(不会是 NULL)

六、性能优化

1. 避免重复计算

sql 复制代码
-- ❌ 慢:多次调用 CUME_DIST
SELECT 
    emp_name,
    CUME_DIST() OVER (ORDER BY sales ASC) AS cd,
    CASE 
        WHEN CUME_DIST() OVER (ORDER BY sales ASC) <= 0.1 THEN '优秀'
        ELSE '普通'
    END AS level
FROM employees;

-- ✅ 快:用子查询或 CTE
WITH ranked AS (
    SELECT 
        emp_name,
        sales,
        CUME_DIST() OVER (ORDER BY sales ASC) AS cd
    FROM employees
)
SELECT 
    emp_name,
    cd,
    CASE WHEN cd <= 0.1 THEN '优秀' ELSE '普通' END AS level
FROM ranked;

2. 合理使用索引

sql 复制代码
-- 为 ORDER BY 字段创建索引
CREATE INDEX idx_employees_sales ON employees (sales_amount);

-- 为 PARTITION BY + ORDER BY 创建复合索引
CREATE INDEX idx_emp_dept_sales ON employees (dept_id, sales_amount);

七、常见错误

错误 1:忘记乘以 100

sql 复制代码
-- ❌ 不直观:0.85 是什么意思?
SELECT CUME_DIST() OVER (ORDER BY score) AS cd FROM students;

-- ✅ 清晰:85.00% 一目了然
SELECT ROUND(CUME_DIST() OVER (ORDER BY score)::DECIMAL * 100, 2) AS cumulative_percent 
FROM students;

错误 2:混淆 ASC 和 DESC

sql 复制代码
-- 场景:分数越高越好
SELECT 
    name,
    score,
    ROUND(CUME_DIST() OVER (ORDER BY score DESC)::DECIMAL * 100, 2) AS cd_desc,
    ROUND(CUME_DIST() OVER (ORDER BY score ASC)::DECIMAL * 100, 2) AS cd_asc
FROM students;

-- 结果:
-- name  | score | cd_desc | cd_asc
-- ------+-------+---------+--------
-- 张三  | 100   | 100.00  | 10.00  ← 最高分
-- 李四  | 60    | 10.00   | 100.00 ← 最低分

-- 💡 记忆技巧:
-- ORDER BY DESC(降序):高分 cd 高(更多人 ≤ 高分)
-- ORDER BY ASC(升序):低分 cd 低(更少人 ≤ 低分)

错误 3:在 WHERE 中直接使用

sql 复制代码
--  错误
SELECT * FROM employees
WHERE CUME_DIST() OVER (ORDER BY salary ASC) <= 0.1;

-- ✅ 正确:用子查询
SELECT * FROM (
    SELECT *, CUME_DIST() OVER (ORDER BY salary ASC) AS cd
    FROM employees
) t
WHERE cd <= 0.1;

错误 4:误解并列处理

sql 复制代码
-- 假设有 3 个人并列 80 分
SELECT 
    name,
    score,
    CUME_DIST() OVER (ORDER BY score ASC) AS cd
FROM students;

-- 结果:
-- name  | score | cd
-- ------+-------+------
-- 张三  | 50    | 0.20
-- 李四  | 60    | 0.40
-- 王五  | 80    | 0.80  ← 3 个人都是 0.80
-- 赵六  | 80    | 0.80
-- 钱七  | 80    | 0.80
-- 孙八  | 90    | 1.00

-- 💡 原因:
-- 对于 80 分,有 5 个人 ≤ 80(包括 3 个并列的 80)
-- 所以 cd = 5/10 = 0.50(假设总共 10 人)
-- 所有 80 分的人共享这个值

八、CUME_DIST vs PERCENT_RANK 选择指南

场景 推荐函数 原因
计算中位数 CUME_DIST() 直接找 cd=0.5 的值
累积分布曲线 CUME_DIST() 更符合统计学定义
百分位排名 PERCENT_RANK() 基于排名位置,更直观
异常值检测 两者皆可 效果相近
等级划分 两者皆可 根据偏好选择
有较多并列值 CUME_DIST() 结果更平滑

九、记忆口诀

复制代码
CUME_DIST 累积布,多少比例小于等于汝
中位数找零点五,四分位数好分组
并列之时同待遇,平滑分布不突兀
ASC DESC 看方向,高低顺序别糊涂

十、总结

核心要点

  1. CUME_DIST() = 累积分布(多少比例 ≤ 当前值)
  2. 适用场景 = 中位数、四分位数、累积分布曲线、异常检测
  3. 返回值 = 小数 1/n - 1.0,通常乘以 100 转成百分比
  4. 计算公式 = 小于等于当前值的行数 / 总行数
  5. 使用时机 = 需要"累积比例"或"分布分析"时

快速参考

sql 复制代码
-- 基本模板
SELECT 
    字段列表,
    ROUND(CUME_DIST() OVER (
        PARTITION BY 分组字段     -- 可选
        ORDER BY 排序字段 ASC     -- 必填
    )::DECIMAL * 100, 2) AS cumulative_percent
FROM 表名;

-- 找中位数
SELECT * FROM (
    SELECT *, CUME_DIST() OVER (ORDER BY value ASC) AS cd
    FROM table_name
) t WHERE cd >= 0.5
ORDER BY cd LIMIT 1;

-- 四分位数分组
SELECT 
    CASE 
        WHEN CUME_DIST() OVER (ORDER BY value ASC) <= 0.25 THEN 'Q1'
        WHEN CUME_DIST() OVER (ORDER BY value ASC) <= 0.50 THEN 'Q2'
        WHEN CUME_DIST() OVER (ORDER BY value ASC) <= 0.75 THEN 'Q3'
        ELSE 'Q4'
    END AS quartile,
    COUNT(*), AVG(value)
FROM table_name
GROUP BY quartile;

实战速查

sql 复制代码
-- 1. 累积分布
SELECT amount, 
       ROUND(CUME_DIST() OVER (ORDER BY amount ASC)::DECIMAL * 100, 2) || '%' AS cume
FROM orders;

-- 2. 中位数
SELECT * FROM (
    SELECT amount, CUME_DIST() OVER (ORDER BY amount ASC) AS cd
    FROM orders
) t WHERE cd >= 0.5 ORDER BY cd LIMIT 1;

-- 3. 四分位分析
SELECT 
    CASE 
        WHEN CUME_DIST() OVER (ORDER BY sales ASC) <= 0.25 THEN 'Q1'
        WHEN CUME_DIST() OVER (ORDER BY sales ASC) <= 0.50 THEN 'Q2'
        WHEN CUME_DIST() OVER (ORDER BY sales ASC) <= 0.75 THEN 'Q3'
        ELSE 'Q4'
    END AS q,
    COUNT(*), AVG(sales)
FROM employees GROUP BY q;

-- 4. 异常值检测
SELECT * FROM (
    SELECT *, CUME_DIST() OVER (ORDER BY amount ASC) AS cd
    FROM transactions
) t WHERE cd < 0.01 OR cd > 0.99;