Oracle 分析函数与窗口函数完全指南
分析函数(Analytic Functions)是 Oracle 最强大的 SQL 特性之一,可在不使用自连接或子查询的情况下,对结果集的当前行 与相关行进行计算。它们极大地简化了复杂的数据分析任务。
一、核心概念与价值
1.1 窗口函数 vs 聚合函数
sql
-- 聚合函数:将多行变为单行
SELECT department_id, AVG(salary)
FROM employees
GROUP BY department_id;
-- 结果:每个部门一行,计算部门平均值
-- 窗口函数:每行都保留,添加计算列
SELECT employee_id, department_id, salary,
AVG(salary) OVER (PARTITION BY department_id) AS dept_avg
FROM employees;
-- 结果:每个员工一行,附带部门平均值
核心区别:
- 聚合函数 :减少行数,需要
GROUP BY - 窗口函数 :保留行数,使用
OVER()子句定义窗口 - 性能优势:单次扫描完成计算,避免自连接
1.2 三大核心子句
sql
FUNCTION() OVER (
[PARTITION BY 列1, 列2] -- 分组(类似GROUP BY)
[ORDER BY 列3 [ASC|DESC]] -- 排序(定义窗口顺序)
[窗口子句] -- 行范围(ROWS/RANGE)
)
示例数据集:
sql
CREATE TABLE sales_data AS
SELECT 1 AS product_id, '2023-01' AS month, 100 AS sales FROM dual UNION ALL
SELECT 1, '2023-02', 150 FROM dual UNION ALL
SELECT 1, '2023-03', 200 FROM dual UNION ALL
SELECT 2, '2023-01', 80 FROM dual UNION ALL
SELECT 2, '2023-02', 120 FROM dual;
二、排名函数(Ranking Functions)
2.1 RANK / DENSE_RANK / ROW_NUMBER
| 函数 | 特点 | 示例结果(薪水相同) |
|---|---|---|
| RANK | 并列排名,有间隔 | 1,1,3 |
| DENSE_RANK | 并列排名,无间隔 | 1,1,2 |
| ROW_NUMBER | 唯一序号,不并列 | 1,2,3 |
sql
SELECT
employee_id,
department_id,
salary,
RANK() OVER (ORDER BY salary DESC) AS rank,
DENSE_RANK() OVER (ORDER BY salary DESC) AS dense_rank,
ROW_NUMBER() OVER (ORDER BY salary DESC) AS row_num
FROM employees
WHERE department_id = 80;
2.2 分区排名
sql
-- 每个部门内排名
SELECT
employee_id,
department_id,
salary,
RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) AS dept_rank
FROM employees;
2.3 NTILE 分组
将结果集均匀分配到指定数量的桶中。
sql
-- 将员工按薪水分为4个等级
SELECT
employee_id,
salary,
NTILE(4) OVER (ORDER BY salary DESC) AS quartile
FROM employees;
-- 结果:1=前25%, 2=26-50%, 3=51-75%, 4=后25%
2.4 实战场景:Top-N 查询
sql
-- 查询每个薪水最高的3名员工(传统子查询 vs 分析函数)
-- 传统方式(性能差)
SELECT * FROM employees e1
WHERE 3 > (SELECT COUNT(*) FROM employees e2
WHERE e2.department_id = e1.department_id
AND e2.salary > e1.salary);
-- 分析函数方式(推荐)
SELECT * FROM (
SELECT
employee_id,
department_id,
salary,
DENSE_RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) AS rnk
FROM employees
) WHERE rnk <= 3;
三、聚合窗口函数
3.1 基础聚合函数 + OVER()
所有聚合函数(SUM, AVG, COUNT, MAX, MIN)都可作为窗口函数。
sql
-- 计算累计薪水
SELECT
employee_id,
hire_date,
salary,
SUM(salary) OVER (ORDER BY hire_date) AS running_total,
AVG(salary) OVER (ORDER BY hire_date) AS running_avg,
COUNT(*) OVER (ORDER BY hire_date) AS running_count
FROM employees;
3.2 分区聚合
sql
-- 部门占比计算
SELECT
employee_id,
department_id,
salary,
SUM(salary) OVER (PARTITION BY department_id) AS dept_total,
salary / SUM(salary) OVER (PARTITION BY department_id) AS ratio,
COUNT(*) OVER (PARTITION BY department_id) AS dept_emp_count
FROM employees;
四、取值函数(Value Functions)
4.1 LAG / LEAD:前后行取值
sql
-- 计算薪资增长
SELECT
employee_id,
hire_date,
salary,
LAG(salary, 1) OVER (ORDER BY hire_date) AS prev_salary,
salary - LAG(salary, 1) OVER (ORDER BY hire_date) AS increase
FROM employees
WHERE department_id = 80;
参数说明:
LAG(列, offset, default):offset默认为1,default为NULLLEAD(列, offset, default):向后取值
4.2 FIRST_VALUE / LAST_VALUE
sql
-- 获取部门最高/最低薪资
SELECT
employee_id,
department_id,
salary,
FIRST_VALUE(salary) OVER (PARTITION BY department_id ORDER BY salary DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS dept_highest,
LAST_VALUE(salary) OVER (PARTITION BY department_id ORDER BY salary DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS dept_lowest
FROM employees;
⚠️ 关键陷阱 :LAST_VALUE 默认窗口是 RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW,需显式指定 ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING 才能获取全局最后值。
4.3 NTH_VALUE
获取窗口中第 N 个值。
sql
-- 获取部门薪水第二高的值
SELECT
employee_id,
department_id,
salary,
NTH_VALUE(salary, 2) OVER (PARTITION BY department_id
ORDER BY salary DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS second_highest
FROM employees;
五、窗口子句(Windowing Clause)
5.1 ROWS vs RANGE 的区别
sql
-- ROWS:物理行范围(与 ORDER BY 列值无关)
SELECT
employee_id,
hire_date,
salary,
SUM(salary) OVER (ORDER BY hire_date
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) AS rows_sum
FROM employees;
-- 结果:当前行 + 前1行 + 后1行
-- RANGE:逻辑值范围(基于 ORDER BY 列的值域)
SELECT
employee_id,
hire_date,
salary,
SUM(salary) OVER (ORDER BY hire_date
RANGE BETWEEN INTERVAL '30' DAY PRECEDING AND CURRENT ROW) AS range_sum
FROM employees;
-- 结果:hire_date 在 [当前日期-30天, 当前日期] 范围内的所有行
5.2 常用窗口定义
sql
-- 从开头到当前行(累计)
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
-- 前3行到当前行(移动平均)
ROWS BETWEEN 3 PRECEDING AND CURRENT ROW
-- 前1行到后1行(3行窗口)
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING
-- 全年数据(RANGE 示例)
RANGE BETWEEN INTERVAL '1' YEAR PRECEDING AND CURRENT ROW
-- 分组内所有行
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
六、分布函数(Distribution Functions)
6.1 CUME_DIST 与 PERCENT_RANK
sql
-- 累积分布和百分比排名
SELECT
employee_id,
department_id,
salary,
CUME_DIST() OVER (PARTITION BY department_id ORDER BY salary) AS cume_dist,
PERCENT_RANK() OVER (PARTITION BY department_id ORDER BY salary) AS percent_rank
FROM employees;
-- CUME_DIST:≤当前值的行数 / 总行数
-- PERCENT_RANK:(当前排名-1) / (总行数-1)
6.2 PERCENTILE_CONT / PERCENTILE_DISC
sql
-- 计算中位数
SELECT
department_id,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary) OVER (PARTITION BY department_id) AS median_salary,
PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY salary) OVER (PARTITION BY department_id) AS median_disc
FROM employees;
-- CONT:连续插值,返回计算值
-- DISC:离散取值,返回实际存在的值
七、高级分析函数
7.1 LISTAGG:字符串聚合
sql
-- 每个部门的员工名单(11gR2+)
SELECT
department_id,
LISTAGG(first_name, ',') WITHIN GROUP (ORDER BY hire_date) AS employees
FROM employees
GROUP BY department_id;
-- 窗口版本(不减少行数)
SELECT
employee_id,
department_id,
LISTAGG(first_name, ',') WITHIN GROUP (ORDER BY hire_date)
OVER (PARTITION BY department_id) AS dept_employees
FROM employees;
7.2 PIVOT / UNPIVOT
sql
-- 行转列(PIVOT)
SELECT * FROM (
SELECT department_id, job_id, salary FROM employees
)
PIVOT (
AVG(salary) FOR job_id IN ('IT_PROG', 'SA_MAN', 'AD_PRES')
);
-- 列转行(UNPIVOT)
SELECT * FROM sales_data
UNPIVOT (
amount FOR month IN (jan, feb, mar)
);
7.3 模型子句(MODEL)
最复杂的分析函数,支持电子表格式计算。
sql
-- 计算年度累计销售
SELECT product_id, month, sales
FROM sales_data
MODEL
RETURN UPDATED ROWS
PARTITION BY (product_id)
DIMENSION BY (ROWNUM AS rn)
MEASURES (month, sales, 0 AS cum_sales)
RULES SEQUENTIAL ORDER (
cum_sales[rn] = sales[cv(rn)] + NVL(cum_sales[cv(rn)-1], 0)
)
ORDER BY product_id, rn;
八、性能优化与陷阱
8.1 性能考量
优势:
- 单次扫描完成计算,避免自连接
- 支持并行执行
- 减少网络传输(计算在服务端完成)
劣势:
- 需要排序(
ORDER BY),大结果集消耗临时表空间 - 复杂窗口可能导致内存溢出
优化策略:
sql
-- 1. 尽量减少 PARTITION BY 的列数
-- 2. 使用索引避免排序(如果 ORDER BY 列已有索引)
-- 3. 限制窗口大小(避免 UNBOUNDED)
-- 4. 物化视图预计算
CREATE MATERIALIZED VIEW mv_sales_summary AS
SELECT
product_id,
month,
sales,
SUM(sales) OVER (PARTITION BY product_id ORDER BY month) AS running_total
FROM sales_data;
-- 5. 监控执行计划
EXPLAIN PLAN FOR
SELECT ... OVER (PARTITION BY ... ORDER BY ...)
FROM ...;
SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY);
8.2 常见陷阱
陷阱 1:ORDER BY 缺失导致不确定结果
sql
-- 错误:ROW_NUMBER 未指定 ORDER BY,结果不稳定
SELECT ROW_NUMBER() OVER () FROM employees; -- 避免!
-- 正确:明确排序
SELECT ROW_NUMBER() OVER (ORDER BY employee_id) FROM employees;
陷阱 2:RANGE 使用不当
sql
-- 错误:RANGE 默认只到 CURRENT ROW
SELECT LAST_VALUE(salary) OVER (ORDER BY hire_date) FROM employees; -- 不是真正的最后值
-- 正确:指定完整窗口
SELECT LAST_VALUE(salary) OVER (
ORDER BY hire_date
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) FROM employees;
陷阱 3:在 WHERE 子句中使用窗口函数
sql
-- 错误:不能在 WHERE 中直接使用
SELECT * FROM employees
WHERE ROW_NUMBER() OVER (ORDER BY salary DESC) <= 3; -- ORA-30483
-- 正确:使用子查询
SELECT * FROM (
SELECT e.*, ROW_NUMBER() OVER (ORDER BY salary DESC) AS rn FROM employees e
) WHERE rn <= 3;
九、综合实战案例
案例 1:计算同比/环比
sql
-- 计算月度销售额同比增长
SELECT
month,
sales,
LAG(sales, 12) OVER (ORDER BY month) AS sales_last_year,
(sales - LAG(sales, 12) OVER (ORDER BY month)) /
LAG(sales, 12) OVER (ORDER BY month) * 100 AS yoy_growth
FROM monthly_sales;
案例 2:找出连续登录用户
sql
-- 找出连续3天登录的用户
SELECT DISTINCT user_id FROM (
SELECT
user_id,
login_date,
LAG(login_date, 2) OVER (PARTITION BY user_id ORDER BY login_date) AS lag_2_date
FROM user_logins
) WHERE login_date - lag_2_date = 2;
案例 3:动态分组(数据分桶)
sql
-- 将员工按薪水分成5个等级
SELECT
employee_id,
salary,
NTILE(5) OVER (ORDER BY salary DESC) AS salary_grade,
CASE NTILE(5) OVER (ORDER BY salary DESC)
WHEN 1 THEN 'Top 20%'
WHEN 2 THEN '20%-40%'
...
ELSE 'Bottom 20%'
END AS grade_desc
FROM employees;
案例 4:客户生命周期价值(LTV)
sql
-- 计算每个客户的累计消费和首次购买间隔天数
SELECT
customer_id,
order_date,
order_amount,
SUM(order_amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS customer_lifetime_value,
order_date - FIRST_VALUE(order_date) OVER (PARTITION BY customer_id ORDER BY order_date) AS days_since_first
FROM orders;
十、总结与最佳实践
10.1 核心要点
| 函数类别 | 主要用途 | 性能影响 |
|---|---|---|
| 排名函数 | Top-N、分组排名 | 中等(需排序) |
| 聚合窗口 | 累计、占比、移动平均 | 中等(需排序) |
| 取值函数 | 前后对比、首尾值 | 中等(需排序) |
| 分布函数 | 百分位、中位数 | 高(需排序和计算) |
| 高级函数 | 字符串聚合、模型计算 | 高(复杂计算) |
10.2 Oracle 版本支持
| 版本 | 关键特性 |
|---|---|
| Oracle 8i | 基础窗口函数(ROW_NUMBER, RANK) |
| Oracle 9i | LAG/LEAD, FIRST_VALUE/LAST_VALUE |
| Oracle 10g | ROLLUP, CUBE, GROUPING SETS |
| Oracle 11gR2 | LISTAGG |
| Oracle 12c | OFFSET FETCH 替代部分场景 |
10.3 最佳实践清单
✅ 优先使用窗口函数 代替自连接,代码更简洁
✅ 明确指定 ORDER BY ,避免不确定性结果
✅ 完整定义窗口范围 ,特别是 LAST_VALUE
✅ 合理使用 PARTITION BY 减少排序数据量
✅ 监控执行计划 ,避免临时表空间溢出
✅ 物化视图预计算 频繁使用的复杂窗口
✅ 避免在 WHERE 中直接使用,改用子查询
分析函数是 SQL 从"数据查询"到"数据分析"的革命性飞跃 。掌握它们将使你的 Oracle SQL 能力产生质变,能够轻松应对复杂的数据分析需求。