窗口函数总结
一、窗口函数概念
在关系型数据库管理系统(RDBMS)中,SQL的窗口函数(Window Functions)是SQL中一种特殊的函数类型,它可以在不改变原始数据行数的情况下,对数据进行分组计算和排序操作。
窗口函数是一种特殊的SQL函数,它保留了每一行的原始数据,但在此基础上对每一行计算出额外的分析数据,窗口函数不会"折叠"表中的行。与传统聚合函数(如SUM()、AVG())不同,聚合函数通常将数据汇总到单行,而窗口函数为每一行都计算相应的分析值。
二、窗口函数基本语法
窗口函数的语法结构
sql
SELECT
column1,
column2,
window_function(column3) OVER (
PARTITION BY column4
ORDER BY column5
ROWS/RANGE BETWEEN start AND end
) AS result
FROM
table_name;
语法组成部分:
- window_function:窗口函数的名称,用于对数据进行操作(如排名、累计求和等),常见的函数包括RANK()、ROW_NUMBER()、SUM()、AVG()等
- OVER:窗口函数的关键字,表示窗口函数的计算范围
- PARTITION BY:可选的子句,按照某些列对数据进行分组,每个分区将单独处理,类似于GROUP BY,但不会将数据聚合成一行
- ORDER BY:可选的子句,定义数据的排序方式,通常是窗口范围内的数据需要按某列排序
- ROWS/RANGE BETWEEN:可选的子句,限定窗口函数所运用的数据范围
窗口范围关键字:
| 关键字 | 解释 |
|---|---|
| UNBOUNDED PRECEDING | 从当前窗口的开头到当前行 |
| UNBOUNDED FOLLOWING | 从当前行到窗口的末尾 |
| CURRENT ROW | 当前行 |
| PRECEDING | 当前行往前的第n行(包含当前行) |
| FOLLOWING | 当前行往后的第n行(包含当前行) |
示例:
取当前行和前五行:ROWS between 5 preceding and current row --共6行
取当前行和后五行:ROWS between current row and 5 following --共6行
取前五行和后五行:ROWS between 5 preceding and 5 following --共11行
取当前行和前六行:ROWS 6 preceding(等价于between...and current row) --共7行
这一天和前面6天:RANGE between interval 6 day preceding and current row --共7天
这一天和前面6天:RANGE interval 6 day preceding(等价于between...and current row) --共7天
字段值落在当前值-100到+200的区间:RANGE between 100 preceding and 200 following --共301个数值
三、窗口函数分类
函数部分
聚合函数:SUM(), AVG(), COUNT(), MAX(), MIN()
分析函数:LAG(), LEAD(), FIRST_VALUE(), LAST_VALUE()
排序函数:ROW_NUMBER(), RANK(), DENSE_RANK()
3.1 聚合函数类
- SUM/COUNT/MAX/MIN/AVG/MEDIAN等都可以作为窗口函数使用
示例:
sql
-- 统计订单月销售金额及截止到当月的累计销售金额
SELECT
MONTH(pay_time) pay_mount,
SUM(amount) amount,
SUM(SUM(amount)) OVER (
ORDER BY MONTH(pay_time) ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS amount_total
FROM order_detail
GROUP BY pay_mount
ORDER BY pay_mount
3.2 排序函数类
- ROW_NUMBER():为每行分配唯一的排序编号
- RANK():排名相同名次并列,且名次中留下空位
- DENSE_RANK():排名相同名次并列,名次中不会留下空位
- CUME_DIST:行数/分组内总行数,相当于百分位
- PERCENT_RANK():计算当前行的百分比排名
- NTILE(n):将分组数据均匀切片分n组
示例:
sql
-- 按部门和销售额排序,分配行号
SELECT
id, department, sales_amount,
ROW_NUMBER() OVER (PARTITION BY department ORDER BY sales_amount DESC) AS rank_num
FROM sales;
-- 累计销售额计算
SELECT
department, sales_amount,
SUM(sales_amount) OVER (PARTITION BY department ORDER BY sales_amount DESC) AS cumulative_sales
FROM sales;
【功能】ROW_NUMBER 函数为每行分配一个唯一的序号,在每个分区内部从1开始递增,即使遇到相同值也会分配不同的序号。
ROW_NUMBER() OVER (PARTITION BY partition_columns ORDER BY order_columns)
【说明】
- 唯一性:为每一行分配唯一的序号
- 连续性:序号从1开始连续递增
- 无重复:即使值相同,序号也不同
应用示例
sql
-- 查询每个部门员工薪资排名(序号唯一)
SELECT
department,
employee_name,
salary,
ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS row_rank
FROM employees;
-- 删除重复数据(只保留每组的第一条记录)
SELECT * FROM (
SELECT
id, name, email,
ROW_NUMBER() OVER (PARTITION BY email ORDER BY id) AS rn
FROM users
) t WHERE rn = 1;
【功能】RANK 函数根据排序条件为行分配排名,相同值获得相同的排名,但后续排名会跳跃。
RANK() OVER (PARTITION BY partition_columns ORDER BY order_columns)
【说明】
- 并列排名:相同值获得相同排名
- 排名跳跃:相同值后,下一个排名会跳跃
- 非连续性:可能存在排名断层
应用示例
sql
-- 学生成绩排名(并列情况会跳名次)
SELECT
student_name,
score,
RANK() OVER (ORDER BY score DESC) AS rank_score
FROM students;
-- 示例结果:
-- 张三 95分 rank=1
-- 李四 90分 rank=2
-- 王五 90分 rank=2 (并列第二)
-- 赵六 85分 rank=4 (跳到第四名)
【功能】DENSE_RANK 函数与RANK类似,但相同值获得相同排名后,后续排名不会跳跃,保持连续。
DENSE_RANK() OVER (PARTITION BY partition_columns ORDER BY order_columns)
【说明】
- 并列排名:相同值获得相同排名
- 连续性:后续排名不跳跃
- 密集排名:排名数字连续
应用示例
sql
-- 学生成绩排名(并列后继续递增)
SELECT
student_name,
score,
DENSE_RANK() OVER (ORDER BY score DESC) AS dense_rank_score
FROM students;
-- 示例结果:
-- 张三 95分 rank=1
-- 李四 90分 rank=2
-- 王五 90分 rank=2 (并列第二)
-- 赵六 85分 rank=3 (继续第三名)
【功能】PERCENT_RANK 函数计算当前行在其分区中的相对位置,返回0到1之间的百分比排名。
PERCENT_RANK() OVER (PARTITION BY partition_columns ORDER BY order_columns)
计算公式:PERCENT_RANK = (当前行的RANK值 - 1) / (分区内总行数 - 1)
【说明】
- 范围:返回值在0到1之间
- 标准化:提供标准化的排名百分比
- 首尾值:第一行始终为0,最后一行始终为1
应用示例
sql
-- 计算员工薪资在公司中的百分位排名
SELECT
employee_name,
salary,
PERCENT_RANK() OVER (ORDER BY salary) AS salary_percentile
FROM employees;
-- 按部门计算薪资百分位排名
SELECT
department,
employee_name,
salary,
PERCENT_RANK() OVER (PARTITION BY department ORDER BY salary) AS dept_salary_percentile
FROM employees;
函数对比示例
sql
-- 假设有以下数据:
-- 人员 分数
-- A 90
-- B 85
-- C 85
-- D 80
-- E 75
SELECT
name,
score,
ROW_NUMBER() OVER (ORDER BY score DESC) AS row_num,
RANK() OVER (ORDER BY score DESC) AS rank_val,
DENSE_RANK() OVER (ORDER BY score DESC) AS dense_rank_val,
PERCENT_RANK() OVER (ORDER BY score DESC) AS percent_rank_val
FROM scores;
-- 结果:
-- name | score | row_num | rank_val | dense_rank_val | percent_rank_val
-- A | 90 | 1 | 1 | 1 | 0.0
-- B | 85 | 2 | 2 | 2 | 0.25
-- C | 85 | 3 | 2 | 2 | 0.25
-- D | 80 | 4 | 4 | 3 | 0.75
-- E | 75 | 5 | 5 | 4 | 1.0
排名函数的对比
| 函数 | 相同值处理 | 排名连续性 | 示例(分数90,85,85,80) |
|---|---|---|---|
| ROW_NUMBER() | 分配不同排名 | 连续 | 1,2,3,4 |
| RANK() | 相同排名,后续跳跃 | 非连续 | 1,2,2,4 |
| DENSE_RANK() | 相同排名,后续连续 | 连续 | 1,2,2,3 |
1. TOP-N 查询
sql
-- 查询每个部门薪资前3名的员工
SELECT * FROM (
SELECT
department,
employee_name,
salary,
ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS rn
FROM employees
) t WHERE rn <= 3;
2. 分位数分析
sql
-- 识别薪资分布中的四分位数
SELECT
employee_name,
salary,
CASE
WHEN PERCENT_RANK() OVER (ORDER BY salary) <= 0.25 THEN 'Q1'
WHEN PERCENT_RANK() OVER (ORDER BY salary) <= 0.50 THEN 'Q2'
WHEN PERCENT_RANK() OVER (ORDER BY salary) <= 0.75 THEN 'Q3'
ELSE 'Q4'
END AS quartile
FROM employees;
3. 并列奖项处理
sql
-- 体育比赛排名,需要考虑并列情况
SELECT
athlete_name,
score,
RANK() OVER (ORDER BY score DESC) AS overall_rank,
CASE
WHEN RANK() OVER (ORDER BY score DESC) <= 3 THEN 'Podium'
ELSE 'Other'
END AS award_status
FROM competition_results;
4. 销售业绩排名
sql
-- 按销售额进行密集排名
SELECT
salesperson,
department,
sales_amount,
DENSE_RANK() OVER (PARTITION BY department ORDER BY sales_amount DESC) AS performance_rank
FROM sales_performance;
5. 产品评分排名
sql
-- 按产品评分进行排名
SELECT
product_name,
category,
average_rating,
DENSE_RANK() OVER (PARTITION BY category ORDER BY average_rating DESC) AS rating_rank
FROM product_reviews;
6. 奖项评定
sql
-- 需要确定前N名获奖者,且允许并列的情况
SELECT
employee_name,
score,
DENSE_RANK() OVER (ORDER BY score DESC) AS award_rank
FROM contest_results
WHERE DENSE_RANK() OVER (ORDER BY score DESC) <= 5; -- 前5名获奖
7. 绩效评估
sql
-- 员工绩效等级划分
SELECT
employee_name,
performance_score,
CASE
WHEN DENSE_RANK() OVER (ORDER BY performance_score DESC) <= 10 THEN 'Top Performer'
WHEN DENSE_RANK() OVER (ORDER BY performance_score DESC) <= 30 THEN 'High Performer'
WHEN DENSE_RANK() OVER (ORDER BY performance_score DESC) <= 60 THEN 'Average Performer'
ELSE 'Needs Improvement'
END AS performance_level
FROM employee_performance;
8. 排行榜应用
sql
-- 游戏排行榜,相同分数的玩家排名相同
SELECT
player_name,
score,
DENSE_RANK() OVER (ORDER BY score DESC) AS leaderboard_position
FROM game_scores;
选择建议
- ROW_NUMBER:需要唯一排名,不允许并列
- RANK:允许并列,但接受排名跳跃
- DENSE_RANK:允许并列,但要求排名连续
- PERCENT_RANK:需要标准化的百分比排名位置
3.3 分析函数类
- FIRST_VALUE(column):取分组内排序后,截止到当前行,column列的第一个值
- LAST_VALUE(column):取分组内排序后,截止到当前行,column列的最后一个值
- LEAD(column, offset, default_value):偏移函数,取窗口内column列往下第n行的值
- LAG(column, offset, default_value) :偏移函数,取窗口内column列往上第n行的值
说明:FIRST_VALUE/LAST_VALUE关注绝对位置(第一行或最后一行),而LEAD/LAG关注相对位置(前n行或后n行)
示例:
sql
-- 获取用户相邻两次购买同类产品的日期差
SELECT
user_id, product, amount, pay_time,
LAG(pay_time) OVER (PARTITION BY user_id, product ORDER BY pay_time) last_pay_time
FROM order_detail
ORDER BY user_id, product
【功能】访问当前行之后的第n行数据,适用于预测、后续值比较等场景
LEAD(column, offset, default_value) OVER (PARTITION BY partition_columns ORDER BY order_columns)
【参数说明】
column: 要访问的列名
offset: 偏移量,默认为1(向下一行)
default_value: 当偏移超出边界时的默认值,可选参数
【功能】访问当前行之前的第n行数据,适用于历史对比、变化分析等场景
LAG(column, offset, default_value) OVER (PARTITION BY partition_columns ORDER BY order_columns)
【参数说明】
column: 要访问的列名
offset: 偏移量,默认为1(向上一行)
default_value: 当偏移超出边界时的默认值,可选参数
【性能优化建议】
确保ORDER BY列上有索引,以提高窗口函数的执行效率
合理使用PARTITION BY,避免过度细分导致性能下降
谨慎设置偏移量,过大的偏移量可能影响查询性能
1. 时间序列分析
sql
-- 分析用户连续购买行为
SELECT
user_id,
purchase_date,
amount,
LAG(purchase_date) OVER (PARTITION BY user_id ORDER BY purchase_date) AS last_purchase_date,
LEAD(purchase_date) OVER (PARTITION BY user_id ORDER BY purchase_date) AS next_purchase_date,
DATEDIFF(purchase_date, LAG(purchase_date) OVER (PARTITION BY user_id ORDER BY purchase_date)) AS days_since_last
FROM purchases;
2. 数据变化分析
sql
-- 计算销售额的环比增长
SELECT
month,
sales_amount,
LAG(sales_amount) OVER (ORDER BY month) AS prev_month_sales,
sales_amount - LAG(sales_amount) OVER (ORDER BY month) AS difference,
(sales_amount - LAG(sales_amount) OVER (ORDER BY month)) / LAG(sales_amount) OVER (ORDER BY month) * 100 AS growth_rate
FROM monthly_sales;
3. 库存周转分析
sql
-- 分析库存变化
SELECT
date,
inventory_count,
LEAD(inventory_count) OVER (ORDER BY date) AS next_day_inventory,
inventory_count - LEAD(inventory_count) OVER (ORDER BY date) AS inventory_change
FROM inventory_log;
4. 价格波动分析
sql
-- 股票价格分析
SELECT
trade_date,
closing_price,
LAG(closing_price, 1) OVER (ORDER BY trade_date) AS prev_close,
LAG(closing_price, 5) OVER (ORDER BY trade_date) AS prev_week_close,
LEAD(closing_price, 1) OVER (ORDER BY trade_date) AS next_close,
(closing_price - LAG(closing_price, 1) OVER (ORDER BY trade_date)) / LAG(closing_price, 1) OVER (ORDER BY trade_date) * 100 AS daily_return
FROM stock_prices;
5. 默认值处理
sql
-- 当没有前一行数据时,使用0作为默认值
SELECT
date,
value,
LAG(value, 1, 0) OVER (ORDER BY date) AS prev_value_with_default
FROM data_table;
6. 复杂偏移分析
sql
-- 分析同比数据(去年同期对比)
SELECT
year,
month,
revenue,
LAG(revenue, 12) OVER (ORDER BY year, month) AS last_year_revenue,
CASE
WHEN LAG(revenue, 12) OVER (ORDER BY year, month) IS NOT NULL
THEN (revenue - LAG(revenue, 12) OVER (ORDER BY year, month)) / LAG(revenue, 12) OVER (ORDER BY year, month) * 100
ELSE NULL
END AS yoy_growth_rate
FROM monthly_revenue;
7. 结合分组使用
sql
-- 按产品类别分析销售变化
SELECT
product_category,
sale_date,
sales,
LAG(sales) OVER (PARTITION BY product_category ORDER BY sale_date) AS prev_sales,
LEAD(sales) OVER (PARTITION BY product_category ORDER BY sale_date) AS next_sales
FROM sales_data;
3.4 比例分析函数
用于计算某行占某组总量的比例
示例:
sql
-- 计算每个部门中,每行销售额占整个部门销售额的百分比
SELECT
department, sales_amount,
ROUND(
sales_amount * 1.0 / SUM(sales_amount) OVER (PARTITION BY department) * 100, 2
) AS percentage
FROM sales;
3.5 滑动窗口函数
基于ROWS(绝对行号)或RANGE(逻辑值范围)进行滑动计算
示例:
sql
-- 计算每一行的当前值和前两行的销售总和
SELECT
id, department, sales_amount,
SUM(sales_amount) OVER (
PARTITION BY department
ORDER BY sales_amount
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
) AS sliding_sum
FROM sales;
四、实际应用场景
1. 数据排名分析
销售额排名统计
sql
-- 按销售额进行排名,相同销售额并列排名,跳过下一个排名
SELECT
employee_name,
sales_amount,
RANK() OVER (ORDER BY sales_amount DESC) AS rank_by_sales,
DENSE_RANK() OVER (ORDER BY sales_amount DESC) AS dense_rank_by_sales,
ROW_NUMBER() OVER (ORDER BY sales_amount DESC) AS row_number_by_sales
FROM sales_records;
学生成绩排名
sql
-- 按班级对学生进行成绩排名
SELECT
class,
student_name,
score,
RANK() OVER (PARTITION BY class ORDER BY score DESC) AS class_rank
FROM student_scores;
2. 累计计算
月度累计销售额
sql
-- 按月份累计销售额
SELECT
sale_month,
monthly_sales,
SUM(monthly_sales) OVER (
ORDER BY sale_month
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS cumulative_sales
FROM monthly_sales_data;
历史累计用户数
sql
-- 计算每日累计注册用户数
SELECT
register_date,
new_users,
SUM(new_users) OVER (
ORDER BY register_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS total_users
FROM daily_registrations;
3. 对比分析
与上期数据对比
sql
-- 计算与上个月销售额的差异
SELECT
month,
sales,
LAG(sales, 1) OVER (ORDER BY month) AS prev_month_sales,
sales - LAG(sales, 1) OVER (ORDER BY month) AS diff_from_prev
FROM monthly_sales;
同比环比分析
sql
-- 年同比增长率计算
SELECT
year,
month,
revenue,
LAG(revenue, 12) OVER (ORDER BY year, month) AS last_year_revenue,
CASE
WHEN LAG(revenue, 12) OVER (ORDER BY year, month) > 0
THEN (revenue - LAG(revenue, 12) OVER (ORDER BY year, month)) / LAG(revenue, 12) OVER (ORDER BY year, month) * 100
ELSE NULL
END AS yoy_growth_rate
FROM monthly_revenue;
4. 分组统计
不同维度的分组统计
sql
-- 按地区和产品类别进行销售统计
SELECT
region,
product_category,
salesperson,
sales_amount,
AVG(sales_amount) OVER (PARTITION BY region) AS avg_sales_by_region,
AVG(sales_amount) OVER (PARTITION BY product_category) AS avg_sales_by_category,
COUNT(*) OVER (PARTITION BY region, product_category) AS count_in_group
FROM sales_data;
多层次数据分析
sql
-- 多级分组分析:国家->城市->员工
SELECT
country,
city,
employee_name,
sales_amount,
-- 每个国家内的销售排名
ROW_NUMBER() OVER (PARTITION BY country ORDER BY sales_amount DESC) AS country_rank,
-- 每个城市内的销售排名
ROW_NUMBER() OVER (PARTITION BY country, city ORDER BY sales_amount DESC) AS city_rank,
-- 全局排名
ROW_NUMBER() OVER (ORDER BY sales_amount DESC) AS global_rank
FROM employee_sales;
综合示例:电商平台分析
sql
-- 综合分析用户购买行为
SELECT
user_id,
order_date,
order_amount,
-- 用户累计消费金额
SUM(order_amount) OVER (PARTITION BY user_id ORDER BY order_date) AS cumulative_user_spending,
-- 用户订单排名
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY order_date) AS order_sequence,
-- 上一次订单金额
LAG(order_amount) OVER (PARTITION BY user_id ORDER BY order_date) AS prev_order_amount,
-- 与上次订单金额的差异
order_amount - LAG(order_amount) OVER (PARTITION BY user_id ORDER BY order_date) AS amount_diff,
-- 本月累计订单金额
SUM(order_amount) OVER (PARTITION BY YEAR(order_date), MONTH(order_date) ORDER BY order_date) AS monthly_cumulative
FROM orders
ORDER BY user_id, order_date;
五、使用注意事项
- PARTITION BY 和ORDER BY 都是可选的
- 如果没有PARTITION BY,则整个数据集视为一个组
- 如果没有ORDER BY,则窗口函数不会进行任何排序
- 窗口范围(ROWS/RANGE)的使用需要搭配ORDER BY
- 性能问题:窗口函数的计算会对性能有一定影响,尤其是当PARTITION BY和ORDER BY的组合过于复杂时,建议先将基础查询的结果保存为临时表,再针对结果应用窗口函数
- MySQL8.0支持窗口函数,但MySQL的窗口函数不支持DISTINCT