1. Hive和Spark SQL中窗口函数
一个完整的窗口函数包含3部分
- 聚合函数或专用窗口函数
- over()子句
- 窗口规范 ,在over()子句中定义,partition by和order by两部分组成。order by决定了窗口框架的范围。
窗口支持的函数
- 连续编号
ROW_NUMBER()
:对窗口中的行进行连续编号,从1开始。
- 按排名编号
RANK()
:按排序键值分配排名,相同值会得到相同的排名,并且会跳过后续的排名1,2,2,4DENSE_RANK()
:按排序键值分配排名,相同值会得到相同排名,但是排名是连续的1,2,2,3PERCENT_RANK()
:按排序键值百分比分配排名,计算逻辑是(当前行的排名-1)/(分区内总行数-1),结果范围是[0,1]
NTILE(n)
:将窗口内的行尽可能得平均分配成n个桶,并分配桶编号。CUME_DIST()
:计算规则,<=当前行排序键的行数/分区内总行数,结果范围(0,1],对等组将会有相同的值。LAG(expr, offset, default)
|LEAD(expr, offset, default)
:访问当行之前/之后第offset行的数据。offset默认为1,default默认为null。不受窗口规则rows和range的影响,offset是相对当前行的绝对位移。expr为想要获取值的列或表达式。FIRST_VALUE(expr)
|LAST_VALUE(expr)
:返回窗口内expr的第一个和最后一个值,注意默认窗口规格下last_value函数的陷阱。nth_value(expr,n)
:计算规则:返回窗口框架内第n行的expr的值。含义同窗口内top(n)。- 聚合函数,几乎支持group by下的任何聚合函数
注意:
- 在hive中
over()
子句中可以省略order by。当省略order by
时,order by
默认使用和partition by
相同的字段列。 - 在spark中必须显示指定
order by
,将会报错Error in query: Window function row_number() requires window to be ordered, please add ORDER BY clause
。
窗口框架定义了对于当前行,其窗口的具体范围有多大。
窗口框架主要结构为:
ROWS/RANGE BETWEEN <start> AND <end>
ROWS和RANGE的主要区别是
ROWS:基于行的物理位置来划分窗口。 RANGE:基于行的排序键的值来划分窗口。
特性 | ROWS | RANGE |
---|---|---|
依据 | 物理偏移(行数) | 逻辑偏移(值的范围) |
计算方式 | 计算当前行之前/之后的具体行数 | 计算与当前行排序键值相差某个范围的所有行 |
性能 | 通常更快,只需移动指针 | 通常更慢,需要检查值是否在范围内 |
结果确定性 | 确定,顺序固定则结果固定 | 在存在重复值时,结果可能不直观 |
常用场景 | 计算移动平均、前后行对比(LAG/LEAD) | 计算累积总和(到当前值为止)、基于值的范围分组 |
在标准sql中,range的行为是,包含所有行,这些行的排序键的值<=当前行的排序键的值。如果order by有重复值,这些重复值的行会被视为一个对等组(peer group),并包含在框架内。
对等组指的是在一个已排序的数组中,所有具有相同排序键值的行的集合。在range的处理逻辑中是以值为单位,相同值被视为一个对等组表示一个不可分割的整体。
起始点 <start>
和结束点 <end>
可以是:
- UNBOUNDED PRECEDING: 分区的第一行
- UNBOUNDED FOLLOWING: 分区的最后一行
- CURRENT ROW: 当前行
- n PRECEDING: 当前行之前的第 n 行
- n FOLLOWING: 当前行之后的第 n 行
例如:
- ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW: 表示从分区开始到当前行。用于计算累积总和/平均值。
- ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING: 从当前行前3行到后1行。
- ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING: 整个分区。常用于在有序分区内求整个分区的总值。
注意: 未显示指定窗口框架时,默认为RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
。 未显示指定order by和窗口框架时,默认为ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
2. 函数演示
创建测试表并插入测试数据
sql
CREATE TABLE employees (name STRING, dept STRING, salary INT, age INT);
INSERT INTO employees VALUES
("Lisa", "Sales", 10000, 35)
,("Evan", "Sales", 32000, 38)
,("Fred", "Engineering", 21000, 28)
,("Alex", "Sales", 30000, 33)
,("Tom", "Engineering", 23000, 33)
,("Jane", "Marketing", 29000, 28)
,("Jeff", "Marketing", 35000, 38)
,("Paul", "Engineering", 29000, 23)
,("Chloe", "Engineering", 23000, 25);
2.1. 排名函数
sql
select *,rank() over(partition by dept order by salary) as rn from employees;
-- 在Engineering分组中,tom和chole行的排序键值相同,默认range情况下,二者将被置为一个对等组,因此编号相同
+-----------------+-----------------+-------------------+----------------+-----+
| employees.name | employees.dept | employees.salary | employees.age | rn |
+-----------------+-----------------+-------------------+----------------+-----+
| Jane | Marketing | 29000 | 28 | 1 |
| Jeff | Marketing | 35000 | 38 | 2 |
| Fred | Engineering | 21000 | 28 | 1 |
| Tom | Engineering | 23000 | 33 | 2 |
| Chloe | Engineering | 23000 | 25 | 2 |
| Paul | Engineering | 29000 | 23 | 4 |
| Lisa | Sales | 10000 | 35 | 1 |
| Alex | Sales | 30000 | 33 | 2 |
| Evan | Sales | 32000 | 38 | 3 |
+-----------------+-----------------+-------------------+----------------+-----+
select *,dense_rank() over(partition by dept order by salary) as rn from employees;
+-----------------+-----------------+-------------------+----------------+-----+
| employees.name | employees.dept | employees.salary | employees.age | rn |
+-----------------+-----------------+-------------------+----------------+-----+
| Jane | Marketing | 29000 | 28 | 1 |
| Jeff | Marketing | 35000 | 38 | 2 |
| Fred | Engineering | 21000 | 28 | 1 |
| Tom | Engineering | 23000 | 33 | 2 |
| Chloe | Engineering | 23000 | 25 | 2 |
| Paul | Engineering | 29000 | 23 | 3 |
| Lisa | Sales | 10000 | 35 | 1 |
| Alex | Sales | 30000 | 33 | 2 |
| Evan | Sales | 32000 | 38 | 3 |
+-----------------+-----------------+-------------------+----------------+-----+
select *,percent_rank() over(partition by dept order by salary) as rn from employees;
-- rn的结果是每行在当前分组中的百分比,注意order by中相同值结果相同
+-----------------+-----------------+-------------------+----------------+-------+
| employees.name | employees.dept | employees.salary | employees.age | rn |
+-----------------+-----------------+-------------------+----------------+-------+
| Jane | Marketing | 29000 | 28 | 0.0 | 0/1
| Jeff | Marketing | 35000 | 38 | 1.0 | 1/1
| Fred | Engineering | 21000 | 28 | 0.0 | 0/1
| Tom | Engineering | 23000 | 33 | 0.33 | tom和chloe并列第二,(2-1)/(4-1)
| Chloe | Engineering | 23000 | 25 | 0.33 |
| Paul | Engineering | 29000 | 23 | 1.0 | 2/2
| Lisa | Sales | 10000 | 35 | 0.0 | 0/2
| Alex | Sales | 30000 | 33 | 0.5 | 1/2
| Evan | Sales | 32000 | 38 | 1.0 | 2/2
+-----------------+-----------------+-------------------+----------------+-------+
select *,rank() over(partition by dept order by dept) as rn from employees;
select *,dense_rank() over(partition by dept order by dept) as rn from employees;
-- 按dept分组并按dept排名,分组内都是并列第一
+-----------------+-----------------+-------------------+----------------+-----+
| employees.name | employees.dept | employees.salary | employees.age | rn |
+-----------------+-----------------+-------------------+----------------+-----+
| Fred | Engineering | 21000 | 28 | 1 |
| Tom | Engineering | 23000 | 33 | 1 |
| Paul | Engineering | 29000 | 23 | 1 |
| Chloe | Engineering | 23000 | 25 | 1 |
| Jane | Marketing | 29000 | 28 | 1 |
| Jeff | Marketing | 35000 | 38 | 1 |
| Lisa | Sales | 10000 | 35 | 1 |
| Evan | Sales | 32000 | 38 | 1 |
| Alex | Sales | 30000 | 33 | 1 |
+-----------------+-----------------+-------------------+----------------+-----+
select *,percent_rank() over(partition by dept order by dept) as rn from employees;
+-----------------+-----------------+-------------------+----------------+------+
| employees.name | employees.dept | employees.salary | employees.age | rn |
+-----------------+-----------------+-------------------+----------------+------+
| Fred | Engineering | 21000 | 28 | 0.0 |
| Tom | Engineering | 23000 | 33 | 0.0 |
| Paul | Engineering | 29000 | 23 | 0.0 |
| Chloe | Engineering | 23000 | 25 | 0.0 |
| Jane | Marketing | 29000 | 28 | 0.0 |
| Jeff | Marketing | 35000 | 38 | 0.0 |
| Lisa | Sales | 10000 | 35 | 0.0 |
| Evan | Sales | 32000 | 38 | 0.0 |
| Alex | Sales | 30000 | 33 | 0.0 |
+-----------------+-----------------+-------------------+----------------+------+
2.2. ntile(n)
作用是在分组内对数据进行分桶,然后返回桶的序号。如果数据无法按照桶个数均分,则将多余的数据放在第一个桶内。ntile不支持range窗口规格。
使用场景:例如获取每个部门薪资前三分之一的员工数据,则按照salary降序排序,然后分成3个桶,最后取第一个桶中的数据。
sql
select *,NTILE(2) over(partition by dept order by salary) as rn from employees;
-- 分成2桶
+-----------------+-----------------+-------------------+----------------+-----+
| employees.name | employees.dept | employees.salary | employees.age | rn |
+-----------------+-----------------+-------------------+----------------+-----+
| Jane | Marketing | 29000 | 28 | 1 |
| Jeff | Marketing | 35000 | 38 | 2 |
| Fred | Engineering | 21000 | 28 | 1 |
| Tom | Engineering | 23000 | 33 | 1 |
| Chloe | Engineering | 23000 | 25 | 2 |
| Paul | Engineering | 29000 | 23 | 2 |
| Lisa | Sales | 10000 | 35 | 1 |
| Alex | Sales | 30000 | 33 | 1 |
| Evan | Sales | 32000 | 38 | 2 |
+-----------------+-----------------+-------------------+----------------+-----+
select *,NTILE(3) over(partition by dept order by salary) as rn from employees;
-- 分成3桶
+-----------------+-----------------+-------------------+----------------+-----+
| employees.name | employees.dept | employees.salary | employees.age | rn |
+-----------------+-----------------+-------------------+----------------+-----+
| Jane | Marketing | 29000 | 28 | 1 |
| Jeff | Marketing | 35000 | 38 | 2 |
| Fred | Engineering | 21000 | 28 | 1 |
| Tom | Engineering | 23000 | 33 | 1 |
| Chloe | Engineering | 23000 | 25 | 2 |
| Paul | Engineering | 29000 | 23 | 3 |
| Lisa | Sales | 10000 | 35 | 1 |
| Alex | Sales | 30000 | 33 | 2 |
| Evan | Sales | 32000 | 38 | 3 |
+-----------------+-----------------+-------------------+----------------+-----+
2.3. cume_dist()
sql
select *,cume_dist() over(partition by dept order by salary) from employees;
-- enginnering分组中,tom和chloe中salary相同,二者的分位数结果相同
+-----------------+-----------------+-------------------+----------------+---------------------+
| employees.name | employees.dept | employees.salary | employees.age | cume_dist_window_0 |
+-----------------+-----------------+-------------------+----------------+---------------------+
| Jane | Marketing | 29000 | 28 | 0.5 | 1/2
| Jeff | Marketing | 35000 | 38 | 1.0 | 2/2
| Fred | Engineering | 21000 | 28 | 0.25 | 1/4
| Tom | Engineering | 23000 | 33 | 0.75 | 3/4
| Chloe | Engineering | 23000 | 25 | 0.75 | 3/4
| Paul | Engineering | 29000 | 23 | 1.0 | 4/4
| Lisa | Sales | 10000 | 35 | 0.3333333333333333 | 1/3
| Alex | Sales | 30000 | 33 | 0.6666666666666666 | 2/3
| Evan | Sales | 32000 | 38 | 1.0 | 3/3
+-----------------+-----------------+-------------------+----------------+---------------------+
select *,cume_dist() over(partition by dept order by age) from employees;
-- 按照age排序,tom和chloe中age不同,因此分位数不同
+-----------------+-----------------+-------------------+----------------+---------------------+
| employees.name | employees.dept | employees.salary | employees.age | cume_dist_window_0 |
+-----------------+-----------------+-------------------+----------------+---------------------+
| Jane | Marketing | 29000 | 28 | 0.5 |
| Jeff | Marketing | 35000 | 38 | 1.0 |
| Paul | Engineering | 29000 | 23 | 0.25 |
| Chloe | Engineering | 23000 | 25 | 0.5 |
| Fred | Engineering | 21000 | 28 | 0.75 |
| Tom | Engineering | 23000 | 33 | 1.0 |
| Alex | Sales | 30000 | 33 | 0.3333333333333333 |
| Lisa | Sales | 10000 | 35 | 0.6666666666666666 |
| Evan | Sales | 32000 | 38 | 1.0 |
+-----------------+-----------------+-------------------+----------------+---------------------+
2.4. LAG/LEAD(expr,offset,default_value)
lag 表示落后的含义,在使用场景中是小于的含义。 lead 表示领先的含义,在使用场景中是大于的含义。
可以实现不自连接的前提下,按照order by结果得到当前行指定列 前/后移动num行 的列值。
函数参数 LAG/LEAD(expr,offset,default_value)
:expr表示指定的列名;offset表示指定的位移行数,默认为1;default_value表示末尾或开头返回的值,默认值null。
sql
SELECT *,
LAG(salary) OVER (PARTITION BY dept ORDER BY salary) AS lag,
LEAD(salary) OVER (PARTITION BY dept ORDER BY salary) AS lead
FROM employees;
-- 从结果看,按照order by排序结果,lag取的值是当前行前1行的值,lead取的值是当前行后1行的值,对于第1行或者最后1行,取值默认为null
+-----------------+-----------------+-------------------+----------------+--------+--------+
| employees.name | employees.dept | employees.salary | employees.age | lag | lead |
+-----------------+-----------------+-------------------+----------------+--------+--------+
| Jane | Marketing | 29000 | 28 | NULL | 35000 |
| Jeff | Marketing | 35000 | 38 | 29000 | NULL |
| Fred | Engineering | 21000 | 28 | NULL | 23000 |
| Tom | Engineering | 23000 | 33 | 21000 | 23000 |
| Chloe | Engineering | 23000 | 25 | 23000 | 29000 |
| Paul | Engineering | 29000 | 23 | 23000 | NULL |
| Lisa | Sales | 10000 | 35 | NULL | 30000 |
| Alex | Sales | 30000 | 33 | 10000 | 32000 |
| Evan | Sales | 32000 | 38 | 30000 | NULL |
+-----------------+-----------------+-------------------+----------------+--------+--------+
SELECT *,
LAG(salary,2,0) OVER (PARTITION BY dept ORDER BY salary) AS lag,
LEAD(salary, 1,0) OVER (PARTITION BY dept ORDER BY salary) AS lead
FROM employees;
+-----------------+-----------------+-------------------+----------------+--------+--------+
| employees.name | employees.dept | employees.salary | employees.age | lag | lead |
+-----------------+-----------------+-------------------+----------------+--------+--------+
| Jane | Marketing | 29000 | 28 | 0 | 35000 |
| Jeff | Marketing | 35000 | 38 | 0 | 0 |
| Fred | Engineering | 21000 | 28 | 0 | 23000 |
| Tom | Engineering | 23000 | 33 | 0 | 23000 |
| Chloe | Engineering | 23000 | 25 | 21000 | 29000 |
| Paul | Engineering | 29000 | 23 | 23000 | 0 |
| Lisa | Sales | 10000 | 35 | 0 | 30000 |
| Alex | Sales | 30000 | 33 | 0 | 32000 |
| Evan | Sales | 32000 | 38 | 10000 | 0 |
+-----------------+-----------------+-------------------+----------------+--------+--------+
-- 排序结果和取值列不同的情况
SELECT *,
LAG(salary) OVER (PARTITION BY dept ORDER BY age) AS lag,
LEAD(salary) OVER (PARTITION BY dept ORDER BY age) AS lead
FROM employees;
+-----------------+-----------------+-------------------+----------------+--------+--------+
| employees.name | employees.dept | employees.salary | employees.age | lag | lead |
+-----------------+-----------------+-------------------+----------------+--------+--------+
| Jane | Marketing | 29000 | 28 | NULL | 35000 |
| Jeff | Marketing | 35000 | 38 | 29000 | NULL |
| Paul | Engineering | 29000 | 23 | NULL | 23000 |
| Chloe | Engineering | 23000 | 25 | 29000 | 21000 |
| Fred | Engineering | 21000 | 28 | 23000 | 23000 |
| Tom | Engineering | 23000 | 33 | 21000 | NULL |
| Alex | Sales | 30000 | 33 | NULL | 10000 |
| Lisa | Sales | 10000 | 35 | 30000 | 32000 |
| Evan | Sales | 32000 | 38 | 10000 | NULL |
+-----------------+-----------------+-------------------+----------------+--------+--------+
2.5. first_value/last_value(expr)
在分组中按照order by的结果,获取指定列的第一个或最后一个值。
注意,默认情况下last_value取的是第一行截止到当前行的最后一个值(当前行的值),并不是整个分区中排序后的最后一个值。
sql
SELECT *,
first_value(salary) OVER (PARTITION BY dept ORDER BY age) AS first,
last_value(salary) OVER (PARTITION BY dept ORDER BY age) AS last
FROM employees;
-- 注意last的结果
+-----------------+-----------------+-------------------+----------------+--------+--------+
| employees.name | employees.dept | employees.salary | employees.age | first | last |
+-----------------+-----------------+-------------------+----------------+--------+--------+
| Jane | Marketing | 29000 | 28 | 29000 | 29000 |
| Jeff | Marketing | 35000 | 38 | 29000 | 35000 |
| Paul | Engineering | 29000 | 23 | 29000 | 29000 |
| Chloe | Engineering | 23000 | 25 | 29000 | 23000 |
| Fred | Engineering | 21000 | 28 | 29000 | 21000 |
| Tom | Engineering | 23000 | 33 | 29000 | 23000 |
| Alex | Sales | 30000 | 33 | 30000 | 30000 |
| Lisa | Sales | 10000 | 35 | 30000 | 10000 |
| Evan | Sales | 32000 | 38 | 30000 | 32000 |
+-----------------+-----------------+-------------------+----------------+--------+--------+
SELECT *,
first_value(salary) OVER (PARTITION BY dept ORDER BY salary) AS first,
last_value(salary) OVER (PARTITION BY dept ORDER BY salary) AS last
FROM employees;
+-----------------+-----------------+-------------------+----------------+--------+--------+
| employees.name | employees.dept | employees.salary | employees.age | first | last |
+-----------------+-----------------+-------------------+----------------+--------+--------+
| Jane | Marketing | 29000 | 28 | 29000 | 29000 |
| Jeff | Marketing | 35000 | 38 | 29000 | 35000 |
| Fred | Engineering | 21000 | 28 | 21000 | 21000 |
| Tom | Engineering | 23000 | 33 | 21000 | 23000 |
| Chloe | Engineering | 23000 | 25 | 21000 | 23000 |
| Paul | Engineering | 29000 | 23 | 21000 | 29000 |
| Lisa | Sales | 10000 | 35 | 10000 | 10000 |
| Alex | Sales | 30000 | 33 | 10000 | 30000 |
| Evan | Sales | 32000 | 38 | 10000 | 32000 |
+-----------------+-----------------+-------------------+----------------+--------+--------+
陷阱: last_value在默认的窗口规格下RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
,每行的为当前行,因此last结果为当前行的salary值。
2.6. nth_value(expr,n)
nth表示第几个的含义
作用,在分组中返回order by结果中指定列的第N行值。
注意,hive中无此函数
sql
SELECT *,nth_value(salary,2) OVER (PARTITION BY dept ORDER BY salary) AS nth FROM employees;