Hive/Spark窗口函数

窗口函数

hive文档链接
spark文档链接

1. OVER支持的函数

  • 自然序编号
    Syntax: ROW_NUMBER
  • 按等级编号
    Syntax: RANK | DENSE_RANK | PERCENT_RANK
  • 分组内分桶,并返回对应桶的序号
    Syntax: NTILE(n)
  • Analytic Functions(分析函数)
    Syntax: CUME_DIST | LAG | LEAD | NTH_VALUE | FIRST_VALUE | LAST_VALUE
  • Aggregate Functions(聚合函数)
    Syntax: MAX | MIN | COUNT | SUM | AVG | ...

1.1. 准备工作

创建测试表并插入测试数据

sql 复制代码
CREATE TABLE employees (name STRING, dept STRING, salary INT, age INT);

INSERT INTO employees VALUES 
("Lisa", "Sales", 10000, 35)
,("Evan", "Sales", 32000, 38)
,("Fred", "Engineering", 21000, 28)
,("Alex", "Sales", 30000, 33)
,("Tom", "Engineering", 23000, 33)
,("Jane", "Marketing", 29000, 28)
,("Jeff", "Marketing", 35000, 38)
,("Paul", "Engineering", 29000, 23)
,("Chloe", "Engineering", 23000, 25);

1.2. row_number()

row_number() over() row_number可能是窗口函数中使用最频繁的函数,作用是在分组内按自然序进行编号,结果值为:1 2 3 4 5

sql 复制代码
select *,row_number() over(partition by dept order by salary) as rn from employees;
+-----------------+-----------------+-------------------+----------------+-----+
| employees.name  | employees.dept  | employees.salary  | employees.age  | rn  |
+-----------------+-----------------+-------------------+----------------+-----+
| Jane            | Marketing       | 29000             | 28             | 1   |
| Jeff            | Marketing       | 35000             | 38             | 2   |
| Fred            | Engineering     | 21000             | 28             | 1   |
| Tom             | Engineering     | 23000             | 33             | 2   |
| Chloe           | Engineering     | 23000             | 25             | 3   |
| Paul            | Engineering     | 29000             | 23             | 4   |
| Lisa            | Sales           | 10000             | 35             | 1   |
| Alex            | Sales           | 30000             | 33             | 2   |
| Evan            | Sales           | 32000             | 38             | 3   |
+-----------------+-----------------+-------------------+----------------+-----+

1.3. rank函数

rank有等级的含义,函数作用是在分组内按照order by的结果得到有等级编号。
rank() over () 并列有间隔,结果如:1 2 2 4 5
dense_rank() over() dense有密集的含义,并列无间隔,结果如:1 2 2 3 4
percent_rank() over() 百分比的rank,含义是(当前行-1)/(当前组总行数-1),当前行从1开始

注意,如果order by的结果相同,则rank得到的结果都相同,在这里的语义是排序结果相同,因此等级编号也相同。

sql 复制代码
select *,rank() over(partition by dept order by salary) as rn from employees;
-- 执行结果中,engineering中salary相同的编号相同,paul的值为4
+-----------------+-----------------+-------------------+----------------+-----+
| employees.name  | employees.dept  | employees.salary  | employees.age  | rn  |
+-----------------+-----------------+-------------------+----------------+-----+
| Jane            | Marketing       | 29000             | 28             | 1   |
| Jeff            | Marketing       | 35000             | 38             | 2   |
| Fred            | Engineering     | 21000             | 28             | 1   |
| Tom             | Engineering     | 23000             | 33             | 2   |
| Chloe           | Engineering     | 23000             | 25             | 2   |
| Paul            | Engineering     | 29000             | 23             | 4   |
| Lisa            | Sales           | 10000             | 35             | 1   |
| Alex            | Sales           | 30000             | 33             | 2   |
| Evan            | Sales           | 32000             | 38             | 3   |
+-----------------+-----------------+-------------------+----------------+-----+

select *,dense_rank() over(partition by dept order by salary) as rn from employees;
-- 执行结果中,engineering中salary相同的编号相同,paul的值为3
+-----------------+-----------------+-------------------+----------------+-----+
| employees.name  | employees.dept  | employees.salary  | employees.age  | rn  |
+-----------------+-----------------+-------------------+----------------+-----+
| Jane            | Marketing       | 29000             | 28             | 1   |
| Jeff            | Marketing       | 35000             | 38             | 2   |
| Fred            | Engineering     | 21000             | 28             | 1   |
| Tom             | Engineering     | 23000             | 33             | 2   |
| Chloe           | Engineering     | 23000             | 25             | 2   |
| Paul            | Engineering     | 29000             | 23             | 3   |
| Lisa            | Sales           | 10000             | 35             | 1   |
| Alex            | Sales           | 30000             | 33             | 2   |
| Evan            | Sales           | 32000             | 38             | 3   |
+-----------------+-----------------+-------------------+----------------+-----+

select *,percent_rank() over(partition by dept order by salary) as rn from employees;
-- rn的结果是每行在当前分组中的百分比,注意order by中相同值结果相同
+-----------------+-----------------+-------------------+----------------+---------------------+
| employees.name  | employees.dept  | employees.salary  | employees.age  |         rn          |
+-----------------+-----------------+-------------------+----------------+---------------------+
| Jane            | Marketing       | 29000             | 28             | 0.0                 |
| Jeff            | Marketing       | 35000             | 38             | 1.0                 |
| Fred            | Engineering     | 21000             | 28             | 0.0                 |
| Tom             | Engineering     | 23000             | 33             | 0.3333333333333333  |
| Chloe           | Engineering     | 23000             | 25             | 0.3333333333333333  |
| Paul            | Engineering     | 29000             | 23             | 1.0                 |
| Lisa            | Sales           | 10000             | 35             | 0.0                 |
| Alex            | Sales           | 30000             | 33             | 0.5                 |
| Evan            | Sales           | 32000             | 38             | 1.0                 |
+-----------------+-----------------+-------------------+----------------+---------------------+

-- order by结果相同的情况,order by dept
select *,rank() over(partition by dept order by dept) as rn from employees;
select *,dense_rank() over(partition by dept order by dept) as rn from employees;
-- 二者结果相同
+-----------------+-----------------+-------------------+----------------+-----+
| employees.name  | employees.dept  | employees.salary  | employees.age  | rn  |
+-----------------+-----------------+-------------------+----------------+-----+
| Fred            | Engineering     | 21000             | 28             | 1   |
| Tom             | Engineering     | 23000             | 33             | 1   |
| Paul            | Engineering     | 29000             | 23             | 1   |
| Chloe           | Engineering     | 23000             | 25             | 1   |
| Jane            | Marketing       | 29000             | 28             | 1   |
| Jeff            | Marketing       | 35000             | 38             | 1   |
| Lisa            | Sales           | 10000             | 35             | 1   |
| Evan            | Sales           | 32000             | 38             | 1   |
| Alex            | Sales           | 30000             | 33             | 1   |
+-----------------+-----------------+-------------------+----------------+-----+

select *,percent_rank() over(partition by dept order by dept) as rn from employees;
+-----------------+-----------------+-------------------+----------------+------+
| employees.name  | employees.dept  | employees.salary  | employees.age  |  rn  |
+-----------------+-----------------+-------------------+----------------+------+
| Fred            | Engineering     | 21000             | 28             | 0.0  |
| Tom             | Engineering     | 23000             | 33             | 0.0  |
| Paul            | Engineering     | 29000             | 23             | 0.0  |
| Chloe           | Engineering     | 23000             | 25             | 0.0  |
| Jane            | Marketing       | 29000             | 28             | 0.0  |
| Jeff            | Marketing       | 35000             | 38             | 0.0  |
| Lisa            | Sales           | 10000             | 35             | 0.0  |
| Evan            | Sales           | 32000             | 38             | 0.0  |
| Alex            | Sales           | 30000             | 33             | 0.0  |
+-----------------+-----------------+-------------------+----------------+------+

1.4. ntile(n)

ntile(n) over tile有瓷砖、瓦片的含义,作用是在分组内对数据进行分桶,然后返回桶的序号。

按照order by结果将数据平均的放入到分好的桶中,如果数据无法按照桶个数均分,则将多余的数据放在第一个桶内。

使用场景:例如获取每个部门薪资前三分之一的员工数据,则按照salary降序排序,然后分成3个桶,最后取第一个桶中的数据。

sql 复制代码
select *,NTILE(2) over(partition by dept order by salary) as rn from employees;
-- ntile(2)表示分成2桶,
+-----------------+-----------------+-------------------+----------------+-----+
| employees.name  | employees.dept  | employees.salary  | employees.age  | rn  |
+-----------------+-----------------+-------------------+----------------+-----+
| Jane            | Marketing       | 29000             | 28             | 1   |
| Jeff            | Marketing       | 35000             | 38             | 2   |
| Fred            | Engineering     | 21000             | 28             | 1   |
| Tom             | Engineering     | 23000             | 33             | 1   |
| Chloe           | Engineering     | 23000             | 25             | 2   |
| Paul            | Engineering     | 29000             | 23             | 2   |
| Lisa            | Sales           | 10000             | 35             | 1   |
| Alex            | Sales           | 30000             | 33             | 1   |
| Evan            | Sales           | 32000             | 38             | 2   |
+-----------------+-----------------+-------------------+----------------+-----+
select *,NTILE(3) over(partition by dept order by salary) as rn from employees;
-- 分成3桶
+-----------------+-----------------+-------------------+----------------+-----+
| employees.name  | employees.dept  | employees.salary  | employees.age  | rn  |
+-----------------+-----------------+-------------------+----------------+-----+
| Jane            | Marketing       | 29000             | 28             | 1   |
| Jeff            | Marketing       | 35000             | 38             | 2   |
| Fred            | Engineering     | 21000             | 28             | 1   |
| Tom             | Engineering     | 23000             | 33             | 1   |
| Chloe           | Engineering     | 23000             | 25             | 2   |
| Paul            | Engineering     | 29000             | 23             | 3   |
| Lisa            | Sales           | 10000             | 35             | 1   |
| Alex            | Sales           | 30000             | 33             | 2   |
| Evan            | Sales           | 32000             | 38             | 3   |
+-----------------+-----------------+-------------------+----------------+-----+

1.5. cume_dist()

cume表示累计,dist表示距离

用于求累计分布,即分组中按照order by结果的分位数

sql 复制代码
select *,cume_dist() over(partition by dept order by salary) from employees;
-- enginnering分组中,tom和chloe中salary相同,二者的分位数结果相同
+-----------------+-----------------+-------------------+----------------+---------------------+
| employees.name  | employees.dept  | employees.salary  | employees.age  | cume_dist_window_0  |
+-----------------+-----------------+-------------------+----------------+---------------------+
| Jane            | Marketing       | 29000             | 28             | 0.5                 |
| Jeff            | Marketing       | 35000             | 38             | 1.0                 |
| Fred            | Engineering     | 21000             | 28             | 0.25                |
| Tom             | Engineering     | 23000             | 33             | 0.75                |
| Chloe           | Engineering     | 23000             | 25             | 0.75                |
| Paul            | Engineering     | 29000             | 23             | 1.0                 |
| Lisa            | Sales           | 10000             | 35             | 0.3333333333333333  |
| Alex            | Sales           | 30000             | 33             | 0.6666666666666666  |
| Evan            | Sales           | 32000             | 38             | 1.0                 |
+-----------------+-----------------+-------------------+----------------+---------------------+

select *,cume_dist() over(partition by dept order by age) from employees;
-- 按照age排序,tom和chloe中age不同,因此分位数不同
+-----------------+-----------------+-------------------+----------------+---------------------+
| employees.name  | employees.dept  | employees.salary  | employees.age  | cume_dist_window_0  |
+-----------------+-----------------+-------------------+----------------+---------------------+
| Jane            | Marketing       | 29000             | 28             | 0.5                 |
| Jeff            | Marketing       | 35000             | 38             | 1.0                 |
| Paul            | Engineering     | 29000             | 23             | 0.25                |
| Chloe           | Engineering     | 23000             | 25             | 0.5                 |
| Fred            | Engineering     | 21000             | 28             | 0.75                |
| Tom             | Engineering     | 23000             | 33             | 1.0                 |
| Alex            | Sales           | 30000             | 33             | 0.3333333333333333  |
| Lisa            | Sales           | 10000             | 35             | 0.6666666666666666  |
| Evan            | Sales           | 32000             | 38             | 1.0                 |
+-----------------+-----------------+-------------------+----------------+---------------------+

1.6. lag和lead

lag 表示落后的含义,在使用场景中是小于的含义。

lead 表示领先的含义,在使用场景中是大于的含义。

可以实现不自连接的前提下,按照order by结果得到当前行指定列 前/后移动num行 的列值。

函数参数
LAG/LEAD(col,num,default_value):col表示指定的列名;num表示指定的位移行数,默认为1;default_value表示末尾或开头返回的值,默认值null。

sql 复制代码
SELECT *,
    LAG(salary) OVER (PARTITION BY dept ORDER BY salary) AS lag,
    LEAD(salary) OVER (PARTITION BY dept ORDER BY salary) AS lead
    FROM employees;
-- 从结果看,按照order by排序结果,lag取的值是当前行前1行的值,lead取的值是当前行后1行的值,对于第1行或者最后1行,取值默认为null
+-----------------+-----------------+-------------------+----------------+--------+--------+
| employees.name  | employees.dept  | employees.salary  | employees.age  |  lag   |  lead  |
+-----------------+-----------------+-------------------+----------------+--------+--------+
| Jane            | Marketing       | 29000             | 28             | NULL   | 35000  |
| Jeff            | Marketing       | 35000             | 38             | 29000  | NULL   |
| Fred            | Engineering     | 21000             | 28             | NULL   | 23000  |
| Tom             | Engineering     | 23000             | 33             | 21000  | 23000  |
| Chloe           | Engineering     | 23000             | 25             | 23000  | 29000  |
| Paul            | Engineering     | 29000             | 23             | 23000  | NULL   |
| Lisa            | Sales           | 10000             | 35             | NULL   | 30000  |
| Alex            | Sales           | 30000             | 33             | 10000  | 32000  |
| Evan            | Sales           | 32000             | 38             | 30000  | NULL   |
+-----------------+-----------------+-------------------+----------------+--------+--------+

SELECT *,
    LAG(salary,2,0) OVER (PARTITION BY dept ORDER BY salary) AS lag,
    LEAD(salary, 1, 0) OVER (PARTITION BY dept ORDER BY salary) AS lead
    FROM employees;
+-----------------+-----------------+-------------------+----------------+--------+--------+
| employees.name  | employees.dept  | employees.salary  | employees.age  |  lag   |  lead  |
+-----------------+-----------------+-------------------+----------------+--------+--------+
| Jane            | Marketing       | 29000             | 28             | 0      | 35000  |
| Jeff            | Marketing       | 35000             | 38             | 0      | 0      |
| Fred            | Engineering     | 21000             | 28             | 0      | 23000  |
| Tom             | Engineering     | 23000             | 33             | 0      | 23000  |
| Chloe           | Engineering     | 23000             | 25             | 21000  | 29000  |
| Paul            | Engineering     | 29000             | 23             | 23000  | 0      |
| Lisa            | Sales           | 10000             | 35             | 0      | 30000  |
| Alex            | Sales           | 30000             | 33             | 0      | 32000  |
| Evan            | Sales           | 32000             | 38             | 10000  | 0      |
+-----------------+-----------------+-------------------+----------------+--------+--------+

-- 排序结果和取值列不同的情况
SELECT *,
    LAG(salary) OVER (PARTITION BY dept ORDER BY age) AS lag,
    LEAD(salary) OVER (PARTITION BY dept ORDER BY age) AS lead
    FROM employees;
+-----------------+-----------------+-------------------+----------------+--------+--------+
| employees.name  | employees.dept  | employees.salary  | employees.age  |  lag   |  lead  |
+-----------------+-----------------+-------------------+----------------+--------+--------+
| Jane            | Marketing       | 29000             | 28             | NULL   | 35000  |
| Jeff            | Marketing       | 35000             | 38             | 29000  | NULL   |
| Paul            | Engineering     | 29000             | 23             | NULL   | 23000  |
| Chloe           | Engineering     | 23000             | 25             | 29000  | 21000  |
| Fred            | Engineering     | 21000             | 28             | 23000  | 23000  |
| Tom             | Engineering     | 23000             | 33             | 21000  | NULL   |
| Alex            | Sales           | 30000             | 33             | NULL   | 10000  |
| Lisa            | Sales           | 10000             | 35             | 30000  | 32000  |
| Evan            | Sales           | 32000             | 38             | 10000  | NULL   |
+-----------------+-----------------+-------------------+----------------+--------+--------+

1.7. first_value和last_value

在分组中按照order by的结果,获取指定列的第一个或最后一个值。

注意,默认情况下last_value取的是第一行截止到当前行的最后一个值(当前行的值),并不是整个分区中排序后的最后一个值。

sql 复制代码
SELECT *,
    first_value(salary) OVER (PARTITION BY dept ORDER BY age) AS first,
    last_value(salary) OVER (PARTITION BY dept ORDER BY age) AS last
    FROM employees;
-- 注意last的结果
+-----------------+-----------------+-------------------+----------------+--------+--------+
| employees.name  | employees.dept  | employees.salary  | employees.age  | first  |  last  |
+-----------------+-----------------+-------------------+----------------+--------+--------+
| Jane            | Marketing       | 29000             | 28             | 29000  | 29000  |
| Jeff            | Marketing       | 35000             | 38             | 29000  | 35000  |
| Paul            | Engineering     | 29000             | 23             | 29000  | 29000  |
| Chloe           | Engineering     | 23000             | 25             | 29000  | 23000  |
| Fred            | Engineering     | 21000             | 28             | 29000  | 21000  |
| Tom             | Engineering     | 23000             | 33             | 29000  | 23000  |
| Alex            | Sales           | 30000             | 33             | 30000  | 30000  |
| Lisa            | Sales           | 10000             | 35             | 30000  | 10000  |
| Evan            | Sales           | 32000             | 38             | 30000  | 32000  |
+-----------------+-----------------+-------------------+----------------+--------+--------+

SELECT *,
    first_value(salary) OVER (PARTITION BY dept ORDER BY salary) AS first,
    last_value(salary) OVER (PARTITION BY dept ORDER BY salary) AS last
    FROM employees;
+-----------------+-----------------+-------------------+----------------+--------+--------+
| employees.name  | employees.dept  | employees.salary  | employees.age  | first  |  last  |
+-----------------+-----------------+-------------------+----------------+--------+--------+
| Jane            | Marketing       | 29000             | 28             | 29000  | 29000  |
| Jeff            | Marketing       | 35000             | 38             | 29000  | 35000  |
| Fred            | Engineering     | 21000             | 28             | 21000  | 21000  |
| Tom             | Engineering     | 23000             | 33             | 21000  | 23000  |
| Chloe           | Engineering     | 23000             | 25             | 21000  | 23000  |
| Paul            | Engineering     | 29000             | 23             | 21000  | 29000  |
| Lisa            | Sales           | 10000             | 35             | 10000  | 10000  |
| Alex            | Sales           | 30000             | 33             | 10000  | 30000  |
| Evan            | Sales           | 32000             | 38             | 10000  | 32000  |
+-----------------+-----------------+-------------------+----------------+--------+--------+

1.8. nth_value

nth表示第几个的含义

作用,在分组中返回order by结果中指定列的第N行值。

注意,hive中无此函数

sql 复制代码
SELECT *,nth_value(salary,2) OVER (PARTITION BY dept ORDER BY salary) AS nth FROM employees;

2. OVER从句

  • 在hive中over语句支持仅有partition by语句,没有order by语句。当没有order by时,order by默认使用partition by指定的字段序列,
  • 在spark中如果未指定order by,将会报错Error in query: Window function row_number() requires window to be ordered, please add ORDER BY clause
sql 复制代码
select *, row_number() over(partition by name) as rn from employees;

-- 执行结果如下,从结果中得知按照name的字典序排序
+-----------------+-----------------+-------------------+----------------+-----+
| employees.name  | employees.dept  | employees.salary  | employees.age  | rn  |
+-----------------+-----------------+-------------------+----------------+-----+
| Alex            | Sales           | 30000             | 33             | 1   |
| Evan            | Sales           | 32000             | 38             | 1   |
| Fred            | Engineering     | 21000             | 28             | 1   |
| Lisa            | Sales           | 10000             | 35             | 1   |
| Tom             | Engineering     | 23000             | 33             | 1   |
| Chloe           | Engineering     | 23000             | 25             | 1   |
| Jane            | Marketing       | 29000             | 28             | 1   |
| Jeff            | Marketing       | 35000             | 38             | 1   |
| Paul            | Engineering     | 29000             | 23             | 1   |
+-----------------+-----------------+-------------------+----------------+-----+

2.1. window specification

在OVER语句中可以带有一个window specification,支持以下格式:

sql 复制代码
格式:
(ROWS | RANGE) BETWEEN xxx AND xxx
具体支持以下3种格式
(ROWS | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROWS | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROWS | RANGE) BETWEEN [num] FOLLOWING AND (UNBOUNDED | [num]) FOLLOWING
  • ROWS 表示在同分区中,按照order by结果按行进行逐行处理
  • RANGE 表示在同分区中按照order by结果的范围处理,如1,2,3,3,4,4,5排序中,3,3和4,4的排序结果相同,则3,3和4,4将分别当成整体处理,每行对应开窗函数的结果相同。
  • UNBOUNDED PRECEDING 表示未绑定当前行之前的行。整个分区中,从第一行开始处理
  • num PRECEDING 表示限制当前行的前num行。如在sum时,每行sum的结果是从前num行到当前行的累加值
  • CURRENT ROW 表示处理过程中的当前行(处理数据过程中的游标指针)
  • UNBOUNDED FOLLOWING 表示未绑定当前行之后的行。整个分区,处理到最后一行
  • num FOLLOWING 表示限制当前行的后num行。如在sum时,每行值的sum的结果将处理到后num行。

注意:在hive中函数不支持使用window specification
rank,dense_rankpercent_rank()
cume_dist()ntile
leadlag

2.1.1. 显式order by下的默认值

当指定了order by语句而未指定window specification语句时,默认的window specification语句是RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

即按照order by结果的range处理,并且从分区中的第一行处理到当前行。

sql 复制代码
select *,sum(salary) over(partition by dept order by salary) as sum_salary from employees;
select *,sum(salary) over(partition by dept order by salary range between unbounded preceding and current row) as sum_salary from employees;

-- 二者的执行结果相同
+-----------------+-----------------+-------------------+----------------+-------------+
| employees.name  | employees.dept  | employees.salary  | employees.age  | sum_salary  |
+-----------------+-----------------+-------------------+----------------+-------------+
| Jane            | Marketing       | 29000             | 28             | 29000       |
| Jeff            | Marketing       | 35000             | 38             | 64000       |
| Fred            | Engineering     | 21000             | 28             | 21000       |
| Tom             | Engineering     | 23000             | 33             | 67000       |
| Chloe           | Engineering     | 23000             | 25             | 67000       |
| Paul            | Engineering     | 29000             | 23             | 96000       |
| Lisa            | Sales           | 10000             | 35             | 10000       |
| Alex            | Sales           | 30000             | 33             | 40000       |
| Evan            | Sales           | 32000             | 38             | 72000       |
+-----------------+-----------------+-------------------+----------------+-------------+

注意:Engineering部门的tom和chloe员工的sum_salary值相同。这是因为在Engineering分区下,二者order by salary结果相同。在range策略下,tom和chloe将当成整体处理,即46000,因此tom和chloe的累计值都为67000(21000+460000),而不是44000和67000。

在SQL中显示指定如下语句,实现分区中的按行累加 效果,即
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

Engineering部门的tom和chloe员工的sum_salary结果将分别为44000和67000。

sql 复制代码
select *,sum(salary) over(partition by dept order by salary rows between unbounded preceding and current row) as sum_salary from employees;

-- 结果如下
+-----------------+-----------------+-------------------+----------------+-------------+
| employees.name  | employees.dept  | employees.salary  | employees.age  | sum_salary  |
+-----------------+-----------------+-------------------+----------------+-------------+
| Jane            | Marketing       | 29000             | 28             | 29000       |
| Jeff            | Marketing       | 35000             | 38             | 64000       |
| Fred            | Engineering     | 21000             | 28             | 21000       |
| Tom             | Engineering     | 23000             | 33             | 44000       |
| Chloe           | Engineering     | 23000             | 25             | 67000       |
| Paul            | Engineering     | 29000             | 23             | 96000       |
| Lisa            | Sales           | 10000             | 35             | 10000       |
| Alex            | Sales           | 30000             | 33             | 40000       |
| Evan            | Sales           | 32000             | 38             | 72000       |
+-----------------+-----------------+-------------------+----------------+-------------+

在SQL中显示指定如下语句,实现相同分组内进行全部值求和 效果。
RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING

sql 复制代码
select *,sum(salary) over(partition by dept order by salary range between unbounded preceding and unbounded following) as sum_salary from employees;

-- 结果如下
+-----------------+-----------------+-------------------+----------------+-------------+
| employees.name  | employees.dept  | employees.salary  | employees.age  | sum_salary  |
+-----------------+-----------------+-------------------+----------------+-------------+
| Jane            | Marketing       | 29000             | 28             | 64000       |
| Jeff            | Marketing       | 35000             | 38             | 64000       |
| Fred            | Engineering     | 21000             | 28             | 96000       |
| Tom             | Engineering     | 23000             | 33             | 96000       |
| Chloe           | Engineering     | 23000             | 25             | 96000       |
| Paul            | Engineering     | 29000             | 23             | 96000       |
| Lisa            | Sales           | 10000             | 35             | 72000       |
| Alex            | Sales           | 30000             | 33             | 72000       |
| Evan            | Sales           | 32000             | 38             | 72000       |
+-----------------+-----------------+-------------------+----------------+-------------+

结论:

  • 情形一:在partition by和order by同时存在的情况下,对于MAX | MIN | COUNT | SUM | AVG 等函数
    如果想要对相同分组中的数整体进行计算,则要显示指定RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
  • 情形二:如果按在分组中实现按行处理,则要显示指定ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  • 情形三:默认情况下(即未显示指定window specification时)对于order by列相同的值处理结果相同。
2.1.2. 无order by下的默认值

当over语句中order by和window specification都缺失时,window specification的默认值是ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING

结论

  • 对MAX | MIN | COUNT | SUM | AVG 等函数,在order by缺失或order by和partition by相同时,效果同上述情形一
sql 复制代码
select *,sum(salary) over(partition by dept) as sum_salary from employees;

-- 结果如下
+-----------------+-----------------+-------------------+----------------+-------------+
| employees.name  | employees.dept  | employees.salary  | employees.age  | sum_salary  |
+-----------------+-----------------+-------------------+----------------+-------------+
| Fred            | Engineering     | 21000             | 28             | 96000       |
| Tom             | Engineering     | 23000             | 33             | 96000       |
| Paul            | Engineering     | 29000             | 23             | 96000       |
| Chloe           | Engineering     | 23000             | 25             | 96000       |
| Jane            | Marketing       | 29000             | 28             | 64000       |
| Jeff            | Marketing       | 35000             | 38             | 64000       |
| Lisa            | Sales           | 10000             | 35             | 72000       |
| Evan            | Sales           | 32000             | 38             | 72000       |
| Alex            | Sales           | 30000             | 33             | 72000       |
+-----------------+-----------------+-------------------+----------------+-------------+

结果中,对于相同dept中不同员工,效果并不是按行并逐行处理,而是对相同dept下的员工进行了统一处理。即UNBOUNDED FOLLOWING表示不跟随当前处理的行,直接对中整个分区中进行计算。

疑问:ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING中rows的作用是什么了?

3. last_value函数的注意事项

前面在提到last_value时,特意强调了该函数的结果并不是分区中的最后一个值,结合上述介绍的window specification再来看下该函数的结果值。

sql 复制代码
SELECT *,
    first_value(salary) OVER (PARTITION BY dept ORDER BY salary) AS first,
    last_value(salary) OVER (PARTITION BY dept ORDER BY salary) AS last
    FROM employees;
+-----------------+-----------------+-------------------+----------------+--------+--------+
| employees.name  | employees.dept  | employees.salary  | employees.age  | first  |  last  |
+-----------------+-----------------+-------------------+----------------+--------+--------+
| Jane            | Marketing       | 29000             | 28             | 29000  | 29000  |
| Jeff            | Marketing       | 35000             | 38             | 29000  | 35000  |
| Fred            | Engineering     | 21000             | 28             | 21000  | 21000  |
| Tom             | Engineering     | 23000             | 33             | 21000  | 23000  |
| Chloe           | Engineering     | 23000             | 25             | 21000  | 23000  |
| Paul            | Engineering     | 29000             | 23             | 21000  | 29000  |
| Lisa            | Sales           | 10000             | 35             | 10000  | 10000  |
| Alex            | Sales           | 30000             | 33             | 10000  | 30000  |
| Evan            | Sales           | 32000             | 38             | 10000  | 32000  |
+-----------------+-----------------+-------------------+----------------+--------+--------+

结果中,对于每一行的last_value的结果都是当前值,并不分区中按salary升序的最后一个值。造成这个结果的原因正是由于默认的window specification导致的(ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)。

如想要实现得到整个分区中按salary升序的最大值,则需要显示设置window specification为RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING

sql 复制代码
select *
,first_value(salary) over(partition by dept order by salary) as first_salary
,last_value(salary) over(partition by dept order by salary range between unbounded preceding and unbounded following) as last_salary from employees;

-- 结果如下
+-----------------+-----------------+-------------------+----------------+---------------+--------------+
| employees.name  | employees.dept  | employees.salary  | employees.age  | first_salary  | last_salary  |
+-----------------+-----------------+-------------------+----------------+---------------+--------------+
| Jane            | Marketing       | 29000             | 28             | 29000         | 35000        |
| Jeff            | Marketing       | 35000             | 38             | 29000         | 35000        |
| Fred            | Engineering     | 21000             | 28             | 21000         | 29000        |
| Tom             | Engineering     | 23000             | 33             | 21000         | 29000        |
| Chloe           | Engineering     | 23000             | 25             | 21000         | 29000        |
| Paul            | Engineering     | 29000             | 23             | 21000         | 29000        |
| Lisa            | Sales           | 10000             | 35             | 10000         | 32000        |
| Alex            | Sales           | 30000             | 33             | 10000         | 32000        |
| Evan            | Sales           | 32000             | 38             | 10000         | 32000        |
+-----------------+-----------------+-------------------+----------------+---------------+--------------+
相关推荐
floret*1 小时前
HiveSQL面试题
hive·sql
Dreams°1232 小时前
大数据 ETL + Flume 数据清洗 — 详细教程及实例(附常见问题及解决方案)
大数据·单元测试·可用性测试
sf_www2 小时前
Flink on YARN是如何确定TaskManager个数的
大数据·flink
武子康3 小时前
大数据-213 数据挖掘 机器学习理论 - KMeans Python 实现 距离计算函数 质心函数 聚类函数
大数据·人工智能·python·机器学习·数据挖掘·scikit-learn·kmeans
武子康3 小时前
大数据-214 数据挖掘 机器学习理论 - KMeans Python 实现 算法验证 sklearn n_clusters labels
大数据·人工智能·python·深度学习·算法·机器学习·数据挖掘
Aloudata4 小时前
NoETL自动化指标平台为数据分析提质增效,驱动业务决策
大数据·数据分析·指标平台·指标体系
2401_883041088 小时前
新锐品牌电商代运营公司都有哪些?
大数据·人工智能
青云交8 小时前
大数据新视界 -- 大数据大厂之 Impala 性能优化:融合机器学习的未来之路(上 (2-1))(11/30)
大数据·计算资源·应用案例·数据交互·impala 性能优化·机器学习融合·行业拓展
Json_1817901448011 小时前
An In-depth Look into the 1688 Product Details Data API Interface
大数据·json
lzhlizihang12 小时前
【spark的集群模式搭建】Standalone集群模式的搭建(简单明了的安装教程)
spark·standalone模式·spark集群搭建