Hive 开窗函数
Hive窗口函数是一种特殊的函数,允许用户在查询中对一组行进行计算,而不仅仅是单独的行。窗口函数可以在 SQL 查询中进行聚合、排名、累积计算等。这使得窗口函数在数据分析和报告生成中非常有用。
窗口函数的基本组成部分
- 函数类型 :如
ROW_NUMBER()
,RANK()
,DENSE_RANK()
,SUM()
,AVG()
等。 - OVER 子句:定义窗口的范围和分区,用于指定在哪些行上应用窗口函数。
窗口边界标识符
-
CURRENT ROW:
- 表示窗口的当前行。通常用于窗口的结束范围。
-
n PRECEDING:
- 表示当前行之前的n行。例如,
ROWS BETWEEN 1 PRECEDING AND CURRENT ROW
表示从当前行向上看一行到当前行。
- 表示当前行之前的n行。例如,
-
n FOLLOWING:
- 表示当前行之后的n行。例如,
ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING
表示从当前行到当前行的后两行。
- 表示当前行之后的n行。例如,
-
UNBOUNDED:
- 表示没有边界,通常用于定义窗口的起点或终点。
-
UNBOUNDED PRECEDING:
- 表示从窗口的起点开始,不限行数。
-
UNBOUNDED FOLLOWING:
- 表示到窗口的终点结束,不限行数。
窗口边界函数
-
LAG(col, n):
- 这个函数用于获取当前行的前n行中的指定列的值。它可以用于比较当前行与前几行的数据。
sqlSELECT employee_id, salary, LAG(salary, 1) OVER (ORDER BY employee_id) AS previous_salary FROM employees;
这个示例显示了每个员工的当前工资和前一个员工的工资。
-
LEAD(col, n):
- 这个函数用于获取当前行的后n行中的指定列的值。与
LAG
类似,但它是向下查找。
sqlSELECT employee_id, salary, LEAD(salary, 1) OVER (ORDER BY employee_id) AS next_salary FROM employees;
这个示例显示了每个员工的当前工资和下一个员工的工资。
- 这个函数用于获取当前行的后n行中的指定列的值。与
示例数据集
假设我们有一个名为 business
的表,内容如下:
name | orderdate | cost |
---|---|---|
Alice | 2017-04-01 | 100 |
Bob | 2017-04-05 | 150 |
Alice | 2017-04-10 | 200 |
Charlie | 2017-05-01 | 300 |
Bob | 2017-05-10 | 100 |
Alice | 2017-05-15 | 250 |
Charlie | 2017-06-01 | 400 |
SQL 查询运行结果
1. 查询在2017年4月份购买过的顾客及总人数
sql
SELECT name, COUNT(*) OVER()
FROM business
WHERE SUBSTRING(orderdate, 1, 7) = '2017-04';
结果:
name | COUNT(*) |
---|---|
Alice | 3 |
Bob | 3 |
Alice | 3 |
2. 查询顾客的购买明细及月购买总额
顾客购买明细及购买总额:
sql
SELECT name, orderdate, cost, SUM(cost) OVER()
FROM business;
结果:
name | orderdate | cost | SUM(cost) |
---|---|---|---|
Alice | 2017-04-01 | 100 | 1300 |
Bob | 2017-04-05 | 150 | 1300 |
Alice | 2017-04-10 | 200 | 1300 |
Charlie | 2017-05-01 | 300 | 1300 |
Bob | 2017-05-10 | 100 | 1300 |
Alice | 2017-05-15 | 250 | 1300 |
Charlie | 2017-06-01 | 400 | 1300 |
明细及月购买总额:
sql
SELECT name, orderdate, cost, SUM(cost) OVER(PARTITION BY MONTH(orderdate))
FROM business;
结果:
name | orderdate | cost | SUM(cost) |
---|---|---|---|
Alice | 2017-04-01 | 100 | 300 |
Bob | 2017-04-05 | 150 | 300 |
Alice | 2017-04-10 | 200 | 300 |
Charlie | 2017-05-01 | 300 | 700 |
Bob | 2017-05-10 | 100 | 700 |
Alice | 2017-05-15 | 250 | 700 |
Charlie | 2017-06-01 | 400 | 400 |
顾客购买明细及顾客购买总额:
sql
SELECT name, orderdate, cost, SUM(cost) OVER(PARTITION BY name)
FROM business;
结果:
name | orderdate | cost | SUM(cost) |
---|---|---|---|
Alice | 2017-04-01 | 100 | 550 |
Bob | 2017-04-05 | 150 | 250 |
Alice | 2017-04-10 | 200 | 550 |
Charlie | 2017-05-01 | 300 | 700 |
Bob | 2017-05-10 | 100 | 250 |
Alice | 2017-05-15 | 250 | 550 |
Charlie | 2017-06-01 | 400 | 400 |
顾客购买明细及顾客月购买总额:
sql
SELECT name, orderdate, cost, SUM(cost) OVER(PARTITION BY name, MONTH(orderdate))
FROM business;
结果:
name | orderdate | cost | SUM(cost) |
---|---|---|---|
Alice | 2017-04-01 | 100 | 300 |
Bob | 2017-04-05 | 150 | 150 |
Alice | 2017-04-10 | 200 | 300 |
Charlie | 2017-05-01 | 300 | 300 |
Bob | 2017-05-10 | 100 | 100 |
Alice | 2017-05-15 | 250 | 250 |
Charlie | 2017-06-01 | 400 | 400 |
3. 按照日期进行累加
按照日期逐步累加购买总额
sql
SELECT name, orderdate, cost,
SUM(cost) OVER(PARTITION BY name ORDER BY orderdate)
FROM business;
方法2(边界从起点到当前行):
sql
SELECT name, orderdate, cost,
SUM(cost) OVER(PARTITION BY name ORDER BY orderdate ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS sample4
FROM business;
结果:
name | orderdate | cost | SUM(cost) |
---|---|---|---|
Alice | 2017-04-01 | 100 | 100 |
Alice | 2017-04-10 | 200 | 300 |
Alice | 2017-05-15 | 250 | 550 |
Bob | 2017-04-05 | 150 | 150 |
Bob | 2017-05-10 | 100 | 250 |
Charlie | 2017-05-01 | 300 | 300 |
Charlie | 2017-06-01 | 400 | 700 |
当前行和前面一行的聚合:
sql
SELECT name, orderdate, cost,
SUM(cost) OVER(PARTITION BY name ORDER BY orderdate ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) AS sample5
FROM business;
结果:
name | orderdate | cost | sample5 |
---|---|---|---|
Alice | 2017-04-01 | 100 | 100 |
Alice | 2017-04-10 | 200 | 300 |
Alice | 2017-05-15 | 250 | 450 |
Bob | 2017-04-05 | 150 | 150 |
Bob | 2017-05-10 | 100 | 250 |
Charlie | 2017-05-01 | 300 | 300 |
Charlie | 2017-06-01 | 400 | 400 |
当前行和前后各一行的聚合:
sql
SELECT name, orderdate, cost,
SUM(cost) OVER(PARTITION BY name ORDER BY orderdate ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) AS sample6
FROM business;
结果:
name | orderdate | cost | sample6 |
---|---|---|---|
Alice | 2017-04-01 | 100 | 300 |
Alice | 2017-04-10 | 200 | 550 |
Alice | 2017-05-15 | 250 | 250 |
Bob | 2017-04-05 | 150 | 250 |
Bob | 2017-05-10 | 100 | 100 |
Charlie | 2017-05-01 | 300 | 700 |
Charlie | 2017-06-01 | 400 | 400 |
当前行及后面所有行:
sql
SELECT name, orderdate, cost,
SUM(cost) OVER(PARTITION BY name ORDER BY orderdate ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS sample7
FROM business;
结果:
name | orderdate | cost | sample7 |
---|---|---|---|
Alice | 2017-04-01 | 100 | 550 |
Alice | 2017-04-10 | 200 | 250 |
Alice | 2017-05-15 | 250 | 250 |
Bob | 2017-04-05 | 150 | 100 |
Bob | 2017-05-10 | 100 | 100 |
Charlie | 2017-05-01 | 300 | 400 |
Charlie | 2017-06-01 | 400 | 400 |
4. 查询顾客上次的购买时间
使用 LAG 函数:
sql
SELECT name, orderdate, cost,
LAG(orderdate, 1) OVER(PARTITION BY name ORDER BY orderdate) AS last_purchase_date
FROM business;
结果:
name | orderdate | cost | last_purchase_date |
---|---|---|---|
Alice | 2017-04-01 | 100 | NULL |
Alice | 2017-04-10 | 200 | 2017-04-01 |
Alice | 2017-05-15 | 250 | 2017-04-10 |
Bob | 2017-04-05 | 150 | NULL |
Bob | 2017-05-10 | 100 | 2017-04-05 |
Charlie | 2017-05-01 | 300 | NULL |
Charlie | 2017-06-01 | 400 | 2017-05-01 |