Flink SQL Over 聚合详解

Over 聚合定义(⽀持 Batch\Streaming):**特殊的滑动窗⼝聚合函数,拿 Over 聚合 与 窗⼝聚合 做对⽐。

窗⼝聚合:不在 group by 中的字段,不能直接在 select 中拿到

Over 聚合:能够保留原始字段

注意: ⽣产环境中,Over 聚合的使⽤场景较少。

**应⽤场景:**计算最近⼀段滑动窗⼝的聚合结果数据。

**实际案例:**查询每个产品最近⼀⼩时订单的⾦额总和:

复制代码
SELECT order_id,
	order_time,
  amount,
 	SUM(amount) OVER (
 		PARTITION BY product
 		ORDER BY order_time
 		RANGE BETWEEN INTERVAL '1' HOUR PRECEDING AND CURRENT ROW
 ) AS one_hour_prod_amount_sum
FROM Orders

Over 聚合语法如下:

复制代码
SELECT
 agg_func(agg_col) OVER (
 [PARTITION BY col1[, col2, ...]]
 ORDER BY time_col
 range_definition),
 ...
FROM ...

ORDER BY:必须是时间戳列(事件时间、处理时间);

PARTITION BY:标识了聚合窗⼝的聚合粒度,如上述案例是按照 product 进⾏聚合;

range_definition:标识聚合窗⼝的聚合数据范围,在 Flink 中有两种指定数据范围的⽅式。第⼀种为 按照⾏数聚合 ,第⼆种为 按照时间区间聚合 。

1.时间区间聚合

**案例:**输出一个产品最近⼀⼩时数据的 amount 之和。

结果就是最近⼀⼩时数据的 amount 之和。

复制代码
CREATE TABLE source_table (
 order_id BIGINT,
 product BIGINT,
 amount BIGINT,
 order_time as cast(CURRENT_TIMESTAMP as TIMESTAMP(3)),
 WATERMARK FOR order_time AS order_time - INTERVAL '0.001' SECOND
) WITH (
 'connector' = 'datagen',
 'rows-per-second' = '1',
 'fields.order_id.min' = '1',
 'fields.order_id.max' = '2',
 'fields.amount.min' = '1',
 'fields.amount.max' = '10',
 'fields.product.min' = '1',
 'fields.product.max' = '2'
);

CREATE TABLE sink_table (
 product BIGINT,
 order_time TIMESTAMP(3),
 amount BIGINT,
 one_hour_prod_amount_sum BIGINT
) WITH (
 'connector' = 'print'
);

INSERT INTO sink_table
SELECT product,
	order_time,
  amount,
 SUM(amount) OVER (
 	PARTITION BY product
 	ORDER BY order_time
 	-- 标识统计范围是⼀个 product 的最近 1 ⼩时的数据
 	RANGE BETWEEN INTERVAL '1' HOUR PRECEDING AND CURRENT ROW
 ) AS one_hour_prod_amount_sum
FROM source_table

结果如下:

复制代码
+I[2, 2021-12-24T22:08:26.583, 7, 73]
+I[2, 2021-12-24T22:08:27.583, 7, 80]
+I[2, 2021-12-24T22:08:28.583, 4, 84]
+I[2, 2021-12-24T22:08:29.584, 7, 91]
+I[2, 2021-12-24T22:08:30.583, 8, 99]
+I[1, 2021-12-24T22:08:31.583, 9, 138]
+I[2, 2021-12-24T22:08:32.584, 6, 105]
+I[1, 2021-12-24T22:08:33.584, 7, 145]
2.⾏数聚合

**案例:**输出一个产品最近 5 ⾏数据的 amount 之和。

复制代码
CREATE TABLE source_table (
 order_id BIGINT,
 product BIGINT,
 amount BIGINT,
 order_time as cast(CURRENT_TIMESTAMP as TIMESTAMP(3)),
 WATERMARK FOR order_time AS order_time - INTERVAL '0.001' SECOND
) WITH (
 'connector' = 'datagen',
 'rows-per-second' = '1',
 'fields.order_id.min' = '1',
 'fields.order_id.max' = '2',
 'fields.amount.min' = '1',
 'fields.amount.max' = '2',
 'fields.product.min' = '1',
 'fields.product.max' = '2'
);

CREATE TABLE sink_table (
 product BIGINT,
 order_time TIMESTAMP(3),
 amount BIGINT,
 one_hour_prod_amount_sum BIGINT
) WITH (
 'connector' = 'print'
);

INSERT INTO sink_table
SELECT product,
	order_time,
  amount,
 SUM(amount) OVER (
 PARTITION BY product
 ORDER BY order_time
 -- 标识统计范围是⼀个 product 的最近 5 ⾏数据
 ROWS BETWEEN 5 PRECEDING AND CURRENT ROW
 ) AS one_hour_prod_amount_sum
FROM source_table

结果如下:

复制代码
+I[2, 2021-12-24T22:18:19.147, 1, 9]
+I[1, 2021-12-24T22:18:20.147, 2, 11]
+I[1, 2021-12-24T22:18:21.147, 2, 12]
+I[1, 2021-12-24T22:18:22.147, 2, 12]
+I[1, 2021-12-24T22:18:23.148, 2, 12]
+I[1, 2021-12-24T22:18:24.147, 1, 11]
+I[1, 2021-12-24T22:18:25.146, 1, 10]
+I[1, 2021-12-24T22:18:26.147, 1, 9]
+I[2, 2021-12-24T22:18:27.145, 2, 11]
+I[2, 2021-12-24T22:18:28.148, 1, 10]
+I[2, 2021-12-24T22:18:29.145, 2, 10]

在⼀个 SELECT 中有多个聚合窗⼝,简化写法如下:

复制代码
SELECT order_id,
	order_time,
  amount,
 SUM(amount) OVER w AS sum_amount,
 AVG(amount) OVER w AS avg_amount
FROM Orders
-- 使⽤下⾯⼦句,定义 Over Window
WINDOW w AS (
 PARTITION BY product
 ORDER BY order_time
 RANGE BETWEEN INTERVAL '1' HOUR PRECEDING AND CURRENT ROW)
相关推荐
Edingbrugh.南空2 小时前
Flink自定义函数
大数据·flink
gaosushexiangji3 小时前
利用sCMOS科学相机测量激光散射强度
大数据·人工智能·数码相机·计算机视觉
无级程序员5 小时前
大数据平台之ranger与ldap集成,同步用户和组
大数据·hadoop
lifallen6 小时前
Paimon 原子提交实现
java·大数据·数据结构·数据库·后端·算法
TDengine (老段)7 小时前
TDengine 数据库建模最佳实践
大数据·数据库·物联网·时序数据库·tdengine·涛思数据
张先shen7 小时前
Elasticsearch RESTful API入门:全文搜索实战(Java版)
java·大数据·elasticsearch·搜索引擎·全文检索·restful
Elastic 中国社区官方博客7 小时前
Elasticsearch 字符串包含子字符串:高级查询技巧
大数据·数据库·elasticsearch·搜索引擎·全文检索·lucene
张先shen8 小时前
Elasticsearch RESTful API入门:全文搜索实战
java·大数据·elasticsearch·搜索引擎·全文检索·restful
尽兴-8 小时前
如何将多个.sql文件合并成一个:Windows和Linux/Mac详细指南
linux·数据库·windows·sql·macos
expect7g8 小时前
Flink-Checkpoint-2.OperatorChain
后端·flink