本文以一个示例说明FlinkSQL如何针对上游乱序数据源设计保序任务,从而保证下游数据准确性。废话不多说,这里以交易数据场景为例.
-
数据表结构为:
sql-- 订单表结构 create table tbl_order_source( order_id int comment '订单ID', shop_id int comment '书店ID', user_id int comment '用户ID', original_price double comment '原始交易额', create_time timestamp(3) comment '创建时间: yyyy-MM-dd HH:mm:ss', watermark for create_time as create_time - interval '0' second )with( 'connector' = 'kafka', 'topic' = 'tbl_order_source', 'properties.bootstrap.servers' = 'localhost:9092', 'properties.group.id' = 'testGroup', 'scan.startup.mode' = 'latest-offset', 'format' = 'json', 'json.fail-on-missing-field' = 'false', 'json.ignore-parse-errors' = 'true' ); -- 支付表结构 create table tbl_order_payment_source( order_id int comment '订单ID', payment_amount double comment '支付金额', create_time timestamp(3) comment '创建时间: yyyy-MM-dd HH:mm:ss', watermark for create_time as create_time - interval '0' second )with( 'connector' = 'kafka', 'topic' = 'tbl_order_payment_source', 'properties.bootstrap.servers' = 'localhost:9092', 'properties.group.id' = 'testGroup', 'scan.startup.mode' = 'latest-offset', 'format' = 'json', 'json.fail-on-missing-field' = 'false', 'json.ignore-parse-errors' = 'true' );
-
乱序数据源如下:
-
订单表乱序数据
json{"order_id":"1","shop_id":"1","user_id":"1","original_price":"1","create_time":"2024-01-01 20:05:00"} {"order_id":"2","shop_id":"1","user_id":"2","original_price":"2","create_time":"2024-01-01 20:04:00"} {"order_id":"1","shop_id":"1","user_id":"1","original_price":"3","create_time":"2024-01-01 20:03:00"} {"order_id":"3","shop_id":"1","user_id":"3","original_price":"4","create_time":"2024-01-01 20:02:00"} {"order_id":"1","shop_id":"1","user_id":"1","original_price":"5","create_time":"2024-01-01 20:04:00"}
-
支付表乱序数据
json{"order_id":"9","payment_amount":"2","create_time":"2024-01-01 20:04:30"} {"order_id":"1","payment_amount":"5","create_time":"2024-01-01 20:03:30"} {"order_id":"3","payment_amount":"4","create_time":"2024-01-01 20:02:30"} {"order_id":"1","payment_amount":"3","create_time":"2024-01-01 20:03:20"}
-
-
测试SQL
sqlselect t1.order_id, t1.shop_id, t1.user_id, t1.original_price, t2.payment_amount from ods_order_source t1 left join ods_order_payment_source t2 on t1.order_id = t2.order_id ;
针对数据源我们可以分析出,join最终结果如下:
json{"order_id":"1","shop_id":"1","user_id":"1","original_price":"3","payment_amount":"3","create_time":"2024-01-01 20:03:00"} {"order_id":"2","shop_id":"1","user_id":"2","original_price":"2","payment_amount":null,"create_time":"2024-01-01 20:04:00"} {"order_id":"3","shop_id":"1","user_id":"3","original_price":"4","payment_amount":"4","create_time":"2024-01-01 20:02:00"}
-
我们针对乱序数据源消费首先设计保序任务,如下:
sql-- 订单表保序数据中间结果 create table ods_order_source( order_id int comment '订单ID', shop_id int comment '书店ID', user_id int comment '用户ID', original_price double comment '订单金额', create_time timestamp(3) comment '创建时间: yyyy-MM-dd HH:mm:ss', watermark for create_time as create_time - interval '0' second, primary key (order_id) not enforced )with( 'connector' = 'upsert-kafka', 'topic' = 'ods_order_source', 'properties.bootstrap.servers' = 'localhost:9092', 'key.format' = 'json', 'key.json.ignore-parse-errors' = 'true', 'value.format' = 'json', 'value.json.fail-on-missing-field' = 'false' ); -- 订单表源到保序结果ETL insert into ods_order_source select tmp.order_id, tmp.shop_id, tmp.user_id, tmp.original_price, tmp.create_time from ( select t.order_id, t.shop_id, t.user_id, t.original_price, t.create_time, row_number()over(partition by t.order_id order by t.create_time asc) as rn from tbl_order_source t ) tmp where tmp.rn = 1 ; -- 支付表保序中间结果 create table ods_order_payment_source( order_id int comment '订单ID', payment_amount double comment '支付金额', create_time timestamp(3) comment '创建时间: yyyy-MM-dd HH:mm:ss', watermark for create_time as create_time - interval '0' second, primary key (order_id) not enforced )with( 'connector' = 'upsert-kafka', 'topic' = 'ods_order_payment_source', 'properties.bootstrap.servers' = 'localhost:9092', 'key.format' = 'json', 'key.json.ignore-parse-errors' = 'true', 'value.format' = 'json', 'value.json.fail-on-missing-field' = 'false' ); -- 支付表保序ETL insert into ods_order_payment_source select tmp.order_id, tmp.payment_amount, tmp.create_time from ( select t.order_id, t.payment_amount, t.create_time, row_number()over(partition by t.order_id order by t.create_time asc) as rn from tbl_order_payment_source t ) tmp where tmp.rn = 1 ;
-
下面针对保序任务设计聚合SQL:
sqlselect t1.order_id, t1.shop_id, t1.user_id, t1.original_price, t2.payment_amount from ods_order_source t1 left join ods_order_payment_source t2 on t1.order_id = t2.order_id ;
测试结果:
json+----+-------------+-------------+--------------------------------+--------------------------------+ | op | order_id | user_id | original_price | payment_amount | +----+-------------+-------------+--------------------------------+--------------------------------+ | +I | 1 | 1 | 1.0 | <NULL> | | +I | 2 | 2 | 2.0 | <NULL> | -- 结果 | -D | 1 | 1 | 1.0 | <NULL> | | +I | 1 | 1 | 3.0 | <NULL> | | +I | 3 | 3 | 4.0 | <NULL> | | -D | 1 | 1 | 3.0 | <NULL> | | +I | 1 | 1 | 3.0 | 5.0 | | -D | 3 | 3 | 4.0 | <NULL> | | +I | 3 | 3 | 4.0 | 4.0 | -- 结果 | -U | 1 | 1 | 3.0 | 5.0 | | +I | 1 | 1 | 3.0 | <NULL> | | -D | 1 | 1 | 3.0 | <NULL> | | +I | 1 | 1 | 3.0 | 3.0 | -- 结果
可以从上述可能输出结果看出,三行标记的数据与我们上边分析的数据相同,因此保序任务对于Join场景是可行的方案,保序SQL可以在真正处理业务逻辑之前对数据顺序及脏数据做进一步清洗,保证下游任务正常运行。
-
下面同样的SQL针对未保序源数据做聚合,看下结果如何:
sqlselect t1.order_id, t1.shop_id, t1.user_id, t1.original_price, t2.payment_amount from tbl_order_source t1 left join tbl_order_payment_source t2 on t1.order_id = t2.order_id ;
查看结果:
+----+--------------------------+--------------------------------+--------------------------------+ | op | order_id user_id | original_price | payment_amount | +----+--------------------------+--------------------------------+--------------------------------+ | +I | 1 1 | 1.0 | <NULL> | | +I | 2 2 | 2.0 | <NULL> | -- 结果 | -D | 1 1 | 1.0 | <NULL> | | +I | 1 1 | 1.0 | 5.0 | | +I | 1 1 | 3.0 | 5.0 | | +I | 3 3 | 4.0 | <NULL> | | -D | 3 3 | 4.0 | <NULL> | | +I | 3 3 | 4.0 | 4.0 | -- 结果 | +I | 1 1 | 3.0 | 3.0 | | +I | 1 1 | 1.0 | 3.0 | | +I | 1 1 | 5.0 | 5.0 | | +I | 1 1 | 5.0 | 3.0 | -- 结果
最后结果与我们最初分析的join结果有出入,如果对于上游乱序数据不做保序任务直接进行SQL计算可能引入未知错误。
针对测试结果,对于聚合SQL数据sink设置shop_id作为key,就可以保证下游数据结果的正确性。