FlinkSQL之保序任务对于Join SQL影响分析

​ 本文以一个示例说明FlinkSQL如何针对上游乱序数据源设计保序任务,从而保证下游数据准确性。废话不多说,这里以交易数据场景为例.

  • 数据表结构为:

    sql 复制代码
    -- 订单表结构 
    create table tbl_order_source(
        order_id            int             comment '订单ID',
        shop_id             int             comment '书店ID',
        user_id             int             comment '用户ID',
        original_price      double          comment '原始交易额',
        create_time         timestamp(3)    comment '创建时间: yyyy-MM-dd HH:mm:ss',
        watermark for create_time as create_time - interval '0' second
    )with(
        'connector' = 'kafka',
        'topic' = 'tbl_order_source',
        'properties.bootstrap.servers' = 'localhost:9092',
        'properties.group.id' = 'testGroup',
        'scan.startup.mode' = 'latest-offset',
        'format' = 'json',
        'json.fail-on-missing-field' = 'false',
        'json.ignore-parse-errors' = 'true'
    );
    
    -- 支付表结构
    create table tbl_order_payment_source(
        order_id            int             comment '订单ID',
        payment_amount      double          comment '支付金额',
        create_time         timestamp(3)    comment '创建时间: yyyy-MM-dd HH:mm:ss',
        watermark for create_time as create_time - interval '0' second
    )with(
         'connector' = 'kafka',
         'topic' = 'tbl_order_payment_source',
         'properties.bootstrap.servers' = 'localhost:9092',
         'properties.group.id' = 'testGroup',
         'scan.startup.mode' = 'latest-offset',
         'format' = 'json',
         'json.fail-on-missing-field' = 'false',
         'json.ignore-parse-errors' = 'true'
     );
  • 乱序数据源如下:

    • 订单表乱序数据

      json 复制代码
      {"order_id":"1","shop_id":"1","user_id":"1","original_price":"1","create_time":"2024-01-01 20:05:00"}
      {"order_id":"2","shop_id":"1","user_id":"2","original_price":"2","create_time":"2024-01-01 20:04:00"}
      {"order_id":"1","shop_id":"1","user_id":"1","original_price":"3","create_time":"2024-01-01 20:03:00"}
      {"order_id":"3","shop_id":"1","user_id":"3","original_price":"4","create_time":"2024-01-01 20:02:00"}
      {"order_id":"1","shop_id":"1","user_id":"1","original_price":"5","create_time":"2024-01-01 20:04:00"}
    • 支付表乱序数据

      json 复制代码
      {"order_id":"9","payment_amount":"2","create_time":"2024-01-01 20:04:30"}
      {"order_id":"1","payment_amount":"5","create_time":"2024-01-01 20:03:30"}
      {"order_id":"3","payment_amount":"4","create_time":"2024-01-01 20:02:30"}
      {"order_id":"1","payment_amount":"3","create_time":"2024-01-01 20:03:20"}
  • 测试SQL

    sql 复制代码
    select
    
        t1.order_id,
        t1.shop_id,
        t1.user_id,
        t1.original_price,
        t2.payment_amount
    
    from ods_order_source t1
    left join ods_order_payment_source t2
         on t1.order_id = t2.order_id
    ;

    针对数据源我们可以分析出,join最终结果如下:

    json 复制代码
    {"order_id":"1","shop_id":"1","user_id":"1","original_price":"3","payment_amount":"3","create_time":"2024-01-01 20:03:00"}
    {"order_id":"2","shop_id":"1","user_id":"2","original_price":"2","payment_amount":null,"create_time":"2024-01-01 20:04:00"}
    {"order_id":"3","shop_id":"1","user_id":"3","original_price":"4","payment_amount":"4","create_time":"2024-01-01 20:02:00"}
  • 我们针对乱序数据源消费首先设计保序任务,如下:

    sql 复制代码
    -- 订单表保序数据中间结果 
    create table ods_order_source(
        order_id            int             comment '订单ID',
        shop_id             int             comment '书店ID',
        user_id             int             comment '用户ID',
        original_price      double          comment '订单金额',
        create_time         timestamp(3)    comment '创建时间: yyyy-MM-dd HH:mm:ss',
        watermark for create_time as create_time - interval '0' second,
        primary key (order_id) not enforced
    )with(
        'connector' = 'upsert-kafka',
        'topic' = 'ods_order_source',
        'properties.bootstrap.servers' = 'localhost:9092',
        'key.format' = 'json',
        'key.json.ignore-parse-errors' = 'true',
        'value.format' = 'json',
        'value.json.fail-on-missing-field' = 'false'
    );
    
    -- 订单表源到保序结果ETL
    insert into ods_order_source
    select
    	tmp.order_id,
    	tmp.shop_id,
    	tmp.user_id,
    	tmp.original_price,
    	tmp.create_time
    from (
    	select
    	    t.order_id,
    	    t.shop_id,
    	    t.user_id,
    	    t.original_price,
    	    t.create_time,
    		row_number()over(partition by t.order_id order by t.create_time asc) as rn
    	from tbl_order_source t
    ) tmp
    where tmp.rn = 1
    ;
    
    -- 支付表保序中间结果 
    create table ods_order_payment_source(
        order_id            int             comment '订单ID',
        payment_amount      double          comment '支付金额',
        create_time         timestamp(3)    comment '创建时间: yyyy-MM-dd HH:mm:ss',
        watermark for create_time as create_time - interval '0' second,
        primary key (order_id) not enforced
    )with(
        'connector' = 'upsert-kafka',
        'topic' = 'ods_order_payment_source',
        'properties.bootstrap.servers' = 'localhost:9092',
        'key.format' = 'json',
        'key.json.ignore-parse-errors' = 'true',
        'value.format' = 'json',
        'value.json.fail-on-missing-field' = 'false'
    );
    
    -- 支付表保序ETL 
    insert into ods_order_payment_source
    select
    	tmp.order_id,
    	tmp.payment_amount,
    	tmp.create_time
    from (
    	select
    	    t.order_id,
    	    t.payment_amount,
    	    t.create_time,
    		row_number()over(partition by t.order_id order by t.create_time asc) as rn
    	from tbl_order_payment_source t
    ) tmp
    where tmp.rn = 1
    ;
  • 下面针对保序任务设计聚合SQL:

    sql 复制代码
    select
    
        t1.order_id,
        t1.shop_id,
        t1.user_id,
        t1.original_price,
        t2.payment_amount
    
    from ods_order_source t1
    left join ods_order_payment_source t2
         on t1.order_id = t2.order_id
    ;

    测试结果:

    json 复制代码
    +----+-------------+-------------+--------------------------------+--------------------------------+
    | op |    order_id |     user_id |                 original_price |                 payment_amount |
    +----+-------------+-------------+--------------------------------+--------------------------------+
    | +I |           1 |           1 |                            1.0 |                         <NULL> |
    | +I |           2 |           2 |                            2.0 |                         <NULL> | -- 结果
    | -D |           1 |           1 |                            1.0 |                         <NULL> |
    | +I |           1 |           1 |                            3.0 |                         <NULL> |
    | +I |           3 |           3 |                            4.0 |                         <NULL> |
    | -D |           1 |           1 |                            3.0 |                         <NULL> |
    | +I |           1 |           1 |                            3.0 |                            5.0 |
    | -D |           3 |           3 |                            4.0 |                         <NULL> |
    | +I |           3 |           3 |                            4.0 |                            4.0 | -- 结果
    | -U |           1 |           1 |                            3.0 |                            5.0 |
    | +I |           1 |           1 |                            3.0 |                         <NULL> |
    | -D |           1 |           1 |                            3.0 |                         <NULL> |
    | +I |           1 |           1 |                            3.0 |                            3.0 | -- 结果

    可以从上述可能输出结果看出,三行标记的数据与我们上边分析的数据相同,因此保序任务对于Join场景是可行的方案,保序SQL可以在真正处理业务逻辑之前对数据顺序及脏数据做进一步清洗,保证下游任务正常运行。

  • 下面同样的SQL针对未保序源数据做聚合,看下结果如何:

    sql 复制代码
    select
    
        t1.order_id,
        t1.shop_id,
        t1.user_id,
        t1.original_price,
        t2.payment_amount
    
    from tbl_order_source t1
    left join tbl_order_payment_source t2
         on t1.order_id = t2.order_id
    ;

    查看结果:

    +----+--------------------------+--------------------------------+--------------------------------+
    | op |    order_id      user_id |                 original_price |                 payment_amount |
    +----+--------------------------+--------------------------------+--------------------------------+
    | +I |           1            1 |                            1.0 |                         <NULL> |
    | +I |           2            2 |                            2.0 |                         <NULL> | -- 结果
    | -D |           1            1 |                            1.0 |                         <NULL> |
    | +I |           1            1 |                            1.0 |                            5.0 |
    | +I |           1            1 |                            3.0 |                            5.0 |
    | +I |           3            3 |                            4.0 |                         <NULL> |
    | -D |           3            3 |                            4.0 |                         <NULL> |
    | +I |           3            3 |                            4.0 |                            4.0 | -- 结果 
    | +I |           1            1 |                            3.0 |                            3.0 |
    | +I |           1            1 |                            1.0 |                            3.0 |
    | +I |           1            1 |                            5.0 |                            5.0 |
    | +I |           1            1 |                            5.0 |                            3.0 | -- 结果
    

    最后结果与我们最初分析的join结果有出入,如果对于上游乱序数据不做保序任务直接进行SQL计算可能引入未知错误。

​ 针对测试结果,对于聚合SQL数据sink设置shop_id作为key,就可以保证下游数据结果的正确性。

相关推荐
lzhlizihang37 分钟前
【Hive sql 面试题】求出各类型专利top 10申请人,以及对应的专利申请数(难)
大数据·hive·sql·面试题
superman超哥1 小时前
04 深入 Oracle 并发世界:MVCC、锁、闩锁、事务隔离与并发性能优化的探索
数据库·oracle·性能优化·dba
用户8007165452001 小时前
HTAP数据库国产化改造技术可行性方案分析
数据库
engchina2 小时前
Neo4j 和 Python 初学者指南:如何使用可选关系匹配优化 Cypher 查询
数据库·python·neo4j
engchina2 小时前
使用 Cypher 查询语言在 Neo4j 中查找最短路径
数据库·neo4j
尘浮生2 小时前
Java项目实战II基于Spring Boot的光影视频平台(开发文档+数据库+源码)
java·开发语言·数据库·spring boot·后端·maven·intellij-idea
威哥爱编程2 小时前
SQL Server 数据太多如何优化
数据库·sql·sqlserver
小华同学ai2 小时前
AJ-Report:一款开源且非常强大的数据可视化大屏和报表工具
数据库·信息可视化·开源
Acrelhuang2 小时前
安科瑞5G基站直流叠光监控系统-安科瑞黄安南
大数据·数据库·数据仓库·物联网
Mephisto.java3 小时前
【大数据学习 | kafka高级部分】kafka的kraft集群
大数据·sql·oracle·kafka·json·hbase