FlinkSQL之保序任务对于聚合SQL影响分析

​ 本文以一个示例说明FlinkSQL如何针对上游乱序数据源设计保序任务,从而保证下游数据准确性。废话不多说,这里以交易数据场景为例.

  • 数据表结构为:

    sql 复制代码
    create table tbl_order_source(
        order_id            int             comment '订单ID',
        shop_id             int             comment '书店ID',
        user_id             int             comment '用户ID',
        original_price      double          comment '原始交易额',
        create_time         timestamp(3)    comment '创建时间: yyyy-MM-dd HH:mm:ss',
        watermark for create_time as create_time - interval '0' second
    )with(
        'connector' = 'kafka',
        'topic' = 'tbl_order_source',
        'properties.bootstrap.servers' = 'localhost:9092',
        'properties.group.id' = 'testGroup',
        'scan.startup.mode' = 'latest-offset',
        'format' = 'json',
        'json.fail-on-missing-field' = 'false',
        'json.ignore-parse-errors' = 'true'
    );
  • 乱序数据源如下:

    json 复制代码
    {"order_id":"1","shop_id":"1","user_id":"1","original_price":"1","create_time":"2024-01-01 20:05:00"}
    {"order_id":"2","shop_id":"1","user_id":"2","original_price":"2","create_time":"2024-01-01 20:04:00"}
    {"order_id":"1","shop_id":"1","user_id":"1","original_price":"3","create_time":"2024-01-01 20:03:00"}
    {"order_id":"3","shop_id":"1","user_id":"3","original_price":"4","create_time":"2024-01-01 20:02:00"}
    {"order_id":"1","shop_id":"1","user_id":"1","original_price":"5","create_time":"2024-01-01 20:04:00"}
  • 我们针对乱序数据源消费首先设计保序任务,如下:

    sql 复制代码
    -- 保序数据中间结果 
    create table ods_order_source(
        order_id            int             comment '订单ID',
        shop_id             int             comment '书店ID',
        user_id             int             comment '用户ID',
        original_price      double          comment '订单金额',
        create_time         timestamp(3)    comment '创建时间: yyyy-MM-dd HH:mm:ss',
        watermark for create_time as create_time - interval '0' second,
        primary key (order_id) not enforced
    )with(
        'connector' = 'upsert-kafka',
        'topic' = 'ods_order_source',
        'properties.bootstrap.servers' = 'localhost:9092',
        'key.format' = 'json',
        'key.json.ignore-parse-errors' = 'true',
        'value.format' = 'json',
        'value.json.fail-on-missing-field' = 'false'
    );
    
    -- 源到保序结果ETL
    insert into ods_order_source
    select
    	tmp.order_id,
    	tmp.shop_id,
    	tmp.user_id,
    	tmp.original_price,
    	tmp.create_time
    from (
    	select
    	    t.order_id,
    	    t.shop_id,
    	    t.user_id,
    	    t.original_price,
    	    t.create_time,
    		row_number()over(partition by t.order_id order by t.create_time asc) as rn
    	from tbl_order_source t
    ) tmp
    where tmp.rn = 1
    ;
    
    -- 查询保序中间结果数据
    select * from ods_order_source;

    针对数据源输入,保序任务输出为:

    json 复制代码
    +I {"order_id":"1","shop_id":"1","user_id":"1","original_price":"1","create_time":"2024-01-01 20:05:00.000"}
    +I {"order_id":"2","shop_id":"1","user_id":"2","original_price":"2","create_time":"2024-01-01 20:04:00.000"}
    -U {"order_id":"1","shop_id":"1","user_id":"1","original_price":"1","create_time":"2024-01-01 20:05:00.000"}
    +U {"order_id":"1","shop_id":"1","user_id":"1","original_price":"3","create_time":"2024-01-01 20:03:00.000"}
    +I {"order_id":"3","shop_id":"1","user_id":"3","original_price":"4","create_time":"2024-01-01 20:02:00.000"}
  • 下面针对保序任务设计聚合SQL:

    sql 复制代码
    select
        t.shop_id                                  as shop_id,
        to_date(cast(t.create_time as string))     as create_date,
        sum(t.original_price)                      as original_amt,
        sum(1)                                     as order_num,
        count(distinct t.order_id)                 as order_cnt
    from ods_order_source t
    group by
        t.shop_id,
        to_date(cast(t.create_time as string))
    ;

    测试结果:

    json 复制代码
    +I {"order_id":1,"create_date":2024-01-01,"original_amt":1,"order_num":1,"order_cnt":1}
    -U {"order_id":1,"create_date":2024-01-01,"original_amt":1,"order_num":1,"order_cnt":1}
    +U {"order_id":1,"create_date":2024-01-01,"original_amt":3,"order_num":2,"order_cnt":2}
    -U {"order_id":1,"create_date":2024-01-01,"original_amt":3,"order_num":2,"order_cnt":2}
    +U {"order_id":1,"create_date":2024-01-01,"original_amt":2,"order_num":1,"order_cnt":1}
    -U {"order_id":1,"create_date":2024-01-01,"original_amt":2,"order_num":1,"order_cnt":1}
    +U {"order_id":1,"create_date":2024-01-01,"original_amt":5,"order_num":2,"order_cnt":2}
    -U {"order_id":1,"create_date":2024-01-01,"original_amt":5,"order_num":2,"order_cnt":2}
    +U {"order_id":1,"create_date":2024-01-01,"original_amt":9,"order_num":3,"order_cnt":3}

    可以从聚合SQL输出结果看出,最后数据为{"order_id":1,"create_date":2024-01-01,"original_amt":9,"order_num":3,"order_cnt":3},结果和源输入数据对应上。

  • 下面同样的SQL针对未保序源数据做聚合,看下结果如何:

    sql 复制代码
    select
        t.shop_id                                  as shop_id,
        to_date(cast(t.create_time as string))     as create_date,
        sum(t.original_price)                      as original_amt,
        sum(1)                                     as order_num,
        count(distinct t.order_id)                 as order_cnt
    from tbl_order_source t
    group by
        t.shop_id,
        to_date(cast(t.create_time as string))
    ;

    查看结果:

    复制代码
    +I {"order_id":1,"create_date":2024-01-01,"original_amt":1,"order_num":1,"order_cnt":1}
    -U {"order_id":1,"create_date":2024-01-01,"original_amt":1,"order_num":1,"order_cnt":1}
    +U {"order_id":1,"create_date":2024-01-01,"original_amt":3,"order_num":2,"order_cnt":2}
    -U {"order_id":1,"create_date":2024-01-01,"original_amt":3,"order_num":2,"order_cnt":2}
    +U {"order_id":1,"create_date":2024-01-01,"original_amt":6,"order_num":3,"order_cnt":2}
    -U {"order_id":1,"create_date":2024-01-01,"original_amt":6,"order_num":3,"order_cnt":2}
    +U {"order_id":1,"create_date":2024-01-01,"original_amt":10,"order_num":4,"order_cnt":3}
    -U {"order_id":1,"create_date":2024-01-01,"original_amt":10,"order_num":4,"order_cnt":3}
    +U {"order_id":1,"create_date":2024-01-01,"original_amt":15,"order_num":4,"order_cnt":3}

    最后结果{"order_id":1,"create_date":2024-01-01,"original_amt":15,"order_num":4,"order_cnt":3}错误。

​ 针对测试结果,对于聚合SQL数据sink设置shop_id作为key,就可以保证下游数据结果的正确性。

相关推荐
用户917439653918 分钟前
Elasticsearch Percolate Query使用优化案例-从2000到500ms
java·大数据·elasticsearch
wang_yb1 小时前
格式塔原理:数据可视化如何引导观众的注意力
大数据·databook
武子康2 小时前
大数据-200 决策树信息增益详解:信息熵、ID3 选特征与 Python 最佳切分实现
大数据·后端·机器学习
小王毕业啦2 小时前
2000-2023年 上市公司-企业组织惯性数据
大数据·人工智能·数据挖掘·数据分析·数据统计·社科数据·实证数据
是阿威啊3 小时前
【用户行为归因分析项目】- 【企业级项目开发第三站】模拟DIM层设备、应用数据加载到Hive
数据仓库·hive·hadoop
麦聪聊数据3 小时前
解构“逻辑数据仓库 (LDW)”与数据虚拟化
数据库·数据仓库·sql
小北方城市网3 小时前
第 3 课:前后端全栈联动核心 —— 接口规范 + AJAX + 跨域解决(打通前后端壁垒)
java·大数据·网络·python
数智顾问4 小时前
(111页PPT)华为业务变革框架及战略级项目管理(附下载方式)
大数据·运维·华为
微光闪现4 小时前
国际航班动态提醒与延误预测优选平台指南
大数据·人工智能·算法
week_泽5 小时前
github_upload,上传项目
大数据·elasticsearch·github