二百七十九、ClickHouse——用Kettle对DWD层清洗数据进行增量补全

一、目的

由于ODS层表数据会因为各种原因缺失部分,所以对缺失的数据进行补全

二、实施步骤

2.1 确认补全策略

比如使用使用前一周同期的历史数据进行补齐

2.2 SQL语句

复制代码
select
generateUUIDv4()  as  id,
a2.device_no, t4.source_device_type, t4.sn, t4.model, a2.miss_time create_time,t4.cycle,
t4.volume_sum,t4.speed_avg, t4.volume_left,t4.speed_left,t4.volume_straight,
t4.speed_straight,t4.volume_right, t4.speed_right,t4.volume_turn,t4.speed_turn,
cast(a2.day as String) day
from (
select
       a1.device_no,a1.day, a1.all_time  miss_time,
       (all_time - interval 7 day) create_time_7
from (
select
t1.device_no,t1.day,t2.all_time
from hurys_jw.dwd_turnratio as t1
cross join(
select
frequency_rate,
toDateTime('2024-12-16 12:00:00') new_time,
toDateTime(concat(toString(toDate('2024-12-16 12:00:00')),' ', frequency_time)) all_time,
(toDateTime(concat(toString(toDate('2024-12-16 12:00:00')),' ', frequency_time))  + interval 5 minute) all_time_5
from hurys_jw.dwd_frequency_time
) as t2
where t2.frequency_rate='300' and  toDate(t2.all_time)=t1.day and all_time <= new_time  and all_time_5 > new_time
group by t1.device_no, t1.day, t2.all_time
) as a1
left join hurys_jw.dwd_turnratio as t3
on a1.device_no=t3.device_no and a1.all_time=t3.create_time  and a1.day=t3.day
where toYear(t3.create_time)=1970
    ) as a2
left join hurys_jw.dwd_turnratio as t4
on a2.device_no=t4.device_no  and a2.create_time_7 = t4.create_time
where t4.cycle is not null
;

最核心的是红色部分,由于每个任务是5分钟执行一次,因此每次时段是前5分钟的数据。

2.3 Kettle任务

2.3.1 newtime

select(

select

toDateTime(create_time)

from hurys_jw.dwd_statistics

order by create_time desc limit 1) as new_time

2.3.2 替换NULL值

2.3.3 表输入

select

generateUUIDv4() as id,

a2.device_no, t4.source_device_type, t4.sn, t4.model, a2.miss_time create_time,t4.cycle,

a2.lane_no , t4.lane_type, a2.section_no,a2.coil_no,t4.volume_sum, t4.volume_person,

t4.volume_car_non,t4.volume_car_small,t4.volume_car_middle,t4.volume_car_big, t4.speed_avg,

t4.speed_85,t4.time_occupancy,t4.average_headway , t4.average_gap, cast(a2.day as String) day

from (

select

a1.device_no,a1.day, a1.all_time miss_time,a1.lane_no , a1.section_no,a1.coil_no,

(all_time - interval 7 day) create_time_7

from (
select
t1.device_no,t1.day,t1.lane_no,t1.section_no,t1.coil_no,t2.all_time
from hurys_jw.dwd_statistics as t1
cross join(
select
frequency_rate,
toDateTime(?) new_time,
toDateTime(concat(toString(toDate(new_time)),' ', frequency_time)) all_time,
(all_time + interval 5 minute) all_time_5
from hurys_jw.dwd_frequency_time ) as t2
where t2.frequency_rate=t1.cycle and toDate(t2.all_time)=t1.day and all_time <= new_time and all_time_5 > new_time
group by t1.device_no, t1.day, t1.lane_no, t1.section_no, t1.coil_no, t2.all_time

) as a1

left join hurys_jw.dwd_statistics as t3

on a1.device_no=t3.device_no and a1.all_time=t3.create_time and a1.lane_no=t3.lane_no

and a1.section_no=t3.section_no and a1.coil_no=t3.coil_no and a1.day=t3.day

where toYear(t3.create_time)=1970

) as a2

left join hurys_jw.dwd_statistics as t4

on a2.device_no=t4.device_no and a2.lane_no=t4.lane_no and a2.section_no=t4.section_no

and a2.coil_no=t4.coil_no and a2.create_time_7 = t4.create_time

where t4.cycle is not null

;

最核心的是红色部分,怎么实现1个5分钟周期内的增量补全

2.3.4 字段选择

2.3.5 clickhouse输出

2.3.6 运行Kettle任务

搞定!!!

相关推荐
叶域17 天前
ClickHouse总体学习
学习·clickhouse
时空无限17 天前
clickhouse清除system 表数据释放磁盘空间
数据库·clickhouse
angryshan17 天前
ClickHouse合并任务与查询延迟专项测试
数据库·clickhouse·php
fusugongzi18 天前
spring boot连接clickhouse集群,实现故障节点自动切换
java·spring boot·clickhouse
AAEllisonPang18 天前
ClickHouse优化技巧实战指南:从原理到案例解析
clickhouse
Faith_xzc21 天前
Doris vs ClickHouse 企业级实时分析引擎怎么选?
大数据·数据库·clickhouse·数据库开发
hjehheje22 天前
hbase的主要功能
clickhouse
hjehheje22 天前
hbase实训 阿达
clickhouse
hjehheje22 天前
clickhouse删除一条数据
数据库·clickhouse·oracle
hjehheje24 天前
hbase超详细介绍
clickhouse