HIVE实战处理(二十四)留存用户数

bash 复制代码
留存概念:
次X日活跃留存,次X日新增留存,也就是看今天的新增或活跃用户在后续几天的留存情况

一、留存表的生成逻辑

因为用户活跃日期和留存的日期无法对齐所以搞了2级分区(dt,static_day)

1)首先获得计算日D、根据要出的次X日留存,推算出前面的DT ,整体从活跃表里根据这些日期生成临时活跃表tmp1

2)分别把计算DT和前X日的DT进行匹配,按相差的天数进行匹配,如果匹配一直分别得到对应的次X日留存标识。

3)需要使用1个新的字段存储留存指标的的日期,比如20250701号的留存keep1_num只能等20250702号过完才能计算,那对应也是7.1号算留存日期,是指在DT=20250702的留存时间。

所以根据dt往前推算的日期都是留存日期,不能写到dt这个字段里,因为除了留存指标外还要计算统计日的指标。

如果留存日期=统计日期的,出的当日活跃。留存日期< 统计日期的话,出的是次X日留存指标。

bash 复制代码
--活跃临时表
create table tmp1 as 
select 
    ,t1.uuid
    ,t1.dt                                  as statis_day
    ,case when t1.dt='${DT}' then 'Y' else 'N' end                      as keep_0d_active_flag
	,case when t1.dt=regexp_replace(date_add(from_unixtime(to_unix_timestamp('${DT}', 'yyyyMMdd')),-1), '-', '') then 'Y'
          else 'N' end                      as keep_1d_active_flag
    ,case when t1.dt=regexp_replace(date_add(from_unixtime(to_unix_timestamp('${DT}', 'yyyyMMdd')),-2), '-', '') then 'Y'
          else 'N' end                      as keep_2d_active_flag     
    ,case when t1.dt=regexp_replace(date_add(from_unixtime(to_unix_timestamp('${DT}', 'yyyyMMdd')),-3), '-', '') then 'Y'
          else 'N' end                      as keep_3d_active_flag      
    ,case when t1.dt=regexp_replace(date_add(from_unixtime(to_unix_timestamp('${DT}', 'yyyyMMdd')),-4), '-', '') then 'Y'
          else 'N' end                      as keep_4d_active_flag      
    ,case when t1.dt=regexp_replace(date_add(from_unixtime(to_unix_timestamp('${DT}', 'yyyyMMdd')),-5), '-', '') then 'Y'
          else 'N' end                      as keep_5d_active_flag      
    ,case when t1.dt=regexp_replace(date_add(from_unixtime(to_unix_timestamp('${DT}', 'yyyyMMdd')),-6), '-', '') then 'Y'
          else 'N' end                      as keep_6d_active_flag      
    ,case when t1.dt=regexp_replace(date_add(from_unixtime(to_unix_timestamp('${DT}', 'yyyyMMdd')),-7), '-', '') then 'Y'
          else 'N' end                      as keep_7d_active_flag        
         
from 活跃表 t1
where t1.dt in ( 
${DT}
,regexp_replace(date_add(from_unixtime(to_unix_timestamp('${DT}', 'yyyyMMdd')),-1), '-', '')
,regexp_replace(date_add(from_unixtime(to_unix_timestamp('${DT}', 'yyyyMMdd')),-2), '-', '')
,regexp_replace(date_add(from_unixtime(to_unix_timestamp('${DT}', 'yyyyMMdd')),-3), '-', '')
,regexp_replace(date_add(from_unixtime(to_unix_timestamp('${DT}', 'yyyyMMdd')),-4), '-', '')
,regexp_replace(date_add(from_unixtime(to_unix_timestamp('${DT}', 'yyyyMMdd')),-5), '-', '')
,regexp_replace(date_add(from_unixtime(to_unix_timestamp('${DT}', 'yyyyMMdd')),-6), '-', '')    
,regexp_replace(date_add(from_unixtime(to_unix_timestamp('${DT}', 'yyyyMMdd')),-7), '-', '')
)


;

--当日活跃以及留存指标
insert overwrite table 留存表 partition(dt='${DT}')
select   
    group_id
   ,statis_day
   ,channel
   ,version
   ,sum(case when keep_0d_active_flag='Y' then 1 else 0 end)  as av
   ,sum(case when keep_1d_active_flag='Y' then 1 else 0 end)  as keep_1d_av
   ,sum(case when keep_2d_active_flag='Y' then 1 else 0 end)  as keep_2d_av
   ,sum(case when keep_3d_active_flag='Y' then 1 else 0 end)  as keep_3d_av
   ,sum(case when keep_4d_active_flag='Y' then 1 else 0 end)  as keep_4d_av
   ,sum(case when keep_5d_active_flag='Y' then 1 else 0 end)  as keep_5d_av
   ,sum(case when keep_6d_active_flag='Y' then 1 else 0 end)  as keep_6d_av
   ,sum(case when keep_7d_active_flag='Y' then 1 else 0 end)  as keep_7d_av
from(
        select 
             cast(grouping__id as bigint)& 7 & 3  as group_id
            ,channel
			,uuid
            ,statis_day
            ,max(keep_1d_active_flag)      as keep_1d_active_flag
            ,max(keep_2d_active_flag)      as keep_2d_active_flag
            ,max(keep_3d_active_flag)      as keep_3d_active_flag
            ,max(keep_4d_active_flag)      as keep_4d_active_flag
            ,max(keep_5d_active_flag)      as keep_5d_active_flag
            ,max(keep_6d_active_flag)      as keep_6d_active_flag
            ,max(keep_7d_active_flag)      as keep_7d_active_flag 
        from tmp1
        group by 
                ,channel  --1
				,version	--2		
				,uuid       -- 4
				,statis_day --8				
                             
        grouping sets(          
             (channel,uuid,statis_day)    
            ,(version,uuid,statis_day)
			,(uuid,statis_day)			
			)
) ta
group by  
    group_id
   ,statis_day
   ,channel
   ,version

二、对于留存的表的查询处理

1)非留存指标的话,直接使用where dt between '20250701' and '20250707'

2)对于留存指标要取static_day,这个static_day是代表留存日期在dt的不同留存指标。

select

dt

,sum(active_num)

,sum(keep1_num)

,sum(keep2_num)

,sum(keep3_num)

,sum(keep4_num)

from

(select

dt,

,active_num

,0 as keep1_num

,0 as keep2_num

,0 as keep3_num

,0 as keep4_num

from 留存表 where dt between '20250701' and '20250704'

union all

select

static_day dt,

,0 as active_num

,keep1_num

,keep2_num

,keep3_num

,keep4_num

from 留存表 where static_day between '20250701' and '20250704'

) t group by dt

相关推荐
奇点爆破XC7 小时前
Hadoop大数据生态(Ambari管理)组件服务详解
大数据·hadoop·ambari
isNotNullX8 小时前
企业数据中台建设,ETL工具选错了会踩哪些坑?
数据仓库·etl·原型模式
SelectDB技术团队10 小时前
预约发布会|核心产品力首发,如何构建面向 Agent 时代的企业级数据引擎
数据库·数据仓库·人工智能·数据分析·可观测·apache doris·selectdb
段一凡-华北理工大学11 小时前
工业领域的Hadoop架构学习~系列文章22:Hadoop生态展望 - 面向未来的技术演进
大数据·人工智能·hadoop·分布式·学习·架构·高炉炼铁
Nefu_lyh11 小时前
【Hive】六、Hive 运算逻辑:数学 / 逻辑 / 条件 / 日期 / 字符串函数
数据仓库·hive·hadoop
ChaITSimpleLove11 小时前
Etl.Net 2.2.0 项目深度分析
数据仓库·.net·etl·大数据处理·数据管道·数据处理引擎
知识分享小能手11 小时前
Hadoop学习教程,从入门到精通, HDFS分布式文件系统 — 完整知识点与案例代码(3)
hadoop·学习·hdfs
陆水A1 天前
【实时数仓·3】Flink多表JOIN状态爆炸——Event Time Temporal JOIN + TTL分层治理
大数据·数据仓库·数据分析·flink·数据库开发·bigdata
段一凡-华北理工大学1 天前
工业领域的Hadoop架构学习~系列文章20:故障诊断与根因分析 - 从表象到本质的智能推理
大数据·人工智能·hadoop·学习·架构·高炉炼铁·工业智能体
Francek Chen1 天前
【大数据处理与分析】MapReduce:05 MapReduce的具体应用
大数据·hadoop·分布式·mapreduce