HIVE实战处理(二十四)留存用户数

bash 复制代码
留存概念:
次X日活跃留存,次X日新增留存,也就是看今天的新增或活跃用户在后续几天的留存情况

一、留存表的生成逻辑

因为用户活跃日期和留存的日期无法对齐所以搞了2级分区(dt,static_day)

1)首先获得计算日D、根据要出的次X日留存,推算出前面的DT ,整体从活跃表里根据这些日期生成临时活跃表tmp1

2)分别把计算DT和前X日的DT进行匹配,按相差的天数进行匹配,如果匹配一直分别得到对应的次X日留存标识。

3)需要使用1个新的字段存储留存指标的的日期,比如20250701号的留存keep1_num只能等20250702号过完才能计算,那对应也是7.1号算留存日期,是指在DT=20250702的留存时间。

所以根据dt往前推算的日期都是留存日期,不能写到dt这个字段里,因为除了留存指标外还要计算统计日的指标。

如果留存日期=统计日期的,出的当日活跃。留存日期< 统计日期的话,出的是次X日留存指标。

bash 复制代码
--活跃临时表
create table tmp1 as 
select 
    ,t1.uuid
    ,t1.dt                                  as statis_day
    ,case when t1.dt='${DT}' then 'Y' else 'N' end                      as keep_0d_active_flag
	,case when t1.dt=regexp_replace(date_add(from_unixtime(to_unix_timestamp('${DT}', 'yyyyMMdd')),-1), '-', '') then 'Y'
          else 'N' end                      as keep_1d_active_flag
    ,case when t1.dt=regexp_replace(date_add(from_unixtime(to_unix_timestamp('${DT}', 'yyyyMMdd')),-2), '-', '') then 'Y'
          else 'N' end                      as keep_2d_active_flag     
    ,case when t1.dt=regexp_replace(date_add(from_unixtime(to_unix_timestamp('${DT}', 'yyyyMMdd')),-3), '-', '') then 'Y'
          else 'N' end                      as keep_3d_active_flag      
    ,case when t1.dt=regexp_replace(date_add(from_unixtime(to_unix_timestamp('${DT}', 'yyyyMMdd')),-4), '-', '') then 'Y'
          else 'N' end                      as keep_4d_active_flag      
    ,case when t1.dt=regexp_replace(date_add(from_unixtime(to_unix_timestamp('${DT}', 'yyyyMMdd')),-5), '-', '') then 'Y'
          else 'N' end                      as keep_5d_active_flag      
    ,case when t1.dt=regexp_replace(date_add(from_unixtime(to_unix_timestamp('${DT}', 'yyyyMMdd')),-6), '-', '') then 'Y'
          else 'N' end                      as keep_6d_active_flag      
    ,case when t1.dt=regexp_replace(date_add(from_unixtime(to_unix_timestamp('${DT}', 'yyyyMMdd')),-7), '-', '') then 'Y'
          else 'N' end                      as keep_7d_active_flag        
         
from 活跃表 t1
where t1.dt in ( 
${DT}
,regexp_replace(date_add(from_unixtime(to_unix_timestamp('${DT}', 'yyyyMMdd')),-1), '-', '')
,regexp_replace(date_add(from_unixtime(to_unix_timestamp('${DT}', 'yyyyMMdd')),-2), '-', '')
,regexp_replace(date_add(from_unixtime(to_unix_timestamp('${DT}', 'yyyyMMdd')),-3), '-', '')
,regexp_replace(date_add(from_unixtime(to_unix_timestamp('${DT}', 'yyyyMMdd')),-4), '-', '')
,regexp_replace(date_add(from_unixtime(to_unix_timestamp('${DT}', 'yyyyMMdd')),-5), '-', '')
,regexp_replace(date_add(from_unixtime(to_unix_timestamp('${DT}', 'yyyyMMdd')),-6), '-', '')    
,regexp_replace(date_add(from_unixtime(to_unix_timestamp('${DT}', 'yyyyMMdd')),-7), '-', '')
)


;

--当日活跃以及留存指标
insert overwrite table 留存表 partition(dt='${DT}')
select   
    group_id
   ,statis_day
   ,channel
   ,version
   ,sum(case when keep_0d_active_flag='Y' then 1 else 0 end)  as av
   ,sum(case when keep_1d_active_flag='Y' then 1 else 0 end)  as keep_1d_av
   ,sum(case when keep_2d_active_flag='Y' then 1 else 0 end)  as keep_2d_av
   ,sum(case when keep_3d_active_flag='Y' then 1 else 0 end)  as keep_3d_av
   ,sum(case when keep_4d_active_flag='Y' then 1 else 0 end)  as keep_4d_av
   ,sum(case when keep_5d_active_flag='Y' then 1 else 0 end)  as keep_5d_av
   ,sum(case when keep_6d_active_flag='Y' then 1 else 0 end)  as keep_6d_av
   ,sum(case when keep_7d_active_flag='Y' then 1 else 0 end)  as keep_7d_av
from(
        select 
             cast(grouping__id as bigint)& 7 & 3  as group_id
            ,channel
			,uuid
            ,statis_day
            ,max(keep_1d_active_flag)      as keep_1d_active_flag
            ,max(keep_2d_active_flag)      as keep_2d_active_flag
            ,max(keep_3d_active_flag)      as keep_3d_active_flag
            ,max(keep_4d_active_flag)      as keep_4d_active_flag
            ,max(keep_5d_active_flag)      as keep_5d_active_flag
            ,max(keep_6d_active_flag)      as keep_6d_active_flag
            ,max(keep_7d_active_flag)      as keep_7d_active_flag 
        from tmp1
        group by 
                ,channel  --1
				,version	--2		
				,uuid       -- 4
				,statis_day --8				
                             
        grouping sets(          
             (channel,uuid,statis_day)    
            ,(version,uuid,statis_day)
			,(uuid,statis_day)			
			)
) ta
group by  
    group_id
   ,statis_day
   ,channel
   ,version

二、对于留存的表的查询处理

1)非留存指标的话,直接使用where dt between '20250701' and '20250707'

2)对于留存指标要取static_day,这个static_day是代表留存日期在dt的不同留存指标。

select

dt

,sum(active_num)

,sum(keep1_num)

,sum(keep2_num)

,sum(keep3_num)

,sum(keep4_num)

from

(select

dt,

,active_num

,0 as keep1_num

,0 as keep2_num

,0 as keep3_num

,0 as keep4_num

from 留存表 where dt between '20250701' and '20250704'

union all

select

static_day dt,

,0 as active_num

,keep1_num

,keep2_num

,keep3_num

,keep4_num

from 留存表 where static_day between '20250701' and '20250704'

) t group by dt

相关推荐
B站计算机毕业设计超人5 天前
计算机毕业设计Django+Vue.js高考推荐系统 高考可视化 大数据毕业设计(源码+LW文档+PPT+详细讲解)
大数据·vue.js·hadoop·django·毕业设计·课程设计·推荐算法
B站计算机毕业设计超人5 天前
计算机毕业设计Django+Vue.js音乐推荐系统 音乐可视化 大数据毕业设计 (源码+文档+PPT+讲解)
大数据·vue.js·hadoop·python·spark·django·课程设计
十月南城5 天前
数据湖技术对比——Iceberg、Hudi、Delta的表格格式与维护策略
大数据·数据库·数据仓库·hive·hadoop·spark
王九思5 天前
Hive Thrift Server 介绍
数据仓库·hive·hadoop
土拨鼠烧电路5 天前
笔记11:数据中台:不是数据仓库,是业务能力复用的引擎
数据仓库·笔记
Asher05095 天前
Hive核心知识:从基础到实战全解析
数据仓库·hive·hadoop
xhaoDream5 天前
Hive3.1.3 配置 Tez 引擎
大数据·hive·tez
yumgpkpm5 天前
AI视频生成:Wan 2.2(阿里通义万相)在华为昇腾下的部署?
人工智能·hadoop·elasticsearch·zookeeper·flink·kafka·cloudera
Asher05095 天前
Hadoop核心技术与实战指南
大数据·hadoop·分布式