HIVE实战处理(二十四)留存用户数

bash 复制代码
留存概念:
次X日活跃留存,次X日新增留存,也就是看今天的新增或活跃用户在后续几天的留存情况

一、留存表的生成逻辑

因为用户活跃日期和留存的日期无法对齐所以搞了2级分区(dt,static_day)

1)首先获得计算日D、根据要出的次X日留存,推算出前面的DT ,整体从活跃表里根据这些日期生成临时活跃表tmp1

2)分别把计算DT和前X日的DT进行匹配,按相差的天数进行匹配,如果匹配一直分别得到对应的次X日留存标识。

3)需要使用1个新的字段存储留存指标的的日期,比如20250701号的留存keep1_num只能等20250702号过完才能计算,那对应也是7.1号算留存日期,是指在DT=20250702的留存时间。

所以根据dt往前推算的日期都是留存日期,不能写到dt这个字段里,因为除了留存指标外还要计算统计日的指标。

如果留存日期=统计日期的,出的当日活跃。留存日期< 统计日期的话,出的是次X日留存指标。

bash 复制代码
--活跃临时表
create table tmp1 as 
select 
    ,t1.uuid
    ,t1.dt                                  as statis_day
    ,case when t1.dt='${DT}' then 'Y' else 'N' end                      as keep_0d_active_flag
	,case when t1.dt=regexp_replace(date_add(from_unixtime(to_unix_timestamp('${DT}', 'yyyyMMdd')),-1), '-', '') then 'Y'
          else 'N' end                      as keep_1d_active_flag
    ,case when t1.dt=regexp_replace(date_add(from_unixtime(to_unix_timestamp('${DT}', 'yyyyMMdd')),-2), '-', '') then 'Y'
          else 'N' end                      as keep_2d_active_flag     
    ,case when t1.dt=regexp_replace(date_add(from_unixtime(to_unix_timestamp('${DT}', 'yyyyMMdd')),-3), '-', '') then 'Y'
          else 'N' end                      as keep_3d_active_flag      
    ,case when t1.dt=regexp_replace(date_add(from_unixtime(to_unix_timestamp('${DT}', 'yyyyMMdd')),-4), '-', '') then 'Y'
          else 'N' end                      as keep_4d_active_flag      
    ,case when t1.dt=regexp_replace(date_add(from_unixtime(to_unix_timestamp('${DT}', 'yyyyMMdd')),-5), '-', '') then 'Y'
          else 'N' end                      as keep_5d_active_flag      
    ,case when t1.dt=regexp_replace(date_add(from_unixtime(to_unix_timestamp('${DT}', 'yyyyMMdd')),-6), '-', '') then 'Y'
          else 'N' end                      as keep_6d_active_flag      
    ,case when t1.dt=regexp_replace(date_add(from_unixtime(to_unix_timestamp('${DT}', 'yyyyMMdd')),-7), '-', '') then 'Y'
          else 'N' end                      as keep_7d_active_flag        
         
from 活跃表 t1
where t1.dt in ( 
${DT}
,regexp_replace(date_add(from_unixtime(to_unix_timestamp('${DT}', 'yyyyMMdd')),-1), '-', '')
,regexp_replace(date_add(from_unixtime(to_unix_timestamp('${DT}', 'yyyyMMdd')),-2), '-', '')
,regexp_replace(date_add(from_unixtime(to_unix_timestamp('${DT}', 'yyyyMMdd')),-3), '-', '')
,regexp_replace(date_add(from_unixtime(to_unix_timestamp('${DT}', 'yyyyMMdd')),-4), '-', '')
,regexp_replace(date_add(from_unixtime(to_unix_timestamp('${DT}', 'yyyyMMdd')),-5), '-', '')
,regexp_replace(date_add(from_unixtime(to_unix_timestamp('${DT}', 'yyyyMMdd')),-6), '-', '')    
,regexp_replace(date_add(from_unixtime(to_unix_timestamp('${DT}', 'yyyyMMdd')),-7), '-', '')
)


;

--当日活跃以及留存指标
insert overwrite table 留存表 partition(dt='${DT}')
select   
    group_id
   ,statis_day
   ,channel
   ,version
   ,sum(case when keep_0d_active_flag='Y' then 1 else 0 end)  as av
   ,sum(case when keep_1d_active_flag='Y' then 1 else 0 end)  as keep_1d_av
   ,sum(case when keep_2d_active_flag='Y' then 1 else 0 end)  as keep_2d_av
   ,sum(case when keep_3d_active_flag='Y' then 1 else 0 end)  as keep_3d_av
   ,sum(case when keep_4d_active_flag='Y' then 1 else 0 end)  as keep_4d_av
   ,sum(case when keep_5d_active_flag='Y' then 1 else 0 end)  as keep_5d_av
   ,sum(case when keep_6d_active_flag='Y' then 1 else 0 end)  as keep_6d_av
   ,sum(case when keep_7d_active_flag='Y' then 1 else 0 end)  as keep_7d_av
from(
        select 
             cast(grouping__id as bigint)& 7 & 3  as group_id
            ,channel
			,uuid
            ,statis_day
            ,max(keep_1d_active_flag)      as keep_1d_active_flag
            ,max(keep_2d_active_flag)      as keep_2d_active_flag
            ,max(keep_3d_active_flag)      as keep_3d_active_flag
            ,max(keep_4d_active_flag)      as keep_4d_active_flag
            ,max(keep_5d_active_flag)      as keep_5d_active_flag
            ,max(keep_6d_active_flag)      as keep_6d_active_flag
            ,max(keep_7d_active_flag)      as keep_7d_active_flag 
        from tmp1
        group by 
                ,channel  --1
				,version	--2		
				,uuid       -- 4
				,statis_day --8				
                             
        grouping sets(          
             (channel,uuid,statis_day)    
            ,(version,uuid,statis_day)
			,(uuid,statis_day)			
			)
) ta
group by  
    group_id
   ,statis_day
   ,channel
   ,version

二、对于留存的表的查询处理

1)非留存指标的话,直接使用where dt between '20250701' and '20250707'

2)对于留存指标要取static_day,这个static_day是代表留存日期在dt的不同留存指标。

select

dt

,sum(active_num)

,sum(keep1_num)

,sum(keep2_num)

,sum(keep3_num)

,sum(keep4_num)

from

(select

dt,

,active_num

,0 as keep1_num

,0 as keep2_num

,0 as keep3_num

,0 as keep4_num

from 留存表 where dt between '20250701' and '20250704'

union all

select

static_day dt,

,0 as active_num

,keep1_num

,keep2_num

,keep3_num

,keep4_num

from 留存表 where static_day between '20250701' and '20250704'

) t group by dt

相关推荐
K***658912 小时前
从0到1部署Tomcat和添加servlet(IDEA2024最新版详细教程)
hive·servlet·tomcat
李兆龙的博客13 小时前
从一到无穷大 #61 Hive史诗
数据仓库·hive·hadoop
q***06471 天前
Spring Boot 从 2.7.x 升级到 3.3注意事项
数据库·hive·spring boot
心止水j1 天前
hive分区
数据仓库·hive·hadoop
心止水j1 天前
Hive 桶表的创建、数据导入、查询与导出
数据仓库·hive·hadoop
8***a8152 天前
从0到1部署Tomcat和添加servlet(IDEA2024最新版详细教程)
hive·servlet·tomcat
5***79004 天前
JavaScript生成器函数
hive·tidb·consul
顧棟4 天前
HDFS元数据与auditlog结合Hive元数据统计分析
hive·hadoop·hdfs
z***89714 天前
【分布式】Hadoop完全分布式的搭建(零基础)
大数据·hadoop·分布式
Kevinyu_4 天前
责任链模式
java·hadoop·责任链模式