HiveSQL——设计一张最近180天的注册、活跃留存表

0 问题描述

现有一个用户活跃表user_active(user_id,active_date)、 用户注册表user_regist(user_id,regist_date),表中分区字段都为dt(yyyy-MM-dd),用户字段均为user_id; 设计一张 1-180天的注册活跃留存表;表结构如下:

1 数据分析

完整的代码如下:

sql 复制代码
select
    regist_date,
    diff,
    active_user_cnt,
    case
        when nvl(regis_cnt, 0) != 0 then active_user_cnt / regis_cnt
        end as retention_rate
from (
         select
             t1.regist_date,
             max(t1.regist_count)                     as regis_cnt,
             datediff(t2.active_date, t1.regist_date) as diff,
             count(t2.user_id)                        as active_user_cnt
         from (select
                   user_id,
                   to_date(regist_date)                                    as regist_date,
                   count(user_id) over (partition by to_date(regist_date)) as regist_count
               from user_regist
               where dt >= date_sub(current_date(), 180)) t1
                  left join
              (select
                   user_id,
                   to_date(active_date) as active_date
               from user_active
               where dt >= date_sub(current_date(), 180)
               group by user_id, to_date(active_date)) t2
              on t1.user_id = t2.user_id
         where datediff(active_date, regist_date) >= 1
           and datediff(active_date, regist_date) <= 180
         group by t1.regist_date, datediff(t2.active_date, t1.regist_date)
     ) t3
order by regist_date,
         diff;

上述代码解析:

步骤一:基于注册表,求出用户的注册日期regist_date、每日的用户注册数量regist_count

sql 复制代码
select
    user_id,
    to_date(regist_date)                                    as regist_date,
    count(user_id) over (partition by to_date(regist_date)) as regist_count
from user_regist
where dt >= date_sub(current_date(), 180);

步骤二:将用户注册表作为主表 ,关联活跃表,关联键为user_id,**一对多的关系,形成笛卡尔积。**需要注意:活跃用户表,每个用户每天可能会有多次活跃的情况,因此需要去重。

sql 复制代码
select
    t1.regist_date,
    t1.user_id,
    t1.regist_count,
    t2.user_id,
    t2.active_date,
    datediff(t2.active_date, t1.regist_date) as diff
from (select
          user_id,
          to_date(regist_date)                                    as regist_date,
          count(user_id) over (partition by to_date(regist_date)) as regist_count
      from user_regist
      where dt >= date_sub(current_date(), 180)) t1
  left join
     (select
          user_id,
          to_date(active_date) as active_date
      from user_active
      where dt >= date_sub(current_date(), 180)
      group by user_id, to_date(active_date)) t2
  on t1.user_id = t2.user_id;

步骤三:基于注册日期,留存周期分组(以"天"为单位),计算该留存周期下的活跃用户数

sql 复制代码
select
    t1.regist_date,
    max(t1.regist_count)                     as regis_cnt,
    datediff(t2.active_date, t1.regist_date) as diff,
    count(t2.user_id)                        as active_user_cnt

from (select
          user_id,
          to_date(regist_date)                                    as regist_date,
          count(user_id) over (partition by to_date(regist_date)) as regist_count
      from user_regist
      where dt >= date_sub(current_date(), 180)) t1
         left join
     (select
          user_id,
          to_date(active_date) as active_date
      from user_active
      where dt >= date_sub(current_date(), 180)
      group by user_id, to_date(active_date)) t2
     on t1.user_id = t2.user_id
where datediff(active_date, regist_date) >= 1
  and datediff(active_date, regist_date) <= 180
group by t1.regist_date, datediff(t2.active_date, t1.regist_date);

步骤四:计算留存率retention_rate

sql 复制代码
select
    regist_date,
    diff,
    active_user_cnt,
    case
        when nvl(regis_cnt, 0) != 0 then active_user_cnt / regis_cnt
        end as retention_rate
from (
         select
             t1.regist_date,
             max(t1.regist_count)                     as regis_cnt,
             datediff(t2.active_date, t1.regist_date) as diff,
             count(t2.user_id)                        as active_user_cnt
         from (select
                   user_id,
                   to_date(regist_date)                                    as regist_date,
                   count(user_id) over (partition by to_date(regist_date)) as regist_count
               from user_regist
               where dt >= date_sub(current_date(), 180)) t1
                  left join
              (select
                   user_id,
                   to_date(active_date) as active_date
               from user_active
               where dt >= date_sub(current_date(), 180)
               group by user_id, to_date(active_date)) t2
              on t1.user_id = t2.user_id
         where datediff(active_date, regist_date) >= 1
           and datediff(active_date, regist_date) <= 180
         group by t1.regist_date, datediff(t2.active_date, t1.regist_date)
     ) t3
order by regist_date,
         diff;

3 总结

利用left join左表关联,笛卡尔积的形式设计最近180天的注册活跃留存表。

相关推荐
武子康1 天前
大数据-240 离线数仓 - 广告业务 Hive ADS 实战:DataX 将 HDFS 分区表导出到 MySQL
大数据·后端·apache hive
字节跳动数据平台2 天前
5000 字技术向拆解 | 火山引擎多模态数据湖如何释放模思智能的算法生产力
大数据
武子康2 天前
大数据-239 离线数仓 - 广告业务实战:Flume 导入日志到 HDFS,并完成 Hive ODS/DWD 分层加载
大数据·后端·apache hive
字节跳动数据平台3 天前
代码量减少 70%、GPU 利用率达 95%:火山引擎多模态数据湖如何释放模思智能的算法生产力
大数据
得物技术3 天前
深入剖析Spark UI界面:参数与界面详解|得物技术
大数据·后端·spark
武子康3 天前
大数据-238 离线数仓 - 广告业务 Hive分析实战:ADS 点击率、购买率与 Top100 排名避坑
大数据·后端·apache hive
武子康4 天前
大数据-237 离线数仓 - Hive 广告业务实战:ODS→DWD 事件解析、广告明细与转化分析落地
大数据·后端·apache hive
大大大大晴天4 天前
Flink生产问题排障-Kryo serializer scala extensions are not available
大数据·flink
武子康6 天前
大数据-236 离线数仓 - 会员指标验证、DataX 导出与广告业务 ODS/DWD/ADS 全流程
大数据·后端·apache hive
武子康7 天前
大数据-235 离线数仓 - 实战:Flume+HDFS+Hive 搭建 ODS/DWD/DWS/ADS 会员分析链路
大数据·后端·apache hive