HiveSQL——设计一张最近180天的注册、活跃留存表

0 问题描述

现有一个用户活跃表user_active(user_id,active_date)、 用户注册表user_regist(user_id,regist_date),表中分区字段都为dt(yyyy-MM-dd),用户字段均为user_id; 设计一张 1-180天的注册活跃留存表;表结构如下:

1 数据分析

完整的代码如下:

sql 复制代码
select
    regist_date,
    diff,
    active_user_cnt,
    case
        when nvl(regis_cnt, 0) != 0 then active_user_cnt / regis_cnt
        end as retention_rate
from (
         select
             t1.regist_date,
             max(t1.regist_count)                     as regis_cnt,
             datediff(t2.active_date, t1.regist_date) as diff,
             count(t2.user_id)                        as active_user_cnt
         from (select
                   user_id,
                   to_date(regist_date)                                    as regist_date,
                   count(user_id) over (partition by to_date(regist_date)) as regist_count
               from user_regist
               where dt >= date_sub(current_date(), 180)) t1
                  left join
              (select
                   user_id,
                   to_date(active_date) as active_date
               from user_active
               where dt >= date_sub(current_date(), 180)
               group by user_id, to_date(active_date)) t2
              on t1.user_id = t2.user_id
         where datediff(active_date, regist_date) >= 1
           and datediff(active_date, regist_date) <= 180
         group by t1.regist_date, datediff(t2.active_date, t1.regist_date)
     ) t3
order by regist_date,
         diff;

上述代码解析:

步骤一:基于注册表,求出用户的注册日期regist_date、每日的用户注册数量regist_count

sql 复制代码
select
    user_id,
    to_date(regist_date)                                    as regist_date,
    count(user_id) over (partition by to_date(regist_date)) as regist_count
from user_regist
where dt >= date_sub(current_date(), 180);

步骤二:将用户注册表作为主表 ,关联活跃表,关联键为user_id,**一对多的关系,形成笛卡尔积。**需要注意:活跃用户表,每个用户每天可能会有多次活跃的情况,因此需要去重。

sql 复制代码
select
    t1.regist_date,
    t1.user_id,
    t1.regist_count,
    t2.user_id,
    t2.active_date,
    datediff(t2.active_date, t1.regist_date) as diff
from (select
          user_id,
          to_date(regist_date)                                    as regist_date,
          count(user_id) over (partition by to_date(regist_date)) as regist_count
      from user_regist
      where dt >= date_sub(current_date(), 180)) t1
  left join
     (select
          user_id,
          to_date(active_date) as active_date
      from user_active
      where dt >= date_sub(current_date(), 180)
      group by user_id, to_date(active_date)) t2
  on t1.user_id = t2.user_id;

步骤三:基于注册日期,留存周期分组(以"天"为单位),计算该留存周期下的活跃用户数

sql 复制代码
select
    t1.regist_date,
    max(t1.regist_count)                     as regis_cnt,
    datediff(t2.active_date, t1.regist_date) as diff,
    count(t2.user_id)                        as active_user_cnt

from (select
          user_id,
          to_date(regist_date)                                    as regist_date,
          count(user_id) over (partition by to_date(regist_date)) as regist_count
      from user_regist
      where dt >= date_sub(current_date(), 180)) t1
         left join
     (select
          user_id,
          to_date(active_date) as active_date
      from user_active
      where dt >= date_sub(current_date(), 180)
      group by user_id, to_date(active_date)) t2
     on t1.user_id = t2.user_id
where datediff(active_date, regist_date) >= 1
  and datediff(active_date, regist_date) <= 180
group by t1.regist_date, datediff(t2.active_date, t1.regist_date);

步骤四:计算留存率retention_rate

sql 复制代码
select
    regist_date,
    diff,
    active_user_cnt,
    case
        when nvl(regis_cnt, 0) != 0 then active_user_cnt / regis_cnt
        end as retention_rate
from (
         select
             t1.regist_date,
             max(t1.regist_count)                     as regis_cnt,
             datediff(t2.active_date, t1.regist_date) as diff,
             count(t2.user_id)                        as active_user_cnt
         from (select
                   user_id,
                   to_date(regist_date)                                    as regist_date,
                   count(user_id) over (partition by to_date(regist_date)) as regist_count
               from user_regist
               where dt >= date_sub(current_date(), 180)) t1
                  left join
              (select
                   user_id,
                   to_date(active_date) as active_date
               from user_active
               where dt >= date_sub(current_date(), 180)
               group by user_id, to_date(active_date)) t2
              on t1.user_id = t2.user_id
         where datediff(active_date, regist_date) >= 1
           and datediff(active_date, regist_date) <= 180
         group by t1.regist_date, datediff(t2.active_date, t1.regist_date)
     ) t3
order by regist_date,
         diff;

3 总结

利用left join左表关联,笛卡尔积的形式设计最近180天的注册活跃留存表。

相关推荐
Elastic 中国社区官方博客1 小时前
如何将数据从 AWS S3 导入到 Elastic Cloud - 第 3 部分:Elastic S3 连接器
大数据·elasticsearch·搜索引擎·云计算·全文检索·可用性测试·aws
Aloudata2 小时前
从Apache Atlas到Aloudata BIG,数据血缘解析有何改变?
大数据·apache·数据血缘·主动元数据·数据链路
水豚AI课代表2 小时前
分析报告、调研报告、工作方案等的提示词
大数据·人工智能·学习·chatgpt·aigc
拓端研究室TRL5 小时前
【梯度提升专题】XGBoost、Adaboost、CatBoost预测合集:抗乳腺癌药物优化、信贷风控、比特币应用|附数据代码...
大数据
黄焖鸡能干四碗5 小时前
信息化运维方案,实施方案,开发方案,信息中心安全运维资料(软件资料word)
大数据·人工智能·软件需求·设计规范·规格说明书
编码小袁5 小时前
探索数据科学与大数据技术专业本科生的广阔就业前景
大数据
WeeJot嵌入式6 小时前
大数据治理:确保数据的可持续性和价值
大数据
zmd-zk7 小时前
kafka+zookeeper的搭建
大数据·分布式·zookeeper·中间件·kafka
激流丶7 小时前
【Kafka 实战】如何解决Kafka Topic数量过多带来的性能问题?
java·大数据·kafka·topic
测试界的酸菜鱼7 小时前
Python 大数据展示屏实例
大数据·开发语言·python