HiveSQL——设计一张最近180天的注册、活跃留存表

0 问题描述

现有一个用户活跃表user_active(user_id,active_date)、 用户注册表user_regist(user_id,regist_date),表中分区字段都为dt(yyyy-MM-dd),用户字段均为user_id; 设计一张 1-180天的注册活跃留存表;表结构如下:

1 数据分析

完整的代码如下:

sql 复制代码
select
    regist_date,
    diff,
    active_user_cnt,
    case
        when nvl(regis_cnt, 0) != 0 then active_user_cnt / regis_cnt
        end as retention_rate
from (
         select
             t1.regist_date,
             max(t1.regist_count)                     as regis_cnt,
             datediff(t2.active_date, t1.regist_date) as diff,
             count(t2.user_id)                        as active_user_cnt
         from (select
                   user_id,
                   to_date(regist_date)                                    as regist_date,
                   count(user_id) over (partition by to_date(regist_date)) as regist_count
               from user_regist
               where dt >= date_sub(current_date(), 180)) t1
                  left join
              (select
                   user_id,
                   to_date(active_date) as active_date
               from user_active
               where dt >= date_sub(current_date(), 180)
               group by user_id, to_date(active_date)) t2
              on t1.user_id = t2.user_id
         where datediff(active_date, regist_date) >= 1
           and datediff(active_date, regist_date) <= 180
         group by t1.regist_date, datediff(t2.active_date, t1.regist_date)
     ) t3
order by regist_date,
         diff;

上述代码解析:

步骤一:基于注册表,求出用户的注册日期regist_date、每日的用户注册数量regist_count

sql 复制代码
select
    user_id,
    to_date(regist_date)                                    as regist_date,
    count(user_id) over (partition by to_date(regist_date)) as regist_count
from user_regist
where dt >= date_sub(current_date(), 180);

步骤二:将用户注册表作为主表 ,关联活跃表,关联键为user_id,**一对多的关系,形成笛卡尔积。**需要注意:活跃用户表,每个用户每天可能会有多次活跃的情况,因此需要去重。

sql 复制代码
select
    t1.regist_date,
    t1.user_id,
    t1.regist_count,
    t2.user_id,
    t2.active_date,
    datediff(t2.active_date, t1.regist_date) as diff
from (select
          user_id,
          to_date(regist_date)                                    as regist_date,
          count(user_id) over (partition by to_date(regist_date)) as regist_count
      from user_regist
      where dt >= date_sub(current_date(), 180)) t1
  left join
     (select
          user_id,
          to_date(active_date) as active_date
      from user_active
      where dt >= date_sub(current_date(), 180)
      group by user_id, to_date(active_date)) t2
  on t1.user_id = t2.user_id;

步骤三:基于注册日期,留存周期分组(以"天"为单位),计算该留存周期下的活跃用户数

sql 复制代码
select
    t1.regist_date,
    max(t1.regist_count)                     as regis_cnt,
    datediff(t2.active_date, t1.regist_date) as diff,
    count(t2.user_id)                        as active_user_cnt

from (select
          user_id,
          to_date(regist_date)                                    as regist_date,
          count(user_id) over (partition by to_date(regist_date)) as regist_count
      from user_regist
      where dt >= date_sub(current_date(), 180)) t1
         left join
     (select
          user_id,
          to_date(active_date) as active_date
      from user_active
      where dt >= date_sub(current_date(), 180)
      group by user_id, to_date(active_date)) t2
     on t1.user_id = t2.user_id
where datediff(active_date, regist_date) >= 1
  and datediff(active_date, regist_date) <= 180
group by t1.regist_date, datediff(t2.active_date, t1.regist_date);

步骤四:计算留存率retention_rate

sql 复制代码
select
    regist_date,
    diff,
    active_user_cnt,
    case
        when nvl(regis_cnt, 0) != 0 then active_user_cnt / regis_cnt
        end as retention_rate
from (
         select
             t1.regist_date,
             max(t1.regist_count)                     as regis_cnt,
             datediff(t2.active_date, t1.regist_date) as diff,
             count(t2.user_id)                        as active_user_cnt
         from (select
                   user_id,
                   to_date(regist_date)                                    as regist_date,
                   count(user_id) over (partition by to_date(regist_date)) as regist_count
               from user_regist
               where dt >= date_sub(current_date(), 180)) t1
                  left join
              (select
                   user_id,
                   to_date(active_date) as active_date
               from user_active
               where dt >= date_sub(current_date(), 180)
               group by user_id, to_date(active_date)) t2
              on t1.user_id = t2.user_id
         where datediff(active_date, regist_date) >= 1
           and datediff(active_date, regist_date) <= 180
         group by t1.regist_date, datediff(t2.active_date, t1.regist_date)
     ) t3
order by regist_date,
         diff;

3 总结

利用left join左表关联,笛卡尔积的形式设计最近180天的注册活跃留存表。

相关推荐
可乐ea6 分钟前
【知识获取与分享社区项目 | 项目日记第 21 天】索引构建与联想建议:Outbox 增量更新 + Completion Suggester
java·大数据·mysql·elasticsearch·搜索引擎
CoCo的编程之路16 分钟前
2026全栈演进:使用前端开发助手进行项目重构的最佳工具
大数据·前端·人工智能·ai编程·comate
BlockWay1 小时前
WEEX Labs 周度观察:微软-OpenAI 合作调整与AI 多云趋势
大数据·人工智能·算法·安全·microsoft
andafaAPS1 小时前
安达发|工艺品aps自动排产排程排单软件:告别生产“一团乱麻“
大数据·数据库·人工智能·安达发aps·计划排产软件·自动排单软件
jkyy20141 小时前
数智赋能健康零售!智能穿戴+慢病数据追踪,解锁长效盈利新路径
大数据·人工智能·零售
狒狒热知识2 小时前
精细化营销时代来临,178软文网标准化服务体系,帮助企业科学管控营销成本
大数据·人工智能
一只专注api接口开发的技术猿2 小时前
OpenClaw 对接淘宝商品 API,低成本实现全天候选品监控|附可运行 Python 实操代码
大数据·开发语言·数据库·python
醉颜凉2 小时前
深度解析 Elasticsearch 搜索过程:Query Then Fetch 两阶段详解
大数据·elasticsearch·搜索引擎
zandy10112 小时前
体系化AI创新赋能产业升级 联想集团树立智能时代企业创新标杆
大数据·人工智能
春日见3 小时前
五分钟入门强化学习DDPG
大数据·人工智能·算法·机器学习·计算机视觉