HiveSQL——统计当前时间段的有客人在住的房间数量

注:参考文章:

HiveSQL一天一个小技巧:如何统计当前时间点状态情况【辅助变量+累计变换思路】_sql查询统计某状态出现的次数及累计时间-CSDN博客文章浏览阅读2k次,点赞6次,收藏8次。本文总结了一种当前时间点状态统计的思路和方法,对于此类问题主要采用构造辅助计数变量及累加变换思路进行求解。常见的场景有:直播同时在线人数、服务器实时并发数、公家车当前时间段人数、某个仓库的货物积压数量,某一段时间内的同时处于服务过程中的最大订单量等_sql查询统计某状态出现的次数及累计时间https://blog.csdn.net/godlovedaniel/article/details/129881211

0 需求描述

1 数据准备

sql 复制代码
create table if not exists table23
(
    user_id     int comment '用户id',
    room_num    string comment '房间号',
    in_time     string comment '入住时间',
    out_time    string comment '离店时间'
)
    comment '旅客入住离店表';

insert overwrite table table23
values (7, '2004', '2021-03-05','2021-03-07'),
       (23,'2010', '2021-03-05','2021-03-06'),
       (7, '1003', '2021-03-07','2021-03-08'),
       (8, '2014', '2021-03-07','2021-03-08'),
       (14, '3001','2021-03-07','2021-03-10'),
       (18, '3002','2021-03-08','2021-03-10'),
       (23, '3020','2021-03-08','2021-03-09'),
       (25, '2006','2021-03-09','2021-03-12');

2 数据分析

需求:求出每个时间段,有客人在住的房间数量。

如果只考虑一人一房,可以借助于【直播间同时在线人数】统计的思路,相关sql逻辑指路:

HiveSQL题------聚合函数(sum/count/max/min/avg)-CSDN博客文章浏览阅读1.1k次,点赞19次,收藏19次。HiveSQL题------聚合函数(sum/count/max/min/avg)https://blog.csdn.net/SHWAITME/article/details/135918264?ops_request_misc=%257B%2522request%255Fid%2522%253A%2522170804307516800211583058%2522%252C%2522scm%2522%253A%252220140713.130102334.pc%255Fblog.%2522%257D&request_id=170804307516800211583058&biz_id=0&utm_medium=distribute.pc_search_result.none-task-blog-2~blog~first_rank_ecpm_v1~rank_v31_ecpm-1-135918264-null-null.nonecase&utm_term=%E7%9B%B4%E6%92%AD%E9%97%B4&spm=1018.2226.3001.4450入住时间加辅助标记记为1,离店时间加辅助标记记为-1,并按照时间进行顺序排序,求当前累计值 ,具体SQL如下**:**

sql 复制代码
select
    start_time,
    end_time,
    acc_cnt
from (select
          `time`                               as start_time,
          lead(`time`) over ( order by `time`) as end_time,
          acc_cnt
      from (select
                `time`,
                sum(flag) over (order by `time`) as acc_cnt
            from (
                     select
                         in_time as `time`,
                         1   as flag
                     from table23
                     union all
                     select
                         out_time as `time`,
                         -1   as flag
                     from table23
                 ) t1
           ) t2
      group by `time`, acc_cnt
     ) t
where end_time is not null;

上述代码需要考虑一个问题:如果有多个人共住一房间 ,今天退了一个人,明天又退了一个人,后天的时候才退完,虽然这期间一直有人在退,但房间还是有人住的,这种情况是不是也算【有客人在住的房间】? 如果考虑上述情况,需要对累加的状态进行调整,此时需要考虑每个房间中截止当前时间的人数情况

第一步 :先求出每个房间截至当前时间人数累计值 ,作为状态判断辅助条件

sql 复制代码
select
    `time`,
    room_num,
    sum( user_cnt) over (partition by room_num order by `time`) user_cnt
from (
         select
             in_time as  `time`,
             room_num,
             count(user_id) user_cnt
         from table23
         group by in_time, room_num
         union all
         select
             out_time as   `time`,
             room_num,
             -1 * count(user_id) user_cnt
         from table23
         group by out_time, room_num
     ) t1

第二步 :基于累计的每个房间人数进行判断:如果房间有人就标记1,没有人时候就标记为-1。代码为:case when user_cnt > 0 时标记1,否则标记-1

sql 复制代码
select
    `time`,
     room_num,
     user_cnt,
    case when user_cnt > 0 then 1 else -1 end flag
from (select
         `time`,
          room_num,
          sum(user_cnt) over (partition by room_num order by `time`) user_cnt
      from (
               select
                   in_time as `time`,
                   room_num,
                   count(user_id) user_cnt
               from table23
               group by in_time, room_num
               union all
               select
                   out_time as `time`,
                   room_num,
                   -1 * count(user_id) user_cnt
               from table23
               group by out_time, room_num
           ) t1
     ) t2;

第三步: 基于第二步的结果,计算截止当前时间点的有人入住的房间数量 acc_cnt,SQL如下:

sql 复制代码
select
    `time`,
    room_num,
    user_cnt,
    case when user_cnt > 0 then 1 else -1 end flag,
    sum(case when user_cnt > 0 then 1 else -1 end) over (order by `time`) acc_cnt
from (select
          `time`,
          room_num,
          sum(user_cnt) over (partition by room_num order by `time`) user_cnt
      from (
               select
                   in_time as `time`,
                   room_num,
                   count(user_id) user_cnt
               from table23
               group by in_time, room_num
               union all
               select
                   out_time as `time`,
                   room_num,
                   -1 * count(user_id) user_cnt
               from table23
               group by out_time, room_num
           ) t1
     ) t2;

第四步: 基于第三步的结果,对时间time 和截止当前时间点的有人入住的房间数量acc_cnt 这两个字段进行去重,SQL如下:

sql 复制代码
select
    `time`,
    acc_cnt
from (
         select
             `time`,
             room_num,
             user_cnt,
             case when user_cnt > 0 then 1 else -1 end                             flag,
             sum(case when user_cnt > 0 then 1 else -1 end) over (order by `time`) acc_cnt
         from (select
                   `time`,
                   room_num,
                   sum(user_cnt) over (partition by room_num order by `time`) user_cnt
               from (
                        select
                            in_time as     `time`,
                            room_num,
                            count(user_id) user_cnt
                        from table23
                        group by in_time, room_num
                        union all
                        select
                            out_time as         `time`,
                            room_num,
                            -1 * count(user_id) user_cnt
                        from table23
                        group by out_time, room_num
                    ) t1
              ) t2
     ) t3
group by `time`, acc_cnt

第五步: 基于第四步的结果,通过lead函数 (对time字段往后偏移一行)求出当前数据的结束时间end_time,SQL如下:

sql 复制代码
select
   `time` as start_time,
    lead(`time`, 1) over (order by `time`) as end_time,
    acc_cnt
from (
         select
             `time`,
             acc_cnt
         from (
                  select
                      `time`,
                      room_num,
                      user_cnt,
                      case when user_cnt > 0 then 1 else -1 end                             flag,
                      sum(case when user_cnt > 0 then 1 else -1 end) over (order by `time`) acc_cnt
                  from (select
                            `time`,
                            room_num,
                            sum(user_cnt) over (partition by room_num order by `time`) user_cnt
                        from (
                                 select
                                     in_time as     `time`,
                                     room_num,
                                     count(user_id) user_cnt
                                 from table23
                                 group by in_time, room_num
                                 union all
                                 select
                                     out_time as         `time`,
                                     room_num,
                                     -1 * count(user_id) user_cnt
                                 from table23
                                 group by out_time, room_num
                             ) t1
                       ) t2
              ) t3
         group by `time`, acc_cnt
     ) t4

:基于第五步的结果,过滤掉end_time 是null的数据,SQL如下:

sql 复制代码
select
    start_time,
    end_time,
    acc_cnt
from (
         select
             `time`  as start_time,
             lead(`time`, 1) over (order by `time`) as end_time,
             acc_cnt
         from (
                  select
                      `time`,
                      acc_cnt
                  from (
                           select
                               `time`,
                               room_num,
                               user_cnt,
                               case when user_cnt > 0 then 1 else -1 end as  flag,
                               sum(case when user_cnt > 0 then 1 else -1 end) over (order by `time`) acc_cnt
                           from (select
                                     `time`,
                                     room_num,
                                     sum(user_cnt) over (partition by room_num order by `time`) user_cnt
                                 from (
                                          select
                                               in_time as `time`,
                                               room_num,
                                               count(user_id) user_cnt
                                          from table23
                                          group by in_time, room_num
                                          union all
                                          select
                                               out_time as `time`,
                                               room_num,
                                               -1 * count(user_id) user_cnt
                                          from table23
                                          group by out_time, room_num
                                      ) t1
                                ) t2
                       ) t3
                  group by `time`, acc_cnt
              ) t4
     ) t5
where end_time is not null;

3 小结

针对【每个时间段的直播同时在线人数】 【每个时间段有客人在住的房间数量】这种类型的题目,本质是对(截至)当前时间点的状态统计。 这种问题常见的解决思路是:对当前时间点的状态打标记flag ,之后基于标记flag 做开窗计算(结合窗口函数)或聚合计算**。**

相关推荐
core51218 小时前
Hive实战(三)
数据仓库·hive·hadoop
程序员小羊!1 天前
大数据电商流量分析项目实战:Hive 数据仓库(三)
大数据·数据仓库·hive
core5121 天前
Hive实战(一)
数据仓库·hive·hadoop·架构·实战·配置·场景
智海观潮1 天前
Spark SQL解析查询parquet格式Hive表获取分区字段和查询条件
hive·sql·spark
cxr8282 天前
基于Claude Code的 规范驱动开发(SDD)指南
人工智能·hive·驱动开发·敏捷流程·智能体
core5123 天前
Hive实战(二)
数据仓库·hive·hadoop
Agatha方艺璇4 天前
Hive基础简介
数据仓库·hive·hadoop
Leo.yuan4 天前
不同数据仓库模型有什么不同?企业如何选择适合的数据仓库模型?
大数据·数据库·数据仓库·信息可视化·spark
chat2tomorrow4 天前
数据采集平台的起源与演进:从ETL到数据复制
大数据·数据库·数据仓库·mysql·低代码·postgresql·etl