概述
最近刷题时遇到一些比较有意思的题目,之前多次遇到一些求解连续数的问题,这次遇到了他们的变种,连续数可以间隔指定的数也视为是一个完整的"连续"。针对连续数的这类问题我们之前讲的可以利用等差数列 的思想来解决,然而现在发生了变形一定还能继续使用等差数列来解决吗?答案是否定的,针对带有指定间隔的连续数问题应当使用分组问题的思想来解决。
题目:间隔连续表
某游戏公司记录的用户每日登录数据
表logs
字段名 | 数据类型 |
---|---|
id | bigint |
dt | date |
输入数据如下:
id | dt |
---|---|
1001 | 2021/12/12 |
1002 | 2021/12/12 |
1001 | 2021/12/13 |
1001 | 2021/12/14 |
1001 | 2021/12/16 |
1002 | 2021/12/16 |
1001 | 2021/12/19 |
1002 | 2021/12/17 |
1001 | 2021/12/20 |
计算每个用户最大的连续登录天数,可以间隔一天。解释:如果一个用户在 1,3,5,6 登
录游戏,则视为连续 6 天登录。
题解
第一步:下移数据
SQL
WITH temp_001 AS (
SELECT id
,dt
,LAG(dt,1,'1970-01-01') OVER (PARTITION BY id ORDER BY dt ASC) AS prev_dt
FROM logs
)
输出如下:
id | dt | prev_dt |
---|---|---|
1001 | 2021/12/12 | 1970/01/01 |
1001 | 2021/12/13 | 2021/12/12 |
1001 | 2021/12/14 | 2021/12/13 |
1001 | 2021/12/16 | 2021/12/14 |
1001 | 2021/12/19 | 2021/12/19 |
1001 | 2021/12/20 | 2021/12/20 |
1002 | 2021/12/12 | 1970/01/01 |
1002 | 2021/12/16 | 2021/12/12 |
1002 | 2021/12/17 | 2021/12/16 |
步骤二:窗口分组累加
计算当前日期和上一日期的差值,如果差值小于等于2,则认为是同一组组号不变,如果差值大于2则任务是新的组,组号+1.
SQL实现:
sql
temp_002 AS (
SELECT id
,dt
,SUM(IF(DATEDIFF(dt,prev_dt)>2,1,0)) OVER(PARTITION BY id ORDER BY dt ASC) AS group_id
FROM temp_001
)
输出结果:
id | dt | group_id |
---|---|---|
1001 | 2021/12/12 | 1 |
1001 | 2021/12/13 | 1 |
1001 | 2021/12/14 | 1 |
1001 | 2021/12/16 | 1 |
1001 | 2021/12/19 | 2 |
1001 | 2021/12/20 | 2 |
1002 | 2021/12/12 | 1 |
1002 | 2021/12/16 | 2 |
1002 | 2021/12/17 | 2 |
步骤三:根据id和group分组统计出最大、最小日期
sql
temp_003 AS (
SELECT id
,group_id
,min(dt) as min_dt
,max(dt) as max_dt
FROM temp_002
GROUP BY id,group_id
)
输出结果
id | group_id | min_dt | max_dt |
---|---|---|---|
1001 | 1 | 2021/12/12 | 2021/12/16 |
1001 | 2 | 2021/12/19 | 2021/12/20 |
1002 | 1 | 2021/12/12 | 2021/12/12 |
1002 | 2 | 2021/12/16 | 2021/12/17 |
步骤四:使用最大日期减掉最小日期再加1,就可以得到每一组的最大连续天数
sql
temp_004 AS (
SELECT id
,group_id
,min_dt
,max_dt
,datediff(max_dt,min_dt) + 1 AS days
FROM temp_003
)
输出结果
id | group_id | min_dt | max_dt | days |
---|---|---|---|---|
1001 | 1 | 2021/12/12 | 2021/12/16 | 5 |
1001 | 2 | 2021/12/19 | 2021/12/20 | 2 |
1002 | 1 | 2021/12/12 | 2021/12/12 | 1 |
1002 | 2 | 2021/12/16 | 2021/12/17 | 2 |
步骤五: 在取每个id的最大连续天数得到最终结果
sql
SELECT id,max(days) as days from temp_004 group by id;
完整SQL
sql
WITH temp_001 AS (
SELECT id
,dt
,LAG(dt,1,'1970-01-01') OVER (PARTITION BY id ORDER BY dt ASC) AS prev_dt
FROM logs
)
,temp_002 AS (
SELECT id
,dt
,SUM(IF(DATEDIFF(dt,prev_dt)>2,1,0)) OVER(PARTITION BY id ORDER BY dt ASC) AS group_id
FROM temp_001
)
,temp_003 AS (
SELECT id
,group_id
,min(dt) as min_dt
,max(dt) as max_dt
FROM temp_002
GROUP BY id,group_id
)
,temp_004 AS (
SELECT id
,group_id
,min_dt
,max_dt
,datediff(max_dt,min_dt) + 1 AS days
FROM temp_003
)
SELECT id,max(days) as days from temp_004 group by id;