
问题分析
- 找出2021年至少有两天作答的用户
- 计算每个用户连续两次作答的最大时间窗
- 基于历史数据预测在这个时间窗内平均会做多少套试卷
版本1
sql
with
-- 功能:筛选2021年至少有两天作答的用户及其作答记录
-- 子查询找出2021年至少有两天作答的用户ID(count(distinct date(start_time)) >= 2)
-- 主查询获取这些用户在2021年的所有作答记录
-- 结果包含三列:uid(用户ID)、exam_id(试卷ID)、start_time(作答日期)
user_with_multiple_days as (
select
uid,
exam_id,
date(start_time) as start_time
from
exam_record
where
uid in (
select
uid
from
exam_record
where
year(start_time) = 2021
group by
uid
having
count(distinct date(start_time)) >= 2
)
and year(start_time) = 2021
),
-- 功能:计算每个用户的日均作答量
-- 通过自连接找出每个用户的所有作答日期组合(u1.start_time < u2.start_time)
-- 计算每个用户的总作答次数(子查询中的count(start_time))
-- 计算最大日期跨度(max(datediff(u2.start_time, u1.start_time) + 1))
-- 用总作答次数除以最大日期跨度得到日均作答量(max_avg)
max_avg_temp as (
select
u1.uid,
(
select
count(start_time)
from
user_with_multiple_days
where
uid = u1.uid
) / max(datediff(u2.start_time, u1.start_time) + 1) as max_avg
from
user_with_multiple_days u1
join user_with_multiple_days u2 on u1.uid = u2.uid
and u1.start_time < u2.start_time
group by
u1.uid
),
-- 功能:计算每个用户连续两次作答的最大时间窗
-- 使用窗口函数lag()获取每个用户的上一次作答日期
-- datediff(start_time, lag(start_time)) + 1计算相邻两次作答的时间窗
-- +1是因为间隔2天实际上是3天的窗口(如1号和3号是3-1=2天,但实际窗口是3天)
-- max(days_window)找出每个用户的最大时间窗
max_windows_temp as (
select
uid,
max(days_window) as days_window
from
(
select
uid,
start_time,
datediff(
start_time,
lag(start_time) over (
partition by
uid
order by
start_time asc
)
) + 1 as days_window
from
user_with_multiple_days
) as t1
group by
uid
)
select
uid,
days_window,
round(days_window * max_avg, 2) as avg_exam_cnt
from
max_avg_temp
join max_windows_temp using (uid)
order by
days_window desc,
avg_exam_cnt desc
1. 第一个CTE:user_with_multiple_days
sql
user_with_multiple_days as (
select
uid,
exam_id,
date(start_time) as start_time
from
exam_record
where
uid in (
select
uid
from
exam_record
where
year(start_time) = 2021
group by
uid
having
count(distinct date(start_time)) >= 2
)
and year(start_time) = 2021
)
功能:筛选2021年至少有两天作答的用户及其作答记录
详细说明:
- 子查询找出2021年至少有两天作答的用户ID(
count(distinct date(start_time)) >= 2
) - 主查询获取这些用户在2021年的所有作答记录
- 结果包含三列:
uid
(用户ID)、exam_id
(试卷ID)、start_time
(作答日期)
2. 第二个CTE:max_avg_temp
sql
max_avg_temp as (
select
u1.uid,
(
select
count(start_time)
from
user_with_multiple_days
where
uid = u1.uid
) / max(datediff(u2.start_time, u1.start_time) + 1) as max_avg
from
user_with_multiple_days u1
join user_with_multiple_days u2 on u1.uid = u2.uid
and u1.start_time < u2.start_time
group by
u1.uid
)
功能:计算每个用户的日均作答量
详细说明:
- 通过自连接找出每个用户的所有作答日期组合(
u1.start_time < u2.start_time
) - 计算每个用户的总作答次数(子查询中的
count(start_time)
) - 计算最大日期跨度(
max(datediff(u2.start_time, u1.start_time) + 1)
) - 用总作答次数除以最大日期跨度得到日均作答量(
max_avg
)
3. 第三个CTE:max_windows_temp
sql
max_windows_temp as (
select
uid,
max(days_window) as days_window
from
(
select
uid,
start_time,
datediff(
start_time,
lag(start_time) over (
partition by
uid
order by
start_time asc
)
) + 1 as days_window
from
user_with_multiple_days
) as t1
group by
uid
)
功能:计算每个用户连续两次作答的最大时间窗
详细说明:
- 使用窗口函数
lag()
获取每个用户的上一次作答日期 datediff(start_time, lag(start_time)) + 1
计算相邻两次作答的时间窗+1
是因为间隔2天实际上是3天的窗口(如1号和3号是3-1=2天,但实际窗口是3天)
max(days_window)
找出每个用户的最大时间窗
最终查询
sql
select
uid,
days_window,
round(days_window * max_avg, 2) as avg_exam_cnt
from
max_avg_temp
join max_windows_temp using (uid)
order by
days_window desc,
avg_exam_cnt desc
逻辑:
- 将最大时间窗(
days_window
)乘以日均作答量(max_avg
) - 预测在该时间窗内平均会做多少套试卷(
avg_exam_cnt
) - 按最大时间窗和预测作答量降序排序
简化1
待补充。。。