大数据学习(125)-hive数据分析

🍋🍋大数据学习🍋🍋

🔥系列专栏: 👑哲学语录: 用力所能及,改变世界。
💖如果觉得博主的文章还不错的话,请点赞👍+收藏⭐️+留言📝支持一下博主哦🤞


1. 连续登录问题变种
  • 题目

    找出恰好连续登录 3 天 的用户(不允许更长的连续区间)。
    表结构user_logs(user_id, login_date)

  • 参考答案

    复制代码
    WITH ranked_logs AS (
        SELECT 
            user_id,
            login_date,
            ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY login_date) AS rn
        FROM user_logs
    ),
    consecutive_groups AS (
        SELECT 
            user_id,
            DATE_SUB(login_date, INTERVAL rn DAY) AS grp,
            MIN(login_date) AS start_date,
            MAX(login_date) AS end_date,
            COUNT(*) AS days
        FROM ranked_logs
        GROUP BY user_id, grp
    )
    SELECT user_id, start_date, end_date
    FROM consecutive_groups
    WHERE days = 3;
2. 连续未登录问题
  • 题目

    找出用户最长连续未登录天数 (假设表中仅记录登录日期)。
    表结构user_logs(user_id, login_date)

  • 参考答案

    复制代码
    WITH next_logs AS (
        SELECT 
            user_id,
            login_date,
            LEAD(login_date) OVER (PARTITION BY user_id ORDER BY login_date) AS next_login
        FROM user_logs
    )
    SELECT 
        user_id,
        MAX(DATEDIFF(next_login, login_date) - 1) AS max_consecutive_missing
    FROM next_logs
    WHERE next_login IS NOT NULL
    GROUP BY user_id;

二、窗口函数高级应用

3. 移动平均值计算
  • 题目

    计算用户最近 7 天的平均消费金额 (滑动窗口)。
    表结构orders(user_id, order_date, amount)

  • 参考答案

    复制代码
    SELECT 
        user_id,
        order_date,
        AVG(amount) OVER (
            PARTITION BY user_id 
            ORDER BY order_date 
            RANGE BETWEEN INTERVAL '6 DAY' PRECEDING AND CURRENT ROW
        ) AS rolling_7day_avg
    FROM orders;
4. 增长率计算
  • 题目

    计算每个用户月消费金额的环比增长率
    表结构orders(user_id, order_date, amount)

  • 参考答案

    复制代码
    WITH monthly_sales AS (
        SELECT 
            user_id,
            DATE_FORMAT(order_date, '%Y-%m') AS month,
            SUM(amount) AS total_amount
        FROM orders
        GROUP BY user_id, month
    )
    SELECT 
        user_id,
        month,
        total_amount,
        (total_amount / LAG(total_amount) OVER (PARTITION BY user_id ORDER BY month) - 1) * 100 AS growth_rate
    FROM monthly_sales;

三、时间序列分析

5. 缺失日期填充
  • 题目

    生成用户每日登录状态 (0 = 未登录,1 = 登录),包括缺失的日期。
    表结构user_logs(user_id, login_date)

  • 参考答案

    复制代码
    WITH date_range AS (
        SELECT 
            user_id,
            MIN(login_date) AS start_date,
            MAX(login_date) AS end_date
        FROM user_logs
        GROUP BY user_id
    ),
    all_dates AS (
        SELECT 
            dr.user_id,
            d.calendar_date
        FROM date_range dr
        CROSS JOIN (
            SELECT CURDATE() - INTERVAL n DAY AS calendar_date
            FROM (SELECT @row := @row + 1 AS n FROM (SELECT 0 UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3) t1,
                        (SELECT 0 UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3) t2,
                        (SELECT @row := -1) t3) t
        ) d
        WHERE d.calendar_date BETWEEN dr.start_date AND dr.end_date
    )
    SELECT 
        ad.user_id,
        ad.calendar_date,
        IF(ul.login_date IS NULL, 0, 1) AS is_logged_in
    FROM all_dates ad
    LEFT JOIN user_logs ul 
    ON ad.user_id = ul.user_id AND ad.calendar_date = ul.login_date;
6. 周期性检测
  • 题目

    找出用户每周固定某天登录的行为模式 (如每周一登录)。
    表结构user_logs(user_id, login_date)

  • 参考答案

    复制代码
    WITH day_of_week AS (
        SELECT 
            user_id,
            login_date,
            DAYOFWEEK(login_date) AS dow
        FROM user_logs
    )
    SELECT 
        user_id,
        dow,
        COUNT(DISTINCT WEEK(login_date)) AS weeks_count,
        COUNT(*) AS login_count
    FROM day_of_week
    GROUP BY user_id, dow
    HAVING login_count = weeks_count; -- 每周该天均登录

四、复杂业务场景

7. 购买间隔分析
  • 题目

    计算用户平均购买间隔 ,并找出间隔超过 30 天的用户。
    表结构orders(user_id, order_date)

  • 参考答案

    复制代码
    WITH order_intervals AS (
        SELECT 
            user_id,
            order_date,
            DATEDIFF(order_date, LAG(order_date) OVER (PARTITION BY user_id ORDER BY order_date)) AS days_since_last
        FROM orders
    )
    SELECT 
        user_id,
        AVG(days_since_last) AS avg_interval
    FROM order_intervals
    WHERE days_since_last IS NOT NULL
    GROUP BY user_id
    HAVING avg_interval > 30;
8. 活跃 / 流失用户分析
  • 题目

    标记用户每月状态 (活跃 = 当月有登录,流失 = 连续 3 个月未登录)。
    表结构user_logs(user_id, login_date)

  • 参考答案

    复制代码
    WITH months AS (
        SELECT 
            user_id,
            DATE_FORMAT(login_date, '%Y-%m') AS month,
            MAX(login_date) AS last_login
        FROM user_logs
        GROUP BY user_id, month
    ),
    status AS (
        SELECT 
            m.user_id,
            m.month,
            m.last_login,
            LEAD(m.last_login, 3) OVER (PARTITION BY m.user_id ORDER BY m.month) AS next_3rd_month_login
        FROM months m
    )
    SELECT 
        user_id,
        month,
        CASE 
            WHEN next_3rd_month_login IS NULL THEN '流失'
            ELSE '活跃'
        END AS status
    FROM status;

五、进阶挑战

9. 最长连续事件链
  • 题目

    找出用户最长的连续事件链 (如连续点赞、评论等,事件类型相同)。
    表结构events(user_id, event_time, event_type)

  • 参考答案

    复制代码
    WITH ranked_events AS (
        SELECT 
            user_id,
            event_time,
            event_type,
            ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_time) AS rn
        FROM events
    ),
    event_groups AS (
        SELECT 
            user_id,
            event_type,
            DATE_SUB(event_time, INTERVAL rn SECOND) AS grp,
            COUNT(*) AS chain_length
        FROM ranked_events
        GROUP BY user_id, event_type, grp
    )
    SELECT 
        user_id,
        event_type,
        MAX(chain_length) AS max_chain
    FROM event_groups
    GROUP BY user_id, event_type;
10. 会话识别
  • 题目

    将用户行为按会话分组 (假设会话间隔为 30 分钟)。
    表结构actions(user_id, action_time, action_type)

  • 参考答案

    复制代码
    WITH time_diff AS (
        SELECT 
            user_id,
            action_time,
            action_type,
            TIMESTAMPDIFF(MINUTE, 
                          LAG(action_time) OVER (PARTITION BY user_id ORDER BY action_time), 
                          action_time) AS minutes_since_last
        FROM actions
    ),
    session_markers AS (
        SELECT 
            user_id,
            action_time,
            action_type,
            IF(minutes_since_last > 30 OR minutes_since_last IS NULL, 1, 0) AS new_session
        FROM time_diff
    ),
    sessions AS (
        SELECT 
            user_id,
            action_time,
            action_type,
            SUM(new_session) OVER (PARTITION BY user_id ORDER BY action_time) AS session_id
        FROM session_markers
    )
    SELECT * FROM sessions;
  1. 先手动模拟数据:创建测试表并插入少量数据,验证逻辑正确性。
  2. 对比不同方法 :例如连续值问题,尝试用 LEAD()DATE_SUB + ROW_NUMBER 等多种方法实现。
  3. 注意边界条件:处理空值、同一天多次记录、跨年 / 跨月等场景。
相关推荐
vijaycc4 分钟前
python学习打卡day43
学习
会飞的架狗师19 分钟前
为何选择Spring框架学习设计模式与编码技巧?
学习·spring·mybatis
@蓝莓果粒茶29 分钟前
LeetCode第245题_最短单词距离III
c语言·c++·笔记·学习·算法·leetcode·c#
小小星球之旅1 小时前
redis缓存常见问题
数据库·redis·学习·缓存
Haoea!1 小时前
Flink03-学习-套接字分词流自动写入工具
开发语言·学习
TDengine (老段)1 小时前
TDengine 高级功能——流计算
大数据·物联网·flink·linq·时序数据库·tdengine·涛思数据
哆啦A梦的口袋呀1 小时前
基于Python学习《Head First设计模式》第三章 装饰者模式
python·学习·设计模式
哆啦A梦的口袋呀1 小时前
基于Python学习《Head First设计模式》第五章 单件模式
python·学习·设计模式
TDengine (老段)1 小时前
TDengine 高级功能——读缓存
大数据·数据库·缓存·时序数据库·tdengine·涛思数据·iotdb
TDengine (老段)2 小时前
TDengine 运维——巡检工具(安装前预配置)
大数据·运维·数据库·时序数据库·iot·tdengine·涛思数据