大数据学习(125)-hive数据分析

🍋🍋大数据学习🍋🍋

🔥系列专栏: 👑哲学语录: 用力所能及,改变世界。
💖如果觉得博主的文章还不错的话,请点赞👍+收藏⭐️+留言📝支持一下博主哦🤞


1. 连续登录问题变种
  • 题目

    找出恰好连续登录 3 天 的用户(不允许更长的连续区间)。
    表结构user_logs(user_id, login_date)

  • 参考答案

    复制代码
    WITH ranked_logs AS (
        SELECT 
            user_id,
            login_date,
            ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY login_date) AS rn
        FROM user_logs
    ),
    consecutive_groups AS (
        SELECT 
            user_id,
            DATE_SUB(login_date, INTERVAL rn DAY) AS grp,
            MIN(login_date) AS start_date,
            MAX(login_date) AS end_date,
            COUNT(*) AS days
        FROM ranked_logs
        GROUP BY user_id, grp
    )
    SELECT user_id, start_date, end_date
    FROM consecutive_groups
    WHERE days = 3;
2. 连续未登录问题
  • 题目

    找出用户最长连续未登录天数 (假设表中仅记录登录日期)。
    表结构user_logs(user_id, login_date)

  • 参考答案

    复制代码
    WITH next_logs AS (
        SELECT 
            user_id,
            login_date,
            LEAD(login_date) OVER (PARTITION BY user_id ORDER BY login_date) AS next_login
        FROM user_logs
    )
    SELECT 
        user_id,
        MAX(DATEDIFF(next_login, login_date) - 1) AS max_consecutive_missing
    FROM next_logs
    WHERE next_login IS NOT NULL
    GROUP BY user_id;

二、窗口函数高级应用

3. 移动平均值计算
  • 题目

    计算用户最近 7 天的平均消费金额 (滑动窗口)。
    表结构orders(user_id, order_date, amount)

  • 参考答案

    复制代码
    SELECT 
        user_id,
        order_date,
        AVG(amount) OVER (
            PARTITION BY user_id 
            ORDER BY order_date 
            RANGE BETWEEN INTERVAL '6 DAY' PRECEDING AND CURRENT ROW
        ) AS rolling_7day_avg
    FROM orders;
4. 增长率计算
  • 题目

    计算每个用户月消费金额的环比增长率
    表结构orders(user_id, order_date, amount)

  • 参考答案

    复制代码
    WITH monthly_sales AS (
        SELECT 
            user_id,
            DATE_FORMAT(order_date, '%Y-%m') AS month,
            SUM(amount) AS total_amount
        FROM orders
        GROUP BY user_id, month
    )
    SELECT 
        user_id,
        month,
        total_amount,
        (total_amount / LAG(total_amount) OVER (PARTITION BY user_id ORDER BY month) - 1) * 100 AS growth_rate
    FROM monthly_sales;

三、时间序列分析

5. 缺失日期填充
  • 题目

    生成用户每日登录状态 (0 = 未登录,1 = 登录),包括缺失的日期。
    表结构user_logs(user_id, login_date)

  • 参考答案

    复制代码
    WITH date_range AS (
        SELECT 
            user_id,
            MIN(login_date) AS start_date,
            MAX(login_date) AS end_date
        FROM user_logs
        GROUP BY user_id
    ),
    all_dates AS (
        SELECT 
            dr.user_id,
            d.calendar_date
        FROM date_range dr
        CROSS JOIN (
            SELECT CURDATE() - INTERVAL n DAY AS calendar_date
            FROM (SELECT @row := @row + 1 AS n FROM (SELECT 0 UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3) t1,
                        (SELECT 0 UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3) t2,
                        (SELECT @row := -1) t3) t
        ) d
        WHERE d.calendar_date BETWEEN dr.start_date AND dr.end_date
    )
    SELECT 
        ad.user_id,
        ad.calendar_date,
        IF(ul.login_date IS NULL, 0, 1) AS is_logged_in
    FROM all_dates ad
    LEFT JOIN user_logs ul 
    ON ad.user_id = ul.user_id AND ad.calendar_date = ul.login_date;
6. 周期性检测
  • 题目

    找出用户每周固定某天登录的行为模式 (如每周一登录)。
    表结构user_logs(user_id, login_date)

  • 参考答案

    复制代码
    WITH day_of_week AS (
        SELECT 
            user_id,
            login_date,
            DAYOFWEEK(login_date) AS dow
        FROM user_logs
    )
    SELECT 
        user_id,
        dow,
        COUNT(DISTINCT WEEK(login_date)) AS weeks_count,
        COUNT(*) AS login_count
    FROM day_of_week
    GROUP BY user_id, dow
    HAVING login_count = weeks_count; -- 每周该天均登录

四、复杂业务场景

7. 购买间隔分析
  • 题目

    计算用户平均购买间隔 ,并找出间隔超过 30 天的用户。
    表结构orders(user_id, order_date)

  • 参考答案

    复制代码
    WITH order_intervals AS (
        SELECT 
            user_id,
            order_date,
            DATEDIFF(order_date, LAG(order_date) OVER (PARTITION BY user_id ORDER BY order_date)) AS days_since_last
        FROM orders
    )
    SELECT 
        user_id,
        AVG(days_since_last) AS avg_interval
    FROM order_intervals
    WHERE days_since_last IS NOT NULL
    GROUP BY user_id
    HAVING avg_interval > 30;
8. 活跃 / 流失用户分析
  • 题目

    标记用户每月状态 (活跃 = 当月有登录,流失 = 连续 3 个月未登录)。
    表结构user_logs(user_id, login_date)

  • 参考答案

    复制代码
    WITH months AS (
        SELECT 
            user_id,
            DATE_FORMAT(login_date, '%Y-%m') AS month,
            MAX(login_date) AS last_login
        FROM user_logs
        GROUP BY user_id, month
    ),
    status AS (
        SELECT 
            m.user_id,
            m.month,
            m.last_login,
            LEAD(m.last_login, 3) OVER (PARTITION BY m.user_id ORDER BY m.month) AS next_3rd_month_login
        FROM months m
    )
    SELECT 
        user_id,
        month,
        CASE 
            WHEN next_3rd_month_login IS NULL THEN '流失'
            ELSE '活跃'
        END AS status
    FROM status;

五、进阶挑战

9. 最长连续事件链
  • 题目

    找出用户最长的连续事件链 (如连续点赞、评论等,事件类型相同)。
    表结构events(user_id, event_time, event_type)

  • 参考答案

    复制代码
    WITH ranked_events AS (
        SELECT 
            user_id,
            event_time,
            event_type,
            ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_time) AS rn
        FROM events
    ),
    event_groups AS (
        SELECT 
            user_id,
            event_type,
            DATE_SUB(event_time, INTERVAL rn SECOND) AS grp,
            COUNT(*) AS chain_length
        FROM ranked_events
        GROUP BY user_id, event_type, grp
    )
    SELECT 
        user_id,
        event_type,
        MAX(chain_length) AS max_chain
    FROM event_groups
    GROUP BY user_id, event_type;
10. 会话识别
  • 题目

    将用户行为按会话分组 (假设会话间隔为 30 分钟)。
    表结构actions(user_id, action_time, action_type)

  • 参考答案

    复制代码
    WITH time_diff AS (
        SELECT 
            user_id,
            action_time,
            action_type,
            TIMESTAMPDIFF(MINUTE, 
                          LAG(action_time) OVER (PARTITION BY user_id ORDER BY action_time), 
                          action_time) AS minutes_since_last
        FROM actions
    ),
    session_markers AS (
        SELECT 
            user_id,
            action_time,
            action_type,
            IF(minutes_since_last > 30 OR minutes_since_last IS NULL, 1, 0) AS new_session
        FROM time_diff
    ),
    sessions AS (
        SELECT 
            user_id,
            action_time,
            action_type,
            SUM(new_session) OVER (PARTITION BY user_id ORDER BY action_time) AS session_id
        FROM session_markers
    )
    SELECT * FROM sessions;
  1. 先手动模拟数据:创建测试表并插入少量数据,验证逻辑正确性。
  2. 对比不同方法 :例如连续值问题,尝试用 LEAD()DATE_SUB + ROW_NUMBER 等多种方法实现。
  3. 注意边界条件:处理空值、同一天多次记录、跨年 / 跨月等场景。
相关推荐
zkmall35 分钟前
企业电商解决方案哪家好?ZKmall模块商城全渠道支持 + 定制化服务更省心
大数据·运维·重构·架构·开源
随缘而动,随遇而安5 小时前
第八十八篇 大数据中的递归算法:从俄罗斯套娃到分布式计算的奇妙之旅
大数据·数据结构·算法
GISer_Jing6 小时前
Git协作开发:feature分支、拉取最新并合并
大数据·git·elasticsearch
IT_10247 小时前
Spring Boot项目开发实战销售管理系统——系统设计!
大数据·spring boot·后端
sealaugh328 小时前
aws(学习笔记第四十八课) appsync-graphql-dynamodb
笔记·学习·aws
水木兰亭8 小时前
数据结构之——树及树的存储
数据结构·c++·学习·算法
鱼摆摆拜拜8 小时前
第 3 章:神经网络如何学习
人工智能·神经网络·学习
一只鹿鹿鹿8 小时前
信息化项目验收,软件工程评审和检查表单
大数据·人工智能·后端·智慧城市·软件工程
aha-凯心9 小时前
vben 之 axios 封装
前端·javascript·学习
聚铭网络9 小时前
案例精选 | 某省级税务局AI大数据日志审计中台应用实践
大数据·人工智能·web安全