大数据学习(128)-数据分析实例

🍋🍋大数据学习🍋🍋

🔥系列专栏: 👑哲学语录: 用力所能及,改变世界。
💖如果觉得博主的文章还不错的话,请点赞👍+收藏⭐️+留言📝支持一下博主哦🤞


一、基础窗口函数与分组聚合

1. 连续登录问题
  • 题目

    找出连续登录至少 5 天 的用户及其最长连续登录天数。
    表结构user_logs(user_id, login_date)

  • 参考答案

    复制代码
    WITH ranked_logs AS (
        SELECT 
            user_id,
            login_date,
            ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY login_date) AS rn
        FROM user_logs
    ),
    consecutive_groups AS (
        SELECT 
            user_id,
            DATE_SUB(login_date, INTERVAL rn DAY) AS grp,
            COUNT(*) AS consecutive_days
        FROM ranked_logs
        GROUP BY user_id, grp
    )
    SELECT 
        user_id,
        MAX(consecutive_days) AS max_consecutive_days
    FROM consecutive_groups
    WHERE consecutive_days >= 5
    GROUP BY user_id;
  • 优化点

    • user_idlogin_date 建立索引。
    • DISTINCT 去重(若存在同一天多次登录):SELECT DISTINCT user_id, login_date FROM user_logs
2. 移动平均计算
  • 题目

    计算用户近 7 天的滚动平均消费金额 (含当天)。
    表结构orders(user_id, order_date, amount)

  • 参考答案

    复制代码
    SELECT 
        user_id,
        order_date,
        AVG(amount) OVER (
            PARTITION BY user_id 
            ORDER BY UNIX_TIMESTAMP(order_date) 
            RANGE BETWEEN 6 * 24 * 3600 PRECEDING AND CURRENT ROW
        ) AS rolling_7day_avg
    FROM orders;
  • 优化点

    • Hive 中 RANGE 对日期的支持有限,需用 UNIX_TIMESTAMP 转换为秒。
    • 预先过滤近 30 天数据以减少计算量:WHERE order_date >= DATE_SUB(CURRENT_DATE, 30)

二、日期函数与时间序列分析

3. 每月首次 / 最后一次登录
  • 题目

    计算用户每月首次和最后一次登录日期
    表结构user_logs(user_id, login_date)

  • 参考答案

    复制代码
    SELECT 
        user_id,
        month,
        MIN(login_date) AS first_login,
        MAX(login_date) AS last_login
    FROM (
        SELECT 
            user_id,
            login_date,
            DATE_FORMAT(login_date, 'yyyy-MM') AS month
        FROM user_logs
    ) t
    GROUP BY user_id, month;
  • 优化点

    • 使用 DISTRIBUTE BY user_id SORT BY month 优化数据分布。
    • 若数据已按日期分区,直接按分区过滤(如 WHERE dt >= '2023-01-01')。
4. 购买季节性分析
  • 题目

    计算每个商品每周销量占全年总销量的比例
    表结构sales(product_id, sale_date, quantity)

  • 参考答案

    复制代码
    WITH yearly_sales AS (
        SELECT 
            product_id,
            SUM(quantity) AS yearly_quantity
        FROM sales
        GROUP BY product_id
    ),
    weekly_sales AS (
        SELECT 
            product_id,
            YEAR(sale_date) AS sale_year,
            WEEKOFYEAR(sale_date) AS sale_week,
            SUM(quantity) AS weekly_quantity
        FROM sales
        GROUP BY product_id, YEAR(sale_date), WEEKOFYEAR(sale_date)
    )
    SELECT 
        ws.product_id,
        ws.sale_year,
        ws.sale_week,
        ws.weekly_quantity / ys.yearly_quantity AS weekly_ratio
    FROM weekly_sales ws
    JOIN yearly_sales ys ON ws.product_id = ys.product_id;
  • 优化点

    • 使用 CUBEROLLUP 同时计算周、月、年聚合:

      复制代码
      SELECT 
          product_id,
          YEAR(sale_date) AS sale_year,
          WEEKOFYEAR(sale_date) AS sale_week,
          SUM(quantity) AS quantity,
          GROUPING__ID  -- 0=完整分组,1=按年聚合,3=总聚合
      FROM sales
      GROUP BY product_id, YEAR(sale_date), WEEKOFYEAR(sale_date) WITH ROLLUP;

三、多表关联与复杂查询

5. 用户留存分析
  • 题目

    计算新用户次日、7 日留存率 (首日登录后,次日 / 7 日仍登录的用户比例)。
    表结构

    • user_logs(user_id, login_date)
    • user_info(user_id, register_date)
  • 参考答案

    复制代码
    WITH first_login AS (
        SELECT 
            user_id,
            MIN(login_date) AS first_login_date
        FROM user_logs
        GROUP BY user_id
    ),
    retention AS (
        SELECT 
            fl.user_id,
            fl.first_login_date,
            MAX(CASE WHEN DATEDIFF(ul.login_date, fl.first_login_date) = 1 THEN 1 ELSE 0 END) AS is_retained_day2,
            MAX(CASE WHEN DATEDIFF(ul.login_date, fl.first_login_date) = 7 THEN 1 ELSE 0 END) AS is_retained_day7
        FROM first_login fl
        LEFT JOIN user_logs ul ON fl.user_id = ul.user_id
        GROUP BY fl.user_id, fl.first_login_date
    )
    SELECT 
        COUNT(*) AS total_users,
        SUM(is_retained_day2) / COUNT(*) AS retention_rate_day2,
        SUM(is_retained_day7) / COUNT(*) AS retention_rate_day7
    FROM retention;
  • 优化点

    • user_logsuser_id 分桶,user_infouser_id 排序,加速 JOIN

    • 使用 MAPJOIN 优化小表关联:

      复制代码
      SELECT /*+ MAPJOIN(user_info) */ ...
      FROM user_logs JOIN user_info ON ...
6. 商品关联分析
  • 题目

    找出经常与商品 A 一起购买的前 10 个商品 (同订单中同时出现的频率最高)。
    表结构order_items(order_id, product_id)

  • 参考答案

    复制代码
    WITH a_orders AS (
        SELECT DISTINCT order_id
        FROM order_items
        WHERE product_id = 'A'
    ),
    co_products AS (
        SELECT 
            oi.product_id AS co_product_id,
            COUNT(*) AS co_occurrence
        FROM a_orders ao
        JOIN order_items oi ON ao.order_id = oi.order_id
        WHERE oi.product_id != 'A'  -- 排除商品 A 自身
        GROUP BY oi.product_id
    )
    SELECT 
        co_product_id,
        co_occurrence
    FROM co_products
    ORDER BY co_occurrence DESC
    LIMIT 10;
  • 优化点

    • 使用 BROADCAST JOIN 分发小表(a_orders):

      复制代码
      SELECT /*+ BROADCAST(a_orders) */ ...
      FROM a_orders JOIN order_items ...
    • 预先过滤低频商品:WHERE product_id IN (SELECT product_id FROM order_items GROUP BY product_id HAVING COUNT(*) > 100)

四、综合业务场景

7. 订单完成率分析
  • 题目

    计算每月订单从创建到支付的平均耗时 ,并按年同比增长排序。
    表结构

    • orders(order_id, create_time, pay_time, status)
    • order_status(status_id, status_name)
  • 参考答案

    复制代码
    WITH monthly_avg AS (
        SELECT 
            YEAR(o.create_time) AS sale_year,
            MONTH(o.create_time) AS sale_month,
            AVG(UNIX_TIMESTAMP(o.pay_time) - UNIX_TIMESTAMP(o.create_time)) / 3600 AS avg_hours
        FROM orders o
        JOIN order_status os ON o.status = os.status_id
        WHERE os.status_name = '已支付'
        GROUP BY YEAR(o.create_time), MONTH(o.create_time)
    ),
    yoy_growth AS (
        SELECT 
            curr.sale_year,
            curr.sale_month,
            curr.avg_hours,
            prev.avg_hours AS prev_avg_hours,
            (curr.avg_hours - prev.avg_hours) / prev.avg_hours AS growth_rate
        FROM monthly_avg curr
        LEFT JOIN monthly_avg prev 
        ON curr.sale_month = prev.sale_month AND curr.sale_year = prev.sale_year + 1
    )
    SELECT 
        sale_year,
        sale_month,
        avg_hours,
        prev_avg_hours,
        growth_rate
    FROM yoy_growth
    ORDER BY growth_rate DESC;
  • 优化点

    • orders 表按 status 分区,减少数据扫描:WHERE status = 'paid' AND dt >= '2023-01-01'

    • 使用窗口函数替代 JOIN 计算同比:

      复制代码
      SELECT 
          sale_year,
          sale_month,
          avg_hours,
          LAG(avg_hours, 12) OVER (ORDER BY sale_year, sale_month) AS prev_avg_hours,
          (avg_hours - LAG(avg_hours, 12) OVER (ORDER BY sale_year, sale_month)) / LAG(avg_hours, 12) OVER (ORDER BY sale_year, sale_month) AS growth_rate
      FROM monthly_avg;
8. 高价值用户识别
  • 题目

    找出RFM 模型中得分最高的前 10% 用户 (最近购买时间 R、购买频率 F、消费金额 M)。
    表结构orders(user_id, order_date, amount)

  • 参考答案

    复制代码
    WITH user_rfm AS (
        SELECT 
            user_id,
            DATEDIFF(CURRENT_DATE, MAX(order_date)) AS recency,
            COUNT(*) AS frequency,
            SUM(amount) AS monetary
        FROM orders
        GROUP BY user_id
    ),
    rfm_scores AS (
        SELECT 
            user_id,
            recency,
            frequency,
            monetary,
            NTILE(4) OVER (ORDER BY recency) AS r_score,  -- 最近购买:值越小排名越高
            NTILE(4) OVER (ORDER BY frequency DESC) AS f_score,  -- 购买频率:值越大排名越高
            NTILE(4) OVER (ORDER BY monetary DESC) AS m_score  -- 消费金额:值越大排名越高
        FROM user_rfm
    ),
    rfm_combined AS (
        SELECT 
            user_id,
            r_score,
            f_score,
            m_score,
            r_score * 100 + f_score * 10 + m_score AS total_score
        FROM rfm_scores
    )
    SELECT 
        user_id,
        total_score
    FROM rfm_combined
    WHERE total_score >= (SELECT PERCENTILE(total_score, 0.9) FROM rfm_combined)
    ORDER BY total_score DESC;
  • 优化点

    • 使用 BUCKETED BY user_id 分桶存储 orders 表,加速聚合。
    • 预先过滤无效订单:WHERE order_date >= DATE_SUB(CURRENT_DATE, 365)

五、性能优化专项练习

9. 大表 JOIN 优化
  • 题目

    优化以下 SQL 的执行效率:

    sql

    复制代码
    SELECT 
        u.user_id,
        u.username,
        o.total_amount
    FROM users u
    JOIN (
        SELECT 
            user_id,
            SUM(amount) AS total_amount
        FROM orders
        GROUP BY user_id
    ) o ON u.user_id = o.user_id;
  • 优化方案

    复制代码
    SET hive.auto.convert.join=true;  -- 启用自动小表广播
    SET hive.optimize.skewjoin=true;  -- 优化数据倾斜
    
    SELECT /*+ MAPJOIN(o) */  -- 手动指定 MAPJOIN
        u.user_id,
        u.username,
        o.total_amount
    FROM users u
    JOIN (
        SELECT 
            /*+ STREAMTABLE(u) */  -- 指定流式处理表
            user_id,
            SUM(amount) AS total_amount
        FROM orders
        GROUP BY user_id
    ) o ON u.user_id = o.user_id;
  • 优化点

    • 使用 MAPJOIN 将小表(如 users)广播到每个节点。
    • 添加 DISTRIBUTE BY user_id 确保数据按 user_id 均匀分布。
    • orders 表按 user_id 分桶,与 users 表分桶数一致。
10. 子查询转窗口函数
  • 题目

    将以下子查询改写为窗口函数,提升性能:

    复制代码
    SELECT 
        user_id,
        order_date,
        amount,
        (SELECT SUM(amount) FROM orders WHERE user_id = o.user_id) AS total_amount
    FROM orders o;
  • 优化方案

    复制代码
    SELECT 
        user_id,
        order_date,
        amount,
        SUM(amount) OVER (PARTITION BY user_id) AS total_amount
    FROM orders;
  • 优化点

    • 窗口函数避免了重复扫描表,只需一次全量扫描。
    • 添加 CLUSTER BY user_id 优化数据分布,减少 shuffle。
相关推荐
小邹子17 分钟前
抑郁症患者数据分析
python·信息可视化·数据分析
疯狂的沙粒1 小时前
如何通过git命令查看项目连接的仓库地址?
大数据·git·elasticsearch
随缘而动,随遇而安1 小时前
第七十四篇 高并发场景下的Java并发容器:用生活案例讲透技术原理
java·大数据·后端
凯禾瑞华实训室建设1 小时前
老年生活照护实训室建设规划:照护质量评估与持续改进实训体系
大数据·人工智能·物联网·ar·vr
静听山水1 小时前
DTS 数据迁移
大数据
普宁彭于晏1 小时前
CSS3相关知识点
前端·css·笔记·学习·css3
科技资讯快报2 小时前
告别Prompt依赖,容联云让智能体嵌入金融业务流
大数据·人工智能·prompt
weixin_贾2 小时前
不确定性分析在LEAP能源-环境系统建模中的整合与应用
大数据·新能源·储能·碳中和·温室气体·leap模型
fen_fen2 小时前
学习笔记(25):线性代数,矩阵-矩阵乘法原理
笔记·学习·线性代数
qq_463944862 小时前
【Spark征服之路-2.3-Spark运行架构】
大数据·架构·spark