🍋🍋大数据学习🍋🍋
🔥系列专栏: 👑哲学语录: 用力所能及,改变世界。
💖如果觉得博主的文章还不错的话,请点赞👍+收藏⭐️+留言📝支持一下博主哦🤞
一、窗口函数进阶
1. 累计分布计算
-
题目 :
计算每个用户的消费金额在全量用户中的累计分布百分比 (即该用户消费超过了百分之多少的用户)。
表结构 :user_transactions(user_id, amount)
。 -
参考答案:
SELECT user_id, amount, CUME_DIST() OVER (ORDER BY amount) AS percentile FROM user_transactions;
-
扩展练习 :
找出消费金额超过 90% 用户的 "超级用户",并计算其总消费占比。
2. 分组排名跳跃问题
-
题目 :
计算每个部门中薪资排名前三的员工 ,若有并列则跳过后续排名(如两个第 1 名后,下一名为第 3 名)。
表结构 :employees(emp_id, dept_id, salary)
。 -
参考答案:
WITH ranked_employees AS ( SELECT emp_id, dept_id, salary, DENSE_RANK() OVER (PARTITION BY dept_id ORDER BY salary DESC) AS salary_rank FROM employees ) SELECT * FROM ranked_employees WHERE salary_rank <= 3;
-
关键区别:
RANK()
:允许并列,后续排名跳跃(如 1,1,3)。DENSE_RANK()
:允许并列,后续排名连续(如 1,1,2)。
二、日期与时间序列
3. 缺失日期填充
-
题目 :
生成用户每日活跃状态表 ,包括无活动的日期(用 0 填充)。
表结构 :user_activity(user_id, activity_date)
。 -
参考答案:
WITH date_range AS ( SELECT user_id, MIN(activity_date) AS start_date, MAX(activity_date) AS end_date FROM user_activity GROUP BY user_id ), all_dates AS ( SELECT dr.user_id, ad.date FROM date_range dr LATERAL VIEW explode( sequence(to_date(dr.start_date), to_date(dr.end_date), 1) ) ad AS date ) SELECT ad.user_id, ad.date, IF(ua.activity_date IS NULL, 0, 1) AS is_active FROM all_dates ad LEFT JOIN user_activity ua ON ad.user_id = ua.user_id AND ad.date = ua.activity_date;
-
Hive 特性 :
使用
sequence()
和LATERAL VIEW explode()
生成连续日期。
4. 会话识别(Sessionization)
-
题目 :
将用户行为按30 分钟无操作间隔 划分为不同会话(session),并计算每个会话的持续时间。
表结构 :user_events(user_id, event_time, event_type)
。 -
参考答案:
WITH time_diff AS ( SELECT user_id, event_time, event_type, UNIX_TIMESTAMP(event_time) - UNIX_TIMESTAMP( LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time) ) AS seconds_since_last FROM user_events ), session_markers AS ( SELECT user_id, event_time, event_type, IF(seconds_since_last > 1800 OR seconds_since_last IS NULL, 1, 0) AS new_session FROM time_diff ), session_ids AS ( SELECT user_id, event_time, event_type, SUM(new_session) OVER (PARTITION BY user_id ORDER BY event_time) AS session_id FROM session_markers ) SELECT user_id, session_id, MIN(event_time) AS session_start, MAX(event_time) AS session_end, UNIX_TIMESTAMP(MAX(event_time)) - UNIX_TIMESTAMP(MIN(event_time)) AS session_duration_seconds FROM session_ids GROUP BY user_id, session_id;
-
核心逻辑 :
通过
LAG()
计算相邻事件的时间差,超过阈值则标记为新会话。
三、多表关联与复杂查询
5. 树形结构路径查询
-
题目 :
查询商品分类树的完整路径 (如 "电子产品> 手机 > 智能手机")。
表结构 :categories(category_id, parent_id, category_name)
。 -
参考答案:
WITH RECURSIVE category_paths AS ( SELECT category_id, parent_id, category_name, CAST(category_name AS STRING) AS path FROM categories WHERE parent_id IS NULL -- 根节点 UNION ALL SELECT c.category_id, c.parent_id, c.category_name, CONCAT(cp.path, ' > ', c.category_name) FROM categories c JOIN category_paths cp ON c.parent_id = cp.category_id ) SELECT * FROM category_paths;
-
Hive 限制 :
Hive 不支持标准的
WITH RECURSIVE
,需改用循环查询或 UDF 实现。
6. 交叉销售分析
-
题目 :
找出用户同时购买但不在同一订单中的商品对 (如用户 A 先买了手机,后买了手机壳)。
表结构 :orders(order_id, user_id, product_id, order_date)
。 -
参考答案:
SELECT o1.user_id, o1.product_id AS product_a, o2.product_id AS product_b, COUNT(DISTINCT o1.order_id) AS a_orders, COUNT(DISTINCT o2.order_id) AS b_orders FROM orders o1 JOIN orders o2 ON o1.user_id = o2.user_id AND o1.product_id < o2.product_id -- 避免重复组合 AND o1.order_id != o2.order_id -- 不同订单 GROUP BY o1.user_id, o1.product_id, o2.product_id HAVING COUNT(DISTINCT o1.order_id) > 0 AND COUNT(DISTINCT o2.order_id) > 0;
-
性能优化 :
使用
CLUSTER BY user_id
预分区,减少 JOIN 时的数据移动。
四、聚合与统计分析
7. 同比 / 环比计算
-
题目 :
计算每月销售额的同比和环比增长率 。
表结构 :sales(sale_date, amount)
。 -
参考答案:
WITH monthly_sales AS ( SELECT YEAR(sale_date) AS sale_year, MONTH(sale_date) AS sale_month, SUM(amount) AS total_amount FROM sales GROUP BY YEAR(sale_date), MONTH(sale_date) ), growth_rates AS ( SELECT sale_year, sale_month, total_amount, LAG(total_amount, 1) OVER (ORDER BY sale_year, sale_month) AS prev_month_amount, LAG(total_amount, 12) OVER (ORDER BY sale_year, sale_month) AS prev_year_amount, (total_amount - LAG(total_amount, 1)) / LAG(total_amount, 1) AS mom_growth, (total_amount - LAG(total_amount, 12)) / LAG(total_amount, 12) AS yoy_growth FROM monthly_sales ) SELECT sale_year, sale_month, total_amount, ROUND(mom_growth * 100, 2) AS mom_growth_percent, ROUND(yoy_growth * 100, 2) AS yoy_growth_percent FROM growth_rates;
-
边界处理 :
用
COALESCE()
处理首个月 / 年的NULL
值:COALESCE((total_amount - LAG(total_amount, 1)) / LAG(total_amount, 1), 0) AS mom_growth
8. 多维分析(Cube/Rollup)
-
题目 :
同时计算按产品、地区、时间维度的销售额聚合 (含小计和总计)。
表结构 :sales(product_id, region_id, sale_date, amount)
。 -
参考答案:
SELECT product_id, region_id, YEAR(sale_date) AS sale_year, SUM(amount) AS total_amount, GROUPING__ID -- 0=完整分组,1=按 region 聚合,2=按 product 聚合,3=按 product+region 聚合... FROM sales GROUP BY product_id, region_id, YEAR(sale_date) WITH CUBE;
五、性能优化与高级技巧
9. 数据倾斜处理
-
题目 :
优化以下 SQL,解决数据倾斜问题:
SELECT u.user_id, COUNT(o.order_id) AS order_count FROM users u JOIN orders o ON u.user_id = o.user_id GROUP BY u.user_id;
-
优化方案:
SET hive.optimize.skewjoin=true; SET hive.skewjoin.key=100000; -- 单个键超过此阈值时触发优化 -- 手动分桶 + 聚合 SELECT user_id, SUM(order_count) AS total_orders FROM ( SELECT u.user_id, COUNT(o.order_id) AS order_count FROM users u JOIN orders o ON u.user_id = CONCAT(FLOOR(RAND() * 100), '_', o.user_id) -- 随机前缀 GROUP BY u.user_id, FLOOR(RAND() * 100) ) t GROUP BY user_id;
-
优化点:
- 通过
RAND()
添加随机前缀,分散热点数据。 - 两阶段聚合:先局部聚合,再全局聚合。
- 通过
10. UDF 与复杂类型处理
-
题目 :
使用 UDF 解析 JSON 字段,并统计每个用户的平均标签数量 。
表结构 :user_profiles(user_id, tags_json)
,其中tags_json
为 JSON 数组(如["sports", "music"]
)。 -
参考答案:
-- 假设已注册 explode_json_array UDF SELECT user_id, AVG(tag_count) AS avg_tags_per_user FROM ( SELECT user_id, SIZE(explode_json_array(tags_json)) AS tag_count FROM user_profiles ) t GROUP BY user_id;
-
内置函数替代方案:
SELECT user_id, AVG(tag_count) AS avg_tags FROM ( SELECT user_id, SIZE(SPLIT(REPLACE(REPLACE(tags_json, '[', ''), ']', ''), ',')) AS tag_count FROM user_profiles ) t GROUP BY user_id;