目录
[1. 数据分层架构](#1. 数据分层架构)
[2. 事务事实表特点](#2. 事务事实表特点)
[1. 工具域优惠券使用事务事实表](#1. 工具域优惠券使用事务事实表)
[2. 互动域收藏商品事务事实表](#2. 互动域收藏商品事务事实表)
[3. 流量域页面浏览事务事实表](#3. 流量域页面浏览事务事实表)
[4. 用户域用户注册事务事实表](#4. 用户域用户注册事务事实表)
[5. 用户域用户登录事务事实表](#5. 用户域用户登录事务事实表)
[1. 数据一致性检查](#1. 数据一致性检查)
[2. 数据血缘分析](#2. 数据血缘分析)
[3. 性能优化建议](#3. 性能优化建议)
[1. 运营分析](#1. 运营分析)
[2. 用户分析](#2. 用户分析)
[3. 产品分析](#3. 产品分析)
[4. 商业分析](#4. 商业分析)
[1. 数据延迟问题](#1. 数据延迟问题)
[2. 数据质量监控](#2. 数据质量监控)
[3. 数据回溯处理](#3. 数据回溯处理)
一、整体架构理解
1. 数据分层架构
原始数据 → ODS层(原始数据层) → **DWD层(数据明细层)** → DWS层(数据服务层) → ADS层(应用数据层)
2. 事务事实表特点
-
粒度:最细粒度业务事件
-
维度:包含丰富的上下文维度信息
-
度量:主要是计数型指标
-
更新策略:首日全量 + 每日增量
二、各表详细解析
1. 工具域优惠券使用事务事实表
业务场景深度解析
-
使用时机:用户下单支付时核销优惠券
-
前置状态:优惠券已领取但未使用
-
触发事件:订单支付成功
-
业务价值:
-
计算优惠券核销率
-
分析优惠券对订单转化率的影响
-
评估营销活动ROI
-
数据流转细节
-- ODS层数据结构示例
ods_coupon_use_inc {
type: 'bootstrap-insert' 或 'update',
data: {
id: '123',
coupon_id: 'coupon_001',
user_id: 'user_1001',
order_id: 'order_20220608001',
used_time: '2022-06-08 10:30:45' -- 关键字段,不为空表示已使用
},
old: {used_time: null} -- 更新前值,用于判断状态变更
}
装载逻辑详解
sql
-- 首日装载:处理历史全量数据
-- 筛选条件:used_time不为空的记录
-- 这表示优惠券已经被使用(核销)
-- 每日增量装载:处理当日状态变更
-- array_contains(map_keys(old),'used_time') 解释:
-- 1. map_keys(old):获取old字典的所有键
-- 2. array_contains:检查数组中是否包含'used_time'
-- 3. 业务含义:记录从未使用状态变更为已使用状态
-- 4. 为什么不用data.used_time is not null?
-- 因为可能是其他字段的更新,需要确保是状态变更
典型分析场景
sql
-- 1. 每日优惠券核销统计
SELECT
date_id,
COUNT(DISTINCT coupon_id) as used_coupons,
COUNT(DISTINCT user_id) as using_users,
COUNT(DISTINCT order_id) as using_orders
FROM dwd_tool_coupon_used_inc
WHERE dt BETWEEN '2022-06-01' AND '2022-06-30'
GROUP BY date_id
ORDER BY date_id;
-- 2. 优惠券核销时间分布
SELECT
HOUR(payment_time) as hour_of_day,
COUNT(*) as usage_count
FROM dwd_tool_coupon_used_inc
WHERE dt = '2022-06-09'
GROUP BY HOUR(payment_time)
ORDER BY hour_of_day;
2. 互动域收藏商品事务事实表
业务场景深度解析
-
用户行为:主动表达对商品的兴趣
-
业务价值:
-
用户兴趣画像:收藏行为反映用户偏好
-
商品热度分析:被收藏次数多的商品更受欢迎
-
转化预测:收藏到购买的转化率分析
-
推荐系统:协同过滤算法的重要输入
-
数据装载技术细节
sql
-- 首日装载:全量历史收藏记录
-- 为什么用'bootstrap-insert'?
-- 这是数据初始化的一种约定,表示首次全量导入
-- 每日装载:只处理'insert'类型
-- 为什么不是'update'?
-- 收藏行为一般是新增,取消收藏可能是'delete'或'update'状态
-- 这里只记录添加收藏的行为,取消收藏不记录(或记录在另一张事实表)
数据分析应用
sql
-- 1. 用户收藏行为分析
SELECT
user_id,
COUNT(*) as total_favors,
COUNT(DISTINCT sku_id) as unique_sku_favors,
MIN(create_time) as first_favor_time,
MAX(create_time) as latest_favor_time
FROM dwd_interaction_favor_add_inc
WHERE dt = '2022-06-09'
GROUP BY user_id
HAVING COUNT(*) > 5; -- 找出重度收藏用户
-- 2. 商品收藏热度排行
SELECT
sku_id,
COUNT(*) as favor_count,
COUNT(DISTINCT user_id) as favored_users,
DATE_FORMAT(MIN(create_time), 'yyyy-MM-dd') as first_favor_date
FROM dwd_interaction_favor_add_inc
WHERE dt BETWEEN '2022-06-01' AND '2022-06-09'
GROUP BY sku_id
ORDER BY favor_count DESC
LIMIT 100;
-- 3. 收藏转化漏斗分析(需要关联订单表)
WITH favor_stats AS (
SELECT
f.user_id,
f.sku_id,
f.create_time as favor_time,
o.order_time,
CASE WHEN o.order_id IS NOT NULL THEN 1 ELSE 0 END as is_converted
FROM dwd_interaction_favor_add_inc f
LEFT JOIN dwd_order_info_inc o
ON f.user_id = o.user_id
AND f.sku_id = o.sku_id
AND o.order_time > f.create_time
AND o.order_time <= DATE_ADD(f.create_time, 7, 'DAY') -- 7天内转化
WHERE f.dt = '2022-06-01'
)
SELECT
COUNT(*) as total_favors,
SUM(is_converted) as converted_favors,
ROUND(SUM(is_converted) * 100.0 / COUNT(*), 2) as conversion_rate
FROM favor_stats;
3. 流量域页面浏览事务事实表
业务场景深度解析
-
页面类型分类:
-
首页、商品详情页、购物车页、订单确认页、支付页
-
活动页、品牌页、搜索结果页、个人中心页
-
-
用户路径分析:
-
页面流:A→B→C
-
跳出率:只访问一个页面就离开
-
转化路径:浏览→点击→加购→下单→支付
-
字段详细解释
sql
-- 设备维度
mid_id: 'device_001' -- 设备唯一标识,用于设备级统计
brand: 'Apple' -- 设备品牌,分析用户设备偏好
model: 'iPhone 13' -- 设备型号
operate_system: 'iOS 15.4' -- 操作系统版本
version_code: '3.2.1' -- APP版本,用于分析版本迭代效果
-- 用户维度
user_id: 'user_1001' -- 登录用户ID(可能为空,未登录用户)
province_id: '440000' -- 省份ID,基于IP解析
-- 页面维度
page_id: 'home' -- 页面标识
last_page_id: 'search' -- 上一个页面,分析流量来源
page_item: 'sku_1001' -- 页面内容(如商品ID)
page_item_type: 'sku' -- 内容类型(sku、activity、brand等)
-- 流量来源
refer_id: 'channel_001' -- 营销渠道ID(广告渠道、社交媒体等)
from_pos_id: 'banner_01' -- 页面内的位置(如banner位)
from_pos_seq: '1' -- 同一位置多个内容的排序
-- 行为度量
during_time: 5000 -- 页面停留时长(毫秒),计算页面吸引力
view_time: '2022-06-08 10:30:45' -- 进入页面时间
session_id: 'session_001' -- 会话ID,同一会话内多次页面浏览
复杂技术点解析
sql
-- 1. 时间转换:为什么要用from_utc_timestamp?
-- 原始日志中的ts通常是UTC时间戳(如1654669845000)
-- 业务分析需要本地时间(GMT+8北京时间)
-- date_format(from_utc_timestamp(ts,'GMT+8'),'yyyy-MM-dd') 作用:
-- a. from_utc_timestamp(ts,'GMT+8'): 将UTC时间戳转为北京时间
-- b. date_format(...): 提取日期部分作为date_id
-- 2. 为什么设置hive.cbo.enable=false?
-- CBO(Cost Based Optimizer)是Hive的基于成本的优化器
-- 在复杂查询中,CBO可能生成非最优执行计划
-- 对于ETL处理,有时关闭CBO可以获得更稳定的性能
-- 3. 页面浏览的会话划分逻辑
-- 会话(session)是用户连续访问的页面序列
-- 通常按以下规则划分:
-- a. 同一设备(mid_id)同一用户(user_id)的连续访问
-- b. 相邻页面访问间隔不超过30分钟
-- c. 会话ID在日志采集时生成
流量分析实战
sql
-- 1. 基础流量指标计算
WITH daily_traffic AS (
SELECT
date_id,
COUNT(*) as pv, -- 页面浏览量
COUNT(DISTINCT mid_id) as uv, -- 访客数(基于设备)
COUNT(DISTINCT user_id) as login_uv, -- 登录用户数
COUNT(DISTINCT session_id) as sessions, -- 会话数
AVG(during_time) as avg_duration, -- 平均页面停留时长
SUM(CASE WHEN user_id IS NOT NULL THEN 1 ELSE 0 END) as login_pv -- 登录用户PV
FROM dwd_traffic_page_view_inc
WHERE dt = '2022-06-08'
GROUP BY date_id
)
SELECT * FROM daily_traffic;
-- 2. 页面热力图分析
SELECT
page_id,
COUNT(*) as pv,
COUNT(DISTINCT mid_id) as uv,
COUNT(DISTINCT user_id) as login_users,
AVG(during_time) as avg_duration,
PERCENTILE(during_time, 0.5) as median_duration, -- 中位数更抗异常
SUM(CASE WHEN last_page_id IS NULL THEN 1 ELSE 0 END) as entry_count, -- 入口页面次数
SUM(CASE WHEN during_time < 3000 THEN 1 ELSE 0 END) * 100.0 / COUNT(*) as bounce_rate -- 跳出率估算
FROM dwd_traffic_page_view_inc
WHERE dt = '2022-06-08'
GROUP BY page_id
ORDER BY pv DESC;
-- 3. 用户路径分析(页面流)
WITH page_flow AS (
SELECT
session_id,
page_id,
last_page_id,
view_time,
LAG(page_id) OVER (PARTITION BY session_id ORDER BY view_time) as prev_page,
LEAD(page_id) OVER (PARTITION BY session_id ORDER BY view_time) as next_page,
ROW_NUMBER() OVER (PARTITION BY session_id ORDER BY view_time) as page_seq
FROM dwd_traffic_page_view_inc
WHERE dt = '2022-06-08'
AND session_id IS NOT NULL
)
SELECT
COALESCE(last_page_id, 'entry') as from_page,
page_id as to_page,
COUNT(*) as transition_count,
COUNT(DISTINCT session_id) as session_count
FROM page_flow
WHERE last_page_id IS NOT NULL OR page_seq = 1
GROUP BY COALESCE(last_page_id, 'entry'), page_id
ORDER BY transition_count DESC
LIMIT 20;
-- 4. 转化漏斗分析(以商品详情页到下单为例)
WITH user_journey AS (
SELECT
user_id,
session_id,
MAX(CASE WHEN page_id = 'good_detail' THEN 1 ELSE 0 END) as viewed_detail,
MAX(CASE WHEN page_id = 'cart' THEN 1 ELSE 0 END) as viewed_cart,
MAX(CASE WHEN page_id = 'trade' THEN 1 ELSE 0 END) as viewed_trade
FROM dwd_traffic_page_view_inc
WHERE dt = '2022-06-08'
AND user_id IS NOT NULL
GROUP BY user_id, session_id
)
SELECT
COUNT(*) as total_sessions,
SUM(viewed_detail) as detail_sessions,
SUM(viewed_cart) as cart_sessions,
SUM(viewed_trade) as trade_sessions,
ROUND(SUM(viewed_detail) * 100.0 / COUNT(*), 2) as detail_rate,
ROUND(SUM(viewed_cart) * 100.0 / SUM(viewed_detail), 2) as cart_conversion,
ROUND(SUM(viewed_trade) * 100.0 / SUM(viewed_cart), 2) as trade_conversion
FROM user_journey;
-- 5. 渠道效果分析
SELECT
channel,
refer_id,
COUNT(*) as pv,
COUNT(DISTINCT mid_id) as uv,
COUNT(DISTINCT user_id) as login_uv,
SUM(CASE WHEN page_id = 'good_detail' THEN 1 ELSE 0 END) as detail_pv,
SUM(CASE WHEN page_item_type = 'sku' THEN 1 ELSE 0 END) as sku_view_pv
FROM dwd_traffic_page_view_inc
WHERE dt = '2022-06-08'
AND refer_id IS NOT NULL
GROUP BY channel, refer_id
ORDER BY pv DESC;
4. 用户域用户注册事务事实表
业务场景深度解析
-
注册渠道分析:
-
自然流量:用户主动下载注册
-
广告渠道:通过广告点击下载注册
-
社交分享:通过分享链接注册
-
地推扫码:线下推广扫码注册
-
-
用户质量评估:
-
注册设备分布
-
注册地域分布
-
注册时间分布
-
表关联逻辑深度解析
sql
-- 关键关联:用户信息表 LEFT JOIN 日志表
-- 为什么用LEFT JOIN?
-- 1. 以用户信息为主表,确保所有注册用户都被记录
-- 2. 日志表中可能没有某些用户的注册页面记录(原因可能包括):
-- a. 老用户注册时日志系统未上线
-- b. 注册时网络问题导致日志丢失
-- c. 通过API注册,不走页面流程
-- 关联条件详解:
-- ui.user_id = log.user_id
-- 日志筛选条件:
-- page.page_id='register' -- 注册页面
-- common.uid is not null -- 用户ID不为空(已注册)
-- dt='2022-06-08' -- 同一天(T+1处理,假设注册和日志同天)
-- 潜在问题与解决方案:
-- 1. 时间差问题:注册时间和日志时间可能不是同一天
-- 解决方案:扩大日志查询时间范围,如注册时间前后3天
-- 2. 多设备注册:一个用户可能用多个设备注册
-- 处理逻辑:取时间最近的设备信息
-- 3. 注册后立即登录:日志可能记录的是登录页面而非注册页面
-- 需要根据业务逻辑调整筛选条件
注册分析实战
sql
-- 1. 每日注册用户统计(含渠道分布)
SELECT
date_id,
COUNT(*) as total_registrations,
COUNT(CASE WHEN channel = 'appstore' THEN user_id END) as appstore_reg,
COUNT(CASE WHEN channel = 'huawei_store' THEN user_id END) as huawei_reg,
COUNT(CASE WHEN channel LIKE '%wechat%' THEN user_id END) as wechat_reg,
COUNT(CASE WHEN channel IS NULL THEN user_id END) as unknown_channel,
COUNT(DISTINCT province_id) as province_count
FROM dwd_user_register_inc
WHERE dt BETWEEN '2022-06-01' AND '2022-06-30'
GROUP BY date_id
ORDER BY date_id;
-- 2. 注册用户设备分析
SELECT
operate_system,
brand,
model,
COUNT(*) as user_count,
ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 2) as percentage
FROM dwd_user_register_inc
WHERE dt = '2022-06-09'
GROUP BY operate_system, brand, model
ORDER BY user_count DESC
LIMIT 20;
-- 3. 注册时间分布分析(小时级)
SELECT
HOUR(create_time) as register_hour,
COUNT(*) as register_count,
AVG(COUNT(*)) OVER () as avg_per_hour,
COUNT(*) - AVG(COUNT(*)) OVER () as diff_from_avg
FROM dwd_user_register_inc
WHERE dt = '2022-06-09'
GROUP BY HOUR(create_time)
ORDER BY register_hour;
-- 4. 注册用户地域分布
SELECT
p.province_name, -- 需要关联省份维度表
COUNT(*) as register_count,
ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 2) as percentage
FROM dwd_user_register_inc r
LEFT JOIN dim_province p ON r.province_id = p.province_id
WHERE r.dt = '2022-06-09'
GROUP BY p.province_name
ORDER BY register_count DESC;
-- 5. 注册后行为分析(关联其他事实表)
WITH registered_users AS (
SELECT
user_id,
create_time as register_time,
channel as register_channel
FROM dwd_user_register_inc
WHERE dt = '2022-06-01'
),
user_activities AS (
SELECT
r.user_id,
r.register_time,
r.register_channel,
-- 注册后7天内是否有登录
MAX(CASE WHEN l.login_time >= r.register_time
AND l.login_time <= DATE_ADD(r.register_time, 7, 'DAY')
THEN 1 ELSE 0 END) as has_login_7d,
-- 注册后7天内是否有下单
MAX(CASE WHEN o.order_time >= r.register_time
AND o.order_time <= DATE_ADD(r.register_time, 7, 'DAY')
THEN 1 ELSE 0 END) as has_order_7d,
-- 注册后7天内下单金额
SUM(CASE WHEN o.order_time >= r.register_time
AND o.order_time <= DATE_ADD(r.register_time, 7, 'DAY')
THEN o.order_amount ELSE 0 END) as order_amount_7d
FROM registered_users r
LEFT JOIN dwd_user_login_inc l ON r.user_id = l.user_id AND l.dt BETWEEN '2022-06-01' AND '2022-06-08'
LEFT JOIN dwd_order_info_inc o ON r.user_id = o.user_id AND o.dt BETWEEN '2022-06-01' AND '2022-06-08'
GROUP BY r.user_id, r.register_time, r.register_channel
)
SELECT
register_channel,
COUNT(*) as total_registrations,
SUM(has_login_7d) as active_users_7d,
SUM(has_order_7d) as paying_users_7d,
ROUND(SUM(has_login_7d) * 100.0 / COUNT(*), 2) as login_rate_7d,
ROUND(SUM(has_order_7d) * 100.0 / COUNT(*), 2) as conversion_rate_7d,
AVG(order_amount_7d) as avg_order_amount_7d
FROM user_activities
GROUP BY register_channel
ORDER BY conversion_rate_7d DESC;
5. 用户域用户登录事务事实表
业务场景深度解析
-
登录方式:
-
账号密码登录
-
手机验证码登录
-
第三方授权登录(微信、QQ)
-
生物识别登录(指纹、人脸)
-
-
会话管理:
-
会话超时:通常30分钟无操作后需要重新登录
-
会话保持:APP端可能长期保持登录状态
-
多设备登录:一个账号同时在多个设备登录
-
技术实现深度解析
sql
-- 核心逻辑:每个会话的第一次页面浏览视为登录
-- 为什么这样设计?
-- 1. 实际登录可能没有独立的日志事件
-- 2. 登录后的第一个页面浏览可以准确标识登录时间
-- 3. 避免重复记录同一会话内的多次页面刷新
-- 窗口函数详解:
-- row_number() over (partition by common.sid order by ts) rn
-- partition by common.sid: 按会话分组
-- order by ts: 按时间戳排序
-- rn = 1: 取每个会话的第一条记录
-- 筛选条件解释:
-- page is not null: 确保是页面浏览事件(不是其他事件如点击、曝光)
-- common.uid is not null: 确保是已登录用户(未登录用户common.uid为空)
-- dt = '2022-06-08': 处理当天的数据
-- 潜在问题:
-- 1. 页面自动刷新:可能被误判为多次登录
-- 解决方案:结合APP的session管理,或设置最小时间间隔
-- 2. 未登录用户浏览:不会被记录
-- 这是符合业务定义的:登录事实表只记录登录事件
-- 3. 同一用户多设备登录:会被记录为多次登录
-- 这是符合业务实际的:每次设备登录都应记录
登录行为分析实战
sql
-- 1. 基础登录指标
SELECT
date_id,
COUNT(*) as login_times, -- 登录次数
COUNT(DISTINCT user_id) as dau, -- 日活跃用户
COUNT(DISTINCT mid_id) as devices, -- 活跃设备数
AVG(CASE WHEN channel = 'appstore' THEN 1 ELSE 0 END) as appstore_rate,
AVG(CASE WHEN channel LIKE '%wechat%' THEN 1 ELSE 0 END) as wechat_rate
FROM dwd_user_login_inc
WHERE dt = '2022-06-08'
GROUP BY date_id;
-- 2. 用户登录频次分析
WITH user_login_stats AS (
SELECT
user_id,
COUNT(*) as login_days,
MIN(date_id) as first_login_date,
MAX(date_id) as last_login_date,
COUNT(DISTINCT brand) as device_brands,
COUNT(DISTINCT channel) as login_channels
FROM dwd_user_login_inc
WHERE dt BETWEEN '2022-06-01' AND '2022-06-30'
GROUP BY user_id
)
SELECT
login_days,
COUNT(*) as user_count,
ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 2) as percentage,
AVG(device_brands) as avg_brands,
AVG(login_channels) as avg_channels
FROM user_login_stats
GROUP BY login_days
ORDER BY login_days;
-- 3. 用户登录时段分析
SELECT
HOUR(login_time) as login_hour,
COUNT(*) as login_count,
COUNT(DISTINCT user_id) as unique_users,
COUNT(DISTINCT mid_id) as unique_devices,
ROUND(AVG(CASE WHEN channel = 'appstore' THEN 1 ELSE 0 END) * 100, 2) as appstore_percent
FROM dwd_user_login_inc
WHERE dt = '2022-06-09'
GROUP BY HOUR(login_time)
ORDER BY login_hour;
-- 4. 用户登录设备偏好分析
SELECT
user_id,
-- 用户最常用的设备品牌
FIRST_VALUE(brand) OVER (
PARTITION BY user_id
ORDER BY COUNT(*) DESC,
MAX(login_time) DESC
) as preferred_brand,
-- 用户最常用的设备型号
FIRST_VALUE(model) OVER (
PARTITION BY user_id
ORDER BY COUNT(*) DESC,
MAX(login_time) DESC
) as preferred_model,
-- 用户使用的设备数
COUNT(DISTINCT mid_id) as device_count,
-- 用户登录总次数
COUNT(*) as total_logins
FROM dwd_user_login_inc
WHERE dt BETWEEN '2022-06-01' AND '2022-06-30'
GROUP BY user_id, brand, model
QUALIFY ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY COUNT(*) DESC) = 1;
-- 5. 登录到行为转化分析
WITH login_sessions AS (
SELECT
l.user_id,
l.login_time,
l.channel,
l.mid_id,
-- 登录后是否有浏览
MAX(CASE WHEN p.view_time > l.login_time
AND p.view_time < DATE_ADD(l.login_time, 1, 'HOUR')
THEN 1 ELSE 0 END) as has_browse,
-- 登录后是否有下单
MAX(CASE WHEN o.order_time > l.login_time
AND o.order_time < DATE_ADD(l.login_time, 24, 'HOUR')
THEN 1 ELSE 0 END) as has_order
FROM dwd_user_login_inc l
LEFT JOIN dwd_traffic_page_view_inc p
ON l.user_id = p.user_id
AND l.dt = p.dt
LEFT JOIN dwd_order_info_inc o
ON l.user_id = o.user_id
AND o.dt = l.dt
WHERE l.dt = '2022-06-09'
GROUP BY l.user_id, l.login_time, l.channel, l.mid_id
)
SELECT
channel,
COUNT(*) as total_logins,
SUM(has_browse) as browse_after_login,
SUM(has_order) as order_after_login,
ROUND(SUM(has_browse) * 100.0 / COUNT(*), 2) as browse_rate,
ROUND(SUM(has_order) * 100.0 / COUNT(*), 2) as order_rate
FROM login_sessions
GROUP BY channel
ORDER BY total_logins DESC;
三、数据治理与质量控制
1. 数据一致性检查
sql
-- 检查各事实表的数据完整性
SELECT
'dwd_tool_coupon_used_inc' as table_name,
COUNT(*) as total_rows,
COUNT(DISTINCT dt) as partition_count,
MIN(dt) as min_date,
MAX(dt) as max_date,
SUM(CASE WHEN payment_time IS NULL THEN 1 ELSE 0 END) as null_payment_time
FROM dwd_tool_coupon_used_inc
WHERE dt >= '2022-06-01'
UNION ALL
SELECT
'dwd_interaction_favor_add_inc',
COUNT(*),
COUNT(DISTINCT dt),
MIN(dt),
MAX(dt),
SUM(CASE WHEN create_time IS NULL THEN 1 ELSE 0 END)
FROM dwd_interaction_favor_add_inc
WHERE dt >= '2022-06-01';
2. 数据血缘分析
sql
ods_coupon_use_inc → dwd_tool_coupon_used_inc
ods_favor_info_inc → dwd_interaction_favor_add_inc
ods_log_inc → dwd_traffic_page_view_inc
ods_user_info_inc + ods_log_inc → dwd_user_register_inc
ods_log_inc → dwd_user_login_inc
3. 性能优化建议
sql
-- 1. 分区优化
-- 按月建立二级分区
PARTITIONED BY (`year` STRING, `month` STRING, `day` STRING)
-- 2. 索引优化
-- 对常用查询字段建立索引
CREATE INDEX idx_user_id ON dwd_user_login_inc(user_id);
CREATE INDEX idx_date_id ON dwd_user_login_inc(date_id);
-- 3. 数据压缩
TBLPROPERTIES ("orc.compress" = "snappy") -- 已配置
-- 4. 小文件合并
-- 设置合并参数
SET hive.merge.mapfiles=true;
SET hive.merge.mapredfiles=true;
SET hive.merge.size.per.task=256000000;
SET hive.merge.smallfiles.avgsize=128000000;
四、业务应用场景总结
1. 运营分析
-
优惠券使用率监控
-
用户增长趋势分析
-
渠道效果评估
-
用户活跃度监控
2. 用户分析
-
用户行为路径分析
-
用户兴趣偏好挖掘
-
用户生命周期管理
-
用户分群与标签
3. 产品分析
-
页面功能使用分析
-
用户交互行为分析
-
产品迭代效果评估
-
A/B测试效果验证
4. 商业分析
-
营销活动ROI计算
-
用户价值评估
-
预测模型构建
-
业务健康度监控
五、常见问题与解决方案
1. 数据延迟问题
sql
-- 解决方案:建立数据延迟监控
SELECT
table_name,
MAX(dt) as latest_partition,
DATEDIFF(CURRENT_DATE, MAX(dt)) as delay_days
FROM (
SELECT 'dwd_tool_coupon_used_inc' as table_name, MAX(dt) as dt
FROM dwd_tool_coupon_used_inc
UNION ALL
SELECT 'dwd_user_login_inc', MAX(dt)
FROM dwd_user_login_inc
) t
GROUP BY table_name;
2. 数据质量监控
sql
-- 监控关键字段的null值率
SELECT
dt,
table_name,
field_name,
total_count,
null_count,
ROUND(null_count * 100.0 / total_count, 2) as null_rate
FROM (
SELECT
dt,
'dwd_user_register_inc' as table_name,
'channel' as field_name,
COUNT(*) as total_count,
SUM(CASE WHEN channel IS NULL THEN 1 ELSE 0 END) as null_count
FROM dwd_user_register_inc
WHERE dt = '2022-06-09'
GROUP BY dt
) t
WHERE null_rate > 5; -- 设置阈值
3. 数据回溯处理
sql
-- 当需要重新计算历史数据时
-- 1. 清除历史分区
ALTER TABLE dwd_tool_coupon_used_inc DROP IF EXISTS PARTITION (dt='2022-06-08');
-- 2. 重新装载数据
INSERT OVERWRITE TABLE dwd_tool_coupon_used_inc PARTITION(dt='2022-06-08')
SELECT ... FROM ods_coupon_use_inc WHERE dt='2022-06-08' AND ...;
六、完整代码
sql
-- 六、工具域优惠券使用(支付)事务事实表
-- 建表语句
DROP TABLE IF EXISTS dwd_tool_coupon_used_inc;
CREATE EXTERNAL TABLE dwd_tool_coupon_used_inc
(
`id` STRING COMMENT '编号',
`coupon_id` STRING COMMENT '优惠券ID',
`user_id` STRING COMMENT '用户ID',
`order_id` STRING COMMENT '订单ID',
`date_id` STRING COMMENT '日期ID',
`payment_time` STRING COMMENT '使用(支付)时间'
) COMMENT '优惠券使用(支付)事务事实表'
PARTITIONED BY (`dt` STRING)
STORED AS ORC
LOCATION '/warehouse/gmall/dwd/dwd_tool_coupon_used_inc/'
TBLPROPERTIES ("orc.compress" = "snappy");
-- 工具域优惠券使用(支付)事务事实表首日数据装载
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table dwd_tool_coupon_used_inc partition(dt)
select
data.id,
data.coupon_id,
data.user_id,
data.order_id,
date_format(data.used_time,'yyyy-MM-dd') date_id,
data.used_time,
date_format(data.used_time,'yyyy-MM-dd')
from ods_coupon_use_inc
where dt='2022-06-08'
and type='bootstrap-insert'
and data.used_time is not null;
-- 工具域优惠券使用(支付)事务事实表每日数据装载
insert overwrite table dwd_tool_coupon_used_inc partition(dt='2022-06-09')
select
data.id,
data.coupon_id,
data.user_id,
data.order_id,
date_format(data.used_time,'yyyy-MM-dd') date_id,
data.used_time
from ods_coupon_use_inc
where dt='2022-06-09'
and type='update'
and array_contains(map_keys(old),'used_time');
-- 七、互动域收藏商品事务事实表
-- 建表语句
DROP TABLE IF EXISTS dwd_interaction_favor_add_inc;
CREATE EXTERNAL TABLE dwd_interaction_favor_add_inc
(
`id` STRING COMMENT '编号',
`user_id` STRING COMMENT '用户ID',
`sku_id` STRING COMMENT 'SKU_ID',
`date_id` STRING COMMENT '日期ID',
`create_time` STRING COMMENT '收藏时间'
) COMMENT '互动域收藏商品事务事实表'
PARTITIONED BY (`dt` STRING)
STORED AS ORC
LOCATION '/warehouse/gmall/dwd/dwd_interaction_favor_add_inc/'
TBLPROPERTIES ("orc.compress" = "snappy");
-- 互动域收藏商品事务事实表首日数据装载
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table dwd_interaction_favor_add_inc partition(dt)
select
data.id,
data.user_id,
data.sku_id,
date_format(data.create_time,'yyyy-MM-dd') date_id,
data.create_time,
date_format(data.create_time,'yyyy-MM-dd')
from ods_favor_info_inc
where dt='2022-06-08'
and type = 'bootstrap-insert';
-- 互动域收藏商品事务事实表每日数据装载
insert overwrite table dwd_interaction_favor_add_inc partition(dt='2022-06-09')
select
data.id,
data.user_id,
data.sku_id,
date_format(data.create_time,'yyyy-MM-dd') date_id,
data.create_time
from ods_favor_info_inc
where dt='2022-06-09'
and type = 'insert';
-- 八、流量域页面浏览事务事实表
-- 建表语句
DROP TABLE IF EXISTS dwd_traffic_page_view_inc;
CREATE EXTERNAL TABLE dwd_traffic_page_view_inc
(
`province_id` STRING COMMENT '省份ID',
`brand` STRING COMMENT '手机品牌',
`channel` STRING COMMENT '渠道',
`is_new` STRING COMMENT '是否首次启动',
`model` STRING COMMENT '手机型号',
`mid_id` STRING COMMENT '设备ID',
`operate_system` STRING COMMENT '操作系统',
`user_id` STRING COMMENT '会员ID',
`version_code` STRING COMMENT 'APP版本号',
`page_item` STRING COMMENT '目标ID',
`page_item_type` STRING COMMENT '目标类型',
`last_page_id` STRING COMMENT '上页ID',
`page_id` STRING COMMENT '页面ID ',
`from_pos_id` STRING COMMENT '点击坑位ID',
`from_pos_seq` STRING COMMENT '点击坑位位置',
`refer_id` STRING COMMENT '营销渠道ID',
`date_id` STRING COMMENT '日期ID',
`view_time` STRING COMMENT '跳入时间',
`session_id` STRING COMMENT '所属会话ID',
`during_time` BIGINT COMMENT '持续时间毫秒'
) COMMENT '流量域页面浏览事务事实表'
PARTITIONED BY (`dt` STRING)
STORED AS ORC
LOCATION '/warehouse/gmall/dwd/dwd_traffic_page_view_inc'
TBLPROPERTIES ('orc.compress' = 'snappy');
-- 流量域页面浏览事务事实表数据装载
set hive.cbo.enable=false;
insert overwrite table dwd_traffic_page_view_inc partition (dt='2022-06-08')
select
common.ar province_id,
common.ba brand,
common.ch channel,
common.is_new is_new,
common.md model,
common.mid mid_id,
common.os operate_system,
common.uid user_id,
common.vc version_code,
page.item page_item,
page.item_type page_item_type,
page.last_page_id,
page.page_id,
page.from_pos_id,
page.from_pos_seq,
page.refer_id,
date_format(from_utc_timestamp(ts,'GMT+8'),'yyyy-MM-dd') date_id,
date_format(from_utc_timestamp(ts,'GMT+8'),'yyyy-MM-dd HH:mm:ss') view_time,
common.sid session_id,
page.during_time
from ods_log_inc
where dt='2022-06-08'
and page is not null;
set hive.cbo.enable=true;
-- 九用户域用户注册事务事实表
-- 建表语句
DROP TABLE IF EXISTS dwd_user_register_inc;
CREATE EXTERNAL TABLE dwd_user_register_inc
(
`user_id` STRING COMMENT '用户ID',
`date_id` STRING COMMENT '日期ID',
`create_time` STRING COMMENT '注册时间',
`channel` STRING COMMENT '应用下载渠道',
`province_id` STRING COMMENT '省份ID',
`version_code` STRING COMMENT '应用版本',
`mid_id` STRING COMMENT '设备ID',
`brand` STRING COMMENT '设备品牌',
`model` STRING COMMENT '设备型号',
`operate_system` STRING COMMENT '设备操作系统'
) COMMENT '用户域用户注册事务事实表'
PARTITIONED BY (`dt` STRING)
STORED AS ORC
LOCATION '/warehouse/gmall/dwd/dwd_user_register_inc/'
TBLPROPERTIES ("orc.compress" = "snappy");
-- 用户域用户注册事务事实表首日数据装载
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table dwd_user_register_inc partition(dt)
select
ui.user_id,
date_format(create_time,'yyyy-MM-dd') date_id,
create_time,
channel,
province_id,
version_code,
mid_id,
brand,
model,
operate_system,
date_format(create_time,'yyyy-MM-dd')
from
(
select
data.id user_id,
data.create_time
from ods_user_info_inc
where dt='2022-06-08'
and type='bootstrap-insert'
)ui
left join
(
select
common.ar province_id,
common.ba brand,
common.ch channel,
common.md model,
common.mid mid_id,
common.os operate_system,
common.uid user_id,
common.vc version_code
from ods_log_inc
where dt='2022-06-08'
and page.page_id='register'
and common.uid is not null
)log
on ui.user_id=log.user_id;
-- 用户域用户注册事务事实表每日数据装载
insert overwrite table dwd_user_register_inc partition(dt='2022-06-09')
select
ui.user_id,
date_format(create_time,'yyyy-MM-dd') date_id,
create_time,
channel,
province_id,
version_code,
mid_id,
brand,
model,
operate_system
from
(
select
data.id user_id,
data.create_time
from ods_user_info_inc
where dt='2022-06-09'
and type='insert'
)ui
left join
(
select
common.ar province_id,
common.ba brand,
common.ch channel,
common.md model,
common.mid mid_id,
common.os operate_system,
common.uid user_id,
common.vc version_code
from ods_log_inc
where dt='2022-06-09'
and page.page_id='register'
and common.uid is not null
)log
on ui.user_id=log.user_id;
-- 十\用户域用户登录事务事实表
-- 建表语句
DROP TABLE IF EXISTS dwd_user_login_inc;
CREATE EXTERNAL TABLE dwd_user_login_inc
(
`user_id` STRING COMMENT '用户ID',
`date_id` STRING COMMENT '日期ID',
`login_time` STRING COMMENT '登录时间',
`channel` STRING COMMENT '应用下载渠道',
`province_id` STRING COMMENT '省份ID',
`version_code` STRING COMMENT '应用版本',
`mid_id` STRING COMMENT '设备ID',
`brand` STRING COMMENT '设备品牌',
`model` STRING COMMENT '设备型号',
`operate_system` STRING COMMENT '设备操作系统'
) COMMENT '用户域用户登录事务事实表'
PARTITIONED BY (`dt` STRING)
STORED AS ORC
LOCATION '/warehouse/gmall/dwd/dwd_user_login_inc/'
TBLPROPERTIES ("orc.compress" = "snappy");
-- 用户域用户登录事务事实表数据装载
insert overwrite table dwd_user_login_inc partition (dt = '2022-06-08')
select user_id,
date_format(from_utc_timestamp(ts, 'GMT+8'), 'yyyy-MM-dd') date_id,
date_format(from_utc_timestamp(ts, 'GMT+8'), 'yyyy-MM-dd HH:mm:ss') login_time,
channel,
province_id,
version_code,
mid_id,
brand,
model,
operate_system
from (
select user_id,
channel,
province_id,
version_code,
mid_id,
brand,
model,
operate_system,
ts
from (select common.uid user_id,
common.ch channel,
common.ar province_id,
common.vc version_code,
common.mid mid_id,
common.ba brand,
common.md model,
common.os operate_system,
ts,
row_number() over (partition by common.sid order by ts) rn
from ods_log_inc
where dt = '2022-06-08'
and page is not null
and common.uid is not null) t1
where rn = 1
) t2;
通过以上详细分析,可以看出DWD层事实表设计充分考虑了业务需求和技术实现,为上层的数据分析和应用提供了稳定、准确、高效的明细数据基础。