尚硅谷大数据项目《在线教育之离线数仓》笔记006

视频地址:尚硅谷大数据项目《在线教育之离线数仓》_哔哩哔哩_bilibili

目录

[第11章 数仓开发之ADS层](#第11章 数仓开发之ADS层)

P087

P088

P089

P090

P091

P092

P093

P094

P095

P096

P097

P098

P099

P100

P101

P102

P103

P104

P105

P106

P107

P108

P109

P110

P111


第11章 数仓开发之ADS层

P087

++第11章 数仓开发之ADS层++

++11.1 流量主题++

++11.1.1 各来源流量统计++

atguigu@node001 \~\]$ nohup /opt/module/hive/hive-3.1.2/bin/hive \& \[atguigu@node001 \~\]$ nohup /opt/module/hive/hive-3.1.2/bin/hive --service hiveserver2 \&

bash 复制代码
[atguigu@node001 hadoop]$ cd /opt/module/hive/hive-3.1.2/
[atguigu@node001 hive-3.1.2]$ nohup bin/hive &
[1] 11485
[atguigu@node001 hive-3.1.2]$ nohup: 忽略输入并把输出追加到"nohup.out"
 
[atguigu@node001 hive-3.1.2]$ nohup bin/hive --service hiveserver2 &
[2] 11626
[atguigu@node001 hive-3.1.2]$ nohup: 忽略输入并把输出追加到"nohup.out"
[atguigu@node001 hive-3.1.2]$ 
bash 复制代码
[atguigu@node001 ~]$ nohup /opt/module/hive/hive-3.1.2/bin/hive &
[1] 3815
[atguigu@node001 ~]$ nohup: 忽略输入并把输出追加到"nohup.out"

[atguigu@node001 ~]$ 
[atguigu@node001 ~]$ jpsall
================ node001 ================
2848 NameNode
3634 JobHistoryServer
3014 DataNode
3815 RunJar
3933 Jps
3422 NodeManager
================ node002 ================
2113 NodeManager
2531 Jps
1989 ResourceManager
1783 DataNode
================ node003 ================
1908 SecondaryNameNode
2055 NodeManager
2270 Jps
1823 DataNode
[atguigu@node001 ~]$ 
[atguigu@node001 ~]$ nohup /opt/module/hive/hive-3.1.2/bin/hive --service hiveserver2 &
[2] 4038
[atguigu@node001 ~]$ nohup: 忽略输入并把输出追加到"nohup.out"

[atguigu@node001 ~]$ 
[atguigu@node001 ~]$ jpsall
================ node001 ================
2848 NameNode
4225 Jps
3634 JobHistoryServer
4038 RunJar
3014 DataNode
3815 RunJar
3422 NodeManager
================ node002 ================
2113 NodeManager
1989 ResourceManager
1783 DataNode
2591 Jps
================ node003 ================
1908 SecondaryNameNode
2055 NodeManager
2334 Jps
1823 DataNode
[atguigu@node001 ~]$ 

P088

++11.1.2 路径分析++

桑基图

P089

11.1.3 各来源下单统计

P090

++11.2 用户主题++

++11.2.1 用户变动统计++

P091

11.2.2 用户留存率

P092

11.2.3 用户新增活跃统计

2023-09-04 10:04:35\] org.apache.hadoop.hive.ql.parse.SemanticException:Failed to get a spark session: org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create Spark client for Spark session 1af1be50-bc68-46e3-a27f-ff9bf04c44c0

bash 复制代码
[atguigu@node001 ~]$ cd /opt/module/hive/hive-3.1.2/
[atguigu@node001 hive-3.1.2]$ bin/hive

hive (default)> set hive.spark.client.server.connect.timeout=300000ms;
2023-09-04 10:14:03,044 INFO  [fd295ca1-5b9b-4f12-bcff-c5db47bee932 main] conf.HiveConf (HiveConf.java:getLogIdVar(5040)) - Using the default value passed in for log id: fd295ca1-5b9b-4f12-bcff-c5db47bee932
Query returned non-zero code: 1, cause: Cannot modify hive.spark.client.server.connect.timeout at runtime. It is in the list of parameters that can't be modified at runtime or is prefixed by a restricted variable
2023-09-04 10:14:03,091 INFO  [fd295ca1-5b9b-4f12-bcff-c5db47bee932 main] conf.HiveConf (HiveConf.java:getLogIdVar(5040)) - Using the default value passed in for log id: fd295ca1-5b9b-4f12-bcff-c5db47bee932
2023-09-04 10:14:03,091 INFO  [fd295ca1-5b9b-4f12-bcff-c5db47bee932 main] session.SessionState (SessionState.java:resetThreadName(452)) - Resetting thread name to  main
hive (default)> 
复制代码
set hive.spark.client.server.connect.timeout=300000ms;

P093

11.2.4 用户行为漏斗分析

P094

11.2.5 新增交易用户统计

P095

11.2.6 各年龄段下单用户数

P096

11.3 商品主题

11.3.1 各分类课程交易统计

P097

11.3.3 各课程交易统计

P098

11.3.4 各课程评价统计

P099

11.3.5 各分类课程试听留存统计

P100

11.3.6 各学科试听留存统计

P101

11.4 交易主题

11.4.1 交易综合统计

11.4.2 各省份交易统计

P102

11.5 考试主题

11.5.1 各试卷平均统计

P103

11.5.2 各课程考试相关指标统计

P104

11.5.3 各试卷成绩分布

P105

11.5.4 各题正确率

P106

11.6 播放主题

11.6.1 各章节视频播放情况统计

P107

11.6.2 各课程视频播放情况统计

P108

11.7 完课主题

11.7.1 各课程完课人数统计

bash 复制代码
-- 11.7 完课主题
-- 11.7.1 各课程完课人数统计
-- 1)建表语句
DROP TABLE IF EXISTS ads_complete_complete_user_count_per_course;
CREATE EXTERNAL TABLE ads_complete_complete_user_count_per_course
(
    `dt`          STRING COMMENT '统计日期',
    `recent_days` BIGINT COMMENT '最近天数,1:最近1天,7:最近7天,30:最近30天',
    `course_id`   STRING COMMENT '课程 id',
    `user_count`  BIGINT COMMENT '完课人数'
) COMMENT '各课程完课人数'
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
    LOCATION '/warehouse/edu/ads/ads_complete_complete_user_count_per_course/';


-- 2)数据装载
-- 统计用户对应一个课程完成多少个章节
insert overwrite table edu2077.ads_complete_complete_user_count_per_course
select dt,
       recent_days,
       course_id,
       user_count
from edu2077.ads_complete_complete_user_count_per_course
union
select '2022-02-21'   dt,
       recent_days,
       course_id,
       count(user_id) order_count
from (select course_id,
             id,
             user_id,
             max_chapter_complete_date
      from (select course_id,
                   user_id,
                   max(first_complete_date)   max_chapter_complete_date,
                   count(first_complete_date) user_chapter_complete_count
            from edu2077.dwd_learn_play_stats_full
            where dt = '2022-02-21'
            group by course_id, user_id
           ) chapter_complete
               left join
           (select id,
                   chapter_num
            from edu2077.dim_course_full
            where dt = '2022-02-21') dim_course
           on chapter_complete.course_id = dim_course.id
      where user_chapter_complete_count = chapter_num
     ) course_complete
         lateral view explode(array(1, 7, 30)) tmp as recent_days
where max_chapter_complete_date >= date_add('2022-02-21', -recent_days + 1)
group by recent_days, course_id;

select * from ads_complete_complete_user_count_per_course;



insert overwrite table edu2077.ads_complete_complete_user_count_per_course
select *
from edu2077.ads_complete_complete_user_count_per_course
union
select '2022-02-21'   dt,
       recent_days,
       course_id,
       count(user_id) user_conunt
from (
         -- 完整的用户数据
         select course_id,
                user_id,
                first_complete_date
         from (
                  -- 统计用户对应一个课程完成多少个章节
                  select course_id,
                         user_id,
                         max(first_complete_date)   first_complete_date,
                         count(first_complete_date) user_complete_nums
                  from dwd_learn_play_stats_full
                  where dt = '2022-02-21'
                  group by course_id, user_id
              ) t1
                  left join (
             -- 统计课程一共有多少章节
             select id,
                    chapter_num
             from dim_course_full
             where dt = '2022-02-21'
         ) t2
                            on t1.course_id = t2.id and user_complete_nums = chapter_num
     ) t3 lateral view explode(array(1, 7, 30)) tmp as recent_days
where first_complete_date <= '2022-02-21'
  and first_complete_date >= date_sub('2022-02-21', recent_days - 1)
group by recent_days, course_id;


select * from ads_complete_complete_user_count_per_course;

P109

11.7.2 完课综合指标

bash 复制代码
-- 11.7.2 完课综合指标
-- 1)建表语句
DROP TABLE IF EXISTS ads_complete_complete_stats;
CREATE EXTERNAL TABLE ads_complete_complete_stats
(
    `dt`                         STRING COMMENT '统计日期',
    `recent_days`                BIGINT COMMENT '最近天数,1:最近1天,7:最近7天,30:最近30天',
    `user_complete_count`        BIGINT COMMENT '完课人数',
    `user_course_complete_count` BIGINT COMMENT '完课人次'
) COMMENT '完课综合指标'
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
    LOCATION '/warehouse/edu/ads/ads_complete_complete_stats/';


-- 2)数据装载
insert overwrite table edu2077.ads_complete_complete_stats
select dt,
       recent_days,
       user_complete_count,
       user_course_complete_count
from edu2077.ads_complete_complete_stats
union
select '2022-02-21'            dt,
       recent_days,
       count(distinct user_id) user_complete_count,
       count(*)                user_course_complete_count
from (select course_id,
             id,
             user_id,
             max_chapter_complete_date
      from (select course_id,
                   user_id,
                   max(first_complete_date)   max_chapter_complete_date,
                   count(first_complete_date) user_chapter_complete_count
            from edu2077.dwd_learn_play_stats_full
            where dt = '2022-02-21'
            group by course_id, user_id
           ) chapter_complete
               left join
           (select id,
                   chapter_num
            from edu2077.dim_course_full
            where dt = '2022-02-21') dim_course
           on chapter_complete.course_id = dim_course.id
      where user_chapter_complete_count = chapter_num
     ) course_complete
         lateral view explode(array(1, 7, 30)) tmp as recent_days
where max_chapter_complete_date >= date_add('2022-02-21', -recent_days + 1)
group by recent_days;



select * from ads_complete_complete_stats;

insert overwrite table edu2077.ads_complete_complete_stats
select dt,
       recent_days,
       user_complete_count,
       user_course_complete_count
from edu2077.ads_complete_complete_stats
union
select '2022-02-21'            dt,
       recent_days,
       count(distinct user_id) user_complete_count,
       count(*)                user_course_complete_count
from (
         -- 完整的用户数据
         select course_id,
                user_id,
                first_complete_date
         from (
                  -- 统计用户对应一个课程完成多少个章节
                  select course_id,
                         user_id,
                         max(first_complete_date)   first_complete_date,
                         count(first_complete_date) user_complete_nums
                  from dwd_learn_play_stats_full
                  where dt = '2022-02-21'
                  group by course_id, user_id
              ) t1
                  left join (
             -- 统计课程一共有多少章节
             select id,
                    chapter_num
             from dim_course_full
             where dt = '2022-02-21'
         ) t2
                            on t1.course_id = t2.id and user_complete_nums = chapter_num
     ) t3 lateral view explode(array(1, 7, 30)) tmp as recent_days
where first_complete_date <= '2022-02-21'
  and first_complete_date >= date_sub('2022-02-21', recent_days - 1)
group by recent_days;

select * from ads_complete_complete_stats;

P110

11.7.3 各个课程人均完成章节数

bash 复制代码
-- 11.7.3 各个课程人均完成章节数
-- 1)建表语句
DROP TABLE IF EXISTS ads_complete_complete_chapter_count_per_course;
CREATE EXTERNAL TABLE ads_complete_complete_chapter_count_per_course
(
    `dt`                     STRING COMMENT '统计日期',
    `recent_days`            BIGINT COMMENT '最近天数,1:最近1天,7:最近7天,30:最近30天',
    `course_id`              STRING COMMENT '课程 id',
    `complete_chapter_count` DECIMAL(16, 2) COMMENT '用户平均完成章节数'
) COMMENT '各课程人均完成章节视频数'
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
    LOCATION '/warehouse/edu/ads/ads_complete_complete_chapter_count_per_course/';


-- 2)数据装载
insert overwrite table edu2077.ads_complete_complete_chapter_count_per_course
select dt,
       recent_days,
       course_id,
       complete_chapter_count
from edu2077.ads_complete_complete_chapter_count_per_course
union
select '2022-02-21'                                         dt,
       recent_days,
       course_id,
       count(first_complete_date) / count(distinct user_id) complete_chapter_count
from edu2077.dwd_learn_play_stats_full
         lateral view explode(array(1, 7, 30)) tmp as recent_days
where dt = '2022-02-21'
  and first_complete_date is not null
  and first_complete_date >= date_add('2022-02-21', -recent_days + 1)
group by recent_days, course_id;



select * from ads_complete_complete_chapter_count_per_course;

insert overwrite table edu2077.ads_complete_complete_chapter_count_per_course
select *
from edu2077.ads_complete_complete_chapter_count_per_course
union
select '2022-02-21' dt,
       recent_days,
       course_id,
       cast(count(first_complete_date) / count(distinct user_id) as decimal(16, 2)) complete_chapter_count
from (
         -- 统计用户对应一个课程完成多少个章节
         -- 完成的章节数
         select user_id,
                course_id,
                max(first_complete_date)   first_complete_date,
                count(first_complete_date) user_complete_nums
         from dwd_learn_play_stats_full
         where dt = '2022-02-21'
         group by user_id, course_id
     ) t1 lateral view explode(array(1, 7, 30)) tmp as recent_days
group by recent_days, course_id;

select * from ads_complete_complete_chapter_count_per_course;

P111

11.8 数据装载脚本

相关推荐
黎阳之光10 分钟前
黎阳之光:全域实景立体管控,重构智慧电厂与变电站数字孪生新范式
大数据·人工智能·算法·安全·数字孪生
草莓熊Lotso16 分钟前
Linux 线程同步与互斥(一):彻底搞懂线程互斥原理、互斥量底层实现与 RAII 封装
linux·运维·服务器·开发语言·数据库·c++
feng_you_ying_li19 分钟前
linux之进程概念
linux
j_xxx404_19 分钟前
深入理解Linux底层存储:从物理磁盘架构到文件系统(inode/Block)原理
linux·运维·服务器·后端
Chuer_25 分钟前
详解智能问数的增效逻辑,全方位拆解智能问数实战场景
大数据·数据库·甘特图
嵌入式×边缘AI:打怪升级日志26 分钟前
深度剖析Linux按键驱动四种访问方式:从查询到异步通知
linux·运维·服务器
凉、介29 分钟前
从设备树到驱动源码:揭秘嵌入式 Linux 中 MMC 子系统的统一与差异
linux·驱动开发·笔记·学习·嵌入式·sd·emmc
咕咕姐与Ai32 分钟前
扣子(Coze)实战:秒出一条情感早安电台!工作流,把治愈内容做成了全自动流水线
大数据·人工智能·程序人生·语言模型·ai写作
Full Stack Developme34 分钟前
Linux 软连接与硬连接比较
linux·运维·服务器
云边有个稻草人37 分钟前
【Linux系统】第九节—进程状态续集+进程优先级+进程切换
linux·进程状态·进程优先级·linux进程调度算法·linux进程切换·死循环进程如何运行·pri and ni