准备:
疫情数据集,mysql,hive,sqoop,davinci,hadoop集群环境
阶段 1 :数据准备
1. MySQL 源表
mysql
CREATE DATABASE covid19 DEFAULT CHARSET 'utf8mb4'
CREATE TABLE `covid_source` (
`id` INT NOT NULL AUTO_INCREMENT,
`province` VARCHAR(20),
`date` DATE,
`confirmed` INT,
`deaths` INT,
`recovered` INT,
PRIMARY KEY (`id`)
);
-- 插入数据语句(可以使用Python,也可以sql)
INSERT INTO `covid_source`
(province, date, confirmed, deaths, recovered) VALUES
('湖北','2023-01-01', 54306, 1320, 32415),
('广东','2023-01-01', 1523, 8, 1240),
...
(你也可以使用LOAD DATA命令,直接将数据集导入到Hive中)
阶段 2 : Hive 数仓构建
1. Sqoop 导入 Hive (关键命令)
bash
sqoop import \
--connect jdbc:mysql://master:3306/covid19 \
--username root \
--password 123456 \
--table covid_source \
--hive-import \
--hive-table ods_covid_source \
--create-hive-table \
--fields-terminated-by '\001'
2. Hive 分层建模
hivesql
-- ODS层(原始数据)(Sqoop导入时自动创建)
CREATE TABLE ods_covid_source (
province STRING,
`date` DATE,
confirmed INT,
deaths INT,
recovered INT
) STORED AS TEXTFILE;
-- DWD层(清洗后)
CREATE TABLE dwd_covid_stats AS
SELECT
province,
`date`,
confirmed,
deaths,
recovered,
(confirmed - deaths - recovered) AS current_confirmed
FROM ods_covid_source
WHERE `date` IS NOT NULL;
阶段 3 :分析指标计算
1. 核心分析 SQL 示例
hivesql
-- 各省累计确诊--DWS层
CREATE TABLE dws_province_top AS
SELECT
province,
MAX(confirmed) AS total_confirmed
FROM dwd_covid_stats
GROUP BY province
ORDER BY total_confirmed DESC;
-- 全国每日新增趋势--DWS层
CREATE TABLE dws_daily_growth AS
SELECT
`date`,
CASE
WHEN `date` = MIN(`date`) OVER() THEN 0
ELSE SUM(confirmed) - LAG(SUM(confirmed),1) OVER(ORDER BY `date`)
END AS new_confirmed
FROM dwd_covid_stats
GROUP BY `date`;
阶段 4 : Sqoop 导出到 MySQL
1. MySQL 结果表设计
mysql
CREATE TABLE `ads_province_top` (
`province` VARCHAR(20) PRIMARY KEY,
`total_confirmed` INT
);
CREATE TABLE `ads_daily_growth` (
`date` DATE PRIMARY KEY,
`new_confirmed` INT
);
2. Sqoop 导出命令
bash
# 导出省份TOP10数据
sqoop export \
--connect "jdbc:mysql://master:3306/covid19?useUnicode=true&characterEncoding=utf8" \
--username root \
--password 123456 \
--table ads_province_top \
--export-dir /user/hive/warehouse/dws_province_top \
--input-fields-terminated-by '\001'
# 导出每日趋势数据(增量导出示例)
sqoop export \
--connect "jdbc:mysql://master:3306/covid19?useUnicode=true&characterEncoding=utf8" \
--username root \
--password 123456 \
--table ads_daily_growth \
--export-dir /user/hive/warehouse/dws_daily_growth \
--input-fields-terminated-by '\001'
阶段 5 : Davinci 可视化配置
1. 数据源连接
- 在Davinci中添加MySQL数据源:
类型: MySQL
URL: jdbc:mysql://master:3306/covid19
用户名/密码: root 123456
2. 可视化看板设计
- ****组件 1 :疫情地图
- 数据模型:SELECT province, total_confirmed FROM ads_province_top
- 图表类型:中国地图
- 映射字段:province -> 区域, total_confirmed -> 颜色深浅
- ****组件 2 :趋势折线图
- 数据模型:SELECT date, new_confirmed FROM ads_daily_growth
- 图表类型:折线图
- X轴:date, Y轴:new_confirmed
效果图
