Hive第三天

  1. 后台启动HIVE的JDBC连接

0 表示标准输入

1 表示标准输出

2 表示标准错误输出

nohup 表示挂起

最后的 & 表示 后台启动

nohup hive -service hiveserver2 > /usr/local/soft/hive-3.1.2/log/hiveserver2.log 2>&1 &

jps 查看 Runjob

关闭后台的JDBC : kill -9 关闭 RunJob对应的PID

1.HIVE WORDCOUNT

CREATE TABLE learn3.wordcount(

word STRING COMMENT "单词"

)

ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'

STORED AS TEXTFILE;

INSERT INTO TABLE learn3.wordcount (word) VALUES ("hello,word"),("hello,java"),("hive,hello");

-- 统计WORDCOUNT

SELECT

T1.col

, sum(num)

FROM (

SELECT

col2 as col,

count(*) as num

FROM (

SELECT

split(word,",")[0] as col1,

split(word,",")[1] as col2

FROM learn3.wordcount

)T GROUP BY T.col2

UNION ALL

SELECT

col1 as col,

count(*) as num

FROM (

SELECT

split(word,",")[0] as col1,

split(word,",")[1] as col2

FROM learn3.wordcount

)T GROUP BY T.col1

)T1 GROUP BY T1.col

  1. WITH AS 用法

格式:

WITH table1 AS (

SELECT 查询语句1

)

, table2 AS (

SELECT 查询语句2

)

[INSERT INTO TABLE] SELECT FROM

--WORDCOUNT的 WITH AS 用法

WITH split_res AS (

SELECT

split(word,",")[0] as col1,

split(word,",")[1] as col2

FROM learn3.wordcount

)

, col1_count AS(

SELECT

col1 as col

,count(*) as num

FROM split_res

GROUP BY col1

)

, col2_count AS(

SELECT

col2 as col

,count(*) as num

FROM split_res

GROUP BY col2

)

SELECT

T.col,

sum(T.num)

FROM (

SELECT * FROM col1_count

UNION ALL

SELECT * FROM col2_count

)T GROUP BY T.col

  1. 集合函数

COLLECT_LIST(column2)

需要跟 GROUP BY column1 配合使用,将column1中相同的组column2数据放至一个集合中

COLLECT_SET()

需要跟GROUP BY column1 配合使用,将column1中相同的组column2数据放至一个集合中,并对集合中的数据进行去重操作

--需求:

将一列中相同的内容放至一组数据中

将 word列中所有相同的单词对应的num 放至一个数组中

CREATE TABLE learn1.wordcount2(

word STRING COMMENT "单词",

num int COMMENT "数量"

)

STORED AS TEXTFILE;

INSERT INTO TABLE learn1.wordcount2 (word,num) VALUES ("hello",1),("hello",2),("hive",3);

SELECT

word

,COLLECT_LIST(num)

FROM learn3.wordcount2

GROUP BY word

+--------+------------+

| word | _c1 |

+--------+------------+

| hello | [1,2,1,2] |

| hive | [3,3] |

SELECT

word

,COLLECT_SET(num)

FROM learn3.wordcount2

GROUP BY word

| hello | [1,2] |

| hive | [3] |

CREATE EXTERNAL TABLE IF NOT EXISTS learn3.student20(

id STRING COMMENT "学生ID",

name STRING COMMENT "学生姓名",

age int COMMENT "年龄",

gender STRING COMMENT "性别",

subject STRING COMMENT "学科"

)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ","

STORED AS TEXTFILE;

load data local inpath "/usr/local/soft/hive-3.1.2/data/student_20.txt" INTO TABLE learn3.student20;

理科|男 1500100020|杭振凯,1500100019|娄曦之

① 将两列数据进行拼接

WITH concat_stu AS(

SELECT

CONCAT(subject,"|",gender) as subject_gender

,CONCAT_WS("|",id,name) as id_name

FROM learn3.student20

)

SELECT

subject_gender

,CONCAT_WS(",",COLLECT_LIST(id_name))

FROM concat_stu

GROUP BY subject_gender

| 文科|女 | ["1500100001|施笑槐","1500100007|尚孤风","1500100016|潘访烟","1500100018|骆怜雪"] |

| 文科|男 | ["1500100002|吕金鹏","1500100013|逯君昊"] |

| 理科|女 | ["1500100003|单乐蕊","1500100005|宣谷芹","1500100008|符半双","1500100012|梁易槐","1500100015|宦怀绿","1500100017|高芷天"] |

| 理科|男 | ["1500100004|葛德曜","1500100006|边昂雄","1500100009|沈德昌","1500100010|羿彦昌","1500100011|宰运华","1500100014|羿旭炎","1500100019|娄曦之","1500100020|杭振凯"] |

| concat_ws(separator, [string | array(string)]+) - returns the concatenation of the strings separated by the separator. |

4.行列互换

+-----------------+-----------------+

| subject_gender | id_name |

+-----------------+-----------------+

| 文科|女 | 1500100001|施笑槐 |

| 文科|男 | 1500100002|吕金鹏 |

| 理科|女 | 1500100003|单乐蕊 |

| 理科|男 | 1500100004|葛德曜 |

| 理科|女 | 1500100005|宣谷芹 |

| 理科|男 | 1500100006|边昂雄 |

| 文科|女 | 1500100007|尚孤风 |

| 理科|女 | 1500100008|符半双 |

| 理科|男 | 1500100009|沈德昌 |

| 理科|男 | 1500100010|羿彦昌 |

| 理科|男 | 1500100011|宰运华 |

| 理科|女 | 1500100012|梁易槐 |

| 文科|男 | 1500100013|逯君昊 |

| 理科|男 | 1500100014|羿旭炎 |

| 理科|女 | 1500100015|宦怀绿 |

| 文科|女 | 1500100016|潘访烟 |

| 理科|女 | 1500100017|高芷天 |

| 文科|女 | 1500100018|骆怜雪 |

| 理科|男 | 1500100019|娄曦之 |

| 理科|男 | 1500100020|杭振凯 |

| 文科|女 | 1500100001|施笑槐,1500100007|尚孤风,1500100016|潘访烟,1500100018|骆怜雪 |

| 文科|男 | 1500100002|吕金鹏,1500100013|逯君昊 |

| 理科|女 | 1500100003|单乐蕊,1500100005|宣谷芹,1500100008|符半双,1500100012|梁易槐,1500100015|宦怀绿,1500100017|高芷天 |

| 理科|男 | 1500100004|葛德曜,1500100006|边昂雄,1500100009|沈德昌,1500100010|羿彦昌,1500100011|宰运华,1500100014|羿旭炎,1500100019|娄曦之,1500100020|杭振凯 |

行转列:

将原先多行数据转成一行

转换方式:

通过COLLECT_SET() 或者 COLLECT_LIST() 和 GROUP BY 进行配合使用

将GROUP BY 分组的数据进行存放于一个集合当中

列传行:

将一行数据转换为多行数据

转换:

案例:

| wordcount.word |

+-----------------+

| hello,word,hive |

| hello,java |

| hello,hive |

| hello,word |

| hello,java |

| hive,hello |

结果:

+----------+

| _u1.col |

+----------+

| hello |

| hive |

| java |

| word |

| hello |

| hive |

+----------+

INSERT INTO TABLE learn3.wordcount (word) VALUES ("hello,word,hello,java,hello,spark");

EXPLODE() 函数

将集合中的一行数据转换为多行

SELECT

EXPLODE(split(word,",")) as word

FROM learn3.wordcount

CREATE TABLE learn3.movie(

movie_name STRING COMMENT "电影名",

types STRING COMMENT "电影类型"

)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ','

STORED AS TEXTFILE;

load data local inpath "/usr/local/soft/hive-3.1.2/data/moive.txt" INTO TABLE learn3.movie;

-- 将types列中的电影类型进行分隔,并且与电影名进行对应

由 |肖申克的救赎 | 犯罪/剧情 |

转换为:

肖申克的救赎 犯罪

肖申克的救赎 剧情

SELECT

movie_name,EXPLODE(split(types,"/")) as type

FROM learn3.movie

SELECT

movie_name,type

FROM learn3.movie LATERAL VIEW EXPLODE(split(types,"/")) view as type

| movie.movie_name | movie.types |

+-------------------+--------------+

| 肖申克的救赎 | 犯罪/剧情 |

| 霸王别姬 | 剧情/爱情/同性 |

| 阿甘正传 | 剧情/爱情 |

| 泰坦尼克号 | 剧情/爱情/灾难 |

| 这个杀手不太冷 | 剧情/动作/犯罪 |

| movie_name | type |

+-------------+-------+

| 肖申克的救赎 | 犯罪 |

| 肖申克的救赎 | 剧情 |

| 霸王别姬 | 剧情 |

| 霸王别姬 | 爱情 |

| 霸王别姬 | 同性 |

| 阿甘正传 | 剧情 |

| 阿甘正传 | 爱情 |

| 泰坦尼克号 | 剧情 |

| 泰坦尼克号 | 爱情 |

| 泰坦尼克号 | 灾难 |

| 这个杀手不太冷 | 剧情 |

| 这个杀手不太冷 | 动作 |

| 这个杀手不太冷 | 犯罪 |

LATERAL VIEW EXPLODE(split(types,"/")) view as type

解析:

① 通过split方法将types中的字符串切分为数组

② 通过EXPLODE方法将数组由一行数据转换为多行

③ 通过LATERAL VIEW 将EXPLODE转换的结果包装成一个名为view的一个侧写表,并且列名为type

④ 通过全连接将侧写表中的数据与原表 learn3.movie 中的一行数据进行全连接

CREATE EXTERNAL TABLE IF NOT EXISTS learn3.student1(

id STRING COMMENT "学生ID",

name STRING COMMENT "学生姓名",

age int COMMENT "年龄",

gender STRING COMMENT "性别",

clazz STRING COMMENT "班级"

)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ",";

load data local inpath "/usr/local/soft/hive-3.1.2/data/students.txt" INTO TABLE learn3.student1;

CREATE EXTERNAL TABLE IF NOT EXISTS learn3.score1(

id STRING COMMENT "学生ID",

subject_id STRING COMMENT "科目ID",

score int COMMENT "成绩"

)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ","

load data local inpath "/usr/local/soft/hive-3.1.2/data/score.txt" INTO TABLE learn3.score1;

需求:

1.统计各性别年龄前三

-- HIVE 原先老版本 不支持这种写法

SELECT

s1.*

FROM learn3.student1 as s1

WHERE 3 > (SELECT count(*) FROM learn3.student1 as s2

WHERE s1.gender = s2.gender and s1.age < s2.age

)

2.统计各班级学生总成绩前三名

①算出学生总成绩

WITH score_sum AS (

SELECT

id

,sum(score) as total_score

FROM learn3.score1

GROUP BY id

)

, studen_score AS (

SELECT

T1.name

,T2.total_score

,T1.clazz

FROM learn3.student1 T1

JOIN score_sum T2 ON T1.id = T2.id

)

--② 排序获取前三名

SELECT

TT.*

FROM (

SELECT

T.name

,T.total_score

,T.clazz

, ROW_NUMBER() OVER(PARTITION BY T.clazz ORDER BY T.total_score DESC) as row_pm

FROM studen_score T

) TT

WHERE TT.row_pm <= 3

SELECT

T.name

,T.total_score

,T.clazz

, ROW_NUMBER() OVER(PARTITION BY T.clazz ORDER BY T.total_score DESC) as row_pm

, dense_rank() OVER(PARTITION BY T.clazz ORDER BY T.total_score DESC) as dense_pm

, rank() OVER(PARTITION BY T.clazz ORDER BY T.total_score DESC) as rank_pm

, percent_rank() OVER(PARTITION BY T.clazz ORDER BY T.total_score DESC) as percent_rank_pm

, cume_dist() OVER(PARTITION BY T.clazz ORDER BY T.total_score DESC) as cume_dist_pm

FROM studen_score T

row_number :无并列排名

* 用法: select xxxx, row_number() over(partition by 分组字段 order by 排序字段 desc) as rn from tb group by xxxx

dense_rank :有并列排名,并且依次递增

* 用法: select xxxx, dense_rank() over(partition by 分组字段 order by 排序字段 desc) as rn from tb group by xxxx

rank :有并列排名,不依次递增

* 用法: select xxxx, rank() over(partition by 分区字段 order by 排序字段 desc) as rn from tb group by xxxx

percent_rank:(rank的结果-1)/(分区内数据的个数-1)

* 用法: select xxxx, percent_rank() over(partition by 分组字段 order by 排序字段 desc) as rn from tb group by xxxx

需求1:

取每个班级总分最大的同学

CREATE TABLE learn3.student_score(

name STRING COMMENT "",

total_score int COMMENT "",

clazz STRING COMMENT ""

);

WITH score_sum AS (

SELECT

id

,sum(score) as total_score

FROM learn3.score1

GROUP BY id

)

INSERT INTO TABLE learn3.student_score

SELECT

T1.name

,T2.total_score

,T1.clazz

FROM learn3.student1 T1

JOIN score_sum T2 ON T1.id = T2.id

-- 通过max方法取出每个班级分区中的学生成绩最大值

SELECT

TT.*

FROM (

SELECT

T1.*

,max(T1.total_score) OVER(PARTITION BY T1.clazz ORDER BY T1.total_score DESC) as max_score

FROM learn3.student_score T1

) TT WHERE TT.total_score = TT.max_score

-- 如果max() OVER() 中的窗口over()没有给定分区,那么当前的窗口表示整个学校,得到的数据是整个学校的最高的分数

SELECT

TT.*

FROM (

SELECT

T1.*

,max(T1.total_score) OVER() as max_score

FROM learn3.student_score T1

) TT WHERE TT.total_score = TT.max_score

max

用法:

① max(T1.total_score) OVER(PARTITION BY T1.clazz ORDER BY T1.total_score DESC)

基于每个partition分区内数据取最大值

② max(T1.total_score) OVER()

基于整个数据集取最大值

min、avg、count、sum:与max方法使用一致

  1. 《《《《 卡口流量需求分析 》》》》

CREATE TABLE learn3.veh_pass(

id STRING COMMENT "卡口编号",

pass_time STRING COMMENT "进过时间",

pass_num int COMMENT "过车数"

)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ","

STORED AS TEXTFILE;

load data local inpath "/usr/local/soft/hive-3.1.2/data/veh_pass.txt" INTO TABLE learn3.veh_pass;

需求1:查询当月的设备及其总数

-- 写法1

SELECT

T.id

,count(*) OVER()

FROM (

SELECT

id

,pass_time

FROM learn3.veh_pass

WHERE substr(pass_time,1,7) = substr(current_date,1,7)

) T GROUP BY T.id

-- 错误写法

SELECT

DISTINCT id

,count(*) OVER()

FROM (

SELECT

id

,pass_time

FROM learn3.veh_pass

WHERE substr(pass_time,1,7) = substr(current_date,1,7)

)T

-- 写法2:

SELECT

T1.id

,count(*) OVER()

FROM (

SELECT

DISTINCT id

FROM (

SELECT

id

,pass_time

FROM learn3.veh_pass

WHERE substr(pass_time,1,7) = substr(current_date,1,7)

)T )T1

+---------------------+-----------------+

| t1.id | count_window_0 |

+---------------------+-----------------+

| 451000000000071117 | 5 |

| 451000000000071116 | 5 |

| 451000000000071115 | 5 |

| 451000000000071114 | 5 |

| 451000000000071113 | 5 |

+---------------------+-----------------+

+---------------------+

| id |

+---------------------+

| 451000000000071113 |

| 451000000000071114 |

| 451000000000071115 |

| 451000000000071116 |

| 451000000000071117 |

+---------------------+

-- 需求2:查询所有流量明细及所有设备月流量总额

SELECT

T1.id

,T1.pass_time

,T1.pass_num

,SUM(T1.pass_num) OVER(PARTITION BY SUBSTRING(T1.pass_time,1,7)) as total_pass

FROM learn3.veh_pass T1

需求3:按设备编号日期顺序展示明细 并求

1)从第一天开始到当前天数 对流量进行累加

2)昨天与当前天流量累加

3)当前天数的前一天与后一天流量累加

4)当前天与下一天的累加和

5)当前天数与之后所有天流量累加和

1)从第一天开始到当前天数 对流量进行累加

SELECT

T1.*

,SUM(T1.pass_num) OVER(ORDER BY T1.pass_time ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)

FROM (

SELECT

*

FROM learn3.veh_pass ORDER BY pass_time

) T1

2)昨天与当前天流量累加

SELECT

T1.*

,SUM(T1.pass_num) OVER(ORDER BY T1.pass_time ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)

FROM (

SELECT

*

FROM learn3.veh_pass ORDER BY pass_time

) T1

3)当前天数的前一天与后一天流量累加

SELECT

T1.*

,SUM(T1.pass_num) OVER(ORDER BY T1.pass_time ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)

FROM (

SELECT

*

FROM learn3.veh_pass ORDER BY pass_time

) T1

4)当前天与下一天的累加和

SELECT

T1.*

,SUM(T1.pass_num) OVER(ORDER BY T1.pass_time ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING)

FROM (

SELECT

*

FROM learn3.veh_pass ORDER BY pass_time

) T1

5)当前天数与之后所有天流量累加和

SELECT

T1.*

,SUM(T1.pass_num) OVER(ORDER BY T1.pass_time ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)

FROM (

SELECT

*

FROM learn3.veh_pass ORDER BY pass_time

) T1

需求4:查询每个设备编号上次有数据日期和下一次有数据日期

SELECT

T1.*

, LAG(T1.pass_time,1,"2022-01-01") OVER(PARTITION BY T1.id ORDER BY T1.pass_time) as before_time

, LEAD(T1.pass_time,1,"2022-12-31") OVER(PARTITION BY T1.id ORDER BY T1.pass_time) as after_time

FROM learn3.veh_pass T1

总结:

OVER(): 会为每条数据都开启一个窗口,默认窗口大小就是当前数据集的大小

OVER(PARTITION BY) 会按照指定的字段进行分区,在获取一条数据时,窗口大小为整个分区的大小,之后根据分区中的数据进行计算

OVER(PARTITION BY ... ORDER BY ...) 根据给定的分区,在获取一条数据时,窗口大小为整个分区的大小,并且对分区中的数据进行排序

OVER中的取数据格式

(ROWS | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)

(ROWS | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)

(ROWS | RANGE) BETWEEN [num] FOLLOWING AND (UNBOUNDED | [num]) FOLLOWING

OVER():指定分析函数工作的数据窗口大小,这个数据窗口大小可能会随着行的改变而变化。

CURRENT ROW:当前行

n PRECEDING:往前n行数据

n FOLLOWING:往后n行数据

UNBOUNDED :起点,

UNBOUNDED PRECEDING 表示从前面的起点,

UNBOUNDED FOLLOWING 表示到后面的终点

LAG(col,n,default_val):往前第n行数据

LEAD(col,n, default_val):往后第n行数据

NTILE(n):把有序窗口的行分发到指定数据的组中,各个组有编号,编号从1开始,对于每一行,NTILE返回此行所属的组的编号。

-- 自定义UDF函数

老版本UDF 不推荐使用:

① 创建自定义类继承UDF 注意 自定义函数名必须使用 evaluate 不然识别不到

public class MyUDFAddString extends UDF {

/**

* 定义函数名 evaluate

* 实现将传入的String 增加后缀 ###

*

* @param col HIVE中使用函数时传入的数据

* @return 一行数据

*/

public String evaluate(String col) {

return col + "###";

}

}

② 将代码打包,添加jar包至HIVE中

add jar /usr/local/soft/test/HiveCode15-1.0-SNAPSHOT.jar;

③ 创建临时自定义函数

CREATE TEMPORARY FUNCTION my_udf as "com.shujia.udf.MyUDFAddString";

my_udf(col)

相关推荐
JessieZeng aaa8 小时前
CSV文件数据导入hive
数据仓库·hive·hadoop
Yz987615 小时前
hive复杂数据类型Array & Map & Struct & 炸裂函数explode
大数据·数据库·数据仓库·hive·hadoop·数据库开发·big data
EDG Zmjjkk16 小时前
Hive 函数(实例操作版2)
数据仓库·hive·hadoop
Yz98761 天前
Hive分桶超详细!!!
大数据·数据仓库·hive·hadoop·hdfs·数据库开发·big data
Francek Chen1 天前
【大数据技术基础 | 实验十一】Hive实验:新建Hive表
大数据·数据仓库·hive·hadoop·分布式
哇咔咔哇咔2 天前
Hive基础笔记
hive·hadoop·笔记
浊酒南街3 天前
Hive自定义UDF函数
hive·udf
march of Time3 天前
hbase mongodb hive starrocks比较
hive·mongodb·hbase
武子康3 天前
大数据-227 离线数仓 - Flume 自定义拦截器(续接上节) 采集启动日志和事件日志
java·大数据·数据仓库·hive·hadoop·架构·flume
csdnfanguyinheng4 天前
hive 统计各项目下排名前5的问题种类
hive