Hive表【汇总】

提前必备

1、内部表和外部表的区别

概念讲解：

复制代码

外部表：
	1、存放他人给予自己的数据
	2、当我们删除表操作时，会将表的元数据删除，保留数据文件
内部表：
	1、存放已有的数据
	2、当我们删除表操作时，会将表的元数据以及数据文件都删除掉

2、公共查询语句

复制代码

with：一次查询内可无数次调用
temporary table：一次会话内可无数次调用【临时】
view与table：何时都可无数次调用【永久】

一：内部表

概念

内部表是由 Hive 管理数据和元数据的一种表类型，通常包含表的名称、列定义、存储格式等信息。
【默认创建的表】就是【内部表】

基本形式

sql 复制代码

create table if not exists inner_table_employee(
	name string,
	places array<string>,
	info struct<gender:string,age:int>,
	scores map<string,int>,
	dept_pos map<string,string>
)
row format delimited
fields terminated by '|'
collection items terminated by ','
map keys terminated by ':'
lines terminated by '\n'
stored as textfile;

二：外部表

概念

创建表时，带有【external】关键字的表即为【外部表】。
外部表允许在 Hive 中定义一个表结构，并对外部存储系统中的数据进行查询和分析，而不会对数据本身进行移动或修改。

基本形式

数据准备

复制代码

{"name":"henry","age":22,"gender":"male","phone":"18014499655"}
{"name":"pola","age":18,"gender":"female","phone":"18014499656"}

外部表创建

sql 复制代码

create external table if not exists hive_ext_json_family(
	name string,
	age int,
	gender string,
	phone string
)
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
stored as textfile
location '/hive_data/hive_cha01/json';

三：分区表

1、概念

将一份大的文件拆分成多份，每一份存储在hdfs上表空间内的一个专门文件夹内

文件夹的命名包含了字段名和字段值，如：year=2012 形式。

注意：year可以作为字段来使用，但本质上year不是原始表字段，是分区字段。

2、优势

hive进行查询时就是文件流读文件，即使读取一条数据也需要加载整个文件。因此分区表将文件切割成更小的粒度，当需要针对局部数据进行检索、聚合等处理时只需要加载对应的粒度即可，从而提高了处理的效率。

3、建立分区要素

复制代码

1、字段频繁出现于：
	一：where... , on...
	二：group by... , distribute by ... , cluster by...
	此时，就要考虑将此字段来建立分区
2、数据的容量(大)，需要考虑建立分区

4、基本语法形式

sql 复制代码

create [external] table TABLE_NAME(FIELD_NAME TYPE,...)
partitioned by(PAR_FIELD_NAME TYPE,...)		✔
row format delimited | serde 'SERDE_CLASS'
....

5、实际操作

一：分区的建立

创建一级分区

sql 复制代码

drop table if exists test1w_partitioned_by_year;
create external table if not exists test1w_partitioned_by_year(
	user_id int,
	user_gender string,
	order_time timestamp,
	order_amount decimal(6,2)
)
partitioned by (year int)	✔
row format delimited
fields terminated by ';'
stored as textfile;

创建多级分区【以二级分区为例】

sql 复制代码

drop table if exists test1w_partitioned_by_year_month;
create table if not exists test1w_partitioned_by_year_month(
    user_id int,
    user_gender string,
    order_time timestamp,
    order_amount decimal(6,2)
)
partitioned by(year int,month int)	✔
row format delimited
fields terminated by ';'
stored as textfile;

二：数据的操作

静态分区

主要用处：客户按分区级别改|插入数据

sql 复制代码

1.筛选原文件：
	一级分区：
        cat test1w.log | awk '/2012-/{print $0}'>test2012.log
        cat test1w.log | awk '/2013-/{print $0}'>test2013.log
	多级分区【以二级分区为例】：
		cat test1w.log | awk '/2012-7/{print $0}'>test20127.log
		cat test1w.log | awk '/2012-8/{print $0}'>test20128.log
		
2.装到两分区内：
	一级分区：
		test2012.log:
			load data local inpath '/root/file/test2012.log' 
			overwrite into table hive_internal_par_regex_test1w partition(year=2012);
										
		test2013.log:
			load data local inpath '/root/file/test2013.log' 
			overwrite into table hive_internal_par_regex_test1w partition(year=2013);
								
	多级分区【以二级分区为例】：
		test20127.log:
			load data local inpath '/root/file/test20127.log'
			overwrite into table zhou.test1w_partitioned_by_year_month partition(year=2012,month=7);
									
		test20128.log:
			load data local inpath '/root/file/test20128.log'
			overwrite into table zhou.test1w_partitioned_by_year_month partition(year=2012,month=8);

动态分区

主要用处：项目初期导入数据

准备工作[动态配置]

复制代码

set hive.exec.dynamic.partition=true;				-- 1、会话
set hive.exec.dynamic.partition.mode=nonstrict;
hive-site.xml										-- 2、个性化配置
hive-default.xml									-- 3、为所有配置项提供默认配置

具体代码：

sql 复制代码

一级分区：
	insert overwrite table test1w_partitioned_by_year partition (`year`)
	select *,year(order_time) from test1w;
	
多级分区【以二级分区为例】：
	insert overwrite table zhou.test1w_partitioned_by_year_month partition (`year`,`month`)
	select * ,year(order_time),month(order_time) from test1w where year(order_time)<=2012;

三：分区的其他操作【查删改】

-- 查看分区信息

sql 复制代码

show partitions 表名;

-- 手动添加分区

sql 复制代码

一级分区：
	alter table zhou.test1w_partitioned_by_year add partition (year=2014);
多级分区【以二级分区为例】：
	alter table zhou.test1w_partitioned_by_year_month add partition (year=2012,month=7);

-- 手动删除分区

sql 复制代码

一级分区：
	alter table zhou.test1w_partitioned_by_year drop partition (year=2014);
多级分区【以二级分区为例】：
	alter table zhou.test1w_partitioned_by_year_month drop partition (year=2012,month=7);

四：分桶表

1、概念

分桶表时将一个表或分区内的数据，拆分成更小的文件片段，使抽样更加高效。

2、必知点

复制代码

1、分桶字段必须是表中已存在的原始字段
2、默认采用：原始字段值的hashcode%分桶数列 => 决定当前行数据会被拆分到几号桶
3、优势：数据采样
4、采样率：10% -> 桶数定义为10
5、一般是在 【分区】 的基础上进行 【分桶】,更好地优化查询性能。

3、实际应用场景

1.抽样【数据采样】

在开发中，数据量大的情况下，我们为了针对开发做测试，就可以采用分桶来进行数据采样，采样得到的结果是一个具有代表性的查询结果，可以达到快速开发的目的。

2.拉链表【便于修改】

修改某行数据时，无需将整个文件都读取出来，只需将小份文件导出进行修改即可。

4、实际操作

一：创建分桶表

在根据year分区的基础上，对每个year内部进行了分桶，分为4份数据，便于抽样|修改

sql 复制代码

drop table if exists test1w_partitioned_SeparateBarrel;
create table if not exists test1w_partitioned_SeparateBarrel(
    user_id int,
    user_gender string,
    order_time timestamp,
    order_amount decimal(6,2)
)
partitioned by(year int)
clustered by(order_time) into 4 buckets	✔	=> 此时采样率：25%
row format delimited
fields terminated by ';'
stored as textfile;

二：数据的操作

准备工作[动态配置]

复制代码

set hive.exec.dynamic.partition=true;				-- 1、会话
set hive.exec.dynamic.partition.mode=nonstrict;
hive-site.xml										-- 2、个性化配置
hive-default.xml									-- 3、为所有配置项提供默认配置

具体代码

sql 复制代码

insert into table zhou.test1w_partitioned_SeparateBarrel partition (year)
select *,year(order_time) from test1w;

三：实际应用【数据采集】

--随机抽样【基于整行数据】

基本解释：

取每个桶中四分之三的数据【很少用】
进行随机抽样，不考虑数据的顺序或时间等因素，可以使用类似 bucket 3 out of 4 on rand()形式，这样每次抽样的结果可能会有所不同，适合需要随机性的分析或实验。

sql 复制代码

select * from test1w_partitioned_SeparateBarrel
tablesample(bucket 3 out of 4 on rand())s;

--分桶字段抽样【基于指定列】✔

基本解释：

取每个桶中四分之一的数据[桶]【随机】 => 推荐使用，使用分桶列更高效
从【有序的数据】中抽样，例如按照时间排序的订单数据，可以使用类似于 bucket 3 out of 4 on order_time形式，这样可以保证抽样数据具有一定的顺序性和连续性。

最终结果分析 ：最后获取的数据是在每个分区【文件夹】内随机抽取指定数量【如：四分之一]的数据[桶]】=> 抽到的数据[桶]是具有随机性的。

sql 复制代码

select year,count(*) as order_count from test1w_partitioned_SeparateBarrel
tablesample ( bucket 1 out of 4 on order_time)s
group by year;

五：临时表(temporary)

1、概念

一次链接（会话session）内临时创建的表格，会话结束后自动删除

默认hdfs路径：/tmp/hive/root 内根据时间寻找临时表
idea中：show tables; 才可看到临时表。

2、实际操作

sql 复制代码

create temporary table if not exists test1w_gender as
select user_gender,count(*) as gender_cnt from zhou.test1w group by user_gender;

六：视图(view)

1、概念

本质：一条较为复杂的共用的查询语句

2、实际操作

sql 复制代码

create view if not exists hive_view_test1w_Girl as
select * from test1w where user_gender = "女";

七：拉链表(zip tables)

1、发展流程

hive发展

hive 0.14就已经有这一逻辑模型,名为slowly changing dimension。
hive 2.6.0 支持merge语法，运用了事务管理

拉链表由来

原来采用分区表，用户分区存储历史增量数据，缺点是重复数据太多

目前运用拉链表来解决这一问题

2、含义

用于解决持续增长且存在一定时间范围内重复的数据，即：合并有一定重复性【较小时间范围内】的数据。

3、优点

节约空间（一份订单只有一条数据）

4、应用场景

【数据规模庞大】，新数据【在有限区间(时间...)内】存在多种状态变化

5、准备工作[动态配置]

复制代码

set hive.support.concurrency=true;
set hive.enforce.bucketing=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
set hive.compactor.initiator.on=true; -- 表合并开启
set hive.compactor.worker.threads=1; -- 表合并线程必须为一
set hive.auto.convert.join=false; -- 关闭 mapjoin
set hive.merge.cardinality.check=false; -- 关闭检查数据列的基数(列值的差异性)
set mapreduce.job.reduces=4;

6、具体代码

sql 复制代码

drop table if exists zhou.hive_zipper_order;
create table zhou.hive_zipper_order(
	order_id bigint,
	user_id bigint,
	order_modify_dt timestamp,
	order_money decimal(10,2),
	current_status int
)
row format delimited 
fields terminated by ',';
--导入f F的数据至普通表hive_zipper_order中
load data local inpath '/root/file/log/order_record.log'
overwrite into table zhou.hive_zipper_order;

--创建拉链表hive_zipper_pc_order	✔
drop table if exists zhou.hive_zipper_pc_order;
create table zhou.hive_zipper_pc_order(
	order_id bigint,
	user_id bigint,
	order_create_dt timestamp,
	order_modify_dt timestamp,
	order_money decimal(10,2),
	current_status int
)
partitioned by(year int,month int,day int)
clustered by(order_create_dt) into 4 buckets
row format delimited
	fields terminated by ','
stored as orc
tblproperties("transactional"="true");

--操作历史全量数据用动态分区	✔
set hive.support.concurrency=true;
set hive.enforce.bucketing=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
set hive.compactor.initiator.on=true;
set hive.compactor.worker.threads=1;
set hive.auto.convert.join=false;
set hive.merge.cardinality.check=false;
set mapreduce.job.reduces=4;

--开启动态分区，一次性挂载至拉链表hive_zipper_pc_order中	✔
with zip_src as (
	select order_id,user_id,order_money,
		min(order_modify_dt) as order_create_dt,
		max(order_modify_dt) as order_modify_dt,
		max(current_status) as current_status
	from zhou.hive_zipper_order
	group by order_id,user_id,order_money
)
insert overwrite table zhou.hive_zipper_pc_order partition(year,month,day)
select
	order_id,
	user_id,
	order_create_dt,
	order_modify_dt,
	order_money,
	current_status,
	year(order_create_dt) as year,
	month(order_create_dt) as month,
	day(order_create_dt) as day
from zip_src;

-- 拉链表查询	✔
set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
set hive.support.concurrency=true;
select * from zhou.hive_zipper_pc_order
where to_date(order_modify_dt)='2021-02-04'
order by order_modify_dt desc;

--之后每天，增量添加【在原表处新增】
load data local inpath '/root/file/log/order_record_2021_02_05.log'
overwrite into table zhou.hive_zipper_order;

--拉链处理增量数据(新增新数据，修改旧数据)	✔
merge into zhou.hive_zipper_pc_order as O
using (
	select 
		order_id,
		user_id,
		order_create_dt,
		order_modify_dt,
		order_money,
		current_status,
		year(order_create_dt) as year,
		month(order_create_dt) as month,
		day(order_create_dt) as day
	from (
		select order_id,user_id,order_money,
			min(order_modify_dt) as order_create_dt,
			max(order_modify_dt) as order_modify_dt,
			max(current_status) as current_status
		from zhou.hive_zipper_order
		--where to_date(order_modify_dt)='2021-02-05'
		group by order_id,user_id,order_money
	)T
) as H
on O.order_id=H.order_id
when matched then 
update set order_modify_dt=H.order_modify_dt,current_status=H.current_status
when not matched then 
insert values(H.order_id,H.user_id,H.order_create_dt,H.order_modify_dt,H.order_money,H.current_status,H.year,H.month,H.day);

--验证拉链结果	✔
select * from zhou.hive_zipper_pc_order
where to_date(order_modify_dt)>to_date(order_create_dt);