Hive之分区表

文章目录

Hive之分区表
- 写在前面
- 分区表
- - 分区表基本操作
  - - 引入分区表
    - 创建分区表语法
    - 加载数据到分区表中
    - 查询分区表中数据
    - 增加分区
    - 删除分区
    - 查看分区表有多少分区
    - 查看分区表结构
  - 二级分区
  - - 正常的加载数据
    - 分区表和数据产生关联
  - 动态分区
  - - 开启动态分区参数设置
    - 案例实操

写在前面

Linux版本：CentOS7.5
Hive版本：Hive-3.1.2

分区表

分区表实际上就是对应一个HDFS文件系统上的独立的文件夹，该文件夹下是该分区所有的数据文件。Hive中的分区就是分目录，把一个大的数据集根据业务需要分割成小的数据集。在查询时通过WHERE子句中的表达式选择查询所需要的指定的分区，这样的查询效率会提高很多。

分区表基本操作

引入分区表

需要根据日期对日志进行管理, 通过部门信息模拟

复制代码

dept_20200401.log
dept_20200402.log
dept_20200403.log
......

创建分区表语法

sql 复制代码

hive (default)> create table dept_partition(
deptno int, dname string, loc string
)
partitioned by (day string)
row format delimited fields terminated by '\t';

注意：分区字段不能是表中已经存在的数据，可以将分区字段看作表的伪列。

加载数据到分区表中

（1）数据准备

dept_20200401.log

复制代码

10	ACCOUNTING	1700
20	RESEARCH	1800
dept_20200402.log
30	SALES	1900
40	OPERATIONS	1700
dept_20200403.log
50	TEST	2000
60	DEV	1900

（2）加载数据

sql 复制代码

hive (default)> load data local inpath '/export/server/hive-3.1.2/datas/dept_20200401.log' into table dept_partition partition(day='20200401');
hive (default)> load data local inpath '/export/server/hive-3.1.2/datas/dept_20200402.log' into table dept_partition partition(day='20200402');
hive (default)> load data local inpath '/export/server/hive-3.1.2/datas/dept_20200403.log' into table dept_partition partition(day='20200403');

注意：分区表加载数据时，必须指定分区

HDFS Web段查看分区

Hive查询分区

查询分区表中数据

单分区查询

sql 复制代码

hive (default)> select * from dept_partition where day='20200401';

多分区联合查询

sql 复制代码

hive (default)> select * from dept_partition where day='20200401'
              union
              select * from dept_partition where day='20200402'
              union
              select * from dept_partition where day='20200403';
hive (default)> select * from dept_partition where day='20200401' or
                day='20200402' or day='20200403' ;

增加分区

创建单个分区

sql 复制代码

hive (default)> alter table dept_partition add partition(day='20200404') ;

同时创建多个分区（中间没有加逗号）

sql 复制代码

hive (default)> alter table dept_partition add partition(day='20200405') partition(day='20200406');

删除分区

删除单个分区

sql 复制代码

hive (default)> alter table dept_partition drop partition (day='20200406');

同时删除多个分区（中间有加逗号）

sql 复制代码

hive (default)> alter table dept_partition drop partition (day='20200404'), partition(day='20200405');

查看分区表有多少分区

sql 复制代码

hive> show partitions dept_partition;

查看分区表结构

sql 复制代码

hive> desc formatted dept_partition;

# Partition Information          
# col_name              data_type               comment             
month                   string

二级分区

假设现在有一个需求：一天的日志数据量很大，如何再将数据拆分?

答案就是接下来的 二级分区

正常的加载数据

（1）加载数据到二级分区表中

sql 复制代码

hive (default)> load data local inpath '/opt/module`/hive/datas/dept_20200401.log' into table
dept_partition2 partition(day='20200401', hour='12');

（2）查询分区数据

sql 复制代码

hive (default)> select * from dept_partition2 where day='20200401' and hour='12';

分区表和数据产生关联

把数据直接上传到分区目录上，让分区表和数据产生关联的三种方式

（1）方式一：上传数据后修复

上传数据（dfs -mkdir --p 或者 hadoop fs --mkdir）

sql 复制代码

hive (default)> dfs -mkdir -p
 /user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=13;
hive (default)> dfs -put /opt/module/datas/dept_20200401.log  /user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=13;

查询数据（查询不到刚上传的数据）

sql 复制代码

hive (default)> select * from dept_partition2 where day='20200401' and hour='13';

执行修复命令

sql 复制代码

hive> msck repair table dept_partition2;

再次查询数据

sql 复制代码

hive (default)> select * from dept_partition2 where day='20200401' and hour='13';

（2）方式二：上传数据后添加分区

上传数据

sql 复制代码

hive (default)> dfs -mkdir -p /user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=14;
hive (default)> dfs -put /export/server/hive-3.1.2/datas/dept_20200401.log/user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=14;

执行添加分区

sql 复制代码

hive (default)> alter table dept_partition2 add partition(day='201709',hour='14');

查询数据

sql 复制代码

hive (default)> select * from dept_partition2 where day='20200401' and hour='14';

（3）方式三：创建文件夹后load数据到分区

sql 复制代码

hive (default)> dfs -mkdir -p /user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=15;

sql 复制代码

hive (default)> load data local inpath '/export/server/hive-3.1.2/datas/dept_20200401.log' into table
 dept_partition2 partition(day='20200401',hour='15');

查询数据

sql 复制代码

hive (default)> select * from dept_partition2 where day='20200401' and hour='15';

动态分区

关系型数据库中，对分区表Insert数据时候，数据库自动会根据分区字段的值，将数据插入到相应的分区中，Hive中也提供了类似的机制，即动态分区(Dynamic Partition)，只不过，使用Hive的动态分区，需要进行相应的配置。

开启动态分区参数设置

（1）开启动态分区功能（默认true，开启）

sql 复制代码

hive.exec.dynamic.partition=true

（2）设置为非严格模式（动态分区的模式，默认strict，表示必须指定至少一个分区为静态分区，nonstrict模式表示允许所有的分区字段都可以使用动态分区。）

sql 复制代码

hive.exec.dynamic.partition.mode=nonstrict

（3）在所有执行MR的节点上，最大一共可以创建多少个动态分区。默认1000

sql 复制代码

hive.exec.max.dynamic.partitions=1000

（4）在每个执行MR的节点上，最大可以创建多少个动态分区。该参数需要根据实际的数据来设定。比如：源数据中包含了一年的数据，即day字段有365个值，那么该参数就需要设置成大于365，如果使用默认值100，则会报错。

sql 复制代码

hive.exec.max.dynamic.partitions.pernode=100

（5）整个MR Job中，最大可以创建多少个HDFS文件。默认100000

sql 复制代码

hive.exec.max.created.files=100000

（6）当有空分区生成时，是否抛出异常。一般不需要设置。默认false

sql 复制代码

hive.error.on.empty.partition=false

案例实操

需求：将dept表中的数据按照地区（loc字段），插入到目标表dept_partition的相应分区中。

（1）创建目标分区表

sql 复制代码

hive (default)> create table dept_partition_dy(id int, name string) partitioned by (loc int) row format delimited fields terminated by '\t';

（2）设置动态分区

sql 复制代码

set hive.exec.dynamic.partition.mode = nonstrict;
hive (default)> insert into table dept_partition_dy partition(loc) select deptno, dname, loc from dept;

（3）查看目标分区表的分区情况

sql 复制代码

hive (default)> show partitions dept_partition;

扩展问题：目标分区表是如何匹配到分区字段的？

==> 位置，默认最后一列是分区列，"伪"列在最后
全文结束！！！