hive SQL常用语法

文章目录

- DDL(数据定义语言)
- - drop
  - alter
  - show语句
  - 建表
- DML（数据管理语言）
- - [load 加载数据](#load 加载数据)
  - [insert 插入数据](#insert 插入数据)
  - [select 查询](#select 查询)
  - - 去重
    - [where 查询过滤](#where 查询过滤)
    - 聚合操作
    - [group by分组](#group by分组)
    - [having 过滤](#having 过滤)
    - [order by 排序](#order by 排序)
    - [limit 限制返回函数](#limit 限制返回函数)
    - [select 查询执行顺序](#select 查询执行顺序)
    - 联表查询
- 经验总结

DDL(数据定义语言)

是SQL语言集中对数据库表或者库结构进行创建、删除、修改等操作语言。核心语法由create alter drop三个组成。DDL不涉及表内部数据操作。

drop

复制代码

# 删除表结构及数据
drop table t_people_part_dynamic;

alter

show语句

复制代码

# 查看所有数据库
show databases;
# 查看数据库的表（默认是当前库）
show tables [in database];
# 格式化显示表的元数据信息
desc/describe formatted t_people_part_dynamic;
# 查看hive内置函数，可以加like过滤筛选
show functions like "s%";
# 查看函数使用方法
describe function extended split;
# 显示建表语句
show create table t_people_part_dynamic;

建表

复制代码

create table t_user(
                         id int comment "ID",
                         name string comment "名称",
                         sex string comment "性别",
                         provice string comment "所处省",
                         city string comment "所处市",
                         age int comment "年龄",
                         house string comment "居住小区"
) comment "人口信息"
    row format delimited
        fields terminated by "\t";

默认分隔符是\001 ,在写建表SQL时，若数据字段是默认分隔符，则不用写row format delimited部分，简化建表语句

获取到数据文件如何查看分割符（有些分隔符是键盘打不出来，不在屏幕显示的）？

notepad+打开文件，视图，显示符号，选择所有符号即可显示出来

DML（数据管理语言）

load 加载数据

复制代码

# 加载本地文件系统的路径下文件到表，并指定分区
load data local inpath '/home/datahouse/data/people/user_henan.txt' into table t_people_part partition (prov='henan');
# 加载本地文件系统文件到数据表
load data local inpath '/home/datahouse/data/user.txt' into table t_user_load;

本地文件系统是相对于hdfs分布式文件系统而言。

insert 插入数据

hive官方推荐加载数据方式：

清洗数据为结构化文件，再使用load语法加载数据。效率更高。

不推荐insert 插入数据，要使用rm计算框架，效率非常慢。

insert使用场景，把查询结果插入到另一张表中。

复制代码

insert into table t_people_part_dynamic partition (prov)
select p.*,p.provice from t_people p;

select 查询

复制代码

select * from t_user;

去重

复制代码

# 按单个字段去重
select distinct name from t_user;
# 按多个字段去重
select distinct name,age from t_user;

where 查询过滤

where 字句结果为true，select 字句查询字段返回，为false的不返回

where 条件中不能使用聚合函数

复制代码

select * from t_people where provice = '河南省' and age>30;

特殊条件（空值判断、between、in）

复制代码

select * from t_people where provice is null;
# 查询条件在范围内，含两边边界
select * from t_people where age between 20 and 30;
# 查询条件在离散的点
select * from t_people where age in （20,30,40）;

聚合操作

聚合函数特点多行数据，聚合操作后只返回一条数据。用在select字句中

avg() 求均值

count(column) 只返回非空行总数

count(*) 返回所有行数

sum() 求总值

max（）求最大值

min() 求最小值

group by分组

根据一个或者多个列对结果集进行分组

复制代码

# 根据性别进行分组并列出各有多少人
select sex,count(*) from t_people_part_dynamic group by sex;

select 字句字段要么是分组字段，要么是聚合函数操作的字段，否则报错；

分组常搭配聚合函数使用。

having 过滤

复制代码

select sex,count(*) as c from t_people_part_dynamic group by sex having c > 3;

having 在分组后过滤，可以使用聚合函数

order by 排序

复制代码

select * from t_people_part_dynamic order by id;

limit 限制返回函数

复制代码

select * from t_people_part_dynamic order by id limit 5;

select 查询执行顺序

from > where>group/聚合>having>order>select

联表查询

内连接：inner join 、join、隐式连接

复制代码

# inner join
select * from t_people_part_dynamic pd inner join t_people tp on pd.id=tp.id;
# join
select * from t_people_part_dynamic pd join t_people tp on pd.id=tp.id;
# 隐式连接
select * from t_people_part_dynamic pd,t_people tp where pd.id = tp.id;

以上三种写法等价。最终效果即是两个集合求交集A^B

左连接：left join

以左表为准，右表匹配上的显示，匹配不上的显示null

经验总结

1 hive SQL的脚本语法跟MySQL很像，基本可以说是相同

2 数仓的目的是为了对海量历史数据进行分析，使用hiveSQL进行分析，无非是以上语法的组合使用；MySQL也有以上语法，能用于数据分析吗？区别是啥？答案是肯定可以的。MySQL用于少量数据的分析，hive+ 数仓用于海量数据的分析。为啥呢？数仓提供了分布式存储和分布式计算，更能利用廉价机器服务数据分析，高效率分析。