必备知识
数据类型
基本类型
类型 |
写法 |
字符 |
char, varchar, string✔ |
整数 |
tinyint, smallint, int✔, bigint✔ |
小数 |
float, double, numeric(m,n), decimal(m,n)✔ |
布尔值 |
boolean✔ |
时间 |
date✔, timestamp✔ |
复杂类型(集合类型)
1、数组:array<T>
面向用户提供的原始数据做结构化映射
样例: [] / |156,1778,42,138| => 描述同一个维度数据
2、键值对:map<K,V>
样例: |LogicJava:88,mysql:89|
3、结构体:struct<name1:value1,name2:value2,....>
样例: 类json格式【以{}开头结尾,且结构稳定】 => 结构化数据
【创建】表操作
一:hive建表【基本语法】
语法组成
组成一:建表 = 基本格式 + 行格式 + 额外处理
组成二:上传数据
*基本格式
sql
复制代码
create table if not exists TABLE_NAME(
FIELD_NAME DATA_TYPE,
FIELD_NAME DATA_TYPE,
....
)[comment '描述备注']
*行格式
sql
复制代码
1、应用场景:面向文本,非结构化与半结构化数据
2、模拟数据:
123,张三,16853210211116,true,26238.5,阅读;跑步;唱歌,java:98;mysql:54,province:南京;city:江宁
3、案例演示:
create table if not exists TABLE_NAME(
id int,
name string,
time bigint,
isPartyMember boolean,
hobby array<string>,
scores map<string,int>,
address struct<province:string,city:string>
)
row format delimited
fields terminated by ','
collection items terminated by ';'
map keys terminated by ':'
lines terminated by '\n'
4、讲解:
fields terminated by ',' 列分隔符【字段: id,name...】
collection items terminated by ';' 集合项内部间的分隔符
map keys terminated by ':' 键值对[map]分隔符
lines terminated by '\n' 行分隔符【默认,一般可以省略】
sql
复制代码
1、应用场景:面向结构化数据,即:结构清晰的数据
2、CLASS_PATH有以下几种选择:
选择一:CSV【简单类型】
数据呈现:
"1","2","Football"
"2","2","Soccer"
"3","2","Baseball & Softball"
代码:
create table if not exists TABLE_NAME(
id string,
page string,
word string
)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
with serdeproperties(
'separatorChar'=',',
'quoteChar'='"',
'escapeChar'='\\'
)
选择二:regex【正则】
数据呈现:
123,张三,16853210211116,true,26238.5,阅读;跑步;唱歌,java:98;mysql:54,province:南京;city:江宁
代码:
create table if not exists TABLE_NAME(
id int,
name string,
time bigint,
isPartyMember boolean,
hobby array<string>,
scores map<string,int>,
address struct<province:string,city:string>
)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties(
'input.regex'='^(//d+),(.*?),(//d+),(true|false),(\\d+\\.?\\d+?)$'
)
选择三:JsonSerDe
数据呈现:
{"name":"henry","age":22,"gender":"male","phone":"18014499655"}
代码:
create table if not exists json(
name string,
age int,
gender string,
phone string
)
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
*额外处理
sql
复制代码
1、store【存储】
基本语法:stored as '存储格式'
存储格式:textfile✔,orc,parquet,sequencefile,...
案例:
stored as textfile
2、tblproperties【表属性】(通用):
案例【实际情况具体分析】:
tblproperties(
'skip.header.line.count'='1' 【跳过表头,即:第一行】
...
)
*上传数据入表
sql
复制代码
方法一【不建议用】:
hdfs dfs -put employee.txt /hive312/warehouse/yb12211.db/inner_table_employee
方法二【有校验过程】:✔
需知:
local :表示数据在虚拟机本地
缺少local :表示数据在hdfs上
overload :覆盖
缺少overload :追加
第一种【本地虚拟机】:
load data local inpath '/root/file/employee.txt'
overwrite into table yb12211.inner_table_employee;
第二种【hdfs】:
load data inpath '/hive_data/hive_cha01/employee/employee.txt'
overwrite into table yb12211.inner_table_employee;
方法三【只用于【外部表】】:✔
基本格式:location 'hdfs中存放文件的【目录】的路径' 外部挂载
针对性实践操作
sql
复制代码
案例一:
/*
1|henry|1.81|1995-03-18|江苏,南京,玄武,北京东路68号|logicjava:88,javaoop:76,mysql:80,ssm:82|beauty,money,joke
2|arill|1.59|1996-7-30|安徽,芜湖,南山,西湖东路68号|logicjava:79,javaoop:58,mysql:65,ssm:85|beauty,power,sleeping
3|mary|1.72|1995-09-02|山东,青岛,长虹,天山东路68
*/
drop table if exists students;
create table if not exists students(
number int,
name string,
height decimal(3,2),
birthday date,
house struct<province:string,city:string,district:string,street:string>,
scores map<string,int>,
hobby array<string>
)
row format delimited
fields terminated by "|"
collection items terminated by ","
map keys terminated by ":"
stored as textfile;
load data inpath '/zhou/students.txt'
overwrite into table zhou.students;
案例二:
/*
user_id,auction_id,cat_id,cat1,property,buy_mount,day
786295544,41098319944,50014866,50022520,21458:86755362;13023209:3593274;10984217:21985;122217965:3227750;21477:28695579;22061:30912;122217803:3230095,2,123434123
*/
drop table if exists sam_mum_baby_trade;
create external table if not exists sam_mum_baby_trade(
user_id bigint,
auction_id bigint,
cat_id bigint,
cat1 bigint,
property map<bigint,bigint>,
buy_mount int,
day bigint
)
row format delimited
fields terminated by ","
collection items terminated by ";"
map keys terminated by ":"
stored as textfile
tblproperties (
'skip.header.line.count'='1'
);
load data inpath '/zhou/sam_mum_baby_trade.csv'
into table zhou.sam_mum_baby_trade;
案例三:
/*
"1","2","Football"
"2","2","Soccer"
"3","2","Baseball & Softball"
*/
drop table if exists categories;
create table if not exists categories(
id string,
page string,
word string
)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
with serdeproperties(
'separatorChar'=',',
'quoteChar'='"',
'escapeChar'='\\'
)
stored as textfile;
load data inpath '/zhou/categories.csv'
overwrite into table zhou.categories;
select * from categories;
案例四:
/*
{"name":"henry","age":22,"gender":"male","phone":"18014499655"}
*/
//Json
drop table if exists json;
create table if not exists json(
name string,
age int,
gender string,
phone string
)
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
stored as textfile;
load data inpath '/zhou/json.log'
overwrite into table zhou.json;
案例五:
/*
125;男;2015-9-7 1:52:22;1521.84
883;男;2014-9-18 5:24:42;6391.45
652;女;2014-5-4 5:56:45;9603.79
*/
create external table if not exists test1w(
user_id int,
user_gender string,
order_time timestamp,
order_amount decimal(6,2)
)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties(
'input.regex'='(\\d+);(.*?);(\\d{4}-\\d{1,2}-\\d{1,2} \\d{1,2}:\\d{1,2}:\\d{1,2});(\\d+\.?\\d+?)'
)
stored as textfile
location '/zhou/test1w';
select * from test1w;
二:hive建表【高阶语法】
1:CTAS
【本质】:在原有表的基础上查询并创建新表
基本语法:
create table if not exists NEW_TABLE_NAME as select ... from OLD_TABLE_NAME ...
案例:
原有的表:hive_ext_regex_test1w
语句:
create table if not exists hive_ext_test_before2015 as
select * from hive_ext_regex_test1w
where year(order_time)<=2015;
2:CTE
sql
复制代码
【本质】:对表进行层层筛选,最终形成新表
基本语法:
as with....select...
案例:
场景:2015年之前的所有数据 以及 2015年之后男性5个以上订单数或5w以上订单总额的订单数据。
原有的表:hive_ext_regex_test1w
语句:
create table hive_test_before2015_and_male_over5or5w as
with
before2015 as (
select * from hive_ext_regex_test1w
where year(order_time)<=2015
),
agg_male_over5or5w as (
select user_id
from hive_ext_regex_test1w
where year(order_time)>2015 and user_gender='男'
group by user_id
having count(*)>=5 or sum(order_amount)>=50000
),
male_over5or5w as (
select A.*
from hive_ext_regex_test1w A
inner join agg_male_over5or5w B
on year(A.order_time)>2015 and A.user_id=B.user_id
)
select * from before2015
union all 【注意:union all => 将表并在一起且不去重】
select * from male_over5or5w;
3:CTL
sql
复制代码
【本质】:复制原表的表结构
基本语法:
create table NEW_TABLE_NAME like OLD_TABLE_NAME;
案例:
create table hive_test1w_like like hive_ext_regex_test1w;
【修改】表操作
提前需知
1、查看表字段基本信息:
desc 表名;
2、查看表字段详细信息:
desc formatted 表名; => 由此可查看表中可修改的属性
3、查看建表流程:
show create 表名;
基本语法
alter table TABLE_NAME
rename to NEW_NAME;
set tblproperties('key'='value') -- 修改表属性:包括各种分隔符,SerDe,...
ser fileformat FORMAT; -- 修改文件格式
change old_name new_name TYPE; -- 修改字段名
column(field_name TYPE) -- 添加列