Hive案例分析之消费数据
部分数据展示
1.customer_details
markdown
customer_id,first_name,last_name,email,gender,address,country,language,job,credit_type,credit_no
1,Spencer,Raffeorty,sraffeorty0@dropbox.com,Male,9274 Lyons Court,China,Khmer Safety,Technician III,jcb,3589373385487669
2,Cherye,Poynor,cpoynor1@51.la,Female,1377 Anzinger Avenue,China,Czech,Research Nurse,instapayment,6376594861844533
3,Natasha,Abendroth,nabendroth2@scribd.com,Female,2913 Evergreen Lane,China,Yiddish,Budget/Accounting Analyst IV,visa,4041591905616356
4,Huntley,Seally,hseally3@prlog.org,Male,694 Del Sol Lane,China,Albanian,Environmental Specialist,laser,677118310740263477
5,Druci,Coad,dcoad4@weibo.com,Female,16 Debs Way,China,Hebrew,Teacher,jcb,3537287259845047
6,Sayer,Brizell,sbrizell5@opensource.org,Male,71 Banding Terrace,China,Maltese,Accountant IV,americanexpress,379709885387687
7,Becca,Brawley,bbrawley6@sitemeter.com,Female,7 Doe Crossing Junction,China,Czech,Payment Adjustment Coordinator,jcb,3545377719922245
8,Michele,Bastable,mbastable7@sun.com,Female,98 Clyde Gallagher Pass,China,Malayalam,Tax Accountant,jcb,3588131787131504
9,Marla,Brotherhood,mbrotherhood8@illinois.edu,Female,4538 Fair Oaks Trail,China,Dari,Design Engineer,china-unionpay,5602233845197745479
10,Lionello,Gogarty,lgogarty9@histats.com,Male,800 Sage Alley,China,Danish,Clinical Specialist,diners-club-carte-blanche,30290846607043
2.store_details
markdown
1,NoFrill,10
2,Lablaws,23
3,FoodMart,18
4,FoodLovers,26
5,Walmart,30
3.store_review
markdown
7430,1,5
912,3,3
4203,5,3
2205,4,4
5166,5,5
2755,5,
2036,5,5
5712,1,2
5296,5,4
6964,4,2
4.transaction_details
markdown
transaction_id,customer_id,store_id,price,product,date,time
1,225,5,47.02,Bamboo Shoots - Sliced,2017-08-04,8:18
2,290,5,43.12,Tarts Assorted,2017-09-23,14:41
3,300,1,27.01,Soup - Campbells, Minestrone,2017-07-17,11:53
4,191,2,8.08,Hold Up Tool Storage Rack,2017-12-02,20:08
5,158,4,14.03,Pail With Metal Handle 16l White,2017-12-14,6:47
6,66,4,19.33,Flour - Bread,2017-06-25,18:48
7,440,2,7.41,Rice - Jasmine Sented,2017-04-10,18:29
8,419,2,9.28,Soup - Campbells Bean Medley,2018-02-08,17:03
9,351,3,14.07,Butter - Salted,2017-07-01,2:07
10,455,5,8.31,Trout - Smoked,2018-02-20,0:53
一、创建数据库
sql
spark-sql>create database shopping;
二、创建表
客户细节表
sql
spark-sql>create table ext_customer_details(
customer_id string,
first_name string,
last_name string,
email string,
gender string,
address string,
country string,
language string,
job string,
credit_type string,
credit_no string)
row format delimited fields terminated by ',' tblproperties ("skip.header.line.count"="1");
spark-sql>load data local inpath '/usr/local/src/datas/consume/cu.txt' into table ext_customer_details;
交易表
sql
spark-sql>create table ext_transaction_details (
transaction_id string,
customer_id string,
store_id string,
price string,
product string,
purchase_date string,
purchase_time string
)row format delimited fields terminated by ',' tblproperties ("skip.header.line.count"="1");
spark-sql>load data local inpath '/usr/local/src/datas/consume/transaction_details.txt' into table ext_transaction_details;
商店表
sql
spark-sql>create table ext_store_details (
store_id string,
store_name string,
employee_number int
)row format delimited fields terminated by ',' tblproperties ("skip.header.line.count"="1");
spark-sql>load data local inpath '/usr/local/src/datas/consume/store_details.txt' into table ext_store_details;
商店回顾表
sql
spark-sql>create table ext_store_review (
transaction_id string,
store_id string,
review_score int
)row format delimited fields terminated by ',' tblproperties ("skip.header.line.count"="1");
spark-sql>load data local inpath '/usr/local/src/datas/consume/store_review.txt' into table ext_store_review;
个人信息进行加密
sql
create view if not exists vw_customer_details as
select
customer_id,
first_name,
unbase64(last_name) as last_name,
unbase64(email) as email,
gender,
unbase64(address) as address,
country,
job,
credit_type,
unbase64(concat(unbase64(credit_no), 'seed')) as credit_no --better way than hash since it can be decoded base64(regexp_replace(base64(credit_no), unbase64('seed'), ''))
from ext_customer_details
ext_transaction_details中有重复的transaction_id
先建立一个transaction_details表
sql
create table if not exists transaction_details (
transaction_id string,
customer_id string,
store_id string,
price string,
product string,
purchase_time string,
purchase_date date
)
partitioned by (purchase_month string)
设置临时结果集base:查找出相同的transaction_id,相同的transaction_id会产生不同的行number
sql
with base as (
select row_number() over(partition by transaction_id order by 1) as rn,*
from ext_transaction_details
)
select count(*) from base where rn > 1
交通细节表
生成临时子集base,使用transaction_id分组
sql
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nostrick;
with base as (
select
transaction_id,
customer_id,
store_id,
price,
product,
purchase_time,
purchase_date,
--对时间格式进行转换
from_unixtime(unix_timestamp(purchase_date ,'yyyy-MM-dd'), 'yyyy-MM') as purchase_month,
--以transaction_id为基础排列row-number
row_number() over(partition by transaction_id order by store_id) as rn
from ext_transaction_details
where customer_id<>'customer_id'
)
from base
insert overwrite table transaction_details partition(purchase_month)
select
if(rn = 1, transaction_id, concat(transaction_id, '_fix', rn)) as transaction_id,
--rn = 1的时候输出transaction_id,当rn >= 2的时候通过concat(transaction_id, '_fix', rn)连接字符串
customer_id,
store_id,
price,
product,
purchase_time,
purchase_date,
purchase_month
商户回顾表
创建视图,取出不为空的部分
sql
create view if not exists vw_store_review as
select transaction_id, review_score
from ext_store_review
where review_score <> ''
三、分析数据
以客户为中心
1. 找出顾客最常用的信用卡:
sql
spark-sql>select credit_type,count(distinct credit_no) as count_cre from ext_customer_details group by credit_type order by count_cre desc limit 1;
jcb 14
Time taken: 0.553 seconds, Fetched 1 row(s)
2.找出客户数据中的前5个职位名称:
sql
spark-sql>select job,count(*) as count_job from ext_customer_details group by job order by count_job desc limit 5;
Environmental Specialist 2
Assistant Professor 1
Analyst Programmer 1
Recruiting Manager 1
Accountant IV 1
Time taken: 0.275 seconds, Fetched 5 row(s)
3. 对于美国女性来说,她们手中最受欢迎的信用卡片是什么
sql
spark-sql>select credit_type,count(distinct credit_no) as count_cre from ext_customer_details where country = 'United States' and gender = 'Female' group by credit_type order by count_cre desc;
jcb 1
Time taken: 0.569 seconds, Fetched 1 row(s)
22/03/08 09:33:00 INFO CliDriver: Time taken: 0.569 seconds, Fetched 1 row(s)
4. 按性别和国家计算客户统计
sql
spark-sql> select country,gender,count(*) from ext_customer_details group by country,gender;
China Female 22
China Male 6
country gender 1
United States Male 1
United States Female 1
Time taken: 0.358 seconds, Fetched 5 row(s)
22/03/08 09:45:54 INFO CliDriver: Time taken: 0.358 seconds, Fetched 5 row(s)
以交易为中心
1. 计算每月总收入
sql
spark-sql>select purchase_month,round(sum(price),2)as sum_proice from transaction_details group by purchase_month order by sum_proice desc;
spark-sql>select DATE_FORMAT(purchase_date,'yyyy-MM'),round(sum(price),2)as sum_proice from ext_transaction_details group by DATE_FORMAT(purchase_date,'yyyy-MM') order by sum_proice desc;
2017-06 721.07
2018-03 690.26
2017-09 606.86
2017-05 587.72
2018-01 566.27
2017-10 564.49
2017-07 559.8
2017-04 543.02
2017-08 531.8
2018-02 486.14
2017-12 366.27
2017-11 334.58
Time taken: 0.935 seconds, Fetched 13 row(s)
22/03/08 10:04:29 INFO CliDriver: Time taken: 0.935 seconds, Fetched 13 row(s)
2. 按季度计算总收入
sql
spark-sql>select round(sum(price),2)as revenue_price,year_quarter from (select price,concat_ws('-',substr(purchase_date,1,4),cast(ceil(month(purchase_date)/3.0)as string))as year_quarter from transaction_details) group by year_quarter order by revenue_price desc;
1851.81 2017-2 --2017年第二季度,即2017年4,5,6月总收入
1742.67 2018-1
1698.46 2017-3
1265.34 2017-4
Time taken: 1.138 seconds, Fetched 4 row(s)
22/03/08 15:31:10 INFO CliDriver: Time taken: 1.138 seconds, Fetched 4 row(s)
3. 按年计算总收入
sql
spark-sql>select round(sum(price),2)as reven_price,year(purchase_date)as years from ext_transaction_details group by year(purchase_date) order by reven_price desc;
4815.61 2017
1742.67 2018
Time taken: 0.696 seconds, Fetched 2 row(s)
22/03/08 11:18:38 INFO CliDriver: Time taken: 0.696 seconds, Fetched 2 row(s)
4. 按工作日星期计算总收入
sql
spark-sql>select round(sum(price),2)as total_rev,DATE_FORMAT(purchase_date,'u')as week from transaction_details group by DATE_FORMAT(purchase_date,'u') order by total_rev desc;
1153.9 2 --所有星期二的收入
1104.69 5
1026.67 3
969.47 7
894.33 4
741.44 1
667.78 6
Time taken: 0.933 seconds, Fetched 7 row(s)
22/03/08 15:50:15 INFO CliDriver: Time taken: 0.933 seconds, Fetched 7 row(s)
5. (需要清理数据)(按时间桶(上午、中午等)计算总收入/平均购买量)
markdown
--定义时间分桶
--early morning: (5:00, 8:00]
--morning: (8:00, 11:00]
--noon: (11:00, 13:00]
--afternoon: (13:00, 18:00]
--evening: (18:00, 22:00]
--night: (22:00, 5:00]
--因为它不是线性递增的,所以把它当作else
--我们也安排时间。第一个格式时间到19:23喜欢,然后比较,然后转换分钟到小时
sql
-------------------------------------写法1-----------------------------------------------
spark-sql>select time_bucket, round(avg(price),2) as avg_spend, round(sum(price)/1000,2) as revenue_k
from (
select
price, purchase_time, purchase_time_in_hrs,
if(purchase_time_in_hrs > 5 and purchase_time_in_hrs <=8, 'early morning',
if(purchase_time_in_hrs > 8 and purchase_time_in_hrs <=11, 'morning',
if(purchase_time_in_hrs > 11 and purchase_time_in_hrs <=13, 'noon',
if(purchase_time_in_hrs > 13 and purchase_time_in_hrs <=18, 'afternoon',
if(purchase_time_in_hrs > 18 and purchase_time_in_hrs <=22, 'evening', 'night'))))) as time_bucket
from (
select
purchase_time, price, (cast(split(time_format, ':')[0] as decimal(4,2)) + cast(split(time_format, ':')[1] as decimal(4,2))/60) as purchase_time_in_hrs
from (
select price, purchase_time, if(purchase_time like '%M',
from_unixtime(unix_timestamp(purchase_time,'hh:mm aa'),'HH:mm'), purchase_time) as time_format
from
transaction_details)))
group by time_bucket order by avg_spend desc;
-------------------------------------写法2-----------------------------------------------
spark-sql>with base as (
select price, purchase_time, if(purchase_time like '%M',
from_unixtime(unix_timestamp(purchase_time,'hh:mm aa'),'HH:mm'), purchase_time) as time_format
from transaction_details ),
timeformat as (
select
purchase_time, price, (cast(split(time_format, ':')[0] as decimal(4,2)) + cast(split(time_format, ':')[1] as decimal(4,2))/60) as purchase_time_in_hrs
from base ),
timebucket as (
select
price, purchase_time, purchase_time_in_hrs,
if(purchase_time_in_hrs > 5 and purchase_time_in_hrs <=8, 'early morning',
if(purchase_time_in_hrs > 8 and purchase_time_in_hrs <=11, 'morning',
if(purchase_time_in_hrs > 11 and purchase_time_in_hrs <=13, 'noon',
if(purchase_time_in_hrs > 13 and purchase_time_in_hrs <=18, 'afternoon',
if(purchase_time_in_hrs > 18 and purchase_time_in_hrs <=22, 'evening', 'night'))))) as time_bucket
from timeformat
)
select time_bucket, round(avg(price),2) as avg_spend, round(sum(price)/1000,2) as revenue_k from timebucket group by time_bucket order by avg_spend desc;
night 29.01 1.83
noon 28.23 0.45
early morning 27.83 1.03
evening 27.01 1.0
morning 25.58 1.0
afternoon 24.55 1.25
Time taken: 1.285 seconds, Fetched 6 row(s)
22/03/08 16:42:58 INFO CliDriver: Time taken: 1.285 seconds, Fetched 6 row(s)
6. 按工作日计算平均消费
sql
spark-sql>select round(avg(price),2) as avg_price,DATE_FORMAT(purchase_date,'u')as week from ext_transaction_details where date_format(purchase_date, 'u') is not null group by DATE_FORMAT(purchase_date,'u') order by avg_price desc;
29.59 2
28.85 4
28.33 5
27.02 3
26.48 1
24.86 7
23.03 6
Time taken: 0.328 seconds, Fetched 7 row(s)
22/03/08 17:23:09 INFO CliDriver: Time taken: 0.328 seconds, Fetched 7 row(s)
7. 计算年、月、日交易总额
sql
spark-sql>with base as (
select transaction_id, date_format(purchase_date, 'u') as weekday, purchase_month,
concat_ws('-', substr(purchase_date, 1, 4), cast(ceil(month(purchase_date)/3.0) as string)) as year_quarter, substr(purchase_date, 1, 4) as year
from transaction_details
where purchase_month is not null )
select count(distinct transaction_id) as total, weekday, purchase_month, year_quarter, year
from base group by weekday, purchase_month, year_quarter, year order by year, purchase_month;
total weekday purchase_month year_quarter year
1 2 2017-04 2017-2 2017 --2017年第二季度的4月下的所有需求二的订单和有2笔
7 5 2017-04 2017-2 2017
3 6 2017-04 2017-2 2017
4 1 2017-04 2017-2 2017
2 7 2017-04 2017-2 2017
2 3 2017-04 2017-2 2017
4 7 2017-05 2017-2 2017
4 4 2017-05 2017-2 2017
3 3 2017-05 2017-2 2017
6 2 2017-05 2017-2 2017
2 5 2017-05 2017-2 2017
2 1 2017-05 2017-2 2017
4 5 2017-06 2017-2 2017
1 4 2017-06 2017-2 2017
8. 找出交易量最大的10个客户
sql
spark-sql>select name,totalspend from (select *,concat_ws(' ',first_name,last_name)as name from (select customer_id,count(distinct(transaction_id))as ct,sum(price)as totalspend from ext_transaction_details group by customer_id)t1 join ext_customer_details t2 on t1.customer_id=t2.customer_id)as xx order by totalspend desc limit 10;
Spencer Raffeorty 140.93
Anjanette Penk 70.89
Guinna Damsell 64.59
Marla Brotherhood 63.44
Ingeberg Sutehall 49.26
Hermina Adacot 48.45
Camile Ringer 46.05
Cherye Poynor 28.24
Erie Gilleson 28.16
Cordy Herety 16.2
Time taken: 0.585 seconds, Fetched 10 row(s)
22/03/08 19:35:23 INFO CliDriver: Time taken: 0.585 seconds, Fetched 10 row(s)
9. 找出消费最多的十大客户
sql
spark-sql>with base as (
select
customer_id, count(distinct transaction_id) as trans_cnt, sum(price) as spend_total
from ext_transaction_details
where month(purchase_date) is not null group by customer_id
),
cust_detail as (
select *, concat_ws(' ', first_name, last_name) as cust_name
from base td join ext_customer_details cd on td.customer_id = cd.customer_id )
select spend_total, cust_name as top10_trans_cust
from cust_detail order by spend_total desc limit 10;
140.93 Spencer Raffeorty
70.89 Anjanette Penk
64.59 Guinna Damsell
63.44 Marla Brotherhood
49.26 Ingeberg Sutehall
48.45 Hermina Adacot
46.05 Camile Ringer
28.24 Cherye Poynor
28.16 Erie Gilleson
16.2 Cordy Herety
Time taken: 0.515 seconds, Fetched 10 row(s)
22/03/08 19:40:17 INFO CliDriver: Time taken: 0.515 seconds, Fetched 10 row(s)
10. 谁拥有该期间的最大交易数量
sql
spark-sql>select customer_id,count(distinct(transaction_id))as cnt from ext_transaction_details group by customer_id order by cnt desc limit 10;
375 4
376 4
1 3
419 3
191 3
483 2
15 2
139 2
434 2
428 2
Time taken: 1.711 seconds, Fetched 10 row(s)
22/03/08 19:48:55 INFO CliDriver: Time taken: 1.711 seconds, Fetched 10 row(s)
11. 按季度/年计算独立客户总数
scala
spark-sql>with base as (
select transaction_id,
concat_ws('-', substr(purchase_date, 1, 4), cast(ceil(month(purchase_date)/3.0) as string)) as year_quarter, substr(purchase_date, 1, 4) as year
from transaction_details
where purchase_month is not null )
select count(distinct transaction_id) as total, year_quarter, year
from base
group by year_quarter, year order by year_quarter;
total year_quarter year
64 2017-2 2017
61 2017-3 2017
51 2017-4 2017
67 2018-1 2018
Time taken: 7.481 seconds, Fetched 4 row(s)
22/04/02 17:43:05 INFO CliDriver: Time taken: 7.481 seconds, Fetched 4 row(s)
12. 计算整个活动中客户的平均最大值
markdown
spark-sql>with base as (
select customer_id, avg(price) as price_avg, max(price) as price_max
from transaction_details
where purchase_month is not null group by customer_id )
select max(price_avg)
from base;
max(price_avg)
48.59
Time taken: 4.807 seconds, Fetched 1 row(s)
22/04/02 19:20:16 INFO CliDriver: Time taken: 4.807 seconds, Fetched 1 row(s)
13. 每个月谁花的钱最多
sql
spark-sql>with base as (
select customer_id, purchase_month, sum(price) as price_sum, count(transaction_id) as trans_cnt
from transaction_details
where purchase_month is not null group by purchase_month, customer_id ),
rank_sum as (
select
rank() over(partition by purchase_month order by price_sum desc) as rn_sum,
rank() over(partition by purchase_month order by trans_cnt desc) as rn_cnt,
purchase_month, price_sum, trans_cnt, customer_id
from base )
select purchase_month, 'spend' as measure_name, price_sum as measure_value, customer_id
from rank_sum where rn_sum = 1;
purchase_month measure_name measure_value customer_id
2017-09 spend 46.64 456
2017-10 spend 48.17 491
2017-05 spend 63.44 9
2017-11 spend 85.36 375
2018-03 spend 49.55 402
2018-02 spend 45.79 156
2017-08 spend 47.48 428
2017-06 spend 49.63 1
2017-04 spend 49.58 158
2017-07 spend 49.26 20
2017-12 spend 47.59 154
2018-01 spend 48.45 26
Time taken: 3.618 seconds, Fetched 12 row(s)
22/04/02 23:43:28 INFO CliDriver: Time taken: 3.618 seconds, Fetched 12 row(s)
14. 每个月谁是最频繁的访客
sql
spark-sql>with base as (
select customer_id, purchase_month, sum(price) as price_sum, count(transaction_id) as trans_cnt
from transaction_details
where purchase_month is not null group by purchase_month, customer_id ),
rank_sum as (
select
rank() over(partition by purchase_month order by price_sum desc) as rn_sum,
rank() over(partition by purchase_month order by trans_cnt desc) as rn_cnt,
purchase_month, price_sum, trans_cnt, customer_id
from base )
select purchase_month, 'visit' as measure_name, trans_cnt as measure_value, customer_id
from rank_sum where rn_cnt = 1 order by measure_name, purchase_month;
purchase_month measure_name measure_value customer_id
2017-04 visit 1 432
2017-04 visit 1 330
2017-04 visit 1 158
2017-04 visit 1 376
2017-04 visit 1 62
2017-04 visit 1 68
2017-04 visit 1 282
2017-04 visit 1 191
2017-04 visit 1 302
2017-04 visit 1 11
2017-04 visit 1 47
2017-04 visit 1 345
2017-04 visit 1 371
15. 找出最受欢迎的5种产品的总价格
sql
spark-sql>select product,sum(price)as totalprice from transaction_details group by product order by totalprice desc;
product totalprice
Shrimp - Tiger 21/25 86.16
Muffin Batt - Carrot Spice 68.81
Venison - Liver 56.95
Flour - Whole Wheat 49.63
Danishes - Mini Raspberry 49.58
Cheese - Cheddarsliced 49.55
Maple Syrup 49.49
Spice - Onion Powder Granulated 48.73
Goat - Whole Cut 48.59
Peppercorns - Pink 48.17
Time taken: 2.056 seconds, Fetched 10 row(s)
22/04/03 04:25:45 INFO CliDriver: Time taken: 2.056 seconds, Fetched 10 row(s)
16. 根据购买频率找出最受欢迎的5种产品
sql
spark-sql>select product,count(transaction_id)as times from transaction_details group by product order by times desc limit 5;
product times
Muffin Batt - Carrot Spice 2
Oil - Coconut 2
Ice Cream Bar - Hagen Daz 2
Shrimp - Tiger 21/25 2
Ecolab - Hobart Washarm End Cap 2
Time taken: 1.835 seconds, Fetched 5 row(s)
22/04/03 05:28:17 INFO CliDriver: Time taken: 1.835 seconds, Fetched 5 row(s)
17. 根据客户数量找出最受欢迎的5种产品
sql
sparak-sql>select product,count(customer_id)as cc from transaction_details group by product order by cc desc limit 5;
以商店为中心
1. 找出最受欢迎的商店正在访问的独特的客户
sql
spark-sql>select t2.store_name,count(distinct t1.customer_id)as unique_visit from transaction_details t1 join ext_store_details t2 on t1.store_id=t2.store_id group by t2.store_name order by unique_visit desc limit 5;
store_name unique_visit
Walmart 50
Lablaws 49
FoodLovers 49
NoFrill 44
FoodMart 39
Time taken: 4.698 seconds, Fetched 5 row(s)
22/04/03 06:14:07 INFO CliDriver: Time taken: 4.698 seconds, Fetched 5 row(s)
2. 根据顾客购买的金额找出最受欢迎的商店
sql
spark-sql>select t2.store_name,sum(price)as reven_price from transaction_details t1 join ext_store_details t2 on t1.store_id=t2.store_id group by store_name order by reven_price desc limit 5;
store_name reven_price
Walmart 1603.0099999999995
Lablaws 1442.46
FoodLovers 1241.9799999999998
NoFrill 1227.32
FoodMart 1043.51
Time taken: 1.645 seconds, Fetched 5 row(s)
22/04/03 06:21:03 INFO CliDriver: Time taken: 1.645 seconds, Fetched 5 row(s)
3. 根据顾客交易情况找出最受欢迎的商店
sql
spark-sql>select t2.store_name,count(transaction_id)as buytimes from transaction_details t1 join ext_store_details t2 on t1.store_id=t2.store_id group by store_name order by buytimes desc limit 5;
store_name buytimes
Walmart 56
Lablaws 52
FoodLovers 50
NoFrill 45
FoodMart 40
Time taken: 1.742 seconds, Fetched 5 row(s)
22/04/03 06:21:56 INFO CliDriver: Time taken: 1.742 seconds, Fetched 5 row(s)
4. 通过唯一的客户id获得商店最受欢迎的产品
sql
spark-sql>with base(
select store_id,product,count(distinct customer_id)as freq_customer from transaction_details group by product,store_id),
re as (
select store_id,product,freq_customer,rank() over(partition by store_id order by freq_customer desc) as rk from base)
select t2.store_name,t1.product,t1.freq_customer from re t1 join ext_store_details t2 on t1.store_id=t2.store_id where rk=1 limit 5;
store_name product freq_customer
FoodMart Pepper - Orange 1
FoodMart Carbonated Water - Peach 1
FoodMart Glaze - Apricot 1
FoodMart Dish Towel 1
FoodMart Crab - Soft Shell 1
Time taken: 3.722 seconds, Fetched 5 row(s)
22/04/03 06:36:34 INFO CliDriver: Time taken: 3.722 seconds, Fetched 5 row(s)
5. 获得每个商店的员工与客户访问率
sql
spark-sql>with base(
select store_id,count(distinct customer_id,purchase_date)as freq_cust from transaction_details group by store_id
)
select
t1.store_name,t2.freq_cust,t1.employee_number,concat(round(t2.freq_cust/t1.employee_number,2),'%')as ratio
from ext_store_details t1 join base t2 on t1.store_id=t2.store_id;
store_name freq_cust employee_number ratio
FoodMart 40 18 2.22%
Walmart 56 30 1.87%
NoFrill 45 10 4.5%
FoodLovers 50 26 1.92%
Lablaws 52 23 2.26%
Time taken: 6.101 seconds, Fetched 5 row(s)
22/04/04 18:52:07 INFO CliDriver: Time taken: 6.101 seconds, Fetched 5 row(s)
6. 按年、月计算各门店收入
sql
spark-sql>with base(
select
t1.store_id,t2.store_name,t1.price,t1.purchase_month
from transaction_details t1 join ext_store_details t2 on t1.store_id=t2.store_id
)
select store_id,store_name,purchase_month,sum(price)as totalre from base group by store_id,store_name,purchase_month order by totalre desc limit 5;
store_id store_name purchase_month totalre
5 Walmart 2017-05 306.01000000000005
1 NoFrill 2018-03 228.88
2 Lablaws 2017-06 218.98
2 Lablaws 2017-10 214.9
5 Walmart 2017-04 213.68
Time taken: 3.337 seconds, Fetched 5 row(s)
22/04/04 21:20:52 INFO CliDriver: Time taken: 3.337 seconds, Fetched 5 row(s)
7. 根据商店的总收入制作饼图
sql
spark-sql>select store_name, sum(price) as revenue
from transaction_details td join ext_store_details sd on td.store_id = sd.store_id
where purchase_month is not null
group by store_name
store_name revenue
FoodMart 1043.51
NoFrill 1227.3200000000002
Lablaws 1442.4600000000003
Walmart 1603.0099999999998
FoodLovers 1241.98
Time taken: 2.317 seconds, Fetched 5 row(s)
22/04/04 21:25:24 INFO CliDriver: Time taken: 2.317 seconds, Fetched 5 row(s)
8. 找出每个商店最繁忙的时间段
sql
spark-sql>with base as (
select transaction_id, purchase_time, if(purchase_time like '%M', from_unixtime(unix_timestamp(purchase_time,'hh:mm aa'),'HH:mm'), purchase_time) as time_format, store_id from transaction_details
where purchase_month is not null
),
timeformat as (
select purchase_time, transaction_id, (cast(split(time_format, ':')[0] as decimal(4,2)) + cast(split(time_format, ':')[1] as decimal(4,2))/60) as purchase_time_in_hrs, store_id
from base
),
timebucket as (
select
transaction_id, purchase_time, purchase_time_in_hrs, store_id,
if(purchase_time_in_hrs > 5 and purchase_time_in_hrs <=8, 'early morning',
if(purchase_time_in_hrs > 8 and purchase_time_in_hrs <=11, 'morning',
if(purchase_time_in_hrs > 11 and purchase_time_in_hrs <=13, 'noon',
if(purchase_time_in_hrs > 13 and purchase_time_in_hrs <=18, 'afternoon',
if(purchase_time_in_hrs > 18 and purchase_time_in_hrs <=22, 'evening', 'night'))))) as time_bucket
from timeformat
)
select sd.store_name, count(transaction_id) as tran_cnt, time_bucket
from timebucket td join ext_store_details sd on td.store_id = sd.store_id
group by sd.store_name, time_bucket order by sd.store_name, tran_cnt desc;
store_name tran_cnt time_bucket
FoodLovers 13 afternoon
FoodLovers 11 morning
FoodLovers 10 evening
FoodLovers 6 early morning
FoodLovers 6 night
FoodLovers 4 noon
FoodMart 13 night
FoodMart 8 evening
FoodMart 8 afternoon
FoodMart 4 morning
FoodMart 4 early morning
FoodMart 3 noon
Lablaws 16 night
Lablaws 12 morning
Lablaws 8 afternoon
Lablaws 7 early morning
Lablaws 6 evening
Lablaws 3 noon
NoFrill 11 afternoon
NoFrill 11 early morning
NoFrill 9 night
NoFrill 6 evening
NoFrill 5 morning
NoFrill 3 noon
Walmart 19 night
Walmart 11 afternoon
Walmart 9 early morning
Walmart 7 evening
Walmart 7 morning
Walmart 3 noon
Time taken: 3.315 seconds, Fetched 30 row(s)
22/04/04 21:28:56 INFO CliDriver: Time taken: 3.315 seconds, Fetched 30 row(s)
9. 找出每家店的忠实顾客
sql
spark-sql>with base as (
select store_name, customer_id, sum(td.price) as total_cust_purphase
from transaction_details td join ext_store_details sd on td.store_id = sd.store_id
where purchase_month is not null
group by store_name, customer_id
),
rk_cust as (
select store_name, customer_id, total_cust_purphase,
rank() over(partition by store_name order by total_cust_purphase desc) as rn
from base
)
select * from rk_cust where rn <= 5;
store_name customer_id total_cust_purphase rn
FoodMart 158 49.58 1
FoodMart 220 46.74 2
FoodMart 1 46.14 3
FoodMart 445 45.16 4
FoodMart 379 44.98 5
Lablaws 397 60.92 1
Lablaws 1 49.63 2
Lablaws 402 49.55 3
Lablaws 375 49.49 4
Lablaws 76 48.73 5
NoFrill 77 47.49 1
NoFrill 428 47.48 2
NoFrill 400 46.49 3
NoFrill 453 45.38 4
NoFrill 1 45.16 5
Walmart 376 119.79 1
Walmart 15 70.89 2
Walmart 9 63.44 3
Walmart 154 47.59 4
Walmart 225 47.02 5
FoodLovers 491 48.17 1
FoodLovers 85 47.38 2
FoodLovers 456 46.64 3
FoodLovers 156 45.79 4
FoodLovers 297 45.18 5
Time taken: 3.919 seconds, Fetched 25 row(s)
22/04/04 21:33:43 INFO CliDriver: Time taken: 3.919 seconds, Fetched 25 row(s)
10. 找出明星店的最大收入
sql
spark-sql>with base as (
select store_id, sum(price) as revenue
from transaction_details
where purchase_month is not null
group by store_id
)
Select store_name, revenue, employee_number,
round(revenue/employee_number,2) as revenue_per_employee_within_period
from base td join ext_store_details sd on td.store_id = sd.store_id;
store_name revenue employee_number revenue_per_employee_within_period
FoodMart 1043.51 18 57.97
Walmart 1603.01 30 53.43
NoFrill 1227.32 10 122.73
FoodLovers 1241.9799999999998 26 47.77
Lablaws 1442.4600000000003 23 62.72
Time taken: 5.687 seconds, Fetched 5 row(s)
22/04/05 19:13:44 INFO CliDriver: Time taken: 5.687 seconds, Fetched 5 row(s)
Hive函数
尚硅谷Hive3.x
一、系统内置函数
1.查看内置函数
sql
spark-sql>show functions;
spark-sql>desc function upper;
Function: upper
Class: org.apache.spark.sql.catalyst.expressions.Upper
Usage: upper(str) - Returns `str` with all characters changed to uppercase.
Time taken: 0.02 seconds, Fetched 3 row(s)
22/03/08 21:01:57 INFO CliDriver: Time taken: 0.02 seconds, Fetched 3 row(s)
spark-sql>desc function extended upper;
Function: upper
Class: org.apache.spark.sql.catalyst.expressions.Upper
Usage: upper(str) - Returns `str` with all characters changed to uppercase.
Extended Usage:
Examples:
> SELECT upper('SparkSql');
SPARKSQL
Time taken: 0.012 seconds, Fetched 4 row(s)
22/03/08 21:02:29 INFO CliDriver: Time taken: 0.012 seconds, Fetched 4 row(s)
2.UDF\UDAF\UDTF
sql
`UDAF:多进一处` --聚合函数:avg...
`UDF:一进一出` --普通函数
`UDTF:一进都多出` --炸裂函数:explode
一、多都是指输入的数据行数
二、常用内置函数
1.空字符赋值:nvl
sql
--数据准备
NULL -1.0
300.0 300.0
500.0 500.0
NULL -1.0
1500.0 1400.0
NULL -1.0
NULL -1.0
--建表语句
spark-sql> create table nvl_test(salary double,id double)row format delimited fields terminated by '\t';
--添加数据
[root@hadoop01 datas]# hadoop fs -put nvl.txt /user/hive/warehouse/shopping.db/nvl_test
--对于数字类型的值进行赋值
spark-sql> select salary from nvl_test;
NULL
300.0
500.0
NULL
1500.0
NULL
NULL
spark-sql> select nvl(salary,888)from nvl_test;
888.0
300.0
500.0
888.0
1500.0
888.0
888.0
Time taken: 0.045 seconds, Fetched 7 row(s)
22/03/08 21:22:18 INFO CliDriver: Time taken: 0.045 seconds, Fetched 7 row(s)
2.CASE WHEN THEN ELSE
sql
--数据准备
悟空 A 男
大海 A 男
宋宋 B 男
凤姐 A 女
婷姐 B 女
听听 B 女
--建表语句
spark-sql> create table case_test(name string,dept_id string,sex string)row format delimited fields terminated by '\t';
--添加数据
[root@hadoop01 datas]# hadoop fs -put case.txt /user/hive/warehouse/shopping.db/case_test
--需求:求出不同部门的男女人数
spark-sql> select dept_id,sum(case sex when '男' then 1 else 0 end )as malecount,sum(case sex when '女' then 1 else 0 end)as femalecount from case_test group by dept_id;
spark-sql> select dept_id,sum(if(sex='男',1,0))as malecount,sum(if(sex='女',1,0))as femalecount from case_test group by dept_id;
B 1 2
A 2 1
Time taken: 0.283 seconds, Fetched 2 row(s)
22/03/08 21:38:04 INFO CliDriver: Time taken: 0.283 seconds, Fetched 2 row(s)
3.行转列
concat
字符拼接,对多个字符串或二进制字符码按照参数顺序进行拼接。
concat(string|binary A, string|binary B...)
sql
select concat('a','b','c');
abc
concat_ws
按照指定分隔符将字符或者数组进行拼接;第一个参数是分隔符。
concat_ws(string SEP, array)/concat_ws(string SEP, string A, string B...)
sql
select concat_ws('','a','b','c')
a b c
#将数组列表元素按照指定分隔符拼接,类似于python中的join方法
select concat_ws('',array('a','b','c'))
a b c
select concat_ws(",",array('a','b','c'));
a,b,c
collect_set
将分组内的数据放入到一个集合中,具有去重的功能;
sql
create table collect_set (name string, area string, course string, score int);
insert into table collect_set values('zhang3','bj','math',88);
insert into table collect_set values('li4','bj','math',99);
insert into table collect_set values('wang5','sh','chinese',92);
insert into table collect_set values('zhao6','sh','chinese',54);
insert into table collect_set values('tian7','bj','chinese',91);
--把同一分组的不同行的数据聚合成一个集合
select course,collect_set(area),avg(score)from collect_set group by course;
OK
chinese ["sh","bj"] 79.0
math ["bj"] 93.5
--用下标可以取某一个
select course,collect_set(area)[1],avg(score)from collect_set group by course;
OK
chinese sh 79.0
math bj 93.5
collect_list
和collect_set一样,但是没有去重功能
案例
使用concat_ws和collect_set函数
sql
--数据准备
孙悟空 白羊座 A
大海 射手座 A
宋宋 白羊座 B
猪八戒 白羊座 A
凤姐 射手座 A
沙僧 白羊座 B
--建表语句
spark-sql> create table ranksTran_test(name string,constelation string, blood_type string)row format delimited fields terminated by '\t';
--添加数据
[root@hadoop01 datas]# hadoop fs -put hanglie.txt /user/hive/warehouse/shopping.db/rankstran_test
--需求:星座血型相同的人用|拼接在一起
--1.第一步:星座和血型拼接
select concat_ws(',',constelation,blood_type)as con_blood,name from ranksTran_test;
白羊座,A 孙悟空
射手座,A 大海
白羊座,B 宋宋
白羊座,A 猪八戒
射手座,A 凤姐
白羊座,B 沙僧
--2.第二步:聚合相同星座血型人的姓名到一行
select con_blood,concat_ws('|',collect_set(name)) from (select concat_ws(',',constelation,blood_type)as con_blood,name from ranksTran_test)xx group by con_blood;
白羊座,A 猪八戒|孙悟空
射手座,A 大海|凤姐
白羊座,B 沙僧|宋宋
Time taken: 0.98 seconds, Fetched 3 row(s)
22/03/09 10:08:27 INFO CliDriver: Time taken: 0.98 seconds, Fetched 3 row(s)
4.列转行
explode
列转行,通常是将一个数组内的元素打开,拆成多行
sql
spark-sql>select explode(array(1,2,3,4,5));
1
2
3
4
5
Time taken: 0.09 seconds, Fetched 5 row(s)
22/03/09 10:15:43 INFO CliDriver: Time taken: 0.09 seconds, Fetched 5 row(s)
lateral view
用法:lateral viewa udtf(expression) tableAlias AS columnAlias
split
案例
综合使用explode\split\laterval view
sql
--数据准备
《疑犯追踪》 悬疑,动作,科幻,剧情
《Lie to me》 悬疑,警匪,动作,心理,剧情
《战狼2》 战争,动作,灾难
--建表语句
spark-sql> create table explode_test(name string,category string)row format delimited fields terminated by '\t';
--添加数据
[root@hadoop01 datas]# hadoop fs -put explode.txt /user/hive/warehouse/shopping.db/explode_test
--读表
《疑犯追踪》 悬疑,动作,科幻,剧情
《Lie to me》 悬疑,警匪,动作,心理,剧情
《战狼2》 战争,动作,灾难
Time taken: 0.094 seconds, Fetched 3 row(s)
22/03/09 10:26:04 INFO CliDriver: Time taken: 0.094 seconds, Fetched 3 row(s)
--split函数
spark-sql> select split(category,',')from explode_test;
["悬疑","动作","科幻","剧情"]
["悬疑","警匪","动作","心理","剧情"]
["战争","动作","灾难"]
Time taken: 0.094 seconds, Fetched 3 row(s)
22/03/09 10:27:11 INFO CliDriver: Time taken: 0.094 seconds, Fetched 3 row(s)
--explode
spark-sql> select explode(split(category,',')) from explode_test;
悬疑
动作
科幻
剧情
悬疑
警匪
动作
心理
剧情
战争
动作
灾难
Time taken: 0.075 seconds, Fetched 12 row(s)
22/03/09 10:27:51 INFO CliDriver: Time taken: 0.075 seconds, Fetched 12 row(s)
--lateral view
select name,category_name from explode_test lateral view explode(split(category,','))movie_tmp AS category_name;
《疑犯追踪》 悬疑
《疑犯追踪》 动作
《疑犯追踪》 科幻
《疑犯追踪》 剧情
《Lie to me》 悬疑
《Lie to me》 警匪
《Lie to me》 动作
《Lie to me》 心理
《Lie to me》 剧情
《战狼2》 战争
《战狼2》 动作
《战狼2》 灾难
Time taken: 0.084 seconds, Fetched 12 row(s)
22/03/09 10:32:26 INFO CliDriver: Time taken: 0.084 seconds, Fetched 12 row(s)
5.窗口函数
http
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics
over():指定分析函数工作的数据窗口大小,这个数据窗口大小可能会随着行的变而变
化
案例
sql
--数据准备
jack,2017-01-01,10
tony,2017-01-02,15
jack,2017-02-03,23
tony,2017-01+-04,29
jack,2017-01-05,46
jack,2017-04-06,42
toney,2017-01-07,50
jack,2017-01-08,55
mart,2017-04-08,62
mart,2017-04-09,68
neil,2017-04-10,12
mart,2017-04-11,75
neil,2017-06-12,80
mart,2017-04,13,94
--建表语句
spark-sql> create table over_test(name string,orderdata string,cost int)row format delimited fields terminated by ',';
--添加数据
[root@hadoop01 datas]# hadoop fs -put overFunc.txt /user/hive/warehouse/shopping.db/over_test
----------------------需求一:查询2017年4月份购买过的顾客及其总人数-----------------------------
spark-sql>select distinct(name) from over_test where substring(orderdata,0,7) ='2017-04';
mart
jack
spark-sql>select name from over_test where substring(orderdata,0,7) ='2017-04' group by name;
mart
jack
spark-sql>select count(distinct(name)) from over_test where substring(orderdata,0,7) ='2017-04';
2 --代表上述2人
spark-sql>select count(*) from (select name from over_test where substring(orderdata,0,7) ='2017-04' group by name)t1;
2
spark-sql>select name,count(*) over() from over_test where substring(orderdata,0,7) ='2017-04' group by name;
mart 2
jack 2
Time taken: 0.41 seconds, Fetched 2 row(s)
22/03/09 11:31:36 INFO CliDriver: Time taken: 0.41 seconds, Fetched 2 row(s)
--over的作用!!!!!!!!!!!!!!!!
--1.只加over
spark-sql> select name,count(*) over() from over_test;
jack 14 --count(*)=14
tony 14
jack 14
tony 14
jack 14
jack 14
tony 14
jack 14
mart 14
mart 14
neil 14
mart 14
neil 14
mart 14
--2.over()中分区
spark-sql> select name,count(*) over(partition by name) from over_test;
mart 4
mart 4
mart 4
mart 4
jack 5
jack 5
jack 5
jack 5
jack 5
tony 3
tony 3
tony 3
neil 2
neil 2
--3.过滤月份
spark-sql> select distinct(name),count(*) over(partition by name) from over_test where substring(orderdata,0,7)='2017-04';
mart 4
jack 1
Time taken: 0.363 seconds, Fetched 2 row(s)
22/03/09 11:47:01 INFO CliDriver: Time taken: 0.363 seconds, Fetched 2 row(s)
-----------------------------------------------------------------------------------------
-------------------------需求二:查询顾客的购买明细及顾客购买总额-------------------------------
spark-sql>select distinct(name),orderdata,cost,sum(cost) over(partition by name,month(orderdata)) from over_test;
neil 2017-06-12 80 80 --姓名,时间,购买花费,该月总计
tony 2017-01-02 15 94
jack 2017-02-03 23 23
mart 2017-04-11 75 299
jack 2017-04-06 42 42
neil 2017-05-10 12 12
jack 2017-01-08 55 111
jack 2017-01-05 46 111
mart 2017-04-08 62 299
tony 2017-01-04 29 94
mart 2017-04-09 68 299
mart 2017-04-13 94 299
tony 2017-01-07 50 94
jack 2017-01-01 10 111
Time taken: 0.744 seconds, Fetched 14 row(s)
22/03/09 14:25:39 INFO CliDriver: Time taken: 0.744 seconds, Fetched 14 row(s)
--计算顾客明细和月购买总额
spark-sql>select name,orderdata,concat(month(orderdata),'月'),cost,sum(cost) over(partition by month(orderdata)) from over_test;
jack 2017-01-01 1月 10 205
tony 2017-01-02 1月 15 205
tony 2017-01-04 1月 29 205
jack 2017-01-05 1月 46 205
tony 2017-01-07 1月 50 205
jack 2017-01-08 1月 55 205
neil 2017-06-12 6月 80 80
neil 2017-05-10 5月 12 12
jack 2017-04-06 4月 42 341
mart 2017-04-08 4月 62 341
mart 2017-04-09 4月 68 341
mart 2017-04-11 4月 75 341
mart 2017-04-13 4月 94 341
jack 2017-02-03 2月 23 23
Time taken: 0.372 seconds, Fetched 14 row(s)
22/03/09 14:31:38 INFO CliDriver: Time taken: 0.372 seconds, Fetched 14 row(s)
--计算每月消费总额
spark-sql>select distinct(concat(month(orderdata),'月')),sum(cost) over(partition by month(orderdata)) from over_test;
1月 205
5月 12
4月 341
2月 23
6月 80
Time taken: 0.552 seconds, Fetched 5 row(s)
22/03/09 14:36:09 INFO CliDriver: Time taken: 0.552 seconds, Fetched 5 row(s)
-----------------------------------------------------------------------------------------
--------------------------需求三:将每个顾客的cost按照日期进行累加-----------------------------
spark-sql>select name,orderdata,cost,sum(cost) over(partition by name order by orderdata) from over_test;
mart 2017-04-08 62 62
mart 2017-04-09 68 130
mart 2017-04-11 75 205
mart 2017-04-13 94 299
jack 2017-01-01 10 10
jack 2017-01-05 46 56
jack 2017-01-08 55 111
jack 2017-02-03 23 134
jack 2017-04-06 42 176
tony 2017-01-02 15 15
tony 2017-01-04 29 44
tony 2017-01-07 50 94
neil 2017-05-10 12 12
neil 2017-06-12 80 92
Time taken: 0.285 seconds, Fetched 14 row(s)
22/03/09 14:43:24 INFO CliDriver: Time taken: 0.285 seconds, Fetched 14 row(s)
--写法二
spark-sql>select name,orderdata,cost,sum(cost) over(partition by name order by orderdata rows between UNBOUNDED PRECEDING AND current row) from over_test;
--相邻三行累加
select name,orderdata,cost,sum(cost) over(partition by name order by orderdata rows between 1 PRECEDING AND 1 FOLLOWING) from over_test;
mart 2017-04-08 62 130
mart 2017-04-09 68 205
mart 2017-04-11 75 237
mart 2017-04-13 94 169
jack 2017-01-01 10 56
jack 2017-01-05 46 111
jack 2017-01-08 55 124
jack 2017-02-03 23 120
jack 2017-04-06 42 65
tony 2017-01-02 15 44
tony 2017-01-04 29 94
tony 2017-01-07 50 79
neil 2017-05-10 12 92
neil 2017-06-12 80 92
Time taken: 0.269 seconds, Fetched 14 row(s)
22/03/09 14:50:57 INFO CliDriver: Time taken: 0.269 seconds, Fetched 14 row(s)
--总的累加,从第一行累加到最后一行
select name,orderdata,cost,sum(cost) over(order by orderdata rows between 1 PRECEDING AND 1 FOLLOWING) from over_test;
jack 2017-01-01 10 10
tony 2017-01-02 15 25
tony 2017-01-04 29 54
jack 2017-01-05 46 100
tony 2017-01-07 50 150
jack 2017-01-08 55 205
jack 2017-02-03 23 228
jack 2017-04-06 42 270
mart 2017-04-08 62 332
mart 2017-04-09 68 400
mart 2017-04-11 75 475
mart 2017-04-13 94 569
neil 2017-05-10 12 581
neil 2017-06-12 80 661
Time taken: 0.086 seconds, Fetched 14 row(s)
22/03/09 15:05:49 INFO CliDriver: Time taken: 0.086 seconds, Fetched 14 row(s)
-----------------------------------------------------------------------------------------
-------------------------------需求四:查询每个顾客上次的购买时间------------------------------
--lag:前一行拿下来,lead:后一行拿上去
spark-sql>select name,orderdata,lag(orderdata,1,'1970-01-01') over(partition by name order by orderdata)from over_test;
mart 2017-04-08 1970-01-01
mart 2017-04-09 2017-04-08
mart 2017-04-11 2017-04-09
mart 2017-04-13 2017-04-11
jack 2017-01-01 1970-01-01
jack 2017-01-05 2017-01-01
jack 2017-01-08 2017-01-05
jack 2017-02-03 2017-01-08
jack 2017-04-06 2017-02-03
tony 2017-01-02 1970-01-01
tony 2017-01-04 2017-01-02
tony 2017-01-07 2017-01-04
neil 2017-05-10 1970-01-01
neil 2017-06-12 2017-05-10
Time taken: 0.422 seconds, Fetched 14 row(s)
22/03/09 15:25:51 INFO CliDriver: Time taken: 0.422 seconds, Fetched 14 row(s)
-----------------------------------------------------------------------------------------
--------------------------------需求五:查询20%时间的订单信息---------------------------------
--分为5组,取1就是百分之20
spark-sql>select name,orderdata,cost,ntile(5) over(order by orderdata)groupId from over_test;t1
jack 2017-01-01 10 1
tony 2017-01-02 15 1
tony 2017-01-04 29 1
jack 2017-01-05 46 2
tony 2017-01-07 50 2
jack 2017-01-08 55 2
jack 2017-02-03 23 3
jack 2017-04-06 42 3
mart 2017-04-08 62 3
mart 2017-04-09 68 4
mart 2017-04-11 75 4
mart 2017-04-13 94 4
neil 2017-05-10 12 5
neil 2017-06-12 80 5
Time taken: 0.124 seconds, Fetched 14 row(s)
22/03/09 15:50:06 INFO CliDriver: Time taken: 0.124 seconds, Fetched 14 row(s)
spark-sql>select name,orderdata,cost from (select name,orderdata,cost,ntile(5) over(order by orderdata)groupId from over_test)t1 where groupId=1;
jack 2017-01-01 10
tony 2017-01-02 15
tony 2017-01-04 29
Time taken: 0.109 seconds, Fetched 3 row(s)
22/03/09 15:51:31 INFO CliDriver: Time taken: 0.109 seconds, Fetched 3 row(s)
6.Rank
rank():排序相同时会重复,总数不变
dense_rank():排序相同时会重复,总数会减少
row_number():根据顺序计算
sql
--数据准备
孙悟空 语文 87
孙悟空 数学 95
孙悟空 英语 68
大海 语文 94
大海 数学 56
大海 英语 84
宋宋 语文 64
宋宋 数学 86
宋宋 英语 84
婷婷 语文 65
婷婷 数学 85
婷婷 英语 78
--建表语句
spark-sql> create table rank_test(name string,subject string, score string)row format delimited fields terminated by '\t';
--加载数据
[root@hadoop01 datas]# hadoop fs -put rank.txt /user/hive/warehouse/shopping.db/rank_test
-----------------------------------案例-------------------------------------------------
--rank()全局排序
spark-sql> select *,rank() over(order by score)from rank_test;
大海 数学 56 1
宋宋 语文 64 2
婷婷 语文 65 3
孙悟空 英语 68 4
婷婷 英语 78 5
大海 英语 84 6
宋宋 英语 84 6
婷婷 数学 85 8
宋宋 数学 86 9
孙悟空 语文 87 10
大海 语文 94 11
孙悟空 数学 95 12
Time taken: 0.118 seconds, Fetched 12 row(s)
22/03/09 16:04:01 INFO CliDriver: Time taken: 0.118 seconds, Fetched 12 row(s)
--rank() 学科分组排序
spark-sql> select *,rank() over(partition by subject order by score desc)from rank_test;
大海 英语 84 1
宋宋 英语 84 1
婷婷 英语 78 3
孙悟空 英语 68 4
大海 语文 94 1
孙悟空 语文 87 2
婷婷 语文 65 3
宋宋 语文 64 4
孙悟空 数学 95 1
宋宋 数学 86 2
婷婷 数学 85 3
大海 数学 56 4
Time taken: 0.226 seconds, Fetched 12 row(s)
22/03/09 16:04:59 INFO CliDriver: Time taken: 0.226 seconds, Fetched 12 row(s)
--row_number()全局排序
spark-sql> select *,row_number() over(order by score)from rank_test;
大海 数学 56 1
宋宋 语文 64 2
婷婷 语文 65 3
孙悟空 英语 68 4
婷婷 英语 78 5
大海 英语 84 6
宋宋 英语 84 7
婷婷 数学 85 8
宋宋 数学 86 9
孙悟空 语文 87 10
大海 语文 94 11
孙悟空 数学 95 12
Time taken: 0.075 seconds, Fetched 12 row(s)
22/03/09 16:06:34 INFO CliDriver: Time taken: 0.075 seconds, Fetched 12 row(s)
--row_number()按照学科分组排序
spark-sql> select *,row_number() over(partition by subject order by score)from rank_test;
孙悟空 英语 68 1
婷婷 英语 78 2
大海 英语 84 3
宋宋 英语 84 4
宋宋 语文 64 1
婷婷 语文 65 2
孙悟空 语文 87 3
大海 语文 94 4
大海 数学 56 1
婷婷 数学 85 2
宋宋 数学 86 3
孙悟空 数学 95 4
Time taken: 0.239 seconds, Fetched 12 row(s)
22/03/09 16:07:09 INFO CliDriver: Time taken: 0.239 seconds, Fetched 12 row(s)
--需求:取每个学科前三名的学生
spark-sql>select *,rank() over(partition by subject order by score desc) as rk from rank_test;t1
spark-sql>select name,subject,score from (select *,rank() over(partition by subject order by score desc) as rk from rank_test)t1 where rk<=3;
大海 英语 84
宋宋 英语 84
婷婷 英语 78
大海 语文 94
孙悟空 语文 87
婷婷 语文 65
孙悟空 数学 95
宋宋 数学 86
婷婷 数学 85
Time taken: 0.293 seconds, Fetched 9 row(s)
22/03/09 16:17:36 INFO CliDriver: Time taken: 0.293 seconds, Fetched 9 row(s)
7.其他常用函数
日期类函数
unix_timestamp:返回当前或指定时间的时间戳
sql
--获取当前时间戳--返回10位的时间戳,到秒
select unix_timestamp();
--查询指定日期的时间戳,需要写入格式
spark-sql> select unix_timestamp('2000-04-19','yyyy-MM-dd');
956073600
Time taken: 0.023 seconds, Fetched 1 row(s)
22/03/09 17:18:43 INFO CliDriver: Time taken: 0.023 seconds, Fetched 1 row(s)
--返回当前日期的时间戳
spark-sql> select unix_timestamp(current_date());
from unixtime:将时间戳转为日期格式
sql
spark-sql> select from_unixtime(956073600)
2000-04-19 00:00:00
--指定格式
spark-sql> select from_unixtime(956073600,'yyyy-MM-dd')
2000-04-19
current_date:当前日期
sql
spark-sql> select current_date()
2022-03-09
Time taken: 0.024 seconds, Fetched 1 row(s)
22/03/09 17:22:03 INFO CliDriver: Time taken: 0.024 seconds, Fetched 1 row(s)
current_timestamp:当前的日期加时间
sql
select current_timestamp();
to_date:抽取日期部分:以-划分
sql
spark-sql> select to_date('2022-03-09')
year:获取年
month:获取月
day:获取日
hour:获取时
minute:获取分
sql
spark-sql> select minute('2022-03-09 12:13:14');;
12
second:获取秒
weekofyear:当前时间是一年中的第几周
sql
spark-sql> select weekofyear(current_date());
10
Time taken: 0.021 seconds, Fetched 1 row(s)
22/03/09 17:28:36 INFO CliDriver: Time taken: 0.021 seconds, Fetched 1 row(s)
dayofmonth:当前时间是一个月中的第几天
sql
spark-sql> select dayofmonth(current_date());
9
months between:两个日期间的月份
sql
spark-sql> select months_between('2020-10-28','2020-04-28');
6.0
add months:日期加减月
sql
spark-sql> select add_months('2022-10-12',3);
2023-01-12
spark-sql> select add_months('2022-10-12',-3);
2022-7-12
datediff:两个日期相差的天数
sql
spark-sql> select datediff('2020-10-24','2020-10-14');
10
date_add:日期加天数
sql
spark-sql> select date_add('2022-03-09',46);
2022-04-24
date_sub:日期减天数
sql
spark-sql> select date_sub('2022-10-1',-3);
2022-10-04
last_day:日期所在月份的最后一天
sql
spark-sql> select last_day('2022-03-09')
2022-03-31
date_format:格式化日期
sql
select data_format('2022-03-09','yyyy-MM')
2022-03
数据取整函数
round:四舍五入
sql
spark-sql>select round(3.14)
3
--取小数点
spark-sql> select round(3.1415926,2);
3.14
ceil:向上取整--不遵循四舍五入
sql
spark-sql> select ceil(3.00001);
4
floor:向下取整
sql
spark-sql> select floor(3.9999);
3
字符串操作函数
upper:转大写
sql
spark-sql> select upper('abcd');
ABCD
lower:转小写
sql
spark-sql> select lower('ABCD');
abcd
length:长度
sql
spark-sql> select length('abcdefg');
7
trim:去除前后空格
sql
spark-sql> select trim(' abc ');
abc
lpad:向左补齐,到指定长度
sql
spark-sql> select lpad('hello',1,'');
h
spark-sql> select lpad('hello',9,'o');
oooohello
rpad:向右补齐,到指定长度
sql
spark-sql> select rpad('hello',9,'o');
hellooooo
regex_replace:正则匹配
sql
spark-sql> select regexp_replace('2022/03/09','/','-');
2022-03-09
集合操作
size:集合中元素个数
sql
select size(array("hadoop","hive"));
2
map_keys:返回map中的key
map_values:返回map中的value
array_contains:判断array中是否包含某个元素
sql
spark-sql> select array_contains(array(1,2,3),1);
true
sort_array:将array中的元素排序
sql
spark-sql> select sort_array(array(1,6,9,8,7,3));
[1,3,6,7,8,9]
三、Hive词频统计案例
sql
--数据准备
hello,saddam
hello,hive,hadoop
hello,zookeeper,spark,flink
hive,spark
------------------------------------
一、日期函数
DATE_FORMAT
将日期进行格式化
1.取年月
sql
select date_format('2019-12-12','yyyy-MM');
2019-12
Time taken: 0.026 seconds, Fetched 1 row(s)
22/03/08 14:57:36 INFO CliDriver: Time taken: 0.026 seconds, Fetched 1 row(s)
2.取月份
sql
select date_format('2019-12-12','MM');
12
Time taken: 0.022 seconds, Fetched 1 row(s)
22/03/08 15:07:41 INFO CliDriver: Time taken: 0.022 seconds, Fetched 1 row(s)
3.取年
sql
select date_format('2019-12-12','yyyy');
2019
Time taken: 0.02 seconds, Fetched 1 row(s)
22/03/08 15:08:30 INFO CliDriver: Time taken: 0.02 seconds, Fetched 1 row(s)
4.时分秒
sql
select date_format('2017-01-01 13:14:52','yyyy-MM-dd HH:mm:ss'); --日期字符串必须满足yyyy-MM-dd格式
5.计算星期
sql
select date_format('2022-03-08','u');
2
Time taken: 0.021 seconds, Fetched 1 row(s)
22/03/08 15:09:04 INFO CliDriver: Time taken: 0.021 seconds, Fetched 1 row(s)
year
year(string date)
返回时间字符串的年份部分
sql
--最后得到2020
select year('2020-09-02')
month
month(string date)
返回时间字符串的月份部分
sql
--最后得到09
select month('2020-09-10')
day
day(string date)
返回时间字符串的天
sql
--最后得到10
select day('2002-09-10')
from_unxitime
from_unixtime(bigint unixtime[, string format])
将时间的秒值转换成format格式(format可为"yyyy-MM-dd hh:mm:ss","yyyy-MM-dd hh","yyyy-MM-dd hh:mm"等等)
sql
select from_unixtime(1599898989,'yyyy-MM-dd') as current_time
unix_timestamp
unix_timestamp():获取当前时间戳
unix_timestamp(string date):获取指定时间对应的时间戳
通过该函数结合from_unixtime使用,或者可计算两个时间差等
sql
select
unix_timestamp() as current_timestamp,--获取当前时间戳
unix_timestamp('2020-09-01 12:03:22') as speical_timestamp,--指定时间对于的时间戳
from_unixtime(unix_timestamp(),'yyyy-MM-dd') as current_date --获取当前日期
to_date
to_date(string timestamp)
返回时间字符串的日期部分
sql
--最后得到2020-09-10
select to_date('2020-09-10 10:31:31')
date_add
date_add(string startdate, int days)
从开始时间startdate加上days
sql
--获取当前时间下未来一周的时间
select date_add(now(),7)
--获取上周的时间
select date_add(now(),-7)
date_sub
date_sub(string startdate, int days)
从开始时间startdate减去days
sql
--获取当前时间下未来一周的时间
select date_sub(now(),-7)
--获取上周的时间
select date_sub(now(),7)
示例:统计月度订单数量
hql
select from_unxitime(unix_timestamp(order_date), "yyyy-MM") as year_month,
count(order_id) from orders
group by from_unxitime(unix_timestamp(order_date), "yyyy-MM")
---------------------------------
If判断
Decimal
markdown
Hive的decimal类型借鉴于Oracle,decimal(m,n)表示数字总长度为m位,小数位为n位,那么整数位就只有m-n位了。这与MySql是不一样的,MySql就直接表示整数位为m位了。
1)DECIMAL(9,8)代表最多9位数字,后8位是小数。此时也就是说,小数点前最多有1位数字,如果超过一位则会变成null。
2)如果不指定参数,那么默认是DECIMAL(10,0),即没有小数位,此时0.82会变成1。
sql
spark-sql> select cast(3.1415926 as decimal(7,2));
3.14
Time taken: 0.016 seconds, Fetched 1 row(s)
22/03/08 17:13:58 INFO CliDriver: Time taken: 0.016 seconds, Fetched 1 row(s)
spark-sql> select cast(0.82 as decimal);
1
Time taken: 0.018 seconds, Fetched 1 row(s)
22/03/08 17:15:29 INFO CliDriver: Time taken: 0.018 seconds, Fetched 1 row(s)
的天
sql
--最后得到10
select day('2002-09-10')
from_unxitime
from_unixtime(bigint unixtime[, string format])
将时间的秒值转换成format格式(format可为"yyyy-MM-dd hh:mm:ss","yyyy-MM-dd hh","yyyy-MM-dd hh:mm"等等)
sql
select from_unixtime(1599898989,'yyyy-MM-dd') as current_time
unix_timestamp
unix_timestamp():获取当前时间戳
unix_timestamp(string date):获取指定时间对应的时间戳
通过该函数结合from_unixtime使用,或者可计算两个时间差等
sql
select
unix_timestamp() as current_timestamp,--获取当前时间戳
unix_timestamp('2020-09-01 12:03:22') as speical_timestamp,--指定时间对于的时间戳
from_unixtime(unix_timestamp(),'yyyy-MM-dd') as current_date --获取当前日期
to_date
to_date(string timestamp)
返回时间字符串的日期部分
sql
--最后得到2020-09-10
select to_date('2020-09-10 10:31:31')
date_add
date_add(string startdate, int days)
从开始时间startdate加上days
sql
--获取当前时间下未来一周的时间
select date_add(now(),7)
--获取上周的时间
select date_add(now(),-7)
date_sub
date_sub(string startdate, int days)
从开始时间startdate减去days
sql
--获取当前时间下未来一周的时间
select date_sub(now(),-7)
--获取上周的时间
select date_sub(now(),7)
示例:统计月度订单数量
hql
select from_unxitime(unix_timestamp(order_date), "yyyy-MM") as year_month,
count(order_id) from orders
group by from_unxitime(unix_timestamp(order_date), "yyyy-MM")
---------------------------------
If判断
Decimal
markdown
Hive的decimal类型借鉴于Oracle,decimal(m,n)表示数字总长度为m位,小数位为n位,那么整数位就只有m-n位了。这与MySql是不一样的,MySql就直接表示整数位为m位了。
1)DECIMAL(9,8)代表最多9位数字,后8位是小数。此时也就是说,小数点前最多有1位数字,如果超过一位则会变成null。
2)如果不指定参数,那么默认是DECIMAL(10,0),即没有小数位,此时0.82会变成1。
sql
spark-sql> select cast(3.1415926 as decimal(7,2));
3.14
Time taken: 0.016 seconds, Fetched 1 row(s)
22/03/08 17:13:58 INFO CliDriver: Time taken: 0.016 seconds, Fetched 1 row(s)
spark-sql> select cast(0.82 as decimal);
1
Time taken: 0.018 seconds, Fetched 1 row(s)
22/03/08 17:15:29 INFO CliDriver: Time taken: 0.018 seconds, Fetched 1 row(s)