HiveSQL题——数据炸裂和数据合并

目录

一、数据炸裂

[0 问题描述](#0 问题描述)

[1 数据准备](#1 数据准备)

[2 数据分析](#2 数据分析)

[3 小结](#3 小结)

二、数据合并

[0 问题描述](#0 问题描述)

[1 数据准备](#1 数据准备)

[2 数据分析](#2 数据分析)

[3 小结](#3 小结)

一、数据炸裂

0 问题描述

如何将字符串1-5,16,11-13,9" 扩展成 "1,2,3,4,5,16,11,12,13,9" 且顺序不变。

1 数据准备

sql 复制代码
with data as (select '1-5,16,11-13,9' as a)

2 数据分析

步骤一:explode(split(a, ',')) 炸裂 + row_number()排序,一行变多行,且对每行的数据排序,保证有序性。

sql 复制代码
with data as (select '1-5,16,11-13,9' as a)
select
    a,
    row_number() over () as rn
from (
         select
             explode(split(a, ',')) as a
         from data
     )tmp1;

输出结果:

步骤二: lateral view explode(split(a, '-')) 、max(b) - min(b) as diff

(1)lateral view +explode 侧写和炸裂,一行变多行,并将源表中每行的输出结果与该行连接;

(2)group by a, rn ....... select min(b) as start_index 得到每个分组的起始值

(3)max(b) - min(b) 得到每个分组的步长

sql 复制代码
with data as (select '1-5,16,11-13,9' as a)
select
    a,
    rn,
    min(b)          as start_data,
    max(b) - min(b) as diff
from (
         select
             a,
             rn,
             b
         from (
                  select
                      a,
                      row_number() over () as rn
                  from (
                           select
                               explode(split(a, ',')) as a
                           from data
                       ) tmp1
              ) tmp2
                  lateral view explode(split(a, '-')) table1 as b
     ) tmp3
group by a, rn;

输出结果是:

步骤三: 根据步长生成索引值,起始值加上索引值获取展开值

(1) lateral view posexplode(split(space(cast (diff as int)), '')) table1 as pos, item;

侧写和炸裂,根据分组的步长 diff 生成对应的索引值pos

(2)(start_data + pos) as str,起始值加上索引值获取展开值

sql 复制代码
with data as (select '1-5,16,11-13,9' as a)
select
    a,
    rn,
    cast ((start_data + pos) as int) as str
from (
         select
             a,
             rn,
             start_index,
             diff,
             pos
         from (
                  select
                      a,
                      rn,
                      min(b) as start_data,
                      max(b) - min(b) as diff
                  from (
                           select
                               a,
                               rn,
                               b
                           from (
                                    select
                                        a,
                                        row_number() over () as rn
                                    from (
                                             select
                                                 explode(split(a, ',')) as a
                                             from data
                                         ) tmp1
                                ) tmp2
                                    lateral view explode(split(a, '-')) table1 as b
                       ) tmp3
                  group by a, rn
              ) tmp4
                  lateral view posexplode(split(space(cast(diff as int)), '')) table1 as pos, val) tmp5
  order by rn;

输出结果是:

步骤四: 对a,rn, diff 字段分组,拼接str字符串得到最终结果值

sql 复制代码
with data as (select '1-5,16,11-13,9' as a)
select
    concat_ws(',', collect_set(cast(str as string))) as result
from (
         select
             a,
             rn,
             cast((start_index + pos) as int) as str
         from (
                  select
                      a,
                      rn,
                      start_index,
                      diff,
                      pos
                  from (
                           select
                               a,
                               rn,
                               min(b)  as start_index,
                               max(b) - min(b) as diff
                           from (
                                    select
                                        a,
                                        rn,
                                        b
                                    from (
                                             select
                                                 a,
                                                 row_number() over () as rn
                                             from (
                                                      select
                                                          explode(split(a, ',')) as a
                                                      from data
                                                  ) tmp1
                                         ) tmp2
                                             lateral view explode(split(a, '-')) table1 as b
                                ) tmp3
                           group by a, rn
                       ) tmp4
                           lateral view posexplode(split(space(cast(diff as int)), '')) table1 as pos, val
              ) tmp5
     ) tmp6
group by a,rn,diff;

最终的输出结果:1,2,3,4,5,16,11,12,13,9

3 小结

sql 复制代码
   数据炸裂的思路一般是:

    1.计算区间【a,b】的步长(差值)diff;
    2.利用split分割函数+ posexplode等 将一行变成 diff+1 行,生成对应的下角标pos(pos的取值为【0,diff】);
    3.【a,b】区间的起始值 (a + pos) 将数据平铺开;
    4.基于平铺开后的数据集进一步加工处理,例如:分组聚合等。

二、数据合并

0 问题描述

面试题:基于A表的数据生成B表数据

1 数据准备

sql 复制代码
create table if not exists  tableA
(
    id        string comment '用户id',
    name   string comment '用户姓名'
) comment 'A表';

insert overwrite table tableA values
    ('1','aa'),
    ('2','aa'),
    ('3','aa'),
    ('4','d'),
    ('5','c'),
    ('6','aa'),
    ('7','aa'),
    ('8','e'),
    ('9','f'),
    ('10','g');


create table if not exists  tableC
(
    id     string comment '用户id',
    name   string comment '用户姓名'
) comment 'C表';

insert overwrite table tableC values
    ('3','aa|aa|aa'),
    ('4','d'),
    ('5','c'),
    ('7','aa|aa'),
    ('8','e'),
    ('9','f'),
    ('10','g');

2 数据分析

步骤1:寻找满足条件的断点

sql 复制代码
select
    id,
    name,
    if(name != lag_name, 1, 0) as flag
from (
         select
             id,
             name,
             lag(name, 1, name) over (order by cast(id as int)) as lag_name
         from tableA
     ) tmp1;

输出结果为:

步骤2:断点处标记为1,非断点处标记为0,并对断点标记值进行累加,构造分组标签

sql 复制代码
select
    id,
    name,
    --并对断点标记值进行累加,构造分组标签
    sum(flag) over (order by cast(id as int)) grp
from (
         select
             id,
             name,
             --断点处标记为1,非断点处标记为0
             if(name != lag_name, 1, 0) flag
         from (
                  select
                      id,
                      name,
                      lag(name, 1, name) over (order by cast(id as int)) as lag_name
                  from tableA
              ) tmp1
     ) tmp2;

输出结果为:

**步骤3:**按照分组标签进行数据合并,并取得分组中最大值作为id

sql 复制代码
select
    max_id,
-- collect_list 数据聚合并拼接concat_ws
    concat_ws('|', collect_list(name)) as name
from (
         select
             name,
             grp,
             max(id) over (partition by grp) max_id
         from (
                  select
                      id,
                      name,
                      sum(if(name != lag_name, 1, 0)) over (order by cast(id as int)) as grp
                  from (
                           select
                               id,
                               name,
                               lag(name, 1, name) over (order by cast(id as int)) as lag_name
                           from tableA
                       ) tmp1
              ) tmp2
     ) tmp3
group by max_id, grp;

输出结果为:

通过max_id, grp分组,对name进行 concat_ws('|', collect_list(name)) 聚合拼接,得出最终的结果

3 小结

sql 复制代码
 断点分组问题的算法总结
 步骤1:寻找满足条件的断点
 步骤2:断点处标记值为1,非断点处标记为0
 步骤3:对断点标记值进行累加 sum(xx)over(order by xx),构造分组标签
 步骤4:按照分组标签进行分组求解问题
相关推荐
2401_883041084 小时前
新锐品牌电商代运营公司都有哪些?
大数据·人工智能
青云交4 小时前
大数据新视界 -- 大数据大厂之 Impala 性能优化:融合机器学习的未来之路(上 (2-1))(11/30)
大数据·计算资源·应用案例·数据交互·impala 性能优化·机器学习融合·行业拓展
Json_181790144806 小时前
An In-depth Look into the 1688 Product Details Data API Interface
大数据·json
Qspace丨轻空间9 小时前
气膜场馆:推动体育文化旅游创新发展的关键力量—轻空间
大数据·人工智能·安全·生活·娱乐
Elastic 中国社区官方博客10 小时前
如何将数据从 AWS S3 导入到 Elastic Cloud - 第 3 部分:Elastic S3 连接器
大数据·elasticsearch·搜索引擎·云计算·全文检索·可用性测试·aws
Aloudata11 小时前
从Apache Atlas到Aloudata BIG,数据血缘解析有何改变?
大数据·apache·数据血缘·主动元数据·数据链路
水豚AI课代表11 小时前
分析报告、调研报告、工作方案等的提示词
大数据·人工智能·学习·chatgpt·aigc
拓端研究室TRL14 小时前
【梯度提升专题】XGBoost、Adaboost、CatBoost预测合集:抗乳腺癌药物优化、信贷风控、比特币应用|附数据代码...
大数据
黄焖鸡能干四碗14 小时前
信息化运维方案,实施方案,开发方案,信息中心安全运维资料(软件资料word)
大数据·人工智能·软件需求·设计规范·规格说明书
编码小袁14 小时前
探索数据科学与大数据技术专业本科生的广阔就业前景
大数据