7. Hive解析JSON字符串、JSON数组

文章目录

- - Hive解析JSON字符串
  - - [1. get_json_object](#1. get_json_object)
    - - 局限性
    - [2. json_tuple](#2. json_tuple)
  - Hive解析JSON数组
  - - 前置知识
    - - explode函数
      - regexp_replace函数
    - [1. 嵌套子查询解析JSON数组（使用explode+regexp_replace）](#1. 嵌套子查询解析JSON数组（使用explode+regexp_replace）)
    - [2. 使用 lateral view 解析JSON数组](#2. 使用 lateral view 解析JSON数组)
  - 学习链接

Hive解析JSON字符串

1. get_json_object

语法：get_json_object(json_string, path)
- json_string 是要解析的JSON字符串
- path 是用于指定要提取的字段路径的字符串

sql 复制代码

-- 示例1(单层JSON)
SELECT get_json_object('{
                            "name": "John",
                            "age": 30
                         }', '$.name');
-- res: "John"

-- 示例2(嵌套JSON)
SELECT get_json_object('{
                            "person": {
                                "name": "John",
                                "age": 30,
                                "address": {
                                    "street": "123 Main St",
                                    "city": "New York"
                                }
                            }
                         }', '$.person.address.street');
-- res: "123 Main St"

局限性

get_json_object 函数的性能会受到 JSON数据的结构和大小 的影响。对于较复杂的嵌套结构，考虑使用Hive的其他函数或自定义函数来处理JSON数据可能更合适。
get_json_object 函数每次只能返回一个数据项。

2. json_tuple

语法：json_tuple(json_string, field1, field2, ...)
- json_string 是要解析的JSON字符串
- field1、field2 ... 是要提取的字段名

sql 复制代码

-- 示例1(单层JSON)
select json_tuple('{
                        "name": "zhangsan",
                        "age": 18
                    }','name','age');
-- res: zhangsan	18

-- 示例2(嵌套JSON)
SELECT json_tuple('{
                        "person": {
                            "name": "Alice",
                            "age": 25,
                            "address": {
                                "city": "New York",
                                "country": "USA"
                            }
                        }
                    }', 'person.name', 'person.age', 'person.address.city');
-- res: Alice    25    New York

Hive解析JSON数组

前置知识

explode函数

语法：explode(Array OR Map)
说明：explode() 函数接收一个 array 或者 map 类型的数据作为输入，然后将 array 或 map 里面的元素按照每行的形式输出，即将 hive 一列中复杂的 array 或者 map 结构拆分成多行显示

sql 复制代码

-- 解析 array
hive> select explode(array('A','B','C'));
OK
A
B
C

-- 解析map
hive> select explode(map('A',10,'B',20,'C',30));
OK
A       10
B       20
C       30

regexp_replace函数

语法： regexp_replace(string A, string B, string C)
说明：将字符串A中符合 java正则表达式B 的部分替换为C。注意，在有些情况下要使用转义字符，类似oracle中的regexp_replace函数。

sql 复制代码

hive> select regexp_replace('foobar', 'oo|ar', ''); 
OK
fb

1. 嵌套子查询解析JSON数组（使用explode+regexp_replace）

数据如下所示：

array(json_str)
[{"website":"baidu.com","name":"百度"},{"website":"google.com","name":"谷歌"}]

解析出其中的website、name

website	name
baidu.com	百度
google.com	谷歌

sql 复制代码

-- 思路
-- 1. 使用 regexp_replace 函数将原数据转换为 {"website":"baidu.com","name":"百度"};{"website":"google.com","name":"谷歌"}
-- 2. 使用 split 函数按照 ';' 分割 {"website":"baidu.com","name":"百度"};{"website":"google.com","name":"谷歌"},返回 [{"website":"baidu.com","name":"百度"},{"website":"google.com","name":"谷歌"}]
-- 3. 使用 explode 炸裂为
-- {"website":"baidu.com","name":"百度"}
-- {"website":"google.com","name":"谷歌"}
-- 4. 使用 json_tuple 解析数据

-- 实现
-- 1. 先将json数组中的元素解析出来，转化为每行显示
SELECT explode(split(regexp_replace(regexp_replace(
                                            '[{"website":"baidu.com","name":"百度"},{"website":"google.com","name":"谷歌"}]',
                                            '\\[|\\]', ''), '\\}\\,\\{', '\\}\\;\\{'), '\\;'));

-- 2. 使用 json_tuple 解析数据
select json_tuple(json, 'website', 'name') 
from (
select explode(split(regexp_replace(regexp_replace('[{"website":"baidu.com","name":"百度"},{"website":"google.com","name":"谷歌"}]', '\\[|\\]',''),'\\}\\,\\{','\\}\\;\\{'),'\\;')) 
as json) t1;

上面 regexp_replace 函数中的内容解析：

sql 复制代码

SELECT explode(split(
    regexp_replace(
        regexp_replace(
            '[
                {"website":"baidu.com","name":"百度"},
                {"website":"google.com","name":"谷歌"}
            ]', 
            '\\[|\\]' , ''), --将json数组两边的中括号去掉
            
              '\\}\\,\\{' , '\\}\\;\\{'), --将json数组元素之间的逗号换成分号
                
                 '\\;') --以分号作为分隔符(split函数以分号作为分隔)
          );
          
-- 问：为什么要将json数组中元素之间的逗号换成分号？
-- 答：因为元素内的分隔也是逗号，如果不将元素之间的逗号换掉的话，后面用split函数分隔时也会把元素内的数据给分隔，这不是我们想要的结果。

2. 使用 lateral view 解析JSON数组

当我们数据的结构如下，

ids	names
[1,2,3]	[{"name": "daming", "age": "15"}, {"name": "lingling", "age": "14"}, {"name": "tom", "age": "17"}]

我们想要获取的数据为，

id	name
1	daming
2	daming
3	daming
1	lingling
2	lingling
3	lingling
1	tom
2	tom
3	tom

sql 复制代码

with json_data as (
    select `array`(1, 2, 3)                        as ids,
           `array`('{"name": "daming", "age": "15"}', '{"name": "lingling", "age": "14"}',
                   '{"name": "tom", "age": "17"}') as json_infos
)

-- 使用json_tuple()报错: org.apache.hadoop.hive.ql.parse.SemanticException:
-- UDTF's are not supported outside the SELECT clause, nor nested in expressions
-- 原因: 未知

-- SELECT id, json_tuple(json_info, 'name')

SELECT id, get_json_object(json_info, '$.name')
FROM json_data
         lateral view explode(json_data.ids) tmp_ids as id
         lateral view explode(json_data.json_infos) tmp_json_infos as json_info;

学习链接

文心一言

ChatGPT

Hive解析Json数组超全讲解