hive sql 处理多层 json 数组

1. 背景

json 字符串值数据示例:

json 复制代码
{
    "score": 1,
    "submitTime": 1712491933,
    "answerFlag": 1,
    "groupId": 1755547960,
    "answers": [
        {
            "value": "[1, 2, 3]",
            "ids": [
                4,
                5,
                6
            ],
            "isPic": 0,
            "duration": 22314,
            "status": 1,
            "tid": 1
        },
        {
            "value": "aabbcc",
            "lessons": [
                44,
                55,
                66
            ],
            "isPic": 0,
            "duration": 22314,
            "status": 2,
            "tid": 2
        }
    ],
    "questionType": 65
}

现在这个 json 字符串形式的字段值在 hive 表的某个字段中,我需要获取到 "answers" 这个 json 数组,然后将其按照数组长度,列转行到多行数据,每行数据一个子 json ,并且从中获取到每个子 json 的 "tid" 和 "status" 值,理想情况下,我需要这行数据处理完之后,结果如下表所示。

tid status
1 1
2 2

2. 常见方案

通过 ge_json_object 函数,先获取到 "answers" 对应的 json 数组字符串,然后通过正则替换掉 [] 符号,之后将 },{ 符号替换为 }我是分隔符{,最后将 我是分隔符 作为 split 函数的分隔符号,将字符串分割,再通过 lateral view explode() 语法,将数组放到多行。多行数据,都是处理好的 json 对象,之后通过 get_json_object 函数后取需要的字段值即可。具体代码示例如下。

  1. 获取 "answers" 对应的 json 数组。

    sql 复制代码
    with table1 as (
        select '{"score":1,"submitTime":1712491933,"answerFlag":1,"groupId":1755547960,"answers":[{"value":"[1, 2, 3]","ids":[4,5,6],"isPic":0,"duration":22314,"status":1,"tid":1},{"value":"aabbcc","lessons":[44,55,66],"isPic":0,"duration":22314,"status":2,"tid":2}],"questionType":65}' as col
    )
    
    select get_json_object(col, '$.answers') as answers
    from table1

    结果如下所示,正常获取到 "answers" 下的 json 数组,结果为字符串。

    answers
    {"value":"\[1, 2, 3","ids":4,5,6,"isPic":0,"duration":22314,"status":1,"tid":1},{"value":"aabbcc","lessons":44,55,66,"isPic":0,"duration":22314,"status":2,"tid":2}]
  2. 将最外层的 [] 符号去掉

    sql 复制代码
    with table1 as (
        select '{"score":1,"submitTime":1712491933,"answerFlag":1,"groupId":1755547960,"answers":[{"value":"[1, 2, 3]","ids":[4,5,6],"isPic":0,"duration":22314,"status":1,"tid":1},{"value":"aabbcc","lessons":[44,55,66],"isPic":0,"duration":22314,"status":2,"tid":2}],"questionType":65}' as col
    )
    
    select regexp_replace(answers, '\\[|\\]', '') as answers
    from (
        select get_json_object(col, '$.answers') as answers
        from table1
        ) as a

    结果如下所示。

    answers
    {"value":"1, 2, 3","ids":4,5,6,"isPic":0,"duration":22314,"status":1,"tid":1},{"value":"aabbcc","lessons":44,55,66,"isPic":0,"duration":22314,"status":2,"tid":2}

    由于第一步处理的结果最外层是 json 数组,左右有 [] 符号,但是由于内层还有子 json 数组,这种直接全局替换的方式,会将内层子 json 数组的 [] 符号也一并去掉,可以查看下面的结果,"ids":4,5,6,,就是因为全局替换,造成整个 json 结构被破坏,之后将无法使用 get_json_object() 函数来获取想要的 key 对应的值了。

由此可见,这种方式只适合于 "answers" 下的 json 数组内的每个 json 对象中都只包含 json 对象才行,不能再包含 json 数组,否则就会造成处理错误,拿不到想要的数据。

3. 推荐方案

3.1 具体步骤

  1. 获取 "answers" 对应的 json 数组。

    sql 复制代码
    with table1 as (
        select '{"score":1,"submitTime":1712491933,"answerFlag":1,"groupId":1755547960,"answers":[{"value":"[1, 2, 3]","ids":[4,5,6],"isPic":0,"duration":22314,"status":1,"tid":1},{"value":"aabbcc","lessons":[44,55,66],"isPic":0,"duration":22314,"status":2,"tid":2}],"questionType":65}' as col
    )
    
    select get_json_object(col, '$.answers') as answers
    from table1

    结果如下所示,正常获取到 "answers" 下的 json 数组,结果为字符串。

    answers
    {"value":"\[1, 2, 3","ids":4,5,6,"isPic":0,"duration":22314,"status":1,"tid":1},{"value":"aabbcc","lessons":44,55,66,"isPic":0,"duration":22314,"status":2,"tid":2}]
  2. 通过字符串截取的方式,将第一步的结果最前面和最后面的 [] 符号去掉

    sql 复制代码
    with table1 as (
        select '{"score":1,"submitTime":1712491933,"answerFlag":1,"groupId":1755547960,"answers":[{"value":"[1, 2, 3]","ids":[4,5,6],"isPic":0,"duration":22314,"status":1,"tid":1},{"value":"aabbcc","lessons":[44,55,66],"isPic":0,"duration":22314,"status":2,"tid":2}],"questionType":65}' as col
    )
    
    select substring(answers, 2, length(answers) - 2) as answers
    from (
        select get_json_object(col, '$.answers') as answers
        from table1
        ) as a

    结果如下所示,最前面和最后面的 [] 符号已经被去掉。

    answers
    {"value":"1, 2, 3","ids":4,5,6,"isPic":0,"duration":22314,"status":1,"tid":1},{"value":"aabbcc","lessons":44,55,66,"isPic":0,"duration":22314,"status":2,"tid":2}
  3. 最重要的一步 :通过正则替换,只要匹配到 [.*] 内容,就直接替换为数字 0。

    sql 复制代码
    with table1 as (
        select '{"score":1,"submitTime":1712491933,"answerFlag":1,"groupId":1755547960,"answers":[{"value":"[1, 2, 3]","ids":[4,5,6],"isPic":0,"duration":22314,"status":1,"tid":1},{"value":"aabbcc","lessons":[44,55,66],"isPic":0,"duration":22314,"status":2,"tid":2}],"questionType":65}' as col
    )
    
    select substring(answers, 1, length(answers) - 3) as answers
    from (
        select get_json_object(col, '$.answers') as answers
        from table1
        ) as a

    结果如下所示,字符串中所有 [.*] 的部分,都已经被替换为数字 0

    answers
    {"value":"0","ids":0,"isPic":0,"duration":22314,"status":1,"tid":1},{"value":"aabbcc","lessons":0,"isPic":0,"duration":22314,"status":2,"tid":2}

    将 json 数字替换为数字 0,是为了兼容 "[.*]"[.*] 两种情况,不管 json 数组是不是被英文双引号包围,替换为数字 0,都是没问题的,都可以保证 json 的格式不被破坏。

  4. 由于已经去掉了所有的子 json 数组,之后就可以按照传统的方式,将 },{ 替换为 }我是分隔符号{

    sql 复制代码
    with table1 as (
        select '{"score":1,"submitTime":1712491933,"answerFlag":1,"groupId":1755547960,"answers":[{"value":"[1, 2, 3]","ids":[4,5,6],"duration":22314,"status":1,"tid":1},{"value":"aabbcc","lessons":[44,55,66],"isPic":0,"duration":22314,"status":2,"tid":2}],"questionType":65}' as col
    )
    
    select regexp_replace(a.answers, '\\}\\,\\{', '\\}我是分隔符号{') as answers
    from (
        select regexp_replace(answers, '\\[.*\\]', '0') as answers
        from (
            select substring(answers, 2, length(answers) - 2) as answers
            from (
                select get_json_object(col, '$.answers') as answers
                from table1
                ) as a
            ) as a
        ) as a
    answers
    {"value":"0","ids":0,"isPic":0,"duration":22314,"status":1,"tid":1}我是分隔符号{"value":"aabbcc","lessons":0,"isPic":0,"duration":22314,"status":2,"tid":2}
  5. 通过 我是分隔符号 切换字符串,再通过 lateral view explode() 语法将 json 数组展开,最后通过 get_json_object() 函数,获取需要的值即可。

    sql 复制代码
    with table1 as (
        select '{"score":1,"submitTime":1712491933,"answerFlag":1,"groupId":1755547960,"answers":[{"value":"[1, 2, 3]","ids":[4,5,6],"duration":22314,"status":1,"tid":1},{"value":"aabbcc","lessons":[44,55,66],"isPic":0,"duration":22314,"status":2,"tid":2}],"questionType":65}' as col
    )
    
    select get_json_object(b.answer, '$.tid') as tid, get_json_object(b.answer, '$.status') as status
    from (
        select regexp_replace(a.answers, '\\}\\,\\{', '\\}我是分隔符号{') as answers
        from (
            select regexp_replace(answers, '\\[.*\\]', '0') as answers
            from (
                select substring(answers, 2, length(answers) - 2) as answers
                from (
                    select get_json_object(col, '$.answers') as answers
                    from table1
                    ) as a
                ) as a
            ) as a
        ) as a
    lateral view explode(split(regexp_replace(a.answers, '\\}\\,\\{', '\\}我是分隔符号{'), '我是分隔符号')) b as answer

    结果如下,可以看到,数据处理结果符合预期

    tid status
    1 1
    2 2

3.2 注意事项

上面的步骤 3,只适用于你想要展开最外层的这个 json 数组,并且完全不需要内部嵌套的子 json 数组才行,否则将内部的子 json 数组全部替换为数字 0 之后,你就获取不到子 json 数组数据了。

如果还想要内部的子 json 数组,单纯的 sql 应该是实现不了的,需要去自定义 udf,然后通过 java 代码一层一层解析了。

相关推荐
疯狂SQL1 天前
手写高性能在线 JSON 工具|Web Worker 工程化打包 + 语法自动修复 + 多语言代码生成实战
typescript·json·next.js·web worker·前端性能优化·esbuild·源码实战
zzzzzz3102 天前
9K Star 炸裂开源!这个 C 语言写的代码知识图谱,把 Linux 内核索引压缩到了 3 分钟
linux·服务器·sql
云技纵横4 天前
唯一索引 INSERT 死锁实战:5 秒复现交叉插入的 S 锁循环等待
sql·mysql
王小王-1236 天前
基于 Hive 的网易云音乐数据分析及可视化系统
hive·hadoop·数据分析·音乐数据分析·网易云音乐分析·hive音乐分析·hadoop网易云
BD_Marathon6 天前
SQL学习指南——视图
数据库·sql
2601_962072556 天前
李梦娇常识4600问|题库|打印版
sql·华为od·华为·c#·华为云·.net·harmonyos
HackTwoHub6 天前
Sqli-Scanner SQL注入SKILL自动化挖掘SQL注入,零依赖自动化SQL注入挖掘,赏金猎人
数据库·人工智能·sql·web安全·网络安全·自动化·系统安全
Volunteer Technology6 天前
Flink Table API与SQL(一)
大数据·sql·flink
持敬chijing6 天前
Web渗透之SQL注入-常用sql语句
sql·安全·web安全·网络安全
Theo·Chan6 天前
更换 Kingbase V9 License 踩坑记
sql·信创·kingbase