CDP集成Hudi实战-Hive

[〇]关于本文

本文测试一下使用Hive和Hudi的集成

|----------------|----------------------|
| 软件 | 版本 |
| Hudi | 1.0.0 |
| Hadoop Version | 3.1.1.7.3.1.0-197 |
| Hive Version | 3.1.3000.7.3.1.0-197 |
| Spark Version | 3.4.1.7.3.1.0-197 |
| CDP | 7.3.1 |

[一]部署Jar包

1-部署hudi-hive-sync-bundle-1.0.0.jar文件

bash 复制代码
[root@cdp73-1 ~]# for i in $(seq 1 6); do scp /opt/software/hudi-1.0.0/packaging/hudi-hadoop-mr-bundle/target/hudi-hadoop-mr-bundle-1.0.0.jar   cdp73-$i:/opt/cloudera/parcels/CDH/lib/hive/lib/; done
hudi-hadoop-mr-bundle-1.0.0.jar                                                                                                                                                   100%   42MB 464.2MB/s   00:00
hudi-hadoop-mr-bundle-1.0.0.jar                                                                                                                                                   100%   42MB 407.5MB/s   00:00
hudi-hadoop-mr-bundle-1.0.0.jar                                                                                                                                                   100%   42MB 378.2MB/s   00:00
hudi-hadoop-mr-bundle-1.0.0.jar                                                                                                                                                   100%   42MB 422.0MB/s   00:00
hudi-hadoop-mr-bundle-1.0.0.jar                                                                                                                                                   100%   42MB 411.4MB/s   00:00
hudi-hadoop-mr-bundle-1.0.0.jar                                                                                                                                                   100%   42MB 420.9MB/s   00:00
[root@cdp73-1 ~]#

2-部署hudi-hive-sync-bundle-1.0.0.jar文件

bash 复制代码
[root@cdp73-1 ~]# for i in $(seq 1 6); do scp /opt/software/hudi-1.0.0/packaging/hudi-hive-sync-bundle/target/hudi-hive-sync-bundle-1.0.0.jar   cdp73-$i:/opt/cloudera/parcels/CDH/lib/hive/lib/; done
hudi-hive-sync-bundle-1.0.0.jar                                                                                                                                                   100%   46MB 399.8MB/s   00:00
hudi-hive-sync-bundle-1.0.0.jar                                                                                                                                                   100%   46MB 463.1MB/s   00:00
hudi-hive-sync-bundle-1.0.0.jar                                                                                                                                                   100%   46MB 376.3MB/s   00:00
hudi-hive-sync-bundle-1.0.0.jar                                                                                                                                                   100%   46MB 396.3MB/s   00:00
hudi-hive-sync-bundle-1.0.0.jar                                                                                                                                                   100%   46MB 413.9MB/s   00:00
hudi-hive-sync-bundle-1.0.0.jar                                                                                                                                                   100%   46MB 408.7MB/s   00:00
[root@cdp73-1 ~]#

[二]Spark将Hudi表同步到Hive

Hive Metastore 是 Apache Hive 提供的一个基于关系数据库管理系统(RDBMS)的服务,它充当数据仓库或数据湖的目录。它可以存储有关表的所有元数据,例如分区、列、列类型等。还可以将 Hudi 表的元数据同步到 Hive Metastore。这使得不仅可以通过 Hive 查询 Hudi 表,还可以使用 Presto 和 Trino 等交互式查询引擎查询 Hudi 表。在本文档中,我们将介绍将 Hudi 表同步到 Hive Metastore 的不同方式。

1-进入Spark-shell

bash 复制代码
spark-shell --packages org.apache.hudi:hudi-spark3.4-bundle_2.12:1.0.0 \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
--conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog' \
--conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
--conf 'spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar'

2-初始化测试表

bash 复制代码
// spark-shell
import org.apache.hudi.QuickstartUtils._
import scala.collection.JavaConversions._
import org.apache.spark.sql.SaveMode._
import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.config.HoodieWriteConfig._
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row


val databaseName = "my_db"
val tableName = "hudi_cow"
val basePath = "/user/hive/warehouse/hudi_cow"

val schema = StructType(Array(
StructField("rowId", StringType,true),
StructField("partitionId", StringType,true),
StructField("preComb", LongType,true),
StructField("name", StringType,true),
StructField("versionId", StringType,true),
StructField("toBeDeletedStr", StringType,true),
StructField("intToLong", IntegerType,true),
StructField("longToInt", LongType,true)
))

val data0 = Seq(Row("row_1", "2021/01/01",0L,"bob","v_0","toBeDel0",0,1000000L), 
               Row("row_2", "2021/01/01",0L,"john","v_0","toBeDel0",0,1000000L), 
               Row("row_3", "2021/01/02",0L,"tom","v_0","toBeDel0",0,1000000L))

var dfFromData0 = spark.createDataFrame(data0,schema)

3-写入测试表并配置同步到hive

bash 复制代码
dfFromData0.write.format("hudi").
  options(getQuickstartWriteConfigs).
  option("hoodie.datasource.write.precombine.field", "preComb").
  option("hoodie.datasource.write.recordkey.field", "rowId").
  option("hoodie.datasource.write.partitionpath.field", "partitionId").
  option("hoodie.database.name", databaseName).
  option("hoodie.table.name", tableName).
  option("hoodie.datasource.write.table.type", "COPY_ON_WRITE").
  option("hoodie.datasource.write.operation", "upsert").
  option("hoodie.datasource.write.hive_style_partitioning","true").
  option("hoodie.datasource.meta.sync.enable", "true").
  option("hoodie.datasource.hive_sync.mode", "hms").
  option("hoodie.embed.timeline.server", "false").
  mode(Overwrite).
  save(basePath)

4-在Hive中查看Hudi表

bash 复制代码
0: jdbc:hive2://cdp73-1.test.com:2181,cdp73-2> show create table my_db.hudi_cow
. . . . . . . . . . . . . . . . . . . . . . .> ;
WARN  : WARNING! Query command could not be redacted.java.lang.IllegalStateException: Error loading from /home/opt/cloudera/parcels/CDH-7.3.1-1.cdh7.3.1.p0.60371244/bin/../lib/hive/conf/redaction-rules.json: java.io.FileNotFoundException: /home/opt/cloudera/parcels/CDH-7.3.1-1.cdh7.3.1.p0.60371244/bin/../lib/hive/conf/redaction-rules.json (No such file or directory)
INFO  : Compiling command(queryId=hive_20250104012540_9035bba6-0eec-4cea-815d-25a780faf8e6): show create table my_db.hudi_cow
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Created Hive schema: Schema(fieldSchemas:[FieldSchema(name:createtab_stmt, type:string, comment:from deserializer)], properties:null)
INFO  : Completed compiling command(queryId=hive_20250104012540_9035bba6-0eec-4cea-815d-25a780faf8e6); Time taken: 0.104 seconds
INFO  : Executing command(queryId=hive_20250104012540_9035bba6-0eec-4cea-815d-25a780faf8e6): show create table my_db.hudi_cow
INFO  : Starting task [Stage-0:DDL] in serial mode
INFO  : Completed executing command(queryId=hive_20250104012540_9035bba6-0eec-4cea-815d-25a780faf8e6); Time taken: 0.202 seconds
INFO  : OK
+----------------------------------------------------+
|                   createtab_stmt                   |
+----------------------------------------------------+
| CREATE EXTERNAL TABLE `my_db`.`hudi_cow`(          |
|   `_hoodie_commit_time` string COMMENT '',         |
|   `_hoodie_commit_seqno` string COMMENT '',        |
|   `_hoodie_record_key` string COMMENT '',          |
|   `_hoodie_partition_path` string COMMENT '',      |
|   `_hoodie_file_name` string COMMENT '',           |
|   `rowid` string COMMENT '',                       |
|   `precomb` bigint COMMENT '',                     |
|   `name` string COMMENT '',                        |
|   `versionid` string COMMENT '',                   |
|   `tobedeletedstr` string COMMENT '',              |
|   `inttolong` int COMMENT '',                      |
|   `longtoint` bigint COMMENT '')                   |
| PARTITIONED BY (                                   |
|   `partitionid` string COMMENT '')                 |
| ROW FORMAT SERDE                                   |
|   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'  |
| WITH SERDEPROPERTIES (                             |
|   'hoodie.query.as.ro.table'='false',              |
|   'path'='/user/hive/warehouse/hudi_cow')          |
| STORED AS INPUTFORMAT                              |
|   'org.apache.hudi.hadoop.HoodieParquetInputFormat'  |
| OUTPUTFORMAT                                       |
|   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' |
| LOCATION                                           |
|   'hdfs://nameservice1/user/hive/warehouse/hudi_cow' |
| TBLPROPERTIES (                                    |
|   'last_commit_completion_time_sync'='20250104012010327',  |
|   'last_commit_time_sync'='20250104011949137',     |
|   'spark.sql.create.version'='3.4.1.7.3.1.0-197',  |
|   'spark.sql.sources.provider'='hudi',             |
|   'spark.sql.sources.schema.numPartCols'='1',      |
|   'spark.sql.sources.schema.numParts'='1',         |
|   'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"_hoodie_commit_time","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_commit_seqno","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_record_key","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_partition_path","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_file_name","type":"string","nullable":true,"metadata":{}},{"name":"rowId","type":"string","nullable":true,"metadata":{}},{"name":"preComb","type":"long","nullable":true,"metadata":{}},{"name":"name","type":"string","nullable":true,"metadata":{}},{"name":"versionId","type":"string","nullable":true,"metadata":{}},{"name":"toBeDeletedStr","type":"string","nullable":true,"metadata":{}},{"name":"intToLong","type":"integer","nullable":true,"metadata":{}},{"name":"longToInt","type":"long","nullable":true,"metadata":{}},{"name":"partitionId","type":"string","nullable":true,"metadata":{}}]}',  |
|   'spark.sql.sources.schema.partCol.0'='partitionId',  |
|   'transient_lastDdlTime'='1735971612')            |
+----------------------------------------------------+
36 rows selected (0.421 seconds)
0: jdbc:hive2://cdp73-1.test.com:2181,cdp73-2>

5-在Hive中查看Hudi表数据

bash 复制代码
0: jdbc:hive2://cdp73-1.test.com:2181,cdp73-2> SELECT * FROM my_db.hudi_cow;
WARN  : WARNING! Query command could not be redacted.java.lang.IllegalStateException: Error loading from /home/opt/cloudera/parcels/CDH-7.3.1-1.cdh7.3.1.p0.60371244/bin/../lib/hive/conf/redaction-rules.json: java.io.FileNotFoundException: /home/opt/cloudera/parcels/CDH-7.3.1-1.cdh7.3.1.p0.60371244/bin/../lib/hive/conf/redaction-rules.json (No such file or directory)
INFO  : Compiling command(queryId=hive_20250104100159_e3235cbc-3bde-44bc-9485-9154675afa37): SELECT * FROM my_db.hudi_cow
INFO  : No Stats for my_db@hudi_cow, Columns: _hoodie_commit_time, inttolong, longtoint, _hoodie_partition_path, versionid, precomb, _hoodie_record_key, name, tobedeletedstr, _hoodie_commit_seqno, _hoodie_file_name, rowid
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Created Hive schema: Schema(fieldSchemas:[FieldSchema(name:hudi_cow._hoodie_commit_time, type:string, comment:null), FieldSchema(name:hudi_cow._hoodie_commit_seqno, type:string, comment:null), FieldSchema(name:hudi_cow._hoodie_record_key, type:string, comment:null), FieldSchema(name:hudi_cow._hoodie_partition_path, type:string, comment:null), FieldSchema(name:hudi_cow._hoodie_file_name, type:string, comment:null), FieldSchema(name:hudi_cow.rowid, type:string, comment:null), FieldSchema(name:hudi_cow.precomb, type:bigint, comment:null), FieldSchema(name:hudi_cow.name, type:string, comment:null), FieldSchema(name:hudi_cow.versionid, type:string, comment:null), FieldSchema(name:hudi_cow.tobedeletedstr, type:string, comment:null), FieldSchema(name:hudi_cow.inttolong, type:int, comment:null), FieldSchema(name:hudi_cow.longtoint, type:bigint, comment:null), FieldSchema(name:hudi_cow.partitionid, type:string, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hive_20250104100159_e3235cbc-3bde-44bc-9485-9154675afa37); Time taken: 0.207 seconds
INFO  : Executing command(queryId=hive_20250104100159_e3235cbc-3bde-44bc-9485-9154675afa37): SELECT * FROM my_db.hudi_cow
INFO  : Completed executing command(queryId=hive_20250104100159_e3235cbc-3bde-44bc-9485-9154675afa37); Time taken: 0.007 seconds
INFO  : OK
+-------------------------------+--------------------------------+------------------------------+----------------------------------+-----------------------------+-----------------+-------------------+----------------+---------------------+--------------------------+---------------------+---------------------+-----------------------+
| hudi_cow._hoodie_commit_time  | hudi_cow._hoodie_commit_seqno  | hudi_cow._hoodie_record_key  | hudi_cow._hoodie_partition_path  | hudi_cow._hoodie_file_name  | hudi_cow.rowid  | hudi_cow.precomb  | hudi_cow.name  | hudi_cow.versionid  | hudi_cow.tobedeletedstr  | hudi_cow.inttolong  | hudi_cow.longtoint  | hudi_cow.partitionid  |
+-------------------------------+--------------------------------+------------------------------+----------------------------------+-----------------------------+-----------------+-------------------+----------------+---------------------+--------------------------+---------------------+---------------------+-----------------------+
+-------------------------------+--------------------------------+------------------------------+----------------------------------+-----------------------------+-----------------+-------------------+----------------+---------------------+--------------------------+---------------------+---------------------+-----------------------+
No rows selected (0.346 seconds)
0: jdbc:hive2://cdp73-1.test.com:2181,cdp73-2>

查不到!!这是为什么呢?

相关推荐
玉成2266 分钟前
Elasticsearch:索引mapping
大数据·elasticsearch·搜索引擎
运维&陈同学23 分钟前
【Logstash01】企业级日志分析系统ELK之Logstash 安装与介绍
大数据·linux·elk·elasticsearch·云原生·自动化·logstash
小怪兽ysl28 分钟前
【Oceanbase数据库常用巡检SQL】
数据库·sql·oceanbase
qq_4204826337 分钟前
飞书机器人告警实现
服务器·数据库·飞书
东方未明010843 分钟前
Redis(三)单线程架构介绍
数据库·redis·数据库架构·单线程架构
代码的奴隶(艾伦·耶格尔)1 小时前
redis(1)
数据库·redis·缓存
q567315231 小时前
Django外键模型未保存引用
数据库·django·sqlite
꧁坚持很酷꧂1 小时前
Qt天气预报系统界面关闭
开发语言·数据库·qt
代码欢乐豆1 小时前
Neo4j的部署和操作
数据库·oracle
wshzd1 小时前
LLM之RAG实战(五十一)| 使用python和Cypher解析PDF数据,并加载到Neo4j数据库
数据库·python·pdf