[〇]关于本文
本文测试一下使用Hive和Hudi的集成
|----------------|----------------------|
| 软件 | 版本 |
| Hudi | 1.0.0 |
| Hadoop Version | 3.1.1.7.3.1.0-197 |
| Hive Version | 3.1.3000.7.3.1.0-197 |
| Spark Version | 3.4.1.7.3.1.0-197 |
| CDP | 7.3.1 |
[一]部署Jar包
1-部署hudi-hive-sync-bundle-1.0.0.jar文件
bash
[root@cdp73-1 ~]# for i in $(seq 1 6); do scp /opt/software/hudi-1.0.0/packaging/hudi-hadoop-mr-bundle/target/hudi-hadoop-mr-bundle-1.0.0.jar cdp73-$i:/opt/cloudera/parcels/CDH/lib/hive/lib/; done
hudi-hadoop-mr-bundle-1.0.0.jar 100% 42MB 464.2MB/s 00:00
hudi-hadoop-mr-bundle-1.0.0.jar 100% 42MB 407.5MB/s 00:00
hudi-hadoop-mr-bundle-1.0.0.jar 100% 42MB 378.2MB/s 00:00
hudi-hadoop-mr-bundle-1.0.0.jar 100% 42MB 422.0MB/s 00:00
hudi-hadoop-mr-bundle-1.0.0.jar 100% 42MB 411.4MB/s 00:00
hudi-hadoop-mr-bundle-1.0.0.jar 100% 42MB 420.9MB/s 00:00
[root@cdp73-1 ~]#
2-部署hudi-hive-sync-bundle-1.0.0.jar文件
bash
[root@cdp73-1 ~]# for i in $(seq 1 6); do scp /opt/software/hudi-1.0.0/packaging/hudi-hive-sync-bundle/target/hudi-hive-sync-bundle-1.0.0.jar cdp73-$i:/opt/cloudera/parcels/CDH/lib/hive/lib/; done
hudi-hive-sync-bundle-1.0.0.jar 100% 46MB 399.8MB/s 00:00
hudi-hive-sync-bundle-1.0.0.jar 100% 46MB 463.1MB/s 00:00
hudi-hive-sync-bundle-1.0.0.jar 100% 46MB 376.3MB/s 00:00
hudi-hive-sync-bundle-1.0.0.jar 100% 46MB 396.3MB/s 00:00
hudi-hive-sync-bundle-1.0.0.jar 100% 46MB 413.9MB/s 00:00
hudi-hive-sync-bundle-1.0.0.jar 100% 46MB 408.7MB/s 00:00
[root@cdp73-1 ~]#
[二]Spark将Hudi表同步到Hive
Hive Metastore 是 Apache Hive 提供的一个基于关系数据库管理系统(RDBMS)的服务,它充当数据仓库或数据湖的目录。它可以存储有关表的所有元数据,例如分区、列、列类型等。还可以将 Hudi 表的元数据同步到 Hive Metastore。这使得不仅可以通过 Hive 查询 Hudi 表,还可以使用 Presto 和 Trino 等交互式查询引擎查询 Hudi 表。在本文档中,我们将介绍将 Hudi 表同步到 Hive Metastore 的不同方式。
1-进入Spark-shell
bash
spark-shell --packages org.apache.hudi:hudi-spark3.4-bundle_2.12:1.0.0 \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
--conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog' \
--conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
--conf 'spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar'
2-初始化测试表
bash
// spark-shell
import org.apache.hudi.QuickstartUtils._
import scala.collection.JavaConversions._
import org.apache.spark.sql.SaveMode._
import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.config.HoodieWriteConfig._
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val databaseName = "my_db"
val tableName = "hudi_cow"
val basePath = "/user/hive/warehouse/hudi_cow"
val schema = StructType(Array(
StructField("rowId", StringType,true),
StructField("partitionId", StringType,true),
StructField("preComb", LongType,true),
StructField("name", StringType,true),
StructField("versionId", StringType,true),
StructField("toBeDeletedStr", StringType,true),
StructField("intToLong", IntegerType,true),
StructField("longToInt", LongType,true)
))
val data0 = Seq(Row("row_1", "2021/01/01",0L,"bob","v_0","toBeDel0",0,1000000L),
Row("row_2", "2021/01/01",0L,"john","v_0","toBeDel0",0,1000000L),
Row("row_3", "2021/01/02",0L,"tom","v_0","toBeDel0",0,1000000L))
var dfFromData0 = spark.createDataFrame(data0,schema)
3-写入测试表并配置同步到hive
bash
dfFromData0.write.format("hudi").
options(getQuickstartWriteConfigs).
option("hoodie.datasource.write.precombine.field", "preComb").
option("hoodie.datasource.write.recordkey.field", "rowId").
option("hoodie.datasource.write.partitionpath.field", "partitionId").
option("hoodie.database.name", databaseName).
option("hoodie.table.name", tableName).
option("hoodie.datasource.write.table.type", "COPY_ON_WRITE").
option("hoodie.datasource.write.operation", "upsert").
option("hoodie.datasource.write.hive_style_partitioning","true").
option("hoodie.datasource.meta.sync.enable", "true").
option("hoodie.datasource.hive_sync.mode", "hms").
option("hoodie.embed.timeline.server", "false").
mode(Overwrite).
save(basePath)
4-在Hive中查看Hudi表
bash
0: jdbc:hive2://cdp73-1.test.com:2181,cdp73-2> show create table my_db.hudi_cow
. . . . . . . . . . . . . . . . . . . . . . .> ;
WARN : WARNING! Query command could not be redacted.java.lang.IllegalStateException: Error loading from /home/opt/cloudera/parcels/CDH-7.3.1-1.cdh7.3.1.p0.60371244/bin/../lib/hive/conf/redaction-rules.json: java.io.FileNotFoundException: /home/opt/cloudera/parcels/CDH-7.3.1-1.cdh7.3.1.p0.60371244/bin/../lib/hive/conf/redaction-rules.json (No such file or directory)
INFO : Compiling command(queryId=hive_20250104012540_9035bba6-0eec-4cea-815d-25a780faf8e6): show create table my_db.hudi_cow
INFO : Semantic Analysis Completed (retrial = false)
INFO : Created Hive schema: Schema(fieldSchemas:[FieldSchema(name:createtab_stmt, type:string, comment:from deserializer)], properties:null)
INFO : Completed compiling command(queryId=hive_20250104012540_9035bba6-0eec-4cea-815d-25a780faf8e6); Time taken: 0.104 seconds
INFO : Executing command(queryId=hive_20250104012540_9035bba6-0eec-4cea-815d-25a780faf8e6): show create table my_db.hudi_cow
INFO : Starting task [Stage-0:DDL] in serial mode
INFO : Completed executing command(queryId=hive_20250104012540_9035bba6-0eec-4cea-815d-25a780faf8e6); Time taken: 0.202 seconds
INFO : OK
+----------------------------------------------------+
| createtab_stmt |
+----------------------------------------------------+
| CREATE EXTERNAL TABLE `my_db`.`hudi_cow`( |
| `_hoodie_commit_time` string COMMENT '', |
| `_hoodie_commit_seqno` string COMMENT '', |
| `_hoodie_record_key` string COMMENT '', |
| `_hoodie_partition_path` string COMMENT '', |
| `_hoodie_file_name` string COMMENT '', |
| `rowid` string COMMENT '', |
| `precomb` bigint COMMENT '', |
| `name` string COMMENT '', |
| `versionid` string COMMENT '', |
| `tobedeletedstr` string COMMENT '', |
| `inttolong` int COMMENT '', |
| `longtoint` bigint COMMENT '') |
| PARTITIONED BY ( |
| `partitionid` string COMMENT '') |
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' |
| WITH SERDEPROPERTIES ( |
| 'hoodie.query.as.ro.table'='false', |
| 'path'='/user/hive/warehouse/hudi_cow') |
| STORED AS INPUTFORMAT |
| 'org.apache.hudi.hadoop.HoodieParquetInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' |
| LOCATION |
| 'hdfs://nameservice1/user/hive/warehouse/hudi_cow' |
| TBLPROPERTIES ( |
| 'last_commit_completion_time_sync'='20250104012010327', |
| 'last_commit_time_sync'='20250104011949137', |
| 'spark.sql.create.version'='3.4.1.7.3.1.0-197', |
| 'spark.sql.sources.provider'='hudi', |
| 'spark.sql.sources.schema.numPartCols'='1', |
| 'spark.sql.sources.schema.numParts'='1', |
| 'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"_hoodie_commit_time","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_commit_seqno","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_record_key","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_partition_path","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_file_name","type":"string","nullable":true,"metadata":{}},{"name":"rowId","type":"string","nullable":true,"metadata":{}},{"name":"preComb","type":"long","nullable":true,"metadata":{}},{"name":"name","type":"string","nullable":true,"metadata":{}},{"name":"versionId","type":"string","nullable":true,"metadata":{}},{"name":"toBeDeletedStr","type":"string","nullable":true,"metadata":{}},{"name":"intToLong","type":"integer","nullable":true,"metadata":{}},{"name":"longToInt","type":"long","nullable":true,"metadata":{}},{"name":"partitionId","type":"string","nullable":true,"metadata":{}}]}', |
| 'spark.sql.sources.schema.partCol.0'='partitionId', |
| 'transient_lastDdlTime'='1735971612') |
+----------------------------------------------------+
36 rows selected (0.421 seconds)
0: jdbc:hive2://cdp73-1.test.com:2181,cdp73-2>
5-在Hive中查看Hudi表数据
bash
0: jdbc:hive2://cdp73-1.test.com:2181,cdp73-2> SELECT * FROM my_db.hudi_cow;
WARN : WARNING! Query command could not be redacted.java.lang.IllegalStateException: Error loading from /home/opt/cloudera/parcels/CDH-7.3.1-1.cdh7.3.1.p0.60371244/bin/../lib/hive/conf/redaction-rules.json: java.io.FileNotFoundException: /home/opt/cloudera/parcels/CDH-7.3.1-1.cdh7.3.1.p0.60371244/bin/../lib/hive/conf/redaction-rules.json (No such file or directory)
INFO : Compiling command(queryId=hive_20250104100159_e3235cbc-3bde-44bc-9485-9154675afa37): SELECT * FROM my_db.hudi_cow
INFO : No Stats for my_db@hudi_cow, Columns: _hoodie_commit_time, inttolong, longtoint, _hoodie_partition_path, versionid, precomb, _hoodie_record_key, name, tobedeletedstr, _hoodie_commit_seqno, _hoodie_file_name, rowid
INFO : Semantic Analysis Completed (retrial = false)
INFO : Created Hive schema: Schema(fieldSchemas:[FieldSchema(name:hudi_cow._hoodie_commit_time, type:string, comment:null), FieldSchema(name:hudi_cow._hoodie_commit_seqno, type:string, comment:null), FieldSchema(name:hudi_cow._hoodie_record_key, type:string, comment:null), FieldSchema(name:hudi_cow._hoodie_partition_path, type:string, comment:null), FieldSchema(name:hudi_cow._hoodie_file_name, type:string, comment:null), FieldSchema(name:hudi_cow.rowid, type:string, comment:null), FieldSchema(name:hudi_cow.precomb, type:bigint, comment:null), FieldSchema(name:hudi_cow.name, type:string, comment:null), FieldSchema(name:hudi_cow.versionid, type:string, comment:null), FieldSchema(name:hudi_cow.tobedeletedstr, type:string, comment:null), FieldSchema(name:hudi_cow.inttolong, type:int, comment:null), FieldSchema(name:hudi_cow.longtoint, type:bigint, comment:null), FieldSchema(name:hudi_cow.partitionid, type:string, comment:null)], properties:null)
INFO : Completed compiling command(queryId=hive_20250104100159_e3235cbc-3bde-44bc-9485-9154675afa37); Time taken: 0.207 seconds
INFO : Executing command(queryId=hive_20250104100159_e3235cbc-3bde-44bc-9485-9154675afa37): SELECT * FROM my_db.hudi_cow
INFO : Completed executing command(queryId=hive_20250104100159_e3235cbc-3bde-44bc-9485-9154675afa37); Time taken: 0.007 seconds
INFO : OK
+-------------------------------+--------------------------------+------------------------------+----------------------------------+-----------------------------+-----------------+-------------------+----------------+---------------------+--------------------------+---------------------+---------------------+-----------------------+
| hudi_cow._hoodie_commit_time | hudi_cow._hoodie_commit_seqno | hudi_cow._hoodie_record_key | hudi_cow._hoodie_partition_path | hudi_cow._hoodie_file_name | hudi_cow.rowid | hudi_cow.precomb | hudi_cow.name | hudi_cow.versionid | hudi_cow.tobedeletedstr | hudi_cow.inttolong | hudi_cow.longtoint | hudi_cow.partitionid |
+-------------------------------+--------------------------------+------------------------------+----------------------------------+-----------------------------+-----------------+-------------------+----------------+---------------------+--------------------------+---------------------+---------------------+-----------------------+
+-------------------------------+--------------------------------+------------------------------+----------------------------------+-----------------------------+-----------------+-------------------+----------------+---------------------+--------------------------+---------------------+---------------------+-----------------------+
No rows selected (0.346 seconds)
0: jdbc:hive2://cdp73-1.test.com:2181,cdp73-2>
查不到!!这是为什么呢?