〇\]关于本文 本文测试一下使用Hive和Hudi的集成 |----------------|----------------------| | 软件 | 版本 | | Hudi | 1.0.0 | | Hadoop Version | 3.1.1.7.3.1.0-197 | | Hive Version | 3.1.3000.7.3.1.0-197 | | Spark Version | 3.4.1.7.3.1.0-197 | | CDP | 7.3.1 | ## \[一\]部署Jar包 ### 1-部署hudi-hive-sync-bundle-1.0.0.jar文件 ```bash [root@cdp73-1 ~]# for i in $(seq 1 6); do scp /opt/software/hudi-1.0.0/packaging/hudi-hadoop-mr-bundle/target/hudi-hadoop-mr-bundle-1.0.0.jar cdp73-$i:/opt/cloudera/parcels/CDH/lib/hive/lib/; done hudi-hadoop-mr-bundle-1.0.0.jar 100% 42MB 464.2MB/s 00:00 hudi-hadoop-mr-bundle-1.0.0.jar 100% 42MB 407.5MB/s 00:00 hudi-hadoop-mr-bundle-1.0.0.jar 100% 42MB 378.2MB/s 00:00 hudi-hadoop-mr-bundle-1.0.0.jar 100% 42MB 422.0MB/s 00:00 hudi-hadoop-mr-bundle-1.0.0.jar 100% 42MB 411.4MB/s 00:00 hudi-hadoop-mr-bundle-1.0.0.jar 100% 42MB 420.9MB/s 00:00 [root@cdp73-1 ~]# ``` ### 2-部署hudi-hive-sync-bundle-1.0.0.jar文件 ```bash [root@cdp73-1 ~]# for i in $(seq 1 6); do scp /opt/software/hudi-1.0.0/packaging/hudi-hive-sync-bundle/target/hudi-hive-sync-bundle-1.0.0.jar cdp73-$i:/opt/cloudera/parcels/CDH/lib/hive/lib/; done hudi-hive-sync-bundle-1.0.0.jar 100% 46MB 399.8MB/s 00:00 hudi-hive-sync-bundle-1.0.0.jar 100% 46MB 463.1MB/s 00:00 hudi-hive-sync-bundle-1.0.0.jar 100% 46MB 376.3MB/s 00:00 hudi-hive-sync-bundle-1.0.0.jar 100% 46MB 396.3MB/s 00:00 hudi-hive-sync-bundle-1.0.0.jar 100% 46MB 413.9MB/s 00:00 hudi-hive-sync-bundle-1.0.0.jar 100% 46MB 408.7MB/s 00:00 [root@cdp73-1 ~]# ``` ## \[二\]Spark将Hudi表同步到Hive Hive Metastore 是 Apache Hive 提供的一个基于关系数据库管理系统(RDBMS)的服务,它充当数据仓库或数据湖的目录。它可以存储有关表的所有元数据,例如分区、列、列类型等。还可以将 Hudi 表的元数据同步到 Hive Metastore。这使得不仅可以通过 Hive 查询 Hudi 表,还可以使用 Presto 和 Trino 等交互式查询引擎查询 Hudi 表。在本文档中,我们将介绍将 Hudi 表同步到 Hive Metastore 的不同方式。 ### 1-进入Spark-shell ```bash spark-shell --packages org.apache.hudi:hudi-spark3.4-bundle_2.12:1.0.0 \ --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \ --conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog' \ --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \ --conf 'spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar' ``` ###  ### 2-初始化测试表 ```bash // spark-shell import org.apache.hudi.QuickstartUtils._ import scala.collection.JavaConversions._ import org.apache.spark.sql.SaveMode._ import org.apache.hudi.DataSourceReadOptions._ import org.apache.hudi.DataSourceWriteOptions._ import org.apache.hudi.config.HoodieWriteConfig._ import org.apache.spark.sql.types._ import org.apache.spark.sql.Row val databaseName = "my_db" val tableName = "hudi_cow" val basePath = "/user/hive/warehouse/hudi_cow" val schema = StructType(Array( StructField("rowId", StringType,true), StructField("partitionId", StringType,true), StructField("preComb", LongType,true), StructField("name", StringType,true), StructField("versionId", StringType,true), StructField("toBeDeletedStr", StringType,true), StructField("intToLong", IntegerType,true), StructField("longToInt", LongType,true) )) val data0 = Seq(Row("row_1", "2021/01/01",0L,"bob","v_0","toBeDel0",0,1000000L), Row("row_2", "2021/01/01",0L,"john","v_0","toBeDel0",0,1000000L), Row("row_3", "2021/01/02",0L,"tom","v_0","toBeDel0",0,1000000L)) var dfFromData0 = spark.createDataFrame(data0,schema) ``` ###  ### 3-写入测试表并配置同步到hive ```bash dfFromData0.write.format("hudi"). options(getQuickstartWriteConfigs). option("hoodie.datasource.write.precombine.field", "preComb"). option("hoodie.datasource.write.recordkey.field", "rowId"). option("hoodie.datasource.write.partitionpath.field", "partitionId"). option("hoodie.database.name", databaseName). option("hoodie.table.name", tableName). option("hoodie.datasource.write.table.type", "COPY_ON_WRITE"). option("hoodie.datasource.write.operation", "upsert"). option("hoodie.datasource.write.hive_style_partitioning","true"). option("hoodie.datasource.meta.sync.enable", "true"). option("hoodie.datasource.hive_sync.mode", "hms"). option("hoodie.embed.timeline.server", "false"). mode(Overwrite). save(basePath) ```  ### 4-在Hive中查看Hudi表 ```bash 0: jdbc:hive2://cdp73-1.test.com:2181,cdp73-2> show create table my_db.hudi_cow . . . . . . . . . . . . . . . . . . . . . . .> ; WARN : WARNING! Query command could not be redacted.java.lang.IllegalStateException: Error loading from /home/opt/cloudera/parcels/CDH-7.3.1-1.cdh7.3.1.p0.60371244/bin/../lib/hive/conf/redaction-rules.json: java.io.FileNotFoundException: /home/opt/cloudera/parcels/CDH-7.3.1-1.cdh7.3.1.p0.60371244/bin/../lib/hive/conf/redaction-rules.json (No such file or directory) INFO : Compiling command(queryId=hive_20250104012540_9035bba6-0eec-4cea-815d-25a780faf8e6): show create table my_db.hudi_cow INFO : Semantic Analysis Completed (retrial = false) INFO : Created Hive schema: Schema(fieldSchemas:[FieldSchema(name:createtab_stmt, type:string, comment:from deserializer)], properties:null) INFO : Completed compiling command(queryId=hive_20250104012540_9035bba6-0eec-4cea-815d-25a780faf8e6); Time taken: 0.104 seconds INFO : Executing command(queryId=hive_20250104012540_9035bba6-0eec-4cea-815d-25a780faf8e6): show create table my_db.hudi_cow INFO : Starting task [Stage-0:DDL] in serial mode INFO : Completed executing command(queryId=hive_20250104012540_9035bba6-0eec-4cea-815d-25a780faf8e6); Time taken: 0.202 seconds INFO : OK +----------------------------------------------------+ | createtab_stmt | +----------------------------------------------------+ | CREATE EXTERNAL TABLE `my_db`.`hudi_cow`( | | `_hoodie_commit_time` string COMMENT '', | | `_hoodie_commit_seqno` string COMMENT '', | | `_hoodie_record_key` string COMMENT '', | | `_hoodie_partition_path` string COMMENT '', | | `_hoodie_file_name` string COMMENT '', | | `rowid` string COMMENT '', | | `precomb` bigint COMMENT '', | | `name` string COMMENT '', | | `versionid` string COMMENT '', | | `tobedeletedstr` string COMMENT '', | | `inttolong` int COMMENT '', | | `longtoint` bigint COMMENT '') | | PARTITIONED BY ( | | `partitionid` string COMMENT '') | | ROW FORMAT SERDE | | 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' | | WITH SERDEPROPERTIES ( | | 'hoodie.query.as.ro.table'='false', | | 'path'='/user/hive/warehouse/hudi_cow') | | STORED AS INPUTFORMAT | | 'org.apache.hudi.hadoop.HoodieParquetInputFormat' | | OUTPUTFORMAT | | 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' | | LOCATION | | 'hdfs://nameservice1/user/hive/warehouse/hudi_cow' | | TBLPROPERTIES ( | | 'last_commit_completion_time_sync'='20250104012010327', | | 'last_commit_time_sync'='20250104011949137', | | 'spark.sql.create.version'='3.4.1.7.3.1.0-197', | | 'spark.sql.sources.provider'='hudi', | | 'spark.sql.sources.schema.numPartCols'='1', | | 'spark.sql.sources.schema.numParts'='1', | | 'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"_hoodie_commit_time","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_commit_seqno","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_record_key","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_partition_path","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_file_name","type":"string","nullable":true,"metadata":{}},{"name":"rowId","type":"string","nullable":true,"metadata":{}},{"name":"preComb","type":"long","nullable":true,"metadata":{}},{"name":"name","type":"string","nullable":true,"metadata":{}},{"name":"versionId","type":"string","nullable":true,"metadata":{}},{"name":"toBeDeletedStr","type":"string","nullable":true,"metadata":{}},{"name":"intToLong","type":"integer","nullable":true,"metadata":{}},{"name":"longToInt","type":"long","nullable":true,"metadata":{}},{"name":"partitionId","type":"string","nullable":true,"metadata":{}}]}', | | 'spark.sql.sources.schema.partCol.0'='partitionId', | | 'transient_lastDdlTime'='1735971612') | +----------------------------------------------------+ 36 rows selected (0.421 seconds) 0: jdbc:hive2://cdp73-1.test.com:2181,cdp73-2> ```  ### 5-在Hive中查看Hudi表数据 ```bash 0: jdbc:hive2://cdp73-1.test.com:2181,cdp73-2> SELECT * FROM my_db.hudi_cow; WARN : WARNING! Query command could not be redacted.java.lang.IllegalStateException: Error loading from /home/opt/cloudera/parcels/CDH-7.3.1-1.cdh7.3.1.p0.60371244/bin/../lib/hive/conf/redaction-rules.json: java.io.FileNotFoundException: /home/opt/cloudera/parcels/CDH-7.3.1-1.cdh7.3.1.p0.60371244/bin/../lib/hive/conf/redaction-rules.json (No such file or directory) INFO : Compiling command(queryId=hive_20250104100159_e3235cbc-3bde-44bc-9485-9154675afa37): SELECT * FROM my_db.hudi_cow INFO : No Stats for my_db@hudi_cow, Columns: _hoodie_commit_time, inttolong, longtoint, _hoodie_partition_path, versionid, precomb, _hoodie_record_key, name, tobedeletedstr, _hoodie_commit_seqno, _hoodie_file_name, rowid INFO : Semantic Analysis Completed (retrial = false) INFO : Created Hive schema: Schema(fieldSchemas:[FieldSchema(name:hudi_cow._hoodie_commit_time, type:string, comment:null), FieldSchema(name:hudi_cow._hoodie_commit_seqno, type:string, comment:null), FieldSchema(name:hudi_cow._hoodie_record_key, type:string, comment:null), FieldSchema(name:hudi_cow._hoodie_partition_path, type:string, comment:null), FieldSchema(name:hudi_cow._hoodie_file_name, type:string, comment:null), FieldSchema(name:hudi_cow.rowid, type:string, comment:null), FieldSchema(name:hudi_cow.precomb, type:bigint, comment:null), FieldSchema(name:hudi_cow.name, type:string, comment:null), FieldSchema(name:hudi_cow.versionid, type:string, comment:null), FieldSchema(name:hudi_cow.tobedeletedstr, type:string, comment:null), FieldSchema(name:hudi_cow.inttolong, type:int, comment:null), FieldSchema(name:hudi_cow.longtoint, type:bigint, comment:null), FieldSchema(name:hudi_cow.partitionid, type:string, comment:null)], properties:null) INFO : Completed compiling command(queryId=hive_20250104100159_e3235cbc-3bde-44bc-9485-9154675afa37); Time taken: 0.207 seconds INFO : Executing command(queryId=hive_20250104100159_e3235cbc-3bde-44bc-9485-9154675afa37): SELECT * FROM my_db.hudi_cow INFO : Completed executing command(queryId=hive_20250104100159_e3235cbc-3bde-44bc-9485-9154675afa37); Time taken: 0.007 seconds INFO : OK +-------------------------------+--------------------------------+------------------------------+----------------------------------+-----------------------------+-----------------+-------------------+----------------+---------------------+--------------------------+---------------------+---------------------+-----------------------+ | hudi_cow._hoodie_commit_time | hudi_cow._hoodie_commit_seqno | hudi_cow._hoodie_record_key | hudi_cow._hoodie_partition_path | hudi_cow._hoodie_file_name | hudi_cow.rowid | hudi_cow.precomb | hudi_cow.name | hudi_cow.versionid | hudi_cow.tobedeletedstr | hudi_cow.inttolong | hudi_cow.longtoint | hudi_cow.partitionid | +-------------------------------+--------------------------------+------------------------------+----------------------------------+-----------------------------+-----------------+-------------------+----------------+---------------------+--------------------------+---------------------+---------------------+-----------------------+ +-------------------------------+--------------------------------+------------------------------+----------------------------------+-----------------------------+-----------------+-------------------+----------------+---------------------+--------------------------+---------------------+---------------------+-----------------------+ No rows selected (0.346 seconds) 0: jdbc:hive2://cdp73-1.test.com:2181,cdp73-2> ``` **查不到!!这是为什么呢?**
相关推荐
橙序研工坊4 分钟前
MySQL的进阶语法7(索引-B+Tree 、Hash、聚集索引 、二级索引(回表查询)、索引的使用及设计原则Bruce-li__6 分钟前
深入理解Python asyncio:从入门到实战,掌握异步编程精髓青云交21 分钟前
Java 大视界 -- Java 大数据在智能电网电力市场交易数据分析与策略制定中的关键作用(162)宝哥大数据39 分钟前
Flink内存模型--flink1.19.1小光学长40 分钟前
基于vue框架的智能服务旅游管理系统54kd3(程序+源码+数据库+调试部署+开发环境)带论文文档1万字以上,文末可获取,系统界面在最后面。Bonnie_121541 分钟前
07-MySQL-事务的隔离级别以及底层原理一个天蝎座 白勺 程序猿44 分钟前
大数据(4.5)Hive聚合函数深度解析:从基础统计到多维聚合的12个生产级技巧ETLCloud数据集成社区1 小时前
ETLCloud是如何通过Oracle实现CDC的?爱编程的王小美1 小时前
用户行为分析系统开发文档KATA~1 小时前
解决MyBatis-Plus枚举映射错误:No enum constant问题