CDP集成Hudi实战-spark shell

[〇]关于本文

本文主要解释spark shell操作Hudi表的案例

|----------------|----------------------|
| 软件 | 版本 |
| Hudi | 1.0.0 |
| Hadoop Version | 3.1.1.7.3.1.0-197 |
| Hive Version | 3.1.3000.7.3.1.0-197 |
| Spark Version | 3.4.1.7.3.1.0-197 |
| CDP | 7.3.1 |

[一]使用Spark-shell

1-配置hudi Jar包

bash 复制代码
[root@cdp73-1 ~]# for i in $(seq 1 6); do scp /opt/software/hudi-1.0.0/packaging/hudi-spark-bundle/target/hudi-spark3.4-bundle_2.12-1.0.0.jar   cdp73-$i:/opt/cloudera/parcels/CDH/lib/spark3/jars/; done
hudi-spark3.4-bundle_2.12-1.0.0.jar                                                                                                                                                                                          100%  105MB 418.2MB/s   00:00
hudi-spark3.4-bundle_2.12-1.0.0.jar                                                                                                                                                                                          100%  105MB 304.8MB/s   00:00
hudi-spark3.4-bundle_2.12-1.0.0.jar                                                                                                                                                                                          100%  105MB 365.0MB/s   00:00
hudi-spark3.4-bundle_2.12-1.0.0.jar                                                                                                                                                                                          100%  105MB 406.1MB/s   00:00
hudi-spark3.4-bundle_2.12-1.0.0.jar                                                                                                                                                                                          100%  105MB 472.7MB/s   00:00
hudi-spark3.4-bundle_2.12-1.0.0.jar                                                                                                                                                                                          100%  105MB 447.1MB/s   00:00
[root@cdp73-1 ~]#

2-进入Spark-shell

bash 复制代码
spark-shell --packages org.apache.hudi:hudi-spark3.4-bundle_2.12:1.0.0 \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
--conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog' \
--conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
--conf 'spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar'

3-初始化项目

python 复制代码
// spark-shell
import scala.collection.JavaConversions._
import org.apache.spark.sql.SaveMode._
import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.common.table.HoodieTableConfig._
import org.apache.hudi.config.HoodieWriteConfig._
import org.apache.hudi.keygen.constant.KeyGeneratorOptions._
import org.apache.hudi.common.model.HoodieRecord
import spark.implicits._

val tableName = "trips_table"
val basePath = "hdfs:///tmp/trips_table"

4-创建表

首次提交将自动初始化表,如果指定的基本路径中尚不存在该表。

5-导入数据

bash 复制代码
// spark-shell
val columns = Seq("ts","uuid","rider","driver","fare","city")
val data =
  Seq((1695159649087L,"334e26e9-8355-45cc-97c6-c31daf0df330","rider-A","driver-K",19.10,"san_francisco"),
    (1695091554788L,"e96c4396-3fad-413a-a942-4cb36106d721","rider-C","driver-M",27.70 ,"san_francisco"),
    (1695046462179L,"9909a8b1-2d15-4d3d-8ec9-efc48c536a00","rider-D","driver-L",33.90 ,"san_francisco"),
    (1695516137016L,"e3cf430c-889d-4015-bc98-59bdce1e530c","rider-F","driver-P",34.15,"sao_paulo"    ),
    (1695115999911L,"c8abbe79-8d89-47ea-b4ce-4d224bae5bfa","rider-J","driver-T",17.85,"chennai"));

var inserts = spark.createDataFrame(data).toDF(columns:_*)
inserts.write.format("hudi").
  option("hoodie.datasource.write.partitionpath.field", "city").
  option("hoodie.embed.timeline.server", "false").
  option("hoodie.table.name", tableName).
  mode(Overwrite).
  save(basePath)

**【映射到Hudi写操作】**​​​​​​​Hudi提供了多种写操作------包括批量和增量写操作------以将数据写入Hudi表,这些操作具有不同的语义和性能。当未配置记录键(请参见下面的键)时,将选择bulk_insert作为写操作,这与Spark的Parquet数据源的非默认行为相匹配。

6-查询数据

bash 复制代码
// spark-shell
val tripsDF = spark.read.format("hudi").load(basePath)
tripsDF.createOrReplaceTempView("trips_table")

spark.sql("SELECT uuid, fare, ts, rider, driver, city FROM  trips_table WHERE fare > 20.0").show()
spark.sql("SELECT _hoodie_commit_time, _hoodie_record_key, _hoodie_partition_path, rider, driver, fare FROM  trips_table").show()

7-更新数据

Scala 复制代码
// Lets read data from target Hudi table, modify fare column for rider-D and update it. 
val updatesDf = spark.read.format("hudi").load(basePath).filter($"rider" === "rider-D").withColumn("fare", col("fare") * 10)

updatesDf.write.format("hudi").
  option("hoodie.datasource.write.operation", "upsert").
  option("hoodie.embed.timeline.server", "false").
  option("hoodie.datasource.write.partitionpath.field", "city").
  option("hoodie.table.name", tableName).
  mode(Append).
  save(basePath)

8-合并数据

Scala 复制代码
// spark-shell
val adjustedFareDF = spark.read.format("hudi").
  load(basePath).limit(2).
  withColumn("fare", col("fare") * 10)
adjustedFareDF.write.format("hudi").
option("hoodie.embed.timeline.server", "false").
mode(Append).
save(basePath)
// Notice Fare column has been updated but all other columns remain intact.
spark.read.format("hudi").load(basePath).show()

9-删除数据

Scala 复制代码
/ spark-shell
// Lets  delete rider: rider-D
val deletesDF = spark.read.format("hudi").load(basePath).filter($"rider" === "rider-F")

deletesDF.write.format("hudi").
  option("hoodie.datasource.write.operation", "delete").
  option("hoodie.datasource.write.partitionpath.field", "city").
  option("hoodie.table.name", tableName).
  option("hoodie.embed.timeline.server", "false").
  mode(Append).
  save(basePath)

​​​​​​​

10-数据索引

bash 复制代码
import scala.collection.JavaConversions._
import org.apache.spark.sql.SaveMode._
import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.common.table.HoodieTableConfig._
import org.apache.hudi.config.HoodieWriteConfig._
import org.apache.hudi.keygen.constant.KeyGeneratorOptions._
import org.apache.hudi.common.model.HoodieRecord
import spark.implicits._

val tableName = "trips_table_index"
val basePath = "hdfs:///tmp/hudi_indexed_table"

val columns = Seq("ts","uuid","rider","driver","fare","city")
val data =
  Seq((1695159649087L,"334e26e9-8355-45cc-97c6-c31daf0df330","rider-A","driver-K",19.10,"san_francisco"),
    (1695091554788L,"e96c4396-3fad-413a-a942-4cb36106d721","rider-C","driver-M",27.70 ,"san_francisco"),
    (1695046462179L,"9909a8b1-2d15-4d3d-8ec9-efc48c536a00","rider-D","driver-L",33.90 ,"san_francisco"),
    (1695516137016L,"e3cf430c-889d-4015-bc98-59bdce1e530c","rider-F","driver-P",34.15,"sao_paulo"    ),
    (1695115999911L,"c8abbe79-8d89-47ea-b4ce-4d224bae5bfa","rider-J","driver-T",17.85,"chennai"));

var inserts = spark.createDataFrame(data).toDF(columns:_*)
inserts.write.format("hudi").
  option("hoodie.datasource.write.partitionpath.field", "city").
  option("hoodie.table.name", tableName).
  option("hoodie.write.record.merge.mode", "COMMIT_TIME_ORDERING").
  option(RECORDKEY_FIELD_OPT_KEY, "uuid").
  option("hoodie.embed.timeline.server", "false").
  mode(Overwrite).
  save(basePath)

// Create record index and secondary index for the table
spark.sql(s"CREATE TABLE hudi_indexed_table USING hudi LOCATION '$basePath'")
// Create bloom filter expression index on driver column
spark.sql(s"CREATE INDEX idx_bloom_driver ON hudi_indexed_table USING bloom_filters(driver) OPTIONS(expr='identity')");
// It would show bloom filter expression index 
spark.sql(s"SHOW INDEXES FROM hudi_indexed_table");
// Query on driver column would prune the data using the idx_bloom_driver index
spark.sql(s"SELECT uuid, rider FROM hudi_indexed_table WHERE driver = 'driver-S'");

// Create column stat expression index on ts column
spark.sql(s"CREATE INDEX idx_column_ts ON hudi_indexed_table USING column_stats(ts) OPTIONS(expr='from_unixtime', format = 'yyyy-MM-dd')");
// Shows both expression indexes 
spark.sql(s"SHOW INDEXES FROM hudi_indexed_table");
// Query on ts column would prune the data using the idx_column_ts index
spark.sql(s"SELECT * FROM hudi_indexed_table WHERE from_unixtime(ts, 'yyyy-MM-dd') = '2023-09-24'");

// To create secondary index, first create the record index
spark.sql(s"SET hoodie.metadata.record.index.enable=true");
spark.sql(s"CREATE INDEX record_index ON hudi_indexed_table (uuid)");
// Create secondary index on rider column
spark.sql(s"CREATE INDEX idx_rider ON hudi_indexed_table (rider)");

// Expression index and secondary index should show up
spark.sql(s"SHOW INDEXES FROM hudi_indexed_table");
// Query on rider column would leverage the secondary index idx_rider
spark.sql(s"SELECT * FROM hudi_indexed_table WHERE rider = 'rider-E'");

// Update a record and query the table based on indexed columns
spark.sql(s"UPDATE hudi_indexed_table SET rider = 'rider-B', driver = 'driver-N', ts = '1697516137' WHERE rider = 'rider-A'");
// Data skipping would be performed using column stat expression index
spark.sql(s"SELECT uuid, rider FROM hudi_indexed_table WHERE from_unixtime(ts, 'yyyy-MM-dd') = '2023-10-17'");
// Data skipping would be performed using bloom filter expression index
spark.sql(s"SELECT * FROM hudi_indexed_table WHERE driver = 'driver-N'");
// Data skipping would be performed using secondary index
spark.sql(s"SELECT * FROM hudi_indexed_table WHERE rider = 'rider-B'");

// Drop all the indexes
spark.sql(s"DROP INDEX secondary_index_idx_rider on hudi_indexed_table");
spark.sql(s"DROP INDEX record_index on hudi_indexed_table");
spark.sql(s"DROP INDEX expr_index_idx_bloom_driver on hudi_indexed_table");
spark.sql(s"DROP INDEX expr_index_idx_column_ts on hudi_indexed_table");
// No indexes should show up for the table
spark.sql(s"SHOW INDEXES FROM hudi_indexed_table");

spark.sql(s"SET hoodie.metadata.record.index.enable=false");
相关推荐
阿隆ALong6 分钟前
亚矩阵云手机:跨境出海直播的全方位利器
大数据·服务器·网络安全·矩阵·云计算·arm·信息与通信
Neil Parker25 分钟前
搭建Hadoop分布式集群
大数据·hadoop·分布式
无奈ieq1 小时前
spark——RDD算子集合
大数据·分布式·spark
B站计算机毕业设计超人1 小时前
计算机毕业设计Python+Spark中药推荐系统 中药识别系统 中药数据分析 中药大数据 中药可视化 中药爬虫 中药大数据 大数据毕业设计 大
大数据·python·深度学习·机器学习·课程设计·数据可视化·推荐算法
huaqianzkh1 小时前
了解RabbitMQ中的Exchange:深入解析与实践应用
分布式·系统架构·rabbitmq
言之。2 小时前
【微服务】7、分布式事务
分布式·微服务·架构
sunxunyong3 小时前
spark on hive 参数
大数据·hive·spark
B站计算机毕业设计超人4 小时前
计算机毕业设计hadoop+spark+hive新能源汽车推荐系统 汽车数据分析可视化大屏 新能源汽车推荐系统 汽车爬虫 汽车大数据 机器学习
大数据·hive·hadoop·python·深度学习·spark·课程设计
续亮~4 小时前
Kafka的Partition故障恢复机制与HW一致性保障-Epoch更新机制详解
java·分布式·后端·kafka
黄名富4 小时前
Kafka 消费者
java·分布式·微服务·kafka