Apache Zeppelin 整合 Spark 和 Hudi

一 环境信息

1.1 组件版本

组件 版本
Spark 3.2.3
Hudi 0.14.0
Zeppelin 0.11.0-SNAPSHOT

1.2 环境准备

  1. Zeppelin 整合 Spark 参考:Apache Zeppelin 一文打尽
  2. Hudi0.14.0编译参考:Hudi0.14.0 最新编译

二 整合 Spark 和 Hudi

2.1 配置

shell 复制代码
%spark.conf

SPARK_HOME /usr/lib/spark

# set execution mode
spark.master yarn
spark.submit.deployMode client

# --jars
spark.jars /root/app/jars/hudi-spark3.2-bundle_2.12-0.14.0.jar

# --conf
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.sql.catalog.spark_catalog org.apache.spark.sql.hudi.catalog.HoodieCatalog
spark.sql.extensions org.apache.spark.sql.hudi.HoodieSparkSessionExtension
spark.kryo.registrator org.apache.spark.HoodieSparkKryoRegistrar

Specifying yarn-client & yarn-cluster in spark.master is not supported in Spark 3.x any more, instead you need to use spark.master and spark.submit.deployMode together.

Mode spark.master spark.submit.deployMode
Yarn Client yarn client
Yarn Cluster yarn cluster

2.2 导入依赖

scala 复制代码
%spark
import scala.collection.JavaConversions._
import org.apache.spark.sql.SaveMode._
import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.common.table.HoodieTableConfig._
import org.apache.hudi.config.HoodieWriteConfig._
import org.apache.hudi.keygen.constant.KeyGeneratorOptions._
import org.apache.hudi.common.model.HoodieRecord
import spark.implicits._

2.3 插入数据

scala 复制代码
%spark
val tableName = "trips_table"
val basePath = "hdfs:///tmp/trips_table"
val columns = Seq("ts","uuid","rider","driver","fare","city")
val data =
  Seq((1695159649087L,"334e26e9-8355-45cc-97c6-c31daf0df330","rider-A","driver-K",19.10,"san_francisco"),
    (1695091554788L,"e96c4396-3fad-413a-a942-4cb36106d721","rider-C","driver-M",27.70 ,"san_francisco"),
    (1695046462179L,"9909a8b1-2d15-4d3d-8ec9-efc48c536a00","rider-D","driver-L",33.90 ,"san_francisco"),
    (1695516137016L,"e3cf430c-889d-4015-bc98-59bdce1e530c","rider-F","driver-P",34.15,"sao_paulo"    ),
    (1695115999911L,"c8abbe79-8d89-47ea-b4ce-4d224bae5bfa","rider-J","driver-T",17.85,"chennai"));

var inserts = spark.createDataFrame(data).toDF(columns:_*)
inserts.write.format("hudi").
  option(PARTITIONPATH_FIELD_NAME.key(), "city").
  option(TABLE_NAME, tableName).
  mode(Overwrite).
  save(basePath)

2.3 查询数据

scala 复制代码
%spark
val tripsDF = spark.read.format("hudi").load(basePath)
tripsDF.createOrReplaceTempView("trips_table")
spark.sql("SELECT uuid, fare, ts, rider, driver, city FROM  trips_table WHERE fare > 20.0").show()

结果:

shell 复制代码
+--------------------+-----+-------------+-------+--------+-------------+
|                uuid| fare|           ts|  rider|  driver|         city|
+--------------------+-----+-------------+-------+--------+-------------+
|e96c4396-3fad-413...| 27.7|1695091554788|rider-C|driver-M|san_francisco|
|9909a8b1-2d15-4d3...| 33.9|1695046462179|rider-D|driver-L|san_francisco|
|e3cf430c-889d-401...|34.15|1695516137016|rider-F|driver-P|    sao_paulo|
+--------------------+-----+-------------+-------+--------+-------------+

相关推荐
昨天今天明天好多天29 分钟前
【数据仓库】
大数据
油头少年_w44 分钟前
大数据导论及分布式存储HadoopHDFS入门
大数据·hadoop·hdfs
Elastic 中国社区官方博客2 小时前
释放专利力量:Patently 如何利用向量搜索和 NLP 简化协作
大数据·数据库·人工智能·elasticsearch·搜索引擎·自然语言处理
力姆泰克2 小时前
看电动缸是如何提高农机的自动化水平
大数据·运维·服务器·数据库·人工智能·自动化·1024程序员节
力姆泰克2 小时前
力姆泰克电动缸助力农业机械装备,提高农机的自动化水平
大数据·服务器·数据库·人工智能·1024程序员节
QYR市场调研2 小时前
自动化研磨领域的革新者:半自动与自动自磨机的技术突破
大数据·人工智能
半部论语3 小时前
第三章:TDengine 常用操作和高级功能
大数据·时序数据库·tdengine
EasyGBS4 小时前
国标GB28181公网直播EasyGBS国标GB28181软件管理解决方案
大数据·网络·音视频·媒体·视频监控·gb28181
2403_875736874 小时前
道品科技的水肥一体化智能灌溉:开启现代农业的创新征程
大数据·人工智能·1024程序员节
河南查新信息技术研究院4 小时前
科技查新在医药健康领域的应用
大数据·科技·全文检索