Apache Zeppelin 整合 Spark 和 Hudi

一 环境信息

1.1 组件版本

组件 版本
Spark 3.2.3
Hudi 0.14.0
Zeppelin 0.11.0-SNAPSHOT

1.2 环境准备

  1. Zeppelin 整合 Spark 参考:Apache Zeppelin 一文打尽
  2. Hudi0.14.0编译参考:Hudi0.14.0 最新编译

二 整合 Spark 和 Hudi

2.1 配置

shell 复制代码
%spark.conf

SPARK_HOME /usr/lib/spark

# set execution mode
spark.master yarn
spark.submit.deployMode client

# --jars
spark.jars /root/app/jars/hudi-spark3.2-bundle_2.12-0.14.0.jar

# --conf
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.sql.catalog.spark_catalog org.apache.spark.sql.hudi.catalog.HoodieCatalog
spark.sql.extensions org.apache.spark.sql.hudi.HoodieSparkSessionExtension
spark.kryo.registrator org.apache.spark.HoodieSparkKryoRegistrar

Specifying yarn-client & yarn-cluster in spark.master is not supported in Spark 3.x any more, instead you need to use spark.master and spark.submit.deployMode together.

Mode spark.master spark.submit.deployMode
Yarn Client yarn client
Yarn Cluster yarn cluster

2.2 导入依赖

scala 复制代码
%spark
import scala.collection.JavaConversions._
import org.apache.spark.sql.SaveMode._
import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.common.table.HoodieTableConfig._
import org.apache.hudi.config.HoodieWriteConfig._
import org.apache.hudi.keygen.constant.KeyGeneratorOptions._
import org.apache.hudi.common.model.HoodieRecord
import spark.implicits._

2.3 插入数据

scala 复制代码
%spark
val tableName = "trips_table"
val basePath = "hdfs:///tmp/trips_table"
val columns = Seq("ts","uuid","rider","driver","fare","city")
val data =
  Seq((1695159649087L,"334e26e9-8355-45cc-97c6-c31daf0df330","rider-A","driver-K",19.10,"san_francisco"),
    (1695091554788L,"e96c4396-3fad-413a-a942-4cb36106d721","rider-C","driver-M",27.70 ,"san_francisco"),
    (1695046462179L,"9909a8b1-2d15-4d3d-8ec9-efc48c536a00","rider-D","driver-L",33.90 ,"san_francisco"),
    (1695516137016L,"e3cf430c-889d-4015-bc98-59bdce1e530c","rider-F","driver-P",34.15,"sao_paulo"    ),
    (1695115999911L,"c8abbe79-8d89-47ea-b4ce-4d224bae5bfa","rider-J","driver-T",17.85,"chennai"));

var inserts = spark.createDataFrame(data).toDF(columns:_*)
inserts.write.format("hudi").
  option(PARTITIONPATH_FIELD_NAME.key(), "city").
  option(TABLE_NAME, tableName).
  mode(Overwrite).
  save(basePath)

2.3 查询数据

scala 复制代码
%spark
val tripsDF = spark.read.format("hudi").load(basePath)
tripsDF.createOrReplaceTempView("trips_table")
spark.sql("SELECT uuid, fare, ts, rider, driver, city FROM  trips_table WHERE fare > 20.0").show()

结果:

shell 复制代码
+--------------------+-----+-------------+-------+--------+-------------+
|                uuid| fare|           ts|  rider|  driver|         city|
+--------------------+-----+-------------+-------+--------+-------------+
|e96c4396-3fad-413...| 27.7|1695091554788|rider-C|driver-M|san_francisco|
|9909a8b1-2d15-4d3...| 33.9|1695046462179|rider-D|driver-L|san_francisco|
|e3cf430c-889d-401...|34.15|1695516137016|rider-F|driver-P|    sao_paulo|
+--------------------+-----+-------------+-------+--------+-------------+

相关推荐
阿 才1 小时前
跟文件系统(busybox)的构建
大数据·hadoop·分布式
宁波鹿语心理1 小时前
过度卷入的三角化:养育者情感投射对青少年自我边界形成的结构性影响及干预路径
大数据
逐米时代1 小时前
制造型企业AI智能体实施步骤详解:提升协同效率的实战指南
大数据·人工智能
大嘴皮猴儿2 小时前
跨境电商运营笔记:我是如何用工具解决多语言素材问题的
大数据·人工智能·新媒体运营·自动翻译·教育电商
赤龙ERP2 小时前
赤龙一周观察 · 6月第2周
大数据·人工智能·ai·erp
JGDT_2 小时前
ERP重塑与未来趋势:SAP的实践及大一统格局(上)
大数据·人工智能·安全·架构·开源
ACP广源盛139246256733 小时前
IX7008 PCIe 交换芯片@ACP#RTX Spark 经济型 8 口扩展芯片(对比 ASM1806)
大数据·人工智能·分布式·嵌入式硬件·gpt·spark·电脑
lauo3 小时前
碳基心脏最后的堡垒——ibbot青春版:你的随身Token生产厂
大数据·人工智能·chatgpt·智能手机·ai-native
TDengine (老段)3 小时前
TDengine 扫描算子 — TableScan、TagScan 与下推优化
大数据·数据库·物联网·时序数据库·tdengine·涛思数据
ACP广源盛139246256733 小时前
IX6012 PCIe 交换芯片@ACP#RTX Spark 入门级 12 口存储外设扩展方案(对比 ASM1812)
大数据·人工智能·分布式·嵌入式硬件·gpt·spark·电脑