38、spark读取hudi报错:java.io.NotSerializableException: org.apache.hadoop.fs.Path

场景:spark.table()的方式读取hudi映射的hive表。

开源组件版本:

spark 2.4.5_2.11

hudi 0.10.0

hive 3.1.0

hadoop 2.8.5

报错代码:

scala 复制代码
spark.table("dwd.dwd_card_menu_a_1d")
.show(false)

报错日志:

java 复制代码
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Failed to serialize task 0, not attempting to retry it. Exception during serialization: java.io.NotSerializableException: org.apache.hadoop.fs.Path
Serialization stack:
	- object not serializable (class: org.apache.hadoop.fs.Path, value: obs://xxxxx/user/hive/warehouse/dwd.db/dwd_card_menu_a_1d)
	- element of array (index: 0)
	- array (class [Ljava.lang.Object;, size 1)
	- field (class: scala.collection.mutable.WrappedArray$ofRef, name: array, type: class [Ljava.lang.Object;)
	- object (class scala.collection.mutable.WrappedArray$ofRef, WrappedArray(obs://xxxxxx/user/hive/warehouse/dwd.db/dwd_card_menu_a_1d))
	- writeObject data (class: org.apache.spark.rdd.ParallelCollectionPartition)
	- object (class org.apache.spark.rdd.ParallelCollectionPartition, org.apache.spark.rdd.ParallelCollectionPartition@691)
	- field (class: org.apache.spark.scheduler.ResultTask, name: partition, type: interface org.apache.spark.Partition)
	- object (class org.apache.spark.scheduler.ResultTask, ResultTask(0, 0))
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1891)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1879)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1878)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1878)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:927)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2112)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2061)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2050)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:738)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:990)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:385)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:989)
	at org.apache.spark.api.java.JavaRDDLike$class.collect(JavaRDDLike.scala:361)
	at org.apache.spark.api.java.AbstractJavaRDDLike.collect(JavaRDDLike.scala:45)
	at org.apache.hudi.client.common.HoodieSparkEngineContext.map(HoodieSparkEngineContext.java:100)
	at org.apache.hudi.metadata.FileSystemBackedTableMetadata.getAllPartitionPaths(FileSystemBackedTableMetadata.java:81)
	at org.apache.hudi.common.fs.FSUtils.getAllPartitionPaths(FSUtils.java:291)

org.apache.hadoop.fs.Path 未实现java.io.serializabe。 因为 Hudi-0.10.0 在 Spark 2.4.5 下会把 Path 对象放进 ParallelCollectionRDD 的闭包,而 Path 不可序列化。

解决办法:

很简单,在sparkConf中设置spark的序列化为:KryoSerializer

scala 复制代码
new SparkConf()
    .set("spark.serializer", classOf[KryoSerializer].getName)
相关推荐
派大鑫wink2 分钟前
【Day12】String 类详解:不可变性、常用方法与字符串拼接优化
java·开发语言
JIngJaneIL5 分钟前
基于springboot + vue健康管理系统(源码+数据库+文档)
java·开发语言·数据库·vue.js·spring boot·后端
秋饼7 分钟前
【三大锁王争霸赛:Java锁、数据库锁、分布式锁谁是卷王?】
java·数据库·分布式
电商API&Tina9 分钟前
【电商API接口】关于电商数据采集相关行业
java·python·oracle·django·sqlite·json·php
刘一说13 分钟前
Spring Boot中IoC(控制反转)深度解析:从实现机制到项目实战
java·spring boot·后端
悟空码字13 分钟前
SpringBoot参数配置:一场“我说了算”的奇幻之旅
java·spring boot·后端
我居然是兔子16 分钟前
Java虚拟机(JVM)内存模型与垃圾回收全解析
java·开发语言·jvm
关于不上作者榜就原神启动那件事18 分钟前
Spring Data Redis 中的 opsFor 方法详解
java·redis·spring
其美杰布-富贵-李19 分钟前
Java (Spring Boot) 反射完整学习笔记
java·spring boot·学习
小许好楠30 分钟前
java开发工程师-学习方式
java·开发语言·学习