Hive 动态分区异常org.apache.hadoop.hive.ql.metadata.HiveException

最近在做一个需求,需求本身并不复杂只是需要添加字段,但是自己大意触发了Hive动态分区异常报错,这里记录一下过程。

报错:Number of dynamic partitions created is 72686, which is more than 1000. To solve this try to set

swift 复制代码
org.apache.hadoop.hive.ql.metadata.HiveException: Number of dynamic partitions created is 72686, which is more than 1000. To solve this try to set hive.exec.max.dynamic.partitions to at least 72686.
09-04-2024 16:41:38 CST INFO -   at org.apache.hadoop.hive.ql.metadata.Hive.loadDynamicPartitions(Hive.java:1573)
09-04-2024 16:41:38 CST INFO -   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
09-04-2024 16:41:38 CST INFO -   at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
09-04-2024 16:41:38 CST INFO -   at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
09-04-2024 16:41:38 CST INFO -   at java.lang.reflect.Method.invoke(Method.java:498)
09-04-2024 16:41:38 CST INFO -   at org.apache.spark.sql.hive.client.Shim_v1_2.loadDynamicPartitions(HiveShim.scala:874)
09-04-2024 16:41:38 CST INFO -   at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadDynamicPartitions$1.apply$mcV$sp(HiveClientImpl.scala:740)
09-04-2024 16:41:38 CST INFO -   at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadDynamicPartitions$1.apply(HiveClientImpl.scala:738)
09-04-2024 16:41:38 CST INFO -   at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadDynamicPartitions$1.apply(HiveClientImpl.scala:738)
09-04-2024 16:41:38 CST INFO -   at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:290)
09-04-2024 16:41:38 CST INFO -   at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:231)
09-04-2024 16:41:38 CST INFO -   at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:230)
09-04-2024 16:41:38 CST INFO -   at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:273)
09-04-2024 16:41:38 CST INFO -   at org.apache.spark.sql.hive.client.HiveClientImpl.loadDynamicPartitions(HiveClientImpl.scala:738)
09-04-2024 16:41:38 CST INFO -   at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadDynamicPartitions$1.apply$mcV$sp(HiveExternalCatalog.scala:892)
09-04-2024 16:41:38 CST INFO -   at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadDynamicPartitions$1.apply(HiveExternalCatalog.scala:880)
09-04-2024 16:41:38 CST INFO -   at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadDynamicPartitions$1.apply(HiveExternalCatalog.scala:880)
09-04-2024 16:41:38 CST INFO -   at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
09-04-2024 16:41:38 CST INFO -   ... 69 more

看到报这个错,立马百度搜索,看到很多帖子都在建议添加参数:

ini 复制代码
hive.exec.dynamic.partition=true
hive.exec.dynamic.partition.mode=nonstrict
hive.exec.max.dynamic.partitions=100000
hive.exec.max.dynamic.partitions.pernode=100000

试过了但是还是没用,这几个参数的意思是开启动态分区模式,其原理如下:

我的代码中是先执行spark.sql,然后进行rdd.map进行遍历,我在SQL中添加了一个字段a,但是我把该字段放在了select查询的第二个位置,但是我在rdd.map遍历的时候case row()中把该字段放在了第一个位置,导致后面的所有的值位置都错位了,分区本来应该只有一个的,但是错位之后分区取了其他字段的值导致暴增到72686。

php 复制代码
val sql = s"""select name,a,age,school,d,h,m5
FROM video.cdncolv1
WHERE d = '$dd'
        AND h = '$hh'
        AND m5='$m5'
        """.stripMargin

val rdd = spark.sql(sql).rdd.map {
      case Row(a:String,name:String,age:Int,school:String, d:String, h:String, m5:String) => {
      .......
        
        Row(a,name,age,school,d,h,m5)
      }
    }

   	spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict")
	spark.conf.set("spark.sql.sources.partitionColumnTypeInference.enabled", "false")

写到表中的分区本来是d,h,m5,结果错位了导致分区值增多才会报这个错。 sql中增加的a字段需要与case row()中a的位置对应,疏忽大意了,不然添加了hive.exec.dynamic.partition.mode和spark.sql.sources.partitionColumnTypeInference.enabled是不会报错的。

相关推荐
DannyIdea3 个月前
Hive的实践记录
大数据·apache hive
青云交4 个月前
大数据新视界 --大数据大厂之 DataFusion:超越传统的大数据集成与处理创新工具
数据库·内存管理·apache hive·数据集成·大数据处理·datafusion·查询处理·powercenter
有数的编程笔记6 个月前
HiveQL和SparkSQL中的正则
spark·apache hive
vivo互联网技术9 个月前
用户行为分析模型实践(四)—— 留存分析模型
数据分析·apache hive
有数的编程笔记9 个月前
HiveSQL如何生成连续日期剖析
apache hive
LightGao10 个月前
深入数仓离线数据同步:问题分析与优化措施
apache hive
卷土的土1 年前
数据流动新时代,Hive 的实时同步技术探索
大数据·数据库·apache hive
冷月半明1 年前
pyhive入门介绍和实例分析(探索票价与景点评分之间是否存在相关性)
大数据·python·apache hive
冷月半明1 年前
使用Apache Hive进行大数据分析的关键配置详解
大数据·apache hive