最近在做一个需求,需求本身并不复杂只是需要添加字段,但是自己大意触发了Hive动态分区异常报错,这里记录一下过程。
报错:Number of dynamic partitions created is 72686, which is more than 1000. To solve this try to set
swift
org.apache.hadoop.hive.ql.metadata.HiveException: Number of dynamic partitions created is 72686, which is more than 1000. To solve this try to set hive.exec.max.dynamic.partitions to at least 72686.
09-04-2024 16:41:38 CST INFO - at org.apache.hadoop.hive.ql.metadata.Hive.loadDynamicPartitions(Hive.java:1573)
09-04-2024 16:41:38 CST INFO - at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
09-04-2024 16:41:38 CST INFO - at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
09-04-2024 16:41:38 CST INFO - at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
09-04-2024 16:41:38 CST INFO - at java.lang.reflect.Method.invoke(Method.java:498)
09-04-2024 16:41:38 CST INFO - at org.apache.spark.sql.hive.client.Shim_v1_2.loadDynamicPartitions(HiveShim.scala:874)
09-04-2024 16:41:38 CST INFO - at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadDynamicPartitions$1.apply$mcV$sp(HiveClientImpl.scala:740)
09-04-2024 16:41:38 CST INFO - at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadDynamicPartitions$1.apply(HiveClientImpl.scala:738)
09-04-2024 16:41:38 CST INFO - at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadDynamicPartitions$1.apply(HiveClientImpl.scala:738)
09-04-2024 16:41:38 CST INFO - at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:290)
09-04-2024 16:41:38 CST INFO - at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:231)
09-04-2024 16:41:38 CST INFO - at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:230)
09-04-2024 16:41:38 CST INFO - at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:273)
09-04-2024 16:41:38 CST INFO - at org.apache.spark.sql.hive.client.HiveClientImpl.loadDynamicPartitions(HiveClientImpl.scala:738)
09-04-2024 16:41:38 CST INFO - at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadDynamicPartitions$1.apply$mcV$sp(HiveExternalCatalog.scala:892)
09-04-2024 16:41:38 CST INFO - at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadDynamicPartitions$1.apply(HiveExternalCatalog.scala:880)
09-04-2024 16:41:38 CST INFO - at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadDynamicPartitions$1.apply(HiveExternalCatalog.scala:880)
09-04-2024 16:41:38 CST INFO - at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
09-04-2024 16:41:38 CST INFO - ... 69 more
看到报这个错,立马百度搜索,看到很多帖子都在建议添加参数:
ini
hive.exec.dynamic.partition=true
hive.exec.dynamic.partition.mode=nonstrict
hive.exec.max.dynamic.partitions=100000
hive.exec.max.dynamic.partitions.pernode=100000
试过了但是还是没用,这几个参数的意思是开启动态分区模式,其原理如下:
我的代码中是先执行spark.sql,然后进行rdd.map进行遍历,我在SQL中添加了一个字段a,但是我把该字段放在了select查询的第二个位置,但是我在rdd.map遍历的时候case row()中把该字段放在了第一个位置,导致后面的所有的值位置都错位了,分区本来应该只有一个的,但是错位之后分区取了其他字段的值导致暴增到72686。
php
val sql = s"""select name,a,age,school,d,h,m5
FROM video.cdncolv1
WHERE d = '$dd'
AND h = '$hh'
AND m5='$m5'
""".stripMargin
val rdd = spark.sql(sql).rdd.map {
case Row(a:String,name:String,age:Int,school:String, d:String, h:String, m5:String) => {
.......
Row(a,name,age,school,d,h,m5)
}
}
spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict")
spark.conf.set("spark.sql.sources.partitionColumnTypeInference.enabled", "false")
写到表中的分区本来是d,h,m5,结果错位了导致分区值增多才会报这个错。 sql中增加的a字段需要与case row()中a的位置对应,疏忽大意了,不然添加了hive.exec.dynamic.partition.mode和spark.sql.sources.partitionColumnTypeInference.enabled是不会报错的。