【踩坑】SparkSQL union/unionAll 函数的去重问题
- 测试数据
scala
case class Employee(first_name:String)
val employeeDF1 = spark.createDataset(Seq(
Employee("Mary"),
Employee("Mandy"),
Employee("Kurt")
))
val employeeDF2 = spark.createDataset(Seq(
Employee("Mary"),
Employee("Julie"),
Employee("Mandy"),
Employee("Julie"),
Employee("Kurt")
))
- 无论是union还是unionall都不会去重
scala
employeeDF1.union(employeeDF2).show
![](https://i-blog.csdnimg.cn/direct/46792d9507c44dbf90a7e242c60da3ff.png)
scala
employeeDF1.unionAll(employeeDF2).show
![](https://i-blog.csdnimg.cn/direct/9542398e60744886be89d8cc46a3a5d0.png)
- 当通过
spark.sql
执行方式时,union
可以去重
scala
employeeDF1.createOrReplaceTempView("ds1")
employeeDF2.createOrReplaceTempView("ds2")
scala
spark.sql("select * from ds1 union select * from ds2").show
![](https://i-blog.csdnimg.cn/direct/43f735dd9867483e8c9b9df4214b316e.png)
scala
spark.sql("select * from ds1 union all select * from ds2").show
![](https://i-blog.csdnimg.cn/direct/e377f94317144d59b19300cb5732f05f.png)
- 误区
- SQL标准查询语言 层面(如hive环境):union去重,unionAll简单合并性能较好
- Spark union 默认按列的位置直接合并,很可能字段错误合并。可使用unionByName作为替代
- 最新官方集合操作文档:https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-setops.html#set-operators