【大数据学习 | Spark-Core】Spark中的join原理

join是两个结果集之间的链接,需要进行数据的匹配。

演示一下join是否存在shuffle。

1. 如果两个rdd没有分区器,分区个数一致

,会发生shuffle。但分区数量不变。

Scala 复制代码
scala> val arr = Array(("zhangsan",300),("lisi",400),("wangwu",350),("zhaosi",450))
arr: Array[(String, Int)] = Array((zhangsan,300), (lisi,400), (wangwu,350), (zhaosi,450))

scala> val arr1 = Array(("zhangsan",22),("lisi",24),("wangwu",30),("guangkun",5))
arr1: Array[(String, Int)] = Array((zhangsan,22), (lisi,24), (wangwu,30), (guangkun,5))

scala> sc.makeRDD(arr,3)
res116: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[108] at makeRDD at <console>:27

scala> sc.makeRDD(arr1,3)
res117: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[109] at makeRDD at <console>:27

scala> res116 join res117
res118: org.apache.spark.rdd.RDD[(String, (Int, Int))] = MapPartitionsRDD[112] at join at <console>:28

scala> res118.collect
res119: Array[(String, (Int, Int))] = Array((zhangsan,(300,22)), (wangwu,(350,30)), (lisi,(400,24)))

2. 如果分区个数不一致,有shuffle,且产生的rdd的分区个数以多的为主。

3. 如果分区个数一样并且分区器一样,那么是没有shuffle的

Scala 复制代码
scala> sc.makeRDD(arr,3)
res128: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[118] at makeRDD at <console>:27

scala> sc.makeRDD(arr1,3)
res129: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[119] at makeRDD at <console>:27

scala> res128.reduceByKey(_+_)
res130: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[120] at reduceByKey at <console>:26

scala> res129.reduceByKey(_+_)
res131: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[121] at reduceByKey at <console>:26

scala> res130 join res131
res132: org.apache.spark.rdd.RDD[(String, (Int, Int))] = MapPartitionsRDD[124] at join at <console>:28

scala> res132.collect
res133: Array[(String, (Int, Int))] = Array((zhangsan,(300,22)), (wangwu,(350,30)), (lisi,(400,24)))

scala> res132.partitions.size
res134: Int = 3

4. 都存在分区器但是分区个数不同,也会存在shuffle

Scala 复制代码
scala> val arr = Array(("zhangsan",300),("lisi",400),("wangwu",350),("zhaosi",450))
arr: Array[(String, Int)] = Array((zhangsan,300), (lisi,400), (wangwu,350), (zhaosi,450))

scala>  val arr1 = Array(("zhangsan",22),("lisi",24),("wangwu",30),("guangkun",5))
arr1: Array[(String, Int)] = Array((zhangsan,22), (lisi,24), (wangwu,30), (guangkun,5))

scala> sc.makeRDD(arr,3)
res0: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[0] at makeRDD at <console>:27

scala> sc.makeRDD(arr1,4)
res1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[1] at makeRDD at <console>:27

scala> res0.reduceByKey(_+_)
res2: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[2] at reduceByKey at <console>:26

scala> res1.reduceByKey(_+_)
res3: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[3] at reduceByKey at <console>:26

scala> res2 join res3
res4: org.apache.spark.rdd.RDD[(String, (Int, Int))] = MapPartitionsRDD[6] at join at <console>:28

scala> res4.collect
res5: Array[(String, (Int, Int))] = Array((zhangsan,(300,22)), (wangwu,(350,30)), (lisi,(400,24)))

scala> res4.partitions.size
res6: Int = 4

这里为啥stage3里reduceByKey和join过程是连在一起的,因为分区多的RDD是不需要进行shuffle的,数据该在哪个分区就在哪个分区,反而是分区少的RDD要进行join,要进行数据的打散。

分区以多的为主。

5. 一个带有分区器一个没有分区器,那么以带有分区器的rdd分区数量为主,并且存在shuffle

Scala 复制代码
scala> arr
res7: Array[(String, Int)] = Array((zhangsan,300), (lisi,400), (wangwu,350), (zhaosi,450))

scala> arr1
res8: Array[(String, Int)] = Array((zhangsan,22), (lisi,24), (wangwu,30), (guangkun,5))

scala> sc.makeRDD(arr,3)
res9: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[7] at makeRDD at <console>:27

scala> sc.makeRDD(arr,4)
res10: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[8] at makeRDD at <console>:27

scala> res9.reduceByKey(_+_)
res11: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[9] at reduceByKey at <console>:26

scala> res10 join res11
res12: org.apache.spark.rdd.RDD[(String, (Int, Int))] = MapPartitionsRDD[12] at join at <console>:28

scala> res12.partitions.size
res13: Int = 3

scala> res12.collect
res14: Array[(String, (Int, Int))] = Array((zhangsan,(300,300)), (wangwu,(350,350)), (lisi,(400,400)), (zhaosi,(450,450)))

同理,stage6的reduceByKey过程和join过程是连在一起的,是因为有分区器的RDD并不需要进行shuffle操作,原来的数据该在哪在哪,而没有分区器的RDD要进行join要进行数据的打散,有shuffle过程,所以有stage4到stage6的连线。

相关推荐
房产中介行业研习社6 分钟前
市面上比较主流的房产中介管理系统有哪些推荐?
大数据·人工智能·房产直播技巧·房产直播培训
Live&&learn11 分钟前
Redis语法入门
数据库·redis
忧郁蓝调2620 分钟前
Redis不停机数据迁移:基于 redis-shake 的跨实例 / 跨集群同步方案
运维·数据库·redis·阿里云·缓存·云原生·paas
五阿哥永琪25 分钟前
Redis的常用数据结构
数据结构·数据库·redis
猴子年华、32 分钟前
【每日一技】:SQL 常用函数实战速查表(函数 + 场景版)
java·数据库·sql·mysql
远方160933 分钟前
110-Oracle中核心业务的年度分区表建立
数据库·oracle·database
月明长歌37 分钟前
MySQL数据库约束:把“能插入”升级成“插入就对”
数据库·mysql·oracle
云器科技1 小时前
NinjaVan x 云器Lakehouse: 从传统自建Spark架构升级到新一代湖仓架构
大数据·ai·架构·spark·湖仓平台
泰迪智能科技1 小时前
分享|2025年广东水利电力职业技术学院泰迪数据智能产业学院订单班结业典礼圆满结束
大数据·人工智能
中科天工1 小时前
如何实现工业AI在智能制造中的应用?
大数据·人工智能·智能