当分区由多变少时,不需要shuffle,也就是父RDD与子RDD之间是窄依赖。
当分区由少变多时,是需要shuffle的。
但极端情况下(1000个分区变成1个分区),这时如果将shuffle设置为false,父子RDD是窄依赖关系,他们同处在一个Stage中,就可能造成spark程序的并行度不够,从而影响性能,如果1000个分区变成1个分区,为了使coalesce之前的操作有更好的并行度,可以将shuffle设置为true。
Scala
scala> val arr = Array(1,2,3,4,5,6,7,8,9)
arr: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9)
scala> sc.makeRDD(arr,3)
res12: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[14] at makeRDD at <console>:27
scala> res12.coalesce(2)
res13: org.apache.spark.rdd.RDD[Int] = CoalescedRDD[15] at coalesce at <console>:26
scala> res13.partitions.size
res14: Int = 2
scala> res12.coalesce(12)
res15: org.apache.spark.rdd.RDD[Int] = CoalescedRDD[16] at coalesce at <console>:26
scala> res15.partitions.size
res16: Int = 3
scala> res12.repartition(2)
res17: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[20] at repartition at <console>:26
scala> res17.partitions.size
res18: Int = 2
scala> res12.repartition(12)
res19: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[24] at repartition at <console>:26
scala> res19.partitions.size
res20: Int = 12
data:image/s3,"s3://crabby-images/2dcfa/2dcfa903fcc401637880ca1c7f0745b2c2d7970c" alt=""
repartition算子底层调用的是coalesce算子。且shuffle指定了值为true。一定会发生shuffle阶段。
data:image/s3,"s3://crabby-images/2690d/2690daf57e005db915a236c299e246ef0936fdb0" alt=""
repartition带有shuffle可以增加也可以减少。shuffle参数指定为true,即一定会发生shuffle阶段。
data:image/s3,"s3://crabby-images/c09c9/c09c93b6d2c4409959515dffe8e40029d00f5773" alt=""
coalesce算子只能减少不能增加。由于coalesce的shuffle默认false。
data:image/s3,"s3://crabby-images/34050/3405015533ff7cb9f35711aef18670c344ce15b8" alt=""
例子:
data:image/s3,"s3://crabby-images/80009/80009afc576909cdef462fce7d51e645274a43d4" alt=""
如果说一个阶段中存在union或者是coalesce算子会出现rdd的分区数量变化,但是没有shuffle的情况,看最后的rdd的分区个数就是当前阶段的task任务的个数。
coalesce算子并没有发生shuffle,没有划分stage。但reduceByKey产生了shuffle,所以应该划分stage。