Spark SQL基于DataSource方式文件切分逻辑详解

0 前言

当读取 ORC/Parquet等文件格式的 Hive 表，Spark 会自动转成 DataSource 来读取（前提是spark.sql.hive.convertMetastoreOrc=true / spark.sql.hive.convertMetastoreParquet=true，默认都为true）；

如果设置spark.sql.hive.convertMetastoreOrc=false / spark.sql.hive.convertMetastoreParquet=false 或者读取hive表是 STORED AS textfile/sequencefile/rcfile/avro，那么Spark SQL会通过Hadoop 的 InputFormat来读取表（参考：Spark RDD任务并行度Part1：文件读取并行度源码解读）。

用 EXPLAIN EXTENDED SELECT * FROM table 查看执行计划判断走了哪条路径：

FileScan parquet/orc → DataSource 路径（有 Batched、PushedFilters 标记）

HiveTableScan → Hive 路径（出现 HiveTableRelation、SerDe 类名）

本文将介绍 Spark SQL 利用DataSource 读取ORC格式hive表时如何创建RDD以及确定分区数。

1 FileSourceScanExec

扫描文件时使用的是 DataSourceScanExec 的实现类 FileSourceScanExec，核心逻辑在inputRDD中，构造一个真正用于读取文件数据的 RDD[InternalRow]。

scala 复制代码

  lazy val inputRDD: RDD[InternalRow] = {
      // 定义一个"如何读取文件片段"的函数
    val readFile: (PartitionedFile) => Iterator[InternalRow] =
      relation.fileFormat.buildReaderWithPartitionValues(
        sparkSession = relation.sparkSession,
        dataSchema = relation.dataSchema,
        partitionSchema = relation.partitionSchema,
        requiredSchema = requiredSchema,
        filters = pushedDownFilters,
        options = relation.options,
        hadoopConf = relation.sparkSession.sessionState.newHadoopConfWithOptions(relation.options))
	
      // 判断这次文件扫描是否可以按 bucket 表的方式读取
    val readRDD = if (bucketedScan) { // 分通表
      createBucketedReadRDD(relation.bucketSpec.get, readFile, dynamicallySelectedPartitions,
        relation)
    } else { // 非分桶表
      createNonBucketedReadRDD(readFile, dynamicallySelectedPartitions, relation)
    }
    sendDriverMetrics()
    readRDD
  }

bucketedScan 判断是否分桶扫描文件

scala 复制代码

  lazy val bucketedScan: Boolean = {
     // 1、分桶表判断
    if (relation.sparkSession.sessionState.conf.bucketingEnabled && relation.bucketSpec.isDefined
      && !disableBucketedScan) {
      // 2、检查 bucket 列能否解析
      val spec = relation.bucketSpec.get // 取出 bucket 表的元数据, bucket 数量、bucket 列、sort 列
      val bucketColumns = spec.bucketColumnNames.flatMap(n => toAttribute(n)) // 把 bucket 列名转换成当前查询计划里的 Attribute
      bucketColumns.size == spec.bucketColumnNames.size // bucket 元数据里声明的所有 bucket 列，都必须能在当前扫描输出中找到对应 Attribute才能走分桶扫描。因为 bucketed scan 的核心依赖 bucket 列，Spark 后续可能利用这个信息做优化，比如 bucket join，减少 shuffle
    } else {
      false
    }
  }

① relation.sparkSession.sessionState.conf.bucketingEnabled

如果spark.sql.sources.bucketing.enabled参数设置false，即使表本身是 bucket 表，Spark 也不会按 bucket scan 处理，默认是开启的。

scala 复制代码

  val BUCKETING_ENABLED = buildConf("spark.sql.sources.bucketing.enabled")
    .doc("When false, we will treat bucketed table as normal table")
    .version("2.0.0")
    .booleanConf
    .createWithDefault(true)

② relation.bucketSpec.isDefined

表示当前读取的 relation 是否真的有 bucket 信息。如果只是普通 parquet/csv/orc 文件，没有 bucket 元数据，这里就是 false。

③ !disableBucketedScan

表示当前执行计划没有主动禁用 bucketed scan。有些情况下，即使表是 bucket 表，Spark 也可能不使用 bucketed scan。例如 optimizer 或 planner 判断当前场景不适合按 bucket 读取，或者某些条件下 bucket 信息不能安全使用。

只有这三个条件都满足，才会继续检查 bucket 列能否解析。

下面将通过Spark源码（spark 3.1.2）分别介绍读取非分桶表和分桶表时的分区生成逻辑。

2 读取非分桶表 RDD 创建逻辑：createNonBucketedReadRDD

2.1 整体逻辑

org.apache.spark.sql.execution.FileSourceScanExec

scala 复制代码

  private def createNonBucketedReadRDD(
      readFile: (PartitionedFile) => Iterator[InternalRow],
      selectedPartitions: Array[PartitionDirectory],
      fsRelation: HadoopFsRelation): RDD[InternalRow] = {
    // 打开文件开销  
    val openCostInBytes = fsRelation.sparkSession.sessionState.conf.filesOpenCostInBytes
    // 1、确定文件切分大小  
    val maxSplitBytes =
      FilePartition.maxSplitBytes(fsRelation.sparkSession, selectedPartitions)
    logInfo(s"Planning scan with bin packing, max size: $maxSplitBytes bytes, " +
      s"open cost is considered as scanning $openCostInBytes bytes.")
	// 2、文件切分
    val splitFiles = selectedPartitions.flatMap { partition =>
      partition.files.flatMap { file =>
        // getPath() is very expensive so we only want to call it once in this block:
        val filePath = file.getPath
        val isSplitable = relation.fileFormat.isSplitable(
          relation.sparkSession, relation.options, filePath)
        PartitionedFileUtil.splitFiles(
          sparkSession = relation.sparkSession,
          file = file,
          filePath = filePath,
          isSplitable = isSplitable,
          maxSplitBytes = maxSplitBytes,
          partitionValues = partition.values
        )
      }
    }.sortBy(_.length)(implicitly[Ordering[Long]].reverse)

    // 3、分区生成
    val partitions =
      FilePartition.getFilePartitions(relation.sparkSession, splitFiles, maxSplitBytes)

    new FileScanRDD(fsRelation.sparkSession, readFile, partitions)
  }

完整链路如下：

text 复制代码

  -> 确定文件切分大小
  -> 遍历分区中的文件
  -> 判断文件是否可切分
  -> 按 maxSplitBytes 切成 PartitionedFile
  -> 按 PartitionedFile.length 降序排序
  -> 使用 getFilePartitions 打包成 FilePartition
  -> 创建 FileScanRDD
  -> Spark 调度 task 并行读取文件

核心对象关系：

text 复制代码

物理文件
  -> PartitionedFile
  -> FilePartition
  -> FileScanRDD partition
  -> Spark task

2.2 相关配置参数

org.apache.spark.sql.internal.SQLConf

`spark.sql.files.maxPartitionBytes`

参数定义：

scala 复制代码

val FILES_MAX_PARTITION_BYTES = buildConf("spark.sql.files.maxPartitionBytes")
  .doc("The maximum number of bytes to pack into a single partition when reading files. " +
    "This configuration is effective only when using file-based sources such as Parquet, JSON " +
    "and ORC.")
  .version("2.0.0")
  .bytesConf(ByteUnit.BYTE)
  .createWithDefaultString("128MB")

释义：

复制代码

读取文件时，单个分区中最多打包的字节数。

该配置只在使用基于文件的数据源时生效，例如 Parquet、JSON 和 ORC。

默认值：128MB

作用：

控制文件扫描阶段每个 Spark task 目标处理的数据量。
值越小，生成的读取分区越多，task 数量越多。
值越大，生成的读取分区越少，task 数量越少。

`spark.sql.files.openCostInBytes`

源码配置：

scala 复制代码

val FILES_OPEN_COST_IN_BYTES = buildConf("spark.sql.files.openCostInBytes")
  .internal()
  .doc("The estimated cost to open a file, measured by the number of bytes could be scanned in" +
    " the same time. This is used when putting multiple files into a partition. It's better to" +
    " over estimated, then the partitions with small files will be faster than partitions with" +
    " bigger files (which is scheduled first). This configuration is effective only when using" +
    " file-based sources such as Parquet, JSON and ORC.")
  .version("2.0.0")
  .longConf
  .createWithDefault(4 * 1024 * 1024)

释义：

复制代码

打开文件的估算成本，以"在相同时间内可以扫描的字节数"来衡量。

当把多个文件放入同一个分区时，会使用这个配置。

这个值最好设置得偏大一些，这样包含小文件的分区会比包含大文件的分区更快完成，而大文件分区会被优先调度。

该配置只在使用基于文件的数据源时生效，例如 Parquet、JSON 和 ORC。

默认值：4MB

作用：

用于小文件合并时估算读取成本。
Spark 不只按文件真实大小计算分区负载，还会给每个文件额外加上打开文件的成本。
可以粗略理解为：

text 复制代码

文件估算成本 = 文件实际大小 + openCostInBytes

这个参数主要用于小文件场景。大量小文件虽然每个文件体积不大，但频繁打开文件会带来额外开销，如果只按文件大小估算，会低估实际读取成本。

`spark.sql.files.minPartitionNum`

参数定义：

scala 复制代码

val FILES_MIN_PARTITION_NUM = buildConf("spark.sql.files.minPartitionNum")
  .doc("The suggested (not guaranteed) minimum number of split file partitions. " +
    "If not set, the default value is `spark.default.parallelism`. This configuration is " +
    "effective only when using file-based sources such as Parquet, JSON and ORC.")
  .version("3.1.0")
  .intConf
  .checkValue(v => v > 0, "The min partition number must be a positive integer.")
  .createOptional

释义：

复制代码

建议的文件切分分区最小数量，注意这不是严格保证的。

如果没有设置，默认值为`spark.default.parallelism`。

该配置只在使用基于文件的数据源时生效，例如 Parquet、JSON 和 ORC。

作用：

用于避免文件读取阶段生成的分区数过少。
当输入文件较少或文件较大时，可以通过该参数提高读取并行度。
该参数是建议值，不保证最终分区数一定不小于该值。

2.3 确定文件切分大小：生成maxSplitBytes

根据上述的3个配置参数计算文件切分大小

org.apache.spark.sql.execution.datasources.FilePartition

scala 复制代码

  def maxSplitBytes(
      sparkSession: SparkSession,
      selectedPartitions: Seq[PartitionDirectory]): Long = {
      // 获取 spark.sql.files.maxPartitionBytes
    val defaultMaxSplitBytes = sparkSession.sessionState.conf.filesMaxPartitionBytes
      // 获取 spark.sql.files.openCostInBytes
    val openCostInBytes = sparkSession.sessionState.conf.filesOpenCostInBytes
      // 获取spark.sql.files.minPartitionNum，如果没有设置spark.sql.files.minPartitionNum，取spark.default.parallelism
    val minPartitionNum = sparkSession.sessionState.conf.filesMinPartitionNum
      .getOrElse(sparkSession.sparkContext.defaultParallelism)
      // 遍历分区目录下每个文件，统计每个文件大小及打开文件开销，获取估算总读取成本
      // 估算总读取成本 = 文件个数 * 单个文件估算成本
      // 单个文件估算成本 = 文件实际大小 + 打开文件成本
    val totalBytes = selectedPartitions.flatMap(_.files.map(_.getLen + openCostInBytes)).sum
      // 按minPartitionNum切分个数计算切分大小
    val bytesPerCore = totalBytes / minPartitionNum

    Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore))
  }

多数情况下maxSplitBytes = spark.sql.files.maxPartitionBytes

2.4 文件切分：生成 `PartitionedFile`

源码逻辑：

scala 复制代码

val splitFiles = selectedPartitions.flatMap { partition => // 1、遍历经过分区裁剪后的分区目录
  partition.files.flatMap { file => // 2、遍历每个分区目录下的文件
    // getPath() is very expensive so we only want to call it once in this block:
    val filePath = file.getPath // 2.1、获取文件路径
    val isSplitable = relation.fileFormat.isSplitable( //  2.2、判断文件是否可切分
      relation.sparkSession, relation.options, filePath)
    PartitionedFileUtil.splitFiles( // 2.3、根据 maxSplitBytes 生成一个或多个 PartitionedFile(可切分时)
      sparkSession = relation.sparkSession,
      file = file,
      filePath = filePath,
      isSplitable = isSplitable,
      maxSplitBytes = maxSplitBytes,
      partitionValues = partition.values
    )
  }
}.sortBy(_.length)(implicitly[Ordering[Long]].reverse) // 3、根据每个PartitionedFile.length 从大到小排序

这段代码的作用是：

text 复制代码

selectedPartitions -> files -> PartitionedFile

生成PartitionedFile源码如下：

org.apache.spark.sql.execution.PartitionedFileUtil

scala 复制代码

  def splitFiles(
      sparkSession: SparkSession,
      file: FileStatus,
      filePath: Path,
      isSplitable: Boolean,
      maxSplitBytes: Long,
      partitionValues: InternalRow): Seq[PartitionedFile] = {
    if (isSplitable) { // 1、如果文件可切分
      (0L until file.getLen by maxSplitBytes).map { offset =>
        val remaining = file.getLen - offset
        val size = if (remaining > maxSplitBytes) maxSplitBytes else remaining // 1.1 按照maxSplitBytes确定PartitionedFile大小
        val hosts = getBlockHosts(getBlockLocations(file), offset, size)
        PartitionedFile(partitionValues, filePath.toUri.toString, offset, size, hosts) 1.2 生成PartitionedFile
      }
    } else { // 2、如果文件不可切分，整个文件作为PartitionedFile
      Seq(getPartitionedFile(file, filePath, partitionValues))
    }
  }

2.4.1 `PartitionedFile` 的含义

PartitionedFile 表示一个具体的文件读取片段。

例如一个 300MB 的 ORC 文件，在 maxSplitBytes = 128MB 且可切分的情况下，可能被切成：

text 复制代码

PartitionedFile(file1.orc, start=0MB, length=128MB)
PartitionedFile(file1.orc, start=128MB, length=128MB)
PartitionedFile(file1.orc, start=256MB, length=44MB)

如果文件不可切分，例如某些 gzip 压缩文件，即使文件很大，也通常只能生成一个 PartitionedFile：

text 复制代码

PartitionedFile(file1.gz, start=0, length=300MB)

2.4.2 为什么要按大小降序排序

代码最后执行：

scala 复制代码

.sortBy(_.length)(implicitly[Ordering[Long]].reverse)

表示将文件片段按长度从大到小排序。

这样做的目的是让大文件片段优先参与后续装箱和调度，减少最后剩下少数大 task 拖慢整个 stage 的风险。

2.5 文件片段打包：生成 `FilePartition`

源码逻辑：

scala 复制代码

val partitions =
  FilePartition.getFilePartitions(relation.sparkSession, splitFiles, maxSplitBytes)

这一步把多个 PartitionedFile 打包成多个 FilePartition。关系如下：

text 复制代码

PartitionedFile = 一个文件片段
FilePartition  = 一个 Spark task 要读取的一组文件片段

也就是说，Spark task 通常不是直接对应单个文件，而是对应一个 FilePartition。

getFilePartitions 源码逻辑

org.apache.spark.sql.execution.datasources.FilePartition

scala 复制代码

def getFilePartitions(
    sparkSession: SparkSession,
    partitionedFiles: Seq[PartitionedFile],
    maxSplitBytes: Long): Seq[FilePartition] = {
  val partitions = new ArrayBuffer[FilePartition] // 最终生成的所有 `FilePartition`
  val currentFiles = new ArrayBuffer[PartitionedFile] // 当前正在构建的分区里的文件片段
  var currentSize = 0L // 当前分区的估算大小

  /** Close the current partition and move to the next. */
  // 关闭当前分区，并创建一个新的 FilePartition
  def closePartition(): Unit = {
    if (currentFiles.nonEmpty) {
      // Copy to a new Array.
        // partitions.size 作为新分区编号；当前构建的分区文件列表作为一个新分区
      val newPartition = FilePartition(partitions.size, currentFiles.toArray)
      partitions += newPartition
    }
      // 创建完成后清空当前分区文件列表，并将 currentSize 重置为 0
    currentFiles.clear()
    currentSize = 0
  }

  val openCostInBytes = sparkSession.sessionState.conf.filesOpenCostInBytes
    
  // Assign files to partitions using "Next Fit Decreasing"
  // 使用"降序下一个适配"算法将文件分配到分区中(partitionedFiles中的文件已经按大小从大到小排序，这就是 `Decreasing`)
  partitionedFiles.foreach { file => // 依次遍历每个文件片段
    // 如果放不下，就关闭当前分区，开启下一个分区。
    if (currentSize + file.length > maxSplitBytes) {
      closePartition()
    }
    // Add the given file to the current partition.
    // 如果当前分区还能放下当前文件，就把该文件放入当前分区
    currentSize += file.length + openCostInBytes
    currentFiles += file
  }
    
  closePartition()
  partitions.toSeq
}

2.6 创建 `FileScanRDD`

源码逻辑：

scala 复制代码

new FileScanRDD(fsRelation.sparkSession, readFile, partitions)

这一步创建真正执行文件扫描的 RDD。

参数含义：

fsRelation.sparkSession：当前 SparkSession。
readFile：读取单个 PartitionedFile 的函数。
partitions：前面生成的 FilePartition 列表。

可以粗略理解 readFile 的作用：

scala 复制代码

PartitionedFile => Iterator[InternalRow]

也就是说：

text 复制代码

输入：一个文件片段
输出：该文件片段中的行数据

不同文件格式会有不同的读取实现：

Parquet 使用 Parquet reader。
ORC 使用 ORC reader。
JSON 使用 JSON reader。
CSV 使用 CSV reader。

FileScanRDD 负责将这些 FilePartition 暴露成 RDD 分区。Spark 执行时，每个 RDD 分区通常对应一个 task。

2.7 示例

假设有两个文件A：150M，文件B：160M；各参数采用默认值

文件A和B生产的 PartitionedFile 如下：

text 复制代码

A-1: file=A, start=0MB,   length=128MB
A-2: file=A, start=128MB, length=22MB
B-1: file=B, start=0MB,   length=128MB
B-2: file=B, start=128MB, length=32MB

排序后参与装箱：

复制代码

A-1(128MB), B-1(128MB), B-2(32MB), A-2(22MB)

FilePartition：

复制代码

FilePartition 0:
  A-1(128MB)

FilePartition 1:
  B-1(128MB)

FilePartition 2:
  B-2(32MB)
  A-2(22MB)

最终会产生 3 个文件扫描 task。其中第三个FilePartition横跨A、B两个文件

复制代码

FilePartition 2:
  B-2(32MB) // B 文件的尾部切片
  A-2(22MB) // A 文件的尾部切片

因此FilePartition 的本质不是"一个文件"，而是：一个 Spark task 要读取的一组 PartitionedFile

这样做主要是为了合并小文件或小切片，避免产生过多 task。

2.8 核心结论

Spark SQL 读取文件时，task 数量不是简单等于文件数量。

实际过程是：

text 复制代码

先切分大文件，再合并小文件，最后生成 FileScanRDD 的分区。

其中：

maxPartitionBytes 控制单个读取分区的目标大小。
openCostInBytes 用于估算打开文件的额外成本，主要影响小文件合并。
minPartitionNum 用于建议最小读取分区数，提高读取并行度。
PartitionedFile 表示文件片段。
FilePartition 表示一个 Spark task 要读取的一组文件片段。
FileScanRDD 是真正执行文件扫描的 RDD。

因此，Spark SQL 文件读取 task 的划分，本质上是一个基于文件大小、文件打开成本和目标分区大小的装箱过程。

3 读取桶表 RDD 创建逻辑：createBucketedReadRDD

3.1 整体逻辑

org.apache.spark.sql.execution.FileSourceScanExec

scala 复制代码

private def createBucketedReadRDD(
    bucketSpec: BucketSpec,
    readFile: (PartitionedFile) => Iterator[InternalRow],
    selectedPartitions: Array[PartitionDirectory],
    fsRelation: HadoopFsRelation): RDD[InternalRow] = {
  logInfo(s"Planning with ${bucketSpec.numBuckets} buckets")
    
  val filesGroupedToBuckets =
    selectedPartitions.flatMap { p =>
      p.files.map { f =>
        PartitionedFileUtil.getPartitionedFile(f, f.getPath, p.values)
      }
    }.groupBy { f =>
      BucketingUtils
        .getBucketId(new Path(f.filePath).getName)
        .getOrElse(sys.error(s"Invalid bucket file ${f.filePath}"))
    }

  // TODO(SPARK-32985): Decouple bucket filter pruning and bucketed table scan
  val prunedFilesGroupedToBuckets = if (optionalBucketSet.isDefined) {
    val bucketSet = optionalBucketSet.get
    filesGroupedToBuckets.filter {
      f => bucketSet.get(f._1)
    }
  } else {
    filesGroupedToBuckets
  }

  val filePartitions = optionalNumCoalescedBuckets.map { numCoalescedBuckets =>
    logInfo(s"Coalescing to ${numCoalescedBuckets} buckets")
    val coalescedBuckets = prunedFilesGroupedToBuckets.groupBy(_._1 % numCoalescedBuckets)
    Seq.tabulate(numCoalescedBuckets) { bucketId =>
      val partitionedFiles = coalescedBuckets.get(bucketId).map {
        _.values.flatten.toArray
      }.getOrElse(Array.empty)
      FilePartition(bucketId, partitionedFiles)
    }
  }.getOrElse {
    Seq.tabulate(bucketSpec.numBuckets) { bucketId =>
      FilePartition(bucketId, prunedFilesGroupedToBuckets.getOrElse(bucketId, Array.empty))
    }
  }

  new FileScanRDD(fsRelation.sparkSession, readFile, filePartitions)
}

createBucketedReadRDD 的作用是：为 bucket 表创建一个 FileScanRDD。

普通文件扫描通常会根据文件大小、maxPartitionBytes、openCostInBytes 等参数，把文件切片组合成若干 FilePartition。

桶表扫描不同。桶表扫描更关注文件所属的 bucket id，需要尽量保持：

text 复制代码

RDD partition index <-> bucket id

这种对应关系可以帮助 Spark 在后续执行中利用 bucket 分布信息，例如 bucket join 场景下减少 shuffle。

createBucketedReadRDD 的整体流程如下：

text 复制代码

selectedPartitions
  -> 遍历分区目录中的文件
  -> 转换为 PartitionedFile
  -> 从文件名解析 bucket id
  -> 按 bucket id 分组
  -> 根据 optionalBucketSet 做 bucket pruning
  -> 根据 optionalNumCoalescedBuckets 判断是否合并 bucket
  -> 创建 FilePartition
  -> 创建 FileScanRDD

核心数据流可以表示为：

text 复制代码

Array[PartitionDirectory]
  -> Array[PartitionedFile]
  -> Map[bucketId, Array[PartitionedFile]]
  -> Seq[FilePartition]
  -> FileScanRDD

3.2 bucket 参数说明

scala 复制代码

bucketSpec: BucketSpec

桶表元数据，包含 bucket 数量、bucket 列、排序列等信息。例如：

sql 复制代码

CLUSTERED BY (user_id) INTO 8 BUCKETS

对应的 bucketSpec.numBuckets 就是 8。

scala 复制代码

logInfo(s"Planning with ${bucketSpec.numBuckets} buckets")

这行日志表示当前扫描计划按照多少个 bucket 来组织。

3.3 将文件转换为 PartitionedFile

scala 复制代码

selectedPartitions.flatMap { p => // 1、遍历经过分区裁剪后的分区目录
  p.files.map { f => // 2、遍历每个分区目录下的文件
    PartitionedFileUtil.getPartitionedFile(f, f.getPath, p.values) // 3、返回每个文件的信息
  }
}

这段代码遍历所有需要扫描的分区目录，并把目录中的文件转换成 PartitionedFile。

org.apache.spark.sql.execution.PartitionedFileUtil

scala 复制代码

  def getPartitionedFile(
      file: FileStatus,
      filePath: Path,
      partitionValues: InternalRow): PartitionedFile = {
    val hosts = getBlockHosts(getBlockLocations(file), 0, file.getLen)
    PartitionedFile(partitionValues, filePath.toUri.toString, 0, file.getLen, hosts)
  }

PartitionedFile 可以理解为 Spark 文件扫描中的最小文件读取描述，通常包含：

文件路径
起始位置
读取长度
分区列值

3.4 从文件名解析 bucket id并按PartitionedFile进行分组

scala 复制代码

.groupBy { f =>
  BucketingUtils
    .getBucketId(new Path(f.filePath).getName)
    .getOrElse(sys.error(s"Invalid bucket file ${f.filePath}"))
}

这一步从文件名中解析 bucket id，并按照 bucket id 对文件进行分组。

Spark bucket 文件名中通常包含 bucket 编号，例如：

text 复制代码

part-00003-xxxxx.orc

其中 00003 可以解析为 bucket id 3。

如果文件名不符合 bucket 文件命名规则：

scala 复制代码

getOrElse(sys.error(s"Invalid bucket file ${f.filePath}"))

Spark 会直接报错。原因是 bucketed scan 必须知道每个文件属于哪个 bucket，否则无法保持 bucket 语义。

分组完成后，数据结构可以理解为：

scala 复制代码

Map[Int, Array[PartitionedFile]]

示例：

text 复制代码

bucket 0 -> [file_0_1, file_0_2]
bucket 1 -> [file_1_1]
bucket 2 -> [file_2_1]
bucket 3 -> []

如果是分区 bucket 表，不同分区目录下可能都有相同 bucket id 的文件，这些文件会被分到同一个 bucket 组中。

例如：

text 复制代码

dt=2026-06-16/part-00000-xxx.orc
dt=2026-06-17/part-00000-yyy.orc

这两个文件都属于 bucket 0。

3.5 bucket pruning：裁剪不需要读取的 bucket

scala 复制代码

val prunedFilesGroupedToBuckets = if (optionalBucketSet.isDefined) {
  val bucketSet = optionalBucketSet.get
  filesGroupedToBuckets.filter {
    f => bucketSet.get(f._1)
  }
} else {
  filesGroupedToBuckets
}

这一步用于 bucket 裁剪。

如果 optionalBucketSet 存在，说明 Spark 已经根据过滤条件推导出只需要读取部分 bucket。

例如表按 user_id 分 8 个 bucket：

sql 复制代码

CLUSTERED BY (user_id) INTO 8 BUCKETS

查询条件是：

sql 复制代码

WHERE user_id = 100

如果 Spark 能够计算出 user_id = 100 一定落在 bucket 5，那么只需要读取 bucket 5。

bucketSet 可以理解为一个 bitset：

text 复制代码

bucket 0 -> false
bucket 1 -> false
bucket 2 -> false
bucket 3 -> false
bucket 4 -> false
bucket 5 -> true
bucket 6 -> false
bucket 7 -> false

过滤逻辑：

scala 复制代码

filesGroupedToBuckets.filter {
  f => bucketSet.get(f._1)
}

其中 f._1 是 bucket id。

如果 optionalBucketSet 不存在，则不做 bucket 裁剪，保留所有 bucket 文件。

3.6 创建 FilePartition

接下来创建最终传给 FileScanRDD 的 FilePartition 列表。

scala 复制代码

val filePartitions = optionalNumCoalescedBuckets.map { numCoalescedBuckets =>
  ...
}.getOrElse {
  ...
}

这里分为两种情况：

启用 bucket coalescing
不启用 bucket coalescing

3.6.1 情况一：启用 bucket coalescing

scala 复制代码

optionalNumCoalescedBuckets.map { numCoalescedBuckets =>
  logInfo(s"Coalescing to ${numCoalescedBuckets} buckets")
  val coalescedBuckets = prunedFilesGroupedToBuckets.groupBy(_._1 % numCoalescedBuckets)
  Seq.tabulate(numCoalescedBuckets) { bucketId =>
    val partitionedFiles = coalescedBuckets.get(bucketId).map {
      _.values.flatten.toArray
    }.getOrElse(Array.empty)
    FilePartition(bucketId, partitionedFiles)
  }
}

如果 optionalNumCoalescedBuckets 有值（下面会单独介绍生效场景），说明 Spark 决定把多个 bucket 合并为更少的扫描分区。

例如原表有 8 个 bucket，但扫描时合并为 4 个：

text 复制代码

原始 bucket: 0 1 2 3 4 5 6 7
合并后 bucket: 0 1 2 3

合并逻辑是：

scala 复制代码

_._1 % numCoalescedBuckets

也就是按照 bucket id 对合并后的 bucket 数取模。

如果 numCoalescedBuckets = 4：

text 复制代码

bucket 0 -> 0 % 4 = 0
bucket 4 -> 4 % 4 = 0

bucket 1 -> 1 % 4 = 1
bucket 5 -> 5 % 4 = 1

bucket 2 -> 2 % 4 = 2
bucket 6 -> 6 % 4 = 2

bucket 3 -> 3 % 4 = 3
bucket 7 -> 7 % 4 = 3

因此合并结果是：

text 复制代码

coalesced bucket 0 -> 原 bucket 0 + 原 bucket 4
coalesced bucket 1 -> 原 bucket 1 + 原 bucket 5
coalesced bucket 2 -> 原 bucket 2 + 原 bucket 6
coalesced bucket 3 -> 原 bucket 3 + 原 bucket 7

随后为每个合并后的 bucket 创建一个 FilePartition：

scala 复制代码

Seq.tabulate(numCoalescedBuckets) { bucketId =>
  ...
  FilePartition(bucketId, partitionedFiles)
}

partitionedFiles 的生成逻辑是：

scala 复制代码

coalescedBuckets.get(bucketId).map {
  _.values.flatten.toArray
}.getOrElse(Array.empty)

含义是：

找到当前合并 bucket 下的所有原始 bucket。
取出这些原始 bucket 对应的 PartitionedFile 数组。
使用 flatten 合并成一个数组。
如果没有文件，则返回空数组。

最终，一个合并后的 FilePartition 可能包含多个原始 bucket 的文件。

`spark.sql.bucketing.coalesceBucketsInJoin.enabled`说明

optionalNumCoalescedBuckets 不是用户在代码里直接手动设置的参数，它通常由 Spark 物理计划优化规则自动设置。它的作用是：在 bucket join 场景下，把 bucket 数量较多的一侧合并成较少的 bucket 数量。如果要生效，必须设置spark.sql.bucketing.coalesceBucketsInJoin.enabled=true（默认值是false）;

scala 复制代码

  val COALESCE_BUCKETS_IN_JOIN_ENABLED =
    buildConf("spark.sql.bucketing.coalesceBucketsInJoin.enabled")
      .doc("When true, if two bucketed tables with the different number of buckets are joined, " +
        "the side with a bigger number of buckets will be coalesced to have the same number " +
        "of buckets as the other side. Bigger number of buckets is divisible by the smaller " +
        "number of buckets. Bucket coalescing is applied to sort-merge joins and " +
        "shuffled hash join. Note: Coalescing bucketed table can avoid unnecessary shuffling " +
        "in join, but it also reduces parallelism and could possibly cause OOM for " +
        "shuffled hash join.")
      .version("3.1.0")
      .booleanConf
      .createWithDefault(false)
    /* 当该配置为 true 时，如果两个 bucket 表进行 join，并且两张表的 bucket 数量不同，
    那么 bucket 数量较多的一侧会被合并，使其 bucket 数量与另一侧相同。
    前提是，较大的 bucket 数量必须能够被较小的 bucket 数量整除。
    bucket 合并适用于 sort-merge join 和 shuffled hash join。

    注意：合并 bucket 表可以避免 join 中不必要的 shuffle，
    但它也会降低并行度，并且在 shuffled hash join 场景下可能导致 OOM。
    */

3.6.2 情况二：不启用 bucket coalescing

scala 复制代码

Seq.tabulate(bucketSpec.numBuckets) { bucketId =>
  FilePartition(bucketId, prunedFilesGroupedToBuckets.getOrElse(bucketId, Array.empty))
}

如果 optionalNumCoalescedBuckets 为空，则按照原始 bucket 数量创建 FilePartition。

例如 bucketSpec.numBuckets = 8，会创建 8 个 FilePartition：

text 复制代码

FilePartition 0 -> bucket 0 的文件
FilePartition 1 -> bucket 1 的文件
FilePartition 2 -> bucket 2 的文件
FilePartition 3 -> bucket 3 的文件
FilePartition 4 -> bucket 4 的文件
FilePartition 5 -> bucket 5 的文件
FilePartition 6 -> bucket 6 的文件
FilePartition 7 -> bucket 7 的文件

如果某个 bucket 没有文件，或者已经被 bucket pruning 裁剪掉，则使用空数组：

scala 复制代码

prunedFilesGroupedToBuckets.getOrElse(bucketId, Array.empty)

也就是说，即使某些 bucket 不需要读取，也仍然可能生成空的 FilePartition。

这样做的意义是保持：

text 复制代码

FilePartition index == bucket id

这种对应关系对后续利用 bucket 分布信息很重要。

3.7 创建 FileScanRDD

scala 复制代码

new FileScanRDD(fsRelation.sparkSession, readFile, filePartitions)

最后使用 filePartitions 创建真正执行文件扫描的 FileScanRDD。

FileScanRDD 执行时，每个 Spark task 通常对应一个 FilePartition。

执行过程可以理解为：

text 复制代码

FileScanRDD partition
  -> FilePartition
  -> Array[PartitionedFile]
  -> 对每个 PartitionedFile 调用 readFile
  -> Iterator[InternalRow]

也就是说：

FilePartition 决定一个 task 读哪些文件片段。
PartitionedFile 描述具体的文件片段。
readFile 负责把文件片段读成 InternalRow。
FileScanRDD 把这些逻辑组织成 Spark RDD。

3.8 示例

假设一个 bucket 表有 4 个 bucket：

sql 复制代码

CREATE TABLE orders (
  order_id BIGINT,
  user_id BIGINT,
  amount DOUBLE
)
USING parquet
CLUSTERED BY (user_id) INTO 4 BUCKETS;

文件如下：

text 复制代码

part-00000-a.parquet
part-00001-b.parquet
part-00002-c.parquet
part-00003-d.parquet

解析 bucket id 后：

text 复制代码

part-00000-a.parquet -> bucket 0
part-00001-b.parquet -> bucket 1
part-00002-c.parquet -> bucket 2
part-00003-d.parquet -> bucket 3

不启用 bucket pruning、不启用 bucket coalescing 时，会生成：

text 复制代码

FilePartition 0 -> bucket 0 文件
FilePartition 1 -> bucket 1 文件
FilePartition 2 -> bucket 2 文件
FilePartition 3 -> bucket 3 文件

如果查询条件只命中 bucket 2：

sql 复制代码

WHERE user_id = 100

则 bucket pruning 后可能只保留：

text 复制代码

bucket 2 -> [part-00002-c.parquet]

最终仍可能生成 4 个 FilePartition，但其中只有 bucket 2 对应的分区有文件：

text 复制代码

FilePartition 0 -> []
FilePartition 1 -> []
FilePartition 2 -> [part-00002-c.parquet]
FilePartition 3 -> []

如果启用 bucket coalescing，例如从 4 个 bucket 合并成 2 个：

text 复制代码

bucket 0 和 bucket 2 合并到 coalesced bucket 0
bucket 1 和 bucket 3 合并到 coalesced bucket 1

则生成：

text 复制代码

FilePartition 0 -> bucket 0 + bucket 2 的文件
FilePartition 1 -> bucket 1 + bucket 3 的文件

3.9 核心结论

如果未启用 bucket coalescing（默认未开启），分桶表的分区数量跟桶的数量是一对一关系。

createBucketedReadRDD 的本质是：

text 复制代码

按照 bucket id，而不是普通文件大小，把文件组织成 FileScanRDD 的分区。

它的关键目标是保持 bucket 表的物理分布信息，使 Spark 后续可以利用 bucket 特性优化执行计划。

可以用一句话概括：

text 复制代码

createBucketedReadRDD 将 bucket 表文件按文件名中的 bucket id 分组，并转换成 FileScanRDD 的 FilePartition。

4 参考

剖析Spark数据分区之Spark RDD分区

万字详解Spark并行度 | 从spark.default.parallelism参数来看Spark并行度、并行计算任务概念

Spark 数据读取切分逻辑与参数详解

Spark SQL基于DataSource方式文件切分逻辑详解

0 前言

1 FileSourceScanExec

2 读取非分桶表 RDD 创建逻辑：createNonBucketedReadRDD

2.1 整体逻辑

2.2 相关配置参数

spark.sql.files.maxPartitionBytes

spark.sql.files.openCostInBytes

spark.sql.files.minPartitionNum

2.3 确定文件切分大小：生成maxSplitBytes

2.4 文件切分：生成 PartitionedFile

2.4.1 PartitionedFile 的含义

2.4.2 为什么要按大小降序排序

2.5 文件片段打包：生成 FilePartition

2.6 创建 FileScanRDD

2.7 示例

2.8 核心结论

3 读取桶表 RDD 创建逻辑：createBucketedReadRDD

3.1 整体逻辑

3.2 bucket 参数说明

3.3 将文件转换为 PartitionedFile

3.4 从文件名解析 bucket id并按PartitionedFile进行分组

3.5 bucket pruning：裁剪不需要读取的 bucket

3.6 创建 FilePartition

3.6.1 情况一：启用 bucket coalescing

spark.sql.bucketing.coalesceBucketsInJoin.enabled说明

3.6.2 情况二：不启用 bucket coalescing

3.7 创建 FileScanRDD

3.8 示例

3.9 核心结论

4 参考

`spark.sql.files.maxPartitionBytes`

`spark.sql.files.openCostInBytes`

`spark.sql.files.minPartitionNum`

2.4 文件切分：生成 `PartitionedFile`

2.4.1 `PartitionedFile` 的含义

2.5 文件片段打包：生成 `FilePartition`

2.6 创建 `FileScanRDD`

`spark.sql.bucketing.coalesceBucketsInJoin.enabled`说明