Spark 之 partitons

Listing leaf files and directories

分析其并行化

org.apache.spark.util.HadoopFSUtils

复制代码
      sc.parallelize(paths, numParallelism)
        .mapPartitions { pathsEachPartition =>
          val hadoopConf = serializableConfiguration.value
          pathsEachPartition.map { path =>
            val leafFiles = listLeafFiles(
              path = path,
              hadoopConf = hadoopConf,
              filter = filter,
              contextOpt = None, // Can't execute parallel scans on workers
              ignoreMissingFiles = ignoreMissingFiles,
              ignoreLocality = ignoreLocality,
              isRootPath = isRootLevel,
              parallelismThreshold = Int.MaxValue,
              parallelismMax = 0)
            (path, leafFiles)
          }
        }.collect()

    // Set the number of parallelism to prevent following file listing from generating many tasks
    // in case of large #defaultParallelism.
    val numParallelism = Math.min(paths.size, parallelismMax)

parallelismMax 最终由以下配置决定。

复制代码
  val PARALLEL_PARTITION_DISCOVERY_PARALLELISM =
    buildConf("spark.sql.sources.parallelPartitionDiscovery.parallelism")
      .doc("The number of parallelism to list a collection of path recursively, Set the " +
        "number to prevent file listing from generating too many tasks.")
      .version("2.1.1")
      .internal()
      .intConf
      .createWithDefault(10000)
相关推荐
天行健,君子而铎10 分钟前
AI赋能·精准适配——知影-API风险监测系统筑牢教育数据流转安全防线
大数据·人工智能·安全
XTIOT66611 分钟前
俄罗斯诚信标签Chestny ZNAK技术约束分析与智能化应对思路
大数据·人工智能·嵌入式硬件·物联网
朴马丁11 分钟前
流程PLM的智能化未来:AI与数字孪生如何赋能工艺优化与预测性运营
大数据·人工智能·ai·流程行业plm
muqsen18 分钟前
Java 分布式相关面试题总结
java·开发语言·分布式
谁似人间西林客19 分钟前
工业大数据:点亮汽车制造质量之路,驱动数字孪生工厂高效转型
大数据·汽车·制造
大大大大晴天️30 分钟前
Flink技术实践:RocksDB 状态后端技术解密
大数据·flink
深圳市九鼎创展科技33 分钟前
九鼎创展 X7110 开发板(JH7110):国产 RISC-V 多媒体平台全解析
大数据·linux·人工智能·嵌入式硬件·ubuntu·risc-v
跨境猫小妹38 分钟前
邮政与燃油附加同步抬升之后跨境卖家如何预留尾程成本缓冲
大数据·人工智能·产品运营·跨境电商·营销策略
跨境牛马哥41 分钟前
2026爬虫开发:Playwright对决Puppeteer
大数据·网络·网络协议