Spark 之 partitons

Listing leaf files and directories

分析其并行化

org.apache.spark.util.HadoopFSUtils

复制代码
      sc.parallelize(paths, numParallelism)
        .mapPartitions { pathsEachPartition =>
          val hadoopConf = serializableConfiguration.value
          pathsEachPartition.map { path =>
            val leafFiles = listLeafFiles(
              path = path,
              hadoopConf = hadoopConf,
              filter = filter,
              contextOpt = None, // Can't execute parallel scans on workers
              ignoreMissingFiles = ignoreMissingFiles,
              ignoreLocality = ignoreLocality,
              isRootPath = isRootLevel,
              parallelismThreshold = Int.MaxValue,
              parallelismMax = 0)
            (path, leafFiles)
          }
        }.collect()

    // Set the number of parallelism to prevent following file listing from generating many tasks
    // in case of large #defaultParallelism.
    val numParallelism = Math.min(paths.size, parallelismMax)

parallelismMax 最终由以下配置决定。

复制代码
  val PARALLEL_PARTITION_DISCOVERY_PARALLELISM =
    buildConf("spark.sql.sources.parallelPartitionDiscovery.parallelism")
      .doc("The number of parallelism to list a collection of path recursively, Set the " +
        "number to prevent file listing from generating too many tasks.")
      .version("2.1.1")
      .internal()
      .intConf
      .createWithDefault(10000)
相关推荐
paperxie_xiexuo1 小时前
如何用自然语言生成科研图表?深度体验PaperXie AI科研绘图模块在流程图、机制图与结构图场景下的实际应用效果
大数据·人工智能·流程图·大学生
Mr_sun.1 小时前
Day07——RabbitMQ-高级
分布式·rabbitmq
Qiuner3 小时前
Spring Boot 配置文件高级实战指南 热更新/动态配置/安全加密/分布式同步/环境变量注入
spring boot·分布式·安全
旗讯数字3 小时前
旗讯 OCR 技术解析:金融行业手写表格识别方案与系统集成实践
大数据·金融·ocr
无心水3 小时前
【分布式利器:事务】4、SAGA模式:长事务的最佳选择?
分布式·seata·分布式事务·saga模式·tcc·分布式利器·长事务
lang201509285 小时前
Kafka延迟操作机制深度解析
分布式·python·kafka
zl97989910 小时前
RabbitMQ-下载安装与Web页面
linux·分布式·rabbitmq
2501_9414043113 小时前
绿色科技与可持续发展:科技如何推动环境保护与资源管理
大数据·人工智能