Spark 之 partitons

Listing leaf files and directories

分析其并行化

org.apache.spark.util.HadoopFSUtils

复制代码
      sc.parallelize(paths, numParallelism)
        .mapPartitions { pathsEachPartition =>
          val hadoopConf = serializableConfiguration.value
          pathsEachPartition.map { path =>
            val leafFiles = listLeafFiles(
              path = path,
              hadoopConf = hadoopConf,
              filter = filter,
              contextOpt = None, // Can't execute parallel scans on workers
              ignoreMissingFiles = ignoreMissingFiles,
              ignoreLocality = ignoreLocality,
              isRootPath = isRootLevel,
              parallelismThreshold = Int.MaxValue,
              parallelismMax = 0)
            (path, leafFiles)
          }
        }.collect()

    // Set the number of parallelism to prevent following file listing from generating many tasks
    // in case of large #defaultParallelism.
    val numParallelism = Math.min(paths.size, parallelismMax)

parallelismMax 最终由以下配置决定。

复制代码
  val PARALLEL_PARTITION_DISCOVERY_PARALLELISM =
    buildConf("spark.sql.sources.parallelPartitionDiscovery.parallelism")
      .doc("The number of parallelism to list a collection of path recursively, Set the " +
        "number to prevent file listing from generating too many tasks.")
      .version("2.1.1")
      .internal()
      .intConf
      .createWithDefault(10000)
相关推荐
TDengine (老段)22 分钟前
TDengine 时间函数 WEEK 用户手册
大数据·数据库·物联网·时序数据库·iot·tdengine·涛思数据
还是鼠鼠1 小时前
Redisson实现的分布式锁能解决主从一致性的问题吗?
java·数据库·redis·分布式·缓存·面试·redisson
G***E3161 小时前
区块链在能源中的分布式交易
分布式·区块链·能源
xieyan08111 小时前
选股中的财务指标运用_ROE_PE_PB...
大数据·人工智能
颜子鱼3 小时前
git基础
大数据·git·elasticsearch
BD_Marathon3 小时前
【Zookeeper】 Zookeeper入门
分布式·zookeeper·云原生
乌恩大侠5 小时前
AI-RAN 在 Spark上部署 Sionna-RK
大数据·分布式·spark
csdn_aspnet6 小时前
【探索实战】Kurator入门体验与分布式云原生环境搭建
分布式·云原生·kurator
G皮T7 小时前
【ELasticsearch】索引字段设置 “index”: false 的作用
大数据·elasticsearch·搜索引擎·全文检索·索引·index·检索
q***69777 小时前
集成RabbitMQ+MQ常用操作
分布式·rabbitmq