【Spark精讲】RDD缓存源码分析

面试题:cache后面能不能接其他算子,它是不是action操作?

能,不是action算子。

源码解析

RDD调用cache或persist之后,会指定RDD的缓存级别,但只是在成员变量中记录了RDD的存储级别,并未真正地对RDD进行缓存。只有当RDD计算的时候才会对RDD进行缓存。

以HadoopRDD为例

Scala 复制代码
    override def compute(split: Partition, context: TaskContext): Iterator[U] = {
      val partition = split.asInstanceOf[HadoopPartition]
      val inputSplit = partition.inputSplit.value
      f(inputSplit, firstParent[T].iterator(split, context))
    }

调用的iterator方法

Scala 复制代码
  /**
   * Internal method to this RDD; will read from cache if applicable, or otherwise compute it.
   * This should ''not'' be called by users directly, but is available for implementors of custom
   * subclasses of RDD.
   */
  final def iterator(split: Partition, context: TaskContext): Iterator[T] = {
    if (storageLevel != StorageLevel.NONE) {
      getOrCompute(split, context)
    } else {
      computeOrReadCheckpoint(split, context)
    }
  }

继续看 getOrCompute方法:这里可以看到blockId的生成规则,可以确定block和partition是一一对应的

Scala 复制代码
@DeveloperApi
case class RDDBlockId(rddId: Int, splitIndex: Int) extends BlockId {
  override def name: String = "rdd_" + rddId + "_" + splitIndex
}

在executor端调用SparkEnv.get.blockManager.getOrElseUpdate()方法,

Scala 复制代码
  /**
   * Gets or computes an RDD partition. Used by RDD.iterator() when an RDD is cached.
   */
  private[spark] def getOrCompute(partition: Partition, context: TaskContext): Iterator[T] = {
    val blockId = RDDBlockId(id, partition.index)
    var readCachedBlock = true
    // This method is called on executors, so we need call SparkEnv.get instead of sc.env.
    SparkEnv.get.blockManager.getOrElseUpdate(blockId, storageLevel, elementClassTag, () => {
      readCachedBlock = false
      computeOrReadCheckpoint(partition, context)
    }) match {
      case Left(blockResult) =>
        if (readCachedBlock) {
          val existingMetrics = context.taskMetrics().inputMetrics
          existingMetrics.incBytesRead(blockResult.bytes)
          new InterruptibleIterator[T](context, blockResult.data.asInstanceOf[Iterator[T]]) {
            override def next(): T = {
              existingMetrics.incRecordsRead(1)
              delegate.next()
            }
          }
        } else {
          new InterruptibleIterator(context, blockResult.data.asInstanceOf[Iterator[T]])
        }
      case Right(iter) =>
        new InterruptibleIterator(context, iter.asInstanceOf[Iterator[T]])
    }
  }

再看BlockManager中的getOrElseUpdate方法,用来缓存数据的

Scala 复制代码
  /**
   * Retrieve the given block if it exists, otherwise call the provided `makeIterator` method
   * to compute the block, persist it, and return its values.
   *
   * @return either a BlockResult if the block was successfully cached, or an iterator if the block
   *         could not be cached.
   */
  def getOrElseUpdate[T](
      blockId: BlockId,
      level: StorageLevel,
      classTag: ClassTag[T],
      makeIterator: () => Iterator[T]): Either[BlockResult, Iterator[T]] = {
    // Attempt to read the block from local or remote storage. If it's present, then we don't need
    // to go through the local-get-or-put path.
    get[T](blockId)(classTag) match {
      case Some(block) =>
        return Left(block)
      case _ =>
        // Need to compute the block.
    }
    // Initially we hold no locks on this block.
    doPutIterator(blockId, makeIterator, level, classTag, keepReadLock = true) match {
      case None =>
        // doPut() didn't hand work back to us, so the block already existed or was successfully
        // stored. Therefore, we now hold a read lock on the block.
        val blockResult = getLocalValues(blockId).getOrElse {
          // Since we held a read lock between the doPut() and get() calls, the block should not
          // have been evicted, so get() not returning the block indicates some internal error.
          releaseLock(blockId)
          throw new SparkException(s"get() failed for block $blockId even though we held a lock")
        }
        // We already hold a read lock on the block from the doPut() call and getLocalValues()
        // acquires the lock again, so we need to call releaseLock() here so that the net number
        // of lock acquisitions is 1 (since the caller will only call release() once).
        releaseLock(blockId)
        Left(blockResult)
      case Some(iter) =>
        // The put failed, likely because the data was too large to fit in memory and could not be
        // dropped to disk. Therefore, we need to pass the input iterator back to the caller so
        // that they can decide what to do with the values (e.g. process them without caching).
       Right(iter)
    }
  }
相关推荐
G皮T10 分钟前
【Elasticsearch】查询性能调优(七):为什么计数对性能影响如此之大?
大数据·elasticsearch·搜索引擎·全文检索·索引·查询·opensearch
立控信息LKONE24 分钟前
库室管控核心产品-仓库安防设施建设
大数据·安全
福客AI智能客服43 分钟前
AI智能客服系统:增值服务行业的售后核心解决方案
大数据·人工智能
thubier(段新建)43 分钟前
2025技术实践复盘:在沉淀中打磨,在融合中锚定AI协同新方向
大数据·人工智能
Hello.Reader1 小时前
Flink Catalogs 元数据统一入口、JDBC/Hive/自定义 Catalog、Time Travel、Catalog Store 与监听器
大数据·hive·flink
Hello.Reader1 小时前
Flink Modules 把自定义函数“伪装成内置函数”,以及 Core/Hive/自定义模块的加载与解析顺序
大数据·hive·flink
档案宝档案管理1 小时前
一键对接OA/ERP/企业微信|档案宝实现业务与档案一体化管理
大数据·数据库·人工智能·档案·档案管理
SmartBrain1 小时前
解读:《华为变革法:打造可持续进步的组织》
大数据·人工智能·华为·语言模型
是阿威啊1 小时前
【用户行为归因分析项目】- 【企业级项目开发第一站】项目架构和需求设计
大数据·hive·hadoop·架构·spark·scala
qq_12498707531 小时前
基于spark的西南天气数据的分析与应用(源码+论文+部署+安装)
大数据·分布式·爬虫·python·spark·毕业设计·数据可视化