Spark中怎么做Spark canonicalize归一化

背景

本文基于Spark 4.0

在Spark查询计划的生成过程中，会对逻辑计划进行一次查询计划自上而下的替换：若某段子树与已 cache/persist 的查询 sameResult 相等，就把该子树换成对应的 InMemoryRelation，后续优化/物理规划会直接读内存缓存，而不是重算。

在这个过程中，就会进行逻辑计划的归一化操作，该操作的目的是为了判断两个计划从语义上来看是否相等。这里分析一下这里的归一化操作是怎么实现的。其实这些实现不仅仅是对于spark来说有用，对于其他引擎也是有借鉴意义的。

分析

对于归一化，spark会做两层操作一层是用户自定义级别的，借助于自定义规则，另一种是spark内部实现的：

用户自定义级别的规则由 SparkSessionExtensions.injectPlanNormalizationRule 注入：

复制代码

    * Inject a plan normalization `Rule` builder into the [[SparkSession]]. The injected rules will
    * be executed just before query caching decisions are made. Such rules can be used to improve the
    * cache hit rate by normalizing different plans to the same form. These rules should never modify
    * the result of the LogicalPlan.
    */
    def injectPlanNormalizationRule(builder: RuleBuilder): Unit = {
      planNormalizationRules += builder
}

这里会调用QueryExecution.normalize计划先进行一般化。

spark内部实现的，主要是如下方法：

复制代码

    private val lazyWithCachedData = LazyTry {
     sparkSession.withActive {
       assertAnalyzed()
       assertSupported()
       // clone the plan to avoid sharing the plan instance between different stages like analyzing,
       // optimizing and planning.
       sparkSession.sharedState.cacheManager.useCachedData(normalized.clone())
     }
   }

这里面就会涉及到计划的Canonicalize:

复制代码

   protected def doCanonicalize(): PlanType = {
    val canonicalizedChildren = children.map(_.canonicalized)
    var id = -1
    val allAttributesSeq = this.allAttributes
    mapExpressions {
      case a: Alias =>
        id += 1
        // As the root of the expression, Alias will always take an arbitrary exprId, we need to
        // normalize that for equality testing, by assigning expr id from 0 incrementally. The
        // alias name doesn't matter and should be erased.
        val normalizedChild = QueryPlan.normalizeExpressions(a.child, allAttributesSeq)
        Alias(normalizedChild, "")(ExprId(id), a.qualifier)

      case ar: AttributeReference if allAttributesSeq.indexOf(ar.exprId) == -1 =>
        // Top level `AttributeReference` may also be used for output like `Alias`, we should
        // normalize the exprId too.
        id += 1
        ar.withExprId(ExprId(id)).canonicalized

      case other => QueryPlan.normalizeExpressions(other, allAttributesSeq)
    }.withNewChildren(canonicalizedChildren)
   }

首先会对子节点进行归一化
再者对Alias做归一化
1. 统一Alias的名字为""
2. Alias的 ExprId 为自增下标，而不是JVM级别的自增
列引用归一化
1. 对于不属于该物理计划的列引用(也就是) 的ExprId统一为自增ID,并把 name 设置为'none'.
2. 对于属于该物理计划的列引用，则ExprId采用AttributeSeq下标
3. 对于语义上可交换的表达式(如Add)，则对子表达式调用canonicalized再hashcode进行排序,再进行比较

举个例子：

复制代码

  -- 会话 A
  SELECT id, id + 1 AS x FROM range(10) WHERE id > 3
  
  -- 会话 B（语义相同，但 Analyzer 分配的 exprId 不同）
  SELECT id, id + 1 AS x FROM range(10) WHERE id > 3

简化后的逻辑计划（仅看 Filter 上方一段）：

复制代码

  Plan A                          Plan B
  Project                         Project
    Alias(id+1, "x") exprId=101      Alias(id+1, "x") exprId=205
    AttributeReference("id") 100     AttributeReference("id") 200
  ...                              ...

归一化后为：

   Plan A                          Plan B
   Project                         Project
    Alias(1, "") exprId=0          Alias(1, "") exprId=0
    AttributeReference("id") 0     AttributeReference("id") 0
  ...                              ...

而且从这里的处理来看，对于表达式的顺序也是有影响的，如 select a,b 和 select b,a 的sameResult为false

对于这种 sameResult为true的的逻辑计划，则会替换为InMemoryRelation这个是在executor端已经cache的RDD，这样在计算的时候就会直接从缓存的RDD获取数据，而无需再重新计算。