[Spark] Metrics收集流程

主要关注Spark作业的指标是如何收集，怎么保存的

Task

Task类有一个属性serializedTaskMetrics，关联着Task指标。

scala 复制代码

private[spark] abstract class Task[T](
    val stageId: Int,
    val stageAttemptId: Int,
    val partitionId: Int,
    @transient var localProperties: Properties = new Properties,
    // The default value is only used in tests.
    serializedTaskMetrics: Array[Byte] =
      SparkEnv.get.closureSerializer.newInstance().serialize(TaskMetrics.registered).array(),
    val jobId: Option[Int] = None,
    val appId: Option[String] = None,
    val appAttemptId: Option[String] = None,
    val isBarrier: Boolean = false) extends Serializable {

TaskMetrics

这里的指标，其实就是LongAccumulator类，继承自AccumulatorV2。其在Executor内收集，并通过心跳和事件上报到Driver。心跳的时候，执行增量合并的操作，任务结束时以最终值随statusUpdate上报

scala 复制代码

class LongAccumulator extends AccumulatorV2[jl.Long, jl.Long] {
  private var _sum = 0L
  private var _count = 0L

基本指标

scala 复制代码

  private val _executorDeserializeTime = new LongAccumulator
  private val _taskWaitResourceTime = new LongAccumulator
  private val _taskDeployDelay = new LongAccumulator
  private val _executorDeserializeBytes = new LongAccumulator
  private val _taskInitTime = new LongAccumulator
  private val _executorDeserializeCpuTime = new LongAccumulator
  private val _executorRunTime = new LongAccumulator
  private val _executorRunTimeNanos = new LongAccumulator
  private val _executorCpuTime = new LongAccumulator
  private val _resultSize = new LongAccumulator
  private val _jvmGCTime = new LongAccumulator
  private val _memoryHeapUsed = new LongAccumulator
  private val _memoryOffHeapUsed = new LongAccumulator
  private val _resultSerializationTime = new LongAccumulator
  private val _taskResultSize = new LongAccumulator
  private val _memoryBytesSpilled = new LongAccumulator
  private val _diskBytesSpilled = new LongAccumulator
  private val _peakExecutionMemory = new LongAccumulator
  private val _updatedBlockStatuses = new CollectionAccumulator[(BlockId, BlockStatus)]
  private val _taskStartTime = new LongAccumulator
  private val _usedCores = new LongAccumulator

LinkedHashMap

会将一些指标封装成LinkedHashMap，方便发送到Driver

Executor

关注其内部类TaskRunner中的run方法，主要分为两部分，运行中和结束时

运行中

通过心跳收集累加器集合，然后将task的（taskId，accumulators）放入Hearbeat中。

在Driver侧，通过HeartbeatReceiver --> TaskScheduler [ Impl ] --> DAGScheduler.executorHeartbeatReceived

DAGScheduler向事件总线上投递 SparkListenerExecutorMetricsUpdate事件

scala 复制代码

listenerBus.post(SparkListenerExecutorMetricsUpdate(execId, accumUpdates,
  executorUpdates))

然后AppStatusListener.onExecutorMetricsUpdate接收到这个事件之后，进行增量更新 task/stage/executor 指标

scala 复制代码

  override def onExecutorMetricsUpdate(event: SparkListenerExecutorMetricsUpdate): Unit = {
    val now = System.nanoTime()

    event.accumUpdates.foreach { case (taskId, sid, sAttempt, accumUpdates) =>
      liveTasks.get(taskId).foreach { task =>
        val metrics = TaskMetrics.fromAccumulatorInfos(accumUpdates)
        val delta = task.updateMetrics(metrics)
        maybeUpdate(task, now)

        Option(liveStages.get((sid, sAttempt))).foreach { stage =>
          stage.metrics = LiveEntityHelpers.addMetrics(stage.metrics, delta)
          maybeUpdate(stage, now)

          val esummary = stage.executorSummary(event.execId)
          esummary.metrics = LiveEntityHelpers.addMetrics(esummary.metrics, delta)
          maybeUpdate(esummary, now)
        }
      }
    }

    // check if there is a new peak value for any of the executor level memory metrics
    // for the live UI. SparkListenerExecutorMetricsUpdate events are only processed
    // for the live UI.
    event.executorUpdates.foreach { case (key, peakUpdates) =>
      liveExecutors.get(event.execId).foreach { exec =>
        if (exec.peakExecutorMetrics.compareAndUpdatePeakValues(peakUpdates)) {
          update(exec, now)
        }
      }

      // Update stage level peak executor metrics.
      updateStageLevelPeakExecutorMetrics(key._1, key._2, event.execId, peakUpdates, now)
    }

    // Flush updates if necessary. Executor heartbeat is an event that happens periodically. Flush
    // here to ensure the staleness of Spark UI doesn't last more than
    // `max(heartbeat interval, liveUpdateMinFlushPeriod)`.
    if (now - lastFlushTimeNs > liveUpdateMinFlushPeriod) {
      flush(maybeUpdate(_, now))
      // Re-get the current system time because `flush` may be slow and `now` is stale.
      lastFlushTimeNs = System.nanoTime()
    }
  }

结束时

合并指标

scala 复制代码

private[spark] def mergeShuffleReadMetrics(): Unit = synchronized {
  if (tempShuffleReadMetrics.nonEmpty) {
    shuffleReadMetrics.setMergeValues(tempShuffleReadMetrics.toSeq)
  }
}

通过execBackend.statusUpdate(taskId, TaskState.FINISHED/FAILED/KILLED, serializedResultOrReason) 将结果和累加器发送到Driver，其路径是 executor --> ExecutorBackend.statusUpdate --> CoarseGrainedExecutorBackend.statusUpdate --> driverRef.send

在Driver侧，首先由TaskSchedulerImpl.statusUpdate来处理

scala 复制代码

if (TaskState.isFinished(state)) {
  cleanupTaskState(tid)
  taskSet.removeRunningTask(tid)
  if (state == TaskState.FINISHED) {
    taskResultGetter.enqueueSuccessfulTask(taskSet, tid, serializedData)
  } else if (Set(TaskState.FAILED, TaskState.KILLED, TaskState.LOST).contains(state)) {
    taskResultGetter.enqueueFailedTask(taskSet, tid, state, serializedData)
  }
}

这里taskResultGetter.enqueueSuccessfulTask的内部其实是TaskResultGetter来做反序列化操作，得到directResult，然后回调handleSuccessfulTask方法，最后下发到DAGScheduler，然后发出SparkListenerTaskEnd事件。同样，最后供AppStatusListener来消费

scala 复制代码

def enqueueSuccessfulTask(
    taskSetManager: TaskSetManager,
    tid: Long,
    serializedData: ByteBuffer): Unit = {
  getTaskResultExecutor.execute(new Runnable {
    override def run(): Unit = Utils.logUncaughtExceptions {
      try {
        val (result, size) = serializer.get().deserialize[TaskResult[_]](serializedData) match {
        
        ......
        
         scheduler.handleSuccessfulTask(taskSetManager, tid, result)

====== handleSuccessfulTask
sched.dagScheduler.taskEnded(tasks(index), Success, result.value(), result.accumUpdates,
  result.metricPeaks, info)

======= 
DAGScheduler向eventProcessLoop发送任务完成信息，等待loop处理
eventProcessLoop.post(
  CompletionEvent(task, reason, result, accumUpdates, metricPeaks, taskInfo))
  
case completion: CompletionEvent =>
  dagScheduler.handleTaskCompletion(completion)

调用handleTaskCompletion，继续处理，最后调用postTaskEnd

scala 复制代码

private[scheduler] def handleTaskCompletion(event: CompletionEvent): Unit = {
  val task = event.task
  val stageId = task.stageId

  outputCommitCoordinator.taskCompleted(
    stageId,
    task.stageAttemptId,
    task.partitionId,
    event.taskInfo.attemptNumber, // this is a task attempt number
    event.reason)

  if (!stageIdToStage.contains(task.stageId)) {
    // The stage may have already finished when we get this event -- e.g. maybe it was a
    // speculative task. It is important that we send the TaskEnd event in any case, so listeners
    // are properly notified and can chose to handle it. For instance, some listeners are
    // doing their own accounting and if they don't get the task end event they think
    // tasks are still running when they really aren't.
    postTaskEnd(event)

这里出要对指标进行聚合，然后发送SparkListenerTaskEnd事件

scala 复制代码

private def postTaskEnd(event: CompletionEvent): Unit = {
  val taskMetrics: TaskMetrics =
    if (event.accumUpdates.nonEmpty) {
      try {
        TaskMetrics.fromAccumulators(event.accumUpdates)
      } catch {
        case NonFatal(e) =>
          val taskId = event.taskInfo.taskId
          logError(s"Error when attempting to reconstruct metrics for task $taskId", e)
          null
      }
    } else {
      null
    }

  listenerBus.post(SparkListenerTaskEnd(event.task.stageId, event.task.stageAttemptId,
    Utils.getFormattedClassName(event.task), event.reason, event.taskInfo,
    new ExecutorMetrics(event.metricPeaks), taskMetrics))
}

其中，通过fromAccumulators汇总taskMetrics信息

scala 复制代码

def fromAccumulators(accums: Seq[AccumulatorV2[_, _]]): TaskMetrics = {
  val tm = new TaskMetrics
  for (acc <- accums) {
    val name = acc.name
    if (name.isDefined && tm.nameToAccums.contains(name.get)) {
      val tmAcc = tm.nameToAccums(name.get).asInstanceOf[AccumulatorV2[Any, Any]]
      tmAcc.metadata = acc.metadata
      tmAcc.merge(acc.asInstanceOf[AccumulatorV2[Any, Any]])
    } else {
      tm._externalAccums.add(acc)
    }
  }
  tm
}

AppStatusListener

onTaskEnd

这个方法主要做一些收尾的工作。接收到Driver端的最终事件之后，完成指标从下到上的聚合与持久化更新，并维护一些峰值索引还有做一些清理工作。

其中持久化，一般是放在AppStatusStore中，底层是KVStore抽象，调用update或者maybeUpdate。

这两个更新有一定的区别，update主要是针对终态或者一次性的重要的变更，而maybeUpdate主要是针对类似于心跳等高频的变更，有频控。

scala 复制代码

  private def update(entity: LiveEntity, now: Long, last: Boolean = false): Unit = {
    entity.write(kvstore, now, checkTriggers = last)
  }
  
    /** Update a live entity only if it hasn't been updated in the last configured period. */
  private def maybeUpdate(entity: LiveEntity, now: Long): Unit = {
    if (live && liveUpdatePeriodNs >= 0 && now - entity.lastWriteTime > liveUpdatePeriodNs) {
      update(entity, now)
    }
  }