主要关注Spark作业的指标是如何收集,怎么保存的
Task
Task类有一个属性serializedTaskMetrics,关联着Task指标。
scala
private[spark] abstract class Task[T](
val stageId: Int,
val stageAttemptId: Int,
val partitionId: Int,
@transient var localProperties: Properties = new Properties,
// The default value is only used in tests.
serializedTaskMetrics: Array[Byte] =
SparkEnv.get.closureSerializer.newInstance().serialize(TaskMetrics.registered).array(),
val jobId: Option[Int] = None,
val appId: Option[String] = None,
val appAttemptId: Option[String] = None,
val isBarrier: Boolean = false) extends Serializable {
TaskMetrics
这里的指标,其实就是LongAccumulator类,继承自AccumulatorV2。其在Executor内收集,并通过心跳和事件上报到Driver。心跳的时候,执行增量合并的操作,任务结束时以最终值随statusUpdate上报
scala
class LongAccumulator extends AccumulatorV2[jl.Long, jl.Long] {
private var _sum = 0L
private var _count = 0L
基本指标
scala
private val _executorDeserializeTime = new LongAccumulator
private val _taskWaitResourceTime = new LongAccumulator
private val _taskDeployDelay = new LongAccumulator
private val _executorDeserializeBytes = new LongAccumulator
private val _taskInitTime = new LongAccumulator
private val _executorDeserializeCpuTime = new LongAccumulator
private val _executorRunTime = new LongAccumulator
private val _executorRunTimeNanos = new LongAccumulator
private val _executorCpuTime = new LongAccumulator
private val _resultSize = new LongAccumulator
private val _jvmGCTime = new LongAccumulator
private val _memoryHeapUsed = new LongAccumulator
private val _memoryOffHeapUsed = new LongAccumulator
private val _resultSerializationTime = new LongAccumulator
private val _taskResultSize = new LongAccumulator
private val _memoryBytesSpilled = new LongAccumulator
private val _diskBytesSpilled = new LongAccumulator
private val _peakExecutionMemory = new LongAccumulator
private val _updatedBlockStatuses = new CollectionAccumulator[(BlockId, BlockStatus)]
private val _taskStartTime = new LongAccumulator
private val _usedCores = new LongAccumulator
LinkedHashMap
会将一些指标封装成LinkedHashMap,方便发送到Driver
Executor
关注其内部类TaskRunner中的run方法,主要分为两部分,运行中和结束时
运行中
通过心跳收集 累加器集合,然后将task的(taskId,accumulators)放入Hearbeat中。
在Driver侧,通过HeartbeatReceiver --> TaskScheduler [ Impl ] --> DAGScheduler.executorHeartbeatReceived
DAGScheduler向事件总线上投递 SparkListenerExecutorMetricsUpdate事件
scala
listenerBus.post(SparkListenerExecutorMetricsUpdate(execId, accumUpdates,
executorUpdates))
然后AppStatusListener.onExecutorMetricsUpdate接收到这个事件之后,进行增量更新 task/stage/executor 指标
scala
override def onExecutorMetricsUpdate(event: SparkListenerExecutorMetricsUpdate): Unit = {
val now = System.nanoTime()
event.accumUpdates.foreach { case (taskId, sid, sAttempt, accumUpdates) =>
liveTasks.get(taskId).foreach { task =>
val metrics = TaskMetrics.fromAccumulatorInfos(accumUpdates)
val delta = task.updateMetrics(metrics)
maybeUpdate(task, now)
Option(liveStages.get((sid, sAttempt))).foreach { stage =>
stage.metrics = LiveEntityHelpers.addMetrics(stage.metrics, delta)
maybeUpdate(stage, now)
val esummary = stage.executorSummary(event.execId)
esummary.metrics = LiveEntityHelpers.addMetrics(esummary.metrics, delta)
maybeUpdate(esummary, now)
}
}
}
// check if there is a new peak value for any of the executor level memory metrics
// for the live UI. SparkListenerExecutorMetricsUpdate events are only processed
// for the live UI.
event.executorUpdates.foreach { case (key, peakUpdates) =>
liveExecutors.get(event.execId).foreach { exec =>
if (exec.peakExecutorMetrics.compareAndUpdatePeakValues(peakUpdates)) {
update(exec, now)
}
}
// Update stage level peak executor metrics.
updateStageLevelPeakExecutorMetrics(key._1, key._2, event.execId, peakUpdates, now)
}
// Flush updates if necessary. Executor heartbeat is an event that happens periodically. Flush
// here to ensure the staleness of Spark UI doesn't last more than
// `max(heartbeat interval, liveUpdateMinFlushPeriod)`.
if (now - lastFlushTimeNs > liveUpdateMinFlushPeriod) {
flush(maybeUpdate(_, now))
// Re-get the current system time because `flush` may be slow and `now` is stale.
lastFlushTimeNs = System.nanoTime()
}
}
结束时
合并指标
scala
private[spark] def mergeShuffleReadMetrics(): Unit = synchronized {
if (tempShuffleReadMetrics.nonEmpty) {
shuffleReadMetrics.setMergeValues(tempShuffleReadMetrics.toSeq)
}
}
通过execBackend.statusUpdate(taskId, TaskState.FINISHED/FAILED/KILLED, serializedResultOrReason) 将结果和累加器发送到Driver,其路径是 executor --> ExecutorBackend.statusUpdate --> CoarseGrainedExecutorBackend.statusUpdate --> driverRef.send
在Driver侧,首先由TaskSchedulerImpl.statusUpdate来处理
scala
if (TaskState.isFinished(state)) {
cleanupTaskState(tid)
taskSet.removeRunningTask(tid)
if (state == TaskState.FINISHED) {
taskResultGetter.enqueueSuccessfulTask(taskSet, tid, serializedData)
} else if (Set(TaskState.FAILED, TaskState.KILLED, TaskState.LOST).contains(state)) {
taskResultGetter.enqueueFailedTask(taskSet, tid, state, serializedData)
}
}
这里taskResultGetter.enqueueSuccessfulTask的内部其实是TaskResultGetter来做反序列化操作,得到directResult,然后回调handleSuccessfulTask方法,最后下发到DAGScheduler,然后发出SparkListenerTaskEnd事件。同样,最后供AppStatusListener来消费
scala
def enqueueSuccessfulTask(
taskSetManager: TaskSetManager,
tid: Long,
serializedData: ByteBuffer): Unit = {
getTaskResultExecutor.execute(new Runnable {
override def run(): Unit = Utils.logUncaughtExceptions {
try {
val (result, size) = serializer.get().deserialize[TaskResult[_]](serializedData) match {
......
scheduler.handleSuccessfulTask(taskSetManager, tid, result)
====== handleSuccessfulTask
sched.dagScheduler.taskEnded(tasks(index), Success, result.value(), result.accumUpdates,
result.metricPeaks, info)
=======
DAGScheduler向eventProcessLoop发送任务完成信息,等待loop处理
eventProcessLoop.post(
CompletionEvent(task, reason, result, accumUpdates, metricPeaks, taskInfo))
case completion: CompletionEvent =>
dagScheduler.handleTaskCompletion(completion)
调用handleTaskCompletion,继续处理,最后调用postTaskEnd
scala
private[scheduler] def handleTaskCompletion(event: CompletionEvent): Unit = {
val task = event.task
val stageId = task.stageId
outputCommitCoordinator.taskCompleted(
stageId,
task.stageAttemptId,
task.partitionId,
event.taskInfo.attemptNumber, // this is a task attempt number
event.reason)
if (!stageIdToStage.contains(task.stageId)) {
// The stage may have already finished when we get this event -- e.g. maybe it was a
// speculative task. It is important that we send the TaskEnd event in any case, so listeners
// are properly notified and can chose to handle it. For instance, some listeners are
// doing their own accounting and if they don't get the task end event they think
// tasks are still running when they really aren't.
postTaskEnd(event)
这里出要对指标进行聚合,然后发送SparkListenerTaskEnd事件
scala
private def postTaskEnd(event: CompletionEvent): Unit = {
val taskMetrics: TaskMetrics =
if (event.accumUpdates.nonEmpty) {
try {
TaskMetrics.fromAccumulators(event.accumUpdates)
} catch {
case NonFatal(e) =>
val taskId = event.taskInfo.taskId
logError(s"Error when attempting to reconstruct metrics for task $taskId", e)
null
}
} else {
null
}
listenerBus.post(SparkListenerTaskEnd(event.task.stageId, event.task.stageAttemptId,
Utils.getFormattedClassName(event.task), event.reason, event.taskInfo,
new ExecutorMetrics(event.metricPeaks), taskMetrics))
}
其中,通过fromAccumulators汇总taskMetrics信息
scala
def fromAccumulators(accums: Seq[AccumulatorV2[_, _]]): TaskMetrics = {
val tm = new TaskMetrics
for (acc <- accums) {
val name = acc.name
if (name.isDefined && tm.nameToAccums.contains(name.get)) {
val tmAcc = tm.nameToAccums(name.get).asInstanceOf[AccumulatorV2[Any, Any]]
tmAcc.metadata = acc.metadata
tmAcc.merge(acc.asInstanceOf[AccumulatorV2[Any, Any]])
} else {
tm._externalAccums.add(acc)
}
}
tm
}
AppStatusListener
onTaskEnd
这个方法主要做一些收尾的工作。接收到Driver端的最终事件之后,完成指标从下到上的聚合与持久化更新,并维护一些峰值索引还有做一些清理工作。
其中持久化,一般是放在AppStatusStore中,底层是KVStore抽象,调用update或者maybeUpdate。
这两个更新有一定的区别,update主要是针对终态或者一次性的重要的变更,而maybeUpdate主要是针对类似于心跳等高频的变更,有频控。
scala
private def update(entity: LiveEntity, now: Long, last: Boolean = false): Unit = {
entity.write(kvstore, now, checkTriggers = last)
}
/** Update a live entity only if it hasn't been updated in the last configured period. */
private def maybeUpdate(entity: LiveEntity, now: Long): Unit = {
if (live && liveUpdatePeriodNs >= 0 && now - entity.lastWriteTime > liveUpdatePeriodNs) {
update(entity, now)
}
}