Spark-driver和executor启动过程

一、上下文

Spark-SparkSubmit详细过程》详细分析了从脚本提交任务后driver是如何调用到自己编写的Spark代码的,而我们的Spark代码在运行前必须准备好分布式资源,接下来我们就分析下资源是如何分配的

二、Spark代码示例

我们以一个简单的WordCount程序为例,来分析Spark后端是如何为这个程序分配资源的

Scala 复制代码
object WordCount {
  def main(args: Array[String]): Unit = {
    //可以通过 SparkConf 为 Spark 绝大多数配置设置参数,且这些参数的优先级要高于系统属性
    //注意:一旦 SparkConf 传递给 Spark 后,就无法再对其进行修改,因为Spark不支持运行时修改
    val conf = new SparkConf().setAppName("WordCount")
    //Spark 的主要入口点 SparkContext 表示到Spark集群的连接,用于在该集群上创建RDD、累加器、广播变量
    //每个JVM只能有一个 SparkContext 处于活动状态
    val sc = new SparkContext(conf)
    //从HDFS、本地文件系统(在所有节点上都可用)或任何Hadoop支持的文件系统URI读取文本文件,并将其作为字符串的RDD返回。
    val sourceRdd = sc.textFile("file/word_count_data.txt")
    //原始一行数据:js,c,vba,json,xml
    //flatMap将每行数据按照逗号分割,得到每个单词 形成 (单词1) (单词2) (单词1) ... 的格式
    //map将每个单词的次数都赋值成1形成 (单词1,1) (单词2,1) (单词1,次数) ... 的格式
    //reduceByKey将相同单词中的次数进行累加
    val resultRdd = sourceRdd.flatMap(_.split(",")).map(x=>{(x,1)}).reduceByKey(_+_)
    //打印结果
    resultRdd.foreach(println)
    //停止SparkContext
    sc.stop()

  }

SparkConf 是对该程序的一个属性设置,且支持链式设置,会覆盖默认的系统属性。一旦程序开始运行就不可以对其再修改了。

SparkContext是程序与Spark集群连接的入口,可以用于在该集群上创建RDD、广播变量、累加器等等,那么RDD运行所需的资源肯定是在创建SparkContext时就已经具备好的。下面我们看看SparkContext中是如何结合spark-submit参数来协调资源的

三、SparkContext

Scala 复制代码
class SparkContext(config: SparkConf) extends Logging {

  //创建一个从系统属性加载设置的SparkContext(例如,使用./bin/spark-submit启动时)
  def this() = this(new SparkConf())

  //......省略.......

  //私有变量。这些变量保留了上下文的内部状态,外部世界无法访问。它们是可变的
  //因为我们想提前将它们初始化为一个中性值,这样在构造函数仍在运行时调用"stop()"是安全的。
  //只列举重要的一些属性
  //支持应用的个性化配置
  private var _conf: SparkConf = _
  //一个正在运行的Spark实例的所有运行时环境对象(无论是master还是worker),包括序列化器、RpcEnv、块管理器、映射输出跟踪器等。目前,Spark代码通过全局变量查找SparkEnv,因此所有线程都可以访问相同的SparkEnv。它可以通过SparkEnv.get访问(例如在创建SparkContext之后)
  //我们后面单独分析下它
  private var _env: SparkEnv = _
  //用于调度系统的后端接口,允许在TaskSchedulerImpl下插入不同的系统。我们假设一个类似Mesos的模型,当机器可用时,应用程序会获得资源供应,并可以在机器上启动任务。
  private var _schedulerBackend: SchedulerBackend = _
  //低级任务调度程序接口,目前由[[org.apache.sspark.scheduler.TaskSchedulerImpl]]专门实现。 
  //此接口允许插入不同的任务调度器。每个TaskScheduler都为单个SparkContext安排任务。
  //这些调度器从DAGScheduler获取每个阶段提交给它们的任务集,并负责将任务发送到集群、运行它们、在出现故障时重试,以及缓解延迟。他们将事件返回给DAGScheduler。
  private var _taskScheduler: TaskScheduler = _
  //一个常住在 Driver端的 HeartbeatReceiver 通信端点 ,用来接收所有 executors 的心跳
  private var _heartbeatReceiver: RpcEndpointRef = _
  //实现面向 stage 调度的高级调度层。它为每个作业计算一个 stage 的DAG,
  //跟踪哪些RDD和stage输出被物化,并找到运行作业的最小时间表。
  //然后,它将stage作为TaskSet提交给在集群上运行它们的底层TaskScheduler实现。
  //TaskSet包含完全独立的任务,这些任务可以根据集群上已有的数据
  @volatile private var _dagScheduler: DAGScheduler = _

  try {
    //......省略.......
    //创建SparkEnv
    _env = createSparkEnv(_conf, isLocal, listenerBus)
    //设置 常住driver端的 _heartbeatReceiver 的 Endpoint
    _heartbeatReceiver = env.rpcEnv.setupEndpoint(
      HeartbeatReceiver.ENDPOINT_NAME, new HeartbeatReceiver(this))

    // 创建并启动调度程序
    val (sched, ts) = SparkContext.createTaskScheduler(this, master, deployMode)
    _schedulerBackend = sched
    _taskScheduler = ts
    _dagScheduler = new DAGScheduler(this)
    _heartbeatReceiver.ask[Boolean](TaskSchedulerIsSet)

    //启动任务调度器
    _taskScheduler.start()

    //......省略.......  

  //基于给定的主URL创建任务调度器。返回调度器后端和任务调度器的 2-tuple
  private def createTaskScheduler(...){
    //......省略.......  
    master match {
      case "local" =>
      case LOCAL_N_REGEX(threads) =>
      case LOCAL_N_FAILURES_REGEX(threads, maxFailures) =>
      case SPARK_REGEX(sparkUrl) =>
        //创建一个 任务调度实现类 TaskSchedulerImpl
        val scheduler = new TaskSchedulerImpl(sc)
        val masterUrls = sparkUrl.split(",").map("spark://" + _)
        //创建一个 StandaloneSchedulerBackend 是 SchedulerBackend 的子类
        val backend = new StandaloneSchedulerBackend(scheduler, sc, masterUrls)
        //初始化 任务调度器 
        scheduler.initialize(backend)
        (backend, scheduler)
      case LOCAL_CLUSTER_REGEX(numWorkers, coresPerWorker, memoryPerWorker) =>
      case masterUrl =>
  }


}

四、TaskSchedulerImpl

Scala 复制代码
private[spark] class TaskSchedulerImpl(
    val sc: SparkContext,
    val maxTaskFailures: Int,
    isLocal: Boolean = false,
    clock: Clock = new SystemClock)
  extends TaskScheduler with Logging {

  //初始化
  def initialize(backend: SchedulerBackend): Unit = {
    this.backend = backend
    //构建调度器   默认是 FIFO 调度 可以通过 spark.scheduler.mode 进行配置
    schedulableBuilder = {
      schedulingMode match {
        case SchedulingMode.FIFO =>
          new FIFOSchedulableBuilder(rootPool)
        case SchedulingMode.FAIR =>
          new FairSchedulableBuilder(rootPool, sc)
        case _ =>
          throw new IllegalArgumentException(s"Unsupported $SCHEDULER_MODE_PROPERTY: " +
          s"$schedulingMode")
      }
    }
    schedulableBuilder.buildPools()
  }

  //启动调度程序
  override def start(): Unit = {
    //SparkContext中创建的是 StandaloneSchedulerBackend 因此会调用 它的 start()
    //StandaloneSchedulerBackend  又会调用其父类CoarseGrainedSchedulerBackend 的 start() 
    backend.start()

    if (!isLocal && conf.get(SPECULATION_ENABLED)) {
      logInfo("Starting speculative execution thread")
      speculationScheduler.scheduleWithFixedDelay(
        () => Utils.tryOrStopSparkContext(sc) { checkSpeculatableTasks() },
        SPECULATION_INTERVAL_MS, SPECULATION_INTERVAL_MS, TimeUnit.MILLISECONDS)
    }
  }
}

五、CoarseGrainedSchedulerBackend

Scala 复制代码
//等待粗粒度执行器连接的调度程序后端。此后端在Spark作业期间保留每个执行器,而不是在任务完成时放弃执行器,
//并要求调度器为每个新任务启动一个新的执行器。执行器可以通过多种方式启动,例如粗粒度Mesos模式的Mesos任务或Spark独立部署模式(Spark.deploy.*)的独立进程。
class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val rpcEnv: RpcEnv)
  extends ExecutorAllocationClient with SchedulerBackend with Logging {

  //这里会创建并注册一个 DriverEndpoint ,且 DriverEndpoint的 onStart() 方法会执行
  val driverEndpoint = rpcEnv.setupEndpoint(ENDPOINT_NAME, createDriverEndpoint())
  protected def createDriverEndpoint(): DriverEndpoint = new DriverEndpoint()

  override def start(): Unit = {

  }

class DriverEndpoint extends IsolatedRpcEndpoint with Logging {
    override def onStart(): Unit = {
      // 定期恢复下 以允许延迟调度工作
      val reviveIntervalMs = conf.get(SCHEDULER_REVIVE_INTERVAL).getOrElse(1000L)

      reviveThread.scheduleAtFixedRate(() => Utils.tryLogNonFatalError {
        Option(self).foreach(_.send(ReviveOffers))
      }, 0, reviveIntervalMs, TimeUnit.MILLISECONDS)
    }

    override def receiveAndReply(context: RpcCallContext): PartialFunction[Any, Unit] = {
      case RegisterExecutor(executorId, executorRef, hostname, cores, logUrls,
          attributes, resources, resourceProfileId) =>
        if (executorDataMap.contains(executorId)) {
          context.sendFailure(new IllegalStateException(s"Duplicate executor ID: $executorId"))
        } else if (scheduler.excludedNodes.contains(hostname) ||
            isExecutorExcluded(executorId, hostname)) {
          // 如果集群管理器在被排除的节点上为我们提供了一个Executor(因为在我们通知它我们的排除之前,它已经开始分配这些资源,或者如果它忽略了我们的排除),那么我们会立即拒绝该Executor
          logInfo(s"Rejecting $executorId as it has been excluded.")
          context.sendFailure(
            new IllegalStateException(s"Executor is excluded due to failures: $executorId"))
        } else {
          //如果Executor 的rpc-env没有监听传入连接,则"hostPort"将为null,应使用客户端连接联系Executor 。
          val executorAddress = if (executorRef.address != null) {
              executorRef.address
            } else {
              context.senderAddress
            }
          logInfo(s"Registered executor $executorRef ($executorAddress) with ID $executorId, " +
            s" ResourceProfileId $resourceProfileId")
          addressToExecutorId(executorAddress) = executorId
          totalCoreCount.addAndGet(cores)
          totalRegisteredExecutors.addAndGet(1)
          val resourcesInfo = resources.map { case (rName, info) =>
            // 这必须同步,因为在请求Executor时会读取此块中突变的变量
            val numParts = scheduler.sc.resourceProfileManager
              .resourceProfileFromId(resourceProfileId).getNumSlotsPerAddress(rName, conf)
            (info.name, new ExecutorResourceInfo(info.name, info.addresses, numParts))
          }
          val data = new ExecutorData(executorRef, executorAddress, hostname,
            0, cores, logUrlHandler.applyPattern(logUrls, attributes), attributes,
            resourcesInfo, resourceProfileId, registrationTs = System.currentTimeMillis())
          // This must be synchronized because variables mutated
          // in this block are read when requesting executors
          CoarseGrainedSchedulerBackend.this.synchronized {
            executorDataMap.put(executorId, data)
            if (currentExecutorIdCounter < executorId.toInt) {
              currentExecutorIdCounter = executorId.toInt
            }
          }
          listenerBus.post(
            SparkListenerExecutorAdded(System.currentTimeMillis(), executorId, data))
          // Note: some tests expect the reply to come after we put the executor in the map
          context.reply(true)
        }

    }


    override def receive: PartialFunction[Any, Unit] = {
      //启动 executor 
      case LaunchedExecutor(executorId) =>
        executorDataMap.get(executorId).foreach { data =>
          data.freeCores = data.totalCores
        }
        //仅仅为一个 executor 提供虚假资源 offer 
        makeOffers(executorId)
    }




}

}

六、StandaloneSchedulerBackend

Scala 复制代码
//Spark独立集群管理器的[[SchedulerBackend]]实现。
private[spark] class StandaloneSchedulerBackend(
    scheduler: TaskSchedulerImpl,
    sc: SparkContext,
    masters: Array[String])
  extends CoarseGrainedSchedulerBackend(scheduler, sc.env.rpcEnv)
  with StandaloneAppClientListener
  with Logging {

 override def start(): Unit = {
    super.start()

    //调度器后端应仅在客户端模式下尝试连接到启动器。
    //在集群模式下,将应用程序提交给Master的代码需要连接到启动器。
    if (sc.deployMode == "client") {
      launcherBackend.connect()
    }

    //executors 中的 endpoint 需要持有 driver的地址 用于和 driver通信
    val driverUrl = RpcEndpointAddress(
      sc.conf.get(config.DRIVER_HOST_ADDRESS),
      sc.conf.get(config.DRIVER_PORT),
      CoarseGrainedSchedulerBackend.ENDPOINT_NAME).toString
    val args = Seq(
      "--driver-url", driverUrl,
      "--executor-id", "{{EXECUTOR_ID}}",
      "--hostname", "{{HOSTNAME}}",
      "--cores", "{{CORES}}",
      "--app-id", "{{APP_ID}}",
      "--worker-url", "{{WORKER_URL}}")
    val extraJavaOpts = sc.conf.get(config.EXECUTOR_JAVA_OPTIONS)
      .map(Utils.splitCommandString).getOrElse(Seq.empty)
    val classPathEntries = sc.conf.get(config.EXECUTOR_CLASS_PATH)
      .map(_.split(java.io.File.pathSeparator).toSeq).getOrElse(Nil)
    val libraryPathEntries = sc.conf.get(config.EXECUTOR_LIBRARY_PATH)
      .map(_.split(java.io.File.pathSeparator).toSeq).getOrElse(Nil)


    //使用一些必要的配置启动 executors ,以便在调度程序中注册
    val sparkJavaOpts = Utils.sparkJavaOpts(conf, SparkConf.isExecutorStartupConf)
    val javaOpts = sparkJavaOpts ++ extraJavaOpts
    // executors 端的主类 CoarseGrainedExecutorBackend
    val command = Command("org.apache.spark.executor.CoarseGrainedExecutorBackend",
      args, sc.executorEnvs, classPathEntries ++ testingClassPath, libraryPathEntries, javaOpts)
    val webUrl = sc.ui.map(_.webUrl).getOrElse("")
    val coresPerExecutor = conf.getOption(config.EXECUTOR_CORES.key).map(_.toInt)
    // 如果我们使用动态分配,现在将初始执行器限制设置为0。ExecutorDallocationManager稍后会将实际的初始限额发送给Master。
    val initialExecutorLimit =
      if (Utils.isDynamicAllocationEnabled(conf)) {
        Some(0)
      } else {
        None
      }
    val executorResourceReqs = ResourceUtils.parseResourceRequirements(conf,
      config.SPARK_EXECUTOR_PREFIX)
    //这里有一个 ApplicationDescription 之前有要给 DriverDescription
    //可以想到 它是用来启动一个 Application 用的
    val appDesc = ApplicationDescription(sc.appName, maxCores, sc.executorMemory, command,
      webUrl, sc.eventLogDir, sc.eventLogCodec, coresPerExecutor, initialExecutorLimit,
      resourceReqsPerExecutor = executorResourceReqs)
    //创建一个 StandaloneAppClient
    //允许应用程序与Spark独立集群管理器对话。获取集群事件的主URL、应用程序描述和侦听器,并在发生各种事件时回调侦听器。
    client = new StandaloneAppClient(sc.env.rpcEnv, masters, appDesc, this, conf)
    //启动它
    client.start()
    launcherBackend.setState(SparkAppHandle.State.SUBMITTED)
    waitForRegistration()
    launcherBackend.setState(SparkAppHandle.State.RUNNING)
  }

}

七、StandaloneAppClient

Scala 复制代码
private[spark] class StandaloneAppClient(...){

  private val masterRpcAddresses = masterUrls.map(RpcAddress.fromSparkURL(_))
  private val REGISTRATION_TIMEOUT_SECONDS = 20
  private val REGISTRATION_RETRIES = 3

  private class ClientEndpoint(override val rpcEnv: RpcEnv) extends ThreadSafeRpcEndpoint
    with Logging {

    override def onStart(): Unit = {
        //向Master注册一个app
        registerWithMaster(1)
    }

    //异步向所有 Master 注册。它将每隔 20 秒调用"registerWithMaster",直到超过 3 次数。一旦我们成功连接到主机,所有调度工作和期货都将被取消。
    //thRetry表示这是第n次尝试向master注册。
    private def registerWithMaster(nthRetry: Int): Unit = {
      registerMasterFutures.set(tryRegisterAllMasters())
      registrationRetryTimer.set(registrationRetryThread.schedule(new Runnable {
        override def run(): Unit = {
          if (registered.get) {
            registerMasterFutures.get.foreach(_.cancel(true))
            registerMasterThreadPool.shutdownNow()
          } else if (nthRetry >= REGISTRATION_RETRIES) {
            markDead("All masters are unresponsive! Giving up.")
          } else {
            registerMasterFutures.get.foreach(_.cancel(true))
            registerWithMaster(nthRetry + 1)
          }
        }
      }, REGISTRATION_TIMEOUT_SECONDS, TimeUnit.SECONDS))
    }

  }

    private def tryRegisterAllMasters(): Array[JFuture[_]] = {
      for (masterAddress <- masterRpcAddresses) yield {
        registerMasterThreadPool.submit(new Runnable {
          override def run(): Unit = try {
            if (registered.get) {
              return
            }
            logInfo("Connecting to master " + masterAddress.toSparkURL + "...")
            val masterRef = rpcEnv.setupEndpointRef(masterAddress, Master.ENDPOINT_NAME)
            //向Master 发送 RegisterApplication 消息
            masterRef.send(RegisterApplication(appDescription, self))
          } catch {
            case ie: InterruptedException => // Cancelled
            case NonFatal(e) => logWarning(s"Failed to connect to master $masterAddress", e)
          }
        })
      }
    }


  def start(): Unit = {
    // 只需启动一个rpcEndpoint;它将呼叫回听众
    //设置一个 AppClient 端点 ClientEndpoint 的 onstart 的方法会调起
    endpoint.set(rpcEnv.setupEndpoint("AppClient", new ClientEndpoint(rpcEnv)))
  }

}

八、Master

Scala 复制代码
private[deploy] class Master(...){

  //处理其他端点的消息
  override def receive: PartialFunction[Any, Unit] = {
    case RegisterApplication(description, driver) =>
      // TODO Prevent repeated registrations from some driver
      if (state == RecoveryState.STANDBY) {
        // ignore, don't send response
      } else {
        logInfo("Registering app " + description.name)
        //创建一个 app
        val app = createApplication(description, driver)
        //driver端添加app的持有 比如 在waitingApps 中添加 这个 app 为后续调度做准备
        registerApplication(app)
        logInfo("Registered app " + description.name + " with ID " + app.id)
        //持久化这个app
        persistenceEngine.addApplication(app)
        //给driver返回一个已经注册的响应
        driver.send(RegisteredApplication(app.id, self))
        //调度,开始在 Worker 上 分配 executors
        schedule()
      }
  }

  //创建一个app
  private def createApplication(desc: ApplicationDescription, driver: RpcEndpointRef):
      ApplicationInfo = {
    val now = System.currentTimeMillis()
    val date = new Date(now)
    val appId = newApplicationId(date)
    //且会自己执行 init()
    new ApplicationInfo(now, appId, desc, date, driver, defaultCores)
  }

 private def schedule(): Unit = {
    if (state != RecoveryState.ALIVE) {
      return
    }
    val shuffledAliveWorkers = Random.shuffle(workers.toSeq.filter(_.state == WorkerState.ALIVE))
    val numWorkersAlive = shuffledAliveWorkers.size
    var curPos = 0
    //启动等待的所有 driver 目前 driver 已经启动了,跳过这一步
    for (driver <- waitingDrivers.toList) { 
      ...
    }
    //在Workers上启动Executors
    startExecutorsOnWorkers()
  }

  //调度并在 workers 上 启动 Executors
 private def startExecutorsOnWorkers(): Unit = {
    //现在这是一个非常简单的FIFO调度器。依次对等待的 app 进行调度 
    for (app <- waitingApps) {
      val coresPerExecutor = app.desc.coresPerExecutor.getOrElse(1)
      // 如果剩余的内核小于coresPerExecutor,则不会分配剩余的内核
      // 简单理解就是 app剩下的核 要满足最少是一个 executor 所需的核数 ,也就是以 executor 所需的核数为单位 进行分配 executor 最少的 核数为 1
      if (app.coresLeft >= coresPerExecutor) {
        //过滤掉哪些 没有足够资源 的 worker 并按照剩余的核数倒序排序  来依次启动 executors
        val usableWorkers = workers.toArray.filter(_.state == WorkerState.ALIVE)
          .filter(canLaunchExecutor(_, app.desc))
          .sortBy(_.coresFree).reverse    
        val appMayHang = waitingApps.length == 1 &&
          waitingApps.head.executors.isEmpty && usableWorkers.isEmpty
        if (appMayHang) {
          logWarning(s"App ${app.id} requires more resource than any of Workers could have.")
        }
        //真正去计算资源分配  返回要给 数组 里面有分配好的核数
        val assignedCores = scheduleExecutorsOnWorkers(app, usableWorkers, spreadOutApps)

        //现在我们已经决定了在每个worker上分配多少个内核,让我们分配它们
        for (pos <- 0 until usableWorkers.length if assignedCores(pos) > 0) {
          allocateWorkerResourceToExecutors(
            app, assignedCores(pos), app.desc.coresPerExecutor, usableWorkers(pos))
        }
      }
    }
  }

  //调度 executors 去 workers 上启动。返回一个数组,其中包含分配给每个worker的核心数。
  //有两种启动executor的模式。
  //    第一种方法试图将应用程序的executor分散到尽可能多的worker进程上,默认设置,更适合数据局部性
  //    第二种方法则相反(即在尽可能少的worker进程中启动它们)。
  //分配给每个executor的核心数量是可配置的。当显式设置此选项时,如果worker进程有足够的内核和内存,
  //则可以在同一worker进程上启动来自同一应用程序的多个executor。
  //否则,默认情况下,每个executor都会抓取worker上可用的所有内核,
  //  在这种情况下,在一次单独的计划迭代中,每个应用程序只能在每个worker上启动一个executor。
  //请注意,当未设置"spark.executor.cores"时,
  //  我们仍然可以在同一个worker上从同一个应用程序启动多个executor。
  //  假设appA和appB都有一个executor在worker1上运行,并且appA.coresLef>0,则appB完成并释放worker1上的所有内核,
  //  因此对于下一个计划迭代,appA启动一个新的executor,抓取worker1上所有空闲的内核,因此我们从运行在worker1的appA中获得多个executor。
  private def scheduleExecutorsOnWorkers(
      app: ApplicationInfo,
      usableWorkers: Array[WorkerInfo],
      spreadOutApps: Boolean): Array[Int] = {
    //每个Executor所需的核数
    val coresPerExecutor = app.desc.coresPerExecutor
    //每个Executor所需的最小核数
    val minCoresPerExecutor = coresPerExecutor.getOrElse(1)
    //每个worker一个Executor
    val oneExecutorPerWorker = coresPerExecutor.isEmpty
    //每个Executor所需的内存
    val memoryPerExecutor = app.desc.memoryPerExecutorMB
    //每个Executor所需的资源
    val resourceReqsPerExecutor = app.desc.resourceReqsPerExecutor
    //可用的worker数量
    val numUsable = usableWorkers.length
    val assignedCores = new Array[Int](numUsable) // 去每个worker上需要申请的核数
    val assignedExecutors = new Array[Int](numUsable) // 去每个worker上需要申请的executor数量
    var coresToAssign = math.min(app.coresLeft, usableWorkers.map(_.coresFree).sum)

    /**返回指定的worker是否可以为此app启动Executor */
    def canLaunchExecutorForApp(pos: Int): Boolean = {
      val keepScheduling = coresToAssign >= minCoresPerExecutor
      val enoughCores = usableWorkers(pos).coresFree - assignedCores(pos) >= minCoresPerExecutor
      val assignedExecutorNum = assignedExecutors(pos)

      //如果我们允许每个worker有多个executor,那么我们总是可以启动新的executor。
      //否则,如果这个worker上已经有一个executor,只需给它更多的内核。
      val launchingNewExecutor = !oneExecutorPerWorker || assignedExecutorNum == 0
      if (launchingNewExecutor) {
        val assignedMemory = assignedExecutorNum * memoryPerExecutor
        val enoughMemory = usableWorkers(pos).memoryFree - assignedMemory >= memoryPerExecutor
        val assignedResources = resourceReqsPerExecutor.map {
          req => req.resourceName -> req.amount * assignedExecutorNum
        }.toMap
        val resourcesFree = usableWorkers(pos).resourcesAmountFree.map {
          case (rName, free) => rName -> (free - assignedResources.getOrElse(rName, 0))
        }
        val enoughResources = ResourceUtils.resourcesMeetRequirements(
          resourcesFree, resourceReqsPerExecutor)
        val underLimit = assignedExecutors.sum + app.executors.size < app.executorLimit
        keepScheduling && enoughCores && enoughMemory && enoughResources && underLimit
      } else {
        //我们正在为现有的执行器添加内核,因此无需检查内存和执行器限制
        keepScheduling && enoughCores
      }
    }

    //过滤可以启动executor 的worker
    var freeWorkers = (0 until numUsable).filter(canLaunchExecutorForApp)
    //对可以启动executor 的worker循环进行 executors 的分配
    while (freeWorkers.nonEmpty) {
      freeWorkers.foreach { pos =>
        var keepScheduling = true
        while (keepScheduling && canLaunchExecutorForApp(pos)) {
          coresToAssign -= minCoresPerExecutor
          assignedCores(pos) += minCoresPerExecutor

          if (oneExecutorPerWorker) {
            assignedExecutors(pos) = 1
          } else {
            assignedExecutors(pos) += 1
          }

          //分散app意味着将其executor分散到尽可能多的worker中。如果我们不分散,
          //那么我们应该继续在这个worker上调度executor,直到我们使用了它的所有资源。
          //否则,请转到下一个worker。 默认 keepScheduling  = true 
          if (spreadOutApps) {
            keepScheduling = false
          }
        }
      }
      freeWorkers = freeWorkers.filter(canLaunchExecutorForApp)
    }
    assignedCores
  }


  //将Worke的资源分配给一个或多个Executor。
  private def allocateWorkerResourceToExecutors(
      app: ApplicationInfo,
      assignedCores: Int,
      coresPerExecutor: Option[Int],
      worker: WorkerInfo): Unit = {
    //如果指定了每个Executor的Cores,我们将分配给此worker的核心平均分配给Executor 没有余数。
    //否则,我们将启动一个Executor,获取此worker上所有分配的Cores。
    val numExecutors = coresPerExecutor.map { assignedCores / _ }.getOrElse(1)
    val coresToAssign = coresPerExecutor.getOrElse(assignedCores)
    //一个一个executor去worker上启动
    for (i <- 1 to numExecutors) {
      val allocated = worker.acquireResources(app.desc.resourceReqsPerExecutor)
      //这里面会创建一个 ExecutorDesc 作为 Executor 启动的描述 像之前的app、driver都有这个描述
      val exec = app.addExecutor(worker, coresToAssign, allocated)
      //启动Executor
      launchExecutor(worker, exec)
      app.state = ApplicationState.RUNNING
    }
  }

  private def launchExecutor(worker: WorkerInfo, exec: ExecutorDesc): Unit = {
    //在那个 worker 上启动 executor 
    logInfo("Launching executor " + exec.fullId + " on worker " + worker.id)
    worker.addExecutor(exec)
    //Master endpoint 向 worker endpoint 发送 LaunchExecutor 消息
    worker.endpoint.send(LaunchExecutor(masterUrl, exec.application.id, exec.id,
      exec.application.desc, exec.cores, exec.memory, exec.resources))
    //Master endpoint 向 driverendpoint 发送 ExecutorAdded消息
    exec.application.driver.send(
      ExecutorAdded(exec.id, worker.id, worker.hostPort, exec.cores, exec.memory))
  }

}

九、Worker

Scala 复制代码
private[deploy] class Worker(......
  extends ThreadSafeRpcEndpoint with Logging {
  override def receive: PartialFunction[Any, Unit] = synchronized {
   case LaunchExecutor(masterUrl, appId, execId, appDesc, cores_, memory_, resources_) =>
      if (masterUrl != activeMasterUrl) {
        logWarning("Invalid Master (" + masterUrl + ") attempted to launch executor.")
      } else if (decommissioned) {
        logWarning("Asked to launch an executor while decommissioned. Not launching executor.")
      } else {
        try {
          logInfo("Asked to launch executor %s/%d for %s".format(appId, execId, appDesc.name))

          // 创建 executor 的 本地工作目录
          val executorDir = new File(workDir, appId + "/" + execId)
          if (!executorDir.mkdirs()) {
            throw new IOException("Failed to create directory " + executorDir)
          }

          // 为执行者创建本地目录。这些通过SPARK_EXECUTOR_DIRS环境变量传递给executor,并在应用程序完成时由Worker删除。
          val appLocalDirs = appDirectories.getOrElse(appId, {
            val localRootDirs = Utils.getOrCreateLocalRootDirs(conf)
            val dirs = localRootDirs.flatMap { dir =>
              try {
                val appDir = Utils.createDirectory(dir, namePrefix = "executor")
                Utils.chmod700(appDir)
                Some(appDir.getAbsolutePath())
              } catch {
                ...
              }
            }.toSeq
            dirs
          })
          appDirectories(appId) = appLocalDirs
          //管理一个executor流程的执行。这目前仅在standalone模式下使用。
          val manager = new ExecutorRunner(
            appId,
            execId,
            appDesc.copy(command = Worker.maybeUpdateSSLSettings(appDesc.command, conf)),
            cores_,
            memory_,
            self,
            workerId,
            webUi.scheme,
            host,
            webUi.boundPort,
            publicAddress,
            sparkHome,
            executorDir,
            workerUri,
            conf,
            appLocalDirs,
            ExecutorState.LAUNCHING,
            resources_)
          executors(appId + "/" + execId) = manager
          manager.start()
          coresUsed += cores_
          memoryUsed += memory_
          addResourcesUsed(resources_)
        } catch {
          ......
        }
      }
  }

}

十、ExecutorRunner

Scala 复制代码
private[deploy] class ExecutorRunner(...){

  private[worker] def start(): Unit = {
    //准备一个线程启动 executor 
    workerThread = new Thread("ExecutorRunner for " + fullId) {
      override def run(): Unit = { fetchAndRunExecutor() }
    }
    //线程启动 fetchAndRunExecutor() 执行 :下载并运行应用程序描述中描述的executor 
    //第6步中封装的启动类为 org.apache.spark.executor.CoarseGrainedExecutorBacken
    //下面我们看看 executor 中做了什么
    workerThread.start()
    ......
  }

}

十一、CoarseGrainedExecutorBacken(executor进程主类)

Scala 复制代码
private[spark] object CoarseGrainedExecutorBackend extends Logging {

  def main(args: Array[String]): Unit = {
    val createFn: (RpcEnv, Arguments, SparkEnv, ResourceProfile) =>
      CoarseGrainedExecutorBackend = { case (rpcEnv, arguments, env, resourceProfile) =>
      new CoarseGrainedExecutorBackend(rpcEnv, arguments.driverUrl, arguments.executorId,
        arguments.bindAddress, arguments.hostname, arguments.cores, arguments.userClassPath.toSeq,
        env, arguments.resourcesFileOpt, resourceProfile)
    }
    //解析参数并运行
    run(parseArguments(args, this.getClass.getCanonicalName.stripSuffix("$")), createFn)
    System.exit(0)
  }

 def run(
      arguments: Arguments,
      backendCreateFn: (RpcEnv, Arguments, SparkEnv, ResourceProfile) =>
        CoarseGrainedExecutorBackend): Unit = {

      //......

      // 创建 RpcEnv 获取 driver端的 Spark properties.
      val executorConf = new SparkConf
      val fetcher = RpcEnv.create(
        "driverPropsFetcher",
        arguments.bindAddress,
        arguments.hostname,
        -1,
        executorConf,
        new SecurityManager(executorConf),
        numUsableCores = 0,
        clientMode = true)

      //尝试3次获取 driver Endpoint 引用
      var driver: RpcEndpointRef = null
      val nTries = 3
      for (i <- 0 until nTries if driver == null) {
        try {
          driver = fetcher.setupEndpointRefByURI(arguments.driverUrl)
        } catch {
          case e: Throwable => if (i == nTries - 1) {
            throw e
          }
        }
      }

      // 向driver发送 RetrieveSparkAppConfig 消息
      val cfg = driver.askSync[SparkAppConfig](RetrieveSparkAppConfig(arguments.resourceProfileId))
      val props = cfg.sparkProperties ++ Seq[(String, String)](("spark.app.id", arguments.appId))
      fetcher.shutdown()

      //根据从driver端获取的属性创建SparkEnv
      val driverConf = new SparkConf()
      for ((key, value) <- props) {
        // this is required for SSL in standalone mode
        if (SparkConf.isExecutorStartupConf(key)) {
          driverConf.setIfMissing(key, value)
        } else {
          driverConf.set(key, value)
        }
      }

      cfg.hadoopDelegationCreds.foreach { tokens =>
        SparkHadoopUtil.get.addDelegationTokens(tokens, driverConf)
      }

      driverConf.set(EXECUTOR_ID, arguments.executorId)
      //为executor创建SparkEnv
      val env = SparkEnv.createExecutorEnv(driverConf, arguments.executorId, arguments.bindAddress,
        arguments.hostname, arguments.cores, cfg.ioEncryptionKey, isLocal = false)
      // 在BlockStoreClient中设置应用程序尝试ID(如果可用)
      val appAttemptId = env.conf.get(APP_ATTEMPT_ID)
      appAttemptId.foreach(attemptId =>
        env.blockManager.blockStoreClient.setAppAttemptId(attemptId)
      )
      //创建CoarseGrainedExecutorBackend
      val backend = backendCreateFn(env.rpcEnv, arguments, env, cfg.resourceProfile)
      //并将CoarseGrainedExecutorBackend设置为 Executor Endpoint 其上的 onStart() 方法执行
      env.rpcEnv.setupEndpoint("Executor", backend)
      arguments.workerUrl.foreach { url =>
        env.rpcEnv.setupEndpoint("WorkerWatcher",
          new WorkerWatcher(env.rpcEnv, url, isChildProcessStopping = backend.stopping))
      }
      env.rpcEnv.awaitTermination()
    }
  }

}

十二、CoarseGrainedExecutorBackend(executor中的Rpc端点)

Scala 复制代码
private[spark] class CoarseGrainedExecutorBackend(...)
  extends IsolatedRpcEndpoint with ExecutorBackend with Logging {
  //当driver成功接受注册请求时,内部用于启动executor的消息。
  case object RegisteredExecutor

  override def onStart(): Unit = {
    //......
    logInfo("Connecting to driver: " + driverUrl)
    try {
      //提供了一个实用程序,用于将Spark JVM内的SparkConf(例如,执行器、驱动程序或独立的shuffle服务)转换为TransportConf,其中包含有关我们环境的详细信息,如分配给此JVM的内核数量。
      val shuffleClientTransportConf = SparkTransportConf.fromSparkConf(env.conf, "shuffle") 
      //判断Netty是否可以直接使用 off-heap 内存  且 操作系统能分配的最大 off-heap < 200M 抛出异常
      //涉及配置  
      //    spark.network.sharedByteBufAllocators.enabled 默认true 是否在不同Netty通道之间共享池化ByteBuf分配器的标志。如果启用,则只创建两个池化ByteBuf分配器:一个允许缓存(用于传输服务器),另一个不允许缓存(对于传输客户端)。禁用后,将为每个传输服务器和客户端创建一个新的分配器。
      //    spark.io.preferDirectBufs  默认true 共享ByteBuf分配器将首选堆外字节缓冲区
      //    spark.network.io.preferDirectBufs 默认true 在Netty中分配堆外字节缓冲区
      if (NettyUtils.preferDirectBufs(shuffleClientTransportConf) &&
          PlatformDependent.maxDirectMemory() < env.conf.get(MAX_REMOTE_BLOCK_SIZE_FETCH_TO_MEM)) {
        throw new SparkException(s"Netty direct memory should at least be bigger than " +
          s"'${MAX_REMOTE_BLOCK_SIZE_FETCH_TO_MEM.key}', but got " +
          s"${PlatformDependent.maxDirectMemory()} bytes < " +
          s"${env.conf.get(MAX_REMOTE_BLOCK_SIZE_FETCH_TO_MEM)}")
      }

      _resources = parseOrFindResources(resourcesFileOpt)
    } catch {
      case NonFatal(e) =>
        exitExecutor(1, "Unable to create executor due to " + e.getMessage, e)
    }
    rpcEnv.asyncSetupEndpointRefByURI(driverUrl).flatMap { ref =>
      //这是一个非常快速的操作,因此我们可以使用"ThreadUtils.sameThread"
      driver = Some(ref)
      //向driver 发送 RegisterExecutor 消息 ,我们看第五步中driver的处理
      ref.ask[Boolean](RegisterExecutor(executorId, self, hostname, cores, extractLogUrls,
        extractAttributes, _resources, resourceProfile.id))
    }(ThreadUtils.sameThread).onComplete {
      case Success(_) =>
        self.send(RegisteredExecutor)
      case Failure(e) =>
        exitExecutor(1, s"Cannot register with driver: $driverUrl", e, notifyDriver = false)
    }(ThreadUtils.sameThread)
  }

  override def receive: PartialFunction[Any, Unit] = {
    //driver收到注册 executor 并返回 true 执行该段逻辑
    case RegisteredExecutor =>
      logInfo("Successfully registered with driver")
      try {
        //创建一个 Executor
        executor = new Executor(executorId, hostname, env, userClassPath, isLocal = false,
          resources = _resources)
        //再向driver发送 LaunchedExecutor 消息,可以看 第五步 driver端的处理
        driver.get.send(LaunchedExecutor(executorId))
      } catch {
        case NonFatal(e) =>
          exitExecutor(1, "Unable to create executor due to " + e.getMessage, e)
      }
  }

}

十三、Executor

Scala 复制代码
//Spark执行器,由线程池支持运行任务。
//这可以与Mesos、YARN、kubernetes和独立调度器一起使用。内部RPC接口用于与driver通信,Mesos细粒度模式除外。
private[spark] class Executor(...){

  // 维护正在运行的任务列表
  private val runningTasks = new ConcurrentHashMap[Long, TaskRunner]

  //当Executor无法向driver程序发送心跳超过"HEARTBEAT_MAX_FAILURES"次数时,它应该自行终止。默认值为60。例如,如果最大失败次数为60次,心跳间隔为10秒,则它将尝试发送长达600秒(10分钟)的心跳。
  private val HEARTBEAT_MAX_FAILURES = conf.get(EXECUTOR_HEARTBEAT_MAX_FAILURES)
  //发送心跳的间隔(毫秒) spark.executor.heartbeatInterval 默认 10s
  private val HEARTBEAT_INTERVAL_MS = conf.get(EXECUTOR_HEARTBEAT_INTERVAL)
  //heartbeat 任务
  private val heartbeater = new Heartbeater(
    () => Executor.this.reportHeartBeat(),
    "executor-heartbeater",
    HEARTBEAT_INTERVAL_MS)

  //启动工作线程池
  private val threadPool = {
    val threadFactory = new ThreadFactoryBuilder()
      .setDaemon(true)
      .setNameFormat("Executor task launch worker-%d")
      .setThreadFactory((r: Runnable) => new UninterruptibleThread(r, "unused"))
      .build()
    Executors.newCachedThreadPool(threadFactory).asInstanceOf[ThreadPoolExecutor]
  }


  //启动任务
  def launchTask(context: ExecutorBackend, taskDescription: TaskDescription): Unit = {
    val tr = new TaskRunner(context, taskDescription, plugins)
    runningTasks.put(taskDescription.taskId, tr)
    threadPool.execute(tr)
    if (decommissioned) {
      log.error(s"Launching a task while in decommissioned state.")
    }
  }

}

十四、总结

1、代码中根据SparkConf构建SparkContext

2、创建任务调度器并启用

3、StandaloneSchedulerBackend 和 CoarseGrainedSchedulerBackend 的 start() 启动

4、DriverEndpoint 创建 等待其他Endpoint发送消息 (比如Master 和 Executur Endpoint)

5、构建Executor的启动参数,主类为CoarseGrainedExecutorBackend

6、创建StandaloneAppClient并启动

7、Driver端创建ClientEndpoint并向Master注册

8、创建app描述信息向Master发送RegisterApplication 消息

9、Master 根据app描述信息开始调度资源,决策在哪些Worker上启动多少个Executor

10、Master端以Executor为单位依次向划分好的Worker发送LaunchExecutor消息,向Driver发送ExecutorAdded消息

11、Worker 创建一个线程启动封装好的Executor进程(主类为CoarseGrainedExecutorBackend)

12、Executor中会创建Executor Endpoint,并向Driver进行注册,如果注册成功会向自己发送RegisteredExecutor消息

13、Executor处理给自己发的RegisteredExecutor消息,其中会创建一个Executor对象并向Driver发送LaunchedExecutor消息

14、Executor对象由线程池支持运行任务,且并默认每隔10s发送依次心跳给Driver

为了方便理解和记忆,我们也画下流程图,下载放大就会清晰哟

相关推荐
Data跳动3 小时前
Spark内存都消耗在哪里了?
大数据·分布式·spark
woshiabc1114 小时前
windows安装Elasticsearch及增删改查操作
大数据·elasticsearch·搜索引擎
lucky_syq5 小时前
Saprk和Flink的区别
大数据·flink
lucky_syq5 小时前
流式处理,为什么Flink比Spark Streaming好?
大数据·flink·spark
袋鼠云数栈5 小时前
深入浅出Flink CEP丨如何通过Flink SQL作业动态更新Flink CEP作业
大数据
Java程序之猿5 小时前
微服务分布式(一、项目初始化)
分布式·微服务·架构
来一杯龙舌兰6 小时前
【RabbitMQ】RabbitMQ保证消息不丢失的N种策略的思想总结
分布式·rabbitmq·ruby·持久化·ack·消息确认
小白学大数据6 小时前
如何使用Selenium处理JavaScript动态加载的内容?
大数据·javascript·爬虫·selenium·测试工具
15年网络推广青哥6 小时前
国际抖音TikTok矩阵运营的关键要素有哪些?
大数据·人工智能·矩阵
节点。csn7 小时前
Hadoop yarn安装
大数据·hadoop·分布式