Flink-1.19.0源码详解10-Flink计算资源的申请与调度

Flink是Apache软件基金会下开源的分布式流批一体计算框架，具备实时流计算和高吞吐批处理计算的大数据计算能力。本专栏内容为Flink源码解析的记录与分享。

本文解析的Flink源码版本为：flink-1.19.0

1.Flink计算资源的申请与调度功能概述

在前文《Flink-1.19.0源码详解7-Flink集群端调度》已介绍了Flink开启集群端调度与ExecutionGraph生成源码过程《Flink-1.19.0源码详解8-ExecutionGraph生成-前篇》《Flink-1.19.0源码详解9-ExecutionGraph生成-后篇》，当JobMaster生成了ExecutionGraph并划分好SchedulingPipelinedRegion后，JobMaster开始为每个SchedulingPipelinedRegion向Flink Resource Manager申请CPU/内存计算资源，开始进行Flink计算资源的申请与调度。

本文从Flink为每个SchedulingPipelinedRegion进行计算资源的申请与调度(内容为下流程图的红色部分）开始解析。

完整源码图解:

2.SchedulingPipelinedRegion计算资源调度

上文《Flink-1.19.0源码详解7-Flink集群端调度》解析了Flink开启了集群端调度，并通过调用DefaultScheduler的startSchedulingInternal()方法进入Flink的计算资源调度，本文从DefaultScheduler的startSchedulingInternal()方法开始，继续分析Flink计算资源调度的完整源码。

在DefaultScheduler的startSchedulingInternal()方法中，Flink先把ExecutionGraph的状态转为Running，并通过调用PipelinedRegionSchedulingStrategy(SchedulingStrategy的实现)的startScheduling()方法，开始具体调度。

DefaultScheduler.startSchedulingInternal()方法源码：

java 复制代码

protected void startSchedulingInternal() {
    log.info(
            "Starting scheduling with scheduling strategy [{}]",
            schedulingStrategy.getClass().getName());
    //将ExecutionGraph的状态转为Running
    transitionToRunning();
    
    //开始调度，SchedulingStrategy具体实现为PipelinedRegionSchedulingStrategy
    schedulingStrategy.startScheduling();
}

在PipelinedRegionSchedulingStrategy.startScheduling()方法中，Flink从schedulingTopology拓扑中找到包含Source的SchedulingPipelinedRegion，然后继续调用PipelinedRegionSchedulingStrategy的maybeScheduleRegions()方法，进行调度。

PipelinedRegionSchedulingStrategy.startScheduling()方法源码:

java 复制代码

public void startScheduling() {
    //从schedulingTopology拓扑中包含Source的SchedulingPipelinedRegion,以SchedulingPipelinedRegion为单位进行资源调度
    final Set<SchedulingPipelinedRegion> sourceRegions =
            IterableUtils.toStream(schedulingTopology.getAllPipelinedRegions())
                    .filter(this::isSourceRegion)
                    .collect(Collectors.toSet());
                    
    //开始调度
    maybeScheduleRegions(sourceRegions);
}

PipelinedRegionSchedulingStrategy的startScheduling()方法对每个SchedulingPipelinedRegion进行排序，并调用scheduleRegion()方法对单个SchedulingPipelinedRegion进行调度。

PipelinedRegionSchedulingStrategy.startScheduling()方法源码:

java 复制代码

private void maybeScheduleRegions(final Set<SchedulingPipelinedRegion> regions) {
    //将每个SchedulingPipelinedRegion加入regionsToSchedule的set中
    final Set<SchedulingPipelinedRegion> regionsToSchedule = new HashSet<>();
    Set<SchedulingPipelinedRegion> nextRegions = regions;
    while (!nextRegions.isEmpty()) {
        nextRegions = addSchedulableAndGetNextRegions(nextRegions, regionsToSchedule);
    }
    
    // schedule regions in topological order.
    //对每个SchedulingPipelinedRegion进行排序并调用scheduleRegion()方法，对单个SchedulingPipelinedRegion进行调度
    SchedulingStrategyUtils.sortPipelinedRegionsInTopologicalOrder(
                    schedulingTopology, regionsToSchedule)
            .forEach(this::scheduleRegion);
}

在PipelinedRegionSchedulingStrategy的scheduleRegion()方法中，先把当前SchedulingPipelinedRegion加入正在调度的scheduledRegions数据结构中，然后继续调用DefaultScheduler的allocateSlotsAndDeploy()方法，其中DefaultScheduler为SchedulerOperations的具体实现。

PipelinedRegionSchedulingStrategy.scheduleRegion()方法源码:

java 复制代码

private void scheduleRegion(final SchedulingPipelinedRegion region) {
    checkState(
            areRegionVerticesAllInCreatedState(region),
            "BUG: trying to schedule a region which is not in CREATED state");
    //添加Region到正在调度的ScheduleRegion
    scheduledRegions.add(region);
    //执行对当前ScheduleRegion的调度
    schedulerOperations.allocateSlotsAndDeploy(regionVerticesSorted.get(region));
}

DefaultScheduler的allocateSlotsAndDeploy()方法获取了当前SchedulingPipelinedRegion的ExecutionVertex的Execution，并以Execution为单位调用DefaultExecutionDeployer的allocateSlotsAndDeploy()方法进行调度，其中ExecutionDeployer的具体实现为DefaultExecutionDeployer。

DefaultScheduler.allocateSlotsAndDeploy()方法源码:

java 复制代码

public void allocateSlotsAndDeploy(final List<ExecutionVertexID> verticesToDeploy) {
    final Map<ExecutionVertexID, ExecutionVertexVersion> requiredVersionByVertex =
            executionVertexVersioner.recordVertexModifications(verticesToDeploy);
    //获取当前SchedulingPipelinedRegion的ExecutionVertex的Execution并进行调度
    final List<Execution> executionsToDeploy =
            verticesToDeploy.stream()
                    .map(this::getCurrentExecutionOfVertex)
                    .collect(Collectors.toList());
    //对本Region的Execution进行调度
    executionDeployer.allocateSlotsAndDeploy(executionsToDeploy, requiredVersionByVertex);
}

3.Execution计算资源调度

进入DefaultExecutionDeployer.allocateSlotsAndDeploy()方法后，开始进行SchedulingPipelinedRegion中每个Execution的调度。

DefaultExecutionDeployer的allocateSlotsAndDeploy()方法为Execution调度与执行的主要方法，首先先验证了Execution当前状态并把状态调整为SCHEDULED。然后开始进行调度为当前SchedulingPipelinedRegion的List<Execution>分配slot，最后创建创建DeploymentHandle处理Task部署，并等待Slot分配结果，开始进行Task部署。

DefaultExecutionDeployer.allocateSlotsAndDeploy()方法源码:

java 复制代码

public void allocateSlotsAndDeploy(
        final List<Execution> executionsToDeploy,
        final Map<ExecutionVertexID, ExecutionVertexVersion> requiredVersionByVertex) {
    //验证Execution当前状态
    validateExecutionStates(executionsToDeploy);

    //将当前状态调整为SCHEDULED
    transitionToScheduled(executionsToDeploy);

    //为当前SchedulingPipelinedRegion的List<Execution>分配slot
    final Map<ExecutionAttemptID, ExecutionSlotAssignment> executionSlotAssignmentMap =
            allocateSlotsFor(executionsToDeploy);

    //创建DeploymentHandle处理Task部署
    final List<ExecutionDeploymentHandle> deploymentHandles =
            createDeploymentHandles(
                    executionsToDeploy, requiredVersionByVertex, executionSlotAssignmentMap);
    //等待Slot分配结果，并部署Task
    waitForAllSlotsAndDeploy(deploymentHandles);
}

首先关注对SchedulingPipelinedRegion的List<Execution>分配slot，又经过几次调用，先调用DefaultExecutionDeployer的allocateSlotsFor()方法，再调用SlotSharingExecutionSlotAllocator的allocateSlotsFor()方法，最终进入SlotSharingExecutionSlotAllocator的allocateSlotsForVertices()方法，开始调度。

DefaultExecutionDeployer.allocateSlotsFor()方法源码：

java 复制代码

private Map<ExecutionAttemptID, ExecutionSlotAssignment> allocateSlotsFor(
        final List<Execution> executionsToDeploy) {
    final List<ExecutionAttemptID> executionAttemptIds =
            executionsToDeploy.stream()
                    .map(Execution::getAttemptId)
                    .collect(Collectors.toList());
    //继续调用                 
    return executionSlotAllocator.allocateSlotsFor(executionAttemptIds);
}

SlotSharingExecutionSlotAllocator.allocateSlotsFor()方法源码：

java 复制代码

public Map<ExecutionAttemptID, ExecutionSlotAssignment> allocateSlotsFor(
        List<ExecutionAttemptID> executionAttemptIds) {
    //...
    //继续调用
    return allocateSlotsForVertices(vertexIds).stream()
            .collect(
                    Collectors.toMap(
                            vertexAssignment ->
                                    vertexIdToExecutionId.get(
                                            vertexAssignment.getExecutionVertexId()),
                            vertexAssignment ->
                                    new ExecutionSlotAssignment(
                                            vertexIdToExecutionId.get(
                                                    vertexAssignment.getExecutionVertexId()),
                                            vertexAssignment.getLogicalSlotFuture())));
}

SlotSharingExecutionSlotAllocator的allocateSlotsAndDeploy方法首先把所有ExecutionVertex按共享slot划分成多个groupsToAssign组，按共享slot为单位分配资源。若当前groupsToAssign组所在共享slot已分配计算资源，则把groupsToAssign组分配在本共享slot中，若未分配资源，则调用SlotSharingExecutionSlotAllocatorallocateSharedSlots()方法继续申请资源。

SlotSharingExecutionSlotAllocator.allocateSlotsAndDeploy()方法源码：

java 复制代码

private List<SlotExecutionVertexAssignment> allocateSlotsForVertices(
        List<ExecutionVertexID> executionVertexIds) {

    //把所有ExecutionVertexID按共享slot划分成多个executionsByGroup(groupsToAssign组)
    SharedSlotProfileRetriever sharedSlotProfileRetriever =
            sharedSlotProfileRetrieverFactory.createFromBulk(new HashSet<>(executionVertexIds));   
    Map<ExecutionSlotSharingGroup, List<ExecutionVertexID>> executionsByGroup =
            executionVertexIds.stream()
                    .collect(
                            Collectors.groupingBy(
                                    slotSharingStrategy::getExecutionSlotSharingGroup));

    Map<ExecutionSlotSharingGroup, SharedSlot> slots = new HashMap<>(executionsByGroup.size());
    Set<ExecutionSlotSharingGroup> groupsToAssign = new HashSet<>(executionsByGroup.keySet());

    //检查groupsToAssign组所在SharingGroup已分配slot，则将groupsToAssign组分配在该slot上
    Map<ExecutionSlotSharingGroup, SharedSlot> assignedSlots =
            tryAssignExistingSharedSlots(groupsToAssign);
    slots.putAll(assignedSlots);
    groupsToAssign.removeAll(assignedSlots.keySet());


    //若未分配，则为该groupsToAssign组申请slot
    if (!groupsToAssign.isEmpty()) {
        //进入slot分配
        Map<ExecutionSlotSharingGroup, SharedSlot> allocatedSlots =
                allocateSharedSlots(groupsToAssign, sharedSlotProfileRetriever);
        slots.putAll(allocatedSlots);
        groupsToAssign.removeAll(allocatedSlots.keySet());
        Preconditions.checkState(groupsToAssign.isEmpty());
    }

    Map<ExecutionVertexID, SlotExecutionVertexAssignment> assignments =
            allocateLogicalSlotsFromSharedSlots(slots, executionsByGroup);

    // we need to pass the slots map to the createBulk method instead of using the allocator's
    // 'sharedSlots'
    // because if any physical slots have already failed, their shared slots have been removed
    // from the allocator's 'sharedSlots' by failed logical slots.
    SharingPhysicalSlotRequestBulk bulk = createBulk(slots, executionsByGroup);
    bulkChecker.schedulePendingRequestBulkTimeoutCheck(bulk, allocationTimeout);

    return executionVertexIds.stream().map(assignments::get).collect(Collectors.toList());
}

申请资源的单位为SchedulingPipelinedRegion按共享同个slot在划分的ExecutionSlotSharingGroup组，首先遍历Region里所有ExecutionSlotSharingGroup，为每个ExecutionSlotSharingGroup创建slotRequest请求，然后向ResourceManager提交SlotRequest请求，最后创建SharedSlot实例封装申请到的slot资源。

DefaultExecutionDeployer.allocateSlotsAndDeploy()方法源码:

java 复制代码

private Map<ExecutionSlotSharingGroup, SharedSlot> allocateSharedSlots(
        Set<ExecutionSlotSharingGroup> executionSlotSharingGroups,
        SharedSlotProfileRetriever sharedSlotProfileRetriever) {

    List<PhysicalSlotRequest> slotRequests = new ArrayList<>();
    Map<ExecutionSlotSharingGroup, SharedSlot> allocatedSlots = new HashMap<>();

    Map<SlotRequestId, ExecutionSlotSharingGroup> requestToGroup = new HashMap<>();
    Map<SlotRequestId, ResourceProfile> requestToPhysicalResources = new HashMap<>();
     
    //遍历Region里所有ExecutionSlotSharingGroup，为每个ExecutionSlotSharingGroup创建slotRequest请求
    for (ExecutionSlotSharingGroup group : executionSlotSharingGroups) {
        SlotRequestId physicalSlotRequestId = new SlotRequestId();
        ResourceProfile physicalSlotResourceProfile = getPhysicalSlotResourceProfile(group);
        SlotProfile slotProfile =
                sharedSlotProfileRetriever.getSlotProfile(group, physicalSlotResourceProfile);
        PhysicalSlotRequest request =
                new PhysicalSlotRequest(
                        physicalSlotRequestId, slotProfile, slotWillBeOccupiedIndefinitely);
        slotRequests.add(request);
        requestToGroup.put(physicalSlotRequestId, group);
        requestToPhysicalResources.put(physicalSlotRequestId, physicalSlotResourceProfile);
    }

    //向ResourceManager提交SlotRequest请求
    Map<SlotRequestId, CompletableFuture<PhysicalSlotRequest.Result>> allocateResult =
            slotProvider.allocatePhysicalSlots(slotRequests);

    //创建SharedSlot实例封装slot资源
    allocateResult.forEach(
            (slotRequestId, resultCompletableFuture) -> {
                ExecutionSlotSharingGroup group = requestToGroup.get(slotRequestId);
                CompletableFuture<PhysicalSlot> physicalSlotFuture =
                        resultCompletableFuture.thenApply(
                                PhysicalSlotRequest.Result::getPhysicalSlot);
                SharedSlot slot =
                        new SharedSlot(
                                slotRequestId,
                                requestToPhysicalResources.get(slotRequestId),
                                group,
                                physicalSlotFuture,
                                slotWillBeOccupiedIndefinitely,
                                this::releaseSharedSlot);
                allocatedSlots.put(group, slot);
                Preconditions.checkState(!sharedSlots.containsKey(group));
                sharedSlots.put(group, slot);
            });
    return allocatedSlots;
}

提交SlotRequest请求的方法是PhysicalSlotProviderImpl.allocatePhysicalSlots()，在方法中，遍历了ExecutionSlotSharingGroup，为每个ExecutionSlotSharingGroup调用PhysicalSlotProviderImpl的requestNewSlot()方法提交请求。

java 复制代码

public Map<SlotRequestId, CompletableFuture<PhysicalSlotRequest.Result>> allocatePhysicalSlots(
        Collection<PhysicalSlotRequest> physicalSlotRequests) {
        //...
    //提交SlotRequest请求    
    return availablePhysicalSlots.entrySet().stream()
            .collect(
                      //..
	                    CompletableFuture<PhysicalSlot> slotFuture =
	                            availablePhysicalSlot
	                                    .map(CompletableFuture::completedFuture)
	                                    .orElseGet(
	                                            //为每个ExecutionSlotSharingGroup提交请求
	                                            () ->
	                                                    requestNewSlot(
	                                                            slotRequestId,
	                                                            resourceProfile,
	                                                            slotProfile
	                                                                    .getPreferredAllocations(),
	                                                            physicalSlotRequest
	                                                                    .willSlotBeOccupiedIndefinitely()));
	
	                    return slotFuture.thenApply(
	                            physicalSlot ->
	                                    new PhysicalSlotRequest.Result(
	                                            slotRequestId, physicalSlot));
	                }));
}

然后又经历一系列复杂的调用

最终调用到DeclarativeSlotPoolService的connectToResourceManager()方法，通过ResourceManagerGateway向Flink的ResourceManager发送slot资源申请请求。

DeclarativeSlotPoolService.connectToResourceManager()方法源码:

java 复制代码

public void connectToResourceManager(ResourceManagerGateway resourceManagerGateway) {
    assertHasBeenStarted();
    
    //向ResourceManagerGateway发送slot资源申请请求
    resourceRequirementServiceConnectionManager.connect(
            resourceRequirements ->
                    resourceManagerGateway.declareRequiredResources(
                            jobMasterId, resourceRequirements, rpcTimeout));
    //获取slot分配结果
    declareResourceRequirements(declarativeSlotPool.getResourceRequirements());
}

4.结语

至此，调度从JobMaster来到了Flink的Resource Manager，进行Resource Manager层面的资源调度，Resource Manager调度的源码解析见《Flink-1.19.0源码详解11-Flink ResourceManager资源调度》。