Flink-1.19.0源码详解10-Flink计算资源的申请与调度

Flink是Apache软件基金会下开源的分布式流批一体计算框架,具备实时流计算和高吞吐批处理计算的大数据计算能力。本专栏内容为Flink源码解析的记录与分享。

本文解析的Flink源码版本为:flink-1.19.0

1.Flink计算资源的申请与调度功能概述

在前文《Flink-1.19.0源码详解7-Flink集群端调度》已介绍了Flink开启集群端调度与ExecutionGraph生成源码过程《Flink-1.19.0源码详解8-ExecutionGraph生成-前篇》《Flink-1.19.0源码详解9-ExecutionGraph生成-后篇》,当JobMaster生成了ExecutionGraph并划分好SchedulingPipelinedRegion后,JobMaster开始为每个SchedulingPipelinedRegion向Flink Resource Manager申请CPU/内存计算资源,开始进行Flink计算资源的申请与调度。

本文从Flink为每个SchedulingPipelinedRegion进行计算资源的申请与调度(内容为下流程图的红色部分)开始解析。

完整源码图解:

2.SchedulingPipelinedRegion计算资源调度

上文《Flink-1.19.0源码详解7-Flink集群端调度》解析了Flink开启了集群端调度,并通过调用DefaultScheduler的startSchedulingInternal()方法进入Flink的计算资源调度,本文从DefaultScheduler的startSchedulingInternal()方法开始,继续分析Flink计算资源调度的完整源码。

在DefaultScheduler的startSchedulingInternal()方法中,Flink先把ExecutionGraph的状态转为Running,并通过调用PipelinedRegionSchedulingStrategy(SchedulingStrategy的实现)的startScheduling()方法,开始具体调度。

DefaultScheduler.startSchedulingInternal()方法源码:

java 复制代码
protected void startSchedulingInternal() {
    log.info(
            "Starting scheduling with scheduling strategy [{}]",
            schedulingStrategy.getClass().getName());
    //将ExecutionGraph的状态转为Running
    transitionToRunning();
    
    //开始调度,SchedulingStrategy具体实现为PipelinedRegionSchedulingStrategy
    schedulingStrategy.startScheduling();
}

在PipelinedRegionSchedulingStrategy.startScheduling()方法中,Flink从schedulingTopology拓扑中找到包含Source的SchedulingPipelinedRegion,然后继续调用PipelinedRegionSchedulingStrategy的maybeScheduleRegions()方法,进行调度。

PipelinedRegionSchedulingStrategy.startScheduling()方法源码:

java 复制代码
public void startScheduling() {
    //从schedulingTopology拓扑中包含Source的SchedulingPipelinedRegion,以SchedulingPipelinedRegion为单位进行资源调度
    final Set<SchedulingPipelinedRegion> sourceRegions =
            IterableUtils.toStream(schedulingTopology.getAllPipelinedRegions())
                    .filter(this::isSourceRegion)
                    .collect(Collectors.toSet());
                    
    //开始调度
    maybeScheduleRegions(sourceRegions);
}

PipelinedRegionSchedulingStrategy的startScheduling()方法对每个SchedulingPipelinedRegion进行排序,并调用scheduleRegion()方法对单个SchedulingPipelinedRegion进行调度。

PipelinedRegionSchedulingStrategy.startScheduling()方法源码:

java 复制代码
private void maybeScheduleRegions(final Set<SchedulingPipelinedRegion> regions) {
    //将每个SchedulingPipelinedRegion加入regionsToSchedule的set中
    final Set<SchedulingPipelinedRegion> regionsToSchedule = new HashSet<>();
    Set<SchedulingPipelinedRegion> nextRegions = regions;
    while (!nextRegions.isEmpty()) {
        nextRegions = addSchedulableAndGetNextRegions(nextRegions, regionsToSchedule);
    }
    
    // schedule regions in topological order.
    //对每个SchedulingPipelinedRegion进行排序并调用scheduleRegion()方法,对单个SchedulingPipelinedRegion进行调度
    SchedulingStrategyUtils.sortPipelinedRegionsInTopologicalOrder(
                    schedulingTopology, regionsToSchedule)
            .forEach(this::scheduleRegion);
}

在PipelinedRegionSchedulingStrategy的scheduleRegion()方法中,先把当前SchedulingPipelinedRegion加入正在调度的scheduledRegions数据结构中,然后继续调用DefaultScheduler的allocateSlotsAndDeploy()方法,其中DefaultScheduler为SchedulerOperations的具体实现。

PipelinedRegionSchedulingStrategy.scheduleRegion()方法源码:

java 复制代码
private void scheduleRegion(final SchedulingPipelinedRegion region) {
    checkState(
            areRegionVerticesAllInCreatedState(region),
            "BUG: trying to schedule a region which is not in CREATED state");
    //添加Region到正在调度的ScheduleRegion
    scheduledRegions.add(region);
    //执行对当前ScheduleRegion的调度
    schedulerOperations.allocateSlotsAndDeploy(regionVerticesSorted.get(region));
}

DefaultScheduler的allocateSlotsAndDeploy()方法获取了当前SchedulingPipelinedRegion的ExecutionVertex的Execution,并以Execution为单位调用DefaultExecutionDeployer的allocateSlotsAndDeploy()方法进行调度,其中ExecutionDeployer的具体实现为DefaultExecutionDeployer。

DefaultScheduler.allocateSlotsAndDeploy()方法源码:

java 复制代码
public void allocateSlotsAndDeploy(final List<ExecutionVertexID> verticesToDeploy) {
    final Map<ExecutionVertexID, ExecutionVertexVersion> requiredVersionByVertex =
            executionVertexVersioner.recordVertexModifications(verticesToDeploy);
    //获取当前SchedulingPipelinedRegion的ExecutionVertex的Execution并进行调度
    final List<Execution> executionsToDeploy =
            verticesToDeploy.stream()
                    .map(this::getCurrentExecutionOfVertex)
                    .collect(Collectors.toList());
    //对本Region的Execution进行调度
    executionDeployer.allocateSlotsAndDeploy(executionsToDeploy, requiredVersionByVertex);
}

3.Execution计算资源调度

进入DefaultExecutionDeployer.allocateSlotsAndDeploy()方法后,开始进行SchedulingPipelinedRegion中每个Execution的调度。

DefaultExecutionDeployer的allocateSlotsAndDeploy()方法为Execution调度与执行的主要方法,首先先验证了Execution当前状态并把状态调整为SCHEDULED。然后开始进行调度为当前SchedulingPipelinedRegion的List<Execution>分配slot,最后创建创建DeploymentHandle处理Task部署,并等待Slot分配结果,开始进行Task部署。

DefaultExecutionDeployer.allocateSlotsAndDeploy()方法源码:

java 复制代码
public void allocateSlotsAndDeploy(
        final List<Execution> executionsToDeploy,
        final Map<ExecutionVertexID, ExecutionVertexVersion> requiredVersionByVertex) {
    //验证Execution当前状态
    validateExecutionStates(executionsToDeploy);

    //将当前状态调整为SCHEDULED
    transitionToScheduled(executionsToDeploy);

    //为当前SchedulingPipelinedRegion的List<Execution>分配slot
    final Map<ExecutionAttemptID, ExecutionSlotAssignment> executionSlotAssignmentMap =
            allocateSlotsFor(executionsToDeploy);

    //创建DeploymentHandle处理Task部署
    final List<ExecutionDeploymentHandle> deploymentHandles =
            createDeploymentHandles(
                    executionsToDeploy, requiredVersionByVertex, executionSlotAssignmentMap);
    //等待Slot分配结果,并部署Task
    waitForAllSlotsAndDeploy(deploymentHandles);
}

首先关注对SchedulingPipelinedRegion的List<Execution>分配slot,又经过几次调用,先调用DefaultExecutionDeployer的allocateSlotsFor()方法,再调用SlotSharingExecutionSlotAllocator的allocateSlotsFor()方法,最终进入SlotSharingExecutionSlotAllocator的allocateSlotsForVertices()方法,开始调度。

DefaultExecutionDeployer.allocateSlotsFor()方法源码:

java 复制代码
private Map<ExecutionAttemptID, ExecutionSlotAssignment> allocateSlotsFor(
        final List<Execution> executionsToDeploy) {
    final List<ExecutionAttemptID> executionAttemptIds =
            executionsToDeploy.stream()
                    .map(Execution::getAttemptId)
                    .collect(Collectors.toList());
    //继续调用                 
    return executionSlotAllocator.allocateSlotsFor(executionAttemptIds);
}

SlotSharingExecutionSlotAllocator.allocateSlotsFor()方法源码:

java 复制代码
public Map<ExecutionAttemptID, ExecutionSlotAssignment> allocateSlotsFor(
        List<ExecutionAttemptID> executionAttemptIds) {
    //...
    //继续调用
    return allocateSlotsForVertices(vertexIds).stream()
            .collect(
                    Collectors.toMap(
                            vertexAssignment ->
                                    vertexIdToExecutionId.get(
                                            vertexAssignment.getExecutionVertexId()),
                            vertexAssignment ->
                                    new ExecutionSlotAssignment(
                                            vertexIdToExecutionId.get(
                                                    vertexAssignment.getExecutionVertexId()),
                                            vertexAssignment.getLogicalSlotFuture())));
}

SlotSharingExecutionSlotAllocator的allocateSlotsAndDeploy方法首先把所有ExecutionVertex按共享slot划分成多个groupsToAssign组,按共享slot为单位分配资源。若当前groupsToAssign组所在共享slot已分配计算资源,则把groupsToAssign组分配在本共享slot中,若未分配资源,则调用SlotSharingExecutionSlotAllocatorallocateSharedSlots()方法继续申请资源。

SlotSharingExecutionSlotAllocator.allocateSlotsAndDeploy()方法源码:

java 复制代码
private List<SlotExecutionVertexAssignment> allocateSlotsForVertices(
        List<ExecutionVertexID> executionVertexIds) {

    //把所有ExecutionVertexID按共享slot划分成多个executionsByGroup(groupsToAssign组)
    SharedSlotProfileRetriever sharedSlotProfileRetriever =
            sharedSlotProfileRetrieverFactory.createFromBulk(new HashSet<>(executionVertexIds));   
    Map<ExecutionSlotSharingGroup, List<ExecutionVertexID>> executionsByGroup =
            executionVertexIds.stream()
                    .collect(
                            Collectors.groupingBy(
                                    slotSharingStrategy::getExecutionSlotSharingGroup));

    Map<ExecutionSlotSharingGroup, SharedSlot> slots = new HashMap<>(executionsByGroup.size());
    Set<ExecutionSlotSharingGroup> groupsToAssign = new HashSet<>(executionsByGroup.keySet());

    //检查groupsToAssign组所在SharingGroup已分配slot,则将groupsToAssign组分配在该slot上
    Map<ExecutionSlotSharingGroup, SharedSlot> assignedSlots =
            tryAssignExistingSharedSlots(groupsToAssign);
    slots.putAll(assignedSlots);
    groupsToAssign.removeAll(assignedSlots.keySet());


    //若未分配,则为该groupsToAssign组申请slot
    if (!groupsToAssign.isEmpty()) {
        //进入slot分配
        Map<ExecutionSlotSharingGroup, SharedSlot> allocatedSlots =
                allocateSharedSlots(groupsToAssign, sharedSlotProfileRetriever);
        slots.putAll(allocatedSlots);
        groupsToAssign.removeAll(allocatedSlots.keySet());
        Preconditions.checkState(groupsToAssign.isEmpty());
    }

    Map<ExecutionVertexID, SlotExecutionVertexAssignment> assignments =
            allocateLogicalSlotsFromSharedSlots(slots, executionsByGroup);

    // we need to pass the slots map to the createBulk method instead of using the allocator's
    // 'sharedSlots'
    // because if any physical slots have already failed, their shared slots have been removed
    // from the allocator's 'sharedSlots' by failed logical slots.
    SharingPhysicalSlotRequestBulk bulk = createBulk(slots, executionsByGroup);
    bulkChecker.schedulePendingRequestBulkTimeoutCheck(bulk, allocationTimeout);

    return executionVertexIds.stream().map(assignments::get).collect(Collectors.toList());
}

申请资源的单位为SchedulingPipelinedRegion按共享同个slot在划分的ExecutionSlotSharingGroup组,首先遍历Region里所有ExecutionSlotSharingGroup,为每个ExecutionSlotSharingGroup创建slotRequest请求,然后向ResourceManager提交SlotRequest请求,最后创建SharedSlot实例封装申请到的slot资源。

DefaultExecutionDeployer.allocateSlotsAndDeploy()方法源码:

java 复制代码
private Map<ExecutionSlotSharingGroup, SharedSlot> allocateSharedSlots(
        Set<ExecutionSlotSharingGroup> executionSlotSharingGroups,
        SharedSlotProfileRetriever sharedSlotProfileRetriever) {

    List<PhysicalSlotRequest> slotRequests = new ArrayList<>();
    Map<ExecutionSlotSharingGroup, SharedSlot> allocatedSlots = new HashMap<>();

    Map<SlotRequestId, ExecutionSlotSharingGroup> requestToGroup = new HashMap<>();
    Map<SlotRequestId, ResourceProfile> requestToPhysicalResources = new HashMap<>();
     
    //遍历Region里所有ExecutionSlotSharingGroup,为每个ExecutionSlotSharingGroup创建slotRequest请求
    for (ExecutionSlotSharingGroup group : executionSlotSharingGroups) {
        SlotRequestId physicalSlotRequestId = new SlotRequestId();
        ResourceProfile physicalSlotResourceProfile = getPhysicalSlotResourceProfile(group);
        SlotProfile slotProfile =
                sharedSlotProfileRetriever.getSlotProfile(group, physicalSlotResourceProfile);
        PhysicalSlotRequest request =
                new PhysicalSlotRequest(
                        physicalSlotRequestId, slotProfile, slotWillBeOccupiedIndefinitely);
        slotRequests.add(request);
        requestToGroup.put(physicalSlotRequestId, group);
        requestToPhysicalResources.put(physicalSlotRequestId, physicalSlotResourceProfile);
    }

    //向ResourceManager提交SlotRequest请求
    Map<SlotRequestId, CompletableFuture<PhysicalSlotRequest.Result>> allocateResult =
            slotProvider.allocatePhysicalSlots(slotRequests);

    //创建SharedSlot实例封装slot资源
    allocateResult.forEach(
            (slotRequestId, resultCompletableFuture) -> {
                ExecutionSlotSharingGroup group = requestToGroup.get(slotRequestId);
                CompletableFuture<PhysicalSlot> physicalSlotFuture =
                        resultCompletableFuture.thenApply(
                                PhysicalSlotRequest.Result::getPhysicalSlot);
                SharedSlot slot =
                        new SharedSlot(
                                slotRequestId,
                                requestToPhysicalResources.get(slotRequestId),
                                group,
                                physicalSlotFuture,
                                slotWillBeOccupiedIndefinitely,
                                this::releaseSharedSlot);
                allocatedSlots.put(group, slot);
                Preconditions.checkState(!sharedSlots.containsKey(group));
                sharedSlots.put(group, slot);
            });
    return allocatedSlots;
}

提交SlotRequest请求的方法是PhysicalSlotProviderImpl.allocatePhysicalSlots(),在方法中,遍历了ExecutionSlotSharingGroup,为每个ExecutionSlotSharingGroup调用PhysicalSlotProviderImpl的requestNewSlot()方法提交请求。

java 复制代码
public Map<SlotRequestId, CompletableFuture<PhysicalSlotRequest.Result>> allocatePhysicalSlots(
        Collection<PhysicalSlotRequest> physicalSlotRequests) {
        //...
    //提交SlotRequest请求    
    return availablePhysicalSlots.entrySet().stream()
            .collect(
                      //..
	                    CompletableFuture<PhysicalSlot> slotFuture =
	                            availablePhysicalSlot
	                                    .map(CompletableFuture::completedFuture)
	                                    .orElseGet(
	                                            //为每个ExecutionSlotSharingGroup提交请求
	                                            () ->
	                                                    requestNewSlot(
	                                                            slotRequestId,
	                                                            resourceProfile,
	                                                            slotProfile
	                                                                    .getPreferredAllocations(),
	                                                            physicalSlotRequest
	                                                                    .willSlotBeOccupiedIndefinitely()));
	
	                    return slotFuture.thenApply(
	                            physicalSlot ->
	                                    new PhysicalSlotRequest.Result(
	                                            slotRequestId, physicalSlot));
	                }));
}

然后又经历一系列复杂的调用

最终调用到DeclarativeSlotPoolService的connectToResourceManager()方法,通过ResourceManagerGateway向Flink的ResourceManager发送slot资源申请请求。

DeclarativeSlotPoolService.connectToResourceManager()方法源码:

java 复制代码
public void connectToResourceManager(ResourceManagerGateway resourceManagerGateway) {
    assertHasBeenStarted();
    
    //向ResourceManagerGateway发送slot资源申请请求
    resourceRequirementServiceConnectionManager.connect(
            resourceRequirements ->
                    resourceManagerGateway.declareRequiredResources(
                            jobMasterId, resourceRequirements, rpcTimeout));
    //获取slot分配结果
    declareResourceRequirements(declarativeSlotPool.getResourceRequirements());
}

4.结语

至此,调度从JobMaster来到了Flink的Resource Manager,进行Resource Manager层面的资源调度,Resource Manager调度的源码解析见《Flink-1.19.0源码详解11-Flink ResourceManager资源调度》。

相关推荐
Lenyiin2 小时前
makefile
java·大数据·前端
资深web全栈开发3 小时前
一文讲透 MySQL 崩溃恢复方案设计
大数据·人工智能
山峰哥3 小时前
现代 C++ 的最佳实践:从语法糖到工程化思维的全维度探索
java·大数据·开发语言·数据结构·c++
努力搬砖的咸鱼3 小时前
API 网关:微服务的大门卫
java·大数据·微服务·云原生
Xinstall渠道统计平台3 小时前
开发者怎么平衡变现模式与用户体验
大数据
金叶科技智慧农业4 小时前
科技如何守护每一株幼苗?苗情生态监测系统带来田间新视角
大数据·人工智能
TDengine (老段)5 小时前
人力减 60%:时序数据库 TDengine 助力桂冠电力实现 AI 智能巡检
java·大数据·数据库·人工智能·时序数据库·tdengine·涛思数据
TG:@yunlaoda360 云老大5 小时前
阿里云国际站代理商ECS跨境有什么优势呢?
大数据·阿里云·云计算
zhixingheyi_tian5 小时前
Hadoop 之 Uber 模式
大数据·hadoop·eclipse