Flink是Apache软件基金会下开源的分布式流批一体计算框架,具备实时流计算和高吞吐批处理计算的大数据计算能力。本专栏内容为Flink源码解析的记录与分享。
本文解析的Flink源码版本为:flink-1.19.0
1.ResourceManager资源调度概述
在前文《Flink-1.19.0源码详解10-Flink计算资源的申请与调度》已介绍了Flink JobMaster以SchedulingPipelinedRegion为调度单位,为SchedulingPipelinedRegion按slot划分的Execution节点组依次向Flink Resource Manager申请计算资源并调度Task的过程。
本文从Flink Resource Manager为每个Execution节点组进行计算资源的申请与启动TaskManager(内容为下流程图的红色部分)开始解析。解析Flink Resource Manager管理自身计算资源,并为节点组向Yarn Resource Manager申请封装Cpu与内存资源的Container,进而为节点组启动新的TaskManager的过程。

完整源码图解:

2.ResourceManager资源调度
JobMaster向Flink Resource Manager发送调度请求后,Resource Manager调用ResourceManager的declareRequiredResources()方法处理请求。
Resource Manager通过SlotManager处理slot分配,SlotManager的具体实现为FineGrainedSlotManager,具体调用了FineGrainedSlotManager的processResourceRequirements()方法。
ResourceManager.declareRequiredResources()方法源码:
java
public CompletableFuture<Acknowledge> declareRequiredResources(
JobMasterId jobMasterId, ResourceRequirements resourceRequirements, Time timeout) {
final JobID jobId = resourceRequirements.getJobId();
final JobManagerRegistration jobManagerRegistration = jobManagerRegistrations.get(jobId);
if (null != jobManagerRegistration) {
if (Objects.equals(jobMasterId, jobManagerRegistration.getJobMasterId())) {
return getReadyToServeFuture()
.thenApply(
acknowledge -> {
validateRunsInMainThread();
//通过SlotManager处理slot分配
slotManager.processResourceRequirements(resourceRequirements);
return null;
});
} else {
return FutureUtils.completedExceptionally(
new ResourceManagerException(
"The job leader's id "
+ jobManagerRegistration.getJobMasterId()
+ " does not match the received id "
+ jobMasterId
+ '.'));
}
} else {
return FutureUtils.completedExceptionally(
new ResourceManagerException(
"Could not find registered job manager for job " + jobId + '.'));
}
}
FineGrainedSlotManager的processResourceRequirements()方法又经历一系列调用,最终进入FineGrainedSlotManager.declareNeededResources()方法,进入TaskManager资源管理。
源码图解:

FineGrainedSlotManager的declareNeededResources()方法为TaskManager资源管理的方法,首先排除不进行调度的TaskManagaer,再获取已注册的和正在启动的TaskManager,从而获取了调度可以使用的Worker资源。然后将ResourceDeclaration资源需求封装在resourceDeclarations,继续在现有TaskManager中调度resourceDeclarations资源需求,继续调用ActiveResourceManager.ResourceAllocatorImpl.declareResourceNeeded(),其中ActiveResourceManager.ResourceAllocatorImpl为ResourceAllocator的具体实现。
FineGrainedSlotManager.declareNeededResources()方法源码:
java
private void declareNeededResources() {
//排除不进行调度的TaskManagaer
Map<InstanceID, WorkerResourceSpec> unWantedTaskManagers =
taskManagerTracker.getUnWantedTaskManager();
Map<WorkerResourceSpec, Set<InstanceID>> unWantedTaskManagerBySpec =
unWantedTaskManagers.entrySet().stream()
.collect(
Collectors.groupingBy(
Map.Entry::getValue,
Collectors.mapping(Map.Entry::getKey, Collectors.toSet())));
//获取已注册的TaskManager
// registered TaskManagers except unwanted worker.
Stream<WorkerResourceSpec> registeredTaskManagerStream =
taskManagerTracker.getRegisteredTaskManagers().stream()
.filter(t -> !unWantedTaskManagers.containsKey(t.getInstanceId()))
.map(
t ->
WorkerResourceSpec.fromTotalResourceProfile(
t.getTotalResource(), t.getDefaultNumSlots()));
//获取正在启动的TaskManager
// pending TaskManagers.
Stream<WorkerResourceSpec> pendingTaskManagerStream =
taskManagerTracker.getPendingTaskManagers().stream()
.map(
t ->
WorkerResourceSpec.fromTotalResourceProfile(
t.getTotalResourceProfile(), t.getNumSlots()));
//获取当前可以使用的Worker资源
Map<WorkerResourceSpec, Integer> requiredWorkers =
Stream.concat(registeredTaskManagerStream, pendingTaskManagerStream)
.collect(
Collectors.groupingBy(
Function.identity(), Collectors.summingInt(e -> 1)));
Set<WorkerResourceSpec> workerResourceSpecs = new HashSet<>(requiredWorkers.keySet());
workerResourceSpecs.addAll(unWantedTaskManagerBySpec.keySet());
//在Worker资源中封装ResourceDeclaration资源需求
List<ResourceDeclaration> resourceDeclarations = new ArrayList<>();
workerResourceSpecs.forEach(
spec ->
resourceDeclarations.add(
new ResourceDeclaration(
spec,
requiredWorkers.getOrDefault(spec, 0),
unWantedTaskManagerBySpec.getOrDefault(
spec, Collections.emptySet()))));
//继续调用,进行资源分配
resourceAllocator.declareResourceNeeded(resourceDeclarations);
}
进入ActiveResourceManager.ResourceAllocatorImpl的declareResourceNeeded()方法后,又继续调用ActiveResourceManager.declareResourceNeeded()方法,然后进入ActiveResourceManager的checkResourceDeclarations()方法。
源码图解:

ActiveResourceManager.ResourceAllocatorImp.declareResourceNeeded()方法源码:
java
public void declareResourceNeeded(Collection<ResourceDeclaration> resourceDeclarations) {
validateRunsInMainThread();
//继续调用
ActiveResourceManager.this.declareResourceNeeded(resourceDeclarations);
}
ActiveResourceManager.declareResourceNeeded()方法源码:
java
public void declareResourceNeeded(Collection<ResourceDeclaration> resourceDeclarations) {
this.resourceDeclarations = Collections.unmodifiableCollection(resourceDeclarations);
log.debug("Update resource declarations to {}.", resourceDeclarations);
//继续调用
checkResourceDeclarations();
}
ActiveResourceManager的checkResourceDeclarations()方法在为每个资源需求分配Workers 前,先会释放系统未使用资源,再判断是否需要启动新的Worker,若需要启动新的Worker则调用ActiveResourceManager.requestNewWorker()方法。
ActiveResourceManager.checkResourceDeclarations()方法源码:
java
private void checkResourceDeclarations() {
validateRunsInMainThread();
//遍历所有资源需求
for (ResourceDeclaration resourceDeclaration : resourceDeclarations) {
WorkerResourceSpec workerResourceSpec = resourceDeclaration.getSpec();
int declaredWorkerNumber = resourceDeclaration.getNumNeeded();
//计算分配后的worker情况
final int releaseOrRequestWorkerNumber =
totalWorkerCounter.getNum(workerResourceSpec) - declaredWorkerNumber;
if (releaseOrRequestWorkerNumber > 0) {
log.info(
"need release {} workers, current worker number {}, declared worker number {}",
releaseOrRequestWorkerNumber,
totalWorkerCounter.getNum(workerResourceSpec),
declaredWorkerNumber);
//释放不需要的workers
// release unwanted workers.
int remainingReleasingWorkerNumber =
releaseUnWantedResources(
resourceDeclaration.getUnwantedWorkers(),
releaseOrRequestWorkerNumber);
//释放未分配的workers
if (remainingReleasingWorkerNumber > 0) {
// release not allocated workers
remainingReleasingWorkerNumber =
releaseUnallocatedWorkers(
workerResourceSpec, remainingReleasingWorkerNumber);
}
//释放正在启动的workers
if (remainingReleasingWorkerNumber > 0) {
// release starting workers
remainingReleasingWorkerNumber =
releaseAllocatedWorkers(
currentAttemptUnregisteredWorkers,
workerResourceSpec,
remainingReleasingWorkerNumber);
}
//释放已经注册的workers
if (remainingReleasingWorkerNumber > 0) {
// release registered workers
remainingReleasingWorkerNumber =
releaseAllocatedWorkers(
workerNodeMap.keySet(),
workerResourceSpec,
remainingReleasingWorkerNumber);
}
checkState(
remainingReleasingWorkerNumber == 0,
"there are no more workers to release");
//若释放资源后,资源还是不充足,则需要启动新的Worker(
} else if (releaseOrRequestWorkerNumber < 0) {
// In case of start worker failures, we should wait for an interval before
// trying to start new workers.
// Otherwise, ActiveResourceManager will always re-requesting the worker,
// which keeps the main thread busy.
if (startWorkerCoolDown.isDone()) {
int requestWorkerNumber = -releaseOrRequestWorkerNumber;
log.info(
"need request {} new workers, current worker number {}, declared worker number {}",
requestWorkerNumber,
totalWorkerCounter.getNum(workerResourceSpec),
declaredWorkerNumber);
for (int i = 0; i < requestWorkerNumber; i++) {
//启动新的worker
requestNewWorker(workerResourceSpec);
}
} else {
startWorkerCoolDown.thenRun(this::checkResourceDeclarations);
}
} else {
log.debug(
"current worker number {} meets the declared worker {}",
totalWorkerCounter.getNum(workerResourceSpec),
declaredWorkerNumber);
}
}
ActiveResourceManager的requestNewWorker()方法继续调用ActiveResourceManager的requestNewWorker()方法进行资源申请。
ActiveResourceManager.requestNewWorker()方法源码:
java
public void requestNewWorker(WorkerResourceSpec workerResourceSpec) {
//...
//向Yarn申请Container
final CompletableFuture<WorkerType> requestResourceFuture =
resourceManagerDriver.requestResource(taskExecutorProcessSpec);
unallocatedWorkerFutures.put(requestResourceFuture, workerResourceSpec);
//...
}
ActiveResourceManager.requestNewWorker()方法最终把新建Worker的请求放在心跳包中,向Yarn的ResourceManager上报心跳并请求计算资源。
ActiveResourceManager.requestNewWorker()方法源码:
java
public CompletableFuture<YarnWorkerNode> requestResource(
//...
//在与Yarn ResourceManager通信的心跳包中,加入资源申请请求
// make sure we transmit the request fast and receive fast news of granted allocations
resourceManagerClient.setHeartbeatInterval(containerRequestHeartbeatIntervalMillis);
//...
}
3.启动TaskManager
当Yarn的Resource Manager接收Flink Resource Manager的把新建Worker请求后,为Worker申请新的封装了Cpu与内存资源的Container,并返回给Flink Resource Manager。Flink Resource Manager会进一步在Container上面启动TaskManager。
源码图解:

Flink Resource Manager接收Yarn Resource Manager的返回,调用YarnResourceManagerDriver.YarnContainerEventHandler的onContainersAllocated()进行响应。
YarnResourceManagerDriver.YarnContainerEventHandler.onContainersAllocated()方法遍历每个Yarn Resource Manager返回的Container,调用YarnResourceManagerDriver .YarnContainerEventHandler.onContainersOfPriorityAllocated()方法进行处理。
YarnResourceManagerDriver.YarnContainerEventHandler.onContainersAllocated()方法源码:
java
public void onContainersAllocated(List<Container> containers) {
runAsyncWithFatalHandler(
() -> {
checkInitialized();
log.info("Received {} containers.", containers.size());
for (Map.Entry<Priority, List<Container>> entry :
groupContainerByPriority(containers).entrySet()) {
//对每个Container进行处理
onContainersOfPriorityAllocated(entry.getKey(), entry.getValue());
}
// if we are waiting for no further containers, we can go to the
// regular heartbeat interval
if (getNumRequestedNotAllocatedWorkers() <= 0) {
resourceManagerClient.setHeartbeatInterval(yarnHeartbeatIntervalMillis);
}
});
}
在YarnResourceManagerDriver.YarnContainerEventHandler .onContainersOfPriorityAllocated()方法中,调用了YarnResourceManagerDriver .YarnContainerEventHandler.startTaskExecutorInContainerAsync()方法,异步在Container上启动TaskManager。
YarnResourceManagerDriver.YarnContainerEventHandler.onContainersOfPriorityAllocated()方法源码:
java
private void onContainersOfPriorityAllocated(Priority priority, List<Container> containers) {
//...
//异步在Container上启动TaskManager
//startTaskExecutorInContainerAsync(container, taskExecutorProcessSpec, resourceId);
//...
}
在YarnResourceManagerDriver.YarnContainerEventHandler .startTaskExecutorInContainerAsync()方法中,首先先创建TaskManager启动Context上下文,再通过上下文的启动命令,让Yarn NodeManager执行启动命令,创建TaskManager。
YarnResourceManagerDriver.YarnContainerEventHandler.startTaskExecutorInContainerAsync()方法源码:
java
private void startTaskExecutorInContainerAsync(
Container container,
TaskExecutorProcessSpec taskExecutorProcessSpec,
ResourceID resourceId) {
final CompletableFuture<ContainerLaunchContext> containerLaunchContextFuture =
FutureUtils.supplyAsync(
() ->
//创建TaskManager启动Context上下文
createTaskExecutorLaunchContext(
resourceId,
container.getNodeId().getHost(),
taskExecutorProcessSpec),
getIoExecutor());
FutureUtils.assertNoException(
containerLaunchContextFuture.handleAsync(
(context, exception) -> {
if (exception == null) {
//通过Yarn NodeManager在Yarn在分配的container上执行TaskManager的启动命令
nodeManagerClient.startContainerAsync(container, context);
} else {
getResourceEventHandler()
.onWorkerTerminated(resourceId, exception.getMessage());
}
return null;
},
getMainThreadExecutor()));
}
其中创建TaskManager的启动Context上下文是通过YarnResourceManagerDriver .YarnContainerEventHandler.startTaskExecutorInContainerAsync()方法实现的,方法中执行了Utils.createTaskExecutorContext()方法进行具体创建。
ActiveResourceManager.requestNewWorker()方法源码:
java
private ContainerLaunchContext createTaskExecutorLaunchContext(
ResourceID containerId, String host, TaskExecutorProcessSpec taskExecutorProcessSpec)
throws Exception {
// init the ContainerLaunchContext
final String currDir = configuration.getCurrentDir();
final ContaineredTaskManagerParameters taskManagerParameters =
ContaineredTaskManagerParameters.create(flinkConfig, taskExecutorProcessSpec);
log.info(
"TaskExecutor {} will be started on {} with {}.",
containerId.getStringWithMetadata(),
host,
taskExecutorProcessSpec);
final Configuration taskManagerConfig = BootstrapTools.cloneConfiguration(flinkConfig);
taskManagerConfig.set(
TaskManagerOptions.TASK_MANAGER_RESOURCE_ID, containerId.getResourceIdString());
taskManagerConfig.set(
TaskManagerOptionsInternal.TASK_MANAGER_RESOURCE_ID_METADATA,
containerId.getMetadata());
final String taskManagerDynamicProperties =
BootstrapTools.getDynamicPropertiesAsString(flinkClientConfig, taskManagerConfig);
log.debug("TaskManager configuration: {}", taskManagerConfig);
//创建TaskManager的启动上下文
//调用Utils.createTaskExecutorContext()创建TaskExecutorContext(TaskManager的启动环境)
final ContainerLaunchContext taskExecutorLaunchContext =
Utils.createTaskExecutorContext(
flinkConfig,
yarnConfig,
configuration,
taskManagerParameters,
taskManagerDynamicProperties,
currDir,
YarnTaskExecutorRunner.class,
log);
taskExecutorLaunchContext.getEnvironment().put(ENV_FLINK_NODE_ID, host);
return taskExecutorLaunchContext;
}
在创建TaskManager启动上下文的Utils.createTaskExecutorContext()方法中,封装了TaskManager的Launch启动命令。
ActiveResourceManager.requestNewWorker()方法源码:
java
static ContainerLaunchContext createTaskExecutorContext(
org.apache.flink.configuration.Configuration flinkConfig,
YarnConfiguration yarnConfig,
YarnResourceManagerDriverConfiguration configuration,
ContaineredTaskManagerParameters tmParams,
String taskManagerDynamicProperties,
String workingDirectory,
Class<?> taskManagerMainClass,
Logger log)
throws Exception {
//...
//TaskManager启动命令
String launchCommand =
getTaskManagerShellCommand(
flinkConfig,
tmParams,
".",
ApplicationConstants.LOG_DIR_EXPANSION_VAR,
hasLogback,
hasLog4j,
hasKrb5,
taskManagerMainClass,
taskManagerDynamicProperties);
//...
}
最终在YarnResourceManagerDriver.YarnContainerEventHandler .startTaskExecutorInContainerAsync()方法中,Flink Resource Manager让Yarn的Node Manager执行TaskManager的启动命令,最终TaskManager就启动在Yarn分配的Container上了。
YarnResourceManagerDriver.YarnContainerEventHandler.startTaskExecutorInContainerAsync()方法源码:
java
private void startTaskExecutorInContainerAsync(
Container container,
TaskExecutorProcessSpec taskExecutorProcessSpec,
ResourceID resourceId) {
final CompletableFuture<ContainerLaunchContext> containerLaunchContextFuture =
FutureUtils.supplyAsync(
() ->
//创建TaskManager启动Context上下文
createTaskExecutorLaunchContext(
resourceId,
container.getNodeId().getHost(),
taskExecutorProcessSpec),
getIoExecutor());
FutureUtils.assertNoException(
containerLaunchContextFuture.handleAsync(
(context, exception) -> {
if (exception == null) {
//通过Yarn NodeManager在Yarn在分配的container上执行TaskManager的启动命令
nodeManagerClient.startContainerAsync(container, context);
} else {
getResourceEventHandler()
.onWorkerTerminated(resourceId, exception.getMessage());
}
return null;
},
getMainThreadExecutor()));
}
4.结语
至此,Flink Resource Manager为JobMaster本次调度的Execution组分配资源并创建TaskManager的源码已解析完成。此后,JobMaster会取到新创建的TaskManager信息,从而在TaskManager上启动Task计算任务。