7.DispatcherResourceManagerComponentFactory解析.md

DispatcherResourceManagerComponentFactory 启动流程解析(以 Standalone Session 为例)

这篇只回答一个问题:JobManager 在完成 initializeServices(...) 这一步"地基服务铺垫"之后,是如何选择并创建 Dispatcher + ResourceManager + WebMonitorEndpoint 这 3 个核心组件,并把它们打包成一个可关闭、可等待 shutdown 的整体组件的?

一、主题与核心组件职责(总览)

  • org.apache.flink.runtime.entrypoint.ClusterEntrypoint#runCluster:把启动流程分成两段:先 initializeServices(...) 建"地基",再用 DispatcherResourceManagerComponentFactory 拉起核心组件。
  • org.apache.flink.runtime.entrypoint.StandaloneSessionClusterEntrypoint:Standalone Session 模式的入口;决定使用哪一种 DispatcherResourceManagerComponentFactory
  • org.apache.flink.runtime.entrypoint.component.DispatcherResourceManagerComponentFactory:对"创建并启动 Dispatcher/RM/Web endpoint"的抽象门面(核心方法是 create(...))。
  • org.apache.flink.runtime.entrypoint.component.DefaultDispatcherResourceManagerComponentFactory:默认实现;通过 3 个子工厂拼装不同模式:
    • DispatcherRunnerFactory:如何启动 Dispatcher(Session 模式 vs Job 模式)。
    • ResourceManagerFactory:如何创建 ResourceManager(Standalone/Yarn/K8s...)。
    • RestEndpointFactory:如何创建 Web/REST endpoint(Session vs Job)。
  • org.apache.flink.runtime.entrypoint.component.DispatcherResourceManagerComponent:把 DispatcherRunner、ResourceManagerService、LeaderRetrievalService、WebMonitorEndpoint 封装成一个整体组件,对外暴露 getShutDownFuture()closeAsync()

implements
holds
holds
holds
<<interface>>
DispatcherResourceManagerComponentFactory
+create(...)
DefaultDispatcherResourceManagerComponentFactory
<<interface>>
DispatcherRunnerFactory
<<interface>>
ResourceManagerFactory
<<interface>>
RestEndpointFactory

二、前置铺垫:必须先有 initializeServices(...) 才能走后续步骤

DispatcherResourceManagerComponentFactory 不是"无依赖地创建 3 个核心组件",它依赖一整套在 initializeServices(configuration, pluginManager) 阶段创建好的地基对象,典型包括:

  • commonRpcService:后续 RpcGatewayRetriever、Dispatcher/RM 的 RPC 都依赖它。
  • haServices:提供 Dispatcher/RM/Web endpoint 的选举/发现服务句柄。
  • blobServer:给 REST endpoint(上传 jar)和 RM/Dispatcher(分发依赖)使用。
  • heartbeatServices:RM 与 TM 的心跳策略。
  • metricRegistry + metricQueryService:Web 侧拉取 metrics 快照的底座。
  • ioExecutordelegationTokenManagerexecutionGraphInfoStore 等。

这也是为什么 ClusterEntrypoint#runCluster 的代码顺序固定是:先 initializeServices(...),再 createDispatcherResourceManagerComponentFactory(...),最后 factory.create(...)

路径:<flink-runtime/src/main/java/org/apache/flink/runtime/entrypoint/ClusterEntrypoint.java>

FQCN:org.apache.flink.runtime.entrypoint.ClusterEntrypoint#runCluster

java 复制代码
private void runCluster(Configuration configuration, PluginManager pluginManager) throws Exception {
    synchronized (lock) {
        initializeServices(configuration, pluginManager);

        final DispatcherResourceManagerComponentFactory dispatcherResourceManagerComponentFactory =
                createDispatcherResourceManagerComponentFactory(configuration);

        clusterComponent =
                dispatcherResourceManagerComponentFactory.create(
                        configuration,
                        resourceId.unwrap(),
                        ioExecutor,
                        commonRpcService,
                        haServices,
                        blobServer,
                        heartbeatServices,
                        delegationTokenManager,
                        metricRegistry,
                        executionGraphInfoStore,
                        new RpcMetricQueryServiceRetriever(
                                metricRegistry.getMetricQueryServiceRpcService()),
                        failureEnrichers,
                        this);
    }
}
  • initializeServices(...) 是"地基服务完备"的分界线:工厂创建与 create(...) 都发生在它之后。
  • createDispatcherResourceManagerComponentFactory(...) 只负责"选型",真正启动发生在 factory.create(...) 内。

三、模式选型:StandaloneSessionClusterEntrypoint 选择哪种 Factory?

Standalone Session 的入口类直接返回 DefaultDispatcherResourceManagerComponentFactory 的 Session 变体。

路径:<flink-runtime/src/main/java/org/apache/flink/runtime/entrypoint/StandaloneSessionClusterEntrypoint.java>

FQCN:org.apache.flink.runtime.entrypoint.StandaloneSessionClusterEntrypoint#createDispatcherResourceManagerComponentFactory

java 复制代码
@Override
protected DefaultDispatcherResourceManagerComponentFactory
        createDispatcherResourceManagerComponentFactory(Configuration configuration) {
    return DefaultDispatcherResourceManagerComponentFactory.createSessionComponentFactory(
            StandaloneResourceManagerFactory.getInstance());
}
  • 这里的关键分叉点是 StandaloneResourceManagerFactory.getInstance():决定 ResourceManager 的实现是 Standalone(不对接外部资源管理系统)。
  • "Session 还是 Job"这件事由 createSessionComponentFactory(...) 选择不同的 DispatcherRunnerFactoryRestEndpointFactory

四、DefaultDispatcherResourceManagerComponentFactory:三个子工厂拼成一个"总工厂"

4.1 构造器:把 3 个子工厂固定下来

路径:<flink-runtime/src/main/java/org/apache/flink/runtime/entrypoint/component/DefaultDispatcherResourceManagerComponentFactory.java>

FQCN:org.apache.flink.runtime.entrypoint.component.DefaultDispatcherResourceManagerComponentFactory#DefaultDispatcherResourceManagerComponentFactory

java 复制代码
public DefaultDispatcherResourceManagerComponentFactory(
        @Nonnull DispatcherRunnerFactory dispatcherRunnerFactory,
        @Nonnull ResourceManagerFactory<?> resourceManagerFactory,
        @Nonnull RestEndpointFactory<?> restEndpointFactory) {
    this.dispatcherRunnerFactory = dispatcherRunnerFactory;
    this.resourceManagerFactory = resourceManagerFactory;
    this.restEndpointFactory = restEndpointFactory;
}
  • DispatcherRunnerFactory:负责"怎么启动 Dispatcher"(把 Dispatcher 包装成可运行/可 shutdown 的 runner)。
  • ResourceManagerFactory:负责"怎么创建 RM"。
  • RestEndpointFactory:负责"怎么创建 web/REST endpoint"(不同模式会创建不同 endpoint)。

4.2 Session 工厂方法:固定选择 Session 相关实现

路径:<flink-runtime/src/main/java/org/apache/flink/runtime/entrypoint/component/DefaultDispatcherResourceManagerComponentFactory.java>

FQCN:org.apache.flink.runtime.entrypoint.component.DefaultDispatcherResourceManagerComponentFactory#createSessionComponentFactory

java 复制代码
public static DefaultDispatcherResourceManagerComponentFactory createSessionComponentFactory(
        ResourceManagerFactory<?> resourceManagerFactory) {
    return new DefaultDispatcherResourceManagerComponentFactory(
            DefaultDispatcherRunnerFactory.createSessionRunner(SessionDispatcherFactory.INSTANCE),
            resourceManagerFactory,
            SessionRestEndpointFactory.INSTANCE);
}
  • Session 模式下:
    • Dispatcher 使用 SessionDispatcherFactory(作业由 REST 提交进入 Session 集群)。
    • REST endpoint 使用 SessionRestEndpointFactory(面向 Session 集群的 REST/Web UI)。

五、核心接口:DispatcherResourceManagerComponentFactory#create(...)

DispatcherResourceManagerComponentFactory 的核心价值是:把"3 个核心组件的创建与启动顺序"收敛到一个方法里 ,让 ClusterEntrypoint 只需要把地基对象作为参数传进去。

路径:<flink-runtime/src/main/java/org/apache/flink/runtime/entrypoint/component/DispatcherResourceManagerComponentFactory.java>

FQCN:org.apache.flink.runtime.entrypoint.component.DispatcherResourceManagerComponentFactory#create

java 复制代码
DispatcherResourceManagerComponent create(
        Configuration configuration,
        ResourceID resourceId,
        Executor ioExecutor,
        RpcService rpcService,
        HighAvailabilityServices highAvailabilityServices,
        BlobServer blobServer,
        HeartbeatServices heartbeatServices,
        DelegationTokenManager delegationTokenManager,
        MetricRegistry metricRegistry,
        ExecutionGraphInfoStore executionGraphInfoStore,
        MetricQueryServiceRetriever metricQueryServiceRetriever,
        Collection<FailureEnricher> failureEnrichers,
        FatalErrorHandler fatalErrorHandler)
        throws Exception;
  • 这个签名本身就体现了依赖边界:传入的全是"地基对象",返回的是"已启动的核心组件集合"。

六、DefaultDispatcherResourceManagerComponentFactory#create(...):创建与启动顺序

下面这段代码就是 JM 启动后半段的关键:创建 web endpoint → 启动 web endpoint → 创建 RM service → 创建 DispatcherRunner → 启动 RM → 启动 leader retrieval(发现链路)→ 打包成组件返回

路径:<flink-runtime/src/main/java/org/apache/flink/runtime/entrypoint/component/DefaultDispatcherResourceManagerComponentFactory.java>

FQCN:org.apache.flink.runtime.entrypoint.component.DefaultDispatcherResourceManagerComponentFactory#create

java 复制代码
dispatcherLeaderRetrievalService = highAvailabilityServices.getDispatcherLeaderRetriever();
resourceManagerRetrievalService = highAvailabilityServices.getResourceManagerLeaderRetriever();

final LeaderGatewayRetriever<DispatcherGateway> dispatcherGatewayRetriever =
        new RpcGatewayRetriever<>(rpcService, DispatcherGateway.class, DispatcherId::fromUuid, ...);

final LeaderGatewayRetriever<ResourceManagerGateway> resourceManagerGatewayRetriever =
        new RpcGatewayRetriever<>(
                rpcService, ResourceManagerGateway.class, ResourceManagerId::fromUuid, ...);

final ScheduledExecutorService executor =
        WebMonitorEndpoint.createExecutorService(
                configuration.get(RestOptions.SERVER_NUM_THREADS),
                configuration.get(RestOptions.SERVER_THREAD_PRIORITY),
                "DispatcherRestEndpoint");

final MetricFetcher metricFetcher =
        updateInterval == 0
                ? VoidMetricFetcher.INSTANCE
                : MetricFetcherImpl.fromConfiguration(
                        configuration,
                        metricQueryServiceRetriever,
                        dispatcherGatewayRetriever,
                        executor);

webMonitorEndpoint =
        restEndpointFactory.createRestEndpoint(
                configuration,
                dispatcherGatewayRetriever,
                resourceManagerGatewayRetriever,
                blobServer,
                executor,
                metricFetcher,
                highAvailabilityServices.getClusterRestEndpointLeaderElection(),
                fatalErrorHandler);
webMonitorEndpoint.start();

resourceManagerService =
        ResourceManagerServiceImpl.create(
                resourceManagerFactory,
                configuration,
                resourceId,
                rpcService,
                highAvailabilityServices,
                heartbeatServices,
                delegationTokenManager,
                fatalErrorHandler,
                new ClusterInformation(hostname, blobServer.getPort()),
                webMonitorEndpoint.getRestBaseUrl(),
                metricRegistry,
                hostname,
                ioExecutor);

dispatcherRunner =
        dispatcherRunnerFactory.createDispatcherRunner(
                highAvailabilityServices.getDispatcherLeaderElection(),
                fatalErrorHandler,
                new HaServicesJobPersistenceComponentFactory(highAvailabilityServices),
                ioExecutor,
                rpcService,
                partialDispatcherServices);

resourceManagerService.start();

resourceManagerRetrievalService.start(resourceManagerGatewayRetriever);
dispatcherLeaderRetrievalService.start(dispatcherGatewayRetriever);

return new DispatcherResourceManagerComponent(
        dispatcherRunner,
        resourceManagerService,
        dispatcherLeaderRetrievalService,
        resourceManagerRetrievalService,
        webMonitorEndpoint,
        fatalErrorHandler,
        dispatcherOperationCaches);
  • LeaderRetrievalService + RpcGatewayRetriever 组合在一起,形成"发现 leader → connect RPC → 得到 gateway 代理"的通用链路:这也是 HA 模式下 JM 内部各组件互相找到对方的关键方式。
  • Web endpoint 被优先启动:这样 webMonitorEndpoint.getRestBaseUrl() 可以作为 RM 的对外宣告信息注入(同时 REST endpoint 自己也会参与 leader 选举)。
  • resourceManagerService.start()retrievalService.start(...) 之前:先把 RM 自身拉起来,再开启"别人如何找到 RM"的发现监听,避免发现到了但服务还没起来的窗口期。

七、整体打包:DispatcherResourceManagerComponent 是"对外统一生命周期"的收口

路径:<flink-runtime/src/main/java/org/apache/flink/runtime/entrypoint/component/DispatcherResourceManagerComponent.java>

FQCN:org.apache.flink.runtime.entrypoint.component.DispatcherResourceManagerComponent

java 复制代码
public class DispatcherResourceManagerComponent implements AutoCloseableAsync {
    @Nonnull private final DispatcherRunner dispatcherRunner;
    @Nonnull private final ResourceManagerService resourceManagerService;
    @Nonnull private final LeaderRetrievalService dispatcherLeaderRetrievalService;
    @Nonnull private final LeaderRetrievalService resourceManagerRetrievalService;
    @Nonnull private final RestService webMonitorEndpoint;

    public final CompletableFuture<ApplicationStatus> getShutDownFuture() {
        return shutDownFuture;
    }
}
  • ClusterEntrypoint 来说,它不需要知道 Dispatcher/RM/Web endpoint 各自怎么关:只要拿着一个 DispatcherResourceManagerComponent 统一等待 shutdown、统一 close 即可。

八、把顺序画出来(从 initializeServices 到 3 组件启动)

ClusterEntrypoint.runCluster
initializeServices(configuration, pluginManager)
createDispatcherResourceManagerComponentFactory(configuration)
StandaloneSessionClusterEntrypoint → createSessionComponentFactory(StandaloneRMFactory)
factory.create(...)
create & start WebMonitorEndpoint
create ResourceManagerService
create DispatcherRunner
resourceManagerService.start()
start LeaderRetrievalService (Dispatcher/RM)
new DispatcherResourceManagerComponent(...)

九、回到主题(收束)

  • initializeServices(...) 解决"地基依赖齐不齐"的问题,DispatcherResourceManagerComponentFactory 解决"核心组件如何创建并按顺序启动"的问题。
  • Standalone Session 模式下,入口类只做一件事:选择 DefaultDispatcherResourceManagerComponentFactory 的 Session 组合(Session Dispatcher + Session REST + Standalone RM)。
  • DefaultDispatcherResourceManagerComponentFactory#create(...) 是 JM 启动后半段的主线:把 Dispatcher、ResourceManager、WebMonitorEndpoint 的创建/启动/发现/组装统一收口。
相关推荐
耶夫斯计2 小时前
Context Engineering:构建高可靠性 AI Agent 的底层逻辑
人工智能·python
云深麋鹿2 小时前
C++ | 继承
开发语言·c++
Polar__Star2 小时前
SQL如何高效导出大规模的分组汇总数据_利用分页与索引
jvm·数据库·python
2201_761040592 小时前
HTML怎么显示复杂图表摘要_HTML数据结论文字描述区【详解】
jvm·数据库·python
m0_746752302 小时前
HTML怎么标注回收估价规则_HTML估价逻辑说明折叠区【指南】
jvm·数据库·python
Greyson12 小时前
SQL如何解决GROUP BY导致查询变慢_利用覆盖索引进行优化
jvm·数据库·python
小辉同志2 小时前
Epoll+线程池
开发语言·c++·c·线程池·epoll
史迪仔01122 小时前
[QML] Qt Quick Dialogs 模块使用指南
开发语言·前端·c++·qt
m0_613856292 小时前
html标签如何插入图片_html中img标签的正确使用方式【方法】
jvm·数据库·python