5.HeartbeatServices启动解析.md

HeartbeatServices 启动流程解析(RM ↔ TM 三段式)

这篇先把 RM ↔ TM 心跳拆成 3 个部分,避免一次把所有细节灌进来:

  • 第一部分(重点):启动阶段如何把 HeartbeatServices/HeartbeatManager 搭起来
  • 第二部分(略讲):TaskExecutor 第一次连上 ResourceManager 时,心跳关系如何建立并"加入队列"
  • 第三部分(重点):ResourceManager 如何定期触发心跳、TaskExecutor 如何回包、ResourceManager 如何处理 payload

总览:三段式主线(先把复杂度压下去)

(1) 启动:ClusterEntrypoint 创建 HeartbeatServices;RM/TM 创建 HeartbeatManager
(2) 建链:TM 注册 RM,双方 monitorTarget
(3) 运行:RM 定时 requestHeartbeat,TM 回包,RM 处理 payload

第一部分:启动(重点)

1.0 HeartbeatManager:心跳子系统的"核心对象"

先把 3 个类/接口的关系讲清楚,后面再看启动与定时心跳就不会乱:

  • org.apache.flink.runtime.heartbeat.HeartbeatServices:心跳"服务门面/工厂",负责创建两类 HeartbeatManager(receiver / sender)。
  • org.apache.flink.runtime.heartbeat.HeartbeatServicesImpl:默认实现,固化 interval/timeout/失败阈值,并据此实例化 HeartbeatManagerImpl/HeartbeatManagerSenderImpl
  • org.apache.flink.runtime.heartbeat.HeartbeatTarget:心跳 RPC 协议抽象(request / receive)。
  • org.apache.flink.runtime.heartbeat.HeartbeatManager:在 HeartbeatTarget 之上,再加一层"监控目标管理"能力(monitorTarget/unmonitorTarget/stop)。
  • org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl:通用实现,维护 resourceId → HeartbeatMonitor 的监控表,并实现 HeartbeatTarget 的收发逻辑。
  • org.apache.flink.runtime.heartbeat.HeartbeatManagerSenderImpl:在 HeartbeatManagerImpl 基础上,额外实现 Runnable,通过定时调度周期性触发 requestHeartbeat

路径:<flink-runtime/src/main/java/org/apache/flink/runtime/heartbeat/HeartbeatManager.java>

FQCN:org.apache.flink.runtime.heartbeat.HeartbeatManager

java 复制代码
public interface HeartbeatManager<I, O> extends HeartbeatTarget<I> {
    void monitorTarget(ResourceID resourceID, HeartbeatTarget<O> heartbeatTarget);
    void unmonitorTarget(ResourceID resourceID);
    void stop();
    long getLastHeartbeatFrom(ResourceID resourceId);
}
  • extends HeartbeatTarget<I> 这一点非常关键:HeartbeatManager 本身就是一个"能被对方调用 request/receive 的目标"(对端通过 gateway/RPC 调到这里)。
  • monitorTarget(...) 的语义就是"把某个对端加入心跳监控队列":一旦加入,Sender 才会在每轮定时心跳里遍历到它。

路径:<flink-runtime/src/main/java/org/apache/flink/runtime/heartbeat/HeartbeatManagerImpl.java>

FQCN:org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl

java 复制代码
class HeartbeatManagerImpl<I, O> implements HeartbeatManager<I, O> {
    // ...
}

路径:<flink-runtime/src/main/java/org/apache/flink/runtime/heartbeat/HeartbeatManagerSenderImpl.java>

FQCN:org.apache.flink.runtime.heartbeat.HeartbeatManagerSenderImpl

java 复制代码
class HeartbeatManagerSenderImpl<I, O> extends HeartbeatManagerImpl<I, O> implements Runnable {
    // ...
}

1.0.1 UML:HeartbeatManager / Impl / SenderImpl 关系图

implements
createHeartbeatManager
createHeartbeatManagerSender
createHeartbeatServices/fromConfiguration
extends
implements
implements
implements
extends
implements
ClusterEntrypoint
<<interface>>
HeartbeatServices
+createHeartbeatManager(ResourceID, HeartbeatListener, ScheduledExecutor, Logger) : HeartbeatManager
+createHeartbeatManagerSender(ResourceID, HeartbeatListener, ScheduledExecutor, Logger) : HeartbeatManager
HeartbeatServicesImpl
<<interface>>
HeartbeatTarget<I>
+receiveHeartbeat(ResourceID, I) : CompletableFuture<Void>
+requestHeartbeat(ResourceID, I) : CompletableFuture<Void>
<<interface>>
HeartbeatManager<I,O>
+monitorTarget(ResourceID, HeartbeatTarget<O>)
+unmonitorTarget(ResourceID)
+stop()
+getLastHeartbeatFrom(ResourceID) : long
HeartbeatManagerImpl<I,O>
-heartbeatTargets Map~ResourceID, HeartbeatMonitor<O~>
<<interface>>
HeartbeatMonitor<O>
+reportHeartbeat()
+cancel()
+reportHeartbeatRpcFailure()
+reportHeartbeatRpcSuccess()
DefaultHeartbeatMonitor<O>
HeartbeatManagerSenderImpl<I,O>
<<interface>>
Runnable
+run()

1.0.2 主动发 vs 被动超时:职责怎么拆

这里是最容易误解的点:看起来都是"心跳",但 Sender 与 Monitor 解决的是两件不同的事情。

  • org.apache.flink.runtime.heartbeat.HeartbeatManagerSenderImpl:负责"按 interval 主动发 request",驱动对端回心跳(或借此探活)。
  • org.apache.flink.runtime.heartbeat.DefaultHeartbeatMonitor:负责"每个 target 的状态机与计时器"(收到心跳就续命;到点没收到就超时;RPC 连续失败就标记不可达)。
1.0.2.1 Sender 做什么:周期性遍历所有 target 发请求

路径:<flink-runtime/src/main/java/org/apache/flink/runtime/heartbeat/HeartbeatManagerSenderImpl.java>

FQCN:org.apache.flink.runtime.heartbeat.HeartbeatManagerSenderImpl#run

java 复制代码
@Override
public void run() {
    if (!stopped) {
        log.debug("Trigger heartbeat request.");
        for (HeartbeatMonitor<O> heartbeatMonitor : getHeartbeatTargets().values()) {
            requestHeartbeat(heartbeatMonitor);
        }

        getMainThreadExecutor().schedule(this, heartbeatPeriod, TimeUnit.MILLISECONDS);
    }
}
  • 这一段的关键词是 "request":主动发
  • 它并不负责"超时计时器重置";它只负责驱动心跳流量产生。

Sender 对单个目标的 RPC 成功/失败会回落到 monitor 上(更新 RPC 失败计数、不可达状态),逻辑在 HeartbeatManagerImpl:

路径:<flink-runtime/src/main/java/org/apache/flink/runtime/heartbeat/HeartbeatManagerImpl.java>

FQCN:org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl#handleHeartbeatRpc

java 复制代码
protected BiConsumer<Void, Throwable> handleHeartbeatRpc(ResourceID heartbeatTarget) {
    return (unused, failure) -> {
        if (failure != null) {
            handleHeartbeatRpcFailure(
                    heartbeatTarget, ExceptionUtils.stripCompletionException(failure));
        } else {
            handleHeartbeatRpcSuccess(heartbeatTarget);
        }
    };
}
1.0.2.2 Monitor 做什么:每个 target 一把"超时闹钟"

路径:<flink-runtime/src/main/java/org/apache/flink/runtime/heartbeat/DefaultHeartbeatMonitor.java>

FQCN:org.apache.flink.runtime.heartbeat.DefaultHeartbeatMonitor#resetHeartbeatTimeout

java 复制代码
void resetHeartbeatTimeout(long heartbeatTimeout) {
    if (state.get() == State.RUNNING) {
        cancelTimeout();

        futureTimeout = scheduledExecutor.schedule(this, heartbeatTimeout, TimeUnit.MILLISECONDS);

        if (state.get() != State.RUNNING) {
            cancelTimeout();
        }
    }
}
  • 这一段的关键词是 "timeout":被动等
  • scheduledExecutor.schedule(this, ...) 调度的就是 DefaultHeartbeatMonitor#run,到点执行后触发超时回调:

路径:<flink-runtime/src/main/java/org/apache/flink/runtime/heartbeat/DefaultHeartbeatMonitor.java>

FQCN:org.apache.flink.runtime.heartbeat.DefaultHeartbeatMonitor#run

java 复制代码
@Override
public void run() {
    if (state.compareAndSet(State.RUNNING, State.TIMEOUT)) {
        heartbeatListener.notifyHeartbeatTimeout(resourceID);
    }
}
1.0.2.3 谁来"续命":HeartbeatManagerImpl.reportHeartbeat → DefaultHeartbeatMonitor.reportHeartbeat

Flink 把"收到心跳就重置超时"的动作放在 HeartbeatManagerImpl 里统一入口处:

路径:<flink-runtime/src/main/java/org/apache/flink/runtime/heartbeat/HeartbeatManagerImpl.java>

FQCN:org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl#reportHeartbeat

java 复制代码
HeartbeatTarget<O> reportHeartbeat(ResourceID resourceID) {
    if (heartbeatTargets.containsKey(resourceID)) {
        HeartbeatMonitor<O> heartbeatMonitor = heartbeatTargets.get(resourceID);
        heartbeatMonitor.reportHeartbeat();

        return heartbeatMonitor.getHeartbeatTarget();
    } else {
        return null;
    }
}
  • 这就是"收到一次心跳,就把对应 monitor 的超时闹钟往后推"的真正落点。
  • 因此你看到的结论是:Sender 负责"发",Monitor 负责"状态/计时";二者组合才是完整的心跳机制。

1.1 ClusterEntrypoint:初始化 HeartbeatServices(地基服务入口)

HeartbeatServices 的"最早初始化"发生在 org.apache.flink.runtime.entrypoint.ClusterEntrypoint 的服务初始化阶段:它先把 HeartbeatServices 构造出来,后续再把该对象注入/传递给 ResourceManager、TaskExecutor 等组件去创建各自的 HeartbeatManager。

路径:<flink-runtime/src/main/java/org/apache/flink/runtime/entrypoint/ClusterEntrypoint.java>

FQCN:org.apache.flink.runtime.entrypoint.ClusterEntrypoint

java 复制代码
// ...
heartbeatServices = createHeartbeatServices(configuration);
// ...

路径:<flink-runtime/src/main/java/org/apache/flink/runtime/entrypoint/ClusterEntrypoint.java>

FQCN:org.apache.flink.runtime.entrypoint.ClusterEntrypoint#createHeartbeatServices

java 复制代码
protected HeartbeatServices createHeartbeatServices(Configuration configuration) {
    return HeartbeatServices.fromConfiguration(configuration);
}
  • 这一步的输出是 HeartbeatServices(工厂/门面),不是具体的 HeartbeatManager;具体 RM/TM 心跳管理器在后续组件启动时由它创建。

1.2 HeartbeatServices:从配置固化 interval/timeout(地基服务)

路径:<flink-runtime/src/main/java/org/apache/flink/runtime/heartbeat/HeartbeatServices.java>

FQCN:org.apache.flink.runtime.heartbeat.HeartbeatServices#fromConfiguration

java 复制代码
static HeartbeatServices fromConfiguration(Configuration configuration) {
    long heartbeatInterval = configuration.get(HeartbeatManagerOptions.HEARTBEAT_INTERVAL);
    long heartbeatTimeout = configuration.get(HeartbeatManagerOptions.HEARTBEAT_TIMEOUT);
    int failedRpcRequestsUntilUnreachable =
            configuration.get(HeartbeatManagerOptions.HEARTBEAT_RPC_FAILURE_THRESHOLD);

    return new HeartbeatServicesImpl(
            heartbeatInterval, heartbeatTimeout, failedRpcRequestsUntilUnreachable);
}
  • HeartbeatServices 是心跳参数的"收敛点":把 interval/timeout/失败阈值一次性固化,后续组件只拿它来造 HeartbeatManager。

路径:<flink-runtime/src/main/java/org/apache/flink/runtime/heartbeat/HeartbeatServicesImpl.java>

FQCN:org.apache.flink.runtime.heartbeat.HeartbeatServicesImpl

java 复制代码
public final class HeartbeatServicesImpl implements HeartbeatServices {
    @Override
    public <I, O> HeartbeatManager<I, O> createHeartbeatManager(
            ResourceID resourceId,
            HeartbeatListener<I, O> heartbeatListener,
            ScheduledExecutor mainThreadExecutor,
            Logger log) {
        return new HeartbeatManagerImpl<>(
                heartbeatTimeout,
                failedRpcRequestsUntilUnreachable,
                resourceId,
                heartbeatListener,
                mainThreadExecutor,
                log);
    }

    @Override
    public <I, O> HeartbeatManager<I, O> createHeartbeatManagerSender(
            ResourceID resourceId,
            HeartbeatListener<I, O> heartbeatListener,
            ScheduledExecutor mainThreadExecutor,
            Logger log) {
        return new HeartbeatManagerSenderImpl<>(
                heartbeatInterval,
                heartbeatTimeout,
                failedRpcRequestsUntilUnreachable,
                resourceId,
                heartbeatListener,
                mainThreadExecutor,
                log);
    }
}
  • createHeartbeatManager(...) 返回的实现是 HeartbeatManagerImpl:不主动发 request(只负责超时监控 + 收到 request 时回包)。
  • createHeartbeatManagerSender(...) 返回的实现是 HeartbeatManagerSenderImpl:在上面的基础上,额外定期触发 request(RM 侧对 TM 的主用法)。

1.3 ResourceManager:onStart 阶段初始化心跳 Sender

ResourceManager 是 RPC Endpoint(继承 org.apache.flink.runtime.rpc.FencedRpcEndpoint),启动入口在 onStart,并在 startResourceManagerServices 中启动心跳服务。

路径:<flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/ResourceManager.java>

FQCN:org.apache.flink.runtime.resourcemanager.ResourceManager#onStart

java 复制代码
@Override
public final void onStart() throws Exception {
    try {
        log.info("Starting the resource manager.");
        startResourceManagerServices();
        startedFuture.complete(null);
    } catch (Throwable t) {
        final ResourceManagerException exception =
                new ResourceManagerException(
                        String.format("Could not start the ResourceManager %s", getAddress()),
                        t);
        onFatalError(exception);
        throw exception;
    }
}

路径:<flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/ResourceManager.java>

FQCN:org.apache.flink.runtime.resourcemanager.ResourceManager#startResourceManagerServices

java 复制代码
private void startResourceManagerServices() throws Exception {
    try {
        jobLeaderIdService.start(new JobLeaderIdActionsImpl());
        registerMetrics();
        startHeartbeatServices();

        slotManager.start(
                getFencingToken(),
                getMainThreadExecutor(),
                resourceAllocator,
                new ResourceEventListenerImpl(),
                blocklistHandler::isBlockedTaskManager);

        delegationTokenManager.start(this);
        initialize();
    } catch (Exception e) {
        handleStartResourceManagerServicesException(e);
    }
}

路径:<flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/ResourceManager.java>

FQCN:org.apache.flink.runtime.resourcemanager.ResourceManager#startHeartbeatServices

java 复制代码
private void startHeartbeatServices() {
    taskManagerHeartbeatManager =
            heartbeatServices.createHeartbeatManagerSender(
                    resourceId,
                    new TaskManagerHeartbeatListener(),
                    getMainThreadExecutor(),
                    log);

    jobManagerHeartbeatManager =
            heartbeatServices.createHeartbeatManagerSender(
                    resourceId,
                    new JobManagerHeartbeatListener(),
                    getMainThreadExecutor(),
                    log);
}
  • 第一层结论:RM 侧对 TM 使用 createHeartbeatManagerSender,因此 RM 会"主动定期发起心跳请求"。

1.4 TaskExecutor:提前准备 ResourceManagerHeartbeatManager(Receiver)

这里先点到为止:TaskExecutor 构造阶段就把 "RM 心跳管理器"创建出来,但用的是 createHeartbeatManager(不是 sender)。

路径:<flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskExecutor.java>

FQCN:org.apache.flink.runtime.taskexecutor.TaskExecutor#createResourceManagerHeartbeatManager

java 复制代码
private HeartbeatManager<Void, TaskExecutorHeartbeatPayload>
        createResourceManagerHeartbeatManager(
                HeartbeatServices heartbeatServices, ResourceID resourceId) {
    return heartbeatServices.createHeartbeatManager(
            resourceId, new ResourceManagerHeartbeatListener(), getMainThreadExecutor(), log);
}
  • 这意味着 TM 侧不负责"定期发请求",而是主要负责:监控 RM 是否存活、收到 RM 请求时回包。

第二部分:TaskExecutor 第一次与 ResourceManager 建立心跳(略讲)

这一段会牵扯 TaskExecutor 启动链路与注册重试机制,本文先把"建立心跳关系"的关键点标出来;后面单独写 TaskExecutor 启动时再把它补全。

2.1 发现 RM Leader → 触发重连

路径:<flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskExecutor.java>

FQCN:org.apache.flink.runtime.taskexecutor.TaskExecutor#notifyOfNewResourceManagerLeader

java 复制代码
private void notifyOfNewResourceManagerLeader(
        String newLeaderAddress, ResourceManagerId newResourceManagerId) {
    resourceManagerAddress =
            createResourceManagerAddress(newLeaderAddress, newResourceManagerId);
    reconnectToResourceManager(
            new FlinkException(
                    String.format(
                            "ResourceManager leader changed to new address %s",
                            resourceManagerAddress)));
}

路径:<flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskExecutor.java>

FQCN:org.apache.flink.runtime.taskexecutor.TaskExecutor#reconnectToResourceManager

java 复制代码
private void reconnectToResourceManager(Exception cause) {
    closeResourceManagerConnection(cause);
    startRegistrationTimeout();
    tryConnectToResourceManager();
}

2.2 发起注册连接(TaskExecutor → ResourceManager)

路径:<flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskExecutor.java>

FQCN:org.apache.flink.runtime.taskexecutor.TaskExecutor#connectToResourceManager

java 复制代码
private void connectToResourceManager() {
    log.info("Connecting to ResourceManager {}.", resourceManagerAddress);

    final TaskExecutorRegistration taskExecutorRegistration =
            new TaskExecutorRegistration(
                    getAddress(),
                    getResourceID(),
                    unresolvedTaskManagerLocation.getDataPort(),
                    JMXService.getPort().orElse(-1),
                    hardwareDescription,
                    memoryConfiguration,
                    taskManagerConfiguration.getDefaultSlotResourceProfile(),
                    taskManagerConfiguration.getTotalResourceProfile(),
                    unresolvedTaskManagerLocation.getNodeId());

    resourceManagerConnection =
            new TaskExecutorToResourceManagerConnection(
                    log,
                    getRpcService(),
                    taskManagerConfiguration.getRetryingRegistrationConfiguration(),
                    resourceManagerAddress.getAddress(),
                    resourceManagerAddress.getResourceManagerId(),
                    getMainThreadExecutor(),
                    new ResourceManagerRegistrationListener(),
                    taskExecutorRegistration);
    resourceManagerConnection.start();
}

2.3 注册成功回调:真正建立心跳监控关系(加入 HeartbeatTargets)

TaskExecutor 注册成功后,会调用 establishResourceManagerConnection,其中包含关键一行:resourceManagerHeartbeatManager.monitorTarget(...)

路径:<flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskExecutor.java>

FQCN:org.apache.flink.runtime.taskexecutor.TaskExecutor.ResourceManagerRegistrationListener#onRegistrationSuccess

java 复制代码
@Override
public void onRegistrationSuccess(
        TaskExecutorToResourceManagerConnection connection,
        TaskExecutorRegistrationSuccess success) {
    final ResourceID resourceManagerId = success.getResourceManagerId();
    final InstanceID taskExecutorRegistrationId = success.getRegistrationId();
    final ClusterInformation clusterInformation = success.getClusterInformation();
    final ResourceManagerGateway resourceManagerGateway = connection.getTargetGateway();

    runAsync(
            () -> {
                if (resourceManagerConnection == connection) {
                    establishResourceManagerConnection(
                            resourceManagerGateway,
                            resourceManagerId,
                            taskExecutorRegistrationId,
                            clusterInformation);
                }
            });
}

路径:<flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskExecutor.java>

FQCN:org.apache.flink.runtime.taskexecutor.TaskExecutor#establishResourceManagerConnection

java 复制代码
private void establishResourceManagerConnection(
        ResourceManagerGateway resourceManagerGateway,
        ResourceID resourceManagerResourceId,
        InstanceID taskExecutorRegistrationId,
        ClusterInformation clusterInformation) {

    // monitor the resource manager as heartbeat target
    resourceManagerHeartbeatManager.monitorTarget(
            resourceManagerResourceId,
            new ResourceManagerHeartbeatReceiver(resourceManagerGateway));

    establishedResourceManagerConnection =
            new EstablishedResourceManagerConnection(
                    resourceManagerGateway,
                    resourceManagerResourceId,
                    taskExecutorRegistrationId);

    stopRegistrationTimeout();
}

同一时刻,ResourceManager 侧在 registerTaskExecutorInternal 里也会把 TaskExecutor 加入自己的 heartbeats 目标集合(这一步非常关键,因为 RM 的 sender 只会对 "monitorTarget 过的目标" 发心跳)。

路径:<flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/ResourceManager.java>

FQCN:org.apache.flink.runtime.resourcemanager.ResourceManager#registerTaskExecutorInternal

java 复制代码
taskManagerHeartbeatManager.monitorTarget(
        taskExecutorResourceId, new TaskExecutorHeartbeatSender(taskExecutorGateway));
  • 这一段可以先记住一句话:注册成功后,双方都会把对端 monitorTarget 进去,从而形成 "ResourceID → HeartbeatMonitor → HeartbeatTarget" 的监控关系。

2.4 心跳"本质调用链":HeartbeatTarget → Gateway → TaskExecutor(重要)

上面说的 monitorTarget(...) 只是把对端加入 HeartbeatTargets;真正"后续心跳跳动"的本质是:RM 侧 Sender 定期调用某个 HeartbeatTarget#requestHeartbeat(...),而这个 HeartbeatTarget 的具体实现最终会落到对 TaskExecutor 的一次 RPC(TaskExecutorGateway#heartbeatFromResourceManager

先看协议抽象:HeartbeatTarget 只定义两件事:请求对方回心跳、以及给对方回心跳。

路径:<flink-runtime/src/main/java/org/apache/flink/runtime/heartbeat/HeartbeatTarget.java>

FQCN:org.apache.flink.runtime.heartbeat.HeartbeatTarget

java 复制代码
public interface HeartbeatTarget<I> {
    CompletableFuture<Void> receiveHeartbeat(ResourceID heartbeatOrigin, I heartbeatPayload);
    CompletableFuture<Void> requestHeartbeat(ResourceID requestOrigin, I heartbeatPayload);
}
2.4.1 RM → TM:TaskExecutorHeartbeatSender 把 requestHeartbeat 落到 TaskExecutorGateway

路径:<flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/ResourceManager.java>

FQCN:org.apache.flink.runtime.resourcemanager.ResourceManager.TaskExecutorHeartbeatSender#requestHeartbeat

java 复制代码
private static final class TaskExecutorHeartbeatSender extends HeartbeatSender<Void> {
    private final TaskExecutorGateway taskExecutorGateway;

    private TaskExecutorHeartbeatSender(TaskExecutorGateway taskExecutorGateway) {
        this.taskExecutorGateway = taskExecutorGateway;
    }

    @Override
    public CompletableFuture<Void> requestHeartbeat(ResourceID resourceID, Void payload) {
        return taskExecutorGateway.heartbeatFromResourceManager(resourceID);
    }
}
  • TaskExecutorHeartbeatSender 的"本质"就是:把 HeartbeatTarget#requestHeartbeat 变成一次 TaskExecutorGateway#heartbeatFromResourceManager RPC 调用。
  • 因为 RM 的 sender 周期性遍历 HeartbeatTargets,所以最终就是周期性"调用 TaskExecutor"。
2.4.2 TM 收到 RM 心跳请求:TaskExecutor#heartbeatFromResourceManager → HeartbeatManagerImpl#requestHeartbeat

路径:<flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskExecutor.java>

FQCN:org.apache.flink.runtime.taskexecutor.TaskExecutor#heartbeatFromResourceManager

java 复制代码
@Override
public CompletableFuture<Void> heartbeatFromResourceManager(ResourceID resourceID) {
    return resourceManagerHeartbeatManager.requestHeartbeat(resourceID, null);
}
  • 这一步就是你说的 "executor 在调用 resourceManager 的 requestHeart 方法":这里调的是 resourceManagerHeartbeatManager.requestHeartbeat(...)
  • requestHeartbeat(...) 的语义是:记录 RM 的心跳(续命/重置超时)+ 通过 HeartbeatTarget#receiveHeartbeat 回一跳给 RM(回包会走 ResourceManagerHeartbeatReceiver → ResourceManagerGateway#heartbeatFromTaskManager)。
2.4.3 UML:RM 触发一次心跳请求的端到端调用链

ResourceManagerGateway(RPC) ResourceManagerHeartbeatReceiver(HeartbeatTarget) resourceManagerHeartbeatManager(Impl) TaskExecutor TaskExecutorGateway(RPC) TaskExecutorHeartbeatSender(HeartbeatTarget) taskManagerHeartbeatManager(Sender) ResourceManager ResourceManagerGateway(RPC) ResourceManagerHeartbeatReceiver(HeartbeatTarget) resourceManagerHeartbeatManager(Impl) TaskExecutor TaskExecutorGateway(RPC) TaskExecutorHeartbeatSender(HeartbeatTarget) taskManagerHeartbeatManager(Sender) ResourceManager run() 每 interval requestHeartbeat(rmResourceId, null) heartbeatFromResourceManager(rmResourceId) heartbeatFromResourceManager(rmResourceId) requestHeartbeat(rmResourceId, null) receiveHeartbeat(tmResourceId, payload) heartbeatFromTaskManager(tmResourceId, payload)

第三部分:ResourceManager 定期开始心跳(重点)

3.1 HeartbeatManagerSenderImpl:定时遍历 HeartbeatTargets 发请求

org.apache.flink.runtime.heartbeat.HeartbeatManagerSenderImpl 的关键点:构造时就 schedule 自己,之后每轮 run 再 schedule 下一轮。

路径:<flink-runtime/src/main/java/org/apache/flink/runtime/heartbeat/HeartbeatManagerSenderImpl.java>

FQCN:org.apache.flink.runtime.heartbeat.HeartbeatManagerSenderImpl#run

java 复制代码
@Override
public void run() {
    if (!stopped) {
        log.debug("Trigger heartbeat request.");
        for (HeartbeatMonitor<O> heartbeatMonitor : getHeartbeatTargets().values()) {
            requestHeartbeat(heartbeatMonitor);
        }

        getMainThreadExecutor().schedule(this, heartbeatPeriod, TimeUnit.MILLISECONDS);
    }
}

路径:<flink-runtime/src/main/java/org/apache/flink/runtime/heartbeat/HeartbeatManagerSenderImpl.java>

FQCN:org.apache.flink.runtime.heartbeat.HeartbeatManagerSenderImpl#requestHeartbeat

java 复制代码
private void requestHeartbeat(HeartbeatMonitor<O> heartbeatMonitor) {
    O payload = getHeartbeatListener().retrievePayload(heartbeatMonitor.getHeartbeatTargetId());
    final HeartbeatTarget<O> heartbeatTarget = heartbeatMonitor.getHeartbeatTarget();

    heartbeatTarget
            .requestHeartbeat(getOwnResourceID(), payload)
            .whenCompleteAsync(
                    handleHeartbeatRpc(heartbeatMonitor.getHeartbeatTargetId()),
                    getMainThreadExecutor());
}
  • 关键依赖:只有被 monitorTarget(...) 注册过的目标才会出现在 getHeartbeatTargets() 里,所以第二部分的 monitorTarget 就是"加入心跳队列"的动作。

3.2 RM → TM:HeartbeatTarget 的落地是 TaskExecutorGateway

RM 侧对 TM 的 HeartbeatTarget,是 TaskExecutorHeartbeatSender,本质就是一次 RPC:TaskExecutorGateway#heartbeatFromResourceManager

路径:<flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/ResourceManager.java>

FQCN:org.apache.flink.runtime.resourcemanager.ResourceManager.TaskExecutorHeartbeatSender#requestHeartbeat

java 复制代码
private static final class TaskExecutorHeartbeatSender extends HeartbeatSender<Void> {
    private final TaskExecutorGateway taskExecutorGateway;

    private TaskExecutorHeartbeatSender(TaskExecutorGateway taskExecutorGateway) {
        this.taskExecutorGateway = taskExecutorGateway;
    }

    @Override
    public CompletableFuture<Void> requestHeartbeat(ResourceID resourceID, Void payload) {
        return taskExecutorGateway.heartbeatFromResourceManager(resourceID);
    }
}

3.3 TM 收到 RM 心跳请求:TaskExecutorGateway#heartbeatFromResourceManager

当 RM 发起 heartbeatFromResourceManager RPC 到 TM 后,会进入 TaskExecutor 的实现,然后转交给 resourceManagerHeartbeatManager.requestHeartbeat(...)

路径:<flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskExecutor.java>

FQCN:org.apache.flink.runtime.taskexecutor.TaskExecutor#heartbeatFromResourceManager

java 复制代码
@Override
public CompletableFuture<Void> heartbeatFromResourceManager(ResourceID resourceID) {
    return resourceManagerHeartbeatManager.requestHeartbeat(resourceID, null);
}

这里的 requestHeartbeat 实现来自 HeartbeatManagerImpl:核心是"记录心跳 + 回一跳(receiveHeartbeat)"。

路径:<flink-runtime/src/main/java/org/apache/flink/runtime/heartbeat/HeartbeatManagerImpl.java>

FQCN:org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl#requestHeartbeat

java 复制代码
@Override
public CompletableFuture<Void> requestHeartbeat(
        final ResourceID requestOrigin, I heartbeatPayload) {
    if (!stopped) {
        log.debug("Received heartbeat request from {}.", requestOrigin);

        final HeartbeatTarget<O> heartbeatTarget = reportHeartbeat(requestOrigin);

        if (heartbeatTarget != null) {
            if (heartbeatPayload != null) {
                heartbeatListener.reportPayload(requestOrigin, heartbeatPayload);
            }

            heartbeatTarget
                    .receiveHeartbeat(
                            getOwnResourceID(),
                            heartbeatListener.retrievePayload(requestOrigin))
                    .whenCompleteAsync(handleHeartbeatRpc(requestOrigin), mainThreadExecutor);
        }
    }

    return FutureUtils.completedVoidFuture();
}
  • reportHeartbeat(requestOrigin) 会更新 monitor 的心跳时间,并取出当初 monitorTarget(...) 时绑定的 HeartbeatTarget(也就是"回包通道")。
  • 随后通过 heartbeatTarget.receiveHeartbeat(...) 回一跳。

TM 侧用于"回包到 RM"的 HeartbeatTarget 是 ResourceManagerHeartbeatReceiver,它把 receiveHeartbeat 落地到 ResourceManagerGateway#heartbeatFromTaskManager

路径:<flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskExecutor.java>

FQCN:org.apache.flink.runtime.taskexecutor.TaskExecutor.ResourceManagerHeartbeatReceiver#receiveHeartbeat

java 复制代码
private static final class ResourceManagerHeartbeatReceiver
        extends HeartbeatReceiver<TaskExecutorHeartbeatPayload> {
    private final ResourceManagerGateway resourceManagerGateway;

    private ResourceManagerHeartbeatReceiver(ResourceManagerGateway resourceManagerGateway) {
        this.resourceManagerGateway = resourceManagerGateway;
    }

    @Override
    public CompletableFuture<Void> receiveHeartbeat(
            ResourceID resourceID, TaskExecutorHeartbeatPayload heartbeatPayload) {
        return resourceManagerGateway.heartbeatFromTaskManager(resourceID, heartbeatPayload);
    }
}

3.4 RM 收到 TM 回包:进入 TaskManagerHeartbeatListener 处理 payload/超时

RM 侧心跳 RPC 的入口是 ResourceManager#heartbeatFromTaskManager,最终交给 taskManagerHeartbeatManager.receiveHeartbeat(...),再由 TaskManagerHeartbeatListener 处理 payload、以及超时/不可达等异常。

路径:<flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/ResourceManager.java>

FQCN:org.apache.flink.runtime.resourcemanager.ResourceManager#heartbeatFromTaskManager

java 复制代码
@Override
public CompletableFuture<Void> heartbeatFromTaskManager(
        final ResourceID resourceID, final TaskExecutorHeartbeatPayload heartbeatPayload) {
    return taskManagerHeartbeatManager.receiveHeartbeat(resourceID, heartbeatPayload);
}

路径:<flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/ResourceManager.java>

FQCN:org.apache.flink.runtime.resourcemanager.ResourceManager.TaskManagerHeartbeatListener#reportPayload

java 复制代码
@Override
public void reportPayload(
        final ResourceID resourceID, final TaskExecutorHeartbeatPayload payload) {
    validateRunsInMainThread();
    final WorkerRegistration<WorkerType> workerRegistration = taskExecutors.get(resourceID);

    if (workerRegistration == null) {
        log.debug(
                "Received slot report from TaskManager {} which is no longer registered.",
                resourceID.getStringWithMetadata());
    } else {
        InstanceID instanceId = workerRegistration.getInstanceID();

        slotManager.reportSlotStatus(instanceId, payload.getSlotReport());
        clusterPartitionTracker.processTaskExecutorClusterPartitionReport(
                resourceID, payload.getClusterPartitionReport());
    }
}
  • 这里的重点只记一句:payload 的 slotReport/partitionReport 会被转交给 SlotManager/PartitionTracker(SlotManager 细节后面单独写)。

回到问题(总结收束)

  • 第一部分回答"启动":RM 在 onStart 阶段创建 sender(createHeartbeatManagerSender),TM 提前创建 receiver(createHeartbeatManager)。
  • 第二部分回答"建链":注册成功后双方 monitorTarget(...),把对端加入 HeartbeatTargets(也就是你说的"加入心跳队列")。
  • 第三部分回答"定期心跳":RM 的 sender 周期性遍历 HeartbeatTargets 发 requestHeartbeat;TM 收到 RPC 后通过 HeartbeatManagerImpl 回一跳;RM 收到回包后由 TaskManagerHeartbeatListener 处理 payload/异常。
相关推荐
老神在在0012 小时前
商城系统(Mall)性能测试实战:从脚本搭建到结果分析
大数据·测试工具·jmeter·压力测试
亚马逊云开发者2 小时前
【Bedrock AgentCore】Multi-Agent 架构实战:用 6 个 Agent 打通零售供应链数据→洞察→行动全链路
大数据·架构·零售
renhongxia12 小时前
网络效应与大型语言模型辩论中的协议漂移
大数据·人工智能·机器学习·语言模型·自然语言处理·语音识别·xcode
CeshirenTester3 小时前
计算机专业找工作别再乱投:100家常见目标公司,先按赛道分清楚,然后闭眼冲!
大数据·人工智能
Rubin智造社3 小时前
OpenClaw实操指南20|记忆系统实战:别让你的AI用完就忘,短期+长期记忆配置指南
大数据·人工智能·用户画像·长期记忆·记忆系统·memory.md·openclaw实操
李兆龙的博客3 小时前
从一到无穷大 #68 Agent Memory 全景:大模型智能体记忆机制的形态、动态与前沿
大数据·人工智能·算法
xcbrand3 小时前
地产建筑品牌策划公司哪家强
大数据·人工智能·python
biaotan10284 小时前
销售实用工具合集:全流程提效,轻松做好客户与业绩
大数据
武子康4 小时前
大数据-271 Spark MLib-基础线性回归详解:从原理到损失优化实战
大数据·后端·spark