Flink集群从节点TaskManager启动分析

1.概述

TaskManager 是 Flink 集群的工作进程,执行数据流的具体计算,称之为"Worker"。Flink集群必须至少有一个TaskManager;每一个TaskManager都包含了一定数量的任务槽(task slots)。Slot是资源调度的最小单位,slot的数量限制了TaskManager能够并行处理的任务数量。

启动之后,TaskManager会向资源管理器注册它的slots;收到资源管理器的指令后,TaskManager就会将一个或者多个槽位提供给JobMaster调用,JobMaster就可以分配任务来执行了。

在Job执行过程中,TaskManager可以缓冲数据,还可以跟其他运行同一应用的TaskManager交换数据。

TaskManager 是一个逻辑抽象,代表一台服务器,服务器的启动,必然会包含一些服务,另外再包含一个 TaskExecutor,存在于TaskManager的内部,真实的帮助TaskManager 完成各种核心操作:提交Task执行、申请和释放slot。

2.TaskManager启动

TaskManager主要负责本机slot资源的管理与具体task的执行。根据集群启动脚本分析:TaskManager 的启动主类: TaskManagerRunner。

2.1 启动入口(main)

java 复制代码
public static void main(String[] args) throws Exception {
		// startup checks and logging 从节点启动的时打印的相关信息
		EnvironmentInformation.logEnvironmentInfo(LOG, "TaskManager", args);
		SignalHandler.register(LOG);
		JvmShutdownSafeguard.installAsShutdownHook(LOG);
		long maxOpenFileHandles = EnvironmentInformation.getOpenFileHandlesLimit();
		if(maxOpenFileHandles != -1L) {
			LOG.info("Maximum number of open file descriptors is {}.", maxOpenFileHandles);
		} else {
			LOG.info("Cannot determine the maximum number of open file descriptors");
		}
		//  注释: 启动入口
		runTaskManagerSecurely(args, ResourceID.generate());
	}

注:ResourceID:Flink集群启动时主节点和从节点都会生成一个全局唯一的ID。

2.2 runTaskManagerSecurely(入口)

  • 1.加载参数(解析main方法参数+配置文件参数)
  • 2.启动TaskManager
    • 2.1 初始化插件服务以及文件系统服务(基础服务)
    • 2.2 通过线程启动TaskManager
      • 2.2.1 构建TaskManagerRunner实例对象
        • ① 初始化很多服务(对外提供服务)
        • ② 初始化Executor
      • 2.2.2 发送start消息确认是否启动成功
java 复制代码
public static void runTaskManagerSecurely(String[] args, ResourceID resourceID) {
			try {
				//  注释: 加载配置参数(shell脚本传入参数+flink-conf.yaml文件)
				Configuration configuration = loadConfiguration(args);
	
				//  注释: 启动TaskManager
				runTaskManagerSecurely(configuration, resourceID);
	
			} catch(Throwable t) {
				final Throwable strippedThrowable = ExceptionUtils.stripException(t, UndeclaredThrowableException.class);
				LOG.error("TaskManager initialization failed.", strippedThrowable);
				System.exit(STARTUP_FAILURE_RETURN_CODE);
			}
		}

// --> runTaskManagerSecurely(configuration, resourceID);
// 1.初始化插件服务和文件系统
// 2.通过线程启动TaskManger,与main线程不是同一个线程中
public static void runTaskManagerSecurely(Configuration configuration, ResourceID resourceID) throws Exception {
		replaceGracefulExitWithHaltIfConfigured(configuration);
		/*************************************************
		 *  注释: 初始化插件
		 */
		final PluginManager pluginManager = PluginUtils.createPluginManagerFromRootFolder(configuration);

		// TODO_MA 注释: 初始化文件系统
		FileSystem.initialize(configuration, pluginManager);
		SecurityUtils.install(new SecurityConfiguration(configuration));

		/*************************************************
		 *  注释: 包装启动
		 */
		SecurityUtils.getInstalledContext().runSecured(

			// 注释: 通过一个线程来启动 TaskManager
			() -> {
				runTaskManager(configuration, resourceID, pluginManager);
				return null;
			});
	}

// -->runTaskManager(configuration, resourceID, pluginManager);
public static void runTaskManager(Configuration configuration, ResourceID resourceId, PluginManager pluginManager) throws Exception {
		/*************************************************
		 *  注释: 构建 TaskManager 实例
		 *  TaskManagerRunner 是 standalone 模式下 TaskManager 的可执行入口点。
		 *  它构造相关组件(network, I/O manager, memory manager, RPC service, HA service)并启动它们。
		 */
		final TaskManagerRunner taskManagerRunner = new TaskManagerRunner(configuration, resourceId, pluginManager);

		/*************************************************
		 *  注释: 发送 START 消息,确认是否启动成功
		 */
		taskManagerRunner.start();
	}

2.3 实例化TaskManagerRunner对象

1.初始化服务:

  • 线程池:异步回调函数的处理(异步编程:future.xxx(() -> xxxxx(), exceutor))
  • HA服务:ZooKeeperHaServices(flink-conf.yaml文件中HA参数为ZooKeeper)
  • Rpc服务:通过创建代理对象的方式创建RpcServer
  • Heartbeat服务:心跳服务(ResourceManger与TaskManager的两个关键参数:10s与50s)
  • Blob服务:内部就是两个定时任务,用来定时检查删除过期的Job的资源文件。通过引用计数的方法,判断文件是否过期。PermanentBlobCache与TransientBlobCache

2.启动TaskManager

  • 负责启动TaskExecutor,负责多个Task的执行
java 复制代码
public TaskManagerRunner(Configuration configuration, ResourceID resourceId, PluginManager pluginManager) throws Exception {
		this.configuration = checkNotNull(configuration);
		this.resourceId = checkNotNull(resourceId);
		timeout = AkkaUtils.getTimeoutAsTime(configuration);

		//  注释:初始化进行回调处理的线程池
		this.executor = java.util.concurrent.Executors
			.newScheduledThreadPool(Hardware.getNumberCPUCores(), new ExecutorThreadFactory("taskmanager-future"));

		/*************************************************
		 *  注释:HA 服务: ZooKeeperHaServices
		 *  提供对高可用性所需的所有服务的访问注册,分布式计数器和Leader选举
		 */
		highAvailabilityServices = HighAvailabilityServicesUtils
			.createHighAvailabilityServices(configuration, executor, HighAvailabilityServicesUtils.AddressResolution.NO_ADDRESS_RESOLUTION);

		//  注释:初始化 RpcService
		rpcService = createRpcService(configuration, highAvailabilityServices);

		//  注释:初始化 HeartbeatServices
		HeartbeatServices heartbeatServices = HeartbeatServices.fromConfiguration(configuration);

		metricRegistry = new MetricRegistryImpl(MetricRegistryConfiguration.fromConfiguration(configuration),
			ReporterSetup.fromConfiguration(configuration, pluginManager));
		final RpcService metricQueryServiceRpcService = MetricUtils.startRemoteMetricsRpcService(configuration, rpcService.getAddress());
		metricRegistry.startQueryService(metricQueryServiceRpcService, resourceId);

		//  注释:初始化 BlobCacheService
		blobCacheService = new BlobCacheService(configuration, highAvailabilityServices.createBlobStore(), null);

		//  注释:提供外部资源的信息
		final ExternalResourceInfoProvider externalResourceInfoProvider = ExternalResourceUtils
			.createStaticExternalResourceInfoProvider(ExternalResourceUtils.getExternalResourceAmountMap(configuration),
				ExternalResourceUtils.externalResourceDriversFromConfig(configuration, pluginManager));

		/*************************************************
		 *  注释:启动 TaskManager
		 *  负责创建 TaskExecutor,负责多个任务Task的运行
		 */
		taskManager = startTaskManager(this.configuration, this.resourceId, rpcService, highAvailabilityServices, heartbeatServices, metricRegistry,
			blobCacheService, false, externalResourceInfoProvider, this);

		this.terminationFuture = new CompletableFuture<>();
		this.shutdown = false;

		MemoryLogger.startIfConfigured(LOG, configuration, terminationFuture);
	}

2.4 startTaskManager(启动TaskManager对象)

  • 1.获取资源定义对象:一台真实的物理节点的资源(cpu,memory,network)
  • 2.taskExecutorResourceSpec--> TaskManagerServicesConfiguration(配置信息封装在TaskManagerServicesConfiguration对象中)
  • 3.初始化ioExecutor(io线程池)
  • 3.构建TaskManagerServices对象封装了TaskManager运行过程中需要对外提供服务的各种服务组件
    • 1.初始化 TaskEventDispatcher(调度的作用)
    • 2.初始化 IOManagerASync(通过异步的形式实现数据流转)
    • 3.shuffleEnvironment = NettyShuffleEnvironment(上下游Task存在shuffle)
    • 4.初始化 KVStageService(状态服务)
    • 5.初始化 BroadCastVariableManager(广播服务)
    • 6.初始化 TaskSlotTable【interface-->TaskSlotTableImpl】(维护TaskManager上所有的TaskSlot与Task以及Job的关系)
    • 7.初始化 DefaultJobTable服务(job信息)
    • 8.初始化 JobLeaderService服务(为JobMaster启动提供服务)
  • 4.返回TaskExecutor对象(startTaskManager-->TaskExecutor),内部构建了两个重要的心跳管理器
    • JobManagerHeartbeatManager
    • ResourceManagerHeartbeatManager
java 复制代码
public static TaskExecutor startTaskManager(Configuration configuration, ResourceID resourceID, RpcService rpcService,
		HighAvailabilityServices highAvailabilityServices, HeartbeatServices heartbeatServices, MetricRegistry metricRegistry,
		BlobCacheService blobCacheService, boolean localCommunicationOnly, ExternalResourceInfoProvider externalResourceInfoProvider,
		FatalErrorHandler fatalErrorHandler) throws Exception {

		checkNotNull(configuration);
		checkNotNull(resourceID);
		checkNotNull(rpcService);
		checkNotNull(highAvailabilityServices);

		LOG.info("Starting TaskManager with ResourceID: {}", resourceID);

		String externalAddress = rpcService.getAddress();

		final TaskExecutorResourceSpec taskExecutorResourceSpec = TaskExecutorResourceUtils.resourceSpecFromConfig(configuration);

		//  注释: TaskManagerServicesConfiguration
		TaskManagerServicesConfiguration taskManagerServicesConfiguration = TaskManagerServicesConfiguration.fromConfiguration(configuration, resourceID, externalAddress, localCommunicationOnly, taskExecutorResourceSpec);
	   Tuple2<TaskManagerMetricGroup, MetricGroup> taskManagerMetricGroup = MetricUtils.instantiateTaskManagerMetricGroup(metricRegistry, externalAddress, resourceID,			taskManagerServicesConfiguration.getSystemResourceMetricsProbingInterval());

		//  注释: 初始化 ioExecutor
		final ExecutorService ioExecutor =
			newFixedThreadPool(taskManagerServicesConfiguration.getNumIoThreads(), new ExecutorThreadFactory("flink-taskexecutor-io"));

		//  注释: taskManagerServices = TaskManagerServices
		TaskManagerServices taskManagerServices = TaskManagerServices
			.fromConfiguration(taskManagerServicesConfiguration, blobCacheService.getPermanentBlobService(), taskManagerMetricGroup.f1, ioExecutor,
				fatalErrorHandler);

		// 注释: TaskManagerConfiguration
		TaskManagerConfiguration taskManagerConfiguration = TaskManagerConfiguration
			.fromConfiguration(configuration, taskExecutorResourceSpec, externalAddress);

		String metricQueryServiceAddress = metricRegistry.getMetricQueryServiceGatewayRpcAddress();

		/*************************************************
		 *  注释: 创建 TaskExecutor 实例
		 *  内部会创建两个重要的心跳管理器:
		 *  1、JobManagerHeartbeatManager
		 *  2、ResourceManagerHeartbeatManager
		 */
		return new TaskExecutor(rpcService, taskManagerConfiguration, highAvailabilityServices, taskManagerServices, externalResourceInfoProvider,
			heartbeatServices, taskManagerMetricGroup.f0, metricQueryServiceAddress, blobCacheService, fatalErrorHandler,
			new TaskExecutorPartitionTrackerImpl(taskManagerServices.getShuffleEnvironment()),
			createBackPressureSampleService(configuration, rpcService.getScheduledExecutor()));
	}

//--> TaskManagerServices fromConfiguration
public static TaskManagerServices fromConfiguration(TaskManagerServicesConfiguration taskManagerServicesConfiguration,
		PermanentBlobService permanentBlobService, MetricGroup taskManagerMetricGroup, ExecutorService ioExecutor,
		FatalErrorHandler fatalErrorHandler) throws Exception {

		// pre-start checks 检查工作目录
		checkTempDirs(taskManagerServicesConfiguration.getTmpDirPaths());

		// 注释: 初始化 TaskEventDispatcher
		final TaskEventDispatcher taskEventDispatcher = new TaskEventDispatcher();

		 //注释: 初始化 IOManagerASync
		// start the I/O manager, it will create some temp directories.
		final IOManager ioManager = new IOManagerAsync(taskManagerServicesConfiguration.getTmpDirPaths());

		//注释: shuffleEnvironment = NettyShuffleEnvironment
		final ShuffleEnvironment<?, ?> shuffleEnvironment = createShuffleEnvironment(taskManagerServicesConfiguration, taskEventDispatcher,
			taskManagerMetricGroup, ioExecutor);
		final int listeningDataPort = shuffleEnvironment.start();

		// 注释: 初始化 KVStageService
		final KvStateService kvStateService = KvStateService.fromConfiguration(taskManagerServicesConfiguration);
		kvStateService.start();

		final UnresolvedTaskManagerLocation unresolvedTaskManagerLocation = new UnresolvedTaskManagerLocation(
			taskManagerServicesConfiguration.getResourceID(), taskManagerServicesConfiguration.getExternalAddress(),
			// we expose the task manager location with the listening port
			// iff the external data port is not explicitly defined
			taskManagerServicesConfiguration.getExternalDataPort() > 0 ? taskManagerServicesConfiguration.getExternalDataPort() : listeningDataPort);

		// 注释: 初始化 BroadCastVariableManager
		final BroadcastVariableManager broadcastVariableManager = new BroadcastVariableManager();

		//  注释: 初始化 TaskSlotTable
		final TaskSlotTable<Task> taskSlotTable = createTaskSlotTable(taskManagerServicesConfiguration.getNumberOfSlots(),
			taskManagerServicesConfiguration.getTaskExecutorResourceSpec(), taskManagerServicesConfiguration.getTimerServiceShutdownTimeout(),
			taskManagerServicesConfiguration.getPageSize(), ioExecutor);

		//  注释:  初始化 DefaultJobTable
		final JobTable jobTable = DefaultJobTable.create();

		//  注释: 初始化 JobLeaderService
		final JobLeaderService jobLeaderService = new DefaultJobLeaderService(unresolvedTaskManagerLocation,
			taskManagerServicesConfiguration.getRetryingRegistrationConfiguration());

		final String[] stateRootDirectoryStrings = taskManagerServicesConfiguration.getLocalRecoveryStateRootDirectories();

		final File[] stateRootDirectoryFiles = new File[stateRootDirectoryStrings.length];

		for(int i = 0; i < stateRootDirectoryStrings.length; ++i) {
			stateRootDirectoryFiles[i] = new File(stateRootDirectoryStrings[i], LOCAL_STATE_SUB_DIRECTORY_ROOT);
		}

		//  注释: 初始化 TaskExecutorLocalStateStoresManager
		final TaskExecutorLocalStateStoresManager taskStateManager = new TaskExecutorLocalStateStoresManager(
			taskManagerServicesConfiguration.isLocalRecoveryEnabled(), stateRootDirectoryFiles, ioExecutor);

		final boolean failOnJvmMetaspaceOomError = taskManagerServicesConfiguration.getConfiguration()
			.getBoolean(CoreOptions.FAIL_ON_USER_CLASS_LOADING_METASPACE_OOM);

		//  注释: 初始化 LibraryCacheManager
		final LibraryCacheManager libraryCacheManager = new BlobLibraryCacheManager(permanentBlobService, BlobLibraryCacheManager		.defaultClassLoaderFactory(taskManagerServicesConfiguration.getClassLoaderResolveOrder(),			taskManagerServicesConfiguration.getAlwaysParentFirstLoaderPatterns(), failOnJvmMetaspaceOomError ? fatalErrorHandler : null));
		//  注释: 返回: TaskManagerServices
		return new TaskManagerServices(unresolvedTaskManagerLocation, taskManagerServicesConfiguration.getManagedMemorySize().getBytes(), ioManager,
			shuffleEnvironment, kvStateService, broadcastVariableManager, taskSlotTable, jobTable, jobLeaderService, taskStateManager,
			taskEventDispatcher, ioExecutor, libraryCacheManager);
	}

2.5 startTaskManager-->TaskExecutor(返回TaskExecutor)

从节点的启动是通过实例化TaskManagerRunner对象,后续分为两部分工作:1.初始化各种基础服务(线程池、HA、Rpc、心跳服务、以及Blob服务)、2.启动Taskmanager:通过将硬件配置信息封装在TaskManagerServicesConfiguration对象中,初始化IO线程池,然后通过构建TaskManagerServices对象(封装了TaskManager运行过程中对外提供服务的各种服务组件),最终返回TaskExecutor对象(封装了两个心跳服务:JobManagerHeartBeatManager与ResourceManagerHeartBeatManager)。

java 复制代码
//-->返回TaskExecutor对象
return new TaskExecutor(rpcService, taskManagerConfiguration, highAvailabilityServices, taskManagerServices, externalResourceInfoProvider,heartbeatServices, taskManagerMetricGroup.f0, metricQueryServiceAddress, blobCacheService, fatalErrorHandler,
			new TaskExecutorPartitionTrackerImpl(taskManagerServices.getShuffleEnvironment()),
createBackPressureSampleService(configuration, rpcService.getScheduledExecutor()))
    
// 当前构造方法执行完了之后,执行 onStart() 方法,因为 TaskExecutor 是一个 RpcEndpoint
    public TaskExecutor(RpcService rpcService, TaskManagerConfiguration taskManagerConfiguration, HighAvailabilityServices haServices,
		TaskManagerServices taskExecutorServices, ExternalResourceInfoProvider externalResourceInfoProvider, HeartbeatServices heartbeatServices,
		TaskManagerMetricGroup taskManagerMetricGroup, @Nullable String metricQueryServiceAddress, BlobCacheService blobCacheService,
		FatalErrorHandler fatalErrorHandler, TaskExecutorPartitionTracker partitionTracker, BackPressureSampleService backPressureSampleService) {
		//创建形式为prefix_X随机名称,其中X为递增数字
		super(rpcService, AkkaRpcServiceUtils.createRandomName(TASK_MANAGER_NAME));
		checkArgument(taskManagerConfiguration.getNumberSlots() > 0, "The number of slots has to be larger than 0.");
		this.taskManagerConfiguration = checkNotNull(taskManagerConfiguration);
		this.taskExecutorServices = checkNotNull(taskExecutorServices);
		this.haServices = checkNotNull(haServices);
		this.fatalErrorHandler = checkNotNull(fatalErrorHandler);
		this.partitionTracker = partitionTracker;
		this.taskManagerMetricGroup = checkNotNull(taskManagerMetricGroup);
		this.blobCacheService = checkNotNull(blobCacheService);
		this.metricQueryServiceAddress = metricQueryServiceAddress;
		this.backPressureSampleService = checkNotNull(backPressureSampleService);
		this.externalResourceInfoProvider = checkNotNull(externalResourceInfoProvider);
		this.libraryCacheManager = taskExecutorServices.getLibraryCacheManager();
		this.taskSlotTable = taskExecutorServices.getTaskSlotTable();
		this.jobTable = taskExecutorServices.getJobTable();
		this.jobLeaderService = taskExecutorServices.getJobLeaderService();
		this.unresolvedTaskManagerLocation = taskExecutorServices.getUnresolvedTaskManagerLocation();
		this.localStateStoresManager = taskExecutorServices.getTaskManagerStateStore();
		this.shuffleEnvironment = taskExecutorServices.getShuffleEnvironment();
		this.kvStateService = taskExecutorServices.getKvStateService();
		this.ioExecutor = taskExecutorServices.getIOExecutor();
		this.resourceManagerLeaderRetriever = haServices.getResourceManagerLeaderRetriever();
    	//硬件抽象对象
		this.hardwareDescription = HardwareDescription.extractFromSystem(taskExecutorServices.getManagedMemorySize());
		this.resourceManagerAddress = null;
		this.resourceManagerConnection = null;
		this.currentRegistrationTimeoutId = null;
		final ResourceID resourceId = taskExecutorServices.getUnresolvedTaskManagerLocation().getResourceID();
		//  注释: HeartbeatManagerImpl jobManagerHeartbeatManager
		this.jobManagerHeartbeatManager = createJobManagerHeartbeatManager(heartbeatServices, resourceId);
		//  注释: HeartbeatManagerImpl resourceManagerHeartbeatManager
		this.resourceManagerHeartbeatManager = createResourceManagerHeartbeatManager(heartbeatServices, resourceId);
	}
//-->代码执行到就到去到TaskExecutor的Onstart()方法
public void onStart() throws Exception {
		try {

			/*************************************************
			 *  注释: 开启服务
			 * 重要的服务:
			 * 1.监控ResourceManager
			 * 2.启动TaskSlotTable服务
			 * 3.监控JobMaster
			 * 4.启动FileCache服务
			 */
			startTaskExecutorServices();

		} catch(Exception e) {
			final TaskManagerException exception = new TaskManagerException(String.format("Could not start the TaskExecutor %s", getAddress()), e);
			onFatalError(exception);
			throw exception;
		}

		//  注释: 开始注册
		startRegistrationTimeout();
	}
//-->startTaskExecutorServices();
private void startTaskExecutorServices() throws Exception {
		try {

			//  注释: 启动 ResourceManagerLeaderListener,监听 TaskManger 向 ResourceManager 注册是通过ResourceManagerLeaderListener 来完成的,它会监控 ResourceManager 的 leader 变化, 如果有新的 leader 被选举出来, 将会调用 notifyLeaderAddress() 方法去触发与 ResourceManager 的重连
			// start by connecting to the ResourceManager
			resourceManagerLeaderRetriever.start(new ResourceManagerLeaderListener());
            
            // 理解上述代码:
            /*1.ResourceManagerLeaderListener 是 LeaderRetrieverListener的子类,构建ResourceManagerLeaderListener对象中,执行的是notifyLeaderAddress()方法【连接RM】,构建TaskExecutorRegistration对象与TaskExecutorToResourceManagerConnection对象(TaskExecutor 和 ResourceManager 之间的链接对象),启动,创建注册对象newRegistration,开始进行注册(向ResourceManager进行注册),ResourceManager获取TaskManager的全局唯一ID,taskExecutorRegistration-->WorkerRegistration对象,将taskExecutorResourceId与registration放入一个map结构中,taskExecutors.put(taskExecutorResourceId, registration),最终返回一个TaskExecutorRegistrationSuccess注册成功的对象,接下来要维持ResourceManager与TaskExecutor之间的心跳,taskExecutorGateway相当于注册成功的那个TaskExecutor,taskExecutorGateway.heartbeatFromResourceManager(resourceID),此时TaskExecutor接收到ResourceManager的心跳请求,此时TaskExecutor向ResourceManager汇报心跳 ;
             *2.在StandAlone场景中,resourceManagerLeaderRetriever的实现类是ZooKeeperLeaderRetrievalService,ZooKeeperLeaderRetrievalService是	NodeCacheListener的子类,NodeCacheListener(接口) 是 curator提供的监听器,当指定的zookeeper中的znode节点数据发生改变,则会收到通知,回调nodeChanged()方法【ZooKeeperLeaderRetrievalService中的nodeChanged()】,在nodeChanged()中会调用对应的LeaderRetrieverListener的notifyIfNewLeaderAddress()方法
             *3.resourceManagerLeaderRetriever的实现类是:ZooKeeperLeaderRetrievalService,它是LeaderRetrievalService的子类
             *4.resourceManagerLeaderRetriever进行监听,当发生变更时,就会调用ResourceManagerLeaderListener的notifyLeaderAddress()方法
            */

			//  注释: 启动 TaskSlotTable
			// tell the task slot table who's responsible for the task slot actions
			taskSlotTable.start(new SlotActionsImpl(), getMainThreadExecutor());

			//  注释: 启动 JobLeaderService
			// start the job leader service
			jobLeaderService.start(getAddress(), getRpcService(), haServices, new JobLeaderListenerImpl());

			//  注释: 初始化 FileCache
			fileCache = new FileCache(taskManagerConfiguration.getTmpDirectories(), blobCacheService.getPermanentBlobService());
		} catch(Exception e) {
			handleStartTaskExecutorServicesException(e);
		}
	}

实例化TaskExecutor对象后,就要执行TaskExecutor对象的onStart()方法:

  • 开启服务 startTaskExecutorServices()
    • 监控ResourceManager(连接ResourceManager,注册(超时注册机制),监听RM)
      • Flink的主从节点的心跳:
        • 1.启动ResourceManager,启动HeartBeatManager,每隔10s钟,遍历注册的TaskExecutor,执行发送心跳请求
        • 2.启动TaskExecutor,启动超市注册检查机制(每隔5min),完成启动后进行注册,接收到心跳的请求之后,相当于RM与TaskManager之间维持心跳
        • 3.TaskManager每次接收到ResourceManager的心跳后,重置超时任务。
    • 启动TaskSlotTable服务:内部包含一个超时检查服务
    • 监控JobLeaderService服务:启动一个监听(当已启动的jobMaster发生节点迁移,JobLeaderService接收到请求进行处理)
    • 启动FileCache服务:资源的缓存服务
  • 开始注册 startRegistrationTimeout()
    • 上述步骤如果超过5min就超时了。超时检查机制(5min的注册超时检查)

3.总结

  • TaskManager作为Flink集群的从节点,主要负责slot资源的管理以及具体task的执行,同时保持与JobManager之间的通信。

  • TaskManager的具体实现类为TaskManagerRunner。

  • TaskManger的启动过程主要为:

    • 加载配置信息(main传入的参数+flink-conf.yaml文件),初始化插件服务以及文件系统服务
    • 通过线程的方式启动TaskManager
    • 实例化TaskManagerRunner对象,成功之后给自己发送一个hello确认。
      • TaskManagerRunner包含了很多基础服务(HA/rpc/HeartBeatServices/大文件处理的服务)
      • 启动TaskManager,最终返回TaskExecutor,负责多个任务Task的运行。
        • 初始化TaskManagerServices(包含很多对外提供服务的服务组件:shuffleEnvironment、TaskSlotTable)以及JobLeaderService等等
        • 创建两个心跳管理服务:JobManagerHeartbeatManager、ResourceManagerHeartbeatManager
    • TaskExecutor实例化完成之后会执行对应的onStart()方法,其中启动四个服务:心跳服务、管理Task、Slot之间的对应关系、JobMaster服务以及文件缓存服务。
相关推荐
ZHE|张恒4 分钟前
Spring Boot 3 + Flyway 全流程教程
java·spring boot·后端
字节数据平台17 分钟前
火山引擎发布Data Agent新能力,推动用户洞察进入“智能3.0时代”
大数据·人工智能
TDengine (老段)26 分钟前
TDengine 字符串函数 CHAR_LENGTH 用户手册
大数据·数据库·时序数据库·tdengine·涛思数据
TDengine (老段)29 分钟前
TDengine 数学函数 CRC32 用户手册
java·大数据·数据库·sql·时序数据库·tdengine·1024程序员节
数智顾问41 分钟前
(111页PPT)大型集团IT治理体系规划详细解决方案(附下载方式)
大数据·人工智能
心随雨下1 小时前
Tomcat日志配置与优化指南
java·服务器·tomcat
Kapaseker1 小时前
Java 25 中值得关注的新特性
java
wljt1 小时前
Linux 常用命令速查手册(Java开发版)
java·linux·python
撩得Android一次心动1 小时前
Android 四大组件——BroadcastReceiver(广播)
android·java·android 四大组件
canonical_entropy1 小时前
Nop平台到底有什么独特之处,它能用在什么场景?
java·后端·领域驱动设计