聊聊PowerJob的InstanceStatusCheckService

本文主要研究一下PowerJob的InstanceStatusCheckService

InstanceStatus

tech/powerjob/common/enums/InstanceStatus.java

复制代码
@Getter
@AllArgsConstructor
public enum InstanceStatus {
    /**
     *
     */
    WAITING_DISPATCH(1, "等待派发"),
    WAITING_WORKER_RECEIVE(2, "等待Worker接收"),
    RUNNING(3, "运行中"),
    FAILED(4, "失败"),
    SUCCEED(5, "成功"),
    CANCELED(9, "取消"),
    STOPPED(10, "手动停止");

    private final int v;
    private final String des;

    /**
     * 广义的运行状态
     */
    public static final List<Integer> GENERALIZED_RUNNING_STATUS = Lists.newArrayList(WAITING_DISPATCH.v, WAITING_WORKER_RECEIVE.v, RUNNING.v);
    /**
     * 结束状态
     */
    public static final List<Integer> FINISHED_STATUS = Lists.newArrayList(FAILED.v, SUCCEED.v, CANCELED.v, STOPPED.v);

    public static InstanceStatus of(int v) {
        for (InstanceStatus is : values()) {
            if (v == is.v) {
                return is;
            }
        }
        throw new IllegalArgumentException("InstanceStatus has no item for value " + v);
    }
}

InstanceStatus定义了任务实例的状态,广义运行中的状态为WAITING_DISPATCH、WAITING_WORKER_RECEIVE、RUNNING;终态为FAILED、SUCCEED、CANCELED、STOPPED

InstanceStatusCheckService

tech/powerjob/server/core/scheduler/InstanceStatusCheckService.java

复制代码
@Slf4j
@Service
@RequiredArgsConstructor
public class InstanceStatusCheckService {

    private static final int MAX_BATCH_NUM_APP = 10;
    private static final int MAX_BATCH_NUM_INSTANCE = 3000;
    private static final int MAX_BATCH_UPDATE_NUM = 500;
    private static final long DISPATCH_TIMEOUT_MS = 30000;
    private static final long RECEIVE_TIMEOUT_MS = 60000;
    private static final long RUNNING_TIMEOUT_MS = 60000;
    private static final long WORKFLOW_WAITING_TIMEOUT_MS = 60000;

    public static final long CHECK_INTERVAL = 10000;

    private final TransportService transportService;

    private final DispatchService dispatchService;

    private final InstanceManager instanceManager;

    private final WorkflowInstanceManager workflowInstanceManager;

    private final AppInfoRepository appInfoRepository;

    private final JobInfoRepository jobInfoRepository;

    private final InstanceInfoRepository instanceInfoRepository;

    private final WorkflowInfoRepository workflowInfoRepository;

    private final WorkflowInstanceInfoRepository workflowInstanceInfoRepository;

    //......
}    

InstanceStatusCheckService提供了checkRunningInstance、checkWaitingDispatchInstance、checkWaitingWorkerReceiveInstance、checkWorkflowInstance方法

checkRunningInstance

复制代码
    public void checkRunningInstance() {
        Stopwatch stopwatch = Stopwatch.createStarted();
        // 查询 DB 获取该 Server 需要负责的 AppGroup
        List<Long> allAppIds = appInfoRepository.listAppIdByCurrentServer(transportService.defaultProtocol().getAddress());
        if (CollectionUtils.isEmpty(allAppIds)) {
            log.info("[InstanceStatusChecker] current server has no app's job to check");
            return;
        }
        try {
            // 检查 RUNNING 状态的任务(一定时间没收到 TaskTracker 的状态报告,视为失败)
            Lists.partition(allAppIds, MAX_BATCH_NUM_APP).forEach(this::handleRunningInstance);
        } catch (Exception e) {
            log.error("[InstanceStatusChecker] RunningInstance status check failed.", e);
        }
        log.info("[InstanceStatusChecker] RunningInstance status check using {}.", stopwatch.stop());
    }

    private void handleRunningInstance(List<Long> partAppIds) {
        // 3. 检查 RUNNING 状态的任务(一定时间没收到 TaskTracker 的状态报告,视为失败)
        long threshold = System.currentTimeMillis() - RUNNING_TIMEOUT_MS;
        List<BriefInstanceInfo> failedInstances = instanceInfoRepository.selectBriefInfoByAppIdInAndStatusAndGmtModifiedBefore(partAppIds, InstanceStatus.RUNNING.getV(), new Date(threshold), PageRequest.of(0, MAX_BATCH_NUM_INSTANCE));
        while (!failedInstances.isEmpty()) {
            // collect job id
            Set<Long> jobIds = failedInstances.stream().map(BriefInstanceInfo::getJobId).collect(Collectors.toSet());
            // query job info and map
            Map<Long, JobInfoDO> jobInfoMap = jobInfoRepository.findByIdIn(jobIds).stream().collect(Collectors.toMap(JobInfoDO::getId, e -> e));
            log.warn("[InstanceStatusCheckService] find some instances have not received status report for a long time : {}", failedInstances.stream().map(BriefInstanceInfo::getInstanceId).collect(Collectors.toList()));
            failedInstances.forEach(instance -> {
                Optional<JobInfoDO> jobInfoOpt = Optional.ofNullable(jobInfoMap.get(instance.getJobId()));
                if (!jobInfoOpt.isPresent()) {
                    final Optional<InstanceInfoDO> opt = instanceInfoRepository.findById(instance.getId());
                    opt.ifPresent(e -> updateFailedInstance(e, SystemInstanceResult.REPORT_TIMEOUT));
                    return;
                }
                TimeExpressionType timeExpressionType = TimeExpressionType.of(jobInfoOpt.get().getTimeExpressionType());
                SwitchableStatus switchableStatus = SwitchableStatus.of(jobInfoOpt.get().getStatus());
                // 如果任务已关闭,则不进行重试,将任务置为失败即可;秒级任务也直接置为失败,由派发器重新调度
                if (switchableStatus != SwitchableStatus.ENABLE || TimeExpressionType.FREQUENT_TYPES.contains(timeExpressionType.getV())) {
                    final Optional<InstanceInfoDO> opt = instanceInfoRepository.findById(instance.getId());
                    opt.ifPresent(e -> updateFailedInstance(e, SystemInstanceResult.REPORT_TIMEOUT));
                    return;
                }
                // CRON 和 API一样,失败次数 + 1,根据重试配置进行重试
                if (instance.getRunningTimes() < jobInfoOpt.get().getInstanceRetryNum()) {
                    dispatchService.redispatchAsync(instance.getInstanceId(), InstanceStatus.RUNNING.getV());
                } else {
                    final Optional<InstanceInfoDO> opt = instanceInfoRepository.findById(instance.getId());
                    opt.ifPresent(e -> updateFailedInstance(e, SystemInstanceResult.REPORT_TIMEOUT));
                }
            });
            threshold = System.currentTimeMillis() - RUNNING_TIMEOUT_MS;
            failedInstances = instanceInfoRepository.selectBriefInfoByAppIdInAndStatusAndGmtModifiedBefore(partAppIds, InstanceStatus.RUNNING.getV(), new Date(threshold), PageRequest.of(0, MAX_BATCH_NUM_INSTANCE));
        }

    }    

checkRunningInstance查询该server负责的appId,然后挨个遍历执行handleRunningInstance;handleRunningInstance查找一定时间没收到TaskTracker状态报告的任务实例,若任务已经关闭则不进行重试,若是秒级任务则更新为失败,其他的则判断运行次数市场超过重试次数,否则通过dispatchService.redispatchAsync重试

checkWaitingDispatchInstance

复制代码
    public void checkWaitingDispatchInstance() {
        Stopwatch stopwatch = Stopwatch.createStarted();
        // 查询 DB 获取该 Server 需要负责的 AppGroup
        List<Long> allAppIds = appInfoRepository.listAppIdByCurrentServer(transportService.defaultProtocol().getAddress());
        if (CollectionUtils.isEmpty(allAppIds)) {
            log.info("[InstanceStatusChecker] current server has no app's job to check");
            return;
        }
        try {
            // 检查等待 WAITING_DISPATCH 状态的任务
            Lists.partition(allAppIds, MAX_BATCH_NUM_APP).forEach(this::handleWaitingDispatchInstance);
        } catch (Exception e) {
            log.error("[InstanceStatusChecker] WaitingDispatchInstance status check failed.", e);
        }
        log.info("[InstanceStatusChecker] WaitingDispatchInstance status check using {}.", stopwatch.stop());
    }

    private void handleWaitingDispatchInstance(List<Long> partAppIds) {
        // 1. 检查等待 WAITING_DISPATCH 状态的任务
        long threshold = System.currentTimeMillis() - DISPATCH_TIMEOUT_MS;
        List<InstanceInfoDO> waitingDispatchInstances = instanceInfoRepository.findAllByAppIdInAndStatusAndExpectedTriggerTimeLessThan(partAppIds, InstanceStatus.WAITING_DISPATCH.getV(), threshold, PageRequest.of(0, MAX_BATCH_NUM_INSTANCE));
        while (!waitingDispatchInstances.isEmpty()) {
            List<Long> overloadAppIdList = new ArrayList<>();
            long startTime = System.currentTimeMillis();
            // 按照 appId 分组处理,方便处理超载的逻辑
            Map<Long, List<InstanceInfoDO>> waitingDispatchInstancesMap = waitingDispatchInstances.stream().collect(Collectors.groupingBy(InstanceInfoDO::getAppId));
            for (Map.Entry<Long, List<InstanceInfoDO>> entry : waitingDispatchInstancesMap.entrySet()) {
                final Long currentAppId = entry.getKey();
                final List<InstanceInfoDO> currentAppWaitingDispatchInstances = entry.getValue();
                // collect job id
                Set<Long> jobIds = currentAppWaitingDispatchInstances.stream().map(InstanceInfoDO::getJobId).collect(Collectors.toSet());
                // query job info and map
                Map<Long, JobInfoDO> jobInfoMap = jobInfoRepository.findByIdIn(jobIds).stream().collect(Collectors.toMap(JobInfoDO::getId, e -> e));
                log.warn("[InstanceStatusChecker] find some instance in app({}) which is not triggered as expected: {}", currentAppId, currentAppWaitingDispatchInstances.stream().map(InstanceInfoDO::getInstanceId).collect(Collectors.toList()));
                final Holder<Boolean> overloadFlag = new Holder<>(false);
                // 先这么简单处理没问题,毕竟只有这一个地方用了 parallelStream
                currentAppWaitingDispatchInstances.parallelStream().forEach(instance -> {
                    if (overloadFlag.get()) {
                        // 直接忽略
                        return;
                    }
                    Optional<JobInfoDO> jobInfoOpt = Optional.ofNullable(jobInfoMap.get(instance.getJobId()));
                    if (jobInfoOpt.isPresent()) {
                        // 处理等待派发的任务没有必要再重置一次状态,减少 io 次数
                        dispatchService.dispatch(jobInfoOpt.get(), instance.getInstanceId(), Optional.of(instance), Optional.of(overloadFlag));
                    } else {
                        log.warn("[InstanceStatusChecker] can't find job by jobId[{}], so redispatch failed, failed instance: {}", instance.getJobId(), instance);
                        final Optional<InstanceInfoDO> opt = instanceInfoRepository.findById(instance.getId());
                        opt.ifPresent(instanceInfoDO -> updateFailedInstance(instanceInfoDO, SystemInstanceResult.CAN_NOT_FIND_JOB_INFO));
                    }
                });
                threshold = System.currentTimeMillis() - DISPATCH_TIMEOUT_MS;
                if (overloadFlag.get()) {
                    overloadAppIdList.add(currentAppId);
                }
            }
            log.info("[InstanceStatusChecker] process {} task,use {} ms", waitingDispatchInstances.size(), System.currentTimeMillis() - startTime);
            if (!overloadAppIdList.isEmpty()) {
                log.warn("[InstanceStatusChecker] app[{}] is overload, so skip check waiting dispatch instance", overloadAppIdList);
                partAppIds.removeAll(overloadAppIdList);
            }
            if (partAppIds.isEmpty()) {
                break;
            }
            waitingDispatchInstances = instanceInfoRepository.findAllByAppIdInAndStatusAndExpectedTriggerTimeLessThan(partAppIds, InstanceStatus.WAITING_DISPATCH.getV(), threshold, PageRequest.of(0, MAX_BATCH_NUM_INSTANCE));
        }

    }    

checkWaitingDispatchInstance查找待派发的任务实例,通过dispatchService.dispatch进行派发

checkWaitingWorkerReceiveInstance

复制代码
    public void checkWaitingWorkerReceiveInstance() {
        Stopwatch stopwatch = Stopwatch.createStarted();
        // 查询 DB 获取该 Server 需要负责的 AppGroup
        List<Long> allAppIds = appInfoRepository.listAppIdByCurrentServer(transportService.defaultProtocol().getAddress());
        if (CollectionUtils.isEmpty(allAppIds)) {
            log.info("[InstanceStatusChecker] current server has no app's job to check");
            return;
        }
        try {
            // 检查 WAITING_WORKER_RECEIVE 状态的任务
            Lists.partition(allAppIds, MAX_BATCH_NUM_APP).forEach(this::handleWaitingWorkerReceiveInstance);
        } catch (Exception e) {
            log.error("[InstanceStatusChecker] WaitingWorkerReceiveInstance status check failed.", e);
        }
        log.info("[InstanceStatusChecker] WaitingWorkerReceiveInstance status check using {}.", stopwatch.stop());
    }

    private void handleWaitingWorkerReceiveInstance(List<Long> partAppIds) {
        // 2. 检查 WAITING_WORKER_RECEIVE 状态的任务
        long threshold = System.currentTimeMillis() - RECEIVE_TIMEOUT_MS;
        List<BriefInstanceInfo> waitingWorkerReceiveInstances = instanceInfoRepository.selectBriefInfoByAppIdInAndStatusAndActualTriggerTimeLessThan(partAppIds, InstanceStatus.WAITING_WORKER_RECEIVE.getV(), threshold, PageRequest.of(0, MAX_BATCH_NUM_INSTANCE));
        while (!waitingWorkerReceiveInstances.isEmpty()) {
            log.warn("[InstanceStatusChecker] find some instance didn't receive any reply from worker, try to redispatch: {}", waitingWorkerReceiveInstances.stream().map(BriefInstanceInfo::getInstanceId).collect(Collectors.toList()));
            final List<List<BriefInstanceInfo>> partitions = Lists.partition(waitingWorkerReceiveInstances, MAX_BATCH_UPDATE_NUM);
            for (List<BriefInstanceInfo> partition : partitions) {
                dispatchService.redispatchBatchAsyncLockFree(partition.stream().map(BriefInstanceInfo::getInstanceId).collect(Collectors.toList()), InstanceStatus.WAITING_WORKER_RECEIVE.getV());
            }
            // 重新查询
            threshold = System.currentTimeMillis() - RECEIVE_TIMEOUT_MS;
            waitingWorkerReceiveInstances = instanceInfoRepository.selectBriefInfoByAppIdInAndStatusAndActualTriggerTimeLessThan(partAppIds, InstanceStatus.WAITING_WORKER_RECEIVE.getV(), threshold, PageRequest.of(0, MAX_BATCH_NUM_INSTANCE));
        }
    }    

checkWaitingWorkerReceiveInstance检查等待worker接收的实例,挨个执行dispatchService.redispatchBatchAsyncLockFree

checkWorkflowInstance

复制代码
    public void checkWorkflowInstance() {
        Stopwatch stopwatch = Stopwatch.createStarted();
        // 查询 DB 获取该 Server 需要负责的 AppGroup
        List<Long> allAppIds = appInfoRepository.listAppIdByCurrentServer(transportService.defaultProtocol().getAddress());
        if (CollectionUtils.isEmpty(allAppIds)) {
            log.info("[InstanceStatusChecker] current server has no app's job to check");
            return;
        }
        try {
            checkWorkflowInstance(allAppIds);
        } catch (Exception e) {
            log.error("[InstanceStatusChecker] WorkflowInstance status check failed.", e);
        }
        log.info("[InstanceStatusChecker] WorkflowInstance status check using {}.", stopwatch.stop());
    }

    private void checkWorkflowInstance(List<Long> allAppIds) {

        // 重试长时间处于 WAITING 状态的工作流实例
        long threshold = System.currentTimeMillis() - WORKFLOW_WAITING_TIMEOUT_MS;
        Lists.partition(allAppIds, MAX_BATCH_NUM_APP).forEach(partAppIds -> {
            List<WorkflowInstanceInfoDO> waitingWfInstanceList = workflowInstanceInfoRepository.findByAppIdInAndStatusAndExpectedTriggerTimeLessThan(partAppIds, WorkflowInstanceStatus.WAITING.getV(), threshold);
            if (!CollectionUtils.isEmpty(waitingWfInstanceList)) {

                List<Long> wfInstanceIds = waitingWfInstanceList.stream().map(WorkflowInstanceInfoDO::getWfInstanceId).collect(Collectors.toList());
                log.warn("[WorkflowInstanceChecker] wfInstance({}) is not started as expected, oms try to restart these workflowInstance.", wfInstanceIds);

                waitingWfInstanceList.forEach(wfInstance -> {
                    Optional<WorkflowInfoDO> workflowOpt = workflowInfoRepository.findById(wfInstance.getWorkflowId());
                    workflowOpt.ifPresent(workflowInfo -> {
                        workflowInstanceManager.start(workflowInfo, wfInstance.getWfInstanceId());
                        log.info("[Workflow-{}|{}] restart workflowInstance successfully~", workflowInfo.getId(), wfInstance.getWfInstanceId());
                    });
                });
            }
        });
    }    

checkWorkflowInstance定期检测工作流实例的状态,针对WAITING的挨个执行workflowInstanceManager.start

小结

InstanceStatusCheckService提供了checkRunningInstance、checkWaitingDispatchInstance、checkWaitingWorkerReceiveInstance、checkWorkflowInstance方法,他们分别用于检查状态是运行中但是上报超时的任务实例、等待worker接收处理的任务实例、等待调度的工作流实例。

相关推荐
lili0012几秒前
CC GUI 插件架构剖析:如何为 JetBrains IDE 打造完整的 AI 编程工作台
java·ide·人工智能·python·架构·ai编程
Royzst4 分钟前
学生信息管理案例
java
爱棋笑谦6 分钟前
单元测试简述
java
音符犹如代码14 分钟前
Docker 一键部署带有 TimescaleDB 插件的 PostgreSQL
java·运维·数据库·后端·docker·postgresql·容器
sleepcattt25 分钟前
Java反射技术
java
小锋java123426 分钟前
【技术专题】Spring AI 2.0 - Advisors —— 拦截器模式增强AI能力
java·人工智能
AI人工智能+电脑小能手26 分钟前
【大白话说Java面试题 第56题】【JVM篇】第16题:JVM有哪些垃圾收集器?
java·开发语言·jvm·面试
二哈赛车手1 小时前
新人笔记---简易版AI实现以图搜图功能
java·人工智能·笔记·spring·ai
夕除1 小时前
spring boot 6
java·spring boot·后端
johnrui1 小时前
JUC之AQS
java·开发语言·jvm