Nacos源码—3.Nacos集群高可用分析

大纲

1.Nacos集群的几个问题

2.单节点对服务进行心跳健康检查和同步检查结果

3.集群新增服务实例时如何同步给其他节点

4.集群节点的健康状态变动时的数据同步

5.集群新增节点时如何同步已有服务实例数据

1.Nacos集群的几个问题

**问题一:**在单机模式下,Nacos服务端会开启心跳健康检查的定时任务。那么在集群模式下,是否有必要让全部集群节点都执行这个定时任务?

**问题二:**Nacos服务端通过心跳健康检查的定时任务感知服务实例健康状态改变时,如何把服务实例的健康状态同步给其他Nacos集群节点?

**问题三:**一个新服务实例发起注册请求,只会有一个Nacos集群节点处理对应请求,那么处理完注册请求后,集群节点间应该如何同步服务实例数据?

**问题四:**假设Nacos集群有三个节点,现在需要新增了一个节点,那么新增的节点应该如何从集群中同步已存在的服务实例数据?

**问题五:**Nacos集群节点相互之间,是否有心跳机制来检测集群节点是否可用?

2.单节点对服务进行心跳健康检查和同步检查结果

(1)集群对服务进行心跳健康检查的设计

(2)选择一个节点对服务进行心跳健康检查的源码

(3)集群之间同步服务的健康状态的源码

(4)总结

(1)集群对服务进行心跳健康检查的架构设计

假设Nacos集群有三个节点:现已知单机模式下的Nacos服务端是会开启心跳健康检查的定时任务的。既然集群节点有三个,是否每个节点都要执行心跳健康检查的定时任务?

**方案一:**三个节点全都去执行心跳健康检查任务。如果每个节点执行的结果都不同,那么以哪个为准?

**方案二:**只有一个节点去执行心跳健康检查任务,然后把检查结果同步给其他节点。

明显方案二逻辑简洁清晰,而Nacos集群也选择了方案二。在Nacos集群模式下,三个节点都会开启一个心跳健康检查的定时任务,但只有一个节点会真正地执行心跳健康检查的逻辑。然后在检查完成后,会开启一个定时任务将检查结果同步给其他节点。

(2)选择一个节点对服务进行心跳健康检查的源码

对服务进行心跳健康检查的任务,其实就是ClientBeatCheckTask任务。Nacos服务端在处理服务实例注册接口请求时,就会开启这个任务。如下所示:

ClientBeatCheckTask这个类是一个线程任务。在ClientBeatCheckTask的run()方法中,一开始就有两个if判断。第一个if判断:判断当前节点在集群模式下是否需要对该Service执行心跳健康检查任务。第二个if判断:是否开启了健康检查任务,默认是开启的。注意:ClientBeatProcessor用于处理服务实例的心跳,服务实例和服务都需要心跳健康检查。

在集群模式下,为了保证只有一个节点对该Service执行心跳健康检查,就需要第一个if判断中的DistroMapper的responsible()方法来实现了。通过DistroMapper的responsible()方法可知:只会有一个集群节点能够对该Service执行心跳健康检查。而其他的集群节点,并不会去执行对该Service的心跳健康检查。

复制代码
//Check and update statues of ephemeral instances, remove them if they have been expired.
public class ClientBeatCheckTask implements Runnable {
    private Service service;//每个ClientBeatCheckTask都会对应一个Service
    ...
    
    @JsonIgnore
    public DistroMapper getDistroMapper() {
        return ApplicationUtils.getBean(DistroMapper.class);
    }
    
    @Override
    public void run() {
        try {
            //第一个if判断:DistroMapper.responsible()方法
            //判断当前节点在集群模式下是否需要对该Service执行心跳健康检查任务
            if (!getDistroMapper().responsible(service.getName())) {
                return;
            }
            //第二个if判断:
            //是否开启了健康检查任务,默认是开启的
            if (!getSwitchDomain().isHealthCheckEnabled()) {
                return;
            }
            List<Instance> instances = service.allIPs(true);
        
            //first set health status of instances:
            for (Instance instance : instances) {
                if (System.currentTimeMillis() - instance.getLastBeat() > instance.getInstanceHeartBeatTimeOut()) {
                    if (!instance.isMarked()) {
                        if (instance.isHealthy()) {
                            instance.setHealthy(false);
                            getPushService().serviceChanged(service);
                            ApplicationUtils.publishEvent(new InstanceHeartbeatTimeoutEvent(this, instance));
                        }
                    }
                }
            }
        
            if (!getGlobalConfig().isExpireInstance()) {
                return;
            }
        
            //then remove obsolete instances:
            for (Instance instance : instances) {
                if (instance.isMarked()) {
                    continue;
                }
                if (System.currentTimeMillis() - instance.getLastBeat() > instance.getIpDeleteTimeout()) {
                    //delete instance
                    deleteIp(instance);
                }
            }
        } catch (Exception e) {
            Loggers.SRV_LOG.warn("Exception while processing client beat time out.", e);
        }
    }
    ...
}

//Distro mapper, judge which server response input service.
@Component("distroMapper")
public class DistroMapper extends MemberChangeListener {
    //List of service nodes, you must ensure that the order of healthyList is the same for all nodes.
    private volatile List<String> healthyList = new ArrayList<>();
    
    //init server list.
    @PostConstruct
    public void init() {
        NotifyCenter.registerSubscriber(this);//注册订阅者
        this.healthyList = MemberUtil.simpleMembers(memberManager.allMembers());
    }
    ...
    //Judge whether current server is responsible for input service.
    public boolean responsible(String serviceName) {
        //获取集群节点数量,这里假设的是三个集群节点
        final List<String> servers = healthyList;
        //如果采用单机模式启动,直接返回true
        if (!switchDomain.isDistroEnabled() || EnvUtil.getStandaloneMode()) {
            return true;
        }
        //如果没有可用的健康集群节点,直接返回false
        if (CollectionUtils.isEmpty(servers)) {
            //means distro config is not ready yet
            return false;
        }
        int index = servers.indexOf(EnvUtil.getLocalAddress());
        int lastIndex = servers.lastIndexOf(EnvUtil.getLocalAddress());
        if (lastIndex < 0 || index < 0) {
            return true;
        }
        //对serviceName进行Hash操作,然后对servers.size()取模,得到负责执行心跳健康检查任务的那个节点索引
        int target = distroHash(serviceName) % servers.size();
        return target >= index && target <= lastIndex;
    }
    
    private int distroHash(String serviceName) {
        return Math.abs(serviceName.hashCode() % Integer.MAX_VALUE);
    }
    ...
}

(3)集群之间同步服务的健康状态的源码

一.集群间同步服务的健康状态的实现逻辑

二.集群间同步服务的健康状态的实现源码

三.第一个异步任务ServiceReporter

四.第二个异步任务UpdatedServiceProcessor

既然集群中只有一个节点能够对某Service执行心跳健康检查,那么心跳健康检查的结果应该如何同步给集群的其他节点。

一.集群间同步服务的健康状态的实现逻辑

每个节点都会有一个定时任务,用来同步心跳健康检查的结果给其他节点。该异步任务会通过HTTP方式,调用其他集群节点的接口来实现数据同步。

二.集群间同步服务的健康状态的实现源码

在ServiceManager类中,有一个init()方法。该方法被@PostConstruct注解修饰了。在创建ServiceManager这个Bean时,便会调用这个init()方法。而在这个方法中,就会开启同步心跳健康检查结果的定时任务。

其中与同步服务实例健康状态相关的有两个异步任务:第一个是用来发起同步心跳健康检查结果请求的异步任务,第二个是用来处理同步心跳健康检查结果请求的异步任务。处理请求的思路是:内存队列削峰 + 异步任务提速。

复制代码
//Core manager storing all services in Nacos.
@Component
public class ServiceManager implements RecordListener<Service> {
    ...
    //Init service maneger.
    @PostConstruct
    public void init() {
        //用来发起 同步心跳健康检查结果请求 的异步任务
        GlobalExecutor.scheduleServiceReporter(new ServiceReporter(), 60000, TimeUnit.MILLISECONDS);
        //用来处理 同步心跳健康检查结果请求 的异步任务:内存队列削峰 + 异步任务提速
        GlobalExecutor.submitServiceUpdateManager(new UpdatedServiceProcessor());
    
        if (emptyServiceAutoClean) {
            Loggers.SRV_LOG.info("open empty service auto clean job, initialDelay : {} ms, period : {} ms", cleanEmptyServiceDelay, cleanEmptyServicePeriod);
        
            //delay 60s, period 20s;
            //This task is not recommended to be performed frequently in order to avoid
            //the possibility that the service cache information may just be deleted
            //and then created due to the heartbeat mechanism
            GlobalExecutor.scheduleServiceAutoClean(new EmptyServiceAutoClean(), cleanEmptyServiceDelay, cleanEmptyServicePeriod);
        }
        try {
            Loggers.SRV_LOG.info("listen for service meta change");
            consistencyService.listen(KeyBuilder.SERVICE_META_KEY_PREFIX, this);
        } catch (NacosException e) {
            Loggers.SRV_LOG.error("listen for service meta change failed!");
        }
    }
    ...
}

public class GlobalExecutor {
    private static final ScheduledExecutorService SERVICE_SYNCHRONIZATION_EXECUTOR = 
        ExecutorFactory.Managed.newSingleScheduledExecutorService(
            ClassUtils.getCanonicalName(NamingApp.class),
            new NameThreadFactory("com.alibaba.nacos.naming.service.worker")
        );
    
    public static final ScheduledExecutorService SERVICE_UPDATE_MANAGER_EXECUTOR = 
        ExecutorFactory.Managed.newSingleScheduledExecutorService(
            ClassUtils.getCanonicalName(NamingApp.class),
            new NameThreadFactory("com.alibaba.nacos.naming.service.update.processor")
        );
    ...
    public static void scheduleServiceReporter(Runnable command, long delay, TimeUnit unit) {
        //在指定的延迟后执行某项任务
        SERVICE_SYNCHRONIZATION_EXECUTOR.schedule(command, delay, unit);
    }
    
    public static void submitServiceUpdateManager(Runnable runnable) {
        //向线程池提交任务,让线程池执行任务
        SERVICE_UPDATE_MANAGER_EXECUTOR.submit(runnable);
    }
    ...
}

public final class ExecutorFactory {
    ...
    public static final class Managed {
        private static final String DEFAULT_NAMESPACE = "nacos";
        private static final ThreadPoolManager THREAD_POOL_MANAGER = ThreadPoolManager.getInstance();
        ...
        //Create a new single scheduled executor service with input thread factory and register to manager.
        public static ScheduledExecutorService newSingleScheduledExecutorService(final String group, final ThreadFactory threadFactory) {
            ScheduledExecutorService executorService = Executors.newScheduledThreadPool(1, threadFactory);
            THREAD_POOL_MANAGER.register(DEFAULT_NAMESPACE, group, executorService);
            return executorService;
        }
        ...
    }
}

//线程池管理器
public final class ThreadPoolManager {
    private Map<String, Map<String, Set<ExecutorService>>> resourcesManager;
    private Map<String, Object> lockers = new ConcurrentHashMap<String, Object>(8);
    private static final ThreadPoolManager INSTANCE = new ThreadPoolManager();
    private static final AtomicBoolean CLOSED = new AtomicBoolean(false);
    
    static {
        INSTANCE.init();
        //JVM关闭时添加勾子,释放线程资源
        ThreadUtils.addShutdownHook(new Thread(new Runnable() {
            @Override
            public void run() {
                LOGGER.warn("[ThreadPoolManager] Start destroying ThreadPool");
                //关闭线程池管理器
                shutdown();
                LOGGER.warn("[ThreadPoolManager] Destruction of the end");
            }
        }));
    }
    
    public static ThreadPoolManager getInstance() {
        return INSTANCE;
    }
    
    private ThreadPoolManager() {
    }
    
    private void init() {
        resourcesManager = new ConcurrentHashMap<String, Map<String, Set<ExecutorService>>>(8);
    }
    
    //Register the thread pool resources with the resource manager.
    public void register(String namespace, String group, ExecutorService executor) {
        if (!resourcesManager.containsKey(namespace)) {
            synchronized (this) {
                lockers.put(namespace, new Object());
            }
        }
        final Object monitor = lockers.get(namespace);
        synchronized (monitor) {
            Map<String, Set<ExecutorService>> map = resourcesManager.get(namespace);
            if (map == null) {
                map = new HashMap<String, Set<ExecutorService>>(8);
                map.put(group, new HashSet<ExecutorService>());
                map.get(group).add(executor);
                resourcesManager.put(namespace, map);
                return;
            }
            if (!map.containsKey(group)) {
                map.put(group, new HashSet<ExecutorService>());
            }
            map.get(group).add(executor);
        }
    }
    
    //Shutdown thread pool manager. 关闭线程池管理器
    public static void shutdown() {
        if (!CLOSED.compareAndSet(false, true)) {
            return;
        }
        Set<String> namespaces = INSTANCE.resourcesManager.keySet();
        for (String namespace : namespaces) {
            //销毁所有线程池资源
            INSTANCE.destroy(namespace);
        }
    }
    
    //Destroys all thread pool resources under this namespace.
    public void destroy(final String namespace) {
        final Object monitor = lockers.get(namespace);
        if (monitor == null) {
            return;
        }
        synchronized (monitor) {
            Map<String, Set<ExecutorService>> subResource = resourcesManager.get(namespace);
            if (subResource == null) {
                return;
            }
            for (Map.Entry<String, Set<ExecutorService>> entry : subResource.entrySet()) {
                for (ExecutorService executor : entry.getValue()) {
                    //关闭线程池
                    ThreadUtils.shutdownThreadPool(executor);
                }
            }
            resourcesManager.get(namespace).clear();
            resourcesManager.remove(namespace);
        }
    }
    ...
}

public final class ThreadUtils {
    ...
    public static void addShutdownHook(Runnable runnable) {
        Runtime.getRuntime().addShutdownHook(new Thread(runnable));
    }
    
    public static void shutdownThreadPool(ExecutorService executor) {
        shutdownThreadPool(executor, null);
    }
    
    //Shutdown thread pool.
    public static void shutdownThreadPool(ExecutorService executor, Logger logger) {
        executor.shutdown();
        int retry = 3;
        while (retry > 0) {
            retry--;
            try {
                if (executor.awaitTermination(1, TimeUnit.SECONDS)) {
                    return;
                }
            } catch (InterruptedException e) {
                executor.shutdownNow();
                Thread.interrupted();
            } catch (Throwable ex) {
                if (logger != null) {
                    logger.error("ThreadPoolManager shutdown executor has error : {}", ex);
                }
            }
        }
        executor.shutdownNow();
    }
    ...
}

三.第一个异步任务ServiceReporter

首先从内存注册表中,获取全部的服务名称。ServiceManager的getAllServiceNames()方法返回的是一个Map对象。其中的key是对应的命名空间ID,value是对应命名空间下的全部服务名称。然后遍历allServiceNames中的内容,此时会有两个for循环来处理。最后这个任务执行完,会继续提交一个延时执行的任务进行健康检查。

第一个for循环:遍历某命名空间ID下的全部服务名称,封装请求参数。

首先采用同样的Hash算法,判断遍历到的Service是否需要同步健康结果。如果需要执行,则把参数放到ServiceChecksum对象中。然后通过JacksonUtils转成JSON数据后,再放到Message请求参数对象。

第二个for循环:遍历集群节点,发送请求给其他节点进行数据同步。

首先判断是否是自身节点,如果是则跳过。否则调用ServiceStatusSynchronizer的send()方法。通过向其他集群节点的接口发起请求,来实现心跳健康检查结果的同步。集群节点同步的核心方法就在ServiceStatusSynchronizer的send()方法中。

通过ServiceStatusSynchronizer的send()方法中的代码可知,最终会通过HTTP方式进行数据同步,请求地址是"v1/ns/service/status"。该请求地址对应的请求处理入口是ServiceController的serviceStatus()方法。

在ServiceController的serviceStatus()方法中,如果通过对比入参和注册表的ServiceChecksum后,发现服务状态发生了改变,那么就会调用ServiceManager.addUpdatedServiceToQueue()方法。

在addUpdatedServiceToQueue()方法中,首先会把传入的参数包装成ServiceKey对象,然后放入到toBeUpdatedServicesQueue阻塞队列中。

既然最后会将ServiceKey对象放入到阻塞队列中,那必然有一个异步任务,从阻塞队列中获取ServiceKey对象进行处理。这个处理逻辑和处理服务实例注册时,将Pair对象放入阻塞队列一样,而这个异步任务便是ServiceManager的init()方法的第二个异步任务。

复制代码
//Core manager storing all services in Nacos.
@Component
public class ServiceManager implements RecordListener<Service> {
    //Map(namespace, Map(group::serviceName, Service)).
    private final Map<String, Map<String, Service>> serviceMap = new ConcurrentHashMap<>();
    private final DistroMapper distroMapper;
    private final Synchronizer synchronizer = new ServiceStatusSynchronizer();
    ...
    public Map<String, Set<String>> getAllServiceNames() {
        Map<String, Set<String>> namesMap = new HashMap<>(16);
        for (String namespaceId : serviceMap.keySet()) {
            namesMap.put(namespaceId, serviceMap.get(namespaceId).keySet());
        }
        return namesMap;
    }
    
    private class ServiceReporter implements Runnable {
        @Override
        public void run() {
            try {
                //获取内存注册表下的所有服务名称,按命名空间分类
                Map<String, Set<String>> allServiceNames = getAllServiceNames();
                if (allServiceNames.size() <= 0) {
                    //ignore
                    return;
                }
                //遍历allServiceNames中的内容
                //也就是遍历每一个命名空间,然后封装请求参数,接着发送请求来同步心跳健康检查结果
                for (String namespaceId : allServiceNames.keySet()) {
                    ServiceChecksum checksum = new ServiceChecksum(namespaceId);
                    //第一个循环:封装请求参数
                    for (String serviceName : allServiceNames.get(namespaceId)) {
                        //采用同样的算法,确保当前的集群节点,只对自己负责的那些Service,同步心跳健康检查结果
                        if (!distroMapper.responsible(serviceName)) {
                            continue;
                        }
                        Service service = getService(namespaceId, serviceName);
                        if (service == null || service.isEmpty()) {
                            continue;
                        }
                        service.recalculateChecksum();
                        //添加请求参数
                        checksum.addItem(serviceName, service.getChecksum());
                    }
                    //创建请求参数对象Message,准备进行同步
                    Message msg = new Message();
                    //对请求对象进行JSON序列化
                    msg.setData(JacksonUtils.toJson(checksum));
                    Collection<Member> sameSiteServers = memberManager.allMembers();
                    if (sameSiteServers == null || sameSiteServers.size() <= 0) {
                        return;
                    }
                   
                    //第二个循环:遍历所有集群节点,发送请求给其他节点进行数据同步
                    for (Member server : sameSiteServers) {
                        //判断地址是否是本节点,如果是则直接跳过
                        if (server.getAddress().equals(NetUtils.localServer())) {
                            continue;
                        }
                        //同步其他集群节点
                        synchronizer.send(server.getAddress(), msg);
                    }
                }
            } catch (Exception e) {
                Loggers.SRV_LOG.error("[DOMAIN-STATUS] Exception while sending service status", e);
            } finally {
                //继续提交一个延时执行的任务
                GlobalExecutor.scheduleServiceReporter(this, switchDomain.getServiceStatusSynchronizationPeriodMillis(), TimeUnit.MILLISECONDS);
            }
        }
    }
    ...
}

public class ServiceStatusSynchronizer implements Synchronizer {
    @Override
    public void send(final String serverIP, Message msg) {
        if (serverIP == null) {
            return;
        }
        //构建请求参数
        Map<String, String> params = new HashMap<String, String>(10);
        params.put("statuses", msg.getData());
        params.put("clientIP", NetUtils.localServer());
        //拼接url地址
        String url = "http://" + serverIP + ":" + EnvUtil.getPort() + EnvUtil.getContextPath() + UtilsAndCommons.NACOS_NAMING_CONTEXT + "/service/status";
        if (IPUtil.containsPort(serverIP)) {
            url = "http://" + serverIP + EnvUtil.getContextPath() + UtilsAndCommons.NACOS_NAMING_CONTEXT + "/service/status";
        }
       
        try {
            //异步发送HTTP请求,url地址就是:http://ip/v1/ns/service/status, 用来同步心跳健康检查结果
            HttpClient.asyncHttpPostLarge(url, null, JacksonUtils.toJson(params), new Callback<String>() {
                @Override
                public void onReceive(RestResult<String> result) {
                    if (!result.ok()) {
                        Loggers.SRV_LOG.warn("[STATUS-SYNCHRONIZE] failed to request serviceStatus, remote server: {}", serverIP);
                    }
                }
                
                @Override
                public void onError(Throwable throwable) {
                    Loggers.SRV_LOG.warn("[STATUS-SYNCHRONIZE] failed to request serviceStatus, remote server: " + serverIP, throwable);
                }
                
                @Override
                public void onCancel() {


                }
            });
        } catch (Exception e) {
            Loggers.SRV_LOG.warn("[STATUS-SYNCHRONIZE] failed to request serviceStatus, remote server: " + serverIP, e);
        }
    }
    ...
}

//Service operation controller.
@RestController
@RequestMapping(UtilsAndCommons.NACOS_NAMING_CONTEXT + "/service")
public class ServiceController {
    @Autowired
    protected ServiceManager serviceManager;
    ...
    //Check service status whether latest.
    @PostMapping("/status")
    public String serviceStatus(HttpServletRequest request) throws Exception {
        String entity = IoUtils.toString(request.getInputStream(), "UTF-8");
        String value = URLDecoder.decode(entity, "UTF-8");
        JsonNode json = JacksonUtils.toObj(value);
        String statuses = json.get("statuses").asText();
        String serverIp = json.get("clientIP").asText();
        if (!memberManager.hasMember(serverIp)) {
            throw new NacosException(NacosException.INVALID_PARAM, "ip: " + serverIp + " is not in serverlist");
        }
    
        try {
            ServiceManager.ServiceChecksum checksums = JacksonUtils.toObj(statuses, ServiceManager.ServiceChecksum.class);
            if (checksums == null) {
                Loggers.SRV_LOG.warn("[DOMAIN-STATUS] receive malformed data: null");
                return "fail";
            }
        
            for (Map.Entry<String, String> entry : checksums.serviceName2Checksum.entrySet()) {
                if (entry == null || StringUtils.isEmpty(entry.getKey()) || StringUtils.isEmpty(entry.getValue())) {
                    continue;
                }
                String serviceName = entry.getKey();
                String checksum = entry.getValue();
                Service service = serviceManager.getService(checksums.namespaceId, serviceName);
                if (service == null) {
                    continue;
                }
                service.recalculateChecksum();
                //通过对比入参和注册表的checksum,如果发现服务状态有变动
                if (!checksum.equals(service.getChecksum())) {
                    if (Loggers.SRV_LOG.isDebugEnabled()) {
                        Loggers.SRV_LOG.debug("checksum of {} is not consistent, remote: {}, checksum: {}, local: {}", serviceName, serverIp, checksum, service.getChecksum());
                    }
                    //添加到阻塞队列
                    serviceManager.addUpdatedServiceToQueue(checksums.namespaceId, serviceName, serverIp, checksum);
                }
            }
        } catch (Exception e) {
            Loggers.SRV_LOG.warn("[DOMAIN-STATUS] receive malformed data: " + statuses, e);
        }
        return "ok";
    }
    ...
}

//Core manager storing all services in Nacos.
@Component
public class ServiceManager implements RecordListener<Service> {
    private final Lock lock = new ReentrantLock();
    //阻塞队列
    private final LinkedBlockingDeque<ServiceKey> toBeUpdatedServicesQueue = new LinkedBlockingDeque<>(1024 * 1024);
    ...
    //Add a service into queue to update.
    public void addUpdatedServiceToQueue(String namespaceId, String serviceName, String serverIP, String checksum) {
        lock.lock();
        try {
            //包装成ServiceKey对象,放入到toBeUpdatedServicesQueue阻塞队列中
            toBeUpdatedServicesQueue.offer(new ServiceKey(namespaceId, serviceName, serverIP, checksum), 5, TimeUnit.MILLISECONDS);
        } catch (Exception e) {
            toBeUpdatedServicesQueue.poll();
            toBeUpdatedServicesQueue.add(new ServiceKey(namespaceId, serviceName, serverIP, checksum));
            Loggers.SRV_LOG.error("[DOMAIN-STATUS] Failed to add service to be updated to queue.", e);
        } finally {
            lock.unlock();
        }
    }
    ...
}

四.第二个异步任务UpdatedServiceProcessor

UpdatedServiceProcessor的run()方法中有一个while无限循环,这个while无限循环会从toBeUpdatedServicesQueue阻塞队列中一直取任务。取得任务ServiceKey对象后,会将其封装成ServiceUpdater对象,然后继续将ServiceUpdater对象作为一个任务提交给一个线程池。

这个心跳健康检查结果的数据同步逻辑,和服务实例注册的处理逻辑类似,都使用了"阻塞队列 + 异步任务"的设计思想。放入阻塞队列是为了削峰,从阻塞队列取出任务再提交线程池是为了提速。

线程池在执行同步健康状态任务时,即执行ServiceUpdater的run()方法时,会调用ServiceManager的updatedHealthStatus()方法来更改服务的健康状态。

在ServiceManager的updatedHealthStatus()方法中,首先会解析参数,然后获取注册表中全部的Instance实例,并遍历实例。如果实例的健康状态有变动,则直接更改实例的healthy属性,并且针对healthy有变动的实例,发布服务改变事件通知客户端进行更新。

复制代码
//Core manager storing all services in Nacos.
@Component
public class ServiceManager implements RecordListener<Service> {
    //阻塞队列
    private final LinkedBlockingDeque<ServiceKey> toBeUpdatedServicesQueue = new LinkedBlockingDeque<>(1024 * 1024);
    ...
    private class UpdatedServiceProcessor implements Runnable {
        //get changed service from other server asynchronously
        @Override
        public void run() {
            ServiceKey serviceKey = null;
            try {
                //无限循环
                while (true) {
                    try {
                        //从阻塞队列中获取任务
                        serviceKey = toBeUpdatedServicesQueue.take();
                    } catch (Exception e) {
                        Loggers.EVT_LOG.error("[UPDATE-DOMAIN] Exception while taking item from LinkedBlockingDeque.");
                    }
                    if (serviceKey == null) {
                        continue;
                    }
                    GlobalExecutor.submitServiceUpdate(new ServiceUpdater(serviceKey));
                }
            } catch (Exception e) {
                Loggers.EVT_LOG.error("[UPDATE-DOMAIN] Exception while update service: {}", serviceKey, e);
            }
        }
    }

    private class ServiceUpdater implements Runnable {
        String namespaceId;
        String serviceName;
        String serverIP;
        
        public ServiceUpdater(ServiceKey serviceKey) {
            this.namespaceId = serviceKey.getNamespaceId();
            this.serviceName = serviceKey.getServiceName();
            this.serverIP = serviceKey.getServerIP();
        }
        
        @Override
        public void run() {
            try {
                //修改服务实例的健康状态
                updatedHealthStatus(namespaceId, serviceName, serverIP);
            } catch (Exception e) {
                Loggers.SRV_LOG.warn("[DOMAIN-UPDATER] Exception while update service: {} from {}, error: {}", serviceName, serverIP, e);
            }
        }
    }
    
    //Update health status of instance in service. 修改服务实例的健康状态
    public void updatedHealthStatus(String namespaceId, String serviceName, String serverIP) {
        Message msg = synchronizer.get(serverIP, UtilsAndCommons.assembleFullServiceName(namespaceId, serviceName));
        //解析参数
        JsonNode serviceJson = JacksonUtils.toObj(msg.getData());
    
        ArrayNode ipList = (ArrayNode) serviceJson.get("ips");
        Map<String, String> ipsMap = new HashMap<>(ipList.size());
        for (int i = 0; i < ipList.size(); i++) {
            String ip = ipList.get(i).asText();
            String[] strings = ip.split("_");
            ipsMap.put(strings[0], strings[1]);
        }
    
        Service service = getService(namespaceId, serviceName);
        if (service == null) {
            return;
        }
       
        //是否改变标识
        boolean changed = false;
        //获取全部的实例数据,进行遍历
        List<Instance> instances = service.allIPs();
        for (Instance instance : instances) {
            //同步健康状态结果
            boolean valid = Boolean.parseBoolean(ipsMap.get(instance.toIpAddr()));
            if (valid != instance.isHealthy()) {
                changed = true;
                //更新服务实例的健康状态
                instance.setHealthy(valid);
                Loggers.EVT_LOG.info("{} {SYNC} IP-{} : {}:{}@{}", serviceName, (instance.isHealthy() ? "ENABLED" : "DISABLED"), instance.getIp(), instance.getPort(), instance.getClusterName());
            }
        }
        //如果服务实例健康状态改变了,那么就发布"服务改变事件",使用UDP方式通知客户端
        if (changed) {
            pushService.serviceChanged(service);
            if (Loggers.EVT_LOG.isDebugEnabled()) {
                StringBuilder stringBuilder = new StringBuilder();
                List<Instance> allIps = service.allIPs();
                for (Instance instance : allIps) {
                    stringBuilder.append(instance.toIpAddr()).append("_").append(instance.isHealthy()).append(",");
                }
                Loggers.EVT_LOG.debug("[HEALTH-STATUS-UPDATED] namespace: {}, service: {}, ips: {}", service.getNamespaceId(), service.getName(), stringBuilder.toString());
            }
        }
    }
    ...
}

(4)总结

**问题一:**在单机模式下,Nacos服务端会开启一个对服务进行心跳健康检查的定时任务。那么在集群模式下,是否有必要让全部节点都执行这个定时任务?

**答:**当Service的init()方法执行心跳健康检查任务时,首先会有一个逻辑判断。具体就是根据服务名称进行哈希运算,然后结合集群节点数量进行取模,最终选出一个节点来执行心跳健康检查任务。所以Nacos服务端对服务Service的心跳健康检查任务,在集群架构下,并不是每一台集群机器都会执行这个任务的,而是通过算法选出一台机器来执行,然后再把结果同步给其他集群节点。

**问题二:**Nacos服务端通过心跳健康检查的定时任务感知服务的健康状态改变时,如何把服务的健康状态同步给其他Nacos集群节点?

**答:**当Nacos服务端也就是Service的init()方法执行完成心跳健康检查任务后,ServiceManager的init()方法会有一个定时任务,同步检查结果到其他节点。这个定时任务会使用HTTP的方式来进行心跳健康检查结果的同步。这个定时任务执行完,会继续创建一个延迟执行的定时任务继续进行同步。

ServiceManager的init()方法还有一个定时任务用来处理检查结果的同步请求。这个定时任务的设计采用了:内存阻塞队列 + 异步任务的方式。这个定时任务会通过while无限循环一直从阻塞队列获取数据进行处理。

3.集群新增服务实例时如何同步给其他节点

(1)新增服务实例时同步给集群其他节点的架构

(2)新增服务实例时同步给集群其他节点的源码

(3)总结

(1)新增服务实例时同步给集群其他节点的架构

Nacos使用的架构是:双层内存队列 + 异步任务。

第一层:

Nacos会使用一个ConcurrentHashMap作为延迟任务的存储容器,把新增服务实例的信息包装成一个DistroDelayTask任务,放入到该Map中。

DistroTaskEngineHolder有一个属性叫DistroDelayTaskExecuteEngine,该属性父类构造方法会开启一个异步任务从ConcurrentHashMap获取DistroDelayTask任务。

第二层:

Nacos会使用BlockingQueue作为同步任务的存储容器,根据参数创建DistroSyncChangeTask线程任务,并放入BlockingQueue。

Nacos会开启一个InnerWorker异步任务,它会从BlockingQueue取出DistroSyncChangeTask并调用其run()方法。

在DistroSyncChangeTask的run()方法中,最后会通过HTTP方式,调用其他集群节点的API接口来完成数据同步。

(2)新增服务实例时同步给集群其他节点的源码

一.构造延迟任务存储在Map中 + 异步任务处理

二.构造同步任务存储在Queue中 + 异步任务处理

三.同步服务实例数据到集群节点的核心方法

一.构造延迟任务存储在Map中 + 异步任务处理

Nacos服务端在处理服务实例注册请求时,会调用DistroConsistencyServiceImpl的onPut()方法来触发更新内存注册表,然后才调用DistroProtocol的sync()方法进行集群数据的同步。

在DistroProtocol的sync()方法的for循环会遍历除自身外的其他集群节点。这个集群节点数据是在搭建Nacos集群时,在cluster.conf文件中配置的,所以Nacos服务端能够获取到整个集群节点的信息。遍历除自身外的集群节点,是因为自己本身是不需要进行数据同步的,当前节点自己只需要同步数据到其他集群节点即可。

DistroProtocol的sync()方法的for循环最后封装一个DistroDelayTask任务,然后调用NacosDelayTaskExecuteEngine的addTask()方法添加到tasks属性,也就是ConcurrentHashMap类型的tasks属性中,其中DistroDelayTask任务实现了NacosTask任务。

而NacosDelayTaskExecuteEngine在初始化时,会开启一个异步任务。这个异步任务会执行ProcessRunnable的run()方法,接着会执行NacosDelayTaskExecuteEngine的processTasks()方法。

在processTasks()方法中,先从tasks这个map中获取全部的key进行遍历,然后根据key调用NacosDelayTaskExecuteEngine的removeTask()方法。removeTask()方法会将从tasks这个map中获取到的延迟任务进行删除然后返回,接着根据taskKey获取DistroDelayTaskProcessor同步任务处理器,最后调用DistroDelayTaskProcessor的process()方法,把从removeTask()方法返回的NacosTask延迟任务放入第二层内存队列中。

复制代码
@DependsOn("ProtocolManager")
@org.springframework.stereotype.Service("distroConsistencyService")
public class DistroConsistencyServiceImpl implements EphemeralConsistencyService, DistroDataProcessor {
    private final DistroProtocol distroProtocol;
    ...
    @Override
    public void put(String key, Record value) throws NacosException {
        //把包含了当前注册的服务实例的、最新的服务实例列表,存储到DataStore对象中,
        //并添加异步任务来实现将最新的服务实例列表更新到内存注册表
        onPut(key, value);
        //在集群架构下,DistroProtocol.sync()方法会进行集群节点的服务实例数据同步
        distroProtocol.sync(new DistroKey(key, KeyBuilder.INSTANCE_LIST_KEY_PREFIX), DataOperation.CHANGE, globalConfig.getTaskDispatchPeriod() / 2);
    }
    ...
}

@Component
public class DistroProtocol {
    private final ServerMemberManager memberManager;
    private final DistroTaskEngineHolder distroTaskEngineHolder;
    ...
    //Start to sync data to all remote server.
    public void sync(DistroKey distroKey, DataOperation action, long delay) {
        //遍历除自身以外的其他集群节点
        for (Member each : memberManager.allMembersWithoutSelf()) {
            //包装第一层
            DistroKey distroKeyWithTarget = new DistroKey(distroKey.getResourceKey(), distroKey.getResourceType(), each.getAddress());
            //包装第二层
            DistroDelayTask distroDelayTask = new DistroDelayTask(distroKeyWithTarget, action, delay);
            //实际调用的是NacosDelayTaskExecuteEngine.addTask()方法添加任务
            distroTaskEngineHolder.getDelayTaskExecuteEngine().addTask(distroKeyWithTarget, distroDelayTask);
            if (Loggers.DISTRO.isDebugEnabled()) {
                Loggers.DISTRO.debug("[DISTRO-SCHEDULE] {} to {}", distroKey, each.getAddress());
            }
        }
    }
    ...
}

public class DistroKey {
    private String resourceKey;
    private String resourceType;    
    private String targetServer;    
   
    public DistroKey() {
    }
    
    public DistroKey(String resourceKey, String resourceType, String targetServer) {
        this.resourceKey = resourceKey;
        this.resourceType = resourceType;
        this.targetServer = targetServer;
    }
    ...
}

//Distro delay task.
public class DistroDelayTask extends AbstractDelayTask {
    private final DistroKey distroKey;
    private DataOperation action;
    private long createTime;
    
    public DistroDelayTask(DistroKey distroKey, DataOperation action, long delayTime) {
        this.distroKey = distroKey;
        this.action = action;
        this.createTime = System.currentTimeMillis();
        setLastProcessTime(createTime);
        setTaskInterval(delayTime);
    }
    ...
}

//Abstract task which can delay and merge.
public abstract class AbstractDelayTask implements NacosTask {
    //Task time interval between twice processing, unit is millisecond.
    private long taskInterval;
    //The time which was processed at last time, unit is millisecond.
    private long lastProcessTime;
    
    public void setTaskInterval(long interval) {
        this.taskInterval = interval;
    }
    
    public void setLastProcessTime(long lastProcessTime) {
        this.lastProcessTime = lastProcessTime;
    }
    ...
}

//Distro task engine holder.
@Component
public class DistroTaskEngineHolder {
    private final DistroDelayTaskExecuteEngine delayTaskExecuteEngine = new DistroDelayTaskExecuteEngine();
    
    public DistroDelayTaskExecuteEngine getDelayTaskExecuteEngine() {
        return delayTaskExecuteEngine;
    }
    ...
}

public class DistroDelayTaskExecuteEngine extends NacosDelayTaskExecuteEngine {
    public DistroDelayTaskExecuteEngine() {
        super(DistroDelayTaskExecuteEngine.class.getName(), Loggers.DISTRO);
    }
    ...
}

//Nacos delay task execute engine.
public class NacosDelayTaskExecuteEngine extends AbstractNacosTaskExecuteEngine<AbstractDelayTask> {
    private final ScheduledExecutorService processingExecutor;
    protected final ConcurrentHashMap<Object, AbstractDelayTask> tasks;//任务池
    protected final ReentrantLock lock = new ReentrantLock();
    ...
    public NacosDelayTaskExecuteEngine(String name, int initCapacity, Logger logger, long processInterval) {
        super(logger);
        tasks = new ConcurrentHashMap<Object, AbstractDelayTask>(initCapacity);
        processingExecutor = ExecutorFactory.newSingleScheduledExecutorService(new NameThreadFactory(name));
        //开启延时任务
        processingExecutor.scheduleWithFixedDelay(new ProcessRunnable(), processInterval, processInterval, TimeUnit.MILLISECONDS);
    }
    
    @Override
    public void addTask(Object key, AbstractDelayTask newTask) {
        lock.lock();
        try {
            AbstractDelayTask existTask = tasks.get(key);
            if (null != existTask) {
                newTask.merge(existTask);
            }
            //最后放入到ConcurrentHashMap中
            tasks.put(key, newTask);
        } finally {
            lock.unlock();
        }
    }
    ...
    private class ProcessRunnable implements Runnable {
        @Override
        public void run() {
            try {
                processTasks();
            } catch (Throwable e) {
                getEngineLog().error(e.toString(), e);
            }
        }
    }
    ...
    //process tasks in execute engine.
    protected void processTasks() {
        //获取tasks中所有的任务,然后进行遍历
        Collection<Object> keys = getAllTaskKeys();
        for (Object taskKey : keys) {
            //通过任务key,获取具体的任务,并且从任务池中移除掉
            AbstractDelayTask task = removeTask(taskKey);
            if (null == task) {
                continue;
            }
            //根据taskKey获取NacosTaskProcessor延迟任务处理器:DistroDelayTaskProcessor
            NacosTaskProcessor processor = getProcessor(taskKey);
            if (null == processor) {
                getEngineLog().error("processor not found for task, so discarded. " + task);
                continue;
            }
            try {
                //ReAdd task if process failed
                //调用DistroDelayTaskProcessor.process()方法,把task同步任务放入到第二层内存队列中
                if (!processor.process(task)) {
                    //如果失败了,会重试添加task回tasks这个map中
                    retryFailedTask(taskKey, task);
                }
            } catch (Throwable e) {
                getEngineLog().error("Nacos task execute error : " + e.toString(), e);
                retryFailedTask(taskKey, task);
            }
        }
    }
    
    @Override
    public AbstractDelayTask removeTask(Object key) {
        lock.lock();
        try {
            AbstractDelayTask task = tasks.get(key);
            if (null != task && task.shouldProcess()) {
                return tasks.remove(key);
            } else {
                return null;
            }
        } finally {
            lock.unlock();
        }
    }
}

二.构造同步任务存储在Queue中 + 异步任务处理

在DistroDelayTaskProcessor的process()方法中,会把获取到的NacosTask延迟任务放入第二层内存队列。也就是先将NacosTask任务对象转换为DistroDelayTask延迟任务对象,然后包装一个DistroSyncChangeTask同步任务对象,最后调用NacosExecuteTaskExecuteEngine的addTask()方法添加到队列中。

具体在执行NacosExecuteTaskExecuteEngine的addTask()方法时,会调用同一个类下的getWorker()方法获取其中一个TaskExecuteWorker。然后通过调用TaskExecuteWorker的process()方法,把DistroSyncChangeTask同步任务放入TaskExecuteWorker的queue队列。

创建NacosExecuteTaskExecuteEngine时会创建多个TaskExecuteWorker,而TaskExecuteWorker初始化时又会启动一个InnerWorker线程。这个InnerWorker线程会不断从阻塞队列中取出同步任务进行处理,也就是InnerWorker的run()方法会调用DistroSyncChangeTask的run()方法,通过DistroSyncChangeTask的run()方法来处理服务实例数据的集群同步。

复制代码
//Distro delay task processor.
public class DistroDelayTaskProcessor implements NacosTaskProcessor {
    private final DistroTaskEngineHolder distroTaskEngineHolder;
    private final DistroComponentHolder distroComponentHolder;
    
    public DistroDelayTaskProcessor(DistroTaskEngineHolder distroTaskEngineHolder, DistroComponentHolder distroComponentHolder) {
        this.distroTaskEngineHolder = distroTaskEngineHolder;
        this.distroComponentHolder = distroComponentHolder;
    }
    
    @Override
    public boolean process(NacosTask task) {
        if (!(task instanceof DistroDelayTask)) {
            return true;
        }
        //将NacosTask任务对象转换为DistroDelayTask任务对象
        DistroDelayTask distroDelayTask = (DistroDelayTask) task;
        DistroKey distroKey = distroDelayTask.getDistroKey();
        if (DataOperation.CHANGE.equals(distroDelayTask.getAction())) {
            //包装成一个DistroSyncChangeTask对象
            DistroSyncChangeTask syncChangeTask = new DistroSyncChangeTask(distroKey, distroComponentHolder);
            //调用NacosExecuteTaskExecuteEngine.addTask()方法添加到队列中去
            distroTaskEngineHolder.getExecuteWorkersManager().addTask(distroKey, syncChangeTask);
            return true;
        }
        return false;
    }
}

//Nacos execute task execute engine.
public class NacosExecuteTaskExecuteEngine extends AbstractNacosTaskExecuteEngine<AbstractExecuteTask> {
    private final TaskExecuteWorker[] executeWorkers;
    
    public NacosExecuteTaskExecuteEngine(String name, Logger logger, int dispatchWorkerCount) {
        super(logger);
        //TaskExecuteWorker在初始化时会启动一个线程处理其队列中的任务
        executeWorkers = new TaskExecuteWorker[dispatchWorkerCount];
        for (int mod = 0; mod < dispatchWorkerCount; ++mod) {
            executeWorkers[mod] = new TaskExecuteWorker(name, mod, dispatchWorkerCount, getEngineLog());
        }
    }
    ...
    @Override
    public void addTask(Object tag, AbstractExecuteTask task) {
        //根据tag获取到TaskExecuteWorker
        NacosTaskProcessor processor = getProcessor(tag);
        if (null != processor) {
            processor.process(task);
            return;
        }
        TaskExecuteWorker worker = getWorker(tag);
        //调用TaskExecuteWorker.process()方法把DistroSyncChangeTask任务放入到队列当中去
        worker.process(task);
    }
    
    private TaskExecuteWorker getWorker(Object tag) {
        int idx = (tag.hashCode() & Integer.MAX_VALUE) % workersCount();
        return executeWorkers[idx];
    }
    ...
}

//Nacos execute task execute worker.
public final class TaskExecuteWorker implements NacosTaskProcessor, Closeable {
    //任务存储容器
    private final BlockingQueue<Runnable> queue;
    
    public TaskExecuteWorker(final String name, final int mod, final int total, final Logger logger) {
        this.name = name + "_" + mod + "%" + total;
        this.queue = new ArrayBlockingQueue<Runnable>(QUEUE_CAPACITY);
        this.closed = new AtomicBoolean(false);
        this.log = null == logger ? LoggerFactory.getLogger(TaskExecuteWorker.class) : logger;
        new InnerWorker(name).start();
    }
    ...
    @Override
    public boolean process(NacosTask task) {
        if (task instanceof AbstractExecuteTask) {
            //把DistroSyncChangeTask任务放入到队列中
            putTask((Runnable) task);
        }
        return true;
    }
    
    private void putTask(Runnable task) {
        try {
            //把DistroSyncChangeTask任务放入到队列中
            queue.put(task);
        } catch (InterruptedException ire) {
            log.error(ire.toString(), ire);
        }
    }
    ...
    //Inner execute worker.
    private class InnerWorker extends Thread {
        InnerWorker(String name) {
            setDaemon(false);
            setName(name);
        }
    
        @Override
        public void run() {
            while (!closed.get()) {
                try {
                    //一直取队列中的任务,这里的task任务类型是:DistroSyncChangeTask
                    Runnable task = queue.take();
                    long begin = System.currentTimeMillis();
                    //调用DistroSyncChangeTask中的run方法
                    task.run();
                    long duration = System.currentTimeMillis() - begin;
                    if (duration > 1000L) {
                        log.warn("distro task {} takes {}ms", task, duration);
                    }
                } catch (Throwable e) {
                    log.error("[DISTRO-FAILED] " + e.toString(), e);
                }
            }
        }
    }
}

三.同步服务实例数据到集群节点的核心方法

在DistroSyncChangeTask的run()方法中,会先获取DistroHttpAgent,然后调用DistroHttpAgent的syncData()方法,通过HTTP方式把新增的服务实例数据同步给其他集群节点。向集群节点进行同步服务实例数据的地址是:/v1/ns/distro/datum,这对应于DistroController的onSyncDatum()方法。

DistroController的onSyncDatum()方法会遍历传递过来的服务实例对象。如果调用ServiceManager的containService()方法时发现服务不存在,则先通过ServiceManager的createEmptyService()方法创建空的服务,然后会调用DistroProtocol的onReceive()方法注册服务实例,接着会调用DistroConsistencyServiceImpl的processData()方法进行处理,最后又会调用实例注册时的DistroConsistencyServiceImpl的onPut()方法。

复制代码
//Distro sync change task.
public class DistroSyncChangeTask extends AbstractDistroExecuteTask {    
    private final DistroComponentHolder distroComponentHolder;
    
    public DistroSyncChangeTask(DistroKey distroKey, DistroComponentHolder distroComponentHolder) {
        super(distroKey);
        this.distroComponentHolder = distroComponentHolder;
    }
      
    @Override
    public void run() {
        Loggers.DISTRO.info("[DISTRO-START] {}", toString());
        try {
            //构建请求参数
            String type = getDistroKey().getResourceType();
            DistroData distroData = distroComponentHolder.findDataStorage(type).getDistroData(getDistroKey());
            distroData.setType(DataOperation.CHANGE);
            //调用DistroHttpAgent.syncData()方法,通过HTTP方式同步新增的服务实例数据
            boolean result = distroComponentHolder.findTransportAgent(type).syncData(distroData, getDistroKey().getTargetServer());
            if (!result) {
                handleFailedTask();
            }
            Loggers.DISTRO.info("[DISTRO-END] {} result: {}", toString(), result);
        } catch (Exception e) {
            Loggers.DISTRO.warn("[DISTRO] Sync data change failed.", e);
            handleFailedTask();
        }
    }
    ...
}

//Distro http agent.
public class DistroHttpAgent implements DistroTransportAgent {
    private final ServerMemberManager memberManager;
    
    public DistroHttpAgent(ServerMemberManager memberManager) {
        this.memberManager = memberManager;
    }
    
    @Override
    public boolean syncData(DistroData data, String targetServer) {
        if (!memberManager.hasMember(targetServer)) {
            return true;
        }
        byte[] dataContent = data.getContent();
        //通过HTTP方式同步新增的服务实例数据
        return NamingProxy.syncData(dataContent, data.getDistroKey().getTargetServer());
    }
    ...
}

public class NamingProxy {
    ...
    //Synchronize datum to target server.
    public static boolean syncData(byte[] data, String curServer) {
        Map<String, String> headers = new HashMap<>(128);
    
        headers.put(HttpHeaderConsts.CLIENT_VERSION_HEADER, VersionUtils.version);
        headers.put(HttpHeaderConsts.USER_AGENT_HEADER, UtilsAndCommons.SERVER_VERSION);
        headers.put(HttpHeaderConsts.ACCEPT_ENCODING, "gzip,deflate,sdch");
        headers.put(HttpHeaderConsts.CONNECTION, "Keep-Alive");
        headers.put(HttpHeaderConsts.CONTENT_ENCODING, "gzip");
    
        try {
            //通过HTTP同步数据 :/v1/ns/distro/datum
            RestResult<String> result = HttpClient.httpPutLarge(
                "http://" + curServer + EnvUtil.getContextPath() + UtilsAndCommons.NACOS_NAMING_CONTEXT + DATA_ON_SYNC_URL, headers, data);
            if (result.ok()) {
                return true;
            }
            if (HttpURLConnection.HTTP_NOT_MODIFIED == result.getCode()) {
                return true;
            }
            throw new IOException("failed to req API:" + "http://" + curServer + EnvUtil.getContextPath()
                + UtilsAndCommons.NACOS_NAMING_CONTEXT + DATA_ON_SYNC_URL + ". code:" + result.getCode() + " msg: "
                + result.getData());
        } catch (Exception e) {
            Loggers.SRV_LOG.warn("NamingProxy", e);
        }
        return false;
    }
    ...
}

//Restful methods for Partition protocol.
@RestController
@RequestMapping(UtilsAndCommons.NACOS_NAMING_CONTEXT + "/distro")
public class DistroController {
    @Autowired
    private DistroProtocol distroProtocol;
  
    @Autowired
    private ServiceManager serviceManager;
    ...
    //Synchronize datum.
    @PutMapping("/datum")
    public ResponseEntity onSyncDatum(@RequestBody Map<String, Datum<Instances>> dataMap) throws Exception {    
        if (dataMap.isEmpty()) {
            Loggers.DISTRO.error("[onSync] receive empty entity!");
            throw new NacosException(NacosException.INVALID_PARAM, "receive empty entity!");
        }


        //遍历新增的服务实例对象
        for (Map.Entry<String, Datum<Instances>> entry : dataMap.entrySet()) {
            if (KeyBuilder.matchEphemeralInstanceListKey(entry.getKey())) {
                //获取命名空间、服务实例名称
                String namespaceId = KeyBuilder.getNamespace(entry.getKey());
                String serviceName = KeyBuilder.getServiceName(entry.getKey());
                if (!serviceManager.containService(namespaceId, serviceName) && switchDomain.isDefaultInstanceEphemeral()) {
                    //创建空的服务Service,这和服务实例注册时一样
                    serviceManager.createEmptyService(namespaceId, serviceName, true);
                }
                DistroHttpData distroHttpData = new DistroHttpData(createDistroKey(entry.getKey()), entry.getValue());
                //注册新的服务实例对象
                distroProtocol.onReceive(distroHttpData);
            }
        }
        return ResponseEntity.ok("ok");
    }
    ...
}

@Component
public class DistroProtocol {
    ...
    //Receive synced distro data, find processor to process.
    public boolean onReceive(DistroData distroData) {
        String resourceType = distroData.getDistroKey().getResourceType();
        //获取到DistroConsistencyServiceImpl
        DistroDataProcessor dataProcessor = distroComponentHolder.findDataProcessor(resourceType);
        if (null == dataProcessor) {
            Loggers.DISTRO.warn("[DISTRO] Can't find data process for received data {}", resourceType);
            return false;
        }
        //调用DistroConsistencyServiceImpl.processData()方法处理新增的服务实例
        return dataProcessor.processData(distroData);
    }
    ...
}

@DependsOn("ProtocolManager")
@org.springframework.stereotype.Service("distroConsistencyService")
public class DistroConsistencyServiceImpl implements EphemeralConsistencyService, DistroDataProcessor {
    //用于存储所有已注册的服务实例数据
    private final DataStore dataStore;
    private volatile Notifier notifier = new Notifier();
    ...
    @Override
    public boolean processData(DistroData distroData) {
        DistroHttpData distroHttpData = (DistroHttpData) distroData;
        Datum<Instances> datum = (Datum<Instances>) distroHttpData.getDeserializedContent();
        //这里的onPut()方法和服务实例注册时调用的onPut()方法一样
        onPut(datum.key, datum.value);
        return true;
    }
    
    public void onPut(String key, Record value) {
        if (KeyBuilder.matchEphemeralInstanceListKey(key)) {
            //创建Datum对象,把服务key和服务的所有服务实例Instances放入Datum对象中
            Datum<Instances> datum = new Datum<>();
            datum.value = (Instances) value;
            datum.key = key;
            datum.timestamp.incrementAndGet();
            //添加到DataStore的Map对象里
            dataStore.put(key, datum);
        }
        if (!listeners.containsKey(key)) {
            return;
        }
        //添加处理任务
        notifier.addTask(key, DataOperation.CHANGE);
    }
    
    @Override
    public void put(String key, Record value) throws NacosException {
        //把包含了当前注册的服务实例的、最新的服务实例列表,存储到DataStore对象中
        onPut(key, value);
        //在集群架构下,DistroProtocol.sync()方法会进行集群节点的服务实例数据同步
        distroProtocol.sync(new DistroKey(key, KeyBuilder.INSTANCE_LIST_KEY_PREFIX), DataOperation.CHANGE, globalConfig.getTaskDispatchPeriod() / 2);
    }
    ...
}

(4)总结

一开始调用DistroConsistencyServiceImpl的put()方法进行服务实例注册时,会调用DistroProtocol的sync()方法同步新增的服务实例给其他集群节点,然后会构造延迟任务存储在Map中 + 异步任务处理,接着继续构造同步任务存储在阻塞队列Queue中 + 异步任务处理,最后异步任务会发起HTTP请求来进行服务实例的数据同步,最终又调用回DistroConsistencyServiceImpl的onPut()方法来更新注册表。所以集群的每个节点都会有所有服务实例的数据。

之所以使用双层内存队列,而不是使用一个内存队列,直接将同步新增服务实例的任务异步交给TaskExecuteWorker进行处理,是因为希望通过加多一个内存队列进行中转来进一步提升处理的性能。服务实例的注册是有可能出现超高并发的,比如上千台机器同时启动,那么就会对Nacos服务端产生上千并发的服务实例注册请求。这时候如果只有一个内存队列,那么上千的新增服务实例的同步请求任务在竞争锁进入TaskExecuteWorker的阻塞队列(内存队列)时,就会让发起服务实例注册请求的Nacos客户端等待Nacos服务端响应的时间过长。

4.集群节点的健康状态变动时的数据同步

(1)Nacos后台管理的集群管理模块介绍

(2)集群节点启动时开启节点健康检查任务的源码

(3)集群节点收到健康检查请求后的数据同步源码

(4)总结

(1)Nacos后台管理的集群管理模块介绍

在集群管理模块下,可以看到每个节点的状态和元数据。节点IP就是节点的IP地址以及端口,节点状态就是标识当前节点是否可用,节点元数据就是相关的Raft信息。

其中节点元数据示例如下:

复制代码
{
    // 最后刷新时间
    "lastRefreshTime": 1674093895774,
    // raft 元信息
    "raftMetaData": {
        "metaDataMap": {
            "naming_persistent_service": {
                // leader IP 地址
                "leader": "10.0.16.3:7849",
                // raft 分组节点
                "raftGroupMember": [
                    "10.0.16.3:7850",
                    "10.0.16.3:7848",
                    "10.0.16.3:7849"
                ],
                "term": 1
            }
        }
    },
    // raft 端口
    "raftPort": "7849",
    // Nacos 版本
    "version": "1.4.1"
}

(2)集群节点启动时开启节点健康检查任务的源码

因为ServerMemberManager这个Bean会监听WebServerInitializedEvent事件,所以Spring启动时会执行ServerMemberManager的onApplicationEvent()方法。该方法会在集群模式下开启一个集群节点的健康检查任务,也就是会执行MemberInfoReportTask的run()方法,即执行Task的run()方法。

由于MemberInfoReportTask类继承了使用模版设计模式的抽象父类Task,所以执行Task的run()方法时:会先执行MemberInfoReportTask的executeBody()方法,然后会执行MemberInfoReportTask的after()方法。

在MemberInfoReportTask的executeBody()方法中:首先会获取除自身以外的其他集群节点List,然后通过对cursor变量自增后取模,来选出本次请求的目标节点Member,最后通过HTTP方式(/v1/core/cluster/report)对目标节点Member发起请求。如果目标节点返回成功,则执行MemberUtil的onSuccess()方法。如果目标节点返回失败,则执行MemberUtil的onFail()方法,并且把目标节点Member的state属性修改为DOWN。

最后在MemberInfoReportTask的after()方法中:又会重新提交这个MemberInfoReportTask健康检查任务,反复执行。

复制代码
@Component(value = "serverMemberManager")
public class ServerMemberManager implements ApplicationListener<WebServerInitializedEvent> {
    private final NacosAsyncRestTemplate asyncRestTemplate = HttpClientBeanHolder.getNacosAsyncRestTemplate(Loggers.CORE);
    //Address information for the local node.
    private String localAddress;
    //Broadcast this node element information task.
    private final MemberInfoReportTask infoReportTask = new MemberInfoReportTask();
    ...
    //监听Spring启动时发布的WebServerInitializedEvent事件
    @Override
    public void onApplicationEvent(WebServerInitializedEvent event) {
        //设置当前集群节点的状态为默认状态
        getSelf().setState(NodeState.UP);
        //集群模式下才启动集群节点的健康检查任务
        if (!EnvUtil.getStandaloneMode()) {
            //开启一个延时任务,执行MemberInfoReportTask.run()方法
            GlobalExecutor.scheduleByCommon(this.infoReportTask, 5_000L);
        }
        EnvUtil.setPort(event.getWebServer().getPort());
        EnvUtil.setLocalAddress(this.localAddress);
        Loggers.CLUSTER.info("This node is ready to provide external services");
    }
    ...    
    class MemberInfoReportTask extends Task {
        private final GenericType<RestResult<String>> reference = new GenericType<RestResult<String>>() { };
        private int cursor = 0;
        
        @Override
        protected void executeBody() {
            //获取除自身节点外的其他集群节点
            List<Member> members = ServerMemberManager.this.allMembersWithoutSelf();
            if (members.isEmpty()) {
                return;
            }
            //轮询请求:每执行一次executeBody()方法,cursor就加1,然后根据cursor去获取对应的某集群节点Member
            this.cursor = (this.cursor + 1) % members.size();
            Member target = members.get(cursor);
            Loggers.CLUSTER.debug("report the metadata to the node : {}", target.getAddress());
            //获取URL参数:/v1/core/cluster/report
            final String url = HttpUtils.buildUrl(false, target.getAddress(), EnvUtil.getContextPath(), Commons.NACOS_CORE_CONTEXT, "/cluster/report");
            try {
                //通过HTTP发起请求,向某集群节点Member发起健康检查请求
                asyncRestTemplate.post(url, Header.newInstance().addParam(Constants.NACOS_SERVER_HEADER, VersionUtils.version),
                    Query.EMPTY, getSelf(), reference.getType(), new Callback<String>() {
                        @Override
                        public void onReceive(RestResult<String> result) {
                            if (result.getCode() == HttpStatus.NOT_IMPLEMENTED.value() || result.getCode() == HttpStatus.NOT_FOUND.value()) {
                                Loggers.CLUSTER.warn("{} version is too low, it is recommended to upgrade the version : {}", target, VersionUtils.version);
                                return;
                            }
                            if (result.ok()) {
                                //如果请求成功,则设置集群节点Member的状态为NodeState.UP
                                MemberUtil.onSuccess(ServerMemberManager.this, target);
                            } else {
                                //如果请求失败,则设置集群节点Member的状态为NodeState.DOWN
                                Loggers.CLUSTER.warn("failed to report new info to target node : {}, result : {}", target.getAddress(), result);
                                MemberUtil.onFail(ServerMemberManager.this, target);
                            }
                        }
                        
                        @Override
                        public void onError(Throwable throwable) {
                            Loggers.CLUSTER.error("failed to report new info to target node : {}, error : {}", target.getAddress(), ExceptionUtil.getAllExceptionMsg(throwable));
                            //如果请求失败,则设置集群节点Member的状态为NodeState.DOWN
                            MemberUtil.onFail(ServerMemberManager.this, target, throwable);
                        }
                        
                        @Override
                        public void onCancel() {
                        }
                    }
                );
            } catch (Throwable ex) {
                Loggers.CLUSTER.error("failed to report new info to target node : {}, error : {}", target.getAddress(), ExceptionUtil.getAllExceptionMsg(ex));
            }
        }
        
        @Override
        protected void after() {
            //重新提交这个节点健康检查的异步任务,从而实现反复执行
            GlobalExecutor.scheduleByCommon(this, 2_000L);
        }
    }
}

//Task使用了模版方法
public abstract class Task implements Runnable {
    protected volatile boolean shutdown = false;
    
    @Override
    public void run() {
        if (shutdown) {
            return;
        }
        try {
            //执行异步任务的核心逻辑,这个方法是一个抽象方法,交给子类去具体实现
            executeBody();
        } catch (Throwable t) {
            Loggers.CORE.error("this task execute has error : {}", ExceptionUtil.getStackTrace(t));
        } finally {
            if (!shutdown) {
                after();
            }
        }
    }
    
    protected abstract void executeBody();
    
    protected void after() {
    }
    
    public void shutdown() {
        shutdown = true;
    }
}

public class MemberUtil {
    ...
    //Successful processing of the operation on the node.
    public static void onSuccess(final ServerMemberManager manager, final Member member) {
        final NodeState old = member.getState();
        manager.getMemberAddressInfos().add(member.getAddress());
        member.setState(NodeState.UP);
        member.setFailAccessCnt(0);
        if (!Objects.equals(old, member.getState())) {
            manager.notifyMemberChange();
        }
    }
    
    public static void onFail(final ServerMemberManager manager, final Member member) {
        onFail(manager, member, ExceptionUtil.NONE_EXCEPTION);
    }
    
    //Failure processing of the operation on the node.
    public static void onFail(final ServerMemberManager manager, final Member member, Throwable ex) {
        manager.getMemberAddressInfos().remove(member.getAddress());
        final NodeState old = member.getState();
        member.setState(NodeState.SUSPICIOUS);
        member.setFailAccessCnt(member.getFailAccessCnt() + 1);
        int maxFailAccessCnt = EnvUtil.getProperty("nacos.core.member.fail-access-cnt", Integer.class, 3);
        if (member.getFailAccessCnt() > maxFailAccessCnt || StringUtils.containsIgnoreCase(ex.getMessage(), TARGET_MEMBER_CONNECT_REFUSE_ERRMSG)) {
            member.setState(NodeState.DOWN);
        }
        if (!Objects.equals(old, member.getState())) {
            manager.notifyMemberChange();
        }
    }
    ...
}

public class GlobalExecutor {
    private static final ScheduledExecutorService COMMON_EXECUTOR = 
        ExecutorFactory.Managed.newScheduledExecutorService(
            ClassUtils.getCanonicalName(GlobalExecutor.class), 
            4,
            new NameThreadFactory("com.alibaba.nacos.core.common")
        );
    ...
    public static void scheduleByCommon(Runnable runnable, long delayMs) {
        if (COMMON_EXECUTOR.isShutdown()) {
            return;
        }
        //在指定的延迟后执行某项任务
        COMMON_EXECUTOR.schedule(runnable, delayMs, TimeUnit.MILLISECONDS);
    }
    ...
}

public final class ExecutorFactory {
    ...
    public static final class Managed {
        private static final String DEFAULT_NAMESPACE = "nacos";
        private static final ThreadPoolManager THREAD_POOL_MANAGER = ThreadPoolManager.getInstance();
        ...
        //Create a new scheduled executor service with input thread factory and register to manager.
        public static ScheduledExecutorService newScheduledExecutorService(final String group, final int nThreads, final ThreadFactory threadFactory) {
            ScheduledExecutorService executorService = Executors.newScheduledThreadPool(nThreads, threadFactory);
            THREAD_POOL_MANAGER.register(DEFAULT_NAMESPACE, group, executorService);
            return executorService;
        }
        ...
    }
    ...
}

(3)集群节点收到健康检查请求后的数据同步源码

集群节点收到某集群节点发来的"/v1/core/cluster/report"请求后,会调用NacosClusterController的report()方法来处理请求。在report()方法中,会把发起请求的来源节点状态直接设置成UP状态,然后调用ServerMemberManager的update()方法来更新来源节点属性。在update()方法中,会把存放在serverList中对应的节点Member进行更新,也就是通过MemberUtil的copy()方法覆盖老对象的属性来实现更新。

注意:因为serverList属性在集群中的每个节点都存在一份,所以节点收到健康检查请求后,要对其serverList属性中的节点进行更新。

复制代码
@RestController
@RequestMapping(Commons.NACOS_CORE_CONTEXT + "/cluster")
public class NacosClusterController {
    private final ServerMemberManager memberManager;
    ...
    //Other nodes return their own metadata information.
    @PostMapping(value = {"/report"})
    public RestResult<String> report(@RequestBody Member node) {
        if (!node.check()) {
            return RestResultUtils.failedWithMsg(400, "Node information is illegal");
        }
        LoggerUtils.printIfDebugEnabled(Loggers.CLUSTER, "node state report, receive info : {}", node);
        //能够正常请求到该接口的集群节点肯定是健康的,所以直接设置其节点状态为UP
        node.setState(NodeState.UP);
        node.setFailAccessCnt(0);
        //修改集群节点
        boolean result = memberManager.update(node);
        return RestResultUtils.success(Boolean.toString(result));
    }
    ...
}

@Component(value = "serverMemberManager")
public class ServerMemberManager implements ApplicationListener<WebServerInitializedEvent> {
    //Cluster node list.
    private volatile ConcurrentSkipListMap<String, Member> serverList;
    ...
    //member information update.
    public boolean update(Member newMember) {
        Loggers.CLUSTER.debug("member information update : {}", newMember);
        String address = newMember.getAddress();
        if (!serverList.containsKey(address)) {
            return false;
        }
        //更新serverList中的数据
        serverList.computeIfPresent(address, (s, member) -> {
            //如果服务状态不健康,则直接移除
            if (NodeState.DOWN.equals(newMember.getState())) {
                memberAddressInfos.remove(newMember.getAddress());
            }
            //对比信息是否有做改变
            boolean isPublishChangeEvent = MemberUtil.isBasicInfoChanged(newMember, member);
            //修改lastRefreshTime为当前时间
            newMember.setExtendVal(MemberMetaDataConstants.LAST_REFRESH_TIME, System.currentTimeMillis());
            //属性覆盖
            MemberUtil.copy(newMember, member);
            if (isPublishChangeEvent) {
                //member basic data changes and all listeners need to be notified
                //如果有做改变,需要发布相关事件通知
                notifyMemberChange();
            }
            return member;
        });
        return true;
    }
    ...
}

(4)总结

在Nacos集群架构下,集群节点间的健康状态如何进行同步。简单来说,集群节点间是会相互进行通信的。如果通信失败,那么就会把通信节点的状态属性修改为DOWN。

5.集群新增节点时如何同步已有服务实例数据

(1)节点启动时加载全部服务实例数据的异步任务

(2)节点处理获取全部服务实例数据请求的源码

(3)总结

(1)节点启动时加载服务实例数据的异步任务

Nacos服务端会有一个DistroProtocol类,它是一个Bean对象,在Spring项目启动时会创建这个DistroProtocol类型的Bean。

创建DistroProtocol类型的Bean时,会执行DistroProtocol的构造方法,从而调用DistroProtocol的startLoadTask()方法开启一个加载数据的异步任务。

在DistroProtocol的startLoadTask()方法中,会提交一个异步任务,并且会通过传入一个回调方法来标志是否已初始化成功。其中提交的任务类型是DistroLoadDataTask,所以会执行DistroLoadDataTask的run()方法,接着会执行DistroLoadDataTask的load()方法,然后执行该任务类的loadAllDataSnapshotFromRemote()方法,从而获取其他集群节点上的全部服务实例数据并更新本地注册表。

在loadAllDataSnapshotFromRemote()方法中,首先会遍历除自身节点外的其他集群节点。然后调用DistroHttpAgent的getDatumSnapshot()方法,通过HTTP请求"/v1/ns/distro/datums"获取目标节点的全部服务实例数据。接着再调用DistroConsistencyServiceImpl的processSnapshot()方法,将获取到的全部服务实例数据写入到本地注册表中。其中只要有一个集群节点数据同步成功,那么这个方法就结束。否则就继续遍历下一个集群节点,获取全部服务实例数据然后同步本地。

Nacos服务端在处理服务实例注册时,采用的是内存队列 + 异步任务。异步任务会调用listener的onChange()方法利用写时复制来更新本地注册表。而processSnapshot()方法也会调用listener的onChange()方法来更新注册表,其中listener的onChange()方法对应的实现其实就是Service的onChange()方法。

复制代码
@Component
public class DistroProtocol {
    ...
    public DistroProtocol(ServerMemberManager memberManager, DistroComponentHolder distroComponentHolder, DistroTaskEngineHolder distroTaskEngineHolder, DistroConfig distroConfig) {
        this.memberManager = memberManager;
        this.distroComponentHolder = distroComponentHolder;
        this.distroTaskEngineHolder = distroTaskEngineHolder;
        this.distroConfig = distroConfig;
        //开启一个异步任务
        startDistroTask();
    }
    
    private void startDistroTask() {
        if (EnvUtil.getStandaloneMode()) {
            isInitialized = true;
            return;
        }
        startVerifyTask();
        //提交一个加载数据的异步任务
        startLoadTask();
    }
    
    private void startLoadTask() {
        //加载数据的回调方法,修改isInitialized属性,标识是否初始化成功
        DistroCallback loadCallback = new DistroCallback() {
            @Override
            public void onSuccess() {
                isInitialized = true;
            }
            
            @Override
            public void onFailed(Throwable throwable) {
                isInitialized = false;
            }
        };
        //提交异步任务
        GlobalExecutor.submitLoadDataTask(new DistroLoadDataTask(memberManager, distroComponentHolder, distroConfig, loadCallback));
    }
    ...
}

//Distro load data task.
public class DistroLoadDataTask implements Runnable {
    ...
    @Override
    public void run() {
        try {
            load();
            if (!checkCompleted()) {
                GlobalExecutor.submitLoadDataTask(this, distroConfig.getLoadDataRetryDelayMillis());
            } else {
                loadCallback.onSuccess();
                Loggers.DISTRO.info("[DISTRO-INIT] load snapshot data success");
            }
        } catch (Exception e) {
            loadCallback.onFailed(e);
            Loggers.DISTRO.error("[DISTRO-INIT] load snapshot data failed. ", e);
        }
    }
    
    private void load() throws Exception {
        while (memberManager.allMembersWithoutSelf().isEmpty()) {
            Loggers.DISTRO.info("[DISTRO-INIT] waiting server list init...");
            TimeUnit.SECONDS.sleep(1);
        }
        while (distroComponentHolder.getDataStorageTypes().isEmpty()) {
            Loggers.DISTRO.info("[DISTRO-INIT] waiting distro data storage register...");
            TimeUnit.SECONDS.sleep(1);
        }
        for (String each : distroComponentHolder.getDataStorageTypes()) {
            if (!loadCompletedMap.containsKey(each) || !loadCompletedMap.get(each)) {
                loadCompletedMap.put(each, loadAllDataSnapshotFromRemote(each));
            }
        }
    }
    
    private boolean loadAllDataSnapshotFromRemote(String resourceType) {
        DistroTransportAgent transportAgent = distroComponentHolder.findTransportAgent(resourceType);
        DistroDataProcessor dataProcessor = distroComponentHolder.findDataProcessor(resourceType);
        if (null == transportAgent || null == dataProcessor) {
            Loggers.DISTRO.warn("[DISTRO-INIT] Can't find component for type {}, transportAgent: {}, dataProcessor: {}", resourceType, transportAgent, dataProcessor);
            return false;
        }
        //遍历除自身节点外的其他节点
        for (Member each : memberManager.allMembersWithoutSelf()) {
            try {
                Loggers.DISTRO.info("[DISTRO-INIT] load snapshot {} from {}", resourceType, each.getAddress());
                //调用DistroHttpAgent.getDatumSnapshot()方法,通过HTTP方式获取其他集群节点的数据
                DistroData distroData = transportAgent.getDatumSnapshot(each.getAddress());
                //调用DistroConsistencyServiceImpl.processSnapshot()方法,同步返回结果到自身节点的内存注册表
                boolean result = dataProcessor.processSnapshot(distroData);
                Loggers.DISTRO.info("[DISTRO-INIT] load snapshot {} from {} result: {}", resourceType, each.getAddress(), result);
                //只要有一个集群节点返回全部数据并同步成功则结束
                if (result) {
                    return true;
                }
            } catch (Exception e) {
                Loggers.DISTRO.error("[DISTRO-INIT] load snapshot {} from {} failed.", resourceType, each.getAddress(), e);
            }
        }
        return false;
    }
    ...
}

public class DistroHttpAgent implements DistroTransportAgent {
    ...
    @Override
    public DistroData getDatumSnapshot(String targetServer) {
        try {
            //通过NamingProxy发起HTTP请求
            byte[] allDatum = NamingProxy.getAllData(targetServer);
            return new DistroData(new DistroKey("snapshot", KeyBuilder.INSTANCE_LIST_KEY_PREFIX), allDatum);
        } catch (Exception e) {
            throw new DistroException(String.format("Get snapshot from %s failed.", targetServer), e);
        }
    }
    ...
}

public class NamingProxy {
    ...
    //获取目标节点的全部数据
    public static byte[] getAllData(String server) throws Exception {
        Map<String, String> params = new HashMap<>(8);
        RestResult<String> result = HttpClient.httpGet(
            "http://" + server + EnvUtil.getContextPath() + UtilsAndCommons.NACOS_NAMING_CONTEXT + ALL_DATA_GET_URL,
            new ArrayList<>(),
            params
        );
    
        if (result.ok()) {
            return result.getData().getBytes();
        }
    
        throw new IOException("failed to req API: " + "http://" + server + EnvUtil.getContextPath()
            + UtilsAndCommons.NACOS_NAMING_CONTEXT + ALL_DATA_GET_URL + ". code: " + result.getCode() + " msg: "
            + result.getMessage());
    }
    ...
}

@DependsOn("ProtocolManager")
@org.springframework.stereotype.Service("distroConsistencyService")
public class DistroConsistencyServiceImpl implements EphemeralConsistencyService, DistroDataProcessor {
    ...
    @Override
    public boolean processSnapshot(DistroData distroData) {
        try {
            return processData(distroData.getContent());
        } catch (Exception e) {
            return false;
        }
    }
    
    private boolean processData(byte[] data) throws Exception {
        if (data.length > 0) {
            //序列化成对象
            Map<String, Datum<Instances>> datumMap = serializer.deserializeMap(data, Instances.class);


            //创建空的Service
            for (Map.Entry<String, Datum<Instances>> entry : datumMap.entrySet()) {
                ...
            }
        
            for (Map.Entry<String, Datum<Instances>> entry : datumMap.entrySet()) {
                if (!listeners.containsKey(entry.getKey())) {
                    // Should not happen:
                    Loggers.DISTRO.warn("listener of {} not found.", entry.getKey());
                    continue;
                }
                try {
                    //更新本地注册表
                    for (RecordListener listener : listeners.get(entry.getKey())) {
                        listener.onChange(entry.getKey(), entry.getValue().value);
                    }
                } catch (Exception e) {
                    Loggers.DISTRO.error("[NACOS-DISTRO] error while execute listener of key: {}", entry.getKey(), e);
                    continue;
                }
                //Update data store if listener executed successfully:
                dataStore.put(entry.getKey(), entry.getValue());
            }
        }
        return true;
    }
    ...
}

**总结:**Nacos服务端集群节点启动时,会创建一个DistroProtocol类型的Bean对象,在这个DistroProtocol类型的Bean对象的构造方法会开启一个异步任务。该异步任务的主要逻辑是通过HTTP方式从其他集群节点获取服务数据,然后把获取到的服务实例数据更新到本地的内存注册表,完成数据同步。而且只要成功从某一个集群节点完成数据同步,那整个任务逻辑就结束。

此外,向某个集群节点获取全部服务实例数据时,是向"/v1/ns/distro/datums"接口发起HTTP请求来进行获取的。

(2)节点处理获取全部服务实例数据请求的源码

Nacos集群节点收到"/v1/ns/distro/datums"的HTTP请求后,便会执行DistroController的getAllDatums()方法。也就是调用DistroProtocol的onSnapshot()方法获取数据,然后直接返回。接着会调用DistroDataStorageImpl的getDatumSnapshot()方法。

getDatumSnapshot()方法会从DataStore的getDataMap()方法获取结果。进行服务实例注册时,会把服务实例信息存一份放在DataStore的Map中。进行服务实例同步时,也会把服务实例信息存放到DataStore的Map中。所以在DataStore里,会包含整个服务实例信息的数据。这里获取全部服务实例数据的接口,也是利用DataStore来实现的,而不是从内存注册表中获取。

复制代码
@RestController
@RequestMapping(UtilsAndCommons.NACOS_NAMING_CONTEXT + "/distro")
public class DistroController {
    @Autowired
    private DistroProtocol distroProtocol;
    ...
    //Get all datums.
    @GetMapping("/datums")
    public ResponseEntity getAllDatums() {
        DistroData distroData = distroProtocol.onSnapshot(KeyBuilder.INSTANCE_LIST_KEY_PREFIX);
        return ResponseEntity.ok(distroData.getContent());
    }
    ...
}

@Component
public class DistroProtocol {
    ...
    //Query all datum snapshot.
    public DistroData onSnapshot(String type) {
        DistroDataStorage distroDataStorage = distroComponentHolder.findDataStorage(type);
        if (null == distroDataStorage) {
            Loggers.DISTRO.warn("[DISTRO] Can't find data storage for received key {}", type);
            return new DistroData(new DistroKey("snapshot", type), new byte[0]);
        }
        //调用DistroDataStorageImpl.getDatumSnapshot()方法
        return distroDataStorage.getDatumSnapshot();
    }
    ...
}

public class DistroDataStorageImpl implements DistroDataStorage {    
    private final DataStore dataStore;
    ...
    @Override
    public DistroData getDatumSnapshot() {
        Map<String, Datum> result = dataStore.getDataMap();
        //对服务实例数据进行序列化
        byte[] dataContent = ApplicationUtils.getBean(Serializer.class).serialize(result);
        DistroKey distroKey = new DistroKey("snapshot", KeyBuilder.INSTANCE_LIST_KEY_PREFIX);
        //封装一个DistroData对象并返回
        return new DistroData(distroKey, dataContent);
    }
    ...
}

//Store of data. 用于存储所有已注册的服务实例数据
@Component
public class DataStore {
    private Map<String, Datum> dataMap = new ConcurrentHashMap<>(1024);
    
    public void put(String key, Datum value) {
        dataMap.put(key, value);
    }
    
    public Datum remove(String key) {
        return dataMap.remove(key);
    }
    
    public Set<String> keys() {
        return dataMap.keySet();
    }
       
    public Datum get(String key) {
        return dataMap.get(key);
    }
    
    public boolean contains(String key) {
        return dataMap.containsKey(key);
    }
    
    public Map<String, Datum> batchGet(List<String> keys) {
        Map<String, Datum> map = new HashMap<>(128);
        for (String key : keys) {
            Datum datum = dataMap.get(key);
            if (datum == null) {
                continue;
            }
            map.put(key, datum);
        }
        return map;
    }
    ...
    public Map<String, Datum> getDataMap() {
        return dataMap;
    }
}

注意:DataStore数据最后还是存到内存的。通过使用DataStore,可以实现以下功能和好处:

一.数据持久化

DataStore可将节点数据持久化到磁盘或其他介质,以确保数据的持久性。这样即使系统重启或发生故障,节点数据也能够得到恢复和保留。毕竟Datum的key是ServiceName、value是Instance实例列表,而Instance实例中又会包含所属的ClusterName、IP和Port,所以根据DataStore可以恢复完整的内存注册表。

复制代码
Map<string, map> serviceMap;
Map(namespace, Map(group::serviceName, Service));

二.数据同步

DataStore可以协调和同步节点数据的访问和更新。当多个节点同时注册或更新数据时,DataStore可确保数据的一致性和正确性,避免数据冲突和不一致的情况。

三.数据管理

DataStore提供了对节点数据的管理功能,包括增加、更新、删除等操作。通过使用适当的数据结构和算法,可以高效地管理大量的节点数据,并支持快速的数据访问和查询。

四.数据访问控制

DataStore可以实现对节点数据的访问控制和权限管理,只有具有相应权限的节点或用户才能访问和修改特定的节点数据,提高数据的安全性和保密性。

DataStore在Nacos中充当了节点数据的中央存储和管理器。通过提供持久化 + 同步 + 管理 + 访问控制等功能,确保节点数据的可靠性 + 一致性 + 安全性,是实现节点数据存储和操作的核心组件之一。

(3)总结

Nacos集群架构下新增一个集群节点时,新节点会如何进行服务数据同步:

首先利用了DistroProtocol类的Bean对象的构造方法开启异步任务,通过HTTP方式去请求其他集群节点的全部数据。

当新节点获取全部数据后,会调用Service的onChange()方法,然后利用写时复制机制更新本地内存注册表。

Nacos集群节点在处理获取全部服务实例数据的请求时,并不是从内存注册表中获取的,而是通过DataStore来获取。

相关推荐
冷环渊2 天前
Finish技术生态计划: FinishRpc
java·后端·nacos·rpc·netty
forestsea4 天前
Nacos-3.0.0适配PostgreSQL数据库
数据库·postgresql·nacos
中草药z5 天前
【Docker】Docker拉取部分常用中间件
运维·redis·mysql·docker·容器·nacos·脚本
东阳马生架构6 天前
Nacos源码—2.Nacos服务注册发现分析三
nacos·注册中心·配置中心
东阳马生架构6 天前
Nacos源码—2.Nacos服务注册发现分析二
nacos
xiaoxi6667 天前
Dubbo实战:四步实现注册中心平滑迁移
分布式·nacos·dubbo·注册中心
东阳马生架构7 天前
Nacos源码—1.Nacos服务注册发现分析一
nacos
东阳马生架构8 天前
Nacos简介—4.Nacos架构和原理一
nacos·注册中心·配置中心
东阳马生架构8 天前
Nacos简介—4.Nacos架构和原理二
nacos·注册中心·配置中心