nacos热更新引起tcp激增导致服务不可用

简介

本次遇到线上nacos热更新问题导致服务不可用,记录在此,也帮后续者排查原因。

这是一个偶发现象,非必现,但是也需要注意.

现象

环境如下:

  1. centos8
  2. nacos 2.2.1
  3. spring-cloud-alibaba-dependencies:2022.0.0.0

特殊点:此现象必须使用阿里云的nacos服务才会出现。在我们开发测试环境中自行搭建的nacos多次尝试并未复现,所以导致应用上线后多次迭代未出现问题直至使用热更新。

在更新nacos配置热更新时,一段时间后服务不可用。

服务所在节点的ecs服务器监控可以看到,在发布热更新的时间后,tcp同时连接数量激增,导致资源占用。多节点的tcp的同时连接数增大导致nginx连接数到max,间接导致域名下应用不可用。

应用日志一直打印

notify-currentSkip **** listener is not finish yet,will try next time.

直接原因

1、 应用tcp连接数量激增

2、nginx同时连接数设置过小

3、没有设置报警及后续处理

根本原因

通过频繁的热更新,我们复现了问题,通过dump分析我们拿到了线程堆栈。

bash 复制代码
Thread 2117539: (state = BLOCKED)
 - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may be imprecise)
 - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, line=175 (Compiled frame)
 - java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt() @bci=1, line=836 (Compiled frame)
 - java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(java.util.concurrent.locks.AbstractQueuedSynchronizer$Node, int) @bci=67, line=870 (Interpreted frame)
 - java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(int) @bci=17, line=1199 (Compiled frame)
 - java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock() @bci=5, line=943 (Compiled frame)
 - org.springframework.cloud.context.scope.GenericScope.destroy() @bci=69, line=140 (Interpreted frame)
 - org.springframework.cloud.context.scope.refresh.RefreshScope.refreshAll() @bci=1, line=154 (Interpreted frame)
 - org.springframework.cloud.context.refresh.ContextRefresher.refresh() @bci=9, line=86 (Interpreted frame)
 - org.springframework.cloud.endpoint.event.RefreshEventListener.handle(org.springframework.cloud.endpoint.event.RefreshEvent) @bci=44, line=72 (Interpreted frame)
 - org.springframework.cloud.endpoint.event.RefreshEventListener.onApplicationEvent(org.springframework.context.ApplicationEvent) @bci=30, line=61 (Interpreted frame)
 - org.springframework.context.event.SimpleApplicationEventMulticaster.doInvokeListener(org.springframework.context.ApplicationListener, org.springframework.context.ApplicationEvent) @bci=2, line=172 (Compiled frame)
 - org.springframework.context.event.SimpleApplicationEventMulticaster.invokeListener(org.springframework.context.ApplicationListener, org.springframework.context.ApplicationEvent) @bci=34, line=165 (Compiled frame)
 - org.springframework.context.event.SimpleApplicationEventMulticaster.multicastEvent(org.springframework.context.ApplicationEvent, org.springframework.core.ResolvableType) @bci=83, line=139 (Compiled frame)
 - org.springframework.context.support.AbstractApplicationContext.publishEvent(java.lang.Object, org.springframework.core.ResolvableType) @bci=70, line=404 (Compiled frame)
 - org.springframework.context.support.AbstractApplicationContext.publishEvent(org.springframework.context.ApplicationEvent) @bci=3, line=361 (Compiled frame)
 - com.alibaba.cloud.nacos.refresh.NacosContextRefresher$1.innerReceive(java.lang.String, java.lang.String, java.lang.String) @bci=34, line=132 (Interpreted frame)
 - com.alibaba.nacos.api.config.listener.AbstractSharedListener.receiveConfigInfo(java.lang.String) @bci=10, line=40 (Interpreted frame)
 - com.alibaba.nacos.client.config.impl.CacheData.lambda$safeNotifyListener$1(com.alibaba.nacos.api.config.listener.Listener, java.lang.String, java.lang.String, java.lang.String, java.lang.String, java.lang.String, com.alibaba.nacos.client.config.impl.CacheData$ManagerListenerWrap, java.lang.String) @bci=146, line=317 (Interpreted frame)
 - com.alibaba.nacos.client.config.impl.CacheData$$Lambda$1538.run() @bci=36 (Interpreted frame)
 - java.util.concurrent.Executors$RunnableAdapter.call() @bci=4, line=511 (Compiled frame)
 - java.util.concurrent.FutureTask.run() @bci=42, line=266 (Compiled frame)
 - java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker) @bci=95, line=1149 (Compiled frame)
 - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=624 (Compiled frame)
 - java.lang.Thread.run() @bci=11, line=750 (Compiled frame)

通过此和服务器打印出的WARN日志的代码,就能找到对应的源码

java 复制代码
    private void safeNotifyListener(final String dataId, final String group, final String content, final String type,
            final String md5, final String encryptedDataKey, final ManagerListenerWrap listenerWrap) {
        final Listener listener = listenerWrap.listener;
        if (listenerWrap.inNotifying) {
            LOGGER.warn(
                    "[{}] [notify-currentSkip] dataId={}, group={}, md5={}, listener={}, listener is not finish yet,will try next time.",
                    name, dataId, group, md5, listener);
            return;
        }
        Runnable job = () -> {
            long start = System.currentTimeMillis();
            ClassLoader myClassLoader = Thread.currentThread().getContextClassLoader();
            ClassLoader appClassLoader = listener.getClass().getClassLoader();
            try {
                if (listener instanceof AbstractSharedListener) {
                    AbstractSharedListener adapter = (AbstractSharedListener) listener;
                    adapter.fillContext(dataId, group);
                    LOGGER.info("[{}] [notify-context] dataId={}, group={}, md5={}", name, dataId, group, md5);
                }
                // Before executing the callback, set the thread classloader to the classloader of
                // the specific webapp to avoid exceptions or misuses when calling the spi interface in
                // the callback method (this problem occurs only in multi-application deployment).
                Thread.currentThread().setContextClassLoader(appClassLoader);
                
                ConfigResponse cr = new ConfigResponse();
                cr.setDataId(dataId);
                cr.setGroup(group);
                cr.setContent(content);
                cr.setEncryptedDataKey(encryptedDataKey);
                configFilterChainManager.doFilter(null, cr);
                String contentTmp = cr.getContent();
                listenerWrap.inNotifying = true;
                listener.receiveConfigInfo(contentTmp);
                // compare lastContent and content
                if (listener instanceof AbstractConfigChangeListener) {
                    Map<String, ConfigChangeItem> data = ConfigChangeHandler.getInstance()
                            .parseChangeData(listenerWrap.lastContent, contentTmp, type);
                    ConfigChangeEvent event = new ConfigChangeEvent(data);
                    ((AbstractConfigChangeListener) listener).receiveConfigChange(event);
                    listenerWrap.lastContent = contentTmp;
                }
                
                listenerWrap.lastCallMd5 = md5;
                LOGGER.info("[{}] [notify-ok] dataId={}, group={}, md5={}, listener={} ,cost={} millis.", name, dataId,
                        group, md5, listener, (System.currentTimeMillis() - start));
            } catch (NacosException ex) {
                LOGGER.error("[{}] [notify-error] dataId={}, group={}, md5={}, listener={} errCode={} errMsg={}", name,
                        dataId, group, md5, listener, ex.getErrCode(), ex.getErrMsg());
            } catch (Throwable t) {
                LOGGER.error("[{}] [notify-error] dataId={}, group={}, md5={}, listener={}", name, dataId, group, md5,
                        listener, t);
            } finally {
                listenerWrap.inNotifying = false;
                Thread.currentThread().setContextClassLoader(myClassLoader);
            }
        };
        
        final long startNotify = System.currentTimeMillis();
        try {
            if (null != listener.getExecutor()) {
                listener.getExecutor().execute(job);
            } else {
                try {
                    INTERNAL_NOTIFIER.submit(job);
                } catch (RejectedExecutionException rejectedExecutionException) {
                    LOGGER.warn(
                            "[{}] [notify-blocked] dataId={}, group={}, md5={}, listener={}, no available internal notifier,will sync notifier ",
                            name, dataId, group, md5, listener);
                    job.run();
                } catch (Throwable throwable) {
                    LOGGER.error(
                            "[{}] [notify-blocked] dataId={}, group={}, md5={}, listener={}, submit internal async task fail,throwable= ",
                            name, dataId, group, md5, listener, throwable);
                    job.run();
                }
            }
        } catch (Throwable t) {
            LOGGER.error("[{}] [notify-error] dataId={}, group={}, md5={}, listener={} throwable={}", name, dataId,
                    group, md5, listener, t.getCause());
        }
        final long finishNotify = System.currentTimeMillis();
        LOGGER.info("[{}] [notify-listener] time cost={}ms in ClientWorker, dataId={}, group={}, md5={}, listener={} ",
                name, (finishNotify - startNotify), dataId, group, md5, listener);
    }
    

我们根据现象只打印了一条WARN日志可以推论出在Runnable job中listenerWrap.inNotifying=true后的几行代码中

java 复制代码
		listener.receiveConfigInfo(contentTmp);
		// compare lastContent and content
		if (listener instanceof AbstractConfigChangeListener) {
		   Map<String, ConfigChangeItem> data = ConfigChangeHandler.getInstance()
		           .parseChangeData(listenerWrap.lastContent, contentTmp, type);
		   ConfigChangeEvent event = new ConfigChangeEvent(data);
		   ((AbstractConfigChangeListener) listener).receiveConfigChange(event);
		   listenerWrap.lastContent = contentTmp;
		}
		
		listenerWrap.lastCallMd5 = md5;

会发生阻塞,导致所有线程LOGGER.warn(略)一直被执行,tcp连接数在不断线程阻塞中开始猛增。

那我们接着一层层的去找

问题出现在com.alibaba.cloud.nacos.refresh.NacosContextRefresher红框范围内。由于时间原因,没有专门的去debug去一条条的看,只是知道了问题出现的点,在nacos库的issue中也可以看到listener被指出,所以相关问题的点就是在此处了。

ps:由于时间问题,就将问题记录于此步,后续有时间会进一步找到其阻塞原因

解决方案

1、重写listener方案

2、不使用热更新。没错,就是不使用热更新,重新构建一次包运行,也有同样的方法去保证服务不间断,为什么一定要使用热更新呢,热更新一直会带来很多问题,最终的方案我觉得都可以转化为冷更新,没有必要为了技术便捷而是用,保证服务正常上线提供服务才是第一选项。

相关推荐
斯内普吖几秒前
(开源)高校素拓分管理系统小程序实战指南 基于 Java + SpringBoot + uni-app + Vue + MySQL
java·spring boot·mysql·小程序·uni-app·开源
兰令水11 分钟前
leecodecode【面试150】【2026.6.15打卡-java版本】
java·算法·面试
是苏浙15 分钟前
Java实现链表2
java·开发语言·数据结构
Orchestrator_me17 分钟前
Centos7安装maven 3.9.11
java·maven
AI人工智能_电脑小能手19 分钟前
【大白话说Java面试题 第106题】【并发篇】第6题:synchronized 锁的锁对象可以是什么?
java·后端·面试
IP老炮不瞎唠19 分钟前
采集运行不稳定?分清住宅IP与数据中心代理的差异
网络·网络协议·tcp/ip
森G26 分钟前
65、UDP协议(拓展选学)---------网络编程
网络·c++·qt·网络协议·tcp/ip·udp
MacroZheng29 分钟前
横空出世!Claude Code画图神器来了,比Visio快10倍!
java·人工智能·后端
半夜修仙29 分钟前
延迟队列的介绍及常见问题
java·数据库·中间件·rabbitmq