记一次java堆空间爆满导致宕机的问题

1 问题现象与定位

团队有一个服务,一直运行的好好的,突然访问异常了,先是请求超时,然后直接无法访问。接到问题反馈后,首先想到的是查看日志,日志中没有任何程序逻辑处理错误的日志,但是接口请求处理的时间拉的特别的长,且访问外部接口都报超时了。

同一个接口处理时间对比如下:

text 复制代码
2024-01-25 00:03:44.451 [http-nio-8843-exec-7] INFO  c.b.w.o.api.v1.ApiControllerV1 - 2410710a-17ec-49ce-a74e-23e410189b08 clientRealIp: 10.181.202.12, finish getStatus, duration time: 634 ms
2024-01-25 02:29:38.707 [http-nio-8843-exec-105] INFO  c.b.w.o.a.a.BasicApiExceptionHandler - 3f3b24ef-b1c0-4558-ad15-ec88b30aa0bd clientRealIp: 10.181.200.33, finish getStatus, duration time: 1217935 ms

eureka client周期性打印以下日志:

text 复制代码
2024-01-25 01:00:59.823 [DiscoveryClient-1] WARN  c.n.discovery.TimedSupervisorTask -  task supervisor timed out
java.util.concurrent.TimeoutException: null
	at java.util.concurrent.FutureTask.get(FutureTask.java:205)
	at com.netflix.discovery.TimedSupervisorTask.run(TimedSupervisorTask.java:66)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

访问数据库的连接异常断开:

text 复制代码
2024-01-25 01:12:21.965 [http-nio-8843-exec-13] WARN  c.zaxxer.hikari.pool.ProxyConnection - d0487d3e-89cf-4385-ae59-6d26ff1e277a HikariPool-1 - Connection com.mysql.jdbc.JDBC4Connection@2b165bf4 marked as broken because of SQLSTATE(08007), ErrorCode(0)
com.mysql.jdbc.exceptions.jdbc4.MySQLNonTransientConnectionException: Communications link failure during commit(). Transaction resolution unknown.
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at com.mysql.jdbc.Util.handleNewInstance(Util.java:377)
	at com.mysql.jdbc.Util.getInstance(Util.java:360)
	at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:956)
	at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:935)
	at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:924)
	at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:870)
	at com.mysql.jdbc.ConnectionImpl.commit(ConnectionImpl.java:1611)
	at com.zaxxer.hikari.pool.ProxyConnection.commit(ProxyConnection.java:361)
	at com.zaxxer.hikari.pool.HikariProxyConnection.commit(HikariProxyConnection.java)

调用外部接口报错:

text 复制代码
2024-01-25 01:12:22.148 [http-nio-8843-exec-3] ERROR c.b.w.o.a.a.BasicApiExceptionHandler - 7c4a2b46-dc8a-4121-85ff-cf25ec16a17d org.springframework.web.util.NestedServletException: Handler dispatch failed; nested exception is java.lang.OutOfMemoryError: Java heap space
	at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:1055)
	at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:943)
	at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:1006)
	at org.springframework.web.servlet.FrameworkServlet.doGet(FrameworkServlet.java:898)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:634)
	at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:883)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:741)
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:231)
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
	at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:53)
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
	at org.springframework.boot.actuate.web.trace.servlet.HttpTraceFilter.doFilterInternal(HttpTraceFilter.java:88)
	at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119)
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
	at com.baidu.wuyou.operation.api.aspect.CachingRequestBodyFilter.doFilterInternal(CachingRequestBodyFilter.java:25)
	at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119)
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
	at org.springframework.web.filter.RequestContextFilter.doFilterInternal(RequestContextFilter.java:100)
	at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119)
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
	at org.springframework.web.filter.FormContentFilter.doFilterInternal(FormContentFilter.java:93)
	at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119)
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
	at org.springframework.web.filter.HiddenHttpMethodFilter.doFilterInternal(HiddenHttpMethodFilter.java:94)
	at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119)
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
	at org.springframework.boot.actuate.metrics.web.servlet.WebMvcMetricsFilter.filterAndRecordMetrics(WebMvcMetricsFilter.java:114)
	at org.springframework.boot.actuate.metrics.web.servlet.WebMvcMetricsFilter.doFilterInternal(WebMvcMetricsFilter.java:104)
	at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119)
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
	at org.springframework.web.filter.CharacterEncodingFilter.doFilterInternal(CharacterEncodingFilter.java:201)
	at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119)
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
	at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:202)
	at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:96)
	at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:541)
	at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:139)
	at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:92)
	at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:74)
	at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:343)
	at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:367)
	at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:65)
	at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:860)
	at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1598)
	at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.OutOfMemoryError: Java heap space
	at com.fasterxml.jackson.core.util.BufferRecycler.calloc(BufferRecycler.java:191)
	at com.fasterxml.jackson.core.util.BufferRecycler.allocCharBuffer(BufferRecycler.java:161)
	at com.fasterxml.jackson.core.util.BufferRecycler.allocCharBuffer(BufferRecycler.java:151)
	at com.fasterxml.jackson.core.io.IOContext.allocConcatBuffer(IOContext.java:234)
	at com.fasterxml.jackson.core.json.UTF8JsonGenerator.<init>(UTF8JsonGenerator.java:139)
	at com.fasterxml.jackson.core.JsonFactory._createUTF8Generator(JsonFactory.java:1817)
	at com.fasterxml.jackson.core.JsonFactory.createGenerator(JsonFactory.java:1281)

上面这条日志很"友好"的说明出错的原因是java.lang.OutOfMemoryError: Java heap space,立即查看应用程序的堆监控,显示堆空间爆满而宕机了,监控图如下所示:

现在就是要确定具体是哪个地方导致内存占用飙升了,幸亏项目设置了如下配置,当程序异常宕机时会dump堆空间快照。

bash 复制代码
JVM_OPTS="$JVM_OPTS -XX:+HeapDumpOnOutOfMemoryError"
JVM_OPTS="$JVM_OPTS -XX:HeapDumpPath=${LOG_DIR}/${PROJ_NAME}-`date +%s`-pid$$.hprof"
JVM_OPTS="$JVM_OPTS -Xloggc:${LOG_DIR}/${PROJ_NAME}-gc.log"
JVM_OPTS="$JVM_OPTS -XX:ErrorFile=${LOG_DIR}/${PROJ_NAME}-`date +%s`-pid%p.log"

马上用MATMemoryAnalyzer)工具查看heap dump数据,终于找到了罪魁祸首,如下所示:

2 问题原因分析

导致内存爆满的地方的核心功能是从多个数据源获取数据,然后提供给grafana进行大屏展示(基于simplejson插件)。同时,grafana大屏数据会自动周期性刷新。当在一个panel内展示的曲线比较多,且展示的时间范围比较大,同时需要展示的panel数也较多时,导致该数据处理逻辑处理不过来,上一个刷新周期还没有处理完,下一个刷新周期的请求又到了,同时java heap空间设置的也不足够大,导致heap空间占用累积上升,jvm一直忙于STWStop The World)式垃圾收集,直到最终宕机

3 问题修复与优化方法

通过这次故障,发现这个模块看上去处理逻辑比较简单,但是仍然暴露了不少设计上的问题,分别采用了一些优化方法:

  1. 首先当然是增加heap空间大小,毕竟能多撑一会儿是一会儿

  2. 梳理处理逻辑时,发现性能仍然有提升空间,通过提升性能,减少数据处理时间,让内存尽快释放

  3. 请求外部数据接口和等待并发任务完成时,为避免挂死,都设置了超时时间,原先为了尽量能获取到数据(外部接口有时确实返回也比较慢),设置的超时时间比较长,降低超时时间(消除隐患)

  4. 限制单个pannel的最大曲线数,限制最大展示时间窗口大小

  5. 增量获取数据

  6. 将该功能独立为单独的微服务,不与其他逻辑放在一个服务中(作为独立的微服务,也不会增加调用链长度)

  7. 优化原先的数据结构。

    原先的数据结构为:

    java 复制代码
    public class Avail {
        private Long total;
        private Integer loss;
        private Double latency;
        private Double rate;
    }

    改为:

    java 复制代码
    public class Avail {
        private long total;
        private int loss;
        private double latency;
        private double rate;
    }

    这样一个数据点可以节约16字节(不压缩对象指针)或8字节(压缩对象指针),这样一个数据点可以节约内存约50%~67%

相关推荐
一只叫煤球的猫6 小时前
写代码很6,面试秒变菜鸟?不卖课,面试官视角走心探讨
前端·后端·面试
bobz9657 小时前
tcp/ip 中的多路复用
后端
bobz9657 小时前
tls ingress 简单记录
后端
皮皮林5518 小时前
IDEA 源码阅读利器,你居然还不会?
java·intellij idea
你的人类朋友8 小时前
什么是OpenSSL
后端·安全·程序员
bobz9658 小时前
mcp 直接操作浏览器
后端
前端小张同学11 小时前
服务器部署 gitlab 占用空间太大怎么办,优化思路。
后端
databook11 小时前
Manim实现闪光轨迹特效
后端·python·动效
武子康12 小时前
大数据-98 Spark 从 DStream 到 Structured Streaming:Spark 实时计算的演进
大数据·后端·spark
该用户已不存在12 小时前
6个值得收藏的.NET ORM 框架
前端·后端·.net