1 问题现象与定位
团队有一个服务,一直运行的好好的,突然访问异常了,先是请求超时,然后直接无法访问。接到问题反馈后,首先想到的是查看日志,日志中没有任何程序逻辑处理错误的日志,但是接口请求处理的时间拉的特别的长,且访问外部接口都报超时了。
同一个接口处理时间对比如下:
text
2024-01-25 00:03:44.451 [http-nio-8843-exec-7] INFO c.b.w.o.api.v1.ApiControllerV1 - 2410710a-17ec-49ce-a74e-23e410189b08 clientRealIp: 10.181.202.12, finish getStatus, duration time: 634 ms
2024-01-25 02:29:38.707 [http-nio-8843-exec-105] INFO c.b.w.o.a.a.BasicApiExceptionHandler - 3f3b24ef-b1c0-4558-ad15-ec88b30aa0bd clientRealIp: 10.181.200.33, finish getStatus, duration time: 1217935 ms
eureka client周期性打印以下日志:
text
2024-01-25 01:00:59.823 [DiscoveryClient-1] WARN c.n.discovery.TimedSupervisorTask - task supervisor timed out
java.util.concurrent.TimeoutException: null
at java.util.concurrent.FutureTask.get(FutureTask.java:205)
at com.netflix.discovery.TimedSupervisorTask.run(TimedSupervisorTask.java:66)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
访问数据库的连接异常断开:
text
2024-01-25 01:12:21.965 [http-nio-8843-exec-13] WARN c.zaxxer.hikari.pool.ProxyConnection - d0487d3e-89cf-4385-ae59-6d26ff1e277a HikariPool-1 - Connection com.mysql.jdbc.JDBC4Connection@2b165bf4 marked as broken because of SQLSTATE(08007), ErrorCode(0)
com.mysql.jdbc.exceptions.jdbc4.MySQLNonTransientConnectionException: Communications link failure during commit(). Transaction resolution unknown.
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at com.mysql.jdbc.Util.handleNewInstance(Util.java:377)
at com.mysql.jdbc.Util.getInstance(Util.java:360)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:956)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:935)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:924)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:870)
at com.mysql.jdbc.ConnectionImpl.commit(ConnectionImpl.java:1611)
at com.zaxxer.hikari.pool.ProxyConnection.commit(ProxyConnection.java:361)
at com.zaxxer.hikari.pool.HikariProxyConnection.commit(HikariProxyConnection.java)
调用外部接口报错:
text
2024-01-25 01:12:22.148 [http-nio-8843-exec-3] ERROR c.b.w.o.a.a.BasicApiExceptionHandler - 7c4a2b46-dc8a-4121-85ff-cf25ec16a17d org.springframework.web.util.NestedServletException: Handler dispatch failed; nested exception is java.lang.OutOfMemoryError: Java heap space
at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:1055)
at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:943)
at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:1006)
at org.springframework.web.servlet.FrameworkServlet.doGet(FrameworkServlet.java:898)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:634)
at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:883)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:741)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:231)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:53)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
at org.springframework.boot.actuate.web.trace.servlet.HttpTraceFilter.doFilterInternal(HttpTraceFilter.java:88)
at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
at com.baidu.wuyou.operation.api.aspect.CachingRequestBodyFilter.doFilterInternal(CachingRequestBodyFilter.java:25)
at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
at org.springframework.web.filter.RequestContextFilter.doFilterInternal(RequestContextFilter.java:100)
at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
at org.springframework.web.filter.FormContentFilter.doFilterInternal(FormContentFilter.java:93)
at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
at org.springframework.web.filter.HiddenHttpMethodFilter.doFilterInternal(HiddenHttpMethodFilter.java:94)
at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
at org.springframework.boot.actuate.metrics.web.servlet.WebMvcMetricsFilter.filterAndRecordMetrics(WebMvcMetricsFilter.java:114)
at org.springframework.boot.actuate.metrics.web.servlet.WebMvcMetricsFilter.doFilterInternal(WebMvcMetricsFilter.java:104)
at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
at org.springframework.web.filter.CharacterEncodingFilter.doFilterInternal(CharacterEncodingFilter.java:201)
at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:202)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:96)
at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:541)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:139)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:92)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:74)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:343)
at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:367)
at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:65)
at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:860)
at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1598)
at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.OutOfMemoryError: Java heap space
at com.fasterxml.jackson.core.util.BufferRecycler.calloc(BufferRecycler.java:191)
at com.fasterxml.jackson.core.util.BufferRecycler.allocCharBuffer(BufferRecycler.java:161)
at com.fasterxml.jackson.core.util.BufferRecycler.allocCharBuffer(BufferRecycler.java:151)
at com.fasterxml.jackson.core.io.IOContext.allocConcatBuffer(IOContext.java:234)
at com.fasterxml.jackson.core.json.UTF8JsonGenerator.<init>(UTF8JsonGenerator.java:139)
at com.fasterxml.jackson.core.JsonFactory._createUTF8Generator(JsonFactory.java:1817)
at com.fasterxml.jackson.core.JsonFactory.createGenerator(JsonFactory.java:1281)
上面这条日志很"友好"的说明出错的原因是java.lang.OutOfMemoryError: Java heap space
,立即查看应用程序的堆监控,显示堆空间爆满而宕机了,监控图如下所示:
现在就是要确定具体是哪个地方导致内存占用飙升了,幸亏项目设置了如下配置,当程序异常宕机时会dump堆空间快照。
bash
JVM_OPTS="$JVM_OPTS -XX:+HeapDumpOnOutOfMemoryError"
JVM_OPTS="$JVM_OPTS -XX:HeapDumpPath=${LOG_DIR}/${PROJ_NAME}-`date +%s`-pid$$.hprof"
JVM_OPTS="$JVM_OPTS -Xloggc:${LOG_DIR}/${PROJ_NAME}-gc.log"
JVM_OPTS="$JVM_OPTS -XX:ErrorFile=${LOG_DIR}/${PROJ_NAME}-`date +%s`-pid%p.log"
马上用MAT
(MemoryAnalyzer
)工具查看heap dump数据,终于找到了罪魁祸首,如下所示:
2 问题原因分析
导致内存爆满的地方的核心功能是从多个数据源获取数据,然后提供给grafana
进行大屏展示(基于simplejson
插件)。同时,grafana大屏数据会自动周期性刷新。当在一个panel内展示的曲线比较多,且展示的时间范围比较大,同时需要展示的panel数也较多时,导致该数据处理逻辑处理不过来,上一个刷新周期还没有处理完,下一个刷新周期的请求又到了,同时java heap空间设置的也不足够大,导致heap空间占用累积上升,jvm一直忙于STW
(Stop The World
)式垃圾收集,直到最终宕机
3 问题修复与优化方法
通过这次故障,发现这个模块看上去处理逻辑比较简单,但是仍然暴露了不少设计上的问题,分别采用了一些优化方法:
-
首先当然是增加heap空间大小,毕竟能多撑一会儿是一会儿
-
梳理处理逻辑时,发现性能仍然有提升空间,通过提升性能,减少数据处理时间,让内存尽快释放
-
请求外部数据接口和等待并发任务完成时,为避免挂死,都设置了超时时间,原先为了尽量能获取到数据(外部接口有时确实返回也比较慢),设置的超时时间比较长,降低超时时间(消除隐患)
-
限制单个pannel的最大曲线数,限制最大展示时间窗口大小
-
增量获取数据
-
将该功能独立为单独的微服务,不与其他逻辑放在一个服务中(作为独立的微服务,也不会增加调用链长度)
-
优化原先的数据结构。
原先的数据结构为:
javapublic class Avail { private Long total; private Integer loss; private Double latency; private Double rate; }
改为:
javapublic class Avail { private long total; private int loss; private double latency; private double rate; }
这样一个数据点可以节约16字节(不压缩对象指针)或8字节(压缩对象指针),这样一个数据点可以节约内存约50%~67%