现网/生产/一线问题记录

为信息安全考虑,涉及到公司保密信息的,用某来代替

文章目录

问题现象

凌晨升级微服务,维护通知升级后某微服务频繁重启,大概十几分钟就会重启一次

定位过程

查看节点日志

首先肯定是看日志

bash 复制代码
[2024-07-15 20:09:49,340]-[5c603b4abb964f14]-[1g7m]-[plioservice-mateinfo-cse-tenant-thread-118]-[com.huawei.ies.plioservice.service.impl.AlarmWriteBackServiceImpl.invokeUpdateAlarmsInBatch(AlarmWriteBackServiceImpl.java:359)]-[WARN] updateAlarmsInBatch: The batch update request body = {"body":{"alarms":[{"condition":[{"field":"identifier","values":["1721074052000_3ed485f0a0a9e05053c19d2343bde3428f7b2a6ddacda60a2377a63d43ff8316"],"id":"1","operator":"="}],"expression":"1","updateFields":{"circuit_count":1,"customer_count":1}}]},"header":{"commonValues":{"currentTenant":"1g7m"},"tracker_id":"5c603b4abb964f14"}}
[2024-07-15 20:09:49,372]-[5c603b4abb964f14]-[1g7m]-[plioservice-mateinfo-cse-tenant-thread-118]-[com.huawei.ies.plioservice.service.impl.AlarmWriteBackServiceImpl.invokeUpdateAlarmsInBatch(AlarmWriteBackServiceImpl.java:369)]-[WARN] updateAlarmsInBatch: End call serviceName:[app.service.ict_alarm_interface.updateAlarmsInBatches] it costs [31]ms
[2024-07-15 20:09:53,508]-[5c603b4abb964f14]-[1g7m]-[plioservice-mateinfo-cse-tenant-thread-118]-[com.huawei.ies.plioservice.service.impl.AlarmWriteBackServiceImpl.invokeUpdateAlarmsInBatch(AlarmWriteBackServiceImpl.java:375)]-[WARN] updateAlarmsInBatch: results = ServiceMessage{id='null', comeFrom='null', header={"commonValues":{"currentUser":"","currentTenant":"1g7m"},"tracker_id":"5c603b4abb964f14"}, body={"result":{"failedNum":"0","successNum":"1"},"total":1,"flag":"true","errorMessage":"","errorCode":""}, error=null}
[2024-07-15 20:10:41,500]-[]-[1g7m]-[Druid-ConnectionPool-Destroy-1960641160]-[com.huawei.mateinfo.sdk.datasource.monitor.filter.MateInfoStatFilter.internalAfterStatementExecute4MateInfo(MateInfoStatFilter.java:93)]-[WARN] Slow sql 4121 millis SELECT 1 FROM dual
[2024-07-15 20:10:52,646]-[]-[]-[vertx-blocked-thread-checker]-[io.vertx.core.logging.SLF4JLogDelegate.log(SLF4JLogDelegate.java:202)]-[WARN] Thread Thread[transport-vert.x-eventloop-thread-5,1,main] has been blocked for 3276 ms, time limit is 2000 ms
[2024-07-15 20:10:57,834]-[]-[]-[vertx-blocked-thread-checker]-[io.vertx.core.logging.SLF4JLogDelegate.log(SLF4JLogDelegate.java:202)]-[WARN] Thread Thread[transport-vert.x-eventloop-thread-4,1,main] has been blocked for 4621 ms, time limit is 2000 ms
[2024-07-15 20:11:01,955]-[]-[1g7m]-[LoadBalancerStatsTimer]-[org.apache.servicecomb.registry.consumer.SimpleMicroserviceInstancePing.ping(SimpleMicroserviceInstancePing.java:50)]-[WARN] pin*****#*#*****e8b11efb6a00255ac100019 endpoint rest://172.18.0.79:28443?sslEnabled=true&urlPrefix=%2Fadc-service%2Fcse failed
[2024-07-15 20:11:01,955]-[]-[]-[vertx-blocked-thread-checker]-[io.vertx.core.logging.SLF4JLogDelegate.log(SLF4JLogDelegate.java:202)]-[WARN] Thread Thread[registry-vert.x-eventloop-thread-1,1,main] has been blocked for 4120 ms, time limit is 2000 ms
[2024-07-15 20:11:06,534]-[]-[]-[vertx-blocked-thread-checker]-[io.vertx.core.logging.SLF4JLogDelegate.log(SLF4JLogDelegate.java:202)]-[WARN] Thread Thread[registry-vert.x-eventloop-thread-1,1,main] has been blocked for 4576 ms, time limit is 2000 m
[2024-07-15 20:11:39,726]-[]-[]-[vertx-blocked-thread-checker]-[io.vertx.core.logging.SLF4JLogDelegate.log(SLF4JLogDelegate.java:202)]-[WARN] Thread Thread[transport-vert.x-eventloop-thread-3,1,main] has been blocked for 4594 ms, time limit is 2000 ms
[2024-07-15 20:11:43,927]-[]-[1g7m]-[Druid-ConnectionPool-Destroy-1815678973]-[com.huawei.mateinfo.sdk.datasource.monitor.filter.MateInfoStatFilter.internalAfterStatementExecute4MateInfo(MateInfoStatFilter.java:93)]-[WARN] Slow sql 4196 millis SELECT 1 FROM dual
[2024-07-15 20:11:43,930]-[]-[]-[registry-vert.x-eventloop-thread-1]-[org.apache.servicecomb.serviceregistry.client.http.RestClientUtil.lambda$null$3(RestClientUtil.java:145)]-[ERROR] PUT /v4/default/registry/microservices/89dedccb32b411eeaef40255ac10001b/instances/3b5fa4f542e211ef85210255ac10002b/heartbeat fail, endpoint is 10.2.16.130:30100, message: The timeout period of 3000ms has been exceeded while executing PUT /v4/default/registry/microservices/89dedccb32b411eeaef40255ac10001b/instances/3b5fa4f542e211ef85210255ac10002b/heartbeat for server 10.2.16.130:30100
[2024-07-15 20:11:48,634]-[]-[1g7m]-[Druid-ConnectionPool-Destroy-1815678973]-[com.huawei.mateinfo.sdk.datasource.monitor.filter.MateInfoStatFilter.internalAfterStatementExecute4MateInfo(MateInfoStatFilter.java:93)]-[WARN] Slow sql 4696 millis SELECT 1 FROM dual
[2024-07-15 20:11:52,486]-[]-[]-[registry-vert.x-eventloop-thread-1]-[org.apache.servicecomb.serviceregistry.client.http.ServiceRegistryClientImpl.retry(ServiceRegistryClientImpl.java:128)]-[WARN] invoke service [10.2.16.130:30100] failed, retry address [10.2.16.130:30100].
[2024-07-15 20:11:52,493]-[]-[1g7m]-[Druid-ConnectionPool-Destroy-1960641160]-[com.huawei.mateinfo.sdk.datasource.monitor.filter.MateInfoStatFilter.internalAfterStatementExecute4MateInfo(MateInfoStatFilter.java:93)]-[WARN] Slow sql 9138 millis SELECT 1 FROM dual
[2024-07-15 20:11:52,494]-[]-[1g7m]-[Druid-ConnectionPool-Destroy-1815678973]-[com.huawei.mateinfo.sdk.datasource.monitor.filter.MateInfoStatFilter.internalAfterStatementExecute4MateInfo(MateInfoStatFilter.java:93)]-[WARN] Slow sql 3858 millis SELECT 1 FROM dual
[2024-07-15 20:11:52,495]-[]-[]-[registry-vert.x-eventloop-thread-1]-[io.vertx.core.logging.SLF4JLogDelegate.log(SLF4JLogDelegate.java:205)]-[ERROR] The timeout period of 3000ms has been exceeded while executing PUT /v4/default/registry/microservices/89dedccb32b411eeaef40255ac10001b/instances/3b5fa4f542e211ef85210255ac10002b/heartbeat for server 10.2.16.130:30100[N]io.vertx.core.http.impl.NoStackTraceTimeoutException: The timeout period of 3000ms has been exceeded while executing PUT /v4/default/registry/microservices/89dedccb32b411eeaef40255ac10001b/instances/3b5fa4f542e211ef85210255ac10002b/heartbeat for server 10.2.16.130:30100
io.vertx.core.http.impl.NoStackTraceTimeoutException: The timeout period of 3000ms has been exceeded while executing PUT /v4/default/registry/microservices/89dedccb32b411eeaef40255ac10001b/instances/3b5fa4f542e211ef85210255ac10002b/heartbeat for server 10.2.16.130:30100
[2024-07-15 20:12:40,597]-[]-[]-[Catalina-utility-1]-[com.huawei.dsp.boot.core.config.impl.BootConfigManager.startRefresherThread(BootConfigManager.java:272)]-[WARN] boot.core.config.refresher.period is unknown, value=null
[2024-07-15 20:12:42,869]-[]-[]-[Catalina-utility-1]-[com.netflix.config.sources.URLConfigurationSource.<init>(URLConfigurationSource.java:126)]-[WARN] No URLs will be polled as dynamic configuration sources.
[2024-07-15 20:12:43,361]-[]-[]-[Catalina-utility-1]-[org.apache.commons.logging.impl.SLF4JLog.warn(SLF4JLog.java:176)]-[WARN] Multiple PropertySourcesPlaceholderConfigurer beans register*****#*#*****laceholderConfigurer, org.apache.servicecomb.core.ConfigurationSpringInitializer#0], falling back to Environment
[2024-07-15 20:12:44,078]-[]-[]-[Catalina-utility-1]-[com.huawei.mateinfo.sdk.starter.common.tracing.spi.impl.ConfigCenterBasedTracingConfig.getTracingCollectorAddress(ConfigCenterBasedTracingConfig.java:38)]-[WARN] get tracing collector address http://10.2.17.56:9411 through http
[2024-07-15 20:12:45,834]-[]-[]-[Catalina-utility-1]-[com.huawei.ies.plioservice.listener.AlarmSimulationListener.initAlarmConsumer(AlarmSimulationListener.java:59)]-[WARN] init kafka simulator alarm and affected service topic consumer
[2024-07-15 20:12:45,835]-[]-[]-[Catalina-utility-1]-[com.huawei.ies.plioservice.listener.AlarmSimulationListener.initAlarmConsumer(AlarmSimulationListener.java:65)]-[WARN] kafka simulator alarm consumer is working

可以看到init kafka alarm and affected service topic consumerboot.core.config.refresher.period等关键日志判断服务正在重启,原因是心跳检查未通过

此外还可以通过tomcat-catalina.log日志判断服务的重启时间

bash 复制代码
[2024-07-16 04:21:14] WARNING [main] org.apache.tomcat.util.net.SSLHostConfig.setProtocols The protocol [TLSv1.3] was added to the list of protocols on the SSLHostConfig named [_default_]. Check if a +/- prefix is missing.
[2024-07-16 04:21:14] WARNING [main] org.apache.tomcat.util.net.SSLHostConfig.setProtocols The protocol [TLSv1.3] was added to the list of protocols on the SSLHostConfig named [_default_]. Check if a +/- prefix is missing.
[2024-07-16 04:21:14] WARNING [main] org.apache.tomcat.util.digester.SetPropertiesRule.begin Match [Server/Service/Engine/Host] failed to set property [hostConfigClass] to [org.apache.catalina.core.startup.SortedHostConfig]
[2024-07-16 04:21:14] INFO [main] org.apache.catalina.core.AprLifecycleListener.lifecycleEvent The Apache Tomcat Native library which allows using OpenSSL was not found on the java.library.path: [/home/gtsgsdba/GaussDB_T_1.1.0-RUN-EULER20SP5-64bit/lib:/home/gtsgsdba/GaussDB_T_1.1.0-RUN-EULER20SP5-64bit/add-ons:/home/gtsgsdba/GaussDB_T_1.1.0-RUN-EULER20SP5-64bit/lib:/home/gtsgsdba/GaussDB_T_1.1.0-RUN-EULER20SP5-64bit/add-ons:/home/gtsgsdba/GaussDB_T_1.1.0-RUN-EULER20SP5-64bit/lib:/home/gtsgsdba/GaussDB_T_1.1.0-RUN-EULER20SP5-64bit/add-ons:/home/gtsgsdba/GaussDB_T_1.1.0-RUN-EULER20SP5-64bit/lib:/home/gtsgsdba/GaussDB_T_1.1.0-RUN-EULER20SP5-64bit/add-ons::/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib]
[2024-07-16 04:21:16] INFO [main] org.apache.coyote.AbstractProtocol.init Initializing ProtocolHandler ["https-jsse-nio-172.18.15.87-18443"]
[2024-07-16 04:21:16] INFO [main] org.apache.tomcat.util.net.AbstractEndpoint.logCertificate Connector [https-jsse-nio-172.18.15.87-18443], TLS virtual host [_default_], certificate type [UNDEFINED] configured from [/opt/mateinfo/conf/security/mateinfo.keystore] using alias [tomcat] and with trust store [/opt/mateinfo/conf/security/mateinfo.keystore]
[2024-07-16 04:21:17] INFO [main] org.apache.coyote.AbstractProtocol.init Initializing ProtocolHandler ["https-jsse-nio-172.18.15.87-28443"]
[2024-07-16 04:21:17] INFO [main] org.apache.tomcat.util.net.AbstractEndpoint.logCertificate Connector [https-jsse-nio-172.18.15.87-28443], TLS virtual host [_default_], certificate type [UNDEFINED] configured from [/opt/mateinfo/conf/security/mateinfo.keystore] using alias [tomcat] and with trust store [/opt/mateinfo/conf/security/mateinfo.keystore]
[2024-07-16 04:21:17] INFO [main] org.apache.catalina.startup.Catalina.load Server initialization in [3635] milliseconds
[2024-07-16 04:21:17] INFO [main] org.apache.catalina.core.StandardService.startInternal Starting service [Catalina]
[2024-07-16 04:21:17] INFO [main] org.apache.catalina.core.StandardEngine.startInternal Starting Servlet engine: [Platform app/2.8]
[2024-07-16 04:21:17] WARNING [Catalina-utility-1] org.apache.tomcat.util.digester.SetPropertiesRule.begin Match [Context/Manager] failed to set property [sessionIdLength] to [24]
[2024-07-16 04:21:34] INFO [Catalina-utility-1] org.apache.jasper.servlet.TldScanner.scanJars At least one JAR was scanned for TLDs yet contained no TLDs. Enable debug logging for this logger for a complete list of JARs that were scanned but no TLDs were found in them. Skipping unneeded JARs during scanning can improve startup time and JSP compilation time.
[2024-07-16 04:21:34] INFO [Catalina-utility-1] org.apache.catalina.core.ApplicationContext.log 2 Spring WebApplicationInitializers detected on classpath
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/mateinfo/app/webapps/plioservice/WEB-INF/lib/mateinfo-sdk-base-common-23.5.13.B209.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/mateinfo/app/lib/log4j-slf4j-impl-2.18.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [com.huawei.mateinfo.sdk.common.log.logger.Log4jLoggerFactory]
### Excluding compile: org.apache.logging.log4j.core.config.AppenderControl::getAppender
             __  __       _       _        __
            |  \/  | __ _| |_ ___(_)_ __  / _| ___
    __      | |\/| |/ _` | __/ _ \ | '_ \| |_ / _ \
 _/    \    | |  | | (_| | ||  __/ | | | |  _| (_) |
(______/__) |_|  |_|\__,_|\__\___|_|_| |_|_|  \___/

 :: Spring Boot ::                  (v2.6.7)
### Excluding compile: static org.springframework.core.ResolvableType::forMethodParameter
### Excluding compile: static org.springframework.core.ResolvableType::forMethodParameter
### Excluding compile: static org.springframework.core.ResolvableType::forMethodParameter
[2024-07-16 04:21:43] INFO [Catalina-utility-1] org.apache.catalina.core.ApplicationContext.log Initializing Spring embedded WebApplicationContext
### Excluding compile: static org.springframework.core.ResolvableType::forMethodParameter
[2024-07-16 04:22:02] SEVERE [Catalina-utility-1] org.apache.tomcat.util.descriptor.web.SecurityConstraint.findUncoveredHttpMethods For security constraints with URL pattern [/*] only the HTTP methods [TRACE HEAD OPTIONS] are covered. All other methods are uncovered.
[2024-07-16 04:22:02] INFO [main] org.apache.coyote.AbstractProtocol.start Starting ProtocolHandler ["https-jsse-nio-172.18.15.87-18443"]
[2024-07-16 04:22:02] INFO [main] org.apache.coyote.AbstractProtocol.start Starting ProtocolHandler ["https-jsse-nio-172.18.15.87-28443"]
[2024-07-16 04:22:02] INFO [main] org.apache.catalina.startup.Catalina.start Server startup in [45079] milliseconds
[2024-07-16 04:22:10] INFO [https-jsse-nio-172.18.15.87-18443-exec-4] org.apache.catalina.core.ApplicationContext.log Initializing Spring DispatcherServlet 'dispatcherServlet'

分析重启原因

找到两次服务重启之间的日志,判断出当前服务正在从kafka拉取某服务的业务分析结果,并将结果写到分库中,大致数据流如下图所示:

发现kafka消息积压

于是根据经验,首先查看kafka的消费数据量,判断是否有消息积压

(当前每次消费的数据量配置上限为500)

全局搜索后,发现升级后每次拉取的kafka数据量都是500,推测当前kafka已有大量消息积压,后经查询kafka管理面证实了这一推测

bash 复制代码
[2024-07-15 20:25:47,024]-[]-[1g7m]-[AlarmListener-alarm-1]-[com.huawei.ies.plioservice.listener.AlarmListener.handleMessage(AlarmListener.java:106)]-[WARN] consume records size is 500  
[2024-07-15 20:25:47,048]-[]-[1g7m]-[AlarmListener-alarm-1]-[com.huawei.ies.plioservice.listener.AlarmListener.siaAssetNameFilter(AlarmListener.java:128)]-[WARN] key=1g7m_sia_asset_ip_az,topic=topic-1g7m-sia-service-outer,value={"service_catalog":"1","old_service_status":2,"start_time":1721070029496,"asset_name":"sia_asset_ip_az","service_name":"default","service_id":"MCQLL0268","service_status":"0","specialLineMessages":[{"route_id":"MCQLL0268","ids":[],"status":0}],"alarm_nodes*****#*#*****  
[2024-07-15 20:25:47,054]-[]-[1g7m]-[AlarmListener-alarm-1]-[com.huawei.ies.plioservice.listener.AlarmListener.siaAssetNameFilter(AlarmListener.java:128)]-[WARN] key=1g7m_sia_asset_ip_az,topic=topic-1g7m-sia-service-outer,value={"service_catalog":"1","old_service_status":2,"start_time":1721070029496,"asset_name":"sia_asset_ip_az","service_name":"default","service_id":"MCQLL0305","service_status":"0","specialLineMessages":[{"route_id":"MCQLL0305","ids":[],"status":0}],"alarm_nodes*****#*#*****  
[2024-07-15 20:25:47,054]-[]-[1g7m]-[AlarmListener-alarm-1]-[com.huawei.ies.plioservice.listener.AlarmListener.siaAssetNameFilter(AlarmListener.java:128)]-[WARN] key=1g7m_sia_asset_ip_az,topic=topic-1g7m-sia-service-outer,value={"service_catalog":"1","old_service_status":2,"start_time":1721070029496,"asset_name":"sia_asset_ip_az","service_name":"default","service_id":"MCSLL0230","service_status":"0","specialLineMessages":[{"route_id":"MCSLL0230","ids":[],"status":0}],"alarm_nodes*****#*#*****

分析dump

使用arthas的heapdump命令dump一次当前服务的内存快照,发现堆内存占用并不高,大约700M,最大堆内存为2G,推测是dump内存时服务刚重启不久

继续分析日志,发现日志中有大量如下内容:

bash 复制代码
sset is not supported.||AssetName = MBB_SIA_Asset_Wireless

追踪这一日志,发现是因为从kafka中拉取的数据,包含我们不支持分析的数据,MBB_SIA_Asset_Wireless是另一产品的服务,故初步怀疑是由于大量不支持的数据导致。测试场景一般较难覆盖这种情况。

同时发现日志中存在大量超长日志,如下图所示,并且log4j配置中未配置截断

追踪代码

追踪到代码中打印的地方,发现我们会将从kafka中拉取的value全部打印出来,结合之前的日志,如果拉取的所有数据都是不支持的类型,那我们的代码会循环打印日志而不做任何其他处理,所以怀疑是异步打印日志过多导致内存被占用

java 复制代码
/**  
 * 资产包名称过滤  
 *  
 * @param records 消费者记录  
 * @param siaAssetNameList sia资产名称列表  
 * @return List<SiaResultModel> 模型列表  
 */  
public static List<SiaResultModel> siaAssetNameFilter(ConsumerRecords<String, String> records,  
    List<String> siaAssetNameList) {  
    List<SiaResultModel> siaResultModelList = new ArrayList<>();  
    for (ConsumerRecord<String, String> record : records) {  
        String key = record.key();  
        String topic = record.topic();  
        String value = record.value();  
        LOGGER.warn("key={},topic={},value={}", key, topic, value);  
        if (StringUtils.isEmpty(value)) {  
            LOGGER.error("kafka message is empty");  
            continue;  
        }  
        SiaResultModel siaResultModel = JSON.parseObject(value, SiaResultModel.class);  
        if (StringUtils.hasText(siaResultModel.getAssetName()) && !siaAssetNameList.contains(  
            siaResultModel.getAssetName())) {  
            LOGGER.error("Asset is not supported.||AssetName = {}", siaResultModel.getAssetName());  
            continue;  
        }  
        if (!SiaResultModel.isValid(siaResultModel)) {  
            LOGGER.error("siaResultModel is invalid.||siaResultModel = {}", siaResultModel);  
            continue;  
        }  
        siaResultModelList.add(siaResultModel);  
    }  
    return siaResultModelList;  
}

顺势排查GC日志,发现日志中存在大量如下日志,上次GC在19秒时清空了新生代中的eden区,在21秒时eden区又100%了,推测是有代码在不停的生成对象占用新生代内存

bash 复制代码
{Heap before GC invocations=0 (full 0):
 par new generation   total 824000K, used 775552K [0x0000000089400000, 0x00000000be800000, 0x00000000be800000)
  eden space 775552K, 100% used [0x0000000089400000, 0x00000000b8960000, 0x00000000b8960000)
  from space 48448K,   0% used [0x00000000b8960000, 0x00000000b8960000, 0x00000000bb8b0000)
  to   space 48448K,   0% used [0x00000000bb8b0000, 0x00000000bb8b0000, 0x00000000be800000)
 concurrent mark-sweep generation total 561152K, used 0K [0x00000000be800000, 0x00000000e0c00000, 0x0000000100000000)
 Metaspace       used 22029K, capacity 22618K, committed 22784K, reserved 1069056K
  class space    used 2512K, capacity 2709K, committed 2816K, reserved 1048576K
2024-07-16T03:51:19.489+0000: 6.418: [GC (Allocation Failure) 2024-07-16T03:51:19.489+0000: 6.418: [ParNew: 775552K->43478K(824000K), 0.0411209 secs] 775552K->43478K(1385152K), 0.0412038 secs] [Times: user=0.06 sys=0.02, real=0.04 secs] 
Heap after GC invocations=1 (full 0):
 par new generation   total 824000K, used 43478K [0x0000000089400000, 0x00000000be800000, 0x00000000be800000)
  eden space 775552K,   0% used [0x0000000089400000, 0x0000000089400000, 0x00000000b8960000)
  from space 48448K,  89% used [0x00000000bb8b0000, 0x00000000be325bf0, 0x00000000be800000)
  to   space 48448K,   0% used [0x00000000b8960000, 0x00000000b8960000, 0x00000000bb8b0000)
 concurrent mark-sweep generation total 561152K, used 0K [0x00000000be800000, 0x00000000e0c00000, 0x0000000100000000)
 Metaspace       used 22029K, capacity 22618K, committed 22784K, reserved 1069056K
  class space    used 2512K, capacity 2709K, committed 2816K, reserved 1048576K
}
{Heap before GC invocations=1 (full 0):
 par new generation   total 824000K, used 819030K [0x0000000089400000, 0x00000000be800000, 0x00000000be800000)
  eden space 775552K, 100% used [0x0000000089400000, 0x00000000b8960000, 0x00000000b8960000)
  from space 48448K,  89% used [0x00000000bb8b0000, 0x00000000be325bf0, 0x00000000be800000)
  to   space 48448K,   0% used [0x00000000b8960000, 0x00000000b8960000, 0x00000000bb8b0000)
 concurrent mark-sweep generation total 561152K, used 0K [0x00000000be800000, 0x00000000e0c00000, 0x0000000100000000)
 Metaspace       used 22405K, capacity 23046K, committed 23296K, reserved 1069056K
  class space    used 2543K, capacity 2743K, committed 2816K, reserved 1048576K
2024-07-16T03:51:21.011+0000: 7.941: [GC (Allocation Failure) 2024-07-16T03:51:21.012+0000: 7.941: [ParNew: 819030K->48447K(824000K), 0.1400724 secs] 819030K->85515K(1385152K), 0.1401505 secs] [Times: user=0.25 sys=0.05, real=0.14 secs] 
Heap after GC invocations=2 (full 0):
 par new generation   total 824000K, used 48447K [0x0000000089400000, 0x00000000be800000, 0x00000000be800000)
  eden space 775552K,   0% used [0x0000000089400000, 0x0000000089400000, 0x00000000b8960000)
  from space 48448K,  99% used [0x00000000b8960000, 0x00000000bb8afff8, 0x00000000bb8b0000)
  to   space 48448K,   0% used [0x00000000bb8b0000, 0x00000000bb8b0000, 0x00000000be800000)
 concurrent mark-sweep generation total 561152K, used 37067K [0x00000000be800000, 0x00000000e0c00000, 0x0000000100000000)
 Metaspace       used 22405K, capacity 23046K, committed 23296K, reserved 1069056K
  class space    used 2543K, capacity 2743K, committed 2816K, reserved 1048576K
}

由于此时服务仍在不停重启,所以我们继续使用arthas监控java进程的堆栈,找到内存较高但还未重启时,重新dump了一次内存

点击右侧calculate之后,计算出了最大的retained对象,发现是一个JSONArray,占用了几乎全部的堆内存

点开之后,找到了它的GC Root

追踪代码找到getFmListAll方法,看到这里循环查询了数据库

使用Arthas监控getFmListAll方法的入参,发现toal居然有5271948!

再使用stack方法监控getFmListAll方法的调用栈,发现是从getHistoryAlramByIdentifier中调过来的,也就是查询的过程中。

定位结论

至此,我们的定位完成了,服务在查询的过程中,查出来了大量的数据,导致堆内存爆满,心跳检测不通过,服务重启。因为我们的服务是不停的从Kafka拉取消息,查询对应的数据,所以只要Kafka中有未消费的消息,就会触发这一问题,进而导致服务频繁重启。

那么还有最后一个问题,为什么会查询来这么多数据呢?

经过与周边服务开发人员的对齐,发现我们查询的数据,他们在存储的过程中,如果该数据类型是A,那么ID就是其唯一主键。但如果该数据类型为B,会存储到ES中,它的唯一主键就变成了ID+发生时间。该环境几天前进行了一大波压测,数据中有一大批ID,但是发生时间不同的数据。我们的服务是使用ID查询的B类数据,也就把这一大批告警全部查出来了

经验总结,编码教训

  1. 读取外部数据要做好防护,对关键数据量要做校验,不能一股脑全部读出来,上述代码中,while要设置最大循环次数,或对total的值做一个校验
  2. 打印日志要配置长度截断,否则会对内存造成很大压力
  3. 不要把一个大变量直接打印到日志中!
相关推荐
G皮T8 个月前
【Flink】Apache Flink 常见问题定位指南
大数据·运维·flink·问题定位