【JVM】一次JVM内存泄露分析处理

一次内存泄露分析

背景情况

编写了一个大数据基础组件的可用性监控程序，采用Bootstrap监测端口的方式，使得方法常驻（main线程常驻），通过一个调度线程ScheduledThreadPoolExecutor，定时的调动监测任务。

监测任务中，通过一个工作的线程池，执行Callable<?> 的任务，每个组件的检测一个Callable Task。要求每个检测带回检测结果，所以使用了Callable。

定时调度的监测任务

java 复制代码

public void run() {
                try {
                    boolean alertFlag = false;
                    long t1 = System.currentTimeMillis();
                    StringBuilder htmlBody = new StringBuilder();
                    for (Monitor monitor : monitorList) {
                        // 返回数据结果Map中有两个值， 一个key为componentName,值为组件的名称;一个key为组件的名称.值为此组件连续发生异常的数量
                        final Future<?> future =
                                ThreadPool.submitTask(() -> {
                                    Map<String, String> map = new HashMap<>();
                                    StringBuilder result = new StringBuilder();
                                    result.append(HTML_BR).append("【").append(monitor.getName()).append("】").append(HTML_BR);
                                    try {
                                        result.append(monitor.monitor());
                                    } catch (Exception e) {
                                        result.append(MonitorSupport.ERROR_INFO).append(HTML_BR);
                                        componentExceptionNum.put(monitor.getName(), componentExceptionNum.get(monitor.getName()) + 1);
                                        LOGGER.error("{}执行异常", monitor.getName(), e);
                                    } finally {
                                        map.put(monitor.getName(), result.toString());
                                        map.put(MONITOR_RESULT_KEY, monitor.getName());
                                    }
                                    return map;
                                });
                        list.add(future);
                    }
                    // 执行的最大时长应该是监控任务调度的周期也就是monitorInterval,
                    // 后期优化思路：在快要超时的前1分钟，将未完成的任务取消cancel(true)，快速返回。
                    do {
                        Iterator<Future<?>> it = list.iterator();
                        while (it.hasNext()) {
                            Future<?> e = it.next();
                            try {
                                @SuppressWarnings("unchecked")
                                Map<String, String> map = (HashMap<String, String>) e.get(5, TimeUnit.MINUTES);
                                if (e.isDone()) {
                                    try {
                                        if (map.get(map.get(MONITOR_RESULT_KEY)).contains(MonitorSupport.ERROR_INFO)) {
                                            htmlBody.insert(0, map.get(map.get(MONITOR_RESULT_KEY)));
                                            Integer newComponentExceptionNum = componentExceptionNum.get(map.get(MONITOR_RESULT_KEY)) + 1;
                                            if (newComponentExceptionNum >= alertThreshold.get(map.get(MONITOR_RESULT_KEY))) {
                                                alertFlag = true;
                                            } else {
                                                componentExceptionNum.put(map.get(MONITOR_RESULT_KEY), newComponentExceptionNum);
                                            }
                                            if (ComponentEnum.HBASE.getName().equals(map.get(MONITOR_RESULT_KEY))
                                                    && map.get(map.get(MONITOR_RESULT_KEY)).contains(MonitorSupport.HBASE_RESTORE_SETTINGS_ERROR)) {
                                                alertFlag = true;
                                            }
                                        } else {
                                            componentExceptionNum.put(map.get(MONITOR_RESULT_KEY), 0);
                                            htmlBody.append(map.get(map.get(MONITOR_RESULT_KEY)));
                                        }
                                        String a = map.get(map.get(MONITOR_RESULT_KEY));
                                        LOGGER.info("执行结果={}.", a);
                                    } catch (Exception ex) {
                                        LOGGER.error("执行结果获取异常", ex);
                                    }
                                    it.remove();
                                } else {
                                    LOGGER.warn("{}执行超时", map.get(MONITOR_RESULT_KEY));
                                    e.cancel(true);
                                    makeTimeOutInfo(htmlBody);
                                    alertFlag = true;
                                    it.remove();
                                }
                                e.cancel(true);
                            } catch (Exception e1) {
                                LOGGER.error("执行异常", e1);
                            }
                        }

                    } while (!list.isEmpty());

                    long t3 = System.currentTimeMillis();
                    LOGGER.info("Monitoring spent total time: [{}]ms.", (t3 - t1));
                    htmlBody.append("<div><br/></div>");
                    if (LOGGER.isDebugEnabled()) {
                        LOGGER.debug("检查结果:{}{}{}", HTML_HEAD, htmlBody, HTML_FOOD);
                    }
                    // 异常邮件告警
                    if (alertFlag) {
                        SNEmail.sendHtmlEmail(HTML_HEAD + htmlBody + HTML_FOOD);
                        LOGGER.warn("resetComponentExceptionNum");
                        resetComponentExceptionNum(mcs, componentExceptionNum);
                    }
                } catch (Exception e) {
                    LOGGER.warn("未处理异常异常。", e);
                }
            }

问题现象

第一次现象：执行的时候，主线程没有报错，监测任务无法调度起来了。对监测任务增加了外部的try catch，看看是不是异常被吞掉了。

第二次现象：等待了近一个月（缓慢泄露，最为致命），出现了，大量OOM（Java heap space）的报错。堆内存溢出了。

内存问题分析

使用MAT（ Eclipse Memory Analysis Tools ）工具进行分析。下载安装参考：使用MAT进行内存问题定位

查询java默认采用的垃圾回收器的命令：

shell 复制代码

java -XX:+PrintCommandLineFlags -version

查询某个java进程采用的垃圾回收器的命令：

12910是pid

shell 复制代码

jcmd 12910 VM.flags

结果

复制代码

12910:
-XX:CICompilerCount=12 -XX:ConcGCThreads=3 -XX:G1HeapRegionSize=2097152 -XX:GCLogFileSize=33554432 -XX:InitialHeapSize=4294967296 -XX:MarkStackSize=4194304 -XX:MaxHeapSize=4294967296 -XX:MaxNewSize=2575302656 -XX:MinHeapDeltaBytes=2097152 -XX:NumberOfGCLogFiles=5 -XX:+PrintAdaptiveSizePolicy -XX:+PrintGC -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution -XX:+UseCompressedClassPointers -XX:+UseCompressedOops -XX:+UseG1GC -XX:+UseGCLogFileRotation

-XX:+UseG1GC参数来启用G1GC。

jmap生存内存快照

shell 复制代码

./jmap -dump:format=b,file=/opt/logs/java_pid27258.hprof 27258

利用MAT生成报告

相关命令：./ParseHeapDump.sh java_pid27258.hprof org.eclipse.mat.api:suspects org.eclipse.mat.api:overview org.eclipse.mat.api:top_components，执行命令的用户需要配置jdk（用的JDK8）

报告分析

产生三个文件java_pid27258_Leak_Suspects.zip，java_pid27258_System_Overview.zip ，java_pid27258_Top_Components.zip

查看占用内存对象中高占比的对象Top Consumers

查询溢出原因猜测Leak Suspects

猜想

看这个对象，应该是ES客户端的Sniff线程。使用RestHighLevelClient使用为了实现节点嗅探增加的Sniff。

处理

尝试了highLevelClient.close()和highLevelClient.getLowLevelClient().close()，发现这个Sniff线程竟然关不掉。关不掉，又在不停创建新的客户端，那肯定会出现内存溢出了。由于监控ES是不停的创建新ES客户端，并非以保证长连接的要求去实现的。所以去掉了了Sniff功能。

以下是开启Sniff的写法。

java 复制代码

    private RestHighLevelClient initRestHighLevelClientByIp(String ip, Integer port) {
        HttpHost[] httpHosts = buildHttpHosts(ip, port);
        // 开启嗅探
        SniffOnFailureListener sniffOnFailureListener = new SniffOnFailureListener();
        RestHighLevelClient highLevelClient = new RestHighLevelClient(RestClient.builder(httpHosts)
                .setFailureListener(sniffOnFailureListener));
        // 嗅探器可以通过setSniffIntervalMills(以毫秒为单位）更新一次节点列表，
        Sniffer sniffer = Sniffer.builder(highLevelClient.getLowLevelClient())
                .setSniffAfterFailureDelayMillis(5 * 60000).build();
        sniffOnFailureListener.setSniffer(sniffer);
        return highLevelClient;
    }

重新启动程序。观察内存变化（多次jmap，看生成的hprof文件的大小变化）。

再次使用MAT生成内存报告，查询泄露情况

发现有java.lang.ref.Finalizer的占比挺高的、但是JVM垃圾回收处于正常运行了。

TODO 什么是java.lang.ref.Finalizer，下次分析学习。