记录一次OSSClient使用不当导致的OOM排查过程

首发：公众号《赵侠客》

前言

最近线上有个比较边缘的项目出现OOM了，还好这个项目只是做一些离线的任务处理，出现OOM对线上业务没有什么影响，这里记录一下排查的过程

Dump日志查看

项目配置的主要JVM参数设置如下：

ruby 复制代码

-Xmx5120m -XX:+PreserveFramePointer -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/usr/local/update/heap_trace.log -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/usr/local/update/dump.log

最大堆内存给了5G，并配置了记录了GC日志，OOM后的内存导出，我们先看一下OOM的导出内存快照，dump.log居然有5GB，第一判断肯定是内存泄漏了。

然后看了一下heap_trace.log的GC日志，最后几次GC花了0.02秒，并且没有释放多少内存，肯定是内存泄漏了

js 复制代码

2023-09-18T09:58:28.259+0800: 234057.213: [GC (Allocation Failure) [PSYoungGen: 438400K->7648K(441344K)] 763838K->333358K(961024K), 0.0140907 secs] [Times: user=0.04 sys=0.00, real=0.02 secs]  
2023-09-18T10:01:33.925+0800: 234242.879: [GC (Allocation Failure) [PSYoungGen: 436704K->7344K(441856K)] 762414K->333326K(961536K), 0.0134861 secs] [Times: user=0.04 sys=0.00, real=0.01 secs]  
2023-09-18T10:04:16.426+0800: 234405.380: [GC (Allocation Failure) [PSYoungGen: 437424K->8832K(441856K)] 763406K->335022K(961536K), 0.0147276 secs] [Times: user=0.05 sys=0.00, real=0.01 secs]  
2023-09-18T10:06:30.923+0800: 234539.877: [GC (Allocation Failure) [PSYoungGen: 438912K->11520K(442368K)] 765102K->338158K(962048K), 0.0202829 secs] [Times: user=0.06 sys=0.00, real=0.02 secs]  
2023-09-18T10:08:27.655+0800: 234656.609: [GC (Allocation Failure) [PSYoungGen: 442112K->12272K(442880K)] 768750K->340510K(962560K), 0.0216111 secs] [Times: user=0.06 sys=0.00, real=0.02 secs]  
2023-09-18T10:11:37.773+0800: 234846.727: [GC (Allocation Failure) [PSYoungGen: 442864K->12000K(445440K)] 771102K->340918K(965120K), 0.0243473 secs] [Times: user=0.06 sys=0.00, real=0.02 secs]  
2023-09-18T10:14:56.925+0800: 235045.879: [GC (Allocation Failure) [PSYoungGen: 443616K->8192K(445952K)] 772534K->337110K(965632K), 0.0152287 secs] [Times: user=0.04 sys=0.00, real=0.01 secs]  
2023-09-18T10:17:49.358+0800: 235218.312: [GC (Allocation Failure) [PSYoungGen: 439808K->8432K(445952K)] 768726K->337790K(965632K), 0.0151303 secs] [Times: user=0.05 sys=0.00, real=0.02 secs]  
2023-09-18T10:20:51.356+0800: 235400.310: [GC (Allocation Failure) [PSYoungGen: 441072K->8976K(446464K)] 770430K->338470K(966144K), 0.0159285 secs] [Times: user=0.05 sys=0.00, real=0.02 secs]  
2023-09-18T10:24:05.395+0800: 235594.349: [GC (Allocation Failure) [PSYoungGen: 441616K->9504K(446464K)] 771110K->339358K(966144K), 0.0219962 secs] [Times: user=0.05 sys=0.00, real=0.02 secs]  
2023-09-18T10:26:48.374+0800: 235757.328: [GC (Allocation Failure) [PSYoungGen: 443168K->11680K(446976K)] 773022K->341950K(966656K), 0.0195554 secs] [Times: user=0.05 sys=0.00, real=0.02 secs]

使用Jprofiler分析Dump文件

使用JProFiler打开Dump文件可以看到HashMap$Node居然有1GB

我们选择Node，发现有3000多万个对象：

我们选择Merged incoming references ,然后一步步的展开Node对象的引用链，最后我们发现有个OSSClient的对象引用了Node

想到这个业务大量使用了阿里云OSS上传文件，于是找到使用OSSClient的代码，代理中上传文件每次new OSSClient(),但是就算是每次new 个局部变量，也不应该会导致内存泄漏啊？

于是我了一下DefaultServiceClient的源码，我们可以看到创建OSSClient时调用了createHttpClientConnectionManager

java 复制代码

    public DefaultServiceClient(ClientConfiguration config) {
        super(config);
        this.connectionManager = createHttpClientConnectionManager();
        this.httpClient = createHttpClient(this.connectionManager);
        RequestConfig.Builder requestConfigBuilder = RequestConfig.custom();

在createHttpClientConnectionManager中使用了IdleConnectionReaper来管理当前连接：

java 复制代码

    protected HttpClientConnectionManager createHttpClientConnectionManager() {
        SSLContext sslContext = null;
      if (config.isUseReaper()) {
            IdleConnectionReaper.setIdleConnectionTime(config.getIdleConnectionTime());
            IdleConnectionReaper.registerConnectionManager(connectionManager);
        }
        return connectionManager;
    }

在IdleConnectionReaper.registerConnectionManager中我们可以看到使用了ArrayList来存所有的HTTP连接

java 复制代码

public final class IdleConnectionReaper extends Thread {
    private static final int REAP_INTERVAL_MILLISECONDS = 5 * 1000;
    private static final ArrayList<HttpClientConnectionManager> connectionManagers = new ArrayList<HttpClientConnectionManager>();

    private static IdleConnectionReaper instance;

    private static long idleConnectionTime = 60 * 1000;

    private volatile boolean shuttingDown;

    private IdleConnectionReaper() {
        super("idle_connection_reaper");
        setDaemon(true);
    }

    public static synchronized boolean registerConnectionManager(HttpClientConnectionManager connectionManager) {
        if (instance == null) {
            instance = new IdleConnectionReaper();
            instance.start();
        }
        return connectionManagers.add(connectionManager);
    }

我们看到OSSClient提供了一个shutdown方法，new过的OSSClint如果不用了需要调用shutdown来释放连接，会从connectionManagers中移除对接的连接，好吧，确实是代码使用不当导致的OOM。

java 复制代码

    @Override
    public void shutdown() {
        IdleConnectionReaper.removeConnectionManager(this.connectionManager);
        this.connectionManager.shutdown();
    }  
  public static synchronized boolean removeConnectionManager(HttpClientConnectionManager connectionManager) {
        boolean b = connectionManagers.remove(connectionManager);
        if (connectionManagers.isEmpty())
            shutdown();
        return b;
    }

我在阿里云官网也找到同样的问题：

解决方法：

将 OSSClient 实例定义为单例模式，避免在应用中多次实例化 OSSClient
使用 OSSClient.shutdown() 方法关闭 OSSClient 实例，释放资源
使用 try-finally 块，在 finally 中调用 OSSClient.shutdown() 方法
在应用中使用 OSSClient 的过程中，确保使用完成后关闭 OSSClient 实例

本地复现

我们本地启用项目使用Jprofiler连接我们的JVM

这个功能是对外提供了一个接口，于是我们写个for循环一直请求这个接口，然后观察内存变化，跑了6分钟可用内存就变成了0

服务端也报了OOM:

解决问题

使用工厂模式重写代码：

arduino 复制代码

    public static final Map<String, OSSClient> map = new ConcurrentHashMap<>();
    public static OSSClient getClient(String endpoint, String accessKey, String accessSecret) {
        if (!map.containsKey(accessKey)) {
            OSSClient client = new OSSClient(endpoint, accessKey, accessSecret);
            map.put(accessKey, client);
        }
        return map.get(accessKey);
    }

替换原来的代码:

ini 复制代码

  OSSClient client = AliyunUtil.getClient(endpoint, getAccessKey(), getAccessSecret());

再次测试，发现每次GC都很好的释放了内存，跑了6分钟，内存使用不超过200M，完美解决了问题

总结

本文介绍了使用Jprofiler排查一次线上由于使用阿里云OSSClient不当导致的OOM过程，主要还是写代码时没有注意OSSClient需要自己手动Shutdown导致的，还好不是出现在核心业务系统中，不然后果就比较麻烦了，以后使用别人提供的工具时一定要多看看官方是如何使用，多翻翻源码，避免再出现类似的问题。