为什么我建议你主动释放直接内存 - Java

直接内存介绍

直接内存指的是非JVM管理的堆内存，但是也是存在于Java进程中的虚拟内存

直接内存的好处就是在进行IO时会减少一次拷贝的次数，这也是Netty默认使用直接内存的原因。

直接内存不由jvm堆管理，它是如何释放内存的呢？

当调用 ByteBuffer.allocateDirect() 时，JVM 会创建一个 DirectByteBuffer 对象（Java 堆对象），该对象内部通过 Unsafe.allocateMemory() 分配堆外内存，并通过虚引用（PhantomReference）与一个 Cleaner 对象关联。

当 DirectByteBuffer 对象不再被任何强引用（Strong Reference）引用时，GC 会回收该对象。此时，Cleaner 会感知到其关联的虚引用被加入引用队列（Reference Queue），并执行注册的清理任务（释放堆外内存）。

正是因为延迟释放，就会产生一些性能问题

Netty

在netty中使用直接内存有两种 Pooled & Unpool

netty Pooled使用了类似jemalloc的内存池，在小内存多线程高并发下更有优势。

但是无论是Pooled还是Unpool，netty的 bytebuf 都是要手动释放的。

为什么unpool也要手动释放呢？我猜是为了优化性能？

malloc

说了这么多，为什么手动释放就能优化性能呢？不急我们先看一下申请内存的原理。

一般我们申请内存会用到两个api，malloc和mmap。一般申请小内存都是使用malloc,而malloc是有使用内存池的，一般linux中默认使用ptmalloc，感兴趣可以自己了解，不为此篇文章重点内容。

不手动归还

在我们申请小内存时，malloc会从内存池中拿一块合适大小的内存给我们，如果我们不手动释放，本应该释放的内存没有归还，malloc会从大块的内存中切割小块的给我们。当jvm进行gc时，大量的直接内存归还，内存池又会将大量小块内存合并为大内存，甚至归还给操作系统，造成性能开销

主动归还

申请小内存，使用完毕后立刻归还，小内存可以得到更好的复用，大内存切割的频率和小内存合并的频率都大幅下降，性能也就提高了。

时间对比

直接benchmark

环境

jdk21

centos 7

glibc /lib64/libc.so.6 （ptmalloc）

Java 复制代码

package com.qiu;

import io.netty.buffer.ByteBuf;
import io.netty.buffer.PooledByteBufAllocator;
import io.netty.buffer.Unpooled;
import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.infra.Blackhole;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;
import sun.misc.Unsafe;

import java.lang.reflect.Field;
import java.nio.ByteBuffer;
import java.util.concurrent.TimeUnit;

@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@Warmup(iterations = 3, time = 2)
@Measurement(iterations = 5, time = 3)
@Threads(4)
@Fork(2)
@State(Scope.Benchmark)
public class NettyBufferAllocationBenchmark {

    // 测试不同缓冲区大小
    @Param({"16384", "1048576"}) // 16KB 和 1MB
    public int bufferSize;

    private PooledByteBufAllocator nettyPooledAllocator;


    Unsafe unsafe;

    @Setup
    public void setup() throws IllegalAccessException, NoSuchFieldException {
        nettyPooledAllocator = new PooledByteBufAllocator(true); // 使用直接内存
        Field unsafeField = Unsafe.class.getDeclaredField("theUnsafe");
        unsafeField.setAccessible(true);
        unsafe = (Unsafe) unsafeField.get(null);
    }

    // 测试Netty池化ByteBuf
    @Benchmark
    public void nettyPooled(Blackhole bh) {
        ByteBuf buf = nettyPooledAllocator.buffer(bufferSize);
        try {
            bh.consume(buf); // 模拟数据写入
        } finally {
            buf.release(); // 释放到池中
        }
    }


    // 测试Netty非池化ByteBuf
    @Benchmark
    public void nettyUnpooled(Blackhole bh) {
        ByteBuf buf = Unpooled.directBuffer(bufferSize);
        try {
            bh.consume(buf);
        } finally {
            buf.release(); // 实际会调用Deallocator
        }
    }

    // 测试标准ByteBuffer（直接内存）
    @Benchmark
    public void standardDirectByteBuffer(Blackhole bh) {
        ByteBuffer buffer = ByteBuffer.allocateDirect(bufferSize);
        bh.consume(buffer); // 模拟数据写入
        // 直接内存由GC回收（此处强制回收模拟压力）
//        System.gc();
    }

    // 测试内存复用的ByteBuffer（最佳实践对照组）
    private ThreadLocal<ByteBuffer> threadLocalBuffer = ThreadLocal.withInitial(
            () -> ByteBuffer.allocateDirect(bufferSize)
    );


    @Benchmark
    public void reusedDirectByteBuffer(Blackhole bh) {
        ByteBuffer buffer = threadLocalBuffer.get();
        buffer.clear();
        bh.consume(buffer);
    }

    @Benchmark
    public void cleanDirectByteBuffer(Blackhole bh) {
        ByteBuffer buffer = ByteBuffer.allocateDirect(bufferSize);
        bh.consume(buffer); // 模拟数据写入

        unsafe.invokeCleaner(buffer);
    }


    public static void main(String[] args) throws RunnerException {
        Options opt = new OptionsBuilder()
                .include(NettyBufferAllocationBenchmark.class.getSimpleName())
                .build();
        new Runner(opt).run();
    }

}

结果

bash 复制代码

Benchmark                                                (bufferSize)   Mode  Cnt       Score       Error   Units
NettyBufferAllocationBenchmark.nettyPooled                      16384  thrpt   10    1584.894 ±    35.131  ops/ms
NettyBufferAllocationBenchmark.nettyPooled                    1048576  thrpt   10     102.882 ±    44.919  ops/ms
NettyBufferAllocationBenchmark.nettyUnpooled                    16384  thrpt   10     932.260 ±    87.911  ops/ms
NettyBufferAllocationBenchmark.nettyUnpooled                  1048576  thrpt   10      25.768 ±     0.835  ops/ms
NettyBufferAllocationBenchmark.reusedDirectByteBuffer           16384  thrpt   10  429511.452 ± 16732.835  ops/ms
NettyBufferAllocationBenchmark.reusedDirectByteBuffer         1048576  thrpt   10  436561.625 ± 20858.339  ops/ms
NettyBufferAllocationBenchmark.standardDirectByteBuffer         16384  thrpt   10       2.535 ±     0.324  ops/ms
NettyBufferAllocationBenchmark.standardDirectByteBuffer       1048576  thrpt   10       5.271 ±     2.726  ops/ms
NettyBufferAllocationBenchmark.cleanDirectByteBuffer            16384  thrpt   10    1130.793 ±    33.644  ops/ms
NettyBufferAllocationBenchmark.cleanDirectByteBuffer          1048576  thrpt   10      25.230 ±     0.792  ops/ms

我们可以看到 nettyUnpooled ≈ cleanDirectByteBuffer >> standardDirectByteBuffer

空间对比

Java 复制代码

package com.qiu;

import io.netty.buffer.ByteBuf;
import io.netty.buffer.Unpooled;
import org.junit.Test;

import java.lang.management.ManagementFactory;
import java.nio.ByteBuffer;
import java.util.concurrent.locks.LockSupport;

public class UnpooledTest {
    // 16kb
    int size = 16384;
    int times = 1024 * 10;

    public static void main(String[] args) {
        String name = ManagementFactory.getRuntimeMXBean().getName();
        String pid = name.split("@")[0];
        System.out.println("Pid is: " + pid);

//        new UnpooledTest().testEmpty();
//        new UnpooledTest().testByteBuffer();
        new UnpooledTest().testUnpooledBytebuf();
    }

    @Test
    public void testEmpty() {
        System.out.println("allocate finish");
        LockSupport.park();
    }

    @Test
    public void testByteBuffer() {
        for (int i = 0; i < times; i++) {
            ByteBuffer b = ByteBuffer.allocateDirect(size);
            b.put((byte) 1);

            // release
            b = null;
        }
        System.out.println("allocate finish");
        LockSupport.park();
    }

    @Test
    public void testUnpooledBytebuf() {
        for (int i = 0; i < times; i++) {
            ByteBuf byteBuf = Unpooled.directBuffer(size);
            byteBuf.setByte(1, (byte) 1);
            byteBuf.release();
        }
        System.out.println("allocate finish");
        LockSupport.park();
    }
}

perl 复制代码

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
74922 root      20   0 4878924  40436  15024 S   0.3  0.5   0:00.79 java     
77398 root      20   0 5076560 206440  15352 S   0.7  2.6   0:01.93 java
79027 root      20   0 5012052  53304  15976 S   0.3  0.7   0:01.73 java

74922 testEmpty
77398 testByteBuffer
79027 testUnpooledBytebuf

我们看 RES 进程实际占用的物理内存，%CPU，就很明显，没有立刻释放内存，当然对于Java来说这不重要，主要还是性能（时间）上的差距。