内存性能测试工具

内存性能测试工具包括常用的stream（最常用），sysbench等。

1. dd简单测试内存读写速度

dd测试内存性能不常用。dd命令为linux系统自带，无需安装，可以通过如下命令简单地测试系统内存性能：

shell 复制代码

# 运行命令如下，从linux的zero设备作为输入，输出到null设备。
$ dd if=/dev/zero of=/dev/null bs=4096 count=1048576
1048576+0 records in
1048576+0 records out
4294967296 bytes (4.3 GB) copied, 2.69363 s, 1.6 GB/s

通过数据复制的速度，简单对比机器内存的性能。

2. stream测试内存性能

2.1 安装

shell 复制代码

$ mkdir stream
$ cd stream/
# 国外下载站点
$ wget https://www.cs.virginia.edu/stream/FTP/Code/stream.c

# 国内源下载安装
$ git clone https://gitee.com/lldhsds/stream.git
$ cd stream/

# 编译安装
$ gcc stream.c -O3 -fopenmp -DSTREAM_ARRAY_SIZE=1024*1024*1024 -DNTIMES=20 -mcmodel=medium -o stream.1g.20

编译参数说明：

stream.c：待编译的源码文件，最新版本为5.10。
-O3：编译器编译优化级别。
-fopenmp：启用OpenMP，适应多处理器环境，更能得到内存带宽实际最大值。开启后，程序默认运行线程为CPU线程数。
-DSTREAM_ARRAY_SIZE: 指定测试数组a[]、b[]、c[]的大小（Array size），该值对测试结果影响较大。

由于stream.c源码推荐设置至少4倍最高级缓存（l3 cache），且STREAM_ARRAY为double类型，每个数组元素占用8Byte。推荐的数组大小计算公式如下，结果取整数：

最高级缓存(单位：MB)×1024×1024×4.1×CPU路数/8 或者最高级缓存(单位：Byte)×4.1倍×CPU路数/8

例如测试机器是双路CPU，最高级缓存32MB，则计算值为32×1024×1024×4.1×2/8≈34393292

-fopenmp：启用OpenMP，适应多处理器环境，更能得到内存带宽实际最大值。开启后，程序默认运行线程为CPU线程数。
-mcmodel=medium ：当单个Memory Array Size 大于2GB时需要设置此参数。还可以改为large、small、tiny等。较新的gcc版本可能不支持small。
-o stream.1g.20：输出的可执行文件名，名称自定义。
-mtune=native -march=native：针对CPU指令的优化，此处由于编译机即运行机器。故采用native的优化方法。
-DOFFSET=4096 ：数组的偏移，一般可以不定义。

其他说明：

stream 5.9版本数组参数为-DN=2000000形式设置。若为5.10版本，参数名变为-DSTREAM_ARRAY_SIZE，默认值10000000。
要充分考虑内存容量的需求，粗略估计是 STREAM ARRAY_SIZE × 8（双精度） × 3 （三个数组）<= 0.6*M；M 是用户的可用内存。
必须设置测试数组大小远大于CPU 最高级缓存（一般为L3 Cache）的大小，否则就是测试CPU缓存的吞吐性能，而非内存吞吐性能。
为了保证测试可以持续一段时间，测试过程中内存带宽可以达到一定的最大值，从而避免得不到实际最大峰值的情况，如果四项测试中有完成时间小于20微秒的情况，就需要适当的增大测试数组的维度 STREAM ARRAY_SIZE。

2.2 测试

shell 复制代码

# 查看机器cpu的最高级缓存l3 cache为16M，为单路CPU。
$ lscpu | grep -i "L3 cache\|Socket\|core"
Thread(s) per core:    1
Core(s) per socket:    4
Socket(s):             1
L3 cache:              16384K

# 根据CPU计算DSTREAM_ARRAY_SIZE为16384*1024*4.1*1/8=8,598,323.2，取值1亿进行取值。编译执行文件
$ gcc stream.c -O3 -fopenmp -DSTREAM_ARRAY_SIZE=1024*1024*100 -DNTIMES=20 -mcmodel=medium -o stream.100M
$ ./stream.100M
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 104857600 (elements), Offset = 0 (elements)
Memory per array = 800.0 MiB (= 0.8 GiB).
Total memory required = 2400.0 MiB (= 2.3 GiB).
Each kernel will be executed 20 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 4
Number of Threads counted = 4
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 29659 microseconds.
   (= 29659 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           34701.7     0.066024     0.048347     0.084770
Scale:          39939.1     0.056470     0.042007     0.070259
Add:            41795.4     0.079521     0.060212     0.102223
Triad:          41073.6     0.079864     0.061270     0.102483
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------

注意

DSTREAM_ARRAY_SIZE过大的情况下编译，stream运行需要的内存过大，导致产生段错误(Segmentation fault)，出现该情况下可以增大内存或者减小DSTREAM_ARRAY_SIZE。

2.3 结果分析

记录测试结果中的COPY（复制）,SCALE（乘法）,ADD（加法）,TRIAD（混合）数值。测试多次取平均值，横向比较不同机器性能。

stream测试原理：

一次Add操作需要访问三次内存（两个读操作，一个写操作），Triad操作也需要三次访问内存， Copy和Scale操作需要两次访问内存。单位操作内访问内存次数越多，越能够掩盖访存延迟，带宽越大。

单核Stream测试，影响的因素除了内存控制器能力外，还有Core的ROB、Load/Store对其影响，因此不是单纯的内存带宽性能测试。而多核Stream测试，通过多核同时发出大量内存访问请求，能够更加饱和地访问内存，从而测试到内存带宽的极限性能。

3. sysbench测试内存性能

3.1 sysbench安装

3.1.1 镜像源安装

centos安装sysbench需要配置epel源。

shell 复制代码

# CentOS安装
sudo yum -y install sysbench

# Ubuntu安装
sudo apt -y install sysbench

3.1.2 源码编译安装

下载编译安装

shell 复制代码

$ sudo wget https://github.com/akopytov/sysbench/archive/master.zip
$ sudo unzip master.zip
$ sudo cd master/

# 或使用国内代码仓库
$ git clone https://gitee.com/mirrors/sysbench.git
$ cd sysbench/

$ sudo ./autogen.sh
# 如果仅测试内存性能不涉及mysql，添加下面参数。否则编译配置不通过。
$ sudo ./configure --without-mysql
# 安装
$ sudo make && sudo make install

参数解读：

shell 复制代码

$ sysbench --help
Usage:
  sysbench [options]... [testname] [command]

Commands implemented by most tests: prepare run cleanup help

General options:
  --threads=N                     number of threads to use [1]
  --events=N                      limit for total number of events [0]
  --time=N                        limit for total execution time in seconds [10]
  --warmup-time=N                 execute events for this many seconds with statistics disabled before the actual benchmark run with statistics enabled [0]
  --forced-shutdown=STRING        number of seconds to wait after the --time limit before forcing shutdown, or 'off' to disable [off]
  --thread-stack-size=SIZE        size of stack per thread [64K]
  --thread-init-timeout=N         wait time in seconds for worker threads to initialize [30]
  --rate=N                        average transactions rate. 0 for unlimited rate [0]
  --report-interval=N             periodically report intermediate statistics with a specified interval in seconds. 0 disables intermediate reports [0]
  --report-checkpoints=[LIST,...] dump full statistics and reset all counters at specified points in time. The argument is a list of comma-separated values representing the amount of time in seconds elapsed from start of test when report checkpoint(s) must be performed. Report checkpoints are off by default. []
  --debug[=on|off]                print more debugging info [off]
  --validate[=on|off]             perform validation checks where possible [off]
  --help[=on|off]                 print help and exit [off]
  --version[=on|off]              print version and exit [off]
  --config-file=FILENAME          File containing command line options
  --luajit-cmd=STRING             perform LuaJIT control command. This option is equivalent to 'luajit -j'. See LuaJIT documentation for more information

Pseudo-Random Numbers Generator options:
  --rand-type=STRING   random numbers distribution {uniform, gaussian, pareto, zipfian} to use by default [uniform]
  --rand-seed=N        seed for random number generator. When 0, the current time is used as an RNG seed. [0]
  --rand-pareto-h=N    shape parameter for the Pareto distribution [0.2]
  --rand-zipfian-exp=N shape parameter (exponent, theta) for the Zipfian distribution [0.8]

Log options:
  --verbosity=N verbosity level {5 - debug, 0 - only critical messages} [3]

  --percentile=N       percentile to calculate in latency statistics (1-100). Use the special value of 0 to disable percentile calculations [95]
  --histogram[=on|off] print latency histogram in report [off]

General database options:

  --db-driver=STRING  specifies database driver to use ('help' to get list of available drivers)
  --db-ps-mode=STRING prepared statements usage mode {auto, disable} [auto]
  --db-debug[=on|off] print database-specific debug information [off]


Compiled-in database drivers:

Compiled-in tests:
  fileio - File I/O test
  cpu - CPU performance test
  memory - Memory functions speed test
  threads - Threads subsystem performance test
  mutex - Mutex performance test

See 'sysbench <testname> help' for a list of options for each test.

3.1.3 测试

shell 复制代码

# 查看memory测试帮助信息
$ sysbench memory help
sysbench 1.1.0-2ca9e3f (using bundled LuaJIT 2.1.0-beta3)

memory options:
  --memory-block-size=SIZE    size of memory block for test [1K]
  --memory-total-size=SIZE    total size of data to transfer [100G]
  --memory-scope=STRING       memory access scope {global,local} [global]
  --memory-hugetlb[=on|off]   allocate memory from HugeTLB pool [off]
  --memory-oper=STRING        type of memory operations {read, write, none} [write]
  --memory-access-mode=STRING memory access mode {seq,rnd} [seq]

# 测试内存读性能。顺序读，读取100G数据，快大小8K。每隔1s打印一次。
$ sysbench memory --threads=4 --time=60 --report-interval=1 --memory-block-size=8K --memory-total-size=100G--memory-oper=read --memory-access-mode=seq run
sysbench 1.1.0-2ca9e3f (using bundled LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 4
Report intermediate results every 1 second(s)
Initializing random number generator from current time


Running memory speed test with the following options:
  block size: 8KiB
  total size: 102400MiB
  operation: write
  scope: global

Initializing worker threads...

Threads started!

[ 1s ] 7663.72 MiB/sec
[ 2s ] 3820.58 MiB/sec
[ 3s ] 2627.22 MiB/sec
[ 4s ] 2616.21 MiB/sec
...
[ 31s ] 2542.26 MiB/sec
[ 32s ] 2532.57 MiB/sec
[ 33s ] 2474.34 MiB/sec
[ 34s ] 2760.10 MiB/sec
Total operations: 13107200 (375099.39 per second)

102400.00 MiB transferred (2930.46 MiB/sec)   # 读/写的平均速度


Throughput:
    events/s (eps):                      375099.3918
    time elapsed:                        34.9433s
    total number of events:              13107200   # # events数,一个event为读/写一个内存块

Latency (ms):
         min:                                    0.00
         avg:                                    0.01
         max:                                   16.04
         95th percentile:                        0.02
         sum:                               130907.47

Threads fairness:
    events (avg/stddev):           3276800.0000/0.00
    execution time (avg/stddev):   32.7269/0.09

# 测试内存写性能。顺序写，写100G数据，快大小8K。每隔1s打印一次。
$ sysbench memory --threads=4 --time=60 --report-interval=1 --memory-block-size=8K --memory-total-size=100G--memory-oper=write --memory-access-mode=seq run
sysbench 1.1.0-2ca9e3f (using bundled LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 4
Report intermediate results every 1 second(s)
Initializing random number generator from current time


Running memory speed test with the following options:
  block size: 8KiB
  total size: 102400MiB
  operation: write
  scope: global

Initializing worker threads...

Threads started!

[ 1s ] 2745.40 MiB/sec
[ 2s ] 2692.62 MiB/sec
[ 3s ] 2712.13 MiB/sec
...
[ 28s ] 2806.32 MiB/sec
[ 29s ] 2747.49 MiB/sec
[ 30s ] 2721.71 MiB/sec
[ 31s ] 5733.25 MiB/sec
Total operations: 13107200 (420671.73 per second)

102400.00 MiB transferred (3286.50 MiB/sec)


Throughput:
    events/s (eps):                      420671.7259
    time elapsed:                        31.1578s
    total number of events:              13107200

Latency (ms):
         min:                                    0.00
         avg:                                    0.01
         max:                                   20.13
         95th percentile:                        0.02
         sum:                               115533.04

Threads fairness:
    events (avg/stddev):           3276800.0000/0.00
    execution time (avg/stddev):   28.8833/0.31

3.1.4 测试结果分析

记录内存读写的平均速度，调整测试参数，多次测试取平均值。

4. memtester测试内存

用于测试内存正确性的实用工具,主要面向硬件开发人员，从4.1.0版本开始，memtester可以指定起始物理内存地址进行测试。

也可以用于构造内存高负载的场景。

shell 复制代码

# 下载编译安装
wget https://pyropus.ca./software/memtester/old-versions/memtester-4.6.0.tar.gz
$ tar xf memtester-4.6.0.tar.gz
$ cd memtester-4.6.0/
$ sudo make && sudo make install

# 使用方法
Usage: memtester [-p physaddrbase [-d device]] <mem>[B|K|M|G] [loops]

# 给定测试内存的大小和次数, 其测试的主要项目有随机值,异或比较,减法,乘法,除法,与或运算等等。
$ sudo memtester 1G 3
memtester version 4.6.0 (64-bit)
Copyright (C) 2001-2020 Charles Cazabon.
Licensed under the GNU General Public License version 2 (only).

pagesize is 4096
pagesizemask is 0xfffffffffffff000
want 1024MB (1073741824 bytes)
got  1024MB (1073741824 bytes), trying mlock ...locked.
Loop 1/3:
...
Loop 3/3:
  Stuck Address       : ok
  Random Value        : ok
  Compare XOR         : ok
  Compare SUB         : ok
  Compare MUL         : ok
  Compare DIV         : ok
  Compare OR          : ok
  Compare AND         : ok
  Sequential Increment: ok
  Solid Bits          : ok
  Block Sequential    : ok
  Checkerboard        : ok
  Bit Spread          : ok
  Bit Flip            : ok
  Walking Ones        : ok
  Walking Zeroes      : ok
  8-bit Writes        : ok
  16-bit Writes       : ok

Done.

5. mbw测试内存性能

shell 复制代码

# ubuntu安装
$ sudo apt install -y mbw

# centos安装
$ sudo git clone https://github.com/raas/mbw.git
$ cd mbw
$ sudo make

# 帮助信息
$ mbw -h
    Usage: mbw [options] array_size_in_MiB
    Options:
            -n: number of runs per test
            -a: Don't display average
            -t0: memcpy test    # 内存拷贝
            -t1: dumb (b[i]=a[i] style) test # 字符串拷贝
            -t2 : memcpy test with fixed block size # 内存块拷贝
            -b <size>: block size in bytes for -t2 (default: 262144)
            -q: quiet (print statistics only)
    (will then use two arrays, watch out for swapping)
    'Bandwidth' is amount of data copied over the time this operation took.

    The default is to run all tests available.

# 测试，-q隐藏日志，-n 10运行10次，256M表示测试使用的内存大小
./mbw -q -n 10 256
0       Method: MEMCPY  Elapsed: 0.04187        MiB: 256.00000  Copy: 6114.455 MiB/s
1       Method: MEMCPY  Elapsed: 0.04571        MiB: 256.00000  Copy: 5600.525 MiB/s
2       Method: MEMCPY  Elapsed: 0.05306        MiB: 256.00000  Copy: 4824.727 MiB/s
3       Method: MEMCPY  Elapsed: 0.05574        MiB: 256.00000  Copy: 4592.999 MiB/s
4       Method: MEMCPY  Elapsed: 0.06371        MiB: 256.00000  Copy: 4018.460 MiB/s
5       Method: MEMCPY  Elapsed: 0.05230        MiB: 256.00000  Copy: 4894.744 MiB/s
6       Method: MEMCPY  Elapsed: 0.05222        MiB: 256.00000  Copy: 4902.336 MiB/s
7       Method: MEMCPY  Elapsed: 0.05833        MiB: 256.00000  Copy: 4388.446 MiB/s
8       Method: MEMCPY  Elapsed: 0.05498        MiB: 256.00000  Copy: 4656.662 MiB/s
9       Method: MEMCPY  Elapsed: 0.05776        MiB: 256.00000  Copy: 4431.903 MiB/s
AVG     Method: MEMCPY  Elapsed: 0.05357        MiB: 256.00000  Copy: 4779.017 MiB/s
0       Method: DUMB    Elapsed: 0.04523        MiB: 256.00000  Copy: 5659.585 MiB/s
1       Method: DUMB    Elapsed: 0.04219        MiB: 256.00000  Copy: 6067.357 MiB/s
2       Method: DUMB    Elapsed: 0.03677        MiB: 256.00000  Copy: 6962.197 MiB/s
3       Method: DUMB    Elapsed: 0.04211        MiB: 256.00000  Copy: 6078.739 MiB/s
4       Method: DUMB    Elapsed: 0.04162        MiB: 256.00000  Copy: 6150.446 MiB/s
5       Method: DUMB    Elapsed: 0.04325        MiB: 256.00000  Copy: 5919.075 MiB/s
6       Method: DUMB    Elapsed: 0.04290        MiB: 256.00000  Copy: 5966.671 MiB/s
7       Method: DUMB    Elapsed: 0.03596        MiB: 256.00000  Copy: 7120.011 MiB/s
8       Method: DUMB    Elapsed: 0.03747        MiB: 256.00000  Copy: 6831.950 MiB/s
9       Method: DUMB    Elapsed: 0.03587        MiB: 256.00000  Copy: 7137.281 MiB/s
AVG     Method: DUMB    Elapsed: 0.04034        MiB: 256.00000  Copy: 6346.342 MiB/s
0       Method: MCBLOCK Elapsed: 0.03189        MiB: 256.00000  Copy: 8026.336 MiB/s
1       Method: MCBLOCK Elapsed: 0.03841        MiB: 256.00000  Copy: 6664.931 MiB/s
2       Method: MCBLOCK Elapsed: 0.03263        MiB: 256.00000  Copy: 7846.503 MiB/s
3       Method: MCBLOCK Elapsed: 0.03469        MiB: 256.00000  Copy: 7379.648 MiB/s
4       Method: MCBLOCK Elapsed: 0.03270        MiB: 256.00000  Copy: 7828.986 MiB/s
5       Method: MCBLOCK Elapsed: 0.03393        MiB: 256.00000  Copy: 7544.056 MiB/s
6       Method: MCBLOCK Elapsed: 0.03700        MiB: 256.00000  Copy: 6919.293 MiB/s
7       Method: MCBLOCK Elapsed: 0.03924        MiB: 256.00000  Copy: 6523.623 MiB/s
8       Method: MCBLOCK Elapsed: 0.04240        MiB: 256.00000  Copy: 6037.736 MiB/s
9       Method: MCBLOCK Elapsed: 0.03011        MiB: 256.00000  Copy: 8503.288 MiB/s
AVG     Method: MCBLOCK Elapsed: 0.03530        MiB: 256.00000  Copy: 7252.125 MiB/s

# 数值越大性能越好

本文由mdnice多平台发布