内存性能测试工具
内存性能测试工具包括常用的stream(最常用),sysbench等。
1. dd简单测试内存读写速度
dd测试内存性能不常用。dd命令为linux系统自带,无需安装,可以通过如下命令简单地测试系统内存性能:
shell
# 运行命令如下,从linux的zero设备作为输入,输出到null设备。
$ dd if=/dev/zero of=/dev/null bs=4096 count=1048576
1048576+0 records in
1048576+0 records out
4294967296 bytes (4.3 GB) copied, 2.69363 s, 1.6 GB/s
通过数据复制的速度,简单对比机器内存的性能。
2. stream测试内存性能
2.1 安装
shell
$ mkdir stream
$ cd stream/
# 国外下载站点
$ wget https://www.cs.virginia.edu/stream/FTP/Code/stream.c
# 国内源下载安装
$ git clone https://gitee.com/lldhsds/stream.git
$ cd stream/
# 编译安装
$ gcc stream.c -O3 -fopenmp -DSTREAM_ARRAY_SIZE=1024*1024*1024 -DNTIMES=20 -mcmodel=medium -o stream.1g.20
编译参数说明:
- stream.c:待编译的源码文件,最新版本为5.10。
- -O3:编译器编译优化级别。
- -fopenmp:启用OpenMP,适应多处理器环境,更能得到内存带宽实际最大值。开启后,程序默认运行线程为CPU线程数。
- -DSTREAM_ARRAY_SIZE: 指定测试数组a[]、b[]、c[]的大小(Array size),该值对测试结果影响较大。
由于stream.c源码推荐设置至少4倍最高级缓存(l3 cache),且STREAM_ARRAY为double类型,每个数组元素占用8Byte。推荐的数组大小计算公式如下,结果取整数:
最高级缓存(单位:MB)×1024×1024×4.1×CPU路数/8 或者 最高级缓存(单位:Byte)×4.1倍×CPU路数/8
例如测试机器是双路CPU,最高级缓存32MB,则计算值为32×1024×1024×4.1×2/8≈34393292
- -fopenmp:启用OpenMP,适应多处理器环境,更能得到内存带宽实际最大值。开启后,程序默认运行线程为CPU线程数。
- -mcmodel=medium :当单个Memory Array Size 大于2GB时需要设置此参数。还可以改为large、small、tiny等。较新的gcc版本可能不支持small。
- -o stream.1g.20:输出的可执行文件名,名称自定义。
- -mtune=native -march=native:针对CPU指令的优化,此处由于编译机即运行机器。故采用native的优化方法。
- -DOFFSET=4096 :数组的偏移,一般可以不定义。
其他说明:
- stream 5.9版本数组参数为-DN=2000000形式设置。若为5.10版本,参数名变为-DSTREAM_ARRAY_SIZE,默认值10000000。
- 要充分考虑内存容量的需求,粗略估计是 STREAM ARRAY_SIZE × 8(双精度) × 3 (三个数组)<= 0.6*M;M 是用户的可用内存。
- 必须设置测试数组大小远大于CPU 最高级缓存(一般为L3 Cache)的大小,否则就是测试CPU缓存的吞吐性能,而非内存吞吐性能。
- 为了保证测试可以持续一段时间,测试过程中内存带宽可以达到一定的最大值, 从而避免得不到实际最大峰值的情况,如果四项测试中有完成时间小于20微秒的情况,就需要适当的增大测试数组的维度 STREAM ARRAY_SIZE。
2.2 测试
shell
# 查看机器cpu的最高级缓存l3 cache为16M,为单路CPU。
$ lscpu | grep -i "L3 cache\|Socket\|core"
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
L3 cache: 16384K
# 根据CPU计算DSTREAM_ARRAY_SIZE为16384*1024*4.1*1/8=8,598,323.2,取值1亿进行取值。编译执行文件
$ gcc stream.c -O3 -fopenmp -DSTREAM_ARRAY_SIZE=1024*1024*100 -DNTIMES=20 -mcmodel=medium -o stream.100M
$ ./stream.100M
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 104857600 (elements), Offset = 0 (elements)
Memory per array = 800.0 MiB (= 0.8 GiB).
Total memory required = 2400.0 MiB (= 2.3 GiB).
Each kernel will be executed 20 times.
The *best* time for each kernel (excluding the first iteration)
will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 4
Number of Threads counted = 4
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 29659 microseconds.
(= 29659 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 34701.7 0.066024 0.048347 0.084770
Scale: 39939.1 0.056470 0.042007 0.070259
Add: 41795.4 0.079521 0.060212 0.102223
Triad: 41073.6 0.079864 0.061270 0.102483
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
注意
DSTREAM_ARRAY_SIZE过大的情况下编译,stream运行需要的内存过大,导致产生段错误(Segmentation fault),出现该情况下可以增大内存或者减小DSTREAM_ARRAY_SIZE。
2.3 结果分析
记录测试结果中的COPY(复制),SCALE(乘法),ADD(加法),TRIAD(混合)数值。测试多次取平均值,横向比较不同机器性能。
stream测试原理:
一次Add操作需要访问三次内存(两个读操作,一个写操作),Triad操作也需要三次访问内存, Copy和Scale操作需要两次访问内存。单位操作内访问内存次数越多,越能够掩盖访存延迟,带宽越大。
单核Stream测试,影响的因素除了内存控制器能力外,还有Core的ROB、Load/Store对其影响,因此不是单纯的内存带宽性能测试。而多核Stream测试,通过多核同时发出大量内存访问请求,能够更加饱和地访问内存,从而测试到内存带宽的极限性能。
3. sysbench测试内存性能
3.1 sysbench安装
3.1.1 镜像源安装
centos安装sysbench需要配置epel源。
shell
# CentOS安装
sudo yum -y install sysbench
# Ubuntu安装
sudo apt -y install sysbench
3.1.2 源码编译安装
下载编译安装
shell
$ sudo wget https://github.com/akopytov/sysbench/archive/master.zip
$ sudo unzip master.zip
$ sudo cd master/
# 或使用国内代码仓库
$ git clone https://gitee.com/mirrors/sysbench.git
$ cd sysbench/
$ sudo ./autogen.sh
# 如果仅测试内存性能不涉及mysql,添加下面参数。否则编译配置不通过。
$ sudo ./configure --without-mysql
# 安装
$ sudo make && sudo make install
参数解读:
shell
$ sysbench --help
Usage:
sysbench [options]... [testname] [command]
Commands implemented by most tests: prepare run cleanup help
General options:
--threads=N number of threads to use [1]
--events=N limit for total number of events [0]
--time=N limit for total execution time in seconds [10]
--warmup-time=N execute events for this many seconds with statistics disabled before the actual benchmark run with statistics enabled [0]
--forced-shutdown=STRING number of seconds to wait after the --time limit before forcing shutdown, or 'off' to disable [off]
--thread-stack-size=SIZE size of stack per thread [64K]
--thread-init-timeout=N wait time in seconds for worker threads to initialize [30]
--rate=N average transactions rate. 0 for unlimited rate [0]
--report-interval=N periodically report intermediate statistics with a specified interval in seconds. 0 disables intermediate reports [0]
--report-checkpoints=[LIST,...] dump full statistics and reset all counters at specified points in time. The argument is a list of comma-separated values representing the amount of time in seconds elapsed from start of test when report checkpoint(s) must be performed. Report checkpoints are off by default. []
--debug[=on|off] print more debugging info [off]
--validate[=on|off] perform validation checks where possible [off]
--help[=on|off] print help and exit [off]
--version[=on|off] print version and exit [off]
--config-file=FILENAME File containing command line options
--luajit-cmd=STRING perform LuaJIT control command. This option is equivalent to 'luajit -j'. See LuaJIT documentation for more information
Pseudo-Random Numbers Generator options:
--rand-type=STRING random numbers distribution {uniform, gaussian, pareto, zipfian} to use by default [uniform]
--rand-seed=N seed for random number generator. When 0, the current time is used as an RNG seed. [0]
--rand-pareto-h=N shape parameter for the Pareto distribution [0.2]
--rand-zipfian-exp=N shape parameter (exponent, theta) for the Zipfian distribution [0.8]
Log options:
--verbosity=N verbosity level {5 - debug, 0 - only critical messages} [3]
--percentile=N percentile to calculate in latency statistics (1-100). Use the special value of 0 to disable percentile calculations [95]
--histogram[=on|off] print latency histogram in report [off]
General database options:
--db-driver=STRING specifies database driver to use ('help' to get list of available drivers)
--db-ps-mode=STRING prepared statements usage mode {auto, disable} [auto]
--db-debug[=on|off] print database-specific debug information [off]
Compiled-in database drivers:
Compiled-in tests:
fileio - File I/O test
cpu - CPU performance test
memory - Memory functions speed test
threads - Threads subsystem performance test
mutex - Mutex performance test
See 'sysbench <testname> help' for a list of options for each test.
3.1.3 测试
shell
# 查看memory测试帮助信息
$ sysbench memory help
sysbench 1.1.0-2ca9e3f (using bundled LuaJIT 2.1.0-beta3)
memory options:
--memory-block-size=SIZE size of memory block for test [1K]
--memory-total-size=SIZE total size of data to transfer [100G]
--memory-scope=STRING memory access scope {global,local} [global]
--memory-hugetlb[=on|off] allocate memory from HugeTLB pool [off]
--memory-oper=STRING type of memory operations {read, write, none} [write]
--memory-access-mode=STRING memory access mode {seq,rnd} [seq]
# 测试内存读性能。顺序读,读取100G数据,快大小8K。每隔1s打印一次。
$ sysbench memory --threads=4 --time=60 --report-interval=1 --memory-block-size=8K --memory-total-size=100G--memory-oper=read --memory-access-mode=seq run
sysbench 1.1.0-2ca9e3f (using bundled LuaJIT 2.1.0-beta3)
Running the test with following options:
Number of threads: 4
Report intermediate results every 1 second(s)
Initializing random number generator from current time
Running memory speed test with the following options:
block size: 8KiB
total size: 102400MiB
operation: write
scope: global
Initializing worker threads...
Threads started!
[ 1s ] 7663.72 MiB/sec
[ 2s ] 3820.58 MiB/sec
[ 3s ] 2627.22 MiB/sec
[ 4s ] 2616.21 MiB/sec
...
[ 31s ] 2542.26 MiB/sec
[ 32s ] 2532.57 MiB/sec
[ 33s ] 2474.34 MiB/sec
[ 34s ] 2760.10 MiB/sec
Total operations: 13107200 (375099.39 per second)
102400.00 MiB transferred (2930.46 MiB/sec) # 读/写的平均速度
Throughput:
events/s (eps): 375099.3918
time elapsed: 34.9433s
total number of events: 13107200 # # events数,一个event为读/写一个内存块
Latency (ms):
min: 0.00
avg: 0.01
max: 16.04
95th percentile: 0.02
sum: 130907.47
Threads fairness:
events (avg/stddev): 3276800.0000/0.00
execution time (avg/stddev): 32.7269/0.09
# 测试内存写性能。顺序写,写100G数据,快大小8K。每隔1s打印一次。
$ sysbench memory --threads=4 --time=60 --report-interval=1 --memory-block-size=8K --memory-total-size=100G--memory-oper=write --memory-access-mode=seq run
sysbench 1.1.0-2ca9e3f (using bundled LuaJIT 2.1.0-beta3)
Running the test with following options:
Number of threads: 4
Report intermediate results every 1 second(s)
Initializing random number generator from current time
Running memory speed test with the following options:
block size: 8KiB
total size: 102400MiB
operation: write
scope: global
Initializing worker threads...
Threads started!
[ 1s ] 2745.40 MiB/sec
[ 2s ] 2692.62 MiB/sec
[ 3s ] 2712.13 MiB/sec
...
[ 28s ] 2806.32 MiB/sec
[ 29s ] 2747.49 MiB/sec
[ 30s ] 2721.71 MiB/sec
[ 31s ] 5733.25 MiB/sec
Total operations: 13107200 (420671.73 per second)
102400.00 MiB transferred (3286.50 MiB/sec)
Throughput:
events/s (eps): 420671.7259
time elapsed: 31.1578s
total number of events: 13107200
Latency (ms):
min: 0.00
avg: 0.01
max: 20.13
95th percentile: 0.02
sum: 115533.04
Threads fairness:
events (avg/stddev): 3276800.0000/0.00
execution time (avg/stddev): 28.8833/0.31
3.1.4 测试结果分析
记录内存读写的平均速度,调整测试参数,多次测试取平均值。
4. memtester测试内存
用于测试内存正确性的实用工具,主要面向硬件开发人员,从4.1.0版本开始,memtester可以指定起始物理内存地址进行测试。
也可以用于构造内存高负载的场景。
shell
# 下载编译安装
wget https://pyropus.ca./software/memtester/old-versions/memtester-4.6.0.tar.gz
$ tar xf memtester-4.6.0.tar.gz
$ cd memtester-4.6.0/
$ sudo make && sudo make install
# 使用方法
Usage: memtester [-p physaddrbase [-d device]] <mem>[B|K|M|G] [loops]
# 给定测试内存的大小和次数, 其测试的主要项目有随机值,异或比较,减法,乘法,除法,与或运算等等。
$ sudo memtester 1G 3
memtester version 4.6.0 (64-bit)
Copyright (C) 2001-2020 Charles Cazabon.
Licensed under the GNU General Public License version 2 (only).
pagesize is 4096
pagesizemask is 0xfffffffffffff000
want 1024MB (1073741824 bytes)
got 1024MB (1073741824 bytes), trying mlock ...locked.
Loop 1/3:
...
Loop 3/3:
Stuck Address : ok
Random Value : ok
Compare XOR : ok
Compare SUB : ok
Compare MUL : ok
Compare DIV : ok
Compare OR : ok
Compare AND : ok
Sequential Increment: ok
Solid Bits : ok
Block Sequential : ok
Checkerboard : ok
Bit Spread : ok
Bit Flip : ok
Walking Ones : ok
Walking Zeroes : ok
8-bit Writes : ok
16-bit Writes : ok
Done.
5. mbw测试内存性能
shell
# ubuntu安装
$ sudo apt install -y mbw
# centos安装
$ sudo git clone https://github.com/raas/mbw.git
$ cd mbw
$ sudo make
# 帮助信息
$ mbw -h
Usage: mbw [options] array_size_in_MiB
Options:
-n: number of runs per test
-a: Don't display average
-t0: memcpy test # 内存拷贝
-t1: dumb (b[i]=a[i] style) test # 字符串拷贝
-t2 : memcpy test with fixed block size # 内存块拷贝
-b <size>: block size in bytes for -t2 (default: 262144)
-q: quiet (print statistics only)
(will then use two arrays, watch out for swapping)
'Bandwidth' is amount of data copied over the time this operation took.
The default is to run all tests available.
# 测试,-q隐藏日志,-n 10运行10次,256M表示测试使用的内存大小
./mbw -q -n 10 256
0 Method: MEMCPY Elapsed: 0.04187 MiB: 256.00000 Copy: 6114.455 MiB/s
1 Method: MEMCPY Elapsed: 0.04571 MiB: 256.00000 Copy: 5600.525 MiB/s
2 Method: MEMCPY Elapsed: 0.05306 MiB: 256.00000 Copy: 4824.727 MiB/s
3 Method: MEMCPY Elapsed: 0.05574 MiB: 256.00000 Copy: 4592.999 MiB/s
4 Method: MEMCPY Elapsed: 0.06371 MiB: 256.00000 Copy: 4018.460 MiB/s
5 Method: MEMCPY Elapsed: 0.05230 MiB: 256.00000 Copy: 4894.744 MiB/s
6 Method: MEMCPY Elapsed: 0.05222 MiB: 256.00000 Copy: 4902.336 MiB/s
7 Method: MEMCPY Elapsed: 0.05833 MiB: 256.00000 Copy: 4388.446 MiB/s
8 Method: MEMCPY Elapsed: 0.05498 MiB: 256.00000 Copy: 4656.662 MiB/s
9 Method: MEMCPY Elapsed: 0.05776 MiB: 256.00000 Copy: 4431.903 MiB/s
AVG Method: MEMCPY Elapsed: 0.05357 MiB: 256.00000 Copy: 4779.017 MiB/s
0 Method: DUMB Elapsed: 0.04523 MiB: 256.00000 Copy: 5659.585 MiB/s
1 Method: DUMB Elapsed: 0.04219 MiB: 256.00000 Copy: 6067.357 MiB/s
2 Method: DUMB Elapsed: 0.03677 MiB: 256.00000 Copy: 6962.197 MiB/s
3 Method: DUMB Elapsed: 0.04211 MiB: 256.00000 Copy: 6078.739 MiB/s
4 Method: DUMB Elapsed: 0.04162 MiB: 256.00000 Copy: 6150.446 MiB/s
5 Method: DUMB Elapsed: 0.04325 MiB: 256.00000 Copy: 5919.075 MiB/s
6 Method: DUMB Elapsed: 0.04290 MiB: 256.00000 Copy: 5966.671 MiB/s
7 Method: DUMB Elapsed: 0.03596 MiB: 256.00000 Copy: 7120.011 MiB/s
8 Method: DUMB Elapsed: 0.03747 MiB: 256.00000 Copy: 6831.950 MiB/s
9 Method: DUMB Elapsed: 0.03587 MiB: 256.00000 Copy: 7137.281 MiB/s
AVG Method: DUMB Elapsed: 0.04034 MiB: 256.00000 Copy: 6346.342 MiB/s
0 Method: MCBLOCK Elapsed: 0.03189 MiB: 256.00000 Copy: 8026.336 MiB/s
1 Method: MCBLOCK Elapsed: 0.03841 MiB: 256.00000 Copy: 6664.931 MiB/s
2 Method: MCBLOCK Elapsed: 0.03263 MiB: 256.00000 Copy: 7846.503 MiB/s
3 Method: MCBLOCK Elapsed: 0.03469 MiB: 256.00000 Copy: 7379.648 MiB/s
4 Method: MCBLOCK Elapsed: 0.03270 MiB: 256.00000 Copy: 7828.986 MiB/s
5 Method: MCBLOCK Elapsed: 0.03393 MiB: 256.00000 Copy: 7544.056 MiB/s
6 Method: MCBLOCK Elapsed: 0.03700 MiB: 256.00000 Copy: 6919.293 MiB/s
7 Method: MCBLOCK Elapsed: 0.03924 MiB: 256.00000 Copy: 6523.623 MiB/s
8 Method: MCBLOCK Elapsed: 0.04240 MiB: 256.00000 Copy: 6037.736 MiB/s
9 Method: MCBLOCK Elapsed: 0.03011 MiB: 256.00000 Copy: 8503.288 MiB/s
AVG Method: MCBLOCK Elapsed: 0.03530 MiB: 256.00000 Copy: 7252.125 MiB/s
# 数值越大性能越好
本文由mdnice多平台发布