开始学习软件优化,发现了Easyperf这个宝藏博主和他的书《Performance Analysis and Tuning on Modern CPUs》
本次记录Performance Ninja的前两个视频
主要是环境配置和toy程序优化,目的是熟悉整个环境。
由于需要安装linux perf命令,如果需要在自己本机安装环境的话,需要查看自己的操作系统内核版本
以下是我的操作系统内核版本
sh
xxx@G15-5511-Ubuntu-005:~$ uname -a
Linux G15-5511-Ubuntu-005 6.8.0-85-generic #85~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Sep 19 16:18:59 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
P.S.这里是有些奇怪的,我的OS是22.04,但内核确实6.8.0版本,但并不影响后面的配置
创建Docker container
bash
docker run --privileged --cap-add SYS_ADMIN --pid=host -it -v /home/xxx:/home/xxx --name perf-ninja ubuntu:22.04 /bin/bash
在容器的命令行中安装必要软件
bash
apt-get update
apt-get upgrade -y
apt-get install build-essential git-core wget cmake -y
mkdir /workspace
cd /workspace
按照perf-ninja/GetStarted.md以make_benchmark_library.sh安装Google Benchmark
bash
# Check out the library.
git clone https://github.com/google/benchmark.git
# Benchmark requires Google Test as a dependency. Add the source tree as a subdirectory.
git clone https://github.com/google/googletest.git benchmark/googletest
# Go to the library root directory
cd /workspace/benchmark
# Make a build directory to place the build output.
cmake -E make_directory "build"
# Generate build system files with cmake.
cmake -E chdir "build" cmake -DCMAKE_BUILD_TYPE=Release ../
# or, starting with CMake 3.13, use a simpler form:
# cmake -DCMAKE_BUILD_TYPE=Release -S . -B "build"
# Build the library.
cmake --build "build" --config Release --parallel 4
# Build release version of google benchmark library
cd /workspace/benchmark
cmake --build "build" --config Release --target install
安装Clang-17
bash
cd /workspace/
wget https://apt.llvm.org/llvm.sh
chmod +x llvm.sh
apt install lsb-release wget software-properties-common gnupg -y
./llvm.sh 17 all
# Enable clang-17 compiler for building labs. If you want to make clang-17 to be the default on a system do the following
update-alternatives --install /usr/bin/cc cc /usr/bin/clang-17 30
update-alternatives --install /usr/bin/c++ c++ /usr/bin/clang++-17 30
编译lab程序
bash
# build lab assignment
cd /workspace
git clone https://github.com/dendibakh/perf-ninja.git
cd /workspace/perf-ninja/labs/misc/warmup
cmake -E make_directory build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
cmake --build . --config Release --parallel 8
cmake --build . --target validateLab
cmake --build . --target benchmarkLab
如果想要debug模式,使用一下命令编译
bash
cmake -DCMAKE_BUILD_TYPE=Debug .. -DCMAKE_C_FLAGS="-g" -DCMAKE_CXX_FLAGS="-g"
cmake --build . --config Debug --parallel 8
cmake --build . --target validateLab
cmake --build . --target benchmarkLab
会得到输出
bash
2025-10-22T21:19:16+00:00
Running ./lab
Run on (16 X 4600 MHz CPU s)
CPU Caches:
L1 Data 48 KiB (x8)
L1 Instruction 32 KiB (x8)
L2 Unified 1280 KiB (x8)
L3 Unified 24576 KiB (x1)
Load Average: 0.18, 0.15, 0.16
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------
bench1 23.7 ns 23.7 ns 119575979
可以使用perf record命令来进行性能数据采集(将会得到跟上面一样的结果),然后使用perf report来进行性能分析/显示(包含对应的汇编代码等等)
bash
perf record ./lab
perf report
Wrapup实验
在目录/workspace/perf-ninja/labs/misc/warmup/下,
bench.cpp是框架文件,定义了一个优化任务:从1到1000累加
cpp
#include "solution.h"
#include <iostream>
static void bench1(benchmark::State &state) {
// problem: count sum of all the numbers up to N
constexpr int N = 1000;
int arr[N];
for (int i = 0; i < N; i++) {
arr[i] = i + 1;
}
int result = 0;
// benchmark
for (auto _ : state) {
result = solution(arr, N);
benchmark::DoNotOptimize(arr);
}
}
// Register the function as a benchmark
BENCHMARK(bench1);
// Run the benchmark
BENCHMARK_MAIN();
其中solution函数是我们需要优化的函数,定义在solution.cpp和solution.h文件
cpp
#include "solution.h"
int solution(int *arr, int N) {
int res = 0;
for (int i = 0; i < N; i++) {
res += arr[i];
}
return res;
}
使用命令进行编译,validation以及性能数据显示
bash
root@c68f663b1972:/workspace/perf-ninja/labs/misc/warmup/build# cmake -DCMAKE_BUILD_TYPE=Release .. && cmake --build . --config Debug --parallel 8 && cmake --build . --target validateLab && cmake --build . --target benchmarkLab
-- Configuring done
-- Generating done
-- Build files have been written to: /workspace/perf-ninja/labs/misc/warmup/build
Consolidate compiler generated dependencies of target validate
Consolidate compiler generated dependencies of target lab
[ 16%] Building CXX object CMakeFiles/validate.dir/solution.cpp.o
[ 33%] Building CXX object CMakeFiles/lab.dir/solution.cpp.o
[ 66%] Linking CXX executable validate
[ 66%] Linking CXX executable lab
[ 83%] Built target validate
[100%] Built target lab
Consolidate compiler generated dependencies of target validate
[100%] Built target validate
Validation Successful
[100%] Built target validateLab
Consolidate compiler generated dependencies of target lab
[100%] Built target lab
2025-10-22T21:19:16+00:00
Running ./lab
Run on (16 X 4600 MHz CPU s)
CPU Caches:
L1 Data 48 KiB (x8)
L1 Instruction 32 KiB (x8)
L2 Unified 1280 KiB (x8)
L3 Unified 24576 KiB (x1)
Load Average: 0.18, 0.15, 0.16
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------
bench1 23.7 ns 23.7 ns 119575979
[100%] Built target benchmarkLab
由高斯的故事我们知道累加的数学公式为
( 1 + N ) ∗ N / 2 (1 + N) * N / 2 (1+N)∗N/2
因此我们可以将solution.cpp改为
cpp
#include "solution.h"
int solution(int *arr, int N) {
return (N+1)*N/2;
}
并且编译运行得到
bash
root@c68f663b1972:/workspace/perf-ninja/labs/misc/warmup/build# cmake -DCMAKE_BUILD_TYPE=Release .. && cmake --build . --config Debug --parallel 8 && cmake --build . --target validateLab && cmake --build . --target benchmarkLab
-- Configuring done
-- Generating done
-- Build files have been written to: /workspace/perf-ninja/labs/misc/warmup/build
Consolidate compiler generated dependencies of target validate
Consolidate compiler generated dependencies of target lab
[ 16%] Building CXX object CMakeFiles/validate.dir/solution.cpp.o
[ 33%] Building CXX object CMakeFiles/lab.dir/solution.cpp.o
[ 66%] Linking CXX executable lab
[ 66%] Linking CXX executable validate
[ 83%] Built target validate
[100%] Built target lab
Consolidate compiler generated dependencies of target validate
[100%] Built target validate
Validation Successful
[100%] Built target validateLab
Consolidate compiler generated dependencies of target lab
[100%] Built target lab
2025-10-22T21:19:33+00:00
Running ./lab
Run on (16 X 4600 MHz CPU s)
CPU Caches:
L1 Data 48 KiB (x8)
L1 Instruction 32 KiB (x8)
L2 Unified 1280 KiB (x8)
L3 Unified 24576 KiB (x1)
Load Average: 0.14, 0.14, 0.15
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------
bench1 0.692 ns 0.692 ns 4007876365
[100%] Built target benchmarkLab
可以看到结果正确(因为cmake --build . --target validateLab没有报错),且计算时间从23.7 ns被压缩到0.692 ns
如有需要上传到github的私人分支上,请参考视频教程Warmup Lab Assignment的后半部分
Troubleshoot
安装perf工具
由于Docker容器使用的是主机的内核,所以需要安装6.8.0版本内核的对应工具
bash
echo "deb http://archive.ubuntu.com/ubuntu noble main restricted" | tee /etc/apt/sources.list.d/noble-perf.list
apt update
apt install linux-tools-6.8.0-85-generic linux-cloud-tools-6.8.0-85-generic -y