Perf-Ninja听课笔记 - 环境配置及Warmup

开始学习软件优化，发现了Easyperf这个宝藏博主和他的书《Performance Analysis and Tuning on Modern CPUs》

本次记录Performance Ninja的前两个视频

主要是环境配置和toy程序优化，目的是熟悉整个环境。

由于需要安装linux perf命令，如果需要在自己本机安装环境的话，需要查看自己的操作系统内核版本

以下是我的操作系统内核版本

sh 复制代码

xxx@G15-5511-Ubuntu-005:~$ uname -a
Linux G15-5511-Ubuntu-005 6.8.0-85-generic #85~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Sep 19 16:18:59 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

P.S.这里是有些奇怪的，我的OS是22.04，但内核确实6.8.0版本，但并不影响后面的配置

创建Docker container

bash 复制代码

docker run --privileged --cap-add SYS_ADMIN --pid=host -it -v /home/xxx:/home/xxx --name perf-ninja ubuntu:22.04 /bin/bash

在容器的命令行中安装必要软件

bash 复制代码

apt-get update
apt-get upgrade -y
apt-get install build-essential git-core wget cmake -y
mkdir /workspace
cd /workspace

按照perf-ninja/GetStarted.md以make_benchmark_library.sh安装Google Benchmark

bash 复制代码

# Check out the library.
git clone https://github.com/google/benchmark.git
# Benchmark requires Google Test as a dependency. Add the source tree as a subdirectory.
git clone https://github.com/google/googletest.git benchmark/googletest
# Go to the library root directory
cd /workspace/benchmark
# Make a build directory to place the build output.
cmake -E make_directory "build"
# Generate build system files with cmake.
cmake -E chdir "build" cmake -DCMAKE_BUILD_TYPE=Release ../
# or, starting with CMake 3.13, use a simpler form:
# cmake -DCMAKE_BUILD_TYPE=Release -S . -B "build"
# Build the library.
cmake --build "build" --config Release --parallel 4

# Build release version of google benchmark library
cd /workspace/benchmark
cmake --build "build" --config Release --target install

安装Clang-17

bash 复制代码

cd /workspace/
wget https://apt.llvm.org/llvm.sh
chmod +x llvm.sh
apt install lsb-release wget software-properties-common gnupg -y
./llvm.sh 17 all
# Enable clang-17 compiler for building labs. If you want to make clang-17 to be the default on a system do the following
update-alternatives --install /usr/bin/cc cc /usr/bin/clang-17 30
update-alternatives --install /usr/bin/c++ c++ /usr/bin/clang++-17 30

编译lab程序

bash 复制代码

# build lab assignment
cd /workspace
git clone https://github.com/dendibakh/perf-ninja.git
cd /workspace/perf-ninja/labs/misc/warmup
cmake -E make_directory build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
cmake --build . --config Release --parallel 8
cmake --build . --target validateLab
cmake --build . --target benchmarkLab

如果想要debug模式，使用一下命令编译

bash 复制代码

cmake -DCMAKE_BUILD_TYPE=Debug .. -DCMAKE_C_FLAGS="-g" -DCMAKE_CXX_FLAGS="-g"
cmake --build . --config Debug --parallel 8
cmake --build . --target validateLab
cmake --build . --target benchmarkLab

会得到输出

bash 复制代码

2025-10-22T21:19:16+00:00
Running ./lab
Run on (16 X 4600 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 1280 KiB (x8)
  L3 Unified 24576 KiB (x1)
Load Average: 0.18, 0.15, 0.16
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
bench1           23.7 ns         23.7 ns    119575979

可以使用perf record命令来进行性能数据采集(将会得到跟上面一样的结果)，然后使用perf report来进行性能分析/显示(包含对应的汇编代码等等)

bash 复制代码

perf record ./lab
perf report

Wrapup实验

在目录/workspace/perf-ninja/labs/misc/warmup/下，
bench.cpp是框架文件，定义了一个优化任务：从1到1000累加

cpp 复制代码

#include "solution.h"
#include <iostream>

static void bench1(benchmark::State &state) {
  // problem: count sum of all the numbers up to N
  constexpr int N = 1000;
  int arr[N];
  for (int i = 0; i < N; i++) {
    arr[i] = i + 1;
  }

  int result = 0;

  // benchmark
  for (auto _ : state) {
    result = solution(arr, N);
    benchmark::DoNotOptimize(arr);
  }
}

// Register the function as a benchmark
BENCHMARK(bench1);

// Run the benchmark
BENCHMARK_MAIN();

其中solution函数是我们需要优化的函数，定义在solution.cpp和solution.h文件

cpp 复制代码

#include "solution.h"
int solution(int *arr, int N) {
  int res = 0;
  for (int i = 0; i < N; i++) {
    res += arr[i];
  }
  return res;
}

使用命令进行编译，validation以及性能数据显示

bash 复制代码

root@c68f663b1972:/workspace/perf-ninja/labs/misc/warmup/build# cmake -DCMAKE_BUILD_TYPE=Release .. && cmake --build . --config Debug --parallel 8 && cmake --build . --target validateLab && cmake --build . --target benchmarkLab
-- Configuring done
-- Generating done
-- Build files have been written to: /workspace/perf-ninja/labs/misc/warmup/build
Consolidate compiler generated dependencies of target validate
Consolidate compiler generated dependencies of target lab
[ 16%] Building CXX object CMakeFiles/validate.dir/solution.cpp.o
[ 33%] Building CXX object CMakeFiles/lab.dir/solution.cpp.o
[ 66%] Linking CXX executable validate
[ 66%] Linking CXX executable lab
[ 83%] Built target validate
[100%] Built target lab
Consolidate compiler generated dependencies of target validate
[100%] Built target validate
Validation Successful
[100%] Built target validateLab
Consolidate compiler generated dependencies of target lab
[100%] Built target lab
2025-10-22T21:19:16+00:00
Running ./lab
Run on (16 X 4600 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 1280 KiB (x8)
  L3 Unified 24576 KiB (x1)
Load Average: 0.18, 0.15, 0.16
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
bench1           23.7 ns         23.7 ns    119575979
[100%] Built target benchmarkLab

由高斯的故事我们知道累加的数学公式为
( 1 + N ) ∗ N / 2 (1 + N) * N / 2 (1+N)∗N/2

因此我们可以将solution.cpp改为

cpp 复制代码

#include "solution.h"

int solution(int *arr, int N) {
  return (N+1)*N/2;
}

并且编译运行得到

bash 复制代码

root@c68f663b1972:/workspace/perf-ninja/labs/misc/warmup/build# cmake -DCMAKE_BUILD_TYPE=Release .. && cmake --build . --config Debug --parallel 8 && cmake --build . --target validateLab && cmake --build . --target benchmarkLab
-- Configuring done
-- Generating done
-- Build files have been written to: /workspace/perf-ninja/labs/misc/warmup/build
Consolidate compiler generated dependencies of target validate
Consolidate compiler generated dependencies of target lab
[ 16%] Building CXX object CMakeFiles/validate.dir/solution.cpp.o
[ 33%] Building CXX object CMakeFiles/lab.dir/solution.cpp.o
[ 66%] Linking CXX executable lab
[ 66%] Linking CXX executable validate
[ 83%] Built target validate
[100%] Built target lab
Consolidate compiler generated dependencies of target validate
[100%] Built target validate
Validation Successful
[100%] Built target validateLab
Consolidate compiler generated dependencies of target lab
[100%] Built target lab
2025-10-22T21:19:33+00:00
Running ./lab
Run on (16 X 4600 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 1280 KiB (x8)
  L3 Unified 24576 KiB (x1)
Load Average: 0.14, 0.14, 0.15
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
bench1          0.692 ns        0.692 ns   4007876365
[100%] Built target benchmarkLab

可以看到结果正确（因为cmake --build . --target validateLab没有报错），且计算时间从23.7 ns被压缩到0.692 ns

如有需要上传到github的私人分支上，请参考视频教程Warmup Lab Assignment的后半部分

Troubleshoot

安装perf工具

由于Docker容器使用的是主机的内核，所以需要安装6.8.0版本内核的对应工具

bash 复制代码

echo "deb http://archive.ubuntu.com/ubuntu noble main restricted" | tee /etc/apt/sources.list.d/noble-perf.list
apt update
apt install linux-tools-6.8.0-85-generic linux-cloud-tools-6.8.0-85-generic -y