Perf-Ninja听课笔记 - 环境配置及Warmup

开始学习软件优化,发现了Easyperf这个宝藏博主和他的书《Performance Analysis and Tuning on Modern CPUs》

本次记录Performance Ninja的前两个视频

主要是环境配置和toy程序优化,目的是熟悉整个环境。

由于需要安装linux perf命令,如果需要在自己本机安装环境的话,需要查看自己的操作系统内核版本

以下是我的操作系统内核版本

sh 复制代码
xxx@G15-5511-Ubuntu-005:~$ uname -a
Linux G15-5511-Ubuntu-005 6.8.0-85-generic #85~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Sep 19 16:18:59 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

P.S.这里是有些奇怪的,我的OS是22.04,但内核确实6.8.0版本,但并不影响后面的配置

创建Docker container

bash 复制代码
docker run --privileged --cap-add SYS_ADMIN --pid=host -it -v /home/xxx:/home/xxx --name perf-ninja ubuntu:22.04 /bin/bash

在容器的命令行中安装必要软件

bash 复制代码
apt-get update
apt-get upgrade -y
apt-get install build-essential git-core wget cmake -y
mkdir /workspace
cd /workspace

按照perf-ninja/GetStarted.mdmake_benchmark_library.sh安装Google Benchmark

bash 复制代码
# Check out the library.
git clone https://github.com/google/benchmark.git
# Benchmark requires Google Test as a dependency. Add the source tree as a subdirectory.
git clone https://github.com/google/googletest.git benchmark/googletest
# Go to the library root directory
cd /workspace/benchmark
# Make a build directory to place the build output.
cmake -E make_directory "build"
# Generate build system files with cmake.
cmake -E chdir "build" cmake -DCMAKE_BUILD_TYPE=Release ../
# or, starting with CMake 3.13, use a simpler form:
# cmake -DCMAKE_BUILD_TYPE=Release -S . -B "build"
# Build the library.
cmake --build "build" --config Release --parallel 4

# Build release version of google benchmark library
cd /workspace/benchmark
cmake --build "build" --config Release --target install

安装Clang-17

bash 复制代码
cd /workspace/
wget https://apt.llvm.org/llvm.sh
chmod +x llvm.sh
apt install lsb-release wget software-properties-common gnupg -y
./llvm.sh 17 all
# Enable clang-17 compiler for building labs. If you want to make clang-17 to be the default on a system do the following
update-alternatives --install /usr/bin/cc cc /usr/bin/clang-17 30
update-alternatives --install /usr/bin/c++ c++ /usr/bin/clang++-17 30

编译lab程序

bash 复制代码
# build lab assignment
cd /workspace
git clone https://github.com/dendibakh/perf-ninja.git
cd /workspace/perf-ninja/labs/misc/warmup
cmake -E make_directory build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
cmake --build . --config Release --parallel 8
cmake --build . --target validateLab
cmake --build . --target benchmarkLab

如果想要debug模式,使用一下命令编译

bash 复制代码
cmake -DCMAKE_BUILD_TYPE=Debug .. -DCMAKE_C_FLAGS="-g" -DCMAKE_CXX_FLAGS="-g"
cmake --build . --config Debug --parallel 8
cmake --build . --target validateLab
cmake --build . --target benchmarkLab

会得到输出

bash 复制代码
2025-10-22T21:19:16+00:00
Running ./lab
Run on (16 X 4600 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 1280 KiB (x8)
  L3 Unified 24576 KiB (x1)
Load Average: 0.18, 0.15, 0.16
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
bench1           23.7 ns         23.7 ns    119575979

可以使用perf record命令来进行性能数据采集(将会得到跟上面一样的结果),然后使用perf report来进行性能分析/显示(包含对应的汇编代码等等)

bash 复制代码
perf record ./lab
perf report

Wrapup实验

在目录/workspace/perf-ninja/labs/misc/warmup/下,
bench.cpp是框架文件,定义了一个优化任务:从1到1000累加

cpp 复制代码
#include "solution.h"
#include <iostream>

static void bench1(benchmark::State &state) {
  // problem: count sum of all the numbers up to N
  constexpr int N = 1000;
  int arr[N];
  for (int i = 0; i < N; i++) {
    arr[i] = i + 1;
  }

  int result = 0;

  // benchmark
  for (auto _ : state) {
    result = solution(arr, N);
    benchmark::DoNotOptimize(arr);
  }
}

// Register the function as a benchmark
BENCHMARK(bench1);

// Run the benchmark
BENCHMARK_MAIN();

其中solution函数是我们需要优化的函数,定义在solution.cppsolution.h文件

cpp 复制代码
#include "solution.h"
int solution(int *arr, int N) {
  int res = 0;
  for (int i = 0; i < N; i++) {
    res += arr[i];
  }
  return res;
}

使用命令进行编译,validation以及性能数据显示

bash 复制代码
root@c68f663b1972:/workspace/perf-ninja/labs/misc/warmup/build# cmake -DCMAKE_BUILD_TYPE=Release .. && cmake --build . --config Debug --parallel 8 && cmake --build . --target validateLab && cmake --build . --target benchmarkLab
-- Configuring done
-- Generating done
-- Build files have been written to: /workspace/perf-ninja/labs/misc/warmup/build
Consolidate compiler generated dependencies of target validate
Consolidate compiler generated dependencies of target lab
[ 16%] Building CXX object CMakeFiles/validate.dir/solution.cpp.o
[ 33%] Building CXX object CMakeFiles/lab.dir/solution.cpp.o
[ 66%] Linking CXX executable validate
[ 66%] Linking CXX executable lab
[ 83%] Built target validate
[100%] Built target lab
Consolidate compiler generated dependencies of target validate
[100%] Built target validate
Validation Successful
[100%] Built target validateLab
Consolidate compiler generated dependencies of target lab
[100%] Built target lab
2025-10-22T21:19:16+00:00
Running ./lab
Run on (16 X 4600 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 1280 KiB (x8)
  L3 Unified 24576 KiB (x1)
Load Average: 0.18, 0.15, 0.16
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
bench1           23.7 ns         23.7 ns    119575979
[100%] Built target benchmarkLab

由高斯的故事我们知道累加的数学公式为
( 1 + N ) ∗ N / 2 (1 + N) * N / 2 (1+N)∗N/2

因此我们可以将solution.cpp改为

cpp 复制代码
#include "solution.h"

int solution(int *arr, int N) {
  return (N+1)*N/2;
}

并且编译运行得到

bash 复制代码
root@c68f663b1972:/workspace/perf-ninja/labs/misc/warmup/build# cmake -DCMAKE_BUILD_TYPE=Release .. && cmake --build . --config Debug --parallel 8 && cmake --build . --target validateLab && cmake --build . --target benchmarkLab
-- Configuring done
-- Generating done
-- Build files have been written to: /workspace/perf-ninja/labs/misc/warmup/build
Consolidate compiler generated dependencies of target validate
Consolidate compiler generated dependencies of target lab
[ 16%] Building CXX object CMakeFiles/validate.dir/solution.cpp.o
[ 33%] Building CXX object CMakeFiles/lab.dir/solution.cpp.o
[ 66%] Linking CXX executable lab
[ 66%] Linking CXX executable validate
[ 83%] Built target validate
[100%] Built target lab
Consolidate compiler generated dependencies of target validate
[100%] Built target validate
Validation Successful
[100%] Built target validateLab
Consolidate compiler generated dependencies of target lab
[100%] Built target lab
2025-10-22T21:19:33+00:00
Running ./lab
Run on (16 X 4600 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 1280 KiB (x8)
  L3 Unified 24576 KiB (x1)
Load Average: 0.14, 0.14, 0.15
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
bench1          0.692 ns        0.692 ns   4007876365
[100%] Built target benchmarkLab

可以看到结果正确(因为cmake --build . --target validateLab没有报错),且计算时间从23.7 ns被压缩到0.692 ns

如有需要上传到github的私人分支上,请参考视频教程Warmup Lab Assignment的后半部分

Troubleshoot

安装perf工具

由于Docker容器使用的是主机的内核,所以需要安装6.8.0版本内核的对应工具

bash 复制代码
echo "deb http://archive.ubuntu.com/ubuntu noble main restricted" | tee /etc/apt/sources.list.d/noble-perf.list
apt update
apt install linux-tools-6.8.0-85-generic linux-cloud-tools-6.8.0-85-generic -y
相关推荐
做cv的小昊4 小时前
计算机图形学:【Games101】学习笔记05——着色(插值、高级纹理映射)与几何(基本表示方法)
笔记·opencv·学习·计算机视觉·图形渲染·几何学
iconball8 小时前
个人用云计算学习笔记 --24 虚拟化、KVM 基础使用与热迁移实验、VMware ESXi笔记
运维·笔记·学习·云计算
Love Song残响8 小时前
虚拟机性能优化实战:30字高效攻略
性能优化
是小菜呀!8 小时前
基于深度学习的图像检索系统项目实践
笔记
奕辰杰10 小时前
Netty私人学习笔记
笔记·学习·netty·网络通信·nio
卜锦元10 小时前
Golang后端性能优化手册(第三章:代码层面性能优化)
开发语言·数据结构·后端·算法·性能优化·golang
恒锐丰小吕10 小时前
屹晶微 EG27710 600V耐压、高性能、快速开关的半桥驱动芯片技术解析
嵌入式硬件·性能优化·硬件工程
De-Alf10 小时前
Megatron-LM学习笔记(6)Megatron Model Attention注意力与MLA
笔记·学习·算法·ai
polarislove021411 小时前
9.2 自制延迟函数-嵌入式铁头山羊STM32笔记
笔记·stm32·嵌入式硬件
智嵌电子11 小时前
【笔记篇】【硬件基础篇】模拟电子技术基础 (童诗白) 第7章 波形的发生和信号的转换
笔记·嵌入式硬件