使用 PaddleNLP 在 CPU(支持 AVX 指令)下跑通 llama2-7b或DeepSeek-r1:1.5b 模型（完成度80%）

原文：🚣‍♂️ 使用 PaddleNLP 在 CPU(支持 AVX 指令)下跑通 llama2-7b 模型 🚣 --- PaddleNLP 文档

使用 PaddleNLP 在 CPU(支持 AVX 指令)下跑通 llama2-7b 模型 🚣

PaddleNLP 在支持 AVX 指令的 CPU 上对 llama 系列模型进行了深度适配和优化，此文档用于说明在支持 AVX 指令的 CPU 上使用 PaddleNLP 进行 llama 系列模型进行高性能推理的流程。

检查硬件：

芯片类型	GCC 版本	cmake 版本
Intel(R) Xeon(R) Platinum 8463B	9.4.0	>=3.18

注：如果要验证您的机器是否支持 AVX 指令，只需系统环境下输入命令，看是否有输出：

复制代码

lscpu | grep -o -P '(?<!\w)(avx\w*)'

# 显示如下结果 -
avx
avx2
**avx512f**
avx512dq
avx512ifma
avx512cd
**avx512bw**
avx512vl
avx_vnni
**avx512_bf16**
avx512vbmi
avx512_vbmi2
avx512_vnni
avx512_bitalg
avx512_vpopcntdq
**avx512_fp16**

环境准备：

1 安装 numactl

复制代码

apt-get update
apt-get install numactl

2 安装 paddle

2.1 源码安装：

复制代码

git clone https://github.com/PaddlePaddle/Paddle.git
cd Paddle && mkdir build && cd build

cmake .. -DPY_VERSION=3.8 -DWITH_GPU=OFF

make -j128
pip install -U python/dist/paddlepaddle-0.0.0-cp38-cp38-linux_x86_64.whl

2.2 pip 安装:

复制代码

python -m pip install --pre paddlepaddle -i https://www.paddlepaddle.org.cn/packages/nightly/cpu/

2.3 检查是否安装正常:

复制代码

python -c "import paddle; paddle.version.show()"
python -c "import paddle; paddle.utils.run_check()"

3 克隆 PaddleNLP 仓库代码，并安装依赖

复制代码

# PaddleNLP是基于paddlepaddle『飞桨』的自然语言处理和大语言模型(LLM)开发库，存放了基于『飞桨』框架实现的各种大模型，llama系列模型也包含其中。为了便于您更好地使用PaddleNLP，您需要clone整个仓库。
pip install --pre --upgrade paddlenlp -f https://www.paddlepaddle.org.cn/whl/paddlenlp.html

4 安装第三方库和 paddlenlp_ops

复制代码

# PaddleNLP仓库内置了专用的融合算子，以便用户享受到极致压缩的推理成本
git clone https://github.com/PaddlePaddle/PaddleNLP.git
cd PaddleNLP/csrc/cpu
sh setup.sh

5 第三方库安装失败

复制代码

#如果oneccl安装失败 建议在gcc 8.2-9.4之间重新安装
cd csrc/cpu/xFasterTransformer/3rdparty/
sh prepare_oneccl.sh

#如果xFasterTransformer 安装失败,建议在gcc 9.2以上重新安装
cd csrc/cpu/xFasterTransformer/build/
make -j24

#更多命令和环境变量可参考csrc/cpu/setup.sh

Cpu 高性能推理

PaddleNLP 还提供了基于 intel/xFasterTransformer 的 CPU 高性能推理，目前支持 FP16、BF16、INT8多种精度推理，以及 Prefill 基于 FP16,Decode 基于 INT8混合方式推理。

非 HBM 机器高性能推理参考：

1 确定 OMP_NUM_THREADS

复制代码

OMP_NUM_THREADS=$(lscpu | grep "Core(s) per socket" | awk -F ':' '{print $2}')

2 动态图推理

复制代码

cd ../../llm/
#2.动态图推理 高性能 AVX 动态图模型推理命令参考
OMP_NUM_THREADS=$(lscpu | grep "Core(s) per socket" | awk -F ':' '{print $2}') numactl -N 0  -m 0 python ./predict/predictor.py --model_name_or_path meta-llama/Llama-2-7b-chat --inference_model --dtype float32 --avx_mode --avx_type "fp16_int8" --device "cpu"

3 静态图推理

复制代码

#step1 : 静态图导出
python ./predict/export_model.py --model_name_or_path meta-llama/Llama-2-7b-chat --inference_model --output_path ./inference --dtype float32 --avx_mode --avx_type "fp16_int8" --device "cpu"
#step2: 静态图推理
OMP_NUM_THREADS=$(lscpu | grep "Core(s) per socket" | awk -F ':' '{print $2}') numactl -N 0  -m 0 python ./predict/predictor.py --model_name_or_path ./inference --inference_model --dtype "float32" --mode "static" --device "cpu" --avx_mode

HBM 机器高性能推理参考：

1 硬件和 OMP_NUM_THREADS 确认

复制代码

#理论上HBM机器比非HBM机器nexttoken时延具有1.3倍-1.9倍的加速
#确认机器具有 hbm
lscpu
#如 node2、node3表示支持 hbm
$NUMA node0 CPU(s):                  0-31,64-95
$NUMA node1 CPU(s):                  32-63,96-127
$NUMA node2 CPU(s):
$NUMA node3 CPU(s):

#确定OMP_NUM_THREADS
lscpu | grep "Socket(s)" | awk -F ':' '{print $2}'
OMP_NUM_THREADS=$(lscpu | grep "Core(s) per socket" | awk -F ':' '{print $2}')

2 动态图推理

复制代码

cd ../../llm/
# 高性能 AVX 动态图模型推理命令参考
FIRST_TOKEN_WEIGHT_LOCATION=0 NEXT_TOKEN_WEIGHT_LOCATION=2 OMP_NUM_THREADS=$(lscpu | grep "Core(s) per socket" | awk -F ':' '{print $2}') numactl -N 0  -m 0 python ./predict/predictor.py --model_name_or_path meta-llama/Llama-2-7b-chat --inference_model --dtype float32 --avx_mode --avx_type "fp16_int8" --device "cpu"
注:FIRST_TOKEN_WEIGHT_LOCATION和NEXT_TOKEN_WEIGHT_LOCATION表示first_token权重放在numa0,next_token权重放在numa2(hbm缓存节点)。

3 静态图推理

复制代码

# 高性能静态图模型推理命令参考
# step1 : 静态图导出
python ./predict/export_model.py --model_name_or_path meta-llama/Llama-2-7b-chat --inference_model --output_path ./inference --dtype float32 --avx_mode --avx_type "fp16_int8" --device "cpu"
# step2: 静态图推理
FIRST_TOKEN_WEIGHT_LOCATION=0 NEXT_TOKEN_WEIGHT_LOCATION=2 OMP_NUM_THREADS=$(lscpu | grep "Core(s) per socket" | awk -F ':' '{print $2}') numactl -N 0  -m 0 python ./predict/predictor.py --model_name_or_path ./inference --inference_model --dtype "float32" --mode "static" --device "cpu" --avx_mode

快速实践

安装

安装库

复制代码

sudo apt update
sudo apt install numactl

看看cpu是否支持avx

复制代码

lscpu | grep -o -P '(?<!\w)(avx\w*)'

安装飞桨

复制代码

pip install --pre paddlepaddle -i https://www.paddlepaddle.org.cn/packages/nightly/cpu/

验证安装好飞桨

复制代码

python -c "import paddle; paddle.version.show()"
python -c "import paddle; paddle.utils.run_check()"

安装PaddleNLP库

复制代码

pip install --pre --upgrade paddlenlp -f https://www.paddlepaddle.org.cn/whl/paddlenlp.html

下载PaddleNLP源码并

安装加速算子

复制代码

git clone https://github.com/PaddlePaddle/PaddleNLP.git
cd PaddleNLP/csrc/cpu
sh setup.sh

编译失败

复制代码

Successfully installed intel-cmplr-lib-ur-2024.2.1 intel-openmp-2024.2.1 mkl-include-2024.0.0 mkl-static-2024.0.0 tbb-2021.13.1
CMake Error at CMakeLists.txt:129 (find_package):
  By not providing "FindoneCCL.cmake" in CMAKE_MODULE_PATH this project has
  asked CMake to find a package configuration file provided by "oneCCL", but
  CMake did not find one.

  Could not find a package configuration file provided by "oneCCL" with any
  of the following names:

    oneCCLConfig.cmake
    oneccl-config.cmake

  Add the installation prefix of "oneCCL" to CMAKE_PREFIX_PATH or set
  "oneCCL_DIR" to a directory containing one of the above files.  If "oneCCL"
  provides a separate development package or SDK, be sure it has been
  installed.


-- Configuring incomplete, errors occurred!
make: *** No targets specified and no makefile found.  Stop.

到oneccl子目录，重新编译下试试

复制代码

(py312) skywalk@DESKTOP-9C5AU01:~/github/PaddleNLP/csrc/cpu$ cd xFasterTransformer/3rdparty/
(py312) skywalk@DESKTOP-9C5AU01:~/github/PaddleNLP/csrc/cpu/xFasterTransformer/3rdparty$ sh prepare_oneccl.sh

还是失败，看文档说gcc版本在8-9之间比较好，而当前是13.3 ，版本有点高，就先搁置吧

现在的情况是：自己本机编译失败，星河社区github连接太慢导致编译失败，kaggle编译也失败。

再次安装加速算子

先添加Ubuntu的intel cpu库

复制代码

# 下载基础工具包
wget https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB
sudo apt-key add GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB
echo "deb https://apt.repos.intel.com/oneapi all main" | sudo tee /etc/apt/sources.list.d/oneAPI.list
sudo apt update

# 安装完整开发套件（包含oneCCL）
sudo apt install intel-oneapi-ccl intel-oneapi-ccl-devel intel-oneapi-runtime-dnnl

再安装

复制代码

cd PaddleNLP/csrc/cpu && oneCCL_DIR=/opt/intel/oneapi/ccl/latest/lib/cmake/oneCCL sh setup.sh

推理

到PaddleNLP/llm 这个目录，执行：

复制代码

python ./predict/predictor.py --model_name_or_path deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --inference_model --dtype float32 --avx_mode --avx_type "fp16_int8" --device "cpu"

总结

坑比预想的多，目前还没通。

调试

报错This system does not support NUMA policy

OMP_NUM_THREADS= $(lscpu \| grep "Core(s) per socket" \| awk -F ':' '{print$ 2}') numactl -N 0 -m 0 python ./predict/predictor.py --model_name_or_path meta-llama/Llama-2-7b-chat --inference_model --dtype float32 --avx_mode --avx_type "fp16_int8" --device "cpu"

numactl: This system does not support NUMA policy

那就不用numactl了

报错：ModuleNotFoundError: No module named 'paddlenlp_ops'

from paddlenlp_ops import (

ModuleNotFoundError: No module named 'paddlenlp_ops'

看来不编译 paddlenlp_ops不行啊！

在kaggle 编译paddlenlp_ops报错

cd xFasterTransformer/3rdparty/

复制代码

!cd PaddleNLP/csrc/cpu/xFasterTransformer/3rdparty && sh prepare_oneccl.sh

再试最后一次，不行就撤。单独编译oneccl过了，但是再编译paddlenlp还是报错

复制代码

-- MKL directory already exists. Skipping installation.
CMake Error at CMakeLists.txt:129 (find_package):
  By not providing "FindoneCCL.cmake" in CMAKE_MODULE_PATH this project has
  asked CMake to find a package configuration file provided by "oneCCL", but
  CMake did not find one.

  Could not find a package configuration file provided by "oneCCL" with any
  of the following names:

    oneCCLConfig.cmake
    oneccl-config.cmake

  Add the installation prefix of "oneCCL" to CMAKE_PREFIX_PATH or set
  "oneCCL_DIR" to a directory containing one of the above files.  If "oneCCL"
  provides a separate development package or SDK, be sure it has been
  installed.


-- Configuring incomplete, errors occurred!
make: *** No targets specified and no makefile found.  Stop.

在kaggle里，也不知道该怎么操作了....放弃

本机编译报错

-- MKL directory already exists. Skipping installation.

CMake Error at CMakeLists.txt:129 (find_package):

By not providing "FindoneCCL.cmake" in CMAKE_MODULE_PATH this project has

asked CMake to find a package configuration file provided by "oneCCL", but

CMake did not find one.

Could not find a package configuration file provided by "oneCCL" with any

of the following names:

oneCCLConfig.cmake

oneccl-config.cmake

Add the installation prefix of "oneCCL" to CMAKE_PREFIX_PATH or set

"oneCCL_DIR" to a directory containing one of the above files. If "oneCCL"

provides a separate development package or SDK, be sure it has been

installed.

-- Configuring incomplete, errors occurred!

直接pip 安装试试

复制代码

pip install oneccl

报错依旧

安装这个试试

复制代码

sudo apt install libdnnl3

尝试新的方法

复制代码

# 下载基础工具包
wget https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB
sudo apt-key add GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB
echo "deb https://apt.repos.intel.com/oneapi all main" | sudo tee /etc/apt/sources.list.d/oneAPI.list
sudo apt update

# 安装完整开发套件（包含oneCCL）
sudo apt install intel-oneapi-ccl intel-oneapi-ccl-devel

本机这边非常慢，kaggle那边也不算快

12% [4 intel-oneapi-mpi-2021.14 7797 kB/45.6 MB 17%] 23.0 kB/s 1h 14min 51s

kaggle那边已经装好了，现在可以编译ops了

复制代码

!cd PaddleNLP/csrc/cpu && oneCCL_DIR=/opt/intel/oneapi/ccl/latest/ sh setup.sh

编译的时候有这样的报错

复制代码

warnings.warn(warning_message)
/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/cmd.py:66: SetuptoolsDeprecationWarning: setup.py install is deprecated.
!!

        ********************************************************************************
        Please avoid running ``setup.py`` directly.
        Instead, use pypa/build, pypa/installer or other
        standards-based tools.

        See Why you shouldn't invoke setup.py directly for details.
        ********************************************************************************

!!
  self.initialize_options()
/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/cmd.py:66: EasyInstallDeprecationWarning: easy_install command is deprecated.
!!

kaggle最后这样报错/usr/bin/ld: cannot find -l:libxfastertransformer.so: No such file or directory

复制代码

/usr/bin/ld: cannot find /kaggle/working/PaddleNLP/csrc/cpu/build/paddlenlp_ops/lib.linux-x86_64-cpython-310/avx_weight_only.o: No such file or directory
/usr/bin/ld: cannot find /kaggle/working/PaddleNLP/csrc/cpu/build/paddlenlp_ops/lib.linux-x86_64-cpython-310/stop_generation_multi_ends.o: No such file or directory
/usr/bin/ld: cannot find -l:libxfastertransformer.so: No such file or directory
/usr/bin/ld: cannot find -l:libxft_comm_helper.so: No such file or directory
collect2: error: ld returned 1 exit status
error: command '/usr/bin/x86_64-linux-gnu-g++' failed with exit code 1

发现是这里：

复制代码

-- Using src='https://github.com/google/sentencepiece/releases/download/v0.1.99/sentencepiece-0.1.99.tar.gz'
/kaggle/working/PaddleNLP/csrc/cpu/xFasterTransformer/src/comm_helper/comm_helper.cpp:17:10: fatal error: oneapi/ccl.hpp: No such file or directory
   17 | #include "oneapi/ccl.hpp"
      |          ^~~~~~~~~~~~~~~~

也就是oneapi

找到原因了，原来刚才设置的路径不对

复制代码

# 标准Intel oneAPI路径（Linux）
export oneCCL_DIR=/opt/intel/oneapi/ccl/latest/lib/cmake/ccl

# 自定义安装路径
export oneCCL_DIR=/your/custom/path/lib/cmake/ccl

# 运行CMake时注入变量
cmake -DoneCCL_DIR=$oneCCL_DIR ..

应该用这句：

复制代码

!cd PaddleNLP/csrc/cpu && oneCCL_DIR=/opt/intel/oneapi/ccl/latest/lib/cmake/oneCCL sh setup.sh

还是报错，应该再装这个：

复制代码

sudo apt install intel-oneapi-runtime-dnnl

kaggle报错：Your notebook tried to allocate more memory than is available. It has restarted.（放弃）

这个没办法了，就是超限了

kaggle放弃

本机编译时报错： status_string: "Failure when receiving data from the peer"

-- Using src='https://github.com/oneapi-src/oneDNN/releases/download/v0.21/mklml_lnx_2019.0.5.20190502.tgz'

Cloning into 'oneccl'...

CMake Error at /home/skywalk/github/PaddleNLP/csrc/cpu/xFasterTransformer/build/xdnn_lib-prefix/src/xdnn_lib-stamp/download-xdnn_lib.cmake:170 (message):

Each download failed!

error: downloading 'https://github.com/intel/xFasterTransformer/releases/download/IntrinsicGemm/xdnn_v1.5.2.tar.gz' failed

status_code: 56

status_string: "Failure when receiving data from the peer"

log:

--- LOG BEGIN ---

Host github.com:443 was resolved.

IPv6: (none)

IPv4:CMake Error at /home/skywalk/github/PaddleNLP/csrc/cpu/xFasterTransformer/build/examples/cpp/cmdline-prefix/src/cmdline-stamp/do 20.205.243.166

Trying 20.205.243.166:443...

Connected to github.com (20.205.243.166) port 443

ALPN: curl offers h2,hwnload-cmdline.cmake:170 (message):

Each download failed!

error: downloading 'https://github.com/tanakh/cmdline/archive/rttp/1.1

5 bytes data

TLSv1.3 (OUT), TLS handshake, Client hello (1):

512 bytes data

5 bytes data

TLSv1.3 (Iefs/heads/master.zip' failed

status_code: 56

status_string: "Failure when receiving data from the peer"

N), TLS handshake, Server hello (2):

可能就是github抽风吧

暂时搁置

还可能需要的一些库：

sudo apt install libdnnl-dev

sudo apt install intel-oneapi-mkl

sudo apt install libmkl-vml-avx libmkl-dev intel-oneapi-runtime-mkl

安装intel-mkl # 数学库的时候出来提示

intel-mkl # 数学库

出来提示：

┌─────────────────────────────────────┤ Intel Math Kernel Library (Intel MKL) ├─────────────────────────────────────┐

│ │

│ Intel MKL's Single Dynamic Library (SDL) is installed on your machine. This shared object can be used as an │

│ alternative to both libblas.so.3 and liblapack.so.3, so that packages built against BLAS/LAPACK can directly use │

│ MKL without rebuild. │

│ │

│ However, MKL is non-free software, and in particular its source code is not publicly available. By using MKL as │

│ the default BLAS/LAPACK implementation, you might be violating the licensing terms of copyleft software that │

│ would become dynamically linked against it. Please verify that the licensing terms of the program(s) that you │

│ intend to use with MKL are compatible with the MKL licensing terms. For the case of software under the GNU │

│ General Public License, you may want to read this FAQ: │

│ │

│ https://www.gnu.org/licenses/gpl-faq.html#GPLIncompatibleLibs │

│ │

│ If you don't know what MKL is, or unwilling to set it as default, just choose the preset value or simply type │

│ Enter. │

│ │

│ Use libmkl_rt.so as the default alternative to BLAS/LAPACK? │

│ │

│ <Yes> <No> │

│ │

也就是这个库需要单独的许可？