使用 PaddleNLP 在 CPU(支持 AVX 指令)下跑通 llama2-7b或DeepSeek-r1:1.5b 模型(完成度80%)

原文:🚣‍♂️ 使用 PaddleNLP 在 CPU(支持 AVX 指令)下跑通 llama2-7b 模型 🚣 --- PaddleNLP 文档

使用 PaddleNLP 在 CPU(支持 AVX 指令)下跑通 llama2-7b 模型 🚣

PaddleNLP 在支持 AVX 指令的 CPU 上对 llama 系列模型进行了深度适配和优化,此文档用于说明在支持 AVX 指令的 CPU 上使用 PaddleNLP 进行 llama 系列模型进行高性能推理的流程。

检查硬件:

芯片类型 GCC 版本 cmake 版本
Intel(R) Xeon(R) Platinum 8463B 9.4.0 >=3.18

注:如果要验证您的机器是否支持 AVX 指令,只需系统环境下输入命令,看是否有输出:

复制代码
lscpu | grep -o -P '(?<!\w)(avx\w*)'

# 显示如下结果 -
avx
avx2
**avx512f**
avx512dq
avx512ifma
avx512cd
**avx512bw**
avx512vl
avx_vnni
**avx512_bf16**
avx512vbmi
avx512_vbmi2
avx512_vnni
avx512_bitalg
avx512_vpopcntdq
**avx512_fp16**

环境准备:

1 安装 numactl

复制代码
apt-get update
apt-get install numactl

2 安装 paddle

2.1 源码安装:
复制代码
git clone https://github.com/PaddlePaddle/Paddle.git
cd Paddle && mkdir build && cd build

cmake .. -DPY_VERSION=3.8 -DWITH_GPU=OFF

make -j128
pip install -U python/dist/paddlepaddle-0.0.0-cp38-cp38-linux_x86_64.whl
2.2 pip 安装:
复制代码
python -m pip install --pre paddlepaddle -i https://www.paddlepaddle.org.cn/packages/nightly/cpu/
2.3 检查是否安装正常:
复制代码
python -c "import paddle; paddle.version.show()"
python -c "import paddle; paddle.utils.run_check()"

3 克隆 PaddleNLP 仓库代码,并安装依赖

复制代码
# PaddleNLP是基于paddlepaddle『飞桨』的自然语言处理和大语言模型(LLM)开发库,存放了基于『飞桨』框架实现的各种大模型,llama系列模型也包含其中。为了便于您更好地使用PaddleNLP,您需要clone整个仓库。
pip install --pre --upgrade paddlenlp -f https://www.paddlepaddle.org.cn/whl/paddlenlp.html

4 安装第三方库和 paddlenlp_ops

复制代码
# PaddleNLP仓库内置了专用的融合算子,以便用户享受到极致压缩的推理成本
git clone https://github.com/PaddlePaddle/PaddleNLP.git
cd PaddleNLP/csrc/cpu
sh setup.sh

5 第三方库安装失败

复制代码
#如果oneccl安装失败 建议在gcc 8.2-9.4之间重新安装
cd csrc/cpu/xFasterTransformer/3rdparty/
sh prepare_oneccl.sh

#如果xFasterTransformer 安装失败,建议在gcc 9.2以上重新安装
cd csrc/cpu/xFasterTransformer/build/
make -j24

#更多命令和环境变量可参考csrc/cpu/setup.sh

Cpu 高性能推理

PaddleNLP 还提供了基于 intel/xFasterTransformer 的 CPU 高性能推理,目前支持 FP16、BF16、INT8多种精度推理,以及 Prefill 基于 FP16,Decode 基于 INT8混合方式推理。

非 HBM 机器高性能推理参考:

1 确定 OMP_NUM_THREADS
复制代码
OMP_NUM_THREADS=$(lscpu | grep "Core(s) per socket" | awk -F ':' '{print $2}')
2 动态图推理
复制代码
cd ../../llm/
#2.动态图推理 高性能 AVX 动态图模型推理命令参考
OMP_NUM_THREADS=$(lscpu | grep "Core(s) per socket" | awk -F ':' '{print $2}') numactl -N 0  -m 0 python ./predict/predictor.py --model_name_or_path meta-llama/Llama-2-7b-chat --inference_model --dtype float32 --avx_mode --avx_type "fp16_int8" --device "cpu"
3 静态图推理
复制代码
#step1 : 静态图导出
python ./predict/export_model.py --model_name_or_path meta-llama/Llama-2-7b-chat --inference_model --output_path ./inference --dtype float32 --avx_mode --avx_type "fp16_int8" --device "cpu"
#step2: 静态图推理
OMP_NUM_THREADS=$(lscpu | grep "Core(s) per socket" | awk -F ':' '{print $2}') numactl -N 0  -m 0 python ./predict/predictor.py --model_name_or_path ./inference --inference_model --dtype "float32" --mode "static" --device "cpu" --avx_mode

HBM 机器高性能推理参考:

1 硬件和 OMP_NUM_THREADS 确认
复制代码
#理论上HBM机器比非HBM机器nexttoken时延具有1.3倍-1.9倍的加速
#确认机器具有 hbm
lscpu
#如 node2、node3表示支持 hbm
$NUMA node0 CPU(s):                  0-31,64-95
$NUMA node1 CPU(s):                  32-63,96-127
$NUMA node2 CPU(s):
$NUMA node3 CPU(s):

#确定OMP_NUM_THREADS
lscpu | grep "Socket(s)" | awk -F ':' '{print $2}'
OMP_NUM_THREADS=$(lscpu | grep "Core(s) per socket" | awk -F ':' '{print $2}')
2 动态图推理
复制代码
cd ../../llm/
# 高性能 AVX 动态图模型推理命令参考
FIRST_TOKEN_WEIGHT_LOCATION=0 NEXT_TOKEN_WEIGHT_LOCATION=2 OMP_NUM_THREADS=$(lscpu | grep "Core(s) per socket" | awk -F ':' '{print $2}') numactl -N 0  -m 0 python ./predict/predictor.py --model_name_or_path meta-llama/Llama-2-7b-chat --inference_model --dtype float32 --avx_mode --avx_type "fp16_int8" --device "cpu"
注:FIRST_TOKEN_WEIGHT_LOCATION和NEXT_TOKEN_WEIGHT_LOCATION表示first_token权重放在numa0,next_token权重放在numa2(hbm缓存节点)。
3 静态图推理
复制代码
# 高性能静态图模型推理命令参考
# step1 : 静态图导出
python ./predict/export_model.py --model_name_or_path meta-llama/Llama-2-7b-chat --inference_model --output_path ./inference --dtype float32 --avx_mode --avx_type "fp16_int8" --device "cpu"
# step2: 静态图推理
FIRST_TOKEN_WEIGHT_LOCATION=0 NEXT_TOKEN_WEIGHT_LOCATION=2 OMP_NUM_THREADS=$(lscpu | grep "Core(s) per socket" | awk -F ':' '{print $2}') numactl -N 0  -m 0 python ./predict/predictor.py --model_name_or_path ./inference --inference_model --dtype "float32" --mode "static" --device "cpu" --avx_mode

快速实践

安装

安装库

复制代码
sudo apt update
sudo apt install numactl

看看cpu是否支持avx

复制代码
lscpu | grep -o -P '(?<!\w)(avx\w*)'

安装飞桨

复制代码
pip install --pre paddlepaddle -i https://www.paddlepaddle.org.cn/packages/nightly/cpu/

验证安装好飞桨

复制代码
python -c "import paddle; paddle.version.show()"
python -c "import paddle; paddle.utils.run_check()"

安装PaddleNLP库

复制代码
pip install --pre --upgrade paddlenlp -f https://www.paddlepaddle.org.cn/whl/paddlenlp.html

下载PaddleNLP源码并

安装加速算子

复制代码
git clone https://github.com/PaddlePaddle/PaddleNLP.git
cd PaddleNLP/csrc/cpu
sh setup.sh

编译失败

复制代码
Successfully installed intel-cmplr-lib-ur-2024.2.1 intel-openmp-2024.2.1 mkl-include-2024.0.0 mkl-static-2024.0.0 tbb-2021.13.1
CMake Error at CMakeLists.txt:129 (find_package):
  By not providing "FindoneCCL.cmake" in CMAKE_MODULE_PATH this project has
  asked CMake to find a package configuration file provided by "oneCCL", but
  CMake did not find one.

  Could not find a package configuration file provided by "oneCCL" with any
  of the following names:

    oneCCLConfig.cmake
    oneccl-config.cmake

  Add the installation prefix of "oneCCL" to CMAKE_PREFIX_PATH or set
  "oneCCL_DIR" to a directory containing one of the above files.  If "oneCCL"
  provides a separate development package or SDK, be sure it has been
  installed.


-- Configuring incomplete, errors occurred!
make: *** No targets specified and no makefile found.  Stop.

到oneccl子目录,重新编译下试试

复制代码
(py312) skywalk@DESKTOP-9C5AU01:~/github/PaddleNLP/csrc/cpu$ cd xFasterTransformer/3rdparty/
(py312) skywalk@DESKTOP-9C5AU01:~/github/PaddleNLP/csrc/cpu/xFasterTransformer/3rdparty$ sh prepare_oneccl.sh

还是失败,看文档说gcc版本在8-9之间比较好,而当前是13.3 ,版本有点高,就先搁置吧

现在的情况是:自己本机编译失败,星河社区github连接太慢导致编译失败,kaggle编译也失败。

再次安装加速算子

先添加Ubuntu的intel cpu库

复制代码
# 下载基础工具包
wget https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB
sudo apt-key add GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB
echo "deb https://apt.repos.intel.com/oneapi all main" | sudo tee /etc/apt/sources.list.d/oneAPI.list
sudo apt update

# 安装完整开发套件(包含oneCCL)
sudo apt install intel-oneapi-ccl intel-oneapi-ccl-devel intel-oneapi-runtime-dnnl

再安装

复制代码
cd PaddleNLP/csrc/cpu && oneCCL_DIR=/opt/intel/oneapi/ccl/latest/lib/cmake/oneCCL sh setup.sh

推理

到PaddleNLP/llm 这个目录,执行:

复制代码
python ./predict/predictor.py --model_name_or_path deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --inference_model --dtype float32 --avx_mode --avx_type "fp16_int8" --device "cpu"

总结

坑比预想的多,目前还没通。

调试

报错This system does not support NUMA policy

OMP_NUM_THREADS=(lscpu \| grep "Core(s) per socket" \| awk -F ':' '{print 2}') numactl -N 0 -m 0 python ./predict/predictor.py --model_name_or_path meta-llama/Llama-2-7b-chat --inference_model --dtype float32 --avx_mode --avx_type "fp16_int8" --device "cpu"

numactl: This system does not support NUMA policy

那就不用numactl了

报错:ModuleNotFoundError: No module named 'paddlenlp_ops'

from paddlenlp_ops import (

ModuleNotFoundError: No module named 'paddlenlp_ops'

看来不编译 paddlenlp_ops不行啊!

在kaggle 编译paddlenlp_ops报错

cd xFasterTransformer/3rdparty/

复制代码
!cd PaddleNLP/csrc/cpu/xFasterTransformer/3rdparty && sh prepare_oneccl.sh

再试最后一次,不行就撤。 单独编译oneccl过了,但是再编译paddlenlp还是报错

复制代码
-- MKL directory already exists. Skipping installation.
CMake Error at CMakeLists.txt:129 (find_package):
  By not providing "FindoneCCL.cmake" in CMAKE_MODULE_PATH this project has
  asked CMake to find a package configuration file provided by "oneCCL", but
  CMake did not find one.

  Could not find a package configuration file provided by "oneCCL" with any
  of the following names:

    oneCCLConfig.cmake
    oneccl-config.cmake

  Add the installation prefix of "oneCCL" to CMAKE_PREFIX_PATH or set
  "oneCCL_DIR" to a directory containing one of the above files.  If "oneCCL"
  provides a separate development package or SDK, be sure it has been
  installed.


-- Configuring incomplete, errors occurred!
make: *** No targets specified and no makefile found.  Stop.

在kaggle里,也不知道该怎么操作了....放弃

本机编译报错

-- MKL directory already exists. Skipping installation.

CMake Error at CMakeLists.txt:129 (find_package):

By not providing "FindoneCCL.cmake" in CMAKE_MODULE_PATH this project has

asked CMake to find a package configuration file provided by "oneCCL", but

CMake did not find one.

Could not find a package configuration file provided by "oneCCL" with any

of the following names:

oneCCLConfig.cmake

oneccl-config.cmake

Add the installation prefix of "oneCCL" to CMAKE_PREFIX_PATH or set

"oneCCL_DIR" to a directory containing one of the above files. If "oneCCL"

provides a separate development package or SDK, be sure it has been

installed.

-- Configuring incomplete, errors occurred!

直接pip 安装试试

复制代码
pip install oneccl

报错依旧

安装这个试试

复制代码
sudo apt install libdnnl3

尝试新的方法

复制代码
# 下载基础工具包
wget https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB
sudo apt-key add GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB
echo "deb https://apt.repos.intel.com/oneapi all main" | sudo tee /etc/apt/sources.list.d/oneAPI.list
sudo apt update

# 安装完整开发套件(包含oneCCL)
sudo apt install intel-oneapi-ccl intel-oneapi-ccl-devel

本机这边非常慢,kaggle那边也不算快

12% [4 intel-oneapi-mpi-2021.14 7797 kB/45.6 MB 17%] 23.0 kB/s 1h 14min 51s

kaggle那边已经装好了,现在可以编译ops了

复制代码
!cd PaddleNLP/csrc/cpu && oneCCL_DIR=/opt/intel/oneapi/ccl/latest/ sh setup.sh

编译的时候有这样的报错

复制代码
warnings.warn(warning_message)
/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/cmd.py:66: SetuptoolsDeprecationWarning: setup.py install is deprecated.
!!

        ********************************************************************************
        Please avoid running ``setup.py`` directly.
        Instead, use pypa/build, pypa/installer or other
        standards-based tools.

        See Why you shouldn't invoke setup.py directly for details.
        ********************************************************************************

!!
  self.initialize_options()
/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/cmd.py:66: EasyInstallDeprecationWarning: easy_install command is deprecated.
!!

kaggle最后这样报错/usr/bin/ld: cannot find -l:libxfastertransformer.so: No such file or directory

复制代码
/usr/bin/ld: cannot find /kaggle/working/PaddleNLP/csrc/cpu/build/paddlenlp_ops/lib.linux-x86_64-cpython-310/avx_weight_only.o: No such file or directory
/usr/bin/ld: cannot find /kaggle/working/PaddleNLP/csrc/cpu/build/paddlenlp_ops/lib.linux-x86_64-cpython-310/stop_generation_multi_ends.o: No such file or directory
/usr/bin/ld: cannot find -l:libxfastertransformer.so: No such file or directory
/usr/bin/ld: cannot find -l:libxft_comm_helper.so: No such file or directory
collect2: error: ld returned 1 exit status
error: command '/usr/bin/x86_64-linux-gnu-g++' failed with exit code 1

发现是这里:

复制代码
-- Using src='https://github.com/google/sentencepiece/releases/download/v0.1.99/sentencepiece-0.1.99.tar.gz'
/kaggle/working/PaddleNLP/csrc/cpu/xFasterTransformer/src/comm_helper/comm_helper.cpp:17:10: fatal error: oneapi/ccl.hpp: No such file or directory
   17 | #include "oneapi/ccl.hpp"
      |          ^~~~~~~~~~~~~~~~

也就是oneapi

找到原因了,原来刚才设置的路径不对

复制代码
# 标准Intel oneAPI路径(Linux)
export oneCCL_DIR=/opt/intel/oneapi/ccl/latest/lib/cmake/ccl

# 自定义安装路径
export oneCCL_DIR=/your/custom/path/lib/cmake/ccl

# 运行CMake时注入变量
cmake -DoneCCL_DIR=$oneCCL_DIR ..

应该用这句:

复制代码
!cd PaddleNLP/csrc/cpu && oneCCL_DIR=/opt/intel/oneapi/ccl/latest/lib/cmake/oneCCL sh setup.sh

还是报错,应该再装这个:

复制代码
sudo apt install intel-oneapi-runtime-dnnl

kaggle报错:Your notebook tried to allocate more memory than is available. It has restarted.(放弃)

这个没办法了,就是超限了

kaggle放弃

本机编译时报错: status_string: "Failure when receiving data from the peer"

-- Using src='https://github.com/oneapi-src/oneDNN/releases/download/v0.21/mklml_lnx_2019.0.5.20190502.tgz'

Cloning into 'oneccl'...

CMake Error at /home/skywalk/github/PaddleNLP/csrc/cpu/xFasterTransformer/build/xdnn_lib-prefix/src/xdnn_lib-stamp/download-xdnn_lib.cmake:170 (message):

Each download failed!

error: downloading 'https://github.com/intel/xFasterTransformer/releases/download/IntrinsicGemm/xdnn_v1.5.2.tar.gz' failed

status_code: 56

status_string: "Failure when receiving data from the peer"

log:

--- LOG BEGIN ---

Host github.com:443 was resolved.

IPv6: (none)

IPv4:CMake Error at /home/skywalk/github/PaddleNLP/csrc/cpu/xFasterTransformer/build/examples/cpp/cmdline-prefix/src/cmdline-stamp/do 20.205.243.166

Trying 20.205.243.166:443...

Connected to github.com (20.205.243.166) port 443

ALPN: curl offers h2,hwnload-cmdline.cmake:170 (message):

Each download failed!

error: downloading 'https://github.com/tanakh/cmdline/archive/rttp/1.1

5 bytes data

TLSv1.3 (OUT), TLS handshake, Client hello (1):

512 bytes data

5 bytes data

TLSv1.3 (Iefs/heads/master.zip' failed

status_code: 56

status_string: "Failure when receiving data from the peer"

N), TLS handshake, Server hello (2):

可能就是github抽风吧

暂时搁置

还可能需要的一些库:

sudo apt install libdnnl-dev

sudo apt install intel-oneapi-mkl

sudo apt install libmkl-vml-avx libmkl-dev intel-oneapi-runtime-mkl

安装intel-mkl # 数学库的时候出来提示

intel-mkl # 数学库

出来提示:

┌─────────────────────────────────────┤ Intel Math Kernel Library (Intel MKL) ├─────────────────────────────────────┐

│ │

│ Intel MKL's Single Dynamic Library (SDL) is installed on your machine. This shared object can be used as an │

│ alternative to both libblas.so.3 and liblapack.so.3, so that packages built against BLAS/LAPACK can directly use │

│ MKL without rebuild. │

│ │

│ However, MKL is non-free software, and in particular its source code is not publicly available. By using MKL as │

│ the default BLAS/LAPACK implementation, you might be violating the licensing terms of copyleft software that │

│ would become dynamically linked against it. Please verify that the licensing terms of the program(s) that you │

│ intend to use with MKL are compatible with the MKL licensing terms. For the case of software under the GNU │

│ General Public License, you may want to read this FAQ: │

│ │

https://www.gnu.org/licenses/gpl-faq.html#GPLIncompatibleLibs

│ │

│ │

│ If you don't know what MKL is, or unwilling to set it as default, just choose the preset value or simply type │

│ Enter. │

│ │

│ Use libmkl_rt.so as the default alternative to BLAS/LAPACK? │

│ │

│ <Yes> <No> │

│ │

也就是这个库需要单独的许可?

相关推荐
搬砖班班长1 分钟前
conda报错activate没办法激活环境
开发语言·python·conda
电报号dapp1194 分钟前
2025交易所开发突围:AI增强型撮合引擎与零知识证明跨链架构
人工智能·架构·web3·去中心化·区块链·智能合约·零知识证明
Vertira13 分钟前
pytorch 网络结构可视化Netron安装使用方法(已解决)
人工智能·pytorch·python
虾球xz16 分钟前
游戏引擎学习第159天
人工智能·学习·游戏引擎
大囚长30 分钟前
deepseek连续对话与API调用机制
人工智能
带鱼工作室39 分钟前
cuda12.4安装tensorflow-gpu 2.18.0
人工智能·python·tensorflow
量子纠缠BUG41 分钟前
DeepSeek:开启机器人智能化的革命性突破
人工智能·机器人
万事可爱^43 分钟前
集成学习(上):Bagging集成方法
人工智能·随机森林·机器学习·集成学习·bagging
GIS思维44 分钟前
ArcGIS10.X影像智能下载!迁移ArcGIS Pro批量智能高清影像下载工具至ArcGIS!
python·arcgis·arcgis pro·deepseek·高清影像下载·谷歌影像·天地图影像
剑盾云安全专家44 分钟前
如何用AI轻松制作完美PPT,节省时间又提升效率
人工智能·科技·aigc·powerpoint·软件