paddle.utils.run_check() 报错 nccl 找不到

完整报错:

shell 复制代码
Running verify PaddlePaddle program ... 
I1203 11:17:31.474777 26315 program_interpreter.cc:212] New Executor is Running.
W1203 11:17:31.475191 26315 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 12.4, Runtime API Version: 12.0
W1203 11:17:31.476090 26315 gpu_resources.cc:164] device: 0, cuDNN Version: 8.9.
I1203 11:17:31.743631 26315 interpreter_util.cc:624] Standalone Executor is Used.
PaddlePaddle works well on 1 GPU.
======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_selected_gpus', current_value='1', default_value='')
=======================================================================
I1203 11:17:32.905944 26397 tcp_utils.cc:107] Retry to connect to 127.0.0.1:43873 while the server is not yet listening.
======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_selected_gpus', current_value='3', default_value='')
=======================================================================
I1203 11:17:32.950119 26399 tcp_utils.cc:107] Retry to connect to 127.0.0.1:43873 while the server is not yet listening.
======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_selected_gpus', current_value='0', default_value='')
=======================================================================
======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_selected_gpus', current_value='2', default_value='')
=======================================================================
I1203 11:17:32.954985 26396 tcp_utils.cc:181] The server starts to listen on IP_ANY:43873
I1203 11:17:32.955080 26398 tcp_utils.cc:130] Successfully connected to 127.0.0.1:43873
I1203 11:17:32.955158 26396 tcp_utils.cc:130] Successfully connected to 127.0.0.1:43873
I1203 11:17:32.955351 26398 process_group_nccl.cc:129] ProcessGroupNCCL pg_timeout_ 1800000
W1203 11:17:33.164513 26398 gpu_resources.cc:119] Please NOTE: device: 2, GPU Compute Capability: 8.6, Driver API Version: 12.4, Runtime API Version: 12.0
W1203 11:17:33.165470 26398 gpu_resources.cc:164] device: 2, cuDNN Version: 8.9.
W1203 11:17:33.173501 26398 dynamic_loader.cc:285] You may need to install 'nccl2' from NVIDIA official website: https://developer.nvidia.com/nccl/nccl-download before install PaddlePaddle.
I1203 11:17:33.278411 26398 process_group_nccl.cc:132] ProcessGroupNCCL destruct 


--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   phi::distributed::CreateOrGetGlobalTCPStore()
1   phi::distributed::TCPStore::TCPStore(std::string, unsigned short, bool, unsigned long, int)
2   phi::distributed::TCPStore::waitWorkers()

----------------------
Error Message Summary:
----------------------
FatalError: `Termination signal` is detected by the operating system.
  [TimeInfo: *** Aborted at 1764731853 (unix time) try "date -d @1764731853" if you are using GNU date ***]
  [SignalInfo: *** SIGTERM (@0x3f5000066cb) received by PID 26396 (TID 0x7fa753ce9200) from PID 26315 ***]



--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   phi::distributed::CreateOrGetGlobalTCPStore()
1   phi::distributed::TCPStore::TCPStore(std::string, unsigned short, bool, unsigned long, int)
2   phi::distributed::detail::TCPClient::connect(std::string, unsigned short)
3   phi::distributed::tcputils::tcp_connect(std::string, std::string, int, std::chrono::duration<long, std::ratio<1l, 1l> >)

----------------------
Error Message Summary:
----------------------
FatalError: `Termination signal` is detected by the operating system.
  [TimeInfo: *** Aborted at 1764731853 (unix time) try "date -d @1764731853" if you are using GNU date ***]
  [SignalInfo: *** SIGTERM (@0x3f5000066cb) received by PID 26397 (TID 0x7f12e3983200) from PID 26315 ***]



--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   phi::distributed::CreateOrGetGlobalTCPStore()
1   phi::distributed::TCPStore::TCPStore(std::string, unsigned short, bool, unsigned long, int)
2   phi::distributed::detail::TCPClient::connect(std::string, unsigned short)
3   phi::distributed::tcputils::tcp_connect(std::string, std::string, int, std::chrono::duration<long, std::ratio<1l, 1l> >)

----------------------
Error Message Summary:
----------------------
FatalError: `Termination signal` is detected by the operating system.
  [TimeInfo: *** Aborted at 1764731853 (unix time) try "date -d @1764731853" if you are using GNU date ***]
  [SignalInfo: *** SIGTERM (@0x3f5000066cb) received by PID 26399 (TID 0x7f81faf2e200) from PID 26315 ***]

[2025-12-03 11:17:33,599] [ WARNING] install_check.py:289 - PaddlePaddle meets some problem with 4 GPUs. This may be caused by:
 1. There is not enough GPUs visible on your system
 2. Some GPUs are occupied by other process now
 3. NVIDIA-NCCL2 is not installed correctly on your system. Please follow instruction on https://github.com/NVIDIA/nccl-tests 
 to test your NCCL, or reinstall it following https://docs.nvidia.com/deeplearning/sdk/nccl-install-guide/index.html
[2025-12-03 11:17:33,599] [ WARNING] install_check.py:297 - 
 Original Error is: 

----------------------------------------------
Process 2 terminated with the following error:
----------------------------------------------

Traceback (most recent call last):
  File "/home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages/paddle/distributed/spawn.py", line 372, in _func_wrapper
    result = func(*args)
  File "/home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages/paddle/utils/install_check.py", line 184, in train_for_run_parallel
    dp_layer = paddle.DataParallel(layer)
  File "/home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages/paddle/distributed/parallel.py", line 398, in __init__
    sync_params_buffers(self._layers, fuse_params=False)
  File "/home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages/decorator.py", line 235, in fun
    return caller(func, *(extras + args), **kw)
  File "/home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages/paddle/base/dygraph/base.py", line 340, in __impl__
    return func(*args, **kwargs)
  File "/home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages/decorator.py", line 235, in fun
    return caller(func, *(extras + args), **kw)
  File "/home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages/paddle/base/wrapped_decorator.py", line 26, in __impl__
    return wrapped_func(*args, **kwargs)
  File "/home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages/paddle/base/framework.py", line 593, in __impl__
    return func(*args, **kwargs)
  File "/home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages/paddle/distributed/parallel.py", line 197, in sync_params_buffers
    paddle.distributed.broadcast(
  File "/home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages/paddle/distributed/communication/broadcast.py", line 64, in broadcast
    return stream.broadcast(
  File "/home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages/paddle/distributed/communication/stream/broadcast.py", line 124, in broadcast
    return _broadcast_in_dygraph(
  File "/home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages/paddle/distributed/communication/stream/broadcast.py", line 32, in _broadcast_in_dygraph
    task = group.process_group.broadcast(tensor, src_rank_in_group, sync_op)
RuntimeError: (PreconditionNotMet) The third-party dynamic library (libnccl.so) that Paddle depends on is not configured correctly. (error code is libnccl.so: cannot open shared object file: No such file or directory)
  Suggestions:
  1. Check if the third-party dynamic library (e.g. CUDA, CUDNN) is installed correctly and its version is matched with paddlepaddle you installed.
  2. Configure third-party dynamic library environment variables as follows:
  - Linux: set LD_LIBRARY_PATH by `export LD_LIBRARY_PATH=...`
  - Windows: set PATH by `set PATH=XXX; (at /paddle/paddle/phi/backends/dynload/dynamic_loader.cc:311)


PaddlePaddle is installed successfully ONLY for single GPU! Let's start deep learning with PaddlePaddle now.
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages/paddle/utils/install_check.py", line 302, in run_check
    raise e
  File "/home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages/paddle/utils/install_check.py", line 283, in run_check
    _run_parallel(device_list)
  File "/home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages/paddle/utils/install_check.py", line 210, in _run_parallel
    paddle.distributed.spawn(train_for_run_parallel, nprocs=len(device_list))
  File "/home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages/paddle/distributed/spawn.py", line 614, in spawn
    while not context.join():
  File "/home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages/paddle/distributed/spawn.py", line 423, in join
    self._throw_exception(error_index)
  File "/home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages/paddle/distributed/spawn.py", line 447, in _throw_exception
    raise Exception(msg)
Exception: 

----------------------------------------------
Process 2 terminated with the following error:
----------------------------------------------

Traceback (most recent call last):
  File "/home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages/paddle/distributed/spawn.py", line 372, in _func_wrapper
    result = func(*args)
  File "/home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages/paddle/utils/install_check.py", line 184, in train_for_run_parallel
    dp_layer = paddle.DataParallel(layer)
  File "/home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages/paddle/distributed/parallel.py", line 398, in __init__
    sync_params_buffers(self._layers, fuse_params=False)
  File "/home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages/decorator.py", line 235, in fun
    return caller(func, *(extras + args), **kw)
  File "/home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages/paddle/base/dygraph/base.py", line 340, in __impl__
    return func(*args, **kwargs)
  File "/home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages/decorator.py", line 235, in fun
    return caller(func, *(extras + args), **kw)
  File "/home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages/paddle/base/wrapped_decorator.py", line 26, in __impl__
    return wrapped_func(*args, **kwargs)
  File "/home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages/paddle/base/framework.py", line 593, in __impl__
    return func(*args, **kwargs)
  File "/home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages/paddle/distributed/parallel.py", line 197, in sync_params_buffers
    paddle.distributed.broadcast(
  File "/home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages/paddle/distributed/communication/broadcast.py", line 64, in broadcast
    return stream.broadcast(
  File "/home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages/paddle/distributed/communication/stream/broadcast.py", line 124, in broadcast
    return _broadcast_in_dygraph(
  File "/home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages/paddle/distributed/communication/stream/broadcast.py", line 32, in _broadcast_in_dygraph
    task = group.process_group.broadcast(tensor, src_rank_in_group, sync_op)
RuntimeError: (PreconditionNotMet) The third-party dynamic library (libnccl.so) that Paddle depends on is not configured correctly. (error code is libnccl.so: cannot open shared object file: No such file or directory)
  Suggestions:
  1. Check if the third-party dynamic library (e.g. CUDA, CUDNN) is installed correctly and its version is matched with paddlepaddle you installed.
  2. Configure third-party dynamic library environment variables as follows:
  - Linux: set LD_LIBRARY_PATH by `export LD_LIBRARY_PATH=...`
  - Windows: set PATH by `set PATH=XXX; (at /paddle/paddle/phi/backends/dynload/dynamic_loader.cc:311)

先检查装没装 torch,一般安装torch会自带nccl:

shell 复制代码
proxychains4 python -m pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124
shell 复制代码
pip list | grep nccl
# nvidia-nccl-cu12         2.21.5

查找一下有没有 nccl

shell 复制代码
ldconfig -p | grep nccl || true

# libvncclient.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libvncclient.so.1

找到路径

shell 复制代码
python -m pip show -f nvidia-nccl-cu12 || python -m pip show -f nvidia-nccl || true
shell 复制代码
Name: nvidia-nccl-cu12
Version: 2.21.5
Summary: NVIDIA Collective Communication Library (NCCL) Runtime
Home-page: https://developer.nvidia.com/cuda-zone
Author: Nvidia CUDA Installer Team
Author-email: cuda_installer@nvidia.com
License: NVIDIA Proprietary Software
Location: /home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages
Requires: 
Required-by: torch
Files:
  nvidia/__init__.py
  nvidia/__pycache__/__init__.cpython-310.pyc
  nvidia/nccl/__init__.py
  nvidia/nccl/__pycache__/__init__.cpython-310.pyc
  nvidia/nccl/include/__init__.py
  nvidia/nccl/include/__pycache__/__init__.cpython-310.pyc
  nvidia/nccl/include/nccl.h
  nvidia/nccl/include/nccl_net.h
  nvidia/nccl/lib/__init__.py
  nvidia/nccl/lib/__pycache__/__init__.cpython-310.pyc
  nvidia/nccl/lib/libnccl.so.2
  nvidia_nccl_cu12-2.21.5.dist-info/INSTALLER
  nvidia_nccl_cu12-2.21.5.dist-info/License.txt
  nvidia_nccl_cu12-2.21.5.dist-info/METADATA
  nvidia_nccl_cu12-2.21.5.dist-info/RECORD
  nvidia_nccl_cu12-2.21.5.dist-info/WHEEL
  nvidia_nccl_cu12-2.21.5.dist-info/top_level.txt

上述打印找到 nvidia/nccl/lib/libnccl.so.2

环境变量里映射一下行了

shell 复制代码
# 获取 env 前缀
CONDA_PREFIX=$(python -c "import sys; print(sys.prefix)")

# 复制并建立软链接
cp /home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages/nvidia/nccl/lib/libnccl.so.2 "$CONDA_PREFIX/lib/"     
ln -sf "$CONDA_PREFIX/lib/libnccl.so.2" "$CONDA_PREFIX/lib/libnccl.so"

ll $CONDA_PREFIX/lib/ | grep nccl

sudo ldconfig
shell 复制代码
export LD_LIBRARY_PATH="$CONDA_PREFIX/lib:$LD_LIBRARY_PATH"
echo "LD_LIBRARY_PATH now: $LD_LIBRARY_PATH"
shell 复制代码
ls -l "$CONDA_PREFIX/lib/" | grep nccl || true
shell 复制代码
python -c "import paddle; print(paddle.utils.run_check())"
shell 复制代码
Running verify PaddlePaddle program ... 
I1203 11:44:38.511579 48378 program_interpreter.cc:212] New Executor is Running.
W1203 11:44:38.512056 48378 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 12.4, Runtime API Version: 12.0
W1203 11:44:38.513100 48378 gpu_resources.cc:164] device: 0, cuDNN Version: 8.9.
I1203 11:44:38.752264 48378 interpreter_util.cc:624] Standalone Executor is Used.
PaddlePaddle works well on 1 GPU.
======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_selected_gpus', current_value='3', default_value='')
=======================================================================
I1203 11:44:39.863541 48883 tcp_utils.cc:107] Retry to connect to 127.0.0.1:52019 while the server is not yet listening.
======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_selected_gpus', current_value='2', default_value='')
=======================================================================
======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_selected_gpus', current_value='1', default_value='')
=======================================================================
I1203 11:44:39.969185 48882 tcp_utils.cc:107] Retry to connect to 127.0.0.1:52019 while the server is not yet listening.
I1203 11:44:39.969283 48881 tcp_utils.cc:107] Retry to connect to 127.0.0.1:52019 while the server is not yet listening.
======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_selected_gpus', current_value='0', default_value='')
=======================================================================
I1203 11:44:39.970425 48880 tcp_utils.cc:181] The server starts to listen on IP_ANY:52019
I1203 11:44:39.970592 48880 tcp_utils.cc:130] Successfully connected to 127.0.0.1:52019
I1203 11:44:42.863881 48883 tcp_utils.cc:130] Successfully connected to 127.0.0.1:52019
I1203 11:44:42.864301 48883 process_group_nccl.cc:129] ProcessGroupNCCL pg_timeout_ 1800000
I1203 11:44:42.969539 48882 tcp_utils.cc:130] Successfully connected to 127.0.0.1:52019
I1203 11:44:42.969566 48881 tcp_utils.cc:130] Successfully connected to 127.0.0.1:52019
I1203 11:44:42.986387 48882 process_group_nccl.cc:129] ProcessGroupNCCL pg_timeout_ 1800000
I1203 11:44:42.986418 48881 process_group_nccl.cc:129] ProcessGroupNCCL pg_timeout_ 1800000
I1203 11:44:42.996657 48880 process_group_nccl.cc:129] ProcessGroupNCCL pg_timeout_ 1800000
W1203 11:44:43.516970 48883 gpu_resources.cc:119] Please NOTE: device: 3, GPU Compute Capability: 8.6, Driver API Version: 12.4, Runtime API Version: 12.0
W1203 11:44:43.517972 48883 gpu_resources.cc:164] device: 3, cuDNN Version: 8.9.
W1203 11:44:43.566345 48881 gpu_resources.cc:119] Please NOTE: device: 1, GPU Compute Capability: 8.6, Driver API Version: 12.4, Runtime API Version: 12.0
W1203 11:44:43.567473 48881 gpu_resources.cc:164] device: 1, cuDNN Version: 8.9.
W1203 11:44:43.580153 48882 gpu_resources.cc:119] Please NOTE: device: 2, GPU Compute Capability: 8.6, Driver API Version: 12.4, Runtime API Version: 12.0
W1203 11:44:43.581290 48882 gpu_resources.cc:164] device: 2, cuDNN Version: 8.9.
W1203 11:44:43.582572 48880 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 12.4, Runtime API Version: 12.0
W1203 11:44:43.583482 48880 gpu_resources.cc:164] device: 0, cuDNN Version: 8.9.
I1203 11:44:44.659265 48881 process_group_nccl.cc:132] ProcessGroupNCCL destruct 
I1203 11:44:44.680932 48880 process_group_nccl.cc:132] ProcessGroupNCCL destruct 
I1203 11:44:44.686960 48882 process_group_nccl.cc:132] ProcessGroupNCCL destruct 
I1203 11:44:44.688688 48883 process_group_nccl.cc:132] ProcessGroupNCCL destruct 
I1203 11:44:44.755733 48940 tcp_store.cc:289] receive shutdown event and so quit from MasterDaemon run loop
PaddlePaddle works well on 4 GPUs.
PaddlePaddle is installed successfully! Let's start deep learning with PaddlePaddle now.
None
相关推荐
小毅&Nora1 小时前
【人工智能】【深度学习】 ⑧ 一文讲清Transformer工作原理:从自注意力到大语言模型的革命
人工智能·深度学习·transformer
hjs_deeplearning1 小时前
应用篇#4:Qwen2视觉语言模型(VLM)的服务器部署
服务器·人工智能·python·深度学习·语言模型
荒野火狐2 小时前
【强化学习】关于PPO收敛问题
python·深度学习·机器学习·强化学习
nwsuaf_huasir2 小时前
Elsevier投稿系统编译latex文件参考文献显示为问号
深度学习
oliveray2 小时前
动手搭建Flamingo(VQA)
人工智能·深度学习·vlms
非著名架构师2 小时前
气象驱动的需求预测:零售企业如何通过气候数据分析实现库存精准控制
人工智能·深度学习·数据分析·transformer·风光功率预测·高精度天气预报数据
算法与编程之美3 小时前
理解pytorch中的L2正则项
人工智能·pytorch·python·深度学习·机器学习
LaughingZhu3 小时前
Product Hunt 每日热榜 | 2025-12-05
人工智能·经验分享·深度学习·神经网络·产品运营
子午3 小时前
【农作物谷物识别系统】Python+TensorFlow+Django+人工智能+深度学习+卷积神经网络算法
人工智能·python·深度学习