完整报错:
shell
Running verify PaddlePaddle program ...
I1203 11:17:31.474777 26315 program_interpreter.cc:212] New Executor is Running.
W1203 11:17:31.475191 26315 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 12.4, Runtime API Version: 12.0
W1203 11:17:31.476090 26315 gpu_resources.cc:164] device: 0, cuDNN Version: 8.9.
I1203 11:17:31.743631 26315 interpreter_util.cc:624] Standalone Executor is Used.
PaddlePaddle works well on 1 GPU.
======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_selected_gpus', current_value='1', default_value='')
=======================================================================
I1203 11:17:32.905944 26397 tcp_utils.cc:107] Retry to connect to 127.0.0.1:43873 while the server is not yet listening.
======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_selected_gpus', current_value='3', default_value='')
=======================================================================
I1203 11:17:32.950119 26399 tcp_utils.cc:107] Retry to connect to 127.0.0.1:43873 while the server is not yet listening.
======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_selected_gpus', current_value='0', default_value='')
=======================================================================
======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_selected_gpus', current_value='2', default_value='')
=======================================================================
I1203 11:17:32.954985 26396 tcp_utils.cc:181] The server starts to listen on IP_ANY:43873
I1203 11:17:32.955080 26398 tcp_utils.cc:130] Successfully connected to 127.0.0.1:43873
I1203 11:17:32.955158 26396 tcp_utils.cc:130] Successfully connected to 127.0.0.1:43873
I1203 11:17:32.955351 26398 process_group_nccl.cc:129] ProcessGroupNCCL pg_timeout_ 1800000
W1203 11:17:33.164513 26398 gpu_resources.cc:119] Please NOTE: device: 2, GPU Compute Capability: 8.6, Driver API Version: 12.4, Runtime API Version: 12.0
W1203 11:17:33.165470 26398 gpu_resources.cc:164] device: 2, cuDNN Version: 8.9.
W1203 11:17:33.173501 26398 dynamic_loader.cc:285] You may need to install 'nccl2' from NVIDIA official website: https://developer.nvidia.com/nccl/nccl-download before install PaddlePaddle.
I1203 11:17:33.278411 26398 process_group_nccl.cc:132] ProcessGroupNCCL destruct
--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0 phi::distributed::CreateOrGetGlobalTCPStore()
1 phi::distributed::TCPStore::TCPStore(std::string, unsigned short, bool, unsigned long, int)
2 phi::distributed::TCPStore::waitWorkers()
----------------------
Error Message Summary:
----------------------
FatalError: `Termination signal` is detected by the operating system.
[TimeInfo: *** Aborted at 1764731853 (unix time) try "date -d @1764731853" if you are using GNU date ***]
[SignalInfo: *** SIGTERM (@0x3f5000066cb) received by PID 26396 (TID 0x7fa753ce9200) from PID 26315 ***]
--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0 phi::distributed::CreateOrGetGlobalTCPStore()
1 phi::distributed::TCPStore::TCPStore(std::string, unsigned short, bool, unsigned long, int)
2 phi::distributed::detail::TCPClient::connect(std::string, unsigned short)
3 phi::distributed::tcputils::tcp_connect(std::string, std::string, int, std::chrono::duration<long, std::ratio<1l, 1l> >)
----------------------
Error Message Summary:
----------------------
FatalError: `Termination signal` is detected by the operating system.
[TimeInfo: *** Aborted at 1764731853 (unix time) try "date -d @1764731853" if you are using GNU date ***]
[SignalInfo: *** SIGTERM (@0x3f5000066cb) received by PID 26397 (TID 0x7f12e3983200) from PID 26315 ***]
--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0 phi::distributed::CreateOrGetGlobalTCPStore()
1 phi::distributed::TCPStore::TCPStore(std::string, unsigned short, bool, unsigned long, int)
2 phi::distributed::detail::TCPClient::connect(std::string, unsigned short)
3 phi::distributed::tcputils::tcp_connect(std::string, std::string, int, std::chrono::duration<long, std::ratio<1l, 1l> >)
----------------------
Error Message Summary:
----------------------
FatalError: `Termination signal` is detected by the operating system.
[TimeInfo: *** Aborted at 1764731853 (unix time) try "date -d @1764731853" if you are using GNU date ***]
[SignalInfo: *** SIGTERM (@0x3f5000066cb) received by PID 26399 (TID 0x7f81faf2e200) from PID 26315 ***]
[2025-12-03 11:17:33,599] [ WARNING] install_check.py:289 - PaddlePaddle meets some problem with 4 GPUs. This may be caused by:
1. There is not enough GPUs visible on your system
2. Some GPUs are occupied by other process now
3. NVIDIA-NCCL2 is not installed correctly on your system. Please follow instruction on https://github.com/NVIDIA/nccl-tests
to test your NCCL, or reinstall it following https://docs.nvidia.com/deeplearning/sdk/nccl-install-guide/index.html
[2025-12-03 11:17:33,599] [ WARNING] install_check.py:297 -
Original Error is:
----------------------------------------------
Process 2 terminated with the following error:
----------------------------------------------
Traceback (most recent call last):
File "/home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages/paddle/distributed/spawn.py", line 372, in _func_wrapper
result = func(*args)
File "/home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages/paddle/utils/install_check.py", line 184, in train_for_run_parallel
dp_layer = paddle.DataParallel(layer)
File "/home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages/paddle/distributed/parallel.py", line 398, in __init__
sync_params_buffers(self._layers, fuse_params=False)
File "/home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages/decorator.py", line 235, in fun
return caller(func, *(extras + args), **kw)
File "/home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages/paddle/base/dygraph/base.py", line 340, in __impl__
return func(*args, **kwargs)
File "/home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages/decorator.py", line 235, in fun
return caller(func, *(extras + args), **kw)
File "/home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages/paddle/base/wrapped_decorator.py", line 26, in __impl__
return wrapped_func(*args, **kwargs)
File "/home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages/paddle/base/framework.py", line 593, in __impl__
return func(*args, **kwargs)
File "/home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages/paddle/distributed/parallel.py", line 197, in sync_params_buffers
paddle.distributed.broadcast(
File "/home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages/paddle/distributed/communication/broadcast.py", line 64, in broadcast
return stream.broadcast(
File "/home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages/paddle/distributed/communication/stream/broadcast.py", line 124, in broadcast
return _broadcast_in_dygraph(
File "/home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages/paddle/distributed/communication/stream/broadcast.py", line 32, in _broadcast_in_dygraph
task = group.process_group.broadcast(tensor, src_rank_in_group, sync_op)
RuntimeError: (PreconditionNotMet) The third-party dynamic library (libnccl.so) that Paddle depends on is not configured correctly. (error code is libnccl.so: cannot open shared object file: No such file or directory)
Suggestions:
1. Check if the third-party dynamic library (e.g. CUDA, CUDNN) is installed correctly and its version is matched with paddlepaddle you installed.
2. Configure third-party dynamic library environment variables as follows:
- Linux: set LD_LIBRARY_PATH by `export LD_LIBRARY_PATH=...`
- Windows: set PATH by `set PATH=XXX; (at /paddle/paddle/phi/backends/dynload/dynamic_loader.cc:311)
PaddlePaddle is installed successfully ONLY for single GPU! Let's start deep learning with PaddlePaddle now.
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages/paddle/utils/install_check.py", line 302, in run_check
raise e
File "/home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages/paddle/utils/install_check.py", line 283, in run_check
_run_parallel(device_list)
File "/home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages/paddle/utils/install_check.py", line 210, in _run_parallel
paddle.distributed.spawn(train_for_run_parallel, nprocs=len(device_list))
File "/home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages/paddle/distributed/spawn.py", line 614, in spawn
while not context.join():
File "/home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages/paddle/distributed/spawn.py", line 423, in join
self._throw_exception(error_index)
File "/home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages/paddle/distributed/spawn.py", line 447, in _throw_exception
raise Exception(msg)
Exception:
----------------------------------------------
Process 2 terminated with the following error:
----------------------------------------------
Traceback (most recent call last):
File "/home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages/paddle/distributed/spawn.py", line 372, in _func_wrapper
result = func(*args)
File "/home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages/paddle/utils/install_check.py", line 184, in train_for_run_parallel
dp_layer = paddle.DataParallel(layer)
File "/home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages/paddle/distributed/parallel.py", line 398, in __init__
sync_params_buffers(self._layers, fuse_params=False)
File "/home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages/decorator.py", line 235, in fun
return caller(func, *(extras + args), **kw)
File "/home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages/paddle/base/dygraph/base.py", line 340, in __impl__
return func(*args, **kwargs)
File "/home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages/decorator.py", line 235, in fun
return caller(func, *(extras + args), **kw)
File "/home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages/paddle/base/wrapped_decorator.py", line 26, in __impl__
return wrapped_func(*args, **kwargs)
File "/home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages/paddle/base/framework.py", line 593, in __impl__
return func(*args, **kwargs)
File "/home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages/paddle/distributed/parallel.py", line 197, in sync_params_buffers
paddle.distributed.broadcast(
File "/home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages/paddle/distributed/communication/broadcast.py", line 64, in broadcast
return stream.broadcast(
File "/home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages/paddle/distributed/communication/stream/broadcast.py", line 124, in broadcast
return _broadcast_in_dygraph(
File "/home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages/paddle/distributed/communication/stream/broadcast.py", line 32, in _broadcast_in_dygraph
task = group.process_group.broadcast(tensor, src_rank_in_group, sync_op)
RuntimeError: (PreconditionNotMet) The third-party dynamic library (libnccl.so) that Paddle depends on is not configured correctly. (error code is libnccl.so: cannot open shared object file: No such file or directory)
Suggestions:
1. Check if the third-party dynamic library (e.g. CUDA, CUDNN) is installed correctly and its version is matched with paddlepaddle you installed.
2. Configure third-party dynamic library environment variables as follows:
- Linux: set LD_LIBRARY_PATH by `export LD_LIBRARY_PATH=...`
- Windows: set PATH by `set PATH=XXX; (at /paddle/paddle/phi/backends/dynload/dynamic_loader.cc:311)
先检查装没装 torch,一般安装torch会自带nccl:
shell
proxychains4 python -m pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124
shell
pip list | grep nccl
# nvidia-nccl-cu12 2.21.5
查找一下有没有 nccl
shell
ldconfig -p | grep nccl || true
# libvncclient.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libvncclient.so.1
找到路径
shell
python -m pip show -f nvidia-nccl-cu12 || python -m pip show -f nvidia-nccl || true
shell
Name: nvidia-nccl-cu12
Version: 2.21.5
Summary: NVIDIA Collective Communication Library (NCCL) Runtime
Home-page: https://developer.nvidia.com/cuda-zone
Author: Nvidia CUDA Installer Team
Author-email: cuda_installer@nvidia.com
License: NVIDIA Proprietary Software
Location: /home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages
Requires:
Required-by: torch
Files:
nvidia/__init__.py
nvidia/__pycache__/__init__.cpython-310.pyc
nvidia/nccl/__init__.py
nvidia/nccl/__pycache__/__init__.cpython-310.pyc
nvidia/nccl/include/__init__.py
nvidia/nccl/include/__pycache__/__init__.cpython-310.pyc
nvidia/nccl/include/nccl.h
nvidia/nccl/include/nccl_net.h
nvidia/nccl/lib/__init__.py
nvidia/nccl/lib/__pycache__/__init__.cpython-310.pyc
nvidia/nccl/lib/libnccl.so.2
nvidia_nccl_cu12-2.21.5.dist-info/INSTALLER
nvidia_nccl_cu12-2.21.5.dist-info/License.txt
nvidia_nccl_cu12-2.21.5.dist-info/METADATA
nvidia_nccl_cu12-2.21.5.dist-info/RECORD
nvidia_nccl_cu12-2.21.5.dist-info/WHEEL
nvidia_nccl_cu12-2.21.5.dist-info/top_level.txt
上述打印找到 nvidia/nccl/lib/libnccl.so.2
环境变量里映射一下行了
shell
# 获取 env 前缀
CONDA_PREFIX=$(python -c "import sys; print(sys.prefix)")
# 复制并建立软链接
cp /home/wangguisen/miniconda3/envs/fullduplex/lib/python3.10/site-packages/nvidia/nccl/lib/libnccl.so.2 "$CONDA_PREFIX/lib/"
ln -sf "$CONDA_PREFIX/lib/libnccl.so.2" "$CONDA_PREFIX/lib/libnccl.so"
ll $CONDA_PREFIX/lib/ | grep nccl
sudo ldconfig
shell
export LD_LIBRARY_PATH="$CONDA_PREFIX/lib:$LD_LIBRARY_PATH"
echo "LD_LIBRARY_PATH now: $LD_LIBRARY_PATH"
shell
ls -l "$CONDA_PREFIX/lib/" | grep nccl || true
shell
python -c "import paddle; print(paddle.utils.run_check())"
shell
Running verify PaddlePaddle program ...
I1203 11:44:38.511579 48378 program_interpreter.cc:212] New Executor is Running.
W1203 11:44:38.512056 48378 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 12.4, Runtime API Version: 12.0
W1203 11:44:38.513100 48378 gpu_resources.cc:164] device: 0, cuDNN Version: 8.9.
I1203 11:44:38.752264 48378 interpreter_util.cc:624] Standalone Executor is Used.
PaddlePaddle works well on 1 GPU.
======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_selected_gpus', current_value='3', default_value='')
=======================================================================
I1203 11:44:39.863541 48883 tcp_utils.cc:107] Retry to connect to 127.0.0.1:52019 while the server is not yet listening.
======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_selected_gpus', current_value='2', default_value='')
=======================================================================
======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_selected_gpus', current_value='1', default_value='')
=======================================================================
I1203 11:44:39.969185 48882 tcp_utils.cc:107] Retry to connect to 127.0.0.1:52019 while the server is not yet listening.
I1203 11:44:39.969283 48881 tcp_utils.cc:107] Retry to connect to 127.0.0.1:52019 while the server is not yet listening.
======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_selected_gpus', current_value='0', default_value='')
=======================================================================
I1203 11:44:39.970425 48880 tcp_utils.cc:181] The server starts to listen on IP_ANY:52019
I1203 11:44:39.970592 48880 tcp_utils.cc:130] Successfully connected to 127.0.0.1:52019
I1203 11:44:42.863881 48883 tcp_utils.cc:130] Successfully connected to 127.0.0.1:52019
I1203 11:44:42.864301 48883 process_group_nccl.cc:129] ProcessGroupNCCL pg_timeout_ 1800000
I1203 11:44:42.969539 48882 tcp_utils.cc:130] Successfully connected to 127.0.0.1:52019
I1203 11:44:42.969566 48881 tcp_utils.cc:130] Successfully connected to 127.0.0.1:52019
I1203 11:44:42.986387 48882 process_group_nccl.cc:129] ProcessGroupNCCL pg_timeout_ 1800000
I1203 11:44:42.986418 48881 process_group_nccl.cc:129] ProcessGroupNCCL pg_timeout_ 1800000
I1203 11:44:42.996657 48880 process_group_nccl.cc:129] ProcessGroupNCCL pg_timeout_ 1800000
W1203 11:44:43.516970 48883 gpu_resources.cc:119] Please NOTE: device: 3, GPU Compute Capability: 8.6, Driver API Version: 12.4, Runtime API Version: 12.0
W1203 11:44:43.517972 48883 gpu_resources.cc:164] device: 3, cuDNN Version: 8.9.
W1203 11:44:43.566345 48881 gpu_resources.cc:119] Please NOTE: device: 1, GPU Compute Capability: 8.6, Driver API Version: 12.4, Runtime API Version: 12.0
W1203 11:44:43.567473 48881 gpu_resources.cc:164] device: 1, cuDNN Version: 8.9.
W1203 11:44:43.580153 48882 gpu_resources.cc:119] Please NOTE: device: 2, GPU Compute Capability: 8.6, Driver API Version: 12.4, Runtime API Version: 12.0
W1203 11:44:43.581290 48882 gpu_resources.cc:164] device: 2, cuDNN Version: 8.9.
W1203 11:44:43.582572 48880 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 12.4, Runtime API Version: 12.0
W1203 11:44:43.583482 48880 gpu_resources.cc:164] device: 0, cuDNN Version: 8.9.
I1203 11:44:44.659265 48881 process_group_nccl.cc:132] ProcessGroupNCCL destruct
I1203 11:44:44.680932 48880 process_group_nccl.cc:132] ProcessGroupNCCL destruct
I1203 11:44:44.686960 48882 process_group_nccl.cc:132] ProcessGroupNCCL destruct
I1203 11:44:44.688688 48883 process_group_nccl.cc:132] ProcessGroupNCCL destruct
I1203 11:44:44.755733 48940 tcp_store.cc:289] receive shutdown event and so quit from MasterDaemon run loop
PaddlePaddle works well on 4 GPUs.
PaddlePaddle is installed successfully! Let's start deep learning with PaddlePaddle now.
None