H20训练CPGNET环境搭建

1、本地cuda安装

版本：12.1

sudo sh cuda_12.1.1_530.30.02_linux.run

echo 'export CUDA_HOME=/usr/local/cuda-12.1' >> ~/.bashrc

echo 'export PATH=CUDAHOME/bin:CUDA_HOME/bin:CUDAHOME/bin:PATH' >> ~/.bashrc

echo 'export LD_LIBRARY_PATH=CUDAHOME/lib64:CUDA_HOME/lib64:CUDAHOME/lib64:LD_LIBRARY_PATH' >> ~/.bashrc

source ~/.bashrc

2、torch安装

pip install torch2.4.1 torchvision0.19.1 torchaudio2.4.1 --index-url https://download.pytorch.org/whl/cu121
3、依赖环境安装
训练环境：
pip install pyyaml5.4.1

pip install pyyaml

pip install opencv-python

pip install scipy pip install nuscenes-devkit

pip install mmcv==2.2.0 -f https://download.openmmlab.com/mmcv/dist/cu121/torch2.4/index.html

pkl生成环境：

pip install terminaltables

pip install requests

pip install numba

pip install xlwt

4、代码修改

4.1 Traceback (most recent call last):

File "/home/foton/M/temp/Project/generate_pkl/tools/create_data.py", line 11, in

from det3d.datasets.kitti import kitti_common as kitti_ds

ModuleNotFoundError: No module named 'det3d'

修改本地路径

4.2 yaml

yaml.load(f)，但 PyYAML ≥ 5.1 之后，出于安全考虑，yaml.load()必须显式指定 Loader参数，否则会报：

TypeError: load() missing 1 required positional argument: 'Loader'

self.task_cfg = yaml.load(f)

改为：

self.task_cfg = yaml.safe_load(f)

4.3

Python 3.10 中 collections.Iterable已被移除，而 det3d代码中还在使用旧的导入方式。

from collections import Iterable, defaultdict

修改后：

from collections.abc import Iterable

from collections import defaultdict

4.4

warnings.warn(warning.format(ret))

$rank0$ : Traceback (most recent call last):

$rank0$ : File "/home/foton/M/temp/Project/CPGNet/train.py", line 260, in

$rank0$ : main(args, config)

$rank0$ : File "/home/foton/M/temp/Project/CPGNet/train.py", line 209, in main

$rank0$ : model = torch.nn.parallel.DistributedDataParallel(base_net.to(device),

$rank0$ : File "/home/foton/miniconda3/envs/cpg/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 822, in init

$rank0$ : _verify_param_shape_across_processes(self.process_group, parameters)

$rank0$ : File "/home/foton/miniconda3/envs/cpg/lib/python3.10/site-packages/torch/distributed/utils.py", line 286, in _verify_param_shape_across_processes

$rank0$ : return dist._verify_params_across_processes(process_group, tensors, logger)

$rank0$ : torch.distributed.DistBackendError: NCCL error in: .../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5

$rank0$ : ncclUnhandledCudaError: Call to CUDA function failed.

$rank0$ : Last error:

$rank0$ : Cuda failure 802 'system not yet initialized'

$rank3$ : Traceback (most recent call last):

$rank3$ : File "/home/foton/M/temp/Project/CPGNet/train.py", line 260, in

$rank3$ : main(args, config)

$rank3$ : File "/home/foton/M/temp/Project/CPGNet/train.py", line 209, in main

$rank3$ : model = torch.nn.parallel.DistributedDataParallel(base_net.to(device),

$rank3$ : File "/home/foton/miniconda3/envs/cpg/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 822, in init

$rank3$ : _verify_param_shape_across_processes(self.process_group, parameters) n many cases (e.g. in a constructor) i $rank3$ : File "/home/foton/miniconda3/envs/cpg/lib/python3.10/site-packages/torch/distributed/utils.py", line 286, in _verify_param_shape_across_processes of tensor.type().backend() use tensor $rank3$ : return dist._verify_params_across_processes(process_group, tensors, logger)

$rank3$ : torch.distributed.DistBackendError: NCCL error in: .../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5

$rank3$ : ncclUnhandledCudaError: Call to CUDA function failed.

$rank3$ : Last error:

$rank3$ : Cuda failure 802 'system not yet initialized'

$rank2$ : Traceback (most recent call last):

$rank2$ : File "/home/foton/M/temp/Project/CPGNet/train.py", line 260, in

$rank2$ : main(args, config)

$rank2$ : File "/home/foton/M/temp/Project/CPGNet/train.py", line 209, in main

$rank2$ : model = torch.nn.parallel.DistributedDataParallel(base_net.to(device),

$rank2$ : File "/home/foton/miniconda3/envs/cpg/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 822, in init

$rank2$ : _verify_param_shape_across_processes(self.process_group, parameters)

$rank2$ : File "/home/foton/miniconda3/envs/cpg/lib/python3.10/site-packages/torch/distributed/utils.py", line 286, in _verify_param_shape_across_processes

$rank2$ : return dist._verify_params_across_processes(process_group, tensors, logger)

$rank2$ : torch.distributed.DistBackendError: NCCL error in: .../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5

$rank2$ : ncclUnhandledCudaError: Call to CUDA function failed.

$rank2$ : Last error:

$rank2$ : Cuda failure 802 'system not yet initialized'

$rank1$ : Traceback (most recent call last):

$rank1$ : File "/home/foton/M/temp/Project/CPGNet/train.py", line 260, in

$rank1$ : main(args, config)

$rank1$ : File "/home/foton/M/temp/Project/CPGNet/train.py", line 209, in main

$rank1$ : model = torch.nn.parallel.DistributedDataParallel(base_net.to(device),

$rank1$ : File "/home/foton/miniconda3/envs/cpg/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 822, in init

$rank1$ : _verify_param_shape_across_processes(self.process_group, parameters)

$rank1$ : File "/home/foton/miniconda3/envs/cpg/lib/python3.10/site-packages/torch/distributed/utils.py", line 286, in _verify_param_shape_across_processes

$rank1$ : return dist._verify_params_across_processes(process_group, tensors, logger)

$rank1$ : torch.distributed.DistBackendError: NCCL error in: .../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5n many cases (e.g. in a constructor) i $rank1$ : ncclUnhandledCudaError: Call to CUDA function failed. of tensor.type().backend() use tensor $rank1$ : Last error:

$rank1$ : Cuda failure 802 'system not yet initialized'

$rank0$ : $W611 19:02:58.669442585 ProcessGroupNCCL.cpp:1168$ Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the applicatio

用0、1卡没问题

5、nccl与gloo

nccl比gloo快1.5倍