H20训练CPGNET环境搭建

1、本地cuda安装

版本:12.1

sudo sh cuda_12.1.1_530.30.02_linux.run

echo 'export CUDA_HOME=/usr/local/cuda-12.1' >> ~/.bashrc

echo 'export PATH=CUDAHOME/bin:CUDA_HOME/bin:CUDAHOME/bin:PATH' >> ~/.bashrc

echo 'export LD_LIBRARY_PATH=CUDAHOME/lib64:CUDA_HOME/lib64:CUDAHOME/lib64:LD_LIBRARY_PATH' >> ~/.bashrc

source ~/.bashrc

2、torch安装

pip install torch2.4.1 torchvision0.19.1 torchaudio2.4.1 --index-url https://download.pytorch.org/whl/cu121
3、依赖环境安装
训练环境:
pip install pyyaml5.4.1

pip install pyyaml

pip install opencv-python

pip install scipy pip install nuscenes-devkit

pip install mmcv==2.2.0 -f https://download.openmmlab.com/mmcv/dist/cu121/torch2.4/index.html

pkl生成环境:

pip install terminaltables

pip install requests

pip install numba

pip install xlwt

4、代码修改

4.1 Traceback (most recent call last):

File "/home/foton/M/temp/Project/generate_pkl/tools/create_data.py", line 11, in

from det3d.datasets.kitti import kitti_common as kitti_ds

ModuleNotFoundError: No module named 'det3d'

修改本地路径

4.2 yaml

yaml.load(f),但 PyYAML ≥ 5.1 之后,出于安全考虑,yaml.load()必须显式指定 Loader参数,否则会报:

TypeError: load() missing 1 required positional argument: 'Loader'

self.task_cfg = yaml.load(f)

改为:

self.task_cfg = yaml.safe_load(f)

4.3

Python 3.10 中 collections.Iterable已被移除,而 det3d代码中还在使用旧的导入方式。

from collections import Iterable, defaultdict

修改后:

from collections.abc import Iterable

from collections import defaultdict

4.4

warnings.warn(warning.format(ret))

rank0: Traceback (most recent call last):

rank0: File "/home/foton/M/temp/Project/CPGNet/train.py", line 260, in

rank0: main(args, config)

rank0: File "/home/foton/M/temp/Project/CPGNet/train.py", line 209, in main

rank0: model = torch.nn.parallel.DistributedDataParallel(base_net.to(device),

rank0: File "/home/foton/miniconda3/envs/cpg/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 822, in init

rank0: _verify_param_shape_across_processes(self.process_group, parameters)

rank0: File "/home/foton/miniconda3/envs/cpg/lib/python3.10/site-packages/torch/distributed/utils.py", line 286, in _verify_param_shape_across_processes

rank0: return dist._verify_params_across_processes(process_group, tensors, logger)

rank0: torch.distributed.DistBackendError: NCCL error in: .../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5

rank0: ncclUnhandledCudaError: Call to CUDA function failed.

rank0: Last error:

rank0: Cuda failure 802 'system not yet initialized'

rank3: Traceback (most recent call last):

rank3: File "/home/foton/M/temp/Project/CPGNet/train.py", line 260, in

rank3: main(args, config)

rank3: File "/home/foton/M/temp/Project/CPGNet/train.py", line 209, in main

rank3: model = torch.nn.parallel.DistributedDataParallel(base_net.to(device),

rank3: File "/home/foton/miniconda3/envs/cpg/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 822, in init

rank3: _verify_param_shape_across_processes(self.process_group, parameters) n many cases (e.g. in a constructor) irank3: File "/home/foton/miniconda3/envs/cpg/lib/python3.10/site-packages/torch/distributed/utils.py", line 286, in _verify_param_shape_across_processes of tensor.type().backend() use tensorrank3: return dist._verify_params_across_processes(process_group, tensors, logger)

rank3: torch.distributed.DistBackendError: NCCL error in: .../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5

rank3: ncclUnhandledCudaError: Call to CUDA function failed.

rank3: Last error:

rank3: Cuda failure 802 'system not yet initialized'

rank2: Traceback (most recent call last):

rank2: File "/home/foton/M/temp/Project/CPGNet/train.py", line 260, in

rank2: main(args, config)

rank2: File "/home/foton/M/temp/Project/CPGNet/train.py", line 209, in main

rank2: model = torch.nn.parallel.DistributedDataParallel(base_net.to(device),

rank2: File "/home/foton/miniconda3/envs/cpg/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 822, in init

rank2: _verify_param_shape_across_processes(self.process_group, parameters)

rank2: File "/home/foton/miniconda3/envs/cpg/lib/python3.10/site-packages/torch/distributed/utils.py", line 286, in _verify_param_shape_across_processes

rank2: return dist._verify_params_across_processes(process_group, tensors, logger)

rank2: torch.distributed.DistBackendError: NCCL error in: .../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5

rank2: ncclUnhandledCudaError: Call to CUDA function failed.

rank2: Last error:

rank2: Cuda failure 802 'system not yet initialized'

rank1: Traceback (most recent call last):

rank1: File "/home/foton/M/temp/Project/CPGNet/train.py", line 260, in

rank1: main(args, config)

rank1: File "/home/foton/M/temp/Project/CPGNet/train.py", line 209, in main

rank1: model = torch.nn.parallel.DistributedDataParallel(base_net.to(device),

rank1: File "/home/foton/miniconda3/envs/cpg/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 822, in init

rank1: _verify_param_shape_across_processes(self.process_group, parameters)

rank1: File "/home/foton/miniconda3/envs/cpg/lib/python3.10/site-packages/torch/distributed/utils.py", line 286, in _verify_param_shape_across_processes

rank1: return dist._verify_params_across_processes(process_group, tensors, logger)

rank1: torch.distributed.DistBackendError: NCCL error in: .../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5n many cases (e.g. in a constructor) irank1: ncclUnhandledCudaError: Call to CUDA function failed. of tensor.type().backend() use tensorrank1: Last error:

rank1: Cuda failure 802 'system not yet initialized'

rank0:W611 19:02:58.669442585 ProcessGroupNCCL.cpp:1168 Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the applicatio

用0、1卡没问题

5、nccl与gloo

nccl比gloo快1.5倍

相关推荐
装不满的克莱因瓶1 小时前
RLHF中的PPO算法——大语言模型对齐优化的核心引擎
人工智能·python·深度学习·算法·机器学习·语言模型·自然语言处理
AndrewHZ3 小时前
【LLM技术全景】开源大模型生态:如何选择适合你的基座模型?
人工智能·深度学习·语言模型·开源·llm·transformer·基座模型
硅谷秋水3 小时前
NVIDIA OmniDreams:用于闭环自动驾驶仿真、支持实时生成的世界模型
人工智能·深度学习·机器学习·计算机视觉·自动驾驶
txg6664 小时前
MirrorFuzz:利用共享漏洞与大模型的深度学习框架 API 模糊测试
人工智能·深度学习·安全·网络安全
chen_zn954 小时前
GR00T N1.7源码学习(五):Policy推理、RTC动作衔接与部署流程解析
人工智能·深度学习·具身智能·vla·流匹配
逻辑星辰4 小时前
x-ds-pow-response逆向分析
开发语言·人工智能·python·深度学习·算法
一切皆是因缘际会4 小时前
从注意力归因到XAI落地
人工智能·深度学习·ai·架构
古希腊掌管代码的神THU4 小时前
【清华代码熊】Agent Harness 工程实践之(1): Context管理
人工智能·深度学习·自然语言处理·面试
月疯4 小时前
torch:tensor的运算
人工智能·pytorch·深度学习