NVML_SUCCESS == DriverAPI::get()->nvmlInit_v2_()
问题描述
在使用GPU运行模型(有其他人也在用这个GPU)并使用同一条数据反复调用时,偶尔 会出现下面的异常:
bash
Traceback (most recent call last):
File "/data/gpu_info.py", line 21, in <module>
img = deepcopy(img)
File "/data/envs/birefnet/lib/python3.10/copy.py", line 153, in deepcopy
y = copier(memo)
File "/data/envs/birefnet/lib/python3.10/site-packages/torch/_tensor.py", line 172, in __deepcopy__
new_storage = self._typed_storage()._deepcopy(memo)
File "/data/envs/birefnet/lib/python3.10/site-packages/torch/storage.py", line 1134, in _deepcopy
return self._new_wrapped_storage(copy.deepcopy(self._untyped_storage, memo))
File "/data/envs/birefnet/lib/python3.10/copy.py", line 153, in deepcopy
y = copier(memo)
File "/data/envs/birefnet/lib/python3.10/site-packages/torch/storage.py", line 239, in __deepcopy__
new_storage = self.clone()
File "/data/envs/birefnet/lib/python3.10/site-packages/torch/storage.py", line 253, in clone
return type(self)(self.nbytes(), device=self.device).copy_(self)
RuntimeError: NVML_SUCCESS == DriverAPI::get()->nvmlInit_v2_() INTERNAL ASSERT FAILED at "/pytorch/c10/cuda/CUDACachingAllocator.cpp":983, please report a bug to PyTorch.
问题排查
nvidia-smi无法正常使用。由于某些客观原因,服务器不能重启,也就无法使显卡驱动恢复正常。
检查GPU资源
检查显存:随机用一个大尺寸的Tensor,塞进GPU显存中,反复复制,发现正在使用的显存剩余空间不多(不足4M)
问题解决
通过上面的方式找到容量够用的显存,将代码放在该GPU上运行,问题解决。