本人在服务器上训练代码时,遇到了以下报错:
Traceback (most recent call last):
File "/home/ubuntu/zcardvein/zzz_dataAndTrain.py", line 163, in <module>
preds = model(img_batch)
File "/home/ubuntu/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ubuntu/miniconda3/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 171, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/ubuntu/miniconda3/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 181, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/ubuntu/miniconda3/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 89, in parallel_apply
output.reraise()
File "/home/ubuntu/miniconda3/lib/python3.9/site-packages/torch/_utils.py", line 644, in reraise
raise exception
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/ubuntu/miniconda3/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker
output = module(*input, **kwargs)
File "/home/ubuntu/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ubuntu/zcardvein/zzz_dataAndTrain.py", line 31, in forward
enc1 = self.encoder1(x)
File "/home/ubuntu/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ubuntu/miniconda3/lib/python3.9/site-packages/torch/nn/modules/container.py", line 217, in forward
input = module(input)
File "/home/ubuntu/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ubuntu/miniconda3/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 463, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/home/ubuntu/miniconda3/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED
通过询问gpt和群友,确定了是cuda自身的问题,因为cpu就能正常运行。
排查步骤主要如下:
1:通过查看显存,发现不是卡的显存炸了。
nvidia-smi
2:通过查看cuda和torch的兼容问题,发现结果是2.0.1+cu117,没问题。
python -c "import torch; print(torch.__version__)"
3:通过查看cuda available,发现结果是True。
4:把device换成cpu,发现代码能正常运行。
device = torch.device("cpu")
其实理论上,换个服务器就能解决我这个问题()
然后群友说试一下:cudnn禁用。
于是...
5:在需要运行的python文件开头加上cudnn禁用语句
torch.backends.cudnn.enabled = False
结果就能正常动起来了!