捣鼓环境的时候,按照网上的办法执行 sudo apt install nvidia-cuda-toolkit
后,28号机器的 nvidia-smi
命令直接无法使用了......
python
# nvidia-smi
Failed to initialize NVML: Driver/library version mismatch
cuda 也无法被正确识别:
python
# python
Python 3.8.5 (default, Sep 4 2020, 07:30:14)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
/root/anaconda3/lib/python3.8/site-packages/torch/cuda/__init__.py:83: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0
False
于是根据此篇博客,尝试
python
>>> sudo dpkg --list | grep nvidia-*
iU libnvidia-cfg1-525:amd64 525.147.05-0ubuntu0~gpu18.04.1 amd64 NVIDIA binary OpenGL/GLX configuration library
iU libnvidia-common-525 525.147.05-0ubuntu0~gpu18.04.1 all Shared files used by the NVIDIA libraries
iU libnvidia-compute-525:amd64 525.147.05-0ubuntu0~gpu18.04.1 amd64 NVIDIA libcompute package
iU libnvidia-decode-525:amd64 525.147.05-0ubuntu0~gpu18.04.1 amd64 NVIDIA Video Decoding runtime libraries
iU libnvidia-encode-525:amd64 525.147.05-0ubuntu0~gpu18.04.1 amd64 NVENC Video Encoding runtime library
iU libnvidia-extra-525:amd64 525.147.05-0ubuntu0~gpu18.04.1 amd64 Extra libraries for the NVIDIA driver
iU libnvidia-fbc1-525:amd64 525.147.05-0ubuntu0~gpu18.04.1 amd64 NVIDIA OpenGL-based Framebuffer Capture runtime library
iU libnvidia-gl-525:amd64 525.147.05-0ubuntu0~gpu18.04.1 amd64 NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
iU nvidia-dkms-525 525.147.05-0ubuntu0~gpu18.04.1 amd64 NVIDIA DKMS package
iU nvidia-driver-510 525.147.05-0ubuntu0~gpu18.04.1 amd64 Transitional package for nvidia-driver-525
iU nvidia-driver-525 525.147.05-0ubuntu0~gpu18.04.1 amd64 NVIDIA driver metapackage
iU nvidia-kernel-common-525 525.147.05-0ubuntu0~gpu18.04.1 amd64 Shared files used with the kernel module
iU nvidia-kernel-source-525 525.147.05-0ubuntu0~gpu18.04.1 amd64 NVIDIA kernel source package
iU nvidia-prime 0.8.16~0.18.04.1 all Tools to enable NVIDIA's Prime
iU nvidia-settings 470.57.01-0ubuntu0.18.04.1 amd64 Tool for configuring the NVIDIA graphics driver
iU xserver-xorg-video-nvidia-525 525.147.05-0ubuntu0~gpu18.04.1 amd64 NVIDIA binary Xorg driver
>>> cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 510.47.03 Mon Jan 24 22:58:54 UTC 2022
GCC version: gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.1)
发现就是cuda和显卡驱动版本不匹配。需要把版本统一为 510.47.03
按照如下方法:
- 卸载驱动:
python
sudo apt-get purge nvidia*
- 把显卡驱动加入ppa(个人软件包文档,仅支持Ubuntu):
python
sudo add-apt-repository ppa:graphics-drivers
sudo apt-get update
- 重新安装驱动:
python
apt-get install nvidia-driver-510 nvidia-settings nvidia-prime
但是一直报如下错误:
python
Errors were encountered while processing:
/tmp/apt-dpkg-install-T8KJGT/08-nvidia-compute-utils-525_525.147.05-0ubuntu0~gpu18.04.1_amd64.deb
/tmp/apt-dpkg-install-T8KJGT/12-nvidia-utils-525_525.147.05-0ubuntu0~gpu18.04.1_amd64.deb
E: Sub-process /usr/bin/dpkg returned an error code (1)
解决这个问题再来更新。
PS:有知道解决办法的小伙伴欢迎在评论区补充!
参考链接
【nvidia-smi报错】Failed to initialize NVML: Driver/library version mismatch-CSDN博客