步骤 1:彻底清理旧版 CUDA + 驱动(核心,避免冲突)
bash
# 1. 卸载旧驱动+CUDA(apt方式)
sudo apt-get --purge remove "*nvidia*" "*cuda*" "*cudnn*" -y
# 2. 手动删除驱动残留(470版本专属)
sudo rm -rf /usr/lib/x86_64-linux-gnu/libnvidia*
sudo rm -rf /usr/lib/x86_64-linux-gnu/libcuda*
sudo rm -rf /usr/local/nvidia*
3. 删除CUDA 11.4目录
bash
sudo rm -rf /usr/local/cuda-11.4
sudo rm -rf /usr/local/cuda # 清除软链接
4. 清理依赖+缓存
bash
sudo apt autoremove -y
sudo apt clean
步骤 2:禁用 nouveau 驱动(470 版本升级必做)
1. 写入黑名单
bash
sudo tee /etc/modprobe.d/blacklist-nouveau.conf << EOF blacklist nouveau options nouveau modeset=0 EOF
2. 生效配置(关键:重建initramfs)
bash
sudo update-initramfs -u
3. 验证禁用(无输出则成功)
bash
lsmod | grep nouveau
过程中卸载不干净,必须清理干净,否则新版本装不上
(base) root@ubuntu:/data/ghf# sudo ./cuda_12.2.0_535.54.03_linux.run --silent --driver --toolkit
Existing package manager installation of the driver found.
It is strongly recommended that you remove this before continuing.
Override this check by passing --override-driver-check
你遇到的报错是因为系统中仍有通过 apt 安装的 NVIDIA 驱动残留 (即使之前执行了清理,仍有隐蔽的包管理器残留),.run 安装包检测到后阻止了驱动升级
继续清理残留文件:
步骤 1:查找并删除所有 apt 安装的 NVIDIA 包
bash
# 1. 列出所有NVIDIA相关的apt包(找到残留)
dpkg -l | grep nvidia
(base) root@ubuntu:/data/ghf# dpkg -l | grep nvidia ii libnvidia-cfg1-470:amd64 470.129.06-0ubuntu0.20.04.1 amd64 NVIDIA binary OpenGL/GLX configuration library ii libnvidia-common-470 470.129.06-0ubuntu0.20.04.1 all Shared files used by the NVIDIA libraries rc libnvidia-compute-440:amd64 450.119.03-0ubuntu0.20.04.1 amd64 Transitional package for libnvidia-compute-450 ii libnvidia-compute-470:amd64 470.129.06-0ubuntu0.20.04.1 amd64 NVIDIA libcompute package ii libnvidia-decode-470:amd64 470.129.06-0ubuntu0.20.04.1 amd64 NVIDIA Video Decoding runtime libraries ii libnvidia-encode-470:amd64 470.129.06-0ubuntu0.20.04.1 amd64 NVENC Video Encoding runtime library ii libnvidia-extra-470:amd64 470.129.06-0ubuntu0.20.04.1 amd64 Extra libraries for the NVIDIA driver ii libnvidia-fbc1-470:amd64 470.129.06-0ubuntu0.20.04.1 amd64 NVIDIA OpenGL-based Framebuffer Capture runtime library
ii libnvidia-gl-470:amd64 470.129.06-0ubuntu0.20.04.1 amd64 NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
ii libnvidia-ifr1-470:amd64 470.129.06-0ubuntu0.20.04.1 amd64 NVIDIA OpenGL-based Inband Frame Readback runtime library
ii nvidia-compute-utils-470 470.129.06-0ubuntu0.20.04.1 amd64 NVIDIA compute utilities
ii nvidia-cuda-dev 9.1.85-3ubuntu1 amd64 NVIDIA CUDA development files
ii nvidia-cuda-doc 9.1.85-3ubuntu1 all NVIDIA CUDA and OpenCL documentation
ii nvidia-cuda-gdb 9.1.85-3ubuntu1 amd64 NVIDIA CUDA Debugger (GDB)
ii nvidia-cuda-toolkit 9.1.85-3ubuntu1 amd64 NVIDIA CUDA development toolkit
ii nvidia-dkms-470 470.129.06-0ubuntu0.20.04.1 amd64 NVIDIA DKMS package
ii nvidia-driver-440 450.119.03-0ubuntu0.20.04.1 amd64 Transitional package for nvidia-driver-450 ii nvidia-driver-450 460.91.03-0ubuntu0.20.04.1 amd64 Transitional package for nvidia-driver-460
ii nvidia-driver-460 470.129.06-0ubuntu0.20.04.1 amd64 Transitional package for nvidia-driver-470
ii nvidia-driver-470 470.129.06-0ubuntu0.20.04.1 amd64 NVIDIA driver metapackage
ii nvidia-kernel-common-470 470.129.06-0ubuntu0.20.04.1 amd64 Shared files used with the kernel module
ii nvidia-kernel-source-470 470.129.06-0ubuntu0.20.04.1 amd64 NVIDIA kernel source package
ii nvidia-opencl-dev:amd64 9.1.85-3ubuntu1 amd64 NVIDIA OpenCL development files
ii nvidia-prime 0.8.16~0.18.04.1 all Tools to enable NVIDIA's Prime
ii nvidia-profiler 9.1.85-3ubuntu1 amd64 NVIDIA Profiler for CUDA and OpenCL
ii nvidia-settings 470.57.01-0ubuntu0.20.04.1 amd64 Tool for configuring the NVIDIA graphics driver
ii nvidia-utils-470 470.129.06-0ubuntu0.20.04.1 amd64 NVIDIA driver support binaries
ii nvidia-visual-profiler 9.1.85-3ubuntu1 amd64 NVIDIA Visual Profiler for CUDA and OpenCL
ii screen-resolution-extra 0.18build1 all Extension for the nvidia-settings control panel
ii xserver-xorg-video-nvidia-470 470.129.06-0ubuntu0.20.04.1 amd64 NVIDIA binary Xorg driver
彻底卸载所有 NVIDIA/CUDA apt 包(精准匹配)
步骤 1:卸载 NVIDIA 470 驱动相关包(核心)
bash
# 批量卸载所有470版本的NVIDIA包
sudo apt-get --purge remove \
libnvidia-cfg1-470:amd64 \
libnvidia-common-470 \
libnvidia-compute-470:amd64 \
libnvidia-decode-470:amd64 \
libnvidia-encode-470:amd64 \
libnvidia-extra-470:amd64 \
libnvidia-fbc1-470:amd64 \
libnvidia-gl-470:amd64 \
libnvidia-ifr1-470:amd64 \
nvidia-compute-utils-470 \
nvidia-dkms-470 \
nvidia-driver-470 \
nvidia-kernel-common-470 \
nvidia-kernel-source-470 \
nvidia-utils-470 \
xserver-xorg-video-nvidia-470 -y
步骤 2:卸载过渡包 + 旧版 CUDA 9.1 包
bash
# 卸载440/450/460过渡包+CUDA 9.1相关包
sudo apt-get --purge remove \
libnvidia-compute-440:amd64 \
nvidia-driver-440 \
nvidia-driver-450 \
nvidia-driver-460 \
nvidia-cuda-dev \
nvidia-cuda-doc \
nvidia-cuda-gdb \
nvidia-cuda-toolkit \
nvidia-opencl-dev:amd64 \
nvidia-profiler \
nvidia-visual-profiler -y
步骤 3:卸载其他 NVIDIA 辅助包
bash
sudo apt-get --purge remove nvidia-prime nvidia-settings screen-resolution-extra -y
步骤 4:清理残留依赖 + 配置文件
bash
# 清理自动安装的依赖
sudo apt-get autoremove -y
# 清理缓存
sudo apt-get autoclean -y
# 强制清理残留的dpkg配置文件
sudo dpkg --purge $(dpkg -l | grep 'nvidia' | awk '{print $2}') 2>/dev/null
(base) root@ubuntu:/data/ghf# lsmod | grep nvidia
nvidia_uvm 1036288 8
nvidia_drm 57344 4
nvidia_modeset 1200128 2
nvidia_drm nvidia 35340288 292
nvidia_uvm,nvidia_modeset
drm_kms_helper 184320 4 cirrus,nvidia_drm drm 491520 10 drm_kms_helper,nvidia,cirrus,nvidia_drm
仍然有没清掉的!!!!
bash
# 杀死所有CUDA/PyTorch相关进程
sudo pkill -9 nvidia-smi
sudo pkill -9 nvcc
# 这里面杀死python 需要谨慎
#
# 先查看所有Python进程(确认要终止的PID)
ps aux | grep python
## 没问题再进行全杀
sudo pkill -9 python
ps aux | grep python 输出内容如下:
systemd+ 3517887 0.0 0.0 26468 20692 ? S Mar17 0:04 python3 -u /app/src/pool/worker.py systemd+ 3517889 0.0 0.0 26468 20676 ? S Mar17 0:04 python3 -u /app/src/pool/worker.py systemd+ 3517896 0.0 0.0 26468 20580 ? S Mar17 0:04 python3 -u /app/src/pool/worker.py systemd+ 3517934 0.0 0.0 26468 20656 ? S Mar17 0:04 python3 -u /app/src/pool/worker.py systemd+ 3517945 0.0 0.0 26468 20644 ? S Mar17 0:04 python3 -u /app/src/pool/worker.py systemd+ 3517948 0.0 0.0 26468 20696 ? S Mar17 0:04 python3 -u /app/src/pool/worker.py systemd+ 3517954 0.0 0.0 26468 20652 ? S Mar17 0:04 python3 -u /app/src/pool/worker.py systemd+ 3517960 0.0 0.0 26468 20672 ? S Mar17 0:04 python3 -u /app/src/pool/worker.py systemd+ 3518010 0.0 0.0 26468 20744 ? S Mar17 0:04 python3 -u /app/src/pool/worker.py systemd+ 3518011 0.0 0.0 26468 20664 ? S Mar17 0:04 python3 -u /app/src/pool/worker.py systemd+ 3518019 0.0 0.0 26468 20656 ? S Mar17 0:04 python3 -u /app/src/pool/worker.py systemd+ 3518020 0.0 0.0 26468 20600 ? S Mar17 0:04 python3 -u /app/src/pool/worker.py systemd+ 3518021 0.0 0.0 26468 20704 ? S Mar17 0:04 python3 -u /app/src/pool/worker.py systemd+ 3518022 0.0 0.0 26468 20676 ? S Mar17 0:04 python3 -u /app/src/pool/worker.py systemd+ 3518024 0.0 0.0 26468 20664 ? S Mar17 0:04 python3 -u /app/src/pool/worker.py systemd+ 3518025 0.0 0.0 26468 20660 ? S Mar17 0:04 python3 -u /app/src/pool/worker.py systemd+ 3518033 0.0 0.0 26468 20620 ? S Mar17 0:04 python3 -u /app/src/pool/worker.py systemd+ 3518034 0.0 0.0 26468 20576 ? S Mar17 0:04 python3 -u /app/src/pool/worker.py systemd+ 3518035 0.0 0.0 26468 20656 ? S Mar17 0:04 python3 -u /app/src/pool/worker.py systemd+ 3518038 0.0 0.0 26468 20684 ? S Mar17 0:04 python3 -u /app/src/pool/worker.py
这些python 进程,我发现是服务器fastgpt 的docker compose启动后的运行进程;
bash
# 进入fastgpt的目录,执行:
docker compose down
1. 杀死所有python/python3进程(覆盖所有版本)
bash
# 1. 杀死所有python/python3进程(覆盖所有版本)
sudo pkill -9 python
sudo pkill -9 python3
sudo pkill -9 ipykernel_launcher
sudo pkill -9 jupyter-lab
# 2. 验证是否全部杀死(无输出则成功)
ps aux | grep -E "python|ipykernel|jupyter" | grep -v grep
2. 验证是否全部杀死(无输出则成功)
ps aux | grep -E "python|ipykernel|jupyter" | grep -v grep
卸载 NVIDIA 内核模块(关键)
剩余的系统 Python 进程不占用 GPU,可直接卸载模块:
bash
# 按顺序卸载驱动内核模块
sudo rmmod nvidia_uvm
sudo rmmod nvidia_drm
sudo rmmod nvidia_modeset
sudo rmmod nvidia
# 验证模块是否卸载成功(无输出则成功)
lsmod | grep nvidia
(base) root@ubuntu:/data/fastgpt#
sudo rmmod nvidia_drm rmmod: ERROR: Module nvidia_drm is in use
!!!此处必须重启服务器
reboot
然后再运行下面,会输出没有被loaded,后面就可以正常安装cuda12了
bash
# 按顺序卸载驱动内核模块
sudo rmmod nvidia_uvm
sudo rmmod nvidia_drm
sudo rmmod nvidia_modeset
sudo rmmod nvidia
# 验证模块是否卸载成功(无输出则成功)
lsmod | grep nvidia
安装cuda12.2
bash
# 切换到安装包目录
cd /data/ghf
# 强制安装CUDA 12.2+驱动(跳过所有检查)
sudo ./cuda_12.2.0_535.54.03_linux.run --silent --driver --toolkit
# 安装完成后必须重启(核心)
sudo reboot
成功!!!!