cuda from 11.4 to 12.2 install readme

步骤 1:彻底清理旧版 CUDA + 驱动(核心,避免冲突)

bash 复制代码
# 1. 卸载旧驱动+CUDA(apt方式)

sudo apt-get --purge remove "*nvidia*" "*cuda*" "*cudnn*" -y

# 2. 手动删除驱动残留(470版本专属)

sudo rm -rf /usr/lib/x86_64-linux-gnu/libnvidia*

sudo rm -rf /usr/lib/x86_64-linux-gnu/libcuda*

sudo rm -rf /usr/local/nvidia*

3. 删除CUDA 11.4目录

bash 复制代码
sudo rm -rf /usr/local/cuda-11.4

sudo rm -rf /usr/local/cuda # 清除软链接

4. 清理依赖+缓存

bash 复制代码
sudo apt autoremove -y

sudo apt clean

步骤 2:禁用 nouveau 驱动(470 版本升级必做)

1. 写入黑名单

bash 复制代码
sudo tee /etc/modprobe.d/blacklist-nouveau.conf << EOF blacklist nouveau options nouveau modeset=0 EOF

2. 生效配置(关键:重建initramfs)

bash 复制代码
sudo update-initramfs -u

3. 验证禁用(无输出则成功)

bash 复制代码
lsmod | grep nouveau

过程中卸载不干净,必须清理干净,否则新版本装不上

(base) root@ubuntu:/data/ghf# sudo ./cuda_12.2.0_535.54.03_linux.run --silent --driver --toolkit

Existing package manager installation of the driver found.

It is strongly recommended that you remove this before continuing.

Override this check by passing --override-driver-check

你遇到的报错是因为系统中仍有通过 apt 安装的 NVIDIA 驱动残留 (即使之前执行了清理,仍有隐蔽的包管理器残留),.run 安装包检测到后阻止了驱动升级

继续清理残留文件:

步骤 1:查找并删除所有 apt 安装的 NVIDIA 包

bash 复制代码
# 1. 列出所有NVIDIA相关的apt包(找到残留)
dpkg -l | grep nvidia

(base) root@ubuntu:/data/ghf# dpkg -l | grep nvidia ii libnvidia-cfg1-470:amd64 470.129.06-0ubuntu0.20.04.1 amd64 NVIDIA binary OpenGL/GLX configuration library ii libnvidia-common-470 470.129.06-0ubuntu0.20.04.1 all Shared files used by the NVIDIA libraries rc libnvidia-compute-440:amd64 450.119.03-0ubuntu0.20.04.1 amd64 Transitional package for libnvidia-compute-450 ii libnvidia-compute-470:amd64 470.129.06-0ubuntu0.20.04.1 amd64 NVIDIA libcompute package ii libnvidia-decode-470:amd64 470.129.06-0ubuntu0.20.04.1 amd64 NVIDIA Video Decoding runtime libraries ii libnvidia-encode-470:amd64 470.129.06-0ubuntu0.20.04.1 amd64 NVENC Video Encoding runtime library ii libnvidia-extra-470:amd64 470.129.06-0ubuntu0.20.04.1 amd64 Extra libraries for the NVIDIA driver ii libnvidia-fbc1-470:amd64 470.129.06-0ubuntu0.20.04.1 amd64 NVIDIA OpenGL-based Framebuffer Capture runtime library

ii libnvidia-gl-470:amd64 470.129.06-0ubuntu0.20.04.1 amd64 NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD

ii libnvidia-ifr1-470:amd64 470.129.06-0ubuntu0.20.04.1 amd64 NVIDIA OpenGL-based Inband Frame Readback runtime library

ii nvidia-compute-utils-470 470.129.06-0ubuntu0.20.04.1 amd64 NVIDIA compute utilities

ii nvidia-cuda-dev 9.1.85-3ubuntu1 amd64 NVIDIA CUDA development files

ii nvidia-cuda-doc 9.1.85-3ubuntu1 all NVIDIA CUDA and OpenCL documentation

ii nvidia-cuda-gdb 9.1.85-3ubuntu1 amd64 NVIDIA CUDA Debugger (GDB)

ii nvidia-cuda-toolkit 9.1.85-3ubuntu1 amd64 NVIDIA CUDA development toolkit

ii nvidia-dkms-470 470.129.06-0ubuntu0.20.04.1 amd64 NVIDIA DKMS package

ii nvidia-driver-440 450.119.03-0ubuntu0.20.04.1 amd64 Transitional package for nvidia-driver-450 ii nvidia-driver-450 460.91.03-0ubuntu0.20.04.1 amd64 Transitional package for nvidia-driver-460

ii nvidia-driver-460 470.129.06-0ubuntu0.20.04.1 amd64 Transitional package for nvidia-driver-470

ii nvidia-driver-470 470.129.06-0ubuntu0.20.04.1 amd64 NVIDIA driver metapackage

ii nvidia-kernel-common-470 470.129.06-0ubuntu0.20.04.1 amd64 Shared files used with the kernel module

ii nvidia-kernel-source-470 470.129.06-0ubuntu0.20.04.1 amd64 NVIDIA kernel source package

ii nvidia-opencl-dev:amd64 9.1.85-3ubuntu1 amd64 NVIDIA OpenCL development files

ii nvidia-prime 0.8.16~0.18.04.1 all Tools to enable NVIDIA's Prime

ii nvidia-profiler 9.1.85-3ubuntu1 amd64 NVIDIA Profiler for CUDA and OpenCL

ii nvidia-settings 470.57.01-0ubuntu0.20.04.1 amd64 Tool for configuring the NVIDIA graphics driver

ii nvidia-utils-470 470.129.06-0ubuntu0.20.04.1 amd64 NVIDIA driver support binaries

ii nvidia-visual-profiler 9.1.85-3ubuntu1 amd64 NVIDIA Visual Profiler for CUDA and OpenCL

ii screen-resolution-extra 0.18build1 all Extension for the nvidia-settings control panel

ii xserver-xorg-video-nvidia-470 470.129.06-0ubuntu0.20.04.1 amd64 NVIDIA binary Xorg driver

彻底卸载所有 NVIDIA/CUDA apt 包(精准匹配)

步骤 1:卸载 NVIDIA 470 驱动相关包(核心)

bash 复制代码
# 批量卸载所有470版本的NVIDIA包
sudo apt-get --purge remove \
libnvidia-cfg1-470:amd64 \
libnvidia-common-470 \
libnvidia-compute-470:amd64 \
libnvidia-decode-470:amd64 \
libnvidia-encode-470:amd64 \
libnvidia-extra-470:amd64 \
libnvidia-fbc1-470:amd64 \
libnvidia-gl-470:amd64 \
libnvidia-ifr1-470:amd64 \
nvidia-compute-utils-470 \
nvidia-dkms-470 \
nvidia-driver-470 \
nvidia-kernel-common-470 \
nvidia-kernel-source-470 \
nvidia-utils-470 \
xserver-xorg-video-nvidia-470 -y

步骤 2:卸载过渡包 + 旧版 CUDA 9.1 包

bash 复制代码
# 卸载440/450/460过渡包+CUDA 9.1相关包
sudo apt-get --purge remove \
libnvidia-compute-440:amd64 \
nvidia-driver-440 \
nvidia-driver-450 \
nvidia-driver-460 \
nvidia-cuda-dev \
nvidia-cuda-doc \
nvidia-cuda-gdb \
nvidia-cuda-toolkit \
nvidia-opencl-dev:amd64 \
nvidia-profiler \
nvidia-visual-profiler -y

步骤 3:卸载其他 NVIDIA 辅助包

bash 复制代码
sudo apt-get --purge remove nvidia-prime nvidia-settings screen-resolution-extra -y

步骤 4:清理残留依赖 + 配置文件

bash 复制代码
# 清理自动安装的依赖
sudo apt-get autoremove -y

# 清理缓存
sudo apt-get autoclean -y

# 强制清理残留的dpkg配置文件
sudo dpkg --purge $(dpkg -l | grep 'nvidia' | awk '{print $2}') 2>/dev/null

(base) root@ubuntu:/data/ghf# lsmod | grep nvidia

nvidia_uvm 1036288 8

nvidia_drm 57344 4

nvidia_modeset 1200128 2

nvidia_drm nvidia 35340288 292

nvidia_uvm,nvidia_modeset

drm_kms_helper 184320 4 cirrus,nvidia_drm drm 491520 10 drm_kms_helper,nvidia,cirrus,nvidia_drm

仍然有没清掉的!!!!

bash 复制代码
# 杀死所有CUDA/PyTorch相关进程

sudo pkill -9 nvidia-smi
sudo pkill -9 nvcc

# 这里面杀死python 需要谨慎
#
# 先查看所有Python进程(确认要终止的PID)
ps aux | grep python

## 没问题再进行全杀
sudo pkill -9 python

ps aux | grep python 输出内容如下:

systemd+ 3517887 0.0 0.0 26468 20692 ? S Mar17 0:04 python3 -u /app/src/pool/worker.py systemd+ 3517889 0.0 0.0 26468 20676 ? S Mar17 0:04 python3 -u /app/src/pool/worker.py systemd+ 3517896 0.0 0.0 26468 20580 ? S Mar17 0:04 python3 -u /app/src/pool/worker.py systemd+ 3517934 0.0 0.0 26468 20656 ? S Mar17 0:04 python3 -u /app/src/pool/worker.py systemd+ 3517945 0.0 0.0 26468 20644 ? S Mar17 0:04 python3 -u /app/src/pool/worker.py systemd+ 3517948 0.0 0.0 26468 20696 ? S Mar17 0:04 python3 -u /app/src/pool/worker.py systemd+ 3517954 0.0 0.0 26468 20652 ? S Mar17 0:04 python3 -u /app/src/pool/worker.py systemd+ 3517960 0.0 0.0 26468 20672 ? S Mar17 0:04 python3 -u /app/src/pool/worker.py systemd+ 3518010 0.0 0.0 26468 20744 ? S Mar17 0:04 python3 -u /app/src/pool/worker.py systemd+ 3518011 0.0 0.0 26468 20664 ? S Mar17 0:04 python3 -u /app/src/pool/worker.py systemd+ 3518019 0.0 0.0 26468 20656 ? S Mar17 0:04 python3 -u /app/src/pool/worker.py systemd+ 3518020 0.0 0.0 26468 20600 ? S Mar17 0:04 python3 -u /app/src/pool/worker.py systemd+ 3518021 0.0 0.0 26468 20704 ? S Mar17 0:04 python3 -u /app/src/pool/worker.py systemd+ 3518022 0.0 0.0 26468 20676 ? S Mar17 0:04 python3 -u /app/src/pool/worker.py systemd+ 3518024 0.0 0.0 26468 20664 ? S Mar17 0:04 python3 -u /app/src/pool/worker.py systemd+ 3518025 0.0 0.0 26468 20660 ? S Mar17 0:04 python3 -u /app/src/pool/worker.py systemd+ 3518033 0.0 0.0 26468 20620 ? S Mar17 0:04 python3 -u /app/src/pool/worker.py systemd+ 3518034 0.0 0.0 26468 20576 ? S Mar17 0:04 python3 -u /app/src/pool/worker.py systemd+ 3518035 0.0 0.0 26468 20656 ? S Mar17 0:04 python3 -u /app/src/pool/worker.py systemd+ 3518038 0.0 0.0 26468 20684 ? S Mar17 0:04 python3 -u /app/src/pool/worker.py

这些python 进程,我发现是服务器fastgpt 的docker compose启动后的运行进程;

bash 复制代码
# 进入fastgpt的目录,执行:
docker compose down

1. 杀死所有python/python3进程(覆盖所有版本)

bash 复制代码
# 1. 杀死所有python/python3进程(覆盖所有版本)
sudo pkill -9 python
sudo pkill -9 python3
sudo pkill -9 ipykernel_launcher
sudo pkill -9 jupyter-lab

# 2. 验证是否全部杀死(无输出则成功)
ps aux | grep -E "python|ipykernel|jupyter" | grep -v grep

2. 验证是否全部杀死(无输出则成功)

ps aux | grep -E "python|ipykernel|jupyter" | grep -v grep

卸载 NVIDIA 内核模块(关键)

剩余的系统 Python 进程不占用 GPU,可直接卸载模块:

bash 复制代码
# 按顺序卸载驱动内核模块
sudo rmmod nvidia_uvm
sudo rmmod nvidia_drm
sudo rmmod nvidia_modeset
sudo rmmod nvidia

# 验证模块是否卸载成功(无输出则成功)
lsmod | grep nvidia

(base) root@ubuntu:/data/fastgpt#

sudo rmmod nvidia_drm rmmod: ERROR: Module nvidia_drm is in use

!!!此处必须重启服务器

reboot

然后再运行下面,会输出没有被loaded,后面就可以正常安装cuda12了

bash 复制代码
# 按顺序卸载驱动内核模块
sudo rmmod nvidia_uvm
sudo rmmod nvidia_drm
sudo rmmod nvidia_modeset
sudo rmmod nvidia

# 验证模块是否卸载成功(无输出则成功)
lsmod | grep nvidia

安装cuda12.2

bash 复制代码
# 切换到安装包目录
cd /data/ghf

# 强制安装CUDA 12.2+驱动(跳过所有检查)
sudo ./cuda_12.2.0_535.54.03_linux.run --silent --driver --toolkit 
# 安装完成后必须重启(核心)
sudo reboot

成功!!!!

相关推荐
HIT_Weston17 小时前
45、【Agent】【OpenCode】本地代理分析(请求&接收回调)
人工智能·agent·opencode
逻辑君17 小时前
认知神经科学研究报告【20260010】
人工智能·深度学习·神经网络·机器学习
星河耀银海17 小时前
远控体验分享:安全与实用性参考
人工智能·安全·微服务
企业架构师老王17 小时前
2026企业架构演进:科普Agent(龙虾)如何从“极客玩具”走向实在Agent规模化落地?
人工智能·ai·架构
GreenTea18 小时前
一文搞懂Harness Engineering与Meta-Harness
前端·人工智能·后端
鬼先生_sir18 小时前
Spring AI Alibaba 1.1.2.2 完整知识点库
人工智能·ai·agent·源码解析·springai
深念Y18 小时前
豆包AI能力集成方案:基于会话管理的API网关设计
人工智能
龙文浩_18 小时前
Attention Mechanism: From Theory to Code
人工智能·深度学习·神经网络·学习·自然语言处理
ulimate_18 小时前
八卡算力、三个Baseline算法(WALLOSS、pi0、DreamZero)
人工智能
深小乐18 小时前
AI 周刊【2026.04.06-04.12】:Anthropic 藏起最强模型、AI 社会矛盾激化、"欢乐马"登顶
人工智能