cuda from 11.4 to 12.2 install readme

步骤 1:彻底清理旧版 CUDA + 驱动(核心,避免冲突)

bash 复制代码
# 1. 卸载旧驱动+CUDA(apt方式)

sudo apt-get --purge remove "*nvidia*" "*cuda*" "*cudnn*" -y

# 2. 手动删除驱动残留(470版本专属)

sudo rm -rf /usr/lib/x86_64-linux-gnu/libnvidia*

sudo rm -rf /usr/lib/x86_64-linux-gnu/libcuda*

sudo rm -rf /usr/local/nvidia*

3. 删除CUDA 11.4目录

bash 复制代码
sudo rm -rf /usr/local/cuda-11.4

sudo rm -rf /usr/local/cuda # 清除软链接

4. 清理依赖+缓存

bash 复制代码
sudo apt autoremove -y

sudo apt clean

步骤 2:禁用 nouveau 驱动(470 版本升级必做)

1. 写入黑名单

bash 复制代码
sudo tee /etc/modprobe.d/blacklist-nouveau.conf << EOF blacklist nouveau options nouveau modeset=0 EOF

2. 生效配置(关键:重建initramfs)

bash 复制代码
sudo update-initramfs -u

3. 验证禁用(无输出则成功)

bash 复制代码
lsmod | grep nouveau

过程中卸载不干净,必须清理干净,否则新版本装不上

(base) root@ubuntu:/data/ghf# sudo ./cuda_12.2.0_535.54.03_linux.run --silent --driver --toolkit

Existing package manager installation of the driver found.

It is strongly recommended that you remove this before continuing.

Override this check by passing --override-driver-check

你遇到的报错是因为系统中仍有通过 apt 安装的 NVIDIA 驱动残留 (即使之前执行了清理,仍有隐蔽的包管理器残留),.run 安装包检测到后阻止了驱动升级

继续清理残留文件:

步骤 1:查找并删除所有 apt 安装的 NVIDIA 包

bash 复制代码
# 1. 列出所有NVIDIA相关的apt包(找到残留)
dpkg -l | grep nvidia

(base) root@ubuntu:/data/ghf# dpkg -l | grep nvidia ii libnvidia-cfg1-470:amd64 470.129.06-0ubuntu0.20.04.1 amd64 NVIDIA binary OpenGL/GLX configuration library ii libnvidia-common-470 470.129.06-0ubuntu0.20.04.1 all Shared files used by the NVIDIA libraries rc libnvidia-compute-440:amd64 450.119.03-0ubuntu0.20.04.1 amd64 Transitional package for libnvidia-compute-450 ii libnvidia-compute-470:amd64 470.129.06-0ubuntu0.20.04.1 amd64 NVIDIA libcompute package ii libnvidia-decode-470:amd64 470.129.06-0ubuntu0.20.04.1 amd64 NVIDIA Video Decoding runtime libraries ii libnvidia-encode-470:amd64 470.129.06-0ubuntu0.20.04.1 amd64 NVENC Video Encoding runtime library ii libnvidia-extra-470:amd64 470.129.06-0ubuntu0.20.04.1 amd64 Extra libraries for the NVIDIA driver ii libnvidia-fbc1-470:amd64 470.129.06-0ubuntu0.20.04.1 amd64 NVIDIA OpenGL-based Framebuffer Capture runtime library

ii libnvidia-gl-470:amd64 470.129.06-0ubuntu0.20.04.1 amd64 NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD

ii libnvidia-ifr1-470:amd64 470.129.06-0ubuntu0.20.04.1 amd64 NVIDIA OpenGL-based Inband Frame Readback runtime library

ii nvidia-compute-utils-470 470.129.06-0ubuntu0.20.04.1 amd64 NVIDIA compute utilities

ii nvidia-cuda-dev 9.1.85-3ubuntu1 amd64 NVIDIA CUDA development files

ii nvidia-cuda-doc 9.1.85-3ubuntu1 all NVIDIA CUDA and OpenCL documentation

ii nvidia-cuda-gdb 9.1.85-3ubuntu1 amd64 NVIDIA CUDA Debugger (GDB)

ii nvidia-cuda-toolkit 9.1.85-3ubuntu1 amd64 NVIDIA CUDA development toolkit

ii nvidia-dkms-470 470.129.06-0ubuntu0.20.04.1 amd64 NVIDIA DKMS package

ii nvidia-driver-440 450.119.03-0ubuntu0.20.04.1 amd64 Transitional package for nvidia-driver-450 ii nvidia-driver-450 460.91.03-0ubuntu0.20.04.1 amd64 Transitional package for nvidia-driver-460

ii nvidia-driver-460 470.129.06-0ubuntu0.20.04.1 amd64 Transitional package for nvidia-driver-470

ii nvidia-driver-470 470.129.06-0ubuntu0.20.04.1 amd64 NVIDIA driver metapackage

ii nvidia-kernel-common-470 470.129.06-0ubuntu0.20.04.1 amd64 Shared files used with the kernel module

ii nvidia-kernel-source-470 470.129.06-0ubuntu0.20.04.1 amd64 NVIDIA kernel source package

ii nvidia-opencl-dev:amd64 9.1.85-3ubuntu1 amd64 NVIDIA OpenCL development files

ii nvidia-prime 0.8.16~0.18.04.1 all Tools to enable NVIDIA's Prime

ii nvidia-profiler 9.1.85-3ubuntu1 amd64 NVIDIA Profiler for CUDA and OpenCL

ii nvidia-settings 470.57.01-0ubuntu0.20.04.1 amd64 Tool for configuring the NVIDIA graphics driver

ii nvidia-utils-470 470.129.06-0ubuntu0.20.04.1 amd64 NVIDIA driver support binaries

ii nvidia-visual-profiler 9.1.85-3ubuntu1 amd64 NVIDIA Visual Profiler for CUDA and OpenCL

ii screen-resolution-extra 0.18build1 all Extension for the nvidia-settings control panel

ii xserver-xorg-video-nvidia-470 470.129.06-0ubuntu0.20.04.1 amd64 NVIDIA binary Xorg driver

彻底卸载所有 NVIDIA/CUDA apt 包(精准匹配)

步骤 1:卸载 NVIDIA 470 驱动相关包(核心)

bash 复制代码
# 批量卸载所有470版本的NVIDIA包
sudo apt-get --purge remove \
libnvidia-cfg1-470:amd64 \
libnvidia-common-470 \
libnvidia-compute-470:amd64 \
libnvidia-decode-470:amd64 \
libnvidia-encode-470:amd64 \
libnvidia-extra-470:amd64 \
libnvidia-fbc1-470:amd64 \
libnvidia-gl-470:amd64 \
libnvidia-ifr1-470:amd64 \
nvidia-compute-utils-470 \
nvidia-dkms-470 \
nvidia-driver-470 \
nvidia-kernel-common-470 \
nvidia-kernel-source-470 \
nvidia-utils-470 \
xserver-xorg-video-nvidia-470 -y

步骤 2:卸载过渡包 + 旧版 CUDA 9.1 包

bash 复制代码
# 卸载440/450/460过渡包+CUDA 9.1相关包
sudo apt-get --purge remove \
libnvidia-compute-440:amd64 \
nvidia-driver-440 \
nvidia-driver-450 \
nvidia-driver-460 \
nvidia-cuda-dev \
nvidia-cuda-doc \
nvidia-cuda-gdb \
nvidia-cuda-toolkit \
nvidia-opencl-dev:amd64 \
nvidia-profiler \
nvidia-visual-profiler -y

步骤 3:卸载其他 NVIDIA 辅助包

bash 复制代码
sudo apt-get --purge remove nvidia-prime nvidia-settings screen-resolution-extra -y

步骤 4:清理残留依赖 + 配置文件

bash 复制代码
# 清理自动安装的依赖
sudo apt-get autoremove -y

# 清理缓存
sudo apt-get autoclean -y

# 强制清理残留的dpkg配置文件
sudo dpkg --purge $(dpkg -l | grep 'nvidia' | awk '{print $2}') 2>/dev/null

(base) root@ubuntu:/data/ghf# lsmod | grep nvidia

nvidia_uvm 1036288 8

nvidia_drm 57344 4

nvidia_modeset 1200128 2

nvidia_drm nvidia 35340288 292

nvidia_uvm,nvidia_modeset

drm_kms_helper 184320 4 cirrus,nvidia_drm drm 491520 10 drm_kms_helper,nvidia,cirrus,nvidia_drm

仍然有没清掉的!!!!

bash 复制代码
# 杀死所有CUDA/PyTorch相关进程

sudo pkill -9 nvidia-smi
sudo pkill -9 nvcc

# 这里面杀死python 需要谨慎
#
# 先查看所有Python进程(确认要终止的PID)
ps aux | grep python

## 没问题再进行全杀
sudo pkill -9 python

ps aux | grep python 输出内容如下:

systemd+ 3517887 0.0 0.0 26468 20692 ? S Mar17 0:04 python3 -u /app/src/pool/worker.py systemd+ 3517889 0.0 0.0 26468 20676 ? S Mar17 0:04 python3 -u /app/src/pool/worker.py systemd+ 3517896 0.0 0.0 26468 20580 ? S Mar17 0:04 python3 -u /app/src/pool/worker.py systemd+ 3517934 0.0 0.0 26468 20656 ? S Mar17 0:04 python3 -u /app/src/pool/worker.py systemd+ 3517945 0.0 0.0 26468 20644 ? S Mar17 0:04 python3 -u /app/src/pool/worker.py systemd+ 3517948 0.0 0.0 26468 20696 ? S Mar17 0:04 python3 -u /app/src/pool/worker.py systemd+ 3517954 0.0 0.0 26468 20652 ? S Mar17 0:04 python3 -u /app/src/pool/worker.py systemd+ 3517960 0.0 0.0 26468 20672 ? S Mar17 0:04 python3 -u /app/src/pool/worker.py systemd+ 3518010 0.0 0.0 26468 20744 ? S Mar17 0:04 python3 -u /app/src/pool/worker.py systemd+ 3518011 0.0 0.0 26468 20664 ? S Mar17 0:04 python3 -u /app/src/pool/worker.py systemd+ 3518019 0.0 0.0 26468 20656 ? S Mar17 0:04 python3 -u /app/src/pool/worker.py systemd+ 3518020 0.0 0.0 26468 20600 ? S Mar17 0:04 python3 -u /app/src/pool/worker.py systemd+ 3518021 0.0 0.0 26468 20704 ? S Mar17 0:04 python3 -u /app/src/pool/worker.py systemd+ 3518022 0.0 0.0 26468 20676 ? S Mar17 0:04 python3 -u /app/src/pool/worker.py systemd+ 3518024 0.0 0.0 26468 20664 ? S Mar17 0:04 python3 -u /app/src/pool/worker.py systemd+ 3518025 0.0 0.0 26468 20660 ? S Mar17 0:04 python3 -u /app/src/pool/worker.py systemd+ 3518033 0.0 0.0 26468 20620 ? S Mar17 0:04 python3 -u /app/src/pool/worker.py systemd+ 3518034 0.0 0.0 26468 20576 ? S Mar17 0:04 python3 -u /app/src/pool/worker.py systemd+ 3518035 0.0 0.0 26468 20656 ? S Mar17 0:04 python3 -u /app/src/pool/worker.py systemd+ 3518038 0.0 0.0 26468 20684 ? S Mar17 0:04 python3 -u /app/src/pool/worker.py

这些python 进程,我发现是服务器fastgpt 的docker compose启动后的运行进程;

bash 复制代码
# 进入fastgpt的目录,执行:
docker compose down

1. 杀死所有python/python3进程(覆盖所有版本)

bash 复制代码
# 1. 杀死所有python/python3进程(覆盖所有版本)
sudo pkill -9 python
sudo pkill -9 python3
sudo pkill -9 ipykernel_launcher
sudo pkill -9 jupyter-lab

# 2. 验证是否全部杀死(无输出则成功)
ps aux | grep -E "python|ipykernel|jupyter" | grep -v grep

2. 验证是否全部杀死(无输出则成功)

ps aux | grep -E "python|ipykernel|jupyter" | grep -v grep

卸载 NVIDIA 内核模块(关键)

剩余的系统 Python 进程不占用 GPU,可直接卸载模块:

bash 复制代码
# 按顺序卸载驱动内核模块
sudo rmmod nvidia_uvm
sudo rmmod nvidia_drm
sudo rmmod nvidia_modeset
sudo rmmod nvidia

# 验证模块是否卸载成功(无输出则成功)
lsmod | grep nvidia

(base) root@ubuntu:/data/fastgpt#

sudo rmmod nvidia_drm rmmod: ERROR: Module nvidia_drm is in use

!!!此处必须重启服务器

reboot

然后再运行下面,会输出没有被loaded,后面就可以正常安装cuda12了

bash 复制代码
# 按顺序卸载驱动内核模块
sudo rmmod nvidia_uvm
sudo rmmod nvidia_drm
sudo rmmod nvidia_modeset
sudo rmmod nvidia

# 验证模块是否卸载成功(无输出则成功)
lsmod | grep nvidia

安装cuda12.2

bash 复制代码
# 切换到安装包目录
cd /data/ghf

# 强制安装CUDA 12.2+驱动(跳过所有检查)
sudo ./cuda_12.2.0_535.54.03_linux.run --silent --driver --toolkit 
# 安装完成后必须重启(核心)
sudo reboot

成功!!!!

相关推荐
丝斯20112 小时前
AI学习笔记整理(76)——Python学习5
人工智能·笔记·学习
LaughingZhu2 小时前
Product Hunt 每日热榜 | 2026-03-22
大数据·数据库·人工智能·经验分享·搜索引擎
醉颜凉2 小时前
Seal^_^【送书活动第8期】——《ChatGLM3大模型本地化部署、应用开发与微调》
人工智能·职场和发展·送书活动·chatglm3大模型
进击的野人2 小时前
从AI“说人话”到“说结构话”:Spring AI结构化输出实战解析
人工智能·spring·ai编程
jay神2 小时前
基于深度学习的车辆识别收费管理系统
人工智能·深度学习·yolo·目标检测·毕业设计
进击的雷神2 小时前
Trae AI IDE 完全指南:从入门到精通
大数据·ide·人工智能·trae
汀丶人工智能2 小时前
基于 Milvus 构建企业级 RAG 问答系统:从原理到实践-CSDN博客
人工智能
工边页字2 小时前
为什么 RAG系统里,Embedding成本往往远低于 LLM成本,但很多公司仍然疯狂优化 Embedding?
前端·人工智能·后端
宇擎智脑科技2 小时前
A2A 协议规范深度剖析:三层架构、数据模型、操作语义与协议绑定
人工智能·a2a