软件安装-在ubuntu24安装nvidia driver和cuda toolkit

需求

在ubuntu24操作系统部署nvidia driver和cuda toolkit。任意版本的安装。

思路

下载文件，部署软件，测试可用性。

方法

方法1

部署nvidia driver,部署cuda，测试。

优点：nvidia承诺了driver的稳定性，长期稳定可靠。

缺点：cuda不是最新版，可能低一个主版本。需分两个软件部署。偶发cuda找不到nvidia driver，需人工修复。

对nvidia稳定性强需求，对cuda不追新且许可版本宽泛的可考虑这个方法。

方法2

部署cuda，顺带部署nvidia driver，测试。

优点：一次部署获得所有，可用最新版本cuda，指定版本的cuda。

缺点：cuda适配的nvidia driver可能会不是认证的稳定版本，可用但不承诺长期稳定性。

对cuda追新或者强制某个版本的选这个方法。

方法3

受限国内网络影响，本文不写该方法。

联网部署，需要良好的国内外网络，操作简单，稳定且可很好的支持跨平台开发。

参考以下文档

bash 复制代码

https://docs.nvidia.com/cuda/cuda-installation-guide-linux/#
https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/index.html#

操作记录

检测GPU设备

可复制执行以下命令，执行sh文件，获取GPU信息。

bash 复制代码

cat > check_GPU.sh << EOF
#!/bin/bash

# 定义日志文件路径
LOG_FILE="gpu_detection_$(date +%Y%m%d_%H%M%S).log"

# 创建日志文件并设置权限
touch "$LOG_FILE"
chmod 644 "$LOG_FILE"

# 输出时间戳到日志
echo "检测时间: $(date)" >> "$LOG_FILE"
echo "==================================================" >> "$LOG_FILE"

# 检测GPU厂商和型号
echo "正在检测GPU厂商和型号..." >> "$LOG_FILE"
GPU_VENDOR=$(lspci | grep -i "vga\|3d\|display" | awk -F': ' '{print $2}' | head -1)
echo "GPU厂商和型号: $GPU_VENDOR" >> "$LOG_FILE"

# 检测GPU数量
echo "正在检测GPU设备数量..." >> "$LOG_FILE"
GPU_COUNT=$(lspci | grep -i "vga\|3d\|display" | wc -l)
echo "GPU设备数量: $GPU_COUNT" >> "$LOG_FILE"

# 检测nvidia-smi
echo "正在检测nvidia-smi..." >> "$LOG_FILE"
if command -v nvidia-smi &>/dev/null; then
    NVIDIA_SMI_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader | head -1)
    echo "nvidia-smi已安装，版本: $NVIDIA_SMI_VERSION" >> "$LOG_FILE"
    
    # 获取CUDA版本
    CUDA_VERSION=$(nvidia-smi | grep "CUDA Version" | awk '{print $NF}')
    echo "CUDA版本: $CUDA_VERSION" >> "$LOG_FILE"
else
    echo "nvidia-smi未安装" >> "$LOG_FILE"
    
    # 尝试通过nvcc检测CUDA
    echo "正在尝试通过nvcc检测CUDA..." >> "$LOG_FILE"
    if command -v nvcc &>/dev/null; then
        CUDA_VERSION=$(nvcc --version | grep release | awk '{print $5}' | cut -c2-)
        echo "CUDA已安装，版本: $CUDA_VERSION" >> "$LOG_FILE"
    else
        echo "CUDA未安装" >> "$LOG_FILE"
    fi
fi

echo "==================================================" >> "$LOG_FILE"
echo "检测完成，结果已保存到 $LOG_FILE"

# 显示日志文件内容
cat "$LOG_FILE"    
EOF

下载软件

选择合适版本的方法

注意：先选版本，后下软件，不放心就多下载几个主版本，子版本可选最新。

浏览器访问https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#id7, 下图所示，可以根据设备驱动选cuda，也可根据cuda选设备驱动。

下载nvidia driver

浏览器访问https://www.nvidia.cn/drivers/lookup/，输入GPU的设备名字，选择linux-64bit，查询GPU适配的驱动，

推荐选择认证的驱动，稳定可靠，

下载cuda

浏览器访问https://developer.nvidia.com/cuda-toolkit,

勾选配置，

用软件下载工具或浏览器访问该地址下载文件，

下载测试专用文件

浏览器访问https://github.com/NVIDIA/cuda-samples/tags,下图所示，根据cuda版本选择对应文件。

上传和授权

将下载的文件上传到服务器。

给sh和run文件授权

bash 复制代码

chmod +x *.sh
chmod +x *.run

解压文件

bash 复制代码

unzip *.zip
tar -zxf *.tar.gz

安装

安装基础软件

bash 复制代码

apt update -y && apt install -y gcc g++ make cmake

方法1

安装nvidia driver

bash 复制代码

./NVIDIA-Linux-x86_64-570.172.08.run

其他都选确认，都选yes。安装结束后，务必reboot重启系统。重启后查询安装结果，

bash 复制代码

root@testserver1 Fri Jul 18 [03:32:10] : ~# nvidia-smi
Fri Jul 18 03:32:14 2025       +-----------------------------------------------------------------------------------------+| NVIDIA-SMI 570.172.08             Driver Version: 570.172.08     CUDA Version: 12.8     ||-----------------------------------------+------------------------+----------------------+| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC || Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. ||                                         |                        |               MIG M. ||=========================================+========================+======================||   0  Quadro RTX 5000                Off |   00000000:2D:00.0 Off |                  Off || 33%   51C    P0             58W /  230W |       0MiB /  16384MiB |      7%      Default ||                                         |                        |                  N/A |+-----------------------------------------+------------------------+----------------------+                                                                                         +-----------------------------------------------------------------------------------------+| Processes:                                                                              ||  GPU   GI   CI              PID   Type   Process name                        GPU Memory ||        ID   ID                                                               Usage      ||=========================================================================================||  No running processes found                                                             |+-----------------------------------------------------------------------------------------+  |

可以看到CUDA Version: 12.8，表示当前设备最高只能安装cuda12.8版本，比如cuda12.8.1或者12.8.9都可以。

安装cuda

bash 复制代码

./cuda_12.8.1_570.124.06_linux.run

此处取消勾选driver 其他都选确认，安装结束可见以下信息

bash 复制代码

root@testserver1 Fri Jul 18 [05:00:56] : /opt/nvidia# ./cuda_12.8.1_570.124.06_linux.run
============ Summary ============ Driver:   Not SelectedToolkit:  Installed in /usr/local/cuda-12.8/ Please make sure that -   PATH includes /usr/local/cuda-12.8/bin -   LD_LIBRARY_PATH includes /usr/local/cuda-12.8/lib64, or, add /usr/local/cuda-12.8/lib64 to /etc/ld.so.conf and run ldconfig as root To uninstall the CUDA Toolkit, run cuda-uninstaller in /usr/local/cuda-12.8/bin***WARNING: Incomplete installation! This installation did not install the CUDA Driver. A driver of version at least 570.00 is required for CUDA 12.8 functionality to work.To install the driver using this installer, run the following command, replacing <CudaInstaller> with the name of this run file:    sudo <CudaInstaller>.run --silent --driver Logfile is /var/log/cuda-installer.log

配置环境变量

bash 复制代码

echo 'export PATH=/usr/local/cuda-12.8/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.8/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source /root/.bashrc

安装后nvcc --version查询，例如，

bash 复制代码

root@testserver1 Fri Jul 18 [07:49:03] : /opt/nvidia
# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Tue_May_27_02:21:03_PDT_2025
Cuda compilation tools, release 12.9, V12.9.86
Build cuda_12.9.r12.9/compiler.36037853_0

推荐安装后 reboot 重启操作系统

方法2

一键安装

bash 复制代码

./cuda_12.9.1_575.57.08_linux.run

如图所示默认全选安装结束后，可见如下信息，

bash 复制代码

root@testserver1 Fri Jul 18 [07:13:29] : /opt/nvidia# ./cuda_12.9.1_575.57.08_linux.run
============ Summary ============ Driver:   InstalledToolkit:  Installed in /usr/local/cuda-12.9/ Please make sure that -   PATH includes /usr/local/cuda-12.9/bin -   LD_LIBRARY_PATH includes /usr/local/cuda-12.9/lib64, or, add /usr/local/cuda-12.9/lib64 to /etc/ld.so.conf and run ldconfig as root To uninstall the CUDA Toolkit, run cuda-uninstaller in /usr/local/cuda-12.9/binTo uninstall the NVIDIA Driver, run nvidia-uninstallLogfile is /var/log/cuda-installer.log |

配置环境变量

记得先在~/.bashrc 注释掉旧的驱动路径配置，避免配置冲突。

bash 复制代码

echo 'export PATH=/usr/local/cuda-12.9/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.9/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source /root/.bashrc

安全一些reboot重启系统。

测试

基础测试

可见有查询结果即可。

bash 复制代码

nvidia-smi
nvcc --version

增量测试

编译文件

bash 复制代码

root@testserver1 Fri Jul 18 [05:38:17] : cuda-samples-12.8/Samples/1_Utilities/deviceQuery# cmake ./
root@testserver1 Fri Jul 18 [05:38:17] : cuda-samples-12.8/Samples/1_Utilities/deviceQuery# make

执行检测./deviceQuery,可见如下结果，Result = PASS表示设备环境检测通过。

bash 复制代码

root@testserver1 Fri Jul 18 [05:41:53] : /opt/nvidia/cuda-samples-12.8/Samples/1_Utilities/deviceQuery# ./deviceQuery
./deviceQuery Starting...  CUDA Device Query (Runtime API) version (CUDART static linking) Detected 1 CUDA Capable device(s) Device 0: "Quadro RTX 5000"  CUDA Driver Version / Runtime Version          12.8 / 12.8  CUDA Capability Major/Minor version number:    7.5  Total amount of global memory:                 15928 MBytes (16701652992 bytes)  (048) Multiprocessors, (064) CUDA Cores/MP:    3072 CUDA Cores  GPU Max Clock rate:                            1815 MHz (1.81 GHz)  Memory Clock rate:                             7001 Mhz  Memory Bus Width:                              256-bit  L2 Cache Size:                                 4194304 bytes  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers  Total amount of constant memory:               65536 bytes  Total amount of shared memory per block:       49152 bytes  Total shared memory per multiprocessor:        65536 bytes  Total number of registers available per block: 65536  Warp size:                                     32  Maximum number of threads per multiprocessor:  1024  Maximum number of threads per block:           1024  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)  Maximum memory pitch:                          2147483647 bytes  Texture alignment:                             512 bytes  Concurrent copy and kernel execution:          Yes with 3 copy engine(s)  Run time limit on kernels:                     No  Integrated GPU sharing Host Memory:            No  Support host page-locked memory mapping:       Yes  Alignment requirement for Surfaces:            Yes  Device has ECC support:                        Disabled  Device supports Unified Addressing (UVA):      Yes  Device supports Managed Memory:                Yes  Device supports Compute Preemption:            Yes  Supports Cooperative Kernel Launch:            Yes  Supports MultiDevice Co-op Kernel Launch:      Yes  Device PCI Domain ID / Bus ID / location ID:   0 / 45 / 0  Compute Mode:     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.8, CUDA Runtime Version = 12.8, NumDevs = 1Result = PASS

卸载

卸载后记得注释掉/root/.bashrc配置过的环境变量。卸载后务必reboot重启系统

卸载nvidia driver

bash 复制代码

/usr/bin/nvidia-uninstall

卸载cuda

bash 复制代码

/usr/local/cuda-12.8/bin/cuda-uninstaller

卸载成功可见以下信息，

bash 复制代码

root@testserver1 Fri Jul 18 [06:10:34] : /opt/nvidia

# /usr/local/cuda-12.8/bin/cuda-uninstaller

 Successfully uninstalled