GPU服务器安装驱动、cuda和cudnn和tensorflow

系统版本兼容要求

复制代码
centos7.2 cuda9.0 cudnn7.4
centos7.5 cuda9.2 cudnn7.4

安装gcc

复制代码
yum -y install gcc gcc-c++ kernel-devel 

package manage-overview
https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#package-manager-overview

1、安装gpu显卡驱动

查看nvidia gpu信息

复制代码
# nvidia-smi

2、安装nvidia检测

2.1添加ElRepo源

复制代码
# rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org 
# rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org  

# rpm -Uvh https://www.elrepo.org/elrepo-release-7.0-3.el7.elrepo.noarch.rpm

2.2、安装显卡驱动检查

复制代码
yum install nvidia-detect

2.3 运行

复制代码
# nvidia-detect -v
Probing for supported NVIDIA devices...
[10de:15f8] NVIDIA Corporation Device 15f8
This device requires the current 410.78 NVIDIA driver kmod-nvidia
[10de:15f8] NVIDIA Corporation Device 15f8
This device requires the current 410.78 NVIDIA driver kmod-nvidia
[102b:0538] Matrox Electronics Systems Ltd. Device 0538

2.4、编辑grub文件

vim /etc/default/grub

在"GRUB_CMDLINE_LINUX"中添加

复制代码
rd.driver.blacklist=nouveau nouveau.modeset=0

改完后的文件如下:

复制代码
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR="$(sed 's, release .*$,,g' /etc/system-release)"
GRUB_DEFAULT=saved
GRUB_DISABLE_SUBMENU=true
GRUB_TERMINAL_OUTPUT="console"
GRUB_CMDLINE_LINUX="crashkernel=auto rd.lvm.lv=centos/root rd.lvm.lv=centos/swap rd.driver.blacklist=nouveau nouveau.modeset=0 rhgb quiet"
GRUB_DISABLE_RECOVERY="true"

随后生成配置

复制代码
grub2-mkconfig -o /boot/grub2/grub.cfg

2.5、创建blacklist

复制代码
vim /etc/modprobe.d/blacklist.conf

添加

复制代码
blacklist nouveau

2.6、更新配置

复制代码
mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r)-nouveau.img
dracut /boot/initramfs-$(uname -r).img $(uname -r)

2.7、重启

复制代码
reboot

2.8、确认禁用了nouveau

复制代码
lsmod | grep nouveau

若无输出则禁用成功

3、安装cuda

cuda下载地址:

复制代码
https://developer.nvidia.com/cuda-toolkit

# sh cuda_9.0.176_384.81_linux.run

如果出现you appear to be running an x server please exit x before installing

执行init 3 进入命令行模式,杀掉x server,然后再执行安装命令

复制代码
===========
= Summary =
===========
Driver:   Installed
Toolkit:  Installed in /usr/local/cuda-9.0
Samples:  Installed in /root, but missing recommended libraries

Please make sure that
 -   PATH includes /usr/local/cuda-9.0/bin
 -   LD_LIBRARY_PATH includes /usr/local/cuda-9.0/lib64, or, add /usr/local/cuda-9.0/lib64 to /etc/ld.so.conf and run ldconfig as root

To uninstall the CUDA Toolkit, run the uninstall script in /usr/local/cuda-9.0/bin
To uninstall the NVIDIA Driver, run nvidia-uninstall

Please see CUDA_Installation_Guide_Linux.pdf in /usr/local/cuda-9.0/doc/pdf for detailed information on setting up CUDA.

Logfile is /tmp/cuda_install_7874.log

验证CUDA 9.0 是否安装成功

终端输入:

复制代码
nvcc -V

可以看到cuda的版本信息

接着尝试运行一下cuda中自带的例子:

复制代码
cd /usr/local/cuda-9.0/samples/1_Utilities/deviceQuery
make
./deviceQuery

可以看到输出成功

复制代码
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.0, CUDA Runtime Version = 10.0, NumDevs = 2
Result = PASS

卸载

复制代码
To uninstall the CUDA Toolkit, run the uninstall script in /usr/local/cuda-9.0/bin
To uninstall the NVIDIA Driver, run nvidia-uninstall

3、安装cudnnv7

复制代码
https://docs.nvidia.com/deeplearning/sdk/cudnn-install/index.html

下载完成以后将其解压到Cuda的目录当中,依次执行如下命令:

复制代码
tar -xzvf cudnn-9.0-linux-x64-v7.4.1.5.tgz
sudo cp cuda/include/cudnn.h /usr/local/cuda/include
sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*

运行一个小Demo即可.

如果安装了 例程和用户指南 这个包的话,我们可以找到位于 /usr/src/cudnn_samples_v7的mnistCUDNN这个小例子.

拷贝到 你的home/yourdir 任意文件夹下

复制代码
$cp -r /usr/src/cudnn_samples_v7/ $HOME

进入 mnistCUDNN

复制代码
$ cd $HOME/cudnn_samples_v7/mnistCUDNN

编译

复制代码
$make clean && make

运行

复制代码
$ ./mnistCUDNN

如果安装成功了,你会看到这样结果

Test passed!

其实还可以cmake 一下你的caffe/build,也能很快测试是否安装成功

13.安装gpu版的TensorFlow(先配置加速器)

复制代码
$ sudo pip install tensorflow-gpu

root用户在根目录下新建.pip目录,在目录中创建文件pip.conf(/root/.pip/pip.conf),配置内容如下,这里使用的清华源,还是挺快的:

复制代码
[global]
index-url=https://pypi.tuna.tsinghua.edu.cn/simple

配置完成,无需任何操作,直接通过pip install即可安装任何想要的工具,再次来对比一下(输入pip install tensorflow之后立马截图就已经是如下图所示的效果)。

14.测试TensorFlow

走过前面的沟沟坎坎,终于到了测试这一步了,是不是很happy。

复制代码
[root@gpuserver ~]# python
Python 2.7.5 (default, Nov 20 2015, 02:00:19) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> hello = tf.constant('Hello, TensorFlow!')
>>> sess = tf.Session()
2018-12-12 17:10:51.572488: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
>>> sess = tf.Session()
>>> print(sess.run(hello))
Hello, TensorFlow!
>>> 

如果你可以正确的运行上面这个小的例子,那么恭喜你,gpu版的TensorFlow安装成功了,还等什么,赶紧造起来吧!

centos7.2安装pip

复制代码
yum install -y epel-release
yum install -y python-pip

6、安装kernel-devel

复制代码
yum -y install kernel-devel

centos7.2配置图形化界面启动

复制代码
# systemctl get-default
multi-user.target
# systemctl set-default graphical.target 

附录:

1、cuda安装过程记录

复制代码
Installing the NVIDIA display driver...
Installing the CUDA Toolkit in /usr/local/cuda-10.0 ...
Missing recommended library: libGLU.so
Missing recommended library: libX11.so
Missing recommended library: libXi.so
Missing recommended library: libXmu.so

Installing the CUDA Samples in /root ...
Copying samples to /root/NVIDIA_CUDA-10.0_Samples now...
Finished copying samples.

===========
= Summary =
===========

Driver:   Installed
Toolkit:  Installed in /usr/local/cuda-10.0
Samples:  Installed in /root, but missing recommended libraries

Please make sure that
 -   PATH includes /usr/local/cuda-10.0/bin
 -   LD_LIBRARY_PATH includes /usr/local/cuda-10.0/lib64, or, add /usr/local/cuda-10.0/lib64 to /etc/ld.so.conf and run ldconfig as root

To uninstall the CUDA Toolkit, run the uninstall script in /usr/local/cuda-10.0/bin
To uninstall the NVIDIA Driver, run nvidia-uninstall

Please see CUDA_Installation_Guide_Linux.pdf in /usr/local/cuda-10.0/doc/pdf for detailed information on setting up CUDA.

Logfile is /tmp/cuda_install_16878.log
相关推荐
智算菩萨几秒前
深度解析ChatGPT 5.4赋能Python Selenium网页自动化测试:从理论到实践的完整指南
人工智能·python·selenium·ai·chatgpt
IMPYLH1 分钟前
Linux 的 arch 命令
linux·运维·服务器·bash
Lenyiin4 分钟前
《LeetCode 顺序刷题》51 - 60
java·c++·python·算法·leetcode·深度优先·lenyiin
搞程序的心海4 分钟前
Python面试题(二)
开发语言·python
Azure DevOps7 分钟前
Azure DevOps Server:扩充数据库服务器的磁盘
服务器·数据库·microsoft·azure·devops
灰阳阳8 分钟前
docker基础命令讲解
运维·docker·容器·eureka
white-persist8 分钟前
【Js逆向 python】Web JS 逆向全体系详细解释
运维·服务器·前端·javascript·网络·python·sql
委婉待续9 分钟前
关于ubuntu开机进入recovering journal的解决方法
linux·运维·ubuntu
轻竹办公PPT10 分钟前
2026年成考来临,毕业论文不会写?这些方法你知道几个?
人工智能·python
gameboy03110 分钟前
【Python学习】网络爬虫-爬取豆瓣电影评论
爬虫·python·学习