GPU服务器安装驱动、cuda和cudnn和tensorflow

系统版本兼容要求

复制代码
centos7.2 cuda9.0 cudnn7.4
centos7.5 cuda9.2 cudnn7.4

安装gcc

复制代码
yum -y install gcc gcc-c++ kernel-devel 

package manage-overview
https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#package-manager-overview

1、安装gpu显卡驱动

查看nvidia gpu信息

复制代码
# nvidia-smi

2、安装nvidia检测

2.1添加ElRepo源

复制代码
# rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org 
# rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org  

# rpm -Uvh https://www.elrepo.org/elrepo-release-7.0-3.el7.elrepo.noarch.rpm

2.2、安装显卡驱动检查

复制代码
yum install nvidia-detect

2.3 运行

复制代码
# nvidia-detect -v
Probing for supported NVIDIA devices...
[10de:15f8] NVIDIA Corporation Device 15f8
This device requires the current 410.78 NVIDIA driver kmod-nvidia
[10de:15f8] NVIDIA Corporation Device 15f8
This device requires the current 410.78 NVIDIA driver kmod-nvidia
[102b:0538] Matrox Electronics Systems Ltd. Device 0538

2.4、编辑grub文件

vim /etc/default/grub

在"GRUB_CMDLINE_LINUX"中添加

复制代码
rd.driver.blacklist=nouveau nouveau.modeset=0

改完后的文件如下:

复制代码
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR="$(sed 's, release .*$,,g' /etc/system-release)"
GRUB_DEFAULT=saved
GRUB_DISABLE_SUBMENU=true
GRUB_TERMINAL_OUTPUT="console"
GRUB_CMDLINE_LINUX="crashkernel=auto rd.lvm.lv=centos/root rd.lvm.lv=centos/swap rd.driver.blacklist=nouveau nouveau.modeset=0 rhgb quiet"
GRUB_DISABLE_RECOVERY="true"

随后生成配置

复制代码
grub2-mkconfig -o /boot/grub2/grub.cfg

2.5、创建blacklist

复制代码
vim /etc/modprobe.d/blacklist.conf

添加

复制代码
blacklist nouveau

2.6、更新配置

复制代码
mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r)-nouveau.img
dracut /boot/initramfs-$(uname -r).img $(uname -r)

2.7、重启

复制代码
reboot

2.8、确认禁用了nouveau

复制代码
lsmod | grep nouveau

若无输出则禁用成功

3、安装cuda

cuda下载地址:

复制代码
https://developer.nvidia.com/cuda-toolkit

# sh cuda_9.0.176_384.81_linux.run

如果出现you appear to be running an x server please exit x before installing

执行init 3 进入命令行模式,杀掉x server,然后再执行安装命令

复制代码
===========
= Summary =
===========
Driver:   Installed
Toolkit:  Installed in /usr/local/cuda-9.0
Samples:  Installed in /root, but missing recommended libraries

Please make sure that
 -   PATH includes /usr/local/cuda-9.0/bin
 -   LD_LIBRARY_PATH includes /usr/local/cuda-9.0/lib64, or, add /usr/local/cuda-9.0/lib64 to /etc/ld.so.conf and run ldconfig as root

To uninstall the CUDA Toolkit, run the uninstall script in /usr/local/cuda-9.0/bin
To uninstall the NVIDIA Driver, run nvidia-uninstall

Please see CUDA_Installation_Guide_Linux.pdf in /usr/local/cuda-9.0/doc/pdf for detailed information on setting up CUDA.

Logfile is /tmp/cuda_install_7874.log

验证CUDA 9.0 是否安装成功

终端输入:

复制代码
nvcc -V

可以看到cuda的版本信息

接着尝试运行一下cuda中自带的例子:

复制代码
cd /usr/local/cuda-9.0/samples/1_Utilities/deviceQuery
make
./deviceQuery

可以看到输出成功

复制代码
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.0, CUDA Runtime Version = 10.0, NumDevs = 2
Result = PASS

卸载

复制代码
To uninstall the CUDA Toolkit, run the uninstall script in /usr/local/cuda-9.0/bin
To uninstall the NVIDIA Driver, run nvidia-uninstall

3、安装cudnnv7

复制代码
https://docs.nvidia.com/deeplearning/sdk/cudnn-install/index.html

下载完成以后将其解压到Cuda的目录当中,依次执行如下命令:

复制代码
tar -xzvf cudnn-9.0-linux-x64-v7.4.1.5.tgz
sudo cp cuda/include/cudnn.h /usr/local/cuda/include
sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*

运行一个小Demo即可.

如果安装了 例程和用户指南 这个包的话,我们可以找到位于 /usr/src/cudnn_samples_v7的mnistCUDNN这个小例子.

拷贝到 你的home/yourdir 任意文件夹下

复制代码
$cp -r /usr/src/cudnn_samples_v7/ $HOME

进入 mnistCUDNN

复制代码
$ cd $HOME/cudnn_samples_v7/mnistCUDNN

编译

复制代码
$make clean && make

运行

复制代码
$ ./mnistCUDNN

如果安装成功了,你会看到这样结果

Test passed!

其实还可以cmake 一下你的caffe/build,也能很快测试是否安装成功

13.安装gpu版的TensorFlow(先配置加速器)

复制代码
$ sudo pip install tensorflow-gpu

root用户在根目录下新建.pip目录,在目录中创建文件pip.conf(/root/.pip/pip.conf),配置内容如下,这里使用的清华源,还是挺快的:

复制代码
[global]
index-url=https://pypi.tuna.tsinghua.edu.cn/simple

配置完成,无需任何操作,直接通过pip install即可安装任何想要的工具,再次来对比一下(输入pip install tensorflow之后立马截图就已经是如下图所示的效果)。

14.测试TensorFlow

走过前面的沟沟坎坎,终于到了测试这一步了,是不是很happy。

复制代码
[root@gpuserver ~]# python
Python 2.7.5 (default, Nov 20 2015, 02:00:19) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> hello = tf.constant('Hello, TensorFlow!')
>>> sess = tf.Session()
2018-12-12 17:10:51.572488: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
>>> sess = tf.Session()
>>> print(sess.run(hello))
Hello, TensorFlow!
>>> 

如果你可以正确的运行上面这个小的例子,那么恭喜你,gpu版的TensorFlow安装成功了,还等什么,赶紧造起来吧!

centos7.2安装pip

复制代码
yum install -y epel-release
yum install -y python-pip

6、安装kernel-devel

复制代码
yum -y install kernel-devel

centos7.2配置图形化界面启动

复制代码
# systemctl get-default
multi-user.target
# systemctl set-default graphical.target 

附录:

1、cuda安装过程记录

复制代码
Installing the NVIDIA display driver...
Installing the CUDA Toolkit in /usr/local/cuda-10.0 ...
Missing recommended library: libGLU.so
Missing recommended library: libX11.so
Missing recommended library: libXi.so
Missing recommended library: libXmu.so

Installing the CUDA Samples in /root ...
Copying samples to /root/NVIDIA_CUDA-10.0_Samples now...
Finished copying samples.

===========
= Summary =
===========

Driver:   Installed
Toolkit:  Installed in /usr/local/cuda-10.0
Samples:  Installed in /root, but missing recommended libraries

Please make sure that
 -   PATH includes /usr/local/cuda-10.0/bin
 -   LD_LIBRARY_PATH includes /usr/local/cuda-10.0/lib64, or, add /usr/local/cuda-10.0/lib64 to /etc/ld.so.conf and run ldconfig as root

To uninstall the CUDA Toolkit, run the uninstall script in /usr/local/cuda-10.0/bin
To uninstall the NVIDIA Driver, run nvidia-uninstall

Please see CUDA_Installation_Guide_Linux.pdf in /usr/local/cuda-10.0/doc/pdf for detailed information on setting up CUDA.

Logfile is /tmp/cuda_install_16878.log
相关推荐
MediaTea4 分钟前
Python 第三方库:matplotlib(科学绘图与数据可视化)
开发语言·python·信息可视化·matplotlib
草莓熊Lotso13 分钟前
C++ 方向 Web 自动化测试入门指南:从概念到 Selenium 实战
前端·c++·python·selenium
我是李武涯37 分钟前
PyTorch Dataloader工作原理 之 default collate_fn操作
pytorch·python·深度学习
lpfasd1231 小时前
第2部分:Netty核心架构与原理解析
运维·服务器·架构
若尘拂风1 小时前
centos 7.9 编译安装 freeswitch 1.10.12
服务器·udp·freeswitch·sip
Kratzdisteln1 小时前
【Python】绘制椭圆眼睛跟随鼠标交互算法配图详解
python·数学·numpy·pillow·matplotlib·仿射变换
maxruan1 小时前
PyTorch学习
人工智能·pytorch·python·学习
小蜜蜂爱编程2 小时前
gerrit的部署与配置关联到不同服务器上的git仓库
运维·服务器·git·gerrit
唐古乌梁海2 小时前
【python】在Django中,执行原生SQL查询
python·sql·django
程序员大雄学编程2 小时前
「用Python来学微积分」5. 曲线的极坐标方程
开发语言·python·微积分