Docker【部署 05】docker使用tensorflow-gpu安装及调用GPU踩坑记录

tensorflow-gpu安装及调用GPU踩坑记录

  • 1.安装tensorflow-gpu
  • 2.Docker使用GPU
    • [2.1 Could not find cuda drivers](#2.1 Could not find cuda drivers)
    • [2.2 was unable to find libcuda.so DSO](#2.2 was unable to find libcuda.so DSO)
    • [2.3 Could not find TensorRT&&Cannot dlopen some GPU libraries](#2.3 Could not find TensorRT&&Cannot dlopen some GPU libraries)
    • [2.4 Could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED](#2.4 Could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED)
    • [2.5 CuDNN library needs to have matching major version and equal or higher minor version](#2.5 CuDNN library needs to have matching major version and equal or higher minor version)

1.安装tensorflow-gpu

bash 复制代码
Building wheels for collected packages: tensorflow-gpu
  Building wheel for tensorflow-gpu (setup.py): started
  Building wheel for tensorflow-gpu (setup.py): finished with status 'error'
  Running setup.py clean for tensorflow-gpu
  error: subprocess-exited-with-error

  × python setup.py bdist_wheel did not run successfully.
  │ exit code: 1
  ╰─> [18 lines of output]
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-i6frcfa8/tensorflow-gpu_2cea358528754cc596c541f9c2ce45ca/setup.py", line 37, in <module>
          raise Exception(TF_REMOVAL_WARNING)
      Exception:

      =========================================================
      The "tensorflow-gpu" package has been removed!

      Please install "tensorflow" instead.

      Other than the name, the two packages have been identical
      since TensorFlow 2.1, or roughly since Sep 2019. For more
      information, see: pypi.org/project/tensorflow-gpu
      =========================================================


      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for tensorflow-gpu
Failed to build tensorflow-gpu

Other than the name, the two packages have been identical since TensorFlow 2.1 也就是说安装2.1版本的已经自带GPU支持。

2.Docker使用GPU

不同型号的GPU及驱动版本有所区别,环境驱动及CUDA版本如下:

bash 复制代码
[root@localhost ~]# nvidia-smi
# 查询结果
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.27.04    Driver Version: 460.27.04    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+

2.1 Could not find cuda drivers

shell 复制代码
# 报错
I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.

在Docker容器中的程序无法识别CUDA环境变量,可以尝试以下步骤来解决这个问题:

  1. 检查CUDA版本:首先,需要确认宿主机上已经正确安装了CUDA。在宿主机上运行nvcc --version命令来检查CUDA版本。
  2. 使用NVIDIA Docker镜像:NVIDIA提供了一些预先配置好的Docker镜像,这些镜像已经包含了CUDA和其他必要的库。可以使用这些镜像作为Dockerfile的基础镜像。
  3. 设置环境变量:在Dockerfile中,可以使用ENV指令来设置环境变量。例如,如果CUDA安装在/usr/local/cuda目录下,可以添加以下行到Dockerfile中:ENV PATH /usr/local/cuda/bin:$PATH
  4. 使用nvidia-docker:nvidia-docker是一个用于运行GPU加速的Docker容器的工具。

检测CUDA版本是必要的,由于使用的是导出的镜像文件,2和3的方法无法使用,最终使用-e进行环境变量设置:

bash 复制代码
# 添加cuda的环境变量
-e PATH=/usr/local/cuda-11.2/bin:$PATH -e LD_LIBRARY_PATH=/usr/local/cuda-11.2/lib64:$LD_LIBRARY_PATH

# 启动命令
nvidia-docker run --name deepface --privileged=true --restart=always --net="host" -e PATH=/usr/local/cuda-11.2/bin:$PATH -e LD_LIBRARY_PATH=/usr/local/cuda-11.2/lib64:$LD_LIBRARY_PATH -v /root/.deepface/weights/:/root/.deepface/weights/ -v /usr/local/cuda-11.2/:/usr/local/cuda-11.2/ -d deepface_image

2.2 was unable to find libcuda.so DSO

bash 复制代码
I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:168] retrieving CUDA diagnostic information for host: localhost.localdomain
I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:175] hostname: localhost.localdomain
I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:199] libcuda reported version is: NOT_FOUND: was unable to find libcuda.so DSO loaded into this program
I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:203] kernel reported version is: 460.27.4

在Linux环境下,Docker可以支持将宿主机上的目录挂载到容器里。这意味着,如果宿主机上的目录包含软链接,那么这些软链接也会被挂载到容器中。然而,需要注意的是,这些软链接指向的路径必须在Docker容器中是可访问的。也就是说,如果软链接指向的路径没有被挂载到Docker容器中,那么在容器中访问这个软链接可能会失败。

原文链接:https://blog.csdn.net/u013546508/article/details/88637434,当前环境下问题解决步骤:

bash 复制代码
# 1.查找 libcuda.so 文件位置
find / -name libcuda.so*
# 查找结果
/usr/lib/libcuda.so
/usr/lib/libcuda.so.1
/usr/lib/libcuda.so.460.27.04
/usr/lib64/libcuda.so
/usr/lib64/libcuda.so.1
/usr/lib64/libcuda.so.460.27.04

# 2.查看LD_LIBRARY_PATH
echo $LD_LIBRARY_PATH
# 查询结果
/usr/local/cuda/lib64

# 3.将64位的libcuda.so.460.27.04复制到LD_LIBRARY_PATH路径下【libcuda.so和libcuda.so.1都是软连接】
cp /usr/lib64/libcuda.so.460.27.04 /usr/local/cuda-11.2/lib64/

# 4.创建软连接
ln -s libcuda.so.460.27.04 libcuda.so.1
ln -s libcuda.so.1 libcuda.so

2.3 Could not find TensorRT&&Cannot dlopen some GPU libraries

bash 复制代码
I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.

W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
W tensorflow/core/common_runtime/gpu/gpu_device.cc:1960] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...

这个问题实际上是Docker镜像文件未安装TensorRT导致的,可以在Dockerfile里添加安装命令后重新构建镜像:

bash 复制代码
RUN pip install tensorrt -i https://pypi.tuna.tsinghua.edu.cn/simple

以下操作不推荐,进入容器进行安装:

bash 复制代码
# 1.查询容器ID
docker ps

# 2.在running状态进入容器
docker exec -it ContainerID /bin/bash

# 3.安装软件
pip install tensorrt -i https://pypi.tuna.tsinghua.edu.cn/simple

# 4.提交新的镜像【可以将新的镜像导出使用】
docker commit ContainerID imageName:version

安装后的现象:

bash 复制代码
root@localhost:/app# python
Python 3.8.18 (default, Sep 20 2023, 11:41:31)
[GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

# 使用tensorflow报错
>>> import tensorflow as tf
2023-10-09 10:15:55.482545: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-10-09 10:15:56.498608: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT

# 先导入tensorrt后使用tensorflow看我用
>>> import tensorrt as tr
>>> import tensorflow as tf
>>> tf.test.is_gpu_available()
WARNING:tensorflow:From <stdin>:1: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
2023-10-09 10:16:41.452672: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /device:GPU:0 with 11389 MB memory:  -> device: 0, name: Tesla T4, pci bus id: 0000:2f:00.0, compute capability: 7.5
True

尝试解决,在容器启动要执行的py文件内加入以下代码,我将以下代码加入到app.py文件内:

python 复制代码
import tensorrt as tr
import tensorflow as tf

if __name__ == "__main__":
    available = tf.config.list_physical_devices('GPU')
    print(f"available:{available}")

加入代码后的文件为:

python 复制代码
# 3rd parth dependencies
import tensorrt as tr
import tensorflow as tf
from flask import Flask
from routes import blueprint

def create_app():
    available = tf.config.list_physical_devices('GPU')
    print(f"available:{available}")
    app = Flask(__name__)
    app.register_blueprint(blueprint)
    return app

启动容器:

bash 复制代码
nvidia-docker run --name deepface --privileged=true --restart=always --net="host" -e PATH=/usr/local/cuda-11.2/bin:$PATH -e LD_LIBRARY_PATH=/usr/local/cuda-11.2/lib64:$LD_LIBRARY_PATH -v /root/.deepface/weights/:/root/.deepface/weights/ -v /usr/local/cuda-11.2/:/usr/local/cuda-11.2/ -v /opt/xinan-facesearch-service-public/deepface/api/app.py:/app/app.py -d deepface_image

2.4 Could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED

bash 复制代码
E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:437] Could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED
E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:441] Memory usage: 1100742656 bytes free, 15843721216 bytes total.
E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:451] Possibly insufficient driver version: 460.27.4
W tensorflow/core/framework/op_kernel.cc:1828] OP_REQUIRES failed at conv_ops_impl.h:770 : UNIMPLEMENTED: DNN library is not found.

未安装cuDNN导致的问题,安装即可。

2.5 CuDNN library needs to have matching major version and equal or higher minor version

安装版本跟编译项目的版本不匹配,调整版本后成功使用GPU。

bash 复制代码
E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:425] Loaded runtime CuDNN library: 8.1.1 but source was compiled with: 8.6.0.  CuDNN library needs to have matching major version and equal or higher minor version. If using a binary install, upgrade your CuDNN library.  If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration.
相关推荐
没有余地 EliasJie24 分钟前
Windows Ubuntu下搭建深度学习Pytorch训练框架与转换环境TensorRT
pytorch·windows·深度学习·ubuntu·pycharm·conda·tensorflow
zeruns80229 分钟前
如何搭建自己的域名邮箱服务器?Poste.io邮箱服务器搭建教程,Linux+Docker搭建邮件服务器的教程
linux·运维·服务器·docker·网站
爱跑步的程序员~29 分钟前
Docker
docker·容器
Hi2024021741 分钟前
CUDA cooperative_groups grid_group测试
gpu·cuda·gpgpu
疯狂的大狗1 小时前
docker进入正在运行的容器,exit后的比较
运维·docker·容器
长天一色1 小时前
【Docker从入门到进阶】01.介绍 & 02.基础使用
运维·docker·容器
伊玛目的门徒1 小时前
docker 搭建minimalist-web-notepad
运维·docker·notepad
theo.wu4 小时前
使用Buildpacks构建Docker镜像
运维·docker·容器
wusam13 小时前
螺蛳壳里做道场:老破机搭建的私人数据中心---Centos下Docker学习04(环境准备)
学习·docker·centos
_.Switch16 小时前
Python机器学习框架介绍和入门案例:Scikit-learn、TensorFlow与Keras、PyTorch
python·机器学习·架构·tensorflow·keras·scikit-learn