服务器上搭建支持GPU的DL+LLM Docker镜像

目标： 创建一个包含 Anaconda, PyTorch, Hugging Face 及强化学习库的 Docker 环境，支持 GPU (CUDA 12.4)。

软件版本：

ubunutu: 22.04

cuda: 12.4

python:3.12

pytorch: 2.5.1

torchvision: 0.20.1

torchaudio: 2.5.1

一、下载镜像搭建容器

1. 写Dockerfile

此 Dockerfile 使用 NVIDIA CUDA 镜像作为基础，配置了清华镜像源，并安装了必要的库。

dockerfile 复制代码

# 步骤 1: 使用 NVIDIA 官方的 CUDA 12.4 基础镜像
# 这确保了底层的 CUDA 环境是正确的
FROM nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04

# 避免在 apt-get 安装过程中出现交互式提示
ENV DEBIAN_FRONTEND=noninteractive

# 步骤 2: 安装必要的工具和 Miniconda (比完整的 Anaconda 更轻量，推荐在 Docker 中使用
)
# - wget 用于下载安装脚本
# - bash 和 git 是常用的工具
RUN apt-get update && apt-get install -y --no-install-recommends \
    wget \
    bash \
    git \
    && rm -rf /var/lib/apt/lists/*

# 下载并安装 Miniconda
RUN wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh && \
    /bin/bash ~/miniconda.sh -b -p /opt/conda && \
    rm ~/miniconda.sh

# 将 Conda 添加到 PATH 环境变量中
ENV PATH /opt/conda/bin:$PATH

RUN conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/main
RUN conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/r
RUN conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/msys2

# 步骤 3: 配置清华镜像源 (Conda 和 Pip)
RUN conda config --set show_channel_urls yes && \
    conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main && \
    conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free && \
    conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/r && \
    conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/pro && \
    conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge && \
    conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch/ && \
    pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple

# 步骤 4: 创建新的 Conda 环境并安装包
# 创建一个名为 'myenv' 的新环境
RUN conda create -n myenv python=3.12 -y

# 使用 shell hook 来确保后续的 RUN 命令在 conda 环境中执行
SHELL ["conda", "run", "-n", "myenv", "/bin/bash", "-c"]

# 步骤 5: 安装 PyTorch, Hugging Face 和强化学习库
# 从清华源安装支持 CUDA 的 PyTorch
# PyTorch 官网推荐使用 pip 安装以获得最新的 CUDA 支持
RUN pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
# RUN pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# 安装 Hugging Face 核心库
RUN pip install transformers datasets accelerate

# 安装常用的强化学习库
RUN pip install gymnasium stable-baselines3

# (可选) 安装 Jupyter
RUN pip install jupyterlab

# 设置工作目录
WORKDIR /app

# 启动 bash 并默认激活环境
CMD ["/bin/bash"]

2. 构建镜像

此步骤只需在 Dockerfile 更改后执行一次。

bash 复制代码

docker build -t my_container .

-t my-dev-env：给镜像命名。
.：表示 Dockerfile 在当前目录。

构建好后可以用 docker images 查看镜像

3. 运行容器

每次需要开始工作时，运行此命令来启动环境。

bash 复制代码

docker run --net=host -it --rm --gpus all -v "$(pwd)":/app -p 8888:8888 my_container

--gpus all：启用 GPU。
-v "$(pwd)":/app：将当前目录映射到容器的 /app 目录，用于同步代码。
-p 8888:8888：映射端口，用于访问 Jupyter。

上述命令在前端运行容器后，可以进入一个tty交互页面，但是在退出容器后，该容器会被关闭。（如果配置了--rm，容器则会被删除）

如果想让容器在后台运行，可以用 -d 命令。

bash 复制代码

docker run --net=host -d --gpus all -v "$(pwd)":/app my_container

随后用exec命令连接进入该容器

bash 复制代码

docker exec -it my_container /bin/bash

3. 在容器内工作

执行上一步后，你会进入容器的命令行。

安装各种软件:

bash 复制代码

sudo apt-get update
sudo apt-get install iproute2 //ip
sudo apt-get install net-tools //ifocnfig
sudo apt-get install inetutils-ping //ping
sudo apt-get install vim //vim

启动 Jupyter Lab:

bash 复制代码

nohup jupyter lab --ip=0.0.0.0 --port=8888 --allow-root --no-browser >jupyter.out 2>&1 &

访问 : 将终端里显示的 http://x.x.x.x:8888/lab?token=... 地址复制到你电脑的浏览器中打开。

4. 提交容器修改

可以把在容器内的修改提交，保存为自己的容器，退出后台容器后，执行下述命令：

bash 复制代码

docker commit my_container my_new_image:latest
docker images

二、主机环境配置与故障排查

在构建和运行 Docker 容器时，遇到了一系列问题，以下是核心问题的现象、诊断和解决方案。

问题1：`docker build` 拉取镜像超时

现象：i/o timeout 或从某个镜像源 404 Not Found。
诊断：访问官方 Docker Hub 网络不稳定，或配置的镜像源失效。
解决：配置 Docker 守护进程，使用国内镜像加速器。编辑 /etc/docker/daemon.json：
json 复制代码
```
{
  "registry-mirrors": [
    "https://hub-mirror.c.163.com",
    "https://mirror.baidubce.com"
  ]
}
```
然后重启 Docker 服务：sudo systemctl restart docker。
注意：镜像站没有nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04完整的镜像，所以此步最好是配置科学，然后从官方站下载

问题2：`docker run` 无法使用 GPU

现象：could not select device driver "" with capabilities: [[gpu]]。
诊断：主机缺少连接 Docker 与 NVIDIA 驱动的桥梁。

解决：在主机上安装 NVIDIA Container Toolkit 并重启 Docker 服务。

bash 复制代码

# (以 Ubuntu 为例)
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

问题3：容器内部完全无法访问网络

使用 Host 网络模式，此方法让容器直接共享主机的网络，绕过 Docker 的网络层。

在 docker run 命令中添加 --net=host 参数。

bash 复制代码

docker run -it --rm --gpus all --net=host -v "$(pwd)":/app py-rl-hf-gpu-env

注意：此模式下 -p 端口映射会失效，容器内服务监听的端口会直接占用主机端口。同时，这也牺牲了容器的网络隔离性，降低了安全性。

服务器上搭建支持GPU的DL+LLM Docker镜像

一、下载镜像搭建容器

1. 写Dockerfile

2. 构建镜像

3. 运行容器

3. 在容器内工作

4. 提交容器修改

二、主机环境配置与故障排查

问题1：docker build 拉取镜像超时

问题2：docker run 无法使用 GPU

问题3：容器内部完全无法访问网络

问题1：`docker build` 拉取镜像超时

问题2：`docker run` 无法使用 GPU