1. 初次运行docker容器
python
$ docker run --gpus all \
--cpu-shares 768 \
-p 8000:8000 \
vllm/vllm-openai \
--model your-model \
-v /data2/model_output:/model_output
报错如下:
python
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]]
Run 'docker run --help' for more information
分析:
这个错误表明Docker 环境没有正确配置 GPU 支持。要解决这个问题,你需要安装 NVIDIA 容器工具包(NVIDIA Container Toolkit),它允许 Docker 容器访问主机的 GPU 资源。
解决方案
- 安装 NVIDIA 容器工具包
python
首先确保你已经安装了 NVIDIA 驱动程序(可以通过 nvidia-smi 命令验证)。
然后按照以下步骤安装 NVIDIA 容器工具包:
# 添加 NVIDIA 容器工具包的仓库
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
&& curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
# 更新软件包列表并安装工具包
sudo apt-get update
sudo apt-get install -y nvidia-docker2
# 重启 Docker 服务
sudo systemctl restart docker
运行:
python
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
&& curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
告警:
python
Warning: apt-key is deprecated. Manage keyring files in trusted.gpg.d instead (see apt-key(8)).
这个警告信息表明 apt-key 命令已被弃用,建议使用新的方法来管理 APT 密钥。以下是更新后的安装步骤,避免了使用 apt-key :
更新后的 NVIDIA 容器工具包安装方法(不使用 apt-key)
python
# 1. 确定系统发行版信息
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
# 2. 下载并保存 NVIDIA 的 GPG 密钥到 trusted.gpg.d
curl -fsSL https://nvidia.github.io/nvidia-docker/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-docker.gpg
# 3. 添加 NVIDIA 容器工具包的源
echo "deb [signed-by=/usr/share/keyrings/nvidia-docker.gpg] https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list" | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
# 4. 更新软件包列表并安装工具包
sudo apt-get update
sudo apt-get install -y nvidia-docker2
# 5. 重启 Docker 服务
sudo systemctl restart docker
NVIDIA 也提供了一个更简单的安装脚本:
python
# 下载并运行安装脚本
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - && \
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list && \
sudo apt-get update && sudo apt-get install -y nvidia-docker2 && \
sudo systemctl restart docker
2. 完全安装 NVIDIA 容器工具包
python
# 移除所有旧版本组件
sudo apt-get purge -y nvidia-docker2 nvidia-container-runtime nvidia-container-toolkit
sudo apt-get autoremove -y
# 添加官方仓库
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) && \
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg && \
curl -s -L https://nvidia.github.io/libnvidia-container/experimental/$distribution/libnvidia-container.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
# 安装最新版本
sudo apt-get update && \
sudo apt-get install -y nvidia-container-toolkit
报错:
python
apt-get purge -y nvidia-docker2 nvidia-container-runtime nvidia-container-toolkit
E: Malformed entry 1 in list file /etc/apt/sources.list.d/nvidia-docker.list (Suite)
E: The list of sources could not be read.
分析:
这个错误表明你的 /etc/apt/sources.list.d/nvidia-docker.list
文件格式有问题。我们来修复这个问题:
完整解决方案
- 首先修复损坏的源列表文件
python
# 备份并删除损坏的文件
sudo mv /etc/apt/sources.list.d/nvidia-docker.list /etc/apt/sources.list.d/nvidia-docker.list.bak
# 清理无效的GPG密钥
sudo rm /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg 2>/dev/null || true
- 重新添加正确的NVIDIA容器仓库
python
# 确定系统发行版
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
# 下载并安装新的GPG密钥
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
# 添加正确的仓库源
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
- 完成清理和重新安装
python
# 更新软件包列表
sudo apt-get update
# 完全清除残留的NVIDIA Docker包
sudo apt-get purge -y nvidia-* docker-*
sudo apt-get autoremove -y
# 重新安装必要的组件
sudo apt-get install -y nvidia-container-toolkit
- 配置Docker使用NVIDIA运行时
python
# 配置默认运行时
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
# 验证配置
docker info | grep -i runtime
- 最终测试GPU访问
python
# 使用与你的CUDA驱动版本匹配的镜像
docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi
3. 手动安装 Docker(推荐方法)
python
# 下载 Docker 安装包(使用 wget 绕过 curl 问题)
wget https://download.docker.com/linux/ubuntu/dists/jammy/pool/stable/amd64/docker-ce-cli_24.0.7-1~ubuntu.22.04~jammy_amd64.deb
wget https://download.docker.com/linux/ubuntu/dists/jammy/pool/stable/amd64/docker-ce_24.0.7-1~ubuntu.22.04~jammy_amd64.deb
wget https://download.docker.com/linux/ubuntu/dists/jammy/pool/stable/amd64/containerd.io_1.6.31-1_amd64.deb
# 安装下载的包
sudo apt install ./docker-ce*.deb ./containerd.io_*.deb
安装报错:
python
wget https://download.docker.com/linux/ubuntu/dists/jammy/pool/stable/amd64/docker-ce-cli_24.0.7-1~ubuntu.22.04~jammy_amd64.deb
--2025-07-14 17:03:32-- https://download.docker.com/linux/ubuntu/dists/jammy/pool/stable/amd64/docker-ce-cli_24.0.7-1~ubuntu.22.04~jammy_amd64.deb
Resolving download.docker.com (download.docker.com)... 65.9.66.54, 65.9.66.46, 65.9.66.125, ...
Connecting to download.docker.com (download.docker.com)|65.9.66.54|:443... connected.
Unable to establish SSL connection.
尝试下载 Docker 安装包时遇到了 SSL 连接问题。这可能是由于企业防火墙、过时的 SSL 库或系统时间不正确导致的。
解决方案 :通过 Ubuntu 官方仓库安装(无需下载)
python
sudo apt update
sudo apt install docker.io
报错:
python
sudo apt install docker.io
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:
The following packages have unmet dependencies:
containerd.io : Conflicts: containerd
E: Error, pkgProblemResolver::Resolve generated breaks, this may be caused by held packages.
分析:
这个错误表明你的系统中存在 containerd 版本冲突。这是因为 Ubuntu 自带的 containerd 与 Docker 官方要求的 containerd.io 不兼容。
解决方案 :彻底卸载旧版本后重装
python
# 1. 卸载所有 Docker 和 containerd 相关包
sudo apt purge -y docker.io docker-ce containerd containerd.io
# 2. 删除残留配置
sudo rm -rf /var/lib/docker /etc/docker
sudo rm -rf /var/lib/containerd /etc/containerd
# 3. 重新安装(使用 --fix-broken 修复依赖)
sudo apt update
sudo apt install -y docker.io --fix-broken
关键检查点
python
是否完全卸载旧版本 →
dpkg -l | grep containerd
是否成功安装新版本 →
docker --version
服务是否正常运行 →
sudo systemctl status docker
python
root@/home/__su# dpkg -l | grep containerd
ii containerd 1.7.27-0ubuntu1~22.04.1 amd64 daemon to control runC
root@/home/__su# docker --version
Docker version 27.5.1, build 27.5.1-0ubuntu3~22.04.2
root@/home/__su# sudo systemctl status docker
● docker.service - Docker Application Container Engine
Loaded: loaded (/lib/systemd/system/docker.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2025-07-14 17:10:58 CST; 4min 34s ago
可以看到:
- Docker 已成功安装(版本 27.5.1)
- Docker 服务正在运行
- containerd 版本为 1.7.27(Ubuntu 官方维护版本)
4. 验证安装
运行以下命令测试 GPU 是否可以在容器中访问:
python
docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi
python
$ docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi
Tue Jul 15 06:17:17 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.216.01 Driver Version: 535.216.01 CUDA Version: 12.4 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA L20 On | 00000000:23:00.0 Off | 0 |
| N/A 31C P0 79W / 350W | 23388MiB / 46068MiB | 0% Default |
| | | N/A |
Docker 容器已经可以正常访问 GPU
5. 问题排查containerd 版本冲突
python
# sudo apt install docker.io
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:
The following packages have unmet dependencies:
containerd.io : Conflicts: containerd
E: Error, pkgProblemResolver::Resolve generated breaks, this may be caused by held packages.
分析:
这个错误表明你的系统中存在 containerd 版本冲突。这是因为 Ubuntu 自带的 containerd 与 Docker 官方要求的 containerd.io 不兼容。以下是完整的解决方案:
python
# 1. 卸载所有 Docker 和 containerd 相关包
sudo apt purge -y docker.io docker-ce containerd containerd.io
# 2. 删除残留配置
sudo rm -rf /var/lib/docker /etc/docker
sudo rm -rf /var/lib/containerd /etc/containerd
# 3. 重新安装(使用 --fix-broken 修复依赖)
sudo apt update
sudo apt install -y docker.io --fix-broken
关键检查点
python
是否完全卸载旧版本 →
dpkg -l | grep containerd
是否成功安装新版本 →
docker --version
服务是否正常运行 →
sudo systemctl status docker
python
root@/home/__su# dpkg -l | grep containerd
ii containerd 1.7.27-0ubuntu1~22.04.1 amd64 daemon to control runC
root@/home/__su# docker --version
Docker version 27.5.1, build 27.5.1-0ubuntu3~22.04.2
root@/home/__su# sudo systemctl status docker
● docker.service - Docker Application Container Engine
Loaded: loaded (/lib/systemd/system/docker.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2025-07-14 17:10:58 CST; 4min 34s ago
6. 修改docker存储目录
设定存储目录/data2/docker
python
检查 containerd 版本:
containerd --version
如果输出是
containerd.io 1.6.x
说明切换成功。
检查 Docker 存储路径:
docker info | grep "Docker Root Dir"
确认是否为
/data2/docker
检查:
python
# containerd --version
containerd github.com/containerd/containerd 1.7.27
# docker info | grep "Docker Root Dir"
Docker Root Dir: /var/lib/docker
根据你的最新输出,虽然 Docker 和 GPU 支持已正常工作,但仍有 两个配置需要优化:
问题 1:Docker 存储目录未指向 /data2/docker
当前配置: Docker Root Dir: /var/lib/docker
但你需要使用 /data2/docker
解决方案
python
# 1. 停止 Docker
sudo systemctl stop docker
# 2. 迁移数据(如果已有数据)
sudo rsync -av /var/lib/docker/ /data2/docker/
# 3. 修改配置
sudo tee /etc/docker/daemon.json <<-'EOF'
{
"data-root": "/data2/docker",
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
}
}
EOF
# 4. 重启 Docker
sudo systemctl start docker
验证
python
docker info | grep "Docker Root Dir"
应输出:
python
Docker Root Dir: /data2/docker
7. 测试带GPU的docker命令
python
正确方法 1:直接运行 vLLM(不带交互 shell)
docker run --gpus all \
-it \
--cpu-shares 512 \ # 默认1024,512表示限制为50%
-v /data2/model_output:/model_output \
vllm/vllm-openai \
--gpu-memory-utilization 0.5 # 改为使用 50% 的 GPU 内存
正确方法 2:先进入容器 shell,再手动运行 vLLM
# 1. 进入容器 shell
docker run --gpus all \
-it \
--name vllm-container \
--rm \
--cpu-shares 768 \
--entrypoint /bin/bash \
-v /data2/model_output:/model_output \
vllm/vllm-openai
# 2. 在容器内手动运行 vLLM
python -m vllm.entrypoints.openai.api_server --gpu-memory-utilization 0.3
python
$ docker run --gpus all \
> -it \
> --cpu-shares 768 \
> --entrypoint /bin/bash \
> -v /data2/model_output:/model_output \
> vllm/vllm-openai
root@8d2bf25cc1c2:/vllm-workspace# ll
total 48
drwxr-xr-x 1 root root 4096 Jul 7 11:38 ./
drwxr-xr-x 1 root root 4096 Jul 14 23:02 ../
drwxr-xr-x 8 root root 4096 Jul 7 11:09 benchmarks/
-rw-r--r-- 1 root root 28292 Jul 7 11:09 collect_env.py
drwxr-xr-x 5 root root 4096 Jul 7 11:09 examples/
drwxr-xr-x 1 root root 4096 Jul 7 11:38 requirements/
8. 修改容器名字
原容器名字
python
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
6cb08717e162 vllm/vllm-openai "/bin/bash" 20 hours ago Up 20 hours mystifying_goldstine
修改:
要修改正在运行的 Docker 容器的名称,你需要先停止容器,然后重新创建一个新名称的容器,或者直接使用 docker rename 命令。
方法 1:使用 docker rename (推荐)
python
docker rename mystifying_goldstine vllm-container
方法 2:停止并重新创建容器(如果容器需要保持运行)
python
停止当前容器: docker stop mystifying_goldstine
重新启动并命名:
python
docker run --gpus all \
-it \
--name vllm-container \ # 新名称
--rm \
--cpu-shares 768 \
--entrypoint /bin/bash \
-v /data2/model_output:/model_output \
vllm/vllm-openai
修改名字后:
python
$ docker rename mystifying_goldstine vllm-container
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
6cb08717e162 vllm/vllm-openai "/bin/bash" 20 hours ago Up 20 hours vllm-container