Server - 使用 Docker 配置 PyTorch 研发环境

欢迎关注我的CSDN:https://spike.blog.csdn.net/

本文地址:https://spike.blog.csdn.net/article/details/148421901

免责声明:本文来源于个人知识与公开资料,仅用于学术交流,欢迎讨论,不支持转载。


建议使用 Docker 配置 PyTorch 研发环境,原因是部分机器配置差异较大,而且环境各不相同,导致安装到最后仍然无法启动训练任务,浪费大量时间。建议直接使用 Docker + Conda(Mamba) 环境构建虚拟环境,即可支持多数任务。

1. 网络代理

常用的 GitHub 下载较慢,建议使用代理,提速。

使用特定的网络代理,即:

bash 复制代码
export https_proxy=http://xxx:80
export http_proxy=http://xxx:80

unset https_proxy http_proxy

xxx 是 IP 地址。

或 使用在线的免费代理,即:https://ghproxy.link/

bash 复制代码
# https://ghfast.top
git clone https://ghfast.top/https://github.com/hiyouga/LLaMA-Factory.git   # 示例

注意:免费代理可能失效,需要实时查看。

Huggingface 环境,参考:https://hf-mirror.com/

bash 复制代码
export HF_ENDPOINT=https://hf-mirror.com

2. 环境变量

打印系统环境变量:

bash 复制代码
printenv

配置大模型相关的环境变量,写入 ~/.bashrc 如下:

bash 复制代码
export WORK_DIR="xxx"
export TORCH_HOME="$WORK_DIR/torch_home/"
export HF_HOME="$WORK_DIR/huggingface/"
export HUGGINGFACE_TOKEN="xxx"
export MODELSCOPE_CACHE="$WORK_DIR/modelscope_models/"
export MODELSCOPE_API_TOKEN="xxx"
export CUDA_HOME="/usr/local/cuda"
export OMP_NUM_THREADS=64

3. Docker

建议使用 Nvidia 的镜像,其中包含默认的配置与环境:https://docker.aityp.com/r/docker.io/nvcr.io/nvidia/pytorch

拉取 Docker 镜像(国内代理),建议使用 24.12-py3 版本,不要使用最新版本,兼容异常:

bash 复制代码
docker pull swr.cn-north-4.myhuaweicloud.com/ddn-k8s/docker.io/nvcr.io/nvidia/pytorch:24.12-py3

启动 Docker 的标准模版,即:

bash 复制代码
docker run -itd \
--name [your name] \
--gpus all \
--shm-size=128g \
--memory=256g \
--cpus=64 \
--restart=unless-stopped \
--ipc=host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-v [your path]:[your path] \
-e NVIDIA_DRIVER_CAPABILITIES=compute,utility \
--privileged \
--network host \
swr.cn-north-4.myhuaweicloud.com/ddn-k8s/docker.io/nvcr.io/nvidia/pytorch:24.12-py3 \
/bin/bash

4. 虚拟环境

建议,使用 Conda 或 Mamba,以 Mamba 为例:

bash 复制代码
"${SHELL}" <(curl -L micro.mamba.pm/install.sh)

手动下载模式,直接下载 micro.mamba.pm/install.sh,即 GitHub 路径,同时 替换代理 https://ghfast.top/

配置 pip 源:

bash 复制代码
# docker 优先级
rm -rf /usr/pip.conf
rm -rf /root/.config/pip/pip.conf
rm -rf /etc/pip.conf
rm -rf /etc/xdg/pip/pip.conf

# 配置其他源
mkdir ~/.pip
vim ~/.pip/pip.conf

[global]
no-cache-dir = true
index-url = http://mirrors.aliyun.com/pypi/simple/
trusted-host = mirrors.aliyun.com

安装 torch_def 环境:

bash 复制代码
micromamba create -n torch_def python=3.11
pip3 install torch torchvision torchaudio --timeout=100

下载速度较慢,避免超时 --timeout=100

验证 PyTorch 环境:

python 复制代码
import torch
print(torch.__version__)  			# 2.7.0+cu126
print(torch.cuda.is_available())  	# True

下载相关 Python 包:

bash 复制代码
pip install datasets accelerate bitsandbytes peft swanlab sentencepiece trl deepspeed modelscope
pip install -U "huggingface_hub[cli]"

配置下载 Huggingface 模型与数据集,参考:

bash 复制代码
huggingface-cli download Qwen/Qwen3-8B --local-dir Qwen/Qwen3-8B
huggingface-cli download --repo-type dataset FreedomIntelligence/medical-o1-reasoning-SFT --local-dir FreedomIntelligence/medical-o1-reasoning-SFT

参考数据集:FreedomIntelligence/medical-o1-reasoning-SFT

其他

已优化的 Mamba 安装文件 mamba_install.sh,如下:

bash 复制代码
#!/bin/sh

set -eu

# Detect the shell from which the script was called
parent=$(ps -o comm $PPID |tail -1)
parent=${parent#-}  # remove the leading dash that login shells have
case "$parent" in
  # shells supported by `micromamba shell init`
  bash|fish|xonsh|zsh)
    shell=$parent
    ;;
  *)
    # use the login shell (basename of $SHELL) as a fallback
    shell=${SHELL##*/}
    ;;
esac

# Parsing arguments
if [ -t 0 ] ; then
  printf "Micromamba binary folder? [~/.local/bin] "
  read BIN_FOLDER
  printf "Init shell ($shell)? [Y/n] "
  read INIT_YES
  printf "Configure conda-forge? [Y/n] "
  read CONDA_FORGE_YES
fi

# Fallbacks
BIN_FOLDER="${BIN_FOLDER:-${HOME}/.local/bin}"
INIT_YES="${INIT_YES:-yes}"
CONDA_FORGE_YES="${CONDA_FORGE_YES:-yes}"

# Prefix location is relevant only if we want to call `micromamba shell init`
case "$INIT_YES" in
  y|Y|yes)
    if [ -t 0 ]; then
      printf "Prefix location? [~/micromamba] "
      read PREFIX_LOCATION
    fi
    ;;
esac
PREFIX_LOCATION="${PREFIX_LOCATION:-${HOME}/micromamba}"

# Computing artifact location
case "$(uname)" in
  Linux)
    PLATFORM="linux" ;;
  Darwin)
    PLATFORM="osx" ;;
  *NT*)
    PLATFORM="win" ;;
esac

ARCH="$(uname -m)"
case "$ARCH" in
  aarch64|ppc64le|arm64)
      ;;  # pass
  *)
    ARCH="64" ;;
esac

case "$PLATFORM-$ARCH" in
  linux-aarch64|linux-ppc64le|linux-64|osx-arm64|osx-64|win-64)
      ;;  # pass
  *)
    echo "Failed to detect your OS" >&2
    exit 1
    ;;
esac

if [ "${VERSION:-}" = "" ]; then
  RELEASE_URL="https://ghfast.top/https://github.com/mamba-org/micromamba-releases/releases/latest/download/micromamba-${PLATFORM}-${ARCH}"
else
  RELEASE_URL="https://ghfast.top/https://github.com/mamba-org/micromamba-releases/releases/download/${VERSION}/micromamba-${PLATFORM}-${ARCH}"
fi


# Downloading artifact
mkdir -p "${BIN_FOLDER}"
if hash curl >/dev/null 2>&1; then
  curl "${RELEASE_URL}" -o "${BIN_FOLDER}/micromamba" -fsSL --compressed ${CURL_OPTS:-}
elif hash wget >/dev/null 2>&1; then
  wget ${WGET_OPTS:-} -qO "${BIN_FOLDER}/micromamba" "${RELEASE_URL}"
else
  echo "Neither curl nor wget was found" >&2
  exit 1
fi
chmod +x "${BIN_FOLDER}/micromamba"


# Initializing shell
case "$INIT_YES" in
  y|Y|yes)
    case $("${BIN_FOLDER}/micromamba" --version) in
      1.*|0.*)
        shell_arg=-s
        prefix_arg=-p
        ;;
      *)
        shell_arg=--shell
        prefix_arg=--root-prefix
        ;;
    esac
    "${BIN_FOLDER}/micromamba" shell init $shell_arg "$shell" $prefix_arg "$PREFIX_LOCATION"

    echo "Please restart your shell to activate micromamba or run the following:\n"
    echo "  source ~/.bashrc (or ~/.zshrc, ~/.xonshrc, ~/.config/fish/config.fish, ...)"
    ;;
  *)
    echo "You can initialize your shell later by running:"
    echo "  micromamba shell init"
    ;;
esac


# Initializing conda-forge
case "$CONDA_FORGE_YES" in
  y|Y|yes)
    "${BIN_FOLDER}/micromamba" config append channels conda-forge
    "${BIN_FOLDER}/micromamba" config append channels nodefaults
    "${BIN_FOLDER}/micromamba" config set channel_priority strict
    ;;
esac
相关推荐
DuelCode2 分钟前
Windows VMWare Centos Docker部署Springboot 应用实现文件上传返回文件http链接
java·spring boot·mysql·nginx·docker·centos·mybatis
杨浦老苏4 小时前
开源服务运行监控工具Lunalytics
docker·群晖·网站监控
阿里云大数据AI技术5 小时前
OpenSearch 视频 RAG 实践
数据库·人工智能·llm
大模型开发6 小时前
零基础打造AI智能体实战教学(10)----零基础用Coze打造短视频自动洗稿工作流
llm·agent·coze
商汤万象开发者7 小时前
懒懒笔记 | 课代表带你梳理【RAG课程 19:基于知识图谱的RAG】
llm
字节跳动视频云技术团队9 小时前
ICME 2025 | 火山引擎在国际音频编码能力挑战赛中夺得冠军
llm·aigc·音视频开发
AI大模型9 小时前
COZE实战部署(二)—— 创建Coze应用
程序员·llm·coze
聚客AI9 小时前
大模型学习进阶路线图:从Prompt到预训练的四阶段全景解析
人工智能·llm·掘金·日新计划
大模型开发9 小时前
零基础打造AI智能体实战教学(9)----把Coze AI助手部署到Discord频道教程
llm·agent·coze
呆萌的代Ma11 小时前
解决Mac上的老版本docker desktop 无法启动/启动后一直转圈/无法登陆账号的问题
macos·docker·eureka