Server - 使用 Docker 配置 PyTorch 研发环境

欢迎关注我的CSDN:https://spike.blog.csdn.net/

本文地址:https://spike.blog.csdn.net/article/details/148421901

免责声明:本文来源于个人知识与公开资料,仅用于学术交流,欢迎讨论,不支持转载。


建议使用 Docker 配置 PyTorch 研发环境,原因是部分机器配置差异较大,而且环境各不相同,导致安装到最后仍然无法启动训练任务,浪费大量时间。建议直接使用 Docker + Conda(Mamba) 环境构建虚拟环境,即可支持多数任务。

1. 网络代理

常用的 GitHub 下载较慢,建议使用代理,提速。

使用特定的网络代理,即:

bash 复制代码
export https_proxy=http://xxx:80
export http_proxy=http://xxx:80

unset https_proxy http_proxy

xxx 是 IP 地址。

或 使用在线的免费代理,即:https://ghproxy.link/

bash 复制代码
# https://ghfast.top
git clone https://ghfast.top/https://github.com/hiyouga/LLaMA-Factory.git   # 示例

注意:免费代理可能失效,需要实时查看。

Huggingface 环境,参考:https://hf-mirror.com/

bash 复制代码
export HF_ENDPOINT=https://hf-mirror.com

2. 环境变量

打印系统环境变量:

bash 复制代码
printenv

配置大模型相关的环境变量,写入 ~/.bashrc 如下:

bash 复制代码
export WORK_DIR="xxx"
export TORCH_HOME="$WORK_DIR/torch_home/"
export HF_HOME="$WORK_DIR/huggingface/"
export HUGGINGFACE_TOKEN="xxx"
export MODELSCOPE_CACHE="$WORK_DIR/modelscope_models/"
export MODELSCOPE_API_TOKEN="xxx"
export CUDA_HOME="/usr/local/cuda"
export OMP_NUM_THREADS=64

3. Docker

建议使用 Nvidia 的镜像,其中包含默认的配置与环境:https://docker.aityp.com/r/docker.io/nvcr.io/nvidia/pytorch

拉取 Docker 镜像(国内代理),建议使用 24.12-py3 版本,不要使用最新版本,兼容异常:

bash 复制代码
docker pull swr.cn-north-4.myhuaweicloud.com/ddn-k8s/docker.io/nvcr.io/nvidia/pytorch:24.12-py3

启动 Docker 的标准模版,即:

bash 复制代码
docker run -itd \
--name [your name] \
--gpus all \
--shm-size=128g \
--memory=256g \
--cpus=64 \
--restart=unless-stopped \
--ipc=host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-v [your path]:[your path] \
-e NVIDIA_DRIVER_CAPABILITIES=compute,utility \
--privileged \
--network host \
swr.cn-north-4.myhuaweicloud.com/ddn-k8s/docker.io/nvcr.io/nvidia/pytorch:24.12-py3 \
/bin/bash

4. 虚拟环境

建议,使用 Conda 或 Mamba,以 Mamba 为例:

bash 复制代码
"${SHELL}" <(curl -L micro.mamba.pm/install.sh)

手动下载模式,直接下载 micro.mamba.pm/install.sh,即 GitHub 路径,同时 替换代理 https://ghfast.top/

配置 pip 源:

bash 复制代码
# docker 优先级
rm -rf /usr/pip.conf
rm -rf /root/.config/pip/pip.conf
rm -rf /etc/pip.conf
rm -rf /etc/xdg/pip/pip.conf

# 配置其他源
mkdir ~/.pip
vim ~/.pip/pip.conf

[global]
no-cache-dir = true
index-url = http://mirrors.aliyun.com/pypi/simple/
trusted-host = mirrors.aliyun.com

安装 torch_def 环境:

bash 复制代码
micromamba create -n torch_def python=3.11
pip3 install torch torchvision torchaudio --timeout=100

下载速度较慢,避免超时 --timeout=100

验证 PyTorch 环境:

python 复制代码
import torch
print(torch.__version__)  			# 2.7.0+cu126
print(torch.cuda.is_available())  	# True

下载相关 Python 包:

bash 复制代码
pip install datasets accelerate bitsandbytes peft swanlab sentencepiece trl deepspeed modelscope
pip install -U "huggingface_hub[cli]"

配置下载 Huggingface 模型与数据集,参考:

bash 复制代码
huggingface-cli download Qwen/Qwen3-8B --local-dir Qwen/Qwen3-8B
huggingface-cli download --repo-type dataset FreedomIntelligence/medical-o1-reasoning-SFT --local-dir FreedomIntelligence/medical-o1-reasoning-SFT

参考数据集:FreedomIntelligence/medical-o1-reasoning-SFT

其他

已优化的 Mamba 安装文件 mamba_install.sh,如下:

bash 复制代码
#!/bin/sh

set -eu

# Detect the shell from which the script was called
parent=$(ps -o comm $PPID |tail -1)
parent=${parent#-}  # remove the leading dash that login shells have
case "$parent" in
  # shells supported by `micromamba shell init`
  bash|fish|xonsh|zsh)
    shell=$parent
    ;;
  *)
    # use the login shell (basename of $SHELL) as a fallback
    shell=${SHELL##*/}
    ;;
esac

# Parsing arguments
if [ -t 0 ] ; then
  printf "Micromamba binary folder? [~/.local/bin] "
  read BIN_FOLDER
  printf "Init shell ($shell)? [Y/n] "
  read INIT_YES
  printf "Configure conda-forge? [Y/n] "
  read CONDA_FORGE_YES
fi

# Fallbacks
BIN_FOLDER="${BIN_FOLDER:-${HOME}/.local/bin}"
INIT_YES="${INIT_YES:-yes}"
CONDA_FORGE_YES="${CONDA_FORGE_YES:-yes}"

# Prefix location is relevant only if we want to call `micromamba shell init`
case "$INIT_YES" in
  y|Y|yes)
    if [ -t 0 ]; then
      printf "Prefix location? [~/micromamba] "
      read PREFIX_LOCATION
    fi
    ;;
esac
PREFIX_LOCATION="${PREFIX_LOCATION:-${HOME}/micromamba}"

# Computing artifact location
case "$(uname)" in
  Linux)
    PLATFORM="linux" ;;
  Darwin)
    PLATFORM="osx" ;;
  *NT*)
    PLATFORM="win" ;;
esac

ARCH="$(uname -m)"
case "$ARCH" in
  aarch64|ppc64le|arm64)
      ;;  # pass
  *)
    ARCH="64" ;;
esac

case "$PLATFORM-$ARCH" in
  linux-aarch64|linux-ppc64le|linux-64|osx-arm64|osx-64|win-64)
      ;;  # pass
  *)
    echo "Failed to detect your OS" >&2
    exit 1
    ;;
esac

if [ "${VERSION:-}" = "" ]; then
  RELEASE_URL="https://ghfast.top/https://github.com/mamba-org/micromamba-releases/releases/latest/download/micromamba-${PLATFORM}-${ARCH}"
else
  RELEASE_URL="https://ghfast.top/https://github.com/mamba-org/micromamba-releases/releases/download/${VERSION}/micromamba-${PLATFORM}-${ARCH}"
fi


# Downloading artifact
mkdir -p "${BIN_FOLDER}"
if hash curl >/dev/null 2>&1; then
  curl "${RELEASE_URL}" -o "${BIN_FOLDER}/micromamba" -fsSL --compressed ${CURL_OPTS:-}
elif hash wget >/dev/null 2>&1; then
  wget ${WGET_OPTS:-} -qO "${BIN_FOLDER}/micromamba" "${RELEASE_URL}"
else
  echo "Neither curl nor wget was found" >&2
  exit 1
fi
chmod +x "${BIN_FOLDER}/micromamba"


# Initializing shell
case "$INIT_YES" in
  y|Y|yes)
    case $("${BIN_FOLDER}/micromamba" --version) in
      1.*|0.*)
        shell_arg=-s
        prefix_arg=-p
        ;;
      *)
        shell_arg=--shell
        prefix_arg=--root-prefix
        ;;
    esac
    "${BIN_FOLDER}/micromamba" shell init $shell_arg "$shell" $prefix_arg "$PREFIX_LOCATION"

    echo "Please restart your shell to activate micromamba or run the following:\n"
    echo "  source ~/.bashrc (or ~/.zshrc, ~/.xonshrc, ~/.config/fish/config.fish, ...)"
    ;;
  *)
    echo "You can initialize your shell later by running:"
    echo "  micromamba shell init"
    ;;
esac


# Initializing conda-forge
case "$CONDA_FORGE_YES" in
  y|Y|yes)
    "${BIN_FOLDER}/micromamba" config append channels conda-forge
    "${BIN_FOLDER}/micromamba" config append channels nodefaults
    "${BIN_FOLDER}/micromamba" config set channel_priority strict
    ;;
esac
相关推荐
聚客AI15 小时前
⭐超越CNN与RNN:为什么Transformer是AI发展的必然选择?
人工智能·llm·掘金·日新计划
杨杨杨大侠17 小时前
案例03-附件E-部署运维
java·docker·github
智泊AI17 小时前
多模态大语言模型(MLLM)是什么?MLLM的基本架构是什么?
llm
Java陈序员21 小时前
直播录制神器!一款多平台直播流自动录制客户端!
python·docker·ffmpeg
水冗水孚21 小时前
你用过docker部署前端项目吗?Tell Me Why 为何要用docker部署前端项目呢?
ubuntu·docker·容器
飞询21 小时前
docker 部署 sftp
运维·docker
花酒锄作田1 天前
用FastAPI和Streamlit实现一个ChatBot
llm·fastapi·streamlit
r0ad2 天前
有没有可能不微调也能让大模型准确完成指定任务?(少样本学习)
llm
虫无涯2 天前
LangSmith:大模型应用开发的得力助手
人工智能·langchain·llm
聚客AI2 天前
🎉7.6倍训练加速与24倍吞吐提升:两项核心技术背后的大模型推理优化全景图
人工智能·llm·掘金·日新计划