Ubuntu单卡5090部署VeRL:从安装到运行

参考文档:https://verl.readthedocs.io/en/latest/start/install.html#install-dependencies

准备conda环境

shell 复制代码
conda create -n verl python==3.12
conda activate verl

安装

shell 复制代码
# Make sure you have activated verl conda env
# If you need to run with megatron
bash scripts/install_vllm_sglang_mcore.sh
# Or if you simply need to run with FSDP
USE_MEGATRON=0 bash scripts/install_vllm_sglang_mcore.sh

网络原因导致TransformerEngine没有安装成功,需要手动安装。

手动安装TransformerEngine

1、clone仓库

shell 复制代码
git clone https://github.com/NVIDIA/TransformerEngine.git
cd TransformerEngine

2、指定版本并更新子模块

shell 复制代码
git checkout v2.6
git submodule update --init --recursive

3、执行安装

shell 复制代码
NVTE_FRAMEWORK=pytorch pip3 install --no-deps --no-build-isolation .

安装遇到了cmake编译报错,找不到cudnn,但是可以通过pip检查到已经安装:

shell 复制代码
> pip show nvidia-cudnn-cu12
Name: nvidia-cudnn-cu12
Version: 9.10.2.21
Summary: cuDNN runtime libraries
Home-page: https://developer.nvidia.com/cuda-zone
Author: Nvidia CUDA Installer Team
Author-email: compute_installer@nvidia.com
License: LicenseRef-NVIDIA-Proprietary
Location: /home/sw/.conda/envs/verl/lib/python3.12/site-packages
Requires: nvidia-cublas-cu12
Required-by: torch

配置全局编译器路径并重新编译:

shell 复制代码
# 1. 清理缓存
rm -rf build/
rm -rf transformer_engine.egg-info

# 2. 定义基础路径
export CONDA_ENV_PATH=/home/sw/.conda/envs/verl
export CUDNN_ROOT=$CONDA_ENV_PATH/lib/python3.12/site-packages/nvidia/cudnn

# 3. 设置 CPATH (告诉编译器去哪里找头文件)
export CPATH=$CUDNN_ROOT/include:$CPATH

# 4. 设置 LIBRARY_PATH (告诉编译器去哪里找库文件进行链接)
export LIBRARY_PATH=$CUDNN_ROOT/lib:$LIBRARY_PATH

# 5. 设置 LD_LIBRARY_PATH (告诉系统运行时去哪里找动态库)
export LD_LIBRARY_PATH=$CUDNN_ROOT/lib:$LD_LIBRARY_PATH

# 6. 再次执行安装
NVTE_FRAMEWORK=pytorch pip3 install --no-deps --no-build-isolation .

编译安装成功:

shell 复制代码
...
  Preparing metadata (pyproject.toml) ... done
Building wheels for collected packages: transformer_engine
  Building wheel for transformer_engine (pyproject.toml) ... done
  Created wheel for transformer_engine: filename=transformer_engine-2.6.0+c90a7207-cp312-cp312-linux_x86_64.whl size=243922631 sha256=3c4e706de6d5626d641563a7f0bb672aab7dddc76199a9cae9cc8e3f0abf6b5c
  Stored in directory: /tmp/pip-ephem-wheel-cache-fazev1_w/wheels/ac/bc/ce/4c6381eafba27055735110a8656f8ffd1ad0581b962c5c4b2b
Successfully built transformer_engine
Installing collected packages: transformer_engine
Successfully installed transformer_engine-2.6.0+c90a7207

手动安装 Megatron-LM (core_v0.13.1)

shell 复制代码
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout core_v0.13.1
pip3 install --no-deps .

安装verl

shell 复制代码
git clone https://github.com/volcengine/verl.git
cd verl
pip install --no-deps -e .

运行verl

数据集格式转换

shell 复制代码
python3 examples/data_preprocess/gsm8k.py --local_dataset_path /home/sw/hw/downloads/openai/gsm8k --local_save_dir /home/sw/hw/downloads/gsm8k

运行demo

shell 复制代码
MODEL_PATH="/hub/weights/Qwen/Qwen2.5-0.5B"
TRAIN_DATA="/home/sw/hw/downloads/gsm8k/train.parquet"
TEST_DATA="/home/sw/hw/downloads/gsm8k/test.parquet"
export HF_HUB_OFFLINE=1
export TRANSFORMERS_OFFLINE=1

PYTHONUNBUFFERED=1 python3 -m verl.trainer.main_ppo \
 data.train_files=$TRAIN_DATA \
 data.val_files=$TEST_DATA \
 data.train_batch_size=256 \
 data.max_prompt_length=512 \
 data.max_response_length=512 \
 actor_rollout_ref.model.path=$MODEL_PATH \
 actor_rollout_ref.actor.optim.lr=1e-6 \
 actor_rollout_ref.actor.ppo_mini_batch_size=64 \
 actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
 actor_rollout_ref.rollout.name=vllm \
 actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
 actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
 actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
 actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
 critic.optim.lr=1e-5 \
 critic.model.path=$MODEL_PATH \
 critic.ppo_micro_batch_size_per_gpu=4 \
 algorithm.kl_ctrl.kl_coef=0.001 \
 trainer.logger=console \
 trainer.val_before_train=False \
 trainer.n_gpus_per_node=1 \
 trainer.nnodes=1 \
 trainer.save_freq=10 \
 trainer.test_freq=10 \
 trainer.total_epochs=15 2>&1 | tee verl_demo.log
相关推荐
safestar20121 小时前
ES批量写入性能调优:BulkProcessor 参数详解与实战案例
java·大数据·运维·jenkins
来一颗砂糖橘1 小时前
负载均衡的多维深度解析
运维·负载均衡
楠奕1 小时前
CentOS7安装GoldenDB单机搭建及常见报错解决方案
linux·运维·服务器
GCTTTTTT2 小时前
远程服务器走本地代理
运维·服务器
剑锋所指,所向披靡!2 小时前
Linux常用指令(2)
linux·运维·服务器
不愿透露姓名的大鹏2 小时前
Oracle归档日志爆满急救指南
linux·数据库·oracle·dba
飞Link2 小时前
逆向兼容的桥梁:3to2 自动化降级工具实现全解析
运维·开发语言·python·自动化
W.W.H.2 小时前
嵌入式常见的面试题1
linux·网络·经验分享·网络协议·tcp/ip
木白CPP2 小时前
DMA-Buffer内核驱动API文档
linux
HXQ_晴天3 小时前
Linux 系统的交互式进程监控工具htop
linux·服务器·网络