【经验总结】Ubuntu 源代码方式安装 Microsoft DeepSpeed

1. 背景介绍

使用 DeepSpeed 在多服务器上分布式训练大模型

2. 安装方法

2.1 查看显卡参数

bash 复制代码
~$ CUDA_VISIBLE_DEVICES=0 python -c "import torch; print(torch.cuda.get_device_capability())"
(8, 0)
~$ CUDA_VISIBLE_DEVICES=0 python -c "import torch; print(torch.cuda.get_device_properties(torch.device('cuda')))"
_CudaDeviceProperties(name='NVIDIA A800 80GB PCIe', major=8, minor=0, total_memory=81050MB, multi_processor_count=108)
~$ CUDA_VISIBLE_DEVICES=0 python -c "import torch; print(torch.cuda.get_arch_list())"
['sm_50', 'sm_60', 'sm_61', 'sm_70', 'sm_75', 'sm_80', 'sm_86', 'sm_37', 'sm_90', 'compute_37']

2.2 源代码安装

2.2.1 创建虚拟环境

采用 clone 方式,新建一个 DeepSpeed 专用的 Anaconda 环境

bash 复制代码
~$ conda create -n deepspeed --clone peft

2.2.2 激活环境

bash 复制代码
~$ conda activate deepspeed

2.2.3 源代码安装 Transformers

遵循官方文档,通过下面的命令安装 Transformers:

bash 复制代码
~$ pip install git+https://github.com/huggingface/transformers

2.2.4 源代码安装 DeepSpeed

根据 GPU 实际情况设置参数 TORCH_CUDA_ARCH_LIST

如果需要使用 CPU Offload 优化器参数,设置参数 DS_BUILD_CPU_ADAM=1

如果需要使用 NVMe Offload,设置参数 DS_BUILD_UTILS=1

bash 复制代码
~$ git clone https://github.com/microsoft/DeepSpeed/
Cloning into 'DeepSpeed'...
remote: Enumerating objects: 45020, done.
remote: Counting objects: 100% (3618/3618), done.
remote: Compressing objects: 100% (413/413), done.
remote: Total 45020 (delta 3387), reused 3299 (delta 3202), pack-reused 41402
Receiving objects: 100% (45020/45020), 207.74 MiB | 14.32 MiB/s, done.
Resolving deltas: 100% (32479/32479), done.
Updating files: 100% (1554/1554), done.
bash 复制代码
~$ cd DeepSpeed/
~$ TORCH_CUDA_ARCH_LIST="8.0" DS_BUILD_CPU_ADAM=1 DS_BUILD_UTILS=1 pip install . \
--global-option="build_ext" --global-option="-j8" --no-cache -v \
--disable-pip-version-check 2>&1 | tee build.log

安装成功:

bash 复制代码
~$ pip show deepspeed
Name: deepspeed
Version: 0.14.3+fbdf0eaf
Summary: DeepSpeed library
Home-page: http://deepspeed.ai
Author: DeepSpeed Team
Author-email: deepspeed-info@microsoft.com
License: Apache Software License 2.0
Location: /public/home/acc5trotmy/.conda/envs/deepspeed/lib/python3.10/site-packages
Requires: hjson, ninja, numpy, packaging, psutil, py-cpuinfo, pydantic, pynvml, torch, tqdm
Required-by: 

deepspeed 命令:

bash 复制代码
~$ deepspeed 
[2024-04-24 12:05:52,629] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
df: /public/home/acc5trotmy/.triton/autotune: No such file or directory
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
 [WARNING]  using untested triton version (2.2.0), only 1.0.0 is known to be compatible
usage: deepspeed [-h] [-H HOSTFILE] [-i INCLUDE] [-e EXCLUDE] [--num_nodes NUM_NODES] [--min_elastic_nodes MIN_ELASTIC_NODES]
                 [--max_elastic_nodes MAX_ELASTIC_NODES] [--num_gpus NUM_GPUS] [--master_port MASTER_PORT] [--master_addr MASTER_ADDR]
                 [--launcher LAUNCHER] [--launcher_args LAUNCHER_ARGS] [--module] [--no_python] [--no_local_rank] [--no_ssh_check] [--force_multi]
                 [--save_pid] [--enable_each_rank_log ENABLE_EACH_RANK_LOG] [--autotuning {tune,run}] [--elastic_training] [--bind_cores_to_rank]
                 [--bind_core_list BIND_CORE_LIST] [--ssh_port SSH_PORT]
                 user_script ...
deepspeed: error: the following arguments are required: user_script, user_args
相关推荐
阿梦Anmory8 小时前
Ubuntu配置代理最详细教程
linux·运维·ubuntu
lili-felicity8 小时前
CANN模型量化详解:从FP32到INT8的精度与性能平衡
人工智能·python
数据知道8 小时前
PostgreSQL实战:详解如何用Python优雅地从PG中存取处理JSON
python·postgresql·json
ZH15455891318 小时前
Flutter for OpenHarmony Python学习助手实战:面向对象编程实战的实现
python·学习·flutter
玄同7658 小时前
SQLite + LLM:大模型应用落地的轻量级数据存储方案
jvm·数据库·人工智能·python·语言模型·sqlite·知识图谱
User_芊芊君子8 小时前
CANN010:PyASC Python编程接口—简化AI算子开发的Python框架
开发语言·人工智能·python
白日做梦Q8 小时前
Anchor-free检测器全解析:CenterNet vs FCOS
python·深度学习·神经网络·目标检测·机器学习
喵手9 小时前
Python爬虫实战:公共自行车站点智能采集系统 - 从零构建生产级爬虫的完整实战(附CSV导出 + SQLite持久化存储)!
爬虫·python·爬虫实战·零基础python爬虫教学·采集公共自行车站点·公共自行车站点智能采集系统·采集公共自行车站点导出csv
getapi9 小时前
Ubuntu 22.04 服务器的系统架构是否为 amd64 x86_64
linux·服务器·ubuntu
喵手9 小时前
Python爬虫实战:地图 POI + 行政区反查实战 - 商圈热力数据准备完整方案(附CSV导出 + SQLite持久化存储)!
爬虫·python·爬虫实战·零基础python爬虫教学·地区poi·行政区反查·商圈热力数据采集