大模型训练框架DeepSpeed使用入门(1): 训练设置

文章目录


官方文档直接抄过来,留个笔记。
https://deepspeed.readthedocs.io/en/latest/initialize.html

使用案例来自:
https://github.com/OvJat/DeepSpeedTutorial


大模型训练的痛点是模型参数过大,动辄上百亿,如果单靠单个GPU来完成训练基本不可能。所以需要多卡或者分布式训练来完成这项工作。

DeepSpeed是由Microsoft提供的分布式训练工具,旨在支持更大规模的模型和提供更多的优化策略和工具。对于更大模型的训练来说,DeepSpeed提供了更多策略,例如:Zero、Offload等。

本文简单介绍下如何使用DeepSpeed。


一、安装

python 复制代码
pip install deepspeed

二、训练设置

Step1 第一步参数解析

DeepSpeed 使用 argparse 来应用控制台的设置,使用

python 复制代码
deepspeed.add_config_arguments()

可以将DeepSpeed内置的参数增加到我们自己的应用参数解析中。

python 复制代码
parser = argparse.ArgumentParser(description='My training script.')
parser.add_argument('--local_rank', type=int, default=-1,
                    help='local rank passed from distributed launcher')
# Include DeepSpeed configuration arguments
parser = deepspeed.add_config_arguments(parser)
cmd_args = parser.parse_args()

Step2 初始化后端

与Step3中的 deepspeed.initialize() 不同,

直接调用即可。

一般发生在以下场景

when using model parallelism, pipeline parallelism, or certain data loader scenarios.

在Step3的initialize前,进行调用

python 复制代码
deepspeed.init_distributed()

Step3 训练初始化

首先调用 deepspeed.initialize() 进行初始化,是整个调用DeepSpeed训练的入口。

调用后,如果分布式后端没有被初始化后,此时会初始化分布式后端。

使用案例:

python 复制代码
model_engine, optimizer, _, _ = deepspeed.initialize(args=cmd_args,
                                                     model=net,
                                                     model_parameters=net.parameters(),
                                                     training_data=ds)

API如下:

python 复制代码
def initialize(args=None,
               model: torch.nn.Module = None,
               optimizer: Optional[Union[Optimizer, DeepSpeedOptimizerCallable]] = None,
               model_parameters: Optional[torch.nn.Module] = None,
               training_data: Optional[torch.utils.data.Dataset] = None,
               lr_scheduler: Optional[Union[_LRScheduler, DeepSpeedSchedulerCallable]] = None,
               distributed_port: int = TORCH_DISTRIBUTED_DEFAULT_PORT,
               mpu=None,
               dist_init_required: Optional[bool] = None,
               collate_fn=None,
               config=None,
               config_params=None):
    """Initialize the DeepSpeed Engine.

    Arguments:
        args: an object containing local_rank and deepspeed_config fields.
            This is optional if `config` is passed.

        model: Required: nn.module class before apply any wrappers

        optimizer: Optional: a user defined Optimizer or Callable that returns an Optimizer object.
            This overrides any optimizer definition in the DeepSpeed json config.

        model_parameters: Optional: An iterable of torch.Tensors or dicts.
            Specifies what Tensors should be optimized.

        training_data: Optional: Dataset of type torch.utils.data.Dataset

        lr_scheduler: Optional: Learning Rate Scheduler Object or a Callable that takes an Optimizer and returns a Scheduler object.
            The scheduler object should define a get_lr(), step(), state_dict(), and load_state_dict() methods

        distributed_port: Optional: Master node (rank 0)'s free port that needs to be used for communication during distributed training

        mpu: Optional: A model parallelism unit object that implements
            get_{model,data}_parallel_{rank,group,world_size}()

        dist_init_required: Optional: None will auto-initialize torch distributed if needed,
            otherwise the user can force it to be initialized or not via boolean.

        collate_fn: Optional: Merges a list of samples to form a
            mini-batch of Tensor(s).  Used when using batched loading from a
            map-style dataset.

        config: Optional: Instead of requiring args.deepspeed_config you can pass your deepspeed config
            as an argument instead, as a path or a dictionary.

        config_params: Optional: Same as `config`, kept for backwards compatibility.

    Returns:
        A tuple of ``engine``, ``optimizer``, ``training_dataloader``, ``lr_scheduler``

        * ``engine``: DeepSpeed runtime engine which wraps the client model for distributed training.

        * ``optimizer``: Wrapped optimizer if a user defined ``optimizer`` is supplied, or if
          optimizer is specified in json config else ``None``.

        * ``training_dataloader``: DeepSpeed dataloader if ``training_data`` was supplied,
          otherwise ``None``.

        * ``lr_scheduler``: Wrapped lr scheduler if user ``lr_scheduler`` is passed, or
          if ``lr_scheduler`` specified in JSON configuration. Otherwise ``None``.
    """

三、训练代码展示

python 复制代码
def parse_arguments():
    import argparse
    parser = argparse.ArgumentParser(description='deepspeed training script.')
    parser.add_argument('--local_rank', type=int, default=-1,
                        help='local rank passed from distributed launcher')
    # Include DeepSpeed configuration arguments
    parser = deepspeed.add_config_arguments(parser)
    args = parser.parse_args()
    return args


def train():
    args = parse_arguments()

    # init distributed
    deepspeed.init_distributed()

    # init model
    model = MyClassifier(3, 100, ch_multi=128)

    # init dataset
    ds = MyDataset((3, 512, 512), 100, sample_count=int(1e6))

    # init engine
    engine, optimizer, training_dataloader, lr_scheduler = deepspeed.initialize(
        args=args,
        model=model,
        model_parameters=model.parameters(),
        training_data=ds,
        # config=deepspeed_config,
    )

    # load checkpoint
    engine.load_checkpoint("./data/checkpoints/MyClassifier/")

    # train
    last_time = time.time()
    loss_list = []
    echo_interval = 10

    engine.train()
    for step, (xx, yy) in enumerate(training_dataloader):
        step += 1
        xx = xx.to(device=engine.device, dtype=torch.float16)
        yy = yy.to(device=engine.device, dtype=torch.long).reshape(-1)

        outputs = engine(xx)
        loss = tnf.cross_entropy(outputs, yy)
        engine.backward(loss)
        engine.step()
        loss_list.append(loss.detach().cpu().numpy())

        if step % echo_interval == 0:
            loss_avg = np.mean(loss_list[-echo_interval:])
            used_time = time.time() - last_time
            time_p_step = used_time / echo_interval
            if args.local_rank == 0:
                logging.info(
                    "[Train Step] Step:{:10d}  Loss:{:8.4f} | Time/Batch: {:6.4f}s",
                    step, loss_avg, time_p_step,
                )
            last_time = time.time()
    # save checkpoint
    engine.save_checkpoint("./data/checkpoints/MyClassifier/")

最后~

码字不易~~

独乐不如众乐~~

如有帮助,欢迎点赞+收藏~~


相关推荐
大模型铲屎官2 小时前
【深度学习-Day 23】框架实战:模型训练与评估核心环节详解 (MNIST实战)
人工智能·pytorch·python·深度学习·大模型·llm·mnist
一 铭9 小时前
Github Copilot新特性:Copilot Spaces-成为某个主题的专家
人工智能·大模型·llm
致Great11 小时前
Gemini开源项目DeepResearch:基于LangGraph的智能研究代理技术原理与实现
大模型
般若Neo1 天前
大模型高效提示词Prompt编写指南
大模型·prompt·提示词
小阿鑫1 天前
记录第一次公司内部分享:如何基于大模型搭建企业+AI业务
大模型·llm·agent·大模型落地·ai落地·mcp·mcpserver
Eastmount1 天前
[论文阅读] (38)基于大模型的威胁情报分析与知识图谱构建论文总结(读书笔记)
论文阅读·人工智能·大模型·知识图谱·威胁情报
百里香酚兰2 天前
【AI学习笔记】Coze工作流写入飞书多维表格(即:多维表格飞书官方插件使用教程)
笔记·学习·大模型·飞书·pe·coze
InternLM2 天前
基于InternLM的情感调节大师FunGPT
大模型·大语言模型·大模型应用·书生