【框架】简化多卡训练——huggingface accelerate使用方法介绍

HuggingFace 的 accelerate 库可以实现只需要修改几行代码就可以实现ddp训练,且支持混合精度训练和TPU训练。(甚至支持deepspeed。)

accelerate支持的训练方式为CPU/单GPU (TPU)/多GPU(TPU) DDP模式/fp32/fp16等。

安装

pip install accelerate

使用

使用accelerate进行单卡或者多卡训练的代码是相同的,不过在单卡训练的时候可以不使用gather_for_metrics()函数聚合信息。这里为了保持代码的不变性,仍然保留gather_for_metrics()。以下为使用accelerate运行在MNIST数据集上面运行手写数字识别的样例代码main.py

python 复制代码
import datetime
import os

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms

#======================================================================
# import accelerate
from accelerate import Accelerator
from accelerate.utils import set_seed
#======================================================================


class BasicNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.dropout1 = nn.Dropout(0.25)
        self.dropout2 = nn.Dropout(0.5)
        self.fc1 = nn.Linear(9216, 128)
        self.fc2 = nn.Linear(128, 10)
        self.act = F.relu

    def forward(self, x):
        x = self.act(self.conv1(x))
        x = self.act(self.conv2(x))
        x = F.max_pool2d(x, 2)
        x = self.dropout1(x)
        x = torch.flatten(x, 1)
        x = self.act(self.fc1(x))
        x = self.dropout2(x)
        x = self.fc2(x)
        output = F.log_softmax(x, dim=1)
        return output


def main(epochs,
         lr=1e-3,
         batch_size= 1024,
         ckpt_dir = "ckpts",
         ckpt_path = "checkpoint.pt",
         mixed_precision="no", #'fp16' or 'bf16'
         ):

    if not os.path.exists(ckpt_dir):
        os.makedirs(ckpt_dir)

    ckpt_path = os.path.join(ckpt_dir, ckpt_path)

    transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.1307), (0.3081))
    ])

    train_dset = datasets.MNIST('data', train=True, download=True, transform=transform)
    test_dset = datasets.MNIST('data', train=False, transform=transform)

    train_loader = torch.utils.data.DataLoader(train_dset, shuffle=True, batch_size=batch_size, num_workers=2)
    test_loader = torch.utils.data.DataLoader(test_dset, shuffle=False, batch_size=batch_size, num_workers=2)

    model = BasicNet()
    optimizer = optim.AdamW(model.parameters(), lr=lr)

    #======================================================================
    # initialize accelerator and auto move data/model to accelerator.device
    set_seed(42)
    accelerator = Accelerator(mixed_precision=mixed_precision)
    accelerator.print(f'device {str(accelerator.device)} is used!')
    # Send everything through `accelerator.prepare`
    train_loader, test_loader, model, optimizer = accelerator.prepare(
        train_loader, test_loader, model, optimizer
    )
    #======================================================================


    model.train()
    optimizer.zero_grad()

    for epoch in range(epochs):
        for batch_idx, (data, target) in enumerate(train_loader):
            output = model(data)
            loss = F.nll_loss(output, target)

            #======================================================================
            #attention here! 
            accelerator.backward(loss)
            #======================================================================

            optimizer.step()
            optimizer.zero_grad()

        model.eval()
        correct = 0
        with torch.no_grad():
            for data, target in test_loader:
                output = model(data)
                pred = output.argmax(dim=1, keepdim=True)

                #======================================================================
                #gather data from multi-gpus (used when in ddp mode)
                pred = accelerator.gather_for_metrics(pred)
                target = accelerator.gather_for_metrics(target)
                #======================================================================

                correct += pred.eq(target.view_as(pred)).sum().item()

            eval_metric = 100. * correct / len(test_loader.dataset)
        #======================================================================
        #print logs and save ckpt  
        accelerator.wait_for_everyone()
        nowtime = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
        accelerator.print(f"epoch【{epoch}】@{nowtime} --> eval_accuracy= {eval_metric:.2f}%")
        net_dict = accelerator.get_state_dict(model)
        accelerator.save(net_dict, ckpt_path+"_"+str(epoch))
        #======================================================================


if __name__ == '__main__':
    #mixed_precision='fp16' or  'bf16')

    main(epochs=5, 
         lr=1e-4,
         batch_size=1024,
         ckpt_dir = "ckpts",
         ckpt_path = "checkpoint.pt",
         mixed_precision="no") 

单GPU运行

直接CUDA_VISIBLE_DEVICES=0 python main.py就可以指定单GPU进行运行,CUDA_VISIBLE_DEVICES可以设置为其它的gpu id。

结果:

device cuda is used!
epoch【0】@2024-05-20 11:46:21 --> eval_accuracy= 89.84%
epoch【1】@2024-05-20 11:46:27 --> eval_accuracy= 93.44%
epoch【2】@2024-05-20 11:46:32 --> eval_accuracy= 95.52%
epoch【3】@2024-05-20 11:46:39 --> eval_accuracy= 96.55%
epoch【4】@2024-05-20 11:46:44 --> eval_accuracy= 97.07%

多GPU运行

首先要在~/.cache/huggingface/accelerate下生成default_config.yaml文件。可以在terminal使用

accelerate config

进行交互性地配置,不过较为复杂。可以使用如下代码生成较为简单的配置。

python 复制代码
import os
from accelerate.utils import write_basic_config
write_basic_config() # Write a config file
os._exit(0) # Restart the notebook to reload info from the latest config file 

生成的default_config.yaml如下

{
  "compute_environment": "LOCAL_MACHINE",
  "debug": false,
  "distributed_type": "MULTI_GPU",
  "downcast_bf16": false,
  "enable_cpu_affinity": false,
  "machine_rank": 0,
  "main_training_function": "main",
  "mixed_precision": "no",
  "num_machines": 1,
  "num_processes": 8,
  "rdzv_backend": "static",
  "same_network": false,
  "tpu_use_cluster": false,
  "tpu_use_sudo": false,
  "use_cpu": false
}

为了可以运行多个ddp程序,建议加上main_process_port的设置。如果设置gpu数量为2,修改num_processes为2。

方法1 使用default_config运行

CUDA_VISIBLE_DEVICES=0,1 accelerate launch main.py

结果

device cuda:0 is used!
epoch【0】@2024-05-20 12:11:38 --> eval_accuracy= 84.74%
epoch【1】@2024-05-20 12:11:41 --> eval_accuracy= 90.13%
epoch【2】@2024-05-20 12:11:44 --> eval_accuracy= 92.16%
epoch【3】@2024-05-20 12:11:48 --> eval_accuracy= 93.28%
epoch【4】@2024-05-20 12:11:51 --> eval_accuracy= 94.11%

方法2 使用default_config+手动配置

一般需要配置的信息为main_process_port, num_processes和gpu ids。

CUDA_VISIBLE_DEVICES=0,1 accelerate launch --main_process_port 41011 --num_processes 2 main.py

结果

device cuda:0 is used!
epoch【0】@2024-05-20 12:11:38 --> eval_accuracy= 84.74%
epoch【1】@2024-05-20 12:11:41 --> eval_accuracy= 90.13%
epoch【2】@2024-05-20 12:11:44 --> eval_accuracy= 92.16%
epoch【3】@2024-05-20 12:11:48 --> eval_accuracy= 93.28%
epoch【4】@2024-05-20 12:11:51 --> eval_accuracy= 94.11%

方法3 使用pytorch运行

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch \
    --nproc_per_node 2 \
    --use_env  \
    --master_port 41011 \
    accelerate_sample.py

结果

[2024-05-20 11:26:14,715] torch.distributed.run: [WARNING] 
[2024-05-20 11:26:14,715] torch.distributed.run: [WARNING] *****************************************
[2024-05-20 11:26:14,715] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2024-05-20 11:26:14,715] torch.distributed.run: [WARNING] *****************************************
device cuda:0 is used!
epoch【0】@2024-05-20 11:26:20 --> eval_accuracy= 84.13%
epoch【1】@2024-05-20 11:26:24 --> eval_accuracy= 90.28%
epoch【2】@2024-05-20 11:26:27 --> eval_accuracy= 92.35%
epoch【3】@2024-05-20 11:26:31 --> eval_accuracy= 93.59%
epoch【4】@2024-05-20 11:26:34 --> eval_accuracy= 94.52%

方法4 使用notebook启动

python 复制代码
from accelerate import notebook_launcher
args = dict(
    epochs = 5,
    lr = 1e-4,
    batch_size= 1024,
    ckpt_dir = "ckpts",
    ckpt_path = "checkpoint.pt",
    mixed_precision="no").values()

notebook_launcher(main, args, num_processes=2, use_port="41011")

运行

CUDA_VISIBLE_DEVICES=0,1 python main.py

结果

Launching training on 2 GPUs.
device cuda:0 is used!
epoch【0】@2024-05-20 12:09:43 --> eval_accuracy= 84.69%
epoch【1】@2024-05-20 12:09:47 --> eval_accuracy= 90.36%
epoch【2】@2024-05-20 12:09:51 --> eval_accuracy= 92.13%
epoch【3】@2024-05-20 12:09:54 --> eval_accuracy= 93.20%
epoch【4】@2024-05-20 12:09:57 --> eval_accuracy= 94.03%

参考

  1. https://huggingface.co/datasets/HuggingFace-CN-community/translation/blob/main/eat_accelerate_in_30_minites.md
  2. https://blog.csdn.net/iin729/article/details/124955224
  3. https://huggingface.co/docs/accelerate/index
  4. https://pytorch.org/docs/stable/distributed.html
相关推荐
Q_w77428 分钟前
计算机视觉小目标检测模型
人工智能·目标检测·计算机视觉
程序猿小D12 分钟前
第二百六十七节 JPA教程 - JPA查询AND条件示例
java·开发语言·前端·数据库·windows·python·jpa
创意锦囊28 分钟前
ChatGPT推出Canvas功能
人工智能·chatgpt
知来者逆38 分钟前
V3D——从单一图像生成 3D 物体
人工智能·计算机视觉·3d·图像生成
碳苯1 小时前
【rCore OS 开源操作系统】Rust 枚举与模式匹配
开发语言·人工智能·后端·rust·操作系统·os
玉树临风江流儿1 小时前
Linux驱动开发(速记版)--设备模型
linux·驱动开发
杰哥在此2 小时前
Python知识点:如何使用Multiprocessing进行并行任务管理
linux·开发语言·python·面试·编程
whaosoft-1432 小时前
51c视觉~CV~合集3
人工智能
GarryLau3 小时前
使用pytorch进行迁移学习的两个步骤
pytorch·迁移学习·torchvision
枫叶丹44 小时前
【在Linux世界中追寻伟大的One Piece】进程信号
linux·运维·服务器