HuggingFace 的 accelerate 库可以实现只需要修改几行代码就可以实现ddp训练,且支持混合精度训练和TPU训练。(甚至支持deepspeed。)
accelerate支持的训练方式为CPU/单GPU (TPU)/多GPU(TPU) DDP模式/fp32/fp16等。
安装
pip install accelerate
使用
使用accelerate进行单卡或者多卡训练的代码是相同的,不过在单卡训练的时候可以不使用gather_for_metrics()
函数聚合信息。这里为了保持代码的不变性,仍然保留gather_for_metrics()
。以下为使用accelerate运行在MNIST数据集上面运行手写数字识别的样例代码main.py
。
python
import datetime
import os
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
#======================================================================
# import accelerate
from accelerate import Accelerator
from accelerate.utils import set_seed
#======================================================================
class BasicNet(nn.Module):
def __init__(self):
super().__init__()
self.conv1 = nn.Conv2d(1, 32, 3, 1)
self.conv2 = nn.Conv2d(32, 64, 3, 1)
self.dropout1 = nn.Dropout(0.25)
self.dropout2 = nn.Dropout(0.5)
self.fc1 = nn.Linear(9216, 128)
self.fc2 = nn.Linear(128, 10)
self.act = F.relu
def forward(self, x):
x = self.act(self.conv1(x))
x = self.act(self.conv2(x))
x = F.max_pool2d(x, 2)
x = self.dropout1(x)
x = torch.flatten(x, 1)
x = self.act(self.fc1(x))
x = self.dropout2(x)
x = self.fc2(x)
output = F.log_softmax(x, dim=1)
return output
def main(epochs,
lr=1e-3,
batch_size= 1024,
ckpt_dir = "ckpts",
ckpt_path = "checkpoint.pt",
mixed_precision="no", #'fp16' or 'bf16'
):
if not os.path.exists(ckpt_dir):
os.makedirs(ckpt_dir)
ckpt_path = os.path.join(ckpt_dir, ckpt_path)
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307), (0.3081))
])
train_dset = datasets.MNIST('data', train=True, download=True, transform=transform)
test_dset = datasets.MNIST('data', train=False, transform=transform)
train_loader = torch.utils.data.DataLoader(train_dset, shuffle=True, batch_size=batch_size, num_workers=2)
test_loader = torch.utils.data.DataLoader(test_dset, shuffle=False, batch_size=batch_size, num_workers=2)
model = BasicNet()
optimizer = optim.AdamW(model.parameters(), lr=lr)
#======================================================================
# initialize accelerator and auto move data/model to accelerator.device
set_seed(42)
accelerator = Accelerator(mixed_precision=mixed_precision)
accelerator.print(f'device {str(accelerator.device)} is used!')
# Send everything through `accelerator.prepare`
train_loader, test_loader, model, optimizer = accelerator.prepare(
train_loader, test_loader, model, optimizer
)
#======================================================================
model.train()
optimizer.zero_grad()
for epoch in range(epochs):
for batch_idx, (data, target) in enumerate(train_loader):
output = model(data)
loss = F.nll_loss(output, target)
#======================================================================
#attention here!
accelerator.backward(loss)
#======================================================================
optimizer.step()
optimizer.zero_grad()
model.eval()
correct = 0
with torch.no_grad():
for data, target in test_loader:
output = model(data)
pred = output.argmax(dim=1, keepdim=True)
#======================================================================
#gather data from multi-gpus (used when in ddp mode)
pred = accelerator.gather_for_metrics(pred)
target = accelerator.gather_for_metrics(target)
#======================================================================
correct += pred.eq(target.view_as(pred)).sum().item()
eval_metric = 100. * correct / len(test_loader.dataset)
#======================================================================
#print logs and save ckpt
accelerator.wait_for_everyone()
nowtime = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
accelerator.print(f"epoch【{epoch}】@{nowtime} --> eval_accuracy= {eval_metric:.2f}%")
net_dict = accelerator.get_state_dict(model)
accelerator.save(net_dict, ckpt_path+"_"+str(epoch))
#======================================================================
if __name__ == '__main__':
#mixed_precision='fp16' or 'bf16')
main(epochs=5,
lr=1e-4,
batch_size=1024,
ckpt_dir = "ckpts",
ckpt_path = "checkpoint.pt",
mixed_precision="no")
单GPU运行
直接CUDA_VISIBLE_DEVICES=0 python main.py
就可以指定单GPU进行运行,CUDA_VISIBLE_DEVICES
可以设置为其它的gpu id。
结果:
device cuda is used!
epoch【0】@2024-05-20 11:46:21 --> eval_accuracy= 89.84%
epoch【1】@2024-05-20 11:46:27 --> eval_accuracy= 93.44%
epoch【2】@2024-05-20 11:46:32 --> eval_accuracy= 95.52%
epoch【3】@2024-05-20 11:46:39 --> eval_accuracy= 96.55%
epoch【4】@2024-05-20 11:46:44 --> eval_accuracy= 97.07%
多GPU运行
首先要在~/.cache/huggingface/accelerate
下生成default_config.yaml文件。可以在terminal使用
accelerate config
进行交互性地配置,不过较为复杂。可以使用如下代码生成较为简单的配置。
python
import os
from accelerate.utils import write_basic_config
write_basic_config() # Write a config file
os._exit(0) # Restart the notebook to reload info from the latest config file
生成的default_config.yaml如下
{
"compute_environment": "LOCAL_MACHINE",
"debug": false,
"distributed_type": "MULTI_GPU",
"downcast_bf16": false,
"enable_cpu_affinity": false,
"machine_rank": 0,
"main_training_function": "main",
"mixed_precision": "no",
"num_machines": 1,
"num_processes": 8,
"rdzv_backend": "static",
"same_network": false,
"tpu_use_cluster": false,
"tpu_use_sudo": false,
"use_cpu": false
}
为了可以运行多个ddp程序,建议加上main_process_port
的设置。如果设置gpu数量为2,修改num_processes
为2。
方法1 使用default_config运行
CUDA_VISIBLE_DEVICES=0,1 accelerate launch main.py
结果
device cuda:0 is used!
epoch【0】@2024-05-20 12:11:38 --> eval_accuracy= 84.74%
epoch【1】@2024-05-20 12:11:41 --> eval_accuracy= 90.13%
epoch【2】@2024-05-20 12:11:44 --> eval_accuracy= 92.16%
epoch【3】@2024-05-20 12:11:48 --> eval_accuracy= 93.28%
epoch【4】@2024-05-20 12:11:51 --> eval_accuracy= 94.11%
方法2 使用default_config+手动配置
一般需要配置的信息为main_process_port
, num_processes
和gpu ids。
CUDA_VISIBLE_DEVICES=0,1 accelerate launch --main_process_port 41011 --num_processes 2 main.py
结果
device cuda:0 is used!
epoch【0】@2024-05-20 12:11:38 --> eval_accuracy= 84.74%
epoch【1】@2024-05-20 12:11:41 --> eval_accuracy= 90.13%
epoch【2】@2024-05-20 12:11:44 --> eval_accuracy= 92.16%
epoch【3】@2024-05-20 12:11:48 --> eval_accuracy= 93.28%
epoch【4】@2024-05-20 12:11:51 --> eval_accuracy= 94.11%
方法3 使用pytorch运行
CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch \
--nproc_per_node 2 \
--use_env \
--master_port 41011 \
accelerate_sample.py
结果
[2024-05-20 11:26:14,715] torch.distributed.run: [WARNING]
[2024-05-20 11:26:14,715] torch.distributed.run: [WARNING] *****************************************
[2024-05-20 11:26:14,715] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-05-20 11:26:14,715] torch.distributed.run: [WARNING] *****************************************
device cuda:0 is used!
epoch【0】@2024-05-20 11:26:20 --> eval_accuracy= 84.13%
epoch【1】@2024-05-20 11:26:24 --> eval_accuracy= 90.28%
epoch【2】@2024-05-20 11:26:27 --> eval_accuracy= 92.35%
epoch【3】@2024-05-20 11:26:31 --> eval_accuracy= 93.59%
epoch【4】@2024-05-20 11:26:34 --> eval_accuracy= 94.52%
方法4 使用notebook启动
python
from accelerate import notebook_launcher
args = dict(
epochs = 5,
lr = 1e-4,
batch_size= 1024,
ckpt_dir = "ckpts",
ckpt_path = "checkpoint.pt",
mixed_precision="no").values()
notebook_launcher(main, args, num_processes=2, use_port="41011")
运行
CUDA_VISIBLE_DEVICES=0,1 python main.py
结果
Launching training on 2 GPUs.
device cuda:0 is used!
epoch【0】@2024-05-20 12:09:43 --> eval_accuracy= 84.69%
epoch【1】@2024-05-20 12:09:47 --> eval_accuracy= 90.36%
epoch【2】@2024-05-20 12:09:51 --> eval_accuracy= 92.13%
epoch【3】@2024-05-20 12:09:54 --> eval_accuracy= 93.20%
epoch【4】@2024-05-20 12:09:57 --> eval_accuracy= 94.03%