在上一章中，我参考了lavis的训练框架，书写了一套干净、灵活的Trainer，github.com/Coobiw/Mini... ，Trainer直接地实现训练上各功能的封装和可控性。在接下来的章节中，我将对大模型训练中的一些常见的技术进行分解和介绍，给出使用样例的代码并在eva-vit-G（0.99B参数量）这一模型上进行一些性能的评测，输出一个关于混合精度训练和gradient checkpointing的简易tutorials代码，代码开源在：github.com/Coobiw/Mini...

注：个别示意图的图片使用了参考中的博客的图片，如果有侵权的话，可以联系我，我会进行删除和重绘替换！

简单分析模型训练中的显存占用

一般情况下，模型在训练阶段的显存占用分为以下几个部分（均假设使用常规的fp32的数据类型）：

静态
- 模型参数量：x * 4
- 模型的参数的梯度：x * 4
- 优化器状态：（如：Adam优化器需要存储梯度的一阶矩、二阶矩估计，占 4 * x + 4 * x = x * 8
动态：
- 中间激活值：在forward构建计算图时，需要存储中间激活值，供后续的梯度计算，比如：对于Trasnformer模型来说，中间激活值与batch_size、序列长度、通道维度 三者的乘积相关
- 是动态存储、释放的，前向传播后存储进计算图，反向传播中逐渐地进行释放

Gradient Checkpoint------减少中间激活值的显存占用

梯度检查点技术的关键在于：比如一共有n个中间激活值（可以认为是计算图中的n个结点），不需要全部进行存储，只存储一部分（称为检查点），其他地进行释放，若反向传播需要的激活值不在内存中，则由最近的 检查点的激活值开始进行前向传播进行计算需要的激活值，是一种时间换空间的策略。一个经典的动图如下：

对于原本显存占用为O(n)的情况（n为模型层数），梯度检查点技术可以将显存占用降低到约O(sqrt(n)，在下面小节我也会展示在EVA ViT-G（0.99B）的模型上，梯度检查点可以将显存占用从14G降低到4点多G（batchsize为16）

torch.utils.checkpoint.checkpoint的使用踩坑

该函数的使用方法：

scss 复制代码

torch.utils.checkpoint.checkpoint(callable的函数或对象, 前者需要的参数)
如：
class SimpleModel(nn.Module):
    def __init__(self):
        super(SimpleModel, self).__init__()
        self.layer1 = nn.Linear(512, 512)
        self.relu1 = nn.ReLU()
        self.layer2 = nn.Linear(512, 512)
        self.relu2 = nn.ReLU()

    def forward(self, x):
        x = checkpoint(self.layer1,x)
        x = self.relu1(x)
        x = checkpoint(self.layer2,x)
        return x
model = SimpleModel().cuda()
inputs = torch.randn(int(1e6), 512,requires_grad=True).cuda() 
labels = torch.zeros(int(1e6),512).cuda()
criterion = nn.MSELoss()
optimizer = torch.optim.AdamW(model.parameters())

print(f"begin: {torch.cuda.memory_allocated()/1e9}")

model.train()

# 进行前向传播
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs,labels)
loss.backward()
optimizer.step()

print(f"final: {torch.cuda.memory_allocated()/1e9}")

值得注意的是，在本样例里：

ini 复制代码

inputs = torch.randn(int(1e6),512,requires_grad=True).cuda()

这里模型的输入数据需要指定requires_grad为True，原因是：

torch.utils.checkpoint.checkpoint的输出的requires_grad属性和输入会保持一致，即输入是False输出就是False，输入是True，输出就是True
- 所以如果该函数所有inputs的requires_grad为False，会报Warning：
  - - UserWarning: None of the inputs have requires_grad=True. Gradients will be None
本样例里所有nn.Linear都被checkpoint了，而nn.ReLU不会改变requires_grad，所以整个模型的输出和输入的requires_grad属性一致，所以如果不指定inputs的requires_grad属性为True，就会出现所有检查点激活值均不需要梯度，而在反向传播中报错：
- - RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

在EVA ViT-G上进行性能测试

首先，需要注意的是，torch.utils.checkpoint.checkpoint(函数，args）只会存储输入函数最后输出的激活值，输入函数中间的激活值不会存储，如：

ruby 复制代码

class xxx:
    ......
    def pass_two(self,x):
        x = self.layer1(x)
        return self.layer2(x)
    def forward(self,x)
        x = torch.utils.checkpoint.checkpoint(self.pass_two,x)
        return self.layer3(x)

这段代码的话，layer3的激活值是直接存储，layer2的输出激活值会被checkpoint，而layer1不会，直接不存储

EVA ViT的checkpoint的部分------Transformer Block：

python 复制代码

    def forward_features(self, x):
        x = self.patch_embed(x)
        batch_size, seq_len, _ = x.size()

        cls_tokens = self.cls_token.expand(batch_size, -1, -1)  # stole cls_tokens impl from Phil Wang, thanks
        x = torch.cat((cls_tokens, x), dim=1)
        if self.pos_embed is not None:
            x = x + self.pos_embed
        x = self.pos_drop(x)

        rel_pos_bias = self.rel_pos_bias() if self.rel_pos_bias is not None else None
        for blk in self.blocks:
            if self.use_checkpoint:
                x = checkpoint.checkpoint(blk, x, rel_pos_bias)
            else:
                x = blk(x, rel_pos_bias)
        return x

这样的话，只checkpoint一个transformer block的最后的输出激活值
对于中间的Q、K、V、Attention输出等的激活值，都会进行释放，后续反向传播需要时重新计算

EVA ViT-G上测试性能的代码：

python 复制代码

import torch
import torch.nn as nn
from torch.utils.checkpoint import checkpoint
from eva_vit import create_eva_vit_g

from torch.cuda.amp import autocast

def train(model):
    model.train()
    optimizer = torch.optim.SGD(model.parameters(), lr=0.001)
    data = torch.randn(16,3,224,224).cuda()

    torch.cuda.reset_peak_memory_stats()

    with autocast():
        output = model(data)
    loss = output.sum()
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    # 训练后的显存使用情况
    final_memory = torch.cuda.max_memory_allocated()

    return final_memory/1e9

use_checkpoint = True
model = create_eva_vit_g(img_size=224,drop_path_rate=0.,use_checkpoint=use_checkpoint,precision="fp16").cuda()
print(f"参数量: {sum([param.numel() for param in model.parameters()])/1e9} B")

info = 'with' if use_checkpoint else 'without'
print(f"Memory used {info} checkpointing: ", train(model))

输出结果：

参数量: 0.985894528 B

Memory used without checkpointing: 14.777437696

参数量: 0.985894528 B

Memory used with checkpointing: 4.459111424

可以看到，梯度检查点技术可以大大降低显存占用！

混合精度训练

在使用fp32进行前向、反向传播，所有计算都是fp32的，在计算效率上会有一定的下降，同时，现在的GPU一般情况下半精度的计算速度会更快，所以使用fp16进行计算就变得理所应当。然而，fp16存在的问题是数据范围很小，在小于5.96e-8就会发现underflow，同时，在精度上具有舍入误差，即

float16的最大舍入误差约为 (~2 ^-10)，比float32的最大舍入误差(~2 ^-23) 要大不少。对足够小的浮点数执行的任何操作都会将该值四舍五入到零，在反向传播中很多甚至大多数梯度更新值都非常小，但不为零。在反向传播中舍入误差累积可以把这些数字变成0或者nan，这会导致不准确的梯度更新，影响网络的收敛。

考虑到这两点，混合精度训练就非常有必要：

权重从fp32转成fp16进行前向计算，得到loss之后，用fp16计算梯度，再转成fp32更新到fp32的权重上。（注：得到的loss也是FP32，因为涉及到累加计算，下面会进行说明）

Loss scaling：

fp32的loss直接到fp16可能会发生underflow，所以先对fp32的loss进行scale放大，放大后再转换成fp16，进行反向传播计算梯度，然后再对梯度进行un_scale即可

算数方法的改进：fp16执行乘法，fp32执行加法：

混合精度训练的显存分析

模型参数：fp16 所以是 x * 2
模型梯度：fp16 x * 2
模型的fp32副本：fp32 x * 4
优化器状态：以fp32存储 x * 4 * 2 = x * 8
所以加起来是 16 * x，和fp32训练的显存占用一致
但因为中间激活值是fp16，中间激活值占用显存降低，以及硬件对fp16支持更好，所以显存应该会有一定程度降低

分析下来，混合精度训练的目的主要是提升速度，而非降级显存，注：当batchsize小的时候，受限的是IO（数据频繁的从cpu到gpu），这时候混合精度训练可能没实际提速，甚至实际减速。

apex中amp的使用

apex中的混合精度分为O0、O1、O2、O3四个等级，其设置如下图所示：

O0代表fp32全精度训练（性能上的baseline）、O3代表fp16全半精度训练（很不稳定，速度上的baseline）
O1代表在大部分计算时采用半精度，但是所有的模型参数依然保持单精度，对于少数单精度较好的计算（如softmax）依然保持单精度（根据黑白名单自动决定使用FP16（如：卷积）还是FP32（如：Softmax）进行计算）
O2代表"几乎FP16"混合精度训练，不存在黑白名单，除了BatchNorm以外，几乎都是用FP16计算
暂时没支持bfloat16（我不太确定，但我使用上optimizer没有支持bfloat16）

代码使用示例：apex的混合精度很简单，只用多一个amp.initialize封装model和optimzier，然后用amp.scale_loss进行loss的scale和unscale的context即可：

ini 复制代码

# Initialization
opt_level = 'O2'
assert opt_level in ['O1','O2','O3'], 'ValueError'

model = ...
optimizer = ...
model, optimizer = amp.initialize(model, optimizer, opt_level=opt_level)

loss = ...
optimizer.zero_grad()
with amp.scale_loss(loss, optimizer) as scaled_loss:
        scaled_loss.backward()
optimizer.step()

在EVA ViT-G上进行性能测试：

ini 复制代码

import torch
import torch.nn as nn
import torch.optim as optim

from apex import amp

import time
import os

from eva_vit import create_eva_vit_g

# 定义训练函数
def train(opt_level, epochs=10):
    torch.cuda.reset_peak_memory_stats()

    # 定义模型、优化器
    model = create_eva_vit_g(img_size=224,drop_path_rate=0.,use_checkpoint=False,precision="fp32").cuda()
    optimizer = optim.SGD(model.parameters(), lr=0.01)
    # 启用自动混合精度
    model, optimizer = amp.initialize(model, optimizer, opt_level=opt_level)

    # 开始训练并记录时间和显存
    start_time = time.time()
    for epoch in range(epochs):
        optimizer.zero_grad()
        inputs = torch.randn(16,3,224,224).cuda()
        outputs = model(inputs)
        loss = outputs.sum()
        with amp.scale_loss(loss, optimizer) as scaled_loss:
            scaled_loss.backward()
        optimizer.step()
    end_time = time.time()

    # 计算并返回时间和显存使用量
    duration = end_time - start_time
    memory = torch.cuda.max_memory_allocated()
    return duration, memory

# 测试不同的优化级别
opt_levels = ['O3']
for level in opt_levels:
    duration, memory = train(opt_level=level)
    print(f"Opt Level: {level}, Duration: {duration} seconds, Max Memory: {memory} bytes")

测试结果：

yaml 复制代码

Opt Level: O1, Duration: 8.245030641555786 seconds, Max Memory: 18018722816 bytes
Opt Level: O2, Duration: 7.409193515777588 seconds, Max Memory: 17048065536 bytes
Opt Level: O3, Duration: 6.8711607456207275 seconds, Max Memory: 11131785728 bytes

pytorch自带的amp使用

pytorch自带的amp等价于apex中的O1级别，但其支持bfloat16！

使用样例：

ini 复制代码

model = ...
optimizer = ...
scaler = torch.cuda.amp.GradScaler()

with torch.cuda.amp.autocast(dtype=torch.float16):
    output = model(...)
loss = ...
optimizer.zero_grad()
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

EVA ViT测试：

ini 复制代码

import torch
import torch.nn as nn
import torch.optim as optim

from torch.cuda.amp import autocast, GradScaler

import time
import os

from eva_vit import create_eva_vit_g

# 定义训练函数
def train(epochs=10):
    torch.cuda.reset_peak_memory_stats()

    # 定义模型、优化器
    model = create_eva_vit_g(img_size=224,drop_path_rate=0.,use_checkpoint=False,precision="fp32").cuda()
    optimizer = optim.SGD(model.parameters(), lr=0.01)
    scaler = GradScaler()

    # 开始训练并记录时间和显存
    start_time = time.time()
    for epoch in range(epochs):
        optimizer.zero_grad()
        inputs = torch.randn(16,3,224,224).cuda()
        with autocast():
            outputs = model(inputs)
            loss = outputs.sum()
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()
    end_time = time.time()

    # 计算并返回时间和显存使用量
    duration = end_time - start_time
    memory = torch.cuda.max_memory_allocated()
    return duration, memory

duration, memory = train()
print(f"Opt Level: Pytorch(Equal to O1), Duration: {duration} seconds, Max Memory: {memory} bytes")

测试结果：

yaml 复制代码

Opt Level: Pytorch(Equal to O1), Duration: 8.872779130935669 seconds, Max Memory: 18857502720 bytes
对比 apex：
Opt Level: O1, Duration: 8.245030641555786 seconds, Max Memory: 18018722816 bytes
Opt Level: O2, Duration: 7.409193515777588 seconds, Max Memory: 17048065536 bytes
Opt Level: O3, Duration: 6.8711607456207275 seconds, Max Memory: 11131785728 bytes

pytorch自带的amp的一点坑

如果对一个模型:

load进来就进行了dtype转换，到一部分是fp16或bf16、一部分是fp32，如果fp16或bf16这部分需要梯度
整个模型都是fp16或bf16，且需要梯度

都会出现错误：

scss 复制代码

import torch
import torch.nn as nn
from torch.cuda.amp import autocast, GradScaler

model = nn.ModuleList([nn.Linear(10, 20),nn.Linear(20,10)]).cuda()

model[0].half() # 如果requires_grad为True，会有问题  ValueError: Attempting to unscale FP16 gradients.
# model[0].requires_grad_(False) # 这样的话，只有model[1]的requires_grad为True，且为float32

optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

scaler = GradScaler()

input_data = torch.randn(10, 10).cuda()
target = torch.randn(10, 10).cuda()

for epoch in range(1):
    optimizer.zero_grad()

    with autocast():  # autocast将自动管理必要的FP16到FP32的转换
        print(f"input.dtype: {input_data.dtype}")
        output = model[1](model[0](input_data))
        print(f"output.dtype: {output.dtype}")
        loss = nn.MSELoss()(output, target)
        print(f"loss.dtype: {loss.dtype}")
        print(f"model[0].weight.dtype: {model[0].weight.dtype}")
        print(f"model[1].weight.dtype: {model[1].weight.dtype}")

    scaler.scale(loss).backward()
    print(f"model[0].weight.grad.dtype: {model[0].weight.grad.dtype}")
    print(f"model[1].weight.grad.dtype: {model[1].weight.grad.dtype}")
    scaler.step(optimizer)
    scaler.update()

输出：

makefile 复制代码

input.dtype: torch.float32
output.dtype: torch.float16
loss.dtype: torch.float32
model[0].weight.dtype: torch.float16
model[1].weight.dtype: torch.float32
model[0].weight.grad.dtype: torch.float16
model[1].weight.grad.dtype: torch.float32

ValueError: Attempting to unscale FP16 gradients.

bf16的情况类似：

scss 复制代码

import torch
import torch.nn as nn
from torch.cuda.amp import autocast, GradScaler

model = nn.ModuleList([nn.Linear(10, 20),nn.Linear(20,10)]).cuda()

# 如果requires_grad为True，会有问题 RuntimeError: "_amp_foreach_non_finite_check_and_unscale_cuda" not implemented for 'BFloat16'
model[0] = model[0].to(torch.bfloat16)

# model[0].requires_grad_(False) # 这样的话，只有model[1]的requires_grad为True，且为float32

optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

scaler = GradScaler()

input_data = torch.randn(10, 10).cuda()
target = torch.randn(10, 10).cuda()

for epoch in range(1):
    optimizer.zero_grad()

    with autocast(dtype=torch.bfloat16):  # autocast将自动管理必要的BFP16到FP32的转换
        print(f"input.dtype: {input_data.dtype}")
        output = model[1](model[0](input_data))
        print(f"output.dtype: {output.dtype}")
        loss = nn.MSELoss()(output, target)
        print(f"loss.dtype: {loss.dtype}")
        print(f"model[0].weight.dtype: {model[0].weight.dtype}")
        print(f"model[1].weight.dtype: {model[1].weight.dtype}")

    scaler.scale(loss).backward()
    print(f"model[0].weight.grad.dtype: {model[0].weight.grad.dtype}")
    print(f"model[1].weight.grad.dtype: {model[1].weight.grad.dtype}")
    scaler.step(optimizer)
    scaler.update()

输出结果：

makefile 复制代码

input.dtype: torch.float32
output.dtype: torch.bfloat16
loss.dtype: torch.float32
model[0].weight.dtype: torch.bfloat16
model[1].weight.dtype: torch.float32
model[0].weight.grad.dtype: torch.bfloat16
model[1].weight.grad.dtype: torch.float32

RuntimeError: "_amp_foreach_non_finite_check_and_unscale_cuda" not implemented for 'BFloat16'

参考

1\] [zhuanlan.zhihu.com/p/448395808](https://link.juejin.cn?target=https%3A%2F%2Fzhuanlan.zhihu.com%2Fp%2F448395808 "https://zhuanlan.zhihu.com/p/448395808") \[2\] [zhuanlan.zhihu.com/p/84219777](https://link.juejin.cn?target=https%3A%2F%2Fzhuanlan.zhihu.com%2Fp%2F84219777 "https://zhuanlan.zhihu.com/p/84219777") \[3\] [zhuanlan.zhihu.com/p/79887894](https://link.juejin.cn?target=https%3A%2F%2Fzhuanlan.zhihu.com%2Fp%2F79887894 "https://zhuanlan.zhihu.com/p/79887894") \[4\] [zhuanlan.zhihu.com/p/647389318](https://link.juejin.cn?target=https%3A%2F%2Fzhuanlan.zhihu.com%2Fp%2F647389318 "https://zhuanlan.zhihu.com/p/647389318")

多模态大模型实战-MiniGPT4Qwen系列3：大模型训练基础技术之混合精度训练与梯度检查点踩坑

简单分析模型训练中的显存占用

Gradient Checkpoint------减少中间激活值的显存占用

torch.utils.checkpoint.checkpoint的使用踩坑

在EVA ViT-G上进行性能测试

混合精度训练

混合精度训练的显存分析

apex中amp的使用

pytorch自带的amp使用

pytorch自带的amp的一点坑

参考