PyTorch 中的累积梯度

https://stackoverflow.com/questions/62067400/understanding-accumulated-gradients-in-pytorch

有一个小的计算图,两次前向梯度累积的结果,可以看到梯度是严格相等的。


代码:

import numpy as np
import torch


class ExampleLinear(torch.nn.Module):

    def __init__(self):
        super().__init__()
        # Initialize the weight at 1
        self.weight = torch.nn.Parameter(torch.Tensor([1]).float(),
                                         requires_grad=True)

    def forward(self, x):
        return self.weight * x


model = ExampleLinear()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)


def calculate_loss(x: torch.Tensor) -> torch.Tensor:
    y = 2 * x
    y_hat = model(x)
    temp1 = (y - y_hat)
    temp2 = temp1**2
    return temp2


# With mulitple batches of size 1
batches = [torch.tensor([4.0]), torch.tensor([2.0])]

optimizer.zero_grad()
for i, batch in enumerate(batches):
    # The loss needs to be scaled, because the mean should be taken across the whole
    # dataset, which requires the loss to be divided by the number of batches.
    temp2 = calculate_loss(batch)
    loss = temp2 / len(batches)
    loss.backward()
    print(f"Batch size 1 (batch {i}) - grad: {model.weight.grad}")
    print(f"Batch size 1 (batch {i}) - weight: {model.weight}")
    print("="*50)

# Updating the model only after all batches
optimizer.step()
print(f"Batch size 1 (final) - grad: {model.weight.grad}")
print(f"Batch size 1 (final) - weight: {model.weight}")

运行结果

Batch size 1 (batch 0) - grad: tensor([-16.])
Batch size 1 (batch 0) - weight: Parameter containing:
tensor([1.], requires_grad=True)
==================================================
Batch size 1 (batch 1) - grad: tensor([-20.])
Batch size 1 (batch 1) - weight: Parameter containing:
tensor([1.], requires_grad=True)
==================================================
Batch size 1 (final) - grad: tensor([-20.])
Batch size 1 (final) - weight: Parameter containing:
tensor([1.2000], requires_grad=True)

然而,如果训练一个真实的模型,结果没有这么理想,比如训练一个bert,𝐵=8,𝑁=1:没有梯度累积(累积每一步),

𝐵=2,𝑁=4:梯度累积(每 4 步累积一次)

使用带有梯度累积的 Batch Normalization 通常效果不佳,原因很简单,因为 BatchNorm 统计数据无法累积。更好的解决方案是使用 Group Normalization 而不是 BatchNorm。

https://ai.stackexchange.com/questions/21972/what-is-the-relationship-between-gradient-accumulation-and-batch-size

相关推荐
多彩电脑19 分钟前
Python的tkinter如何把日志弄进文本框(Text)
python·用户界面
小馒头学python21 分钟前
【Python爬虫五十个小案例】爬取豆瓣电影Top250
开发语言·爬虫·python
乐呦刘、3 小时前
nature communications论文 解读
人工智能·深度学习·机器学习
black0moonlight6 小时前
ISAAC Gym 7. 使用箭头进行数据可视化
开发语言·python
程序员黄同学6 小时前
Python 中如何创建多行字符串?
前端·python
自不量力的A同学7 小时前
微软发布「AI Shell」
人工智能·microsoft
一点一木7 小时前
AI与数据集:从零基础到全面应用的深度解析(超详细教程)
人工智能·python·tensorflow
A.sir啊7 小时前
Python知识点精汇:集合篇精解!
python·pycharm
周某人姓周7 小时前
利用爬虫爬取网页小说
爬虫·python
花生糖@7 小时前
OpenCV图像基础处理:通道分离与灰度转换
人工智能·python·opencv·计算机视觉