调试和优化大型深度学习模型 - 2 使用 PyTorch Profiler 在 GPU 上分析模型的算子,并提取相关性能数据
flyfish
ProfilerActivity.CPU 和 ProfilerActivity.CUDA 指定了需要分析 CPU 和 GPU 的活动。
record_shapes=True 允许记录每个操作的输入张量形状,这对调试和优化非常有帮助。
record_function("model_inference") 是一个上下文管理器,用于标记代码块的分析区域。你可以在任何代码块周围使用它来进行更细粒度的性能分析。
prof.key_averages().table(sort_by="cuda_time_total", row_limit=10) 打印出按 GPU 执行时间排序的前 10 个算子的分析数据,包括各个操作的耗时、调用次数等信息。
版本 2.4.0
cpp
import torch
import torch.nn as nn
import torch.optim as optim
from torch.profiler import profile, record_function, ProfilerActivity
# 检查是否有可用的 GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# 定义一个简单的神经网络
class SimpleModel(nn.Module):
def __init__(self):
super(SimpleModel, self).__init__()
self.fc1 = nn.Linear(100, 50)
self.fc2 = nn.Linear(50, 10)
def forward(self, x):
x = torch.relu(self.fc1(x))
x = self.fc2(x)
return x
# 实例化模型、损失函数和优化器
model = SimpleModel().to(device)
criterion = nn.MSELoss().to(device)
optimizer = optim.SGD(model.parameters(), lr=0.01)
# 创建输入数据
inputs = torch.randn(10, 100).to(device)
targets = torch.randn(10, 10).to(device)
# 使用 PyTorch Profiler 分析 GPU 上的算子
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True) as prof:
with record_function("model_inference"):
# 前向传播
outputs = model(inputs)
loss = criterion(outputs, targets)
# 后向传播
optimizer.zero_grad()
loss.backward()
optimizer.step()
# 打印性能分析结果
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))
# 也可以保存成文件以供进一步分析
prof.export_chrome_trace("trace.json")
输出
一个包含 GPU 上各个操作的耗时、调用次数等信息的表格
cpp
------------------------------------------------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
model_inference 43.86% 53.190ms 82.46% 100.010ms 100.010ms 52.881ms 43.63% 99.954ms 99.954ms 1
aten::linear 0.07% 79.700us 16.40% 19.893ms 9.947ms 44.000us 0.04% 19.780ms 9.890ms 2
aten::addmm 16.15% 19.588ms 16.15% 19.588ms 9.794ms 19.520ms 16.11% 19.520ms 9.760ms 2
aten::mse_loss_backward 5.04% 6.107ms 10.17% 12.330ms 6.165ms 6.103ms 5.04% 12.345ms 6.173ms 2
aten::relu 0.05% 66.000us 8.29% 10.055ms 10.055ms 25.000us 0.02% 10.408ms 10.408ms 1
aten::clamp_min 8.24% 9.989ms 8.24% 9.989ms 9.989ms 10.383ms 8.57% 10.383ms 10.383ms 1
autograd::engine::evaluate_function: AddmmBackward0 0.13% 153.600us 8.48% 10.279ms 5.140ms 160.000us 0.13% 10.285ms 5.143ms 2
aten::mse_loss 4.82% 5.845ms 7.19% 8.717ms 8.717ms 5.836ms 4.82% 8.721ms 8.721ms 1
autograd::engine::evaluate_function: MseLossBackward... 0.05% 59.900us 5.32% 6.450ms 6.450ms 24.000us 0.02% 6.382ms 6.382ms 1
MseLossBackward0 0.03% 40.600us 5.27% 6.390ms 6.390ms 9.000us 0.01% 6.358ms 6.358ms 1
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 121.286ms
Self CUDA time total: 121.191ms