Pytorch 学习笔记(21) : PyTorch Profiler

资料：

https://docs.pytorch.org/docs/stable/profiler.html
https://docs.pytorch.org/tutorials/recipes/recipes/profiler_recipe.html
https://github.com/ZhiqianXia/perf-compiler-learning/tree/main/08-pytorch/4-performance/profiler_labs （如果感兴趣，可以下载玩玩）

一、概述

PyTorch Profiler 是 PyTorch 提供的性能分析工具，用于收集训练和推理过程中的性能指标。通过上下文管理器 API，开发者可以：

识别最耗时的模型算子
查看算子的输入形状和堆栈跟踪
研究设备内核活动
可视化执行跟踪

⚠️ 注意：torch.autograd 模块中的旧版 API 已被弃用，建议使用 torch.profiler。

二、核心 API 详解

2.1 profile 上下文管理器（推荐用法）

python 复制代码

with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ]
) as p:
    code_to_profile()
print(p.key_averages().table(sort_by="self_cuda_time_total", row_limit=-1))

主要参数说明

参数	类型	说明
`activities`	iterable	分析活动类型：CPU、CUDA、XPU
`schedule`	Callable	调度器，控制分析启停时机
`on_trace_ready`	Callable	跟踪就绪时的回调函数
`record_shapes`	bool	记录算子输入形状
`profile_memory`	bool	跟踪张量内存分配/释放
`with_stack`	bool	记录源代码信息（文件和行号）
`with_flops`	bool	估算特定算子的 FLOPS
`with_modules`	bool	记录模块层次结构（仅 TorchScript）
`acc_events`	bool	跨多个分析周期累积事件
`post_processing_timeout_s`	float	后处理超时时间

2.2 调度器（Schedule）使用

适用于长时间训练任务，可在不同迭代获取多个跟踪：

python 复制代码

def trace_handler(prof):
    print(prof.key_averages().table(sort_by="self_cuda_time_total", row_limit=-1))

with torch.profiler.profile(
    activities=[...],
    schedule=torch.profiler.schedule(wait=1, warmup=1, active=2, repeat=1),
    on_trace_ready=trace_handler,
) as p:
    for iter in range(N):
        code_iteration_to_profile(iter)
        p.step()  # 通知分析器新迭代开始

schedule 参数说明：

wait：等待步数
warmup：预热步数
active：活跃记录步数
repeat：重复周期数（0 表示持续到结束）
skip_first：跳过前 N 步
skip_first_wait：是否跳过首次等待

2.3 TensorBoard 集成

python 复制代码

with torch.profiler.profile(
    on_trace_ready=torch.profiler.tensorboard_trace_handler('./log'),
    ...
) as p:
    # 训练代码

# 启动 TensorBoard
# tensorboard --logdir ./log

三、高级功能

3.1 动态切换分析活动

python 复制代码

with torch.profiler.profile(...) as p:
    code_to_profile_0()
    # 关闭 CUDA 活动收集
    p.toggle_collection_dynamic(False, [torch.profiler.ProfilerActivity.CUDA])
    code_to_profile_1()
    # 重新开启 CUDA 活动收集
    p.toggle_collection_dynamic(True, [torch.profiler.ProfilerActivity.CUDA])
    code_to_profile_2()

3.2 Execution Trace Observer

用于获取 AI/ML 工作负载的图表示，支持重放基准测试：

python 复制代码

from torch.profiler.execution_trace import ExecutionTraceObserver

with torch.profiler.profile(
    execution_trace_observer=ExecutionTraceObserver().register_callback("./execution_trace.json"),
) as p:
    # 训练代码

3.3 导出功能

方法	功能
`export_chrome_trace(path)`	导出 Chrome JSON 格式跟踪
`export_stacks(path, metric)`	保存堆栈跟踪
`key_averages()`	按算子名称分组平均事件
`events()`	获取未聚合的分析事件列表

四、ProfilerActivity 类型

CPU - CPU 活动
CUDA - NVIDIA GPU 活动
XPU - Intel GPU 活动
MTIA - Meta Training and Inference Accelerator
HPU - Habana Gaudi 设备
PrivateUse1 - 私有自定义设备

五、性能注意事项

启用 shape 和 stack 追踪会产生额外开销
record_shapes=True 时，分析器会临时持有张量引用，可能阻止某些优化并引入额外拷贝
建议在调试时使用完整功能，生产环境根据需求选择

六、Intel ITT API（可选）

针对 Intel 平台的额外支持：

python 复制代码

torch.profiler.itt.is_available()  # 检查 ITT 是否可用
torch.profiler.itt.mark(msg)       # 标记瞬时事件
torch.profiler.itt.range_push(msg) # 压入嵌套范围
torch.profiler.itt.range_pop()     # 弹出嵌套范围

七、完整示例代码

python 复制代码

import torch
import torch.profiler

# 定义模型
model = torch.nn.Linear(10, 10).cuda()
input = torch.randn(100, 10).cuda()

# 性能分析
with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ],
    schedule=torch.profiler.schedule(wait=1, warmup=1, active=3, repeat=2),
    on_trace_ready=torch.profiler.tensorboard_trace_handler("./log"),
    record_shapes=True,
    profile_memory=True,
    with_stack=True,
    with_flops=True,
) as prof:
    
    for step in range(10):
        with torch.profiler.record_function("model_forward"):
            output = model(input)
            loss = output.sum()
        
        with torch.profiler.record_function("model_backward"):
            loss.backward()
        
        prof.step()

# 打印统计结果
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

八、总结

PyTorch Profiler 提供了从基础到高级的全方位性能分析能力：

场景	推荐用法
快速调试	基础 `profile` 上下文管理器
长期训练	`schedule` + `on_trace_ready`
可视化分析	TensorBoard 集成
内存优化	`profile_memory=True`
分布式训练	`tensorboard_trace_handler` 指定 `worker_name`