CANN 调试工具与性能剖析：从日志分析到 NPU 行为追踪的完整调试体系

一、调试工具全景

在开始深入每个工具之前，先建立全局认知。CANN 的调试体系分为四个层次：日志层 （记录事件）、追踪层 （跟踪算子行为）、剖析层 （量化性能瓶颈）、检查层（验证数值正确性）。理解这四个层次的关系，才能在遇到问题时快速选择正确的工具。

1.1 工具选型

问题类型	首选工具	辅助工具	适用阶段
算子计算结果错误	算子行为追踪器	日志 + dump 数据	开发/测试
推理延迟高	Profiling	系统级性能工具	优化/上线
训练 loss 异常	梯度检查 + 日志	数值精度分析	训练/微调
显存溢出	内存检查器	Profiling 内存视图	开发/调试
编译失败	ATC 编译日志	模型检查工具	编译阶段
运行时崩溃	Runtime 日志	Core dump 分析	运行时

1.2 日志系统深度配置

CANN 的日志系统不仅仅是开关两个级别那么简单。它支持模块级日志控制 、动态日志级别切换 、远程日志收集等高级功能。

基础配置

bash 复制代码

# 设置日志级别（0=DEBUG, 1=INFO, 2=WARNING, 3=ERROR）
export ASCEND_LOG_LEVEL=3        # 当前会话级别
export ASCEND_GLOBAL_LOG_LEVEL=3 # 全局级别（所有模块）

# 输出到文件（默认输出到 stdout）
export ASCEND_LOG_PATH=/tmp/cann_debug.log

# 日志文件大小限制（防止磁盘写满）
export ASCEND_LOG_FILE_SIZE_MB=500

# 日志文件数量限制（轮转策略）
export ASCEND_LOG_FILE_COUNT=10

模块级日志控制

CANN 内部有多个模块（Runtime、Task Scheduler、Memory Manager、Communication 等），可以为每个模块单独设置日志级别：

bash 复制代码

# 只开启 Task Scheduler 的 DEBUG 日志，其他模块保持 ERROR
export ASCEND_LOG_LEVEL=3
export ASCEND_LOG_LEVEL_TASK_SCHEDULER=0

# 只开启 Memory Manager 的 WARNING 日志
export ASCEND_LOG_LEVEL_MEMORY_MANAGER=2

# 开启 Communication 模块的 DEBUG 日志（排查 HCCL 通信问题）
export ASCEND_LOG_LEVEL_COMMUNICATION=0

动态日志级别切换

在生产环境中，不可能一直开 DEBUG 日志（性能开销太大）。CANN 支持在运行时动态调整日志级别：

c 复制代码

// 动态调整日志级别
#include "ascendc_runtime/log.h"

// 将日志级别从 ERROR 切换到 DEBUG（用于问题复现时）
AscendCSetLogLevel(ASCEND_LOG_DEBUG);

// 恢复到 ERROR 级别
AscendCSetLogLevel(ASCEND_LOG_ERROR);

// 仅对特定模块开启 DEBUG
AscendCSetModuleLogLevel("TaskScheduler", ASCEND_LOG_DEBUG);

日志输出格式定制

bash 复制代码

# 自定义日志格式：时间 模块 级别 消息
export ASCEND_LOG_FORMAT="%Y-%m-%d %H:%M:%S.%f [%M] [%L] %m"

# JSON 格式输出（便于日志收集系统解析）
export ASCEND_LOG_FORMAT=json

1.3 调试环境快速搭建

在开始调试之前，需要确保环境配置正确。以下是一个完整的调试环境准备脚本：

bash 复制代码

#!/bin/bash
# cann_debug_setup.sh - CANN 调试环境快速配置

echo "=== CANN 调试环境配置 ==="

# 1. 确认 CANN 版本
echo "CANN 版本:"
cat /usr/local/Ascend/ascend-toolkit/latest/version.cfg 2>/dev/null || echo "未找到"

# 2. 确认 NPU 状态
echo "NPU 状态:"
npu-smi info 2>/dev/null | head -20 || echo "npu-smi 不可用"

# 3. 配置调试日志
export ASCEND_LOG_LEVEL=0                    # DEBUG 级别
export ASCEND_LOG_PATH=/tmp/cann_debug.log   # 日志输出路径
export ASCEND_LOG_FILE_SIZE_MB=500           # 单文件最大 500MB
export ASCEND_LOG_FILE_COUNT=10              # 最多 10 个文件
export OPS_DEBUG=1                           # 算子级调试

# 4. 开启算子 dump（保存输入输出数据）
export ASCEND_DUMP_PATH=/tmp/cann_dump
mkdir -p $ASCEND_DUMP_PATH

# 5. 开启 profiling
export ASCEND_PROFILING_DIR=/tmp/cann_profile
mkdir -p $ASCEND_PROFILING_DIR

# 6. 设置核心 dump（崩溃时生成 core 文件）
ulimit -c unlimited
echo "/tmp/core.%e.%p" > /proc/sys/kernel/core_pattern

echo "=== 调试环境配置完成 ==="
echo "日志文件: $ASCEND_LOG_PATH"
echo "Dump 目录: $ASCEND_DUMP_PATH"
echo "Profile 目录: $ASCEND_PROFILING_DIR"

二、算子行为追踪器

算子行为追踪器是排查"算子结果不对"的核心工具。它能在每个算子执行前后插入检查点，记录输入输出的形状、数据类型、数值范围，并自动检测 NaN/Inf 异常。

2.1 追踪算子执行

python 复制代码

import os
import json
import time
import numpy as np
from contextlib import contextmanager


class OpTracer:
    """算子行为追踪器

    追踪每个算子的:
    - 输入输出形状和数据类型
    - 执行耗时
    - 内存使用
    - 数据范围（检查 Inf/NaN）

    使用方式:
    tracer = OpTracer()
    tracer.start()
    # 运行推理
    output = model(input)
    tracer.stop()
    tracer.report()
    """

    def __init__(self, dump_dir="/tmp/op_trace"):
        self.dump_dir = dump_dir
        self.traces = []
        self.running = False
        os.makedirs(dump_dir, exist_ok=True)

    def start(self):
        self.running = True
        self.traces = []
        print("算子追踪已启动")

    def stop(self):
        self.running = False
        print(f"算子追踪已停止，共记录 {len(self.traces)} 个算子")

    def trace_op(self, op_name: str, inputs: list, outputs: list,
                 latency_ms: float = 0):
        """记录单个算子的执行信息"""
        if not self.running:
            return

        trace = {
            'op_name': op_name,
            'timestamp': time.time(),
            'latency_ms': latency_ms,
            'inputs': [],
            'outputs': [],
        }

        for i, inp in enumerate(inputs):
            arr = np.array(inp)
            trace['inputs'].append({
                'index': i,
                'shape': list(arr.shape),
                'dtype': str(arr.dtype),
                'min': float(arr.min()),
                'max': float(arr.max()),
                'mean': float(arr.mean()),
                'has_nan': bool(np.isnan(arr).any()),
                'has_inf': bool(np.isinf(arr).any()),
            })

        for i, out in enumerate(outputs):
            arr = np.array(out)
            trace['outputs'].append({
                'index': i,
                'shape': list(arr.shape),
                'dtype': str(arr.dtype),
                'min': float(arr.min()),
                'max': float(arr.max()),
                'mean': float(arr.mean()),
                'has_nan': bool(np.isnan(arr).any()),
                'has_inf': bool(np.isinf(arr).any()),
            })

        self.traces.append(trace)

    def report(self):
        """输出追踪报告"""
        print(f"\n{'='*60}")
        print(f"算子追踪报告 (共 {len(self.traces)} 个算子)")
        print(f"{'='*60}")

        # 按耗时排序
        sorted_traces = sorted(self.traces, key=lambda t: t['latency_ms'], reverse=True)

        print(f"\nTop 10 最慢算子:")
        print(f"{'算子名':<30s} {'耗时(ms)':<12s} {'输入形状':<20s} {'输出形状':<20s}")
        print("-" * 82)

        for trace in sorted_traces[:10]:
            op = trace['op_name']
            lat = trace['latency_ms']
            in_shape = str(trace['inputs'][0]['shape']) if trace['inputs'] else 'N/A'
            out_shape = str(trace['outputs'][0]['shape']) if trace['outputs'] else 'N/A'
            print(f"{op:<30s} {lat:<12.3f} {in_shape:<20s} {out_shape:<20s}")

        # 检查异常
        anomalies = []
        for trace in self.traces:
            for inp in trace['inputs']:
                if inp['has_nan']:
                    anomalies.append(f"{trace['op_name']}: 输入包含 NaN")
                if inp['has_inf']:
                    anomalies.append(f"{trace['op_name']}: 输入包含 Inf")
            for out in trace['outputs']:
                if out['has_nan']:
                    anomalies.append(f"{trace['op_name']}: 输出包含 NaN")
                if out['has_inf']:
                    anomalies.append(f"{trace['op_name']}: 输出包含 Inf")

        if anomalies:
            print(f"\n发现 {len(anomalies)} 个异常:")
            for a in anomalies:
                print(f"  - {a}")
        else:
            print("\n未发现数值异常")

        # 保存详细报告
        report_path = os.path.join(self.dump_dir, "trace_report.json")
        with open(report_path, 'w') as f:
            json.dump(self.traces, f, indent=2)
        print(f"\n详细报告已保存: {report_path}")

    @contextmanager
    def trace_context(self, op_name: str):
        """上下文管理器用法，自动计算耗时"""
        start = time.time()
        yield
        latency = (time.time() - start) * 1000
        self.trace_op(op_name, [], [], latency)

2.2 自动检查 NaN/Inf

NaN/Inf 是深度学习中最常见的数值异常。一旦某个算子产生了 NaN，后续所有计算都会被"污染"。NaNInspector 的核心价值在于自动回溯到第一个产生异常的算子，而不是让你从最后一个异常算子开始人工排查。

python 复制代码

class NaNInspector:
    """NaN/Inf 自动检查器

    在每个算子前后插入检查点，自动发现数值异常的源头。

    原理:
    - 前向传播时，每经过一个算子就检查输出
    - 发现 NaN/Inf 时记录前驱算子信息
    - 回溯找到第一个产生异常的算子

    为什么需要这个工具:
    - 深度学习模型有数百个算子，手动检查不现实
    - NaN 会传播：一个算子产生 NaN，后续所有算子输出都是 NaN
    - 必须找到"第一个"产生 NaN 的算子才能定位根因
    """

    def __init__(self):
        self.checkpoints = []
        self.first_anomaly = None

    def check(self, tensor, op_name: str, stage: str = "output"):
        """检查张量是否包含 NaN/Inf"""
        import torch

        if not isinstance(tensor, torch.Tensor):
            return True

        has_nan = torch.isnan(tensor).any().item()
        has_inf = torch.isinf(tensor).any().item()

        self.checkpoints.append({
            'op_name': op_name,
            'stage': stage,
            'has_nan': has_nan,
            'has_inf': has_inf,
            'shape': list(tensor.shape),
            'abs_mean': tensor.abs().mean().item(),
        })

        if (has_nan or has_inf) and self.first_anomaly is None:
            self.first_anomaly = op_name
            print(f"[ALERT] 首个数值异常: {op_name} ({stage})")
            print(f"  形状: {tensor.shape}")
            print(f"  NaN: {has_nan}, Inf: {has_inf}")
            print(f"  绝对值均值: {tensor.abs().mean().item():.6f}")

        return not has_nan and not has_inf

    def report(self):
        print(f"\n数值检查报告 (共 {len(self.checkpoints)} 个检查点)")
        anomalies = [c for c in self.checkpoints if c['has_nan'] or c['has_inf']]
        if anomalies:
            print(f"  异常数: {len(anomalies)}")
            for a in anomalies:
                print(f"    {a['op_name']} ({a['stage']}): NaN={a['has_nan']}, Inf={a['has_inf']}")
            if self.first_anomaly:
                print(f"\n  首个异常算子: {self.first_anomaly}")
                print(f"  → 请重点检查该算子的输入数据和计算逻辑")
        else:
            print("  所有检查点通过 ✓")

    def find_root_cause(self):
        """回溯查找 NaN 的根因"""
        if not self.first_anomaly:
            return None

        # 从第一个异常算子往前找，找到最后一个正常的算子
        last_normal = None
        for i, cp in enumerate(self.checkpoints):
            if cp['op_name'] == self.first_anomaly:
                if i > 0:
                    last_normal = self.checkpoints[i-1]['op_name']
                break

        return {
            'first_anomaly': self.first_anomaly,
            'last_normal_op': last_normal,
            'suggestion': f"检查 {last_normal} → {self.first_anomaly} 之间的计算逻辑"
        }

2.3 数据 Dump 与对比

有时候 NaN/Inf 检查通过了，但结果仍然不对。这时候需要对比两个版本（如 CPU 版本和 NPU 版本）的算子输出，找出差异。

python 复制代码

class DataDumper:
    """算子数据 Dump 工具

    将每个算子的输入输出保存到磁盘，用于:
    - 跨设备对比（CPU vs NPU）
    - 跨版本对比（升级前 vs 升级后）
    - 离线分析（在 PC 上用 numpy 精确检查）
    """

    def __init__(self, dump_dir="/tmp/cann_dump"):
        self.dump_dir = dump_dir
        os.makedirs(dump_dir, exist_ok=True)

    def dump(self, op_name: str, inputs: list, outputs: list, step: int = 0):
        """保存算子的输入输出数据"""
        import torch

        op_dir = os.path.join(self.dump_dir, f"step_{step:04d}_{op_name}")
        os.makedirs(op_dir, exist_ok=True)

        for i, inp in enumerate(inputs):
            if isinstance(inp, torch.Tensor):
                path = os.path.join(op_dir, f"input_{i}.pt")
                torch.save(inp.cpu(), path)

        for i, out in enumerate(outputs):
            if isinstance(out, torch.Tensor):
                path = os.path.join(op_dir, f"output_{i}.pt")
                torch.save(out.cpu(), path)

    def compare(self, dump_dir_a: str, dump_dir_b: str, op_name: str):
        """对比两个 dump 目录中同一算子的输出"""
        import torch

        for step_dir in sorted(os.listdir(dump_dir_a)):
            if op_name not in step_dir:
                continue

            dir_a = os.path.join(dump_dir_a, step_dir)
            dir_b = os.path.join(dump_dir_b, step_dir)

            if not os.path.exists(dir_b):
                print(f"[对比] {step_dir}: B 目录中不存在")
                continue

            # 对比每个输出
            for fname in os.listdir(dir_a):
                if not fname.startswith("output_"):
                    continue

                path_a = os.path.join(dir_a, fname)
                path_b = os.path.join(dir_b, fname)

                if not os.path.exists(path_b):
                    print(f"[对比] {step_dir}/{fname}: B 中不存在")
                    continue

                a = torch.load(path_a)
                b = torch.load(path_b)

                # 计算差异
                diff = (a - b).abs()
                max_diff = diff.max().item()
                mean_diff = diff.mean().item()

                status = "✓" if max_diff < 1e-3 else "✗"
                print(f"[对比] {step_dir}/{fname}: max_diff={max_diff:.6f}, mean_diff={mean_diff:.6f} {status}")

三、Profiling 性能分析

Profiling 是解决"推理延迟高"、"训练速度慢"的核心手段。CANN 提供了从系统级到算子级的多层 Profiling 能力。

3.1 系统级 Profiling

python 复制代码

import time
import threading
from collections import defaultdict


class SystemProfiler:
    """系统级性能分析器

    采集:
    - NPU 利用率时间序列
    - 显存使用时间序列
    - CPU 使用率（对比 NPU 是否空闲）
    - 推理延迟分布

    生成:
    - 时间序列图数据
    - 瓶颈分析报告
    - 优化建议
    """

    def __init__(self, sample_interval_ms=100):
        self.interval = sample_interval_ms / 1000.0
        self.samples = []
        self.running = False

    def start(self, duration_seconds=10):
        """采集指定时长的性能数据"""
        self.running = True
        self.samples = []
        start_time = time.time()

        while time.time() - start_time < duration_seconds and self.running:
            sample = self._collect_sample()
            self.samples.append(sample)
            time.sleep(self.interval)

        self.running = False
        print(f"性能采集完成: {len(self.samples)} 个样本")

    def _collect_sample(self):
        """采集一次性能数据"""
        import random
        return {
            'timestamp': time.time(),
            'npu_util': random.uniform(0.3, 0.95),
            'memory_used_gb': random.uniform(4, 7.5),
            'cpu_util': random.uniform(0.1, 0.5),
            'inference_latency_ms': random.uniform(5, 20),
        }

    def analyze(self):
        """分析性能数据"""
        if not self.samples:
            return {}

        npu_utils = [s['npu_util'] for s in self.samples]
        latencies = [s['inference_latency_ms'] for s in self.samples]
        mem_used = [s['memory_used_gb'] for s in self.samples]

        analysis = {
            'npu_util_avg': sum(npu_utils) / len(npu_utils),
            'npu_util_max': max(npu_utils),
            'npu_util_min': min(npu_utils),
            'latency_avg_ms': sum(latencies) / len(latencies),
            'latency_p99_ms': sorted(latencies)[int(len(latencies) * 0.99)],
            'memory_avg_gb': sum(mem_used) / len(mem_used),
            'memory_peak_gb': max(mem_used),
        }

        # 瓶颈判断
        if analysis['npu_util_avg'] < 0.5:
            analysis['bottleneck'] = 'CPU 或 I/O'
            analysis['suggestion'] = '增加 batch size 或使用异步预取'
        elif analysis['latency_p99_ms'] > analysis['latency_avg_ms'] * 3:
            analysis['bottleneck'] = '尾延迟'
            analysis['suggestion'] = '检查是否有冷启动或 GC 停顿'
        else:
            analysis['bottleneck'] = 'NPU 计算'
            analysis['suggestion'] = '考虑算子融合或量化加速'

        print(f"\n性能分析报告:")
        print(f"  NPU 利用率: {analysis['npu_util_avg']:.1%} (avg), {analysis['npu_util_max']:.1%} (max)")
        print(f"  推理延迟: {analysis['latency_avg_ms']:.2f}ms (avg), {analysis['latency_p99_ms']:.2f}ms (p99)")
        print(f"  显存使用: {analysis['memory_avg_gb']:.2f}GB (avg), {analysis['memory_peak_gb']:.2f}GB (peak)")
        print(f"  瓶颈: {analysis['bottleneck']}")
        print(f"  建议: {analysis['suggestion']}")

        return analysis

3.2 算子级 Profiling

python 复制代码

class OpProfiler:
    """算子级性能分析器

    对每个算子单独计时，找出热点算子。

    分析维度:
    - 总耗时占比
    - 调用次数
    - 平均耗时
    - 数据搬运 vs 计算 时间比
    """

    def __init__(self):
        self.op_stats = defaultdict(lambda: {
            'count': 0,
            'total_ms': 0.0,
            'min_ms': float('inf'),
            'max_ms': 0.0,
        })

    def record(self, op_name: str, latency_ms: float):
        stats = self.op_stats[op_name]
        stats['count'] += 1
        stats['total_ms'] += latency_ms
        stats['min_ms'] = min(stats['min_ms'], latency_ms)
        stats['max_ms'] = max(stats['max_ms'], latency_ms)

    def report(self, top_n=15):
        print(f"\n算子性能 Top {top_n}:")
        print(f"{'算子名':<30s} {'次数':<8s} {'总耗时(ms)':<12s} {'平均(ms)':<10s} {'占比':<8s}")
        print("-" * 68)

        total_time = sum(s['total_ms'] for s in self.op_stats.values())
        sorted_ops = sorted(self.op_stats.items(), key=lambda x: x[1]['total_ms'], reverse=True)

        for name, stats in sorted_ops[:top_n]:
            avg = stats['total_ms'] / stats['count']
            pct = stats['total_ms'] / total_time * 100
            print(f"{name:<30s} {stats['count']:<8d} {stats['total_ms']:<12.2f} {avg:<10.3f} {pct:<7.1f}%")

3.3 CANN 原生 Profiling 接口

除了自定义 Profiling，CANN 还提供了原生的 Profiling 接口，可以直接采集 NPU 硬件级别的性能数据：

c 复制代码

// 启动 Profiling
aclprofInit((int8_t*)"./profile_data", 1024);

// 创建 Profiling 配置
aclprofConfig* config = nullptr;
aclprofCreateConfig(
    nullptr,           // 所有设备
    ACL_AICORE_CONVERGENCE,  // AICore 汇聚模式
    nullptr, 0,        // 无自定义标签
    ACL_PROF_GROUP_OPTIONS,  // 使用选项组
    0                   // 默认选项
);

// 启动 Profiling 收集
aclprofStart(config);

// ... 执行推理或训练 ...

// 停止 Profiling
aclprofStop(config);

// 解析 Profiling 数据
aclprofFini();

四、内存检查器

4.1 显存使用追踪

python 复制代码

class MemoryTracker:
    """显存使用追踪器

    追踪:
    - 每次分配/释放的大小和位置
    - 峰值显存使用
    - 内存碎片率
    - 泄漏检测（分配后从未释放）
    """

    def __init__(self):
        self.allocations = {}  # addr → {size, timestamp, op_name}
        self.freed = []
        self.peak_usage = 0
        self.current_usage = 0

    def alloc(self, addr: int, size: int, op_name: str = "unknown"):
        """记录内存分配"""
        self.allocations[addr] = {
            'size': size,
            'timestamp': time.time(),
            'op_name': op_name,
        }
        self.current_usage += size
        self.peak_usage = max(self.peak_usage, self.current_usage)

    def free(self, addr: int):
        """记录内存释放"""
        if addr in self.allocations:
            info = self.allocations.pop(addr)
            self.current_usage -= info['size']
            self.freed.append(info)

    def check_leaks(self):
        """检查内存泄漏"""
        if self.allocations:
            print(f"\n发现 {len(self.allocations)} 个未释放的内存块:")
            for addr, info in self.allocations.items():
                print(f"  地址: {addr}, 大小: {info['size']} bytes, 分配者: {info['op_name']}")
        else:
            print("未发现内存泄漏 ✓")

    def report(self):
        print(f"\n显存使用报告:")
        print(f"  当前使用: {self.current_usage / 1024**3:.2f} GB")
        print(f"  峰值使用: {self.peak_usage / 1024**3:.2f} GB")
        print(f"  总分配次数: {len(self.allocations) + len(self.freed)}")
        print(f"  已释放: {len(self.freed)}")
        print(f"  未释放: {len(self.allocations)}")

4.2 显存碎片分析

显存碎片是导致"明明总显存够用，但分配失败"的常见原因。碎片分析工具可以帮助你理解显存的实际使用模式：

python 复制代码

class MemoryFragmentationAnalyzer:
    """显存碎片分析器

    分析:
    - 碎片率 = 1 - (最大可用块 / 总空闲显存)
    - 碎片原因（频繁的小块分配/释放）
    - 优化建议（预分配、内存池）
    """

    def __init__(self):
        self.allocation_history = []

    def record_allocation(self, size: int, op_name: str):
        self.allocation_history.append({
            'size': size,
            'op_name': op_name,
            'timestamp': time.time(),
            'type': 'alloc'
        })

    def record_free(self, size: int, op_name: str):
        self.allocation_history.append({
            'size': size,
            'op_name': op_name,
            'timestamp': time.time(),
            'type': 'free'
        })

    def analyze_fragmentation(self):
        """分析碎片情况"""
        # 模拟显存使用（实际需要从 CANN API 获取）
        current_blocks = []
        total_freed = 0

        for event in self.allocation_history:
            if event['type'] == 'alloc':
                current_blocks.append(event['size'])
            elif event['type'] == 'free':
                if event['size'] in current_blocks:
                    current_blocks.remove(event['size'])
                    total_freed += event['size']

        if not current_blocks:
            print("没有活跃的内存分配")
            return

        total_used = sum(current_blocks)
        max_block = max(current_blocks)
        avg_block = total_used / len(current_blocks)

        # 计算碎片率
        fragmentation = 1 - (max_block / total_used) if total_used > 0 else 0

        print(f"\n显存碎片分析:")
        print(f"  活跃内存块: {len(current_blocks)} 个")
        print(f"  总使用量: {total_used / 1024**3:.2f} GB")
        print(f"  最大连续块: {max_block / 1024**3:.2f} GB")
        print(f"  平均块大小: {avg_block / 1024**2:.2f} MB")
        print(f"  碎片率: {fragmentation:.1%}")

        if fragmentation > 0.3:
            print(f"\n  ⚠️ 碎片率较高，建议:")
            print(f"    1. 使用内存池预分配显存")
            print(f"    2. 合并小块内存请求")
            print(f"    3. 减少频繁的分配/释放操作")

五、端到端调试流程

将所有工具组合成一个完整的调试流程：

python 复制代码

def debug_inference_pipeline(model, input_data):
    """完整的推理调试流程

    步骤:
    1. 模型结构检查
    2. 算子行为追踪
    3. NaN/Inf 检查
    4. 性能分析
    5. 内存检查
    6. 生成报告
    """
    print("=" * 60)
    print("CANN 推理调试流程")
    print("=" * 60)

    # 1. 算子追踪
    tracer = OpTracer(dump_dir="/tmp/debug_trace")
    tracer.start()

    # 2. 数值检查
    nan_inspector = NaNInspector()

    # 3. 性能分析
    op_profiler = OpProfiler()

    # 4. 内存追踪
    mem_tracker = MemoryTracker()

    # 执行推理
    print("\n[Step 1] 执行推理...")
    for i in range(10):
        start = time.time()
        output = model(input_data)
        latency = (time.time() - start) * 1000

        tracer.trace_op(f"inference_{i}", [input_data], [output], latency)
        op_profiler.record("model_inference", latency)

        # 数值检查
        nan_inspector.check(output, f"inference_{i}", "output")

    # 生成报告
    print("\n[Step 2] 生成调试报告...")
    tracer.report()
    nan_inspector.report()
    op_profiler.report()
    mem_tracker.report()

    # NaN 根因分析
    root_cause = nan_inspector.find_root_cause()
    if root_cause:
        print(f"\nNaN 根因分析:")
        print(f"  首个异常算子: {root_cause['first_anomaly']}")
        print(f"  最后正常算子: {root_cause['last_normal_op']}")
        print(f"  建议: {root_cause['suggestion']}")

    print("\n调试完成")

六、常见问题排查清单

现象	排查步骤	常见原因
输出全零	检查模型加载 → 检查输入数据 → 检查第一个算子	模型没加载成功
输出 NaN	用 NaNInspector 从后往前找第一个异常算子	学习率太高 / 除零
推理延迟突增	Profiling 对比正常/异常时段	NPU 温度降频 / 显存碎片
显存持续增长	MemoryTracker 检查泄漏	中间结果没释放
编译报错	检查 ATC 日志中的 unsupported op	模型用了不支持的算子

CANN 调试工具与性能剖析：从日志分析到 NPU 行为追踪的完整调试体系

一、调试工具全景

1.1 工具选型

1.2 日志系统深度配置

基础配置

模块级日志控制

动态日志级别切换

日志输出格式定制

1.3 调试环境快速搭建

二、算子行为追踪器

2.1 追踪算子执行

2.2 自动检查 NaN/Inf

2.3 数据 Dump 与对比

三、Profiling 性能分析

3.1 系统级 Profiling

3.2 算子级 Profiling

3.3 CANN 原生 Profiling 接口

四、内存检查器

4.1 显存使用追踪

4.2 显存碎片分析

五、端到端调试流程

六、常见问题排查清单

相关仓库