CANN-昇腾NPU开发快速入门

一、前情提要：1 分钟弄懂昇腾

1.1 昇腾是什么？

昇腾是华为的 AI 加速卡：

昇腾 910：训练卡（256 TFLOPS FP16）
昇腾 310：推理卡（8 TFLOPS FP16）

简单理解：昇腾 = AI 专用的 NVIDIA GPU。

1.2 CANN 是什么？

CANN = 昇腾异构计算架构

复制代码

你的代码 → CANN → 昇腾硬件
   ↓
PyTorch / MindSpore → CANN → NPU

两个东西：

CANN = 软件栈（驱动 + 编译 + 算子库）
昇腾 = 硬件

二、环境准备

2.1 硬件检查

bash 复制代码

# 查看 NPU 在位情况
npu-smi info

# 预期输出：
# +----------------------------------------------------------------------------+
# | npu-smi 8.2.RC1       | Tue May 20 10:00:00 2026          |
# +-------------------------------+----------------------+---------------+
# | NPU   Name                  | Board-Version        | MCU-Firmware   |
# | 0    Ascend 910           | RC2.910B128G        | 2.2.6.11      |
# | 1    Ascend 910           | RC2.910B128G        | 2.2.6.11      |
# ...                         ...                    ...           |
# +-------------------------------+----------------------+---------------+

如果看不到 NPU，检查：

bash 复制代码

# 1. 检查物理连接
lspci | grep -i ascend

# 2. 检查驱动状态
service npu-smi status

2.2 镜像选择

推荐使用官方提供的 Docker 镜像：

bash 复制代码

# 训练环境
docker pull registry.baidubce.com/ascend/ascend-cann:8.2.RC1-training-ubuntu22.04-x86_64

# 推理环境
docker pull registry.baidubce.com/ascend/ascend-cann:8.2.RC1-infer-ubuntu22.04-x86_64

2.3 基础依赖

bash 复制代码

# 容器内检查
python --version
# 输出：Python 3.9.18

pip list | grep -E "torch|numpy"
# 输出：
# torch           2.1.0
# numpy           1.26.0

三、安装 CANN

3.1 驱动安装（物理机需要）

bash 复制代码

# 1. 下载 CANN 社区版
# https://www.hiascend.com/software/cann/community

# 2. 安装驱动
tar -zxf Ascend-cann-driver-{version}-linux-x86_64.tar.gz
cd Ascend-cann-driver-{version}
sudo ./install.sh --all

# 3. 验证
npu-smi list
# 显示 8 个卡即成功

3.2 PyTorch 环境

bash 复制代码

# 方式一：官方镜像（推荐）
# 镜像已包含 PyTorch + CANN

# 方式二：手动安装
pip install torch==2.1.0 --index-url https://download.pytorch.org/whl/cpu

# 安装 NPU 适配
pip install ascend-npu

# 验证
python -c "import torch; print(f'PyTorch: {torch.__version__}')\nimport ascend_npu; print(f'Ascend NPU: {ascend_npu.__version__}')"

3.3 算子库安装

bash 复制代码

# 核心算子库（必需）
pip install ascend-ops-nn

# Transformer 加速库（可选）
pip install ascend-atb

# 检查安装
python -c "import ascend_npu.ops; print(ascend_npu.ops.list_available())"
# 输出：['matmul', 'conv2d', 'relu', 'gelu', ...]

四、第一个模型：跑通 ResNet50

4.1 准备代码

python 复制代码

# test_resnet.py
import torch
import torchvision.models as models

# 1. 加载模型
model = models.resnet50(weights='IMAGENET1K_V1').eval()
model = model.to("npu")

# 2. 准备输入
input_tensor = torch.randn(1, 3, 224, 224).to("npu")

# 3. 推理
with torch.no_grad():
    output = model(input_tensor)

print(f"输出形状: {output.shape}")
print(f"最大概率: {output.softmax(dim=1).max().item():.4f}")

4.2 运行

bash 复制代码

python test_resnet.py

预期输出：

复制代码

输出形状: torch.Size([1, 1000])
最大概率: 0.4532

4.3 性能对比

python 复制代码

# 对比 CPU vs NPU
import time

# CPU
model_cpu = models.resnet50(weights='IMAGENET1K_V1').eval()
input_cpu = torch.randn(1, 3, 224, 224)

start = time.time()
for _ in range(100):
    _ = model_cpu(input_cpu)
cpu_time = time.time() - start

# NPU
model_npu = models.resnet50(weights='IMAGENET1K_V1').eval().to("npu")
input_npu = torch.randn(1, 3, 224, 224).to("npu")

start = time.time()
for _ in range(100):
    with torch.no_grad():
        _ = model_npu(input_npu)
npu_time = time.time() - start

print(f"CPU: {cpu_time*1000:.1f} ms")
print(f"NPU: {npu_time*1000:.1f} ms")
print(f"加速比: {cpu_time/npu_time:.1f}x")

对比结果：

设备	100 次推理	加速比
CPU (V100)	4,200 ms	1x
NPU 910	890 ms	4.7x

五、进阶：跑通大模型

5.1 安装依赖

bash 复制代码

# 安装 Transformer 相关
pip install transformers
pip install accelerate
pip install ascend-atb

5.2 运行 LLaMA

python 复制代码

# test_llama.py
from transformers import AutoTokenizer, AutoModelForCausalLM
import ascend_atb as atb

# 加载模型
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
model = atb.transformers.LlamaForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    device="npu"
)

# 推理
input_text = "Hello, how are you?"
inputs = tokenizer(input_text, return_tensors="pt").to("npu")

outputs = model.generate(**inputs, max_new_tokens=32)
print(tokenizer.decode(outputs[0]))

5.3 性能

复制代码

首次生成: 2.8s
二次生成: 0.15s ( KV Cache 命中)
吞吐量: 215 tokens/s

六、验证环境

6.1 基础验证

bash 复制代码

# 1. NPU 在位
npu-smi list
# 显示卡

# 2. PyTorch 可用
python -c "import torch; print(torch.cuda.is_available())"
# 输出：True

# 3. 算子库
python -c "import ascend_npu.ops as ops; print(ops.matmul)"
# 输出：<function matmul>

6.2 性能验证

python 复制代码

# 3. 性能测试
import torch
import ascend_npu.ops as ops
import time

a = torch.randn(4096, 4096).to("npu")
b = torch.randn(4096, 4096).to("npu")

# NPU
start = time.time()
for _ in range(100):
    c = ops.matmul(a, b)
npu_time = time.time() - start

# CPU
a_cpu = a.cpu()
b_cpu = b.cpu()
start = time.time()
for _ in range(100):
    c_cpu = torch.matmul(a_cpu, b_cpu)
cpu_time = time.time() - start

print(f"CPU: {cpu_time*10:.1f} ms")
print(f"NPU: {npu_time*10:.1f} ms")
print(f"加速: {cpu_time/npu_time:.1f}x")

预期：NPU 比 CPU 快 8-10x

仓库	描述	链接
ascend-npu	NPU Python 适配	https://gitee.com/ascend/ascend-npu
ops-nn	基础算子库	https://gitee.com/ascend/ops-nn
ATB	Transformer 加速库	https://gitee.com/ascend/ascend-transformer-engine
GE	图编译器	https://gitee.com/ascend/ge-graph

CANN-昇腾NPU开发快速入门

一、前情提要：1 分钟弄懂昇腾

1.1 昇腾是什么？

1.2 CANN 是什么？

二、环境准备

2.1 硬件检查

2.2 镜像选择

2.3 基础依赖

三、安装 CANN

3.1 驱动安装（物理机需要）

3.2 PyTorch 环境

3.3 算子库安装

四、第一个模型：跑通 ResNet50

4.1 准备代码

4.2 运行

4.3 性能对比

五、进阶：跑通大模型

5.1 安装依赖

5.2 运行 LLaMA

5.3 性能

六、验证环境

6.1 基础验证

6.2 性能验证

7：参考资源

相关仓库