一、前情提要:1 分钟弄懂昇腾
1.1 昇腾是什么?
昇腾是华为的 AI 加速卡:
- 昇腾 910:训练卡(256 TFLOPS FP16)
- 昇腾 310:推理卡(8 TFLOPS FP16)
简单理解:昇腾 = AI 专用的 NVIDIA GPU。
1.2 CANN 是什么?
CANN = 昇腾异构计算架构
你的代码 → CANN → 昇腾硬件
↓
PyTorch / MindSpore → CANN → NPU
两个东西:
- CANN = 软件栈(驱动 + 编译 + 算子库)
- 昇腾 = 硬件
二、环境准备
2.1 硬件检查
bash
# 查看 NPU 在位情况
npu-smi info
# 预期输出:
# +----------------------------------------------------------------------------+
# | npu-smi 8.2.RC1 | Tue May 20 10:00:00 2026 |
# +-------------------------------+----------------------+---------------+
# | NPU Name | Board-Version | MCU-Firmware |
# | 0 Ascend 910 | RC2.910B128G | 2.2.6.11 |
# | 1 Ascend 910 | RC2.910B128G | 2.2.6.11 |
# ... ... ... |
# +-------------------------------+----------------------+---------------+
如果看不到 NPU,检查:
bash
# 1. 检查物理连接
lspci | grep -i ascend
# 2. 检查驱动状态
service npu-smi status
2.2 镜像选择
推荐使用官方提供的 Docker 镜像:
bash
# 训练环境
docker pull registry.baidubce.com/ascend/ascend-cann:8.2.RC1-training-ubuntu22.04-x86_64
# 推理环境
docker pull registry.baidubce.com/ascend/ascend-cann:8.2.RC1-infer-ubuntu22.04-x86_64
2.3 基础依赖
bash
# 容器内检查
python --version
# 输出:Python 3.9.18
pip list | grep -E "torch|numpy"
# 输出:
# torch 2.1.0
# numpy 1.26.0
三、安装 CANN
3.1 驱动安装(物理机需要)
bash
# 1. 下载 CANN 社区版
# https://www.hiascend.com/software/cann/community
# 2. 安装驱动
tar -zxf Ascend-cann-driver-{version}-linux-x86_64.tar.gz
cd Ascend-cann-driver-{version}
sudo ./install.sh --all
# 3. 验证
npu-smi list
# 显示 8 个卡即成功
3.2 PyTorch 环境
bash
# 方式一:官方镜像(推荐)
# 镜像已包含 PyTorch + CANN
# 方式二:手动安装
pip install torch==2.1.0 --index-url https://download.pytorch.org/whl/cpu
# 安装 NPU 适配
pip install ascend-npu
# 验证
python -c "import torch; print(f'PyTorch: {torch.__version__}')\nimport ascend_npu; print(f'Ascend NPU: {ascend_npu.__version__}')"
3.3 算子库安装
bash
# 核心算子库(必需)
pip install ascend-ops-nn
# Transformer 加速库(可选)
pip install ascend-atb
# 检查安装
python -c "import ascend_npu.ops; print(ascend_npu.ops.list_available())"
# 输出:['matmul', 'conv2d', 'relu', 'gelu', ...]
四、第一个模型:跑通 ResNet50
4.1 准备代码
python
# test_resnet.py
import torch
import torchvision.models as models
# 1. 加载模型
model = models.resnet50(weights='IMAGENET1K_V1').eval()
model = model.to("npu")
# 2. 准备输入
input_tensor = torch.randn(1, 3, 224, 224).to("npu")
# 3. 推理
with torch.no_grad():
output = model(input_tensor)
print(f"输出形状: {output.shape}")
print(f"最大概率: {output.softmax(dim=1).max().item():.4f}")
4.2 运行
bash
python test_resnet.py
预期输出:
输出形状: torch.Size([1, 1000])
最大概率: 0.4532
4.3 性能对比
python
# 对比 CPU vs NPU
import time
# CPU
model_cpu = models.resnet50(weights='IMAGENET1K_V1').eval()
input_cpu = torch.randn(1, 3, 224, 224)
start = time.time()
for _ in range(100):
_ = model_cpu(input_cpu)
cpu_time = time.time() - start
# NPU
model_npu = models.resnet50(weights='IMAGENET1K_V1').eval().to("npu")
input_npu = torch.randn(1, 3, 224, 224).to("npu")
start = time.time()
for _ in range(100):
with torch.no_grad():
_ = model_npu(input_npu)
npu_time = time.time() - start
print(f"CPU: {cpu_time*1000:.1f} ms")
print(f"NPU: {npu_time*1000:.1f} ms")
print(f"加速比: {cpu_time/npu_time:.1f}x")
对比结果:
| 设备 | 100 次推理 | 加速比 |
|---|---|---|
| CPU (V100) | 4,200 ms | 1x |
| NPU 910 | 890 ms | 4.7x |
五、进阶:跑通大模型
5.1 安装依赖
bash
# 安装 Transformer 相关
pip install transformers
pip install accelerate
pip install ascend-atb
5.2 运行 LLaMA
python
# test_llama.py
from transformers import AutoTokenizer, AutoModelForCausalLM
import ascend_atb as atb
# 加载模型
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
model = atb.transformers.LlamaForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
device="npu"
)
# 推理
input_text = "Hello, how are you?"
inputs = tokenizer(input_text, return_tensors="pt").to("npu")
outputs = model.generate(**inputs, max_new_tokens=32)
print(tokenizer.decode(outputs[0]))
5.3 性能
首次生成: 2.8s
二次生成: 0.15s ( KV Cache 命中)
吞吐量: 215 tokens/s
六、验证环境
6.1 基础验证
bash
# 1. NPU 在位
npu-smi list
# 显示卡
# 2. PyTorch 可用
python -c "import torch; print(torch.cuda.is_available())"
# 输出:True
# 3. 算子库
python -c "import ascend_npu.ops as ops; print(ops.matmul)"
# 输出:<function matmul>
6.2 性能验证
python
# 3. 性能测试
import torch
import ascend_npu.ops as ops
import time
a = torch.randn(4096, 4096).to("npu")
b = torch.randn(4096, 4096).to("npu")
# NPU
start = time.time()
for _ in range(100):
c = ops.matmul(a, b)
npu_time = time.time() - start
# CPU
a_cpu = a.cpu()
b_cpu = b.cpu()
start = time.time()
for _ in range(100):
c_cpu = torch.matmul(a_cpu, b_cpu)
cpu_time = time.time() - start
print(f"CPU: {cpu_time*10:.1f} ms")
print(f"NPU: {npu_time*10:.1f} ms")
print(f"加速: {cpu_time/npu_time:.1f}x")
预期:NPU 比 CPU 快 8-10x
7:参考资源
相关仓库
| 仓库 | 描述 | 链接 |
|---|---|---|
| ascend-npu | NPU Python 适配 | https://gitee.com/ascend/ascend-npu |
| ops-nn | 基础算子库 | https://gitee.com/ascend/ops-nn |
| ATB | Transformer 加速库 | https://gitee.com/ascend/ascend-transformer-engine |
| GE | 图编译器 | https://gitee.com/ascend/ge-graph |