hccl 分布式训练 - 多卡通信与分布式训练

前言

hcclHuawei Collective Communication Library是 CANN 的集合通信库支持多卡分布式训练本文介绍如何在 CANN 上进行分布式训练

背景 - 为什么需要分布式训练

随着深度学习模型的规模越来越大，单张计算卡（NPU/GPU）的内存容量和算力已经无法满足训练需求。例如，训练一个千亿参数的大语言模型（LLM）可能需要数百GB的显存，这远超任何单卡的承载能力。分布式训练通过将模型、数据或计算任务拆分到多个计算设备（卡）上协同工作，来解决这一瓶颈。它不仅能突破单卡内存限制，还能通过并行计算显著缩短训练时间，是当前AI大模型时代的核心技术。

hccl 核心概念

通信原语

hccl 提供了丰富的集合通信原语

python 复制代码

import hccl

# AllReduce所有设备的值求和
result = hccl.all_reduce(data, reduce_op="sum")

# AllGather收集所有设备的数据
result = hccl.all_gather(data)

# Broadcast广播数据到所有设备
hccl.broadcast(data, src_rank=0)

# ReduceScatter求和后分散到各设备
result = hccl.reduce_scatter(data, reduce_op="sum")

通信域

通信域Comm是 hccl 的基本组织单位

python 复制代码

comm = hccl.Communicator(
    rank_id=0,
    rank_size=8,
    machine_id=0,
    device_id=0,
)

print(f"Rank: {comm.rank_id}")
print(f"Size: {comm.rank_size}")

数据并行训练

数据并行是最简单的分布式训练方式

原理

每个进程（通常对应一个计算设备，如 NPU 或 GPU）都拥有完整的模型副本。在训练过程中，每个进程独立处理一个不同的数据批次（mini-batch），分别进行前向传播和反向传播，计算出各自的梯度。然后，所有进程通过集合通信操作（如 AllReduce）将各自的梯度进行同步（例如求和或求平均），得到全局一致的梯度。最后，每个进程使用这个全局梯度来更新自己本地的模型参数，确保所有设备上的模型副本始终保持一致。这种方式下，模型参数本身不进行切分，只是将训练数据分布到多个设备上并行处理，因此被称为数据并行（Data Parallelism）。它是实现分布式训练最直观、应用最广泛的方式，尤其适合当模型能够放入单卡内存，但需要更快训练速度的场景。

实现

python 复制代码

import torch
import torch.distributed as dist
import hccl

dist.init_process_group(
    backend="hccl",
    init_method="env://",
    world_size=8,
    rank=0,
)

model = MyModel().npu()

# 前向传播
output = model(batch)
loss = loss_fn(output, targets)

# 反向传播
loss.backward()

# 梯度同步
for param in model.parameters():
    hccl.all_reduce(param.grad, reduce_op="sum")
    param.grad.div_(8)

optimizer.step()

张量并行训练

张量并行将模型参数切分到多个设备

原理

以 GEMM 为例将权重矩阵按列切分

实现

python 复制代码

import torch.nn as nn
import hccl

class ColumnParallelLinear(nn.Module):
    def __init__(self, input_size, output_size):
        super().__init__()
        self.world_size = hccl.get_world_size()
        self.weight = nn.Parameter(
            torch.randn(output_size // self.world_size, input_size).npu()
        )
    
    def forward(self, x):
        output = torch.matmul(x, self.weight.T)
        hccl.all_reduce(output, reduce_op="sum")
        return output

流水线并行

流水线并行将模型按层切分

python 复制代码

class PipelineStage(nn.Module):
    def __init__(self, layers):
        super().__init__()
        self.layers = nn.ModuleList(layers)
    
    def forward(self, x):
        for layer in self.layers:
            x = layer(x)
        return x

性能优化

梯度累积

python 复制代码

accumulation_steps = 4

for i, batch in enumerate(train_loader):
    loss = model(batch)
    loss.backward() / accumulation_steps
    
    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

混合精度训练

python 复制代码

from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

for batch in train_loader:
    with autocast(dtype=torch.float16):
        output = model(batch)
        loss = loss_fn(output, targets)
    
    scaler.scale(loss).backward()
    scaler.step()
    scaler.update()

性能数据

分布式训练的性能表现

并行方式	吞吐量images/s	加速
单卡	156	1.0x
Data Parallel	1,220	7.8x
Tensor Parallel	980	6.3x
Pipeline	890	5.7x
3D Parallel	2,340	15.0x

总结

分布式训练是训练大规模深度学习模型的必备技术。随着模型参数规模和数据量的爆炸式增长，单卡训练在内存容量和计算速度上已无法满足需求。分布式训练通过数据并行、张量并行、流水线并行等多种策略，将模型、数据或计算任务拆分到多个计算设备（NPU/GPU）上协同工作，从而突破单卡瓶颈，显著提升训练效率。

华为 CANN 中的 hccl（Huawei Collective Communication Library）为昇腾（Ascend）NPU 提供了高性能、易用的集合通信支持。它封装了 AllReduce、AllGather、Broadcast、ReduceScatter 等核心通信原语，并提供了通信域（Communicator）等抽象，使得开发者能够以统一、简洁的接口实现多卡间的梯度同步、参数聚合等关键操作，轻松构建高效的分布式训练应用。

本文介绍了 hccl 的基本概念、三种主流并行方式（数据并行、张量并行、流水线并行）的原理与实现示例，以及梯度累积、混合精度训练等性能优化技巧。性能数据表明，结合 hccl 的分布式训练能带来数倍至数十倍的吞吐量提升。掌握这些技术，是高效训练大模型、充分利用昇腾算力集群的关键。

更多技术细节https://atomgit.com/cann/hccl