NCCL (NVIDIA Collective Communications Library) 中重要 API 的总结表

以下是 NCCL (NVIDIA Collective Communications Library) 中重要 API 的总结表格，涵盖了常用集合通信操作、初始化与管理函数：

📌 NCCL 重要 API 总结表

API 函数	作用简介	关键参数说明
`ncclCommInitRank`	初始化 NCCL 通信器，为每个进程分配唯一 rank	- `ncclComm_t* comm`: 输出的通信器句柄 - `int ndev`: 通信组中的总进程数 - `ncclUniqueId commId`: 由 `ncclGetUniqueId` 生成的唯一 ID - `int rank`: 当前进程在通信组中的排名（0 ~ ndev-1）
`ncclGetUniqueId`	生成用于初始化通信组的唯一 ID	- `ncclUniqueId* uniqueId`: 输出的唯一 ID，需广播给所有参与进程
`ncclCommDestroy`	销毁通信器，释放资源	- `ncclComm_t comm`: 要销毁的通信器
`ncclCommAbort`	异常时强制中止通信器	- `ncclComm_t comm`: 要中止的通信器
`ncclAllReduce`	全局规约：所有进程提供输入缓冲区，结果（如求和）广播到所有进程的输出缓冲区	- `const void* sendbuff`: 发送缓冲区 - `void* recvbuff`: 接收缓冲区 - `size_t count`: 元素数量 - `ncclDataType_t datatype`: 数据类型（如 `ncclFloat`） - `ncclRedOp_t op`: 规约操作（如 `ncclSum`） - `ncclComm_t comm`: 通信器 - `cudaStream_t stream`: 关联的 CUDA 流
`ncclBroadcast`	广播：将 root 进程的数据广播到所有进程	- `void* buffer`: 输入/输出缓冲区（root 发送，其他接收） - `size_t count`: 元素数量 - `ncclDataType_t datatype`: 数据类型 - `int root`: 源进程 rank - `ncclComm_t comm`: 通信器 - `cudaStream_t stream`: CUDA 流
`ncclReduce`	规约：所有进程提供数据，结果汇总到 root 进程	- `const void* sendbuff`: 发送缓冲区 - `void* recvbuff`: 接收缓冲区（仅 root 有效） - `size_t count`: 元素数量 - `ncclDataType_t datatype`: 数据类型 - `ncclRedOp_t op`: 规约操作 - `int root`: 目标进程 rank - `ncclComm_t comm`: 通信器 - `cudaStream_t stream`: CUDA 流
`ncclAllGather`	全收集：每个进程提供数据，所有数据拼接后分发给所有进程	- `const void* sendbuff`: 发送缓冲区（大小 = count） - `void* recvbuff`: 接收缓冲区（大小 = count * ndev） - `size_t count`: 每个进程发送的元素数 - `ncclDataType_t datatype`: 数据类型 - `ncclComm_t comm`: 通信器 - `cudaStream_t stream`: CUDA 流
`ncclReduceScatter`	规约散播：先按元素规约，再将结果块分发给各进程	- `const void* sendbuff`: 发送缓冲区（大小 = count * ndev） - `void* recvbuff`: 接收缓冲区（大小 = count） - `size_t count`: 每个进程接收的元素数 - `ncclDataType_t datatype`: 数据类型 - `ncclRedOp_t op`: 规约操作 - `ncclComm_t comm`: 通信器 - `cudaStream_t stream`: CUDA 流
`ncclSend` / `ncclRecv`	点对点通信：在集合通信上下文中发送/接收数据	- `const void* sendbuff` / `void* recvbuff`: 发送/接收缓冲区 - `size_t count`: 元素数量 - `ncclDataType_t datatype`: 数据类型 - `int peer`: 目标/源进程 rank - `ncclComm_t comm`: 通信器 - `cudaStream_t stream`: CUDA 流
`ncclGroupStart` / `ncclGroupEnd`	组合操作：将多个 NCCL 调用组合为一个原子操作	无参数，用于包裹多个集合通信调用，确保它们并行执行

📝 补充说明

1. 数据类型 (`ncclDataType_t`)

ncclInt8, ncclUint8, ncclInt32, ncclUint32, ncclInt64, ncclUint64
ncclFloat16, ncclFloat32, ncclFloat64, ncclBfloat16

2. 规约操作 (`ncclRedOp_t`)

ncclSum: 求和
ncclProd: 求积
ncclMax: 最大值
ncclMin: 最小值
ncclAvg: 平均值（NCCL 2.10+）

3. 典型使用流程

c 复制代码

// 1. 获取唯一 ID（rank 0 生成并广播）
ncclUniqueId id;
if (rank == 0) ncclGetUniqueId(&id);
MPI_Bcast(&id, sizeof(id), MPI_BYTE, 0, MPI_COMM_WORLD);

// 2. 初始化通信器
ncclComm_t comm;
ncclCommInitRank(&comm, nranks, id, rank);

// 3. 执行集合通信（如 AllReduce）
ncclAllReduce(sendbuf, recvbuf, count, ncclFloat, ncclSum, comm, stream);

// 4. 销毁通信器
ncclCommDestroy(comm);

4. 注意事项

NCCL 调用需绑定 CUDA 流，且通常在 GPU 内存上操作
所有进程必须按相同顺序调用集合通信函数
使用 ncclGroupStart/End 可组合多个操作，提升性能

如果需要更详细的示例代码或特定场景的使用指导，可以进一步补充说明！

NCCL (NVIDIA Collective Communications Library) 中重要 API 的总结表

📌 NCCL 重要 API 总结表

📝 补充说明

1. 数据类型 (ncclDataType_t)

2. 规约操作 (ncclRedOp_t)

3. 典型使用流程

4. 注意事项

1. 数据类型 (`ncclDataType_t`)

2. 规约操作 (`ncclRedOp_t`)