cuda编程笔记（28）-- cudaMemcpyPeer 与 P2P 访问机制

`cudaMemcpyPeer` 是什么？

cudaMemcpyPeer() 是 CUDA 提供的一个 跨 GPU 内存拷贝函数，可以直接在两个不同 GPU 设备之间传输数据，而不需要中转到主机内存。

cpp 复制代码

cudaError_t cudaMemcpyPeer(
    void* dst, int dstDevice,
    const void* src, int srcDevice,
    size_t count);
//异步版本
cudaError_t cudaMemcpyPeerAsync(
    void* dst, int dstDevice,
    const void* src, int srcDevice,
    size_t count,
    cudaStream_t stream = 0);

当两个 GPU 支持 P2P（Peer-to-Peer）访问 并且已开启时，

它们之间的拷贝 不经过主机内存，由 GPU 内部的 DMA 直接完成；
否则，会退化为"Host 中转"：数据会被先拷到主机内存，再传到目标 GPU。

P2P（Peer-to-Peer）访问机制

P2P 是指不同 GPU 能够直接访问彼此的显存（Device Memory），无需经过 CPU 或主机内存。

这在多 GPU 的 NVLink 或 PCIe 同域拓扑下尤其有用。

有了 P2P：

一张 GPU 可以直接读取另一张 GPU 的 global memory；
cudaMemcpyPeer() 可以用 GPU-GPU 通道直接传数据；
两个 GPU kernel 之间可以共享数据缓冲区（只要启用了 peer access）。

cpp 复制代码

//判断device设备是否能p2p访问peerDevice设备，结果通过canAccessPeer返回
//若canAccessPeer为0则表示不支持P2P
cudaError_t cudaDeviceCanAccessPeer(int *canAccessPeer, int device, int peerDevice);

//设置当前设备可以p2p访问peerDevice设备，flags只能设置为0
cudaError_t cudaDeviceEnablePeerAccess(int peerDevice, unsigned int flags);

//设置当前设备禁用通过p2p访问peerDevice设备
cudaError_t cudaDeviceDisablePeerAccess(int peerDevice);

开启 P2P 的步骤

假设我们有两个设备 0 和 1：

cpp 复制代码

int canAccess01, canAccess10;
cudaDeviceCanAccessPeer(&canAccess01, 0, 1);
cudaDeviceCanAccessPeer(&canAccess10, 1, 0);

if (canAccess01 && canAccess10) {
    cudaSetDevice(0);
    cudaDeviceEnablePeerAccess(1, 0);
    cudaSetDevice(1);
    cudaDeviceEnablePeerAccess(0, 0);
}

注意事项

必须在 支持统一虚拟寻址（UVA） 的系统上；
如果两个 GPU 不在同一个 PCIe 根复合体（root complex），P2P 可能不可用；
启用后，对方显存指针可以直接被访问（比如 kernel 从另一 GPU 的缓冲区读数据）；
NVLink 拓扑下，P2P 带宽比 PCIe 高数倍。

示例代码

cpp 复制代码

#include <cuda_runtime.h>
#include <cstdio>

int main() {
    int canAccessPeer = 0;
    cudaDeviceCanAccessPeer(&canAccessPeer, 0, 1);
    if (!canAccessPeer) {
        printf("GPU0 与 GPU1 之间不支持 P2P\n");
        return 0;
    }

    cudaSetDevice(0);
    cudaDeviceEnablePeerAccess(1, 0);
    cudaSetDevice(1);
    cudaDeviceEnablePeerAccess(0, 0);

    int N = 1024;
    float *d0, *d1;
    cudaSetDevice(0);
    cudaMalloc(&d0, N * sizeof(float));
    cudaSetDevice(1);
    cudaMalloc(&d1, N * sizeof(float));

    // GPU1 <- GPU0
    cudaMemcpyPeer(d1, 1, d0, 0, N * sizeof(float));
    cudaDeviceSynchronize();

    printf("GPU0 -> GPU1 拷贝完成\n");
    cudaSetDevice(0);
    cudaFree(d0);
    cudaDeviceDisablePeerAccess(1);  // 可选：禁用 P2P
    cudaSetDevice(1);
    cudaFree(d1);
    cudaDeviceDisablePeerAccess(0);  // 可选：禁用 P2P
}

但是很可惜，我的机器上不可以使用p2p

bash 复制代码

GPU0 与 GPU1 之间不支持 P2P

可以用如下命令查看GPU的情况

bash 复制代码

$ nvidia-smi topo -m
        GPU0    GPU1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      SYS     0-39    0               N/A
GPU1    SYS      X      0-39    0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

GPU间连接	显示	含义
GPU0 ↔ GPU1	`SYS`	通过系统互联（SMP Interconnect）通信，需要跨 CPU NUMA 节点或主板总线桥

GPU0 与 GPU1 不在同一个 PCIe Root Complex 下 ，通信必须经过 CPU/QPI/UPI 链路，因此 无法建立 GPU→GPU 的直接 P2P 通信。

cuda编程笔记（28）-- cudaMemcpyPeer 与 P2P 访问机制

cudaMemcpyPeer 是什么？

P2P（Peer-to-Peer）访问机制

开启 P2P 的步骤

注意事项

示例代码

`cudaMemcpyPeer` 是什么？