对pytorch的底层nccl库进行插桩

intro

本文记录了使用dlsym对系统内nccl库进行插桩，然后再pytorch的脚本上运行呈现。环境配置可见使用系统内NCCL环境重新编译Pytorch

插桩代码

cpp 复制代码

// nccl_instrument.c
#include <nccl.h>
#include <stdio.h>
#include <dlfcn.h>

// 定义一个函数指针来指向原始的 ncclBroadcast 实现
static ncclResult_t (*original_ncclBroadcast)(const void *, void *, size_t, ncclDataType_t, int, ncclComm_t, cudaStream_t) = NULL;

extern "C" ncclResult_t ncclBroadcast(const void *sendbuff, void *recvbuff, size_t count,
                                      ncclDataType_t datatype, int root, ncclComm_t comm, cudaStream_t stream)
{
    // 加载原始的 ncclBroadcast 函数
    if (!original_ncclBroadcast)
    {
        original_ncclBroadcast = (ncclResult_t(*)(const void *, void *, size_t, ncclDataType_t, int, ncclComm_t, cudaStream_t))dlsym(RTLD_NEXT, "ncclBroadcast");
        if (!original_ncclBroadcast)
        {
            fprintf(stderr, "Error loading original ncclBroadcast: %s\n", dlerror());
            return ncclSystemError;
        }
        else
        {
            printf("Successfully change the Point!\n");
        }
    }

    // 打印广播操作的信息
    printf("NEW![Instrumentation] ncclBroadcast called with count: %zu, root: %d\n", count, root);

    // 调用原始的 ncclBroadcast 函数
    return original_ncclBroadcast(sendbuff, recvbuff, count, datatype, root, comm, stream);
}

这是一个简单的对广播的插桩。

需要先把他编译为动态链接库（这里需要手动链接一下cuda的相应lib和include地址）：

cpp 复制代码

 g++ -shared -fPIC -o libnccl_instrument.so nccl_instrument.cpp -L/usr/local/cuda/lib64 -lnccl -lcudart -I/usr/local/cuda/include

完事后可以先用一个test测试一下：：

cpp 复制代码

// test_nccl.c
#include <nccl.h>
#include <stdio.h>
#include <cuda_runtime.h>

int main()
{
    ncclComm_t comm;
    int size = 1; // 单机单 GPU 时可以设置为 1
    int rank = 0; // 当前进程的 rank 为 0

    // 初始化 NCCL 通信
    ncclUniqueId id;
    ncclGetUniqueId(&id);
    ncclCommInitRank(&comm, size, id, rank);

    // 分配 GPU 内存
    int *sendbuff, *recvbuff;
    cudaMalloc((void **)&sendbuff, sizeof(int) * size);
    cudaMalloc((void **)&recvbuff, sizeof(int) * size);

    // 假设 root 节点是 0
    int root = 0;

    // 广播调用
    ncclBroadcast(sendbuff, recvbuff, size, ncclInt, root, comm, 0);

    // 释放资源
    ncclCommDestroy(comm);
    cudaFree(sendbuff);
    cudaFree(recvbuff);

    printf("Broadcast test completed.Successful!\n");
    return 0;
}

编译：

cpp 复制代码

nvcc -o test_nccl test_nccl.c -lnccl -lcudart

运行（使用PRE_LOAD环境变量优先链接自定义的库）：

bash 复制代码

LD_PRELOAD=./libnccl_instrument.so ./test_nccl

输出如下：

bash 复制代码

Successfully change the Point!
NEW![Instrumentation] ncclBroadcast called with count: 1, root: 0
Broadcast test completed.Successful!

再拿一个简单的pytorch 脚本测试：

bash 复制代码

import os
import torch
import torch.distributed as dist

import ctypes
import os


# 设置通信环境
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '29500'

# 初始化进程组
rank = 0  # 当前进程的 rank
world_size = 1  # 总进程数
dist.init_process_group("nccl", rank=rank, world_size=world_size)

# 创建张量
x = torch.zeros(6)  # 初始张量为全零

if torch.cuda.is_available():
    # 将张量移动到 GPU 上
    x = x.cuda()
    if rank == 0:
        # 只有 rank 0 初始化张量
        x = torch.arange(1, 7).float().cuda()

    # 广播张量，从 rank 0 广播到所有进程
    dist.broadcast(x, src=0)

    # 打印广播后的结果
    print(f"Rank {rank} broadcasted tensor: {x}")

运行的时候我试了好久，奇怪的是如果只是指定：

bash 复制代码

LD_PRELOAD=/mnt/d/Ubuntu_Code/nccl_PI/libnccl_instrument.so  python PI_test.py

就会报错，会发现链接器找不到broadcast符号，但是在c代码里边就不会，如果没有同时 PRELOAD 原始 NCCL 库，dlsym 会因为找不到符号而失败，导致 undefinedsymbol:ncclBroadcast。

我怀疑是：动态链接器无法自动加载原始 NCCL 库 ，PyTorch 的动态加载机制可能不依赖 ld.so 自动加载原始 NCCL 库。必须手动通过 LD_PRELOAD 明确加载 NCCL。改为下面的指令执行：

bash 复制代码

LD_PRELOAD=/mnt/d/Ubuntu_Code/nccl_PI/libnccl_instrument.so:/usr/local/cuda/lib64/libnccl.so python PI_test.py

结果如下：

bash 复制代码

Successfully change the Point!
NEW![Instrumentation] ncclBroadcast called with count: 6, root: 0
Rank 0 broadcasted tensor: tensor([1., 2., 3., 4., 5., 6.], device='cuda:0')