CUDA系列之常用Runtime API简介

文章目录

[1. 前言](#1. 前言)
[2. Device 管理](#2. Device 管理)
[3. Error Handling](#3. Error Handling)
[4. 内存管理](#4. 内存管理)
[5. Stream 管理](#5. Stream 管理)
[6. Event 管理](#6. Event 管理)
[7. 执行控制](#7. 执行控制)
[8. 多卡管理](#8. 多卡管理)
[9. 实践用例](#9. 实践用例)

1. 前言

在读这篇文章前，大家环境应该已经ready，如果环境未安装，可以看上一篇文章。CUDA runtime api是利用频率最高的API，熟练掌握这些API的用法，会让我们的程序变得非常的灵活。

详细的API介绍，可以参考官网文档

2. Device 管理

API	描述
cudaChooseDevice	通过输入的properties匹配一个最合适的device
cudaDeviceGetAttribute	获取指定device的执行熟悉的值
cudaDeviceGetP2PAttribute	获取srcDevice 到 dstDevice的P2P属性
cudaGetDeviceCount	获取当前device数量
cudaGetDevice	获取当前的device id
cudaGetDeviceProperties	获取指定device 的属性
cudaSetDevice	设置当前device的device id，常用于多卡时，卡间的切换
cudaDeviceReset	reset当前device的所有状态
cudaDeviceSynchronize	等待device 计算完成

3. Error Handling

这块api，主要时一些error的获取与处理

API	描述
cudaGetLastError	获取上一个runtime api的error code,并清除error handle
cudaGetErrorName	通过error code，获取error code的名字
cudaGetErrorString	通过error code, 获取error的错误信息
cudaPeekAtLastError	获取上一个runtime api的error code,但不清除error handle

4. 内存管理

API	描述
cudaMalloc	在当前卡上分配一块device内存，类似cpu的malloc
cudaFree	释放cudaMalloc申请的内存，类似cpu的free
cudaMallocHost	分配一块page-locked的cpu内存，该内存可以在kernel中直接访问
cudaFreeHost	释放cudaMallocHost分配的内存
cudaMallocManaged	分配一块托管内存，自动识别是放在cpu还是device
cudaMemcpy	内存搬运，支持D2D, H2D，D2H，H2H
cudaMemcpyAsync	异步内存搬运，支持D2D, H2D，D2H，H2H
cudaMemset	初始化device 内存，类似cpu的memset
cudaMemsetAsync	异步初始化device 内存

5. Stream 管理

API	描述
cudaStreamAddCallback	将一个callback挂载在流上
cudaStreamCreate	流的创建
cudaStreamCreateWithFlags	通过flag创建流
cudaStreamCreateWithPriority	创建带优先级的流
cudaStreamDestroy	流的销毁
cudaStreamGetFlags	获取流的flag
cudaStreamGetPriority	获取流的优先级
cudaStreamQuery	获取流的完成状态
cudaStreamSynchronize	等待流的所有任务完成
cudaStreamWaitEvent	流等待事件的发生

6. Event 管理

event 常与stream 一起使用，来做流之间的通过，同时也可以通过event来获取device侧的运行事件

API	描述
cudaEventCreate	事件创建
cudaEventCreateWithFlags	创建指定属性的事件
cudaEventDestroy	事件的销毁
cudaEventElapsedTime	获取start到end的耗时
cudaEventQuery	获取事件的状态
cudaEventRecord	在流上插入一个事件
cudaEventSynchronize	等待一个事件的完成

7. 执行控制

API	描述
cudaFuncGetAttributes	获取kernel函数的属性
cudaFuncSetSharedMemConfig	设置device函数的shared memory配置
cudaLaunchKernel	启动核函数，简单的kernel推荐使用`<<<>>>`来启动

8. 多卡管理

API	描述
cudaDeviceCanAccessPeer	获取两卡之间的连接状态
cudaDeviceDisablePeerAccess	禁用两卡之间的连接
cudaDeviceEnablePeerAccess	使能两卡之间的连接

9. 实践用例

cpp 复制代码

#include <cuda_runtime.h>

__global__ void vectorAdd(int* out, int* in1, int* in2, size_t size){
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < size) {
        out[idx] = in1[idx] + in2[idx];
    }
}

int main() {
    int *dout, *din1, *din2;
    int *hout, *hin1, *hin2;
    constexpr size_t size= 10240;
    // 分配device内存
    cudaMalloc(&dout, size * sizeof(int));
    cudaMalloc(&din1, size * sizeof(int));
    cudaMalloc(&din2, size * sizeof(int));
    // 分配host内存
    cudaMallocHost(&hout, size * sizeof(int));
    cudaMallocHost(&hin1, size * sizeof(int));
    cudaMallocHost(&hin2, size * sizeof(int));
    // 内存初始化
    for (size_t i = 0; i < size; i++) {
        hin1[i] = rand();
        hin2[i] = rand();
    }
    // 将初始化数据从host搬运到device
    cudaMemcpy(din1, hin1, size * sizeof(int), cudaMemcpyHostToDevice);
    cudaMemcpy(din2, hin2, size * sizeof(int), cudaMemcpyHostToDevice);
    // 启动核函数
    dim3 block(1024);
    dim3 grid((size + block.x - 1) / block.x);
    vectorAdd<<<grid, block>>>(dout, din1, din2, size);
    cudaDeviceSynchronize();
    // 将kernel运行数据从device回传到host
    cudaMemcpy(hout, dout, size * sizeof(int), cudaMemcpyDeviceToHost);
    bool is_pass = true;
    for (size_t i = 0; i < size; i++) {
        is_pass &= hout[i] == (hin1[i] + hin2[i]);
    }
    // 内存回收
    cudaFree(dout);
    cudaFree(din1);
    cudaFree(din2);
    cudaFreeHost(hout);
    cudaFreeHost(hin1);
    cudaFreeHost(hin2);
    return is_pass? EXIT_SUCCESS : EXIT_FAILURE;
}

编译：

cpp 复制代码

/usr/local/cuda/bin/nvcc -I/usr/local/cuda/include -O3 -DNDEBUG  --generate-code=arch=compute_90,code=[compute_90,sm_90] -std=c++17 vector_add.cu -o vector_add