cuda 入门笔记 1

The provided code snippet is a CUDA (Compute Unified Device Architecture) kernel definition and its invocation in the main() function, written in C/C++. This code is intended to perform matrix addition on two 2D arrays (or matrices) A and B, storing the result in a third matrix C. Let's break it down step by step:

Kernel Definition

cpp 复制代码
__global__ void MatAdd(float A[N][N], float B[N][N], float C[N][N])
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int j = blockIdx.y * blockDim.y + threadIdx.y;
    if (i < N && j < N)
        C[i][j] = A[i][j] + B[i][j];
}
  1. Kernel Declaration : The __global__ qualifier indicates that this function is a CUDA kernel, which means it can be called from the host (CPU) and executed on the device (GPU).

  2. Matrix Addition Logic:

    • The kernel takes three 2D arrays (matrices) as inputs: A, B, and C, where A and B are the input matrices and C is the output matrix.
    • Inside the kernel, matrix indices i and j are calculated. i corresponds to the row index, and j corresponds to the column index.
  3. Index Calculation:

    • blockIdx and threadIdx are CUDA built-in variables that represent the indices of the block and thread currently executing the code.
    • blockDim represents the number of threads per block.
    • The formula calculates the global row and column indices in the flattened matrix based on the block and thread indices. The kernel assumes a 2D grid of thread blocks.
  4. Conditional Check:

    • The if (i < N && j < N) statement ensures that the computed indices are within the bounds of the matrix dimensions (N x N). This is a safety measure to prevent out-of-bounds access.
  5. Matrix Addition Operation: If the indices are valid, the corresponding elements of matrices A and B are added and stored in matrix C.

Kernel Invocation

cpp 复制代码
int main()
{
    ...
    // Kernel invocation
    dim3 threadsPerBlock(16, 16);
    dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y);
    MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);
    ...
}
  1. Thread and Block Configuration:

    • dim3 threadsPerBlock(16, 16) specifies that each block will contain 16x16 = 256 threads.
    • dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y) calculates the number of blocks needed to cover the entire matrix. This assumes that N is divisible by 16 (the block size).
  2. Kernel Launch:

    • The kernel MatAdd is launched with the specified numBlocks and threadsPerBlock. The <<<...>>> syntax is used for launching the kernel in CUDA.
    • During the execution of the kernel, the GPU will use the defined blocks and threads to perform the matrix addition in parallel, where each thread computes the addition for one element of the matrices.

Complete Context

In the context of a complete program, you would need to handle memory allocation and data transfer between the host and the device (e.g., using CUDA API functions such as cudaMalloc, cudaMemcpy, etc.) prior to invoking the kernel. You would also need to include error checking and potentially free allocated memory afterward. The snippet provided is focused specifically on the kernel's definition and invocation.

This CUDA kernel is designed for matrix addition, where two matrices ( A ) and ( B ) are added element-wise to produce a resultant matrix ( C ). Let's break down the important components, focusing on the specific lines you highlighted.

Kernel Definition

The kernel function MatAdd is defined with the __global__ keyword, which indicates that this function will run on the GPU and can be called from the CPU code. The function takes three two-dimensional arrays of floats representing matrices:

  • A[N][N]: the first input matrix.
  • B[N][N]: the second input matrix.
  • C[N][N]: the output matrix where the sum of A and B will be stored.

Calculation of Global Indices

Inside the kernel, the lines:

cpp 复制代码
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;

are critical for determining which element of the matrices each thread will operate on.

Breakdown:
  • blockIdx.x / blockIdx.y: These are built-in variables that indicate the index of the block in the grid. CUDA allows the division of the work into blocks---a higher-level grouping of threads. Every block can contain multiple threads.

  • blockDim.x / blockDim.y : These denote the total number of threads along each dimension (x and y) within a block. In this case, the block is a 16x16 grid of threads (because of dim3 threadsPerBlock(16, 16)).

  • threadIdx.x / threadIdx.y: These represent the specific thread's index within its block in the x and y dimensions. Each block thus contains up to 16 threads in each dimension.

Understanding the Index Calculations

  1. Calculating the Global Index:

    • For row index ( i ):

      i = b l o c k I d x . x × b l o c k D i m . x + t h r e a d I d x . x i = blockIdx.x \times blockDim.x + threadIdx.x i=blockIdx.x×blockDim.x+threadIdx.x

      This formula calculates which row of the matrix the current thread is responsible for. The term blockIdx.x * blockDim.x gives us the starting index of the block in the global context, and adding threadIdx.x provides the local offset within that block. Therefore, for the first block (block index 0), the row range would be 0 to 15 (if there are enough rows), for the second block (block index 1), from 16 to 31, and so on.

  2. Calculating the Column Index:

    • For column index ( j ):

    j = b l o c k I d x . y × b l o c k D i m . y + t h r e a d I d x . y j = blockIdx.y \times blockDim.y + threadIdx.y j=blockIdx.y×blockDim.y+threadIdx.y

    This works similarly to the row indexing but affects the column. It determines the specific position in the matrix that the current thread is responsible for in the y-axis.

Boundary Checking

cpp 复制代码
if (i < N && j < N)
    C[i][j] = A[i][j] + B[i][j];

This condition checks whether the calculated indices ( i ) and ( j ) are within the bounds of the matrix dimensions, ensuring that we do not attempt to access elements outside the allocated memory for (in this case) ( N \times N ) matrices. If the indices are valid, the actual addition of the matrices takes place.

Kernel Invocation

In the main function, the kernel is invoked with:

cpp 复制代码
MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);

Here, numBlocks is computed as:

cpp 复制代码
dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y);

This determines how many blocks are needed to cover the entire ( N \times N ) matrix, given that each block can handle ( 16 \times 16 ) threads.

Summary

In summary, the lines of code you asked about are essential for mapping the 2D grid of threads to the 2D matrix indices, allowing each thread to operate on specific elements of the matrices. This structure provides parallelization of the matrix addition operation, which is one of the main advantages of using CUDA programming on GPUs.

https://harmanani.github.io/classes/csc447/Notes/Lecture15.pdf

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html

相关推荐
卡提西亚3 分钟前
一本通网站1122题:计算鞍点
c++·笔记·编程题·一本通
im_AMBER9 分钟前
Leetcode 47
数据结构·c++·笔记·学习·算法·leetcode
希露菲叶特格雷拉特1 小时前
PyTorch深度学习笔记(二十)(模型验证测试)
人工智能·pytorch·笔记
lingggggaaaa3 小时前
小迪安全v2023学习笔记(一百四十五讲)—— Webshell篇&魔改冰蝎&打乱特征指纹&新增加密协议&过后门查杀&过流量识别
笔记·学习·安全·魔改冰蝎·免杀对抗·免杀技术
Digitally4 小时前
如何将iPhone上的笔记传输到电脑
笔记·电脑·iphone
lkbhua莱克瓦245 小时前
Java基础——常用算法4
java·数据结构·笔记·算法·github·排序算法·快速排序
学渣676565 小时前
11111
笔记
MeowKnight9585 小时前
【DIY】PCB练习记录2——51单片机核心板
笔记
tjsoft12 小时前
网站如何被百度收录之探索笔记
笔记
QT 小鲜肉14 小时前
【个人成长笔记】在 Linux 系统下撰写老化测试脚本以实现自动压测效果(亲测有效)
linux·开发语言·笔记·单片机·压力测试