cuda 入门笔记 1

The provided code snippet is a CUDA (Compute Unified Device Architecture) kernel definition and its invocation in the main() function, written in C/C++. This code is intended to perform matrix addition on two 2D arrays (or matrices) A and B, storing the result in a third matrix C. Let's break it down step by step:

Kernel Definition

cpp 复制代码

__global__ void MatAdd(float A[N][N], float B[N][N], float C[N][N])
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int j = blockIdx.y * blockDim.y + threadIdx.y;
    if (i < N && j < N)
        C[i][j] = A[i][j] + B[i][j];
}

Kernel Declaration : The __global__ qualifier indicates that this function is a CUDA kernel, which means it can be called from the host (CPU) and executed on the device (GPU).
Matrix Addition Logic:
- The kernel takes three 2D arrays (matrices) as inputs: A, B, and C, where A and B are the input matrices and C is the output matrix.
- Inside the kernel, matrix indices i and j are calculated. i corresponds to the row index, and j corresponds to the column index.
Index Calculation:
- blockIdx and threadIdx are CUDA built-in variables that represent the indices of the block and thread currently executing the code.
- blockDim represents the number of threads per block.
- The formula calculates the global row and column indices in the flattened matrix based on the block and thread indices. The kernel assumes a 2D grid of thread blocks.
Conditional Check:
- The if (i < N && j < N) statement ensures that the computed indices are within the bounds of the matrix dimensions (N x N). This is a safety measure to prevent out-of-bounds access.
Matrix Addition Operation: If the indices are valid, the corresponding elements of matrices A and B are added and stored in matrix C.

Kernel Invocation

cpp 复制代码

int main()
{
    ...
    // Kernel invocation
    dim3 threadsPerBlock(16, 16);
    dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y);
    MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);
    ...
}

Thread and Block Configuration:
- dim3 threadsPerBlock(16, 16) specifies that each block will contain 16x16 = 256 threads.
- dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y) calculates the number of blocks needed to cover the entire matrix. This assumes that N is divisible by 16 (the block size).
Kernel Launch:
- The kernel MatAdd is launched with the specified numBlocks and threadsPerBlock. The <<<...>>> syntax is used for launching the kernel in CUDA.
- During the execution of the kernel, the GPU will use the defined blocks and threads to perform the matrix addition in parallel, where each thread computes the addition for one element of the matrices.

Complete Context

In the context of a complete program, you would need to handle memory allocation and data transfer between the host and the device (e.g., using CUDA API functions such as cudaMalloc, cudaMemcpy, etc.) prior to invoking the kernel. You would also need to include error checking and potentially free allocated memory afterward. The snippet provided is focused specifically on the kernel's definition and invocation.

This CUDA kernel is designed for matrix addition, where two matrices ( A ) and ( B ) are added element-wise to produce a resultant matrix ( C ). Let's break down the important components, focusing on the specific lines you highlighted.

Kernel Definition

The kernel function MatAdd is defined with the __global__ keyword, which indicates that this function will run on the GPU and can be called from the CPU code. The function takes three two-dimensional arrays of floats representing matrices:

A[N][N]: the first input matrix.
B[N][N]: the second input matrix.
C[N][N]: the output matrix where the sum of A and B will be stored.

Calculation of Global Indices

Inside the kernel, the lines:

cpp 复制代码

int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;

are critical for determining which element of the matrices each thread will operate on.

Breakdown:

blockIdx.x / blockIdx.y: These are built-in variables that indicate the index of the block in the grid. CUDA allows the division of the work into blocks---a higher-level grouping of threads. Every block can contain multiple threads.
blockDim.x / blockDim.y : These denote the total number of threads along each dimension (x and y) within a block. In this case, the block is a 16x16 grid of threads (because of dim3 threadsPerBlock(16, 16)).
threadIdx.x / threadIdx.y: These represent the specific thread's index within its block in the x and y dimensions. Each block thus contains up to 16 threads in each dimension.

Understanding the Index Calculations

Calculating the Global Index:
- For row index ( i ):
  
  i = b l o c k I d x . x × b l o c k D i m . x + t h r e a d I d x . x i = blockIdx.x \times blockDim.x + threadIdx.x i=blockIdx.x×blockDim.x+threadIdx.x
  
  This formula calculates which row of the matrix the current thread is responsible for. The term blockIdx.x * blockDim.x gives us the starting index of the block in the global context, and adding threadIdx.x provides the local offset within that block. Therefore, for the first block (block index 0), the row range would be 0 to 15 (if there are enough rows), for the second block (block index 1), from 16 to 31, and so on.
Calculating the Column Index:
- For column index ( j ):
j = b l o c k I d x . y × b l o c k D i m . y + t h r e a d I d x . y j = blockIdx.y \times blockDim.y + threadIdx.y j=blockIdx.y×blockDim.y+threadIdx.y

This works similarly to the row indexing but affects the column. It determines the specific position in the matrix that the current thread is responsible for in the y-axis.

Boundary Checking

cpp 复制代码

if (i < N && j < N)
    C[i][j] = A[i][j] + B[i][j];

This condition checks whether the calculated indices ( i ) and ( j ) are within the bounds of the matrix dimensions, ensuring that we do not attempt to access elements outside the allocated memory for (in this case) ( N \times N ) matrices. If the indices are valid, the actual addition of the matrices takes place.

Kernel Invocation

In the main function, the kernel is invoked with:

cpp 复制代码

MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);

Here, numBlocks is computed as:

cpp 复制代码

dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y);

This determines how many blocks are needed to cover the entire ( N \times N ) matrix, given that each block can handle ( 16 \times 16 ) threads.

Summary

In summary, the lines of code you asked about are essential for mapping the 2D grid of threads to the 2D matrix indices, allowing each thread to operate on specific elements of the matrices. This structure provides parallelization of the matrix addition operation, which is one of the main advantages of using CUDA programming on GPUs.

https://harmanani.github.io/classes/csc447/Notes/Lecture15.pdf

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html