cuda 入门笔记 1

The provided code snippet is a CUDA (Compute Unified Device Architecture) kernel definition and its invocation in the main() function, written in C/C++. This code is intended to perform matrix addition on two 2D arrays (or matrices) A and B, storing the result in a third matrix C. Let's break it down step by step:

Kernel Definition

cpp 复制代码
__global__ void MatAdd(float A[N][N], float B[N][N], float C[N][N])
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int j = blockIdx.y * blockDim.y + threadIdx.y;
    if (i < N && j < N)
        C[i][j] = A[i][j] + B[i][j];
}
  1. Kernel Declaration : The __global__ qualifier indicates that this function is a CUDA kernel, which means it can be called from the host (CPU) and executed on the device (GPU).

  2. Matrix Addition Logic:

    • The kernel takes three 2D arrays (matrices) as inputs: A, B, and C, where A and B are the input matrices and C is the output matrix.
    • Inside the kernel, matrix indices i and j are calculated. i corresponds to the row index, and j corresponds to the column index.
  3. Index Calculation:

    • blockIdx and threadIdx are CUDA built-in variables that represent the indices of the block and thread currently executing the code.
    • blockDim represents the number of threads per block.
    • The formula calculates the global row and column indices in the flattened matrix based on the block and thread indices. The kernel assumes a 2D grid of thread blocks.
  4. Conditional Check:

    • The if (i < N && j < N) statement ensures that the computed indices are within the bounds of the matrix dimensions (N x N). This is a safety measure to prevent out-of-bounds access.
  5. Matrix Addition Operation: If the indices are valid, the corresponding elements of matrices A and B are added and stored in matrix C.

Kernel Invocation

cpp 复制代码
int main()
{
    ...
    // Kernel invocation
    dim3 threadsPerBlock(16, 16);
    dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y);
    MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);
    ...
}
  1. Thread and Block Configuration:

    • dim3 threadsPerBlock(16, 16) specifies that each block will contain 16x16 = 256 threads.
    • dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y) calculates the number of blocks needed to cover the entire matrix. This assumes that N is divisible by 16 (the block size).
  2. Kernel Launch:

    • The kernel MatAdd is launched with the specified numBlocks and threadsPerBlock. The <<<...>>> syntax is used for launching the kernel in CUDA.
    • During the execution of the kernel, the GPU will use the defined blocks and threads to perform the matrix addition in parallel, where each thread computes the addition for one element of the matrices.

Complete Context

In the context of a complete program, you would need to handle memory allocation and data transfer between the host and the device (e.g., using CUDA API functions such as cudaMalloc, cudaMemcpy, etc.) prior to invoking the kernel. You would also need to include error checking and potentially free allocated memory afterward. The snippet provided is focused specifically on the kernel's definition and invocation.

This CUDA kernel is designed for matrix addition, where two matrices ( A ) and ( B ) are added element-wise to produce a resultant matrix ( C ). Let's break down the important components, focusing on the specific lines you highlighted.

Kernel Definition

The kernel function MatAdd is defined with the __global__ keyword, which indicates that this function will run on the GPU and can be called from the CPU code. The function takes three two-dimensional arrays of floats representing matrices:

  • A[N][N]: the first input matrix.
  • B[N][N]: the second input matrix.
  • C[N][N]: the output matrix where the sum of A and B will be stored.

Calculation of Global Indices

Inside the kernel, the lines:

cpp 复制代码
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;

are critical for determining which element of the matrices each thread will operate on.

Breakdown:
  • blockIdx.x / blockIdx.y: These are built-in variables that indicate the index of the block in the grid. CUDA allows the division of the work into blocks---a higher-level grouping of threads. Every block can contain multiple threads.

  • blockDim.x / blockDim.y : These denote the total number of threads along each dimension (x and y) within a block. In this case, the block is a 16x16 grid of threads (because of dim3 threadsPerBlock(16, 16)).

  • threadIdx.x / threadIdx.y: These represent the specific thread's index within its block in the x and y dimensions. Each block thus contains up to 16 threads in each dimension.

Understanding the Index Calculations

  1. Calculating the Global Index:

    • For row index ( i ):

      i = b l o c k I d x . x × b l o c k D i m . x + t h r e a d I d x . x i = blockIdx.x \times blockDim.x + threadIdx.x i=blockIdx.x×blockDim.x+threadIdx.x

      This formula calculates which row of the matrix the current thread is responsible for. The term blockIdx.x * blockDim.x gives us the starting index of the block in the global context, and adding threadIdx.x provides the local offset within that block. Therefore, for the first block (block index 0), the row range would be 0 to 15 (if there are enough rows), for the second block (block index 1), from 16 to 31, and so on.

  2. Calculating the Column Index:

    • For column index ( j ):

    j = b l o c k I d x . y × b l o c k D i m . y + t h r e a d I d x . y j = blockIdx.y \times blockDim.y + threadIdx.y j=blockIdx.y×blockDim.y+threadIdx.y

    This works similarly to the row indexing but affects the column. It determines the specific position in the matrix that the current thread is responsible for in the y-axis.

Boundary Checking

cpp 复制代码
if (i < N && j < N)
    C[i][j] = A[i][j] + B[i][j];

This condition checks whether the calculated indices ( i ) and ( j ) are within the bounds of the matrix dimensions, ensuring that we do not attempt to access elements outside the allocated memory for (in this case) ( N \times N ) matrices. If the indices are valid, the actual addition of the matrices takes place.

Kernel Invocation

In the main function, the kernel is invoked with:

cpp 复制代码
MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);

Here, numBlocks is computed as:

cpp 复制代码
dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y);

This determines how many blocks are needed to cover the entire ( N \times N ) matrix, given that each block can handle ( 16 \times 16 ) threads.

Summary

In summary, the lines of code you asked about are essential for mapping the 2D grid of threads to the 2D matrix indices, allowing each thread to operate on specific elements of the matrices. This structure provides parallelization of the matrix addition operation, which is one of the main advantages of using CUDA programming on GPUs.

https://harmanani.github.io/classes/csc447/Notes/Lecture15.pdf

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html

相关推荐
李小星同志10 分钟前
高级算法设计与分析 学习笔记6 B树
笔记·学习
霜晨月c22 分钟前
MFC 使用细节
笔记·学习·mfc
Jhxbdks34 分钟前
C语言中的一些小知识(二)
c语言·开发语言·笔记
AlexMercer10121 小时前
【C++】二、数据类型 (同C)
c语言·开发语言·数据结构·c++·笔记·算法
微刻时光1 小时前
Redis集群知识及实战
数据库·redis·笔记·学习·程序人生·缓存
chnyi6_ya2 小时前
一些写leetcode的笔记
笔记·leetcode·c#
青椒大仙KI113 小时前
24/9/19 算法笔记 kaggle BankChurn数据分类
笔记·算法·分类
liangbm33 小时前
数学建模笔记——动态规划
笔记·python·算法·数学建模·动态规划·背包问题·优化问题
GoppViper4 小时前
golang学习笔记29——golang 中如何将 GitHub 最新提交的版本设置为 v1.0.0
笔记·git·后端·学习·golang·github·源代码管理
Charles Ray5 小时前
C++学习笔记 —— 内存分配 new
c++·笔记·学习