CUDA 核函数编程基础:全理论讲清楚(中英双语)
你的文档覆盖的是 CUDA 的"最小闭环":写核函数 → 正确映射线程 → 正确用内存 → 粗测性能 → 用工具定位瓶颈 → 做一轮优化。✅
1) CUDA 编程模型:Host vs Device、Kernel 的语义
中文
-
Host(CPU) 负责:分配显存、准备数据、发起 kernel、做同步/回收资源。
-
Device(GPU) 负责:并行执行 kernel(核函数)。
-
__global__表示该函数是 kernel:- 从 Host 端调用 (用
<<<grid, block>>>启动配置) - 在 Device 端执行
- 从 Host 端调用 (用
-
CUDA 的并行执行单位分层:
- Grid(网格):一次 kernel launch 的整体范围
- Block(线程块):Grid 的分块;块内线程可协作(共享内存、同步)
- Thread(线程):最小执行实体
- Warp(线程束) :硬件调度的基本单位,通常 32 线程/warp;分支发散与访存合并都以 warp 为核心来理解 ⚠️
English
-
The host (CPU) orchestrates: memory allocation/copies, kernel launches, synchronization, and cleanup.
-
The device (GPU) executes kernels in parallel.
-
A
__global__function is a kernel:- Launched from the host via
<<<grid, block>>> - Executed on the device
- Launched from the host via
-
CUDA's execution hierarchy:
- Grid (one kernel launch)
- Block (a cooperative group; shared memory + synchronization)
- Thread
- Warp (hardware scheduling unit, typically 32 threads/warp ). Warp is the key to reasoning about branch divergence and memory coalescing.
2) 线程索引计算:为什么这些公式是"正确映射"的核心
2.1 一维索引(1D)
中文
cuda
int idx = blockIdx.x * blockDim.x + threadIdx.x;
threadIdx.x:块内线程编号blockIdx.x:块编号blockDim.x:每块线程数- 这个
idx通常对应数组下标:data[idx] if (idx < n)是必要的 越界保护(因为 gridSize 往往是向上取整)
English
cuda
int idx = blockIdx.x * blockDim.x + threadIdx.x;
threadIdx.x: thread's local index within a blockblockIdx.x: block index within the gridblockDim.x: threads per blockidxcommonly maps to array indices, andif (idx < n)is a required bounds check since grid sizes are usually rounded up.
2.2 二维索引(2D):把 (row, col) 映射到线性内存
中文
cuda
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
int idx = row * width + col;
- GPU 全局内存是线性的(1D),但我们经常处理 2D/3D 数据
idx = row * width + col是行优先(row-major)布局的典型展开
English
cuda
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
int idx = row * width + col;
- Global memory is linear, but data is often conceptualized as 2D/3D.
idx = row * width + colis the standard row-major flattening.
2.3 三维索引(3D):体数据/体素/多维张量的线性化
中文
cuda
int idx = z * width * height + y * width + x;
- 本质还是"多维坐标 → 一维 offset"
- 关键是你要明确:width/height/depth 的含义、以及内存布局顺序
English
- Same idea: map
(x,y,z)into a linear offset. - The crucial part is being explicit about dimension semantics and layout order.
3) 网格与块配置:为什么"blockSize 选 256"常见但不是铁律
中文
3.1 典型 gridSize 计算
cuda
int blockSize = 256;
int gridSize = (n + blockSize - 1) / blockSize;
myKernel<<<gridSize, blockSize>>>(data, n);
-
(n + blockSize - 1) / blockSize:向上取整,确保覆盖所有元素 -
blockSize常选 128/256/512 的原因:- 是 warp(32) 的倍数 → 便于调度与访存模式
- 通常能在"并行度 vs 资源占用(寄存器/共享内存)"之间取得不错平衡
3.2 最佳实践的真实含义
-
"warp 倍数"是指导原则,不是性能保证
-
真正限制性能的常见因素:
- kernel 使用了很多寄存器 → 降低 occupancy
- 使用大量共享内存 → 降低并发 block 数
- 内存访问不合并(coalescing 差) → 吞吐被带宽/延迟拖死
- 分支发散严重 → warp 利用率下降
English
gridSize = (n + blockSize - 1)/blockSizerounds up to cover all elements.blockSizelike 128/256/512 is common because it's a multiple of 32 (warp size), often yielding decent scheduling and memory behavior.- However, "multiple of warp" is a heuristic, not a guarantee. Real bottlenecks include registers/shared memory limiting concurrency, poor coalescing, or heavy branch divergence.
4) GPU 内存层次:你写 kernel 的"性能上限"主要由这里决定
可以把它理解为:越靠近 SM(计算核心)的内存越快、越小、越难用;越远的越大、越慢、越通用。
4.1 全局内存 Global Memory
中文
- 最大、最通用、最慢(相对共享/寄存器)
- 所有线程可访问
- 性能关键:访问是否合并(coalesced)、是否复用(缓存/L2)
English
- Largest and most general, but relatively the slowest.
- Accessible by all threads.
- Performance depends heavily on coalescing and cache reuse.
4.2 共享内存 Shared Memory(块内协作的核心)
中文
-
每个 block 私有,block 内线程共享
-
速度很快(接近 L1/片上),容量小
-
典型用途:
- 重用数据:减少对 global memory 的重复读取
- 做数据重排:如矩阵转置、卷积/tiling
- 做规约(reduction):把局部结果聚合
-
必须理解两个点:
__syncthreads()是块内 barrier:保证共享内存读写的阶段一致性- 共享内存可能有 bank conflict(后面讲)
English
-
Private per block, shared among threads in the same block.
-
Fast, small.
-
Used for data reuse, tiling/reordering (e.g., transpose), and reductions.
-
Two key concepts:
__syncthreads()is a block-level barrier.- Shared memory can suffer bank conflicts.
4.3 常量内存 Constant Memory("所有线程读同一份"时的利器)
中文
- 只读,带专用缓存
- 适合:warp 内线程读取同一个地址/相同小表
- 如果同一 warp 里线程读不同常量地址,会发生序列化(性能下降)
English
- Read-only, cached.
- Best when threads in a warp read the same address (broadcast).
- If threads read many different constant addresses, access can serialize.
4.4 纹理内存 Texture Memory(空间局部性 + 采样能力)
中文
- 只读 + 缓存,针对"空间局部性"访问优化
- 历史上常用于图像/2D 访问;也支持插值/寻址模式(取决于 API)
- 现代 CUDA 更推荐 texture objects (而不是旧式
texture<>引用写法),但你的例子表达的是核心思想:用纹理路径读能提升某些访问模式下的命中率
English
- Read-only, cached, optimized for spatial locality.
- Historically used for image-like 2D access; can support sampling/addressing modes.
- Modern CUDA prefers texture objects over legacy texture references, but the principle remains: texture path can improve cache behavior for certain patterns.
5) 内存合并(Coalescing):为什么"连续访问"能快很多
中文
以 warp 为单位看:一个 warp 的 32 个线程,如果访问连续地址,硬件可以用更少的内存事务完成加载/存储 → 带宽利用率高 ✅
- 好模式(合并) :线程
t访问data[base + t] - 坏模式(跨步 stride) :线程
t访问data[base + t * stride]
stride 越大,越可能导致一次 warp 访问分裂成很多内存事务
你的示例正是在强调:
data[idx]这种线性访问通常最容易合并idx * stride会破坏合并
English
Coalescing is analyzed at the warp level: if 32 threads access contiguous addresses, hardware can serve them with fewer memory transactions, improving effective bandwidth.
- Good: thread
taccessesdata[base + t] - Bad: thread
taccessesdata[base + t * stride], which often splits into many transactions.
6) 带宽测试 kernel:它在"测什么",又为什么容易误导
中文
cuda
float value = data[idx];
for (int i = 0; i < iterations; i++) value = value * 1.01f;
data[idx] = value;
这类 kernel 同时包含:
- global memory 读/写(带宽因素)
- 浮点计算(计算吞吐因素)
iterations增大后,可能从"带宽受限"变成"计算受限"
因此它更像是一个可调的 micro-benchmark:你要明确自己要测的是哪一边。
更严谨的带宽测试往往会:
- 降低计算(更接近纯 copy / load-store)
- 控制访存模式、对齐、数据类型
- 避免编译器优化把循环折叠掉(需要一些技巧)
English
This kernel mixes global memory traffic and floating-point computation. As iterations increases, it may shift from memory-bound to compute-bound, so you must be clear what you intend to measure. A more rigorous bandwidth benchmark minimizes computation, controls access patterns/alignment, and avoids compiler "optimizing away" the loop.
7) CUDA Event 计时:测 kernel 时间的标准姿势与常见坑
中文
cudaEventRecord(start)/cudaEventRecord(stop)记录 GPU 时间戳cudaEventSynchronize(stop)确保 GPU 执行完成cudaEventElapsedTime得到毫秒
常见坑 ⚠️:
- 忘了同步 :不
Synchronize你读到的时间可能不准 - 首次运行有 warm-up 开销:第一次包含 JIT、缓存、页面等开销,建议跑多次取统计量
- 异步语义:kernel launch 本身是异步的,事件是解决这个问题的方式之一
English
CUDA events provide standard GPU-side timing. Common pitfalls:
- Missing synchronization
- First-run overhead (warm-up required)
- Kernel launches are asynchronous; events help measure actual device execution time.
8) Occupancy(占用率):它是什么、能解决什么、不能解决什么
中文
Occupancy 通常指:每个 SM 上活跃 warp 数占理论最大 warp 数的比例(或等价指标)。
cudaOccupancyMaxPotentialBlockSize 做的是:
- 根据 kernel 的资源需求(寄存器/共享内存等)和硬件限制
- 给出一个"理论上可能较优"的 block size 建议
但要强调两点:
-
Occupancy 不是越高越好:
- 如果你已经被带宽限制,occupancy 再高也不会快
-
它只解决"并行度/隐藏延迟"的一部分问题:
- 真性能还得看:内存合并、缓存命中、指令效率、分支发散、访存与计算的重叠等
English
Occupancy is the ratio of active warps per SM to the maximum possible. The occupancy API suggests a block size based on resource constraints, but:
- Higher occupancy is not always better (e.g., memory-bound kernels).
- True performance depends on coalescing, cache, instruction efficiency, divergence, and overlap.
9) Nsight Compute:你该看哪些指标(从"会用"到"会分析")
中文
ncu --set full ... 会生成报告。实际分析时你不需要一开始就看"全部",优先抓住主线:
建议你按这条链路看(非常通用)✅:
-
Kernel 是否 memory-bound 还是 compute-bound
- 看吞吐、stall、roofline(如果你熟悉的话)
-
Memory 访问是否合并
- global load/store efficiency、transactions、L2 hit rate
-
分支发散与 warp 执行效率
- branch efficiency、warp execution efficiency
-
占用率与限制项
- achieved occupancy、寄存器、shared memory 限制
-
共享内存 bank conflict(如有)
- shared load/store replay/conflict 类指标
English
With Nsight Compute, don't start by reading everything. Follow a pipeline:
- Determine memory-bound vs compute-bound
- Check coalescing and cache behavior
- Inspect divergence / warp efficiency
- Review achieved occupancy and limiting resources
- Diagnose shared memory bank conflicts if relevant
10) 常见问题:本质原因与工程解法
10.1 Bank Conflict(共享内存银行冲突)
中文
共享内存被分成多个 bank。同一个 warp 若多个线程访问落在同一个 bank 的不同地址,会产生冲突并序列化(具体规则与架构有关,但你只需抓住"同 warp、同 bank、多地址"这个核心)。
工程解法:
- padding(填充) :让二维数组的行跨度避开冲突(比如转置里常见
tile[BLOCK][BLOCK+1]) - 调整访问模式,让同 warp 访问分布到不同 bank
English
Shared memory is banked. If multiple threads in a warp hit the same bank with different addresses, accesses may serialize. Typical fixes include padding (e.g., TILE_DIM+1) and reworking access patterns.
10.2 Branch Divergence(分支发散)
中文
同一个 warp 内线程走不同分支 → warp 需要按分支路径分段执行 → 有效并行度下降。
解法:
- 重构条件逻辑(把 if 挪到 warp 外/块外、用数据布局减少分歧)
- 有时可以用算术表达式替代分支(trade-off:更多算术 vs 更少 divergence)
English
Warp divergence occurs when threads in a warp take different branches, forcing serialized execution. Fix by restructuring control flow, improving data layout, or using branchless arithmetic when appropriate.
10.3 Occupancy 低
中文
常见原因:
- 寄存器用量过高(每线程寄存器多 → 同时驻留的 warps 下降)
- shared memory 用量过高(每 block shared 多 → 同时驻留 block 数下降)
- block 维度不合理导致并发不足
解法:
- 控制寄存器(算法/编译选项/减少临时变量)
- 减少 shared footprint 或重新 tiling
- 调整 block size 并结合 ncu 看限制项
English
Low occupancy is often caused by high register usage, heavy shared memory usage, or insufficient block configuration. Fix by reducing registers/shared usage or tuning block size guided by profiling.
11) 进阶主题:每个点到底"进阶在哪"
中文 + English(对照)
-
warp 级原语(warp primitives) :
用 warp shuffle、投票等在 warp 内交换数据,减少
__syncthreads()和 shared memory。
Use warp shuffles/votes to exchange data within a warp, reducing shared memory and barriers. -
动态并行(dynamic parallelism) :
kernel 内部再 launch kernel,适合递归/自适应任务,但调度开销与复杂度更高。
Launching kernels from kernels; useful but adds overhead and complexity. -
统一内存(Unified Memory) :
更易用,但性能依赖迁移策略与访问模式;高性能场景仍常手动管理。
Easier programming model, but performance depends on migration/prefetch; manual management is still common for peak performance. -
多流并发(Streams) :
用 stream 实现 copy/compute overlap、并发 kernel;关键是依赖关系与资源竞争。
Overlap transfers and compute; manage dependencies and resource contention carefully.
12) 你的四个练习:每题"理论要点"应该训练什么
练习1:向量加法优化
- 理论要点:合并访问、grid/block 配置、事件计时、带宽瓶颈识别
- Key ideas: coalescing, launch config, event timing, bandwidth bound analysis.
练习2:矩阵转置(shared memory + 避免 bank conflict)
- 理论要点:tiling、shared memory、
__syncthreads()、padding 避免冲突 - Key ideas: tiling, shared memory, barriers, padding to avoid bank conflicts.
练习3:归约求和(reduction)
- 理论要点:shared reduction、warp-level 优化(减少同步/分支)
- Key ideas: shared reduction patterns, warp-level primitives, minimizing sync/divergence.
练习4:带宽测试(不同访问模式)
- 理论要点:stride 对 coalescing 的破坏、缓存行为、benchmark 设计
- Key ideas: stride hurts coalescing, cache effects, benchmark methodology.