【CUDA编程系列】之01

CUDA 核函数编程基础：全理论讲清楚（中英双语）

你的文档覆盖的是 CUDA 的"最小闭环"：写核函数 → 正确映射线程 → 正确用内存 → 粗测性能 → 用工具定位瓶颈 → 做一轮优化。✅

1) CUDA 编程模型：Host vs Device、Kernel 的语义

中文

Host（CPU） 负责：分配显存、准备数据、发起 kernel、做同步/回收资源。
Device（GPU） 负责：并行执行 kernel（核函数）。
__global__ 表示该函数是 kernel：
- 从 Host 端调用 （用 <<<grid, block>>> 启动配置）
- 在 Device 端执行
CUDA 的并行执行单位分层：
- Grid（网格）：一次 kernel launch 的整体范围
- Block（线程块）：Grid 的分块；块内线程可协作（共享内存、同步）
- Thread（线程）：最小执行实体
- Warp（线程束） ：硬件调度的基本单位，通常 32 线程/warp；分支发散与访存合并都以 warp 为核心来理解 ⚠️

English

The host (CPU) orchestrates: memory allocation/copies, kernel launches, synchronization, and cleanup.
The device (GPU) executes kernels in parallel.
A __global__ function is a kernel:
- Launched from the host via <<<grid, block>>>
- Executed on the device
CUDA's execution hierarchy:
- Grid (one kernel launch)
- Block (a cooperative group; shared memory + synchronization)
- Thread
- Warp (hardware scheduling unit, typically 32 threads/warp ). Warp is the key to reasoning about branch divergence and memory coalescing.

2) 线程索引计算：为什么这些公式是"正确映射"的核心

2.1 一维索引（1D）

中文

cuda 复制代码

int idx = blockIdx.x * blockDim.x + threadIdx.x;

threadIdx.x：块内线程编号
blockIdx.x：块编号
blockDim.x：每块线程数
这个 idx 通常对应数组下标：data[idx]
if (idx < n) 是必要的 越界保护（因为 gridSize 往往是向上取整）

English

cuda 复制代码

int idx = blockIdx.x * blockDim.x + threadIdx.x;

threadIdx.x: thread's local index within a block
blockIdx.x: block index within the grid
blockDim.x: threads per block
idx commonly maps to array indices, and if (idx < n) is a required bounds check since grid sizes are usually rounded up.

2.2 二维索引（2D）：把 (row, col) 映射到线性内存

中文

cuda 复制代码

int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
int idx = row * width + col;

GPU 全局内存是线性的（1D），但我们经常处理 2D/3D 数据
idx = row * width + col 是行优先（row-major）布局的典型展开

English

cuda 复制代码

int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
int idx = row * width + col;

Global memory is linear, but data is often conceptualized as 2D/3D.
idx = row * width + col is the standard row-major flattening.

2.3 三维索引（3D）：体数据/体素/多维张量的线性化

中文

cuda 复制代码

int idx = z * width * height + y * width + x;

本质还是"多维坐标 → 一维 offset"
关键是你要明确：width/height/depth 的含义、以及内存布局顺序

English

Same idea: map (x,y,z) into a linear offset.
The crucial part is being explicit about dimension semantics and layout order.

3) 网格与块配置：为什么"blockSize 选 256"常见但不是铁律

中文

3.1 典型 gridSize 计算

cuda 复制代码

int blockSize = 256;
int gridSize = (n + blockSize - 1) / blockSize;
myKernel<<<gridSize, blockSize>>>(data, n);

(n + blockSize - 1) / blockSize：向上取整，确保覆盖所有元素
blockSize 常选 128/256/512 的原因：
- 是 warp(32) 的倍数 → 便于调度与访存模式
- 通常能在"并行度 vs 资源占用（寄存器/共享内存）"之间取得不错平衡

3.2 最佳实践的真实含义

"warp 倍数"是指导原则，不是性能保证
真正限制性能的常见因素：
- kernel 使用了很多寄存器 → 降低 occupancy
- 使用大量共享内存 → 降低并发 block 数
- 内存访问不合并（coalescing 差） → 吞吐被带宽/延迟拖死
- 分支发散严重 → warp 利用率下降

English

gridSize = (n + blockSize - 1)/blockSize rounds up to cover all elements.
blockSize like 128/256/512 is common because it's a multiple of 32 (warp size), often yielding decent scheduling and memory behavior.
However, "multiple of warp" is a heuristic, not a guarantee. Real bottlenecks include registers/shared memory limiting concurrency, poor coalescing, or heavy branch divergence.

4) GPU 内存层次：你写 kernel 的"性能上限"主要由这里决定

可以把它理解为：越靠近 SM（计算核心）的内存越快、越小、越难用；越远的越大、越慢、越通用。

4.1 全局内存 Global Memory

中文

最大、最通用、最慢（相对共享/寄存器）
所有线程可访问
性能关键：访问是否合并（coalesced）、是否复用（缓存/L2）

English

Largest and most general, but relatively the slowest.
Accessible by all threads.
Performance depends heavily on coalescing and cache reuse.

4.2 共享内存 Shared Memory（块内协作的核心）

中文

每个 block 私有，block 内线程共享
速度很快（接近 L1/片上），容量小
典型用途：
1. 重用数据：减少对 global memory 的重复读取
2. 做数据重排：如矩阵转置、卷积/tiling
3. 做规约（reduction）：把局部结果聚合
必须理解两个点：
- __syncthreads() 是块内 barrier：保证共享内存读写的阶段一致性
- 共享内存可能有 bank conflict（后面讲）

English

Private per block, shared among threads in the same block.
Fast, small.
Used for data reuse, tiling/reordering (e.g., transpose), and reductions.
Two key concepts:
- __syncthreads() is a block-level barrier.
- Shared memory can suffer bank conflicts.

4.3 常量内存 Constant Memory（"所有线程读同一份"时的利器）

中文

只读，带专用缓存
适合：warp 内线程读取同一个地址/相同小表
如果同一 warp 里线程读不同常量地址，会发生序列化（性能下降）

English

Read-only, cached.
Best when threads in a warp read the same address (broadcast).
If threads read many different constant addresses, access can serialize.

4.4 纹理内存 Texture Memory（空间局部性 + 采样能力）

中文

只读 + 缓存，针对"空间局部性"访问优化
历史上常用于图像/2D 访问；也支持插值/寻址模式（取决于 API）
现代 CUDA 更推荐 texture objects （而不是旧式 texture<> 引用写法），但你的例子表达的是核心思想：用纹理路径读能提升某些访问模式下的命中率

English

Read-only, cached, optimized for spatial locality.
Historically used for image-like 2D access; can support sampling/addressing modes.
Modern CUDA prefers texture objects over legacy texture references, but the principle remains: texture path can improve cache behavior for certain patterns.

5) 内存合并（Coalescing）：为什么"连续访问"能快很多

中文

以 warp 为单位看：一个 warp 的 32 个线程，如果访问连续地址，硬件可以用更少的内存事务完成加载/存储 → 带宽利用率高 ✅

好模式（合并） ：线程 t 访问 data[base + t]
坏模式（跨步 stride） ：线程 t 访问 data[base + t * stride]
stride 越大，越可能导致一次 warp 访问分裂成很多内存事务

你的示例正是在强调：

data[idx] 这种线性访问通常最容易合并
idx * stride 会破坏合并

English

Coalescing is analyzed at the warp level: if 32 threads access contiguous addresses, hardware can serve them with fewer memory transactions, improving effective bandwidth.

Good: thread t accesses data[base + t]
Bad: thread t accesses data[base + t * stride], which often splits into many transactions.

6) 带宽测试 kernel：它在"测什么"，又为什么容易误导

中文

cuda 复制代码

float value = data[idx];
for (int i = 0; i < iterations; i++) value = value * 1.01f;
data[idx] = value;

这类 kernel 同时包含：

global memory 读/写（带宽因素）
浮点计算（计算吞吐因素）
iterations 增大后，可能从"带宽受限"变成"计算受限"
因此它更像是一个可调的 micro-benchmark：你要明确自己要测的是哪一边。

更严谨的带宽测试往往会：

降低计算（更接近纯 copy / load-store）
控制访存模式、对齐、数据类型
避免编译器优化把循环折叠掉（需要一些技巧）

English

This kernel mixes global memory traffic and floating-point computation. As iterations increases, it may shift from memory-bound to compute-bound, so you must be clear what you intend to measure. A more rigorous bandwidth benchmark minimizes computation, controls access patterns/alignment, and avoids compiler "optimizing away" the loop.

7) CUDA Event 计时：测 kernel 时间的标准姿势与常见坑

中文

cudaEventRecord(start) / cudaEventRecord(stop) 记录 GPU 时间戳
cudaEventSynchronize(stop) 确保 GPU 执行完成
cudaEventElapsedTime 得到毫秒

常见坑 ⚠️：

忘了同步 ：不 Synchronize 你读到的时间可能不准
首次运行有 warm-up 开销：第一次包含 JIT、缓存、页面等开销，建议跑多次取统计量
异步语义：kernel launch 本身是异步的，事件是解决这个问题的方式之一

English

CUDA events provide standard GPU-side timing. Common pitfalls:

Missing synchronization
First-run overhead (warm-up required)
Kernel launches are asynchronous; events help measure actual device execution time.

8) Occupancy（占用率）：它是什么、能解决什么、不能解决什么

中文

Occupancy 通常指：每个 SM 上活跃 warp 数占理论最大 warp 数的比例（或等价指标）。

cudaOccupancyMaxPotentialBlockSize 做的是：

根据 kernel 的资源需求（寄存器/共享内存等）和硬件限制
给出一个"理论上可能较优"的 block size 建议

但要强调两点：

Occupancy 不是越高越好：
- 如果你已经被带宽限制，occupancy 再高也不会快
它只解决"并行度/隐藏延迟"的一部分问题：
- 真性能还得看：内存合并、缓存命中、指令效率、分支发散、访存与计算的重叠等

English

Occupancy is the ratio of active warps per SM to the maximum possible. The occupancy API suggests a block size based on resource constraints, but:

Higher occupancy is not always better (e.g., memory-bound kernels).
True performance depends on coalescing, cache, instruction efficiency, divergence, and overlap.

9) Nsight Compute：你该看哪些指标（从"会用"到"会分析"）

中文

ncu --set full ... 会生成报告。实际分析时你不需要一开始就看"全部"，优先抓住主线：

建议你按这条链路看（非常通用）✅：

Kernel 是否 memory-bound 还是 compute-bound
- 看吞吐、stall、roofline（如果你熟悉的话）
Memory 访问是否合并
- global load/store efficiency、transactions、L2 hit rate
分支发散与 warp 执行效率
- branch efficiency、warp execution efficiency
占用率与限制项
- achieved occupancy、寄存器、shared memory 限制
共享内存 bank conflict（如有）
- shared load/store replay/conflict 类指标

English

With Nsight Compute, don't start by reading everything. Follow a pipeline:

Determine memory-bound vs compute-bound
Check coalescing and cache behavior
Inspect divergence / warp efficiency
Review achieved occupancy and limiting resources
Diagnose shared memory bank conflicts if relevant

10) 常见问题：本质原因与工程解法

10.1 Bank Conflict（共享内存银行冲突）

中文

共享内存被分成多个 bank。同一个 warp 若多个线程访问落在同一个 bank 的不同地址，会产生冲突并序列化（具体规则与架构有关，但你只需抓住"同 warp、同 bank、多地址"这个核心）。

工程解法：

padding（填充） ：让二维数组的行跨度避开冲突（比如转置里常见 tile[BLOCK][BLOCK+1]）
调整访问模式，让同 warp 访问分布到不同 bank

English

Shared memory is banked. If multiple threads in a warp hit the same bank with different addresses, accesses may serialize. Typical fixes include padding (e.g., TILE_DIM+1) and reworking access patterns.

10.2 Branch Divergence（分支发散）

中文

同一个 warp 内线程走不同分支 → warp 需要按分支路径分段执行 → 有效并行度下降。

解法：

重构条件逻辑（把 if 挪到 warp 外/块外、用数据布局减少分歧）
有时可以用算术表达式替代分支（trade-off：更多算术 vs 更少 divergence）

English

Warp divergence occurs when threads in a warp take different branches, forcing serialized execution. Fix by restructuring control flow, improving data layout, or using branchless arithmetic when appropriate.

10.3 Occupancy 低

中文

常见原因：

寄存器用量过高（每线程寄存器多 → 同时驻留的 warps 下降）
shared memory 用量过高（每 block shared 多 → 同时驻留 block 数下降）
block 维度不合理导致并发不足

解法：

控制寄存器（算法/编译选项/减少临时变量）
减少 shared footprint 或重新 tiling
调整 block size 并结合 ncu 看限制项

English

Low occupancy is often caused by high register usage, heavy shared memory usage, or insufficient block configuration. Fix by reducing registers/shared usage or tuning block size guided by profiling.

11) 进阶主题：每个点到底"进阶在哪"

中文 + English（对照）

warp 级原语（warp primitives） ：

用 warp shuffle、投票等在 warp 内交换数据，减少 __syncthreads() 和 shared memory。
Use warp shuffles/votes to exchange data within a warp, reducing shared memory and barriers.
动态并行（dynamic parallelism） ：

kernel 内部再 launch kernel，适合递归/自适应任务，但调度开销与复杂度更高。
Launching kernels from kernels; useful but adds overhead and complexity.
统一内存（Unified Memory） ：

更易用，但性能依赖迁移策略与访问模式；高性能场景仍常手动管理。
Easier programming model, but performance depends on migration/prefetch; manual management is still common for peak performance.
多流并发（Streams） ：

用 stream 实现 copy/compute overlap、并发 kernel；关键是依赖关系与资源竞争。
Overlap transfers and compute; manage dependencies and resource contention carefully.

12) 你的四个练习：每题"理论要点"应该训练什么

练习1：向量加法优化

理论要点：合并访问、grid/block 配置、事件计时、带宽瓶颈识别
Key ideas: coalescing, launch config, event timing, bandwidth bound analysis.

练习2：矩阵转置（shared memory + 避免 bank conflict）

理论要点：tiling、shared memory、__syncthreads()、padding 避免冲突
Key ideas: tiling, shared memory, barriers, padding to avoid bank conflicts.

练习3：归约求和（reduction）

理论要点：shared reduction、warp-level 优化（减少同步/分支）
Key ideas: shared reduction patterns, warp-level primitives, minimizing sync/divergence.

练习4：带宽测试（不同访问模式）

理论要点：stride 对 coalescing 的破坏、缓存行为、benchmark 设计
Key ideas: stride hurts coalescing, cache effects, benchmark methodology.