使用Thrust库进行高效的CUDA并行算法

Thrust 是一个基于模板的 C++ 库，它提供了并行版本的标准模板库（STL，Standard Template Library）算法和数据结构。它允许开发者使用简洁的 C++ 语法来编写高效的 GPU 并行代码。

1. Thrust 的核心优势

1.1 抽象与易用性

类似 STL： Thrust 的 API 设计与 C++ STL（如 std::sort, std::vector, std::for_each）非常相似，大大降低了 CUDA 的学习曲线。
单源编程： 开发者无需编写复杂的 Kernel 代码、管理线程块和线程索引，Thrust 会自动处理底层的并行细节。

1.2 高性能与可移植性

自动优化： Thrust 库内部包含由 NVIDIA 工程师优化过的大量 CUDA Kernel。它能够根据不同的 GPU 架构和数据规模，自动选择最高效的并行算法（例如，使用分段归约、并行前缀和等）。
可切换后端： Thrust 支持多种执行后端：
- CUDA（默认）： 在 GPU 上并行执行。
- TBB/OpenMP： 在多核 CPU 上并行执行。
- Sequential： 在 CPU 上串行执行。

2. Thrust 的核心组件

2.1 向量（`thrust::device_vector`）

thrust::device_vector 是 Thrust 中用于在 GPU 显存上存储数据的主要容器。它对应于主机端的 std::vector。

功能： 自动管理 GPU 内存分配、释放和数据传输。
使用： 可以像使用普通 std::vector 一样使用它，无需手动调用 cudaMalloc 或 cudaFree。

c++ 复制代码

#include <thrust/device_vector.h>
#include <thrust/host_vector.h>

// 在 CPU 上创建向量并初始化
thrust::host_vector<int> h_data(N);
// ... 初始化 h_data ...

// 自动将数据从主机拷贝到设备
thrust::device_vector<int> d_data = h_data;

// 数据传输回主机
h_data = d_data;

2.2 迭代器（Iterators）

Thrust 算法操作的都是迭代器（Iterators） ，这使得它能够灵活地操作各种容器，包括 device_vector、host_vector，以及通过原始指针创建的迭代器。

原始指针： 可以通过 thrust::device_ptr 将原始 CUDA 指针（如 cudaMalloc 分配的指针）包装成 Thrust 迭代器。

3. 核心并行算法示例

Thrust 提供了丰富且高效的并行算法，涵盖了数据处理的常见需求。

3.1 转换（`thrust::transform`）

对容器中的每个元素执行函数操作（类似于 CUDA 中的 Element-wise Kernel）。

c++ 复制代码

// 目标：将 d_A 中的每个元素平方，结果存入 d_B
// d_A: 输入向量, d_B: 输出向量
thrust::device_vector<float> d_A(N), d_B(N);

// 使用 Thrust::negate 函数对象
thrust::transform(d_A.begin(), d_A.end(), d_B.begin(), 
                  thrust::negate<float>()); 

// 或者使用 Lambda 表达式 (C++11/14)
thrust::transform(d_A.begin(), d_A.end(), d_B.begin(), 
                  [] __device__ (float x) { // 必须指定 __device__
                      return x * x;
                  });

3.2 归约（`thrust::reduce`）

对容器中的所有元素执行累积操作（如求和、求最大值），是并行计算中的重要操作。

c++ 复制代码

// 求 d_data 中所有元素的和
thrust::device_vector<int> d_data(N);
// ... 填充 d_data ...

int sum = thrust::reduce(d_data.begin(), d_data.end());

3.3 扫描/前缀和（`thrust::inclusive_scan`, `thrust::exclusive_scan`）

并行扫描操作，用于计算数组中每个元素的累积值。这是许多复杂并行算法（如流压缩、Radix 排序）的基础。

c++ 复制代码

// 计算独占扫描（Exclusive Scan）
// 输入: [1, 2, 3, 4]
// 输出: [0, 1, 3, 6]
thrust::exclusive_scan(d_in.begin(), d_in.end(), d_out.begin());

3.4 排序（`thrust::sort`）

并行排序算法，通常采用高效的基数排序（Radix Sort）或合并排序（Merge Sort）的并行版本。

c++ 复制代码

// 对 d_data 中的元素进行升序排序
thrust::device_vector<int> d_data(N);
// ... 填充 d_data ...

thrust::sort(d_data.begin(), d_data.end());

3.5 过滤/压缩（`thrust::copy_if`）

根据条件并行地从源容器中筛选元素到目标容器中。

c++ 复制代码

// 目标：将 d_A 中所有大于 10 的元素拷贝到 d_B
thrust::device_vector<int> d_A(N), d_B(N);
// ... 填充 d_A ...

// 定义谓词 (Predicate)
auto is_greater_than_10 = [] __device__ (int x) { return x > 10; };

// 执行并行压缩
thrust::copy_if(d_A.begin(), d_A.end(), 
                d_B.begin(), 
                is_greater_than_10);

4. 总结与最佳实践

Thrust 库是 CUDA 开发者工具箱中的核心工具。

首选 Thrust： 在实现并行算法时，应首先查看 Thrust 库中是否有对应的算法（如 transform, reduce, sort）。如果有，使用 Thrust 的实现几乎总是比自己编写 Kernel 更高效、更可靠。
结合 Kernel： 对于 Thrust 库中没有直接对应的定制化操作，开发者可以结合使用 Thrust 和自定义的 CUDA Kernel：
1. 使用 Thrust 容器 device_vector 来管理数据。
2. 将 device_vector 的指针通过 .data() 或 .begin() 传递给自定义 Kernel。
3. 在 Kernel 中执行定制逻辑。
4. 返回 Thrust 再次进行高效的后处理（如归约或排序）。

通过 Thrust，开发者可以专注于更高层次的并行算法设计，而将底层 GPU 硬件优化交给 NVIDIA 的高度调优代码来完成。

使用Thrust库进行高效的CUDA并行算法

1. Thrust 的核心优势

1.1 抽象与易用性

1.2 高性能与可移植性

2. Thrust 的核心组件

2.1 向量（thrust::device_vector）

2.2 迭代器（Iterators）

3. 核心并行算法示例

3.1 转换（thrust::transform）

3.2 归约（thrust::reduce）

3.3 扫描/前缀和（thrust::inclusive_scan, thrust::exclusive_scan）

3.4 排序（thrust::sort）

3.5 过滤/压缩（thrust::copy_if）

4. 总结与最佳实践

2.1 向量（`thrust::device_vector`）

3.1 转换（`thrust::transform`）

3.2 归约（`thrust::reduce`）

3.3 扫描/前缀和（`thrust::inclusive_scan`, `thrust::exclusive_scan`）

3.4 排序（`thrust::sort`）

3.5 过滤/压缩（`thrust::copy_if`）