cuda编程笔记（33）--Thrust库的使用

Thrust 是一个类似 C++ STL（标准模板库）的并行算法库：

提供类似 std::vector, std::sort, std::reduce 等容器与算法；
可以在 GPU (CUDA) 和 CPU (OMP/TBB) 之间无缝切换；
自动管理 device/host 内存；
支持迭代器、仿函数、模板编程。

它本质上是一个并行 STL，用模板抽象实现了设备无关的高性能算法。

容器

<thrust/device_vector.h>:GPU 向量容器（自动管理 CUDA 内存）

<thrust/host_vector.h>:CPU 向量容器

1️⃣ `thrust::host_vector`

与std::vector<T>非常类似，区别是 host_vector 能与 GPU 数据进行直接交互。

cpp 复制代码

#include <thrust/host_vector.h>
#include <iostream>

int main() {
    thrust::host_vector<int> h_vec(5);  // 创建 5 个元素的 host 向量
    for (int i = 0; i < 5; i++) h_vec[i] = i;
    for (int x : h_vec) std::cout << x << " ";
}

2️⃣ `thrust::device_vector`

相当于 GPU 上的 std::vector。

会自动调用 cudaMalloc / cudaFree 管理显存。

cpp 复制代码

#include <thrust/device_vector.h>
#include <thrust/host_vector.h>

int main() {
    thrust::host_vector<int> h_vec(5);
    for (int i = 0; i < 5; i++) h_vec[i] = i;

    // 拷贝到 GPU
    thrust::device_vector<int> d_vec = h_vec;

    // 在 GPU 上就地修改
    d_vec[0] = 100;

    // 拷回 CPU
    h_vec = d_vec;
}

内部封装了 device memory；
拷贝构造或赋值时会自动在 GPU 和 CPU 之间传输；
可直接作为 Thrust 算法（如 thrust::sort, thrust::reduce）的输入。

思考：如何实现异步拷贝

直接写赋值，肯定就算同步拷贝了，假如我们想要异步拷贝，那怎么办？

对于异步拷贝，详见cuda编程笔记（6）--流_cudahostalloc-CSDN博客

带来以下几个问题：host_vector的内存怎么设置cudaHostAlloc分配；用了thrust封装后，cudaMemcpyAsync该怎么用（非要用.data()直接获取原始指针当然能用，但这样明显不是想要的封装效果）

自定义 allocator

Thrust 所有容器都支持通过 模板参数指定 allocator 。

可以自定义一个分配器，让它使用 cudaHostAlloc() 和 cudaFreeHost()。

cpp 复制代码

//pinned_allocator.cuh
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/copy.h>
#include <thrust/execution_policy.h>
#include <cuda_runtime.h>
#include <iostream>

// ---------------- 自定义 pinned allocator ----------------
template <typename T>
struct pinned_allocator {
    using value_type = T;

    pinned_allocator() = default;

    template <class U>
    constexpr pinned_allocator(const pinned_allocator<U>&) noexcept {}

    // 分配：用 cudaHostAlloc 分配固定页内存
    T* allocate(std::size_t n) {
        T* ptr = nullptr;
        cudaError_t err = cudaHostAlloc((void**)&ptr, n * sizeof(T), cudaHostAllocDefault);
        if (err != cudaSuccess) {
            throw std::bad_alloc();
        }
        return ptr;
    }

    // 释放：用 cudaFreeHost
    void deallocate(T* ptr, std::size_t) noexcept {
        cudaFreeHost(ptr);
    }
};

// 必须定义比较操作符（allocator 要求）
template <class T, class U>
bool operator==(const pinned_allocator<T>&, const pinned_allocator<U>&) { return true; }
template <class T, class U>
bool operator!=(const pinned_allocator<T>&, const pinned_allocator<U>&) { return false; }

这里写的函数接口都是自定义allocator的固定套路，具体变化就是在cudaHostAlloc和cudaFreeHost。

cpp 复制代码

#ifndef __CUDACC__
#define __CUDACC__
#endif
#include <cuda_runtime.h>
#include"pinned_allocator.cuh"

int main() {
    const int N = 1 << 20;

    // 使用自定义 allocator，让内存在 pinned 区
    thrust::host_vector<float, pinned_allocator<float>> h_vec(N, 1.0f);
    thrust::device_vector<float> d_vec(N);

    cudaStream_t stream;
    cudaStreamCreate(&stream);

    // 异步拷贝（真正异步，因为 host 端是 pinned memory）
    thrust::copy(thrust::cuda::par.on(stream),
        h_vec.begin(), h_vec.end(),
        d_vec.begin());

    std::cout << "异步拷贝已发起\n";

    // 可以并行执行其他任务
    // ...

    cudaStreamSynchronize(stream);
    cudaStreamDestroy(stream);
}

执行策略

想要设置执行策略，需要先包含头文件

cpp 复制代码

#include <thrust/execution_policy.h>

Thrust 默认所有操作都在 默认流（stream 0） 上执行。

所以如果你想让 Thrust 操作和你自己写的 kernel 在 同一个异步流中执行（而不是默认流阻塞），就必须显式指定 stream。

thrust::cuda::par 其实就是一个执行策略对象（execution policy object），它是 Thrust 在 CUDA 后端的"调度器"入口。

cpp 复制代码

thrust::cuda::par.on(stream)

含义：

thrust::cuda::par：告诉 Thrust 使用 CUDA 后端；
.on(stream)：告诉它在指定的 CUDA 流上运行；
返回一个 execution policy 对象，你可以传给 Thrust 算法使用。

它会返回stream_attachment_type类型对象（execution_policy的子类），该对象对stream做了封装。

当然还有其他的执行策略

执行策略	含义	对应设备
`thrust::seq`	顺序执行（serial）	CPU
`thrust::host`	多线程或 OpenMP 后端	CPU
`thrust::omp::par`	OpenMP 并行	CPU
`thrust::tbb::par`	Intel TBB 并行	CPU
`thrust::cuda::par`	CUDA 后端	GPU
`thrust::cuda::par.on(stream)`	CUDA 后端（指定流）	GPU

Thrust 在编译期就选择后端实现，类似模板多态：

如果你传的是 device_vector → 自动使用 CUDA；
如果你传的是 host_vector → 自动使用 CPU；
你也可以手动指定 thrust::seq、thrust::cuda::par 来强制选择后端。

我们自己在thrust算法上再套一层的时候，也可以通过这个特性设计自动选择后端

cpp 复制代码

template <typename ExecPolicy, typename Iterator>
void my_sort(ExecPolicy&& policy, Iterator begin, Iterator end) {
    thrust::sort(std::forward<ExecPolicy>(policy), begin, end);
}

int main() {
    thrust::device_vector<int> d_vec = {5, 1, 4, 2};
    my_sort(thrust::cuda::par, d_vec.begin(), d_vec.end());

    thrust::host_vector<int> h_vec = {5, 1, 4, 2};
    my_sort(thrust::seq, h_vec.begin(), h_vec.end());
}

device_ptr

thrust::device_ptr 是 Thrust 的一个关键类模板 ，几乎所有 Thrust 算法的 GPU 版本底层都依赖它。

可以把它理解为：

"一个可以被当作 STL 迭代器使用的 CUDA 设备端指针包装类"。

从设计动机、原理和用法三个角度讲透它

为什么需要 `device_ptr`

在 CUDA 原生 API 中，我们拿到的是裸指针（float* d_data），比如：

cpp 复制代码

float* d_data;
cudaMalloc(&d_data, 100 * sizeof(float));

但问题是：

CUDA 设备指针（device pointer）不能直接在主机端解引用；
标准库（STL）算法，如 std::sort, std::copy，都要求随机访问迭代器；
CUDA 设备指针和主机指针类型相同（都是 float*），编译器无法区分"这是 GPU 内存"还是"CPU 内存"。

💡 因此，Thrust 设计了 device_ptr<T> ------

它是一个类型安全的包装器 ，告诉编译器"这是设备内存上的指针"，并让它行为上看起来像 STL 的迭代器。

简单理解就是，比如std::sort();即可以传vector的迭代器，也可以传递裸指针；

但是thrust的泛型算法，不能接收裸指针。要么是迭代器，要么通过封装。

注意：它不会自动管理内存，它只是一个轻量级智能指针外壳 ，主要目的是 告诉 Thrust：这个地址在 GPU 上。

基本定义与原型

想要使用，必须包含该头文件

cpp 复制代码

#include <thrust/device_ptr.h>

cpp 复制代码

namespace thrust {

template<typename T>
class device_ptr {
public:
    using element_type = T;
    using value_type   = T;
    using pointer      = device_ptr<T>;
    using reference    = T&;   // 不能直接访问，会封装成代理类型
    using difference_type = ptrdiff_t;

    __host__ __device__
    device_ptr(T* ptr = nullptr);      // 构造包装一个原始指针

    __host__ __device__
    T* get() const;                    // 获取底层裸指针

    __host__ __device__
    reference operator*() const;       // 设备端解引用（代理）
    __host__ __device__
    device_ptr operator+(ptrdiff_t n) const;
    __host__ __device__
    difference_type operator-(device_ptr const& other) const;

    // ... 其他比较操作符、++、-- 等
};
}

简单示例

cpp 复制代码

#include <thrust/device_ptr.h>
#include <thrust/fill.h>
#include <cuda_runtime.h>

int main() {
    int N = 10;
    int* d_data;
    cudaMalloc(&d_data, N * sizeof(int));

    // 将裸指针封装为 device_ptr
    thrust::device_ptr<int> dev_ptr(d_data);

    // 使用 Thrust 算法直接操作
    thrust::fill(dev_ptr, dev_ptr + N, 42);

    // 取回裸指针
    int* raw_ptr = dev_ptr.get();

    cudaFree(raw_ptr);
}

`device_pointer_cast`：自动从裸指针转换

Thrust 提供一个辅助函数：

cpp 复制代码

template<typename T>
thrust::device_ptr<T> device_pointer_cast(T* raw_ptr);

cpp 复制代码

float* d_data;
cudaMalloc(&d_data, 100 * sizeof(float));

auto dev_ptr = thrust::device_pointer_cast(d_data);
thrust::fill(dev_ptr, dev_ptr + 100, 1.0f);

`device_reference`（延伸）

因为不能直接在 host 解引用 device 指针，device_ptr::operator*() 实际返回的不是 T&，而是一个代理类型：

cpp 复制代码

thrust::device_reference<T>

这个类型用于封装 device 上的读写操作（类似 std::reference_wrapper），可安全传递到算法中使用。

从device_ptr转回裸指针

虽然.get也行，但是thrust专门设计了一个函数

thrust::device_vector<T> 的 .data() 返回的其实不是普通的 T*，

而是一个 thrust::device_ptr<T> 对象。

cpp 复制代码

thrust::device_vector<int> d_vec(100);
auto ptr = d_vec.data();     // ptr 的类型是 thrust::device_ptr<int>

但是，CUDA kernel 的参数必须是 裸指针（raw pointer）：

Thrust 提供了一个安全的转换函数：

cpp 复制代码

T* thrust::raw_pointer_cast(thrust::device_ptr<T> ptr);

它的作用是：

"取出 device_ptr 内部包裹的真实裸指针（raw device pointer）。"

既然有.get，为什么还有设计这个函数呢？

Thrust 除了 device_ptr 外，还有：

thrust::host_ptr
thrust::universal_ptr（统一内存）
自定义指针适配器（advanced users）

这些类型都可以重载 raw_pointer_cast()，

这样你写模板代码时，不用管具体是哪种指针类型。

而.get()成员函数，仅限 device_ptr

`泛型算法`

Thrust 模仿了 C++ STL 的算法体系，但它的算法可以：

在 CPU（主机端） 上运行（如 thrust::seq 或 thrust::host）；
在 GPU（设备端） 上运行（如 thrust::device 或 thrust::cuda::par）；
并且在代码层面几乎完全一致。

你只需要改变执行策略，Thrust 会自动选择在 CPU 还是 GPU 上执行。

常见 Thrust 算法分类与示例

拷贝与赋值类

函数	作用	示例
`thrust::copy`	从一个范围复制到另一个范围	`thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin());`
`thrust::fill`	用指定值填充范围	`thrust::fill(d_vec.begin(), d_vec.end(), 5);`
`thrust::sequence`	生成线性序列（如 0,1,2,...）	`thrust::sequence(d_vec.begin(), d_vec.end(), 0);`
`thrust::transform`	对每个元素应用函数	`thrust::transform(a.begin(), a.end(), b.begin(), c.begin(), thrust::plus<int>());`

归约（Reduction）类

函数	作用	示例
`thrust::reduce`	求和、求最小值、最大值等	`auto sum = thrust::reduce(d_vec.begin(), d_vec.end());`
`thrust::transform_reduce`	先变换再归约	`thrust::transform_reduce(d_vec.begin(), d_vec.end(), square(), 0, thrust::plus<int>());`
`thrust::inclusive_scan` / `exclusive_scan`	前缀和	`thrust::inclusive_scan(d_vec.begin(), d_vec.end(), result.begin());`

排序与重排类

函数	作用	示例
`thrust::sort`	对范围排序	`thrust::sort(d_vec.begin(), d_vec.end());`
`thrust::sort_by_key`	按 key 排序对应的 value	`thrust::sort_by_key(keys.begin(), keys.end(), values.begin());`
`thrust::unique`	去重（相邻元素）	`auto new_end = thrust::unique(d_vec.begin(), d_vec.end());`
`thrust::gather` / `thrust::scatter`	根据索引重排数据	`thrust::gather(index.begin(), index.end(), src.begin(), dst.begin());`

比较与搜索类

函数	作用	示例
`thrust::count`	统计满足条件的元素数量	`int n = thrust::count(d_vec.begin(), d_vec.end(), 5);`
`thrust::find`	查找指定元素	`auto it = thrust::find(d_vec.begin(), d_vec.end(), 42);`
`thrust::min_element` / `max_element`	找最小/最大元素	`auto max_it = thrust::max_element(d_vec.begin(), d_vec.end());`

可以传给算法的第一个参数，例如：

cpp 复制代码

thrust::sort(thrust::device, d_vec.begin(), d_vec.end());     // GPU排序
thrust::sort(thrust::host, h_vec.begin(), h_vec.end());       // CPU排序
thrust::sort(thrust::seq,  h_vec.begin(), h_vec.end());       // 单线程排序
thrust::sort(thrust::cuda::par.on(stream), d_vec.begin(), d_vec.end()); // GPU异步排序

谓词（predicate）

Thrust 算法中可以传的"可调用对象"主要有三类：

函数指针（Function pointer）

传统方式，但必须是 设备可调用：

cpp 复制代码

__device__ __host__ int square(int x) { return x * x; }

thrust::transform(d_vec.begin(), d_vec.end(), d_out.begin(), square);

注意：

必须加上 __device__ __host__，否则 GPU 端无法调用。
这种方式可读性差，可维护性低。

函数对象 / 仿函数（Functor）

推荐方式，C++ 风格，更灵活：

cpp 复制代码

struct Square {
    __host__ __device__  // CPU和GPU都可用
    int operator()(int x) const { return x * x; }
};

thrust::transform(d_vec.begin(), d_vec.end(), d_out.begin(), Square());

特点：

可保存状态（比如带参数的 functor）；
编译器更容易优化；
支持内联和 GPU 端调用。

Lambda 表达式（C++11/14）

最新、最灵活的方式：

cpp 复制代码

thrust::transform(d_vec.begin(), d_vec.end(), d_out.begin(),
                  [] __host__ __device__ (int x) { return x * x; });

必须写 __host__ __device__，否则 GPU 端无法使用；
lambda 可以捕获局部变量（捕获的变量必须可以传递给 GPU）。
捕获变量要小心：
捕获 by value [=] 比较安全；
捕获 by reference [&] 在 GPU 上通常不安全。

注意想要使用设备上的lambda，需要nvcc启动编译选项，在Windows的VS上介绍过做法：https://blog.csdn.net/ouliten/article/details/149886418?spm=1001.2014.3001.5502#t3

如果是cmake，可以这么做：

cpp 复制代码

# 给 CUDA target 添加编译选项
target_compile_options(my_cuda_program
    PRIVATE
    $<$<COMPILE_LANGUAGE:CUDA>:--expt-extended-lambda>
)

其中my_cuda_program是add_executable(my_cuda_program ...)时自己起的目标文件名

官方的模板谓词

thrust::less<T> // 小于

thrust::greater<T> // 大于

thrust::less_equal<T> // 小于等于

thrust::greater_equal<T> // 大于等于

thrust::equal_to<T> // 等于

thrust::not_equal_to<T> // 不等于

thrust::plus<T> // +

thrust::minus<T> // -

thrust::multiplies<T>// *

thrust::divides<T> // /

thrust::modulus<T> // %

thrust::maximum<T> // max

thrust::minimum<T> // min

thrust::logical_and<T> // &&

thrust::logical_or<T> // ||

thrust::logical_not<T> // !

cuda编程笔记（33）--Thrust库的使用

容器

1️⃣ thrust::host_vector

2️⃣ thrust::device_vector

思考：如何实现异步拷贝

自定义 allocator

执行策略

device_ptr

为什么需要 device_ptr

基本定义与原型

简单示例

device_pointer_cast：自动从裸指针转换

device_reference（延伸）

从device_ptr转回裸指针

泛型算法

拷贝与赋值类

归约（Reduction）类

排序与重排类

比较与搜索类

谓词（predicate）

官方的模板谓词

1️⃣ `thrust::host_vector`

2️⃣ `thrust::device_vector`

为什么需要 `device_ptr`

`device_pointer_cast`：自动从裸指针转换

`device_reference`（延伸）

`泛型算法`