并行编程实战——CUDA编程的pipelines

一、pipelines

对于学计算机的人来说，pipelines是一个经常听到的名词。在CPU的指令处理中，流水线处理机制是典型的应用代表。所谓流水线就是把任务拆成多个部分，让任务可以并行操作。

这里的重点就在于并行，提高并行，其中隐含着同步机制的处理。比如流水打乱后，如何进行新的任务流水的处理。

二、CUDA编程的pipelines

CUDA中也提供了流水线，但这个流水线是基于软件层次的抽象的流水线，不是指的硬件的流水线。也就是说，CUDA中的流水线是用来将异步数据与计算重叠，从而隐藏内存延迟的一种技术。开发者可以把它与CPU中的指令操作的硬件流水进行对比，但不能一概而论。

CUDA中的流水线是用于处理分阶段工作和协调多缓冲区生产者-消费者模式的机制。它是一种高级的同步原语。利用好流水线机制可以提高GPU的效率和性能。

既然是流水线，就意味着其可以划分为多个阶段，而每个阶段中，都可以处理独立的数据。即生产者和消费者可以在除段中进行生产和消费的同步操作。

流水线在CUDA中一般可以划分为两类即：

统一流水线：所谓统一就是所有的参与线程既可以是生产者也可以是消费者

分区流水线：此时的参与线程就划分为两类即生产者或消费者，它们需要共享状态进行协调

另外如果从线程的角度也可以将其划分为：

线程局部流水线：线程局部流水线开销最小，但无法不能分区

全局共享流水线：共享流水线则可以实现复杂的操作，但需要利用屏蔽进行同步

三、分析

在前面的图中，需要处理Warp中的分支操作即合并或分散情况。同样，在处理CUDA中的流水线时，也需要对此进行控制。毕竟分支的操作有可能导致同步状态的异常。而CUDA中的流水线作为一种轻量级的同步抽象，如果Warp过度分散将会导致提交的操作分散到各个不同阶段，这极有可能导致过度等待（即warp entanglement，线程束纠缠）。而这对并行编程是非常不友好的，一定要尽量避免。

当然，如果在流水线中等待的相关线程需要提前退出，CUDA也提供了显式的退出机制，这样可以保证同步的安全性。这和前面的异步屏障有些类似。在CUDA中，提供了四个主要的操作接口：

c++ 复制代码

producer_acquire()：生产者获取下一个可用的流水线阶段。资源不足则阻塞
producer_commit()：生产者提交阶段，推进流水线头（head）
consumer_wait()：消费者等待最旧的阶段完成
consumer_release()：消费者释放阶段，使其资源可被重用

四、创建应用流程

CUDA中的流水线应用有以下几步：

初始化
在这个阶段要确定使用哪种流水线（统一、分区等），并调用cuda::make_pipeline来初始化
生产者提交
生产者线程调用pipeline.producer_acquire()获取下一个可用的阶段并利用接口memcpy_async函数来提交异步任务，并使用pipeline.producer_commit()推进流水线头（下一个阶段Stage）。这个动作可以循环执行多个阶段，不过在提交时需要注意保持Warp的收敛
消费者消费
调用pipeline.consumer_wait()或cuda::pipeline_consumer_wait_prior()等待最老的阶段完成或最后N个阶段完成。对已完成的阶段进行相关的数据处理并调用consumer_release()清理资源。它可以与生产者交替或并行执行，从而提高效率
对于复杂任务可循环处理
如果面对的是多阶段流水线，则可以利用双缓冲等机制，实现异步拷贝和计算的重叠进行（可以想想完成端口的异步IO）
异常处理和资源回收
CUDA中的流水线提供了提前退出的机制，即使用pipeline.quit()让线程退出其同步行为。局部流水线有点像局部变量，由系统自动处理。而共享流水线只要处理好相关的内存资源即可，其它也是自动处理

一般来说，只要按照上述的步骤进行开发，就可以利用流水线这种同步机制实现生产者-消费者的任务处理。

五、例程

下面看一个生产者和消费者的例程：

c 复制代码

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>
#include <stdlib.h>

#include <cuda/pipeline>
#include <cooperative_groups.h>

#pragma nv_diag_suppress static_var_with_dynamic_init

using pipeline = cuda::pipeline<cuda::thread_scope_block>;

#define CUDA_CHECK(call)                                                   \
    do {                                                                    \
        cudaError_t err = call;                                             \
        if (err != cudaSuccess) {                                           \
            fprintf(stderr, "CUDA error %s:%d: %s\n", __FILE__, __LINE__,   \
                    cudaGetErrorString(err));                               \
            exit(EXIT_FAILURE);                                             \
        }                                                                   \
    } while (0)

__device__ void produce(pipeline& pipe, int num_stages, int stage, int num_batches, int batch, float* buffer, int buffer_len, float* in, int N)
{
    if (batch < num_batches)
    {
        auto block = cooperative_groups::this_thread_block();
        int producer_count = block.size() / 2; // partition: half producers
        int tid = block.thread_rank();         // 0..producer_count-1 for producers

        pipe.producer_acquire();

        float* buf_stage = buffer + stage * buffer_len;
        int in_offset = batch * buffer_len;

        // each producer thread copies a strided subset of the batch
        for (int i = tid; i < buffer_len; i += producer_count)
        {
            buf_stage[i] = in[in_offset + i];
        }

        // commit the produced stage
        pipe.producer_commit();
    }
}

__device__ void consume(pipeline& pipe, int num_stages, int stage, int num_batches, int batch, float* buffer, int buffer_len, float* out, int N)
{
    if (batch < num_batches)
    {
        auto block = cooperative_groups::this_thread_block();
        int producer_count = block.size() / 2;
        int consumer_count = block.size() - producer_count;
        int tid = block.thread_rank() - producer_count; // consumer local rank: 0..consumer_count-1

        pipe.consumer_wait();

        float* buf_stage = buffer + stage * buffer_len;
        int out_offset = batch * buffer_len;

        // each consumer thread consumes a strided subset and writes result to global output
        for (int i = tid; i < buffer_len; i += consumer_count)
        {
            // example processing: multiply by 2.0
            out[out_offset + i] = buf_stage[i] * 2.0f;
        }

        pipe.consumer_release();
    }
}

__global__ void producer_consumer_pattern(float* in, float* out, int N, int buffer_len)
{
    auto block = cooperative_groups::this_thread_block();

    /* Shared memory buffer declared below is of size 2 * buffer_len
       so that we can alternatively work between two buffers.
       buffer_0 = buffer and buffer_1 = buffer + buffer_len */
    extern __shared__ float buffer[];

    const int num_batches = N / buffer_len;

    // Create a partitioned pipeline with 2 stages where half the threads are producers and the other half are consumers.
    constexpr auto scope = cuda::thread_scope_block;
    constexpr int num_stages = 2;
    cuda::std::size_t producer_count = block.size() / 2;
    __shared__ cuda::pipeline_shared_state<scope, num_stages> shared_state;
    pipeline pipe = cuda::make_pipeline(block, &shared_state, producer_count);

    // Fill the pipeline
    if (block.thread_rank() < producer_count)
    {
        for (int s = 0; s < num_stages; ++s)
        {
            produce(pipe, num_stages, s, num_batches, s, buffer, buffer_len, in, N);
        }
    }

    // Process the batches
    int stage = 0;
    for (size_t b = 0; b < num_batches; ++b)
    {
        if (block.thread_rank() < producer_count)
        {
            // Prefetch the next batch
            produce(pipe, num_stages, stage, num_batches, int(b) + num_stages, buffer, buffer_len, in, N);
        }
        else
        {
            // Consume the oldest batch
            consume(pipe, num_stages, stage, num_batches, int(b), buffer, buffer_len, out, N);
        }
        stage = (stage + 1) % num_stages;
    }
}

int main()
{
    // Parameters (N must be divisible by buffer_len)
    const int buffer_len = 1024;
    const int N = 1024 * 8; // 8 batches
    const int num_batches = N / buffer_len;

    const int threads_per_block = 128; // must be >= 2
    const int blocks = 1;

    size_t bytes = N * sizeof(float);

    float* h_in = (float*)malloc(bytes);
    float* h_out = (float*)malloc(bytes);
    if (!h_in || !h_out) {
        fprintf(stderr, "Host malloc failed\n");
        return EXIT_FAILURE;
    }

    // initialize input
    for (int i = 0; i < N; ++i) h_in[i] = (float)i;

    float* d_in = nullptr;
    float* d_out = nullptr;
    CUDA_CHECK(cudaMalloc(&d_in, bytes));
    CUDA_CHECK(cudaMalloc(&d_out, bytes));
    CUDA_CHECK(cudaMemcpy(d_in, h_in, bytes, cudaMemcpyHostToDevice));
    CUDA_CHECK(cudaMemset(d_out, 0, bytes));

    // dynamic shared memory: 2 * buffer_len floats
    size_t shared_mem_bytes = 2 * buffer_len * sizeof(float);

    producer_consumer_pattern<<<blocks, threads_per_block, shared_mem_bytes>>>(d_in, d_out, N, buffer_len);
    CUDA_CHECK(cudaGetLastError());
    CUDA_CHECK(cudaDeviceSynchronize());

    CUDA_CHECK(cudaMemcpy(h_out, d_out, bytes, cudaMemcpyDeviceToHost));

    // verify: we expect out[i] == in[i] * 2.0f
    bool ok = true;
    for (int i = 0; i < N; ++i)
    {
        float expect = h_in[i] * 2.0f;
        if (fabs(h_out[i] - expect) > 1e-5f)
        {
            printf("Mismatch at %d: got %f expected %f\n", i, h_out[i], expect);
            ok = false;
            break;
        }
    }
    if (ok) printf("Result OK\n");

    CUDA_CHECK(cudaFree(d_in));
    CUDA_CHECK(cudaFree(d_out));
    free(h_in);
    free(h_out);

    return ok ? EXIT_SUCCESS : EXIT_FAILURE;
}

代码来自官方文档。注意，上述代码的运行需要CUDA12.0以上版本并且算力在7.0以上。

六、总结

对于并行开发来说，同步机制是非常重要也是对性能和效率影响非常高的一种机制。而流水线作为一种轻量级的同步机制，一旦应用处理不好，轻则达不到任务处理的目的，重则有可能导致过度等待甚至是死锁。所以，一定要掌握其内在运行机制，从整体掌握流水线。