并行编程实战——CUDA编程的Parallel Task类型

一、Parallel Task的操作范式

对CUDA来说，并行编程或者说对并行任务处理，是它一个典型的特点。在前面的分析中，对数据访问和线程等的基础内容进行了学习，掌握了它们间的关系和运用方法。但在并行任务中，线程空间与数据空间是有着映射模型即操作范式的。它主要包括Map、Gather、Scatter、Transpose、Reduce和Scan。

可能有人不太理解什么是数据空间和线程空间，其实举一个普通的编程例子说明就可以了。在计算机中，有一维、二维、三维...数组，但在实际的内存中，则只有平坦的一维地址。

而在CUDA中线程中可能有2D，3D等情况，而在真实的输入空间中则一般只存在1D（也有2D矩阵等情况）。这就需要将二者进行映射，从而闭包数据进行安全的转换操作。可参看前面的文章中在Grid、Block中对线程索引的计算。

二、常见类型

在CUDA中的并行任务并行模式即核心数据操作和计算范式主要有以下几种情况：

Map
最基础和简单的并行模式，输出元素由一对一的输入元素计算而来。在CUDA中，线程数量与数据数量一一对应，完全并行不需要进行线程间通信。它适用于计算密集型且内存访问连续的场景
Gather
就是从不规则非连续的输入数组中抽取数据并输出到一个连续的空间中。在CUDA中，由每个线程来读取一个索引并从输入数组中读取数据并计算，然后输出到指定的位置。
Scatter
与Gather相反，将一个索引数组，将输入数组中的数据输出（散射）到不规则非连续的数组的位置中。CUDA内每个线程读取一个与输入数组对应的索引，然后写入到输出数组的指定位置。注意，有可能写入的过程是冲突的即索引可能是相同的。所以需要使用原子操作保证数据的安全性。
Transpose
一般用于矩阵的转置。可以把其当作一种特殊的Scatter。最简单的就是和列的转换。这个转置过程，可能优化内存的地方很多，包括使用共享内存或通过填充来解决Bank的冲突。
Stencil
一种特殊的、具有局部性的Gather。需要先读取输入数据中一个固定模式（模板）的相邻元素才能对每个输出元素进行计算，它涉及到CUDA中的内存预取和共享内存。
Reduce
这个有过分布式经验的很容易理解，就是一种分治法解决问题后再汇总的过程。在CUDA中通过二元的计算将一组数据归并为一个结果。一般是使用树来实现。由于其是并的过程可能稍显复杂，所以优化的可能也比较大。包括循环展开、向量化加载等等。其常用的算法包括相互归约、树形归约和分块归约等。
Scan
所谓扫描一定是处理多个数据。即输出索引n数据由输入数据前n个数据元素组合计算而来。它有两类，一种是包含前缀和以及排他前缀各。它的特点非常鲜明，需要一个串行化的依赖计算。在CUDA中可以使用Hillis-Steele算法（Work-inefficient）和Blelloch算法来实现

除上了面的几种模式外，还有几种情况如分治（Divide-and-Conquer）常用于排序，利用递归分解问题；流水线（Pipeline）可以进行多个处理阶段重叠执行；稀疏模式（Sparse Patterns）优化不规则数据结构

三、分析说明

在上面的类型分析说明中，其中是有很多算法支持的。这些常见类型，其实并不是CUDA本身提供的基础库，而是一种算法抽象的总结。可以参照C++中的相关的设计模式或分布式中的设计方法来理解就很容易明白。

一般情况下数据转换无依赖可以使用Map（类似于简单的函数映射）；Gather模式更适合于需要查询/收集数据；在图的邻域或卷积操作中，推荐使用Stencil模式；而如果要对数据进行重排或矩阵转置等就需要Transpose模式了。Scatter模式要严格控制使用，毕竟其可能引起冲突并且输出是非线性的；Scan和Reduce更适合于大规模的数据计算，如前者针对前序和累积的处理，后者更适合于大数据的归约和聚合操作。

当然，在CUDA为了简便的实现这些模式和相关的算法，也提供了一些基础库的接口，如Thrust库，作为一种高级抽象库，更适合Map、Reduce、Scan、Sort；而CUB库作为底层优化，能提供最佳的性能，针对Block级操作更有优势；cuBLAS库专门针对线性代数，包含优化的Transpose；而NPP库则更适合图像处理，其中包括Stencil操作模式

其实在实践应用中，虽然很多情况下可以直接使用这些模式，但更多的情况下可能是组合其中的不同种类共同实现设计的目标。正所谓各展所长，协同工作。比如实际的流压缩处理，很可能会用到Map（标记数据数组）和Scan（处理理标记数组中数组）；而图形处理则有可能应用到Gather（读取图像数据）和Stencil（用于卷积操作）。

四、应用

针对上面的分析，给出相应的例子：

c 复制代码

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>

void initArray(float* arr, int size, int seed = 0) {
	srand(seed);
	for (int i = 0; i < size; i++) {
		arr[i] = (float)(rand() % 100) / 10.0f;
	}
}

void initMatrix(float* matrix, int rows, int cols, int seed = 0) {
	srand(seed);
	for (int i = 0; i < rows * cols; i++) {
		matrix[i] = (float)(rand() % 100) / 10.0f;
	}
}


// MAP模式
__global__ void mapKernel(float* input, float* output, int n) {
	int idx = blockIdx.x * blockDim.x + threadIdx.x;
	if (idx < n) {
		output[idx] = input[idx] * input[idx];   
	}
}

void mapCPU(float* input, float* output, int n) {
	for (int i = 0; i < n; i++) {
		output[i] = input[i] * input[i];
	}
}

// GATHER模式
__global__ void gatherKernel(float* input, int* indices, float* output, int n) {
	int idx = blockIdx.x * blockDim.x + threadIdx.x;
	if (idx < n) {
		int srcIdx = indices[idx];
		output[idx] = input[srcIdx];
	}
}

void gatherCPU(float* input, int* indices, float* output, int n) {
	for (int i = 0; i < n; i++) {
		output[i] = input[indices[i]];
	}
}

// SCATTER模式
__global__ void scatterKernel(float* input, int* indices, float* output, int n) {
	int idx = blockIdx.x * blockDim.x + threadIdx.x;
	if (idx < n) {
		int dst_idx = indices[idx];
		output[dst_idx] = input[idx];  
	}
}

void scatterCPU(float* input, int* indices, float* output, int n, int output_size) {
	//有可能有冲突处理
	for (int i = 0; i < n; i++) {
		output[indices[i]] = input[i];
	}
}

// STENCIL模式 (3x3卷积)
__global__ void stencilKernel(float* input, float* output, int width, int height,
	float* kernel, int kernel_size) {
	int x = blockIdx.x * blockDim.x + threadIdx.x;
	int y = blockIdx.y * blockDim.y + threadIdx.y;

	if (x >= 1 && x < width - 1 && y >= 1 && y < height - 1) {
		float sum = 0.0f;
		for (int ky = -1; ky <= 1; ky++) {
			for (int kx = -1; kx <= 1; kx++) {
				int idx = (y + ky) * width + (x + kx);
				int kernelIdx = (ky + 1) * kernel_size + (kx + 1);
				sum += input[idx] * kernel[kernelIdx];
			}
		}
		output[y * width + x] = sum;
	}
	else if (x < width && y < height) {
		output[y * width + x] = input[y * width + x];
	}
}

void stencilCPU(float* input, float* output, int width, int height,
	float* kernel, int kernelSize) {
	for (int y = 1; y < height - 1; y++) {
		for (int x = 1; x < width - 1; x++) {
			float sum = 0.0f;
			for (int ky = -1; ky <= 1; ky++) {
				for (int kx = -1; kx <= 1; kx++) {
					int idx = (y + ky) * width + (x + kx);
					int kernelIdx = (ky + 1) * kernelSize + (kx + 1);
					sum += input[idx] * kernel[kernelIdx];
				}
			}
			output[y * width + x] = sum;
		}
	}

	for (int y = 0; y < height; y++) {
		output[y * width] = input[y * width];
		output[y * width + width - 1] = input[y * width + width - 1];
	}
	for (int x = 0; x < width; x++) {
		output[x] = input[x];
		output[(height - 1) * width + x] = input[(height - 1) * width + x];
	}
}

//TRANSPOSE模式
__global__ void transposeNaiveKernel(float* input, float* output, int rows, int cols) {
	int x = blockIdx.x * blockDim.x + threadIdx.x;
	int y = blockIdx.y * blockDim.y + threadIdx.y;

	if (x < cols && y < rows) {
		output[x * rows + y] = input[y * cols + x];
	}
}

__global__ void transposeSharedKernel(float* input, float* output, int rows, int cols) {
	__shared__ float tile[32][33];  // 通过padding处理bank冲突

	int blockX = blockIdx.x * 32;
	int blockY = blockIdx.y * 32;

	int x = blockX + threadIdx.x;
	int y = blockY + threadIdx.y;

	// 加载到共享内存
	if (x < cols && y < rows) {
		tile[threadIdx.y][threadIdx.x] = input[y * cols + x];
	}

	__syncthreads();

	// 转置并写入
	int out_x = blockY + threadIdx.x;
	int out_y = blockX + threadIdx.y;

	if (out_x < rows && out_y < cols) {
		output[out_y * rows + out_x] = tile[threadIdx.x][threadIdx.y];
	}
}

void transposeCPU(float* input, float* output, int rows, int cols) {
	for (int y = 0; y < rows; y++) {
		for (int x = 0; x < cols; x++) {
			output[x * rows + y] = input[y * cols + x];
		}
	}
}

// SCAN模式
__global__ void scanBlockKernel(float* input, float* output, int n) {
	extern __shared__ float  tmp[];

	int tid = threadIdx.x;
	int idx = blockIdx.x * blockDim.x * 2 + threadIdx.x;

	float val1 = (idx < n) ? input[idx] : 0;
	float val2 = (idx + blockDim.x < n) ? input[idx + blockDim.x] : 0;

	int offset = 1;
	 tmp[2 * tid] = val1;
	 tmp[2 * tid + 1] = val2;

	for (int d = blockDim.x; d > 0; d >>= 1) {
		__syncthreads();
		if (tid < d) {
			int ai = offset * (2 * tid + 1) - 1;
			int bi = offset * (2 * tid + 2) - 1;
			 tmp[bi] +=  tmp[ai];
		}
		offset * = 2;
	}

	if (tid == 0) {
		 tmp[2 * blockDim.x - 1] = 0;
	}

	for (int d = 1; d <= blockDim.x; d *= 2) {
		offset >>= 1;
		__syncthreads();
		if (tid < d) {
			int ai = offset * (2 * tid + 1) - 1;
			int bi = offset * (2 * tid + 2) - 1;
			float t =  tmp[ai];
			 tmp[ai] =  tmp[bi];
			 tmp[bi] += t;
		}
	}
	__syncthreads();

	if (idx < n) {
		output[idx] = tmp[2 * tid];
	}
	if (idx + blockDim.x < n) {
		output[idx + blockDim.x] = tmp[2 * tid + 1];
	}
}

// 简化单块
void scanCPU(float* input, float* output, int n) {
	output[0] = 0;
	for (int i = 1; i < n; i++) {
		output[i] = output[i - 1] + input[i - 1];
	}
}

// REDUCE模式
__global__ void reduceKernel(float* input, float* output, int n) {
	extern __shared__ float sdata[];

	int tid = threadIdx.x;
	int i = blockIdx.x * blockDim.x * 2 + tid;

	sdata[tid] = 0;
	if (i < n) sdata[tid] += input[i];
	if (i + blockDim.x < n) sdata[tid] += input[i + blockDim.x];

	__syncthreads();

	// 树形归约
	for (int s = blockDim.x / 2; s > 0; s >>= 1) {
		if (tid < s) {
			sdata[tid] += sdata[tid + s];
		}
		__syncthreads();
	}

	if (tid == 0) {
		output[blockIdx.x] = sdata[0];
	}
}

float reduceCPU(float* input, int n) {
	float sum = 0;
	for (int i = 0; i < n; i++) {
		sum += input[i];
	}
	return sum;
}

int main() {

	// 元数据
	const int N = 1 << 20;      
	const int IMAGE_WIDTH = 512;  // 图像宽度
	const int IMAGE_HEIGHT = 512; // 图像高度
	const int  MATRIX_ROWS = 1024;  // 矩阵行数
	const int  MATRIX_COLS = 1024;  // 矩阵列数


	// 主机内存
	float *hInput = (float*)malloc(N * sizeof(float));
	float *hOutputCpu = (float*)malloc(N * sizeof(float));
	float *hOutputGpu = (float*)malloc(N * sizeof(float));

	// Gather/Scatter索引
	int *hIndices = (int*)malloc(N * sizeof(int));

	// 图像
	int imgSize = IMAGE_WIDTH * IMAGE_HEIGHT;
	float *hImgInput = (float*)malloc(imgSize * sizeof(float));
	float *hImgOutputCpu = (float*)malloc(imgSize * sizeof(float));
	float *hImgOutputGpu = (float*)malloc(imgSize * sizeof(float));

	// 矩阵
	int mat_size =  MATRIX_ROWS *  MATRIX_COLS;
	float *hMatInput = (float*)malloc(mat_size * sizeof(float));
	float *hMatOutputCpu = (float*)malloc(mat_size * sizeof(float));
	float *hMatOutputGpu = (float*)malloc(mat_size * sizeof(float));

	// 卷积核
	float hKernel[9];
	for (int i = 0; i < 9; i++) {
		hKernel[i] = 1.0f / 9.0f;
	}

	// 设备内存
	float * dInput, * dOutput=NULL;
	int * dIndices=NULL;
	float * dImgInput, * dImgOutput=NULL;
	float * dMatInput, * dMatOutput=NULL;
	float * dKernel=NULL;
	float * dScanOutput, * dReduceOutput=NULL;

	cudaMalloc(&dInput, N * sizeof(float));
	cudaMalloc(&dOutput, N * sizeof(float));
	cudaMalloc(&dIndices, N * sizeof(int));
	cudaMalloc(&dImgInput, imgSize * sizeof(float));
    cudaMalloc(&dImgOutput, imgSize * sizeof(float));
	cudaMalloc(&dMatInput, mat_size * sizeof(float));
	cudaMalloc(&dMatOutput, mat_size * sizeof(float));
	cudaMalloc(&dKernel, 9 * sizeof(float));
	cudaMalloc(&dScanOutput, N * sizeof(float));
	cudaMalloc(&dReduceOutput, ((N + 511) / 512) * sizeof(float));

	// 初始化
	initArray(hInput, N, 68);
	initArray(hImgInput, imgSize, 333);
	initMatrix(hMatInput,  MATRIX_ROWS,  MATRIX_COLS, 666);

	// 初始化索引（用于Gather/Scatter）
	srand(127);
	for (int i = 0; i < N; i++) {
		hIndices[i] = rand() % N;
	}

	// 拷贝数据到设备
	 cudaMemcpy(dInput, hInput, N * sizeof(float), cudaMemcpyHostToDevice);
	 cudaMemcpy(dIndices, hIndices, N * sizeof(int), cudaMemcpyHostToDevice);
	 cudaMemcpy(dImgInput, hImgInput, imgSize * sizeof(float), cudaMemcpyHostToDevice);
	 cudaMemcpy(dMatInput, hMatInput, mat_size * sizeof(float), cudaMemcpyHostToDevice);
	 cudaMemcpy(dKernel, hKernel, 9 * sizeof(float), cudaMemcpyHostToDevice);

	//执行
	int blockSize = 256;
	int gridSize = (N + blockSize - 1) / blockSize;
	mapKernel << <gridSize, blockSize >> >(dInput, dOutput, N);
	cudaDeviceSynchronize();


	cudaMemcpy(hOutputGpu, dOutput, N * sizeof(float), cudaMemcpyDeviceToHost);
	mapCPU(hInput, hOutputCpu, N);

	// GATHER
	gatherKernel << <gridSize, blockSize >> >(dInput, dIndices, dOutput, N);cudaDeviceSynchronize();

	 cudaMemcpy(hOutputGpu, dOutput, N * sizeof(float), cudaMemcpyDeviceToHost);
	gatherCPU(hInput, hIndices, hOutputCpu, N);

	// SCATTER
	// 重置
	 cudaMemset(dOutput, 0, N * sizeof(float));

	scatterKernel << <gridSize, blockSize >> >(dInput, dIndices, dOutput, N);cudaDeviceSynchronize();


	cudaMemcpy(hOutputGpu, dOutput, N * sizeof(float), cudaMemcpyDeviceToHost);

	// STENCIL
	dim3 blockDimStencil(16, 16);
	dim3 gridDimStencil((IMAGE_WIDTH + blockDimStencil.x - 1) / blockDimStencil.x,(IMAGE_HEIGHT + blockDimStencil.y - 1) / blockDimStencil.y);

	stencilKernel << <gridDimStencil, blockDimStencil >> >(dImgInput, dImgOutput, IMAGE_WIDTH, IMAGE_HEIGHT, dKernel, 3);
	cudaDeviceSynchronize();

	 cudaMemcpy(hImgOutputGpu, dImgOutput, imgSize * sizeof(float),
		cudaMemcpyDeviceToHost);
	stencilCPU(hImgInput, hImgOutputCpu, IMAGE_WIDTH, IMAGE_HEIGHT, hKernel, 3);


	// TRANSPOSE
	dim3 blockDimTranspose(16, 16);
	dim3 gridDimTranspose(( MATRIX_COLS + blockDimTranspose.x - 1) / blockDimTranspose.x,( MATRIX_ROWS + blockDimTranspose.y - 1) / blockDimTranspose.y);

	transposeSharedKernel << <gridDimTranspose, blockDimTranspose >> >(dMatInput, dMatOutput,  MATRIX_ROWS,  MATRIX_COLS);
	cudaDeviceSynchronize();

	cudaMemcpy(hMatOutputGpu, dMatOutput, mat_size * sizeof(float),cudaMemcpyDeviceToHost);
	transposeCPU(hMatInput, hMatOutputCpu,  MATRIX_ROWS,  MATRIX_COLS);

	// SCAN

	// 使用较小的数据测试SCAN
	const int SCAN_SIZE = 1 << 16;  // 64K
	float *hScanInput = (float*)malloc(SCAN_SIZE * sizeof(float));
	float *hScanOutputCpu = (float*)malloc(SCAN_SIZE * sizeof(float));
	float *hScanOutputGpu = (float*)malloc(SCAN_SIZE * sizeof(float));

	initArray(hScanInput, SCAN_SIZE, 99);

	float * d_scan_input;
	cudaMalloc(&d_scan_input, SCAN_SIZE * sizeof(float));
	cudaMemcpy(d_scan_input, hScanInput, SCAN_SIZE * sizeof(float),cudaMemcpyHostToDevice);


	int scanBlockSize = 512;
	int scanGridSize = (SCAN_SIZE + scanBlockSize * 2 - 1) / (scanBlockSize * 2);
	int sharedMemSize = 2 * scanBlockSize * sizeof(float);

	scanBlockKernel << <scanGridSize, scanBlockSize, sharedMemSize >> >(d_scan_input, dScanOutput, SCAN_SIZE);
	cudaDeviceSynchronize();

	cudaMemcpy(hScanOutputGpu, dScanOutput, SCAN_SIZE * sizeof(float),cudaMemcpyDeviceToHost);
	scanCPU(hScanInput, hScanOutputCpu, SCAN_SIZE);

	// REDUCE
	int reduceBlockSize = 512;
	int reduceGridSize = (N + reduceBlockSize * 2 - 1) / (reduceBlockSize * 2);
	int reduceSharedMem = reduceBlockSize * sizeof(float);

	// 最上层归约
	reduceKernel << <reduceGridSize, reduceBlockSize, reduceSharedMem >> >(dInput, dReduceOutput, N);

	// 继续归约
	float hReduceResult;
	if (reduceGridSize > 1) {
		float * dFinalResult;
		 cudaMalloc(&dFinalResult, sizeof(float));

		//拷贝至主机并最终完成
		float *hIntermediate = (float*)malloc(reduceGridSize * sizeof(float));
		cudaMemcpy(hIntermediate, dReduceOutput,reduceGridSize * sizeof(float), cudaMemcpyDeviceToHost);

		hReduceResult = 0;
		for (int i = 0; i < reduceGridSize; i++) {
			hReduceResult += hIntermediate[i];
		}

		free(hIntermediate);
		cudaFree(dFinalResult);
	}
	else {
		cudaMemcpy(&hReduceResult, dReduceOutput,sizeof(float), cudaMemcpyDeviceToHost);
	}


	float cpu_reduce_result = reduceCPU(hInput, N);


	// 清理资源
	free(hInput);
	free(hOutputCpu);
	free(hOutputGpu);
	free(hIndices);
	free(hImgInput);
	free(hImgOutputCpu);
	free(hImgOutputGpu);
	free(hMatInput);
	free(hMatOutputCpu);
	free(hMatOutputGpu);
	free(hScanInput);
	free(hScanOutputCpu);
	free(hScanOutputGpu);

	// 释放设备内存
	cudaFree(dInput);
	cudaFree(dOutput);
	cudaFree(dIndices);
	cudaFree(dImgInput);
	cudaFree(dImgOutput);
	cudaFree(dMatInput);
	cudaFree(dMatOutput);
	cudaFree(dKernel);
	cudaFree(d_scan_input);
	cudaFree(dScanOutput);
	cudaFree(dReduceOutput);

	return 0;
}

可以去实际的环境上跑一下就明白了

五、总结

本文其实更倾向于CUDA中的实践中的应用设计了，但却与CUDA的开发密切相连。文章写得太笼统，一些细节可能未具体的说明分析，这就需要根据上面的"纲"去完善相关的内容。学习的过程就是一个不断总结不断完善的过程，这样才能真正的形成自己的知识体系。每个人眼中的哈姆雷特都是不同的，但他们最终得到的知识本质是一致的。