之前,所有的核函数都是从主机线程中被调用。内核启动是静态的,意味着所有的并行任务必须在程序运行前就确定好。动态并行是指在一个CUDA内核执行时,该内核能够动态地启动其他内核的能力。启用动态并行,你需要在编译CUDA代码时使用-rdc=true选项,并且确保你的GPU支持这一特性。
在动态并行中,内核执行分为两种类型:父母和孩子。父线程、父线程块或父网格启
动一个新的网格,即子网格。子线程、子线程块或子网格被父母启动。子网格必须在父线程、父线程块或父网格完成之前完成。只有在所有的子网格都完成之后,父母才会完成。
3.4.1 GPU 上嵌套Hello world
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <string.h>
#include <windows.h>
#include "../common/common.h"
__global__ void nestedHelloWorld(int const iSize, int iDepth){
int tid = threadIdx.x;
printf("Recursion = %d : helloworld from thread %d block %d\n", iDepth, tid, blockIdx.x);
if (iSize == 1) return;
int nthreads = iSize >> 1;
if (tid == 0 && nthreads){
nestedHelloWorld<<<1, nthreads>>>(nthreads, ++iDepth);
printf("------> nested execution depth : %d\n" , iDepth);
}
}
int main(int argc, char **argv)
{
int size = 8;
int blocksize = 8; // initial block size
int igrid = 1;
if(argc > 1)
{
igrid = atoi(argv[1]);
size = igrid * blocksize;
}
dim3 block (blocksize, 1);
dim3 grid ((size + block.x - 1) / block.x, 1);
printf("%s Execution Configuration: grid %d block %d\n", argv[0], grid.x,
block.x);
nestedHelloWorld<<<grid, block>>>(block.x, 0);
CHECK(cudaDeviceReset());
return 0;
}
编译: nvcc .\nestedHelloworld.cu -o .\nestedHelloworld -rdc=true
输出:
nestedHelloworld.exe Execution Configuration: grid 1 block 8
Recursion = 0 : helloworld from thread 0 block 0
Recursion = 0 : helloworld from thread 1 block 0
Recursion = 0 : helloworld from thread 2 block 0
Recursion = 0 : helloworld from thread 3 block 0
Recursion = 0 : helloworld from thread 4 block 0
Recursion = 0 : helloworld from thread 5 block 0
Recursion = 0 : helloworld from thread 6 block 0
Recursion = 0 : helloworld from thread 7 block 0
------> nested execution depth : 1
Recursion = 1 : helloworld from thread 0 block 0
Recursion = 1 : helloworld from thread 1 block 0
Recursion = 1 : helloworld from thread 2 block 0
Recursion = 1 : helloworld from thread 3 block 0
------> nested execution depth : 2
Recursion = 2 : helloworld from thread 0 block 0
Recursion = 2 : helloworld from thread 1 block 0
------> nested execution depth : 3
Recursion = 3 : helloworld from thread 0 block 0
由输出可以看出, 主机调用的父网格有1个线程块和8个线程。nestedHelloWorld
核函数递归地调用三次,每次调用的线程数是上一次的一半。