cuDNN FrontEnd(FE) API 是一个包含 cuDNN C 后端 API 的 C++ header-only library。 FE 和backend APIs 都是同一组功能(Graph API)的入口点。
github:NVIDIA/cudnn-frontend
在 FE v1.0 API 中,用户可以通过持久化的 cudnn_frontend::graph::Graph
对象来描述形成子图的多个操作。与 FE v0.x API 不同,用户无需担心指定中间virtual tensors的形状和大小。 FE v1.0 API 扩展了早期版本的基础工作,并引入了一组新的 API 以进一步简化工作流程。
在这里先介绍FE v0.x API的示例(因为v1.0的我还没有看过)。由于FE是一个仅包含头文件的库,所有的内容都定义在不同的头文件中,例如不同的BackendDescriptor在对应的.h文件中:
- cudnn_frontend_Tensor.h ->
CUDNN_BACKEND_TENSOR_DESCRIPTOR
- cudnn_frontend_Engine.h ->
CUDNN_BACKEND_ENGINE_DESCRIPTOR
- cudnn_frontend_ExecutionPlan.h ->
CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR
接下来给出一个示例,说明使用FE v0.x API的基本流程。下面的代码使用Catch2单元测试框架,测试了Conv+Bias+Scale+Activation的算子融合的前向计算。
源码在cudnn-frontend/samples/legacy_samples/test_lists.cpp)中。
ConvBiasScaleAct sample
c++
TEST_CASE("ConvBiasScaleAct sample", "[frontend][fusion][ConvBiasScaleAct]") {
std::cout << "TEST_CASE ConvBiasScaleAct :: ConvBiasScaleAct sample" << std::endl;
INFO("TEST_CASE :: ConvBiasScaleAct sample");
int64_t xTensorDim[] = {1, 16, 512, 512};
int64_t wTensorDim[] = {64, 16, 3, 3};
int64_t yTensorDim[] = {1, 64, 512, 512};
int64_t conv_padA[] = {1, 1};
int64_t conv_dilationA[] = {1, 1};
int64_t conv_strideA[] = {1, 1};
int64_t bTensorDim[] = {1, 64, 1, 1}; // bias
int64_t sTensorDim[] = {1, 64, 1, 1}; // scale
printf("====DIMENSIONS====\n");
printf("input dims are %" PRId64 ", %" PRId64 ", %" PRId64 ", %" PRId64 "\n",
xTensorDim[0],
xTensorDim[1],
xTensorDim[2],
xTensorDim[3]);
printf("filter dims are %" PRId64 ", %" PRId64 ", %" PRId64 ", %" PRId64 "\n",
wTensorDim[0],
wTensorDim[1],
wTensorDim[2],
wTensorDim[3]);
printf("output dims are %" PRId64 ", %" PRId64 ", %" PRId64 ", %" PRId64 "\n",
yTensorDim[0],
yTensorDim[1],
yTensorDim[2],
yTensorDim[3]);
int64_t Ysize = yTensorDim[0] * yTensorDim[1] * yTensorDim[2] * yTensorDim[3];
Surface<float> X(xTensorDim[0] * xTensorDim[1] * xTensorDim[2] * xTensorDim[3], false);
Surface<float> W(wTensorDim[0] * wTensorDim[1] * wTensorDim[2] * wTensorDim[3], false);
Surface<float> Y(Ysize, true);
Surface<float> B(bTensorDim[0] * bTensorDim[1] * bTensorDim[2] * bTensorDim[3], false);
Surface<float> S(sTensorDim[0] * sTensorDim[1] * sTensorDim[2] * sTensorDim[3], false);
run_conv_bias_scale_relu(xTensorDim,
wTensorDim,
yTensorDim,
bTensorDim,
sTensorDim,
CUDNN_DATA_HALF,
2,
conv_padA,
conv_dilationA,
conv_strideA,
X.devPtr,
W.devPtr,
Y.devPtr,
B.devPtr,
S.devPtr);
checkCudaErr(cudaDeviceSynchronize());
checkCudaErr(cudaMemcpy(Y.hostPtr, Y.devPtr, (size_t)(sizeof(Y.hostPtr[0]) * Ysize), cudaMemcpyDeviceToHost));
checkCudaErr(cudaDeviceSynchronize());
std::cout << "\n========================================================================================\n";
}
这段测试代码主要是设置输入与输出Tensor的Dim,以及通过Surface<float>
分配memory、随机初始化input tensor数据。然后在41行调用核心函数run_conv_bias_scale_relu。具体看看该函数是怎么通过cudnn FE API进行前向计算的。
1. 创建Handle
C++
cudnnHandle_t handle_;
// Create cudnn handle
checkCudnnErr(cudnnCreate(&handle_));
2. 创建Tensor Descriptors
由于篇幅有限,该例创建了较多的Tensor Descriptors,在这只列出部分。
C++
// Creates the necessary tensor descriptors
int64_t stride[4];
generateStrides(x_dim, stride, 4, CUDNN_TENSOR_NHWC);
auto xTensor = cudnn_frontend::TensorBuilder()
.setDim(4, x_dim)
.setStride(4, stride)
.setId('x')
.setAlignment(16) // 16B alignment is needed to run a tensor core engine
.setDataType(dataType)
.build();
/*Create wTensor bTensor sTensor......*/
auto afterConvTensor = cudnn_frontend::TensorBuilder()
.setDim(4, y_dim)
.setStride(4, stride)
.setId('A') // after conv
.setAlignment(16)
.setVirtual()
.setDataType(dataType)
.build();
/*Create afterBiasTensor afterScaleTensor......*/
auto yTensor = cudnn_frontend::TensorBuilder()
.setDim(4, y_dim)
.setStride(4, stride)
.setId('y') // output
.setAlignment(16)
.setDataType(dataType)
.build();
std::cout << xTensor.describe() << std::endl;
std::cout << wTensor.describe() << std::endl;
std::cout << bTensor.describe() << std::endl;
std::cout << sTensor.describe() << std::endl;
std::cout << afterConvTensor.describe() << std::endl;
std::cout << afterBiasTensor.describe() << std::endl;
std::cout << afterScaleTensor.describe() << std::endl;
std::cout << yTensor.describe() << std::endl;
TensorBuilder
类用于构建和管理Tensor Descriptor,通过set与build来设置Properties并构建。 Properties包括:
- dataType
- alignment
- unique identifier
- tensor dimensions
- tensor strides
- isVirtual
- isByValue
这部分的输出如下图所示
其中,Str(stride)是指张量在内存中存储时每个维度上元素之间的字节数; isVirtual是指是否是中间张量;isByValue是指张量是否在主机内存中,且需要按值传递给内核。
3. 定义Op Descriptors
C++
// Define the bias descriptor
auto biasDesc = cudnn_frontend::PointWiseDescBuilder()
.setMode(CUDNN_POINTWISE_ADD)
.setComputeType(CUDNN_DATA_FLOAT)
.build();
std::cout << biasDesc.describe() << std::endl;
// Define the scale descriptor
auto scaleDesc = cudnn_frontend::PointWiseDescBuilder()
.setMode(CUDNN_POINTWISE_MUL)
.setComputeType(CUDNN_DATA_FLOAT)
.build();
std::cout << scaleDesc.describe() << std::endl;
// Define the activation descriptor
auto actDesc = cudnn_frontend::PointWiseDescBuilder()
.setMode(CUDNN_POINTWISE_RELU_FWD)
.setComputeType(CUDNN_DATA_FLOAT)
.build();
std::cout << actDesc.describe() << std::endl;
// Define the convolution problem
auto convDesc = cudnn_frontend::ConvDescBuilder()
.setComputeType(CUDNN_DATA_FLOAT)
.setMathMode(CUDNN_CROSS_CORRELATION)
.setSpatialDimCount(convDim)
.setSpatialStride(convDim, conv_strideA)
.setPrePadding(convDim, conv_padA)
.setPostPadding(convDim, conv_padA)
.setDilation(convDim, conv_dilationA)
.build();
std::cout << convDesc.describe() << std::endl;
PointWiseDescBuilder
类用于创建和管理逐点操作的Descriptor,ConvDescBuilder
类用于创建和管理卷积操作的Descriptor,他们都有着类似的结构,有着自己特有的属性。
这部分的输出如下图所示:
Mode指Op的类型,Conv类型会存在一些额外的属性。
4. 创建Op Node
C++
float alpha = 1.0f;
float beta = 0.0f;
// Create a convolution Node
auto conv_op = cudnn_frontend::OperationBuilder(CUDNN_BACKEND_OPERATION_CONVOLUTION_FORWARD_DESCRIPTOR)
.setxDesc(xTensor)
.setwDesc(wTensor)
.setyDesc(afterConvTensor)
.setcDesc(convDesc)
.setAlpha(alpha)
.setBeta(beta)
.build();
std::cout << conv_op.describe() << std::endl;
// Create a Bias Node.
auto bias_op = cudnn_frontend::OperationBuilder(CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR)
.setxDesc(conv_op.getOutputTensor())
.setbDesc(bTensor)
.setyDesc(afterBiasTensor)
.setpwDesc(biasDesc)
.build();
std::cout << bias_op.describe() << std::endl;
// Create a Multiplication Node with scaling parameters.
auto scale_op = cudnn_frontend::OperationBuilder(CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR)
.setxDesc(bias_op.getOutputTensor())
.setbDesc(sTensor)
.setyDesc(afterScaleTensor)
.setpwDesc(scaleDesc)
.build();
std::cout << scale_op.describe() << std::endl;
// Create an Activation Node.
auto act_op = cudnn_frontend::OperationBuilder(CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR)
.setxDesc(scale_op.getOutputTensor())
.setyDesc(yTensor)
.setpwDesc(actDesc)
.build();
std::cout << act_op.describe() << std::endl;
使用统一的结构OperationBuilder
创建OP Node,设置每个Op(convolution、Bias、Multiplication、Activation)对应的输入、输出Tensor Descriptor、OP Descriptor;alpha与beta是变换因子,通常取1、0即为不做变换。
这部分的输出如下图所示:
后面的内容是指各种Descriptor的指向(若没有对应Descriptor则为0)以及alpha、beta的值。
5. 创建Operation Graph
C++
// Create an Operation Graph. In this case it is convolution bias scale activation
std::array<cudnn_frontend::Operation const*, 4> ops = {&conv_op, &bias_op, &scale_op, &act_op};
auto opGraph = cudnn_frontend::OperationGraphBuilder()
.setHandle(handle_)
.setOperationGraph(ops.size(), ops.data())
.build();
通过OperationGraphBuilder
,传入Handle与上一步创建的Op Node的array,通过build()构建Op Graph。
6. 通过启发式(heuristics)筛选plan
C++
auto plan = get_execplan_from_heuristics_else_fall_back(std::move(opGraph), handle_);
std::cout << "Plan tag: " << plan.getTag() << std::endl;
auto workspace_size = plan.getWorkspaceSize();
std::cout << plan.describe() << " requires workspace " << workspace_size << std::endl;
调用函数get_execplan_from_heuristics_else_fall_back
,通过heuristics查询并选择第一个有效的Engine,这个Engine通常是性能最高的融合算子方式。然后获取选择的Engine需要的WorkspaceSize。
这部分的输出如下图所示:
7. 创建backend variant pack,执行plan
C++
void* workspace_ptr = nullptr;
if (workspace_size > 0) {
checkCudaErr(cudaMalloc(&workspace_ptr, (size_t)workspace_size));
}
void* data_ptrs[] = {devPtrX, devPtrY, devPtrW, devPtrB, devPtrS};
int64_t uids[] = {'x', 'y', 'w', 'b', 's'};
auto variantPack = cudnn_frontend::VariantPackBuilder()
.setWorkspacePointer(workspace_ptr)
.setDataPointers(5, data_ptrs)
.setUids(5, uids)
.build();
std::cout << "variantPack " << variantPack.describe() << std::endl;
cudnnStatus_t status = cudnnBackendExecute(handle_, plan.get_raw_desc(), variantPack.get_raw_desc());
if (workspace_size > 0) {
checkCudaErr(cudaFree(workspace_ptr));
}
checkCudnnErr(cudnnDestroy(handle_));
如果workspace_size > 0,则为其分配相应大小的memory,data_ptrs指向数据对应的device memory。variantPack打包了工作空间指针、数据指针、uid。
通过cudnn标准APIcudnnBackendExecute
,使用选择了高性能Engine的plan,并传入variantPack,执行最终的计算,返回一个status,最后释放相关的memory。
总结
上述步骤通过FE 0.x API完成了一个Conv+Bias+Scale+Activation的前向计算。通过FE给出的统一的API,可以较为方便、模式化的写出高性能的训练、推理代码。
后续会研究一下使用Backend Descriptor Types的实现、FE 1.0版本的更简化的实现,以及算子融合的原理等。