模型部署技巧（一）

以下内容是参考CUDA与TensorRT模型部署内容第六章，主要针对图像的前/后处理中的trick。

参考：
1.部署分类器-int8-calibration
2. cudnn安装地址
 3. 如何查找Tensor版本，与cuda 和 cudnn匹配
4. timing cache

一. 前处理 preprocess

学习目标

分析学习几种cv::Mat bgr2rgb 的方式，比较运行速度

1. 图像 BGR2RGB 在cpu中的方式

某些小图像前处理部分如果放在GPU上跑，并不能充分的硬件资源吃满，导致硬件资源比较浪费。

如果这种情况出现的话，我们可能会考虑把前处理放在CPU上，DNN的forward部分放在GPU上，进行异步的推理。

1.1 cv::cvtColor

复制代码

void preprocess_cv_cvtcolor(cv::Mat src, cv::Mat tar){
    cv::cvtColor(src, tar, cv::COLOR_BGR2RGB);
}

1.2 .at 方式

复制代码

void preprocess_cv_mat_at(cv::Mat src, cv::Mat tar){
    for (int i = 0; i < src.rows; i++) {
        for (int j = 0; j < src.cols; j++) {
            tar.at<cv::Vec3b>(i, j)[2] = src.at<cv::Vec3b>(i, j)[0];
            tar.at<cv::Vec3b>(i, j)[1] = src.at<cv::Vec3b>(i, j)[1];
            tar.at<cv::Vec3b>(i, j)[0] = src.at<cv::Vec3b>(i, j)[2];
        }
    }
}

1.3 cv::MatIterator_方式

复制代码

void preprocess_cv_mat_iterator(cv::Mat src, cv::Mat tar){
    cv::MatIterator_<cv::Vec3b> src_it = src.begin<cv::Vec3b>();
    cv::MatIterator_<cv::Vec3b> tar_it = tar.begin<cv::Vec3b>();
    cv::MatIterator_<cv::Vec3b> end    = src.end<cv::Vec3b>();
    for (; src_it != end; src_it++, tar_it++) {
        (*tar_it)[2] = (*src_it)[0];
        (*tar_it)[1] = (*src_it)[1];
        (*tar_it)[0] = (*src_it)[2];
    }
}

1.4 .data方法

复制代码

void preprocess_cv_mat_data(cv::Mat src, cv::Mat tar){
    int height   = src.rows;
    int width    = src.cols;
    int channels = src.channels();

    for (int i = 0; i < height; i ++) {
        for (int j = 0; j < width; j ++) {
            int index = i * width * channels + j * channels;
            tar.data[index + 2] = src.data[index + 0];
            tar.data[index + 1] = src.data[index + 1];
            tar.data[index + 0] = src.data[index + 2];
        }
    }
}

1.5 pointer

复制代码

void preprocess_cv_pointer(cv::Mat src, cv::Mat tar){
    for (int i = 0; i < src.rows; i ++) {
        cv::Vec3b* src_ptr = src.ptr<cv::Vec3b>(i);
        cv::Vec3b* tar_ptr = tar.ptr<cv::Vec3b>(i);
        for (int j = 0; j < src.cols; j ++) {
            tar_ptr[j][2] = src_ptr[j][0];
            tar_ptr[j][1] = src_ptr[j][1];
            tar_ptr[j][0] = src_ptr[j][2];
        }
    }
}

结论：

使用cv::Mat::at：速度最慢
使用cv::MatIterator_ 速度中等
使用cv::Mat.data
使用cv::Mat.ptr: 速度最快

Tips.图像 BGR2RGB + norm + hwc2chw 最优方式

复制代码

void preprocess_cv_pointer(cv::Mat src, float* tar, float* mean, float* std){
    int area = src.rows * src.cols;
    int offset_ch0 = area * 0;
    int offset_ch1 = area * 1;
    int offset_ch2 = area * 2;

    for (int i = 0; i < src.rows; i ++) {
        cv::Vec3b* src_ptr = src.ptr<cv::Vec3b>(i);
        for (int j = 0; j < src.cols; j ++) {
            tar[offset_ch2++] = (src_ptr[j][0] / 255.0f - mean[0]) / std[0];
            tar[offset_ch1++] = (src_ptr[j][1] / 255.0f - mean[1]) / std[1];
            tar[offset_ch0++] = (src_ptr[j][2] / 255.0f - mean[2]) / std[2];
        }
    }
}

二. 通用模型推理框架设计

2.1 worker类

根据模型的种类(分类、检测、分割)在构造函数中初始化一个模型，另外包含一个推理函数即可。

复制代码

Worker::Worker(string onnxPath, logger::Level level, model::Params params) {
    m_logger = logger::create_logger(level);

    // 这里根据task_type选择创建的trt_model的子类，今后会针对detection, segmentation扩充
    if (params.task == model::task_type::CLASSIFICATION) 
        m_classifier = model::classifier::make_classifier(onnxPath, level, params);

}

void Worker::inference(string imagePath) {
    if (m_classifier != nullptr) {
        m_classifier->load_image(imagePath);
        m_classifier->inference();
    }
}

2.2 model基类

成员变量包含模型参数集合，各类路径字符串，logger, timer等。

复制代码

Model::Model(string onnx_path, logger::Level level, Params params) {
    m_onnxPath      = onnx_path;
    m_enginePath    = getEnginePath(onnx_path);
    m_workspaceSize = WORKSPACESIZE;
    m_logger        = make_shared<logger::Logger>(level);
    m_timer         = make_shared<timer::Timer>();
    m_params        = new Params(params);
}

成员函数有初始化模型，加载数据，推理，构建/加载/保存引擎，几个纯虚函数（setup, 前/后处理cpu版本，前/后处理gpu版本）。

纯虚函数需要子类去具体实现。

setup负责分配host/device的memory, bindings, 以及创建推理所需要的上下文。由于不同task的input/output的tensor不一样，所以这里的setup需要在子类实现。

2.3 classifier 分类器子类

主要是针对model基类中的几个纯虚函数，进行具体实现。

Eg.

复制代码

void Classifier::setup(void const* data, size_t size) {
    m_runtime     = shared_ptr<IRuntime>(createInferRuntime(*m_logger), destroy_trt_ptr<IRuntime>);
    m_engine      = shared_ptr<ICudaEngine>(m_runtime->deserializeCudaEngine(data, size), destroy_trt_ptr<ICudaEngine>);
    m_context     = shared_ptr<IExecutionContext>(m_engine->createExecutionContext(), destroy_trt_ptr<IExecutionContext>);
    m_inputDims   = m_context->getBindingDimensions(0);
    m_outputDims  = m_context->getBindingDimensions(1);
    // 考虑到大多数classification model都是1 input, 1 output, 这边这么写。如果像BEVFusion这种有多输出的需要修改

    CUDA_CHECK(cudaStreamCreate(&m_stream));
    
    m_inputSize     = m_params->img.h * m_params->img.w * m_params->img.c * sizeof(float);
    m_outputSize    = m_params->num_cls * sizeof(float);
    m_imgArea       = m_params->img.h * m_params->img.w;

    // 这里对host和device上的memory一起分配空间
    CUDA_CHECK(cudaMallocHost(&m_inputMemory[0], m_inputSize));
    CUDA_CHECK(cudaMallocHost(&m_outputMemory[0], m_outputSize));
    CUDA_CHECK(cudaMalloc(&m_inputMemory[1], m_inputSize));
    CUDA_CHECK(cudaMalloc(&m_outputMemory[1], m_outputSize));

    // //创建m_bindings，之后再寻址就直接从这里找
    m_bindings[0] = m_inputMemory[1];
    m_bindings[1] = m_outputMemory[1];
}

2.4 logger类

日志类，通过设置等级进行打印消息，相比于cout更清爽。

2.5 timer类

记录cpu和gpu的开始/结束时间，计算相应的时间差。

复制代码

    m_timer->start_cpu();
    /* 处理程序 */
    m_timer->stop_cpu();
    m_timer->duration_cpu<timer::Timer::ms>("preprocess(CPU)");

2.6 process命名空间

process 命名空间下，定义了preprocess_resize_cpu， preprocess_resize_gpu等一些函数。

三. int8量化

3.1 创建calibrator类的时候需要继承nvinfer1里的calibrator，NVIDIA官方提供了以下五种：

nvinfer1::IInt8EntropyCalibrator2 是tensorRT 7.0引入的接口，实现基于熵的INT8量化校准器。(默认情况下优先使用它)
nvinfer1::IInt8MinMaxCalibrator
nvinfer1::IInt8EntropyCalibrator 是tensorRT 7.0之前的接口，实现基于熵的INT8量化校准器。(目前已被弃用)
nvinfer1::IInt8LegacyCalibrator（percentile）
nvinfer1::IInt8Calibrator（被弃用）

3.2 在calibrator类中需要实现的函数只需要四个：

c++ 复制代码

int         getBatchSize() const noexcept override {return m_batchSize;};
bool        getBatch(void* bindings[], const char* names[], int nbBindings) noexcept override;
const void* readCalibrationCache(std::size_t &length) noexcept override;
void        writeCalibrationCache (const void* ptr, std::size_t legth) noexcept override;

getBatchSize: 获取calibration的batch大小，需要注意的是不同的batch size会有不同的校准效果。一般而言，越大越好。
getBatch获取的图像必须要和真正推理时所采用的预处理保持一直。不然dynamic range会不准
readCalibrationCache: 用来读取calibration table,也就是之前做calibration统计得到的各个layer输出tensor的dynamic range。实现这个函数可以让我们避免每次做int8推理的时候都需要做一次calibration
writeCalibrationCache: 将统计得到的dynamic range写入到calibration table中去

3.3 实现完了基本的calibrator之后，在build引擎的时候通过config指定calibrator就可以了。

c++ 复制代码

shared_ptr<Int8EntropyCalibrator> calibrator(new Int8EntropyCalibrator(
    64, 
    "calibration/calibration_list_imagenet.txt", 
    "calibration/calibration_table.txt",
    3 * 224 * 224, 224, 224));
config->setInt8Calibrator(calibrator.get());

这里面的calibration_list_imagenet.txt使用的是ImageNet2012的test数据集的一部分。可以根据各自的情况去更改，注意batch_size 64需要改成能被calibration dataset的整除的数，否则core dump。

需要注意的是 ，如果calibrator改变了，或者模型架构改变了，需要删除掉calibration_table.txt来重新计算dynamic range。否则会报错

Tips.

实操生成过程中遇到的core dump情况，报出一个cudnn库加载版本不正确的警告。通过ldd ./bin/trt-infer 定位到libnvinfer.so.8 => /home/xx/opt/TensorRT-8.5.3.1/lib/libnvinfer.so.8 ，TensorRT版本与Makefile配置文件中指定的版本不一致。

查看服务器动态库路径，先删除动态库其中被指定的TensorRT动态库路径，再指定自己的动态库路径

复制代码

echo $LD_LIBRARY_PATH
export LD_LIBRARY_PATH=""  # 先清空
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/home/xxxpersonal/others/TensorRT-8.4.0.6/lib

四. Timing Cache

主要作用：它可以加快 engine 的创建过程，为了优化和加速内核选择过程。

如何使用 Timing Cache：

创建和保存 Timing Cache：在第一次构建 engine 时，TensorRT 会创建一个 timing cache。你可以将这个 timing cache 保存到文件中，以便未来复用。
加载 Timing Cache：在构建新的 engine 时，可以加载已经保存的 timing cache，从而避免重新进行时间消耗的内核调优过程。

五. trt-engine-explorer

trt-engine-explorer是NVIDIA官方提供的分析TensorRT优化后的推理引擎架构的工具包。链接在这里:

TensorRT tool: trt-engine-explorer