使用Tensorrt部署,C++ API yolov7_pose模型

使用Tensorrt部署,C++ API yolov7_pose模型

虽然标题叫部署yolov7_pose模型,但是接下来的教程可以使用Tensorrt部署任何pytorch模型。

仓库地址:https://github.com/WongKinYiu/yolov7/tree/pose

系统版本:ubuntu18.4

驱动版本:CUDA Version: 11.4

在推理过程中,基于 TensorRT 的应用程序的执行速度可比 CPU 平台的速度快 40 倍。借助 TensorRT,您可以优化在所有主要框架中训练的神经网络模型,精确校正低精度,并最终将模型部署到超大规模数据中心、嵌入式或汽车产品平台中。

TensorRT 以 NVIDIA 的并行编程模型 CUDA 为基础构建而成,可帮助您利用 CUDA-X 中的库、开发工具和技术,针对人工智能、自主机器、高性能计算和图形优化所有深度学习框架中的推理。

TensorRT 针对多种深度学习推理应用的生产部署提供 INT8 和 FP16 优化,例如视频流式传输、语音识别、推荐和自然语言处理。推理精度降低后可显著减少应用延迟,这恰巧满足了许多实时服务、自动和嵌入式应用的要求。

我们部署的主要步骤为:将PytorchModel转化为OnnxModel,在将OnnxModel转化为TensorrtModel.

虽然看似步骤简单,但是坑还是有点多。

1.安装TensorRT

首先查看自己的Cuda版本,Windows 在cmd中执行nvidia-smi,Ubuntu在终端执行nvidia-smi即可查看cuda的版本。一般我们选择自己所能下载的最新的版本,避免有的算子没有实现的问题。我之前在这里被坑了一天。

然后根据版本在官网下载,点击Download,没有注册英伟达账号的需要注册账号登陆。官网地址:https://developer.nvidia.com/tensorrt

同意协议,然后根据自己的cuda版本选择,合适的版本。比如我的版本是cuda 11.4,一般选择Tar包

接下来将tar包或者zip包解压到你想安装的位置。这个软件解压即用,不用再安装。我们需要做的就是把软件的bin目录添加到环境变量。

Ubuntu:用vim打开~/.bashrc,将下面两行添加到文件最后面。

bash 复制代码
export LD_LIBRARY_PATH=/home/ubuntu/mySoftware/TensorRT-8.6.1.6/lib:$LD_LIBRARY_PATH
export PATH=/home/ubuntu/mySoftware/TensorRT-8.6.1.6/bin:$PATH

其中tensorrt的地址应该换成你解压的地址。然后sourse一下当前的终端

bash 复制代码
source ~/.bashrc

然后直接执行trtexec,如果没有报错证明成功安装了tensorrt

bash 复制代码
~/Downloads trtexec
&&&& RUNNING TensorRT.trtexec [TensorRT v8601] # trtexec -h
=== Model Options ===
  --uff=<file>                UFF model
  --onnx=<file>               ONNX model
  --model=<file>              Caffe model (default = no model, random weights used)
  --deploy=<file>             Caffe prototxt file
  --output=<name>[,<name>]*   Output names (it can be specified multiple times); at least one output is 
  ......

2.转换pytorch模型为onnx格式的模型

先说yolo项目:项目目录下有个model/export.py

打开文件查看参数可以看到有一下参数设置。

python 复制代码
if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--weights', type=str, default='./yolov5s.pt', help='weights path')
    parser.add_argument('--img-size', nargs='+', type=int, default=[640, 640], help='image size')  # height, width
    parser.add_argument('--batch-size', type=int, default=1, help='batch size')
    parser.add_argument('--grid', action='store_true', help='export Detect() layer grid')
    parser.add_argument('--device', default='cpu', help='cuda device, i.e. 0 or 0,1,2,3 or cpu')
    parser.add_argument('--dynamic', action='store_true', help='dynamic ONNX axes')  # ONNX-only
    parser.add_argument('--simplify', action='store_true', help='simplify ONNX model')  # ONNX-only
    parser.add_argument('--export-nms', action='store_true', help='export the nms part in ONNX model')  # ONNX-only, #opt.grid has to be set True for nms export to work
    opt = parser.parse_args()
    opt.img_size *= 2 if len(opt.img_size) == 1 else 1  # expand
    print(opt)
    set_logging()
    t = time.time()

根据自己模型设置合适的参数,注意如果你修改过模型的输出分类数,关键点数目。那么在导出nms层的时候就需要你自己手动修改网络模型。在models/common.py中的第361行修改non_max_suppression参数

python 复制代码
class NMS(nn.Module):
    # Non-Maximum Suppression (NMS) module
    iou = 0.45  # IoU threshold
    classes = None  # (optional list) filter by class

    def __init__(self, conf=0.25, kpt_label=False):
        super(NMS, self).__init__()
        self.conf=conf
        self.kpt_label = kpt_label


    def forward(self, x):
        return non_max_suppression(x[0], conf_thres=self.conf, iou_thres=self.iou, classes=self.classes, kpt_label=self.kpt_label,nc=2,nkpt=3)

class NMS_Export(nn.Module):
    # Non-Maximum Suppression (NMS) module used while exporting ONNX model
    iou = 0.45  # IoU threshold
    classes = None  # (optional list) filter by class

    def __init__(self, conf=0.001, kpt_label=False):
        super(NMS_Export, self).__init__()
        self.conf = conf
        self.kpt_label = kpt_label

    def forward(self, x):
        return non_max_suppression_export(x[0], conf_thres=self.conf, iou_thres=self.iou, classes=self.classes, kpt_label=self.kpt_label,nc=2)

我们需要把nc和nkpt改为自己的设置的参数,比如我的分类为2,关键点数量为3。然后导出模型。

sh 复制代码
python --img-size 960 --weights /home/ubuntu/GITHUG/yolov7_pose/runs/train/exp2/weights/best.pt --grid --export-nms --simplify

如果顺利的话,我们会得到一个onnx格式的模型。我们可以打开https://netron.app/ 然后选择onnx模型打开。我们可以看到模型的图像

我们需要关注的就是模型的输入,输出。以及他们的形状。

从图中可以看出我的模型输入为images,大小为13 * 960 960输出为detections形状暂时不清楚。如果不清楚我们可以用onnxruntime跑一下查看形状

python 复制代码
import onnxruntime
import numpy as np
import cv2
# 指定你的 ONNX 模型文件路径
onnx_model_path = '/home/ubuntu/GITHUG/yolov7_pose/runs/train/exp2/weights/best.onnx'
# 创建 ONNX Runtime 的推理会话
sess = onnxruntime.InferenceSession(onnx_model_path)

# 获取输入名称和形状
input_name = sess.get_inputs()[0].name
input_shape = sess.get_inputs()[0].shape

# 指定图像文件路径
image_path = '/home/ubuntu/GITHUG/yolov7_pose/501_png.rf.9cc0a917ca7972be6c8088aa9d17d651.jpg'

# 使用 OpenCV 读取图像
image = cv2.imread(image_path)
# 将图像调整为模型的输入形状
resized_image = cv2.resize(image, (input_shape[3], input_shape[2]))
# 将图像转换为浮点数并进行归一化
input_data = resized_image.astype(np.float32) / 255.0
# 将图像数据转换为 ONNX 模型期望的输入形状
input_data = np.transpose(input_data, [2, 0, 1])
input_data = np.expand_dims(input_data, axis=0)

# 运行推理
outputs = sess.run(None, {input_name: input_data})

# 输出模型的每个输出
for i, output_data in enumerate(outputs):
    print(f"Output {i + 1}: {output_data}")
print(f"Output  {output_data.shape}")
python 复制代码
Output 1: [[8.01661621e+02 1.53809937e+02 9.72689453e+02 3.77949707e+02
  4.21597920e-02 0.00000000e+00 5.15294671e-02 8.84101624e+02
  2.51692810e+02 9.91612077e-01 9.03469177e+02 1.68072296e+02
  6.35425091e-01 8.85691345e+02 1.72709320e+02 7.30822206e-01]
 [7.85901917e+02 1.61294067e+02 9.64655701e+02 3.66809448e+02
  4.08335961e-02 1.00000000e+00 6.32926583e-01 8.77714966e+02
  2.57085205e+02 9.89280879e-01 8.91954224e+02 1.80863663e+02
  2.32283741e-01 8.78342041e+02 1.87161697e+02 5.20734370e-01]
 [7.05231201e+02 3.90309601e+02 7.51886230e+02 4.35935760e+02
  1.86153594e-02 1.00000000e+00 6.94520175e-01 7.35046814e+02
  4.11621490e+02 7.23196447e-01 7.14923584e+02 4.14582092e+02
  4.62090850e-01 7.09832214e+02 4.12042603e+02 2.80124098e-01]
 [4.01937828e+01 4.64705994e+02 1.51267151e+02 6.35167419e+02
  1.55489137e-02 1.00000000e+00 9.99976933e-01 8.51227875e+01
  5.72096252e+02 9.97074127e-01 8.59449158e+01 4.89000427e+02
  9.83235717e-01 8.48072968e+01 5.18143494e+02 9.95639443e-01]
 [4.67657043e+02 2.47014786e+02 6.09315125e+02 4.11179565e+02
  1.50994565e-02 0.00000000e+00 1.29642010e-01 5.45577820e+02
  3.71885773e+02 9.93896723e-01 5.56157104e+02 3.50972717e+02
  9.97142434e-01 5.54454590e+02 3.20836670e+02 9.76849675e-01]
 [3.69356445e+02 1.81159134e+01 4.91651611e+02 1.81579437e+02
  1.44530777e-02 1.00000000e+00 9.98439074e-01 4.16761169e+02
  1.16163483e+02 9.97292042e-01 4.29588745e+02 2.69206352e+01
  9.79286790e-01 4.28487366e+02 8.01969910e+01 9.97563720e-01]
 [7.12836548e+02 3.89805634e+02 7.66137817e+02 4.36001556e+02
  1.32421134e-02 0.00000000e+00 2.13130921e-01 7.40284363e+02
  4.09640594e+02 7.56286979e-01 7.18195129e+02 4.12563293e+02
  1.05279446e-01 7.11785156e+02 4.14483521e+02 1.00254148e-01]
 [7.01546204e+02 3.92902222e+02 7.31227966e+02 4.25415100e+02
  1.30005283e-02 1.00000000e+00 9.94012475e-01 7.22401733e+02
  4.12053406e+02 4.85429347e-01 7.12319214e+02 4.13364197e+02
  7.06610680e-01 7.13084656e+02 4.11362488e+02 4.67233360e-01]
 [6.80663696e+02 4.66796997e+02 7.09215454e+02 4.98112915e+02
  1.06324852e-02 0.00000000e+00 6.49383068e-02 6.97597473e+02
  4.87214142e+02 9.42029715e-01 6.90804749e+02 4.85028137e+02
  9.82081532e-01 6.85866089e+02 4.70633820e+02 9.92424369e-01]]
Output  (9, 16)

最后输出可以看出我的输出为1* 9 * 16,因为经过nms层后最后检测框的数量是不固定的所以应该是1 * x *16。仔细观察16纬的数据可以发现,每个数据都是

python 复制代码
[x1,y1,x2,y2,confi,prob1,prob2,kpt1,conf1,pkt2,conf2,kpt3,conf3]

其中前四个数据为检测框,然后是置信度,分类概率,关键点以及关键点的置信度。

3.将onnx格式的模型转为.engine的tensorrt模型。

直接执行命令,然后等待模型转换成功。

sh 复制代码
trtexec --onnx=yolov7.onnx --fp16 --saveEngine=yolov7.engine

如果报错,比如什么算子不支持可以尝试更新tensorrt到最新版本。

4.C++部署

c++ 复制代码
#include <iostream>
#include <fstream>
#include <vector>
#include <opencv2/opencv.hpp>
#include <NvInfer.h>
#include <cuda_runtime_api.h>

#define INPUT_W 960
#define INPUT_H 960
#define DEVICE 0  // GPU id

#define CONF_THRESH 0.2

using namespace nvinfer1;

class Logger : public ILogger {
    void log(Severity severity, const char *msg) noexcept override {
        // suppress info-level messages
        if (severity <= Severity::kWARNING)
            std::cout << msg << std::endl;
    }
} logger;

#define CHECK(status) \
    do\
    {\
        auto ret = (status);\
        if (ret != 0)\
        {\
            std::cerr << "Cuda failure: " << ret << std::endl;\
            abort();\
        }\
    } while (0)

float *blobFromImage(cv::Mat &img) {
    cv::cvtColor(img, img, cv::COLOR_BGR2RGB);

    float *blob = new float[img.total() * 3];
    int channels = 3;
    int img_h = img.rows;
    int img_w = img.cols;
    for (int c = 0; c < channels; c++) {
        for (int h = 0; h < img_h; h++) {
            for (int w = 0; w < img_w; w++) {
                blob[c * img_w * img_h + h * img_w + w] =
                        (((float) img.at<cv::Vec3b>(h, w)[c]) / 255.0f);
            }
        }
    }
    return blob;
}

cv::Mat static_resize(cv::Mat &img) {
    float r = std::min(INPUT_W / (img.cols * 1.0), INPUT_H / (img.rows * 1.0));
    int unpad_w = r * img.cols;
    int unpad_h = r * img.rows;
    cv::Mat re(unpad_h, unpad_w, CV_8UC3);
    cv::resize(img, re, re.size());
    cv::Mat out(INPUT_W, INPUT_H, CV_8UC3, cv::Scalar(114, 114, 114));
    re.copyTo(out(cv::Rect(0, 0, re.cols, re.rows)));
    return out;
}

const char *INPUT_BLOB_NAME = "images";
const char *OUTPUT_BLOB_NAME = "detections";
static Logger gLogger;
static constexpr int MAX_OUTPUT_BBOX_COUNT = 100;
static constexpr int CLASS_NUM = 2;
static constexpr int LOCATIONS = 4;
static constexpr int KEY_POINTS_NUM = 3;
struct Keypoint {
    float x;
    float y;
    float kpt_conf;
};

struct alignas(float) Detection {
    //center_x center_y w h
    float bbox[LOCATIONS];
    float conf;  // bbox_conf * cls_conf
    float prob[CLASS_NUM]; // Probabilities for each class
    // 3 keypoints
    Keypoint kpts[KEY_POINTS_NUM];
};


void
doInference(IExecutionContext &context, float *input, float *output, const int output_size, const int input_shape) {
    const ICudaEngine &engine = context.getEngine();

    // Pointers to input and output device buffers to pass to engine.
    // Engine requires exactly IEngine::getNbBindings() number of buffers.
    assert(engine.getNbBindings() == 2);
    void *buffers[2];

    // In order to bind the buffers, we need to know the names of the input and output tensors.
    // Note that indices are guaranteed to be less than IEngine::getNbBindings()
    const int inputIndex = engine.getBindingIndex(INPUT_BLOB_NAME);

    assert(engine.getBindingDataType(inputIndex) == nvinfer1::DataType::kFLOAT);
    const int outputIndex = engine.getBindingIndex(OUTPUT_BLOB_NAME);
    assert(engine.getBindingDataType(outputIndex) == nvinfer1::DataType::kFLOAT);
    // int mBatchSize = engine.getMaxBatchSize();

    // Create GPU buffers on device
    CHECK(cudaMalloc(&buffers[inputIndex], input_shape * sizeof(float)));
    CHECK(cudaMalloc(&buffers[outputIndex], output_size * sizeof(float)));

    // Create stream
    cudaStream_t stream;
    CHECK(cudaStreamCreate(&stream));

    // DMA input batch data to device, infer on the batch asynchronously, and DMA output back to host
    CHECK(cudaMemcpyAsync(buffers[inputIndex], input, input_shape * sizeof(float), cudaMemcpyHostToDevice, stream));
    context.enqueueV2(buffers, stream, nullptr);
    CHECK(cudaMemcpyAsync(output, buffers[outputIndex], output_size * sizeof(float), cudaMemcpyDeviceToHost, stream));
    cudaStreamSynchronize(stream);

    // Release stream and buffers
    cudaStreamDestroy(stream);
    CHECK(cudaFree(buffers[inputIndex]));
    CHECK(cudaFree(buffers[outputIndex]));
}

static constexpr int DETECTION_SIZE = sizeof(Detection) / sizeof(float);

static void
postprocess_decode(float *feat_blob, float prob_threshold,std::vector<Detection> &objects_map) {
    for (int i = 0; i < MAX_OUTPUT_BBOX_COUNT; i++) {

        int base_index = i * DETECTION_SIZE;  // Calculate the base index for the current detection
        if (feat_blob[base_index + LOCATIONS] <= prob_threshold)
            continue;
        Detection det;
        // Copy the detection information from feat_blob to the Detection structure
        memcpy(&det, &feat_blob[base_index], DETECTION_SIZE * sizeof(float));
        objects_map.push_back(det);
    }
}

int main() {
    char *trtModelStream{nullptr};
    cudaSetDevice(DEVICE);
    size_t size{0};
    const char *engine_file_path = "/home/ubuntu/GITHUG/yolov7_pose/yolov7.engine";
    std::ifstream file(engine_file_path, std::ios::binary);
    if (file.good()) {
        file.seekg(0, file.end);
        size = file.tellg();
        file.seekg(0, file.beg);
        trtModelStream = new char[size];
        assert(trtModelStream);
        file.read(trtModelStream, size);
        file.close();
    }
    // create a model using the API directly and serialize it to a stream
    IRuntime *runtime = createInferRuntime(gLogger);
    assert(runtime != nullptr);
    ICudaEngine *engine = runtime->deserializeCudaEngine(trtModelStream, size);
    assert(engine != nullptr);
    IExecutionContext *context = engine->createExecutionContext();
    assert(context != nullptr);
    delete[] trtModelStream;
    // auto out_dims = engine->getBindingDimensions(1);


    int input_size = 1 * 3 * 960 * 960;
    int output_size = MAX_OUTPUT_BBOX_COUNT * 16 * 1;
    static float *prob = new float[output_size];
    const char *input_image_path = "/home/ubuntu/GITHUG/yolov7_pose/501_png.rf.9cc0a917ca7972be6c8088aa9d17d651.jpg";
    cv::Mat img = cv::imread(input_image_path);
    cv::Mat pr_img = static_resize(img);
    float *blob;
//    cv::imshow("Image", pr_img);
    blob = blobFromImage(pr_img);

    cv::waitKey(200);
    // 关闭窗口
    cv::destroyAllWindows();
    doInference(*context, blob, prob, output_size, input_size);

    std::vector<Detection> objects_map;
    for (int i = 0; i < prob[0] && i < MAX_OUTPUT_BBOX_COUNT; i++) {
        std::cout << ": " << prob[i] << std::endl;
    }
    postprocess_decode(prob, CONF_THRESH, objects_map);
    float r_w = INPUT_W / (img.cols * 1.0);
    float r_h = INPUT_H / (img.rows * 1.0);
    cv::cvtColor(pr_img, pr_img, cv::COLOR_RGB2BGR);
    for (const auto &det: objects_map) {
            // Access other information in the Detection structure as needed
            // Example: Print bbox coordinates
            std::cout << "  Bbox: ";
            for (int i = 0; i < LOCATIONS; i++) {
                std::cout << det.bbox[i] << " ";
            }
            float r = 0.0;
            if (img.rows <= img.cols) {
                r = r_w;
            } else {
                r = r_h;
            }
            cv::Point pt1(det.bbox[0]/r, det.bbox[1]/r);
            cv::Point pt2(det.bbox[2]/r, det.bbox[3]/r);

            cv::rectangle(img, pt1, pt2, cv::Scalar(0, 255, 0), 2);

            cv::Point point1(det.kpts[0].x / r, det.kpts[0].y / r);
            cv::Point point2(det.kpts[1].x / r, det.kpts[1].y / r);
            cv::Point point3(det.kpts[2].x / r, det.kpts[2].y / r);
            // 画线段
            cv::line(img, point1, point2, cv::Scalar(0, 0, 255), 2);  // Scalar 参数表示颜色,这里是红色 (B, G, R)
            cv::line(img, point2, point3, cv::Scalar(255, 0, 0), 2);  // Scalar 参数表示颜色,这里是红色 (B, G, R)


            cv::imshow("Rectangle", img);
            cv::waitKey(0);
            std::cout << std::endl;
        }
    }

这是我的demo以及最后的效果。

其中的关键代码为解析模型输出的部分,大家可以参考一下