使用Tensorrt部署,C++ API yolov7_pose模型
虽然标题叫部署yolov7_pose模型,但是接下来的教程可以使用Tensorrt部署任何pytorch模型。
仓库地址:https://github.com/WongKinYiu/yolov7/tree/pose
系统版本:ubuntu18.4
驱动版本:CUDA Version: 11.4
在推理过程中,基于 TensorRT 的应用程序的执行速度可比 CPU 平台的速度快 40 倍。借助 TensorRT,您可以优化在所有主要框架中训练的神经网络模型,精确校正低精度,并最终将模型部署到超大规模数据中心、嵌入式或汽车产品平台中。
TensorRT 以 NVIDIA 的并行编程模型 CUDA 为基础构建而成,可帮助您利用 CUDA-X 中的库、开发工具和技术,针对人工智能、自主机器、高性能计算和图形优化所有深度学习框架中的推理。
TensorRT 针对多种深度学习推理应用的生产部署提供 INT8 和 FP16 优化,例如视频流式传输、语音识别、推荐和自然语言处理。推理精度降低后可显著减少应用延迟,这恰巧满足了许多实时服务、自动和嵌入式应用的要求。
我们部署的主要步骤为:将PytorchModel转化为OnnxModel,在将OnnxModel转化为TensorrtModel.
虽然看似步骤简单,但是坑还是有点多。
1.安装TensorRT
首先查看自己的Cuda版本,Windows 在cmd中执行nvidia-smi,Ubuntu在终端执行nvidia-smi即可查看cuda的版本。一般我们选择自己所能下载的最新的版本,避免有的算子没有实现的问题。我之前在这里被坑了一天。
然后根据版本在官网下载,点击Download,没有注册英伟达账号的需要注册账号登陆。官网地址:https://developer.nvidia.com/tensorrt
同意协议,然后根据自己的cuda版本选择,合适的版本。比如我的版本是cuda 11.4,一般选择Tar包
接下来将tar包或者zip包解压到你想安装的位置。这个软件解压即用,不用再安装。我们需要做的就是把软件的bin目录添加到环境变量。
Ubuntu:用vim打开~/.bashrc,将下面两行添加到文件最后面。
bash
export LD_LIBRARY_PATH=/home/ubuntu/mySoftware/TensorRT-8.6.1.6/lib:$LD_LIBRARY_PATH
export PATH=/home/ubuntu/mySoftware/TensorRT-8.6.1.6/bin:$PATH
其中tensorrt的地址应该换成你解压的地址。然后sourse一下当前的终端
bash
source ~/.bashrc
然后直接执行trtexec,如果没有报错证明成功安装了tensorrt
bash
~/Downloads trtexec
&&&& RUNNING TensorRT.trtexec [TensorRT v8601] # trtexec -h
=== Model Options ===
--uff=<file> UFF model
--onnx=<file> ONNX model
--model=<file> Caffe model (default = no model, random weights used)
--deploy=<file> Caffe prototxt file
--output=<name>[,<name>]* Output names (it can be specified multiple times); at least one output is
......
2.转换pytorch模型为onnx格式的模型
先说yolo项目:项目目录下有个model/export.py
打开文件查看参数可以看到有一下参数设置。
python
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--weights', type=str, default='./yolov5s.pt', help='weights path')
parser.add_argument('--img-size', nargs='+', type=int, default=[640, 640], help='image size') # height, width
parser.add_argument('--batch-size', type=int, default=1, help='batch size')
parser.add_argument('--grid', action='store_true', help='export Detect() layer grid')
parser.add_argument('--device', default='cpu', help='cuda device, i.e. 0 or 0,1,2,3 or cpu')
parser.add_argument('--dynamic', action='store_true', help='dynamic ONNX axes') # ONNX-only
parser.add_argument('--simplify', action='store_true', help='simplify ONNX model') # ONNX-only
parser.add_argument('--export-nms', action='store_true', help='export the nms part in ONNX model') # ONNX-only, #opt.grid has to be set True for nms export to work
opt = parser.parse_args()
opt.img_size *= 2 if len(opt.img_size) == 1 else 1 # expand
print(opt)
set_logging()
t = time.time()
根据自己模型设置合适的参数,注意如果你修改过模型的输出分类数,关键点数目。那么在导出nms层的时候就需要你自己手动修改网络模型。在models/common.py中的第361行修改non_max_suppression参数
python
class NMS(nn.Module):
# Non-Maximum Suppression (NMS) module
iou = 0.45 # IoU threshold
classes = None # (optional list) filter by class
def __init__(self, conf=0.25, kpt_label=False):
super(NMS, self).__init__()
self.conf=conf
self.kpt_label = kpt_label
def forward(self, x):
return non_max_suppression(x[0], conf_thres=self.conf, iou_thres=self.iou, classes=self.classes, kpt_label=self.kpt_label,nc=2,nkpt=3)
class NMS_Export(nn.Module):
# Non-Maximum Suppression (NMS) module used while exporting ONNX model
iou = 0.45 # IoU threshold
classes = None # (optional list) filter by class
def __init__(self, conf=0.001, kpt_label=False):
super(NMS_Export, self).__init__()
self.conf = conf
self.kpt_label = kpt_label
def forward(self, x):
return non_max_suppression_export(x[0], conf_thres=self.conf, iou_thres=self.iou, classes=self.classes, kpt_label=self.kpt_label,nc=2)
我们需要把nc和nkpt改为自己的设置的参数,比如我的分类为2,关键点数量为3。然后导出模型。
sh
python --img-size 960 --weights /home/ubuntu/GITHUG/yolov7_pose/runs/train/exp2/weights/best.pt --grid --export-nms --simplify
如果顺利的话,我们会得到一个onnx格式的模型。我们可以打开https://netron.app/ 然后选择onnx模型打开。我们可以看到模型的图像
我们需要关注的就是模型的输入,输出。以及他们的形状。
从图中可以看出我的模型输入为images,大小为13 * 960 960输出为detections形状暂时不清楚。如果不清楚我们可以用onnxruntime跑一下查看形状
python
import onnxruntime
import numpy as np
import cv2
# 指定你的 ONNX 模型文件路径
onnx_model_path = '/home/ubuntu/GITHUG/yolov7_pose/runs/train/exp2/weights/best.onnx'
# 创建 ONNX Runtime 的推理会话
sess = onnxruntime.InferenceSession(onnx_model_path)
# 获取输入名称和形状
input_name = sess.get_inputs()[0].name
input_shape = sess.get_inputs()[0].shape
# 指定图像文件路径
image_path = '/home/ubuntu/GITHUG/yolov7_pose/501_png.rf.9cc0a917ca7972be6c8088aa9d17d651.jpg'
# 使用 OpenCV 读取图像
image = cv2.imread(image_path)
# 将图像调整为模型的输入形状
resized_image = cv2.resize(image, (input_shape[3], input_shape[2]))
# 将图像转换为浮点数并进行归一化
input_data = resized_image.astype(np.float32) / 255.0
# 将图像数据转换为 ONNX 模型期望的输入形状
input_data = np.transpose(input_data, [2, 0, 1])
input_data = np.expand_dims(input_data, axis=0)
# 运行推理
outputs = sess.run(None, {input_name: input_data})
# 输出模型的每个输出
for i, output_data in enumerate(outputs):
print(f"Output {i + 1}: {output_data}")
print(f"Output {output_data.shape}")
python
Output 1: [[8.01661621e+02 1.53809937e+02 9.72689453e+02 3.77949707e+02
4.21597920e-02 0.00000000e+00 5.15294671e-02 8.84101624e+02
2.51692810e+02 9.91612077e-01 9.03469177e+02 1.68072296e+02
6.35425091e-01 8.85691345e+02 1.72709320e+02 7.30822206e-01]
[7.85901917e+02 1.61294067e+02 9.64655701e+02 3.66809448e+02
4.08335961e-02 1.00000000e+00 6.32926583e-01 8.77714966e+02
2.57085205e+02 9.89280879e-01 8.91954224e+02 1.80863663e+02
2.32283741e-01 8.78342041e+02 1.87161697e+02 5.20734370e-01]
[7.05231201e+02 3.90309601e+02 7.51886230e+02 4.35935760e+02
1.86153594e-02 1.00000000e+00 6.94520175e-01 7.35046814e+02
4.11621490e+02 7.23196447e-01 7.14923584e+02 4.14582092e+02
4.62090850e-01 7.09832214e+02 4.12042603e+02 2.80124098e-01]
[4.01937828e+01 4.64705994e+02 1.51267151e+02 6.35167419e+02
1.55489137e-02 1.00000000e+00 9.99976933e-01 8.51227875e+01
5.72096252e+02 9.97074127e-01 8.59449158e+01 4.89000427e+02
9.83235717e-01 8.48072968e+01 5.18143494e+02 9.95639443e-01]
[4.67657043e+02 2.47014786e+02 6.09315125e+02 4.11179565e+02
1.50994565e-02 0.00000000e+00 1.29642010e-01 5.45577820e+02
3.71885773e+02 9.93896723e-01 5.56157104e+02 3.50972717e+02
9.97142434e-01 5.54454590e+02 3.20836670e+02 9.76849675e-01]
[3.69356445e+02 1.81159134e+01 4.91651611e+02 1.81579437e+02
1.44530777e-02 1.00000000e+00 9.98439074e-01 4.16761169e+02
1.16163483e+02 9.97292042e-01 4.29588745e+02 2.69206352e+01
9.79286790e-01 4.28487366e+02 8.01969910e+01 9.97563720e-01]
[7.12836548e+02 3.89805634e+02 7.66137817e+02 4.36001556e+02
1.32421134e-02 0.00000000e+00 2.13130921e-01 7.40284363e+02
4.09640594e+02 7.56286979e-01 7.18195129e+02 4.12563293e+02
1.05279446e-01 7.11785156e+02 4.14483521e+02 1.00254148e-01]
[7.01546204e+02 3.92902222e+02 7.31227966e+02 4.25415100e+02
1.30005283e-02 1.00000000e+00 9.94012475e-01 7.22401733e+02
4.12053406e+02 4.85429347e-01 7.12319214e+02 4.13364197e+02
7.06610680e-01 7.13084656e+02 4.11362488e+02 4.67233360e-01]
[6.80663696e+02 4.66796997e+02 7.09215454e+02 4.98112915e+02
1.06324852e-02 0.00000000e+00 6.49383068e-02 6.97597473e+02
4.87214142e+02 9.42029715e-01 6.90804749e+02 4.85028137e+02
9.82081532e-01 6.85866089e+02 4.70633820e+02 9.92424369e-01]]
Output (9, 16)
最后输出可以看出我的输出为1* 9 * 16,因为经过nms层后最后检测框的数量是不固定的所以应该是1 * x *16。仔细观察16纬的数据可以发现,每个数据都是
python
[x1,y1,x2,y2,confi,prob1,prob2,kpt1,conf1,pkt2,conf2,kpt3,conf3]
其中前四个数据为检测框,然后是置信度,分类概率,关键点以及关键点的置信度。
3.将onnx格式的模型转为.engine的tensorrt模型。
直接执行命令,然后等待模型转换成功。
sh
trtexec --onnx=yolov7.onnx --fp16 --saveEngine=yolov7.engine
如果报错,比如什么算子不支持可以尝试更新tensorrt到最新版本。
4.C++部署
c++
#include <iostream>
#include <fstream>
#include <vector>
#include <opencv2/opencv.hpp>
#include <NvInfer.h>
#include <cuda_runtime_api.h>
#define INPUT_W 960
#define INPUT_H 960
#define DEVICE 0 // GPU id
#define CONF_THRESH 0.2
using namespace nvinfer1;
class Logger : public ILogger {
void log(Severity severity, const char *msg) noexcept override {
// suppress info-level messages
if (severity <= Severity::kWARNING)
std::cout << msg << std::endl;
}
} logger;
#define CHECK(status) \
do\
{\
auto ret = (status);\
if (ret != 0)\
{\
std::cerr << "Cuda failure: " << ret << std::endl;\
abort();\
}\
} while (0)
float *blobFromImage(cv::Mat &img) {
cv::cvtColor(img, img, cv::COLOR_BGR2RGB);
float *blob = new float[img.total() * 3];
int channels = 3;
int img_h = img.rows;
int img_w = img.cols;
for (int c = 0; c < channels; c++) {
for (int h = 0; h < img_h; h++) {
for (int w = 0; w < img_w; w++) {
blob[c * img_w * img_h + h * img_w + w] =
(((float) img.at<cv::Vec3b>(h, w)[c]) / 255.0f);
}
}
}
return blob;
}
cv::Mat static_resize(cv::Mat &img) {
float r = std::min(INPUT_W / (img.cols * 1.0), INPUT_H / (img.rows * 1.0));
int unpad_w = r * img.cols;
int unpad_h = r * img.rows;
cv::Mat re(unpad_h, unpad_w, CV_8UC3);
cv::resize(img, re, re.size());
cv::Mat out(INPUT_W, INPUT_H, CV_8UC3, cv::Scalar(114, 114, 114));
re.copyTo(out(cv::Rect(0, 0, re.cols, re.rows)));
return out;
}
const char *INPUT_BLOB_NAME = "images";
const char *OUTPUT_BLOB_NAME = "detections";
static Logger gLogger;
static constexpr int MAX_OUTPUT_BBOX_COUNT = 100;
static constexpr int CLASS_NUM = 2;
static constexpr int LOCATIONS = 4;
static constexpr int KEY_POINTS_NUM = 3;
struct Keypoint {
float x;
float y;
float kpt_conf;
};
struct alignas(float) Detection {
//center_x center_y w h
float bbox[LOCATIONS];
float conf; // bbox_conf * cls_conf
float prob[CLASS_NUM]; // Probabilities for each class
// 3 keypoints
Keypoint kpts[KEY_POINTS_NUM];
};
void
doInference(IExecutionContext &context, float *input, float *output, const int output_size, const int input_shape) {
const ICudaEngine &engine = context.getEngine();
// Pointers to input and output device buffers to pass to engine.
// Engine requires exactly IEngine::getNbBindings() number of buffers.
assert(engine.getNbBindings() == 2);
void *buffers[2];
// In order to bind the buffers, we need to know the names of the input and output tensors.
// Note that indices are guaranteed to be less than IEngine::getNbBindings()
const int inputIndex = engine.getBindingIndex(INPUT_BLOB_NAME);
assert(engine.getBindingDataType(inputIndex) == nvinfer1::DataType::kFLOAT);
const int outputIndex = engine.getBindingIndex(OUTPUT_BLOB_NAME);
assert(engine.getBindingDataType(outputIndex) == nvinfer1::DataType::kFLOAT);
// int mBatchSize = engine.getMaxBatchSize();
// Create GPU buffers on device
CHECK(cudaMalloc(&buffers[inputIndex], input_shape * sizeof(float)));
CHECK(cudaMalloc(&buffers[outputIndex], output_size * sizeof(float)));
// Create stream
cudaStream_t stream;
CHECK(cudaStreamCreate(&stream));
// DMA input batch data to device, infer on the batch asynchronously, and DMA output back to host
CHECK(cudaMemcpyAsync(buffers[inputIndex], input, input_shape * sizeof(float), cudaMemcpyHostToDevice, stream));
context.enqueueV2(buffers, stream, nullptr);
CHECK(cudaMemcpyAsync(output, buffers[outputIndex], output_size * sizeof(float), cudaMemcpyDeviceToHost, stream));
cudaStreamSynchronize(stream);
// Release stream and buffers
cudaStreamDestroy(stream);
CHECK(cudaFree(buffers[inputIndex]));
CHECK(cudaFree(buffers[outputIndex]));
}
static constexpr int DETECTION_SIZE = sizeof(Detection) / sizeof(float);
static void
postprocess_decode(float *feat_blob, float prob_threshold,std::vector<Detection> &objects_map) {
for (int i = 0; i < MAX_OUTPUT_BBOX_COUNT; i++) {
int base_index = i * DETECTION_SIZE; // Calculate the base index for the current detection
if (feat_blob[base_index + LOCATIONS] <= prob_threshold)
continue;
Detection det;
// Copy the detection information from feat_blob to the Detection structure
memcpy(&det, &feat_blob[base_index], DETECTION_SIZE * sizeof(float));
objects_map.push_back(det);
}
}
int main() {
char *trtModelStream{nullptr};
cudaSetDevice(DEVICE);
size_t size{0};
const char *engine_file_path = "/home/ubuntu/GITHUG/yolov7_pose/yolov7.engine";
std::ifstream file(engine_file_path, std::ios::binary);
if (file.good()) {
file.seekg(0, file.end);
size = file.tellg();
file.seekg(0, file.beg);
trtModelStream = new char[size];
assert(trtModelStream);
file.read(trtModelStream, size);
file.close();
}
// create a model using the API directly and serialize it to a stream
IRuntime *runtime = createInferRuntime(gLogger);
assert(runtime != nullptr);
ICudaEngine *engine = runtime->deserializeCudaEngine(trtModelStream, size);
assert(engine != nullptr);
IExecutionContext *context = engine->createExecutionContext();
assert(context != nullptr);
delete[] trtModelStream;
// auto out_dims = engine->getBindingDimensions(1);
int input_size = 1 * 3 * 960 * 960;
int output_size = MAX_OUTPUT_BBOX_COUNT * 16 * 1;
static float *prob = new float[output_size];
const char *input_image_path = "/home/ubuntu/GITHUG/yolov7_pose/501_png.rf.9cc0a917ca7972be6c8088aa9d17d651.jpg";
cv::Mat img = cv::imread(input_image_path);
cv::Mat pr_img = static_resize(img);
float *blob;
// cv::imshow("Image", pr_img);
blob = blobFromImage(pr_img);
cv::waitKey(200);
// 关闭窗口
cv::destroyAllWindows();
doInference(*context, blob, prob, output_size, input_size);
std::vector<Detection> objects_map;
for (int i = 0; i < prob[0] && i < MAX_OUTPUT_BBOX_COUNT; i++) {
std::cout << ": " << prob[i] << std::endl;
}
postprocess_decode(prob, CONF_THRESH, objects_map);
float r_w = INPUT_W / (img.cols * 1.0);
float r_h = INPUT_H / (img.rows * 1.0);
cv::cvtColor(pr_img, pr_img, cv::COLOR_RGB2BGR);
for (const auto &det: objects_map) {
// Access other information in the Detection structure as needed
// Example: Print bbox coordinates
std::cout << " Bbox: ";
for (int i = 0; i < LOCATIONS; i++) {
std::cout << det.bbox[i] << " ";
}
float r = 0.0;
if (img.rows <= img.cols) {
r = r_w;
} else {
r = r_h;
}
cv::Point pt1(det.bbox[0]/r, det.bbox[1]/r);
cv::Point pt2(det.bbox[2]/r, det.bbox[3]/r);
cv::rectangle(img, pt1, pt2, cv::Scalar(0, 255, 0), 2);
cv::Point point1(det.kpts[0].x / r, det.kpts[0].y / r);
cv::Point point2(det.kpts[1].x / r, det.kpts[1].y / r);
cv::Point point3(det.kpts[2].x / r, det.kpts[2].y / r);
// 画线段
cv::line(img, point1, point2, cv::Scalar(0, 0, 255), 2); // Scalar 参数表示颜色,这里是红色 (B, G, R)
cv::line(img, point2, point3, cv::Scalar(255, 0, 0), 2); // Scalar 参数表示颜色,这里是红色 (B, G, R)
cv::imshow("Rectangle", img);
cv::waitKey(0);
std::cout << std::endl;
}
}
这是我的demo以及最后的效果。
其中的关键代码为解析模型输出的部分,大家可以参考一下