学弟一看就会的RDKX5模型转换及部署，你确定不学？

作者：SkyXZ

CSDN：SkyXZ～-CSDN博客

博客园：SkyXZ - 博客园

宿主机环境：WSL2-Ubuntu22.04+Cuda12.6、D-Robotics-OE 1.2.8、Ubuntu20.04 GPU Docker

端侧设备环境：RDK X5-Server-3.1.0

买了RDK X5还只停留在树莓派的使用思想？想部署深度学习但对着BPU不知从何下手？好不容易找到了OE交付包和Model Zoo但不知有什么作用？我知道你很急，但你先别急！跟着这篇学弟一看就会的模型量化部署教程包你30Min告别RDK模型量化部署小白！！！首先我们本篇教程参考一下资料及文档：

地瓜机器人RDK用户手册：1.快速开始 | RDK DOC
地瓜 X5 算法工具链：地瓜 X5 算法工具链版本发布
地瓜RDK_ModelZoo介绍手册：4.3.1 ModelZoo概述 | RDK DOC
地瓜RDK_ModelZoo仓库地址：https://github.com/D-Robotics/rdk_model_zoo

一、算法工具链介绍及环境安装

目前，我们在GPU上训练的模型通常采用浮点数格式，因为浮点类型能够提供较高的计算精度和灵活性，但是对于边缘式设备来说浮点类型模型所需的算力和存储资源远超其承载能力，因此一般边缘式设备上的AI加速芯片基本都只支持INT8（业内处理器的通用精度）定点模型，我们X5的BPU也不例外，因此我们需要将我们训练出来的浮点模型转化为定点模型，这一过程便叫做模型的量化，而地瓜机器人官方基于D-Robotics处理器自研了一套D-Robotics算法工具链 可以方便快捷的将浮点模型量化为定点模型，并在D-Robotics处理器上快速部署！！！下面我们介绍该如何安装算法工具链：

由于D-Robotics算法工具链暂时只能在Linux环境运行，因此大家首先先确保自己的开发机满足以下要求并且安装了WSL2-Ubuntu（具体可参阅：告别虚拟机！WSL2安装配置教程！！！ - SkyXZ - 博客园）或者是虚拟机里的Ubuntu，由于官方有给我们工具链的docker镜像因此Ubuntu的系统版本不是很重要

（1）安装Docker及NVIDIA Container Toolkit

接着我们在Ubuntu中安装Docker（地瓜官方要求19.03或更高版本，安装详见：Get Docker | Docker Docs）及NVIDIA Container Toolkit（地瓜官方要求1.13.1-1.13.5，安装详见：Installing the NVIDIA Container Toolkit --- NVIDIA Container Toolkit 1.17.3 documentation），接着我将从头带着大家走一遍这个过程，首先便是安装Docker，我们先卸载系统默认安装的docker并安装一些必要支持：

bash 复制代码

#如果有便删，报错说没有那就无所谓不用管
sudo apt-get remove docker docker-engine docker.io containerd runc
#下载必要依赖
sudo apt install apt-transport-https ca-certificates curl software-properties-common gnupg lsb-release

我们默认大家不会使用代理，因此我们所有的源均使用国内源，我们添加阿里的GPG KEY以及阿里的APT源后便可以直接APT安装Docker的最新版本啦

bash 复制代码

# step 1 添加阿里GPG Key
curl -fsSL https://mirrors.aliyun.com/docker-ce/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg

# step 2 添加阿里Docker APT源
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://mirrors.aliyun.com/docker-ce/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

# step 3 Update
sudo apt update
sudo apt-get update

# step 4 下载Docker
sudo apt install docker-ce docker-ce-cli containerd.io

# step 5 验证Docker安装
sudo docker version      #查看Docker版本
sudo systemctl status docker   #验证Docker运行状态

如果验证Docker安装均有输出且正常运行那么便代表我们的Docker安装完成啦，接着我们将无root权限的用户添加到Docker用户组中，这样我们便可以让当前用户在不切root，或者不加sudo 的情况下正常使用 docker 命令：

bash 复制代码

  sudo groupadd docker
  sudo gpasswd -a ${USER} docker
  sudo service docker restart

但是到这里还没有结束，因为大概率大家运行docker run hello-world是会一直报如下网络错误：

这是因为国内暂时无法直接访问Docker源镜像，我们需要使用第三方Docker源，我在这里帮大家已经整理好了一些常见的Docker源，大家只需要添加进/etc/docker/daemon.json文件即可：

shell 复制代码

# step 1 创建 or 编辑 /etc/docker/daemon.json
sudo nano /etc/docker/daemon.json
# step 2 复制粘贴进入文件
{
    "registry-mirrors": [
        "https://dockerproxy.com",
        "https://docker.m.daocloud.io",
        "https://cr.console.aliyun.com",
        "https://ccr.ccs.tencentyun.com",
        "https://hub-mirror.c.163.com",
        "https://mirror.baidubce.com",
        "https://docker.nju.edu.cn",
        "https://docker.mirrors.sjtug.sjtu.edu.cn",
        "https://github.com/ustclug/mirrorrequest",
        "https://registry.docker-cn.com"
    ]
}
# step 3 重载配置文件，并重启 docker
sudo systemctl daemon-reload
sudo systemctl restart docker
# step 4 查看Docker配置检查是否配置成功
sudo docker info

可以看到运行了docker info命令后终端输出了我们之前添加进去的docker源地址，这时候我们再次运行docker run hello-world便可以看到docker成功下载了对应的镜像并打印输出了**"Hello from Docker!"**

安装完docker，接着我们来安装NVIDIA Container Toolkit （电脑没有GPU或者是使用的VM等虚拟机的同学可以跳过这一步了，由于你们无法访问到GPU所以这步不需要安装），这个工具链组件是一个Nvidia提供的一组工具，安装了之后我们便可以在Docker中使用GPU并能够支持 GPU 加速，由于Nvidia的文档写的非常的详细，因此我们按照英伟达文档中的步骤来安装配置

类似于之前的Docker，我们需要添加Nvidia官方的源，添加了之后我们便可以直接使用APT安装啦

bash 复制代码

# step 1 配置生产存储库
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
# step 2 Update
sudo apt-get update
# step 3 使用APT安装
sudo apt-get install -y nvidia-container-toolkit #如果没有代理这部分耗时会比较久

接着我们开始为Docker配置NVIDIA Container Runtime，这部分很简单只需要两行命令即可：

bash 复制代码

sudo nvidia-ctk runtime configure --runtime=docker #使用nvidia-ctk命令修改/etc/docker/daemon.json 文件
sudo systemctl restart docker #重启Docker守护进程

最后输入以下命令即可验证我们的配置是否成功，如果出现下图即代表Nvidia Container Toolkit安装完成啦！！！

bash 复制代码

sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

（2）配置使用D-Robotics算法工具链

好啦，走完上述流程没有问题的话就代表我们现在已经完成了所有的前置配置啦！接着我们便可以开始配置我们的算法工具链，首先我们下载RDK的OE交付包（截止文章发布最新版本为V1.2.8）以及对应的Docker镜像

bash 复制代码

# 下载OE-v1.2.8交付包
wget -c ftp://x5ftp@vrftp.horizon.ai/OpenExplorer/v1.2.8_release/horizon_x5_open_explorer_v1.2.8-py310_20240926.tar.gz --ftp-password=x5ftp@123$%

# 自行选择下述CPU or GPU版本的Docker镜像下载，二选一即可
#Ubuntu20.04 CPU Docker镜像
wget -c ftp://x5ftp@vrftp.horizon.ai/OpenExplorer/v1.2.8_release/docker_openexplorer_ubuntu_20_x5_cpu_v1.2.8.tar.gz --ftp-password=x5ftp@123$%
#Ubuntu20.04 GPU Docker镜像
wget -c ftp://x5ftp@vrftp.horizon.ai/OpenExplorer/v1.2.8_release/docker_openexplorer_ubuntu_20_x5_gpu_v1.2.8.tar.gz --ftp-password=x5ftp@123$%

#地平线旭日5 算法工具链 用户开发文档（按需下载）
wget -c ftp://x5ftp@vrftp.horizon.ai/OpenExplorer/v1.2.8_release/x5_doc-v1.2.8-py310-cn.zip --ftp-password=x5ftp@123$%
#Checksum（按需下载使用）
wget -c ftp://x5ftp@vrftp.horizon.ai/OpenExplorer/v1.2.8_release/md5sum.txt --ftp-password=x5ftp@123$%

由于docker系统文件较大，所以我们需要稍等一会，下载完之后输入ls便能看到两个文件啦

我们输入以下指令进行解压：

bash 复制代码

tar -xvfhorizon_x5_open_explorer_v1.2.8-py310_20240926.tar.gz #解压OE交付包

解压完成后我们进入OE包，可以看到我们的OE包的结构如下，分为两个大的文件夹package和samples，package里面主要是RDK系列的板端以及宿主机端的开发环境，由于我们使用的Docker镜像因此这个文件夹我们可以不用管我们主要来看看samples包叭，samples下面分为了三个文件夹，其中第三个model zoo是第二个文件夹里面ai_toolchain/model_zoo的软链接，第一个文件夹ai_benchmark是地瓜官方提供AI基准测试样例包，用于评估分类、检测和分割模型的性能和精度支持单帧延迟和双核多线程调度性能评估，通过这个包我们可以评估模型是否满足性能要求以及验证量化后的模型精度，但是一般来说如果我们使用Yolo系列的官版架构不微调的话这部分我们可以不用太过于在意

接着我们来看下我们的重头戏ai_toolchain模型工具链，通过下述结构图我们可以看到主要为四个部分，分别是模型量化转化示例，模型训练示例以及我们模型的运行示例，其具体的使用方法我们到第三节再介绍

看完了OE交付包接着我们开始导入Docker镜像，由于这个docker镜像是依赖于OE包运行的，因此我们需要设置docker的映射路径，接着我们从tar包导入docker镜像即可：

bash 复制代码

#大家根据自己的路径进行修改
export version=v1.2.8
export ai_toolchain_package_path=/path/OE/horizon_x5_open_explorer_v1.2.8-py310_20240926#请自行修改路径
export dataset_path=/path/OE/dataset   #请自行修改路径，没有dataset请自行创建
#导入镜像
docker load < docker_openexplorer_ubuntu_20_x5_gpu_v1.2.6.tar.gz

由于我们的镜像比较大，所以导入的时间会比较久，大家安心等待即可，接着我们输入如下的指令即可启动docker镜像

bash 复制代码

sudo docker run -it --rm --gpus all --shm-size=15g -v "$ai_toolchain_package_path":/open_explorer -v "$dataset_path":/data openexplorer/ai_toolchain_ubuntu_20_x5_gpu:v1.2.8-py310

接着在镜像中输入命令hb_mapper有如下打印输出则代表我们环境安装完成啦~~

**小Tips:**大家可以在~/.bashrc中使用alias添加如下一行，之后便可以直接在终端输入RDK_Ai_Toolchain打开工具链啦，就不用去记那么长的指令了

bash 复制代码

alias RDK_Ai_Toolchain="sudo docker run -it --rm --gpus all --shm-size=15g -v "$ai_toolchain_package_path":/open_explorer -v "$dataset_path":/data openexplorer/ai_toolchain_ubuntu_20_x5_gpu:v1.2.8-py310"

至此我们的地瓜工具链的环境便全部安装配置完成啦！！！

二、Model Zoo介绍

我觉得对于刚拿到RDK板子的同学来说，我们无法绕开地瓜机器人最新推出的Model Zoo而直接去学习RDK的算法工具链，因此我们的X5模型量化转换部署教程便先从Model Zoo开始介绍。Model Zoo，意如其名，从字面上我们便可以知道这是一个**"模型动物园"，这是一个是一个由地瓜开发者社区在维护的一个开源社区算法案例仓库，按照官方对其的解释 这个仓库里包含了各类可直接上板部署，适用于多种场景、通用性较强的地瓜异构模型（如Yolo系列、FCOS、ResNet、PaddleOCR等）包括但不限于图像分类、目标检测、语义分割、自然语言处理等领域精心挑选和优化，具有高效的性能且已经**量化转换之后可直接运行的一系列.bin模型，并且还为用户提供了C++/Python以及Jupyter运行示例

那我们该如何使用这个仓库呢？我们首先从Github上将Model Zoo拉取下来，我们可以看到Model Zoo的项目结构如图所示：

bash 复制代码

git clone https://github.com/D-Robotics/rdk_model_zoo   #拉取Model Zoo

主文件夹下面有中英双语的README、README的图片资源文件夹resource、requirement.txt以及我们最主要的demo文件夹，这里面把官方目前支持的所有模型按照目标检测detect、目标分类classification、关键点检测Pose等进行了分类，我们以detect目标检测类模型为例子打开可以看到里面有很多官方支持的模型系列，我们再打开Yolov5的文件夹可以看到里面有官方给的C++/Jupyter部署例程以及官方转换好的模型文件和模型量化的ptq配置文件

相信到这大家应该对Model Zoo有了基本的认识，接下来我们以Yolov5-V2.0为例子给代价介绍如何转换模型

三、模型量化示例教程

接下来我们正式进入工具链的使用，我们以Yolov5-V2.0官方版本为示例带着大家在完成模型转化的同时简单了解其中的一些概念，本流程将基于rdk_model_zoo/demos/detect/YOLOv5/README_cn.md地瓜Model Zoo中的官方文档描述的进行介绍，首先我们先拉取Yolov5-V2.0的官方源码并下载官方的模型权重：

bash 复制代码

git clone https://github.com/ultralytics/yolov5.git #克隆仓库
cd yolov5 #进入仓库
git checkout v2.0 #切换分支
git branch #检查，如出现：* (HEAD detached at v2.0)即代表切换分支完成
#我用官方的80类别权重进行演示，如果自己有训练出来的模型不需要执行这一步，使用自己的模型就好了
wget https://github.com/ultralytics/yolov5/releases/download/v2.0/yolov5s.pt -O yolov5s_tag2.0.pt #下载官方模型权重

由于我们的BPU需要使用4维的NHWC输出即(batch_size, height, width, channels)，而Yolov5源码由于使用的是PyTorch框架因此他的输出为NCHW，即(batch_size, channels, height, width)，因此我们需要修改模型的输出部分让我们训练出来的.pt文件在导出为ONNX文件的时候能有正确的输出格式，我们首先打开yolov5/models/yolo.py文件，并定位到大概22行的样子，由于我们只有在模型导出为ONNX的时候需要修改输出头而训练的时候需要保持原样，因此建议大家不要将原来的代码删掉而选择使用注释的方式如我的图片所示，我们用下述代码进行修改即可：

python 复制代码

def forward(self, x):
    return [self.m[i](x[i]).permute(0,2,3,1).contiguous() for i in range(self.nl)]

接着我们使用Yolo官方给我们的模型导出工具export.py我们首先将这个文件从yolov5/models/export.py复制出来

bash 复制代码

cp ./models/export.py .

然后我们进入这个文件，由于我们只需要导出ONNX模型，因此我们将32行导出TorchScript的部分和60行导出CoreML的部分删去，只留下导出ONNX的部分，同时我们在导出ONNX的部分添加opset版本的选择以及加入一个onnx simplify的程序，作一些图优化，常量折叠的操作

PS：每个 ONNX 操作（如卷积、激活、矩阵乘法等）都有一个特定的版本，而opset版本就是指我们当前使用的ONNX 中支持的算子版本，而我们的RDK系列由于暂时仅支持Opset10和Opset11，因此我们需要指定使用11版本

python 复制代码

try:
    import onnx
    from onnxsim import simplify

    print('\nStarting ONNX export with onnx %s...' % onnx.__version__)
    f = opt.weights.replace('.pt', '.onnx')  # filename
    model.fuse()  # only for ONNX
    torch.onnx.export(model, img, f, verbose=False, opset_version=11, input_names=['images'],
                      output_names=['small', 'medium', 'big'])
    # Checks
    onnx_model = onnx.load(f)  # load onnx model
    onnx.checker.check_model(onnx_model)  # check onnx model
    print(onnx.helper.printable_graph(onnx_model.graph))  # print a human readable model
    # simplify
    onnx_model, check = simplify(
        onnx_model,
        dynamic_input_shape=False,
        input_shapes=None)
    assert check, 'assert check failed'
    onnx.save(onnx_model, f)
    print('ONNX export success, saved as %s' % f)
except Exception as e:
    print('ONNX export failure: %s' % e)

如果有觉得修改export.py麻烦的同学也可以直接复制下方我修改完成的代码抄作业替换原来的内容：

python 复制代码

import argparse
from models.common import *
from utils import google_utils
if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--weights', type=str, default='./yolov5s.pt', help='weights path')
    parser.add_argument('--img-size', nargs='+', type=int, default=[640, 640], help='image size')
    parser.add_argument('--batch-size', type=int, default=1, help='batch size')
    opt = parser.parse_args()
    opt.img_size *= 2 if len(opt.img_size) == 1 else 1  # expand
    print(opt)
    img = torch.zeros((opt.batch_size, 3, *opt.img_size))  # image size(1,3,320,192) iDetection
    google_utils.attempt_download(opt.weights)
    model = torch.load(opt.weights, map_location=torch.device('cpu'))['model'].float()
    model.eval()
    model.model[-1].export = True  # set Detect() layer export=True
    y = model(img)  # dry run
    try:
        import onnx
        from onnxsim import simplify

        print('\nStarting ONNX export with onnx %s...' % onnx.__version__)
        f = opt.weights.replace('.pt', '.onnx')  # filename
        model.fuse()  # only for ONNX
        torch.onnx.export(model, img, f, verbose=False, opset_version=11, input_names=['images'],
                      output_names=['small', 'medium', 'big'])
    # Checks
        onnx_model = onnx.load(f)  # load onnx model
        onnx.checker.check_model(onnx_model)  # check onnx model
        print(onnx.helper.printable_graph(onnx_model.graph))  # print a human readable model
    # simplify
        onnx_model, check = simplify(
            onnx_model,
            dynamic_input_shape=False,
            input_shapes=None)
        assert check, 'assert check failed'
        onnx.save(onnx_model, f)
        print('ONNX export success, saved as %s' % f)
    except Exception as e:
        print('ONNX export failure: %s' % e)

完成了这些操作之后我们便可以将我们训练出来的.pt模型导出为ONNX模型啦**(默认大家已经配置好了Yolov5的Conda环境)**

bash 复制代码

python3 export.py --weights ./yolov5s.pt

接着我们便可以开始进行模型的量化啦！我们将导出的.onnx模型添加进我们的OE包内

bash 复制代码

cp ./yolov5s.onnx /path/to/OE #大家根据自己的配置自行修改复制的路径

Tips：我为了文件管理的规范，我在OE包里面新建了一个Model文件夹用来统一管理我自己的模型项目，推荐大家也采用这种方式

接着我们启动官方提供的算法工具链docker镜像 首先对我们的ONNX进行检查，这里需要用到地瓜官方的一个命令hb_mapper checker其具体的使用方法如下：

bash 复制代码

hb_mapper checker --model-type ${model_type} \
                    --march ${march} \
                    --proto ${proto} \
                    --model ${caffe_model/onnx_model} \
                    --input-shape ${input_node} ${input_shape} \
                    --output ${output}
# --model-type 用于指定检查输入的模型类型，目前只支持设置 caffe 或者 onnx
# --march 用于指定需要适配的D-Robotics 处理器类型，可设置值为 bernoulli2 和 bayes；
#		  RDK X3设置为 bernoulli2，RDK Ultra设置为 bayes，RDK X5设置为 bayes-e
# --proto 此参数仅在 model-type 指定 caffe 时有效，取值为Caffe模型的prototxt文件名称
# --model 在 model-type 被指定为 caffe 时，取值为Caffe模型的caffemodel文件名称
#		  在 model-type 被指定为 onnx 时，取值为ONNX模型文件名称
# --input-shape 可选参数，明确指定模型的输入shape
#			    取值为 {input_name} {NxHxWxC/NxCxHxW} ，input_name 与shape之间以空格分隔
#               例如模型输入名称为 data1，输入shape为 [1,224,224,3]，则配置应该为 --input-shape data1 1x224x224x3
#               如果此处配置shape与模型内shape信息不一致，以此处配置为准

根据官方对这个命令的介绍，我们输入以下命令对我们的模型进行检查，系统会有很长一段的输出，同时我们也可以从输出中发现X5的BPU支持Yolov5-2.0的所有算子，也就是说模型的所有计算都可以放在X5的BPU上运算

bash 复制代码

#根据自己的模型路径修改--model参数
hb_mapper checker --model-type onnx --march bayes-e --model /path/to/model

这一步检查如果没有问题，那我们就可以开始进行模型的转换了，地瓜的算法工具链采用的是PTQ的方案，同样的地瓜的算法工具链也给我们提供了一个类似的命令，使用这个命令会自动完成浮点模型到D-Robotics 混合异构模型的转换，经过这个阶段之后，将得到一个可以在D-Robotics 处理器上运行的模型，我们先来看一下官方的命令解析：

PS：PTQ（Post-Training Quantization）即训练后量化，是一种将已经训练好的模型转换为低精度（如8位整数）表示的技术，以减少模型的存储和计算开销，在不重新训练模型的情况下，通过对训练完成后的模型进行量化来加速推理过程并减小模型大小，同时尽量保持其性能

bash 复制代码

# 不开启 fast-perf 模式
hb_mapper makertbin --config ${config_file}  \
                      --model-type  ${model_type}
# 开启 fast-perf 模式
hb_mapper makertbin --fast-perf --model ${caffe_model/onnx_model} --model-type ${model_type} \
                  --proto ${caffe_proto} \
                  --march ${march}
# --help 显示帮助信息并退出
# -c, --config 模型编译的配置文件，为yaml格式，文件名使用.yaml后缀
# --model-type 用于指定转换输入的模型类型，目前支持设置 caffe 或者 onnx
# --fast-perf 开启fast-perf模式，该模式开启后，会在转换过程中生成可以在板端运行最高性能的bin模型
# 如果开启fast-perf模式还需要以下配置
# --model Caffe或ONNX浮点模型文件
# --proto 用于指定Caffe模型prototxt文件
# --march BPU的微架构，若使用 RDK X3 则设置为 bernoulli2，若使用 RDK Ultra 则设置为 bayes，若使用 RDK X5 则设置为 bayes-e

我们看到这个命令需要我们提供一个模型编译的配置文件，这个配置文件里面需要我们配置模型转化相关的参数，比如原始浮点模型训练框架中所使用数据预处理方法、图像减去的均值、图像预处理缩放比例、编译器相关参数等必要参数 ，如果大家使用的是地瓜Model Zoo中的模型系列的话，地瓜官方已经给大家提供了可以直接使用的PTQ配置文件，存放在了每个模型具体的文件夹里面，一般来说我们只需要根据自己的环境和板端设备修改onnx_model模型位置、march架构以及cal_data_dir验证集地址即可

但是这时候就会有小伙伴来问了：哎呀！要是我使用的模型Model Zoo没有怎么办呀？我该如何自己去编写这些参数呢？ 别急，地瓜官方也给大家准备了不同设备不同模型（Caffe、ONNX）的PTQ模板文件8.5 算法工具链类 | RDK DOC在链接的文档的最后部分就有RDK X3、RDK X5以及RDK Ultra的模型量化yaml文件模板有需要的小伙伴可以自行取用，这时候又会有小伙伴来问了：欸！？那这个YAML文件里面的参数都是干什么用的呢？我该怎么去配置呢？，也别急！地瓜官方文档中对模型转换yaml配置参数 这部分有着非常详细的介绍PTQ原理及步骤详解 | RDK DOC，但是大家要注意配置文件中，四个参数组位置都需要存在，具体参数分为可选和必选，可选参数可以不配置。

接着我们继续开始模型转换的教学，根据上述我们知道我们使用的Yolov5-2.0是被官方的Model Zoo所收录了的，因此我们可以直接使用官方给我们提供的PTQ配置文件，我们首先从Model Zoo 中拷贝进我们的OE包：

bash 复制代码

#根据自己的配置进行修改，将YAML拷贝至OE包内docker即可访问到，建议和模型同路径
cp /path/to/demos/detect/YOLOv5/ptq_yamls/yolov5_detect_bayese_640x640_nv12.yaml /path/to/OE

然后我们修改YAML文件中的模型路径和架构以及根据自己的需求更改输出路径等，但是这是时候我们发现我们还需要准备验证集标定数据 用于我们的浮点模型转换为定点模型过程中的标定，这个也简单，标定样本其实就是大家在训练模型的时候使用的训练集或验证集 ，因此我们只需要将拷贝将近100张数据集进我们的OE包即可，同时官方在配文参数里给我们提供了一个选项preprocess_on用于开启图片校准样本自动处理，使用了这个参数之后工具链会自动用skimage的方式来读取并自动将标定数据集resize到输入节点尺寸**（虽然这个参数很方便但是依旧建议阅读官方用户手册和OE包中的示例自己写一个数据处理的代码）**

按照我们的需求修改添加了需要修改的内容之后我们便可以开始模型转化啦，在docker环境中输入如下命令稍等一会不报错即代表我们的转换成功啦！模型转换成功后便会在当前目录下生成一个output文件夹，里面便是我们转换完成的模型~

bash 复制代码

hb_mapper makertbin --model-type onnx --config yolov5_detect_bayese_640x640_nv12.yaml

虽然我们的模型已经转换完成了，但是为了保证安全我们还需要对模型进行可视化检查以及输入输出检查，我们首先在命令行中输入以下命令，工具链便会自动在hb_perf_result生成我们转换完成的Bin模型文件的可视化结构图

bash 复制代码

hb_perf /path/to/model #修改为自己的模型路径

检查没有错误后我们便可以开始对我们模型的输入输出进行检查，一样的输入如下命令即可工具链会打印我们模型输入输出的基本情况

bash 复制代码

hrt_model_exec model_info --model_file /path/to/model #修改为自己的模型路径

至此，如果模型的结构以及输入输出都没有问题的话就代表我们的模型转换完成啦！！！

四、模型部署应用实例

接下来就到了大家最关心最好奇的模型部署环节啦！！！以前RDK仅支持C++的模型部署接口，但是随着X5的发布Python接口也被官方释放出来了，我们本篇文章将以官方手册及API说明为依据，手把手带着大家从零开始学会使用C++/Python部署推理代码！！！

参考手册：模型推理接口说明 (TODO: 增加C++ 示例) | RDK DOC

在开始之前，由于官方在Model Zoo中有给我们提供对应模型的示例代码以及包含了详细注释，因此大家可以拿出我们的RDK X5在我们的开发板上提前使用官方的代码示例测试自己的模型转换是否成功，对应的代码在Model Zoo对应模型具体文件夹里面的cpp文件夹

我们打开里面的main.cc修改模型路径、类别数量以及标签名称的宏定义以及测试图片的路径即可

接着编译后运行可执行文件即可看到识别结果

bash 复制代码

mkdir build && cd build 
cmake ..
make
./main

注意哈！以上操作都是在板端！大家也可以自行看一下这个文件了解模型部署的流程

(1) 完成Cmake

接下来用我之前自己训练转换后的单类别Yolov5-V2.0版本模型为例子带着大家从零开始部署C++的模型推理，RDK的模型推理API主要分为六大类分别是模型推理库信息获取、模型的加载与释放、模型信息获取、模型推理、模型内存操作、模型前处理这六类API也代表着我们代码中的模型推理应该有六步，首先我们先创建Cmake文件

cmake 复制代码

#step 1 设置项目以及版本最小需求
cmake_minimum_required(VERSION 2.8)
project(rdk_yolov8_detect)
#step 2 设置C++标准
set(CMAKE_CXX_STANDARD 11)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
#step 3 设置编译类型
if(NOT CMAKE_BUILD_TYPE)
    set(CMAKE_BUILD_TYPE Release)
endif()
message(STATUS "Build type: ${CMAKE_BUILD_TYPE}")
#step 4 设置编译选项
set(CMAKE_CXX_FLAGS_DEBUG " -Wall -Werror -g -O0 ")
set(CMAKE_C_FLAGS_DEBUG " -Wall -Werror -g -O0 ")
set(CMAKE_CXX_FLAGS_RELEASE " -Wall -Werror -O3 ")
set(CMAKE_C_FLAGS_RELEASE " -Wall -Werror -O3 ")
# libdnn.so 依赖设置
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=c++11 -Wl,-unresolved-symbols=ignore-in-shared-libs")
#step 5 添加外部依赖包
find_package(OpenCV REQUIRED)# OpenCV
#step 6 设置RDK BPU库路径
set(DNN_PATH "/usr/include/dnn")      # BPU头文件路径
set(DNN_LIB_PATH "/usr/lib/")         # BPU库文件路径
#step 7 添加头文件路径
include_directories(
    ${DNN_PATH}
    ${OpenCV_INCLUDE_DIRS}
)
#step 8 添加库文件路径
link_directories(
    ${DNN_LIB_PATH}
)
#step 9 添加源文件
add_executable(main 
    main.cpp
)
#step 10 链接依赖库
target_link_libraries(main
    ${OpenCV_LIBS}     # OpenCV库
    dnn                # RDK BPU库
    pthread            # 线程库
    rt                 # 实时库
    dl                 # 动态链接库
)
#step 11 安装目标
install(TARGETS main
    RUNTIME DESTINATION bin
)

(2)完成头文件导入及宏定义

编写完Cmake之后我们便可以开始写我们的C++代码啦！！！我们先创建一个main.cpp文件然后导入一些必要的头文件：

c 复制代码

// C/C++ Standard Librarys
#include <iostream>     // 输入输出流
#include <vector>      // 向量容器
#include <algorithm>   // 算法库
#include <chrono>      // 时间相关功能
#include <iomanip>     // 输入输出格式控制

// Thrid Party Librarys
#include <opencv2/opencv.hpp>      // OpenCV主要头文件
#include <opencv2/dnn/dnn.hpp>     // OpenCV深度学习模块

// RDK BPU libDNN API
#include "dnn/hb_dnn.h"           // BPU基础功能
#include "dnn/hb_dnn_ext.h"       // BPU扩展功能
#include "dnn/plugin/hb_dnn_layer.h"    // BPU层定义
#include "dnn/plugin/hb_dnn_plugin.h"   // BPU插件
#include "dnn/hb_sys.h"           // BPU系统功能

接着为了使我们的代码更加的规范以及更方面我们修改检测的参数，我们使用宏定义的方式实现模型路径、类别数量及置信度等参数配置，同时我们添加一个错误检查的宏定义，能使我们在操作API的时候判断API执行是否正确，同时考虑到调试和非调试期间对于图像显示的需求不一样，所以我们添加了两个宏定义DETECT_MODE、ENABLE_DRAW分别用来启用单张图片推理or实时推理以及是否启动绘图和显示功能

c 复制代码

// 错误检查宏
#define RDK_CHECK_SUCCESS(value, errmsg)                        \
    do                                                          \
    {                                                          \
        auto ret_code = value;                                  \
        if (ret_code != 0)                                      \
        {                                                       \
            std::cout << errmsg << ", error code:" << ret_code; \
            return ret_code;                                    \
        }                                                       \
    } while (0);
    
// 默认参数定义
#define DEFAULT_MODEL_PATH "/root/Deep_Learning/YOLOv5/models/tennis_detect_640x640_bayese_.bin" //模型路径
#define DEFAULT_CLASSES_NUM 1 //模型类别
#define DEFAULT_NMS_THRESHOLD 0.45f //NMS的阈值, 默认0.45
#define DEFAULT_SCORE_THRESHOLD 0.25f // 置信度阈值, 默认0.25
#define DEFAULT_NMS_TOP_K 300 // NMS选取的前K个框数, 默认300
#define DEFAULT_FONT_SIZE 1.0f // 绘制标签的字体尺寸, 默认1.0
#define DEFAULT_FONT_THICKNESS 1.0f // 绘制标签的字体粗细, 默认 1.0
#define DEFAULT_LINE_SIZE 2.0f // 绘制矩形框的线宽, 默认2.0
#define DETECT_MODE 0 //推理模式的选择 0 for 单张图片，1 for 实时检测
#define ENABLE_DRAW 0  // 1: 启用绘图, 0: 禁用绘图

(3)BPU检测类封装

为了便于我们的使用，我们将推理代码封装为一个BPU_Detect类，其中包含三个主要的功能接口Init()、Detect()、Release()分别用于初始化BPU和模型、执行检测以及释放资源，以及为了完成这三个主要的函数我们还创建了几个内部工具函数LoadModel()、GetModelInfo()、PreProcess()、Inference();、PostProcess();、DrawResults()以及PrintResults()，分别用于加载模型、获取模型信息、图像预处理、模型推理、后处理以及图像绘制和结果格式化打印函数

c 复制代码

class BPU_Detect {
    public:
    	BPU_Detect(const std::string& model_path = DEFAULT_MODEL_PATH,
                 int classes_num = DEFAULT_CLASSES_NUM,
                 float nms_threshold = DEFAULT_NMS_THRESHOLD,
                 float score_threshold = DEFAULT_SCORE_THRESHOLD,
                 int nms_top_k = DEFAULT_NMS_TOP_K,
                 int d_mode = DETECT_MODE);
        ~BPU_Detect();        // 析构函数
        bool Init();  // 初始化BPU和模型
        bool Detect(const cv::Mat& input_img, cv::Mat& output_img);  // 执行检测
        bool Release();  // 释放资源
    private:
    	bool LoadModel();  // 加载模型
        void GetModelInfo();  // 获取模型信息
        bool PreProcess(const cv::Mat& input_img);  // 图像预处理
        bool Inference();  // 模型推理
        bool PostProcess();  // 后处理
        void DrawResults(cv::Mat& img);  // 绘制结果
        void PrintResults() const;  // 打印检测结果
		// 成员变量（按照构造函数初始化顺序排列）
        std::string model_path_;      // 模型文件路径
        int classes_num_;             // 类别数量
        float nms_threshold_;         // NMS阈值
        float score_threshold_;       // 置信度阈值
        int nms_top_k_;              // NMS保留的最大框数
        bool is_initialized_;         // 初始化状态标志
        float font_size_;            // 绘制文字大小
        float font_thickness_;       // 绘制文字粗细
        float line_size_;            // 绘制线条粗细

我们开始首先完成我们的构造函数和析构函数，我们在构造函数的时候将我们的的宏定义的值全部传输进去，并且设置我们small, medium, large的anchors，同时我们在析构函数的时候释放我们的资源

PS： 什么是Anchors呢？在计算机视觉特别是目标检测（Object Detection）中，Anchors（锚点） 是一组预定义的边界框，它们用于与输入图像中的目标进行匹配。这些锚点的大小、形状和位置通常在模型训练之前就已经确定，目的是为了解决目标尺度不一的问题，总的来说，Anchor 可以看作是一种"参考框"，它的作用是提前在图像上覆盖一定区域，然后模型会根据这些预定义的框来预测实际目标的位置和大小

c 复制代码

//添加私有类成员变量
class BPU_Detect {
    private:
    	std::vector<std::string> class_names_; // 类别名称
        std::vector<std::pair<double, double>> s_anchors_;
        std::vector<std::pair<double, double>> m_anchors_;
        std::vector<std::pair<double, double>> l_anchors_;
}
// 构造函数实现
BPU_Detect::BPU_Detect(const std::string& model_path,
                          int classes_num,
                          float nms_threshold,
                          float score_threshold,
                          int nms_top_k)
    : model_path_(model_path),
      classes_num_(classes_num),
      nms_threshold_(nms_threshold),
      score_threshold_(score_threshold),
      nms_top_k_(nms_top_k),
      is_initialized_(false),
      font_size_(DEFAULT_FONT_SIZE),
      font_thickness_(DEFAULT_FONT_THICKNESS),
      line_size_(DEFAULT_LINE_SIZE) {
    class_names_ = {CLASSES_LIST};    // 初始化类别名称
    std::vector<float> anchors = {10.0, 13.0, 16.0, 30.0, 33.0, 23.0,     
                                 30.0, 61.0, 62.0, 45.0, 59.0, 119.0, 
                                 116.0, 90.0, 156.0, 198.0, 373.0, 326.0};// 初始化anchors
    // 设置small, medium, large anchors
    for(int i = 0; i < 3; i++) {
        s_anchors_.push_back({anchors[i*2], anchors[i*2+1]});
        m_anchors_.push_back({anchors[i*2+6], anchors[i*2+7]});
        l_anchors_.push_back({anchors[i*2+12], anchors[i*2+13]});
    }
}
// 析构函数实现
BPU_Detect::~BPU_Detect() {
    if(is_initialized_) {
        Release();
    }
}

(4)完成私有LoadModel()函数

接着我们开始实现我们的LoadModel()模型加载私有类函数，我们通过官方的API用户手册可以看到，官方提供了两种加载模型的方式，分别是从文件加载以及从内存加载模型，这两种方式相比较来说FromFiles这一个函数由于文件I/O操作，相对较慢，代码较简单但是由于模型文件是独立存储存储的因此更加适合开发调试，而FromDDR这一个函数由于直接从内存读取，速度更快，适合嵌入式系统或需要快速加载的场景，但缺点便是代码较为复杂，比较贴近TensorRT加载模型的方式，两个API的具体介绍如下：

c 复制代码

/**
 * Creates and initializes Horizon DNN Networks from file list
 * @param[out] packedDNNHandle	Horizon DNN句柄，指向多个模型
 * @param[in] modelFileNames	模型文件的路径  
 * @param[in] modelFileCount	模型文件的个数
 * @return 0 if success, return defined error code otherwise
 */
int32_t hbDNNInitializeFromFiles(hbPackedDNNHandle_t *packedDNNHandle,
                                 char const **modelFileNames,
                                 int32_t modelFileCount);
/**
 * Creates and initializes Horizon DNN Networks from memory
 * @param[out] packedDNNHandle	Horizon DNN句柄，指向多个模型
 * @param[in] modelData			模型文件的指针
 * @param[in] modelDataLengths	模型数据的长度
 * @param[in] modelDataCount	模型数据的个数
 * @return 0 if success, return defined error code otherwise
 */
int32_t hbDNNInitializeFromDDR(hbPackedDNNHandle_t *packedDNNHandle,
                               const void **modelData,
                               int32_t *modelDataLengths,
                               int32_t modelDataCount);

我们可以看到这两个API都是传入模型然后以hbPackedDNNHandle_t结构体类型传出模型句柄，因此我们要使用这个函数的话我们首先需要用hbPackedDNNHandle_t创建一个私有类成员变量packed_dnn_handle_，由于这两种模型加载方式都比较常见，因此我们在这里介绍两种API的使用方法：

我们从简单的FromFilesAPI开始介绍， 首先由于我们前面利用宏定义来导入的模型路径，因此我们这里需要用一个字符指针变量来获取我们的模型路径地址，接着使用我们的错误检查宏来调用模型加载的API即可

c 复制代码

//添加私有类成员变量
class BPU_Detect {
    private:
		hbPackedDNNHandle_t packed_dnn_handle_;
}
// Method One FromFiles
const char* model_file_name = model_path_.c_str(); //获取文件路径字符指针
RDK_CHECK_SUCCESS(
        hbDNNInitializeFromFiles(&packed_dnn_handle_, &model_file_name, 1),
        "Initialize model from file failed");//调用模型加载API

接着我们来介绍从内存读取模型的API使用方法， 这个步骤的核心便在于获取文件的内存，我们首先先使用C++官方库来打开我们的模型文件，接着我们将文件指针移到末尾即可获得文件的大小，得到了模型的大小之后我们便可以使用malloc函数为模型分配内存啦，我们将模型的数据输入进内存后验证模型文件是否完整读取后即可准备模型数据数组和长度数组来使用RDK 的模型加载API从内存初始化模型，可以看到这个流程比上一个API麻烦了太多

c 复制代码

//添加私有类成员变量
class BPU_Detect {
    private:
		hbPackedDNNHandle_t packed_dnn_handle_;
}
FILE* fp = fopen(model_path_.c_str(), "rb"); // 打开模型文件
if (!fp) {
    std::cout << "Failed to open model file: " << model_path_ << std::endl;
    return false;
}
// 获取文件大小:
fseek(fp, 0, SEEK_END);// 1. 将文件指针移到末尾
size_t model_size = static_cast<size_t>(ftell(fp));// 2. 获取当前位置(即文件大小)
fseek(fp, 0, SEEK_SET);// 3. 将文件指针重置到开头

// 为模型数据分配内存
void* model_data = malloc(model_size);
if (!model_data) {
    std::cout << "Failed to allocate memory for model data" << std::endl;
    fclose(fp);
    return false;
}
// 读取模型数据到内存
size_t read_size = fread(model_data, 1, model_size, fp);
fclose(fp);

// 验证是否完整读取了文件
if (read_size != model_size) {
    std::cout << "Failed to read model data, expected " << model_size 
             << " bytes, but got " << read_size << " bytes" << std::endl;
    free(model_data);
    return false;
}

// 准备模型数据数组和长度数组
const void* model_data_array[] = {model_data};
int32_t model_data_length[] = {static_cast<int32_t>(model_size)};
// 使用BPU API从内存初始化模型
RDK_CHECK_SUCCESS(
    hbDNNInitializeFromDDR(&packed_dnn_handle_, model_data_array, model_data_length, 1),
    "Initialize model from DDR failed");

// 释放临时分配的内存
free(model_data);

至此我们的LoadModel()便完成啦！！！我们再次添加一个宏定义用于选择模型加载方式，具体完整代码如下：

c 复制代码

#define LOAD_FROM_DDR 0  // 0: 从文件加载模型, 1: 从内存加载模型
// 加载模型的两种实现
bool BPU_Detect::LoadModel() {
#if LOAD_FROM_DDR
    // 从文件读取模型数据到内存
    auto read_start = std::chrono::high_resolution_clock::now();
    FILE* fp = fopen(model_path_.c_str(), "rb");
    if (!fp) {
        std::cout << "Failed to open model file: " << model_path_ << std::endl;
        return false;
    }
    // 获取文件大小
    fseek(fp, 0, SEEK_END);
    size_t model_size = static_cast<size_t>(ftell(fp));
    fseek(fp, 0, SEEK_SET);
    // 分配内存并读取模型数据
    void* model_data = malloc(model_size);
    if (!model_data) {
        std::cout << "Failed to allocate memory for model data" << std::endl;
        fclose(fp);
        return false;
    }
    size_t read_size = fread(model_data, 1, model_size, fp);
    fclose(fp);
    if (read_size != model_size) {
        std::cout << "Failed to read model data, expected " << model_size 
                 << " bytes, but got " << read_size << " bytes" << std::endl;
        free(model_data);
        return false;
    }
    // 从内存加载模型
    auto init_start = std::chrono::high_resolution_clock::now();
    
    const void* model_data_array[] = {model_data};
    int32_t model_data_length[] = {static_cast<int32_t>(model_size)};
    RDK_CHECK_SUCCESS(
        hbDNNInitializeFromDDR(&packed_dnn_handle_, model_data_array, model_data_length, 1),
        "Initialize model from DDR failed");
    // 释放内存
    free(model_data);
#else
    // 从文件加载模型
    const char* model_file_name = model_path_.c_str();
    RDK_CHECK_SUCCESS(
        hbDNNInitializeFromFiles(&packed_dnn_handle_, &model_file_name, 1),
        "Initialize model from file failed");
#endif
    return true;
}

(5)完成私有GetModelInfo()函数

我们继续来介绍我们的GetModelInfo()函数，这个函数用于获取我们加载模型后获取模型信息包括模型的名称列表呀、模型句柄呀、输入信息呀以及输出信息等模型的基本信息，我们通过查阅官方的API手册可以看到这部分模型信息获取的API有九个，分别是:

hbDNNGetModelNameList()用于获取 packedDNNHandle 所指向模型的名称列表和个数

c 复制代码

/**
 * Get model names from given packed handle
 * @param[out] modelNameList	模型名称列表
 * @param[out] modelNameCount	模型名称个数
 * @param[in] packedDNNHandle	Horizon DNN句柄，指向多个模型
 * @return 0 if success, return defined error code otherwise
 */
int32_t hbDNNGetModelNameList(char const ***modelNameList,
                              int32_t *modelNameCount,
                              hbPackedDNNHandle_t packedDNNHandle);

hbDNNGetModelHandle()用于从 packedDNNHandle 所指向模型列表中获取一个模型的句柄并让调用方可以跨函数、跨线程使用返回的 dnnHandle

c 复制代码

/**
 * Get DNN Network handle from packed Handle with given model name
 * @param[out] dnnHandle		DNN句柄，指向一个模型
 * @param[in] packedDNNHandle	DNN句柄，指向多个模型
 * @param[in] modelName			模型名称
 * @return 0 if success, return defined error code otherwise
 */
int32_t hbDNNGetModelHandle(hbDNNHandle_t *dnnHandle,
                            hbPackedDNNHandle_t packedDNNHandle,
                            char const *modelName);

hbDNNGetInputCount()用于获取 dnnHandle 所指向模型输入张量的个数

c 复制代码

/**
 * Get input count
 * @param[out] inputCount	模型输入张量的个数
 * @param[in] dnnHandle		DNN句柄，指向一个模型
 * @return 0 if success, return defined error code otherwise
 */
int32_t hbDNNGetInputCount(int32_t *inputCount, hbDNNHandle_t dnnHandle);

hbDNNGetInputName()用于获取 dnnHandle 所指向模型输入张量的名称

c 复制代码

/**
 * Get model input name
 * @param[out] name			模型输入张量的名称
 * @param[in] dnnHandle		DNN句柄，指向一个模型
 * @param[in] inputIndex	模型输入张量的编号
 * @return 0 if success, return defined error code otherwise
 */
int32_t hbDNNGetInputName(char const **name, hbDNNHandle_t dnnHandle,
                          int32_t inputIndex);

hbDNNGetInputTensorProperties()用于获取 dnnHandle 所指向模型特定输入张量的属性

c 复制代码

/**
 * Get input tensor properties
 * @param[out] properties	输入张量的信息
 * @param[in] dnnHandle		DNN句柄，指向一个模型
 * @param[in] inputIndex	模型输入张量的编号
 * @return 0 if success, return defined error code otherwise
 */
int32_t hbDNNGetInputTensorProperties(hbDNNTensorProperties *properties,
                                      hbDNNHandle_t dnnHandle,
                                      int32_t inputIndex);

hbDNNGetOutputCount()用于获取 dnnHandle 所指向模型输出张量的个数

c 复制代码

/**
 * Get output count
 * @param[out] outputCount	模型输出张量的个数
 * @param[in] dnnHandle		DNN句柄，指向一个模型
 * @return 0 if success, return defined error code otherwise
 */
int32_t hbDNNGetOutputCount(int32_t *outputCount, hbDNNHandle_t dnnHandle);

hbDNNGetOutputName()用于获取 dnnHandle 所指向模型输出张量的名称

c 复制代码

/**
 * Get model output name
 * @param[out] name			模型输出张量的名称
 * @param[in] dnnHandle		DNN句柄，指向一个模型
 * @param[in] outputIndex	模型输出张量的编号
 * @return 0 if success, return defined error code otherwise
 */
int32_t hbDNNGetOutputName(char const **name, hbDNNHandle_t dnnHandle,
                           int32_t outputIndex);

hbDNNGetOutputTensorProperties()用于获取 dnnHandle 所指向模型特定输出张量的属性

c 复制代码

/**
 * Get output tensor properties
 * @param[out] properties	输出张量的信息
 * @param[in] dnnHandle		DNN句柄，指向一个模型
 * @param[in] outputIndex	模型输出张量的编号
 * @return 0 if success, return defined error code otherwise
 */
int32_t hbDNNGetOutputTensorProperties(hbDNNTensorProperties *properties,
                                       hbDNNHandle_t dnnHandle,
                                       int32_t outputIndex);

但是我们在这一步仅需要用到五个针对于模型本身的API来获取我们模型的基本信息，首先我们使用hbDNNGetModelNameList()函数来获取我们加载的Bin模型里面的打包模型数量，由于我们已知我们要使用的只有Yolov5一个模型，因此如果检测到我们转换之后的Bin模型存在多个打包，那么就代表我们的Bin模型出错了，因此我们首先根据API的要求创建两个变量用来获取模型列表以及数量，接着我们调用API并判断模型数量是否正确，具体代码实现如下：

c 复制代码

//添加私有类成员变量
class BPU_Detect {
    private:
		const char* model_name_;// 模型名称
}
// 获取模型名称列表和数量
const char** model_name_list; //创建模型列表变量
int model_count = 0; //创建模型打包数量变量
RDK_CHECK_SUCCESS(
        hbDNNGetModelNameList(&model_name_list, &model_count, packed_dnn_handle_),
        "hbDNNGetModelNameList failed");
if(model_count > 1) {
    std::cout << "Model count: " << model_count << std::endl;
    std::cout << "Please check the model count!" << std::endl;
    return false;
}
model_name_ = model_name_list[0];

检查了模型列表没有错误，我们便可以获取模型的一个让调用方可以跨函数、跨线程使用返回的 dnnHandle句柄，我们首先根据API的要求利用hbDNNHandle_t创建一个私有类成员变量，接着便可以直接调用API了

c 复制代码

//添加私有类成员变量
class BPU_Detect {
    private:
		hbDNNHandle_t dnn_handle_;// 模型句柄
}
// 获取模型句柄
RDK_CHECK_SUCCESS(
    hbDNNGetModelHandle(&dnn_handle_, packed_dnn_handle_, model_name_),
    "hbDNNGetModelHandle failed");

创建了模型句柄之后我们便可以获取输入信息啦！这部分涉及到两个API分别是hbDNNGetInputCount用于获取模型网络输入的个数以及hbDNNGetInputTensorProperties用来获取模型输入的张量，还是因为我们使用的是Yolov5的检测模型，因此我们的模型应该是一个但输入的模型，如果出现了多个输入说明我们的模型出错了，同时我们发现hbDNNGetInputTensorPropertiesAPI输出是一个hbDNNTensorProperties类型的结构体，我们查看结构体定义可以发现这个结构体是一个嵌套结构体，里面通过嵌套hbDNNTensorShape结构体、hbDNNQuantiShift结构体、hbDNNQuantiScale结构体以及hbDNNQuantiType结构体能够准确的描述输入的张量信息，其结构体定义及每项成员的解释如下：

c 复制代码

typedef struct {
  int32_t dimensionSize[HB_DNN_TENSOR_MAX_DIMENSIONS];//表示张量每个维度的大小HB_DNN_TENSOR_MAX_DIMENSIONS表示张量可以有的最大维度数量
  int32_t numDimensions;//张量的维度数，指明张量是几维的
} hbDNNTensorShape;

typedef struct {
  int32_t shiftLen;//量化时的偏移长度，指明偏移数据的数量
  uint8_t *shiftData;//指向偏移数据的指针，这些数据通常用于量化过程中对张量数据进行位移
} hbDNNQuantiShift;

typedef struct {
  int32_t scaleLen;//缩放因子的长度，表示有多少个缩放因子
  float *scaleData;//指向缩放因子数据的指针，通常用于量化过程中调整张量数据的大小
  int32_t zeroPointLen;//零点的长度，表示零点数据的数量
  int8_t *zeroPointData;//指向零点数据的指针，这些数据用于量化过程中调整张量的零点
} hbDNNQuantiScale;

typedef enum {
  NONE, //不进行量化
  SHIFT,//采用位移量化
  SCALE//采用缩放量化
} hbDNNQuantiType;

typedef struct {
  hbDNNTensorShape validShape;//张量的有效形状，表示张量的真实尺寸
  hbDNNTensorShape alignedShape;//张量的对齐形状，表示经过对齐后的张量尺寸
  int32_t tensorLayout;//张量布局，指示数据在内存中的组织方式
  int32_t tensorType;//张量的数据类型，指明张量中元素的数据类型
  hbDNNQuantiShift shift;//量化中的偏移信息
  hbDNNQuantiScale scale;//量化中的缩放信息
  hbDNNQuantiType quantiType;//量化类型，表示量化是使用位移、缩放还是不使用量化
  int32_t quantizeAxis;//量化的轴，指示在哪个维度上应用量化操作
  int32_t alignedByteSize;//对齐后的字节大小，表示张量在内存中对齐后的大小
  int32_t stride[HB_DNN_TENSOR_MAX_DIMENSIONS];//每个维度的步长，表示张量每一维的元素间隔，支持最大维度数HB_DNN_TENSOR_MAX_DIMENSIONS
} hbDNNTensorProperties;

了解了这些结构体之后，我们便可以根据结构体参数来定义我们的一些变量，同时由于我们知道我们的模型是单输入的也知道我们输入的数据应该该是NV12，且数据排布是NCHW，同时输入Tensor数据的valid shape应为(1,3,H,W)，所以我们在使用API获取了我们的输入信息之后我们还可以利用这些安全信息进行一些输入的安全检查，因此我们首先添加一些必要的私有类成员变量，接着我们便可以调用hbDNNGetInputCount和hbDNNGetInputTensorProperties这两个API来获取输入的信息，接着我们便可以根据接收的输入数量及输入张量进行安全检查

c 复制代码

//添加私有类成员变量
class BPU_Detect {
    private:
        // 模型输入参数
        int input_h_;// 输入高度
        int input_w_;// 输入宽度
        hbDNNTensorProperties input_properties_; // 输入tensor属性
}
// 获取输入信息
int32_t input_count = 0;
RDK_CHECK_SUCCESS(
    hbDNNGetInputCount(&input_count, dnn_handle_),
    "hbDNNGetInputCount failed");
RDK_CHECK_SUCCESS(
    hbDNNGetInputTensorProperties(&input_properties_, dnn_handle_, 0),
    "hbDNNGetInputTensorProperties failed");
/*-----------------------------以下为模型安全检查-----------------*/
//检查模型输入数
if(input_count > 1){
    std::cout << "模型输入节点大于1，请检查！" << std::endl;
    return false;
}
//检查模型的输入类型
if(input_properties_.validShape.numDimensions == 4){
    std::cout << "输入tensor类型: HB_DNN_IMG_TYPE_NV12" << std::endl;
}
else{
    std::cout << "输入tensor类型不是HB_DNN_IMG_TYPE_NV12，请检查！" << std::endl;
    return false;
}
//检查模型的输入数据排布
if(input_properties_.tensorType == 1){
    std::cout << "输入tensor数据排布: HB_DNN_LAYOUT_NCHW" << std::endl;
}
else{
    std::cout << "输入tensor数据排布不是HB_DNN_LAYOUT_NCHW，请检查！" << std::endl;
    return false;
}
// 检查模型输入Tensor数据的valid shape
input_h_ = input_properties_.validShape.dimensionSize[2];
input_w_ = input_properties_.validShape.dimensionSize[3];
if (input_properties_.validShape.numDimensions == 4)
{
    std::cout << "输入的尺寸为: (" << input_properties_.validShape.dimensionSize[0];
    std::cout << ", " << input_properties_.validShape.dimensionSize[1];
    std::cout << ", " << input_h_;
    std::cout << ", " << input_w_ << ")" << std::endl;
}
else
{
    std::cout << "输入的尺寸不是(1,3,640,640)，请检查！" << std::endl;
    return false;
}

输入获取并且检查完了，我们的输出怎么能落下呢？接着我们便可以开始对我们的输出进行检查啦，我们利用hbDNNGetOutputCount获取输出的数量，由于我们已知Yolov5应该有三个输出，因此我们可以在这里对模型的输出进行检查，在获取了输出数量之后我们利用hbDNNTensor创建一个私有类变量hbDN output_tensors_接着便可以利用hbDNNTensor这一个类型为模型的输出分配内存

c 复制代码

//添加私有类成员变量
class BPU_Detect {
    private:
    	hbDNNTensor* output_tensors_;// 输出tensor数组
}
//模型输出数量检查
int32_t output_count = 0;
RDK_CHECK_SUCCESS(
    hbDNNGetOutputCount(&output_count, dnn_handle_),
    "hbDNNGetOutputCount failed");
// 分配输出tensor内存
output_tensors_ = new hbDNNTensor[output_count];

但是在这里有一个很重要的步骤还需要完成，由于YOLOv5有3个输出头，分别对应3种不同尺度的特征图因此我们还需要确保模型的输出顺序为: 小目标(8倍下采样) -> 中目标(16倍下采样) -> 大目标(32倍下采样)，为了完成这个步骤我们首先先定义一个输出顺序的数组output_order_[3]，接着我们手动初始化模型的模型输出顺序同时定义我们期望的输出特征图尺寸和通道数，接着我们便可以利用一个for循环遍历我们每一个期望的输出尺度，如果我们获取的实际的特征图尺寸和通道数和我们期望的相匹配的话我们便将正确的输出顺序记录下来即可

c 复制代码

//添加私有类成员变量
class BPU_Detect {
    private:
        int output_order_[3];// 输出顺序映射
}
// 初始化默认顺序
output_order_[0] = 0;  // 默认第1个输出
output_order_[1] = 1;  // 默认第2个输出
output_order_[2] = 2;  // 默认第3个输出
// 定义期望的输出特征图尺寸和通道数
int32_t expected_shapes[3][3] = {
    {H_8,  W_8,  3 * (5 + classes_num_)},   // 小目标特征图: H/8 x W/8
    {H_16, W_16, 3 * (5 + classes_num_)},   // 中目标特征图: H/16 x W/16
    {H_32, W_32, 3 * (5 + classes_num_)}    // 大目标特征图: H/32 x W/32
};
// 遍历每个期望的输出尺度
for(int i = 0; i < 3; i++) {
    // 遍历实际的输出节点
    for(int j = 0; j < 3; j++) {
        hbDNNTensorProperties output_properties;// 获取当前输出节点的属性
        RDK_CHECK_SUCCESS(
            hbDNNGetOutputTensorProperties(&output_properties, dnn_handle_, j),
            "Get output tensor properties failed");
        // 获取实际的特征图尺寸和通道数
        int32_t actual_h = output_properties.validShape.dimensionSize[1];
        int32_t actual_w = output_properties.validShape.dimensionSize[2];
        int32_t actual_c = output_properties.validShape.dimensionSize[3];
        // 如果实际尺寸和通道数与期望的匹配
        if(actual_h == expected_shapes[i][0] && 
           actual_w == expected_shapes[i][1] && 
           actual_c == expected_shapes[i][2]) {
            output_order_[i] = j;// 记录正确的输出顺序
            break;
        }
    }
}

至此我们的GetModelInfo()便完成啦！！！具体完整代码如下：

c 复制代码

// 获取模型信息实现
bool BPU_Detect::GetModelInfo() {
    // 获取模型名称列表
    const char** model_name_list;
    int model_count = 0;
    RDK_CHECK_SUCCESS(
        hbDNNGetModelNameList(&model_name_list, &model_count, packed_dnn_handle_),
        "hbDNNGetModelNameList failed");
    if(model_count > 1) {
        std::cout << "Model count: " << model_count << std::endl;
        std::cout << "Please check the model count!" << std::endl;
        return false;
    }
    model_name_ = model_name_list[0];
    // 获取模型句柄
    RDK_CHECK_SUCCESS(
        hbDNNGetModelHandle(&dnn_handle_, packed_dnn_handle_, model_name_),
        "hbDNNGetModelHandle failed");
    // 获取输入信息
    int32_t input_count = 0;
    RDK_CHECK_SUCCESS(
        hbDNNGetInputCount(&input_count, dnn_handle_),
        "hbDNNGetInputCount failed");
    RDK_CHECK_SUCCESS(
        hbDNNGetInputTensorProperties(&input_properties_, dnn_handle_, 0),
        "hbDNNGetInputTensorProperties failed");

    if(input_count > 1){
        std::cout << "模型输入节点大于1，请检查！" << std::endl;
        return false;
    }
    if(input_properties_.validShape.numDimensions == 4){
        std::cout << "输入tensor类型: HB_DNN_IMG_TYPE_NV12" << std::endl;
    }
    else{
        std::cout << "输入tensor类型不是HB_DNN_IMG_TYPE_NV12，请检查！" << std::endl;
        return false;
    }
    if(input_properties_.tensorType == 1){
        std::cout << "输入tensor数据排布: HB_DNN_LAYOUT_NCHW" << std::endl;
    }
    else{
        std::cout << "输入tensor数据排布不是HB_DNN_LAYOUT_NCHW，请检查！" << std::endl;
        return false;
    }
    // 获取输入尺寸
    input_h_ = input_properties_.validShape.dimensionSize[2];
    input_w_ = input_properties_.validShape.dimensionSize[3];
    if (input_properties_.validShape.numDimensions == 4)
    {
        std::cout << "输入的尺寸为: (" << input_properties_.validShape.dimensionSize[0];
        std::cout << ", " << input_properties_.validShape.dimensionSize[1];
        std::cout << ", " << input_h_;
        std::cout << ", " << input_w_ << ")" << std::endl;
    }
    else
    {
        std::cout << "输入的尺寸不是(1,3,640,640)，请检查！" << std::endl;
        return false;
    }
    // 获取输出信息并调整输出顺序
    int32_t output_count = 0;
    RDK_CHECK_SUCCESS(
        hbDNNGetOutputCount(&output_count, dnn_handle_),
        "hbDNNGetOutputCount failed");
    // 分配输出tensor内存
    output_tensors_ = new hbDNNTensor[output_count];
    // =============== 调整输出头顺序映射 ===============
    // YOLOv5有3个输出头，分别对应3种不同尺度的特征图
    // 需要确保输出顺序为: 小目标(8倍下采样) -> 中目标(16倍下采样) -> 大目标(32倍下采样)
    // 初始化默认顺序
    output_order_[0] = 0;  // 默认第1个输出
    output_order_[1] = 1;  // 默认第2个输出
    output_order_[2] = 2;  // 默认第3个输出
    // 定义期望的输出特征图尺寸和通道数
    int32_t expected_shapes[3][3] = {
        {H_8,  W_8,  3 * (5 + classes_num_)},   // 小目标特征图: H/8 x W/8
        {H_16, W_16, 3 * (5 + classes_num_)},   // 中目标特征图: H/16 x W/16
        {H_32, W_32, 3 * (5 + classes_num_)}    // 大目标特征图: H/32 x W/32
    };
    // 遍历每个期望的输出尺度
    for(int i = 0; i < 3; i++) {
        // 遍历实际的输出节点
        for(int j = 0; j < 3; j++) {
            // 获取当前输出节点的属性
            hbDNNTensorProperties output_properties;
            RDK_CHECK_SUCCESS(
                hbDNNGetOutputTensorProperties(&output_properties, dnn_handle_, j),
                "Get output tensor properties failed");
            // 获取实际的特征图尺寸和通道数
            int32_t actual_h = output_properties.validShape.dimensionSize[1];
            int32_t actual_w = output_properties.validShape.dimensionSize[2];
            int32_t actual_c = output_properties.validShape.dimensionSize[3];

            // 如果实际尺寸和通道数与期望的匹配
            if(actual_h == expected_shapes[i][0] && 
               actual_w == expected_shapes[i][1] && 
               actual_c == expected_shapes[i][2]) {
                // 记录正确的输出顺序
                output_order_[i] = j;
                break;
            }
        }
    }
    // 打印输出顺序映射信息
    std::cout << "\n============ Output Order Mapping ============" << std::endl;
    std::cout << "Small object  (1/" << 8  << "): output[" << output_order_[0] << "]" << std::endl;
    std::cout << "Medium object (1/" << 16 << "): output[" << output_order_[1] << "]" << std::endl;
    std::cout << "Large object  (1/" << 32 << "): output[" << output_order_[2] << "]" << std::endl;
    std::cout << "==========================================\n" << std::endl;

    return true;
}

(6)完成私有PreProcess()函数

接下来我们便可以完成模型的前处理函数啦，图像的预处理无非就是图像尺寸的转换和图像格式的转换，所以这部分比较简单我就讲的稍微快一点啦，图像尺寸的变换我们采用letterbox的方式，众所周知，OpenCV中有一个图像转换的函数叫resize这个函数可以直接实现图像尺寸的变换，但是由于这个函数的实现过于简单粗暴，因此在图像尺寸不一致的情况下会改变图像的长宽比造成图像的失真，就比如如下情况，可以看到右边图像就发生了扭曲

而我们使用LetterBox的方式便可以看到，画面并没有产生扭曲变形，因为LetterBox的方式在对图片进行resize时，保持了原图的长宽比进行等比例缩放，当长边 resize 到需要的长度时，短边剩下的部分便采用灰色填充，这样便保持了原始图像的长宽比不变

因此接下来我们便用LetterBox的方式实现图像的预处理，具体代码如下，其核心思想便是其核心思想便是通过按比例缩放图像以适应目标尺寸，同时保持原始图像的纵横比。为了确保图像在目标尺寸内居中，空白区域将使用填充的方式填充，通常填充色为中性色（如127, 127, 127）。这样，我们可以避免图像在缩放时出现失真，且确保图像的宽高比保持不变

c 复制代码

//添加私有类成员变量
class BPU_Detect {
    private:
        float x_scale_;                          // X方向缩放比例
        float y_scale_;                          // Y方向缩放比例
        int x_shift_;                            // X方向偏移量
        int y_shift_;                            // Y方向偏移量
        cv::Mat resized_img_;                    // 缩放后的图像
        hbDNNTensor input_tensor_;               // 输入tensor
}
// 使用letterbox方式进行预处理
x_scale_ = std::min(1.0f * input_h_ / input_img.rows, 1.0f * input_w_ / input_img.cols);
y_scale_ = x_scale_;

int new_w = input_img.cols * x_scale_;
x_shift_ = (input_w_ - new_w) / 2;
int x_other = input_w_ - new_w - x_shift_;

int new_h = input_img.rows * y_scale_;
y_shift_ = (input_h_ - new_h) / 2;
int y_other = input_h_ - new_h - y_shift_;

cv::resize(input_img, resized_img_, cv::Size(new_w, new_h));
cv::copyMakeBorder(resized_img_, resized_img_, y_shift_, y_other, 
                   x_shift_, x_other, cv::BORDER_CONSTANT, cv::Scalar(127, 127, 127));

完成了图像尺寸的缩放之后我们便使用OpenCV的函数将图像转换为NV12格式：

c 复制代码

// 转换为NV12格式
cv::Mat yuv_mat;
cv::cvtColor(resized_img_, yuv_mat, cv::COLOR_BGR2YUV_I420);

完成了前面对图像的操作之后我们便要开始准备模型的输入数据啦！接下来，我们需要将处理后的图像数据转换为我们的模型可以接受的输入格式，在这个过程中，我们首先要为输入张量分配内存，并将处理后的图像数据（YUV格式）复制到内存中，以确保模型能够正确地访问和使用这些数据。其中涉及到了一个API为hbSysAllocCachedMem，我们查看一下他的解释以及其中涉及到的结构体定义：

c 复制代码

/**
 * Allocate cachable system memory
 * @param[out] mem
 * @param[in] size
 * @return 0 if success, return defined error code otherwise
 */
int32_t hbSysAllocCachedMem(hbSysMem *mem, uint32_t size);

typedef struct {
  hbSysMem sysMem[4];
  hbDNNTensorProperties properties;
} hbDNNTensor;

typedef struct {
  uint64_t phyAddr;
  void *virAddr;
  uint32_t memSize;
} hbSysMem;

根据API所示，我们首先要先创建一个hbSysMem结构体，这个结构体用于描述内存的物理地址(phyAddr)、虚拟地址(virAddr)以及内存的大小(memSize)。接着，我们调用hbSysAllocCachedMem函数为输入张量分配内存，分配的内存是可缓存的，这意味着硬件可以在处理数据时直接访问此内存，而无需频繁与主内存进行交换。hbDNNTensor是用来存储整个张量信息的结构体，其中包含了多个hbSysMem结构体来描述不同部分的数据（比如输入、输出等）。而hbDNNTensorProperties则存储有关张量的属性信息，如张量的形状、数据类型、量化信息等。

我们首先通过hbSysAllocCachedMem为输入张量分配缓存内存，sysMem[0]是用于存储YUV数据的内存。size为图像的YUV数据所需的内存大小，即3 * input_h_ * input_w_ / 2，这是因为YUV格式的内存布局要求Y分量、U分量和V分量的数据分别存储，其中Y分量占用较大的内存空间，U和V分量分别占用一半的大小，然后我们将处理后的YUV图像数据从yuv_mat复制到ynv12中，其中ynv12是我们通过hbSysAllocCachedMem分配的内存的虚拟地址，接着我们们通过对U和V分量的交替拷贝来将其转换为NV12格式，以满足模型的输入需求，最后在数据准备好之后，我们调用hbSysFlushMem函数来清理内存缓存，具体实现的代码如下：

c 复制代码

// 准备输入tensor
hbSysAllocCachedMem(&input_tensor_.sysMem[0], int(3 * input_h_ * input_w_ / 2));
uint8_t* yuv = yuv_mat.ptr<uint8_t>();
uint8_t* ynv12 = (uint8_t*)input_tensor_.sysMem[0].virAddr;
// 计算UV部分的高度和宽度，以及Y部分的大小
int uv_height = input_h_ / 2;
int uv_width = input_w_ / 2;
int y_size = input_h_ * input_w_;
// 将Y分量数据复制到输入张量
memcpy(ynv12, yuv, y_size);
// 获取NV12格式的UV分量位置
uint8_t* nv12 = ynv12 + y_size;
uint8_t* u_data = yuv + y_size;
uint8_t* v_data = u_data + uv_height * uv_width;
// 将U和V分量交替写入NV12格式
for(int i = 0; i < uv_width * uv_height; i++) {
    *nv12++ = *u_data++;
    *nv12++ = *v_data++;
}
// 将内存缓存清理，确保数据准备好可以供模型使用
hbSysFlushMem(&input_tensor_.sysMem[0], HB_SYS_MEM_CACHE_CLEAN);// 清除缓存，确保数据同步

至此我们的PreProcess()便完成啦！！！具体完整代码如下：

(7)完成私有Inference()函数

我们现在来完成我们的推理部分啦，查阅用户手册之后我们可以看到推理部分我们主要需要如下两个API，根据API介绍我们可以看到hbDNNInfer主要用于执行我们的模型推理而hbDNNWaitTaskDone则用于等待推理任务完成或超时。它的主要作用是等待推理任务的执行结果，直到任务完成或者超过指定的超时时间

c 复制代码

/**
 * DNN inference
 * @param[out] taskHandle: return a pointer represent the task if success,  otherwise nullptr
 	返回一个表示任务的指针，如果成功，返回指向任务句柄的指针，失败时返回nullptr
 * @param[out] output: pointer to the output tensor array, the size of array should be equal to $(`hbDNNGetOutputCount`)
 	指向输出张量数组的指针，数组的大小应该等于hbDNNGetOutputCount返回的数量
 * @param[in] input: input tensor array, the size of array should be equal to  $(`hbDNNGetInputCount`)
 	输入张量数组的指针，数组的大小应该等于hbDNNGetInputCount返回的数量
 * @param[in] dnnHandle: pointer to the dnn handle
 	DNN句柄，用于标识推理任务所使用的模型
 * @param[in] inferCtrlParam: infer control parameters
 	推理控制参数，用于设置推理过程中的一些配置项（如是否使用加速，推理模式等）
 * @return 0 if success, return defined error code otherwise
 */
int32_t hbDNNInfer(hbDNNTaskHandle_t *taskHandle, hbDNNTensor **output,
                   hbDNNTensor const *input, hbDNNHandle_t dnnHandle,
                   hbDNNInferCtrlParam *inferCtrlParam);
/**
 * Wait util task completed or timeout.
 * @param[in] taskHandle: pointer to the task
 * @param[in] timeout: timeout of milliseconds
 * @return 0 if success, return defined error code otherwise
 */
int32_t hbDNNWaitTaskDone(hbDNNTaskHandle_t taskHandle, int32_t timeout);

接着我们查阅以下hbDNNInferCtrlParam *inferCtrlParam参数发现他的定义即传入方式如下：

c 复制代码

#define HB_DNN_INITIALIZE_INFER_CTRL_PARAM(param) \
  {                                               \
    (param)->bpuCoreId = HB_BPU_CORE_ANY;         \
    (param)->dspCoreId = HB_DSP_CORE_ANY;         \
    (param)->priority = HB_DNN_PRIORITY_LOWEST;   \
    (param)->more = false;                        \
    (param)->customId = 0;                        \
    (param)->reserved1 = 0;                       \
    (param)->reserved2 = 0;                       \
  }
typedef struct {
  int32_t bpuCoreId;	//// BPU核心ID，用于指定推理任务在哪个BPU核心上执行
  int32_t dspCoreId;	//// DSP核心ID，用于指定推理任务在哪个DSP核心上执行
  int32_t priority;		//// 推理任务的优先级
  int32_t more;			//// 是否有更多的推理任务，通常设置为false
  int64_t customId;		//// 自定义ID，可以用于标识推理任务
  int32_t reserved1;	//// 保留字段，暂时没有使用
  int32_t reserved2;	//// 保留字段，暂时没有使用
} hbDNNInferCtrlParam;

了解了上述之后我们便可以开始写我们的推理部分啦！我们先完成执行推理之前的一些前置任务，我们创建一个hbDNNTaskHandle_t类型的推理任务句柄task_handle_用于标识一个推理任务的唯一性，便于我们任务管理，接着初始化任务句柄task_handle_为nullptr，以确保它在推理任务开始之前是空的同时对于每一个输出张量，我们首先获取它的属性，然后基于输出张量的对齐大小（alignedByteSize）分配相应的内存。内存分配我们通过之前有介绍过的hbSysAllocCachedMem来完成，这个函数会确保为每个输出张量分配到适当大小的缓存内存，以保证后续的数据处理不会出现内存越界或访问错误的问题，于是我们的代码如下：

c 复制代码

//添加私有类成员变量
class BPU_Detect {
    private:
    	hbDNNTaskHandle_t task_handle_;          // 推理任务句柄
}
// 初始化任务句柄为nullptr
task_handle_ = nullptr;
// 初始化输入tensor属性
input_tensor_.properties = input_properties_;
// 获取输出tensor属性
for(int i = 0; i < 3; i++) {
    hbDNNTensorProperties output_properties;
    RDK_CHECK_SUCCESS(
        hbDNNGetOutputTensorProperties(&output_properties, dnn_handle_, i),
        "Get output tensor properties failed");
    output_tensors_[i].properties = output_properties;

    // 为输出分配内存
    int out_aligned_size = output_properties.alignedByteSize;
    RDK_CHECK_SUCCESS(
        hbSysAllocCachedMem(&output_tensors_[i].sysMem[0], out_aligned_size),
        "Allocate output memory failed");
}

完成了前置任务我们便可以开始执行推理啦！！！我们首先利用hbDNNInferCtrlParam创建推理的参数，并使用官方提供的HB_DNN_INITIALIZE_INFER_CTRL_PARAM传入，接着我们便可以调用hbDNNInfer执行推理啦，同时我们hbDNNWaitTaskDone函数来等待推理任务完成

c 复制代码

hbDNNInferCtrlParam infer_ctrl_param;
HB_DNN_INITIALIZE_INFER_CTRL_PARAM(&infer_ctrl_param);
RDK_CHECK_SUCCESS(
        hbDNNInfer(&task_handle_, &output_tensors_, &input_tensor_, dnn_handle_, &infer_ctrl_param),
        "Model inference failed");
RDK_CHECK_SUCCESS(
    hbDNNWaitTaskDone(task_handle_, 0),
    "Wait task done failed");

至此我们的Inference()便完成啦！！！具体完整代码如下：

c 复制代码

// 推理实现
bool BPU_Detect::Inference() {
    // 初始化任务句柄为nullptr
    task_handle_ = nullptr;
    
    // 初始化输入tensor属性
    input_tensor_.properties = input_properties_;
    
    // 获取输出tensor属性
    for(int i = 0; i < 3; i++) {
        hbDNNTensorProperties output_properties;
        RDK_CHECK_SUCCESS(
            hbDNNGetOutputTensorProperties(&output_properties, dnn_handle_, i),
            "Get output tensor properties failed");
        output_tensors_[i].properties = output_properties;
        
        // 为输出分配内存
        int out_aligned_size = output_properties.alignedByteSize;
        RDK_CHECK_SUCCESS(
            hbSysAllocCachedMem(&output_tensors_[i].sysMem[0], out_aligned_size),
            "Allocate output memory failed");
    }
    
    hbDNNInferCtrlParam infer_ctrl_param;
    HB_DNN_INITIALIZE_INFER_CTRL_PARAM(&infer_ctrl_param);
    
    RDK_CHECK_SUCCESS(
        hbDNNInfer(&task_handle_, &output_tensors_, &input_tensor_, dnn_handle_, &infer_ctrl_param),
        "Model inference failed");
    
    RDK_CHECK_SUCCESS(
        hbDNNWaitTaskDone(task_handle_, 0),
        "Wait task done failed");
    
    return true;
}

(8)完成私有ProcessFeatureMap()函数

在进行后处理之前我们还需要完成ProcessFeatureMap函数，这个函数是特征图处理辅助函数，主要用于从网络的输出特征图中提取目标检测的边界框及其相应的得分，并将这些信息存储起来，以供后续的NMS（非极大值抑制）处理，首先，我们输出张量的量化类型（quantiType）进行检查，如果输出张量的量化类型不是NONE，将会输出错误信息并返回，因为这里的推理任务假设输出数据是未量化的浮点数，如果是量化数据，处理方式会有所不同。

c 复制代码

if (output_tensor.properties.quantiType != NONE) {
    std::cout << "Output tensor quantization type should be NONE!" << std::endl;
    return;
}

接着我们为了确保从内存中读取的数据是最新的，我们调用hbSysFlushMem函数来刷新内存缓存，这个操作会将内存中的数据同步到主存中，防止由于缓存引起的读写不一致

c 复制代码

/**
 * Flush cachable system memory
 * @param[in] mem
 * @return 0 if success, return defined error code otherwise
 */
int32_t hbSysFlushMem(hbSysMem *mem, int32_t flag);

hbSysFlushMem(&output_tensor.sysMem[0], HB_SYS_MEM_CACHE_INVALIDATE);

然后我们通过output_tensor.sysMem[0].virAddr来获取输出张量的数据地址，并将其转换为float*类型，这个地址指向的是模型推理输出的原始数据

c 复制代码

auto* raw_data = reinterpret_cast<float*>(output_tensor.sysMem[0].virAddr);

之后我们利用For循环来遍历输出特征图的每个位置（height和width），这里的每个位置都包含了一些预测数据，包括边界框的中心坐标、宽高以及类别得分，而每个锚点（anchors）则代表一个可能的目标的形状

c 复制代码

for(int h = 0; h < height; h++) {
    for(int w = 0; w < width; w++) {
        for(const auto& anchor : anchors) {

对于每个位置，我们首先读取当前的预测数据（包括边界框位置和类别得分等），然后根据位置的置信度（cur_raw[4]，通常是对象存在的概率）进行过滤,如果置信度低于预设的阈值（conf_thres_raw），则跳过该位置的处理

c 复制代码

if(cur_raw[4] < conf_thres_raw) continue;

接下来，我们会在所有类别的得分中找到最大的类别概率（cur_raw[5]到cur_raw[classes_num_+5]），即预测出当前锚点所属于的目标类别

c 复制代码

int cls_id = 5;
int end = classes_num_ + 5;
for(int i = 6; i < end; i++) {
    if(cur_raw[i] > cur_raw[cls_id]) {
        cls_id = i;
    }
}

找到最大的类别概率之后我们便可以计算当前锚点的最终得分，这个最终得分是根据该锚点的置信度和最大类别概率共同计算的，通过sigmoid函数的反转计算来得到最终的目标得分，得分低于score_threshold_的检测结果便会被过滤掉

c 复制代码

float score = 1.0f / (1.0f + std::exp(-cur_raw[4])) * 
              1.0f / (1.0f + std::exp(-cur_raw[cls_id]));
if(score < score_threshold_) continue;

最后我们便来解码边界框的具体位置和尺寸。根据sigmoid函数反转的计算方式，我们将中心坐标（cur_raw[0]、cur_raw[1]）以及宽高（cur_raw[2]、cur_raw[3]）从网络的输出中恢复为实际的边界框坐标，接着我们将其转换到图像的实际尺寸中，同时将计算出的边界框和得分保存到对应类别的bboxes_（存储所有检测框的位置的数组）和scores_（存储对应的得分）中

c 复制代码

float stride = input_h_ / height;
float center_x = ((1.0f / (1.0f + std::exp(-cur_raw[0]))) * 2 - 0.5f + w) * stride;
float center_y = ((1.0f / (1.0f + std::exp(-cur_raw[1]))) * 2 - 0.5f + h) * stride;
float bbox_w = std::pow((1.0f / (1.0f + std::exp(-cur_raw[2]))) * 2, 2) * anchor.first;
float bbox_h = std::pow((1.0f / (1.0f + std::exp(-cur_raw[3]))) * 2, 2) * anchor.second;
float bbox_x = center_x - bbox_w / 2.0f;
float bbox_y = center_y - bbox_h / 2.0f;

bboxes_[cls_id].push_back(cv::Rect2d(bbox_x, bbox_y, bbox_w, bbox_h));
scores_[cls_id].push_back(score);

至此我们的ProcessFeatureMap()便完成啦！！！具体完整代码如下：

c 复制代码

// 特征图处理辅助函数
void BPU_Detect::ProcessFeatureMap(hbDNNTensor& output_tensor, 
                                  int height, int width,
                                  const std::vector<std::pair<double, double>>& anchors,
                                  float conf_thres_raw) {
    // 检查量化类型
    if (output_tensor.properties.quantiType != NONE) {
        std::cout << "Output tensor quantization type should be NONE!" << std::endl;
        return;
    }
    
    // 刷新内存
    hbSysFlushMem(&output_tensor.sysMem[0], HB_SYS_MEM_CACHE_INVALIDATE);
    
    // 获取输出数据指针
    auto* raw_data = reinterpret_cast<float*>(output_tensor.sysMem[0].virAddr);
    
    // 遍历特征图的每个位置
    for(int h = 0; h < height; h++) {
        for(int w = 0; w < width; w++) {
            for(const auto& anchor : anchors) {
                // 获取当前位置的预测数据
                float* cur_raw = raw_data;
                raw_data += (5 + classes_num_);
                
                // 条件概率过滤
                if(cur_raw[4] < conf_thres_raw) continue;
                
                // 找到最大类别概率
                int cls_id = 5;
                int end = classes_num_ + 5;
                for(int i = 6; i < end; i++) {
                    if(cur_raw[i] > cur_raw[cls_id]) {
                        cls_id = i;
                    }
                }
                
                // 计算最终得分
                float score = 1.0f / (1.0f + std::exp(-cur_raw[4])) * 
                            1.0f / (1.0f + std::exp(-cur_raw[cls_id]));
                
                // 得分过滤
                if(score < score_threshold_) continue;
                cls_id -= 5;
                
                // 解码边界框
                float stride = input_h_ / height;
                float center_x = ((1.0f / (1.0f + std::exp(-cur_raw[0]))) * 2 - 0.5f + w) * stride;
                float center_y = ((1.0f / (1.0f + std::exp(-cur_raw[1]))) * 2 - 0.5f + h) * stride;
                float bbox_w = std::pow((1.0f / (1.0f + std::exp(-cur_raw[2]))) * 2, 2) * anchor.first;
                float bbox_h = std::pow((1.0f / (1.0f + std::exp(-cur_raw[3]))) * 2, 2) * anchor.second;
                float bbox_x = center_x - bbox_w / 2.0f;
                float bbox_y = center_y - bbox_h / 2.0f;
                
                // 保存检测结果
                bboxes_[cls_id].push_back(cv::Rect2d(bbox_x, bbox_y, bbox_w, bbox_h));
                scores_[cls_id].push_back(score);
            }
        }
    }
}

(9)完成私有PostProcess()函数

推理完成之后当然就到后处理部分啦，我们的后处理主要分为以下三个步骤：清空上次的结果、处理输出的特征图、对每个类别进行NMS（非极大值抑制） ，在每次推理和后处理开始之前，我们先清空之前存储的检测结果。bboxes_存储检测到的边界框，scores_存储每个边界框的得分，indices_存储每个边界框对应的类别索引，接着我们根据检测任务的类别数量（classes_num_）来调整边界框、得分和索引的大小，以适应不同类别的检测结果，并且根据预设的score_threshold_（得分阈值），将其转换为原始的对数形式conf_thres_raw。（PS：这个转换是为了与模型的输出格式匹配，因为通常深度学习模型输出的分数范围是基于对数计算的）具体的代码如下：

c 复制代码

//添加私有类成员变量
class BPU_Detect {
    private:
    	// 检测结果存储
        std::vector<std::vector<cv::Rect2d>> bboxes_;  // 每个类别的边界框
        std::vector<std::vector<float>> scores_;       // 每个类别的得分
        std::vector<std::vector<int>> indices_;        // NMS后的索引
        // YOLOv5 anchors信息
        std::vector<std::pair<double, double>> s_anchors_;  // 小目标anchors
        std::vector<std::pair<double, double>> m_anchors_;  // 中目标anchors
        std::vector<std::pair<double, double>> l_anchors_;  // 大目标anchors
}

bboxes_.clear();  // 清空边界框
scores_.clear();  // 清空得分
indices_.clear(); // 清空索引

bboxes_.resize(classes_num_);  // 根据类别数量调整边界框数组的大小
scores_.resize(classes_num_);  // 根据类别数量调整得分数组的大小
indices_.resize(classes_num_); // 根据类别数量调整索引数组的大小

float conf_thres_raw = -log(1 / score_threshold_ - 1);

由于在目标检测任务中，常常使用多尺度输出，每个尺度负责不同大小目标的检测这时候我们便可以调用我们定义的ProcessFeatureMap特征图辅助处理函数负责处理这些特征图

c 复制代码

 // 处理三个尺度的输出
ProcessFeatureMap(output_tensors_[0], H_8, W_8, s_anchors_, conf_thres_raw);
ProcessFeatureMap(output_tensors_[1], H_16, W_16, m_anchors_, conf_thres_raw);
ProcessFeatureMap(output_tensors_[2], H_32, W_32, l_anchors_, conf_thres_raw);

最后我们使用OpenCV提供的cv::dnn::NMSBoxes函数来根据边界框的得分、重叠度（IOU），以及设置的阈值抑制重复框，最终获取每个类别的边界框索引（indices_）

c 复制代码

for(int i = 0; i < classes_num_; i++) {
    cv::dnn::NMSBoxes(bboxes_[i], scores_[i], score_threshold_, 
                       nms_threshold_, indices_[i], 1.f, nms_top_k_);
}

至此我们的ProcessFeatureMap()便完成啦！！！具体完整代码如下：

c 复制代码

// 后处理实现
bool BPU_Detect::PostProcess() {
    // 清空上次的结果
    bboxes_.clear();
    scores_.clear();
    indices_.clear();
    
    // 调整大小
    bboxes_.resize(classes_num_);
    scores_.resize(classes_num_);
    indices_.resize(classes_num_);
    
    float conf_thres_raw = -log(1 / score_threshold_ - 1);
    
    // 处理三个尺度的输出
    ProcessFeatureMap(output_tensors_[0], H_8, W_8, s_anchors_, conf_thres_raw);
    ProcessFeatureMap(output_tensors_[1], H_16, W_16, m_anchors_, conf_thres_raw);
    ProcessFeatureMap(output_tensors_[2], H_32, W_32, l_anchors_, conf_thres_raw);
    
    // 对每个类别进行NMS
    for(int i = 0; i < classes_num_; i++) {
        cv::dnn::NMSBoxes(bboxes_[i], scores_[i], score_threshold_, 
                         nms_threshold_, indices_[i], 1.f, nms_top_k_);
    }
    
    return true;
}

(10) 完成私有DrawResults()函数

接着我们便来完成我们的结果绘制显示工具辅助函数DrawResults啦！由于在开发或者是调试结果对于结果框的需求是不一样的，因此我们首先创建一个宏定义用来选择是否需要绘制结果框

c 复制代码

#define ENABLE_DRAW 0    // 绘图开关: 0-禁用, 1-启用

由于这部分完全属于OpenCV的内容，因此不再详细叙述只需要对每个类别的NMS之后的检测结果进行遍历即可，唯一要注意的便是由于我们的图像在预处理时使用了LetterBox方法进行尺寸调整，因此我们需要通过x_shift_、y_shift_、x_scale_、y_scale_等参数进行坐标变换，使得边界框恢复到正确的图像空间

c 复制代码

float x1 = (bboxes_[cls_id][idx].x - x_shift_) / x_scale_;
float y1 = (bboxes_[cls_id][idx].y - y_shift_) / y_scale_;
float x2 = x1 + (bboxes_[cls_id][idx].width) / x_scale_;
float y2 = y1 + (bboxes_[cls_id][idx].height) / y_scale_;

最后DrawResults函数的完整代码如下：

c 复制代码

// 绘制结果实现
void BPU_Detect::DrawResults(cv::Mat& img) {
#if ENABLE_DRAW
    for(int cls_id = 0; cls_id < classes_num_; cls_id++) {
        if(!indices_[cls_id].empty()) {
            for(size_t i = 0; i < indices_[cls_id].size(); i++) {
                int idx = indices_[cls_id][i];
                float x1 = (bboxes_[cls_id][idx].x - x_shift_) / x_scale_;
                float y1 = (bboxes_[cls_id][idx].y - y_shift_) / y_scale_;
                float x2 = x1 + (bboxes_[cls_id][idx].width) / x_scale_;
                float y2 = y1 + (bboxes_[cls_id][idx].height) / y_scale_;
                float score = scores_[cls_id][idx];
                
                // 绘制边界框
                cv::rectangle(img, cv::Point(x1, y1), cv::Point(x2, y2), 
                            cv::Scalar(255, 0, 0), line_size_);
                
                // 绘制标签
                std::string text = class_names_[cls_id] + ": " + 
                                std::to_string(static_cast<int>(score * 100)) + "%";
                cv::putText(img, text, cv::Point(x1, y1 - 5), 
                          cv::FONT_HERSHEY_SIMPLEX, font_size_, 
                          cv::Scalar(0, 0, 255), font_thickness_, cv::LINE_AA);
            }
        }
    }
#endif
    // 打印检测结果
    PrintResults();
}

(11)完成私有PrintResults()函数

接下来只剩最后一个PrintResult函数啦，这个函数也没有什么好说的，只需要用For循环来规范化打印模型的输出结果即可，只需要知道indices_ 中的数据排布是按类别（cls_id）存储的，每个类别包含该类别下所有通过 NMS 筛选的检测框的索引，因此完整代码如下：

c 复制代码

// 打印检测结果实现
void BPU_Detect::PrintResults() const {
    // 打印检测结果的总体信息
    int total_detections = 0;
    for(int cls_id = 0; cls_id < classes_num_; cls_id++) {
        total_detections += indices_[cls_id].size();
    }
    std::cout << "\n============ Detection Results ============" << std::endl;
    std::cout << "Total detections: " << total_detections << std::endl;
    
    for(int cls_id = 0; cls_id < classes_num_; cls_id++) {
        if(!indices_[cls_id].empty()) {
            std::cout << "\nClass: " << class_names_[cls_id] << std::endl;
            std::cout << "Number of detections: " << indices_[cls_id].size() << std::endl;
            std::cout << "Details:" << std::endl;
            
            for(size_t i = 0; i < indices_[cls_id].size(); i++) {
                int idx = indices_[cls_id][i];
                float x1 = (bboxes_[cls_id][idx].x - x_shift_) / x_scale_;
                float y1 = (bboxes_[cls_id][idx].y - y_shift_) / y_scale_;
                float x2 = x1 + (bboxes_[cls_id][idx].width) / x_scale_;
                float y2 = y1 + (bboxes_[cls_id][idx].height) / y_scale_;
                float score = scores_[cls_id][idx];
                
                // 打印每个检测框的详细信息
                std::cout << "  Detection " << i + 1 << ":" << std::endl;
                std::cout << "    Position: (" << x1 << ", " << y1 << ") to (" << x2 << ", " << y2 << ")" << std::endl;
                std::cout << "    Confidence: " << std::fixed << std::setprecision(2) << score * 100 << "%" << std::endl;
            }
        }
    }
    std::cout << "========================================\n" << std::endl;
}

至此我们已经完成了所有的私有辅助函数，可以开始完成三个公有函数啦！！！

(12)完成公有Init()函数

我们首先来完成我们的初始化函数，我们在初始化阶段只需要实现模型的加载和模型信息的获取与检查，因此我们直接调用我们的LoadModel函数和GetModelInfo函数

c 复制代码

if(!LoadModel()) {
        std::cout << "Failed to load model!" << std::endl;
        return false;
    }
if(!GetModelInfo()) {
    std::cout << "Failed to get model info!" << std::endl;
    return false;
}

最后我们加上初始化标志位以及时间的输出，便完成了我们的初始化函数！！！其完整代码如下：

c 复制代码

// 初始化函数实现
bool BPU_Detect::Init() {
    if(is_initialized_) {
        std::cout << "Already initialized!" << std::endl;
        return true;
    }
    
    auto init_start = std::chrono::high_resolution_clock::now();
    
    if(!LoadModel()) {
        std::cout << "Failed to load model!" << std::endl;
        return false;
    }
    
    if(!GetModelInfo()) {
        std::cout << "Failed to get model info!" << std::endl;
        return false;
    }
    
    is_initialized_ = true;
    
    auto init_end = std::chrono::high_resolution_clock::now();
    float init_time = std::chrono::duration_cast<std::chrono::microseconds>(init_end - init_start).count() / 1000.0f;
    
    std::cout << "\n============ Model Loading Time ============" << std::endl;
    std::cout << "Total init time: " << std::fixed << std::setprecision(2) << init_time << " ms" << std::endl;
    std::cout << "=========================================\n" << std::endl;
    
    return true;
}

(13)完成公有Detect()函数

接着我们完成我们的Detect检测函数，我们首先先检查是否成功初始化：

c 复制代码

if(!is_initialized_) {
        std::cout << "Please initialize first!" << std::endl;
        return false;
    }

接着我们依次调用PreProcess预处理函数、Inference推理函数以及PostProcess后处理函数同时调用我们的DrawResults函数即可：

c 复制代码

if(!PreProcess(input_img)) {
        return false;
    }
if(!Inference()) {
        return false;
    }
if(!PostProcess()) {
        return false;
    }

DrawResults(output_img);

最后我们加上时间的输出，便完成了我们的初始化函数！！！其完整代码如下：

c 复制代码

// 检测函数实现
bool BPU_Detect::Detect(const cv::Mat& input_img, cv::Mat& output_img) {
    if(!is_initialized_) {
        std::cout << "Please initialize first!" << std::endl;
        return false;
    }
    
    auto total_start = std::chrono::high_resolution_clock::now();
    
#if ENABLE_DRAW
    input_img.copyTo(output_img);
#endif

    // 预处理时间统计
    auto preprocess_start = std::chrono::high_resolution_clock::now();
    if(!PreProcess(input_img)) {
        return false;
    }
    auto preprocess_end = std::chrono::high_resolution_clock::now();
    float preprocess_time = std::chrono::duration_cast<std::chrono::microseconds>(preprocess_end - preprocess_start).count() / 1000.0f;
    
    // 推理时间统计
    auto infer_start = std::chrono::high_resolution_clock::now();
    if(!Inference()) {
        return false;
    }
    auto infer_end = std::chrono::high_resolution_clock::now();
    float infer_time = std::chrono::duration_cast<std::chrono::microseconds>(infer_end - infer_start).count() / 1000.0f;
    
    // 后处理时间统计
    auto postprocess_start = std::chrono::high_resolution_clock::now();
    if(!PostProcess()) {
        return false;
    }
    auto postprocess_end = std::chrono::high_resolution_clock::now();
    float postprocess_time = std::chrono::duration_cast<std::chrono::microseconds>(postprocess_end - postprocess_start).count() / 1000.0f;
    
    // 绘制结果时间统计
    auto draw_start = std::chrono::high_resolution_clock::now();
    DrawResults(output_img);
    auto draw_end = std::chrono::high_resolution_clock::now();
    float draw_time = std::chrono::duration_cast<std::chrono::microseconds>(draw_end - draw_start).count() / 1000.0f;
    
    // 总时间统计
    auto total_end = std::chrono::high_resolution_clock::now();
    float total_time = std::chrono::duration_cast<std::chrono::microseconds>(total_end - total_start).count() / 1000.0f;
    
    // 打印时间统计信息
    std::cout << "\n============ Time Statistics ============" << std::endl;
    std::cout << "Preprocess time: " << std::fixed << std::setprecision(2) << preprocess_time << " ms" << std::endl;
    std::cout << "Inference time: " << std::fixed << std::setprecision(2) << infer_time << " ms" << std::endl;
    std::cout << "Postprocess time: " << std::fixed << std::setprecision(2) << postprocess_time << " ms" << std::endl;
    std::cout << "Draw time: " << std::fixed << std::setprecision(2) << draw_time << " ms" << std::endl;
    std::cout << "Total time: " << std::fixed << std::setprecision(2) << total_time << " ms" << std::endl;
    std::cout << "FPS: " << std::fixed << std::setprecision(2) << 1000.0f / total_time << std::endl;
    std::cout << "======================================\n" << std::endl;
    
    return true;
}

(14)完成公有Release()函数

最后我们要完成的便是我们的资源释放函数啦，我们首先先检查我们的函数是否初始化了，如果没有的话那便无需释放资源：

c 复制代码

if(!is_initialized_) {
        return true;
    }

接着我们检查以下我们的推理任务是否结束，如果没有结束的话我们便需要用到hbDNNReleaseTask来释放推理任务，这个API的解释如下,

c 复制代码

/**
 * Release a task and its related resources. If the task has not been executed then it will be canceled,
 * and if the task has not been finished then it will be stopped.
 * This interface will return immediately, and all operations will run in the background
 * @param[in] taskHandle: pointer to the task
 * @return 0 if success, return defined error code otherwise
 */
int32_t hbDNNReleaseTask(hbDNNTaskHandle_t taskHandle);

因此我们代码只需调用这个函数并将任务句柄指针置空即可

c 复制代码

if(task_handle_) {
        hbDNNReleaseTask(task_handle_);
        task_handle_ = nullptr;
    }

最后我们利用hbSysFreeMemAPI来依次释放输入输出和模型的内存即可：

c 复制代码

/**
 * Free mem
 * @param[in] mem
 * @return 0 if success, return defined error code otherwise
 */
int32_t hbSysFreeMem(hbSysMem *mem);

// 释放输入内存
if(input_tensor_.sysMem[0].virAddr) {
    hbSysFreeMem(&(input_tensor_.sysMem[0]));
}

// 释放输出内存
for(int i = 0; i < 3; i++) {
    if(output_tensors_ && output_tensors_[i].sysMem[0].virAddr) {
        hbSysFreeMem(&(output_tensors_[i].sysMem[0]));
    }
}

if(output_tensors_) {
    delete[] output_tensors_;
    output_tensors_ = nullptr;
}

// 释放模型
if(packed_dnn_handle_) {
    hbDNNRelease(packed_dnn_handle_);
    packed_dnn_handle_ = nullptr;
}

最后我们加上一些细节，便完成了我们的资源释放函数！！！其完整代码如下：

c 复制代码

// 释放资源实现
bool BPU_Detect::Release() {
    if(!is_initialized_) {
        return true;
    }
    
    // 释放任务
    if(task_handle_) {
        hbDNNReleaseTask(task_handle_);
        task_handle_ = nullptr;
    }
    
    try {
        // 释放输入内存
        if(input_tensor_.sysMem[0].virAddr) {
            hbSysFreeMem(&(input_tensor_.sysMem[0]));
        }
        
        // 释放输出内存
        for(int i = 0; i < 3; i++) {
            if(output_tensors_ && output_tensors_[i].sysMem[0].virAddr) {
                hbSysFreeMem(&(output_tensors_[i].sysMem[0]));
            }
        }
        
        if(output_tensors_) {
            delete[] output_tensors_;
            output_tensors_ = nullptr;
        }
        
        // 释放模型
        if(packed_dnn_handle_) {
            hbDNNRelease(packed_dnn_handle_);
            packed_dnn_handle_ = nullptr;
        }
    } catch(const std::exception& e) {
        std::cout << "Exception during release: " << e.what() << std::endl;
    }
    
    is_initialized_ = false;
    return true;
}

(15)实现Main函数

教程到了这里便临近尾声啦，接下来我们只需要实现调用类然后进行推理的逻辑便完成了我们本节的教学，目前的代码逻辑还没有进行优化，推理也没有达到最佳的性能，具体的优化教程敬请期待年后就发！！！

要使用这个检测的类其实很简单，我们只需要创建一个检测器的实例，接着对检测类执行初始化操作，接着我们只需要将要检测的图片或者帧输入进我们的detector.Detect()示例中即可，最后释放资源就好啦！！！

c 复制代码

BPU_Detect detector;
// 初始化
if (!detector.Init()) {
    std::cout << "Failed to initialize detector" << std::endl;
    return -1;
}
if (!detector.Detect(input_img, output_img)) {
    std::cout << "Detection failed" << std::endl;
    return -1;
}
// 释放资源
detector.Release();

还记得我们上面添加的单张图片和实时检测的宏定义吗？我们在主函数加上这个宏定义的判定同时和一些细节，完整代码如下：

c 复制代码

int main() {
    // 创建检测器实例
    BPU_Detect detector;
    // 初始化
    if (!detector.Init()) {
        std::cout << "Failed to initialize detector" << std::endl;
        return -1;
    }
#if DETECT_MODE == 0
    // 单张图片检测模式
    std::cout << "Single image detection mode" << std::endl;
    // 读取测试图片
    cv::Mat input_img = cv::imread("/path/to/img");
    if (input_img.empty()) {
        std::cout << "Failed to load image" << std::endl;
        return -1;
    }
    // 执行检测
    cv::Mat output_img;
#if ENABLE_DRAW
    if (!detector.Detect(input_img, output_img)) {
        std::cout << "Detection failed" << std::endl;
        return -1;
    }
    // 保存结果
    cv::imwrite("cpp_result.jpg", output_img);
#else
    if (!detector.Detect(input_img, output_img)) {
        std::cout << "Detection failed" << std::endl;
        return -1;
    }
#endif
#else
    // 实时检测模式
    std::cout << "Real-time detection mode" << std::endl;
    // 打开摄像头
    cv::VideoCapture cap(0);
    if (!cap.isOpened()) {
        std::cout << "Failed to open camera" << std::endl;
        return -1;
    }
    cv::Mat frame, output_frame;
    while (true) {
        // 读取一帧
        cap >> frame;
        if (frame.empty()) {
            std::cout << "Failed to read frame" << std::endl;
            break;
        }
        // 执行检测
        if (!detector.Detect(frame, output_frame)) {
            std::cout << "Detection failed" << std::endl;
            break;
        }
#if ENABLE_DRAW
        // 显示结果
        cv::imshow("Real-time Detection", output_frame);
        
        // 按'q'退出
        if (cv::waitKey(1) == 'q') {
            break;
        }
#endif
    }
#if ENABLE_DRAW
    // 释放摄像头
    cap.release();
    cv::destroyAllWindows();
#endif
#endif
    // 释放资源
    detector.Release();
    return 0;
}

完整代码仅供参考

c 复制代码

// 标准C++库
#include <iostream>     // 输入输出流
#include <vector>      // 向量容器
#include <algorithm>   // 算法库
#include <chrono>      // 时间相关功能
#include <iomanip>     // 输入输出格式控制

// OpenCV库
#include <opencv2/opencv.hpp>      // OpenCV主要头文件
#include <opencv2/dnn/dnn.hpp>     // OpenCV深度学习模块

// 地平线RDK BPU API
#include "dnn/hb_dnn.h"           // BPU基础功能
#include "dnn/hb_dnn_ext.h"       // BPU扩展功能
#include "dnn/plugin/hb_dnn_layer.h"    // BPU层定义
#include "dnn/plugin/hb_dnn_plugin.h"   // BPU插件
#include "dnn/hb_sys.h"           // BPU系统功能

// 错误检查宏定义
#define RDK_CHECK_SUCCESS(value, errmsg)                        \
    do                                                          \
    {                                                          \
        auto ret_code = value;                                  \
        if (ret_code != 0)                                      \
        {                                                       \
            std::cout << errmsg << ", error code:" << ret_code; \
            return ret_code;                                    \
        }                                                       \
    } while (0);

// 模型和检测相关的默认参数定义
#define DEFAULT_MODEL_PATH "/root/Deep_Learning/YOLOv5/models/tennis_detect_640x640_bayese_.bin"  // 默认模型路径
#define DEFAULT_CLASSES_NUM 1          // 默认类别数量
#define CLASSES_LIST "tennis_ball"     // 类别名称
#define DEFAULT_NMS_THRESHOLD 0.45f    // 非极大值抑制阈值
#define DEFAULT_SCORE_THRESHOLD 0.25f  // 置信度阈值
#define DEFAULT_NMS_TOP_K 300          // NMS保留的最大框数
#define DEFAULT_FONT_SIZE 1.0f         // 绘制文字大小
#define DEFAULT_FONT_THICKNESS 1.0f    // 绘制文字粗细
#define DEFAULT_LINE_SIZE 2.0f         // 绘制线条粗细

// 运行模式选择
#define DETECT_MODE 0    // 检测模式: 0-单张图片, 1-实时检测
#define ENABLE_DRAW 0    // 绘图开关: 0-禁用, 1-启用
#define LOAD_FROM_DDR 1  // 模型加载方式: 0-从文件加载, 1-从内存加载

// 特征图尺度定义 (基于输入尺寸的倍数关系)
#define H_8 (input_h_ / 8)    // 输入高度的1/8
#define W_8 (input_w_ / 8)    // 输入宽度的1/8
#define H_16 (input_h_ / 16)  // 输入高度的1/16
#define W_16 (input_w_ / 16)  // 输入宽度的1/16
#define H_32 (input_h_ / 32)  // 输入高度的1/32
#define W_32 (input_w_ / 32)  // 输入宽度的1/32

// BPU目标检测类
class BPU_Detect {
public:
    // 构造函数：初始化检测器的参数
    // @param model_path: 模型文件路径
    // @param classes_num: 检测类别数量
    // @param nms_threshold: NMS阈值
    // @param score_threshold: 置信度阈值
    // @param nms_top_k: NMS保留的最大框数
    BPU_Detect(const std::string& model_path = DEFAULT_MODEL_PATH,
                 int classes_num = DEFAULT_CLASSES_NUM,
                 float nms_threshold = DEFAULT_NMS_THRESHOLD,
                 float score_threshold = DEFAULT_SCORE_THRESHOLD,
                 int nms_top_k = DEFAULT_NMS_TOP_K);
    
    // 析构函数：释放资源
    ~BPU_Detect();

    // 主要功能接口
    bool Init();  // 初始化BPU和模型
    bool Detect(const cv::Mat& input_img, cv::Mat& output_img);  // 执行目标检测
    bool Release();  // 释放所有资源

private:
    // 内部工具函数
    bool LoadModel();  // 加载模型文件
    bool GetModelInfo();  // 获取模型的输入输出信息
    bool PreProcess(const cv::Mat& input_img);  // 图像预处理（resize和格式转换）
    bool Inference();  // 执行模型推理
    bool PostProcess();  // 后处理（NMS等）
    void DrawResults(cv::Mat& img);  // 在图像上绘制检测结果
    void PrintResults() const;  // 打印检测结果到控制台

    // 特征图处理辅助函数
    // @param output_tensor: 输出tensor
    // @param height, width: 特征图尺寸
    // @param anchors: 对应尺度的anchor boxes
    // @param conf_thres_raw: 原始置信度阈值
    void ProcessFeatureMap(hbDNNTensor& output_tensor, 
                          int height, int width,
                          const std::vector<std::pair<double, double>>& anchors,
                          float conf_thres_raw);

    // 成员变量（按照构造函数初始化顺序排列）
    std::string model_path_;      // 模型文件路径
    int classes_num_;             // 类别数量
    float nms_threshold_;         // NMS阈值
    float score_threshold_;       // 置信度阈值
    int nms_top_k_;              // NMS保留的最大框数
    bool is_initialized_;         // 初始化状态标志
    float font_size_;            // 绘制文字大小
    float font_thickness_;       // 绘制文字粗细
    float line_size_;            // 绘制线条粗细
    
    // BPU相关变量
    hbPackedDNNHandle_t packed_dnn_handle_;  // 打包模型句柄
    hbDNNHandle_t dnn_handle_;               // 模型句柄
    const char* model_name_;                 // 模型名称
    
    // 输入输出张量
    hbDNNTensor input_tensor_;               // 输入tensor
    hbDNNTensor* output_tensors_;            // 输出tensor数组
    hbDNNTensorProperties input_properties_; // 输入tensor属性
    
    // 任务相关
    hbDNNTaskHandle_t task_handle_;          // 推理任务句柄
    
    // 模型输入参数
    int input_h_;                            // 输入高度
    int input_w_;                            // 输入宽度
    
    // 检测结果存储
    std::vector<std::vector<cv::Rect2d>> bboxes_;  // 每个类别的边界框
    std::vector<std::vector<float>> scores_;       // 每个类别的得分
    std::vector<std::vector<int>> indices_;        // NMS后的索引
    
    // 图像处理参数
    float x_scale_;                          // X方向缩放比例
    float y_scale_;                          // Y方向缩放比例
    int x_shift_;                            // X方向偏移量
    int y_shift_;                            // Y方向偏移量
    cv::Mat resized_img_;                    // 缩放后的图像
    
    // YOLOv5 anchors信息
    std::vector<std::pair<double, double>> s_anchors_;  // 小目标anchors
    std::vector<std::pair<double, double>> m_anchors_;  // 中目标anchors
    std::vector<std::pair<double, double>> l_anchors_;  // 大目标anchors
    
    // 输出处理
    int output_order_[3];                    // 输出顺序映射
    std::vector<std::string> class_names_;   // 类别名称列表
};

// 构造函数实现
BPU_Detect::BPU_Detect(const std::string& model_path,
                          int classes_num,
                          float nms_threshold,
                          float score_threshold,
                          int nms_top_k)
    : model_path_(model_path),
      classes_num_(classes_num),
      nms_threshold_(nms_threshold),
      score_threshold_(score_threshold),
      nms_top_k_(nms_top_k),
      is_initialized_(false),
      font_size_(DEFAULT_FONT_SIZE),
      font_thickness_(DEFAULT_FONT_THICKNESS),
      line_size_(DEFAULT_LINE_SIZE) {
    
    // 初始化类别名称
    class_names_ = {CLASSES_LIST};
    
    // 初始化anchors
    std::vector<float> anchors = {10.0, 13.0, 16.0, 30.0, 33.0, 23.0, 
                                 30.0, 61.0, 62.0, 45.0, 59.0, 119.0, 
                                 116.0, 90.0, 156.0, 198.0, 373.0, 326.0};
    
    // 设置small, medium, large anchors
    for(int i = 0; i < 3; i++) {
        s_anchors_.push_back({anchors[i*2], anchors[i*2+1]});
        m_anchors_.push_back({anchors[i*2+6], anchors[i*2+7]});
        l_anchors_.push_back({anchors[i*2+12], anchors[i*2+13]});
    }
}

// 析构函数实现
BPU_Detect::~BPU_Detect() {
    if(is_initialized_) {
        Release();
    }
}

// 初始化函数实现
bool BPU_Detect::Init() {
    if(is_initialized_) {
        std::cout << "Already initialized!" << std::endl;
        return true;
    }
    
    auto init_start = std::chrono::high_resolution_clock::now();
    
    if(!LoadModel()) {
        std::cout << "Failed to load model!" << std::endl;
        return false;
    }
    
    if(!GetModelInfo()) {
        std::cout << "Failed to get model info!" << std::endl;
        return false;
    }
    
    is_initialized_ = true;
    
    auto init_end = std::chrono::high_resolution_clock::now();
    float init_time = std::chrono::duration_cast<std::chrono::microseconds>(init_end - init_start).count() / 1000.0f;
    
    std::cout << "\n============ Model Loading Time ============" << std::endl;
    std::cout << "Total init time: " << std::fixed << std::setprecision(2) << init_time << " ms" << std::endl;
    std::cout << "=========================================\n" << std::endl;
    
    return true;
}

// 加载模型实现
bool BPU_Detect::LoadModel() {
    // 记录总加载时间的起点
    auto load_start = std::chrono::high_resolution_clock::now();

#if LOAD_FROM_DDR
    // 用于记录从文件读取模型数据的时间
    float read_time = 0.0f;
#endif
    // 用于记录模型初始化的时间
    float init_time = 0.0f;
    
#if LOAD_FROM_DDR
    // =============== 从文件读取模型到内存 ===============
    auto read_start = std::chrono::high_resolution_clock::now();
    
    // 打开模型文件
    FILE* fp = fopen(model_path_.c_str(), "rb");
    if (!fp) {
        std::cout << "Failed to open model file: " << model_path_ << std::endl;
        return false;
    }
    
    // 获取文件大小:
    fseek(fp, 0, SEEK_END);// 1. 将文件指针移到末尾
    size_t model_size = static_cast<size_t>(ftell(fp));// 2. 获取当前位置(即文件大小)
    fseek(fp, 0, SEEK_SET);// 3. 将文件指针重置到开头
    
    // 为模型数据分配内存
    void* model_data = malloc(model_size);
    if (!model_data) {
        std::cout << "Failed to allocate memory for model data" << std::endl;
        fclose(fp);
        return false;
    }
    
    // 读取模型数据到内存
    size_t read_size = fread(model_data, 1, model_size, fp);
    fclose(fp);
    
    // 计算文件读取时间
    auto read_end = std::chrono::high_resolution_clock::now();
    read_time = std::chrono::duration_cast<std::chrono::microseconds>(read_end - read_start).count() / 1000.0f;
    
    // 验证是否完整读取了文件
    if (read_size != model_size) {
        std::cout << "Failed to read model data, expected " << model_size 
                 << " bytes, but got " << read_size << " bytes" << std::endl;
        free(model_data);
        return false;
    }
    
    // =============== 从内存初始化模型 ===============
    auto init_start = std::chrono::high_resolution_clock::now();
    
    // 准备模型数据数组和长度数组
    const void* model_data_array[] = {model_data};
    int32_t model_data_length[] = {static_cast<int32_t>(model_size)};
    
    // 使用BPU API从内存初始化模型
    RDK_CHECK_SUCCESS(
        hbDNNInitializeFromDDR(&packed_dnn_handle_, model_data_array, model_data_length, 1),
        "Initialize model from DDR failed");
    
    // 释放临时分配的内存
    free(model_data);
    
    // 计算模型初始化时间
    auto init_end = std::chrono::high_resolution_clock::now();
    init_time = std::chrono::duration_cast<std::chrono::microseconds>(init_end - init_start).count() / 1000.0f;
    
#else
    // =============== 直接从文件初始化模型 ===============
    auto init_start = std::chrono::high_resolution_clock::now();
    
    // 获取模型文件路径
    const char* model_file_name = model_path_.c_str();
    
    // 使用BPU API从文件初始化模型
    RDK_CHECK_SUCCESS(
        hbDNNInitializeFromFiles(&packed_dnn_handle_, &model_file_name, 1),
        "Initialize model from file failed");
    
    // 计算模型初始化时间
    auto init_end = std::chrono::high_resolution_clock::now();
    init_time = std::chrono::duration_cast<std::chrono::microseconds>(init_end - init_start).count() / 1000.0f;
#endif

    // =============== 计算并打印总时间统计 ===============
    auto load_end = std::chrono::high_resolution_clock::now();
    float total_load_time = std::chrono::duration_cast<std::chrono::microseconds>(load_end - load_start).count() / 1000.0f;

    // 打印时间统计信息
    std::cout << "\n============ Model Loading Details ============" << std::endl;
#if LOAD_FROM_DDR
    std::cout << "File reading time: " << std::fixed << std::setprecision(2) << read_time << " ms" << std::endl;
#endif
    std::cout << "Model init time: " << std::fixed << std::setprecision(2) << init_time << " ms" << std::endl;
    std::cout << "Total loading time: " << std::fixed << std::setprecision(2) << total_load_time << " ms" << std::endl;
    std::cout << "===========================================\n" << std::endl;

    return true;
}

// 获取模型信息实现
bool BPU_Detect::GetModelInfo() {
    // 获取模型名称列表
    const char** model_name_list;
    int model_count = 0;
    RDK_CHECK_SUCCESS(
        hbDNNGetModelNameList(&model_name_list, &model_count, packed_dnn_handle_),
        "hbDNNGetModelNameList failed");
    if(model_count > 1) {
        std::cout << "Model count: " << model_count << std::endl;
        std::cout << "Please check the model count!" << std::endl;
        return false;
    }
    model_name_ = model_name_list[0];
    
    // 获取模型句柄
    RDK_CHECK_SUCCESS(
        hbDNNGetModelHandle(&dnn_handle_, packed_dnn_handle_, model_name_),
        "hbDNNGetModelHandle failed");
    
    // 获取输入信息
    int32_t input_count = 0;
    RDK_CHECK_SUCCESS(
        hbDNNGetInputCount(&input_count, dnn_handle_),
        "hbDNNGetInputCount failed");
    RDK_CHECK_SUCCESS(
        hbDNNGetInputTensorProperties(&input_properties_, dnn_handle_, 0),
        "hbDNNGetInputTensorProperties failed");

    if(input_count > 1){
        std::cout << "模型输入节点大于1，请检查！" << std::endl;
        return false;
    }
    if(input_properties_.validShape.numDimensions == 4){
        std::cout << "输入tensor类型: HB_DNN_IMG_TYPE_NV12" << std::endl;
    }
    else{
        std::cout << "输入tensor类型不是HB_DNN_IMG_TYPE_NV12，请检查！" << std::endl;
        return false;
    }
    if(input_properties_.tensorType == 1){
        std::cout << "输入tensor数据排布: HB_DNN_LAYOUT_NCHW" << std::endl;
    }
    else{
        std::cout << "输入tensor数据排布不是HB_DNN_LAYOUT_NCHW，请检查！" << std::endl;
        return false;
    }
    // 获取输入尺寸
    input_h_ = input_properties_.validShape.dimensionSize[2];
    input_w_ = input_properties_.validShape.dimensionSize[3];
    if (input_properties_.validShape.numDimensions == 4)
    {
        std::cout << "输入的尺寸为: (" << input_properties_.validShape.dimensionSize[0];
        std::cout << ", " << input_properties_.validShape.dimensionSize[1];
        std::cout << ", " << input_h_;
        std::cout << ", " << input_w_ << ")" << std::endl;
    }
    else
    {
        std::cout << "输入的尺寸不是(1,3,640,640)，请检查！" << std::endl;
        return false;
    }
    
    // 获取输出信息并调整输出顺序
    int32_t output_count = 0;
    RDK_CHECK_SUCCESS(
        hbDNNGetOutputCount(&output_count, dnn_handle_),
        "hbDNNGetOutputCount failed");
    
    // 分配输出tensor内存
    output_tensors_ = new hbDNNTensor[output_count];
    
    // =============== 调整输出头顺序映射 ===============
    // YOLOv5有3个输出头，分别对应3种不同尺度的特征图
    // 需要确保输出顺序为: 小目标(8倍下采样) -> 中目标(16倍下采样) -> 大目标(32倍下采样)
    
    // 初始化默认顺序
    output_order_[0] = 0;  // 默认第1个输出
    output_order_[1] = 1;  // 默认第2个输出
    output_order_[2] = 2;  // 默认第3个输出

    // 定义期望的输出特征图尺寸和通道数
    int32_t expected_shapes[3][3] = {
        {H_8,  W_8,  3 * (5 + classes_num_)},   // 小目标特征图: H/8 x W/8
        {H_16, W_16, 3 * (5 + classes_num_)},   // 中目标特征图: H/16 x W/16
        {H_32, W_32, 3 * (5 + classes_num_)}    // 大目标特征图: H/32 x W/32
    };

    // 遍历每个期望的输出尺度
    for(int i = 0; i < 3; i++) {
        // 遍历实际的输出节点
        for(int j = 0; j < 3; j++) {
            // 获取当前输出节点的属性
            hbDNNTensorProperties output_properties;
            RDK_CHECK_SUCCESS(
                hbDNNGetOutputTensorProperties(&output_properties, dnn_handle_, j),
                "Get output tensor properties failed");
            
            // 获取实际的特征图尺寸和通道数
            int32_t actual_h = output_properties.validShape.dimensionSize[1];
            int32_t actual_w = output_properties.validShape.dimensionSize[2];
            int32_t actual_c = output_properties.validShape.dimensionSize[3];

            // 如果实际尺寸和通道数与期望的匹配
            if(actual_h == expected_shapes[i][0] && 
               actual_w == expected_shapes[i][1] && 
               actual_c == expected_shapes[i][2]) {
                // 记录正确的输出顺序
                output_order_[i] = j;
                break;
            }
        }
    }

    // 打印输出顺序映射信息
    std::cout << "\n============ Output Order Mapping ============" << std::endl;
    std::cout << "Small object  (1/" << 8  << "): output[" << output_order_[0] << "]" << std::endl;
    std::cout << "Medium object (1/" << 16 << "): output[" << output_order_[1] << "]" << std::endl;
    std::cout << "Large object  (1/" << 32 << "): output[" << output_order_[2] << "]" << std::endl;
    std::cout << "==========================================\n" << std::endl;

    return true;
}

// 检测函数实现
bool BPU_Detect::Detect(const cv::Mat& input_img, cv::Mat& output_img) {
    if(!is_initialized_) {
        std::cout << "Please initialize first!" << std::endl;
        return false;
    }
    
    auto total_start = std::chrono::high_resolution_clock::now();
    
#if ENABLE_DRAW
    input_img.copyTo(output_img);
#endif

    // 预处理时间统计
    auto preprocess_start = std::chrono::high_resolution_clock::now();
    if(!PreProcess(input_img)) {
        return false;
    }
    auto preprocess_end = std::chrono::high_resolution_clock::now();
    float preprocess_time = std::chrono::duration_cast<std::chrono::microseconds>(preprocess_end - preprocess_start).count() / 1000.0f;
    
    // 推理时间统计
    auto infer_start = std::chrono::high_resolution_clock::now();
    if(!Inference()) {
        return false;
    }
    auto infer_end = std::chrono::high_resolution_clock::now();
    float infer_time = std::chrono::duration_cast<std::chrono::microseconds>(infer_end - infer_start).count() / 1000.0f;
    
    // 后处理时间统计
    auto postprocess_start = std::chrono::high_resolution_clock::now();
    if(!PostProcess()) {
        return false;
    }
    auto postprocess_end = std::chrono::high_resolution_clock::now();
    float postprocess_time = std::chrono::duration_cast<std::chrono::microseconds>(postprocess_end - postprocess_start).count() / 1000.0f;
    
    // 绘制结果时间统计
    auto draw_start = std::chrono::high_resolution_clock::now();
    DrawResults(output_img);
    auto draw_end = std::chrono::high_resolution_clock::now();
    float draw_time = std::chrono::duration_cast<std::chrono::microseconds>(draw_end - draw_start).count() / 1000.0f;
    
    // 总时间统计
    auto total_end = std::chrono::high_resolution_clock::now();
    float total_time = std::chrono::duration_cast<std::chrono::microseconds>(total_end - total_start).count() / 1000.0f;
    
    // 打印时间统计信息
    std::cout << "\n============ Time Statistics ============" << std::endl;
    std::cout << "Preprocess time: " << std::fixed << std::setprecision(2) << preprocess_time << " ms" << std::endl;
    std::cout << "Inference time: " << std::fixed << std::setprecision(2) << infer_time << " ms" << std::endl;
    std::cout << "Postprocess time: " << std::fixed << std::setprecision(2) << postprocess_time << " ms" << std::endl;
    std::cout << "Draw time: " << std::fixed << std::setprecision(2) << draw_time << " ms" << std::endl;
    std::cout << "Total time: " << std::fixed << std::setprecision(2) << total_time << " ms" << std::endl;
    std::cout << "FPS: " << std::fixed << std::setprecision(2) << 1000.0f / total_time << std::endl;
    std::cout << "======================================\n" << std::endl;
    
    return true;
}

// 预处理实现
bool BPU_Detect::PreProcess(const cv::Mat& input_img) {
    // 使用letterbox方式进行预处理
    x_scale_ = std::min(1.0f * input_h_ / input_img.rows, 1.0f * input_w_ / input_img.cols);
    y_scale_ = x_scale_;
    
    int new_w = input_img.cols * x_scale_;
    x_shift_ = (input_w_ - new_w) / 2;
    int x_other = input_w_ - new_w - x_shift_;
    
    int new_h = input_img.rows * y_scale_;
    y_shift_ = (input_h_ - new_h) / 2;
    int y_other = input_h_ - new_h - y_shift_;
    
    cv::resize(input_img, resized_img_, cv::Size(new_w, new_h));
    cv::copyMakeBorder(resized_img_, resized_img_, y_shift_, y_other, 
                       x_shift_, x_other, cv::BORDER_CONSTANT, cv::Scalar(127, 127, 127));
    
    // 转换为NV12格式
    cv::Mat yuv_mat;
    cv::cvtColor(resized_img_, yuv_mat, cv::COLOR_BGR2YUV_I420);
    
    // 准备输入tensor
    hbSysAllocCachedMem(&input_tensor_.sysMem[0], int(3 * input_h_ * input_w_ / 2));
    uint8_t* yuv = yuv_mat.ptr<uint8_t>();
    uint8_t* ynv12 = (uint8_t*)input_tensor_.sysMem[0].virAddr;
    // 计算UV部分的高度和宽度，以及Y部分的大小
    int uv_height = input_h_ / 2;
    int uv_width = input_w_ / 2;
    int y_size = input_h_ * input_w_;
    // 将Y分量数据复制到输入张量
    memcpy(ynv12, yuv, y_size);
    // 获取NV12格式的UV分量位置
    uint8_t* nv12 = ynv12 + y_size;
    uint8_t* u_data = yuv + y_size;
    uint8_t* v_data = u_data + uv_height * uv_width;
    // 将U和V分量交替写入NV12格式
    for(int i = 0; i < uv_width * uv_height; i++) {
        *nv12++ = *u_data++;
        *nv12++ = *v_data++;
    }
    // 将内存缓存清理，确保数据准备好可以供模型使用
    hbSysFlushMem(&input_tensor_.sysMem[0], HB_SYS_MEM_CACHE_CLEAN);// 清除缓存，确保数据同步
    return true;
}

// 推理实现
bool BPU_Detect::Inference() {
    // 初始化任务句柄为nullptr
    task_handle_ = nullptr;
    
    // 初始化输入tensor属性
    input_tensor_.properties = input_properties_;
    
    // 获取输出tensor属性
    for(int i = 0; i < 3; i++) {
        hbDNNTensorProperties output_properties;
        RDK_CHECK_SUCCESS(
            hbDNNGetOutputTensorProperties(&output_properties, dnn_handle_, i),
            "Get output tensor properties failed");
        output_tensors_[i].properties = output_properties;
        
        // 为输出分配内存
        int out_aligned_size = output_properties.alignedByteSize;
        RDK_CHECK_SUCCESS(
            hbSysAllocCachedMem(&output_tensors_[i].sysMem[0], out_aligned_size),
            "Allocate output memory failed");
    }
    
    hbDNNInferCtrlParam infer_ctrl_param;
    HB_DNN_INITIALIZE_INFER_CTRL_PARAM(&infer_ctrl_param);
    
    RDK_CHECK_SUCCESS(
        hbDNNInfer(&task_handle_, &output_tensors_, &input_tensor_, dnn_handle_, &infer_ctrl_param),
        "Model inference failed");
    
    RDK_CHECK_SUCCESS(
        hbDNNWaitTaskDone(task_handle_, 0),
        "Wait task done failed");
    
    return true;
}

// 后处理实现
bool BPU_Detect::PostProcess() {
    // 清空上次的结果
    bboxes_.clear();
    scores_.clear();
    indices_.clear();
    
    // 调整大小
    bboxes_.resize(classes_num_);
    scores_.resize(classes_num_);
    indices_.resize(classes_num_);
    
    float conf_thres_raw = -log(1 / score_threshold_ - 1);
    
    // 处理三个尺度的输出
    ProcessFeatureMap(output_tensors_[0], H_8, W_8, s_anchors_, conf_thres_raw);
    ProcessFeatureMap(output_tensors_[1], H_16, W_16, m_anchors_, conf_thres_raw);
    ProcessFeatureMap(output_tensors_[2], H_32, W_32, l_anchors_, conf_thres_raw);
    
    // 对每个类别进行NMS
    for(int i = 0; i < classes_num_; i++) {
        cv::dnn::NMSBoxes(bboxes_[i], scores_[i], score_threshold_, 
                         nms_threshold_, indices_[i], 1.f, nms_top_k_);
    }
    
    return true;
}

// 打印检测结果实现
void BPU_Detect::PrintResults() const {
    // 打印检测结果的总体信息
    int total_detections = 0;
    for(int cls_id = 0; cls_id < classes_num_; cls_id++) {
        total_detections += indices_[cls_id].size();
    }
    std::cout << "\n============ Detection Results ============" << std::endl;
    std::cout << "Total detections: " << total_detections << std::endl;
    
    for(int cls_id = 0; cls_id < classes_num_; cls_id++) {
        if(!indices_[cls_id].empty()) {
            std::cout << "\nClass: " << class_names_[cls_id] << std::endl;
            std::cout << "Number of detections: " << indices_[cls_id].size() << std::endl;
            std::cout << "Details:" << std::endl;
            
            for(size_t i = 0; i < indices_[cls_id].size(); i++) {
                int idx = indices_[cls_id][i];
                float x1 = (bboxes_[cls_id][idx].x - x_shift_) / x_scale_;
                float y1 = (bboxes_[cls_id][idx].y - y_shift_) / y_scale_;
                float x2 = x1 + (bboxes_[cls_id][idx].width) / x_scale_;
                float y2 = y1 + (bboxes_[cls_id][idx].height) / y_scale_;
                float score = scores_[cls_id][idx];
                
                // 打印每个检测框的详细信息
                std::cout << "  Detection " << i + 1 << ":" << std::endl;
                std::cout << "    Position: (" << x1 << ", " << y1 << ") to (" << x2 << ", " << y2 << ")" << std::endl;
                std::cout << "    Confidence: " << std::fixed << std::setprecision(2) << score * 100 << "%" << std::endl;
            }
        }
    }
    std::cout << "========================================\n" << std::endl;
}

// 绘制结果实现
void BPU_Detect::DrawResults(cv::Mat& img) {
#if ENABLE_DRAW
    for(int cls_id = 0; cls_id < classes_num_; cls_id++) {
        if(!indices_[cls_id].empty()) {
            for(size_t i = 0; i < indices_[cls_id].size(); i++) {
                int idx = indices_[cls_id][i];
                float x1 = (bboxes_[cls_id][idx].x - x_shift_) / x_scale_;
                float y1 = (bboxes_[cls_id][idx].y - y_shift_) / y_scale_;
                float x2 = x1 + (bboxes_[cls_id][idx].width) / x_scale_;
                float y2 = y1 + (bboxes_[cls_id][idx].height) / y_scale_;
                float score = scores_[cls_id][idx];
                
                // 绘制边界框
                cv::rectangle(img, cv::Point(x1, y1), cv::Point(x2, y2), 
                            cv::Scalar(255, 0, 0), line_size_);
                
                // 绘制标签
                std::string text = class_names_[cls_id] + ": " + 
                                std::to_string(static_cast<int>(score * 100)) + "%";
                cv::putText(img, text, cv::Point(x1, y1 - 5), 
                          cv::FONT_HERSHEY_SIMPLEX, font_size_, 
                          cv::Scalar(0, 0, 255), font_thickness_, cv::LINE_AA);
            }
        }
    }
#endif
    // 打印检测结果
    PrintResults();
}

// 特征图处理辅助函数
void BPU_Detect::ProcessFeatureMap(hbDNNTensor& output_tensor, 
                                  int height, int width,
                                  const std::vector<std::pair<double, double>>& anchors,
                                  float conf_thres_raw) {
    // 检查量化类型
    if (output_tensor.properties.quantiType != NONE) {
        std::cout << "Output tensor quantization type should be NONE!" << std::endl;
        return;
    }
    
    // 刷新内存
    hbSysFlushMem(&output_tensor.sysMem[0], HB_SYS_MEM_CACHE_INVALIDATE);
    
    // 获取输出数据指针
    auto* raw_data = reinterpret_cast<float*>(output_tensor.sysMem[0].virAddr);
    
    // 遍历特征图的每个位置
    for(int h = 0; h < height; h++) {
        for(int w = 0; w < width; w++) {
            for(const auto& anchor : anchors) {
                // 获取当前位置的预测数据
                float* cur_raw = raw_data;
                raw_data += (5 + classes_num_);
                
                // 条件概率过滤
                if(cur_raw[4] < conf_thres_raw) continue;
                
                // 找到最大类别概率
                int cls_id = 5;
                int end = classes_num_ + 5;
                for(int i = 6; i < end; i++) {
                    if(cur_raw[i] > cur_raw[cls_id]) {
                        cls_id = i;
                    }
                }
                
                // 计算最终得分
                float score = 1.0f / (1.0f + std::exp(-cur_raw[4])) * 
                            1.0f / (1.0f + std::exp(-cur_raw[cls_id]));
                
                // 得分过滤
                if(score < score_threshold_) continue;
                cls_id -= 5;
                
                // 解码边界框
                float stride = input_h_ / height;
                float center_x = ((1.0f / (1.0f + std::exp(-cur_raw[0]))) * 2 - 0.5f + w) * stride;
                float center_y = ((1.0f / (1.0f + std::exp(-cur_raw[1]))) * 2 - 0.5f + h) * stride;
                float bbox_w = std::pow((1.0f / (1.0f + std::exp(-cur_raw[2]))) * 2, 2) * anchor.first;
                float bbox_h = std::pow((1.0f / (1.0f + std::exp(-cur_raw[3]))) * 2, 2) * anchor.second;
                float bbox_x = center_x - bbox_w / 2.0f;
                float bbox_y = center_y - bbox_h / 2.0f;
                
                // 保存检测结果
                bboxes_[cls_id].push_back(cv::Rect2d(bbox_x, bbox_y, bbox_w, bbox_h));
                scores_[cls_id].push_back(score);
            }
        }
    }
}

// 释放资源实现
bool BPU_Detect::Release() {
    if(!is_initialized_) {
        return true;
    }
    
    // 释放任务
    if(task_handle_) {
        hbDNNReleaseTask(task_handle_);
        task_handle_ = nullptr;
    }
    
    try {
        // 释放输入内存
        if(input_tensor_.sysMem[0].virAddr) {
            hbSysFreeMem(&(input_tensor_.sysMem[0]));
        }
        
        // 释放输出内存
        for(int i = 0; i < 3; i++) {
            if(output_tensors_ && output_tensors_[i].sysMem[0].virAddr) {
                hbSysFreeMem(&(output_tensors_[i].sysMem[0]));
            }
        }
        
        if(output_tensors_) {
            delete[] output_tensors_;
            output_tensors_ = nullptr;
        }
        
        // 释放模型
        if(packed_dnn_handle_) {
            hbDNNRelease(packed_dnn_handle_);
            packed_dnn_handle_ = nullptr;
        }
    } catch(const std::exception& e) {
        std::cout << "Exception during release: " << e.what() << std::endl;
    }
    
    is_initialized_ = false;
    return true;
}

// 修改main函数
int main() {
    // 创建检测器实例
    BPU_Detect detector;
    
    // 初始化
    if (!detector.Init()) {
        std::cout << "Failed to initialize detector" << std::endl;
        return -1;
    }

#if DETECT_MODE == 0
    // 单张图片检测模式
    std::cout << "Single image detection mode" << std::endl;
    
    // 读取测试图片
    cv::Mat input_img = cv::imread("/root/Deep_Learning/YOLOv5/imgs/tennis_1_frame_0001.jpg");
    if (input_img.empty()) {
        std::cout << "Failed to load image" << std::endl;
        return -1;
    }
    
    // 执行检测
    cv::Mat output_img;
#if ENABLE_DRAW
    if (!detector.Detect(input_img, output_img)) {
        std::cout << "Detection failed" << std::endl;
        return -1;
    }
    // 保存结果
    cv::imwrite("cpp_result.jpg", output_img);
#else
    if (!detector.Detect(input_img, output_img)) {
        std::cout << "Detection failed" << std::endl;
        return -1;
    }
#endif

#else
    // 实时检测模式
    std::cout << "Real-time detection mode" << std::endl;
    
    // 打开摄像头
    cv::VideoCapture cap(0);
    if (!cap.isOpened()) {
        std::cout << "Failed to open camera" << std::endl;
        return -1;
    }
    
    cv::Mat frame, output_frame;
    while (true) {
        // 读取一帧
        cap >> frame;
        if (frame.empty()) {
            std::cout << "Failed to read frame" << std::endl;
            break;
        }
        
        // 执行检测
        if (!detector.Detect(frame, output_frame)) {
            std::cout << "Detection failed" << std::endl;
            break;
        }
        
#if ENABLE_DRAW
        // 显示结果
        cv::imshow("Real-time Detection", output_frame);
        
        // 按'q'退出
        if (cv::waitKey(1) == 'q') {
            break;
        }
#endif
    }
    
#if ENABLE_DRAW
    // 释放摄像头
    cap.release();
    cv::destroyAllWindows();
#endif
#endif
    
    // 释放资源
    detector.Release();
    
    return 0;
}