RoboMaster- RDK X5能量机关实现案例（一）识别

作者：SkyXZ

CSDN：https://blog.csdn.net/xiongqi123123

博客园：https://www.cnblogs.com/SkyXZ

在RoboMaster的25赛季，我主要负责了能量机关的视觉方案开发，目前整体算法已经搭建完成，实际方案上我使用的上位机是Jetson Orin NX 16GB，其具备100TOPS的算力，在经过TensorRT优化部署后，实现了1080P原始图像从识别到PNP解算再到得到预测结果的每帧耗时仅为2.5ms，由于算法部分已经完成，且正值寒假，我准备利用手头的RDK X5进行能量机关的识别的算法验证，试试RDK X5作为RoboMaster的上位机可不可行

一、训练模型

我采用的模型是Yolov8n-Pose，数据集我们使用的是西交利物浦GMaster战队开源的数据集：zRzRzRzRzRzRzR/YOLO-of-RoboMaster-Keypoints-Detection-2023: 2023年西交利物浦大学动云科技GMaster战队yolo 装甲板四点模型，能量机关五点模型，区域赛视觉识别板目标检测其标注格式为4类别5点，具体介绍如下：

训练部分没什么好说的，配置好环境使用如下数据配置及命令进行训练即可：

yaml 复制代码

# buff.yaml
path: buff_format
train: train
val: test
kpt_shape: [5, 2]
names:
  0: RR
  1: RW
  2: BR
  3: BW

bash 复制代码

yolo pose train data=buff.yaml model=yolov8n-pose.pt epochs=200 batch=32 imgsz=640 iou=0.7 max_det=10 kobj=10 rect=True name=buff

二、量化模型

完成了模型的训练便到了我们最关键的一步模型量化啦，我们首先需要修改模型的输出头使得三个特征层的Bounding Box信息和Classify信息分开输出，具体而言，我们找到Yolov8的源码中的./ultralytics/ultralytics/nn/modules/head.py文件，接着在大约第64行的地方用如下代码替代Detect类的forward方法：

python 复制代码

def forward(self, x):  # Detect
    result = []
    for i in range(self.nl):
        result.append(self.cv2[i](x[i]).permute(0, 2, 3, 1).contiguous())
        result.append(self.cv3[i](x[i]).permute(0, 2, 3, 1).contiguous())
    return result

然后继续用如下代码来替换大约第242行的Pose类的forward方法：

python 复制代码

def forward(self, x):
    detect_results = Detect.forward(self, x)
    kpt_results = []
    for i in range(self.nl):
        kpt_results.append(self.cv4[i](x[i]).permute(0, 2, 3, 1).contiguous())
    return (detect_results, kpt_results)

修改完上述部分后我们便可以使用如下命令来导出ONNX模型啦：

bash 复制代码

yolo export model=/path/to/your/model format=onnx simplify=True opset=11 imgsz=640
# 注意，如果生成的onnx模型显示ir版本过高，可以将simplify=False

然后我们进入地瓜的RDK算法工具链的Docker镜像（具体安装配置可见我的另外一篇Blogs：学弟一看就会的RDKX5模型转换及部署，你确定不学？），使用如下命令对我们的ONNX进行验证，之后终端便会打印出我们这个模型的基本信息、结构信息以及算子信息

bash 复制代码

hb_mapper checker --model-type onnx --march bayes-e --model /path/to/your/model.onnx

我们根据如下的打印信息可以知道我们这个模型的所有算子均可以放到BPU上

接着我们便可以开始配置我们的模型量化配置文件啦，我们开启了calibration_parameters校准数据类的preprocess_on功能来开启图片校准样本自动处理，大家只需要修改onnx_model模型路径和cal_data_dir校准图片地址即可

yaml 复制代码

model_parameters:
  onnx_model: 'buff_dim3.onnx'
  march: "bayes-e"
  layer_out_dump: False
  working_dir: 'buff'
  output_model_file_prefix: 'buff_dim3'

input_parameters:
  input_name: ""
  input_type_rt: 'nv12'
  input_type_train: 'rgb'
  input_layout_train: 'NCHW'
  norm_type: 'data_scale'
  scale_value: 0.003921568627451

calibration_parameters:
  cal_data_dir: './buff_format'
  cal_data_type: 'float32'
  preprocess_on: True

compiler_parameters:
  compile_mode: 'latency'
  debug: False
  optimize_level: 'O3'

接着我们使用如下命令即可开始量化我们的ONNX模型为RDK所支持的Bin模型，过程有些小慢，如果没有红色报错的话安心等待即可

bash 复制代码

hb_mapper makertbin --model-type onnx --config /path/to/your/yaml

在模型的转换过程中我们查看日志可以找到大小为[1, 80, 80, 64], [1, 40, 40, 64], [1, 20, 20, 64]的三个输出的名称分别为output0, 352, 368

由于反量化操作会将int8的量化数据转换回float32格式消耗额外的计算资源和时间，因此我们需要移除反量化节点可以减少不必要的计算开销，提高模型推理速度，我们使用如下命令查看可以被移除的反量化节点

bash 复制代码

hb_model_modifier /path/to/your/convert_model.bin

我们打开生成的/open_explorer/Model/Buff/hb_model_modifier.log日志，这里面有详细的节点说明，我们根据之前找到的三个名称output0、352、368，可以找到其详细信息

我们使用如下命令移除上述反量化节点：(请根据自己的模型进行修改)

bash 复制代码

hb_model_modifier /path/to/your/convert_model.bin \
-r /model.22/cv2.0/cv2.0.2/Conv_output_0_HzDequantize \
-r /model.22/cv2.1/cv2.1.2/Conv_output_0_HzDequantize \
-r /model.22/cv2.2/cv2.2.2/Conv_output_0_HzDequantize

最后我们便得到了摘掉反量化节点的最终模型buff_dim3_modified.bin，这个模型便可以直接用于部署啦，但是在实际部署之前我们使用可视化命令对其进行检查：

bash 复制代码

hb_perf /path/to/your/convert_model.bin

以及检查模型的输入输出：

bash 复制代码

hrt_model_exec model_info --model_file /path/to/your/convert_model.bin

三、模型部署

在得到摘掉反量化节点的最终模型buff_dim3_modified.bin后我们便可以进行部署啦，我们首先先对X5板卡进行超频，使之CPU和BPU均处于最佳状态，具体超频命令如下：

bash 复制代码

sudo bash -c "echo 1 > /sys/devices/system/cpu/cpufreq/boost"  # CPU: 1.8Ghz
sudo bash -c "echo performance > /sys/devices/system/cpu/cpufreq/policy0/scaling_governor" # Performance Mode
echo 1200000000 > /sys/kernel/debug/clk/bpu_mclk_2x_clk/clk_rate # BPU: 1.2GHz

Yolov8-Pose的部署流程和Yolov5-Detect的总体上没有太大差别，具体部署的教程可以查看我的另一篇Blogs：学弟一看就会的RDKX5模型转换及部署，你确定不学？，但是我们需要在此基础上主要修改我们的特征图辅助处理函数ProcessFeatureMap()、GetModelInfo()模型信息检查函数中的输出顺序以及绘图函数中的DrawResults()中的关键点可视化部分，我们运行代码可以看到在使用普通USB相机的情况下CPU占用率很低，BPU的负载也较低，并且在没有开多线程对推理优化的情况下能跑满我手上这个摄像头的帧率到80帧

python 复制代码

// 标准C++库
#include <iostream>     // 输入输出流
#include <vector>      // 向量容器
#include <algorithm>   // 算法库
#include <chrono>      // 时间相关功能
#include <iomanip>     // 输入输出格式控制

// OpenCV库
#include <opencv2/opencv.hpp>      // OpenCV主要头文件
#include <opencv2/dnn/dnn.hpp>     // OpenCV深度学习模块

// 地平线RDK BPU API
#include "dnn/hb_dnn.h"           // BPU基础功能
#include "dnn/hb_dnn_ext.h"       // BPU扩展功能
#include "dnn/plugin/hb_dnn_layer.h"    // BPU层定义
#include "dnn/plugin/hb_dnn_plugin.h"   // BPU插件
#include "dnn/hb_sys.h"           // BPU系统功能

// 错误检查宏定义
#define RDK_CHECK_SUCCESS(value, errmsg)                        \
    do                                                          \
    {                                                          \
        auto ret_code = value;                                  \
        if (ret_code != 0)                                      \
        {                                                       \
            std::cout << errmsg << ", error code:" << ret_code; \
            return ret_code;                                    \
        }                                                       \
    } while (0);

// 模型和检测相关的默认参数定义
#define DEFAULT_MODEL_PATH "/root/Deep_Learning/YOLOV8-Pose/models/buff_dim3_modified.bin"  // 默认模型路径
#define DEFAULT_CLASSES_NUM 4          // 默认类别数量
#define CLASSES_LIST "RR","RW","BR","BW"     // 类别名称
#define KPT_NUM 5 // num of kpt
#define KPT_ENCODE 3 // kpt 的编码，2:x,y, 3:x,y,vis
#define KPT_SCORE_THRESHOLD 0.5 // kpt 分数阈值, 默认0.25
#define REG 16 // 控制回归部分离散化程度的超参数, 默认16
#define DEFAULT_NMS_THRESHOLD 0.45f    // 非极大值抑制阈值
#define DEFAULT_SCORE_THRESHOLD 0.25f  // 置信度阈值
#define DEFAULT_NMS_TOP_K 300          // NMS保留的最大框数
#define DEFAULT_FONT_SIZE 1.0f         // 绘制文字大小
#define DEFAULT_FONT_THICKNESS 1.0f    // 绘制文字粗细
#define DEFAULT_LINE_SIZE 2.0f         // 绘制线条粗细

// 运行模式选择
#define DETECT_MODE 0    // 检测模式: 0-单张图片, 1-实时检测
#define ENABLE_DRAW 1    // 绘图开关: 0-禁用, 1-启用
#define LOAD_FROM_DDR 0  // 模型加载方式: 0-从文件加载, 1-从内存加载

// 特征图尺度定义 (基于输入尺寸的倍数关系)
#define H_8 (input_h_ / 8)    // 输入高度的1/8
#define W_8 (input_w_ / 8)    // 输入宽度的1/8
#define H_16 (input_h_ / 16)  // 输入高度的1/16
#define W_16 (input_w_ / 16)  // 输入宽度的1/16
#define H_32 (input_h_ / 32)  // 输入高度的1/32
#define W_32 (input_w_ / 32)  // 输入宽度的1/32

// BPU目标检测类
class BPU_Detect {
public:
    // 构造函数：初始化检测器的参数
    // @param model_path: 模型文件路径
    // @param classes_num: 检测类别数量
    // @param nms_threshold: NMS阈值
    // @param score_threshold: 置信度阈值
    // @param nms_top_k: NMS保留的最大框数
    BPU_Detect(const std::string& model_path = DEFAULT_MODEL_PATH,
                 int classes_num = DEFAULT_CLASSES_NUM,
                 float nms_threshold = DEFAULT_NMS_THRESHOLD,
                 float score_threshold = DEFAULT_SCORE_THRESHOLD,
                 int nms_top_k = DEFAULT_NMS_TOP_K);
    
    // 析构函数：释放资源
    ~BPU_Detect();

    // 主要功能接口
    bool Init();  // 初始化BPU和模型
    bool Detect(const cv::Mat& input_img, cv::Mat& output_img);  // 执行目标检测
    bool Release();  // 释放所有资源

private:
    // 内部工具函数
    bool LoadModel();  // 加载模型文件
    bool GetModelInfo();  // 获取模型的输入输出信息
    bool PreProcess(const cv::Mat& input_img);  // 图像预处理（resize和格式转换）
    bool Inference();  // 执行模型推理
    bool PostProcess();  // 后处理（NMS等）
    void DrawResults(cv::Mat& img);  // 在图像上绘制检测结果
    void PrintResults() const;  // 打印检测结果到控制台

    // 特征图处理辅助函数
    // @param output_tensor: 输出tensor
    // @param height, width: 特征图尺寸
    // @param anchors: 对应尺度的anchor boxes
    // @param conf_thres_raw: 原始置信度阈值
    void ProcessFeatureMap(hbDNNTensor& output_tensor_REG, 
                          hbDNNTensor& output_tensor_CLA,
                          hbDNNTensor& output_tensor_KPT,
                          int height, int width,
                          const std::vector<std::pair<double, double>>& anchors,
                          float conf_thres_raw);

    // 成员变量（按照构造函数初始化顺序排列）
    std::string model_path_;      // 模型文件路径
    int classes_num_;             // 类别数量
    float nms_threshold_;         // NMS阈值
    float score_threshold_;       // 置信度阈值
    int nms_top_k_;              // NMS保留的最大框数
    bool is_initialized_;         // 初始化状态标志
    float font_size_;            // 绘制文字大小
    float font_thickness_;       // 绘制文字粗细
    float line_size_;            // 绘制线条粗细
    
    // BPU相关变量
    hbPackedDNNHandle_t packed_dnn_handle_;  // 打包模型句柄
    hbDNNHandle_t dnn_handle_;               // 模型句柄
    const char* model_name_;                 // 模型名称
    
    // 输入输出张量
    hbDNNTensor input_tensor_;               // 输入tensor
    hbDNNTensor* output_tensors_;            // 输出tensor数组
    hbDNNTensorProperties input_properties_; // 输入tensor属性
    
    // 任务相关
    hbDNNTaskHandle_t task_handle_;          // 推理任务句柄
    
    // 模型输入参数
    int input_h_;                            // 输入高度
    int input_w_;                            // 输入宽度
    
    // 检测结果存储
    std::vector<std::vector<cv::Rect2d>> bboxes_;  // 每个类别的边界框
    std::vector<std::vector<float>> scores_;       // 每个类别的得分
    std::vector<std::vector<int>> indices_;        // NMS后的索引
    std::vector<std::vector<cv::Point2f>> kpts_xy_;
    std::vector<std::vector<float>> kpts_score_;
    
    // 图像处理参数
    float x_scale_;                          // X方向缩放比例
    float y_scale_;                          // Y方向缩放比例
    int x_shift_;                            // X方向偏移量
    int y_shift_;                            // Y方向偏移量
    cv::Mat resized_img_;                    // 缩放后的图像

    float conf_thres_raw_;
    float kpt_conf_thres_raw_;
    
    // YOLOv5 anchors信息
    std::vector<std::pair<double, double>> s_anchors_;  // 小目标anchors
    std::vector<std::pair<double, double>> m_anchors_;  // 中目标anchors
    std::vector<std::pair<double, double>> l_anchors_;  // 大目标anchors
    
    // 输出处理
    int output_order_[9] = {0, 1, 2, 3, 4, 5, 6, 7, 8};                    // 输出顺序映射
    std::vector<std::string> class_names_;   // 类别名称列表
};

// 构造函数实现
BPU_Detect::BPU_Detect(const std::string& model_path,
                          int classes_num,
                          float nms_threshold,
                          float score_threshold,
                          int nms_top_k)
    : model_path_(model_path),
      classes_num_(classes_num),
      nms_threshold_(nms_threshold),
      score_threshold_(score_threshold),
      nms_top_k_(nms_top_k),
      is_initialized_(false),
      font_size_(DEFAULT_FONT_SIZE),
      font_thickness_(DEFAULT_FONT_THICKNESS),
      line_size_(DEFAULT_LINE_SIZE),
      packed_dnn_handle_(nullptr),
      dnn_handle_(nullptr),
      task_handle_(nullptr),
      output_tensors_(nullptr) {
    
    // 初始化类别名称
    class_names_ = {CLASSES_LIST};
    
    // 初始化anchors
    std::vector<float> anchors = {10.0, 13.0, 16.0, 30.0, 33.0, 23.0, 
                                 30.0, 61.0, 62.0, 45.0, 59.0, 119.0, 
                                 116.0, 90.0, 156.0, 198.0, 373.0, 326.0};
    
    // 设置small, medium, large anchors
    for(int i = 0; i < 3; i++) {
        s_anchors_.push_back({anchors[i*2], anchors[i*2+1]});
        m_anchors_.push_back({anchors[i*2+6], anchors[i*2+7]});
        l_anchors_.push_back({anchors[i*2+12], anchors[i*2+13]});
    }
}

// 析构函数实现
BPU_Detect::~BPU_Detect() {
    if(is_initialized_) {
        Release();
    }
}

// 初始化函数实现
bool BPU_Detect::Init() {
    if(is_initialized_) {
        std::cout << "Already initialized!" << std::endl;
        return true;
    }
    
    auto init_start = std::chrono::high_resolution_clock::now();
    
    if(!LoadModel()) {
        std::cout << "Failed to load model!" << std::endl;
        return false;
    }
    
    if(!GetModelInfo()) {
        std::cout << "Failed to get model info!" << std::endl;
        return false;
    }
    
    is_initialized_ = true;
    
    auto init_end = std::chrono::high_resolution_clock::now();
    float init_time = std::chrono::duration_cast<std::chrono::microseconds>(init_end - init_start).count() / 1000.0f;
    
    std::cout << "\n============ Model Loading Time ============" << std::endl;
    std::cout << "Total init time: " << std::fixed << std::setprecision(2) << init_time << " ms" << std::endl;
    std::cout << "=========================================\n" << std::endl;
    
    return true;
}

// 加载模型实现
bool BPU_Detect::LoadModel() {
    // 记录总加载时间的起点
    auto load_start = std::chrono::high_resolution_clock::now();

#if LOAD_FROM_DDR
    // 用于记录从文件读取模型数据的时间
    float read_time = 0.0f;
#endif
    // 用于记录模型初始化的时间
    float init_time = 0.0f;
    
#if LOAD_FROM_DDR
    // =============== 从文件读取模型到内存 ===============
    auto read_start = std::chrono::high_resolution_clock::now();
    
    // 打开模型文件
    FILE* fp = fopen(model_path_.c_str(), "rb");
    if (!fp) {
        std::cout << "Failed to open model file: " << model_path_ << std::endl;
        return false;
    }
    
    // 获取文件大小:
    fseek(fp, 0, SEEK_END);// 1. 将文件指针移到末尾
    size_t model_size = static_cast<size_t>(ftell(fp));// 2. 获取当前位置(即文件大小)
    fseek(fp, 0, SEEK_SET);// 3. 将文件指针重置到开头
    
    // 为模型数据分配内存
    void* model_data = malloc(model_size);
    if (!model_data) {
        std::cout << "Failed to allocate memory for model data" << std::endl;
        fclose(fp);
        return false;
    }
    
    // 读取模型数据到内存
    size_t read_size = fread(model_data, 1, model_size, fp);
    fclose(fp);
    
    // 计算文件读取时间
    auto read_end = std::chrono::high_resolution_clock::now();
    read_time = std::chrono::duration_cast<std::chrono::microseconds>(read_end - read_start).count() / 1000.0f;
    
    // 验证是否完整读取了文件
    if (read_size != model_size) {
        std::cout << "Failed to read model data, expected " << model_size 
                 << " bytes, but got " << read_size << " bytes" << std::endl;
        free(model_data);
        return false;
    }
    
    // =============== 从内存初始化模型 ===============
    auto init_start = std::chrono::high_resolution_clock::now();
    
    // 准备模型数据数组和长度数组
    const void* model_data_array[] = {model_data};
    int32_t model_data_length[] = {static_cast<int32_t>(model_size)};
    
    // 使用BPU API从内存初始化模型
    RDK_CHECK_SUCCESS(
        hbDNNInitializeFromDDR(&packed_dnn_handle_, model_data_array, model_data_length, 1),
        "Initialize model from DDR failed");
    
    // 释放临时分配的内存
    free(model_data);
    
    // 计算模型初始化时间
    auto init_end = std::chrono::high_resolution_clock::now();
    init_time = std::chrono::duration_cast<std::chrono::microseconds>(init_end - init_start).count() / 1000.0f;
    
#else
    // =============== 直接从文件初始化模型 ===============
    auto init_start = std::chrono::high_resolution_clock::now();
    
    // 获取模型文件路径
    const char* model_file_name = model_path_.c_str();
    
    // 使用BPU API从文件初始化模型
    RDK_CHECK_SUCCESS(
        hbDNNInitializeFromFiles(&packed_dnn_handle_, &model_file_name, 1),
        "Initialize model from file failed");
    
    // 计算模型初始化时间
    auto init_end = std::chrono::high_resolution_clock::now();
    init_time = std::chrono::duration_cast<std::chrono::microseconds>(init_end - init_start).count() / 1000.0f;
#endif

    // =============== 计算并打印总时间统计 ===============
    auto load_end = std::chrono::high_resolution_clock::now();
    float total_load_time = std::chrono::duration_cast<std::chrono::microseconds>(load_end - load_start).count() / 1000.0f;

    // 打印时间统计信息
    std::cout << "\n============ Model Loading Details ============" << std::endl;
#if LOAD_FROM_DDR
    std::cout << "File reading time: " << std::fixed << std::setprecision(2) << read_time << " ms" << std::endl;
#endif
    std::cout << "Model init time: " << std::fixed << std::setprecision(2) << init_time << " ms" << std::endl;
    std::cout << "Total loading time: " << std::fixed << std::setprecision(2) << total_load_time << " ms" << std::endl;
    std::cout << "===========================================\n" << std::endl;

    return true;
}

// 获取模型信息实现
bool BPU_Detect::GetModelInfo() {
    // 获取模型名称列表
    const char** model_name_list;
    int model_count = 0;
    RDK_CHECK_SUCCESS(
        hbDNNGetModelNameList(&model_name_list, &model_count, packed_dnn_handle_),
        "hbDNNGetModelNameList failed");
    if(model_count > 1) {
        std::cout << "Model count: " << model_count << std::endl;
        std::cout << "Please check the model count!" << std::endl;
        return false;
    }
    model_name_ = model_name_list[0];
    
    // 获取模型句柄
    RDK_CHECK_SUCCESS(
        hbDNNGetModelHandle(&dnn_handle_, packed_dnn_handle_, model_name_),
        "hbDNNGetModelHandle failed");
    
    // 获取输入信息
    int32_t input_count = 0;
    RDK_CHECK_SUCCESS(
        hbDNNGetInputCount(&input_count, dnn_handle_),
        "hbDNNGetInputCount failed");
    RDK_CHECK_SUCCESS(
        hbDNNGetInputTensorProperties(&input_properties_, dnn_handle_, 0),
        "hbDNNGetInputTensorProperties failed");

    if(input_count > 1){
        std::cout << "模型输入节点大于1，请检查！" << std::endl;
        return false;
    }
    if(input_properties_.validShape.numDimensions == 4){
        std::cout << "输入tensor类型: HB_DNN_IMG_TYPE_NV12" << std::endl;
    }
    else{
        std::cout << "输入tensor类型不是HB_DNN_IMG_TYPE_NV12，请检查！" << std::endl;
        return false;
    }
    if(input_properties_.tensorType == 1){
        std::cout << "输入tensor数据排布: HB_DNN_LAYOUT_NCHW" << std::endl;
    }
    else{
        std::cout << "输入tensor数据排布不是HB_DNN_LAYOUT_NCHW，请检查！" << std::endl;
        return false;
    }
    // 获取输入尺寸
    input_h_ = input_properties_.validShape.dimensionSize[2];
    input_w_ = input_properties_.validShape.dimensionSize[3];
    if (input_properties_.validShape.numDimensions == 4)
    {
        std::cout << "输入的尺寸为: (" << input_properties_.validShape.dimensionSize[0];
        std::cout << ", " << input_properties_.validShape.dimensionSize[1];
        std::cout << ", " << input_h_;
        std::cout << ", " << input_w_ << ")" << std::endl;
    }
    else
    {
        std::cout << "输入的尺寸不是(1,3,640,640)，请检查！" << std::endl;
        return false;
    }
    
    // 获取输出信息并调整输出顺序
    int32_t output_count = 0;
    RDK_CHECK_SUCCESS(
        hbDNNGetOutputCount(&output_count, dnn_handle_),
        "hbDNNGetOutputCount failed");
    std::cout << "output_count: " << output_count << std::endl;
    // 分配输出tensor内存
    output_tensors_ = new hbDNNTensor[output_count];
    
    // =============== 调整输出头顺序映射 ===============
    // YOLOv5有3个输出头，分别对应3种不同尺度的特征图
    // 需要确保输出顺序为: 小目标(8倍下采样) -> 中目标(16倍下采样) -> 大目标(32倍下采样)
    

    // 定义期望的输出特征图尺寸和通道数
    int32_t expected_shapes[9][3] = {
        {H_8, W_8, 64},            // output[order[0]]: (1, H // 8,  W // 8,  64)
        {H_8, W_8, DEFAULT_CLASSES_NUM},   // output[order[1]]: (1, H // 8,  W // 8,  CLASSES_NUM)
        {H_16, W_16, 64},          // output[order[2]]: (1, H // 16, W // 16, 64)
        {H_16, W_16, DEFAULT_CLASSES_NUM}, // output[order[3]]: (1, H // 16, W // 16, CLASSES_NUM)
        {H_32, W_32, 64},          // output[order[4]]: (1, H // 32, W // 32, 64)
        {H_32, W_32, DEFAULT_CLASSES_NUM}, // output[order[5]]: (1, H // 32, W // 32, CLASSES_NUM)
        {H_8, W_8, KPT_NUM * KPT_ENCODE},            // output[order[6]]: (1, H // 8 , W // 8 , KPT_NUM * KPT_ENCODE)
        {H_16, W_16, KPT_NUM * KPT_ENCODE},          // output[order[7]]: (1, H // 16, W // 16, KPT_NUM * KPT_ENCODE)
        {H_32, W_32, KPT_NUM * KPT_ENCODE},          // output[order[8]]: (1, H // 32, W // 32, KPT_NUM * KPT_ENCODE)
    };

    // 遍历每个期望的输出尺度
    for(int i = 0; i < 9; i++) {
        // 遍历实际的输出节点
        for(int j = 0; j < 9; j++) {
            // 获取当前输出节点的属性
            hbDNNTensorProperties output_properties;
            RDK_CHECK_SUCCESS(
                hbDNNGetOutputTensorProperties(&output_properties, dnn_handle_, j),
                "Get output tensor properties failed");
            
            int32_t actual_h = output_properties.validShape.dimensionSize[1];
            int32_t actual_w = output_properties.validShape.dimensionSize[2];
            int32_t actual_c = output_properties.validShape.dimensionSize[3];

            if(actual_h == expected_shapes[i][0] && 
               actual_w == expected_shapes[i][1] && 
               actual_c == expected_shapes[i][2]) {
                // 记录正确的输出顺序
                output_order_[i] = j;
                break;
            }
        }
    }

    // 打印输出顺序映射信息
    if (output_order_[0] + output_order_[1] + output_order_[2] + output_order_[3] + output_order_[4] + output_order_[5] + output_order_[6] + output_order_[7] + output_order_[8] == 0 + 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8)
    {
        std::cout << "Outputs order check SUCCESS, continue." << std::endl;
        std::cout << "order = {";
        for (int i = 0; i < 9; i++)
        {
            std::cout << output_order_[i] << ", ";
        }
        std::cout << "}" << std::endl;
    }
    else
    {
        std::cout << "Outputs order check FAILED, use default" << std::endl;
        for (int i = 0; i < 9; i++)
            output_order_[i] = i;
    }

    return true;
}

// 检测函数实现
bool BPU_Detect::Detect(const cv::Mat& input_img, cv::Mat& output_img) {
    if(!is_initialized_) {
        std::cout << "Please initialize first!" << std::endl;
        return false;
    }
    
    if(input_img.empty()) {
        std::cout << "Input image is empty!" << std::endl;
        return false;
    }
    
    // 定义所有时间变量
    float preprocess_time = 0.0f;
    float infer_time = 0.0f;
    float postprocess_time = 0.0f;
    float draw_time = 0.0f;
    float total_time = 0.0f;
    
    auto total_start = std::chrono::high_resolution_clock::now();
    
#if ENABLE_DRAW
    input_img.copyTo(output_img);
#endif

    bool success = true;
    
    // 预处理
    {
        auto preprocess_start = std::chrono::high_resolution_clock::now();
        success = PreProcess(input_img);
        auto preprocess_end = std::chrono::high_resolution_clock::now();
        preprocess_time = std::chrono::duration_cast<std::chrono::microseconds>(
            preprocess_end - preprocess_start).count() / 1000.0f;
        
        if (!success) {
            std::cout << "Preprocess failed" << std::endl;
            goto cleanup;  
        }
    }
    
    // 推理
    {
        auto infer_start = std::chrono::high_resolution_clock::now();
        success = Inference();
        auto infer_end = std::chrono::high_resolution_clock::now();
        infer_time = std::chrono::duration_cast<std::chrono::microseconds>(
            infer_end - infer_start).count() / 1000.0f;
        
        if (!success) {
            std::cout << "Inference failed" << std::endl;
            goto cleanup;
        }
        
    }
    
    // 后处理
    {
        auto postprocess_start = std::chrono::high_resolution_clock::now();
        success = PostProcess();
        auto postprocess_end = std::chrono::high_resolution_clock::now();
        postprocess_time = std::chrono::duration_cast<std::chrono::microseconds>(
            postprocess_end - postprocess_start).count() / 1000.0f;
        
        if (!success) {
            std::cout << "Postprocess failed" << std::endl;
            goto cleanup;
        }
        
    }
    
    // 绘制结果
    {
        auto draw_start = std::chrono::high_resolution_clock::now();
        DrawResults(output_img);
        auto draw_end = std::chrono::high_resolution_clock::now();
        draw_time = std::chrono::duration_cast<std::chrono::microseconds>(
            draw_end - draw_start).count() / 1000.0f;
    }
    
    // 计算总时间
    {
        auto total_end = std::chrono::high_resolution_clock::now();
        total_time = std::chrono::duration_cast<std::chrono::microseconds>(
            total_end - total_start).count() / 1000.0f;
    }
    
    // 打印时间统计
    std::cout << "\n============ Time Statistics ============" << std::endl;
    std::cout << "Preprocess time: " << std::fixed << std::setprecision(2) << preprocess_time << " ms" << std::endl;
    std::cout << "Inference time: " << std::fixed << std::setprecision(2) << infer_time << " ms" << std::endl;
    std::cout << "Postprocess time: " << std::fixed << std::setprecision(2) << postprocess_time << " ms" << std::endl;
    std::cout << "Draw time: " << std::fixed << std::setprecision(2) << draw_time << " ms" << std::endl;
    std::cout << "Total time: " << std::fixed << std::setprecision(2) << total_time << " ms" << std::endl;
    std::cout << "FPS: " << std::fixed << std::setprecision(2) << 1000.0f / total_time << std::endl;
    std::cout << "======================================\n" << std::endl;

cleanup:
    // 清理资源
    if (task_handle_) {
        hbDNNReleaseTask(task_handle_);
        task_handle_ = nullptr;
    }
    
    // 释放输入内存
    if(input_tensor_.sysMem[0].virAddr) {
        hbSysFreeMem(&(input_tensor_.sysMem[0]));
        input_tensor_.sysMem[0].virAddr = nullptr;
    }
    
    return success;
}

// 预处理实现
bool BPU_Detect::PreProcess(const cv::Mat& input_img) {
    // 使用letterbox方式进行预处理
    x_scale_ = std::min(1.0f * input_h_ / input_img.rows, 1.0f * input_w_ / input_img.cols);
    y_scale_ = x_scale_;
    
    int new_w = input_img.cols * x_scale_;
    x_shift_ = (input_w_ - new_w) / 2;
    int x_other = input_w_ - new_w - x_shift_;
    
    int new_h = input_img.rows * y_scale_;
    y_shift_ = (input_h_ - new_h) / 2;
    int y_other = input_h_ - new_h - y_shift_;
    
    cv::resize(input_img, resized_img_, cv::Size(new_w, new_h));
    cv::copyMakeBorder(resized_img_, resized_img_, y_shift_, y_other, 
                       x_shift_, x_other, cv::BORDER_CONSTANT, cv::Scalar(127, 127, 127));
    
    // 转换为NV12格式
    cv::Mat yuv_mat;
    cv::cvtColor(resized_img_, yuv_mat, cv::COLOR_BGR2YUV_I420);
    
    // 准备输入tensor
    input_tensor_.properties = input_properties_;
    input_tensor_.properties.validShape.dimensionSize[0] = 1;  // 设置batch size为1
    input_tensor_.properties.validShape.dimensionSize[1] = 3;  // 3通道
    input_tensor_.properties.validShape.dimensionSize[2] = input_h_;
    input_tensor_.properties.validShape.dimensionSize[3] = input_w_;
    
    hbSysAllocCachedMem(&input_tensor_.sysMem[0], int(3 * input_h_ * input_w_ / 2));
    uint8_t* yuv = yuv_mat.ptr<uint8_t>();
    uint8_t* ynv12 = (uint8_t*)input_tensor_.sysMem[0].virAddr;
    // 计算UV部分的高度和宽度，以及Y部分的大小
    int uv_height = input_h_ / 2;
    int uv_width = input_w_ / 2;
    int y_size = input_h_ * input_w_;
    // 将Y分量数据复制到输入张量
    memcpy(ynv12, yuv, y_size);
    // 获取NV12格式的UV分量位置
    uint8_t* nv12 = ynv12 + y_size;
    uint8_t* u_data = yuv + y_size;
    uint8_t* v_data = u_data + uv_height * uv_width;
    // 将U和V分量交替写入NV12格式
    for(int i = 0; i < uv_width * uv_height; i++) {
        *nv12++ = *u_data++;
        *nv12++ = *v_data++;
    }
    // 将内存缓存清理，确保数据准备好可以供模型使用
    hbSysFlushMem(&input_tensor_.sysMem[0], HB_SYS_MEM_CACHE_CLEAN);// 清除缓存，确保数据同步
    return true;
}

// 推理实现
bool BPU_Detect::Inference() {
    // 确保先释放之前的任务
    if (task_handle_) {
        hbDNNReleaseTask(task_handle_);
        task_handle_ = nullptr;
    }
    // 初始化输入tensor属性
    input_tensor_.properties = input_properties_;
    input_tensor_.properties.validShape.dimensionSize[0] = 1;  // batch size
    input_tensor_.properties.validShape.dimensionSize[1] = 3;  // channels
    input_tensor_.properties.validShape.dimensionSize[2] = input_h_;
    input_tensor_.properties.validShape.dimensionSize[3] = input_w_;
    // 获取输出tensor属性并分配内存
    for(int i = 0; i < 9; i++) {
        hbDNNTensorProperties output_properties;
        RDK_CHECK_SUCCESS(
            hbDNNGetOutputTensorProperties(&output_properties, dnn_handle_, i),
            "Get output tensor properties failed");

        output_tensors_[i].properties = output_properties;
        // 分配内存
        int out_aligned_size = output_properties.alignedByteSize;
        RDK_CHECK_SUCCESS(
            hbSysAllocCachedMem(&output_tensors_[i].sysMem[0], out_aligned_size),
            "Allocate output memory failed");
            
        // 验证内存分配
        if (!output_tensors_[i].sysMem[0].virAddr) {
            std::cout << "Failed to allocate memory for output tensor " << i << std::endl;
            return false;
        }
    }
    // 设置推理控制参数
    hbDNNInferCtrlParam infer_ctrl_param;
    HB_DNN_INITIALIZE_INFER_CTRL_PARAM(&infer_ctrl_param);
    // 执行推理
    int ret = hbDNNInfer(&task_handle_, &output_tensors_, &input_tensor_, dnn_handle_, &infer_ctrl_param);
    if (ret != 0) {
        std::cout << "Model inference failed with error code: " << ret << std::endl;
        return false;
    }
   // 等待任务完成
    ret = hbDNNWaitTaskDone(task_handle_, 0);
    if (ret != 0) {
        std::cout << "Wait task done failed with error code: " << ret << std::endl;
        return false;
    }
    return true;
}

// 后处理实现
bool BPU_Detect::PostProcess() {
    // 清空上次的结果
    bboxes_.clear();
    scores_.clear();
    indices_.clear();
    kpts_xy_.clear();
    kpts_score_.clear();
    
    // 调整大小
    bboxes_.resize(classes_num_);
    scores_.resize(classes_num_);
    indices_.resize(classes_num_);
    
    conf_thres_raw_ = -log(1 / score_threshold_ - 1);
    kpt_conf_thres_raw_ = -log(1 / KPT_SCORE_THRESHOLD - 1); // kpt 利用反函数作用阈值，利用单调性筛选

    // 处理三个尺度的输出
    ProcessFeatureMap(output_tensors_[0], output_tensors_[1], output_tensors_[6], H_8, W_8, s_anchors_, conf_thres_raw_);
    ProcessFeatureMap(output_tensors_[2], output_tensors_[3], output_tensors_[7], H_16, W_16, m_anchors_, conf_thres_raw_);
    ProcessFeatureMap(output_tensors_[4], output_tensors_[5], output_tensors_[8], H_32, W_32, l_anchors_, conf_thres_raw_);

    
    // 对每个类别进行NMS
    for(int i = 0; i < classes_num_; i++) {
        cv::dnn::NMSBoxes(bboxes_[i], scores_[i], score_threshold_, 
                         nms_threshold_, indices_[i], 1.f, nms_top_k_);
    }
    
    return true;
}

// 打印检测结果实现
void BPU_Detect::PrintResults() const {
    // 打印检测结果的总体信息
    int total_detections = 0;
    for(int cls_id = 0; cls_id < classes_num_; cls_id++) {
        total_detections += indices_[cls_id].size();
    }
    std::cout << "\n============ Detection Results ============" << std::endl;
    std::cout << "Total detections: " << total_detections << std::endl;
    
    for(int cls_id = 0; cls_id < classes_num_; cls_id++) {
        if(!indices_[cls_id].empty()) {
            std::cout << "\nClass: " << class_names_[cls_id] << std::endl;
            std::cout << "Number of detections: " << indices_[cls_id].size() << std::endl;
            std::cout << "Details:" << std::endl;
            
            for(size_t i = 0; i < indices_[cls_id].size(); i++) {
                int idx = indices_[cls_id][i];
                float x1 = (bboxes_[cls_id][idx].x - x_shift_) / x_scale_;
                float y1 = (bboxes_[cls_id][idx].y - y_shift_) / y_scale_;
                float x2 = x1 + (bboxes_[cls_id][idx].width) / x_scale_;
                float y2 = y1 + (bboxes_[cls_id][idx].height) / y_scale_;
                float score = scores_[cls_id][idx];
                
                // 打印每个检测框的详细信息
                std::cout << "  Detection " << i + 1 << ":" << std::endl;
                std::cout << "    Position: (" << x1 << ", " << y1 << ") to (" << x2 << ", " << y2 << ")" << std::endl;
                std::cout << "    Confidence: " << std::fixed << std::setprecision(2) << score * 100 << "%" << std::endl;
            }
        }
    }
    std::cout << "========================================\n" << std::endl;
}

// 绘制结果实现
void BPU_Detect::DrawResults(cv::Mat& img) {
#if ENABLE_DRAW
    for(int cls_id = 0; cls_id < classes_num_; cls_id++) {
        if(!indices_[cls_id].empty()) {
            for(size_t i = 0; i < indices_[cls_id].size(); i++) {
                int idx = indices_[cls_id][i];
                float x1 = (bboxes_[cls_id][idx].x - x_shift_) / x_scale_;
                float y1 = (bboxes_[cls_id][idx].y - y_shift_) / y_scale_;
                float x2 = x1 + (bboxes_[cls_id][idx].width) / x_scale_;
                float y2 = y1 + (bboxes_[cls_id][idx].height) / y_scale_;
                float score = scores_[cls_id][idx];
                
                // 绘制边界框
                cv::rectangle(img, cv::Point(x1, y1), cv::Point(x2, y2), 
                            cv::Scalar(255, 0, 0), line_size_);
                
                // 绘制标签
                std::string text = class_names_[cls_id] + ": " + 
                                std::to_string(static_cast<int>(score * 100)) + "%";
                cv::putText(img, text, cv::Point(x1, y1 - 5), 
                          cv::FONT_HERSHEY_SIMPLEX, font_size_, 
                          cv::Scalar(0, 0, 255), font_thickness_, cv::LINE_AA);
            }
            for (int j = 0; j < KPT_NUM; ++j)
            {
                if (kpts_score_[cls_id][j] < kpt_conf_thres_raw_)
                {
                    continue;
                }

                int x = static_cast<int>((kpts_xy_[cls_id][j].x - x_shift_) / x_scale_);
                int y = static_cast<int>((kpts_xy_[cls_id][j].y - y_shift_) / y_scale_);

                // 绘制内圈黄色圆, 外圈红色圆
                cv::circle(img, cv::Point(x, y), 5, cv::Scalar(0, 0, 255), -1);
                cv::circle(img, cv::Point(x, y), 2, cv::Scalar(0, 255, 255), -1);
                // 绘制黄色文本, 红色文本
                cv::putText(img, std::to_string(j), cv::Point(x, y), cv::FONT_HERSHEY_SIMPLEX, 0.5, cv::Scalar(0, 0, 255), 3, cv::LINE_AA);
                cv::putText(img, std::to_string(j), cv::Point(x, y), cv::FONT_HERSHEY_SIMPLEX, 0.5, cv::Scalar(0, 255, 255), 1, cv::LINE_AA);
            }

        }
    }
#endif
    // 打印检测结果
    PrintResults();
}

// 特征图处理辅助函数
void BPU_Detect::ProcessFeatureMap(hbDNNTensor& output_tensor_REG, 
                                  hbDNNTensor& output_tensor_CLA, 
                                  hbDNNTensor& output_tensor_KPT,
                                  int height, int width,
                                  const std::vector<std::pair<double, double>>& anchors,
                                  float conf_thres_raw) {
    // 检查内存是否有效
    if (!output_tensor_REG.sysMem[0].virAddr || !output_tensor_REG.properties.scale.scaleData ||
        !output_tensor_CLA.sysMem[0].virAddr || !output_tensor_KPT.sysMem[0].virAddr) {
        std::cout << "Invalid memory for tensors!" << std::endl;
        return;
    }

    // 打印内存大小信息
    std::cout << "REG tensor aligned size: " << output_tensor_REG.properties.alignedByteSize << std::endl;
    std::cout << "REG tensor scale length: " << output_tensor_REG.properties.scale.scaleLen << std::endl;

    // 获取数据指针前先刷新内存
    hbSysFlushMem(&output_tensor_REG.sysMem[0], HB_SYS_MEM_CACHE_INVALIDATE);
    hbSysFlushMem(&output_tensor_CLA.sysMem[0], HB_SYS_MEM_CACHE_INVALIDATE);
    hbSysFlushMem(&output_tensor_KPT.sysMem[0], HB_SYS_MEM_CACHE_INVALIDATE);

    // 获取所有指针
    auto *s_bbox_raw = static_cast<int32_t *>(output_tensor_REG.sysMem[0].virAddr);
    auto *s_cls_raw = static_cast<float *>(output_tensor_CLA.sysMem[0].virAddr);
    auto *s_kpts_raw = static_cast<float *>(output_tensor_KPT.sysMem[0].virAddr);
    auto *s_bbox_scale = static_cast<float *>(output_tensor_REG.properties.scale.scaleData);

    // 验证指针
    std::cout << "s_bbox_raw valid range: " << 0 << " to " << (output_tensor_REG.properties.alignedByteSize/sizeof(int32_t)-1) << std::endl;
    std::cout << "s_bbox_scale valid range: " << 0 << " to " << (output_tensor_REG.properties.scale.scaleLen-1) << std::endl;

    // 遍历特征图的每个位置
    for(int h = 0; h < height; h++) {
        for(int w = 0; w < width; w++) {
            float *cur_s_cls_raw = s_cls_raw;
            int32_t *cur_s_bbox_raw = s_bbox_raw;
            float *cur_s_kpts_raw = s_kpts_raw;

            // 在移动指针之前保存原始位置
            int32_t *original_bbox_raw = cur_s_bbox_raw;
            float *original_kpts_raw = cur_s_kpts_raw;
            
            s_cls_raw += DEFAULT_CLASSES_NUM;
            s_bbox_raw += REG * 4;
            s_kpts_raw += KPT_NUM * KPT_ENCODE;

            // 找到最大类别概率
            int cls_id = 0;
            for (int i = 1; i < DEFAULT_CLASSES_NUM; i++) {
                if (cur_s_cls_raw[i] > cur_s_cls_raw[cls_id]) {
                    cls_id = i;
                }
            }

            if (cur_s_cls_raw[cls_id] < conf_thres_raw) {
                continue;
            }

            float score = 1 / (1 + std::exp(-cur_s_cls_raw[cls_id]));
            float ltrb[4], sum;

            for (int i = 0; i < 4; i++) {
                ltrb[i] = 0.;
                sum = 0.;
                for (int j = 0; j < REG; j++) {
                    // 计算实际访问的索引
                    size_t bbox_index = (original_bbox_raw - static_cast<int32_t *>(output_tensor_REG.sysMem[0].virAddr)) + REG * i + j;
                    
                    // 检查索引是否在有效范围内
                    if (bbox_index >= output_tensor_REG.properties.alignedByteSize/sizeof(int32_t)) {
                        std::cout << "bbox_index out of range: " << bbox_index << std::endl;
                        return;
                    }
                    if (j >= output_tensor_REG.properties.scale.scaleLen) {
                        std::cout << "scale index out of range: " << j << std::endl;
                        return;
                    }

                    // 安全地访问数据
                    try {
                        int32_t raw_val = original_bbox_raw[REG * i + j];
                        float scale_val = s_bbox_scale[j];
                        float exp_val = float(raw_val) * scale_val;
                        
                        // 限制exp的输入范围
                        if (exp_val > 88.0f) exp_val = 88.0f;
                        if (exp_val < -88.0f) exp_val = -88.0f;
                        
                        float dfl = std::exp(exp_val);
                        ltrb[i] += dfl * j;
                        sum += dfl;
                    } catch (const std::exception& e) {
                        std::cout << "Exception during memory access: " << e.what() << std::endl;
                        return;
                    }
                }
                if (sum > 0) {
                    ltrb[i] /= sum;
                }
            }

            // 计算边界框坐标
            float x1 = (w + 0.5 - ltrb[0]) * (height == H_8 ? 8.0 : (height == H_16 ? 16.0 : 32.0));
            float y1 = (h + 0.5 - ltrb[1]) * (height == H_8 ? 8.0 : (height == H_16 ? 16.0 : 32.0));
            float x2 = (w + 0.5 + ltrb[2]) * (height == H_8 ? 8.0 : (height == H_16 ? 16.0 : 32.0));
            float y2 = (h + 0.5 + ltrb[3]) * (height == H_8 ? 8.0 : (height == H_16 ? 16.0 : 32.0));

            // 处理关键点
            std::vector<cv::Point2f> kpt_xy(KPT_NUM);
            std::vector<float> kpt_score(KPT_NUM);
            float stride = (height == H_8 ? 8.0 : (height == H_16 ? 16.0 : 32.0));
            
            for (int j = 0; j < KPT_NUM; j++) {
                try {
                    float x = (original_kpts_raw[KPT_ENCODE * j] * 2.0 + w) * stride;
                    float y = (original_kpts_raw[KPT_ENCODE * j + 1] * 2.0 + h) * stride;
                    float vis = original_kpts_raw[KPT_ENCODE * j + 2];

                    kpt_xy[j] = cv::Point2f(x, y);
                    kpt_score[j] = vis;
                } catch (const std::exception& e) {
                    std::cout << "Exception during keypoint processing: " << e.what() << std::endl;
                    continue;
                }
            }

            // 添加检测结果到对应类别的向量中
            bboxes_[cls_id].push_back(cv::Rect2d(x1, y1, x2 - x1, y2 - y1));
            scores_[cls_id].push_back(score);
            kpts_xy_.push_back(kpt_xy);
            kpts_score_.push_back(kpt_score);
        }
    }
}

// 释放资源实现
bool BPU_Detect::Release() {
    if (!is_initialized_) {
        return true;
    }
    
    // 释放task
    if (task_handle_) {
        hbDNNReleaseTask(task_handle_);
        task_handle_ = nullptr;
    }
    
    // 释放输入内存
    if (input_tensor_.sysMem[0].virAddr) {
        hbSysFreeMem(&input_tensor_.sysMem[0]);
        input_tensor_.sysMem[0].virAddr = nullptr;
    }
    
    // 释放输出内存
    if (output_tensors_) {
        for (int i = 0; i < 9; i++) {
            if (output_tensors_[i].sysMem[0].virAddr) {
                hbSysFreeMem(&output_tensors_[i].sysMem[0]);
                output_tensors_[i].sysMem[0].virAddr = nullptr;
            }
        }
        delete[] output_tensors_;
        output_tensors_ = nullptr;
    }
    
    // 释放模型
    if (packed_dnn_handle_) {
        hbDNNRelease(packed_dnn_handle_);
        packed_dnn_handle_ = nullptr;
    }
    
    is_initialized_ = false;
    return true;
}

// 修改main函数
int main() {
    // 创建检测器实例
    BPU_Detect detector;
    
    // 初始化
    if (!detector.Init()) {
        std::cout << "Failed to initialize detector" << std::endl;
        return -1;
    }

#if DETECT_MODE == 0
    // 单张图片检测模式
    std::cout << "Single image detection mode" << std::endl;
    
    // 读取测试图片
    cv::Mat input_img = cv::imread("/root/Deep_Learning/YOLOV8-Pose/imgs/0a84fc03-1873.jpg");
    if (input_img.empty()) {
        std::cout << "Failed to load image" << std::endl;
        return -1;
    }
    
    // 执行检测
    cv::Mat output_img;
#if ENABLE_DRAW
    if (!detector.Detect(input_img, output_img)) {
        std::cout << "Detection failed" << std::endl;
        return -1;
    }
    // 保存结果
    cv::imwrite("cpp_result.jpg", output_img);
#else
    if (!detector.Detect(input_img, output_img)) {
        std::cout << "Detection failed" << std::endl;
        return -1;
    }
#endif

#else
    // 实时检测模式
    std::cout << "Real-time detection mode" << std::endl;
    
    // 打开摄像头
    cv::VideoCapture cap(0);
    if (!cap.isOpened()) {
        std::cout << "Failed to open camera" << std::endl;
        return -1;
    }
    
    cv::Mat frame, output_frame;
    while (true) {
        // 读取一帧
        cap >> frame;
        if (frame.empty()) {
            std::cout << "Failed to read frame" << std::endl;
            break;
        }
        
        // 执行检测
        if (!detector.Detect(frame, output_frame)) {
            std::cout << "Detection failed" << std::endl;
            break;
        }
        
#if ENABLE_DRAW
        // 显示结果
        cv::imshow("Real-time Detection", output_frame);
        
        // 按'q'退出
        if (cv::waitKey(1) == 'q') {
            break;
        }
#endif
    }
    
#if ENABLE_DRAW
    // 释放摄像头
    cap.release();
    cv::destroyAllWindows();
#endif
#endif
    
    // 释放资源
    detector.Release();
    
    return 0;
}