yolov26基于triton ensemble部署推理服务

本博客主要分享使用triton服务基于ensemble后端部署yolov26模型，部署在支持cuda的nvidia的显卡上，使用python后端实现模型的预处理和后处理，和模型推理集成在同一个triton服务里面，完整项目代码库可参考： triton_ensemble_model_zoo

yolov26 triton esnsemble

摘要
[triton ensemble 部署onnx 模型](#triton ensemble 部署onnx 模型)
- 模型准备
- 文件夹目录准备
- - 1、yolov26_preprocess (预处理)
  - - config.pbtxt:
    - preprocess.py
  - [2、 yolov26_inference (推理)](#2、 yolov26_inference (推理))
  - - config.pbtxt
  - 3、yolov26_postprocess (后处理)
  - - config.pbtxt
    - postprocess.py
  - [4、 yolov26_ensemble (集成/编排)](#4、 yolov26_ensemble (集成/编排))
  - - config.pbtxt
- 模型部署
[triton ensemble部署tensorrt模型](#triton ensemble部署tensorrt模型)

摘要

常见的使用triton部署yolov26模型的推理服务，都是只部署模型推理服务，而yolov26预处理和后处理在本地进行，如：

客户端预处理(读取图片，resize， padding) -> triton服务(模型推理)-> 客户端后处理(过滤目标框、复原图片目标框)

但是，在本地预处理后，得到图片的向量 $1, 3, 640, 640$ 的float32数据，该数据大约占4M，请求服务需要大带宽，如果同时请求多批量的图片数据，带宽就需要更大。因此，业界常用的方法是使用两个服务：fastapi+triton：

客户端请求(encoding image)-> fastapi服务预处理（decoding image, reading image, resize, padding) -> tirton(模型推理) -> fastapi服务后处理(过滤目标框，还原目标框) -> 客户端接受结果

以上fastapi+tirton是当前业界的做法，但是该方法数据需要在两个服务中频繁传输，延时比较长，还有一种方法，就是把预处理和后处理部署在triton服务中，就是基于ensemble为backend：

客户端请求(encoding image) -> triton ensemble(预处理、模型推理、后处理) -> 客户端(接受结果)

triton ensemble 部署onnx 模型

模型准备

准备好onnx模型，yolov26的pt转onnx教程可以参考：yolov26推理

文件夹目录准备

triton服务的文件夹目录要严格的要求，其格式需要按照以下路径存放：

bash 复制代码

models/
├── yolov26_ensemble/
│   ├── 1/
│   └── config.pbtxt
├── yolov26_inference/
│   ├── 1/
│   │   └── coco_yolo26s_fp16.onnx
│   └── config.pbtxt
├── yolov26_postprocess/
│   ├── 1/
│   │   └── postprocess.py
│   └── config.pbtxt
└── yolov26_preprocess/
    ├── 1/
    │   └── preprocess.py
    └── config.pbtxt

其中：

yolov26_preprocess (预处理):
负责在图像进入模型前进行处理（例如：调整大小、归一化、转换颜色空间等）。
yolov26_inference (推理):
这是执行 YOLOv26 模型计算的部分，使用了 FP16（半精度）格式以加速推理。
yolov26_postprocess (后处理):
负责处理模型的原始输出（例如：非极大值抑制 NMS、解码边界框坐标、过滤低置信度结果），将其转化为人类可读的检测结果。
yolov26_ensemble (集成/编排):
这是一个特殊的调度模型。它不直接运行代码，而是通过 config.pbtxt 定义上述三个步骤（预处理 -> 推理 -> 后处理）的执行顺序和数据流向，将它们串联成一个完整的 Pipeline。

1、yolov26_preprocess (预处理)

config.pbtxt:

powershell 复制代码

name: "yolov26_preprocess"
backend: "python"
max_batch_size: 64
default_model_filename: "preprocess.py"

input [
  {
    name: "RAW_IMAGE"
    data_type: TYPE_STRING
    dims: [ 1 ]
  }
]

output [
  {
    name: "PREPROCESSED_IMAGE"
    data_type: TYPE_FP32
    dims: [ 3, 640, 640 ]
  },
  {
    name: "IMAGE_SHAPE"
    data_type: TYPE_INT32
    dims: [ 2 ]
  },
  {
    name: "SCALE"
    data_type: TYPE_FP32
    dims: [ 1 ]
  },
  {
    name: "PADDING"
    data_type: TYPE_FP32
    dims: [ 2 ]
  }
]

instance_group [
  {
    count: 32
    kind: KIND_CPU
  }
]

preprocess.py

python 复制代码

import triton_python_backend_utils as pb_utils
import numpy as np
import cv2
import base64
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


class TritonPythonModel:
    def initialize(self, args):
        
        self.img_size = 640
        self.mean = np.array([0.0, 0.0, 0.0], dtype=np.float32)
        self.std = np.array([1.0, 1.0, 1.0], dtype=np.float32) / 255.0

    def letterbox(self, img, new_shape=(640, 640), color=(114, 114, 114)):
        shape = img.shape[:2]  # current shape [height, width]
        if isinstance(new_shape, int):
            new_shape = (new_shape, new_shape)
        
        scale = min(new_shape[0] / shape[0], new_shape[1] / shape[1])
        new_unpad = int(shape[1] * scale), int(shape[0] * scale)
        
        dw = new_shape[1] - new_unpad[0]
        dh = new_shape[0] - new_unpad[1]

        dw /= 2
        dh /= 2
        
        if shape[::-1] != new_unpad:
            img = cv2.resize(img, new_unpad, interpolation=cv2.INTER_LINEAR)
        
        top, bottom = int(round(dh - 0.1)), int(round(dh + 0.1))
        left, right = int(round(dw - 0.1)), int(round(dw + 0.1))
        
        img = cv2.copyMakeBorder(img, top, bottom, left, right, cv2.BORDER_CONSTANT, value=color)
        return img, scale, (dw, dh)

    def resize_image(self, img):
        img_resized, scale, padding = self.letterbox(img, new_shape=(self.img_size, self.img_size))  # Letterbox 处理
        img_resized = cv2.cvtColor(img_resized, cv2.COLOR_BGR2RGB).astype(np.float32)  # BGR to RGB
        img_resized = (img_resized - self.mean) * self.std  # Normalize
        img_resized = np.transpose(img_resized, (2, 0, 1))  # HWC to CHW
        return img_resized, scale, padding


    def base64_to_image(self, base64_str):
        try:
            raw_bytes = base64.b64decode(base64_str)
            nparr = np.frombuffer(raw_bytes, np.uint8)
            img = cv2.imdecode(nparr, cv2.IMREAD_COLOR)

            return img

        except Exception as error:
            logger.error(f"Error: {error}")
            logger.error(f"error line: {error.__traceback__.tb_lineno}")


    def execute(self, requests):
        responses = []
        for request in requests:
            # 1. 获取输入张量，该张量包含一个批次
            in_tensor = pb_utils.get_input_tensor_by_name(request, "RAW_IMAGE")
            if in_tensor is None:
                logger.error("Input tensor 'RAW_IMAGE' not found!")
                continue
            # raw_batch = pb_utils.deserialize_bytes_tensor(in_tensor.as_numpy())
            raw_batch = in_tensor.as_numpy()
            batch_size = raw_batch.shape[0]
            imgs_resized = []
            origins_shape = []
            scales = []
            paddings = []

            # 2. 处理批次中的每个图像
            for i in range(batch_size):
                base64_str = raw_batch[i][0]  # 获取第i个图像的字节
                # 3. 把 base64 转成 image
                img = self.base64_to_image(base64_str) 
                if img is None:
                    # 如果解码失败，可以插入一个黑色图像？或者报错？这里我们插入一个黑色图像
                    logger.info("Base64 converted to image failed!")
                    img = np.zeros((self.img_size, self.img_size, 3), dtype=np.uint8)
                    origins_shape.append([self.img_size, self.img_size])
                    scales.append(1)
                    paddings.append((0,0))
                    
                else:
                    logger.info("Base64 converted to image Successfully!")
                    orig_h, orig_w = img.shape[:2]
                    origins_shape.append([orig_h, orig_w])
                    # 4. 预处理
                    img_resized, scale, padding= self.resize_image(img)
                    imgs_resized.append(img_resized)
                    scales.append(scale)
                    paddings.append(padding)

            # 将列表转换为一个批次
            if len(imgs_resized) == 0:
                # 没有有效的图像，创建一个全零批次
                batch_imgs = np.zeros((batch_size, 3, self.img_size, self.img_size), dtype=np.float32)
                batch_shapes = np.zeros((batch_size, 2), dtype=np.int32)
                batch_scales = np.zeros((batch_size,1), dtype=np.float32)
                batch_paddings = np.zeros((batch_size, 2), dtype=np.float32)

            else:
                batch_imgs = np.stack(imgs_resized, axis=0)  # (batch_size, 3, 640, 640)
                batch_shapes = np.array(origins_shape, dtype=np.int32)  # (batch_size, 2)
                batch_scales = np.array(scales, dtype=np.float32).reshape(-1, 1)
                batch_paddings = np.array(paddings, dtype=np.float32)

            # 构造输出张量
            out_img = pb_utils.Tensor("PREPROCESSED_IMAGE", batch_imgs)
            out_shape = pb_utils.Tensor("IMAGE_SHAPE", batch_shapes)
            out_scale = pb_utils.Tensor("SCALE", batch_scales)
            out_padding = pb_utils.Tensor("PADDING", batch_paddings)

            response = pb_utils.InferenceResponse(output_tensors=[out_img, out_shape, out_scale, out_padding])
            responses.append(response)

        return responses

2、 yolov26_inference (推理)

config.pbtxt

powershell 复制代码

name: "yolov26_inference"
backend: "onnxruntime"
default_model_filename: "coco_yolo26s_fp16.onnx"
max_batch_size: 64  # 支持动态批处理（推荐），或设置为固定值如 1
dynamic_batching {
  max_queue_delay_microseconds: 100000
  preferred_batch_size: [ 16, 32, 64, 128 ]
}

input [
  {
    name: "images"
    data_type: TYPE_FP32
    dims: [3, 640, 640]  # 固定输入尺寸，NCHW 格式
  }
]

output [
  {
    name: "output0"
    data_type: TYPE_FP32
    dims: [300, 6]  # 输出维度，[batch, 300, 6]
  }
]

instance_group [
  {
    count: 1
    kind: KIND_GPU
    gpus: [0]  # 使用第 0 号 GPU（A10）
  }
]

3、yolov26_postprocess (后处理)

config.pbtxt

powershell 复制代码

name: "yolov26_postprocess"
backend: "python"
max_batch_size: 64
default_model_filename: "postprocess.py"

input [
  {
    name: "DETECTION_OUTPUT"
    data_type: TYPE_FP32
    dims: [ 300, 6 ]  # onnx输出维度
  },
  {
    name: "IMAGE_SHAPE"
    data_type: TYPE_INT32
    dims: [ 2 ]  # heigh 和 width
  },
  {
    name: "SCALE"
    data_type: TYPE_FP32
    dims: [ 1 ]
  },
  {
    name: "PADDING"
    data_type: TYPE_FP32
    dims: [ 2 ]
  }
]

output [
  {
    name: "OUTPUT_RESULTS"
    data_type: TYPE_STRING
    dims: [ 1 ]
  }
]

# 参数配置：置信度阈值
parameters: {
  key: "CONF_THRESHOLD"
  value: { string_value: "0.25" }
}

instance_group [
  {
    count: 32
    kind: KIND_CPU
  }
]

postprocess.py

python 复制代码

import triton_python_backend_utils as pb_utils
import numpy as np
import cv2
import base64
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


class TritonPythonModel:
    def initialize(self, args):
        
        self.img_size = 640
        self.mean = np.array([0.0, 0.0, 0.0], dtype=np.float32)
        self.std = np.array([1.0, 1.0, 1.0], dtype=np.float32) / 255.0

    def letterbox(self, img, new_shape=(640, 640), color=(114, 114, 114)):
        shape = img.shape[:2]  # current shape [height, width]
        if isinstance(new_shape, int):
            new_shape = (new_shape, new_shape)
        
        scale = min(new_shape[0] / shape[0], new_shape[1] / shape[1])
        new_unpad = int(shape[1] * scale), int(shape[0] * scale)
        
        dw = new_shape[1] - new_unpad[0]
        dh = new_shape[0] - new_unpad[1]

        dw /= 2
        dh /= 2
        
        if shape[::-1] != new_unpad:
            img = cv2.resize(img, new_unpad, interpolation=cv2.INTER_LINEAR)
        
        top, bottom = int(round(dh - 0.1)), int(round(dh + 0.1))
        left, right = int(round(dw - 0.1)), int(round(dw + 0.1))
        
        img = cv2.copyMakeBorder(img, top, bottom, left, right, cv2.BORDER_CONSTANT, value=color)
        return img, scale, (dw, dh)

    def resize_image(self, img):
        img_resized, scale, padding = self.letterbox(img, new_shape=(self.img_size, self.img_size))  # Letterbox 处理
        img_resized = cv2.cvtColor(img_resized, cv2.COLOR_BGR2RGB).astype(np.float32)  # BGR to RGB
        img_resized = (img_resized - self.mean) * self.std  # Normalize
        img_resized = np.transpose(img_resized, (2, 0, 1))  # HWC to CHW
        return img_resized, scale, padding


    def base64_to_image(self, base64_str):
        try:
            raw_bytes = base64.b64decode(base64_str)
            nparr = np.frombuffer(raw_bytes, np.uint8)
            img = cv2.imdecode(nparr, cv2.IMREAD_COLOR)

            return img

        except Exception as error:
            logger.error(f"Error: {error}")
            logger.error(f"error line: {error.__traceback__.tb_lineno}")


    def execute(self, requests):
        responses = []
        for request in requests:
            # 1. 获取输入张量，该张量包含一个批次
            in_tensor = pb_utils.get_input_tensor_by_name(request, "RAW_IMAGE")
            if in_tensor is None:
                logger.error("Input tensor 'RAW_IMAGE' not found!")
                continue
            # raw_batch = pb_utils.deserialize_bytes_tensor(in_tensor.as_numpy())
            raw_batch = in_tensor.as_numpy()
            batch_size = raw_batch.shape[0]
            imgs_resized = []
            origins_shape = []
            scales = []
            paddings = []

            # 2. 处理批次中的每个图像
            for i in range(batch_size):
                base64_str = raw_batch[i][0]  # 获取第i个图像的字节
                # 3. 把 base64 转成 image
                img = self.base64_to_image(base64_str) 
                if img is None:
                    # 如果解码失败，可以插入一个黑色图像？或者报错？这里我们插入一个黑色图像
                    logger.info("Base64 converted to image failed!")
                    img = np.zeros((self.img_size, self.img_size, 3), dtype=np.uint8)
                    origins_shape.append([self.img_size, self.img_size])
                    scales.append(1)
                    paddings.append((0,0))
                    
                else:
                    logger.info("Base64 converted to image Successfully!")
                    orig_h, orig_w = img.shape[:2]
                    origins_shape.append([orig_h, orig_w])
                    # 4. 预处理
                    img_resized, scale, padding= self.resize_image(img)
                    imgs_resized.append(img_resized)
                    scales.append(scale)
                    paddings.append(padding)

            # 将列表转换为一个批次
            if len(imgs_resized) == 0:
                # 没有有效的图像，创建一个全零批次
                batch_imgs = np.zeros((batch_size, 3, self.img_size, self.img_size), dtype=np.float32)
                batch_shapes = np.zeros((batch_size, 2), dtype=np.int32)
                batch_scales = np.zeros((batch_size,1), dtype=np.float32)
                batch_paddings = np.zeros((batch_size, 2), dtype=np.float32)

            else:
                batch_imgs = np.stack(imgs_resized, axis=0)  # (batch_size, 3, 640, 640)
                batch_shapes = np.array(origins_shape, dtype=np.int32)  # (batch_size, 2)
                batch_scales = np.array(scales, dtype=np.float32).reshape(-1, 1)
                batch_paddings = np.array(paddings, dtype=np.float32)

            # 构造输出张量
            out_img = pb_utils.Tensor("PREPROCESSED_IMAGE", batch_imgs)
            out_shape = pb_utils.Tensor("IMAGE_SHAPE", batch_shapes)
            out_scale = pb_utils.Tensor("SCALE", batch_scales)
            out_padding = pb_utils.Tensor("PADDING", batch_paddings)

            response = pb_utils.InferenceResponse(output_tensors=[out_img, out_shape, out_scale, out_padding])
            responses.append(response)

        return responses

4、 yolov26_ensemble (集成/编排)

config.pbtxt

powershell 复制代码

name: "yolov26_ensemble"
platform: "ensemble"
max_batch_size: 64

# 定义整个流水线的最终输入（客户端发送进来的）
# 对应 preprocess 模型的输入
input [
  {
    name: "RAW_IMAGE"
    data_type: TYPE_STRING
    dims: [ 1 ]  
  }
]

# 定义整个流水线的最终输出（客户端接收到的）
# 对应 postprocess 模型的输出
output [
  {
    name: "OUTPUT_RESULTS"
    data_type: TYPE_STRING
    dims: [ 1 ]
  }
]

# 流水线步骤定义
ensemble_scheduling {
  step [
    # =======================================================
    # 步骤 1: 预处理 (Preprocessing)
    # 功能：Resize, Normalize, HWC->CHW, 添加 Batch 维
    # =======================================================
    {
      model_name: "yolov26_preprocess"
      model_version: -1  # -1 表示使用最新可用版本
      input_map {
        key: "RAW_IMAGE"          # preprocess 模型 config 中定义的 input name
        value: "RAW_IMAGE"      # 映射到 ensemble 的全局 input
      }
      output_map [
        {
          key: "PREPROCESSED_IMAGE" # preprocess 的输出 name (例如：images)
          value: "PREPROCESSED_IMAGE"       # 自定义一个中间变量名，供下一步使用
        },
        {
          key: "IMAGE_SHAPE"    # preprocess 输出的原图尺寸 [H, W]
          value: "IMAGE_SHAPE"        # 自定义中间变量名，直接透传给后处理
        },
        {
          key: "SCALE"    # preprocess 压缩与原图的短边比率 [RATE]
          value: "SCALE"        # 自定义中间变量名，直接透传给后处理
        },
        {
          key: "PADDING"    # preprocess letterbox的填充 [pw, ph]
          value: "PADDING"        # 自定义中间变量名，直接透传给后处理
        }
      ]
    },

    # =======================================================
    # 步骤 2: 模型推理 (Inference - ONNX)
    # 功能：YOLO26 前向传播
    # =======================================================
    {
      model_name: "yolov26_inference"
      model_version: -1
      input_map {
        key: "images"              # ONNX 模型 config 中定义的 input name
        value: "PREPROCESSED_IMAGE"       # 承接上一步的输出
      }
      output_map {
        key: "output0"             # ONNX 模型 config 中定义的 output name (即 [N, 300, 6])
        value: "output0"    # 自定义中间变量名，供下一步使用
      }
    },

    # =======================================================
    # 步骤 3: 后处理 (Postprocessing)
    # 功能：过滤低分框，坐标还原，格式化 JSON
    # =======================================================
    {
      model_name: "yolov26_postprocess"
      model_version: -1
      input_map [
        {
          key: "DETECTION_OUTPUT"    # postprocess 模型 config 中定义的 input 1
          value: "output0"  # 承接步骤 2 的输出
        },
        {
          key: "IMAGE_SHAPE"       # postprocess 模型 config 中定义的 input 2
          value: "IMAGE_SHAPE"      # 承接步骤 1 透传过来的原图尺寸
        },
        {
          key: "SCALE"    # postprocess 压缩与原图的短边比率 [RATE]
          value: "SCALE"        # 承接步骤 1 透传过来的压缩比例，用于还原图片计算目标框
        },
        {
          key: "PADDING"    # postprocess letterbox的填充 [pw, ph]
          value: "PADDING"        # 承接步骤 1 透传过来的填充边距，用于还原图片计算目标框
        }
      ]
      output_map {
        key: "OUTPUT_RESULTS"       # postprocess 模型 config 中定义的 output
        value: "OUTPUT_RESULTS"     # 映射到 ensemble 的全局 output
      }
    }
  ]
}

模型部署

powershell 复制代码

docker run -d \
  --gpus 1 \
  --name tritonserver \
  -p 127.0.0.1:8000:8000 \
  -v /path/to/yolov26/models:/models \
  nvcr.io/nvidia/tritonserver:23.01-py3-v0.0.1 \
  CUDA_VISIBLE_DEVICES=1 tritonserver --model-repository=models --strict-model-config=false --log-verbose=1

triton ensemble部署tensorrt模型

onnx转tensort

powershell 复制代码

docker run --gpus 1 -v $(pwd):/workspace -it nvcr.io/nvidia/tensorrt:23.01-py3 \
    bash -c \
    "cd /workspace && \
trtexec \
--onnx=yolov26s.onnx \
--minShapes=images:1x3x640x640 \
--optShapes=images:64x3x640x640 \
--maxShapes=images:128x3x640x640 \
--workspace=8192 \
--saveEngine=yolov26s_fp16.plan \
--explicitBatch \
--fp16"

文件夹目录结构

1、onnx替换成tensorrt模型

把文件路径：models/yolov26_infernce/1/中的onnx模型替换成tensorrt模型

2、修改config.pbtxt

powershell 复制代码

name: "yolov26_inference"
backend: "tensorrt"  # 把onnxruntime改成tensorrt
default_model_filename: "coco_yolo26s_fp16.plan" # 替换tensorrt模型名称
max_batch_size: 64  # 支持动态批处理（推荐），或设置为固定值如 1
dynamic_batching {
  max_queue_delay_microseconds: 100000
  preferred_batch_size: [ 16, 32, 64, 128 ]
}

input [
  {
    name: "images"
    data_type: TYPE_FP32
    dims: [3, 640, 640]  # 固定输入尺寸，NCHW 格式
  }
]

output [
  {
    name: "output0"
    data_type: TYPE_FP32
    dims: [300, 6]  # 输出维度，[batch, 300, 6]
  }
]

instance_group [
  {
    count: 1
    kind: KIND_GPU
    gpus: [0]  # 使用第 0 号 GPU（A10）
  }
]