本博客主要分享使用triton服务基于ensemble后端部署yolov26模型,部署在支持cuda的nvidia的显卡上,使用python后端实现模型的预处理和后处理,和模型推理集成在同一个triton服务里面,完整项目代码库可参考: triton_ensemble_model_zoo
yolov26 triton esnsemble
- 摘要
- [triton ensemble 部署onnx 模型](#triton ensemble 部署onnx 模型)
-
- 模型准备
- 文件夹目录准备
-
- 1、yolov26_preprocess (预处理)
- [2、 yolov26_inference (推理)](#2、 yolov26_inference (推理))
- 3、yolov26_postprocess (后处理)
- [4、 yolov26_ensemble (集成/编排)](#4、 yolov26_ensemble (集成/编排))
- 模型部署
- [triton ensemble部署tensorrt模型](#triton ensemble部署tensorrt模型)
摘要
常见的使用triton部署yolov26模型的推理服务,都是只部署模型推理服务,而yolov26预处理和后处理在本地进行,如:
客户端预处理(读取图片,resize, padding) -> triton服务(模型推理)-> 客户端后处理(过滤目标框、复原图片目标框)
但是,在本地预处理后,得到图片的向量1, 3, 640, 640的float32数据,该数据大约占4M,请求服务需要大带宽,如果同时请求多批量的图片数据,带宽就需要更大。因此,业界常用的方法是使用两个服务:fastapi+triton:
客户端请求(encoding image)-> fastapi服务预处理(decoding image, reading image, resize, padding) -> tirton(模型推理) -> fastapi服务后处理(过滤目标框,还原目标框) -> 客户端接受结果
以上fastapi+tirton是当前业界的做法,但是该方法数据需要在两个服务中频繁传输,延时比较长,还有一种方法,就是把预处理和后处理部署在triton服务中,就是基于ensemble为backend:
客户端请求(encoding image) -> triton ensemble(预处理、模型推理、后处理) -> 客户端(接受结果)
triton ensemble 部署onnx 模型
模型准备
准备好onnx模型,yolov26的pt转onnx教程可以参考:yolov26推理
文件夹目录准备
triton服务的文件夹目录要严格的要求,其格式需要按照以下路径存放:
bash
models/
├── yolov26_ensemble/
│ ├── 1/
│ └── config.pbtxt
├── yolov26_inference/
│ ├── 1/
│ │ └── coco_yolo26s_fp16.onnx
│ └── config.pbtxt
├── yolov26_postprocess/
│ ├── 1/
│ │ └── postprocess.py
│ └── config.pbtxt
└── yolov26_preprocess/
├── 1/
│ └── preprocess.py
└── config.pbtxt
其中:
- yolov26_preprocess (预处理):
负责在图像进入模型前进行处理(例如:调整大小、归一化、转换颜色空间等)。 - yolov26_inference (推理):
这是执行 YOLOv26 模型计算的部分,使用了 FP16(半精度)格式以加速推理。 - yolov26_postprocess (后处理):
负责处理模型的原始输出(例如:非极大值抑制 NMS、解码边界框坐标、过滤低置信度结果),将其转化为人类可读的检测结果。 - yolov26_ensemble (集成/编排):
这是一个特殊的调度模型。它不直接运行代码,而是通过 config.pbtxt 定义上述三个步骤(预处理 -> 推理 -> 后处理)的执行顺序和数据流向,将它们串联成一个完整的 Pipeline。
1、yolov26_preprocess (预处理)
config.pbtxt:
powershell
name: "yolov26_preprocess"
backend: "python"
max_batch_size: 64
default_model_filename: "preprocess.py"
input [
{
name: "RAW_IMAGE"
data_type: TYPE_STRING
dims: [ 1 ]
}
]
output [
{
name: "PREPROCESSED_IMAGE"
data_type: TYPE_FP32
dims: [ 3, 640, 640 ]
},
{
name: "IMAGE_SHAPE"
data_type: TYPE_INT32
dims: [ 2 ]
},
{
name: "SCALE"
data_type: TYPE_FP32
dims: [ 1 ]
},
{
name: "PADDING"
data_type: TYPE_FP32
dims: [ 2 ]
}
]
instance_group [
{
count: 32
kind: KIND_CPU
}
]
preprocess.py
python
import triton_python_backend_utils as pb_utils
import numpy as np
import cv2
import base64
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class TritonPythonModel:
def initialize(self, args):
self.img_size = 640
self.mean = np.array([0.0, 0.0, 0.0], dtype=np.float32)
self.std = np.array([1.0, 1.0, 1.0], dtype=np.float32) / 255.0
def letterbox(self, img, new_shape=(640, 640), color=(114, 114, 114)):
shape = img.shape[:2] # current shape [height, width]
if isinstance(new_shape, int):
new_shape = (new_shape, new_shape)
scale = min(new_shape[0] / shape[0], new_shape[1] / shape[1])
new_unpad = int(shape[1] * scale), int(shape[0] * scale)
dw = new_shape[1] - new_unpad[0]
dh = new_shape[0] - new_unpad[1]
dw /= 2
dh /= 2
if shape[::-1] != new_unpad:
img = cv2.resize(img, new_unpad, interpolation=cv2.INTER_LINEAR)
top, bottom = int(round(dh - 0.1)), int(round(dh + 0.1))
left, right = int(round(dw - 0.1)), int(round(dw + 0.1))
img = cv2.copyMakeBorder(img, top, bottom, left, right, cv2.BORDER_CONSTANT, value=color)
return img, scale, (dw, dh)
def resize_image(self, img):
img_resized, scale, padding = self.letterbox(img, new_shape=(self.img_size, self.img_size)) # Letterbox 处理
img_resized = cv2.cvtColor(img_resized, cv2.COLOR_BGR2RGB).astype(np.float32) # BGR to RGB
img_resized = (img_resized - self.mean) * self.std # Normalize
img_resized = np.transpose(img_resized, (2, 0, 1)) # HWC to CHW
return img_resized, scale, padding
def base64_to_image(self, base64_str):
try:
raw_bytes = base64.b64decode(base64_str)
nparr = np.frombuffer(raw_bytes, np.uint8)
img = cv2.imdecode(nparr, cv2.IMREAD_COLOR)
return img
except Exception as error:
logger.error(f"Error: {error}")
logger.error(f"error line: {error.__traceback__.tb_lineno}")
def execute(self, requests):
responses = []
for request in requests:
# 1. 获取输入张量,该张量包含一个批次
in_tensor = pb_utils.get_input_tensor_by_name(request, "RAW_IMAGE")
if in_tensor is None:
logger.error("Input tensor 'RAW_IMAGE' not found!")
continue
# raw_batch = pb_utils.deserialize_bytes_tensor(in_tensor.as_numpy())
raw_batch = in_tensor.as_numpy()
batch_size = raw_batch.shape[0]
imgs_resized = []
origins_shape = []
scales = []
paddings = []
# 2. 处理批次中的每个图像
for i in range(batch_size):
base64_str = raw_batch[i][0] # 获取第i个图像的字节
# 3. 把 base64 转成 image
img = self.base64_to_image(base64_str)
if img is None:
# 如果解码失败,可以插入一个黑色图像?或者报错?这里我们插入一个黑色图像
logger.info("Base64 converted to image failed!")
img = np.zeros((self.img_size, self.img_size, 3), dtype=np.uint8)
origins_shape.append([self.img_size, self.img_size])
scales.append(1)
paddings.append((0,0))
else:
logger.info("Base64 converted to image Successfully!")
orig_h, orig_w = img.shape[:2]
origins_shape.append([orig_h, orig_w])
# 4. 预处理
img_resized, scale, padding= self.resize_image(img)
imgs_resized.append(img_resized)
scales.append(scale)
paddings.append(padding)
# 将列表转换为一个批次
if len(imgs_resized) == 0:
# 没有有效的图像,创建一个全零批次
batch_imgs = np.zeros((batch_size, 3, self.img_size, self.img_size), dtype=np.float32)
batch_shapes = np.zeros((batch_size, 2), dtype=np.int32)
batch_scales = np.zeros((batch_size,1), dtype=np.float32)
batch_paddings = np.zeros((batch_size, 2), dtype=np.float32)
else:
batch_imgs = np.stack(imgs_resized, axis=0) # (batch_size, 3, 640, 640)
batch_shapes = np.array(origins_shape, dtype=np.int32) # (batch_size, 2)
batch_scales = np.array(scales, dtype=np.float32).reshape(-1, 1)
batch_paddings = np.array(paddings, dtype=np.float32)
# 构造输出张量
out_img = pb_utils.Tensor("PREPROCESSED_IMAGE", batch_imgs)
out_shape = pb_utils.Tensor("IMAGE_SHAPE", batch_shapes)
out_scale = pb_utils.Tensor("SCALE", batch_scales)
out_padding = pb_utils.Tensor("PADDING", batch_paddings)
response = pb_utils.InferenceResponse(output_tensors=[out_img, out_shape, out_scale, out_padding])
responses.append(response)
return responses
2、 yolov26_inference (推理)
config.pbtxt
powershell
name: "yolov26_inference"
backend: "onnxruntime"
default_model_filename: "coco_yolo26s_fp16.onnx"
max_batch_size: 64 # 支持动态批处理(推荐),或设置为固定值如 1
dynamic_batching {
max_queue_delay_microseconds: 100000
preferred_batch_size: [ 16, 32, 64, 128 ]
}
input [
{
name: "images"
data_type: TYPE_FP32
dims: [3, 640, 640] # 固定输入尺寸,NCHW 格式
}
]
output [
{
name: "output0"
data_type: TYPE_FP32
dims: [300, 6] # 输出维度,[batch, 300, 6]
}
]
instance_group [
{
count: 1
kind: KIND_GPU
gpus: [0] # 使用第 0 号 GPU(A10)
}
]
3、yolov26_postprocess (后处理)
config.pbtxt
powershell
name: "yolov26_postprocess"
backend: "python"
max_batch_size: 64
default_model_filename: "postprocess.py"
input [
{
name: "DETECTION_OUTPUT"
data_type: TYPE_FP32
dims: [ 300, 6 ] # onnx输出维度
},
{
name: "IMAGE_SHAPE"
data_type: TYPE_INT32
dims: [ 2 ] # heigh 和 width
},
{
name: "SCALE"
data_type: TYPE_FP32
dims: [ 1 ]
},
{
name: "PADDING"
data_type: TYPE_FP32
dims: [ 2 ]
}
]
output [
{
name: "OUTPUT_RESULTS"
data_type: TYPE_STRING
dims: [ 1 ]
}
]
# 参数配置:置信度阈值
parameters: {
key: "CONF_THRESHOLD"
value: { string_value: "0.25" }
}
instance_group [
{
count: 32
kind: KIND_CPU
}
]
postprocess.py
python
import triton_python_backend_utils as pb_utils
import numpy as np
import cv2
import base64
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class TritonPythonModel:
def initialize(self, args):
self.img_size = 640
self.mean = np.array([0.0, 0.0, 0.0], dtype=np.float32)
self.std = np.array([1.0, 1.0, 1.0], dtype=np.float32) / 255.0
def letterbox(self, img, new_shape=(640, 640), color=(114, 114, 114)):
shape = img.shape[:2] # current shape [height, width]
if isinstance(new_shape, int):
new_shape = (new_shape, new_shape)
scale = min(new_shape[0] / shape[0], new_shape[1] / shape[1])
new_unpad = int(shape[1] * scale), int(shape[0] * scale)
dw = new_shape[1] - new_unpad[0]
dh = new_shape[0] - new_unpad[1]
dw /= 2
dh /= 2
if shape[::-1] != new_unpad:
img = cv2.resize(img, new_unpad, interpolation=cv2.INTER_LINEAR)
top, bottom = int(round(dh - 0.1)), int(round(dh + 0.1))
left, right = int(round(dw - 0.1)), int(round(dw + 0.1))
img = cv2.copyMakeBorder(img, top, bottom, left, right, cv2.BORDER_CONSTANT, value=color)
return img, scale, (dw, dh)
def resize_image(self, img):
img_resized, scale, padding = self.letterbox(img, new_shape=(self.img_size, self.img_size)) # Letterbox 处理
img_resized = cv2.cvtColor(img_resized, cv2.COLOR_BGR2RGB).astype(np.float32) # BGR to RGB
img_resized = (img_resized - self.mean) * self.std # Normalize
img_resized = np.transpose(img_resized, (2, 0, 1)) # HWC to CHW
return img_resized, scale, padding
def base64_to_image(self, base64_str):
try:
raw_bytes = base64.b64decode(base64_str)
nparr = np.frombuffer(raw_bytes, np.uint8)
img = cv2.imdecode(nparr, cv2.IMREAD_COLOR)
return img
except Exception as error:
logger.error(f"Error: {error}")
logger.error(f"error line: {error.__traceback__.tb_lineno}")
def execute(self, requests):
responses = []
for request in requests:
# 1. 获取输入张量,该张量包含一个批次
in_tensor = pb_utils.get_input_tensor_by_name(request, "RAW_IMAGE")
if in_tensor is None:
logger.error("Input tensor 'RAW_IMAGE' not found!")
continue
# raw_batch = pb_utils.deserialize_bytes_tensor(in_tensor.as_numpy())
raw_batch = in_tensor.as_numpy()
batch_size = raw_batch.shape[0]
imgs_resized = []
origins_shape = []
scales = []
paddings = []
# 2. 处理批次中的每个图像
for i in range(batch_size):
base64_str = raw_batch[i][0] # 获取第i个图像的字节
# 3. 把 base64 转成 image
img = self.base64_to_image(base64_str)
if img is None:
# 如果解码失败,可以插入一个黑色图像?或者报错?这里我们插入一个黑色图像
logger.info("Base64 converted to image failed!")
img = np.zeros((self.img_size, self.img_size, 3), dtype=np.uint8)
origins_shape.append([self.img_size, self.img_size])
scales.append(1)
paddings.append((0,0))
else:
logger.info("Base64 converted to image Successfully!")
orig_h, orig_w = img.shape[:2]
origins_shape.append([orig_h, orig_w])
# 4. 预处理
img_resized, scale, padding= self.resize_image(img)
imgs_resized.append(img_resized)
scales.append(scale)
paddings.append(padding)
# 将列表转换为一个批次
if len(imgs_resized) == 0:
# 没有有效的图像,创建一个全零批次
batch_imgs = np.zeros((batch_size, 3, self.img_size, self.img_size), dtype=np.float32)
batch_shapes = np.zeros((batch_size, 2), dtype=np.int32)
batch_scales = np.zeros((batch_size,1), dtype=np.float32)
batch_paddings = np.zeros((batch_size, 2), dtype=np.float32)
else:
batch_imgs = np.stack(imgs_resized, axis=0) # (batch_size, 3, 640, 640)
batch_shapes = np.array(origins_shape, dtype=np.int32) # (batch_size, 2)
batch_scales = np.array(scales, dtype=np.float32).reshape(-1, 1)
batch_paddings = np.array(paddings, dtype=np.float32)
# 构造输出张量
out_img = pb_utils.Tensor("PREPROCESSED_IMAGE", batch_imgs)
out_shape = pb_utils.Tensor("IMAGE_SHAPE", batch_shapes)
out_scale = pb_utils.Tensor("SCALE", batch_scales)
out_padding = pb_utils.Tensor("PADDING", batch_paddings)
response = pb_utils.InferenceResponse(output_tensors=[out_img, out_shape, out_scale, out_padding])
responses.append(response)
return responses
4、 yolov26_ensemble (集成/编排)
config.pbtxt
powershell
name: "yolov26_ensemble"
platform: "ensemble"
max_batch_size: 64
# 定义整个流水线的最终输入(客户端发送进来的)
# 对应 preprocess 模型的输入
input [
{
name: "RAW_IMAGE"
data_type: TYPE_STRING
dims: [ 1 ]
}
]
# 定义整个流水线的最终输出(客户端接收到的)
# 对应 postprocess 模型的输出
output [
{
name: "OUTPUT_RESULTS"
data_type: TYPE_STRING
dims: [ 1 ]
}
]
# 流水线步骤定义
ensemble_scheduling {
step [
# =======================================================
# 步骤 1: 预处理 (Preprocessing)
# 功能:Resize, Normalize, HWC->CHW, 添加 Batch 维
# =======================================================
{
model_name: "yolov26_preprocess"
model_version: -1 # -1 表示使用最新可用版本
input_map {
key: "RAW_IMAGE" # preprocess 模型 config 中定义的 input name
value: "RAW_IMAGE" # 映射到 ensemble 的全局 input
}
output_map [
{
key: "PREPROCESSED_IMAGE" # preprocess 的输出 name (例如:images)
value: "PREPROCESSED_IMAGE" # 自定义一个中间变量名,供下一步使用
},
{
key: "IMAGE_SHAPE" # preprocess 输出的原图尺寸 [H, W]
value: "IMAGE_SHAPE" # 自定义中间变量名,直接透传给后处理
},
{
key: "SCALE" # preprocess 压缩与原图的短边比率 [RATE]
value: "SCALE" # 自定义中间变量名,直接透传给后处理
},
{
key: "PADDING" # preprocess letterbox的填充 [pw, ph]
value: "PADDING" # 自定义中间变量名,直接透传给后处理
}
]
},
# =======================================================
# 步骤 2: 模型推理 (Inference - ONNX)
# 功能:YOLO26 前向传播
# =======================================================
{
model_name: "yolov26_inference"
model_version: -1
input_map {
key: "images" # ONNX 模型 config 中定义的 input name
value: "PREPROCESSED_IMAGE" # 承接上一步的输出
}
output_map {
key: "output0" # ONNX 模型 config 中定义的 output name (即 [N, 300, 6])
value: "output0" # 自定义中间变量名,供下一步使用
}
},
# =======================================================
# 步骤 3: 后处理 (Postprocessing)
# 功能:过滤低分框,坐标还原,格式化 JSON
# =======================================================
{
model_name: "yolov26_postprocess"
model_version: -1
input_map [
{
key: "DETECTION_OUTPUT" # postprocess 模型 config 中定义的 input 1
value: "output0" # 承接步骤 2 的输出
},
{
key: "IMAGE_SHAPE" # postprocess 模型 config 中定义的 input 2
value: "IMAGE_SHAPE" # 承接步骤 1 透传过来的原图尺寸
},
{
key: "SCALE" # postprocess 压缩与原图的短边比率 [RATE]
value: "SCALE" # 承接步骤 1 透传过来的压缩比例,用于还原图片计算目标框
},
{
key: "PADDING" # postprocess letterbox的填充 [pw, ph]
value: "PADDING" # 承接步骤 1 透传过来的填充边距,用于还原图片计算目标框
}
]
output_map {
key: "OUTPUT_RESULTS" # postprocess 模型 config 中定义的 output
value: "OUTPUT_RESULTS" # 映射到 ensemble 的全局 output
}
}
]
}
模型部署
powershell
docker run -d \
--gpus 1 \
--name tritonserver \
-p 127.0.0.1:8000:8000 \
-v /path/to/yolov26/models:/models \
nvcr.io/nvidia/tritonserver:23.01-py3-v0.0.1 \
CUDA_VISIBLE_DEVICES=1 tritonserver --model-repository=models --strict-model-config=false --log-verbose=1
triton ensemble部署tensorrt模型
onnx转tensort
powershell
docker run --gpus 1 -v $(pwd):/workspace -it nvcr.io/nvidia/tensorrt:23.01-py3 \
bash -c \
"cd /workspace && \
trtexec \
--onnx=yolov26s.onnx \
--minShapes=images:1x3x640x640 \
--optShapes=images:64x3x640x640 \
--maxShapes=images:128x3x640x640 \
--workspace=8192 \
--saveEngine=yolov26s_fp16.plan \
--explicitBatch \
--fp16"
文件夹目录结构
1、onnx替换成tensorrt模型
把文件路径:models/yolov26_infernce/1/中的onnx模型替换成tensorrt模型
2、修改config.pbtxt
powershell
name: "yolov26_inference"
backend: "tensorrt" # 把onnxruntime改成tensorrt
default_model_filename: "coco_yolo26s_fp16.plan" # 替换tensorrt模型名称
max_batch_size: 64 # 支持动态批处理(推荐),或设置为固定值如 1
dynamic_batching {
max_queue_delay_microseconds: 100000
preferred_batch_size: [ 16, 32, 64, 128 ]
}
input [
{
name: "images"
data_type: TYPE_FP32
dims: [3, 640, 640] # 固定输入尺寸,NCHW 格式
}
]
output [
{
name: "output0"
data_type: TYPE_FP32
dims: [300, 6] # 输出维度,[batch, 300, 6]
}
]
instance_group [
{
count: 1
kind: KIND_GPU
gpus: [0] # 使用第 0 号 GPU(A10)
}
]