文章目录
一、环境准备
1)模型下载
模型放在/root/autodl-tmp,I/O读取会快,项目放在根目录,需要创建软连接,还要创建onnx导出目录
shell
cd /root/3rd/GPT-SoVITS_minimal_inference
mkdir -p /root/autodl-tmp/GPT-SoVITS_minimal_inference/pretrained_models
mkdir -p /root/autodl-tmp/GPT-SoVITS_minimal_inference/onnx_export
rm -rf pretrained_models onnx_export
ln -s /root/autodl-tmp/GPT-SoVITS_minimal_inference/pretrained_models pretrained_models
ln -s /root/autodl-tmp/GPT-SoVITS_minimal_inference/onnx_export onnx_export
然后模型都下载/放到这里:
/root/autodl-tmp/GPT-SoVITS_minimal_inference/pretrained_models
shell
##开始下载模型
cd /root/3rd/GPT-SoVITS_minimal_inference
HF_ENDPOINT=https://hf-mirror.com hf download lj1995/GPT-SoVITS \
--include "chinese-hubert-base/*" \
--include "chinese-roberta-wwm-ext-large/*" \
--include "s1v3.ckpt" \
--include "v2Pro/s2Gv2ProPlus.pth" \
--include "sv/pretrained_eres2netv2w24s4ep4.ckpt" \
--local-dir pretrained_models
#下载完可以检查:
ls -lh pretrained_models
ls -lh pretrained_models/v2Pro
ls -lh pretrained_models/sv
下载速度慢,换成modelscope国内源
shell
cd /root/3rd/GPT-SoVITS_minimal_inference
python - <<'PY'
from modelscope import snapshot_download
snapshot_download(
'dienstag/chinese-roberta-wwm-ext-large',
local_dir='pretrained_models/chinese-roberta-wwm-ext-large'
)
snapshot_download(
'innnky/chinese-hubert-base-tencent',
local_dir='pretrained_models/chinese-hubert-base'
)
PY
2)pth模型导出为onnx模型
由于基模不能导出为onnx模型,补充下载模型文件
shell
cd /root/3rd/GPT-SoVITS_minimal_inference
#pretrained_models/GPT_weights_v2ProPlus/*.ckpt
#pretrained_models/SoVITS_weights_v2ProPlus/*.pth
HF_HUB_DISABLE_XET=1 python - <<'PY'
from huggingface_hub import hf_hub_download
import os
import shutil
repo_id = "lj1995/GPT-SoVITS"
downloads = [
("s1v3.ckpt", "pretrained_models/GPT_weights_v2ProPlus/s1v3.ckpt"),
("v2Pro/s2Gv2ProPlus.pth", "pretrained_models/SoVITS_weights_v2ProPlus/s2Gv2ProPlus.pth"),
("sv/pretrained_eres2netv2w24s4ep4.ckpt", "pretrained_models/sv/pretrained_eres2netv2w24s4ep4.ckpt"),
]
for filename, target in downloads:
os.makedirs(os.path.dirname(target), exist_ok=True)
src = hf_hub_download(repo_id=repo_id, filename=filename)
shutil.copy2(src, target)
print(f"saved: {target}")
PY
导出onnx模型
shell
cd /root/3rd/GPT-SoVITS_minimal_inference
pip install -r requirements.txt
python export_onnx.py \
--gpt_path "pretrained_models/GPT_weights_v2ProPlus/s1v3.ckpt" \
--sovits_path "pretrained_models/SoVITS_weights_v2ProPlus/s2Gv2ProPlus.pth" \
--cnhubert_base_path "pretrained_models/chinese-hubert-base" \
--bert_path "pretrained_models/chinese-roberta-wwm-ext-large" \
--sv_path "pretrained_models/sv/pretrained_eres2netv2w24s4ep4.ckpt" \
--output_dir "onnx_export/v2proplus_base" \
--max_len 1000
- 报错
shell
ImportError: libcudart.so.13
由于
onnxruntime-gpu 1.27.0
pip uninstall -y onnxruntime-gpu onnxruntime
pip install --no-cache-dir onnxruntime-gpu==1.22.0
- 开始导出onnx模型
shell
(base) root@autodl-container-943b48886a-a5c25e1d:~/3rd/GPT-SoVITS_minimal_inference# python export_onnx.py \
--gpt_path "pretrained_models/GPT_weights_v2ProPlus/s1v3.ckpt" \
--sovits_path "pretrained_models/SoVITS_weights_v2ProPlus/s2Gv2ProPlus.pth" \
--cnhubert_base_path "pretrained_models/chinese-hubert-base" \
--bert_path "pretrained_models/chinese-roberta-wwm-ext-large" \
--sv_path "pretrained_models/sv/pretrained_eres2netv2w24s4ep4.ckpt" \
--output_dir "onnx_export/v2proplus_base" \
--max_len 1000
Loading models...
Exporting to onnx_export/v2proplus_base...
Exporting SSL...
Exporting BERT...
Exporting VQEncoder...
Exporting GPT Encoder...
Exporting GPT Step...
Exporting SoVITS...
Exporting Spectrogram...
min value is tensor(-3.8879)
max value is tensor(4.3243)
Exporting SV Embedding...
Export complete! Config saved to onnx_export/v2proplus_base/config.json
转 FP16:
shell
python onnx_to_fp16.py \
--input_dir "onnx_export/v2proplus_base" \
--output_dir "onnx_export/v2proplus_base_fp16"
....................
Saved: onnx_export/v2proplus_base_fp16/gpt_encoder.onnx
Processing: gpt_step.onnx | Strategy: FP16 (Mixed) [FP16]
Converting to FP16...
[Attribute Fix] Fixed 49 attributes (Random/Cast mismatch).
Simplifying...
Saved: onnx_export/v2proplus_base_fp16/gpt_step.onnx
Processing: sovits.onnx | Strategy: FP16 (Mixed) [FP16]
Converting to FP16...
/root/miniconda3/lib/python3.12/site-packages/onnxconverter_common/float16.py:63: UserWarning: the float32 number -10000.0 will be truncated to -10000.0
warnings.warn(
[Attribute Fix] Fixed 45 attributes (Random/Cast mismatch).
Simplifying...
Saved: onnx_export/v2proplus_base_fp16/sovits.onnx
Processing: spectrogram.onnx | Strategy: FP32 (Keep) [FP16]
Skipping FP16 conversion (Sensitivity/Low-Cost).
Simplifying...
Saved: onnx_export/v2proplus_base_fp16/spectrogram.onnx
Processing: sv_embedding.onnx | Strategy: FP32 (Keep) [FP16]
Skipping FP16 conversion (Sensitivity/Low-Cost).
Simplifying...
Saved: onnx_export/v2proplus_base_fp16/sv_embedding.onnx
Optimization complete: onnx_export/v2proplus_base_fp16
3)安装trt把onnx文件转为trt
- 下载trt 10.16版本
shell
mkdir -p /root/3rd/trt
cd /root/3rd/trt
wget -O nv-tensorrt-local-repo-ubuntu2204-10.16.1-cuda-13.2_1.0-1_amd64.deb \
"https://developer.download.nvidia.com/compute/tensorrt/10.16.1/local_installers/nv-tensorrt-local-repo-ubuntu2204-10.16.1-cuda-13.2_1.0-1_amd64.deb"
下载完安装:
dpkg -i nv-tensorrt-local-repo-ubuntu2204-10.16.1-cuda-13.2_1.0-1_amd64.deb
cp /var/nv-tensorrt-local-repo-ubuntu2204-10.16.1-cuda-13.2/nv-tensorrt-local-*-keyring.gpg /usr/share/keyrings/
apt-get update
apt-get install -y tensorrt libnvinfer-dev libnvinfer-plugin-dev libnvonnxparsers-dev libnvinfer-bin
验证:
which trtexec
trtexec --version
- 下载trt后转成引擎文件
shell
再回到 ONNX 项目转 TRT:
cd /root/3rd/GPT-SoVITS_minimal_inference
python onnx2trt.py \
--input_dir "onnx_export/v2proplus_base_fp16" \
--output_dir "onnx_export/v2proplus_base_trt_fp16" \
--precision fp16 \
--shape_profile fitted
4)onnx转trt报错
(1)思路和排查点
-
报错信息
Finished parsing network model
说明没有 ONNX 解析失败onnx解析失败是这样
[TRT] ModelImporter.cpp:xxx: ERROR: builtin_op_importers.cpp:xxxx In function importXXX:
[TRT] No importer registered for op: SomeOp
[TRT] Failed to parse ONNX model
[E] Failed to parse onnx file
[E] Parsing model failed
或者:[E] Error[4]: [graphShapeAnalyzer.cpp::...]
[E] Network must have at least one output
[E] ModelImporter.cpp:... While parsing node number ...
也不是明显缺算子,日志通常会直接点名某个 op,例如:
No importer registered for op: GridSample
或者:
Unsupported ONNX data type: UINT64
Plugin not found, are the plugin name, version, and namespace correct?
-
分析问题
真正失败在 build 阶段:
MyelinCheckException
CHECK_EQ(dim_count(), stride_order().size()) failed
ForeignNode[/dec/Constant_9_output_0 + ONNXTRT_Broadcast_9642.../dec/Tanh]
这里的 ForeignNode[...] 说明 TensorRT 把一大段 ONNX 子图融合成了内部执行块。范围从 /dec/Constant_9_output_0 附近,一直到 /dec/Tanh。所以重点不是单个 Tanh,而是这个融合块。接着看 ONNX 节点,发现末尾结构是:
/dec/Add_10
↓
/dec/Div_9
↓
/dec/LeakyRelu_5
↓
/dec/conv_post/Conv
↓
/dec/Tanh
↓
audio -
开始逐步排查
然后逐步排查:
先试 Div -> Mul,想排除 scalar Div/broadcast 问题。结果还是失败,只是报错起点变了,说明不是 Div 单点问题。
再试静态 profile:把 pred_semantic/text_seq/refer_spec 的 min/opt/max 都固定。结果成功,说明算子本身能 TRT,问题和动态 shape 有关。
再单独测试哪个维度动态会失败:
pred_semantic 动态:失败
text_seq 动态:成功
refer_spec 动态:成功
这一步把问题缩小到:
pred_semantic 的 sem_len 动态 + decoder 融合
最后开始打断融合。方法是给中间 tensor 加额外 graph output,强制 TRT 在那里保留边界。我试了几个断点:
/dec/Add_10_output_0 失败
/dec/Div_9_output_0 失败
/dec/LeakyRelu_5_output_0 失败
/dec/conv_post/Conv_output_0 成功
为什么最后一个成功?因为前三个断点太靠前,TRT 后面仍然能把:
Div/LeakyRelu 后续 -> Conv -> Tanh
融合成有问题的块。
但 /dec/conv_post/Conv_output_0 正好在最终 Tanh 前面,把:
Conv -> Tanh
这条边切开了。TRT 不能再把 conv_post 的输出完全藏进 /dec/Tanh 的大融合里,于是绕过了 Myelin 的 shape/stride bug。
所以最终结论不是"Conv -> Tanh 算子有问题",而是:
动态 pred_semantic 导致 decoder 后半段被 TRT 融合成有问题的 ForeignNode。
把 /dec/conv_post/Conv_output_0 暴露成额外输出,可以阻止这段有问题的融合。
(2)辅助脚本
①先试 Div -> Mul,想排除 scalar Div/broadcast 问题。结果还是失败,只是报错起点变了,说明不是 Div 单点问题。
思路:
数学上等价:
x / 3.0
改成:
x * 0.33333334
脚本
python
from pathlib import Path
import onnx
from onnx import numpy_helper
import numpy as np
src = Path("onnx_export/v2proplus_base_opset18/sovits.onnx")
dst = Path("onnx_export/v2proplus_base_opset18/sovits_div9_to_mul.onnx")
model = onnx.load(src, load_external_data=False)
target = None
for node in model.graph.node:
if node.name == "/dec/Div_9":
target = node
break
old_const = target.input[1]
new_const = "/dec/Constant_9_recip_output_0"
recip = numpy_helper.from_array(
np.array([[[1.0 / 3.0]]], dtype=np.float32),
name=new_const,
)
model.graph.initializer.append(recip)
target.op_type = "Mul"
target.input[1] = new_const
del target.attribute[:]
onnx.checker.check_model(model)
onnx.save(model, dst)
然后用改出来的新 ONNX 跑 TRT:
shell
trtexec \
--onnx=onnx_export/v2proplus_base_opset18/sovits_div9_to_mul.onnx \
--saveEngine=onnx_export/v2proplus_base_trt_fp16/sovits_div9_to_mul.engine \
--minShapes=pred_semantic:1x1x1,text_seq:1x1,refer_spec:1x1025x1 \
--optShapes=pred_semantic:1x1x100,text_seq:1x64,refer_spec:1x1025x200 \
--maxShapes=pred_semantic:1x1x150,text_seq:1x128,refer_spec:1x1025x400 \
--memPoolSize=workspace:4096M \
--builderOptimizationLevel=1
结果还是失败,只是报错从:
ForeignNode[/dec/Constant_9_output_0 ... /dec/Tanh]
②再试静态 profile:把 pred_semantic/text_seq/refer_spec 的 min/opt/max 都固定。结果成功,说明算子本身能 TRT,问题和动态 shape 有关。
测试静态 profile 时,把三者都设成完全相同:
shell
trtexec \
--onnx=onnx_export/v2proplus_base_opset18/sovits.onnx \
--saveEngine=onnx_export/v2proplus_base_trt_fp16/sovits_static_test.engine \
--minShapes=pred_semantic:1x1x100,text_seq:1x64,refer_spec:1x1025x200 \
--optShapes=pred_semantic:1x1x100,text_seq:1x64,refer_spec:1x1025x200 \
--maxShapes=pred_semantic:1x1x100,text_seq:1x64,refer_spec:1x1025x200 \
--memPoolSize=workspace:4096M \
--builderOptimizationLevel=1
也就是固定输入形状:
pred_semantic = 1x1x100
text_seq = 1x64
refer_spec = 1x1025x200
结果成功:
Engine generation completed
Created engine with size: 498 MiB
&&&& PASSED TensorRT.trtexec
这说明:
- sovits.onnx 里的主要算子 TRT 是能编译的;
- 不是 ConvTranspose、Tanh、LeakyRelu 这类算子本身不支持;
- 问题出在某些输入维度动态变化时,TRT 的 shape/stride/fusion 推导失败。
所以这一步的作用是把问题从:
模型/算子不支持
缩小到:
动态 shape profile 导致的构建问题
③最后开始打断融合。方法是给中间 tensor 加额外 graph output,强制 TRT 在那里保留边界。我试了几个断点:
breaks = {
"break_add10": "/dec/Add_10_output_0",
"break_div9": "/dec/Div_9_output_0",
"break_lrelu5": "/dec/LeakyRelu_5_output_0",
"break_convpost": "/dec/conv_post/Conv_output_0",
}
然后对每个 tensor 生成一个新 ONNX:
python
from pathlib import Path
import copy
import onnx
from onnx import helper, TensorProto
src = Path("onnx_export/v2proplus_base_opset18/sovits.onnx")
base = Path("onnx_export/v2proplus_base_opset18")
breaks = {
"break_add10": "/dec/Add_10_output_0",
"break_div9": "/dec/Div_9_output_0",
"break_lrelu5": "/dec/LeakyRelu_5_output_0",
"break_convpost": "/dec/conv_post/Conv_output_0",
}
orig = onnx.load(src, load_external_data=False)
# 做 shape inference,拿中间 tensor 的 dtype/shape
inferred = onnx.shape_inference.infer_shapes(orig)
value_infos = {}
for vi in list(inferred.graph.value_info) + list(inferred.graph.output) + list(inferred.graph.input):
value_infos[vi.name] = vi
for suffix, tensor_name in breaks.items():
model = copy.deepcopy(orig)
if tensor_name in value_infos:
vi = copy.deepcopy(value_infos[tensor_name])
else:
vi = helper.make_tensor_value_info(
tensor_name,
TensorProto.FLOAT,
["unk0", "unk1", "unk2"],
)
if not any(o.name == tensor_name for o in model.graph.output):
model.graph.output.append(vi)
out = base / f"sovits_{suffix}.onnx"
onnx.checker.check_model(model)
onnx.save(model, out)
print("saved", out, "extra_output", tensor_name)
它会生成:
sovits_break_add10.onnx
sovits_break_div9.onnx
sovits_break_lrelu5.onnx
sovits_break_convpost.onnx
然后我分别用同一套动态 profile 去编译:
shell
trtexec \
--onnx=onnx_export/v2proplus_base_opset18/sovits_break_convpost.onnx \
--saveEngine=onnx_export/v2proplus_base_trt_fp16/sovits_break_convpost.engine \
--minShapes=pred_semantic:1x1x1,text_seq:1x1,refer_spec:1x1025x1 \
--optShapes=pred_semantic:1x1x100,text_seq:1x64,refer_spec:1x1025x200 \
--maxShapes=pred_semantic:1x1x150,text_seq:1x128,refer_spec:1x1025x400 \
--memPoolSize=workspace:4096M \
--builderOptimizationLevel=1
四个都这样跑,只换 --onnx 和 --saveEngine。
结果:
sovits_break_add10.onnx 失败
sovits_break_div9.onnx 失败
sovits_break_lrelu5.onnx 失败
sovits_break_convpost.onnx 成功
成功原因是 /dec/conv_post/Conv_output_0 作为额外输出后,TRT 不能再把最后:
Conv -> Tanh
连同前面的动态 shape 子图一起融合成有问题的 ForeignNode