PVN3D ORT CUDA Custom Ops 实现与联调记录

1. 任务目标

本次任务是在已经可用的 CPU C++ custom op 基础上，单独落一条 GPU 版本：

新建独立目录 deploy/ort_custom_ops_gpu/
把 PointNet2 的 6 个 custom op 升级为 CUDA custom kernel
通过 ONNX Runtime 官方 C++ custom op API 导出 .so
在 Python 侧通过 register_custom_ops_library(...) 加载这份 GPU .so
用 CUDAExecutionProvider 跑通完整 pvn3d_full.onnx
与 PyTorch 原生权重做输出和 pose 精度对比

这次工作不是替换原来的 CPU 版本，而是保持两套并行工程：

CPU 版：deploy/ort_custom_ops/
GPU 版：deploy/ort_custom_ops_gpu/

2. 最终产物

2.1 GPU custom op 工程

3. 环境事实

本次以 pvn3d-dev 容器为准。

关键环境：

Ubuntu 18.04.5
g++ 7.5.0
nvcc 11.3.109
GPU: NVIDIA GeForce RTX 4060 Laptop GPU
cuDNN 8.2
onnxruntime==1.16.3
onnxruntime-gpu==1.16.3
onnx==1.14.1
torch==1.10.0+cu113

本次 ORT 开发头文件仍然来自你已经解压好的官方包：

/workspace/tmp/onnxruntime-linux-x64-1.16.3/include/onnxruntime_c_api.h
/workspace/tmp/onnxruntime-linux-x64-1.16.3/include/onnxruntime_cxx_api.h

也就是说：

编译用头文件：onnxruntime-linux-x64-1.16.3
Python 运行时：onnxruntime-gpu==1.16.3

4. 为什么需要单独的 GPU 版本目录

原来的 CPU custom op 工程已经能跑完整图，但它的实现方式是：

custom op kernel 在 CPU 上执行
ORT provider 走 CPUExecutionProvider

这次要做的是：

custom op kernel 在 CUDA 上执行
ORT provider 显式走 CUDAExecutionProvider

因此必须单独拆目录，避免以下混淆：

CMake 构建语言不同
运行时 provider 绑定不同
.so 依赖不同
调试问题完全不同

5. GPU 版实现策略

5.1 沿用 ORT 官方 `CustomOpBase`

本次没有切换到另一套封装，而是继续沿用：

Ort::CustomOpBase<TOp, TKernel>

区别在于每个 op 都显式声明：

cpp 复制代码

const char* GetExecutionProviderType() const { return "CUDAExecutionProvider"; }

这样 ORT 才会把这些 custom nodes 分配到 CUDA EP。

5.2 从 ORT 取当前 CUDA stream

每个 kernel 运行时都通过：

cpp 复制代码

Ort::KernelContext ctx(context);
auto* stream = static_cast<cudaStream_t>(ctx.GetGPUComputeStream());

拿到 ORT 当前执行流。

这是这次 GPU custom op 能和 ORT CUDA EP 正常配合的关键点。

5.3 直接复用 PVN3D PointNet2 的 CUDA 语义

GPU kernel 不是重新设计的，而是直接对齐当前仓库已有 PointNet2 CUDA 实现。

对应来源主要是：

但本次做了一个重要适配：

原仓库很多 kernel 用的是 int
ONNX / ORT 这边图里索引统一是 int64

所以 GPU 版 .so 里把索引相关输入输出都统一改成了 int64_t。

6. 本次承接的 custom op

GPU 版一次性承接了完整图里全部 6 个 PointNet2 custom op：

PVN3D_FurthestPointSample
PVN3D_GatherPoints
PVN3D_BallQuery
PVN3D_GroupPoints
PVN3D_ThreeNN
PVN3D_ThreeInterpolate

所有 op 都注册在：

custom domain: ai.onnx.contrib

这和当前 full ONNX 导出结果保持一致。

7. 容器内操作过程

7.1 检查 CUDA 与 ORT 初始状态

最开始容器里虽然有 CUDA 工具链和 GPU，但 Python 环境里的 ORT 只有 CPU provider。

实际检查命令：

bash 复制代码

docker exec pvn3d-dev bash -lc '
which nvcc || true &&
nvcc --version || true &&
source /opt/conda/etc/profile.d/conda.sh &&
conda activate pvn3d &&
python - << "PY"
import onnxruntime as ort
print("ort", ort.__version__)
print("providers", ort.get_available_providers())
PY
'

当时结果是：

nvcc 11.3.109
ort 1.16.3
providers 只有 CPUExecutionProvider

7.2 安装 `onnxruntime-gpu`

为了让 Python 入口层能够真正使用 CUDA EP，实际在容器内安装了：

bash 复制代码

docker exec pvn3d-dev bash -lc '
source /opt/conda/etc/profile.d/conda.sh &&
conda activate pvn3d &&
python -m pip install --upgrade --force-reinstall onnxruntime-gpu==1.16.3
'

安装后再次检查：

bash 复制代码

docker exec pvn3d-dev bash -lc '
source /opt/conda/etc/profile.d/conda.sh &&
conda activate pvn3d &&
python - << "PY"
import onnxruntime as ort
print("ort", ort.__version__)
print("providers", ort.get_available_providers())
PY
'

结果变为：

TensorrtExecutionProvider
CUDAExecutionProvider
AzureExecutionProvider
CPUExecutionProvider

7.3 编译 GPU custom op `.so`

最终使用的构建命令：

bash 复制代码

docker exec pvn3d-dev bash -lc '
cd /workspace/workflow/self/PVN3D/deploy/ort_custom_ops_gpu &&
rm -rf build &&
mkdir build &&
cd build &&
cmake -DONNXRUNTIME_ROOT=/workspace/tmp/onnxruntime-linux-x64-1.16.3 .. &&
cmake --build . -- -j2
'

构建成功后得到：

libpvn3d_ort_custom_ops_gpu.so

7.4 检查导出符号和链接

检查命令：

bash 复制代码

docker exec pvn3d-dev bash -lc '
cd /workspace/workflow/self/PVN3D &&
nm -D deploy/ort_custom_ops_gpu/build/libpvn3d_ort_custom_ops_gpu.so | rg "RegisterCustomOps" -n -S &&
ldd deploy/ort_custom_ops_gpu/build/libpvn3d_ort_custom_ops_gpu.so
'

结果说明：

RegisterCustomOps 已正确导出
.so 已正确依赖 libcudart.so.11.0

7.5 先做 smoke test

先跳过 PyTorch 对比和 pose，只验证：

.so 能否被 ORT 加载
ORT 是否真的能用 CUDA EP 跑完整图

命令：

bash 复制代码

docker exec pvn3d-dev bash -lc '
cd /workspace/workflow/self/PVN3D &&
source /opt/conda/etc/profile.d/conda.sh &&
conda activate pvn3d &&
python deploy/scripts/run_full_onnx_ort_cpp_gpu.py \
  --checkpoint weights/ape_pvn3d_best.pth.tar \
  --onnx deploy/models/onnx_ape/pvn3d_full.onnx \
  --custom-ops-lib deploy/ort_custom_ops_gpu/build/libpvn3d_ort_custom_ops_gpu.so \
  --cls ape \
  --sample-index 0 \
  --num-points 4096 \
  --height 480 \
  --width 624 \
  --crop-left 8 \
  --skip-torch-compare \
  --skip-pose-eval
'

这一步成功，说明：

CUDAExecutionProvider 可用
.so 可被 ORT 正常加载
6 个 custom nodes 可被 CUDA EP 承接
完整 pvn3d_full.onnx 已可执行

7.6 再做完整验证

命令：

bash 复制代码

docker exec pvn3d-dev bash -lc '
cd /workspace/workflow/self/PVN3D &&
source /opt/conda/etc/profile.d/conda.sh &&
conda activate pvn3d &&
python deploy/scripts/run_full_onnx_ort_cpp_gpu.py \
  --checkpoint weights/ape_pvn3d_best.pth.tar \
  --onnx deploy/models/onnx_ape/pvn3d_full.onnx \
  --custom-ops-lib deploy/ort_custom_ops_gpu/build/libpvn3d_ort_custom_ops_gpu.so \
  --cls ape \
  --sample-index 0 \
  --num-points 4096 \
  --height 480 \
  --width 624 \
  --crop-left 8 \
  --output deploy/benchmarks/linemod_ape_full_onnx_ort_cpp_gpu.json
'

8. 实际结果

最终结果写入：

linemod_ape_full_onnx_ort_cpp_gpu.json

关键结果：

pred_kp_of
max_abs = 0.014214515686035156
mean_abs = 8.158481250575278e-06
pred_rgbd_seg
max_abs = 0.07232666015625
mean_abs = 0.0005051378393545747
pred_ctr_of
max_abs = 0.001059534028172493
mean_abs = 5.723987214878434e-06
ADD = 0.004475891590118408
ADD-S = 0.002825918374583125

当前这组结果说明：

GPU custom op 路线已可执行完整图
输出与 PyTorch 仍保持较小偏差
pose 层面的 ADD / ADD-S 结果也稳定

9. 中间遇到的问题与处理

9.1 问题一：容器里虽然有 CUDA，但 ORT 只有 CPU provider

现象：

ort.get_available_providers() 只有 CPUExecutionProvider

结论：

有 nvcc 和 GPU 不代表 Python 侧 ORT 就能跑 CUDA

处理：

安装 onnxruntime-gpu==1.16.3

9.2 问题二：GPU custom op 的 `.cc` 文件也需要 CUDA 头文件

现象：

第一次编译时，.cu 可以编，但 .cc 编译报错：

text 复制代码

fatal error: cuda_runtime.h: No such file or directory

原因：

pvn3d_pointnet2_ops_cuda.cc 里同样会拿 cudaStream_t
但 CMake 只让 CUDA 编译单元看到了 CUDA include path

处理：

在 target_include_directories(...) 中显式补：
/usr/local/cuda/include

9.3 问题三：CMake 3.10 对 CUDA arch 参数传递不干净

现象：

第一版虽然能编，但 device link 阶段出现：

text 复制代码

nvlink warning : SM Arch ('sm_52') not found ...

原因：

容器里是 cmake 3.10.2
仅靠 target 级别的 -gencode 参数，device link 阶段仍可能带上默认旧架构

处理：

把 -gencode 放到 CMAKE_CUDA_FLAGS
重新构建后告警消失

当前固定配置为：

sm_86
compute_86

这样在当前 nvcc 11.3 下对 RTX 4060 这类新卡更稳。

9.4 问题四：原始 PVN3D CUDA kernel 使用 `int`，但 ONNX 图里是 `int64`

现象：

原 PointNet2 CUDA 扩展里的索引多为 int
但当前 ONNX symbolic 导出的索引节点类型是 int64

如果直接照搬原 kernel，会在 ORT custom op 输入输出类型上不一致。

处理：

GPU custom op 版本统一把索引相关 tensor 改成 int64_t

涉及：

FurthestPointSample 输出
BallQuery 输出
ThreeNN 的 idx 输出
GatherPoints / GroupPoints / ThreeInterpolate 的 idx 输入

9.5 问题五：GPU custom op 不能继续沿用 CPU 版运行脚本

现象：

CPU 版脚本默认 provider 是 CPUExecutionProvider
也不会检查 CUDAExecutionProvider 是否真的可用

处理：

单独新建 run_full_onnx_ort_cpp_gpu.py
默认 provider 改为：
["CUDAExecutionProvider", "CPUExecutionProvider"]
并且显式校验：
CUDAExecutionProvider in ort.get_available_providers()

10. 当前边界

这次已经完成的是：

独立 GPU custom op 工程
6 个 PointNet2 custom op 的 CUDA kernel 版本
原生 .so 构建
Python 侧加载 GPU .so
CUDAExecutionProvider 下完整图执行
与 PyTorch 的输出与 pose 对比

这次还没有做的是：

CUDA kernel 的性能优化
多 batch / dynamic shape 调优
ORT 自定义 allocator / workspace 池化
Nsight 级别 profiling

当前实现定位是：

正确性优先的 GPU custom op 基线
为后续性能优化和 TensorRT plugin 对齐提供参考

11. 结论

这次工作已经把 ORT custom op 这条链从：

CPU C++ custom op 可运行

推进到了：

CUDA C++ custom op 可运行
能在 ORT CUDAExecutionProvider 下执行完整 pvn3d_full.onnx
能与 PyTorch 原生结果做稳定对比

后续如果继续推进，优先顺序建议是：

对 6 个 kernel 做 profiling
识别最耗时的 PointNet2 op
再决定是否继续优化 ORT GPU 版，或回到 TensorRT plugin 路线

PVN3D ORT CUDA Custom Ops 实现与联调记录

1. 任务目标

2. 最终产物

2.1 GPU custom op 工程

2.2 GPU 联调脚本

2.3 构建结果

2.4 运行结果

3. 环境事实

4. 为什么需要单独的 GPU 版本目录

5. GPU 版实现策略

5.1 沿用 ORT 官方 CustomOpBase

5.2 从 ORT 取当前 CUDA stream

5.3 直接复用 PVN3D PointNet2 的 CUDA 语义

6. 本次承接的 custom op

7. 容器内操作过程

7.1 检查 CUDA 与 ORT 初始状态

7.2 安装 onnxruntime-gpu

7.3 编译 GPU custom op .so

7.4 检查导出符号和链接

7.5 先做 smoke test

7.6 再做完整验证

8. 实际结果

9. 中间遇到的问题与处理

9.1 问题一：容器里虽然有 CUDA，但 ORT 只有 CPU provider

9.2 问题二：GPU custom op 的 .cc 文件也需要 CUDA 头文件

9.3 问题三：CMake 3.10 对 CUDA arch 参数传递不干净

9.4 问题四：原始 PVN3D CUDA kernel 使用 int，但 ONNX 图里是 int64

9.5 问题五：GPU custom op 不能继续沿用 CPU 版运行脚本

10. 当前边界

11. 结论

5.1 沿用 ORT 官方 `CustomOpBase`

7.2 安装 `onnxruntime-gpu`

7.3 编译 GPU custom op `.so`

9.2 问题二：GPU custom op 的 `.cc` 文件也需要 CUDA 头文件

9.4 问题四：原始 PVN3D CUDA kernel 使用 `int`，但 ONNX 图里是 `int64`