【论文阅读】使用神经形状先验的多视图三维重建和不确定性建模

Multi-view 3D Object Reconstruction and Uncertainty Modelling with Neural Shape Prior

  • Abstract
  • [1. INTRODUCTION](#1. INTRODUCTION)
  • [2. Related Work](#2. Related Work)
    • [2.1. 3D Object Representations and Reconstruction](#2.1. 3D Object Representations and Reconstruction)
    • [2.2. Neural Implicit Representation](#2.2. Neural Implicit Representation)
    • [2.3. Uncertainty Modeling in Deep Learning](#2.3. Uncertainty Modeling in Deep Learning)
  • [3. Methods](#3. Methods)
    • [3.1. Framework Overview](#3.1. Framework Overview)
    • [3.2. Uncertainty-aware Neural Object Model](#3.2. Uncertainty-aware Neural Object Model)
    • [3.3. Uncertainty-aware Image Encoder](#3.3. Uncertainty-aware Image Encoder)
    • [3.4. Multi-view Bayesian Fusion in Latent Space](#3.4. Multi-view Bayesian Fusion in Latent Space)
  • [4. Experiments](#4. Experiments)
    • [4.1. Implementation and Training Details](#4.1. Implementation and Training Details)
    • [4.2. Metrics and Baselines](#4.2. Metrics and Baselines)
    • [4.3. Single-view reconstruction](#4.3. Single-view reconstruction)
    • [4.4. Multi-view Reconstruction](#4.4. Multi-view Reconstruction)
    • [4.5. Ablation Study](#4.5. Ablation Study)
    • [4.6. Evaluation on ShapeNet](#4.6. Evaluation on ShapeNet)
    • [4.7. Uncertainty Analysis](#4.7. Uncertainty Analysis)
  • [5. CONCLUSIONS](#5. CONCLUSIONS)

Multi-view 3D Object Reconstruction and Uncertainty Modelling with Neural Shape Prior

在做神经隐式场的不确定性估计相关工作时看的论文,与Bayes' Rays类似

Abstract

3D object reconstruction is important for semantic scene understanding. It is challenging to reconstruct detailed 3D shapes from monocular images directly due to a lack of depth information, occlusion and noise. Most current methods generate deterministic object models without any awareness of the uncertainty of the reconstruction. We tackle this problem by leveraging a neural object representation which learns an object shape distribution from large dataset of 3d object models and maps it into a latent space. We propose a method to model uncertainty as part of the representation and define an uncertainty-aware encoder which generates latent codes with uncertainty directly from individual input images. Further, we propose a method to propagate the uncertainty in the latent code to SDF values and generate a 3d object mesh with local uncertainty for each mesh component. Finally, we propose an incremental fusion method under a Bayesian framework to fuse the latent codes from multi-view observations. We evaluate the system in both synthetic and real datasets to demonstrate the effectiveness of uncertainty-based fusion to improve 3D object reconstruction accuracy.

3D 对象重建对于语义场景理解非常重要。由于缺乏深度信息、遮挡和噪声,直接从单目图像重建详细的 3D 形状具有挑战性。当前大多数方法生成确定性对象模型,而没有意识到重建的不确定性。我们通过利用神经对象表示来解决这个问题,该表示从 3D 对象模型的大型数据集中学习对象形状分布并将其映射到潜在空间。我们提出了一种将不确定性建模为表示的一部分的方法,并定义了一个不确定性感知编码器,该编码器直接从各个输入图像生成具有不确定性的潜在代码。此外,我们提出了一种将潜在代码中的不确定性传播到 SDF 值的方法,并为每个网格组件生成具有局部不确定性的 3d 对象网格。最后,我们提出了一种贝叶斯框架下的增量融合方法来融合多视图观察的潜在代码。我们在合成和真实数据集中评估该系统,以证明基于不确定性的融合在提高 3D 对象重建精度方面的有效性。

1. INTRODUCTION

To couple uncertainty, we propose a framework that can propagate uncertainty from image space, to latent space, and finally to 3D object shape, as in Fig. 2. Specifically, we propose a way to teach the encoder to produce a code uncertainty that leads to the right model uncertainties from single images. Then, we propose a method to propagate the uncertainty through the decoder to the SDF and onto the mesh. We design a two-stage training strategy following the previous work [35]. First, we train the decoder to learn a latent space. Then, holding the decoder fixed, we force the encoder to output the correct code uncertainty. This strategy makes the encoder and decoder loosely coupled, and stores the uncertainty in the latent space, which can in theory generalize to different types of decoders. We summarize our contributions below:

为了耦合不确定性,我们提出了一个框架,可以将不确定性从图像空间传播到潜在空间,最后传播到 3D 对象形状,如图 2 所示。具体来说,我们提出了一种方法来教导编码器产生代码不确定性,从而导致从单个图像中得出正确的模型不确定性。然后,我们提出了一种通过解码器将不确定性传播到 SDF 并传播到网格上的方法。我们根据之前的工作[35]设计了一个两阶段训练策略。首先,我们训练解码器来学习潜在空间。然后,保持解码器固定,我们强制编码器输出正确的代码不确定性。这种策略使编码器和解码器松耦合,并将不确定性存储在潜在空间中,理论上可以推广到不同类型的解码器。我们总结了我们的贡献如下:

• We propose a 3d object modeling approach that relies on an implicit neural representation and provides both a 3D object reconstruction and an uncertainty measure for each object.
• We propose an image encoder with direct uncertainty modelling to estimate latent codes with uncertainty from a single image. • We propose an incremental fusion method that relies on Bayesian inference to fuse multi-view observations in the latent space to improve reconstruction accuracy and reduce spatial uncertainty.
• We evaluate the system in both synthetic and realistic datasets, demonstrating the benefit of fusing object models produced from different views through Bayesian inference on the encoded representation.

• 我们提出了一种3D 对象建模方法,该方法依赖于隐式神经表示,并为每个对象提供3D 对象重建和不确定性测量。

• 我们提出了一种具有直接不确定性建模的图像编码器,以从单个图像中估计具有不确定性的潜在代码。

• 我们提出了一种增量融合方法,依靠贝叶斯推理来融合潜在空间中的多视图观测,以提高重建精度并减少空间不确定性。

• 我们在合成和实际数据集中评估系统,展示通过对编码表示进行贝叶斯推理来融合从不同视图生成的对象模型的好处。

图 2. 提出的系统框架。它由不确定性感知图像编码器和预训练解码器组成。我们在贝叶斯框架下融合潜在空间中的多视图观察。解码器采用融合的潜在空间编码并生成 SDF 值和相关的不确定性。最后,使用 Marching Cubes 算法根据每个顶点具有不确定性的 SDF 值生成网格。在本文中,我们使用每个模型内的颜色条来可视化相对不确定性值。

2.1. 3D Object Representations and Reconstruction

2.2. Neural Implicit Representation

Implicit representation in 3D currently presents many open problems to address, such as effective neural architectures, multi-view fusion methods and uncertainty representations. As described above, this work focuses on identifying an effective fusion method and providing accurate uncertainty measures for downstream tasks.

3D 中的隐式表示目前提出了许多需要解决的开放性问题,例如有效的神经架构、多视图融合方法和不确定性表示。如上所述,这项工作的重点是确定有效的融合方法并为下游任务提供准确的不确定性度量。

2.3. Uncertainty Modeling in Deep Learning

Modelling uncertainty in deep learning inference has been well studied in the area of Bayesian Neural Networks [24]. Common uncertainty modeling techniques include sampling methods such as MC Dropout and Deep Ensembles, Error Propagation and Direct Modelling [8]. MC Dropout [9] and Deep Ensembles [19] need to run the network multiple times to produce samples from which to estimate uncertainty. Directly Modelling [16] can output uncertainty from a single forward pass and is much more efficient, so we use it to estimate the uncertainty in our work. Error Propagation [33] can also be run efficiently at inference time but requires complex modification of network layers which can affect network performance adversely, so we leave it as future work.

贝叶斯神经网络领域对深度学习推理中的不确定性建模进行了深入研究[24]。常见的不确定性建模技术包括 MC Dropout 和 Deep Ensembles、Error Propagation 和 Direct Modeling 等采样方法 [8]。 MC Dropout [9] 和 Deep Ensembles [19] 需要多次运行网络以生成样本来估计不确定性。直接建模[16]可以从单个前向传递中输出不确定性,并且效率更高,因此我们用它来估计工作中的不确定性。错误传播[33]也可以在推理时有效运行,但需要对网络层进行复杂的修改,这可能会对网络性能产生不利影响,因此我们将其留作未来的工作。

Direct modelling faces the problem of inaccurate and uncalibrated uncertainty in classification and regression [18]. Several methods are proposed to evaluate the output calibration, including calibration plot, and proper scoring rules [13] such as Energy Score and Negative Log Likelihood. A recalibration method [12] has been proposed to rectify the calibration via temperature scaling. We will give a detailed analysis with proper scoring rules, and a calibration plot for our uncertainty output.

直接建模面临分类和回归不准确和未校准不确定性的问题[18]。提出了几种方法来评估输出校准,包括校准图和适当的评分规则[13],例如能量评分和负对数似然。已经提出了一种重新校准方法[12]来通过温度缩放来纠正校准。我们将使用适当的评分规则进行详细的分析,并为我们的不确定性输出提供校准图。

Very limited work exists for considering uncertainty in neural implicit representation. Most related to ours is [37] which models uncertainty in the color and density output of a scene-level neural representation. However, we concentrate on the problem of 3D object reconstruction and multiview fusion. As far as we know, we are the first to estimate uncertainty for neural object representation from monocular images.

考虑神经隐式表示的不确定性的工作非常有限。与我们最相关的是[37],它模拟了场景级神经表示的颜色和密度输出的不确定性。然而,我们专注于 3D 对象重建和多视图融合的问题。据我们所知,我们是第一个根据单目图像估计神经对象表示的不确定性的人。

3. Methods

3.1. Framework Overview

The system framework is shown in Fig 2. The inputs are monocular image sequences of an object taken from different viewpoints. For each input image, the system outputs a reconstructed 3D object shape with uncertainty. The system can fuse multi-view observations in an uncertainty-aware way to incrementally update the shape.

系统框架如图2所示。输入是从不同视点拍摄的物体的单目图像序列。对于每个输入图像,系统都会输出具有不确定性的重建 3D 物体形状。该系统可以以不确定性感知的方式融合多视图观察,以增量更新形状。

The system consists of an uncertainty-aware neural object representation, and an uncertainty-aware Image Encoder. The neural object representation learns an object shape prior in a latent code space. It has a decoder to generate Signed Distance Function (SDF) values conditioned on each latent code. Then, the Marching Cubes algorithm [23] is used to generate a mesh from the SDF values, with uncertainty represented as an isotropic variance attached to its vertices.

该系统由不确定性感知神经对象表示和不确定性感知图像编码器组成。神经对象表示在潜在代码空间中先学习对象形状。它有一个解码器来生成以每个潜在代码为条件的符号距离函数(SDF)值。然后,使用 Marching Cubes 算法 [23] 从 SDF 值生成网格,不确定性表示为附加到其顶点的各向同性方差。

The uncertainty-aware Image Encoder takes in monocular images and outputs latent codes with uncertainty. In this work, we consider a diagonal covariance matrix for all the dimensions of the latent space. When there are multiple images, the multi-view fusion module fuses each output through a Bayesian update rule to estimate both the mean and covariance of the latent code. We now proceed with a more detailed formulation of our approach.

不确定性感知图像编码器接收单目图像并输出具有不确定性的潜在代码。在这项工作中,我们考虑潜在空间所有维度的对角协方差矩阵。当存在多个图像时,多视图融合模块通过贝叶斯更新规则融合每个输出,以估计潜在代码的均值和协方差。我们现在对我们的方法进行更详细的阐述。

3.2. Uncertainty-aware Neural Object Model

Building on DeepSDF [32], we propose to expand the current decoder-based neural object representation to model uncertainty. It is worth mentioning that the proposed uncertainty modelling and fusion method is generalizable to other similar neural representations with limited modification.

在 DeepSDF [32] 的基础上,我们建议扩展当前基于解码器的神经对象表示来建模不确定性。值得一提的是,所提出的不确定性建模和融合方法可以推广到其他类似的神经表示,只需进行有限的修改。

3D object shape modelling with a neural network. A neural network f θ f_θ fθ can be trained as a function to map any 3D coordinate, X = [ x , y , z ] ∈ R 3 X = [x, y, z] ∈ \mathbb{R}^3 X=[x,y,z]∈R3, to its SDF value of s ∈ R s ∈ \mathbb{R} s∈R:

使用神经网络进行 3D 物体形状建模。神经网络 f θ f_θ fθ 可以被训练为一个函数,将任何 3D 坐标 X = [ x , y , z ] ∈ R 3 X = [x, y, z] ∈ \mathbb{R}^3 X=[x,y,z]∈R3 映射到其 SDF 值 s ∈ R s ∈ \mathbb{R } s∈R:

where θ θ θ are the network parameters. Given a 3D grid of SDF values, the Marching Cubes algorithm can then generate a mesh. We can model a 3D shape with each parameter θ θ θ. To model a specific class of objects, e.g. chairs or tables we make the network conditional on a D-dimensional latent code, z ∈ R D z ∈ \mathbb{R }^D z∈RD:

其中 θ θ θ 是网络参数。给定 SDF 值的 3D 网格,Marching Cubes 算法就可以生成网格。我们可以使用每个参数 θ θ θ 来建模 3D 形状。对特定类别的对象进行建模,例如椅子或桌子,我们使网络以 D 维潜在代码为条件, z ∈ R D z ∈ \mathbb{R }^D z∈RD:

By varying z z z, the SDF function will also change, as well as the 3D reconstruction it produces. In this manner, a single decoder network can be trained to express the SDF representations of multiple semantically and geometrically similar objects, based on a latent code associated with each training object instance.

通过改变 z z z,SDF 函数及其生成的 3D 重建也会发生变化。以这种方式,可以基于与每个训练对象实例相关联的潜在代码来训练单个解码器网络来表达多个语义上和几何上相似的对象的SDF表示。

Modelling uncertainty into 3D object shape. In Eq. 2, the code z z z is deterministic. To model uncertainty, we model the D-dimensional latent code z as a probabilistic variable obeying a multivariate Gaussian distribution z ∼ N D ( μ , Σ ) z ∼ N_D(μ, Σ) z∼ND(μ,Σ). To simplify the problem, we assume each dimension of z is independent, which leads to a diagonal covariance matrix Σ Σ Σ. We will train a neural network to output the mean and variance for each dimension of z z z.

将不确定性建模为 3D 对象形状。在等式中。 2、代码 z z z是确定性的。为了建模不确定性,我们将 D 维潜在代码 z 建模为服从多元高斯分布 z ∼ N D ( μ , Σ ) z ∼ N_D(μ, Σ) z∼ND(μ,Σ) 的概率变量。为了简化问题,我们假设 z 的每个维度都是独立的,这导致对角协方差矩阵 Σ Σ Σ。我们将训练一个神经网络来输出 z z z 每个维度的均值和方差。

We also model the SDF value at X X X as a random variable, s ∼ N ( μ s , σ s 2 ) s ∼ N (μ_s, σ^2_s) s∼N(μs,σs2). According to Eq. 2, we can propagate the code uncertainty in z z z to the SDF value through the decoder network. Since the neural network f θ f_θ fθ is nonlinear, we can not directly solve for σ s 2 ) σ^2_s) σs2) , and must employ some form or approximation to propagate the uncertainty from code input to SDF output.

我们还将 X X X 的 SDF 值建模为随机变量 s ∼ N ( μ s , σ s 2 ) s ∼ N (μ_s, σ^2_s) s∼N(μs,σs2)。根据公式 2,我们可以通过解码器网络将 z z z 中的代码不确定性传播到 SDF 值中。由于神经网络 f θ f_θ fθ 是非线性的,我们无法直接求解 σ s 2 σ^2_s σs2,必须采用某种形式或近似方法将不确定性从代码输入传播到 SDF 输出。

Uncertainty propagation through neural network. We use Monto Carlo Sampling [14] to propagate the uncertainty through the nonlinear network. First, we sample M codes Z = { z m } m M = 1 Z = \{z_m\}^M_m{=1} Z={zm}mM=1 from the code distribution z ∼ N D ( μ , Σ ) z ∼ N_D(μ, Σ) z∼ND(μ,Σ). For a 3D point X X X, to get the variance σ s 2 σ^2_s σs2 of its SDF, we pass each code z m ∈ Z z_m ∈ Z zm∈Z through Eq. 2 to get sm. We then calculate the sample variance [2] from the M M M SDF values:

通过神经网络传播不确定性。我们使用蒙托卡罗采样[14]通过非线性网络传播不确定性。首先,我们从代码分布 z ∼ N D ( μ , Σ ) z ∼ N_D(μ, Σ) z∼ND(μ,Σ) 中采样 M 个代码 Z = { z m } m M = 1 Z = \{z_m\}^M_m{=1} Z={zm}mM=1。对于 3D 点 X X X,为了获得其 SDF 的方差 σ s 2 σ^2_s σs2,我们将每个代码 z m ∈ Z z_m ∈ Z zm∈Z 传递给等式: 2 获得SM。然后,我们根据 M M M SDF 值计算样本方差 [2]:


where s μ = 1 M Σ s m s_μ =\frac{1}{M}Σs_m sμ=M1Σsm is the sample mean. We then calculate the SDF uncertainty for each of the vertices of the mesh generated using Marching Cubes. Now we can use a random variable code z z z with mean and variance to represent a 3D object shape and its uncertainty. The remaining question is how to estimate the latent codes and the variance from input images.

其中, s μ = 1 M Σ s m s_μ =\frac{1}{M}Σs_m sμ=M1Σsm 是样本平均值。然后,我们使用行进立方体计算所生成网格的每个顶点的 SDF 不确定性。现在,我们可以使用具有均值和方差的随机变量代码 z z z 来表示三维物体的形状及其不确定性。剩下的问题是如何从输入图像中估算出潜在代码和方差。

3.3. Uncertainty-aware Image Encoder

We propose training a simple encoder network fβ to map an RGB image m ∈ R H × M × 3 m ∈ \mathbb{R }^{H×M×3} m∈RH×M×3 with height H and width W to a D-dimensional latent code z z z with mean μ ∈ R D μ ∈ \mathbb{R }^D μ∈RD and covariance Σ ∈ R D × D Σ ∈ \mathbb{R }^{D×D} Σ∈RD×D. Since we assume each code dimension is independent, the covariance matrix is diagonal and can be represented as Σ = d i a g ( σ 2 ) Σ = diag(σ^2) Σ=diag(σ2), where σ ∈ R D σ ∈ \mathbb{R }^D σ∈RD.

We use the direct modeling [16] approach to output uncertainty, which is well-established and does not add computational complexity. We leave the comparison of other uncertainty modeling methods as future work. The Encoder consists of a feature backbone, ResNet-50, and an output layer for the mean and variance. The architecture is straight forward and we concentrate on the choice of proper losses [13] to generate calibrated and accurate uncertainty. We consider two common losses, Negative Log-Likelihood loss (NLL) and Energy Score. We conduct extensive experiments to explore the effectiveness compared with the baseline model trained without uncertainty. We will briefly introduce the two losses below. Their advantages and applications in object detection have been discussed in [13].

我们使用直接建模[16]方法来输出不确定性,该方法已经成熟并且不会增加计算复杂性。我们将其他不确定性建模方法的比较作为未来的工作。编码器由特征主干、ResNet-50 以及均值和方差的输出层组成。该架构很简单,我们专注于选择适当的损失[13]以生成校准且准确的不确定性。我们考虑两种常见的损失:负对数似然损失(NLL)和能量得分。我们进行了大量的实验,以探索与没有不确定性的基线模型相比的有效性。下面我们简单介绍一下这两种损失。它们在目标检测中的优点和应用已在[13]中讨论。

NLL loss.
The NLL loss can be viewed as a standard L2 loss weighted by uncertainty. Considering a batch of outputs ( μ i , σ i ) i = 1 N {(μ_i, σ_i)}^N_{i=1} (μi,σi)i=1N directly from the encoder with N data samples, and the ground-truth codes z i i = 1 N {z_i}^N_{i=1} zii=1N, NLL can be written as:

NLL 损失。

NLL 损失可以被视为由不确定性加权的标准 L2 损失。考虑直接来自具有 N 个数据样本的编码器的一批输出 ( μ i , σ i ) i = 1 N {(μ_i, σ_i)}^N_{i=1} (μi,σi)i=1N,以及真实代码 z i i = 1 N {z_i}^N_{i=1} zii=1N,NLL可以写成:

where Σ i = d i a g ( σ i 2 ) ∈ R D × D Σ_i= diag(σ^2_i) ∈ \mathbb{R }^{D×D} Σi=diag(σi2)∈RD×D and σ i ∈ R D σ_i ∈ \mathbb{R }^{D} σi∈RD. The first term pushes down the error, where the variance, Σ i Σ_i Σi, acts to reduce the weight of samples in high uncertainty areas. The second, regularization term avoids uncertainty from growing too large.

其中 Σ i = d i a g ( σ i 2 ) ∈ R D × D Σ_i= diag(σ^2_i) ∈ \mathbb{R }^{D×D} Σi=diag(σi2)∈RD×D 且 σ i ∈ R D σ_i ∈ \mathbb{R }^{D} σi∈RD。第一项降低了误差,其中方差 Σ i Σ_i Σi 的作用是减少高不确定性区域中样本的权重。第二个正则化项可以避免不确定性变得太大。

Energy Score.
Energy Score (ES) can be generalized to any distribution. It concentrates on optimizing the result of high uncertainty data samples to improve performance during training. For computation efficiency, we use a MonteCarlo approximation version [13], which is represented as:

能量分数。

能量得分(ES)可以推广到任何分布。它专注于优化高不确定性数据样本的结果,以提高训练期间的性能。为了计算效率,我们使用蒙特卡洛近似版本[13],其表示为:

where z i , m z_{i,m} zi,m is the m t h m^{th} mth i.i.d sample from N ( μ i , Σ i ) N (μ_i, Σ_i) N(μi,Σi). We take M M M = 1000 with very little computational overhead.

其中 z i , m z_{i,m} zi,m 是来自 N ( μ i , Σ i ) N (μ_i, Σ_i) N(μi,Σi) 的 m t h m^{th} mth i.i.d 样本。我们采用 M M M = 1000,计算开销非常小。

3.4. Multi-view Bayesian Fusion in Latent Space

Bayesian Fusion. As we assume that each dimensions of the D-dimensional latent codes follow independent Gaussian distributions, we fuse each code dimension independently. According to Gaussian Inference [1], given two dimensions from two separate codes with mean μ 1 , μ 2 μ_1, μ_2 μ1,μ2 and variance σ 1 2 , σ 2 2 σ^2_1, σ^2_2 σ12,σ22, the updated mean μ μ μ and variance σ 2 σ^2 σ2 of the code dimension can be written as:

贝叶斯融合。由于我们假设 D 维潜在代码的每个维度遵循独立的高斯分布,因此我们独立地融合每个代码维度。根据高斯推理[1],给定两个独立代码的二维,其均值为 μ 1 、 μ 2 μ_1、μ_2 μ1、μ2 和方差 σ 1 2 、 σ 2 2 σ^2_1、σ^2_2 σ12、σ22,则代码的更新均值 μ μ μ 和方差 σ 2 σ^2 σ2维数可以写为:

Outlier rejection.
When facing extreme situations such as highly occluded objects, experimentation revealed that performance improves by treating them as outliers and filter them out of the fusion process, instead of incorporating them with high uncertainty. We define a modified inference strategy, "Bayesian-N ", which only selects the N observations with the lowest uncertainty for Bayesian fusion. When N = 1, we simply select the lowest uncertainty viewpoint. When N = Nmax, we use all available measurements without rejection, referred to as "Bayesian" by omitting N .

异常值排除。

当面对高度遮挡的物体等极端情况时,实验表明,通过将它们视为异常值并将它们从融合过程中过滤出来,而不是在高度不确定的情况下合并它们,可以提高性能。我们定义了一种改进的推理策略"Bayesian-N",它只选择不确定性最低的 N 个观测值进行贝叶斯融合。当 N = 1 时,我们只需选择不确定性最低的观点。当 N = Nmax 时,我们使用所有可用的测量而不拒绝,通过省略 N 称为"贝叶斯"。

4. Experiments

4.1. Implementation and Training Details

Our system consists of an encoder and a decoder. For the decoder, we follow the implementation and training of DeepSDF [32] on ShapeNet [3]. For the encoder, we use ResNet-50 pretrained on ImageNet as the feature backbone, modify the output layer to the dimensions of the code N , and further add K dimensions for the uncertainty. We take N = K = 64 in the experiments.

我们的系统由编码器和解码器组成。对于解码器,我们遵循 DeepSDF [32] 在 ShapeNet [3] 上的实现和训练。对于编码器,我们使用在 ImageNet 上预训练的 ResNet-50 作为特征主干,将输出层修改为代码 N 的维度,并进一步添加 K 维的不确定性。实验中我们取N=K=64。

We need monocular images and ground-truth latent codes to train the encoder. We use the images from ShapNetRendering dataset [5] which contains rendered images of 24 different views from the CAD models in ShapeNet [3]. After training the decoder, we get optimized latent codes for each CAD models, and we use them as the ground-truth latent codes for the training and evaluation of the Encoder.

我们需要单目图像和真实的潜在代码来训练编码器。我们使用来自 ShapNetRendering 数据集 [5] 的图像,该数据集包含 ShapeNet [3] 中 CAD 模型的 24 个不同视图的渲染图像。训练解码器后,我们得到每个 CAD 模型的优化潜在代码,并将它们用作编码器训练和评估的真实潜在代码。

For training the Encoder, we use the same dataset split as FroDO [35]. We augmented the training data with random resize and horizontal flip, and random background clip from SUN dataset [49]. We set a learning rate of 0.1, a batchsize of 64, and a random seed of 1000. We use a polynomial learning rate scheduler, and trained for 50 epochs.

为了训练编码器,我们使用与 FroDO [35] 相同的数据集分割。我们通过随机调整大小和水平翻转以及来自 SUN 数据集的随机背景剪辑来增强训练数据 [49]。我们设置学习率为 0.1,批量大小为 64,随机种子为 1000。我们使用多项式学习率调度程序,并训练 50 个时期。

To verify that our model can generalize to different categories, we test on both chairs and tables categories on ShapeNet dataset. To verify the performance in real scenarios, after training on the sythetic Shapenet dataset, we directly evaluate on the Pix3D dataset without finetunning.

为了验证我们的模型是否可以推广到不同的类别,我们在 ShapeNet 数据集上对椅子和桌子类别进行了测试。为了验证实际场景中的性能,在合成 Shapenet 数据集上进行训练后,我们直接在 Pix3D 数据集上进行评估,而无需进行微调。

4.2. Metrics and Baselines

Metrics.
For the reconstruction, we calculate the metrics of Intersection over Union (IoU), Chamfer Distance (CD), and Earth Moved Distance (EMD) on the voxelized mesh with a resolution of 323 following [43]. For the uncertainty, we use Negative Log Likelihood (NLL) and Energy Score (ES), which evaluate the error of the regression, and the calibration and sharpness of the estimated uncertainty.

指标。

为了重建,我们按照[43]计算体素化网格上的交并集(IoU)、倒角距离(CD)和全局移动距离(EMD)的度量,分辨率为 3 2 3 32^3 323。对于不确定性,我们使用负对数似然(NLL)和能量得分(ES)来评估回归的误差以及估计不确定性的校准和锐度。

Baselines. We train our model with Energy Score and denote it as Ours. We also compare the choice of the two training losses, Energy Score and NLL in the ablation study. FroDO [35] is a baseline closest to ours with an encoder trained with L2 loss and a DeepSDF decoder for reconstruction but without uncertainty. It averagely fuses multiple latent codes to get the final reconstruction. It also supports pose estimation and optimization with both shape and pose together. Since pose estimation is out of the scope of the paper, we compare with the Encoder parts to investigate the effectiveness of uncertainty. Note that the results on Pix3D dataset of the origin paper do not use the pose module so it is a fair comparason. Since FroDO is not open-sourced, to fully investigate the performance, we implemented it by ourselves and denote it as FroDO*. We compare our implemented version with origin published version on Pix3D dataset. We also fully compare our models with other published state-of-the-art models for the reconstruction accuracy on Pix3D dataset.

基线。

我们用能量分数训练我们的模型并将其表示为我们的。我们还比较了消融研究中两种训练损失(Energy Score 和 NLL)的选择。 FroDO [35] 是最接近我们的基线,其编码器经过 L2 损失训练,并使用 DeepSDF 解码器进行重建,但没有不确定性。它平均融合多个潜在代码以获得最终的重构。它还支持形状和姿势一起进行姿势估计和优化。由于姿态估计超出了本文的范围,因此我们与编码器部分进行比较以研究不确定性的有效性。请注意,原始论文的 Pix3D 数据集上的结果没有使用位姿模块,因此这是一个公平的比较。由于FroDO不是开源的,为了充分考察性能,我们自己实现了它,并将其表示为FroDO*。我们将我们的实现版本与 Pix3D 数据集上的原始发布版本进行比较。我们还将我们的模型与其他已发布的最先进模型进行了充分比较,以了解 Pix3D 数据集上的重建精度。

图 5. Pix3D 数据集上多视图融合的定性结果。在融合 1 到 10 个损坏的观测图像后,我们基于贝叶斯融合降低了形状不确定性,并输出与基线 FroDO* 相比更准确的最终重建。请注意,我们的方法可以在不知道相机姿势的情况下融合观察结果,并且即使对象具有不同的纹理也可以工作。

表 2. Pix3D-MV 数据集椅子类别上的多视图融合 IOU 性能。我们的不确定性可以比确定性基线更高的 IOU。 %表示相对于基线单视图/多视图的改进百分比。

4.3. Single-view reconstruction

4.4. Multi-view Reconstruction

The Pix3D dataset contains real images and groundtruth CAD models but has no splits for instances and their multiview observations. To evaluate the multi-view performance, we group the images of the chair category into separate instances according to their GT models, and keep 10 views as one instance, which results in a multi-view dataset with totally 1490 images from 149 instances. We denote this multiview dataset as Pix3D-MV which is a subset of the original Pix3D dataset. We show the results of multi-view fusion on Pix3D-MV chair set in Fig. 5 and Table 2. We consider the following methods as multi-view fusion baselines: Average equally fuses each estimated latent code; Bayesian fuses with uncertainty according to Bayesian Fusion in Equation 7; Bayesian-K keeps the top-K observations with the lowest uncertainty evaluated by taking the trace of the covariance matrix, and then fuses with Bayesian.

Pix3D 数据集包含真实图像和真实 CAD 模型,但没有实例及其多视图观察的分割。为了评估多视图性能,我们根据 GT 模型将椅子类别的图像分组为单独的实例,并将 10 个视图作为一个实例,从而得到一个包含来自 149 个实例的总共 1490 个图像的多视图数据集。我们将此多视图数据集表示为 Pix3D-MV,它是原始 Pix3D 数据集的子集。我们在图 5 和表 2 中展示了 Pix3D-MV 椅子组上的多视图融合结果。我们考虑以下方法作为多视图融合基线: 平均均匀地融合每个估计的潜在代码;根据公式 7 中的贝叶斯融合,贝叶斯与不确定性进行融合; Bayesian-K 保留通过协方差矩阵的迹评估的不确定性最低的前 K 个观测值,然后与贝叶斯融合。

Compared with the deterministic baseline, Ours with uncertainty achieves an IoU of 0.3816 vs. 0.3456 with a margin of 10.4%. When using Bayesian-4 to filter the outliers and keep the first 4 observations, ours can further boost up to an IoU of 0.3902 with a margin of 12.9% compared with the baseline. The experiment demonstrates that uncertainty can effectively help the system to trust the observations that contain more valid information, and make the system more robust to outliers in the multi-view observations.

与确定性基线相比,我们的不确定性基线的 IoU 为 0.3816 vs. 0.3456,差距为 10.4%。当使用 Bayesian-4 过滤异常值并保留前 4 个观测值时,我们的 IoU 可以进一步提高到 0.3902,与基线相比,裕度为 12.9%。实验表明,不确定性可以有效帮助系统信任包含更多有效信息的观测,并使系统对多视图观测中的异常值更加鲁棒。

We further push the limit of the robustness brought by the uncertainty during multi-view fusion in Table 3. In real applications like robots, the input images are heavily corrupted because of occlusions, errors in segmentation or sensor noise. We simulate challenging situations by randomly cropping images into a specific size range [c, 1.0], so that only part of the origin image is kept. By changing the value of the min scale, c, we vary the difficulty of the experiments. As is visible in Table 3, when the min scale decreases, the task becomes more difficult. In the most difficult task, where images can be cropped to only 10% of the origin images, the deterministic FroDO model suffers from the occlusions obviously and decreases to an IoU of only 0.318 while Ours w/ uncertainty remains robust to the cropping and maintains an IoU of 0.366, with an improvement of 15.1%. The experiments prove the effectiveness of using uncertainty in multi-view fusion to select valid information from a group of corrupted input images.

我们进一步突破了表3中多视图融合过程中的不确定性带来的鲁棒性极限。在机器人等实际应用中,输入图像由于遮挡、分割错误或传感器噪声而严重损坏。我们通过将图像随机裁剪到特定尺寸范围 [c,1.0] 来模拟具有挑战性的情况,以便仅保留原始图像的一部分。通过改变最小尺度 c 的值,我们改变了实验的难度。从表 3 中可以看出,当最小规模减小时,任务变得更加困难。在最困难的任务中,图像只能裁剪到原始图像的 10%,确定性 FroDO 模型明显受到遮挡,IoU 下降到仅为 0.318,而我们的不确定性模型对裁剪仍然具有鲁棒性,并保持IoU为0.366,提升了15.1%。实验证明了利用多视图融合中的不确定性从一组损坏的输入图像中选择有效信息的有效性。

表 3. 当输入图像被裁剪到 [Min Scale, 1.0] 之间随机选择的区域时,Pix3D-MV 椅子组上的多视图重建 (IoU)。随着最小规模的减小,融合任务变得更加困难。我们的不确定性在困难任务中具有更好的鲁棒性。

4.5. Ablation Study

Loss function. We compare two options for uncertainty training loss: NLL (NLL) and ES (Ours) in Table 3. Even though training with NLL can improve the performance in difficult tasks, it presents lower accuracy in general than when training with ES.

损失函数。我们在表 3 中比较了不确定性训练损失的两种选项:NLL(NLL)和 ES(我们的)。尽管使用 NLL 进行训练可以提高困难任务中的性能,但它通常比使用 ES 进行训练时的准确性较低。

Selection of K in Bayesian. With uncertainty, we can detect outliers and take active actions to deal with them. As in Table 2, when decreasing K, the performance increases since the outlier codes are filtered out. The highest IOU performance of 0.39 is achieved with k = 4, which has an improvement of 12.9% compared with the baseline. When further decreasing the number, the system has too few observations to fuse and the performance begins to drop. An interesting finding is that, with 1 best views we get better performance than the Average Equal of FroDO. In real applications, we have the option of adjusting the parameter K to better suit the data.

贝叶斯中 K 的选择。有了不确定性,我们就可以发现异常值并采取积极行动来处理它们。如表 2 所示,当减小 K 时,性能会提高,因为异常值代码被滤除。 k = 4 时实现了最高 IOU 性能 0.39,与基线相比提高了 12.9%。当进一步减少数量时,系统的观测值太少而无法融合,性能开始下降。一个有趣的发现是,通过 1 个最佳视图,我们获得了比 FroDO 的平均等于更好的性能。在实际应用中,我们可以选择调整参数 K 以更好地适应数据。

表 4. ShapeNet 数据集上椅子和桌子的单视图和多视图重建结果。当多视图融合任务变得困难时,我们的方法表现出明显的改进,显示出对不确定性带来的腐败的鲁棒性。

图 4. ShapeNet 数据集上表的单视图重建的定性结果。这些模型均在 ShapeNet 数据集上进行训练和测试。与基线 Frodo* 相比,我们的输出进一步不确定,并且伪影更少。

4.6. Evaluation on ShapeNet

Our network architecture, including the uncertainty framework, is not specifically designed for any categories. If the training data is available, we can support the new cat- egories. We show more experiments results on the Chairs and Tables categories on ShapeNet dataset in Table 4. We also show the results of Tables in Fig. 4. We train each categories separately. During inference, we consider two tasks, Easy for taking origin rendered images in Shapenet, and Mid/Hard for randomly croping the images into a range of [c,1.0] (c = 0.1 for Mid and c = 0.01 for Hard). For the multi-view fusion method, we use Bayesian Fusion for Ours and Average for FroDO*. In the Easy task, we get comparable but slightly lower IOU performance on Chairs, but higher IoU performance on Tables. Since our model, with the same architecture, requires a part of the model capacity to regress uncertainty. It is notable that uncertainty is not very helpful in easy tasks where each input image contains enough information for reconstruction. In the Mid and Hard task, the effectiveness of uncertain becomes more pronounced, as Ours gets higher multi-view accuracy than the baseline. Especially on Hard task, we got an IoU of 0.385 v.s. 0.359 on Chairs, and 0.378 v.s. 0.350 on Tables. This result demonstrates that uncertainty can robustly find and fuse the valid data from a set of input data of varying quality.

我们的网络架构,包括不确定性框架,并不是专门为任何类别设计的。如果训练数据可用,我们可以支持新类别。我们在表 4 中展示了 ShapeNet 数据集上椅子和桌子类别的更多实验结果。我们还在图 4 中展示了表格的结果。我们分别训练每个类别。在推理过程中,我们考虑两个任务,Easy 用于在 Shapenet 中获取原始渲染图像,Mid/Hard 用于将图像随机裁剪到 [c,1.0] 范围内(对于 Mid 来说 c = 0.1,对于 Hard 来说 c = 0.01)。对于多视图融合方法,我们使用贝叶斯融合,对于 FroDO* 使用平均值。在简单任务中,我们在椅子上获得了可比但略低的 IoU 性能,但在桌子上获得了更高的 IoU 性能。由于我们的模型具有相同的架构,需要部分模型能力来回归不确定性。值得注意的是,不确定性对于每个输入图像包含足够的重建信息的简单任务来说并不是很有帮助。在中度和困难任务中,不确定性的有效性变得更加明显,因为我们的多视图精度比基线更高。特别是在 Hard 任务上,我们得到了 0.385 的 IoU。椅子上为 0.359,而椅子上为 0.378表格上为 0.350。该结果表明,不确定性可以从一组不同质量的输入数据中稳健地找到并融合有效数据。

图 6. Pix3D 数据集上的校准图。潜在空间和 SDF 空间的不确定性图分别显示。我们进一步显示基线输出的不确定性等于 1。我们的方法比基线校准得更好(更接近 Y = X 线)。

4.7. Uncertainty Analysis

We further design experiments to quantitatively evaluate if the estimated uncertainty values are well-calibrated. We present the calibration plot on Pix3D dataset in Figure 6, where the X axis shows the predicted probability from zero to one, and the Y axis counts the real frequency of the variable. A well calibrated plot will be close to the line Y = X. For our case, we have two uncertainty sources: the one in the latent space directly output from the Encoder, and the one in the 3D SDF space propagated from the latent space through the decoder with Monte-carlo sampling. We use a baseline with equal variance of 1 for each code dimension. We can see that it is non-trivial to output a well-calibrated uncertainty for both the latent code and the 3D SDF values. Our trained model can output uncertainty that is significantly better calibrated than the baseline. At the same time, we still see an opportunity for further improvement, e.g., with temperature scaling [12]. We hope our approach can serve as a benchmark for future research that can output better uncertainty for 3D neural shape model. We show more analysis in the supplementary materials.

我们进一步设计实验来定量评估估计的不确定性值是否经过良好校准。我们在图 6 中展示了 Pix3D 数据集上的校准图,其中 X 轴显示从 0 到 1 的预测概率,Y 轴计算变量的实际频率。校准良好的图将接近线 Y = X。对于我们的情况,我们有两个不确定性源:潜在空间中的一个直接从编码器输出,以及 3D SDF 空间中的一个从潜在空间传播通过采用蒙特卡罗采样的解码器。我们对每个代码维度使用等方差为 1 的基线。我们可以看到,为潜在代码和 3D SDF 值输出经过良好校准的不确定性并非易事。我们经过训练的模型可以输出比基线校准得更好的不确定性。与此同时,我们仍然看到进一步改进的机会,例如温度缩放[12]。我们希望我们的方法可以作为未来研究的基准,为 3D 神经形状模型输出更好的不确定性。我们在补充材料中展示了更多分析。

5. CONCLUSIONS

We propose an uncertainty-aware 3D object reconstruction framework that can take in both monocular and multiview images. Based on the neural shape models, we introduce a method to model and estimate uncertainty in latent space and a method to propagate uncertainty into 3D object space, so that we can output 3D object shape with uncertainty awareness. Our proposed method can be trained on a purely synthetic dataset and directly evaluated on real datasets. It achieves higher reconstruction performance than deterministic models, and in particular demonstrates better robustness and accuracy in multi-view fusion when the input image sequences are corrupted.

我们提出了一种不确定性感知的 3D 对象重建框架,可以接收单眼和多视图图像。基于神经形状模型,我们引入了一种对潜在空间中的不确定性进行建模和估计的方法,以及一种将不确定性传播到 3D 对象空间的方法,以便我们可以输出具有不确定性感知的 3D 对象形状。我们提出的方法可以在纯合成数据集上进行训练,并直接在真实数据集上进行评估。它比确定性模型实现了更高的重建性能,特别是当输入图像序列损坏时,在多视图融合中表现出更好的鲁棒性和准确性。
In future work, we plan to scale up to multi-classes objects reconstruction and uncertainty estimation. Also, it will be interesting to leverage the uncertainty-aware shape model for down stream tasks related to objects such as detection, segmentation, tracking, and object-level SLAM.

在未来的工作中,我们计划扩展到多类对象重建和不确定性估计。此外,利用不确定性感知形状模型来执行与对象相关的下游任务(例如检测、分割、跟踪和对象级 SLAM)也将很有趣。

相关推荐
__如果5 小时前
论文阅读--Qwen2&2.5技术报告
论文阅读·qwen
好评笔记6 小时前
AIGC视频生成模型:Stability AI的SVD(Stable Video Diffusion)模型
论文阅读·人工智能·深度学习·机器学习·计算机视觉·面试·aigc
zenpluck6 小时前
GS论文阅读--GeoTexDensifier
论文阅读
feifeikon13 小时前
大模型GUI系列论文阅读 DAY2续2:《使用指令微调基础模型的多模态网页导航》
论文阅读
墨绿色的摆渡人13 小时前
论文笔记(六十三)Understanding Diffusion Models: A Unified Perspective(一)
论文阅读
好评笔记1 天前
AIGC视频生成模型:ByteDance的PixelDance模型
论文阅读·人工智能·深度学习·机器学习·计算机视觉·aigc·transformer
feifeikon1 天前
大模型GUI系列论文阅读 DAY2:《ScreenAgent:一种基于视觉语言模型的计算机控制代理》
论文阅读·人工智能·语言模型
lovep12 天前
Data Filtering Network 论文阅读和理解
论文阅读·数据质量·大模型算法
Eastmount2 天前
[论文阅读] (36)C&S22 MPSAutodetect:基于自编码器的恶意Powershell脚本检测模型
论文阅读·系统安全·powershell·自编码器·恶意代码检测
好评笔记2 天前
AIGC视频生成国产之光:ByteDance的PixelDance模型
论文阅读·人工智能·深度学习·机器学习·计算机视觉·面试·aigc