ICCV2023单目摄像头相关论文速览

Paper1 Semantic Attention Flow Fields for Monocular Dynamic Scene Decomposition

摘要原文: From video, we reconstruct a neural volume that captures time-varying color, density, scene flow, semantics, and attention information. The semantics and attention let us identify salient foreground objects separately from the background across spacetime. To mitigate low resolution semantic and attention features, we compute pyramids that trade detail with whole-image context. After optimization, we perform a saliency-aware clustering to decompose the scene. To evaluate real-world scenes, we annotate object masks in the NVIDIA Dynamic Scene and DyCheck datasets. We demonstrate that this method can decompose dynamic scenes in an unsupervised way with competitive performance to a supervised method, and that it improves foreground/background segmentation over recent static/dynamic split methods. Project webpage: https://visual.cs.brown.edu/saff

中文总结: 这段话的主要内容是通过视频重建神经体积，捕捉随时间变化的颜色、密度、场景流动、语义和注意力信息。语义和注意力让我们能够在时空中分别识别显著的前景对象和背景。为了减轻低分辨率的语义和注意力特征，我们计算金字塔，平衡了细节和整体图像背景。在优化后，我们执行一个考虑显著性的聚类以分解场景。为了评估真实世界的场景，我们在NVIDIA Dynamic Scene和DyCheck数据集中注释了对象掩模。我们展示了这种方法可以以竞争性的性能无监督地分解动态场景，且在前景/背景分割方面优于最近的静态/动态分割方法。项目网页：https://visual.cs.brown.edu/saff。

Paper2 NDDepth: Normal-Distance Assisted Monocular Depth Estimation

摘要原文:空

Paper3 Delving into Motion-Aware Matching for Monocular 3D Object Tracking

中文总结: 最近的单目3D物体检测的进展促进了基于低成本摄像头传感器的3D多目标跟踪任务。本文发现，在3D多目标跟踪中，对象在不同时间帧上的运动线索是至关重要的，这在现有的基于单目的方法中很少被探索。为此，我们提出了MoMA-M3T，一个主要由三个运动感知组件组成的框架。首先，我们将对象可能的运动与特征空间中的所有对象轨迹相关联，作为其运动特征。然后，我们通过运动变换器从时空角度进一步对历史对象轨迹进行建模。最后，我们提出了一个运动感知匹配模块，将历史对象轨迹和当前观测关联起来作为最终跟踪结果。我们在nuScenes和KITTI数据集上进行了大量实验，证明我们的MoMA-M3T在性能上与最先进的方法相媲美。此外，所提出的跟踪器灵活且可以轻松插入现有的基于图像的3D物体检测器，无需重新训练。代码和模型可在https://github.com/kuanchihhuang/MoMA-M3T 上获得。

Paper4 LivePose: Online 3D Reconstruction from Monocular Video with Dynamic Camera Poses

摘要原文: Dense 3D reconstruction from RGB images traditionally assumes static camera pose estimates. This assumption has endured, even as recent works have increasingly focused on real-time methods for mobile devices. However, the assumption of a fixed pose for each image does not hold for online execution: poses from real-time SLAM are dynamic and may be updated following events such as bundle adjustment and loop closure. This has been addressed in the RGB-D setting, by de-integrating past views and re-integrating them with updated poses, but it remains largely untreated in the RGB-only setting. We formalize this problem to define the new task of dense online reconstruction from dynamically-posed images. To support further research, we introduce a dataset called LivePose containing the dynamic poses from a SLAM system running on ScanNet. We select three recent reconstruction systems and apply a framework based on de-integration to adapt each one to the dynamic-pose setting. In addition, we propose a novel, non-linear de-integration module that learns to remove stale scene content. We show that responding to pose updates is critical for high-quality reconstruction, and that our de-integration framework is an effective solution.

中文总结: 这段话主要讨论了传统上从RGB图像进行的密集3D重建通常假设相机姿势是静态的。尽管最近的研究越来越专注于移动设备的实时方法，但这种假设仍然存在。然而，在在线执行中，每个图像的固定姿势的假设在实时SLAM中不成立：实时SLAM生成的姿势是动态的，可能会在进行诸如捆绑调整和闭环检测等事件后进行更新。在RGB-D设置中已经解决了这个问题，通过将过去的视图去整合并重新整合它们与更新后的姿势，但在仅有RGB图像的设置中仍然未得到很好的处理。为了支持进一步的研究，作者提出了一个名为LivePose的数据集，其中包含了在ScanNet上运行的SLAM系统生成的动态姿势。作者选择了三种最近的重建系统，并应用了一个基于去整合的框架，以使每个系统适应动态姿势的设置。此外，作者提出了一种新颖的非线性去整合模块，该模块学习去除陈旧的场景内容。作者展示了对姿势更新的响应对于高质量重建至关重要，并且作者的去整合框架是一个有效的解决方案。

Paper5 Towards Zero-Shot Scale-Aware Monocular Depth Estimation

摘要原文: Monocular depth estimation is scale-ambiguous, and thus requires scale supervision to produce metric predictions. Even so, the resulting models will be geometry-specific, with learned scales that cannot be directly transferred across domains. Because of that, recent works focus instead on relative depth, eschewing scale in favor of improved up-to-scale zero-shot transfer. In this work we introduce ZeroDepth, a novel monocular depth estimation framework capable of predicting metric scale for arbitrary test images from different domains and camera parameters. This is achieved by (i) the use of input-level geometric embeddings that enable the network to learn a scale prior over objects; and (ii) decoupling the encoder and decoder stages, via a variational latent representation that is conditioned on single frame information. We evaluated ZeroDepth targeting both outdoor (KITTI, DDAD, nuScenes) and indoor (NYUv2) benchmarks, and achieved a new state-of-the-art in both settings using the same pre-trained model, outperforming methods that train on in-domain data and require test-time scaling to produce metric estimates.

中文总结: 这段话主要讨论了单目深度估计在尺度上存在歧义，因此需要尺度监督才能产生度量预测。然而，由此产生的模型将是几何特定的，具有无法直接跨领域转移的学习尺度。因此，最近的研究工作着重于相对深度，放弃尺度以改善零样本跨尺度转移。在这项工作中，我们介绍了ZeroDepth，这是一种新颖的单目深度估计框架，能够为不同领域和摄像机参数的任意测试图像预测度量尺度。这是通过以下方式实现的：(i)使用输入级几何嵌入，使网络能够学习关于物体的尺度先验；以及(ii)通过变分潜在表示解耦编码器和解码器阶段，通过条件化单帧信息。我们评估了ZeroDepth在室外（KITTI、DDAD、nuScenes）和室内（NYUv2）基准上的性能，并使用相同的预训练模型在两种设置中均取得了新的最先进水平，优于那些在领域内数据上训练并需要测试时缩放以产生度量估计的方法。

Paper6 MonoDETR: Depth-guided Transformer for Monocular 3D Object Detection

摘要原文: Monocular 3D object detection has long been a challenging task in autonomous driving. Most existing methods follow conventional 2D detectors to first localize object centers, and then predict 3D attributes by neighboring features. However, only using local visual features is insufficient to understand the scene-level 3D spatial structures and ignores the long-range inter-object depth relations. In this paper, we introduce the first DETR framework for Monocular DEtection with a depth-guided TRansformer, named MonoDETR. We modify the vanilla transformer to be depth-aware and guide the whole detection process by contextual depth cues. Specifically, concurrent to the visual encoder that captures object appearances, we introduce to predict a foreground depth map, and specialize a depth encoder to extract non-local depth embeddings. Then, we formulate 3D object candidates as learnable queries and propose a depth-guided decoder to conduct object-scene depth interactions. In this way, each object query estimates its 3D attributes adaptively from the depth-guided regions on the image and is no longer constrained to local visual features. On KITTI benchmark with monocular images as input, MonoDETR achieves state-of-the-art performance and requires no extra dense depth annotations. Besides, our depth-guided modules can also be plug-and-play to enhance multi-view 3D object detectors on nuScenes dataset, demonstrating our superior generalization capacity. Code is available at https://github.com/ZrrSkywalker/MonoDETR.

中文总结: 这段话主要讨论了在自动驾驶中，单目三维物体检测一直是一个具有挑战性的任务。大多数现有方法遵循传统的二维检测器，首先定位物体中心，然后通过邻近特征预测三维属性。然而，仅使用局部视觉特征是不足以理解场景级别的三维空间结构，并且忽略了长距离的物体间深度关系。在这篇论文中，作者引入了第一个用于单目深度引导变压器的DETR框架，命名为MonoDETR。他们修改了传统变压器，使其具有深度感知，并通过上下文深度线索引导整个检测过程。具体来说，在捕获物体外观的视觉编码器的同时，他们引入了预测前景深度图，并专门设计了一个深度编码器来提取非局部深度嵌入。然后，他们将三维物体候选项形式化为可学习的查询，并提出了一个深度引导解码器来进行物体-场景深度交互。通过这种方式，每个物体查询可以自适应地从图像上的深度引导区域估计其三维属性，不再受限于局部视觉特征。在使用单目图像作为输入的KITTI基准测试中，MonoDETR实现了最先进的性能，并且无需额外的密集深度注释。此外，他们的深度引导模块还可以插入到nuScenes数据集上的多视图三维物体检测器中，展示了其出色的泛化能力。源代码可在https://github.com/ZrrSkywalker/MonoDETR 上找到。

Paper7 GEDepth: Ground Embedding for Monocular Depth Estimation

摘要原文: Monocular depth estimation is an ill-posed problem as

the same 2D image can be projected from infinite 3D scenes.

Although the leading algorithms in this field have reported

significant improvement, they are essentially geared to the

particular compound of pictorial observations and camera

parameters (i.e., intrinsics and extrinsics), strongly limit-

ing their generalizability in real-world scenarios. In or-

der to cope with this difficulty, this paper proposes a novel

ground embedding module to decouple camera parameters

from pictorial cues, thus promoting the generalization ca-

pability. Given camera parameters, our module generates

the ground depth, which is stacked with the input image and

referenced in the final depth prediction. A ground attention

is designed in the module to optimally combine the ground

depth with the residual depth. The proposed ground embed-

ding is highly flexible and lightweight, leading to a plug-in

module that is amenable to be integrated into various depth

estimation networks. Experiments reveal that our approach

achieves the state-of-the-art results on popular benchmarks,

and more importantly, renders significant improvement on

the cross-domain generalization.

中文总结: 这段话主要讨论了单目深度估计是一个不适定问题，因为同一张2D图像可以从无限的3D场景投影出来。尽管该领域的领先算法已经取得了显著进展，但它们基本上是针对图像观测和相机参数（即内参和外参）的特定组合，从而严重限制了它们在真实场景中的泛化能力。为了应对这一困难，本文提出了一种新颖的地面嵌入模块，以将相机参数与图像线索解耦，从而提升泛化能力。在给定相机参数的情况下，我们的模块生成地面深度，将其与输入图像堆叠，并在最终深度预测中进行引用。模块中设计了地面注意力机制，以最佳方式将地面深度与残差深度结合。所提出的地面嵌入具有高度灵活性和轻量级特点，成为可轻松集成到各种深度估计网络中的插件模块。实验证明，我们的方法在流行的基准测试中取得了最先进的结果，更重要的是，在跨领域泛化方面取得了显著改进。

中文总结: 这段话主要讨论了轻量级飞行时间（ToF）深度传感器在移动设备上广泛用于自动对焦和障碍物检测等任务，但由于深度测量稀疏且嘈杂，因此很少被用于密集几何重建。作者提出了首个使用单目相机和轻量级ToF传感器的密集SLAM系统，采用多模态隐式场景表示来支持从RGB相机和轻量级ToF传感器中渲染信号，并通过与原始传感器输入进行比较来驱动优化。为了保证成功的姿态跟踪和重建，作者利用预测的深度作为中间监督，并开发了一种粗到细的优化策略，以有效学习隐式表示。最后，明确利用时间信息处理轻量级ToF传感器的嘈杂信号，以提高系统的准确性和鲁棒性。实验证明，该系统很好地利用了轻量级ToF传感器的信号，并在相机跟踪和密集场景重建方面取得了竞争性结果。项目页面：https://zju3dv.github.io/tof_slam/。

Paper9 MonoNeRD: NeRF-like Representations for Monocular 3D Object Detection

摘要原文: In the field of monocular 3D detection, it is common practice to utilize scene geometric clues to enhance the detector's performance. However, many existing works adopt these clues explicitly such as estimating a depth map and back-projecting it into 3D space. This explicit methodology induces sparsity in 3D representations due to the increased dimensionality from 2D to 3D, and leads to substantial information loss, especially for distant and occluded objects. To alleviate this issue, we propose MonoNeRD, a novel detection framework that can infer dense 3D geometry and occupancy. Specifically, we model scenes with Signed Distance Functions (SDF), facilitating the production of dense 3D representations. We treat these representations as Neural Radiance Fields (NeRF) and then employ volume rendering to recover RGB images and depth maps. To the best of our knowledge, this work is the first to introduce volume rendering for M3D, and demonstrates the potential of implicit reconstruction for image-based 3D perception. Extensive experiments conducted on the KITTI-3D benchmark and Waymo Open Dataset demonstrate the effectiveness of MonoNeRD. Codes are available at https://github.com/cskkxjk/MonoNeRD.

中文总结: 在单目3D检测领域中，通常会利用场景几何线索来增强检测器的性能。然而，许多现有作品明确采用这些线索，例如估计深度图并将其反投影到3D空间中。这种显式方法导致3D表示的稀疏性增加，由于从2D到3D的维度增加，尤其是对于远距离和遮挡的对象，会导致大量信息丢失。为了缓解这个问题，我们提出了MonoNeRD，这是一个能够推断密集3D几何和占用情况的新型检测框架。具体来说，我们使用有符号距离函数（SDF）对场景进行建模，促进了密集3D表示的生成。我们将这些表示视为神经辐射场（NeRF），然后利用体素渲染来恢复RGB图像和深度图。据我们所知，这项工作是首次将体素渲染引入M3D，并展示了基于图像的3D感知的隐式重建潜力。在KITTI-3D基准和Waymo开放数据集上进行的大量实验表明了MonoNeRD的有效性。代码可在https://github.com/cskkxjk/MonoNeRD找到。

Paper10 Monocular 3D Object Detection with Bounding Box Denoising in 3D by Perceiver

摘要原文: The main challenge of monocular 3D object detection is the accurate localization of 3D center. Motivated by a new and strong observation that this challenge can be remedied by a 3D-space local-grid search scheme in an ideal case, we propose a stage-wise approach, which combines the information flow from 2D-to-3D (3D bounding box proposal generation with a single 2D image) and 3D-to-2D (proposal verification by denoising with 3D-to-2D contexts) in a top-down manner. Specifically, we first obtain initial proposals from off-the-shelf backbone monocular 3D detectors. Then, we generate a 3D anchor space by local-grid sampling from the initial proposals. Finally, we perform 3D bounding box denoising at the 3D-to-2D proposal verification stage. To effectively learn discriminative features for denoising highly overlapped proposals, this paper presents a method of using the Perceiver I/O model to fuse the 3D-to-2D geometric information and the 2D appearance information. With the encoded latent representation of a proposal, the verification head is implemented with a self-attention module. Our method, named as MonoXiver, is generic and can be easily adapted to any backbone monocular 3D detectors. Experimental results on the well-established KITTI dataset and the challenging large-scale Waymo dataset show that MonoXiver consistently achieves improvement with limited computation overhead.

中文总结: 这段话主要讨论了单目三维物体检测中最主要的挑战是准确定位三维中心。作者提出了一种阶段式方法，结合了从2D到3D（使用单个2D图像生成3D边界框提案）和从3D到2D（通过3D到2D上下文进行提案验证）的信息流，以解决这一挑战。具体来说，他们首先从现成的骨干单目三维检测器中获取初始提案，然后通过局部网格采样生成3D锚定空间，最后在3D到2D提案验证阶段进行3D边界框去噪。为了有效学习用于去噪高度重叠提案的区分特征，本文提出了一种使用Perceiver I/O模型融合3D到2D几何信息和2D外观信息的方法。通过对提案的编码潜在表示，验证头部实现了一个自注意力模块。作者提出的方法名为MonoXiver，通用且可以轻松适应任何骨干单目三维检测器。在知名的KITTI数据集和具有挑战性的大规模Waymo数据集上的实验结果显示，MonoXiver在有限的计算开销下持续取得改进。

Paper11 Cyclic Test-Time Adaptation on Monocular Video for 3D Human Mesh Reconstruction

摘要原文: Despite recent advances in 3D human mesh reconstruction, domain gap between training and test data is still a major challenge. Several prior works tackle the domain gap problem via test-time adaptation that fine-tunes a network relying on 2D evidence (e.g., 2D human keypoints) from test images. However, the high reliance on 2D evidence during adaptation causes two major issues. First, 2D evidence induces depth ambiguity, preventing the learning of accurate 3D human geometry. Second, 2D evidence is noisy or partially non-existent during test time, and such imperfect 2D evidence leads to erroneous adaptation. To overcome the above issues, we introduce CycleAdapt, which cyclically adapts two networks: a human mesh reconstruction network (HMRNet) and a human motion denoising network (MDNet), given a test video. In our framework, to alleviate high reliance on 2D evidence, we fully supervise HMRNet with generated 3D supervision targets by MDNet. Our cyclic adaptation scheme progressively elaborates the 3D supervision targets, which compensate for imperfect 2D evidence. As a result, our CycleAdapt achieves state-of-the-art performance compared to previous test-time adaptation methods. The codes are available in here: https://github.com/hygenie1228/CycleAdapt_RELEASE.

中文总结: 尽管最近在3D人体网格重建方面取得了进展，但训练数据和测试数据之间的领域差距仍然是一个主要挑战。一些先前的工作通过测试时适应来解决领域差距问题，该方法通过在测试图像中依赖于2D证据（例如2D人体关键点）对网络进行微调。然而，在适应过程中对2D证据的高度依赖会导致两个主要问题。首先，2D证据引入深度模糊，阻碍准确学习3D人体几何。其次，在测试时，2D证据可能存在噪声或部分缺失，这种不完美的2D证据会导致错误的适应。为了克服上述问题，我们引入了CycleAdapt，该方法通过循环适应两个网络：一个人体网格重建网络（HMRNet）和一个人体运动去噪网络（MDNet），针对一个测试视频。在我们的框架中，为了减轻对2D证据的高度依赖，我们使用MDNet生成的3D监督目标完全监督HMRNet。我们的循环适应方案逐步详细说明了3D监督目标，这些目标弥补了不完美的2D证据。因此，我们的CycleAdapt相比以前的测试时适应方法取得了最先进的性能。代码可在此处获取：https://github.com/hygenie1228/CycleAdapt_RELEASE。

Paper12 NDC-Scene: Boost Monocular 3D Semantic Scene Completion in Normalized Device Coordinates Space

摘要原文: Monocular 3D Semantic Scene Completion (SSC) has garnered significant attention in recent years due to its potential to predict complex semantics and geometry shapes from a single image, requiring no 3D inputs. In this paper, we identify several critical issues in current state-of-the-art methods, including the Feature Ambiguity of projected 2D features in the ray to the 3D space, the Pose Ambiguity of the 3D convolution, and the Computation Imbalance in the 3D convolution across different depth levels. To address these problems, we devise a novel Normalized Device Coordinates scene completion network (NDC-Scene) that directly extends the 2D feature map to a Normalized Device Coordinates (NDC) space, rather than to the world space directly, through progressive restoration of the dimension of depth with deconvolution operations. Experiment results demonstrate that transferring the majority of computation from the target 3D space to the proposed normalized device coordinates space benefits monocular SSC tasks. Additionally, we design a Depth-Adaptive Dual Decoder to simultaneously upsample and fuse the 2D and 3D feature maps, further improving overall performance. Our extensive experiments confirm that the proposed method consistently outperforms state-of-the-art methods on both outdoor SemanticKITTI and indoor NYUv2 datasets. Our code are available at https://github.com/Jiawei-Yao0812/NDCScene.

中文总结: 这段话主要讨论了单目3D语义场景完成（SSC）近年来引起了极大关注，因为它有潜力从单个图像中预测复杂的语义和几何形状，无需3D输入。作者指出了当前最先进方法中存在的几个关键问题，包括2D特征在射线到3D空间中的特征模糊性，3D卷积中的姿态模糊性，以及在不同深度级别上的3D卷积中的计算不平衡性。为了解决这些问题，他们设计了一种新颖的Normalized Device Coordinates场景完成网络（NDC-Scene），通过逐渐恢复深度维度，将2D特征映射直接扩展到标准化设备坐标（NDC）空间，而不是直接到世界空间，通过反卷积操作。实验结果表明，将大部分计算从目标3D空间转移到提出的标准化设备坐标空间有利于单目SSC任务。此外，他们设计了一个深度自适应双解码器，同时上采样和融合2D和3D特征图，进一步提高了整体性能。作者的广泛实验证实，所提出的方法在室外SemanticKITTI和室内NYUv2数据集上始终优于最先进的方法。他们的代码可在https://github.com/Jiawei-Yao0812/NDCScene 上找到。

Paper13 LATR: 3D Lane Detection from Monocular Images with Transformer

摘要原文: 3D lane detection from monocular images is a fundamental yet challenging task in autonomous driving. Recent advances primarily rely on structural 3D surrogates (e.g., bird's eye view) built from front-view image features and camera parameters. However, the depth ambiguity in monocular images inevitably causes misalignment between the constructed surrogate feature map and the original image, posing a great challenge for accurate lane detection. To address the above issue, we present a novel LATR model, an end-to-end 3D lane detector that uses 3D-aware front-view features without transformed view representation. Specifically, LATR detects 3D lanes via cross-attention based on query and key-value pairs, constructed using our lane-aware query generator and dynamic 3D ground positional embedding. On the one hand, each query is generated based on 2D lane-aware features and adopts a hybrid embedding to enhance the lane information. On the other hand, 3D space information is injected as positional embedding from an iteratively-updated 3D ground plane. LATR outperforms previous state-of-the-art methods on both synthetic Apollo and realistic OpenLane, ONCE-3DLanes datasets by large margins (e.g., 11.4 gain in terms of F1 score on OpenLane). Code will be released at https://github.com/JMoonr/LATR.

中文总结: 这段话主要讨论了从单目图像中进行3D车道检测在自动驾驶中是一个基础性且具有挑战性的任务。最近的研究主要依赖于从前视图图像特征和摄像机参数构建的结构化3D替代物（例如鸟瞰图）。然而，单目图像中的深度模糊不可避免地导致构建的替代特征图与原始图像之间的错位，给准确的车道检测带来了巨大挑战。为了解决上述问题，作者提出了一种新颖的LATR模型，这是一个端到端的3D车道检测器，使用了3D感知的前视图特征，而无需转换视图表示。具体来说，LATR通过基于查询和键值对的交叉注意力来检测3D车道，这些查询和键值对是使用车道感知查询生成器和动态3D地面位置嵌入构建的。一方面，每个查询基于2D车道感知特征生成，并采用混合嵌入来增强车道信息。另一方面，3D空间信息作为位置嵌入从迭代更新的3D地面平面注入。LATR在合成Apollo和真实OpenLane、ONCE-3DLanes数据集上均大幅优于先前的最先进方法（例如，在OpenLane上F1分数方面获得11.4的增益）。代码将在https://github.com/JMoonr/LATR发布。

Paper14 Reconstructing Interacting Hands with Interaction Prior from Monocular Images

摘要原文: Reconstructing interacting hands from monocular images is indispensable in AR/VR applications. Most existing solutions rely on the accurate localization of each skeleton joint. However, these methods tend to be unreliable due to the severe occlusion and confusing similarity among adjacent hand parts. This also defies human perception because humans can quickly imitate an interaction pattern without localizing all joints. Our key idea is to first construct a two-hand interaction prior and recast the interaction reconstruction task as the conditional sampling from the prior. To expand more interaction states, a large-scale multimodal dataset with physical plausibility is proposed. Then a VAE is trained to further condense these interaction patterns as latent codes in a prior distribution. When looking for image cues that contribute to interaction prior sampling, we propose the interaction adjacency heatmap (IAH). Compared with a joint-wise heatmap for localization, IAH assigns denser visible features to those invisible joints. Compared with an all-in-one visible heatmap, it provides more fine-grained local interaction information in each interaction region. Finally, the correlations between the extracted features and corresponding interaction codes are linked by the ViT module. Comprehensive evaluations on benchmark datasets have verified the effectiveness of this framework. The code and dataset are publicly available at: https://github.com/binghui-z/InterPrior_pytorch.

中文总结: 这段话主要讨论了从单眼图像中重建交互手部在增强现实/虚拟现实应用中的重要性。现有的解决方案大多依赖于准确定位每个骨架关节，但由于严重的遮挡和相邻手部部分之间的相似性混淆，这些方法往往不可靠。作者提出的关键思想是首先构建一个双手交互先验，将交互重建任务重新构造为从先验中条件采样。为了扩展更多交互状态，他们提出了一个具有物理合理性的大规模多模态数据集。然后训练了一个VAE来将这些交互模式进一步压缩为先验分布中的潜在代码。作者还提出了交互邻接热图（IAH）来寻找有助于交互先验采样的图像线索。最后，通过ViT模块将提取的特征与对应的交互代码之间的相关性联系起来。综合评估表明，这一框架的有效性得到了验证。该代码和数据集可以在https://github.com/binghui-z/InterPrior_pytorch 上公开获取。

Paper15 MAMo: Leveraging Memory and Attention for Monocular Video Depth Estimation

摘要原文: We propose MAMo, a novel memory and attention framework for monocular video depth estimation. MAMo can augment and improve any single-image depth estimation networks into video depth estimation models, enabling them to take advantage of the temporal information to predict more accurate depth. In MAMo, we augment model with memory which aids the depth prediction as the model streams through the video. Specifically, the memory stores learned visual and displacement tokens of the previous time instances. This allows the depth network to cross-reference relevant features from the past when predicting depth on the current frame. We introduce a novel scheme to continuously update the memory, optimizing it to keep tokens that correspond with both the past and the present visual information. We adopt attention-based approach to process memory features where we first learn the spatio-temporal relation among the resultant visual and displacement memory tokens using self-attention module. Further, the output features of self-attention are aggregated with the current visual features through cross-attention. The cross-attended features are finally given to a decoder to predict depth on the current frame. Through extensive experiments on several benchmarks, including KITTI, NYU-Depth V2, and DDAD, we show that MAMo consistently improves monocular depth estimation networks and sets new state-of-the-art (SOTA) accuracy. Notably, our MAMo video depth estimation provides higher accuracy with lower latency, when comparing to SOTA cost-volume-based video depth models.

中文总结: 我们提出了一种新颖的内存和注意力框架MAMo，用于单目视频深度估计。MAMo可以将任何单图深度估计网络增强和改进为视频深度估计模型，使其能够利用时间信息来预测更准确的深度。在MAMo中，我们使用内存来增强模型，在模型流经视频时帮助深度预测。具体来说，内存存储了先前时间点学习到的视觉和位移令牌。这使得深度网络在预测当前帧深度时可以交叉参考过去的相关特征。我们引入了一种新颖的方案来持续更新内存，优化它以保留与过去和现在视觉信息相对应的令牌。我们采用基于注意力的方法来处理内存特征，首先通过自注意力模块学习生成视觉和位移内存令牌之间的时空关系。进一步，通过交叉注意力将自注意力的输出特征与当前视觉特征聚合。最终，将交叉注意力特征传递给解码器，以预测当前帧的深度。通过在多个基准数据集上进行广泛实验，包括KITTI、NYU-Depth V2和DDAD，我们展示了MAMo始终能够改进单目深度估计网络，并实现了新的最先进准确性。值得注意的是，与最先进的基于成本体积的视频深度模型相比，我们的MAMo视频深度估计在提供更高准确性的同时具有更低的延迟。

Paper16 Neural Reconstruction of Relightable Human Model from Monocular Video

摘要原文: Creating relightable and animatable human characters from monocular video at a low cost is a critical task for digital human modeling and virtual reality applications. This task is complex due to intricate articulation motion, a wide range of ambient lighting conditions, and pose-dependent clothing deformations. In this paper, we introduce a novel self-supervised framework that takes a monocular video of a moving human as input and generates a 3D neural representation capable of being rendered with novel poses under arbitrary lighting conditions. Our framework decomposes dynamic humans under varying illumination into neural fields in canonical space, taking into account geometry and spatially varying BRDF material properties. Additionally, we introduce pose-driven deformation fields, enabling bidirectional mapping between canonical space and observation. Leveraging the proposed appearance decomposition and deformation fields, our framework learns in a self-supervised manner. Ultimately, based on pose-driven deformation, recovered appearance, and physically-based rendering, the reconstructed human figure becomes relightable and can be explicitly driven by novel poses. We demonstrate significant performance improvements over previous works and provide compelling examples of relighting from monocular videos of moving humans in challenging, uncontrolled capture scenarios.

中文总结: 本文介绍了一种新颖的自监督框架，该框架以移动人体的单眼视频为输入，并生成一个能够在任意光照条件下渲染新姿势的3D神经表示。我们的框架将动态人体在不同光照条件下分解为规范空间中的神经场，考虑几何形状和空间变化的BRDF材质属性。此外，我们引入了姿势驱动的变形场，实现了规范空间和观察之间的双向映射。通过利用提出的外观分解和变形场，我们的框架能够自监督学习。最终，基于姿势驱动的变形、恢复的外观和基于物理的渲染，重建的人体形象可以实现灯光调整，并可以明确地由新的姿势驱动。我们展示了相对于先前工作的显著性能改进，并提供了在具有挑战性的、不受控制的捕捉场景中，从移动人体的单眼视频中进行灯光调整的引人注目的示例。

Paper17 Beyond the Limitation of Monocular 3D Detector via Knowledge Distillation

摘要原文: Knowledge distillation (KD) is a promising approach that facilitates the compact student model to learn dark knowledge from the huge teacher model for better results. Although KD methods are well explored in the 2D detection task, existing approaches are not suitable for 3D monocular detection without considering spatial cues. Motivated by the potential of depth information, we propose a novel distillation framework that validly improves the performance of the student model without extra depth labels. Specifically, we first put forward a perspective-induced feature imitation, which utilizes the perspective principle (the farther the smaller) to facilitate the student to imitate more features of farther objects from the teacher model. Moreover, we construct a depth-guided matrix by the predicted depth gap of teacher and student to facilitate the model to learn more knowledge of farther objects in prediction level distillation. The proposed method is available for advanced monocular detectors with various backbones, which also brings no extra inference time. Extensive experiments on the KITTI and nuScenes benchmarks with diverse settings demonstrate that the proposed method outperforms the state-of-the-art KD methods.

中文总结: 知识蒸馏（KD）是一种有前途的方法，它促使紧凑的学生模型从庞大的教师模型中学习深层知识，以获得更好的结果。尽管KD方法在2D检测任务中得到了很好的探索，但现有方法并不适用于没有考虑空间线索的3D单眼检测。受深度信息潜力的启发，我们提出了一种新颖的蒸馏框架，有效提高了学生模型的性能，而无需额外的深度标签。具体而言，我们首先提出了一种透视诱导的特征模仿，利用透视原理（越远越小）促使学生从教师模型中模仿更多远处物体的特征。此外，我们通过教师和学生的预测深度差构建了一个深度引导矩阵，以促使模型在预测级别蒸馏中学习更多远处物体的知识。该方法适用于具有各种主干的先进单眼检测器，而且不会增加额外的推理时间。在KITTI和nuScenes基准测试中进行的大量实验表明，所提出的方法优于最先进的KD方法。

Paper18 Speech4Mesh: Speech-Assisted Monocular 3D Facial Reconstruction for Speech-Driven 3D Facial Animation

中文总结: 最近基于音频到网格的方法显示出在由语音驱动的3D面部动画任务中具有很好的前景。然而，一些棘手的挑战亟待解决。例如，由于4D数据采集的困难，数据稀缺问题在本质上是不可避免的。此外，当前方法通常缺乏对动画面部的可控性。为此，我们提出了一个名为Speech4Mesh的新颖框架，用于连续生成4D语音头部数据并训练音频到网格网络以重建网格。在我们的框架中，我们首先基于单目视频重建4D语音头部序列。为了精确捕捉面部上与说话相关的变化，我们利用视频中的音频-视觉对齐信息，采用对比学习方案。接下来，我们可以基于生成的4D数据训练音频到网格网络（例如FaceFormer）。为了控制动画的说话面部，我们将与说话无关的因素（例如情绪等）编码为情感嵌入以进行操作。最后，可微分渲染器确保重建和动画结果的光度细节更加准确。实证实验证明，Speech4Mesh框架不仅可以在重建方法上表现出色，尤其是在下半脸部分，而且在预训练合成数据后，在感知和客观上实现更好的动画表现。此外，我们还验证了所提出的框架能够明确控制动画说话面部的情绪。

Paper19 Robust Monocular Depth Estimation under Challenging Conditions

摘要原文: While state-of-the-art monocular depth estimation approaches achieve impressive results in ideal settings, they are highly unreliable under challenging illumination and weather conditions, such as at nighttime or in the presence of rain. In this paper, we uncover these safety-critical issues and tackle them with md4all: a simple and effective solution that works reliably under both adverse and ideal conditions, as well as for different types of learning supervision. We achieve this by exploiting the efficacy of existing methods under perfect settings. Therefore, we provide valid training signals independently of what is in the input. First, we generate a set of complex samples corresponding to the normal training ones. Then, we train the model by guiding its self- or full-supervision by feeding the generated samples and computing the standard losses on the corresponding original images. Doing so enables a single model to recover information across diverse conditions without modifications at inference time. Extensive experiments on two challenging public datasets, namely nuScenes and Oxford RobotCar, demonstrate the effectiveness of our techniques, outperforming prior works by a large margin in both standard and challenging conditions. Source code and data are available at: https://md4all.github.io.

中文总结: 这段话主要讨论了在挑战性的光照和天气条件下，现有的单目深度估计方法往往表现不可靠，如在夜晚或下雨等条件下。作者提出了一种名为md4all的简单而有效的解决方案，能够可靠地在不同条件下工作，包括逆境和理想条件下，以及不同类型的学习监督。作者利用现有方法在完美设置下的有效性，提供有效的训练信号，独立于输入内容。通过生成一组复杂样本，并在训练模型时引导其自监督或全监督，使单个模型能够在不同条件下恢复信息，无需在推断时进行修改。作者在两个具有挑战性的公共数据集nuScenes和Oxford RobotCar上进行了广泛实验，证明了他们的技术的有效性，在标准和挑战条件下均大幅优于先前的研究成果。源代码和数据可在https://md4all.github.io获取。

Paper20 CORE: Co-planarity Regularized Monocular Geometry Estimation with Weak Supervision

摘要原文: The ill-posed nature of monocular 3D geometry (depth map and surface normals) estimation makes it rely mostly on data-driven approaches such as Deep Neural Networks (DNN). However, data acquisition of surface normals, especially the reliable normals, is acknowledged difficult. Commonly, reconstruction of surface normals with high quality is heuristic and time-consuming. Such fact urges methodologies to minimize dependency on ground-truth normals when predicting 3D geometry. In this work, we devise CO-planarity REgularized (CORE) loss functions and Structure-Aware Normal Estimator (SANE). Without involving any knowledge of ground-truth normals, these two designs enable pixel-wise 3D geometry estimation weakly supervised by only ground-truth depth map. For CORE loss functions, the key idea is to exploit locally linear depth-normal orthogonality under spherical coordinates as pixel-level constraints, and utilize our designed Adaptive Polar Regularization (APR) to resolve underlying numerical degeneracies. Meanwhile, SANE easily establishes multi-task learning with CORE loss functions on both depth and surface normal estimation, leading to the whole performance leap. Extensive experiments present the effectiveness of our method on various DNN architectures and data benchmarks. The experimental results demonstrate that our depth estimation achieves the state-of-the-art performance across all metrics on indoor scenes and comparable performance on outdoor scenes. In addition, our surface normal estimation is overall superior.

中文总结: 这段话主要讨论了单目三维几何（深度图和表面法线）估计的不适定性，导致主要依赖于数据驱动方法，如深度神经网络（DNN）。然而，表面法线数据获取，特别是可靠的法线，被认为是困难的。通常，高质量的表面法线重建是启发式的且耗时的。这种事实促使方法学在预测三维几何时最小化对地面真实法线的依赖性。在这项工作中，我们设计了CO-planarity REgularized（CORE）损失函数和Structure-Aware Normal Estimator（SANE）。这两种设计在仅通过地面真实深度图进行弱监督时，实现了像素级三维几何估计，而不涉及任何地面真实法线知识。对于CORE损失函数，关键思想是利用在球面坐标下的局部线性深度-法线正交性作为像素级约束，并利用我们设计的自适应极坐标正则化（APR）来解决潜在的数值退化。同时，SANE很容易在深度和表面法线估计上与CORE损失函数建立多任务学习，从而提高整体性能。大量实验展示了我们方法在各种DNN架构和数据基准上的有效性。实验结果表明，我们的深度估计在室内场景上在所有指标上实现了最先进的性能，并在室外场景上实现了可比性能。此外，我们的表面法线估计总体上更优越。

Paper21 Out-of-Distribution Detection for Monocular Depth Estimation

摘要原文: In monocular depth estimation, uncertainty estimation approaches mainly target the data uncertainty introduced by image noise. In contrast to prior work, we address the uncertainty due to lack of knowledge, which is relevant for the detection of data not represented by the training distribution, the so-called out-of-distribution (OOD) data. Motivated by anomaly detection, we propose to detect OOD images from an encoder-decoder depth estimation model based on the reconstruction error. Given the features extracted with the fixed depth encoder, we train an image decoder for image reconstruction using only in-distribution data. Consequently, OOD images result in a high reconstruction error, which we use to distinguish between in- and out-of-distribution samples. We built our experiments on the standard NYU Depth V2 and KITTI benchmarks as in-distribution data. Our post hoc method performs astonishingly well on different models and outperforms existing uncertainty estimation approaches without modifying the trained encoder-decoder depth estimation model.

中文总结: 在单目深度估计中，不确定性估计方法主要针对由图像噪声引入的数据不确定性。与先前的工作相反，我们解决由于缺乏知识而产生的不确定性，这对于检测训练分布中未表示的数据，即所谓的离群数据（OOD数据）是相关的。受异常检测的启发，我们提出基于重建误差从编码器-解码器深度估计模型中检测OOD图像。通过使用固定深度编码器提取的特征，我们仅使用分布内数据训练图像解码器进行图像重建。因此，OOD图像会导致较高的重建误差，我们利用这一点来区分分布内和分布外样本。我们在标准的NYU Depth V2和KITTI基准上构建了实验作为分布内数据。我们的事后方法在不同模型上表现出色，并且在不修改训练的编码器-解码器深度估计模型的情况下，优于现有的不确定性估计方法。