3D感知生成对抗网络的高斯溅射解码器

Gaussian Splatting Decoder for 3D-aware Generative Adversarial Networks

3D感知生成对抗网络的高斯溅射解码器

Florian Barthel1, 2  Arian Beckmann1  Wieland Morgenstern1  Anna Hilsmann1  Peter Eisert1,2
Florian Barthel 1, 2 阿里安·贝克曼Wieland晨星Anna Hilsmann彼得·艾泽特 1,2
1 Fraunhofer Heinrich Hertz Institute, HHI
德国弗劳恩霍夫海因里希赫兹研究所
2 Humboldt University of Berlin
柏林洪堡大学

Abstract 摘要

NeRF-based 3D-aware Generative Adversarial Networks (GANs) like EG3D or GIRAFFE have shown very high rendering quality under large representational variety. However, rendering with Neural Radiance Fields poses challenges for 3D applications: First, the significant computational demands of NeRF rendering preclude its use on low-power devices, such as mobiles and VR/AR headsets. Second, implicit representations based on neural networks are difficult to incorporate into explicit 3D scenes, such as VR environments or video games. 3D Gaussian Splatting (3DGS) overcomes these limitations by providing an explicit 3D representation that can be rendered efficiently at high frame rates. In this work, we present a novel approach that combines the high rendering quality of NeRF-based 3D-aware GANs with the flexibility and computational advantages of 3DGS. By training a decoder that maps implicit NeRF representations to explicit 3D Gaussian Splatting attributes, we can integrate the representational diversity and quality of 3D GANs into the ecosystem of 3D Gaussian Splatting for the first time. Additionally, our approach allows for a high resolution GAN inversion and real-time GAN editing with 3D Gaussian Splatting scenes.
基于NeRF的3D感知生成对抗网络(GAN),如EG 3D或GIRAFFE,在大的代表性变化下表现出非常高的渲染质量。然而,使用Neural Radiance Fields进行渲染给3D应用带来了挑战:首先,NeRF渲染的大量计算需求阻碍了其在低功耗设备上的使用,例如手机和VR/AR耳机。其次,基于神经网络的隐式表示很难融入到显式3D场景中,例如VR环境或视频游戏。3D高斯溅射(3DGS)通过提供可以以高帧速率高效地渲染的显式3D表示来克服这些限制。在这项工作中,我们提出了一种新的方法,将基于NeRF的3D感知GAN的高渲染质量与3DGS的灵活性和计算优势相结合。 通过训练一个将隐式NeRF表示映射到显式3D高斯溅射属性的解码器,我们可以首次将3D GAN的表示多样性和质量集成到3D高斯溅射的生态系统中。此外,我们的方法允许高分辨率GAN反演和实时GAN编辑3D高斯溅射场景。
Project page: florian-barthel.github.io/gaussian_decoder
项目页面:florian-barthel.github.io/gaussian_decoder

{strip}

Figure 1: We propose a novel 3D Gaussian Splatting decoder that converts high quality results from pre-trained 3D-aware GANs into Gaussian Splatting scenes in real-time for efficient and high resolution rendering.

图1:我们提出了一种新的3D高斯溅射解码器,它将预先训练的3D感知GAN的高质量结果实时转换为高斯溅射场景,以实现高效和高分辨率的渲染。

1Introduction 一、导言

Creating and editing realistic 3D assets is of vital importance for applications such as Virtual Reality (VR) or video games. Often, this process is very costly and requires a significant amount of manual editing. Over the last few years, there have been drastic improvements to 2D [14, 15, 16, 37] and 3D [8, 33, 7, 6, 39, 2, 42] image synthesis. These advancements increasingly narrow the gap between professionally created 3D assets and those that are automatically synthesized. One of the most notable recent methods is the Efficient Geometry-aware 3D GAN (EG3D) [8]. It successfully combines the strength of StyleGAN [16], originally built for 2D image generation, with a 3D NeRF renderer [26, 3], achieving state-of-the-art 3D renderings synthesized from a latent space. Despite EG3D's significant contributions to 3D rendering quality, its integration into 3D modeling environments like Unity or Blender remains difficult. This challenge stems from its NeRF dependency, which only generates 2D images from 3D scenes, without ever explicitly representing the 3D scene. As a result, EG3D cannot be imported or manipulated in these computer graphics tools.

创建和编辑逼真的3D资源对于虚拟现实(VR)或视频游戏等应用至关重要。通常,这个过程是非常昂贵的,需要大量的手动编辑。在过去的几年里,2D [14,15,16,37]和3D [8,33,7,6,39,2,42]图像合成有了很大的改进。这些进步不断缩小专业创建的3D资源与自动合成的资源之间的差距。最近最值得注意的方法之一是高效几何感知3D GAN(EG3D)[8]。它成功地将StyleGAN [16]的优势(最初用于生成2D图像)与3D NeRF渲染器[26,3]相结合,实现了从潜在空间合成的最先进的3D渲染。尽管EG3D对3D渲染质量做出了重大贡献,但将其集成到Unity或Blender等3D建模环境中仍然很困难。 这一挑战源于它的NeRF依赖性,它只能从3D场景生成2D图像,而不能显式地表示3D场景。因此,EG3D无法在这些计算机图形工具中导入或操作。
Recently introduced, 3D Gaussian Splatting (3DGS) [19] provides a novel explicit 3D scene representation, enabling high-quality renderings at high frame rates. Following its debut, numerous derivative techniques have already emerged, including the synthesis of controllable human heads [34, 43, 48], the rendering of full body humans [20] or the compression of the storage size of Gaussian objects [30]. On the one hand, 3DGS provides a substantial improvement in terms of rendering speed and flexibility compared to NeRF: The explicit modelling enables simple exporting of the scenes into 3D software environments. Furthermore, the novel and efficient renderer in 3DGS allows for high-resolutions renderings, and an increase in rendering speed with a factor of up to 1000× over state-of-the-art NeRF frameworks [18, 4, 31, 36]. On the other hand, NeRF's implicit scene representation allows for straightforward decoding of scene information from latent spaces. Notably through the usage of tri-planes [8], which store visual and geometric information of the scene to be rendered. This enables the integration of NeRF rendering into GAN frameworks, lifting the representational variety and visual fidelity of GANs up into three-dimensional space. Combining NeRFs and GANs is highly advantageous, as rendering from a latent space offers multiple benefits: Firstly, it allows for rendering an unlimited amount of unique appearances. Secondly, a large variety of editing methods [12, 1, 29] can be applied. And thirdly, single 2D images can be inverted, using 3D GAN inversion [15, 35, 5], allowing for full 3D reconstructions from a single image.

最近引入的3D高斯溅射(3DGS)[19]提供了一种新颖的显式3D场景表示,可以在高帧速率下实现高质量的渲染。在其首次亮相之后,已经出现了许多衍生技术,包括可控人头的合成[34,43,48],全身人体的渲染[20]或高斯对象的存储大小的压缩[30]。一方面,与NeRF相比,3DGS在渲染速度和灵活性方面提供了实质性的改进:显式建模可以将场景简单地导出到3D软件环境中。此外,3DGS中新颖高效的渲染器允许高分辨率渲染,并且渲染速度比最先进的NeRF框架提高了 1000× [18,4,31,36]。另一方面,NeRF的隐式场景表示允许从潜在空间直接解码场景信息。 特别是通过使用三平面[8],它存储要渲染的场景的视觉和几何信息。这使得NeRF渲染能够集成到GAN框架中,将GAN的代表性多样性和视觉保真度提升到三维空间中。将NeRF和GAN相结合是非常有利的,因为从潜在空间进行渲染提供了多种好处:首先,它允许渲染无限数量的独特外观。其次,可以应用各种各样的编辑方法[12,1,29]。第三,可以使用3D GAN反演[15,35,5]来反演单个2D图像,从而允许从单个图像进行完整的3D重建。
Sampling visual information from latent spaces with large representational variety poses a challenge for rendering with 3DGS, as the framework requires the information for the appearance of the scene to be encoded as attributes of individual splats, rather than in the latent space itself. This severely complicates the task of fitting 3D Gaussian splats to variable latent spaces, given that the splats would need to be repositioned for every new latent code - a challenge that is not addressed in the original 3DGS framework. Several approaches tackling the problem of rendering with 3DGS from latent tri-planes have been proposed [20, 49, 45], but to the best of our knowledge, no method exists to create 3D heads rendered with Gaussian Splatting from a latent space.

从具有大量代表性变化的潜在空间中采样视觉信息对3DGS渲染提出了挑战,因为框架需要将场景外观的信息编码为单个splats的属性,而不是在潜在空间本身中。这使得将3D高斯splats拟合到可变潜在空间的任务变得非常复杂,因为splats需要为每个新的潜在代码重新定位-这是原始3DGS框架中没有解决的挑战。已经提出了几种方法来解决从潜在三平面使用3DGS渲染的问题[20,49,45],但据我们所知,没有方法可以从潜在空间创建使用高斯溅射渲染的3D头部。
In this work, we propose a framework for the synthesis of explicit 3D scenes representing human heads from a latent space. This is done by combining the representational variety and fidelity of 3D-aware GANs with the explicit scenes and fast rendering speed of 3D Gaussian Splatting. Our main contributions can be summarized as follows:

在这项工作中,我们提出了一个框架,用于合成显式的3D场景,从一个潜在的空间,代表人类的头部。这是通过将3D感知GAN的代表性多样性和保真度与3D高斯溅射的显式场景和快速渲染速度相结合来实现的。我们的主要贡献可归纳如下:

    A novel method that allows for GAN-based synthesis of explicit 3D Gaussian Splatting scenes, additionally avoiding superresolution modules as used in the generation of implicit scene representations.

    1.一种新的方法,允许基于GAN的显式3D高斯溅射场景的合成,另外避免了在隐式场景表示的生成中使用的超分辨率模块。

    A novel sequential decoder architecture, a strategy for sampling Gaussian splat positions around human heads and a generator backbone fine-tuning technique to improve the decoders capacity.

    2.一种新的顺序解码器架构,一个策略,用于采样高斯飞溅位置周围的人的头部和生成器骨干微调技术,以提高解码器的容量。

    An open source end-to-end pipeline for synthesizing state-of-the-art 3D assets to be used in 3D software.

    3.一个开源的端到端管道,用于合成要在3D软件中使用的最先进的3D资产。

2Related Work 2相关工作

**2.1Neural Radiance Fields

2.1神经辐射场**

In their foundational work on Neural Radiance Fields (NeRFs) Mildenhall et al . [27] propose to implicitly represent a 3D scene with an MLP that outputs color and density at any point in space from any viewing direction. This representation revolutionized novel-view synthesis, due to its ability to reconstruct scenes with high fidelity, high flexibility with respect to viewpoints, and compactness in representation through the usage of the MLP. To obtain a 2D rendering, a ray is cast for each pixel from the camera into the scene with multiple points sampled along each ray, which in turn are fed to the MLP in order to obtain their respective color and density values. NeRFs have proven to provide high-quality renderings, but are slow during both training and inference: the sampling and decoding process require querying a substantial number of points per ray, which has a high computational cost. Successors of the seminal NeRF approach are subject to improvements in quality [25] as well as training and inference speed [3, 36, 31]. Plenoxels [36] replaces the MLP representation of the scene by a sparse voxel grid representation, which leads to a speed-up in optimization time by two orders of magnitude compared to vanilla NeRF while maintaining high rendering quality. InstantNGP [31] proposes the usage of multi-resolution hash tables to store scene-related information. Through leveraging parallelism on GPUs and a very optimized implementation that fits the hash tables into the GPU cache, it achieves significant speedups in processing times, making real-time applications feasible.

在他们关于神经辐射场(NeRFs)的基础工作中,Mildenhall et al. [27]建议隐式地表示具有MLP的3D场景,该MLP从任何观看方向在空间中的任何点处输出颜色和密度。这种表示革命性的新的视图合成,由于其重建场景的能力与高保真度,高灵活性的观点,并通过使用的MLP表示紧凑。为了获得2D渲染,针对每个像素将射线从相机投射到场景中,其中沿着每条射线沿着采样多个点,这些点又被馈送到MLP,以便获得它们各自的颜色和密度值。NeRF已被证明可以提供高质量的渲染,但在训练和推理过程中都很慢:采样和解码过程需要查询每条射线的大量点,这具有很高的计算成本。 开创性的NeRF方法的继任者在质量[25]以及训练和推理速度[3,36,31]方面有所改进。Plenoxels [36]用稀疏体素网格表示代替了场景的MLP表示,与vanilla NeRF相比,这导致优化时间加快了两个数量级,同时保持了高渲染质量。InstantNGP [31]建议使用多分辨率哈希表来存储场景相关信息。通过利用GPU上的并行性和将哈希表放入GPU缓存的非常优化的实现,它在处理时间方面实现了显着的加速,使实时应用变得可行。
Moreover, various approaches aiming to render in real-time propose to either store the view-dependent colors and opacities of NeRF in volumetric data representations or partition the scene into multiple voxels represented by small independent neural networks [46, 44, 21, 11, 40, 9, 13, 23].

此外,各种旨在实时渲染的方法提出将NeRF的视图相关颜色和不透明度存储在体积数据表示中,或者将场景划分为由小型独立神经网络表示的多个体素[46,44,21,11,40,9,13,23]。

2.23D Gaussian Splatting 2.23D高斯溅射

Recently, Kerbl et al . [18] proposed to represent scenes explicitly in the form of Gaussian splats. Each singular splat represents a three-dimensional Gaussian distribution with mean � and covariance matrix Σ. For computational simplicity, the authors decide to represent the covariance matrix as the configuration of an ellipsoid, i.e. Σ=�⁢�⁢��⁢��, with scaling and rotation matrices � and �. To characterize the appearance, each splat holds further attributes describing its opacity and view-dependent color through a set of spherical harmonics. Each splat's attributes are optimized in end-to-end training, utilizing a novel differentiable renderer. This renderer is essential for the success of 3D Gaussian Splatting. Its architectural design allows for high-resolution rendering in real-time through fast GPU execution utilizing anisotropic splatting while being visibility-aware. This architectural design significantly accelerates the training process and novel-view rendering time. In general, 3DGS [18] is able to outperform several NeRF-based state-of-the-art approaches in rendering speed by factors of up to 1000×, while keeping competitive or better image quality.

最近,Kerbl et al. [18]提出以高斯splats的形式显式表示场景。每个奇异splat表示具有均值 � 和协方差矩阵 Σ 的三维高斯分布。为了计算简单,作者决定将协方差矩阵表示为具有缩放和旋转矩阵 � 和 � 的椭圆体的配置,即 Σ=�⁢�⁢��⁢�� 。为了表征外观,每个splat都通过一组球谐函数保持描述其不透明度和视图相关颜色的进一步属性。每个splat的属性都在端到端训练中进行了优化,使用了一种新颖的可微分渲染器。此渲染器对于3D高斯溅射的成功至关重要。其架构设计允许通过快速GPU执行利用各向异性飞溅,同时具有可识别性,实现高分辨率实时渲染。这种架构设计显著加快了训练过程和新视图渲染时间。 一般来说,3DGS [18]能够在渲染速度方面超过几种基于NeRF的最先进方法,高达 1000× ,同时保持竞争力或更好的图像质量。
Several approaches that utilize 3DGS for the representation and rendering of human heads have been proposed [48, 34, 43]. GaussianAvatar [34] allows editing of facial expressions within a scene of Gaussians already fitted to a specific identity. To do so, they use the FLAME [22] 3D morphable face model to create a triangle-mesh representing the head in 3D space and assign a splat to each triangle. Moreover, the geometric attributes of the splats are dynamically adjusted to align with the respective triangle's properties; for example, the global rotation of the splat is adjusted to match that of the triangle. Similarly, Xu et al . [43] anchor the Gaussian splats to a 3D triangle mesh fitted to a head that depicts a neutral expression. They utilize deformation MLPs conditioned on expression vectors to adjust the triangle-mesh and the resulting Gaussian positions to account for changes in expression. Ultimately, they render a feature map from their scene with 3DGS and translate those into high-fidelity head images in 2K resolution with a super-resolution network.

已经提出了几种利用3DGS来表示和渲染人头的方法[48,34,43]。GaussianAvatar [34]允许在已经适合特定身份的高斯场景中编辑面部表情。为此,他们使用FLAME [22] 3D变形人脸模型在3D空间中创建一个代表头部的三角形网格,并为每个三角形分配一个splat。此外,splat的几何属性被动态调整以与相应三角形的属性对齐;例如,splat的全局旋转被调整以匹配三角形的全局旋转。类似地,Xu et al. [43]将高斯splats锚到3D三角形网格,该三角形网格拟合到描绘中性表情的头部。他们利用变形MLP条件的表达向量调整三角形网格和由此产生的高斯位置,以考虑表达的变化。 最终,他们使用3DGS从场景中渲染出特征图,并使用超分辨率网络将其转换为2K分辨率的高保真头部图像。

Figure 2:Visualization of our method (orange parts are optimized). We initially clone the backbone of the 3D-aware GAN. Afterwards, we iteratively optimize the Gaussian Splatting decoder by comparing the output of the pre-trained GAN, after super-resolution, with the output of the decoder. The xyz coordinates at which the tri-plane is sampled originates from the density information of the NeRF renderer.

图2:我们的方法的可视化(橙子部分被优化)。我们最初克隆了3D感知GAN的主干。然后,我们通过比较超分辨率后预训练GAN的输出与解码器的输出来迭代优化高斯溅射解码器。采样三平面的xyz坐标源自NeRF渲染器的密度信息。

2.33D-aware GANs 2. 33D感知GANs

Following the success of 2D GANs in recent years, several methods have been proposed to synthesize 3D content with GANs as well. To achieve this, the generator component of a GAN is modified to create an internal 3D representation suitable for output through differentiable renderers. Given that these renderers return a 2D representation of the 3D model, the framework can be trained with 2D data. This is crucial, as high-quality 2D datasets are more readily available compared to their 3D counterparts. One of the first high-resolution 3D-aware GANs is LiftingStyleGAN [38]. Its architecture extends the 2D StyleGAN [17] with a custom built renderer based on texture and depth maps. Shortly after, �-GAN [7] and GIRAFFE [33] have been introduced that both use a Neural Radiance Field (NeRF) renderer. Those methods show very promising visual results. Nevertheless, while �-GAN is very slow in rendering, only achieving 1 frame per second, GIRAFFE fails to estimate a good 3D geometry. Both challenges are solved with the introduction of the Efficient Geometry-aware 3D Generative Adversarial Network (EG3D) by Chan et al . [8]. EG3D combines the strength of the StyleGAN architecture for 2D image synthesis with the rendering quality of a NeRF renderer. This is done by reshaping the output features of a StyleGAN generator into a three-dimensional tri-plane structure to span a 3D space. From this tri-plane, 3D points are projected onto 2D feature maps and forwarded to a NeRF renderer. The renderer creates a 2D image at a small resolution, which is then forwarded to a super-resolution module. This approach returns state-of-the-art renderings at a resolution of 512x512 pixels, while maintaining reasonable rendering speeds of about 30 FPS. The use of the super-resolution module effectively locks in the output size and aspect ratio, making it impossible to adjust or enlarge them without training a completely new network. This limitation of EG3D contrasts with explicit methods, which feature a renderer capable of adjusting the rendering resolution on demand.

随着近年来2D GAN的成功,已经提出了几种方法来用GAN合成3D内容。为了实现这一点,GAN的生成器组件被修改以创建适合于通过可微分渲染器输出的内部3D表示。考虑到这些渲染器返回3D模型的2D表示,可以使用2D数据训练框架。这一点至关重要,因为与3D数据集相比,高质量的2D数据集更容易获得。第一个高分辨率的3D感知GAN是LiftingStyleGAN [38]。它的架构扩展了2D StyleGAN [17],基于纹理和深度图定制渲染器。不久之后,引入了 � -GAN [7]和GIRAFFE [33],两者都使用神经辐射场(NeRF)渲染器。这些方法显示出非常有希望的视觉效果。然而,虽然 � -GAN在渲染方面非常缓慢,仅实现每秒1帧,但GIRAFFE无法估计良好的3D几何形状。 这两个挑战都通过Chan等人引入的高效几何感知3D生成对抗网络(EG3D)解决了。[8]的一项建议。EG3D将StyleGAN架构的2D图像合成优势与NeRF渲染器的渲染质量相结合。这是通过将StyleGAN生成器的输出特征重塑为三维三平面结构以跨越3D空间来实现的。从这个三平面,3D点被投影到2D特征图上,并转发到NeRF渲染器。渲染器以低分辨率创建2D图像,然后将其转发到超分辨率模块。这种方法以512x512像素的分辨率返回最先进的渲染,同时保持约30 FPS的合理渲染速度。超分辨率模块的使用有效地锁定了输出尺寸和纵横比,使得在不训练全新网络的情况下无法调整或放大它们。 EG3D的这种限制与显式方法形成对比,显式方法的特点是渲染器能够根据需要调整渲染分辨率。
Since the inception of EG3D, several approaches have adopted its architecture, extending the rendering capabilities [42, 28, 2, 32]. PanoHead [2], stands out in particular, as it addresses synthesizing full 360° heads. This is done by adding further training data that shows heads from the back and by disentangling the head from the background. The latter is done by separately generating the background and blending it with the foreground using a foreground mask obtained during the rendering process.

自EG3D诞生以来,有几种方法采用了它的架构,扩展了渲染功能[42,28,2,32]。PanoHead [2]特别突出,因为它解决了合成全360 °头。这是通过添加进一步的训练数据来完成的,这些数据从后面显示头部,并将头部从背景中分离出来。后者是通过单独生成背景并使用渲染过程中获得的前景蒙版将其与前景混合来完成的。

**2.4Decoding Gaussian Attributes from Tri-planes

2.4从三平面解码高斯属性**

Decoding NeRF attributes, i.e. color and density, from a tri-plane has proven to produce state-of-the-art frameworks. Decoding tri-planes into Gaussian Splatting attributes, on the other hand, induces further complexity. This is because a Gaussian splat, located at a specific position on the tri-plane need not represent the color and density of this specific location, but instead a 3D shape with a scale that extends into other regions of the scene. Naively, this could be solved by treating Gaussian splats as a point cloud with a very high number of tiny colored points. This approach would however neglect the advantages of Gaussian Splatting and reintroduce high computational costs for rendering the point cloud. Instead, when decoding Gaussian attributes [18], we seek to find suitable representations, such that Gaussian splats adapt their geometry to represent the structure of the target shape. Thus, smooth surfaces should be represented by wide flat Gaussian splats, while fine structures are best represented by thin long Gaussians.

从三平面解码NeRF属性,即颜色和密度,已被证明可以产生最先进的框架。另一方面,将三平面解码为高斯溅射属性会导致进一步的复杂性。这是因为位于三平面上的特定位置处的高斯溅射不需要表示该特定位置的颜色和密度,而是具有延伸到场景的其他区域中的尺度的3D形状。简单地说,这可以通过将高斯splats视为具有大量微小彩色点的点云来解决。然而,这种方法会忽略高斯溅射的优点,并重新引入渲染点云的高计算成本。相反,当解码高斯属性[18]时,我们寻求找到合适的表示,使得高斯splats调整其几何形状以表示目标形状的结构。 因此,光滑的表面应该用宽而平的高斯分布来表示,而精细的结构最好用细长的高斯分布来表示。
Recent work already investigated the ability to decode Gaussian splats from tri-planes. HUGS [20] uses a small fully connected network to predict the Gaussian attributes to render full body humans in 3D. Contrary to our approach, HUGS overfits a single identity iteratively instead of converting any IDs from a latent space in a single shot. Similarly, [49] uses a transformer network to decode Gaussian attributes from a tri-plane in order to synthesize 3D objects. A different approach that also combines 3DGS with tri-planes is Gaussian3Diff [45]. Instead of decoding Gaussian attributes from a tri-plane, they equip each Gaussian with a local tri-plane that is attached to its position. This hybrid approach shows promising quality, although the rendering speed is lower compared to 3DGS.

最近的工作已经研究了从三平面解码高斯splats的能力。HUGS [20]使用一个小型的全连接网络来预测高斯属性,以3D方式渲染全身人体。与我们的方法相反,HUGS迭代地过度拟合单个身份,而不是在单个镜头中从潜在空间转换任何ID。类似地,[49]使用Transformer网络从三平面解码高斯属性,以合成3D对象。另一种将3DGS与三平面相结合的方法是Gaussian3Diff [45]。他们不是从三平面解码高斯属性,而是为每个高斯配备一个附加到其位置的局部三平面。这种混合方法显示出有前途的质量,虽然渲染速度低于3DGS。

3Method 3方法

Our goal is to design a decoder network that converts the output of a 3D-aware GAN, specifically tailored for human head generation, into a 3D Gaussian Splatting scene without requiring an iterative scene optimization process. An overview of our method is shown in Figure 2. We extract the tri-plane of the 3D GAN, which is originally used to render a NeRF scene, and train a decoder network to obtain 3D Gaussian Splatting attributes (i.e. position, color, rotation, scale, and opacity). For simplicity, we omit the estimation of view-dependent spherical harmonics coefficients. For training, we compare the synthesized images from the 3D GAN to the renderings of the decoded 3D Gaussian Splatting scenes. Importantly, our decoder does not use any super-resolution module. Instead, we render the decoded Gaussian Splatting scene already at the same resolution as the final output of the 3D GAN. The absence of a super-resolution module allows the export of the decoded scene directly into 3D modeling environments, and for rendering at different resolutions and aspect ratios at high frame rates.

我们的目标是设计一个解码器网络,将专门为人类头部生成量身定制的3D感知GAN的输出转换为3D高斯溅射场景,而无需迭代场景优化过程。我们的方法概述如图2所示。我们提取3D GAN的三平面,它最初用于渲染NeRF场景,并训练解码器网络以获得3D高斯溅射属性(即位置,颜色,旋转,缩放和不透明度)。为了简单起见,我们省略了视相关球谐系数的估计。为了训练,我们将来自3D GAN的合成图像与解码的3D高斯飞溅场景的渲染进行比较。重要的是,我们的解码器不使用任何超分辨率模块。相反,我们以与3D GAN的最终输出相同的分辨率渲染解码的高斯飞溅场景。 超分辨率模块的缺失允许将解码场景直接导出到3D建模环境中,并以高帧速率以不同的分辨率和宽高比进行渲染。

**3.1Position Initialization

3.1位置初始化**

Given that the 3D Gaussian splats, being described with multiple attributes (position, color, rotation, scale, and opacity), have multiple degrees of freedom, it is difficult to receive a meaningful gradient for the position during optimization. To overcome this issue, 3DGS uses several strategies to prune, clone, and split Gaussians during the training in order to spawn new Gaussians at fitting locations in the scene or remove redundant ones. For example, if a Gaussian splat is located at an incorrect position, 3DGS prefers to make the Gaussian splat vanish by reducing the opacity or to change its color to fit the current position, rather than moving its position. For our purpose of training a decoder that automatically creates new Gaussian scenes in a single forward pass, this iterative approach is not available. Instead, we take advantage of the geometric information already contained in the pre-trained 3D GAN's tri-plane. This is done by decoding the tri-plane features into opacity values using the pre-trained MLP of the NeRF renderer followed by a surface estimation based on the opacity values. Specifically, we uniformly sample a cube of points (128×128×128), decode the opacity and estimate the surface with marching cubes [24]. On this surface, we sample 500k points at random positions and slightly interpolate the points randomly towards the center, thus creating a thick surface. This provides us with a good position initialization for the Gaussians representing any head created by the 3D GAN. Even so, sampling the opacity from the NeRF renderer is computationally expensive. Nevertheless, this only has to be done once per ID / latent vector. After the 3D Gaussian scene is created, it can be rendered very efficiently.

考虑到用多个属性(位置、颜色、旋转、缩放和不透明度)描述的3D高斯splats具有多个自由度,在优化期间难以接收位置的有意义的梯度。为了克服这个问题,3DGS在训练过程中使用了几种策略来修剪、克隆和分割高斯模型,以便在场景中的拟合位置生成新的高斯模型或删除冗余的高斯模型。例如,如果高斯飞溅位于不正确的位置,3DGS更倾向于通过降低不透明度或更改其颜色以适应当前位置来使高斯飞溅消失,而不是移动其位置。对于我们训练在单个前向传递中自动创建新高斯场景的解码器的目的,这种迭代方法不可用。相反,我们利用预先训练的3D GAN的三平面中已经包含的几何信息。 这是通过使用NeRF渲染器的预训练MLP将三平面特征解码为不透明度值,然后基于不透明度值进行表面估计来完成的。具体来说,我们对点的立方体( 128×128×128 )进行均匀采样,解码不透明度并使用行进立方体估计表面[24]。在这个表面上,我们在随机位置采样500 k个点,并向中心随机稍微插值这些点,从而创建一个厚表面。这为代表3D GAN创建的任何头部的高斯模型提供了良好的位置初始化。即便如此,从NeRF渲染器采样不透明度在计算上是昂贵的。然而,对于每个ID /潜在向量,这仅需要进行一次。创建3D高斯场景后,可以非常有效地渲染。

3.2Decoder Architecture 3.2解码器架构

Recent work that use a decoder network to estimate Gaussian Splatting attributes from tri-plane features use fully connected networks [20] or transformer-based models [49]. For our approach, we also use a fully connected network, however, instead of computing all Gaussian attributes at once, we calculate them sequentially. Specifically, we first forward the tri-plane features to the first decoder that estimates the color. After that, we use the information of the color together with the tri-plane features and feed them to the next decoder that estimates the opacity. This is done iteratively until all attributes are decoded (color → opacity → rotation → scale → position offset). Thus, the last decoder receives all preceding attributes along with the tri-plane features. The intuition behind this approach is to create a dependency between the attributes. We hypothesize that, for instance, the scale decoder benefits from information about the color or rotation, in order to decide how large the respective Gaussian splat will be. Additionally, the high degrees of freedom of the combined Gaussian splat attributes get reduced heavily for each decoder, allowing for easier specialization.

最近使用解码器网络从三平面特征估计高斯飞溅属性的工作使用完全连接的网络[20]或基于变换器的模型[49]。对于我们的方法,我们也使用了一个完全连接的网络,但是,我们不是一次计算所有高斯属性,而是顺序计算它们。具体来说,我们首先将三平面特征转发到估计颜色的第一解码器。之后,我们将颜色信息与三平面特征一起使用,并将它们馈送到下一个估计不透明度的解码器。迭代地完成此操作,直到所有属性都被解码(颜色 → 不透明度 → 旋转 → 缩放 → 位置偏移)。因此,最后一个解码器接收所有先前属性沿着三平面特征。这种方法背后的直觉是在属性之间创建依赖关系。 我们假设,例如,尺度解码器受益于关于颜色或旋转的信息,以决定相应的高斯飞溅将有多大。此外,对于每个解码器,组合的高斯splat属性的高自由度大大减少,从而允许更容易的专业化。
Inside each decoder, we use three hidden layers, each equipped with 128 neurons and a GELU activation. The output layer has no activation function, except for the scaling decoder. There, we apply an inverted Softplus activation to keep the splats from getting too large, avoiding excessive GPU memory usage during rasterization.

在每个解码器内部,我们使用三个隐藏层,每个层配备128个神经元和一个GELU激活。输出层没有激活功能,除了缩放解码器。在那里,我们应用反向Softplus激活来防止splats变得太大,避免在光栅化期间过度使用GPU内存。

Figure 3:A comparison between a parallel decoder that maps all Gaussian attributes at once to our sequential decoder, where each attribute is decoded after another using the prior information.

图三:并行解码器将所有高斯属性一次映射到我们的顺序解码器之间的比较,其中每个属性都使用先验信息依次解码。

Model Training Data 训练数据 MSE ↓ LPIPS ↓ LPIPPS ↓ 的问题 SSIM ↑ 阿信 ↑ ID Sim. ↑ ID Sim。 ↑ FPS@512 ↑ FPS@512 ↑ FPS@1024 ↑ FPS@1024 ↑
PanoHead FFHQ-H 37 N/A
Ours PanoHead 0.002 0.161 0.820 0.902 170 90
EG3DLPFF LPFF 32 N/A
Ours EG3DLPFF 0.004 0.248 0.852 0.946 135 96
EG3DFFHQ FFHQ 31 N/A
Ours EG3DFFHQ 0.002 0.195 0.842 0.968 164 132

Table 1:Results for training our decoder for different pre-trained 3D-aware GANs. Columns with Ours refer to a decoder that was trained with the GAN specified in the column above. For all three decoders we observe high similarity scores along with high rendering speeds.

表1:针对不同的预训练3D感知GAN训练我们的解码器的结果。包含Ours的列是指使用上列中指定的GAN训练的解码器。对于所有三个解码器,我们观察到高相似性分数沿着高渲染速度。

3.3Backbone Fine-tuning 3.3骨干微调

In addition to optimizing the weights of the decoder network, we create a copy of the pre-trained 3D generator and optimize its weights as well. This fine-tuning allows the optimization process to adapt the tri-plane features to provide a better basis for creating Gaussian Splatting attributes, as they are inherently different. While NeRFs only require the color and density of a specific location, Gaussian splats additionally have a scale and rotation, thus influencing adjacent regions too.

除了优化解码器网络的权重外,我们还创建了预训练的3D生成器的副本,并优化其权重。这种微调允许优化过程调整三平面特征,以便为创建高斯溅射属性提供更好的基础,因为它们本质上是不同的。虽然NeRF只需要特定位置的颜色和密度,但高斯splats还具有尺度和旋转,因此也会影响相邻区域。

3.4Loss Functions 3.4损失函数

The vanilla 3D Gaussian Splatting algorithm uses a combination of L1 loss and structural similarity. This combination has proven to be very successful for learning static scenes. For our purpose, however, of learning a decoder network that is able to synthesize an tremendous diversity of images, it requires a loss function that provides better perceptual feedback. This is because we aim to produce a 3D Gaussian Splatting face that looks perceptually very close to the GAN rendered face, without penalizing the model too much if small structures like hair do not align perfectly. For that reason, we supplement the existing L1 and structural similarity loss with an LPIPS norm [47] and an ID similarity [10] loss. This ID loss is based on a pre-trained face detector (ArcFace) and estimates how similar two faces are. Since PanoHead renders the head from all 360° views, we only apply the ID loss, when the face is viewed from a frontal viewpoint. Additionally, to guide the decoder towards areas needing finer structural details, we calculate the difference between the synthesized image and target image after applying a Sobel filter. Formally, our loss function can be expressed as follows:

香草3D高斯飞溅算法使用L1损失和结构相似性的组合。这种组合已被证明是非常成功的学习静态场景。然而,为了学习能够合成大量图像的解码器网络,它需要一个能提供更好感知反馈的损失函数。这是因为我们的目标是生成一个在感知上看起来非常接近GAN渲染面部的3D高斯飞溅面部,如果像头发这样的小结构不能完美对齐,则不会对模型造成太大的影响。出于这个原因,我们用LPIPS范数[47]和ID相似性[10]损失来补充现有的L1和结构相似性损失。这种ID丢失是基于预先训练的人脸检测器(ArcFace),并估计两张脸的相似程度。由于PanoHead从所有360°视图渲染头部,因此我们仅在从正面视点查看面部时应用ID损失。 此外,为了将解码器引导到需要更精细结构细节的区域,我们在应用Sobel滤波器后计算合成图像和目标图像之间的差异。 形式上,我们的损失函数可以表示为:

|---|-----------------------------------------------|---|-----|
| | ℒ=�1⁢ℒL1+�2⁢ℒSSIM+�3⁢ℒLPIPS+�4⁢ℒID+�5⁢ℒSobel. | | (1) |

4Experiments 4实验

**4.1Implementation Details

4.1实现细节**

In the following experiments, we train our Gaussian splatting decoder for multiple pre-trained target GANs. These are: EG3D trained on the FFHQ [15] dataset, EG3D trained on the LPFF dataset [41], and PanoHead trained on the FFHQ-H [2] dataset. We train for 500k iterations with an Adam optimizer using a learning rate of 0.0009. Loss weights are set to (�1,�2,�3,�4,�5)=(0.2,0.5,1.0,1.0,0.2) for all experiments unless stated otherwise. For PanoHead, we sample random cameras all around the head, and for EG3D, we sample mainly frontal views with small vertical and horizontal rotations.

在下面的实验中,我们为多个预先训练的目标GAN训练高斯飞溅解码器。这些是:在FFHQ [15]数据集上训练的EG 3D,在LPFF数据集上训练的EG 3D [41],以及在FFHQ-H [2]数据集上训练的PanoHead。我们使用Adam优化器进行50万次迭代训练,学习率为0.0009。除非另有说明,否则所有实验的损失重量均设定为 (�1,�2,�3,�4,�5)=(0.2,0.5,1.0,1.0,0.2) 。对于PanoHead,我们对头部周围的随机相机进行采样,对于EG 3D,我们主要对正面视图进行采样,并进行小的垂直和水平旋转。
We have optimized all training parameters for PanoHead, since it synthesizes full 360° views, making it ideally suited for being rendered in an explicit 3D space.

我们已经优化了PanoHead的所有训练参数,因为它可以合成完整的360°视图,非常适合在显式3D空间中渲染。

4.2Metrics

To evaluate the performance of our decoder, we measure the image similarity for 10k images using MSE, LPIPS, ID similarity and structural similarity. In order to measure the frame rate, we use a custom visualization tool that is based on the EG3D visualizer. This way, we ensure that the performance differences are purely due to the renderer and not dependent on the programming language or compiler. With very efficient Gaussian splatting renderer like the SIBR viewer [19] that is purely written in C++, we could achieve even higher FPS.

为了评估我们的解码器的性能,我们使用MSE,LPIPS,ID相似性和结构相似性来测量10k图像的图像相似性。为了测量帧速率,我们使用基于EG3D可视化工具的自定义可视化工具。通过这种方式,我们可以确保性能差异纯粹是由于渲染器而不是依赖于编程语言或编译器。使用非常高效的高斯飞溅渲染器,如纯粹用C++编写的SIBR查看器[19],我们可以实现更高的FPS。

4.3Quantitative Results 4.3定量结果

After training our decoders, we observe a high image similarity between the decoded Gaussian Splatting scene and the respective target GAN as stated in Table 1. The low MSE and SSIM indicate that the renderings have similar colors and structures, respectively. In addition, the LPIPS and ID similarity metrics underline that the images are perceptually very close. The highest ID similarity is found when decoding the EG3DFFHQ model. Here, we reach a similarity score of 0.968. A possible explanation for this is that the FFHQ training dataset contains the fewest images across all three comparisons, making it the easiest to decode, given that there is less variation. The lowest ID similarity is found for the decoder trained with the PanoHead GAN. In this case, the decoder has to learn a full 360° view of the head. This is not regarded by the ID similarity as it is only computed for renderings showing a frontal view.

在训练我们的解码器之后,我们观察到解码的高斯飞溅场景和相应的目标GAN之间的高图像相似性,如表1所示。低MSE和SSIM分别表示渲染具有相似的颜色和结构。此外,LPIPS和ID相似性度量强调图像在感知上非常接近。在解码 EG3DFFHQ 模型时发现最高ID相似性。在这里,我们达到0.968的相似性得分。对此的一个可能解释是,FFHQ训练数据集在所有三次比较中包含的图像最少,使得它最容易解码,因为变化较少。对于用PanoHead GAN训练的解码器,发现最低的ID相似性。在这种情况下,解码器必须学习头部的完整360°视图。ID相似性不考虑此问题,因为它仅针对显示正面视图的渲染图进行计算。
Considering render speed for each model, rendering the Gaussian Splatting scene achieves about four times the FPS compared to rendering the 3D-aware GANs. Furthermore, as the rendering resolution for Gaussian Splatting is not limited by any super-resolution module, it can be rendered at arbitrary resolution. Here, we observe that when increasing resolution four-fold, we still achieve more than three times the framerate of the GAN models at the lower resolution.

考虑到每个模型的渲染速度,与渲染3D感知GAN相比,渲染高斯飞溅场景实现了大约四倍的FPS。此外,由于高斯溅射的渲染分辨率不受任何超分辨率模块的限制,因此可以以任意分辨率进行渲染。在这里,我们观察到,当分辨率增加四倍时,我们仍然在较低分辨率下实现了GAN模型的三倍以上的帧速率。

Figure 4:Example renderings of the target images produced by respective 3D-aware GAN (top row) and the renderings of the decoded 3D Gaussian Splatting scene (bottom row, Ours ). Additional renderings can be found in the supplementary material.

图4:由相应的3D感知GAN产生的目标图像的示例渲染(顶行)和解码的3D高斯溅射场景的渲染(底行,我们的)。补充材料中载有其他效果图。

Figure 5:Renderings for example interpolating paths, demonstrating the possibility for applying GAN editing methods.

图5:渲染例如插值路径,演示了应用GAN编辑方法的可能性。

Figure 6:Example 360° rendering of our Gaussian Splatting decoder trained with PanoHead.

图6:使用PanoHead训练的高斯飞溅解码器的示例360°渲染。

4.4Qualitative Results 4.4定性结果

In addition to the quantitative measures, we also demonstrate our method qualitatively. Figure 4 shows one example rendering for each of the three 3D-GAN methods, along with our decoded Gaussian Splatting scenes. We observe a high visual similarity between target and rendering as indicated by the image similarity metrics. While EG3D uses one single 3D scene that combines head and background, PanoHead uses two separate renderings. This allows our decoder to exclusively learn the head and rotate it independently to the background. An example of a full 360° rotation of two decoded Gaussian Splatting heads is shown in Figure 6.

除了定量的措施,我们也证明了我们的方法定性。图4显示了三种3D-GAN方法中每种方法的一个示例渲染,沿着我们解码的高斯飞溅场景。我们观察到目标和渲染之间的高视觉相似性,如图像相似性度量所示。虽然EG 3D使用一个单一的3D场景,结合头部和背景,PanoHead使用两个单独的渲染。这允许我们的解码器专门学习头部并将其独立旋转到背景中。图6中示出了两个解码的高斯溅射头的完整360°旋转的示例。
Additionally, we observe a reduction in aliasing and texture sticking artifacts with our 3D representation when rotating the camera around the head. This was often observed when rendering with EG3D or PanoHead. Specifically, some structures like hair or skin pores shifted when chaining the camera viewpoint, instead of moving along with the 3D head. This is no longer the case with our Gaussian Splatting representation, as we produce one fixed explicit 3D scene for each ID.

此外,我们观察到减少锯齿和纹理粘文物与我们的3D表示时,旋转相机的头部。这在使用EG3D或PanoHead渲染时经常观察到。具体而言,当链接相机视点时,某些结构(如头发或皮肤毛孔)发生了移动,而不是沿着3D头部移动。这不再是我们的高斯飞溅表示的情况,因为我们为每个ID生成一个固定的显式3D场景。
We also demonstrate in Figure 5 that our decoder allows latent interpolation. This opens up various GAN editing or GAN inversion methods to be applicable to our method.

我们还在图5中演示了我们的解码器允许潜在插值。这打开了各种GAN编辑或GAN反演方法以适用于我们的方法。
In some renderings, we observe that the eyes appear uncanny or blurry. We believe this occurs because the underlying target data produced from the 3D-aware GAN almost exclusively shows renderings where the person is looking towards the camera. The GAN's NeRF-renderer, being view-dependent, likely learns to place the pupils to align with the camera angle. However, as we disabled the spherical harmonics to reduce complexity, our decoder is not able to learn any view dependencies. Instead, it learns an averaged eye, which is slightly blurry and always looks forward. To overcome this limitation, it might be beneficial to incorporate the spherical harmonics into the decoder training for future work.

在一些渲染中,我们观察到眼睛看起来不可思议或模糊。我们认为这是因为从3D感知GAN产生的底层目标数据几乎只显示了人看向相机的渲染。GAN的NeRF渲染器是视图相关的,可能会学习将瞳孔放置为与相机角度对齐。然而,由于我们禁用了球谐函数以降低复杂度,因此我们的解码器无法学习任何视图依赖性。相反,它学习一个平均的眼睛,这是稍微模糊,总是向前看。为了克服这种限制,将球谐函数合并到解码器训练中以用于未来的工作可能是有益的。

4.5Ablation Study 4.5消融研究

In the following, we will justify our design decisions by performing an ablation study to the main components.

在下文中,我们将通过对主要组件进行消融研究来证明我们的设计决策。
Position Initialization: The position initialization is a crucial component of our decoder as it decides where to place the Gaussian splats. For this, multiple approaches are possible: Sampling the points on a 3D grid, sampling the points randomly in a 3D cube or sampling the points on the surface of a 3D shape, created by marching cubes. Interestingly, when looking at the quantitative results in for all three approaches in Table 2, we clearly favor sampling on a 3D grid, as it achieves the overall best scores. Nevertheless, when inspecting the resulting renderings in Figure 7, we observe that grid sampling creates some artifacts. We see some horizontal and vertical lines on the surface of the head, where the splats are placed. Although this is not penalized by the chosen metrics, it significantly decreases the level of realism. Therefore, given that the marching cube sampling scores second best, while producing good visual results, we chose it for our decoder.

位置初始化:位置初始化是我们解码器的一个关键组成部分,因为它决定在哪里放置高斯splats。为此,有多种方法可以使用:在3D网格上采样点,在3D立方体中随机采样点,或者在3D形状表面上采样点,通过移动立方体创建。有趣的是,当查看表2中所有三种方法的定量结果时,我们明显倾向于在3D网格上采样,因为它获得了整体最佳分数。尽管如此,在检查图7中的结果渲染时,我们观察到网格采样创建了一些伪像。我们可以看到头部表面的一些水平和垂直的线条,也就是飞溅的地方。虽然这不会受到所选指标的惩罚,但它显著降低了真实性水平。因此,考虑到行进立方体采样得分第二好,同时产生良好的视觉效果,我们选择它作为解码器。

| Sampling Method 抽样方法 | LPIPS ↓ LPIPPS ↓ 的问题 | SSIM ↑ 阿信 ↑ | ID Sim ↑ ID模拟器 ↑ |
| Random Pos | 0.179 | 0.839 | 0.856 |
| 3D Grid | 0.167 | 0.851 | 0.898 |

March. Cubes 0.176 0.842 0.883

Table 2:Comparing different position sampling methods.

表2:不同位置采样方法的比较。

Figure 7:A visual comparison between different position sampling approaches. When using grid sampling, we observe some grid artifacts on the surface.

图7:不同位置采样方法之间的视觉比较。当使用网格采样时,我们在表面上观察到一些网格伪影。
Decoder Architecture: The core component of our method is the decoder. Its architecture can have a big influence on the capacity to learn a mapping between tri-plane features and Gaussian Splatting attributes. In the following, we will look into three different architecture types. First, the sequential decoder, decoding each attribute with the information of the previous one (color → opacity → rotation → scale → position offset), a parallel decoder that maps all attributes at once and again a sequential decoder where we invert the order of the decoders. The results for all three decoder are listed in Table 3. We notice that the sequential decoder is the overall best, although with a very slim margin to the parallel decoder. Interestingly, despite having the same amount of connections in the network as the parallel decoder, the reversed sequential decoder performs worse, suggesting that the order of decoding significantly impacts performance. A possible explanation for this disparity is that the outputs from earlier stages in the sequential version might introduce noise, thereby impeding the optimization process.

解码器架构:我们方法的核心组件是解码器。它的架构可以对学习三平面特征和高斯溅射属性之间的映射的能力产生很大影响。在下文中,我们将研究三种不同的架构类型。首先,顺序解码器,用前一个属性的信息(颜色 → 不透明度 → 旋转 → 缩放 → 位置偏移)解码每个属性,一次映射所有属性的并行解码器和一次反转解码器顺序的顺序解码器。所有三个解码器的结果列于表3中。我们注意到,顺序解码器是整体上最好的,尽管与并行解码器相比具有非常小的余量。有趣的是,尽管网络中的连接数量与并行解码器相同,但反向顺序解码器的性能更差,这表明解码顺序会显著影响性能。 对这种差异的一种可能的解释是,顺序版本中较早阶段的输出可能会引入噪声,从而阻碍优化过程。

| Architecture | LPIPS ↓ LPIPPS ↓ 的问题 | SSIM ↑ 阿信 ↑ | ID Sim ↑ ID模拟器 ↑ |
| Sequential | 0.176 | 0.842 | 0.883 |
| Parallel | 0.177 | 0.841 | 0.879 |

Sequential Reversed 顺序反转 0.228 0.803 0.765

Table 3:Image difference metrics for decoder architectures.

表3:解码器架构的图像差异度量。
Backbone Fine-tuning: During the training, we fine-tune the weights of the pre-trained StyleGAN backbone that produces the tri-plane features. This distributes some of the work load from the decoder to the backbone, as we penalize the tri-plane creation if the decoder cannot easily create Gaussian splats from it. Disabling this component leads to a decline in all performance metrics, especially the ID similarity, which drops from 0.883 to 0.858 as shown in Table 4. This demonstrates that fine-tuning the StyleGAN backbone enhances the tri-plane features for decoding them into high-quality Gaussian Splatting scenes.

骨干微调:在训练过程中,我们微调预训练的StyleGAN骨干的权重,以产生三平面特征。这会将一些工作负载从解码器分配到主干,因为如果解码器无法轻松创建高斯splats,我们会惩罚三平面创建。禁用此组件会导致所有性能指标下降,特别是ID相似性,从0.883下降到0.858,如表4所示。这表明,微调StyleGAN主干增强了三平面特征,用于将其解码为高质量的高斯飞溅场景。
Loss Functions: Training the decoder network requires appropriate loss functions that yield meaningful gradients. For our proposed decoder training, we employ a combination of several different loss functions. To better understand their individual impact, we conduct an ablation study by training multiple models, each with one loss function deactivated. The resulting renderings compared using the same ID can be seen in Figure 8. Here, the biggest difference is visible when deactivating the LPIPS loss. In this case, the rendering starts to become very blurry. This is surprising, given that L1 or SSIM are expected to penalize such blurry renderings. Instead, when disabling them, some artifacts are created at the edges. This hints that those loss functions help building the coarse geometry for the face, while LPIPS provides a gradient that creates fine structures.

损失函数:训练解码器网络需要适当的损失函数,以产生有意义的梯度。对于我们提出的解码器训练,我们采用了几种不同损失函数的组合。为了更好地了解它们各自的影响,我们通过训练多个模型来进行消融研究,每个模型都有一个损失函数被禁用。使用相同ID比较的结果呈现如图8所示。在这里,当停用LPIPS损失时,最大的区别是可见的。在这种情况下,渲染开始变得非常模糊。这是令人惊讶的,因为L1或SSIM预计会惩罚这种模糊的渲染。相反,当禁用它们时,会在边缘处创建一些伪影。这表明这些损失函数有助于构建面部的粗糙几何结构,而LPIPS提供了创建精细结构的梯度。

| Method | LPIPS ↓ LPIPPS ↓ 的问题 | SSIM ↑ 阿信 ↑ | ID Sim ↑ ID模拟器 ↑ |
| Baseline | 0.176 | 0.842 | 0.883 |
| w/o fine-tuning 无微调 | 0.188 | 0.837 | 0.858 |
| w/o L1 Loss | 0.175 | 0.841 | 0.881 |
| w/o LPIPS Loss 无LPIPS损失 | 0.260 | 0.859 | 0.885 |
| w/o SSIM Loss 无SSIM丢失 | 0.174 | 0.832 | 0.880 |
| w/o Sobel Loss 无Sobel损失 | 0.175 | 0.839 | 0.880 |

w/o ID Loss 0.176 0.842 0.827

Table 4:Comparison of our baseline model with variants, each with a single component deactivated. While the baseline does not achieve the highest score for each metric, it offers a balanced trade-off along all three metrics combined.

表4:我们的基线模型与变体的比较,每个变体都有一个单独的组件被禁用。虽然基线没有达到每个指标的最高分,但它提供了一个平衡的权衡,沿着所有三个指标的组合。

Figure 8:A visual comparison between decoders that have been trained while deactivating the respective loss function.

图8:在停用相应的损失函数时,已训练的解码器之间的视觉比较。

**5Limitations & Future Work

5局限性和未来工作**

Since all target images we use to train our framework are generated by either PanoHead or EG3D, the output fidelity of our method is bounded by the fidelity of these 3D GANs. A possible approach to push the visual quality of our renderings closer to photorealism would be to train the entire pipeline, i.e. training the generator backbone alongside the decoder, from scratch in a GAN based end-to-end manner. This approach, while being straightforward in theory, is subject to some challenges including the localization of good initial positions for the Gaussians in 3D space and, especially, handling the unstable nature of adversarial training. We aim to tackle these challenges in succeeding works, using the general structure of this framework and the insights obtained while developing it as a foundation.

由于我们用于训练框架的所有目标图像都是由PanoHead或EG3D生成的,因此我们方法的输出保真度受这些3D GAN的保真度限制。一种可能的方法来推动我们的渲染的视觉质量更接近照片现实主义将是训练整个管道,即训练生成器骨干与解码器一起,以基于GAN的端到端方式从头开始。这种方法虽然在理论上很简单,但也面临一些挑战,包括在3D空间中定位高斯的良好初始位置,特别是处理对抗训练的不稳定性。我们的目标是在后续工作中应对这些挑战,使用这个框架的一般结构和在开发它时获得的见解作为基础。
Moreover, we observe that the eyes of the faces in our scenes appear uncanny or blurry. As described in 4.4, we expect to solve this issue by including view-dependent spherical harmonics in the future.

此外,我们观察到,在我们的场景中,脸部的眼睛看起来不可思议或模糊。如4.4节所述,我们希望在将来通过包含视相关球谐函数来解决这个问题。

6Conclusion 六、结论

We have presented a framework that decodes tri-plane features of pre-trained 3D aware GANs for facial image generation like PanoHead or EG3D into scenes suitable for rendering with 3D Gaussian Splatting. This not only allows for rendering at up to 5 times higher frame rates with flexible image resolutions but also enables to export the resulting scenes into 3D software environments, allowing for realistic 3D asset creation in real-time. As our decoders show very high visual similarity to the 3D-aware target GANs, we are able to maintain high visual quality along interpolation paths, paving the way for applying GAN editing or GAN inversion methods to explicit 3D Gaussian Splatting scenes for the first time. In an in-depth ablation study, we discuss all components of our method, providing a basis for future works dealing with decoding Gaussian Splatting attributes. For succeeding work, we plan to broaden our training scheme to be able to train a GAN in an adversarial training for the generation of scenes compatible with 3DGS, as mentioned in the previous section.

我们已经提出了一个框架,可以将预训练的3D感知GAN的三平面特征解码为适合使用3D高斯溅射渲染的场景,以生成面部图像,如PanoHead或EG3D。这不仅允许以高达5倍的帧速率和灵活的图像分辨率进行渲染,还可以将生成的场景导出到3D软件环境中,从而实时创建逼真的3D资产。由于我们的解码器与3D感知目标GAN具有非常高的视觉相似性,因此我们能够沿着沿着插值路径保持高视觉质量,为首次将GAN编辑或GAN反演方法应用于显式3D高斯溅射场景铺平了道路。在深入的消融研究中,我们讨论了我们的方法的所有组成部分,为未来处理解码高斯溅射属性的工作提供了基础。 对于后续工作,我们计划扩大我们的训练计划,以便能够在对抗训练中训练GAN,以生成与3DGS兼容的场景,如前一节所述。

相关推荐
迅易科技1 小时前
借助腾讯云质检平台的新范式,做工业制造企业质检的“AI慧眼”
人工智能·视觉检测·制造
古希腊掌管学习的神2 小时前
[机器学习]XGBoost(3)——确定树的结构
人工智能·机器学习
ZHOU_WUYI2 小时前
4.metagpt中的软件公司智能体 (ProjectManager 角色)
人工智能·metagpt
靴子学长3 小时前
基于字节大模型的论文翻译(含免费源码)
人工智能·深度学习·nlp
AI_NEW_COME4 小时前
知识库管理系统可扩展性深度测评
人工智能
海棠AI实验室4 小时前
AI的进阶之路:从机器学习到深度学习的演变(一)
人工智能·深度学习·机器学习
hunteritself4 小时前
AI Weekly『12月16-22日』:OpenAI公布o3,谷歌发布首个推理模型,GitHub Copilot免费版上线!
人工智能·gpt·chatgpt·github·openai·copilot
IT古董5 小时前
【机器学习】机器学习的基本分类-强化学习-策略梯度(Policy Gradient,PG)
人工智能·机器学习·分类
centurysee5 小时前
【最佳实践】Anthropic:Agentic系统实践案例
人工智能
mahuifa5 小时前
混合开发环境---使用编程AI辅助开发Qt
人工智能·vscode·qt·qtcreator·编程ai