【读论文】【精读】3D Gaussian Splatting for Real-Time Radiance Field Rendering

文章目录

- [1. What：](#1. What：)
- [2. Why：](#2. Why：)
- [3. How：](#3. How：)
- - [3.1 Real-time rendering](#3.1 Real-time rendering)
  - [3.2 Adaptive Control of Gaussians](#3.2 Adaptive Control of Gaussians)
  - [3.3 Differentiable 3D Gaussian splatting](#3.3 Differentiable 3D Gaussian splatting)
- [4. Self-thoughts](#4. Self-thoughts)

1. What：

What kind of thing is this article going to do (from the abstract and conclusion, try to summarize it in one sentence)

To simultaneously satisfy the requirements of efficiency and quality, this article begins by establishing a foundation with sparse points using 3D Gaussian distributions to preserve desirable space. It then progresses to optimizing anisotropic covariance to achieve an accurate representation. Lastly, it introduces a cutting-edge, visibility-aware rendering algorithm designed for rapid processing, thereby achieving state-of-the-art results in the field.

2. Why：

Under what conditions or needs this research plan was proposed (Intro), what problems/deficiencies should be solved at the core, what others have done, and what are the innovation points? (From Introduction and related work)

Maybe contain Background, Question, Others, Innovation:

Three aspects of related work can explain this question.

Traditional reconstructions such as SfM and MVS need to re-project and

blend the input images into the novel view camera, and use the

geometry to guide this re-projection(From 2D to 3D).

Sad: Cannot completely recover from unreconstructed regions, or from "over-reconstruction", when MVS generates inexistent geometry.
Neural Rendering and Radiance Fields

Neural rendering represents a broader category of techniques that leverage deep learning for image synthesis, while radiance field is a specific technique within neural rendering focused on the scene representation of light and color in 3D spaces.

Deep Learning was mainly used on MVS-based geometry before, which is also its major drawback.
Nerf is along the way of volumetric representation, which introduced positional encoding and importance sampling.
Faster training methods focus on the use of spatial data structures to store (neural) features that are subsequently interpolated during volumetric ray-marching, different encodings, and MLP capacity.
Today, notable works include InstantNGP and Plenoxels both rely on Spherical Harmonics.

Understand Spherical Harmonics as a set of basic functions to fit a geometry in a 3D spherical coordinate system.

球谐函数介绍（Spherical Harmonics） - 知乎 (zhihu.com)

Point-Based Rendering and Radiance Fields

The methods in human performance capture inspired the choice of 3D Gaussians as scene representation.
Point-based and spherical rendering is achieved before.

3. How：

Through the Gradient Flow in this paper's pipeline, we are trying to connect Part4, 5, and 6 in this paper.

Firstly, start from the loss function, which is combined by a L 1 {\mathcal L}_{1} L1 loss and a S S I M SSIM SSIM index, just as shown below:

L = ( 1 − λ ) L 1 + λ L D − S S I M . (1) {\mathcal L}=(1-\lambda){\mathcal L}{1}+\lambda{\mathcal L}{\mathrm{D-SSIM}}.\tag{1} L=(1−λ)L1+λLD−SSIM.(1)

It found a relation between the actual image and the rendering image. So to finish the optimization, we need to dive into the process of rendering. From the chapter on related work, we know Point-based α \alpha α-blending and NeRF-style volumetric rendering share essentially the same image formation model. That is

C = ∑ i = 1 N T i ( 1 − exp ⁡ ( − σ i δ i ) ) c i w i t h T i = exp ⁡ ( − ∑ j = 1 i − 1 σ j δ j ) . (2) C=\sum_{i=1}^{N}T_{i}(1-\exp(-\sigma_{i}\delta_{i}))c_{i}\quad\mathrm{with}\quad T_{i}=\exp\left(-\sum_{j=1}^{i-1}\sigma_{j}\delta_{j}\right).\tag{2} C=i=1∑NTi(1−exp(−σiδi))ciwithTi=exp(−j=1∑i−1σjδj).(2)

And this paper actually uses a typical neural point-based approach just like (2), which can be represented as:

C = ∑ i ∈ N c i α i ∏ j = 1 i − 1 ( 1 − α j ) (3) C=\sum_{i\in N}c_{i}\alpha_{i}\prod_{j=1}^{i-1}(1-\alpha_{j}) \tag{3} C=i∈N∑ciαij=1∏i−1(1−αj)(3)

From this formulation, we can know what the representation of volume should contain the information of color c c c and transparency α \alpha α. These are attached to the gaussian, where Spherical Harmonics was used to represent color, just like Plenoxels. The other attributes used are the position and covariance matrix. So, now we have introduced the four attributes to represent the scene, that is positions 𝑝, 𝛼, covariance Σ, and SH coefficients representing color 𝑐 of each Gaussian.

After knowing the basic elements we need to use, now let's work backward, starting with rendering, which was addressed in the author's previous paper.

3.1 Real-time rendering

This method is independent of the propagation of gradients but is critical for real-time performance, which was published in the author's paper before.

In the previous game, someone had tried to model the world in ellipsoid and render it. This is the same as the render process of Gaussian splatting. But the latter uses lots of techniques in the utilization of threads and GPU.

Firstly, it starts by splitting the screen into 16×16 tiles and then proceeds to cull 3D Gaussians against the view frustum and each tile, only keeping Gaussians with a 99% confidence interval intersecting the view frustum.
Then instantiate each Gaussian according to the number of tiles they overlap and assign each instance a key that combines view space depth and tile ID.
Then sort Gaussians based on these keys using a single fast GPU Radix sort.
Finally, launching one thread block for each tile, for a given pixel, accumulate color and transparency values by traversing the lists front-to-back, until α \alpha α goes to one.

3.2 Adaptive Control of Gaussians

In the process of fitting gaussian to the scene, we should utilize the number and volume of gaussian to strengthen the representation of the scene. It contained two methods named clone and split, as shown below.

These were judged by the view-space positional gradients. Both under-reconstruction and over-construction have large view-space positional gradients. We will clone or split the gaussian according to different conditions.

3.3 Differentiable 3D Gaussian splatting

We have known the process of rendering and control of gaussian. Finally, we will talk about how to backward the gradients to where we can optimize. This is mainly about the processing of Gaussian function.

The basic simplified formulation of 3D Gaussain can be represented as:

G ( x ) = e − 1 2 ( x ) T Σ − 1 ( x ) . (4) G(x)=e^{-\frac{1}{2}(x)^{T}\Sigma^{-1}(x)}.\tag{4} G(x)=e−21(x)TΣ−1(x).(4)

We will use α \alpha α-blending to combine it to generate the rendering picture, so that we can calculate the loss function and finish the optimization. So now we need to know how to optimize and calculate the gradients of Gaussian.

When rasterizing, the three-dimensional scene needs to be transformed into a two-dimensional space. The author hopes that the 3D Gaussian will maintain its distribution during the transformation (otherwise, if the raster finish has nothing to do with Gaussian, all the efforts will be in vain). So we should choose a method to transfer the covariance matrix to camera coordinate without change the affine relation. That is

Σ ′ = J W Σ W T J T , (5) \Sigma'=JW\Sigma W^{T}J^{T},\tag{5} Σ′=JWΣWTJT,(5)

where J J J is the Jacobian of the affine approximation of the projective transformation.

Another problem is that the covariance matrix must be semi-definite. So we use a scaling matrix 𝑆 and rotation matrix 𝑅 to assure it. That is

Σ = R S S T R T (6) \Sigma=RSS^{T}R^{T}\tag{6} Σ=RSSTRT(6)

And then we can use a 3D vector 𝑠 for scaling and a quaternion 𝑞 to represent rotation. The gradients will backward to them. These are the whole process of optimization.

4. Self-thoughts

Summary of different representation

Explicit representation: Mesh, Point Cloud
Implicit representation
- Volumetric representation: Nerf
  
  The density value returned by the sample points reflects whether there is geometric occupancy here.
- Surface representation: SDF(Signed Distance Function)
  
  Outputs the distance to the nearest surface in the space from this point, where a positive value indicates outside the surface, and a negative value indicates inside the surface.

Refer:

1\]: [3D Gaussian Splatting：用于实时的辐射场渲染-CSDN博客](https://blog.csdn.net/m0_51976564/article/details/134116161) \[2\]: [【三维重建】3D Gaussian Splatting：实时的神经场渲染-CSDN博客](https://blog.csdn.net/qq_45752541/article/details/132854115) \[3\]: [3D Gaussian Splatting中的数学推导 - 知乎 (zhihu.com)](https://zhuanlan.zhihu.com/p/666465701) \[4\]: [\[NeRF坑浮沉记\]3D Gaussian Splatting入门：如何表达几何 - 知乎 (zhihu.com)](https://zhuanlan.zhihu.com/p/661569671)