ASH: A Modern Framework for Parallel Spatial Hashing in 3D Perception
30 Jan 2023
Abstract
We present ASH, a modern and high-performance framework for parallel spatial hashing on GPU. Compared to existing GPU hash map implementations, ASH achieves higher performance, supports richer functionality, and requires fewer lines of code (LoC) when used for implementing spatially varying operations from volumetric geometry reconstruction to differentiable appearance reconstruction. Unlike existing GPU hash maps, the ASH framework provides a versatile tensor interface, hiding low-level details from the users. In addition, by decoupling the internal hashing data structures and key-value data in buffers, we offer direct access to spatially varying data via indices, enabling seamless integration to modern libraries such as PyTorch. To achieve this, we 1) detach stored key-value data from the low-level hash map implementation; 2) bridge the pointer-first low level data structures to index-first high-level tensor interfaces via an index heap; 3) adapt both generic and non-generic integer-only hash map implementations as backends to operate on multi-dimensional keys. We first profile our hash map against state-of-the-art hash maps on synthetic data to show the performance gain from this architecture. We then show that ASH can consistently achieve higher performance on various large-scale 3D perception tasks with fewer LoC by showcasing several applications, including 1) point cloud voxelization, 2) retargetable volumetric scene reconstruction, 3) non-rigid point cloud registration and volumetric deformation, and 4) spatially varying geometry and appearance refinement. ASH and its example applications are open sourced in Open3D (http://www.open3d.org).
我们推出 ASH,这是一种用于 GPU 上并行空间哈希的现代高性能框架。与现有的 GPU 哈希图实现相比,ASH 在用于实现从体积几何重建到可微外观重建的空间变化操作时,实现了更高的性能,支持更丰富的功能,并且需要更少的代码行 (LoC)。与现有的 GPU 哈希图不同,ASH 框架提供了通用的张量接口,向用户隐藏了低级细节。此外,通过解耦内部哈希数据结构和缓冲区中的键值数据,我们可以通过索引直接访问空间变化的数据,从而实现与 PyTorch 等现代库的无缝集成。为了实现这一点,我们 1) 将存储的键值数据与底层哈希映射实现分离; 2)通过索引堆将指针优先的低级数据结构桥接到索引优先的高级张量接口; 3) 将通用和非通用纯整数哈希映射实现作为后端来对多维键进行操作。我们首先根据合成数据上最先进的哈希图来分析我们的哈希图,以显示该架构的性能增益。然后,我们通过展示多个应用程序,包括1)点云体素化,2)可重定向体积场景重建,3)非刚性点云配准和体积变形,以及4)空间变化的几何形状和外观细化。 ASH 及其示例应用程序在 Open3D (http://www.open3d.org) 中开源。
1 INTRODUCTION
3D space is only one dimension higher than the 2D image space, yet the additional dimension introduces an unpredictable multiplier to the computational and storage cost. This increased dimension makes 3D perception tasks such as geometric reconstruction and appearance refinement challenging to implement. To reduce complexity, one compromise is to reuse dense data structures and constrain the 3D space by bounding the region of interest, e.g., adopting a 3D array in a bounded space [40]. While this simple approach succeeds at the scale of objects, it cannot meet the demand of room or city scale perception, which is necessary for virtual tour, telepresence, and autonomous driving.
Since the 3D space is generally a collection of 2D surface manifolds, its sparsity can be exploited by partitioning to reduce computational cost. The general idea is to split the large 3D space into smaller regions and only proceed with the non-empty ones. There is a plethora of well-established data structures for 3D space partitioning. Examples include trees (Octree [36], KD-tree [4]) and hash maps (spatial hashing [41]). While trees are able to adaptively achieve high precision, they 1) require an initial bounding volume and 2) usually take unbalanced traversal time for a batch of spatial queries and hence 3) are less friendly to batched operations. On the other hand, spatial hashing coupled with a plain array structure is more scalable and parallelizable for reconstruction tasks.
In classical dense SLAM pipelines [41], [47], spatial hashing is used to map 3D coordinates to internal pointers that are only accessible in GPU device code. A pre-allocated memory pool allows dynamic growing hash maps, but adaptive rehashing is generally unavailable. Recent featuregrid encoders [37] apply spatial hashing to map 3D coordinates to features stored at grid points in a static bounded region. While differentiable spatial query is supported, collisions are not resolved, limiting its usage to stochastic feature optimization where incorrect key-value mappings are tolerated. A general, user-friendly, collision-free hash map is missing for efficient spatial perception at scale.
The reason for this absence is understandable. A parallel hash map on GPU has to resolve collisions and thread conflicts and preferably organize an optimized memory manager, none of which is trivial to implement. Previous studies have attempted to tackle the problem in one or more aspects, driven by their selected downstream applications. Furthermore, most of the popular parallel GPU hash maps are implemented in C++/CUDA and only expose low-level interfaces. As a result, customized extensions must start from low-level programming. While these designs usually guarantee performance under certain circumstances [2], [3], [41], [47], as of today, they leave a gap from the standpoint of the research community, which prefers to use off-theshelf libraries for fast prototyping with a high-level scripting language using tensors and automatic differentiation. Our motivation is to bridge this gap to enable researchers to develop sophisticated 3D perception routines with less effort and drive the community towards large-scale 3D perception.
3D 空间仅比 2D 图像空间高一维,但额外的维度会带来不可预测的计算和存储成本乘数。这种增加的维度使得几何重建和外观细化等 3D 感知任务难以实施。为了降低复杂性,一种折衷方案是重用密集数据结构并通过限制感兴趣区域来限制 3D 空间,例如,在有界空间中采用 3D 数组 [40]。虽然这种简单的方法在物体尺度上取得了成功,但它无法满足虚拟漫游、远程呈现和自动驾驶所必需的房间或城市尺度感知的需求。
由于 3D 空间通常是 2D 表面流形的集合,因此可以通过细分来利用其稀疏性来降低计算成本。总体思路是将大的 3D 空间分割成更小的区域,并且只处理非空的区域。有大量成熟的用于 3D 空间划分的数据结构。例子包括树(八叉树[36]、KD树[4])和哈希图(空间哈希[41])。虽然树能够自适应地实现高精度,但它们 1) 需要初始包围体,2) 通常在一批空间查询中需要不平衡的遍历时间,因此 3) 对批量操作不太友好。另一方面,空间哈希与普通数组结构相结合对于重建任务来说更具可扩展性和可并行性。
在经典的密集 SLAM 管道 [41]、[47] 中,空间哈希用于将 3D 坐标映射到只能在 GPU 设备代码中访问的内部指针。预分配的内存池允许动态增长哈希映射,但自适应重新哈希通常不可用。最近的特征网格编码器[37]应用空间散列将3D坐标映射到存储在静态有界区域中的网格点处的特征。虽然支持可微分空间查询,但无法解决冲突,从而将其使用限制为随机特征优化,其中可以容忍不正确的键值映射。缺少通用的、用户友好的、无碰撞的哈希图来实现大规模的有效空间感知。
这种缺失的原因是可以理解的。GPU 上的并行哈希映射必须解决碰撞和线程冲突问题,最好还能组织一个优化的内存管理器,而这些都不是容易实现的。以往的研究都试图从一个或多个方面来解决这个问题,这是由其选定的下游应用所驱动的。此外,大多数流行的并行 GPU 哈希映射都是用 C++/CUDA 实现的,而且只公开了底层接口。因此,定制扩展必须从底层编程开始。虽然这些设计在某些情况下通常能保证性能[2], [3], [41], [47],但从目前的情况来看,它们与研究界的观点还存在差距,研究界更倾向于使用现成的库,通过使用张量和自动微分的高级脚本语言进行快速原型开发。我们的动机是弥合这一差距,使研究人员能够以更少的努力开发出复杂的 3D 感知程序,并推动研究界实现大规模 3D 感知。
To this end, we design a modern hash map framework with the following major contributions:
- a user-friendly d y n a m i c , g e n e r i c 1 dynamic, generic^1 dynamic,generic1, and collision-free hash map interface that enables tensor I/O, a d v a n c e d i n d e x i n g advanced indexing advancedindexing,and i n − p l a c e a u t o m a t i c d i f f e r e n t i a t i o n in-place automatic differentiation in−placeautomaticdifferentiation when bridged to autodiff engines such as PyTorch;
- an index-first adaptor that supports various state-of-theart parallel GPU hash map backends and accelerates hash map operations with an improved structure-ofarray (SoA) data layout;
- a number of downstream applications that achieve higher performance compared to state-of-the-art implementations with fewer LoC.
Experiments show that ASH achieves better performance with fewer LoC on both synthetic and real-world tasks.
为此,我们设计了一个现代哈希图框架,其主要贡献如下:
- 用户友好的动态、通用和无碰撞哈希图接口,在桥接到 PyTorch 等自动比较引擎时,可实现张量 I/O、高级索引和就地自动微分;
- 索引优先适配器,支持各种最先进的并行 GPU 哈希图后端,并通过改进的数组结构 (SoA) 数据布局加速哈希图操作;
- 与具有更少 LoC 的最先进实现相比,许多下游应用程序实现了更高的性能。
实验表明,ASH 在合成任务和实际任务中都以更少的 LoC 实现了更好的性能。
- A dynamic hash map supports insertion and deletion after hash map construction. A generic hash map supports arbitrary dimensional keys and values in various data types. 动态哈希映射支持在哈希映射构建后插入和删除。通用哈希映射支持各种数据类型中任意维度的键和值。
2 RELATED WORK
2.1 Parallel Hash Map
The hash map is a data structure that seeks to map sparse keys (e.g. unbounded indices, strings, coordinates) from the set K K K to values from the set V V V with amortized O ( 1 ) O(1) O(1) access. It has a hash function h : K → I n , k → h ( k ) h : K → I_n, k → h(k) h:K→In,k→h(k) that maps the key to the index set I n = 0 , 1 , . . . , n − 1 I_n = {0, 1, . . . , n − 1} In=0,1,...,n−1 for indexing (or addressing) that is viable on a computer.
Ideally, with a perfect injective hash function h h h, a hash map can be implemented by H : K → V , k → v A [ h ( k ) ] H : K → V, k → v^A[h(k)] H:K→V,k→vA[h(k)], where v A v^A vA is an array of objects of type V V V and [ ⋅ ] [·] [⋅] is the trivial array element accessor. However, in practice, it is intractable to find an injective map given a sparse key distribution in K K K and a constrained index set I n I_n In of size n n n due to the computational budget. Therefore, modifications are required to resolve inevitable collisions, where i = h ( k 1 ) = h ( k 2 ) , k 1 ≠ k 2 i = h(k_1) = h(k_2), k1 \neq k_2 i=h(k1)=h(k2),k1=k2. There are two classes of techniques for collision resolution, open addressing and separate chaining. Open addressing searches for another candidate j ≠ i , j ∈ I n j \neq i, j ∈ I_n j=i,j∈In via a probing algorithm until an empty address is found. The simplest probing, linear probing [28], computes j = ( h ( k ) + t ) m o d n j = (h(k) + t) \mod n j=(h(k)+t)modn starting from i = h ( k ) i = h(k) i=h(k), where t t t is the number of attempts. Separate chaining, on the other hand, maintains multiple entries per mapped index where a linked list is grown at i i i if i = h ( k 1 ) = h ( k 2 ) , k 1 ≠ k 2 i = h(k_1) = h(k_2), k1 \neq k_2 i=h(k1)=h(k2),k1=k2.
哈希映射是一种数据结构,旨在将稀疏键(如无约束索引、字符串、坐标)从集合 K K K 映射到集合 V V V 中的值,访问次数为 O ( 1 ) O(1) O(1) 。它有一个哈希函数 h : K → I n , k → h ( k ) h : K → I_n, k → h(k) h:K→In,k→h(k),可将键映射到索引集 I n = 0 , 1 , . , n − 1 I_n = {0, 1, . , n - 1} In=0,1,.,n−1 的索引(或寻址),这在计算机上是可行的。
理想情况下,使用完美的单射哈希函数 h h h,哈希映射可以通过 H 实现: K → V , k → v A [ h ( k ) ] H 实现:K → V, k → v^A[h(k)] H实现:K→V,k→vA[h(k)],其中 v A v^A vA 是对象数组 V V V 和 [ ⋅ ] [·] [⋅] 类型是普通数组元素访问器。然而,在实践中,由于计算预算的原因,在给定 K K K 中的稀疏键分布和大小为 n n n 的约束索引集 I n I_n In 的情况下找到单射映射是很困难的。因此,需要进行修改来解决不可避免的冲突,其中 i = h ( k 1 ) = h ( k 2 ) , k 1 ≠ k 2 i = h(k_1) = h(k_2), k1 \neq k_2 i=h(k1)=h(k2),k1=k2。有两类冲突解决技术:开放寻址和单独链接。开放寻址通过探测算法搜索另一个候选者 j ≠ i , j ∈ I n j \neq i, j ∈ I_n j=i,j∈In 直到找到空地址。最简单的探测,线性探测 [28],从 i = h ( k ) i = h(k) i=h(k) 开始计算 j = ( h ( k ) + t ) m o d n j = (h(k) + t) \mod n j=(h(k)+t)modn,其中 t t t 是尝试次数。另一方面,单独链接为每个映射索引维护多个条目,其中如果 i = h ( k 1 ) = h ( k 2 ) , k 1 ≠ k 2 i = h(k_1) = h(k_2), k1 \neq k_2 i=h(k1)=h(k2),k1=k2,则链接列表在 i i i 处增长。
While hash map implementations are widely available for CPU, their GPU counterparts have only emerged in the recent decade. Most GPU hash maps use open addressing [2], [17], [27], [53], mainly due to simplicity in implementation and capability of handling highly concurrent operations. CUDPP [2] utilizes Cuckoo Hashing [44], while CoherentHash [17] adopts Robin Hood Hashing [6] both involving advanced probing design. Although being performant when K , V K, V K,V are limited to integer sets, these variations cannot be generalized to spatial hashing and only allow static input. Recently, WarpCore [27] proposes to support non-integer V V V and dynamic insertion, but the key domain is still limited to at most 64 bits.
虽然哈希映射实现广泛适用于 CPU,但 GPU 对应的实现是在最近十年才出现的。大多数 GPU 哈希图使用开放寻址 [2]、[17]、[27]、[53],主要是因为实现简单且能够处理高度并发操作。 CUDPP [2] 使用 Cuckoo Hashing [44],而 CoherentHash [17] 采用 Robin Hood Hashing [6],两者都涉及先进的探测设计。尽管当 K 、 V K、V K、V 仅限于整数集时性能良好,但这些变化不能推广到空间哈希并且仅允许静态输入。最近,WarpCore [27]提出支持非整数 V V V和动态插入,但键域仍然限制为最多64位。
There are also a few separate chaining implementations on GPU involving device-side linked lists. SlabHash [3] builds a linked list with a 128-bit Slab as the minimal unit, optimized for Single Instruction Multiple Threads (SIMT) warp operations. Although SlabHash allows dynamic insertions, similar to the aforementioned GPU hash maps, only integer K , V K, V K,V are supported. stdgpu [52] follows the conventional C++ Standard Library std::unordered map and builds supporting vectors, bitset lock guards, and linked lists from scratch, resulting in a generic, dynamic hash map. With these rich functionalities, however, stdgpu is not optimized for large value sets. In addition, due to its lowlevel templated design, users have to write device code for simple tasks.
We refer the readers to a comprehensive review of GPU hash maps [33].
GPU 上还有一些涉及设备端链表的单独链接实现。 SlabHash [3] 以 128 位 Slab 作为最小单元构建链表,针对单指令多线程 (SIMT) 扭曲操作进行优化。虽然SlabHash允许动态插入,但与前面提到的GPU哈希图类似,仅支持整数 K 、 V K、V K、V。 stdgpu [52] 遵循传统的 C++ 标准库 std::unordered 映射,并从头开始构建支持向量、位集锁保护和链表,从而产生通用的动态哈希映射。然而,尽管具有这些丰富的功能,stdgpu 并未针对大值集进行优化。此外,由于其低级模板化设计,用户必须为简单的任务编写设备代码。
我们建议读者对 GPU 哈希图进行全面回顾 [33]。
2.2 Space Partitioning Structures
3D data is not as simple to organize as 2D images. While a 2D image can be stored in a dense matrix, exploiting sparsity in 3D data is paramount due to the limits in computer memory of the current day.
The most widely used data structures for 3D indexing are arguably trees. A KD-tree [4] recursively sorts kdimensional data along a selected axis and partitions data at the median point. By nature, a KD-Tree is designed for neighbor search. In 3D, it is mainly used to organize 3D points and their features. Examples include normal estimation and nearest neighbor association in Iterative Closest Points (ICP) [49], [62] and 3D feature association in global registration [50], [59]. GPU adaptations exist for KD-trees [51], [58], but are not suitable for incrementally changing scenes, as they are usually constructed once and queried repeatedly.
Bounding volume hierarchy (BVH) is another hierarchical representation that organizes primitives such as objects and triangles in 3D. There are various GPU adaptations [19], [32], [55] mostly targeted at ray tracing and dynamic collision detection. While a parallel construction is possible and deformation of the nodes is allowed, the tree structure typically remains unchanged, assuming a fixed layout.
3D 数据不像 2D 图像那样易于组织。虽然 2D 图像可以存储在密集矩阵中,但由于当今计算机内存的限制,利用 3D 数据的稀疏性至关重要。
3D 索引中使用最广泛的数据结构可以说是树。 KD 树 [4] 沿着选定的轴递归地对 k 维数据进行排序,并在中点处对数据进行分区。从本质上讲,KD 树是为邻居搜索而设计的。在3D中,它主要用于组织3D点及其特征。示例包括迭代最近点 (ICP) [49]、[62] 中的法线估计和最近邻关联以及全局配准中的 3D 特征关联 [50]、[59]。 GPU 适配适用于 KD 树 [51]、[58],但不适合增量变化的场景,因为它们通常构建一次并重复查询。
层次包围盒 (BVH) 是另一种层次表示,用于组织 3D 中的对象和三角形等基元。在 GPU 上也有各种不同的改进 [19]、[32]、[55],主要针对光线追踪和动态碰撞检测。虽然可以并行构建并允许节点变形,但树形结构通常保持不变,假设有固定的布局
While KD-trees and BVH split the space unevenly by data distribution, an Octree [36], on the other hand, recursively partitions the 3D space evenly into 8 subvolumes according to space occupation states. It has been widely used in adaptive 3D mapping [23], [38] for robot navigation. There have been parallel implementations on GPU, from optimized data structures [22] to domain-specific languages [24]. However, these works generally focus on physics simulation within a bounded region of interest where the spatial partition is predefined. While parallel incremental division [57] is possible, an initial bounding region is still required, and the trees are not guaranteed to be balanced.
Spatial hashing is another variation of spatial management with O ( 1 ) O(1) O(1) access time depending on hash maps.Bundled with small dense 3D arrays, it has been widely used in real-time volumetric scene reconstruction within unbounded region of interest. A handful of CPU implementations have achieved real-time performance [20], [30] at the expense of resolution. Similarly, GPU implementations [12], [15], [41], [47] reach high frame rates using GPU-based spatial hashing. However, all of the studies depend on ad hoc GPU hash maps exclusive to these specific systems. Concurrent race conditions have not been fully resolved in several implementations [12], [41], where volumes can be randomly under-allocated. Recent neural feature grids [37] apply spatial hashing in a bounded volume to query voxelized feature embeddings. These approaches use simplified hashing designs that are not collision-free, and thus are only compatible with stochastic optimization that tolerates noise from an incorrect query.
KD 树和 BVH 通过数据分布不均匀地分割空间,而另一方面,八叉树 [36] 根据空间占用状态递归地将 3D 空间均匀地划分为 8 个子体。它已广泛应用于机器人导航的自适应 3D 建图 [23]、[38]。 GPU 上已经有了并行实现,从优化的数据结构 [22] 到特定领域的语言 [24]。然而,这些工作通常侧重于预定义空间分区的有界感兴趣区域内的物理模拟。虽然并行增量划分[57]是可能的,但仍然需要初始边界区域,并且不能保证树是平衡的。
空间哈希是空间管理的另一种变体,访问时间为 O ( 1 ) O(1) O(1),具体取决于哈希映射。与小型密集 3D 数组捆绑在一起,它已广泛用于无界感兴趣区域内的实时体积场景重建。少数 CPU 实现以牺牲分辨率为代价实现了实时性能 [20]、[30]。类似地,GPU 实现 [12]、[15]、[41]、[47] 使用基于 GPU 的空间哈希来达到高帧速率。然而,所有研究都依赖于这些特定系统独有的临时 GPU 哈希图。在几个实现中,并发竞争条件尚未完全解决[12]、[41],其中体积可能会随机分配不足。这些方法使用的是简化的散列设计,并不是无碰撞的,因此只能与随机优化兼容,而随机优化可以容忍错误查询带来的噪声。
2.3 Spatially Varying 3D Representations
A truncated signed distance function (TSDF) [11] is an implicit representation of surfaces, recording point-wise distance to the nearest surface point. It is frequently used for dense scene reconstruction with noisy input. The distribution of surfaces is generally spatially varying and therefore, a proper parameterization is often necessary, either in a discrete [41] or neural [7], [37] form. Non-rigid deformation methods [39], [60] seek to embed point clouds in a deformable grid, where each point is anchored to and deformed by neighbor grids. They are mainly used for animation or non-rigid distortion calibration. Similar to a deformation grid, complex lighting for rendering can be approximated by spatially-varying spherical harmonics (SVSH) [35] placed at a sparse grid. These grids are natural applications of spatial hashing. A comprehensive review of spatially varying representations for real-world scene reconstruction is available [64].
While ad hoc implementations have been introduced for these representations either on CPU or GPU, ASH provides a device-agnostic interface requiring less code written and providing better performance. Table 1 compares various aspects of existing GPU hash maps, either as a standalone data structure (Section 2.1) or embedded in an application (Section 2.2). To the best of our knowledge, ASH is the first implementation that simultaneously supports dynamic insertion, ensures correctness via atomic collision-free operations, allows generic keys, and has a modern tensor interface and Python binding for better usability.
截断符号距离函数(TSDF)[11]是表面的隐式表示,记录到最近表面点的逐点距离。它经常用于具有噪声输入的密集场景重建。表面的分布通常在空间上变化,因此通常需要适当的参数化,无论是离散[41]还是神经[7]、[37]形式。非刚性变形方法[39]、[60]寻求将点云嵌入到可变形网格中,其中每个点都锚定到相邻网格并由相邻网格变形。它们主要用于动画或非刚性畸变校准。与变形网格类似,用于渲染的复杂照明可以通过放置在稀疏网格上的空间变化球谐函数(SVSH)[35]来近似。这些网格是空间散列的自然应用。有关用于真实世界场景重建的空间变化表示法的全面综述 [64]。
虽然在 CPU 或 GPU 上为这些表示引入了临时实现,但 ASH 提供了一个与设备无关的接口,需要编写更少的代码并提供更好的性能。表 1 比较了现有 GPU 哈希图的各个方面,无论是作为独立的数据结构(第 2.1 节)还是嵌入到应用程序中(第 2.2 节)。据我们所知,ASH 是第一个同时支持动态插入、通过原子无碰撞操作确保正确性、允许通用键并具有现代张量接口和 Python 绑定以提高可用性的实现。
TABLE 1 Comparison of existing parallel GPU hash maps. ASH preserves the dynamic, generic, and atomic properties, and is extendable to the non-templated high-level Python interfaces.
表 1 现有并行 GPU 哈希图的比较。 ASH 保留了动态、通用和原子属性,并且可扩展到非模板化高级 Python 接口。
3 OVERVIEW
Before plunging into the details, we first provide a highlevel overview of our framework in Fig. 1.
Conventional parallel hash maps reorganize the structure of arrays (SoA) input, i.e., the separated key array and value array, into an array of structures (AoS) where keys and values are paired, inserted, and stored. Therefore, array of pointers to pair structures (std::pair in C++, thrust::pair in CUDA, and tuple in Python) are returned upon query. Consequently, the operations from insertion and query to in-place value increment require users to write device code and visit AoS at the low-level pointers.
在深入探讨细节之前,我们首先在图 1 中对我们的框架进行了高度概括。
传统的并行散列映射会将输入的数组结构(SoA),即分离的键数组和值数组,重组为结构数组(AoS),键和值在其中配对、插入和存储。因此,查询时会返回指向配对结构(C++ 中为 std::pair,CUDA 中为 thrust::pair,Python 中为 tuple)的指针数组。因此,从插入和查询到就地值递增等操作都需要用户编写设备代码,并访问底层指针的 AoS。
In contrast, ASH sticks to SoA. Fig. 1 shows the workflow of ASH. Instead of pointers to pairs, ASH returns indices and masks arrays that can be directly consumed by tensor libraries such as PyTorch [46] (without memory copy) and NumPy [21] (with GPU to host memory copy). As a result, post-processing functions such as duplicate key removal and in-place modification can be chained with insertion and query in ASH via advanced indexing without writing any device code. As a general and device-agnostic interface for parallel hash maps, our framework is built upon switchable backends with details hidden from the user. Currently, separate chain backends are supported, including the generic stdgpu [52] backend, and the extended integer-only SlabHash [3] backend for arbitrary key-value data types. TBB's concurrent hash map [31] powers the CPU counterpart with the identical interface to GPU.
相比之下,ASH 坚持 SoA。图1展示了ASH的工作流程。 ASH 返回的不是对(pair)的指针,而是索引和掩码数组,这些数组可以直接被张量库使用,例如 PyTorch [46](没有内存复制)和 NumPy [21](使用 GPU 到主机内存复制)。因此,重复键删除和就地修改等后处理功能可以通过高级索引与 ASH 中的插入和查询链接起来,而无需编写任何设备代码。作为并行哈希映射的通用且与设备无关的接口,我们的框架构建在可切换后端上,并且对用户隐藏了详细信息。目前,支持单独的链后端,包括通用 stdgpu [52] 后端和用于任意键值数据类型的扩展纯整数 SlabHash [3] 后端。 TBB 的并发哈希图 [31] 通过与 GPU 相同的接口为 CPU 对应物提供支持。
Fig. 1. The interface illustration of ASH. Left: ASH takes a key and/or a list of value tensors as input, with a one-liner interface. For insertion, tensors are organized in SoA. Right: buffer indices and masks are returned upon insertion and query, organized together in SoA. Chained functions can be easily applied to hashed data by indexing ASH buffers with indices and masks. Examples include selecting unique keys through insertion, and applying in-place value increment through query. Differentiable ops can be applied in downstream applications.
图1. ASH的界面说明。左图:ASH 采用一个键和/或一组值张量作为输入,具有单行接口。为了插入,张量在 SoA 中组织。右图:缓冲区索引和掩码在插入和查询时返回,并在 SoA 中组织在一起。通过使用索引和掩码对 ASH 缓冲区进行索引,可以轻松地将链式函数应用于哈希数据。示例包括通过插入选择唯一键,以及通过查询应用就地值增量。可微分运算可以应用于下游应用程序。
In this paper, we use calligraphic letters to represent sets, and normal lower-case letters for their elements. Normal upper-case letters denote functions. Bold lower-case letters denote vectors of elements or arrays in the programmer's perspective. Bold upper-case letters are for matrices. For instance, in a hash map, we are interested in key elements k ∈ K k ∈ \mathcal{K} k∈K and their vectorized processing, e.g. query Q ( k ) Q(k) Q(k). Specifically, we use I n \mathcal{I}_n In to denote a set of indices 0 , 1 , . . . , n − 1 {0, 1, . . . , n−1} 0,1,...,n−1, and Θ Θ Θ as the boolean selection 0 , 1 {0, 1} 0,1. Given an arbitrary vector x x x, we denote x ( i ) x(\pmb{i}) x(i) and x ( θ ) x(\pmb{θ}) x(θ) as indexing and selection functions applied to x x x when ∀ i ∈ i , i ∈ I n ∀i ∈ \pmb{i}, i ∈ \mathcal{I}_n ∀i∈i,i∈In and ∀ θ ∈ θ , θ ∈ Θ ∀θ ∈ \pmb{θ}, θ ∈ Θ ∀θ∈θ,θ∈Θ. We use 〈 K , V 〉 〈 \mathcal{K}, \mathcal{V}〉 〈K,V〉 as the key and value sets for a hash map. h : K → I n h : \mathcal{K} → \mathcal{I}_n h:K→In is the internal hash function that converts a key to an index. H : K → V H : \mathcal{K} → \mathcal{V} H:K→V is the general hash map enclosing h h h.
在本文中,我们用书法字母表示集合,用普通小写字母表示集合的元素。普通大写字母表示函数。粗体小写字母从程序员的角度表示元素向量或数组。粗体大写字母表示矩阵。例如,在哈希映射中,我们感兴趣的是关键元素 k ∈ K k ∈ \mathcal{K} k∈K 及其向量化处理,如查询 Q ( k ) Q(k) Q(k)。具体来说,我们用 I n \mathcal{I}_n In 表示一组索引 0 , 1 , . , n − 1 {0, 1, . , n-1} 0,1,.,n−1 和 Θ Θ Θ 作为布尔选择 0 , 1 {0, 1} 0,1。给定一个任意向量 x x x,当 ∀ i ∈ i , i ∈ I n ∀i∈ \pmb{i}, i∈ \mathcal{I}_n ∀i∈i,i∈In 和 ∀ θ ∈ θ , θ ∈ Θ ∀θ∈ \pmb{θ}, θ∈ Θ ∀θ∈θ,θ∈Θ 时,我们用 x ( i ) x(\pmb{i}) x(i) 和 x ( θ ) x(\pmb{θ}) x(θ) 表示应用于 x x x 的索引和选择函数。我们使用 〈 K , V 〉 〈 \mathcal{K}, \mathcal{V}〉 〈K,V〉作为哈希映射的键集和值集。 h : K → I n h : \mathcal{K} → \mathcal{I}_n h:K→In 是内部哈希函数,用于将键转换为索引。 H : K → V H : \mathcal{K} → \mathcal{V} H:K→V 是包围 h h h 的一般散列映射。
4 THE ASH FRAMEWORK
4.1 Classical Hashing
In a hash map 〈 K , V 〉 〈 \mathcal{K}, \mathcal{V}〉 〈K,V〉, since the hash function h cannot be perfect as discussed in Section 2, we have to store keys to verify if collisions happen ( h ( k 1 ) = h ( k 2 ) (h(k_1) = h(k_2) (h(k1)=h(k2) but k 1 ≠ k 2 , k 1 , k 2 ∈ K ) k_1 \neq k_2, k1, k_2 ∈ \mathcal{K}) k1=k2,k1,k2∈K).
In separate chaining, to resolve hash collisions, the bucket-linked list architecture is used. With n n n initial buckets, we construct the hash function h : K → I n h : K → \mathcal{I}_n h:K→In where In is defined in Section 2. As shown in Fig. 2, keys with the same hashed index i = h ( k ) ∈ I n i = h(k) ∈ \mathcal{I}_n i=h(k)∈In are first aggregated in the i-th bucket, where a linked list grows adaptively to accommodate different keys. A conventional hash map stores key-value pairs as the storage units. Consequently, two keys k 1 , k 2 k_1, k_2 k1,k2 can be distinguished by checking h ( k 1 ) = h ( k 2 ) h(k_1) = h(k_2) h(k1)=h(k2) and k 1 = k 2 k_1 = k_2 k1=k2 from the pair in order, and manipulation of the keys and values can be achieved by iterating over such pairs.
在哈希映射 < K , V > < \mathcal{K}, \mathcal{V}> <K,V> 中,由于哈希函数 h 不可能像第 2 节中讨论的那样完美,因此我们必须存储键来验证是否发生冲突 ( h ( k 1 ) = h ( k 2 ) (h(k_1) = h(k_2) (h(k1)=h(k2) 但 k 1 ≠ k 2 , k 1 , k 2 ∈ K ) k_1 \neq k_2, k1, k_2 ∈ \mathcal{K}) k1=k2,k1,k2∈K)。
在单独链接中,为了解决哈希冲突,使用了桶链表架构。使用 n n n 个初始桶,我们构建哈希函数 h : K → I n h : K → \mathcal{I}_n h:K→In,其中 In 在第 2 节中定义。如图 2 所示,具有相同哈希索引 i = h ( k ) ∈ I n i = h( k) ∈ \mathcal{I}_n i=h(k)∈In 首先聚合在第 i 个桶中,其中链表自适应增长以容纳不同的键。传统的哈希映射存储键值对作为存储单元。因此,可以通过按顺序检查该对中的 h ( k 1 ) = h ( k 2 ) h(k_1) = h(k_2) h(k1)=h(k2) 和 k 1 = k 2 k_1 = k_2 k1=k2 来区分两个键 k 1 , k 2 k_1, k_2 k1,k2,并且可以通过迭代来实现对键和值的操作这样的对。
Fig. 2. Illustration of a classical hash map using separate chaining. Keys (left) are put into corresponding buckets (middle) obtained by the hash function h. A linked list (right) is constructed per bucket to store keyvalue pairs within the same bucket but with unequal keys.
图 2. 使用单独链接的经典哈希图的图示。键(左)被放入由哈希函数 h 获得的相应桶(中)中。每个存储桶构造一个链表(右),以将键值对存储在同一存储桶内,但键不相等。
With this formulation, assuming a subset X ⊂ K \mathcal{X} ⊂ \mathcal{K} X⊂K has been inserted into the hash map with associated values Y ⊂ V \mathcal{Y} ⊂ \mathcal{V} Y⊂V, a query function can be described as
使用此公式,假设子集 X ⊂ K \mathcal{X} ⊂ \mathcal{K} X⊂K 已插入哈希图中,其关联值是 Y ⊂ V \mathcal{Y} ⊂ \mathcal{V} Y⊂V,则查询函数可描述为
where 〈 k , v 〉 〈k, v〉 〈k,v〉 forms a concrete pair stored in the hash map. This format is common in implementations, e.g. in C++ (std::unordered map) and Python (dict).
其中 〈 k , v 〉 〈k,v〉 〈k,v〉 构成了存储在哈希映射中的具体对。这种格式在 C++(std::unordered map)和 Python(dict)等实现中很常见。
4.2 Function Chaining and Parallel Hashing
The element-wise operation in Eq. 1 can be extended to vectors via parallel device kernels. However, interpretation of the returned iterators of pairs is still required at the low level. In other words, although the parallel version can be implemented efficiently, results are still packed in an AoS instead of SoA:
通过并行设备内核,公式 1 中的 "元素-方向 "操作可以扩展到矢量。不过,仍需要在底层对返回的迭代器进行解释。换句话说,虽然并行版本可以高效地实现,但结果仍被打包在 AoS 而不是 SoA 中:
This forms a barrier when the parallel query is located in a chain of functions. For instance, to apply any function G G G (e.g., geometry transformation) over the result of a query, the low-level function second that selects the value element from a pair 〈 k , v 〉 〈k, v〉 〈k,v〉 must be provided to dereference the lowlevel structures and manipulate the keys and values in-place. In other words, we have to implement a non-trivial G ~ \widetilde{G} G :
当并行查询位于函数链中时,这就形成了一个障碍。例如,要在查询结果上应用任何函数 G G G(如几何变换),必须提供从一对 〈 k , v 〉 〈k,v〉 〈k,v〉 中选择值元素的底层函数 second,以取消引用底层结构并就地操作键和值。换句话说,我们必须实现一个非难的 G ~ \widetilde{G} G :
to force the conversion from AoS to SoA and chain a highlevel function G G G with Q K . V Q_{\mathcal{K} . \mathcal{V}} QK.V^2. This could be tedious when prototyping geometry perception that requires hash map structures since off-the-shelf operations have to be reimplemented in device code.
来强制将 AoS 转换为 SoA,并用 Q K . V Q_{\mathcal{K} . \mathcal{V}} QK.V^2。当几何感知原型设计需要散列映射结构时,这可能会很繁琐,因为必须在设备代码中重新实现现成的操作。
- This can be achieved simply by returning a copy of values, but it is not feasible, especially when dealing with large-scale data, e.g. hierarchical voxel grids.这可以通过简单地返回数值副本来实现,但这并不可行,尤其是在处理大规模数据(如分层体素网格)时。
We reformulate this problem by introducing two affiliate arrays, k B k^B kB and v B v^B vB (note with a superscript B B B for buffering, they are not the input k , v k, v k,v) of capacity c ≥ n c ≥ n c≥n, where n n n is the number of buckets. These arrays are designed for explicit storage of keys and values, respectively, and serve as buffers to support natural SoA. They are exposed to users for direct access and in-place modification. Now the query function can be rewritten as
我们通过引入两个附属数组 k B k^B kB 和 v B v^B vB(注意上标 B B B 用于缓冲,它们不是容量 c ≥ n c ≥ n c≥n 的输入 k , v k, v k,v)来重新表述这个问题,其中 n n n 是桶的数量。这些数组分别设计用于显式存储键和值,并用作支持自然 SoA 的缓冲区。它们向用户公开以供直接访问和就地修改。现在查询函数可以重写为
and this version is ready for parallelization. At this stage, to combine G G G and Q Y . V Q_{\mathcal{Y} . \mathcal{V}} QY.V , we can chain G ◦ v B ◦ Q G◦v^B ◦Q G◦vB◦Q to manipulate values:
并且这个版本已经准备好并行化。在此阶段,结合 G G G 和 Q Y 。 V Q_{\mathcal{Y} 。 \mathcal{V}} QY。V ,我们可以链接 G ◦ v B ◦ Q G◦v^B ◦Q G◦vB◦Q 来操纵值:
which retains convenient properties such as array vectorization and advanced indexing.
When the input set X ~ ⊄ X \widetilde{\mathcal{X}} \not\subset \mathcal{X} X ⊂X is not fully stored in the hash map, our formulation maintains its effectiveness by a simple masked extension:
它保留了方便的属性,例如数组矢量化和高级索引。
当输入集 X ~ ⊄ X \widetilde{\mathcal{X}} \not\subset \mathcal{X} X ⊂X 未完全存储在哈希映射中时,我们的公式通过简单的掩码扩展来保持其有效性:
which is also ready for parallelization. Now the chaining of functions is given by
这也为并行化做好了准备。现在函数的链接由下式给出
using advanced indexing with masks. We can also select valid queries with k ( θ ) k(θ) k(θ) without visiting k B k^B kB. While our discussion was about the query function, the same applies to insertion.
使用带掩码的高级索引。我们还可以使用 k ( θ ) k(θ) k(θ) 选择有效查询,而无需访问 k B k^B kB。虽然我们讨论的是查询功能,但插入功能也是如此。
In essence, by converting the pair-first AoS to an indexfirst SoA format with the help of array buffers, we can conveniently chain high-level functions over hash map query and insertion. This simple change enables easy development on hash maps and unleashes their potential for fast prototyping and differentiable computation. However, the layout requires fundamental changes to the hash map data structure. With this in mind, we move on to illustrate how the ASH layer converts the AoS in native backends to our SoA layout.
从本质上讲,借助数组缓冲区,我们可以把成对优先的 AoS 转换为索引优先的 SoA 格式,从而方便地在哈希图的查询和插入中使用高级函数链。这一简单的改变使散列映射的开发变得容易,并释放了散列映射在快速原型开发和可微分计算方面的潜力。不过,这种布局需要从根本上改变哈希映射的数据结构。考虑到这一点,我们将继续说明 ASH 层如何将本地后端中的 AoS 转换为我们的 SoA 布局。
4.3 Generic Backends
We start with converting stdgpu [52], a state-of-theart generic GPU hash map as the backend of ASH. stdgpu follows the convention of its CPU counterpart std::unordered map by providing a templated interface. The underlying implementation is a classical bucket - linked list structure with locks to avoid race conditions on GPU. To exploit the power of a generic hash map without reinventing the wheel, we seek to reuse the operations over keys (i.e. lock-guarded bucket and linked list operations) and redirect the value mapping to our buffer v B v^B vB.
A dynamic GPU hash map requires dynamic allocation and freeing of keys and values in device kernels. With preallocated key buffer k B k^B kB and value buffer v B v^B vB, we maintain an additional index heap h h h, as shown in Fig. 3. The index heap stores buffer indices i i i pointing to the buffers k B k^B kB, v B v^B vB, as a map P : I c → I c P : \mathcal{I}_c → \mathcal{I}_c P:Ic→Ic, where the heap top t t t maintains the currently available buffer index in h [ t ] h[t] h[t]. Heap top starts at t = 0 t = 0 t=0, and is atomically increased at allocation and decreased at free. With h h h and the dynamically changing t t t, we instantiate a generic hash map with the templated value in stdgpu to be V = I n t 32 \mathcal{V} = Int32 V=Int32, where the values are buffer indices i i i stored in h h h to access k B k^B kB, v B v^B vB exposed to the user.
我们从转换 stdgpu [52] 开始,这是一个最先进的通用 GPU 哈希图作为 ASH 的后端。 stdgpu 通过提供模板化接口来遵循其 CPU 对应项 std::unordered map 的约定。底层实现是一个经典的桶链表结构,带有锁以避免 GPU 上的竞争条件。为了利用通用哈希映射的强大功能而不需要重新发明轮子,我们寻求重用键上的操作(即锁保护桶和链表操作)并将值映射重定向到我们的缓冲区 v B v^B vB。
动态 GPU 哈希映射需要动态分配和释放设备内核中的键和值。通过预先分配的键缓冲区 k B k^B kB 和值缓冲区 v B v^B vB,我们维护一个额外的索引堆 h h h,如图 3 所示。索引堆存储指向缓冲区 k 的缓冲区索引 i i i \^B, v B v^B vB,作为映射 P : I c → I c P : \mathcal{I}_c → \mathcal{I}_c P:Ic→Ic,其中堆顶 t t t 维护 h [ t ] 中当前可用的缓冲区索引 h[t] 中当前可用的缓冲区索引 h[t]中当前可用的缓冲区索引。堆顶从 t = 0 t = 0 t=0 开始,在分配时自动增加,在空闲时自动减少。使用 h h h 和动态变化的 t t t,我们实例化一个通用哈希映射,其中 stdgpu 中的模板化值为 V = I n t 32 \mathcal{V} = Int32 V=Int32,其中值是存储在 h h h 中的缓冲区索引 i i i访问暴露给用户的 k B k^B kB、 v B v^B vB。
Fig. 3. Illustration of a generic hash map bridged to tensors in ASH. Buffer indices i i i are dynamically provided by the available indices maintained in the index heap at the increasing heap top t t t (middle), acting as the values in the underlying hash map (left). It connects the hash map and the actual key values stored in the buffer (right) by accessing i i i. The key-bucket correspondences are the same as Fig. 2, omitted for simplicity.
图 3. ASH 中与张量桥接的通用哈希图示意图。缓冲区索引 i i i 由索引堆中的可用索引动态提供,索引堆顶部 t t t(中)不断增大,作为底层哈希映射(左)中的值。它通过访问 i i i,将散列映射与存储在缓冲区(右)中的实际键值连接起来。键与缓冲区的对应关系与图 2 相同,为简化起见略去。
4.3.1 Insertion
The insertion of a 〈 k , v 〉 ∈ 〈 K , V 〉 〈k, v〉 ∈ 〈\mathcal{K} , \mathcal{V}〉 〈k,v〉∈〈K,V〉 pair is now decoupled into two steps, with i i i) insertion of 〈 k , i 〉 ∈ 〈 K , I c 〉 〈k, i〉 ∈ 〈\mathcal{K} ,\mathcal{I}_c〉 〈k,i〉∈〈K,Ic〉 into the hash map, where i is the buffer index dynamically acquired from the heap top h [ t ] h[t] h[t] and i i ii ii) insertion of k B ( i ) : = k , v B ( i ) : = v k^B(i) := k, v^B(i) := v kB(i):=k,vB(i):=v into buffers.
A naive implementation will acquire a buffer index i from h on every insertion attempt and free it if the insertion fails because the key already exists. However, when running in parallel, atomicAdd and atomicSub may be conflicting among threads, leading to race conditions. A two-pass insertion could resolve the issue: in the first pass, we allocate a batch of indices from h determined by the input size, attempt insertions, and record results; in the second pass, we free the indices to h from failed insertions.
We adopt a more efficient one-pass lazy insertion. We first attempt to insert 〈 k , − 1 〉 〈k, −1〉 〈k,−1〉 with − 1 −1 −1 as the dummy index into the backend and observe if it is successful. If not, nothing needs to be done. Otherwise, we capture the returned pointer to the pair, trigger an index i i i allocation from h h h, and directly replace the dummy − 1 -1 −1 with i i i. This significantly reduces the overhead when the key uniqueness is low (i.e., many duplicates exist in the keys to be inserted).
$ 〈 k , v 〉 ∈ 〈 K , V 〉 〈k, v〉 ∈ 〈\mathcal{K} , \mathcal{V}〉 〈k,v〉∈〈K,V〉 对的插入现在被解耦为两个步骤, i i i) 插入 〈 k , i 〉 ∈ 〈 K , I c 〉 〈k, i〉 ∈ 〈\mathcal {K} ,\mathcal{I}_c〉 〈k,i〉∈〈K,Ic〉 插入哈希映射,其中 i 是从堆顶动态获取的缓冲区索引 h [ t ] h[t] h[t] 和 i i ii ii) 插入 k B ( i ) : = k , v B ( i ) : = v k^B(i) : = k, v^B(i) := v kB(i):=k,vB(i):=v 进入缓冲区。
一个简单的实现会在每次插入尝试时从 h 中获取缓冲区索引 i,并在插入失败(因为键已经存在)时释放该索引。但是,并行运行时,线程间的 atomicAdd 和 atomicSub 可能会发生冲突,从而导致竞赛条件。两次插入可以解决这个问题:在第一次插入中,我们从 h 中分配一批由输入大小决定的索引,尝试插入并记录结果;在第二次插入中,我们释放插入失败后的 h 索引。
我们采用了更高效的一次性懒插入法。我们首先尝试在后端插入 〈 k , − 1 〉 〈k,-1〉 〈k,−1〉,并将 − 1 -1 −1 作为虚拟索引,然后观察是否成功。如果不成功,则无需做任何操作。否则,我们将捕获返回的索引对指针,从 h h h 中分配索引 i i i,并直接用 i i i 替换假 − 1 -1 −1。当键的唯一性较低时(即要插入的键中存在许多重复键),这将大大减少开销。
4.3.2 Query
The query operation is relatively simpler. We first look up the buffer index i ∈ I c i ∈ \mathcal{I}_c i∈Ic given k k k in the backend. If it is a success, we end up with k = k B ( i ) k = k^B(i) k=kB(i), and the target v = v B ( i ) v = v^B(i) v=vB(i) is accessible with i i i by users.
查询操作相对简单。我们首先在后端给定 k k k 查找缓冲区索引 i ∈ I c i ∈ \mathcal{I}_c i∈Ic。如果成功,我们最终会得到 k = k B ( i ) k = k^B(i) k=kB(i),并且用户可以使用 i i i 访问目标 v = v B ( i ) v = v^B(i) v=vB(i)。
4.4 Non-generic Backends
While the generic GPU hash map has only recently been available, the research community in parallel computation has been focusing on more controlled setups where both K \mathcal {K} K and V \mathcal {V} V are limited to certain dimensions or data types. We seek to generalize this non-generic setup with our index heap and verify their performance in more real-world applications. In this section, we show how ASH can be used to generalize SlabHash [3], a warp-oriented dynamic GPU hash map that only allows insertions and queries to Int32 data type.
An extension to generic key types is non-trivial for SlabHash since its warp operations only apply to variables with limited word length. Our implementation extends the hash set variation of SlabHash, where only integers as keys are maintained in the backend.
虽然通用 GPU 哈希图最近才可用,但并行计算研究界一直专注于更受控制的设置,其中 K \mathcal {K} K 和 V \mathcal {V} V 都限于某些维度或数据类型。我们试图通过索引堆推广这种非通用设置,并在更多实际应用中验证它们的性能。在本节中,我们展示了如何使用 ASH 来推广 SlabHash [3],这是一种面向 warp 的动态 GPU 哈希图,仅允许对 Int32 数据类型进行插入和查询。
对于 SlabHash 来说,对通用键类型的扩展并非易事,因为它的 warp 操作仅适用于字长有限的变量。我们的实现扩展了 SlabHash 的哈希集变体,其中后端仅维护整数作为键。
4.4.1 Generalization via Index Heap
The index heap h h h is the core to generalizing the SlabHash backend. In brief, a generic key is represented by its associated buffer index i i i in an integer-only hash set, allocated the same way as discussed in Section 4.3.1. As illustrated in Fig. 4, all the insertions and queries are redirected from the buffer indices to actual keys and values via the index heap. However, the actual implementation involves more complicated changes in design.
Given a generic key k k k, we first locate the bucket b = h ( k ) ∈ I c b = h(k) ∈ \mathcal{I}_c b=h(k)∈Ic. Ideally, we can then allocate a buffer index i i i at h h h's top t t t and insert it into the linked list at the bucket b b b in the integer-only hash set. The accompanying key and value are put in k B ( i ) , v B ( i ) k^B(i), v^B(i) kB(i),vB(i). During query, we similarly first locate the bucket b b b then search the key in the linked list by visiting k B k^B kB via the stored index i i i.
索引堆 h h h 是 SlabHash 后端通用化的核心。简而言之,一个通用键由其在整数散列集中的相关缓冲索引 i i i 表示,分配方式与第 4.3.1 节中讨论的相同。如图 4 所示,所有插入和查询都通过索引堆从缓冲区索引重定向到实际键和值。然而,实际执行涉及到更复杂的设计变更。
给定一个通用键 k k k,我们首先找到 b = h ( k ) ∈ I c b = h(k) ∈ \mathcal{I}_c b=h(k)∈Ic。理想情况下,我们可以在 h h h 的顶部 t t t 分配一个缓冲区索引 i i i,并将其插入只包含整数的哈希集合中桶 b b b 的链表中。随之而来的键和值被放入 k B ( i ) , v B ( i ) k^B(i), v^B(i) kB(i),vB(i) 中。在查询过程中,我们同样首先定位到 b b b 桶,然后通过存储的索引 i i i 访问 k B k^B kB 在链接列表中搜索键。
Fig. 4. Illustration of a non-generic hash set enhanced by ASH. Integer buffer indices i i i allocated from the index heap (middle) are inserted as delegate keys directly into the hash set (left), associated with actual keys in the buffer (right) at i i i. The key-bucket correspondences are the same as Figs. 2 and 3, omitted for simplicity.
图 4. 通过 ASH 增强的非通用散列集示意图。从索引堆(中间)分配的整数缓冲区索引 i i i,作为委托键直接插入哈希集合(左),并与缓冲区(右)中 i i i 的实际键相关联。键-缓冲区对应关系与图 2 和图 3 相同,为简化起见略去。
4.4.2 Multi-Pass Insertion
Although query can be applied as mentioned above, lazy insertion mentioned in Section 4.3.1 is problematic in this setup. The main reason is that while the race condition in inserting index i i i does not occur in warp-oriented insertions, the copy of the actual key k ∈ K k ∈ \mathcal{K} k∈K to k B k^B kB requires global memory write. They may not be synchronized among threads, as copying a multi-dimensional key takes several non-atomic instructions. As a result, the insertion of a key k 2 k_2 k2 could be accidentally triggered when i) a duplicate k 1 ( = k 2 ) k_1(= k_2) k1(=k2)'s index i 1 ∈ I c i_1 ∈ \mathcal{I}_c i1∈Ic has been inserted but ii) whose actual key k 1 k_1 k1 has only been partially copied to the buffer k B k_B kB. This would mistakenly result in k B ( i 1 ) = k 1 ≠ k 2 k^B(i_1) = k_1 \neq k2 kB(i1)=k1=k2 followed by the unexpected insertion of k 2 k_2 k2 when unsynchronized. In practice, with more than 1 million keys to be inserted in parallel, these kinds of conflicts happen with probability as low as ≤ 0.1%. To resolve conflicts, we split insertion into three passes:
虽然查询可以如上所述应用,但第 4.3.1 节中提到的lazy插入在这种情况下是有问题的。主要原因是,虽然在面向 warp 的插入中不会出现插入索引 i i i 时的竞赛条件,但将实际键 k ∈ K k ∈ \mathcal{K} k∈K 复制到 k B k^B kB 需要全局内存写入。这些操作可能无法在线程间同步,因为复制一个多维键需要几条非原子指令。因此,当 i) k 1 ( = k 2 ) k_1(= k_2) k1(=k2) 的索引 i 1 ∈ I c i_1∈ \mathcal{I}_c i1∈Ic 已被插入,但 ii) 其实际键 k 1 k_1 k1 仅被部分复制到缓冲区 k B k_B kB 时,插入键 k 2 k_2 k2 的操作可能会被意外触发。这将错误地导致 k B ( i 1 ) = k 1 ≠ k 2 k^B(i_1) = k_1 \neq k2 kB(i1)=k1=k2 在不同步时意外插入 k 2 k_2 k2。在实践中,由于要并行插入 100 多万个键,这类冲突发生的概率低至 ≤ 0.1%。为了解决冲突,我们将插入过程分为三个阶段:
• Pass 1: batch insert all keys k k k to k B k^B kB by directly copying all candidates via batch allocated corresponding indices i i i from h h h;
• Pass 2: perform parallel hashing with indices i i i from pass 1. In this pass, keys are read-only in global buffers and hence do not face race conditions. Successful insertions are marked in a mask array.
• Pass 3: batch insert values to v B v^B vB with successful masks, and free the rest to h h h.
While there is overhead due to the multi-pass operation, it is still practical for a dynamic hash map. First, keys are relatively inexpensive to copy, especially for spatial coordinates, while the more expensive copying of values is done without redundancy. Second, a dynamic hash map generally reserves sufficient memory for further growth so that the all key insertion would not exceed the buffer capacity.
- 第 1 步:通过从 h h h 批量分配相应的索引 i i i,直接复制所有候选键 k k k 至 k B k^B kB,从而批量插入所有键 k k k;
- 第 2 步:使用第 1 步中的索引 i i i 执行并行散列。在此过程中,键在全局缓冲区中为只读,因此不会出现竞争条件。成功的插入会在掩码数组中标记。
- 第 3 步:通过成功的掩码将值批量插入到 v B v^B vB,并将其余值释放到 h h h。
虽然多通道操作会产生开销,但对于动态哈希映射来说仍然是实用的。首先,键的复制成本相对较低,尤其是对于空间坐标而言,而值的复制成本较高,且无冗余。其次,动态哈希图一般会为进一步增长预留足够的内存,这样所有键的插入都不会超过缓冲区的容量。
4.5 Rehashing and Memory Management
While buffers are represented as fixed-size arrays, growth of storage is needed to accommodate the accumulated input data, which can exceed the hash map's capacity, e.g. 3D points from an RGB-D stream. This triggers rehashing, where we adopt the conventional ×2 strategy to double the buffer size as common in the C++ Standard Library, collect all the active keys and values, and batch insert them into the enlarged buffer.
In dynamic insertions, there can be frequent free and allocation of small memory blobs that are adjacent and mergeable. In view of this, we implement another treestructured global GPU manager similar to PyTorch [46].
虽然缓冲区表示为固定大小的数组,但需要增加存储空间来容纳累积的输入数据,这可能会超出哈希图的容量,例如来自 RGB-D 流的 3D 点。这会触发重新哈希,我们采用传统的×2策略将缓冲区大小加倍,就像C++标准库中常见的那样,收集所有活动的键和值,并将它们批量插入到扩大的缓冲区中。
在动态插入中,可能会频繁释放和分配相邻且可合并的小内存块。鉴于此,我们实现了另一个类似于 PyTorch [46] 的树形结构全局 GPU 管理器。
4.6 Dispatch Routines
To enable bindings to non-templated languages, e.g. Python, the tensor interface is non-templated so that it can take data types and shapes as arguments. In the context of spatial hashing, we support arbitrary dimensional keys by expanding the dispatcher macros in C++. Float types have undetermined precision behaviors on GPU. Therefore, a conversion to the integers given the desired precision is recommended to use the hash map.
We also additionally dispatch values by their element byte sizes into intrinsically supported vectors: int, int2, int3, and int4. This adaptation accelerates trivially copiable value objects such as int3, and supports non-trivially copiable value blocks (e.g. an 8 3 8^3 83 array pointed to a void pointer). This improves the insertion of large value chunks by a factor of 10 approximately.
为了实现与非模板语言(如 Python)的绑定,张量接口是非模板的,因此它可以将数据类型和形状作为参数。在空间散列方面,我们通过扩展 C++ 中的派发器宏来支持任意维度的键。浮点类型在 GPU 上具有不确定精度的行为。因此,在使用散列映射时,建议将浮点类型转换为所需精度的整数类型。
我们还根据元素字节大小将值分派到本质上支持的向量中:int、int2、int3 和 int4。这种调整加速了 int3 等琐碎的可复制值对象,并支持非琐碎的可复制值块(例如指向 void 指针的 8 3 8^3 83 数组)。这样,插入大数值块的速度大约提高了 10 倍。
4.7 Multi-value Hash Map and Hash Set
ASH supports multi-value hash maps that store values organized in SoA, as well as hash sets with only keys and no values.
ASH 支持存储在 SoA 中组织的值的多值哈希映射,以及仅包含键而没有值的哈希集。
Multi-value hash maps.
Various applications in 3D processing require mapping coordinates to several properties. For instance, a 3D coordinate can be mapped to a normal, a color, and a label in a point cloud. While the mapped values can be packed as an array of structures (i.e., AoS) to fit a hash map, code complexity could increase since structure-level functions have to be implemented. We generalize the hash map's functionality by extending the single value buffer v B v^B vB to an array of value buffers v i B {v^B_i} viB and applying loops over properties per index during an insertion. This simple change supports the storage of complex value types in SoA that allows easy vectorized query and indexing.
3D 处理中的各种应用需要将坐标映射到多个属性。例如,3D 坐标可以映射到点云中的法线、颜色和标签。虽然映射值可以打包为结构数组(即 AoS)以适合哈希映射,但由于必须实现结构级函数,因此代码复杂性可能会增加。我们通过将单值缓冲区 v B v^B vB 扩展到值缓冲区数组 v i B {v^B_i} viB 并在插入期间对每个索引的属性应用循环来概括哈希图的功能。这个简单的更改支持在 SoA 中存储复杂值类型,从而允许轻松的矢量化查询和索引。
Hash set.
A hash set, on the other hand, is a simplified hash map -- an unordered set that stores unique keys. It is generally useful in maintaining a set by rejecting duplicates, such as in point cloud voxelization. By removing v B v^B vB and ignoring value insertion, a hash map becomes a hash set.
另一方面,哈希集是一个简化的哈希映射------存储唯一键的无序集合。它通常可用于通过拒绝重复项来维护集合,例如在点云体素化中。通过删除 v B v^B vB 并忽略值插入,哈希映射就变成了哈希集。
5 EXPERIMENTS
We start with synthetic experiments to show that ASH, with its optimized memory layout, increases performance while improving usability. All experiments in this section are conducted on a laptop with an Intel i7-6700HQ CPU and an Nvidia GTX 1070 GPU. In all experiments, we assume the hash map capacity is equivalent to the number of input keys (regardless of duplicates). Each reported time is an average of 10 trials.
我们从综合实验开始,证明 ASH 凭借其优化的内存布局,可以在提高可用性的同时提高性能。本节中的所有实验均在配备 Intel i7-6700HQ CPU 和 Nvidia GTX 1070 GPU 的笔记本电脑上进行。在所有实验中,我们假设哈希映射容量等于输入键的数量(无论是否重复)。每次报告的时间是 10 次试验的平均值。
5.1 Spatial Hashing with Generic Backend
The first experiment is the performance comparison between vanilla stdgpu and ASH with stdgpu backend (ASHstdgpu). For fairness, we extend the examples of stdgpu such that an array of iterators and masks are returned for in-place manipulations. The number of buckets and the load factor are determined internally by stdgpu.
Setup 1. We test randomly generated 3D spatial coordinates mapped to float value blocks of varying sizes. The key K \mathcal{K} K, value V \mathcal{V} V, capacity c c c, and uniqueness ρ ρ ρ are chosen as follows:
设置 1. 第一个实验是对 vanilla stdgpu 和带有 stdgpu 后端(ASHstdgpu)的 ASH 进行性能比较。为了公平起见,我们扩展了 stdgpu 的示例,使迭代器和掩码数组能够返回,以便进行就地操作。桶的数量和负载系数由 stdgpu 内部决定。
我们测试了随机生成的三维空间坐标,并将其映射到不同大小的浮点数值块中。键 K \mathcal{K} K、值 V \mathcal{V} V、容量 c c c 和唯一性 ρ ρ ρ 的选择如下:
where ρ ρ ρ indicates the ratio of the unique number of keys to the total number of keys being inserted or queried.
其中, ρ ρ ρ 表示键的唯一性数量与插入或查询的键的总数之比。
Fig. 5. Hash map performance comparison between ASH-stdgpu and the vanilla stdgpu with 3D integer keys. Each curve shows the average operation time (y-axis) with varying hash map value sizes in bytes (x-axis), given a controlled backend, input length, and input key uniqueness ratio. Lower is better. Further factors are denoted by the legends on the right. ASH-stdgpu runs consistently faster than the vanilla stdgpu.
图 5. ASH-stdgpu 与使用三维整数键的 vanilla stdgpu 的哈希映射性能比较。每条曲线都显示了在后端、输入长度和输入键唯一性比率受控的情况下,不同哈希映射值大小(字节)(x 轴)的平均操作时间(y 轴)。越低越好。右侧图例表示其他因素。ASH-stdgpu 的运行速度始终快于vanilla stdgpu。
Fig. 5 illustrates the comparison between vanilla stdgpu and ASH-stdgpu. For insert operation, ASH-stdgpu is significantly faster than stdgpu when ρ = 0.1 ρ = 0.1 ρ=0.1 is low, and the performance gain increases when the value byte size increases. This is mainly due to the SoA memory layout and the lazy insertion mechanism, where a lightweight integer i ∈ I c i ∈ \mathcal{I}_c i∈Ic is inserted in an attempt instead of the actual value v ∈ V v ∈ \mathcal{V} v∈V. At a high input uniqueness ρ = 0.99 ρ = 0.99 ρ=0.99, ASH-stdgpu maintains the performance advantage with low and medium value sizes, and its performance is comparable to stdgpu with a large value size. This indicates that our dispatch pattern in copying values helps in a high throughput scenario. For find operation, ASH-stdgpu is consistently faster than vanilla stdgpu, under both high and low key uniqueness settings.
In addition to to insert and find, we introduce a new activate operation. It "activates" the input keys by inserting them into the hash map and obtaining the associated buffer indices. This is especially useful when we can pre-determine and apply the element-wise initialization. Examples include the TSDF voxel blocks (zeros) and multi-layer perceptrons (random initializations). The activate operation is absent in most existing hash maps and is only available as hard-coded functions [12], [41], [47].
With the activate operation, we conduct ablation studies to compare the insertion time of merely the keys versus the insert time of both the keys and values. Fig. 6 compares the runtime between insert and activate in ASH-stdgpu. The key, value, capacity, and uniqueness choices are the same as in Setup 1. We observe that while the insertion time increases as the value size increases, the activation time remains stable.
图 5 展示了 vanilla stdgpu 和 ASH-stdgpu 之间的比较。对于插入操作,当 ρ = 0.1 ρ = 0.1 ρ=0.1 较低时,ASH-stdgpu 明显快于 stdgpu,并且当值字节大小增加时,性能增益会增加。这主要是由于 SoA 内存布局和惰性插入机制,尝试插入轻量级整数 i ∈ I c i ∈ \mathcal{I}_c i∈Ic 而不是实际值 v ∈ V v ∈ \mathcal{V} v∈V。在高输入唯一性 ρ = 0.99 ρ = 0.99 ρ=0.99 时,ASH-stdgpu 保持了中低值大小的性能优势,其性能与大值大小的 stdgpu 相当。这表明我们复制值的调度模式有助于高吞吐量场景。对于查找操作,无论是高键唯一性设置还是低键唯一性设置,ASH-stdgpu 始终比普通 stdgpu 更快。
除了插入和查找之外,我们还引入了新的激活操作。它通过将输入键插入哈希映射并获取关联的缓冲区索引来"激活"输入键。当我们可以预先确定并应用逐元素初始化时,这尤其有用。示例包括 TSDF 体素块(零)和多层感知器(随机初始化)。大多数现有哈希映射中不存在激活操作,并且只能作为硬编码函数使用[12]、[41]、[47]。
通过激活操作,我们进行消融研究,以比较仅键的插入时间与键和值的插入时间。图 6 比较了 ASH-stdgpu 中插入和激活之间的运行时间。键、值、容量和唯一性选择与设置 1 中相同。我们观察到,虽然插入时间随着值大小的增加而增加,但激活时间保持稳定。
Fig. 6. Study of the activate operation introduced in ASH against insert with 3D integer keys on the ASH-stdgpu backend. Each curve shows the average operation time (y-axis) with varying hash map value sizes in bytes (x-axis), given an input length and input key uniqueness ratio. Lower is better. Activate keeps a stable runtime in the tasks that do not require explicit value insertion, while insert time increases corresponding to the hash map value size.
图 6. ASH 中引入的激活操作与 ASH-stdgpu 后端插入三维整数键的对比研究。每条曲线都显示了在输入长度和输入键唯一性比率不变的情况下,不同哈希图值大小(字节)(x 轴)的平均操作时间(y 轴)。越低越好。在不需要插入显式值的任务中,Activate 保持了稳定的运行时间,而插入时间则随着哈希映射值大小的增加而增加。
5.2 Integer Hashing with Non-Generic Backend
Next, we compare ASH based on the SlabHash backend (ASH-slab) with the vanilla SlabHash. Since SlabHash only supports integers as keys and values, we limit our ASHslab backend to the same integer types here. The number of buckets is 2× capacity (load factor is approx. 0.5), since it is empirically the best factor when ASH-slab is applied to non-generic and generic tasks. Since vanilla SlabHash only supports data I/O from the host, we include the data transfer time between host and device when measuring the performance of ASH-slab.
接下来,我们将比较基于 SlabHash 后端的 ASH(ASH-slab)和普通 SlabHash。由于 SlabHash 仅支持整数作为键和值,因此我们在此将 ASHslab 后端限制为相同的整数类型。桶的数量是容量的 2 倍(负载系数约为 0.5),因为根据经验,这是 ASH-slab 应用于非通用任务和通用任务时的最佳系数。由于 vanilla SlabHash 仅支持来自主机的数据 I/O,因此在衡量 ASH-slab 性能时,我们将主机与设备之间的数据传输时间也计算在内。
Setup 2. We test random scalar integer values mapped to scalar float values. The key $ \mathcal{K}$, value V \mathcal{V} V, capacity c c c, and uniqueness ρ ρ ρ are chosen as follows:
设置 2. 我们测试随机标量整数值与标量浮点数值的映射。键 K \mathcal{K} K、值 V \mathcal{V} V、容量 c c c 和唯一性 ρ ρ ρ 的选择如下:
As shown in Fig. 7, although ASH-slab does not make use of the non-blocking warp-oriented operations in SlabHash in order to enable support for generic key and value types, our insert is still comparable to vanilla SlabHash which is only optimized for integers. The drop in performance of ASH-slab when ρ ρ ρ increases is an expected indication that the overhead of multi-pass insertion increases correspondingly.
It is worth mentioning that with an improved global memory manager, the construction of an ASH-slab hash map takes less than 1ms under all circumstances, while the vanilla SlabHash constantly takes 30ms for the redundant slab memory manager. In practice, where the hash map is constructed and used once (e.g. voxelization), ASH-slab is a more practical solution.
如图 7 所示,虽然 ASH-slab 没有利用 SlabHash 中的非阻塞 warp 面向操作来实现对通用键和值类型的支持,但我们的插入性能仍可与只针对整数进行优化的 vanilla SlabHash 相媲美。当 ρ ρ ρ 增加时,ASH-slab 的性能会下降,这表明多路插入的开销会相应增加。
值得一提的是,使用改进后的全局内存管理器,在任何情况下构建 ASH-slab 哈希映射的时间都小于 1ms,而使用冗余板块内存管理器的 vanilla SlabHash 则持续耗时 30ms。在实际应用中,如果哈希映射只需构建和使用一次(例如体素化),ASH-slab 是一种更实用的解决方案。
Fig. 7. Hash map performance comparison between ASH-slab and the vanilla SlabHash with 1D integer keys and values. Each curve shows the average operation time (y-axis) with varying input key uniqueness ratio (x-axis), given an input length. Lower is better. ASH-slab retains a comparable performance for integers while supporting generalization to arbitrary dimensional keys and values of various data types.
图 7. ASH-slab 与具有 1D 整数键和值的普通 SlabHash 之间的哈希图性能比较。每条曲线显示在给定输入长度的情况下,不同输入键唯一性比率(x 轴)的平均操作时间(y 轴)。越低越好。 ASH-slab 保留了与整数相当的性能,同时支持泛化到各种数据类型的任意维度键和值。
5.3 Ablation Between Backends
We now conduct an ablation study with different backends in ASH, namely ASH-stdgpu and ASH-slab, with arbitrary input key-value types beyond integers. The experimental setup follows Section 5.1.
In Fig. 8, we can see that ASH-stdgpu outperforms ASHslab in most circumstances with the 3D coordinate keys and varying length values that are common in real-world applications. While warp-oriented operations heavily used in SlabHash enjoy the benefits of intrinsic acceleration, they sacrifice the granularity of operations. Threads can only move on to the next task once all the operations in a warp (of 32) are finished. As a result, early termination when an insertion failure occurs are less likely in a warp-oriented hash map. If the data layout is not well-distributed for the intrinsic operations (e.g., low-uniqueness input, keys with long word width), the performance drop could be significant.
This observation is more apparent in insertion under varying input densities. With a relatively small value size and a high uniqueness, ASH-slab performs better. When the uniqueness is low, however, each thread in ASH-slab still has to finish a similar workload before termination, while ASH-stdgpu can reject many failure insertions early and move on to the following workloads. As of now, ASHslab is suitable for the voxel downsampling application, while ASH-stdgpu is better for other tasks. Therefore, we set stdgpu as the default backend for ASH in the remaining sections.
我们现在对 ASH 中的不同后端(即 ASH-stdgpu 和 ASH-slab)进行消融研究,其中输入的键值类型不限于整数。实验设置遵循第 5.1 节。
在图 8 中,我们可以看到,在现实应用中常见的 3D 坐标键和不同长度值的情况下,ASH-stdgpu 的性能优于 ASHslab。虽然 SlabHash 中大量使用的面向扭曲的操作享有内在加速的好处,但它们牺牲了操作的粒度。只有在 warp 中的所有操作(32 个)完成后,线程才能继续执行下一个任务。因此,在面向扭曲的哈希图中,发生插入失败时提前终止的可能性较小。如果内部操作的数据布局分布不均匀(例如,低唯一性输入、长字宽的键),则性能下降可能会很大。
这种观察在不同输入密度下的插入中更为明显。 ASH-slab 具有相对较小的值大小和较高的唯一性,表现更好。然而,当唯一性较低时,ASH-slab 中的每个线程在终止之前仍然必须完成类似的工作负载,而 ASH-stdgpu 可以尽早拒绝许多失败插入并继续处理后续工作负载。截至目前,ASHslab 适用于体素下采样应用,而 ASH-stdgpu 更适合其他任务。因此,我们在其余部分中将 stdgpu 设置为 ASH 的默认后端。
Fig. 8. Ablation study of the hash map performance with 3D integer keys over different backends. Each curve shows the average operation time (yaxis) with varying hash map value sizes in bytes (x-axis), given a controlled backend, input length, and input key uniqueness ratio. Lower is better. ASH-stdgpu outperforms ASH-slab in most circumstances with the 3D integer keys and varying length values, which are common in real-world applications.
图 8. 不同后端上使用 3D 整数键的哈希映射性能的消融研究。每条曲线显示了在给定受控后端、输入长度和输入键唯一性比率的情况下,不同哈希映射值大小(以字节为单位)(x 轴)的平均操作时间(y 轴)。越低越好。在大多数情况下,ASH-stdgpu 的性能优于 ASH-slab,其 3D 整数键和不同的长度值在实际应用中很常见。
5.4 Code Complexity
We now study the usage at the user end. First of all, the ASH framework, regardless of the backend used, is already compiled as a library. A C++ developer can easily include the header and build the example directly with a CPU compiler and link to the precompiled library with a light tensor engine. An equivalent Python interface is provided via pybind [26] as shown in Fig. 1.
In comparison, to use SlabHash's interface with an input array from host memory, a CUDA compiler is required, along with manual bucket-linked list chain configurations. For further performance improvement, detailed memory management has to be done manually via cudaMalloc, cudaMemcpy and cudaFree. stdgpu provides a built-in memory manager but requires writing device and host functions. In a query operation, the found values are returned by-copy for SlabHash, so in-place modification requires further modification of the library. stdgpu exposes iterators in an AoS fashion, therefore the device code needs to be implemented to reinterpret an array of iterators and masks for further operations.
The compilation complexity and the interface LoC required for the same functionality in C++ are listed in Table 2.
我们现在研究一下用户端的用法。首先,ASH框架,无论使用什么后端,都已经被编译为库。 C++ 开发人员可以轻松包含标头并直接使用 CPU 编译器构建示例,并使用轻量张量引擎链接到预编译库。通过 pybind [26] 提供了等效的 Python 接口,如图 1 所示。
相比之下,要将 SlabHash 的接口与主机内存中的输入数组一起使用,需要 CUDA 编译器以及手动存储桶链表链配置。为了进一步提高性能,必须通过 cudaMalloc、cudaMemcpy 和 cudaFree 手动完成详细的内存管理。 stdgpu 提供内置内存管理器,但需要编写设备和主机函数。在查询操作中,找到的值是通过SlabHash的副本返回的,因此就地修改需要进一步修改库。 stdgpu 以 AoS 方式公开迭代器,因此需要实现设备代码来重新解释迭代器和掩码数组以进行进一步操作。
表 2 列出了 C++ 中相同功能所需的编译复杂度和接口 LoC。
TABLE 2 Comparison of the complexity of coding (top) and LoC (bottom) of each operation among the implementations. Unlike stdgpu and SlabHash, ASH does not require that users write device code or use a CUDA compiler. It requires few LoC for construction, query, and insertion.
表 2 各实现中每个操作的编码复杂度(上)和 LoC(下)复杂度比较。与 stdgpu 和 SlabHash 不同,ASH 不需要用户编写设备代码或使用 CUDA 编译器。它需要很少的 LoC 来进行构造、查询和插入。
6 APPLICATIONS
We now demonstrate a number of applications and readyto-use systems in 3D perception to demonstrate the power of ASH with fewer LoC and better performance. The presented applications include:
- Point cloud voxelization;
- Retargetable volumetric reconstruction;
- Non-rigid registration and deformation;
- Joint geometry and appearance refinement.
The first two experiments are conducted on an Intel i76700HQ CPU and an Nvidia GeForce GTX 1070 GPU for indoor scenes. Outdoor scene experiments are run on an Intel i7-11700 CPU and an Nvidia RTX 3060 GPU. The rest are done on an Intel i7-7700 CPU and an Nvidia GeForce GTX 1080Ti GPU.
我们现在在 3D 感知中演示了许多应用程序和即用型系统,以展示 ASH 的强大功能,同时具有更少的 LoC 和更好的性能。所提出的应用包括:
- 点云体素化;
- 可重定向的体积重建;
- 非刚性配准和变形;
- 几何和外观联合细化。
前两项实验使用英特尔 i76700HQ CPU 和 Nvidia GeForce GTX 1070 GPU 进行室内场景实验。室外场景实验在英特尔 i7-11700 CPU 和 Nvidia RTX 3060 GPU 上进行。其余实验使用英特尔 i7-7700 CPU 和 Nvidia GeForce GTX 1080Ti GPU。
6.1 Point Cloud Voxelization
Setup 3. In voxelization, a hash map maps a point cloud's discretized coordinates to its natural array indices, and the hash map capacity is the point cloud size, typically ranging from 1 0 5 10^5 105 to 1 0 7 10^7 107
设置 3. 在体素化中,哈希图将点云的离散坐标映射到其自然数组索引,哈希图容量是点云大小,通常范围从 1 0 5 10^5 105 到 1 0 7 10^7 107
Voxelization is a core operation for discretizing coordinates. It is essential for sparse convolution [9], [10] at the quantization preprocessing stage, and is often used to generate point cloud "pyramids" [62] in coarse-to-fine 3D registration for improved speed and robustness.
Voxelization is a natural task for parallel hashing, as the essence of the operation is to discard duplicates at grid points. To achieve this, we first discretize the input by converting the coordinates described by the continuous meter metric to the voxel units. Then a simple hash set insertion eliminates the duplicates and corresponds them to the remaining unique coordinates. The returned indices can be reused for tracing other properties such as colors and point normals associated with the input. The python code for voxelization can be found in Listing 1.
体素化是离散坐标的核心操作。它对于量化预处理阶段的稀疏卷积 [9]、[10] 至关重要,并且通常用于在从粗到细的 3D 配准中生成点云"金字塔"[62],以提高速度和鲁棒性。
体素化是并行哈希的一项自然任务,因为该操作的本质是丢弃网格点处的重复项。为了实现这一点,我们首先通过将连续米度量描述的坐标转换为体素单位来离散化输入。然后,简单的哈希集插入消除了重复项,并将它们与剩余的唯一坐标相对应。返回的索引可以重新用于跟踪其他属性,例如与输入关联的颜色和点法线。用于体素化的 Python 代码可以在清单 1 中找到。
Fig. 9. Visualization of point cloud voxelization. Top: scene-level largescale inputs. Bottom: fragment-level small-scale inputs. Left: original point clouds. Right: voxelized point clouds.
图 9. 点云体素化的可视化。上图:场景级大尺度输入。下图:片段级小尺度输入。左:原始点云。右图:体素化点云。
We compare voxelization implemented in ASH with two popular implementations, MinkowskiEngine [10] on CUDA and Open3D [62] on CPU. Our experiments are conducted on a large scene input with 8 × 1 0 6 8 × 10^6 8×106 points which is typical for scene perception, and a small fragment of the scene with 5 × 1 0 5 5 × 10^5 5×105 points which is typical for an RGB-D input frame, as shown in Fig. 9.
To evaluate the performance, we vary the parameter voxel size from 5mm to 5cm, which is typical in the spectrum of voxelization applications, from dense reconstruction to feature extraction. In Fig. 10 we can see that our implementation outperforms baselines consistently for inputs at both scales. Meanwhile, in measuring the LoC written in C++ (the Python wrappers are one-liners) required for the functionality given the hash map interface, we observe that ASH requires only 28 LoC, while MinkowskiEngine and Open3D on CPU take 71 and 72 LoC respectively.
我们将 ASH 中实现的体素化与两种流行的实现进行比较,即 CUDA 上的 MinkowskiEngine [10] 和 CPU 上的 Open3D [62]。我们的实验是在一个具有 8 × 1 0 6 8 × 10^6 8×106 点(对于场景感知来说是典型的)的大场景输入和一个具有 5 × 1 0 5 5 × 10^5 5×105 点(对于 RGB-D 输入帧来说是典型的)的小场景片段上进行的,如图9所示。
为了评估性能,我们将参数体素尺寸从 5mm 更改为 5cm,这在从密集重建到特征提取的体素化应用范围内是典型的。在图 10 中,我们可以看到我们的实现在两个尺度上的输入始终优于基线。同时,在测量给定哈希映射接口的功能所需的用 C++ 编写的 LoC(Python 包装器是单行代码)时,我们观察到 ASH 仅需要 28 LoC,而 CPU 上的 MinkowskiEngine 和 Open3D 分别需要 71 和 72 LoC。
Fig. 10. Performance comparison of voxelization. Each curve shows the run time (y-axis) over the varying voxel size (x-axis). Lower is better. ASH is consistently faster than Open3D's default voxelizer (CPU) and MinkowskiEngine (CUDA).
图 10. 体素化性能比较。每条曲线都显示了不同体素大小(x 轴)下的运行时间(y 轴)。越低越好。ASH 的速度始终快于 Open3D 的默认体素化器(CPU)和 MinkowskiEngine(CUDA)。
6.2 Retargetable Volumetric Reconstruction
6.2.1 Truncated Signed Distance Function
Scene representation with truncated signed distance function (TSDF) from a sequence of 3D input has been introduced [11] and adapted to RGB-D [40]. It takes a sequence of depth images { D j } \{D_j\} {Dj} with their poses { T j ∈ S E ( 3 ) } \{\pmb{T}_j ∈ SE(3)\} {Tj∈SE(3)} as input, and seeks to optimize the signed distance, an implicit function value d d d per point at x ∈ R 3 x ∈ R^3 x∈R3. The signed distance measured for frame j j j is given by
从三维输入序列中使用截断符号距离函数(TSDF)进行场景表示的方法已被引入[11],并被应用于 RGB-D [40]。它以一连串深度图像 { D j } \{D_j\} {Dj} 及其位置 { p m b T j ∈ S E ( 3 ) } \{pmb{T}_j∈ SE(3)\} {pmbTj∈SE(3)} 作为输入,并寻求优化有符号距离,即在 x ∈ R 3 x∈ R^3 x∈R3 处每点的隐式函数值 d d d。为帧 j j j 测量的带符号距离由以下公式给出
where Π Π Π projects the 3D point to 2D with a range reading after a rigid transformation. To reject outliers, a truncate function Ψ μ ( d ) = c l a m p ( d , − μ , μ ) Ψ_μ(d) = clamp(d, −μ, μ) Ψμ(d)=clamp(d,−μ,μ) is applied to d j d^j dj. There are multiple variations of Ψ Ψ Ψ and the definition of signed distance d j d_j dj [5]. For this paper, we follow the convention in KinectFusion [40].
其中 Π Π Π 将 3D 点投影到 2D,并在刚性变换后显示范围读数。为了拒绝异常值,对 d j d^j dj 应用截断函数 Ψ μ ( d ) = c l a m p ( d , − μ , μ ) Ψ_μ(d)=clamp(d, −μ, μ) Ψμ(d)=clamp(d,−μ,μ)。 Ψ Ψ Ψ 和符号距离 d j d_j dj 的定义有多种变体 [5]。对于本文,我们遵循 KinectFusion [40] 中的约定。
With a sequence of inputs, per-point signed distance can be estimated in least squares with a closed-form solution
通过一系列输入,可以使用闭式解以最小二乘法估计每点有符号距离
where w j w^j wj is the selected weight depending on view angles and distances [5]. In other words, with a sequence of depth inputs and their poses, we can measure TSDF at any point in 3D within the camera frustums. We can also rewrite Eq. 11 incrementally:
其中, w j w^j wj 是根据视角和距离选择的权重[5]。换句话说,有了一连串的深度输入及其姿势,我们就可以在三维空间中的任何一点测量相机圆锥体内的 TSDF。我们还可以逐步重写公式 11:
where w w w is the accumulated weight paired with d d d.
Equipped with a projection model Π that converts a point to signed distance, TSDF reconstruction can be generalized to imaging LiDARs [13] for larger scale scenes.
其中 w w w 是与 d d d 配对的累积权重。
配备将点转换为有符号距离的投影模型 Π,TSDF 重建可以推广到较大规模场景的成像 LiDAR [13]。
6.2.2 Spatially Hashed TSDF Blocks
Setup 4. In a scene represented by a volumetric TSDF grid, a hash map maps the coarse voxel blocks' coordinates to the TSDF data structure of the voxel block, and the hash map capacity is typically 1 0 3 10^3 103 to 1 0 5 10^5 105 for small to large-scale indoor scenes:
设置 4. 在由体积 TSDF 网格表示的场景中,哈希图将粗体素块的坐标映射到体素块的 TSDF 数据结构,并且哈希图容量通常为 1 0 3 10^3 103 到 1 0 5 10^5 105从小到大的室内场景:
While recent neural representations utilize multi-layer perceptrons to approximate the TSDF in continuous space [7], classical approaches use discretized voxels. Such representations have a long history and are ready for realworld, real-time applications. They can also provide data for training neural representations.
虽然最近的神经表示利用多层感知器来近似连续空间中的 TSDF [7],但经典方法使用离散体素。这种表示形式有着悠久的历史,并且已经为现实世界的实时应用做好了准备。它们还可以提供用于训练神经表征的数据。
The state-of-the-art volumetric discretization for TSDF reconstruction is spatial hashing, where points are allocated around surfaces on-demand at a voxel resolution of around 5mm. While it is possible to hash high-resolution voxels directly, the access pattern could be less cache-friendly, as the neighbor voxels are scattered in the hash map. A hierarchical structure is a better layout, where small dense voxel grids (e.g. in the shape of 8 3 8^3 83 or 1 6 3 16^3 163) are the minimal unit in a hash map; detailed access can be redirected to simple 3D indexing. In other words, a voxel can be indexed by a coupled hash map lookup and a direct local addressing
用于 TSDF 重建的最先进的体积离散化方法是空间散列,即按需在表面周围分配点,体素分辨率约为 5 毫米。虽然可以直接对高分辨率体素进行散列,但由于相邻体素分散在散列图中,因此访问模式对缓存不太友好。分层结构是一种更好的布局,在这种结构中,小的密集体素网格(如 8 3 8^3 83 或 1 6 3 16^3 163 )是哈希图中的最小单位;详细访问可以重定向到简单的三维索引。换句话说,体素可以通过哈希图查找和直接本地寻址的耦合方式进行索引。
where s s s is the voxel size and l l l is the voxel block resolution as described in Setup 4.
其中 s s s 是体素大小, l l l 是体素块分辨率,如设置 4 中所述。
While previous implementations [12], [41], [47] have achieved remarkable performance, modularized designs are missing. Geometry processing and hash map operations were coupled due to the absence of a well-designed parallel GPU hash map. One deficiency of this design is unsafe parallel insertion, where the capacity of a hash map can be exceeded. Another is ad hoc recurring low-level linked list access in geometry processing kernels that cause high code redundancy. Our implementation demonstrates the first modularized pipeline where safe hash map operations are used without any ad hoc modifications.
虽然之前的实现[12]、[41]、[47]取得了不俗的性能,但却缺少模块化设计。由于缺乏设计良好的 GPU 并行哈希映射,几何处理和哈希映射操作被耦合在一起。这种设计的缺陷之一是不安全的并行插入,可能会超出哈希映射的容量。另一个缺陷是几何处理内核中临时重复的低级链表访问,导致代码高度冗余。我们的实现展示了首个模块化流水线,其中使用了安全的哈希映射操作,无需任何特别修改。
6.2.3 Voxel Block Allocation and TSDF Integration
Setup 5. For an input depth image, a hash map maps the unprojected point coordinates to the active indices as described in Setup 4, and the capacity of a hash map is typically 1 0 5 10^5 105 to 1 0 6 10^6 106, with 1 0 2 10^2 102 to 1 0 3 10^3 103 valid entries:
设置 5. 对于输入深度图像,哈希图将未投影点坐标映射到活动索引,如设置 4 中所述,哈希图的容量通常为 1 0 5 10^5 105 到 1 0 6 10^6 106,其中 $10^2 $ 到 1 0 3 10^3 103 有效条目:
In the modularized design, we first introduce a double hash map structure for voxel block allocation and TSDF estimation. Voxel block allocation identifies points from { D j } \{D_j\} {Dj} as surfaces and computes coordinates with Eq. 13. Intuitively, they can be directly inserted to the global hash map described in Setup 4. This is achievable in an ad hoc implementation where the core of the hash map is modified at the device code level, and unsafe insertion is allowed [41], [47]. However, in a modularized and safe setup, this could lead to problems. A VGA resolution depth input contains 640 × 320 ≈ 3 × 1 0 5 640 × 320 ≈ 3 × 10^5 640×320≈3×105 points and easily exceeds the empirical global hash map capacity. As we have mentioned, rehashing will be triggered under such circumstances, which is both time and memory consuming, especially for a hash map with memory-demanding voxel blocks as values.
在模块化设计中,我们首先引入了用于体素块分配和 TSDF 估计的双哈希图结构。体素块分配将 { D j } \{D_j\} {Dj} 中的点识别为曲面,并使用方程式计算坐标。 13. 直观上,它们可以直接插入到设置 4 中描述的全局哈希映射中。这可以在临时实现中实现,其中哈希映射的核心在设备代码级别进行修改,并且允许不安全插入[41] ,[47]。然而,在模块化和安全的设置中,这可能会导致问题。 VGA 分辨率深度输入包含 640 × 320 ≈ 3 × 1 0 5 640 × 320 ≈ 3 × 10^5 640×320≈3×105 个点,很容易超出经验全局哈希图容量。正如我们所提到的,在这种情况下会触发重新哈希,这既费时又费内存,特别是对于以需要内存的体素块作为值的哈希图。
To address this issue without changing the low-level implementation and sacrificing safety, we introduce a second hash map, the local hash map from Setup 5. This hash map is similar to the one used in voxelization: it maps discretized 3D coordinates unprojected from depths to integer indices. With this setup, a larger input capacity is acceptable, as the local hash map is lightweight and can be cleared or constructed from scratch per iteration.
为了在不改变底层实现和牺牲安全性的情况下解决这个问题,我们引入了第二个哈希图,即设置 5 中的本地哈希图。 这个哈希图与体素化中使用的哈希图类似:它将未投影的离散三维坐标从深度映射到整数索引。在这种设置下,可以接受更大的输入容量,因为本地哈希图是轻量级的,每次迭代都可以清除或从头开始构建。
Fig. 11. Illustration of local and global hash maps iteratively used in real-time reconstruction. The local hash map activates voxel blocks enclosing points observed in the viewing frustum. The global hashmap accumulates such activated blocks and maintains all the blocks around the isosurface.
图 11. 实时重建中迭代使用的局部和全局哈希图的图示。本地哈希图激活包围在视锥体中观察到的点的体素块。全局哈希图累积这些激活的块并维护等值面周围的所有块。
There are two main benefits to using a local hash map: it converts the input from the 1 0 5 10^5 105 raw point scale to the 1 0 3 10^3 103 voxel block scale, which is safe for the global hash map without rehashing; as a byproduct, it keeps track of the active voxel blocks for the current frame j, which can be directly used in the following TSDF integration and ray casting. The local and global hash maps can be connected through indices, where a query of coordinate in the local map is redirected to the global map in-place. Fig. 11 shows the roles of the two hash maps. Listing 2 details the construction and interaction between the two hash maps.
使用本地哈希图有两个主要好处:它将输入从 1 0 5 10^5 105 原始点尺度转换为 1 0 3 10^3 103 体素块尺度,这对于全局哈希图来说是安全的,无需重新哈希;作为副产品,它跟踪当前帧 j 的活动体素块,可直接用于后续 TSDF 集成和光线投射。本地和全局哈希映射可以通过索引连接,其中本地映射中的坐标查询被就地重定向到全局映射。图 11 显示了两个哈希映射的作用。清单 2 详细介绍了两个哈希映射之间的构造和交互。
By accessing the global hash map's TSDF and colors through returned indices, TSDF integration can then be implemented following Eq. 12 in a pure geometry function, either in a low-level GPU kernel or a high-level vectorized Python script. Spatial hash map is detached from the core geometric computation, providing more flexibility in performance optimization.
通过返回的索引访问全局哈希图的 TSDF 和颜色,TSDF 整合就可以按照公式 12 在纯几何函数中实现,既可以在低级 GPU 内核中实现,也可以在高级矢量化 Python 脚本中实现。空间哈希图从核心几何计算中分离出来,为性能优化提供了更大的灵活性。
6.2.4 Surface Extraction
A volumetric scene reconstruction is not usable for most software and solutions until the results are exported to point clouds or triangle meshes. Hence we implement a variation of Marching Cubes [34] that extracts vertices with triangle faces at zero crossings in a volume. In a spatially hashed implementation, boundary voxels in voxel blocks are harder to process since queries of neighbor voxel blocks are frequent, and shared vertices among triangles are hard to track. One common approach is to simply visit vertices at isosurfaces and disregard duplicates, but this usually results in a heavily redundant mesh [30], [47], or time-consuming post-processing to merge vertices [41]. Another method is to introduce an assistant volumetric data structure to atomically record the vertex-voxel association, but the implementations are over-complex and require frequent lowlevel hash map queries coupled with surface extraction [12], [14].
在将结果导出到点云或三角形网格之前,体积场景重建不适用于大多数软件和解决方案。因此,我们实现了 Marching Cubes [34] 的一种变体,它提取体积中零交叉处具有三角形面的顶点。在空间散列实现中,体素块中的边界体素更难处理,因为相邻体素块的查询很频繁,并且三角形之间的共享顶点难以跟踪。一种常见的方法是简单地访问等值面处的顶点并忽略重复项,但这通常会导致严重冗余的网格[30]、[47],或耗时的后处理来合并顶点[41]。另一种方法是引入辅助体积数据结构来自动记录顶点-体素关联,但实现过于复杂,并且需要频繁的低级哈希图查询以及表面提取[12]、[14]。
Now that we have a unified hash map interface, we simplify the voxel block neighbor search routine [12] and set up a 1-radius neighbor lookup table in advance, as described in Listing 3. Surface extraction is then detached from hash map access and can be optimized separately. As a lowhanging fruit, point cloud extraction is implemented with the same routine by ignoring the triangle generation step. In fact, surface extraction of a median-scale scene shown in Fig. 13 takes less than 100ms, making interactive surface updates possible in a real-time reconstruction system.
现在我们有了统一的哈希图接口,我们简化了体素块邻居搜索例程 [12] 并提前设置了一个 1半径的邻居查找表,如清单 3 中所述。然后,表面提取与哈希图访问分离,并且可以单独优化。作为一个容易实现的目标,点云提取是通过忽略三角形生成步骤使用相同的例程实现的。事实上,图 13 所示的中尺度场景的表面提取需要不到 100 毫秒,使得实时重建系统中的交互式表面更新成为可能。
Fig. 13. Visualization of triangle mesh extracted from the real-time dense SLAM system on scene lounge and copyroom in the fast mode. Rendered with Mitsuba 2 [42].
图 13. 在快速模式下从场景休息室和复印室的实时密集 SLAM 系统中提取的三角形网格的可视化。使用 Mitsuba 2 渲染[42]。
6.2.5 Ray Casting
Another way to interpret a volume is through ray casting or ray marching. Given camera intrinsics and extrinsics, ray casting renders depth and color images by marching rays in the spatially hashed volumes, querying color and TSDF values, and finding zero-crossing interfaces. It allows rendering at known viewpoints, synthesizing novel views, and estimating camera poses.
解释体积的另一种方法是通过光线投射或光线行进。给定相机内外参,光线投射通过在空间散列体积中行进光线、查询颜色和 TSDF 值以及查找过零界面来渲染深度和彩色图像。它允许在已知视点进行渲染、合成新颖的视图以及估计相机姿势。
Various accelerations can speed up ray casting. Adaptive spherical ray casting and a precomputed min-max range estimate [47] will constrain the search range and boost performance. The latter can be conducted by simply projecting the active keys collected in Listing 2 without the involvement of hash maps. In addition, we can squeeze more from our double-hash-map architecture. Conventional ray marching applies query in the global hash map [12], [41], [47]. Since the local hash map is directly associated with the global hash map with shared active indices, we can replace the global hash map with the local one accompanied with the v B v^B vB buffer in the global hash map. With such a simple change, we can now query the more compact local hash map, and access the global hash map in-place without touching the geometric computations. Since out-of-frustum voxel blocks are ignored, this operation slightly sacrifices rendering completeness at image boundaries, as shown in Fig. 12. However, it is able to boost speed by a factor of 5, and is useful for real-time systems such as dense SLAM.
各种加速方法都能加快射线投射速度。自适应球形光线投射和预先计算的最小-最大范围估计 [47] 将限制搜索范围并提高性能。后者可以通过简单地投射清单 2 中收集的活动键来实现,而无需哈希映射。此外,我们还能从双散列图架构中榨取更多。传统的光线行进法在全局哈希映射中进行查询 [12], [41], [47]。由于本地哈希图与全局哈希图直接关联,共享活动索引,因此我们可以用本地哈希图替换全局哈希图,并在全局哈希图中加入 v B v^B vB 缓冲区。通过这样一个简单的改动,我们现在就可以查询更紧凑的本地散列映射,并就地访问全局散列映射,而无需触及几何计算。如图 12 所示,由于会忽略失真体素块,这一操作会稍微牺牲图像边界的渲染完整性。不过,它能将速度提高 5 倍,对密集 SLAM 等实时系统非常有用。
Fig. 12. Visualization of volumetric ray casting. From left to right: rendered depth from local, global hash maps, and input ground truth depth. Note the difference at boundaries.
图 12. 体积光线投射的可视化。从左到右依次为:局部、全局哈希图的渲染深度,以及输入的地面真实深度。注意边界处的差异。
6.3 Non-Rigid Volumetric Deformation
6.4 Joint Geometry and Appearance Refinement
7 CONCLUSION AND FUTURE WORK
We presented ASH, a performant and easy-to-use framework for spatial hashing. Both synthetic and real-world experiments demonstrate the power of the framework. With ASH, users can achieve the same or better performance in 3D perception tasks while writing less code.
There are various avenues for future work. At the architecture level, we seek to introduce the open address variation [2], [27] of parallel hash maps for flexibility and potential high performance static hash maps. At the low level, we plan to further optimize the GPU backend, and accelerate the CPU counterpart, potentially with cache level optimization and code generation [24], [25]. We also plan to apply ASH to sparse convolution [10], [54] and neural rendering [16], [48], where spatially varying parameterizations are exploited.
ASH accelerates a variety of 3D perception workloads. We hope that the presented framework will serve both research and production applications.
我们提出了 ASH,一种高性能且易于使用的空间哈希框架。合成实验和现实实验都证明了该框架的强大功能。借助 ASH,用户可以在 3D 感知任务中实现相同或更好的性能,同时编写更少的代码。
未来的工作有多种途径。在架构层面,我们寻求引入并行哈希映射的开放地址变体[2]、[27],以实现灵活性和潜在的高性能静态哈希映射。在底层,我们计划进一步优化 GPU 后端,并加速 CPU 对应部分,可能通过缓存级别优化和代码生成 [24]、[25]。我们还计划将 ASH 应用到稀疏卷积 [10]、[54] 和神经渲染 [16]、[48],其中利用空间变化的参数化。
ASH 可加速各种 3D 感知工作负载。我们希望所提出的框架能够服务于研究和生产应用。