学技术学英语: 数据结构 - Elasticsearch BKD tree

I had worked on Elasticsearch back in 2015, when it was more known for its text searching capabilities using inverted indexes. As I looked to pick it up again last year for another project, I saw that Elasticsearch had added core support for other data types from text like numbers, IP addresses, geospatial data types, etc.

As I looked to understand the main differences which could allow optimized search over such data types, I stumbled upon BKD trees . Surprisingly, there is not much written about BKD trees apart from a white paper and some blogs. The blog post will look to cover elements leading up to the development of BKD trees and its advantages starting from KD trees.

We will start with the BST (Binary Search Tree) which will be the base for our post. A BST is a binary tree which has lesser elements to its left and greater elements to its right for all nodes. The article will not contain more information regarding the insertion, deletion and searching of elements further since there are numerous sources out there.

BST or other similar implementations of BST like AVL trees, leverage the capability of dividing the search space by 2 at each node during the traversal, thus resulting in a O(logN) search in the best case scenario. It is possible to balance BSTs by rotating the tree with the pivot.

A major flaw or lack of ability with the BST is the ability to deal with multiple dimensions or spaces. For example, if we have a store of latitudes and longitudes, and we are asked to search for a specific set of latitude and longitude. It is easy for us to use BST to search for either the latitude or longitude but not both the elements together since BST is capable of handling only one dimension in its store.

What do we do if we have multiple dimensions or multiple metrics across which we need to run our search queries?

KD or K-Dimensional trees

This is where KD or K-Dimensional trees come into the picture. KD trees are similar to BST in the terms that it allows segregation of data to the left or right depending on the value. The main difference is the consideration of multiple planes or dimensions or spaces while constructing it. In a KD tree or K-NNproblems, each traversal is able to divide a particular plane into 2 equal sub planes. As the traversals go deeper, the combination of division of planes is used to reach the point in space that was being searched for.

A very good example of splitting of planes in a 1 dimensional structure is the binary search method on an array. For every jump, half of the array is taken out of consideration.

For a 2D structure, the structure can be split in the following way:

Fig: A 2D structure

For the point A, the X axis is split into 2. For the point B and C, the Y axis is split into 2 and so on.

How do we represent the 2D split in a tree?

In order to do this, we define something called as the discriminator. The discriminator is mainly used to figure out which plane is to be considered for the split during the jump. The formula to figure out the discriminator is

复制代码
discriminator = level % N,where level is the level of the tree and N is the number of dimensions

Each dimension is assigned a key. The key is used against the discriminator value of each node.

Fig: KD tree representation of a 2D structure

We will try to use the above knowledge to try to explain the tree.

At node A, discriminator = 0 % 2 = 0

At node B, discriminator = 1% 2 = 1

At node C, discriminator = 1% 2 = 1

At node D, discriminator = 2 % 2 = 0

At node E, discriminator = 3 % 2 = 1

At node F, discriminator = 3 % 2 = 1

Since there are only 2 dimensions here, the discriminator value can be only 0 or 1 since the level starts from 0. We will call the first dimension as X and the second dimension as Y. The X can very well be the latitude and Y the longitude.

Applying BST strategy here, we select the next node based on the comparison of values of the dimension.

Let us try to search the element with value S(66, 85)

At point A , discriminator is 0, hence we compare the X dimension of S and A. S(X) is 66 and A(X) is 40. Since 66 > 40, we navigate to the right which is the C node.

At point C , discriminator is 1, hence we compare the Y dimension of S and A. S(Y) is 85 and C(Y) is 10. Since 85 > 10, we navigate to the right which is the D node.

Atpoint D , discriminator is 0, hence we compare the X dimension of S and A. S(X) is 66 and D(X) is 69. Since 66 < 69, we navigate to the left which is the E node.

We finally reach the node that we were searching for.

The same algorithm can be used to a 3D or a 4D structure as well. The only change is the calculation of the discriminator and the number of discriminator values will be 3 or 4 respectively depending on the dimensions. If it's a 3D structure, the space will repeat itself every 3 jumps.

As we can see above, the read operations tend to be optimized well keeping N-dimensions in mind.

Any insertion or deletion on the tree can be a little more tricky than in a BST. In a BST, the rotation works since we are rotating along only one dimension. Rotation in a KD tree will not work easily since rotation along only one dimension will disrupt the other dimensions as well. Hence, write operations on a KD tree can become expensive.

In the above example, the space division selection is done in a round robin fashion. However, that is the simplest approach that can be taken to ensure proper cutting of all spaces. If we want to give more priority to specific dimensions, it can very easily be controlled by modifying the discriminator formula to reflect the same.

The BKD tree is a collection of multiple KD trees as children. This makes sense as the write operation propagations can be controlled to a single KD tree as the data increases. Since BKD trees were mainly built for disk operations, it borrows a leaf (pun intended) from B+ trees and stores the actual points only at the leaf nodes. The internal nodes are mainly used as pointers to reach the appropriate blocks. The tree is a combination of complete binary trees and B+ trees. Since it is a complete binary tree, array techniques using formulas like 2n+1 and2n+2 can be used to fetch the appropriate child nodes which will be further optimized for disk IO. I have a hunch that the number of KD trees and the nodes in them can be directly correlated with the shard sizes used in Elasticsearch.

Searching on a N-dimensional space

In this section, we try to use the above learnt knowledge of developing the required atomic query types using which other complex queries can be built.

  • Filter based on X dimensions where 1 <= X <= N
  • Top hits based on X dimensions as the unique scope on a function of Y dimensions where 1 <= X <= N and and 1 <= Y <= N
  • Range distribution or buckets formation with X dimensions as the unique scope on a function of Y dimensions falling within defined ranges where 1 <= X <= N and and 1 <= Y <= N
  • Temporal listing of X dimensions with sampling functions performed on Y dimensions where 1 <= X <= N and and 1 <= Y <= N
  • Datatable listing of X dimensions where 1 <= X <= N

I think being able to restrict the queries to be a collection of the above query types should be enough to answer further complex queries.

中文总结:

  1. 看懂文中的两张图,就弄懂了 BKD tree .

  2. 为了方便理解,本文只采用了 2维 (x,y )

  3. 这幅图是几何解释,

二维可以在一个平面上展示出来, 水平方向x轴,垂直方向y轴, 那么比较两个点,左小右大,下小上大,为了方便理解,我们可以随便假设一下图中点的值。

A (2,6)

B(1,5)

C (6,9)

D (6,5.8)

  1. 这幅图是计算机数据结构的实现:

第一层比较X(第一维), 第二层比较Y(第二维),依次类推

每一层比较哪一维,是公式计算出来的

discriminator (需要比较的维度)= level(树的层级,从0开始) % N(维度,本例中是2),

相关推荐
长路 ㅤ   7 小时前
ES索引切分方案4:索引+别名 应用层自己维护:时间序列索引
elasticsearch·索引模板·时间序列索引·别名系统·大数据优化
小龙9 小时前
【Git 报错解决】本地分支与远程分支名称/提交历史不匹配
大数据·git·elasticsearch·github
kaikaile199510 小时前
基于拥挤距离的多目标粒子群优化算法(MO-PSO-CD)详解
数据结构·算法
不忘不弃11 小时前
求两组数的平均值
数据结构·算法
leaves falling11 小时前
迭代实现 斐波那契数列
数据结构·算法
DonnyCoy11 小时前
Android性能之数据结构
数据结构
天赐学c语言11 小时前
1.7 - 删除排序链表中的重要元素II && 哈希冲突常用解决冲突方法
数据结构·c++·链表·哈希算法·leecode
菜鸟233号12 小时前
力扣96 不同的二叉搜索树 java实现
java·数据结构·算法·leetcode
李@十一₂⁰12 小时前
git多分支管理
大数据·git·elasticsearch
空空潍13 小时前
hot100-最小覆盖字串(day12)
数据结构·算法·leetcode