从头开始构建数据库:03. B-Tree: The Ideas

3.1 The Intuitions of the B-Tree and BST

3.1 B树和BST树的直觉

Our first intuition comes from balanced binary trees (BST). Binary trees are popular data structures for sorted data. Keeping a tree in good shape after inserting or removing keys is what "balancing" means. As stated in a previous chapter, n-ary trees should be used instead of binary trees to make use of the "page" (minimum unit of IO).

我们的第一直觉来自平衡二叉树（BST）。二叉树是存储排序数据的流行数据结构。在插入或移除key后保持树的良好状态就是"平衡"的含义。正如上一章所述，应该使用n叉树而不是二叉树来充分利用 "页"（IO的最小单位）。

B-trees can be generalized from BSTs. Each node of a B-tree contains multiple keys and multiple links to its children. When looking up a key in a node, all keys are used to decide the next child node.

B 树可以从 BST树推广而来。 B 树的每个节点都包含多个键和指向其子节点的多个指针。当在节点中查找键时，所有键都用于决定下一个子节点。

css 复制代码

     [1,   4,   9]
     /     |     \
    v      v      v
[1, 2, 3] [4, 6] [9, 11, 12]

The balancing of a B-tree is different from a BST, popular BSTs like RB trees or AVL trees are balanced on the height of sub-trees (by rotation). While the height of all B-tree leaf nodes is the same, a B-tree is balanced by the size of the nodes:

B 树的平衡与 BST 不同，流行的 BST（如 RB 树或 AVL 树）在子树的高度上进行平衡（通过旋转）。虽然所有 B 树叶节点的高度相同，但 B 树通过节点的大小进行平衡：

If a node is too large to fit on one page, it is split into two nodes. This will increase the size of the parent node and possibly increase the height of the tree if the root node was split.
如果节点太大而无法容纳在一页上，则会将其拆分为两个节点。这将增加父节点的大小，如果根节点被分割，则可能会增加树的高度。
If a node is too small, try merging it with a sibling.
如果节点太小，请尝试将其与兄弟节点合并。

If you are familiar with RB trees, you may also be aware of 2-3 trees that can be easily generalized as B-trees.

如果您熟悉 RB 树，您可能还知道可以轻松概括为 B 树的 2-3 棵树。

3.2 B-tree and Nested Arrays

3.2 B树和嵌套数组

Even if you are not familiar with the 2-3 tree, you can still gain some intuition using nested arrays.

即使您不熟悉 2-3 树，您仍然可以使用嵌套数组获得一些直觉。

Let's start with a sorted array. Queries can be done by bisection. But, updating the array is O(n) which we need to tackle. Updating a big array is bad so we split it into smaller arrays. Let's say we split the array into sqrt(n) parts, and each part contains sqrt(n) keys on average.

让我们从一个排序数组开始。可以通过二分来进行查询。但是，更新数组是我们需要解决的 O(n) 问题。更新一个大数组是不好的，所以我们把它分成更小的数组。假设我们将数组分成 sqrt(n) 部分，每个部分平均包含 sqrt(n) 个键。

lua 复制代码

[[1,2,3], [4,6], [9,11,12]]

To query a key, we must first determine which part contains the key, bisecting on the sqrt(n) parts is O(log(n)). After that, bisecting the key on the part is again O(log(n)) --- it's no worse than before. And updating is improved to O(sqrt(n)).

要查询某个键，我们必须首先确定哪个部分包含该键，将 sqrt(n) 部分平分就是 O(log(n)) 。之后，将子部分的键平分又是 O(log(n)) - 并不比以前差。并且更新改进为 O(sqrt(n)) 。

This is a 2-level sorted nested array, what if we add more levels? This is another intuition of the B-tree.

这是一个 2 层排序的嵌套数组，如果我们添加更多层会怎样？这是B树的另一种直觉。

3.3 B-Tree Operations 3.3 B树操作

Querying a B-tree is the same as querying a BST.

查询 B 树与查询 BST 相同。

Updating a B-tree is more complicated. From now on we'll use a variant of B-tree called "B+ tree", the B+ tree stores values only in leaf nodes, and internal nodes contain only keys.

更新 B 树更为复杂。从现在开始，我们将使用 B 树的一种变体，称为"B+ 树"，B+ 树仅在叶节点中存储值，内部节点仅包含键。

Key insertion starts at a leaf. A leaf is just a sorted list of keys. Inserting the key into the leaf is trivial. But, the insertion may cause the node size to exceed the page size. In this case, we need to split the leaf node into 2 nodes, each containing half of the keys, so that both leaf nodes fit into one page.

插入key从叶子开始。叶子只是一个排序的key列表。将key插入叶子中是微不足道的。但是，插入可能会导致节点大小超过页面大小。在这种情况下，我们需要将叶节点拆分为 2 个节点，每个节点包含一半的key，以便两个叶节点都适合一页。

An internal node consists of:

内部节点包括：

A list of pointers to its children.
指向其子项的指针列表。
A list of keys paired with the pointer list. Each of the keys is the first key of the corresponding child.
与指针列表配对的键列表。每个键都是相应子键的第一个键。

After splitting a leaf node into 2 nodes. The parent node replaces the old pointer and key with the new pointers and keys. And the size of the node increases, which may trigger further splitting.

将一个叶节点拆分为 2 个节点后。父节点用新的指针和键替换旧的指针和键。并且节点的规模增大，可能会触发进一步的分裂。

markdown 复制代码

    parent              parent
   /  |  \     =>      /  | |  \
L1   L2   L6         L1  L3 L4  L6

After the root node is split, a new root node is added. This is how a B-tree grows.

根节点分裂后，会添加新的根节点。这就是 B 树的生长方式。

markdown 复制代码

                        new_root
                          / \
    root                 N1 N2
   /  |  \     =>      /  | |  \
L1   L2   L6         L1  L3 L4  L6

Key deletion is the opposite of insertion. A node is never empty because a small node will be merged into either its left sibling or its right sibling.

键删除与插入相反。节点永远不会为空，因为小节点将被合并到其左兄弟或右兄弟中。

And when a non-leaf root is reduced to a single key, the root can be replaced by its sole child. This is how a B-tree shrinks.

当非叶根减少为单个键时，根可以由其唯一的子项替换。这就是 B 树收缩的方式。

3.4 Immutable Data Structures

3.4 不可变数据结构

Immutable means never updating data in place. Some similar jargons are "append-only", "copy-on-write", and "persistent data structures" (the word "persistent" has nothing to do with the "persistence" we talked about ealier ).

不可变意味着永远不会就地更新数据。一些类似的术语是"append-only"、"copy-on-write"和"persistent data structures"（"persistent"这个词与我们之前谈到的"persistence"毫无关系）。

For example, when inserting a key into a leaf node, do not modify the node in place, instead, create a new node with all the keys from the to-be-updated node and the new key. Now the parent node must also be updated to point to the new node.

例如，当向叶节点插入键时，不要原地修改该节点，而是使用要更新的节点中的所有键和新键创建一个新节点。现在还必须更新父节点以指向新节点。

Likewise, the parent node is duplicated with the new pointer. Until we reach the root node, the entire path has been duplicated. This effectively creates a new version of the tree that coexists with the old version. The LSM-tree we mentioned before is also considered immutable.

同样，父节点也被复制为新指针。直到我们到达根节点，整个路径都被复制了。这有效地创建了与旧版本共存的树的新版本。我们之前提到的LSM树也被认为是不可变的。

There are several advantages of immutable data structures:

不可变数据结构有几个优点：

Avoid data corruption. Immutable data structures do not modify the existing data, they merely add new data, so the old version of data remains intact even if the update is interrupted.
避免数据损坏。不可变数据结构不会修改现有数据，它们只是添加新数据，因此即使更新中断，旧版本的数据也保持不变。
Easy concurrency. Readers can operate concurrently with writers since readers can work on older versions unaffected.
轻松并发。读者可以与作者同时操作，因为读者可以在不受影响的旧版本上工作。

Persistence and concurrency are covered in later chapters. For now, we'll code an immutable B+ tree first.

持久性和并发性将在后面的章节中介绍。现在，我们将首先编写一个不可变的 B+ 树。