从头开始构建数据库:04.B-Tree: The Practice (Part I)

04. B-Tree: The Practice (Part I)

04.B树:实践(第一部分)

This chapter implements an immutable B+ tree in Golang. The implementation is minimal and thus is easy to follow.

本章在 Golang 中实现了一个不可变的 B+ 树。实现是最小的,因此很容易跟着完成。

4.1 The Node Format

4.1 节点格式

Our B-tree will be persisted to the disk eventually, so we need to design the wire format for the B-tree nodes first. Without the format, we won't know the size of a node and when to split a node.

我们的B树最终会被持久化到磁盘上,所以我们需要首先设计B树节点的传输格式。如果没有格式,我们将不知道节点的大小以及何时拆分节点。

A node consists of:

一个节点由以下部分组成:

  1. A fixed-sized header containing the type of the node (leaf node or internal node) and the number of keys.

    固定大小的头部,包含节点的类型(叶节点或内部节点)和 键的数量。

  2. A list of pointers to the child nodes. (Used by internal nodes).

    指向子节点的指针列表。 (由内部节点使用)。

  3. A list of offsets pointing to each key-value pair.

    指向每个 键值对 的偏移量列表。

  4. Packed KV pairs.

    打包的KV 对。

bash 复制代码
| type | nkeys |  pointers  |   offsets  | key-values
|  2B  |   2B  | nkeys * 8B | nkeys * 2B | ...

This is the format of the KV pair. Lengths followed by data.

这就是KV对的格式。长度后面是数据。

kotlin 复制代码
| klen | vlen | key | val |
|  2B  |  2B  | ... | ... |

To keep things simple, both leaf nodes and internal nodes use the same format.

为了简单起见,叶节点和内部节点都使用相同的格式。

4.2 Data Types 4.2 数据类型

Since we're going to dump our B-tree to the disk eventually, why not use an array of bytes as our in-memory data structure as well?

既然我们最终要将 B 树转储到磁盘,为什么不使用字节数组作为内存数据结构呢?

go 复制代码
type BNode struct {
    data []byte // can be dumped to the disk
}

const (
    BNODE_NODE = 1 // internal nodes without values
    BNODE_LEAF = 2 // leaf nodes with values
)

And we can't use the in-memory pointers, the pointers are 64-bit integers referencing disk pages instead of in-memory nodes. We'll add some callbacks to abstract away this aspect so that our data structure code remains pure data structure code.

而且我们不能使用内存中的指针,这些指针是引用磁盘页面而不是内存中节点的 64 位整数。我们将添加一些回调来抽象这方面,以便我们的数据结构代码仍然是纯粹的数据结构代码。

go 复制代码
type BTree struct {
    // pointer (a nonzero page number)
    root uint64
    // callbacks for managing on-disk pages
    get func(uint64) BNode // dereference a pointer
    new func(BNode) uint64 // allocate a new page
    del func(uint64)       // deallocate a page
}

The page size is defined to be 4K bytes. A larger page size such as 8K or 16K also works.

页大小定义为 4K 字节。更大的页面大小(例如 8K 或 16K)也适用。

We also add some constraints on the size of the keys and values. So that a node with a single KV pair always fits on a single page. If you need to support bigger keys or bigger values, you have to allocate extra pages for them and that adds complexity.

我们还对键和值的大小添加了一些限制。这样具有单个 KV 对的节点始终适合单个页面。如果您需要支持更大的键或更大的值,则必须为它们分配额外的页面,这会增加复杂性。

go 复制代码
const HEADER = 4

const BTREE_PAGE_SIZE = 4096
const BTREE_MAX_KEY_SIZE = 1000
const BTREE_MAX_VAL_SIZE = 3000

func init() {
    node1max := HEADER + 8 + 2 + 4 + BTREE_MAX_KEY_SIZE + BTREE_MAX_VAL_SIZE
    assert(node1max <= BTREE_PAGE_SIZE)
}

4.3 Decoding the B-tree Node

4.3 解码B树节点

Since a node is just an array of bytes, we'll add some helper functions to access its content.

由于节点只是一个字节数组,因此我们将添加一些辅助函数来访问其内容。

go 复制代码
// header
func (node BNode) btype() uint16 {
    return binary.LittleEndian.Uint16(node.data)
}
func (node BNode) nkeys() uint16 {
    return binary.LittleEndian.Uint16(node.data[2:4])
}
func (node BNode) setHeader(btype uint16, nkeys uint16) {
    binary.LittleEndian.PutUint16(node.data[0:2], btype)
    binary.LittleEndian.PutUint16(node.data[2:4], nkeys)
}
go 复制代码
// pointers
func (node BNode) getPtr(idx uint16) uint64 {
    assert(idx < node.nkeys())
    pos := HEADER + 8*idx
    return binary.LittleEndian.Uint64(node.data[pos:])
}
func (node BNode) setPtr(idx uint16, val uint64) {
    assert(idx < node.nkeys())
    pos := HEADER + 8*idx
    binary.LittleEndian.PutUint64(node.data[pos:], val)
}

Some details about the offset list:

有关偏移列表的一些详细信息:

  • The offset is relative to the position of the first KV pair.
    该偏移量是相对于第一个 KV 对的位置而言的。
  • The offset of the first KV pair is always zero, so it is not stored in the list.
    第一个 KV 对的偏移量始终为零,因此不存储在列表中。
  • We store the offset to the end of the last KV pair in the offset list, which is used to determine the size of the node.
    我们将到最后一个 KV 对末尾的偏移量存储在偏移列表中,该偏移量用于确定节点的大小。
go 复制代码
// offset list
func offsetPos(node BNode, idx uint16) uint16 {
    assert(1 <= idx && idx <= node.nkeys())
    return HEADER + 8*node.nkeys() + 2*(idx-1)
}
func (node BNode) getOffset(idx uint16) uint16 {
    if idx == 0 {
        return 0
    }
    return binary.LittleEndian.Uint16(node.data[offsetPos(node, idx):])
}
func (node BNode) setOffset(idx uint16, offset uint16) {
    binary.LittleEndian.PutUint16(node.data[offsetPos(node, idx):], offset)
}

The offset list is used to locate the nth KV pair quickly.

偏移列表用于快速定位第n个KV对。

go 复制代码
// key-values
func (node BNode) kvPos(idx uint16) uint16 {
    assert(idx <= node.nkeys())
    return HEADER + 8*node.nkeys() + 2*node.nkeys() + node.getOffset(idx)
}
func (node BNode) getKey(idx uint16) []byte {
    assert(idx < node.nkeys())
    pos := node.kvPos(idx)
    klen := binary.LittleEndian.Uint16(node.data[pos:])
    return node.data[pos+4:][:klen]
}
func (node BNode) getVal(idx uint16) []byte {
    assert(idx < node.nkeys())
    pos := node.kvPos(idx)
    klen := binary.LittleEndian.Uint16(node.data[pos+0:])
    vlen := binary.LittleEndian.Uint16(node.data[pos+2:])
    return node.data[pos+4+klen:][:vlen]
}

And to determine the size of the node.

并确定节点的大小。

go 复制代码
// node size in bytes
func (node BNode) nbytes() uint16 {
    return node.kvPos(node.nkeys())
}

4.4 The B-Tree Insertion

4.4 B树插入

The code is broken down into small steps.

该代码被分解为小步骤。

Step 1: Look Up the Key

第 1 步:查找key

To insert a key into a leaf node, we need to look up its position in the sorted KV list.

要将键插入叶节点,我们需要在排序的 KV 列表中查找它的位置。

go 复制代码
// returns the first kid node whose range intersects the key. (kid[i] <= key)
// TODO: bisect
func nodeLookupLE(node BNode, key []byte) uint16 {
    nkeys := node.nkeys()
    found := uint16(0)
    // the first key is a copy from the parent node,
    // thus it's always less than or equal to the key.
    for i := uint16(1); i < nkeys; i++ {
        cmp := bytes.Compare(node.getKey(i), key)
        if cmp <= 0 {
            found = i
        }
        if cmp >= 0 {
            break
        }
    }
    return found
}

The lookup works for both leaf nodes and internal nodes. Note that the first key is skipped for comparison, since it has already been compared from the parent node.

该查找适用于叶节点和内部节点。请注意,第一个键将被跳过进行比较,因为它已经与父节点进行了比较。

Step 2: Update Leaf Nodes

第2步:更新叶节点

After looking up the position to insert, we need to create a copy of the node with the new key in it.

查找到要插入的位置后,我们需要创建一个包含新键的节点副本。

go 复制代码
// add a new key to a leaf node
func leafInsert(
    new BNode, old BNode, idx uint16,
    key []byte, val []byte,
) {
    new.setHeader(BNODE_LEAF, old.nkeys()+1)
    nodeAppendRange(new, old, 0, 0, idx)
    nodeAppendKV(new, idx, 0, key, val)
    nodeAppendRange(new, old, idx+1, idx, old.nkeys()-idx)
}

The nodeAppendRange function copies keys from an old node to a new node.
nodeAppendRange 函数将key从旧节点复制到新节点。

go 复制代码
// copy multiple KVs into the position
func nodeAppendRange(
    new BNode, old BNode,
    dstNew uint16, srcOld uint16, n uint16,
) {
    assert(srcOld+n <= old.nkeys())
    assert(dstNew+n <= new.nkeys())
    if n == 0 {
        return
    }

    // pointers
    for i := uint16(0); i < n; i++ {
        new.setPtr(dstNew+i, old.getPtr(srcOld+i))
    }
    // offsets
    dstBegin := new.getOffset(dstNew)
    srcBegin := old.getOffset(srcOld)
    for i := uint16(1); i <= n; i++ { // NOTE: the range is [1, n]
        offset := dstBegin + old.getOffset(srcOld+i) - srcBegin
        new.setOffset(dstNew+i, offset)
    }
    // KVs
    begin := old.kvPos(srcOld)
    end := old.kvPos(srcOld + n)
    copy(new.data[new.kvPos(dstNew):], old.data[begin:end])
}

The nodeAppendKV function copies a KV pair to the new node.
nodeAppendKV 函数将 KV对 复制到新节点。

go 复制代码
// copy a KV into the position
func nodeAppendKV(new BNode, idx uint16, ptr uint64, key []byte, val []byte) {
    // ptrs
    new.setPtr(idx, ptr)
    // KVs
    pos := new.kvPos(idx)
    binary.LittleEndian.PutUint16(new.data[pos+0:], uint16(len(key)))
    binary.LittleEndian.PutUint16(new.data[pos+2:], uint16(len(val)))
    copy(new.data[pos+4:], key)
    copy(new.data[pos+4+uint16(len(key)):], val)
    // the offset of the next key
    new.setOffset(idx+1, new.getOffset(idx)+4+uint16((len(key)+len(val))))
}

Step 3: Recursive Insertion

第三步:递归插入

The main function for inserting a key.

主要功能是插入key。

go 复制代码
// insert a KV into a node, the result might be split into 2 nodes.
// the caller is responsible for deallocating the input node
// and splitting and allocating result nodes.
func treeInsert(tree *BTree, node BNode, key []byte, val []byte) BNode {
    // the result node.
    // it's allowed to be bigger than 1 page and will be split if so
    new := BNode{data: make([]byte, 2*BTREE_PAGE_SIZE)}

    // where to insert the key?
    idx := nodeLookupLE(node, key)
    // act depending on the node type
    switch node.btype() {
    case BNODE_LEAF:
        // leaf, node.getKey(idx) <= key
        if bytes.Equal(key, node.getKey(idx)) {
            // found the key, update it.
            leafUpdate(new, node, idx, key, val)
        } else {
            // insert it after the position.
            leafInsert(new, node, idx+1, key, val)
        }
    case BNODE_NODE:
        // internal node, insert it to a kid node.
        nodeInsert(tree, new, node, idx, key, val)
    default:
        panic("bad node!")
    }
    return new
}

The leafUpdate function is similar to the leafInsert function.
leafUpdate 函数与 leafInsert 函数类似。

Step 4: Handle Internal Nodes

步骤 4:处理内部节点

Now comes the code for handling internal nodes.

现在是处理内部节点的代码。

go 复制代码
// part of the treeInsert(): KV insertion to an internal node
func nodeInsert(
    tree *BTree, new BNode, node BNode, idx uint16,
    key []byte, val []byte,
) {
    // get and deallocate the kid node
    kptr := node.getPtr(idx)
    knode := tree.get(kptr)
    tree.del(kptr)
    // recursive insertion to the kid node
    knode = treeInsert(tree, knode, key, val)
    // split the result
    nsplit, splited := nodeSplit3(knode)
    // update the kid links
    nodeReplaceKidN(tree, new, node, idx, splited[:nsplit]...)
}

Step 5: Split Big Nodes

第五步:拆分大节点

Inserting keys into a node increases its size, causing it to exceed the page size. In this case, the node is split into multiple smaller nodes.

将键插入节点会增加其大小,导致其超出页面大小。在这种情况下,节点被分割成多个更小的节点。

The maximum allowed key size and value size only guarantee that a single KV pair always fits on one page. In the worst case, the fat node is split into 3 nodes (one large KV pair in the middle).

允许的最大键大小和值大小仅保证单个 KV 对始终适合一页。在最坏的情况下,胖节点被分成3个节点(中间有一个大的KV对)。

go 复制代码
// split a bigger-than-allowed node into two.
// the second node always fits on a page.
func nodeSplit2(left BNode, right BNode, old BNode) {
    // code omitted...
}

// split a node if it's too big. the results are 1~3 nodes.
func nodeSplit3(old BNode) (uint16, [3]BNode) {
    if old.nbytes() <= BTREE_PAGE_SIZE {
        old.data = old.data[:BTREE_PAGE_SIZE]
        return 1, [3]BNode{old}
    }
    left := BNode{make([]byte, 2*BTREE_PAGE_SIZE)} // might be split later
    right := BNode{make([]byte, BTREE_PAGE_SIZE)}
    nodeSplit2(left, right, old)
    if left.nbytes() <= BTREE_PAGE_SIZE {
        left.data = left.data[:BTREE_PAGE_SIZE]
        return 2, [3]BNode{left, right}
    }
    // the left node is still too large
    leftleft := BNode{make([]byte, BTREE_PAGE_SIZE)}
    middle := BNode{make([]byte, BTREE_PAGE_SIZE)}
    nodeSplit2(leftleft, middle, left)
    assert(leftleft.nbytes() <= BTREE_PAGE_SIZE)
    return 3, [3]BNode{leftleft, middle, right}
}

Step 6: Update Internal Nodes

第6步:更新内部节点

Inserting a key into a node can result in either 1, 2 or 3 nodes. The parent node must update itself accordingly. The code for updating an internal node is similar to that for updating a leaf node.

将键插入节点可能会产生 1、2 或 3 个节点。父节点必须相应地更新自身。更新内部节点的代码与更新叶节点的代码类似。

go 复制代码
// replace a link with multiple links
func nodeReplaceKidN(
    tree *BTree, new BNode, old BNode, idx uint16,
    kids ...BNode,
) {
    inc := uint16(len(kids))
    new.setHeader(BNODE_NODE, old.nkeys()+inc-1)
    nodeAppendRange(new, old, 0, 0, idx)
    for i, node := range kids {
        nodeAppendKV(new, idx+uint16(i), tree.new(node), node.getKey(0), nil)
    }
    nodeAppendRange(new, old, idx+inc, idx+1, old.nkeys()-(idx+1))
}

We have finished the B-tree insertion. Deletion and the rest of the code will be introduced in the next chapter.

我们已经完成了B树的插入。删除以及其余代码将在下一章介绍。

返回目录 下一章 上一章

相关推荐
qq_3216653325 分钟前
mysql 数据库迁移到达梦数据库
数据库·mysql
Hello.Reader1 小时前
Redis大Key问题全解析
数据库·redis·bootstrap
靖顺3 小时前
【OceanBase 诊断调优】—— packet fly cost too much time 的根因分析
数据库·oceanbase
liuxin334455663 小时前
学籍管理系统:实现教育管理现代化
java·开发语言·前端·数据库·安全
海绵波波1073 小时前
flask后端开发(10):问答平台项目结构搭建
后端·python·flask
网络风云5 小时前
【魅力golang】之-反射
开发语言·后端·golang
Q_19284999065 小时前
基于Spring Boot的电影售票系统
java·spring boot·后端
运维&陈同学6 小时前
【Kibana01】企业级日志分析系统ELK之Kibana的安装与介绍
运维·后端·elk·elasticsearch·云原生·自动化·kibana·日志收集
yuanbenshidiaos6 小时前
C++--------------树
java·数据库·c++
dengjiayue7 小时前
MySQL 查询大偏移量(LIMIT)问题分析
数据库·mysql