Since our B-tree is immutable, every update to the KV store creates new nodes in the path instead of updating current nodes, leaving some nodes unreachable from the latest version. We need to reuse these unreachable nodes from old versions, otherwise, the database file will grow indefinitely.
由于我们的 B 树是不可变的,因此对 KV 存储的每次更新都会在路径中创建新节点,而不是更新当前节点,从而导致某些节点无法从最新版本访问。我们需要重用旧版本中这些不可达的节点,否则数据库文件将无限增长。
7.1 Design the Free List
7.1 设计空闲列表
To reuse these pages, we'll add a persistent free list to keep track of unused pages. Update operations reuse pages from the list before appending new pages, and unused pages from the current version are added to the list.
为了重用这些页面,我们将添加一个持久的空闲列表来跟踪未使用的页面。更新操作在附加新页面之前重用列表中的页面,并将当前版本中未使用的页面添加到列表中。
The list is used as a stack (first-in-last-out), each update operation can both remove from and add to the top of the list.
该列表用作堆栈(先进后出),每个更新操作都可以从列表顶部删除或添加到列表顶部。
go
// number of items in the list
func (fl *FreeList) Total() int
// get the nth pointer
func (fl *FreeList) Get(topn int) uint64
// remove `popn` pointers and add some new pointers
func (fl *FreeList) Update(popn int, freed []uint64)
The free list is also immutable like our B-tree. Each node contains:
空闲列表也像我们的 B 树一样是不可变的。每个节点包含:
- Multiple pointers to unused pages.
指向未使用页面的多个指针。 - The link to the next node.
到下一个节点的链接。 - The total number of items in the list. This only applies to the head node.
列表中的项目总数。这只适用于头节点。
perl
| node1 | | node2 | | node3 |
+-----------+ +-----------+ +-----------+
| total=xxx | | | | |
| next=yyy | ==> | next=qqq | ==> | next=eee | ==> ...
| size=zzz | | size=ppp | | size=rrr |
| pointers | | pointers | | pointers |
The node format: 节点格式:
matlab
| type | size | total | next | pointers |
| 2B | 2B | 8B | 8B | size * 8B |
go
const BNODE_FREE_LIST = 3
const FREE_LIST_HEADER = 4 + 8 + 8
const FREE_LIST_CAP = (BTREE_PAGE_SIZE - FREE_LIST_HEADER) / 8
Functions for accessing the list node:
访问链表节点的函数:
go
func flnSize(node BNode) int
func flnNext(node BNode) uint64
func flnPtr(node BNode, idx int)
func flnSetPtr(node BNode, idx int, ptr uint64)
func flnSetHeader(node BNode, size uint16, next uint64)
func flnSetTotal(node BNode, total uint64)
7.2 The Free List Datatype
7.2 空闲列表数据类型
The FreeList
type consists of the pointer to the head node and callbacks for managing disk pages.
FreeList
类型由指向头节点的指针和用于管理磁盘页面的回调组成。
go
type FreeList struct {
head uint64
// callbacks for managing on-disk pages
get func(uint64) BNode // dereference a pointer
new func(BNode) uint64 // append a new page
use func(uint64, BNode) // reuse a page
}
These callbacks are different from the B-tree because the pages used by the list are managed by the list itself .
这些回调与B树不同,因为列表使用的页面是由列表本身管理的。
- The
new
callback is only for appending new pages since the free list must reuse pages from itself.
new
回调仅用于附加新页面,因为空闲列表必须重用自身的页面。 - There is no
del
callback because the free list adds unused pages to itself.
没有del
回调,因为空闲列表将未使用的页面添加到自身。 - The
use
callback registers a pending update to a reused page.
use
回调将挂起的更新注册到重用页面。
go
type BTree struct {
// pointer (a nonzero page number)
root uint64
// callbacks for managing on-disk pages
get func(uint64) BNode // dereference a pointer
new func(BNode) uint64 // allocate a new page
del func(uint64) // deallocate a page
}
7.3 The Free List Implementation
7.3 空闲列表的实现
Getting the nth item from the list is just a simple list traversal.
从列表中获取第 n 项只是简单的列表遍历。
go
func (fl *FreeList) Get(topn int) uint64 {
assert(0 <= topn && topn < fl.Total())
node := fl.get(fl.head)
for flnSize(node) <= topn {
topn -= flnSize(node)
next := flnNext(node)
assert(next != 0)
node = fl.get(next)
}
return flnPtr(node, flnSize(node)-topn-1)
}
Updating the list is tricky. It first removes popn
items from the list, then adds the freed
to the list, which can be divided into 3 phases:
更新列表很棘手。它首先从列表中删除 popn
项,然后将 freed
添加到列表中,这可以分为3个阶段:
- If the head node is larger than
popn
, remove it. The node itself will be added to the list later. Repeat this step until it is not longer possible.
如果头节点大于popn
,则将其删除。该节点本身稍后将被添加到列表中。重复此步骤,直到不再可能为止。 - We may need to remove some items from the list and possibly add some new items to the list. Updating the list head requires new pages, and new pages should be reused from the items of the list itself. Pop some items from the list one by one until there are enough pages to reuse for the next phase.
我们可能需要从列表中删除一些项目,并且可能需要向列表中添加一些新项目。更新列表头需要新页面,并且应从列表本身的项目中重用新页面。从列表中逐个弹出一些项目,直到有足够的页面可供下一阶段重复使用。 - Modify the list by adding new nodes.
通过添加新节点来修改列表。
go
// remove `popn` pointers and add some new pointers
func (fl *FreeList) Update(popn int, freed []uint64) {
assert(popn <= fl.Total())
if popn == 0 && len(freed) == 0 {
return // nothing to do
}
// prepare to construct the new list
total := fl.Total()
reuse := []uint64{}
for fl.head != 0 && len(reuse)*FREE_LIST_CAP < len(freed) {
node := fl.get(fl.head)
freed = append(freed, fl.head) // recyle the node itself
if popn >= flnSize(node) {
// phase 1
// remove all pointers in this node
popn -= flnSize(node)
} else {
// phase 2:
// remove some pointers
remain := flnSize(node) - popn
popn = 0
// reuse pointers from the free list itself
for remain > 0 && len(reuse)*FREE_LIST_CAP < len(freed)+remain {
remain--
reuse = append(reuse, flnPtr(node, remain))
}
// move the node into the `freed` list
for i := 0; i < remain; i++ {
freed = append(freed, flnPtr(node, i))
}
}
// discard the node and move to the next node
total -= flnSize(node)
fl.head = flnNext(node)
}
assert(len(reuse)*FREE_LIST_CAP >= len(freed) || fl.head == 0)
// phase 3: prepend new nodes
flPush(fl, freed, reuse)
// done
flnSetTotal(fl.get(fl.head), uint64(total+len(freed)))
}
go
func flPush(fl *FreeList, freed []uint64, reuse []uint64) {
for len(freed) > 0 {
new := BNode{make([]byte, BTREE_PAGE_SIZE)}
// construct a new node
size := len(freed)
if size > FREE_LIST_CAP {
size = FREE_LIST_CAP
}
flnSetHeader(new, uint16(size), fl.head)
for i, ptr := range freed[:size] {
flnSetPtr(new, i, ptr)
}
freed = freed[size:]
if len(reuse) > 0 {
// reuse a pointer from the list
fl.head, reuse = reuse[0], reuse[1:]
fl.use(fl.head, new)
} else {
// or append a page to house the new node
fl.head = fl.new(new)
}
}
assert(len(reuse) == 0)
}
7.4 Manage Disk Pages
7.4 管理磁盘页面
Step 1: Modify the Data Structure
第1步:修改数据结构
The data structure is modified. Temporary pages are kept in a map keyed by their assigned page numbers. And removed page numbers are also there.
数据结构被修改。临时页面保存在由分配的页码键入的映射中。删除的页码也在那里。
go
type KV struct {
// omitted...
page struct {
flushed uint64 // database size in number of pages
nfree int // number of pages taken from the free list
nappend int // number of pages to be appended
// newly allocated or deallocated pages keyed by the pointer.
// nil value denotes a deallocated page.
updates map[uint64][]byte
}
}
Step 2: Page Management for B-Tree
第2步:B-Tree的页面管理
The pageGet
function is modified to also return temporary pages because the free list code depends on this behavior.
pageGet
函数被修改为也返回临时页面,因为空闲列表代码取决于此行为。
go
// callback for BTree & FreeList, dereference a pointer.
func (db *KV) pageGet(ptr uint64) BNode {
if page, ok := db.page.updates[ptr]; ok {
assert(page != nil)
return BNode{page} // for new pages
}
return pageGetMapped(db, ptr) // for written pages
}
func pageGetMapped(db *KV, ptr uint64) BNode {
start := uint64(0)
for _, chunk := range db.mmap.chunks {
end := start + uint64(len(chunk))/BTREE_PAGE_SIZE
if ptr < end {
offset := BTREE_PAGE_SIZE * (ptr - start)
return BNode{chunk[offset : offset+BTREE_PAGE_SIZE]}
}
start = end
}
panic("bad ptr")
}
The function for allocating a B-tree page is changed to reuse pages from the free list first.
分配 B 树页面的功能更改为首先重用空闲列表中的页面。
go
// callback for BTree, allocate a new page.
func (db *KV) pageNew(node BNode) uint64 {
assert(len(node.data) <= BTREE_PAGE_SIZE)
ptr := uint64(0)
if db.page.nfree < db.free.Total() {
// reuse a deallocated page
ptr = db.free.Get(db.page.nfree)
db.page.nfree++
} else {
// append a new page
ptr = db.page.flushed + uint64(db.page.nappend)
db.page.nappend++
}
db.page.updates[ptr] = node.data
return ptr
}
Removed pages are marked for the free list update later.
删除的页面将被标记为稍后更新空闲列表。
go
// callback for BTree, deallocate a page.
func (db *KV) pageDel(ptr uint64) {
db.page.updates[ptr] = nil
}
Step 3: Page Management for the Free List
步骤3:空闲列表的页面管理
Callbacks for appending a new page and reusing a page for the free list:
用于附加新页面并重新使用空闲列表页面的回调:
go
// callback for FreeList, allocate a new page.
func (db *KV) pageAppend(node BNode) uint64 {
assert(len(node.data) <= BTREE_PAGE_SIZE)
ptr := db.page.flushed + uint64(db.page.nappend)
db.page.nappend++
db.page.updates[ptr] = node.data
return ptr
}
// callback for FreeList, reuse a page.
func (db *KV) pageUse(ptr uint64, node BNode) {
db.page.updates[ptr] = node.data
}
Step 4: Update the Free List
第 4 步:更新空闲列表
Before extending the file and writing pages to disk, we must update the free list first since it also creates pending writes.
在扩展文件并将页面写入磁盘之前,我们必须首先更新空闲列表,因为它还会创建挂起的写入。
go
func writePages(db *KV) error {
// update the free list
freed := []uint64{}
for ptr, page := range db.page.updates {
if page == nil {
freed = append(freed, ptr)
}
}
db.free.Update(db.page.nfree, freed)
// extend the file & mmap if needed
// omitted...
// copy pages to the file
for ptr, page := range db.page.updates {
if page != nil {
copy(pageGetMapped(db, ptr).data, page)
}
}
return nil
}
The pointer to the list head is added to the master page:
指向列表头的指针被添加到母版页中:
| sig | btree_root | page_used | free_list |
| 16B | 8B | 8B | 8B |
Step 5: Done 第 5 步:完成
The KV store is finished. It is persistent and crash resistant, although it can only be accessed sequentially.
KV存储完成了。尽管只能按顺序访问,但它具有持久性和抗崩溃性。
There is more to learn in part II of the book:
本书的第二部分还有更多内容需要学习:
- Relational DB on the KV store.
KV 存储上的关系型数据库。 - Concurrent access to the database and transactions.
对数据库和事务的并发访问。