从头开始构建数据库:06.存到磁盘

The B-tree data structure from the previous chapter can be dumped to disk easily. Let's build a simple KV store on top of it. Since our B-tree implementation is immutable, we'll allocate disk space in an append-only manner, reusing disk space is deferred to the next chapter.

上一章中的 B 树数据结构可以轻松转储到磁盘。让我们在它上面构建一个简单的 KV 存储。由于我们的 B 树实现是不可变的，因此我们将以仅追加的方式分配磁盘空间，磁盘空间的重用将推迟到下一章。

6.1 The Method for Persisting Data

6.1 持久化数据的方法

As mentioned in previous chapters, persisting data to disk is more than just dumping data into files. There are a couple of considerations:

如前几章所述，将数据持久保存到磁盘不仅仅是将数据转储到文件中。有几个考虑因素：

Crash recovery: This includes database process crashes, OS crashes, and power failures. The database must be in a usable state after a reboot.
崩溃恢复：这包括数据库进程崩溃、操作系统崩溃和电源故障。重新启动后数据库必须处于可用状态。
Durability: After a successful response from the database, the data involved is guaranteed to persist, even after a crash. In other words, persistence occurs before responding to the client.
持久性：数据库成功响应后，即使发生崩溃，所涉及的数据也能保证持续存在。换句话说，持久性发生在响应客户端之前。

There are many materials describing databases using the ACID jargon (atomicity, consistency, isolation, durability), but these concepts are not orthogonal and hard to explain, so let's focus on our practical example instead.

有很多资料使用 ACID 术语（原子性、一致性、隔离性、持久性）来描述数据库，但这些概念不是正交的并且很难解释，所以让我们专注于我们的实际示例。

The immutable aspect of our B-tree: Updating the B-tree does not touch the previous version of the B-tree, which makes crash recovery easy --- should the update goes wrong, we can simply recover to the previous version.
B 树的不可变方面：更新 B 树不会触及 B 树的先前版本，这使得崩溃恢复变得容易 - 如果更新出错，我们可以简单地恢复到先前版本。
Durability is achieved via the fsync Linux syscall. Normal file IO via write or mmap goes to the page cache first, the system has to flush the page cache to the disk later. The fsync syscall blocks until all dirty pages are flushed.
持久性是通过 fsync Linux 系统调用实现的。通过 write 或 mmap 的普通文件 IO 首先进入页面缓存，系统稍后必须将页面缓存刷新到磁盘。 fsync 系统调用会阻塞，直到所有脏页都被刷新。

How do we recover to the previous version if an update goes wrong? We can split the update into two phases:

如果更新出现问题，如何恢复到之前的版本？我们可以将更新分为两个阶段：

An update creates new nodes; write them to the disk.
更新创建新节点；将它们写入磁盘。
Each update creates a new root node, we need to store the pointer to the root node somewhere.
每次更新都会创建一个新的根节点，我们需要将指向根节点的指针存储在某处。

The first phase may involve writing multiple pages to the disk, this is generally not atomic. But the second phase involves only a single pointer and can be done in an atomic single page write. This makes the whole operation atomic --- the update will simply not happen if the database crashes.

第一阶段可能涉及将多个页面写入磁盘，这通常不是原子的。但第二阶段仅涉及单个指针，并且可以在原子单页写入中完成。这使得整个操作变得原子化------如果数据库崩溃，更新将根本不会发生。

The first phase must be persisted before the second phase, otherwise, the root pointer could point to a corrupted (partly persisted) version of the tree after a crash. There should be an fsync between the two phases (to serve as a barrier).

第一阶段必须在第二阶段之前持久化，否则，根指针可能会在崩溃后指向树的损坏（部分持久化）版本。两个阶段之间应该有一个 fsync （作为屏障）。

And the second phase should also be fsync'd before responding to the client.

第二阶段也应该在响应客户端之前 fsync 完成。

6.2 mmap-Based IO 6.2 基于mmap的IO

The contents of a disk file can be mapped from a virtual address using the mmap syscall. Reading from this address initiates transparent disk IO, which is the same as reading the file via the read syscall, but without the need for a user-space buffer and the overhead of a syscall. The mapped address is a proxy to the page cache, modifying data via it is the same as the write syscall.

可以使用 mmap 系统调用从虚拟地址映射磁盘文件的内容。从该地址读取会启动透明磁盘 IO，这与通过 read 系统调用读取文件相同，但不需要用户空间缓冲区和系统调用的开销。映射地址是页面缓存的代理，通过它修改数据与 write 系统调用相同。

mmap is convenient, and we'll use it for our KV store. However, the use of mmap is not essential.
mmap 很方便，我们将把它用于我们的 KV 存储。然而， mmap 的使用并不是必需的。

go 复制代码

// create the initial mmap that covers the whole file.
func mmapInit(fp *os.File) (int, []byte, error) {
    fi, err := fp.Stat()
    if err != nil {
        return 0, nil, fmt.Errorf("stat: %w", err)
    }

    if fi.Size()%BTREE_PAGE_SIZE != 0 {
        return 0, nil, errors.New("File size is not a multiple of page size.")
    }

    mmapSize := 64 << 20
    assert(mmapSize%BTREE_PAGE_SIZE == 0)
    for mmapSize < int(fi.Size()) {
        mmapSize *= 2
    }
    // mmapSize can be larger than the file

    chunk, err := syscall.Mmap(
        int(fp.Fd()), 0, mmapSize,
        syscall.PROT_READ|syscall.PROT_WRITE, syscall.MAP_SHARED,
    )
    if err != nil {
        return 0, nil, fmt.Errorf("mmap: %w", err)
    }

    return int(fi.Size()), chunk, nil
}

The above function creates the initial mapping at least the size of the file. The size of the mapping can be larger than the file size, and the range past the end of the file is not accessible (SIGBUS), but the file can be extended later.

上述函数创建至少与文件大小相同的初始映射。映射的大小可以大于文件大小，并且文件末尾之后的范围不可访问（ SIGBUS ），但文件可以稍后扩展。

We may have to extend the range of the mapping as the file grows. The syscall for extending a mmap range is mremap. Unfortunately, we may not be able to keep the starting address when extending a range by remapping. Our approach to extending mappings is to use multiple mappings --- create a new mapping for the overflow file range.

随着文件的增长，我们可能必须扩展映射的范围。用于扩展 mmap 范围的系统调用是 mremap 。不幸的是，当通过重新映射扩展范围时，我们可能无法保留起始地址。我们扩展映射的方法是使用多个映射 - 为溢出文件范围创建一个新映射。

go 复制代码

type KV struct {
    Path string
    // internals
    fp   *os.File
    tree BTree
    mmap struct {
        file   int      // file size, can be larger than the database size
        total  int      // mmap size, can be larger than the file size
        chunks [][]byte // multiple mmaps, can be non-continuous
    }
    page struct {
        flushed uint64   // database size in number of pages
        temp    [][]byte // newly allocated pages
    }
}

go 复制代码

// extend the mmap by adding new mappings.
func extendMmap(db *KV, npages int) error {
    if db.mmap.total >= npages*BTREE_PAGE_SIZE {
        return nil
    }

    // double the address space
    chunk, err := syscall.Mmap(
        int(db.fp.Fd()), int64(db.mmap.total), db.mmap.total,
        syscall.PROT_READ|syscall.PROT_WRITE, syscall.MAP_SHARED,
    )
    if err != nil {
        return fmt.Errorf("mmap: %w", err)
    }

    db.mmap.total += db.mmap.total
    db.mmap.chunks = append(db.mmap.chunks, chunk)
    return nil
}

The size of the new mapping increases exponentially so that we don't have to call mmap frequently.

新映射的大小呈指数级增长，这样我们就不必频繁调用 mmap 。

Below is how we access a page from the mapped address.

下面是我们如何从映射地址访问页面。

go 复制代码

// callback for BTree, dereference a pointer.
func (db *KV) pageGet(ptr uint64) BNode {
    start := uint64(0)
    for _, chunk := range db.mmap.chunks {
        end := start + uint64(len(chunk))/BTREE_PAGE_SIZE
        if ptr < end {
            offset := BTREE_PAGE_SIZE * (ptr - start)
            return BNode{chunk[offset : offset+BTREE_PAGE_SIZE]}
        }
        start = end
    }
    panic("bad ptr")
}

6.3 The Master Page

6.3 母版页

The first page of the file is used to store the pointer to the root, let's call it the "master page". The total number of pages is needed for allocating new nodes, thus it is also stored there.

文件的第一页用于存储指向根的指针，我们称之为"母版页"。分配新节点需要总页数，因此也存储在那里。

lua 复制代码

|     the_master_page    | pages... | tree_root | pages... |
| btree_root | page_used |                ^                ^
      |            |                      |                |
      +------------+----------------------+                |
                   |                                       |
                   +---------------------------------------+

The function below reads the master page when initializing a database:

下面的函数在初始化数据库时读取母版页：

go 复制代码

const DB_SIG = "BuildYourOwnDB06"

// the master page format.
// it contains the pointer to the root and other important bits.
// | sig | btree_root | page_used |
// | 16B |     8B     |     8B    |
func masterLoad(db *KV) error {
    if db.mmap.file == 0 {
        // empty file, the master page will be created on the first write.
        db.page.flushed = 1 // reserved for the master page
        return nil
    }

    data := db.mmap.chunks[0]
    root := binary.LittleEndian.Uint64(data[16:])
    used := binary.LittleEndian.Uint64(data[24:])

    // verify the page
    if !bytes.Equal([]byte(DB_SIG), data[:16]) {
        return errors.New("Bad signature.")
    }
    bad := !(1 <= used && used <= uint64(db.mmap.file/BTREE_PAGE_SIZE))
    bad = bad || !(0 <= root && root < used)
    if bad {
        return errors.New("Bad master page.")
    }

    db.tree.root = root
    db.page.flushed = used
    return nil
}

Below is the function for updating the master page. Unlike the code for reading, it doesn't use the mapped address for writing. This is because modifying a page via mmap is not atomic. The kernel could flush the page midway and corrupt the disk file, while a small write that doesn't cross the page boundary is guaranteed to be atomic.

下面是更新母版页的函数。与读取代码不同，它不使用映射地址进行写入。这是因为通过 mmap 修改页面不是原子的。内核可能会中途刷新页面并损坏磁盘文件，而不跨越页面边界的小 write 则保证是原子的。

go 复制代码

// update the master page. it must be atomic.
func masterStore(db *KV) error {
    var data [32]byte
    copy(data[:16], []byte(DB_SIG))
    binary.LittleEndian.PutUint64(data[16:], db.tree.root)
    binary.LittleEndian.PutUint64(data[24:], db.page.flushed)
    // NOTE: Updating the page via mmap is not atomic.
    //       Use the `pwrite()` syscall instead.
    _, err := db.fp.WriteAt(data[:], 0)
    if err != nil {
        return fmt.Errorf("write master page: %w", err)
    }
    return nil
}

6.4 Allocating Disk Pages

6.4 分配磁盘页面

We'll simply append new pages to the end of the database until we add a free list in the next chapter.

我们将简单地将新页面附加到数据库的末尾，直到我们在下一章中添加空闲列表。

And new pages are kept temporarily in memory until copied to the file later (after possibly extending the file).

新页面将暂时保留在内存中，直到稍后复制到文件（可能在扩展文件之后）。

go 复制代码

type KV struct {
    // omitted...
    page struct {
        flushed uint64   // database size in number of pages
        temp    [][]byte // newly allocated pages
    }
}

go 复制代码

// callback for BTree, allocate a new page.
func (db *KV) pageNew(node BNode) uint64 {
    // TODO: reuse deallocated pages
    assert(len(node.data) <= BTREE_PAGE_SIZE)
    ptr := db.page.flushed + uint64(len(db.page.temp))
    db.page.temp = append(db.page.temp, node.data)
    return ptr
}

// callback for BTree, deallocate a page.
func (db *KV) pageDel(uint64) {
    // TODO: implement this
}

Before writing the pending pages, we may need to extend the file first. The corresponding syscall is fallocate.

在写入待处理页面之前，我们可能需要先扩展文件。对应的系统调用是 fallocate 。

go 复制代码

// extend the file to at least `npages`.
func extendFile(db *KV, npages int) error {
    filePages := db.mmap.file / BTREE_PAGE_SIZE
    if filePages >= npages {
        return nil
    }

    for filePages < npages {
        // the file size is increased exponentially,
        // so that we don't have to extend the file for every update.
        inc := filePages / 8
        if inc < 1 {
            inc = 1
        }
        filePages += inc
    }

    fileSize := filePages * BTREE_PAGE_SIZE
    err := syscall.Fallocate(int(db.fp.Fd()), 0, 0, int64(fileSize))
    if err != nil {
        return fmt.Errorf("fallocate: %w", err)
    }

    db.mmap.file = fileSize
    return nil
}

6.5 Initializing the Database

6.5 初始化数据库

Putting together what we have done.

将我们所做的事情放在一起。

go 复制代码

func (db *KV) Open() error {
    // open or create the DB file
    fp, err := os.OpenFile(db.Path, os.O_RDWR|os.O_CREATE, 0644)
    if err != nil {
        return fmt.Errorf("OpenFile: %w", err)
    }
    db.fp = fp

    // create the initial mmap
    sz, chunk, err := mmapInit(db.fp)
    if err != nil {
        goto fail
    }
    db.mmap.file = sz
    db.mmap.total = len(chunk)
    db.mmap.chunks = [][]byte{chunk}

    // btree callbacks
    db.tree.get = db.pageGet
    db.tree.new = db.pageNew
    db.tree.del = db.pageDel

    // read the master page
    err = masterLoad(db)
    if err != nil {
        goto fail
    }

    // done
    return nil

fail:
    db.Close()
    return fmt.Errorf("KV.Open: %w", err)
}

go 复制代码

// cleanups
func (db *KV) Close() {
    for _, chunk := range db.mmap.chunks {
        err := syscall.Munmap(chunk)
        assert(err == nil)
    }
    _ = db.fp.Close()
}

6.6 Update Operations 6.6 更新操作

Unlike queries, update operations must persist the data before returning.

与查询不同，更新操作必须在返回之前保留数据。

go 复制代码

// read the db
func (db *KV) Get(key []byte) ([]byte, bool) {
    return db.tree.Get(key)
}

// update the db
func (db *KV) Set(key []byte, val []byte) error {
    db.tree.Insert(key, val)
    return flushPages(db)
}

func (db *KV) Del(key []byte) (bool, error) {
    deleted := db.tree.Delete(key)
    return deleted, flushPages(db)
}

The flushPages is the function for persisting new pages.
flushPages 是持久化新页面的函数。

go 复制代码

// persist the newly allocated pages after updates
func flushPages(db *KV) error {
    if err := writePages(db); err != nil {
        return err
    }
    return syncPages(db)
}

It is split into two phases as mentioned earlier.

如前所述，它分为两个阶段。

go 复制代码

func writePages(db *KV) error {
    // extend the file & mmap if needed
    npages := int(db.page.flushed) + len(db.page.temp)
    if err := extendFile(db, npages); err != nil {
        return err
    }
    if err := extendMmap(db, npages); err != nil {
        return err
    }

    // copy data to the file
    for i, page := range db.page.temp {
        ptr := db.page.flushed + uint64(i)
        copy(db.pageGet(ptr).data, page)
    }
    return nil
}

And the fsync is in between and after them.
fsync 位于它们之间和之后。

go 复制代码

func syncPages(db *KV) error {
    // flush data to the disk. must be done before updating the master page.
    if err := db.fp.Sync(); err != nil {
        return fmt.Errorf("fsync: %w", err)
    }
    db.page.flushed += uint64(len(db.page.temp))
    db.page.temp = db.page.temp[:0]

    // update & flush the master page
    if err := masterStore(db); err != nil {
        return err
    }
    if err := db.fp.Sync(); err != nil {
        return fmt.Errorf("fsync: %w", err)
    }
    return nil
}

Our KV store is functional, but the file can't grow forever as we update the database, we'll finish our KV store by reusing disk pages in the next chapter.

我们的 KV 存储是可用的，但是当我们更新数据库时，文件不能永远增长，我们将在下一章中通过重用磁盘页面来完成我们的 KV 存储。