从头开始构建数据库:01. Files vs. Databases 文件与数据库

01. Files vs. Databases 文件与数据库

This chapter shows the limitations of simply dumping data to files and the problems that databases solve.


1.1 Persisting Data to Files

1.1 将数据持久化到文件

Let's say you have some data that needs to be persisted to a file; this is a typical way to do it:


go 复制代码
func SaveData1(path string, data []byte) error {
    fp, err := os.OpenFile(path, os.O_WRONLY|os.O_CREATE|os.O_TRUNC, 0664)
    if err != nil {
        return err
    defer fp.Close()

    _, err = fp.Write(data)
    return err

This naive approach has some drawbacks:


  1. It truncates the file before updating it. What if the file needs to be read concurrently?


  2. Writing data to files may not be atomic, depending on the size of the write. Concurrent readers might get incomplete data.


  3. When is the data actually persisted to the disk? The data is probably still in the operating system's page cache after the write syscall returns. What's the state of the file when the system crashes and reboots?

    数据什么时候真正持久化到磁盘上? write 系统调用返回后,数据可能仍在操作系统的页面缓存中。当系统崩溃并重新启动时,文件的状态是什么?

1.2 Atomic Renaming 1.2 原子重命名

To address some of these problems, let's propose a better approach:


go 复制代码
func SaveData2(path string, data []byte) error {
    tmp := fmt.Sprintf("%s.tmp.%d", path, randomInt())
    fp, err := os.OpenFile(tmp, os.O_WRONLY|os.O_CREATE|os.O_EXCL, 0664)
    if err != nil {
        return err
    defer fp.Close()

    _, err = fp.Write(data)
    if err != nil {
        return err

    return os.Rename(tmp, path)

This approach is slightly more sophisticated, it first dumps the data to a temporary file, then rename the temporary file to the target file. This seems to be free of the non-atomic problem of updating a file directly --- the rename operation is atomic. If the system crashed before renaming, the original file remains intact, and applications have no problem reading the file concurrently.

这种方法稍微复杂一些,它首先将数据转储到临时文件,然后将临时文件 rename 复制到目标文件。这似乎没有直接更新文件的非原子问题 - rename 操作是原子的。如果系统在重命名之前崩溃,原始文件保持不变,应用程序并发读取该文件没有问题。

However, this is still problematic because it doesn't control when the data is persisted to the disk, and the metadata (the size of the file) may be persisted to the disk before the data, potentially corrupting the file after when the system crash. (You may have noticed that some log files have zeros in them after a power failure, that's a sign of file corruption.)

但是,这仍然存在问题,因为它无法控制数据何时持久化到磁盘,并且 元数据(文件大小)可能会在数据之前持久化到磁盘,从而可能在系统崩溃后损坏文件。 (您可能已经注意到,断电后某些日志文件中的内容为零,这是文件损坏的迹象。)

1.3 fsync 1.3 同步

To fix the problem, we must flush the data to the disk before renaming it. The Linux syscall for this is "fsync".

为了解决这个问题,我们必须在重命名之前将数据刷新到磁盘。 Linux 系统调用是"fsync"。

go 复制代码
func SaveData3(path string, data []byte) error {
    // code omitted...

    _, err = fp.Write(data)
    if err != nil {
        return err

    err = fp.Sync() // fsync
    if err != nil {
        return err

    return os.Rename(tmp, path)

Are we done yet? The answer is no. We have flushed the data to the disk, but what about the metadata? Should we also call the fsync on the directory containing the file?

我们搞定了吗?答案是 没有。我们已经将数据刷新到磁盘,但是元数据呢?我们还应该在包含该文件的目录上调用 fsync 吗?

This rabbit hole is quite deep and that's why databases are preferred over files for persisting data to the disk.


1.4 Append-Only Logs 1.4 仅追加日志

In some use cases, it makes sense to persist data using an append-only log.

在某些用例中,使用 仅追加日志 来保存数据是有意义的。

go 复制代码
func LogCreate(path string) (*os.File, error) {
    return os.OpenFile(path, os.O_RDWR|os.O_CREATE, 0664)

func LogAppend(fp *os.File, line string) error {
    buf := []byte(line)
    buf = append(buf, '\n')
    _, err := fp.Write(buf)
    if err != nil {
        return err
    return fp.Sync() // fsync

The nice thing about the append-only log is that it does not modify the existing data, nor does it deal with the rename operation, making it more resistant to corruption. But logs alone are not enough to build a database.

仅追加日志的好处是它不会修改现有数据,也不处理 rename 操作,使其更能抵抗损坏。但仅靠日志还不足以构建数据库。

  1. A database uses additional "indexes" to query the data efficiently. There are only brute-force ways to query a bunch of records of arbitrary order.


  2. How do logs handle deleted data? They cannot grow forever.


We have already seen some of the problems we must handle. Let's start with indexing first in the next chapter.


