案例3：文件系统与数据持久化

场景描述

某金融交易系统记录每笔交易到日志文件。开发者使用AI生成的代码在测试环境运行正常,但上线后机房突然断电,重启后发现最近30分钟的交易记录全部丢失,造成严重的数据一致性问题。

代码明明已经执行了file.write(),为什么数据还会丢失?

问题代码

python 复制代码

def record_transaction(transaction_data):
    """记录交易到日志文件"""
    with open("transactions.log", "a") as f:
        f.write(json.dumps(transaction_data) + "\n")
    # ← 数据可能还在操作系统缓冲区,未真正写入磁盘
    return "Transaction recorded"

# 断电时丢失数据!

操作系统知识点分析

1. 文件I/O的多层缓冲机制

完整的写入流程：

scss 复制代码

应用程序
    ↓ fwrite()
语言运行时缓冲区 (Python/C标准库)
    ↓ write()系统调用
操作系统页缓存 (Page Cache)
    ↓ 后台刷新/显式fsync
磁盘控制器缓存
    ↓ 物理写入
物理磁盘

问题根源 ：write()系统调用只是将数据写入操作系统的页缓存,并未真正写入磁盘。

2. 页缓存（Page Cache）

python 复制代码

import os

# 写入文件
with open("test.txt", "w") as f:
    f.write("Hello")  # 数据在page cache中

# 此时断电 → 数据丢失!

# 操作系统会在以下时机刷新缓存:
# 1. 缓存满了
# 2. 定期刷新(Linux默认30秒)
# 3. 文件关闭时(不保证立即)
# 4. 显式调用fsync()

查看page cache大小：

bash 复制代码

free -h
#               total  used  free  shared  buff/cache  available
# buff/cache列就是页缓存

3. 文件描述符与句柄

python 复制代码

# 文件描述符是操作系统内核的数据结构
fd = os.open("file.txt", os.O_WRONLY | os.O_CREAT)
os.write(fd, b"data")
os.close(fd)

# 查看进程打开的文件描述符
import subprocess
subprocess.run(["lsof", "-p", str(os.getpid())])

# 文件描述符泄漏示例
def leak_file_descriptors():
    for i in range(10000):
        f = open(f"file_{i}.txt", "w")  # 未关闭!
        # 达到系统限制(ulimit -n): OSError: Too many open files

解决方案

方案1：显式刷新缓冲区

python 复制代码

import os

def record_transaction_safe(transaction_data):
    """确保数据落盘的写入"""
    with open("transactions.log", "a") as f:
        f.write(json.dumps(transaction_data) + "\n")
        f.flush()  # 刷新Python缓冲区到操作系统
        os.fsync(f.fileno())  # 强制操作系统写入磁盘
    return "Transaction recorded safely"

刷新操作对比：

操作	作用范围	保证级别
`f.flush()`	Python缓冲区→OS缓存	数据到达OS,未落盘
`os.fsync(fd)`	OS缓存→磁盘	数据真正写入磁盘
`os.fdatasync(fd)`	数据落盘(不含元数据)	比fsync快,元数据可能丢

方案2：直接I/O（绕过缓存）

python 复制代码

import os

# O_DIRECT: 绕过操作系统缓存,直接写磁盘
# O_SYNC: 同步写入,write()返回时数据已落盘
fd = os.open("transactions.log",
             os.O_WRONLY | os.O_CREAT | os.O_SYNC)

data = b"transaction data\n"
os.write(fd, data)  # 返回时已写入磁盘
os.close(fd)

# 注意: O_DIRECT有对齐要求,不常用

方案3：写前日志（Write-Ahead Logging, WAL）

python 复制代码

class TransactionLog:
    def __init__(self, data_file, wal_file):
        self.data_file = data_file
        self.wal_file = wal_file

    def commit(self, transaction):
        """使用WAL保证数据一致性"""
        # 步骤1: 写入WAL日志
        with open(self.wal_file, "a") as wal:
            wal.write(json.dumps(transaction) + "\n")
            wal.flush()
            os.fsync(wal.fileno())  # 确保WAL落盘

        # 步骤2: 更新数据文件
        with open(self.data_file, "a") as data:
            data.write(json.dumps(transaction) + "\n")

        # 步骤3: 标记WAL已提交
        self.mark_wal_committed(transaction.id)

    def recover(self):
        """从崩溃中恢复"""
        # 读取WAL中未提交的事务,重新应用到数据文件
        pass

WAL的优势：

顺序写入：WAL只追加,比随机写入快
批量刷新：积累多个事务后一次性fsync
崩溃恢复：重启后从WAL重放事务

应用实例：

SQLite的WAL模式
PostgreSQL的WAL
Redis的AOF持久化

文件系统的日志机制

1. 日志文件系统（Journaling）

现代文件系统(ext4, NTFS, XFS)都使用日志技术保证一致性:

markdown 复制代码

传统文件系统:
    写数据 → 更新元数据
    ↑ 如果中途崩溃 → 文件系统不一致

日志文件系统:
    1. 写入日志(数据 + 元数据)
    2. 日志落盘(checkpoint)
    3. 写入实际位置
    4. 标记日志已提交
    ↑ 任何时刻崩溃都可以恢复

ext4的三种日志模式：

bash 复制代码

# 查看挂载选项
mount | grep ext4

# data=journal: 数据和元数据都记录日志(最安全,最慢)
# data=ordered: 先写数据,再写元数据日志(默认,平衡)
# data=writeback: 不保证顺序(最快,最不安全)

2. 文件系统屏障（Barrier）

python 复制代码

# 应用程序可以使用屏障确保写入顺序
import os

fd = os.open("file.txt", os.O_WRONLY)

# 写入数据A
os.write(fd, b"important data A")

# 屏障: 确保A在B之前落盘
os.fsync(fd)

# 写入数据B
os.write(fd, b"dependent data B")
os.fsync(fd)

os.close(fd)

实际案例：数据库的持久化策略

MySQL的持久化配置

sql 复制代码

-- 查看配置
SHOW VARIABLES LIKE 'innodb_flush%';

-- innodb_flush_log_at_trx_commit (最重要)
-- 0: 每秒刷新一次(最快,可能丢1秒数据)
-- 1: 每次事务提交都刷新(最安全,最慢) ← 金融系统必须用
-- 2: 每次提交写OS缓存,每秒刷盘(折中)

SET GLOBAL innodb_flush_log_at_trx_commit = 1;

-- innodb_flush_method
-- fsync: 使用fsync()刷新(默认)
-- O_DIRECT: 绕过OS缓存
-- O_DSYNC: 同步写入

性能对比测试：

python 复制代码

import time
import pymysql

# 测试不同配置的TPS
configs = [
    (0, "每秒刷新"),
    (1, "每次提交刷新"),
    (2, "提交写OS缓存")
]

for value, desc in configs:
    conn = pymysql.connect(host='localhost', user='root')
    cursor = conn.cursor()
    cursor.execute(f"SET innodb_flush_log_at_trx_commit={value}")

    start = time.time()
    for i in range(1000):
        cursor.execute("INSERT INTO test VALUES (%s)", (i,))
        conn.commit()
    elapsed = time.time() - start

    print(f"{desc}: {1000/elapsed:.0f} TPS")

# 输出示例:
# 每秒刷新: 8000 TPS
# 每次提交刷新: 400 TPS  ← 安全但慢20倍
# 提交写OS缓存: 3000 TPS

Redis的持久化策略

python 复制代码

# Redis提供两种持久化方式

# 1. RDB (快照)
# 定期将内存数据dump到磁盘
redis_config = {
    "save": [
        (900, 1),   # 900秒内至少1个key变化
        (300, 10),  # 300秒内至少10个key变化
        (60, 10000) # 60秒内至少10000个key变化
    ]
}
# 优点: 恢复快,文件小
# 缺点: 可能丢失最后一次快照后的数据

# 2. AOF (Append Only File)
# 记录每个写命令
redis_config = {
    "appendfsync": "everysec"  # 每秒fsync一次
    # "appendfsync": "always"  # 每个命令都fsync(最安全)
    # "appendfsync": "no"      # 由OS决定(最快)
}
# 优点: 数据更安全
# 缺点: 文件大,恢复慢

# 金融场景: 同时开启RDB+AOF

性能优化技巧

1. 批量提交

python 复制代码

# ❌ 每次写入都fsync (慢)
for i in range(1000):
    with open("log.txt", "a") as f:
        f.write(f"line {i}\n")
        f.flush()
        os.fsync(f.fileno())  # 1000次fsync

# ✅ 批量写入,最后一次fsync (快100倍)
with open("log.txt", "a") as f:
    for i in range(1000):
        f.write(f"line {i}\n")
    f.flush()
    os.fsync(f.fileno())  # 只1次fsync

2. 异步刷新

python 复制代码

import queue
import threading

class AsyncLogger:
    def __init__(self, filename):
        self.filename = filename
        self.queue = queue.Queue()
        self.thread = threading.Thread(target=self._worker)
        self.thread.daemon = True
        self.thread.start()

    def log(self, message):
        """非阻塞写入"""
        self.queue.put(message)

    def _worker(self):
        """后台线程批量刷新"""
        with open(self.filename, "a") as f:
            while True:
                messages = []
                # 收集1秒内的所有消息
                try:
                    while len(messages) < 1000:
                        msg = self.queue.get(timeout=1.0)
                        messages.append(msg)
                except queue.Empty:
                    pass

                # 批量写入
                if messages:
                    f.writelines(messages)
                    f.flush()
                    os.fsync(f.fileno())

# 使用
logger = AsyncLogger("app.log")
logger.log("transaction completed\n")  # 立即返回,不阻塞

3. 分区写入

python 复制代码

import datetime

class PartitionedLogger:
    """按时间分区写入日志"""
    def log(self, message):
        # 每小时一个文件
        filename = f"log_{datetime.datetime.now().strftime('%Y%m%d_%H')}.log"
        with open(filename, "a") as f:
            f.write(message)
            # 只在关键日志上fsync,普通日志不fsync

# 优势:
# 1. 单个文件小,fsync快
# 2. 可以删除旧日志释放空间
# 3. 可以并行写入不同分区

监控与诊断

1. 查看文件I/O统计

bash 复制代码

# Linux: iostat 监控磁盘I/O
iostat -x 1

# 关注指标:
# %util: 磁盘利用率 (>80%说明瓶颈)
# await: 平均等待时间 (越小越好)
# w/s: 每秒写次数
# wMB/s: 每秒写入MB数

# 查看哪个进程在大量I/O
iotop

# 查看文件系统缓存使用
cat /proc/meminfo | grep -i dirty
# Dirty: 脏页大小(等待写入磁盘的数据)

2. 追踪文件操作

bash 复制代码

# strace: 追踪系统调用
strace -e trace=open,write,fsync python app.py

# 输出示例:
# open("log.txt", O_WRONLY|O_CREAT|O_APPEND) = 3
# write(3, "transaction data\n", 17) = 17
# fsync(3) = 0  # 是否调用了fsync?

3. 检测fsync慢的问题

python 复制代码

import time
import os

def benchmark_fsync():
    """测试fsync性能"""
    filename = "test.dat"
    iterations = 100

    with open(filename, "w") as f:
        start = time.time()
        for i in range(iterations):
            f.write("x" * 4096)  # 写4KB
            f.flush()
            os.fsync(f.fileno())
        elapsed = time.time() - start

    print(f"平均fsync时间: {elapsed/iterations*1000:.2f}ms")
    print(f"fsync TPS: {iterations/elapsed:.0f}")

    os.remove(filename)

# HDD: ~5-10ms per fsync (100-200 TPS)
# SSD: ~0.1-1ms per fsync (1000-10000 TPS)
# NVMe SSD: ~0.02-0.1ms per fsync (10000-50000 TPS)

关键认知

1. 为什么AI生成的代码容易丢数据

python 复制代码

# AI生成的代码通常:
- 使用高级API(不了解底层缓冲)
- 不考虑断电/崩溃场景
- 不调用fsync()

# 开发者需要:
- 理解文件I/O的多层缓冲
- 根据业务要求选择刷新策略
- 权衡性能与数据安全

2. 数据持久化的三个层次

第1层：应用层

python 复制代码

# 何时调用flush/fsync
# 批量还是单条
# 同步还是异步

第2层：文件系统层

python 复制代码

# 日志文件系统(ext4/NTFS)
# 挂载选项(data=journal/ordered)
# 文件系统屏障

第3层：硬件层

python 复制代码

# 磁盘写缓存
# RAID配置
# 电池保护(BBU)

3. 性能与安全的权衡

makefile 复制代码

数据安全性 ←→ 性能
         ↓
    需要权衡

金融系统: 安全 > 性能 (必须fsync)
日志系统: 性能 > 安全 (可以异步)
缓存系统: 性能 >> 安全 (数据可丢失)

最佳实践

1. 关键业务数据

python 复制代码

# ✅ 必须做:
- 每次写入都fsync
- 使用WAL
- 定期备份
- 主从复制

# ❌ 不要:
- 依赖OS自动刷新
- 假设写入成功就是安全的

2. 普通日志数据

python 复制代码

# ✅ 可以做:
- 批量刷新(每秒或每N条)
- 异步写入
- 分区存储

# ❌ 不要:
- 完全不刷新
- 忽略磁盘满的情况

3. 临时数据

python 复制代码

# ✅ 可以做:
- 不调用fsync
- 使用内存文件系统(tmpfs)
- 进程退出时删除

# ❌ 不要:
- 把重要数据当临时数据

扩展阅读

论文: "The Design and Implementation of a Log-Structured File System"
书籍: 《深入理解计算机系统》第10章
文档: Linux man手册 (man 2 fsync)

小结

操作系统的文件系统知识使开发者能够:

✅ 理解数据持久化的真实时机 ✅ 设计可靠的数据存储方案 ✅ 避免断电/崩溃导致的数据丢失 ✅ 优化磁盘I/O性能

没有这些知识,开发者无法判断AI生成的代码是否真正安全,可能在关键时刻丢失数据。

大神进阶

1. 底层原理深度剖析

1.1 文件系统I/O栈 - 从应用到磁盘

makefile 复制代码

应用层:  fwrite("data", 4, 1, fp)
           ↓
C库层:   [用户缓冲区] → fflush()
           ↓
系统调用:  write(fd, buf, count)  [用户态→内核态切换]
           ↓
VFS层:    vfs_write()  [虚拟文件系统]
           ↓
文件系统:  ext4_file_write()
           ↓
页缓存:   [Page Cache] → mark_page_dirty()
           ↓
块层:     submit_bio()  [I/O调度器: CFQ/Deadline/NOOP]
           ↓
设备驱动:  SCSI/NVMe driver
           ↓
磁盘控制器: [磁盘缓存 DRAM]
           ↓
物理介质:  [HDD磁头/SSD闪存]

每层延迟:
- 系统调用: ~100ns
- VFS: ~50ns
- Page Cache命中: ~1μs
- Page Cache未命中: ~5ms (HDD) / ~100μs (SSD)
- fsync()强制落盘: ~10ms (HDD) / ~1ms (SSD)

1.2 Linux内核的文件I/O实现

c 复制代码

// Linux内核 VFS层 write系统调用实现
// 源文件: fs/read_write.c

SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf, size_t, count)
{
    struct fd f = fdget_pos(fd);  // 获取文件描述符
    if (!f.file)
        return -EBADF;

    // 调用VFS write
    loff_t pos = file_pos_read(f.file);
    ret = vfs_write(f.file, buf, count, &pos);
    if (ret >= 0)
        file_pos_write(f.file, pos);

    fdput_pos(f);
    return ret;
}

// VFS write实现
ssize_t vfs_write(struct file *file, const char __user *buf,
                  size_t count, loff_t *pos)
{
    // 检查权限
    if (!(file->f_mode & FMODE_WRITE))
        return -EBADF;

    // 检查是否可写
    if (!file->f_op->write && !file->f_op->write_iter)
        return -EINVAL;

    // 调用文件系统特定的write函数
    if (file->f_op->write)
        ret = file->f_op->write(file, buf, count, pos);
    else
        ret = do_iter_write(file, &iter, pos, 0);

    // 更新文件访问时间
    if (ret > 0) {
        fsnotify_modify(file);
        add_wchar(current, ret);
    }

    return ret;
}

// ext4文件系统的write实现
// 源文件: fs/ext4/file.c

static ssize_t ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
{
    // 如果是O_DIRECT,绕过Page Cache
    if (iocb->ki_flags & IOCB_DIRECT) {
        return ext4_dio_write_iter(iocb, from);
    }

    // 否则使用Page Cache
    return ext4_buffered_write_iter(iocb, from);
}

// 通过Page Cache写入
static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
                                         struct iov_iter *from)
{
    // 1. 分配页面缓存
    struct page *page = grab_cache_page_write_begin(mapping, index);

    // 2. 将数据从用户空间拷贝到页缓存
    copy_from_user(page_address(page) + offset, buf, bytes);

    // 3. 标记页面为脏
    mark_page_dirty(page);

    // 4. 解锁页面
    unlock_page(page);

    return bytes;
}

// fsync实现 - 强制刷新到磁盘
// 源文件: fs/sync.c

SYSCALL_DEFINE1(fsync, unsigned int, fd)
{
    struct fd f = fdget(fd);
    if (!f.file)
        return -EBADF;

    ret = vfs_fsync(f.file, 0);
    fdput(f);
    return ret;
}

int vfs_fsync(struct file *file, int datasync)
{
    return file->f_op->fsync(file, datasync);
}

// ext4的fsync实现
int ext4_sync_file(struct file *file, int datasync)
{
    struct inode *inode = file->f_mapping->host;
    struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);

    // 1. 刷新Page Cache中的脏页到块设备
    ret = filemap_write_and_wait_range(inode->i_mapping, start, end);

    // 2. 如果是日志文件系统,提交事务日志
    if (!datasync && (inode->i_state & I_DIRTY_DATASYNC)) {
        ret = ext4_force_commit(inode->i_sb);
    }

    // 3. 刷新块设备缓存
    ret = blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL);

    return ret;
}

/*
写入流程总结:

普通write():
1. 数据从用户空间拷贝到Page Cache
2. 标记页为脏
3. 立即返回 (异步)
4. 内核后台线程(pdflush/writeback)定期刷新脏页
5. 默认: 30秒刷新一次

fsync():
1. 强制将脏页写入块设备
2. 提交文件系统日志(如果有)
3. 刷新设备缓存 (CACHE FLUSH命令)
4. 阻塞等待完成
5. 耗时: HDD ~10ms, SSD ~1ms, NVMe ~100μs
*/

1.3 ext4日志机制(JBD2)

c 复制代码

// ext4日志层 (Journal Block Device 2)
// 源文件: fs/jbd2/transaction.c

/*
ext4的3种日志模式:
1. journal: 数据和元数据都写日志 (最安全,最慢)
2. ordered: 元数据写日志,数据直接写磁盘 (默认,平衡)
3. writeback: 元数据写日志,数据无序写入 (最快,可能损坏)
*/

// 日志事务结构
struct journal_transaction {
    journal_t *t_journal;           // 所属日志
    tid_t t_tid;                    // 事务ID
    enum {
        T_RUNNING,                  // 运行中
        T_LOCKED,                   // 已锁定
        T_FLUSH,                    // 刷新中
        T_COMMIT,                   // 提交中
        T_FINISHED                  // 已完成
    } t_state;

    struct journal_head *t_buffers;  // 脏缓冲区列表
    unsigned long t_log_start;       // 日志起始块号
};

// 开始事务
handle_t *jbd2_journal_start(journal_t *journal, int nblocks)
{
    handle_t *handle;
    int err;

    // 1. 分配事务句柄
    handle = kmem_cache_alloc(journal_handle_cache, GFP_NOFS);

    // 2. 加入当前运行的事务
    err = start_this_handle(journal, handle, gfp_mask);
    if (err < 0) {
        kmem_cache_free(journal_handle_cache, handle);
        return ERR_PTR(err);
    }

    return handle;
}

// 修改元数据
int jbd2_journal_get_write_access(handle_t *handle, struct buffer_head *bh)
{
    transaction_t *transaction = handle->h_transaction;
    journal_t *journal = transaction->t_journal;

    // 1. 如果缓冲区不在当前事务中,创建日志头
    if (!buffer_jbd(bh)) {
        jh = jbd2_journal_add_journal_head(bh);
        jh->b_transaction = transaction;

        // 2. 复制原始数据到日志区
        do_get_write_access(handle, jh, 0);
    }

    return 0;
}

// 提交事务
int jbd2_journal_stop(handle_t *handle)
{
    transaction_t *transaction = handle->h_transaction;
    tid_t tid = transaction->t_tid;

    // 1. 减少事务引用计数
    if (--handle->h_ref > 0) {
        return 0;
    }

    // 2. 如果事务已满或超时,触发提交
    if (transaction->t_state == T_LOCKED ||
        time_after_eq(jiffies, transaction->t_expires)) {

        // 异步提交事务
        jbd2_journal_commit_transaction(transaction);
    }

    kmem_cache_free(journal_handle_cache, handle);
    return 0;
}

// 提交事务到磁盘
void jbd2_journal_commit_transaction(journal_t *journal)
{
    transaction_t *commit_transaction = journal->j_committing_transaction;

    // 阶段1: 将数据块写入磁盘 (ordered模式)
    if (journal->j_flags & JBD2_BARRIER) {
        blkdev_issue_flush(journal->j_dev, GFP_NOFS, NULL);
    }

    // 阶段2: 写入Descriptor Block (描述日志内容)
    write_descriptor_block(journal, commit_transaction);

    // 阶段3: 将修改的元数据写入日志区
    jbd2_journal_write_metadata_buffer(commit_transaction, &wbuf);

    // 阶段4: 写入Commit Block (标记事务完成)
    write_commit_block(journal, commit_transaction);

    // 阶段5: 刷新磁盘缓存 (确保落盘)
    blkdev_issue_flush(journal->j_dev, GFP_NOFS, NULL);

    // 阶段6: 更新检查点,释放日志空间
    jbd2_journal_update_sb_log_tail(journal, commit_transaction->t_tid);
}

/*
日志提交流程示例:

写入文件 "test.txt":

1. 开始事务 T1
2. 修改inode (元数据): 大小, 修改时间
3. 修改数据块: 写入文件内容
4. 提交事务:

   磁盘布局:
   +------------------+------------------+------------------+
   | Journal Area     | Data Area        | Metadata Area    |
   +------------------+------------------+------------------+

   日志区写入顺序:
   [Descriptor] → [inode copy] → [data block] → [Commit Block]
       ↓              ↓              ↓              ↓
    描述元数据    inode备份     数据内容      提交标记

5. 刷新磁盘缓存 (确保日志落盘)
6. 将inode写入元数据区 (原位置)
7. 标记日志空间可重用

崩溃恢复:
- 如果崩溃发生在步骤5之前: 回滚事务,数据未提交
- 如果崩溃发生在步骤5之后: 重放日志,恢复元数据

优势:
- 保证文件系统一致性
- 减少fsck时间 (从小时级到秒级)
- 防止元数据损坏
*/

1.4 Direct I/O vs Buffered I/O

c 复制代码

// Direct I/O (O_DIRECT) 绕过Page Cache
#include <fcntl.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>

#define BLOCK_SIZE 4096  // 必须对齐到块大小

void direct_io_write() {
    int fd = open("testfile", O_WRONLY | O_CREAT | O_DIRECT, 0644);

    // 分配对齐的缓冲区
    void* buffer;
    posix_memalign(&buffer, BLOCK_SIZE, BLOCK_SIZE);
    memset(buffer, 'A', BLOCK_SIZE);

    // 直接写入磁盘,绕过Page Cache
    ssize_t bytes = write(fd, buffer, BLOCK_SIZE);

    free(buffer);
    close(fd);
}

/*
Direct I/O vs Buffered I/O:

Buffered I/O (默认):
优点:
- 减少磁盘I/O (多次write可能只触发1次磁盘写入)
- 预读优化 (read-ahead)
- 写合并 (write coalescing)

缺点:
- 数据可能丢失 (未fsync时断电)
- 双重缓存 (应用缓存 + Page Cache)

Direct I/O (O_DIRECT):
优点:
- 完全控制I/O时机
- 避免双重缓存
- 适合数据库等自管理缓存的应用

缺点:
- 必须对齐 (缓冲区地址、大小、文件偏移都要对齐)
- 无内核优化 (无预读、无写合并)
- 性能取决于应用层优化

使用场景:
- 数据库: MySQL, PostgreSQL (InnoDB用O_DIRECT)
- 缓存系统: Redis, Memcached
- 流媒体: 大文件顺序读写
*/

2. 真实生产环境案例

2.1 蚂蚁金服 - 数据库binlog丢失事故

背景:

2018年某交易系统,日均1000万笔订单
机房断电,重启后发现最近1小时的binlog丢失
影响: 100万笔订单数据不一致,需人工核对

问题配置:

ini 复制代码

# MySQL配置 my.cnf (错误配置)
[mysqld]
innodb_flush_log_at_trx_commit = 2  # 每秒刷新一次 (不安全!)
sync_binlog = 10                     # 每10个事务刷新binlog (不安全!)

# 文件系统挂载参数 (错误)
$ cat /etc/fstab
/dev/sda1 /data ext4 defaults,nobarrier 0 0  # nobarrier禁用写屏障!

监控数据 (事故发生时):

sql 复制代码

-- 查看MySQL配置
SHOW VARIABLES LIKE '%flush%';
+--------------------------------+-------+
| Variable_name                  | Value |
+--------------------------------+-------+
| innodb_flush_log_at_trx_commit | 2     |  # 危险配置!
| innodb_flush_method            | O_DIRECT |
+--------------------------------+-------+

SHOW VARIABLES LIKE 'sync_binlog';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| sync_binlog   | 10    |  # 每10个事务才fsync!
+---------------+-------+

-- 查看binlog状态
SHOW BINARY LOGS;
+------------------+-----------+
| Log_name         | File_size |
+------------------+-----------+
| mysql-bin.000123 | 536870912 | # 512MB
| mysql-bin.000124 | 234567890 | # 最后一个文件未完整写入
+------------------+-----------+

-- 尝试恢复
$ mysqlbinlog mysql-bin.000124
...
ERROR: binlog truncated in the middle of event; consider using --force-if-open

问题根因分析:

bash 复制代码

# 1. 磁盘缓存状态
$ hdparm -W /dev/sda
/dev/sda:
 write-caching =  1 (on)   # 磁盘写缓存开启 ← 问题!

# 2. 文件系统日志模式
$ tune2fs -l /dev/sda1 | grep 'Filesystem features'
Filesystem features: ... data=ordered ...  # ordered模式

# 3. 检查内核脏页设置
$ sysctl -a | grep dirty
vm.dirty_background_ratio = 10     # 10%内存有脏页时后台刷新
vm.dirty_ratio = 20                # 20%内存有脏页时阻塞写入
vm.dirty_expire_centisecs = 3000   # 脏页30秒后必须刷新
vm.dirty_writeback_centisecs = 500 # 每5秒检查一次

# 4. 分析日志
$ dmesg | grep -i 'I/O error'
[12345.678] EXT4-fs error (device sda1): ... I/O error writing to journal

# 5. 查看断电前的系统负载
$ sar -b -f /var/log/sa/sa15
14:58:01  tps  rtps  wtps  bread/s  bwrtn/s
14:59:01  8520  1234  7286   45678    890123  # 高写入负载

数据丢失原因:

ini 复制代码

写入流程:

应用 → MySQL → InnoDB Buffer Pool
                    ↓ (innodb_flush_log_at_trx_commit=2)
                Redo Log Buffer
                    ↓ (每秒刷新1次)
                Page Cache (OS)
                    ↓ (断电前未fsync)
                磁盘缓存 (DRAM) ← 断电丢失!
                    ↓
                物理磁盘

binlog写入流程:

事务提交 → binlog buffer
                ↓ (sync_binlog=10)
            Page Cache
                ↓ (每10个事务才fsync)
            磁盘缓存 (DRAM) ← 断电丢失!
                ↓
            物理磁盘

问题链:
1. innodb_flush_log_at_trx_commit=2: redo log每秒才刷新
2. sync_binlog=10: binlog每10个事务才刷新
3. 文件系统nobarrier: 禁用写屏障,无法保证顺序
4. 磁盘写缓存开启: 断电时缓存数据丢失
5. 断电时机不巧: 正好在刷新间隔期

结果: 最近1小时(3600秒)的部分数据丢失

解决方案:

ini 复制代码

# MySQL配置 (安全配置)
[mysqld]
# InnoDB配置
innodb_flush_log_at_trx_commit = 1   # 每次事务提交都刷新
innodb_flush_method = O_DIRECT       # 绕过Page Cache
innodb_file_per_table = 1            # 每表一个文件
innodb_log_file_size = 512M          # 增大redo log

# Binlog配置
sync_binlog = 1                      # 每次事务都刷新binlog
binlog_cache_size = 4M              # binlog缓存
binlog_format = ROW                 # 行格式 (更安全)

# 双1配置: innodb_flush_log_at_trx_commit=1 + sync_binlog=1
# 保证: 提交的事务一定持久化

bash 复制代码

# 系统配置
# 1. 禁用磁盘写缓存 (需要RAID卡的BBU支持)
$ hdparm -W 0 /dev/sda

# 2. 启用文件系统写屏障
$ mount -o remount,barrier /data

# 或修改 /etc/fstab
/dev/sda1 /data ext4 defaults,barrier 0 0

# 3. 优化内核参数
$ cat >> /etc/sysctl.conf <<EOF
vm.dirty_ratio = 10                  # 降低脏页比例
vm.dirty_background_ratio = 5
vm.dirty_expire_centisecs = 1000     # 缩短过期时间到10秒
vm.swappiness = 0                    # 禁用swap
EOF

$ sysctl -p

# 4. 使用RAID卡的BBU (Battery Backup Unit)
# - 断电时将缓存写入闪存
# - 重启后恢复缓存数据
$ megacli -AdpBbuCmd -GetBbuStatus -a0
Battery State: Optimal

性能影响与优化:

bash 复制代码

# 测试不同配置的TPS

# 配置1: 不安全 (innodb=2, binlog=10)
$ sysbench oltp_read_write --tables=10 --table-size=1000000 run
TPS: 8500
Latency P95: 25ms

# 配置2: 安全 (innodb=1, binlog=1, 无优化)
$ sysbench oltp_read_write --tables=10 --table-size=1000000 run
TPS: 1200  # 性能下降86%!
Latency P95: 180ms

# 配置3: 安全+优化 (SSD + 组提交)
$ cat my.cnf
[mysqld]
innodb_flush_log_at_trx_commit = 1
sync_binlog = 1
binlog_group_commit_sync_delay = 100      # 100μs延迟
binlog_group_commit_sync_no_delay_count = 10  # 或10个事务

$ sysbench oltp_read_write --tables=10 --table-size=1000000 run
TPS: 6800  # 恢复到80%性能
Latency P95: 35ms

# 配置4: 安全+优化 (NVMe SSD + 大redo log)
TPS: 9500  # 性能提升12%
Latency P95: 22ms

最终效果:

diff 复制代码

指标对比 (生产环境,运行6个月):

优化前:
- TPS: 8500
- 数据安全: ✗ 丢失数据
- 故障恢复: 需人工核对 (3天)

优化后:
- TPS: 6800 (降低20%,可接受)
- 数据安全: ✓ 零数据丢失
- 故障恢复: 自动 (< 5分钟)

成本:
- 硬件升级: NVMe SSD + RAID卡BBU = 5万元
- 性能损失: 20%
- 收益: 避免数据丢失,无价!

2.2 字节跳动 - 日志文件导致磁盘满

背景:

2020年某推荐系统,每天产生5TB日志
磁盘空间耗尽,服务不可用
影响: 300万用户无法使用推荐功能

监控数据:

bash 复制代码

# 磁盘使用情况
$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1       2.0T  2.0T     0 100% /data  # 磁盘满了!

# 查找大文件
$ du -sh /data/* | sort -hr | head -10
1.2T    /data/logs/recommend/
300G    /data/logs/search/
200G    /data/logs/app/
150G    /data/tmp/
...

# 查看inode使用 (小文件过多)
$ df -i
Filesystem      Inodes   IUsed   IFree IUse% Mounted on
/dev/sda1     10000000 9999999       1  100%  /data  # inode耗尽!

# 查找大量小文件
$ find /data/logs -type f | wc -l
9876543  # 将近1000万个文件!

# 查看日志生成速率
$ watch -n 1 'ls -l /data/logs/recommend/ | wc -l'
# 每秒增加 100+ 文件

# 查看文件描述符使用
$ lsof -p $(pgrep recommend_server) | wc -l
98765  # 将近10万个打开的文件!

问题代码:

python 复制代码

# 推荐服务 - 每个请求创建一个日志文件
import datetime
import os

class RecommendLogger:
    def __init__(self, log_dir="/data/logs/recommend"):
        self.log_dir = log_dir

    def log_request(self, user_id, items):
        # 问题: 每个请求创建一个独立文件!
        timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S_%f")
        log_file = f"{self.log_dir}/user_{user_id}_{timestamp}.log"

        with open(log_file, 'w') as f:  # 每次都打开新文件
            f.write(f"User: {user_id}\n")
            f.write(f"Items: {items}\n")
            # 问题: 没有调用fsync,依赖OS自动刷新

# 使用
logger = RecommendLogger()

# 每天1000万请求 = 1000万个文件!
for request in requests:
    logger.log_request(request.user_id, request.items)

"""
问题:
1. 每个请求1个文件: 1000万请求 = 1000万文件
2. 文件名包含微秒时间戳: 无法合并
3. 文件很小(~100字节): 浪费inode和磁盘块
4. 频繁open/close: 系统调用开销大
5. 无日志轮转: 旧日志永不删除
"""

解决方案:

python 复制代码

# 优化后的日志系统
import logging
import logging.handlers
import gzip
import os
from datetime import datetime, timedelta

class OptimizedLogger:
    def __init__(self, log_dir="/data/logs/recommend"):
        self.log_dir = log_dir
        os.makedirs(log_dir, exist_ok=True)

        # 使用Python标准logging
        self.logger = logging.getLogger("recommend")
        self.logger.setLevel(logging.INFO)

        # 方案1: 按时间轮转 (每小时一个文件)
        handler = logging.handlers.TimedRotatingFileHandler(
            filename=f"{log_dir}/recommend.log",
            when='H',           # 每小时轮转
            interval=1,         # 间隔1小时
            backupCount=72,     # 保留72小时 (3天)
            encoding='utf-8'
        )

        # 方案2: 按大小轮转
        # handler = logging.handlers.RotatingFileHandler(
        #     filename=f"{log_dir}/recommend.log",
        #     maxBytes=100 * 1024 * 1024,  # 100MB
        #     backupCount=20,               # 保留20个文件
        # )

        # 格式化
        formatter = logging.Formatter(
            '%(asctime)s - %(name)s - %(levelname)s - %(message)s'
        )
        handler.setFormatter(formatter)
        self.logger.addHandler(handler)

    def log_request(self, user_id, items):
        # 所有请求写入同一个文件
        self.logger.info(f"User: {user_id}, Items: {items}")

# 方案3: 压缩旧日志
class CompressedLogger:
    def compress_old_logs(self, days=3):
        """压缩3天前的日志"""
        cutoff = datetime.now() - timedelta(days=days)

        for filename in os.listdir(self.log_dir):
            if not filename.endswith('.log'):
                continue

            file_path = os.path.join(self.log_dir, filename)
            file_time = datetime.fromtimestamp(os.path.getmtime(file_path))

            if file_time < cutoff:
                # 压缩文件
                with open(file_path, 'rb') as f_in:
                    with gzip.open(f"{file_path}.gz", 'wb') as f_out:
                        f_out.writelines(f_in)

                # 删除原文件
                os.remove(file_path)
                print(f"Compressed: {filename}")

# 方案4: 异步日志 (不阻塞业务)
import queue
import threading

class AsyncLogger:
    def __init__(self, log_dir="/data/logs/recommend"):
        self.log_dir = log_dir
        self.queue = queue.Queue(maxsize=10000)
        self.thread = threading.Thread(target=self._worker, daemon=True)
        self.thread.start()

        # 打开日志文件并保持打开
        self.log_file = open(f"{log_dir}/recommend.log", 'a', buffering=1024*1024)

    def log_request(self, user_id, items):
        # 非阻塞: 立即放入队列并返回
        try:
            self.queue.put_nowait(f"User: {user_id}, Items: {items}\n")
        except queue.Full:
            # 队列满了,丢弃日志 (或记录到错误日志)
            pass

    def _worker(self):
        """后台线程批量写入"""
        while True:
            messages = []

            # 收集100条消息或等待1秒
            try:
                while len(messages) < 100:
                    msg = self.queue.get(timeout=1.0)
                    messages.append(msg)
            except queue.Empty:
                pass

            # 批量写入
            if messages:
                self.log_file.writelines(messages)
                self.log_file.flush()  # 刷新到OS缓存
                # 注意: 不调用fsync,依赖OS定期刷新

    def close(self):
        self.log_file.close()

系统级优化:

bash 复制代码

# 1. 定期清理日志
$ cat /etc/cron.daily/clean_logs.sh
#!/bin/bash
# 删除7天前的日志
find /data/logs -name "*.log" -mtime +7 -delete
# 压缩3天前的日志
find /data/logs -name "*.log" -mtime +3 -exec gzip {} \;

$ chmod +x /etc/cron.daily/clean_logs.sh

# 2. 使用logrotate
$ cat /etc/logrotate.d/recommend
/data/logs/recommend/*.log {
    daily                   # 每天轮转
    rotate 7                # 保留7天
    compress                # 压缩旧日志
    delaycompress           # 延迟压缩 (第二天再压缩)
    missingok               # 文件不存在不报错
    notifempty              # 空文件不轮转
    create 0644 app app     # 创建新文件权限
    postrotate
        killall -HUP recommend_server  # 重新打开日志文件
    endscript
}

# 3. 使用专门的日志分区
$ lsblk
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda      8:0    0   500G  0 disk /
sdb      8:16   0     5T  0 disk /data/logs  # 独立日志分区

# 4. 使用XFS文件系统 (支持更多inode)
$ mkfs.xfs /dev/sdb
$ mount -o noatime,nodiratime /dev/sdb /data/logs

# noatime: 不更新访问时间,提升性能
# nodiratime: 不更新目录访问时间

# 5. 监控磁盘空间
$ cat /etc/cron.hourly/check_disk.sh
#!/bin/bash
USAGE=$(df /data | tail -1 | awk '{print $5}' | sed 's/%//')
if [ $USAGE -gt 90 ]; then
    echo "Disk usage is $USAGE%" | mail -s "Disk Alert" ops@company.com
fi

最终效果:

diff 复制代码

指标对比 (每天1000万请求):

优化前:
- 文件数: 1000万/天
- 磁盘使用: 1.2TB/天 (每个文件浪费4KB磁盘块)
- inode使用: 1000万/天 (耗尽inode)
- 性能: P99延迟 = 150ms (频繁open/close)

优化后:
- 文件数: 24/天 (每小时1个)
- 磁盘使用: 200GB/天 (压缩后50GB)
- inode使用: 24/天
- 性能: P99延迟 = 5ms (异步日志)

磁盘空间节省: 83%
性能提升: 30倍

3. 高级调优技巧

3.1 使用strace追踪文件I/O

bash 复制代码

# 追踪文件操作
$ strace -e trace=open,close,read,write,fsync -T python app.py

# 输出示例:
open("data.txt", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3 <0.000123>
write(3, "Hello World\n", 12) = 12 <0.000045>
fsync(3) = 0 <0.008234>  # fsync耗时8ms!
close(3) = 0 <0.000012>

# 统计系统调用
$ strace -c python app.py

% time     seconds  usecs/call     calls    errors syscall
 68.50    0.234567          15     15678           write
 20.10    0.068912          89       774           fsync  # fsync最耗时!
  5.20    0.017823          12      1485           read
  3.10    0.010623           7      1523           open
  2.10    0.007234           5      1456           close

# 只追踪特定PID
$ strace -p $(pgrep mysql) -e trace=fsync -T

# 追踪所有子进程
$ strace -f -e trace=open python app.py

3.2 使用perf分析I/O性能

bash 复制代码

# 录制I/O事件
$ sudo perf record -e 'syscalls:sys_enter_write' -ag -- python app.py

# 查看报告
$ sudo perf report

#   45.00%  python  [kernel.kallsyms]  [k] vfs_write
#   20.00%  python  [ext4]             [k] ext4_file_write_iter
#   15.00%  python  libc.so            [.] __write
#   10.00%  python  python             [.] file_write

# 分析I/O延迟
$ sudo perf record -e 'block:block_rq_insert' -ag -- python app.py
$ sudo perf report

# 查看磁盘I/O统计
$ sudo perf stat -e 'block:block_rq_issue' python app.py

Performance counter stats for 'python app.py':

     12,345      block:block_rq_issue

# I/O火焰图
$ sudo perf record -e 'syscalls:sys_enter_write' -ag -- python app.py
$ sudo perf script | stackcollapse-perf.pl | flamegraph.pl > io.svg

3.3 io_uring - 新一代异步I/O

c 复制代码

// io_uring: Linux 5.1+ 的高性能异步I/O接口
#include <liburing.h>
#include <fcntl.h>
#include <stdio.h>

#define QUEUE_DEPTH 64
#define BLOCK_SIZE 4096

void iouring_write_example() {
    struct io_uring ring;
    struct io_uring_sqe *sqe;
    struct io_uring_cqe *cqe;

    // 1. 初始化io_uring
    io_uring_queue_init(QUEUE_DEPTH, &ring, 0);

    // 2. 打开文件
    int fd = open("testfile", O_WRONLY | O_CREAT, 0644);

    // 3. 准备写入请求
    char buffer[BLOCK_SIZE] = "Hello io_uring";

    sqe = io_uring_get_sqe(&ring);  // 获取SQE (Submission Queue Entry)
    io_uring_prep_write(sqe, fd, buffer, BLOCK_SIZE, 0);  // 准备写入
    io_uring_sqe_set_data(sqe, buffer);  // 关联用户数据

    // 4. 提交请求 (一次系统调用提交多个I/O)
    io_uring_submit(&ring);

    // 5. 等待完成 (可以去做其他事情)
    io_uring_wait_cqe(&ring, &cqe);  // 获取CQE (Completion Queue Entry)

    // 6. 检查结果
    if (cqe->res < 0) {
        fprintf(stderr, "Write failed: %s\n", strerror(-cqe->res));
    } else {
        printf("Wrote %d bytes\n", cqe->res);
    }

    // 7. 标记完成
    io_uring_cqe_seen(&ring, cqe);

    // 8. 清理
    close(fd);
    io_uring_queue_exit(&ring);
}

/*
io_uring vs 传统I/O:

传统同步I/O:
- 每次I/O: 1次系统调用 (write)
- 阻塞等待
- 100次I/O = 100次系统调用

传统异步I/O (aio):
- 复杂的API
- 只支持Direct I/O
- 很少使用

io_uring:
- 零系统调用 (使用共享内存ring buffer)
- 批量提交 (100次I/O = 1次系统调用)
- 真正异步
- 支持所有I/O操作 (read, write, fsync, ...)

性能提升:
- IOPS: 2-3倍
- 延迟: 降低50%
- CPU: 降低30%
*/

3.4 使用eBPF追踪文件I/O

python 复制代码

# 使用bpftrace追踪文件操作
# bpftrace脚本: trace_io.bt

#!/usr/bin/env bpftrace

BEGIN {
    printf("Tracing file I/O...\n");
}

// 追踪open系统调用
tracepoint:syscalls:sys_enter_openat {
    @open[comm, str(args->filename)] = count();
}

// 追踪write系统调用
tracepoint:syscalls:sys_enter_write {
    @write_bytes[comm] = sum(args->count);
    @write_count[comm] = count();
}

// 追踪fsync系统调用
tracepoint:syscalls:sys_enter_fsync {
    @fsync[comm] = count();
}

// 追踪块I/O延迟
tracepoint:block:block_rq_insert {
    @io_start[args->dev, args->sector] = nsecs;
}

tracepoint:block:block_rq_complete {
    $duration = nsecs - @io_start[args->dev, args->sector];
    @io_latency = hist($duration / 1000);  // 转换为微秒
    delete(@io_start[args->dev, args->sector]);
}

END {
    printf("\n--- File Open Count ---\n");
    print(@open);

    printf("\n--- Write Bytes by Process ---\n");
    print(@write_bytes);

    printf("\n--- Write Count by Process ---\n");
    print(@write_count);

    printf("\n--- Fsync Count ---\n");
    print(@fsync);

    printf("\n--- I/O Latency (μs) ---\n");
    print(@io_latency);

    clear(@open);
    clear(@write_bytes);
    clear(@write_count);
    clear(@fsync);
    clear(@io_latency);
}

# 运行
$ sudo bpftrace trace_io.bt

# 输出示例:
Attaching 6 probes...
Tracing file I/O...
^C

--- File Open Count ---
@open[python, /var/log/app.log]: 12345
@open[mysql, /data/mysql/ib_logfile0]: 678

--- Write Bytes by Process ---
@write_bytes[python]: 123456789  # 117MB
@write_bytes[mysql]: 8765432100  # 8.2GB

--- Write Count by Process ---
@write_count[python]: 98765
@write_count[mysql]: 234567

--- Fsync Count ---
@fsync[mysql]: 15678

--- I/O Latency (μs) ---
@io_latency:
[0, 1)             12345 |@@@@@@@@@@@@@@@@@@@@@@@@@@              |
[1, 2)             23456 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[2, 4)             18765 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@        |
[4, 8)              8765 |@@@@@@@@@@@@@@@@                        |
[8, 16)             2345 |@@@@@                                   |
[16, 32)             678 |@                                       |
[32, 64)             123 |                                        |

4. 源码级别的实现

4.1 手写一个简单的Write-Ahead Log (WAL)

c 复制代码

// 简化版的Write-Ahead Log实现
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdint.h>

#define WAL_FILE "wal.log"
#define DATA_FILE "data.dat"
#define BLOCK_SIZE 512

typedef enum {
    OP_INSERT,
    OP_UPDATE,
    OP_DELETE
} OperationType;

typedef struct {
    uint64_t txn_id;        // 事务ID
    OperationType op;       // 操作类型
    uint64_t offset;        // 数据文件偏移
    uint32_t length;        // 数据长度
    char data[BLOCK_SIZE];  // 数据内容
    uint32_t checksum;      // 校验和
} WALRecord;

typedef struct {
    int wal_fd;
    int data_fd;
    uint64_t next_txn_id;
} WALManager;

// 计算校验和
uint32_t calculate_checksum(const void* data, size_t len) {
    uint32_t sum = 0;
    const uint8_t* ptr = (const uint8_t*)data;
    for (size_t i = 0; i < len; i++) {
        sum += ptr[i];
    }
    return sum;
}

// 初始化WAL
WALManager* wal_init() {
    WALManager* wal = (WALManager*)malloc(sizeof(WALManager));

    // 打开WAL文件 (追加模式)
    wal->wal_fd = open(WAL_FILE,
                       O_WRONLY | O_CREAT | O_APPEND,
                       0644);

    // 打开数据文件
    wal->data_fd = open(DATA_FILE,
                        O_RDWR | O_CREAT,
                        0644);

    wal->next_txn_id = 1;

    return wal;
}

// 写入WAL记录
int wal_append(WALManager* wal, OperationType op,
               uint64_t offset, const char* data, uint32_t length) {

    WALRecord record = {0};
    record.txn_id = wal->next_txn_id++;
    record.op = op;
    record.offset = offset;
    record.length = length;
    memcpy(record.data, data, length);

    // 计算校验和
    record.checksum = calculate_checksum(&record,
                                         sizeof(WALRecord) - sizeof(uint32_t));

    // 1. 写入WAL文件
    ssize_t written = write(wal->wal_fd, &record, sizeof(WALRecord));
    if (written != sizeof(WALRecord)) {
        perror("WAL write failed");
        return -1;
    }

    // 2. 强制刷新WAL到磁盘 (关键!)
    if (fsync(wal->wal_fd) != 0) {
        perror("WAL fsync failed");
        return -1;
    }

    // 3. 更新数据文件 (可以异步)
    lseek(wal->data_fd, offset, SEEK_SET);
    write(wal->data_fd, data, length);

    printf("WAL: txn=%lu, op=%d, offset=%lu, length=%u\n",
           record.txn_id, op, offset, length);

    return 0;
}

// 从WAL恢复数据
void wal_recover(WALManager* wal) {
    printf("Starting WAL recovery...\n");

    // 重新打开WAL文件 (读模式)
    int read_fd = open(WAL_FILE, O_RDONLY);
    if (read_fd < 0) {
        printf("No WAL file found, nothing to recover.\n");
        return;
    }

    WALRecord record;
    int recovered = 0;

    // 读取所有WAL记录
    while (read(read_fd, &record, sizeof(WALRecord)) == sizeof(WALRecord)) {

        // 验证校验和
        uint32_t checksum = calculate_checksum(&record,
                                               sizeof(WALRecord) - sizeof(uint32_t));
        if (checksum != record.checksum) {
            fprintf(stderr, "Corrupted WAL record at txn %lu\n", record.txn_id);
            break;
        }

        // 重放操作到数据文件
        lseek(wal->data_fd, record.offset, SEEK_SET);
        write(wal->data_fd, record.data, record.length);

        printf("Recovered: txn=%lu, op=%d, offset=%lu\n",
               record.txn_id, record.op, record.offset);

        recovered++;
    }

    // 刷新数据文件
    fsync(wal->data_fd);

    close(read_fd);

    printf("Recovery complete: %d transactions replayed.\n", recovered);

    // 清空WAL (可选)
    ftruncate(wal->wal_fd, 0);
}

// 测试
int main() {
    WALManager* wal = wal_init();

    // 恢复(如果有未完成的事务)
    wal_recover(wal);

    // 写入一些数据
    char data1[] = "Hello WAL";
    wal_append(wal, OP_INSERT, 0, data1, strlen(data1));

    char data2[] = "Transaction 2";
    wal_append(wal, OP_UPDATE, 100, data2, strlen(data2));

    // 模拟崩溃前的写入
    char data3[] = "Before crash";
    wal_append(wal, OP_INSERT, 200, data3, strlen(data3));

    printf("\nSimulating crash and recovery...\n\n");

    // 重新初始化并恢复
    WALManager* wal2 = wal_init();
    wal_recover(wal2);

    return 0;
}

/*
输出:
Starting WAL recovery...
No WAL file found, nothing to recover.
WAL: txn=1, op=0, offset=0, length=9
WAL: txn=2, op=1, offset=100, length=13
WAL: txn=3, op=0, offset=200, length=12

Simulating crash and recovery...

Starting WAL recovery...
Recovered: txn=1, op=0, offset=0
Recovered: txn=2, op=1, offset=100
Recovered: txn=3, op=0, offset=200
Recovery complete: 3 transactions replayed.

优势:
1. 崩溃后可恢复: WAL先落盘,保证数据不丢
2. 性能优化: 顺序写WAL (快) → 随机写数据 (慢,可异步)
3. 原子性: 事务要么全部恢复,要么全部丢弃
*/

4.2 实现一个简单的LRU缓存(减少磁盘I/O)

c 复制代码

// LRU (Least Recently Used) 缓存实现
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define CACHE_SIZE 100

typedef struct CacheNode {
    char* key;
    void* value;
    size_t value_size;
    struct CacheNode* prev;
    struct CacheNode* next;
} CacheNode;

typedef struct {
    CacheNode* head;
    CacheNode* tail;
    int size;
    int capacity;
} LRUCache;

// 初始化缓存
LRUCache* cache_create(int capacity) {
    LRUCache* cache = (LRUCache*)malloc(sizeof(LRUCache));
    cache->head = NULL;
    cache->tail = NULL;
    cache->size = 0;
    cache->capacity = capacity;
    return cache;
}

// 移动节点到头部 (最近使用)
void move_to_head(LRUCache* cache, CacheNode* node) {
    if (node == cache->head) return;

    // 从当前位置移除
    if (node->prev) node->prev->next = node->next;
    if (node->next) node->next->prev = node->prev;
    if (node == cache->tail) cache->tail = node->prev;

    // 插入到头部
    node->prev = NULL;
    node->next = cache->head;
    if (cache->head) cache->head->prev = node;
    cache->head = node;
    if (!cache->tail) cache->tail = node;
}

// 查找缓存
void* cache_get(LRUCache* cache, const char* key, size_t* out_size) {
    CacheNode* node = cache->head;

    while (node) {
        if (strcmp(node->key, key) == 0) {
            // 命中: 移到头部
            move_to_head(cache, node);
            *out_size = node->value_size;
            return node->value;
        }
        node = node->next;
    }

    // 未命中
    return NULL;
}

// 添加到缓存
void cache_put(LRUCache* cache, const char* key, void* value, size_t size) {
    // 检查是否已存在
    CacheNode* node = cache->head;
    while (node) {
        if (strcmp(node->key, key) == 0) {
            // 更新值
            free(node->value);
            node->value = malloc(size);
            memcpy(node->value, value, size);
            node->value_size = size;
            move_to_head(cache, node);
            return;
        }
        node = node->next;
    }

    // 新节点
    CacheNode* new_node = (CacheNode*)malloc(sizeof(CacheNode));
    new_node->key = strdup(key);
    new_node->value = malloc(size);
    memcpy(new_node->value, value, size);
    new_node->value_size = size;
    new_node->prev = NULL;
    new_node->next = cache->head;

    if (cache->head) cache->head->prev = new_node;
    cache->head = new_node;
    if (!cache->tail) cache->tail = new_node;

    cache->size++;

    // 超过容量,淘汰尾部 (最久未使用)
    if (cache->size > cache->capacity) {
        CacheNode* lru = cache->tail;
        cache->tail = lru->prev;
        if (cache->tail) cache->tail->next = NULL;

        printf("Evicting: %s\n", lru->key);

        free(lru->key);
        free(lru->value);
        free(lru);
        cache->size--;
    }
}

// 测试
int main() {
    LRUCache* cache = cache_create(3);

    // 添加数据
    char data1[] = "value1";
    cache_put(cache, "key1", data1, strlen(data1) + 1);

    char data2[] = "value2";
    cache_put(cache, "key2", data2, strlen(data2) + 1);

    char data3[] = "value3";
    cache_put(cache, "key3", data3, strlen(data3) + 1);

    // 访问key1 (移到头部)
    size_t size;
    char* val = (char*)cache_get(cache, "key1", &size);
    printf("Get key1: %s\n", val);

    // 添加key4 (淘汰key2)
    char data4[] = "value4";
    cache_put(cache, "key4", data4, strlen(data4) + 1);

    // 尝试获取key2 (应该已被淘汰)
    val = (char*)cache_get(cache, "key2", &size);
    printf("Get key2: %s\n", val ? val : "Not found");

    return 0;
}

/*
输出:
Get key1: value1
Evicting: key2
Get key2: Not found

应用场景:
- 文件系统: Page Cache (缓存磁盘块)
- 数据库: Buffer Pool (缓存数据页)
- Web服务: 缓存热点数据,减少数据库查询

性能提升:
- 缓存命中: ~100ns (内存)
- 缓存未命中: ~5ms (磁盘)
- 命中率90%: 性能提升50倍
*/

5. 性能基准测试

5.1 fsync性能测试

c 复制代码

// 测试不同刷新策略的性能
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <time.h>
#include <string.h>

#define ITERATIONS 10000
#define DATA_SIZE 4096

void benchmark_no_sync() {
    int fd = open("test_nosync.dat", O_WRONLY | O_CREAT | O_TRUNC, 0644);
    char data[DATA_SIZE];
    memset(data, 'A', DATA_SIZE);

    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);

    for (int i = 0; i < ITERATIONS; i++) {
        write(fd, data, DATA_SIZE);
        // 不调用fsync
    }

    clock_gettime(CLOCK_MONOTONIC, &end);
    double elapsed = (end.tv_sec - start.tv_sec) +
                     (end.tv_nsec - start.tv_nsec) / 1e9;

    printf("No fsync: %.2f s (%.0f ops/s)\n",
           elapsed, ITERATIONS / elapsed);

    close(fd);
}

void benchmark_fsync_each() {
    int fd = open("test_fsync.dat", O_WRONLY | O_CREAT | O_TRUNC, 0644);
    char data[DATA_SIZE];
    memset(data, 'A', DATA_SIZE);

    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);

    for (int i = 0; i < ITERATIONS; i++) {
        write(fd, data, DATA_SIZE);
        fsync(fd);  // 每次都刷新
    }

    clock_gettime(CLOCK_MONOTONIC, &end);
    double elapsed = (end.tv_sec - start.tv_sec) +
                     (end.tv_nsec - start.tv_nsec) / 1e9;

    printf("fsync each: %.2f s (%.0f ops/s)\n",
           elapsed, ITERATIONS / elapsed);

    close(fd);
}

void benchmark_fsync_batch() {
    int fd = open("test_batch.dat", O_WRONLY | O_CREAT | O_TRUNC, 0644);
    char data[DATA_SIZE];
    memset(data, 'A', DATA_SIZE);

    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);

    for (int i = 0; i < ITERATIONS; i++) {
        write(fd, data, DATA_SIZE);
        if ((i + 1) % 100 == 0) {
            fsync(fd);  // 每100次刷新一次
        }
    }
    fsync(fd);  // 最后刷新

    clock_gettime(CLOCK_MONOTONIC, &end);
    double elapsed = (end.tv_sec - start.tv_sec) +
                     (end.tv_nsec - start.tv_nsec) / 1e9;

    printf("fsync batch(100): %.2f s (%.0f ops/s)\n",
           elapsed, ITERATIONS / elapsed);

    close(fd);
}

void benchmark_direct_io() {
    int fd = open("test_direct.dat",
                  O_WRONLY | O_CREAT | O_TRUNC | O_DIRECT,
                  0644);

    // Direct I/O需要对齐的缓冲区
    void* data;
    posix_memalign(&data, 4096, DATA_SIZE);
    memset(data, 'A', DATA_SIZE);

    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);

    for (int i = 0; i < ITERATIONS; i++) {
        write(fd, data, DATA_SIZE);
        // Direct I/O不需要fsync
    }

    clock_gettime(CLOCK_MONOTONIC, &end);
    double elapsed = (end.tv_sec - start.tv_sec) +
                     (end.tv_nsec - start.tv_nsec) / 1e9;

    printf("Direct I/O: %.2f s (%.0f ops/s)\n",
           elapsed, ITERATIONS / elapsed);

    free(data);
    close(fd);
}

int main() {
    printf("File I/O Benchmark (%d iterations, %d bytes each)\n\n",
           ITERATIONS, DATA_SIZE);

    benchmark_no_sync();
    benchmark_fsync_each();
    benchmark_fsync_batch();
    benchmark_direct_io();

    return 0;
}

/*
典型输出 (HDD):

File I/O Benchmark (10000 iterations, 4096 bytes each)

No fsync: 0.15 s (66667 ops/s)         # 最快,但不安全
fsync each: 95.50 s (105 ops/s)        # 最慢,但最安全
fsync batch(100): 1.20 s (8333 ops/s)  # 平衡
Direct I/O: 15.50 s (645 ops/s)        # 绕过缓存

典型输出 (NVMe SSD):

No fsync: 0.10 s (100000 ops/s)
fsync each: 2.50 s (4000 ops/s)        # SSD快很多!
fsync batch(100): 0.35 s (28571 ops/s)
Direct I/O: 1.20 s (8333 ops/s)

结论:
1. fsync是性能瓶颈 (HDD慢600倍!)
2. 批量fsync可以大幅提升性能 (80倍)
3. SSD比HDD快38倍 (fsync场景)
4. Direct I/O适合数据库等场景
*/

5.2 不同文件系统性能对比

bash 复制代码

# 测试脚本: benchmark_fs.sh
#!/bin/bash

MOUNT_POINT="/mnt/test"
TEST_FILE="$MOUNT_POINT/testfile"
SIZE_MB=1024

benchmark_fs() {
    FS=$1
    DEV=$2

    echo "=== Testing $FS ==="

    # 格式化
    case $FS in
        ext4)
            mkfs.ext4 -F $DEV ;;
        xfs)
            mkfs.xfs -f $DEV ;;
        btrfs)
            mkfs.btrfs -f $DEV ;;
    esac

    # 挂载
    mount $DEV $MOUNT_POINT

    # 测试1: 顺序写入
    echo "Sequential write:"
    dd if=/dev/zero of=$TEST_FILE bs=1M count=$SIZE_MB conv=fdatasync 2>&1 | grep copied

    # 测试2: 随机写入
    echo "Random write:"
    fio --name=randwrite --ioengine=libaio --iodepth=16 --rw=randwrite \
        --bs=4k --direct=1 --size=1G --numjobs=1 --filename=$TEST_FILE \
        --runtime=30 --time_based --group_reporting | grep IOPS

    # 测试3: fsync延迟
    echo "fsync latency:"
    ./fsync_bench $TEST_FILE

    # 卸载
    umount $MOUNT_POINT

    echo ""
}

# 测试不同文件系统
benchmark_fs ext4 /dev/sdb1
benchmark_fs xfs /dev/sdb1
benchmark_fs btrfs /dev/sdb1

# 典型输出:
"""
=== Testing ext4 ===
Sequential write:
1073741824 bytes (1.1 GB) copied, 8.5 s, 126 MB/s
Random write:
IOPS=12.3k
fsync latency: avg=0.85ms, p99=3.2ms

=== Testing xfs ===
Sequential write:
1073741824 bytes (1.1 GB) copied, 7.2 s, 149 MB/s
Random write:
IOPS=15.8k
fsync latency: avg=0.62ms, p99=2.1ms

=== Testing btrfs ===
Sequential write:
1073741824 bytes (1.1 GB) copied, 9.8 s, 110 MB/s
Random write:
IOPS=9.8k
fsync latency: avg=1.15ms, p99=5.8ms

结论:
- XFS: 最快 (大文件、流媒体)
- ext4: 平衡 (通用场景)
- btrfs: 功能多 (快照、压缩),但性能较低
"""

总结

通过这些大神级内容,我们深入掌握了:

底层机制: VFS、ext4日志、Page Cache、Direct I/O
真实案例: 蚂蚁binlog丢失、字节日志爆盘的完整排查
高级工具: strace、perf、eBPF、io_uring
源码实现: 手写WAL、LRU缓存
性能测试: fsync、文件系统的详细benchmark

这些知识让你真正理解文件I/O的本质,能够设计出既高性能又数据安全的系统。

下一篇 : 案例4:进程间通信IPC →