从 FileOutputStream.write(byte) 到磁盘扇区：一次 Java 写入操作的完整内核穿越之旅

引言

作为一名 Java 开发者，我们每天都在调用 FileOutputStream.write(b) 将数据写入文件。这一行简单的代码背后，究竟隐藏着怎样的系统深渊？从 JVM 的 JNI 调用，到 glibc 的系统调用封装，再到 Linux 内核的虚拟文件系统、ext4 文件系统、页缓存，直至最终触及物理磁盘的扇区------这是一条漫长而精妙的链路。

本文将根据 OpenJDK 17、glibc 2.35 以及 Linux Kernel 5.15 的源码，逐层解剖 FileOutputStream.write(int byte) 的执行路径。我们将会看到：

Java 层如何通过 native 方法进入 JNI 世界；
JNI 如何通过预缓存的字段 ID 快速获取 FileDescriptor；
原生 C 代码如何调用 write 系统调用；
glibc 如何使用内联汇编触发 syscall 指令；
Linux 内核如何从 sys_write 进入 VFS 层；
ext4 文件系统如何处理缓冲写、延迟分配和页缓存；
数据最终如何通过 submit_bio 到达块设备层。

本文不追求高屋建瓴的宏观描述，而是紧扣源码，逐行解读，力求还原一次写入操作的真实旅程。全文约 8500 字，适合希望深入理解 I/O 栈的 Java 工程师、系统程序员以及内核爱好者。

1. Java 层：`FileOutputStream` 的 `write` 方法

1.1 `FileOutputStream` 概览

java.io.FileOutputStream 是 Java 标准库中用于向文件写入字节流的基础类。它的核心方法之一是 write(int b)，该方法将指定的字节写入文件输出流（只写入低 8 位，高 24 位被丢弃）。其源码（OpenJDK 17）如下：

java

java 复制代码

public void write(int b) throws IOException {
    write(b, fd, fdAccess, append);
}

实际调用了一个私有 native 方法：

java

java 复制代码

private native void write(int b, FileDescriptor fd, FileDescriptor fdAccess, boolean append)
        throws IOException;

这个 native 方法的实现在 JDK 源码的 src/java.base/share/native/libjava/FileOutputStream.c 中。

1.2 JNI 初始化：缓存字段 ID

为了提高性能，JNI 代码会在类初始化时缓存 FileOutputStream 对象中与文件描述符相关的字段 ID。具体地，Java_java_io_FileOutputStream_initIDs 方法在类加载时被调用：

bash 复制代码

JNIEXPORT void JNICALL
Java_java_io_FileOutputStream_initIDs(JNIEnv *env, jclass fdClass) {
    fos_fd = (*env)->GetFieldID(env, fdClass, "fd", "Ljava/io/FileDescriptor;");
}

这里 fos_fd 是一个全局静态变量（类型为 jfieldID），保存了 java.io.FileOutputStream 类中名为 fd 的字段的 ID。该字段的类型是 java.io.FileDescriptor，它封装了操作系统的文件描述符整数值。通过缓存字段 ID，后续的 native 方法无需每次进行字符串查找，从而提升调用效率。

1.3 `write` 的 JNI 分发

真正执行写入的 native 方法是：

arduino 复制代码

JNIEXPORT void JNICALL
Java_java_io_FileOutputStream_write(JNIEnv *env, jobject this, jint byte, jboolean append) {
    writeSingle(env, this, byte, append, fos_fd);
}

writeSingle 是一个辅助函数，它接收缓存的字段 ID，并执行实际写入逻辑：

ini 复制代码

void
writeSingle(JNIEnv *env, jobject this, jint byte, jboolean append, jfieldID fid) {
    // 丢弃高 24 位，只保留低 8 位
    char c = (char) byte;
    jint n;
    // 从 this 对象中获取 fd 字段对应的整型文件描述符
    FD fd = getFD(env, this, fid);
    if (fd == -1) {
        JNU_ThrowIOException(env, "Stream Closed");
        return;
    }
    if (append == JNI_TRUE) {
        n = IO_Append(fd, &c, 1);
    } else {
        n = IO_Write(fd, &c, 1);
    }
    if (n == -1) {
        JNU_ThrowIOExceptionWithLastError(env, "Write error");
    }
}

1.3.1 `getFD` 的实现

getFD 函数位于 io_util.c 中，它通过 JNI 从 java.io.FileDescriptor 对象中获取 fd 整型字段：

scss 复制代码

int
getFD(JNIEnv *env, jobject obj, jfieldID fid) {
    jobject fdobj = (*env)->GetObjectField(env, obj, fid);
    if (fdobj == NULL) return -1;
    return (*env)->GetIntField(env, fdobj, IO_fd_fdID);
}

IO_fd_fdID 是另一个缓存的字段 ID，对应 FileDescriptor 类中的 fd 私有字段（int 类型）。

1.3.2 `IO_Write` 宏

IO_Write 是一个宏，定义在 io_util.h 中：

arduino 复制代码

#define IO_Write handleWrite

而 handleWrite 是一个简单的封装，用于处理被信号中断的系统调用（RESTARTABLE 宏确保在 errno == EINTR 时重试）：

arduino 复制代码

ssize_t
handleWrite(FD fd, const void *buf, jint len)
{
    ssize_t result;
    RESTARTABLE(write(fd, buf, len), result);
    return result;
}

RESTARTABLE 通常定义为：

ini 复制代码

#define RESTARTABLE(_cmd, _result) do { \
    _result = _cmd; \
    if (_result == -1 && errno == EINTR) \
        goto restart; \
} while(0)

至此，我们抵达了 C 标准库的 write 函数。但注意，这里直接调用了 write，它并不是 Linux 内核的"真正"系统调用，而是 glibc 提供的封装。接下来我们将深入 glibc 内部。

2. glibc 层：从 `write` 到 `syscall`

2.1 `write` 函数的弱别名

在 glibc 源码中，write 函数实际上是一个弱别名，指向 __libc_write：

scss 复制代码

weak_alias (__libc_write, __write)
libc_hidden_weak (__write)
weak_alias (__libc_write, write)
libc_hidden_weak (write)

__libc_write 是真正的实现：

arduino 复制代码

ssize_t
__libc_write (int fd, const void *buf, size_t nbytes)
{
  return SYSCALL_CANCEL (write, fd, buf, nbytes);
}

2.2 可取消点的处理：`SYSCALL_CANCEL` 宏

Linux 中的 POSIX 线程支持"取消点"（cancellation points）。write 是一个典型的取消点，当线程被取消时，如果它正阻塞在 write 上，应该立即退出。glibc 通过 SYSCALL_CANCEL 宏来处理这一机制：

ini 复制代码

#define SYSCALL_CANCEL(...) \
  ({									     \
    long int sc_ret;							     \
    if (NO_SYSCALL_CANCEL_CHECKING)					     \
      sc_ret = INLINE_SYSCALL_CALL (__VA_ARGS__); 			     \
    else								     \
      {									     \
	int sc_cancel_oldtype = LIBC_CANCEL_ASYNC ();			     \
	sc_ret = INLINE_SYSCALL_CALL (__VA_ARGS__);			     \
        LIBC_CANCEL_RESET (sc_cancel_oldtype);				     \
      }									     \
    sc_ret;								     \
  })

如果未启用取消检查（NO_SYSCALL_CANCEL_CHECKING），直接执行系统调用。
否则，先调用 LIBC_CANCEL_ASYNC 将线程设为异步取消模式（允许随时取消），然后执行系统调用，最后恢复原来的取消状态。

INLINE_SYSCALL_CALL 负责真正触发系统调用。

2.3 `INLINE_SYSCALL_CALL` 与可变参数宏

INLINE_SYSCALL_CALL 是一个可变参数宏，它根据参数的个数分派到不同后缀的宏：

scss 复制代码

#define INLINE_SYSCALL_CALL(...) \
  __INLINE_SYSCALL_DISP (__INLINE_SYSCALL, __VA_ARGS__)

#define __INLINE_SYSCALL_DISP(b,...) \
  __SYSCALL_CONCAT (b,__INLINE_SYSCALL_NARGS(__VA_ARGS__))(__VA_ARGS__)

__INLINE_SYSCALL_NARGS 会计算出参数的数量（最多 7 个），然后拼接出类似 __INLINE_SYSCALL3 的宏名称。对于 write 这种有 3 个参数的系统调用，最终会调用 __INLINE_SYSCALL3：

scss 复制代码

#define __INLINE_SYSCALL3(name, a1, a2, a3) \
  INTERNAL_SYSCALL (name, 3, a1, a2, a3)

而 INTERNAL_SYSCALL 是架构相关的宏。以 x86-64 为例：

csharp 复制代码

#undef INTERNAL_SYSCALL
#define INTERNAL_SYSCALL(name, nr, args...)				\
	internal_syscall##nr (SYS_ify (name), args)

其中 SYS_ify 将系统调用名称转换为系统调用号：

arduino 复制代码

#undef SYS_ify
#define SYS_ify(syscall_name)	__NR_##syscall_name

__NR_write 被定义为 1（在 <asm/unistd_64.h> 中）。

2.4 内联汇编触发 `syscall` 指令

对于 x86-64，internal_syscall3 使用了内联汇编来实现 syscall 指令：

scss 复制代码

#undef internal_syscall3
#define internal_syscall3(number, arg1, arg2, arg3)			\
({									\
    unsigned long int resultvar;					\
    TYPEFY (arg3, __arg3) = ARGIFY (arg3);			 	\
    TYPEFY (arg2, __arg2) = ARGIFY (arg2);			 	\
    TYPEFY (arg1, __arg1) = ARGIFY (arg1);			 	\
    register TYPEFY (arg3, _a3) asm ("rdx") = __arg3;			\
    register TYPEFY (arg2, _a2) asm ("rsi") = __arg2;			\
    register TYPEFY (arg1, _a1) asm ("rdi") = __arg1;			\
    asm volatile (							\
    "syscall\n\t"							\
    : "=a" (resultvar)							\
    : "0" (number), "r" (_a1), "r" (_a2), "r" (_a3)			\
    : "memory", REGISTERS_CLOBBERED_BY_SYSCALL);			\
    (long int) resultvar;						\
})

这段汇编代码做了以下事情：

将系统调用号 number（即 1）放入 rax 寄存器（"0" (number) 约束表示输入输出共用 %rax）。
将三个参数分别放入 rdi（fd）、rsi（buf）、rdx（count）寄存器。
执行 syscall 指令，CPU 会从用户态陷入内核态，根据 rax 的值在系统调用表中查找处理函数。
系统调用返回后，返回值保存在 rax 中，赋值给 resultvar。
"memory" 告诉编译器内存可能被修改（例如内核写入用户缓冲区），防止错误的优化。

至此，我们完成了用户态的最后一步。接下来，CPU 会切换到内核态，从 entry_SYSCALL_64 开始执行内核的系统调用入口。

3. Linux 内核：系统调用入口与 VFS 层

3.1 系统调用分发

当 syscall 指令执行后，CPU 保存上下文并跳转到内核的 entry_SYSCALL_64（x86-64 架构）。最终根据 rax=1 找到 sys_write 函数。在较新的内核中，系统调用通过 SYSCALL_DEFINE 宏定义：

arduino 复制代码

SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf,
		size_t, count)
{
	return ksys_write(fd, buf, count);
}

SYSCALL_DEFINE3 展开后会生成一个名为 __x64_sys_write 的函数，它会调用 ksys_write。

3.2 `ksys_write`：获取文件对象

ini 复制代码

ssize_t ksys_write(unsigned int fd, const char __user *buf, size_t count)
{
	struct fd f = fdget_pos(fd);
	ssize_t ret = -EBADF;

	if (f.file) {
		loff_t pos, *ppos = file_ppos(f.file);
		if (ppos) {
			pos = *ppos;
			ppos = &pos;
		}
		ret = vfs_write(f.file, buf, count, ppos);
		if (ret >= 0 && ppos)
			f.file->f_pos = pos;
		fdput_pos(f);
	}

	return ret;
}

fdget_pos(fd) 根据文件描述符整数获取 struct file 指针，并增加引用计数，同时返回文件位置锁（如果文件支持）。
file_ppos 返回文件当前偏移量的指针（如果是可定位文件，如普通文件；如果是 socket 等则可能为 NULL）。
vfs_write 是 VFS 层的核心写入函数。

3.3 `vfs_write`：访问控制和写操作分派

vfs_write 执行基本的权限检查和统计，然后调用具体文件系统的写入方法：

ini 复制代码

ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_t *pos)
{
	ssize_t ret;

	if (!(file->f_mode & FMODE_WRITE))
		return -EBADF;
	if (!(file->f_mode & FMODE_CAN_WRITE))
		return -EINVAL;
	if (unlikely(!access_ok(buf, count)))
		return -EFAULT;

	ret = rw_verify_area(WRITE, file, pos, count);
	if (ret)
		return ret;
	if (count > MAX_RW_COUNT)
		count =  MAX_RW_COUNT;
	file_start_write(file);
	if (file->f_op->write)
		ret = file->f_op->write(file, buf, count, pos);
	else if (file->f_op->write_iter)
		ret = new_sync_write(file, buf, count, pos);
	else
		ret = -EINVAL;
	if (ret > 0) {
		fsnotify_modify(file);
		add_wchar(current, ret);
	}
	inc_syscw(current);
	file_end_write(file);
	return ret;
}

access_ok 验证用户空间的缓冲区地址是否有效。
rw_verify_area 检查是否超出文件最大偏移等限制。
MAX_RW_COUNT 通常为 INT_MAX & PAGE_MASK，防止一次读写过大。
file_start_write 和 file_end_write 用于文件系统 freeze 保护。
对于 ext4，file->f_op->write_iter 被设置为 ext4_file_write_iter（write 回调为 NULL，因为 ext4 使用 write_iter 接口）。new_sync_write 是一个包装函数，最终调用 ext4_file_write_iter。

4. ext4 文件系统：缓冲写入的具体实现

4.1 `ext4_file_write_iter`：选择写路径

ext4 根据 inode 的属性和传递的标志，决定使用 DAX（直接访问）、直接 I/O（DIO）还是缓冲写：

csharp 复制代码

static ssize_t
ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
{
	struct inode *inode = file_inode(iocb->ki_filp);

	if (unlikely(ext4_forced_shutdown(inode->i_sb)))
		return -EIO;

#ifdef CONFIG_FS_DAX
	if (IS_DAX(inode))
		return ext4_dax_write_iter(iocb, from);
#endif
	if (iocb->ki_flags & IOCB_DIRECT)
		return ext4_dio_write_iter(iocb, from);
	else
		return ext4_buffered_write_iter(iocb, from);
}

对于普通的 FileOutputStream.write（未开启 O_DIRECT），将走 ext4_buffered_write_iter。

4.2 `ext4_buffered_write_iter`：加锁与通用写

scss 复制代码

static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
					struct iov_iter *from)
{
	ssize_t ret;
	struct inode *inode = file_inode(iocb->ki_filp);

	if (iocb->ki_flags & IOCB_NOWAIT)
		return -EOPNOTSUPP;

	inode_lock(inode);
	ret = ext4_write_checks(iocb, from);
	if (ret <= 0)
		goto out;

	ret = generic_perform_write(iocb, from);

out:
	inode_unlock(inode);
	if (unlikely(ret <= 0))
		return ret;
	return generic_write_sync(iocb, ret);
}

inode_lock(inode) 获取 inode 的读写信号量（i_rwsem）的写锁。注意这是文件级别的写锁，防止多个进程同时修改同一文件的元数据或数据。
ext4_write_checks 更新 iocb 中的文件位置，并处理 O_APPEND 等标志。
generic_perform_write 是内核提供的通用写函数，负责将用户数据复制到页缓存（page cache）中。
写入完成后调用 generic_write_sync，如果文件以同步方式打开（O_SYNC）或使用了 fdatasync，则会触发强制回写。

4.3 `generic_perform_write`：页缓存写入

generic_perform_write 是理解缓冲 I/O 的关键。它的核心逻辑：

对于要写入的每个页面（以 PAGE_SIZE 为单位），找到或创建一个页缓存页（pagecache_get_page）；
将用户空间的数据复制到页缓存中（copy_page_from_iter_atomic）；
标记页为脏（set_page_dirty）；
更新 inode 的大小（如果写入位置超出了当前文件大小）；
返回实际写入的字节数。

对于 ext4，在 generic_perform_write 调用前后，ext4 还通过 write_begin / write_end 回调进行特定处理。在 ext4_da_write_begin 和 ext4_da_write_end 中实现了延迟分配（delayed allocation） ：当数据被写入页缓存时，并不立即分配磁盘块，而是仅仅标记为延迟分配。真正的块分配发生在稍后页缓存回写（writeback）时。

4.4 延迟分配与回写触发

延迟分配允许文件系统批量处理块分配，减少碎片，提高性能。但代价是增加了数据丢失的风险（如果断电，尚未分配块的数据会丢失）。数据最终写入磁盘的时机有：

页缓存压力过大，内核启动内存回收，强制回写脏页。
用户主动调用 fsync/fdatasync 或 close（某些情况）。
内核的 flusher 线程定期回写（/proc/sys/vm/dirty_writeback_centisecs）。
同步写入标志（O_SYNC）会导致每次 write 系统调用返回前就触发回写。

在 ext4 中，实际的回写操作由 ext4_writepages 函数负责。

5. 页缓存回写：`ext4_writepages` 的旅程

当脏页需要被刷新到磁盘时，内核会调用文件系统的 writepages 方法。对于 ext4，该方法为 ext4_writepages。

5.1 准备阶段：`ext4_do_writepages`

ext4_writepages 主要调用 ext4_do_writepages，这个函数处理了两种模式：

数据=journal模式：数据先记录到日志，再写入文件系统（较少用）。
延迟分配模式（默认） ：真正的块分配和 bio 提交。

ini 复制代码

static int ext4_do_writepages(struct mpage_da_data *mpd)
{
    struct writeback_control *wbc = mpd->wbc;
    struct inode *inode = mpd->inode;
    // ...
    while (!mpd->scanned_until_end && wbc->nr_to_write > 0) {
        // ...
        needed_blocks = ext4_da_writepages_trans_blocks(inode);
        handle = ext4_journal_start_with_reserve(inode, ...);
        ret = mpage_prepare_extent_to_map(mpd);
        if (!ret && mpd->map.m_len)
            ret = mpage_map_and_submit_extent(handle, mpd, &give_up_on_write);
        // ...
        ext4_journal_stop(handle);
    }
    // ...
}

mpage_prepare_extent_to_map 遍历页缓存，收集连续脏页，存入 mpd->map 结构。
mpage_map_and_submit_extent 为这些脏页分配磁盘块，并生成 bio 提交给块层。

5.2 块分配与 bio 提交

mpage_map_and_submit_extent 的核心是循环调用 mpage_map_one_extent 来分配或映射一个连续的磁盘区域，然后调用 mpage_map_and_submit_buffers 将对应的页提交为 bio。

ext4_io_submit 最终将 bio 传递给通用块层：

ini 复制代码

void ext4_io_submit(struct ext4_io_submit *io)
{
    struct bio *bio = io->io_bio;
    if (bio) {
        if (io->io_wbc->sync_mode == WB_SYNC_ALL)
            io->io_bio->bi_opf |= REQ_SYNC;
        submit_bio(io->io_bio);
    }
    io->io_bio = NULL;
}

submit_bio 会将 bio 放入块设备的请求队列，然后由块设备驱动程序（如 SCSI/NVMe）执行实际的 DMA 传输，将数据写入磁盘介质。这一部分已经超出文件系统范畴，在此不再赘述。

6. 锁机制深度解析：`inode_lock` 底层实现

我们之前提到 inode_lock(inode) 获取 inode 的读写信号量写锁。这个锁实际上是 struct rw_semaphore 类型。下面我们探究其底层实现。

6.1 `inode_lock` 展开

scss 复制代码

static inline void inode_lock(struct inode *inode)
{
    down_write(&inode->i_rwsem);
}

down_write 是内核中获取读写信号量写锁的函数。其简化实现如下：

scss 复制代码

static inline void __down_write(struct rw_semaphore *sem)
{
    __down_write_common(sem, TASK_UNINTERRUPTIBLE);
}

__down_write_common 首先尝试使用 rwsem_write_trylock 进行快速获取，如果失败则进入慢速路径：

scss 复制代码

static inline int __down_write_common(struct rw_semaphore *sem, int state)
{
    int ret = 0;
    preempt_disable();
    if (unlikely(!rwsem_write_trylock(sem))) {
        if (IS_ERR(rwsem_down_write_slowpath(sem, state)))
            ret = -EINTR;
    }
    preempt_enable();
    return ret;
}

6.2 `rwsem_write_trylock` 与原子操作

rwsem_write_trylock 使用原子 cmpxchg 尝试将 sem->count 从无锁值（RWSEM_UNLOCKED_VALUE）改为写锁定值（RWSEM_WRITER_LOCKED）：

arduino 复制代码

static inline bool rwsem_write_trylock(struct rw_semaphore *sem)
{
    long tmp = RWSEM_UNLOCKED_VALUE;
    if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp, RWSEM_WRITER_LOCKED)) {
        rwsem_set_owner(sem);
        return true;
    }
    return false;
}

atomic_long_try_cmpxchg_acquire 最终会调用架构相关的 cmpxchg 指令。在 x86-64 上，最终落到内联汇编：

ini 复制代码

#define __raw_try_cmpxchg(_ptr, _pold, _new, size, lock)		\
({									\
    bool success;							\
    __typeof__(_ptr) _old = (__typeof__(_ptr))(_pold);		\
    __typeof__(*(_ptr)) __old = *_old;				\
    __typeof__(*(_ptr)) __new = (_new);				\
    switch (size) {							\
    case __X86_CASE_Q:						\
    {								\
        volatile u64 *__ptr = (volatile u64 *)(_ptr);		\
        asm volatile(lock "cmpxchgq %[new], %[ptr]"		\
                     CC_SET(z)					\
                     : CC_OUT(z) (success),			\
                       [ptr] "+m" (*__ptr),			\
                       [old] "+a" (__old)			\
                     : [new] "r" (__new)			\
                     : "memory");				\
        break;							\
    }								\
    // ...
    }								\
    if (unlikely(!success))						\
        *_old = __old;						\
    likely(success);						\
})

这里使用了带有 lock 前缀的 cmpxchgq 指令，保证在多核 CPU 上的原子性。

6.3 慢速路径：`rwsem_down_write_slowpath`

如果快速路径失败（即信号量已被其他进程持有），内核会进入慢速路径。这里使用了乐观自旋（optimistic spinning） ：如果当前锁的持有者正在另一个 CPU 上运行，那么等待者可能会短暂自旋，而不是立即睡眠，以减少上下文切换开销。

scss 复制代码

static struct rw_semaphore __sched *
rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
{
    // 乐观自旋尝试
    if (rwsem_can_spin_on_owner(sem) && rwsem_optimistic_spin(sem)) {
        return sem;
    }
    // 加入等待队列，然后睡眠
    waiter.task = current;
    waiter.type = RWSEM_WAITING_FOR_WRITE;
    // ...
    for (;;) {
        if (rwsem_try_write_lock(sem, &waiter))
            break;
        schedule_preempt_disabled();  // 主动让出 CPU
    }
    // ...
}

这种实现体现了内核在性能和公平性之间的权衡。

7. 总结与思考

我们跟随 FileOutputStream.write(b) 的字节，从 Java 对象一直走到磁盘控制器的 DMA 操作。整条路径可以概括为：

Java 应用 调用 write(int) 方法。
JVM 通过 JNI 调用预编译的 C 函数，获取文件描述符。
glibc 封装系统调用，处理线程取消，通过内联汇编触发 syscall 指令。
内核入口 根据系统调用号找到 sys_write，VFS 层验证权限并分派给 ext4。
ext4 使用 ext4_buffered_write_iter 将数据复制到页缓存，并标记脏页。
脏页在稍后（或立即，如果 O_SYNC）通过 ext4_writepages 分配磁盘块，构建 bio 并提交给块层。
块层将 bio 转化为请求，由设备驱动程序执行实际传输。

7.1 性能与一致性的权衡

缓冲 I/O + 延迟分配：提供最高性能，但数据在断电时可能丢失（若未调用 fsync）。
O_SYNC / O_DSYNC：每次 write 完成后会触发回写，保证数据持久性，但性能显著下降。
直接 I/O（DIO） ：绕过页缓存，直接与块设备交互，适用于数据库等自管理缓存的应用。

7.2 锁竞争的影响

inode_lock 是文件级别的写锁，这意味着同时只能有一个线程对同一文件进行写入（或修改元数据）。对于高并发写入同一文件的场景（如日志），这会造成瓶颈。解决方案包括使用多个文件、pwrite 配合不同的偏移量（但仍受锁限制），或者使用无锁数据结构（如 io_uring 的某些模式）。

7.3 可观察性与调试

理解这一整条链路对性能分析和故障排查至关重要。例如：

当 write 调用延迟很高时，可能是因为：
- 页缓存回写压力大，write 被阻塞在等待内存；
- 磁盘设备满载，submit_bio 后的请求在队列中排队；
- 文件锁竞争严重。
可以通过 perf、ftrace、blktrace 等工具追踪从系统调用到磁盘的每一毫秒。

7.4 对 Java 开发者的启示

使用 FileChannel 和 ByteBuffer 可以绕过部分 JNI 开销，但仍走相同的系统调用路径。
FileOutputStream.getFD().sync() 对应 fsync，会强制刷新页缓存到磁盘。
如果需要极致的异步 I/O 性能，可以考虑 java.nio.channels.AsynchronousFileChannel（基于内核 AIO 或 io_uring，但具体实现取决于 JDK 版本和操作系统）。

#源码

scss 复制代码

jfieldID fos_fd; /* id for jobject 'fd' in java.io.FileOutputStream */

/**************************************************************
 * static methods to store field ID's in initializers
 */

JNIEXPORT void JNICALL
Java_java_io_FileOutputStream_initIDs(JNIEnv *env, jclass fdClass) {
    fos_fd = (*env)->GetFieldID(env, fdClass, "fd", "Ljava/io/FileDescriptor;");
}

JNIEXPORT void JNICALL
Java_java_io_FileOutputStream_write(JNIEnv *env, jobject this, jint byte, jboolean append) {
    writeSingle(env, this, byte, append, fos_fd);
}

void
writeSingle(JNIEnv *env, jobject this, jint byte, jboolean append, jfieldID fid) {
    // Discard the 24 high-order bits of byte. See OutputStream#write(int)
    char c = (char) byte;
    jint n;
    FD fd = getFD(env, this, fid);
    if (fd == -1) {
        JNU_ThrowIOException(env, "Stream Closed");
        return;
    }
    if (append == JNI_TRUE) {
        n = IO_Append(fd, &c, 1);
    } else {
        n = IO_Write(fd, &c, 1);
    }
    if (n == -1) {
        JNU_ThrowIOExceptionWithLastError(env, "Write error");
    }
}

#define IO_Write handleWrite

ssize_t
handleWrite(FD fd, const void *buf, jint len)
{
    ssize_t result;
    RESTARTABLE(write(fd, buf, len), result);
    return result;
}

/* Write NBYTES of BUF to FD.  Return the number written, or -1.  */
ssize_t
__libc_write (int fd, const void *buf, size_t nbytes)
{
  return SYSCALL_CANCEL (write, fd, buf, nbytes);
}
libc_hidden_def (__libc_write)

weak_alias (__libc_write, __write)
libc_hidden_weak (__write)
weak_alias (__libc_write, write)
libc_hidden_weak (write)

#define SYSCALL_CANCEL(...) \
  ({									     \
    long int sc_ret;							     \
    if (NO_SYSCALL_CANCEL_CHECKING)					     \
      sc_ret = INLINE_SYSCALL_CALL (__VA_ARGS__); 			     \
    else								     \
      {									     \
	int sc_cancel_oldtype = LIBC_CANCEL_ASYNC ();			     \
	sc_ret = INLINE_SYSCALL_CALL (__VA_ARGS__);			     \
        LIBC_CANCEL_RESET (sc_cancel_oldtype);				     \
      }									     \
    sc_ret;								     \
  })
  
/* Issue a syscall defined by syscall number plus any other argument
   required.  Any error will be handled using arch defined macros and errno
   will be set accordingly.
   It is similar to INLINE_SYSCALL macro, but without the need to pass the
   expected argument number as second parameter.  */
#define INLINE_SYSCALL_CALL(...) \
  __INLINE_SYSCALL_DISP (__INLINE_SYSCALL, __VA_ARGS__)

#define __INLINE_SYSCALL_DISP(b,...) \
  __SYSCALL_CONCAT (b,__INLINE_SYSCALL_NARGS(__VA_ARGS__))(__VA_ARGS__)

#define __INLINE_SYSCALL_NARGS(...) \
  __INLINE_SYSCALL_NARGS_X (__VA_ARGS__,7,6,5,4,3,2,1,0,)
  
#define __INTERNAL_SYSCALL3(name, a1, a2, a3) \
  INTERNAL_SYSCALL (name, 3, a1, a2, a3)

#undef INTERNAL_SYSCALL
#define INTERNAL_SYSCALL(name, nr, args...)				\
	internal_syscall##nr (SYS_ify (name), args)

#undef SYS_ify
#define SYS_ify(syscall_name)	__NR_##syscall_name

#define __NR_write 1

#undef internal_syscall3
#define internal_syscall3(number, arg1, arg2, arg3)			\
({									\
    unsigned long int resultvar;					\
    TYPEFY (arg3, __arg3) = ARGIFY (arg3);			 	\
    TYPEFY (arg2, __arg2) = ARGIFY (arg2);			 	\
    TYPEFY (arg1, __arg1) = ARGIFY (arg1);			 	\
    register TYPEFY (arg3, _a3) asm ("rdx") = __arg3;			\
    register TYPEFY (arg2, _a2) asm ("rsi") = __arg2;			\
    register TYPEFY (arg1, _a1) asm ("rdi") = __arg1;			\
    asm volatile (							\
    "syscall\n\t"							\
    : "=a" (resultvar)							\
    : "0" (number), "r" (_a1), "r" (_a2), "r" (_a3)			\
    : "memory", REGISTERS_CLOBBERED_BY_SYSCALL);			\
    (long int) resultvar;						\
})

##系统调用
1	common	write			sys_write    

SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf,
		size_t, count)
{
	return ksys_write(fd, buf, count);
}

ssize_t ksys_write(unsigned int fd, const char __user *buf, size_t count)
{
	struct fd f = fdget_pos(fd);
	ssize_t ret = -EBADF;

	if (f.file) {
		loff_t pos, *ppos = file_ppos(f.file);
		if (ppos) {
			pos = *ppos;
			ppos = &pos;
		}
		ret = vfs_write(f.file, buf, count, ppos);
		if (ret >= 0 && ppos)
			f.file->f_pos = pos;
		fdput_pos(f);
	}

	return ret;
}

ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_t *pos)
{
	ssize_t ret;

	if (!(file->f_mode & FMODE_WRITE))
		return -EBADF;
	if (!(file->f_mode & FMODE_CAN_WRITE))
		return -EINVAL;
	if (unlikely(!access_ok(buf, count)))
		return -EFAULT;

	ret = rw_verify_area(WRITE, file, pos, count);
	if (ret)
		return ret;
	if (count > MAX_RW_COUNT)
		count =  MAX_RW_COUNT;
	file_start_write(file);
	if (file->f_op->write)
		ret = file->f_op->write(file, buf, count, pos);
	else if (file->f_op->write_iter)
		ret = new_sync_write(file, buf, count, pos);
	else
		ret = -EINVAL;
	if (ret > 0) {
		fsnotify_modify(file);
		add_wchar(current, ret);
	}
	inc_syscw(current);
	file_end_write(file);
	return ret;
}

static ssize_t
ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
{
	struct inode *inode = file_inode(iocb->ki_filp);

	if (unlikely(ext4_forced_shutdown(inode->i_sb)))
		return -EIO;

#ifdef CONFIG_FS_DAX
	if (IS_DAX(inode))
		return ext4_dax_write_iter(iocb, from);
#endif
	if (iocb->ki_flags & IOCB_DIRECT)
		return ext4_dio_write_iter(iocb, from);
	else
		return ext4_buffered_write_iter(iocb, from);
}

static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
					struct iov_iter *from)
{
	ssize_t ret;
	struct inode *inode = file_inode(iocb->ki_filp);

	if (iocb->ki_flags & IOCB_NOWAIT)
		return -EOPNOTSUPP;

	inode_lock(inode);
	ret = ext4_write_checks(iocb, from);
	if (ret <= 0)
		goto out;

	ret = generic_perform_write(iocb, from);

out:
	inode_unlock(inode);
	if (unlikely(ret <= 0))
		return ret;
	return generic_write_sync(iocb, ret);
}

static inline void inode_lock(struct inode *inode)
{
	down_write(&inode->i_rwsem);
}

/*
 * lock for writing
 */
static inline int __down_write_common(struct rw_semaphore *sem, int state)
{
	int ret = 0;

	preempt_disable();
	if (unlikely(!rwsem_write_trylock(sem))) {
		if (IS_ERR(rwsem_down_write_slowpath(sem, state)))
			ret = -EINTR;
	}
	preempt_enable();
	return ret;
}

static inline void __down_write(struct rw_semaphore *sem)
{
	__down_write_common(sem, TASK_UNINTERRUPTIBLE);
}

static inline bool rwsem_write_trylock(struct rw_semaphore *sem)
{
	long tmp = RWSEM_UNLOCKED_VALUE;

	if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp, RWSEM_WRITER_LOCKED)) {
		rwsem_set_owner(sem);
		return true;
	}

	return false;
}


/**
 * atomic_long_try_cmpxchg_acquire() - atomic compare and exchange with acquire ordering
 * @v: pointer to atomic_long_t
 * @old: pointer to long value to compare with
 * @new: long value to assign
 *
 * If (@v == @old), atomically updates @v to @new with acquire ordering.
 * Otherwise, updates @old to the current value of @v.
 *
 * Unsafe to use in noinstr code; use raw_atomic_long_try_cmpxchg_acquire() there.
 *
 * Return: @true if the exchange occured, @false otherwise.
 */
static __always_inline bool
atomic_long_try_cmpxchg_acquire(atomic_long_t *v, long *old, long new)
{
	instrument_atomic_read_write(v, sizeof(*v));
	instrument_atomic_read_write(old, sizeof(*old));
	return raw_atomic_long_try_cmpxchg_acquire(v, old, new);
}

/**
 * raw_atomic_long_try_cmpxchg_acquire() - atomic compare and exchange with acquire ordering
 * @v: pointer to atomic_long_t
 * @old: pointer to long value to compare with
 * @new: long value to assign
 *
 * If (@v == @old), atomically updates @v to @new with acquire ordering.
 * Otherwise, updates @old to the current value of @v.
 *
 * Safe to use in noinstr code; prefer atomic_long_try_cmpxchg_acquire() elsewhere.
 *
 * Return: @true if the exchange occured, @false otherwise.
 */
static __always_inline bool
raw_atomic_long_try_cmpxchg_acquire(atomic_long_t *v, long *old, long new)
{
#ifdef CONFIG_64BIT
	return raw_atomic64_try_cmpxchg_acquire(v, (s64 *)old, new);
#else
	return raw_atomic_try_cmpxchg_acquire(v, (int *)old, new);
#endif
}

/**
 * raw_atomic64_try_cmpxchg_acquire() - atomic compare and exchange with acquire ordering
 * @v: pointer to atomic64_t
 * @old: pointer to s64 value to compare with
 * @new: s64 value to assign
 *
 * If (@v == @old), atomically updates @v to @new with acquire ordering.
 * Otherwise, updates @old to the current value of @v.
 *
 * Safe to use in noinstr code; prefer atomic64_try_cmpxchg_acquire() elsewhere.
 *
 * Return: @true if the exchange occured, @false otherwise.
 */
static __always_inline bool
raw_atomic64_try_cmpxchg_acquire(atomic64_t *v, s64 *old, s64 new)
{
#if defined(arch_atomic64_try_cmpxchg_acquire)
	return arch_atomic64_try_cmpxchg_acquire(v, old, new);
#elif defined(arch_atomic64_try_cmpxchg_relaxed)
	bool ret = arch_atomic64_try_cmpxchg_relaxed(v, old, new);
	__atomic_acquire_fence();
	return ret;
#elif defined(arch_atomic64_try_cmpxchg)
	return arch_atomic64_try_cmpxchg(v, old, new);
#else
	s64 r, o = *old;
	r = raw_atomic64_cmpxchg_acquire(v, o, new);
	if (unlikely(r != o))
		*old = r;
	return likely(r == o);
#endif
}

static __always_inline bool arch_atomic64_try_cmpxchg(atomic64_t *v, s64 *old, s64 new)
{
	return arch_try_cmpxchg(&v->counter, old, new);
}
#define arch_atomic64_try_cmpxchg arch_atomic64_try_cmpxchg

#define arch_try_cmpxchg(ptr, pold, new) 				\
	__try_cmpxchg((ptr), (pold), (new), sizeof(*(ptr)))

#define __try_cmpxchg(ptr, pold, new, size)				\
	__raw_try_cmpxchg((ptr), (pold), (new), (size), LOCK_PREFIX)


#define __raw_try_cmpxchg(_ptr, _pold, _new, size, lock)		\
({									\
	bool success;							\
	__typeof__(_ptr) _old = (__typeof__(_ptr))(_pold);		\
	__typeof__(*(_ptr)) __old = *_old;				\
	__typeof__(*(_ptr)) __new = (_new);				\
	switch (size) {							\
	case __X86_CASE_B:						\
	{								\
		volatile u8 *__ptr = (volatile u8 *)(_ptr);		\
		asm volatile(lock "cmpxchgb %[new], %[ptr]"		\
			     CC_SET(z)					\
			     : CC_OUT(z) (success),			\
			       [ptr] "+m" (*__ptr),			\
			       [old] "+a" (__old)			\
			     : [new] "q" (__new)			\
			     : "memory");				\
		break;							\
	}								\
	case __X86_CASE_W:						\
	{								\
		volatile u16 *__ptr = (volatile u16 *)(_ptr);		\
		asm volatile(lock "cmpxchgw %[new], %[ptr]"		\
			     CC_SET(z)					\
			     : CC_OUT(z) (success),			\
			       [ptr] "+m" (*__ptr),			\
			       [old] "+a" (__old)			\
			     : [new] "r" (__new)			\
			     : "memory");				\
		break;							\
	}								\
	case __X86_CASE_L:						\
	{								\
		volatile u32 *__ptr = (volatile u32 *)(_ptr);		\
		asm volatile(lock "cmpxchgl %[new], %[ptr]"		\
			     CC_SET(z)					\
			     : CC_OUT(z) (success),			\
			       [ptr] "+m" (*__ptr),			\
			       [old] "+a" (__old)			\
			     : [new] "r" (__new)			\
			     : "memory");				\
		break;							\
	}								\
	case __X86_CASE_Q:						\
	{								\
		volatile u64 *__ptr = (volatile u64 *)(_ptr);		\
		asm volatile(lock "cmpxchgq %[new], %[ptr]"		\
			     CC_SET(z)					\
			     : CC_OUT(z) (success),			\
			       [ptr] "+m" (*__ptr),			\
			       [old] "+a" (__old)			\
			     : [new] "r" (__new)			\
			     : "memory");				\
		break;							\
	}								\
	default:							\
		__cmpxchg_wrong_size();					\
	}								\
	if (unlikely(!success))						\
		*_old = __old;						\
	likely(success);						\
})

/*
 * Wait until we successfully acquire the write lock
 */
static struct rw_semaphore __sched *
rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
{
	struct rwsem_waiter waiter;
	DEFINE_WAKE_Q(wake_q);

	/* do optimistic spinning and steal lock if possible */
	if (rwsem_can_spin_on_owner(sem) && rwsem_optimistic_spin(sem)) {
		/* rwsem_optimistic_spin() implies ACQUIRE on success */
		return sem;
	}

	/*
	 * Optimistic spinning failed, proceed to the slowpath
	 * and block until we can acquire the sem.
	 */
	waiter.task = current;
	waiter.type = RWSEM_WAITING_FOR_WRITE;
	waiter.timeout = jiffies + RWSEM_WAIT_TIMEOUT;
	waiter.handoff_set = false;

	raw_spin_lock_irq(&sem->wait_lock);
	rwsem_add_waiter(sem, &waiter);

	/* we're now waiting on the lock */
	if (rwsem_first_waiter(sem) != &waiter) {
		rwsem_cond_wake_waiter(sem, atomic_long_read(&sem->count),
				       &wake_q);
		if (!wake_q_empty(&wake_q)) {
			/*
			 * We want to minimize wait_lock hold time especially
			 * when a large number of readers are to be woken up.
			 */
			raw_spin_unlock_irq(&sem->wait_lock);
			wake_up_q(&wake_q);
			raw_spin_lock_irq(&sem->wait_lock);
		}
	} else {
		atomic_long_or(RWSEM_FLAG_WAITERS, &sem->count);
	}

	/* wait until we successfully acquire the lock */
	set_current_state(state);
	trace_contention_begin(sem, LCB_F_WRITE);

	for (;;) {
		if (rwsem_try_write_lock(sem, &waiter)) {
			/* rwsem_try_write_lock() implies ACQUIRE on success */
			break;
		}

		raw_spin_unlock_irq(&sem->wait_lock);

		if (signal_pending_state(state, current))
			goto out_nolock;

		/*
		 * After setting the handoff bit and failing to acquire
		 * the lock, attempt to spin on owner to accelerate lock
		 * transfer. If the previous owner is a on-cpu writer and it
		 * has just released the lock, OWNER_NULL will be returned.
		 * In this case, we attempt to acquire the lock again
		 * without sleeping.
		 */
		if (waiter.handoff_set) {
			enum owner_state owner_state;

			owner_state = rwsem_spin_on_owner(sem);
			if (owner_state == OWNER_NULL)
				goto trylock_again;
		}

		schedule_preempt_disabled();
		lockevent_inc(rwsem_sleep_writer);
		set_current_state(state);
trylock_again:
		raw_spin_lock_irq(&sem->wait_lock);
	}
	__set_current_state(TASK_RUNNING);
	raw_spin_unlock_irq(&sem->wait_lock);
	lockevent_inc(rwsem_wlock);
	trace_contention_end(sem, 0);
	return sem;

out_nolock:
	__set_current_state(TASK_RUNNING);
	raw_spin_lock_irq(&sem->wait_lock);
	rwsem_del_wake_waiter(sem, &waiter, &wake_q);
	lockevent_inc(rwsem_wlock_fail);
	trace_contention_end(sem, -EINTR);
	return ERR_PTR(-EINTR);
}

static const struct address_space_operations ext4_da_aops = {
	.read_folio		= ext4_read_folio,
	.readahead		= ext4_readahead,
	.writepages		= ext4_writepages,
	.write_begin		= ext4_da_write_begin,
	.write_end		= ext4_da_write_end,
	.dirty_folio		= ext4_dirty_folio,
	.bmap			= ext4_bmap,
	.invalidate_folio	= ext4_invalidate_folio,
	.release_folio		= ext4_release_folio,
	.direct_IO		= noop_direct_IO,
	.migrate_folio		= buffer_migrate_folio,
	.is_partially_uptodate  = block_is_partially_uptodate,
	.error_remove_folio	= generic_error_remove_folio,
	.swap_activate		= ext4_iomap_swap_activate,
};

static int ext4_writepages(struct address_space *mapping,
			   struct writeback_control *wbc)
{
	struct super_block *sb = mapping->host->i_sb;
	struct mpage_da_data mpd = {
		.inode = mapping->host,
		.wbc = wbc,
		.can_map = 1,
	};
	int ret;
	int alloc_ctx;

	if (unlikely(ext4_forced_shutdown(sb)))
		return -EIO;

	alloc_ctx = ext4_writepages_down_read(sb);
	ret = ext4_do_writepages(&mpd);
	/*
	 * For data=journal writeback we could have come across pages marked
	 * for delayed dirtying (PageChecked) which were just added to the
	 * running transaction. Try once more to get them to stable storage.
	 */
	if (!ret && mpd.journalled_more_data)
		ret = ext4_do_writepages(&mpd);
	ext4_writepages_up_read(sb, alloc_ctx);

	return ret;
}


static int ext4_do_writepages(struct mpage_da_data *mpd)
{
	struct writeback_control *wbc = mpd->wbc;
	pgoff_t	writeback_index = 0;
	long nr_to_write = wbc->nr_to_write;
	int range_whole = 0;
	int cycled = 1;
	handle_t *handle = NULL;
	struct inode *inode = mpd->inode;
	struct address_space *mapping = inode->i_mapping;
	int needed_blocks, rsv_blocks = 0, ret = 0;
	struct ext4_sb_info *sbi = EXT4_SB(mapping->host->i_sb);
	struct blk_plug plug;
	bool give_up_on_write = false;

	trace_ext4_writepages(inode, wbc);

	/*
	 * No pages to write? This is mainly a kludge to avoid starting
	 * a transaction for special inodes like journal inode on last iput()
	 * because that could violate lock ordering on umount
	 */
	if (!mapping->nrpages || !mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
		goto out_writepages;

	/*
	 * If the filesystem has aborted, it is read-only, so return
	 * right away instead of dumping stack traces later on that
	 * will obscure the real source of the problem.  We test
	 * fs shutdown state instead of sb->s_flag's SB_RDONLY because
	 * the latter could be true if the filesystem is mounted
	 * read-only, and in that case, ext4_writepages should
	 * *never* be called, so if that ever happens, we would want
	 * the stack trace.
	 */
	if (unlikely(ext4_forced_shutdown(mapping->host->i_sb))) {
		ret = -EROFS;
		goto out_writepages;
	}

	/*
	 * If we have inline data and arrive here, it means that
	 * we will soon create the block for the 1st page, so
	 * we'd better clear the inline data here.
	 */
	if (ext4_has_inline_data(inode)) {
		/* Just inode will be modified... */
		handle = ext4_journal_start(inode, EXT4_HT_INODE, 1);
		if (IS_ERR(handle)) {
			ret = PTR_ERR(handle);
			goto out_writepages;
		}
		BUG_ON(ext4_test_inode_state(inode,
				EXT4_STATE_MAY_INLINE_DATA));
		ext4_destroy_inline_data(handle, inode);
		ext4_journal_stop(handle);
	}

	/*
	 * data=journal mode does not do delalloc so we just need to writeout /
	 * journal already mapped buffers. On the other hand we need to commit
	 * transaction to make data stable. We expect all the data to be
	 * already in the journal (the only exception are DMA pinned pages
	 * dirtied behind our back) so we commit transaction here and run the
	 * writeback loop to checkpoint them. The checkpointing is not actually
	 * necessary to make data persistent *but* quite a few places (extent
	 * shifting operations, fsverity, ...) depend on being able to drop
	 * pagecache pages after calling filemap_write_and_wait() and for that
	 * checkpointing needs to happen.
	 */
	if (ext4_should_journal_data(inode)) {
		mpd->can_map = 0;
		if (wbc->sync_mode == WB_SYNC_ALL)
			ext4_fc_commit(sbi->s_journal,
				       EXT4_I(inode)->i_datasync_tid);
	}
	mpd->journalled_more_data = 0;

	if (ext4_should_dioread_nolock(inode)) {
		/*
		 * We may need to convert up to one extent per block in
		 * the page and we may dirty the inode.
		 */
		rsv_blocks = 1 + ext4_chunk_trans_blocks(inode,
						PAGE_SIZE >> inode->i_blkbits);
	}

	if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
		range_whole = 1;

	if (wbc->range_cyclic) {
		writeback_index = mapping->writeback_index;
		if (writeback_index)
			cycled = 0;
		mpd->first_page = writeback_index;
		mpd->last_page = -1;
	} else {
		mpd->first_page = wbc->range_start >> PAGE_SHIFT;
		mpd->last_page = wbc->range_end >> PAGE_SHIFT;
	}

	ext4_io_submit_init(&mpd->io_submit, wbc);
retry:
	if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
		tag_pages_for_writeback(mapping, mpd->first_page,
					mpd->last_page);
	blk_start_plug(&plug);

	/*
	 * First writeback pages that don't need mapping - we can avoid
	 * starting a transaction unnecessarily and also avoid being blocked
	 * in the block layer on device congestion while having transaction
	 * started.
	 */
	mpd->do_map = 0;
	mpd->scanned_until_end = 0;
	mpd->io_submit.io_end = ext4_init_io_end(inode, GFP_KERNEL);
	if (!mpd->io_submit.io_end) {
		ret = -ENOMEM;
		goto unplug;
	}
	ret = mpage_prepare_extent_to_map(mpd);
	/* Unlock pages we didn't use */
	mpage_release_unused_pages(mpd, false);
	/* Submit prepared bio */
	ext4_io_submit(&mpd->io_submit);
	ext4_put_io_end_defer(mpd->io_submit.io_end);
	mpd->io_submit.io_end = NULL;
	if (ret < 0)
		goto unplug;

	while (!mpd->scanned_until_end && wbc->nr_to_write > 0) {
		/* For each extent of pages we use new io_end */
		mpd->io_submit.io_end = ext4_init_io_end(inode, GFP_KERNEL);
		if (!mpd->io_submit.io_end) {
			ret = -ENOMEM;
			break;
		}

		WARN_ON_ONCE(!mpd->can_map);
		/*
		 * We have two constraints: We find one extent to map and we
		 * must always write out whole page (makes a difference when
		 * blocksize < pagesize) so that we don't block on IO when we
		 * try to write out the rest of the page. Journalled mode is
		 * not supported by delalloc.
		 */
		BUG_ON(ext4_should_journal_data(inode));
		needed_blocks = ext4_da_writepages_trans_blocks(inode);

		/* start a new transaction */
		handle = ext4_journal_start_with_reserve(inode,
				EXT4_HT_WRITE_PAGE, needed_blocks, rsv_blocks);
		if (IS_ERR(handle)) {
			ret = PTR_ERR(handle);
			ext4_msg(inode->i_sb, KERN_CRIT, "%s: jbd2_start: "
			       "%ld pages, ino %lu; err %d", __func__,
				wbc->nr_to_write, inode->i_ino, ret);
			/* Release allocated io_end */
			ext4_put_io_end(mpd->io_submit.io_end);
			mpd->io_submit.io_end = NULL;
			break;
		}
		mpd->do_map = 1;

		trace_ext4_da_write_pages(inode, mpd->first_page, wbc);
		ret = mpage_prepare_extent_to_map(mpd);
		if (!ret && mpd->map.m_len)
			ret = mpage_map_and_submit_extent(handle, mpd,
					&give_up_on_write);
		/*
		 * Caution: If the handle is synchronous,
		 * ext4_journal_stop() can wait for transaction commit
		 * to finish which may depend on writeback of pages to
		 * complete or on page lock to be released.  In that
		 * case, we have to wait until after we have
		 * submitted all the IO, released page locks we hold,
		 * and dropped io_end reference (for extent conversion
		 * to be able to complete) before stopping the handle.
		 */
		if (!ext4_handle_valid(handle) || handle->h_sync == 0) {
			ext4_journal_stop(handle);
			handle = NULL;
			mpd->do_map = 0;
		}
		/* Unlock pages we didn't use */
		mpage_release_unused_pages(mpd, give_up_on_write);
		/* Submit prepared bio */
		ext4_io_submit(&mpd->io_submit);

		/*
		 * Drop our io_end reference we got from init. We have
		 * to be careful and use deferred io_end finishing if
		 * we are still holding the transaction as we can
		 * release the last reference to io_end which may end
		 * up doing unwritten extent conversion.
		 */
		if (handle) {
			ext4_put_io_end_defer(mpd->io_submit.io_end);
			ext4_journal_stop(handle);
		} else
			ext4_put_io_end(mpd->io_submit.io_end);
		mpd->io_submit.io_end = NULL;

		if (ret == -ENOSPC && sbi->s_journal) {
			/*
			 * Commit the transaction which would
			 * free blocks released in the transaction
			 * and try again
			 */
			jbd2_journal_force_commit_nested(sbi->s_journal);
			ret = 0;
			continue;
		}
		/* Fatal error - ENOMEM, EIO... */
		if (ret)
			break;
	}
unplug:
	blk_finish_plug(&plug);
	if (!ret && !cycled && wbc->nr_to_write > 0) {
		cycled = 1;
		mpd->last_page = writeback_index - 1;
		mpd->first_page = 0;
		goto retry;
	}

	/* Update index */
	if (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0))
		/*
		 * Set the writeback_index so that range_cyclic
		 * mode will write it back later
		 */
		mapping->writeback_index = mpd->first_page;

out_writepages:
	trace_ext4_writepages_result(inode, wbc, ret,
				     nr_to_write - wbc->nr_to_write);
	return ret;
}

/*
 * mpage_map_and_submit_extent - map extent starting at mpd->lblk of length
 *				 mpd->len and submit pages underlying it for IO
 *
 * @handle - handle for journal operations
 * @mpd - extent to map
 * @give_up_on_write - we set this to true iff there is a fatal error and there
 *                     is no hope of writing the data. The caller should discard
 *                     dirty pages to avoid infinite loops.
 *
 * The function maps extent starting at mpd->lblk of length mpd->len. If it is
 * delayed, blocks are allocated, if it is unwritten, we may need to convert
 * them to initialized or split the described range from larger unwritten
 * extent. Note that we need not map all the described range since allocation
 * can return less blocks or the range is covered by more unwritten extents. We
 * cannot map more because we are limited by reserved transaction credits. On
 * the other hand we always make sure that the last touched page is fully
 * mapped so that it can be written out (and thus forward progress is
 * guaranteed). After mapping we submit all mapped pages for IO.
 */
static int mpage_map_and_submit_extent(handle_t *handle,
				       struct mpage_da_data *mpd,
				       bool *give_up_on_write)
{
	struct inode *inode = mpd->inode;
	struct ext4_map_blocks *map = &mpd->map;
	int err;
	loff_t disksize;
	int progress = 0;
	ext4_io_end_t *io_end = mpd->io_submit.io_end;
	struct ext4_io_end_vec *io_end_vec;

	io_end_vec = ext4_alloc_io_end_vec(io_end);
	if (IS_ERR(io_end_vec))
		return PTR_ERR(io_end_vec);
	io_end_vec->offset = ((loff_t)map->m_lblk) << inode->i_blkbits;
	do {
		err = mpage_map_one_extent(handle, mpd);
		if (err < 0) {
			struct super_block *sb = inode->i_sb;

			if (ext4_forced_shutdown(sb))
				goto invalidate_dirty_pages;
			/*
			 * Let the uper layers retry transient errors.
			 * In the case of ENOSPC, if ext4_count_free_blocks()
			 * is non-zero, a commit should free up blocks.
			 */
			if ((err == -ENOMEM) ||
			    (err == -ENOSPC && ext4_count_free_clusters(sb))) {
				if (progress)
					goto update_disksize;
				return err;
			}
			ext4_msg(sb, KERN_CRIT,
				 "Delayed block allocation failed for "
				 "inode %lu at logical offset %llu with"
				 " max blocks %u with error %d",
				 inode->i_ino,
				 (unsigned long long)map->m_lblk,
				 (unsigned)map->m_len, -err);
			ext4_msg(sb, KERN_CRIT,
				 "This should not happen!! Data will "
				 "be lost\n");
			if (err == -ENOSPC)
				ext4_print_free_blocks(inode);
		invalidate_dirty_pages:
			*give_up_on_write = true;
			return err;
		}
		progress = 1;
		/*
		 * Update buffer state, submit mapped pages, and get us new
		 * extent to map
		 */
		err = mpage_map_and_submit_buffers(mpd);
		if (err < 0)
			goto update_disksize;
	} while (map->m_len);

update_disksize:
	/*
	 * Update on-disk size after IO is submitted.  Races with
	 * truncate are avoided by checking i_size under i_data_sem.
	 */
	disksize = ((loff_t)mpd->first_page) << PAGE_SHIFT;
	if (disksize > READ_ONCE(EXT4_I(inode)->i_disksize)) {
		int err2;
		loff_t i_size;

		down_write(&EXT4_I(inode)->i_data_sem);
		i_size = i_size_read(inode);
		if (disksize > i_size)
			disksize = i_size;
		if (disksize > EXT4_I(inode)->i_disksize)
			EXT4_I(inode)->i_disksize = disksize;
		up_write(&EXT4_I(inode)->i_data_sem);
		err2 = ext4_mark_inode_dirty(handle, inode);
		if (err2) {
			ext4_error_err(inode->i_sb, -err2,
				       "Failed to mark inode %lu dirty",
				       inode->i_ino);
		}
		if (!err)
			err = err2;
	}
	return err;
}

void ext4_io_submit(struct ext4_io_submit *io)
{
	struct bio *bio = io->io_bio;

	if (bio) {
		if (io->io_wbc->sync_mode == WB_SYNC_ALL)
			io->io_bio->bi_opf |= REQ_SYNC;
		submit_bio(io->io_bio);
	}
	io->io_bio = NULL;
}

从 FileOutputStream.write(byte) 到磁盘扇区：一次 Java 写入操作的完整内核穿越之旅

引言

1. Java 层：FileOutputStream 的 write 方法

1.1 FileOutputStream 概览

1.2 JNI 初始化：缓存字段 ID

1.3 write 的 JNI 分发

1.3.1 getFD 的实现

1.3.2 IO_Write 宏

2. glibc 层：从 write 到 syscall

2.1 write 函数的弱别名

2.2 可取消点的处理：SYSCALL_CANCEL 宏

2.3 INLINE_SYSCALL_CALL 与可变参数宏

2.4 内联汇编触发 syscall 指令

3. Linux 内核：系统调用入口与 VFS 层

3.1 系统调用分发

3.2 ksys_write：获取文件对象

3.3 vfs_write：访问控制和写操作分派

4. ext4 文件系统：缓冲写入的具体实现

4.1 ext4_file_write_iter：选择写路径

4.2 ext4_buffered_write_iter：加锁与通用写

4.3 generic_perform_write：页缓存写入

4.4 延迟分配与回写触发

5. 页缓存回写：ext4_writepages 的旅程

5.1 准备阶段：ext4_do_writepages

5.2 块分配与 bio 提交

6. 锁机制深度解析：inode_lock 底层实现

6.1 inode_lock 展开

6.2 rwsem_write_trylock 与原子操作

6.3 慢速路径：rwsem_down_write_slowpath

7. 总结与思考

7.1 性能与一致性的权衡

7.2 锁竞争的影响

7.3 可观察性与调试

7.4 对 Java 开发者的启示

1. Java 层：`FileOutputStream` 的 `write` 方法

1.1 `FileOutputStream` 概览

1.3 `write` 的 JNI 分发

1.3.1 `getFD` 的实现

1.3.2 `IO_Write` 宏

2. glibc 层：从 `write` 到 `syscall`

2.1 `write` 函数的弱别名

2.2 可取消点的处理：`SYSCALL_CANCEL` 宏

2.3 `INLINE_SYSCALL_CALL` 与可变参数宏

2.4 内联汇编触发 `syscall` 指令

3.2 `ksys_write`：获取文件对象

3.3 `vfs_write`：访问控制和写操作分派

4.1 `ext4_file_write_iter`：选择写路径

4.2 `ext4_buffered_write_iter`：加锁与通用写

4.3 `generic_perform_write`：页缓存写入

5. 页缓存回写：`ext4_writepages` 的旅程

5.1 准备阶段：`ext4_do_writepages`

6. 锁机制深度解析：`inode_lock` 底层实现

6.1 `inode_lock` 展开

6.2 `rwsem_write_trylock` 与原子操作

6.3 慢速路径：`rwsem_down_write_slowpath`