引言
作为一名 Java 开发者,我们每天都在调用 FileOutputStream.write(b) 将数据写入文件。这一行简单的代码背后,究竟隐藏着怎样的系统深渊?从 JVM 的 JNI 调用,到 glibc 的系统调用封装,再到 Linux 内核的虚拟文件系统、ext4 文件系统、页缓存,直至最终触及物理磁盘的扇区------这是一条漫长而精妙的链路。
本文将根据 OpenJDK 17、glibc 2.35 以及 Linux Kernel 5.15 的源码,逐层解剖 FileOutputStream.write(int byte) 的执行路径。我们将会看到:
- Java 层如何通过
native方法进入 JNI 世界; - JNI 如何通过预缓存的字段 ID 快速获取
FileDescriptor; - 原生 C 代码如何调用
write系统调用; - glibc 如何使用内联汇编触发
syscall指令; - Linux 内核如何从
sys_write进入 VFS 层; - ext4 文件系统如何处理缓冲写、延迟分配和页缓存;
- 数据最终如何通过
submit_bio到达块设备层。
本文不追求高屋建瓴的宏观描述,而是紧扣源码,逐行解读,力求还原一次写入操作的真实旅程。全文约 8500 字,适合希望深入理解 I/O 栈的 Java 工程师、系统程序员以及内核爱好者。
1. Java 层:FileOutputStream 的 write 方法
1.1 FileOutputStream 概览
java.io.FileOutputStream 是 Java 标准库中用于向文件写入字节流的基础类。它的核心方法之一是 write(int b),该方法将指定的字节写入文件输出流(只写入低 8 位,高 24 位被丢弃)。其源码(OpenJDK 17)如下:
java
java
public void write(int b) throws IOException {
write(b, fd, fdAccess, append);
}
实际调用了一个私有 native 方法:
java
java
private native void write(int b, FileDescriptor fd, FileDescriptor fdAccess, boolean append)
throws IOException;
这个 native 方法的实现在 JDK 源码的 src/java.base/share/native/libjava/FileOutputStream.c 中。
1.2 JNI 初始化:缓存字段 ID
为了提高性能,JNI 代码会在类初始化时缓存 FileOutputStream 对象中与文件描述符相关的字段 ID。具体地,Java_java_io_FileOutputStream_initIDs 方法在类加载时被调用:
c
bash
JNIEXPORT void JNICALL
Java_java_io_FileOutputStream_initIDs(JNIEnv *env, jclass fdClass) {
fos_fd = (*env)->GetFieldID(env, fdClass, "fd", "Ljava/io/FileDescriptor;");
}
这里 fos_fd 是一个全局静态变量(类型为 jfieldID),保存了 java.io.FileOutputStream 类中名为 fd 的字段的 ID。该字段的类型是 java.io.FileDescriptor,它封装了操作系统的文件描述符整数值。通过缓存字段 ID,后续的 native 方法无需每次进行字符串查找,从而提升调用效率。
1.3 write 的 JNI 分发
真正执行写入的 native 方法是:
c
arduino
JNIEXPORT void JNICALL
Java_java_io_FileOutputStream_write(JNIEnv *env, jobject this, jint byte, jboolean append) {
writeSingle(env, this, byte, append, fos_fd);
}
writeSingle 是一个辅助函数,它接收缓存的字段 ID,并执行实际写入逻辑:
c
ini
void
writeSingle(JNIEnv *env, jobject this, jint byte, jboolean append, jfieldID fid) {
// 丢弃高 24 位,只保留低 8 位
char c = (char) byte;
jint n;
// 从 this 对象中获取 fd 字段对应的整型文件描述符
FD fd = getFD(env, this, fid);
if (fd == -1) {
JNU_ThrowIOException(env, "Stream Closed");
return;
}
if (append == JNI_TRUE) {
n = IO_Append(fd, &c, 1);
} else {
n = IO_Write(fd, &c, 1);
}
if (n == -1) {
JNU_ThrowIOExceptionWithLastError(env, "Write error");
}
}
1.3.1 getFD 的实现
getFD 函数位于 io_util.c 中,它通过 JNI 从 java.io.FileDescriptor 对象中获取 fd 整型字段:
c
scss
int
getFD(JNIEnv *env, jobject obj, jfieldID fid) {
jobject fdobj = (*env)->GetObjectField(env, obj, fid);
if (fdobj == NULL) return -1;
return (*env)->GetIntField(env, fdobj, IO_fd_fdID);
}
IO_fd_fdID 是另一个缓存的字段 ID,对应 FileDescriptor 类中的 fd 私有字段(int 类型)。
1.3.2 IO_Write 宏
IO_Write 是一个宏,定义在 io_util.h 中:
c
arduino
#define IO_Write handleWrite
而 handleWrite 是一个简单的封装,用于处理被信号中断的系统调用(RESTARTABLE 宏确保在 errno == EINTR 时重试):
c
arduino
ssize_t
handleWrite(FD fd, const void *buf, jint len)
{
ssize_t result;
RESTARTABLE(write(fd, buf, len), result);
return result;
}
RESTARTABLE 通常定义为:
c
ini
#define RESTARTABLE(_cmd, _result) do { \
_result = _cmd; \
if (_result == -1 && errno == EINTR) \
goto restart; \
} while(0)
至此,我们抵达了 C 标准库的 write 函数。但注意,这里直接调用了 write,它并不是 Linux 内核的"真正"系统调用,而是 glibc 提供的封装。接下来我们将深入 glibc 内部。
2. glibc 层:从 write 到 syscall
2.1 write 函数的弱别名
在 glibc 源码中,write 函数实际上是一个弱别名,指向 __libc_write:
c
scss
weak_alias (__libc_write, __write)
libc_hidden_weak (__write)
weak_alias (__libc_write, write)
libc_hidden_weak (write)
__libc_write 是真正的实现:
c
arduino
ssize_t
__libc_write (int fd, const void *buf, size_t nbytes)
{
return SYSCALL_CANCEL (write, fd, buf, nbytes);
}
2.2 可取消点的处理:SYSCALL_CANCEL 宏
Linux 中的 POSIX 线程支持"取消点"(cancellation points)。write 是一个典型的取消点,当线程被取消时,如果它正阻塞在 write 上,应该立即退出。glibc 通过 SYSCALL_CANCEL 宏来处理这一机制:
c
ini
#define SYSCALL_CANCEL(...) \
({ \
long int sc_ret; \
if (NO_SYSCALL_CANCEL_CHECKING) \
sc_ret = INLINE_SYSCALL_CALL (__VA_ARGS__); \
else \
{ \
int sc_cancel_oldtype = LIBC_CANCEL_ASYNC (); \
sc_ret = INLINE_SYSCALL_CALL (__VA_ARGS__); \
LIBC_CANCEL_RESET (sc_cancel_oldtype); \
} \
sc_ret; \
})
- 如果未启用取消检查(
NO_SYSCALL_CANCEL_CHECKING),直接执行系统调用。 - 否则,先调用
LIBC_CANCEL_ASYNC将线程设为异步取消模式(允许随时取消),然后执行系统调用,最后恢复原来的取消状态。
INLINE_SYSCALL_CALL 负责真正触发系统调用。
2.3 INLINE_SYSCALL_CALL 与可变参数宏
INLINE_SYSCALL_CALL 是一个可变参数宏,它根据参数的个数分派到不同后缀的宏:
c
scss
#define INLINE_SYSCALL_CALL(...) \
__INLINE_SYSCALL_DISP (__INLINE_SYSCALL, __VA_ARGS__)
#define __INLINE_SYSCALL_DISP(b,...) \
__SYSCALL_CONCAT (b,__INLINE_SYSCALL_NARGS(__VA_ARGS__))(__VA_ARGS__)
__INLINE_SYSCALL_NARGS 会计算出参数的数量(最多 7 个),然后拼接出类似 __INLINE_SYSCALL3 的宏名称。对于 write 这种有 3 个参数的系统调用,最终会调用 __INLINE_SYSCALL3:
c
scss
#define __INLINE_SYSCALL3(name, a1, a2, a3) \
INTERNAL_SYSCALL (name, 3, a1, a2, a3)
而 INTERNAL_SYSCALL 是架构相关的宏。以 x86-64 为例:
c
csharp
#undef INTERNAL_SYSCALL
#define INTERNAL_SYSCALL(name, nr, args...) \
internal_syscall##nr (SYS_ify (name), args)
其中 SYS_ify 将系统调用名称转换为系统调用号:
c
arduino
#undef SYS_ify
#define SYS_ify(syscall_name) __NR_##syscall_name
__NR_write 被定义为 1(在 <asm/unistd_64.h> 中)。
2.4 内联汇编触发 syscall 指令
对于 x86-64,internal_syscall3 使用了内联汇编来实现 syscall 指令:
c
scss
#undef internal_syscall3
#define internal_syscall3(number, arg1, arg2, arg3) \
({ \
unsigned long int resultvar; \
TYPEFY (arg3, __arg3) = ARGIFY (arg3); \
TYPEFY (arg2, __arg2) = ARGIFY (arg2); \
TYPEFY (arg1, __arg1) = ARGIFY (arg1); \
register TYPEFY (arg3, _a3) asm ("rdx") = __arg3; \
register TYPEFY (arg2, _a2) asm ("rsi") = __arg2; \
register TYPEFY (arg1, _a1) asm ("rdi") = __arg1; \
asm volatile ( \
"syscall\n\t" \
: "=a" (resultvar) \
: "0" (number), "r" (_a1), "r" (_a2), "r" (_a3) \
: "memory", REGISTERS_CLOBBERED_BY_SYSCALL); \
(long int) resultvar; \
})
这段汇编代码做了以下事情:
- 将系统调用号
number(即 1)放入rax寄存器("0" (number)约束表示输入输出共用%rax)。 - 将三个参数分别放入
rdi(fd)、rsi(buf)、rdx(count)寄存器。 - 执行
syscall指令,CPU 会从用户态陷入内核态,根据rax的值在系统调用表中查找处理函数。 - 系统调用返回后,返回值保存在
rax中,赋值给resultvar。 "memory"告诉编译器内存可能被修改(例如内核写入用户缓冲区),防止错误的优化。
至此,我们完成了用户态的最后一步。接下来,CPU 会切换到内核态,从 entry_SYSCALL_64 开始执行内核的系统调用入口。
3. Linux 内核:系统调用入口与 VFS 层
3.1 系统调用分发
当 syscall 指令执行后,CPU 保存上下文并跳转到内核的 entry_SYSCALL_64(x86-64 架构)。最终根据 rax=1 找到 sys_write 函数。在较新的内核中,系统调用通过 SYSCALL_DEFINE 宏定义:
c
arduino
SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf,
size_t, count)
{
return ksys_write(fd, buf, count);
}
SYSCALL_DEFINE3 展开后会生成一个名为 __x64_sys_write 的函数,它会调用 ksys_write。
3.2 ksys_write:获取文件对象
c
ini
ssize_t ksys_write(unsigned int fd, const char __user *buf, size_t count)
{
struct fd f = fdget_pos(fd);
ssize_t ret = -EBADF;
if (f.file) {
loff_t pos, *ppos = file_ppos(f.file);
if (ppos) {
pos = *ppos;
ppos = &pos;
}
ret = vfs_write(f.file, buf, count, ppos);
if (ret >= 0 && ppos)
f.file->f_pos = pos;
fdput_pos(f);
}
return ret;
}
fdget_pos(fd)根据文件描述符整数获取struct file指针,并增加引用计数,同时返回文件位置锁(如果文件支持)。file_ppos返回文件当前偏移量的指针(如果是可定位文件,如普通文件;如果是 socket 等则可能为 NULL)。vfs_write是 VFS 层的核心写入函数。
3.3 vfs_write:访问控制和写操作分派
vfs_write 执行基本的权限检查和统计,然后调用具体文件系统的写入方法:
c
ini
ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_t *pos)
{
ssize_t ret;
if (!(file->f_mode & FMODE_WRITE))
return -EBADF;
if (!(file->f_mode & FMODE_CAN_WRITE))
return -EINVAL;
if (unlikely(!access_ok(buf, count)))
return -EFAULT;
ret = rw_verify_area(WRITE, file, pos, count);
if (ret)
return ret;
if (count > MAX_RW_COUNT)
count = MAX_RW_COUNT;
file_start_write(file);
if (file->f_op->write)
ret = file->f_op->write(file, buf, count, pos);
else if (file->f_op->write_iter)
ret = new_sync_write(file, buf, count, pos);
else
ret = -EINVAL;
if (ret > 0) {
fsnotify_modify(file);
add_wchar(current, ret);
}
inc_syscw(current);
file_end_write(file);
return ret;
}
access_ok验证用户空间的缓冲区地址是否有效。rw_verify_area检查是否超出文件最大偏移等限制。MAX_RW_COUNT通常为INT_MAX & PAGE_MASK,防止一次读写过大。file_start_write和file_end_write用于文件系统 freeze 保护。- 对于 ext4,
file->f_op->write_iter被设置为ext4_file_write_iter(write回调为 NULL,因为 ext4 使用write_iter接口)。new_sync_write是一个包装函数,最终调用ext4_file_write_iter。
4. ext4 文件系统:缓冲写入的具体实现
4.1 ext4_file_write_iter:选择写路径
ext4 根据 inode 的属性和传递的标志,决定使用 DAX(直接访问)、直接 I/O(DIO)还是缓冲写:
c
csharp
static ssize_t
ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
{
struct inode *inode = file_inode(iocb->ki_filp);
if (unlikely(ext4_forced_shutdown(inode->i_sb)))
return -EIO;
#ifdef CONFIG_FS_DAX
if (IS_DAX(inode))
return ext4_dax_write_iter(iocb, from);
#endif
if (iocb->ki_flags & IOCB_DIRECT)
return ext4_dio_write_iter(iocb, from);
else
return ext4_buffered_write_iter(iocb, from);
}
对于普通的 FileOutputStream.write(未开启 O_DIRECT),将走 ext4_buffered_write_iter。
4.2 ext4_buffered_write_iter:加锁与通用写
c
scss
static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
struct iov_iter *from)
{
ssize_t ret;
struct inode *inode = file_inode(iocb->ki_filp);
if (iocb->ki_flags & IOCB_NOWAIT)
return -EOPNOTSUPP;
inode_lock(inode);
ret = ext4_write_checks(iocb, from);
if (ret <= 0)
goto out;
ret = generic_perform_write(iocb, from);
out:
inode_unlock(inode);
if (unlikely(ret <= 0))
return ret;
return generic_write_sync(iocb, ret);
}
inode_lock(inode)获取 inode 的读写信号量(i_rwsem)的写锁。注意这是文件级别的写锁,防止多个进程同时修改同一文件的元数据或数据。ext4_write_checks更新 iocb 中的文件位置,并处理 O_APPEND 等标志。generic_perform_write是内核提供的通用写函数,负责将用户数据复制到页缓存(page cache)中。- 写入完成后调用
generic_write_sync,如果文件以同步方式打开(O_SYNC)或使用了fdatasync,则会触发强制回写。
4.3 generic_perform_write:页缓存写入
generic_perform_write 是理解缓冲 I/O 的关键。它的核心逻辑:
- 对于要写入的每个页面(以 PAGE_SIZE 为单位),找到或创建一个页缓存页(
pagecache_get_page); - 将用户空间的数据复制到页缓存中(
copy_page_from_iter_atomic); - 标记页为脏(
set_page_dirty); - 更新 inode 的大小(如果写入位置超出了当前文件大小);
- 返回实际写入的字节数。
对于 ext4,在 generic_perform_write 调用前后,ext4 还通过 write_begin / write_end 回调进行特定处理。在 ext4_da_write_begin 和 ext4_da_write_end 中实现了延迟分配(delayed allocation) :当数据被写入页缓存时,并不立即分配磁盘块,而是仅仅标记为延迟分配。真正的块分配发生在稍后页缓存回写(writeback)时。
4.4 延迟分配与回写触发
延迟分配允许文件系统批量处理块分配,减少碎片,提高性能。但代价是增加了数据丢失的风险(如果断电,尚未分配块的数据会丢失)。数据最终写入磁盘的时机有:
- 页缓存压力过大,内核启动内存回收,强制回写脏页。
- 用户主动调用
fsync/fdatasync或close(某些情况)。 - 内核的 flusher 线程定期回写(
/proc/sys/vm/dirty_writeback_centisecs)。 - 同步写入标志(O_SYNC)会导致每次
write系统调用返回前就触发回写。
在 ext4 中,实际的回写操作由 ext4_writepages 函数负责。
5. 页缓存回写:ext4_writepages 的旅程
当脏页需要被刷新到磁盘时,内核会调用文件系统的 writepages 方法。对于 ext4,该方法为 ext4_writepages。
5.1 准备阶段:ext4_do_writepages
ext4_writepages 主要调用 ext4_do_writepages,这个函数处理了两种模式:
- 数据=journal模式:数据先记录到日志,再写入文件系统(较少用)。
- 延迟分配模式(默认) :真正的块分配和 bio 提交。
c
ini
static int ext4_do_writepages(struct mpage_da_data *mpd)
{
struct writeback_control *wbc = mpd->wbc;
struct inode *inode = mpd->inode;
// ...
while (!mpd->scanned_until_end && wbc->nr_to_write > 0) {
// ...
needed_blocks = ext4_da_writepages_trans_blocks(inode);
handle = ext4_journal_start_with_reserve(inode, ...);
ret = mpage_prepare_extent_to_map(mpd);
if (!ret && mpd->map.m_len)
ret = mpage_map_and_submit_extent(handle, mpd, &give_up_on_write);
// ...
ext4_journal_stop(handle);
}
// ...
}
mpage_prepare_extent_to_map遍历页缓存,收集连续脏页,存入mpd->map结构。mpage_map_and_submit_extent为这些脏页分配磁盘块,并生成 bio 提交给块层。
5.2 块分配与 bio 提交
mpage_map_and_submit_extent 的核心是循环调用 mpage_map_one_extent 来分配或映射一个连续的磁盘区域,然后调用 mpage_map_and_submit_buffers 将对应的页提交为 bio。
ext4_io_submit 最终将 bio 传递给通用块层:
c
ini
void ext4_io_submit(struct ext4_io_submit *io)
{
struct bio *bio = io->io_bio;
if (bio) {
if (io->io_wbc->sync_mode == WB_SYNC_ALL)
io->io_bio->bi_opf |= REQ_SYNC;
submit_bio(io->io_bio);
}
io->io_bio = NULL;
}
submit_bio 会将 bio 放入块设备的请求队列,然后由块设备驱动程序(如 SCSI/NVMe)执行实际的 DMA 传输,将数据写入磁盘介质。这一部分已经超出文件系统范畴,在此不再赘述。
6. 锁机制深度解析:inode_lock 底层实现
我们之前提到 inode_lock(inode) 获取 inode 的读写信号量写锁。这个锁实际上是 struct rw_semaphore 类型。下面我们探究其底层实现。
6.1 inode_lock 展开
c
scss
static inline void inode_lock(struct inode *inode)
{
down_write(&inode->i_rwsem);
}
down_write 是内核中获取读写信号量写锁的函数。其简化实现如下:
c
scss
static inline void __down_write(struct rw_semaphore *sem)
{
__down_write_common(sem, TASK_UNINTERRUPTIBLE);
}
__down_write_common 首先尝试使用 rwsem_write_trylock 进行快速获取,如果失败则进入慢速路径:
c
scss
static inline int __down_write_common(struct rw_semaphore *sem, int state)
{
int ret = 0;
preempt_disable();
if (unlikely(!rwsem_write_trylock(sem))) {
if (IS_ERR(rwsem_down_write_slowpath(sem, state)))
ret = -EINTR;
}
preempt_enable();
return ret;
}
6.2 rwsem_write_trylock 与原子操作
rwsem_write_trylock 使用原子 cmpxchg 尝试将 sem->count 从无锁值(RWSEM_UNLOCKED_VALUE)改为写锁定值(RWSEM_WRITER_LOCKED):
c
arduino
static inline bool rwsem_write_trylock(struct rw_semaphore *sem)
{
long tmp = RWSEM_UNLOCKED_VALUE;
if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp, RWSEM_WRITER_LOCKED)) {
rwsem_set_owner(sem);
return true;
}
return false;
}
atomic_long_try_cmpxchg_acquire 最终会调用架构相关的 cmpxchg 指令。在 x86-64 上,最终落到内联汇编:
c
ini
#define __raw_try_cmpxchg(_ptr, _pold, _new, size, lock) \
({ \
bool success; \
__typeof__(_ptr) _old = (__typeof__(_ptr))(_pold); \
__typeof__(*(_ptr)) __old = *_old; \
__typeof__(*(_ptr)) __new = (_new); \
switch (size) { \
case __X86_CASE_Q: \
{ \
volatile u64 *__ptr = (volatile u64 *)(_ptr); \
asm volatile(lock "cmpxchgq %[new], %[ptr]" \
CC_SET(z) \
: CC_OUT(z) (success), \
[ptr] "+m" (*__ptr), \
[old] "+a" (__old) \
: [new] "r" (__new) \
: "memory"); \
break; \
} \
// ...
} \
if (unlikely(!success)) \
*_old = __old; \
likely(success); \
})
这里使用了带有 lock 前缀的 cmpxchgq 指令,保证在多核 CPU 上的原子性。
6.3 慢速路径:rwsem_down_write_slowpath
如果快速路径失败(即信号量已被其他进程持有),内核会进入慢速路径。这里使用了乐观自旋(optimistic spinning) :如果当前锁的持有者正在另一个 CPU 上运行,那么等待者可能会短暂自旋,而不是立即睡眠,以减少上下文切换开销。
c
scss
static struct rw_semaphore __sched *
rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
{
// 乐观自旋尝试
if (rwsem_can_spin_on_owner(sem) && rwsem_optimistic_spin(sem)) {
return sem;
}
// 加入等待队列,然后睡眠
waiter.task = current;
waiter.type = RWSEM_WAITING_FOR_WRITE;
// ...
for (;;) {
if (rwsem_try_write_lock(sem, &waiter))
break;
schedule_preempt_disabled(); // 主动让出 CPU
}
// ...
}
这种实现体现了内核在性能和公平性之间的权衡。
7. 总结与思考
我们跟随 FileOutputStream.write(b) 的字节,从 Java 对象一直走到磁盘控制器的 DMA 操作。整条路径可以概括为:
- Java 应用 调用
write(int)方法。 - JVM 通过 JNI 调用预编译的 C 函数,获取文件描述符。
- glibc 封装系统调用,处理线程取消,通过内联汇编触发
syscall指令。 - 内核入口 根据系统调用号找到
sys_write,VFS 层验证权限并分派给 ext4。 - ext4 使用
ext4_buffered_write_iter将数据复制到页缓存,并标记脏页。 - 脏页在稍后(或立即,如果 O_SYNC)通过
ext4_writepages分配磁盘块,构建 bio 并提交给块层。 - 块层将 bio 转化为请求,由设备驱动程序执行实际传输。
7.1 性能与一致性的权衡
- 缓冲 I/O + 延迟分配:提供最高性能,但数据在断电时可能丢失(若未调用 fsync)。
- O_SYNC / O_DSYNC:每次 write 完成后会触发回写,保证数据持久性,但性能显著下降。
- 直接 I/O(DIO) :绕过页缓存,直接与块设备交互,适用于数据库等自管理缓存的应用。
7.2 锁竞争的影响
inode_lock 是文件级别的写锁,这意味着同时只能有一个线程对同一文件进行写入(或修改元数据)。对于高并发写入同一文件的场景(如日志),这会造成瓶颈。解决方案包括使用多个文件、pwrite 配合不同的偏移量(但仍受锁限制),或者使用无锁数据结构(如 io_uring 的某些模式)。
7.3 可观察性与调试
理解这一整条链路对性能分析和故障排查至关重要。例如:
-
当
write调用延迟很高时,可能是因为:- 页缓存回写压力大,
write被阻塞在等待内存; - 磁盘设备满载,
submit_bio后的请求在队列中排队; - 文件锁竞争严重。
- 页缓存回写压力大,
-
可以通过
perf、ftrace、blktrace等工具追踪从系统调用到磁盘的每一毫秒。
7.4 对 Java 开发者的启示
- 使用
FileChannel和ByteBuffer可以绕过部分 JNI 开销,但仍走相同的系统调用路径。 FileOutputStream.getFD().sync()对应fsync,会强制刷新页缓存到磁盘。- 如果需要极致的异步 I/O 性能,可以考虑
java.nio.channels.AsynchronousFileChannel(基于内核 AIO 或 io_uring,但具体实现取决于 JDK 版本和操作系统)。
#源码
scss
jfieldID fos_fd; /* id for jobject 'fd' in java.io.FileOutputStream */
/**************************************************************
* static methods to store field ID's in initializers
*/
JNIEXPORT void JNICALL
Java_java_io_FileOutputStream_initIDs(JNIEnv *env, jclass fdClass) {
fos_fd = (*env)->GetFieldID(env, fdClass, "fd", "Ljava/io/FileDescriptor;");
}
JNIEXPORT void JNICALL
Java_java_io_FileOutputStream_write(JNIEnv *env, jobject this, jint byte, jboolean append) {
writeSingle(env, this, byte, append, fos_fd);
}
void
writeSingle(JNIEnv *env, jobject this, jint byte, jboolean append, jfieldID fid) {
// Discard the 24 high-order bits of byte. See OutputStream#write(int)
char c = (char) byte;
jint n;
FD fd = getFD(env, this, fid);
if (fd == -1) {
JNU_ThrowIOException(env, "Stream Closed");
return;
}
if (append == JNI_TRUE) {
n = IO_Append(fd, &c, 1);
} else {
n = IO_Write(fd, &c, 1);
}
if (n == -1) {
JNU_ThrowIOExceptionWithLastError(env, "Write error");
}
}
#define IO_Write handleWrite
ssize_t
handleWrite(FD fd, const void *buf, jint len)
{
ssize_t result;
RESTARTABLE(write(fd, buf, len), result);
return result;
}
/* Write NBYTES of BUF to FD. Return the number written, or -1. */
ssize_t
__libc_write (int fd, const void *buf, size_t nbytes)
{
return SYSCALL_CANCEL (write, fd, buf, nbytes);
}
libc_hidden_def (__libc_write)
weak_alias (__libc_write, __write)
libc_hidden_weak (__write)
weak_alias (__libc_write, write)
libc_hidden_weak (write)
#define SYSCALL_CANCEL(...) \
({ \
long int sc_ret; \
if (NO_SYSCALL_CANCEL_CHECKING) \
sc_ret = INLINE_SYSCALL_CALL (__VA_ARGS__); \
else \
{ \
int sc_cancel_oldtype = LIBC_CANCEL_ASYNC (); \
sc_ret = INLINE_SYSCALL_CALL (__VA_ARGS__); \
LIBC_CANCEL_RESET (sc_cancel_oldtype); \
} \
sc_ret; \
})
/* Issue a syscall defined by syscall number plus any other argument
required. Any error will be handled using arch defined macros and errno
will be set accordingly.
It is similar to INLINE_SYSCALL macro, but without the need to pass the
expected argument number as second parameter. */
#define INLINE_SYSCALL_CALL(...) \
__INLINE_SYSCALL_DISP (__INLINE_SYSCALL, __VA_ARGS__)
#define __INLINE_SYSCALL_DISP(b,...) \
__SYSCALL_CONCAT (b,__INLINE_SYSCALL_NARGS(__VA_ARGS__))(__VA_ARGS__)
#define __INLINE_SYSCALL_NARGS(...) \
__INLINE_SYSCALL_NARGS_X (__VA_ARGS__,7,6,5,4,3,2,1,0,)
#define __INTERNAL_SYSCALL3(name, a1, a2, a3) \
INTERNAL_SYSCALL (name, 3, a1, a2, a3)
#undef INTERNAL_SYSCALL
#define INTERNAL_SYSCALL(name, nr, args...) \
internal_syscall##nr (SYS_ify (name), args)
#undef SYS_ify
#define SYS_ify(syscall_name) __NR_##syscall_name
#define __NR_write 1
#undef internal_syscall3
#define internal_syscall3(number, arg1, arg2, arg3) \
({ \
unsigned long int resultvar; \
TYPEFY (arg3, __arg3) = ARGIFY (arg3); \
TYPEFY (arg2, __arg2) = ARGIFY (arg2); \
TYPEFY (arg1, __arg1) = ARGIFY (arg1); \
register TYPEFY (arg3, _a3) asm ("rdx") = __arg3; \
register TYPEFY (arg2, _a2) asm ("rsi") = __arg2; \
register TYPEFY (arg1, _a1) asm ("rdi") = __arg1; \
asm volatile ( \
"syscall\n\t" \
: "=a" (resultvar) \
: "0" (number), "r" (_a1), "r" (_a2), "r" (_a3) \
: "memory", REGISTERS_CLOBBERED_BY_SYSCALL); \
(long int) resultvar; \
})
##系统调用
1 common write sys_write
SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf,
size_t, count)
{
return ksys_write(fd, buf, count);
}
ssize_t ksys_write(unsigned int fd, const char __user *buf, size_t count)
{
struct fd f = fdget_pos(fd);
ssize_t ret = -EBADF;
if (f.file) {
loff_t pos, *ppos = file_ppos(f.file);
if (ppos) {
pos = *ppos;
ppos = &pos;
}
ret = vfs_write(f.file, buf, count, ppos);
if (ret >= 0 && ppos)
f.file->f_pos = pos;
fdput_pos(f);
}
return ret;
}
ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_t *pos)
{
ssize_t ret;
if (!(file->f_mode & FMODE_WRITE))
return -EBADF;
if (!(file->f_mode & FMODE_CAN_WRITE))
return -EINVAL;
if (unlikely(!access_ok(buf, count)))
return -EFAULT;
ret = rw_verify_area(WRITE, file, pos, count);
if (ret)
return ret;
if (count > MAX_RW_COUNT)
count = MAX_RW_COUNT;
file_start_write(file);
if (file->f_op->write)
ret = file->f_op->write(file, buf, count, pos);
else if (file->f_op->write_iter)
ret = new_sync_write(file, buf, count, pos);
else
ret = -EINVAL;
if (ret > 0) {
fsnotify_modify(file);
add_wchar(current, ret);
}
inc_syscw(current);
file_end_write(file);
return ret;
}
static ssize_t
ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
{
struct inode *inode = file_inode(iocb->ki_filp);
if (unlikely(ext4_forced_shutdown(inode->i_sb)))
return -EIO;
#ifdef CONFIG_FS_DAX
if (IS_DAX(inode))
return ext4_dax_write_iter(iocb, from);
#endif
if (iocb->ki_flags & IOCB_DIRECT)
return ext4_dio_write_iter(iocb, from);
else
return ext4_buffered_write_iter(iocb, from);
}
static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
struct iov_iter *from)
{
ssize_t ret;
struct inode *inode = file_inode(iocb->ki_filp);
if (iocb->ki_flags & IOCB_NOWAIT)
return -EOPNOTSUPP;
inode_lock(inode);
ret = ext4_write_checks(iocb, from);
if (ret <= 0)
goto out;
ret = generic_perform_write(iocb, from);
out:
inode_unlock(inode);
if (unlikely(ret <= 0))
return ret;
return generic_write_sync(iocb, ret);
}
static inline void inode_lock(struct inode *inode)
{
down_write(&inode->i_rwsem);
}
/*
* lock for writing
*/
static inline int __down_write_common(struct rw_semaphore *sem, int state)
{
int ret = 0;
preempt_disable();
if (unlikely(!rwsem_write_trylock(sem))) {
if (IS_ERR(rwsem_down_write_slowpath(sem, state)))
ret = -EINTR;
}
preempt_enable();
return ret;
}
static inline void __down_write(struct rw_semaphore *sem)
{
__down_write_common(sem, TASK_UNINTERRUPTIBLE);
}
static inline bool rwsem_write_trylock(struct rw_semaphore *sem)
{
long tmp = RWSEM_UNLOCKED_VALUE;
if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp, RWSEM_WRITER_LOCKED)) {
rwsem_set_owner(sem);
return true;
}
return false;
}
/**
* atomic_long_try_cmpxchg_acquire() - atomic compare and exchange with acquire ordering
* @v: pointer to atomic_long_t
* @old: pointer to long value to compare with
* @new: long value to assign
*
* If (@v == @old), atomically updates @v to @new with acquire ordering.
* Otherwise, updates @old to the current value of @v.
*
* Unsafe to use in noinstr code; use raw_atomic_long_try_cmpxchg_acquire() there.
*
* Return: @true if the exchange occured, @false otherwise.
*/
static __always_inline bool
atomic_long_try_cmpxchg_acquire(atomic_long_t *v, long *old, long new)
{
instrument_atomic_read_write(v, sizeof(*v));
instrument_atomic_read_write(old, sizeof(*old));
return raw_atomic_long_try_cmpxchg_acquire(v, old, new);
}
/**
* raw_atomic_long_try_cmpxchg_acquire() - atomic compare and exchange with acquire ordering
* @v: pointer to atomic_long_t
* @old: pointer to long value to compare with
* @new: long value to assign
*
* If (@v == @old), atomically updates @v to @new with acquire ordering.
* Otherwise, updates @old to the current value of @v.
*
* Safe to use in noinstr code; prefer atomic_long_try_cmpxchg_acquire() elsewhere.
*
* Return: @true if the exchange occured, @false otherwise.
*/
static __always_inline bool
raw_atomic_long_try_cmpxchg_acquire(atomic_long_t *v, long *old, long new)
{
#ifdef CONFIG_64BIT
return raw_atomic64_try_cmpxchg_acquire(v, (s64 *)old, new);
#else
return raw_atomic_try_cmpxchg_acquire(v, (int *)old, new);
#endif
}
/**
* raw_atomic64_try_cmpxchg_acquire() - atomic compare and exchange with acquire ordering
* @v: pointer to atomic64_t
* @old: pointer to s64 value to compare with
* @new: s64 value to assign
*
* If (@v == @old), atomically updates @v to @new with acquire ordering.
* Otherwise, updates @old to the current value of @v.
*
* Safe to use in noinstr code; prefer atomic64_try_cmpxchg_acquire() elsewhere.
*
* Return: @true if the exchange occured, @false otherwise.
*/
static __always_inline bool
raw_atomic64_try_cmpxchg_acquire(atomic64_t *v, s64 *old, s64 new)
{
#if defined(arch_atomic64_try_cmpxchg_acquire)
return arch_atomic64_try_cmpxchg_acquire(v, old, new);
#elif defined(arch_atomic64_try_cmpxchg_relaxed)
bool ret = arch_atomic64_try_cmpxchg_relaxed(v, old, new);
__atomic_acquire_fence();
return ret;
#elif defined(arch_atomic64_try_cmpxchg)
return arch_atomic64_try_cmpxchg(v, old, new);
#else
s64 r, o = *old;
r = raw_atomic64_cmpxchg_acquire(v, o, new);
if (unlikely(r != o))
*old = r;
return likely(r == o);
#endif
}
static __always_inline bool arch_atomic64_try_cmpxchg(atomic64_t *v, s64 *old, s64 new)
{
return arch_try_cmpxchg(&v->counter, old, new);
}
#define arch_atomic64_try_cmpxchg arch_atomic64_try_cmpxchg
#define arch_try_cmpxchg(ptr, pold, new) \
__try_cmpxchg((ptr), (pold), (new), sizeof(*(ptr)))
#define __try_cmpxchg(ptr, pold, new, size) \
__raw_try_cmpxchg((ptr), (pold), (new), (size), LOCK_PREFIX)
#define __raw_try_cmpxchg(_ptr, _pold, _new, size, lock) \
({ \
bool success; \
__typeof__(_ptr) _old = (__typeof__(_ptr))(_pold); \
__typeof__(*(_ptr)) __old = *_old; \
__typeof__(*(_ptr)) __new = (_new); \
switch (size) { \
case __X86_CASE_B: \
{ \
volatile u8 *__ptr = (volatile u8 *)(_ptr); \
asm volatile(lock "cmpxchgb %[new], %[ptr]" \
CC_SET(z) \
: CC_OUT(z) (success), \
[ptr] "+m" (*__ptr), \
[old] "+a" (__old) \
: [new] "q" (__new) \
: "memory"); \
break; \
} \
case __X86_CASE_W: \
{ \
volatile u16 *__ptr = (volatile u16 *)(_ptr); \
asm volatile(lock "cmpxchgw %[new], %[ptr]" \
CC_SET(z) \
: CC_OUT(z) (success), \
[ptr] "+m" (*__ptr), \
[old] "+a" (__old) \
: [new] "r" (__new) \
: "memory"); \
break; \
} \
case __X86_CASE_L: \
{ \
volatile u32 *__ptr = (volatile u32 *)(_ptr); \
asm volatile(lock "cmpxchgl %[new], %[ptr]" \
CC_SET(z) \
: CC_OUT(z) (success), \
[ptr] "+m" (*__ptr), \
[old] "+a" (__old) \
: [new] "r" (__new) \
: "memory"); \
break; \
} \
case __X86_CASE_Q: \
{ \
volatile u64 *__ptr = (volatile u64 *)(_ptr); \
asm volatile(lock "cmpxchgq %[new], %[ptr]" \
CC_SET(z) \
: CC_OUT(z) (success), \
[ptr] "+m" (*__ptr), \
[old] "+a" (__old) \
: [new] "r" (__new) \
: "memory"); \
break; \
} \
default: \
__cmpxchg_wrong_size(); \
} \
if (unlikely(!success)) \
*_old = __old; \
likely(success); \
})
/*
* Wait until we successfully acquire the write lock
*/
static struct rw_semaphore __sched *
rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
{
struct rwsem_waiter waiter;
DEFINE_WAKE_Q(wake_q);
/* do optimistic spinning and steal lock if possible */
if (rwsem_can_spin_on_owner(sem) && rwsem_optimistic_spin(sem)) {
/* rwsem_optimistic_spin() implies ACQUIRE on success */
return sem;
}
/*
* Optimistic spinning failed, proceed to the slowpath
* and block until we can acquire the sem.
*/
waiter.task = current;
waiter.type = RWSEM_WAITING_FOR_WRITE;
waiter.timeout = jiffies + RWSEM_WAIT_TIMEOUT;
waiter.handoff_set = false;
raw_spin_lock_irq(&sem->wait_lock);
rwsem_add_waiter(sem, &waiter);
/* we're now waiting on the lock */
if (rwsem_first_waiter(sem) != &waiter) {
rwsem_cond_wake_waiter(sem, atomic_long_read(&sem->count),
&wake_q);
if (!wake_q_empty(&wake_q)) {
/*
* We want to minimize wait_lock hold time especially
* when a large number of readers are to be woken up.
*/
raw_spin_unlock_irq(&sem->wait_lock);
wake_up_q(&wake_q);
raw_spin_lock_irq(&sem->wait_lock);
}
} else {
atomic_long_or(RWSEM_FLAG_WAITERS, &sem->count);
}
/* wait until we successfully acquire the lock */
set_current_state(state);
trace_contention_begin(sem, LCB_F_WRITE);
for (;;) {
if (rwsem_try_write_lock(sem, &waiter)) {
/* rwsem_try_write_lock() implies ACQUIRE on success */
break;
}
raw_spin_unlock_irq(&sem->wait_lock);
if (signal_pending_state(state, current))
goto out_nolock;
/*
* After setting the handoff bit and failing to acquire
* the lock, attempt to spin on owner to accelerate lock
* transfer. If the previous owner is a on-cpu writer and it
* has just released the lock, OWNER_NULL will be returned.
* In this case, we attempt to acquire the lock again
* without sleeping.
*/
if (waiter.handoff_set) {
enum owner_state owner_state;
owner_state = rwsem_spin_on_owner(sem);
if (owner_state == OWNER_NULL)
goto trylock_again;
}
schedule_preempt_disabled();
lockevent_inc(rwsem_sleep_writer);
set_current_state(state);
trylock_again:
raw_spin_lock_irq(&sem->wait_lock);
}
__set_current_state(TASK_RUNNING);
raw_spin_unlock_irq(&sem->wait_lock);
lockevent_inc(rwsem_wlock);
trace_contention_end(sem, 0);
return sem;
out_nolock:
__set_current_state(TASK_RUNNING);
raw_spin_lock_irq(&sem->wait_lock);
rwsem_del_wake_waiter(sem, &waiter, &wake_q);
lockevent_inc(rwsem_wlock_fail);
trace_contention_end(sem, -EINTR);
return ERR_PTR(-EINTR);
}
static const struct address_space_operations ext4_da_aops = {
.read_folio = ext4_read_folio,
.readahead = ext4_readahead,
.writepages = ext4_writepages,
.write_begin = ext4_da_write_begin,
.write_end = ext4_da_write_end,
.dirty_folio = ext4_dirty_folio,
.bmap = ext4_bmap,
.invalidate_folio = ext4_invalidate_folio,
.release_folio = ext4_release_folio,
.direct_IO = noop_direct_IO,
.migrate_folio = buffer_migrate_folio,
.is_partially_uptodate = block_is_partially_uptodate,
.error_remove_folio = generic_error_remove_folio,
.swap_activate = ext4_iomap_swap_activate,
};
static int ext4_writepages(struct address_space *mapping,
struct writeback_control *wbc)
{
struct super_block *sb = mapping->host->i_sb;
struct mpage_da_data mpd = {
.inode = mapping->host,
.wbc = wbc,
.can_map = 1,
};
int ret;
int alloc_ctx;
if (unlikely(ext4_forced_shutdown(sb)))
return -EIO;
alloc_ctx = ext4_writepages_down_read(sb);
ret = ext4_do_writepages(&mpd);
/*
* For data=journal writeback we could have come across pages marked
* for delayed dirtying (PageChecked) which were just added to the
* running transaction. Try once more to get them to stable storage.
*/
if (!ret && mpd.journalled_more_data)
ret = ext4_do_writepages(&mpd);
ext4_writepages_up_read(sb, alloc_ctx);
return ret;
}
static int ext4_do_writepages(struct mpage_da_data *mpd)
{
struct writeback_control *wbc = mpd->wbc;
pgoff_t writeback_index = 0;
long nr_to_write = wbc->nr_to_write;
int range_whole = 0;
int cycled = 1;
handle_t *handle = NULL;
struct inode *inode = mpd->inode;
struct address_space *mapping = inode->i_mapping;
int needed_blocks, rsv_blocks = 0, ret = 0;
struct ext4_sb_info *sbi = EXT4_SB(mapping->host->i_sb);
struct blk_plug plug;
bool give_up_on_write = false;
trace_ext4_writepages(inode, wbc);
/*
* No pages to write? This is mainly a kludge to avoid starting
* a transaction for special inodes like journal inode on last iput()
* because that could violate lock ordering on umount
*/
if (!mapping->nrpages || !mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
goto out_writepages;
/*
* If the filesystem has aborted, it is read-only, so return
* right away instead of dumping stack traces later on that
* will obscure the real source of the problem. We test
* fs shutdown state instead of sb->s_flag's SB_RDONLY because
* the latter could be true if the filesystem is mounted
* read-only, and in that case, ext4_writepages should
* *never* be called, so if that ever happens, we would want
* the stack trace.
*/
if (unlikely(ext4_forced_shutdown(mapping->host->i_sb))) {
ret = -EROFS;
goto out_writepages;
}
/*
* If we have inline data and arrive here, it means that
* we will soon create the block for the 1st page, so
* we'd better clear the inline data here.
*/
if (ext4_has_inline_data(inode)) {
/* Just inode will be modified... */
handle = ext4_journal_start(inode, EXT4_HT_INODE, 1);
if (IS_ERR(handle)) {
ret = PTR_ERR(handle);
goto out_writepages;
}
BUG_ON(ext4_test_inode_state(inode,
EXT4_STATE_MAY_INLINE_DATA));
ext4_destroy_inline_data(handle, inode);
ext4_journal_stop(handle);
}
/*
* data=journal mode does not do delalloc so we just need to writeout /
* journal already mapped buffers. On the other hand we need to commit
* transaction to make data stable. We expect all the data to be
* already in the journal (the only exception are DMA pinned pages
* dirtied behind our back) so we commit transaction here and run the
* writeback loop to checkpoint them. The checkpointing is not actually
* necessary to make data persistent *but* quite a few places (extent
* shifting operations, fsverity, ...) depend on being able to drop
* pagecache pages after calling filemap_write_and_wait() and for that
* checkpointing needs to happen.
*/
if (ext4_should_journal_data(inode)) {
mpd->can_map = 0;
if (wbc->sync_mode == WB_SYNC_ALL)
ext4_fc_commit(sbi->s_journal,
EXT4_I(inode)->i_datasync_tid);
}
mpd->journalled_more_data = 0;
if (ext4_should_dioread_nolock(inode)) {
/*
* We may need to convert up to one extent per block in
* the page and we may dirty the inode.
*/
rsv_blocks = 1 + ext4_chunk_trans_blocks(inode,
PAGE_SIZE >> inode->i_blkbits);
}
if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
range_whole = 1;
if (wbc->range_cyclic) {
writeback_index = mapping->writeback_index;
if (writeback_index)
cycled = 0;
mpd->first_page = writeback_index;
mpd->last_page = -1;
} else {
mpd->first_page = wbc->range_start >> PAGE_SHIFT;
mpd->last_page = wbc->range_end >> PAGE_SHIFT;
}
ext4_io_submit_init(&mpd->io_submit, wbc);
retry:
if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
tag_pages_for_writeback(mapping, mpd->first_page,
mpd->last_page);
blk_start_plug(&plug);
/*
* First writeback pages that don't need mapping - we can avoid
* starting a transaction unnecessarily and also avoid being blocked
* in the block layer on device congestion while having transaction
* started.
*/
mpd->do_map = 0;
mpd->scanned_until_end = 0;
mpd->io_submit.io_end = ext4_init_io_end(inode, GFP_KERNEL);
if (!mpd->io_submit.io_end) {
ret = -ENOMEM;
goto unplug;
}
ret = mpage_prepare_extent_to_map(mpd);
/* Unlock pages we didn't use */
mpage_release_unused_pages(mpd, false);
/* Submit prepared bio */
ext4_io_submit(&mpd->io_submit);
ext4_put_io_end_defer(mpd->io_submit.io_end);
mpd->io_submit.io_end = NULL;
if (ret < 0)
goto unplug;
while (!mpd->scanned_until_end && wbc->nr_to_write > 0) {
/* For each extent of pages we use new io_end */
mpd->io_submit.io_end = ext4_init_io_end(inode, GFP_KERNEL);
if (!mpd->io_submit.io_end) {
ret = -ENOMEM;
break;
}
WARN_ON_ONCE(!mpd->can_map);
/*
* We have two constraints: We find one extent to map and we
* must always write out whole page (makes a difference when
* blocksize < pagesize) so that we don't block on IO when we
* try to write out the rest of the page. Journalled mode is
* not supported by delalloc.
*/
BUG_ON(ext4_should_journal_data(inode));
needed_blocks = ext4_da_writepages_trans_blocks(inode);
/* start a new transaction */
handle = ext4_journal_start_with_reserve(inode,
EXT4_HT_WRITE_PAGE, needed_blocks, rsv_blocks);
if (IS_ERR(handle)) {
ret = PTR_ERR(handle);
ext4_msg(inode->i_sb, KERN_CRIT, "%s: jbd2_start: "
"%ld pages, ino %lu; err %d", __func__,
wbc->nr_to_write, inode->i_ino, ret);
/* Release allocated io_end */
ext4_put_io_end(mpd->io_submit.io_end);
mpd->io_submit.io_end = NULL;
break;
}
mpd->do_map = 1;
trace_ext4_da_write_pages(inode, mpd->first_page, wbc);
ret = mpage_prepare_extent_to_map(mpd);
if (!ret && mpd->map.m_len)
ret = mpage_map_and_submit_extent(handle, mpd,
&give_up_on_write);
/*
* Caution: If the handle is synchronous,
* ext4_journal_stop() can wait for transaction commit
* to finish which may depend on writeback of pages to
* complete or on page lock to be released. In that
* case, we have to wait until after we have
* submitted all the IO, released page locks we hold,
* and dropped io_end reference (for extent conversion
* to be able to complete) before stopping the handle.
*/
if (!ext4_handle_valid(handle) || handle->h_sync == 0) {
ext4_journal_stop(handle);
handle = NULL;
mpd->do_map = 0;
}
/* Unlock pages we didn't use */
mpage_release_unused_pages(mpd, give_up_on_write);
/* Submit prepared bio */
ext4_io_submit(&mpd->io_submit);
/*
* Drop our io_end reference we got from init. We have
* to be careful and use deferred io_end finishing if
* we are still holding the transaction as we can
* release the last reference to io_end which may end
* up doing unwritten extent conversion.
*/
if (handle) {
ext4_put_io_end_defer(mpd->io_submit.io_end);
ext4_journal_stop(handle);
} else
ext4_put_io_end(mpd->io_submit.io_end);
mpd->io_submit.io_end = NULL;
if (ret == -ENOSPC && sbi->s_journal) {
/*
* Commit the transaction which would
* free blocks released in the transaction
* and try again
*/
jbd2_journal_force_commit_nested(sbi->s_journal);
ret = 0;
continue;
}
/* Fatal error - ENOMEM, EIO... */
if (ret)
break;
}
unplug:
blk_finish_plug(&plug);
if (!ret && !cycled && wbc->nr_to_write > 0) {
cycled = 1;
mpd->last_page = writeback_index - 1;
mpd->first_page = 0;
goto retry;
}
/* Update index */
if (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0))
/*
* Set the writeback_index so that range_cyclic
* mode will write it back later
*/
mapping->writeback_index = mpd->first_page;
out_writepages:
trace_ext4_writepages_result(inode, wbc, ret,
nr_to_write - wbc->nr_to_write);
return ret;
}
/*
* mpage_map_and_submit_extent - map extent starting at mpd->lblk of length
* mpd->len and submit pages underlying it for IO
*
* @handle - handle for journal operations
* @mpd - extent to map
* @give_up_on_write - we set this to true iff there is a fatal error and there
* is no hope of writing the data. The caller should discard
* dirty pages to avoid infinite loops.
*
* The function maps extent starting at mpd->lblk of length mpd->len. If it is
* delayed, blocks are allocated, if it is unwritten, we may need to convert
* them to initialized or split the described range from larger unwritten
* extent. Note that we need not map all the described range since allocation
* can return less blocks or the range is covered by more unwritten extents. We
* cannot map more because we are limited by reserved transaction credits. On
* the other hand we always make sure that the last touched page is fully
* mapped so that it can be written out (and thus forward progress is
* guaranteed). After mapping we submit all mapped pages for IO.
*/
static int mpage_map_and_submit_extent(handle_t *handle,
struct mpage_da_data *mpd,
bool *give_up_on_write)
{
struct inode *inode = mpd->inode;
struct ext4_map_blocks *map = &mpd->map;
int err;
loff_t disksize;
int progress = 0;
ext4_io_end_t *io_end = mpd->io_submit.io_end;
struct ext4_io_end_vec *io_end_vec;
io_end_vec = ext4_alloc_io_end_vec(io_end);
if (IS_ERR(io_end_vec))
return PTR_ERR(io_end_vec);
io_end_vec->offset = ((loff_t)map->m_lblk) << inode->i_blkbits;
do {
err = mpage_map_one_extent(handle, mpd);
if (err < 0) {
struct super_block *sb = inode->i_sb;
if (ext4_forced_shutdown(sb))
goto invalidate_dirty_pages;
/*
* Let the uper layers retry transient errors.
* In the case of ENOSPC, if ext4_count_free_blocks()
* is non-zero, a commit should free up blocks.
*/
if ((err == -ENOMEM) ||
(err == -ENOSPC && ext4_count_free_clusters(sb))) {
if (progress)
goto update_disksize;
return err;
}
ext4_msg(sb, KERN_CRIT,
"Delayed block allocation failed for "
"inode %lu at logical offset %llu with"
" max blocks %u with error %d",
inode->i_ino,
(unsigned long long)map->m_lblk,
(unsigned)map->m_len, -err);
ext4_msg(sb, KERN_CRIT,
"This should not happen!! Data will "
"be lost\n");
if (err == -ENOSPC)
ext4_print_free_blocks(inode);
invalidate_dirty_pages:
*give_up_on_write = true;
return err;
}
progress = 1;
/*
* Update buffer state, submit mapped pages, and get us new
* extent to map
*/
err = mpage_map_and_submit_buffers(mpd);
if (err < 0)
goto update_disksize;
} while (map->m_len);
update_disksize:
/*
* Update on-disk size after IO is submitted. Races with
* truncate are avoided by checking i_size under i_data_sem.
*/
disksize = ((loff_t)mpd->first_page) << PAGE_SHIFT;
if (disksize > READ_ONCE(EXT4_I(inode)->i_disksize)) {
int err2;
loff_t i_size;
down_write(&EXT4_I(inode)->i_data_sem);
i_size = i_size_read(inode);
if (disksize > i_size)
disksize = i_size;
if (disksize > EXT4_I(inode)->i_disksize)
EXT4_I(inode)->i_disksize = disksize;
up_write(&EXT4_I(inode)->i_data_sem);
err2 = ext4_mark_inode_dirty(handle, inode);
if (err2) {
ext4_error_err(inode->i_sb, -err2,
"Failed to mark inode %lu dirty",
inode->i_ino);
}
if (!err)
err = err2;
}
return err;
}
void ext4_io_submit(struct ext4_io_submit *io)
{
struct bio *bio = io->io_bio;
if (bio) {
if (io->io_wbc->sync_mode == WB_SYNC_ALL)
io->io_bio->bi_opf |= REQ_SYNC;
submit_bio(io->io_bio);
}
io->io_bio = NULL;
}