postgresql源码学习（60）—— VFD的作用及机制

首先VFD是Virtual File Descriptor，即虚拟文件描述符，既然是虚拟的，一定先有物理的。

一、物理文件描述符（File Descriptor, FD）

1. 什么是 FD

它是操作系统提供给用户程序访问和操作文件或其他 I/O 资源的抽象接口，使用户无需关心底层的具体实现。

在 Linux 中，文件描述符是一个非负整数（通常从 0 开始）。每个进程都有一个独立的文件描述符表，表中每一项指向一个打开的文件或其他 I/O 资源（如管道、套接字等）。

2. 为什么需要设置FD上限

① 防止无限制打开FD

防止恶意程序：如果没有文件描述符的上限限制，恶意程序可能会通过打开大量文件描述符来耗尽系统资源，导致系统无法正常运行。
防止程序错误：程序中的错误（如未关闭文件描述符）可能会导致文件描述符泄漏，最终耗尽系统资源。通过设置上限，可以尽早发现并处理这些问题。

② 资源限制

内存资源：每个文件描述符都需要占用一定的内核内存资源（如文件描述符表、文件表、inode 表等）。如果不加限制，可能会导致系统内存耗尽，影响系统的稳定性。
内核性能：操作系统内核需要为每个文件描述符维护一些数据结构。如果文件描述符数量过多，可能会导致内核数据结构过大，影响系统性能。
进程性能：每个进程的文件描述符表需要在内核中维护，如果文件描述符数量过多，可能会导致进程切换时的性能下降。
防止系统崩溃：如果文件描述符数量过多，可能会导致系统资源耗尽，进而导致系统崩溃或不可用。通过设置上限，可以确保系统的稳定性。

3. FD的上限设置

在 Linux 中，文件描述符设置分为三层：

系统级限制：操作系统内核通常会设置一个全局的文件描述符上限（如 /proc/sys/fs/file-max），限制整个系统能够打开的文件描述符数量。
用户级限制 ：每个用户可以打开的文件描述符数量也受到限制（如 ulimit -n）。
进程级限制：每个进程能够打开的文件描述符数量也受到限制（通常为 1024 或更高，/proc/sys/fs/file-max）。

二、为什么需要VFD

在看到FD设置上限原因的时候，我觉得其实非常像数据库中的连接数。对于应用而言，业务连接通常远超DB上限，希望DB连接数可以近似无限；对DB而言，连接数过高又可能导致内存占用过多、连接管理消耗资源过多等问题。

在DB中，解决这个矛盾的方法是引入连接池。而确实，VFD之于FD就是类似连接池的功能。

操作系统对单个进程能够打开的文件描述符数量有限制（通常为 1024），可是对PG进程而言，这个值远远不够，而频繁申请和关闭FD又过度消耗资源，影响性能。参考连接池的优势，可以很方便地理解VFD的作用：

逻辑上突破FD上限：使得对PG而言，进程可以打开的文件数近乎无限
高效资源管理：避免频繁打开和关闭文件，从而减少系统调用的开销
缓存与资源回收：VFD 机制结合了 LRU策略，确保最不常用的文件描述符能够被及时释放，从而为新的文件操作腾出空间。同时可以缓存常用的文件描述符，再度提升性能。

三、核心数据结构

1. VFD的核心机制

PG中的通过全局的 VfdCache（类似连接池）来管理VFD（连接池中的每个连接）
每个 VFD 对应一个实际的FD（真正的连接）
PG 通过 LRU 策略来管理这些 VFD，确保能够及时释放不常用的 VFD

这里我们先看看每部分的数据结构长什么样

2. 单个VFD

VFD 的数据结构定义如下（位于 src/include/storage/fd.c）：

cpp 复制代码

typedef struct vfd
{
	int			fd;				/* current FD, or VFD_CLOSED if none */
	unsigned short fdstate;		/* bitflags for VFD's state */
	ResourceOwner resowner;		/* owner, for automatic cleanup */
	File		nextFree;		/* link to next free VFD, if in freelist */
	File		lruMoreRecently;	/* doubly linked recency-of-use list */
	File		lruLessRecently;
	off_t		fileSize;		/* current size of file (0 if not temporary) */
	char	   *fileName;		/* name of file, or NULL for unused VFD */
	/* NB: fileName is malloc'd, and must be free'd when closing the VFD */
	int			fileFlags;		/* open(2) flags for (re)opening the file */
	mode_t		fileMode;		/* mode to pass to open(2) */
} Vfd;

主要字段含义

fd：当前 VFD 对应的真正FD。如果为-1，则表示该 VFD 当前未分配。
fdstate：VFD 的状态标志（如是否正在使用、是否可关闭等）。
owner：该 VFD 的资源所有者，用于资源管理。
nextFree：指向下一个空闲的 VFD，用于空闲 VFD 链表的管理。
lruMoreRecently和lruLessRecently：用于 LRU 链表的双向链接，分别指向最近更多和更少使用的 VFD。

可以看出，整个 VFD 数组被组织为两部分：空闲VFD列表和 LRU池。

cpp 复制代码

typedef struct vfd
{
...
	File		nextFree;		/* link to next free VFD, if in freelist */
	File		lruMoreRecently;	/* doubly linked recency-of-use list */
	File		lruLessRecently;
...
} Vfd;

空闲 VFD 列表在逻辑上是一个单向列表。所有未被使用的 VFD 都会被串联在这个单链表中。被使用完毕释放的 VFD 也会被串回这个链表中。
LRU 池在逻辑上是一个双向链表，链表的头部是最近使用的 VFD，尾部是最少使用的 VFD。每个 VFD 通过lruMoreRecently和lruLessRecently指针连接到链表中

3. VfdCache

如前所说，VfdCache类似连接池，而VFD类似连接池中的每个连接，因此VfdCache实际就是由VFD组成的一个数组。

nfile变量记录了 VFD 数组中实际使用了多少个FD，这样 VFD 机制才能在打开的文件数量即将超出 OS 限制时，关闭最近最久未被使用的FD。

cpp 复制代码

/*
 * Virtual File Descriptor array pointer and size.  This grows as
 * needed.  'File' values are indexes into this array.
 * Note that VfdCache[0] is not a usable VFD, just a list header.
 */
static Vfd *VfdCache;
static Size SizeVfdCache = 0;

/*
 * Number of file descriptors known to be in use by VFD entries.
 */
static int	nfile = 0;

四、 VFD中的LRU机制

LRU 池是一个双向链表，链表的头部是最近使用的 VFD，尾部是最少使用的 VFD。每个 VFD 通过lruMoreRecently和lruLessRecently指针连接到链表中。

1. 初始化

① Postmaster进程启动时

计算LRU池最多可以打开的文件数 max_safe_fds，这个值的上限跟真正的进程FD上限一致。

cpp 复制代码

/*
 * set_max_safe_fds
 *		Determine number of file descriptors that fd.c is allowed to use
 */
void
set_max_safe_fds(void)
{
	int			usable_fds;
	int			already_open;

	/*----------
	 * We want to set max_safe_fds to
	 *			MIN(usable_fds, max_files_per_process - already_open)
	 * less the slop factor for files that are opened without consulting
	 * fd.c.  This ensures that we won't exceed either max_files_per_process
	 * or the experimentally-determined EMFILE limit.
	 *----------
	 */
	count_usable_fds(max_files_per_process,
					 &usable_fds, &already_open);

	max_safe_fds = Min(usable_fds, max_files_per_process - already_open);

	/*
	 * Take off the FDs reserved for system() etc.
	 */
	max_safe_fds -= NUM_RESERVED_FDS;

	/*
	 * Make sure we still have enough to get by.
	 */
	if (max_safe_fds < FD_MINFREE)
		ereport(FATAL,
				(errcode(ERRCODE_INSUFFICIENT_RESOURCES),
				 errmsg("insufficient file descriptors available to start server process"),
				 errdetail("System allows %d, we need at least %d.",
						   max_safe_fds + NUM_RESERVED_FDS,
						   FD_MINFREE + NUM_RESERVED_FDS)));

	elog(DEBUG2, "max_safe_fds = %d, usable_fds = %d, already_open = %d",
		 max_safe_fds, usable_fds, already_open);
}

② backend启动时

VfdCache 的初始化函数会在每个backend启动时调用，为 VfdCache 数组分配内存，并将fd变量均设置为未使用。 SizeVfdCache=1 表示 VfdCache 数组中当前有一个 VFD，这个值用于跟踪缓存的大小，确保PG知道当前有多少个VFD被分配和管理。

此时 VfdCache 并不包含任何有效的VFD，VfdCache 的第一个元素 VfdCache[0] 被用作双向链表的头节点，这个节点不会存储实际的 VFD。

cpp 复制代码

/*
 * InitFileAccess --- initialize this module during backend startup
 *
 * This is called during either normal or standalone backend start.
 * It is *not* called in the postmaster.
 */
void
InitFileAccess(void)
{
	Assert(SizeVfdCache == 0);	/* call me only once */

	/* initialize cache header entry */
	VfdCache = (Vfd *) malloc(sizeof(Vfd));
	if (VfdCache == NULL)
		ereport(FATAL,
				(errcode(ERRCODE_OUT_OF_MEMORY),
				 errmsg("out of memory")));

	MemSet((char *) &(VfdCache[0]), 0, sizeof(Vfd));
	VfdCache->fd = VFD_CLOSED;

	SizeVfdCache = 1;

	/* register proc-exit hook to ensure temp files are dropped at exit */
	on_proc_exit(AtProcExit_Files, 0);
}

这里个人有个小疑问咨询了下AI，先记录下结果

2. 进程打开/关闭文件过程概览

① 当进程打开文件时：分配Vfd，在VfdCache中通过nextFree指针查找空闲的VFD。

如果能找到，将文件元信息记录至Vfd中，继续下一步打开文件
如果没有空闲的 VFD，调用AllocateVfd函数，启动扩容机制。初始VfdCache size 是 32，每次扩容为原来的 2 倍。并将新增的VFD加入FreeList

cpp 复制代码

static File
AllocateVfd(void)
{
	Index		i;
	File		file;

	DO_DB(elog(LOG, "AllocateVfd. Size %zu", SizeVfdCache));

	Assert(SizeVfdCache > 0);	/* InitFileAccess not called? */

	if (VfdCache[0].nextFree == 0)
	{
		/*
		 * The free list is empty so it is time to increase the size of the
		 * array.  We choose to double it each time this happens. However,
		 * there's not much point in starting *real* small.
		 */
		Size		newCacheSize = SizeVfdCache * 2;
		Vfd		   *newVfdCache;

		if (newCacheSize < 32)
			newCacheSize = 32;

		/*
		 * Be careful not to clobber VfdCache ptr if realloc fails.
		 */
		newVfdCache = (Vfd *) realloc(VfdCache, sizeof(Vfd) * newCacheSize);
		if (newVfdCache == NULL)
			ereport(ERROR,
					(errcode(ERRCODE_OUT_OF_MEMORY),
					 errmsg("out of memory")));
		VfdCache = newVfdCache;

		/*
		 * Initialize the new entries and link them into the free list.
		 */
		for (i = SizeVfdCache; i < newCacheSize; i++)
		{
			MemSet((char *) &(VfdCache[i]), 0, sizeof(Vfd));
			VfdCache[i].nextFree = i + 1;
			VfdCache[i].fd = VFD_CLOSED;
		}
		VfdCache[newCacheSize - 1].nextFree = 0;
		VfdCache[0].nextFree = SizeVfdCache;

		/*
		 * Record the new size
		 */
		SizeVfdCache = newCacheSize;
	}

	file = VfdCache[0].nextFree;

	VfdCache[0].nextFree = VfdCache[file].nextFree;

	return file;
}

② 进程获得VFD后，检查LRU池是否已满，即是否还有可用FD

如果未满，使用该VFD打开物理文件，并调用LruInsert函数（再调用Insert函数）将VFD插入LRU池。这个函数本质只做一件事，将新的VFD插到最常使用的链头。

cpp 复制代码

static void
Insert(File file)
{
	Vfd		   *vfdP;

	Assert(file != 0);

	DO_DB(elog(LOG, "Insert %d (%s)",
			   file, VfdCache[file].fileName));
	DO_DB(_dump_lru());

	vfdP = &VfdCache[file];

	vfdP->lruMoreRecently = 0;
	vfdP->lruLessRecently = VfdCache[0].lruLessRecently;
	VfdCache[0].lruLessRecently = file;
	VfdCache[vfdP->lruLessRecently].lruMoreRecently = file;

	DO_DB(_dump_lru());
}

如果已满，调用ReleaseLruFile函数，删除LRU池尾（最少使用）的VFD，再使用获得的VFD打开新的物理文件。这个函数本质还是调用的LruDelete函数，只是它固定删除池尾的VFD。

cpp 复制代码

/*
 * Release one kernel FD by closing the least-recently-used VFD.
 */
static bool
ReleaseLruFile(void)
{
	DO_DB(elog(LOG, "ReleaseLruFile. Opened %d", nfile));

	if (nfile > 0)
	{
		/*
		 * There are opened files and so there should be at least one used vfd
		 * in the ring.
		 */
		Assert(VfdCache[0].lruMoreRecently != 0);
		LruDelete(VfdCache[0].lruMoreRecently);
		return true;			/* freed a file */
	}
	return false;				/* no files available to free */
}

③ 用完文件关闭时，从LRU池删除VFD

通过LruDelete函数（再调用delete函数）实现，将指定VFD从LRU池中删除，并关闭该VFD对应文件，同时将其fd置为VFD_CLOSED表示已空闲，并将其加回FreeList中

cpp 复制代码

static void
LruDelete(File file)
{
	Vfd		   *vfdP;

	Assert(file != 0);

	DO_DB(elog(LOG, "LruDelete %d (%s)",
			   file, VfdCache[file].fileName));

	vfdP = &VfdCache[file];

	/*
	 * Close the file.  We aren't expecting this to fail; if it does, better
	 * to leak the FD than to mess up our internal state.
	 */
	if (close(vfdP->fd) != 0)
		elog(vfdP->fdstate & FD_TEMP_FILE_LIMIT ? LOG : data_sync_elevel(LOG),
			 "could not close file \"%s\": %m", vfdP->fileName);
	vfdP->fd = VFD_CLOSED;
	--nfile;

	/* delete the vfd record from the LRU ring */
	Delete(file);
}

cpp 复制代码

static void
Delete(File file)
{
	Vfd		   *vfdP;

	Assert(file != 0);

	DO_DB(elog(LOG, "Delete %d (%s)",
			   file, VfdCache[file].fileName));
	DO_DB(_dump_lru());

	vfdP = &VfdCache[file];

	VfdCache[vfdP->lruLessRecently].lruMoreRecently = vfdP->lruMoreRecently;
	VfdCache[vfdP->lruMoreRecently].lruLessRecently = vfdP->lruLessRecently;

	DO_DB(_dump_lru());
}

参考

《PostgreSQL数据库内核分析》

https://zhuanlan.zhihu.com/p/550996343

PostgreSQL源码学习笔记(4)-存储管理_postgres 堆文件-CSDN博客

Postgres 源码学习 2---Postgres 的 VFD 机制-腾讯云开发者社区-腾讯云

PolarDB for PostgreSQL 内核解读系列第四讲：PostgreSQL 存储管理（二）_哔哩哔哩_bilibili

海山数据库(He3DB)技术分享：He3DB Virtual File Descriptor实现原理 - dawn1221 - 博客园