PostgreSQL源码分析——bgwriter

为什么会有bgwriter

bgwriter进程主要负责将共享缓冲区(Buffer)中的脏页刷盘,这个进程主要是从数据库性能的考虑而加的,如果没有这个进程,数据库一样可以工作。所以,这里重点理解的就是bgwriter进程对性能的影响。

我们前面讲过,一条插入语句的执行过程,先在Buffer中找到空闲页,在页中插入元组,暂不刷盘,而且先构造WAL日志,将WAL日志刷盘,再由后台进程(bgwriter)刷盘。之所以这么设计就是出于性能的考虑,每次写后,频繁的进行刷盘会降低性能。比如,连续进行100次插入,每次插入的都是同一个页,就会造成对这个页频繁的进行刷盘,而通过bgwriter以及wal,则转换为写WAL,再对该脏页刷1次盘即可,设计WAL,bgwriter其中目的之一都是为了降低刷盘的频率。

其二,在有WAL后,那我一直不对脏页进行刷盘行不行?答案是肯定不行,即使bgwriter不进行刷盘,缓冲区也会进行页淘汰,缓冲区大小是有限的,当缓冲区满了时,又需要从磁盘中读数据页到缓冲区中,就必须将缓冲区中的部分页进行淘汰,目前的算法是时钟扫描算法,如果选择淘汰的页是脏页,则需要将脏页进行刷盘,这会导致查询或者更新需要更长的时间,自然性能降低了。周期性的进行脏页刷盘,避免了在查询过程中因为缓冲区淘汰页导致的刷盘,避免了因此导致的性能降低。

其三,bgwriter进行周期性的刷盘,对性能的平稳有益,能够一定程度的避免性能的抖动,使得IO操作尽可能的被平滑的处理了。不单单是bgwriter,其他进程也有这方面设计的思考,比如autovacuum进程。

参数说明

bgwriter涉及到以下参数设置:

shell 复制代码
#bgwriter_delay = 200ms                 # 10-10000ms between rounds
#bgwriter_lru_maxpages = 100            # max buffers written/round, 0 disables
#bgwriter_lru_multiplier = 2.0          # 0-10.0 multiplier on buffers scanned/round

表示系统每间隔bgwriter_delay指定的时间启动进程bgwriter,扫描缓冲区,写出至多bgwriter_lru_multiplier * N个脏页,并且不超过bgwriter_lru_maxpages值的限制。其中N是最近一段时间在两次bgwriter运行期间系统新申请的缓冲页数。

源码分析

bgwriter进程,核心流程很清晰,就是间隔一段时间进行Buffer刷盘,主流程如下:

c 复制代码
/*
 * Main entry point for bgwriter process
 *
 * This is invoked from AuxiliaryProcessMain, which has already created the
 * basic execution environment, but not enabled signals yet.
 */
void BackgroundWriterMain(void)
{
    // 注册信号处理函数
	pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
    // ...

    // 错误处理

	/* Loop forever */
	for (;;)
	{
		bool		can_hibernate;
		int			rc;

		/* Clear any already-pending wakeups */
		ResetLatch(MyLatch);

		HandleMainLoopInterrupts();

		/* Do one cycle of dirty-buffer writing. */
		can_hibernate = BgBufferSync(&wb_context);
    
        // ...

        // 等待,直到收到信号或BgWriterDelay超时
        /* Sleep until we are signaled or BgWriterDelay has elapsed. */
		rc = WaitLatch(MyLatch,
					   WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
					   BgWriterDelay /* ms */ , WAIT_EVENT_BGWRITER_MAIN);

        // ...
    }
}

其主要实现是在BgBufferSync函数中实现的,让我们看一下这个函数。先思考一下函数设计的关键点,第1个问题,Buffer中页面那么多,从哪里开始? 第2个问题,每次刷盘刷多少个页面?关于从哪里开始,这个问题,首先看bgwriter存在的意义,其中要避免缓冲区淘汰脏页面进行刷盘,所以,bgwriter扫描页面,最好是领先于时钟扫描算法一轮,这样才能起到效果。而上面讲到的2个参数就是解决每次刷多少个页面相关的参数。下面的函数中,会涉及到上面2个问题,计算每次刷多少个是最优的,在哪里开始扫描。具体的算法,我们这里不进行分析,可参考《PostgreSQL技术内幕:事务处理探索》第4.12.2章节。

c 复制代码
/* BgBufferSync -- Write out some dirty buffers in the pool. */
bool BgBufferSync(WritebackContext *wb_context)
{
	/* info obtained from freelist.c */
	int			strategy_buf_id;
	uint32		strategy_passes;
	uint32		recent_alloc;

    // ...
	/*
	 * Find out where the freelist clock sweep currently is, and how many
	 * buffer allocations have happened since our last call.
	 */
	strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);

    // 执行扫描,刷盘
    /* Execute the LRU scan */
	while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
	{
		int			sync_state = SyncOneBuffer(next_to_clean, true,
											   wb_context);

		if (++next_to_clean >= NBuffers)
		{
			next_to_clean = 0;
			next_passes++;
		}
		num_to_scan--;

		if (sync_state & BUF_WRITTEN)
		{
			reusable_buffers++;
			if (++num_written >= bgwriter_lru_maxpages)
			{
				BgWriterStats.m_maxwritten_clean++;
				break;
			}
		}
		else if (sync_state & BUF_REUSABLE)
			reusable_buffers++;
	}

    // ...
}

最后是将Buffer刷盘的实现,我们看一下SyncOneBuffer,作用是尝试将单个缓冲区页面刷入磁盘文件。

c 复制代码
/* SyncOneBuffer -- process a single buffer during syncing. */
static int SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
{
	BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
	int			result = 0;
	uint32		buf_state;
	BufferTag	tag;

	ReservePrivateRefCountEntry();

	/*
	 * Check whether buffer needs writing.
	 *
	 * We can make this check without taking the buffer content lock so long
	 * as we mark pages dirty in access methods *before* logging changes with
	 * XLogInsert(): if someone marks the buffer dirty just after our check we
	 * don't worry because our checkpoint.redo points before log record for
	 * upcoming changes and so we are not required to write such dirty buffer.
	 */
	buf_state = LockBufHdr(bufHdr);

	if (BUF_STATE_GET_REFCOUNT(buf_state) == 0 &&
		BUF_STATE_GET_USAGECOUNT(buf_state) == 0)
	{
		result |= BUF_REUSABLE;
	}
	else if (skip_recently_used)
	{
		/* Caller told us not to write recently-used buffers */
		UnlockBufHdr(bufHdr, buf_state);
		return result;
	}

	if (!(buf_state & BM_VALID) || !(buf_state & BM_DIRTY))
	{
		/* It's clean, so nothing to do */
		UnlockBufHdr(bufHdr, buf_state);
		return result;
	}

	/*
	 * Pin it, share-lock it, write it.  (FlushBuffer will do nothing if the
	 * buffer is clean by the time we've locked it.)
	 */
	PinBuffer_Locked(bufHdr);
	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);

	FlushBuffer(bufHdr, NULL);

	LWLockRelease(BufferDescriptorGetContentLock(bufHdr));

	tag = bufHdr->tag;

	UnpinBuffer(bufHdr, true);

	ScheduleBufferTagForWriteback(wb_context, &tag);

	return result | BUF_WRITTEN;
}

下面就是将Buffer写入磁盘的具体过程,详细可参考函数FlushBuffer

  1. 依据Buffer中的描述信息,打开指定的表文件
  2. 校验并copy Buffer中待写入的数据
  3. 将Buffer的数据写入打开的表文件中
c 复制代码
static void
FlushBuffer(BufferDesc *buf, SMgrRelation reln)
{
	/* Find smgr relation for buffer */
	if (reln == NULL)
		reln = smgropen(buf->tag.rnode, InvalidBackendId);

	bufBlock = BufHdrGetBlock(buf);

	/*
	 * Update page checksum if desired.  Since we have only shared lock on the
	 * buffer, other processes might be updating hint bits in it, so we must
	 * copy the page to private storage if we do checksumming.
	 */
	bufToWrite = PageSetChecksumCopy((Page) bufBlock, buf->tag.blockNum);

	/* bufToWrite is either the shared buffer or a copy, as appropriate. */
	smgrwrite(reln,
			  buf->tag.forkNum,
			  buf->tag.blockNum,
			  bufToWrite,
			  false);

}
相关推荐
Rookie也要加油21 分钟前
01_SQLite
数据库·sqlite
liuxin3344556626 分钟前
教育技术革新:SpringBoot在线教育系统开发
数据库·spring boot·后端
看山还是山,看水还是。1 小时前
MySQL 管理
数据库·笔记·mysql·adb
fishmemory7sec1 小时前
Koa2项目实战2(路由管理、项目结构优化)
数据库·mongodb·koa
momo小菜pa1 小时前
【MySQL 09】表的内外连接
数据库·mysql
Jasonakeke2 小时前
【重学 MySQL】四十九、阿里 MySQL 命名规范及 MySQL8 DDL 的原子化
数据库·mysql
程序猿小D2 小时前
第二百六十九节 JPA教程 - JPA查询OrderBy两个属性示例
java·开发语言·数据库·windows·jpa
小宇成长录2 小时前
Mysql:数据库和表增删查改基本语句
数据库·mysql·数据库备份
团儿.3 小时前
解锁MySQL高可用新境界:深入探索MHA架构的无限魅力与实战部署
数据库·mysql·架构·mysql之mha架构
程序猿小D3 小时前
第二百六十七节 JPA教程 - JPA查询AND条件示例
java·开发语言·前端·数据库·windows·python·jpa