PostgreSQL源码分析——bgwriter

为什么会有bgwriter

bgwriter进程主要负责将共享缓冲区(Buffer)中的脏页刷盘,这个进程主要是从数据库性能的考虑而加的,如果没有这个进程,数据库一样可以工作。所以,这里重点理解的就是bgwriter进程对性能的影响。

我们前面讲过,一条插入语句的执行过程,先在Buffer中找到空闲页,在页中插入元组,暂不刷盘,而且先构造WAL日志,将WAL日志刷盘,再由后台进程(bgwriter)刷盘。之所以这么设计就是出于性能的考虑,每次写后,频繁的进行刷盘会降低性能。比如,连续进行100次插入,每次插入的都是同一个页,就会造成对这个页频繁的进行刷盘,而通过bgwriter以及wal,则转换为写WAL,再对该脏页刷1次盘即可,设计WAL,bgwriter其中目的之一都是为了降低刷盘的频率。

其二,在有WAL后,那我一直不对脏页进行刷盘行不行?答案是肯定不行,即使bgwriter不进行刷盘,缓冲区也会进行页淘汰,缓冲区大小是有限的,当缓冲区满了时,又需要从磁盘中读数据页到缓冲区中,就必须将缓冲区中的部分页进行淘汰,目前的算法是时钟扫描算法,如果选择淘汰的页是脏页,则需要将脏页进行刷盘,这会导致查询或者更新需要更长的时间,自然性能降低了。周期性的进行脏页刷盘,避免了在查询过程中因为缓冲区淘汰页导致的刷盘,避免了因此导致的性能降低。

其三,bgwriter进行周期性的刷盘,对性能的平稳有益,能够一定程度的避免性能的抖动,使得IO操作尽可能的被平滑的处理了。不单单是bgwriter,其他进程也有这方面设计的思考,比如autovacuum进程。

参数说明

bgwriter涉及到以下参数设置:

shell 复制代码
#bgwriter_delay = 200ms                 # 10-10000ms between rounds
#bgwriter_lru_maxpages = 100            # max buffers written/round, 0 disables
#bgwriter_lru_multiplier = 2.0          # 0-10.0 multiplier on buffers scanned/round

表示系统每间隔bgwriter_delay指定的时间启动进程bgwriter,扫描缓冲区,写出至多bgwriter_lru_multiplier * N个脏页,并且不超过bgwriter_lru_maxpages值的限制。其中N是最近一段时间在两次bgwriter运行期间系统新申请的缓冲页数。

源码分析

bgwriter进程,核心流程很清晰,就是间隔一段时间进行Buffer刷盘,主流程如下:

c 复制代码
/*
 * Main entry point for bgwriter process
 *
 * This is invoked from AuxiliaryProcessMain, which has already created the
 * basic execution environment, but not enabled signals yet.
 */
void BackgroundWriterMain(void)
{
    // 注册信号处理函数
	pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
    // ...

    // 错误处理

	/* Loop forever */
	for (;;)
	{
		bool		can_hibernate;
		int			rc;

		/* Clear any already-pending wakeups */
		ResetLatch(MyLatch);

		HandleMainLoopInterrupts();

		/* Do one cycle of dirty-buffer writing. */
		can_hibernate = BgBufferSync(&wb_context);
    
        // ...

        // 等待,直到收到信号或BgWriterDelay超时
        /* Sleep until we are signaled or BgWriterDelay has elapsed. */
		rc = WaitLatch(MyLatch,
					   WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
					   BgWriterDelay /* ms */ , WAIT_EVENT_BGWRITER_MAIN);

        // ...
    }
}

其主要实现是在BgBufferSync函数中实现的,让我们看一下这个函数。先思考一下函数设计的关键点,第1个问题,Buffer中页面那么多,从哪里开始? 第2个问题,每次刷盘刷多少个页面?关于从哪里开始,这个问题,首先看bgwriter存在的意义,其中要避免缓冲区淘汰脏页面进行刷盘,所以,bgwriter扫描页面,最好是领先于时钟扫描算法一轮,这样才能起到效果。而上面讲到的2个参数就是解决每次刷多少个页面相关的参数。下面的函数中,会涉及到上面2个问题,计算每次刷多少个是最优的,在哪里开始扫描。具体的算法,我们这里不进行分析,可参考《PostgreSQL技术内幕:事务处理探索》第4.12.2章节。

c 复制代码
/* BgBufferSync -- Write out some dirty buffers in the pool. */
bool BgBufferSync(WritebackContext *wb_context)
{
	/* info obtained from freelist.c */
	int			strategy_buf_id;
	uint32		strategy_passes;
	uint32		recent_alloc;

    // ...
	/*
	 * Find out where the freelist clock sweep currently is, and how many
	 * buffer allocations have happened since our last call.
	 */
	strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);

    // 执行扫描,刷盘
    /* Execute the LRU scan */
	while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
	{
		int			sync_state = SyncOneBuffer(next_to_clean, true,
											   wb_context);

		if (++next_to_clean >= NBuffers)
		{
			next_to_clean = 0;
			next_passes++;
		}
		num_to_scan--;

		if (sync_state & BUF_WRITTEN)
		{
			reusable_buffers++;
			if (++num_written >= bgwriter_lru_maxpages)
			{
				BgWriterStats.m_maxwritten_clean++;
				break;
			}
		}
		else if (sync_state & BUF_REUSABLE)
			reusable_buffers++;
	}

    // ...
}

最后是将Buffer刷盘的实现,我们看一下SyncOneBuffer,作用是尝试将单个缓冲区页面刷入磁盘文件。

c 复制代码
/* SyncOneBuffer -- process a single buffer during syncing. */
static int SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
{
	BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
	int			result = 0;
	uint32		buf_state;
	BufferTag	tag;

	ReservePrivateRefCountEntry();

	/*
	 * Check whether buffer needs writing.
	 *
	 * We can make this check without taking the buffer content lock so long
	 * as we mark pages dirty in access methods *before* logging changes with
	 * XLogInsert(): if someone marks the buffer dirty just after our check we
	 * don't worry because our checkpoint.redo points before log record for
	 * upcoming changes and so we are not required to write such dirty buffer.
	 */
	buf_state = LockBufHdr(bufHdr);

	if (BUF_STATE_GET_REFCOUNT(buf_state) == 0 &&
		BUF_STATE_GET_USAGECOUNT(buf_state) == 0)
	{
		result |= BUF_REUSABLE;
	}
	else if (skip_recently_used)
	{
		/* Caller told us not to write recently-used buffers */
		UnlockBufHdr(bufHdr, buf_state);
		return result;
	}

	if (!(buf_state & BM_VALID) || !(buf_state & BM_DIRTY))
	{
		/* It's clean, so nothing to do */
		UnlockBufHdr(bufHdr, buf_state);
		return result;
	}

	/*
	 * Pin it, share-lock it, write it.  (FlushBuffer will do nothing if the
	 * buffer is clean by the time we've locked it.)
	 */
	PinBuffer_Locked(bufHdr);
	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);

	FlushBuffer(bufHdr, NULL);

	LWLockRelease(BufferDescriptorGetContentLock(bufHdr));

	tag = bufHdr->tag;

	UnpinBuffer(bufHdr, true);

	ScheduleBufferTagForWriteback(wb_context, &tag);

	return result | BUF_WRITTEN;
}

下面就是将Buffer写入磁盘的具体过程,详细可参考函数FlushBuffer

  1. 依据Buffer中的描述信息,打开指定的表文件
  2. 校验并copy Buffer中待写入的数据
  3. 将Buffer的数据写入打开的表文件中
c 复制代码
static void
FlushBuffer(BufferDesc *buf, SMgrRelation reln)
{
	/* Find smgr relation for buffer */
	if (reln == NULL)
		reln = smgropen(buf->tag.rnode, InvalidBackendId);

	bufBlock = BufHdrGetBlock(buf);

	/*
	 * Update page checksum if desired.  Since we have only shared lock on the
	 * buffer, other processes might be updating hint bits in it, so we must
	 * copy the page to private storage if we do checksumming.
	 */
	bufToWrite = PageSetChecksumCopy((Page) bufBlock, buf->tag.blockNum);

	/* bufToWrite is either the shared buffer or a copy, as appropriate. */
	smgrwrite(reln,
			  buf->tag.forkNum,
			  buf->tag.blockNum,
			  bufToWrite,
			  false);

}
相关推荐
瓜牛_gn1 小时前
mysql特性
数据库·mysql
奶糖趣多多2 小时前
Redis知识点
数据库·redis·缓存
CoderIsArt3 小时前
Redis的三种模式:主从模式,哨兵与集群模式
数据库·redis·缓存
师太,答应老衲吧5 小时前
SQL实战训练之,力扣:2020. 无流量的帐户数(递归)
数据库·sql·leetcode
Channing Lewis6 小时前
salesforce case可以新建一个roll up 字段,统计出这个case下的email数量吗
数据库·salesforce
毕业设计制作和分享7 小时前
ssm《数据库系统原理》课程平台的设计与实现+vue
前端·数据库·vue.js·oracle·mybatis
ketil277 小时前
Redis - String 字符串
数据库·redis·缓存
Hsu_kk8 小时前
MySQL 批量删除海量数据的几种方法
数据库·mysql
编程学无止境8 小时前
第02章 MySQL环境搭建
数据库·mysql
knight-n8 小时前
MYSQL库的操作
数据库·mysql