Postgresql CLOG文件及其从库同步解析

放眼所有关系型数据库,PostgreSQL的clog也是很特殊的日志。CLOG的存在跟PG的MVCC机制不无关系。一些事务ID、clog的基础知识本篇不会涉及,感谢兴趣的可参考clog和hintbits。本篇主要讲clog文件的构成、手工定位事务状态、clog的wal日志同步机制,以进一步理解PostgreSQL的clog。

clog segment

clog目录

为了区别普通日志,PG 10对clog和wal的目录进行了重命名[1](#1)

pg9.6 pg10
pg_clog pg_xact
pg_xlog pg_wal

别搞错了,我也有段时间被pg_xlog和pg_xact给困扰···

clog segment name

clog也是由slru管理,clog文件命名也在slru.c

c 复制代码
#define SlruFileName(ctl, path, seg) \
	snprintf(path, MAXPGPATH, "%s/%04X", (ctl)->Dir, seg)

%04X表示十六进制(X),宽度为4、不足前面补0(04)。

clog文件名示例如下:

shell 复制代码
[pg_xact]$ ll
-rw------- 1 postgres postgres 262144 Aug 15 16:29 03C0
-rw------- 1 postgres postgres 262144 Aug 19 23:04 03C1
...

TransactionID与clog位置的转换

clog只存储事务ID的状态,不存事务ID本身。通过TransactionID本身是可以直接定位到clog文件及文件中的位置的。在此之前我们需要了解一些基础知识。

CLOG保存的事务状态

事务的状态只有4种:

c 复制代码
typedef int XidStatus;

#define TRANSACTION_STATUS_IN_PROGRESS		0x00
#define TRANSACTION_STATUS_COMMITTED		0x01
#define TRANSACTION_STATUS_ABORTED		0x02
#define TRANSACTION_STATUS_SUB_COMMITTED	0x03

事务状态只有进行中、已提交、回滚、子事务已提交。注意事务id没有"未开始"这个状态,只要数据库中分配了事务ID,那么这个事务肯定是已经开始了。

相反,库中还没有分配的事务ID(实际是少许,参考下面的extend clog章节),在clog中对应in_progress状态。

4个事务状态实际上2个bit位即可存储。那么1个字节(8 bits)可存储4个事务状态,1个页(8k)能放8KB*4=32768个事务状态。这些都在源码中定义了:

c 复制代码
 * Defines for CLOG page sizes.  A page is the same BLCKSZ as is used
 * everywhere else in Postgres.
//clog页大小=BLCKSZ=8k(默认) 
#define CLOG_BITS_PER_XACT	2  								 //一个事务状态占2个bit
#define CLOG_XACTS_PER_BYTE 4  								 //1个字节可存放4个事务状态
#define CLOG_XACTS_PER_PAGE (BLCKSZ * CLOG_XACTS_PER_BYTE)   //1个页可存放32768个事务状态,8KB*4=32768
#define CLOG_XACT_BITMASK ((1 << CLOG_BITS_PER_XACT) - 1)    //事务状态的bitmask位=((1<<2)-1)=3,用二进制表达为11
c 复制代码
#define SLRU_PAGES_PER_SEGMENT	32  //1个segment有32个pages

汇总:

  • 1个clog segment有32个page
  • 1个clog page有8k(一般)
  • 1个bytes有4个事务状态
  • 1个事务状态占2 bit

clog segment/page/bytes的转换

事务id对应在哪个CLOG段不太好找,它藏在注释里:

c 复制代码
 * Note: because TransactionIds are 32 bits and wrap around at 0xFFFFFFFF,
 * CLOG page numbering also wraps around at 0xFFFFFFFF/CLOG_XACTS_PER_PAGE,
 * and CLOG segment numbering at
 * 0xFFFFFFFF/CLOG_XACTS_PER_PAGE/SLRU_PAGES_PER_SEGMENT
//segment number=xid/CLOG_XACTS_PER_PAGE/SLRU_PAGES_PER_SEGMENT=xid/32768/32            //事务id对应在哪个CLOG段,xid/32768/32,需要转成16进制

事务id对应对应到page、byte等比较清晰[2](#2)

c 复制代码
#define TransactionIdToPage(xid)	((xid) / (TransactionId) CLOG_XACTS_PER_PAGE)       //事务id对应在哪个CLOG page,xid/32768
#define TransactionIdToPgIndex(xid) ((xid) % (TransactionId) CLOG_XACTS_PER_PAGE)       //事务id对应在上面page中的偏移量,xid%32768
#define TransactionIdToByte(xid)	(TransactionIdToPgIndex(xid) / CLOG_XACTS_PER_BYTE) //事务id对应在page中的第几个字节,(xid%32768)/4
#define TransactionIdToBIndex(xid)	((xid) % (TransactionId) CLOG_XACTS_PER_BYTE)		//事务id对应在上面字节中的哪个bit index(注意是bit index不是bit本身),xid%4

一般来说(8k的BLCKSZ),1个CLOG段有32个页面;1个CLOG段有32*8k字节,即CLOG文件大小固定256K;1个CLOG段可存放4*32*8k个事务状态。

shell 复制代码
[pg_xact]$ ll  # 256k的clog segment
-rw------- 1 postgres postgres 262144 Aug 15 16:29 03C0
-rw------- 1 postgres postgres 262144 Aug 19 23:04 03C1
...

clog bit位的转换

设置clog bit和获取clog bit的函数(对应TransactionIdSetStatusBitTransactionIdGetStatus)都有以下代码以获取事务id对应到clog中的哪两位bit:

c 复制代码
	int			bshift = TransactionIdToBIndex(xid) * CLOG_BITS_PER_XACT;
	char	   *byteptr;
...
	byteptr = XactCtl->shared->page_buffer[slotno] + byteno;
	curval = (*byteptr >> bshift) & CLOG_XACT_BITMASK;

bshift表示右移的位置,其中TransactionIdToBIndex=xid%4CLOG_BITS_PER_XACT=2CLOG_XACT_BITMASK=3(二进制为11)

clog bit获取的关键代码curval = (*byteptr >> bshift) & CLOG_XACT_BITMASK分两段来看:

  • *byteptr >> bshift表示指针右移0、2、4、6位
  • & CLOG_XACT_BITMASK其实就是右移后取后两位(00&11=00、01&11=01、10&11=10、11&11=11)

so,计算在一个byte中事务id状态的位置:

  • xid%4=0时,取第7、8位
  • xid%4=1时,取第5、6位
  • xid%4=2时,取第3、4位
  • xid%4=3时,取第1、2位

注意,事务id状态在byte中的bit位是反着取的,不是正着顺序取。byte、page位这些是顺序递增的取的。

手动计算事务id在clog文件中的位置

如果我们要手工通过hexdump定位CLOG中的事务,需要计算出三个元素<CLOG的段号、段中的偏移量in bytes、byte上的偏移量in bitindex >。(这里参考了《PostgreSQL数据库内核分析》的说法,不过有一些区别[3](#3)

计算之前还需要了解:

  • clog段文件号是16进制的
  • hexdump是16进制的,每行存放16个字节,即每行存放16*CLOG_XACTS_PER_BYTE=16*4=64个事务状态
  • hexdump -s xxx是以byte为单位

下面sql可以计算transactionid对应clog中的位置:

sql 复制代码
--CLOG的段号
--%4294967296代表事务id回卷,/(8192*4*32)代表一个段文件能包含的最大事务数,to_hex转为文件名的16进制,lpad左补全4位
select lpad(upper(to_hex(txid_current()%4294967296/(8192*4*32))),4,'0') as clog_segmentno;

--段中的偏移量in bytes 
--%4294967296代表事务id回卷,%(8192*32*4)取余下的事务id,/4计算成byte单位
select txid_current()%4294967296%(8192*32*4)/4 as in_clog_offset_bytes;

--byte上的偏移量in bitindex
--%4294967296代表事务id回卷,%4取byte中的bitindex
select txid_current()%4294967296%4 as in_byte_offset_bitindex;


--或者一条sql
select lpad(upper(to_hex(txid_current()%4294967296/(8192*4*32))),4,'0') as clog_segmentno,txid_current()%4294967296%(8192*32*4)/4 as in_clog_offset_bytes,txid_current()%4294967296%4 as in_byte_offset_bitindex;

实操模拟计算一个事务id在clog中的状态

sql 复制代码
begin;
 select lpad(upper(to_hex(txid_current()%4294967296/(8192*4*32))),4,'0') as clog_segmentno,txid_current()%4294967296%(8192*32*4)/4 as in_clog_offset_bytes,txid_current()%4294967296%4 as in_byte_offset_bitindex;
 clog_segmentno | in_clog_offset_bytes | in_byte_offset_bitindex 
----------------+----------------------+-------------------------
 0002           |                63196 |                       3
rollback;
checkpoint; 

rollback把事务回滚,主要是方便观察,因为大部分事务都是提交的。

checkpoint是未来保证clog page刷脏,不然clog page在clog buffer中还没写到clog segment文件。

sql 复制代码
cd pg_xact/
 hexdump -C 0002 -s 63196 -n 1 -v
0000f6dc  95                                                |.|
0000f6dd

--十六进制转二进制
> select 'x96'::bit(8);
   bit    
----------
 10010110

xid%4=3时,取第1、2位,所以这个回滚的事务的bit位取10,10代表TRANSACTION_STATUS_ABORTED

为什么clog中一般有很多55和U?

一般的业务事务库clog文件,直接hexdump的话结果就像下面这样:

c 复制代码
hexdump -C 0001 -v|head -10
00000000  55 55 55 55 55 55 55 55  55 55 55 55 55 55 55 55  |UUUUUUUUUUUUUUUU|
00000010  55 55 55 55 55 55 55 55  55 55 55 55 55 55 55 55  |UUUUUUUUUUUUUUUU|
00000020  55 55 55 55 55 55 55 55  55 55 55 55 55 55 55 55  |UUUUUUUUUUUUUUUU|
00000030  55 55 55 55 55 55 55 55  55 55 55 55 55 55 55 55  |UUUUUUUUUUUUUUUU|
00000040  55 55 55 55 55 55 55 55  55 55 55 55 55 55 55 55  |UUUUUUUUUUUUUUUU|
00000050  55 55 55 55 55 55 55 55  55 55 55 55 55 55 55 55  |UUUUUUUUUUUUUUUU|
00000060  55 55 55 55 55 55 55 55  55 55 55 55 55 55 55 55  |UUUUUUUUUUUUUUUU|
00000070  55 55 55 55 55 55 55 55  55 55 55 55 55 55 55 55  |UUUUUUUUUUUUUUUU|
00000080  55 55 55 55 55 55 55 55  55 55 55 55 55 55 55 55  |UUUUUUUUUUUUUUUU|
00000090  55 55 55 55 55 55 55 55  55 55 55 55 55 55 55 55  |UUUUUUUUUUUUUUUU|

因为事务提交的状态=01=TRANSACTION_STATUS_COMMITTED,一个byte中4个连续的事务都是提交的,那么就是01010101。

  • 二进制为01010101,十六进制为55
  • 十六进制55在ASCII中为U,所以在肉眼看CLOG文件一般可以看到很多U
  • 偶尔有些数组不是55或U,因为生产环境中偶尔有些事务没有完成或者使用了子事务。子事务在clog中的提交状态是x03。

shared clog buffer

clog shared buffers的个数很容易理解:

c 复制代码
/*
 * Number of shared CLOG buffers.
 *
 * On larger multi-processor systems, it is possible to have many CLOG page
 * requests in flight at one time which could lead to disk access for CLOG
 * page if the required page is not found in memory.  Testing revealed that we
 * can get the best performance by having 128 CLOG buffers, more than that it
 * doesn't improve performance.
 *
 * Unconditionally keeping the number of CLOG buffers to 128 did not seem like
 * a good idea, because it would increase the minimum amount of shared memory
 * required to start, which could be a problem for people running very small
 * configurations.  The following formula seems to represent a reasonable
 * compromise: people with very low values for shared_buffers will get fewer
 * CLOG buffers as well, and everyone else will get 128.
 */
Size
CLOGShmemBuffers(void)
{
	return Min(128, Max(4, NBuffers / 512));
}

翻译:经过测试128个clog buffers性能是最好的,再多也提升不了性能。但是由于有些库的配置实在是太小了,128个clog buffers显得有点大,所以取shared buffers个数的1/512。也就是说:

clog buffers的个数=1/512 shared buffer,最小是4,最大是128。注意这里都是buffer的个数,不是大小!

那么单个buffer有多大呢?

clog buffer是slru管理的,slru每个page都是8k:

A page is the same BLCKSZ as is used everywhere

我们可以从clog slru初始化的角度窥探shared clog buffer的大小:

c 复制代码
/*
 * Initialization of shared memory for CLOG
 */
Size
CLOGShmemSize(void)
{
	return SimpleLruShmemSize(CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE);
}

传入的CLOGShmemBuffers()是4~128,传入的CLOG_LSNS_PER_PAGE=1024 byte(8k page时)。
SimpleLruShmemSize初始化slru shmem:

c 复制代码
Size
SimpleLruShmemSize(int nslots, int nlsns)
{
	Size		sz;

	/* we assume nslots isn't so large as to risk overflow */
	sz = MAXALIGN(sizeof(SlruSharedData));
	sz += MAXALIGN(nslots * sizeof(char *));	/* page_buffer[] */
	sz += MAXALIGN(nslots * sizeof(SlruPageStatus));	/* page_status[] */
	sz += MAXALIGN(nslots * sizeof(bool));	/* page_dirty[] */
	sz += MAXALIGN(nslots * sizeof(int));	/* page_number[] */
	sz += MAXALIGN(nslots * sizeof(int));	/* page_lru_count[] */
	sz += MAXALIGN(nslots * sizeof(LWLockPadded));	/* buffer_locks[] */

	if (nlsns > 0)
		sz += MAXALIGN(nslots * nlsns * sizeof(XLogRecPtr));	/* group_lsn[] */

	return BUFFERALIGN(sz) + BLCKSZ * nslots;
}

slrus用一些数组保存slru的元数据和控制信息,sz大小都是数据类型*buffer个数,这些大致估算都不是特别不大。主要初始化的内存还是BLCKSZ * nslots,也就是8k * (4~128)=(32k~1M)大小。所以可以粗略的认为,shared clog buffer的大小为1M左右。

CLOG的wal:类型、写入和redo

写clog的时候会写clog的wal日志吗?如果写了那不是clog丢了也可以通过wal日志再次应用就应该找回了事务状态?我们带着问题来看clog写wal和redo的源码。

extend clog

ZeroCLOGPage会写wal,ZeroCLOGPage(pageno, true)实际上ExtendCLOG调用:

c 复制代码
/*
 * Make sure that CLOG has room for a newly-allocated XID.
 *
 * NB: this is called while holding XidGenLock.  We want it to be very fast
 * most of the time; even when it's not so fast, no actual I/O need happen
 * unless we're forced to write out a dirty clog or xlog page to make room
 * in shared memory.
 */
void
ExtendCLOG(TransactionId newestXact)
{
	int			pageno;

	/*
	 * No work except at first XID of a page.  But beware: just after
	 * wraparound, the first XID of page zero is FirstNormalTransactionId.
	 */
	if (TransactionIdToPgIndex(newestXact) != 0 &&
		!TransactionIdEquals(newestXact, FirstNormalTransactionId))
		return;

	pageno = TransactionIdToPage(newestXact); //由TransactionId换算的clog page号

	LWLockAcquire(XactSLRULock, LW_EXCLUSIVE);

	/* Zero the page and make an XLOG entry about it */
	ZeroCLOGPage(pageno, true);

	LWLockRelease(XactSLRULock);
}

ZeroCLOGPage主要调用WriteZeroPageXlogRec

c 复制代码
/*
 * Write a ZEROPAGE xlog record
 */
static void
WriteZeroPageXlogRec(int pageno)
{
	XLogBeginInsert();
	XLogRegisterData((char *) (&pageno), sizeof(int));
	(void) XLogInsert(RM_CLOG_ID, CLOG_ZEROPAGE);
}

WriteZeroPageXlogRec就是在写wal record了,写入的类型是"RM_CLOG_ID, CLOG_ZEROPAGE"。

通过waldump可以查看到CLOG_ZEROPAGE,它的占比一般都是非常少的:

sql 复制代码
pg_waldump -z 000000010000056B00000018 --stat=record
Type                                           N      (%)          Record size      (%)             FPI size      (%)        Combined size      (%)
----                                           -      ---          -----------      ---             --------      ---        -------------      ---
...
CLOG/ZEROPAGE                                  1 (  0.00)                   30 (  0.00)                    0 (  0.00)                   30 (  0.00)
...

extend clog page时都是以page为单位,实际上在clog segment的末尾可以很容易看到00

复制代码
hexdump 03C2
0000000 5555 5555 5555 5555 5555 5555 5555 5555
*
001bb30 5555 5555 0055 0000 0000 0000 0000 0000
001bb40 0000 0000 0000 0000 0000 0000 0000 0000  
* ##clog文件后面的都是0
001c000

truncate clog

除了extend clog外,还有truncate clog,truncate clog会在vacuum期间被调用,调用时会写truncate clog的wal record,并将wal record flush到磁盘

c 复制代码
/*
 * Remove all CLOG segments before the one holding the passed transaction ID
 *
 * Before removing any CLOG data, we must flush XLOG to disk, to ensure
 * that any recently-emitted FREEZE_PAGE records have reached disk; otherwise
 * a crash and restart might leave us with some unfrozen tuples referencing
 * removed CLOG data.  We choose to emit a special TRUNCATE XLOG record too.
 * Replaying the deletion from XLOG is not critical, since the files could
 * just as well be removed later, but doing so prevents a long-running hot
 * standby server from acquiring an unreasonably bloated CLOG directory.
 *
 * Since CLOG segments hold a large number of transactions, the opportunity to
 * actually remove a segment is fairly rare, and so it seems best not to do
 * the XLOG flush unless we have confirmed that there is a removable segment.
 */
void
TruncateCLOG(TransactionId oldestXact, Oid oldestxid_datoid)
{
	int			cutoffPage;

	/*
	 * The cutoff point is the start of the segment containing oldestXact. We
	 * pass the *page* containing oldestXact to SimpleLruTruncate.
	 */
	//写入wal的是clog的位置,clog位置是oldestXact转换出来的clog page号
	cutoffPage = TransactionIdToPage(oldestXact); 

.....
	/*
	 * Write XLOG record and flush XLOG to disk. We record the oldest xid
	 * we're keeping information about here so we can ensure that it's always
	 * ahead of clog truncation in case we crash, and so a standby finds out
	 * the new valid xid before the next checkpoint.
	 */
	// WriteTruncateXlogRec会写对应的wal record并将其写到磁盘上
	WriteTruncateXlogRec(cutoffPage, oldestXact, oldestxid_datoid);
	
	//wal写完后,才真正执行truncate clog段
	/* Now we can remove the old CLOG segment(s) */
	SimpleLruTruncate(XactCtl, cutoffPage);
}

WriteTruncateXlogRec就是写RMGRRM_CLOG_IDinfoCLOG_TRUNCAT的wal record:

c 复制代码
/*
 * Write a TRUNCATE xlog record
 *
 * We must flush the xlog record to disk before returning --- see notes
 * in TruncateCLOG().
 */
static void
WriteTruncateXlogRec(int pageno, TransactionId oldestXact, Oid oldestXactDb)
{
	XLogRecPtr	recptr;
	xl_clog_truncate xlrec;

	xlrec.pageno = pageno;
	xlrec.oldestXact = oldestXact;
	xlrec.oldestXactDb = oldestXactDb;

	XLogBeginInsert();
	XLogRegisterData((char *) (&xlrec), sizeof(xl_clog_truncate));
	recptr = XLogInsert(RM_CLOG_ID, CLOG_TRUNCATE);
	XLogFlush(recptr);
}

当制造了clog wal record后,还需要redo恢复的routine:

c 复制代码
/*
 * CLOG resource manager's routines
 */
void
clog_redo(XLogReaderState *record)
{
...
	//redo info类型是CLOG_ZEROPAGE时,将读取出来的redo信息放在内存,然后写到CLOG page文件中
	if (info == CLOG_ZEROPAGE)
	{
		int			pageno;
		int			slotno;

		memcpy(&pageno, XLogRecGetData(record), sizeof(int));

		LWLockAcquire(XactSLRULock, LW_EXCLUSIVE);

		slotno = ZeroCLOGPage(pageno, false);
		SimpleLruWritePage(XactCtl, slotno); 
		Assert(!XactCtl->shared->page_dirty[slotno]);

		LWLockRelease(XactSLRULock);
	}
	//redo info类型是CLOG_TRUNCATE时,将读取出来的redo信息放在内存中,确认page是可删除的(如果不可用需要write page),然后truncate segment
	else if (info == CLOG_TRUNCATE)
	{
		xl_clog_truncate xlrec;

		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_clog_truncate));

		/*
		 * During XLOG replay, latest_page_number isn't set up yet; insert a
		 * suitable value to bypass the sanity test in SimpleLruTruncate.
		 */
		XactCtl->shared->latest_page_number = xlrec.pageno;

		AdvanceOldestClogXid(xlrec.oldestXact);

		SimpleLruTruncate(XactCtl, xlrec.pageno);
	}
	else
		elog(PANIC, "clog_redo: unknown op code %u", info);
}

clog redo routine所做的事:

  • redo info类型是CLOG_ZEROPAGE时,找到一个合适的slot(如果没有需要换出),根据读取出来的redo信息(实际上是clog page号)进行一些是否可写的判断,可以再讲page write到clog文件
  • redo info类型是CLOG_TRUNCATE时,根据读取出来的redo信息(实际上是clog page号),确认page是可删除的(如果不可用需要write page),然后truncate clog segment

clog的同步小结

CLOG只有两种wal日志,两种都不包含事务状态信息,且仅仅是在extend clog page和truncate clog segment时触发,写入的wal record只是一个clog page编号。

CLOG的wal日志RMGR类型只有一种RM_CLOG_ID,这种类型只有两种信息:CLOG_ZEROPAGECLOG_TRUNCATE

c 复制代码
/* XLOG stuff */
#define CLOG_ZEROPAGE 0x00
#define CLOG_TRUNCATE 0x10

clog wal同步总结:
从库本质上没有在同步clog信息,只是同步了一些clog文件的扩容和删除信息

但是,从库的clog文件中明明是有状态信息的,从库明显也需要这部分信息以提供可见性检查。clog中的事务状态是怎么同步的呢?

事务id的wal:类型、写入和redo

rmgr=clog的wal不包含事务状态,难道从库不同步clog事务信息?并不是,wal日志中有事务ID的状态信息,clog也会被更新:

sql 复制代码
--回滚一个事务,提交一个事务
>  begin;
BEGIN
>  select txid_current();
 txid_current 
--------------
      1817254
(1 row)

>  rollback;
ROLLBACK
> begin;
BEGIN
>  select txid_current();
 txid_current 
--------------
      1817258
(1 row)

> commit;
COMMIT
> checkpoint;
CHECKPOINT
sql 复制代码
--pg_waldump查看日志中的事务ID状态
[datalzl/pg_wal]$ pg_waldump ../../pg_wal/000000010000007300000008|grep  -E "1817254|1817258"
rmgr: Transaction len (rec/tot):     34/    34, tx:    1817254, lsn: 73/400ED210, prev 73/400ED1E0, desc: ABORT 2024-08-01 14:41:26.017612 CST
rmgr: Transaction len (rec/tot):     46/    46, tx:    1817258, lsn: 73/400EEB08, prev 73/400EEAD8, desc: COMMIT 2024-08-01 14:41:37.042545 CST
pg_waldump: fatal: error in WAL record at 73/400F7C78: invalid record length at 73/400F7F88: wanted 24, got 0

在wal中有记录事务ID(1817254,1817258)的状态,并且分别记录为ABORTCOMMIT ;rmgr为Transaction

事务ID状态在wal日志中,但是pg会写到从库的clog里吗?

很明显,我们需要找到这部分的redo信息。按照刚才的经验,clog_redo代表rmgr=clog的wal redo源码,那么源码搜索_redo应该能找到rmgr=transactionid的wal redo源码。搜搜,在xact.c中的找到函数xact_redo,其主要调用xact_redo_commit,xact_redo_abort,明显各自对应提交事务和回退事务的wal日志应用逻辑。

c 复制代码
void
xact_redo(XLogReaderState *record)
{
	uint8		info = XLogRecGetInfo(record) & XLOG_XACT_OPMASK;

	/* Backup blocks are not used in xact records */
	Assert(!XLogRecHasAnyBlockRefs(record));

	if (info == XLOG_XACT_COMMIT)
	{
	...
		xact_redo_commit(&parsed, XLogRecGetXid(record),
						 record->EndRecPtr, XLogRecGetOrigin(record));
	}
...
	else if (info == XLOG_XACT_ABORT)
	{
	...
		xact_redo_abort(&parsed, XLogRecGetXid(record));
	}
...
	}
	else
		elog(PANIC, "xact_redo: unknown op code %u", info);
}

以commit为例

c 复制代码
static void
xact_redo_commit(xl_xact_parsed_commit *parsed,
				 TransactionId xid,
				 XLogRecPtr lsn,
				 RepOriginId origin_id)
{
...

	if (standbyState == STANDBY_DISABLED)
	{
		/*
		 * Mark the transaction committed in pg_xact.
		 */
		TransactionIdCommitTree(xid, parsed->nsubxacts, parsed->subxacts);
	}
	else//standby逻辑
	{
	...
		/*
		 * Mark the transaction committed in pg_xact. We use async commit
		 * protocol during recovery to provide information on database
		 * consistency for when users try to set hint bits. It is important
		 * that we do not set hint bits until the minRecoveryPoint is past
		 * this commit record. This ensures that if we crash we don't see hint
		 * bits set on changes made by transactions that haven't yet
		 * recovered. It's unlikely but it's good to be safe.
		 */
		//在pg_xact中标记事务提交
		TransactionIdAsyncCommitTree(xid, parsed->nsubxacts, parsed->subxacts, lsn);

...
}

看着TransactionIdAsyncCommitTree就是我们要找的写clog的函数。

为了验证wal中事务commit信息的redo逻辑,下面给从库的startup进程打三个断点,然后在源库执行begin;select txid_current();commit;提交一个事务,看看从库的startup进程在做redo的时候是否命中我们想要看到的函数:

c 复制代码
(gdb) bt
#0  TransactionIdAsyncCommitTree (xid=xid@entry=1818665, nxids=0, xids=0x0, lsn=lsn@entry=495398394064) at transam.c:274
#1  0x000000000050c139 in xact_redo_commit (parsed=parsed@entry=0x7ffda52c0fc0, xid=1818665, lsn=495398394064, origin_id=<optimized out>) at xact.c:5805
#2  0x000000000050ffa3 in xact_redo (record=0x2b5ff2434038) at xact.c:5962
#3  0x0000000000519ea5 in StartupXLOG () at xlog.c:7411
#4  0x000000000072f301 in StartupProcessMain () at startup.c:204
#5  0x0000000000528701 in AuxiliaryProcessMain (argc=argc@entry=2, argv=argv@entry=0x7ffda52c6ef0) at bootstrap.c:450
#6  0x000000000072c459 in StartChildProcess (type=StartupProcess) at postmaster.c:5494
#7  0x000000000072ec44 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x2b5ff242d1c0) at postmaster.c:1407
#8  0x000000000048931f in main (argc=3, argv=0x2b5ff242d1c0) at main.c:210
(gdb) info b
Num     Type           Disp Enb Address            What
1       breakpoint     keep y   0x000000000050c060 in xact_redo_commit at xact.c:5753
        breakpoint already hit 43 times
2       breakpoint     keep y   0x0000000000508190 in TransactionIdCommitTree at transam.c:262
3       breakpoint     keep y   0x00000000005081a0 in TransactionIdAsyncCommitTree at transam.c:274
        breakpoint already hit 1 time

可以命中断点TransactionIdAsyncCommitTree ,并且xid=1818665,也就是刚才在源库提交的事务ID。说明刚才肉眼看的代码逻辑没有问题。

so,standby库的clog事务ID状态由wal rmgr=Transaction来同步

总结

  • clog只存储了事务ID的状态,没有存储事务ID本身
  • 可以通过事务ID手动定位到CLOG文件中的事务状态
  • rmgr=clog的wal只有扩展和清理clog文件,没有更新事务状态
  • rmgr=transaction的wal会更新clog的事务状态

references


  1. 《快速掌握PostgreSQL版本新特性》p24 ↩︎

  2. 阎书利 PostgreSQL的CLOG解析 https://www.modb.pro/db/606433 ↩︎

  3. 《PostgreSQL数据库内核分析》第7章 p380-390 ↩︎

相关推荐
广州智造4 小时前
OptiStruct实例:3D实体转子分析
数据库·人工智能·算法·机器学习·数学建模·3d·性能优化
技术宝哥7 小时前
Redis(2):Redis + Lua为什么可以实现原子性
数据库·redis·lua
学地理的小胖砸8 小时前
【Python 操作 MySQL 数据库】
数据库·python·mysql
dddaidai1239 小时前
Redis解析
数据库·redis·缓存
数据库幼崽9 小时前
MySQL 8.0 OCP 1Z0-908 121-130题
数据库·mysql·ocp
Amctwd9 小时前
【SQL】如何在 SQL 中统计结构化字符串的特征频率
数据库·sql
betazhou10 小时前
基于Linux环境实现Oracle goldengate远程抽取MySQL同步数据到MySQL
linux·数据库·mysql·oracle·ogg
lyrhhhhhhhh10 小时前
Spring 框架 JDBC 模板技术详解
java·数据库·spring
喝醉的小喵12 小时前
【mysql】并发 Insert 的死锁问题 第二弹
数据库·后端·mysql·死锁
付出不多12 小时前
Linux——mysql主从复制与读写分离
数据库·mysql