Postgresql源码（149）SIMD应用与性能测试

1 SIMD指令集在PostgreSQL中的应用详解

SIMD是一种能够同时处理多个数据元素的CPU指令集技术，通过向量寄存器实现单条指令对多个数据的操作，可显著提高计算密集型任务性能。

SIMD在PG中主要应用于x86和ARM架构的CPU，针对事务ID数组的线性搜索、JSON字符串处理以及子事务搜索等场景进行了优化，性能提升可达20%以上。

PostgreSQL中SIMD的应用场景

确定某个事务ID是不是InProgress状态。

c 复制代码

bool
TransactionIdIsInProgress(TransactionId xid)
{
	...
	...
	if (!TransactionIdEquals(topxid, xid) &&
		pg_lfind32(topxid, xids, nxids))
		return true;

	cachedXidIsNotInProgress = xid;
	return false;
}

确定某个xid是不是在快照的xip或subxip数组中（运行中的事务）

c 复制代码

bool
XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
{
	/* Any xid < xmin is not in-progress */
	if (TransactionIdPrecedes(xid, snapshot->xmin))
		return false;
	/* Any xid >= xmax is in-progress */
	if (TransactionIdFollowsOrEquals(xid, snapshot->xmax))
		return true;

	if (!snapshot->takenDuringRecovery)
	{
		if (!snapshot->suboverflowed)
		{
			/* we have full data, so search subxip */
			if (pg_lfind32(xid, snapshot->subxip, snapshot->subxcnt))
				return true;
	...
	...
}

2 SIMD在PostgreSQL中的具体实现

pg_lfind8

复制代码

bool
lfind8(uint8 key, uint8 *base, uint32 nelem)
{
	uint32		i;

	/* round down to multiple of vector length */

SIMD向量化处理，下面这行代码将数组长度向下舍入到向量长度（16字节__m128i）的倍数：

复制代码

	uint32		tail_idx = nelem & ~(sizeof(Vector8) - 1);
	Vector8		chunk;

一次操作可以同时比较16个，不逐个比较

c 复制代码

	for (i = 0; i < tail_idx; i += sizeof(Vector8))
	{
		vector8_load(&chunk, &base[i]);
		if (vector8_has(chunk, key))
			return true;
	}

	/* Process the remaining elements one at a time. */
	for (; i < nelem; i++)
	{
		if (key == base[i])
			return true;
	}

	return false;
}

pg_lfind32

c 复制代码

bool
pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
{
	uint32		i = 0;

keys=vector32_broadcast：将要查找的key值广播到整个向量中（比如把5变成[5,5,5,5]）
nelem_per_vector: 每个向量能容纳多少个uint32，4个！
nelem_per_iteration: 每轮处理的元素数，16个！

c 复制代码

	const Vector32 keys = vector32_broadcast(key);	/* load copies of key */
	const uint32 nelem_per_vector = sizeof(Vector32) / sizeof(uint32);
	const uint32 nelem_per_iteration = 4 * nelem_per_vector;

计算每轮处理范围：

将总元素数向下舍入到16的倍数，如果有100个元素，tail_idx = 96（6×16）

c 复制代码

	/* round down to multiple of elements per iteration */
	const uint32 tail_idx = nelem & ~(nelem_per_iteration - 1);

	for (i = 0; i < tail_idx; i += nelem_per_iteration)
	{
		Vector32	vals1,    // 存储数据的4个向量
					vals2,
					vals3,
					vals4,
					result1,  // 存储比较结果的4个向
					result2,
					result3,
					result4,
					tmp1,     // 临时合并向量 
					tmp2,
					result;   // 最终结果向量

一次vector32_load加载4个uint32到vals1中，共四次，共加载16个uint32进行计算。

c 复制代码

		/* load the next block into 4 registers */
		vector32_load(&vals1, &base[i]);
		vector32_load(&vals2, &base[i + nelem_per_vector]);
		vector32_load(&vals3, &base[i + nelem_per_vector * 2]);
		vector32_load(&vals4, &base[i + nelem_per_vector * 3]);

计算

c 复制代码

		/* compare each value to the key */
		result1 = vector32_eq(keys, vals1);
		result2 = vector32_eq(keys, vals2);
		result3 = vector32_eq(keys, vals3);
		result4 = vector32_eq(keys, vals4);
		/* combine the results into a single variable */
		tmp1 = vector32_or(result1, result2);
		tmp2 = vector32_or(result3, result4);
		result = vector32_or(tmp1, tmp2);
		/* see if there was a match */
		if (vector32_is_highbit_set(result))
			return true;
	}

处理不能被16整除的剩余元素

c 复制代码

	/* Process the remaining elements one at a time. */
	for (; i < nelem; i++)
		if (key == base[i])
			return true;

	return false;
}

3 PG SIMD X86性能测试

https://github.com/mingjiegao/libsimd/tree/main

执行make check后，会自动做性能对比，结果如下：

复制代码

=== Performance Tests ===
libsimd Performance Tests
=========================
System Information:
- Compiler: 8.5.0 20210514 (Tencent 8.5.0-26)
- AVX2 support: Enabled
- SSE4.2 support: Enabled

Performance Test Results
========================
Test Configuration:
- Iterations per test: 1000
- Warmup runs: 10
- Key distribution: 75% existing, 25% non-existing

Test Name            |     Size |  SIMD (ms) | Linear (ms) |  Speedup | Status
--------------------------------------------------------------------------------
lfind8_small         |    10000 |       0.03 |       0.10 |     3.92x | PASS
lfind8_medium        |   100000 |       0.03 |       0.10 |     4.16x | PASS
lfind8_large         |  1000000 |       0.03 |       0.11 |     4.08x | PASS
lfind8_xlarge        | 10000000 |       0.03 |       0.10 |     3.96x | PASS
lfind32_small        |    10000 |       0.67 |       2.06 |     3.06x | PASS
lfind32_medium       |   100000 |       6.73 |      20.72 |     3.08x | PASS
lfind32_large        |  1000000 |     102.91 |     209.44 |     2.04x | PASS
lfind32_xlarge       | 10000000 |    1986.14 |    3120.74 |     1.57x | PASS
--------------------------------------------------------------------------------

Performance Summary:
- Total tests: 8
- Passed tests: 8
- Failed tests: 0
- Average speedup: 3.23x
- SIMD implementation is 3.23x faster on average

Notes:
- Speedup values > 1.0 indicate SIMD is faster
- Results may vary based on CPU architecture and compiler optimizations
- All tests include correctness verification


Worst Case Performance Analysis
===============================
Worst case (key not found):
- SIMD time: 0.45 ms
- Linear time: 7.01 ms
- Speedup: 15.68x

Performance testing completed.