postgresql-分区表

PostgreSQL 分区表与继承表设计方案与实现梳理

本文基于 PostgreSQL 源码，对两套经常被混在一起讨论、但设计目标并不完全相同的机制做系统梳理：

传统继承表：CREATE TABLE ... INHERITS (...)
声明式分区表：CREATE TABLE ... PARTITION BY ...

重点不是只回答"它们都能有子表"，而是把下面几条主线讲清楚：

这两套机制在 catalog 层怎么建模
规划器怎样展开继承树 / 分区树
执行器怎样做插入路由与查询裁剪
关键数据结构、关键函数、关键源码片段分别负责什么
为什么 PostgreSQL 最终把"分区"做成建立在"继承关系之上，但拥有独立元数据和优化路径"的体系

1. 一句话结论

可以先用一句话概括：

PostgreSQL 的声明式分区不是完全独立于继承体系的新树形模型，而是"以 pg_inherits 记录父子关系，以 pg_partitioned_table + pg_class.relpartbound 记录分区专属元数据，再由规划器 / 执行器走专门的展开、路由、裁剪逻辑"的增强版继承体系。

也就是说：

继承提供了"父子关系"和"树遍历基础设施"
分区在此基础上再引入：
- 分区键
- 分区边界
- 默认分区 / null 分区
- 规划期和执行期裁剪
- 插入 tuple 路由
- attach/detach 的并发可见性处理

2. 整体设计目标

2.1 继承表的目标

传统继承的目标更偏向"模式复用 + 查询扩展"：

父表定义公共列
子表可增加列
查询父表时可带上所有子表
语义上更接近"对象层级"或"逻辑集合"

它并不天然强调：

自动插入路由
按边界高效裁剪
子表 bound overlap 检查

这些都不是传统继承的核心诉求。

2.2 声明式分区的目标

分区的目标是"把一张逻辑大表拆成若干物理子表，同时保持 SQL 层尽量像一张表"：

根据分区键自动把写入路由到目标分区
通过边界信息快速裁剪不相关分区
限制分区集合满足"不重叠、完整可校验"的结构约束
支持多级分区
支持 attach / detach / default partition 等运维动作

因此 PostgreSQL 没有直接复用"普通继承 + 约束排除"的旧路径，而是做了专门的：

分区键缓存
分区描述符
分区边界结构
分区裁剪 step 模型
tuple routing

3. 继承与分区的关系

3.1 相同点

二者都共享父子层级关系，因此都依赖：

pg_inherits
find_inheritance_children()
find_all_inheritors()
expand_inherited_rtentry()

也都体现为：

父 relation 不一定只扫描自己
planner 需要展开出 child rel
executor / planner 需要在父子列号之间做映射

3.2 不同点

最关键的区别是：

维度	传统继承	声明式分区
父子关系存储	`pg_inherits`	也用 `pg_inherits`
专属元数据	基本没有	`pg_partitioned_table`、`pg_class.relpartbound`
父表是否存数据	可以	分区根通常不存数据
子表是否可多父	可多继承	分区只能有一个分区父
插入自动路由	没有通用自动路由	有 `ExecFindPartition()`
查询裁剪	主要靠旧式 constraint exclusion/普通展开	专门的 partition pruning
结构约束	较弱	强，要求 bound 不重叠、层级一致
planner 展开形态	倾向扁平化	保留层级，递归展开分区树

3.3 代码中的直接体现

expand_inherited_rtentry() 的注释已经直接点出两类语义：

c 复制代码

/*
 * "inh" on a plain RELATION RTE means that it is a partitioned table or the
 * parent of a traditional-inheritance set.
 *
 * ...
 * In the case of traditional inheritance, the first of the generated
 * RTEs is an RTE for the same table, but with inh = false, to represent the
 * parent table in its role as a simple member of the inheritance set.
 * For partitioning, we don't need a second RTE because the partitioned table
 * itself has no data and need not be scanned.
 */

这几句非常重要，它说明：

对传统继承，父表本身是集合成员之一
对声明式分区，分区根主要是"逻辑入口"和"元数据载体"，不需要像普通表那样扫描自己

4. Catalog 设计

4.1 `pg_inherits`：统一记录父子关系

pg_inherits 是继承体系和分区体系共同依赖的基础目录：

c 复制代码

CATALOG(pg_inherits,2611,InheritsRelationId)
{
    Oid         inhrelid;
    Oid         inhparent;
    int32       inhseqno;
    bool        inhdetachpending;
} FormData_pg_inherits;

字段含义：

inhrelid：子 relation OID
inhparent：父 relation OID
inhseqno：父列顺序 / 父关系顺序
inhdetachpending：分区 DETACH CONCURRENTLY 中间态标记

设计意义：

继承和分区共用同一棵父子树
但分区额外引入 inhdetachpending，用于并发 detach 可见性控制

4.2 `pg_partitioned_table`：记录"谁是分区表，以及按什么分"

声明式分区的根表元数据存在 pg_partitioned_table：

c 复制代码

CATALOG(pg_partitioned_table,3350,PartitionedRelationId)
{
    Oid         partrelid;      /* partitioned table oid */
    char        partstrat;      /* partitioning strategy */
    int16       partnatts;      /* number of partition key columns */
    Oid         partdefid;      /* default partition oid */
    int2vector  partattrs;      /* key columns, 0 means expression */
    oidvector   partclass;      /* opclass */
    oidvector   partcollation;  /* collation */
    pg_node_tree partexprs;     /* expression partition keys */
} FormData_pg_partitioned_table;

字段职责：

partstrat：LIST / RANGE / HASH
partnatts：分区键列数
partdefid：默认分区 OID
partattrs：键列 attnum；如果为 0，表示该键位是表达式
partclass / partcollation：比较和排序规则
partexprs：表达式分区键

这张表解决的是：

如何定义分区键
如何比较边界
默认分区是谁

4.3 `pg_class.relpartbound`：每个分区自己的边界

每个分区 relation 的 bound 保存在 pg_class.relpartbound 中。虽然该字段不在上面单独头文件展示，但源码多处直接读写它，例如在 partdesc.c：

c 复制代码

datum = SysCacheGetAttr(RELOID, tuple,
                        Anum_pg_class_relpartbound,
                        &isnull);
if (!isnull)
    boundspec = stringToNode(TextDatumGetCString(datum));

也就是说：

pg_partitioned_table 描述"父表如何分"
pg_class.relpartbound 描述"某个子表接收哪一段值"

4.4 `pg_class.relispartition` / `relhassubclass`

这两个标志也很关键：

relispartition：某 relation 是否是某个分区
relhassubclass：某 relation 是否可能有子表

其中 relhassubclass 是一个"可能有，不保证实时清理"的优化标记。pg_inherits.c 注释说明得很明确：

c 复制代码

/*
 * has_subclass - does this relation have any children?
 *
 * In the current implementation, has_subclass returns whether a
 * particular class *might* have a subclass.
 *
 * ...
 * Currently has_subclass is only used as an efficiency hack to skip
 * unnecessary inheritance searches, so this is OK.
 */

因此：

has_subclass() 是性能优化，不是强一致真相源
真正父子关系仍以 pg_inherits 为准

5. 模块分工图

5.1 源码模块

模块	主要职责
`catalog/pg_inherits.c`	继承树扫描、`pg_inherits` 读写辅助
`catalog/partition.c`	分区父子关系、默认分区、分区列映射等通用工具
`utils/cache/partcache.c`	读取并缓存分区键 `PartitionKey`
`partitioning/partdesc.c`	构建并缓存 `PartitionDesc`
`partitioning/partbounds.c`	构建 / 比较 / 检查分区边界
`partitioning/partprune.c`	规划期 / 执行期分区裁剪
`optimizer/util/inherit.c`	planner 侧展开继承树 / 分区树
`executor/execPartition.c`	插入 tuple 路由、执行期 pruning
`commands/tablecmds.c`	`CREATE TABLE`、`ATTACH/DETACH PARTITION` 等 DDL

5.2 总体结构图

DDL: CREATE TABLE / ATTACH PARTITION
tablecmds.c
pg_inherits
pg_partitioned_table
pg_class.relpartbound
Relcache Open
partcache.c: RelationGetPartitionKey
partdesc.c: RelationGetPartitionDesc
Planner
inherit.c: expand_inherited_rtentry
pg_inherits.c: find_all_inheritors / find_inheritance_children
partprune.c: prune_append_rel_partitions
Executor INSERT
execPartition.c: ExecSetupPartitionTupleRouting
ExecFindPartition
Executor Scan
ExecDoInitialPruning / ExecFindMatchingSubPlans

6. 重要数据结构

下面按"元数据层 -> planner 层 -> executor 层"的顺序梳理。

6.1 `PartitionKeyData`：分区键定义

定义位于 src/include/utils/partcache.h：

c 复制代码

typedef struct PartitionKeyData
{
    PartitionStrategy strategy;
    int16       partnatts;
    AttrNumber *partattrs;
    List       *partexprs;

    Oid        *partopfamily;
    Oid        *partopcintype;
    FmgrInfo   *partsupfunc;

    Oid        *partcollation;

    Oid        *parttypid;
    int32      *parttypmod;
    int16      *parttyplen;
    bool       *parttypbyval;
    char       *parttypalign;
    Oid        *parttypcoll;
} PartitionKeyData;

它回答的是：

按什么策略分：LIST/RANGE/HASH
分区键有几个列
每个键位是直接列还是表达式
采用什么 opclass / opfamily / collation
每个键位的数据类型是什么

这是整个分区体系最核心的"键定义对象"。

6.2 `PartitionBoundInfoData`：边界索引结构

定义位于 src/include/partitioning/partbounds.h：

c 复制代码

typedef struct PartitionBoundInfoData
{
    PartitionStrategy strategy;
    int         ndatums;
    Datum     **datums;
    PartitionRangeDatumKind **kind;
    Bitmapset  *interleaved_parts;
    int         nindexes;
    int        *indexes;
    int         null_index;
    int         default_index;
} PartitionBoundInfoData;

它不是"原始 SQL 边界列表"，而是为查找优化后的边界结构：

datums：排序后的边界值
indexes：边界到 partition index 的映射
null_index：接收 NULL 的分区
default_index：默认分区

这使得 PostgreSQL 可以：

二分查找 list / range 边界
直接按 remainder 查 hash 分区
在 pruning 和 tuple routing 中复用统一边界表示

6.3 `PartitionDescData`：某个分区表当前看到的分区集合

定义位于 src/include/partitioning/partdesc.h：

c 复制代码

typedef struct PartitionDescData
{
    int         nparts;
    bool        detached_exist;
    Oid        *oids;
    bool       *is_leaf;
    PartitionBoundInfo boundinfo;

    int         last_found_datum_index;
    int         last_found_part_index;
    int         last_found_count;
} PartitionDescData;

职责：

保存某分区父表当前的 partitions 列表
保存每个分区是否是 leaf
保存对应的 PartitionBoundInfo
维护最近一次命中的 cache，用于加速 ExecFindPartition()

这里可以看出一个设计重点：

PartitionKey 描述"如何分"
PartitionDesc 描述"当前有哪些分区"
PartitionBoundInfo 描述"这些分区在边界空间中如何索引"

6.4 `PartitionTupleRouting`：执行器的插入路由总控对象

定义位于 src/backend/executor/execPartition.c：

c 复制代码

struct PartitionTupleRouting
{
    Relation        partition_root;
    PartitionDispatch *partition_dispatch_info;
    ResultRelInfo **nonleaf_partitions;
    int             num_dispatch;
    int             max_dispatch;
    ResultRelInfo **partitions;
    bool           *is_borrowed_rel;
    int             num_partitions;
    int             max_partitions;
    MemoryContext   memcxt;
};

它是插入路径上的总调度器：

根表是谁
多级分区树里每一层的 dispatch 信息
已经初始化过的 leaf ResultRelInfo
哪些 ResultRelInfo 是从 ModifyTableState 借来的，哪些是动态创建的

6.5 `PartitionDispatchData`：单层分区节点的路由信息

c 复制代码

typedef struct PartitionDispatchData
{
    Relation        reldesc;
    PartitionKey    key;
    List           *keystate;
    PartitionDesc   partdesc;
    TupleTableSlot *tupslot;
    AttrMap        *tupmap;
    int             indexes[FLEXIBLE_ARRAY_MEMBER];
} PartitionDispatchData;

它代表"分区树中的一个 partitioned table 节点"：

reldesc：当前层 relation
key：当前层分区键
keystate：表达式分区键执行状态
partdesc：当前层子分区描述
tupmap / tupslot：用于父子 rowtype 不同场景的 tuple 转换
indexes[]：partition index 到下游对象的映射

6.6 `PartitionPruneInfo` / `PartitionedRelPruneInfo`

planner 生成、executor 消费的 pruning 元数据定义在 plannodes.h：

c 复制代码

typedef struct PartitionPruneInfo
{
    NodeTag     type;
    Bitmapset  *relids;
    List       *prune_infos;
    Bitmapset  *other_subplans;
} PartitionPruneInfo;

c 复制代码

typedef struct PartitionedRelPruneInfo
{
    NodeTag     type;
    Index       rtindex;
    Bitmapset  *present_parts;
    int         nparts;
    int        *subplan_map;
    int        *subpart_map;
    int        *leafpart_rti_map;
    Oid        *relid_map;
    List       *initial_pruning_steps;
    List       *exec_pruning_steps;
    Bitmapset  *execparamids;
} PartitionedRelPruneInfo;

这两层结构的设计非常关键：

PartitionPruneInfo：一个 Append/MergeAppend 节点对应的总 pruning 信息
PartitionedRelPruneInfo：某一个 partitioned table 层级的 pruning 细节

6.7 `PartitionPruneStepOp` / `PartitionPruneStepCombine`

分区裁剪不是直接"硬编码 if/else"，而是先把条件编译成 step：

c 复制代码

typedef struct PartitionPruneStepOp
{
    PartitionPruneStep step;
    StrategyNumber opstrategy;
    List           *exprs;
    List           *cmpfns;
    Bitmapset      *nullkeys;
} PartitionPruneStepOp;

c 复制代码

typedef struct PartitionPruneStepCombine
{
    PartitionPruneStep step;
    PartitionPruneCombineOp combineOp;
    List           *source_stepids;
} PartitionPruneStepCombine;

这相当于把 pruning 设计成了一个小型执行计划：

Op step：一个基于分区键比较的基本裁剪操作
Combine step：对多个 step 做并 / 交组合

6.8 `PartitionedRelPruningData` / `PartitionPruneState`

执行期 pruning 状态定义在 execPartition.h：

c 复制代码

typedef struct PartitionedRelPruningData
{
    Relation    partrel;
    int         nparts;
    int        *subplan_map;
    int        *subpart_map;
    int        *leafpart_rti_map;
    Bitmapset  *present_parts;
    List       *initial_pruning_steps;
    List       *exec_pruning_steps;
    PartitionPruneContext initial_context;
    PartitionPruneContext exec_context;
} PartitionedRelPruningData;

c 复制代码

typedef struct PartitionPruneState
{
    ExprContext *econtext;
    Bitmapset   *execparamids;
    Bitmapset   *other_subplans;
    MemoryContext prune_context;
    bool        do_initial_prune;
    bool        do_exec_prune;
    int         num_partprunedata;
    PartitionPruningData *partprunedata[FLEXIBLE_ARRAY_MEMBER];
} PartitionPruneState;

这说明执行期 pruning 并不是 planner 结果的简单复用，而是：

planner 产出 step 描述
executor 再把 step 绑定成可执行状态
根据 startup / per-scan 两种时机执行

7. 关键 catalog / cache 构建流程

7.1 分区键如何从 catalog 进入 relcache

入口是 RelationGetPartitionKey()：

c 复制代码

/*
 * RelationGetPartitionKey -- get partition key, if relation is partitioned
 *
 * Note: partition keys are not allowed to change after the partitioned rel
 * is created.
 */
PartitionKey
RelationGetPartitionKey(Relation rel)
{
    if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
        return NULL;

    if (unlikely(rel->rd_partkey == NULL))
        RelationBuildPartitionKey(rel);

    return rel->rd_partkey;
}

这里的设计要点：

分区键被视为建表后不变
所以可以长期缓存到 relcache
打开 relation 后，不必每次重新扫 catalog

RelationBuildPartitionKey() 再从 pg_partitioned_table 中读取元数据：

c 复制代码

form = (Form_pg_partitioned_table) GETSTRUCT(tuple);
key->strategy = form->partstrat;
key->partnatts = form->partnatts;

attrs = form->partattrs.values;

datum = SysCacheGetAttrNotNull(PARTRELID, tuple,
                               Anum_pg_partitioned_table_partclass);
opclass = (oidvector *) DatumGetPointer(datum);

datum = SysCacheGetAttr(PARTRELID, tuple,
                        Anum_pg_partitioned_table_partexprs, &isnull);

然后为每个分区键位补齐：

opfamily
支持函数
collation
type info

这就是 PartitionKeyData 的来源。

7.2 分区描述符如何构建

入口是 RelationGetPartitionDesc()：

c 复制代码

/*
 * We keep two partdescs in relcache: rd_partdesc includes all partitions
 * (even those being concurrently marked detached), while rd_partdesc_nodetached
 * omits (some of) those.
 */
PartitionDesc
RelationGetPartitionDesc(Relation rel, bool omit_detached)

这是一个非常关键、也很容易被忽略的设计点：

PostgreSQL 不只缓存一个 PartitionDesc
而是区分：
- 包含 detached pending 分区的版本
- 基于 active snapshot 过滤 detached 的版本

原因是：

DETACH CONCURRENTLY 会让"某个分区对某些快照可见、对另一些不可见"
所以不能简单做成单一全局缓存

真正构建在 RelationBuildPartitionDesc() 中完成，其核心流程是：

通过 find_inheritance_children_extended() 读 pg_inherits
拿到每个 child OID
为每个 child 读取 pg_class.relpartbound
调用 partition_bounds_create() 构建 PartitionBoundInfo
生成 PartitionDescData

核心注释如下：

c 复制代码

/*
 * Get partition oids from pg_inherits.  This uses a single snapshot to
 * fetch the list of children, so while more children may be getting added
 * or removed concurrently, whatever this function returns will be
 * accurate as of some well-defined point in time.
 */

这说明设计上非常重视"分区集合的快照一致性"。

8. 继承树扫描与并发可见性

8.1 `find_inheritance_children_extended()`

这是继承/分区扫描的底层入口之一：

c 复制代码

List *
find_inheritance_children_extended(Oid parentrelId, bool omit_detached,
                                   LOCKMODE lockmode, bool *detached_exist,
                                   TransactionId *detached_xmin)

最值得注意的是它对 detach pending 分区的处理：

c 复制代码

/*
 * If a partition's pg_inherits row is marked "detach pending",
 * ...
 * then that partition is omitted from the output list.
 * This makes partitions invisible depending on
 * whether the transaction that marked those partitions as detached appears
 * committed to the active snapshot.
 */

具体判断逻辑：

c 复制代码

if (((Form_pg_inherits) GETSTRUCT(inheritsTuple))->inhdetachpending)
{
    if (omit_detached && ActiveSnapshotSet())
    {
        TransactionId xmin;
        Snapshot snap;

        xmin = HeapTupleHeaderGetXmin(inheritsTuple->t_data);
        snap = GetActiveSnapshot();

        if (!XidInMVCCSnapshot(xmin, snap))
            continue;
    }
}

这个实现非常有 PostgreSQL 风格：

detach 不是简单"一刀切立刻消失"
而是通过 catalog tuple 的 xmin 和调用方 active snapshot 协同决定可见性

8.2 `find_all_inheritors()`

对传统继承来说，planner / DDL 经常直接要整棵子树：

c 复制代码

/*
 * Returns a list of relation OIDs including the given rel plus
 * all relations that inherit from it, directly or indirectly.
 */
List *
find_all_inheritors(Oid parentrelId, LOCKMODE lockmode, List **numparents)

实现要点：

从根开始 BFS/队列遍历
用 hash 表去重
支持多继承路径计数

这条路径主要服务：

传统继承的 planner 展开
某些 DDL 的整棵子树锁定 / 遍历

9. DDL 设计与实现

9.1 `CREATE TABLE` 如何分出"继承"与"分区"

DefineRelation() 是总入口：

c 复制代码

if (stmt->partspec != NULL)
{
    if (relkind != RELKIND_RELATION)
        elog(ERROR, "unexpected relkind: %d", (int) relkind);

    relkind = RELKIND_PARTITIONED_TABLE;
    partitioned = true;
}
else
    partitioned = false;

这里已经反映出一个设计原则：

PARTITION BY 不是普通表上的附加标志
而是直接决定 relation kind 变成 RELKIND_PARTITIONED_TABLE

随后它还根据是否是分区 child 决定锁模式：

c 复制代码

/*
 * If the child table is a partition, then we instead grab an exclusive
 * lock on the parent because its partition descriptor will be changed by
 * addition of the new partition.
 */
parentLockmode = (stmt->partbound != NULL ? AccessExclusiveLock :
                  ShareUpdateExclusiveLock);

说明：

普通继承加子表，对父表影响较小
新增分区，会改变父表 PartitionDesc，因此需要更强锁

9.2 `StoreSingleInheritance()`：共同父子关系落地

无论普通继承还是分区，最终父子边都要写入 pg_inherits：

c 复制代码

/*
 * Create a single pg_inherits row with the given data
 */
void
StoreSingleInheritance(Oid relationId, Oid parentOid, int32 seqNumber)
{
    values[Anum_pg_inherits_inhrelid - 1] = ObjectIdGetDatum(relationId);
    values[Anum_pg_inherits_inhparent - 1] = ObjectIdGetDatum(parentOid);
    values[Anum_pg_inherits_inhseqno - 1] = Int32GetDatum(seqNumber);
    values[Anum_pg_inherits_inhdetachpending - 1] = BoolGetDatum(false);
    CatalogTupleInsert(inhRelation, tuple);
}

这说明"分区是继承增强版"不是概念描述，而是物理上真的共用同一条目录边。

9.3 分区边界落地

tablecmds.c 在创建或 attach 分区时会调用：

c 复制代码

StorePartitionBound(rel, parent, bound);

也就是说，创建分区不仅要：

写 pg_inherits

还要：

写 pg_class.relpartbound
维护 pg_partitioned_table.partdefid
失效相关 relcache

9.4 `ATTACH PARTITION` 的设计重点

入口是 ATExecAttachPartition()。

它体现了 PostgreSQL 对声明式分区的"结构正确性强约束"。

9.4.1 禁止把普通继承体系直接混进分区体系

c 复制代码

/* A partition can only have one parent */
if (attachrel->rd_rel->relispartition)
    ereport(ERROR, ...);

...
if (HeapTupleIsValid(systable_getnext(scan)))
    ereport(ERROR,
            (errcode(ERRCODE_WRONG_OBJECT_TYPE),
             errmsg("cannot attach inheritance child as partition")));

...
if (HeapTupleIsValid(systable_getnext(scan)) &&
    attachrel->rd_rel->relkind == RELKIND_RELATION)
    ereport(ERROR,
            (errcode(ERRCODE_WRONG_OBJECT_TYPE),
             errmsg("cannot attach inheritance parent as partition")));

这几条限制说明：

普通继承和声明式分区虽然共享底层边模型
但 SQL 语义层面不允许随意混接

9.4.2 防环

c 复制代码

attachrel_children = find_all_inheritors(RelationGetRelid(attachrel),
                                         AccessExclusiveLock, NULL);
if (list_member_oid(attachrel_children, RelationGetRelid(rel)))
    ereport(ERROR,
            (errcode(ERRCODE_DUPLICATE_TABLE),
             errmsg("circular inheritance not allowed")));

9.4.3 校验新边界不重叠

c 复制代码

check_new_partition_bound(RelationGetRelationName(attachrel), rel,
                          cmd->bound, pstate);

这一步是分区体系相较普通继承最核心的"结构约束"之一。

9.4.4 挂接之后自动补齐附属对象

c 复制代码

attachPartitionTable(wqueue, rel, attachrel, cmd->bound);

...
AttachPartitionEnsureIndexes(wqueue, rel, attachrel);
CloneRowTriggersToPartition(rel, attachrel);

所以 attach partition 不是只插一条 pg_inherits：

要建立父子边
要写 bound
要补索引
要克隆触发器
要处理外键 / 约束校验

9.5 DDL 结构图

CREATE TABLE / ATTACH PARTITION
DefineRelation / ATExecAttachPartition
检查父子关系、锁、循环、持久性
StoreSingleInheritance -> pg_inherits
StorePartitionBound -> pg_class.relpartbound
更新 pg_partitioned_table.partdefid
失效 relcache / 约束校验 / 索引触发器同步

10. Planner 设计

10.1 总入口：`expand_inherited_rtentry()`

planner 遇到 rte->inh = true 时会进入这里：

c 复制代码

/*
 * "inh" on a plain RELATION RTE means that it is a partitioned table or the
 * parent of a traditional-inheritance set.
 */
void
expand_inherited_rtentry(PlannerInfo *root, RelOptInfo *rel,
                         RangeTblEntry *rte, Index rti)

随后分两条路径：

RELKIND_PARTITIONED_TABLE -> expand_partitioned_rtentry()
否则 -> find_all_inheritors() 展开传统继承

10.2 传统继承路径

传统继承使用：

c 复制代码

inhOIDs = find_all_inheritors(parentOID, lockmode, NULL);

然后为每个 child：

建 child RTE
建 AppendRelInfo
建 child RelOptInfo

关键特点：

父表本身也作为集合成员之一
结构更接近"把一组表扁平接到 Append 下面"

10.3 分区路径：`expand_partitioned_rtentry()`

核心逻辑：

c 复制代码

partdesc = PartitionDirectoryLookup(root->glob->partition_directory,
                                    parentrel);

relinfo->live_parts = prune_append_rel_partitions(relinfo);

...
while ((i = bms_next_member(relinfo->live_parts, i)) >= 0)
{
    Oid childOID = partdesc->oids[i];
    ...
    childrelinfo = build_simple_rel(root, childRTindex, relinfo);
    relinfo->part_rels[i] = childrelinfo;

    if (childrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
        expand_partitioned_rtentry(...);
}

设计特点：

先拿 PartitionDesc
再做规划期裁剪
只为活着的分区创建 child rel
对非叶子分区递归展开

最关键的一句注释是：

c 复制代码

/*
 * Create a child RTE for each live partition.
 * ...
 * unlike traditional inheritance, we don't need a child RTE for the
 * partitioned table itself, because it's not going to be scanned.
 */

10.4 分区层级不会像旧继承那样完全拍平

expand_single_inheritance_child() 注释说得很清楚：

c 复制代码

/*
 * We now expand the partition hierarchy level by level, creating a
 * corresponding hierarchy of AppendRelInfos and RelOptInfos, where each
 * partitioned descendant acts as a parent of its immediate partitions.
 *
 * (This is a difference from what older versions of PostgreSQL did and what
 * is still done in the case of table inheritance for unpartitioned tables,
 * where the hierarchy is flattened during RTE expansion.)
 */

这段非常重要，说明：

普通继承：更偏扁平 append set
分区：保留层级化 partition tree

原因很自然：

分区层级本身承载边界和裁剪信息
拍平后不利于多级分区 pruning / routing / metadata 映射

10.5 planner 结构图

普通继承
分区表
是
否
expand_inherited_rtentry
父表类型
find_all_inheritors
为每个 child 建 RTE / AppendRelInfo / RelOptInfo
PartitionDirectoryLookup
prune_append_rel_partitions
仅对 live_parts 建 child rel
child 是否仍是 partitioned
叶子分区进入扫描候选

11. 查询裁剪设计

11.1 规划期裁剪：`prune_append_rel_partitions()`

这是 planner 阶段真正做裁剪的入口：

c 复制代码

/*
 * Process rel's baserestrictinfo and make use of quals which can be
 * evaluated during query planning in order to determine the minimum set
 * of partitions which must be scanned.
 */
Bitmapset *
prune_append_rel_partitions(RelOptInfo *rel)

主要过程：

从 baserestrictinfo 取谓词
调用 gen_partprune_steps() 生成 pruning steps
构造 PartitionPruneContext
调用 get_matching_partitions() 算 surviving partitions

其中 step 生成逻辑：

c 复制代码

/*
 * 'target' tells whether to generate pruning steps for planning (use
 * immutable clauses only), or for executor startup ...
 */
static void
gen_partprune_steps(...)

这说明 PostgreSQL 把 pruning 明确拆成三个阶段目标：

planner：只能用 immutable 信息
executor startup：可用更多运行时稳定值，但不能依赖 PARAM_EXEC
executor per-scan：可依赖 PARAM_EXEC

11.2 `get_matching_partitions()`：step 执行器

c 复制代码

foreach(lc, pruning_steps)
{
    PartitionPruneStep *step = lfirst(lc);

    switch (nodeTag(step))
    {
        case T_PartitionPruneStepOp:
            results[step->step_id] =
                perform_pruning_base_step(context,
                                          (PartitionPruneStepOp *) step);
            break;

        case T_PartitionPruneStepCombine:
            results[step->step_id] =
                perform_pruning_combine_step(context,
                                             (PartitionPruneStepCombine *) step,
                                             results);
            break;
    }
}

这相当于是一个"小型解释执行器"：

先执行 base step
再执行 combine step
最后把 bound offset 转成 partition index

11.3 planner 产物如何交给 executor

planner 还会调用 make_partition_pruneinfo() 生成运行时 pruning 元信息，后续挂到 plan node 上。

因此分区裁剪不是"planner 一次性做完"：

planner 尽可能先 prune
仍有 runtime value 时，把 step 带到 executor

12. 执行器插入路由设计

12.1 `ExecSetupPartitionTupleRouting()`

插入路径的入口：

c 复制代码

/*
 * ExecSetupPartitionTupleRouting - sets up information needed during
 * tuple routing for partitioned tables
 */
PartitionTupleRouting *
ExecSetupPartitionTupleRouting(EState *estate, Relation rel)

设计原则非常明确：

c 复制代码

/*
 * Here we attempt to expend as little effort as possible in setting up
 * the PartitionTupleRouting.
 * Each partition's ResultRelInfo is built on demand.
 * The reason for this is that a common case is for INSERT to insert a
 * single tuple into a partitioned table and this must be fast.
 */

也就是说，插入路由是懒初始化设计：

先建根 dispatch
真正命中某个 leaf 时，再创建它的 ResultRelInfo

12.2 `ExecFindPartition()`：真正找目标 leaf

核心流程：

c 复制代码

/*
 * ExecFindPartition -- Return the ResultRelInfo for the leaf partition that
 * the tuple contained in *slot should belong to.
 */
ResultRelInfo *
ExecFindPartition(...)
{
    dispatch = pd[0];
    while (dispatch != NULL)
    {
        FormPartitionKeyDatum(dispatch, slot, estate, values, isnull);

        if (partdesc->nparts == 0 ||
            (partidx = get_partition_for_tuple(dispatch, values, isnull)) < 0)
            ereport(ERROR, ...);

        is_leaf = partdesc->is_leaf[partidx];
        if (is_leaf)
            ...  /* 返回或初始化 leaf ResultRelInfo */
        else
            ...  /* 进入下一层 dispatch */
    }
}

本质上它做了一个"沿着分区树逐层下降"的过程：

取当前层分区键值
在当前层 PartitionDesc.boundinfo 中查目标分区
如果是非叶子，切到下一层
如果是叶子，返回对应 ResultRelInfo

12.3 插入路由流程图

否
是
INSERT 到分区根
ExecSetupPartitionTupleRouting
ExecFindPartition
当前层提取 partition key
get_partition_for_tuple
命中 leaf ?
切到下一层 PartitionDispatch
初始化或复用 leaf ResultRelInfo
执行真正写入

12.4 为什么分区必须有专门 executor 路由

这是分区与普通继承的决定性差异之一。

普通继承没有统一自动路由，原因是：

普通继承没有强结构化的 bound 元数据
child 也不要求互斥覆盖
planner / executor 无法仅凭继承边决定该插到哪张表

而声明式分区拥有：

明确的 partition key
明确的 bound
overlap 检查

所以 executor 才能安全实现自动路由。

13. 执行期裁剪设计

13.1 为什么 planner 裁剪还不够

有些谓词在 plan 时拿不到值，例如：

执行参数
nested loop inner param
执行阶段才确定的 stable 值

所以 executor 还需要两阶段 pruning：

startup pruning
per-scan pruning

13.2 `ExecDoInitialPruning()`

c 复制代码

/*
 * ExecDoInitialPruning
 *      Perform runtime "initial" pruning
 */
void
ExecDoInitialPruning(EState *estate)
{
    ...
    prunestate = CreatePartitionPruneState(estate, pruneinfo, ...);
    ...
    if (prunestate->do_initial_prune)
        validsubplans = ExecFindMatchingSubPlans(prunestate, true, ...);
}

这一步发生在 executor startup：

先把 plan 里的 pruning 元数据绑定成 PartitionPruneState
然后执行 startup pruning
只初始化仍可能被访问的 subplans

13.3 `ExecInitPartitionExecPruning()`

c 复制代码

/*
 * Initialize the data structures needed for runtime "exec" partition
 * pruning
 */
PartitionPruneState *
ExecInitPartitionExecPruning(...)

它负责：

将 initial pruning 结果同步到 plan node
初始化后续 per-scan pruning 所需状态

13.4 `ExecFindMatchingSubPlans()`

这个函数根据当前表达式 / 参数值重新计算应保留哪些 subplans。

整体模型是：

planner 负责"编译 pruning steps"
executor 负责"在某个时刻执行 pruning steps"

这是分区系统能处理运行时参数的关键。

13.5 执行期 pruning 结构图

是
否
是
Planner 生成 PartitionPruneInfo
Executor: ExecDoInitialPruning
CreatePartitionPruneState
是否需要 startup pruning
ExecFindMatchingSubPlans(initial=true)
保留所有初始 subplans
仅初始化 surviving subplans
是否需要 per-scan pruning
ExecInitPartitionExecPruning
参数变化时再次 ExecFindMatchingSubPlans

14. 重要实现细节与设计取舍

14.1 分区仍然通过 `pg_inherits` 建树，而不是另起一套 parent-child catalog

收益：

复用继承遍历、锁序、依赖管理
planner / DDL 共用基础设施

代价：

需要在语义层额外区分"普通继承边"和"分区边"
需要更多 catalog 字段表达分区专属语义

14.2 `PartitionDesc` 要考虑 active snapshot

这是 detach concurrently 相关的复杂点。

如果没有这层设计，会出现：

某事务应看不到已 detach 的分区
另一个较老事务仍应看得到

因此 partdesc.c 不能简单"全局缓存一个 children 数组"。

14.3 planner 对分区树保留层级，而不是完全拍平

这样做便于：

多级分区继续递归裁剪
保留分区父节点的元数据语义
executor tuple routing 沿层下降

14.4 tuple routing 懒初始化

原因是插入常见场景可能只打到一个 leaf：

如果一开始初始化整个分区树所有 ResultRelInfo
对单行 insert 成本太高

14.5 pruning 采用"step 中间表示"

好处：

planner 与 executor 共用一套抽象
同一 pruning 逻辑可以在 plan/startup/per-scan 三个时机运行
更容易表达 AND/OR 组合

15. 继承表与分区表的实现方式对比总结

15.1 普通继承实现方式

可以概括成：

用 pg_inherits 记录父子关系
planner 通过 find_all_inheritors() 找出整棵子树
为每个 child 构造 AppendRelInfo 与 child RelOptInfo
查询父表时把整个继承集合作为 append 输入

特点：

简洁
通用
结构约束弱
优化能力较弱

15.2 声明式分区实现方式

可以概括成：

仍用 pg_inherits 记录树边
用 pg_partitioned_table 记录分区键
用 pg_class.relpartbound 记录每个分区边界
relcache 中构建 PartitionKey / PartitionDesc
planner 使用专门的 pruning 逻辑裁剪分区
executor 使用专门的 tuple routing 路由写入
attach/detach 通过 snapshot-aware 机制处理并发可见性

特点：

结构化更强
写入路径更智能
读路径优化更强
并发语义更复杂

16. 阅读源码的推荐顺序

如果你要继续深入源码，建议按这个顺序：

src/include/catalog/pg_inherits.h
- 先理解共同父子边模型
src/include/catalog/pg_partitioned_table.h
- 再看分区专属元数据
src/include/utils/partcache.h
- 看 PartitionKeyData
src/include/partitioning/partdesc.h
- 看 PartitionDescData
src/include/partitioning/partbounds.h
- 看 PartitionBoundInfoData
src/backend/utils/cache/partcache.c
- 看分区键如何缓存
src/backend/partitioning/partdesc.c
- 看分区描述符如何构建
src/backend/optimizer/util/inherit.c
- 看 planner 如何分流继承和分区
src/backend/partitioning/partprune.c
- 看 planner/executor 共用的 pruning 模型
src/backend/executor/execPartition.c

看 tuple routing 和执行期 pruning

src/backend/commands/tablecmds.c

看 CREATE/ATTACH/DETACH 的 DDL 落地

17. 源码导读版主线

这一章不是再罗列模块，而是按"第一次读源码时，脑子里该怎么走"来组织。

17.1 如果你想先理解"树是怎么建起来的"

建议先看这几组对象：

pg_inherits
pg_partitioned_table
pg_class.relpartbound
relhassubclass / relispartition

推荐阅读顺序：

$pg_inherits.h\](file:///e:/udb_pg/src/include/catalog/pg_inherits.h)$
$pg_inherits.c\](file:///e:/udb_pg/src/backend/catalog/pg_inherits.c)$

带着下面几个问题去看最容易抓住重点：

父子边存在哪
分区键定义存在哪
分区 bound 存在哪
attach / detach 时，哪些 catalog 行会变化

17.2 如果你想先理解"planner 为什么能展开子表"

建议主看 [inherit.c](file:///e:/udb_pg/src/backend/optimizer/util/inherit.c)。

最关键入口是 [expand_inherited_rtentry](file:///e:/udb_pg/src/backend/optimizer/util/inherit.c#L64-L313)。

读它时建议按这条线：

看 rte->inh 是如何触发展开的
看它如何区分：
- 传统继承
- 分区表
看传统继承如何调用 find_all_inheritors()
看分区如何走 expand_partitioned_rtentry()
看 child RTE / AppendRelInfo / child RelOptInfo 是如何生成的

源码里最值得反复读的两句是：

c 复制代码

/*
 * In the case of traditional inheritance, the first of the generated
 * RTEs is an RTE for the same table, but with inh = false
 * ...
 * For partitioning, we don't need a second RTE because the partitioned table
 * itself has no data and need not be scanned.
 */

以及：

c 复制代码

/*
 * We now expand the partition hierarchy level by level
 * ...
 * This is a difference from what older versions of PostgreSQL did
 * ... where the hierarchy is flattened during RTE expansion.
 */

把这两段吃透，继承和分区的 planner 差异就基本立住了。

17.3 如果你想先理解"分区键与分区集合从哪来"

建议把 [partcache.c](file:///e:/udb_pg/src/backend/utils/cache/partcache.c) 和 [partdesc.c](file:///e:/udb_pg/src/backend/partitioning/partdesc.c) 连起来看。

阅读顺序：

$RelationGetPartitionKey\](file:///e:/udb_pg/src/backend/utils/cache/partcache.c#L41-L60)$
$RelationGetPartitionDesc\](file:///e:/udb_pg/src/backend/partitioning/partdesc.c#L53-L110)$

推荐带着这些问题：

PartitionKeyData 是怎么从 pg_partitioned_table 填出来的
PartitionDescData 是怎么从 pg_inherits + relpartbound 拼出来的
为什么 PartitionDesc 要区分 omit_detached

17.4 如果你想先理解"写入为什么能自动路由"

建议直接看 [execPartition.c](file:///e:/udb_pg/src/backend/executor/execPartition.c)。

最核心的阅读顺序：

$ExecSetupPartitionTupleRouting\](file:///e:/udb_pg/src/backend/executor/execPartition.c#L208-L246)$
FormPartitionKeyDatum()
get_partition_for_tuple()
ExecInitPartitionDispatchInfo()
ExecInitPartitionInfo()

可以按"沿树下降"来理解：

根节点 dispatch 已经准备好
从当前 tuple 提取分区键
在当前层 boundinfo 中定位 partition index
如果命中的是非叶子，继续往下
如果命中的是叶子，拿到或初始化 ResultRelInfo

17.5 如果你想先理解"查询为什么能裁剪分区"

建议分两步看：

planner 侧：partprune.c
executor 侧：execPartition.c

推荐入口：

$prune_append_rel_partitions\](file:///e:/udb_pg/src/backend/partitioning/partprune.c#L769-L832)$
$ExecDoInitialPruning\](file:///e:/udb_pg/src/backend/executor/execPartition.c#L1971-L2028)$

理解时要始终分清三件事：

planner pruning：计划时能算掉的
initial pruning：执行器启动时能算掉的
exec pruning：每次扫描时动态再算的

17.6 如果你想先理解"DDL 为什么这么复杂"

建议主看 [tablecmds.c](file:///e:/udb_pg/src/backend/commands/tablecmds.c)。

重点入口：

$DefineRelation\](file:///e:/udb_pg/src/backend/commands/tablecmds.c#L765-L978)$
ATExecDetachPartition()
attachPartitionTable()
StoreCatalogInheritance1()

看 DDL 时最值得关注的不是"它干了多少事"，而是"为什么这些事必须一起干"：

写 pg_inherits
写 bound
校验 overlap
更新默认分区约束
同步索引、触发器、约束
处理 detach concurrently 的可见性

18. SQL 示例与结构图解

这一章用几个最小例子，把源码设计和 SQL 语义对起来。

18.1 传统继承表示例

sql 复制代码

CREATE TABLE city (
    id bigint,
    name text
);

CREATE TABLE capital (
    country text
) INHERITS (city);

这里的语义是：

city 是父表
capital 是子表
capital 拥有：
- city 的列
- 自己新增的列 country

如果执行：

sql 复制代码

SELECT * FROM city;

默认会看到：

city 自己的数据
capital 的数据

结构图：
city (普通表, 可存数据)
capital (继承子表, 可存数据)

这个模型更像：

一个逻辑父类型
多个可直接存数据的成员表

18.2 声明式分区表示例

sql 复制代码

CREATE TABLE sales (
    id bigint,
    sale_date date,
    amount numeric
) PARTITION BY RANGE (sale_date);

CREATE TABLE sales_2024
    PARTITION OF sales
    FOR VALUES FROM ('2024-01-01') TO ('2025-01-01');

CREATE TABLE sales_2025
    PARTITION OF sales
    FOR VALUES FROM ('2025-01-01') TO ('2026-01-01');

这里的语义是：

sales 是 partitioned table
sales_2024 / sales_2025 是 leaf partition
根表主要承担：
- 逻辑入口
- 分区键定义
- 查询入口
- 插入路由入口

结构图：
sales (分区根, PARTITION BY RANGE(sale_date))
sales_2024 [2024-01-01, 2025-01-01)
sales_2025 [2025-01-01, 2026-01-01)

18.3 多级分区示例

sql 复制代码

CREATE TABLE sales (
    id bigint,
    sale_date date,
    region text,
    amount numeric
) PARTITION BY RANGE (sale_date);

CREATE TABLE sales_2025
    PARTITION OF sales
    FOR VALUES FROM ('2025-01-01') TO ('2026-01-01')
    PARTITION BY LIST (region);

CREATE TABLE sales_2025_cn
    PARTITION OF sales_2025
    FOR VALUES IN ('CN');

CREATE TABLE sales_2025_us
    PARTITION OF sales_2025
    FOR VALUES IN ('US');

结构图：
sales: RANGE(sale_date)
sales_2025: LIST(region)
sales_2025_cn
sales_2025_us

这个例子对应源码中的两个重要事实：

planner 会按层级递归展开分区树
executor 会按层级逐层路由 tuple

18.4 对应到系统目录怎么存

以上 sales 例子在 catalog 层可以粗略理解成：

sales 在 pg_class 中是 RELKIND_PARTITIONED_TABLE
pg_partitioned_table 里有一行：
- partrelid = sales
- partstrat = RANGE
- partattrs = sale_date
sales_2024 / sales_2025 各有一条 pg_inherits
sales_2024 / sales_2025 各自的 pg_class.relpartbound 保存具体区间

如果是多级分区，则：

sales_2025 同时是：
- sales 的一个分区
- 自己又是一个 partitioned table

这也是为什么源码里一个 child relation 可能还会继续进入 expand_partitioned_rtentry()。

18.5 INSERT 路由示例

sql 复制代码

INSERT INTO sales VALUES (1, DATE '2025-05-20', 'CN', 88.8);

executor 侧可抽象成：

先看根表 sales
按 sale_date='2025-05-20' 命中 sales_2025
继续看 sales_2025
按 region='CN' 命中 sales_2025_cn
把 tuple 写入 sales_2025_cn

对应流程图：
INSERT INTO sales (...)
ExecFindPartition: level 1
sale_date = 2025-05-20 -> sales_2025
ExecFindPartition: level 2
region = 'CN' -> sales_2025_cn
写入 leaf partition

18.6 SELECT 裁剪示例

sql 复制代码

SELECT * FROM sales
WHERE sale_date >= DATE '2025-01-01'
  AND sale_date < DATE '2026-01-01';

planner 通常可以直接推断：

sales_2024 不可能命中
只保留 sales_2025

如果再加：

sql 复制代码

AND region = 'CN'

在多级分区场景下还可以继续裁掉：

sales_2025_us

18.7 为什么传统继承没有同等强度的自动优化

设想下面的继承例子：

sql 复制代码

CREATE TABLE parent (k int, v text);
CREATE TABLE child1 (CHECK (k < 100)) INHERITS (parent);
CREATE TABLE child2 (CHECK (k >= 100)) INHERITS (parent);

从语义上看，它"像分区"，但源码层面仍不是声明式分区，因为缺少：

pg_partitioned_table 的键定义
relpartbound 的结构化边界
PartitionDesc/PartitionBoundInfo
ExecFindPartition() 路由能力
专门的 pruning step 计划

也就是说：

它可以靠约束做部分推导
但不是 PostgreSQL 的声明式分区体系

19. `partprune.c` 深入解析

这一章专门拆 partprune.c，因为它是整个分区查询优化里最"像引擎"的部分。

19.1 它到底在解决什么问题

核心问题只有一句：

给定查询谓词和一个 partitioned table 的边界信息，尽量在尽可能早的阶段排除不可能命中的分区。

这个目标会分裂成三个子问题：

哪些谓词可以用于 pruning
这些谓词如何转换成可执行的 pruning step
这些 step 在 planner / executor 中何时执行

19.2 文件开头注释已经概括了总体设计

partprune.c 文件头最值得看的一段是：

c 复制代码

/*
 * This module implements partition pruning using the information contained in
 * a table's partition descriptor, query clauses, and run-time parameters.
 *
 * During planning, clauses that can be matched to the table's partition key
 * are turned into a set of "pruning steps", which are then executed to
 * identify a set of partitions
 *
 * ...
 * A "base" pruning step represents tests on partition key column(s)
 * ...
 * A "combine" pruning step represents a Boolean connector (AND/OR)
 */

这几句就定义了整个架构：

输入：PartitionDesc + query clauses + runtime params
中间表示：pruning steps
step 类型：
- base step
- combine step

19.3 `gen_partprune_steps()` 的职责

入口：

c 复制代码

static void
gen_partprune_steps(RelOptInfo *rel, List *clauses, PartClauseTarget target,
                    GeneratePruningStepsContext *context)

这里最核心的不是"生成 step"，而是"根据目标时机决定能使用哪些 clause"：

PARTTARGET_PLANNER
PARTTARGET_INITIAL
PARTTARGET_EXEC

源码注释已经说明：

c 复制代码

/*
 * 'target' tells whether to generate pruning steps for planning (use
 * immutable clauses only), or for executor startup ... or for executor
 * per-scan pruning
 */

这意味着 PostgreSQL 明确承认：

有些 pruning 只能计划时做
有些只能执行启动时做
有些必须每轮 scan 都重新做

19.4 从 clause 到 step 的编译过程

可以把 partprune.c 想成一个很小的"编译器前端"：

遍历 WHERE quals
识别哪些 qual 能匹配 partition key
生成 PartitionPruneStepOp
遇到 AND/OR 时生成 PartitionPruneStepCombine
把整个 pruning 逻辑编译成一个 step DAG/序列

19.5 `match_clause_to_partition_key()` 在做什么

虽然前面文档已经列过函数名，但这里要强调它的作用：

不是所有 WHERE 条件都能 pruning
必须是和 partition key 可对齐的条件

它会尝试识别：

OpExpr
IS NULL
某些布尔表达式
可展开的 OR

识别失败时并不代表查询错，只代表：

这个 qual 不能参与 pruning
后续仍可能在普通执行阶段过滤

19.6 为什么 pruning 用的是 step，而不是直接递归判断

因为 PostgreSQL 需要同时支持：

planner 期执行
executor startup 执行
executor per-scan 执行

如果直接把 pruning 写死在 planner 里，会遇到两个问题：

运行时参数没法统一支持
布尔组合和多级分区不容易复用

step 设计带来的好处是：

planner 先"编译"
executor 再"执行"
同一套结构能跨阶段复用

19.7 `prune_append_rel_partitions()`：planner 期总入口

c 复制代码

Bitmapset *
prune_append_rel_partitions(RelOptInfo *rel)
{
    ...
    gen_partprune_steps(rel, clauses, PARTTARGET_PLANNER, &gcontext);
    ...
    return get_matching_partitions(&context, pruning_steps);
}

这个函数做了三件事：

从 baserestrictinfo 提取 planner 期可用的 clauses
编译成 pruning steps
执行这些 steps，得到 surviving partitions

这里最关键的设计是：

输出不是 child rel 列表
而是一个 Bitmapset

也就是：

pruning 逻辑只关心"哪些 partition index 还活着"
具体 child rel 构建由 inherit.c 之后再做

19.8 `get_matching_partitions()`：step 解释执行器

它是整个文件里最像 VM 的部分：

c 复制代码

foreach(lc, pruning_steps)
{
    PartitionPruneStep *step = lfirst(lc);

    switch (nodeTag(step))
    {
        case T_PartitionPruneStepOp:
            ...
        case T_PartitionPruneStepCombine:
            ...
    }
}

这里的思路非常清楚：

每个 step 产出一个 PruneStepResult
后续 step 可以依赖前面 step 的结果
最后一个 step 的结果就是整体 pruning 结果

对应的数据结构：

c 复制代码

typedef struct PruneStepResult
{
    Bitmapset  *bound_offsets;
    bool        scan_default;
    bool        scan_null;
} PruneStepResult;

注意这里存的是：

bound_offsets

而不是直接的 partition indexes。

原因是：

pruning 的底层操作首先命中的是 bound 空间
最终才从 bound offset 映射到 partition index

19.9 为什么 `scan_default` 和 `scan_null` 要单独跟踪

因为对 list/range/hash 来说：

NULL 分区
DEFAULT 分区

都不是普通边界数组里最自然的一部分。

源码中这部分逻辑也非常直接：

c 复制代码

/*
 * Add the null and/or default partition if needed and present.
 */
if (final_result->scan_null)
    ...

所以 PruneStepResult 不是只有一个 bitmap，而是：

普通 bound 命中集合
是否需要扫 default
是否需要扫 null 分区

19.10 多级分区时 pruning 是怎么串起来的

多级分区并不是一次性在一个大数组里全局裁剪，而是分层处理：

每个 partitioned rel 层都有自己的 PartitionedRelPruneInfo
executor 中对应 PartitionedRelPruningData
父层裁掉后，才有意义继续看子层

所以多级分区 pruning 的结构更像：
顶层 PartitionedRelPruneInfo
裁掉一批一级 partitions
对保留的非叶子 partition 进入下一层 pruning
二级 PartitionedRelPruneInfo
裁掉二级 partitions

19.11 运行时 pruning 为什么要分 initial 和 exec

这个区分是 partprune.c + execPartition.c 联合设计出来的：

initial
- executor startup 时做一次
- 不依赖 PARAM_EXEC
- 能避免不必要 subplan 初始化
exec
- 每轮 scan 可重算
- 依赖 PARAM_EXEC
- 适合 nested loop param 等场景

这也是为什么 planner 里会保存：

initial_pruning_steps
exec_pruning_steps

19.12 `partprune.c` 的阅读建议

如果专门啃这个文件，推荐顺序不是从头硬读到尾，而是：

文件头注释
$make_partition_pruneinfo\](file:///e:/udb_pg/src/backend/partitioning/partprune.c#L209-L364)$
$prune_append_rel_partitions\](file:///e:/udb_pg/src/backend/partitioning/partprune.c#L769-L832)$
gen_partprune_steps_internal()
match_clause_to_partition_key()
perform_pruning_base_step()
perform_pruning_combine_step()

可以按三层心智模型来读：

第一层：它要解决什么问题
第二层：它如何把 clause 编译成 step
第三层：它如何执行 step 得到 surviving partitions

19.13 一句话理解 `partprune.c`

可以把它记成：

partprune.c 本质上是 PostgreSQL 分区优化器里的一个"小型规则编译器 + 小型解释执行器"，负责把和 partition key 相关的谓词编译成 pruning steps，并在 planner 或 executor 阶段执行这些 steps，尽量缩小需要访问的分区集合。

20. 组合学习路径

如果你的目标不是"只看懂文档"，而是真正把这套代码吃下来，可以按下面 3 条路径选一种。

20.1 路径 A：从 SQL 语义到源码

适合第一次系统接触：

先看本文件第 18 章里的 SQL 例子
对照第 4 章理解 catalog 怎么存
再看第 10 章 planner
最后看第 12、13、19 章 executor 和 pruning

20.2 路径 B：从 planner/executor 主线切入

适合你已经熟悉 catalog：

inherit.c
partprune.c
execPartition.c
最后回头看 partcache.c / partdesc.c / tablecmds.c

这种顺序的好处是：

能最快理解"查询为什么快"
也最容易把设计和执行效果联系起来

20.3 路径 C：从 DDL 到运行时全链路切入

适合做内核改造或功能增强：

DefineRelation()
ATExecAttachPartition()
StoreSingleInheritance()
RelationBuildPartitionKey()
RelationBuildPartitionDesc()
expand_partitioned_rtentry()
make_partition_pruneinfo()
ExecFindPartition()
ExecDoInitialPruning()

这一条路径最接近真实开发时的思考方式：

DDL 怎么落 catalog
relcache 怎么建内存表示
planner 怎么消费
executor 怎么消费

21. `execPartition.c` 逐函数走读

这一章专门把 execPartition.c 从"知道它干什么"推进到"知道它为什么要这样组织数据结构"。

21.1 `PartitionTupleRouting` 为什么是两套数组

回看定义：

c 复制代码

struct PartitionTupleRouting
{
    Relation        partition_root;
    PartitionDispatch *partition_dispatch_info;
    ResultRelInfo **nonleaf_partitions;
    int             num_dispatch;
    int             max_dispatch;
    ResultRelInfo **partitions;
    bool           *is_borrowed_rel;
    int             num_partitions;
    int             max_partitions;
    MemoryContext   memcxt;
};

这里最值得注意的是它同时维护：

partition_dispatch_info
partitions

这不是重复，而是两类对象本来就不同：

PartitionDispatch
- 面向"分区树中的 partitioned 节点"
- 解决"下一层该怎么继续找"
ResultRelInfo
- 面向"最终怎么写 leaf relation"
- 解决"已经找到 leaf，如何执行写入"

所以可以把它理解成：

一套数组管理"导航节点"
一套数组管理"真正写入目标"

21.2 `ExecInitRoutingInfo()` 为什么不是只记录一个 relation 指针

核心代码：

c 复制代码

/*
 * Set up tuple conversion between root parent and the partition if the
 * two have different rowtypes.
 */
if (ExecGetRootToChildMap(partRelInfo, estate) != NULL)
{
    partRelInfo->ri_PartitionTupleSlot =
        table_slot_create(partrel, &estate->es_tupleTable);
}
else
    partRelInfo->ri_PartitionTupleSlot = NULL;

这里反映了一个非常关键的实现事实：

分区树上不同层、甚至 root 和 leaf 之间，列顺序不一定完全一致
executor 不能假设"root tuple descriptor == leaf tuple descriptor"

因此插入路由不仅要找到 leaf，还要准备：

tuple conversion map
用于转换后 tuple 的专用 slot

随后它还处理 FDW 场景：

c 复制代码

if (partRelInfo->ri_FdwRoutine != NULL &&
    partRelInfo->ri_FdwRoutine->BeginForeignInsert != NULL)
    partRelInfo->ri_FdwRoutine->BeginForeignInsert(mtstate, partRelInfo);

这说明声明式分区的叶子不一定是普通 heap 表，也可能是 foreign table，所以路由层必须允许不同 table AM / FDW 自己接管插入准备。

21.3 `ExecInitPartitionDispatchInfo()` 的本质是"给每个内节点建导航信息"

它的关键职责不是简单 open relation，而是把一个 partitioned 节点变成"可继续向下路由"的状态对象：

c 复制代码

pd->reldesc = rel;
pd->key = RelationGetPartitionKey(rel);
pd->keystate = NIL;
pd->partdesc = partdesc;

这四个字段就代表四件事：

当前层 relation 是谁
当前层分区键是什么
当前层表达式 key 的执行状态是什么
当前层 children 集合和 boundinfo 是什么

21.4 为什么子分区节点要保存 `tupmap + tupslot`

源码注释已经点透：

c 复制代码

/*
 * For sub-partitioned tables where the column order differs from its
 * direct parent partitioned table, we must store a tuple table slot
 * initialized with its tuple descriptor and a tuple conversion map
 */

这意味着多级分区的 executor 路由不是"永远拿 root tuple 直接算到底"，而是：

在父层使用父层 rowtype 取 key
如果继续进入子分区层，而 rowtype 不同
先转换 tuple 到子层 rowtype
再按子层分区键求值

这个设计保证了：

表达式分区键看到的是本层正确的 tuple descriptor
列顺序差异不会导致取错列

21.5 为什么还要给非叶子分区准备一个"最小可用"的 `ResultRelInfo`

关键代码：

c 复制代码

/*
 * If setting up a PartitionDispatch for a sub-partitioned table, we may
 * also need a minimally valid ResultRelInfo for checking the partition
 * constraint later; set that up now.
 */
if (parent_pd)
{
    ResultRelInfo *rri = makeNode(ResultRelInfo);

    InitResultRelInfo(rri, rel, 0, rootResultRelInfo, 0);
    proute->nonleaf_partitions[dispatchidx] = rri;
}

这一步很容易被忽略，但它透露出 executor 的一个重要细节：

即使某节点不是最终写入目标
它仍可能需要作为"约束检查上下文"

也就是说，路由过程中不仅是"找 leaf"，还会在下降时顺手验证：

当前 tuple 是否满足中间层 partition constraint

21.6 `FormPartitionKeyDatum()`：插入路由的值提取器

核心实现：

c 复制代码

if (keycol != 0)
    datum = slot_getattr(slot, keycol, &isNull);
else
    datum = ExecEvalExprSwitchContext((ExprState *) lfirst(partexpr_item),
                                      GetPerTupleExprContext(estate),
                                      &isNull);

它统一处理两种分区键：

直接列
表达式键

这里的关键不是"能算表达式"，而是：

这一步发生在 executor per-tuple 路径上
所以表达式状态必须懒初始化并复用

因此它首次调用时会：

c 复制代码

if (pd->key->partexprs != NIL && pd->keystate == NIL)
    pd->keystate = ExecPrepareExprList(pd->key->partexprs, estate);

这和 ExecSetupPartitionTupleRouting() 的整体风格一致：

前期少做事
需要时再初始化

21.7 `get_partition_for_tuple()`：真正的分区定位算法

这是插入路由里最核心的函数之一。

它的总体设计很清楚：

HASH：直接算 hash -> remainder -> index
LIST：单值查找，必要时二分
RANGE：范围比较，必要时二分

哈希路径最简单：

c 复制代码

rowHash = compute_partition_hash_value(...);
return boundinfo->indexes[rowHash % boundinfo->nindexes];

这说明 hash 分区的运行时定位本质是 O(1) 数组索引。

21.8 为什么 LIST/RANGE 要做"命中缓存"

源码中最重要的一段注释：

c 复制代码

/*
 * ... if we keep finding the same partition
 * PARTITION_CACHED_FIND_THRESHOLD times in a row, then we'll enable caching
 * logic and instead of performing a binary search ...
 */

这里的优化非常工程化：

对很多实际 workload，连续很多 tuple 会落在同一个分区
特别是按时间分区、数据按时间顺序写入时

所以 PostgreSQL 在 PartitionDesc 中缓存：

last_found_datum_index
last_found_part_index
last_found_count

当连续命中超过阈值 16 后：

不再每次二分
先快速检查"是不是还在上次那个分区"

这类优化很典型地体现了 PostgreSQL 的实现风格：

不把路径写成理论上最优的大框架
而是在最热路径上加非常便宜的命中缓存

21.9 LIST 路由逻辑怎么理解

LIST 路由的核心是：

c 复制代码

bound_offset = partition_list_bsearch(..., values[0], &equal);
if (bound_offset >= 0 && equal)
    part_index = boundinfo->indexes[bound_offset];

这表示：

先在排序后的 list datum 中找"<= value 的最大 datum"
再看是否真相等
相等则命中对应 partition

如果不相等，则可能：

去 default partition
或者最终报错无匹配分区

21.10 RANGE 路由逻辑怎么理解

RANGE 更微妙：

c 复制代码

bound_offset = partition_range_datum_bsearch(...);
part_index = boundinfo->indexes[bound_offset + 1];

也就是说，它找到的是：

小于等于当前 tuple 的最大 bound

然后真正候选分区是：

这个 bound 的"下一格 upper-bound 对应的 partition"

这和 PartitionBoundInfo.indexes 的编码方式直接相关：

range 的 indexes 不是"datum -> 分区本身"
而是"该 bound 作为哪个区间上界 / gap 的索引含义"

这也是为什么前文一直强调：

PartitionBoundInfo 不是 SQL bound 的原样存储
而是经过查找友好化编码后的索引结构

21.11 `ExecCleanupTupleRouting()` 为什么要区分 borrowed 和 owned

清理逻辑：

c 复制代码

if (proute->is_borrowed_rel[i])
    continue;

ExecCloseIndices(resultRelInfo);
table_close(resultRelInfo->ri_RelationDesc, NoLock);

这说明 leaf ResultRelInfo 的来源有两种：

从 ModifyTableState 借来的
tuple routing 动态新建的

如果不区分所有权，就会出现：

重复关闭 relation
重复释放 executor 结构

因此 is_borrowed_rel 实际上是一个资源生命周期边界标记。

21.12 一句话理解 `execPartition.c`

可以记成：

execPartition.c 的 tuple routing 本质上是在 executor 中维护一棵"可导航的分区树状态机"：每个内节点保存如何继续向下找，每个叶子节点保存如何真正写入，并在热路径上用 tuple 转换、表达式状态和命中缓存把单条插入成本压低。

22. Attach / Detach、Relcache 与快照一致性

这一章专门解释为什么分区 DDL 看起来比普通继承复杂得多。

22.1 attach/detach 的本质难点

普通继承改父子边，主要是结构变化。

分区 attach/detach 改父子边时，还会同时影响：

父表 PartitionDesc
default partition constraint
planner 能否看到该分区
executor 路由是否还能落到该分区
RI 查询与快照可见性

所以它必须额外处理：

relcache invalidation
active snapshot 相关可见性
detach concurrently 的双事务协议

22.2 `MarkInheritDetached()`：并发 detach 的第一阶段标记

核心逻辑：

c 复制代码

/*
 * Set inhdetachpending for a partition
 */
if (inhForm->inhrelid == RelationGetRelid(child_rel))
{
    newtup = heap_copytuple(inheritsTuple);
    ((Form_pg_inherits) GETSTRUCT(newtup))->inhdetachpending = true;
    CatalogTupleUpdate(...);
}

这一步并不立即删除父子边，而是先把它标记为：

pending detach

这个中间态的意义非常大：

对新的快照来说，这个分区应该被隐藏
但对旧快照来说，它可能仍然必须可见

22.3 为什么一次只能有一个 pending detach

源码直接检查：

c 复制代码

if (inhForm->inhdetachpending)
    ereport(ERROR, ...);

这是为了避免 descriptor 可见性和 finalize 逻辑进入歧义状态。

因为一旦同时有多个 pending detach：

rd_partdesc_nodetached_xmin 的可复用条件会更复杂
default partition / RI / finalize 的状态机会显著复杂化

PostgreSQL 这里选择了更保守但更可控的策略：

同一父表一次只允许一个 pending detach

22.4 `ATExecDetachPartition()` 的并发协议

文件注释已经把协议写得非常明确：

c 复制代码

/*
 * The strategy for concurrency is to first modify the partition's
 * pg_inherit catalog row ...
 * In a second transaction, we wait until all transactions that could have
 * seen the partition as attached are gone, then we remove the rest of
 * partition metadata
 */

也就是说 DETACH CONCURRENTLY 不是一个单事务原子动作，而是两阶段协议：

第一事务：
- 标记 inhdetachpending = true
- 让新快照看不到该分区
第二事务：
- 等待所有可能还看得到旧状态的事务结束
- 再删剩余元数据

22.5 为什么 default partition 会让 concurrent detach 直接报错

源码注释说得非常实在：

c 复制代码

/*
 * Concurrent detaching when a default partition exists is not
 * supported.
 * The main problem is that the default partition
 * constraint would change.
 */

根因是：

一旦某个分区被 detach
default partition 的"接收剩余值"的约束集合就会变化

而 concurrent detach 想要的又是：

在新旧快照并存期间保持语义正确

这两件事叠在一起非常难安全实现，所以 PostgreSQL 直接选择：

有 default partition 时，不支持 concurrent detach

22.6 为什么第一阶段提交前要先 invalidation parent relcache

关键代码：

c 复制代码

/* Invalidate relcache entries for the parent -- must be before close */
CacheInvalidateRelcache(rel);

这是为了让后续重新获取 PartitionDesc 的会话看到：

这个分区已经不该再出现在新的 partition descriptor 中

如果不先做 relcache invalidation，就可能出现：

catalog 已改
但其它 backend 仍拿旧的 cached PartitionDesc

22.7 为什么第二阶段要 `WaitForLockersMultiple()`

关键代码：

c 复制代码

SET_LOCKTAG_RELATION(tag, MyDatabaseId, parentrelid);
WaitForLockersMultiple(list_make1(&tag), AccessExclusiveLock, false);

这一步等待的是：

所有可能还基于旧计划、旧快照、旧 descriptor 认为该分区仍然 attached 的事务结束

只有等它们都结束，才能安全做最终清理：

删除 pg_inherits
清空 relpartbound
改 relispartition

22.8 `DetachPartitionFinalize()` 最终到底清什么

它的核心动作包括：

真正移除继承关系
清掉 pg_class.relpartbound
把 relispartition 设回 false
处理 identity / FK / trigger / index 等附属对象语义

前面 grep 出来的关键代码已经很说明问题：

c 复制代码

/* Clear relpartbound and reset relispartition */
new_val[Anum_pg_class_relpartbound - 1] = (Datum) 0;
new_null[Anum_pg_class_relpartbound - 1] = true;
...
((Form_pg_class) GETSTRUCT(newtuple))->relispartition = false;

也就是说 finalize 之后，这张表从 catalog 语义上就不再是一个 partition。

22.9 `RelationGetPartitionDesc()` 为什么必须关心 active snapshot

这正是 detach concurrently 设计反向推出来的需求。

如果 PartitionDesc 完全不区分 snapshot，那么：

老事务可能被错误地隐藏本应可见的分区
新事务可能错误地仍看到已 pending detach 的分区

因此 partdesc.c 才会有：

rd_partdesc
rd_partdesc_nodetached
rd_partdesc_nodetached_xmin

这是一套典型的"catalog 并发语义反逼 relcache 设计"的例子。

22.10 一句话理解 detach concurrently

可以记成：

DETACH CONCURRENTLY 的本质不是"立刻删掉一条父子边"，而是"先让新快照停止看到这条边，再等待所有可能还记得旧边的事务退出，最后才清理剩余元数据"。

23. `partbounds.c` 与边界查找算法

partbounds.c 是整个分区系统里最"数据结构实现导向"的文件之一。

23.1 `partition_bounds_create()` 在做什么

它的职责不是简单把 SQL bound 存起来，而是把 bound 编译成查找友好的内部结构：

c 复制代码

/*
 * Build a PartitionBoundInfo struct from a list of PartitionBoundSpec
 * nodes
 *
 * ... 'datums' array will contain Datum representation of individual bounds
 * ... sorted in a canonical order
 * 'indexes' array will contain as many elements as there are bounds
 */
PartitionBoundInfo
partition_bounds_create(...)

它相当于一个"边界编译器"：

输入：parser 层 PartitionBoundSpec
输出：executor/planner 可高效查找的 PartitionBoundInfo

23.2 为什么要 canonical order

因为后续无论是：

planner pruning
executor routing
overlap check

都强依赖：

能用统一的顺序比较
能做二分查找

所以 PostgreSQL 会先把边界整理成：

canonical 排序
canonical partition index

并通过 mapping 记录：

原始定义顺序 -> canonical 顺序

23.3 HASH 边界为什么 `nindexes = greatest modulus`

看 create_hash_bounds()：

c 复制代码

greatest_modulus = hbounds[nparts - 1].modulus;
...
boundinfo->nindexes = greatest_modulus;
boundinfo->indexes = (int *) palloc(greatest_modulus * sizeof(int));

这背后的思想是：

hash 路由最终是按 remainder 空间来查
所以最方便的结构不是存"每个分区一个 bucket"
而是直接建一个"remainder -> partition index"的查找表

因此运行时可以做到：

c 复制代码

return boundinfo->indexes[rowHash % boundinfo->nindexes];

23.4 LIST 查找为什么是"找 <= value 的最大 datum"

partition_list_bsearch()：

c 复制代码

/*
 * Returns the index of the greatest bound datum that is less than equal
 * to the given value
 */

它不是直接"找等于值"，而是一个标准 upper-bound 风格二分：

找 <= value 的最大位置
再通过 is_equal 判断是否真命中

这样设计的好处是：

同一套 bsearch 框架更容易和 range/search 形式保持一致
可以统一处理"前驱位置"和"是否精确匹配"

23.5 RANGE 查找为什么返回的是"最大 <= tuple 的 bound"

partition_range_datum_bsearch()：

c 复制代码

/*
 * Returns the index of the greatest range bound that is less than or
 * equal to the given tuple
 */

这正是 range 路由的关键。

因为 range 的 partition 并不是"一个值对应一个分区"，而是：

tuple 落在哪两个边界之间

所以查找阶段自然先求：

当前 tuple 前面的那个边界

然后再利用 indexes[offset + 1] 得到实际 partition。

23.6 为什么 range 比 list 更复杂

因为 range 的比较对象不是一个单值，而是：

多列 tuple
每列还有 bound kind
还要处理 MINVALUE/MAXVALUE

所以 PartitionBoundInfo 对 range 额外保存：

kind

而查找时会用：

partition_rbound_datum_cmp()
partition_rbound_cmp()

它们本质上是"带语义的复合 key 比较器"。

23.7 `check_new_partition_bound()` 为什么是分区结构正确性的核心

这个函数的职责是：

新分区是否与现有分区 overlap
strategy 特定的额外规则是否满足

例如 list：

c 复制代码

offset = partition_list_bsearch(...);
if (offset >= 0 && equal)
{
    overlap = true;
    with = boundinfo->indexes[offset];
}

也就是说 overlap check 不是单独的一套逻辑，而是：

直接复用内部边界查找结构

这也是整个分区设计一个很漂亮的点：

"构建内部边界结构"的那套表示
不仅服务查询优化
也服务 DDL 结构合法性校验

23.8 边界结构、插入路由、查询裁剪三者的关系

这三者可以理解成同一个内核事实的三个消费面：

partbounds.c
- 定义"边界如何编码"
execPartition.c
- 消费它来做 tuple routing
partprune.c
- 消费它来做 pruning

关系图：
PartitionBoundSpec (DDL 输入)
partbounds.c: partition_bounds_create
PartitionBoundInfo
execPartition.c: get_partition_for_tuple
partprune.c: get_matching_partitions / base step
tablecmds.c: check_new_partition_bound

这也解释了为什么分区相比普通继承能拥有更强优化能力：

它不是"很多子表 + 一些约束"
而是"所有子表 bound 被编译成统一可搜索的数据结构"

23.9 一句话理解 `partbounds.c`

可以记成：

partbounds.c 是声明式分区的边界编译层，它把 SQL 层的分区定义编译成统一的、可排序、可二分、可复用的 PartitionBoundInfo，然后由 DDL 校验、executor 路由和 planner/executor 裁剪共同消费。

24. 最后结论

如果只看 SQL 表面，很容易误以为：

继承表和分区表只是"功能差不多"

但源码层面更准确的理解是：

传统继承是 PostgreSQL 的通用父子表框架；声明式分区是在这个框架之上，叠加了分区键、边界结构、裁剪计划、执行路由以及并发 detach 语义的一整套专门基础设施。

因此：

没有继承基础设施，就很难低成本实现分区树
没有分区专属元数据和执行路径，继承本身又不足以支撑高效分区

这正是 PostgreSQL 当前分区实现的核心设计哲学。

postgresql-分区表