Apache Paimon 主键表解析

Primary Key Table-主键表

1.概述

主键表是创建表时的默认表类型，用户可以插入、更新或删除表中的记录；

主键由一组列组成，这些列包含每条记录的唯一值；

Paimon通过对每个桶中的主键排序来强制数据有序，允许用户在主键上应用过滤条件来实现高性能。

2.Data Distribution

桶是用于读取和写入的最小存储单元，每个桶目录都包含一个LSM树；

默认，Paimon表只有一个桶，只提供单并行读写。

a) Fixed Bucket-固定桶

配置好桶的数量（大于0），根据Math.abs(key_hashcode % numBuckets) 计算数据属于哪个桶；

扩缩容桶的数量只能通过离线进程完成，桶数量过多会导致文件太多，桶数量过少会导致写入性能不佳。

b) Dynamic Bucket-动态桶

配置 'bucket' = '-1'，首先到达的 key 将落入旧桶，新 key 将落入新桶，桶和 key 的分布取决于数据到达的顺序，Paimon维护一个索引，以确定哪个 key 对应于哪个桶。

Paimon将自动扩大桶的数量。

选项1'dynamic-bucket.target-row-num'：控制桶的目标行数；
选项2'dynamic-bucket.initial-buckets'：控制初始化桶的数量。

动态桶仅支持单个写入作业，请不要启动多个作业写入同一分区（可能会导致数据重复），即使启用'write-only'并启动compaction job，也不会起作用。

c) Normal Dynamic Bucket Mode-正常动态桶模式

当更新不交叉分区upsert（没有分区，或主键包含所有分区字段）时，动态桶模式使用HASH索引来维护从 key 到桶的映射，它比固定桶模式需要更多的内存。

性能：

一般来说，没有性能损失，但会有一些额外的内存消耗，分区中的1亿个条目多占用1GB的内存，不再活跃的分区不会占用内存。
对于更新率低的表，建议使用此模式来显著提高性能。

Normal Dynamic Bucket Mode支持紧凑排序来加快查询速度。

d) Cross Partitions Upsert Dynamic Bucket Mode-交叉分区 Upsert 动态桶模式

当需要交叉分区upsert（主键不包含所有分区字段）时，动态桶模式直接维护 key 到分区和桶的映射，使用本地磁盘并在启动流写入作业时，读取表中的所有现有键来初始化索引，不同的合并引擎有不同的行为：

Deduplicate-重复数据删除：从旧分区中删除数据，并将新数据插入新分区。
PartialUpdate & Aggregation-部分更新和聚合：将新数据插入旧分区。
FirstRow：如果有旧值，请忽略新数据。

性能：数据量级大的表，性能将有重大损失，初始化需要很长时间。

如果upsert不依赖太旧的数据，可以配置索引TTL以减少索引和初始化时间：

'cross-partition-upsert.index-ttl'：rocksdb索引和初始化的TTL，避免保持太多的索引导致性变差，但可能导致数据重复。

3.Merge Engine

当Paimon sink收到两个或多个具有相同主键的记录时，会将它们合并成一个记录，以保持主键的唯一性，通过指定merge-engine表属性，用户可以选择如何将记录合并在一起。

a) Deduplicate

deduplicate是默认的合并引擎，只会保留最新的记录，并删除具有相同主键的其他记录；

如果最新记录是DELETE，则具有相同主键的记录都将被删除。

b) Partial Update

配置 'merge-engine' = 'partial-update'，使用同一主键下的最新数据逐个更新值来实现，空值在这个过程中不会被覆盖。

复制代码

Paimon收到三条记录：
<1, 23.0, 10, NULL>---
<1, NULL, NULL, 'This is a book'>
<1, 25.2, NULL, NULL>

第一列是主键，最终结果：
<1, 25.2, 10, 'This is a book'>。

注意：

复制代码

1、流式查询：partial-update 合并引擎必须与 lookup 或 full-compaction changelog producer一起使用，也支持"input" changelog producer，但只返回输入记录。

2、默认 partial-update 不接受删除记录，解决方案如下：

配置"partial-update.ignore-delete"以忽略删除记录。
配置"序列组"以回收部分列。

c) Sequence Group【SequenceGroup 类似于版本号】

序列字段可能无法解决【多个流更新的部分更新表的无序问题】，因为在多流更新期间，序列字段可能会被另一个流的最新数据覆盖；

为部分更新表引入了序列组机制：

解决多流更新期间的紊乱-每个流都定义了自己的序列组。

做到真正的部分更新-而不仅仅是非空更新。

复制代码

CREATE TABLE T (
    k INT,
    a INT,
    b INT,
    g_1 INT,
    c INT,
    d INT,
    g_2 INT,
    PRIMARY KEY (k) NOT ENFORCED
) WITH (
    'merge-engine'='partial-update',
    'fields.g_1.sequence-group'='a,b',
    'fields.g_2.sequence-group'='c,d'
);

INSERT INTO T VALUES (1, 1, 1, 1, 1, 1, 1);

-- g_2 is null, c, d should not be updated
INSERT INTO T VALUES (1, 2, 2, 2, 2, 2, CAST(NULL AS INT));

SELECT * FROM T; -- output 1, 2, 2, 2, 1, 1, 1

-- g_1 is smaller, a, b should not be updated
INSERT INTO T VALUES (1, 3, 3, 1, 3, 3, 3);

SELECT * FROM T; -- output 1, 2, 2, 2, 3, 3, 3

对于 fields...sequence-group，有效的可比较数据类型包括：DECIMAL、TINYINT、SMALLINT、INTEGER、BIGINT、FLOAT、DOUBLE、DATE、TIME、TIMESTAMP和TIMESTAMP_LTZ。

d) Aggregation For Partial Update

可以为输入字段指定聚合函数，聚合中的所有函数都支持。

复制代码

CREATE TABLE T (
          k INT,
          a INT,
          b INT,
          c INT,
          d INT,
          PRIMARY KEY (k) NOT ENFORCED
) WITH (
     'merge-engine'='partial-update',
     'fields.a.sequence-group' = 'b',
     'fields.b.aggregate-function' = 'first_value',
     'fields.c.sequence-group' = 'd',
     'fields.d.aggregate-function' = 'sum'
 );
 
INSERT INTO T VALUES (1, 1, 1, CAST(NULL AS INT), CAST(NULL AS INT));
INSERT INTO T VALUES (1, CAST(NULL AS INT), CAST(NULL AS INT), 1, 1);
INSERT INTO T VALUES (1, 2, 2, CAST(NULL AS INT), CAST(NULL AS INT));
INSERT INTO T VALUES (1, CAST(NULL AS INT), CAST(NULL AS INT), 2, 2);


SELECT * FROM T; -- output 1, 2, 1, 2, 3

e) Aggregation

注意：必须在Flink SQL TableConfig中将table.exec.sink.upsert-materialize设置为NONE。

aggregation合并引擎根据聚合函数将每个值字段与最新数据逐一聚合在同一主键下。

每个不属于主键的字段都可以获得一个聚合函数，由fields..aggregate-function表属性指定，否则将使用last_non_null_value作为默认值。

复制代码

CREATE TABLE MyTable (
    product_id BIGINT,
    price DOUBLE,
    sales BIGINT,
    PRIMARY KEY (product_id) NOT ENFORCED
) WITH (
    'merge-engine' = 'aggregation',
    'fields.price.aggregate-function' = 'max',
    'fields.sales.aggregate-function' = 'sum'
);

案例：

复制代码

输入：
<1, 23.0, 15> 
<1, 30.2, 20>

结果：
<1, 30.2, 35>.

支持的函数列表：

复制代码

sum: The sum function aggregates the values across multiple rows. It supports DECIMAL, TINYINT, SMALLINT, INTEGER, BIGINT, FLOAT, and DOUBLE data types.

product: The product function can compute product values across multiple lines. It supports DECIMAL, TINYINT, SMALLINT, INTEGER, BIGINT, FLOAT, and DOUBLE data types.

count: The count function counts the values across multiple rows. It supports INTEGER, BIGINT data types.

max: The max function identifies and retains the maximum value. It supports CHAR, VARCHAR, DECIMAL, TINYINT, SMALLINT, INTEGER, BIGINT, FLOAT, DOUBLE, DATE, TIME, TIMESTAMP, and TIMESTAMP_LTZ data types.

min: The min function identifies and retains the minimum value. It supports CHAR, VARCHAR, DECIMAL, TINYINT, SMALLINT, INTEGER, BIGINT, FLOAT, DOUBLE, DATE, TIME, TIMESTAMP, and TIMESTAMP_LTZ data types.

last_value: The last_value function replaces the previous value with the most recently imported value. It supports all data types.

last_non_null_value: The last_non_null_value function replaces the previous value with the latest non-null value. It supports all data types.

listagg: The listagg function concatenates multiple string values into a single string. It supports STRING data type.

bool_and: The bool_and function evaluates whether all values in a boolean set are true. It supports BOOLEAN data type.

bool_or: The bool_or function checks if at least one value in a boolean set is true. It supports BOOLEAN data type.

first_value: The first_value function retrieves the first null value from a data set. It supports all data types.

first_non_null_value: The first_non_null_value function selects the first non-null value in a data set. It supports all data types.

nested_update: The nested_update function collects multiple rows into one array (so-called 'nested table'). It supports ARRAY data types.

使用 fields.<field-name>.nested-key=pk0,pk1,...来指定嵌套表的主键。如果没有键，行将附加到数组中。

案例：

复制代码

-- orders table
CREATE TABLE orders (
  order_id BIGINT PRIMARY KEY NOT ENFORCED,
  user_name STRING,
  address STRING
);

-- sub orders that have the same order_id 
-- belongs to the same order
CREATE TABLE sub_orders (
  order_id BIGINT,
  sub_order_id INT,
  product_name STRING,
  price BIGINT,
  PRIMARY KEY (order_id, sub_order_id) NOT ENFORCED
);

-- wide table
CREATE TABLE order_wide (
  order_id BIGINT PRIMARY KEY NOT ENFORCED,
  user_name STRING,
  address STRING,
  sub_orders ARRAY<ROW<sub_order_id BIGINT, product_name STRING, price BIGINT>>
) WITH (
  'merge-engine' = 'aggregation',
  'fields.sub_orders.aggregate-function' = 'nested_update',
  'fields.sub_orders.nested-key' = 'sub_order_id'
);

-- widen
INSERT INTO order_wide

SELECT 
  order_id, 
  user_name,
  address, 
  CAST (NULL AS ARRAY<ROW<sub_order_id BIGINT, product_name STRING, price BIGINT>>) 
FROM orders

UNION ALL 
  
SELECT 
  order_id, 
  CAST (NULL AS STRING), 
  CAST (NULL AS STRING), 
  ARRAY[ROW(sub_order_id, product_name, price)] 
FROM sub_orders;

-- query using UNNEST
SELECT order_id, user_name, address, sub_order_id, product_name, price 
FROM order_wide, UNNEST(sub_orders) AS so(sub_order_id, product_name, price)

collect: The collect function collects elements into an Array. You can set fields.<field-name>.distinct=true to deduplicate elements. It only supports ARRAY type.

merge_map: The merge_map function merge input maps. It only supports MAP type.

只有sum、product和count支持retract（UPDATE_BEFORE和DELETE），其他聚合函数不支持retract；

如果允许一些函数忽略撤回消息，可以配置：'fields.${field_name}.ignore-retract'='true'。

注意：流式查询 aggregation merge 引擎必须与 lookup 或 full-compaction changelog producer一起使用，也支持"input" changelog producer，但只返回输入记录。

f) First Row

通过指定'merge-engine' = 'first-row'"，可以保留同一主键的第一行，与deduplicate合并引擎不同，在first-row合并引擎中，它将仅生成insert only changelog。

注意：

复制代码

1.first-row合并引擎必须与lookup changelog producer一起使用。
2.无法指定sequence.field。
3.不接受DELETE和UPDATE_BEFORE消息，可以配置first-row.ignore-delete来忽略这两种记录。

对取代流式计算中的重复日志数据很有帮助。

4.Changelog Producer

流式查询将持续生产最新的变化，在创建表时指定changelog-producer表属性，用户可以选择从表文件生成的更改模式。

changelog-producer表属性仅影响表文件中的 changelog，不影响外部日志系统。

a) None

默认值为 None，Paimon sources 只能看到跨 snapshots 的合并更改，像哪些 keys 被移除或某些 key 的新值，这些 merged changes不能形成 complete changelog。

Flink 有一个 "normalize" operator 会在状态中保留每个 key 的 value，这个 operator 资源消耗大，可以通过 'scan.remove-normalize' 移除 "normalize" operator。

b) Input

配置 'changelog-producer' = 'input'，所有 input records 将保存在 changelog files 中，并由 Paimon sources 提供给 consumers。

input changelog producer 可被用于输入是 complete changelog，像database CDC或由Flink stateful computation产生的结果。

c）Lookup

如果 input 不能产生 complete changelog，但要避免资源消耗大的 normalized operator，可以考虑使用 'lookup' changelog producer。

配置 'changelog-producer' = 'lookup'，Paimon 在提交数据写入之前通过'lookup'生成更改日志。

Lookup 将在内存和本地磁盘上缓存数据，可以使用以下选项来调整性能：

Option	Default	Type	Description
lookup.cache-file-retention	1 h	Duration	The cached files retention time for lookup. After the file expires, if there is a need for access, it will be re-read from the DFS to build an index on the local disk.
lookup.cache-max-disk-size	unlimited	MemorySize	Max disk size for lookup cache, you can use this option to limit the use of local disks.
lookup.cache-max-memory-size	256 mb	MemorySize	Max memory size for lookup cache.

Lookup changelog-producer支持changelog-producer.row-deduplicate，以避免为同一记录生成-U，+U changelog。

注意：需要增加Flink配置'execution.checkpointing.max-concurrent-checkpoints'以提升性能。

d）Full Compaction

如果 "lookup" 的资源消耗太大，可以考虑使用 "Full Compaction" 解耦数据写入和changelog generation，更适合高延迟的场景（例如，10分钟）。

指定'changelog-producer' = 'full-compaction'，Paimon将比较Full Compaction和produce 之间的结果，并将差异生成为changelog，changelog 的延迟受full compactions频率的影响。

指定full-compaction.delta-commits表属性，在delta提交（checkpoints）后将不断触发full compaction，默认情况下，这设置为1，因此每个checkpoint将有一个完整的full compression 并生成一个changelog。

Full-compaction changelog-producer 可以为任何类型的source生成complete changelog，但是它不像input changelog producer那样高效，生成produce changelog的延迟可能很高。

Full-compaction changelog-producer支持changelog-producer.row-deduplicate，以避免为同一记录生成-U、+U变更日志。

注意：需要增加Flink配置'execution.checkpointing.max-concurrent-checkpoints'以提升性能。

5.Sequence and Rowkind

创建表时，通过'sequence.field'指定字段来确定更新的顺序，通过`'rowkind.field'指定record的changelog类型。

a) Sequence Field

默认，主键表根据输入顺序确定合并顺序（最后一个输入记录将是最后一个合并记录），但在分布式计算中，存在数据乱序的情况，可以使用时间字段作为sequence.field，例如：

复制代码

CREATE TABLE MyTable (
    pk BIGINT PRIMARY KEY NOT ENFORCED,
    v1 DOUBLE,
    v2 BIGINT,
    dt TIMESTAMP
) WITH (
    'sequence.field' = 'dt'
);

无论输入顺序如何，具有最大sequence.field值的记录将是最后一个合并的记录。

序列自动填充：

当record被更新或删除时，sequence.field必须变大，并且不能保持不变，对于-U和+U，它们的序列字段必须不同，如果无法满足此要求，Paimon提供自动填充序列字段的选项。

'sequence.auto-padding' = 'row-kind-flag'：如果对-U和+U使用相同的值，就像Mysql Binlog中的"op_ts"（在数据库中进行更改的时间）一样，建议使用行类型标志的自动填充，该标志将自动区分-U（-D）和+U（+I）。
精度不足：如果提供的sequence.field不符合精度，如秒或毫秒，可以设置sequence.auto-padding为second-to-micro或millis-to-micro，以便自动调整为微秒。
复合模式：例如，"second-to-micro，row-kind-flag"，首先将秒转换为微妙，然后添加行类标志。

Row Kind Field：

默认，主键表根据input row确定row kind，也可以定义'rowkind.field'使用字段转换行类型。

有效的行类型字符串为'+I'、'-U'、'+U'或'-D'。