Apache Paimon 使用之 Querying Tables

Querying Tables

1.Batch Query

Paimon的批量读取返回表快照中的所有数据。默认情况下，批处理读取返回最新的快照。

复制代码

-- Flink SQL
SET 'execution.runtime-mode' = 'batch';

2.Batch Time Travel

Paimon批量读取指定快照或标签的数据。

Flink 动态配置

复制代码

-- read the snapshot with id 1L
SELECT * FROM t /*+ OPTIONS('scan.snapshot-id' = '1') */;

-- read the snapshot from specified timestamp in unix milliseconds
SELECT * FROM t /*+ OPTIONS('scan.timestamp-millis' = '1678883047356') */;

-- read tag 'my-tag'
SELECT * FROM t /*+ OPTIONS('scan.tag-name' = 'my-tag') */;

Flink 1.18+

复制代码

-- read the snapshot from specified timestamp
SELECT * FROM t FOR SYSTEM_TIME AS OF TIMESTAMP '2023-01-01 00:00:00';

-- you can also use some simple expressions (see flink document to get supported functions)
SELECT * FROM t FOR SYSTEM_TIME AS OF TIMESTAMP '2023-01-01 00:00:00' + INTERVAL '1' DAY

Spark3

Spark 3.3+，可以在查询中使用VERSION AS OF和TIMESTAMP AS OF进行时间旅行：

复制代码

-- read the snapshot with id 1L (use snapshot id as version)
SELECT * FROM t VERSION AS OF 1;

-- read the snapshot from specified timestamp 
SELECT * FROM t TIMESTAMP AS OF '2023-06-01 00:00:00.123';

-- read the snapshot from specified timestamp in unix seconds
SELECT * FROM t TIMESTAMP AS OF 1678883047;

-- read tag 'my-tag'
SELECT * FROM t VERSION AS OF 'my-tag';

如果标签的名称是一个数字，并且等于快照ID，则VERSION AS OF语法将首先考虑标签。

例如，标签叫1但基于快照2，语句 SELECT * FROM t VERSION AS OF '1' 实际上查询快照2（即标签1）而不是快照1。

Spark3-DF

复制代码

// read the snapshot from specified timestamp in unix seconds
spark.read
    .option("scan.timestamp-millis", "1678883047000")
    .format("paimon")
    .load("path/to/table")
    
// read the snapshot with id 1L (use snapshot id as version)
spark.read
    .option("scan.snapshot-id", 1)
    .format("paimon")
    .load("path/to/table")
    
// read tag 'my-tag'
spark.read
    .option("scan.tag-name", "my-tag")
    .format("paimon")
    .load("path/to/table")

Hive 引擎

Hive需要将以下配置参数添加到hive-site.xml文件中：

复制代码

<!--This parameter is used to configure the whitelist of permissible configuration items allowed for use in SQL standard authorization mode.-->
<property>
  <name>hive.security.authorization.sqlstd.confwhitelist</name>
  <value>mapred.*|hive.*|mapreduce.*|spark.*</value>
</property>

<!--This parameter is an additional configuration for hive.security.authorization.sqlstd.confwhitelist. It allows you to add other permissible configuration items to the existing whitelist.-->
<property>
 <name>hive.security.authorization.sqlstd.confwhitelist.append</name>
  <value>mapred.*|hive.*|mapreduce.*|spark.*</value>
</property>

-- read the snapshot with id 1L (use snapshot id as version)
SET paimon.scan.snapshot-id=1
SELECT * FROM t;
SET paimon.scan.snapshot-id=null;

-- read the snapshot from specified timestamp in unix seconds
SET paimon.scan.timestamp-millis=1679486589444;
SELECT * FROM t;
SET paimon.scan.timestamp-millis=null;
    
-- read tag 'my-tag'
set paimon.scan.tag-name=my-tag;
SELECT * FROM t;
set paimon.scan.tag-name=null;

3.批次读取新增数据

在开始的snapshot和结束的snapshot之间读取增量的变化数据。

例如：

"5,10"是指快照5和快照10之间的变化。
"TAG1，TAG3"是指TAG1和TAG3之间的更改。

Flink 引擎

复制代码

-- incremental between snapshot ids
SELECT * FROM t /*+ OPTIONS('incremental-between' = '12,20') */;

-- incremental between snapshot time mills
SELECT * FROM t /*+ OPTIONS('incremental-between-timestamp' = '1692169000000,1692169900000') */;

Spark3引擎

需要Spark 3.2+。

Paimon支持使用Spark SQL执行Spark Table Valued Function实现的增量查询。要启用此功能，需要以下配置：

复制代码

--conf spark.sql.extensions=org.apache.paimon.spark.extensions.PaimonSparkSessionExtensions

可以在查询中使用paimon_incremental_query来提取增量数据：

复制代码

-- read the incremental data between snapshot id 12 and snapshot id 20.
SELECT * FROM paimon_incremental_query('tableName', 12, 20);

Spark-DF

复制代码

// incremental between snapshot ids
spark.read()
  .format("paimon")
  .option("incremental-between", "12,20")
  .load("path/to/table")

// incremental between snapshot time mills
spark.read()
  .format("paimon")
  .option("incremental-between-timestamp", "1692169000000,1692169900000")
  .load("path/to/table")

Hive

复制代码

-- incremental between snapshot ids
SET paimon.incremental-between='12,20';
SELECT * FROM t;
SET paimon.incremental-between=null;

-- incremental between snapshot time mills
SET paimon.incremental-between-timestamp='1692169000000,1692169900000';
SELECT * FROM t;
SET paimon.incremental-between-timestamp=null;

在批处理SQL中，不允许返回DELETE记录，因此-D的记录将被删除。如果想查看DELETE记录，可以使用audit_log表：

复制代码

SELECT * FROM t$audit_log /*+ OPTIONS('incremental-between' = '12,20') */;

4.流式查询

默认情况下，流式查询在首次启动时会在表上生成最新得快照，并继续读取最新的更改。

复制代码

-- Flink SQL
SET 'execution.runtime-mode' = 'streaming';

可以在没有快照数据的情况下进行流式查询，可以使用latest scan模式：

复制代码

-- Continuously reads latest changes without producing a snapshot at the beginning.
SELECT * FROM t /*+ OPTIONS('scan.mode' = 'latest') */;

4.Streaming Time Travel

如果只想处理今天及以后的数据，可以使用分区进行过滤：

复制代码

SELECT * FROM t WHERE dt > '2023-06-26';

如果不是分区表，或者无法按分区过滤，可以使用时间旅行的流式读取。

Flink 动态配置

复制代码

-- read changes from snapshot id 1L 
SELECT * FROM t /*+ OPTIONS('scan.snapshot-id' = '1') */;

-- read changes from snapshot specified timestamp
SELECT * FROM t /*+ OPTIONS('scan.timestamp-millis' = '1678883047356') */;

-- read snapshot id 1L upon first startup, and continue to read the changes
SELECT * FROM t /*+ OPTIONS('scan.mode'='from-snapshot-full','scan.snapshot-id' = '1') */;

Flink 1.18+

复制代码

-- read the snapshot from specified timestamp
SELECT * FROM t FOR SYSTEM_TIME AS OF TIMESTAMP '2023-01-01 00:00:00';

-- you can also use some simple expressions (see flink document to get supported functions)
SELECT * FROM t FOR SYSTEM_TIME AS OF TIMESTAMP '2023-01-01 00:00:00' + INTERVAL '1' DAY

时间旅行的流式读取依赖于快照，但默认情况下，快照仅保留1小时内的数据，会影响读取较旧的增量数据。

因此，Paimon还提供了另一种流式读取模式，scan.file-creation-time-millis，该模式保留timeMillis之后生成的文件。

复制代码

SELECT * FROM t /*+ OPTIONS('scan.file-creation-time-millis' = '1678883047356') */;

5.Consumer ID

可以在流式读取表时指定consumer-id。

复制代码

SELECT * FROM t /*+ OPTIONS('consumer-id' = 'myid') */;

当流式读取Paimon表时，下一个快照ID将记录到文件系统中。优点如下：

当上一个作业停止时，新开始的作业可以上一个进度开始，而无需从状态恢复。新的读取将从消费者文件中找到的下一个快照ID开始读取。如果不希望这种行为，可以将"consumer.ignore-progress"设置为True。
在决定快照是否已过期时，Paimon会查看文件系统中表的所有消费者，如果有消费者仍然依赖此快照，则此快照不会在过期前删除。
当没有水印定义时，Paimon表会将快照中的水印传递给下游的Paimon表，这意味着可以跟踪整个管道的水印进度。

注意：消费者将防止快照过期，可以指定"consumer.expiration-time"来管理消费者的生命周期。

默认情况下，消费者使用exactly-once模式来记录消费进度，这严格确保消费者中记录的是所有reader精确消费的快照ID + 1。

可以将consumer.mode设置为at-least-once以允许reader以不同的速率消耗快照，并将所有reader中最慢的快照ID记录到消费者中。这种模式可以提供更多功能，例如水印对齐。

注意：

当没有水印定义时，at-least-once模式的消费者无法提供将快照中的水印传递给下游的能力。
由于exactly-once模式和at-least-once模式的实现完全不同，因此flink的状态是不兼容的，在切换模式时无法从状态恢复。

可以使用给定的消费者ID和下一个快照ID重置消费者，并删除具有给定消费者ID的消费者。

首先，需要使用此消费者ID停止流式传输任务，然后执行重置消费者操作作业。

Flink 引擎

复制代码

<FLINK_HOME>/bin/flink run \
    /path/to/paimon-flink-action-0.7.0-incubating.jar \
    reset-consumer \
    --warehouse <warehouse-path> \
    --database <database-name> \ 
    --table <table-name> \
    --consumer_id <consumer-id> \
    [--next_snapshot <next-snapshot-id>] \
    [--catalog_conf <paimon-catalog-conf> [--catalog_conf <paimon-catalog-conf> ...]]

如果想删除消费者，请不要指定-next_snapshot参数。

6.Read Overwrite

默认情况下，流式读取将忽略INSERT OVERWRITE生成的提交。如果想读取OVERWRITE的提交，可以配置streaming-read-overwrite。

a) 并行读取

Flink 引擎

默认情况下，批处理读取的并行度与拆分数相同，而流读取的并行度与桶数相同，但不大于scan.infer-parallelism.max。

禁用scan.infer-parallelism，将使用全局并行度配置，还可以从scan.parallelism手动指定并行性。

Key	Default	Type	Description
scan.infer-parallelism	true	Boolean	If it is false, parallelism of source are set by global parallelism. Otherwise, source parallelism is inferred from splits number (batch mode) or bucket number(streaming mode).
scan.infer-parallelism.max	1024	Integer	If scan.infer-parallelism is true, limit the parallelism of source through this option.
scan.parallelism	(none)	Integer	Define a custom parallelism for the scan source. By default, if this option is not defined, the planner will derive the parallelism for each statement individually by also considering the global configuration. If user enable the scan.infer-parallelism, the planner will derive the parallelism by inferred parallelism.

7.查询优化

强烈建议在查询的同时指定分区和主键进行过滤，这将加快查询数据的速度。

可以加速数据查询效率的：

=
<
<=
>
>=
IN (...)
LIKE 'abc%'
IS NULL

Paimon将按主键对数据进行排序，可以加快点查询和范围查询的速度，使用复合主键时，查询过滤器最好匹配主键的最左前缀，以便加速。

假设表如下：

复制代码

CREATE TABLE orders (
    catalog_id BIGINT,
    order_id BIGINT,
    .....,
    PRIMARY KEY (catalog_id, order_id) NOT ENFORCED -- composite primary key
);

查询通过为主键最左前缀指定范围过滤器来获得良好的加速。

复制代码

SELECT * FROM orders WHERE catalog_id=1025;

SELECT * FROM orders WHERE catalog_id=1025 AND order_id=29495;

SELECT * FROM orders
  WHERE catalog_id=1025
  AND order_id>2035 AND order_id<6000;

以下过滤器无法加速查询。

复制代码

SELECT * FROM orders WHERE order_id=29495;

SELECT * FROM orders WHERE catalog_id=1025 OR order_id=29495;