Apache Paimon 使用之 Pulsar CDC 解析

Pulsar CDC

a)依赖准备

复制代码
flink-connector-pulsar-*.jar

b)支持的文件格式

Flink提供了几种Pulsar CDC格式:Canal、Debezium、Ogg和Maxwell JSON。

如果Pulsar Topic中的消息是使用 CDC 工具从另一个数据库捕获的change event,那么可以使用Paimon Pulsar CDC,将解析的INSERT、UPDATE、DELETE消息写入paimon表中。

注意

JSON源可能缺少信息。例如,Ogg和Maxwell格式不包含字段类型;当将JSON源写入Flink Pulsar Sink时,它只会保留数据和行类型并删除其他信息。同步工作将尽最大努力处理问题,如下所示:

  • 如果缺少字段类型,Paimon将默认使用"STRING"类型。
  • 如果缺少数据库名称或表名,则无法进行数据库同步,但仍然可以进行表同步。
  • 如果缺少主键,该作业可能会创建非主键表。可以在tablesynchronization中提交作业时设置主键。

c)同步表(Synchronizing Tables)

在Flink DataStream作业中使用PulsarSyncTableAction或直接通过flink run,可以将Pulsar的一个Topic中的一个或多个表同步到一个Paimon表中。

复制代码
<FLINK_HOME>/bin/flink run \
    /path/to/paimon-flink-action-0.7.0-incubating.jar \
    pulsar_sync_table
    --warehouse <warehouse-path> \
    --database <database-name> \
    --table <table-name> \
    [--partition_keys <partition_keys>] \
    [--primary_keys <primary-keys>] \
    [--type_mapping to-string] \
    [--computed_column <'column-name=expr-name(args[, ...])'> [--computed_column ...]] \
    [--pulsar_conf <pulsar-source-conf> [--pulsar_conf <pulsar-source-conf> ...]] \
    [--catalog_conf <paimon-catalog-conf> [--catalog_conf <paimon-catalog-conf> ...]] \
    [--table_conf <paimon-table-sink-conf> [--table_conf <paimon-table-sink-conf> ...]]
Configuration Description
--warehouse The path to Paimon warehouse.
--database The database name in Paimon catalog.
--table The Paimon table name.
--partition_keys The partition keys for Paimon table. If there are multiple partition keys, connect them with comma, for example "dt,hh,mm".
--primary_keys The primary keys for Paimon table. If there are multiple primary keys, connect them with comma, for example "buyer_id,seller_id".
--type_mapping It is used to specify how to map MySQL data type to Paimon type. Supported options:"tinyint1-not-bool": maps MySQL TINYINT(1) to TINYINT instead of BOOLEAN."to-nullable": ignores all NOT NULL constraints (except for primary keys). This is used to solve the problem that Flink cannot accept the MySQL 'ALTER TABLE ADD COLUMN column type NOT NULL DEFAULT x' operation."to-string": maps all MySQL types to STRING."char-to-string": maps MySQL CHAR(length)/VARCHAR(length) types to STRING."longtext-to-bytes": maps MySQL LONGTEXT types to BYTES."bigint-unsigned-to-bigint": maps MySQL BIGINT UNSIGNED, BIGINT UNSIGNED ZEROFILL, SERIAL to BIGINT. You should ensure overflow won't occur when using this option.
--computed_column The definitions of computed columns. The argument field is from Pulsar topic's table field name. See here for a complete list of configurations.
--pulsar_conf The configuration for Flink Pulsar sources. Each configuration should be specified in the format key=value. topic/topic-pattern, value.format, pulsar.client.serviceUrl, pulsar.admin.adminUrl, and pulsar.consumer.subscriptionName are required configurations, others are optional.See its document for a complete list of configurations.
--catalog_conf The configuration for Paimon catalog. Each configuration should be specified in the format "key=value". See here for a complete list of catalog configurations.
--table_conf The configuration for Paimon table sink. Each configuration should be specified in the format "key=value". See here for a complete list of table configurations.

如果指定的Paimon表不存在,将自动创建该表。它的模式将从所有指定的Pulsar Topic的表中派生出来,它从Topic中获取最早的非DDL数据解析模式。

如果Paimon表已经存在,其模式将与所有指定的Pulsar Topic表的模式进行比较。

示例1

复制代码
<FLINK_HOME>/bin/flink run \
    /path/to/paimon-flink-action-0.7.0-incubating.jar \
    pulsar_sync_table \
    --warehouse hdfs:///path/to/warehouse \
    --database test_db \
    --table test_table \
    --partition_keys pt \
    --primary_keys pt,uid \
    --computed_column '_year=year(age)' \
    --pulsar_conf topic=order \
    --pulsar_conf value.format=canal-json \
    --pulsar_conf pulsar.client.serviceUrl=pulsar://127.0.0.1:6650 \
    --pulsar_conf pulsar.admin.adminUrl=http://127.0.0.1:8080 \
    --pulsar_conf pulsar.consumer.subscriptionName=paimon-tests \
    --catalog_conf metastore=hive \
    --catalog_conf uri=thrift://hive-metastore:9083 \
    --table_conf bucket=4 \
    --table_conf changelog-producer=input \
    --table_conf sink.parallelism=4

如果启动同步作业时Pulsar Topic不包含消息,则必须在提交作业之前手动创建表。只能定义分区键和主键,剩余列将由同步作业添加。

注意:在这种情况下,不应该使用-partition_keys或-primary_keys,因为这些键是在创建表时定义的,不能修改。此外,如果指定了计算列,还应该定义用于计算列的所有参数列。

示例2:如果想同步具有主键"id INT"的表,并且要计算分区键"part=date_format(create_time,yyyy-MM-dd)",可以先创建如下表(其他列可以省略)

复制代码
CREATE TABLE test_db.test_table (
    id INT,                 -- primary key
    create_time TIMESTAMP,  -- the argument of computed column part
    part STRING,            -- partition key
    PRIMARY KEY (id, part) NOT ENFORCED
) PARTITIONED BY (part);

然后,可以提交同步作业:

复制代码
<FLINK_HOME>/bin/flink run \
    /path/to/paimon-flink-action-0.7.0-incubating.jar \
    pulsar_sync_table \
    --warehouse hdfs:///path/to/warehouse \
    --database test_db \
    --table test_table \
    --computed_column 'part=date_format(create_time,yyyy-MM-dd)' \
    ... (other conf)

d)同步数据库(Synchronizing Databases)

通过在Flink DataStream作业中或直接通过flink run使用PulsarSyncDatabaseAction,可以将多个Topic或一个Topic同步到一个Paimon数据库中。

复制代码
<FLINK_HOME>/bin/flink run \
    /path/to/paimon-flink-action-0.7.0-incubating.jar \
    pulsar_sync_database
    --warehouse <warehouse-path> \
    --database <database-name> \
    [--table_prefix <paimon-table-prefix>] \
    [--table_suffix <paimon-table-suffix>] \
    [--including_tables <table-name|name-regular-expr>] \
    [--excluding_tables <table-name|name-regular-expr>] \
    [--type_mapping to-string] \
    [--pulsar_conf <pulsar-source-conf> [--pulsar_conf <pulsar-source-conf> ...]] \
    [--catalog_conf <paimon-catalog-conf> [--catalog_conf <paimon-catalog-conf> ...]] \
    [--table_conf <paimon-table-sink-conf> [--table_conf <paimon-table-sink-conf> ...]]
Configuration Description
--warehouse The path to Paimon warehouse.
--database The database name in Paimon catalog.
--ignore_incompatible It is default false, in this case, if MySQL table name exists in Paimon and their schema is incompatible,an exception will be thrown. You can specify it to true explicitly to ignore the incompatible tables and exception.
--table_prefix The prefix of all Paimon tables to be synchronized. For example, if you want all synchronized tables to have "ods_" as prefix, you can specify "--table_prefix ods_".
--table_suffix The suffix of all Paimon tables to be synchronized. The usage is same as "--table_prefix".
--including_tables It is used to specify which source tables are to be synchronized. You must use '|' to separate multiple tables.Because '|' is a special character, a comma is required, for example: 'a|b|c'.Regular expression is supported, for example, specifying "--including_tables test|paimon.*" means to synchronize table 'test' and all tables start with 'paimon'.
--excluding_tables It is used to specify which source tables are not to be synchronized. The usage is same as "--including_tables". "--excluding_tables" has higher priority than "--including_tables" if you specified both.
--type_mapping It is used to specify how to map MySQL data type to Paimon type. Supported options:"tinyint1-not-bool": maps MySQL TINYINT(1) to TINYINT instead of BOOLEAN."to-nullable": ignores all NOT NULL constraints (except for primary keys). This is used to solve the problem that Flink cannot accept the MySQL 'ALTER TABLE ADD COLUMN column type NOT NULL DEFAULT x' operation."to-string": maps all MySQL types to STRING."char-to-string": maps MySQL CHAR(length)/VARCHAR(length) types to STRING."longtext-to-bytes": maps MySQL LONGTEXT types to BYTES."bigint-unsigned-to-bigint": maps MySQL BIGINT UNSIGNED, BIGINT UNSIGNED ZEROFILL, SERIAL to BIGINT. You should ensure overflow won't occur when using this option.
--pulsar_conf The configuration for Flink Pulsar sources. Each configuration should be specified in the format key=value. topic/topic-pattern, value.format, pulsar.client.serviceUrl, pulsar.admin.adminUrl, and pulsar.consumer.subscriptionName are required configurations, others are optional.See its document for a complete list of configurations.
--catalog_conf The configuration for Paimon catalog. Each configuration should be specified in the format "key=value". See here for a complete list of catalog configurations.
--table_conf The configuration for Paimon table sink. Each configuration should be specified in the format "key=value". See here for a complete list of table configurations.

只有带有主键的表才会同步。

此操作将为所有表构建一个single combined sink。对于要同步的每个Pulsar Topic的表,如果相应的Paimon表不存在,此操作将自动创建该表,其模式将从所有指定的Pulsar Topic的表中派生。

如果Paimon表已经存在,并且其模式与从Pulsar Topic数据中解析的模式不同,则此操作将尝试schema evolution。

示例:从一个Pulsar Topic同步到Paimon数据库。

复制代码
<FLINK_HOME>/bin/flink run \
    /path/to/paimon-flink-action-0.7.0-incubating.jar \
    pulsar_sync_database \
    --warehouse hdfs:///path/to/warehouse \
    --database test_db \
    --pulsar_conf topic=order \
    --pulsar_conf value.format=canal-json \
    --pulsar_conf pulsar.client.serviceUrl=pulsar://127.0.0.1:6650 \
    --pulsar_conf pulsar.admin.adminUrl=http://127.0.0.1:8080 \
    --pulsar_conf pulsar.consumer.subscriptionName=paimon-tests \
    --catalog_conf metastore=hive \
    --catalog_conf uri=thrift://hive-metastore:9083 \
    --table_conf bucket=4 \
    --table_conf changelog-producer=input \
    --table_conf sink.parallelism=4

从多个Pulsar Topic同步到Paimon数据库。

复制代码
<FLINK_HOME>/bin/flink run \
    /path/to/paimon-flink-action-0.7.0-incubating.jar \
    pulsar_sync_database \
    --warehouse hdfs:///path/to/warehouse \
    --database test_db \
    --pulsar_conf topic=order,logistic_order,user \
    --pulsar_conf value.format=canal-json \
    --pulsar_conf pulsar.client.serviceUrl=pulsar://127.0.0.1:6650 \
    --pulsar_conf pulsar.admin.adminUrl=http://127.0.0.1:8080 \
    --pulsar_conf pulsar.consumer.subscriptionName=paimon-tests \
    --catalog_conf metastore=hive \
    --catalog_conf uri=thrift://hive-metastore:9083 \
    --table_conf bucket=4 \
    --table_conf changelog-producer=input \
    --table_conf sink.parallelism=4

e) 额外的 pulsar_config

非 flink-pulsar-connector 文档提供的配置项。

Key Default Type Description
value.format (none) String Defines the format identifier for encoding value data.
topic (none) String Topic name(s) from which the data is read. It also supports topic list by separating topic by semicolon like 'topic-1;topic-2'. Note, only one of "topic-pattern" and "topic" can be specified.
topic-pattern (none) String The regular expression for a pattern of topic names to read from. All topics with names that match the specified regular expression will be subscribed by the consumer when the job starts running. Note, only one of "topic-pattern" and "topic" can be specified.
pulsar.startCursor.fromMessageId EARLIEST Sting Using a unique identifier of a single message to seek the start position. The common format is a triple 'ledgerId,entryId,partitionIndex'. Specially, you can set it to EARLIEST (-1, -1, -1) or LATEST (Long.MAX_VALUE, Long.MAX_VALUE, -1).
pulsar.startCursor.fromPublishTime (none) Long Using the message publish time to seek the start position.
pulsar.startCursor.fromMessageIdInclusive true Boolean Whether to include the given message id. This option only works when the message id is not EARLIEST or LATEST.
pulsar.stopCursor.atMessageId (none) String Stop consuming when the message id is equal or greater than the specified message id. Message that is equal to the specified message id will not be consumed. The common format is a triple 'ledgerId,entryId,partitionIndex'. Specially, you can set it to LATEST (Long.MAX_VALUE, Long.MAX_VALUE, -1).
pulsar.stopCursor.afterMessageId (none) String Stop consuming when the message id is greater than the specified message id. Message that is equal to the specified message id will be consumed. The common format is a triple 'ledgerId,entryId,partitionIndex'. Specially, you can set it to LATEST (Long.MAX_VALUE, Long.MAX_VALUE, -1).
pulsar.stopCursor.atEventTime (none) Long Stop consuming when message event time is greater than or equals the specified timestamp. Message that even time is equal to the specified timestamp will not be consumed.
pulsar.stopCursor.afterEventTime (none) Long Stop consuming when message event time is greater than the specified timestamp. Message that even time is equal to the specified timestamp will be consumed.
pulsar.source.unbounded true Boolean To specify the boundedness of a stream.
相关推荐
藦卡机器人7 小时前
中国工业机器人发展现状
大数据·人工智能·机器人
Simon_lca8 小时前
突破合规瓶颈:ZDHC Supplier to Zero(工厂零排放 - 进阶型)体系全攻略
大数据·网络·人工智能·分类·数据挖掘·数据分析·零售
黄焖鸡能干四碗10 小时前
网络安全建设实施方案(Word文件参考下载)
大数据·网络·人工智能·安全·web安全·制造
云境筑桃源哇11 小时前
马踏春风 为爱启航 | 瑞派宠物医院(南部新城旗舰店)盛大开业!打造宠物医疗新标杆!
大数据·宠物
xixixi7777712 小时前
2026 年 03 月 20 日 AI+通信+安全行业日报(来更新啦)
大数据·人工智能·安全·ai·大模型·通信
F36_9_12 小时前
大数据治理平台选型避坑:2026 年 8 大主流系统实测
大数据·数据治理
成长之路51412 小时前
【实证分析】A股上市公司企业劳动力需求数据集(2000-2023年)
大数据
奔跑的呱呱牛12 小时前
GeoJSON 在大数据场景下为什么不够用?替代方案分析
java·大数据·servlet·gis·geojson
Lab_AI12 小时前
电池材料行业数据管理新突破:AI4S驱动的科学数据平台正在重塑电池材料开发范式
大数据·人工智能·ai4s·电池材料开发·电池材料研发·电池材料创新·ai材料研发
FindAI发现力量12 小时前
智能工牌:线下销售场景的数字化赋能解决方案
大数据·人工智能·销售管理·ai销售·ai销冠·销售智能体