Apache Paimon 使用 Kafka CDC 获取数据

a.依赖准备

复制代码
flink-sql-connector-kafka-*.jar

b.支持的文件格式

Flink提供了几种Kafka CDC格式:Canal、Debezium、Ogg和Maxwell JSON。

如果Kafka的Topic中的消息是使用Change Data Capture(CDC)工具从另一个数据库捕获的change event,那么可以使用Paimon Kafka CDC,将INSERT、UPDATE、DELETE消息写入到paimon表中。

注意

JSON源可能缺少一些信息。例如,Ogg和Maxwell格式标准不包含字段类型;当将JSON源写入Flink Kafka Sink时,它只会保留数据和行类型并删除其他信息。

通常,debezium-json包含"schema"字段,Paimon将从中检索数据类型。确保debezium json具有此字段,否则Paimon将使用"STRING"类型。

  • 如果缺少字段类型,Paimon将默认使用"STRING"类型。
  • 如果缺少数据库名称或表名,则无法进行数据库同步,但仍然可以进行表同步。
  • 如果缺少主键,该作业可能会创建非主键表,可以在表synchronization中提交作业时设置主键。

c.同步表

在Flink DataStream作业中使用 KafkaSyncTableAction 或直接通过flink run,可以将Kafka的一个Topic中的一个或多个表同步到一个Paimon表中。

复制代码
<FLINK_HOME>/bin/flink run \
    /path/to/paimon-flink-action-0.7.0-incubating.jar \
    kafka_sync_table
    --warehouse <warehouse-path> \
    --database <database-name> \
    --table <table-name> \
    [--partition_keys <partition_keys>] \
    [--primary_keys <primary-keys>] \
    [--type_mapping to-string] \
    [--computed_column <'column-name=expr-name(args[, ...])'> [--computed_column ...]] \
    [--kafka_conf <kafka-source-conf> [--kafka_conf <kafka-source-conf> ...]] \
    [--catalog_conf <paimon-catalog-conf> [--catalog_conf <paimon-catalog-conf> ...]] \
    [--table_conf <paimon-table-sink-conf> [--table_conf <paimon-table-sink-conf> ...]]
Configuration Description
--warehouse The path to Paimon warehouse.
--database The database name in Paimon catalog.
--table The Paimon table name.
--partition_keys The partition keys for Paimon table. If there are multiple partition keys, connect them with comma, for example "dt,hh,mm".
--primary_keys The primary keys for Paimon table. If there are multiple primary keys, connect them with comma, for example "buyer_id,seller_id".
--type_mapping It is used to specify how to map MySQL data type to Paimon type. Supported options:"tinyint1-not-bool": maps MySQL TINYINT(1) to TINYINT instead of BOOLEAN."to-nullable": ignores all NOT NULL constraints (except for primary keys). This is used to solve the problem that Flink cannot accept the MySQL 'ALTER TABLE ADD COLUMN column type NOT NULL DEFAULT x' operation."to-string": maps all MySQL types to STRING."char-to-string": maps MySQL CHAR(length)/VARCHAR(length) types to STRING."longtext-to-bytes": maps MySQL LONGTEXT types to BYTES."bigint-unsigned-to-bigint": maps MySQL BIGINT UNSIGNED, BIGINT UNSIGNED ZEROFILL, SERIAL to BIGINT. You should ensure overflow won't occur when using this option.
--computed_column The definitions of computed columns. The argument field is from Kafka topic's table field name. See here for a complete list of configurations.
--kafka_conf The configuration for Flink Kafka sources. Each configuration should be specified in the format key=value. properties.bootstrap.servers, topic/topic-pattern, properties.group.id, and value.format are required configurations, others are optional.See its document for a complete list of configurations.
--catalog_conf The configuration for Paimon catalog. Each configuration should be specified in the format "key=value". See here for a complete list of catalog configurations.
--table_conf The configuration for Paimon table sink. Each configuration should be specified in the format "key=value". See here for a complete list of table configurations.

如果指定的Paimon表不存在,此操作将自动创建该表,它的结构将从所有指定的Kafka的Topic的表中派生出来,它从Topic中获取最早的非DDL数据解析模式。

如果Paimon表已经存在,其模式将与所有指定的Kafka的Topic表的模式进行比较。

示例1:

复制代码
<FLINK_HOME>/bin/flink run \
    /path/to/paimon-flink-action-0.7.0-incubating.jar \
    kafka_sync_table \
    --warehouse hdfs:///path/to/warehouse \
    --database test_db \
    --table test_table \
    --partition_keys pt \
    --primary_keys pt,uid \
    --computed_column '_year=year(age)' \
    --kafka_conf properties.bootstrap.servers=127.0.0.1:9020 \
    --kafka_conf topic=order \
    --kafka_conf properties.group.id=123456 \
    --kafka_conf value.format=canal-json \
    --catalog_conf metastore=hive \
    --catalog_conf uri=thrift://hive-metastore:9083 \
    --table_conf bucket=4 \
    --table_conf changelog-producer=input \
    --table_conf sink.parallelism=4

如果启动同步作业时kafka的Topic中不包含消息,则必须在提交作业之前手动创建表,只能定义分区键和主键,剩下的列将由同步作业添加。

注意:在这种情况下,不应该使用-partition_keys或-primary_keys,因为这些键是在创建表时定义的,不能修改。此外,如果指定了计算列,还应该定义用于计算列的所有参数列。

示例2:如果想同步具有主键"id INT"的表,并且要计算分区键"part=date_format(create_time,yyyy-MM-dd)",可以先创建这样的表(其他列可以省略)

复制代码
CREATE TABLE test_db.test_table (
    id INT,                 -- primary key
    create_time TIMESTAMP,  -- the argument of computed column part
    part STRING,            -- partition key
    PRIMARY KEY (id, part) NOT ENFORCED
) PARTITIONED BY (part);

启动同步作业

复制代码
<FLINK_HOME>/bin/flink run \
    /path/to/paimon-flink-action-0.7.0-incubating.jar \
    kafka_sync_table \
    --warehouse hdfs:///path/to/warehouse \
    --database test_db \
    --table test_table \
    --computed_column 'part=date_format(create_time,yyyy-MM-dd)' \
    ... (other conf)

d.同步数据库

通过在Flink DataStream作业中使用KafkaSyncDatabaseAction或直接通过flink run,可以将多个Topic或一个Topic同步到一个Paimon数据库中。

复制代码
<FLINK_HOME>/bin/flink run \
    /path/to/paimon-flink-action-0.7.0-incubating.jar \
    kafka_sync_database
    --warehouse <warehouse-path> \
    --database <database-name> \
    [--table_prefix <paimon-table-prefix>] \
    [--table_suffix <paimon-table-suffix>] \
    [--including_tables <table-name|name-regular-expr>] \
    [--excluding_tables <table-name|name-regular-expr>] \
    [--type_mapping to-string] \
    [--kafka_conf <kafka-source-conf> [--kafka_conf <kafka-source-conf> ...]] \
    [--catalog_conf <paimon-catalog-conf> [--catalog_conf <paimon-catalog-conf> ...]] \
    [--table_conf <paimon-table-sink-conf> [--table_conf <paimon-table-sink-conf> ...]]
Configuration Description
--warehouse The path to Paimon warehouse.
--database The database name in Paimon catalog.
--ignore_incompatible It is default false, in this case, if MySQL table name exists in Paimon and their schema is incompatible,an exception will be thrown. You can specify it to true explicitly to ignore the incompatible tables and exception.
--table_prefix The prefix of all Paimon tables to be synchronized. For example, if you want all synchronized tables to have "ods_" as prefix, you can specify "--table_prefix ods_".
--table_suffix The suffix of all Paimon tables to be synchronized. The usage is same as "--table_prefix".
--including_tables It is used to specify which source tables are to be synchronized. You must use '|' to separate multiple tables.Because '|' is a special character, a comma is required, for example: 'a|b|c'.Regular expression is supported, for example, specifying "--including_tables test|paimon.*" means to synchronize table 'test' and all tables start with 'paimon'.
--excluding_tables It is used to specify which source tables are not to be synchronized. The usage is same as "--including_tables". "--excluding_tables" has higher priority than "--including_tables" if you specified both.
--type_mapping It is used to specify how to map MySQL data type to Paimon type. Supported options:"tinyint1-not-bool": maps MySQL TINYINT(1) to TINYINT instead of BOOLEAN."to-nullable": ignores all NOT NULL constraints (except for primary keys). This is used to solve the problem that Flink cannot accept the MySQL 'ALTER TABLE ADD COLUMN column type NOT NULL DEFAULT x' operation."to-string": maps all MySQL types to STRING."char-to-string": maps MySQL CHAR(length)/VARCHAR(length) types to STRING."longtext-to-bytes": maps MySQL LONGTEXT types to BYTES."bigint-unsigned-to-bigint": maps MySQL BIGINT UNSIGNED, BIGINT UNSIGNED ZEROFILL, SERIAL to BIGINT. You should ensure overflow won't occur when using this option.
--kafka_conf The configuration for Flink Kafka sources. Each configuration should be specified in the format key=value. properties.bootstrap.servers, topic/topic-pattern, properties.group.id, and value.format are required configurations, others are optional.See its document for a complete list of configurations.
--catalog_conf The configuration for Paimon catalog. Each configuration should be specified in the format "key=value". See here for a complete list of catalog configurations.
--table_conf The configuration for Paimon table sink. Each configuration should be specified in the format "key=value". See here for a complete list of table configurations.

只有带有主键的表才会同步。

此操作将为所有表构建一个combined sink,对于要同步的每个Kafka Topic的表,如果相应的Paimon表不存在,此操作将自动创建该表,其模式将从所有指定的Kafka Topic的表中派生。

如果Paimon表已经存在,并且其模式与从Kafka记录中解析的模式不同,则此操作将尝试模式演变。

示例:从一个Kafka Topic同步到Paimon数据库。

复制代码
<FLINK_HOME>/bin/flink run \
    /path/to/paimon-flink-action-0.7.0-incubating.jar \
    kafka_sync_database \
    --warehouse hdfs:///path/to/warehouse \
    --database test_db \
    --kafka_conf properties.bootstrap.servers=127.0.0.1:9020 \
    --kafka_conf topic=order \
    --kafka_conf properties.group.id=123456 \
    --kafka_conf value.format=canal-json \
    --catalog_conf metastore=hive \
    --catalog_conf uri=thrift://hive-metastore:9083 \
    --table_conf bucket=4 \
    --table_conf changelog-producer=input \
    --table_conf sink.parallelism=4

从多个Kafka Topic同步到Paimon数据库。

复制代码
<FLINK_HOME>/bin/flink run \
    /path/to/paimon-flink-action-0.7.0-incubating.jar \
    kafka_sync_database \
    --warehouse hdfs:///path/to/warehouse \
    --database test_db \
    --kafka_conf properties.bootstrap.servers=127.0.0.1:9020 \
    --kafka_conf topic=order\;logistic_order\;user \
    --kafka_conf properties.group.id=123456 \
    --kafka_conf value.format=canal-json \
    --catalog_conf metastore=hive \
    --catalog_conf uri=thrift://hive-metastore:9083 \
    --table_conf bucket=4 \
    --table_conf changelog-producer=input \
    --table_conf sink.parallelism=4
相关推荐
大树887 小时前
金刚石散热越强,管路越先见顶
大数据·运维·服务器·人工智能·ai
大志哥1238 小时前
ES和Logstash日志链路系统上线后遭遇切片爆炸(解决)
大数据·elasticsearch
果丁智能9 小时前
物联网智能锁赋能集中式住宿:身份核验与远程权限管控的全链路技术实践
大数据·人工智能·物联网·智能家居
王小王-1239 小时前
基于 Hive 的网易云音乐数据分析及可视化系统
hive·hadoop·数据分析·音乐数据分析·网易云音乐分析·hive音乐分析·hadoop网易云
ApacheSeaTunnel9 小时前
实战演示 | 基于 Apache SeaTunnel 与 Apache DolphinScheduler 实现 MySQL 到 Doris 离线定时增量同步
大数据·mysql·开源·doris·数据集成·seatunnel·数据同步
weixin_3975740910 小时前
PDF复杂表格的1:1还原引擎:跨页表格自动拼接技术实战
大数据·人工智能·pdf
极光代码工作室10 小时前
基于数据仓库的电商数据分析平台
大数据·hadoop·python·spark·数据可视化
秋名山码民11 小时前
Graph RAG 深度解析:从向量检索到知识推理的技术演进
大数据·人工智能·rag
m0_3801671411 小时前
面向开发者的Top10加密货币数据API(2026年最新)
大数据·人工智能·区块链
yyxx41212311 小时前
上海企业如何选择专业的钉钉服务商
java·大数据·人工智能·钉钉