Apache Paimon 的 CDC Ingestion 概述

CDC Ingestion

1）概述

Paimon支持schema evolution将数据插入到Paimon表中，添加的列将实时同步到Paimon表，并且无需重启同步作业。

目前支持的同步方式如下：

MySQL Synchronizing Table: 将MySQL中的一个或多个表同步到一个Paimon表中。
MySQL Synchronizing Database: 将整个MySQL数据库同步到一个Paimon数据库中。
Program API Sync: 自定义DataStream输入同步到一个Paimon表中。
Kafka Synchronizing Table: 将一个Kafka的Topic的表同步到一个Paimon表中。
Kafka Synchronizing Database: 将一个包含多个表的Kafka的Topic或包含一个表的多个Topic同步到一个Paimon数据库中。
MongoDB Synchronizing Collection: 将MongoDB的一个集合同步到一个Paimon表中。
MongoDB Synchronizing Database: 将整个MongoDB数据库同步到一个Paimon数据库中。
Pulsar Synchronizing Table: 同步一个Pulsar的Topic的表到一个Paimon表中。
Pulsar Synchronizing Database: 将一个包含多个表的Pulsar的Topic或包含一个表的多个Topic同步到一个Paimon数据库中。

什么是 Schema Evolution （模式演变）

假设有一个名为tableA的MySQL表，它有三个字段：field_1、field_2、field_3，想将此MySQL表加载到Paimon时，可以在Flink SQL中执行如下操作，或使用MySqlSyncTableAction。

Flink SQL：

在Flink SQL中，如果在插入后更改MySQL表的表模式（表结构），表模式更改将不会同步到Paimon。

MySqlSyncTableAction：

在MySqlSyncTableAction中，如果在摄取后更改MySQL表的表模式，表模式更改将同步到Paimon，新添加的field_4的数据也将同步到Paimon。

Schema Change Evolution（模式变化进化）

cdc Ingestion支持的模式更改行为有限，该框架无法重命名表、删除列，因此RENAME TABLE和DROP COLUMN的行为将被忽略，RENAME COLUMN将添加新列。目前支持的模式更改包括：

添加列。
更改列类型。
- 从字符串类型（char、varchar、text）更改为另一个长度较长的字符串类型，
- 从二进制类型（二进制、二进制、blob）更改为另一种长度较长的二进制类型，
- 从整数类型（tinyint、smallint、int、bigint）更改为另一个范围更广的整数类型，
- 从浮点类型（浮动，双）更改为另一个范围更广的浮点类型。

Computed Functions（计算函数）

Function	Description
year(date-column)	Extract year from a DATE, DATETIME or TIMESTAMP (or its corresponding string format). Output is an INT value represent the year.
month(date-column)	Extract month of year from a DATE, DATETIME or TIMESTAMP (or its corresponding string format). Output is an INT value represent the month of year.
day(date-column)	Extract day of month from a DATE, DATETIME or TIMESTAMP (or its corresponding string format). Output is an INT value represent the day of month.
hour(date-column)	Extract hour from a DATE, DATETIME or TIMESTAMP (or its corresponding string format). Output is an INT value represent the hour.
minute(date-column)	Extract minute from a DATE, DATETIME or TIMESTAMP (or its corresponding string format). Output is an INT value represent the minute.
second(date-column)	Extract second from a DATE, DATETIME or TIMESTAMP (or its corresponding string format). Output is an INT value represent the second.
date_format(date-column,format)	Convert date format from a DATE, DATETIME or TIMESTAMP (or its corresponding string format). 'format' is compatible with Java's DateTimeFormatter String (for example, 'yyyy-MM-dd'). Output is a string value in converted date format.
substring(column,beginInclusive)	Get column.substring(beginInclusive). Output is a STRING.
substring(column,beginInclusive,endExclusive)	Get column.substring(beginInclusive,endExclusive). Output is a STRING.
truncate(column,width)	truncate column by width. Output type is same with column.If the column is a STRING, truncate(column,width) will truncate the string to width characters, namely `value.substring(0, width)`. If the column is an INT or LONG, truncate(column,width) will truncate the number with the algorithm `v - (((v % W) + W) % W)`. The `redundant` compute part is to keep the result always positive. If the column is a DECIMAL, truncate(column,width) will truncate the decimal with the algorithm: let `scaled_W = decimal(W, scale(v))`, then return `v - (v % scaled_W)`.

Special Data Type Mapping（数据类型映射）

默认情况下，MySQL TINYINT（1）类型将映射到Boolean。如果想像MySQL一样在其中存储数字（-128~127），可以指定类型映射选项tinyint1-not-bool（使用--type_mapping），那么该列将映射到Paimon表中的TINYINT。
可以使用类型映射选项to-nullable（使用--type_mapping）来忽略所有NOT NULL约束（主键除外）。
可以使用类型映射选项to-string（使用--type_mapping）将所有MySQL数据类型映射到字符串。
可以使用类型映射选项char-to-string（使用--type_mapping）将MySQL CHAR（长度）/VARCHAR（长度）类型映射到STRING。
可以使用类型映射选项longtext-to-bytes（使用--type_mapping）将MySQL LONGTEXT类型映射到BYTES。
MySQL BIGINT UNSIGNED，BIGINT UNSIGNED ZEROFILL，SERIAL将默认映射到DECIMAL(20, 0)可以使用类型映射选项bigint-unsigned-to-bigint（使用--type_mapping）将这些类型映射到Paimon BIGINT，但存在潜在的数据溢出，因为BIGINT UNSIGNED可以存储多达20位的整数值，而Paimon BIGINT只能存储多达19位的整数值。因此，应确保使用此选项时不会发生溢出。
MySQL BIT（1）类型将映射到Boolean。
使用Hive目录时，MySQL TIME类型将映射到STRING。
MySQL BINARY将被映射到Paimon VARBINARY。因为二进制值在binlog中作为字节传递，因此它应该映射到字节类型（BYTES或VARBINARY）选择VARBINARY，因为它可以保留长度信息。

Custom Job Settings（自定义作业设置）

Checkpointing（检查点）

使用-Dexecution.checkpointing.interval=启用检查点并设置时间间隔，对于0.7及更高版本，如果尚未启用检查点，Paimon将默认启用检查点，并将检查点间隔设置为180秒。

Job Name

使用-Dpipeline.name=设置自定义同步作业的名称。