如何在 Amazon EMR 中运行 Flink CDC Pipeline Connector

如何在Amazon EMR 中运行 Flink CDC Pipeline Connector

由于 Amazon EMR 最新的 Flink 版本中没有原生支持 Flink CDC，因此这里介绍一种通过 FlinkCDC Pipeline Connector 同步数据的例子(MySQL->Kafka)

环境准备

启动一个 Amazon EMR 集群

启动成功之后，可以登录到 EMR Master 节点中，执行以下命令，将 mysql/kafka connector pipeline 的包上传到 master 节点的 /usr/lib/flink/lib/ 目录下

bash 复制代码

sudo wget https://repo1.maven.org/maven2/org/apache/flink/flink-cdc-pipeline-connector-mysql/3.2.0/flink-cdc-pipeline-connector-mysql-3.2.0.jar -P /usr/lib/flink/lib/

sudo wget https://repo1.maven.org/maven2/org/apache/flink/flink-cdc-pipeline-connector-kafka/3.2.0/flink-cdc-pipeline-connector-kafka-3.2.0.jar -P /usr/lib/flink/lib/

下载 flink cdc

以下命令在 EMR Master 节点上执行，可以选择就在 /home/hadoop 目录

bash 复制代码

wget https://dlcdn.apache.org/flink/flink-cdc-3.2.0/flink-cdc-3.2.0-bin.tar.gz
tar -xvf flink-cdc-3.2.0-bin.tar.gz

MySQL 同步到 Kafka

启动 Yarn Session

虽然这里指定了 checkpoint 地址，和 execution.checkpointing.interval ，但是不起作用，因此需要参考下一步骤，配置在 flink-conf.yaml 目录下。

bash 复制代码

# set flink home
export FLINK_HOME=/usr/lib/flink

#指定checkpoint地址
checkpoints=s3://<s3bucket>/flink/checkpoints/

sudo flink-yarn-session -jm 2048 -tm 4096 -s 2 \
-D state.checkpoint-storage=filesystem \
-D state.checkpoints.dir=${checkpoints} \
-D execution.checkpointing.interval=10s \
-D state.checkpoints.num-retained=5 \
-D execution.checkpointing.mode=EXACTLY_ONCE \
-D execution.checkpointing.externalized-checkpoint-retention=RETAIN_ON_CANCELLATION \
-D execution.checkpointing.max-concurrent-checkpoints=2 \
-D execution.checkpointing.checkpoints-after-tasks-finish.enabled=true \
-D rest.flamegraph.enabled=true \
-d

启动之后，从返回的结果中，获取到 Application ID 用于后续的步骤

修改 Flink conf 文件

在 flink-conf.yaml 文件中修改和添加以下内容

bash 复制代码

rest.bind-port: {{REST_PORT}}
rest.address: {{NODE_IP}}
execution.target: yarn-session
yarn.application.id: {{YARN_APPLICATION_ID}}

另外，也需要在 flink-conf.yaml 文件中指定 checkpoint，在 yarn-session 指定的无效。

bash 复制代码

execution.checkpointing.interval: 10000
state.checkpoint-storage: filesystem
state.checkpoints.dir: s3://<s3bucket>/flink/checkpoints/

配置数据同步 yaml 文件

如下例子，将多个表，写到一个 kafka topic 中。

不过需要注意的是，对于没有主键的表，需要在配置中指定一个字短作为分布键，设置 scan.incremental.snapshot.chunk.key-column参数，多表字段使用逗号分隔。

Example 1

yaml 复制代码

cat > cdc_demo_example_1.yaml <<EOF
source:
   type: mysql
   name: MySQL Source
   hostname: <mysql-host/ip>
   port: 3306
   username: <mysql-username>
   password: <mysql-password>
   tables: <database_name>.<table-prefix>\.*,<tablename>
   server-id: 5401-5404
   scan.incremental.snapshot.chunk.key-column: <databasename>.<tablename>:<key-cloumn>

sink:
  type: kafka
  name: Kafka Sink
  properties.bootstrap.servers: <kafka-boostrap-server>
  topic: <topic-name>

pipeline:
  name: MySQL to Kafka Pipeline example 1
  parallelism: 2
EOF

主要参数配置说明：

参数	值
topic	如果需要将多个表同时写到一个 topic中，需要设置此参数，如果不设置，将默认按照每张表一个 topic（`<database>.<tablename>`）
scan.incremental.snapshot.chunk.key-column	对于没有主键的表，需要在 source 指定该参数，格式`<database>.<tablename>:<column>`，多个表字段使用逗号(,)分隔。

Example 2

yaml 复制代码

cat > cdc_demo_example_2.yaml <<EOF
source:
   type: mysql
   name: MySQL Source
   hostname: <mysql-host/ip>
   port: 3306
   username: <mysql-username>
   password: <mysql-password>
   tables: <database_name>.<table-prefix>\.*,<tablename>
   server-id: 5401-5404

sink:
  type: kafka
  name: Kafka Sink
  properties.bootstrap.servers: <kafka-boostrap-server>
  
pipeline:
  name: MySQL to Kafka Pipeline example 2
  parallelism: 2
EOF

运行 Flink CDC Job

bash 复制代码

./flink-cdc-3.2.0/bin/flink-cdc.sh -cm claim cdc_demo_example_2.yaml

从 Offset 恢复

在 Amazon EMR 中启动 Flink 作业需要手动维护 savepoint，因此需要在停止时保存 savepoint ，启动从 savepoint 恢复。这样可以在 flink cdc job 恢复的时候从上一次同步的 offset 开始同步数据。

Example：

bash 复制代码

flink cancel -s  s3://<s3bucket>/flink/20240923/01 9b46c81d4f018eafdf6c2cc44b2e356a -yid application_1727084151749_0006

./flink-cdc-3.2.0/bin/flink-cdc.sh -cm claim -s s3://<s3bucket>/flink/20240923/01/savepoint-9b46c8-1633c5a00095 \
	cdc_demo_example_1.yaml