flink系列之:使用flink cdc3从mysql数据库同步数据到doris和starrocks

解压flink

bash 复制代码
tar -zxvf flink-1.19.1-bin-scala_2.12.tgz

修改flink配置文件config.yaml

bash 复制代码
taskmanager:
  bind-host: localhost
  host: localhost
  numberOfTaskSlots: 6
  memory:
    process:
      size: 1728m

parallelism:
  default: 1
rest:
  address: 10.66.77.104
  # network interface, such as 0.0.0.0.
  bind-address: 10.66.77.104
  # port: 8081
  # # Port range for the REST and web server to bind to.
  # bind-port: 8080-8090

设置flink 环境变零

bash 复制代码
cd /etc/profile.d
cat flink.sh 

#export HADOOP_CLASSPATH=`hadoop classpath`
FLINK_HOME=/data/src/flink/flink-1.19.1
PATH=$PATH:$FLINK_HOME/bin:$FLINK_HOME/sbin

export PATH
export FLINK_HOME

启动flink

bash 复制代码
./start-cluster.sh

查看jps

bash 复制代码
jps
760234 StandaloneSessionClusterEntrypoint
390132 Jps
760880 TaskManagerRunner

查看flink web ui,{ip}:{port}

bash 复制代码
tar -zxvf flink-cdc-3.3.0-bin.tar.gz

下载Pipeline Connectors Jars和Source Connector Jars到lib目录

bash 复制代码
/data/src/flink/flink-cdc-3.3.0/lib   ls
flink-cdc-dist-3.3.0.jar                              flink-cdc-pipeline-connector-maxcompute-3.3.0.jar  flink-sql-connector-tidb-cdc-3.3.0.jar
flink-cdc-pipeline-connector-doris-3.3.0.jar          flink-cdc-pipeline-connector-mysql-3.3.0.jar       mysql-connector-java-8.0.28.jar
flink-cdc-pipeline-connector-elasticsearch-3.3.0.jar  flink-cdc-pipeline-connector-paimon-3.3.0.jar
flink-cdc-pipeline-connector-kafka-3.3.0.jar          flink-cdc-pipeline-connector-starrocks-3.3.0.jar

https://mvnrepository.com/artifact/mysql/mysql-connector-java/8.0.28

bash 复制代码
wget https://repo1.maven.org/maven2/mysql/mysql-connector-java/8.0.28/mysql-connector-java-8.0.28.jar

四、flink设置checkpoint支持增量同步数据

  • execution.checkpointing.interval: 3000

参数说明

  • execution.checkpointing.interval: 这个参数用于指定 Flink 作业执行检查点的频率。检查点是 Flink 用于实现容错机制的一种机制,通过定期保存作业的状态,可以在发生故障时恢复到最近的一个检查点。
  • 3000: 这个值表示检查点的间隔时间,单位是毫秒(ms)。因此,3000 毫秒等于 3 秒。

五、mysql到doris和starrocks的yaml配置文件

放到任意目录下

mysql-to-doris.yaml

bash 复制代码
   source:
     type: mysql
     hostname: ip
     port: 3306
     username: *********
     password: ************
     tables: data_entry_test.debeziumOfflineClusterInfo,data_entry_test.debeziumRealtimeClusterInfo
     server-id: 5400-5404
     server-time-zone: Asia/Shanghai

   sink:
     type: doris
     fenodes: ip:8030
     username: ***********
     password: *************

   route:
     - source-table: data_entry_test.debeziumOfflineClusterInfo
       sink-table: optics.debeziumOfflineClusterInfo
     - source-table: data_entry_test.debeziumRealtimeClusterInfo
       sink-table: optics.debeziumRealtimeClusterInfo


   pipeline:
     name: Sync MySQL Database to Doris
     parallelism: 2

mysql-to-starrocks.yaml

bash 复制代码
################################################################################
# Description: Sync MySQL all tables to Doris
################################################################################
source:
 type: mysql
 hostname: ip
 port: 3306
 username: *********
 password: **********
 tables: data_entry_test.debeziumOfflineClusterInfo,data_entry_test.debeziumRealtimeClusterInfo
 server-id: 5400-5404
 server-time-zone: Asia/Shanghai

sink:
  type: starrocks
  name: StarRocks Sink
  jdbc-url: jdbc:mysql://ip:9030
  load-url: ip:8030
  username: ****************
  password: ****************
route:
  - source-table: data_entry_test.debeziumOfflineClusterInfo
    sink-table: dd_test_starrocks.debeziumOfflineClusterInfo
  - source-table: data_entry_test.debeziumRealtimeClusterInfo
    sink-table: dd_test_starrocks.debeziumRealtimeClusterInfo
pipeline:
   name: MySQL to StarRocks Pipeline
   parallelism: 6

启动flink

bash 复制代码
./start-cluster.sh

启动flink cdc

bash 复制代码
/data/src/flink/flink-cdc-3.3.0/bin/flink-cdc.sh
/data/src/flink/flink-cdc-3.3.0/conf/mysql-to-starrocks.yaml

flink web ui查看任务

bash 复制代码
2025-02-18 13:48:49,973 INFO  com.starrocks.connector.flink.catalog.StarRocksCatalog       [] - Success to create table dd_test_starrocks.dd_test_starrocks, sql: CREATE TABLE IF NOT EXISTS dd_test_starrocks.debeziumOfflineClusterInfo (
id VARCHAR(21) NOT NULL,
servername VARCHAR(6168) NOT NULL,
connectorname VARCHAR(6168) NOT NULL,
databasename VARCHAR(6168) NOT NULL,
url VARCHAR(6168) NOT NULL,
topicname VARCHAR(6168) NOT NULL,
clustername VARCHAR(6168) NOT NULL
) PRIMARY KEY (id)
DISTRIBUTED BY HASH (id);
bash 复制代码
2025-02-18 14:04:25,298 INFO  org.apache.flink.runtime.taskmanager.Task                    [] - Source: Flink CDC Event Source: mysql -> SchemaOperator -> PrePartition (1/2)#0 (2069f3b2a289abd02012736f795a34b7_cbc357ccb763df2852fee8c4fc7d55f2_0_0) switched from INITIALIZING to RUNNING.
2025-02-18 14:04:25,333 INFO  org.apache.flink.runtime.taskmanager.Task                    [] - Source: Flink CDC Event Source: mysql -> SchemaOperator -> PrePartition (2/2)#0 (2069f3b2a289abd02012736f795a34b7_cbc357ccb763df2852fee8c4fc7d55f2_1_0) switched from INITIALIZING to RUNNING.
bash 复制代码
2025-02-18 14:09:35,729 INFO  com.starrocks.data.load.stream.DefaultStreamLoader           [] - Stream load completed, label : flink-84c2fdac-3341-4b5b-8bf1-3946098c0a97, database : dd_test_starrocks, table : debeziumOfflineClusterInfo, body : {
    "Status": "OK",
    "Message": "",
    "Label": "flink-84c2fdac-3341-4b5b-8bf1-3946098c0a97",
    "TxnId": 108875857,
    "LoadBytes": 133959,
    "StreamLoadPlanTimeMs": 0,
    "ReceivedDataTimeMs": 0
}

八、查看mysql表和starrocks表

mysql表

sql 复制代码
-- data_entry_test.debeziumOfflineClusterInfo definition

CREATE TABLE `debeziumOfflineClusterInfo` (
  `id` bigint unsigned NOT NULL AUTO_INCREMENT COMMENT 'primary key',
  `servername` varchar(2056) NOT NULL COMMENT 'connector标识名',
  `connectorname` varchar(2056) NOT NULL COMMENT 'connector名称',
  `databasename` varchar(2056) NOT NULL COMMENT '数据库名',
  `url` varchar(2056) NOT NULL COMMENT '数据库名',
  `topicname` varchar(2056) NOT NULL COMMENT 'topic名称',
  `clustername` varchar(2056) NOT NULL COMMENT '集群名称',
  `database_server_id` varchar(256) NOT NULL COMMENT '集群名称',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=765 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

starrocks表

sql 复制代码
-- dd_test_starrocks.debeziumOfflineClusterInfo definition

CREATE TABLE `debeziumOfflineClusterInfo` (
  `id` varchar(21) NOT NULL COMMENT "",
  `servername` varchar(6168) NOT NULL COMMENT "",
  `connectorname` varchar(6168) NOT NULL COMMENT "",
  `databasename` varchar(6168) NOT NULL COMMENT "",
  `url` varchar(6168) NOT NULL COMMENT "",
  `topicname` varchar(6168) NOT NULL COMMENT "",
  `clustername` varchar(6168) NOT NULL COMMENT ""
) ENGINE=OLAP 
PRIMARY KEY(`id`)
DISTRIBUTED BY HASH(`id`)
PROPERTIES (
"replication_num" = "3",
"in_memory" = "false",
"storage_format" = "DEFAULT",
"enable_persistent_index" = "false",
"compression" = "LZ4"
);

如上所示,成功在starrocks表中创建了表,并完成了历史数据和增量数据的同步

细粒度变更策略控制:

  • 支持新增表、新增列、修改列名、修改列定义、删除列、删除表和清空表等操作

当上游数据库新增表时,CDC YAML 能够自动识别并同步这些表的数据,而无需重新配置作业。此功能分为两种情况:

  • 历史数据同步:通过开启 scan.newly-added-table.enabled 选项,并通过 savepoint 重启作业来读取新增表的历史数据。
  • 增量数据同步:只需开启 scan.binlog.newly-added-table.enabled 选项,自动同步新增表的增量数据。
相关推荐
ZNineSun1 天前
新一代MPP数据库:StarRocks
starrocks·olap·数据湖·mpp·oltp
StarRocks_labs9 天前
腾讯大数据基于 StarRocks 的向量检索探索
starrocks·人工智能·搜索引擎·开源
漫步者TZ18 天前
Starrocks 对比 Clickhouse
数据库·starrocks·clickhouse
小Tomkk19 天前
Docker 部署 Starrocks 教程
运维·starrocks·docker·容器
书忆江南1 个月前
StarRocks BE源码编译、CLion高亮跳转方法
starrocks·源码·编译·be
鸿乃江边鸟1 个月前
StarRocks 怎么让特定的SQL路由到FE master节点的
大数据·starrocks·sql
大鳥1 个月前
深入了解 StarRocks 表类型:解锁高效数据分析的密码
数据库·starrocks·sql
京河小蚁2 个月前
StarRocks 生产部署一套集群,存储空间如何规划?
starrocks