flink系列之:使用flink cdc3从mysql数据库同步数据到doris和starrocks

解压flink

bash 复制代码
tar -zxvf flink-1.19.1-bin-scala_2.12.tgz

修改flink配置文件config.yaml

bash 复制代码
taskmanager:
  bind-host: localhost
  host: localhost
  numberOfTaskSlots: 6
  memory:
    process:
      size: 1728m

parallelism:
  default: 1
rest:
  address: 10.66.77.104
  # network interface, such as 0.0.0.0.
  bind-address: 10.66.77.104
  # port: 8081
  # # Port range for the REST and web server to bind to.
  # bind-port: 8080-8090

设置flink 环境变零

bash 复制代码
cd /etc/profile.d
cat flink.sh 

#export HADOOP_CLASSPATH=`hadoop classpath`
FLINK_HOME=/data/src/flink/flink-1.19.1
PATH=$PATH:$FLINK_HOME/bin:$FLINK_HOME/sbin

export PATH
export FLINK_HOME

启动flink

bash 复制代码
./start-cluster.sh

查看jps

bash 复制代码
jps
760234 StandaloneSessionClusterEntrypoint
390132 Jps
760880 TaskManagerRunner

查看flink web ui,{ip}:{port}

bash 复制代码
tar -zxvf flink-cdc-3.3.0-bin.tar.gz

下载Pipeline Connectors Jars和Source Connector Jars到lib目录

bash 复制代码
/data/src/flink/flink-cdc-3.3.0/lib   ls
flink-cdc-dist-3.3.0.jar                              flink-cdc-pipeline-connector-maxcompute-3.3.0.jar  flink-sql-connector-tidb-cdc-3.3.0.jar
flink-cdc-pipeline-connector-doris-3.3.0.jar          flink-cdc-pipeline-connector-mysql-3.3.0.jar       mysql-connector-java-8.0.28.jar
flink-cdc-pipeline-connector-elasticsearch-3.3.0.jar  flink-cdc-pipeline-connector-paimon-3.3.0.jar
flink-cdc-pipeline-connector-kafka-3.3.0.jar          flink-cdc-pipeline-connector-starrocks-3.3.0.jar

https://mvnrepository.com/artifact/mysql/mysql-connector-java/8.0.28

bash 复制代码
wget https://repo1.maven.org/maven2/mysql/mysql-connector-java/8.0.28/mysql-connector-java-8.0.28.jar

四、flink设置checkpoint支持增量同步数据

  • execution.checkpointing.interval: 3000

参数说明

  • execution.checkpointing.interval: 这个参数用于指定 Flink 作业执行检查点的频率。检查点是 Flink 用于实现容错机制的一种机制,通过定期保存作业的状态,可以在发生故障时恢复到最近的一个检查点。
  • 3000: 这个值表示检查点的间隔时间,单位是毫秒(ms)。因此,3000 毫秒等于 3 秒。

五、mysql到doris和starrocks的yaml配置文件

放到任意目录下

mysql-to-doris.yaml

bash 复制代码
   source:
     type: mysql
     hostname: ip
     port: 3306
     username: *********
     password: ************
     tables: data_entry_test.debeziumOfflineClusterInfo,data_entry_test.debeziumRealtimeClusterInfo
     server-id: 5400-5404
     server-time-zone: Asia/Shanghai

   sink:
     type: doris
     fenodes: ip:8030
     username: ***********
     password: *************

   route:
     - source-table: data_entry_test.debeziumOfflineClusterInfo
       sink-table: optics.debeziumOfflineClusterInfo
     - source-table: data_entry_test.debeziumRealtimeClusterInfo
       sink-table: optics.debeziumRealtimeClusterInfo


   pipeline:
     name: Sync MySQL Database to Doris
     parallelism: 2

mysql-to-starrocks.yaml

bash 复制代码
################################################################################
# Description: Sync MySQL all tables to Doris
################################################################################
source:
 type: mysql
 hostname: ip
 port: 3306
 username: *********
 password: **********
 tables: data_entry_test.debeziumOfflineClusterInfo,data_entry_test.debeziumRealtimeClusterInfo
 server-id: 5400-5404
 server-time-zone: Asia/Shanghai

sink:
  type: starrocks
  name: StarRocks Sink
  jdbc-url: jdbc:mysql://ip:9030
  load-url: ip:8030
  username: ****************
  password: ****************
route:
  - source-table: data_entry_test.debeziumOfflineClusterInfo
    sink-table: dd_test_starrocks.debeziumOfflineClusterInfo
  - source-table: data_entry_test.debeziumRealtimeClusterInfo
    sink-table: dd_test_starrocks.debeziumRealtimeClusterInfo
pipeline:
   name: MySQL to StarRocks Pipeline
   parallelism: 6

启动flink

bash 复制代码
./start-cluster.sh

启动flink cdc

bash 复制代码
/data/src/flink/flink-cdc-3.3.0/bin/flink-cdc.sh
/data/src/flink/flink-cdc-3.3.0/conf/mysql-to-starrocks.yaml

flink web ui查看任务

bash 复制代码
2025-02-18 13:48:49,973 INFO  com.starrocks.connector.flink.catalog.StarRocksCatalog       [] - Success to create table dd_test_starrocks.dd_test_starrocks, sql: CREATE TABLE IF NOT EXISTS dd_test_starrocks.debeziumOfflineClusterInfo (
id VARCHAR(21) NOT NULL,
servername VARCHAR(6168) NOT NULL,
connectorname VARCHAR(6168) NOT NULL,
databasename VARCHAR(6168) NOT NULL,
url VARCHAR(6168) NOT NULL,
topicname VARCHAR(6168) NOT NULL,
clustername VARCHAR(6168) NOT NULL
) PRIMARY KEY (id)
DISTRIBUTED BY HASH (id);
bash 复制代码
2025-02-18 14:04:25,298 INFO  org.apache.flink.runtime.taskmanager.Task                    [] - Source: Flink CDC Event Source: mysql -> SchemaOperator -> PrePartition (1/2)#0 (2069f3b2a289abd02012736f795a34b7_cbc357ccb763df2852fee8c4fc7d55f2_0_0) switched from INITIALIZING to RUNNING.
2025-02-18 14:04:25,333 INFO  org.apache.flink.runtime.taskmanager.Task                    [] - Source: Flink CDC Event Source: mysql -> SchemaOperator -> PrePartition (2/2)#0 (2069f3b2a289abd02012736f795a34b7_cbc357ccb763df2852fee8c4fc7d55f2_1_0) switched from INITIALIZING to RUNNING.
bash 复制代码
2025-02-18 14:09:35,729 INFO  com.starrocks.data.load.stream.DefaultStreamLoader           [] - Stream load completed, label : flink-84c2fdac-3341-4b5b-8bf1-3946098c0a97, database : dd_test_starrocks, table : debeziumOfflineClusterInfo, body : {
    "Status": "OK",
    "Message": "",
    "Label": "flink-84c2fdac-3341-4b5b-8bf1-3946098c0a97",
    "TxnId": 108875857,
    "LoadBytes": 133959,
    "StreamLoadPlanTimeMs": 0,
    "ReceivedDataTimeMs": 0
}

八、查看mysql表和starrocks表

mysql表

sql 复制代码
-- data_entry_test.debeziumOfflineClusterInfo definition

CREATE TABLE `debeziumOfflineClusterInfo` (
  `id` bigint unsigned NOT NULL AUTO_INCREMENT COMMENT 'primary key',
  `servername` varchar(2056) NOT NULL COMMENT 'connector标识名',
  `connectorname` varchar(2056) NOT NULL COMMENT 'connector名称',
  `databasename` varchar(2056) NOT NULL COMMENT '数据库名',
  `url` varchar(2056) NOT NULL COMMENT '数据库名',
  `topicname` varchar(2056) NOT NULL COMMENT 'topic名称',
  `clustername` varchar(2056) NOT NULL COMMENT '集群名称',
  `database_server_id` varchar(256) NOT NULL COMMENT '集群名称',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=765 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

starrocks表

sql 复制代码
-- dd_test_starrocks.debeziumOfflineClusterInfo definition

CREATE TABLE `debeziumOfflineClusterInfo` (
  `id` varchar(21) NOT NULL COMMENT "",
  `servername` varchar(6168) NOT NULL COMMENT "",
  `connectorname` varchar(6168) NOT NULL COMMENT "",
  `databasename` varchar(6168) NOT NULL COMMENT "",
  `url` varchar(6168) NOT NULL COMMENT "",
  `topicname` varchar(6168) NOT NULL COMMENT "",
  `clustername` varchar(6168) NOT NULL COMMENT ""
) ENGINE=OLAP 
PRIMARY KEY(`id`)
DISTRIBUTED BY HASH(`id`)
PROPERTIES (
"replication_num" = "3",
"in_memory" = "false",
"storage_format" = "DEFAULT",
"enable_persistent_index" = "false",
"compression" = "LZ4"
);

如上所示,成功在starrocks表中创建了表,并完成了历史数据和增量数据的同步

细粒度变更策略控制:

  • 支持新增表、新增列、修改列名、修改列定义、删除列、删除表和清空表等操作

当上游数据库新增表时,CDC YAML 能够自动识别并同步这些表的数据,而无需重新配置作业。此功能分为两种情况:

  • 历史数据同步:通过开启 scan.newly-added-table.enabled 选项,并通过 savepoint 重启作业来读取新增表的历史数据。
  • 增量数据同步:只需开启 scan.binlog.newly-added-table.enabled 选项,自动同步新增表的增量数据。
相关推荐
镜舟科技1 小时前
告别 Hadoop,拥抱 StarRocks!政采云数据平台升级之路
大数据·starrocks·数据仓库·hadoop·存算分离
StarRocks_labs10 天前
欧洲数字化养殖平台 Herdwatch 借力 Iceberg + StarRocks 提升分析能力
数据库·starrocks·iceberg·湖仓一体架构·herdwatch
阿里云大数据AI技术15 天前
鹰角网络基于阿里云 EMR Serverless StarRocks 的实时分析工程实践
starrocks·clickhouse·阿里云·emr·实时分析
小Tomkk16 天前
StarRocks不能启动 ,StarRocksFe节点不能启动问题 处理
starrocks·log满了
jakeswang1 个月前
去哪儿StarRocks实践
starrocks·后端
鸿乃江边鸟1 个月前
Starrocks中的 Query Profile以及explain analyze及trace命令中的区别
大数据·starrocks·sql
鸿乃江边鸟1 个月前
Starrocks ShortCircuit短路径的调度
大数据·starrocks·sql
镜舟科技1 个月前
什么是列存储(Columnar Storage)?深度解析其原理与应用场景
starrocks·数据分析·列存储·行存储·mpp分布式架构
大数据狂人1 个月前
深入剖析 StarRocks 与 Hive 的区别、使用场景及协同方案实践
大数据·starrocks·hive·数仓