seatunnel-mysqlcdc同步clickhouse方案
本质原理是 封装了flinkcdc,建议有能力直接使用flinkcdc,seatunnel太不稳定太多坑了
首先拉下2.3.11 的seatunel, 记得带上相关jar包,这里需要检查一下, 因为seatunnel很不稳定,建议和我完全一致的版本号和相关生态的jar包
csharp
[root@test-app clickhouse]# ls apache-seatunnel-2.3.11/lib
seatunnel-hadoop3-3.1.4-uber.jar seatunnel-hadoop-aws.jar
[root@test-app clickhouse]# ls apache-seatunnel-2.3.11/plugins/
hive-jdbc-2.3.9-standalone.jar mssql-jdbc-10.2.1.jre8.jar mysql-connector-java-8.0.27.jar postgresql-42.3.1.jar README.md sqlite-jdbc-3.36.0.3.jar sqljdbc4-4.0.jar
[root@test-app clickhouse]# ls apache-seatunnel-2.3.11/connectors/
connector-cdc-mysql-2.3.11.jar connector-clickhouse-2.3.11.jar connector-console-2.3.11.jar connector-fake-2.3.11.jar plugin-mapping.properties seatunnel-transforms-v2-2.3.11.jar
[root@test-app clickhouse]#
然后编写shell脚本
初次启动的全量,运行脚本
arduino
./bin/seatunnel.sh --config ./config/mysqlcdctock.template -e local JvmOption="-Xms4G -Xmx4G -XX:+UseG1GC"
运行后,过一段时间可以在输出里面看到对应的jobid
markdown
2025-12-15 16:59:15,749 INFO [.s.e.s.c.CheckpointCoordinator] [seatunnel-coordinator-service-7] - wait checkpoint completed: 4767
2025-12-15 16:59:15,869 INFO [.s.e.s.c.CheckpointCoordinator] [seatunnel-coordinator-service-7] - pending checkpoint(4767/1@1051533496620679169) notify finished!
2025-12-15 16:59:15,869 INFO [.s.e.s.c.CheckpointCoordinator] [seatunnel-coordinator-service-7] - start notify checkpoint completed, job id: 1051533496620679169, pipeline id: 1, checkpoint id:4767
2025-12-15 16:59:44,886 INFO [o.a.s.e.s.CoordinatorService ] [pool-7-thread-1] - [localhost]:5802 [seatunnel-135960] [5.1]
***********************************************
CoordinatorService Thread Pool Status
***********************************************
activeCount : 2
corePoolSize : 10
maximumPoolSize : 2147483647
poolSize : 10
completedTaskCount : 28232
taskCount : 28234
***********************************************
2025-12-15 16:59:44,887 INFO [o.a.s.e.s.CoordinatorService ] [pool-7-thread-1] - [localhost]:5802 [seatunnel-135960] [5.1]
***********************************************
Job info detail
***********************************************
createdJobCount : 0
scheduledJobCount : 0
runningJobCount : 1
failingJobCount : 0
failedJobCount : 0
cancellingJobCount : 0
canceledJobCount : 0
finishedJobCount : 0
***********************************************
比如上述日志中的jobid日志为 1051533496620679169
如果任务出现中断,先从中断的地方进行恢复,使用下述的指令, 同时如果全量同步结束之后 最好使用kill -13 中止当前的同步进程,然后使用下面的命令重新启动, 因为添加了 restore 的配置, 可维护性也大大提高;注意 amzn_order.template 中有些参数也需要修改, 全量同步 和 增量同步 有一点差异, 这个是相关的配置文件,也是描述sink 、 source 数据流向的元数据文件;
bash
nohup /apache-seatunnel-2.3.11/bin/seatunnel.sh --restore 1051533496620679169 --config /apache-seatunnel-2.3.11/config/amzn_order.template -e local JvmOption="-Xms3G -Xmx3G -XX:+UseG1GC" > /apache-seatunnel-2.3.11/logs/amzn_order.log 2>&1 &
--restore 1051533496620679169
这个的原理是基于 checkpoint_snapshot 的机制来进行恢复的,
csharp
[root@test-app apache-seatunnel-2.3.11]# ls checkpoint_snapshot/
1051528334246150145 1051533496620679169
[root@test-app apache-seatunnel-2.3.11]#
[root@test-app apache-seatunnel-2.3.11]# ls checkpoint_snapshot/1051533496620679169
1765473815300-338-1-28.ser 1765474196823-602-1-31.ser 1765506409805-58-1-61.ser 1765789215750-20-1-4768.ser 1765789395905-906-1-4771.ser
1765473935993-536-1-29.ser 1765474329165-267-1-32.ser 1765506523907-569-1-62.ser 1765789275751-76-1-4769.ser
1765474077615-86-1-30.ser 1765506261314-645-1-60.ser 1765506685978-839-1-63.ser 1765789335750-750-1-4770.ser
[root@test-app apache-seatunnel-2.3.11]#
其中的 *.ser 文件 是配置相关的信息,其中包含了 binlog偏移量(如果已经开始增量同步了)、当前分块的数量(全量同步时中断了)等相关元数据信息, 用于数据恢复
其中 mysqlcdctock.template 可以参考如下
ini
env {
job.name = "任务名称"
execution.parallelism = 10
job.mode = "STREAMING"
checkpoint.interval = 60000
checkpoint.timeout = 1800000
close.timeout = 120000
}
source {
MySQL-CDC {
#初始化要加上这些并行参数, 一定要加,别搞错了,一定要加,就是最开始增量同步的时候; 不然同步的速度慢死了
#但是后面增量binlog监听的时候一定要注释掉这里的并发参数,seatunnel这个插件有bug目前社区都还没解决,建议用单线程的reader这样能规避社区的问题
#parallelism = 6
#incremental.parallelism = 4
#snapshot.parallelism = 4
base-url = "jdbc:mysql://脱敏.rds.aliyuncs.com:3306/阿里云"
username = "seatunnel"
password = "脱敏"
table-names = ["库名.表名"] # 这里支持多个表的配置 ["库名.表名1", "库名.表名2" ]
snapshot.split.size = 50000
snapshot.fetch.size = 10000
schema-changes.enabled = false
chunk-key-column="id"
startup.mode="initial"
startup_mode="initial"
}
}
sink {
Clickhouse {
host = "脱敏ip:脱敏端口"
database = "脱敏"
username = "admin"
password = "脱敏"
table = "库名.表名"
primary_key = "id"
support_upsert = true
connect_timeout = 60000
socket_timeout = 600000
bulk_size = 50000
flush_interval = 2000
data_save_mode = "APPEND_DATA"
schema_save_mode = "CREATE_SCHEMA_WHEN_NOT_EXIST"
clickhouse.config = {
socket_timeout = "600000"
connect_timeout = "10000"
keep_alive_timeout = "600000"
}
}
}
如果是多表同步, 一定要参考我下面的,一定要一个都不变化,否则容易出现大问题,注意这里的并发线程最好要多一点特别是表特别多的情况,注意版本一定要使用seatunnel 2.3.11 否则有可能出问题,因为seatunnel很混乱每个版本实现都有点差异
ini
env {
job.name = "任务名"
execution.parallelism = 24
job.mode = "STREAMING"
checkpoint.interval = 60000
checkpoint.timeout = 1800000
close.timeout = 120000
}
source {
MySQL-CDC {
parallelism = 12
incremental.parallelism = 4
snapshot.parallelism = 4
base-url = "jdbc:mysql://脱敏.rds.aliyuncs.com:3306/脱敏库名"
username = "脱敏"
password = "脱敏脱敏"
table-names = ["库名.表名1","库名.表名2","库名.表名3","库名.表名4"]
snapshot.split.size = 50000
snapshot.fetch.size = 10000
schema-changes.enabled = false
chunk-key-column="id"
startup_mode="initial"
}
}
sink {
Clickhouse {
host = "172.21.18.12:6568"
database = "数据库名"
username = "用户名脱敏"
password = "密码脱敏脱敏"
table = "${table_name}" # 注意这里是clickhouse插件封装的,这里的变量名要和我这个保持一模一样的,千万别改,这里就能拿到批量的表名
primary_key = "id"
support_upsert = true
connect_timeout = 60000
socket_timeout = 600000
bulk_size = 50000
flush_interval = 2000
data_save_mode = "APPEND_DATA"
schema_save_mode = "CREATE_SCHEMA_WHEN_NOT_EXIST"
clickhouse.config = {
socket_timeout = "600000"
connect_timeout = "10000"
keep_alive_timeout = "600000"
}
}
}
随之,需要提前建立clickhouse的表(从mysqlcdc同步进入clickhouse),这里同步注意字段类型,最好clickhouse严格按照我下面的风格进行建表对于字段的处理:
MySQL → ClickHouse 映射(默认可空 Nullable,主键/控制字段除外)
bigint → Int64(如 id)
bigint unsigned/版本号 → UInt64(如 version)
int → Int32;tinyint(1) 布尔 → Bool
varchar/text → String(
decimal(p,s) → Decimal(p,s)(保持一致)
datetime → DateTime64(3)(统一毫秒精度)
date → Date 或 Date32(ClickHouse 新版本推荐 Date32)
JSON 类:落地 String(有解析需求再用函数处理)
事实字段、外来标识符:NULL 允许(CK 用 Nullable)。
主键、版本号、软删标记:NOT NULL。
布尔字段:MySQL 用 tinyint(1)(0/1),CK 用 Bool
mysql模板
sql
CREATE TABLE `{db}`.`{table}` (
`id` BIGINT NOT NULL COMMENT '主键',
-- 业务字段...
`create_time` DATETIME DEFAULT NULL COMMENT '创建时间',
`update_time` DATETIME DEFAULT NULL COMMENT '更新时间',
`is_delete` BIGINT NOT NULL DEFAULT 0 COMMENT '是否删除:0-否;>0=删除批次/时间',
`version` BIGINT UNSIGNED NOT NULL DEFAULT 0 COMMENT '版本号(单调递增)',
PRIMARY KEY (`id`) USING BTREE,
UNIQUE KEY `uni_{biz}` (...),
KEY `idx_xxx` (...),
KEY `idx_yyy` (...)
) ENGINE=InnoDB
DEFAULT CHARSET = utf8mb4
COLLATE = utf8mb4_0900_as_cs;
clickhouse模板
less
CREATE TABLE {db}.{table} (
`id` Int64,
-- 业务字段:默认 Nullable,与 MySQL 同名同义
`create_time` Nullable(DateTime64(3)),
`update_time` Nullable(DateTime64(3)),
`is_delete` Int64,
`version` UInt64
)
ENGINE = ReplacingMergeTree(version)
PRIMARY KEY id
ORDER BY id
SETTINGS index_granularity = 8192;
这里需要注意, 使用的本地部署,如果是想要处理集群部署有偿联系
.preview-wrapper pre::before { position: absolute; top: 0; right: 0; color: #ccc; text-align: center; font-size: 0.8em; padding: 5px 10px 0; line-height: 15px; height: 15px; font-weight: 600; } .hljs.code__pre > .mac-sign { display: flex; } .code__pre { padding: 0 !important; } .hljs.code__pre code { display: -webkit-box; padding: 0.5em 1em 1em; overflow-x: auto; text-indent: 0; } h2 strong { color: inherit !important; }
本文使用 文章同步助手 同步