在大数据处理的浩瀚宇宙中,数据集成堪称连接各个星系的引力纽带,其重要性不言而喻。而 SeaTunnel,作为这一领域的璀璨新星,正凭借其卓越特性闪耀登场。它是一个极为易用且具备超高性能的分布式数据集成平台,肩负着实时海量数据同步的重任,每日稳定高效地穿梭于数百亿数据之间,已然成为近百家企业生产线上的得力助手。
一、SeaTunnel:数据集成的璀璨明珠
直击数据集成痛点
- 数据源多样的迷宫:常用数据源宛如繁星,多达数百种,且版本各异,相互之间存在兼容性的暗礁。随着科技浪潮的推进,新数据源如雨后春笋般不断涌现,寻觅一款能全面、迅速支持这些数据源的工具,犹如大海捞针。
- 同步场景复杂的棋局:数据同步的战场需要应对离线全量同步、离线增量同步、CDC、实时同步、全库同步等多种复杂局势,每一种场景都有其独特的战术要求。
- 资源需求高的重担:现有的数据集成工具在面对海量小表的实时同步时,往往如同贪婪的巨兽,吞噬大量计算资源或 JDBC 连接资源,给企业带来沉重的成本负担。
- 缺乏质量和监控的黑洞:数据集成过程中,数据丢失或重复的幽灵时常作祟,且同步过程缺乏有效的监控手段,难以直观洞察任务中数据的真实状况。
- 技术栈复杂的荆棘丛:企业技术组件的多样性,使得用户需要针对不同组件开发各自的同步程序,如同在荆棘丛中艰难前行。
- 管理和维护困难的高山:由于底层技术组件(Flink/Spark)的差异,离线同步和实时同步通常需要分开开发与管理,犹如攀登陡峭的高山,增加了运维的难度。
SeaTunnel的闪耀特性
- 丰富且可扩展的 Connector 生态:SeaTunnel 精心打造了不依赖特定执行引擎的 Connector API。基于此 API 开发的 Connector(Source、Transform、Sink),如同拥有魔法翅膀,能够在众多不同引擎上翱翔,如目前支持的 SeaTunnel 引擎(Zeta)、Flink、Spark 等。
- 插件式设计的便捷舞台:插件式设计为用户提供了便捷的创作舞台,可轻松开发自己的 Connector,并将其无缝集成到 SeaTunnel 项目中。目前,SeaTunnel 支持的连接器已超 100 个,且数量仍在持续激增。
- 批流集成的和谐乐章:基于 SeaTunnel Connector API 开发的 Connector,完美兼容离线同步、实时同步、全量同步、增量同步等多种场景,如同奏响一曲和谐的乐章,大大降低了管理数据集成任务的难度。
- 分布式快照算法的数据一致性保障:支持分布式快照算法,如同为数据一致性上了一把坚固的锁,确保数据在流转过程中的准确与完整。
- 多引擎支持的灵活选择:SeaTunnel 默认使用 SeaTunnel 引擎(Zeta)进行数据同步,但同时也贴心支持使用 Flink 或 Spark 作为 Connector 的执行引擎,以适配企业现有的技术组件,并且对 Spark 和 Flink 的多个版本都提供良好的兼容性。
- JDBC 复用与数据库日志多表解析的智慧方案:支持多表或全库同步,巧妙解决了过度 JDBC 连接的难题;支持多表或全库日志读取解析,有效避免了 CDC 多表同步场景下日志重复读取解析的困境。
- 高吞吐量、低延迟的速度传奇:支持并行读写,具备稳定可靠、高吞吐量、低延迟的数据同步能力,如同高速列车在数据轨道上飞驰。
- 完善的实时监控的千里眼:支持数据同步过程中每一步的详细监控信息,为用户提供了一双 "千里眼",能够轻松了解同步任务读写的数据数量、数据大小、QPS 等关键信息。 尽管 SeaTunnel 功能强大,但在使用 Docker 部署时,仍可能遭遇诸多棘手问题。接下来,让我们一同深入探讨这些问题及其解决方案。
二、Docker 部署 SeaTunnel 的官方方式
官方提供了三种部署方式,分别是 Locally、Docker 部署和 K8S 部署。本文将聚焦于 Docker 部署方式,通过官方提供的 docker - compose 来部署 SeaTunnel,官方示例如下:
yaml
version: '3'
services:
master:
image: apache/seatunnel
container_name: seatunnel_master
environment:
- ST_DOCKER_MEMBER_LIST=172.16.0.2,172.16.0.3,172.16.0.4
entrypoint: >
/bin/sh -c "
/opt/seatunnel/bin/seatunnel-cluster.sh -r master
"
ports:
- "5801:5801"
networks:
seatunnel_network:
ipv4_address: 172.16.0.2
worker1:
image: apache/seatunnel
container_name: seatunnel_worker_1
environment:
- ST_DOCKER_MEMBER_LIST=172.16.0.2,172.16.0.3,172.16.0.4
entrypoint: >
/bin/sh -c "
/opt/seatunnel/bin/seatunnel-cluster.sh -r worker
"
depends_on:
- master
networks:
seatunnel_network:
ipv4_address: 172.16.0.3
worker2:
image: apache/seatunnel
container_name: seatunnel_worker_2
environment:
- ST_DOCKER_MEMBER_LIST=172.16.0.2,172.16.0.3,172.16.0.4
entrypoint: >
/bin/sh -c "
/opt/seatunnel/bin/seatunnel-cluster.sh -r worker
"
depends_on:
- master
networks:
seatunnel_network:
ipv4_address: 172.16.0.4
networks:
seatunnel_network:
driver: bridge
ipam:
config:
- subnet: 172.16.0.0/24
三、Docker 部署 Seatunnel 的常见 "坑" 及解决方案
坑一:镜像下载的拦路虎
问题描述:当尝试下载 apache/seatunnel 镜像时,默认的完整路径 docker.io/apache/seatunnel 在国内无法访问,导致镜像下载失败,部署进程被迫中断。
解决方案 : 1、临时方案 - 快捷绕道:将镜像名称临时修改为 docker.1ms.run/apache/seatunnel,即可快速解决燃眉之急,继续推进部署工作。
2、永久方案 - 彻底疏通:
a、修改 /etc/docker/daemon.json,设置 registry mirror:
shell
sudo vim /etc/docker/daemon.json
{
"registry-mirrors": [
"https://docker.1ms.run",
"https://docker.xuanyuan.me"
]
}
b、重启docker:
shell
systemctl daemon-reload
systemctl restart docker
注: 更多可用镜像源,可以查看这篇博文xuanyuan.me/blog/archiv...
坑二:日志文件混乱的迷局
问题描述:SeaTunnel 默认采用配置混合日志文件的方式,所有作业日志一股脑地输出到 SeaTunnel Engine 系统日志文件中,这使得日志查找与分析变得极为困难,如同在杂乱无章的仓库中寻找特定物品。
解决方案: 通过更新 log4j2.properties 文件中的配置,为每个作业生成单独的日志文件,让日志管理变得井然有序。只需将配置修改为 rootLogger.appenderRef.file.ref = routingAppender ,此后,每个作业便会拥有自己独立的日志文件,如 job - xxx1.log、job - xxx2.log、job - xxx3.log 等。为使配置生效,需将更新后的 log4j2.properties 文件挂载到容器中。以下是更新后的 docker - compose 配置示例:
yaml
version: '3'
services:
master:
image: docker.1ms.run/apache/seatunnel
container_name: seatunnel_master
environment:
- ST_DOCKER_MEMBER_LIST=172.16.0.2,172.16.0.3,172.16.0.4
volumes:
# 挂载日志配置文件
- ./config/log4j2.properties:/opt/seatunnel/config/log4j2.properties
entrypoint: >
/bin/sh -c "
/opt/seatunnel/bin/seatunnel-cluster.sh -r master
"
ports:
- "5801:5801"
networks:
seatunnel_network:
ipv4_address: 172.16.0.2
worker1:
image: apache/seatunnel
container_name: seatunnel_worker_1
environment:
- ST_DOCKER_MEMBER_LIST=172.16.0.2,172.16.0.3,172.16.0.4
entrypoint: >
/bin/sh -c "
/opt/seatunnel/bin/seatunnel-cluster.sh -r worker
"
depends_on:
- master
networks:
seatunnel_network:
ipv4_address: 172.16.0.3
worker2:
image: docker.1ms.run/apache/seatunnel
container_name: seatunnel_worker_2
environment:
- ST_DOCKER_MEMBER_LIST=172.16.0.2,172.16.0.3,172.16.0.4
volumes:
# 挂载日志配置文件
- ./config/log4j2.properties:/opt/seatunnel/config/log4j2.properties
entrypoint: >
/bin/sh -c "
/opt/seatunnel/bin/seatunnel-cluster.sh -r worker
"
depends_on:
- master
networks:
seatunnel_network:
ipv4_address: 172.16.0.4
networks:
seatunnel_network:
driver: bridge
ipam:
config:
- subnet: 172.16.0.0/24
坑三:RESTful API V2 访问的障碍
问题描述 :在部署完成后,尝试访问 RESTful API V2 却发现无法连接,无法通过 API 对 SeaTunnel 进行便捷管理与操作。 解决方案:确保在两个关键环节进行正确配置。首先,在 seatunnel.yaml 文件中,开启相关配置:
yaml
seatunnel:
engine:
http:
enable-http: true
port: 8080
enable-dynamic-port: true
port-range: 100
其次,在 docker - compose 文件中,将 http 端口暴露出来。以下是完整的 docker - compose 配置示例:
yaml
version: '3'
services:
master:
image: docker.1ms.run/apache/seatunnel
container_name: seatunnel_master
environment:
- ST_DOCKER_MEMBER_LIST=172.16.0.2,172.16.0.3,172.16.0.4
volumes:
# 挂载日志配置文件
- ./config/log4j2.properties:/opt/seatunnel/config/log4j2.properties
- ./config/seatunnel.yaml:/opt/seatunnel/config/seatunnel.yaml
entrypoint: >
/bin/sh -c "
/opt/seatunnel/bin/seatunnel-cluster.sh -r master
"
ports:
- "5801:5801"
- "8080:8080"
networks:
seatunnel_network:
ipv4_address: 172.16.0.2
worker1:
image: apache/seatunnel
container_name: seatunnel_worker_1
environment:
- ST_DOCKER_MEMBER_LIST=172.16.0.2,172.16.0.3,172.16.0.4
entrypoint: >
/bin/sh -c "
/opt/seatunnel/bin/seatunnel-cluster.sh -r worker
"
depends_on:
- master
networks:
seatunnel_network:
ipv4_address: 172.16.0.3
worker2:
image: docker.1ms.run/apache/seatunnel
container_name: seatunnel_worker_2
environment:
- ST_DOCKER_MEMBER_LIST=172.16.0.2,172.16.0.3,172.16.0.4
volumes:
# 挂载日志配置文件
- ./config/log4j2.properties:/opt/seatunnel/config/log4j2.properties
- ./config/seatunnel.yaml:/opt/seatunnel/config/seatunnel.yaml
entrypoint: >
/bin/sh -c "
/opt/seatunnel/bin/seatunnel-cluster.sh -r worker
"
depends_on:
- master
networks:
seatunnel_network:
ipv4_address: 172.16.0.4
networks:
seatunnel_network:
driver: bridge
ipam:
config:
- subnet: 172.16.0.0/24
坑四:监控指标失效的谜团
问题描述 :配置了监控相关设置后,却发现监控指标并未生效,无法获取数据同步过程中的关键监控信息,对任务运行状态的掌控犹如盲人摸象。 解决方案:仔细检查 seatunnel.yaml 文件中监控的相关设置,确保如下配置正确无误:
yaml
seatunnel:
engine:
telemetry:
metric:
enabled: true
经此设置,监控指标便可正常生效,为您实时反馈数据同步的运行状况。
坑五:控制台日志时间错误的困惑
问题描述 :查看控制台日志时,发现日志时间与实际时间不符,这为排查问题和分析任务执行顺序带来极大困扰。 解决方案:在 docker - compose 配置中设置正确的时区。以下是添加时区配置后的 docker - compose 示例:
yaml
version: '3'
services:
master:
image: docker.1ms.run/apache/seatunnel
container_name: seatunnel_master
environment:
- ST_DOCKER_MEMBER_LIST=172.16.0.2,172.16.0.3,172.16.0.4
- TZ=Asia/Shanghai
volumes:
# 挂载日志配置文件
- ./config/log4j2.properties:/opt/seatunnel/config/log4j2.properties
- ./config/seatunnel.yaml:/opt/seatunnel/config/seatunnel.yaml
entrypoint: >
/bin/sh -c "
/opt/seatunnel/bin/seatunnel-cluster.sh -r master
"
ports:
- "5801:5801"
- "8080:8080"
networks:
seatunnel_network:
ipv4_address: 172.16.0.2
worker1:
image: apache/seatunnel
container_name: seatunnel_worker_1
environment:
- ST_DOCKER_MEMBER_LIST=172.16.0.2,172.16.0.3,172.16.0.4
entrypoint: >
/bin/sh -c "
/opt/seatunnel/bin/seatunnel-cluster.sh -r worker
"
depends_on:
- master
networks:
seatunnel_network:
ipv4_address: 172.16.0.3
worker2:
image: docker.1ms.run/apache/seatunnel
container_name: seatunnel_worker_2
environment:
- ST_DOCKER_MEMBER_LIST=172.16.0.2,172.16.0.3,172.16.0.4
- TZ=Asia/Shanghai
volumes:
# 挂载日志配置文件
- ./config/log4j2.properties:/opt/seatunnel/config/log4j2.properties
- ./config/seatunnel.yaml:/opt/seatunnel/config/seatunnel.yaml
entrypoint: >
/bin/sh -c "
/opt/seatunnel/bin/seatunnel-cluster.sh -r worker
"
depends_on:
- master
networks:
seatunnel_network:
ipv4_address: 172.16.0.4
networks:
seatunnel_network:
driver: bridge
ipam:
config:
- subnet: 172.16.0.0/24
通过设置时区为 Asia/Shanghai,控制台日志时间将恢复正常,为您提供准确的时间参考。
坑六:容器重启后元数据丢失的困境
问题描述 :当容器重启后,元数据如集群的状态数据(作业运行状态、资源状态)、每个任务及其 task 的状态全部丢失,这对于需要持续稳定运行的生产环境来说,无疑是一场灾难。 解决方案:默认 SeaTunnel Engine 将数据存储在 Imap 中,因此需要对 IMap 进行持久化处理。由于官方推荐采用分离模式集群模式部署,在此模式下,只有 Master 节点存储 Imap 数据,Worker 节点不存储。所以,我们只需修改 hazelcast - master.yaml 文件。本文以 minio 作为存储 Imap 数据的对象存储,在 hazelcast - master.yaml 文件中新增如下内容:
yaml
map:
engine*:
map-store:
enabled: true
initial-mode: EAGER
factory-class-name: org.apache.seatunnel.engine.server.persistence.FileMapStoreFactory
properties:
type: hdfs
namespace: /seatunnel/imap
clusterName: seatunnel-cluster
storage.type: s3
s3.bucket: s3a://seatunnel-dev
fs.s3a.access.key: etoDbE8uGdpg3ED8
fs.s3a.secret.key: 6hkb90nPCaMrBcbhN1v5iC0QI0MeXDOk
fs.s3a.endpoint: http://10.1.4.155:9000
fs.s3a.aws.credentials.provider: org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider
同时,将 hazelcast - master 文件挂载到容器中。以下是更新后的 docker - compose 配置示例:
yaml
version: '3'
services:
master:
image: docker.1ms.run/apache/seatunnel
container_name: seatunnel_master
environment:
- ST_DOCKER_MEMBER_LIST=172.16.0.2,172.16.0.3,172.16.0.4
- TZ=Asia/Shanghai
volumes:
# 挂载日志配置文件
- ./config/log4j2.properties:/opt/seatunnel/config/log4j2.properties
- ./config/seatunnel.yaml:/opt/seatunnel/config/seatunnel.yaml
# 配置元数据持久化(存储每个任务及其task的状态,以便在任务所在节点宕机后,可以在其他节点上获取到任务之前的状态信息,从而恢复任务实现任务的容错):https://seatunnel.apache.org/zh-CN/docs/2.3.9/seatunnel-engine/separated-cluster-deployment
- ./config/hazelcast-master.yaml:/opt/seatunnel/config/hazelcast-master.yaml
entrypoint: >
/bin/sh -c "
/opt/seatunnel/bin/seatunnel-cluster.sh -r master
"
ports:
- "5801:5801"
- "8080:8080"
networks:
seatunnel_network:
ipv4_address: 172.16.0.2
worker1:
image: apache/seatunnel
container_name: seatunnel_worker_1
environment:
- ST_DOCKER_MEMBER_LIST=172.16.0.2,172.16.0.3,172.16.0.4
entrypoint: >
/bin/sh -c "
/opt/seatunnel/bin/seatunnel-cluster.sh -r worker
"
depends_on:
- master
networks:
seatunnel_network:
ipv4_address: 172.16.0.3
worker2:
image: docker.1ms.run/apache/seatunnel
container_name: seatunnel_worker_2
environment:
- ST_DOCKER_MEMBER_LIST=172.16.0.2,172.16.0.3,172.16.0.4
- TZ=Asia/Shanghai
volumes:
# 挂载日志配置文件
- ./config/log4j2.properties:/opt/seatunnel/config/log4j2.properties
- ./config/seatunnel.yaml:/opt/seatunnel/config/seatunnel.yaml
entrypoint: >
/bin/sh -c "
/opt/seatunnel/bin/seatunnel-cluster.sh -r worker
"
depends_on:
- master
networks:
seatunnel_network:
ipv4_address: 172.16.0.4
networks:
seatunnel_network:
driver: bridge
ipam:
config:
- subnet: 172.16.0.0/24
坑七:容器重启后检查点丢失的难题
问题描述 :与元数据丢失类似,容器重启后检查点也随之丢失,这严重影响了数据同步任务的连续性与可靠性,可能导致数据不一致等问题。 解决方案:将检查点存储到对象存储中,以 minio 为例,在 seatunnel.yaml 中进行如下配置:
yaml
checkpoint:
interval: 10000
timeout: 60000
storage:
type: hdfs
max-retained: 3
plugin-config:
storage.type: s3
s3.bucket: s3a://seatunnel-dev
fs.s3a.access.key: ST4HTeGdARHk7Drf
fs.s3a.secret.key: zyiJYIpYy0ewiozse6kSLIQG62vO9IUh
fs.s3a.endpoint: http://10.1.4.155:9000
fs.s3a.aws.credentials.provider: org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider
四、总结
SeaTunnel 作为国人主导的 Apache 开源项目,其文档和代码相对易于理解。然而,在实际部署过程中,确实会遇到各种复杂问题。上述提及的诸多坑点,其实在官方文档中均能找到解决思路,只是目前官方文档的组织可能稍显繁杂,需要读者仔细研读、深度挖掘。 为方便大家参考,这里附上完整的 docker-compose 配置,希望能助力各位在 SeaTunnel 的部署征程中一帆风顺。
yaml
version: '3'
services:
master:
image: docker.1ms.run/apache/seatunnel
container_name: seatunnel_master
environment:
- ST_DOCKER_MEMBER_LIST=172.16.0.2,172.16.0.3,172.16.0.4
- TZ=Asia/Shanghai
volumes:
# 挂载日志配置文件
- ./config/log4j2.properties:/opt/seatunnel/config/log4j2.properties
- ./config/seatunnel.yaml:/opt/seatunnel/config/seatunnel.yaml
# 配置元数据持久化(存储每个任务及其task的状态,以便在任务所在节点宕机后,可以在其他节点上获取到任务之前的状态信息,从而恢复任务实现任务的容错):https://seatunnel.apache.org/zh-CN/docs/2.3.9/seatunnel-engine/separated-cluster-deployment
- ./config/hazelcast-master.yaml:/opt/seatunnel/config/hazelcast-master.yaml
entrypoint: >
/bin/sh -c "
/opt/seatunnel/bin/seatunnel-cluster.sh -r master
"
ports:
- "5801:5801"
- "8080:8080"
networks:
seatunnel_network:
ipv4_address: 172.16.0.2
worker1:
image: apache/seatunnel
container_name: seatunnel_worker_1
environment:
- ST_DOCKER_MEMBER_LIST=172.16.0.2,172.16.0.3,172.16.0.4
entrypoint: >
/bin/sh -c "
/opt/seatunnel/bin/seatunnel-cluster.sh -r worker
"
depends_on:
- master
networks:
seatunnel_network:
ipv4_address: 172.16.0.3
worker2:
image: docker.1ms.run/apache/seatunnel
container_name: seatunnel_worker_2
environment:
- ST_DOCKER_MEMBER_LIST=172.16.0.2,172.16.0.3,172.16.0.4
- TZ=Asia/Shanghai
volumes:
# 挂载日志配置文件
- ./config/log4j2.properties:/opt/seatunnel/config/log4j2.properties
- ./config/seatunnel.yaml:/opt/seatunnel/config/seatunnel.yaml
entrypoint: >
/bin/sh -c "
/opt/seatunnel/bin/seatunnel-cluster.sh -r worker
"
depends_on:
- master
networks:
seatunnel_network:
ipv4_address: 172.16.0.4
networks:
seatunnel_network:
driver: bridge
ipam:
config:
- subnet: 172.16.0.0/24
希望这篇文章能成为您在 Docker 部署 SeaTunnel 过程中的得力指南,帮助您顺利跨越重重障碍,充分发挥 SeaTunnel 强大的数据集成能力。如果您在阅读过程中有任何疑问,或者发现新的问题,欢迎在评论区留言分享。同时,如果您觉得本文对您有所帮助,别忘了点赞、转发,让更多的人受益于这份实战经验总结。