一,Flink 和Flink CDC
1, Flink
Apache Flink是一个框架和分布式处理引擎,用于对无界和有界数据流进行有状态计算。
中文文档 Apache Flink Documentation | Apache Flink
官方文档 :https://flink.apache.org
Flink 中文社区视频课程:https://github.com/flink-china/flink-training-course
Flink 中文社区 :https://www.slidestalk.com/FlinkChina
ververica 教程 :https://training.ververica.com/
ververica 教程中文文档:https://ci.apache.org/projects/flink/flink-docs-master/zh/
源码:https://github.com/apache/flink
Flink 知识图谱 https://developer.aliyun.com/article/744741
2,Flink CDC
CDC 的全称是 Change Data Capture ,在广义的概念上,只要是能捕获数据变更的技术,我们都可以称之为 CDC 。
目前通常描述的 CDC 技术主要面向数据库的变更,是一种用于捕获数据库中数据变更的技术。CDC 技术的应用场景非常广泛:
- 数据迁移:常用于数据库备份、容灾等;
- 数据分发:将一个数据源分发给多个下游,常用于业务解耦、微服务;
- 数据采集:将分散异构的数据源集成到数据仓库中,消除数据孤岛,便于后续的分析。
目前业界主流的 CDC 实现机制可以分为两种:
2.1 基于查询的 CDC
- 离线调度查询作业,批处理。依赖表中的更新时间字段,每次执行查询去获取表中最新的数据;
- 无法捕获删除事件,从而无法保证数据一致性;
- 无法保障实时性,基于离线调度存在天然的延迟。
2.2 基于日志的 CDC
- 实时消费日志,流处理。例如 MySQL 的 binlog 日志完整记录了数据库中的变更,可以把binlog 文件当作流的数据源;
- 保障数据一致性,因为 binlog 文件包含了所有历史变更明细;
- 保障实时性,因为类似 binlog 的日志文件是可以流式消费的,提供的是实时数据。
2.3 主流CDC技术对比
-
DataX 不支持增量同步,Canal 不支持全量同步。虽然两者都是非常流行的数据同步工具,但在场景支持上仍不完善。
-
在全量+增量一体化同步方面,只有 Flink CDC、Debezium、Oracle Goldengate 支持较好。
-
在架构方面,Apache Flink 是一个非常优秀的分布式流处理框架,因此 Flink CDC 作为Apache Flink 的一个组件具有非常灵活的水平扩展能力。而 DataX 和 Canal 是个单机架构,在大数据场景下容易面临性能瓶颈的问题。
-
在数据加工的能力上,CDC 工具是否能够方便地对数据做一些清洗、过滤、聚合,甚至关联打宽。Flink CDC 依托强大的 Flink SQL 流式计算能力,可以非常方便地对数据进行加工。而Debezium 等则需要通过复杂的 Java 代码才能完成,使用门槛比较高。
-
另外,在生态方面,这里指的是上下游存储的支持。Flink CDC 上下游非常丰富,支持对接MySQL、PostgreSQL 等数据源,还支持写入到 TiDB、HBase、Kafka、Hudi 等各种存储系统中,也支持灵活的自定义 connector。
二,安装Flink
1,为了运行Flink,只需提前安装好 Java 11或以上版本
java -version
2,下载Flink
下载地址:Index of /dist/flinkhttps://archive.apache.org/dist/flink
可以找到你要安装的版本
wget https://archive.apache.org/dist/flink/flink-1.19.1/flink-1.19.1-bin-scala_2.12.tgz
3,解压
tar -xzf flink-1.19.1-bin-scala_2.12.tgz
4,进入到flink-1.19.1目录,启动集群
./bin/start-cluster.sh
5,如果想访问WebUI,可以看下面
【问题解决】Flink在linux上运行成功但是无法访问webUI界面_flink web ui的ip访问-CSDN博客
6,停止集群
./bin/stop-cluster.sh
三,运行Flink示例
在examples目录下有很多示例,可以试着运行
1,运行单词统计示例
./bin/flink run examples/streaming/WordCount.jar
进入到log目录,查看日志
tail -f flink-root-taskexecutor-0-rocky8-template.out
四,运行Flink CDC示例
1,通过编写sql脚本来实现同步数据
sql
SET execution.checkpointing.interval = 5s;
drop table if exists order_01;
CREATE TABLE order_01 (
id INT NOT NULL,
`order_id` VARCHAR,
amount VARCHAR,
remark VARCHAR,
PRIMARY KEY (`id`) NOT ENFORCED
) WITH (
'connector' = 'mysql-cdc',
'hostname' = 'x.x.x.x',
'port' = '3306',
'username' = 'root',
'password' = 'p4ssword',
'database-name' = 'account',
'server-time-zone' = 'UTC',
'table-name' = 'order_01'
);
drop table if exists order_02;
CREATE TABLE order_02 (
id INT NOT NULL,
`order_id` VARCHAR,
amount VARCHAR,
remark VARCHAR,
PRIMARY KEY (`id`) NOT ENFORCED
) WITH (
'connector' = 'jdbc',
'url' = 'jdbc:mysql://x.x.x.x:3306/account?useSSL=false&allowPublicKeyRetrieval=true&serverTimezone=UTC',
'username' = 'root',
'password' = 'p4ssword',
'table-name' = 'order_02',
'driver' = 'com.mysql.cj.jdbc.Driver',
'scan.fetch-size' = '200'
);
INSERT INTO order_02
SELECT id,order_id,amount,remark
FROM order_01;
编写以上脚本,命名为flinkCdc2Mysql.sql,上传到flink的sql目录下,这里的sql是我新建的,你可以自己指定。
需要下载flink-connecter和flink-sql-connector包
下载地址:Central Repository: com/ververica
通过以下命令执行
./bin/sql-client.sh -f ./sql/flinkCdc2Mysql.sql
就可以完成数据从order_01表同步到order_02表。
2, 编写Java类,编译成jar,执行jar来实现数据同步
AccountVoucherSumaryDWSSQL类
java
package com.xxx.demo
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.api.bridge.java.StreamTableEnvironment;
/**
* @date 2024/10/31 下午3:57
*/
public class AccountVoucherSumaryDWSSQL {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(3);
env.enableCheckpointing(5000);
final StreamTableEnvironment tEnv = StreamTableEnvironment.create(env);
tEnv.executeSql("CREATE DATABASE IF NOT EXISTS account");
// 动态表,此为source表
tEnv.executeSql("CREATE TABLE account.order_01 (\n" +
" id INT,\n" +
" order_id VARCHAR,\n" +
" amount VARCHAR,\n" +
" remark VARCHAR,\n" +
" PRIMARY KEY (id) NOT ENFORCED\n" +
") WITH (\n" +
" 'connector' = 'mysql-cdc',\n" +
" 'hostname' = 'x.x.x.x',\n" +
" 'port' = '3306',\n" +
" 'username' = 'root',\n" +
" 'password' = 'p4ssword',\n" +
" 'database-name' = 'account',\n" +
" 'table-name' = 'order_01',\n" +
" 'server-time-zone' = 'UTC',\n" +
" 'scan.incremental.snapshot.enabled' = 'false'\n" +
")");
// 动态表,此为sink表。sink表和source表的connector不一样
tEnv.executeSql("CREATE TABLE account.order_02 (\n" +
" id INT,\n" +
" order_id VARCHAR,\n" +
" amount VARCHAR,\n" +
" remark VARCHAR,\n" +
" PRIMARY KEY (id) NOT ENFORCED\n" +
") WITH (\n" +
" 'connector' = 'jdbc',\n" +
" 'url' = 'jdbc:mysql://x.x.x.x:3306/account?useSSL=false&allowPublicKeyRetrieval=true&connectionTimeZone=UTC',\n" +
" 'username' = 'root',\n" +
" 'password' = 'p4ssword',\n" +
" 'table-name' = 'order_02',\n" +
" 'sink.buffer-flush.max-rows' = '1',\n" +
" 'sink.buffer-flush.interval' = '1s',\n" +
" 'sink.max-retries' = '3' \n"+
")");
tEnv.executeSql("INSERT INTO account.order_02 (id, order_id, amount, remark)\n" +
" select t1.id,\n" +
" t1.order_id,\n" +
" t1.amount,\n" +
" t1.remark\n" +
" from account.order_01 t1 \n");
env.execute();
}
}
pom文件
XML
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.xxx.demo</groupId>
<artifactId>FlinkDemo</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<maven.compiler.source>8</maven.compiler.source>
<maven.compiler.target>8</maven.compiler.target>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<scala.version>2.12</scala.version>
<java.version>8</java.version>
<flink.version>1.19.1</flink.version>
<fastjson.version>1.2.62</fastjson.version>
<hadoop.version>2.8.3</hadoop.version>
<scope.mode>compile</scope.mode>
<slf4j.version>1.7.30</slf4j.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-table-api-scala-bridge_${scala.version}</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-table-api-java-bridge</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-java</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-table-api-java</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-table-planner_${scala.version}</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-scala_${scala.version}</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-clients</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-files</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-csv</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-jdbc</artifactId>
<version>3.2.0-1.19</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka</artifactId>
<version>3.2.0-1.19</version>
</dependency>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>${fastjson.version}</version>
</dependency>
<!-- Add log dependencies when debugging locally -->
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>${slf4j.version}</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>${slf4j.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-runtime-web</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>com.google.code.gson</groupId>
<artifactId>gson</artifactId>
<version>2.8.7</version>
</dependency>
<dependency>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
<version>1.18.26</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-table-runtime</artifactId>
<version>${flink.version}</version>
</dependency>
<!-- Test dependencies -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-test-utils</artifactId>
<version>${flink.version}</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>8.0.26</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-statebackend-rocksdb</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.doris</groupId>
<artifactId>flink-doris-connector-1.15</artifactId>
<version>1.2.1</version>
</dependency>
<dependency>
<groupId>com.ververica</groupId>
<artifactId>flink-connector-mysql-cdc</artifactId>
<version>3.0.1</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.8.1</version>
<configuration>
<source>8</source>
<target>8</target>
</configuration>
</plugin>
<!-- 打fatjar配置 -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>3.3.0</version>
<configuration>
<archive>
<manifest>
<!--这里指定要运行的main类-->
<mainClass>com.xxx.demo.AccountVoucherSumaryDWSSQL</mainClass>
</manifest>
</archive>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id> <!-- 此处指定继承合并 -->
<phase>package</phase> <!-- 绑定到打包阶段 -->
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
完成之后,通过maven来打包。
上传包到flink下examples目录下,这里也新建一个目录CDC
通过以下命令来运行
./bin/flink run examples/cdc/FlinkDemo-1.0-SNAPSHOT-jar-with-dependencies.jar
上面两种运行方式运行之后,都可以去Flink的Web页面查看到运行的任务。
五,小结
上面只是搭建了Flink,并运行了示例项目,初步体验了一下Flink的功能,还完全到不了实际项目中运用的程度。后续会一步步去探索Flink在项目中应用的技巧。