MySQL -＞ Canal -＞ Kafka-＞ ES 完整数据同步流程详解

本文详细介绍了从MySQL到Elasticsearch的完整数据同步流程，通过Canal组件实现MySQL数据库变更的实时捕获和同步。文档涵盖了系统架构设计、各个组件功能说明、Zookeeper集群部署、Canal Admin和Deployer集群配置、Canal Adapter部署以及数据同步的全过程，提供了完整的配置示例和操作步骤。

一、整体流程

1.架构图

2.各个组件概述

2.1 MySQL

作用：

存储业务数据的源数据库
通过binlog（二进制日志）记录所有数据变更（INSERT/UPDATE/DELETE/DDL）

关键特性：

binlog_format必须为ROW模式（才能记录行级变更）
需要开启binlog（log_bin=ON）
需要创建专门的Canal用户（REPLICATION权限）

sql 复制代码

-- 查看binlog配置
SHOW VARIABLES LIKE 'log_bin';
SHOW VARIABLES LIKE 'binlog_format';
-- 必须显示：binlog_format = ROW

-- 检查Canal用户权限
SHOW GRANTS FOR 'canal'@'%';
-- 必须包含：REPLICATION SLAVE, REPLICATION CLIENT, SELECT

-- 查看当前binlog位置
SHOW MASTER STATUS;
-- 记录 File 和 Position

SHOW slave hosts;

2.2 Canal Admin + Zookeeper

Canal Admin：Canal集群管理平台（Web界面）

统一管理多个Canal实例
监控运行状态、配置管理、故障恢复
可视化界面操作

Zookeeper：分布式协调服务

存储Canal集群元数据、配置信息
实现Canal实例的高可用和故障转移
记录binlog消费位置（position）

2.3 Canal Deployer（Canal服务端，数据捕获与转换）

作用：

伪装成MySQL从库，向主库请求binlog
解析binlog二进制流为结构化消息
过滤数据（白名单/黑名单）
将数据变更消息发送到消息队列（Kafka）

核心组件：

EventParser：解析binlog事件
CanalServer：管理多个Canal实例
CanalMQProducer：消息队列生产者

2.4 Kafka（消息队列/缓冲区）

作用：

解耦生产者和消费者
缓冲数据，应对流量高峰
保证消息顺序（分区级别）
提供数据持久化和消息回溯

核心概念：

Topic：逻辑消息分类（如task_es_index）
Partition：Topic的分区，保证顺序性
Producer：Canal Deployer作为生产者
Consumer：Canal Adapter作为消费者

2.5 Canal Adapter（数据转换与写入）

作用：

消费Kafka中的消息
将Canal格式消息转换为目标存储格式（ES文档）
批量写入Elasticsearch
处理数据转换映射关系

支持的目标存储：

Elasticsearch
MySQL
HBase
REST API

2.6 Elasticsearch（数据存储与搜索）

作用：

存储最终的业务数据
提供全文搜索和聚合分析
构建实时数据索引
提供REST API接口

二、安装部署zookeeper集群

本章节详细介绍Zookeeper集群的部署过程。Zookeeper作为分布式协调服务，为Canal集群提供元数据存储、配置管理和故障恢复支持，是构建高可用Canal集群的基础组件。内容包括软件下载、环境配置、集群搭建和管理脚本编写等。

1.下载安装

sh 复制代码

(1)下载zookeeper
[worker@canal-12 ~]$ cd /data/software/
[worker@canal-12 software]$   wget https://dlcdn.apache.org/zookeeper/zookeeper-3.8.5/apache-zookeeper-3.8.5-bin.tar.gz

(2)解压软件包
[worker@canal-12 software]$  tar xf apache-zookeeper-3.8.5-bin.tar.gz

2.修改环境变量

sh 复制代码

#配置环境变量
#如果没有root权限可以配置到当前用户下 ~/.bash_profile，root权限可以配置到/etc/profile.d/zk.sh
[worker@canal-12 software]$ cat ~/.bash_profile
# Source /root/.bashrc if user has one
[ -f ~/.bashrc ] && . ~/.bashrc
                                                                                       export ZK_HOME=/data/software/zookeeper-3.8.5
export PATH=$PATH:$ZK_HOME/bin
#使环境变量生效
[worker@canal-12 software]$  source  ~/.bash_profile

3.修改配置文件

sh 复制代码

#创建配置文件
[worker@canal-12 conf]$ mkdir /data/software/zookeeper-3.8.5/{data,logs}
[worker@canal-12 conf]$  cat > /data/software/zookeeper-3.8.5/conf/zoo.cfg <<EOF
# 数据目录
dataDir=/data/software/zookeeper-3.8.5/data
# 日志目录
dataLogDir=/data/software/zookeeper-3.8.5/logs
# 客户端连接端口
clientPort=2181
# 心跳间隔（ms）
tickTime=2000
# 初始化同步超时时间（10*tickTime）
initLimit=10
# 同步通信超时时间（5*tickTime）
syncLimit=5
# 集群节点配置（server.${myid}=IP:通信端口:选举端口）
# server.ID=A:B:C[:D]
# ID:
#    zk的唯一编号。
# A:
#    zk的主机地址。
# B:
#    leader的选举端口，是谁leader角色，就会监听该端口。
# C: 
#    数据通信端口。
# D:
#    可选配置，指定角色。
server.1=172.16.130.12:2888:3888
server.2=172.16.130.130:2888:3888
server.3=172.16.130.131:2888:3888
EOF

4.同步配置文件

sh 复制代码

#1.同步zookeeper软件包到其他服务器 130/131
[worker@canal-12 software]$ scp -P55622 -r zookeeper-3.8.5 172.16.130.131:/data/software/
[worker@canal-12 software]$  scp -P55622 -r ~/.bash_profile 172.16.130.131:~/.bash_profile 

#2.每个zookeeper节点创建唯一标识ID
#其他节点分别修改/data/software/zookeeper-3.8.5/data/myid中的myid为 server.id
[root@elk101 ~]# for ((host_id=101;host_id<=103;host_id++)) do ssh elk${host_id} "echo ${host_id} > /liux/data/zk/myid";done
[worker@canal-12 ~]$ echo '1' > /data/software/zookeeper-3.8.5/data/myid
[worker@canal-12 ~]$ echo '1' > /data/software/zookeeper-3.8.5/data/myid
[worker@canal-131 ~]$ echo '3' > /data/software/zookeeper-3.8.5/data/myid

5.编写zookeeper集群管理脚本

sh 复制代码

#编写zookeeper集群管理脚本
[root@elk101 ~]# cat >/data/software/zookeeper-3.8.5/bin/manager_zk.sh<<EOF
#!/bin/bash
                
#判断用户是否传参
if [ $# -ne 1 ];then
    echo "无效参数，用法为: $0  {start|stop|restart|status}"
    exit
fi
#获取用户输入的命令
cmd=$1
#定义函数功能
function zookeeperManger(){
    case $cmd  in
    start)
        echo "启动服务"        
        remoteExecution start
        ;;
    stop)
        echo "停止服务"
        remoteExecution stop
        ;;
    restart)
        echo "重启服务"
        remoteExecution restart
        ;;
    status)
        echo "查看状态"
        remoteExecution status
        ;;
    *)
        echo "无效参数，用法为: $0  {start|stop|restart|status}"
        ;;
    esac
}                                                                                                      
#定义执行的命令
host_list=(172.16.130.12 172.16.130.130 172.16.130.131)
function remoteExecution(){
  for i in ${host_list[@]} ;do
       tput setaf 2
       echo ========== $i   zkServer.sh $1  ================
       tput setaf 9
       ssh $i   "source  ~/.bash_profile ; zkServer.sh $1  2>/dev/null"
    done
}                                                                                                            
#调用函数
zookeeperManger
EOF
[worker@canal-12 ~]$ chmod +x /data/software/zookeeper-3.8.5/bin/manager_zk.sh

6.配置服务器免密码登录

sh 复制代码

#设置免密码登录	
[worker@canal-12 ~]$ ssh-keygen -t rsa -P "" -f /worker/.ssh/id_rsa -q
#发送到每台服务器
[worker@canal-12 ~]$ cat ~/.ssh/id_rsa.pub | ssh worker@172.16.130.12 " cat >> ~/.ssh/authorized_keys"

7.启动与验证

sh 复制代码

# 启动zookeeper集群
[worker@canal-12 ~]$ manager_zk.sh start
 #查看集群状态
[root@elk101 ~]# manager_zk.sh status

#连接zookeeper
# 连接到ZooKeeper查看
zkCli.sh -server 172.16.130.12:2181

8.zookeeper简单使用命令

sh 复制代码

# 连接到ZooKeeper查看
zkCli.sh -server 172.16.130.12:2181
[zk: 172.16.130.12:2181(CONNECTED) 0] ls /
[canal-adapter, otter, zookeeper]
[zk: 172.16.130.12:2181(CONNECTED) 1]

三、部署canal

本章节详细讲解Canal集群的部署，包括Canal Admin管理平台和Canal Deployer服务端的配置。Canal负责实时捕获MySQL的binlog变更，并将数据发送到Kafka消息队列。内容涵盖管理平台搭建、服务端配置、实例管理以及数据同步配置等。

1.部署canal admin

1.1 概述

Canal Admin 是 Canal Deployer 集群的集中管理平台，相当于 Canal 的"管控中心"或"指挥台"。

1.2 架构图

1.3 部署canal admin

01 下载安装

sh 复制代码

# 1.下载
https://github.com/alibaba/canal/releases/download/canal-1.1.8/canal.admin-1.1.8.tar.gz
# 2.解压安装
[worker@canal-12 ~]$ mkdir /data/software/canal
[worker@canal-12 ~]$ cd /data/software/canal
[worker@canal-12 canal]$ mkdir canal_admin
[worker@canal-12 canal]$ tar xf canal.admin-1.1.8.tar.gz -C canal_admin

02 修改配置文件

sh 复制代码

[worker@canal-12 canal]$ cd canal_admin/conf/
[worker@canal-12 conf]$ cat application.yml 
server:
  port: 8089
spring:
  jackson:
    date-format: yyyy-MM-dd HH:mm:ss
    time-zone: GMT+8
                                                                                        spring.datasource:
  address: 172.16.130.57:53306
  database: canal_manager
  url: jdbc:mysql://${spring.datasource.address}/${spring.datasource.database}?useUnicode=true&characterEncoding=UTF-8&useSSL=false&allowPublicKeyRetrieval=true&serverTimezone=Asia/Shanghai&autoReconnect=true&failOverReadOnly=false&maxReconnects=10
  username: canal_admin
  password: 345t3GmowVsMn!@V
  driver-class-name: com.mysql.cj.jdbc.Driver
  hikari:
    maximum-pool-size: 30
    minimum-idle: 1    
#Canal Server 加入 Canal Admin 使用的密码
canal:
  adminUser: admin
  adminPasswd: b8hLqT%y#hJgz2Ea

03 初始化数据库

下图目目录下有canal_manager.sql文件，是初始化数据库的脚本

sh 复制代码

#登录MySQL数据库
mysql -h172.16.130.57 -P53306 -uroot -p
#执行脚本
source /data/software/canal/canal_admin/conf/canal_manager.sql

#创建连接用户
create user canal_admin@'%' identified by '345t3GmowVsMn!@V';
GRANT ALL PRIVILEGES ON `canal_manager`.* TO `canal_admin`@`%`;

04 启动服务

sh 复制代码

[worker@canal-12 conf]$ cd /data/software/canal/canal_admin/
#启动服务
[worker@canal-12 canal_admin]$ ./bin/startup.sh
#登录
http://172.16.130.12:8081
#默认账户/密码
admin/123456
#登录账户后可以修改密码

05 .使用canal admin管理集群

点击集群管理-->新建集群-->输入集群名称和zookeeper地址-->保存

2.部署canal deplyoper集群

服务器ip ： 172.16.130.12/130/131

2.1 下载安装

**canal.deployer：**Canal 核心服务端（Server），负责连接 MySQL 主库，解析 Binlog 并转发给客户端

核心功能

伪装成 MySQL 从库，向主库拉取 Binlog。
解析 Binlog 事件（增删改操作）。
将解析结果投递给客户端（如 canal.adapter 或其他自定义程序）

sh 复制代码

# 1.下载
https://github.com/alibaba/canal/releases/download/canal-1.1.8/canal.deployer-1.1.8.tar.gz
# 2.解压安装
[worker@canal-12 ~]$ cd /data/software/canal
[worker@canal-12 canal]$ mkdir deployer
[worker@canal-12 canal]$ tar xf canal.deployer-1.1.8.tar.gz -C deployer

2.2 修改配置文件canal_local.properties

sh 复制代码

#修改配置文件，使用admin管理只需要修改canal_local.properties，其他修改canal.properties
#每台都需要修改
[worker@canal-12 canal]$ cd deployer/conf
[worker@canal-12 conf]$ vim canal_local.properties 
# register ip 每台服务器的ip
canal.register.ip = 172.16.130.12
canal.port = 11111
canal.metrics.pull.port = 11112
             
#################################################
######     Canal Admin 配置
#################################################
canal.admin.mode = manager
canal.admin.manager = 172.16.130.12:8089
# 当前节点的管理端口
canal.admin.port = 11110
#用户密码与admin中application.yml 配置文件一致
#注意密码需要加密，加密方式可以直接在mysql中执行以下语句
#select upper(sha1(unhex(sha1('b8hLqT%y#hJgz2Ea'))))
canal.admin.user = admin
canal.admin.passwd = B843914782989CD69C847EEBD01D250DC932C642
                     
# 自动注册到Admin admin auto register
canal.admin.register.auto = true
#集群名称自定义，跟admin页面配置的集群名称一致
canal.admin.register.cluster = zk
#集群server名称，每台不一样，另外两台分别是node-130、node-131
canal.admin.register.name = node-12

2.3 canal主配置文件 admin统一配置

admin 点击集群管理-->主配置-->载入模板 -->修改配置文件，修改完之后保存，admin会去自动获取server节点

涉及到kafka地址填写自己部署的，如果未部署请参考Kafka 4.1.1 生产集群部署完整指南（手把手教程）

sh 复制代码

#################################################
######### common argument 基础配置   #############
#################################################
# tcp bind ip (为空表示绑定所有网卡，生产环境建议绑定内网IP)
canal.ip =
# register ip to zookeeper
canal.register.ip =
canal.port = 11111
canal.metrics.pull.port = 11112

#################################################
#此处需要修改
# 集群模式（cluster：依赖ZooKeeper）
canal.mode = cluster
# 实例加载方式（zk：从ZooKeeper加载）
canal.instance.loadMode = zk
canal.zkServers = 172.16.130.12:2181,172.16.130.130:2181,172.16.130.131:2181
#################################################

# flush data to zk
canal.zookeeper.flush.period = 1000
canal.withoutNetty = false
# tcp, kafka, rocketMQ, rabbitMQ, pulsarMQ  需要修改
canal.serverMode = kafka
# flush meta cursor/parse position to file
canal.file.data.dir = ${canal.conf.dir}
canal.file.flush.period = 1000
## memory store RingBuffer size, should be Math.pow(2,n)
canal.instance.memory.buffer.size = 16384
## memory store RingBuffer used memory unit size , default 1kb
canal.instance.memory.buffer.memunit = 1024 
## meory store gets mode used MEMSIZE or ITEMSIZE
canal.instance.memory.batch.mode = MEMSIZE
canal.instance.memory.rawEntry = true

## detecing config
canal.instance.detecting.enable = false
#canal.instance.detecting.sql = insert into retl.xdual values(1,now()) on duplicate key update x=now()
canal.instance.detecting.sql = select 1
canal.instance.detecting.interval.time = 3
canal.instance.detecting.retry.threshold = 3
canal.instance.detecting.heartbeatHaEnable = false

# support maximum transaction size, more than the size of the transaction will be cut into multiple transactions delivery
canal.instance.transaction.size =  1024
# mysql fallback connected to new master should fallback times
canal.instance.fallbackIntervalInSeconds = 60

# network config
canal.instance.network.receiveBufferSize = 16384
canal.instance.network.sendBufferSize = 16384
canal.instance.network.soTimeout = 30

# binlog filter config
canal.instance.filter.druid.ddl = true
canal.instance.filter.query.dcl = false
canal.instance.filter.query.dml = false
canal.instance.filter.query.ddl = false
canal.instance.filter.table.error = false
canal.instance.filter.rows = false
canal.instance.filter.transaction.entry = false
canal.instance.filter.dml.insert = false
canal.instance.filter.dml.update = false
canal.instance.filter.dml.delete = false

# binlog format/image check
canal.instance.binlog.format = ROW,STATEMENT,MIXED 
canal.instance.binlog.image = FULL,MINIMAL,NOBLOB

# binlog ddl isolation
canal.instance.get.ddl.isolation = false

# parallel parser config
canal.instance.parser.parallel = true
## concurrent thread number, default 60% available processors, suggest not to exceed Runtime.getRuntime().availableProcessors()
#canal.instance.parser.parallelThreadSize = 16
## disruptor ringbuffer size, must be power of 2
canal.instance.parser.parallelBufferSize = 256

# table meta tsdb info
canal.instance.tsdb.enable = true
canal.instance.tsdb.dir = ${canal.file.data.dir:../conf}/${canal.instance.destination:}
canal.instance.tsdb.url = jdbc:h2:${canal.instance.tsdb.dir}/h2;CACHE_SIZE=1000;MODE=MYSQL;
canal.instance.tsdb.dbUsername = canal
canal.instance.tsdb.dbPassword = canal
# dump snapshot interval, default 24 hour
canal.instance.tsdb.snapshot.interval = 24
# purge snapshot expire , default 360 hour(15 days)
canal.instance.tsdb.snapshot.expire = 360

#################################################
######### 		destinations		#############
#################################################
# 实例列表（需要创建的实例名称）使用admin管理时可以不写
canal.destinations =  event_es_index,task_es_index
# conf root dir
canal.conf.dir = ../conf
# auto scan instance dir add/remove and start/stop instance
canal.auto.scan = true
canal.auto.scan.interval = 5
# set this value to 'true' means that when binlog pos not found, skip to latest.
# WARN: pls keep 'false' in production env, or if you know what you want.
canal.auto.reset.latest.pos.mode = false

canal.instance.tsdb.spring.xml = classpath:spring/tsdb/h2-tsdb.xml
#canal.instance.tsdb.spring.xml = classpath:spring/tsdb/mysql-tsdb.xml

canal.instance.global.mode = manager
canal.instance.global.lazy = false
canal.instance.global.manager.address = ${canal.admin.manager}
#canal.instance.global.spring.xml = classpath:spring/memory-instance.xml
#canal.instance.global.spring.xml = classpath:spring/file-instance.xml
canal.instance.global.spring.xml = classpath:spring/default-instance.xml

##################################################
######### 	      MQ Properties      #############
##################################################
canal.mq.flatMessage = true
canal.mq.canalBatchSize = 50
canal.mq.canalGetTimeout = 100
# Set this value to "cloud", if you want open message trace feature in aliyun.
canal.mq.accessChannel = local

canal.mq.database.hash = true
canal.mq.send.thread.size = 30
canal.mq.build.thread.size = 8

##################################################
######### 		     Kafka 		     #############
##################################################
#kafka地址需要修改
kafka.bootstrap.servers = 172.16.130.3:9092,172.16.130.13:9092,172.16.130.15:9092
kafka.acks = all
kafka.compression.type = none
kafka.batch.size = 16384
kafka.linger.ms = 1
kafka.max.request.size = 1048576
kafka.buffer.memory = 33554432
kafka.max.in.flight.requests.per.connection = 1
kafka.retries = 0

kafka.security.protocol=SASL_PLAINTEXT
kafka.sasl.mechanism=SCRAM-SHA-512
kafka.sasl.jaas.config = org.apache.kafka.common.security.scram.ScramLoginModule required username="kafka_canal" password="xxx";

2.4 启动canal服务

sh 复制代码

#启动文件+ local表示启动用的是canal_local.properties 文件
# 172.16.130.12/130/131 都要启动
[worker@canal-12 deployer]$ ./bin/startup.sh local
#查看日志
[worker@canal-12 deployer]$ tail -f logs/canal/canal.log 
#以下图示代表启动成功了

2.5 在kafka中创建需要用到的topic

sh 复制代码

# event_es_index
kafka-topics.sh --create \
  --bootstrap-server 172.16.130.234:9092 \
  --topic event_es_index \
  --partitions 3 \
  --replication-factor 2
  
# task_es_index
kafka-topics.sh --create \
  --bootstrap-server 172.16.130.234:9092 \
  --topic task_es_index \
  --partitions 3 \
  --replication-factor 2
  
--partitions 3	设置该Topic的分区数为3
--replication-factor 2	设置副本因子为2（即每个分区有2个副本）

2.6.数据库添加账号

sql 复制代码

mysql -uroot -h172.18.45.121 -P3306 -p'zT1U#cfjJ!WmkBN0'
CREATE USER 'canal'@'%' IDENTIFIED BY 'kV2cf!KqqjX82NHE';
GRANT SELECT, REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'canal_user'@'%';
FLUSH PRIVILEGES;

2.7 添加instance实例

instance管理-->新建instance-->载入模板-->填写instance名称和所属集群

修改完配置保存即可

sh 复制代码

#我这里配置了两个实例，一个事件、一个任务
01 事件实例 event_es_index/instance.properties
#需要修改的点
#数据库连接信息
canal.instance.master.address=172.16.130.121:3306
canal.instance.dbUsername = canal
canal.instance.dbPassword = kV2cf!KqqjX82NHE
#################################################
## 表过滤规则
#################################################
# 业务表
#canal.instance.filter.regex=.*\\..*
canal.instance.filter.regex = cqshzl.cscp_hx_dic_item,cqshzl.cscp_hx_dic,cqshzl.event_basic_info,cqshzl.t_data_burial_point,cqshzl.event_mhww_info,cqshzl.event_other_property,cqshzl.event_basic_info_extend,cqshzl.event_report_record,cqshzl.event_tag_link,cqshzl.t_ryjcxxb
# 黑名单过滤
canal.instance.filter.black.regex = mysql\\.slave_.*
# 默认topic 所有数据默认发送到event_es_index主题
canal.mq.topic = event_es_index

02 任务实例 task_es_index/instance.properties
#只需要修改表过滤规则和topic主题就行了
#################################################
## 表过滤规则 - 可以根据需要调整
#################################################
# 任务相关表（示例，根据实际调整）
canal.instance.filter.regex = cqshzl.task_period_statistics,cqshzl.personnel_manage_follow                      
# 黑名单过滤
canal.instance.filter.black.regex = mysql\\.slave_.*
# mq config
canal.mq.topic=task_es_index

3.部署 canal adapter

3.1 下载安装

sh 复制代码

# 1.下载
https://github.com/alibaba/canal/releases/download/canal-1.1.8/canal.adapter-1.1.8.tar.gz
# 2.解压安装
[worker@canal-12 ~]$ cd /data/software/canal
[worker@canal-12 canal]$ mkdir adapter
[worker@canal-12 canal]$ tar xf canal.adapter-1.1.8.tar.gz -C adapter

3.2 修改配置文件application.yml

配置文件会涉及到elasticsearch地址填写，未部署请参考Elasticsearch 8.13.2 集群部署详细指南（手把手教程）

sh 复制代码

#每台都需要修改 canal.instance.cluster.id: canal_adapter_cluster_12参数需要修改，其他都一样
[worker@canal-12 canal]$ cd adapter/conf
[worker@canal-12 ~]$ cd /data/software/canal/
[worker@canal-12 conf]$ cat application.yml 
server:
  port: 8081
spring:
  jackson:
    date-format: yyyy-MM-dd HH:mm:ss
    time-zone: GMT+8
    default-property-inclusion: non_null
                  
canal.conf:
  # 消费模式：从Kafka消费消息（其他选项：tcp, rocketMQ, rabbitMQ）
  mode: kafka  
  # 使用扁平化的消息格式
  flatMessage: true
   # 集群配置：通过Zookeeper实现负载均衡
  zookeeperHosts: 172.16.130.12:2181,172.16.130.130:2181,172.16.130.131:2181
  syncBatchSize: 1000
  retries: -1
  timeout:
  accessKey:
  secretKey:
  # 消费配置
  consumerProperties:
    # canal tcp consumer
    canal.tcp.server.host: 
    canal.tcp.zookeeper.hosts:
    canal.tcp.batch.size: 500
    canal.tcp.username:
    canal.tcp.password:
    # kafka consumer
    kafka.bootstrap.servers: 172.16.130.3:9092,172.16.130.13:9092,172.16.130.15:9092
    # 安全协议：明文传输
    kafka.security.protocol: PLAINTEXT
    # 禁用自动提交offset
    kafka.enable.auto.commit: false
    # 自动提交间隔（但上面禁用）
    kafka.auto.commit.interval.ms: 100
    # 没有offset时从最新位置开始
    kafka.auto.offset.reset: latest
     # 最小拉取字节数，设置为1表示有数据立即拉取
    kafka.fetch.min.bytes: 1
    # 最大拉取等待时间，设置为10ms，有数据立即返回
    kafka.fetch.max.wait.ms: 10
    # 每次拉取最大记录数，减少批处理
    kafka.max.poll.records: 10
    # 心跳间隔，快速检测消费者健康
    kafka.heartbeat.interval.ms: 1000
    # 会话超时，快速重新平衡
    kafka.session.timeout.ms: 10000
    # 请求超时
    kafka.request.timeout.ms: 30000
    # 连接参数
    kafka.connections.max.idle.ms: 300000
    kafka.receive.buffer.bytes: 65536
    kafka.send.buffer.bytes: 131072                 
    # 只读取已提交的消息
    kafka.isolation.level: read_committed              
    # 集群ID唯一，另外两台 canal_adapter_cluster_130、canal_adapter_cluster_131
    canal.instance.cluster.id: canal_adapter_cluster_12
     # 重试配置
    canal.instance.retry.count: 3
    # 重试间隔（毫秒）
    canal.instance.retry.interval: 1000
             
    #配置源数据源
  srcDataSources:
    defaultDS:   # 默认数据源（用于全量同步时读取MySQL数据）
      url: jdbc:mysql://172.16.130.121:3306/cqshzl?useUnicode=true&characterEncoding=UTF-8&autoReconnect=true&failOverReadOnly=false&maxReconnects=10&useSSL=false&connectTimeout=300000&socketTimeout=300000&connectionTimeout=300000
      driverClassName: com.mysql.cj.jdbc.Driver
      username: canal
      password: kV2cf!KqqjX82NHE
      # ========== 连接池优化：减少连接等待 ==========      
      initialSize: 5  # 初始连接数
      minIdle: 5       # 最小空闲连接
      maxActive: 20    # 最大活跃连接
      maxWait: 1000   # 获取连接最大等待时间（1秒）
      timeBetweenEvictionRunsMillis: 60000   # 驱逐检查间隔
      minEvictableIdleTimeMillis: 300000      # 连接最小空闲时间（5分钟）
      validationQuery: SELECT 1               # 验证查询SQL
      testWhileIdle: true                     
      testOnBorrow: true
      testOnReturn: false
      removeAbandoned: true
      removeAbandonedTimeout: 18000
      logAbandoned: true
      keepAlive: true
      phyTimeoutMillis: 600000
      phyMaxUseCount: 0
  # Adapter实例配置,同步到es
  canalAdapters:
  # 对应Canal实例名称/Kafka Topic名称
  - instance: event_es_index 
    groups:
       #为实例指定消费组
    - groupId: canal_event_group  #实例级别的消费者组，adapter集群依靠这个消费组给每个adapter分配分区
      outerAdapters:
      - name: es8
        key: canal-es-event  # 适配器标识
        hosts: http://172.16.130.27:19200,http://172.16.130.28:19200,http://172.16.130.41:19200
        properties:
          mode: rest # or rest
          security.auth: elastic:xxx #  only used for rest mode
          cluster.name: jczz-es-cluster

  - instance: task_es_index  # canal instance Name or mq topic name
    groups:
    - groupId: canal_task_group # 实例级别的消费者组
      outerAdapters:
      - name: es8
        key: canal-es-task
        hosts: http://172.16.130.27:19200,http://172.16.130.28:19200,http://172.16.130.41:19200
        properties:
          mode: rest # or rest
          security.auth: elastic:xxx #  only used for rest mode
          cluster.name: jczz-es-cluster

3.3 添加同步表配置文件

sh 复制代码

#我们在这里用的是es8，就在es8/目录下添加
[worker@canal-12 conf]$ cat es8/ryjcxxb_index.yml 
dataSourceKey: defaultDS
destination: event_es_index      # 与application.yml中实例名一致
outerAdapterKey: canal-es-event  #与application.yml中第一个实例的key一致
groupId: canal_event_group       #与application.yml中groupId一致
esMapping:
  _index: ryjcxxb_index
  _id: _id
  upsert: true
  #  pk: id
  sql: "select t.id as _id,t.id AS id,t.del_flag AS del_flag,t.xm AS xm,t.gmsfzh AS gmsfzh,t.csrq AS csrq,if((((t.yxzdgx = '01') and ((t.rhyzbsdm is null) or (t.rhyzbsdm = '') or (t.rhyzbsdm <> '02'))) or ((t.yxzdgx = '02') and ((t.zxlbdm is null) or (t.zxlbdm = '') or (t.zxlbdm not in ('2','7','9'))))),1,0) AS is_syrk from t_ryjcxxb t"
  # sql: "select '1' as _id from t_ryjcxxb t limit 1"
  # etlCondition: "limit 100"
  commitBatch: 100
  
[worker@canal-12 conf]$ cat es8/personnel_manage_follow_index.yml 
dataSourceKey: defaultDS
destination: task_es_index  #任务的实例
outerAdapterKey: canal-es-task
groupId: canal_task_group
esMapping:
  _index: personnel_manage_follow_index
  _id: _id
  upsert: true
  #  pk: id
  sql: "SELECT t.id as _id,t.id as id,t.created_date as created_date,t.creator_id as creator_id,t.del_flag as del_flag,t.updated_date as updated_date,t.updater_id as updater_id,t.rylxdm as rylxdm,t.level as level,t.zdry_id as zdry_id,t.ryid as ryid,t.wgbm as wgbm,LEFT(t.wgbm,6) as wgbm_district,LEFT(t.wgbm,9) as wgbm_town,t.follow_start_time as follow_start_time,t.follow_end_time as follow_end_time,t.is_follow as is_follow,t.add_time as add_time FROM personnel_manage_follow t"
  # objFields:
  #   data_burial_points: object;
  #   prop_codes: array:;
  #   third_event_nums: array:;
  #   system_sources: array:;
  #   tag_ids: array:;
  etlCondition: "where id <{}"
  commitBatch: 1000

3.4 es中创建索引

sh 复制代码

# 创建索引
curl  -u elastic:xxx -X PUT "http://172.16.130.27:19200/ryjcxxb_index" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "number_of_shards": 5,       
    "number_of_replicas": 1,  
    "index.routing.allocation.total_shards_per_node": 2 
  },
  "mappings": {
    "properties": {
      "csrq": {
        "type": "date",
        "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis||strict_date_time||date_time"
      },
      "del_flag": { "type": "keyword" },
      "gmsfzh": { "type": "keyword" },
      "id": { "type": "keyword" },
      "is_syrk": { "type": "keyword" },
      "xm": { "type": "keyword" }
    }
  }
}
'
curl  -u elastic:xxx -X PUT "http://172.16.130.27:19200/personnel_manage_follow_index" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "number_of_shards": 5,       
    "number_of_replicas": 1,  
    "index.routing.allocation.total_shards_per_node": 2 
  },
  "mappings" : {
      "properties" : {
        "add_time" : {
          "type" : "date",
          "format" : "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis||strict_date_time||date_time"
        },
        "created_date" : {
          "type" : "date",
          "format" : "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis||strict_date_time||date_time"
        },
        "creator_id" : {
          "type" : "long"
        },
        "del_flag" : {
          "type" : "integer"
        },
        "follow_end_time" : {
          "type" : "date",
          "format" : "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis||strict_date_time||date_time"
        },
        "follow_start_time" : {
          "type" : "date",
          "format" : "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis||strict_date_time||date_time"
        },
        "id" : {
          "type" : "long"
        },
        "is_follow" : {
          "type" : "keyword",
          "ignore_above" : 1
        },
        "level" : {
          "type" : "keyword",
          "ignore_above" : 10
        },
        "ryid" : {
          "type" : "long"
        },
        "rylxdm" : {
          "type" : "keyword",
          "ignore_above" : 30
        },
        "updated_date" : {
          "type" : "date",
          "format" : "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis||strict_date_time||date_time"
        },
        "updater_id" : {
          "type" : "long"
        },
        "wgbm" : {
          "type" : "keyword",
          "ignore_above" : 30
        },
        "wgbm_district" : {
          "type" : "keyword",
          "ignore_above" : 6
        },
        "wgbm_town" : {
          "type" : "keyword",
          "ignore_above" : 9
        },
        "zdry_id" : {
          "type" : "long"
        }
      }
    }            
}
'

# 查看索引
curl  -u elastic:xxx -X GET "http://172.16.130.172.16.130.27:19200/_cat/indices?v&s=index"
#查看单个索引的配置
curl  -u elastic:xxx -X GET "http://172.16.130.172.16.130.27:19200/personnel_manage_follow_index/_settings?pretty"
curl  -u elastic:xxx -X GET "http://172.16.130.27:19200/personnel_manage_follow_index/_mapping?pretty"

3.5 启动adapter服务

sh 复制代码

[worker@canal-12 adapter]$ ./bin/startup.sh 
[worker@canal-12 adapter]$ tail -f logs/adapter/adapter.log

3.6 手动同步全量数据到es

sh 复制代码

#第一次需要手动同步全量数据到es
curl -X POST http://172.16.130.12:8081/etl/es8/personnel_manage_follow_index.yml
curl -X POST http://172.16.130.12:8081/etl/es8/ryjcxxb_index.yml
curl -X POST http://172.16.130.130:8081/etl/es8/video_point_manage_index.yml

#查看es中是否有数据
watch -n 1 curl -XGET 'http://172.16.130.234:24100/ryjcxxb_index/_count?pretty'

四、总结

本文档详细介绍了从MySQL到Elasticsearch的完整数据同步流程。通过Canal组件实现MySQL数据库变更的实时捕获和同步，构建了基于Zookeeper的高可用集群架构。整个流程包括：

数据采集层：Canal Deployer集群伪装成MySQL从库，实时解析binlog数据变更
消息缓冲层：Kafka作为消息中间件，解耦生产者和消费者，提供数据缓冲和顺序保证
数据转换层：Canal Adapter集群消费Kafka消息，进行数据格式转换和映射
数据存储层：Elasticsearch接收转换后的数据，构建实时搜索索引
管理监控层：Canal Admin提供可视化集群管理界面，Zookeeper提供分布式协调服务

关键优势：

实时性：基于MySQL binlog的实时数据变更捕获
高可用：Zookeeper协调下的多节点集群部署
可扩展：各组件均可水平扩展，应对不同数据量级
灵活性：支持多种目标存储（ES、MySQL、HBase等）
易管理：提供Web管理界面，简化集群运维

适用场景：

实时搜索索引构建
数据仓库ETL流程
业务系统数据同步
实时数据分析平台

通过本文的配置指导，读者可以快速搭建完整的数据同步管道，实现MySQL到Elasticsearch的实时数据同步。