使用 Kafka Connect 导入/导出数据
一、Kafka Connect 概述
1.1 什么是 Kafka Connect
Kafka Connect 是 Apache Kafka 的一个核心组件,专门用于在 Kafka 和其他系统之间可靠地、可扩展地传输数据。它简化了数据集成过程,无需编写复杂的自定义代码。
1.2 核心优势
- 简化集成:预置多种连接器,支持常见数据源/目标
- 可扩展性:支持分布式和独立部署模式
- 可靠性:提供精确一次语义和容错机制
- 生态系统丰富:拥有庞大的连接器生态系统
二、架构和工作原理
2.1 核心组件
┌─────────────────────────────────────────────────────────────┐
│ Kafka Connect Cluster │
├─────────────────────────────────────────────────────────────┤
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Worker │ │ Worker │ │ Worker │ │
│ │ │ │ │ │ │ │
│ │ ┌───────┐ │ │ ┌───────┐ │ │ ┌───────┐ │ │
│ │ │Source │ │ │ │Source │ │ │ │ Sink │ │ │
│ │ │Connector│ │ │ │Connector│ │ │ │Connector│ │ │
│ └──┴───────┴──┘ └──┴───────┴──┘ └──┴───────┴──┘ │
└─────────────────────────────────────────────────────────────┘
│ │
▼ ▼
┌──────────────┐ ┌────────────────┐
│ 外部数据源 │ │ Kafka集群 │
│ (文件/DB/API)│◄────────────►│ │◄────────────►┐
└──────────────┘ └────────────────┘ │
▼
┌──────────────┐
│ 外部目标 │
│ (文件/DB/存储) │
└──────────────┘
2.2 核心概念
| 术语 | 说明 |
|---|---|
| Connector | 连接器,负责管理数据复制的逻辑 |
| Task | 任务,连接器创建的实际工作单元 |
| Worker | 工作进程,执行连接器和任务 |
| Source Connector | 源连接器,从外部系统导入数据到 Kafka |
| Sink Connector | 接收器连接器,从 Kafka 导出数据到外部系统 |
| Converter | 转换器,序列化/反序列化数据 |
| Transform | 转换,在数据传输过程中修改数据 |
三、安装与配置
3.1 环境准备
3.1.1 系统要求
bash
# 检查 Java 版本(需要 Java 8+)
java -version
# 下载 Apache Kafka(包含 Kafka Connect)
wget https://downloads.apache.org/kafka/3.3.1/kafka_2.13-3.3.1.tgz
tar -xzf kafka_2.13-3.3.1.tgz
cd kafka_2.13-3.3.1
# 启动 ZooKeeper 和 Kafka
bin/zookeeper-server-start.sh config/zookeeper.properties &
bin/kafka-server-start.sh config/server.properties &
3.2 配置模式
3.2.1 独立模式(Standalone)
适用于开发和测试环境,单个进程运行。
配置文件:config/connect-standalone.properties
properties
# 核心配置
bootstrap.servers=localhost:9092
# 集群标识
group.id=connect-cluster
# 关键配置
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter.schemas.enable=true
value.converter.schemas.enable=true
# 内部 topic 配置
offset.storage.topic=connect-offsets
offset.storage.replication.factor=1
config.storage.topic=connect-configs
config.storage.replication.factor=1
status.storage.topic=connect-status
status.storage.replication.factor=1
# 偏移量提交间隔
offset.flush.interval.ms=10000
# 插件路径
plugin.path=/usr/local/share/java,/usr/share/java
# REST API 配置
rest.host.name=localhost
rest.port=8083
# 序列化器
internal.key.converter=org.apache.kafka.connect.json.JsonConverter
internal.value.converter=org.apache.kafka.connect.json.JsonConverter
internal.key.converter.schemas.enable=false
internal.value.converter.schemas.enable=false
3.2.2 分布式模式(Distributed)
适用于生产环境,支持高可用和负载均衡。
配置文件:config/connect-distributed.properties
properties
# 集群配置
bootstrap.servers=localhost:9092
group.id=connect-cluster
# REST API 配置
rest.port=8083
rest.advertised.host.name=localhost
rest.advertised.port=8083
# 关键转换器配置
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter.schemas.enable=true
value.converter.schemas.enable=true
# 内部 topic 配置(生产环境建议提高副本数)
offset.storage.topic=connect-offsets
offset.storage.replication.factor=3
config.storage.topic=connect-configs
config.storage.replication.factor=3
status.storage.topic=connect-status
status.storage.replication.factor=3
# 任务配置
offset.flush.interval.ms=10000
offset.flush.timeout.ms=5000
# 心跳和会话超时
session.timeout.ms=10000
heartbeat.interval.ms=3000
# 重试配置
rebalance.timeout.ms=60000
# 安全配置(如启用)
# security.protocol=SSL
# ssl.truststore.location=/path/to/truststore.jks
# ssl.truststore.password=password
# ssl.keystore.location=/path/to/keystore.jks
# ssl.keystore.password=password
# 监控配置
metrics.recording.level=INFO
metrics.num.samples=2
metrics.sample.window.ms=30000
四、实战:文件连接器示例
4.1 准备测试数据
bash
# 1. 创建测试数据文件
echo -e "zzq\nkafka\nconnect\nexample" > test.txt
# 查看文件内容
cat test.txt
# 输出:
# zzq
# kafka
# connect
# example
4.2 配置源连接器
配置文件:config/connect-file-source.properties
properties
# 连接器基本配置
name=local-file-source
connector.class=FileStreamSource
tasks.max=1
# 源文件配置
file=test.txt
topic=connect-test
# 数据格式配置
key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=org.apache.kafka.connect.storage.StringConverter
# 监控配置
batch.size=100
poll.interval.ms=1000
# 错误处理
errors.tolerance=none
errors.log.enable=true
errors.log.include.messages=true
# 高级配置
file.encoding=UTF-8
file.format.text.line.separator=\n
4.3 配置目标连接器
配置文件:config/connect-file-sink.properties
properties
# 连接器基本配置
name=local-file-sink
connector.class=FileStreamSink
tasks.max=1
# 目标配置
file=test.sink.txt
topics=connect-test
# 数据格式配置
key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=org.apache.kafka.connect.storage.StringConverter
# 写入配置
flush.size=3
batch.size=100
# 错误处理
errors.tolerance=none
errors.log.enable=true
# 文件处理
file.format.text.line.separator=\n
4.4 启动连接器
4.4.1 独立模式启动
bash
# 启动 Kafka Connect(独立模式)
bin/connect-standalone.sh config/connect-standalone.properties \
config/connect-file-source.properties \
config/connect-file-sink.properties
# 或者使用后台启动
nohup bin/connect-standalone.sh config/connect-standalone.properties \
config/connect-file-source.properties \
config/connect-file-sink.properties > connect.log 2>&1 &
4.4.2 分布式模式启动
bash
# 1. 首先启动分布式 Connect Worker
bin/connect-distributed.sh config/connect-distributed.properties &
# 2. 使用 REST API 创建连接器
curl -X POST -H "Content-Type: application/json" \
--data @source-connector.json \
http://localhost:8083/connectors
curl -X POST -H "Content-Type: application/json" \
--data @sink-connector.json \
http://localhost:8083/connectors
4.5 验证数据传输
bash
# 1. 检查输出文件
cat test.sink.txt
# 应该看到从 test.txt 传输过来的数据
# 2. 查看 Kafka topic 中的数据
bin/kafka-console-consumer.sh \
--bootstrap-server localhost:9092 \
--topic connect-test \
--from-beginning
# 3. 实时监控数据流
echo "new line 1" >> test.txt
echo "new line 2" >> test.txt
# 4. 检查新数据是否已传输
tail -f test.sink.txt
4.6 使用 REST API 管理
bash
# 1. 查看所有连接器
curl http://localhost:8083/connectors
# 2. 查看特定连接器状态
curl http://localhost:8083/connectors/local-file-source/status
# 3. 查看连接器配置
curl http://localhost:8083/connectors/local-file-source/config
# 4. 重启连接器
curl -X POST http://localhost:8083/connectors/local-file-source/restart
# 5. 暂停连接器
curl -X PUT http://localhost:8083/connectors/local-file-source/pause
# 6. 恢复连接器
curl -X PUT http://localhost:8083/connectors/local-file-source/resume
# 7. 删除连接器
curl -X DELETE http://localhost:8083/connectors/local-file-source
五、常用连接器类型
5.1 数据库连接器
5.1.1 JDBC Source Connector
json
{
"name": "jdbc-source",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
"tasks.max": "1",
"connection.url": "jdbc:mysql://localhost:3306/testdb",
"connection.user": "root",
"connection.password": "password",
"mode": "incrementing",
"incrementing.column.name": "id",
"table.whitelist": "users",
"topic.prefix": "mysql-",
"poll.interval.ms": "5000",
"batch.max.rows": "100",
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable": "false"
}
}
5.1.2 JDBC Sink Connector
json
{
"name": "jdbc-sink",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"tasks.max": "1",
"connection.url": "jdbc:mysql://localhost:3306/analytics",
"connection.user": "root",
"connection.password": "password",
"topics": "user-events",
"auto.create": "true",
"auto.evolve": "true",
"insert.mode": "upsert",
"pk.mode": "record_value",
"pk.fields": "user_id",
"delete.enabled": "false",
"max.retries": "10",
"retry.backoff.ms": "3000"
}
}
5.2 云服务连接器
5.2.1 AWS S3 Sink Connector
json
{
"name": "s3-sink",
"config": {
"connector.class": "io.confluent.connect.s3.S3SinkConnector",
"tasks.max": "3",
"topics": "logs,metrics,events",
"s3.region": "us-west-2",
"s3.bucket.name": "my-kafka-bucket",
"s3.part.size": "5242880",
"flush.size": "1000",
"storage.class": "io.confluent.connect.s3.storage.S3Storage",
"format.class": "io.confluent.connect.s3.format.json.JsonFormat",
"partitioner.class": "io.confluent.connect.storage.partitioner.TimeBasedPartitioner",
"path.format": "'year'=YYYY/'month'=MM/'day'=dd/'hour'=HH",
"locale": "en-US",
"timezone": "UTC",
"timestamp.extractor": "RecordField",
"timestamp.field": "timestamp"
}
}
5.2.2 Elasticsearch Sink Connector
json
{
"name": "elasticsearch-sink",
"config": {
"connector.class": "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
"tasks.max": "2",
"topics": "user-activity",
"connection.url": "http://elasticsearch:9200",
"type.name": "_doc",
"key.ignore": "true",
"schema.ignore": "true",
"behavior.on.null.values": "ignore",
"batch.size": "2000",
"max.in.flight.requests": "5",
"max.buffered.records": "20000",
"linger.ms": "1000",
"flush.timeout.ms": "10000",
"max.retries": "5",
"retry.backoff.ms": "100"
}
}
5.3 消息队列连接器
5.3.1 RabbitMQ Source Connector
json
{
"name": "rabbitmq-source",
"config": {
"connector.class": "io.confluent.connect.rabbitmq.RabbitMQSourceConnector",
"tasks.max": "1",
"rabbitmq.host": "localhost",
"rabbitmq.port": "5672",
"rabbitmq.username": "guest",
"rabbitmq.password": "guest",
"rabbitmq.virtual.host": "/",
"rabbitmq.queue": "my-queue",
"rabbitmq.automatic.recovery.enabled": "true",
"rabbitmq.network.recovery.interval.ms": "5000",
"kafka.topic": "rabbitmq-events",
"value.converter": "org.apache.kafka.connect.json.JsonConverter"
}
}
六、数据转换与处理
6.1 内置转换器
6.1.1 JSON 转换器配置
properties
# JSON 转换器配置
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter.schemas.enable=true
value.converter.schemas.enable=true
# 或者禁用 schema(纯 JSON)
value.converter.schemas.enable=false
6.1.2 Avro 转换器配置
properties
# Avro 转换器配置(需要 Schema Registry)
key.converter=io.confluent.connect.avro.AvroConverter
value.converter=io.confluent.connect.avro.AvroConverter
key.converter.schema.registry.url=http://localhost:8081
value.converter.schema.registry.url=http://localhost:8081
6.2 数据转换(Transforms)
6.2.1 常用转换示例
json
{
"name": "jdbc-source-with-transforms",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
"tasks.max": "1",
"connection.url": "jdbc:mysql://localhost:3306/testdb",
"table.whitelist": "users",
"topic.prefix": "mysql-",
"transforms": "createKey,extractInt",
"transforms.createKey.type": "org.apache.kafka.connect.transforms.ValueToKey",
"transforms.createKey.fields": "id",
"transforms.extractInt.type": "org.apache.kafka.connect.transforms.ExtractField$Key",
"transforms.extractInt.field": "id",
"transforms.mask.type": "org.apache.kafka.connect.transforms.MaskField$Value",
"transforms.mask.fields": "password,ssn",
"transforms.mask.replacement": "****",
"transforms.timestamp.type": "org.apache.kafka.connect.transforms.TimestampConverter$Value",
"transforms.timestamp.field": "created_at",
"transforms.timestamp.target.type": "Timestamp",
"transforms.timestamp.format": "yyyy-MM-dd HH:mm:ss"
}
}
6.2.2 自定义字段转换
properties
# 1. 字段重命名
transforms=renameField
transforms.renameField.type=org.apache.kafka.connect.transforms.ReplaceField$Value
transforms.renameField.renames=old_name:new_name,user_id:userId
# 2. 字段过滤
transforms=filterFields
transforms.filterFields.type=org.apache.kafka.connect.transforms.ReplaceField$Value
transforms.filterFields.blacklist=sensitive_data,temp_field
# 3. 值掩码
transforms=maskSensitive
transforms.maskSensitive.type=org.apache.kafka.connect.transforms.MaskField$Value
transforms.maskSensitive.fields=credit_card,password
transforms.maskSensitive.replacement=****
# 4. 时间戳转换
transforms=convertTimestamp
transforms.convertTimestamp.type=org.apache.kafka.connect.transforms.TimestampConverter$Value
transforms.convertTimestamp.field=event_time
transforms.convertTimestamp.target.type=Timestamp
transforms.convertTimestamp.format=yyyy-MM-dd'T'HH:mm:ss.SSS'Z'
七、监控与故障排除
7.1 监控指标
7.1.1 JMX 监控配置
properties
# 在 connect-*.properties 中添加
metric.reporters=io.confluent.metrics.reporter.ConfluentMetricsReporter
confluent.metrics.reporter.bootstrap.servers=localhost:9092
confluent.metrics.reporter.topic.replicas=1
# 启用 JMX
jmx.port=9999
7.1.2 关键监控指标
bash
# 1. 查看连接器状态
curl -s http://localhost:8083/connectors/local-file-source/status | jq '.'
# 2. 查看任务状态
curl -s http://localhost:8083/connectors/local-file-source/tasks | jq '.'
# 3. 查看 Worker 信息
curl -s http://localhost:8083/ | jq '.'
# 4. 查看插件列表
curl -s http://localhost:8083/connector-plugins | jq '.'
7.2 日志配置
7.2.1 日志级别配置
properties
# 创建 log4j.properties
log4j.rootLogger=INFO, stdout
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=[%d] %p %m (%c)%n
# 详细日志(调试用)
log4j.logger.org.apache.kafka.connect=DEBUG
log4j.logger.io.confluent.connect=DEBUG
7.2.2 启动时指定日志配置
bash
# 使用自定义日志配置启动
bin/connect-standalone.sh \
config/connect-standalone.properties \
config/connect-file-source.properties \
--override log4j.configuration=file:config/connect-log4j.properties
7.3 常见问题排查
7.3.1 连接器无法启动
bash
# 1. 检查 Kafka 集群连接
telnet localhost 9092
# 2. 检查内部 topic
bin/kafka-topics.sh --list --bootstrap-server localhost:9092 | grep connect
# 3. 查看详细错误日志
grep -i "error\|exception" logs/connect.log
7.3.2 数据不流动
bash
# 1. 检查连接器状态
curl -s http://localhost:8083/connectors/local-file-source/status | jq '.tasks[].state'
# 2. 检查 topic 数据
bin/kafka-console-consumer.sh \
--bootstrap-server localhost:9092 \
--topic connect-test \
--from-beginning \
--max-messages 10
# 3. 检查任务配置
curl -s http://localhost:8083/connectors/local-file-source/tasks/0/status | jq '.'
八、高级配置与优化
8.1 性能优化
8.1.1 批量处理配置
properties
# 源连接器优化
batch.size=5000
poll.interval.ms=100
max.batch.size=10000
# 接收器连接器优化
flush.size=2000
batch.size=5000
max.buffer.size=50000
linger.ms=500
8.1.2 并行处理配置
properties
# 增加任务数
tasks.max=3
# 分区策略
partitioner.class=io.confluent.connect.storage.partitioner.DefaultPartitioner
# 对于 JDBC 连接器
table.whitelist=table1,table2,table3
8.2 容错配置
8.2.1 错误处理策略
properties
# 错误处理配置
errors.tolerance=all
errors.deadletterqueue.topic.name=dlq-connect-test
errors.deadletterqueue.topic.replication.factor=3
errors.deadletterqueue.context.headers.enable=true
# 重试策略
retry.backoff.ms=1000
max.retries=10
8.2.2 精确一次语义
properties
# 启用精确一次语义
processing.guarantee=exactly_once
# 事务配置
producer.enable.idempotence=true
producer.transactional.id=connect-transactional-id
# 隔离级别
isolation.level=read_committed
8.3 安全配置
8.3.1 SSL/TLS 加密
properties
# SSL 配置
security.protocol=SSL
ssl.truststore.location=/path/to/truststore.jks
ssl.truststore.password=password
ssl.keystore.location=/path/to/keystore.jks
ssl.keystore.password=password
ssl.key.password=password
# SASL 认证
sasl.mechanism=PLAIN
sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required \
username="admin" \
password="admin-secret";
九、生产环境部署
9.1 Docker 部署示例
9.1.1 Docker Compose 配置
yaml
version: '3'
services:
zookeeper:
image: confluentinc/cp-zookeeper:latest
environment:
ZOOKEEPER_CLIENT_PORT: 2181
ZOOKEEPER_TICK_TIME: 2000
kafka:
image: confluentinc/cp-kafka:latest
depends_on:
- zookeeper
ports:
- "9092:9092"
environment:
KAFKA_BROKER_ID: 1
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
connect:
image: confluentinc/cp-kafka-connect:latest
depends_on:
- kafka
ports:
- "8083:8083"
environment:
CONNECT_BOOTSTRAP_SERVERS: kafka:9092
CONNECT_REST_PORT: 8083
CONNECT_GROUP_ID: connect-cluster
CONNECT_CONFIG_STORAGE_TOPIC: connect-configs
CONNECT_OFFSET_STORAGE_TOPIC: connect-offsets
CONNECT_STATUS_STORAGE_TOPIC: connect-status
CONNECT_KEY_CONVERTER: org.apache.kafka.connect.json.JsonConverter
CONNECT_VALUE_CONVERTER: org.apache.kafka.connect.json.JsonConverter
CONNECT_INTERNAL_KEY_CONVERTER: org.apache.kafka.connect.json.JsonConverter
CONNECT_INTERNAL_VALUE_CONVERTER: org.apache.kafka.connect.json.JsonConverter
volumes:
- ./connectors:/usr/share/confluent-hub-components
9.2 Kubernetes 部署
9.2.1 Deployment 配置
yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: kafka-connect
labels:
app: kafka-connect
spec:
replicas: 3
selector:
matchLabels:
app: kafka-connect
template:
metadata:
labels:
app: kafka-connect
spec:
containers:
- name: connect
image: confluentinc/cp-kafka-connect:latest
ports:
- containerPort: 8083
env:
- name: CONNECT_BOOTSTRAP_SERVERS
value: "kafka-broker:9092"
- name: CONNECT_REST_ADVERTISED_HOST_NAME
value: "kafka-connect"
- name: CONNECT_GROUP_ID
value: "connect-cluster"
- name: CONNECT_CONFIG_STORAGE_TOPIC
value: "connect-configs"
- name: CONNECT_OFFSET_STORAGE_TOPIC
value: "connect-offsets"
- name: CONNECT_STATUS_STORAGE_TOPIC
value: "connect-status"
- name: CONNECT_KEY_CONVERTER
value: "org.apache.kafka.connect.json.JsonConverter"
- name: CONNECT_VALUE_CONVERTER
value: "org.apache.kafka.connect.json.JsonConverter"
volumeMounts:
- name: connector-plugins
mountPath: /usr/share/confluent-hub-components
volumes:
- name: connector-plugins
persistentVolumeClaim:
claimName: connector-plugins-pvc
---
apiVersion: v1
kind: Service
metadata:
name: kafka-connect
spec:
selector:
app: kafka-connect
ports:
- port: 8083
targetPort: 8083
type: LoadBalancer
十、最佳实践总结
10.1 配置建议
- 使用分布式模式:生产环境必须使用分布式模式
- 合理配置副本因子:内部 topic 至少配置 3 个副本
- 启用监控:配置 JMX 或使用 Confluent Control Center
- 实施安全措施:启用 SSL/TLS 和认证授权
10.2 性能调优
- 合理设置任务数:根据 CPU 核数和数据量配置 tasks.max
- 批量处理优化:调整 batch.size 和 linger.ms
- 内存管理:监控 JVM 内存使用,合理配置 heap size
10.3 运维管理
- 版本控制:对连接器配置进行版本控制
- 自动化部署:使用 CI/CD 部署连接器配置
- 灾难恢复:定期备份连接器配置和偏移量
- 容量规划:根据数据量预估存储和计算资源
通过上述完整的 Kafka Connect 使用指南,您可以高效地在 Kafka 和其他系统之间传输数据,构建稳定可靠的数据集成管道。