小白学习kafka集群集成kafka connect

前言

Kafka ConnectApache Kafka 的一个组件,用于使其它系统,比如数据库、云服务、文件系统等能方便地连接到 Kafka。数据既可以通过 Kafka Connect 从其它系统流向 Kafka, 也可以通过 Kafka Connect 从 Kafka 流向其它系统。从其它系统读数据的插件称为 Source Connector, 写数据到其它系统的插件称为 Sink Connector。Source Connector 和 Sink Connector 都不会直接连接 Kafka Broker,Source Connector 把数据转交给 Kafka Connect。Sink Connector 从 Kafka Connect 接收数据。

安装 TDengine Connector 插件

需要先安装git、maven、unzip。

编译插件

shell 复制代码
cd /tmp

git clone --branch 3.0 https://github.com/taosdata/kafka-connect-tdengine.git

cd kafka-connect-tdengine

mvn clean package -Dmaven.test.skip=true

unzip -d $KAFKA_HOME/components/ target/components/packages/taosdata-kafka-connect-tdengine-*.zip

配置插件

1. 分布式模式配置

编辑$KAFKA_HOME/config/connect-distributed.properties配置文件。关键配置项包括:

  • bootstrap.servers: Kafka集群的地址,格式为host1:port1,host2:port2,...
  • group.id: Kafka Connect使用的消费者组ID。
  • key.convertervalue.converter: 指定将记录键值从Kafka转换成Connect数据格式的转换器。常用的是org.apache.kafka.connect.json.JsonConverter
  • offset.storage.topic: Kafka Connect存储偏移量的topic名称。
  • config.storage.topic: Kafka Connect存储连接器配置的topic名称。
  • status.storage.topic: Kafka Connect存储连接器和任务状态的topic名称。

我的配置如下:

ini 复制代码
# 先创建/usr/share/java目录,不能使用java的安装目录,否则会导致插件无法使用
mkdir -p /usr/share/java

#kafka集群的地址
bootstrap.servers=192.168.174.131:9092,192.168.174.130:9092

# unique name for the cluster, used in forming the Connect cluster group. Note that this must not conflict with consumer group IDs
# 这个id必须唯一,不然会导致无法到主机的路由
group.id=connect-cluster-1

# The converters specify the format of data in Kafka and how to translate it into Connect data. Every Connect user will
# need to configure these based on the format they want their data in when loaded from or stored into Kafka
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter

# Topic to use for storing offsets. This topic should have many partitions and be replicated and compacted.
# Kafka Connect will attempt to create the topic automatically when needed, but you can always manually create
# the topic before starting Kafka Connect if a specific topic configuration is needed.
# Most users will want to use the built-in default replication factor of 3 or in some cases even specify a larger value.
# Since this means there must be at least as many brokers as the maximum replication factor used, we'd like to be able
# to run this example on a single-broker cluster and so here we instead set the replication factor to 1.
offset.storage.topic=connect-offsets
# 值为2是因为我只配置了两个kafka节点
offset.storage.replication.factor=2

# Topic to use for storing connector and task configurations; note that this should be a single partition, highly replicated,
# and compacted topic. Kafka Connect will attempt to create the topic automatically when needed, but you can always manually create
# the topic before starting Kafka Connect if a specific topic configuration is needed.
# Most users will want to use the built-in default replication factor of 3 or in some cases even specify a larger value.
# Since this means there must be at least as many brokers as the maximum replication factor used, we'd like to be able
# to run this example on a single-broker cluster and so here we instead set the replication factor to 1.
config.storage.topic=connect-configs
# 值为2是因为我只配置了两个kafka节点
config.storage.replication.factor=2

# Topic to use for storing statuses. This topic can have multiple partitions and should be replicated and compacted.
# Kafka Connect will attempt to create the topic automatically when needed, but you can always manually create
# the topic before starting Kafka Connect if a specific topic configuration is needed.
# Most users will want to use the built-in default replication factor of 3 or in some cases even specify a larger value.
# Since this means there must be at least as many brokers as the maximum replication factor used, we'd like to be able
# to run this example on a single-broker cluster and so here we instead set the replication factor to 1.
status.storage.topic=connect-status
# 值为2是因为我只配置了两个kafka节点
status.storage.replication.factor=2

# List of comma-separated URIs the REST API will listen on. The supported protocols are HTTP and HTTPS.
# Specify hostname as 0.0.0.0 to bind to all interfaces.
# Leave hostname empty to bind to default interface.
# Examples of legal listener lists: HTTP://myhost:8083,HTTPS://myhost:8084"
#listeners=HTTP://:8083
# 可以配置主机hostname:8083
listeners=HTTP://:8083

plugin.path=/usr/share/java,/usr/kafka/kafka_2.12-3.7.0/components

启动 Kafka

shell 复制代码
zookeeper-server-start.sh -daemon $KAFKA_HOME/config/zookeeper.properties

kafka-server-start.sh -daemon $KAFKA_HOME/config/server.properties

connect-distributed.sh -daemon $KAFKA_HOME/config/connect-distributed.properties
  • -daemon:表示以守护进程运行,如果不确定是否配置成功,建议先去掉运行。

注意:我这里遇到了在kafka01节点执行kafka connectcurl -X POST -d @sink-demo.json http://localhost:8083/connectors -H "Content-Type: application/json"没问题,但kafka02的执行的时候一直报{"error_code":500,"message":"IO Error trying to forward REST request: java.net.NoRouteToHostException: 没有到主机的路由"}这个错误。然后我放行了8083端口,同时修改了group.id的值,使其和kafka01不一致,然后,报{"error_code":500,"message":"IO Error trying to forward REST request: java.net.ConnectException: 拒绝连接"}这个错误。再后来,我去掉守护进程,发现报如下错误:

shell 复制代码
org.apache.kafka.connect.errors.ConnectException: Unable to initialize REST server
  at >org.apache.kafka.connect.runtime.rest.RestServer.initializeServer(RestServer.java:199)
  at >org.apache.kafka.connect.cli.AbstractConnectCli.startConnect(AbstractConnectCli.java:129)
  at org.apache.kafka.connect.cli.AbstractConnectCli.run(AbstractConnectCli.java:94)
  at org.apache.kafka.connect.cli.ConnectDistributed.main(ConnectDistributed.java:116)
Caused by: java.io.IOException: Failed to bind to 0.0.0.0/0.0.0.0:8083
  at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:349)
  at org.eclipse.jetty.server.ServerConnector.open(ServerConnector.java:310)
  at >org.eclipse.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80)
  at org.eclipse.jetty.server.ServerConnector.doStart(ServerConnector.java:234)
  at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
  at org.eclipse.jetty.server.Server.doStart(Server.java:401)
  at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
  at >org.apache.kafka.connect.runtime.rest.RestServer.initializeServer(RestServer.java:197)
  ... 3 more
Caused by: java.net.BindException: 地址已在使用
  at sun.nio.ch.Net.bind0(Native Method)
  at sun.nio.ch.Net.bind(Net.java:438)
  at sun.nio.ch.Net.bind(Net.java:430)
  at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:225)
  at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
  at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:344)
  ... 10 more

查看8083端口占用情况

shell 复制代码
lsof -i :8083

发现端口已被占用 杀死进程

shell 复制代码
kill -9 63250

不以守护进程方式重新运行

shell 复制代码
connect-distributed.sh $KAFKA_HOME/config/connect-distributed.properties

运行成功。

测试kafka connect

在kafka01和kafka02上分别测试。

验证 kafka Connect 是否启动成功

shell 复制代码
curl http://localhost:8083/connectors

如果各组件都启动成功,会得到如下输出:

shell 复制代码
[]

添加 Sink Connector 配置文件

bash 复制代码
mkdir ~/test
cd ~/test
vi sink-demo.json

sink-demo.json,因为对无模式写入做配置,采用的是TAOS-RS连接方式,内容如下:

shell 复制代码
{
  "name": "TDengineSinkConnector",
  "config": {
    "connector.class":"com.taosdata.kafka.connect.sink.TDengineSinkConnector",
    "tasks.max": "1",
    "topics": "meters",
    "connection.url": "jdbc:TAOS-RS://127.0.0.1:6041/?user=root&password=taosdata&batchfetch=true",
    "connection.user": "root",
    "connection.password": "taosdata",
    "connection.database": "power",
    "db.schemaless": "line",
    "data.precision": "ns",
    "key.converter": "org.apache.kafka.connect.storage.StringConverter",
    "value.converter": "org.apache.kafka.connect.storage.StringConverter",
    "errors.tolerance": "all",
    "errors.deadletterqueue.topic.name": "dead_letter_topic",
    "errors.deadletterqueue.topic.replication.factor": 1
  }
}

关键配置说明:

  1. "topics": "meters""connection.database": "power", 表示订阅主题 meters 的数据,并写入数据库 power。
  2. "db.schemaless": "line", 表示使用 InfluxDB Line 协议格式的数据。

创建 Sink Connector 实例

shell 复制代码
curl -X POST -d @sink-demo.json http://localhost:8083/connectors -H "Content-Type: application/json"

若以上命令执行成功,则有如下输出:

shell 复制代码
{
  "name": "TDengineSinkConnector",
  "config": {
    "connection.database": "power",
    "connection.password": "taosdata",
    "connection.url": "jdbc:TAOS-RS://127.0.0.1:6041/?user=root&password=taosdata&batchfetch=true",
    "connection.user": "root",
    "connector.class": "com.taosdata.kafka.connect.sink.TDengineSinkConnector",
    "data.precision": "ns",
    "db.schemaless": "line",
    "key.converter": "org.apache.kafka.connect.storage.StringConverter",
    "tasks.max": "1",
    "topics": "meters",
    "value.converter": "org.apache.kafka.connect.storage.StringConverter",
    "name": "TDengineSinkConnector",
    "errors.tolerance": "all",
    "errors.deadletterqueue.topic.name": "dead_letter_topic",
    "errors.deadletterqueue.topic.replication.factor": "1",    
  },
  "tasks": [],
  "type": "sink"
}

写入测试数据

准备测试数据的文本文件,内容如下:

ini 复制代码
vi ~/test/test-data.txt

meters,location=California.LosAngeles,groupid=2 current=11.8,voltage=221,phase=0.28 1648432611249000000
meters,location=California.LosAngeles,groupid=2 current=13.4,voltage=223,phase=0.29 1648432611250000000
meters,location=California.LosAngeles,groupid=3 current=10.8,voltage=223,phase=0.29 1648432611249000000
meters,location=California.LosAngeles,groupid=3 current=11.3,voltage=221,phase=0.35 1648432611250000000

使用 kafka-console-producer 向主题 meters 添加测试数据。

shell 复制代码
cat test-data.txt | kafka-console-producer.sh --broker-list localhost:9092 --topic meters

验证同步是否成功

使用 TDengine CLI 验证同步是否成功。

shell 复制代码
taos> use power;
Database changed.

taos> select * from meters;
              _ts               |          current          |          voltage          |           phase           | groupid |            location            |
===============================================================================================================================================================
 2022-03-28 09:56:51.249000000 |              11.800000000 |             221.000000000 |               0.280000000 | 2       | California.LosAngeles          |
 2022-03-28 09:56:51.250000000 |              13.400000000 |             223.000000000 |               0.290000000 | 2       | California.LosAngeles          |
 2022-03-28 09:56:51.249000000 |              10.800000000 |             223.000000000 |               0.290000000 | 3       | California.LosAngeles          |
 2022-03-28 09:56:51.250000000 |              11.300000000 |             221.000000000 |               0.350000000 | 3       | California.LosAngeles          |
Query OK, 4 row(s) in set (0.004208s)

unload 插件

测试完毕之后,用 unload 命令停止已加载的 connector。

查看当前活跃的 connector:

shell 复制代码
curl http://localhost:8083/connectors

如果按照前述操作,此时应有一个活跃的 connector

shell 复制代码
curl -X DELETE http://localhost:8083/connectors/TDengineSinkConnector

写在最后

无模式写入,或者使用消费api直接写数据时,生成的子表名是看不懂的,可以参考无模式写入的主要处理逻辑,但此配置只对rest方式连接生效。

我的配置如下:

shell 复制代码
vi /etc/taos/taos.cfg

# 添加smlAutoChildTableNameDelimiter
smlAutoChildTableNameDelimiter _

重启taosd:

arduino 复制代码
systemctl stop taosd
systemctl start taosd
systemctl status taosd
taos -C

这时候,如果之前创建了Sink Connector,需要先删除,重新添加,才会生效,如下图,不是默认的MD5子表名了。

参考:

相关推荐
程序员-珍2 小时前
虚拟机ip突然看不了了
linux·网络·网络协议·tcp/ip·centos
Ljubim.te6 小时前
Linux基于CentOS学习【进程状态】【进程优先级】【调度与切换】【进程挂起】【进程饥饿】
linux·学习·centos
苦逼IT运维7 小时前
YUM 源与 APT 源的详解及使用指南
linux·运维·ubuntu·centos·devops
猿java11 小时前
使用 Kafka面临的挑战
java·后端·kafka
路上^_^11 小时前
00_概览_kafka
分布式·kafka
CopyLower19 小时前
Kafka 消费者状态及高水位(High Watermark)详解
分布式·kafka
wusam20 小时前
螺蛳壳里做道场:老破机搭建的私人数据中心---Centos下Docker学习04(环境准备)
学习·docker·centos
信徒_21 小时前
kafka
分布式·kafka
掘根1 天前
【MySQL】Ubuntu环境下MySQL的安装与卸载
数据库·mysql·centos
灰色孤星A1 天前
Kafka学习笔记(三)Kafka分区和副本机制、自定义分区、消费者指定分区
zookeeper·kafka·kafka分区机制·kafka副本机制·kafka自定义分区