小白学习kafka集群集成kafka connect

前言

Kafka Connect 是 Apache Kafka 的一个组件，用于使其它系统，比如数据库、云服务、文件系统等能方便地连接到 Kafka。数据既可以通过 Kafka Connect 从其它系统流向 Kafka, 也可以通过 Kafka Connect 从 Kafka 流向其它系统。从其它系统读数据的插件称为 Source Connector, 写数据到其它系统的插件称为 Sink Connector。Source Connector 和 Sink Connector 都不会直接连接 Kafka Broker，Source Connector 把数据转交给 Kafka Connect。Sink Connector 从 Kafka Connect 接收数据。

安装 TDengine Connector 插件

需要先安装git、maven、unzip。

编译插件

shell 复制代码

cd /tmp

git clone --branch 3.0 https://github.com/taosdata/kafka-connect-tdengine.git

cd kafka-connect-tdengine

mvn clean package -Dmaven.test.skip=true

unzip -d $KAFKA_HOME/components/ target/components/packages/taosdata-kafka-connect-tdengine-*.zip

配置插件

1. 分布式模式配置

编辑$KAFKA_HOME/config/connect-distributed.properties配置文件。关键配置项包括：

bootstrap.servers: Kafka集群的地址，格式为host1:port1,host2:port2,...。
group.id: Kafka Connect使用的消费者组ID。
key.converter 和 value.converter: 指定将记录键值从Kafka转换成Connect数据格式的转换器。常用的是org.apache.kafka.connect.json.JsonConverter。
offset.storage.topic: Kafka Connect存储偏移量的topic名称。
config.storage.topic: Kafka Connect存储连接器配置的topic名称。
status.storage.topic: Kafka Connect存储连接器和任务状态的topic名称。

我的配置如下：

ini 复制代码

# 先创建/usr/share/java目录，不能使用java的安装目录，否则会导致插件无法使用
mkdir -p /usr/share/java

#kafka集群的地址
bootstrap.servers=192.168.174.131:9092,192.168.174.130:9092

# unique name for the cluster, used in forming the Connect cluster group. Note that this must not conflict with consumer group IDs
# 这个id必须唯一，不然会导致无法到主机的路由
group.id=connect-cluster-1

# The converters specify the format of data in Kafka and how to translate it into Connect data. Every Connect user will
# need to configure these based on the format they want their data in when loaded from or stored into Kafka
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter

# Topic to use for storing offsets. This topic should have many partitions and be replicated and compacted.
# Kafka Connect will attempt to create the topic automatically when needed, but you can always manually create
# the topic before starting Kafka Connect if a specific topic configuration is needed.
# Most users will want to use the built-in default replication factor of 3 or in some cases even specify a larger value.
# Since this means there must be at least as many brokers as the maximum replication factor used, we'd like to be able
# to run this example on a single-broker cluster and so here we instead set the replication factor to 1.
offset.storage.topic=connect-offsets
# 值为2是因为我只配置了两个kafka节点
offset.storage.replication.factor=2

# Topic to use for storing connector and task configurations; note that this should be a single partition, highly replicated,
# and compacted topic. Kafka Connect will attempt to create the topic automatically when needed, but you can always manually create
# the topic before starting Kafka Connect if a specific topic configuration is needed.
# Most users will want to use the built-in default replication factor of 3 or in some cases even specify a larger value.
# Since this means there must be at least as many brokers as the maximum replication factor used, we'd like to be able
# to run this example on a single-broker cluster and so here we instead set the replication factor to 1.
config.storage.topic=connect-configs
# 值为2是因为我只配置了两个kafka节点
config.storage.replication.factor=2

# Topic to use for storing statuses. This topic can have multiple partitions and should be replicated and compacted.
# Kafka Connect will attempt to create the topic automatically when needed, but you can always manually create
# the topic before starting Kafka Connect if a specific topic configuration is needed.
# Most users will want to use the built-in default replication factor of 3 or in some cases even specify a larger value.
# Since this means there must be at least as many brokers as the maximum replication factor used, we'd like to be able
# to run this example on a single-broker cluster and so here we instead set the replication factor to 1.
status.storage.topic=connect-status
# 值为2是因为我只配置了两个kafka节点
status.storage.replication.factor=2

# List of comma-separated URIs the REST API will listen on. The supported protocols are HTTP and HTTPS.
# Specify hostname as 0.0.0.0 to bind to all interfaces.
# Leave hostname empty to bind to default interface.
# Examples of legal listener lists: HTTP://myhost:8083,HTTPS://myhost:8084"
#listeners=HTTP://:8083
# 可以配置主机hostname:8083
listeners=HTTP://:8083

plugin.path=/usr/share/java,/usr/kafka/kafka_2.12-3.7.0/components

启动 Kafka

shell 复制代码

zookeeper-server-start.sh -daemon $KAFKA_HOME/config/zookeeper.properties

kafka-server-start.sh -daemon $KAFKA_HOME/config/server.properties

connect-distributed.sh -daemon $KAFKA_HOME/config/connect-distributed.properties

-daemon：表示以守护进程运行，如果不确定是否配置成功，建议先去掉运行。

注意：我这里遇到了在kafka01节点执行kafka connect的curl -X POST -d @sink-demo.json http://localhost:8083/connectors -H "Content-Type: application/json"没问题，但kafka02的执行的时候一直报{"error_code":500,"message":"IO Error trying to forward REST request: java.net.NoRouteToHostException: 没有到主机的路由"}这个错误。然后我放行了8083端口，同时修改了group.id的值，使其和kafka01不一致，然后，报{"error_code":500,"message":"IO Error trying to forward REST request: java.net.ConnectException: 拒绝连接"}这个错误。再后来，我去掉守护进程，发现报如下错误：
shell 复制代码
org.apache.kafka.connect.errors.ConnectException: Unable to initialize REST server
  at >org.apache.kafka.connect.runtime.rest.RestServer.initializeServer(RestServer.java:199)
  at >org.apache.kafka.connect.cli.AbstractConnectCli.startConnect(AbstractConnectCli.java:129)
  at org.apache.kafka.connect.cli.AbstractConnectCli.run(AbstractConnectCli.java:94)
  at org.apache.kafka.connect.cli.ConnectDistributed.main(ConnectDistributed.java:116)
Caused by: java.io.IOException: Failed to bind to 0.0.0.0/0.0.0.0:8083
  at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:349)
  at org.eclipse.jetty.server.ServerConnector.open(ServerConnector.java:310)
  at >org.eclipse.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80)
  at org.eclipse.jetty.server.ServerConnector.doStart(ServerConnector.java:234)
  at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
  at org.eclipse.jetty.server.Server.doStart(Server.java:401)
  at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
  at >org.apache.kafka.connect.runtime.rest.RestServer.initializeServer(RestServer.java:197)
  ... 3 more
Caused by: java.net.BindException: 地址已在使用
  at sun.nio.ch.Net.bind0(Native Method)
  at sun.nio.ch.Net.bind(Net.java:438)
  at sun.nio.ch.Net.bind(Net.java:430)
  at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:225)
  at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
  at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:344)
  ... 10 more
查看8083端口占用情况
shell 复制代码
lsof -i :8083
发现端口已被占用杀死进程
shell 复制代码
kill -9 63250
不以守护进程方式重新运行
shell 复制代码
connect-distributed.sh $KAFKA_HOME/config/connect-distributed.properties
运行成功。

测试`kafka connect`

在kafka01和kafka02上分别测试。

验证 kafka Connect 是否启动成功

shell 复制代码

curl http://localhost:8083/connectors

如果各组件都启动成功，会得到如下输出：

shell 复制代码

[]

添加 Sink Connector 配置文件

bash 复制代码

mkdir ~/test
cd ~/test
vi sink-demo.json

sink-demo.json，因为对无模式写入做配置，采用的是TAOS-RS连接方式，内容如下：

shell 复制代码

{
  "name": "TDengineSinkConnector",
  "config": {
    "connector.class":"com.taosdata.kafka.connect.sink.TDengineSinkConnector",
    "tasks.max": "1",
    "topics": "meters",
    "connection.url": "jdbc:TAOS-RS://127.0.0.1:6041/?user=root&password=taosdata&batchfetch=true",
    "connection.user": "root",
    "connection.password": "taosdata",
    "connection.database": "power",
    "db.schemaless": "line",
    "data.precision": "ns",
    "key.converter": "org.apache.kafka.connect.storage.StringConverter",
    "value.converter": "org.apache.kafka.connect.storage.StringConverter",
    "errors.tolerance": "all",
    "errors.deadletterqueue.topic.name": "dead_letter_topic",
    "errors.deadletterqueue.topic.replication.factor": 1
  }
}

关键配置说明：

"topics": "meters" 和 "connection.database": "power", 表示订阅主题 meters 的数据，并写入数据库 power。
"db.schemaless": "line", 表示使用 InfluxDB Line 协议格式的数据。

创建 Sink Connector 实例

shell 复制代码

curl -X POST -d @sink-demo.json http://localhost:8083/connectors -H "Content-Type: application/json"

若以上命令执行成功，则有如下输出：

shell 复制代码

{
  "name": "TDengineSinkConnector",
  "config": {
    "connection.database": "power",
    "connection.password": "taosdata",
    "connection.url": "jdbc:TAOS-RS://127.0.0.1:6041/?user=root&password=taosdata&batchfetch=true",
    "connection.user": "root",
    "connector.class": "com.taosdata.kafka.connect.sink.TDengineSinkConnector",
    "data.precision": "ns",
    "db.schemaless": "line",
    "key.converter": "org.apache.kafka.connect.storage.StringConverter",
    "tasks.max": "1",
    "topics": "meters",
    "value.converter": "org.apache.kafka.connect.storage.StringConverter",
    "name": "TDengineSinkConnector",
    "errors.tolerance": "all",
    "errors.deadletterqueue.topic.name": "dead_letter_topic",
    "errors.deadletterqueue.topic.replication.factor": "1",    
  },
  "tasks": [],
  "type": "sink"
}

写入测试数据

准备测试数据的文本文件，内容如下：

ini 复制代码

vi ~/test/test-data.txt

meters,location=California.LosAngeles,groupid=2 current=11.8,voltage=221,phase=0.28 1648432611249000000
meters,location=California.LosAngeles,groupid=2 current=13.4,voltage=223,phase=0.29 1648432611250000000
meters,location=California.LosAngeles,groupid=3 current=10.8,voltage=223,phase=0.29 1648432611249000000
meters,location=California.LosAngeles,groupid=3 current=11.3,voltage=221,phase=0.35 1648432611250000000

使用 kafka-console-producer 向主题 meters 添加测试数据。

shell 复制代码

cat test-data.txt | kafka-console-producer.sh --broker-list localhost:9092 --topic meters

验证同步是否成功

使用 TDengine CLI 验证同步是否成功。

shell 复制代码

taos> use power;
Database changed.

taos> select * from meters;
              _ts               |          current          |          voltage          |           phase           | groupid |            location            |
===============================================================================================================================================================
 2022-03-28 09:56:51.249000000 |              11.800000000 |             221.000000000 |               0.280000000 | 2       | California.LosAngeles          |
 2022-03-28 09:56:51.250000000 |              13.400000000 |             223.000000000 |               0.290000000 | 2       | California.LosAngeles          |
 2022-03-28 09:56:51.249000000 |              10.800000000 |             223.000000000 |               0.290000000 | 3       | California.LosAngeles          |
 2022-03-28 09:56:51.250000000 |              11.300000000 |             221.000000000 |               0.350000000 | 3       | California.LosAngeles          |
Query OK, 4 row(s) in set (0.004208s)

unload 插件

测试完毕之后，用 unload 命令停止已加载的 connector。

查看当前活跃的 connector：

shell 复制代码

curl http://localhost:8083/connectors

如果按照前述操作，此时应有一个活跃的 connector

shell 复制代码

curl -X DELETE http://localhost:8083/connectors/TDengineSinkConnector

写在最后

无模式写入，或者使用消费api直接写数据时，生成的子表名是看不懂的，可以参考无模式写入的主要处理逻辑，但此配置只对rest方式连接生效。

我的配置如下：

shell 复制代码

vi /etc/taos/taos.cfg

# 添加smlAutoChildTableNameDelimiter
smlAutoChildTableNameDelimiter _

重启taosd:

arduino 复制代码

systemctl stop taosd
systemctl start taosd
systemctl status taosd
taos -C

这时候，如果之前创建了Sink Connector，需要先删除，重新添加，才会生效，如下图，不是默认的MD5子表名了。

参考：

小白学习kafka集群集成kafka connect

前言

安装 TDengine Connector 插件

编译插件

配置插件

1. 分布式模式配置

启动 Kafka

测试kafka connect

验证 kafka Connect 是否启动成功

添加 Sink Connector 配置文件

创建 Sink Connector 实例

写入测试数据

验证同步是否成功

unload 插件

写在最后

测试`kafka connect`