一、上下文
《Kafka-Connect自带示例》中我们尝试了零配置启动producer和consumer去生产和消费数据,那么它内部是如何实现的呢?下面我们从源码来揭开它神秘的面纱。
二、入口类有哪些?
从启动脚本(connect-standalone.sh)中我们可以获取到启动类为ConnectStandalone
bash
# ......省略......
exec $(dirname $0)/kafka-run-class.sh $EXTRA_ARGS org.apache.kafka.connect.cli.ConnectStandalone "$@"
此外我们在参数中还给了source和sink的配置文件,从这source文件中可以获取到入口类为:FileStreamSource,从sink文件中可以获取到入口类为:FileStreamSink
FileStreamSource对应源码中的****FileStreamSourceConnector
FileStreamSink对应源码中的****FileStreamSinkConnector
connect-file-source.properties
name=local-file-source
connector.class=FileStreamSource
tasks.max=1
file=/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/etc/kafka/conf.dist/connect-file-test-data/source.txt
topic=connect-test
connect-file-sink.properties
name=local-file-sink
connector.class=FileStreamSink
tasks.max=1
file=/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/etc/kafka/conf.dist/connect-file-test-data/sink.txt
topics=connect-test
三、ConnectStandalone
它将Kafka Connect作为独立进程运行。在此模式下,work(连接器和任务)不会被分配。相反,所有正常的Connect机器都在一个过程中工作。适合一些临时、小型或实验性工作。
在此模式下,连接器和任务配置存储在内存中,不是持久的。但是,连接器偏移数据是持久的,因为它使用文件存储(可通过offset.storage.file.filename配置)
java
public class ConnectStandalone extends AbstractConnectCli<StandaloneConfig> {
public static void main(String[] args) {
ConnectStandalone connectStandalone = new ConnectStandalone(args);
//调用父类的run()
connectStandalone.run();
}
}
//这里用到了AbstractConnectCli,它里面有Kafka Connect的通用初始化逻辑。
public abstract class AbstractConnectCli<T extends WorkerConfig> {
//验证第一个CLI参数、进程工作器属性,并启动Connect
public void run() {
if (args.length < 1 || Arrays.asList(args).contains("--help")) {
log.info("Usage: {}", usage());
Exit.exit(1);
}
try {
//connect-standalone.properties 配置文件
String workerPropsFile = args[0];
//从connect-standalone.properties 中将配置的参数都转成 map
Map<String, String> workerProps = !workerPropsFile.isEmpty() ?
Utils.propsToStringMap(Utils.loadProps(workerPropsFile)) : Collections.emptyMap();
//这里面存放了 connect-standalone.properties 后的其他参数
//比如connect-file-source.properties、connect-file-sink.properties
String[] extraArgs = Arrays.copyOfRange(args, 1, args.length);
//启动Connect
Connect connect = startConnect(workerProps, extraArgs);
// 关机将由Ctrl-C或通过HTTP关机请求触发
connect.awaitStop();
} catch (Throwable t) {
log.error("Stopping due to error", t);
Exit.exit(2);
}
}
}
总结下来做了两件事情
1、初始化connect-standalone.properties配置信息
2、启动Connect
那什么是Connect,且它有做了什么呢?下面我们来看下
1、启动Connect
java
public abstract class AbstractConnectCli<T extends WorkerConfig> {
public Connect startConnect(Map<String, String> workerProps, String... extraArgs) {
//Kafka Connect工作进程初始化
log.info("Kafka Connect worker initializing ...");
long initStart = time.hiResClockMs();
WorkerInfo initInfo = new WorkerInfo();
initInfo.logAll();
//正在扫描插件类。这可能需要一点时间
log.info("Scanning for plugin classes. This might take a moment ...");
Plugins plugins = new Plugins(workerProps);
plugins.compareAndSwapWithDelegatingLoader();
T config = createConfig(workerProps);
log.debug("Kafka cluster ID: {}", config.kafkaClusterId());
RestClient restClient = new RestClient(config);
ConnectRestServer restServer = new ConnectRestServer(config.rebalanceTimeout(), restClient, config.originals());
restServer.initializeServer();
URI advertisedUrl = restServer.advertisedUrl();
String workerId = advertisedUrl.getHost() + ":" + advertisedUrl.getPort();
//connector.client.config.override.policy 默认值 All
//ConnectorClientConfigOverridePolicy 实现的类名或别名。
// 定义连接器可以覆盖哪些客户端配置。默认实现为"All",这意味着连接器配置可以覆盖所有客户端属性。
// 框架中的其他可能策略包括"无"禁止连接器覆盖客户端属性,"主体"允许连接器仅覆盖客户端主体
ConnectorClientConfigOverridePolicy connectorClientConfigOverridePolicy = plugins.newPlugin(
config.getString(WorkerConfig.CONNECTOR_CLIENT_POLICY_CLASS_CONFIG),
config, ConnectorClientConfigOverridePolicy.class);
Herder herder = createHerder(config, workerId, plugins, connectorClientConfigOverridePolicy, restServer, restClient);
final Connect connect = new Connect(herder, restServer);
//Kafka Connect工作进程初始化已完成
log.info("Kafka Connect worker initialization took {}ms", time.hiResClockMs() - initStart);
try {
connect.start();
} catch (Exception e) {
log.error("Failed to start Connect", e);
connect.stop();
Exit.exit(3);
}
//这里面会依次处理 connect-file-source.properties、connect-file-sink.properties 等参数
processExtraArgs(herder, connect, extraArgs);
return connect;
}
}
可以理解connect-standalone.properties是用于启动connect的,connect-file-source.properties是用于启动source任务的的,connect-file-sink.properties是用于启动sink任务的,source、sink的任务是在processExtraArgs(herder, connect, extraArgs)中完成的,这里用到了三个参数,
1、Herder:牧民,可用于在Connect群集上执行操作的实例,它里面有Worker
2、Connect:这个类将Kafka Connect进程的所有组件(牧民、工人、存储、命令接口)联系在一起,管理它们的生命周期
3、extraArgs:用于配置source、sink任务的参数
2、启动source、sink任务
这里我们对着参数(connect-file-source.properties、connect-file-sink.properties)来看更为形象
java
protected void processExtraArgs(Herder herder, Connect connect, String[] extraArgs) {
try {
//一个配置一个配置去解析
//按照官方的例子会依次解析connect-file-source.properties、connect-file-sink.properties
for (final String connectorConfigFile : extraArgs) {
CreateConnectorRequest createConnectorRequest = parseConnectorConfigurationFile(connectorConfigFile);
FutureCallback<Herder.Created<ConnectorInfo>> cb = new FutureCallback<>((error, info) -> {
if (error != null)
log.error("Failed to create connector for {}", connectorConfigFile);
else
log.info("Created connector {}", info.result().name());
});
//依次把每个配置文件的解析结果放入 herder 牧民中
herder.putConnectorConfig(
createConnectorRequest.name(), createConnectorRequest.config(),
createConnectorRequest.initialTargetState(),
false, cb);
//Future的get方法是一个阻塞方法,用于获取任务的运行结果。当调用get方法时,如果任务尚未完成,线程会阻塞,直到任务完成。
cb.get();
}
} catch (Throwable t) {
//.....
}
}
下面我们分别用connect-file-source.properties、connect-file-sink.properties带入看看牧民(herder)是如何将任务进行执行的。一份文件就是一个connector,这里我们先分析StandaloneHerder
java
public class StandaloneHerder extends AbstractHerder {
private final ScheduledExecutorService requestExecutorService;
StandaloneHerder(Worker worker,
String workerId,
String kafkaClusterId,
StatusBackingStore statusBackingStore,
MemoryConfigBackingStore configBackingStore,
ConnectorClientConfigOverridePolicy connectorClientConfigOverridePolicy,
Time time) {
super(worker, workerId, kafkaClusterId, statusBackingStore, configBackingStore, connectorClientConfigOverridePolicy, time);
this.configState = ClusterConfigState.EMPTY;
//创建一个单线程执行器,可以安排命令在给定延迟后运行,或定期执行。
this.requestExecutorService = Executors.newSingleThreadScheduledExecutor();
configBackingStore.setUpdateListener(new ConfigUpdateListener());
}
public void putConnectorConfig(final String connName, final Map<String, String> config, final TargetState targetState,final boolean allowReplace, final Callback<Created<ConnectorInfo>> callback) {
try {
validateConnectorConfig(config, (error, configInfos) -> {
if (error != null) {
callback.onCompletion(error, null);
return;
}
//向执行器提交 source 或 sink 的 ConnectorConfig
requestExecutorService.submit(
() -> putConnectorConfig(connName, config, targetState, allowReplace, callback, configInfos)
);
});
} catch (Throwable t) {
callback.onCompletion(t, null);
}
}
private synchronized void putConnectorConfig(String connName,
final Map<String, String> config,
TargetState targetState,
boolean allowReplace,
final Callback<Created<ConnectorInfo>> callback,
ConfigInfos configInfos) {
try {
//........
//将此连接器配置(以及可选的目标状态)写入持久存储,并等待其被确认,
//然后通过在Kafka日志中添加消费者来读回。如果将worker配置为使用fencable生产者写入配置topic,
//则必须在调用此方法之前成功调用claimWritePrivileges()
configBackingStore.putConnectorConfig(connName, config, targetState);
startConnector(connName, (error, result) -> {
if (error != null) {
callback.onCompletion(error, null);
return;
}
//执行器提交任务
requestExecutorService.submit(() -> {
updateConnectorTasks(connName);
callback.onCompletion(null, new Created<>(created, createConnectorInfo(connName)));
});
});
} catch (Throwable t) {
callback.onCompletion(t, null);
}
}
private synchronized void updateConnectorTasks(String connName) {
//......
//配置 connectorTask 这里会用到配置文件中的connector.class
//既:
List<Map<String, String>> newTaskConfigs = recomputeTaskConfigs(connName);
List<Map<String, String>> rawTaskConfigs = reverseTransform(connName, configState, newTaskConfigs);
if (taskConfigsChanged(configState, connName, rawTaskConfigs)) {
removeConnectorTasks(connName);
configBackingStore.putTaskConfigs(connName, rawTaskConfigs);
createConnectorTasks(connName);
}
}
private void createConnectorTasks(String connName) {
List<ConnectorTaskId> taskIds = configState.tasks(connName);
createConnectorTasks(connName, taskIds);
}
private void createConnectorTasks(String connName, Collection<ConnectorTaskId> taskIds) {
Map<String, String> connConfigs = configState.connectorConfig(connName);
for (ConnectorTaskId taskId : taskIds) {
//依次启动每个task(配置的 source和sink task)
startTask(taskId, connConfigs);
}
}
private boolean startTask(ConnectorTaskId taskId, Map<String, String> connProps) {
switch (connectorType(connProps)) {
case SINK:
return worker.startSinkTask(
taskId,
configState,
connProps,
configState.taskConfig(taskId),
this,
configState.targetState(taskId.connector())
);
case SOURCE:
return worker.startSourceTask(
taskId,
configState,
connProps,
configState.taskConfig(taskId),
this,
configState.targetState(taskId.connector())
);
default:
throw new ConnectException("Failed to start task " + taskId + " since it is not a recognizable type (source or sink)");
}
}
}
牧民(Herder)会用Worker来启动配置的SouceTask和SinkTask,最终他们调用的还是同一个方法,只是任务构建器不同而已,下面我们继续分析
Worker启动SouceTask
java
public boolean startSourceTask(...) {
return startTask(id, connProps, taskProps, configState, statusListener,
new SourceTaskBuilder(id, configState, statusListener, initialState));
}
Worker启动SinkTask
java
public boolean startSinkTask(...) {
return startTask(id, connProps, taskProps, configState, statusListener,
new SinkTaskBuilder(id, configState, statusListener, initialState));
}
下面我们共同来分析startTask(...)
java
//线程池
//Executors.newCachedThreadPool()
private final ExecutorService executor;
private boolean startTask(...){
//......
//从 connector.class 获取类进行加载
String connType = connProps.get(ConnectorConfig.CONNECTOR_CLASS_CONFIG);
ClassLoader connectorLoader = plugins.connectorLoader(connType);
//......
final Class<? extends Task> taskClass = taskConfig.getClass(TaskConfig.TASK_CLASS_CONFIG).asSubclass(Task.class);
//对应的sourceTask 和 sinkTask
final Task task = plugins.newTask(taskClass);
//此处就是根据传的参数来workerTask 会加载不同的Task
// SourceTaskBuilder ---> SouceTask
// SinkTaskBuilder ---> SinkTask
workerTask = taskBuilder
.withTask(task)
.withConnectorConfig(connConfig)
.withKeyConverter(keyConverter)
.withValueConverter(valueConverter)
.withHeaderConverter(headerConverter)
.withClassloader(connectorLoader)
.build();
workerTask.initialize(taskConfig);
WorkerTask<?, ?> existing = tasks.putIfAbsent(id, workerTask);
//我们继续往下分析,看看 SourceTask 和 SinkTask 都是怎么执行的
executor.submit(plugins.withClassLoader(connectorLoader, workerTask));
if (workerTask instanceof WorkerSourceTask) {
SourceTask 有一个单独 定时提交 offset 的 任务,默认间隔为 1min
sourceTaskOffsetCommitter.ifPresent(committer -> committer.schedule(id, (WorkerSourceTask) workerTask));
}
return true;
}
FileStreamSource对应的Task为:FileStreamSourceTask
FileStreamSink对应的Task为:FileStreamSinkTask
Worker会将WorkerTask调起去生产和消费数据
3、调度运行WorkerTask
WorkerTask会提供Worker用于管理任务的基本方法。实现将用户指定的Task与Kafka相结合,以创建数据流。且WorkerTask会放到线程池中进行调度,下面我们看下它的run()
java
//以下只是主要代码
abstract class WorkerTask<T, R extends ConnectRecord<R>> implements Runnable {
public void run() {
doRun();
}
private void doRun() throws InterruptedException {
//会初始化 我们在connect-file-source.properties、connect-file-sink.properties中对Producer、Consumer的参数配置
doStart();
//真的开始
// 用对应的SourceTask去读取数据,并交由Producer去生产
// 用Consumer接收数据交由 SinkTask去处理数据
execute();
}
}
doStart()
java
void doStart() {
retryWithToleranceOperator.reporters(errorReportersSupplier.get());
initializeAndStart();
statusListener.onStartup(id);
}
Source 和 Sink 会对initializeAndStart()有不同的实现
Source
这里用的是WorkerTask的子类:AbstractWorkerSourceTask
java
public abstract class AbstractWorkerSourceTask extends WorkerTask<SourceRecord, SourceRecord> {
protected void initializeAndStart() {
prepareToInitializeTask();
offsetStore.start();
//启动标记设置为 true
started = true;
//使用指定的上下文对象初始化此SourceTask。
task.initialize(sourceTaskContext);
//启动任务。这应该处理任何配置解析和任务的一次性设置。
//这里会实际调用 FileStreamSourceTask 或者我们指定的其他 SourceTask的 start()
task.start(taskConfig);
log.info("{} Source task finished initialization and start", this);
}
}
java
public class FileStreamSourceTask extends SourceTask {
public void start(Map<String, String> props) {
AbstractConfig config = new AbstractConfig(FileStreamSourceConnector.CONFIG_DEF, props);
//name=local-file-source
//connector.class=FileStreamSource
//tasks.max=1
//file=/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/etc/kafka/conf.dist/connect-file-test-data/source.txt
//topic=connect-test
// file
filename = config.getString(FileStreamSourceConnector.FILE_CONFIG);
if (filename == null || filename.isEmpty()) {
stream = System.in;
//跟踪stdin的偏移量没有意义
streamOffset = null;
reader = new BufferedReader(new InputStreamReader(stream, StandardCharsets.UTF_8));
}
//topic=connect-test
topic = config.getString(FileStreamSourceConnector.TOPIC_CONFIG);
//batch.size
batchSize = config.getInt(FileStreamSourceConnector.TASK_BATCH_SIZE_CONFIG);
}
}
Sink
这里用的是WorkerTask的子类:WorkerSinkTask
java
protected void initializeAndStart() {
SinkConnectorConfig.validate(taskConfig);
//订阅 topic
if (SinkConnectorConfig.hasTopicsConfig(taskConfig)) {
List<String> topics = SinkConnectorConfig.parseTopicsList(taskConfig);
consumer.subscribe(topics, new HandleRebalance());
log.debug("{} Initializing and starting task for topics {}", this, String.join(", ", topics));
} else {
//topics.regex
//根据正则设置的 要消费的 topic
String topicsRegexStr = taskConfig.get(SinkTask.TOPICS_REGEX_CONFIG);
Pattern pattern = Pattern.compile(topicsRegexStr);
consumer.subscribe(pattern, new HandleRebalance());
log.debug("{} Initializing and starting task for topics regex {}", this, topicsRegexStr);
}
//初始化此任务的上下文。
task.initialize(context);
//启动任务。这应该处理任何配置解析和任务的一次性设置。
//这里会真正的调用 FileStreamSinkTask 或者 其他我们配置的SinkTask 的 start()
task.start(taskConfig);
log.info("{} Sink task finished initialization and start", this);
}
java
public class FileStreamSinkTask extends SinkTask {
public void start(Map<String, String> props) {
AbstractConfig config = new AbstractConfig(FileStreamSinkConnector.CONFIG_DEF, props);
filename = config.getString(FileStreamSinkConnector.FILE_CONFIG);
if (filename == null || filename.isEmpty()) {
outputStream = System.out;
} else {
try {
//根据我们在配置文件中结果文件 创建输出流
outputStream = new PrintStream(
Files.newOutputStream(Paths.get(filename), StandardOpenOption.CREATE, StandardOpenOption.APPEND),
false,
StandardCharsets.UTF_8.name());
} catch (IOException e) {
throw new ConnectException("Couldn't find or create file '" + filename + "' for FileStreamSinkTask", e);
}
}
}
}
execute()
为了看的清晰,这里我们只列举主要代码
Source
java
public abstract class AbstractWorkerSourceTask extends WorkerTask<SourceRecord, SourceRecord> {
public void execute() {
while (!isStopping()) {
//这里会调用 task.poll(); 也就是从文件读取数据
toSend = poll();
//这里真的就会调用producer.send() 发送数据
sendRecords()
}
}
}
Sink
java
class WorkerSinkTask extends WorkerTask<ConsumerRecord<byte[], byte[]>, SinkRecord> {
public void execute() {
while (!isStopping())
iteration();
}
}
protected void iteration() {
poll(timeoutMs);
}
protected void poll(long timeoutMs) {
//用consumer拉回数据
ConsumerRecords<byte[], byte[]> msgs = pollConsumer(timeoutMs);
//转化并交由 自己定义的 SinkTask 处理数据
convertMessages(msgs);
deliverMessages()
}
private ConsumerRecords<byte[], byte[]> pollConsumer(long timeoutMs) {
ConsumerRecords<byte[], byte[]> msgs = consumer.poll(Duration.ofMillis(timeoutMs));
}
}