Flink 1.14.*版本kafkaSource源码

都知道Flink主要分为三个模块,Source,各种算子(map,flatMap等),Sink三部分,

这里主要讲一下Flin自身带的连接器kafkaSource内部源码逻辑

1、添加kafkaSource

Flink要增加各种source,业务层一般是通过下面的代码添加source

java 复制代码
//初始化环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//初始化反序列化器
DeserializationSchema deserializationSchema = new  实现了(DeserializationSchema<Row[]>)接口的对象
//创建kafkaSource
SourceFunction kafkaSource = new FlinkKafkaConsumer<Row>(List<String> topics, KafkaDeserializationSchema<T> deserializer, Properties props);
//把kafkaSource添加到环境
env.addSource(SourceFunction<OUT> kafkaSource, String sourceName)
.........
//启动
env.execute()

2、SourceFunction(各种source的顶层父类)

(1)FlinkKafkaConsumerBase类层级结构

java 复制代码
@PublicEvolving
public class FlinkKafkaConsumer<T> extends FlinkKafkaConsumerBase<T> {
    ..........
}
@Internal
public abstract class FlinkKafkaConsumerBase<T> extends RichParallelSourceFunction<T> implements CheckpointListener, ResultTypeQueryable<T>, CheckpointedFunction {
    //省略干扰代码
    public void run(SourceContext<T> sourceContext) throws Exception {
        if (this.subscribedPartitionsToStartOffsets == null) {
            throw new Exception("The partitions were not set for the consumer");
        } else {
            this.successfulCommits = this.getRuntimeContext().getMetricGroup().counter("commitsSucceeded");
            this.failedCommits = this.getRuntimeContext().getMetricGroup().counter("commitsFailed");
            final int subtaskIndex = this.getRuntimeContext().getIndexOfThisSubtask();
            this.offsetCommitCallback = new KafkaCommitCallback() {
                public void onSuccess() {
                    FlinkKafkaConsumerBase.this.successfulCommits.inc();
                }

                public void onException(Throwable cause) {
                    FlinkKafkaConsumerBase.LOG.warn(String.format("Consumer subtask %d failed async Kafka commit.", subtaskIndex), cause);
                    FlinkKafkaConsumerBase.this.failedCommits.inc();
                }
            };
            if (this.subscribedPartitionsToStartOffsets.isEmpty()) {
                sourceContext.markAsTemporarilyIdle();
            }

            LOG.info("Consumer subtask {} creating fetcher with offsets {}.", this.getRuntimeContext().getIndexOfThisSubtask(), this.subscribedPartitionsToStartOffsets);
            this.kafkaFetcher = this.createFetcher(sourceContext, this.subscribedPartitionsToStartOffsets, this.watermarkStrategy, (StreamingRuntimeContext)this.getRuntimeContext(), this.offsetCommitMode, this.getRuntimeContext().getMetricGroup().addGroup("KafkaConsumer"), this.useMetrics);
            if (this.running) {
                if (this.discoveryIntervalMillis == -9223372036854775808L) {
                    this.kafkaFetcher.runFetchLoop();
                } else {
                    this.runWithPartitionDiscovery();
                }

            }
        }
    }
    private void runWithPartitionDiscovery() throws Exception {
        AtomicReference<Exception> discoveryLoopErrorRef = new AtomicReference();
        this.createAndStartDiscoveryLoop(discoveryLoopErrorRef);
        this.kafkaFetcher.runFetchLoop();
        this.partitionDiscoverer.wakeup();
        this.joinDiscoveryLoopThread();
        Exception discoveryLoopError = (Exception)discoveryLoopErrorRef.get();
        if (discoveryLoopError != null) {
            throw new RuntimeException(discoveryLoopError);
        }
    }
    //省略干扰代码
}

其中FlinkKafkaConsumer继承自父类FlinkKafkaConsumerBase

启动kafka的消费任务方法是public void run(SourceContext<T> sourceContext);

java 复制代码
@Public
public abstract class RichParallelSourceFunction<OUT> extends AbstractRichFunction implements ParallelSourceFunction<OUT> {
    private static final long serialVersionUID = 1L;

    public RichParallelSourceFunction() {
    }
}

@Public
public interface ParallelSourceFunction<OUT> extends SourceFunction<OUT> {
}


public interface SourceFunction<T> extends Function, Serializable {
    void run(SourceFunction.SourceContext<T> var1) throws Exception;

    void cancel();

    @Public
    public interface SourceContext<T> {
        void collect(T var1);

        @PublicEvolving
        void collectWithTimestamp(T var1, long var2);

        @PublicEvolving
        void emitWatermark(Watermark var1);

        @PublicEvolving
        void markAsTemporarilyIdle();

        Object getCheckpointLock();

        void close();
    }
}

FlinkKafkaConsumerBaserun方法继承自上级父类SourceFunction中的run方法,即触发SourceFunction中的run方法,就是触发了FlinkKafkaConsumerBaserun方法

(2) FlinkKafkaConsumerBase 内部执行逻辑

java 复制代码
@Internal
public class KafkaFetcher<T> extends AbstractFetcher<T, TopicPartition> {
    //从kafka服务端消费数据的线程
    final KafkaConsumerThread consumerThread;
    //构造函数
    public KafkaFetcher(SourceContext<T> sourceContext, Map<KafkaTopicPartition, Long> assignedPartitionsWithInitialOffsets, SerializedValue<WatermarkStrategy<T>> watermarkStrategy, ProcessingTimeService processingTimeProvider, long autoWatermarkInterval, ClassLoader userCodeClassLoader, String taskNameWithSubtasks, KafkaDeserializationSchema<T> deserializer, Properties kafkaProperties, long pollTimeout, MetricGroup subtaskMetricGroup, MetricGroup consumerMetricGroup, boolean useMetrics) throws Exception {
        super(sourceContext, assignedPartitionsWithInitialOffsets, watermarkStrategy, processingTimeProvider, autoWatermarkInterval, userCodeClassLoader, consumerMetricGroup, useMetrics);
        this.deserializer = deserializer;
        this.handover = new Handover();
        this.consumerThread = new KafkaConsumerThread(LOG, this.handover, kafkaProperties, this.unassignedPartitionsQueue, this.getFetcherName() + " for " + taskNameWithSubtasks, pollTimeout, useMetrics, consumerMetricGroup, subtaskMetricGroup);
        this.kafkaCollector = new KafkaFetcher.KafkaCollector();
    }

    public void runFetchLoop() throws Exception {
        try {
            //启动单独的消费线程,从kafka服务端消费数据到handover
            this.consumerThread.start();
            //从handover获取数据执行反序化,和
            while(this.running) {
                ConsumerRecords<byte[], byte[]> records = this.handover.pollNext();
                Iterator var2 = this.subscribedPartitionStates().iterator();

                while(var2.hasNext()) {
                    KafkaTopicPartitionState<T, TopicPartition> partition = (KafkaTopicPartitionState)var2.next();
                    List<ConsumerRecord<byte[], byte[]>> partitionRecords = records.records((TopicPartition)partition.getKafkaPartitionHandle());
                    this.partitionConsumerRecordsHandler(partitionRecords, partition);
                }
            }
        } finally {
            this.consumerThread.shutdown();
        }

        try {
            this.consumerThread.join();
        } catch (InterruptedException var8) {
            Thread.currentThread().interrupt();
        }

    }
    protected void partitionConsumerRecordsHandler(List<ConsumerRecord<byte[], byte[]>> partitionRecords, KafkaTopicPartitionState<T, TopicPartition> partition) throws Exception {
        Iterator var3 = partitionRecords.iterator();

        while(var3.hasNext()) {
            ConsumerRecord<byte[], byte[]> record = (ConsumerRecord)var3.next();
            //反序列化操作,deserializer这个由研发自定义的,即文章开头的DeserializationSchema deserializationSchema
            this.deserializer.deserialize(record, this.kafkaCollector);
            this.emitRecordsWithTimestamps(this.kafkaCollector.getRecords(), partition, record.offset(), record.timestamp());
            if (this.kafkaCollector.isEndOfStreamSignalled()) {
                this.running = false;
                break;
            }
        }

    }
}

@Internal
public abstract class AbstractFetcher<T, KPH> {
    protected void emitRecordsWithTimestamps(Queue<T> records, KafkaTopicPartitionState<T, KPH> partitionState, long offset, long kafkaEventTimestamp) {
        synchronized(this.checkpointLock) {
            Object record;
            while((record = records.poll()) != null) {
                //获取时间戳
                long timestamp = partitionState.extractTimestamp(record, kafkaEventTimestamp);
                //发送到下游,即source结束,
                this.sourceContext.collectWithTimestamp(record, timestamp);
                //保存分区状态
                partitionState.onEvent(record, timestamp);
            }
            //提交位点到分区的状态
            partitionState.setOffset(offset);
        }
    }
}

这里需要注意,这个runFetchLoop方法中是从临时空间Handover取数据,不是直接从kafka服务端取数据,而从kafka服务端取数据是由consumerThread线程执行

(3)单独启动一个消费线程把kafka服务端数据写入临时空间

java 复制代码
this.consumerThread.start()

实际上是调的下面的run方法,启动从kafka拉取数据的线程

java 复制代码
@Internal
public class KafkaConsumerThread<T> extends Thread {
    public void run() {
          //省略干扰代码
        if (this.running) {
            ConsumerRecords records = null;
                    //省略干扰代码
            while(true) {
                while(true) {
                    //省略干扰代码
                    try {
                        records = this.consumer.poll(this.pollTimeout);
                        break;
                    } catch (WakeupException var21) {
                    } 
                }
                //把数据放入到handover中
                try {
                    handover.produce(records);
                    records = null;
                } catch (org.apache.flink.streaming.connectors.kafka.internals.Handover.WakeupException var18) {
                }
            }      
                
    }
}

这里是把kafka服务端拉取的数据放入到handover

(4)保存快照(snapshotState)和快照保存成功后的回调(notifyCheckpointComplete)

FlinkKafkaConsumerBase中有两个方法,

  1. 一个是snapshotState(实现的是CheckpointedFunction接口),保存快照触发
  2. 一个是notifyCheckpointComplete(实现的是CheckpointListener接口) 保存成功后的回调
java 复制代码
public abstract class FlinkKafkaConsumerBase<T> extends RichParallelSourceFunction<T> implements CheckpointListener, ResultTypeQueryable<T>, CheckpointedFunction {
    public final void snapshotState(FunctionSnapshotContext context) throws Exception {
        if (!this.running) {
            LOG.debug("snapshotState() called on closed source");
        } else {
            this.unionOffsetStates.clear();
            AbstractFetcher<?, ?> fetcher = this.kafkaFetcher;
            if (fetcher != null) {
                //获取当前消费的各个分区,已经提交到partitionState的位点,放入到pendingOffsetsToCommit中
                HashMap<KafkaTopicPartition, Long> currentOffsets = fetcher.snapshotCurrentState();
                if (this.offsetCommitMode == OffsetCommitMode.ON_CHECKPOINTS) {
                    this.pendingOffsetsToCommit.put(context.getCheckpointId(), currentOffsets);
                }
                 //忽略干扰代码
            } 
            //忽略干扰代码
        }
    
    }
    
    public final void notifyCheckpointComplete(long checkpointId) throws Exception {
        if (!this.running) {
            LOG.debug("notifyCheckpointComplete() called on closed source");
        } else {
            AbstractFetcher<?, ?> fetcher = this.kafkaFetcher;
            if (fetcher == null) {
                LOG.debug("notifyCheckpointComplete() called on uninitialized source");
            } else {
                if (this.offsetCommitMode == OffsetCommitMode.ON_CHECKPOINTS) {
                    if (LOG.isDebugEnabled()) {
                        LOG.debug("Consumer subtask {} committing offsets to Kafka/ZooKeeper for checkpoint {}.", this.getRuntimeContext().getIndexOfThisSubtask(), checkpointId);
                    }
    
                    try {
                        int posInMap = this.pendingOffsetsToCommit.indexOf(checkpointId);
                        //忽略干扰代码
                        //往kafka服务端提交位点
                        fetcher.commitInternalOffsetsToKafka(offsets, this.offsetCommitCallback);
                    } catch (Exception var7) {
                        if (this.running) {
                            throw var7;
                        }
                    }
                }
    
            }
        }
    }

}

snapshotState方法中,获取分区的各个位点,其中各个分区的位点是由上面的emitRecordsWithTimestamps方法set进去的,

java 复制代码
public HashMap<KafkaTopicPartition, Long> snapshotCurrentState() {
    assert Thread.holdsLock(this.checkpointLock);

    HashMap<KafkaTopicPartition, Long> state = new HashMap(this.subscribedPartitionStates.size());
    Iterator var2 = this.subscribedPartitionStates.iterator();

    while(var2.hasNext()) {
        KafkaTopicPartitionState<T, KPH> partition = (KafkaTopicPartitionState)var2.next();
        //partition.getOffset()就是上面partitionState.setOffset塞的数据
        state.put(partition.getKafkaTopicPartition(), partition.getOffset());
    }

    return state;
}

所以,kafka消费下来的数据,当checkpoint保存后,触发回调notifyCheckpointComplete,才会提交kafka位点

通过kafkaSource,就能知道其他的source,基本逻辑是一样的

相关推荐
Yz98769 分钟前
hive复杂数据类型Array & Map & Struct & 炸裂函数explode
大数据·数据库·数据仓库·hive·hadoop·数据库开发·big data
那一抹阳光多灿烂1 小时前
Spark中的Stage概念
大数据·spark
北京鹏生科技有限公司2 小时前
EcoVadis审核是什么?EcoVadis审核流程包括什么?
大数据·百度
Moshow郑锴3 小时前
数据库、数据仓库、数据湖、数据中台、湖仓一体的概念和区别
大数据·数据库·数据仓库·数据湖·湖仓一体
二进制_博客5 小时前
Flink学习连载第二篇-使用flink编写WordCount(多种情况演示)
大数据
hong1616886 小时前
大数据技术Kafka详解:消息队列(Messages Queue)
大数据·分布式·kafka
隔着天花板看星星15 小时前
Kafka-创建topic源码
大数据·分布式·中间件·kafka
goTsHgo15 小时前
在Spark Streaming中简单实现实时用户画像系统
大数据·分布式·spark
老周聊架构15 小时前
聊聊Flink:Flink中的时间语义和Watermark详解
大数据·flink
high201115 小时前
【Apache Paimon】-- 5 -- Flink 向 Paimon 表写入数据
linux·flink·apache·paimon