源码解析FlinkKafkaConsumer支持punctuated水位线发送

背景

FlinkKafkaConsumer支持当收到某个kafka分区中的某条记录时发送水位线,比如这条特殊的记录代表一个完整记录的结束等,本文就来解析下发送punctuated水位线的源码

punctuated 水位线发送源码解析

1.首先KafkaFetcher中的runFetchLoop方法

java 复制代码
public void runFetchLoop() throws Exception {
        try {
            // kick off the actual Kafka consumer
            consumerThread.start();

            while (running) {
                // this blocks until we get the next records
                // it automatically re-throws exceptions encountered in the consumer thread
                final ConsumerRecords<byte[], byte[]> records = handover.pollNext();

                // get the records for each topic partition
                for (KafkaTopicPartitionState<T, TopicPartition> partition :
                        subscribedPartitionStates()) {

                    List<ConsumerRecord<byte[], byte[]>> partitionRecords =
                            records.records(partition.getKafkaPartitionHandle());
// 算子任务消费的每个分区都调用这个方法
                    partitionConsumerRecordsHandler(partitionRecords, partition);
                }
            }
        } finally {
            // this signals the consumer thread that no more work is to be done
            consumerThread.shutdown();
        }

2.查看partitionConsumerRecordsHandler方法处理当前算子任务对应的每个分区的水位线

java 复制代码
    protected void emitRecordsWithTimestamps(
            Queue<T> records,
            KafkaTopicPartitionState<T, KPH> partitionState,
            long offset,
            long kafkaEventTimestamp) {
        // emit the records, using the checkpoint lock to guarantee
        // atomicity of record emission and offset state update
        synchronized (checkpointLock) {
            T record;
            while ((record = records.poll()) != null) {
                long timestamp = partitionState.extractTimestamp(record, kafkaEventTimestamp);
                // 发送kafka记录到下游算子
                sourceContext.collectWithTimestamp(record, timestamp);

                // this might emit a watermark, so do it after emitting the record
                // 处理分区的水位线,记录这个分区的水位线,并在满足条件时更新整个算子任务的水位线
                partitionState.onEvent(record, timestamp);
            }
            partitionState.setOffset(offset);
        }
    }```

3.处理每个分区的水位线

```java
    public void onEvent(T event, long timestamp) {
        watermarkGenerator.onEvent(event, timestamp, immediateOutput);
    }
    public void onEvent(T event, long eventTimestamp, WatermarkOutput output) {
        final org.apache.flink.streaming.api.watermark.Watermark next =
                wms.checkAndGetNextWatermark(event, eventTimestamp);

        if (next != null) {
            output.emitWatermark(new Watermark(next.getTimestamp()));
        }
    }
    其中 output.emitWatermark(new Watermark(next.getTimestamp()));对应方法如下
            public void emitWatermark(Watermark watermark) {
            long timestamp = watermark.getTimestamp();
            // 更新每个分区对应的水位线,并且更新
            boolean wasUpdated = state.setWatermark(timestamp);

            // if it's higher than the max watermark so far we might have to update the
            // combined watermark 这个表明这个算子任务的最低水位线,也就是算子任务级别的水位线,而不是分区级别的了
            if (wasUpdated && timestamp > combinedWatermark) {
                updateCombinedWatermark();
            }
        }
 //每个分区水位线的更新如下
         public boolean setWatermark(long watermark) {
            this.idle = false;
            final boolean updated = watermark > this.watermark;
            this.watermark = Math.max(watermark, this.watermark);
            return updated;
        }       
        

4.最后是发送算子任务级别的水位线的方法

java 复制代码
private void updateCombinedWatermark() {
        long minimumOverAllOutputs = Long.MAX_VALUE;

        boolean hasOutputs = false;
        boolean allIdle = true;
        for (OutputState outputState : watermarkOutputs) {
            if (!outputState.isIdle()) {
                minimumOverAllOutputs = Math.min(minimumOverAllOutputs, outputState.getWatermark());
                allIdle = false;
            }
            hasOutputs = true;
        }

        // if we don't have any outputs minimumOverAllOutputs is not valid, it's still
        // at its initial Long.MAX_VALUE state and we must not emit that
        if (!hasOutputs) {
            return;
        }

        if (allIdle) {
            underlyingOutput.markIdle();
        } else if (minimumOverAllOutputs > combinedWatermark) {
            combinedWatermark = minimumOverAllOutputs;
            underlyingOutput.emitWatermark(new Watermark(minimumOverAllOutputs));
        }
    }

你可以看这个流程,是不是意味着如果使用Punctuated的方式,是不支持Idle空闲时间的?--答案是的

相关推荐
数据库安全7 分钟前
美创科技针对《银行保险机构数据安全管理办法》解读
大数据·人工智能·产品运营
ice___Cpu43 分钟前
Git - 1( 14000 字详解 )
大数据·git·elasticsearch
tebukaopu1481 小时前
官方 Elasticsearch SQL NLPChina Elasticsearch SQL
大数据·sql·elasticsearch
jiedaodezhuti9 小时前
为什么elasticsearch配置文件JVM配置31G最佳
大数据·jvm·elasticsearch
思通数据9 小时前
AI全域智能监控系统重构商业清洁管理范式——从被动响应到主动预防的监控效能革命
大数据·人工智能·目标检测·机器学习·计算机视觉·数据挖掘·ocr
lilye6610 小时前
精益数据分析(55/126):双边市场模式的挑战、策略与创业阶段关联
大数据·人工智能·数据分析
码上地球10 小时前
因子分析基础指南:原理、步骤与地球化学数据分析应用解析
大数据·数据挖掘·数据分析
胡小禾10 小时前
ES常识7:ES8.X集群允许4个 master 节点吗
大数据·elasticsearch·搜索引擎
火龙谷11 小时前
【hadoop】Kafka 安装部署
大数据·hadoop·kafka
强哥叨逼叨11 小时前
没经过我同意,flink window就把数据存到state里的了?
大数据·flink