源码解析FlinkKafkaConsumer支持punctuated水位线发送

背景

FlinkKafkaConsumer支持当收到某个kafka分区中的某条记录时发送水位线,比如这条特殊的记录代表一个完整记录的结束等,本文就来解析下发送punctuated水位线的源码

punctuated 水位线发送源码解析

1.首先KafkaFetcher中的runFetchLoop方法

java 复制代码
public void runFetchLoop() throws Exception {
        try {
            // kick off the actual Kafka consumer
            consumerThread.start();

            while (running) {
                // this blocks until we get the next records
                // it automatically re-throws exceptions encountered in the consumer thread
                final ConsumerRecords<byte[], byte[]> records = handover.pollNext();

                // get the records for each topic partition
                for (KafkaTopicPartitionState<T, TopicPartition> partition :
                        subscribedPartitionStates()) {

                    List<ConsumerRecord<byte[], byte[]>> partitionRecords =
                            records.records(partition.getKafkaPartitionHandle());
// 算子任务消费的每个分区都调用这个方法
                    partitionConsumerRecordsHandler(partitionRecords, partition);
                }
            }
        } finally {
            // this signals the consumer thread that no more work is to be done
            consumerThread.shutdown();
        }

2.查看partitionConsumerRecordsHandler方法处理当前算子任务对应的每个分区的水位线

java 复制代码
    protected void emitRecordsWithTimestamps(
            Queue<T> records,
            KafkaTopicPartitionState<T, KPH> partitionState,
            long offset,
            long kafkaEventTimestamp) {
        // emit the records, using the checkpoint lock to guarantee
        // atomicity of record emission and offset state update
        synchronized (checkpointLock) {
            T record;
            while ((record = records.poll()) != null) {
                long timestamp = partitionState.extractTimestamp(record, kafkaEventTimestamp);
                // 发送kafka记录到下游算子
                sourceContext.collectWithTimestamp(record, timestamp);

                // this might emit a watermark, so do it after emitting the record
                // 处理分区的水位线,记录这个分区的水位线,并在满足条件时更新整个算子任务的水位线
                partitionState.onEvent(record, timestamp);
            }
            partitionState.setOffset(offset);
        }
    }```

3.处理每个分区的水位线

```java
    public void onEvent(T event, long timestamp) {
        watermarkGenerator.onEvent(event, timestamp, immediateOutput);
    }
    public void onEvent(T event, long eventTimestamp, WatermarkOutput output) {
        final org.apache.flink.streaming.api.watermark.Watermark next =
                wms.checkAndGetNextWatermark(event, eventTimestamp);

        if (next != null) {
            output.emitWatermark(new Watermark(next.getTimestamp()));
        }
    }
    其中 output.emitWatermark(new Watermark(next.getTimestamp()));对应方法如下
            public void emitWatermark(Watermark watermark) {
            long timestamp = watermark.getTimestamp();
            // 更新每个分区对应的水位线,并且更新
            boolean wasUpdated = state.setWatermark(timestamp);

            // if it's higher than the max watermark so far we might have to update the
            // combined watermark 这个表明这个算子任务的最低水位线,也就是算子任务级别的水位线,而不是分区级别的了
            if (wasUpdated && timestamp > combinedWatermark) {
                updateCombinedWatermark();
            }
        }
 //每个分区水位线的更新如下
         public boolean setWatermark(long watermark) {
            this.idle = false;
            final boolean updated = watermark > this.watermark;
            this.watermark = Math.max(watermark, this.watermark);
            return updated;
        }       
        

4.最后是发送算子任务级别的水位线的方法

java 复制代码
private void updateCombinedWatermark() {
        long minimumOverAllOutputs = Long.MAX_VALUE;

        boolean hasOutputs = false;
        boolean allIdle = true;
        for (OutputState outputState : watermarkOutputs) {
            if (!outputState.isIdle()) {
                minimumOverAllOutputs = Math.min(minimumOverAllOutputs, outputState.getWatermark());
                allIdle = false;
            }
            hasOutputs = true;
        }

        // if we don't have any outputs minimumOverAllOutputs is not valid, it's still
        // at its initial Long.MAX_VALUE state and we must not emit that
        if (!hasOutputs) {
            return;
        }

        if (allIdle) {
            underlyingOutput.markIdle();
        } else if (minimumOverAllOutputs > combinedWatermark) {
            combinedWatermark = minimumOverAllOutputs;
            underlyingOutput.emitWatermark(new Watermark(minimumOverAllOutputs));
        }
    }

你可以看这个流程,是不是意味着如果使用Punctuated的方式,是不支持Idle空闲时间的?--答案是的

相关推荐
秋刀鱼 ..21 小时前
2026年电力电子与电能变换国际学术会议 (ICPEPC 2026)
大数据·python·计算机网络·数学建模·制造
G皮T1 天前
【Elasticsearch】 大慢查询隔离(一):最佳实践
大数据·elasticsearch·搜索引擎·性能调优·索引·性能·查询
expect7g1 天前
Paimon源码解读 -- Compaction-6.CompactStrategy
大数据·后端·flink
武子康1 天前
大数据-183 Elasticsearch - 并发冲突与乐观锁、分布式数据一致性剖析
大数据·后端·elasticsearch
Hello.Reader1 天前
Flink SQL Top-N 深度从“实时榜单”到“少写点数据”
大数据·sql·flink
梦里不知身是客111 天前
Combiner在mapreduce中的作用
大数据·mapreduce
ha_lydms1 天前
Spark函数
大数据·分布式·spark
相思半1 天前
机器学习模型实战全解析
大数据·人工智能·笔记·python·机器学习·数据挖掘·transformer
semantist@语校1 天前
第五十四篇|从事实字段到推理边界:名古屋国际外语学院Prompt生成中的过度推断防御设计
大数据·linux·服务器·人工智能·百度·语言模型·prompt
秋刀鱼 ..1 天前
第二届电气、自动化与人工智能国际学术会议(ICEAAI 2026)
大数据·运维·人工智能·机器人·自动化