源码解析FlinkKafkaConsumer支持punctuated水位线发送

背景

FlinkKafkaConsumer支持当收到某个kafka分区中的某条记录时发送水位线,比如这条特殊的记录代表一个完整记录的结束等,本文就来解析下发送punctuated水位线的源码

punctuated 水位线发送源码解析

1.首先KafkaFetcher中的runFetchLoop方法

java 复制代码
public void runFetchLoop() throws Exception {
        try {
            // kick off the actual Kafka consumer
            consumerThread.start();

            while (running) {
                // this blocks until we get the next records
                // it automatically re-throws exceptions encountered in the consumer thread
                final ConsumerRecords<byte[], byte[]> records = handover.pollNext();

                // get the records for each topic partition
                for (KafkaTopicPartitionState<T, TopicPartition> partition :
                        subscribedPartitionStates()) {

                    List<ConsumerRecord<byte[], byte[]>> partitionRecords =
                            records.records(partition.getKafkaPartitionHandle());
// 算子任务消费的每个分区都调用这个方法
                    partitionConsumerRecordsHandler(partitionRecords, partition);
                }
            }
        } finally {
            // this signals the consumer thread that no more work is to be done
            consumerThread.shutdown();
        }

2.查看partitionConsumerRecordsHandler方法处理当前算子任务对应的每个分区的水位线

java 复制代码
    protected void emitRecordsWithTimestamps(
            Queue<T> records,
            KafkaTopicPartitionState<T, KPH> partitionState,
            long offset,
            long kafkaEventTimestamp) {
        // emit the records, using the checkpoint lock to guarantee
        // atomicity of record emission and offset state update
        synchronized (checkpointLock) {
            T record;
            while ((record = records.poll()) != null) {
                long timestamp = partitionState.extractTimestamp(record, kafkaEventTimestamp);
                // 发送kafka记录到下游算子
                sourceContext.collectWithTimestamp(record, timestamp);

                // this might emit a watermark, so do it after emitting the record
                // 处理分区的水位线,记录这个分区的水位线,并在满足条件时更新整个算子任务的水位线
                partitionState.onEvent(record, timestamp);
            }
            partitionState.setOffset(offset);
        }
    }```

3.处理每个分区的水位线

```java
    public void onEvent(T event, long timestamp) {
        watermarkGenerator.onEvent(event, timestamp, immediateOutput);
    }
    public void onEvent(T event, long eventTimestamp, WatermarkOutput output) {
        final org.apache.flink.streaming.api.watermark.Watermark next =
                wms.checkAndGetNextWatermark(event, eventTimestamp);

        if (next != null) {
            output.emitWatermark(new Watermark(next.getTimestamp()));
        }
    }
    其中 output.emitWatermark(new Watermark(next.getTimestamp()));对应方法如下
            public void emitWatermark(Watermark watermark) {
            long timestamp = watermark.getTimestamp();
            // 更新每个分区对应的水位线,并且更新
            boolean wasUpdated = state.setWatermark(timestamp);

            // if it's higher than the max watermark so far we might have to update the
            // combined watermark 这个表明这个算子任务的最低水位线,也就是算子任务级别的水位线,而不是分区级别的了
            if (wasUpdated && timestamp > combinedWatermark) {
                updateCombinedWatermark();
            }
        }
 //每个分区水位线的更新如下
         public boolean setWatermark(long watermark) {
            this.idle = false;
            final boolean updated = watermark > this.watermark;
            this.watermark = Math.max(watermark, this.watermark);
            return updated;
        }       
        

4.最后是发送算子任务级别的水位线的方法

java 复制代码
private void updateCombinedWatermark() {
        long minimumOverAllOutputs = Long.MAX_VALUE;

        boolean hasOutputs = false;
        boolean allIdle = true;
        for (OutputState outputState : watermarkOutputs) {
            if (!outputState.isIdle()) {
                minimumOverAllOutputs = Math.min(minimumOverAllOutputs, outputState.getWatermark());
                allIdle = false;
            }
            hasOutputs = true;
        }

        // if we don't have any outputs minimumOverAllOutputs is not valid, it's still
        // at its initial Long.MAX_VALUE state and we must not emit that
        if (!hasOutputs) {
            return;
        }

        if (allIdle) {
            underlyingOutput.markIdle();
        } else if (minimumOverAllOutputs > combinedWatermark) {
            combinedWatermark = minimumOverAllOutputs;
            underlyingOutput.emitWatermark(new Watermark(minimumOverAllOutputs));
        }
    }

你可以看这个流程,是不是意味着如果使用Punctuated的方式,是不支持Idle空闲时间的?--答案是的

相关推荐
爱吃面的猫3 小时前
大数据Hadoop之——Flink1.17.0安装与使用(非常详细)
大数据·hadoop·分布式
Fireworkitte4 小时前
安装 Elasticsearch IK 分词器
大数据·elasticsearch
ywyy67985 小时前
短剧系统开发定制全流程解析:从需求分析到上线的专业指南
大数据·需求分析·短剧·推客系统·推客小程序·短剧系统开发·海外短剧系统开发
暗影八度7 小时前
Spark流水线数据质量检查组件
大数据·分布式·spark
白鲸开源7 小时前
Linux 基金会报告解读:开源 AI 重塑经济格局,有人失业,有人涨薪!
大数据
海豚调度7 小时前
Linux 基金会报告解读:开源 AI 重塑经济格局,有人失业,有人涨薪!
大数据·人工智能·ai·开源
白鲸开源8 小时前
DolphinScheduler+Sqoop 入门避坑:一文搞定数据同步常见异常
大数据
Edingbrugh.南空8 小时前
Flink ClickHouse 连接器数据读取源码深度解析
java·clickhouse·flink
学术小八9 小时前
第二届云计算与大数据国际学术会议(ICCBD 2025)
大数据·云计算
求职小程序华东同舟求职9 小时前
龙旗科技社招校招入职测评25年北森笔试测评题库答题攻略
大数据·人工智能·科技