源码解析FlinkKafkaConsumer支持punctuated水位线发送

背景

FlinkKafkaConsumer支持当收到某个kafka分区中的某条记录时发送水位线,比如这条特殊的记录代表一个完整记录的结束等,本文就来解析下发送punctuated水位线的源码

punctuated 水位线发送源码解析

1.首先KafkaFetcher中的runFetchLoop方法

java 复制代码
public void runFetchLoop() throws Exception {
        try {
            // kick off the actual Kafka consumer
            consumerThread.start();

            while (running) {
                // this blocks until we get the next records
                // it automatically re-throws exceptions encountered in the consumer thread
                final ConsumerRecords<byte[], byte[]> records = handover.pollNext();

                // get the records for each topic partition
                for (KafkaTopicPartitionState<T, TopicPartition> partition :
                        subscribedPartitionStates()) {

                    List<ConsumerRecord<byte[], byte[]>> partitionRecords =
                            records.records(partition.getKafkaPartitionHandle());
// 算子任务消费的每个分区都调用这个方法
                    partitionConsumerRecordsHandler(partitionRecords, partition);
                }
            }
        } finally {
            // this signals the consumer thread that no more work is to be done
            consumerThread.shutdown();
        }

2.查看partitionConsumerRecordsHandler方法处理当前算子任务对应的每个分区的水位线

java 复制代码
    protected void emitRecordsWithTimestamps(
            Queue<T> records,
            KafkaTopicPartitionState<T, KPH> partitionState,
            long offset,
            long kafkaEventTimestamp) {
        // emit the records, using the checkpoint lock to guarantee
        // atomicity of record emission and offset state update
        synchronized (checkpointLock) {
            T record;
            while ((record = records.poll()) != null) {
                long timestamp = partitionState.extractTimestamp(record, kafkaEventTimestamp);
                // 发送kafka记录到下游算子
                sourceContext.collectWithTimestamp(record, timestamp);

                // this might emit a watermark, so do it after emitting the record
                // 处理分区的水位线,记录这个分区的水位线,并在满足条件时更新整个算子任务的水位线
                partitionState.onEvent(record, timestamp);
            }
            partitionState.setOffset(offset);
        }
    }```

3.处理每个分区的水位线

```java
    public void onEvent(T event, long timestamp) {
        watermarkGenerator.onEvent(event, timestamp, immediateOutput);
    }
    public void onEvent(T event, long eventTimestamp, WatermarkOutput output) {
        final org.apache.flink.streaming.api.watermark.Watermark next =
                wms.checkAndGetNextWatermark(event, eventTimestamp);

        if (next != null) {
            output.emitWatermark(new Watermark(next.getTimestamp()));
        }
    }
    其中 output.emitWatermark(new Watermark(next.getTimestamp()));对应方法如下
            public void emitWatermark(Watermark watermark) {
            long timestamp = watermark.getTimestamp();
            // 更新每个分区对应的水位线,并且更新
            boolean wasUpdated = state.setWatermark(timestamp);

            // if it's higher than the max watermark so far we might have to update the
            // combined watermark 这个表明这个算子任务的最低水位线,也就是算子任务级别的水位线,而不是分区级别的了
            if (wasUpdated && timestamp > combinedWatermark) {
                updateCombinedWatermark();
            }
        }
 //每个分区水位线的更新如下
         public boolean setWatermark(long watermark) {
            this.idle = false;
            final boolean updated = watermark > this.watermark;
            this.watermark = Math.max(watermark, this.watermark);
            return updated;
        }       
        

4.最后是发送算子任务级别的水位线的方法

java 复制代码
private void updateCombinedWatermark() {
        long minimumOverAllOutputs = Long.MAX_VALUE;

        boolean hasOutputs = false;
        boolean allIdle = true;
        for (OutputState outputState : watermarkOutputs) {
            if (!outputState.isIdle()) {
                minimumOverAllOutputs = Math.min(minimumOverAllOutputs, outputState.getWatermark());
                allIdle = false;
            }
            hasOutputs = true;
        }

        // if we don't have any outputs minimumOverAllOutputs is not valid, it's still
        // at its initial Long.MAX_VALUE state and we must not emit that
        if (!hasOutputs) {
            return;
        }

        if (allIdle) {
            underlyingOutput.markIdle();
        } else if (minimumOverAllOutputs > combinedWatermark) {
            combinedWatermark = minimumOverAllOutputs;
            underlyingOutput.emitWatermark(new Watermark(minimumOverAllOutputs));
        }
    }

你可以看这个流程,是不是意味着如果使用Punctuated的方式,是不支持Idle空闲时间的?--答案是的

相关推荐
倔强的石头10627 分钟前
大数据时代下的时序数据库选型指南:基于工业场景的IoTDB技术优势与适用性研究
大数据·时序数据库·iotdb
火火PM打怪中4 小时前
产品经理如何绘制服务蓝图(Service Blueprint)
大数据·产品经理
Elastic 中国社区官方博客12 小时前
在 Windows 上使用 Docker 运行 Elastic Open Crawler
大数据·windows·爬虫·elasticsearch·搜索引擎·docker·容器
一切顺势而行13 小时前
Flink cdc 使用总结
大数据·flink
淦暴尼15 小时前
基于spark的二手房数据分析可视化系统
大数据·分布式·数据分析·spark
expect7g16 小时前
Flink-反压-1.基本概念
后端·flink
Ashlee_code16 小时前
裂变时刻:全球关税重构下的券商交易系统跃迁路线图(2025-2027)
java·大数据·数据结构·python·云原生·区块链·perl
Flink_China16 小时前
淘天AB实验分析平台Fluss落地实践:更适合实时OLAP的消息队列
大数据·flink
阿里云大数据AI技术17 小时前
云上AI推理平台全掌握 (4):大模型分发加速
大数据·人工智能·llm
1892280486118 小时前
NW972NW974美光固态闪存NW977NW981
大数据·服务器·网络·人工智能·性能优化