源码解析FlinkKafkaConsumer支持punctuated水位线发送

背景

FlinkKafkaConsumer支持当收到某个kafka分区中的某条记录时发送水位线,比如这条特殊的记录代表一个完整记录的结束等,本文就来解析下发送punctuated水位线的源码

punctuated 水位线发送源码解析

1.首先KafkaFetcher中的runFetchLoop方法

java 复制代码
public void runFetchLoop() throws Exception {
        try {
            // kick off the actual Kafka consumer
            consumerThread.start();

            while (running) {
                // this blocks until we get the next records
                // it automatically re-throws exceptions encountered in the consumer thread
                final ConsumerRecords<byte[], byte[]> records = handover.pollNext();

                // get the records for each topic partition
                for (KafkaTopicPartitionState<T, TopicPartition> partition :
                        subscribedPartitionStates()) {

                    List<ConsumerRecord<byte[], byte[]>> partitionRecords =
                            records.records(partition.getKafkaPartitionHandle());
// 算子任务消费的每个分区都调用这个方法
                    partitionConsumerRecordsHandler(partitionRecords, partition);
                }
            }
        } finally {
            // this signals the consumer thread that no more work is to be done
            consumerThread.shutdown();
        }

2.查看partitionConsumerRecordsHandler方法处理当前算子任务对应的每个分区的水位线

java 复制代码
    protected void emitRecordsWithTimestamps(
            Queue<T> records,
            KafkaTopicPartitionState<T, KPH> partitionState,
            long offset,
            long kafkaEventTimestamp) {
        // emit the records, using the checkpoint lock to guarantee
        // atomicity of record emission and offset state update
        synchronized (checkpointLock) {
            T record;
            while ((record = records.poll()) != null) {
                long timestamp = partitionState.extractTimestamp(record, kafkaEventTimestamp);
                // 发送kafka记录到下游算子
                sourceContext.collectWithTimestamp(record, timestamp);

                // this might emit a watermark, so do it after emitting the record
                // 处理分区的水位线,记录这个分区的水位线,并在满足条件时更新整个算子任务的水位线
                partitionState.onEvent(record, timestamp);
            }
            partitionState.setOffset(offset);
        }
    }```

3.处理每个分区的水位线

```java
    public void onEvent(T event, long timestamp) {
        watermarkGenerator.onEvent(event, timestamp, immediateOutput);
    }
    public void onEvent(T event, long eventTimestamp, WatermarkOutput output) {
        final org.apache.flink.streaming.api.watermark.Watermark next =
                wms.checkAndGetNextWatermark(event, eventTimestamp);

        if (next != null) {
            output.emitWatermark(new Watermark(next.getTimestamp()));
        }
    }
    其中 output.emitWatermark(new Watermark(next.getTimestamp()));对应方法如下
            public void emitWatermark(Watermark watermark) {
            long timestamp = watermark.getTimestamp();
            // 更新每个分区对应的水位线,并且更新
            boolean wasUpdated = state.setWatermark(timestamp);

            // if it's higher than the max watermark so far we might have to update the
            // combined watermark 这个表明这个算子任务的最低水位线,也就是算子任务级别的水位线,而不是分区级别的了
            if (wasUpdated && timestamp > combinedWatermark) {
                updateCombinedWatermark();
            }
        }
 //每个分区水位线的更新如下
         public boolean setWatermark(long watermark) {
            this.idle = false;
            final boolean updated = watermark > this.watermark;
            this.watermark = Math.max(watermark, this.watermark);
            return updated;
        }       
        

4.最后是发送算子任务级别的水位线的方法

java 复制代码
private void updateCombinedWatermark() {
        long minimumOverAllOutputs = Long.MAX_VALUE;

        boolean hasOutputs = false;
        boolean allIdle = true;
        for (OutputState outputState : watermarkOutputs) {
            if (!outputState.isIdle()) {
                minimumOverAllOutputs = Math.min(minimumOverAllOutputs, outputState.getWatermark());
                allIdle = false;
            }
            hasOutputs = true;
        }

        // if we don't have any outputs minimumOverAllOutputs is not valid, it's still
        // at its initial Long.MAX_VALUE state and we must not emit that
        if (!hasOutputs) {
            return;
        }

        if (allIdle) {
            underlyingOutput.markIdle();
        } else if (minimumOverAllOutputs > combinedWatermark) {
            combinedWatermark = minimumOverAllOutputs;
            underlyingOutput.emitWatermark(new Watermark(minimumOverAllOutputs));
        }
    }

你可以看这个流程,是不是意味着如果使用Punctuated的方式,是不支持Idle空闲时间的?--答案是的

相关推荐
Lx35233 分钟前
Hadoop日志分析实战:快速定位问题的技巧
大数据·hadoop
喂完待续3 小时前
【Tech Arch】Hive技术解析:大数据仓库的SQL桥梁
大数据·数据仓库·hive·hadoop·sql·apache
SelectDB4 小时前
5000+ 中大型企业首选的 Doris,在稳定性的提升上究竟花了多大的功夫?
大数据·数据库·apache
最初的↘那颗心4 小时前
Flink Stream API 源码走读 - window 和 sum
大数据·hadoop·flink·源码·实时计算·窗口函数
Yusei_05236 小时前
迅速掌握Git通用指令
大数据·git·elasticsearch
一只栖枝12 小时前
华为 HCIE 大数据认证中 Linux 命令行的运用及价值
大数据·linux·运维·华为·华为认证·hcie·it
喂完待续17 小时前
Apache Hudi:数据湖的实时革命
大数据·数据仓库·分布式·架构·apache·数据库架构
青云交17 小时前
Java 大视界 -- 基于 Java 的大数据可视化在城市交通拥堵治理与出行效率提升中的应用(398)
java·大数据·flink·大数据可视化·拥堵预测·城市交通治理·实时热力图
还是大剑师兰特1 天前
Flink面试题及详细答案100道(1-20)- 基础概念与架构
大数据·flink·大剑师·flink面试题
sleetdream1 天前
Flink Sql 按分钟或日期统计数据量
sql·flink