flink 消费 kafka subtask 分区策略

在新版本的flink source 采取的是KafkaSourceEnumerator完成分区的分区策略,通过阅读源码发现,真正的分区是下面的代码实现的

复制代码
 private void addPartitionSplitChangeToPendingAssignments(
            Collection<KafkaPartitionSplit> newPartitionSplits) {
        //kafka source 指定的并行度
        int numReaders = context.currentParallelism();
        for (KafkaPartitionSplit split : newPartitionSplits) {
        	//具体的task与kafka分区分配算法
            int ownerReader = getSplitOwner(split.getTopicPartition(), numReaders);
            //存储着task与分区的对应关系
            pendingPartitionSplitAssignment
                    .computeIfAbsent(ownerReader, r -> new HashSet<>())
                    .add(split);
        }
    }

    static int getSplitOwner(TopicPartition tp, int numReaders) {
    	//按照topic name取 startIndex
        int startIndex = ((tp.topic().hashCode() * 31) & 0x7FFFFFFF) % numReaders;
	    //计算分区与task的对应关系
        return (startIndex + tp.partition()) % numReaders;
    }

举例子说明:

有两个topic:test_topic_partition_one, test_topic_partition_two,每个topic有9个分区,kafka source并行度设置为5

复制代码
KafkaSource<String> source = KafkaSource.<String>builder()
        .setBootstrapServers("localhost:9092")
        .setProperties(properties)
        .setTopics("test_topic_partition_one", "test_topic_partition_two")
        .setGroupId("my-group")
        .setStartingOffsets(OffsetsInitializer.latest())
        .setValueOnlyDeserializer(new SimpleStringSchema())
        .build();           
env.fromSource(source, WatermarkStrategy.noWatermarks(), "Kafka Source").setParallelism(5)

根据公式: int startIndex = ((tp.topic().hashCode() * 31) & 0x7FFFFFFF) % numReaders;

第一个topic test_topic_partition_onestartIndex = 2

所以各个subtask与分区的对应关系为:

复制代码
subtask 0 ==> test_topic_partition_one-3  test_topic_partition_one-8

subtask 1 ==> test_topic_partition_one-4

subtask 2 ==> test_topic_partition_one-0  test_topic_partition_one-5

subtask 3 ==> test_topic_partition_one-1  test_topic_partition_one-6 

subtask 4 ==> test_topic_partition_one-2  test_topic_partition_one-7

根据公式: int startIndex = ((tp.topic().hashCode() * 31) & 0x7FFFFFFF) % numReaders;

第二个topic test_topic_partition_twostartIndex = 1

所以各个subtask与分区的对应关系为:

复制代码
subtask 0 ==> test_topic_partition_two-4

subtask 1 ==> test_topic_partition_two-0  test_topic_partition_two-5

subtask 2 ==> test_topic_partition_two-1  test_topic_partition_two-6

subtask 3 ==> test_topic_partition_two-2  test_topic_partition_two-7 

subtask 4 ==> test_topic_partition_two-3  test_topic_partition_two-8

所以最终flink每个subtask对应的分区是,所以由于topic的流量不同,可能导致数据倾斜影响数据处理的能力。

复制代码
subtask 0 ==> test_topic_partition_one-3  test_topic_partition_one-8  test_topic_partition_two-4

subtask 1 ==> test_topic_partition_one-4  test_topic_partition_two-0  test_topic_partition_two-5

subtask 2 ==> test_topic_partition_one-0  test_topic_partition_one-5  test_topic_partition_two-1  test_topic_partition_two-6

subtask 3 ==> test_topic_partition_one-1  test_topic_partition_one-6  test_topic_partition_two-2  test_topic_partition_two-7 

subtask 4 ==> test_topic_partition_one-2  test_topic_partition_one-7  test_topic_partition_two-3  test_topic_partition_two-8

对应的日志信息:

复制代码
2024-08-18 18:39:51 INFO [org.apache.flink.connector.kafka.source.enumerator.KafkaSourceEnumerator  Line:393] Discovered new partitions: [test_topic_partition_one-6, test_topic_partition_one-7, test_topic_partition_one-8, test_topic_partition_two-4, test_topic_partition_two-5, test_topic_partition_two-6, test_topic_partition_one-0, test_topic_partition_two-7, test_topic_partition_one-1, test_topic_partition_two-8, test_topic_partition_one-2, test_topic_partition_one-3, test_topic_partition_one-4, test_topic_partition_one-5, test_topic_partition_two-0, test_topic_partition_two-1, test_topic_partition_two-2, test_topic_partition_two-3]

2024-08-18 18:39:51 INFO [org.apache.flink.connector.kafka.source.enumerator.KafkaSourceEnumerator  Line:353] Assigning splits to readers {0=[[Partition: test_topic_partition_one-3, StartingOffset: -1, StoppingOffset: -9223372036854775808], [Partition: test_topic_partition_two-4, StartingOffset: -1, StoppingOffset: -9223372036854775808], [Partition: test_topic_partition_one-8, StartingOffset: -1, StoppingOffset: -9223372036854775808]], 1=[[Partition: test_topic_partition_one-4, StartingOffset: -1, StoppingOffset: -9223372036854775808], [Partition: test_topic_partition_two-5, StartingOffset: -1, StoppingOffset: -9223372036854775808], [Partition: test_topic_partition_two-0, StartingOffset: -1, StoppingOffset: -9223372036854775808]], 2=[[Partition: test_topic_partition_two-6, StartingOffset: -1, StoppingOffset: -9223372036854775808], [Partition: test_topic_partition_one-0, StartingOffset: -1, StoppingOffset: -9223372036854775808], [Partition: test_topic_partition_two-1, StartingOffset: -1, StoppingOffset: -9223372036854775808], [Partition: test_topic_partition_one-5, StartingOffset: -1, StoppingOffset: -9223372036854775808]], 3=[[Partition: test_topic_partition_one-1, StartingOffset: -1, StoppingOffset: -9223372036854775808], [Partition: test_topic_partition_two-7, StartingOffset: -1, StoppingOffset: -9223372036854775808], [Partition: test_topic_partition_two-2, StartingOffset: -1, StoppingOffset: -9223372036854775808], [Partition: test_topic_partition_one-6, StartingOffset: -1, StoppingOffset: -9223372036854775808]], 4=[[Partition: test_topic_partition_two-8, StartingOffset: -1, StoppingOffset: -9223372036854775808], [Partition: test_topic_partition_one-2, StartingOffset: -1, StoppingOffset: -9223372036854775808], [Partition: test_topic_partition_two-3, StartingOffset: -1, StoppingOffset: -9223372036854775808], [Partition: test_topic_partition_one-7, StartingOffset: -1, StoppingOffset: -9223372036854775808]]}
相关推荐
Hello.Reader12 小时前
在 YARN 上跑 Flink CDC从 Session 到 Yarn Application 的完整实践
大数据·flink
Sinowintop12 小时前
易连EDI-EasyLink无缝集成之消息队列Kafka
分布式·网络协议·kafka·集成·国产化·as2·国产edi
佛祖让我来巡山13 小时前
Kafka入门:从初识到Spring Boot实战
kafka·kafka入门·kafka案例
二进制_博客15 小时前
eventTime+watermarker+allowedLateness到底窗口关闭时间是什么?
flink·kafka
2501_941403761 天前
Python高性能图像识别与TensorFlow实战分享:深度学习模型优化与批量推理经验
flink
佛祖让我来巡山1 天前
设计模式深度解析:策略模式、责任链模式与模板模式
设计模式·责任链模式·策略模式·模版模式
2501_941877981 天前
Python在微服务高并发异步日志聚合与智能告警分析架构中的实践
kafka
最笨的羊羊2 天前
Flink CDC系列之:Kafka CSV 序列化器CsvSerializationSchema
kafka·csv·schema·flink cdc系列·serialization·序列化器
最笨的羊羊2 天前
Flink CDC系列之:Kafka的Debezium JSON 结构定义类DebeziumJsonStruct
kafka·debezium·flink cdc系列·debezium json·结构定义类·jsonstruct
Bug快跑-12 天前
面向高并发场景的多语言异构系统架构演进与性能优化策略深度解析实践分享全过程方法论探索
flink