背景
生产环境有一个微服务,定时扫库找到发生变化的数据,然后发送kafka消息。另一个微服务会接收消息进行消费。之前一直跑的好好的,最近忽然出现消息大量积压并且一直无限循环投递的情况。
因为定时扫库的任务中有动态开关,第一时间就把开关关掉了。按道理说,就不应该再有消息发出来,但是consumer还在不停的消费消息,那么这些消息是哪来的?
原因
查看日志:
plain
2026-03-23T20:21:35: consumer poll timeout has expired. This means the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time processing messages. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.
2026-03-23T20:21:35: Member consumer-quick-group-1-7158d065-c551-495c-8cb0-b1f10c41d797 sending LeaveGroup request to coordinator localhost:9092 (id: 2147483646 rack: null isFenced: false) due to consumer poll timeout has expired.
2026-03-23T20:21:35: Resetting generation and member id due to: consumer pro-actively leaving the group
2026-03-23T20:21:35: Request joining group due to: consumer pro-actively leaving the group
大概的意思是consumer处理poll()出来的消息时间太长导致超时了,然后消费者被认为是挂掉了,发生了rebalance,然后消息重新投递了回来。这里也提到了一个参数max.poll.interval.ms:
The maximum delay between invocations of poll() when using consumer group management. This places an upper bound on the amount of time that the consumer can be idle before fetching more records. If poll() is not called before expiration of this timeout, then the consumer is considered failed and the group will rebalance in order to reassign the partitions to another member.
使用消费者组管理时,两次调用 poll() 之间的最大延迟。这设定了消费者在获取更多记录前可以空闲的时间上限。如果在超时到期前未调用 poll(),则该消费者被视为失败,消费者组将进行重平衡,以便将分区重新分配给其他成员。
默认情况下,这个超时时间是5分钟。在我们的场景中,单个消费者的处理时间会比较长,恰好最近因为某些操作导致消息数量忽然增加,导致一次poll()出来的消息数量比较多,在5分钟以内没有处理完所有的消息,然后发生了rebalance,导致无限循环。