Kafka源码(六)消费者消费

前言

本章分析消费者消费:

1)确定消费位点;2)拉取消息;3)提交消费位点;

历史相关文章:

1)Kafka源码(三)发送消息-客户端

2)Kafka源码(四)发送消息-服务端

3)Kafka源码(五)消费者rebalance

注:本文基于kafka2.6,没有KRaft。

一、回顾Rebalance结果

KafkaConsumer#poll:消费者侧只有一个线程,即用户poll线程。poll方法会完成rebalance+自动提交+拉取消息一系列动作。

java 复制代码
private ConsumerRecords<K, V> poll(final Timer timer, final boolean includeMetadataInTimeout) {
    do {
        // 1. rebalance和自动提交
        updateAssignmentMetadataIfNeeded(timer, false);
        // 2. 拉取消息
        Map<TopicPartition, List<ConsumerRecord<K, V>>> records = 
                                        pollForFetches(timer);
        if (!records.isEmpty()) {
            if (fetcher.sendFetches() > 0 || 
                    client.hasPendingRequests()) {
                // 如果2拉取到消息,尝试再次发送FetchRequest
                client.transmitSends();
            }
            // 3. 返回消息
            return this.interceptors.onConsume(new ConsumerRecords<>(records));
        }
    // 4. 用户指定超时时间没超时,继续循环
    } while (timer.notExpired());
    return ConsumerRecords.empty();
}

SubscriptionState 是消费者侧的订阅状态,包含:1-订阅topics(subscription );2-分区分配结果(assignment)。

消费组leader用RangeAssignor分配策略执行分区分配,将assignment通过SyncGroupRequest给到协调者,由协调者通过SyncGroupResponse将assignment下发给所有消费者。

java 复制代码
public class SubscriptionState {
    private enum SubscriptionType {
        NONE, AUTO_TOPICS, AUTO_PATTERN, USER_ASSIGNED
    }
    // 订阅方式
    private SubscriptionType subscriptionType;
    // AUTO_PATTERN 正则匹配订阅
    private Pattern subscribedPattern;
    // AUTO_TOPICS 精确匹配订阅
    private Set<String> subscription;
    // 消费组订阅topics全集
    private Set<String> groupSubscription;
    // 分配分区-状态
    private final PartitionStates<TopicPartitionState> assignment;
    // 如果消费组首次创建,初始化消费进度的策略
    // auto.offset.reset=latest
    private final OffsetResetStrategy defaultResetStrategy;
    // 用户subscribe可以提供第二个参数ConsumerRebalanceListener
    private ConsumerRebalanceListener rebalanceListener;
    // rebalance导致assignment分区改变,assignmentId++
    private int assignmentId = 0;
}

SubscriptionState#assignFromSubscribed:所有消费者收到SyncGroupResponse得到自身消费分区。初始分区fetch状态TopicPartitionStateINITIALIZING ,消费位点position为null。

java 复制代码
public class PartitionStates<S> {
    // 分区-分区状态
    private final LinkedHashMap<TopicPartition, S> map = new LinkedHashMap<>();
}
private static class TopicPartitionState {
    // fetch状态
    private FetchState fetchState;
    // 消费进度
    private FetchPosition position; 
    TopicPartitionState() {
      this.fetchState = FetchStates.INITIALIZING;
      this.position = null;
      // 其余属性都是null
    }
}

KafkaConsumer#updateAssignmentMetadataIfNeeded:消费者完成rebalance后需要确定消费位点updateFetchPositions,才能开始消费。

java 复制代码
boolean updateAssignmentMetadataIfNeeded(final Timer timer, final boolean waitForJoinGroup) {
  // 1. coordinator.poll,完成rebalance,自动提交
  if (coordinator != null && !coordinator.poll(timer, waitForJoinGroup)) {
      return false;
  }
  // 2. 确定消费位点,才能开始发送FetchRequest
  return updateFetchPositions(timer);
}

分区FetchState有四种状态:

1)INITIALIZING:刚分配到该分区;

2)AWAIT_VALIDATION:消费组曾经消费过,已经获取消费位点;

3)AWAIT_RESET:消费组未曾经消费过,待重置消费位点;

4)FETCHING:消费位点合法,可以开始消费;

KafkaConsumer#updateFetchPositions:处理四种分区状态,目标是将状态翻转至FETCHING才能拉取消息

java 复制代码
private boolean updateFetchPositions(final Timer timer) {
  // 1. 【FETCHING】分区leaderEpoch改变【AWAIT_VALIDATION】
  // 【AWAIT_VALIDATION】执行leaderEpoch校验【FETCHING】
  fetcher.validateOffsetsIfNeeded();
  // 2. 所有分区进入【FETCHING】状态,直接返回
  cachedSubscriptionHashAllFetchPositions = subscriptions.hasAllFetchPositions();
  if (cachedSubscriptionHashAllFetchPositions) return true;
  // 3. 【INITIALIZING】获取历史消费进度【AWAIT_VALIDATION】
  // refreshCommittedOffsetsIfNeeded=true代表未超时,false代表超时
  if (coordinator != null && 
      !coordinator.refreshCommittedOffsetsIfNeeded(timer)) 
    return false;
  // 4. 【INITIALIZING】获取历史消费进度为空【AWAIT_RESET】,
  // 【AWAIT_RESET】使用重置策略【FETCHING】
  subscriptions.resetInitializingPositions();
  fetcher.resetOffsetsIfNeeded();
  return true;
}

二、确定消费位点

2-1、获取分区消费进度

如果消费组曾经消费过该topic分区,可以从协调者获取消费组该分区消费进度。

2-1-1、consumer发送OffsetFetchRequest

ConsumerCoordinator#refreshCommittedOffsetsIfNeeded:consumer发送OffsetFetchRequest,收到分区进度更新到内存。

java 复制代码
public boolean refreshCommittedOffsetsIfNeeded(Timer timer) {
  // INITIALIZING状态的分区
  Set<TopicPartition> initializingPartitions = 
          subscriptions.initializingPartitions();
  // 发送OffsetFetchRequest,获取分区的消费进度
  Map<TopicPartition, OffsetAndMetadata> offsets = 
          fetchCommittedOffsets(initializingPartitions, timer);
  if (offsets == null) return false; // 超时
  // ...处理返回消费进度
  return true;
}

OffsetFetchRequest请求如下:

java 复制代码
public class OffsetFetchRequestData implements ApiMessage {
    // 消费组id
    private String groupId;
    // topic和分区
    private List<OffsetFetchRequestTopic> topics;
    // true
    private boolean requireStable;
}
static public class OffsetFetchRequestTopic implements Message {
    // topic
    private String name;
    // 分区id
    private List<Integer> partitionIndexes;
}

2-1-2、coordinator处理OffsetFetchRequest

GroupMetadataManager#getOffsets:coordinator循环所有分区,获取内存中的消费进度(GroupMetadata#offsets),如果不存在消费进度,返回offset=-1(INVALID_OFFSET)。

coordinator收到消费组的首个成员JoinGroupRequest会创建消费组,一个消费组id对应一个GroupMetadata。

scala 复制代码
def getOffsets(groupId: String, requireStable: Boolean, 
               topicPartitionsOpt: Option[Seq[TopicPartition]]): 
                    Map[TopicPartition, PartitionData] = {
  val group = groupMetadataCache.get(groupId)
  group.inLock {
    val topicPartitions = 
      topicPartitionsOpt.getOrElse(group.allOffsets.keySet)
    topicPartitions.map { topicPartition =>
      val partitionData = group.offset(topicPartition) match {
        case None =>
          // 不存在返回offset=-1(INVALID_OFFSET)
          new PartitionData(OffsetFetchResponse.INVALID_OFFSET,
            Optional.empty(), "", Errors.NONE)
        case Some(offsetAndMetadata) =>
          new PartitionData(offsetAndMetadata.offset,
            offsetAndMetadata.leaderEpoch, offsetAndMetadata.metadata, Errors.NONE)
      }
      topicPartition -> partitionData
    }.toMap
  }
}
// GroupMetadata 分区->消费进度
private val offsets = new mutable.HashMap[TopicPartition, CommitRecordMetadataAndOffset]
def offset(topicPartition: TopicPartition): Option[OffsetAndMetadata] = offsets.get(topicPartition).map(_.offsetAndMetadata)

OffsetFetchResponse响应如下:

java 复制代码
public class OffsetFetchResponseData implements ApiMessage {
    // topic列表
    private List<OffsetFetchResponseTopic> topics;
}
static public class OffsetFetchResponseTopic implements Message {
    // topic
    private String name;
   // 分区offset
    private List<OffsetFetchResponsePartition> partitions;
}
static public class OffsetFetchResponsePartition implements Message {
    // 分区id
    private int partitionIndex;
    // 消费进度offset
    private long committedOffset;
    // 这个offset对应的分区leaderEpoch
    private int committedLeaderEpoch;
}

2-1-3、consumer处理OffsetFetchResponse

OffsetFetchResponseHandler:offset大于等于0,代表该分区以前被消费组消费过,可以拿到消费进度。

java 复制代码
private class OffsetFetchResponseHandler {
  public void handle(OffsetFetchResponse response, 
    RequestFuture<Map<TopicPartition, OffsetAndMetadata>> future) {
      Map<TopicPartition, OffsetAndMetadata> offsets = new HashMap<>(response.responseData().size());
      for (Map.Entry<TopicPartition, OffsetFetchResponse.PartitionData> entry :
           response.responseData().entrySet()) {
          TopicPartition tp = entry.getKey();
          OffsetFetchResponse.PartitionData partitionData = entry.getValue();
        if (partitionData.offset >= 0) {
            // 有消费进度
            offsets.put(tp, new OffsetAndMetadata(
              partitionData.offset, partitionData.leaderEpoch, partitionData.metadata));
        } else {
            // 没消费进度
            log.info("Found no committed offset for partition {}", tp);
            offsets.put(tp, null);
        }
      }
      future.complete(offsets);
  }
}

ConsumerCoordinator#refreshCommittedOffsetsIfNeeded:分区Fetch状态进入AWAIT_VALIDATION ,将消费位点封装为FetchPosition

java 复制代码
public class SubscriptionState {
  // 被分配的分区状态
  private final PartitionStates<TopicPartitionState> assignment;
}
private static class TopicPartitionState {
  // 状态AWAIT_VALIDATION
  private FetchState fetchState;
  // 消费进度
  private FetchPosition position; 
}
public static class FetchPosition {
  // 消费进度
  public final long offset;
  // offset消息记录对应的分区leaderEpoch
  final Optional<Integer> offsetEpoch;
  // 当前的分区leader和epoch
  final Metadata.LeaderAndEpoch currentLeader;
}

2-2、重置消费进度

如果上一步分区对应offset返回-1,代表消费组还未消费过该分区 ,使用重置策略从分区leader获取分区offset。

2-2-1、consumer发送ListOffsetRequest

SubscriptionState#resetInitializingPositions:标记所有INITIALIZING 状态分区为AWAIT_RESET

Fetcher#resetOffsetsIfNeeded:获取分区重置策略,发送ListOffsetRequest。

Fetcher#offsetResetStrategyTimestamp:重置策略有两种,可以通过auto.offset.reset调整:

1)LATEST:默认,重置为分区最新写入进度,传参-1;

2)EARLIEST:重置为分区最早写入进度,传参-2;

java 复制代码
private Long offsetResetStrategyTimestamp(final TopicPartition partition) {
    OffsetResetStrategy strategy = subscriptions.resetStrategy(partition);
    if (strategy == OffsetResetStrategy.EARLIEST)
        return ListOffsetRequest.EARLIEST_TIMESTAMP; // -2
    else if (strategy == OffsetResetStrategy.LATEST) // 默认
        return ListOffsetRequest.LATEST_TIMESTAMP; // -1
    else
        return null;
}

Fetcher#resetOffsetsAsync:发送ListOffsetRequest给分区leader broker。

Fetcher和普通broker交互,Coordinator和协调者交互。

java 复制代码
 private void resetOffsetsAsync(
   // 分区 - 重置策略(-1、-2)
   Map<TopicPartition, Long> partitionResetTimestamps) {
   // 分区leader - 分区 - 重置策略(-1、-2)
   Map<Node, Map<TopicPartition, ListOffsetRequest.PartitionData>> timestampsToSearchByNode
      = groupListOffsetRequests(partitionResetTimestamps, new HashSet<>());
   // 循环所有broker发送请求
   for (Map.Entry<Node, Map<TopicPartition, ListOffsetRequest.PartitionData>> 
       entry : timestampsToSearchByNode.entrySet()) {
      Node node = entry.getKey();
      final Map<TopicPartition, ListOffsetRequest.PartitionData> 
        resetTimestamps = entry.getValue();
      // 发送请求
      RequestFuture<ListOffsetResult> future = sendListOffsetRequest(node, resetTimestamps, false);
      future.addListener(new RequestFutureListener<ListOffsetResult>() {
          //...
      });
    }
 }

ListOffsetRequest

java 复制代码
public class ListOffsetRequest extends AbstractRequest {
    // 副本ID,用于broker间主从复制,普通消费者=-1
    private final int replicaId;
    // 隔离级别,默认READ_UNCOMMITTED
    private final IsolationLevel isolationLevel;
    // 分区 - PartitionData
    private final Map<TopicPartition, PartitionData> partitionTimestamps;
}
public static final class PartitionData {
    // -1或-2 重置策略
    public final long timestamp;
    // 消费者看到的当前分区leaderEpoch
    public final Optional<Integer> currentLeaderEpoch;
}

2-2-2、分区leader broker处理ListOffsetRequest

回顾一下第四章的几个概念:

1)LEO(LogEndOffset) :写入偏移量 + 1 = 下次写入的offset;

2)HW高水位:min(ISR副本的LEO,与leader同步延迟30s内的副本的LEO);

3)logStartOffset:分区的第一条消息的offset,如果发生过数据清理,则会大于0;

4)leaderEpochCache :分区的 leader任期任期起始offset 缓存。比如:(0,0),(1,222),(2,333),代表:任期0写入offset[0,221]、任期1写入offset[222,332]。这份数据存储在分区log目录下的leader-epoch-checkpoint

txt 复制代码
0 // 版本号
3 // 条目数量
0 0 // epoch 起始offset
1 222 // epoch 起始offset
2 333 // epoch 起始offset

Partition#fetchOffsetForTimestamp:

1)基础校验,请求的分区leaderEpoch和自身相等,且自身是leader;

2)如果使用LATEST策略(默认)根据隔离级别返回offset当前leaderEpoch

READ_COMMITTED,返回LSO,事务消息相关后续再看;

READ_UNCOMMITTED (默认),返回HW高水位

scala 复制代码
def fetchOffsetForTimestamp(timestamp: Long,
                              isolationLevel: Option[IsolationLevel],
                              currentLeaderEpoch: Optional[Integer],
                              fetchOnlyFromLeader: Boolean): 
  // leaderAndIsr读锁,期间leader和isr不会变
  Option[TimestampAndOffset] = inReadLock(leaderIsrUpdateLock) {
    // 1. 校验客户端发送的分区leaderEpoch和自己相等
    val localLog = localLogWithEpochOrException(currentLeaderEpoch, fetchOnlyFromLeader)
    // 最大可拉取的offset
    val lastFetchableOffset = isolationLevel match {
      // 事务消息相关,消费者设置READ_COMMITTED隔离级别,只能消费LSO
      case Some(IsolationLevel.READ_COMMITTED) => localLog.lastStableOffset
      // 普通消费者,能消费HW之前的数据,HW高水位=min(ISR副本的写入进度LEO)
      case Some(IsolationLevel.READ_UNCOMMITTED) => localLog.highWatermark
      case None => localLog.logEndOffset
    }
    // 根据时间戳查询offset
    def getOffsetByTimestamp: Option[TimestampAndOffset] = {
      logManager.getLog(topicPartition).flatMap(log => log.fetchOffsetByTimestamp(timestamp))
    }
    timestamp match {
      case ListOffsetRequest.LATEST_TIMESTAMP =>
        // 2. 根据隔离级别,返回 offset=HW或LSO,当前leaderEpoch
        Some(new TimestampAndOffset(RecordBatch.NO_TIMESTAMP, 
                                    lastFetchableOffset, 
                                    Optional.of(leaderEpoch)))
      case ListOffsetRequest.EARLIEST_TIMESTAMP =>
        // 3. 找LogStartOffset
        getOffsetByTimestamp
    }
  }

Log#fetchOffsetByTimestamp:EARLIEST 策略,返回logStartOffset该offset写入时对应的leaderEpoch

scala 复制代码
// 记录了每个leaderEpoch对应的起始offset
// (0,100),(1,222),...
var leaderEpochCache: Option[LeaderEpochFileCache] = None
def fetchOffsetByTimestamp(targetTimestamp: Long): 
              Option[TimestampAndOffset] = {
  if (targetTimestamp == ListOffsetRequest.EARLIEST_TIMESTAMP) {
    // 找到第一个任期的leaderEpoch
    val earliestEpochEntry = leaderEpochCache.flatMap(_.earliestEntry)
    val epochOpt = earliestEpochEntry match {
      // 如果logStartOffset在第一个leaderEpoch内,则返回对应leaderEpoch
      case Some(entry) if entry.startOffset <= logStartOffset => Optional.of[Integer](entry.epoch)
      case _ => Optional.empty[Integer]()
    }
    return Some(new TimestampAndOffset(RecordBatch.NO_TIMESTAMP, logStartOffset, epochOpt))
  } 

}

ListOffsetResponse

scala 复制代码
public class ListOffsetResponse extends AbstractResponse {
    // 分区 - 分区数据
    private final Map<TopicPartition, PartitionData> responseData;
}
public static final class PartitionData {
    // 固定-1
    public final Long timestamp;
    // offset
    public final Long offset;
    // offset对应分区leader任期
    public final Optional<Integer> leaderEpoch;
}

2-2-3、consumer处理ListOffsetResponse

SubscriptionState#maybeSeekUnvalidated:重置offset,设置FetchPosition不含leaderEpoch。

java 复制代码
synchronized void maybeSeekUnvalidated(TopicPartition tp, long offset, 
              OffsetResetStrategy requestedResetStrategy) {
    TopicPartitionState state = assignedStateOrNull(tp);
    // ...
    log.info("Resetting offset for partition {} to offset {}.", tp, offset);
    state.seekUnvalidated(new FetchPosition(offset));
}

TopicPartitionState#validatePosition:重置场景下,跳过AWAIT_VALIDATION,直接进入FETCHING。

java 复制代码
private void validatePosition(FetchPosition position) {
    if (position.offsetEpoch.isPresent() && position.currentLeader.epoch.isPresent()) {
        // 有leaderEpoch,即非重置情况
        transitionState(FetchStates.AWAIT_VALIDATION, () -> {
            this.position = position;
            this.nextRetryTimeMs = null;
        });
    } else {
        // 没leaderEpoch,直接翻转到FETCHING
        transitionState(FetchStates.FETCHING, () -> {
            this.position = position;
            this.nextRetryTimeMs = null;
        });
    }
}

2-3、校验消费进度

2-3-1、consumer发送OffsetsForLeaderEpochRequest

Fetcher#validateOffsetsIfNeeded:每次用户执行poll方法,都会判断分区leaderEpoch是否变更,如果发生变更,fetch状态也会翻转AWAIT_VALIDATION

java 复制代码
private boolean maybeValidatePosition(
      // 消费者 元数据中的分区leaderEpoch
      Metadata.LeaderAndEpoch currentLeaderAndEpoch) {
    if (this.fetchState.equals(FetchStates.AWAIT_RESET)) {
        return false;
    }
    if (!currentLeaderAndEpoch.leader.isPresent()) {
        return false;
    }
    if (position != null && 
        !position.currentLeader.equals(currentLeaderAndEpoch)) {
        // FetchPosition中的leaderEpoch和元数据里的分区leaderEpoch不同
        // 翻转分区fetch状态为AWAIT_VALIDATION
        FetchPosition newPosition = new FetchPosition(position.offset, position.offsetEpoch, currentLeaderAndEpoch);
        validatePosition(newPosition);
        preferredReadReplica = null;
    }
    return this.fetchState.equals(FetchStates.AWAIT_VALIDATION);
}

Fetcher#validateOffsetsAsync:对于AWAIT_VALIDATION 状态的分区,发送OffsetsForLeaderEpochRequest给分区leader broker。

OffsetsForLeaderEpochRequest

java 复制代码
 public class OffsetsForLeaderEpochRequest extends AbstractRequest {
    // 分区-分区数据
    private final Map<TopicPartition, PartitionData> epochsByPartition;
    // 普通消费者固定-1
    private final int replicaId;
}
public static class PartitionData {
    // 当前leaderEpoch---broker需要校验自己epoch和消费者看到的一致
    public final Optional<Integer> currentLeaderEpoch;
    // offset对应消息的leaderEpoch
    public final int leaderEpoch;
}

2-3-2、leader broker处理OffsetsForLeaderEpochRequest

Partition#lastOffsetForLeaderEpoch:

1)确认自己是leader,且request.currentLeaderEpoch和自己epoch相等;

2)查询request.leaderEpoch任期对应的结束offset;

scala 复制代码
def lastOffsetForLeaderEpoch(currentLeaderEpoch: Optional[Integer],
           leaderEpoch: Int,
           fetchOnlyFromLeader: Boolean): EpochEndOffset = {
    // leader和isr读锁 期间不变化
    inReadLock(leaderIsrUpdateLock) {
      // 1. 确认客户端发送的currentLeaderEpoch = 当前leaderEpoch
      val localLogOrError = getLocalLog(currentLeaderEpoch, fetchOnlyFromLeader)
      localLogOrError match {
        case Left(localLog) =>
          // 2. 根据客户端传入待校验leaderEpoch,返回leaderEpoch和它的结束offset
          localLog.endOffsetForEpoch(leaderEpoch) match {
            case Some(epochAndOffset) => new EpochEndOffset(NONE, epochAndOffset.leaderEpoch, epochAndOffset.offset)
            case None => new EpochEndOffset(NONE, UNDEFINED_EPOCH, UNDEFINED_EPOCH_OFFSET)
          }
        case Right(error) =>
          new EpochEndOffset(error, UNDEFINED_EPOCH, UNDEFINED_EPOCH_OFFSET)
      }
    }
  }

LeaderEpochFileCache#endOffsetFor:根据leaderEpochCache ,返回请求epoch的下一个offset,比如请求epoch=1管理offset范围为[1,10],则返回epoch=1和offset=11。

scala 复制代码
def endOffsetForEpoch(leaderEpoch: Int): Option[OffsetAndEpoch] = {
  leaderEpochCache.flatMap { cache =>
    val (foundEpoch, foundOffset) = cache.endOffsetFor(leaderEpoch)
    if (foundOffset == EpochEndOffset.UNDEFINED_EPOCH_OFFSET)
      None
    else
      Some(OffsetAndEpoch(foundOffset, foundEpoch))
  }
}
def endOffsetFor(requestedEpoch: Int): (Int, Long) = {
  inReadLock(lock) {
    val epochAndOffset =
      if (requestedEpoch == UNDEFINED_EPOCH) {
        (UNDEFINED_EPOCH, UNDEFINED_EPOCH_OFFSET)
      } else if (latestEpoch.contains(requestedEpoch)) {
        // 当前leaderEpoch(latestEpoch) = 请求leaderEpoch,直接返回LEO即可
        (requestedEpoch, logEndOffset())
      } else {
        // subsequentEpochs = 大于请求leaderEpoch的(epoch,startOffset)
        // previousEpochs = 小于等于请求leaderEpoch的(epoch,startOffset)
        val (subsequentEpochs, previousEpochs) = 
                epochs.partition { e => e.epoch > requestedEpoch}
        if (subsequentEpochs.isEmpty) {
          // 没有大于请求epoch的entry,则返回不存在
          (UNDEFINED_EPOCH, UNDEFINED_EPOCH_OFFSET)
        } else if (previousEpochs.isEmpty) {
          // 没有小于等于请求epoch的entry,
          // 返回 = 请求epoch + 大于epoch的第一个startOffset
          (requestedEpoch, subsequentEpochs.head.startOffset)
        } else {
          // 有小于等于请求epoch的entry,
          // 返回 = 小于等于请求epoch的第一个epoch + 下一个epoch的startOffset
          (previousEpochs.last.epoch, subsequentEpochs.head.startOffset)
        }
      }
    epochAndOffset
  }
}

OffsetsForLeaderEpochResponse:即endOffsetFor的返回

java 复制代码
public class OffsetsForLeaderEpochResponse extends AbstractResponse {
  private final Map<TopicPartition, EpochEndOffset> 
              epochEndOffsetsByPartition;  
}
public class EpochEndOffset {
    private int leaderEpoch;
    private long endOffset;
}

2-3-3、consumer处理OffsetsForLeaderEpochResponse

SubscriptionState#maybeCompleteValidation:如果返回response.endOffset小于当前消费位点,可能是发生新一轮leader选举,数据被截断,消费位点截断至response.endOffset。处理完成,状态翻转到FETCHING可以开始消费。

java 复制代码
public synchronized Optional<LogTruncation> 
      maybeCompleteValidation(TopicPartition tp,
                FetchPosition requestPosition,
               EpochEndOffset epochEndOffset) {
    TopicPartitionState state = assignedStateOrNull(tp);
    SubscriptionState.FetchPosition currentPosition = state.position;
    if (epochEndOffset.endOffset() < currentPosition.offset) {
        // 如果返回offset 小于 当前fetch的offset 可能是leader重新选举发生数据截断
        // 更新FetchPosition=broker返回的offset
        SubscriptionState.FetchPosition newPosition = 
          new SubscriptionState.FetchPosition(
                epochEndOffset.endOffset(), 
                Optional.of(epochEndOffset.leaderEpoch()),
                currentPosition.currentLeader);
        // 状态翻转FETCHING
        state.seekValidated(newPosition);
    } else {
        // 状态翻转FETCHING
        state.completeValidation();
    }
    return Optional.empty();
}

三、拉取消息

3-1、主流程

KafkaConsumer#pollForFetches:

consumer优先从缓存中获取已经拉到的消息,否则构造FetchRequest从broker拉取消息。

java 复制代码
private Map<TopicPartition, List<ConsumerRecord<K, V>>> 
                                  pollForFetches (Timer timer) {
    long pollTimeout = coordinator == null ? timer.remainingMs() :
            // min (下次心跳时间、下次commit时间、用户timer超时时间)
            Math.min(coordinator.timeToNextPoll(timer.currentTimeMs()), timer.remainingMs());
    // 1. 先从缓存中获取已经拉到的消息
    final Map<TopicPartition, List<ConsumerRecord<K, V>>> records = fetcher.fetchedRecords();
    if (!records.isEmpty()) {
        return records;
    }
    // 2. 构造FetchRequest
    fetcher.sendFetches();
    // 3. IO
    Timer pollTimer = time.timer(pollTimeout);
    client.poll(pollTimer, () -> {
        return !fetcher.hasAvailableFetches();
    });
    timer.update(pollTimer.currentTimeMs());
    // 4. 返回缓存中获取已经拉到的消息
    return fetcher.fetchedRecords();
}

3-2、从broker拉取消息

3-2-1、增量FetchRequest介绍

Incremental Fetch在KIP-227中提出,于1.1版本发布。

假设Kafka集群中有上万个分区,如topic0p0到topicNpN。

在主从复制中,Follower每个FetchRequest都携带一个冗长的分区列表,Leader必须为列表中的每一个分区生成元数据(HW高水位等),即使该分区自上次请求后没有新消息。

在分区数量庞大的Kafka集群中,传统的全量枚举式FetchRequest机制导致了显著且不必要的网络带宽和 CPU 资源消耗,这严重制约了集群的可扩展性,并与低延迟的数据同步需求产生了直接冲突。

而KIP-227提出的解决方案正是 "增量FetchRequest" ,它通过建立会话只传输发生变化的分区信息,从而从根本上解决了上述瓶颈。

consumer对每个broker维护一个FetchSessionHandler ,用于维持会话状态

java 复制代码
public class Fetcher<K, V> implements Closeable {
  // brokerId -> session
  private final Map<Integer, FetchSessionHandler> sessionHandlers;
}
public class FetchSessionHandler {
    // brokerId
    private final int node;
    // (sessionId, epoch) 二元组
    private FetchMetadata nextMetadata = FetchMetadata.INITIAL;
    // 分区 - 分区数据
    private LinkedHashMap<TopicPartition, PartitionData> sessionPartitions =
        new LinkedHashMap<>(0);
}
// fetch请求数据缓存FetchRequest.PartitionData
public static final class PartitionData {
    // 消费位点
    public final long fetchOffset;
    // logStartOffset
    public final long logStartOffset;
    // max.partition.fetch.bytes=1mb 
    // 一次FetchRequest中的单分区最大拉取的字节数
    public final int maxBytes;
    // 当前leader任期
    public final Optional<Integer> currentLeaderEpoch;
}
public class FetchMetadata {
    public static final int INVALID_SESSION_ID = 0;
    public static final int INITIAL_EPOCH = 0;
    // 初始0,0
    public static final FetchMetadata INITIAL = new FetchMetadata(INVALID_SESSION_ID, INITIAL_EPOCH);
    // 会话id
    private final int sessionId;
    // epoch
    private final int epoch;
}

broker通过FetchSessionCache 维护n个会话FetchSession

FetchSession.partitionMap :每个分区当前的拉取状态CachedPartition,同consumer侧FetchSessionHandler.sessionPartitions类似。

scala 复制代码
class FetchSessionCache {
  // sessionId -> session
  private val sessions = new mutable.HashMap[Int, FetchSession]
}
class FetchSession(val id: Int, // sessionId
                   //...
                   // 分区-分区数据
                   val partitionMap: FetchSession.CACHE_MAP,
                   var epoch: Int) { // sessionEpoch
}
// CACHE_MAP别名,是个map
type CACHE_MAP = ImplicitLinkedHashCollection[CachedPartition]
class CachedPartition(val topic: String,
                      val partition: Int,
                      var maxBytes: Int,
                      var fetchOffset: Long,
                      var highWatermark: Long,
                      var leaderEpoch: Optional[Integer],
                      var fetcherLogStartOffset: Long,
                      var localLogStartOffset: Long) {}

consumer和broker间的FetchRequest通过sessionId维持状态用sessionEpoch保证顺序

sessionId会话id,首次全量FetchRequest,broker创建返回;

sessionEpoch会话任期,broker每次收到请求+1,consumer每次收到响应+1。

3-2-2、consumer发送FetchRequest

Fetcher#sendFetches:循环每个broker发送FetchRequest。

java 复制代码
public synchronized int sendFetches() {
    // 准备数据
    Map<Node, FetchSessionHandler.FetchRequestData> fetchRequestMap = prepareFetchRequests();
    for (Map.Entry<Node, FetchSessionHandler.FetchRequestData> entry : 
               fetchRequestMap.entrySet()) {
        final Node fetchTarget = entry.getKey();
        final FetchSessionHandler.FetchRequestData data = entry.getValue();
        final FetchRequest.Builder request = FetchRequest.Builder
                .forConsumer(this.maxWaitMs, this.minBytes, data.toSend())
                .isolationLevel(isolationLevel)
                .setMaxBytes(this.maxBytes)
                .metadata(data.metadata())
                .toForget(data.toForget())
                .rackId(clientRackId);
        RequestFuture<ClientResponse> future = client.send(fetchTarget, request);
        this.nodesWithPendingFetchRequests.add(entry.getKey().id());
        future.addListener(new RequestFutureListener<ClientResponse>() {
        }
    }
}

Fetcher#prepareFetchRequests:构建每个broker的FetchRequestData

java 复制代码
private Map<Node, FetchSessionHandler.FetchRequestData> prepareFetchRequests() {
    Map<Node, FetchSessionHandler.Builder> fetchable = new LinkedHashMap<>();
    // 检测leaderEpoch是否改变,如果改变,相关分区翻转为AWAIT_VALIDATION
    subscriptions.assignedPartitions().forEach(tp ->
        subscriptions.maybeValidatePositionForCurrentLeader(apiVersions, tp, metadata.currentLeader(tp))
    );
    long currentTimeMs = time.milliseconds();
    // 没有缓存消息 的 FETCHING分区
    for (TopicPartition partition : fetchablePartitions()) {
        FetchPosition position = this.subscriptions.position(partition);
        Optional<Node> leaderOpt = position.currentLeader.leader;
        if (!leaderOpt.isPresent()) {
            // leader不存在,更新元数据,忽略
            metadata.requestUpdate();
            continue;
        }
        // 这里默认只会选择leader节点,#6832在2.4引入可以从follower拉
        Node node = selectReadReplica(partition, leaderOpt.get(), currentTimeMs);
        if (client.isUnavailable(node)) {
            // 节点下线
            client.maybeThrowAuthFailure(node);
        } else if (this.nodesWithPendingFetchRequests.contains(node.id())) {
           // broker有对应fetchrequest未响应,不处理
           // log.trace...
        } else {
            // 正常情况
            FetchSessionHandler.Builder builder = fetchable.get(node);
            if (builder == null) {
                int id = node.id();
                // 获取broker对应FetchSessionHandler
                FetchSessionHandler handler = sessionHandler(id);
                if (handler == null) {
                    handler = new FetchSessionHandler(logContext, id);
                    sessionHandlers.put(id, handler);
                }
                builder = handler.newBuilder();
                fetchable.put(node, builder);
            }
            // 将需要拉消息的分区加入handler.builder
            // 包含拉取位点position.offset
            builder.add(partition, new FetchRequest.PartitionData(
              position.offset, FetchRequest.INVALID_LOG_START_OFFSET, 
              this.fetchSize, position.currentLeader.epoch));
        }
    }
    // 构建fetchRequest分区数据,注意Builder.build
    Map<Node, FetchSessionHandler.FetchRequestData> reqs = new LinkedHashMap<>();
    for (Map.Entry<Node, FetchSessionHandler.Builder> entry : fetchable.entrySet()) {
        reqs.put(entry.getKey(), entry.getValue().build());
    }
    return reqs;
}

FetchSessionHandler.Builder#add&build:

1)如果session还未建立(nextMetadata.isFull),全量构建所有分区数据;

2)如果session已经建立,和当前sessionPartitions比对,发送变更的分区数据;

java 复制代码
public class FetchSessionHandler {
    // sessionId - epoch 二元组
    private FetchMetadata nextMetadata;
    // 分区 - 当前fetch数据
    private LinkedHashMap<TopicPartition, PartitionData> sessionPartitions;
    public class Builder {
        // 本轮需要拉取的分区数据
        private LinkedHashMap<TopicPartition, PartitionData> next;
        public void add(TopicPartition topicPartition, PartitionData data) {
            next.put(topicPartition, data);
        }
    }
}
// Builder#build
public FetchRequestData build() {
    // return (this.epoch == INITIAL_EPOCH) || (this.epoch == FINAL_EPOCH);
    if (nextMetadata.isFull()) {
        // 1. 全量分区
        sessionPartitions = next;
        next = null;
        Map<TopicPartition, PartitionData> toSend =
            Collections.unmodifiableMap(new LinkedHashMap<>(sessionPartitions));
        return new FetchRequestData(toSend, Collections.emptyList(), toSend, nextMetadata);
    }
    // 2. 增量分区
    List<TopicPartition> added = new ArrayList<>();
    List<TopicPartition> removed = new ArrayList<>();
    List<TopicPartition> altered = new ArrayList<>();
    // 遍历session中的分区
    for (Iterator<Entry<TopicPartition, PartitionData>> iter =
             sessionPartitions.entrySet().iterator(); iter.hasNext(); ) {
        Entry<TopicPartition, PartitionData> entry = iter.next();
        TopicPartition topicPartition = entry.getKey();
        PartitionData prevData = entry.getValue();
        PartitionData nextData = next.remove(topicPartition);
        if (nextData != null) {
            // case1 分区变更,比如fetch位点变更,加入altered
            if (!prevData.equals(nextData)) {
                next.put(topicPartition, nextData);
                entry.setValue(nextData);
                altered.add(topicPartition);
            }
        } else {
            // case2 分区移除(rebalance),加入removed
            iter.remove();
            removed.add(topicPartition);
        }
    }
    // case3 分区新增(rebalance)
    for (Entry<TopicPartition, PartitionData> entry : next.entrySet()) {
        TopicPartition topicPartition = entry.getKey();
        PartitionData nextData = entry.getValue();
        if (sessionPartitions.containsKey(topicPartition)) {
            break;
        }
        sessionPartitions.put(topicPartition, nextData);
        added.add(topicPartition);
    }
    Map<TopicPartition, PartitionData> toSend = Collections.unmodifiableMap(next);
    Map<TopicPartition, PartitionData> curSessionPartitions = 
      Collections.unmodifiableMap(new LinkedHashMap<>(sessionPartitions));
    next = null;
    // toSend=新增或修改分区,toForget=移除分区
    return new FetchRequestData(toSend, Collections.unmodifiableList(removed),
        curSessionPartitions, nextMetadata);
}

返回FetchRequestData包含以下数据:

java 复制代码
public static class FetchRequestData {
    // 本次需要拉取数据的分区 - 分区数据
    private final Map<TopicPartition, PartitionData> toSend;
    // 需要停止消费的分区
    private final List<TopicPartition> toForget;
    // 所有需要消费的分区 - 分区数据
    private final Map<TopicPartition, PartitionData> sessionPartitions;
    // sessionId,epoch 二元组
    private final FetchMetadata metadata;
}
// FetchRequest.PartitionData
public static final class PartitionData {
    // 消费进度
    public final long fetchOffset;
    // 固定-1
    public final long logStartOffset;
    // max.partition.fetch.bytes=1mb 
    // 一次FetchRequest中的单分区最大拉取的字节数
    public final int maxBytes;
    // 当前分区leaderEpoch
    public final Optional<Integer> currentLeaderEpoch;
}

FetchRequest:注意consumer传入长轮询相关参数:

1)minBytes=fetch.min.bytes,默认1字节。broker只要有消息就直接响应;

2)maxWait=fetch.max.wait.ms,默认500ms。待消费消息不足minBytes,FetchRequest被broker挂起的时长;

java 复制代码
public class FetchRequest extends AbstractRequest {
    // 副本id,消费者固定-1
    private final int replicaId;
    // fetch.max.wait.ms=500 
    // 如果消息不足minBytes个字节,FetchRequest在server端被block的时长
    private final int maxWait;
    // fetch.min.bytes=1 
    // 如果消息不足minBytes个字节,FetchRequest在server端被block
    private final int minBytes;
    // fetch.max.bytes=50mb 一次FetchRequest最大拉取的字节数
    private final int maxBytes;
    // 隔离级别,没事务消息,默认READ_UNCOMMITTED
    private final IsolationLevel isolationLevel;
    // 需要消费 且 新增或修改的分区 - 分区数据 FetchRequest.PartitionData
    private final Map<TopicPartition, PartitionData> fetchData;
    // 需要停止消费的分区
    private final List<TopicPartition> toForget;
    // sessionId,epoch 二元组
    private final FetchMetadata metadata;
    // client.rack=null, 忽略
    private final String rackId;
}

3-2-3、broker处理FetchRequest

3-2-3-1、更新Session

FetchManager#newContext:

1)如果sessionId为空,创建FullFetchContext,新session会在响应阶段创建;

2)否则根据FetchRequest更新Session的分区,session epoch+1;

scala 复制代码
class FetchManager(private val time: Time,
                   private val cache: FetchSessionCache) extends Logging {
  def newContext(reqMetadata: JFetchMetadata,
                 fetchData: FetchSession.REQ_MAP,
                 toForget: util.List[TopicPartition],
                 isFollower: Boolean): FetchContext = {
    val context = if (reqMetadata.isFull) {
      val context = if (reqMetadata.epoch == FINAL_EPOCH) {
        new SessionlessFetchContext(fetchData)
      } else {
        // 新session会在响应时创建
        new FullFetchContext(time, cache, reqMetadata, fetchData, isFollower)
      }
      context
    } else {
      cache.synchronized {
        cache.get(reqMetadata.sessionId) match {
          case Some(session) => session.synchronized {
            // 更新session,比如哪些分区要更新,哪些分区要移除
            val (added, updated, removed) = session.update(fetchData, toForget, reqMetadata)
            if (session.isEmpty) {
              cache.remove(session)
              new SessionlessFetchContext(fetchData)
            } else {
              cache.touch(session, time.milliseconds())
              // session epoch +1
              session.epoch = JFetchMetadata.nextEpoch(session.epoch)
              new IncrementalFetchContext(time, reqMetadata, session)
            }
          }
        }
      }
    }
    context
  }
}
3-2-3-2、拉消息

回顾第四章。

数据目录下的唯一目录={Topic}-{PartitionId},如Test1-0,存储了每个分区的数据,称为Log。

sql 复制代码
tree -a
.
├── .kafka_cleanshutdown
├── .lock
├── Test1-0 # topic-partitionId
│   ├── 00000000000000000000.index
│   ├── 00000000000000000000.log
│   ├── 00000000000000000000.timeindex
│   ├── 00000000000000001872.index
│   ├── 00000000000000001872.log
│   ├── 00000000000000001872.timeindex
│   └── leader-epoch-checkpoint
├── cleaner-offset-checkpoint
├── log-start-offset-checkpoint
├── meta.properties
├── recovery-point-offset-checkpoint
└── replication-offset-checkpoint

Log数据由多个Segment 组成,如0000是第一个segment,1872是第二个segment,文件名是该segment的起始offset(baseOffset) ,如0000存储了offset=[0,1872)共1872条消息。

每个Segment由4个文件构成:

a)log:消息数据文件,追加写,大小超过segment.bytes=1G触发滚动;

b)index:offset索引文件,通过mmap读写,大小=segment.index.bytes=10M;

c)timeindex:时间索引文件;

d)txnindex:中断事务索引文件;

消息以批次格式写入 ,broker每超出4KB大小生成一个索引项

拉消息需要用到index索引,index索引项包含两部分:

1)key=offset-baseOffset(4byte),是一个批次消息的结束offset ,即该批次的最后一条消息对应的offset

2)value=消息在log的写入位置(4byte),即一个批次消息的开始物理位置

KafkaApis#handleFetchRequest:循环所有session中所有分区拉取消息。

scala 复制代码
def handleFetchRequest(request: RequestChannel.Request): Unit = {
    val fetchRequest = request.body[FetchRequest]
    // 更新session,创建fetchContext
    val fetchContext = fetchManager.newContext(
      fetchRequest.metadata, fetchRequest.fetchData,
      fetchRequest.toForget, fetchRequest.isFromFollower)
    val interesting = mutable.ArrayBuffer[(TopicPartition, FetchRequest.PartitionData)]()
    if (fetchRequest.isFromFollower) {

    } else {
      val partitionMap = new mutable.ArrayBuffer[(TopicPartition, FetchRequest.PartitionData)]
      // 循环session中所有分区
      fetchContext.foreachPartition { (topicPartition, partitionData) =>
        // 分区 -> 分区请求数据
        partitionMap += topicPartition -> partitionData
      }
      partitionMap.foreach { case (topicPartition, data) =>
           // 权限...
           // 分区 -> 分区请求数据
          interesting += (topicPartition -> data)
      }
    }
    if (interesting.isEmpty)
      processResponseCallback(Seq.empty)
    else {
      replicaManager.fetchMessages(
        fetchRequest.maxWait.toLong, fetchRequest.replicaId,
        fetchMinBytes, fetchMaxBytes,
        versionId <= 2, interesting, // session分区 -> 分区数据
        replicationQuota(fetchRequest), processResponseCallback,
        fetchRequest.isolationLevel, clientMetadata)
    }
}

Log#read:针对某个分区拉取消息。

1)根据隔离级别,确定最大可读offset和对应物理位置;

2)执行Segment#read,传入startOffset=消费进度、maxLength=分区最大可读字节数、maxPosition=最大可读物理位置,返回FetchDataInfo

scala 复制代码
def read(startOffset: Long,
       maxLength: Int,
       isolation: FetchIsolation,
       minOneMessage: Boolean): FetchDataInfo = {
  val includeAbortedTxns = isolation == FetchTxnCommitted
  // LEO(LogEndOffset)当前写进度
  val endOffsetMetadata = nextOffsetMetadata
  val endOffset = endOffsetMetadata.messageOffset
  // 根据 消费位点 找起始segment
  var segmentEntry = segments.floorEntry(startOffset)
  // 根据隔离级别,决定最大可读offset
  val maxOffsetMetadata = isolation match {
    // follower拉消息 取LEO
    case FetchLogEnd => endOffsetMetadata
    // 消费者拉消息 READ_UNCOMMITTED 高水位
    case FetchHighWatermark => fetchHighWatermarkMetadata
    // 消费者拉消息 事务消息 READ_COMMITTED LastStableOffset(LSO)
    case FetchTxnCommitted => fetchLastStableOffsetMetadata
  }
  if (startOffset == maxOffsetMetadata.messageOffset) {
    // 如果 消费位点 和 最大可读offset 一致,说明没消息,直接返回空
    return emptyFetchDataInfo(maxOffsetMetadata, includeAbortedTxns)
  } 
  while (segmentEntry != null) {
    val segment = segmentEntry.getValue
    // segment最大可读物理位置
    val maxPosition = {
      if (maxOffsetMetadata.segmentBaseOffset == segment.baseOffset) {
        // 如果最大可读offset在当前segment,max=比如高水位对应物理位置
        maxOffsetMetadata.relativePositionInSegment
      } else {
        // 如果最大可读offset不在当前segment,max=segment大小
        segment.size
      }
    }
    // segment读
    val fetchInfo = segment.read(startOffset, maxLength, maxPosition, minOneMessage)
    if (fetchInfo == null) {
      segmentEntry = segments.higherEntry(segmentEntry.getKey)
    } else {
      // 返回fetchInfo
      return if (includeAbortedTxns)
        addAbortedTransactions(startOffset, segmentEntry, fetchInfo)
      else
        fetchInfo
    }
  }
  FetchDataInfo(nextOffsetMetadata, MemoryRecords.EMPTY)
}

LogSegment#read:【核心方法 】利用 offset索引 定位 log可读数据

scala 复制代码
  def read(startOffset: Long, // 消费进度
           maxSize: Int, // 分区最大返回字节数1mb
           maxPosition: Long = size, // 高水位对应物理position
           minOneMessage: Boolean = false): FetchDataInfo = {
  // 【focus】入参=消费进度,出参=第一个要读取的消息批次(物理位置,大小)
  // 读索引和数据的主要逻辑
  val startOffsetAndSize = translateOffset(startOffset)
  if (startOffsetAndSize == null)
    return null
  val startPosition = startOffsetAndSize.position
  // 消费进度,segment起始offset,读取物理position开始
  val offsetMetadata = LogOffsetMetadata(startOffset, this.baseOffset, startPosition)
  // minOneMessage=true,本次fetch还未读取到任何一个分区的消息,如果批次大于maxSize也要读
  val adjustedMaxSize =
    if (minOneMessage) math.max(maxSize, startOffsetAndSize.size)
    else maxSize
  if (adjustedMaxSize == 0)
    return FetchDataInfo(offsetMetadata, MemoryRecords.EMPTY)
  val fetchSize: Int = min((maxPosition - startPosition).toInt, adjustedMaxSize)
  // 【focus】log=FileRecords=当前segment的log文件,按照(开始物理位置,最大可读物理位置)切片
  FetchDataInfo(offsetMetadata, log.slice(startPosition, fetchSize),
    firstEntryIncomplete = adjustedMaxSize < startOffsetAndSize.size)
}

关键点:

1)生产者侧,多个消息会合并为一个批次发送;broker侧,以原始批次存储;消费者侧,以批次维度拉取 ,如果fetch请求还未拉取到消息(minOneMessage=true),但是首批消息超过分区消息大小限制(消费者配置max.partition.fetch.bytes=1mb),仍然允许拉取

2)已知消费进度startOffset ,得到拉取的起始批次的物理位置startOffsetAndSize.position

3)已知最大拉取字节数maxSize 和拉取物理位置限制maxPosition ,得到最终可拉取log数据大小fetchSize=min(maxSize,maxPosition)

4)通过2和3,返回n个消息批次在log文件中的物理位置log.slice;

假设当前消费进度为500,则Segment拉取消息的逻辑如下:

Step1,OffsetIndex#lookup,找到小于等于offset(500)的最大索引项(456,6666);

这里使用二分查找。不过KAFKA-6432做了优化,因为一般拉取的是最新的消息,优先从最新一页的索引项里做二分(8192/10=最后819个索引项),充分利用mmap的pagecache缓存。

Step2,FileRecords#searchForOffsetWithSize,从1返回的物理位置(6666)向后遍历,找到第一个包含请求offset的批次(结束offset=555);

Step3,FileRecords#slice,根据拉取数据大小限制(1mb和高水位),返回Log文件的数据范围(物理位置8888,大小1111),后续可以通过迭代器访问n个消息批次;

java 复制代码
// 是否为slice=true
private final boolean isSlice;
// 开始物理位置
private final int start;
// 结束物理位置
private final int end;
// 批次迭代器
private final Iterable<FileLogInputStream.FileChannelRecordBatch> batches;
// 数据大小
private final AtomicInteger size;
private final FileChannel channel;
private volatile File file;
public FileRecords slice(int position, int size) throws IOException {
    int currentSizeInBytes = sizeInBytes();
    int end = this.start + position + size;
    if (end < 0 || end > start + currentSizeInBytes)
        end = start + currentSizeInBytes;
    return new FileRecords(file, channel, this.start + position, end, true);
}
FileRecords(File file, FileChannel channel,
            int start, int end, boolean isSlice) throws IOException {
        this.file = file;
        this.channel = channel;
        this.start = start;
        this.end = end;
        this.isSlice = isSlice;
        this.size = new AtomicInteger();
        if (isSlice) {
            // 大小
            size.set(end - start);
        } else {
           // ...
        }
        batches = batchesFrom(start);
    }
}
public Iterable<FileChannelRecordBatch> batchesFrom(final int start) {
    return () -> batchIterator(start);
}
private AbstractIterator<FileChannelRecordBatch> batchIterator(int start) {
    final int end;
    if (isSlice) // here
        end = this.end;
    else
        end = this.sizeInBytes();
    FileLogInputStream inputStream = new FileLogInputStream(this, start, end);
    return new RecordBatchIterator<>(inputStream);
}
3-2-3-4、读到消息-回复响应

ReplicaManager#fetchMessages:如果bytesReadable≥fetchMinBytes,默认fetchMinBytes=1字节,只要任意分区读到消息,就回复FetchResponse。

scala 复制代码
if (timeout <= 0 || fetchInfos.isEmpty || bytesReadable >= fetchMinBytes || errorReadingData) {

  val fetchPartitionData = logReadResults.map { case (tp, result) =>
    tp -> FetchPartitionData(result.error, result.highWatermark,
                             result.leaderLogStartOffset, 
                             result.info.records,
                             result.lastStableOffset, ...)
  }
  // 回复响应
  responseCallback(fetchPartitionData)
} else {
  // 没读到消息...
}

FullFetchContext#updateAndGenerateResponseData:首次fetch请求创建session,将全量分区拉取情况(updates)加入fetch响应,并缓存到session中。

scala 复制代码
type RESP_MAP = util.LinkedHashMap[TopicPartition, FetchResponse.PartitionData[Records]]
override def updateAndGenerateResponseData(updates: FetchSession.RESP_MAP): FetchResponse[Records] = {
  def createNewSession: FetchSession.CACHE_MAP = {
    val cachedPartitions = new FetchSession.CACHE_MAP(updates.size)
    // 循环所有分区,加入session
    updates.forEach { (part, respData) =>
      val reqData = fetchData.get(part)
      cachedPartitions.mustAdd(new CachedPartition(part, reqData, respData))
    }
    cachedPartitions
  }
  // 创建session并缓存
  val responseSessionId = cache.maybeCreateSession(time.milliseconds(), isFromFollower,
      updates.size, () => createNewSession)
  new FetchResponse(Errors.NONE, updates, 0, responseSessionId)
}
// sessionId生成算法,随机正整数,唯一
def newSessionId(): Int = synchronized {
  var id = 0
  do {
    id = ThreadLocalRandom.current().nextInt(1, Int.MaxValue)
  } while (sessions.contains(id) || id == INVALID_SESSION_ID)
  id
}

ncrementalFetchContext#updateAndGenerateResponseData:非首次fetch,PartitionInterator过滤分区是否需要响应,当有数据或有异常或HW等offset变更,分区会加入fetch响应

scala 复制代码
override def updateAndGenerateResponseData(updates: FetchSession.RESP_MAP): FetchResponse[Records] = {
  session.synchronized {
    // 用迭代器,判断拉取结果
    val partitionIter = new PartitionIterator(updates.entrySet.iterator, true)
    while (partitionIter.hasNext) {
      partitionIter.next()
    }
    new FetchResponse(Errors.NONE, updates, 0, session.id)
  }
}
// CachedPartition#maybeUpdateResponseData 判断分区是否需要响应
def maybeUpdateResponseData(respData: FetchResponse.PartitionData[Records], updateResponseData: Boolean): Boolean = {
  // Check the response data.
  var mustRespond = false
  if ((respData.records != null) && (respData.records.sizeInBytes > 0)) {
    mustRespond = true
  }
  if (highWatermark != respData.highWatermark) {
    mustRespond = true
    if (updateResponseData)
      highWatermark = respData.highWatermark
  }
  if (localLogStartOffset != respData.logStartOffset) {
    mustRespond = true
    if (updateResponseData)
      localLogStartOffset = respData.logStartOffset
  }
  if (respData.preferredReadReplica.isPresent) {
    mustRespond = true
  }
  if (respData.error.code != 0) {
    if (updateResponseData)
      highWatermark = -1
    mustRespond = true
  }
  mustRespond
}

FetchResponse最终如下,注意每个分区的records即为FileRecords(上面拉消息返回的FileRecords#slice)。

java 复制代码
public class FetchResponse<T extends BaseRecords> extends AbstractResponse {
    // sessionId
    private final int sessionId;
    // 分区-分区数据
    private final LinkedHashMap<TopicPartition, PartitionData<T>> responseData;
}
public static final class PartitionData<T extends BaseRecords> {
    public final Errors error;
    // 分区高水位
    public final long highWatermark;
    // 分区LSO(事务)
    public final long lastStableOffset;
    // 分区LogStartOffset
    public final long logStartOffset;
    public final Optional<Integer> preferredReadReplica;
    public final List<AbortedTransaction> abortedTransactions;
    // 批次数据记录FileRecords
    public final T records;
}

最终每个分区的FileRecords会通过transferTo零拷贝技术写入通讯channel。

PlaintextTransportLayer#transferFrom:比如下图是0号segment的log,物理偏移量8409,长度291,包含n个批次。

3-2-3-5、未读到消息-挂起请求

ReplicaManager#fetchMessages:如果没读到消息,需要把当前FetchRequest挂起。

scala 复制代码
 if (timeout <= 0 || fetchInfos.isEmpty || bytesReadable >= fetchMinBytes || errorReadingData) {
     // 2-2-3-4 读到数据直接响应
} else {
  // 分区 -> (本次fetch得到的offset,请求offset)
  val fetchPartitionStatus = new mutable.ArrayBuffer[(TopicPartition, FetchPartitionStatus)]
  fetchInfos.foreach { case (topicPartition, partitionData) =>
    logReadResultMap.get(topicPartition).foreach(logReadResult => {
      val logOffsetMetadata = logReadResult.info.fetchOffsetMetadata
      fetchPartitionStatus += (topicPartition -> FetchPartitionStatus(logOffsetMetadata, partitionData))
    })
  }
  // fetch请求参数 + fetchPartitionStatus
  val fetchMetadata: SFetchMetadata = SFetchMetadata(fetchMinBytes, fetchMaxBytes, hardMaxBytesLimit,
    fetchOnlyFromLeader, fetchIsolation, isFromFollower, replicaId, fetchPartitionStatus)
  val delayedFetch = new DelayedFetch(timeout, fetchMetadata, this, quota, clientMetadata,
    responseCallback)
  val delayedFetchKeys = fetchPartitionStatus.map { case (tp, _) => TopicPartitionOperationKey(tp) }
  // 执行DelayedFetch.tryComplete,失败后提交延迟任务,延迟时长=maxWait=500ms
  delayedFetchPurgatory.tryCompleteElseWatch(delayedFetch, delayedFetchKeys)
}

DelayedFetch#tryComplete:在挂起fetch请求前,再次获取最大可读位点,如果有新消息到来(fetchMinBytes=fetch.min.bytes默认1byte),onComplete会再次发起读消息,完成响应。

scala 复制代码
override def tryComplete(): Boolean = {
  var accumulatedSize = 0
  fetchMetadata.fetchPartitionStatus.foreach {
    case (topicPartition, fetchStatus) =>
      val fetchOffset = fetchStatus.startOffsetMetadata
      val fetchLeaderEpoch = fetchStatus.fetchInfo.currentLeaderEpoch
      try {
        if (fetchOffset != LogOffsetMetadata.UnknownOffsetMetadata) {
          val partition = replicaManager.getPartitionOrException(topicPartition)
          // 再次获取 LEO/HW/LSO 等位点
          val offsetSnapshot = partition.fetchOffsetSnapshot(fetchLeaderEpoch, fetchMetadata.fetchOnlyLeader)
          val endOffset = fetchMetadata.fetchIsolation match {
            case FetchLogEnd => offsetSnapshot.logEndOffset
            case FetchHighWatermark => offsetSnapshot.highWatermark
            case FetchTxnCommitted => offsetSnapshot.lastStableOffset
          }
          // fetch请求offset < 可读位点,有新数据到来
          if (fetchOffset.messageOffset < endOffset.messageOffset) {
              val bytesAvailable =  math.min(
                    endOffset.positionDiff(fetchOffset), 
                    fetchStatus.fetchInfo.maxBytes)
              // 累加可以读到的字节
              accumulatedSize += bytesAvailable
          }
        }
      } catch {
          // ...
          return forceComplete()
      }
  }
  // 可读数据大小 超过 默认1byte,forceComplete再次读log数据返回
  if (accumulatedSize >= fetchMetadata.fetchMinBytes)
     forceComplete()
  else
    false
}
// 满足tryCompelete方法或超时,再次读取后响应
override def onComplete(): Unit = {
  val logReadResults = replicaManager.readFromLocalLog(
    // ...
    // 分区-offset
    readPartitionInfo = fetchMetadata.fetchPartitionStatus.map 
    { case (tp, status) => tp -> status.fetchInfo })
  // 组装结果
  val fetchPartitionData = logReadResults.map { case (tp, result) =>
    tp -> FetchPartitionData(result.error, result.highWatermark, result.leaderLogStartOffset, result.info.records,
      result.lastStableOffset, result.info.abortedTransactions, result.preferredReadReplica,
      fetchMetadata.isFromFollower && replicaManager.isAddingReplica(tp, fetchMetadata.replicaId))
  }
  // 响应
  responseCallback(fetchPartitionData)
}

如果fetch请求被挂起,当HW发生变化会再次触发DelayedFetch#tryComplete,判断是否满足响应条件,如果满足则执行DelayedFetch#onComplete读数据回复FetchResponse

Partition#updateFollowerFetchState:比如follower从leader拉消息触发HW高水位增加。

scala 复制代码
def updateFollowerFetchState(followerId: Int,
                               followerFetchOffsetMetadata: LogOffsetMetadata,
                               followerStartOffset: Long,
                               followerFetchTimeMs: Long,
                               leaderEndOffset: Long): Boolean = {
    getReplica(followerId) match {
      case Some(followerReplica) =>
        // ...
        if (leaderLWIncremented || leaderHWIncremented)
          tryCompleteDelayedRequests()
        true
    }
}
// Partition#tryCompleteDelayedRequests
private def tryCompleteDelayedRequests(): Unit = delayedOperations.checkAndCompleteAll()
// DelayedOperations#checkAndCompleteAll
class DelayedOperations(topicPartition: TopicPartition,
                        produce: DelayedOperationPurgatory[DelayedProduce],
                        fetch: DelayedOperationPurgatory[DelayedFetch]) {
    def checkAndCompleteAll(): Unit = {
      val requestKey = TopicPartitionOperationKey(topicPartition)
      // 尝试完成fetch请求
      fetch.checkAndComplete(requestKey)
      // 尝试完成produce请求
      produce.checkAndComplete(requestKey)
      deleteRecords.checkAndComplete(requestKey)
    }
}

如果500ms超时 ,则执行DelayedFetch#onComplete读数据回复FetchResponse

3-2-4、consumer处理FetchResponse

Fetcher#sendFetches:

1)FetchSessionHandler校验FetchResponse符合要求;

2)将拉取消息结果加入completedFetches队列;

java 复制代码
private final ConcurrentLinkedQueue<CompletedFetch> completedFetches;
public synchronized int sendFetches() {
  // 循环所有broker发送fetch请求
  // ...
  // 处理每个fetch响应
  future.addListener(new RequestFutureListener<ClientResponse>() {
    @Override
    public void onSuccess(ClientResponse resp) {
        synchronized (Fetcher.this) {
            try {
              FetchResponse<Records> response = (FetchResponse<Records>) resp.responseBody();
              // 1. session处理
              FetchSessionHandler handler = sessionHandler(fetchTarget.id());
              if (!handler.handleResponse(response)) {
                  return;
              }
              Set<TopicPartition> partitions = new HashSet<>(response.responseData().keySet());
              FetchResponseMetricAggregator metricAggregator = new FetchResponseMetricAggregator(sensors, partitions);
              for (Map.Entry<TopicPartition, FetchResponse.PartitionData<Records>> entry : response.responseData().entrySet()) {
                  TopicPartition partition = entry.getKey();
                  FetchRequest.PartitionData requestData = data.sessionPartitions().get(partition);
                  long fetchOffset = requestData.fetchOffset;
                  FetchResponse.PartitionData<Records> partitionData = entry.getValue();
                  Iterator<? extends RecordBatch> batches = partitionData.records.batches().iterator();
                  short responseVersion = resp.requestHeader().apiVersion();
                  // 2. 加入缓存队列completedFetches
                  completedFetches.add(new CompletedFetch(partition, partitionData,
                          metricAggregator, batches, fetchOffset, responseVersion));
              }
            } finally {
                nodesWithPendingFetchRequests.remove(fetchTarget.id());
            }
        }
    }

    @Override
    public void onFailure(RuntimeException e) {
        synchronized (Fetcher.this) {
            try {
                FetchSessionHandler handler = sessionHandler(fetchTarget.id());
                if (handler != null) {
                    handler.handleError(e);
                }
            } finally {
                nodesWithPendingFetchRequests.remove(fetchTarget.id());
            }
        }
    }
  });
  return fetchRequestMap.size();
}

FetchSessionHandler#handleResponse:

1)返回异常,标记session回到INITIAL_EPOCH,下次会发起全量FetchRequest;

2)新session,校验响应分区=全量session分区,保存response.sessionId,epoch++;

3)已存在session,校验响应分区在session中,epoch++;

java 复制代码
 public boolean handleResponse(FetchResponse<?> response) {
    if (response.error() != Errors.NONE) {
        log.info("Node {} was unable to process the fetch request with {}: {}.",
            node, nextMetadata, response.error());
        if (response.error() == Errors.FETCH_SESSION_ID_NOT_FOUND) {
            nextMetadata = FetchMetadata.INITIAL;
        } else {
            // new FetchMetadata(sessionId, INITIAL_EPOCH);
            nextMetadata = nextMetadata.nextCloseExisting();
        }
        return false;
    }
    if (nextMetadata.isFull()) {
        // 新session,校验 响应分区 = 全量session分区
        String problem = verifyFullFetchResponsePartitions(response);
        if (problem != null) {
            nextMetadata = FetchMetadata.INITIAL;
            return false;
        } else if (response.sessionId() == INVALID_SESSION_ID) {
            nextMetadata = FetchMetadata.INITIAL;
            return true;
        } else {
            // sessionId=res.sessionId,epoch++
            nextMetadata = FetchMetadata.newIncremental(response.sessionId());
            return true;
        }
    } else {
        // 老session,校验 响应分区 包含在 session 中
        String problem = verifyIncrementalFetchResponsePartitions(response);
        if (problem != null) {
            nextMetadata = nextMetadata.nextCloseExisting();
            return false;
        } else if (response.sessionId() == INVALID_SESSION_ID) {
            nextMetadata = FetchMetadata.INITIAL;
            return true;
        } else {
            // epoch++
            nextMetadata = nextMetadata.nextIncremental();
            return true;
        }
    }
}

public void handleError(Throwable t) {
    log.info("Error sending fetch request {} to node {}:", nextMetadata, node, t);
    // new FetchMetadata(sessionId, INITIAL_EPOCH);
    nextMetadata = nextMetadata.nextCloseExisting();
}

3-3、从缓存拉取消息

从Rebalance来说(HeartbeatThread#run),如果用户处理poll返回的一批消息超过max.poll.interval.ms=5分钟 ,consumer则会在与coordinator的心跳线程中,主动发送LeaveGroup,导致Rebalance

为了避免这种Rebalance,一方面需要提升消费速度,另一方面可以通过配置max.poll.records(默认500) 来控制每次KafkaConsumer#poll返回的消息数量。

但是从FetchRequest来说:1)按照消息批次完整拉取 ;2)每个分区按照大小限制 拉取max.partition.fetch.bytes=1mb

Consumer侧,按照实际消费进度最大消息数量,拆分消息批次。

Fetcher#fetchedRecords:迭代消息缓存队列completedFetches ,其中每个CompletedFetch元素包含n个消息批次。每次从CompletedFetch拉取后,更新分区消费位点,下次使用新位点发送FetchRequest。

java 复制代码
// FetchRequest拉来的消息缓存,CompletedFetch包含同一个分区的n个消息批次
private final ConcurrentLinkedQueue<CompletedFetch> completedFetches;
// 正在被消费的CompletedFetch
private CompletedFetch nextInLineFetch = null;

public Map<TopicPartition, List<ConsumerRecord<K, V>>> fetchedRecords() {
  // 分区 - 拉取到的消息
  Map<TopicPartition, List<ConsumerRecord<K, V>>> fetched = new HashMap<>();
  // maxPollRecords=max.poll.records=500
  int recordsRemaining = maxPollRecords;
  try {
      while (recordsRemaining > 0) {
          if (nextInLineFetch == null || nextInLineFetch.isConsumed) {
              // 1. 从completedFetches获取一个分区的CompletedFetch做初始化
              CompletedFetch records = completedFetches.peek();
              if (records == null) break;
              if (records.notInitialized()) {
                  // 更新HW等offset到订阅状态,records.notInitialized()=false
                  nextInLineFetch = initializeCompletedFetch(records);
              } 
              completedFetches.poll();
          } else {
              // 2. 从当前正在处理的分区中获取数据,数量限制=recordsRemaining
              List<ConsumerRecord<K, V>> records = fetchRecords(nextInLineFetch, recordsRemaining);
              if (!records.isEmpty()) {
                  TopicPartition partition = nextInLineFetch.partition;
                  fetched.put(partition, records);
                  recordsRemaining -= records.size();
              }
          }
      }
  } catch (KafkaException e) {
      if (fetched.isEmpty())
          throw e;
  }
  return fetched;
}
private List<ConsumerRecord<K, V>> fetchRecords(CompletedFetch completedFetch, int maxRecords) {
    FetchPosition position = subscriptions.position(completedFetch.partition);
    // 从CompletedFetch获取消息,移动nextFetchOffset位置
    List<ConsumerRecord<K, V>> partRecords = completedFetch.fetchRecords(maxRecords);
    if (completedFetch.nextFetchOffset > position.offset) {
        // nextFetchOffset发生改变,更新消费位点
        FetchPosition nextPosition = new FetchPosition(
                completedFetch.nextFetchOffset,
                completedFetch.lastEpoch,
                position.currentLeader);
    }
    return partRecords;
    // completedFetch.isConsumed=true,标记completedFetch消费完了
    completedFetch.drain();
    return emptyList();
}

private List<ConsumerRecord<K, V>> fetchRecords(CompletedFetch completedFetch, int maxRecords) {
    FetchPosition position = subscriptions.position(completedFetch.partition);
    if (completedFetch.nextFetchOffset == position.offset) {
        // 从CompletedFetch获取消息,移动nextFetchOffset位置
        List<ConsumerRecord<K, V>> partRecords = completedFetch.fetchRecords(maxRecords);
        if (completedFetch.nextFetchOffset > position.offset) {
            // nextFetchOffset发生改变,更新消费位点
            FetchPosition nextPosition = new FetchPosition(
                    completedFetch.nextFetchOffset,
                    completedFetch.lastEpoch,
                    position.currentLeader);
            subscriptions.position(completedFetch.partition, nextPosition);
        }
        return partRecords;
    }
    // completedFetch.isConsumed=true,标记completedFetch消费完了
    completedFetch.drain();
    return emptyList();
}

Fetcher.CompletedFetch#fetchRecords:一个CompletedFetch又包含n个批次,需要迭代批次,然后迭代批次里的消息。

java 复制代码
// 批次的迭代器
private final Iterator<? extends RecordBatch> batches;
// 正在迭代的批次的消息迭代器
private CloseableIterator<Record> records;
// 正在迭代的批次
private RecordBatch currentBatch;
// 已经处理的最后一个消息
private Record lastRecord;
// 当前迭代到的offset
private long nextFetchOffset;
private List<ConsumerRecord<K, V>> fetchRecords(int maxRecords) {
  List<ConsumerRecord<K, V>> records = new ArrayList<>();
  for (int i = 0; i < maxRecords; i++) {
      lastRecord = nextFetchedRecord();
      records.add(parseRecord(partition, currentBatch, lastRecord));
      nextFetchOffset = lastRecord.offset() + 1;
  }
  return records;
}
private Record nextFetchedRecord() {
  while (true) {
      if (records == null || !records.hasNext()) {
          // 刚开始迭代 || 当前批次迭代完毕
          maybeCloseRecordStream();
          // completeFetch
          if (!batches.hasNext()) {
              drain();
              return null;
          }
          // 迭代消息批次
          currentBatch = batches.next();
          maybeEnsureValid(currentBatch);
          records = currentBatch.streamingIterator(decompressionBufferSupplier);
      } else {
          // 迭代消息批次里的消息
          Record record = records.next();
          // 跳过消费进度前的消息,
          // 因为消息只能按照批次拉取,所以需要在consumer侧跳过
          if (record.offset() >= nextFetchOffset) {
              maybeEnsureValid(record);
              if (!currentBatch.isControlBatch()) {
                  return record;
              } else {
                  nextFetchOffset = record.offset() + 1;
              }
          }
      }
  }
}

DefaultRecordBatch#streamingIterator:迭代消息过程中,完成消息解压,实现端到端压缩

java 复制代码
public CloseableIterator<Record> streamingIterator(BufferSupplier bufferSupplier) {
    if (isCompressed())
        return compressedIterator(bufferSupplier, false);
    else
        return uncompressedIterator();
}

最终返回给用户的ConsumerRecords,实际上是按照分区分组的,通过迭代器按序迭代n个分区的m条消息。

java 复制代码
public class ConsumerRecords<K, V> implements Iterable<ConsumerRecord<K, V>> {
    private final Map<TopicPartition, List<ConsumerRecord<K, V>>> records;
    public ConsumerRecords(Map<TopicPartition, List<ConsumerRecord<K, V>>> records) {
        this.records = records;
    }
    public Iterator<ConsumerRecord<K, V>> iterator() {
        return new ConcatenatedIterable<>(records.values()).iterator();
    }
}

四、提交消费进度

4-1、自动提交消费进度的入口

默认情况下,enable.auto.commit=true,自动提交消费进度。

KafkaConsumer#poll→ConsumerCoordinator#poll:自动提交场景1 ,每次执行poll api都会触发自动提交检测,超过auto.commit.interval.ms=5000ms会触发。

java 复制代码
public boolean poll(Timer timer, boolean waitForJoinGroup) {
    // 尝试更新metadataSnapshot=当前的metadata
    maybeUpdateSubscriptionMetadata();
    invokeCompletedOffsetCommitCallbacks();
    if (subscriptions.hasAutoAssignedPartitions()) {
        // case1 精确订阅 / 正则订阅
        // 唤醒kafka-coordinator-heartbeat-thread线程
        // 更新 Heartbeat 上次 poll 时间
        // 维持 max.poll.interval.ms = 5m 避免触发 rebalance
        pollHeartbeat(timer.currentTimeMs());
        // Step1,发现Coordinator FindCoordinatorRequest 确保Coordinator连接建立
        if (coordinatorUnknown() && !ensureCoordinatorReady(timer)) {
            return false;
        }

        // Step2,尝试重新进组
        if (rejoinNeededOrPending()) {
            // ...
            // 消费者进组 JoinGroupRequest + SyncGroupRequest
            if (!ensureActiveGroup(waitForJoinGroup ? timer : time.timer(0L))) {
                timer.update(time.milliseconds());

                return false;
            }
        }
    }
    // 自动提交offset
    maybeAutoCommitOffsetsAsync(timer.currentTimeMs());
    return true;
}
public void maybeAutoCommitOffsetsAsync(long now) {
    if (autoCommitEnabled) {
        nextAutoCommitTimer.update(now);
        // auto.commit.interval.ms=5000
        if (nextAutoCommitTimer.isExpired()) {
            nextAutoCommitTimer.reset(autoCommitIntervalMs);
            doAutoCommitOffsetsAsync();
        }
    }
}

ConsumerCoordinator#onJoinPrepare:自动提交场景2,rebalance前,同步提交消费进度。

java 复制代码
protected void onJoinPrepare(int generation, String memberId) {
        maybeAutoCommitOffsetsSync
          (time.timer(rebalanceConfig.rebalanceTimeoutMs));
}

ConsumerCoordinator#close:自动提交场景3,消费者关闭。

java 复制代码
public void close(final Timer timer) {
    // ...
    maybeAutoCommitOffsetsSync(timer);
}

4-2、consumer发送OffsetCommitRequest

SubscriptionState#allConsumed:搜集所有FETCHING状态的分区对应的消费进度,注意此处的消费进度在每次poll返回时就已经移动了,见Fetcher#fetchedRecords。

java 复制代码
public synchronized Map<TopicPartition, OffsetAndMetadata> allConsumed() {
    Map<TopicPartition, OffsetAndMetadata> allConsumed = new HashMap<>();
    assignment.stream().forEach(state -> {
        TopicPartitionState partitionState = state.value();
        if (partitionState.hasValidPosition())
            allConsumed.put(state.topicPartition(), 
            new OffsetAndMetadata(partitionState.position.offset,
                    partitionState.position.offsetEpoch, ""));
    });
    return allConsumed;
}

ConsumerCoordinator#commitOffsetsAsync:确保协调者存活,发送提交请求。

java 复制代码
public void commitOffsetsAsync(final Map<TopicPartition, OffsetAndMetadata> offsets, final OffsetCommitCallback callback) {
    if (!coordinatorUnknown()) {
        // 协调者存活,发送提交offset请求
        doCommitOffsetsAsync(offsets, callback);
    } else {
        // 否则FindCoordinatorRequest,与协调者建立连接后发送提交offset请求
        lookupCoordinator().addListener(new RequestFutureListener<Void>() {
            @Override
            public void onSuccess(Void value) {
                doCommitOffsetsAsync(offsets, callback);
                client.pollNoWakeup();
            }
        });
    }
    client.pollNoWakeup();
}

OffsetCommitRequestData:请求参数

java 复制代码
public class OffsetCommitRequestData implements ApiMessage {
    // 消费组id
    private String groupId;
    // 第几轮Rebalance
    private int generationId;
    // 成员id 协调者分配的
    private String memberId;
    // 静态成员id
    private String groupInstanceId;
    // 提交数据
    private List<OffsetCommitRequestTopic> topics;
}
static public class OffsetCommitRequestTopic implements Message {
    // topic
    private String name;
    // 分区
    private List<OffsetCommitRequestPartition> partitions;
}
static public class OffsetCommitRequestPartition implements Message {
    // 分区id
    private int partitionIndex;
    // 提交offset
    private long committedOffset;
    // 提交offset对应的leaderEpoch
    private int committedLeaderEpoch;
}

4-3、coordinator处理OffsetCommitRequest

GroupMetadataManager#storeOffsets:协调者会将本次提交请求作为一个消息批次,写入topic=__consumer_offsets,partition=分配协调者对应分区。其中每条消息的key=groupId+topic+partition,value=req.offset+req.leaderEpoch+broker收到请求的时间戳

scala 复制代码
  def storeOffsets(group: GroupMetadata,
       consumerId: String,
       offsetMetadata: immutable.Map[TopicPartition, OffsetAndMetadata],
       responseCallback: immutable.Map[TopicPartition, Errors] => Unit,
       producerId: Long = RecordBatch.NO_PRODUCER_ID,
       producerEpoch: Short = RecordBatch.NO_PRODUCER_EPOCH): Unit = {
  val filteredOffsetMetadata = offsetMetadata.filter { case (_, offsetAndMetadata) =>
    validateOffsetMetadataLength(offsetAndMetadata.metadata)
  }
  val isTxnOffsetCommit = producerId != RecordBatch.NO_PRODUCER_ID
  if (filteredOffsetMetadata.isEmpty) {
    val commitStatus = offsetMetadata.map { case (k, _) => k -> Errors.OFFSET_METADATA_TOO_LARGE }
    responseCallback(commitStatus)
    None
  } else {
    getMagic(partitionFor(group.groupId)) match {
      case Some(magicValue) =>
        val timestampType = TimestampType.CREATE_TIME
        val timestamp = time.milliseconds()
        // 每个分区一条消息
        val records = filteredOffsetMetadata.map { case (topicPartition, offsetAndMetadata) =>
          // key = groupId + topic + partition
          val key = GroupMetadataManager.offsetCommitKey(group.groupId, topicPartition)
          // value = offset + leaderEpoch + 创建时间戳
          val value = GroupMetadataManager.offsetCommitValue(offsetAndMetadata, interBrokerProtocolVersion)
          new SimpleRecord(timestamp, key, value)
        }
        // topic=__consumer_offsets partition=分配协调者对应分区
        val offsetTopicPartition = new TopicPartition(Topic.GROUP_METADATA_TOPIC_NAME, partitionFor(group.groupId))
        // 这些消息作为一个消息批次写入
        val buffer = ByteBuffer.allocate(AbstractRecords.estimateSizeInBytes(magicValue, compressionType, records.asJava))
        val builder = MemoryRecords.builder(buffer, magicValue, compressionType, timestampType, 0L, time.milliseconds(),
          producerId, producerEpoch, 0, isTxnOffsetCommit, RecordBatch.NO_PARTITION_LEADER_EPOCH)
        records.foreach(builder.append)
        val entries = Map(offsetTopicPartition -> builder.build())
        // 写消息
        appendForGroup(group, entries, putCacheCallback)
    }
  }
}

GroupMetadataManager#appendForGroup:作为正常消息写入,acks默认为-1,需要等待所有isr写入成功。

scala 复制代码
  private def appendForGroup(group: GroupMetadata,
   records: Map[TopicPartition, MemoryRecords],
   callback: Map[TopicPartition, PartitionResponse] => Unit): Unit = {
    replicaManager.appendRecords(
      // offsets.commit.timeout.ms=5000
      timeout = config.offsetCommitTimeoutMs.toLong,
      // offsets.commit.required.acks=-1 --- 等待所有isr写入
      requiredAcks = config.offsetCommitRequiredAcks,
      internalTopicsAllowed = true,
      origin = AppendOrigin.Coordinator,
      entriesPerPartition = records,
      delayedProduceLock = Some(group.lock),
      responseCallback = callback)
  }

GroupMetadata#onOffsetCommitAppend:写消息完成,将消费进度更新到消费组元数据内存GroupMetadata#offsets,后续可以从内存中快速读取消费组的消费进度。

scala 复制代码
private val offsets = 
    new mutable.HashMap[TopicPartition, CommitRecordMetadataAndOffset]
def onOffsetCommitAppend(topicPartition: TopicPartition, 
           offsetWithCommitRecordMetadata: CommitRecordMetadataAndOffset): Unit = {
  if (!offsets.contains(topicPartition) 
      || offsets(topicPartition).olderThan(offsetWithCommitRecordMetadata))
    // 提交进度大于当前进度,则更新消费进度
    offsets.put(topicPartition, offsetWithCommitRecordMetadata)
}

OffsetCommitResponseData:响应数据包含每个分区提交是否异常。

java 复制代码
public class OffsetCommitResponseData implements ApiMessage {
    private int throttleTimeMs;
    private List<OffsetCommitResponseTopic> topics;
}
static public class OffsetCommitResponseTopic implements Message {
        private String name;
        private List<OffsetCommitResponsePartition> partitions;
}
static public class OffsetCommitResponsePartition implements Message {
    private int partitionIndex;
    private short errorCode;
}

ConsumerCoordinator#doAutoCommitOffsetsAsync:消费者对此不会做额外处理,仅打印日志:

java 复制代码
private void doAutoCommitOffsetsAsync() {
    Map<TopicPartition, OffsetAndMetadata> allConsumedOffsets = subscriptions.allConsumed();
    commitOffsetsAsync(allConsumedOffsets, (offsets, exception) -> {
        if (exception != null) {
                log.warn("Asynchronous auto-commit of offsets {} failed: {}", offsets, exception.getMessage());
            }
        } 
    });
}

4-4、coordinator加载消费组消费进度

回顾第五章,协调者节点=leaderOf(hash(groupId) %__consumer_offsets的分区数),默认__consumer_offsets的分区数为50个。

onLeadershipChange:当broker收到controller下发的LeaderAndIsrRequest,发现自己成为__consumer_offsets的某个分区leader,触发消费组加载

scala 复制代码
def onLeadershipChange(updatedLeaders: Iterable[Partition], updatedFollowers: Iterable[Partition]): Unit = {
  updatedLeaders.foreach { partition =>
    // topic==__consumer_offsets
    if (partition.topic == GROUP_METADATA_TOPIC_NAME)
      groupCoordinator.onElection(partition.partitionId)
  }
}

GroupMetadataManager#doLoadGroupsAndOffsets:读分区log,回放所有记录,加载消费组和消费进度到内存。

注意消费组元数据消费组消费进度都在这个topic,通过key区分。

scala 复制代码
private def doLoadGroupsAndOffsets(topicPartition: TopicPartition, onGroupLoaded: GroupMetadata => Unit): Unit = {
  def logEndOffset: Long = replicaManager.getLogEndOffset(topicPartition).getOrElse(-1L)
  replicaManager.getLog(topicPartition) match {
    case Some(log) =>
       // 被加载的消费进度
      val loadedOffsets = mutable.Map[GroupTopicPartition, CommitRecordMetadataAndOffset]()
      val pendingOffsets = mutable.Map[Long, mutable.Map[GroupTopicPartition, CommitRecordMetadataAndOffset]]()
      val loadedGroups = mutable.Map[String, GroupMetadata]()
      val removedGroups = mutable.Set[String]()
      var buffer = ByteBuffer.allocate(0)
      var currOffset = log.logStartOffset
      var readAtLeastOneRecord = true
      // 从头读到尾
      while (currOffset < logEndOffset && readAtLeastOneRecord && !shuttingDown.get()) {
        // 5mb一次读消息
        val fetchDataInfo = log.read(currOffset,
           // offsets.load.buffer.size=5mb
          maxLength = config.loadBufferSize,
          isolation = FetchLogEnd,
          minOneMessage = true)
        readAtLeastOneRecord = fetchDataInfo.records.sizeInBytes > 0
        // 转换为MemoryRecords
        val memRecords = fetchDataInfo.records match {
          case records: MemoryRecords => records
          case fileRecords: FileRecords =>
            val sizeInBytes = fileRecords.sizeInBytes
            val bytesNeeded = Math.max(config.loadBufferSize, sizeInBytes)
            if (buffer.capacity < bytesNeeded) {
              buffer = ByteBuffer.allocate(bytesNeeded)
            } else {
              buffer.clear()
            }
            fileRecords.readInto(buffer, 0)
            MemoryRecords.readableRecords(buffer)
        }
        // 迭代批次
        memRecords.batches.forEach { batch =>
          val isTxnOffsetCommit = batch.isTransactional
          if (batch.isControlBatch) {
            // ...
          } else {
            var batchBaseOffset: Option[Long] = None
            // 迭代消息
            for (record <- batch.asScala) {
              if (batchBaseOffset.isEmpty)
                batchBaseOffset = Some(record.offset)
              GroupMetadataManager.readMessageKey(record.key) match {
                case offsetKey: OffsetKey =>
                  // 消费进度
                  // ... 忽略事务消息
                  val groupTopicPartition = offsetKey.key
                  val offsetAndMetadata = GroupMetadataManager.readOffsetMessageValue(record.value)
                  loadedOffsets.put(groupTopicPartition, CommitRecordMetadataAndOffset(batchBaseOffset, offsetAndMetadata))
                case groupMetadataKey: GroupMetadataKey =>
                  // 消费组
                  val groupId = groupMetadataKey.key
                  val groupMetadata = GroupMetadataManager.readGroupMessageValue(groupId, record.value, time)
                  loadedGroups.put(groupId, groupMetadata)
              }
            }
          }
          currOffset = batch.nextOffset
        }
      }
      // 非空组的offset,空组的offset(所有成员下线)
      val (groupOffsets, emptyGroupOffsets) = loadedOffsets
        .groupBy(_._1.group)
        .map { case (k, v) =>
          k -> v.map { case (groupTopicPartition, offset) => (groupTopicPartition.topicPartition, offset) }
        }.partition { case (group, _) => loadedGroups.contains(group) }
      loadedGroups.values.foreach { group =>
        val offsets = groupOffsets.getOrElse(group.groupId, Map.empty[TopicPartition, CommitRecordMetadataAndOffset])
        val pendingOffsets = pendingGroupOffsets.getOrElse(group.groupId, Map.empty[Long, mutable.Map[TopicPartition, CommitRecordMetadataAndOffset]])
        // 加载
        loadGroup(group, offsets, pendingOffsets)
        onGroupLoaded(group)
      }
  }
}

private def loadGroup(group: GroupMetadata, offsets: Map[TopicPartition, CommitRecordMetadataAndOffset],
                      pendingTransactionalOffsets: Map[Long, mutable.Map[TopicPartition, CommitRecordMetadataAndOffset]]): Unit = {
  group.initializeOffsets(offsets, pendingTransactionalOffsets.toMap)
  val currentGroup = addGroup(group)
}

// GroupMetadata#initializeOffsets
private val offsets = new mutable.HashMap[TopicPartition, CommitRecordMetadataAndOffset]
def initializeOffsets(offsets: collection.Map[TopicPartition, CommitRecordMetadataAndOffset],
                      pendingTxnOffsets: Map[Long, mutable.Map[TopicPartition, CommitRecordMetadataAndOffset]]): Unit = {
  this.offsets ++= offsets
  this.pendingTransactionalOffsetCommits ++= pendingTxnOffsets
}

总结

获取消费位点

在Consumer侧,分区FetchState有四种状态:

1)INITIALIZING:刚分配到该分区;

2)AWAIT_VALIDATION:消费组曾经消费过,已经获取消费位点;

3)AWAIT_RESET:消费组未曾经消费过,待重置消费位点;

4)FETCHING:消费位点合法,可以开始消费;

如果分区有消费进度:

1)INITIALIZING→AWAIT_VALIDATION:consumer发送OffsetFetchRequest,coordinator返回内存中的消费进度GroupMetadata#offsets;

2)AWAIT_VALIDATION→FETCHING:consumer发送OffsetForLeaderEpochRequest(offset对应的leaderEpoch)给对应分区Leader;分区leader通过leaderEpoch记录(每个分区都有多个leaderEpoch-startOffset的二元组,记录每个leader任期内的offset区间)得到leaderEpoch的结束offset,如果返回offset<当前offset,代表发生数据截断,消费位点=leaderEpoch的结束offset;

如果分区无消费进度,使用重置策略获取重置消费位点。

INITIALIZING→AWAIT_RESET:OffsetFetchRequest返回offset=-1,分区无消费进度。

AWAIT_RESET→FETCHINGauto.offset.reset重置策略有两种:

1)LATEST :默认策略,根据隔离级别重置为分区最新写入进度,READ_COMMITTED ,返回LSO,事务消息相关;READ_UNCOMMITTED (默认),返回HW高水位

2)EARLIEST :重置为分区最早写入进度,返回分区LogStartOffset起始写入位点;

后续消费过程中(执行poll方法),会检测缓存元数据的分区leaderEpoch与Fetch状态中的leaderEpoch是否一致,如果发生改变,Fetch状态再次进入AWAIT_VALIDATION

拉取消息

拉取消息分为几步:

1)Consumer发起FetchRequest传入FETCHING分区消费进度;

2)Broker收到FetchRequest,读取消息,如果有消息(默认fetch.min.bytes=1byte)立即响应FetchResponse;否则挂起等待高水位上涨,再次读数据响应FetchResponse;

3)Consumer将消息批次记录缓存在队列中,依次消费;

KIP-227提出增量FetchRequest ,通过建立会话只传输发生变化的分区信息,避免consumer和broker间(主要是follower和leader间)全量传输分区信息,即使没有任何变更。

首次Fetch :Request包含全量分区 和消费位点,Broker生成sessionId响应Response包含全量分区;增量Fetch :Request包含发生变化分区和消费位点,Broker生成sessionEpoch+1响应Response仅包含变化分区(有消息或各种offset变更)。

Step1,Consumer发送FetchRequest中包含3个配置参数:

1)minBytes=fetch.min.bytes,默认1字节;

2)maxWait=fetch.max.wait.ms,默认500ms。待消费消息不足minBytes,FetchRequest被broker挂起的时长;

3)max.partition.fetch.bytes,默认1mb。每个分区最大拉取的消息大小;

Step2,Broker侧拉取消息:

1)循环所有分区,以批次维度拉取

2)已知隔离级别READ_UNCOMMITTED (默认),得到结束位置 =HW高水位=(offset=778,pos=9999);

3)已知消费进度offset=500,二分查找index,得到小于等于offset的最大索引项(offset=456,pos=6666);

4)从3对应物理位置(6666),向后遍历log,找包含offset(500)的批次 (结束offset=555,pos=8888)作为起始位置

5)根据2和4定位数据块位置(pos=8888,size=9999-8888);

6)利用FileChannel.transferTo零拷贝技术将消息数据写入通讯channel;

Step3,Consumer消费消息:

1)将FetchResponse返回数据缓存在completedFetches队列中;

2)从completedFetches 拉取消息,按照实际消费进度(因为broker以批次返回需要consumer自行按照消费进度跳过一些offset)最大消息数量( 每次KafkaConsumer#poll最多返回max.poll.records=默认500 条消息),拆分消息批次,迭代过程中完成内存消费位点更新

提交消费进度

默认情况下,enable.auto.commit =true,自动提交消费进度。KafkaConsumer#poll每超过auto.commit.interval.ms=5000ms触发一次自动提交。

Consumer发送OffsetCommitRequest,包含所有FETCHING状态分区当前的消费位点(在上面消费拉取消息的时候已经更新);

Coordinator处理OffsetCommitRequest,将本次请求中所有分区的消费进度作为一批消息写入topic=__consumer_offsets,partition=消费组的协调者对应分区=hash(groupId) %__consumer_offsets的分区数。其中每条消息的key=groupId+topic+partition,value=req.offset+req.leaderEpoch+broker收到请求的时间戳。默认offsets.commit.required.acks=-1,需要等待所有ISR副本写入后响应成功,将消费进度更新到内存GroupMetadata#offsets。

如果__consumer_offsets的分区重新发生leader选举(比如broker重启),会触发该分区对应的消费组加载。此时会全量读取分区中的消息,重新更新消费进度到内存GroupMetadata#offsets。

相关推荐
失散133 小时前
分布式专题——35 Netty的使用和常用组件辨析
java·分布式·架构·netty
Terio_my3 小时前
Spring Boot 热部署配置
java·spring boot·后端
荣江3 小时前
多技术栈项目文档生成引擎:Litho的插件化扩展架构设计
后端·架构
TeleostNaCl3 小时前
实战 | 使用 Chrome 开发者工具修改网页源码跳过前端校验
前端·chrome·经验分享·后端·js
xxxxxxllllllshi3 小时前
Java 集合框架全解析:从数据结构到源码实战
java·开发语言·数据结构·面试
埃泽漫笔3 小时前
消息顺序消费问题
java·mq
程序新视界3 小时前
MySQL中,日期、时间与时间戳三种数据类型的区别
数据库·后端·mysql
小小工匠3 小时前
基于 Spring AI 的多平台多模型动态切换实战
后端