Kafka源码(八)数据复制

前言

本章分析Kafka的数据复制。

  1. 回顾Controller选举和Topic创建;
  2. Leader选举;
  3. 数据复制;
  4. 高水位 和 ISR;

注:

  1. 基于Kafka2.6,无KRaft;
  2. 往期回顾:juejin.cn/column/7523...

一、Controller

Controller:Broker集群中的特殊角色,负责管理Topic和Broker。

ControllerEventThread#doWork,controller-event-thread单线程处理各类事件,如Topic变更、Broker变更。

Controller选举:

  1. KafkaZkClient#registerBroker:Broker注册/brokers/ids/{brokerId}临时ZNode。
ruby 复制代码
# /brokers/ids/{brokerId}
{"listener_security_protocol_map":{"PLAINTEXT":"PLAINTEXT"},
 "endpoints":["PLAINTEXT://localhost:9092"],
 "jmx_port":-1,"port":9092,
 "host":"localhost","version":4,"timestamp":"1766887228838"}
  1. KafkaController#processStartup:Broker watch /controller节点,如果不存在尝试创建/controller临时ZNode,并更新controller任期 = /controller_epoch + 1,成为Controller。

通过ZK的MultiOp(打包create -e /controller和set /controller_epoch)+ ZNode Version乐观更新实现。

bash 复制代码
# /controller
数据={"version":1,"brokerid":1(成为controller的brokerId),"timestamp":"xxx"}
# /controller_epoch
数据=1(任期)
  1. KafkaController#onControllerFailover:Broker成为Controller后处理
  • 监听ZNode,包括:/brokers/ids Broker变更、/brokers/topics Topic变更;

  • 组装内存ControllerContext,包括:

    1. /brokers/ids:broker信息;
    2. /brokers/topics:topic信息;
    3. /brokers/topics/{topic}:topic分区分配;
    4. /brokers/topics/{topic}/partitions/{partitionId}/state:topic分区状态LeaderAndIsr;
    5. 与所有broker建立连接(包括自己);
  • 发送UpdateMetadataRequest给所有存活broker,包含broker和topic分区信息,broker将数据缓存到MetadataCache;

二、Topic分区Leader选举

2-1、创建Topic初始状态

回顾Topic创建,分区Leader如何产生,简化流程如下。

有两个状态需要Controller在创建Topic时定义:

  1. AdminUtils#assignReplicasToBrokers:分区分配方案 ,根据请求分区数副本数 ,将副本均匀分布在不同Broker节点上,通过 /brokers/topics/{topic} 持久化。
json 复制代码
# 这个topic只有一个分区,每个分区有两个副本
# p0分区的两个副本在brokerId=111和222上
{"version":2,"partitions":{"0":[222,111]}}
  1. ZkPartitionStateMachine#initializeLeaderAndIsrForPartitions:LeaderAndIsr分区状态isr列表=当时在线的brokerleader=isr列表第一个副本 ,通过 /brokers/topics/{topic}/partitions/{partitionId}/state持久化。
json 复制代码
# p0分区的isr列表为222和111,当前leader副本是222
{"controller_epoch":1,"leader":222,"version":1,"leader_epoch":0,"isr":[222,111]}

完成Topic创建后,Controller通过LeaderAndIsrRequest分区状态下发给分区相关Broker,Broker根据自己是leader还是follower做出反应。

2-2、Leader选举主流程

ZkPartitionStateMachine#doElectLeaderForPartitions:Controller处理Leader选举

  1. 查询/brokers/topics/{topic}/partitions/{partitionId}/state得到当前LeaderAndIsr;
  2. 根据策略选主;
  3. 更新/brokers/topics/{topic}/partitions/{partitionId}/state;
  4. 发送LeaderAndIsr请求给相关Broker;
scala 复制代码
private def doElectLeaderForPartitions(
    partitions: Seq[TopicPartition],
    partitionLeaderElectionStrategy: PartitionLeaderElectionStrategy
  ): (Map[TopicPartition, Either[Exception, LeaderAndIsr]], Seq[TopicPartition]) = {
    // 1. 查询/brokers/topics/{topic}/partitions/{partitionId}/state
    val getDataResponses = zkClient.getTopicPartitionStatesRaw(partitions)
    val validLeaderAndIsrs = mutable.Buffer.empty[(TopicPartition, LeaderAndIsr)]
    getDataResponses.foreach { getDataResponse =>
      val partition = getDataResponse.ctx.get.asInstanceOf[TopicPartition]
      if (getDataResponse.resultCode == Code.OK) {
        TopicPartitionStateZNode.decode(getDataResponse.data, getDataResponse.stat) match {
          case Some(leaderIsrAndControllerEpoch) =>
              validLeaderAndIsrs += partition -> leaderIsrAndControllerEpoch.leaderAndIsr
      } 
    }
    if (validLeaderAndIsrs.isEmpty) {
      return (failedElections.toMap, Seq.empty)
    }
    // 2. 根据策略选主
    val (partitionsWithoutLeaders, partitionsWithLeaders) = partitionLeaderElectionStrategy match {
      case OfflinePartitionLeaderElectionStrategy(allowUnclean) =>
        // ...
      case ReassignPartitionLeaderElectionStrategy =>
        // ...
      case PreferredReplicaPartitionLeaderElectionStrategy =>
        // ...
      case ControlledShutdownPartitionLeaderElectionStrategy =>
       // ...
    }
    // partition -> replicas
    val recipientsPerPartition = partitionsWithLeaders.map(result => result.topicPartition -> result.liveReplicas).toMap
    // partition -> leaderAndIsr
    val adjustedLeaderAndIsrs = partitionsWithLeaders.map(result => result.topicPartition -> result.leaderAndIsr.get).toMap
    // 3. 更新/brokers/topics/topicA/partitions/0/state
    val UpdateLeaderAndIsrResult(finishedUpdates, updatesToRetry) = zkClient.updateLeaderAndIsr(
      adjustedLeaderAndIsrs, controllerContext.epoch, controllerContext.epochZkVersion)
    // 4. LeaderAndIsr请求
    finishedUpdates.foreach { case (partition, result) =>
      result.foreach { leaderAndIsr =>
        val replicaAssignment = controllerContext.partitionFullReplicaAssignment(partition)
        val leaderIsrAndControllerEpoch = LeaderIsrAndControllerEpoch(leaderAndIsr, controllerContext.epoch)
        // 更新controllerContext内存
        controllerContext.partitionLeadershipInfo.put(partition, leaderIsrAndControllerEpoch)
        controllerBrokerRequestBatch.addLeaderAndIsrRequestForBrokers(recipientsPerPartition(partition), partition,
          leaderIsrAndControllerEpoch, replicaAssignment, isNew = false)
      }
    }
    (finishedUpdates ++ failedElections, updatesToRetry)
  }

2-3、Leader选举策略

Leader选举有4种策略,对应多种场景。

scala 复制代码
sealed trait PartitionLeaderElectionStrategy
// case1 zk发现分区leader broker非正常下线
// case2 ElectLeadersRequest(忽略) admin api手动触发 allowUnclean=true
final case class OfflinePartitionLeaderElectionStrategy(allowUnclean: Boolean) extends PartitionLeaderElectionStrategy
// case3 AlterPartitionReassignmentsRequest: 分区重分配
final case object ReassignPartitionLeaderElectionStrategy extends PartitionLeaderElectionStrategy
// case4 controller定时:leader自动rebalance;
// case5 ElectLeadersRequest (忽略)admin api手动触发
final case object PreferredReplicaPartitionLeaderElectionStrategy extends PartitionLeaderElectionStrategy
// case6 ControlledShutdownRequest:Broker正常下线
final case object ControlledShutdownPartitionLeaderElectionStrategy extends PartitionLeaderElectionStrategy

2-3-1、Broker异常下线

场景一:controller通过watch /brokers/ids发现broker下线(BrokerChangeHandler),且该broker是某分区leader。

ZkPartitionStateMachine#collectUncleanLeaderElectionState:获取分区-LeaderAndIsr-是否允许unlean选举。

如果topic配置(或全局)unclean.leader.election.enable=true(默认false),可以从非ISR副本中选Leader。

scala 复制代码
private def collectUncleanLeaderElectionState(
    leaderAndIsrs: Seq[(TopicPartition, LeaderAndIsr)],
    allowUnclean: Boolean
  ): Seq[(TopicPartition, Option[LeaderAndIsr], Boolean)] = {
    // isr中无存活副本的分区(1) | isr中有存活副本的分区(2)
    val (partitionsWithNoLiveInSyncReplicas, partitionsWithLiveInSyncReplicas) = leaderAndIsrs.partition {
      case (partition, leaderAndIsr) =>
        val liveInSyncReplicas = leaderAndIsr.isr.filter(controllerContext.isReplicaOnline(_, partition))
        liveInSyncReplicas.isEmpty
    }
    // 针对(1)中的分区,unclean.leader.election.enable=true的topic可以unclean选举
    val electionForPartitionWithoutLiveReplicas = if (allowUnclean) {
      // ... ElectLeadersRequest忽略
    } else {
      // unclean.leader.election.enable=true的topic可以unclean选举
      val (logConfigs, failed) = zkClient.getLogConfigs(
        partitionsWithNoLiveInSyncReplicas.iterator.map { case (partition, _) => partition.topic }.toSet,
        config.originals()
      )
      partitionsWithNoLiveInSyncReplicas.map { case (partition, leaderAndIsr) =>
          (
            partition,
            Option(leaderAndIsr),
            logConfigs(partition.topic).uncleanLeaderElectionEnable.booleanValue()
          )
      }
    }
    electionForPartitionWithoutLiveReplicas ++
    partitionsWithLiveInSyncReplicas.map { case (partition, leaderAndIsr) =>
      (partition, Option(leaderAndIsr), false)
    }
  }

PartitionLeaderElectionAlgorithms#offlinePartitionLeaderElection:

  1. 优先ISR中的存活副本选择;2. 如果允许unclean,从非ISR的存活副本中选择;
scala 复制代码
  def offlinePartitionLeaderElection(assignment: Seq[Int],
                                     isr: Seq[Int],
                                     liveReplicas: Set[Int], 
                                     uncleanLeaderElectionEnabled: Boolean, 
                                     controllerContext: ControllerContext): Option[Int] = {
    // 1. 优先从 isr中的存活副本 选leader
    assignment.find(id => liveReplicas.contains(id) && isr.contains(id)).orElse {
      if (uncleanLeaderElectionEnabled) {
        // 2. unclean开启 允许从 非isr中的存活副本 选leader
        val leaderOpt = assignment.find(liveReplicas.contains)
        if (leaderOpt.isDefined)
          controllerContext.stats.uncleanLeaderElectionRate.mark()
        leaderOpt
      } else {
        None
      }
    }
  }

Election#leaderForOffline:根据leader选举结果设置isr列表,正常选举-isr剔除下线broker,unlean选举-只包含leader。

scala 复制代码
private def leaderForOffline(partition: TopicPartition,
                 leaderAndIsrOpt: Option[LeaderAndIsr],
                 uncleanLeaderElectionEnabled: Boolean,
                 controllerContext: ControllerContext): ElectionResult = {
    val assignment = controllerContext.partitionReplicaAssignment(partition)
    val liveReplicas = assignment.filter(replica => controllerContext.isReplicaOnline(replica, partition))
    leaderAndIsrOpt match {
      case Some(leaderAndIsr) =>
        val isr = leaderAndIsr.isr
        // 选leader
        val leaderOpt = PartitionLeaderElectionAlgorithms.offlinePartitionLeaderElection(
          assignment, isr, liveReplicas.toSet, uncleanLeaderElectionEnabled, controllerContext)
        val newLeaderAndIsrOpt = leaderOpt.map { leader =>
          // leader在isr里,正常选举,isr剔除下线broker
          val newIsr = if (isr.contains(leader)) isr.filter(replica => controllerContext.isReplicaOnline(replica, partition))
          // leader不在isr里,unlean选举,isr只包含leader
          else List(leader)
          leaderAndIsr.newLeaderAndIsr(leader, newIsr)
        }
        ElectionResult(partition, newLeaderAndIsrOpt, liveReplicas)

      case None =>
        ElectionResult(partition, None, liveReplicas)
    }
  }

2-3-2、分区重分配

场景二:通过AlterPartitionReassignmentsRequest(admin api)分区重分配触发,由于新的副本分配方案中不包含老leader,进而触发leader选举。

Election#leaderForReassign:reassign场景下的leader选举,从reassign目标副本集中选择isr中的存活副本,isr保持不变。

scala 复制代码
private def leaderForReassign(partition: TopicPartition,
                leaderAndIsr: LeaderAndIsr,
                controllerContext: ControllerContext): ElectionResult = {
  // reassign的目标副本集
  val targetReplicas = controllerContext.partitionFullReplicaAssignment(partition).targetReplicas
  val liveReplicas = targetReplicas.filter(replica => controllerContext.isReplicaOnline(replica, partition))
  val isr = leaderAndIsr.isr
  // reassign目标副本集 选择 isr中的存活副本
  val leaderOpt = PartitionLeaderElectionAlgorithms.reassignPartitionLeaderElection(targetReplicas, isr, liveReplicas.toSet)
  // isr保持不变 --- isr会在下一步stopRemovedReplicasOfReassignedPartition删除
  val newLeaderAndIsrOpt = leaderOpt.map(leader => leaderAndIsr.newLeader(leader))
  ElectionResult(partition, newLeaderAndIsrOpt, targetReplicas)
}
// PartitionLeaderElectionAlgorithms
def reassignPartitionLeaderElection(reassignment: Seq[Int], isr: Seq[Int], liveReplicas: Set[Int]): Option[Int] = {
  reassignment.find(id => liveReplicas.contains(id) && isr.contains(id))
}

2-3-3、Leader自动Rebalance

场景三:Controller定时检测Leader角色是否均匀分布在Broker集群里,如果不均匀则触发Rebalance,转移分区Leader。

AdminUtils#assignReplicasToBrokers:创建Topic时,Controller会使用以下策略将分区副本均匀分布在不同Broker节点上,每个分区的第一个副本(replica)成为leader。preferred副本 就是 分区分配的第一个副本。

从/kafka/brokers/topics/{topic}上看,比如0分区的preferred副本是brokerId=222。

json 复制代码
{"partitions":{"0":[222,111],"1":[111,333],"2":[333,222]}}

leader自动rebalance默认配置:

  1. auto.leader.rebalance.enable=true,leader自动rebalance;
  2. leader.imbalance.check.interval.seconds=300,每5分钟检测一次;
  3. leader.imbalance.per.broker.percentage=10,不平衡比率超过10%,触发分区leader rebalance;

KafkaController#checkAndTriggerAutoLeaderRebalance:自动rebalance检测如下,主要在于如何触发10%。

比如:brokerId=333在10个partition里排在第一位(preferred副本)。

case1-10个partition里有2个partition,brokerId=333不是leader,imbalanceRatio=20%,针对这两个partition要执行rebalance;

case2-如果只有1个partition,brokerId=333不是leader,由于imbalanceRatio=10%,未超过阈值,则不会rebalance。

scala 复制代码
private def checkAndTriggerAutoLeaderRebalance(): Unit = {
  // preferred副本(每个分区的第一个副本) -> topic partition -> 副本
  val preferredReplicasForTopicsByBrokers: Map[Int, Map[TopicPartition, Seq[Int]]] =
    controllerContext.allPartitions.filterNot {
      tp => topicDeletionManager.isTopicQueuedUpForDeletion(tp.topic)
    }.map { tp =>
      (tp, controllerContext.partitionReplicaAssignment(tp) )
    }.toMap.groupBy { case (_, assignedReplicas) => assignedReplicas.head }

  preferredReplicasForTopicsByBrokers.foreach { case (leaderBroker, topicPartitionsForBroker) =>
    val topicsNotInPreferredReplica = topicPartitionsForBroker.filter { case (topicPartition, _) =>
      val leadershipInfo = controllerContext.partitionLeadershipInfo.get(topicPartition)
      leadershipInfo.exists(_.leaderAndIsr.leader != leaderBroker)
    }
    val imbalanceRatio = topicsNotInPreferredReplica.size.toDouble / topicPartitionsForBroker.size

    // 对于当前broker 非preferred分区 / 非preferred+preferred分区 > 10 %
    if (imbalanceRatio > (config.leaderImbalancePerBrokerPercentage.toDouble / 100)) {
       // 循环非preferred分区,broker在这个分区的isr里且存活,这个分区才会重新选举
      val candidatePartitions = topicsNotInPreferredReplica.keys.filter(tp =>
        controllerContext.partitionsBeingReassigned.isEmpty &&
        !topicDeletionManager.isTopicQueuedUpForDeletion(tp.topic) &&
        controllerContext.allTopics.contains(tp.topic) &&
        canPreferredReplicaBeLeader(tp)
     )
      onReplicaElection(candidatePartitions.toSet, ElectionType.PREFERRED, AutoTriggered)
    }
  }
}

Election#leaderForPreferredReplica:选举策略就是从assignment分区分配中取第一个副本(preferred)。

scala 复制代码
  private def leaderForPreferredReplica(partition: TopicPartition,
            leaderAndIsr: LeaderAndIsr,
            controllerContext: ControllerContext): ElectionResult = {
  // 分区当前分配情况,如:brokerId=[1, 2]
  val assignment = controllerContext.partitionReplicaAssignment(partition)
  val liveReplicas = assignment.filter(replica => controllerContext.isReplicaOnline(replica, partition))
  val isr = leaderAndIsr.isr
  // preferred策略,选assignment中第一个副本,要求它在isr中且存活
  val leaderOpt = PartitionLeaderElectionAlgorithms.preferredReplicaPartitionLeaderElection(assignment, isr, liveReplicas.toSet)
  // isr不变
  val newLeaderAndIsrOpt = leaderOpt.map(leader => leaderAndIsr.newLeader(leader))
  ElectionResult(partition, newLeaderAndIsrOpt, assignment)
}
// PartitionLeaderElectionAlgorithms
def preferredReplicaPartitionLeaderElection(assignment: Seq[Int], isr: Seq[Int], liveReplicas: Set[Int]): Option[Int] = {
  assignment.headOption.filter(id => liveReplicas.contains(id) && isr.contains(id))
}

2-3-4、Broker正常下线

场景四:Broker正常下线。

KafkaServer#shutdown:Broker正常下线会向Controller发送ControlledShutdownRequest。

KafkaController#doControlledShutdown:Controller发现该Broker是某些分区的leader,触发leader选举。

scala 复制代码
private def doControlledShutdown(id: Int, brokerEpoch: Long): Set[TopicPartition] = {
  // ...
  val (partitionsLedByBroker, partitionsFollowedByBroker) = partitionsToActOn.partition { partition =>
    controllerContext.partitionLeadershipInfo(partition).leaderAndIsr.leader == id
  }
  // 对于自己是leader的分区,重新选举,发送LeaderAndIsr
  partitionStateMachine.handleStateChanges(partitionsLedByBroker.toSeq, OnlinePartition, Some(ControlledShutdownPartitionLeaderElectionStrategy))
  // ...
  // 对于自己是follower的分区,将自己从isr中移除,发送LeaderAndIsr
  replicaStateMachine.handleStateChanges(partitionsFollowedByBroker.map(partition =>
    PartitionAndReplica(partition, id)).toSeq, OfflineReplica)
  // ...
}

Election#leaderForControlledShutdown:leader选择存活isr中的一个,排除下线broker。

scala 复制代码
  private def leaderForControlledShutdown(partition: TopicPartition,
              leaderAndIsr: LeaderAndIsr,
              shuttingDownBrokerIds: Set[Int],
              controllerContext: ControllerContext): ElectionResult = {
  val assignment = controllerContext.partitionReplicaAssignment(partition)
  val liveOrShuttingDownReplicas = assignment.filter(replica =>
    controllerContext.isReplicaOnline(replica, partition, includeShuttingDownBrokers = true))
  val isr = leaderAndIsr.isr
  // leader = 从 (存活isr - 下线broker) 中选一个
  val leaderOpt = PartitionLeaderElectionAlgorithms.controlledShutdownPartitionLeaderElection(assignment, isr,
    liveOrShuttingDownReplicas.toSet, shuttingDownBrokerIds)
  // isr = isr - 下线broker
  val newIsr = isr.filter(replica => !shuttingDownBrokerIds.contains(replica))
  val newLeaderAndIsrOpt = leaderOpt.map(leader => leaderAndIsr.newLeaderAndIsr(leader, newIsr))
  ElectionResult(partition, newLeaderAndIsrOpt, liveOrShuttingDownReplicas)
}
// PartitionLeaderElectionAlgorithms
def controlledShutdownPartitionLeaderElection(assignment: Seq[Int], 
        isr: Seq[Int], liveReplicas: Set[Int], 
                shuttingDownBrokers: Set[Int]): Option[Int] = {
  assignment.find(id => liveReplicas.contains(id) 
                  && isr.contains(id) && !shuttingDownBrokers.contains(id))
}

三、Broker处理LeaderAndIsrRequest

当Leader或ISR变更,Controller会下发LeaderAndIsrRequest。

java 复制代码
public class LeaderAndIsrRequestData implements ApiMessage {
    // controller的brokerId
    private int controllerId;
    // controller任期
    private int controllerEpoch;
    // controller看到当前broker的epoch
    private long brokerEpoch;
    // 分区状态
    private List<LeaderAndIsrTopicState> topicStates;
    // 相关的存活的leader brokers
    private List<LeaderAndIsrLiveLeader> liveLeaders;
}
static public class LeaderAndIsrTopicState implements Message {
    private String topicName;
    private List<LeaderAndIsrPartitionState> partitionStates;
}
static public class LeaderAndIsrPartitionState implements Message {
    private String topicName;
    // 分区
    private int partitionIndex;
    private int controllerEpoch;
    // 分区leader的brokerId
    private int leader;
    // leader epoch
    private int leaderEpoch;
    // isr列表
    private List<Integer> isr;
    // 副本列表
    private List<Integer> replicas;
    // ...
}

ReplicaManager#becomeLeaderOrFollower:Broker处理LeaderAndIsrRequest。

  1. 请求合法性校验,如controller任期等;
  2. 分区成为leader执行makeLeaders,反之执行makeFollowers;
  3. 开启highwatermark-checkpoint线程,将内存高水位刷盘到replication-offset-checkpoint文件;
  4. onLeadershipChange,如果是系统topic,consumer_offsets-加载消费进度到内存,transaction_state-加载事务状态到内存;
scala 复制代码
  def becomeLeaderOrFollower(correlationId: Int,
         leaderAndIsrRequest: LeaderAndIsrRequest,
         onLeadershipChange: (Iterable[Partition], Iterable[Partition]) => Unit): LeaderAndIsrResponse = {
    replicaStateChangeLock synchronized {
      val controllerId = leaderAndIsrRequest.controllerId
      // 入参分区状态
      val requestPartitionStates = leaderAndIsrRequest.partitionStates.asScala
      // 1. 校验controller epoch
      if (leaderAndIsrRequest.controllerEpoch < controllerEpoch) {
        leaderAndIsrRequest.getErrorResponse(0, Errors.STALE_CONTROLLER_EPOCH.exception)
      } else {
        val responseMap = new mutable.HashMap[TopicPartition, Errors]
        controllerEpoch = leaderAndIsrRequest.controllerEpoch
        val partitionStates = new mutable.HashMap[Partition, LeaderAndIsrPartitionState]()
        requestPartitionStates.foreach { partitionState =>
        // 2. 内存获取或创建partition
          val topicPartition = new TopicPartition(partitionState.topicName, partitionState.partitionIndex)
          val partitionOpt = getPartition(topicPartition) match {
            case HostedPartition.Offline =>
              responseMap.put(topicPartition, Errors.KAFKA_STORAGE_ERROR)
              None
            case HostedPartition.Online(partition) =>
              Some(partition)
            case HostedPartition.None =>
              val partition = Partition(topicPartition, time, this)
              allPartitions.putIfNotExists(topicPartition, HostedPartition.Online(partition))
              Some(partition)
          }
          // 3. 校验partition leader的epoch,需要大于当前节点看到的leader的epoch,才能做其他操作
          partitionOpt.foreach { partition =>
            val currentLeaderEpoch = partition.getLeaderEpoch
            val requestLeaderEpoch = partitionState.leaderEpoch
            if (requestLeaderEpoch > currentLeaderEpoch) {
              if (partitionState.replicas.contains(localBrokerId))
                partitionStates.put(partition, partitionState)
              else {
                responseMap.put(topicPartition, Errors.UNKNOWN_TOPIC_OR_PARTITION)
              }
            } else if (requestLeaderEpoch < currentLeaderEpoch) {
              responseMap.put(topicPartition, Errors.STALE_CONTROLLER_EPOCH)
            } else {
              responseMap.put(topicPartition, Errors.STALE_CONTROLLER_EPOCH)
            }
          }
        }

        // 4. 根据partition状态,成为leader或follower,会创建log数据目录和文件
        val partitionsToBeLeader = partitionStates.filter { case (_, partitionState) =>
          partitionState.leader == localBrokerId
        }
        val partitionsToBeFollower = partitionStates.filter { case (k, _) => !partitionsToBeLeader.contains(k) }
        val highWatermarkCheckpoints = new LazyOffsetCheckpoints(this.highWatermarkCheckpoints)
        val partitionsBecomeLeader = if (partitionsToBeLeader.nonEmpty)
          makeLeaders(controllerId, controllerEpoch, partitionsToBeLeader, correlationId, responseMap,
            highWatermarkCheckpoints)
        else
          Set.empty[Partition]
        val partitionsBecomeFollower = if (partitionsToBeFollower.nonEmpty)
          makeFollowers(controllerId, controllerEpoch, partitionsToBeFollower, correlationId, responseMap,
            highWatermarkCheckpoints)
        else
          Set.empty[Partition]
        // 5. 开启highwatermark-checkpoint线程
        startHighWatermarkCheckPointThread()
        // 6. 如果是协调者系统topic,如consumer_offsets和transaction_state,加载 消费进度 和 事务状态
        onLeadershipChange(partitionsBecomeLeader, partitionsBecomeFollower)
        val responsePartitions = responseMap.iterator.map { case (tp, error) =>
          new LeaderAndIsrPartitionError()
            .setTopicName(tp.topic)
            .setPartitionIndex(tp.partition)
            .setErrorCode(error.code)
        }.toBuffer
        new LeaderAndIsrResponse(new LeaderAndIsrResponseData()
          .setErrorCode(Errors.NONE.code)
          .setPartitionErrors(responsePartitions.asJava))
      }
    }
  }

3-1、leader

Partition#makeLeader:如果分区成为leader

  1. 更新assignment和leaderAndIsr到内存;
  2. 尝试创建Log数据目录和文件;
  3. leader变更,自己成为新leader,初始化远程副本状态;
  4. isr变更 ,可能导致高水位变更,高水位升高触发相关延迟任务完成(如produce请求acks=-1);

这里关注1和3,高水位变更后面再看。

scala 复制代码
def makeLeader(partitionState: LeaderAndIsrPartitionState,
                 highWatermarkCheckpoints: OffsetCheckpoints): Boolean = {
    val (leaderHWIncremented, isNewLeader) = inWriteLock(leaderIsrUpdateLock) {
      controllerEpoch = partitionState.controllerEpoch
      val isr = partitionState.isr.asScala.map(_.toInt).toSet
      val addingReplicas = partitionState.addingReplicas.asScala.map(_.toInt)
      val removingReplicas = partitionState.removingReplicas.asScala.map(_.toInt)
      // 1. 更新assignment和leaderAndIsr到内存
      updateAssignmentAndIsr(
        assignment = partitionState.replicas.asScala.map(_.toInt),
        isr = isr,
        addingReplicas = addingReplicas,
        removingReplicas = removingReplicas
      )
      // 2. 尝试创建Log数据文件,高水位=replication-offset-checkpoint中的高水位
      createLogIfNotExists(partitionState.isNew, isFutureReplica = false, highWatermarkCheckpoints)
      // 内存变量更新...
      val leaderLog = localLogOrException
      val leaderEpochStartOffset = leaderLog.logEndOffset
      leaderEpoch = partitionState.leaderEpoch
      leaderEpochStartOffsetOpt = Some(leaderEpochStartOffset)
      zkVersion = partitionState.zkVersion
      leaderLog.maybeAssignEpochStartOffset(leaderEpoch, leaderEpochStartOffset)
      val isNewLeader = !isLeader
      val curTimeMs = time.milliseconds
      remoteReplicas.foreach { replica =>
        val lastCaughtUpTimeMs = if (inSyncReplicaIds.contains(replica.brokerId)) curTimeMs else 0L
        replica.resetLastCaughtUpTime(leaderEpochStartOffset, curTimeMs, lastCaughtUpTimeMs)
      }
      // 记录leader任期的起始offset到leader-epoch-checkpoint
      leaderLog.maybeAssignEpochStartOffset(leaderEpoch, leaderEpochStartOffset)
      if (isNewLeader) {
        // 3. 刚成为leader,初始化其他副本状态
        leaderReplicaIdOpt = Some(localBrokerId)
        remoteReplicas.foreach { replica =>
          replica.updateFetchState(
            followerFetchOffsetMetadata = LogOffsetMetadata.UnknownOffsetMetadata,
            followerStartOffset = Log.UnknownOffset,
            followerFetchTimeMs = 0L,
            leaderEndOffset = Log.UnknownOffset)
        }
      }
      // 4. 每次isr变更,重新计算hw,因为isr可能变小,导致hw变大
      (maybeIncrementLeaderHW(leaderLog), isNewLeader)
    }
    // hw增加,尝试完成延迟操作,比如produce请求acks=-1
    if (leaderHWIncremented)
      tryCompleteDelayedRequests()
    isNewLeader
  }

Partition#updateAssignmentAndIsr:更新分区分配和isr信息到内存。

如当前brokerId=1,assignment=[1,2,3]代表有3个副本,isr=[1,2]代表有2个副本追上当前leader副本,remoteReplicasMap=brokerId in (2,3)的副本信息Replica。

scala 复制代码
// 其他副本id(brokerId) 和 信息
private val remoteReplicasMap = new Pool[Int, Replica]
// isr集合 包含n个brokerId
var inSyncReplicaIds = Set.empty[Int]
// assignment分配信息 包含n个brokerId
var assignmentState: AssignmentState = SimpleAssignmentState(Seq.empty)
def updateAssignmentAndIsr(assignment: Seq[Int],
                           isr: Set[Int],
                           addingReplicas: Seq[Int],
                           removingReplicas: Seq[Int]): Unit = {
  // 更新其他副本map
  val newRemoteReplicas = assignment.filter(_ != localBrokerId)
  val removedReplicas = remoteReplicasMap.keys.filter(!newRemoteReplicas.contains(_))

  newRemoteReplicas.foreach(id => remoteReplicasMap.getAndMaybePut(id, new Replica(id, topicPartition)))
  remoteReplicasMap.removeAll(removedReplicas)

  if (addingReplicas.nonEmpty || removingReplicas.nonEmpty)
     // 分区重分配中间状态
    assignmentState = OngoingReassignmentState(addingReplicas, removingReplicas, assignment)
  else
    // 正常状态,更新assignment分配情况
    assignmentState = SimpleAssignmentState(assignment)
  // isr集合变更
  inSyncReplicaIds = isr
}

Leader需要维护Replica副本信息包括:

  1. logEndOffsetMetadata:LEO,follower写入offset;
  2. logStartOffset:follower起始offset;
  3. lastFetchLeaderLogEndOffset:follower最后一次发送FetchRequest时,leader的LEO写入offset进度;
  4. lastFetchTimeMs:follower最后一次发送FetchRequest的时间戳;
  5. lastCaughtUpTimeMs :follower追上leader的时间戳,用于判断follower是否会离开isr的关键,后面再看;
scala 复制代码
class Replica(val brokerId: Int, val topicPartition: TopicPartition) {
  private[this] var _logEndOffsetMetadata = LogOffsetMetadata.UnknownOffsetMetadata
  private[this] var _logStartOffset = Log.UnknownOffset
  private[this] var lastFetchLeaderLogEndOffset = 0L
  private[this] var lastFetchTimeMs = 0L
  private[this] var _lastCaughtUpTimeMs = 0L
}

3-2、follower

ReplicaManager#makeFollowers:

  1. partition.makeFollower:类似Leader,updateAssignmentAndIsr更新内存数据,创建log数据文件,如果leader发生变更(或leader epoch变更),返回true,进行下一步;
  2. leader变更,分区加入ReplicaFetcherManager,fetch初始offset=当前分区副本的高水位
scala 复制代码
private def makeFollowers(controllerId: Int,
              controllerEpoch: Int,
              partitionStates: Map[Partition, LeaderAndIsrPartitionState],
              correlationId: Int,
              responseMap: mutable.Map[TopicPartition, Errors],
              highWatermarkCheckpoints: OffsetCheckpoints) : Set[Partition] = {
    partitionStates.foreach { case (partition, partitionState) =>
      responseMap.put(partition.topicPartition, Errors.NONE)
    }
    val partitionsToMakeFollower: mutable.Set[Partition] = mutable.Set()
    try {
      partitionStates.foreach { case (partition, partitionState) =>
        val newLeaderBrokerId = partitionState.leader
        try {
          metadataCache.getAliveBrokers.find(_.id == newLeaderBrokerId) match {
            case Some(_) =>
              // 1. 对于partition的leader存活的情况下,makeFollower
              if (partition.makeFollower(partitionState, highWatermarkCheckpoints))
                partitionsToMakeFollower += partition
          }
        } catch {
            responseMap.put(partition.topicPartition, Errors.KAFKA_STORAGE_ERROR)
        }
      }
      // 先从fetcher线程中移除分区
      replicaFetcherManager.removeFetcherForPartitions(partitionsToMakeFollower.map(_.topicPartition))
      partitionsToMakeFollower.foreach { partition =>
        completeDelayedFetchOrProduceRequests(partition.topicPartition)
      }
      if (isShuttingDown.get()) {
      } else {
        // 2. 分区加入fetcher
        val partitionsToMakeFollowerWithLeaderAndOffset = partitionsToMakeFollower.map { partition =>
          val leader = metadataCache.getAliveBrokers.find(_.id == partition.leaderReplicaIdOpt.get).get
            .brokerEndPoint(config.interBrokerListenerName)
          // follower从HW开始同步
          val fetchOffset = partition.localLogOrException.highWatermark
          partition.topicPartition -> InitialFetchState(leader, partition.getLeaderEpoch, fetchOffset)
       }.toMap
        // 分区 -> InitialFetchState(leader/epoch/offset)
        replicaFetcherManager.addFetcherForPartitions(partitionsToMakeFollowerWithLeaderAndOffset)
      }
    } catch {
        throw e
    }
    partitionsToMakeFollower
  }

AbstractFetcherManager#addFetcherForPartitions:每个分区需要分配到一个固定的Fetcher线程。默认num.replica.fetchers=1,分区对应fetcher线程=hash(partition)%1=0,代表针对一个broker只开启一个fetcher线程,如ReplicaFetcherThread-0-{brokerId}。

scala 复制代码
// brokerId + fetcherId -> FetcherThread
private[server] val fetcherThreadMap =
            new mutable.HashMap[BrokerIdAndFetcherId, T]
def addFetcherForPartitions(partitionAndOffsets: Map[TopicPartition, InitialFetchState]): Unit = {
  lock synchronized {
    // 1. getFetcherId 分配fetcherId = hash(TopicPartition) % num.replica.fetchers(1) = 0
    val partitionsPerFetcher = partitionAndOffsets.groupBy { case (topicPartition, brokerAndInitialFetchOffset) =>
      BrokerAndFetcherId(brokerAndInitialFetchOffset.leader, getFetcherId(topicPartition))
    }
    // ReplicaFetcherThread-$fetcherId-${sourceBroker.id}
    def addAndStartFetcherThread(brokerAndFetcherId: BrokerAndFetcherId,
                                 brokerIdAndFetcherId: BrokerIdAndFetcherId): T = {
      val fetcherThread = createFetcherThread(brokerAndFetcherId.fetcherId, brokerAndFetcherId.broker)
      fetcherThreadMap.put(brokerIdAndFetcherId, fetcherThread)
      fetcherThread.start()
      fetcherThread
    }
    for ((brokerAndFetcherId, initialFetchOffsets) <- partitionsPerFetcher) {
      // 2. 创建或获取分区对应Fetcher线程 = hash(brokerId + fetcherId)
      val brokerIdAndFetcherId = BrokerIdAndFetcherId(brokerAndFetcherId.broker.id, brokerAndFetcherId.fetcherId)
      val fetcherThread = fetcherThreadMap.get(brokerIdAndFetcherId) match {
        case Some(currentFetcherThread) if currentFetcherThread.sourceBroker == brokerAndFetcherId.broker =>
          currentFetcherThread
        case Some(f) =>
          f.shutdown()
          addAndStartFetcherThread(brokerAndFetcherId, brokerIdAndFetcherId)
        case None =>
          addAndStartFetcherThread(brokerAndFetcherId, brokerIdAndFetcherId)
      }

      // initialOffsetAndEpochs = 拉取位点高水位 + 当前leader epoch
      val initialOffsetAndEpochs = initialFetchOffsets.map { case (tp, brokerAndInitOffset) =>
        tp -> OffsetAndEpoch(brokerAndInitOffset.initOffset, brokerAndInitOffset.currentLeaderEpoch)
      }

      // 3. 分区初始offset加入fetcherThread
      addPartitionsToFetcherThread(fetcherThread, initialOffsetAndEpochs)
    }
  }
}
def addPartitions(initialFetchStates: Map[TopicPartition, OffsetAndEpoch]): Set[TopicPartition] = {
    partitionMapLock.lockInterruptibly()
    try {
      failedPartitions.removeAll(initialFetchStates.keySet)

      initialFetchStates.foreach { case (tp, initialFetchState) =>
        val currentState = partitionStates.stateValue(tp)
        val updatedState = if (currentState != null && currentState.currentLeaderEpoch == initialFetchState.leaderEpoch) {
          currentState
        } else if (initialFetchState.offset < 0) {
          fetchOffsetAndTruncate(tp, initialFetchState.leaderEpoch)
        } else {
          // 初始状态=Truncating
          PartitionFetchState(initialFetchState.offset, None, initialFetchState.leaderEpoch, state = Truncating)
        }
        partitionStates.updateAndMoveToEnd(tp, updatedState)
      }
      partitionMapCond.signalAll()
      initialFetchStates.keySet
    } finally partitionMapLock.unlock()
  }

FetcherThread维护拉取broker每个分区的Fetch状态,初始状态为Truncating。

scala 复制代码
abstract class AbstractFetcherThread(...) {
  // map结构 key=partition value=PartitionFetchState
  private val partitionStates = new PartitionStates[PartitionFetchState]
}
case class PartitionFetchState(
               // 拉取offset
               fetchOffset: Long,
               lag: Option[Long],
               // leader epoch
               currentLeaderEpoch: Int,
               // 拉取异常,延迟时间
               delay: Option[DelayedItem],
               // 状态
               state: ReplicaState)
}
sealed trait ReplicaState
case object Truncating extends ReplicaState
case object Fetching extends ReplicaState

四、数据复制

4-1、Follower

Fetcher线程循环执行truncate和fetch,对应Truncating状态和Fetching状态分区。

scala 复制代码
  override def doWork(): Unit = {
    maybeTruncate()
    maybeFetch()
  }

4-1-1、truncate

每个分区数据目录下都有一个leader-epoch-checkpoint 文件,用于记录leader任期 对应数据起始offset,用于保证多个分区副本的最终一致性(数据截断)。

比如下面这个文件:epoch=0,offset=[0,1500);epoch=1,offset=[1500,3200);epoch=2,offset=[3200,?)。

juejin.cn/post/754391... 提到过。

arduino 复制代码
3 // 条目数量
0 0 // epoch 起始offset
1 1500 // epoch 起始offset
2 3200 // epoch 起始offset

leader-epoch-checkpoint的写入时机:

  1. leader:处理LeaderAndIsrRequest,写入自己的新任期和当前LEO;
  2. follower:处理FetchResponse,写入数据时,创建新leader epoch数据记录;

AbstractFetcherThread#maybeTruncate:根据当前分区是否有leader epoch分为两种情况。

  1. 分区有数据写入,所以存在epoch,需要请求leader获取该epoch的结束offset;
  2. 分区没数据记录,没有epoch,从高水位截断数据;
scss 复制代码
 private def maybeTruncate(): Unit = {
    val (partitionsWithEpochs, partitionsWithoutEpochs) = fetchTruncatingPartitions()
    // case1 分区有数据记录,需要通过分区epoch
    if (partitionsWithEpochs.nonEmpty) {
      truncateToEpochEndOffsets(partitionsWithEpochs)
    }
    // case2 分区没数据记录,从高水位截断数据
    if (partitionsWithoutEpochs.nonEmpty) {
      truncateToHighWatermark(partitionsWithoutEpochs)
    }
  }

AbstractFetcherThread#truncateToEpochEndOffsets:

  1. follower发送OffsetsForLeaderEpochRequest给leader,获取follower当前分区数据的最后一个epoch对应的结束offset;
  2. leader查询内存中的leader-epoch-checkpoint,返回epoch的结束offset;
  3. follower如果发现leader返回offset小于自己的offset,执行数据截断maybeTruncateToEpochEndOffsets;
  4. 分区进入Fetching状态;

第1和第2步在消费者消费中提到过,消费者发现分区leader任期变更,需要发送OffsetsForLeaderEpochRequest获取拉取数据对应leader epoch的最后offset。

scala 复制代码
private def truncateToEpochEndOffsets(latestEpochsForPartitions: Map[TopicPartition, EpochData]): Unit = {
    // 1. 发送OffsetsForLeaderEpochRequest
    val endOffsets = fetchEpochEndOffsets(latestEpochsForPartitions)
    inLock(partitionMapLock) {
      // 校验请求期间 leader epoch没发生变化
      val epochEndOffsets = endOffsets.filter { case (tp, _) =>
        val curPartitionState = partitionStates.stateValue(tp)
        val partitionEpochRequest = latestEpochsForPartitions.getOrElse(tp, {
          throw new IllegalStateException()
        })
        val leaderEpochInRequest = partitionEpochRequest.currentLeaderEpoch.get
        curPartitionState != null && leaderEpochInRequest == curPartitionState.currentLeaderEpoch
      }
      // 2. 执行数据截断
      val ResultWithPartitions(fetchOffsets, partitionsWithError) = maybeTruncateToEpochEndOffsets(epochEndOffsets, latestEpochsForPartitions)
      // 3-1. partitionsWithError - leaderEpoch发生变更,延迟fetch
      handlePartitionsWithErrors(partitionsWithError, "truncateToEpochEndOffsets")
      // 3-2. fetchOffsets - 标记分区截断完成,进入Fetching
      updateFetchOffsetAndMaybeMarkTruncationComplete(fetchOffsets)
    }
  }

Log#truncateTo:数据截断逻辑如下

scala 复制代码
private[log] def truncateTo(targetOffset: Long): Boolean = {
    maybeHandleIOException() {
      if (targetOffset >= logEndOffset) {
        // 正常情况,当前数据的leader epoch的结束offset >= LEO当前写入进度
        false
      } else {
        // 异常情况,leader副本比follower副本数据多,比如发生unclean选举
        info(s"Truncating to offset $targetOffset")
        lock synchronized {
          checkIfMemoryMappedBufferClosed()
          if (segments.firstEntry.getValue.baseOffset > targetOffset) {
            // 如果所有数据都大于目标offset,全量截断
            truncateFullyAndStartAt(targetOffset)
          } else {
            // 删除超过targetOffset的segment
            val deletable = logSegments.filter(segment => segment.baseOffset > targetOffset)
            removeAndDeleteSegments(deletable, asyncDelete = true)
            // 对当前segment截断到targetOffset
            activeSegment.truncateTo(targetOffset)
            // 更新offset信息,如LEO、HW、recoveryPoint(刷盘进度)...
            updateLogEndOffset(targetOffset)
            updateLogStartOffset(math.min(targetOffset, this.logStartOffset))
            // leader-epoch-checkpoint也需要截断
            leaderEpochCache.foreach(_.truncateFromEnd(targetOffset))
            loadProducerState(targetOffset, reloadFromCleanShutdown = false)
          }
          true
        }
      }
    }
  }

4-1-2、fetch

AbstractFetcherThread#maybeFetch:

  1. 构造FetchRequest;
  2. 发送FetchRequest;
  3. 收到FetchResponse数据,写入分区log;
  4. 更新Fetch状态,下次拉取offset=写入进度;
scala 复制代码
private def maybeFetch(): Unit = {
  // 1. 构建FetchRequest
  val fetchRequestOpt = inLock(partitionMapLock) {
    val ResultWithPartitions(fetchRequestOpt, partitionsWithError) 
          = buildFetch(partitionStates.partitionStateMap.asScala)
    fetchRequestOpt
  }
  // 2. 发送FetchRequest
  fetchRequestOpt.foreach { case ReplicaFetch(sessionPartitions, fetchRequest) =>
    processFetchRequest(sessionPartitions, fetchRequest)
  }
}
// 分区 -> 分区fetch状态
private val partitionStates = new PartitionStates[PartitionFetchState]
protected val partitionMapLock = new ReentrantLock
private val partitionMapCond = partitionMapLock.newCondition()
private def processFetchRequest(sessionPartitions: util.Map[TopicPartition, FetchRequest.PartitionData],
                                  fetchRequest: FetchRequest.Builder): Unit = {
    val partitionsWithError = mutable.Set[TopicPartition]()
    var responseData: Map[TopicPartition, FetchData] = Map.empty
    // 1. 发送FetchRequest,接收FetchResponse
    try {
      trace(s"Sending fetch request $fetchRequest")
      responseData = fetchFromLeader(fetchRequest)
    } catch {
      case t: Throwable =>
        if (isRunning) {
          inLock(partitionMapLock) {
            partitionsWithError ++= partitionStates.partitionSet.asScala
            partitionMapCond.await(fetchBackOffMs, TimeUnit.MILLISECONDS)
          }
        }
    }
    inLock(partitionMapLock) {
      responseData.foreach { case (topicPartition, partitionData) =>
        Option(partitionStates.stateValue(topicPartition)).foreach { currentFetchState =>
          partitionData.error match {
            case Errors.NONE =>
              // 2. 写数据
              val logAppendInfoOpt = processPartitionData(topicPartition, currentFetchState.fetchOffset,
                partitionData)
              // 3. 设置下次fetch offset
              logAppendInfoOpt.foreach { logAppendInfo =>
                val validBytes = logAppendInfo.validBytes
                val nextOffset = if (validBytes > 0) logAppendInfo.lastOffset + 1 else currentFetchState.fetchOffset
                val lag = Math.max(0L, partitionData.highWatermark - nextOffset)
                if (validBytes > 0 && partitionStates.contains(topicPartition)) {
                  val newFetchState = PartitionFetchState(nextOffset, Some(lag), currentFetchState.currentLeaderEpoch, state = Fetching)
                  partitionStates.updateAndMoveToEnd(topicPartition, newFetchState)
                }
              }
              // 异常加入partitionsWithError...
          }
        }
      }
    }
    // 4. 异常分区 延迟fetch
    if (partitionsWithError.nonEmpty) {
      handlePartitionsWithErrors(partitionsWithError, "processFetchRequest")
    }
  }

这里就看1和3。

第一步,构造FetchRequest同正常消费者:

  1. 有增量Fetch,没有分区变更的情况下,不需要发送所有分区;
  2. 默认Fetch参数相同:
    • replica.fetch.min.bytes=1,如果消息不足1个字节,FetchRequest在server端被挂起;
    • replica.fetch.wait.max.ms=500,不足minBytes被挂起时长;
    • replica.fetch.max.bytes=1MB,单分区最大拉取字节数;
    • replica.fetch.response.max.bytes=10MB,单次Fetch响应最大字节数;

唯一区别是,FetchRequest中的replicaId=follower的brokerId,因为leader需要统计每个副本的同步情况,控制ISR。

ReplicaFetcherThread#processPartitionData:第三步,follower收到FetchResponse。

  1. 数据写入分区log(同leader写log一致,包括建立索引、leader-epoch-checkpoint,recovery-point刷盘进度);
  2. 高水位=leader.高水位、logStartOffset=leader.logStartOffset;
scala 复制代码
override def processPartitionData(topicPartition: TopicPartition,
                    fetchOffset: Long,
                    partitionData: FetchData): Option[LogAppendInfo] = {
  val partition = replicaMgr.nonOfflinePartition(topicPartition).get
  val log = partition.localLogOrException
  val records = toMemoryRecords(partitionData.records)
  // 写数据
  val logAppendInfo = partition.appendRecordsToFollowerOrFutureReplica(records, isFuture = false)
  val leaderLogStartOffset = partitionData.logStartOffset
  // follower更新hw=leader
  val followerHighWatermark = log.updateHighWatermark(partitionData.highWatermark)
  // follower更新logStartOffset=leader
  log.maybeIncrementLogStartOffset(leaderLogStartOffset, LeaderOffsetIncremented)
  logAppendInfo
}
// 写入数据appendRecordsToFollowerOrFutureReplica
// Log#appendAsFollower
def appendAsFollower(records: MemoryRecords): LogAppendInfo = {
  append(records,
    origin = AppendOrigin.Replication,
    interBrokerProtocolVersion = ApiVersion.latestVersion,
    // 不需要分配数据记录的offset
    assignOffsets = false,
    leaderEpoch = -1,
    ignoreRecordSize = true)
}

4-2、Leader

4-2-1、处理FetchRequest

Leader处理Follower的FetchRequest同处理Consumer一致。

Partition#updateFollowerFetchState:区别是读取log数据结束后,需要更新follower的fetch状态,用于控制高低水位和ISR。

scala 复制代码
def updateFollowerFetchState(followerId: Int,
                               followerFetchOffsetMetadata: LogOffsetMetadata,
                               followerStartOffset: Long,
                               followerFetchTimeMs: Long,
                               leaderEndOffset: Long): Boolean = {
    getReplica(followerId) match {
      case Some(followerReplica) =>
        // 低水位 和删除数据有关 忽略
        val oldLeaderLW = if (delayedOperations.numDelayedDelete > 0) lowWatermarkIfLeader else -1L
        val prevFollowerEndOffset = followerReplica.logEndOffset
        // 1. 更新follower副本状态
        followerReplica.updateFetchState(
          followerFetchOffsetMetadata, // follower本次请求的offset
          followerStartOffset, // follower的logStartOffset
          followerFetchTimeMs, // 收到fetch请求的时间
          leaderEndOffset) // leader当前的写入位置LEO
        val newLeaderLW = if (delayedOperations.numDelayedDelete > 0) lowWatermarkIfLeader else -1L
        val leaderLWIncremented = newLeaderLW > oldLeaderLW
        // 2. 如果follower不在isr中,校验是否需要扩张isr
        if (!inSyncReplicaIds.contains(followerId))
          maybeExpandIsr(followerReplica, followerFetchTimeMs)
        // 3. follower可能已经在isr中,校验是否需要增加HW
        val leaderHWIncremented = if (prevFollowerEndOffset != followerReplica.logEndOffset) {
          leaderLogIfLocal.exists(leaderLog => maybeIncrementLeaderHW(leaderLog, followerFetchTimeMs))
        } else {
          false
        }
        // 高低水位变化,可能完成挂起的请求,比如ProduceRequest acks=-1
        if (leaderLWIncremented || leaderHWIncremented)
          tryCompleteDelayedRequests()
        true
      case None =>
        false
    }
  }

Replica#updateFetchState:每次follower拉消息,leader需要更新follower副本的同步进度。

fetch offset代表了follower的LEO,同步进度用lastCaughtUpTimeMs上次追上leader的时间表示。

  1. 如果本次fetch offset ≥ leader的LEO写入位置,代表follower已经完全追上leader,lastCaughtUpTimeMs=本次fetch请求收到时间;
  2. 如果本次fetch offset ≥ 上次fetch时leader的LEO写入位置,代表follower在上次追上leader,lastCaughtUpTimeMs=上次fetch请求收到时间;
  3. 否则,lastCaughtUpTimeMs保持不变;
scala 复制代码
// follower的写入进度LEO
private[this] var _logEndOffsetMetadata = LogOffsetMetadata.UnknownOffsetMetadata
// follower的logStartOffset
private[this] var _logStartOffset = Log.UnknownOffset
// 上次fetch时的leader的LEO
private[this] var lastFetchLeaderLogEndOffset = 0L
// 上次fetch时间
private[this] var lastFetchTimeMs = 0L
// 上次追上leader的时间
private[this] var _lastCaughtUpTimeMs = 0L
def updateFetchState(
  // fetch请求offset
  followerFetchOffsetMetadata: LogOffsetMetadata,
  // fetch请求中的logStartOffset
  followerStartOffset: Long,
  // leader收到fetch请求时间
  followerFetchTimeMs: Long,
  // leader收到fetch请求时的LEO写入进度
  leaderEndOffset: Long): Unit = {
  // 如果fetch offset >= leader写入位置
  // lastCaughtUpTimeMs=本次fetch请求时间,代表follower已经完全追上leader
  if (followerFetchOffsetMetadata.messageOffset >= leaderEndOffset)
    _lastCaughtUpTimeMs = math.max(_lastCaughtUpTimeMs, followerFetchTimeMs)
  // 否则,如果本次fetch offset >= 上次fetch时leader的LEO,
  // lastCaughtUpTimeMs=上次fetch请求时间,代表follower在上一次fetch时追上leader
  else if (followerFetchOffsetMetadata.messageOffset >= lastFetchLeaderLogEndOffset)
    _lastCaughtUpTimeMs = math.max(_lastCaughtUpTimeMs, lastFetchTimeMs)
  // 记录follower的本次fetch请求情况
  _logStartOffset = followerStartOffset
  _logEndOffsetMetadata = followerFetchOffsetMetadata
  lastFetchLeaderLogEndOffset = leaderEndOffset
  lastFetchTimeMs = followerFetchTimeMs
}

4-2-2、高水位(HW)变化

高水位的作用:

  1. 发送消息,acks=-1(all),需要等待HW超过本批消息,才能响应客户端成功;
  2. 消费消息,默认隔离级别=READ_UNCOMMITTED,只能消费小于HW的消息;

高水位升高的场景:

  1. ISR变化:ISR收缩到1,只剩Leader,如Follower下线Leader收到LeaderAndIsrRequest;
  2. 副本LEO变化:Leader收到Follower的Fetch请求,发现Follower的LEO增加;当ISR只有Leader时,Leader收到ProduceRequest,Leader的LEO增加;

Partition#maybeIncrementLeaderHW:计算HW,HW只能单调递增,如果新HW小于老HW,不更新。

  1. 参与HW计算的副本 = ISR内副本 + 上次追上Leader的LEO的时间在replica.lag.time.max.ms=30s内的副本;
  2. HW = 这些副本的LEO的最小值;

当isr中只包含leader副本,leader的LEO持续增加导致hw持续增加。如果没有30秒buffer,follower可能永远无法追上hw,导致没有除leader以外的副本能进入isr。

scala 复制代码
private def maybeIncrementLeaderHW(leaderLog: Log, curTime: Long = time.milliseconds): Boolean = {
  inReadLock(leaderIsrUpdateLock) {
    var newHighWatermark = leaderLog.logEndOffsetMetadata
    remoteReplicasMap.values.foreach { replica =>
      // 取LEO的最小值
      if (replica.logEndOffsetMetadata.messageOffset < newHighWatermark.messageOffset &&
        // 副本 = Fetch延迟小于30s(replica.lag.time.max.ms)
        // || ISR中的副本
        (curTime - replica.lastCaughtUpTimeMs <= replicaLagTimeMaxMs || inSyncReplicaIds.contains(replica.brokerId))) {
        newHighWatermark = replica.logEndOffsetMetadata
      }
    }
    // 尝试更新HW,HW只能单调递增,如果新HW小于老HW,不更新
    leaderLog.maybeIncrementHighWatermark(newHighWatermark) match {
      case Some(oldHighWatermark) =>
        true
      case None =>
        false
    }
  }
}

4-2-3、ISR扩张

ISR(In-Sync Replicas)是与Leader副本同步的副本列表,包含Leader自己。

ISR在保证可用性一致性之间取得平衡:

  1. 发送消息acks=-1,Leader数据写入,需要高水位≥当时的LEO,才响应客户端成功。发送消息无需等待所有副本写入成功,高水位取决于所有isr副本LEO的最小值,保证可用性。此外通过配置min.insync.replicas(默认1),当isr小于该值,返回生产者拒绝写入,保证一致性;
  2. 默认unclean.leader.election.enable=false,leader下线,新leader副本只能从isr中选取,避免数据丢失;

ISR 扩张触发:Leader收到Follower的Fetch请求。

Partition#maybeExpandIsr:如果follower进入isr,leader更新/brokers/topics/{topic}/partitions/{partitionId}/state,并更新自身内存isr。

scala 复制代码
private def maybeExpandIsr(followerReplica: Replica, followerFetchTimeMs: Long): Unit = {
  // 读锁下 判一次是否需要扩张
  val needsIsrUpdate = inReadLock(leaderIsrUpdateLock) {
    needsExpandIsr(followerReplica)
  }
  if (needsIsrUpdate) {
    // 如果需要 升级写锁 判一次是否需要扩张
    inWriteLock(leaderIsrUpdateLock) {
      if (needsExpandIsr(followerReplica)) {
        val newInSyncReplicaIds = inSyncReplicaIds + followerReplica.brokerId
        // 更新/brokers/topics/{topic}/partitions/{partitionId}/state和内存
        expandIsr(newInSyncReplicaIds)
      }
    }
  }
}

Partition#needsExpandIsr:follower进入isr要求:1)follower LEO ≥ leader当前HW 2)follower LEO ≥ 当前leader任期的起始offset

第二点的原因参考github.com/apache/kafk...issues.apache.org/jira/browse...

scala 复制代码
// isr副本集合 包含n个brokerId
var inSyncReplicaIds = Set.empty[Int]
// 当前leader任期的起始offset
private var leaderEpochStartOffsetOpt: Option[Long] = None
private def needsExpandIsr(followerReplica: Replica): Boolean = {
  leaderLogIfLocal.exists { leaderLog =>
    val leaderHighwatermark = leaderLog.highWatermark
    !inSyncReplicaIds.contains(followerReplica.brokerId) 
        && isFollowerInSync(followerReplica, leaderHighwatermark)
  }
}
private def isFollowerInSync(followerReplica: Replica, highWatermark: Long): Boolean = {
  val followerEndOffset = followerReplica.logEndOffset
  followerEndOffset >= highWatermark
      && leaderEpochStartOffsetOpt.exists(followerEndOffset >= _)
}

4-2-4、ISR收缩

ReplicaManager#startup:根据最大延迟replica.lag.time.max.ms=30000/2,每15秒跑一次,校验是否需要收缩isr。

scala 复制代码
private val allPartitions = new Pool[TopicPartition, HostedPartition](
  valueFactory = Some(tp => HostedPartition.Online(Partition(tp, time, this)))
)
def startup(): Unit = {
  // 定时校验是否需要收缩isr
  scheduler.schedule("isr-expiration", maybeShrinkIsr _, 
                     period = config.replicaLagTimeMaxMs / 2, 
                     unit = TimeUnit.MILLISECONDS)
}
private def maybeShrinkIsr(): Unit = {
  allPartitions.keys.foreach { topicPartition =>
    nonOfflinePartition(topicPartition).foreach(_.maybeShrinkIsr())
  }
}

Partition#maybeShrinkIsr:处理方式类似isr扩张。

scala 复制代码
def maybeShrinkIsr(): Unit = {
  // 1. 读锁 判断是否需要shrink
  val needsIsrUpdate = inReadLock(leaderIsrUpdateLock) {
    needsShrinkIsr()
  }
  // 2. 如果需要 上写锁再次判断
  val leaderHWIncremented = needsIsrUpdate 
          && inWriteLock(leaderIsrUpdateLock) {
    leaderLogIfLocal match {
      case Some(leaderLog) =>
        // 3. 获取踢出isr的副本ids
        val outOfSyncReplicaIds = getOutOfSyncReplicas(replicaLagTimeMaxMs)
        if (outOfSyncReplicaIds.nonEmpty) {
          val newInSyncReplicaIds = inSyncReplicaIds -- outOfSyncReplicaIds
          // 4. 更新zk和内存
          shrinkIsr(newInSyncReplicaIds)
          // 5. isr收缩,可能导致hw增加
          maybeIncrementLeaderHW(leaderLog)
        } else {
          false
        }
      case None => false
    }
  }
  // 6. 如果hw增加,可能需要完成produceRequest等挂起请求
  if (leaderHWIncremented)
    tryCompleteDelayedRequests()
}
// 如果有出isr的副本,需要上写锁
private def needsShrinkIsr(): Boolean = {
  if (isLeader) {
    val outOfSyncReplicaIds = getOutOfSyncReplicas(replicaLagTimeMaxMs)
    outOfSyncReplicaIds.nonEmpty
  } else {
    false
  }
}

Partition#getOutOfSyncReplicas:follower踢出isr需要满足条件:

1)follower副本的LEO 没到达 leader副本LEO;

2)follower副本上次追上leader的时间(lastCaughtUpTimeMs)超出30s(replica.lag.time.max.ms);

scala 复制代码
def getOutOfSyncReplicas(maxLagMs: Long): Set[Int] = {
  // 需要校验的副本id = isr - leader自己
  val candidateReplicaIds = inSyncReplicaIds - localBrokerId
  val currentTimeMs = time.milliseconds()
  // leader的leo
  val leaderEndOffset = localLogOrException.logEndOffset
  candidateReplicaIds.filter(replicaId => isFollowerOutOfSync(replicaId, leaderEndOffset, currentTimeMs, maxLagMs))
}

private def isFollowerOutOfSync(replicaId: Int,
                                leaderEndOffset: Long,
                                currentTimeMs: Long,
                                maxLagMs: Long): Boolean = {
  val followerReplica = getReplicaOrException(replicaId)
  followerReplica.logEndOffset != leaderEndOffset 
    && (currentTimeMs - followerReplica.lastCaughtUpTimeMs) > maxLagMs
}

总结

leader选举

leader选举由controller角色broker执行,选举流程:

  1. 查询/brokers/topics/{topic}/partitions/{partitionId}/state得到当前LeaderAndIsr;
  2. 根据策略选主;
  3. 更新/brokers/topics/{topic}/partitions/{partitionId}/state;
  4. 发送LeaderAndIsr请求给相关Broker;

选举场景和算法:

  1. broker作为分区leader异常下线,controller通过/brokers/ids子节点watch发现。优先从isr存活副本 中选leader,isr = isr - 线下broker;如果开启unclean选举(unclean.leader.election.enable =true),降级从非isr存活副本中选leader,isr = 新leader;
  2. broker作为分区leader正常下线,发送ControlledShutdownRequest给controller。从isr存活副本-下线broker中选leader,isr = isr - 下线broker。
  3. admin执行分区重分配,leader副本从assignment中移除。从reassign目标副本中选择isr存活副本作为leader,isr保持不变;
  4. controller每5分钟(leader.imbalance.check.interval.seconds )执行leader rebalance,如果判定broker不均衡比率查过10%(leader.imbalance.per.broker.percentage ),触发相关分区重新选举。使用preferred策略,选择assignment中的第一个副本作为leader,isr保持不变;

数据复制

controller完成leader选举,下发LeaderAndIsrRequest给相关分区副本broker。

broker识别自身是leader还是follower走不同逻辑,总体上遵循:follower作为消费者从leader拉消息,消费逻辑是将消息写入本地数据文件

follower注意点:

  1. 刚加入的分区处于Truncating状态,需要发送OffsetsForLeaderEpochRequest给leader,请求中包含当前分区数据的最后一个leader epoch。leader返回该epoch的最后一个offset。如果follower发现该offset小于自己的LEO,则执行截断,从该offset开始发送fetch请求;
  2. 确定fetch offset后,进入Fetching状态,发送FetchRequest(包含副本id=自身brokerId)接收FetchResponse写本地数据文件,包括构建索引、HW=leader.HW;

leader注意点:FetchRequest的副本id识别出是follower拉消息,fetch offset代表了follower的LEO,每次拉消息完成后执行:

  1. 更新副本同步进度,,同步进度用lastCaughtUpTimeMs 上次追上leader的时间表示:
    • 如果本次fetch offset ≥ leader的LEO写入位置,代表follower已经完全追上leader,lastCaughtUpTimeMs=本次fetch请求收到时间;
    • 如果本次fetch offset ≥ 上次fetch时leader的LEO写入位置,代表follower在上次追上leader,lastCaughtUpTimeMs=上次fetch请求收到时间;
    • 否则,lastCaughtUpTimeMs保持不变;
  2. 尝试扩张ISR;
  3. 尝试增加高水位;

高水位

高水位的作用:

  1. 发送消息,acks=-1(all),需要等待HW超过本批消息,才能响应客户端成功;
  2. 消费消息,默认隔离级别=READ_UNCOMMITTED,只能消费小于HW的消息;

高水位单调递增,升高场景:ISR收缩、副本LEO增加。

高水位 = min(参与计算的副本的LEO),参与计算的副本 = ISR内副本 + 上次追上Leader的LEO的时间(lastCaughtUpTimeMs)在30s(replica.lag.time.max.ms)内的副本。

ISR

ISR(In-Sync Replicas)是与Leader副本同步的副本列表,包含Leader自己,用于在可用性一致性之间取得平衡:

  1. 发送消息acks=-1,通过配置min.insync.replicas(默认1),当isr副本数小于该值,返回生产者拒绝写入,保证一致性;
  2. 默认配置下,新leader副本只能从isr中选取,避免数据丢失;

ISR 扩张:

  1. Leader收到Follower的Fetch请求;
  2. follower进入isr条件:follower LEO ≥ leader的高水位 且 follower LEO ≥ 当前leader任期的起始offset
  3. 如果变更,leader更新/brokers/topics/{topic}/partitions/{partitionId}/state和自身内存;

ISR 收缩:

  1. leader根据最大延迟replica.lag.time.max.ms=30000/2,每15秒跑一次,校验是否需要收缩isr;
  2. follower踢出isr条件:follower的LEO<leader的LEO 且 follower上次追上leader的时间(lastCaughtUpTimeMs )超出30s(replica.lag.time.max.ms);
  3. 如果变更,leader更新/brokers/topics/{topic}/partitions/{partitionId}/state和自身内存;
  4. ISR收缩可能导致高水位增加;
相关推荐
weixin_395448916 分钟前
下位机&yolov11输出
java·服务器·前端
freejackman18 分钟前
持续集成-Jenkins 基础教程
java·python·ci/cd·自动化·jenkins·持续部署·持续集成
雨中飘荡的记忆32 分钟前
Spring AI + MCP:从入门到实战
java·人工智能·spring
callJJ37 分钟前
Docker 代码沙箱与容器池技术详解
java·运维·docker·容器·oj系统·代码沙箱
wangmengxxw37 分钟前
SpringAI-mcp-入门案例
java·服务器·前端·大模型·springai·mcp
燕山石头38 分钟前
java模拟Modbus-tcp从站
java·开发语言·tcp/ip
觉醒大王41 分钟前
简单说说参考文献引用
java·前端·数据库·学习·自然语言处理·学习方法·迁移学习
wangmengxxw42 分钟前
SpringAI-MySQLMcp服务
java·人工智能·mysql·大模型·sse·springai·mcp
weixin_4492900142 分钟前
EverMemOS 访问外部(deepinfra)API接口
java·服务器·前端
爱装代码的小瓶子1 小时前
【Linux基础】操作系统下的进程与虚拟内存的关系
android·java·服务器