概述
Kafka源码包含多个模块,每个模块负责不同的功能。以下是一些核心模块及其功能的概述:
-
服务端源码 :实现Kafka Broker的核心功能,包括日志存储、控制器、协调器、元数据管理及状态机管理、延迟机制、消费者组管理、高并发网络架构模型实现等。
-
Java客户端源码 :实现了Producer和Consumer与Broker的交互机制,以及通用组件支撑代码。
-
Connect源码 :用来构建异构数据双向流式同步服务。
-
Stream源码 :用来实现实时流处理相关功能。
-
Raft源码 :实现了Raft一致性协议。
-
Admin模块 :Kafka的管理员模块,操作和管理其topic,partition相关,包含创建,删除topic,或者拓展分区等。
-
Api模块 :负责数据交互,客户端与服务端交互数据的编码与解码。
-
Client模块 :包含Producer读取Kafka Broker元数据信息的类,如topic和分区,以及leader。
-
Cluster模块 :包含Broker、Cluster、Partition、Replica等实体类。
-
Common模块 :包含各种异常类以及错误验证。
-
Consumer模块 :消费者处理模块,负责客户端消费者数据和逻辑处理。
-
Controller模块 :负责中央控制器的选举,分区的Leader选举,Replica的分配或重新分配,分区和副本的扩容等。
-
Coordinator模块 :负责管理部分consumer group和他们的offset。
-
Javaapi模块 :提供Java语言的Producer和Consumer的API接口。
-
Log模块 :负责Kafka文件存储,读写所有Topic消息数据。
-
Message模块 :封装多条数据组成数据集或压缩数据集。
-
Metrics模块 :负责内部状态监控。
-
Network模块 :处理客户端连接,网络事件模块。
-
Producer模块 :生产者细节实现,包括同步和异步消息发送。
-
Security模块 :负责Kafka的安全验证和管理。
-
Serializer模块 :序列化和反序列化消息内容。
-
Server模块 :涉及Leader和Offset的checkpoint,动态配置,延时创建和删除Topic,Leader选举,Admin和Replica管理等。
-
Tools模块 :包含多种工具,如导出consumer offset值,LogSegments信息,Topic的log位置信息,Zookeeper上的offset值等。
-
Utils模块 :包含各种工具类,如Json,ZkUtils,线程池工具类,KafkaScheduler公共调度器类等。
这些模块共同构成了Kafka的整体架构,使其能够提供高吞吐量、高可用性的消息队列服务。
kafka源码分支为1.0.2
kafkaServer启动时会调用replicaManager.startup()方法:
scala
//副本管理器
/* start replica manager */
replicaManager = createReplicaManager(isShuttingDown)
replicaManager.startup()
replicaManager.startup(),会注册两个定时任务:
scala
def startup() {
// start ISR expiration thread
// A follower can lag behind leader for up to config.replicaLagTimeMaxMs x 1.5 before it is removed from ISR
//周期性检查 topic-partition 的 isr 是否有 replica 因为延迟或 hang 住需要从 isr 中移除;
scheduler.schedule("isr-expiration", maybeShrinkIsr _, period = config.replicaLagTimeMaxMs / 2, unit = TimeUnit.MILLISECONDS)
//判断是不是需要对 isr 进行更新,如果有 topic-partition 的 isr 发生了变动需要进行更新,那么这个方法就会被调用,它会触发 zk 的相应节点,进而触发 controller 进行相应的操作。
scheduler.schedule("isr-change-propagation", maybePropagateIsrChanges _, period = 2500L, unit = TimeUnit.MILLISECONDS)
val haltBrokerOnFailure = config.interBrokerProtocolVersion < KAFKA_1_0_IV0
logDirFailureHandler = new LogDirFailureHandler("LogDirFailureHandler", haltBrokerOnFailure)
logDirFailureHandler.start()
}
//遍历本节点所有的 Partition 实例,来检查它们 isr 中的 replica 是否需要从 isr 中移除
private def maybeShrinkIsr(): Unit = {
trace("Evaluating ISR list of partitions to see which replicas can be removed from the ISR")
nonOfflinePartitionsIterator.foreach(_.maybeShrinkIsr(config.replicaLagTimeMaxMs))
}
/**
* This function periodically runs to see if ISR needs to be propagated. It propagates ISR when:
* 1. There is ISR change not propagated yet.
* 2. There is no ISR Change in the last five seconds, or it has been more than 60 seconds since the last ISR propagation.
* This allows an occasional ISR change to be propagated within a few seconds, and avoids overwhelming controller and
* other brokers when large amount of ISR change occurs.
*/
//将那些 isr 变动的 topic-partition 列表(isrChangeSet)通过 ReplicationUtils 的 propagateIsrChanges() 方法更新 zk 上,
// 这时候 Controller 才能知道哪些 topic-partition 的 isr 发生了变动。
def maybePropagateIsrChanges() {
val now = System.currentTimeMillis()
isrChangeSet synchronized {
//有 topic-partition 的 isr 需要更新
if (isrChangeSet.nonEmpty &&
// 5s 内没有触发过isr change
(lastIsrChangeMs.get() + ReplicaManager.IsrChangePropagationBlackOut < now ||
//或60s内没有没有触发isr propagation
lastIsrPropagationMs.get() + ReplicaManager.IsrChangePropagationInterval < now)) {
//在 zk 创建 isr 变动的提醒
ReplicationUtils.propagateIsrChanges(zkUtils, isrChangeSet)
//清空 isrChangeSet,它记录着 isr 变动的 topic-partition 信息
isrChangeSet.clear()
//更新触发isr propagation的时间
lastIsrPropagationMs.set(now)
}
}
}
Partition类中和maybeShrinkIsr定时任务相关的方法:
scala
//检查这个 isr 中的每个 replica 是否需要从isr列表移除
def maybeShrinkIsr(replicaMaxLagTimeMs: Long) {
val leaderHWIncremented = inWriteLock(leaderIsrUpdateLock) {
//先判断本地的replica是不是这个 partition 的 leader副本,这个操作只会在 leader副本 上进行,如果不是 leader 直接跳过
leaderReplicaIfLocal match {
case Some(leaderReplica) =>
//遍历除 leader 外 isr 的所有 replica,找到那些满足条件(落后超过 maxLagMs 时间的副本)需要从 isr 中移除的 replica
val outOfSyncReplicas = getOutOfSyncReplicas(leaderReplica, replicaMaxLagTimeMs)
if(outOfSyncReplicas.nonEmpty) {
val newInSyncReplicas = inSyncReplicas -- outOfSyncReplicas
assert(newInSyncReplicas.nonEmpty)
info("Shrinking ISR from %s to %s".format(inSyncReplicas.map(_.brokerId).mkString(","),
newInSyncReplicas.map(_.brokerId).mkString(",")))
// update ISR in zk and in cache
//得到了新的 isr 列表,调用 updateIsr() 将新的 isr 更新到 zk 上,并且这个方法内部又调用了 ReplicaManager 的 recordIsrChange() 方法
// 来告诉 ReplicaManager 当前这个 topic-partition 的 isr 发生了变化
// (可以看出,zk 上这个 topic-partition 的 isr 信息虽然变化了,但是实际上 controller 还是无法感知的)
updateIsr(newInSyncReplicas)
// we may need to increment high watermark since ISR could be down to 1
//更新 metrics
replicaManager.isrShrinkRate.mark()
//因为 isr 发生了变动,所以这里会通过 maybeIncrementLeaderHW() 方法来检查一下这个 partition 的 HW 是否需要增加。
//因为isr减少了,因此hw值可能增加(可能isr中同步延迟高的副本已经被移除了)
maybeIncrementLeaderHW(leaderReplica)
} else {
false
}
case None => false // do nothing if no longer leader
}
}
// some delayed operations may be unblocked after HW changed
if (leaderHWIncremented)
tryCompleteDelayedRequests()
}
def getOutOfSyncReplicas(leaderReplica: Replica, maxLagMs: Long): Set[Replica] = {
/**
* there are two cases that will be handled here -
* 1. Stuck followers: If the leo of the replica hasn't been updated for maxLagMs ms,
* the follower is stuck and should be removed from the ISR
* 2. Slow followers: If the replica has not read up to the leo within the last maxLagMs ms,
* then the follower is lagging and should be removed from the ISR
* Both these cases are handled by checking the lastCaughtUpTimeMs which represents
* the last time when the replica was fully caught up. If either of the above conditions
* is violated, that replica is considered to be out of sync
*
**/
val candidateReplicas = inSyncReplicas - leaderReplica
val laggingReplicas = candidateReplicas.filter(r => (time.milliseconds - r.lastCaughtUpTimeMs) > maxLagMs)
if (laggingReplicas.nonEmpty)
debug("Lagging replicas are %s".format(laggingReplicas.map(_.brokerId).mkString(",")))
laggingReplicas
}
//更新isr信息到zk
private def updateIsr(newIsr: Set[Replica]) {
val newLeaderAndIsr = new LeaderAndIsr(localBrokerId, leaderEpoch, newIsr.map(_.brokerId).toList, zkVersion)
//执行更新操作
val (updateSucceeded,newVersion) = ReplicationUtils.updateLeaderAndIsr(zkUtils, topic, partitionId,
newLeaderAndIsr, controllerEpoch, zkVersion)
// 成功更新到 zk 上
if(updateSucceeded) {
//告诉 replicaManager 这个 partition 的 isr 需要更新
replicaManager.recordIsrChange(topicPartition)
inSyncReplicas = newIsr
zkVersion = newVersion
trace("ISR updated to [%s] and zkVersion updated to [%d]".format(newIsr.mkString(","), zkVersion))
} else {
replicaManager.failedIsrUpdatesRate.mark()
info("Cached zkVersion [%d] not equal to that in zookeeper, skip updating ISR".format(zkVersion))
}
}
/**
* Check and maybe increment the high watermark of the partition;
* this function can be triggered when
*
* 1. Partition ISR changed
* 2. Any replica's LEO changed
*
* The HW is determined by the smallest log end offset among all replicas that are in sync or are considered caught-up.
* This way, if a replica is considered caught-up, but its log end offset is smaller than HW, we will wait for this
* replica to catch up to the HW before advancing the HW. This helps the situation when the ISR only includes the
* leader replica and a follower tries to catch up. If we don't wait for the follower when advancing the HW, the
* follower's log end offset may keep falling behind the HW (determined by the leader's log end offset) and therefore
* will never be added to ISR.
*
* Returns true if the HW was incremented, and false otherwise.
* Note There is no need to acquire the leaderIsrUpdate lock here
* since all callers of this private API acquire that lock
*/
//检查并可能更新 Leader 副本的高水位(High Watermark,HW),即消费者可以看到的该分区中最大的已提交的偏移量(offset)
//可能增加hw的情况:1.Partition ISR 变动; 2. 任何副本的 LEO 改变;
//在获取 HW 时,是从 isr 和认为能追得上的副本中选择最小的 LEO,之所以也要从能追得上的副本中选择,是为了等待 follower 追上 HW,否则可能没机会追上了
private def maybeIncrementLeaderHW(leaderReplica: Replica, curTime: Long = time.milliseconds): Boolean = {
// 获取 isr 以及能够追上 isr (认为最近一次 fetch 的时间在 replica.lag.time.max.time 之内) 副本的 LEO 信息。
val allLogEndOffsets = assignedReplicas.filter { replica =>
curTime - replica.lastCaughtUpTimeMs <= replicaManager.config.replicaLagTimeMaxMs || inSyncReplicas.contains(replica)
}.map(_.logEndOffset)
//新的hw值
val newHighWatermark = allLogEndOffsets.min(new LogOffsetMetadata.OffsetOrdering)
val oldHighWatermark = leaderReplica.highWatermark
// Ensure that the high watermark increases monotonically. We also update the high watermark when the new
// offset metadata is on a newer segment, which occurs whenever the log is rolled to a new segment.
if (oldHighWatermark.messageOffset < newHighWatermark.messageOffset ||
(oldHighWatermark.messageOffset == newHighWatermark.messageOffset && oldHighWatermark.onOlderSegment(newHighWatermark))) {
leaderReplica.highWatermark = newHighWatermark
debug(s"High watermark updated to $newHighWatermark")
true
} else {
debug(s"Skipping update high watermark since new hw $newHighWatermark is not larger than old hw $oldHighWatermark." +
s"All LEOs are ${allLogEndOffsets.mkString(",")}")
false
}
}
在 ReplicaManager 的 fetchMessages() 方法中,如果 Fetch 请求是来自副本,那么会调用 updateFollowerLogReadResults() 更新远程副本的信息:
scala
/**
* Update the follower's fetch state in the leader based on the last fetch request and update `readResult`,
* if the follower replica is not recognized to be one of the assigned replicas. Do not update
* `readResult` otherwise, so that log start/end offset and high watermark is consistent with
* records in fetch response. Log start/end offset and high watermark may change not only due to
* this fetch request, e.g., rolling new log segment and removing old log segment may move log
* start offset further than the last offset in the fetched records. The followers will get the
* updated leader's state in the next fetch response.
*/
//在 ReplicaManager 的 fetchMessages() 方法中,如果 Fetch 请求是来自副本,那么会调用 updateFollowerLogReadResults() 更新远程副本的信息
//这个方法的作用就是找到本节点这个 Partition 对象,然后调用其 updateReplicaLogReadResult() 方法更新副本的 LEO 信息和拉取时间信息。
private def updateFollowerLogReadResults(replicaId: Int,
readResults: Seq[(TopicPartition, LogReadResult)]): Seq[(TopicPartition, LogReadResult)] = {
debug(s"Recording follower broker $replicaId log end offsets: $readResults")
readResults.map { case (topicPartition, readResult) =>
var updatedReadResult = readResult
nonOfflinePartition(topicPartition) match {
case Some(partition) =>
partition.getReplica(replicaId) match {
case Some(replica) =>
//更新副本的相关信息
partition.updateReplicaLogReadResult(replica, readResult)
case None =>
warn(s"Leader $localBrokerId failed to record follower $replicaId's position " +
s"${readResult.info.fetchOffsetMetadata.messageOffset} since the replica is not recognized to be " +
s"one of the assigned replicas ${partition.assignedReplicas.map(_.brokerId).mkString(",")} " +
s"for partition $topicPartition. Empty records will be returned for this partition.")
updatedReadResult = readResult.withEmptyFetchInfo
}
case None =>
warn(s"While recording the replica LEO, the partition $topicPartition hasn't been created.")
}
topicPartition -> updatedReadResult
}
}
会调用Partition类中的相关方法:
scala
/**
* Update the the follower's state in the leader based on the last fetch request. See
* [[kafka.cluster.Replica#updateLogReadResult]] for details.
*
* @return true if the leader's log start offset or high watermark have been updated
*/
def updateReplicaLogReadResult(replica: Replica, logReadResult: LogReadResult): Boolean = {
val replicaId = replica.brokerId
// No need to calculate low watermark if there is no delayed DeleteRecordsRequest
val oldLeaderLW = if (replicaManager.delayedDeleteRecordsPurgatory.delayed > 0) lowWatermarkIfLeader else -1L
// 更新副本的相关信息,这里是更新该副本的 LEO、lastFetchLeaderLogEndOffset 和 lastFetchTimeMs;
replica.updateLogReadResult(logReadResult)
val newLeaderLW = if (replicaManager.delayedDeleteRecordsPurgatory.delayed > 0) lowWatermarkIfLeader else -1L
// check if the LW of the partition has incremented
// since the replica's logStartOffset may have incremented
val leaderLWIncremented = newLeaderLW > oldLeaderLW
// check if we need to expand ISR to include this replica
// if it is not in the ISR yet
//如果该副本不在 isr 中, 检查是否需要进行加入到isr,即是否有不在 isr 内的副本满足进入 isr 的条件。
val leaderHWIncremented = maybeExpandIsr(replicaId, logReadResult)
val result = leaderLWIncremented || leaderHWIncremented
// some delayed operations may be unblocked after HW or LW changed
if (result)
tryCompleteDelayedRequests()
debug(s"Recorded replica $replicaId log end offset (LEO) position ${logReadResult.info.fetchOffsetMetadata.messageOffset}.")
result
}
/**
* Check and maybe expand the ISR of the partition.
* A replica will be added to ISR if its LEO >= current hw of the partition.
*
* Technically, a replica shouldn't be in ISR if it hasn't caught up for longer than replicaLagTimeMaxMs,
* even if its log end offset is >= HW. However, to be consistent with how the follower determines
* whether a replica is in-sync, we only check HW.
*
* This function can be triggered when a replica's LEO has incremented.
*
* @return true if the high watermark has been updated
*/
//检查当前 Partition 是否需要扩充 ISR, 副本的 LEO 大于等于 hw 的副本将会被添加到 isr 中
def maybeExpandIsr(replicaId: Int, logReadResult: LogReadResult): Boolean = {
inWriteLock(leaderIsrUpdateLock) {
// check if this replica needs to be added to the ISR
leaderReplicaIfLocal match {
case Some(leaderReplica) =>
val replica = getReplica(replicaId).get
val leaderHW = leaderReplica.highWatermark
//这个replica不在isr中
if (!inSyncReplicas.contains(replica) &&
//而是在AR列表中
assignedReplicas.map(_.brokerId).contains(replicaId) &&
//replica LEO 大于 HW 的情况下,加入 isr 列表
replica.logEndOffset.offsetDiff(leaderHW) >= 0) {
val newInSyncReplicas = inSyncReplicas + replica
info(s"Expanding ISR from ${inSyncReplicas.map(_.brokerId).mkString(",")} " +
s"to ${newInSyncReplicas.map(_.brokerId).mkString(",")}")
// update ISR in ZK and cache
// 更新这个 topic-partition 的 isr 信息到zk
updateIsr(newInSyncReplicas)
replicaManager.isrExpandRate.mark()
}
// check if the HW of the partition can now be incremented
// since the replica may already be in the ISR and its LEO has just incremented
//检查 HW 是否需要更新
//因为replica可能加入到了isr,且其LEO值增加了
maybeIncrementLeaderHW(leaderReplica, logReadResult.fetchTimeMs)
case None => false // nothing to do if no longer leader
}
}
}
对于Controller发送的Updata-Metadata 请求的处理,入口方法kafkaApis.handleUpdateMetadataRequest():
scala
//处理来自controller的 update-metadata 请求
def handleUpdateMetadataRequest(request: RequestChannel.Request) {
val correlationId = request.header.correlationId
val updateMetadataRequest = request.body[UpdateMetadataRequest]
if (authorize(request.session, ClusterAction, Resource.ClusterResource)) {
// 调用replicaManager更新 metadata, 并返回需要删除的 Partition
val deletedPartitions = replicaManager.maybeUpdateMetadataCache(correlationId, updateMetadataRequest)
if (deletedPartitions.nonEmpty) {
//调用GroupCoordinator 删除相关 partition 的信息
groupCoordinator.handleDeletedPartitions(deletedPartitions)
}
if (adminManager.hasDelayedTopicOperations) {
updateMetadataRequest.partitionStates.keySet.asScala.map(_.topic).foreach { topic =>
adminManager.tryCompleteDelayedTopicOperations(topic)
}
}
sendResponseExemptThrottle(request, new UpdateMetadataResponse(Errors.NONE))
} else {
sendResponseMaybeThrottle(request, _ => new UpdateMetadataResponse(Errors.CLUSTER_AUTHORIZATION_FAILED))
}
}
ReplicaManager 的 maybeUpdateMetadataCache() 方法实现:
scala
// Controller 向所有的 Broker 发送请求, 让它们去更新各自的 meta 信息
def maybeUpdateMetadataCache(correlationId: Int, updateMetadataRequest: UpdateMetadataRequest) : Seq[TopicPartition] = {
replicaStateChangeLock synchronized {
//来自过期的 controller请求,抛出异常
if(updateMetadataRequest.controllerEpoch < controllerEpoch) {
val stateControllerEpochErrorMessage = s"Received update metadata request with correlation id $correlationId " +
s"from an old controller ${updateMetadataRequest.controllerId} with epoch ${updateMetadataRequest.controllerEpoch}. " +
s"Latest known controller epoch is $controllerEpoch"
stateChangeLogger.warn(stateControllerEpochErrorMessage)
throw new ControllerMovedException(stateChangeLogger.messageWithPrefix(stateControllerEpochErrorMessage))
} else {
//更新元数据缓存,然后返回需要删除的 topic-partition 列表。
val deletedPartitions = metadataCache.updateCache(correlationId, updateMetadataRequest)
controllerEpoch = updateMetadataRequest.controllerEpoch
deletedPartitions
}
}
}
这个方法中会调用 metadataCache.updateCache() 方法更新 meta 缓存,然后返回需要删除的 topic-partition 列表:
scala
// This method returns the deleted TopicPartitions received from UpdateMetadataRequest
//更新本地的元数据信息,并返回要删除的 topic-partition
def updateCache(correlationId: Int, updateMetadataRequest: UpdateMetadataRequest): Seq[TopicPartition] = {
inWriteLock(partitionMetadataLock) {
controllerId = updateMetadataRequest.controllerId match {
case id if id < 0 => None
case id => Some(id)
}
//清空 aliveNodes 和 aliveBrokers 记录,并更新成最新的记录
aliveNodes.clear()
aliveBrokers.clear()
updateMetadataRequest.liveBrokers.asScala.foreach { broker =>
// `aliveNodes` is a hot path for metadata requests for large clusters, so we use java.util.HashMap which
// is a bit faster than scala.collection.mutable.HashMap. When we drop support for Scala 2.10, we could
// move to `AnyRefMap`, which has comparable performance.
val nodes = new java.util.HashMap[ListenerName, Node]
val endPoints = new mutable.ArrayBuffer[EndPoint]
broker.endPoints.asScala.foreach { ep =>
endPoints += EndPoint(ep.host, ep.port, ep.listenerName, ep.securityProtocol)
nodes.put(ep.listenerName, new Node(broker.id, ep.host, ep.port))
}
aliveBrokers(broker.id) = Broker(broker.id, endPoints, Option(broker.rack))
aliveNodes(broker.id) = nodes.asScala
}
val deletedPartitions = new mutable.ArrayBuffer[TopicPartition]
updateMetadataRequest.partitionStates.asScala.foreach { case (tp, info) =>
val controllerId = updateMetadataRequest.controllerId
val controllerEpoch = updateMetadataRequest.controllerEpoch
// 对于要删除的 topic-partition,从缓存中删除,并记录下来作为这个方法的返回;
if (info.basePartitionState.leader == LeaderAndIsr.LeaderDuringDelete) {
//从 cache 中删除
removePartitionInfo(tp.topic, tp.partition)
stateChangeLogger.trace(s"Deleted partition $tp from metadata cache in response to UpdateMetadata " +
s"request sent by controller $controllerId epoch $controllerEpoch with correlation id $correlationId")
deletedPartitions += tp
} else {
// 对于其他的 topic-partition,执行 updateOrCreate 操作。
addOrUpdatePartitionInfo(tp.topic, tp.partition, info)
stateChangeLogger.trace(s"Cached leader info $info for partition $tp in response to " +
s"UpdateMetadata request sent by controller $controllerId epoch $controllerEpoch with correlation id $correlationId")
}
}
deletedPartitions
}
}