Kafka源码(三)发送消息-客户端

背景

本文分析kafka客户端如何发送消息。

注:基于kafka2.6的java客户端。

一、使用案例

Topic1和Topic2都是3分区2副本。

生产者代码:

java 复制代码
public static void main(String[] args) throws Exception {
    Properties props = new Properties();
    props.put("bootstrap.servers", "localhost:9092");
    props.put("key.serializer", IntegerSerializer.class.getName());
    props.put("value.serializer", StringSerializer.class.getName());
    // 压缩策略,默认none不压缩
    props.put("compression.type", "zstd");
    // ack策略,默认1,0-不等待;1-等待leader落库;all(-1)-等待所有ISR落库
    props.put("acks", "all");
    // 消息批次 延迟时间 默认0ms
    props.put("linger.ms", "1000");
    KafkaProducer<Integer, String> producer = new KafkaProducer<>(props);
    for (int i = 0; i < 40000; i++) {
        producer.send(new ProducerRecord<>("Topic1", i, "Hello World"));
        producer.send(new ProducerRecord<>("Topic2", i, "Hello World"));
    }
    producer.close();
}

抓包:生产者发送ProduceRequest包含n个分区的消息。

每个分区包含一个批次(RecordBatch),一个批次包含多条消息(Record)。

broker侧查看Topic1的p0分区目录如下。

这里将broker的log.segment.bytes调整为32768(32KB),默认为1G。

bash 复制代码
Topic1-0 % ls -lh
40B  00000000000000000000.index
30K  00000000000000000000.log
72B  00000000000000000000.timeindex
10M  00000000000000011208.index
5.8K  00000000000000011208.log
10B  00000000000000011208.snapshot
10M  00000000000000011208.timeindex
8B  leader-epoch-checkpoint

使用kafka-dump-log.sh可以查看log的数据信息,第一个log的最后一条记录的lastOffset+1,正是第二个log的文件名。

shell 复制代码
./kafka-dump-log.sh --files ../config/server1/Topic1-0/00000000000000000000.log
Dumping ../config/server1/Topic1-0/00000000000000000000.log
Starting offset: 0
baseOffset: 0 lastOffset: 668 count: 669 baseSequence: -1 lastSequence: -1 producerId: -1 producerEpoch: -1 partitionLeaderEpoch: 0 isTransactional: false isControl: false position: 0 CreateTime: 1753496555870 size: 1977 magic: 2 compresscodec: ZSTD crc: 273196058 isvalid: true
baseOffset: 669 lastOffset: 1344 count: 676 baseSequence: -1 lastSequence: -1 producerId: -1 producerEpoch: -1 partitionLeaderEpoch: 0 isTransactional: false isControl: false position: 1977 CreateTime: 1753496555939 size: 1933 magic: 2 compresscodec: ZSTD crc: 196171263 isvalid: true
// ...
baseOffset: 10361 lastOffset: 11207 count: 847 baseSequence: -1 lastSequence: -1 producerId: -1 producerEpoch: -1 partitionLeaderEpoch: 0 isTransactional: false isControl: false position: 28449 CreateTime: 1753496556585 size: 2235 magic: 2 compresscodec: ZSTD crc: 2388557788 isvalid: true

二、客户端组件介绍

java 复制代码
public class KafkaProducer<K, V> implements Producer<K, V> {
    private final String clientId;
    // 分区器
    private final Partitioner partitioner;
    // max.request.size = 1024 * 1024 = 1MB
    private final int maxRequestSize;
    // buffer.memory = 32 * 1024 * 1024 = 32MB
    private final long totalMemorySize;
    // 元数据缓存
    private final ProducerMetadata metadata;
    // 消息累加器
    private final RecordAccumulator accumulator;
    // 消息发送IO线程
    private final Sender sender;
    private final Thread ioThread;
    // compression.type=none 压缩实现方式,默认不压缩
    private final CompressionType compressionType;
    // 序列化器
    private final Serializer<K> keySerializer;
    private final Serializer<V> valueSerializer;
    // producer的配置
    private final ProducerConfig producerConfig;
    // max.block.ms=60000 send最多block时长(获取元数据+写累积器),默认60s
    private final long maxBlockTimeMs;
    // 拦截器(扩展点)
    private final ProducerInterceptors<K, V> interceptors;
    // 事务管理器 IGNORE
    private final TransactionManager transactionManager;
}

Partitioner

Partitioner:分区器,计算消息发送到哪个分区。

默认partitioner.class=org.apache.kafka.clients.producer.internals.DefaultPartitioner。

java 复制代码
public interface Partitioner {
    public int partition(String topic, Object key, 
     byte[] keyBytes, Object value, 
     byte[] valueBytes, Cluster cluster);
}

ProducerMetadata

ProducerMetadata:生产者元数据缓存。

java 复制代码
public class ProducerMetadata extends Metadata {
    // metadata.max.idle.ms=5 * 60 * 1000 5分钟
    private final long metadataIdleMs;
    // topic -> 元数据过期时间
    private final Map<String, Long> topics = new HashMap<>();
    // 还未发现元数据的topics
    private final Set<String> newTopics = new HashSet<>();
}

Metadata:生产者和消费者的元数据基类。

通过元数据缓存MetadataCache可以找到partition leader对应broker的连接信息,用于发送消息。

java 复制代码
public class Metadata implements Closeable {
    // broker集群元数据
    private MetadataCache cache = MetadataCache.empty();
}
/**
 * An internal mutable cache of nodes, topics, and partitions in the Kafka cluster. This keeps an up-to-date Cluster
 * instance which is optimized for read access.
 */
public class MetadataCache {
    private final String clusterId;
    // brokerId -> broker连接信息 ip port
    private final Map<Integer, Node> nodes;
    private final Set<String> unauthorizedTopics;
    private final Set<String> invalidTopics;
    private final Set<String> internalTopics;
    // controller连接信息
    private final Node controller;
    // 分区元数据
    private final Map<TopicPartition, PartitionMetadata> metadataByPartition;
    // 由上述信息构成的不可变视图对象
    private Cluster clusterInstance;
}
public static class PartitionMetadata {
    public final TopicPartition topicPartition;
    public final Errors error;
    // leader(brokerId)
    public final Optional<Integer> leaderId;
    // leader任期
    public final Optional<Integer> leaderEpoch;
    // 副本id(brokerIds)
    public final List<Integer> replicaIds;
    // ISR副本id(brokerIds)
    public final List<Integer> inSyncReplicaIds;
    public final List<Integer> offlineReplicaIds;
}

客户端侧,一般使用MetadataCache构建的Cluster只读视图。

java 复制代码
public final class Cluster {
    private final boolean isBootstrapConfigured;
    private final List<Node> nodes;
    private final Set<String> unauthorizedTopics;
    private final Set<String> invalidTopics;
    private final Set<String> internalTopics;
    // controller节点
    private final Node controller;
    // 分区 -> 分区信息
    private final Map<TopicPartition, PartitionInfo> partitionsByTopicPartition;
    // topic -> topic所有分区
    private final Map<String, List<PartitionInfo>> partitionsByTopic;
    // topic -> leader存活的所有分区
    private final Map<String, List<PartitionInfo>> availablePartitionsByTopic;
    // brokerId -> 分区
    private final Map<Integer, List<PartitionInfo>> partitionsByNode;
    // brokerId -> broker
    private final Map<Integer, Node> nodesById;
}

RecordAccumulator

RecordAccumulator:消息累积器。

kafka生产者支持通过ProducerBatch合并多个消息为一个批次发送到Broker,提升吞吐量。

批处理主要取决于两个配置:

1)batch.size:一批消息的大小,默认16384=16KB,如果单条消息超出该容量,按照消息实际大小分配;

2)linger.ms:未达到batch.size,一批消息最多延迟多久发送,默认0,即未开启批消息;

java 复制代码
public final class RecordAccumulator {
    private final AtomicInteger flushesInProgress;
    private final AtomicInteger appendsInProgress;
    // 批消息 内存分配大小 默认batch.size=16384=16KB
    private final int batchSize;
    // 压缩方式
    private final CompressionType compression;
    // 批消息 延迟时间 默认linger.ms=0ms
    private final int lingerMs;
    private final long retryBackoffMs;
    private final int deliveryTimeoutMs;
    // buffer池 容量 = buffer.memory = 32 * 1024 * 1024 = 32MB
    private final BufferPool free;
    // partition -> 批次队列
    private final ConcurrentMap<TopicPartition, Deque<ProducerBatch>> batches;
    private final IncompleteBatches incomplete;
    private final Set<TopicPartition> muted;
    private int drainIndex;
    private final TransactionManager transactionManager;
    private long nextBatchExpiryTimeMs = Long.MAX_VALUE;
}

Sender

业务线程使用KafkaProducer实例发送消息,将数据写入RecordAccumulator累积器。

Sender线程(kafka-producer-network-thread)读取累积器中的数据,实际执行IO读写操作。

java 复制代码
public class Sender implements Runnable {
    // 网络客户端
    private final KafkaClient client;
    // 累积器
    private final RecordAccumulator accumulator;
}

KafkaClient底层使用NIO的Selector,处理多个Broker连接通道(SocketChannel)上的IO事件。

客户端与Broker建立的连接会被封装为KafkaChannel。

java 复制代码
public class NetworkClient implements KafkaClient {
    private final Selectable selector;
}
public class Selector implements Selectable, AutoCloseable {
    // nio Selector
    private final java.nio.channels.Selector nioSelector;
    // brokerId -> 连接
    private final Map<String, KafkaChannel> channels;
}
public class KafkaChannel implements AutoCloseable {
    // brokerId
    private final String id;
    // 通过TransportLayer收发数据
    private final TransportLayer transportLayer;
    // 收到的buffer
    private NetworkReceive receive;
    // 发出的buffer
    private Send send;
}
public class PlaintextTransportLayer implements TransportLayer {
    // 通过SelectionKey设置nio关心的读写连接事件
    private final java.nio.channels.SelectionKey key;
    // nio底层通讯通道,可以读写数据
    private final java.nio.channels.SocketChannel socketChannel;
}

Serializer

Serializer:序列化器。

通过key.serializer和value.serializer指定消息Record的key和value如何序列化。

java 复制代码
public class IntegerSerializer implements Serializer<Integer> {
    public byte[] serialize(String topic, Integer data) {
        if (data == null)
            return null;

        return new byte[] {
            (byte) (data >>> 24),
            (byte) (data >>> 16),
            (byte) (data >>> 8),
            data.byteValue()
        };
    }
}
public class StringSerializer implements Serializer<String> {
    private String encoding = "UTF8";
    @Override
    public byte[] serialize(String topic, String data) {
        try {
            if (data == null)
                return null;
            else
                return data.getBytes(encoding);
        } catch (UnsupportedEncodingException e) {
            throw new SerializationException("Error when serializing string to byte[] due to unsupported encoding " + encoding);
        }
    }
}

三、生产者主流程

KafkaProducer#send:发送消息的入口,执行所有ProducerInterceptor钩子,可以改变ProducerRecord。

java 复制代码
  @Override
  public Future<RecordMetadata> send(ProducerRecord<K, V> record) {
      return send(record, null);
  }
  @Override
  public Future<RecordMetadata> send(ProducerRecord<K, V> record, Callback callback) {
      ProducerRecord<K, V> interceptedRecord = this.interceptors.onSend(record);
      return doSend(interceptedRecord, callback);
  }

ProducerInterceptors#onSend:循环执行所有ProducerInterceptor,所有异常会被try-catch不影响消息发送。

java 复制代码
public ProducerRecord<K, V> onSend(ProducerRecord<K, V> record) {
    ProducerRecord<K, V> interceptRecord = record;
    for (ProducerInterceptor<K, V> interceptor : this.interceptors) {
        try {
            interceptRecord = interceptor.onSend(interceptRecord);
        } catch (Exception e) {
           log.warn()
        }
    }
    return interceptRecord;
}

KafkaProducer#doSend:发送消息主流程

【1】获取topic元数据,如果缓存未命中,需要唤醒IO线程(Sender)发送MetadataRequest;

【2】key/value序列化;(忽略)

【3】计算消息分区;

【4】预计消息大小,校验消息不超过max.request.size,默认1MB;(忽略)

【5】将消息写入累积器;

【6】唤醒IO线程(Sender),IO线程会拉取累积器中的消息发送,返回Future;

java 复制代码
  private Future<RecordMetadata> doSend(ProducerRecord<K, V> record, Callback callback) {
    TopicPartition tp = null;
    try {
        throwIfProducerClosed();
        long nowMs = time.milliseconds();
        ClusterAndWaitTime clusterAndWaitTime;
        try {
            //【1】获取元数据,包括集群broker信息,topic-partition信息
            clusterAndWaitTime = waitOnMetadata(record.topic(), record.partition(), nowMs, maxBlockTimeMs);
        } catch (KafkaException e) {
            if (metadata.isClosed())
                throw new KafkaException();
            throw e;
        }
        nowMs += clusterAndWaitTime.waitedOnMetadataMs;
        long remainingWaitMs = Math.max(0, maxBlockTimeMs - clusterAndWaitTime.waitedOnMetadataMs);
        Cluster cluster = clusterAndWaitTime.cluster;
        // 【2】key/value序列化
        byte[] serializedKey;
        try {
            serializedKey = keySerializer.serialize(record.topic(), record.headers(), record.key());
        } catch (ClassCastException cce) {
            throw new SerializationException();
        }
        byte[] serializedValue;
        try {
            serializedValue = valueSerializer.serialize(record.topic(), record.headers(), record.value());
        } catch (ClassCastException cce) {
            throw new SerializationException();
        }
        // 【3】计算partition,如果ProducerRecord没显示指定分区,DefaultPartitioner
        int partition = partition(record, serializedKey, serializedValue, cluster);
        tp = new TopicPartition(record.topic(), partition);

        setReadOnly(record.headers());
        Header[] headers = record.headers().toArray();

        // 【4】校验消息大小不能超过max.request.size = 1024 * 1024 = 1MB
        int serializedSize = AbstractRecords.estimateSizeInBytesUpperBound(apiVersions.maxUsableProduceMagic(),
                compressionType, serializedKey, serializedValue, headers);
        ensureValidRecordSize(serializedSize);
        long timestamp = record.timestamp() == null ? nowMs : record.timestamp();
        Callback interceptCallback = new InterceptorCallback<>(callback, this.interceptors, tp);
        if (transactionManager != null && transactionManager.isTransactional()) {
            transactionManager.failIfNotReadyForSend();
        }
        // 【5】将消息加入累积器
        RecordAccumulator.RecordAppendResult result = accumulator.append(tp, timestamp, serializedKey,
                serializedValue, headers, interceptCallback, remainingWaitMs, true, nowMs);
         // KIP-480 开启新批次,重新执行partitioner
        if (result.abortForNewBatch) {
            int prevPartition = partition;
            partitioner.onNewBatch(record.topic(), cluster, prevPartition);
            partition = partition(record, serializedKey, serializedValue, cluster);
            tp = new TopicPartition(record.topic(), partition);
            interceptCallback = new InterceptorCallback<>(callback, this.interceptors, tp);
            result = accumulator.append(tp, timestamp, serializedKey,
                serializedValue, headers, interceptCallback, remainingWaitMs, false, nowMs);
        }

        if (transactionManager != null && transactionManager.isTransactional())
            transactionManager.maybeAddPartitionToTransaction(tp);

        // 【6】唤醒io线程发送消息
        if (result.batchIsFull || result.newBatchCreated) {
            this.sender.wakeup();
        }
        return result.future;
    } 
}

四、获取生产者元数据

producer并不需要全量的元数据,不需要未send过的topic的元数据。

java 复制代码
public class ProducerMetadata extends Metadata {
    // metadata.max.idle.ms=5 * 60 * 1000 5分钟
    private final long metadataIdleMs;
    // topic -> 元数据过期时间,这些topic是producer关心的
    private final Map<String, Long> topics = new HashMap<>();
    // 还未发现元数据的topics
    private final Set<String> newTopics = new HashSet<>();
}

所以producer需要告知sender线程需要更新哪些topic的元数据。

对于producer来说,先从缓存获取元数据,如果获取不到,需要等待sender线程获取。

而sender可能阻塞在IO上,比如Selector#select,所以需要producer唤醒sender。

sender被唤醒后反查Metadata是否需要更新,由Metadata构建MetadataRequest(包含必要topic),sender调用broker查询元数据,broker返回后,sender更新Metadata,唤醒producer。

Producer线程

KafkaProducer#waitOnMetadata:producer线程获取元数据的流程

【1】metadata.add,标记topic需要元数据;

【2】cluster.partitionCountForTopic,从缓存中获取topic元数据,如果缓存命中,结束流程;

【3】缓存未命中,sender.wakeup唤醒Sender线程;

【4】metadata.awaitUpdate,等待Sender线程更新缓存;

java 复制代码
private final ProducerMetadata metadata;
private final Sender sender;
private ClusterAndWaitTime waitOnMetadata(String topic, Integer partition, long nowMs, long maxWaitMs) 
    throws InterruptedException {
    Cluster cluster = metadata.fetch();
    // 标记生产者需要该topic的元数据,刷新缓存时效,不要让其被移除
    metadata.add(topic, nowMs);
    // 通过Cluster视图获取topic对应partition数量
    Integer partitionsCount = cluster.partitionCountForTopic(topic);
    // topic元数据缓存命中,直接返回
    if (partitionsCount != null && (partition == null || partition < partitionsCount))
        return new ClusterAndWaitTime(cluster, 0);
    // topic元数据缓存未命中,需要io线程发送MetadataRequest
    long remainingWaitMs = maxWaitMs;
    long elapsed = 0;
    do {
        metadata.add(topic, nowMs + elapsed);
        // 标记元数据需要部分或全量更新
        int version = metadata.requestUpdateForTopic(topic);
        // 唤醒sender的nioSelector
        sender.wakeup();
        try {
            // 等待sender更新metadata
            metadata.awaitUpdate(version, remainingWaitMs);
        } catch (TimeoutException ex) {
            throw new TimeoutException();
        }
        cluster = metadata.fetch();
        elapsed = time.milliseconds() - nowMs;
        if (elapsed >= maxWaitMs) {
            throw new TimeoutException();
        }
        metadata.maybeThrowExceptionForTopic(topic);
        remainingWaitMs = maxWaitMs - elapsed;
         // 通过Cluster视图获取topic对应partition数量
        partitionsCount = cluster.partitionCountForTopic(topic);
    } while (partitionsCount == null || (partition != null && partition >= partitionsCount));

    return new ClusterAndWaitTime(cluster, elapsed);
}

ProducerMetadata#add:

【1】producer线程发送消息,将目标topic写入ProducerMetadata,标记这个topic需要元数据。

java 复制代码
public class ProducerMetadata extends Metadata {
  public synchronized void add(String topic, long nowMs) {
      if (topics.put(topic, nowMs + metadataIdleMs) == null) {
          // 还未发现该topic的元数据,加入newTopics
          newTopics.add(topic);
          // 标记Metadata需要部分更新
          requestUpdateForNewTopics();
      }
  }
  public synchronized int requestUpdateForTopic(String topic) {
      if (newTopics.contains(topic)) {
          // 如果还未发现该topic的元数据,标记部分更新
          return requestUpdateForNewTopics();
      } else {
          // 否则标记为全量更新
          return requestUpdate();
      }
  }

   // Sender线程调用,仅获取新增topic的元数据
  @Override
  public synchronized MetadataRequest.Builder newMetadataRequestBuilderForNewTopics() {
      return new MetadataRequest.Builder(new ArrayList<>(newTopics), true);
  }
  // Sender线程调用,获取缓存topics的全量元数据
  @Override
  public synchronized MetadataRequest.Builder newMetadataRequestBuilder() {
      return new MetadataRequest.Builder(new ArrayList<>(topics.keySet()), true);
  }
}

ProducerMetadata#awaitUpdate:

【4】缓存未命中,producer线程等待Sender线程更新缓存。

java 复制代码
private final Time time; // SystemTime
public synchronized void awaitUpdate(final int lastVersion, final long timeoutMs) throws InterruptedException {
    long currentTimeMs = time.milliseconds();
    long deadlineMs = currentTimeMs + timeoutMs < 0 ? Long.MAX_VALUE : currentTimeMs + timeoutMs;

    time.waitObject(this, () -> {
        maybeThrowFatalException();
        // 等待MetadataResponse返回,更新缓存后,version++
        return updateVersion() > lastVersion || isClosed();
    }, deadlineMs);
    if (isClosed())
        throw new KafkaException("Requested metadata update after close");
}

// SystemTime
@Override
public void waitObject(Object obj, Supplier<Boolean> condition, long deadlineMs) throws InterruptedException {
    synchronized (obj) {
        while (true) {
            if (condition.get())
                return;
            long currentTimeMs = milliseconds();
            if (currentTimeMs >= deadlineMs)
                throw new TimeoutException();
            obj.wait(deadlineMs - currentTimeMs);
        }
    }
 }

Sender线程

Sender线程如下。

java 复制代码
@Override
public void run() {
    while (running) {
        runOnce();
    }
}
void runOnce() {
    long currentTimeMs = time.milliseconds();
    // 将累积器中的消息转换为Send缓存到KafkaChannel
    long pollTimeout = sendProducerData(currentTimeMs);
    // 执行IO
    client.poll(pollTimeout, currentTimeMs);
}

NetworkClient#poll:poll是Sender的主要逻辑。

java 复制代码
// NetworkClient#poll
private final Selectable selector;
private final MetadataUpdater metadataUpdater;
@Override
public List<ClientResponse> poll(long timeout, long now) {
    // 【1】可能构造MetadataRequest缓存到channel
    long metadataTimeout = metadataUpdater.maybeUpdate(now);
    // 【2】执行io操作 读 写 ...
    this.selector.poll(Utils.min(timeout, metadataTimeout, defaultRequestTimeoutMs));
    long updatedNow = this.time.milliseconds();
    List<ClientResponse> responses = new ArrayList<>();
    // 【3】处理收到的响应
    handleCompletedReceives(responses, updatedNow);
    // ..
    return responses;
}

DefaultMetadataUpdater#maybeUpdate:Sender需要根据Metadata判断是否需要刷新元数据。

java 复制代码
// NetworkClient的内部类DefaultMetadataUpdater#maybeUpdate
private final Metadata metadata;
@Override
public long maybeUpdate(long now) {
    // 元数据判断是否需要刷新
    long timeToNextMetadataUpdate = metadata.timeToNextUpdate(now);
     // IO线程还有元数据刷新请求没处理,不刷新
    long waitForMetadataFetch = hasFetchInProgress() ?
      defaultRequestTimeoutMs : 0;

    long metadataTimeout = Math.max(timeToNextMetadataUpdate, waitForMetadataFetch);
    if (metadataTimeout > 0) {
        return metadataTimeout;
    }
    // 找到最小负载节点
    Node node = leastLoadedNode(now);
    if (node == null) {
        return reconnectBackoffMs;
    }

    return maybeUpdate(now, node);
}
// NetworkClient的内部类DefaultMetadataUpdater#maybeUpdate
 private long maybeUpdate(long now, Node node) {
    String nodeConnectionId = node.idString();
    // 未建立连接 这里为false 不能发请求 先要走initiateConnect
    if (canSendRequest(nodeConnectionId, now)) {
        Metadata.MetadataRequestAndVersion requestAndVersion = metadata.newMetadataRequestAndVersion(now);
        MetadataRequest.Builder metadataRequest = requestAndVersion.requestBuilder;
        // NetworkClient将Request缓存到Channel
        sendInternalMetadataRequest(metadataRequest, nodeConnectionId, now);
        inProgress = new InProgressData(requestAndVersion.requestVersion, requestAndVersion.isPartialUpdate);
        return defaultRequestTimeoutMs;
    }
    // 还未与node建立连接,先建立连接,下次poll才能发送请求
    if (connectionStates.canConnect(nodeConnectionId, now)) {
        initiateConnect(node, now);
        return reconnectBackoffMs;
    }
    // ...
}
// NetworkClient#sendInternalMetadataRequest
void sendInternalMetadataRequest(MetadataRequest.Builder builder, String nodeConnectionId, long now) {
    ClientRequest clientRequest = newClientRequest(nodeConnectionId, builder, now, true);
    doSend(clientRequest, true, now);
}

Metadata#timeToNextUpdate:元数据自身判断是否需要更新有几个条件

1)retry.backoff.ms=100内发生过更新,不需要更新;

2)满足以下一个条件,允许更新:

2-1)producer请求更新,比如新发送一个topic不存在元数据需要增量更新;

2-2)距离上次Metadata更新超过metadata.max.age.ms=5分钟,需要刷新;

java 复制代码
public synchronized long timeToNextUpdate(long nowMs) {
    long timeToExpire = 
      // producer请求更新,部分或全量
      updateRequested() ? 0 :
      // 到达metadata.max.age.ms=5分钟 需要刷新
      Math.max(this.lastSuccessfulRefreshMs + this.metadataExpireMs - nowMs, 0);
    return Math.max(timeToExpire,
           // 如果retry.backoff.ms=100内发生过更新,不更新
            timeToAllowUpdate(nowMs)); 
}

Sender请求响应的主要工作方式是:

【1】先把Request序列化为Send放到目标Channel,标记Channel可写;

java 复制代码
// NetworkClient#doSend
private final Selectable selector;
private void doSend(ClientRequest clientRequest, boolean isInternalRequest, long now, AbstractRequest request) {
    String destination = clientRequest.destination(); // brokerId
    RequestHeader header = clientRequest.makeHeader(request.version());
    // request序列化为bytebuffer,转换为Send对象
    Send send = request.toSend(destination, header);
    // 正在发送的请求
    InFlightRequest inFlightRequest = new InFlightRequest(
                clientRequest, header,
                isInternalRequest, request,
                send, now);
    this.inFlightRequests.add(inFlightRequest);
    // 将Send缓存到目标连接KafkaChannel
    selector.send(send);
}
// org.apache.kafka.common.network.Selector#send
private final Map<String, KafkaChannel> channels;
public void send(Send send) {
    String connectionId = send.destination();
    // KafkaChannel channel = channels.get(connectionId);
    KafkaChannel channel = openOrClosingChannelOrFail(connectionId);
    channel.setSend(send);
}
// org.apache.kafka.common.network.KafkaChannel#setSend
private final TransportLayer transportLayer;
private Send send;
public void setSend(Send send) {
    // 单channel只能有一个Send被缓存
    // 发消息和获取元数据不能并发,上层做了校验
    if (this.send != null)
            throw new IllegalStateException();
    this.send = send;
    this.transportLayer.addInterestOps(SelectionKey.OP_WRITE);
}
// PlaintextTransportLayer#addInterestOps
private final SelectionKey key;
public void addInterestOps(int ops) {
    key.interestOps(key.interestOps() | ops);
}

【2】selector.select发现Channel可写,将Channel中的Send写到对端;

java 复制代码
private final java.nio.channels.Selector nioSelector;
@Override
public void poll(long timeout) throws IOException {
    int numReadyKeys = select(timeout);
    long endSelect = time.nanoseconds();
    if (numReadyKeys > 0) {
        Set<SelectionKey> readyKeys = this.nioSelector.selectedKeys();
        // ...
        pollSelectionKeys(readyKeys, false, endSelect);
        // ...
    } 
    // ...
}

private int select(long timeoutMs) throws IOException {
    if (timeoutMs == 0L)
        return this.nioSelector.selectNow();
    else
        return this.nioSelector.select(timeoutMs);
}

void pollSelectionKeys(Set<SelectionKey> selectionKeys,
                           boolean isImmediatelyConnected,
                           long currentTimeNanos) {
    for (SelectionKey key : selectionKeys) {
        KafkaChannel channel = channel(key);
        try {
            // ...
            // 处理读事件
            if (channel.ready() && (key.isReadable() || ...)) {
                attemptRead(channel);
            }
            // 处理写事件(发送消息、发送MetadataRequest)
            attemptWrite(key, channel, nowNanos);
        }
    }
}
// 连接建立时,通过attachment绑定netty通道->业务KafkaChannel通道
private KafkaChannel channel(SelectionKey key) {
    return (KafkaChannel) key.attachment();
}

Selector#attemptWrite:将缓存在KafkaChannel的Send对象写到SocketChannel。

java 复制代码
// org.apache.kafka.common.network.Selector#attemptWrite
private void attemptWrite(SelectionKey key, KafkaChannel channel, long nowNanos) throws IOException {
    if (channel.hasSend() && channel.ready() && key.isWritable()) {
        write(channel);
    }
}
// org.apache.kafka.common.network.Selector#write
void write(KafkaChannel channel) throws IOException {
    String nodeId = channel.id();
    long bytesSent = channel.write(); // 将Send写入channel
}
// org.apache.kafka.common.network.KafkaChannel#write
private Send send; // 【1】缓存的Send对象,待发送的请求
public long write() throws IOException {
    return send.writeTo(transportLayer);
}
// ByteBufferSend#writeTo
public long writeTo(GatheringByteChannel channel) throws IOException {
    long written = channel.write(buffers);
    return written;
}
// PlaintextTransportLayer#write
private final SocketChannel socketChannel;
public long write(ByteBuffer[] srcs) throws IOException {
    return socketChannel.write(srcs);
}

【3】对端处理完成,selector.select发现Channel可读,读到Receive里,反序列化为Response,执行回调;

java 复制代码
// Selector#attemptRead
private final LinkedHashMap<String, NetworkReceive> completedReceives;
private void attemptRead(KafkaChannel channel) throws IOException {
    String nodeId = channel.id();
    long bytesReceived = channel.read();
    if (bytesReceived != 0) {
        NetworkReceive receive = channel.maybeCompleteReceive();
        if (receive != null) {
            // 读到的数据缓存下来
            addToCompletedReceives(channel, receive, currentTimeMs);
        }
    }
}
private void addToCompletedReceives(KafkaChannel channel, NetworkReceive networkReceive, long currentTimeMs) {
    this.completedReceives.put(channel.id(), networkReceive);
}
// KafkaChannel#read
private NetworkReceive receive; // 收到的响应
public long read() throws IOException {
    long bytesReceived = receive(this.receive);
    return bytesReceived;
}
private long receive(NetworkReceive receive) throws IOException {
      return receive.readFrom(transportLayer);
}
// NetworkReceive#readFrom
private final ByteBuffer size = ByteBuffer.allocate(4);
private int requestedBufferSize = -1;
private ByteBuffer buffer;
public long readFrom(ScatteringByteChannel channel) throws IOException {
    int read = 0;
    // 先读4个byte得到响应大小
    if (size.hasRemaining()) {
        int bytesRead = channel.read(size);
        if (bytesRead < 0)
            throw new EOFException();
        read += bytesRead;
        if (!size.hasRemaining()) {
            size.rewind();
            int receiveSize = size.getInt();
            requestedBufferSize = receiveSize;
            if (receiveSize == 0) {
                buffer = EMPTY_BUFFER;
            }
        }
    }
    if (buffer == null && requestedBufferSize != -1) {
        // 按照上面读到的size,分配内存
        buffer = memoryPool.tryAllocate(requestedBufferSize);
    }
    if (buffer != null) {
        // 从socketChannel读数据到buffer
        int bytesRead = channel.read(buffer);
        read += bytesRead;
    }
    return read;
}

NetworkClient#poll:此时可以从Selector的缓存Receives拿到需要处理的响应。

java 复制代码
public List<ClientResponse> poll(long timeout, long now) {
    // 可能构造MetadataRequest缓存到channel
    long metadataTimeout = metadataUpdater.maybeUpdate(now);
    // 执行io操作 读 写 ...
    this.selector.poll(Utils.min(timeout, metadataTimeout, defaultRequestTimeoutMs));

    long updatedNow = this.time.milliseconds();
    List<ClientResponse> responses = new ArrayList<>();
    // 处理收到的响应【此处从selector的completedReceives中拿到需要处理的响应】
    handleCompletedReceives(responses, updatedNow);
    return responses;
}

private void handleCompletedReceives(List<ClientResponse> responses, long now) {
    for (NetworkReceive receive : this.selector.completedReceives()) {
        String source = receive.source(); // KafkaChannel.id即brokerId
        InFlightRequest req = inFlightRequests.completeNext(source);
        // 反序列化为Response
        Struct responseStruct = parseStructMaybeUpdateThrottleTimeMetrics(receive.payload(), req.header,
            throttleTimeSensor, now);
        AbstractResponse response = AbstractResponse.
            parseResponse(req.header.apiKey(), responseStruct, req.header.apiVersion());
        if (req.isInternalRequest && response instanceof MetadataResponse)
            // 元数据响应
            metadataUpdater.handleSuccessfulResponse(req.header, now, (MetadataResponse) response);
        else
            // ... 其他响应,比如发送消息响应
            responses.add(req.completed(response, now));
    }
}

ProducerMetadata#update:最终Sender线程收到MetadataResponse,调用Metadata更新缓存,唤醒等待元数据的producer线程。

java 复制代码
 public class ProducerMetadata extends Metadata {

    // Sender线程收到MetadataResponse,调用Metadata更新缓存
    @Override
    public synchronized void update(int requestVersion, MetadataResponse response, boolean isPartialUpdate, long nowMs) {
        // 生产消费Metadata统一逻辑,更新Metadata缓存,会回调retainTopic
        super.update(requestVersion, response, isPartialUpdate, nowMs);
        if (!newTopics.isEmpty()) {
            for (MetadataResponse.TopicMetadata metadata : response.topicMetadata()) {
                // 新发现的topic元数据,从newTopics标记中移除
                newTopics.remove(metadata.topic());
            }
        }
        // 唤醒等待元数据的producer线程
        notifyAll();
    }
    /**
     * super.update判断是否需要保留该主题的元数据
     * @param topic 主题
     * @param isInternal 内部请求
     * @param nowMs 当前时间
     * @return true-需要保留该topic的元数据 false-该topic的元数据可以清理
     */
    @Override
    public synchronized boolean retainTopic(String topic, boolean isInternal, long nowMs) {
        Long expireMs = topics.get(topic);
        if (expireMs == null) {
            return false;
        } else if (newTopics.contains(topic)) {
            // 新发现的topic,需要保留
            return true;
        } else if (expireMs <= nowMs) {
            // topic长时间没有发消息,移除topic的元数据
            topics.remove(topic);
            return false;
        } else {
            // topic最近有发消息,需要保留topic的元数据
            return true;
        }
    }
}

五、计算分区

KafkaProducer#partition:如果用户未主动指定Record的partition属性,由分区器计算Record发往哪个分区。

java 复制代码
private int partition(ProducerRecord<K, V> record, byte[] serializedKey, byte[] serializedValue, Cluster cluster) {
    Integer partition = record.partition(); // 用户指定分区
    return partition != null ?
            partition :
            partitioner.partition(...);
}

默认分区器为DefaultPartitioner,可以通过partitioner.class=完全限定类名,定义分区器。

case1,如果用户消息指定了key,通过hash(key)%分区数得到分区。

注意这里cluster.partitionsForTopic(topic).size()得到的分区数,是topic下所有分区,无论这个分区是否可用(leader是否存在,即LeaderAndIsr中的leader=-1也会被选中)。

java 复制代码
public class DefaultPartitioner implements Partitioner {
    private final StickyPartitionCache stickyPartitionCache = new StickyPartitionCache();
    public int partition(String topic, Object key,
             byte[] keyBytes, Object value, 
             byte[] valueBytes, Cluster cluster) {
        return partition(topic, key, keyBytes, value, valueBytes, 
             cluster, 
             cluster.partitionsForTopic(topic).size());
    }
    public int partition(String topic, Object key,
                         byte[] keyBytes, Object value,
                         byte[] valueBytes, Cluster cluster,
                         int numPartitions) {
        if (keyBytes == null) {
            // KIP-480 Sticky Partitioner 没有指定key,采取一个批次一个partition
            return stickyPartitionCache.partition(topic, cluster);
        }
        // hash(key)%分区数量
        return Utils.toPositive(Utils.murmur2(keyBytes)) % numPartitions;
    }
}

case2,KIP-480引入,如果用户未指定key,采取Sticky粘滞分区策略。(原来是round-robin)

Sticky策略将没有key的消息,合并到一个分区批次中发送,降低无key消息的发送延迟。

StickyPartitionCache缓存了每个topic的粘滞分区,如果topic还未发送过无key消息,需要选择分区。

这里选择分区的逻辑中,优先从可用分区(leader存在)随机选择,降级从所有分区随机选择。

有key消息不会受分区leader下线而hash到另一个分区。

java 复制代码
public class StickyPartitionCache {
    // topic -> 当前stiky的分区
    private final ConcurrentMap<String, Integer> indexCache;
    public StickyPartitionCache() {
        this.indexCache = new ConcurrentHashMap<>();
    }

    public int partition(String topic, Cluster cluster) {
        Integer part = indexCache.get(topic);
        if (part == null) {
            // 还没有stiky分区,选一个缓存下来
            return nextPartition(topic, cluster, -1);
        }
        // 优先用stiky分区
        return part;
    }

    public int nextPartition(String topic,
                             Cluster cluster, int prevPartition) {
        // 所有分区
        List<PartitionInfo> partitions = cluster.partitionsForTopic(topic);
        Integer oldPart = indexCache.get(topic);
        Integer newPart = oldPart;
        if (oldPart == null || oldPart == prevPartition) {
            // 有leader的分区
            List<PartitionInfo> availablePartitions = cluster.availablePartitionsForTopic(topic);
            if (availablePartitions.size() < 1) {
                // 所有分区都没有leader,从所有分区中任意选择一个
                Integer random = Utils.toPositive(ThreadLocalRandom.current().nextInt());
                newPart = random % partitions.size();
            } else if (availablePartitions.size() == 1) {
                // 只有一个分区有leader,选择这个
                newPart = availablePartitions.get(0).partition();
            } else {
                // 从有leader的分区中随机选择一个
               // 不能是老分区,保证整体负载均衡
                while (newPart == null || newPart.equals(oldPart)) {
                    Integer random = Utils.toPositive(ThreadLocalRandom.current().nextInt());
                    newPart = availablePartitions.get(random % availablePartitions.size()).partition();
                }
            }
            if (oldPart == null) {
                indexCache.putIfAbsent(topic, newPart);
            } else {
                indexCache.replace(topic, prevPartition, newPart);
            }
            return indexCache.get(topic);
        }
        return indexCache.get(topic);
    }

}

KafkaProducer#doSend:如果累积器写入返回abortForNewBatch,代表当前分区批次已满,该分区需要开启新批次。因为引入了Stiky分区策略,需要StickyPartitionCache在批次满了之后,重新分配一个新的黏滞分区,来整体保证无key消息的负载均衡。

java 复制代码
private Future<RecordMetadata> doSend(ProducerRecord<K, V> record, Callback callback) {
  // 【3】计算partition,如果ProducerRecord没显示指定分区,DefaultPartitioner
  int partition = partition(...);

  // 【5】将消息加入累积器
  RecordAccumulator.RecordAppendResult result = accumulator.append(...);
  if (result.abortForNewBatch) { // KIP-480
      // 开启新批次
      int prevPartition = partition;
      // 再来一次sticky策略
      partitioner.onNewBatch(record.topic(), cluster, prevPartition);
      // 重新分区
      partition = partition(...);
      // 再写入累积器
      result = accumulator.append(...);
  }
}

// DefaultPartitioner
public void onNewBatch(String topic, Cluster cluster, int prevPartition) {
    stickyPartitionCache.nextPartition(topic, cluster, prevPartition);
}

六、写累积器

1、重要模型

RecordAccumulator

java 复制代码
public final class RecordAccumulator {
     // 正在写累加器的线程数量
    private final AtomicInteger appendsInProgress;
    // 消息批次分配内存大小 默认batch.size=16KB
    private final int batchSize;
    // 压缩方式
    private final CompressionType compression;
    // 消息批次 延迟时间 默认linger.ms=0ms
    private final int lingerMs;
    // 重试间隔 retry.backoff.ms=100ms
    private final long retryBackoffMs;
    // send方法发送超时时间,用来控制重试次数(retries默认Integer.MAX_VALUE)
    // 默认delivery.timeout.ms=2分钟
    private final int deliveryTimeoutMs;
    // buffer池 容量=buffer.memory=32MB
    private final BufferPool free;
    private final Time time;
    private final ApiVersions apiVersions;
    // partition -> 批次队列
    private final ConcurrentMap<TopicPartition, Deque<ProducerBatch>> batches;
    // 已经发送但还未收到响应的ProducerBatch
    private final IncompleteBatches incomplete;
    // 下面是提供给sender用的
    private final Set<TopicPartition> muted;
    private int drainIndex;
    private final TransactionManager transactionManager;
    private long nextBatchExpiryTimeMs = Long.MAX_VALUE; 
}

batches 为每个分区维护了一个Deque 双端队列,队列中的元素即为消息批次ProducerBatch

n个生产者线程将消息写入各分区队尾的批次,sender线程从各分区队首拉取批次发送。

ProducerBatch

ProducerBatch属于一个分区,代表了一个消息批次。

java 复制代码
// 批次创建时间
final long createdMs;
// 分区
final TopicPartition topicPartition;
// 发送future
final ProduceRequestResult produceFuture;
// Thunk=(回调方法,发送结果),批次发送响应后
// 循环thunks回调InterceptorCallback和用户callback
private final List<Thunk> thunks = new ArrayList<>();
// 【重点】消息写入MemoryRecordsBuilder
private final MemoryRecordsBuilder recordsBuilder;
// 重试次数
private final AtomicInteger attempts = new AtomicInteger(0);
// 消息数量
int recordCount;
// 最大消息大小
int maxRecordSize;
// 最后一次尝试时间,初始等于createdMs
private long lastAttemptMs;
// 最后一次追加时间(批次被append新消息)
private long lastAppendTime;

每个ProducerBatch用MemoryRecordsBuilder维护底层消息buffer。

对于producer,用ByteBufferOutputStream包装ByteBuffer,用于写数据。

对于sender,用MemoryRecords包装同一个ByteBuffer,用于读数据。

java 复制代码
public class MemoryRecordsBuilder {
    // producer 消息写入buffer
    private final ByteBufferOutputStream bufferStream;
    // sender 消息读取,底层与bufferStream使用同一个buffer
    private MemoryRecords builtRecords;
}
public class ByteBufferOutputStream extends OutputStream {
    private ByteBuffer buffer;
}

BufferPool

BufferPool,为ProducerBatch(MemoryRecordsBuilder)分配buffer,默认buffer池大小为32MB。(buffer.memory)

java 复制代码
public class BufferPool {
    // 池容量=buffer.memory=32MB
    private final long totalMemory;
    // 池化buffer的大小=buffer.size=16KB
    private final int poolableSize;
    private final ReentrantLock lock;
    // 池化buffer归还到这里,后续可直接分配出去
    private final Deque<ByteBuffer> free;
    // 等待获取buffer的conditions
    private final Deque<Condition> waiters;
    // 非池化的buffer大小,初始=池容量=32MB
    private long nonPooledAvailableMemory;
}

默认消息批次大小poolableSize=buffer.size=16KB,只要单条消息不大于16KB,都可以正常使用池化能力。消息生产者,触发批次创建,由BufferPool分配buffer,BufferPool优先使用free队列中已经池化的buffer,否则新创建buffer返回;Sender,发送消息结束,将buffer归还到BufferPool,BufferPooll用free队列池化buffer,供后续生产者使用。

如果单条消息大小超过16KB,则无法使用池化能力。消息生产者,BufferPool直接分配内存;Sender发送消息结束,BufferPool直接释放内存。

2、出入参

RecordAccumulator.append是生产者写入累积器的api。

入参如下:

java 复制代码
public RecordAppendResult append(
     // 目标分区
     TopicPartition tp,
     // 消息时间(如果用户未设置,则为当前时间)
     long timestamp,
     // 消息key
     byte[] key,
     // 消息value
     byte[] value, 
     // 消息header
     Header[] headers,
     // interceptors和用户callback
     Callback callback,
     // max.block.ms(60s)-先前等待元数据花费的时间
     long maxTimeToBlock,
     // 如果要创建新批次,是否直接返回,上面KIP-480的sticky分区策略需要更换新分区
     boolean abortOnNewBatch,
     // 当前时间
     long nowMs) throws InterruptedException {
}

出参如下:

java 复制代码
public final static class RecordAppendResult {
    // 如果写入累积器成功,返回这个future,
    // sender会完成这个future,生产者线程通过这个future获取消息发送结果
    public final FutureRecordMetadata future;
    // dequeue中是否有批次已经满了,需要通知sender线程来消费
    public final boolean batchIsFull;
    // 是否创建了一个新的批次
    public final boolean newBatchCreated;
    // KIP-480 为stiky分区策略轮转分区
    // 如果入参abortOnNewBatch=true,且需要创建新批次,返回true
    public final boolean abortForNewBatch;
}

最终结果有三种:

case1:出参newBatchCreated=false,abortForNewBatch=false,加入现有批次成功;

case2:出参newBatchCreated=false,abortForNewBatch=true,当入参abortForNewBatch=true,且无法加入最近的一个批次,目的是让上层执行KIP-480stiky分区策略轮转分区;

case3:出参newBatchCreated=true,abortForNewBatch=false,创建新批次并加入成功;

3、主流程

RecordAccumulator#append:主流程。

java 复制代码
public RecordAppendResult append(TopicPartition tp,
           long timestamp,
           byte[] key,
           byte[] value,
           Header[] headers,
           Callback callback,
           long maxTimeToBlock,
           boolean abortOnNewBatch,
           long nowMs) throws InterruptedException {
  appendsInProgress.incrementAndGet();
  ByteBuffer buffer = null;
  if (headers == null) headers = Record.EMPTY_HEADERS;
  try {
      // 每个partition对应一个Deque,Deque里是n个批次
      Deque<ProducerBatch> dq = getOrCreateDeque(tp);
      synchronized (dq) {
          // Step1,尝试将消息添加到批次中
          RecordAppendResult appendResult = tryAppend(timestamp, key, value, headers, callback, dq, nowMs);
          if (appendResult != null)
              // case1 加入现有批次成功
              return appendResult;
      }

      // case2 加入现有批次失败,入参abortOnNewBatch=true,KIP-480
      if (abortOnNewBatch) {
          return new RecordAppendResult(null, false, false, true);
      }

      // case3 加入现有批次失败,创建新批次
      // size=Math.max(batch.size=16384=16k, 预估消息大小)
      byte maxUsableMagic = apiVersions.maxUsableProduceMagic();
      int size = Math.max(this.batchSize, AbstractRecords.estimateSizeInBytesUpperBound(maxUsableMagic, compression, key, value, headers));

      // Step2,分配内存大小size,如果没有可用内存,则阻塞
      // maxTimeToBlock=max.block.ms(60s)-先前等待元数据花费的时间
      buffer = free.allocate(size, maxTimeToBlock);
      nowMs = time.milliseconds();
      synchronized (dq) {
          // Step3,再次尝试加入现有批次
          // 小概率成功,其他线程先进入了synchronized代码块创建了新批次
          RecordAppendResult appendResult = tryAppend(timestamp, key, value, headers, callback, dq, nowMs);
          if (appendResult != null) {
              return appendResult;
          }
          // Step4,创建新批次,并写入消息
          MemoryRecordsBuilder recordsBuilder = recordsBuilder(buffer, maxUsableMagic);
          ProducerBatch batch = new ProducerBatch(tp, recordsBuilder, nowMs);
          FutureRecordMetadata future = Objects.requireNonNull(batch.tryAppend(timestamp, key, value, headers,
                  callback, nowMs));

          dq.addLast(batch);
          incomplete.add(batch);
          buffer = null;
          return new RecordAppendResult(future, dq.size() > 1 || batch.isFull(), true, false);
      }
  } finally {
      if (buffer != null)
           // Step3的case
          // 其他线程创建了批次,但是当前线程已分配了buffer,需要归还
          free.deallocate(buffer);
      appendsInProgress.decrementAndGet();
  }
}

RecordAccumulator#tryAppend:尝试写入dequeue的尾部最后一个批次。

java 复制代码
private RecordAppendResult tryAppend(long timestamp, 
               byte[] key, byte[] value,
               Header[] headers, Callback callback, 
                Deque<ProducerBatch> deque, long nowMs) {
    ProducerBatch last = deque.peekLast();
    if (last != null) {
        // 消息写入批次
        FutureRecordMetadata future = last.tryAppend(timestamp, key, value, headers, callback, nowMs);
        if (future == null)
            // case1 这个批次满了
            last.closeForRecordAppends();
        else
            // case2 放到这个批次里了
            return new RecordAppendResult(future, deque.size() > 1 || last.isFull(), false, false);
    }
    // case1+3 批次满了 or 还没创建过批次
    return null;
}

总体来说累积器分为两步:1-如果要创建新批次,申请buffer;2-写入批次。

4、申请buffer

BufferPool#allocate:申请buffer

1)申请大小为buffer.size=16k,优先从free链表分配;

2)池化失败,可用内存足够(buffer.memory=32MB),直接分配buffer;

3)池化失败,可用内存不足,加入waiters等待队列,最多等待maxTimeToBlockMs,sender线程发送消息完成后,归还内存,唤醒allocate等待线程;

java 复制代码
// 池容量=buffer.memory=32MB
private final long totalMemory;
// 池化buffer的大小=buffer.size=16KB
private final int poolableSize;
private final ReentrantLock lock;
// 池化buffer归还到这里,后续可直接分配出去
private final Deque<ByteBuffer> free;
// 等待获取buffer的conditions
private final Deque<Condition> waiters;
// 非池化的buffer大小,初始=池容量=32MB
private long nonPooledAvailableMemory;
public ByteBuffer allocate(int size, long maxTimeToBlockMs) throws InterruptedException {
    // totalMemory=buffer.memory=32MB
    if (size > this.totalMemory)
        throw new IllegalArgumentException();
    ByteBuffer buffer = null;
    this.lock.lock();
    try {
        // Step1,申请大小=poolableSize=batch.size=16k,先从free池中获取
        if (size == poolableSize && !this.free.isEmpty())
            return this.free.pollFirst();
        int freeListSize = freeSize() * this.poolableSize;

        // 剩余可分配内存 = nonPooledAvailableMemory + free链表 >= 目标内存
        if (this.nonPooledAvailableMemory + freeListSize >= size) {
            // Step2,如果池化失败,但剩余内存足够,按照请求大小分配内存
            // 如果池非空,可分配内存不足size,释放free池
            freeUp(size);
            this.nonPooledAvailableMemory -= size;
        } else {
            // Step3,剩余内存不足,需要等待sender发送消息后释放
            int accumulated = 0;
            Condition moreMemory = this.lock.newCondition();
            try {
                long remainingTimeToBlockNs = TimeUnit.MILLISECONDS.toNanos(maxTimeToBlockMs);
                this.waiters.addLast(moreMemory);
                while (accumulated < size) {
                    long startWaitNs = time.nanoseconds();
                    long timeNs;
                    boolean waitingTimeElapsed;
                    try {
                        // 等待sender释放
                        waitingTimeElapsed = !moreMemory.await(remainingTimeToBlockNs, TimeUnit.NANOSECONDS);
                    } finally {
                        long endWaitNs = time.nanoseconds();
                        timeNs = Math.max(0L, endWaitNs - startWaitNs);
                        recordWaitTime(timeNs);
                    }

                    // moreMemory等待可用内存超时
                    if (waitingTimeElapsed) {
                        throw new BufferExhaustedException("Failed to allocate memory within the configured max blocking time " + maxTimeToBlockMs + " ms.");
                    }
                    // moreMemory被唤醒,计算accumulated可累计分配的内存,达到size后可退出
                    remainingTimeToBlockNs -= timeNs;
                    if (accumulated == 0 && size == this.poolableSize && !this.free.isEmpty()) {
                        // 如果申请大小为16k,还是优先走池化分配
                        buffer = this.free.pollFirst();
                        accumulated = size;
                    } else {
                        // 否则,走非池化分配
                        freeUp(size - accumulated);
                        int got = (int) Math.min(size - accumulated, this.nonPooledAvailableMemory);
                        this.nonPooledAvailableMemory -= got;
                        accumulated += got;
                    }
                }
                accumulated = 0;
            } finally {
                this.nonPooledAvailableMemory += accumulated;
                this.waiters.remove(moreMemory);
            }
        }
    } finally {
        try {
            // 如果有内存可用,由当前线程唤醒后一个等待内存的线程
            if (!(this.nonPooledAvailableMemory == 0 && this.free.isEmpty()) && !this.waiters.isEmpty())
                this.waiters.peekFirst().signal();
        } finally {
            lock.unlock();
        }
    }
    if (buffer == null)
        // 池化失败,分配内存
        return safeAllocateByteBuffer(size);
    else
        return buffer;
}
 private void freeUp(int size) {
    while (!this.free.isEmpty() && this.nonPooledAvailableMemory < size)
        this.nonPooledAvailableMemory += this.free.pollLast().capacity();
}

BufferPool#deallocate:sender线程发送消息后,调用BufferPool归还内存。

java 复制代码
public void deallocate(ByteBuffer buffer, int size) {
    lock.lock();
    try {
        if (size == this.poolableSize && size == buffer.capacity()) {
            // 是池化的buffer大小,加入free链表
            buffer.clear();
            this.free.add(buffer);
        } else {
            // 非池化的buffer大小(单条消息超过16k),仅仅修改可用非池化大小
            this.nonPooledAvailableMemory += size;
        }
        // 唤醒等待内存的生产者线程
        Condition moreMem = this.waiters.peekFirst();
        if (moreMem != null)
            moreMem.signal();
    } finally {
        lock.unlock();
    }
}

5、写入批次

5-1、创建批次

RecordAccumulator#append:如果批次不存在,需要先创建批次,批次的底层buffer由MemoryRecordsBuilder维护。

java 复制代码
// MemoryRecordsBuilder
MemoryRecordsBuilder recordsBuilder = recordsBuilder(buffer, maxUsableMagic);
// ProducerBatch
ProducerBatch batch = new ProducerBatch(tp, recordsBuilder, nowMs);
// 写批次
FutureRecordMetadata future = Objects.requireNonNull(batch.tryAppend(timestamp, key, value, headers,
        callback, nowMs));
// 批次放到dequeue里,下次可以继续写
dq.addLast(batch);
incomplete.add(batch);

RecordAccumulator#recordsBuilder:MemoryRecordsBuilder构建入参只有几个

1)buffer:从BufferPool申请的内存;

2)maxUsableMagic:魔数2;

3)compression:压缩类型,默认NONE;

4)timestampType:logAppendTime的类型,CREATE_TIME;

5)baseOffset:0,代表这个批次的第一条消息;

java 复制代码
// RecordAccumulator
private MemoryRecordsBuilder recordsBuilder(ByteBuffer buffer, byte maxUsableMagic) {
  return MemoryRecords.builder(buffer, maxUsableMagic, compression, TimestampType.CREATE_TIME, 0L);
}

public static MemoryRecordsBuilder builder(ByteBuffer buffer,
                                               byte magic,
                                               CompressionType compressionType,
                                               TimestampType timestampType,
                                               long baseOffset) {
    long logAppendTime = RecordBatch.NO_TIMESTAMP;
    if (timestampType == TimestampType.LOG_APPEND_TIME)
        logAppendTime = System.currentTimeMillis();
    return builder(buffer, magic, compressionType, timestampType, baseOffset, logAppendTime,
            RecordBatch.NO_PRODUCER_ID, RecordBatch.NO_PRODUCER_EPOCH, RecordBatch.NO_SEQUENCE, false,
            RecordBatch.NO_PARTITION_LEADER_EPOCH);
}

MemoryRecordsBuilder:最终构建的时候,buffer被包装为bufferStream,如果有压缩类型,在bufferStream会再包装一个压缩Stream,最终写入的stream为appendStream。

java 复制代码
public MemoryRecordsBuilder(ByteBuffer buffer,...) {
    this(new ByteBufferOutputStream(buffer), ...);
}
public MemoryRecordsBuilder(ByteBufferOutputStream bufferStream,
                             ...) {
    this.magic = magic;
    this.timestampType = timestampType;
    this.compressionType = compressionType;
    this.baseOffset = baseOffset; // 0
    this.logAppendTime = logAppendTime; // -1
    this.numRecords = 0;
    this.maxTimestamp = RecordBatch.NO_TIMESTAMP; // -1

    this.writeLimit = writeLimit; // buffer可用大小
    // 初始位置0
    this.initialPosition = bufferStream.position();
    // 批次头大小
    this.batchHeaderSizeInBytes = AbstractRecords.recordBatchHeaderSizeInBytes(magic, compressionType);
    // 写入位置从头之后开始
    bufferStream.position(initialPosition + batchHeaderSizeInBytes);
    this.bufferStream = bufferStream;
    // 在ByteBufferOutputStream(buffer)外面再包一层,是可以压缩的
    this.appendStream = new DataOutputStream(compressionType.wrapForOutput(this.bufferStream, magic));
}

// 比如压缩类型为lz4,在原始在ByteBufferOutputStream(buffer)外面再包一层
LZ4(3, "lz4", 1.0f) {
  @Override
  public OutputStream wrapForOutput(ByteBufferOutputStream buffer, byte messageVersion) {
      return new KafkaLZ4BlockOutputStream(buffer,...);
  }
}

5-2、tryAppend

ProducerBatch#tryAppend:

java 复制代码
// 【重点】消息写入MemoryRecordsBuilder
private final MemoryRecordsBuilder recordsBuilder;
// 批次发送future,sender来完成
final ProduceRequestResult produceFuture;
// Thunk=(回调方法,发送结果),批次发送响应后
// 循环thunks回调InterceptorCallback和用户callback
private final List<Thunk> thunks = new ArrayList<>();
public FutureRecordMetadata tryAppend(long timestamp, byte[] key, byte[] value,
                 Header[] headers, Callback callback, long now) {
  if (!recordsBuilder.hasRoomFor(timestamp, key, value, headers)) {
      // 如果底层buffer写满了,返回null,外面就要创建新批次了
      return null;
  } else {
      // 写入appendStream,可能包含压缩
      Long checksum = 
        this.recordsBuilder.append(timestamp, key, value, headers);
      this.lastAppendTime = now;
      // producerFuture一个批次一个,如果sender完成了会完成这个future
      // 一个producerFuture对应n次send的FutureRecordMetadata(future)
      // 如果发送完成,sender完成producerFuture,producerFuture完成FutureRecordMetadata
      FutureRecordMetadata future = new FutureRecordMetadata(
        this.produceFuture, this.recordCount,
         timestamp, checksum,
         key == null ? -1 : key.length,
         value == null ? -1 : value.length,
         Time.SYSTEM);
      // 将所有future加入链表
      thunks.add(new Thunk(callback, future));
      this.recordCount++;
      return future;
  }
}

每条消息Record包含如下属性:

1)Length:Varint编码,1-5字节,消息大小,压缩前的;

2)Attributes:1字节,未使用;

3)TimestampDelta:Varlong编码,1-10字节,消息相对于第一个消息的偏移创建时间;

4)OffsetDelta:Varint编码,1-5字节,消息在批次中的下标,从0开始;

5)key/value/headers:消息key/value/headers;

MemoryRecordsBuilder#appendDefaultRecord:

消息写入appendStream完成压缩逻辑,实际是否写入底层buffer,和压缩实现有关。

如果这里没有写入底层buffer,sender从累积器拉数据时会flush到底层buffer。

java 复制代码
// 底层BufferPool分配的bytebuffer
private final ByteBufferOutputStream bufferStream;
// 包装压缩算法
private DataOutputStream appendStream = 
  new DataOutputStream(
  compressionType.wrapForOutput(this.bufferStream, magic));

private void appendDefaultRecord(long offset, 
                                 long timestamp, 
                                 ByteBuffer key, ByteBuffer value,
                                 Header[] headers) throws IOException {
    ensureOpenForRecordAppend();
    // 批次中的消息下标,从0开始
    int offsetDelta = (int) (offset - baseOffset);
    // 批次里的第n条消息 - 第1条消息的创建时间
    long timestampDelta = timestamp - firstTimestamp;
    // 写入appendStream,可能压缩
    int sizeInBytes = DefaultRecord.writeTo(appendStream, offsetDelta, timestampDelta, key, value, headers);
    // 记录写入大小,后面可以判断是否写满了
    recordWritten(offset, timestamp, sizeInBytes);
}

如zstd压缩,ZstdOutputStream里面还有一个dst的buffer,压缩数据会先写入dst,在适当时机再写入BufferPool分配的最底层buffer。

java 复制代码
public ZstdOutputStream(OutputStream outStream/*BufferPool分配的底层buffer*/) throws IOException {
    super(outStream);
    // create compression context
    this.stream = createCStream();
    this.closeFrameOnFlush = false;
    this.dst = new byte[(int) dstSize];
}

5-3、返回future

每次生产者调用send写buffer成功后,返回给生产者一个future(FutureRecordMetadata),生产者可以等待这个future完成或注册Callback,拿到发布响应RecordMetadata。

java 复制代码
public final class FutureRecordMetadata implements Future<RecordMetadata> {
     // 批次future
    private final ProduceRequestResult result;

   @Override
    public RecordMetadata get() throws InterruptedException, ExecutionException {
        this.result.await();
        if (nextRecordMetadata != null)
            return nextRecordMetadata.get();
        return valueOrError();
    }

    @Override
    public RecordMetadata get(long timeout, TimeUnit unit) throws InterruptedException, ExecutionException, TimeoutException {
        long now = time.milliseconds();
        long timeoutMillis = unit.toMillis(timeout);
        long deadline = Long.MAX_VALUE - timeoutMillis < now ? Long.MAX_VALUE : now + timeoutMillis;
        boolean occurred = this.result.await(timeout, unit);
        if (!occurred)
            throw new TimeoutException();
        if (nextRecordMetadata != null)
            return nextRecordMetadata.get(deadline - time.milliseconds(), TimeUnit.MILLISECONDS);
        return valueOrError();
    }

    RecordMetadata valueOrError() throws ExecutionException {
        if (this.result.error() != null)
            throw new ExecutionException(this.result.error());
        else
            return value();
    }
    // producer.send可以拿到RecordMetadata
    RecordMetadata value() {
        if (nextRecordMetadata != null)
            return nextRecordMetadata.value();
        return new RecordMetadata(
          result.topicPartition(), this.result.baseOffset(), 
          this.relativeOffset, timestamp(), this.checksum,
          this.serializedKeySize, this.serializedValueSize);
    }
}

底层FutureRecordMetadata 需要等待批次future(ProduceRequestResult)完成。

当sender发送批次成功后,完成批次future。

java 复制代码
public class ProduceRequestResult {
    private final CountDownLatch latch = new CountDownLatch(1);
    // sender来完成
    public void done() {
        this.latch.countDown();
    }
    // 批次对应n次send
  // n个FutureRecordMetadata可以通过ProduceRequestResult等待该批次发送完成
    public void await() throws InterruptedException {
        latch.await();
    }
    public boolean await(long timeout, TimeUnit unit) throws InterruptedException {
        return latch.await(timeout, unit);
    }
}

七、Sender发送消息

Sender#runOnce:sender线程while-true逻辑。

在上面分析获取元数据的时候已经分析过sender线程的工作逻辑,即client.poll。这里主要分析sendProducerData中的逻辑,包含如何与服务端建立连接,如何消费累积器中的消息。

java 复制代码
void runOnce() {
    if (transactionManager != null) {
      //...
    }
    long currentTimeMs = time.milliseconds();
    // 从累积器拉取消息,转换为Send,缓存到每个broker对应的KafkaChannel
    long pollTimeout = sendProducerData(currentTimeMs);
    // 真实发送Send
    client.poll(pollTimeout, currentTimeMs);
}

Sender#sendProducerData:将消息缓存到每个broker节点对应的KafkaChannel中,分为5步。

注意这里不会发送ProduceRequest,所有IO都在poll里。

java 复制代码
private long sendProducerData(long now) {
    Cluster cluster = metadata.fetch();
    // Step1, 从累积器获取,有消息的分区对应的node
    RecordAccumulator.ReadyCheckResult result = this.accumulator.ready(cluster, now);

    // 未知leader的分区topic,标记为需要元数据
    if (!result.unknownLeaderTopics.isEmpty()) {
        for (String topic : result.unknownLeaderTopics)
            this.metadata.add(topic, now);
        this.metadata.requestUpdate();
    }

    // Step2,对有消息的node,判断是否允许发送消息,如果不允许,从readyNodes中移除
    // 1. 如果还未建立连接,则触发建立连接,但不可用,不能发送消息
    // 2. 如果节点超出阈值的请求还未响应,max.in.flight.requests.per.connection=5,不能发送消息
    // 3. 如果需要发送MetadataRequest,不能发送消息
    Iterator<Node> iter = result.readyNodes.iterator();
    long notReadyTimeout = Long.MAX_VALUE;
    while (iter.hasNext()) {
        Node node = iter.next();
        if (!this.client.ready(node, now)) {
            iter.remove();
            notReadyTimeout = Math.min(notReadyTimeout, this.client.pollDelayMs(node, now));
        }
    }

    // Step3,从累积器中获取每个broker需要发送的批次,每个分区最多发一个批次
    Map<Integer/*brokerId*/, List<ProducerBatch>> batches = this.accumulator.drain(cluster, result.readyNodes, this.maxRequestSize, now);
     // 将批次加入inflight列表
    addToInflightBatches(batches);
     // max.in.flight.requests.per.connection=1 为true 默认为5
    if (guaranteeMessageOrder) { // 是否保证顺序
        for (List<ProducerBatch> batchList : batches.values()) {
            for (ProducerBatch batch : batchList)
                // 标记分区mut,在请求处理完前不会再次发送消息
                this.accumulator.mutePartition(batch.topicPartition);
        }
    }

    accumulator.resetNextBatchExpiryTime();
    // delivery.timeout.ms=2分钟
    // Step4,获取 inflight的超时批次 + 累积器中的超时批次,完成future为TimeoutException,归还buffer到BufferPool
    List<ProducerBatch> expiredInflightBatches = getExpiredInflightBatches(now);
    List<ProducerBatch> expiredBatches = this.accumulator.expiredBatches(now);
    expiredBatches.addAll(expiredInflightBatches);

    for (ProducerBatch expiredBatch : expiredBatches) {
        failBatch(expiredBatch, -1, NO_TIMESTAMP, new TimeoutException(), false);
    }

    long pollTimeout = Math.min(result.nextReadyCheckDelayMs, notReadyTimeout);
    pollTimeout = Math.min(pollTimeout, this.accumulator.nextExpiryTimeMs() - now);
    pollTimeout = Math.max(pollTimeout, 0);
    if (!result.readyNodes.isEmpty()) {
        pollTimeout = 0;
    }
    // Step5,将每个节点的批次,缓存到KafkaChannel
    sendProduceRequests(batches, now);
    return pollTimeout;
}

7-1、获取需要发送消息的Node

RecordAccumulator#ready:从累积器获取需要发送消息的Broker节点。

循环每个分区的批次队列,判断分区对应节点是否需要发送消息,满足以下几个条件之一,则节点需要发送消息:

1)full:有分区批次满了;2)expired:超过linger.ms(默认0);3)exhausted:BufferPool内存不足了,有生产者线程在等待分配内存;4)closed:生产者关闭;5)flushInProgress:生产者主动调用flush;

java 复制代码
private final BufferPool free;
private final ConcurrentMap<TopicPartition, Deque<ProducerBatch>> batches;
public ReadyCheckResult ready(Cluster cluster, long nowMs) {
    Set<Node> readyNodes = new HashSet<>();
    long nextReadyCheckDelayMs = Long.MAX_VALUE;
    Set<String> unknownLeaderTopics = new HashSet<>();
    // 等待BufferPool分配内存的生产者线程 > 0
    boolean exhausted = this.free.queued() > 0;
    for (Map.Entry<TopicPartition, Deque<ProducerBatch>> entry : this.batches.entrySet()) {
        Deque<ProducerBatch> deque = entry.getValue();
        synchronized (deque) {
            ProducerBatch batch = deque.peekFirst();
            if (batch != null) {
                TopicPartition part = entry.getKey();
                Node leader = cluster.leaderFor(part);
                if (leader == null) {
                    unknownLeaderTopics.add(part.topic());
                } else if (!readyNodes.contains(leader) && !isMuted(part)) {
                    // Math.max(0, nowMs - 创建时间/上次重试时间)
                    long waitedTimeMs = batch.waitedTimeMs(nowMs);
                    // 正在重试,但是还没到retryBackoffMs
                    boolean backingOff = batch.attempts() > 0 && waitedTimeMs < retryBackoffMs;
                    // 需要等待的时长lingerMs
                    long timeToWaitMs = backingOff ? retryBackoffMs : lingerMs;
                    // 是否有批次是否满
                    boolean full = deque.size() > 1 || batch.isFull();
                    // 等待时间到lingerMs
                    boolean expired = waitedTimeMs >= timeToWaitMs;
                    boolean sendable = full || /*有批次满了*/
                            expired || /*超过lingerMs*/
                            exhausted || /*BufferPool内存不足了,有生产者在等待*/
                            closed || /*生产者关闭了*/
                            flushInProgress(); /*生产者调用flush*/
                    if (sendable && !backingOff) {
                        readyNodes.add(leader);
                    } else {
                        long timeLeftMs = Math.max(timeToWaitMs - waitedTimeMs, 0);
                        nextReadyCheckDelayMs = Math.min(timeLeftMs, nextReadyCheckDelayMs);
                    }
                }
            }
        }
    }
    return new ReadyCheckResult(readyNodes, nextReadyCheckDelayMs, unknownLeaderTopics);
}

7-2、移除非ready的Node&建立连接

NetworkClient#ready:对于需要发送消息的节点,验证是否ready,此外这里会真正触发与broker建立连接。

java 复制代码
public boolean ready(Node node, long now) {
    if (isReady(node, now))
        return true;
    if (connectionStates.canConnect(node.idString(), now))
        // if we are interested in sending to a node and we don't have a connection to it, initiate one
        initiateConnect(node, now);
    return false;
}
private void initiateConnect(Node node, long now) {
  String nodeConnectionId = node.idString();
  connectionStates.connecting(nodeConnectionId, now, node.host(), clientDnsLookup);
  InetAddress address = connectionStates.currentAddress(nodeConnectionId);
  selector.connect(nodeConnectionId,
              new InetSocketAddress(address, node.port()),
              this.socketSendBuffer,
              this.socketReceiveBuffer);
}

NetworkClient#isReady:节点ready条件

1)需要更新元数据,这个是全局限制,和节点无关,只要需要更新元数据,所有节点都不能发送消息;

2)已经建立连接;(未建立连接需要等待下一次sender循环再判断)

3)节点未响应的请求≤并发请求数量,默认max.in.flight.requests.per.connection=5;

java 复制代码
public boolean isReady(Node node, long now) {
    return !metadataUpdater.isUpdateDue(now)  // 不需要更新元数据
      && canSendRequest(node.idString(), now);
}
private boolean canSendRequest(String node, long now) {
    return connectionStates.isReady(node, now) // 已经建立连接
      && selector.isChannelReady(node) 
      && inFlightRequests.canSendMore(node); // 节点未响应的请求>并发请求数量
}
// InFlightRequests#canSendMore
public boolean canSendMore(String node) {
  Deque<NetworkClient.InFlightRequest> queue = requests.get(node);
  return queue == null || queue.isEmpty()
    ||  (queue.peekFirst().send.completed()
     && queue.size() < this.maxInFlightRequestsPerConnection);
}
// Metadata#timeToNextUpdate
public synchronized long timeToNextUpdate(long nowMs) {
  long timeToExpire = updateRequested() ? // producer请求更新
          0 :
          // 到达metadata.max.age.ms=5分钟 需要刷新
          Math.max(this.lastSuccessfulRefreshMs + this.metadataExpireMs - nowMs, 0);
  return Math.max(timeToExpire,
          // 如果retry.backoff.ms=100内发生过更新,不更新
          timeToAllowUpdate(nowMs));
}

Selector#connect:与broker建立连接。

java 复制代码
public void connect(String id, InetSocketAddress address, int sendBufferSize, int receiveBufferSize) throws IOException {
    ensureNotRegistered(id);
    // 创建一个socketChannel
    SocketChannel socketChannel = SocketChannel.open();
    SelectionKey key = null;
    try {
       // 配置socket
       // send.buffer.bytes=128 * 1024 -- SocketOptions.SO_SNDBUF
       // receive.buffer.bytes=32 * 1024 -- SocketOptions.SO_RCVBUF
        configureSocketChannel(socketChannel, sendBufferSize, receiveBufferSize);
        // 与broker建立连接socketChannel.connect(address)
        boolean connected = doConnect(socketChannel, address);
       // 注册socketChannel到selector,注册CONNECT事件
        key = registerChannel(id, socketChannel, SelectionKey.OP_CONNECT);

        if (connected) {
            immediatelyConnectedKeys.add(key);
            key.interestOps(0);
        }
    } catch (IOException | RuntimeException e) {
        if (key != null)
            immediatelyConnectedKeys.remove(key);
        channels.remove(id);
        socketChannel.close();
        throw e;
    }
}

Selector#registerChannel:KafkaChannel=socketChannel+SelectionKey。

java 复制代码
private final Map<String, KafkaChannel> channels;
protected SelectionKey registerChannel(String id, SocketChannel socketChannel, int interestedOps) throws IOException {
    SelectionKey key = socketChannel.register(nioSelector, interestedOps);
    KafkaChannel channel = buildAndAttachKafkaChannel(socketChannel, id, key);
    this.channels.put(id, channel);
    if (idleExpiryManager != null)
        idleExpiryManager.update(channel.id(), time.nanoseconds());
    return key;
}

 private KafkaChannel buildAndAttachKafkaChannel(SocketChannel socketChannel, String id, SelectionKey key) throws IOException {
      KafkaChannel channel = channelBuilder.buildChannel(...);
      // SelectionKey->KafkaChannel,
      // 发生io事件key.attachment()可以拿到KafkaChannel
      key.attach(channel);
      return channel;
}

7-3、从Accumulator拉消息批次

RecordAccumulator#drain:循环每个ready节点,sender从Accumulator拉取消息批次,maxSize=单个请求大小上限=1MB。

java 复制代码
public Map<Integer, List<ProducerBatch>> drain(Cluster cluster, Set<Node> nodes, int maxSize, long now) {
    Map<Integer, List<ProducerBatch>> batches = new HashMap<>();
    for (Node node : nodes) {
        // maxSize=max.request.size=1MB 一个broker的消息大小上限
        List<ProducerBatch> ready = drainBatchesForOneNode(cluster, node, maxSize, now);
        batches.put(node.id(), ready);
    }
    return batches;
}

RecordAccumulator#drainBatchesForOneNode:

遍历broker的所有分区,在maxSize内,对于每个分区拿一个批次。

注意,累积器通过full/expired等方式判断分区所在节点是否需要发送,只要一个分区满足条件,这个节点下所有分区都会触发消息发送,即使单个分区未满足full/expired。

java 复制代码
private List<ProducerBatch> drainBatchesForOneNode(Cluster cluster, Node node, int maxSize, long now) {
    int size = 0;
    // 遍历broker下的所有partition
    List<PartitionInfo> parts = cluster.partitionsForNode(node.id());
    List<ProducerBatch> ready = new ArrayList<>();
    int start = drainIndex = drainIndex % parts.size();
    do {
        PartitionInfo part = parts.get(drainIndex);
        TopicPartition tp = new TopicPartition(part.topic(), part.partition());
        this.drainIndex = (this.drainIndex + 1) % parts.size();
        // Only proceed if the partition has no in-flight batches.
        if (isMuted(tp))
            continue;
        Deque<ProducerBatch> deque = getDeque(tp);
        if (deque == null)
            continue;
        synchronized (deque) {
            ProducerBatch first = deque.peekFirst();
            if (first == null)
                continue;
            boolean backoff = first.attempts() > 0 && first.waitedTimeMs(now) < retryBackoffMs;
            if (backoff)
                // 批次重试中,且小于retry.backoff.ms=100ms
                continue;

            if (size + first.estimatedSizeInBytes() > maxSize && !ready.isEmpty()) {
               // 遍历到一个partition的批次导致超出maxSize,退出
                break;
            } else {
               // ... 省略事务消息
                // 关闭批次
                batch.close();
                size += batch.records().sizeInBytes();
                ready.add(batch);
                batch.drained(now);
            }
        }
    } while (start != drainIndex);
    return ready;
}

ProducerBatch#close:关闭消息批次。关闭底层appendStream写入流,writeDefaultBatchHeader写入批次头,构造MemoryRecords用于sender自己读取buffer。

java 复制代码
// ProducerBatch#close
private final MemoryRecordsBuilder recordsBuilder;
public void close() {
    recordsBuilder.close();
    reopened = false;
}
// MemoryRecordsBuilder#close
// producer写入
private DataOutputStream appendStream;
// sender读取
private MemoryRecords builtRecords;
public void close() {
    if (builtRecords != null)
        return;
    validateProducerState();
    closeForRecordAppends();
    if (numRecords == 0L) {
        buffer().position(initialPosition);
        builtRecords = MemoryRecords.EMPTY;
    } else {
        // 写批次头
        this.actualCompressionRatio = (float) writeDefaultBatchHeader() / this.uncompressedRecordsSizeInBytes;
        ByteBuffer buffer = buffer().duplicate();
        buffer.flip();
        buffer.position(initialPosition);
        // 构造builtRecords=MemoryRecords
        builtRecords = MemoryRecords.readableRecords(buffer.slice());
    }
}
public void closeForRecordAppends() {
    if (appendStream != CLOSED_STREAM) {
        try {
            // 如果有压缩,这里确保压缩的内置buffer刷到底层buffer(BufferPool分配的)
            appendStream.close(); 
        } catch (IOException e) {
            throw new KafkaException(e);
        } finally {
            appendStream = CLOSED_STREAM;
        }
    }
}

DefaultRecordBatch#writeHeader:每个消息批次头包含如下属性,由sender写入底层buffer。

java 复制代码
static void writeHeader(ByteBuffer buffer, ...) {
  // 16bit attributes:Unused (6-15) | Control (5) | Transactional (4) | Timestamp Type (3) | Compression Type (0-2)
  short attributes = computeAttributes(compressionType, timestampType, isTransactional, isControlBatch);

  int position = buffer.position();
  buffer.putLong(position + BASE_OFFSET_OFFSET, baseOffset); // 0
  buffer.putInt(position + LENGTH_OFFSET, sizeInBytes - LOG_OVERHEAD); // 批次里的n个消息的大小总和
  buffer.putInt(position + PARTITION_LEADER_EPOCH_OFFSET, partitionLeaderEpoch); // -1
  buffer.put(position + MAGIC_OFFSET, magic); // 魔数2
  buffer.putShort(position + ATTRIBUTES_OFFSET, attributes); // 压缩类型 时间类型 是否事务...
  buffer.putLong(position + FIRST_TIMESTAMP_OFFSET, firstTimestamp); // 第一条消息的创建时间
  buffer.putLong(position + MAX_TIMESTAMP_OFFSET, maxTimestamp); // 最大的消息创建时间
  buffer.putInt(position + LAST_OFFSET_DELTA_OFFSET, lastOffsetDelta); // 最后一条消息的偏移量(一般就是消息数量)
  buffer.putLong(position + PRODUCER_ID_OFFSET, producerId); // 生产者id -1
  buffer.putShort(position + PRODUCER_EPOCH_OFFSET, epoch); // 生产者epoch -1
  buffer.putInt(position + BASE_SEQUENCE_OFFSET, sequence); // -1
  buffer.putInt(position + RECORDS_COUNT_OFFSET, numRecords); // 消息数量
  long crc = Crc32C.compute(buffer, ATTRIBUTES_OFFSET, sizeInBytes - ATTRIBUTES_OFFSET);
  buffer.putInt(position + CRC_OFFSET, (int) crc); // crc
  buffer.position(position + RECORD_BATCH_OVERHEAD);
}

ProducerBatch→MemoryRecordsBuilder→MemoryRecords,最终消息批次构造为MemoryRecords供sender后续读取。

java 复制代码
public class MemoryRecords extends AbstractRecords {
    // 消息批次头 + n条消息
    private final ByteBuffer buffer;
    private final Iterable<MutableRecordBatch> batches = this::batchIterator;
    private MemoryRecords(ByteBuffer buffer) {
        this.buffer = buffer;
    }
}

7-4、超时批次处理

Sender#sendProducerData:处理超时批次,设置future为Timeout。

生产者默认重试次数retris=Integer.MAX_VALUE,无限重试,一般通过配置delivery.timeout.ms超时时间来控制重试次数,默认超时时间为2分钟。

java 复制代码
// Step4,处理 inflight的超时批次 + 累积器中的超时批次
// 已经发送给broker,但是超时未收到响应
List<ProducerBatch> expiredInflightBatches = getExpiredInflightBatches(now);
// 在累积器中,还没拿出来,超时
List<ProducerBatch> expiredBatches = this.accumulator.expiredBatches(now);
expiredBatches.addAll(expiredInflightBatches);

if (!expiredBatches.isEmpty())
for (ProducerBatch expiredBatch : expiredBatches) {
   // 完成future为TimeoutException,归还buffer到BufferPool
    failBatch(expiredBatch, -1, NO_TIMESTAMP, new TimeoutException(errorMessage), false);
}

7-5、将ProducerRequest放到KafkaChannel

Sender#sendProduceRequest:构造ProduceRequest,包括acks响应策略、produceRecordsByPartition每个分区对应的MemoryRecords批次数据。

java 复制代码
private void sendProduceRequest(long now, int destination, short acks, int timeout, List<ProducerBatch> batches) {
    // 每个partition一个批次
    // 用于构造请求
    Map<TopicPartition, MemoryRecords> produceRecordsByPartition = new HashMap<>(batches.size());
    // 用于处理响应
    final Map<TopicPartition, ProducerBatch> recordsByPartition = new HashMap<>(batches.size());
    byte minUsedMagic = apiVersions.maxUsableProduceMagic();
    for (ProducerBatch batch : batches) {
        if (batch.magic() < minUsedMagic)
            minUsedMagic = batch.magic();
    }
    for (ProducerBatch batch : batches) {
        TopicPartition tp = batch.topicPartition;
        MemoryRecords records = batch.records(); // builtRecords 
        produceRecordsByPartition.put(tp, records);
        recordsByPartition.put(tp, batch);
    }
    // 构建请求,在每个分区的消息批次
    ProduceRequest.Builder requestBuilder = ProduceRequest.Builder.forMagic(minUsedMagic, acks, timeout,
            produceRecordsByPartition, transactionalId);
    // 响应处理
    RequestCompletionHandler callback = response -> handleProduceResponse(response, recordsByPartition, time.milliseconds());
    String nodeId = Integer.toString(destination);
    ClientRequest clientRequest = client.newClientRequest(nodeId, requestBuilder, now, acks != 0,
            requestTimeoutMs, callback);
    // 缓存到KafkaChannel的Send
    client.send(clientRequest, now);
}

NetworkClient#send:将ProduceRequest序列化为Send,放到KafkaChannel。

java 复制代码
public void send(ClientRequest request, long now) {
    doSend(request, false, now);
}
private void doSend(ClientRequest clientRequest, boolean isInternalRequest, long now) {
    AbstractRequest.Builder<?> builder = clientRequest.requestBuilder();
    try {
        doSend(clientRequest, isInternalRequest, now, builder.build(version));
    } 
}
private void doSend(ClientRequest clientRequest, boolean isInternalRequest, long now, AbstractRequest request) {
    String destination = clientRequest.destination();
    // 创建请求头
    RequestHeader header = clientRequest.makeHeader(request.version());
    // request序列化为bytebuffer
    Send send = request.toSend(destination, header);
    InFlightRequest inFlightRequest = new InFlightRequest(...);
    this.inFlightRequests.add(inFlightRequest);
    // 将Send缓存到目标连接KafkaChannel
    selector.send(send);
}

// Selector#send->KafkaChannel#setSend
// 发出的buffer
private Send send;
public void setSend(Send send) {
    if (this.send != null)
        // 一个channel在一轮sender循环中,只能发送一个请求
        throw new IllegalStateException();
    this.send = send;
    // 标记nioSelector关心write事件
    this.transportLayer.addInterestOps(SelectionKey.OP_WRITE);
}

ClientRequest#makeHeader:创建请求头,注意这里的correlationId。

每个客户端和服务端的连接下,一个请求都对应一个唯一id,由sender每次创建请求时分配,顺序递增,在rpc请求响应模型下,响应中也会携带这个唯一id,找到对应请求。

java 复制代码
public RequestHeader makeHeader(short version) {
    short requestApiKey = requestBuilder.apiKey().id;
    return new RequestHeader(
        new RequestHeaderData().
            // 是哪个api方法,比如ProducerRequest
            setRequestApiKey(requestApiKey).
            // api版本
            setRequestApiVersion(version).
            // clientId
            setClientId(clientId).
            // 请求id
            setCorrelationId(correlationId),
        ApiKeys.forId(requestApiKey).requestHeaderVersion(version));
}
// NetworkClient.java
private int correlation;
int nextCorrelationId() {
    return correlation++;
}

AbstractRequest#toSend:请求转Send。

java 复制代码
public Send toSend(String destination, RequestHeader header) {
    return new NetworkSend(destination, serialize(header));
}
public ByteBuffer serialize(RequestHeader header) {
    // 先转Struct,再转ByteBuffer
    return RequestUtils.serialize(header.toStruct(), toStruct());
}

所有api对象会定义各自的Schema,每个Schema包含多个按照顺序编排的Field字段,每个Field字段有自己的Type数据类型,不同的Type使用不同的序列化方式。

java 复制代码
// ProduceRequestData 
public static final Schema SCHEMA_3 =
        new Schema(new Field("transactional_id", Type.NULLABLE_STRING),
            new Field("acks", Type.INT16),
            new Field("timeout_ms", Type.INT32),
            new Field("topics", new ArrayOf(TopicProduceData.SCHEMA_0)));

ProduceRequest#toStruct:请求对象转Struct后才能序列化。(同理响应对象,需要先反序列化为Struct,才能转换为响应对象,byte-Struct-对象)

ProducerRequest重新按照topic分组,每个topic下挂n个分区。

java 复制代码
public Struct toStruct() {
    Map<TopicPartition, MemoryRecords> partitionRecords = partitionRecordsOrFail();
    short version = version();
    Struct struct = new Struct(ApiKeys.PRODUCE.requestSchema(version));
    Map<String, Map<Integer, MemoryRecords>> recordsByTopic = CollectionUtils.groupPartitionDataByTopic(partitionRecords);
    struct.set(ACKS_KEY_NAME, acks);
    struct.set(TIMEOUT_KEY_NAME, timeout);
    struct.setIfExists(NULLABLE_TRANSACTIONAL_ID, transactionalId);
    List<Struct> topicDatas = new ArrayList<>(recordsByTopic.size());
    for (Map.Entry<String, Map<Integer, MemoryRecords>> topicEntry : recordsByTopic.entrySet()) {
        Struct topicData = struct.instance(TOPIC_DATA_KEY_NAME);
        topicData.set(TOPIC_NAME, topicEntry.getKey());
        List<Struct> partitionArray = new ArrayList<>();
        for (Map.Entry<Integer, MemoryRecords> partitionEntry : topicEntry.getValue().entrySet()) {
            MemoryRecords records = partitionEntry.getValue();
            Struct part = topicData.instance(PARTITION_DATA_KEY_NAME)
                    .set(PARTITION_ID, partitionEntry.getKey())
                    .set(RECORD_SET_KEY_NAME, records);
            partitionArray.add(part);
        }
        topicData.set(PARTITION_DATA_KEY_NAME, partitionArray.toArray());
        topicDatas.add(topicData);
    }
    struct.set(TOPIC_DATA_KEY_NAME, topicDatas.toArray());
    return struct;
}

RequestUtils#serialize:序列化。

java 复制代码
public static ByteBuffer serialize(Struct headerStruct, Struct bodyStruct) {
    ByteBuffer buffer = ByteBuffer.allocate(headerStruct.sizeOf() + bodyStruct.sizeOf());
    headerStruct.writeTo(buffer);
    bodyStruct.writeTo(buffer);
    buffer.rewind();
    return buffer;
}

Type#write:Struct中的所有数据类型都有自己的序列化方式。

比如MemoryRecords的序列化方式如下,4字节buffer大小+实际buffer,这里需要将MemoryRecords中原始的buffer拷贝到NetworkSend中的目标buffer。

java 复制代码
public static final DocumentedType RECORDS = new DocumentedType() {
  public void write(ByteBuffer buffer, Object o) {
      MemoryRecords records = (MemoryRecords) o;
      // MemoryRecords中的buffer拷贝到刚才分配的buffer
      NULLABLE_BYTES.write(buffer, records.buffer().duplicate());
  }
}
 public static final DocumentedType NULLABLE_BYTES = new DocumentedType() {
    public void write(ByteBuffer buffer, Object o) {
        if (o == null) {
            buffer.putInt(-1);
            return;
        }
        ByteBuffer arg = (ByteBuffer) o;
        int pos = arg.position();
        // 写入MemoryRecords中的buffer大小
        buffer.putInt(arg.remaining());
        // 写入MemoryRecords中的buffer
        buffer.put(arg);
        arg.position(pos);
    }
 }

7-6、poll

NetworkClient#poll:最终通过nioSelector监听socket上的IO事件,发送请求接收响应。

java 复制代码
public List<ClientResponse> poll(long timeout, long now) {
    ensureActive();
    // 可能构造MetadataRequest缓存到channel
    long metadataTimeout = metadataUpdater.maybeUpdate(now);
    try {
        // 执行io操作 读 写 ...
        this.selector.poll(Utils.min(timeout, metadataTimeout, defaultRequestTimeoutMs));
    } catch (IOException e) {
        log.error("Unexpected error during I/O", e);
    }
    long updatedNow = this.time.milliseconds();
    List<ClientResponse> responses = new ArrayList<>();
    // 处理acks=0的请求
    handleCompletedSends(responses, updatedNow);
    // 处理收到的响应
    handleCompletedReceives(responses, updatedNow);
    // ...
    // 调用callback
    completeResponses(responses);
    return responses;
}

Step1,简单概括一下selector.poll的逻辑。

对于写事件(发送请求),将Send写入底层socketChannel。

java 复制代码
Selector#poll
// 通道IO事件触发
Selector#pollSelectionKeys
// 有写操作,比如ProduceRequest发送消息
Selector#attemptWrite
    // 将socketChannel对应的KafkaChannel的Send写入socketChannel
    Selector#write
        KafkaChannel#write
            ByteBufferSend#writeTo
                PlaintextTransportLayer#write
                    SocketChannel#write
// Selector#write                    
void write(KafkaChannel channel) throws IOException {
    String nodeId = channel.id();
    long bytesSent = channel.write(); // 将Send写入channel
    Send send = channel.maybeCompleteSend(); // 取消OP_WRITE
}
// KafkaChannel#write  
// 发出的buffer
private Send send;
public long write() throws IOException {
    midWrite = true;
    return send.writeTo(transportLayer);
}

对于读事件(接收响应),将数据读入NetworkReceive。

java 复制代码
Selector#poll
// 通道IO事件触发
Selector#pollSelectionKeys
// 收到服务端响应
Selector#attemptRead
KafkaChannel#read
KafkaChannel#receive
NetworkReceive#readFrom
PlaintextTransportLayer#read
SocketChannel#read
// KafkaChannel#read
// 收到的buffer
private NetworkReceive receive;
public long read() throws IOException {
    if (receive == null) {
        receive = new NetworkReceive(maxReceiveSize, id, memoryPool);
    }
    long bytesReceived = receive(this.receive);
    return bytesReceived;
}
public class NetworkReceive implements Receive {
    // 报文长度
    private final ByteBuffer size;
    // 报文
    private ByteBuffer buffer;
}

Step2,处理响应。

NetworkClient#handleCompletedSends:循环所有等待响应的请求(Send),如果ack=0,构建ClientResponse加入responses结果集,用于后续处理。

java 复制代码
private void handleCompletedSends(List<ClientResponse> responses, long now) {
    for (Send send : this.selector.completedSends()) {
        // 获取刚发送的请求
        InFlightRequest request = 
          this.inFlightRequests.lastSent(send.destination());
        // 对于ack=0,expectResponse=false
        if (!request.expectResponse) {
            // 弹出刚发送的请求
            this.inFlightRequests.completeLastSent(send.destination());
            // 构造为ClientResponse,加入结果集
            responses.add(request.completed(null, now));
        }
    }
}

NetworkClient#handleCompletedReceives:循环所有收到的响应,反序列化并封装为AbstractResponse。

java 复制代码
private void handleCompletedReceives(List<ClientResponse> responses, long now) {
  for (NetworkReceive receive : this.selector.completedReceives()) {
      String source = receive.source(); // brokerId
      // 弹出最早的等待响应的请求
      InFlightRequest req = inFlightRequests.completeNext(source);
      // 反序列化为Struct
      // 这里会校验请求响应的correlationId一致,kafka只能按照顺序发送和接收响应
      // 和普通rpc不同,通过id找挂起的请求
      Struct responseStruct = parseStructMaybeUpdateThrottleTimeMetrics(receive.payload(), req.header,
          throttleTimeSensor, now);
      // Struct转对象
      AbstractResponse response = AbstractResponse.
          parseResponse(req.header.apiKey(), responseStruct, req.header.apiVersion());

      // ...
     // 将响应加入结果集
      responses.add(req.completed(response, now));
  }
}

Step3,回调callback。

NetworkClient#completeResponses:

java 复制代码
 private void completeResponses(List<ClientResponse> responses) {
    for (ClientResponse response : responses) {
        try {
            response.onComplete();
        } catch (Exception e) {
            log.error("Uncaught error in request completion:", e);
        }
    }
}
// ClientResponse.java
private final RequestCompletionHandler callback;
public void onComplete() {
    if (callback != null)
        callback.onComplete(this);
}

Sender#sendProduceRequest:回到7-5的callback。

java 复制代码
private void sendProduceRequest(long now, int destination, short acks, int timeout, List<ProducerBatch> batches) {
  // ...
    // 构建请求
    ProduceRequest.Builder requestBuilder = ProduceRequest.Builder.forMagic(minUsedMagic, acks, timeout,
            produceRecordsByPartition, transactionalId);
    // 响应处理【focus】
    RequestCompletionHandler callback = response -> handleProduceResponse(response, recordsByPartition, time.milliseconds());

    String nodeId = Integer.toString(destination);
    ClientRequest clientRequest = client.newClientRequest(nodeId, requestBuilder, now, acks != 0,
            requestTimeoutMs, callback);
    // 缓存到KafkaChannel的Send
    client.send(clientRequest, now);
}

Sender#handleProduceResponse:一个ProduceResponse对应n个分区批次,循环响应每个分区批次。

java 复制代码
private void handleProduceResponse(ClientResponse response, Map<TopicPartition, ProducerBatch> batches, long now) {
  RequestHeader requestHeader = response.requestHeader();
  int correlationId = requestHeader.correlationId();
  // ...
  if (response.hasResponse()) {
      // ack!=0,有server端响应
      ProduceResponse produceResponse = (ProduceResponse) response.responseBody();
      for (Map.Entry<TopicPartition, ProduceResponse.PartitionResponse> entry : produceResponse.responses().entrySet()) {
          TopicPartition tp = entry.getKey();
          ProduceResponse.PartitionResponse partResp = entry.getValue();
          ProducerBatch batch = batches.get(tp);
          completeBatch(batch, partResp, correlationId, now);
      }
  } else {
      // ack=0,无server响应
      for (ProducerBatch batch : batches.values()) {
          completeBatch(batch, new ProduceResponse.PartitionResponse(Errors.NONE), correlationId, now);
      }
  }
}

ProducerBatch#completeFutureAndFireCallbacks:一个分区批次对应n次send方法调用,循环回调所有send方法的callback(先ProducerInterceptor#onAcknowledgement,后用户自己传入的callback),最终完成produceFuture,唤醒阻塞等待发送结果的生产者线程。

所以生产结果的callback都在io线程里执行。

java 复制代码
// 批次发送future
final ProduceRequestResult produceFuture;
// Thunk=(回调方法,发送结果)
private final List<Thunk> thunks = new ArrayList<>();
private void completeFutureAndFireCallbacks(long baseOffset, long logAppendTime, RuntimeException exception) {
    // 设置批次发送future结果
    produceFuture.set(baseOffset, logAppendTime, exception);
    // 循环thunks回调InterceptorCallback和用户callback
    for (Thunk thunk : thunks) {
        try {
            if (exception == null) {
                RecordMetadata metadata = thunk.future.value();
                if (thunk.callback != null)
                    thunk.callback.onCompletion(metadata, null);
            } else {
                if (thunk.callback != null)
                    thunk.callback.onCompletion(null, exception);
            }
        } catch (Exception e) {
            log.error(..., e);
        }
    }
    // 完成批次发送future(countdownLatch.countDown)
    // 唤醒所有等待发送结果的生产者
    produceFuture.done();
}

private static class InterceptorCallback<K, V> implements Callback {
    private final Callback userCallback;
    private final ProducerInterceptors<K, V> interceptors;
    public void onCompletion(RecordMetadata metadata, Exception exception) {
        this.interceptors.onAcknowledgement(metadata, exception);
        if (this.userCallback != null)
            this.userCallback.onCompletion(metadata, exception);
    }
}

总结

Kafka生产者客户端使用两个线程:

1)生产者线程,调用send方法返回一个Future,可以选择阻塞等待Future完成,也可以选择send方法传入第二个callback方法用于处理回调;send方法底层将消息写入RecordAccumulator累积器,每个分区维护1个消息批次队列Dequeue,1个消息批次ProducerBatch包含n条消息;

2)Sender线程,无限循环从RecordAccumulator中拉取ProducerBatch,发送ProduceRequest,接收ProduceResponse,执行所有callback,完成生产Future;

生产者线程:

1)获取topic元数据:先从cache获取,如果cache miss,需要唤醒Sender获取元数据写入cache,生产者线程等待元数据返回;

2)key/value序列化;

3)计算消息分区:如果消息指定了key,用hash(key)%分区数决定,反之,采用sticky策略,将没有key的消息,合并到一个分区批次中发送,降低无key消息的发送延迟;

4)预计消息大小:校验消息不超过max.request.size,默认1MB;

5)将消息写入累积器:

5-1)优先写入消息对应分区(分区锁)的现有批次,如果加入失败,需要创建新批次;

5-2)创建新批次,需要从内存池BufferPool分配一块内存(全局锁,默认16K),这个内存将在sender发送消息完成后归还并池化,如果内存池可用内存不足(32M),将会阻塞生产者;

5-3)将消息写入批次时,会完成消息压缩;

Sender线程:

1)从累积器筛选需要发送消息的节点,满足下面条件之一的分区对应节点:分区批次满了、在累积器中等待超过linger.ms(默认0)、内存池内存不足,有生产者线程在等待分配内存;

2)过滤未就绪的节点:需要更新元数据(全局限制)、未建立连接(触发建立连接,但是要下一轮循环才能发送消息)、节点未响应的请求数量小于max.in.flight.requests.per.connection=5;

3)循环节点,每个分区从累积器拉取1个消息批次,单节点数据大小不超过max.request.size=1MB,对每个消息批次构造批次头;

4)移除超时批次:通过配置delivery.timeout.ms超时时间来控制重试次数,默认超时时间为2分钟,超时批次包含已经发送给broker的和在累积器中还未发送的,完成生产Future为超时异常;

5)为每个节点构造ProduceRequest,序列化后封装为Send对象,放到KafkaChannel;

6)poll:执行IO操作,包括发送Send和接收Receive并反序列化为ProduceResponse,执行callback方法,完成生产Future;

相关推荐
用户685453759776920 分钟前
同步成本换并行度:多线程、协程、分片、MapReduce 怎么选才不踩坑
后端
javaTodo28 分钟前
Claude Code 记忆机制详解:从 CLAUDE.md 到 Auto Memory,六层体系全拆解
后端
LSTM971 小时前
使用 C# 和 Spire.PDF 从 HTML 模板生成 PDF 的实用指南
后端
JaguarJack1 小时前
为什么 PHP 闭包要加 static?
后端·php·服务端
BingoGo1 小时前
为什么 PHP 闭包要加 static?
后端
是糖糖啊2 小时前
OpenClaw 从零到一实战指南(飞书接入)
前端·人工智能·后端
百度Geek说2 小时前
基于Spark的配置化离线反作弊系统
后端
后端AI实验室2 小时前
用AI写代码,我差点把漏洞发上线:血泪总结的10个教训
java·ai
Java编程爱好者2 小时前
虚拟线程深度解析:轻量并发编程的未来趋势
后端