【源码剖析】5-生产者-RecordAccumulator分析

主线程调用KafakaProducer.send方法之后，先将消息暂存在RecordAccumulator中，然后就可以返回了，当达到一定的条件，就会唤醒Sender线程发送RecordAccumulator中的消息。由于两个线程会操作RecordAccumulator，所以其必须是线程安全的。

RecordAccumulator中有一个以TopicPartition为key的ConcurrentMap，每个value是RecordBatch数组，每个RecordBatch有一个MemoryRecords对象的引用，MemoryRecords是真正存放消息的地方。

MemoryRecords

MemoryRecords是多个消息的集合体，其中封装了NIO ByteBuffer来保存消息，Compressor用于对ByteBuffer中的消息进行压缩。

java 复制代码

// 压缩器，对消息进行压缩，将压缩后的数据输出到buffer
private final Compressor compressor;

// 记录最大可以写入多少字节
private final int writeLimit;

// 用来保存消息的Java NIO ByteBuffer
private ByteBuffer buffer;

// 只读模式还是读写模式
private boolean writable;

Compressor有两个重要的流类型字段，分别是bufferStream和appendStream。前者是Kafka自定义实现的ByteBufferOutputStream继承了java.io.OutputStream，封装了ByteBuffer，当写入数据超过ByteBuffer容量时，ByteBufferOutputStream会自动进行扩容。后者是DataOutputStream，为其添加了压缩的功能。压缩类型由Kafka配置的参数compression.type匹配值参数指定。

压缩器支持GZIP、SNAPPY、LZ4三种压缩方式，下面详细介绍一下压缩器创建压缩流的方式。

java 复制代码

public Compressor(ByteBuffer buffer, CompressionType type) {
    this.type = type; //压缩器类型
   
    ....

    // create the stream
    bufferStream = new ByteBufferOutputStream(buffer);
    appendStream = wrapForOutput(bufferStream, type, COMPRESSION_DEFAULT_BUFFER_SIZE);
}


public static DataOutputStream wrapForOutput(ByteBufferOutputStream buffer, CompressionType type, int bufferSize) {
        try {
            switch (type) {
                case NONE:
                    return new DataOutputStream(buffer);
                case GZIP:
                    return new DataOutputStream(new GZIPOutputStream(buffer, bufferSize));
                case SNAPPY:
                    try {
                        OutputStream stream = (OutputStream) snappyOutputStreamSupplier.get().newInstance(buffer, bufferSize);
                        return new DataOutputStream(stream);
                    } catch (Exception e) {
                        throw new KafkaException(e);
                    }
                case LZ4:
                    ....// 逻辑同snappy
                default:
                    throw new IllegalArgumentException("Unknown compression type: " + type);
            }
        } catch (IOException e) {
            throw new KafkaException(e);
        }
    }

这里有个需要注意点，GIZ使用new创建压缩流，snappy是通过反射实现。原因是GZIP是JDK自带的压缩方式，snappy需要额外引包，为了尽可能减少依赖，使用反射方式可以在不使用snappy方式时不引入此依赖。

Compressor提供了一系列的put*方法，这是装饰器模式，通过appendStream增加压缩功能，通过bufferStream增加自动扩缩容功能。

estimatedBytesWritten用于估算使用量，判断是否需要扩容。

输入：
- writtenUncompressed：未压缩数据的字节数（long 或 int）。
- type.id：数据类型的标识符（如文本、二进制等），用于从 TYPE_TO_RATE 数组中查找对应的压缩率。
- TYPE_TO_RATE：一个数组或映射表，存储不同数据类型的基础压缩率 （例如，文本类型可能是 0.3，表示压缩后大小为未压缩的 30%）。
- COMPRESSION_RATE_ESTIMATION_FACTOR：一个全局调整因子（例如 1.1），用于微调估算结果（可能用于补偿未考虑的额外开销，如压缩头信息）。
输出：
- 返回 long 类型的估算值，表示压缩后的字节数。

java 复制代码

public long estimatedBytesWritten() {
    if (type == CompressionType.NONE) {
        return bufferStream.buffer().position();
    } else {
        // estimate the written bytes to the underlying byte buffer based on uncompressed written bytes
        return (long) (writtenUncompressed * TYPE_TO_RATE[type.id] * COMPRESSION_RATE_ESTIMATION_FACTOR);
    }
}

其他方法：

append：判断MemoryRecords是否为可写模式，然后调用Compressor的put方法写ByteBuffer
hasRoomFor：通过estimatedBytesWritten估算是否有剩余空间
close
sizeInBytes

RecordBatch

RecordBatch对象中除了MemoryRecord对象，还有其他的统计信息：

java 复制代码

public int recordCount = 0;
public int maxRecordSize = 0;
public volatile int attempts = 0;
public final long createdMs;
public long drainedMs;
public long lastAttemptMs;
public final MemoryRecords records;
public final TopicPartition topicPartition;
public final ProduceRequestResult produceFuture;
public long lastAppendTime;
private final List<Thunk> thunks;
private long offsetCounter = 0L;
private boolean retry;

当RecordBatch中全部消息被正常响应或者超时、关闭生产者时，会调用ProduceRequestResult.done方法，将produceFuture标记为完成，并通过ProduceRequestResult中的error字段区分时正常完成还是异常完成。

Tunk类中的callback字段指向对应消息的callback对象，其另一个字段是FutureRecordMetadata类型。

FutureRecordMetadata类有两个关键字段，result：ProducerRequestResult类型，指向对应消息所在RecordBatch的produceFuture字段；relativeOffset记录了对应消息在RecordBatch中的偏移量。FutureRecordMetadata实现了Future接口，但其实现基本都是委托给了ProduceRequestRequest对应的方法，消息是按照RecordBatch进行发送和确认的。当生产者已经收到某条消息的响应时，FutureRecordMetadata.get方法就会返回RecordMetadata对象，包含了消息的元数据信息。

java 复制代码

public FutureRecordMetadata tryAppend(long timestamp, byte[] key, byte[] value, Callback callback, long now) {
	// 估算剩余空间
    if (!this.records.hasRoomFor(key, value)) {
        return null;
    } else {
        // 向MeroryRecord对象中添加数据
        long checksum = this.records.append(offsetCounter++, timestamp, key, value);
        this.maxRecordSize = Math.max(this.maxRecordSize, Record.recordSize(key, value));
        this.lastAppendTime = now;
        
        // 创建FutureRecordMetadata对象
        FutureRecordMetadata对象 future = new FutureRecordMetadata(this.produceFuture, this.recordCount,
                                                               timestamp, checksum,
                                                               key == null ? -1 : key.length,
                                                               value == null ? -1 : value.length);
        
        //将用户自定义的Callback和FutureRecordMetadata对象保存到thunks对象中
        if (callback != null)
            thunks.add(new Thunk(callback, future));
        this.recordCount++;
        return future;
    }
}

当RecordBatch收到正常、超时、或者生产者关闭时，都会调用ProduceRequestResult.done方法,在此方法中会调用所有消息的Callback回调，并调用其produceFuture字段的done方法。

java 复制代码

public void done(long baseOffset, long timestamp, RuntimeException exception) {
    log.trace("Produced messages to topic-partition {} with base offset offset {} and error: {}.",
              topicPartition,
              baseOffset,
              exception);
    // execute callbacks
    for (int i = 0; i < this.thunks.size(); i++) {
        try {
            Thunk thunk = this.thunks.get(i);
            if (exception == null) {
                // If the timestamp returned by server is NoTimestamp, that means CreateTime is used. Otherwise LogAppendTime is used.
                RecordMetadata metadata = new RecordMetadata(this.topicPartition,  baseOffset, thunk.future.relativeOffset(),
                                                             timestamp == Record.NO_TIMESTAMP ? thunk.future.timestamp() : timestamp,
                                                             thunk.future.checksum(),
                                                             thunk.future.serializedKeySize(),
                                                             thunk.future.serializedValueSize());
                thunk.callback.onCompletion(metadata, null);
            } else {
                thunk.callback.onCompletion(null, exception);
            }
        } catch (Exception e) {
            log.error("Error executing user-provided callback on message for topic-partition {}:", topicPartition, e);
        }
    }
    this.produceFuture.done(topicPartition, baseOffset, exception);
}

BufferPool

ByteBuffer的创建和释放是比较消耗资源的，BufferPool实现了ByteBuffer的复用。

java 复制代码

private final long totalMemory;
private final int poolableSize;
private final ReentrantLock lock;
private final Deque<ByteBuffer> free;
private final Deque<Condition> waiters;
private long availableMemory;
private final Metrics metrics;
private final Time time;
private final Sensor waitTime;

每个BufferPool只针对特定大小的的ByteBuffer进行管理。

java 复制代码

public ByteBuffer allocate(int size, long maxTimeToBlockMs) throws InterruptedException {
    if (size > this.totalMemory)
        throw new IllegalArgumentException("Attempt to allocate " + size
                                           + " bytes, but there is a hard limit of "
                                           + this.totalMemory
                                           + " on memory allocations.");

    this.lock.lock();
    try {
        // check if we have a free buffer of the right size pooled
        if (size == poolableSize && !this.free.isEmpty())
            return this.free.pollFirst();

        // now check if the request is immediately satisfiable with the
        // memory on hand or if we need to block
        int freeListSize = this.free.size() * this.poolableSize;
        if (this.availableMemory + freeListSize >= size) {
            // we have enough unallocated or pooled memory to immediately
            // satisfy the request
            freeUp(size);
            this.availableMemory -= size;
            lock.unlock();
            return ByteBuffer.allocate(size);
        } else {
            // we are out of memory and will have to block
            int accumulated = 0;
            ByteBuffer buffer = null;
            Condition moreMemory = this.lock.newCondition();
            long remainingTimeToBlockNs = TimeUnit.MILLISECONDS.toNanos(maxTimeToBlockMs);
            this.waiters.addLast(moreMemory);
            // loop over and over until we have a buffer or have reserved
            // enough memory to allocate one
            while (accumulated < size) {
                long startWaitNs = time.nanoseconds();
                long timeNs;
                boolean waitingTimeElapsed;
                try {
                    waitingTimeElapsed = !moreMemory.await(remainingTimeToBlockNs, TimeUnit.NANOSECONDS);
                } catch (InterruptedException e) {
                    this.waiters.remove(moreMemory);
                    throw e;
                } finally {
                    long endWaitNs = time.nanoseconds();
                    timeNs = Math.max(0L, endWaitNs - startWaitNs);
                    this.waitTime.record(timeNs, time.milliseconds());
                }

                if (waitingTimeElapsed) {
                    this.waiters.remove(moreMemory);
                    throw new TimeoutException("Failed to allocate memory within the configured max blocking time " + maxTimeToBlockMs + " ms.");
                }

                remainingTimeToBlockNs -= timeNs;
                // check if we can satisfy this request from the free list,
                // otherwise allocate memory
                if (accumulated == 0 && size == this.poolableSize && !this.free.isEmpty()) {
                    // just grab a buffer from the free list
                    buffer = this.free.pollFirst();
                    accumulated = size;
                } else {
                    // we'll need to allocate memory, but we may only get
                    // part of what we need on this iteration
                    freeUp(size - accumulated);
                    int got = (int) Math.min(size - accumulated, this.availableMemory);
                    this.availableMemory -= got;
                    accumulated += got;
                }
            }

            // remove the condition for this thread to let the next thread
            // in line start getting memory
            Condition removed = this.waiters.removeFirst();
            if (removed != moreMemory)
                throw new IllegalStateException("Wrong condition: this shouldn't happen.");

            // signal any additional waiters if there is more memory left
            // over for them
            if (this.availableMemory > 0 || !this.free.isEmpty()) {
                if (!this.waiters.isEmpty())
                    this.waiters.peekFirst().signal();
            }

            // unlock and return the buffer
            lock.unlock();
            if (buffer == null)
                return ByteBuffer.allocate(size);
            else
                return buffer;
        }
    } finally {
        if (lock.isHeldByCurrentThread())
            lock.unlock();
    }
}

释放内存

java 复制代码

public void deallocate(ByteBuffer buffer, int size) {
    lock.lock();
    try {
        if (size == this.poolableSize && size == buffer.capacity()) {
            buffer.clear();
            this.free.add(buffer);
        } else {
            this.availableMemory += size;
        }
        Condition moreMem = this.waiters.peekFirst();
        if (moreMem != null)
            moreMem.signal();
    } finally {
        lock.unlock();
    }
}

RecordAccumulator

以下为RecordAccumulator中的关键字段：

java 复制代码

private volatile boolean closed;
private final AtomicInteger flushesInProgress;
private final AtomicInteger appendsInProgress;
private final int batchSize;
private final CompressionType compression;
private final long lingerMs;
private final long retryBackoffMs;
private final BufferPool free;
private final Time time;
private final ConcurrentMap<TopicPartition, Deque<RecordBatch>> batches;
private final IncompleteRecordBatches incomplete;
// The following variables are only accessed by the sender thread, so we don't need to protect them.
private final Set<TopicPartition> muted;
private int drainIndex;

batches：TopicPartition与RecordBatch集合的映射关系，类型是ConcurrentMap，但是Deque是非线程安全的，追加消息或者发送RecordBatch的时候，需要同步加锁。

KafkaProducer的send方法最终会调用RecordAccumulator的append方法将消息追加到RecordAccumulator中，主要逻辑为：

首先在batches集合中查找TopicPartition对应的Deque，查找不到，则创建新的Deque，并添加到batches集合中。
对Deque加锁（使用synchronized关键字加锁）。
调用tryAppend()方法，尝试向Deque中最后一个RecordBatch追加Record。
synchronized块结束，自动解锁。
追加成功，则返回RecordAppendResult（其中封装了ProduceRequestResult）。
追加失败，则尝试从BufferPool中申请新的ByteBuffer。
对Deque加锁（使用synchronized关键字加锁），再次尝试第3步。
追加成功，则返回；失败，则使用第5步得到的ByteBuffer创建RecordBatch。
将Record追加到新建的RecordBatch中，并将新建的RecordBatch追加到对应的Deque尾部。
将新建的RecordBatch追加到incomplete集合。
synchronized块结束，自动解锁。
返回RecordAppendResult，RecordAppendResult会中的字段会作为唤醒Sender线程的条件。

java 复制代码

public RecordAppendResult append(TopicPartition tp,
                                 long timestamp,
                                 byte[] key,
                                 byte[] value,
                                 Callback callback,
                                 long maxTimeToBlock) throws InterruptedException {
    // We keep track of the number of appending thread to make sure we do not miss batches in
    // abortIncompleteBatches().
    appendsInProgress.incrementAndGet();
    try {
        // 1： 查找是否有TopicPartition对应的Deque
        Deque<RecordBatch> dq = getOrCreateDeque(tp);
        synchronized (dq) {// 2.对dq加锁
            // 边界检测
            
            //3.向Deqqueue中最后一个RecordBatch追加Record
            RecordAppendResult appendResult = tryAppend(timestamp, key, value, callback, dq);
            if (appendResult != null)
                return appendResult;// 5.追加成功直接返回
        }

        // 6.追加失败则从bufferPool中申请空间
        ByteBuffer buffer = free.allocate(size, maxTimeToBlock);
        synchronized (dq) {
            // ....
            // 7.对dp加锁后在此调用tryAppend尝试追加Record
            RecordAppendResult appendResult = tryAppend(timestamp, key, value, callback, dq);
            if (appendResult != null) {
                // Somebody else found us a batch, return the one we waited for! Hopefully this doesn't happen often...
                free.deallocate(buffer);
                return appendResult;// 8.追加成功则返回
            }
            MemoryRecords records = MemoryRecords.emptyRecords(buffer, compression, this.batchSize);
            RecordBatch batch = new RecordBatch(tp, records, time.milliseconds());
            //9. 在新创建的batch中追加Record，并将其添加到batchs集合中
            FutureRecordMetadata future = Utils.notNull(batch.tryAppend(timestamp, key, value, callback, time.milliseconds()));

            // 10.将新建的RecordBatch追加到incomplete集合中 
            dq.addLast(batch);
            incomplete.add(batch);
            return new RecordAppendResult(future, dq.size() > 1 || batch.records.isFull(), true);// 返回RecordAppendResult
        }
    } finally {
        appendsInProgress.decrementAndGet();
    }
}

KafkaProducer.doSend方法的最后一步就是判断此次向RecordAccumulator追加消息后是否满足发送条件，调用ready方法获取集群中符合发送条件的节点集合。筛选条件如下：

Deque中有多个RecordBatch或是第一个RecordBatch是否满了。
是否超时了。
是否有其他线程在等待BufferPool释放空间（即BufferPool的空间耗尽了）。
是否有线程正在等待flush操作完成。
Sender线程准备关闭。

下面来看一下ready方法的代码，它会遍历batches集合中每个分区，首先查找当前分区Leader副本所在的Node，如果满足上述五个条件，则将此Node信息记录到readyNodes集合中。遍历完成后返回ReadyCheckResult对象，其中记录了满足发送条件的Node集合、在遍历过程中是否有找不到Leader副本的分区、下次调用ready()方法进行检查的时间间隔。

java 复制代码

public ReadyCheckResult ready(Cluster cluster, long nowMs) {
    Set<Node> readyNodes = new HashSet<>();
    long nextReadyCheckDelayMs = Long.MAX_VALUE;
    Set<String> unknownLeaderTopics = new HashSet<>();

    boolean exhausted = this.free.queued() > 0;
    for (Map.Entry<TopicPartition, Deque<RecordBatch>> entry : this.batches.entrySet()) {
        TopicPartition part = entry.getKey();
        Deque<RecordBatch> deque = entry.getValue();

        Node leader = cluster.leaderFor(part);
        synchronized (deque) {
            if (leader == null && !deque.isEmpty()) {
                // This is a partition for which leader is not known, but messages are available to send.
                // Note that entries are currently not removed from batches when deque is empty.
                unknownLeaderTopics.add(part.topic());
            } else if (!readyNodes.contains(leader) && !muted.contains(part)) {
                RecordBatch batch = deque.peekFirst();
                if (batch != null) {
                    boolean backingOff = batch.attempts > 0 && batch.lastAttemptMs + retryBackoffMs > nowMs;
                    long waitedTimeMs = nowMs - batch.lastAttemptMs;
                    long timeToWaitMs = backingOff ? retryBackoffMs : lingerMs;
                    long timeLeftMs = Math.max(timeToWaitMs - waitedTimeMs, 0);
                    boolean full = deque.size() > 1 || batch.records.isFull();
                    boolean expired = waitedTimeMs >= timeToWaitMs;
                    boolean sendable = full || expired || exhausted || closed || flushInProgress();
                    if (sendable && !backingOff) {
                        readyNodes.add(leader);
                    } else {
                        // Note that this results in a conservative estimate since an un-sendable partition may have
                        // a leader that will later be found to have sendable data. However, this is good enough
                        // since we'll just wake up and then sleep again for the remaining time.
                        nextReadyCheckDelayMs = Math.min(timeLeftMs, nextReadyCheckDelayMs);
                    }
                }
            }
        }
    }

    return new ReadyCheckResult(readyNodes, nextReadyCheckDelayMs, unknownLeaderTopics);
}

获取到发送的节点后，调用drain方法，返回要发送的数据格式：

java 复制代码

public Map<Integer, List<RecordBatch>> drain(Cluster cluster,
                                             Set<Node> nodes,
                                             int maxSize,
                                             long now) {
    
    // 转化后的结果
    Map<Integer, List<RecordBatch>> batches = new HashMap<>();
    for (Node node : nodes) {  //遍历Node集合
        int size = 0;
        // 获取当前node上的节点
        List<PartitionInfo> parts = cluster.partitionsForNode(node.id());
        List<RecordBatch> ready = new ArrayList<>();
        /* to make starvation less likely this loop doesn't start at 0 */
        int start = drainIndex = drainIndex % parts.size();
        do {
            PartitionInfo part = parts.get(drainIndex);
            TopicPartition tp = new TopicPartition(part.topic(), part.partition());
            // Only proceed if the partition has no in-flight batches.
            if (!muted.contains(tp)) {
                Deque<RecordBatch> deque = getDeque(new TopicPartition(part.topic(), part.partition()));
                if (deque != null) {
                    synchronized (deque) {
                        RecordBatch first = deque.peekFirst();
                        if (first != null) {
                            boolean backoff = first.attempts > 0 && first.lastAttemptMs + retryBackoffMs > now;
                            // Only drain the batch if it is not during backoff period.
                            if (!backoff) {
                                if (size + first.records.sizeInBytes() > maxSize && !ready.isEmpty()) {
                                    // there is a rare case that a single batch size is larger than the request size due
                                    // to compression; in this case we will still eventually send this batch in a single
                                    // request
                                    break;
                                } else {
                                    RecordBatch batch = deque.pollFirst();
                                    batch.records.close();
                                    size += batch.records.sizeInBytes();
                                    ready.add(batch);
                                    batch.drainedMs = now;
                                }
                            }
                        }
                    }
                }
            }
            this.drainIndex = (this.drainIndex + 1) % parts.size();
        } while (start != drainIndex);
        batches.put(node.id(), ready);
    }
    return batches;
}