本文主要阐述CommitLog文件结构,以及CommitLog的刷盘逻辑,刷盘是保障消息不丢的重要措施,但是又是影响性能的重大阻碍。最后还会介绍在异步刷盘情况下,还能提升消息写入性能的机制。
1. CommitLog文件结构
commitlog文件默认大小为1G,可以通过broker配置文件中的mapedFileSizeCommitLog属性改变默认大小。commitlog文件存储逻辑:每条消息最前面4字节存储消息的总长度,接着存储消息内容,整体结构非常简单。
scss
totalSize(4字节) | 消息其他信息
2. 刷盘逻辑
刷盘,区分两个概念:
- flush:将内存映射中的数据写入到磁盘,或者将文件系统缓存数据写到磁盘,是一个真正落盘的操作。
- commit:将内存中的数据提交到文件系统缓存中,是将内存写入到缓存的操作。
在消息发送流程中,可以看到刷盘分为不同的逻辑:
可以参考本人的另外一篇文章:RocketMQ消息存储:存储流程
同步刷盘:需要等待,处理的逻辑为GroupCommitService
异步刷盘:还区分是否开启了transientStorePool
- 没开启,处理逻辑为FlushRealTimeService
- 开启,则为CommitRealTimeService和FlushRealTimeService
java
//CommitLog#handleDiskFlush
public void handleDiskFlush(AppendMessageResult result, PutMessageResult putMessageResult, MessageExt messageExt) {
// Synchronization flush
if (FlushDiskType.SYNC_FLUSH == this.defaultMessageStore.getMessageStoreConfig().getFlushDiskType()) {
final GroupCommitService service = (GroupCommitService) this.flushCommitLogService;
if (messageExt.isWaitStoreMsgOK()) {
GroupCommitRequest request = new GroupCommitRequest(result.getWroteOffset() + result.getWroteBytes());
service.putRequest(request);
boolean flushOK = request.waitForFlush(this.defaultMessageStore.getMessageStoreConfig().getSyncFlushTimeout());
if (!flushOK) {
}
} else {
service.wakeup();
}
}
// Asynchronous flush
else {
if (!this.defaultMessageStore.getMessageStoreConfig().isTransientStorePoolEnable()) {
flushCommitLogService.wakeup();
} else {
commitLogService.wakeup();
}
}
}
flushCommitLogService和commitLogService对象的创建情况如下:
java
//CommitLog构造器
if (FlushDiskType.SYNC_FLUSH == defaultMessageStore.getMessageStoreConfig().getFlushDiskType()) {
this.flushCommitLogService = new GroupCommitService();
} else {
this.flushCommitLogService = new FlushRealTimeService();
}
this.commitLogService = new CommitRealTimeService();
2.1 同步刷盘
逻辑在GroupCommitService中,请求是通过GroupCommitRequest封装,然后内部有run方法定时执行,默认10ms执行一次,当然如果有通知的话,会立马执行。具体逻辑在doCommit中。
内部因为消息可能在下一个文件,所以最多可能刷盘两次。刷盘的逻辑在MappedFileQueue的flush逻辑,CommitLog、ConsumeQueue和Index刷盘逻辑都类似。
然后会把刷盘结果通知给调用者,因为调用者前面用了wait,内部逻辑其实就是用CountDownLatch
java
//CommitLog$GroupCommitService#doCommit
for (GroupCommitRequest req : this.requestsRead) {
// There may be a message in the next file, so a maximum of
// two times the flush
boolean flushOK = false;
for (int i = 0; i < 2 && !flushOK; i++) {
flushOK = CommitLog.this.mappedFileQueue.getFlushedWhere() >= req.getNextOffset();
if (!flushOK) {
CommitLog.this.mappedFileQueue.flush(0);
}
}
req.wakeupCustomer(flushOK);
}
long storeTimestamp = CommitLog.this.mappedFileQueue.getStoreTimestamp();
if (storeTimestamp > 0) {
CommitLog.this.defaultMessageStore.getStoreCheckpoint().setPhysicMsgTimestamp(storeTimestamp);
}
this.requestsRead.clear();
调用者等待,就是通过CountDownLatch然后await一段时间。同步刷盘的超时时间默认是5000ms。
java
//GroupCommitRequest#waitForFlush
public boolean waitForFlush(long timeout) {
try {
this.countDownLatch.await(timeout, TimeUnit.MILLISECONDS);
return this.flushOK;
} catch (InterruptedException e) {
log.error("Interrupted", e);
return false;
}
}
结果响应,就通过CountDownLatch的countDown方法,让调用者从await返回。
java
//GroupCommitRequest#wakeupCustomer
public void wakeupCustomer(final boolean flushOK) {
this.flushOK = flushOK;
this.countDownLatch.countDown();
}
2.2异步刷盘
当没有开启TransientStorePool时,异步刷盘服务为FlushRealTimeService。默认情况下为实时刷盘,也就是一旦监听到有数据写入就会唤醒线程执行刷盘逻辑。也可以通过flushCommitLogTimed
设置为定时刷盘,默认通过flushIntervalCommitLog
配置为500ms执行一次。
实际的刷盘逻辑需要满足一定条件,满足以下条件之一就会进行刷盘:这其实也是批量操作的通用做法,通过大小和时间触发。
- 数据超过4个Page页没有刷盘,也就是16KB。
- 超过10s没有刷盘。因为底层刷盘需要满足4个Page页,但是可能短时间内没有达到,那么也会通过时间去触发。
实时刷盘和定时刷盘的区别:
-
实时刷盘:通过执行waitForRunning方法,该方法可以被唤醒,当有数据需要刷盘时就会被唤醒。
-
定时刷盘:通过执行sleep,这个方法除了线程中断外,只能等时间到达。
实际刷盘的逻辑会交给MappedFileQueue执行。
java
//FlushRealTimeService#run
//默认false,这个参数表示是否定时刷盘,false的话就是实时刷盘
boolean flushCommitLogTimed = CommitLog.this.defaultMessageStore.getMessageStoreConfig().isFlushCommitLogTimed();
//默认500
int interval = CommitLog.this.defaultMessageStore.getMessageStoreConfig().getFlushIntervalCommitLog();
//默认4
int flushPhysicQueueLeastPages = CommitLog.this.defaultMessageStore.getMessageStoreConfig().getFlushCommitLogLeastPages();
//默认1000 * 10
int flushPhysicQueueThoroughInterval =
CommitLog.this.defaultMessageStore.getMessageStoreConfig().getFlushCommitLogThoroughInterval();
boolean printFlushProgress = false;
//如果上一次刷盘时间超过10s,则会要求刷盘一次。正常情况刷盘是要满足一些条件,比如满足4个Page页
long currentTimeMillis = System.currentTimeMillis();
if (currentTimeMillis >= (this.lastFlushTimestamp + flushPhysicQueueThoroughInterval)) {
this.lastFlushTimestamp = currentTimeMillis;
flushPhysicQueueLeastPages = 0;
printFlushProgress = (printTimes++ % 10) == 0;
}
try {
//flushCommitLogTimed为true则通过sleep定时刷盘,不然就用waitForRunning,这个是可以唤醒的操作。
if (flushCommitLogTimed) {
Thread.sleep(interval);
} else {
this.waitForRunning(interval);
}
if (printFlushProgress) {
this.printFlushProgress();
}
long begin = System.currentTimeMillis();
//调用MappedFileQueue进行刷盘
CommitLog.this.mappedFileQueue.flush(flushPhysicQueueLeastPages);
long storeTimestamp = CommitLog.this.mappedFileQueue.getStoreTimestamp();
//设置checkPoint
if (storeTimestamp > 0) {
CommitLog.this.defaultMessageStore.getStoreCheckpoint().setPhysicMsgTimestamp(storeTimestamp);
}
long past = System.currentTimeMillis() - begin;
if (past > 500) {
log.info("Flush data to disk costs {} ms", past);
}
} catch (Throwable e) {
CommitLog.log.warn(this.getServiceName() + " service has exception. ", e);
this.printFlushProgress();
}
当开启TransientStorePool时,在TransientStorePool中进行分析。
3. TransientStorePool机制
TransientStorePool,短暂的存储池。RokcetMQ单独创建一个MappedByteBuffer内存缓存池,用来临时存储数据,数据先写入该内存映射中,然后由commit线程定时将数据从该内存复制到目的文件对应的内存映中。TransientStorePool主要目的是为了提供写性能。
文件内存映射写入实际是写入PageCache,可能会涉及PageCache的竞争,如果直接写内存的话,就不存在竞争,在异步刷盘的场景下可以达到更快的速度。
而且还使用com.sun.jna.Library库,将内存进行锁定,避免被置换到交换区,提高写入性能。
3.1初始化
java
//DefaultMessageStore构造器
this.transientStorePool = new TransientStorePool(messageStoreConfig);
if (messageStoreConfig.isTransientStorePoolEnable()) {
this.transientStorePool.init();
}
在TransientStorePool的构造函数中,会设置Pool的数量以及大小。
java
//TransientStorePool构造函数
public TransientStorePool(final MessageStoreConfig storeConfig) {
this.storeConfig = storeConfig;
//默认值5
this.poolSize = storeConfig.getTransientStorePoolSize();
//默认值1G
this.fileSize = storeConfig.getMapedFileSizeCommitLog();
this.availableBuffers = new ConcurrentLinkedDeque<>();
}
TransientStorePool开启的条件,需要同时满足以下条件:
- 配置transientStorePoolEnable开启
- 刷盘方式为异步刷盘。
- broker的角色为Master
java
//MessageStoreConfig#isTransientStorePoolEnable
public boolean isTransientStorePoolEnable() {
return transientStorePoolEnable && FlushDiskType.ASYNC_FLUSH == getFlushDiskType()
&& BrokerRole.SLAVE != getBrokerRole();
}
默认情况下会创建5个1G的ByteBuffer,同时使用com.sun.jna.Library库,将内存进行锁定,避免被置换到交换区
java
//TransientStorePool#init
public void init() {
for (int i = 0; i < poolSize; i++) {
ByteBuffer byteBuffer = ByteBuffer.allocateDirect(fileSize);
final long address = ((DirectBuffer) byteBuffer).address();
Pointer pointer = new Pointer(address);
LibC.INSTANCE.mlock(pointer, new NativeLong(fileSize));
availableBuffers.offer(byteBuffer);
}
}
3.2数据写入
在MappedFile的初始化时,会从TransientStorePool中获取ByteBuffer,当然获取到的ByteBuffer可能是null。获取到的ByteBuffer命名为writeBuffer
,后续很多场景会用到这个变量。
java
//MappedFile#init
public void init(final String fileName, final int fileSize,
final TransientStorePool transientStorePool) throws IOException {
init(fileName, fileSize);
this.writeBuffer = transientStorePool.borrowBuffer();
this.transientStorePool = transientStorePool;
}
同时还会创建文件的FileChannel命名为fileChannel
,以及文件的内存映射MappedByteBuffer,命名为mappedByteBuffer
。
java
//MappedFile#init
this.fileChannel = new RandomAccessFile(this.file, "rw").getChannel();
this.mappedByteBuffer = this.fileChannel.map(MapMode.READ_WRITE, 0, fileSize);
在CommitLog消息写入的过程中,会判断writeBuffer是否为空,如果不为空则写入到writeBuffer中。同时wrotePosition会记录当前已经写入的大小。
java
//MappedFile#appendMessagesInner
ByteBuffer byteBuffer = writeBuffer != null ? writeBuffer.slice() : this.mappedByteBuffer.slice();
//...
this.wrotePosition.addAndGet(result.getWroteBytes());
此时已经完成数据写入,但是仅仅写入到了内存,距离落盘还需要一些操作。
3.3 刷盘
当TransientStorePool开启时,会通过CommitRealTimeService,把内存中的数据写入到内存映射mappedByteBuffer
中,再通过FlushRealTimeService把数据写入到磁盘中。
默认200ms或者4个Page页会执行一次commit操作,而commit操作也是交给MappedFileQueue处理。
当返回的结果表示有数据commit的时候,则会唤醒FlushRealTimeService执行flush操作。
java
//CommitRealTimeService#run
//默认200
int interval = CommitLog.this.defaultMessageStore.getMessageStoreConfig().getCommitIntervalCommitLog();
//默认4
int commitDataLeastPages = CommitLog.this.defaultMessageStore.getMessageStoreConfig().getCommitCommitLogLeastPages();
//默认200
int commitDataThoroughInterval =
CommitLog.this.defaultMessageStore.getMessageStoreConfig().getCommitCommitLogThoroughInterval();
long begin = System.currentTimeMillis();
if (begin >= (this.lastCommitTimestamp + commitDataThoroughInterval)) {
this.lastCommitTimestamp = begin;
commitDataLeastPages = 0;
}
try {
boolean result = CommitLog.this.mappedFileQueue.commit(commitDataLeastPages);
long end = System.currentTimeMillis();
if (!result) {
this.lastCommitTimestamp = end; // result = false means some data committed.
//now wake up flush thread.
flushCommitLogService.wakeup();
}
if (end - begin > 500) {
log.info("Commit data to file costs {} ms", end - begin);
}
this.waitForRunning(interval);
} catch (Throwable e) {
CommitLog.log.error(this.getServiceName() + " service has exception. ", e);
}
在MappedFileQueue中,会找到合适的MappedFile,然后执行commit,然后写入的位置。只要有数据写入,那么返回的结果就是为false。
java
//MappedFileQueue#commit
public boolean commit(final int commitLeastPages) {
boolean result = true;
MappedFile mappedFile = this.findMappedFileByOffset(this.committedWhere, this.committedWhere == 0);
if (mappedFile != null) {
int offset = mappedFile.commit(commitLeastPages);
long where = mappedFile.getFileFromOffset() + offset;
result = where == this.committedWhere;
this.committedWhere = where;
}
return result;
}
如果writeBuffer为null,则直接返回当前写入的位置。然后也会判断是否需要提交,内部逻辑主要判断期望提交的页是否达到要求,如果达到则会执行commit0操作。在writeBuffer写完之后,则会把writeBuffer归还给TransientStorePool,便于循环利用。
java
//MappedFile#commit
public int commit(final int commitLeastPages) {
if (writeBuffer == null) {
//no need to commit data to file channel, so just regard wrotePosition as committedPosition.
return this.wrotePosition.get();
}
if (this.isAbleToCommit(commitLeastPages)) {
if (this.hold()) {
commit0(commitLeastPages);
this.release();
} else {
log.warn("in commit, hold failed, commit offset = " + this.committedPosition.get());
}
}
// All dirty data has been committed to FileChannel.
if (writeBuffer != null && this.transientStorePool != null && this.fileSize == this.committedPosition.get()) {
this.transientStorePool.returnBuffer(writeBuffer);
this.writeBuffer = null;
}
return this.committedPosition.get();
}
在commit0中,判断写入的位置大于提交的位置,然后执行FileChannel写入执行buffer大小的内容。
java
//MappedFile#commit0
protected void commit0(final int commitLeastPages) {
int writePos = this.wrotePosition.get();
int lastCommittedPosition = this.committedPosition.get();
if (writePos - this.committedPosition.get() > 0) {
try {
ByteBuffer byteBuffer = writeBuffer.slice();
byteBuffer.position(lastCommittedPosition);
byteBuffer.limit(writePos);
this.fileChannel.position(lastCommittedPosition);
this.fileChannel.write(byteBuffer);
this.committedPosition.set(writePos);
} catch (Throwable e) {
log.error("Error occurred when commit data to FileChannel.", e);
}
}
}
4. 参考资料
- zhuanlan.zhihu.com/p/360912438
- RocketMQ源码 4.4.0分支
- 《RocketMQ技术内幕》