RocketMQ消息存储：Index

RocketMQ增加索引机制来方便消息根据key方便消息查找。本文主要介绍Index文件结构，以及Index数据如何构建，以及消息查询如何结合Index数据使用。

1. 文件结构

Index文件包含三个部分：IndexHead、Hash槽以及Index条目。内部数据结构如下图所示：

1.1 IndexHead

IndexHead固定为40字节内容：

scss 复制代码

beginTimestamp (8字节)|endTimestamp(8字节)|beginPhyoffset(8字节)|endPhyoffset(8字节)|
slotCount(4字节)|index count(4字节)

beginTimestamp：索引文件包含消息的最小存储时间。

endTimestamp：索引文件包含消息的最大存储时间。

beginPhyoffset：索引文件包含消息的最小物理偏移量，物理偏移量也就是commitlog的偏移量。

endPhyoffset：索引文件包含消息的最大物理偏移量，物理偏移量也就是commitlog的偏移量。

slot count：hash槽的个数，这里并不是hash槽已经使用的个数。

index count：Index条目已使用的个数。Index条目在条目列表中按顺序存储。

1.2 Hash槽

一个IndexFile文件，默认包含500万个Hash槽，每个槽占4个字节。每个Hash槽存储的是落在该Hash槽对应的索引条目位置（Index条目存储位置），这个有点绕。这样设计的目的，是为了解决Hash冲突。

1.3 Index条目

默认一个IndexFile文件，包含2000万个条目，每个条目占20个字节。

scss 复制代码

hashcode （4字节）| phyoffset(8字节)|timedif(4字节)|preIndexNo(4字节)

hashcode：key的hashcode的值，key包括普通的key以及唯一key。

phyoffset：物理偏移量，也就是commItlog偏移量，需要根据这个在commitlog中找数据。

timedif：消息存储时间与第一条消息的时间戳的差值，小于0表示该消息无效。

preIndexNo：该条目对于hash槽前一条记录的Index索引。主要是当hash冲突时，构建的链表结构。

解决hash冲突，需要preIndexNo和Hash槽的值一起配合。

Hash槽的值记录的是，当前Index条目的数量。
preIndexNo记录的是，对于Hash槽的值。

这个需要后面Index数据构建的时候和根据key查询消息的源码，结合源码比较容易理解。

2.数据构建

2.1数据写入

在DefaultMessageStore中会通过内部类ReputMessageService，几乎实时的将CommitLog消息，转发到ConsumeQueue和Index消息。所以构建的入口就在ReputMessageService中，会调用CommitLogDispatcher进行分发。Index消息对应CommitLogDispatcherBuildIndex，内部又会调用IndexService的buildIndex方法。

buildIndex方法首先会获取一个IndexFile文件，获取过程中，可能会创建一个新的IndexFile文件（如之前文件已经写满的情况）。如果CommitLog偏移量小于IndexFile里记录中记录的最后偏移量，那么本次就不处理。避免重复构建Index信息。

java 复制代码

//IndexService#buildIndex
IndexFile indexFile = retryGetAndCreateIndexFile();
if (indexFile != null) {
    long endPhyOffset = indexFile.getEndPhyOffset();
    DispatchRequest msg = req;
    String topic = msg.getTopic();
    String keys = msg.getKeys();
    if (msg.getCommitLogOffset() < endPhyOffset) {
        return;
    }
}

获取IndexFile文件过程中，可能会创建新的文件。新的文件名用时间进行命名。

java 复制代码

//getAndCreateLastIndexFile
String fileName =
    this.storePath + File.separator
        + UtilAll.timeMillisToHumanString(System.currentTimeMillis());
indexFile =
    new IndexFile(fileName, this.hashSlotNum, this.indexNum, lastUpdateEndPhyOffset,
        lastUpdateIndexTimestamp);

会对消息的Uniqkey以及Keys属性分别进行构建Index信息。内部都会调用putKey进行消息的真正构建，而内部又会调用IndexFile的putKey进行Index信息的写入。注意，这里的key会是topic#key的形式存储。

java 复制代码

//IndexService#buildIndex
if (req.getUniqKey() != null) {
    indexFile = putKey(indexFile, msg, buildKey(topic, req.getUniqKey()));
    if (indexFile == null) {
        return;
    }
}

if (keys != null && keys.length() > 0) {
    String[] keyset = keys.split(MessageConst.KEY_SEPARATOR);
    for (int i = 0; i < keyset.length; i++) {
        String key = keyset[i];
        if (key.length() > 0) {
            indexFile = putKey(indexFile, msg, buildKey(topic, key));
            if (indexFile == null) {
                return;
            }
        }
    }
}

首先会判断IndexFile的Index条目是否达到上限，如果达到上限则直接返回false，由外层进行重试。

java 复制代码

//IndexFile#putKey
this.indexHeader.getIndexCount() < this.indexNum

keyHash就是计算key的hash值，实际就是String的hashCode，如果负数通过Math.abs转正数。

slotPos：key对应的Hash槽的位置，通过key的hash值取余槽总数得到位置。

absSlotPos：key对应的物理位置，会加上IndexHeader的大小，以及每个槽的大小乘槽的位置。

java 复制代码

//IndexFile#putKey
int keyHash = indexKeyHashMethod(key);
int slotPos = keyHash % this.hashSlotNum;
int absSlotPos = IndexHeader.INDEX_HEADER_SIZE + slotPos * hashSlotSize;

会从key对应的物理位置读取数据，如果数据小于等于0，或者大于当前的槽数量，则设置为0。这一步是为了解决hash冲突，获取上一个key在同一个hash槽的位置。

java 复制代码

//IndexFile#putKey
int slotValue = this.mappedByteBuffer.getInt(absSlotPos);
if (slotValue <= invalidIndex || slotValue > this.indexHeader.getIndexCount()) {
    slotValue = invalidIndex;
}

timeDiff是当前消息的存储时间和Index文件起始存储时间的时间差。

absIndexPos是Index条目写入的物理位置，会加上Header大小，总的槽大小，以及当前Index条目数 * 条目大小。

最后会把条目数据写入到IndexFile文件，需要注意的是，最后一个Int写入的是slotValue，当Hash冲突的时候，就是上一个key的物理位置。

java 复制代码

//IndexFile#putKey
long timeDiff = storeTimestamp - this.indexHeader.getBeginTimestamp();
int absIndexPos = IndexHeader.INDEX_HEADER_SIZE + this.hashSlotNum * hashSlotSize
                        + this.indexHeader.getIndexCount() * indexSize;

this.mappedByteBuffer.putInt(absIndexPos, keyHash);
this.mappedByteBuffer.putLong(absIndexPos + 4, phyOffset);
this.mappedByteBuffer.putInt(absIndexPos + 4 + 8, (int) timeDiff);
this.mappedByteBuffer.putInt(absIndexPos + 4 + 8 + 4, slotValue);

同时会在当前key的Hash槽位置上，写入当前Key的Index条目位置。如果是第一个消息，还会写入起始的物理偏移量，起始消息存储时间。IndexHeader信息上还会增加槽的数量、增加Index条目数量，记录最后物理偏移量、最后存储时间。

java 复制代码

//IndexFile#putKey
this.mappedByteBuffer.putInt(absSlotPos, this.indexHeader.getIndexCount());

if (this.indexHeader.getIndexCount() <= 1) {
    this.indexHeader.setBeginPhyOffset(phyOffset);
    this.indexHeader.setBeginTimestamp(storeTimestamp);
}

this.indexHeader.incHashSlotCount();
this.indexHeader.incIndexCount();
this.indexHeader.setEndPhyOffset(phyOffset);
this.indexHeader.setEndTimestamp(storeTimestamp);

2.2刷盘

Index文件刷盘逻辑不同于CommitLog和ConsumeQueue定时刷盘，而是在Index文件写满的时候，创建线程进行刷盘。进入这里，需要满足前一个文件已经写满了，刷盘逻辑会调用IndexService的flush方法。

java 复制代码

//IndexService#getAndCreateLastIndexFile
final IndexFile flushThisFile = prevIndexFile;
Thread flushThread = new Thread(new Runnable() {
    @Override
    public void run() {
        IndexService.this.flush(flushThisFile);
    }
}, "FlushIndexFileThread");

flushThread.setDaemon(true);
flushThread.start();

在flush方法中，如果文件写满了（当前入口，都是这种情况），还会在CheckPoint文件记录Index写入时间，以及对CheckPoint文件进行刷盘。还会调用IndexFile的flush方法，进行真正刷盘。

java 复制代码

//IndexService#flush
public void flush(final IndexFile f) {
    if (null == f)
        return;

    long indexMsgTimestamp = 0;

    if (f.isWriteFull()) {
        indexMsgTimestamp = f.getEndTimestamp();
    }

    f.flush();

    if (indexMsgTimestamp > 0) {
        this.defaultMessageStore.getStoreCheckpoint().setIndexMsgTimestamp(indexMsgTimestamp);
        this.defaultMessageStore.getStoreCheckpoint().flush();
    }
}

IndexFile的flush方法，会先更新IndexHeader数据，然后把内存映射中的数据，写入到磁盘中。

java 复制代码

//IndexFile#flush
public void flush() {
    long beginTime = System.currentTimeMillis();
    if (this.mappedFile.hold()) {
        this.indexHeader.updateByteBuffer();
        this.mappedByteBuffer.force();
        this.mappedFile.release();
    }
}

3. 消息查询

可以根据UniqKey或者Key查询消息，broker上的实现入口在QueryMessageProcessor，请求的code为RequestCode.QUERY_MESSAGE。内部实现会到DefaultMessageStore的queryMessage消息中。主要逻辑为从IndexService中查询消息，然后根据物理偏移量，找到CommitLog消息返回。

消息查询一般都会携带查询的时间范围，也就是会携带begin和end时间参数。会从IndexFile最新的文件往前遍历查找。会判断时间是否符合，主要是请求的时间和IndexFIle记录的消息起始结束时间判断，IndexFile完全包含begin,end或者符合begin或者符合end参数都会进行实际的Index查询。

如果IndexFile的起始时间小于请求的起始时间，说明IndexFile都不符合了，直接返回。

java 复制代码

//IndexService#queryOffset
for (int i = this.indexFileList.size(); i > 0; i--) {
    IndexFile f = this.indexFileList.get(i - 1);
    boolean lastFile = i == this.indexFileList.size();
    if (lastFile) {
        indexLastUpdateTimestamp = f.getEndTimestamp();
        indexLastUpdatePhyoffset = f.getEndPhyOffset();
    }

    if (f.isTimeMatched(begin, end)) {

        f.selectPhyOffset(phyOffsets, buildKey(topic, key), maxNum, begin, end, lastFile);
    }

    if (f.getBeginTimestamp() < begin) {
        break;
    }

    if (phyOffsets.size() >= maxNum) {
        break;
    }
}

类似Index构建，也会根据key进行计算相应的位置。

java 复制代码

//IndexFile#selectPhyOffset
int keyHash = indexKeyHashMethod(key);
int slotPos = keyHash % this.hashSlotNum;
int absSlotPos = IndexHeader.INDEX_HEADER_SIZE + slotPos * hashSlotSize;

int slotValue = this.mappedByteBuffer.getInt(absSlotPos);

会根据key对应hash槽对应的值，找到Index条目，然后读取keyhash值、物理偏移量、时间、前一个条目的Index位置。

如果请求key的hash和存储的key的hash能对应上，以及时间符合的话，就会把偏移量加入到待查询列表。

如果前一个条目的位置，为0或者大于条目数量或者等于当前条目位置，或者当前条目时间已经小于小于起始时间，则直接终端循环。否则的话，会继续往前一个条目位置，继续查询数据。

java 复制代码

//IndexFile#selectPhyOffset
for (int nextIndexToRead = slotValue; ; ) {
    if (phyOffsets.size() >= maxNum) {
        break;
    }

    int absIndexPos =
        IndexHeader.INDEX_HEADER_SIZE + this.hashSlotNum * hashSlotSize
            + nextIndexToRead * indexSize;

    int keyHashRead = this.mappedByteBuffer.getInt(absIndexPos);
    long phyOffsetRead = this.mappedByteBuffer.getLong(absIndexPos + 4);

    long timeDiff = (long) this.mappedByteBuffer.getInt(absIndexPos + 4 + 8);
    int prevIndexRead = this.mappedByteBuffer.getInt(absIndexPos + 4 + 8 + 4);

    if (timeDiff < 0) {
        break;
    }

    timeDiff *= 1000L;

    long timeRead = this.indexHeader.getBeginTimestamp() + timeDiff;
    boolean timeMatched = (timeRead >= begin) && (timeRead <= end);

    if (keyHash == keyHashRead && timeMatched) {
        phyOffsets.add(phyOffsetRead);
    }

    if (prevIndexRead <= invalidIndex
        || prevIndexRead > this.indexHeader.getIndexCount()
        || prevIndexRead == nextIndexToRead || timeRead < begin) {
        break;
    }

    nextIndexToRead = prevIndexRead;
}

4.参考资料

RocketMQ源码 4.4.0分支
《RocketMQ技术内幕》