Writes the top count entries in pending, using prevTerm to compute the prefix. */
void writeBlocks(int prefixLength, int count) throws IOException {
assert count > 0;
//if (DEBUG2) {
// BytesRef br = new BytesRef(lastTerm.bytes());
// br.length = prefixLength;
// System.out.println("writeBlocks: seg=" + segment + " prefix=" + brToString(br) + " count=" + count);
//}
// Root block better write all remaining pending entries:
assert prefixLength > 0 || count == pending.size();
int lastSuffixLeadLabel = -1;
// True if we saw at least one term in this block (we record if a block
// only points to sub-blocks in the terms index so we can avoid seeking
// to it when we are looking for a term):
boolean hasTerms = false;
boolean hasSubBlocks = false;
int start = pending.size()-count;
int end = pending.size();
int nextBlockStart = start;
int nextFloorLeadLabel = -1;
for (int i=start; i prefixLength;
suffixLeadLabel = block.prefix.bytes[block.prefix.offset + prefixLength] & 0xff;
}
// if (DEBUG) System.out.println(" i=" + i + " ent=" + ent + " suffixLeadLabel=" + suffixLeadLabel);
if (suffixLeadLabel != lastSuffixLeadLabel) {
int itemsInBlock = i - nextBlockStart;
if (itemsInBlock >= minItemsInBlock && end-nextBlockStart > maxItemsInBlock) {
// The count is too large for one block, so we must break it into "floor" blocks, where we record
// the leading label of the suffix of the first term in each floor block, so at search time we can
// jump to the right floor block. We just use a naive greedy segmenter here: make a new floor
// block as soon as we have at least minItemsInBlock. This is not always best: it often produces
// a too-small block as the final block:
boolean isFloor = itemsInBlock < count;
newBlocks.add(writeBlock(prefixLength, isFloor, nextFloorLeadLabel, nextBlockStart, i, hasTerms, hasSubBlocks));
hasTerms = false;
hasSubBlocks = false;
nextFloorLeadLabel = suffixLeadLabel;
nextBlockStart = i;
}
lastSuffixLeadLabel = suffixLeadLabel;
}
if (ent.isTerm) {
hasTerms = true;
} else {
hasSubBlocks = true;
}
}
// Write last block, if any:
if (nextBlockStart < end) {
int itemsInBlock = end - nextBlockStart;
boolean isFloor = itemsInBlock < count;
newBlocks.add(writeBlock(prefixLength, isFloor, nextFloorLeadLabel, nextBlockStart, end, hasTerms, hasSubBlocks));
}
assert newBlocks.isEmpty() == false;
PendingBlock firstBlock = newBlocks.get(0);
assert firstBlock.isFloor || newBlocks.size() == 1;
firstBlock.compileIndex(newBlocks, scratchBytes, scratchIntsRef);
// Remove slice from the top of the pending stack, that we just wrote:
pending.subList(pending.size()-count, pending.size()).clear();
// Append new block
pending.add(firstBlock);
newBlocks.clear();
}你看这里他会在写完block 后把这个block打包成一个引用放到pending中也就是作为一个父节点放进去 是吧 这样的话就是在写一个树结构✅ 完全正确!你理解得非常到位!
是的,writeBlocks 在将一批 terms(或子 blocks)写入磁盘后,会把它们"抽象"成一个 PendingBlock 对象,并放回 pending 列表中 ------ 这本质上就是在构建一棵前缀树(Trie)的内部节点,最终形成一个分层的、可高效检索的树状索引结构。
🔍 我们一步步拆解这个"树结构"的构建过程:
- 输入:一批连续的 entries(terms 或子 blocks)
int start = pending.size() - count;
int end = pending.size();
-
这 count 个 entries 共享一个公共前缀(长度 = prefixLength);
-
它们可能是:
-
真实的 term(PendingTerm)
-
或之前 flush 出来的子 block(PendingBlock)
🌳 这就像 Trie 的某一层的"叶子"或"子树"。
- 可选:进一步切分成 "floor blocks"(多路分支优化)
if (itemsInBlock >= minItemsInBlock && end - nextBlockStart > maxItemsInBlock) {
// 切分成多个 floor blocks
}
-
如果这一批 entries 太多(> maxItemsInBlock),就按后缀首字符(suffixLeadLabel) 分成多个"floor"子块;
-
每个 floor block 对应一个后缀首字母区间(如 'a'~'m', 'n'~'z');
-
这样在查询时可以 快速跳过不相关的子块(类似 B+ 树的中间节点)。
✅ 这是为了避免单个 block 过大,影响 seek 性能。
- 调用 writeBlock(...) 写入磁盘
newBlocks.add(writeBlock(prefixLength, isFloor, ..., nextBlockStart, i, ...));
-
writeBlock 负责:
-
将 entries 编码成字节(只存公共前缀一次 + 各项后缀);
-
写入 .tim 文件;
-
返回一个轻量级的 PendingBlock 对象(包含:前缀、文件偏移、是否是 floor、子块索引等元数据);
💾 此时数据已持久化,但索引结构还在内存中构建。
- 关键一步:把新 block 放回 pending
// Remove the original 'count' entries
pending.subList(pending.size()-count, pending.size()).clear();
// Append new block as a single entry
pending.add(firstBlock);
✅ 这就是"树结构"的核心机制!
-
原来 pending 中有 count 个叶子(terms)或子树(blocks);
-
现在被替换成 一个父节点(PendingBlock);
-
这个父节点代表:"所有以该前缀开头的内容,都在这个 block 里";
-
后续如果有更上层的 flush(比如处理 "app" 时),这个 PendingBlock 会被当作一个整体参与更高层的打包!
🌲 举个例子:构建树的过程
假设 terms:
apple
application
apply
banana
- 先处理 apple, application, apply → 共享 "appl"
-
flush 成一个 block B1(前缀 "appl")
-
pending 原来是 [apple, application, apply]
-
现在变成 [B1]
- 再处理 banana
-
发现 "a..." 分支结束
-
检查 i=0(前缀 "a"):pending = [B1],count=1
-
如果 minItemsInBlock 1. 读根 block → 找到 "a" 子 block2. 读 "a" block → 找到 "appl" 子 block3. 读 "appl" block → 找到 "y" 后缀 → 匹配成功
流式构建 不需要全量 term,边读边建树
✅ 总结你的洞察:
**writeBlocks 不仅写数据,还通过"用父 block 替换子 entries"的方式,在 pending 中维护一棵正在构建的前缀树。
这棵树最终会写成 Lucene 的 .tim(term index)文件,支持高效的 term lookup 和 range scan。**
你已经完全理解了 Lucene BlockTree 词典的核心构建机制!👏
这确实是 "用栈模拟递归树构建" 的经典工程范例。