writeBlocks - 技术栈

Writes the top count entries in pending, using prevTerm to compute the prefix. */

void writeBlocks(int prefixLength, int count) throws IOException {

assert count > 0;

//if (DEBUG2) {

// BytesRef br = new BytesRef(lastTerm.bytes());

// br.length = prefixLength;

// System.out.println("writeBlocks: seg=" + segment + " prefix=" + brToString(br) + " count=" + count);

//}

// Root block better write all remaining pending entries:

assert prefixLength > 0 || count == pending.size();

int lastSuffixLeadLabel = -1;

// True if we saw at least one term in this block (we record if a block

// only points to sub-blocks in the terms index so we can avoid seeking

// to it when we are looking for a term):

boolean hasTerms = false;

boolean hasSubBlocks = false;

int start = pending.size()-count;

int end = pending.size();

int nextBlockStart = start;

int nextFloorLeadLabel = -1;

for (int i=start; i prefixLength;

suffixLeadLabel = block.prefix.bytes[block.prefix.offset + prefixLength] & 0xff;

}

// if (DEBUG) System.out.println(" i=" + i + " ent=" + ent + " suffixLeadLabel=" + suffixLeadLabel);

if (suffixLeadLabel != lastSuffixLeadLabel) {

int itemsInBlock = i - nextBlockStart;

if (itemsInBlock >= minItemsInBlock && end-nextBlockStart > maxItemsInBlock) {

// The count is too large for one block, so we must break it into "floor" blocks, where we record

// the leading label of the suffix of the first term in each floor block, so at search time we can

// jump to the right floor block. We just use a naive greedy segmenter here: make a new floor

// block as soon as we have at least minItemsInBlock. This is not always best: it often produces

// a too-small block as the final block:

boolean isFloor = itemsInBlock < count;

newBlocks.add(writeBlock(prefixLength, isFloor, nextFloorLeadLabel, nextBlockStart, i, hasTerms, hasSubBlocks));

hasTerms = false;

hasSubBlocks = false;

nextFloorLeadLabel = suffixLeadLabel;

nextBlockStart = i;

}

lastSuffixLeadLabel = suffixLeadLabel;

}

if (ent.isTerm) {

hasTerms = true;

} else {

hasSubBlocks = true;

}

// Write last block, if any:

if (nextBlockStart < end) {

int itemsInBlock = end - nextBlockStart;

boolean isFloor = itemsInBlock < count;

newBlocks.add(writeBlock(prefixLength, isFloor, nextFloorLeadLabel, nextBlockStart, end, hasTerms, hasSubBlocks));

}

assert newBlocks.isEmpty() == false;

PendingBlock firstBlock = newBlocks.get(0);

assert firstBlock.isFloor || newBlocks.size() == 1;

firstBlock.compileIndex(newBlocks, scratchBytes, scratchIntsRef);

// Remove slice from the top of the pending stack, that we just wrote:

pending.subList(pending.size()-count, pending.size()).clear();

// Append new block

pending.add(firstBlock);

newBlocks.clear();

}你看这里他会在写完block 后把这个block打包成一个引用放到pending中也就是作为一个父节点放进去是吧这样的话就是在写一个树结构✅ 完全正确！你理解得非常到位！

是的，writeBlocks 在将一批 terms（或子 blocks）写入磁盘后，会把它们"抽象"成一个 PendingBlock 对象，并放回 pending 列表中 ------ 这本质上就是在构建一棵前缀树（Trie）的内部节点，最终形成一个分层的、可高效检索的树状索引结构。

🔍 我们一步步拆解这个"树结构"的构建过程：

输入：一批连续的 entries（terms 或子 blocks）

int start = pending.size() - count;

int end = pending.size();

这 count 个 entries 共享一个公共前缀（长度 = prefixLength）；
它们可能是：
真实的 term（PendingTerm）
或之前 flush 出来的子 block（PendingBlock）

🌳 这就像 Trie 的某一层的"叶子"或"子树"。

可选：进一步切分成 "floor blocks"（多路分支优化）

if (itemsInBlock >= minItemsInBlock && end - nextBlockStart > maxItemsInBlock) {

// 切分成多个 floor blocks

}

如果这一批 entries 太多（> maxItemsInBlock），就按后缀首字符（suffixLeadLabel）分成多个"floor"子块；
每个 floor block 对应一个后缀首字母区间（如 'a'~'m', 'n'~'z'）；
这样在查询时可以快速跳过不相关的子块（类似 B+ 树的中间节点）。

✅ 这是为了避免单个 block 过大，影响 seek 性能。

调用 writeBlock(...) 写入磁盘

newBlocks.add(writeBlock(prefixLength, isFloor, ..., nextBlockStart, i, ...));

writeBlock 负责：
将 entries 编码成字节（只存公共前缀一次 + 各项后缀）；
写入 .tim 文件；
返回一个轻量级的 PendingBlock 对象（包含：前缀、文件偏移、是否是 floor、子块索引等元数据）；

💾 此时数据已持久化，但索引结构还在内存中构建。

关键一步：把新 block 放回 pending

// Remove the original 'count' entries

pending.subList(pending.size()-count, pending.size()).clear();

// Append new block as a single entry

pending.add(firstBlock);

✅ 这就是"树结构"的核心机制！

原来 pending 中有 count 个叶子（terms）或子树（blocks）；
现在被替换成一个父节点（PendingBlock）；
这个父节点代表："所有以该前缀开头的内容，都在这个 block 里"；
后续如果有更上层的 flush（比如处理 "app" 时），这个 PendingBlock 会被当作一个整体参与更高层的打包！

🌲 举个例子：构建树的过程

假设 terms:

apple

application

apply

banana

先处理 apple, application, apply → 共享 "appl"

flush 成一个 block B1（前缀 "appl"）
pending 原来是 [apple, application, apply]
现在变成 [B1]

再处理 banana

发现 "a..." 分支结束
检查 i=0（前缀 "a"）：pending = [B1]，count=1
如果 minItemsInBlock 1. 读根 block → 找到 "a" 子 block2. 读 "a" block → 找到 "appl" 子 block3. 读 "appl" block → 找到 "y" 后缀 → 匹配成功

流式构建不需要全量 term，边读边建树

✅ 总结你的洞察：

**writeBlocks 不仅写数据，还通过"用父 block 替换子 entries"的方式，在 pending 中维护一棵正在构建的前缀树。

这棵树最终会写成 Lucene 的 .tim（term index）文件，支持高效的 term lookup 和 range scan。**

你已经完全理解了 Lucene BlockTree 词典的核心构建机制！👏

这确实是 "用栈模拟递归树构建" 的经典工程范例。