前言
上篇文章Paimon源码解读 -- Compaction-1.MergeTreeCompactTask解析了Paimon-Compaction阶段的大概流程
其中Paimon的compaction操作由如下几个部分组成,
- 用
SingleFileWriter和RollingFileWriter去执行写入和滚动文件操作 -- 详情看文章Paimon源码解读 -- Compaction-2.SingleFileWriter和RollingFileWriter - 用
ReducerMergeFunctionWrapper去执行聚合逻辑 -- 详情看文章Paimon源码解读 -- PartialUpdateMerge - 用
readerForMergeTree()最后由SortMergeReader去执行特定的合并算法,去将文件进行排序合并重写 -- 调用的流程请看文章Paimon源码解读 -- Compaction-3.MergeSorter
一.SortMergeReader
1.入口函数
首先SortMergeReader是在MergeSorter.mergeSortNoSpill()中调用的,具体流程如下
java
// 无溢出合并的方法
public <T> RecordReader<T> mergeSortNoSpill(
List<? extends ReaderSupplier<KeyValue>> lazyReaders,
Comparator<InternalRow> keyComparator,
@Nullable FieldsComparator userDefinedSeqComparator,
MergeFunctionWrapper<T> mergeFunction)
throws IOException {
List<RecordReader<KeyValue>> readers = new ArrayList<>(lazyReaders.size());
for (ReaderSupplier<KeyValue> supplier : lazyReaders) {
try {
readers.add(supplier.get()); // 底层调MergeTreeReaders.readerForRun()获取当前Sorted Run下所有DF文件创建且合起来的RecordReader,其实就是获取当前Sorted Run对应的RecordReader
} catch (IOException e) {
// if one of the readers creating failed, we need to close them all.
// 如果当前Sorted Run下所有DF文件创建的RecordReader,有任何一个有问题,则全部关闭
readers.forEach(IOUtils::closeQuietly);
throw e;
}
}
// 创建一个排序合并读取器,这是创建排序合并算法器的入口
return SortMergeReader.createSortMergeReader(
readers, keyComparator, userDefinedSeqComparator, mergeFunction, sortEngine);
}
2.SortMergeReader本身
java
public interface SortMergeReader<T> extends RecordReader<T> {
// 由MergeSorter.mergeSortNoSpill()调用,该方法是创建排序合并算法器的入口
static <T> SortMergeReader<T> createSortMergeReader(
List<RecordReader<KeyValue>> readers,
Comparator<InternalRow> userKeyComparator,
@Nullable FieldsComparator userDefinedSeqComparator,
MergeFunctionWrapper<T> mergeFunctionWrapper,
SortEngine sortEngine) {
// 根据配置的'sort-engine',去创建对应的算法排序合并器,默认是loser-tree
switch (sortEngine) {
case MIN_HEAP: // 最小堆min-heap算法
return new SortMergeReaderWithMinHeap<>(
readers, userKeyComparator, userDefinedSeqComparator, mergeFunctionWrapper);
case LOSER_TREE: // 败者树loser-tree算法
return new SortMergeReaderWithLoserTree<>(
readers, userKeyComparator, userDefinedSeqComparator, mergeFunctionWrapper);
default:
throw new UnsupportedOperationException("Unsupported sort engine: " + sortEngine);
}
}
}
二.SortMergeReaderWithLoserTree -- 败者树SortMergeReader
1.源码机制
(1) 核心属性和构造函数
swift
SortMergeReaderWithLoserTree<T>
├── MergeFunctionWrapper<T> mergeFunctionWrapper // 合并函数包装器
└── LoserTree<KeyValue> loserTree // 败者树
├── int[] tree // 败者索引数组
├── List<LeafIterator<KeyValue>> leaves // 叶子节点列表
├── Comparator<KeyValue> firstComparator // Key 比较器
└── Comparator<KeyValue> secondComparator // Sequence 比较器
SortMergeIterator (内部类)
└── boolean released // 是否已释放
java
private final MergeFunctionWrapper<T> mergeFunctionWrapper; // MergeFunction的包装器,由'merge-engine'参数绑定
private final LoserTree<KeyValue> loserTree; // 败者树数据结构
public SortMergeReaderWithLoserTree(
List<RecordReader<KeyValue>> readers,
Comparator<InternalRow> userKeyComparator,
@Nullable FieldsComparator userDefinedSeqComparator,
MergeFunctionWrapper<T> mergeFunctionWrapper) {
this.mergeFunctionWrapper = mergeFunctionWrapper;
// 创建败者树
this.loserTree =
new LoserTree<>(
readers, // 所有RecordReader,每个RecordReader对应一个败者树节点
(e1, e2) -> userKeyComparator.compare(e2.key(), e1.key()), // Key的比较器
createSequenceComparator(userDefinedSeqComparator)); // Sequence比较器
}
(2) createSequenceComparator() -- 创建sequence比较器
less
// 创建sequnce比较器
private Comparator<KeyValue> createSequenceComparator(
@Nullable FieldsComparator userDefinedSeqComparator) {
if (userDefinedSeqComparator == null) {
// 默认:按 sequenceNumber 降序
return (e1, e2) -> Long.compare(e2.sequenceNumber(), e1.sequenceNumber());
}
// 用户自定义sequence-field:先降序比较用户字段,再比较 sequenceNumber
return (o1, o2) -> {
int result = userDefinedSeqComparator.compare(o2.value(), o1.value());
if (result != 0) {
return result;
}
return Long.compare(o2.sequenceNumber(), o1.sequenceNumber());
};
}
(3) readeBatch()
java
// 与min-heap相比,loser-tree只会产生一个批次
// 读取批次
@Nullable
@Override
public RecordIterator<T> readBatch() throws IOException {
loserTree.initializeIfNeeded(); // 初始化败者树
// 检查是否有数据
// peekWinner() 返回当前全局赢家,如果为 null 说明还没有全局赢家,需要返回SortMergeIterator
return loserTree.peekWinner() == null ? null : new SortMergeIterator();
}
(4) SortMergeIterator -- 核心
java
private class SortMergeIterator implements RecordIterator<T> {
private boolean released = false;
@Nullable
@Override
public T next() throws IOException {
while (true) {
// 1.调整树并更新下一个winner
loserTree.adjustForNextLoop();
// 2.弹出这个winner,并更新下下一个winner
KeyValue winner = loserTree.popWinner();
if (winner == null) {
return null;
}
// 3.调合并函数的reset()重置缓存
mergeFunctionWrapper.reset();
// 4.调合并函数对winner进行合并
mergeFunctionWrapper.add(winner); // 记住:Key小的先弹出,Key大的后弹出;旧的数据会先弹出,新的数据会后弹出;
// 5.合并所有相同的Key,并返回合并结果
T result = merge();
if (result != null) {
return result;
}
}
}
private T merge() {
Preconditions.checkState(
!released, "SortMergeIterator#nextImpl is called after release");
// 持续弹出相同Key的winner,并重新调整树
while (loserTree.peekWinner() != null) {
// 进行合并
mergeFunctionWrapper.add(loserTree.popWinner());
}
// 返回合并结果
return mergeFunctionWrapper.getResult();
}
@Override
public void releaseBatch() {
released = true;
}
}
2.LoserTree -- 败者树
核心点就是Key小的、Sequnece小的优先弹出进行优先聚合;Key大的、Sequence大的最后弹出,再进行聚合
(1) 核心属性、枚举类等
<1> 属性
java
private final int[] tree; // 存储败者索引的数组,tree[0]的值表示当前winner的索引位置
private final int size; // 叶子节点数量
private final List<LeafIterator<T>> leaves; // 叶子节点列表
// 补充:如果comparator.compare('a', 'b') > 0 则认为a是胜者,那么a会作为父节点,b作为叶子节点
private final Comparator<T> firstComparator; // Key 比较器,优先比较
/** same as firstComparator, but mainly used to compare sequenceNumber. */
private final Comparator<T> secondComparator; // Sequence 比较器,其次比较
private boolean initialized; // 是否已初始化
<2> 构造函数
java
public LoserTree(
List<RecordReader<T>> nextBatchReaders, // 所有待合并的 RecordReader
Comparator<T> firstComparator, // Key 比较器
Comparator<T> secondComparator) { // Sequence 比较器
this.size = nextBatchReaders.size();
this.leaves = new ArrayList<>(size);
this.tree = new int[size]; // 树数组大小等于叶子数量
// 如果 compare(a, b) > 0,则 a 是赢家(a 更优)
// 如果 compare(a, b) < 0,则 b 是赢家(b 更优)
// 如果 compare(a, b) = 0,则需要使用 secondComparator 继续比较
// null 值处理:
// e1 == null => 返回 -1(e1 是败者)
// e2 == null => 返回 1(e1 是赢家)
this.firstComparator =
(e1, e2) -> e1 == null ? -1 : (e2 == null ? 1 : firstComparator.compare(e1, e2));
this.secondComparator =
(e1, e2) -> e1 == null ? -1 : (e2 == null ? 1 : secondComparator.compare(e1, e2));
this.initialized = false;
// 为每个 RecordReader 创建 LeafIterator 包装器
for (RecordReader<T> reader : nextBatchReaders) {
LeafIterator<T> iterator = new LeafIterator<>(reader);
this.leaves.add(iterator);
}
}
<3> State枚举类 -- 核心
整个算法的核心是根据这个状态去判断是赢家还是输家,然后进行调整父子节点位置的
java
private enum State {
LOSER_WITH_NEW_KEY(false), // 败者,新Key
LOSER_WITH_SAME_KEY(false), // 败者,相同Key
LOSER_POPPED(false), // 败者,已弹出
WINNER_WITH_NEW_KEY(true), // 胜者,新Key
WINNER_WITH_SAME_KEY(true), // 胜者,相同Key
WINNER_POPPED(true); // 胜者,已弹出
private final boolean winner;
State(boolean winner) {
this.winner = winner;
}
public boolean isWinner() {
return winner;
}
}
<4> LeafIterator叶子节点迭代器
java
// 叶子节点迭代器
private static class LeafIterator<T> implements Closeable {
private final RecordReader<T> reader; // 底层RecordReader
private RecordReader.RecordIterator<T> iterator; // 当前批次迭代器
private T kv; // 当前KeyValue
private boolean endOfInput; // 是否读取完毕
private int firstSameKeyIndex; // 第一个相同Key的索引,这个是从上往下算的
private State state; // 节点状态
private LeafIterator(RecordReader<T> reader) {
this.reader = reader;
this.endOfInput = false;
this.firstSameKeyIndex = -1;
this.state = State.WINNER_WITH_NEW_KEY; // 默认每个叶子节点都是新Key的胜者,后续会和父节点进行比较
}
// 返回当前叶子节点的KeyValue
public T peek() {
return kv;
}
// 弹出当前叶子节点的KeyValue,并标记该节点的state为胜者已弹出
public T pop() {
this.state = State.WINNER_POPPED;
return kv;
}
// 更新第一个相同Key节点的索引位置
public void setFirstSameKeyIndex(int index) {
if (firstSameKeyIndex == -1) {
firstSameKeyIndex = index;
}
}
// 推进到下一个记录
public void advanceIfAvailable() throws IOException {
// 重置状态
this.firstSameKeyIndex = -1;
this.state = State.WINNER_WITH_NEW_KEY;
// 尝试从当前迭代器读取下一个记录
if (iterator == null || (kv = iterator.next()) == null) {
// 当前批次已经读完,需要读取新的批次
while (!endOfInput) {
if (iterator != null) {
iterator.releaseBatch(); // 释放当前批次内存
iterator = null;
}
// 读取下一批次
iterator = reader.readBatch();
if (iterator == null) {
// 所有数据已读完,重置属性
endOfInput = true;
kv = null;
reader.close();
} else if ((kv = iterator.next()) != null) {
// 成功读取到数据,则break
break;
}
}
}
}
@Override
public void close() throws IOException {
if (this.iterator != null) {
this.iterator.releaseBatch();
this.iterator = null;
}
this.reader.close();
}
}
(2) initializeIfNeeded() -- 初始化败者树
案例
ini
初始数据:
Leaf 0: key=5, seq=100
Leaf 1: key=3, seq=200
Leaf 2: key=7, seq=150
执行过程:
1. i=2: advanceIfAvailable() → key=7
adjust(2) → tree[某位置] = 2 (败者)
2. i=1: advanceIfAvailable() → key=3
adjust(1) → 与 Leaf 2 比较
key=3 < key=7 → Leaf 1 更优(赢家)
3. i=0: advanceIfAvailable() → key=5
adjust(0) → 与当前赢家比较
key=3 < key=5 → Leaf 1 仍是赢家
最终结果:
tree[0] = 1 (Leaf 1 是全局赢家,key=3 最小,先弹出)
java
public void initializeIfNeeded() throws IOException {
if (!initialized) {
Arrays.fill(tree, -1); // 初始化败者数组为-1,最开始没有败者和胜者
// 从最后一个叶子开始,逐个读取数据并构建败者树
// 为什么从后向前?因为败者树是从叶子向根构建的
for (int i = size - 1; i >= 0; i--) {
leaves.get(i).advanceIfAvailable(); // 读取该节点的第一个KeyValue,并更新state=WINNER_WITH_NEW_KEY, firstSameKeyIndex=-1
adjust(i); // 调整树结构,让该节点参与比较
}
initialized = true;
}
}
(3) adjustForNextLoop() -- 调整并获取下一个winner
java
public void adjustForNextLoop() throws IOException {
LeafIterator<T> winner = leaves.get(tree[0]);
// 持续调整直到赢家状态不是 WINNER_POPPED
while (winner.state == State.WINNER_POPPED) {
winner.advanceIfAvailable(); // 读取下一个 KV
adjust(tree[0]); // 重新调整树,尝试将winner相同Key的节点提到下一次的winnerNode中
winner = leaves.get(tree[0]);
}
}
(4) popWinner() -- 弹出当前的赢家,并且标记其为WINNER_POPPED
java
// 弹出当前的赢家,并且标记其为WINNER_POPPED
public T popWinner() {
LeafIterator<T> winner = leaves.get(tree[0]);
// 如果当前winner已经弹出过,返回 null
if (winner.state == State.WINNER_POPPED) {
// if the winner has already been popped, it means that all the same key has been
// processed.
return null;
}
// 如果当前winner没有弹出过,将winner弹出,并标记为WINNER_POPED,重新调整树
T result = winner.pop();
adjust(tree[0]); // 重新调整树结构,将该winner相同的Key的下一个节点提到winner位置,以便后续继续弹出该key的数据
return result;
}
(5) peekWinner() -- 找到当前的赢家
返回当前的赢家,不可以是已经处理过的赢家;否则,返回null。
java
public T peekWinner() {
// 返回当前的赢家,不可以是已经处理过的赢家
return leaves.get(tree[0]).state != State.WINNER_POPPED ? leaves.get(tree[0]).peek() : null;
}
(6) adjust() -- 算法核心
补充:
adjust(tree[0]):重新调整树,尝试将当前tree[0]对应winner的相同Key的节点提到下一次的winnerNode中,进行交换位置;如果没有这相同的Key,则进行比较其parent位置节点,进行更新winner
java
private void adjust(int winner) {
// 核心循环:从叶子节点向上遍历到根节点,这里的winner最开始是当前节点
// parent 计算公式:(winner + size) / 2
// 这是完全二叉树的父节点索引计算公式(数组从 0 开始)
// 从叶子节点向上遍历到根节点
for (int parent = (winner + this.size) / 2; parent > 0 && winner >= 0; parent /= 2) {
LeafIterator<T> winnerNode = leaves.get(winner);
LeafIterator<T> parentNode;
// CASE-1: 初始化阶段,都还没有败者和胜者,那么将当前的winner标记为新key败者,以便后续比较
if (this.tree[parent] == -1) {
// 重新设置winner节点状态为LOSER_WITH_NEW_KEY
winnerNode.state = State.LOSER_WITH_NEW_KEY;
}
// CASE-2: 正常调整,已经有败者了
else {
// 获取当前节点的父节点
parentNode = leaves.get(this.tree[parent]);
// 根据 winner 的当前状态,选择不同的调整策略
switch (winnerNode.state) {
case WINNER_WITH_NEW_KEY:
// winner 持有新 Key,需要完整比较,调adjustWithNewWinnerKey()去处理
adjustWithNewWinnerKey(parent, parentNode, winnerNode);
break;
case WINNER_WITH_SAME_KEY:
// winner 持有相同 Key,只需比较 Sequence,调adjustWithSameWinnerKey()去处理
adjustWithSameWinnerKey(parent, parentNode, winnerNode);
break;
case WINNER_POPPED:
// winner 已被弹出,快速路径优化
if (winnerNode.firstSameKeyIndex < 0) {
// 没有更多相同 Key 了,停止调整
parent = -1;
} else {
// 快速路径:根据winner节点记录的firstSameKeyIndex,直接跳转到第一个相同 Key 的位置
// 更新winner节点的state为LOSER_POPPED,并更新parentNode和其状态,后续会交换
parent = winnerNode.firstSameKeyIndex;
parentNode = leaves.get(this.tree[parent]);
winnerNode.state = State.LOSER_POPPED;
parentNode.state = State.WINNER_WITH_SAME_KEY;
}
break;
default:
throw new UnsupportedOperationException(
"unknown state for " + winnerNode.state.name());
}
}
// if the winner loses, exchange nodes.
// 如果处理完,最开始的胜者节点状态变了,不再是胜者了,那么调整胜者节点
// 原 parent 中的败者成为新 winner,继续向上比较
if (!winnerNode.state.isWinner()) {
int tmp = winner;
winner = this.tree[parent];
this.tree[parent] = tmp;
}
}
// 循环结束,最终 winner 到达根节点
this.tree[0] = winner;
}
(7) adjustWithNewWinnerKey() -- 新Key赢家的处理
java
private void adjustWithNewWinnerKey(
int index, LeafIterator<T> parentNode, LeafIterator<T> winnerNode) {
// 处理场景:当前 winner 持有的是新 Key(与全局赢家不同)
// 判断父节点情况
switch (parentNode.state) {
// CASE-1: parentNode 也是新 Key,需要完整比较
case LOSER_WITH_NEW_KEY:
// when the new winner is also a new key, it needs to be compared.
T parentKey = parentNode.peek();
T childKey = winnerNode.peek();
/* 1.先比较Key,
注意:SortMergeReaderWithLoserTree传入的Key比较器是(e1, e2) -> userKeyComparator.compare(e2.key(), e1.key())
firstComparator.compare(parentKey, childKey)
返回>0,表示child的key大,parent是赢家
返回<0,表示parent的key大,child是赢家
返回=0,表示一样大,需要再进行Sequence比较
*/
int firstResult = firstComparator.compare(parentKey, childKey);
// SUB-CASE-1: Key一样,再进行Sequence比较
if (firstResult == 0) {
// 注意:secondComparator其实是(e1, e2) -> Long.compare(e2.sequenceNumber(), e1.sequenceNumber());
int secondResult = secondComparator.compare(parentKey, childKey);
// parent的Sequence更大,child是旧的数据,因此,它是赢家,优先弹出
if (secondResult < 0) {
parentNode.state = State.LOSER_WITH_SAME_KEY;
winnerNode.setFirstSameKeyIndex(index); // 记录相同 Key 的位置
}
// child 的 Sequence 更大或相等,parent是旧的数据,则parent是赢家,优先弹出
else {
winnerNode.state = State.LOSER_WITH_SAME_KEY;
parentNode.state = State.WINNER_WITH_NEW_KEY;
parentNode.setFirstSameKeyIndex(index);
}
}
// SUB-CASE-2: child的Key大,则 parent是赢家,需更新俩节点状态
else if (firstResult > 0) {
// the two keys are completely different and just need to update the state.
parentNode.state = State.WINNER_WITH_NEW_KEY;
winnerNode.state = State.LOSER_WITH_NEW_KEY;
}
return;
// CASE-2: 不可能的情况,因为相同的key会走adjustWithSameWinnerKey这个方法,直接抛出异常
case LOSER_WITH_SAME_KEY:
throw new RuntimeException(
"This is a bug. Please file an issue. A node in the WINNER_WITH_NEW_KEY "
+ "state cannot encounter a node in the LOSER_WITH_SAME_KEY state.");
// CASE-3: 仅发生在 adjustForNextLoop 中
// parent 已经被弹出处理过,现在有新数据进来
case LOSER_POPPED:
parentNode.state = State.WINNER_POPPED;
parentNode.firstSameKeyIndex = -1;
winnerNode.state = State.LOSER_WITH_NEW_KEY;
return;
default:
throw new UnsupportedOperationException(
"unknown state for " + parentNode.state.name());
}
}
(8) adjustWithSameWinnerKey() -- 相同key赢家的处理
java
private void adjustWithSameWinnerKey(
int index, LeafIterator<T> parentNode, LeafIterator<T> winnerNode) {
switch (parentNode.state) {
// CASE-1: parent有相同的Key,则只需要再比较Sequnece即可
case LOSER_WITH_SAME_KEY:
// the key of the previous loser is the same as the key of the current winner,
// only the sequence needs to be compared.
T parentKey = parentNode.peek();
T childKey = winnerNode.peek();
// 比较Sequence
int secondResult = secondComparator.compare(parentKey, childKey);
// child 的 Sequence 更大,parent 是旧的,需要先弹出,调整两节点的状态
if (secondResult > 0) {
parentNode.state = State.WINNER_WITH_SAME_KEY;
winnerNode.state = State.LOSER_WITH_SAME_KEY;
parentNode.setFirstSameKeyIndex(index);
}
// parent 的 Sequence 更大,child 是旧的,需要先弹出,不需要调整状态,只需要记录index即可
else {
winnerNode.setFirstSameKeyIndex(index);
}
return;
// CASE-2: 其他情况,都不在这个方法处理,会在adjustWithNewWinnerKey()中处理
case LOSER_WITH_NEW_KEY:
case LOSER_POPPED:
return;
default:
throw new UnsupportedOperationException(
"unknown state for " + parentNode.state.name());
}
}
3.流程总结
(1) 初始状态
ini
tree = [-1, -1, -1]
size = 3
所有 Leaf 初始状态:
Leaf 0: kv=null, state=WINNER_WITH_NEW_KEY, firstSameKeyIndex=-1
Leaf 1: kv=null, state=WINNER_WITH_NEW_KEY, firstSameKeyIndex=-1
Leaf 2: kv=null, state=WINNER_WITH_NEW_KEY, firstSameKeyIndex=-1
(2) initializeIfNeeded阶段
<1> 步骤 1:i=2, adjust(2)
advanceIfAvailable()
ini
Leaf 2 读取第一条记录
Leaf 2: kv={key=1, seq=150, value="C"}, state=WINNER_WITH_NEW_KEY, firstSameKeyIndex=-1
adjust(2)
java
winner = 2
parent = (2 + 3) / 2 = 2
循环第 1 轮:parent=2, winner=2
winnerNode = Leaf 2
tree[2] == -1 // 初始化分支
→ winnerNode.state = LOSER_WITH_NEW_KEY
!winnerNode.state.isWinner() = true(是 LOSER)
→ 交换:
tmp = 2
winner = tree[2] = -1
tree[2] = 2
winner = -1
parent /= 2 → parent = 1
循环条件:1 > 0 && -1 >= 0 → false
tree[0] = -1
结果:
ini
tree = [-1, -1, 2]
Leaf 2: state=LOSER_WITH_NEW_KEY
<2> 步骤 2:i=1, adjust(1)
advanceIfAvailable()
ini
Leaf 1: kv={key=1, seq=200, value="B"}, state=WINNER_WITH_NEW_KEY, firstSameKeyIndex=-1
adjust(1)
java
winner = 1
parent = (1 + 3) / 2 = 2
循环第 1 轮:parent=2, winner=1
winnerNode = Leaf 1 (key=1, seq=200, state=WINNER_WITH_NEW_KEY)
tree[2] == 2
parentNode = Leaf 2 (key=1, seq=150, state=LOSER_WITH_NEW_KEY)
winnerNode.state = WINNER_WITH_NEW_KEY
→ 调用 adjustWithNewWinnerKey(2, Leaf 2, Leaf 1)
adjustWithNewWinnerKey():
parentNode.state = LOSER_WITH_NEW_KEY
parentKey = {key=1, seq=150}
childKey = {key=1, seq=200}
firstResult = firstComparator.compare(parentKey, childKey)
= userKeyComparator.compare(childKey, parentKey)
= userKeyComparator.compare({key=1}, {key=1})
= 0
// Key 相同,比较 Sequence
secondResult = secondComparator.compare(parentKey, childKey)
= Long.compare(childKey.seq, parentKey.seq)
= Long.compare(200, 150)
= 50 > 0
// secondResult > 0,进入 else 分支
// parent 的 seq 更小,parent 是赢家
winnerNode.state = LOSER_WITH_SAME_KEY
parentNode.state = WINNER_WITH_NEW_KEY
parentNode.setFirstSameKeyIndex(2) // 记录 index=2
// Leaf 2.firstSameKeyIndex == -1,设置为 2
Leaf 2.firstSameKeyIndex = 2
// 返回后检查
!winnerNode.state.isWinner() = true(Leaf 1 是 LOSER)
→ 交换:
tmp = 1
winner = tree[2] = 2
tree[2] = 1
winner = 2
parent /= 2 → parent = 1
循环第 2 轮:parent=1, winner=2
winnerNode = Leaf 2 (key=1, seq=150, state=WINNER_WITH_NEW_KEY)
tree[1] == -1 // 初始化分支
→ winnerNode.state = LOSER_WITH_NEW_KEY
!isWinner() = true
→ 交换:
tmp = 2
winner = tree[1] = -1
tree[1] = 2
winner = -1
parent = 0
循环条件:0 > 0 → false
tree[0] = -1
结果:
ini
tree = [-1, 2, 1]
Leaf 1: state=LOSER_WITH_SAME_KEY, firstSameKeyIndex=-1
Leaf 2: state=LOSER_WITH_NEW_KEY, firstSameKeyIndex=2
<3> 步骤 3:i=0, adjust(0)
advanceIfAvailable()
ini
Leaf 0: kv={key=1, seq=100, value="A"}, state=WINNER_WITH_NEW_KEY, firstSameKeyIndex=-1
adjust(0)
java
winner = 0
parent = (0 + 3) / 2 = 1
循环第 1 轮:parent=1, winner=0
winnerNode = Leaf 0 (key=1, seq=100, state=WINNER_WITH_NEW_KEY)
tree[1] == 2
parentNode = Leaf 2 (key=1, seq=150, state=LOSER_WITH_NEW_KEY)
调用 adjustWithNewWinnerKey(1, Leaf 2, Leaf 0):
parentKey = {key=1, seq=150}
childKey = {key=1, seq=100}
firstResult = compare({key=1}, {key=1}) = 0
secondResult = Long.compare(100, 150) = -50 < 0
// secondResult < 0,进入 if 分支
// child 的 seq 更小,child 是赢家
parentNode.state = LOSER_WITH_SAME_KEY
winnerNode.setFirstSameKeyIndex(1)
Leaf 0.firstSameKeyIndex = 1
// winnerNode 还是 WINNER
!isWinner() = false
→ 不交换
winner = 0
parent = 0
循环条件:0 > 0 → false
tree[0] = 0
<4> 初始化完成
ini
tree = [0, 2, 1]
- tree[0] = 0 → 全局赢家 Leaf 0 (key=1, seq=100)
- tree[1] = 2 → 败者 Leaf 2 (key=1, seq=150)
- tree[2] = 1 → 败者 Leaf 1 (key=1, seq=200)
Leaf 0: kv={key=1, seq=100}, state=WINNER_WITH_NEW_KEY, firstSameKeyIndex=1
Leaf 1: kv={key=1, seq=200}, state=LOSER_WITH_SAME_KEY, firstSameKeyIndex=-1
Leaf 2: kv={key=1, seq=150}, state=LOSER_WITH_SAME_KEY, firstSameKeyIndex=2
(3) SortMergeIterator.next() 执行
第一次调用 next()
java
public T next() throws IOException {
while (true) {
loserTree.adjustForNextLoop(); // 步骤 1
KeyValue winner = loserTree.popWinner(); // 步骤 2
if (winner == null) return null;
mergeFunctionWrapper.reset(); // 步骤 3
mergeFunctionWrapper.add(winner); // 步骤 4
T result = merge(); // 步骤 5
if (result != null) return result;
}
}
<1> 步骤 1:adjustForNextLoop()
java
public void adjustForNextLoop() throws IOException {
LeafIterator<T> winner = leaves.get(tree[0]);
while (winner.state == State.WINNER_POPPED) {
winner.advanceIfAvailable();
adjust(tree[0]);
winner = leaves.get(tree[0]);
}
}
ini
winner = leaves.get(0) = Leaf 0
Leaf 0.state = WINNER_WITH_NEW_KEY ≠ WINNER_POPPED
→ 循环条件不满足,直接返回
<2> 步骤 2:popWinner()
java
public T popWinner() {
LeafIterator<T> winner = leaves.get(tree[0]);
if (winner.state == State.WINNER_POPPED) {
return null;
}
T result = winner.pop();
adjust(tree[0]);
return result;
}
java
winner = Leaf 0
Leaf 0.state = WINNER_WITH_NEW_KEY ≠ WINNER_POPPED
result = Leaf 0.pop()
Leaf 0.state = WINNER_POPPED
return {key=1, seq=100, value="A"}
调用 adjust(0):
winner = 0
parent = 1
循环第 1 轮:
winnerNode = Leaf 0 (state=WINNER_POPPED, firstSameKeyIndex=1)
tree[1] = 2
parentNode = Leaf 2
// WINNER_POPPED 分支
firstSameKeyIndex = 1 ≥ 0
→ 快速路径
parent = 1
parentNode = Leaf 2 (tree[1])
Leaf 0.state = LOSER_POPPED
Leaf 2.state = WINNER_WITH_SAME_KEY
// 检查 !winnerNode.state.isWinner()
// Leaf 0 现在是 LOSER_POPPED (不是赢家)
!isWinner() = true
→ 交换:
tmp = 0
winner = tree[1] = 2
tree[1] = 0
winner = 2
parent = 0
循环条件:0 > 0 → false
tree[0] = 2
返回:{key=1, seq=100, value="A"}
当前状态:
ini
tree = [2, 0, 1]
- tree[0] = 2 → Leaf 2 是新的全局赢家
Leaf 0: state=LOSER_POPPED, firstSameKeyIndex=1
Leaf 1: state=LOSER_WITH_SAME_KEY, firstSameKeyIndex=-1
Leaf 2: state=WINNER_WITH_SAME_KEY, firstSameKeyIndex=2
<3> 步骤 3-4:reset() 和 add()
ini
mergeFunctionWrapper.reset();
latestKv = null
mergeFunctionWrapper.add({key=1, seq=100, value="A"});
latestKv = {key=1, seq=100, value="A"}
<4> 步骤 5:merge()
java
private T merge() {
while (loserTree.peekWinner() != null) {
mergeFunctionWrapper.add(loserTree.popWinner());
}
return mergeFunctionWrapper.getResult();
}
《1》第 1 次循环
java
peekWinner():
tree[0] = 2
Leaf 2.state = WINNER_WITH_SAME_KEY ≠ WINNER_POPPED
return Leaf 2.peek() = {key=1, seq=150, value="C"} ✅
popWinner():
winner = Leaf 2
Leaf 2.state ≠ WINNER_POPPED
result = Leaf 2.pop()
Leaf 2.state = WINNER_POPPED
return {key=1, seq=150, value="C"}
adjust(2):
winner = 2
parent = 2
winnerNode = Leaf 2 (WINNER_POPPED, firstSameKeyIndex=2)
tree[2] = 1
parentNode = Leaf 1
firstSameKeyIndex = 2 ≥ 0
parent = 2
parentNode = Leaf 1 (tree[2])
Leaf 2.state = LOSER_POPPED
Leaf 1.state = WINNER_WITH_SAME_KEY
!isWinner() = true
→ 交换:
tmp = 2
winner = tree[2] = 1
tree[2] = 2
winner = 1
parent = 1
循环第 2 轮:
winnerNode = Leaf 1 (WINNER_WITH_SAME_KEY)
tree[1] = 0
parentNode = Leaf 0 (LOSER_POPPED)
// adjustWithSameWinnerKey
parentNode.state = LOSER_POPPED
→ return(不处理)
!isWinner() = false(Leaf 1 是 WINNER)
→ 不交换
parent = 0
循环条件:0 > 0 → false
tree[0] = 1
add({key=1, seq=150, value="C"}):
latestKv = {key=1, seq=150, value="C"} // 弹出
当前状态:
ini
tree = [1, 0, 2]
Leaf 0: state=LOSER_POPPED, firstSameKeyIndex=1
Leaf 1: state=WINNER_WITH_SAME_KEY, firstSameKeyIndex=-1
Leaf 2: state=LOSER_POPPED, firstSameKeyIndex=2
《2》第 2 次循环
java
peekWinner():
tree[0] = 1
Leaf 1.state = WINNER_WITH_SAME_KEY ≠ WINNER_POPPED
return {key=1, seq=200, value="B"} ✅
popWinner():
result = Leaf 1.pop()
Leaf 1.state = WINNER_POPPED
return {key=1, seq=200, value="B"}
adjust(1):
winner = 1
parent = 2
winnerNode = Leaf 1 (WINNER_POPPED, firstSameKeyIndex=-1)
firstSameKeyIndex = -1 < 0
→ parent = -1(停止调整)
!isWinner() = false → 不处理
循环条件:-1 > 0 → false
tree[0] = 1
add({key=1, seq=200, value="B"}):
latestKv = {key=1, seq=200, value="B"} // 弹出
当前状态:
ini
tree = [1, 0, 2]
Leaf 0: state=LOSER_POPPED, firstSameKeyIndex=1
Leaf 1: state=WINNER_POPPED, firstSameKeyIndex=-1
Leaf 2: state=LOSER_POPPED, firstSameKeyIndex=2
《3》第 3 次循环
java
peekWinner():
tree[0] = 1
Leaf 1.state = WINNER_POPPED
return null ❌
循环结束
getResult():
return latestKv = {key=1, seq=200, value="B"}
(4) 最终结果
ini
以deduplicate覆盖为例子
next() 返回:{key=1, seq=200, value="B"}
所有相同 Key 的记录都被收集:
1. seq=100 先添加
2. seq=150 覆盖
3. seq=200 最终覆盖
保留的是 Sequence 最大(最新)的数据
官方案例请看PIP-2: Optimize SortMergeReader with LoserTree - Paimon - Apache Software Foundation