1 Flink基础接口
Flink的基础接口是Source,核心是两个接口:createEnumerator和createReader
createEnumerator负责数据发现和分片器的创建;createReader负责实际读取器的创建
java
public interface Source<T, SplitT extends SourceSplit, EnumChkT> extends Serializable {
Boundedness getBoundedness();
SourceReader<T, SplitT> createReader(SourceReaderContext var1) throws Exception;
SplitEnumerator<SplitT, EnumChkT> createEnumerator(SplitEnumeratorContext<SplitT> var1) throws Exception;
SplitEnumerator<SplitT, EnumChkT> restoreEnumerator(SplitEnumeratorContext<SplitT> var1, EnumChkT var2) throws Exception;
SimpleVersionedSerializer<SplitT> getSplitSerializer();
SimpleVersionedSerializer<EnumChkT> getEnumeratorCheckpointSerializer();
}
createEnumerator由Flink中的SourceCoordinator调用,SourceCoordinator是Flink中的一个独立部件(实际是在JobGraph转ExecutionGraph的过程中创建的)
在SourceCoordinator启动的时候会调用createEnumerator
java
if (enumerator == null) {
final ClassLoader userCodeClassLoader =
context.getCoordinatorContext().getUserCodeClassloader();
try (TemporaryClassLoaderContext ignored =
TemporaryClassLoaderContext.of(userCodeClassLoader)) {
enumerator = source.createEnumerator(context);
} catch (Throwable t) {
ExceptionUtils.rethrowIfFatalErrorOrOOM(t);
LOG.error("Failed to create Source Enumerator for source {}", operatorName, t);
context.failJob(t);
return;
}
}
2 Iceberg对核心接口的实现
2.1 getBoundedness
接口用来标志数据源是有界还是无界的,对应批还是流处理
java
public Boundedness getBoundedness() {
return scanContext.isStreaming() ? Boundedness.CONTINUOUS_UNBOUNDED : Boundedness.BOUNDED;
}
根据Scan配置属性,选择是有界还是无界;Iceberg可通过如下方式设置
java
SELECT * FROM hjfdb.test2 /*+ OPTIONS('streaming'='true', 'monitor-interval'='10s')*/ ;
完整的配置Key为:connector.iceberg.streaming
java
public static final String STREAMING = "streaming";
public static final ConfigOption<Boolean> STREAMING_OPTION =
ConfigOptions.key(PREFIX + STREAMING).booleanType().defaultValue(false);
2.2 createEnumerator
createEnumerator和restoreEnumerator调用的是Iceberg实现的同一个接口
java
@Override
public SplitEnumerator<IcebergSourceSplit, IcebergEnumeratorState> createEnumerator(
SplitEnumeratorContext<IcebergSourceSplit> enumContext) {
return createEnumerator(enumContext, null);
}
@Override
public SplitEnumerator<IcebergSourceSplit, IcebergEnumeratorState> restoreEnumerator(
SplitEnumeratorContext<IcebergSourceSplit> enumContext, IcebergEnumeratorState enumState) {
return createEnumerator(enumContext, enumState);
}
2.2.1 分片分配器
createEnumerator接口当中首先创建分片分配器
java
SplitAssigner assigner;
if (enumState == null) {
assigner = assignerFactory.createAssigner();
} else {
LOG.info(
"Iceberg source restored {} splits from state for table {}",
enumState.pendingSplits().size(),
lazyTable().name());
assigner = assignerFactory.createAssigner(enumState.pendingSplits());
}
目前分配器只有一个实现:SimpleSplitAssigner,其核心是一个ArrayDeque,用来存储和分配分片
java
public SimpleSplitAssigner() {
this.pendingSplits = new ArrayDeque<>();
}
public SimpleSplitAssigner(Collection<IcebergSourceSplitState> assignerState) {
this.pendingSplits = new ArrayDeque<>(assignerState.size());
// Because simple assigner only tracks unassigned splits,
// there is no need to filter splits based on status (unassigned) here.
assignerState.forEach(splitState -> pendingSplits.add(splitState.split()));
}
2.2.2 SplitEnumerator
分片分配器设置完成后进行SplitEnumerator的创建,分为流式和批式
java
if (scanContext.isStreaming()) {
ContinuousSplitPlanner splitPlanner =
new ContinuousSplitPlannerImpl(tableLoader.clone(), scanContext, planningThreadName());
return new ContinuousIcebergEnumerator(
enumContext, assigner, scanContext, splitPlanner, enumState);
} else {
List<IcebergSourceSplit> splits = planSplitsForBatch(planningThreadName());
assigner.onDiscoveredSplits(splits);
return new StaticIcebergEnumerator(enumContext, assigner);
}
- ContinuousSplitPlanner
流式需要一个流式扫描数据的计划,就是ContinuousSplitPlanner,实现类是ContinuousSplitPlannerImpl
ContinuousSplitPlannerImpl初始化创建这里有一个线程池的设置
java
this.isSharedPool = threadName == null;
this.workerPool =
isSharedPool
? ThreadPools.getWorkerPool()
: ThreadPools.newWorkerPool(
"iceberg-plan-worker-pool-" + threadName, scanContext.planParallelism());
这里默认threadName都是有值的,所以使用独立的线程池,线程池的并行度由table.exec.iceberg.worker-pool-size设置,默认值根据机器的可用CPU核数确定
java
public static final int WORKER_THREAD_POOL_SIZE =
getPoolSize(
WORKER_THREAD_POOL_SIZE_PROP, Math.max(2, Runtime.getRuntime().availableProcessors()));
- ContinuousIcebergEnumerator
ContinuousIcebergEnumerator的一个核心是start的时候设置了异步调用周期
java
public void start() {
super.start();
enumeratorContext.callAsync(
this::discoverSplits,
this::processDiscoveredSplits,
0L,
scanContext.monitorInterval().toMillis());
}
周期设置方式和流式设置方式一致
java
SELECT * FROM hjfdb.test2 /*+ OPTIONS('streaming'='true', 'monitor-interval'='10s')*/ ;
完整的配置Key为:connector.iceberg.monitor-interval
java
public static final String MONITOR_INTERVAL = "monitor-interval";
public static final ConfigOption<String> MONITOR_INTERVAL_OPTION =
ConfigOptions.key(PREFIX + MONITOR_INTERVAL).stringType().defaultValue("60s");
start接口的调用也是在SourceCoordinator当中,创建完就直接调用start了
java
runInEventLoop(() -> enumerator.start(), "starting the SplitEnumerator.");
- planSplitsForBatch
批式执行方式这里直接扫描获取分片列表,具体待后文
扫描分片确定以后,直接传入分片分配器
java
assigner.onDiscoveredSplits(splits);
最终创建一个批式的Enumerator
java
return new StaticIcebergEnumerator(enumContext, assigner);
2.3 createReader
createReader就比较简单,直接创建了一个IcebergSourceReader
java
public SourceReader<T, IcebergSourceSplit> createReader(SourceReaderContext readerContext) {
IcebergSourceReaderMetrics metrics =
new IcebergSourceReaderMetrics(readerContext.metricGroup(), lazyTable().name());
return new IcebergSourceReader<>(metrics, readerFunction, readerContext);
}
3 ContinuousIcebergEnumerator
前面提到过ContinuousIcebergEnumerator的启动接口如下
java
public void start() {
super.start();
enumeratorContext.callAsync(
this::discoverSplits,
this::processDiscoveredSplits,
0L,
scanContext.monitorInterval().toMillis());
}
这是一个定时的异步调用接口,核心是discoverSplits和processDiscoveredSplits,discoverSplits的结果由processDiscoveredSplits处理
3.1 discoverSplits
这个操作是扫描获取完整分片的,这里会有一个暂停扫描的阈值,阈值说明如下
java
// if ScanContext#maxPlanningSnapshotCount() is 10, each split enumeration can
// discovery splits up to 10 snapshots. if maxHistorySize is 3, the max number of
// splits tracked in assigner shouldn't be more than 10 * (3 + 1) snapshots
// worth of splits. +1 because there could be another enumeration when the
// pending splits fall just below the 10 * 3.
int totalSplitCountFromRecentDiscovery = Arrays.stream(history).reduce(0, Integer::sum);
return pendingSplitCountFromAssigner >= totalSplitCountFromRecentDiscovery;
没有达到阈值的情况下通过splitPlanner.planSplits进行扫描获取,这里lastPosition对应的是snapshot
java
public ContinuousEnumerationResult planSplits(IcebergEnumeratorPosition lastPosition) {
table.refresh();
if (lastPosition != null) {
return discoverIncrementalSplits(lastPosition);
} else {
return discoverInitialSplits();
}
}
3.1.1 discoverInitialSplits
进行初始扫描,扫描有不同的模式,在StreamingStartingStrategy当中枚举,其中只有TABLE_SCAN_THEN_INCREMENTAL模式是进行初次普通扫描以后再进行增量扫描
所以这里会进行判断,如果是TABLE_SCAN_THEN_INCREMENTAL,会进行一次批扫描获取分片返回;否则分片为空。同时,不管哪种模式,会设置Position,后续步骤就不走这个接口了
初次普通扫描接口如下,这个也就是批扫描的接口
java
public static List<IcebergSourceSplit> planIcebergSourceSplits(
Table table, ScanContext context, ExecutorService workerPool) {
try (CloseableIterable<CombinedScanTask> tasksIterable =
planTasks(table, context, workerPool)) {
return Lists.newArrayList(
CloseableIterable.transform(tasksIterable, IcebergSourceSplit::fromCombinedScanTask));
} catch (IOException e) {
throw new UncheckedIOException("Failed to process task iterable: ", e);
}
}
真正的扫描在planTasks当中,按模式分增量和非增量,判断方式如下,流模式的初次扫描设置这几个值都为空,所以也进行批扫描
java
static ScanMode checkScanMode(ScanContext context) {
if (context.startSnapshotId() != null
|| context.endSnapshotId() != null
|| context.startTag() != null
|| context.endTag() != null) {
return ScanMode.INCREMENTAL_APPEND_SCAN;
} else {
return ScanMode.BATCH;
}
}
批扫描走的就是Iceberg普通的TableScan
java
TableScan scan = table.newScan();
scan = refineScanWithBaseConfigs(scan, context, workerPool);
if (context.snapshotId() != null) {
scan = scan.useSnapshot(context.snapshotId());
} else if (context.tag() != null) {
scan = scan.useRef(context.tag());
} else if (context.branch() != null) {
scan = scan.useRef(context.branch());
}
if (context.asOfTimestamp() != null) {
scan = scan.asOfTime(context.asOfTimestamp());
}
return scan.planTasks();
3.1.2 discoverIncrementalSplits
这是读取增量数据的接口,增量扫描时如果当前Snapshot为空或者当前Snapshot为上一次扫描过的Snapshot,则表示没有增量数据,返回空分片;否则获取增量分片列表
首先获取上一次扫描的Snapshot和当前Snapshot的区间,然后调用接口进行扫描
java
ScanContext incrementalScan =
scanContext.copyWithAppendsBetween(
lastPosition.snapshotId(), toSnapshotInclusive.snapshotId());
List<IcebergSourceSplit> splits =
FlinkSplitPlanner.planIcebergSourceSplits(table, incrementalScan, workerPool);
这里扫描调用的还是上节相同的接口,只不过走的不是批扫描了,而是走的INCREMENTAL_APPEND_SCAN增量扫描模式,它使用专门的增量扫描器
java
IncrementalAppendScan scan = table.newIncrementalAppendScan();
其实现类是BaseIncrementalAppendScan,扫描过程首先计算所有的增量Snapshot
java
// appendsBetween handles null fromSnapshotId (exclusive) properly
List<Snapshot> snapshots =
appendsBetween(table(), fromSnapshotIdExclusive, toSnapshotIdInclusive);
private static List<Snapshot> appendsBetween(
Table table, Long fromSnapshotIdExclusive, long toSnapshotIdInclusive) {
List<Snapshot> snapshots = Lists.newArrayList();
for (Snapshot snapshot :
SnapshotUtil.ancestorsBetween(
toSnapshotIdInclusive, fromSnapshotIdExclusive, table::snapshot)) {
if (snapshot.operation().equals(DataOperations.APPEND)) {
snapshots.add(snapshot);
}
}
return snapshots;
}
之后基于Snapshot获取所有的dataManifests,Manifest分为dataManifests和deleteManifests
java
Set<ManifestFile> manifests =
FluentIterable.from(snapshots)
.transformAndConcat(snapshot -> snapshot.dataManifests(table().io()))
.filter(manifestFile -> snapshotIds.contains(manifestFile.snapshotId()))
.toSet();
最终基于正常的Iceberg文件扫描进行获取,这里只读取append的数据不处理delete数据的核心在ManifestEntry.Status.ADDED这个过滤条件,overwrite或者delete删除的数据,其manifest文件的类型是delete
java
ManifestGroup manifestGroup =
new ManifestGroup(table().io(), manifests)
.caseSensitive(isCaseSensitive())
.select(scanColumns())
.filterData(filter())
.filterManifestEntries(
manifestEntry ->
snapshotIds.contains(manifestEntry.snapshotId())
&& manifestEntry.status() == ManifestEntry.Status.ADDED)
.specsById(table().specs())
.ignoreDeleted();
if (context().ignoreResiduals()) {
manifestGroup = manifestGroup.ignoreResiduals();
}
if (manifests.size() > 1 && shouldPlanWithExecutor()) {
manifestGroup = manifestGroup.planWith(planExecutor());
}
return manifestGroup.planFiles();
3.2 processDiscoveredSplits
这一步就是将上一步扫描到的split全部放入assigner当中,注意的是由于并发同步等问题,必须保证starting position和enumerator position是一致的
java
assigner.onDiscoveredSplits(result.splits());
4 分片分配-AbstractIcebergEnumerator
前面步骤完成后是把所有的分片全部放在了assigner当中,没有进行分配。分配在AbstractIcebergEnumerator接口当中定义,这是Iceberg当中enumerator的最上层父类,直接实现Flink的SplitEnumerator
SplitEnumerator定义了handleSourceEvent接口,负责处理来自读取器的自定义消息,有SourceCoordinator调用
java
} else if (event instanceof SourceEventWrapper) {
final SourceEvent sourceEvent =
((SourceEventWrapper) event).getSourceEvent();
LOG.debug(
"Source {} received custom event from parallel task {}: {}",
operatorName,
subtask,
sourceEvent);
enumerator.handleSourceEvent(subtask, sourceEvent);
在Iceberg的实现中,完成了分片的分配,具体操作在assignSplits接口当中,从实现上看并没有一次分配完,而是每个reader每次分配一个
java
Iterator<Map.Entry<Integer, String>> awaitingReader =
readersAwaitingSplit.entrySet().iterator();
while (awaitingReader.hasNext()) {
Map.Entry<Integer, String> nextAwaiting = awaitingReader.next();
// if the reader that requested another split has failed in the meantime, remove
// it from the list of waiting readers
if (!enumeratorContext.registeredReaders().containsKey(nextAwaiting.getKey())) {
awaitingReader.remove();
continue;
}
int awaitingSubtask = nextAwaiting.getKey();
String hostname = nextAwaiting.getValue();
GetSplitResult getResult = assigner.getNext(hostname);
if (getResult.status() == GetSplitResult.Status.AVAILABLE) {
LOG.info("Assign split to subtask {}: {}", awaitingSubtask, getResult.split());
enumeratorContext.assignSplit(getResult.split(), awaitingSubtask);
awaitingReader.remove();
5 IcebergSourceReader
Source的createReader接口是在createStreamOperator的时候创建的,所以这个最终是任务级别的,在TaskManager上并行执行
5.1 分片请求
IcebergSourceReader的诸多接口都最终调用了requestSplit,在其中发送了split请求获取新的分片,对应上一章SplitEnumerator接收的请求
java
private void requestSplit(Collection<String> finishedSplitIds) {
context.sendSourceEventToCoordinator(new SplitRequestEvent(finishedSplitIds));
}
5.2 处理分片
IcebergSourceReader创建的时候调用了父类,核心传入了几个处理类,最重要的就是IcebergSourceSplitReader
java
public IcebergSourceReader(
IcebergSourceReaderMetrics metrics,
ReaderFunction<T> readerFunction,
SourceReaderContext context) {
super(
() -> new IcebergSourceSplitReader<>(metrics, readerFunction, context),
new IcebergSourceRecordEmitter<>(),
context.getConfiguration(),
context);
}
IcebergSourceSplitReader在Flink当中被封装进了SplitFetcherManager。Flink的SourceOperator处理消息,处理AddSplitEvent的时候,调用sourceReader.addSplits(newSplits),之后调用SplitFetcherManager的addSplits接口,这里面完成了连个动作:1、createSplitFetcher;2、fetcher.addSplits(splitsToAdd);
createSplitFetcher接口在其中创建SplitFetcher,SplitFetcher当中创建FetchTask,参数有SplitReader
java
this.fetchTask =
new FetchTask<>(
splitReader,
elementsQueue,
ids -> {
ids.forEach(assignedSplits::remove);
splitFinishedHook.accept(ids);
LOG.info("Finished reading from splits {}", ids);
},
id);
FetchTask的run方法会调用SplitReader的fetch接口获取分片,Iceberg的实现如下
java
public RecordsWithSplitIds<RecordAndPosition<T>> fetch() throws IOException {
metrics.incrementSplitReaderFetchCalls(1);
if (currentReader == null) {
IcebergSourceSplit nextSplit = splits.poll();
if (nextSplit != null) {
currentSplit = nextSplit;
currentSplitId = nextSplit.splitId();
currentReader = openSplitFunction.apply(currentSplit);
} else {
// return an empty result, which will lead to split fetch to be idle.
// SplitFetcherManager will then close idle fetcher.
return new RecordsBySplits(Collections.emptyMap(), Collections.emptySet());
}
}
if (currentReader.hasNext()) {
// Because Iterator#next() doesn't support checked exception,
// we need to wrap and unwrap the checked IOException with UncheckedIOException
try {
return currentReader.next();
} catch (UncheckedIOException e) {
throw e.getCause();
}
} else {
return finishSplit();
}
}
openSplitFunction就是文件读取类,在IcebergSource当中定义
java
if (readerFunction == null) {
if (table instanceof BaseMetadataTable) {
MetaDataReaderFunction rowDataReaderFunction =
new MetaDataReaderFunction(
flinkConfig, table.schema(), context.project(), table.io(), table.encryption());
this.readerFunction = (ReaderFunction<T>) rowDataReaderFunction;
} else {
RowDataReaderFunction rowDataReaderFunction =
new RowDataReaderFunction(
flinkConfig,
table.schema(),
context.project(),
context.nameMapping(),
context.caseSensitive(),
table.io(),
table.encryption(),
context.filters());
this.readerFunction = (ReaderFunction<T>) rowDataReaderFunction;
}
}
fetch接口当中数据来源是splits.poll(),这splits由handleSplitsChanges接口设置,其调用来源则是上文createSplitFetcher后的fetcher.addSplits(splitsToAdd)调用
java
public void handleSplitsChanges(SplitsChange<IcebergSourceSplit> splitsChange) {
if (!(splitsChange instanceof SplitsAddition)) {
throw new UnsupportedOperationException(
String.format("Unsupported split change: %s", splitsChange.getClass()));
}
LOG.info("Add {} splits to reader", splitsChange.splits().size());
splits.addAll(splitsChange.splits());
metrics.incrementAssignedSplits(splitsChange.splits().size());
metrics.incrementAssignedBytes(calculateBytes(splitsChange));
}