Iceberg Flink FLIP-27实现

1 Flink基础接口

Flink的基础接口是Source，核心是两个接口：createEnumerator和createReader

createEnumerator负责数据发现和分片器的创建；createReader负责实际读取器的创建

java 复制代码

public interface Source<T, SplitT extends SourceSplit, EnumChkT> extends Serializable {
    Boundedness getBoundedness();

    SourceReader<T, SplitT> createReader(SourceReaderContext var1) throws Exception;

    SplitEnumerator<SplitT, EnumChkT> createEnumerator(SplitEnumeratorContext<SplitT> var1) throws Exception;

    SplitEnumerator<SplitT, EnumChkT> restoreEnumerator(SplitEnumeratorContext<SplitT> var1, EnumChkT var2) throws Exception;

    SimpleVersionedSerializer<SplitT> getSplitSerializer();

    SimpleVersionedSerializer<EnumChkT> getEnumeratorCheckpointSerializer();
}

createEnumerator由Flink中的SourceCoordinator调用，SourceCoordinator是Flink中的一个独立部件（实际是在JobGraph转ExecutionGraph的过程中创建的）

在SourceCoordinator启动的时候会调用createEnumerator

java 复制代码

if (enumerator == null) {
    final ClassLoader userCodeClassLoader =
            context.getCoordinatorContext().getUserCodeClassloader();
    try (TemporaryClassLoaderContext ignored =
            TemporaryClassLoaderContext.of(userCodeClassLoader)) {
        enumerator = source.createEnumerator(context);
    } catch (Throwable t) {
        ExceptionUtils.rethrowIfFatalErrorOrOOM(t);
        LOG.error("Failed to create Source Enumerator for source {}", operatorName, t);
        context.failJob(t);
        return;
    }
}

2 Iceberg对核心接口的实现

2.1 getBoundedness

接口用来标志数据源是有界还是无界的，对应批还是流处理

java 复制代码

public Boundedness getBoundedness() {
  return scanContext.isStreaming() ? Boundedness.CONTINUOUS_UNBOUNDED : Boundedness.BOUNDED;
}

根据Scan配置属性，选择是有界还是无界；Iceberg可通过如下方式设置

java 复制代码

SELECT * FROM hjfdb.test2 /*+ OPTIONS('streaming'='true', 'monitor-interval'='10s')*/ ;

完整的配置Key为：connector.iceberg.streaming

java 复制代码

public static final String STREAMING = "streaming";
public static final ConfigOption<Boolean> STREAMING_OPTION =
    ConfigOptions.key(PREFIX + STREAMING).booleanType().defaultValue(false);

2.2 createEnumerator

createEnumerator和restoreEnumerator调用的是Iceberg实现的同一个接口

java 复制代码

@Override
public SplitEnumerator<IcebergSourceSplit, IcebergEnumeratorState> createEnumerator(
    SplitEnumeratorContext<IcebergSourceSplit> enumContext) {
  return createEnumerator(enumContext, null);
}

@Override
public SplitEnumerator<IcebergSourceSplit, IcebergEnumeratorState> restoreEnumerator(
    SplitEnumeratorContext<IcebergSourceSplit> enumContext, IcebergEnumeratorState enumState) {
  return createEnumerator(enumContext, enumState);
}

2.2.1 分片分配器

createEnumerator接口当中首先创建分片分配器

java 复制代码

SplitAssigner assigner;
if (enumState == null) {
  assigner = assignerFactory.createAssigner();
} else {
  LOG.info(
      "Iceberg source restored {} splits from state for table {}",
      enumState.pendingSplits().size(),
      lazyTable().name());
  assigner = assignerFactory.createAssigner(enumState.pendingSplits());
}

目前分配器只有一个实现：SimpleSplitAssigner，其核心是一个ArrayDeque，用来存储和分配分片

java 复制代码

public SimpleSplitAssigner() {
  this.pendingSplits = new ArrayDeque<>();
}

public SimpleSplitAssigner(Collection<IcebergSourceSplitState> assignerState) {
  this.pendingSplits = new ArrayDeque<>(assignerState.size());
  // Because simple assigner only tracks unassigned splits,
  // there is no need to filter splits based on status (unassigned) here.
  assignerState.forEach(splitState -> pendingSplits.add(splitState.split()));
}

2.2.2 SplitEnumerator

分片分配器设置完成后进行SplitEnumerator的创建，分为流式和批式

java 复制代码

if (scanContext.isStreaming()) {
  ContinuousSplitPlanner splitPlanner =
      new ContinuousSplitPlannerImpl(tableLoader.clone(), scanContext, planningThreadName());
  return new ContinuousIcebergEnumerator(
      enumContext, assigner, scanContext, splitPlanner, enumState);
} else {
  List<IcebergSourceSplit> splits = planSplitsForBatch(planningThreadName());
  assigner.onDiscoveredSplits(splits);
  return new StaticIcebergEnumerator(enumContext, assigner);
}

ContinuousSplitPlanner
流式需要一个流式扫描数据的计划，就是ContinuousSplitPlanner，实现类是ContinuousSplitPlannerImpl
ContinuousSplitPlannerImpl初始化创建这里有一个线程池的设置

java 复制代码

this.isSharedPool = threadName == null;
this.workerPool =
    isSharedPool
        ? ThreadPools.getWorkerPool()
        : ThreadPools.newWorkerPool(
            "iceberg-plan-worker-pool-" + threadName, scanContext.planParallelism());

这里默认threadName都是有值的，所以使用独立的线程池，线程池的并行度由table.exec.iceberg.worker-pool-size设置，默认值根据机器的可用CPU核数确定

java 复制代码

public static final int WORKER_THREAD_POOL_SIZE =
    getPoolSize(
        WORKER_THREAD_POOL_SIZE_PROP, Math.max(2, Runtime.getRuntime().availableProcessors()));

ContinuousIcebergEnumerator
ContinuousIcebergEnumerator的一个核心是start的时候设置了异步调用周期

java 复制代码

public void start() {
  super.start();
  enumeratorContext.callAsync(
      this::discoverSplits,
      this::processDiscoveredSplits,
      0L,
      scanContext.monitorInterval().toMillis());
}

周期设置方式和流式设置方式一致

java 复制代码

SELECT * FROM hjfdb.test2 /*+ OPTIONS('streaming'='true', 'monitor-interval'='10s')*/ ;

完整的配置Key为：connector.iceberg.monitor-interval

java 复制代码

public static final String MONITOR_INTERVAL = "monitor-interval";
public static final ConfigOption<String> MONITOR_INTERVAL_OPTION =
    ConfigOptions.key(PREFIX + MONITOR_INTERVAL).stringType().defaultValue("60s");

start接口的调用也是在SourceCoordinator当中，创建完就直接调用start了

java 复制代码

runInEventLoop(() -> enumerator.start(), "starting the SplitEnumerator.");

planSplitsForBatch
批式执行方式这里直接扫描获取分片列表，具体待后文
扫描分片确定以后，直接传入分片分配器

java 复制代码

assigner.onDiscoveredSplits(splits);

最终创建一个批式的Enumerator

java 复制代码

return new StaticIcebergEnumerator(enumContext, assigner);

2.3 createReader

createReader就比较简单，直接创建了一个IcebergSourceReader

java 复制代码

public SourceReader<T, IcebergSourceSplit> createReader(SourceReaderContext readerContext) {
  IcebergSourceReaderMetrics metrics =
      new IcebergSourceReaderMetrics(readerContext.metricGroup(), lazyTable().name());
  return new IcebergSourceReader<>(metrics, readerFunction, readerContext);
}

3 ContinuousIcebergEnumerator

前面提到过ContinuousIcebergEnumerator的启动接口如下

java 复制代码

public void start() {
  super.start();
  enumeratorContext.callAsync(
      this::discoverSplits,
      this::processDiscoveredSplits,
      0L,
      scanContext.monitorInterval().toMillis());
}

这是一个定时的异步调用接口，核心是discoverSplits和processDiscoveredSplits，discoverSplits的结果由processDiscoveredSplits处理

3.1 discoverSplits

这个操作是扫描获取完整分片的，这里会有一个暂停扫描的阈值，阈值说明如下

java 复制代码

// if ScanContext#maxPlanningSnapshotCount() is 10, each split enumeration can
// discovery splits up to 10 snapshots. if maxHistorySize is 3, the max number of
// splits tracked in assigner shouldn't be more than 10 * (3 + 1) snapshots
// worth of splits. +1 because there could be another enumeration when the
// pending splits fall just below the 10 * 3.
int totalSplitCountFromRecentDiscovery = Arrays.stream(history).reduce(0, Integer::sum);
return pendingSplitCountFromAssigner >= totalSplitCountFromRecentDiscovery;

没有达到阈值的情况下通过splitPlanner.planSplits进行扫描获取，这里lastPosition对应的是snapshot

java 复制代码

public ContinuousEnumerationResult planSplits(IcebergEnumeratorPosition lastPosition) {
  table.refresh();
  if (lastPosition != null) {
    return discoverIncrementalSplits(lastPosition);
  } else {
    return discoverInitialSplits();
  }
}

3.1.1 discoverInitialSplits

进行初始扫描，扫描有不同的模式，在StreamingStartingStrategy当中枚举，其中只有TABLE_SCAN_THEN_INCREMENTAL模式是进行初次普通扫描以后再进行增量扫描

所以这里会进行判断，如果是TABLE_SCAN_THEN_INCREMENTAL，会进行一次批扫描获取分片返回；否则分片为空。同时，不管哪种模式，会设置Position，后续步骤就不走这个接口了

初次普通扫描接口如下，这个也就是批扫描的接口

java 复制代码

public static List<IcebergSourceSplit> planIcebergSourceSplits(
    Table table, ScanContext context, ExecutorService workerPool) {
  try (CloseableIterable<CombinedScanTask> tasksIterable =
      planTasks(table, context, workerPool)) {
    return Lists.newArrayList(
        CloseableIterable.transform(tasksIterable, IcebergSourceSplit::fromCombinedScanTask));
  } catch (IOException e) {
    throw new UncheckedIOException("Failed to process task iterable: ", e);
  }
}

真正的扫描在planTasks当中，按模式分增量和非增量，判断方式如下，流模式的初次扫描设置这几个值都为空，所以也进行批扫描

java 复制代码

static ScanMode checkScanMode(ScanContext context) {
  if (context.startSnapshotId() != null
      || context.endSnapshotId() != null
      || context.startTag() != null
      || context.endTag() != null) {
    return ScanMode.INCREMENTAL_APPEND_SCAN;
  } else {
    return ScanMode.BATCH;
  }
}

批扫描走的就是Iceberg普通的TableScan

java 复制代码

TableScan scan = table.newScan();
scan = refineScanWithBaseConfigs(scan, context, workerPool);

if (context.snapshotId() != null) {
  scan = scan.useSnapshot(context.snapshotId());
} else if (context.tag() != null) {
  scan = scan.useRef(context.tag());
} else if (context.branch() != null) {
  scan = scan.useRef(context.branch());
}

if (context.asOfTimestamp() != null) {
  scan = scan.asOfTime(context.asOfTimestamp());
}

return scan.planTasks();

3.1.2 discoverIncrementalSplits

这是读取增量数据的接口，增量扫描时如果当前Snapshot为空或者当前Snapshot为上一次扫描过的Snapshot，则表示没有增量数据，返回空分片；否则获取增量分片列表

首先获取上一次扫描的Snapshot和当前Snapshot的区间，然后调用接口进行扫描

java 复制代码

ScanContext incrementalScan =
    scanContext.copyWithAppendsBetween(
        lastPosition.snapshotId(), toSnapshotInclusive.snapshotId());
List<IcebergSourceSplit> splits =
    FlinkSplitPlanner.planIcebergSourceSplits(table, incrementalScan, workerPool);

这里扫描调用的还是上节相同的接口，只不过走的不是批扫描了，而是走的INCREMENTAL_APPEND_SCAN增量扫描模式，它使用专门的增量扫描器

java 复制代码

IncrementalAppendScan scan = table.newIncrementalAppendScan();

其实现类是BaseIncrementalAppendScan，扫描过程首先计算所有的增量Snapshot

java 复制代码

// appendsBetween handles null fromSnapshotId (exclusive) properly
List<Snapshot> snapshots =
    appendsBetween(table(), fromSnapshotIdExclusive, toSnapshotIdInclusive);


private static List<Snapshot> appendsBetween(
    Table table, Long fromSnapshotIdExclusive, long toSnapshotIdInclusive) {
  List<Snapshot> snapshots = Lists.newArrayList();
  for (Snapshot snapshot :
      SnapshotUtil.ancestorsBetween(
          toSnapshotIdInclusive, fromSnapshotIdExclusive, table::snapshot)) {
    if (snapshot.operation().equals(DataOperations.APPEND)) {
      snapshots.add(snapshot);
    }
  }

  return snapshots;
}

之后基于Snapshot获取所有的dataManifests，Manifest分为dataManifests和deleteManifests

java 复制代码

Set<ManifestFile> manifests =
    FluentIterable.from(snapshots)
        .transformAndConcat(snapshot -> snapshot.dataManifests(table().io()))
        .filter(manifestFile -> snapshotIds.contains(manifestFile.snapshotId()))
        .toSet();

最终基于正常的Iceberg文件扫描进行获取，这里只读取append的数据不处理delete数据的核心在ManifestEntry.Status.ADDED这个过滤条件，overwrite或者delete删除的数据，其manifest文件的类型是delete

java 复制代码

ManifestGroup manifestGroup =
    new ManifestGroup(table().io(), manifests)
        .caseSensitive(isCaseSensitive())
        .select(scanColumns())
        .filterData(filter())
        .filterManifestEntries(
            manifestEntry ->
                snapshotIds.contains(manifestEntry.snapshotId())
                    && manifestEntry.status() == ManifestEntry.Status.ADDED)
        .specsById(table().specs())
        .ignoreDeleted();

if (context().ignoreResiduals()) {
  manifestGroup = manifestGroup.ignoreResiduals();
}

if (manifests.size() > 1 && shouldPlanWithExecutor()) {
  manifestGroup = manifestGroup.planWith(planExecutor());
}

return manifestGroup.planFiles();

3.2 processDiscoveredSplits

这一步就是将上一步扫描到的split全部放入assigner当中，注意的是由于并发同步等问题，必须保证starting position和enumerator position是一致的

java 复制代码

assigner.onDiscoveredSplits(result.splits());

4 分片分配-AbstractIcebergEnumerator

前面步骤完成后是把所有的分片全部放在了assigner当中，没有进行分配。分配在AbstractIcebergEnumerator接口当中定义，这是Iceberg当中enumerator的最上层父类，直接实现Flink的SplitEnumerator

SplitEnumerator定义了handleSourceEvent接口，负责处理来自读取器的自定义消息，有SourceCoordinator调用

java 复制代码

} else if (event instanceof SourceEventWrapper) {
    final SourceEvent sourceEvent =
            ((SourceEventWrapper) event).getSourceEvent();
    LOG.debug(
            "Source {} received custom event from parallel task {}: {}",
            operatorName,
            subtask,
            sourceEvent);
    enumerator.handleSourceEvent(subtask, sourceEvent);

在Iceberg的实现中，完成了分片的分配，具体操作在assignSplits接口当中，从实现上看并没有一次分配完，而是每个reader每次分配一个

java 复制代码

Iterator<Map.Entry<Integer, String>> awaitingReader =
    readersAwaitingSplit.entrySet().iterator();
while (awaitingReader.hasNext()) {
  Map.Entry<Integer, String> nextAwaiting = awaitingReader.next();
  // if the reader that requested another split has failed in the meantime, remove
  // it from the list of waiting readers
  if (!enumeratorContext.registeredReaders().containsKey(nextAwaiting.getKey())) {
    awaitingReader.remove();
    continue;
  }

  int awaitingSubtask = nextAwaiting.getKey();
  String hostname = nextAwaiting.getValue();
  GetSplitResult getResult = assigner.getNext(hostname);
  if (getResult.status() == GetSplitResult.Status.AVAILABLE) {
    LOG.info("Assign split to subtask {}: {}", awaitingSubtask, getResult.split());
    enumeratorContext.assignSplit(getResult.split(), awaitingSubtask);
    awaitingReader.remove();

5 IcebergSourceReader

Source的createReader接口是在createStreamOperator的时候创建的，所以这个最终是任务级别的，在TaskManager上并行执行

5.1 分片请求

IcebergSourceReader的诸多接口都最终调用了requestSplit，在其中发送了split请求获取新的分片，对应上一章SplitEnumerator接收的请求

java 复制代码

private void requestSplit(Collection<String> finishedSplitIds) {
  context.sendSourceEventToCoordinator(new SplitRequestEvent(finishedSplitIds));
}

5.2 处理分片

IcebergSourceReader创建的时候调用了父类，核心传入了几个处理类，最重要的就是IcebergSourceSplitReader

java 复制代码

public IcebergSourceReader(
    IcebergSourceReaderMetrics metrics,
    ReaderFunction<T> readerFunction,
    SourceReaderContext context) {
  super(
      () -> new IcebergSourceSplitReader<>(metrics, readerFunction, context),
      new IcebergSourceRecordEmitter<>(),
      context.getConfiguration(),
      context);
}

IcebergSourceSplitReader在Flink当中被封装进了SplitFetcherManager。Flink的SourceOperator处理消息，处理AddSplitEvent的时候，调用sourceReader.addSplits(newSplits)，之后调用SplitFetcherManager的addSplits接口，这里面完成了连个动作：1、createSplitFetcher；2、fetcher.addSplits(splitsToAdd);

createSplitFetcher接口在其中创建SplitFetcher，SplitFetcher当中创建FetchTask，参数有SplitReader

java 复制代码

this.fetchTask =
        new FetchTask<>(
                splitReader,
                elementsQueue,
                ids -> {
                    ids.forEach(assignedSplits::remove);
                    splitFinishedHook.accept(ids);
                    LOG.info("Finished reading from splits {}", ids);
                },
                id);

FetchTask的run方法会调用SplitReader的fetch接口获取分片，Iceberg的实现如下

java 复制代码

public RecordsWithSplitIds<RecordAndPosition<T>> fetch() throws IOException {
  metrics.incrementSplitReaderFetchCalls(1);
  if (currentReader == null) {
    IcebergSourceSplit nextSplit = splits.poll();
    if (nextSplit != null) {
      currentSplit = nextSplit;
      currentSplitId = nextSplit.splitId();
      currentReader = openSplitFunction.apply(currentSplit);
    } else {
      // return an empty result, which will lead to split fetch to be idle.
      // SplitFetcherManager will then close idle fetcher.
      return new RecordsBySplits(Collections.emptyMap(), Collections.emptySet());
    }
  }

  if (currentReader.hasNext()) {
    // Because Iterator#next() doesn't support checked exception,
    // we need to wrap and unwrap the checked IOException with UncheckedIOException
    try {
      return currentReader.next();
    } catch (UncheckedIOException e) {
      throw e.getCause();
    }
  } else {
    return finishSplit();
  }
}

openSplitFunction就是文件读取类，在IcebergSource当中定义

java 复制代码

if (readerFunction == null) {
  if (table instanceof BaseMetadataTable) {
    MetaDataReaderFunction rowDataReaderFunction =
        new MetaDataReaderFunction(
            flinkConfig, table.schema(), context.project(), table.io(), table.encryption());
    this.readerFunction = (ReaderFunction<T>) rowDataReaderFunction;
  } else {
    RowDataReaderFunction rowDataReaderFunction =
        new RowDataReaderFunction(
            flinkConfig,
            table.schema(),
            context.project(),
            context.nameMapping(),
            context.caseSensitive(),
            table.io(),
            table.encryption(),
            context.filters());
    this.readerFunction = (ReaderFunction<T>) rowDataReaderFunction;
  }
}

fetch接口当中数据来源是splits.poll()，这splits由handleSplitsChanges接口设置，其调用来源则是上文createSplitFetcher后的fetcher.addSplits(splitsToAdd)调用

java 复制代码

public void handleSplitsChanges(SplitsChange<IcebergSourceSplit> splitsChange) {
  if (!(splitsChange instanceof SplitsAddition)) {
    throw new UnsupportedOperationException(
        String.format("Unsupported split change: %s", splitsChange.getClass()));
  }

  LOG.info("Add {} splits to reader", splitsChange.splits().size());
  splits.addAll(splitsChange.splits());
  metrics.incrementAssignedSplits(splitsChange.splits().size());
  metrics.incrementAssignedBytes(calculateBytes(splitsChange));
}