Flink分区相关

0、要点

Flink的分区列不会存数据,也就是两个列有一个分区列,则文件只会存另一个列的数据

1、CreateTable

根据SQL的执行流程,进入TableEnvironmentImpl.executeInternal,createTable分支

java 复制代码
} else if (operation instanceof CreateTableOperation) {
    CreateTableOperation createTableOperation = (CreateTableOperation) operation;
    if (createTableOperation.isTemporary()) {
        catalogManager.createTemporaryTable(
                createTableOperation.getCatalogTable(),
                createTableOperation.getTableIdentifier(),
                createTableOperation.isIgnoreIfExists());
    } else {
        catalogManager.createTable(
                createTableOperation.getCatalogTable(),
                createTableOperation.getTableIdentifier(),
                createTableOperation.isIgnoreIfExists());
    }
    return TableResultImpl.TABLE_RESULT_OK;

1.1 GenericInMemoryCatalog

之后调用catalog.createTable,以GenericInMemoryCatalog来说,其中有几个分区的Map,但实际这里并不存储分区信息,可以看到,这里创建的是空表

java 复制代码
} else {
    tables.put(tablePath, table.copy());

    if (isPartitionedTable(tablePath)) {
        partitions.put(tablePath, new LinkedHashMap<>());
        partitionStats.put(tablePath, new LinkedHashMap<>());
        partitionColumnStats.put(tablePath, new LinkedHashMap<>());
    }
}

1.2 Catalog中的分区Map

partitionStats和partitionColumnStats是放一些统计信息的,partitions目前看是单独的分区操作时会用到,如createPartition(对应SQL语句ALTER TABLE ADD PARTITION),并且这一块存储的只有Alter语句里修改的partition信息,主要还是一些描述信息,并不是主要用于记录分区,信息来源在SqlToOperationConverter.convertAlterTable当中

java 复制代码
for (int i = 0; i < addPartitions.getPartSpecs().size(); i++) {
    specs.add(new CatalogPartitionSpec(addPartitions.getPartitionKVs(i)));
    Map<String, String> props =
            OperationConverterUtils.extractProperties(
                    addPartitions.getPartProps().get(i));
    partitions.add(new CatalogPartitionImpl(props, null));
}
return new AddPartitionsOperation(
        tableIdentifier, addPartitions.ifNotExists(), specs, partitions);

1.3 AbstractCatalogTable

真正有用的信息是在table表的信息当中,核心在tables.put(tablePath, table.copy());这一句当中,table.copy()存储了表信息,最终调用到实现类CatalogTableImpl,其父类的构造函数有分区信息。表中存储了相应的分区信息,SQL最终操作的都是表,所以都是从这取的分区信息,注意这是一个StringList

java 复制代码
public AbstractCatalogTable(
        TableSchema tableSchema,
        List<String> partitionKeys,
        Map<String, String> options,
        String comment) {
    this.tableSchema = checkNotNull(tableSchema, "tableSchema cannot be null");
    this.partitionKeys = checkNotNull(partitionKeys, "partitionKeys cannot be null");
    this.options = checkNotNull(options, "options cannot be null");

2、DescribeTable

同样的在ableEnvironmentImpl.executeInternal,describe分支

java 复制代码
} else if (operation instanceof DescribeTableOperation) {
    DescribeTableOperation describeTableOperation = (DescribeTableOperation) operation;
    Optional<ContextResolvedTable> result =
            catalogManager.getTable(describeTableOperation.getSqlIdentifier());
    if (result.isPresent()) {
        return buildDescribeResult(result.get().getResolvedSchema());
    } else {
        throw new ValidationException(
                String.format(
                        "Tables or views with the identifier '%s' doesn't exist",
                        describeTableOperation.getSqlIdentifier().asSummaryString()));
    }

2.1 获取及解析表

首先这边getTable方法,获取Table,由CatalogManager做入口,正常的表都是走的getPermanentTable

java 复制代码
public Optional<ContextResolvedTable> getTable(ObjectIdentifier objectIdentifier) {
    CatalogBaseTable temporaryTable = temporaryTables.get(objectIdentifier);
    if (temporaryTable != null) {
        final ResolvedCatalogBaseTable<?> resolvedTable =
                resolveCatalogBaseTable(temporaryTable);
        return Optional.of(ContextResolvedTable.temporary(objectIdentifier, resolvedTable));
    } else {
        return getPermanentTable(objectIdentifier);
    }
}

这里对Table进行了多层封装,最底层的还是来自GenericInMemoryCatalog当中,前面CreateTable的时候有一个tables的Map,这里就是从这个里面拿当时存储的Table类

java 复制代码
public CatalogBaseTable getTable(ObjectPath tablePath) throws TableNotExistException {
    checkNotNull(tablePath);

    if (!tableExists(tablePath)) {
        throw new TableNotExistException(getName(), tablePath);
    } else {
        return tables.get(tablePath).copy();
    }
}

返回到最上层前会对这个Table进行解析封装,在CatalogManager.resolveCatalogTable进行解析,这里有一个重要的点就是对分区的校验,必须对应表的列

java 复制代码
// Validate partition keys are included in physical columns
final List<String> physicalColumns =
        resolvedSchema.getColumns().stream()
                .filter(Column::isPhysical)
                .map(Column::getName)
                .collect(Collectors.toList());
table.getPartitionKeys()
        .forEach(
                partitionKey -> {
                    if (!physicalColumns.contains(partitionKey)) {
                        throw new ValidationException(
                                String.format(
                                        "Invalid partition key '%s'. A partition key must "
                                                + "reference a physical column in the schema. "
                                                + "Available columns are: %s",
                                        partitionKey, physicalColumns));
                    }
                });

2.2 返回信息

最终构建返回信息,这里有一个关注点就是接口只传入了表的Schema信息,没有传入分区信息

java 复制代码
return buildDescribeResult(result.get().getResolvedSchema());

Describe返回控制台是一个表结构的形式,所以这里会构建一个表的格式

java 复制代码
private TableResultInternal buildDescribeResult(ResolvedSchema schema) {
    Object[][] rows = buildTableColumns(schema);
    return buildResult(generateTableColumnsNames(), generateTableColumnsDataTypes(), rows);
}

buildTableColumns是把Shema信息构建成行数据,因为Describe输出的表是有固定字段的,所以这里要对应固定字段填值

具体的列名在generateTableColumnsNames当中指定,这个也是最后返回信息里的表头

java 复制代码
private String[] generateTableColumnsNames() {
    return new String[] {"name", "type", "null", "key", "extras", "watermark"};
}

generateTableColumnsDataTypes设置上面几个列的字段类型

java 复制代码
private DataType[] generateTableColumnsDataTypes() {
    return new DataType[] {
        DataTypes.STRING(),
        DataTypes.STRING(),
        DataTypes.BOOLEAN(),
        DataTypes.STRING(),
        DataTypes.STRING(),
        DataTypes.STRING()
    };
}

最终把Shema构建的行信息插入表中就构成了返回信息,是一个TableResultImpl的类型

java 复制代码
private TableResultInternal buildResult(String[] headers, DataType[] types, Object[][] rows) {
    ResolvedSchema schema = ResolvedSchema.physical(headers, types);
    ResultProvider provider =
            new StaticResultProvider(
                    Arrays.stream(rows).map(Row::of).collect(Collectors.toList()));
    return TableResultImpl.builder()
            .resultKind(ResultKind.SUCCESS_WITH_CONTENT)
            .schema(ResolvedSchema.physical(headers, types))
            .resultProvider(provider)
            .setPrintStyle(
                    PrintStyle.tableauWithDataInferredColumnWidths(
                            schema,
                            provider.getRowDataStringConverter(),
                            Integer.MAX_VALUE,
                            true,
                            false))
            .build();
}

整体的输出形态如下

±--------±-------±-----±----±-------±----------+

| name | type | null | key | extras | watermark |

±--------±-------±-----±----±-------±----------+

| user | BIGINT | TRUE | | | |

| product | STRING | TRUE | | | |

±--------±-------±-----±----±-------±----------+

3、Insert

3.1 封装SinkModifyOperation

首先是封装SinkModifyOperation的时候,其中有表,在SqlToOperationConverter.convertSqlInsert当中,getTableOrError最终调用的跟前面describe获取表一样,从Catalog拿表并且对分区进行校验,这些步骤都不少,所以SinkModifyOperation里封装的contextResolvedTable是带分区信息的

java 复制代码
ContextResolvedTable contextResolvedTable = catalogManager.getTableOrError(identifier);

return new SinkModifyOperation(
        contextResolvedTable,
        query,
        insert.getStaticPartitionKVs(),
        insert.isOverwrite(),
        dynamicOptions);

3.2 转TableSink

在SQL转换流程的PlannerBase.translateToRel当中,走catalogSink分支,在getTableSink接口调用时,走到TableFactoryUtil.findAndCreateTableSink

java 复制代码
public static <T> TableSink<T> findAndCreateTableSink(TableSinkFactory.Context context) {
    try {
        return TableFactoryService.find(
                        TableSinkFactory.class, context.getTable().toProperties())
                .createTableSink(context);
    } catch (Throwable t) {
        throw new TableException("findAndCreateTableSink failed.", t);
    }
}

这里在toProperties接口当中,会把分区传入成为一项Property

java 复制代码
public Map<String, String> toProperties() {
    DescriptorProperties descriptor = new DescriptorProperties(false);

    descriptor.putTableSchema(SCHEMA, getSchema());
    descriptor.putPartitionKeys(getPartitionKeys());

    Map<String, String> properties = new HashMap<>(getOptions());

    descriptor.putProperties(properties);

    return descriptor.asMap();
}

在下一步的createTableSink接口调用的时候,也会调用到toProperties,但目前这个好像只有CSV的两个实现类,所以TableSink的具体过程待研究

getTableSink最后调用的是createDynamicTableSink,这里面封装了table,就是跟前面一样的从catalog拿的表,所以这个表是包含分区信息的

java 复制代码
val tableSink = FactoryUtil.createDynamicTableSink(
  factory,
  objectIdentifier,
  tableToFind,
  Collections.emptyMap(),
  getTableConfig,
  getFlinkContext.getClassLoader,
  isTemporary)

3.3 分区分配

从文件数据源追踪下去,有一个分区分配器的类PartitionComputer,在fileconnector当中,有四个实现类,分别是file和hive的

Sink在SQL转换的时候调用translateToPlanInternal,有构建SinkRuntimeProvider的流程,在CommonExecSink.createSinkTransformation

java 复制代码
final SinkRuntimeProvider runtimeProvider =
        tableSink.getSinkRuntimeProvider(new SinkRuntimeProviderContext(isBounded));

在FileSystemTableSink当中,最终构建了分区计算器

java 复制代码
private DataStreamSink<RowData> createBatchSink(
        DataStream<RowData> inputStream, Context sinkContext, final int parallelism) {
    FileSystemOutputFormat.Builder<RowData> builder = new FileSystemOutputFormat.Builder<>();
    builder.setPartitionComputer(partitionComputer());


private RowDataPartitionComputer partitionComputer() {
    return new RowDataPartitionComputer(
            defaultPartName,
            DataType.getFieldNames(physicalRowDataType).toArray(new String[0]),
            DataType.getFieldDataTypes(physicalRowDataType).toArray(new DataType[0]),
            partitionKeys.toArray(new String[0]));
}

在这里面,会根据分区列名生成分区index用作后续使用,也就是必须跟列有对应关系,不过计算分区的时候还是用分区名的List

java 复制代码
this.partitionIndexes =
        Arrays.stream(partitionColumns).mapToInt(columnList::indexOf).toArray();
this.partitionTypes =
        Arrays.stream(partitionIndexes)
                .mapToObj(columnTypeList::get)
                .toArray(LogicalType[]::new);
this.partitionFieldGetters =
        IntStream.range(0, partitionTypes.length)
                .mapToObj(
                        i ->
                                RowData.createFieldGetter(
                                        partitionTypes[i], partitionIndexes[i]))
                .toArray(RowData.FieldGetter[]::new);

List<Integer> partitionIndexList =
        Arrays.stream(partitionIndexes).boxed().collect(Collectors.toList());

generatePartValues会计算数据的分区,基于前面RowDataPartitionComputer初始化时基于分区构建的各种对象进行计算

java 复制代码
public LinkedHashMap<String, String> generatePartValues(RowData in) {
    LinkedHashMap<String, String> partSpec = new LinkedHashMap<>();

    for (int i = 0; i < partitionIndexes.length; i++) {
        Object field = partitionFieldGetters[i].getFieldOrNull(in);
        String partitionValue = field != null ? field.toString() : null;
        if (partitionValue == null || "".equals(partitionValue)) {
            partitionValue = defaultPartValue;
        }
        partSpec.put(partitionColumns[i], partitionValue);
    }
    return partSpec;
}

FileSystemTableSink最下层写的实现类是GroupedPartitionWriter和DynamicPartitionWriter,GroupedPartitionWriter的write如下

java 复制代码
public void write(T in) throws Exception {
    String partition = generatePartitionPath(computer.generatePartValues(in));
    if (!partition.equals(currentPartition)) {
        if (currentFormat != null) {
            currentFormat.close();
        }

        currentFormat = context.createNewOutputFormat(manager.createPartitionDir(partition));
        currentPartition = partition;
    }
    currentFormat.writeRecord(computer.projectColumnsToWrite(in));
}

3.4 分区列不写数据

RowDataPartitionComputer.projectColumnsToWrite计算需要写数据的列,也就是说只要这几列会写数据,核心就是去除分区列

java 复制代码
for (int i = 0; i < nonPartitionIndexes.length; i++) {
    reuseRow.setField(i, nonPartitionFieldGetters[i].getFieldOrNull(in));
}
reuseRow.setRowKind(in.getRowKind());
return reuseRow;

nonPartitionIndexes在构建RowDataPartitionComputer,可以看到,就是遍历列名,然后去除分区列

java 复制代码
this.nonPartitionIndexes =
        IntStream.range(0, columnNames.length)
                .filter(c -> !partitionIndexList.contains(c))
                .toArray();

3.5 分区目录

在PartitionPathUtils.generatePartitionPath当中定义了分区目录的形式,以{列名=分区值的形式},因为目前key就是列名

java 复制代码
for (Map.Entry<String, String> e : partitionSpec.entrySet()) {
    if (i > 0) {
        suffixBuf.append(Path.SEPARATOR);
    }
    suffixBuf.append(escapePathName(e.getKey()));
    suffixBuf.append('=');
    suffixBuf.append(escapePathName(e.getValue()));
    i++;
}

4、Select

参照Sink的流程,在CommonExecTableSourceScan.translateToPlanInternal有构建Source的流程

java 复制代码
final ScanTableSource tableSource =
        tableSourceSpec.getScanTableSource(planner.getFlinkContext());
ScanTableSource.ScanRuntimeProvider provider =
        tableSource.getScanRuntimeProvider(ScanRuntimeProviderContext.INSTANCE);

4.1 getScanTableSource

在getScanTableSource接口当中,由createDynamicTableSource创建TableSource

java 复制代码
tableSource =
        FactoryUtil.createDynamicTableSource(
                factory,
                contextResolvedTable.getIdentifier(),
                contextResolvedTable.getResolvedTable(),
                loadOptionsFromCatalogTable(contextResolvedTable, flinkContext),
                flinkContext.getTableConfig(),
                flinkContext.getClassLoader(),
                contextResolvedTable.isTemporary());

以FileSystemTableFactory为例,最终创建FileSystemTableSource,其中传入的参数就有分区信息,分区信息同样是来自Catalog的表

java 复制代码
public DynamicTableSource createDynamicTableSource(Context context) {
    FactoryUtil.TableFactoryHelper helper = FactoryUtil.createTableFactoryHelper(this, context);
    validate(helper);
    return new FileSystemTableSource(
            context.getObjectIdentifier(),
            context.getPhysicalRowDataType(),
            context.getCatalogTable().getPartitionKeys(),
            helper.getOptions(),
            discoverDecodingFormat(context, BulkReaderFormatFactory.class),
            discoverDecodingFormat(context, DeserializationFormatFactory.class));
}

4.2 getScanRuntimeProvider

之后进行provider创建,是基于上面的FileSystemTableSource进行的调用,最终到它的getScanRuntimeProvider接口当中,内部有很多跟分区相关的操作

首先是无分区就返回一个默认简单的类

java 复制代码
// When this table has no partition, just return an empty source.
if (!partitionKeys.isEmpty() && getOrFetchPartitions().isEmpty()) {
    return InputFormatProvider.of(new CollectionInputFormat<>(new ArrayList<>(), null));
}

后续又是用表的列字段对分区进行过滤

java 复制代码
// Filter out partition columns not in producedDataType
final List<String> partitionKeysToExtract =
        DataType.getFieldNames(producedDataType).stream()
                .filter(this.partitionKeys::contains)
                .collect(Collectors.toList());

后续这个过滤过的分区会被传入format,format是Flink最后执行读写的类(但是这里有些传了有些没传,需要看一下差别)

format虽然不一样,但是最终都是调用的return createSourceProvider(format);,在createSourceProvider当中有获取分区的操作(不是分区key,是分区值)

java 复制代码
private SourceProvider createSourceProvider(BulkFormat<RowData, FileSourceSplit> bulkFormat) {
    final FileSource.FileSourceBuilder<RowData> fileSourceBuilder =
            FileSource.forBulkFileFormat(bulkFormat, paths());

这里的paths()就是基于remainingPartitions获取要读取的分区目录

java 复制代码
private Path[] paths() {
    if (partitionKeys.isEmpty()) {
        return new Path[] {path};
    } else {
        return getOrFetchPartitions().stream()
                .map(FileSystemTableSource.this::toFullLinkedPartSpec)
                .map(PartitionPathUtils::generatePartitionPath)
                .map(n -> new Path(path, n))
                .toArray(Path[]::new);
    }
}

4.3 remainingPartitions

这个是分区下推使用的一个东西,当支持分区下推时,就会把这个值设置为分区,在PartitionPushDownSpec的apply当中

这个不是存储的分区列,而是实际的分区值

java 复制代码
public void apply(DynamicTableSource tableSource, SourceAbilityContext context) {
    if (tableSource instanceof SupportsPartitionPushDown) {
        ((SupportsPartitionPushDown) tableSource).applyPartitions(partitions);
    } else {
        throw new TableException(
                String.format(
                        "%s does not support SupportsPartitionPushDown.",
                        tableSource.getClass().getName()));
    }
}

这里applyPartitions就是设置remainingPartitions的

java 复制代码
public void applyPartitions(List<Map<String, String>> remainingPartitions) {
    this.remainingPartitions = remainingPartitions;
}

还有其他的地方会进行设置,在getOrFetchPartitions当中

java 复制代码
private List<Map<String, String>> getOrFetchPartitions() {
    if (remainingPartitions == null) {
        remainingPartitions = listPartitions().get();
    }
    return remainingPartitions;
}

这里listPartitions就是去数据目录扫描分区

java 复制代码
public Optional<List<Map<String, String>>> listPartitions() {
    try {
        return Optional.of(
                PartitionPathUtils.searchPartSpecAndPaths(
                                path.getFileSystem(), path, partitionKeys.size())
                        .stream()
                        .map(tuple2 -> tuple2.f0)
                        .map(
                                spec -> {
                                    LinkedHashMap<String, String> ret = new LinkedHashMap<>();
                                    spec.forEach(
                                            (k, v) ->
                                                    ret.put(
                                                            k,
                                                            defaultPartName.equals(v)
                                                                    ? null
                                                                    : v));
                                    return ret;
                                })
                        .collect(Collectors.toList()));
    } catch (Exception e) {
        throw new TableException("Fetch partitions fail.", e);
    }
}

5、分区下推

能力由PartitionPushDownSpec决定,规则是PushPartitionIntoTableSourceScanRule

分区信息同样的从catalog中的table上获取

java 复制代码
List<String> partitionFieldNames =
        tableSourceTable
                .contextResolvedTable()
                .<ResolvedCatalogTable>getResolvedTable()
                .getPartitionKeys();

之后从过滤条件中提取分区相关的条件

java 复制代码
// extract partition predicates
RelBuilder relBuilder = call.builder();
RexBuilder rexBuilder = relBuilder.getRexBuilder();
Tuple2<Seq<RexNode>, Seq<RexNode>> allPredicates =
        RexNodeExtractor.extractPartitionPredicateList(
                filter.getCondition(),
                FlinkRelOptUtil.getMaxCnfNodeCount(scan),
                inputFieldNames.toArray(new String[0]),
                rexBuilder,
                partitionFieldNames.toArray(new String[0]));
RexNode partitionPredicate =
        RexUtil.composeConjunction(
                rexBuilder, JavaConversions.seqAsJavaList(allPredicates._1));

之后获取分区列类型,这里又是从分区列获取的

java 复制代码
// build pruner
LogicalType[] partitionFieldTypes =
        partitionFieldNames.stream()
                .map(
                        name -> {
                            int index = inputFieldNames.indexOf(name);
                            if (index < 0) {
                                throw new TableException(
                                        String.format(
                                                "Partitioned key '%s' isn't found in input columns. "
                                                        + "Validator should have checked that.",
                                                name));
                            }
                            return inputFieldTypes.getFieldList().get(index).getType();
                        })
                .map(FlinkTypeFactory::toLogicalType)
                .toArray(LogicalType[]::new);

后续会进行分区过滤,这里就会生成上面的remainingPartitions,获取首先是调用TableSource的listPartitions,如果能直接获取到就用它的值,获取不到会进行一个获取逻辑处理readPartitionFromCatalogAndPrune,这里要注意分区条件已经被转换成了RexNode形态,最终过滤还是基于catalog,目前看只有HiveCatalog有处理逻辑

java 复制代码
return catalog.listPartitionsByFilter(tablePath, partitionFilters).stream()
        .map(CatalogPartitionSpec::getPartitionSpec)
        .collect(Collectors.toList());

之后,会基于上面的分区过滤,构建新的TableSourceTable,替换其中的tableStats

java 复制代码
FlinkStatistic newStatistic =
        FlinkStatistic.builder()
                .statistic(tableSourceTable.getStatistic())
                .tableStats(newTableStat)
                .build();

TableSourceTable newTableSourceTable =
        tableSourceTable.copy(
                dynamicTableSource,
                newStatistic,
                new SourceAbilitySpec[] {partitionPushDownSpec});

LogicalTableScan newScan =
        LogicalTableScan.create(scan.getCluster(), newTableSourceTable, scan.getHints());
相关推荐
代码匠心2 小时前
从零开始学Flink:数据源
java·大数据·后端·flink
数据智能老司机7 小时前
精通 Python 设计模式——分布式系统模式
python·设计模式·架构
数据智能老司机8 小时前
精通 Python 设计模式——并发与异步模式
python·设计模式·编程语言
数据智能老司机8 小时前
精通 Python 设计模式——测试模式
python·设计模式·架构
数据智能老司机8 小时前
精通 Python 设计模式——性能模式
python·设计模式·架构
c8i8 小时前
drf初步梳理
python·django
每日AI新事件8 小时前
python的异步函数
python
expect7g8 小时前
Flink KeySelector
大数据·后端·flink
这里有鱼汤9 小时前
miniQMT下载历史行情数据太慢怎么办?一招提速10倍!
前端·python
databook18 小时前
Manim实现脉冲闪烁特效
后端·python·动效