背景
在之前的文章中Apache Hudi初探(二)(与flink的结合)--flink写hudi的操作(JobManager端的提交操作) 有说到写hudi数据会涉及到写hudi真实数据 以及写hudi元数据,这篇文章来说一下具体的实现
写hudi真实数据
这里的操作就是在HoodieFlinkWriteClient.upsert方法:
public List<WriteStatus> upsert(List<HoodieRecord<T>> records, String instantTime) {
HoodieTable<T, List<HoodieRecord<T>>, List<HoodieKey>, List<WriteStatus>> table =
initTable(WriteOperationType.UPSERT, Option.ofNullable(instantTime));
table.validateUpsertSchema();
preWrite(instantTime, WriteOperationType.UPSERT, table.getMetaClient());
final HoodieWriteHandle<?, ?, ?, ?> writeHandle = getOrCreateWriteHandle(records.get(0), getConfig(),
instantTime, table, records.listIterator());
HoodieWriteMetadata<List<WriteStatus>> result = ((HoodieFlinkTable<T>) table).upsert(context, writeHandle, instantTime, records);
if (result.getIndexLookupDuration().isPresent()) {
metrics.updateIndexMetrics(LOOKUP_STR, result.getIndexLookupDuration().get().toMillis());
}
return postWrite(result, instantTime, table);
}
-
initTable
初始化HoodieFlinkTable -
preWrite
在这里几乎没什么操作 -
getOrCreateWriteHandle
创建一个写文件的handle(假如这里创建的是FlinkMergeAndReplaceHandle ),这里会记录已有的文件路径,后续FlinkMergeHelper.runMerge 会从这里读取数
注意该构造函数中的init方法,会创建一个ExternalSpillableMap类型的map来存储即将插入的记录,这在后续upsert中会用到 -
HoodieFlinkTable.upsert
这里进行真正的upsert操作,会调用FlinkUpsertDeltaCommitActionExecutor.execute ,最终会调用到BaseFlinkCommitActionExecutor.execute ,从而调用到FlinkMergeHelper.newInstance().runMergepublic void runMerge(HoodieTable<T, List<HoodieRecord<T>>, List<HoodieKey>, List<WriteStatus>> table,..) { final boolean externalSchemaTransformation = table.getConfig().shouldUseExternalSchemaTransformation(); HoodieBaseFile baseFile = mergeHandle.baseFileForMerge(); if (externalSchemaTransformation || baseFile.getBootstrapBaseFile().isPresent()) { readSchema = baseFileReader.getSchema(); gWriter = new GenericDatumWriter<>(readSchema); gReader = new GenericDatumReader<>(readSchema, mergeHandle.getWriterSchemaWithMetaFields()); } else { gReader = null; gWriter = null; readSchema = mergeHandle.getWriterSchemaWithMetaFields(); } wrapper = new BoundedInMemoryExecutor<>(table.getConfig().getWriteBufferLimitBytes(), new IteratorBasedQueueProducer<>(readerIterator), Option.of(new UpdateHandler(mergeHandle)), record -> { if (!externalSchemaTransformation) { return record; } return transformRecordBasedOnNewSchema(gReader, gWriter, encoderCache, decoderCache, (GenericRecord) record); }); wrapper.execute(); 。。。 mergeHandle.close(); }
-
externalSchemaTransformation=
这里有hoodie.avro.schema.external.transformation配置(默认是false)用来把在之前schame下的数据转换为新的schema下的数据 -
wrapper.execute()
这里会最终调用到upsertHandle.write(record) ,也就是UpdateHandler.consumeOneRecord 方法被调用的地方public void write(GenericRecord oldRecord) { ... if (keyToNewRecords.containsKey(key)) { if (combinedAvroRecord.isPresent() && combinedAvroRecord.get().equals(IGNORE_RECORD)) { copyOldRecord = true; } else if (writeUpdateRecord(hoodieRecord, oldRecord, combinedAvroRecord)) { copyOldRecord = false; } writtenRecordKeys.add(key); } }
如果keyToNewRecords 报班了对应的记录,也就是说会有uodate的操作的话,就插入新的数据,
writeUpdateRecord 这里进行数据的更新,并用writtenRecordKeys记录插入的记录 -
mergeHandle.close()
public List<WriteStatus> close() { writeIncomingRecords(); ... } ... protected void writeIncomingRecords() throws IOException { // write out any pending records (this can happen when inserts are turned into updates) Iterator<HoodieRecord<T>> newRecordsItr = (keyToNewRecords instanceof ExternalSpillableMap) ? ((ExternalSpillableMap)keyToNewRecords).iterator() : keyToNewRecords.values().iterator(); while (newRecordsItr.hasNext()) { HoodieRecord<T> hoodieRecord = newRecordsItr.next(); if (!writtenRecordKeys.contains(hoodieRecord.getRecordKey())) { writeInsertRecord(hoodieRecord); } } }
这里的writeIncomingRecords 会判断如果writtenRecordKeys没有包含该记录的话,就直接插入数据,而不是更新
-
总结一下upsert的关键点:
mergeHandle.close()才是真正的写数据(insert)的时候,在初始化handle的时候会把记录传导writtenRecordKeys中(在HoodieMergeHandle中的init方法)
mergeHandle的write() 方法会在写入数据的时候,如果发现有新的数据,则会写入新的数据(update)
写hudi元数据
这里的操作是StreamWriteOperatorCoordinator.notifyCheckpointComplete
方法
public void notifyCheckpointComplete(long checkpointId) {
...
final boolean committed = commitInstant(this.instant, checkpointId);
...
}
...
private boolean commitInstant(String instant, long checkpointId){
...
doCommit(instant, writeResults);
...
}
...
private void doCommit(String instant, List<WriteStatus> writeResults) {
// commit or rollback
long totalErrorRecords = writeResults.stream().map(WriteStatus::getTotalErrorRecords).reduce(Long::sum).orElse(0L);
long totalRecords = writeResults.stream().map(WriteStatus::getTotalRecords).reduce(Long::sum).orElse(0L);
boolean hasErrors = totalErrorRecords > 0;
if (!hasErrors || this.conf.getBoolean(FlinkOptions.IGNORE_FAILED)) {
HashMap<String, String> checkpointCommitMetadata = new HashMap<>();
if (hasErrors) {
LOG.warn("Some records failed to merge but forcing commit since commitOnErrors set to true. Errors/Total="
+ totalErrorRecords + "/" + totalRecords);
}
final Map<String, List<String>> partitionToReplacedFileIds = tableState.isOverwrite
? writeClient.getPartitionToReplacedFileIds(tableState.operationType, writeResults)
: Collections.emptyMap();
boolean success = writeClient.commit(instant, writeResults, Option.of(checkpointCommitMetadata),
tableState.commitAction, partitionToReplacedFileIds);
if (success) {
reset();
this.ckpMetadata.commitInstant(instant);
LOG.info("Commit instant [{}] success!", instant);
} else {
throw new HoodieException(String.format("Commit instant [%s] failed!", instant));
}
} else {
LOG.error("Error when writing. Errors/Total=" + totalErrorRecords + "/" + totalRecords);
LOG.error("The first 100 error messages");
writeResults.stream().filter(WriteStatus::hasErrors).limit(100).forEach(ws -> {
LOG.error("Global error for partition path {} and fileID {}: {}",
ws.getGlobalError(), ws.getPartitionPath(), ws.getFileId());
if (ws.getErrors().size() > 0) {
ws.getErrors().forEach((key, value) -> LOG.trace("Error for key:" + key + " and value " + value));
}
});
// Rolls back instant
writeClient.rollback(instant);
throw new HoodieException(String.format("Commit instant [%s] failed and rolled back !", instant));
}
}
主要在commitInstant 涉及动的方法doCommit(instant, writeResults)
如果说没有错误发生的话,就继续下一步:
这里的提交过程和spark中一样,具体参考Apache Hudi初探(五)(与spark的结合)
其他
在flink和spark中新写入的文件是在哪里分配对一个的fieldId:
//Flink中
BucketAssignFunction 中processRecord getNewRecordLocation 分配新的 fieldId
//Spark中
BaseSparkCommitActionExecutor 中execute方法 中 handleUpsertPartition 涉及到的UpsertPartitioner getBucketInfo方法
其中UpsertPartitioner构造函数中 assignInserts 方法涉及到分配新的 fieldId