Iceberg: 列式读取Parquet数据

通过Spark读取Parquet文件的基本流程

shell 复制代码
SQL
==> Spark解析SQL生成逻辑计划树
LogicalPlan
==> Spark创建扫描表/读取数据的逻辑计划结点
DataSourceV2ScanRelation
==> Spark优化逻辑计划树,生成物理计划树
SparkPlan
==> Spark根据不同的属性,将逻辑计划结点DataSourceV2ScanRelation转换成物理计划结点BatchScanExec
BatchScanExec
==> BatchScanExec::inputRDD属性的延迟生成DataSourceRDD实例
DataSourceRDD
==> DataSourceRDD::compute方法创建PartitionReader实例
PartitionReader
==> Iceberg中实现了Spark中的BatchDataReader接口
BatchDataReader
==> BatchDataReader::open方法会创建Parquet文件上的迭代器(Spark中遍历数据的过程都是基于迭代器)
VectorizedParquetReader
==> VectorizedParquetReader::next方法,读取Parquet文件中的内容,并封装成Spark中的ColumnarBatch对象
ColumnarBatch

两种BaseBatchReader的实现类

BaseBatchReader支持以Batch + Vectorized的特性,读取底层的文件。

ColumnarBatchReader

通过VectorizedSparkParquetReaders::build Reader()静态方法创建的读取器,关键特性如下:

  1. 支持读取Delete File
  2. 以Arrow的格式直接读取Parquet文件
  3. 最终返回的数据集的类型为Spark.ColumnarBatch,是Spark中的实现类
java 复制代码
  public static ColumnarBatchReader buildReader(
      Schema expectedSchema,
      MessageType fileSchema,
      Map<Integer, ?> idToConstant,
      DeleteFilter<InternalRow> deleteFilter) {
    return (ColumnarBatchReader)
        TypeWithSchemaVisitor.visit(
            expectedSchema.asStruct(),
            fileSchema,
            new ReaderBuilder(
                expectedSchema,
                fileSchema,
                NullCheckingForGet.NULL_CHECKING_ENABLED,
                idToConstant,
                ColumnarBatchReader::new,
                deleteFilter));

ArrowBatchReader

通过ArrowReader::buildReader()静态方法创建的读取器,关键特性如下:

  1. 不支持读取Delete File
  2. 以Arrow的格式直接读取Parquet文件
  3. 返回的最终结果为ColumnarBatch类型,是Iceberg内置的实现类

在Iceberg 1.2.x的版本中,只在测试用例中使用到,因此在这里不再讨论,它的实现比ColumnarBatchReader更简单。

ColumnarBatchReader的创建

DataSourceRDD::compute方法中创建PartitionReader实例

java 复制代码
// 在计算RDD数据的过程中,会通过如下的方法创建一个实现了PartitionReader接口的具体类的实例,
// 这里partitionReaderFactory的类型为SparkColumnarReaderFactory,
// SparkColumnarReaderFactory类是Iceberg中的实现,它重写了createColumnarReader(InputPartition)接口
// 以返回一个PartitionReader<ColumnarBatch>的实例。
val batchReader = partitionReaderFactory.createColumnarReader(inputPartition)

PartitionReaderFactory.createColumnarReader方法创建BatchDataReader实例

java 复制代码
class SparkColumnarReaderFactory implements PartitionReaderFactory {
  public PartitionReader<ColumnarBatch> createColumnarReader(InputPartition inputPartition) {
    SparkInputPartition partition = (SparkInputPartition) inputPartition;
    if (partition.allTasksOfType(FileScanTask.class)) {
      return new BatchDataReader(partition, batchSize);
    } else {
      throw new UnsupportedOperationException(
          "Unsupported task group for columnar reads: " + partition.taskGroup());
    }
  }
}

BatchDataReader::open方法创建VectorizedParquetReader迭代器

BatchDataReader::open

java 复制代码
class BatchDataReader extends BaseBatchReader<FileScanTask>
    implements PartitionReader<ColumnarBatch> {
  @Override
  protected CloseableIterator<ColumnarBatch> open(FileScanTask task) {
    // 获取Data File的路径
    String filePath = task.file().path().toString();
    LOG.debug("Opening data file {}", filePath);

    // update the current file for Spark's filename() function
    InputFileBlockHolder.set(filePath, task.start(), task.length());

    Map<Integer, ?> idToConstant = constantsMap(task, expectedSchema());
    // 获取底层文件的句柄
    InputFile inputFile = getInputFile(filePath);
    Preconditions.checkNotNull(inputFile, "Could not find InputFile associated with FileScanTask");
    // 获取数据文件对应的Delete Files
    SparkDeleteFilter deleteFilter =
        task.deletes().isEmpty()
            ? null
            : new SparkDeleteFilter(filePath, task.deletes(), counter());
    // 返回一个数据文件上的迭代器
    return newBatchIterable(
            inputFile,
            task.file().format(),
            task.start(),
            task.length(),
            task.residual(),
            idToConstant,
            deleteFilter)
        .iterator();
  }
}

BaseBatchReader::newBatchIterable方法创建VectorizedParquetReader实例

VectorizedParquetReader类是最上层的类,它提供了对遍历文件内容的入口。

java 复制代码
abstract class BaseBatchReader<T extends ScanTask> extends BaseReader<ColumnarBatch, T> {

  protected CloseableIterable<ColumnarBatch> newBatchIterable(
      InputFile inputFile,
      FileFormat format,
      long start,
      long length,
      Expression residual,
      Map<Integer, ?> idToConstant,
      SparkDeleteFilter deleteFilter) {
    switch (format) {
      case PARQUET:
        // 如果文件的格式是PARQUET,则创建一个Parquet上的迭代器
        return newParquetIterable(inputFile, start, length, residual, idToConstant, deleteFilter);

      case ORC:
        // 忽略,不讨论
        return newOrcIterable(inputFile, start, length, residual, idToConstant);

      default:
        throw new UnsupportedOperationException(
            "Format: " + format + " not supported for batched reads");
    }
  }
 
  private CloseableIterable<ColumnarBatch> newParquetIterable(
      InputFile inputFile,
      long start,
      long length,
      Expression residual,
      Map<Integer, ?> idToConstant,
      SparkDeleteFilter deleteFilter) {
    // get required schema if there are deletes
    Schema requiredSchema = deleteFilter != null ? deleteFilter.requiredSchema() : expectedSchema();

    return Parquet.read(inputFile)
        .project(requiredSchema)
        .split(start, length)
        // 指定可以创建BaseBatchReader的实现类的实例的方法
        .createBatchedReaderFunc(
            fileSchema ->
                VectorizedSparkParquetReaders.buildReader(
                    requiredSchema, fileSchema, idToConstant, deleteFilter))
        .recordsPerBatch(batchSize)
        .filter(residual)
        .caseSensitive(caseSensitive())
        // Spark eagerly consumes the batches. So the underlying memory allocated could be reused
        // without worrying about subsequent reads clobbering over each other. This improves
        // read performance as every batch read doesn't have to pay the cost of allocating memory.
        .reuseContainers()
        .withNameMapping(nameMapping())
        .build();
  }
}

ColumnarBatchReader::new方法创建ColumnarBatchReader实例

VectorizedSparkParquetReaders.buildReader()方法见第一大章节的简述。

java 复制代码
public class ColumnarBatchReader extends BaseBatchReader<ColumnarBatch> {
  private final boolean hasIsDeletedColumn;
  private DeleteFilter<InternalRow> deletes = null;
  private long rowStartPosInBatch = 0;
  // 只有一个构造器,readers是保存了读取文件中每一个列(字段)的Reader,它们都是实现了VectorizedReader<T>接口的
  // VectorizedArrowReader<T>的实例
  public ColumnarBatchReader(List<VectorizedReader<?>> readers) {
    super(readers);
    // 遍历每一个字段的Reader类型,看看当前文件中是不是存在内置的列_deleted,它标识着当前当前行是不是被删除了。
    this.hasIsDeletedColumn =
        readers.stream().anyMatch(reader -> reader instanceof DeletedVectorReader);
  }
}

Parquet文件读取

通过前面的分析,知道对上层(Spark RDD)可见的接口,是由VectorizedParquetReader(一个Iterator的实现类)提供的,

它内部封装了对ColumnarBatchReader的操作。

VectorizedParquetReader::iterator方法,返回Parquet文件上的迭代器

java 复制代码
public class VectorizedParquetReader<T> extends CloseableGroup implements CloseableIterable<T> {

  @Override
  public CloseableIterator<T> iterator() {
    FileIterator<T> iter = new FileIterator<>(init());
    addCloseable(iter);
    return iter;
  }
}

FileIterator::next方法,读取数据

由于FilterIterator实现了JAVA中的Iterator接口,因此可以在compute Spark RDD时,通过这个迭代器,获取到文件中的内容,

也就是next()方法返回的ColumnarBatch对象。

java 复制代码
  /**
   * 这里T的类型为ColumnarBatch。
   */
  private static class FileIterator<T> implements CloseableIterator<T> {
  
    public T next() {
      if (!hasNext()) {
        throw new NoSuchElementException();
      }
      if (valuesRead >= nextRowGroupStart) {
        // 第一次执行时,valuesRead == nextRowGroupStart,表示开始读取一个新的RowGroup
        // 这里调用advance()后,nextRowGroupStart指向了下一个要读取的RowGroup的起始位置,
        // 但当前的RowGroup是还没有被读取的,被延迟到了后面的过程。
        advance();
      }

      // batchSize is an integer, so casting to integer is safe
      // 读取当前RowGroup的数据,其中:
      //   nextRowGroupStart指向的是下一个RowGroup的起始位置,
      //   valuesRead的值表示一共读取了多少行
      // 这里必须有nextRowGroupStart >= nextRowGroupStart,而它们的差值就是当前RowGroup剩余的没有被读取的行
      int numValuesToRead = (int) Math.min(nextRowGroupStart - valuesRead, batchSize);
      // 读取指定数量的行,这里的model就是前面提到的ColumnarBatchReader的实例对象。
      if (reuseContainers) {
        this.last = model.read(last, numValuesToRead);
      } else {
        this.last = model.read(null, numValuesToRead);
      }
      // 累加读取的行数
      valuesRead += numValuesToRead;

      return last;
    }

    /**
     * 移动读取指针到下一个RowGroup的起始位置。
     */
    private void advance() {
      while (shouldSkip[nextRowGroup]) {
        nextRowGroup += 1;
        reader.skipNextRowGroup();
      }
      PageReadStore pages;
      try {
        pages = reader.readNextRowGroup();
      } catch (IOException e) {
        throw new RuntimeIOException(e);
      }
      // 从绑定的RowGroups信息中,计算下一个RowGroup的起始位置
      long rowPosition = rowGroupsStartRowPos[nextRowGroup];
      model.setRowGroupInfo(pages, columnChunkMetadata.get(nextRowGroup), rowPosition);
      nextRowGroupStart += pages.getRowCount();
      nextRowGroup += 1;
    }
  }

ColumnarBatchReader::read

java 复制代码
public class ColumnarBatchReader extends BaseBatchReader<ColumnarBatch> {
  protected final VectorHolder[] vectorHolders;

  @Override
  public final ColumnarBatch read(ColumnarBatch reuse, int numRowsToRead) {
    if (reuse == null) {
      // 如果指定了不复用当前的VectorHolder来存储数据时,就关闭它们
      closeVectors();
    }
    // 由内部类ColumnBatchLoader负责代理进行真正的读取操作。
    ColumnarBatch columnarBatch = new ColumnBatchLoader(numRowsToRead).loadDataToColumnBatch();
    rowStartPosInBatch += numRowsToRead;
    return columnarBatch;
  }
}

ColumnBatchLoader::loadDataToColumnBatch读取数据,封装成ColumnarBatch对象

java 复制代码
  private class ColumnBatchLoader {
    // 读取的数据记录总数
    private final int numRowsToRead;
    // the rowId mapping to skip deleted rows for all column vectors inside a batch, it is null when
    // there is no deletes
    private int[] rowIdMapping;
    // the array to indicate if a row is deleted or not, it is null when there is no "_deleted"
    // metadata column
    private boolean[] isDeleted;

    ColumnBatchLoader(int numRowsToRead) {
      Preconditions.checkArgument(
          numRowsToRead > 0, "Invalid number of rows to read: %s", numRowsToRead);
      this.numRowsToRead = numRowsToRead;
      if (hasIsDeletedColumn) {
        isDeleted = new boolean[numRowsToRead];
      }
    }

    ColumnarBatch loadDataToColumnBatch() {
      // 对读取的数据记录进行过滤,得到未删除的数据记录总数
      int numRowsUndeleted = initRowIdMapping();
      // 以Arrows格式,读取每一列的数据,表示为Spark.ColumnVector类型
      ColumnVector[] arrowColumnVectors = readDataToColumnVectors();
      // 创建一个ColumnarBatch实例,包含所有存活的数据
      ColumnarBatch newColumnarBatch = new ColumnarBatch(arrowColumnVectors);
      newColumnarBatch.setNumRows(numRowsUndeleted);

      if (hasEqDeletes()) {
        // 如果有等值删除的文件存在,则还需要按值来过滤掉被删除的数据行
        // 由于基于等值删除的文件过滤数据时,需要知道每一行的实际值,因此只有将数据读取到内存中才知道哪一行要被删除掉
        applyEqDelete(newColumnarBatch);
      }

      if (hasIsDeletedColumn && rowIdMapping != null) {
        // 如果存在被删除的数据行,则需要重新分配行号,从0开始自然递增
        // reset the row id mapping array, so that it doesn't filter out the deleted rows
        for (int i = 0; i < numRowsToRead; i++) {
          rowIdMapping[i] = i;
        }
        newColumnarBatch.setNumRows(numRowsToRead);
      }
      // 返回
      return newColumnarBatch;
    }

    ColumnVector[] readDataToColumnVectors() {
      ColumnVector[] arrowColumnVectors = new ColumnVector[readers.length];

      ColumnVectorBuilder columnVectorBuilder = new ColumnVectorBuilder();
      for (int i = 0; i < readers.length; i += 1) {
        vectorHolders[i] = readers[i].read(vectorHolders[i], numRowsToRead);
        int numRowsInVector = vectorHolders[i].numValues();
        Preconditions.checkState(
            numRowsInVector == numRowsToRead,
            "Number of rows in the vector %s didn't match expected %s ",
            numRowsInVector,
            numRowsToRead);

        arrowColumnVectors[i] =
            columnVectorBuilder
                .withDeletedRows(rowIdMapping, isDeleted)
                .build(vectorHolders[i], numRowsInVector);
      }
      return arrowColumnVectors;
    }

    boolean hasEqDeletes() {
      return deletes != null && deletes.hasEqDeletes();
    }

    int initRowIdMapping() {
      Pair<int[], Integer> posDeleteRowIdMapping = posDelRowIdMapping();
      if (posDeleteRowIdMapping != null) {
        rowIdMapping = posDeleteRowIdMapping.first();
        return posDeleteRowIdMapping.second();
      } else {
        rowIdMapping = initEqDeleteRowIdMapping();
        return numRowsToRead;
      }
    }

    /**
     * 如果当前文件包含 positions delete files,那么需要建立索引数据结构
     */
    Pair<int[], Integer> posDelRowIdMapping() {
      if (deletes != null && deletes.hasPosDeletes()) {
        return buildPosDelRowIdMapping(deletes.deletedRowPositions());
      } else {
        return null;
      }
    }
    /**
     * Build a row id mapping inside a batch, which skips deleted rows. Here is an example of how we
     * delete 2 rows in a batch with 8 rows in total. [0,1,2,3,4,5,6,7] -- Original status of the
     * row id mapping array [F,F,F,F,F,F,F,F] -- Original status of the isDeleted array Position
     * delete 2, 6 [0,1,3,4,5,7,-,-] -- After applying position deletes [Set Num records to 6]
     * [F,F,T,F,F,F,T,F] -- After applying position deletes
     *
     * @param deletedRowPositions a set of deleted row positions
     * @return the mapping array and the new num of rows in a batch, null if no row is deleted
     */
    Pair<int[], Integer> buildPosDelRowIdMapping(PositionDeleteIndex deletedRowPositions) {
      if (deletedRowPositions == null) {
        return null;
      }
      // 为新读取的数据记录,创建一个数组,保存所有没有被删除的行号,从0开始
      // 基本算法:使用双指针,将所有未删除的行放到队列一端,且有序
      int[] posDelRowIdMapping = new int[numRowsToRead];
      int originalRowId = 0; // 指向待判定的行的下标
      int currentRowId = 0; // 存活行的下标
      while (originalRowId < numRowsToRead) {
        if (!deletedRowPositions.isDeleted(originalRowId + rowStartPosInBatch)) {
          // 如果当前行没有被删除,则将其添加到currentRowId指向的位置
          posDelRowIdMapping[currentRowId] = originalRowId;
          // currentRowId指向下一个待插入的位置  
          currentRowId++;
        } else {
          if (hasIsDeletedColumn) {
            isDeleted[originalRowId] = true;
          }

          deletes.incrementDeleteCount();
        }
        originalRowId++;
      }

      if (currentRowId == numRowsToRead) {
        // there is no delete in this batch
        return null;
      } else {
        return Pair.of(posDelRowIdMapping, currentRowId);
      }
    }

    int[] initEqDeleteRowIdMapping() {
      int[] eqDeleteRowIdMapping = null;
      if (hasEqDeletes()) {
        eqDeleteRowIdMapping = new int[numRowsToRead];
        for (int i = 0; i < numRowsToRead; i++) {
          eqDeleteRowIdMapping[i] = i;
        }
      }

      return eqDeleteRowIdMapping;
    }

    /**
     * Filter out the equality deleted rows. Here is an example, [0,1,2,3,4,5,6,7] -- Original
     * status of the row id mapping array [F,F,F,F,F,F,F,F] -- Original status of the isDeleted
     * array Position delete 2, 6 [0,1,3,4,5,7,-,-] -- After applying position deletes [Set Num
     * records to 6] [F,F,T,F,F,F,T,F] -- After applying position deletes Equality delete 1 <= x <=
     * 3 [0,4,5,7,-,-,-,-] -- After applying equality deletes [Set Num records to 4]
     * [F,T,T,T,F,F,T,F] -- After applying equality deletes
     *
     * @param columnarBatch the {@link ColumnarBatch} to apply the equality delete
     */
    void applyEqDelete(ColumnarBatch columnarBatch) {
      // 对经过position deletes 过滤的数据行,进行按值删除
      Iterator<InternalRow> it = columnarBatch.rowIterator();
      int rowId = 0;
      int currentRowId = 0;
      while (it.hasNext()) { // 行式遍历
        InternalRow row = it.next();
        if (deletes.eqDeletedRowFilter().test(row)) {
          // the row is NOT deleted
          // skip deleted rows by pointing to the next undeleted row Id
          // 更新成员变量rowIdMapping
          rowIdMapping[currentRowId] = rowIdMapping[rowId];
          currentRowId++;
        } else {
          if (hasIsDeletedColumn) {
            isDeleted[rowIdMapping[rowId]] = true;
          }

          deletes.incrementDeleteCount();
        }

        rowId++;
      }
      // 更新最新的存活记录数
      columnarBatch.setNumRows(currentRowId);
    }
  }
相关推荐
zhixingheyi_tian2 小时前
Spark 之 Aggregate
大数据·分布式·spark
PersistJiao2 小时前
Spark 分布式计算中网络传输和序列化的关系(一)
大数据·网络·spark
PersistJiao10 小时前
Spark 分布式计算中网络传输和序列化的关系(二)
大数据·网络·spark·序列化·分布式计算
PersistJiao14 小时前
Spark RDD 的宽依赖和窄依赖
spark·rdd·宽窄依赖
那一抹阳光多灿烂15 小时前
Spark中的Stage概念
大数据·spark
lisacumt20 小时前
【spark】pycharm 内使用pyspark连接有kerberos集群执行sql
python·pycharm·spark
心死翼未伤20 小时前
python从入门到精通:pyspark实战分析
开发语言·数据结构·python·spark·json
PersistJiao1 天前
Spark RDD sortBy算子执行时进行数据 “采样”是什么意思?
spark·sortby·rangepartition
goTsHgo1 天前
在Spark Streaming中简单实现实时用户画像系统
大数据·分布式·spark
PersistJiao1 天前
Spark RDD(弹性分布式数据集)的深度理解
大数据·分布式·spark·rdd