Spark Datafusion Comet 向量化Rust Native--Native算子ScanExec以及涉及到的Selection Vectors

背景

Apache Datafusion Comet 是苹果公司开源的加速Spark运行的向量化项目。

本项目采用了 Spark插件化 + Protobuf + Arrow + DataFusion 架构形式

其中

Spark插件是利用 SparkPlugin 插件，其中分为 DriverPlugin 和 ExecutorPlugin ,这两个插件在driver和 Executor启动的时候就会调用
Protobuf 是用来序列化 spark对应的表达式以及计划，用来传递给 native 引擎去执行,利用了体积小，速度快的特性
Arrow 是用来 spark 和 native 引擎进行高效的数据交换(native执行的结果或者spark执行的数据结果)，主要在JNI中利用Arrow IPC 列式存储以及零拷贝等特点进行进程间数据交换
DataFusion 主要是利用Rust native以及Arrow内存格式实现的向量化执行引擎，Spark中主要offload对应的算子到该引擎中去执行

本文基于 datafusion comet 截止到2026年1月13号的main分支的最新代码(对应的commit为 eef5f28a0727d9aef043fa2b87d6747ff68b827a)
主要分析Rust Native的Spark Datafusion Comet 向量化Rust Native--执行Datafusion计划中ScanExec以及涉及到的Selection Vectors

`Selection Vectors`

什么是`Selection Vectors`

Selection Vectors 是向量化查询执行引擎过滤操作中的一种表达，还有另一种表达是 Bitmap :

Bitmap 表达是：用 BitMap 来标记哪些数据是被过滤选中的
Selection Vectors表达是：用 vector 存储被命中的数据的下标
两者的区别是Bitmap表达会记录所有的数据，只不过是用不同的0/1代表存在与否，而 Selection Vectors 只记录命中的数据
具体相关的论文可以参考Filter Representation in Vectorized Query Execution
针对这两种过滤算子的表达，可以衍生出三种执行策略：

BMFull：总是对所有数据处理，未选中的数据的值未定义，优势是能充分发挥向量化的优势
BMPartial：只对选中的数据进行处理，无法利用向量化，依然需要遍历所有下标
SVPartial：只需要遍历选中的下标，无法利用向量化

ScanExec读取以及涉及到的`Select Vectors`

Java侧

复制代码

 public boolean hasSelectionVectors() {
   if (currentBatch == null) {
     return false;
   }

   // Check if all columns are CometSelectionVector instances
   for (int i = 0; i < currentBatch.numCols(); i++) {
     if (!(currentBatch.column(i) instanceof CometSelectionVector)) {
       return false;
     }
   }
   return true;
 }

这其中的 CometSelectionVector就是对应上文中说到的selection vector,具体的代码如下:

复制代码

  public class CometSelectionVector extends CometVector {
     /** The original vector containing all values */
     private final CometVector values;

     /**
      * The valid indices in the values vector. This array is converted into an Arrow vector so we can
      * transfer the data to native in one JNI call. This is used to represent the rowid mapping used
      * by Iceberg
      */
     private final int[] selectionIndices;

     /**
      * The indices vector containing selection indices. This is currently allocated by the JVM side
      * unlike the values vector which is allocated on the native side
      */
     private final CometVector indices;

     /**

values 为一列中的所有原始值
selectionIndices为选中的数据的下标
indices为 java数组selectionIndices 以Arrow vector的表示，便于其他语言能够以零拷贝的方式访问这些数据，后续会被传递给 Native(Rust) 层
复制代码
```
  this.indices =
     CometVector.getVector(indicesVector, values.useDecimal128, values.getDictionaryProvider());
```

Rust侧

get_selection_indices方法说明

复制代码

 fn get_selection_indices(
  env: &mut jni::JNIEnv,
  iter: &JObject,
  num_cols: usize,
    ) -> Result<Option<Vec<ArrayRef>>, CometError> {
        // Check if all columns have selection vectors
        let has_selection_vectors_result: jni::sys::jboolean = unsafe {
            jni_call!(env,
                comet_batch_iterator(iter).has_selection_vectors() -> jni::sys::jboolean)?
        };
        let has_selection_vectors = has_selection_vectors_result != 0;

        let selection_indices_arrays = if has_selection_vectors {
            // Allocate arrays for selection indices export (one per column)
            let mut indices_array_addrs = Vec::with_capacity(num_cols);
            let mut indices_schema_addrs = Vec::with_capacity(num_cols);

            for _ in 0..num_cols {
                let arrow_array = Rc::new(FFI_ArrowArray::empty());
                let arrow_schema = Rc::new(FFI_ArrowSchema::empty());
                indices_array_addrs.push(Rc::into_raw(arrow_array) as i64);
                indices_schema_addrs.push(Rc::into_raw(arrow_schema) as i64);
            }

            // Prepare JNI arrays for the export call
            let indices_array_obj = env.new_long_array(num_cols as jsize)?;
            let indices_schema_obj = env.new_long_array(num_cols as jsize)?;
            env.set_long_array_region(&indices_array_obj, 0, &indices_array_addrs)?;
            env.set_long_array_region(&indices_schema_obj, 0, &indices_schema_addrs)?;

            // Export selection indices from JVM
            let _exported_count: i32 = unsafe {
                jni_call!(env,
                    comet_batch_iterator(iter).export_selection_indices(
                        JValueGen::Object(JObject::from(indices_array_obj).as_ref()),
                        JValueGen::Object(JObject::from(indices_schema_obj).as_ref())
                    ) -> i32)?
            };

            // Convert to ArrayRef for easier handling
            let mut selection_arrays = Vec::with_capacity(num_cols);
            for i in 0..num_cols {
                let array_data =
                    ArrayData::from_spark((indices_array_addrs[i], indices_schema_addrs[i]))?;
                selection_arrays.push(make_array(array_data));

                // Drop the references to the FFI arrays
                unsafe {
                    Rc::from_raw(indices_array_addrs[i] as *const FFI_ArrowArray);
                    Rc::from_raw(indices_schema_addrs[i] as *const FFI_ArrowSchema);
                }
            }

            Some(selection_arrays)
        } else {
            None
        };

        Ok(selection_indices_arrays)
    }

判断是否存在Selection Vectors,通过JNI调用java侧方法hasSelectionVectors：

复制代码

  jni_call!(env,
       comet_batch_iterator(iter).has_selection_vectors() -> jni::sys::jboolean

假如存在则获取每一列的Selection Vector,否则返回None
- 首先用对于每一列值构造一个Vec(FFI_ArrowArray)和Vec(FFI_ArrowSchema)类型的数组以及初始化数组，并创建的 FFI_ArrowArray和FFI_ArrowSchema对应的地址插入到该数组中
- 使用JNIEnv.new_long_array创建Java Long型数组
- 使用JNIEnv.set_long_array_region新创建的Java Long型数组(也就是对应的FFIArray/Schema地址)赋值给该数组
- JNI调用JVM的exportSelectionIndices方法
  复制代码
```
  let _exported_count: i32 = unsafe {
    jni_call!(env,
        comet_batch_iterator(iter).export_selection_indices(
            JValueGen::Object(JObject::from(indices_array_obj).as_ref()),
            JValueGen::Object(JObject::from(indices_schema_obj).as_ref())
        ) -> i32)?
 };
```
  这里用到的as_ref方法通常用于将高级包装类型（如JObject, JString等）转换为对底层JNI指针（jobject）的共享引用。它使得在不改变对象所有权的情况下，可以安全地将对象传递给JNIEnv函数进行后续操作，
  
  在Java侧的话，主要是NativeUtil.exportSingleVector的调用:
  复制代码
```
   def exportSingleVector(vector: CometVector, arrayAddr: Long, schemaAddr: Long): Unit = {
     val valueVector = vector.getValueVector

     val provider = if (valueVector.getField.getDictionary != null) {
       vector.getDictionaryProvider
     } else {
       null
     }

     val arrowSchema = ArrowSchema.wrap(schemaAddr)
     val arrowArray = ArrowArray.wrap(arrayAddr)
     Data.exportVector(
       allocator,
       getFieldVector(valueVector, "export"),
       provider,
       arrowArray,
       arrowSchema)
   }
```
  Data.exportVector 使用这个方法使Selection Vector回传到Rust端，
- 调用ArrayData::from_spark方法将Spark 端通过 Arrow C Data Interface 传递过来的内存地址转换为 Rust 端的 Arrow ArrayData 对象
  
  这里主要使用了from_ffi方法，从这些裸指针（Raw Pointers）重建出 Rust 的 ArrayData 结构，这个过程是零拷贝的（Zero-copy），直接复用 Spark 分配的内存；并调用 align_buffers() 确保数据在 Rust 端能被正确、安全地访问（例如 SIMD 操作对内存对齐有要求）

allocate_and_fetch_batch方法 及后续说明

复制代码

  let (num_rows, array_addrs, schema_addrs) =
        Self::allocate_and_fetch_batch(&mut env, iter, num_cols)?;

    let mut inputs: Vec<ArrayRef> = Vec::with_capacity(num_cols);

    // Process each column
    for i in 0..num_cols {
        let array_ptr = array_addrs[i];
        let schema_ptr = schema_addrs[i];
        let array_data = ArrayData::from_spark((array_ptr, schema_ptr))?;

        // TODO: validate array input data
        // array_data.validate_full()?;

        let array = make_array(array_data);

        // Apply selection if selection vectors exist (applies to all columns)
        let array = if let Some(ref selection_arrays) = selection_indices_arrays {
            let indices = &selection_arrays[i];
            // Apply the selection using Arrow's take kernel
            match take(&*array, &**indices, None) {
                Ok(selected_array) => selected_array,
                Err(e) => {
                    return Err(CometError::from(ExecutionError::ArrowError(format!(
                        "Failed to apply selection for column {i}: {e}",
                    ))));
                }
            }
        } else {
            array
        };

        let array = if arrow_ffi_safe {
            // ownership of this array has been transferred to native
            // but we still need to unpack dictionary arrays
            copy_or_unpack_array(&array, &CopyMode::UnpackOrClone)?
        } else {
            // it is necessary to copy the array because the contents may be
            // overwritten on the JVM side in the future
            copy_array(&array)
        };

        inputs.push(array);

        // Drop the Arcs to avoid memory leak
        unsafe {
            Rc::from_raw(array_ptr as *const FFI_ArrowArray);
            Rc::from_raw(schema_ptr as *const FFI_ArrowSchema);
        }
    }

    // If selection was applied, determine the actual row count from the selected arrays
    let actual_num_rows = if let Some(ref selection_arrays) = selection_indices_arrays {
        if !selection_arrays.is_empty() {
            // Use the length of the first selection array as the actual row count
            selection_arrays[0].len()
        } else {
            num_rows as usize
        }
    } else {
        num_rows as usize
    };

    Ok(InputBatch::new(inputs, Some(actual_num_rows)))

调用allocate_and_fetch_batch方法从Java端获取一批数据并赋值到传入的FFI_ArrowArray和FFI_ArrowSchema中

对于java端的处理和之前的exportSingleVector方法处理一样，主要是使用Data.exportVector方法来进行数据回传
如果存在Selection Vector，则使用Arrow take方法获取到真正的值，否则就是原值

其中对&Arc使用 &**操作是将类型由引用 &Arc 依次转换为 Arc、T，最后再取引用 &T，用于不转移所有权地读取数据。
如果存在Selection Vector，则返回真正的行数

最后组装成InputBatch::new(inputs, Some(actual_num_rows)返回

Spark Datafusion Comet 向量化Rust Native--Native算子ScanExec以及涉及到的Selection Vectors

背景

Selection Vectors

什么是Selection Vectors

ScanExec读取以及涉及到的Select Vectors

参考

`Selection Vectors`

什么是`Selection Vectors`

ScanExec读取以及涉及到的`Select Vectors`