Spark Datafusion Comet 向量化Rust Native--执行Datafusion计划

背景

Apache Datafusion Comet 是苹果公司开源的加速Spark运行的向量化项目。

本项目采用了 Spark插件化 + Protobuf + Arrow + DataFusion 架构形式

其中

Spark插件是利用 SparkPlugin 插件，其中分为 DriverPlugin 和 ExecutorPlugin ,这两个插件在driver和 Executor启动的时候就会调用
Protobuf 是用来序列化 spark对应的表达式以及计划，用来传递给 native 引擎去执行,利用了体积小，速度快的特性
Arrow 是用来 spark 和 native 引擎进行高效的数据交换(native执行的结果或者spark执行的数据结果)，主要在JNI中利用Arrow IPC 列式存储以及零拷贝等特点进行进程间数据交换
DataFusion 主要是利用Rust native以及Arrow内存格式实现的向量化执行引擎，Spark中主要offload对应的算子到该引擎中去执行

本文基于 datafusion comet 截止到2026年1月13号的main分支的最新代码(对应的commit为 eef5f28a0727d9aef043fa2b87d6747ff68b827a)
主要分析 Rust Native 执行物理计划

native executePlan

此处的代码主要是在CometExecIterator类中，该类会CometNativeShuffleWriter 和CometNativeExec调用

CometNativeShuffleWriter 主要是在用native shuffle的时候，会构造native的writer plan，从而写入shuffle中间文件，相比JVM写文件相比，相率更高

CometNativeExec 主要是 Native算子执行的时候，用来执行native单个算子，后续会从native中获取对应的结果。
Java侧：

private def getNextBatch: Option[ColumnarBatch] = {
assert(partitionIndex >= 0 && partitionIndex < numParts)

复制代码

  if (tracingEnabled) {
    traceMemoryUsage()
  }

  val ctx = TaskContext.get()

  try {
    withTrace(
      s"getNextBatch[JVM] stage=${ctx.stageId()}",
      tracingEnabled, {
        nativeUtil.getNextBatch(
          numOutputCols,
          (arrayAddrs, schemaAddrs) => {
            nativeLib.executePlan(ctx.stageId(), partitionIndex, plan, arrayAddrs, schemaAddrs)
          })
      })

如果spark.comet.tracing.enabled为true（默认是false），则会打印出内存使用

复制代码

  nativeLib.logMemoryUsage("jvm_heapUsed", memoryMXBean.getHeapMemoryUsage.getUsed)
  val totalTaskMemory = cometTaskMemoryManager.internal.getMemoryConsumptionForThisTask
  val cometTaskMemory = cometTaskMemoryManager.getUsed
  val sparkTaskMemory = totalTaskMemory - cometTaskMemory
  val threadId = Thread.currentThread().getId
  nativeLib.logMemoryUsage(s"task_memory_comet_$threadId", cometTaskMemory)
  nativeLib.logMemoryUsage(s"task_memory_spark_$threadId", sparkTaskMemory)

会将内存写到comet-event-trace.json文件中

executePlan
NativeUtil.NativeUtil已经在Spark Datafusion Comet 向量化Rule--CometExecRule Shuffle分析解释说明过，此处不再累赘，直接调用 nativeLib.executePlan(ctx.stageId(), partitionIndex, plan, arrayAddrs, schemaAddrs)方法，用来执行计划

Rust侧：

获取 ExecutionContext 指针

复制代码

 (id as *mut ExecutionContext)
 .as_mut()
 .expect("Comet execution context shouldn't be null!")

直接强制类型转换，并且转换为可变引用

根据 exec_context.spark_plan.op_struct的类型来确认是不是ShuffleWriter
如果是的话，就命名tracing_event_name为executePlan(ShuffleWriter),否则为executePlan
开启 jemalloc特性
如果开启了 jemalloc 特征，则获取一个"管理信息库"（Management Information Base）的映射，它是一个基于 MIB 的轻量级、高性能的接口，用于在 epoch 周期性刷新机制下，以较低的开销读取 jemalloc 的统计数据（如已分配内存、活跃内存等）

如果root_op 为空，构造Native plan并执行

复制代码

 let start = Instant::now();
 let planner: PhysicalPlanner =
     PhysicalPlanner::new(Arc::clone(&exec_context.session_ctx), partition)
         .with_exec_id(exec_context_id);
 let (scans, root_op) = planner.create_plan(
     &exec_context.spark_plan,
     &mut exec_context.input_sources.clone(),
     exec_context.partition_count,
 )?;
 let physical_plan_time = start.elapsed();

 exec_context.plan_creation_time += physical_plan_time;
 exec_context.root_op = Some(Arc::clone(&root_op));
 exec_context.scans = scans;

 if exec_context.explain_native {
     let formatted_plan_str =
         DisplayableExecutionPlan::new(root_op.native_plan.as_ref()).indent(true);
     info!("Comet native query plan:\n{formatted_plan_str:}");
 }

 let task_ctx = exec_context.session_ctx.task_ctx();
 // Each Comet native execution corresponds to a single Spark partition,
 // so we should always execute partition 0.
 let stream = root_op.native_plan.execute(0, task_ctx)?;
 exec_context.stream = Some(stream);

构造 PhysicalPlanner

并调用planner.create_plan方法构造出Scan和root_op,且赋值给对应的exec_context值，此时会把plan_creation_time赋值为物理计划的创建时间

这里会把 protobuf Operator一一映射为DataFusion physical plan,便于真正的执行
执行 Datafusion计划

调用native_plan.execute获取该计划对一个的结果，结果为RecordBatch Stream
如果root_op不为空

调用pull_input_batches从 JVM 拉下一批输入,实现后面解释
获取 RecordBatch Stream

此处get_runtime().block_on采用了 Rust tokio crate,具体的可以参考理解tikio核心
复制代码
```
let next_item = exec_context.stream.as_mut().unwrap().next();
let poll_output = poll!(next_item);
```

其中poll!宏是Rust 的 futures-util 库配合 async/await 语法，通过 futures::ready! 宏和 poll 函数实现高效的非阻塞异步操作。poll 是 Future 的底层核心，通过 ready! 宏简化状态判断（Poll::Pending 或 Poll::Ready），使自定义异步逻辑更简洁，实现了状态机的高效轮询
并且每100次poll了之后，再update metrics。
此处的Poll返回有两种结果：
Poll::Ready(val)：任务已完成，返回结果。
Poll::Pending：任务未完成。

复制代码

match poll_output {
    Poll::Ready(Some(output)) => {
        // prepare output for FFI transfer
        return prepare_output(
            &mut env,
            array_addrs,
            schema_addrs,
            output?,
            exec_context.debug_native,
        );
      }
    Poll::Ready(None) => {
        // Reaches EOF of output.
        if exec_context.explain_native {
            if let Some(plan) = &exec_context.root_op {
                let formatted_plan_str = DisplayableExecutionPlan::with_metrics(
                    plan.native_plan.as_ref(),
                )
                .indent(true);
                info!(
                    "Comet native query plan with metrics (Plan #{} Stage {} Partition {}):\
                \n plan creation took {:?}:\
                \n{formatted_plan_str:}",
                    plan.plan_id, stage_id, partition, exec_context.plan_creation_time
                );
            }
        }
        return Ok(-1);
      }
    // A poll pending means the stream is not ready yet.
    Poll::Pending => {
        if exec_context.scans.is_empty() {
            // Pure async I/O (e.g., IcebergScanExec, DataSourceExec)
            // Yield to let the executor drive I/O instead of busy-polling
            tokio::task::yield_now().await;
        } else {
            // Has ScanExec operators
            // Busy-poll to pull batches from JVM
            // TODO: Investigate if JNI calls are safe without block_in_place.
            // block_in_place prevents Tokio from migrating this task to another thread,
            // which is necessary because JNI env is thread-local. If we can guarantee
            // thread safety another way, we could remove this wrapper for better perf.
            tokio::task::block_in_place(|| {
                pull_input_batches(exec_context)
            })?;
        }

        // Output not ready yet
        continue;
      }
                 }

如果完成的话,且有数据的话，就调用prepare_output方法,把 RecordBatch 输出到 JVM 的地址数组，并返回行数，此方法已经在Spark Datafusion Comet 向量化Rust Native--读数据解释过，
如果没有数据的话，则返回 -1；
如果任务未完成，且如果scan是空的话，则调用tokio::task::yield_now().await,当前正在执行的 Toko 任务立即放弃 CPU 使用权,使得其它待处理的任务有机会被执行，从而防止耗时较长的计算任务造成饥饿
如果任务未完成，且如果scan非空的话，则调用pull_input_batches从 JVM 拉下一批输入.

nativeLib releasePlan

该方法会在JVM端中的close方法被调用：

复制代码

  def close(): Unit = synchronized {
    if (!closed) {
      ...
      nativeLib.releasePlan(plan)
      ...

      closed = true
    }

主要是：

更新native指标
释放内存
重新接管ExecutionContext内存

更新指标
复制代码
```
  // Update metrics
  update_metrics(&mut env, execution_context)?;
```
这里主要是把Datafusion里的运行指标更新到JVM的CometMetricNode, 此处后续解释，

释放内存

复制代码

 handle_task_shared_pool_release(
     execution_context.memory_pool_config.pool_type,
     execution_context.task_attempt_id,
 );

如果使用共享内存 native plans个数为0的话，则移除该task_attempt_id所对应的内存。

重新接管ExecutionContext内存
复制代码
```
 let _: Box<ExecutionContext> = Box::from_raw(execution_context);
```
通过 Box::from_raw(ptr) 恢复为 Box ,以便 Box 离开作用域时被正确释放