Batch 处理：并发控制与可中断批处理

系列「企业级 AI Agent 实现拆解」E22 篇。上一篇Checkpoint 机制：Agent 怎么在断电后接着跑讲了单个 Agent 的中断/恢复机制。这篇处理一个实际工程问题：如果要把同一个工作流跑在一批输入上，怎么控制并发，怎么让其中某几个任务等人工确认，其余的继续跑 。Eino 在 compose/batch 里提供了 BatchNode，这篇拆解它的设计。

读完这篇你会知道

BatchNode 是什么：接受 []I 返回 []O 的泛型批处理节点

MaxConcurrency 的两种模式：顺序 vs. 并发信号量

中断时发生了什么：NodeInterruptState 保存了哪些信息

恢复时只重跑哪些任务：CompletedResults 的作用

CompositeInterrupt vs. 普通 Interrupt：区别在哪

七个完整示例场景走一遍

一、场景：合规文档批量审核

假设合规团队有 100 份文档要审核。每份文档走同一个工作流------自动评分，高优先级的暂停等人工确认，其余的自动通过。手动一份份运行太慢；完全并行又会打爆 LLM 的 QPS 限制。

这就是 BatchNode 要解决的问题。

二、接口：泛型包装

go 复制代码

// compose/batch/batch/node.go
type Node[I, O any] struct { ... }

func NewBatchNode[I, O any](config *NodeConfig[I, O]) *Node[I, O]

type NodeConfig[I, O any] struct {
    Name           string
    InnerTask      Compilable[I, O]  // 接受 Graph 或 Workflow
    MaxConcurrency int               // 0=顺序, >0=并发上限
    InnerCompileOptions []compose.GraphCompileOption
}

使用：

go 复制代码

batchNode := batch.NewBatchNode(&batch.NodeConfig[ReviewRequest, ReviewResult]{
    Name:           "DocReviewer",
    InnerTask:      reviewWorkflow,  // 一个 compose.Workflow
    MaxConcurrency: 3,              // 最多 3 个并行
})

results, err := batchNode.Invoke(ctx, docs)

输入 []ReviewRequest，输出 []ReviewResult，顺序保持不变。

三、并发控制：信号量实现

go 复制代码

// batch/node.go
if b.maxConcurrency == 0 {
    // 顺序：一个个跑
    for _, idx := range indicesToProcess {
        wg.Add(1)
        runTask(idx, effectiveInputs[idx])
    }
} else {
    // 并发：信号量控制上限
    sem := make(chan struct{}, b.maxConcurrency)
    for i, idx := range indicesToProcess {
        wg.Add(1)
        if i == 0 {
            runTask(idx, effectiveInputs[idx])  // 第一个在主 goroutine 跑
        } else {
            go func(index int, input I) {
                sem <- struct{}{}        // 获取信号量
                defer func() { <-sem }() // 释放信号量
                runTask(index, input)
            }(idx, effectiveInputs[idx])
        }
    }
}

两个设计细节：

MaxConcurrency == 0 是顺序，不是"不限并发"
第一个任务始终在主 goroutine 跑（减少一次 goroutine 切换）
实际并发数 = 主 goroutine(1) + 信号量里的 ≤ MaxConcurrency-1 个，合计 ≤ MaxConcurrency

四、内部：每个子任务怎么被隔离

每个子任务拿到独立的上下文地址和独立的 checkpointID：

go 复制代码

func (b *Node[I, O]) invoke(...) {
    runTask := func(index int, input I) {
        // 每个子任务独立的地址段：batch_process:0, batch_process:1, ...
        subCtx := compose.AppendAddressSegment(ctx,
            AddressSegmentBatchProcess, strconv.Itoa(index))
        // 每个子任务独立的 checkpoint：<parent_id>:batch_0, :batch_1, ...
        invokeOpts := append([]compose.Option{
            compose.WithCheckPointID(makeBatchCheckpointID(index)),
        }, batchOpts.innerOptions...)
        output, taskErr := runner.Invoke(subCtx, input, invokeOpts...)
        resultCh <- taskResult{index, output, taskErr}
    }

独立地址 + 独立 CheckpointID，中断时各自保存状态，恢复时精确定位，不会互相干扰。

五、中断：CompositeInterrupt 聚合多个任务的中断

如果多个子任务同时中断（比如 3 个高优先级文档都需要审批），BatchNode 用 compose.CompositeInterrupt 把所有中断打包：

go 复制代码

// 收集结果
for result := range resultCh {
    if result.err != nil {
        if _, ok := compose.ExtractInterruptInfo(result.err); ok {
            interruptErrs = append(interruptErrs, result.err)
            interruptedIndices = append(interruptedIndices, result.index)
        } else if normalErr == nil {
            normalErr = result.err
        }
    } else {
        outputs[result.index] = result.output
        completedResults[result.index] = result.output  // 已完成的保存下来
    }
}

// 把所有中断打包，保存状态
if len(interruptErrs) > 0 {
    state := &NodeInterruptState{
        OriginalInputs:     originalInputs,    // 完整输入列表
        CompletedResults:   completedResults,  // 已完成的结果
        InterruptedIndices: interruptedIndices, // 被中断的下标
        TotalCount:         len(effectiveInputs),
    }
    return nil, compose.CompositeInterrupt(ctx, nil, state, interruptErrs...)
}

调用方通过 compose.ExtractInterruptInfo(err) 拿到所有中断上下文：

go 复制代码

results, err := runner.Invoke(ctx, docs, compose.WithCheckPointID("session-1"))
if err != nil {
    info, ok := compose.ExtractInterruptInfo(err)
    if ok {
        // info.InterruptContexts 包含每个被中断子任务的详情
        for _, iCtx := range info.InterruptContexts {
            fmt.Printf("ID=%s, 文档=%v\n", iCtx.ID, iCtx.Info)
        }
    }
}

六、恢复：只重跑被中断的任务

NodeInterruptState 是恢复的关键：

go 复制代码

wasInterrupted, hasState, prevState := compose.GetInterruptState[*NodeInterruptState](ctx)

if wasInterrupted && hasState && prevState != nil {
    // 从 prevState 还原输入
    effectiveInputs = restoreFrom(prevState.OriginalInputs)
    // 只处理被中断的任务
    indicesToProcess = prevState.InterruptedIndices

    // 已完成的结果直接填回
    for idx, result := range prevState.CompletedResults {
        outputs[idx] = result.(O)
    }
}

恢复调用方式：

go 复制代码

// 1. 构造每个中断的 resume 数据
resumeData := map[string]any{
    interruptCtx1.ID: &ApprovalDecision{Approved: true, Comments: "批准"},
    interruptCtx2.ID: &ApprovalDecision{Approved: false, Comments: "内容不合规"},
}

// 2. 注入 resume 数据
resumeCtx := compose.BatchResumeWithData(ctx, resumeData)

// 3. 用同一个 CheckpointID 重新 Invoke（inputs 传 nil）
results, err = runner.Invoke(resumeCtx, nil, compose.WithCheckPointID("session-1"))

注意：inputs 参数传 nil------原始输入已在 NodeInterruptState.OriginalInputs 里，不需要调用方重新传。

七、放进父图：Map-Reduce 模式

BatchNode 可以作为父图里的一个节点，配合预处理和聚合节点形成完整管道（示例场景 7）：

go 复制代码

parentGraph := compose.NewGraph[BatchReviewInput, ReviewReport]()

// preprocess：从原始输入拆出 []ReviewRequest
parentGraph.AddLambdaNode("preprocess", preprocess)

// batch_review：批量运行 reviewWorkflow
parentGraph.AddLambdaNode("batch_review",
    compose.InvokableLambda(func(ctx context.Context, inputs []ReviewRequest) ([]ReviewResult, error) {
        return batchNode.Invoke(ctx, inputs,
            batch.WithInnerOptions(compose.WithCallbacks(progressHandler)), // 运行时透传
        )
    }))

// reduce：聚合结果，生成报告
parentGraph.AddLambdaNode("reduce", reduce)

parentGraph.AddEdge(compose.START, "preprocess")
parentGraph.AddEdge("preprocess", "batch_review")
parentGraph.AddEdge("batch_review", "reduce")
parentGraph.AddEdge("reduce", compose.END)

batch.WithInnerOptions 把运行时 option（如进度回调）透传到每个子任务的 Invoke，InnerCompileOptions 是编译期 option------两者分开，互不干扰。

八、必须注册的类型

BatchNode 内部用 any 字段存状态，所有涉及的类型都要注册：

go 复制代码

// batch/types.go --- 框架已内置注册
func init() {
    schema.RegisterName[*NodeInterruptState]("batch.NodeInterruptState")
}

// 你的业务类型（必须自己注册）
func init() {
    schema.RegisterName[ReviewRequest]("batch_example.ReviewRequest")
    schema.RegisterName[ReviewResult]("batch_example.ReviewResult")
    schema.RegisterName[*ApprovalDecision]("batch_example.ApprovalDecision")
    // 注意：切片类型要单独注册
    schema.RegisterName[[]ReviewRequest]("batch_example.ReviewRequestSlice")
    schema.RegisterName[[]ReviewResult]("batch_example.ReviewResultSlice")
}

[]ReviewRequest 和 ReviewRequest 是不同的注册条目，两者都要注册。

九、七个场景速览

场景	MaxConcurrency	特点
顺序处理	0	一个个跑，最安全，最慢
并发处理	3	最多 3 个并行，限速保护
带编译选项	2	`WithGraphName` 方便追踪
带运行时回调	0	`WithInnerOptions` 传进度 handler
错误处理	0	某个任务失败，返回第一个错误
中断 & 恢复	0	高优先级文档暂停等人工确认
父图 Map-Reduce	3	预处理 → 批处理 → 聚合报告

小结

BatchNode 的核心价值是把单任务工作流"套"成批处理，同时保留 HITL 能力 。设计的三个关键点：信号量控制并发（不是无限 goroutine）；NodeInterruptState 保存完整原始输入和已完成结果（恢复时不重跑完成的任务）；CompositeInterrupt 聚合多个子任务的中断（一次调用可以有多个等待点）。注册类型是使用 Checkpoint 的必要前提，切片类型要单独注册。

下篇继续。

代码来源：cloudwego/eino · cloudwego/eino-examples