Checkpoint 机制：Agent 怎么在断电后接着跑

系列「企业级 AI Agent 实现拆解」E21 篇。上一篇 E20 拆解了 HITL 八种模式，每种模式都有一个共同前提：Agent 暂停后必须能在另一个进程、甚至另一台机器上接着跑。这个能力靠 Checkpoint 实现。这篇深入 Checkpoint 的接口设计、触发时机、序列化格式和高级用法。

读完这篇你会知道

Checkpoint 的接口只有两个方法------为什么这就够了

触发时机：什么时候写 Checkpoint，什么时候读

内部结构：一个 Checkpoint 里存着什么

ADK 层 vs. compose 层：两套 Checkpoint 机制的区别

高级用法：分叉保存、强制重跑、状态迁移

实现一个 Redis CheckPointStore 只需要多少代码

一、从一个问题出发

假设用户让 Agent 处理一个需要多次人工审批的工单，第一个审批在上午，第二个审批在下午------中间 Agent 进程完全可以重启、甚至换一台服务器。

这就是 Checkpoint 要解决的问题：把 Agent 在某个时刻的完整执行状态序列化存储，之后从任意进程加载并继续执行。

二、接口：只有两个方法

go 复制代码

// internal/core/interrupt.go
type CheckPointStore interface {
    Get(ctx context.Context, checkPointID string) ([]byte, bool, error)
    Set(ctx context.Context, checkPointID string, checkPoint []byte) error
}

Get 返回 (data []byte, existed bool, error)，Set 写入序列化好的字节。框架负责所有序列化/反序列化，Store 只管存取字节。

还有一个可选接口：

go 复制代码

type CheckPointDeleter interface {
    Delete(ctx context.Context, checkPointID string) error
}

如果 Store 实现了 CheckPointDeleter，框架会在 Agent 正常完成后自动删除过期 Checkpoint，避免存储膨胀。不实现也没关系，只是要自己清理。

三、内存实现：最小可跑版本

Eino 示例里的 InMemoryStore 展示了最简单的实现（eino-examples/adk/common/store/store.go）：

go 复制代码

type inMemoryStore struct {
    mem map[string][]byte
}

func (i *inMemoryStore) Set(_ context.Context, key string, value []byte) error {
    i.mem[key] = value
    return nil
}

func (i *inMemoryStore) Get(_ context.Context, key string) ([]byte, bool, error) {
    v, ok := i.mem[key]
    return v, ok, nil
}

20 行代码，没有锁，没有序列化------框架把一切序列化都做好了 ，交给 Store 的就是 []byte。

生产环境换 Redis 也是同样的结构：

go 复制代码

type redisStore struct {
    client *redis.Client
    ttl    time.Duration
}

func (r *redisStore) Set(ctx context.Context, key string, value []byte) error {
    return r.client.Set(ctx, "checkpoint:"+key, value, r.ttl).Err()
}

func (r *redisStore) Get(ctx context.Context, key string) ([]byte, bool, error) {
    data, err := r.client.Get(ctx, "checkpoint:"+key).Bytes()
    if err == redis.Nil {
        return nil, false, nil
    }
    return data, err == nil, err
}

func (r *redisStore) Delete(ctx context.Context, key string) error {
    return r.client.Del(ctx, "checkpoint:"+key).Err()
}

四、触发时机

ADK 层（Runner）

javascript 复制代码

runner.Query(ctx, query, adk.WithCheckPointID("session-001"))
    ↓ Agent 执行过程中调用 StatefulInterrupt
    ↓ 框架调用 runnerSaveCheckPointImpl → store.Set("session-001", data)
    ↓ 事件流关闭，返回给调用方

恢复时：

less 复制代码

runner.ResumeWithParams(ctx, "session-001", &adk.ResumeParams{...})
    ↓ runnerLoadCheckPointImpl → store.Get("session-001")
    ↓ 反序列化 runContext + InterruptID2Address + InterruptID2State
    ↓ Agent 从断点继续执行

核心约束 ：CheckPointID 是一次"对话会话"的唯一标识。同一个 CheckPointID 在 Query/Run 时写，在 ResumeWithParams 时读。

compose 层（Graph）

Graph 编译时注入 Store：

go 复制代码

runner, _ := graph.Compile(ctx,
    compose.WithCheckPointStore(store),
    compose.WithInterruptBeforeNodes([]string{"HumanNode"}),  // 节点前暂停
)

Graph 在每次到达"中断前节点"时，自动保存所有 Channel 的值、各节点待处理的输入、共享 State。

五、ADK Checkpoint 内部结构

ADK Runner 用 gob 编码的 serialization struct（adk/interrupt.go:210）：

go 复制代码

type serialization struct {
    RunCtx              *runContext                     // 当前运行上下文（agent 栈、session 等）
    Info                *InterruptInfo                  // 中断信息（deprecated，保留兼容）
    EnableStreaming      bool                            // 是否流式模式
    InterruptID2Address map[string]Address               // 中断 ID → Agent 树中的地址
    InterruptID2State   map[string]core.InterruptState   // 中断 ID → 工具的 internal state
}

InterruptID2State 是关键：每个 StatefulInterrupt 调用都生成一个唯一的 interruptID，并把工具保存的 state（如 argumentsInJSON、FollowUpState）存进去。

恢复时，框架通过 interruptID 找到对应的 state，通过 InterruptID2Address 定位 Agent/Tool 在执行树里的位置，精确地把 resume 数据注入到正确的工具调用。

六、compose Checkpoint 内部结构

compose 层的 checkpoint struct（compose/checkpoint.go:106）：

go 复制代码

type checkpoint struct {
    Channels       map[string]channel         // 每个节点的输出 Channel 状态
    Inputs         map[string]any             // 各节点待处理的输入
    State          any                        // 整个图的共享 State
    SkipPreHandler map[string]bool
    RerunNodes     []string                   // 需要重跑的节点

    SubGraphs map[string]*checkpoint           // 嵌套子图的 checkpoint（递归结构）

    InterruptID2Addr  map[string]Address
    InterruptID2State map[string]core.InterruptState
}

SubGraphs 是递归结构------图里嵌套的子图有自己的 Checkpoint，整体形成一棵树，保证任意深度的图都能正确恢复。

七、高级用法

写入不同的 CheckPointID（分叉）

go 复制代码

// 读取 session-001 的状态，把新状态写入 session-002
// 适合"fork 一个新对话分支"的场景
iter := runner.Run(ctx, messages,
    adk.WithCheckPointID("session-001"),
    compose.WithWriteToCheckPointID("session-002"),
)

强制重新开始

go 复制代码

// 忽略已有 Checkpoint，从头执行（清空历史对话）
iter := runner.Query(ctx, query,
    adk.WithCheckPointID("session-001"),
    compose.WithForceNewRun(),
)

状态迁移

当 Graph State 的结构发生变化（加字段、改类型），老的 Checkpoint 无法直接反序列化。MigrateCheckpointState 提供迁移钩子（compose/checkpoint.go:231）：

go 复制代码

newBytes, err := compose.MigrateCheckpointState(oldBytes, serializer,
    func(state any) (any, bool, error) {
        old, ok := state.(*OldState)
        if !ok {
            return state, false, nil  // 不是目标类型，跳过
        }
        newState := &NewState{
            User:    old.Username,
            Profile: old.UserData,
        }
        return newState, true, nil  // 返回 changed=true 触发重编码
    },
)

迁移是递归的，自动处理 SubGraphs 里所有嵌套的 Checkpoint。

八、向后兼容的代价

ADK Checkpoint 用 gob 编码，有一个隐患：gob 要求每个具体类型必须注册，且注册名一旦固定就不能变------改了就无法读取旧 Checkpoint。

Eino 源码里有一段 preprocessADKCheckpoint（adk/interrupt.go:267），专门处理 v0.8.0--v0.8.3 版本的兼容性问题：

go 复制代码

// 把旧版本的类型名替换成兼容别名，因为两个版本用了同一个名字但编码格式不同
return bytes.ReplaceAll(data,
    []byte(lenPrefixedReactStateName),    // "_eino_adk_react_state"（旧）
    []byte(lenPrefixedCompatName))        // "_eino_adk_state_v080_"（兼容别名）

给自己的自定义类型注册时的规则：

用 schema.Register[*MyType]() 或 schema.RegisterName[*MyType]("my_unique_name")
名字一旦用于生产 Checkpoint，不能修改，否则旧 Checkpoint 读不回来
不要以 _eino 开头（框架保留前缀）
类型指针和非指针是不同的注册条目，*MyType 和 MyType 要分别考虑

九、两套 Checkpoint 对比

	ADK Runner 层	compose Graph 层
编码格式	gob	框架内置 Serializer（可替换）
存储粒度	整个 Agent 执行状态	逐节点 Channel + State
触发时机	`StatefulInterrupt` 被调用时	到达 `WithInterruptBeforeNodes` 节点时
使用方式	`RunnerConfig{CheckPointStore: s}`	`graph.Compile(WithCheckPointStore(s))`
子图支持	通过 bridgeStore 透传	`SubGraphs` 递归结构
典型场景	ADK ChatModelAgent / Custom Agent	低层 Graph 编排

大多数场景用 ADK 层，只有直接操作 compose 图时才需要 Graph 层。

小结

Checkpoint 的设计哲学是职责分离 ：框架负责序列化和恢复逻辑，Store 只需要实现 Get/Set 两个方法。这使得替换存储后端非常简单------内存、Redis、DynamoDB 是同样的接入方式。关键约束是类型注册：自定义类型如果存在 any 字段里，必须用 schema.Register 注册，且注册名不能事后修改。

下篇继续。

代码来源：cloudwego/eino · cloudwego/eino-examples