LangGraph设计与实现-第7章-任务调度与并行执行

《LangGraph 设计与实现》完整目录

前言

第1章为什么需要理解 LangGraph

第2章架构总览

第3章 StateGraph 图构建 API

第4章 Channel 状态管理与 Reducer

第5章图编译：从 StateGraph 到 CompiledStateGraph

第6章 Pregel 执行引擎

第7章任务调度与并行执行（当前）

第8章 Checkpoint 持久化

第9章中断与人机协作

第10章 Command 与高级控制流

第11章子图与嵌套

第12章 Send 与动态并行

第13章流式输出与调试

第14章 Runtime 与 Context

第15章 Store 与长期记忆

第16章预构建 Agent 组件

第17章多 Agent 模式实战

第18章设计模式与架构决策

第7章任务调度与并行执行

7.1 引言

上一章我们剖析了 Pregel 执行循环的宏观架构------tick()、after_tick() 和 BSP 超步模型。但在每个超步内部，还有一个同样复杂的世界：多个任务如何被并行调度？任务失败时如何重试？缓存如何避免重复计算？PUSH 任务和 PULL 任务在运行时有何不同？

本章将深入 LangGraph 的任务执行层，涉及以下核心组件：

PregelExecutableTask（types.py）------ 可执行任务的数据结构
PregelRunner（pregel/_runner.py）------ 任务调度器，管理并行执行和结果收集
BackgroundExecutor / AsyncBackgroundExecutor（pregel/_executor.py）------ 线程池和 asyncio 并行原语
run_with_retry / arun_with_retry（pregel/_retry.py）------ 重试逻辑
缓存匹配机制 ------ cache_policy 和 CacheKey 的协作

这些组件共同实现了一个高效的并行执行框架，在保证正确性的前提下最大化吞吐量。

:::tip 本章要点

PregelExecutableTask 是任务执行的最小单元，包含输入、处理器、写入缓冲、配置等全部信息
PregelRunner 通过 FuturesDict 管理并发任务，支持"任一失败则全部停止"的语义
PULL 任务由 Channel 版本变更触发，输入从 Channel 读取；PUSH 任务由 Send API 创建，输入由调用者指定
BackgroundExecutor 使用线程池实现同步并行，AsyncBackgroundExecutor 使用 asyncio 任务实现异步并行
重试策略支持指数退避、抖动、最大重试次数，以及按异常类型匹配的多策略组合
缓存策略通过 CacheKey 关联节点身份和输入哈希，支持 TTL 过期 :::

7.2 PregelExecutableTask：任务的全貌

PregelExecutableTask 定义在 types.py 中，是一个不可变的 dataclass：

python 复制代码

@dataclass(frozen=True)
class PregelExecutableTask:
    name: str                          # 节点名称
    input: Any                         # 任务输入
    proc: Runnable                     # 可执行处理器（bound + writers 的组合）
    writes: deque[tuple[str, Any]]     # 写入缓冲区
    config: RunnableConfig             # 完整的运行配置
    triggers: Sequence[str]            # 触发此任务的 Channel 列表
    retry_policy: Sequence[RetryPolicy] # 重试策略
    cache_key: CacheKey | None         # 缓存键（如果启用了缓存）
    id: str                            # 全局唯一的任务 ID
    path: tuple[str | int | tuple, ...] # 任务路径（用于排序和标识）
    writers: Sequence[Runnable] = ()    # 写入器引用
    subgraphs: Sequence[PregelProtocol] = ()  # 子图引用

虽然标记为 frozen=True（不可变），但 writes 字段是一个 deque------它的引用不可变，但内容可变。这个设计使得任务执行过程中可以向 writes 追加数据，同时防止意外替换整个 writes 对象。

7.2.1 任务 ID 的生成

任务 ID 是通过确定性哈希函数生成的，确保同一个 Checkpoint 状态下，相同的任务总是获得相同的 ID：

python 复制代码

# 对于 PULL 任务
task_id = task_id_func(
    checkpoint_id_bytes,    # Checkpoint ID 的字节表示
    checkpoint_ns,          # 命名空间（如 "parent|agent"）
    str(step),              # 步数
    name,                   # 节点名称
    PULL,                   # 任务类型
    *triggers,              # 触发 Channel
)

# 对于 PUSH 任务（Send API）
task_id = task_id_func(
    checkpoint_id_bytes,
    checkpoint_ns,
    str(step),
    name,
    PUSH,
    task_path_str(parent_path),  # 父任务路径
    str(idx),                    # 在父任务写入中的索引
)

LangGraph 1.1.6 支持两种哈希函数：xxhash（v2 Checkpoint 格式，更快）和 uuid5（v1 格式，兼容旧版）。确定性的 ID 是 Checkpoint 恢复的关键------恢复后重新计算的任务 ID 与保存的 pending writes 中的 task ID 必须匹配，这样 _match_writes 才能正确地将已保存的写入结果关联到重建的任务。

7.2.2 proc 的构成

PregelExecutableTask.proc 是一个 RunnableSeq，它将用户逻辑和写入器串联：

flowchart LR INPUT[任务输入] --> BOUND[bound: 用户函数] BOUND --> W1[ChannelWrite: 状态更新] W1 --> W2[ChannelWrite: 边路由] subgraph "proc = RunnableSeq(bound, *writers)" BOUND W1 W2 end

执行 task.proc.invoke(task.input, task.config) 时：

首先调用用户函数，传入从 Channel 读取的状态
用户函数返回状态更新（如 {"count": 5}）
第一个 ChannelWrite 将更新转化为 Channel 写入元组，通过 CONFIG_KEY_SEND 发送
第二个 ChannelWrite（如果有边）将路由信号写入目标节点的触发 Channel

7.2.3 config 中注入的关键函数

每个任务的 config 中注入了几个关键回调，使得任务执行过程中能与 PregelLoop 交互：

python 复制代码

config = patch_config(
    config,
    configurable={
        CONFIG_KEY_TASK_ID: task_id,
        CONFIG_KEY_SEND: writes.extend,     # 写入收集器
        CONFIG_KEY_READ: partial(            # 状态读取器
            local_read, scratchpad, channels, managed,
            PregelTaskWrites(path, name, writes, triggers),
        ),
        CONFIG_KEY_CHECKPOINTER: checkpointer,
        CONFIG_KEY_CHECKPOINT_NS: task_checkpoint_ns,
        CONFIG_KEY_SCRATCHPAD: scratchpad,
        CONFIG_KEY_RUNTIME: runtime,
    },
)

CONFIG_KEY_SEND ：绑定到 writes.extend------当 ChannelWrite.do_write 被调用时，写入元组被追加到任务的 writes deque。deque.extend 是线程安全的。
CONFIG_KEY_READ ：绑定到 local_read 函数------条件边通过此函数读取"应用了当前任务写入后"的状态快照。这确保条件判断基于最新状态。

7.3 PULL 任务 vs PUSH 任务

LangGraph 中有两种根本不同的任务触发方式：

flowchart TB subgraph "PULL 任务" direction TB EDGE["边/条件边写入\nbranch:to:agent"] --> CHAN_VER["Channel 版本更新"] CHAN_VER --> TRIGGERS["_triggers() 检测\n版本 > seen"] TRIGGERS --> PULL_PREP["prepare_single_task\npath = (PULL, name)"] PULL_PREP --> PULL_INPUT["输入 = Channel 读取结果\n读取 node.channels 指定的 Channel"] PULL_INPUT --> PULL_TASK["PregelExecutableTask"] end subgraph "PUSH 任务" direction TB SEND_API["节点返回 Send('tool', data)\n或条件边返回 Send"] --> TASKS_CH["__pregel_tasks Topic"] TASKS_CH --> PUSH_PREP["prepare_single_task\npath = (PUSH, idx)"] PUSH_PREP --> PUSH_INPUT["输入 = Send.arg\n由调用者直接指定"] PUSH_INPUT --> PUSH_TASK["PregelExecutableTask"] end style PULL_TASK fill:#c8e6c9 style PUSH_TASK fill:#fff3e0

PULL 任务

PULL 任务是标准的 BSP 触发方式。在 prepare_single_task 中，对于 (PULL, name) 路径：

python 复制代码

if task_path[0] == PULL:
    name = task_path[1]
    proc = processes[name]
    # 检查触发条件
    if _triggers(channels, checkpoint["channel_versions"],
                 checkpoint["versions_seen"].get(name),
                 null_version, proc):
        # 读取输入
        val = _proc_input(proc, managed, channels,
                          for_execution=True, ...)
        if val is MISSING:
            return  # Channel 为空，跳过
        # 创建任务
        return PregelExecutableTask(name, val, node, writes, ...)

PULL 任务的输入来自 Channel：_proc_input 根据 proc.channels 配置读取指定的 Channel 值，如果有 mapper 则进行类型转换。

PUSH 任务

PUSH 任务通过两种途径创建：

Send API（prepare_push_task_send） ：当 __pregel_tasks Topic Channel 中有 Send 对象时
Functional API（prepare_push_task_functional） ：当任务路径以 Call 对象结尾时

对于 Send API 的 PUSH 任务：

python 复制代码

if task_path[0] == PUSH:
    # 获取 Send 对象
    send = tasks_channel.get()[task_path[1]]
    name = send.node
    val = send.arg  # 直接使用 Send 的参数作为输入
    proc = processes[name]
    # 创建任务（不检查 _triggers）
    return PregelExecutableTask(name, val, node, writes, ...)

关键区别：PUSH 任务不检查 _triggers ------它们总是被执行。输入直接来自 Send.arg，而非从 Channel 读取。这使得同一个节点可以被多次调用，每次使用不同的输入。

7.4 PregelRunner：并行调度器

PregelRunner 定义在 pregel/_runner.py 中，负责在每个超步中并行执行所有任务：

python 复制代码

class PregelRunner:
    def __init__(self, *, submit, put_writes,
                 use_astream=False, node_finished=None):
        self.submit = submit          # 提交函数（弱引用）
        self.put_writes = put_writes  # 写入保存函数（弱引用）
        self.use_astream = use_astream
        self.node_finished = node_finished

7.4.1 同步 tick 的执行流程

python 复制代码

def tick(self, tasks, *, reraise=True, timeout=None,
         retry_policy=None, get_waiter=None, schedule_task):
    tasks = tuple(tasks)
    futures = FuturesDict(
        callback=weakref.WeakMethod(self.commit),
        event=threading.Event(),
        future_type=concurrent.futures.Future,
    )
    # 让出控制权给调用者
    yield

    # 快速路径：单任务无超时
    if len(tasks) == 1 and timeout is None and get_waiter is None:
        t = tasks[0]
        try:
            run_with_retry(t, retry_policy, ...)
            self.commit(t, None)
        except Exception as exc:
            self.commit(t, exc)
            ...
        return

    # 调度所有任务到线程池
    for t in tasks:
        fut = self.submit()(
            run_with_retry, t, retry_policy, ...
        )
        futures[fut] = t

    # 等待任务完成，逐个处理
    while len(futures) > 0:
        done, inflight = concurrent.futures.wait(
            futures,
            return_when=concurrent.futures.FIRST_COMPLETED,
            timeout=...,
        )
        for fut in done:
            futures.pop(fut)
        if _should_stop_others(done):
            break
        yield  # 让出控制权给调用者处理流式输出

    # 等待所有回调完成
    futures.event.wait(timeout=...)
    yield

    # 检查异常
    _panic_or_proceed(futures.done, panic=reraise)

7.4.2 FuturesDict：智能的并发管理

FuturesDict 是一个自定义的 dict，它在 Future 完成时自动调用回调并管理计数器：

python 复制代码

class FuturesDict(dict):
    event: threading.Event  # 所有任务完成时设置
    callback: weakref.ref   # commit 回调
    counter: int            # 活跃任务计数
    done: set              # 已完成的 Future 集合

    def __setitem__(self, key, value):
        super().__setitem__(key, value)
        if value is not None:
            self.event.clear()
            self.counter += 1
            key.add_done_callback(partial(self.on_done, value))

    def on_done(self, task, fut):
        try:
            if cb := self.callback():
                cb(task, _exception(fut))
        finally:
            self.done.add(fut)
            self.counter -= 1
            if self.counter == 0 or _should_stop_others(self.done):
                self.event.set()

sequenceDiagram participant Runner as PregelRunner participant FD as FuturesDict participant Pool as 线程池 participant T1 as 任务 A participant T2 as 任务 B participant Loop as PregelLoop Runner->>FD: futures[fut_a] = task_a FD->>Pool: submit(run_with_retry, task_a) Runner->>FD: futures[fut_b] = task_b FD->>Pool: submit(run_with_retry, task_b) Pool->>T1: 执行 Pool->>T2: 执行 T1-->>FD: on_done(task_a, result) FD->>Runner: commit(task_a, None) Runner->>Loop: put_writes(task_a.id, task_a.writes) Note over Runner: yield - 让出控制权 Note over Loop: 处理流式输出 T2-->>FD: on_done(task_b, result) FD->>Runner: commit(task_b, None) Runner->>Loop: put_writes(task_b.id, task_b.writes) FD-->>FD: counter == 0, event.set()

7.4.3 单任务快速路径

当只有一个任务且没有超时时，PregelRunner 使用快速路径------直接在当前线程执行，避免线程池的开销：

python 复制代码

if len(tasks) == 1 and timeout is None and get_waiter is None:
    t = tasks[0]
    try:
        run_with_retry(t, retry_policy, ...)
        self.commit(t, None)
    except Exception as exc:
        self.commit(t, exc)

这个优化在简单的线性图（每步只有一个节点执行）中显著减少了开销。

7.4.4 commit：结果提交

commit 方法根据任务结果的不同情况执行不同的处理：

python 复制代码

def commit(self, task, exception):
    if isinstance(exception, asyncio.CancelledError):
        # 被取消的任务：保存错误
        task.writes.append((ERROR, exception))
        self.put_writes()(task.id, task.writes)
    elif exception:
        if isinstance(exception, GraphInterrupt):
            # 中断：保存中断信息
            writes = [(INTERRUPT, exception.args[0])]
            self.put_writes()(task.id, writes)
        elif isinstance(exception, GraphBubbleUp):
            # 冒泡异常：不保存，将在 _panic_or_proceed 中处理
            pass
        else:
            # 普通错误：保存错误信息
            task.writes.append((ERROR, exception))
            self.put_writes()(task.id, task.writes)
    else:
        # 成功：通知节点完成，保存写入
        if self.node_finished:
            self.node_finished(task.name)
        if not task.writes:
            task.writes.append((NO_WRITES, None))
        self.put_writes()(task.id, task.writes)

注意 NO_WRITES 标记------即使任务没有产生任何写入，也会标记一个占位写入。这确保了 Checkpointer 知道该任务已经执行过，在恢复时不会重复执行。

7.4.5 错误传播与任务取消

_should_stop_others 函数检查是否有任务失败：

python 复制代码

def _should_stop_others(done):
    for fut in done:
        if fut.cancelled():
            continue
        elif exc := fut.exception():
            if not isinstance(exc, GraphBubbleUp) and \
               fut not in SKIP_RERAISE_SET:
                return True
    return False

当一个任务失败时（非 GraphBubbleUp 异常），所有其他任务会被取消。GraphBubbleUp（包括 GraphInterrupt）是特殊的------它们不被视为错误，不会导致其他任务被取消。

_panic_or_proceed 在所有任务完成后进行最终检查：

python 复制代码

def _panic_or_proceed(futs, *, panic=True):
    interrupts = []
    while done:
        fut = done.pop()
        if exc := _exception(fut):
            if isinstance(exc, GraphInterrupt):
                interrupts.append(exc)
            elif fut not in SKIP_RERAISE_SET:
                raise exc
    if interrupts:
        # 合并所有中断
        raise GraphInterrupt(
            tuple(i for exc in interrupts for i in exc.args[0])
        )
    if inflight:
        raise TimeoutError("Timed out")

多个任务的 GraphInterrupt 会被合并为一个，确保所有中断信息都被保留。

7.5 BackgroundExecutor：线程池并行

BackgroundExecutor 定义在 pregel/_executor.py 中，是同步执行的后台任务管理器：

python 复制代码

class BackgroundExecutor(AbstractContextManager):
    def __init__(self, config: RunnableConfig):
        self.stack = ExitStack()
        self.executor = self.stack.enter_context(
            get_executor_for_config(config)
        )
        self.tasks: dict[Future, tuple[bool, bool]] = {}

    def submit(self, fn, *args,
               __cancel_on_exit__=False,
               __reraise_on_exit__=True,
               __next_tick__=False,
               **kwargs) -> Future:
        ctx = copy_context()
        if __next_tick__:
            task = self.executor.submit(
                next_tick, ctx.run, fn, *args, **kwargs
            )
        else:
            task = self.executor.submit(ctx.run, fn, *args, **kwargs)
        self.tasks[task] = (__cancel_on_exit__, __reraise_on_exit__)
        task.add_done_callback(self.done)
        return task

7.5.1 关键参数

参数	含义	使用场景
`__cancel_on_exit__`	退出时是否取消任务	waiter 任务（流式输出等待器）
`__reraise_on_exit__`	退出时是否重新抛出异常	PUSH 子任务设为 False（由父任务处理）
`__next_tick__`	是否在下一个"tick"执行	动态创建的子任务，确保当前步的写入先提交

7.5.2 next_tick 机制

__next_tick__ 参数触发 next_tick 函数：

python 复制代码

def next_tick(fn, *args, **kwargs):
    time.sleep(0)  # 让出 CPU 时间片
    return fn(*args, **kwargs)

time.sleep(0) 看起来无意义，但它实际上让出了当前线程的执行权，允许线程池中的其他线程（特别是正在执行 commit 回调的线程）先完成。这确保了动态创建的子任务在父任务的当前写入提交到 PregelLoop 之后才开始执行。

7.5.3 上下文管理器退出

python 复制代码

def __exit__(self, exc_type, exc_value, traceback):
    tasks = self.tasks.copy()
    # 取消标记为 cancel_on_exit 的任务
    for task, (cancel, _) in tasks.items():
        if cancel:
            task.cancel()
    # 等待所有任务完成
    if pending := {t for t in tasks if not t.done()}:
        concurrent.futures.wait(pending)
    # 关闭线程池
    self.stack.__exit__(exc_type, exc_value, traceback)
    # 重新抛出标记为 reraise 的任务异常
    if exc_type is None:
        for task, (_, reraise) in tasks.items():
            if not reraise:
                continue
            try:
                task.result()
            except concurrent.futures.CancelledError:
                pass

退出时的处理顺序很重要：先取消，再等待，再清理，最后处理异常。这确保了所有资源被正确释放。

7.6 AsyncBackgroundExecutor：异步并行

AsyncBackgroundExecutor 是异步版本，使用 asyncio 任务替代线程：

python 复制代码

class AsyncBackgroundExecutor(AbstractAsyncContextManager):
    def __init__(self, config: RunnableConfig):
        self.tasks: dict[asyncio.Future, tuple[bool, bool]] = {}
        self.loop = asyncio.get_running_loop()
        if max_concurrency := config.get("max_concurrency"):
            self.semaphore = asyncio.Semaphore(max_concurrency)
        else:
            self.semaphore = None

    def submit(self, fn, *args, __cancel_on_exit__=False,
               __reraise_on_exit__=True, __next_tick__=False,
               **kwargs) -> asyncio.Future:
        coro = fn(*args, **kwargs)
        if self.semaphore:
            coro = gated(self.semaphore, coro)
        task = run_coroutine_threadsafe(
            coro, self.loop,
            context=copy_context(),
            lazy=__next_tick__,
        )
        self.tasks[task] = (__cancel_on_exit__, __reraise_on_exit__)
        return task

7.6.1 并发控制：Semaphore

max_concurrency 通过 gated 协程实现：

python 复制代码

async def gated(semaphore, coro):
    async with semaphore:
        return await coro

这提供了细粒度的并发控制------在 AI 工作流中，过多的并发可能导致 API 限流或内存溢出，max_concurrency 提供了安全阀。

7.6.2 同步 vs 异步对比

特性	BackgroundExecutor	AsyncBackgroundExecutor
并行原语	`ThreadPoolExecutor`	`asyncio.Task`
并发控制	线程池大小	`asyncio.Semaphore`
上下文传递	`copy_context()` + `ctx.run`	`copy_context()` 参数
取消机制	`Future.cancel()`（仅未启动）	`Task.cancel()`（立即）
next_tick	`time.sleep(0)`	`lazy=True` 延迟调度

异步版本的取消更加彻底------asyncio 任务可以在执行过程中被取消（通过 CancelledError），而线程池中的 Future 只能在未开始时被取消。

7.7 重试策略：run_with_retry

重试逻辑定义在 pregel/_retry.py 中，run_with_retry 是同步版本：

python 复制代码

def run_with_retry(task, retry_policy, configurable=None):
    retry_policy = task.retry_policy or retry_policy
    attempts = 0
    config = task.config
    if configurable is not None:
        config = patch_configurable(config, configurable)

    while True:
        try:
            # 清除上次尝试的写入
            task.writes.clear()
            # 执行任务
            return task.proc.invoke(task.input, config)
        except ParentCommand as exc:
            # Command 路由到当前图或父图
            ns = config[CONF][CONFIG_KEY_CHECKPOINT_NS]
            cmd = exc.args[0]
            if cmd.graph in (ns, recast_checkpoint_ns(ns), task.name):
                for w in task.writers:
                    w.invoke(cmd, config)
                break
            elif cmd.graph == Command.PARENT:
                exc.args = (replace(cmd, graph=parent_ns),)
            raise
        except GraphBubbleUp:
            # 中断信号，直接向上传播
            raise
        except Exception as exc:
            if not retry_policy:
                raise

            # 查找匹配的重试策略
            matching_policy = None
            for policy in retry_policy:
                if _should_retry_on(policy, exc):
                    matching_policy = policy
                    break

            if not matching_policy:
                raise

            attempts += 1
            if attempts >= matching_policy.max_attempts:
                raise

            # 计算退避时间
            interval = matching_policy.initial_interval
            interval = min(
                matching_policy.max_interval,
                interval * (matching_policy.backoff_factor ** (attempts - 1)),
            )
            sleep_time = (
                interval + random.uniform(0, 1)
                if matching_policy.jitter else interval
            )
            time.sleep(sleep_time)

            # 标记为恢复模式
            config = patch_configurable(
                config, {CONFIG_KEY_RESUMING: True}
            )

7.7.1 RetryPolicy 数据结构

python 复制代码

class RetryPolicy(NamedTuple):
    initial_interval: float = 0.5   # 首次重试等待时间（秒）
    backoff_factor: float = 2.0     # 退避倍数
    max_interval: float = 128.0     # 最大等待时间（秒）
    max_attempts: int = 3           # 最大重试次数
    jitter: bool = True             # 是否添加随机抖动
    retry_on: type[Exception] | Sequence[type[Exception]] | Callable = default_retry_on

7.7.2 退避算法可视化

flowchart TB subgraph "指数退避 + 抖动" A1["尝试 1: 执行失败"] --> W1["等待 0.5s + jitter"] W1 --> A2["尝试 2: 执行失败"] A2 --> W2["等待 1.0s + jitter"] W2 --> A3["尝试 3: 执行失败"] A3 --> FAIL["达到 max_attempts=3\n抛出异常"] end subgraph "退避时间计算" direction TB F["interval = initial_interval * backoff_factor^(attempts-1)"] F --> C["interval = min(interval, max_interval)"] C --> J{"jitter?"} J -->|是| JV["sleep = interval + random(0,1)"] J -->|否| NJ["sleep = interval"] end

7.7.3 多策略匹配

LangGraph 支持为同一个节点配置多个重试策略，按照异常类型进行匹配：

python 复制代码

def _should_retry_on(retry_policy, exc):
    if isinstance(retry_policy.retry_on, Sequence):
        return isinstance(exc, tuple(retry_policy.retry_on))
    elif isinstance(retry_policy.retry_on, type):
        return isinstance(exc, retry_policy.retry_on)
    elif callable(retry_policy.retry_on):
        return retry_policy.retry_on(exc)

这允许精细的重试控制：

python 复制代码

# 对 API 限流重试更多次，对网络错误快速重试
graph.add_node(
    "llm_call", llm_fn,
    retry_policy=[
        RetryPolicy(max_attempts=5, initial_interval=2.0,
                     retry_on=RateLimitError),
        RetryPolicy(max_attempts=3, initial_interval=0.1,
                     retry_on=ConnectionError),
    ]
)

7.7.4 重试中的状态管理

重试时有两个关键的状态操作：

task.writes.clear()：清除上次尝试的写入。这确保失败的写入不会被保留------只有成功执行的写入才会被提交。
CONFIG_KEY_RESUMING: True：标记为恢复模式。这通知子图"你正在被重试"，子图可以据此跳过已完成的步骤。

7.8 缓存策略

LangGraph 1.1.6 引入了节点级别的缓存支持，避免对确定性节点的重复计算。

7.8.1 CachePolicy 和 CacheKey

python 复制代码

@dataclass
class CachePolicy:
    key_func: Callable = default_cache_key  # 缓存键生成函数
    ttl: int | None = None                  # 过期时间（秒）

class CacheKey(NamedTuple):
    ns: tuple[str, ...]   # 命名空间（标识节点身份）
    key: str              # 缓存键（输入哈希）
    ttl: int | None       # 过期时间

缓存键由三部分组成：

命名空间 ：(CACHE_NS_WRITES, identifier(proc), name) ------ 标识是哪个节点的哪个实现
键：xxh3_128_hexdigest(key_func(input)) ------ 输入数据的哈希
TTL：可选的过期时间

7.8.2 缓存匹配流程

缓存匹配在两个地方发生：

1. 超步开始时（tick 之后）：

python 复制代码

while loop.tick():
    for task in loop.match_cached_writes():
        loop.output_writes(task.id, task.writes, cached=True)
    # 只执行没有缓存命中的任务
    for _ in runner.tick(
        [t for t in loop.tasks.values() if not t.writes],
        ...
    ):

match_cached_writes 批量查询缓存：

python 复制代码

def match_cached_writes(self):
    if self.cache is None:
        return ()
    cached = {
        (t.cache_key.ns, t.cache_key.key): t
        for t in self.tasks.values()
        if t.cache_key and not t.writes  # 只查没有写入的任务
    }
    matched = []
    for key, values in self.cache.get(tuple(cached)).items():
        task = cached[key]
        task.writes.extend(values)  # 将缓存的写入填充到任务
        matched.append(task)
    return matched

2. 任务写入保存时（put_writes）：

python 复制代码

def put_writes(self, task_id, writes):
    super().put_writes(task_id, writes)
    if self.cache is None or not hasattr(self, "tasks"):
        return
    task = self.tasks.get(task_id)
    if task is None or task.cache_key is None:
        return
    # 异步保存到缓存
    self.submit(
        self.cache.set,
        {(task.cache_key.ns, task.cache_key.key):
         (task.writes, task.cache_key.ttl)}
    )

flowchart TB START[超步开始] --> PREP[prepare_next_tasks\n创建任务，计算 cache_key] PREP --> MATCH[match_cached_writes\n批量查询缓存] MATCH --> HIT{缓存命中?} HIT -->|是| FILL[填充 task.writes\n发出 cached=True 事件] HIT -->|否| EXEC[runner.tick 执行任务] EXEC --> COMMIT[commit 提交结果] COMMIT --> SAVE[put_writes 保存写入] SAVE --> CACHE_SAVE[异步保存到缓存] FILL --> AFTER[after_tick 应用写入] CACHE_SAVE --> AFTER

7.8.3 缓存键的生成

默认的 default_cache_key 函数使用 pickle 序列化输入后进行哈希：

python 复制代码

# _internal/_cache.py
def default_cache_key(input: Any) -> bytes:
    return pickle.dumps(input)

在 prepare_single_task 中，缓存键的完整计算过程：

python 复制代码

if cache_policy:
    args_key = cache_policy.key_func(val)  # 用户定义或默认的键函数
    cache_key = CacheKey(
        (CACHE_NS_WRITES, identifier(proc) or "__dynamic__", name),
        xxh3_128_hexdigest(
            args_key.encode() if isinstance(args_key, str) else args_key
        ),
        cache_policy.ttl,
    )

identifier(proc) 返回处理器的唯一标识（通常基于函数的模块和名称），确保不同实现的同名节点不会共享缓存。

7.9 动态任务创建：accept_push

在 Functional API 中，任务可以在执行过程中动态创建子任务。这通过 PregelLoop.accept_push 方法实现：

python 复制代码

def accept_push(self, task, write_idx, call=None):
    """接受一个来自正在执行的任务的 PUSH 请求。"""
    pushed = prepare_single_task(
        (PUSH, task.path, write_idx, task.id, call),
        None,
        checkpoint=self.checkpoint,
        ...
    )
    if pushed:
        # 发出调试事件
        self._emit("tasks", map_debug_tasks, [pushed])
        # 保存新任务
        self.tasks[pushed.id] = pushed
        # 匹配已有写入
        if not self.is_replaying:
            self._match_writes({pushed.id: pushed})
        return pushed

accept_push 作为 schedule_task 回调传递给 PregelRunner.tick，在任务执行过程中被调用：

python 复制代码

# _runner.py 中的 _call 函数
if next_task := schedule_task(task(), scratchpad.call_counter(), call):
    if next_task.writes:
        # 已经有结果（从缓存或 Checkpoint 恢复），直接返回
        fut = concurrent.futures.Future()
        ret = next((v for c, v in next_task.writes if c == RETURN), MISSING)
        fut.set_result(ret)
    else:
        # 调度新任务到线程池
        fut = submit()(
            run_with_retry, next_task, retry_policy, ...,
            __next_tick__=True,  # 确保写入先提交
        )
        SKIP_RERAISE_SET.add(fut)
        futures()[fut] = next_task

动态创建的子任务使用 __next_tick__=True，确保父任务的当前写入先被提交到 PregelLoop，然后子任务才开始执行。这维护了写入的因果顺序。

7.10 异常处理的分层设计

LangGraph 的异常处理分为多个层次：

flowchart TB subgraph "异常分层" direction TB L1["GraphBubbleUp（基类）"] L2["GraphInterrupt\n(中断信号)"] L3["ParentCommand\n(跨图命令)"] L4["普通 Exception\n(可重试)"] L5["CancelledError\n(被取消)"] L1 --- L2 L1 --- L3 end subgraph "处理策略" direction TB S1["GraphBubbleUp: 不重试，不停止其他任务，向上传播"] S2["GraphInterrupt: 保存中断信息，合并后向上传播"] S3["ParentCommand: 路由到目标图，或继续冒泡"] S4["Exception: 匹配重试策略，超限后保存错误并停止其他任务"] S5["CancelledError: 保存错误到 Checkpoint"] end L1 --> S1 L2 --> S2 L3 --> S3 L4 --> S4 L5 --> S5

run_with_retry 中的异常处理

python 复制代码

while True:
    try:
        task.writes.clear()
        return task.proc.invoke(task.input, config)
    except ParentCommand as exc:
        # 检查 Command 的目标图
        if cmd.graph in (ns, recast_checkpoint_ns(ns), task.name):
            # 当前图处理
            for w in task.writers:
                w.invoke(cmd, config)
            break
        elif cmd.graph == Command.PARENT:
            # 设置父图命名空间后冒泡
            exc.args = (replace(cmd, graph=parent_ns),)
        raise
    except GraphBubbleUp:
        raise  # 直接向上传播
    except Exception as exc:
        if not retry_policy:
            raise
        # 匹配并执行重试策略...

这种分层设计确保了：

控制流信号（Interrupt、Command）不会被重试逻辑意外捕获
普通异常可以根据策略重试
异常信息会被保存到 Checkpoint，支持调试和恢复

7.11 执行过程中的写入流

让我们追踪一次完整的写入流，从节点函数返回到 Channel 更新：

sequenceDiagram participant Node as 用户函数 participant CW as ChannelWrite participant Writes as task.writes (deque) participant Runner as PregelRunner participant Loop as PregelLoop participant Chan as Channels Note over Node: return {"count": 5} Node->>CW: invoke({"count": 5}) Note over CW: _get_updates 提取元组 CW->>CW: do_write(config, writes) Note over CW: config[CONF][CONFIG_KEY_SEND] CW->>Writes: extend([("count", 5)]) Note over CW: 第二个 ChannelWrite (路由) CW->>Writes: extend([("branch:to:next", None)]) Note over Runner: task.proc.invoke 完成 Runner->>Runner: commit(task, None) Runner->>Loop: put_writes(task.id, task.writes) Note over Loop: 保存到 checkpoint_pending_writes Note over Loop: 异步保存到 Checkpointer Note over Loop: after_tick() Loop->>Chan: apply_writes(checkpoint, channels, tasks, ...) Note over Chan: channels["count"].update([5]) Note over Chan: channels["branch:to:next"].update([None])

关键观察：写入在任务执行过程中被收集到 task.writes 中，但直到 after_tick() 调用 apply_writes 时才真正应用到 Channel。这就是 BSP 模型的"步内隔离"------同一超步中的不同任务看不到彼此的写入。

7.12 设计决策分析

为什么使用弱引用（weakref）？

PregelRunner 中的 submit 和 put_writes 都使用弱引用：

python 复制代码

class PregelRunner:
    def __init__(self, *, submit, put_writes, ...):
        self.submit = submit          # weakref.WeakMethod
        self.put_writes = put_writes  # weakref.WeakMethod

这防止了循环引用导致的内存泄漏。PregelRunner 引用 PregelLoop 的方法，而 PregelLoop 可能间接引用 PregelRunner（通过 config 中注入的回调）。弱引用确保了当 PregelLoop 退出上下文管理器后，所有相关对象都能被正确回收。

为什么 writes 使用 deque 而非 list？

deque.extend 是线程安全的（在 CPython 中），而 list.extend 不是。在并行执行场景中，多个 writer 可能同时向同一个 writes deque 追加数据，deque 的原子性保证了不会出现数据损坏。

为什么 PUSH 子任务使用 SKIP_RERAISE_SET？

当任务 A 动态创建子任务 B 时，B 的异常应该由 A 处理（通过返回的 Future），而不是由 PregelRunner 的 _panic_or_proceed 处理。SKIP_RERAISE_SET 记录了哪些 Future 的异常应该被跳过：

python 复制代码

# _runner.py
fut = submit()(run_with_retry, next_task, ...)
SKIP_RERAISE_SET.add(fut)  # 标记：异常由父任务处理
futures()[fut] = next_task

这实现了"异常沿调用链传播"的语义，而非"所有异常在顶层汇合"的语义。

为什么重试时清除 writes 而不是创建新 deque？

python 复制代码

task.writes.clear()  # 而不是 task = replace(task, writes=deque())

因为 PregelExecutableTask 是 frozen=True 的 dataclass，不能替换字段。同时，config 中注入的 CONFIG_KEY_SEND 引用的是原始 deque 的 extend 方法------如果创建新 deque，写入会丢失到旧对象中。clear() 在不改变引用的前提下清空内容。

7.13 小结

本章深入分析了 LangGraph 的任务调度和并行执行机制。核心要点回顾：

PregelExecutableTask 是不可变的任务容器，其 proc 串联了用户逻辑和写入器，writes deque 作为线程安全的写入缓冲区
PULL vs PUSH：PULL 任务由 Channel 版本变更触发，输入从 Channel 读取；PUSH 任务由 Send API 创建，输入由调用者指定，不检查版本
PregelRunner 通过 FuturesDict 管理并发任务，支持单任务快速路径、超时控制、"任一失败则全部停止"语义
BackgroundExecutor 使用线程池 + copy_context 实现同步并行；AsyncBackgroundExecutor 使用 asyncio + Semaphore 实现异步并行和并发控制
重试策略支持指数退避、随机抖动、最大次数限制，以及按异常类型匹配的多策略组合。重试时清除旧写入并标记恢复模式
缓存策略通过节点身份 + 输入哈希生成 CacheKey，在超步开始时批量查询，命中后直接填充 writes 跳过执行
写入流遵循 BSP 模型：执行期间收集到 deque，commit 时保存到 Checkpointer，after_tick 时统一应用到 Channel

这三个层次------Pregel Loop（调度）、PregelRunner（并行）、run_with_retry（执行）------共同构成了 LangGraph 的运行时引擎，将声明式的图定义转化为高效、可靠、可恢复的 AI 工作流执行。

LangGraph设计与实现-第7章-任务调度与并行执行

第7章 任务调度与并行执行

7.1 引言

7.2 PregelExecutableTask：任务的全貌

7.2.1 任务 ID 的生成

7.2.2 proc 的构成

7.2.3 config 中注入的关键函数

7.3 PULL 任务 vs PUSH 任务

PULL 任务

PUSH 任务

7.4 PregelRunner：并行调度器

7.4.1 同步 tick 的执行流程

7.4.2 FuturesDict：智能的并发管理

7.4.3 单任务快速路径

7.4.4 commit：结果提交

7.4.5 错误传播与任务取消

7.5 BackgroundExecutor：线程池并行

7.5.1 关键参数

7.5.2 next_tick 机制

7.5.3 上下文管理器退出

7.6 AsyncBackgroundExecutor：异步并行

7.6.1 并发控制：Semaphore

7.6.2 同步 vs 异步对比

7.7 重试策略：run_with_retry

7.7.1 RetryPolicy 数据结构

7.7.2 退避算法可视化

7.7.3 多策略匹配

7.7.4 重试中的状态管理

7.8 缓存策略

7.8.1 CachePolicy 和 CacheKey

7.8.2 缓存匹配流程

7.8.3 缓存键的生成

7.9 动态任务创建：accept_push

7.10 异常处理的分层设计

run_with_retry 中的异常处理

7.11 执行过程中的写入流

7.12 设计决策分析

为什么使用弱引用（weakref）？

为什么 writes 使用 deque 而非 list？

为什么 PUSH 子任务使用 SKIP_RERAISE_SET？

为什么重试时清除 writes 而不是创建新 deque？

7.13 小结

第7章任务调度与并行执行