大模型运行的 enforce_eager 参数

flyfish

enforce_eager=True ：

vLLM 完全走纯 PyTorch Eager 执行 。

每次生成 token 都正常调用 model.forward()，没有捕获 Graph。保留了所有 kernel launch 开销。

Eager Mode（急切模式 / 即时执行模式）：代码写到哪里，PyTorch 就立刻执行 到哪里。每调用一次 model(x)，就立即把所有操作（kernel）逐个发给 GPU 执行，Python → C++ → CUDA Driver 的调用链是实时的、一步一步的。

它和"lazy mode"（延迟模式）相对。在 TensorFlow 1.x 的 Graph Mode（静态图）中，所有操作先构建成图，最后统一执行（lazy）。而 PyTorch 默认就是 eager ，非常适合调试和研究，但会带来较高的 kernel launch overhead （CPU 反复调用 GPU 操作的开销）。

Eager = 立刻执行，CPU launch overhead（CPU 启动开销，也常称为 CUDA Kernel Launch Overhead）指的是：

CPU 把一个 CUDA kernel（GPU 计算函数）"告诉"GPU去执行时，所花费的额外时间开销。

这个开销不包括 kernel 本身在 GPU 上真正计算的时间，而是 CPU 准备、提交这个任务给 GPU 的过程所消耗的时间。

enforce_eager=False （推荐用于生产）：

vLLM 使用 Eager + CUDA Graph 的混合策略 。

预填充（prefill）阶段通常仍用 Eager。
解码（decoding）阶段 对固定 batch size / 序列长度的部分捕获 CUDA Graph，然后用 replay() 快速重放，极大减少 CPU → GPU 的 launch 开销。

如果模型有动态控制流、动态形状、或某些操作不支持 Graph，可能会崩溃或 fallback。

捕获 Graph 时需要预热和固定形状，对某些复杂模型不友好。

官方文档原文（vLLM）：

"If True, we will disable CUDA graph and always execute the model in eager mode.

If False, we will use CUDA graph and eager execution in hybrid for maximal performance and flexibility."

enforce_eager=True = 强制关闭 CUDA Graph，只用最原始的 Eager 执行。
enforce_eager=False = 开启 CUDA Graph 加速。

CUDA Graph 的完整流程可以分为 四个主要阶段 ：Warmup（预热） → Capture（捕获） → Replay（重放） → （可选）Reset/销毁。

1. 为什么需要这个流程？

在普通 Eager 模式 下：

每次调用 model(x)，PyTorch 会通过 Python → C++ → CUDA Driver 层层调用。

每启动一个 kernel（GPU 计算单元），CPU 都要做一次 launch （包括参数准备、驱动调用等）。

小模型、小 batch 时，这些 CPU launch overhead 占了很大比例，导致 GPU 利用率低（GPU 经常空闲等待 CPU 发指令）。

CUDA Graph 的思路是：
只记录一次 GPU 要做的事（哪些 kernel、用什么参数、依赖关系等）。

后面 直接重放 这个记录，CPU 几乎只发一条指令（cudaGraphLaunch），就把整个序列发给 GPU 执行。这样大幅减少 CPU 开销，kernel 也能以更优的顺序/并发执行。

关键约束 ：

必须是静态形状 （static shape）。

输入/输出的内存地址 必须固定（pointers baked into the graph）。

执行流必须确定性（无 if/动态控制流、动态形状等）。

2. CUDA Graph 的详细流程

步骤 1: Warmup（预热）

在捕获（Capture）之前，先在普通 Eager 模式 下运行模型几次（通常 5~20 次）。

目的：

让 cuDNN、JIT 编译、内存分配器（caching allocator）完全稳定下来。

避免捕获时因为第一次运行产生的额外操作（比如 kernel JIT 编译、内存分配）被记录进去，导致 graph 不纯净。
推荐做法 ：在侧边流（side stream） 上进行预热，然后等待它完成。

如果不充分预热，捕获可能会失败、性能变差，或出现数值不一致。

步骤 2: Capture（捕获）

创建一个 torch.cuda.CUDAGraph() 对象。

使用上下文管理器 with torch.cuda.graph(g):（PyTorch 内部会调用 cudaStreamBeginCapture）。

在这个上下文里，正常执行一次 模型 forward（static_output = model(static_input)）。
这时发生了什么？

所有发往 GPU 的操作（kernel launch、memcpy、events 等）不会真正执行 。

CUDA Driver 把这些操作记录成一个 Graph（节点 + 依赖关系）。

PyTorch 的内存分配器会使用独立的私有内存池 （private memory pool），保证捕获期间分配的 tensor 地址在后续 replay 中保持不变。

捕获结束后，Graph 就"录制"好了，里面固定了所有 kernel 的参数（尤其是指针地址）。

注意：捕获期间的 CPU 代码（如 Python 的 if 判断）仍然会正常执行，但不会被记录到 Graph 里。只有 GPU 操作被记录。

步骤 3: Replay（重放）

准备新的输入数据：用 .copy_() 把新数据拷贝到捕获时使用的 static_input tensor （地址不变！不能重新赋值 static_input = new_data）。

调用 g.replay() ------ 这是一条非常轻量的调用。
底层发生了什么？

PyTorch/CUDA 只调用一次 cudaGraphLaunch，就把整个 Graph 的所有工作一次性提交给 GPU。

所有 kernel 以捕获时记录的顺序和依赖关系执行。

GPU 端 kernel 本身运行速度也可能略有提升（更好的调度），但主要收益 来自消除了 CPU 端的多次 launch 开销。

执行完后，从 static_output（捕获时保存的输出 tensor）中读取结果（地址也是固定的）。

重复这个过程：copy_ → replay → read output，就可以快速处理很多批数据。

步骤 4: （可选）Reset / 销毁

如果需要改变形状或重新捕获，可以调用 graph.reset()。

Graph 对象不再使用时，Python 会自动清理（或手动删除）。

两种比较

方式1

cpp 复制代码

import torch
import time
import torch.nn as nn

# ====================== 配置 ======================
assert torch.cuda.is_available(), "CUDA 不可用"
device = torch.device("cuda:0")

torch.backends.cudnn.benchmark = True
torch.set_float32_matmul_precision('high')

# ====================== 模型 ======================
class SimpleModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(1024, 4096, device=device)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(4096, 2048, device=device)
        self.fc3 = nn.Linear(2048, 512, device=device)

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        x = self.relu(x)
        return self.fc3(x)


model = SimpleModel().eval()

# ====================== 测试数据 ======================
batch_size = 64
num_runs = 1000
input_shape = (batch_size, 1024)
input_tensor = torch.randn(input_shape, device=device, dtype=torch.float32)

# 准备 1000 个不同的输入样本
test_data = [torch.randn_like(input_tensor) for _ in range(num_runs)]

# ====================== 1. 纯 Eager 模式 (enforce_eager=True) ======================
def run_eager(model, data):
    model.eval()
    total_time = 0.0
    with torch.no_grad():
        for x in data:
            start = time.perf_counter()
            _ = model(x)
            torch.cuda.synchronize()
            total_time += time.perf_counter() - start
    return total_time


# ====================== 2. 标准 Manual CUDA Graph 模式 (enforce_eager=False) ======================
def run_manual_cuda_graph(model, data):
    model.eval()
    graph = torch.cuda.CUDAGraph()
    
    # 创建静态输入和输出（必须保持长期存活，内存地址不变）
    static_input = torch.randn_like(input_tensor, device=device)
    static_output = torch.empty_like(static_input)   # 预分配输出空间
    
    # ------------------ 步骤1: 充分预热 ------------------
    # 使用侧边流进行预热，让 cuDNN/JIT/allocator 稳定
    warmup_stream = torch.cuda.Stream()
    with torch.no_grad(), torch.cuda.stream(warmup_stream):
        for _ in range(10):          # 推荐 5~20 次，视模型复杂度而定
            _ = model(static_input)
    
    # 等待预热流完成
    torch.cuda.current_stream().wait_stream(warmup_stream)
    
    # ------------------ 步骤2: 捕获 CUDA Graph ------------------
    with torch.no_grad(), torch.cuda.graph(graph):
        static_output = model(static_input)   # 注意：这里重新赋值，保持引用
    
    # ------------------ 步骤3: 重放测试 ------------------
    total_time = 0.0
    with torch.no_grad():
        for x in data:
            start = time.perf_counter()
            
            static_input.copy_(x)      # 更新输入数据（地址不变！）
            graph.replay()             # 执行已捕获的 Graph
            
            # 实际使用时应读取 static_output（这里仅测时间，可 clone 或直接使用）
            # result = static_output.clone()
            
            torch.cuda.synchronize()
            total_time += time.perf_counter() - start
    
    return total_time


# ====================== 主程序 ======================
if __name__ == "__main__":
    print("开始测试（请等待预热和捕获）...\n")
    
    # 全局预热一次，消除首次加载开销
    with torch.no_grad():
        _ = model(input_tensor)
    torch.cuda.empty_cache()
    
    print("正在运行 Eager 模式 (enforce_eager=True)...")
    eager_time = run_eager(model, test_data)
    
    print("正在运行 Manual CUDA Graph 模式 (enforce_eager=False)...")
    graph_time = run_manual_cuda_graph(model, test_data)
    
    print("\n" + "="*70)
    print(f"enforce_eager=True  → 纯 Eager 模式          : {eager_time:.4f} 秒")
    print(f"enforce_eager=False → Manual CUDA Graph 模式 : {graph_time:.4f} 秒")
    print(f"加速倍数                                       : {eager_time / graph_time:.2f} 倍")
    print("="*70)
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Batch Size: {batch_size} | 测试次数: {num_runs}")
    print("提示：实际部署时请从 static_output 中读取结果")

方式2

cpp 复制代码

import torch
import time
import torch.nn as nn

assert torch.cuda.is_available(), "CUDA 不可用"
device = torch.device("cuda:0")

torch.backends.cudnn.benchmark = True

# ====================== 模型 ======================
class SimpleModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(1024, 4096, device=device)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(4096, 2048, device=device)
        self.fc3 = nn.Linear(2048, 512, device=device)

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        x = self.relu(x)
        return self.fc3(x)


model = SimpleModel().eval()

# ====================== 测试数据 ======================
batch_size = 64
num_runs = 1000
input_tensor = torch.randn(batch_size, 1024, device=device, dtype=torch.float32)
test_data = [torch.randn_like(input_tensor) for _ in range(num_runs)]

# ====================== 1. 纯 Eager 模式 (enforce_eager=True) ======================
def run_eager(model, data):
    model.eval()
    total_time = 0.0
    with torch.no_grad():
        for x in data:
            start = time.perf_counter()
            _ = model(x)
            torch.cuda.synchronize()
            total_time += time.perf_counter() - start
    return total_time


# ====================== 2. 手动 CUDA Graph 模式 (enforce_eager=False) ======================
def run_cuda_graph(model, data):
    model.eval()
    graph = torch.cuda.CUDAGraph()
    
    # 静态输入和输出（必须长期存活）
    static_input = torch.randn_like(input_tensor, device=device)
    with torch.no_grad():
        for _ in range(10):                    # 充分预热
            _ = model(static_input)
    
    # 捕获 CUDA Graph
    with torch.no_grad(), torch.cuda.graph(graph):
        static_output = model(static_input)    # 必须保存输出引用
    
    # 重放测试
    total_time = 0.0
    with torch.no_grad():
        for x in data:
            start = time.perf_counter()
            static_input.copy_(x)              # 更新输入数据
            graph.replay()                     # 执行 Graph
            # _ = static_output.clone()        # 实际使用时需要读取输出
            torch.cuda.synchronize()
            total_time += time.perf_counter() - start
    return total_time


# ====================== 主程序 ======================
if __name__ == "__main__":
    print("开始测试（请等待预热）...\n")
    
    # 全局预热
    with torch.no_grad():
        _ = model(input_tensor)
    torch.cuda.empty_cache()
    
    print("正在运行 Eager 模式 (enforce_eager=True)...")
    eager_time = run_eager(model, test_data)
    
    print("正在运行 CUDA Graph 模式 (enforce_eager=False)...")
    graph_time = run_cuda_graph(model, test_data)
    
    print("\n" + "="*65)
    print(f"enforce_eager=True  (纯 Eager 模式)      : {eager_time:.4f} 秒")
    print(f"enforce_eager=False (CUDA Graph 模式)    : {graph_time:.4f} 秒")
    print(f"加速倍数                                   : {eager_time / graph_time:.2f} 倍")
    print("="*65)
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Batch Size: {batch_size} | 测试次数: {num_runs}")

cpp 复制代码