深度学习模型 CPU 移植实战：将 MinivLLM 从 GPU 迁移到 CPU 环境

一、问题背景

MinivLLM 是一个轻量级的 vLLM 实现，专为 GPU 环境设计，使用了大量 GPU 优化技术：

Triton：GPU 加速的 kernel 编程框架
CUDA Graph：GPU 计算图优化
Paged Attention：基于 GPU 的分页注意力机制
NCCL：NVIDIA 多卡通信后端

然而，我们只有 CPU 环境（WSL2 + Linux），需要将整个项目迁移到 CPU 平台。本文将记录这个过程中的挑战、解决方案和最终成果。

二、遇到的挑战

2.1 依赖冲突问题

问题：项目的 pyproject.toml 中包含了 vllm>=0.15.0 依赖，但实际上 MinivLLM 本身就是一个 vLLM 实现。

原因：vllm 仅在 benchmark_tps.py 中用于性能对比，核心功能并不依赖它。但 pip 仍会尝试安装，浪费资源。

解决方案：从依赖中移除 vllm

python 复制代码

# pyproject.toml
dependencies = [
    "transformers",
    "torch",
    "xxhash",
    # "vllm>=0.15.0",  # 移除
]

2.2 PyTorch CPU 版本不在镜像源

问题：使用清华镜像源安装 PyTorch 时，只能找到 CUDA 版本（766.7 MB + 大量依赖），没有 CPU 版本。

原因：PyTorch 的 CPU 版本 wheel 不在 PyPI 上，只在 PyTorch 官方源的 cpu 子目录中。

解决方案：混合源安装，其他包用国内源，PyTorch 用官方 CPU 源

bash 复制代码

# 178.7 MB vs 766.7 MB (CUDA 版本)
pip install torch --index-url https://download.pytorch.org/whl/cpu \
    --extra-index-url https://pypi.tuna.tsinghua.edu.cn/simple

2.3 Triton 是 GPU 专用库

问题：代码大量使用 @triton.jit 装饰器定义 GPU kernels，如 store_kvcache_kernel、flash_attention_varlen_kernel 等。

原因：Triton 是专为 NVIDIA GPU 设计的 CUDA 编程框架，完全不支持 CPU。

解决方案：用标准 PyTorch 实现 Triton kernels

核心替换：

Triton kernel → Python 循环 + PyTorch 操作
tl.load/tl.store → PyTorch tensor 切片
@triton.jit → 普通 Python 函数

python 复制代码

# 原 Triton kernel
@triton.jit
def flash_attention_varlen_kernel(Q, K, V, O, ...):
    ...
    tl.store(O_ptr + offset, acc)

# CPU 实现
def flash_attention_prefill(q, k, v, cu_seqlens, ...):
    # 标准 PyTorch 实现
    for seq_idx in range(len(cu_seqlens) - 1):
        seq_start = cu_seqlens[seq_idx]
        seq_end = cu_seqlens[seq_idx + 1]
        q_seq = q[seq_start:seq_end]
        # 使用 torch.matmul + F.softmax
        ...

2.4 Torch._Dynamo 编译错误

问题：PyTorch 2.6 的 CPU inductor 编译器报错：fatal error: 'omp.h' file not found

原因：PyTorch 2.6 的 CPU 编译器依赖 OpenMP，但 WSL/CPU 环境可能没有安装 libomp-dev。

解决方案：设置环境变量禁用 torch.compile

python 复制代码

import torch._dynamo

# 禁用编译以避免 OpenMP 问题
torch._dynamo.config.suppress_errors = True

或在命令行：

bash 复制代码

TORCH_COMPILE_DISABLE=1 python main.py

2.5 SamplerLayer inplace 操作错误

问题：@torch.compile 装饰器配合 @torch.inference_mode() 使用时报错：

复制代码

RuntimeError: Inplace update to inference tensor outside InferenceMode is not allowed.

原因：PyTorch 2.0+ 的 inference_mode 强制张量不可变，inplace 修改（如 logits /= temperature）会报错。

解决方案 ：移除 @torch.compile 装饰器，使用非 inplace 操作

python 复制代码

# 原代码（会报错）
@torch.compile
def forward(self, logits, temperature):
    logits /= temperature.unsqueeze(-1)  # inplace 操作
    probs = torch.softmax(logits, dim=-1)
    sample_tokens = probs.div_(...).argmax(dim=-1)

# 修改后
def forward(self, logits, temperature):
    logits = logits / temperature.unsqueeze(-1)  # 非 inplace
    probs = torch.softmax(logits, dim=-1)
    sample_tokens = probs / torch.empty_like(probs).exponential_(1).clamp_min_(1e-10)
    sample_tokens = sample_tokens.argmax(dim=-1)

2.6 NCCL 后端不支持 CPU

问题：dist.init_process_group('nccl', ...) 启动失败

原因：NCCL (NVIDIA Collective Communications Library) 是纯 GPU 库。

解决方案 ：改用 gloo 后端（支持 CPU）

python 复制代码

# 原代码
dist.init_process_group('nccl', "tcp://localhost:12345", ...)

# 修改后
dist.init_process_group('gloo', "tcp://localhost:12345", ...)

2.7 Pin Memory 和 CUDA 调用

问题：大量使用 .cuda() 和 pin_memory=True，在 CPU 上无法使用

原因：pin_memory 用于 GPU-CPU 高速传输，.cuda() 将 tensor 移到 GPU。

解决方案：移除所有 GPU 相关调用

python 复制代码

# 原代码
input_ids = torch.tensor(input_ids, pin_memory=True).cuda(non_blocking=True)

# 修改后
input_ids = torch.tensor(input_ids)

2.8 CUDA Graph 不支持 CPU

问题：torch.cuda.CUDAGraph() 和 torch.cuda.graph() 只支持 GPU

原因：CUDA Graph 是 GPU 特有的优化技术。

解决方案：强制使用 eager 模式，跳过 CUDA Graph 捕获

python 复制代码

# 修改 run_model 为 always eager
@torch.inference_mode()
def run_model(self, input_ids, is_prefill):
    # CPU 模式：始终使用 eager 模式
    hidden_states = self.model(input_ids)
    logits = self.model.compute_logits(hidden_states)
    return logits

2.9 Warmup 阶段 block_table 分配问题

问题：decode 阶段时 paged_attention_decode 报错 block_table 为空

原因：Warmup 创建的 Sequence 对象没有通过 Scheduler.schedule() 分配 block_table，直接运行跳过了分配逻辑。

解决方案：跳过 warmup（CPU 模式下不需要）

python 复制代码

def warmup_model(self):
    pass  # CPU 模式下跳过 warmup

三、部分成果：Prefilling 成功运行

经过上述修改后，项目的 prefilling 阶段成功在 CPU 上运行！

3.1 成功指标

复制代码

================================================================================
Weight Loading Summary:
================================================================================
Successfully loaded: 283 parameter groups
Skipped (merged into other weights): 28
================================================================================
51 number of processed tokens 63.79 tokens/sec during prefilling

模型加载成功：283 个参数组成功加载
Prefilling 速度：约 64 tokens/sec（CPU 环境，相比 GPU 会慢很多）
无 Triton 错误：使用 PyTorch 原生实现成功

3.2 运行状态

✅ 模型初始化成功
✅ Tokenizer 下载成功（使用 HuggingFace 镜像）
✅ Prefill 阶段执行成功
❌ Decode 阶段仍有问题（paged_attention 需要进一步修复）

四、剩余问题

4.1 Paged Attention 的 Block 分配机制

问题：Decode 阶段时，block_table 的分配和传递机制在 CPU 模式下存在问题。

表现：

复制代码

RuntimeError: stack expects a non-empty TensorList

根本原因：

项目使用分页 KV cache 来管理内存
GPU 模式下通过 BlockManager 管理物理块
CPU 模式下，warmup 跳过导致初始 block_table 为空
decode 阶段尝试从空 block_table 读取 cache 时失败

可能的解决方案：

完全重写 paged attention 为标准 attention
添加 CPU 特定的 block 分配逻辑
使用 PyTorch 的 torch.nn.MultiheadAttention 替代

4.2 架构耦合度高

问题：GPU 相关代码散布在多个文件中，需要大量修改。

影响文件：

src/myvllm/layers/attention.py（triton kernels）
src/myvllm/layers/sampler.py（@torch.compile）
src/myvllm/engine/model_runner.py（CUDA 调用）
src/myvllm/engine/llm_engine.py（分布式初始化）

4.3 性能差异

预期：CPU 模式下性能会显著低于 GPU 模式

Prefilling: GPU 可能达到数千 tokens/sec，CPU 约为几十到几百 tokens/sec
Decoding: 差异会更大，因为无法使用 CUDA Graph 等优化

五、核心经验总结

5.1 项目设计假设

MinivLLM 的架构假设了以下 GPU 特性：

NVIDIA GPU 和 CUDA 支持
Triton kernel 编程能力
CUDA Graph 优化
NCCL 多卡通信
GPU 专用的内存管理（pin_memory）

教训：在项目设计初期就应该考虑 CPU/GPU 兼容性，使用条件编译或运行时分支。

5.2 抽象层的重要性

问题：GPU 优化直接散布在业务逻辑中，没有抽象层。

更好的做法：

python 复制代码

# 设计抽象层
class AttentionBackend(ABC):
    @abstractmethod
    def flash_attention_prefill(self, q, k, v, ...): pass

    @abstractmethod
    def paged_attention_decode(self, query, k_cache, v_cache, ...): pass

# GPU 实现
class TritonAttentionBackend(AttentionBackend):
    def flash_attention_prefill(self, q, k, v, ...):
        @triton.jit
        def kernel(...): ...

# CPU 实现
class PyTorchAttentionBackend(AttentionBackend):
    def flash_attention_prefill(self, q, k, v, ...):
        # 使用 torch.matmul + F.softmax
        ...

# 运行时选择
if torch.cuda.is_available():
    backend = TritonAttentionBackend()
else:
    backend = PyTorchAttentionBackend()

5.3 测试驱动开发

现状：每次修改后运行才发现下一个问题。

建议：

编写单元测试覆盖核心组件
使用 mock 对象模拟 GPU 环境
渐进式验证（先确保模型能 forward，再测试 attention，最后端到端）

5.4 国内源使用

经验：

PyTorch CPU 版本不在 PyPI 镜像源
混合源安装技巧：--index-url <primary> --extra-index-url <fallback>
HuggingFace 镜像：HF_ENDPOINT=https://hf-mirror.com

六、后续建议

6.1 短期方案

如果需要继续 CPU 支持：

完全替换 Paged Attention
- 使用 torch.nn.MultiheadAttention 或标准 attention
- 移除 block_table 管理机制

添加配置选项

python 复制代码

config = {
    "use_triton": False,  # 是否使用 Triton
    "use_cuda_graph": False,  # 是否使用 CUDA Graph
    "backend": "cpu",  # "cpu" or "gpu"
}

性能优化
- 使用 PyTorch JIT (torch.jit.script)
- 考虑 ONNX Runtime 或 OpenVINO
- 批量处理优化

6.2 长期方案

分层架构重构
- 抽象后端接口
- 分别实现 CPU/GPU 版本
- 运行时动态选择
渐进式迁移
- 先支持小模型
- 逐步增加功能
- 保持向后兼容
多平台支持
- CPU（x86, ARM）
- GPU（NVIDIA, AMD, Apple Silicon）
- 专用加速器（TPU, NPU）

七、完整代码修改清单

文件	修改内容	状态
`pyproject.toml`	移除 vllm 依赖	✅ 完成
`main.py`	添加 `torch._dynamo.config.suppress_errors = True`	✅ 完成
`src/myvllm/layers/attention.py`	移除 triton，添加 CPU 实现	✅ 完成（部分）
`src/myvllm/layers/sampler.py`	移除 `@torch.compile`，修复 inplace	✅ 完成
`src/myvllm/engine/model_runner.py`	改用 gloo，移除 CUDA 调用	✅ 完成（大部分）

八、结论

本次 CPU 移植实践成功实现了以下目标：

识别了所有 GPU 特定的依赖
成功移除了 Triton 和 CUDA Graph 依赖
Prefilling 阶段在 CPU 上成功运行（~64 tokens/sec）
记录了详细的问题和解决方案

虽然 decode 阶段仍有待解决的问题，但这为我们提供了宝贵的经验教训。对于类似项目的 CPU 移植，关键建议是：

早期设计考虑多平台兼容性
使用抽象层隔离平台特定代码
测试驱动开发
渐进式验证

完整的修改记录和代码示例已保存在 CPU_PORTING_LESSONS.md 文件中，可供参考。

作者 : Claude (Sonnet 4.5)
日期 : 2026-02-24
项目 : MinivLLM
主题: 深度学习模型 CPU 移植实战