细读经典： ZeRO

论文链接：https://arxiv.org/pdf/1910.02054

训练并行的几种方式：

Pipeline Parallelism (PP)
Model Parallelism (MP)
Data Parallelism

So, how can we overcome the limitations of existing solutions and train large models more efficiently? To answer this question, we first analyze the full spectrum of memory consumption of the existing systems on model training and classify it into two parts: 1) For large models, the majority of the memory is occupied by model states which include the optimizer states (such as momentum and variances in Adam $6$ ), gradients, and parameters. 2) The remaining memory is consumed by activation, temporary buffers and unusable fragmented memory, which we refer to collectively as residual states. We develop ZeRO--- Zero Redundancy Optimizer --- to optimize memory efficiency on both while obtaining high compute and communication efficiency. As these two parts face different challenges, we develop and discuss their solutions correspondingly

该论文提供了一种新的数据并行的方式，Zero-DP；提出了Zero-R，用于 optimize the residual memory consumed by these three factors respectively.

For activations (stored from forward pass in order to perform backward pass), we noticed checkpointing $7$ helps but not sufficient for large models. Thus ZeRO-R optimizes activation memory by identifying and removing activation replication in existing MP approaches through activation partitioning. It also offloads activations to CPU when appropriate. 2) ZeRO-R defines appropriate size for temporary buffers to strike for a balance of memory and computation efficiency. 3) We observe fragmented memory during training due to variations in the lifetime of different tensors. Lack of contiguous memory due to fragmentation can cause memory allocation failure, even when enough free memory is available. ZeRO-R proactively m

最后放一段GPT的总结

这篇论文是：

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
作者：Samyam Rajbhandari、Jeff Rasley、Olatunji Ruwase、Yuxiong He
来自 Microsoft
后来成为 DeepSpeed 的核心技术。

这篇论文是 大模型训练系统领域最重要的论文之一 ，提出了 ZeRO（Zero Redundancy Optimizer） ，核心目标是 解决大模型训练的 GPU 内存瓶颈问题 。(Hugging Face)

下面我从 问题 → 方法 → ZeRO 三个阶段 → 效果 帮你系统讲清楚。

一、论文要解决的问题

随着深度学习模型规模快速增长（BERT、GPT 等），训练面临一个核心瓶颈：

GPU显存不够。

在标准数据并行（Data Parallelism）中，每个 GPU 都需要保存完整的模型状态：

模型训练时需要保存三类状态：

Parameters
Gradients
Optimizer states（如 Adam 的 m / v）

其中 optimizer state 是最大的。

如果使用 Adam：

复制代码

memory ≈
parameters
+ gradients
+ 2 × optimizer states

也就是说：

每个参数需要 ~16 bytes（FP16训练）

举例：

模型规模	需要显存
1B 参数	~16GB
10B 参数	~160GB

这就是为什么：

普通数据并行无法训练大模型。 (Hugging Face)

二、传统方法的局限

论文分析了三种已有方案：

1 Data Parallelism

特点：

每个 GPU 保存完整模型
只分 batch

问题：

复制代码

memory = O(model size)

无法扩展。

2 Model Parallelism

代表：

Megatron-LM
Mesh TensorFlow

方法：

把一个 layer 拆到多个 GPU。

问题：

通信量巨大
编程复杂
多节点效率差

论文实验：

40B 模型跨节点训练

GPU 利用率 < 5% peak 。 (Hugging Face)

3 Pipeline Parallelism

代表：

GPipe

问题：

pipeline bubble
实现复杂

三、核心思想：Zero Redundancy

论文观察到一个关键事实：

数据并行存在大量"冗余状态"。

在 DP 中：

复制代码

GPU0
GPU1
GPU2
GPU3

每个 GPU 都存：

复制代码

parameters
gradients
optimizer states

这些都是 完全一样的副本。

ZeRO 的思想：

消除数据并行中的冗余副本

让不同 GPU 只保存一部分状态。

四、ZeRO 三个阶段（最重要部分）

ZeRO逐步分片三类状态。

Stage 1：Optimizer State Partition

只切分：

复制代码

optimizer states

例如：

复制代码

GPU0 -> optimizer 0-25%
GPU1 -> optimizer 25-50%
GPU2 -> optimizer 50-75%
GPU3 -> optimizer 75-100%

而：

复制代码

parameters   -> full copy
gradients    -> full copy

优点：

显存减少 4×

Stage 2：Gradient Partition

进一步切分：

复制代码

gradients

现在：

状态	是否分片
parameters	❌
gradients	✅
optimizer	✅

优点：

显存减少：

复制代码

≈ 8×

Stage 3：Parameter Partition

最后连参数也分片。

每个 GPU 只保存：

复制代码

1/N parameters
1/N gradients
1/N optimizer

训练时：

forward 前 all-gather 参数
backward 后 reduce-scatter 梯度

这样：

复制代码

memory = O(model_size / N)

这就是 ZeRO 的关键突破。

五、ZeRO 的通信机制

ZeRO 的核心通信模式：

Forward

复制代码

All-Gather parameters

每层计算前收集参数。

Backward

复制代码

Reduce-Scatter gradients

梯度直接分片回收。

相比传统：

复制代码

AllReduce gradients

ZeRO 更节省内存。

六、论文实验结果

实验规模：

复制代码

400 GPUs

结果：

100B+ 模型可训练
15 PFLOPS throughput
super-linear speedup

对比 SOTA：

指标	提升
模型规模	8×
训练性能	10×

并成功训练：

17B Turing-NLG 模型 。 (Hugging Face)

七、论文的历史意义

这篇论文几乎 定义了现代大模型训练架构。

之后很多技术都是它的延伸：

ZeRO-Offload

CPU 显存扩展。

ZeRO-Infinity

NVMe 扩展。

PyTorch FSDP

Fully Sharded Data Parallel

几乎就是 ZeRO Stage3 的实现。

八、一句话总结

ZeRO 的核心贡献：

通过 分片 optimizer / gradients / parameters

消除数据并行的冗余显存，实现 O(model_size / N) 的内存扩展能力。

从而让：

复制代码

Trillion parameter training

成为可能。

如果你愿意，我可以再帮你深入讲一个 工程层面非常关键的问题：

为什么 Adam optimizer 会占 3× 参数内存？

以及 ZeRO 如何把它降到 1/N ------ 这个其实和你之前问的 Adam memory usage 是完全相关的。