分布式训练

数据并行(DP & DDP)

DataParallel

![转存失败，建议直接上传图片文件](转存失败，建议直接上传图片文件 pics/image-20240327152041750.png)

DP 是较简单的一种数据并行方式，直接将模型复制到多个 GPU 上并行计算，每个 GPU 计算 batch 中的一部分数据，各自完成前向和反向后，将梯度汇总到主 GPU 上。

基本流程：

加载模型、数据至内存；

创建 DP 模型；

DP 模型的 forward 过程：

一个 batch 的数据均分到不同 device 上；

为每个 device 复制一份模型；

至此，每个 device 上有模型和一份数据，并行进行前向传播；

收集各个 device 上的输出；

每个 device 上的模型反向传播后，收集梯度到主 device 上 ，更新主 device 上的模型，将模型广播到其他 device 上；

3-4 循环。

只有一个主进程 ，主进程下有多个线程
每个线程管理一个 device 的训练。
DP 中内存中只存在一份数据 ，各个线程间共享数据。DP 和 Parameter Server 的方式很像。

DistributedDataParallel

基本流程：

准备阶段
- 环境初始化：在各张卡上初始化进程并建立进程间通信，对应代码：init_process_group。
- 模型广播：将模型 parameter、buffer 广播到各节点 ，对应代码：model = DDP(model).to(local_rank)。
- 创建管理器 reducer，给每个参数注册梯度平均 hook。
准备数据
- 加载数据集，创建适用于分布式场景的数据采样器，以防不同节点使用的数据重叠。

训练阶段

前向传播

同步各进程状态（parameter 和 buffer）；

当 DDP 参数 find_unused_parameter 为 true 时，其会在 forward 结束时，启动一个回溯，标记未用到的参数，提前将这些设置为 ready。

计算梯度

reducer 外面：

各进程各自开始反向计算梯度；

当某个参数的梯度计算好了，其之前注册的 grad hook 就会触发，在 reducer 里把这个参数的状态标记为 ready；

reducer 里面：

当某个 bucket 的所有参数都是 ready 时，reducer 开始对这个 bucket 的所有参数开始一个异步的 all-reduce 梯度平均操作；

当所有 bucket 的梯度平均都结束后，reducer 把得到的平均梯度正式写入到 parameter.grad 里。

优化器应用梯度更新参数。

DDP 与 DP 的区别

	DP	DDP
	多线程 1. 受到 GIL 的限制 2. 单机工作	多进程 1. 多机多卡
迭代更新	传输数据包括梯度和参数 1. 全程维护一个 optimizer 2 梯度汇总到主 GPU, 主 GPU 进行参数更新 3. 主 GPU Broadcast 参数给其他的 GPU	传输数据包括梯度 1. 每个进程具有自己的 optimizer 2. 各进程自己计算梯度 3. Ring All-Reduce 将梯度汇总平均 4. 各进程用梯度来独立的更新参数
通信效率	通信成本随着 GPU 数量线性增长	Ring All-Reduce 通信成本恒定，与 GPU 数量无关

DDP 中由于各进程中的模型，初始参数一致 (初始时刻进行一次 broadcast)，而每次用于更新参数的梯度也一致，因此，各进程的模型参数始终保持一致。

TP (Tensor Parallelism)

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

每个张量都被水平分成多个块，因此张量的每个分片都位于其指定的 GPU 上，而不是让整个张量驻留在单个 GPU 上。在处理过程中，每个分片在不同的 GPU 上分别并行处理，结果在步骤结束时同步。
![转存失败，建议直接上传图片文件](转存失败，建议直接上传图片文件 pics/image-20240325202756320.png)

MLP 的并行化

![转存失败，建议直接上传图片文件](转存失败，建议直接上传图片文件 pics/image-20241113154251922.png)

对于输入 <math xmlns="http://www.w3.org/1998/Math/MathML"> X ∈ R ( B × L ) × D \mathbf{X} \in \mathbb{R}^{(B\times L) \times D} </math>X∈R(B×L)×D，它的行数是批量大小 <math xmlns="http://www.w3.org/1998/Math/MathML"> B B </math>B 乘以序列长度 <math xmlns="http://www.w3.org/1998/Math/MathML"> L L </math>L ，列数是隐藏层的宽度即 <math xmlns="http://www.w3.org/1998/Math/MathML"> D D </math>D。

为了方便，令 <math xmlns="http://www.w3.org/1998/Math/MathML"> B = 1 B=1 </math>B=1，即 <math xmlns="http://www.w3.org/1998/Math/MathML"> X ∈ R L × D \mathbf{X} \in \mathbb{R}^{L \times D} </math>X∈RL×D
MLP 模块里面其实就是两个全连接层
- 假定第一个隐藏层的权重是 <math xmlns="http://www.w3.org/1998/Math/MathML"> A ∈ R D × D ′ \mathbf A \in \mathbb{R}^{D\times D^\prime} </math>A∈RD×D′ ( <math xmlns="http://www.w3.org/1998/Math/MathML"> D ′ D^\prime </math>D′ 一般是 <math xmlns="http://www.w3.org/1998/Math/MathML"> D D </math>D 的 <math xmlns="http://www.w3.org/1998/Math/MathML"> 4 4 </math>4 倍)，则先做矩阵乘法，然后再接一个激活函数比如 GELU
- 假定第二个隐藏层的权重是 <math xmlns="http://www.w3.org/1998/Math/MathML"> B ∈ R D ′ × D \mathbf B \in \mathbb{R}^{D^\prime \times D} </math>B∈RD′×D，最终得到 <math xmlns="http://www.w3.org/1998/Math/MathML"> Z = σ ( X ⋅ A ) B \mathbf Z = \sigma(\mathbf X \cdot \mathbf A) \mathbf B </math>Z=σ(X⋅A)B
为了保证每个数据的完整，避免GPU 之间的通讯：
- 对 <math xmlns="http://www.w3.org/1998/Math/MathML"> A ∈ R D × D ′ \mathbf A \in \mathbb{R}^{D\times D^\prime} </math>A∈RD×D′ 按 <math xmlns="http://www.w3.org/1998/Math/MathML"> D ′ D^\prime </math>D′ 所在的那一维作拆分（按行切），此时 <math xmlns="http://www.w3.org/1998/Math/MathML"> X \mathbf{X} </math>X 不需要拆分，直接复制保证每个GPU上都有即可
- 对 <math xmlns="http://www.w3.org/1998/Math/MathML"> B ∈ R D ′ × D \mathbf B \in \mathbb{R}^{D^\prime \times D} </math>B∈RD′×D 按 <math xmlns="http://www.w3.org/1998/Math/MathML"> D ′ D^\prime </math>D′ 所在的那一维作拆分（按列切）。
将 <math xmlns="http://www.w3.org/1998/Math/MathML"> A \mathbf A </math>A 按行拆分成 <math xmlns="http://www.w3.org/1998/Math/MathML"> n n </math>n 份 ： <math xmlns="http://www.w3.org/1998/Math/MathML"> A = [ A 1 , ⋯ , A n ] \mathbf A= \begin{bmatrix}\mathbf A_1,\cdots, \mathbf A_n \end{bmatrix} </math>A=[A1,⋯,An]，其中 <math xmlns="http://www.w3.org/1998/Math/MathML"> A i ∈ R D × D ′ n \mathbf A_i \in \mathbb{R}^{D \times \frac{D^\prime}{n}} </math>Ai∈RD×nD′。通过执行矩阵乘法得到:

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> X ⋅ A = [ X A 1 , ⋯ , X A n ] , X A i ∈ R L × D ′ n \mathbf X \cdot \mathbf A = \begin{bmatrix}\mathbf X \mathbf A_1,\cdots, \mathbf X\mathbf A_n \end{bmatrix} , \quad \mathbf X \mathbf A_i \in \mathbb{R}^{L\times \frac{D^\prime}{n}} </math>X⋅A=[XA1,⋯,XAn],XAi∈RL×nD′

它们可以独立输入GeLU：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> [ Y 1 , ⋯ , Y n ] = [ GeLU ⁡ ( X A 1 ) , ⋯ , GeLU ⁡ ( X A n ) ] , Y i ∈ R L × D ′ n \begin{bmatrix}\mathbf Y_1,\cdots, \mathbf Y_n\end{bmatrix} = \begin{bmatrix}\operatorname{GeLU}\left(\mathbf X \mathbf A_1\right),\cdots, \operatorname{GeLU} \left(\mathbf X\mathbf A_n \right)\end{bmatrix} , \quad \mathbf Y_i \in \mathbb{R}^{L\times \frac{D^\prime}{n}} </math>[Y1,⋯,Yn]=[GeLU(XA1),⋯,GeLU(XAn)],Yi∈RL×nD′
将 <math xmlns="http://www.w3.org/1998/Math/MathML"> B \mathbf B </math>B 按列拆分成 <math xmlns="http://www.w3.org/1998/Math/MathML"> n n </math>n 份 ： <math xmlns="http://www.w3.org/1998/Math/MathML"> B = [ B 1 , ⋯ , B n ] ⊤ \mathbf B= \begin{bmatrix}\mathbf B_1,\cdots, \mathbf B_n \end{bmatrix}^{\top} </math>B=[B1,⋯,Bn]⊤，其中 <math xmlns="http://www.w3.org/1998/Math/MathML"> B i ∈ R D ′ n × D \mathbf B_i \in \mathbf{R}^{\frac{D^\prime}{n}\times D} </math>Bi∈RnD′×D。通过执行矩阵乘法得到

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> Z = ∑ i n Z i = [ Y 1 , ⋯ , Y n ] [ B 1 , ⋯ , B n ] ⊤ , Z ∈ R L × D \mathbf Z =\sum_i^n\mathbf Z_i = \begin{bmatrix}\mathbf Y_1,\cdots, \mathbf Y_n\end{bmatrix} \begin{bmatrix}\mathbf B_1,\cdots, \mathbf B_n \end{bmatrix}^{\top} , \quad \mathbf Z \in \mathbb{R}^{L\times D} </math>Z=i∑nZi=[Y1,⋯,Yn][B1,⋯,Bn]⊤,Z∈RL×D
通过上述操作，我们可以更新任意深度的 MLP，只需在每个 拆列-拆行 序列之后同步 GPU

Self-Attention 的并行化

![转存失败，建议直接上传图片文件](转存失败，建议直接上传图片文件 pics/image-20241113162947135.png)

各个头各自计算

对于输入 <math xmlns="http://www.w3.org/1998/Math/MathML"> X ∈ R ( B × L ) × D \mathbf{X} \in \mathbb{R}^{(B\times L) \times D} </math>X∈R(B×L)×D，它的行数是批量大小 <math xmlns="http://www.w3.org/1998/Math/MathML"> B B </math>B 乘以序列长度 <math xmlns="http://www.w3.org/1998/Math/MathML"> L L </math>L ，列数是隐藏层的宽度即 <math xmlns="http://www.w3.org/1998/Math/MathML"> D D </math>D。

为了方便，令 <math xmlns="http://www.w3.org/1998/Math/MathML"> B = 1 B=1 </math>B=1，即 <math xmlns="http://www.w3.org/1998/Math/MathML"> X ∈ R L × D \mathbf{X} \in \mathbb{R}^{L \times D} </math>X∈RL×D。

在自注意力机制中，输入 <math xmlns="http://www.w3.org/1998/Math/MathML"> X \mathbf{X} </math>X 会被复制成三份，分别对应为 <math xmlns="http://www.w3.org/1998/Math/MathML"> X \mathbf{X} </math>X 的 <math xmlns="http://www.w3.org/1998/Math/MathML"> Q \mathbf Q </math>Q、 <math xmlns="http://www.w3.org/1998/Math/MathML"> K \mathbf K </math>K、 <math xmlns="http://www.w3.org/1998/Math/MathML"> V \mathbf V </math>V 向量矩阵。
对于多头注意力，头的维度为 <math xmlns="http://www.w3.org/1998/Math/MathML"> D h \frac{D}{h} </math>hD, 假定 <math xmlns="http://www.w3.org/1998/Math/MathML"> h = 2 h=2 </math>h=2，之后针对每个头中输入 <math xmlns="http://www.w3.org/1998/Math/MathML"> X \mathbf{X} </math>X 矩阵中各个单词的 <math xmlns="http://www.w3.org/1998/Math/MathML"> Q \mathbf Q </math>Q 向量，会与各自上下文的 <math xmlns="http://www.w3.org/1998/Math/MathML"> K \mathbf K </math>K 向量做缩放点积然后做 Softmax 得到一个注意力分数或权重，之后再与 <math xmlns="http://www.w3.org/1998/Math/MathML"> V \mathbf V </math>V 相乘，得到一个 <math xmlns="http://www.w3.org/1998/Math/MathML"> L × D h L \times \frac{D}{h} </math>L×hD 的输出
每个头的计算是各自独立并行的，那意味着一个头可以放在 GPU 0 上，另一个头可以放在 GPU 1 上，最后 all reduce 每个头的结果

由于前向和后向传播中每层都有 <math xmlns="http://www.w3.org/1998/Math/MathML"> 2 2 </math>2 个 all reduce(MLP+Self-Attention)，因此 TP 需要设备间有非常快速的互联。

因此，不建议跨多个节点进行 TP。

如果节点有 <math xmlns="http://www.w3.org/1998/Math/MathML"> 4 4 </math>4 个 GPU，则最高 TP 度设为 <math xmlns="http://www.w3.org/1998/Math/MathML"> 4 4 </math>4 比较好。如果需要 TP 度为 <math xmlns="http://www.w3.org/1998/Math/MathML"> 8 8 </math>8，则需要使用至少有 <math xmlns="http://www.w3.org/1998/Math/MathML"> 8 8 </math>8 个 GPU 的节点

PP (Pipeline Parallelism)

GPipe: Easy Scaling with Micro-Batch Pipeline Parallelism

模型在多个 GPU 上 垂直 (即按层) 拆分:

因此只有一个或多个模型层放置在单个 GPU 上。
每个 GPU 并行处理流水线的不同阶段，并处理 batch 的一部分数据

![转存失败，建议直接上传图片文件](转存失败，建议直接上传图片文件 pics/image-20240326092738269.png)

把网络分成 <math xmlns="http://www.w3.org/1998/Math/MathML"> 4 4 </math>4 块，每一块放在一个 GPU 上(不同的颜色表示不同的 GPU )，于是就有了 <math xmlns="http://www.w3.org/1998/Math/MathML"> F 0 F_0 </math>F0、 <math xmlns="http://www.w3.org/1998/Math/MathML"> F 1 F_1 </math>F1、 <math xmlns="http://www.w3.org/1998/Math/MathML"> F 2 F_2 </math>F2、 <math xmlns="http://www.w3.org/1998/Math/MathML"> F 3 F_3 </math>F3 这 <math xmlns="http://www.w3.org/1998/Math/MathML"> 4 4 </math>4 个前向路径和 <math xmlns="http://www.w3.org/1998/Math/MathML"> B 3 B_3 </math>B3、 <math xmlns="http://www.w3.org/1998/Math/MathML"> B 2 B_2 </math>B2、 <math xmlns="http://www.w3.org/1998/Math/MathML"> B 1 B_1 </math>B1、 <math xmlns="http://www.w3.org/1998/Math/MathML"> B 0 B_0 </math>B0 逆序后向路径。
![转存失败，建议直接上传图片文件](转存失败，建议直接上传图片文件 pics/image-20241113165059253.png)

朴素 PP 方案

在每个时间点只有一台设备在处理计算逻辑，完成计算后将结果发送给下一台设备。
![转存失败，建议直接上传图片文件](转存失败，建议直接上传图片文件 pics/image-20241113164952642.png)

PP

![转存失败，建议直接上传图片文件](转存失败，建议直接上传图片文件 pics/image-20241113165156140.png)

PP 引入了一个新的超参数来调整，称为 块 (chunks) 。它定义了通过同一管级按顺序发送多少数据块。图中 <math xmlns="http://www.w3.org/1998/Math/MathML"> chunks = 4 \text{chunks} = 4 </math>chunks=4.

GPU 0 在 chunk 0、1、2 和 3 ( <math xmlns="http://www.w3.org/1998/Math/MathML"> F 0 , 0 F_{0,0} </math>F0,0、 <math xmlns="http://www.w3.org/1998/Math/MathML"> F 0 , 1 F_{0,1} </math>F0,1、 <math xmlns="http://www.w3.org/1998/Math/MathML"> F 0 , 2 F_{0,2} </math>F0,2、 <math xmlns="http://www.w3.org/1998/Math/MathML"> F 0 , 3 F_{0,3} </math>F0,3) 上执行相同的前向路径，然后等待。

等其他 GPU 完成工作后，GPU 0 会再次开始工作，为块 3、2、1 和 0 ( <math xmlns="http://www.w3.org/1998/Math/MathML"> B 0 , 3 B_{0,3} </math>B0,3、 <math xmlns="http://www.w3.org/1998/Math/MathML"> B 0 , 2 B_{0,2} </math>B0,2、 <math xmlns="http://www.w3.org/1998/Math/MathML"> B 0 , 1 B_{0,1} </math>B0,1、 <math xmlns="http://www.w3.org/1998/Math/MathML"> B 0 , 0 B_{0,0} </math>B0,0) 执行后向路径。

请注意，从概念上讲，这与梯度累积 (gradient accumulation steps，GAS) 的意思相同。PyTorch 叫它chunks ，而 DeepSpeed 叫它GAS

**梯度累积（Gradient Accumulation）**的主要思想是在计算一个批次的梯度后不立刻更新模型参数，而是累积几个批次后再更新，这样便可以在不增加显存消耗的情况下模拟更大的批次。
因为 块 (chunks），PP 引入了 **micro-batches (MBS) ** 的概念。
- DP 将全局 batch size 拆分为小 batch size。
  
  如果 <math xmlns="http://www.w3.org/1998/Math/MathML"> dp_degree = 4 \text{dp\_degree} = 4 </math>dp_degree=4，则全局 <math xmlns="http://www.w3.org/1998/Math/MathML"> batch_size all = 1024 \text{batch\size}{\text{all}}=1024 </math>batch_sizeall=1024 将拆分为 <math xmlns="http://www.w3.org/1998/Math/MathML"> 4 4 </math>4 个小 batch size，每个小batch有 <math xmlns="http://www.w3.org/1998/Math/MathML"> batch_size dp = 1024 / 4 = 256 \text{batch\size}{\text{dp}}=1024/4 = 256 </math>batch_sizedp=1024/4=256。
- 如果 <math xmlns="http://www.w3.org/1998/Math/MathML"> chunks = 32 \text{chunks} = 32 </math>chunks=32，最终得到的 <math xmlns="http://www.w3.org/1998/Math/MathML"> micro batch_size = 256 / 32 = 8 \text{micro batch\_size} = 256/32= 8 </math>micro batch_size=256/32=8。
- 每个管级一次处理一个 micro batch。
- 计算 DP + PP 设置的全局批量大小的公式为: <math xmlns="http://www.w3.org/1998/Math/MathML"> mbs ∗ chunks ∗ dp_degree ( 8 ∗ 32 ∗ 4 = 1024 ) \text{mbs}*\text{chunks}*\text{dp\_degree }(8*32*4=1024) </math>mbs∗chunks∗dp_degree (8∗32∗4=1024)
将 mini-batch 进一步划分成更小的 micro-batch，同时利用 pipipline 方案，每次处理一个 micro-batch 的数据，得到结果后，将该 micro-batch 的结果发送给下游设备，同时开始处理后一个 micro-batch 的数据，通过这套方案减小设备中的 Bubble(设备空闲的时间称为 Bubble)

ZeRO

数据并行会产生大量冗余 Model States 的空间占用。每个 GPU 都需要存储大语言模型的相同副本，包括模型参数和优化器参数等。但是对于每个 GPU，在模型传播到某一层时，其他层的模型和优化器参数并不参与计算，这导致了严重的显存冗余现象。

ZeRO 的本质，是在数据并行的基础上，对冗余空间占用进行深度优化

ZeRO 仅在每个 GPU 上保留部分模型参数和优化器参数，当需要时再从其它 GPU 中读取进行计算，使用完之后便可以释放相应显存。

显存占用

大规模训练中的显存占用可以分为 Model States 与 Residual states 两部分

Model States

Optimizer States Optimizer States 是 Optimizer 在进行梯度更新时所需要用到的数据，例如 SGD 中的 Momentum 以及使用混合精度训练时的 Float32 Master Parameters
Gradient 在反向传播后所产生的梯度信息，其决定了参数的更新方向。
Model Parameter 模型参数，也就是我们在整个过程中通过数据"学习"的信息

在传统DDP下，每个进程都使用同样参数来进行训练。每个进程也会持有对 Optimizer States 的完整拷贝，同样占用了大量显存。

在混合精度场景下，设模型参数量为 <math xmlns="http://www.w3.org/1998/Math/MathML"> Φ \mathbf\Phi </math>Φ, 那么梯度的元素数量为 <math xmlns="http://www.w3.org/1998/Math/MathML"> Φ \mathbf\Phi </math>Φ ，模型参数（fp16） 、模型梯度（fp16） 和 优化器状态（fp32 ）总占用 显存：
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"> ( 2 + 2 + K ) Φ (2 +2+K)\mathbf\Phi </math>(2+2+K)Φ

Residual States

除了模型状态之外的显存占用，包括 激活值（activation）、各种临时缓冲区（buffer）以及无法使用的显存碎片（fragmentation）

ZeRO-DP （Model States）

![转存失败，建议直接上传图片文件](转存失败，建议直接上传图片文件 pics/DeepSpeed-Image-1.png)

ZeRO 有三个不同级别，分别对应对 Model States 不同程度的分割 (Paritition) ，图中的 <math xmlns="http://www.w3.org/1998/Math/MathML"> P os \text{P}{\text{os}} </math>Pos、 <math xmlns="http://www.w3.org/1998/Math/MathML"> P os + g \text{P}{\text{os}+\text{g}} </math>Pos+g、 <math xmlns="http://www.w3.org/1998/Math/MathML"> P os + g + p \text{P}_{\text{os}+\text{g}+\text{p}} </math>Pos+g+p 分别代表 ZeRO-1 、ZeRO-2 、ZeRO-3

ZeRO-1 [ <math xmlns="http://www.w3.org/1998/Math/MathML"> P os \text{P}_{\text{os}} </math>Pos]： 分割 Optimizer States

模型参数（parameters）和梯度（gradients）仍旧是每张卡保持一份，此时，每张卡的模型状态所需显存是 <math xmlns="http://www.w3.org/1998/Math/MathML"> 2 Φ + 2 Φ + K ∗ Φ N d 2\mathbf\Phi+2\mathbf\Phi+ \frac{K*\mathbf\Phi}{N_d} </math>2Φ+2Φ+NdK∗Φ ，当 <math xmlns="http://www.w3.org/1998/Math/MathML"> N N </math>N 比较大时，趋向于 <math xmlns="http://www.w3.org/1998/Math/MathML"> 4 Φ 4\mathbf\Phi </math>4Φ。

ZeRO-2 [ <math xmlns="http://www.w3.org/1998/Math/MathML"> P os + g \text{P}_{\text{os}+\text{g}} </math>Pos+g]： 分割 Optimizer States 与 Gradients

继续对模型梯度进行分片，模型参数仍旧是每张卡保持一份，此时，每张卡的模型状态所需显存是 <math xmlns="http://www.w3.org/1998/Math/MathML"> 2 Φ + ( 2 + K ) ∗ Φ N d 2\mathbf\Phi+ \frac{(2+K)*\mathbf\Phi}{N_d} </math>2Φ+Nd(2+K)∗Φ，当 <math xmlns="http://www.w3.org/1998/Math/MathML"> N N </math>N 比较大时，趋向于 <math xmlns="http://www.w3.org/1998/Math/MathML"> 2 Φ 2\mathbf\Phi </math>2Φ。

ZeRO-3 [ <math xmlns="http://www.w3.org/1998/Math/MathML"> P os + g + p \text{P}_{\text{os}+\text{g}+\text{p}} </math>Pos+g+p]： 分割 Optimizer States、Gradients 与 Parameters

继续对模型参数进行分片，此时每张卡的模型状态所需显存是 <math xmlns="http://www.w3.org/1998/Math/MathML"> ( 2 + 2 + K ) ∗ Φ N d \frac{(2+2+K)*\mathbf\Phi}{N_d} </math>Nd(2+2+K)∗Φ，当 <math xmlns="http://www.w3.org/1998/Math/MathML"> N N </math>N 比较大时，趋向于 <math xmlns="http://www.w3.org/1998/Math/MathML"> 0 0 </math>0 。

ZeRO-1 和 ZeRO-2 并不会带来额外的通讯 ，但 ZeRO-3 每一次要算 <math xmlns="http://www.w3.org/1998/Math/MathML"> W \mathbf W </math>W 的时候，都得去别的机器拿回来，相当于带来了额外的通讯(增加了 50%)

ZeRO V.S. 模型并行

ZeRO 是模型并行的形式，数据并行的实质。

模型并行，是指在 forward 和 backward 的过程中，只需要用自己维护的那块 <math xmlns="http://www.w3.org/1998/Math/MathML"> W \mathbf W </math>W 来计算。

即 同样的输入 X，每块 GPU 上各算模型的一部分，最后通过某些方式聚合结果。
ZeRO 做 forward 和 backward 的时候，需要把各 GPU 上维护的 <math xmlns="http://www.w3.org/1998/Math/MathML"> W \mathbf W </math>W 聚合起来。

即 本质上还是用完整的 W 进行计算 。它是不同的输入 X，完整的参数 W，最终再做聚合。

ZeRO-R（Residual States）

<math xmlns="http://www.w3.org/1998/Math/MathML"> P α P_\alpha </math>Pα: Partitioned Activation Checkpointing

activation 起到加速梯度计算的作用。

使用分片方法，并且配合 checkpointing，可以灵活设置 activation的存储。每块 GPU 上只维护部分的 activation，需要时再从别的地方聚合过来就行。需要注意的是，activation 对显存的占用一般会远高于模型本身，通讯量也是巨大的，所以这块要灵活、有效地实验设计。

<math xmlns="http://www.w3.org/1998/Math/MathML"> C B C_B </math>CB: Constant Size Buffer 临时缓冲区

模型训练过程中经常会创建一些大小不等的临时缓冲区，比如对梯度进行 AllReduce。

**解决办法为预先创建一个固定的缓冲区，**训练过程中不再动态创建，如果要传输的数据较小，则多组数据 bucket 后再一次性传输，提高效率

固定大小的内存 buffer，它的目的在于：

提升带宽利用率。当 GPU 数量上升，GPU 间的通讯次数也上升，每次的通讯量可能下降（但总通讯量不会变）。数据切片小了，就不能很好利用带宽了。所以这个 buffer 起到了积攒数据的作用：等数据积攒到一定大小，再进行通讯。

使得存储大小可控。在每次通讯前，积攒的存储大小是常量，是已知可控的。更方便使用者对训练中的存储消耗和通讯时间进行预估。

<math xmlns="http://www.w3.org/1998/Math/MathML"> M D M_D </math>MD: Memory Defragmentation 显存碎片

显存出现碎片的一大原因是时候 gradient checkpointing 后，不断地创建和销毁那些不保存的激活值。

解决方法是预先分配一块连续的显存，将常驻显存的模型状态和 checkpointed activation 存在里面，剩余显存用于动态创建和销毁 discarded activation。

设置机制，对碎片化的存储空间进行重新整合，整出连续的存储空间。防止出现总存储足够，但连续存储不够而引起的存储请求 fail

ZeRO-Offload

forward 和 backward 计算量高，因此和它们相关的部分，例如参数 W（fp16），activation，就全放入 GPU。
update 的部分计算量低，因此和它相关的部分，全部放入 CPU 中。例如 W(fp32)，optimizer states（fp32）和 gradients(fp16)等。

混合精度

Mixed Precision Training

混合精度训练，指代的是单精度 float （ <math xmlns="http://www.w3.org/1998/Math/MathML"> 32 32 </math>32bit， <math xmlns="http://www.w3.org/1998/Math/MathML"> 4 4 </math>4 个字节）和半精度 float16 （ <math xmlns="http://www.w3.org/1998/Math/MathML"> 12 12 </math>12bit， <math xmlns="http://www.w3.org/1998/Math/MathML"> 2 2 </math>2个字节）混合。

半精度

半精度优点：

内存占用更少： 通用的模型 fp16 占用的内存只需原来的一半：
- 模型占用的内存更小，训练的时候可以用更大的 batchsize。
- 模型训练时，通信量（特别是多卡，或者多机多卡）大幅减少，大幅减少等待时间，加快数据的流通。
计算更快：
- 目前的不少 GPU 都有针对 fp16 的计算进行优化。论文指出：在近期的 GPU 中，半精度的计算吞吐量可以是单精度的 2-8 倍；

半精度问题

数据溢出 Overflow / Underflow：对于深度学习而言，最大的问题在于 Underflow（下溢出），在训练后期，例如激活函数的梯度会非常小，甚至在梯度乘以学习率后，值会更加小。
舍入误差 Rounding Error

混合精度训练（Mixed Precision

利用 fp16 进行乘法和存储，利用 fp32 来进行加法计算。这样可以减少加法过程中的舍入误差，保证精度不损失

在模型矩阵乘法的过程中，利用 fp32 来进行矩阵乘法中间的累加(accumulated)
然后再将 fp32 的值转化为 fp16 进行存储。

![转存失败，建议直接上传图片文件](转存失败，建议直接上传图片文件 pics/image-20240325171701034-17315482395991.png)

FP32 权重备份

主要用于解决舍入误差的问题。

weights, activations, gradients 等数据 在训练中都 利用 fp16 来存储
- fp32 额外拷贝一份 weight 会新增加训练时候存储的占用。
  
  实际训练过程中，内存中占据大部分的基本都是 activations 的值。特别是在 batchsize 很大的情况下， activations 更是特别占据空间。保存 activiations 主要是为了在 back-propogation 的时候进行计算。因此，只要 activation 的值基本都是使用 fp16 来进行存储的话，则最终模型与 fp32 相比起来，内存占用也基本能够减半。
拷贝一份 fp32 的 weights，用于更新。
- 在更新权重的时候， <math xmlns="http://www.w3.org/1998/Math/MathML"> weight t = weight t − 1 + lr ∗ gradients \text{weight}t= \text {weight}{t-1}+ \text{lr} * \text{gradients} </math>weightt=weightt−1+lr∗gradients ，而在深度模型中， <math xmlns="http://www.w3.org/1998/Math/MathML"> lr ∗ gradients \text{lr} * \text{gradients} </math>lr∗gradients 往往非常小，如果利用 fp16 来进行相加的话，则很可能会出现 舍入误差 Rounding Error，导致更新无效。
- 通过将 weights 拷贝成 fp32 格式，并且确保整个更新（update）过程在 fp32 格式下进行

损失放大 Loss Scale

主要用于解决 fp16 underflow 的问题。

训练到了后期，梯度（特别是激活函数平滑段的梯度）会特别小，fp16 表示容易产生 underflow 现象。

Loss Scale

对计算出来的 loss 值进行 scale，由于链式法则的存在，loss 上的 scale 也会作用在梯度上。这样比起对每个梯度进行 scale 更加划算。 scaled 过后的梯度，就会平移到 fp16 有效的展示范围内。

**反向传播前，**将损失变化（dLoss）手动增大 <math xmlns="http://www.w3.org/1998/Math/MathML"> 2 k 2^k </math>2k 倍，因此反向传播时得到的中间变量（激活函数梯度）则不会溢出；

反向传播后 ，将权重梯度缩 <math xmlns="http://www.w3.org/1998/Math/MathML"> 2 k 2^k </math>2k 倍，恢复正常值。

这样，scaled-gradient 就可以一直使用 fp16 进行存储了。只有在进行更新的时候，才会将 scaled-gradient 转化为 fp32，同时将 scale 抹去。