GPU 显存分析

在微调时，模型显存占用主要包括模型参数 、参数梯度 、优化器 和中间结果四个部分。

对于一个 6B 参数量的模型，它的模型参数占用为：
$6 × 1 0 9 × 4 ( F P 32 ) 102 4 3 ≈ 22 G B \frac{6 \times 10^9 \times 4(FP32)}{1024^3} \approx 22GB$ 102436×109×4(FP32)≈22GB

将模型参数视为基准，模型梯度占用量与模型参数相同。

优化器主采用 Adam Optimizer ，它核心计算公式如下：
$m t = β 1 m t − 1 + ( 1 − β 1 ) g t v t = β 2 v t − 1 + ( 1 − β 2 ) g t 2 m_t = \beta_1 m_{t-1} + (1- \beta_1) g_t \\ v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2$ mt=β1mt−1+(1−β1)gtvt=β2vt−1+(1−β2)gt2

由于需要保存 m 和 v，而 m 和 v 规模与参数梯度相同，因此优化器需要两倍显存容量。

同时，在计算中得到的中间结果需要保存在显存中，以便反向传播时计算梯度。对于每一个中间结果，其数据形状为 $Batch, SeqLen, Dim$ 。

Collective Operations

为了节省显存，可以将模型或者数据分配到不同的显卡上，显卡之间有如下几种 Collective Operations。

Broadcast

The Broadcast operation copies an N-element buffer on the root rank to all ranks.

广播操作将一张显卡上数据广播到所有显卡。

AllReduce、Reduce、ReduceScatter

The AllReduce operation is performing reductions on data (for example, sum, min, max) across devices and writing the result in the receive buffers of every rank.

The Reduce operation is performing the same operation as AllReduce, but writes the result only in the receive buffers of a specified root rank.

The ReduceScatter operation performs the same operation as the Reduce operation, except the result is scattered in equal blocks between ranks, each rank getting a chunk of data based on its rank index.

AllReduce 操作将所有显卡上数据进行聚合如求和、取最大值 或取最小值，并将结果写入所有显卡。

Reduce 只会将结果写入一张显卡。

ReduceScatter 则将结果分散在所有显卡中。

AllGather

The AllGather operation gathers N values from k ranks into an output of size k*N, and distributes that result to all ranks.

AllGather 操作会收集所有显卡数据，并写入所有显卡中。

数据并行

数据并行是将数据分成若干份，装载到不同节点上进行计算。

数据并行计算流程如下：

有个参数服务器保存模型参数。
参数被复制到不同的设备中，构成若干 replicas 。每个 replica 处理一部分数据，进行前向传播和反向传播。
每个设备得到梯度进行 Reduce 操作，得到最终梯度，并按照这个梯度更新参数服务器中的模型参数。
在后向传播时，每计算完一层的梯度，就可以进行 Reduce 操作，提高并行性。

分布式数据并行

分布式数据并行中不存在参数服务器，其计算流程如下：

每个 replica 都保存模型参数，但是分别计算部分数据，进行前向传播和反向传播。
每个设备都得到梯度后进行 AllReduce 操作，将梯度写入所有设备，每个设备根据自己的优化器和梯度更新参数。

分布式数据并行中，每个设备显存占用情况如图：

其中每个设备仍需要保存模型参数、梯度和优化器参数。

模型并行

由于模型越来越大，单个设备保存模型参数、梯度和优化器越来越难。因为深度学习主要是矩阵计算，而矩阵计算可以分块计算，因此可以将模型参数拆成若干份，每份单独计算，以减少显存占用。
$y A = W A × B X B =$ $W A n \times B ( 1 ) ; W A n \times B ( 2 ) ; W A n \times B ( 3 )$ X B = $W A n \times B ( 1 ) X B ; W A n \times B ( 2 ) X B ; W A n \times B ( 3 ) X B$ y_A=W_{A \times B} X_B \\ = $W\^{(1)}_{\\frac{A}{n} \\times B};W\^{(2)}_{\\frac{A}{n} \\times B};W\^{(3)}_{\\frac{A}{n} \\times B}$ X_B \\ = $W\^{(1)}_{\\frac{A}{n} \\times B} X_B;W\^{(2)}_{\\frac{A}{n} \\times B} X_B;W\^{(3)}_{\\frac{A}{n} \\times B} X_B$ yA=WA×BXB= $WnA\timesB(1);WnA\timesB(2);WnA\timesB(3)$ XB= $WnA\timesB(1)XB;WnA\timesB(2)XB;WnA\timesB(3)XB$