基于Megatron-LM从0到1完成GPT2模型预训练、模型评估及推理

本文为稀土掘金技术社区首发签约文章，30天内禁止转载，30天后未获授权禁止转载，侵权必究！

随着 ChatGPT 迅速爆火，引领基于Transformer架构的大模型从幕后走到台前。但 ChatGPT 的成功并不是一蹴而就，而是，经过了从早期的 GPT1 到 GPT2，之后到 GPT3 和 InstructGPT、然后到GPT3.5和ChatGPT，直到如今的多模态大模型 GPT4。

但是 GPT3 之后的一系列工作，OpenAI并没有开源其模型，因此，我们没办法去自己的剖析其背后的机理。但是，作为 GPT 系列的鼻祖之一，GPT2 却是开源的；因此，本文将使用 Megatron-LM 针对 GPT2 模型进行预训练；为了不影响文章的阅读体验，具体的脚本和代码均放置在GitHub：llm-action。

运行环境搭建

基础环境配置如下：

操作系统: Ubuntu 18.04
CPUs: 单个节点具有 384GB 内存的 Intel CPU，物理CPU个数为2，每颗CPU核数为20
GPUs: 4 卡 A800 80GB GPUs
Python: 3.10 (需要先升级OpenSSL到1.1.1t版本（点击下载OpenSSL），然后再编译安装Python)，点击下载Python
NVIDIA驱动程序版本: 525.105.17，根据不同型号选择不同的驱动程序，点击下载。
CUDA工具包: 11.6，点击下载

为了能够快速复现 GPT2 的整个预训练过程，本文选择基于英伟达官方提供的 Doker 镜像来构建运行环境。

首先，从英伟达官方下载对应版本的Pytorch镜像。

bash 复制代码

docker pull nvcr.io/nvidia/pytorch:23.04-py3

镜像下载完成之后，创建训练环境的容器。

css 复制代码

docker run -dt --name nvidia_pytorch_env --restart=always --gpus all \
--network=host \
--shm-size 4G \
-v /home/gdong/workspace:/workspace \
-w /workspace \
nvcr.io/nvidia/pytorch:23.04-py3 \
/bin/bash

之后，进入容器准备代码、模型、数据等。

bash 复制代码

docker exec -it nvidia_pytorch_env bash

代码准备

下载 Megatron-LM 源码，然后，切换到对应的 commitid：

bash 复制代码

git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout 992da75

模型权重和词表准备

下载GPT2权重：

bash 复制代码

wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/megatron_lm_345m/versions/v0.0/zip -O megatron_lm_345m_v0.0.zip

解压之后的文件格式如下所示：

css 复制代码

> tree -h megatron
megatron
├── [   8]  latest_checkpointed_iteration.txt
└── [4.0K]  release
    └── [4.0K]  mp_rank_00
        └── [677M]  model_optim_rng.pt

2 directories, 2 files

> cat megatron/latest_checkpointed_iteration.txt 
release

下载GPT2词表：

bash 复制代码

https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt

数据集准备

Megatron-LM 训练 GPT2 利用了 jcpeterson 和 eukaryote31 的公开可用的 OpenWebText 库来下载 URL。然后，根据 openwebtext 目录中描述的过程对所有下载的内容进行过滤、清理和重复数据删除。根据截至 2018 年 10 月 Reddit URL 对应的内容，得到了大约 37GB 的内容。

下面根据 Megatron-LM 中 openwebtext 文档准备训练数据。

首先，安装依赖库。

arduino 复制代码

pip install ftfy langdetect numpy torch pandas nltk sentencepiece boto3 tqdm regex bs4 newspaper3k htmlmin tldextract -i https://pypi.tuna.tsinghua.edu.cn/simple  --trusted-host pypi.tuna.tsinghua.edu.cn

然后，安装LSH。

bash 复制代码

git clone https://github.com/mattilyra/LSH
cd LSH
git checkout a57069b
python setup.py install

由于，这里使用的Python版本为3.8.10，存在不兼容的问题，安装的时候会报错，按照提示进行修改即可。

修改lsh/cMinhash.cpp文件：

将exc_type改为curexc_type
将exc_value改为curexc_value
将exc_traceback改为curexc_traceback

安装完成之后，下面从 jcpeterson 下载去重后的 URL，放置在urls目录下，由于文件太多这里仅下载一个URL文件用于演示。

bash 复制代码

> mkdir urls

> tree -h urls/
urls/
└── [5.3M]  RS_2011-01.bz2.deduped.txt

0 directories, 1 file

然后，删除列入黑名单的 URL。

bash 复制代码

# python blacklist_urls.py <path to the downloaded deduplicated URLs> <filename for clean urls. e.g. clean_urls.txt>
python3 blacklist_urls.py ./urls clean_urls.txt
# 只保存清除后的前100个URL。
# head -n100 clean_urls.txt >> clean_urls_100.txt

接下来，使用 openwebtext 的实用工具从清洗后的 URL 下载内容。

需要修改一下download.py里面的--sqlite_meta和--save_uncompressed的默认值，分别改成False和True，这样执行python3 openwebtext/download.py clean_urls.txt 之后就会生成一个scraped文件夹，每个url下载的文本就保存在data子文件夹下。

bash 复制代码

# ef42b51
git clone https://github.com/yet-another-account/openwebtext.git

# vim openwebtext/download.py

python3 openwebtext/download.py ./Megatron-LM/tools/openwebtext/clean_urls.txt  --output_dir /workspace/code/scraped

下载完成之后，格式如下所示：

css 复制代码

> tree -h /workspace/code/scraped
/workspace/code/scraped
├── [304K]  data
│   ├── [ 176]  0000300-ab9ff12f7658b8764a413bf58d58bc48b866b0c163ce5c0442296dce46ff0ff8.txt
│	│	...
│   └── [ 634]  0009896-6e15400f49434b3dbf9421a8f342f80f26c1e901f78f6350d4b738f58c456bdd.txt
└── [296K]  meta
    ├── [ 154]  0001000-ab50f2cd5366369108d58d6e4eb77e8c4babf56e634a33dcd880597684109fc4.json
    │	...
    └── [ 224]  0009896-6e15400f49434b3dbf9421a8f342f80f26c1e901f78f6350d4b738f58c456bdd.json

2 directories, 4860 files

文件内容如下：

bash 复制代码

# meta 子文件夹存储元数据
> cat /workspace/code/scraped/meta/0009896-6e15400f49434b3dbf9421a8f342f80f26c1e901f78f6350d4b738f58c456bdd.json
{"url": "http://minnesotaindependent.com/74302/bachmann-says-transportation-projects-shouldnt-count-as-earmarks", "word_count": 73, "elapsed": 3.2160894870758057, "scraper": "newspaper", "domain": "minnesotaindependent.com"}

# data 子文件夹存储文本数据
> cat /workspace/code/scraped/data/0009896-6e15400f49434b3dbf9421a8f342f80f26c1e901f78f6350d4b738f58c456bdd.txt 
Der eigene Bodenwischer ist der wichtigste Begleiter im täglichen Haushalt. Ob für Parkett, Fliesen oder Laminat: Qualität, Ausstattung und Preis spielen bei der Kaufentscheidung eine große Rolle.
...
Bodenwischer für ...

将data子文件夹的文本文件合并成一个json文件。

bash 复制代码

python3 Megatron-LM/tools/openwebtext/merge_data.py --data_path /workspace/code/scraped/data --output_file /workspace/data/merged_output.json

合并后文件格式如下：

swift 复制代码

> head -n6 /workspace/data/merged_output.json
{"text": "With every new year, it's murder for Neal Smither and his crew.\n"}
{"text": "\n"}
{"text": "Suicide, too.\n"}
{"text": "\n"}
{"text": "As owner of Crime Scene Cleaners, Smither's job is to clean up the bloody messes left behind when people kill each other or themselves - and those first few weeks after Jan. 1 are his busiest time of year.\n"}
{"text": "\n"}

数据清洗

执行 ftfy、英语检测并删除少于 128 个标记的文档。

bash 复制代码

python3 cleanup_dataset.py /workspace/data/merged_output.json /workspace/data/merged_cleand.json

清洗前后数据对比：

shell 复制代码

> wc -l merged_output.json 
78802 merged_output.json

> wc -l merged_cleand.json 
2456 merged_cleand.json

然后，shuffle清洗后的数据集。

bash 复制代码

shuf /workspace/data/merged_cleand.json -o /workspace/data/train_data.json

数据预处理

接下来，进行训练数据需要预处理。

css 复制代码

python tools/preprocess_data.py \
       --input /workspace/data/train_data.json \
       --output-prefix /workspace/data/my-gpt2 \
       --vocab-file /workspace/model/gpt2-vocab/gpt2-vocab.json\
       --dataset-impl mmap \
       --tokenizer-type GPT2BPETokenizer \
       --merge-file /workspace/model/gpt2-vocab/gpt2-merges.txt \
       --append-eod \
       --workers 20 \
       --chunk-size 25

输出文件名为 my-gpt2_text_document.bin 和 my-gpt2_text_document.idx。在 GPT2 训练时，使用不带扩展名的名称作为 --data-path。

现在，所有的前期工作都已经准备好了，接下来开始模型训练。

模型训练

单卡训练

下面，修改examples/pretrain_gpt.sh脚本，配置权重文件路径（CHECKPOINT_PATH）、词表文件路径（VOCAB_FILE）merge表路径（MERGE_FILE）、数据集路径（DATA_PATH）等；

bash 复制代码

#!/bin/bash

# Runs the "345M" parameter model

export CUDA_DEVICE_MAX_CONNECTIONS=1

CHECKPOINT_PATH=/workspace/model/megatron-models/345m
VOCAB_FILE=/workspace/model/gpt2-vocab/gpt2-vocab.json
MERGE_FILE=/workspace/model/gpt2-vocab/gpt2-merges.txt
DATA_PATH=/workspace/data/my-gpt2_text_document
MODEL_PATH=/workspace/model/megatron-models/output

# 模型超参数
GPT_ARGS="
    --num-layers 24 \
    --hidden-size 1024 \
    --num-attention-heads 16 \
    --seq-length 1024 \
    --max-position-embeddings 1024 \
    --micro-batch-size 1 \
    --global-batch-size 2 \
    --lr 0.00015 \
    --train-iters 5000 \
    --lr-decay-iters 320000 \
    --lr-decay-style cosine \
    --min-lr 1.0e-5 \
    --weight-decay 1e-2 \
    --lr-warmup-fraction .01 \
    --clip-grad 1.0 \
    --fp16
"

# 数据集和词表路径参数
DATA_ARGS="
    --data-path $DATA_PATH \
    --vocab-file $VOCAB_FILE \
    --merge-file $MERGE_FILE \
    --data-impl mmap \
    --split 700,200,100
"

# 模型权重输出、评估、日志相关的参数
OUTPUT_ARGS="
    --log-interval 100 \
    --save-interval 10000 \
    --eval-interval 1000 \
    --eval-iters 10
"

# 启动训练任务
torchrun pretrain_gpt.py \
    $GPT_ARGS \
    $DATA_ARGS \
    $OUTPUT_ARGS \
    --save $MODEL_PATH \
    --load $CHECKPOINT_PATH

然后，运行如下脚本进行训练：

ini 复制代码

CUDA_VISIBLE_DEVICES=3 sh examples/pretrain_gpt.sh

训练完成之后，模型权重输出如下所示：

css 复制代码

> tree -h 345m
345m
├── [4.0K]  iter_0005000
│   └── [4.0K]  mp_rank_00
│       └── [4.6G]  model_optim_rng.pt
└── [   4]  latest_checkpointed_iteration.txt

> cat 345m/latest_checkpointed_iteration.txt 
5000

除了单卡进行训练之外，我们还可以使用多卡进行训练。下面分别演示使用4卡数据并行、4卡张量并行、4卡流水线并行、以及多维混合并行（2卡张量并行、2卡流水线并行）训练。

数据并行训练（4DP）

下面使用4DP进行数据并行训练，运行pretrain_gpt_distributed.sh脚本进行训练。

训练完成之后，模型权重输出：

css 复制代码

tree -h /workspace/model/megatron-models/345m-init-4tp
/workspace/model/megatron-models/345m-init-4tp
├── [4.0K]  iter_0002000
│   ├── [4.0K]  mp_rank_00
│   │   └── [1.2G]  model_optim_rng.pt
...
│   └── [4.0K]  mp_rank_03
│       └── [1.2G]  model_optim_rng.pt
└── [   4]  latest_checkpointed_iteration.txt

10 directories, 9 files

> cat /workspace/model/megatron-models/345m-init-4tp/latest_checkpointed_iteration.txt 
2000

训练过程中，显存占用：

less 复制代码

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   3227288      C   /usr/bin/python                  9652MiB |
|    1   N/A  N/A   3227289      C   /usr/bin/python                  9652MiB |
|    2   N/A  N/A   3227290      C   /usr/bin/python                  9652MiB |
|    3   N/A  N/A   3227291      C   /usr/bin/python                  9652MiB |
+-----------------------------------------------------------------------------+

模型并行训练（4PP）

下面使用4PP进行模型并行训练，使用pretrain_gpt_distributed_with_4pp.sh脚本进行训练。

训练完成之后，模型权重输出：

css 复制代码

> tree -h /workspace/model/megatron-models/345m-init-4pp
/workspace/model/megatron-models/345m-init-4pp
├── [4.0K]  iter_0002000
│   ├── [4.0K]  mp_rank_00_000
│   │   └── [1.7G]  model_optim_rng.pt
│   ├── [4.0K]  mp_rank_00_001
│   │   └── [1009M]  model_optim_rng.pt
│   ├── [4.0K]  mp_rank_00_002
│   │   └── [1009M]  model_optim_rng.pt
│   └── [4.0K]  mp_rank_00_003
│       └── [1.7G]  model_optim_rng.pt
└── [   4]  latest_checkpointed_iteration.txt

> cat /workspace/model/megatron-models/345m-init-4pp/latest_checkpointed_iteration.txt 
2000

训练过程中，显存占用：

less 复制代码

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   2630871      C   /usr/bin/python                  8680MiB |
|    1   N/A  N/A   2630872      C   /usr/bin/python                  6408MiB |
|    2   N/A  N/A   2630873      C   /usr/bin/python                  5080MiB |
|    3   N/A  N/A   2630874      C   /usr/bin/python                  5436MiB |
+-----------------------------------------------------------------------------+

模型并行训练（4TP）

下面使用4TP进行模型并行训练，使用pretrain_gpt_distributed_with_4tp.sh脚本进行训练。

训练完成之后，模型权重输出：

css 复制代码

tree -h /workspace/model/megatron-models/345m-init-4tp
/workspace/model/megatron-models/345m-init-4tp
├── [4.0K]  iter_0002000
│   ├── [4.0K]  mp_rank_00
│   │   └── [1.2G]  model_optim_rng.pt
...
│   └── [4.0K]  mp_rank_03
│       └── [1.2G]  model_optim_rng.pt
└── [   4]  latest_checkpointed_iteration.txt

> cat /workspace/model/megatron-models/345m-init-4tp/latest_checkpointed_iteration.txt 
2000

训练过程中，显存占用：

less 复制代码

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   3895346      C   /usr/bin/python                  4236MiB |
|    1   N/A  N/A   3895347      C   /usr/bin/python                  4176MiB |
|    2   N/A  N/A   3895348      C   /usr/bin/python                  4168MiB |
|    3   N/A  N/A   3895349      C   /usr/bin/python                  4176MiB |
+-----------------------------------------------------------------------------+

模型并行训练（2TP+2PP）

下面使用2TP和2PP进行模型并行训练，运行pretrain_gpt_distributed_with_mp.sh脚本进行训练。

训练完成之后，模型权重输出：

css 复制代码

> tree -h 345m-init-mp
345m-init-mp
├── [4.0K]  iter_0005000
│   ├── [4.0K]  mp_rank_00_000
│   │   └── [1.3G]  model_optim_rng.pt
│   ├── [4.0K]  mp_rank_00_001
│   │   └── [1.3G]  model_optim_rng.pt
│   ├── [4.0K]  mp_rank_01_000
│   │   └── [1.3G]  model_optim_rng.pt
│   └── [4.0K]  mp_rank_01_001
│       └── [1.3G]  model_optim_rng.pt
└── [   4]  latest_checkpointed_iteration.txt

训练过程中，显存占用：

less 复制代码

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   3448098      C   /usr/bin/python                  8732MiB |
|    1   N/A  N/A   3448099      C   /usr/bin/python                  8732MiB |
|    2   N/A  N/A   3448100      C   /usr/bin/python                  6828MiB |
|    3   N/A  N/A   3448101      C   /usr/bin/python                  7078MiB |
+-----------------------------------------------------------------------------+

模型权重合并

合并分布式并行训练的模型，在更少的 GPU 上使用可能会更有利。

通过以下脚本完成合并操作。此示例读取具有 2TP 和 2PP 模型并行训练的 GPT 模型，并输出具有 1TP 和 1PP 的模型。

css 复制代码

python tools/checkpoint_util.py \
        --model-type GPT \
        --load-dir /workspace/model/megatron-models/345m-init-mp\
        --save-dir /workspace/model/megatron-models/345m-init-mp-out \
        --target-tensor-parallel-size 1 \
        --target-pipeline-parallel-size 1

模型权重合并之后，下面使用合并后的权重进行模型评估及推理。

模型评估

下面基于 LAMBADA 数据集进行完形填空准确率（在给定前面的Token的情况下，预测最后一个Token的准确性）评估。

使用以下命令进行模型评估，执行脚本之前需预先配置模型权重、评估数据集、词表路径等。

复制代码

sh eval_gpt2_lambada.sh

注意：应使用 --strict-lambada 来要求整个单词匹配。

运行过程部分日志如下：

sql 复制代码

using world size: 1, data-parallel-size: 1, tensor-model-parallel size: 1, pipeline-model-parallel size: 1 
setting global batch size to 8
using torch.float16 for parameters ...
------------------------ arguments ------------------------
  accumulate_allreduce_grads_in_fp32 .............. False
  ...
  world_size ...................................... 1
-------------------- end of arguments ---------------------
setting number of micro-batches to constant 1
> building GPT2BPETokenizer tokenizer ...
> padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> initializing torch distributed ...
> initialized tensor model parallel with size 1
> initialized pipeline model parallel with size 1
> setting random seeds to 1234 ...
> compiling dataset index builder ...
...
make: Leaving directory '/workspace/code/bak/Megatron-LM/megatron/data'
>>> done with dataset index builder. Compilation time: 13.399 seconds
> compiling and loading fused kernels ...
>>> done with compiling and loading fused kernels. Compilation time: 1.411 seconds
building GPT model ...
 > number of parameters on (tensor, pipeline) model parallel rank (0, 0): 354871296
 loading checkpoint from /workspace/model/megatron-models/345m-init-mp-out at iteration 5000
 checkpoint version 3.0
  successfully loaded checkpoint from /workspace/model/megatron-models/345m-init-mp-out at iteration 5000
> building lambada dataset from /workspace/data/lambada_test.jsonl ...
 > found 5153 samples.
> working on iteration: 0
...
> working on iteration: 640
--------------------------------------------------------------------------------------------------------------------
 validation results on LAMBADA | number correct: 0.0000E+00 | total examples: 5.1530E+03 | avg accuracy: 0.0000E+00
--------------------------------------------------------------------------------------------------------------------
done :-)

模型推理服务

在 tools/run_text_ Generation_server.py 中包含了一个简单的 REST 服务，用于生成文本。运行它，你需要指定适当的预训练检查点（checkpoint）。还有一些可选参数：temperature， top-k 和 top-p 等可以配置，详细信息请参阅 --help 或源文件。

启动推理服务之前，需预先安装依赖库：

arduino 复制代码

pip install flask flask-restful -i https://pypi.tuna.tsinghua.edu.cn/simple  --trusted-host pypi.tuna.tsinghua.edu.cn

安装完成之后，使用examples/run_text_generation_server_345M.sh脚本启动基于GPT2模型的推理服务。

bash 复制代码

sh examples/run_text_generation_server_345M.sh

推理服务运行后，您可以使用 tools/text_ Generation_cli.py 来请求接口，它需要一个参数，即服务运行的主机。

yaml 复制代码

> python tools/text_generation_cli.py localhost:5000
Enter prompt: hello
Enter number of tokens to generate: 5
Megatron Response: 
hello! Until that protagonist receive
Enter prompt: world 
Enter number of tokens to generate: 2
Megatron Response: 
worldboarding-
Enter prompt:

除此之外，您还可以使用 curl 或任何其他接口测试工具直接请求接口：

css 复制代码

> curl 'http://localhost:5000/api' -X 'PUT' -H 'Content-Type: application/json; charset=UTF-8'  -d '{"prompts":["Hello world"], "tokens_to_generate":1}'

{"logprobs":null,"segments":[["Hello"," world",","]],"text":["Hello world,"]}

上面是使用单卡进行模型推理，我们还可以进行多卡模型并行推理。

使用4TP进行模型推理：

bash 复制代码

sh examples/run_text_generation_server_345M_4_tensor_parallel.sh

显存占用：

less 复制代码

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   1844443      C   /usr/bin/python                   788MiB |
|    1   N/A  N/A   1844444      C   /usr/bin/python                   788MiB |
|    2   N/A  N/A   1844445      C   /usr/bin/python                   788MiB |
|    3   N/A  N/A   1844446      C   /usr/bin/python                   788MiB |
+-----------------------------------------------------------------------------+

使用2TP+2PP进行模型推理：

bash 复制代码

sh examples/run_text_generation_server_345M_2tp_2dp.sh

显存占用：

less 复制代码

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   1869409      C   /usr/bin/python                  1222MiB |
|    1   N/A  N/A   1869410      C   /usr/bin/python                  1222MiB |
|    2   N/A  N/A   1869411      C   /usr/bin/python                  1222MiB |
|    3   N/A  N/A   1869412      C   /usr/bin/python                  1222MiB |
+-----------------------------------------------------------------------------+

结语

本文基于英伟达开源的 Megatron-LM 框架完成了GPT2 模型的预训练、模型评估及推理的整个过程。同时，也讲述了准备 GPT2 模型训练的数据集的整个预处理过程。

如果觉得我的文章能够能够给您带来帮助，期待您的点赞收藏加关注~~