从头构建 GPT 模型生成文本

前言

使用 Transorfmer 的 Decoder 从头构建 GPT 模型,并且使用不同采样算法完成文本生成任务。

数据处理

  1. 从指定的 URL 下载名为 simplebooks.zip 的文件,并在本地解压,这里的数据集包含了一些简单的文本数据。
  2. 使用 tf.data.TextLineDataset 创建了训练和验证的数据集。对于训练集,使用filter函数排除长度小于 MIN_STRING_LEN 的文本,并进行批处理和打乱操作,验证集只进行了长度过滤和批处理。
  3. 使用 keras_nlp.tokenizers.compute_word_piece_vocabulary 函数构建词汇表。这个函数接受训练集数据,指定词汇表大小、是否小写以及保留的特殊标记等参数。
  4. 使用 keras_nlp.tokenizers.WordPieceTokenizer 设置 Tokenizer ,传入前面构建的词汇表和序列长度等参数。
  5. 使用 keras_nlp.layers.StartEndPacker 设置 StartEndPacker ,指定序列长度和开始标记的 ID 。
ini 复制代码
tf.keras.utils.get_file(origin="https://dldata-public.s3.us-east-2.amazonaws.com/simplebooks.zip", extract=True)
dir = os.path.expanduser("~/.keras/datasets/simplebooks/")
raw_train_ds = (tf.data.TextLineDataset(dir + "simplebooks-92-raw/train.txt").filter(lambda x: tf.strings.length(x) > MIN_STRING_LEN).batch(BATCH_SIZE).shuffle(buffer_size=BATCH_SIZE * 4))
raw_val_ds = tf.data.TextLineDataset(dir + "simplebooks-92-raw/valid.txt").filter(lambda x:tf.strings.length(x)>MIN_STRING_LEN).batch(BATCH_SIZE)
vocab = keras_nlp.tokenizers.compute_word_piece_vocabulary(raw_train_ds, vocabulary_size=VOCAB_SIZE, lowercase=True, reserved_tokens=["[PAD]", "[UNK]", "[BOS]"])
tokenizer = keras_nlp.tokenizers.WordPieceTokenizer(vocabulary=vocab, sequence_length=SEQ_LEN, lowercase=True)
start_packer = keras_nlp.layers.StartEndPacker(sequence_length=SEQ_LEN, start_value=tokenizer.token_to_id("[BOS]"))
def preprocess(inputs):
    outputs = tokenizer(inputs)
    features = start_packer(outputs)
    labels = outputs
    return features, labels
train_ds = raw_train_ds.map(preprocess, num_parallel_calls=tf.data.AUTOTUNE).prefetch(tf.data.AUTOTUNE)
val_ds = raw_val_ds.map(preprocess, num_parallel_calls=tf.data.AUTOTUNE).prefetch(tf.data.AUTOTUNE)

模型训练

  1. 使用 tf.keras.layers.Input 定义了一个输入层,输入的数据是整数序列,也就是文本的 token id 序列。
  2. 使用 keras_nlp.layers.TokenAndPositionEmbedding 创建了一个嵌入层,用于将输入的整数序列嵌入到高维空间。这个嵌入层考虑了词汇表的大小、序列长度、嵌入维度等参数,并且支持对序列中的特殊标记进行 mask 。
  3. 使用 keras_nlp.layers.TransformerDecoder 循环堆叠了 NUM_LAYERS 个解码器层,每个解码器层使用了指定的头数 num_heads 和中间维度 intermediate_dim
  4. 使用全连接层 Dense 层生成最终的输出,输出维度为词汇表的大小。
  5. 使用 tf.keras.Model 定义了整个模型,指定了输入和输出。使 用SparseCategoricalCrossentropy 作为损失函数,Adam 作为优化器,同时使用自定义的 Perplexity 指标来度量。
  6. 使用 model.fit 进行训练,输入训练集和验证集,指定训练轮数。
ini 复制代码
inputs = tf.keras.layers.Input(shape=(None,), dtype="int32")
embedding_layer = keras_nlp.layers.TokenAndPositionEmbedding(vocabulary_size=VOCAB_SIZE, sequence_length=SEQ_LEN, embedding_dim=EMBED_DIM, mask_zero=True)
x = embedding_layer(inputs)
for _ in range(NUM_LAYERS):
    decoder_layer = keras_nlp.layers.TransformerDecoder(num_heads=NUM_HEADS, intermediate_dim=FEED_FORWARD_DIM)
    x = decoder_layer(x)
outputs = tf.keras.layers.Dense(VOCAB_SIZE)(x)
model = tf.keras.Model(inputs=inputs, outputs=outputs)
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
perplexity = keras_nlp.metrics.Perplexity(from_logits=True, mask_token_id=0)
model.compile(optimizer="adam", loss=loss_fn, metrics=[perplexity])
model.fit(train_ds, validation_data=val_ds, epochs=EPOCHS)

日志打印:

yaml 复制代码
Epoch 1/5
2445/2445 [==============================] - 82s 30ms/step - loss: 4.7067 - perplexity: 111.3245 - val_loss: 4.1930 - val_perplexity: 67.9194
Epoch 2/5
2445/2445 [==============================] - 77s 30ms/step - loss: 4.1393 - perplexity: 63.0901 - val_loss: 4.0543 - val_perplexity: 59.1305
Epoch 3/5
2445/2445 [==============================] - 94s 35ms/step - loss: 4.0120 - perplexity: 55.5379 - val_loss: 4.0131 - val_perplexity: 56.7052
Epoch 4/5
2445/2445 [==============================] - 93s 35ms/step - loss: 3.9460 - perplexity: 51.9870 - val_loss: 3.9597 - val_perplexity: 53.7640
Epoch 5/5
2445/2445 [==============================] - 96s 36ms/step - loss: 3.9003 - perplexity: 49.6596 - val_loss: 3.9651 - val_perplexity: 54.0120

不同采样算法的生成结果

GreedySampler

即局部最优的贪心搜索方法,每一步都选择概率最大的 token 进行输出,最后组成整个句子。

vbnet 复制代码
Greedy search generated text:
[b'[BOS] " i am glad to see you , " said the captain , " and i am glad to see you , for i have been so glad to see you , and i have been so glad to see you . i have been thinking of that i have been so much of you , and i have been so glad to see you , and i have been so glad to see you . i have been thinking of you , and i have been thinking of you , and i have been thinking of you , and i have been thinking of you , and i have been thinking of you , and i have been thinking of you , and i have been']

BeamSampler

Beam Search 在每个时间步保留 beam size 个最大的可能取值路径。假设词典大小为 v ,beam size 为 k ,则每个时间步生成的候选路径有 k*v 个,保留其中概率最大的 k 个路径,再进入下一个时间步的搜索中。当 k=1 时,Beam Search 退化为 Greedy Search 。

vbnet 复制代码
Beam search generated text: 
[b'[BOS] " i don \' t think of it , " he said . " i don \' t want to know that , but i don \' t know what to do . i don \' t know what to do , but i don \' t know what to do . i don \' t know what to do , but i don \' t know what i want to do . i don \' t know what to do . i don \' t know what i want to do . i don \' t know what to do . i don \' t want to know what to do . i don \' t know what to do , but i don \' t']

RandomSampler

按照所有 token 的概率分布,每一步随机选择一个 token ,每个 token 的选择概率由其概率决定。

yaml 复制代码
Random  search generated text: 
[b"[BOS] and on the porch he could not get off , for the lamp she had been sitting on a table , mounted on the crest of griday , and , watcher , with their nose and auber pocket in humoniation , and blue indignant himself heartily , had no particular reason upon going in the first place . and on account of his own helping tom so . he knew the whole art of charleston , and had no pain of pains to follow his brother ' s boots , that one knew that such an accident was happening away from the schoolmaster . [PAD] ! whitevinne came across the"]

TopKSampler

Top-K 是直接挑选概率最高的 K 个单词,然后重新根据 softmax 计算这 K 个单词的概率,再根据概率分布情况进行采样,生成下一个单词。采样还可以选用 Temperature Sampling 方法。

vbnet 复制代码
Topk search generated text: 
[b'[BOS] but the spaniard of the day , with a corporal , came to the conclusion that the latter was the case . it was to be affected to recall . the superintender was to be adorned . in this case the acres of colony had been the case with the abroachus of praunctual , and the breach of the picycle - - that the thoroughfare was not in the chateau , and the superintendent was now a']

TopPSampler

Top-P Sampling 是预先设置一个概率界限 p 值,然后将所有可能取到的单词,根据概率大小从高到低排列,依次选取单词。当单词的累积概率和大于或等于 p 值时停止,然后从已经选取的单词中进行采样,生成下一个单词。采样同样可以选用 Temperature Sampling 方法。

arduino 复制代码
Topp search generated text: 
[b"[BOS] the general , of the enemy , and , with a small party of sailors , was a stout fellow - master , who was of the officers of the prussians , and his men were killed , as well as the major of the party of the prisoners , the french had gathered round their own . he was not the only skimos , but the french was at the same time . they had now returned to the admiral ' s portuguese , and , in the meantime , the english general , with his commander , had sent to the spanish admiral to pay the respects to the admiral . [PAD] had they been in command of the"]
相关推荐
slomay2 小时前
关于对比学习(简单整理
经验分享·深度学习·学习·机器学习
春末的南方城市2 小时前
FLUX的ID保持项目也来了! 字节开源PuLID-FLUX-v0.9.0,开启一致性风格写真新纪元!
人工智能·计算机视觉·stable diffusion·aigc·图像生成
AI完全体3 小时前
【AI知识点】偏差-方差权衡(Bias-Variance Tradeoff)
人工智能·深度学习·神经网络·机器学习·过拟合·模型复杂度·偏差-方差
卷心菜小温4 小时前
【BUG】P-tuningv2微调ChatGLM2-6B时所踩的坑
python·深度学习·语言模型·nlp·bug
陈苏同学4 小时前
4. 将pycharm本地项目同步到(Linux)服务器上——深度学习·科研实践·从0到1
linux·服务器·ide·人工智能·python·深度学习·pycharm
FL16238631295 小时前
[深度学习][python]yolov11+bytetrack+pyqt5实现目标追踪
深度学习·qt·yolo
羊小猪~~5 小时前
深度学习项目----用LSTM模型预测股价(包含LSTM网络简介,代码数据均可下载)
pytorch·python·rnn·深度学习·机器学习·数据分析·lstm
龙的爹23335 小时前
论文 | Model-tuning Via Prompts Makes NLP Models Adversarially Robust
人工智能·gpt·深度学习·语言模型·自然语言处理·prompt
工业机器视觉设计和实现5 小时前
cnn突破四(生成卷积核与固定核对比)
人工智能·深度学习·cnn
醒了就刷牙5 小时前
58 深层循环神经网络_by《李沐:动手学深度学习v2》pytorch版
pytorch·rnn·深度学习